Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add type mapping APIs to customize JSON value serialization/deserialization #30677

Closed
Tracked by #30731
roji opened this issue Apr 12, 2023 · 15 comments
Closed
Tracked by #30731

Add type mapping APIs to customize JSON value serialization/deserialization #30677

roji opened this issue Apr 12, 2023 · 15 comments
Assignees
Labels
area-json area-query area-type-mapping closed-fixed The issue has been fixed and is/will be included in the release indicated by the issue milestone. type-enhancement
Milestone

Comments

@roji
Copy link
Member

roji commented Apr 12, 2023

When reading and writing JSON documents, we need to control the precise conversion from a .NET value to a JSON element, to be integrated in the larger JSON document. This is somewhat similar to GenerateNonNullSqlLiteral, but for values inside JSON. In simple cases (int, string), no special logic is needed; but in certain cases, we need to precisely control the representation. For example, a NetTopologySuite Geometry would likely need to be serialized as WKT, which isn't necessarily what ToString returns.

Note the interaction between this issue and #30604, which is about switching to the lower-level Utf8JsonReader/Utf8JsonWriter APIs rather than passing through DOM. If/when we do that, the type mappings would be the ones calling specific APIs e.g. on Utf8JsonWriter.

In addition to serialization/deserialization, we also need a hook for perform a SQL conversion from the JSON (text) value to the actual store type; this is needed when extracting a value from a JSON document and e.g. comparing them to a regular database column. For example, the JSON representation of a timestamp is ISO8601 (with a T), but in SQLite, a timestamp is typically represented as a timestamp text without the T (and so datetime() must be invoked). Similarly, to convert a geometry WKT representation to an actual SQL Server geometry type, a function to parse the WKT needs to be invoked. Moving the server-side conversion part to #30727.

Tests affected:

  • NorthwindJoinQueryTestBase.Join_local_string_closure_is_cached_correctly (string treated as enumerable of chars)
  • NorthwindJoinQueryTestBase.Join_local_bytes_closure_is_cached_correctly (same)
@AndriySvyryd
Copy link
Member

AndriySvyryd commented Apr 21, 2023

Consider using an abstraction that doesn't depend on a particular implementation, see CosmosSerializer.
For example, CoreTypeMapping could have T FromJson<T>(Stream) and void ToJson<T>(T, Stream) (and async versions) that would only need work for a T supported by a given type mapping and would instantiate (without allocating) a Utf8JsonReader/Utf8JsonWriter in the implementation.
To replace the underlying serializer a provider-specific ITypeMappingSourcePlugin implementation could be used (#28043)

The base implementation could call JsonSerializer.Serialize/Deserialize

We could allow JsonReaderOptions and JsonWriterOptions or a custom type in context options to control the default implementation.

@AndriySvyryd
Copy link
Member

Looking into this a bit more I don't think it's possible to create a performant abstraction that would allow to replace the implementation without massive amounts of code.

Some things to consider:

  • We can't control in what order the properties will come out of the query, so we can't get the correct type mapping until we read the property name.
  • When reading from a stream we need a callback to increase the buffer size in case a token doesn't fit.

So here's another proposal:

  • T FromJson<T>(ref Utf8JsonReader, Action<bool>?)
  • void ToJson<T>(Utf8JsonWriter, T)

For the server-side conversion consider #30729.

@roji
Copy link
Member Author

roji commented Apr 22, 2023

We could allow JsonReaderOptions and JsonWriterOptions or a custom type in context options to control the default implementation.

Opened #30744 to track that.

We can't control in what order the properties will come out of the query, so we can't get the correct type mapping until we read the property name.

Yeah; our core deserialization logic would be the one iterating over the properties, and calling the type mapping API only for the values of those properties based on the names.

When reading from a stream we need a callback to increase the buffer size in case a token doesn't fit.

Unless I'm mistaken, EF currently always reads the entire row into memory at the DbDataReader level (i.e. we don't specify CommandBehavior.SequentialAccess); one reason for that is that expression trees aren't compatible with async, which would be necessary if we wanted to stream the row.

However, it seems that Utf8JsonReader buffers by property anyway. In other words, according to this sample, you call Read() to get the next property, and if that returns false, that means there aren't enough bytes in the buffer and you need to get more. If that returns true, it seems that you're guaranteed to be able to read the next property; note that there's a GetString() API, but no TryGetString - which I understand to mean that string values are buffered, and you never need to deal with streaming the property value. If that's the case (we can verify), we don't need to worry about this at all.

T FromJson(ref Utf8JsonReader, Action?)
void ToJson(Utf8JsonWriter, T)

Looks good - though what's the Action<bool> for?

@AndriySvyryd
Copy link
Member

Looks good - though what's the Action for?

That's the callback to increase the buffer size.

If that returns true, it seems that you're guaranteed to be able to read the next property;

Utf8JsonReader doesn't buffer at all. If the current token doesn't fit in the passed in buffer you need to increase the buffer size to be able to keep the part that it tried to read so far. In other words, when reading from a stream the buffer must be able to fit the largest token.

@roji
Copy link
Member Author

roji commented Apr 22, 2023

@AndriySvyryd the question is whether a single token is buffered - the Utf8JsonReader seems to suggest it is... To read a single string property via Utf8JsonReader, I think the user does the following:

if (!reader.Read())
{
    // Not enough bytes for the next token!
    // Return to read more bytes from the stream etc.
}

var string = reader.GetString();

GetString() doesn't have a way of telling you that it didn't have enough bytes (e.g. there's no TryGetString). So if I'm getting it right, within a specific property value (which is what our type mapping API handles), we shouldn't need to worry about this. We can set up a quick experiment to prove this.

Am I misunderstanding the API?

@AndriySvyryd
Copy link
Member

GetString() doesn't have a way of telling you that it didn't have enough bytes (e.g. there's no TryGetString).

Right, but Read won't return true unless the next token is completely contained in the buffer. And we'll need to call it for collections of primitive types.

@AndriySvyryd
Copy link
Member

Perhaps a better signature would be
bool TryReadFromJson<T>(ref Utf8JsonReader, out T)

@AndriySvyryd
Copy link
Member

Perhaps a better signature would be
bool TryReadFromJson(ref Utf8JsonReader, out T)

No, this implies that the buffer should fit the entire value, not just the largest token.

How about this?

ref struct Utf8JsonReaderManager
{
    readonly Stream? _stream;
    byte[]? _buffer;
    ref Utf8JsonReader _currentReader;

    public ref Utf8JsonReader MoveNext();
}

T FromJson(ref Utf8JsonReaderManager)

@roji
Copy link
Member Author

roji commented Apr 22, 2023

GetString() doesn't have a way of telling you that it didn't have enough bytes (e.g. there's no TryGetString).

Right, but Read won't return true unless the next token is completely contained in the buffer. And we'll need to call it for collections of primitive types.

I think I understand where we have a disconnect... The APIs I was thinking of here were about the element type mapping, i.e. reading a timestamp, int or string (i.e. a primitive JSON token) - those shouldn't be a problem. But you're right that we need to think about the collection, and also about nested collections.

However, if we do want to allow streaming, don't we have a bigger general problem with value converters, regardless of JSON? In other words, since we're implementing serialization/deserialization of the primitive collection as a value converter, we'd need streaming support for value converters, which is something we don't have at the moment... Maybe let's discuss on Monday...

@roji
Copy link
Member Author

roji commented Apr 24, 2023

Design decision: the type mapping API should support streaming. We won't do real streaming (since we don't support that on e.g. DbDataReader because of LINQ expression tree limitations). However, if we implement optimized UTF8 Stream access via DbDataReader.GetStream() (should be supported at least with PostgreSQL, SQL Server), we can read small chunks out of that stream and hand those chunks to the type mapping.

@roji
Copy link
Member Author

roji commented Apr 24, 2023

ref struct Utf8JsonReaderManager

Yeah, generally looks good - though if we go in this direction, maybe it's possible to simply expose GetNextReader(), and encapsulate the details of reading more into the buffer and creating the new reader inside the manager? In other words, I use my reader until there's no more data; at that point I go to the manager and ask for the next reader - it reads from the stream into the buffer, creates a new reader of that and returns it to me.

If we want to, we should be able to make the manager API shape compatible with real streaming (in case we can make that work one day). That would mean making the manager a regular struct, and having a MoveNextAsync returning ValueTask<bool>, after which you call GetReader which returns the reader. The reader needs to be passed separately from the manager (since the manager is no longer a ref struct), and the caller also needs to tell the manager how much data the reader consumed before calling MoveNextAsync (some sort of Commit method that accepts the previous reader and gets BytesConsumed from it.

Whether it's worth the extra API shape complexity is a good question...

/cc @NinoFloris @vonzshik, this bears a bit of resemblence to our type handling stuff in Npgsql :)

@AndriySvyryd
Copy link
Member

maybe it's possible to simply expose GetNextReader(), and encapsulate the details of reading more into the buffer and creating the new reader inside the manager? In other words, I use my reader until there's no more data; at that point I go to the manager and ask for the next reader - it reads from the stream into the buffer, creates a new reader of that and returns it to me.

I thought that's what my proposal was. What do you think needs to change in the API shape?

If we want to, we should be able to make the manager API shape compatible with real streaming

For async we'll need a separate method - FromJsonAsync and a different manager type anyway, so we can design it later.

@roji
Copy link
Member Author

roji commented Apr 25, 2023

I thought that's what my proposal was. What do you think needs to change in the API shape?

Ah, I misread your proposal as exposing the fields publicly. All good.

For async we'll need a separate method - FromJsonAsync and a different manager type anyway, so we can design it later.

Sure, though at that point it would be a breaking change for all existing type mappings.

One more idea regardless: we should can consider guaranteeing that Read() has already been called on the reader before the manager was passed to the API, and that it returned true. That would allow simple type mappings which just need a single token to just always read it from the reader, without dealing with the Read, getting a new reader, etc. Complex types (like collections) would need to deal with the complexity, but that will be very rare.

@AndriySvyryd
Copy link
Member

One more idea regardless: we should can consider guaranteeing that Read() has already been called on the reader before the manager was passed to the API, and that it returned true. That would allow simple type mappings which just need a single token to just always read it from the reader, without dealing with the Read, getting a new reader, etc. Complex types (like collections) would need to deal with the complexity, but that will be very rare.

Yes, we can always do this in Utf8JsonReaderManager when returning a new reader, to make the consuming logic simpler

roji added a commit to roji/efcore that referenced this issue Apr 26, 2023
Blocked on dotnet#30677 for serializing spatial types as WKT in JSON

Closes dotnet#30630
@roji roji changed the title Add type mapping APIs to customize JSON value serialization/deserialization and querying Add type mapping APIs to customize JSON value serialization/deserialization May 25, 2023
@roji
Copy link
Member Author

roji commented May 25, 2023

Note I moved the server-side conversion part of this to #30727, so this only tracks the serialization/deserialization component.

ajcvickers added a commit that referenced this issue Jul 15, 2023
Part of #30677

There is still work to do here, but this is basically working so I wanted to get it out for review.

Remaining work:
- Negative cases/exception messages
- Re-introduce checks for whether or not query supports primitive collections
- Cases where the primitive collection object exists and should be added to
- Custom element type mapping on the property
- Custom element reader/writer
- Custom collection reader/writer
- Use default element type mapping configured in ConfigureConventions
- Update Cosmos primitive collection mapping
- More query tests!
ajcvickers added a commit that referenced this issue Jul 17, 2023
Part of #30677

Builds on #31272

This is the case where the CLR types contains, for example, an `ObservableCollection<int>`. In this case, we populate this existing collection, rather than creating a new instance.

Same still needs to be done for relational materialization of primitive collections.
ajcvickers added a commit that referenced this issue Jul 19, 2023
Part of #30677

There is still work to do here, but this is basically working so I wanted to get it out for review.

Remaining work:
- Negative cases/exception messages
- Re-introduce checks for whether or not query supports primitive collections
- Cases where the primitive collection object exists and should be added to
- Custom element type mapping on the property
- Custom element reader/writer
- Custom collection reader/writer
- Use default element type mapping configured in ConfigureConventions
- Update Cosmos primitive collection mapping
- More query tests!
ajcvickers added a commit that referenced this issue Jul 19, 2023
Part of #30677

Builds on #31272

This is the case where the CLR types contains, for example, an `ObservableCollection<int>`. In this case, we populate this existing collection, rather than creating a new instance.

Same still needs to be done for relational materialization of primitive collections.
@ajcvickers ajcvickers added the closed-fixed The issue has been fixed and is/will be included in the release indicated by the issue milestone. label Sep 6, 2023
@ajcvickers ajcvickers modified the milestones: 8.0.0, 8.0.0-rc2 Sep 6, 2023
@ajcvickers ajcvickers modified the milestones: 8.0.0-rc2, 8.0.0 Nov 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-json area-query area-type-mapping closed-fixed The issue has been fixed and is/will be included in the release indicated by the issue milestone. type-enhancement
Projects
None yet
Development

No branches or pull requests

3 participants