Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handling (serialization) of nested containers #342

Open
primeos-work opened this issue Nov 22, 2023 · 8 comments
Open

Handling (serialization) of nested containers #342

primeos-work opened this issue Nov 22, 2023 · 8 comments

Comments

@primeos-work
Copy link

primeos-work commented Nov 22, 2023

I haven't found any prior discussions on this so I though it might be useful to open an issue for it as it's also a feature request and somewhat of a correctness bug.

The current behavior of this crate is to try to flatten nested containers:

https://docs.rs/csv/1.3.0/csv/struct.Writer.html#rules

The behavior of serialize is fairly simple:

  1. Nested containers (tuples, Vecs, structs, etc.) are always flattened (depth-first order).
  2. If has_headers is true and the type contains field names, then a header row is automatically generated.

However, some container types cannot be serialized, and if has_headers is true, there are some additional restrictions on the types that can be serialized. See below for details.

This design decision is understandable as CSV is pretty limited and simply doesn't support nested containers (vs. JSON, etc.).
However, I do wonder if this is a good default as it interferes with correctness.

Disclaimer: My knowledge of CSV, Rust, and especially this crate is limited but here are my considerations:

The main problem seems to be that there is no official CSV specification. I've used https://en.wikipedia.org/wiki/Comma-separated_values and https://datatracker.ietf.org/doc/html/rfc4180 as references.
(Aside: Perhaps it would also make sense to document to which specification this crate intends to conform to?)

A good rule seems to be the following:

All records should have the same number of fields, in the same order.

IMO this is a requirement for correctly parsing CSV files (with some exceptions like time series), especially if the first record is a "header" - otherwise there just isn't enough context to understand the structure of the data.

This crate seems to support that rule as well:
https://docs.rs/csv/1.3.0/csv/struct.Reader.html#error-handling

By default, all records in CSV data must have the same number of fields. If a record is found with a different number of fields than a prior record, then an error is returned. This behavior can be disabled by enabling flexible parsing via the flexible method on ReaderBuilder.

The writer also enforces that rule by default:
https://docs.rs/csv/1.3.0/csv/struct.WriterBuilder.html#method.flexible

When disabled (which is the default), writing CSV data will return an error if a record is written with a number of fields different from the number of fields written in a previous record.

So the default is to flatten nested containers (like a Vec in a Struct) and error until flexible(false) is used.
It certainly makes sense that flexible defaults to true but I'd prefer if there was an option to support nested containers without changing the number of fields.

I propose the following ideas (not sure how feasible they are though):


1. Let the user supply a custom string function to handle the transformation

If possible, this crate would call that function each time it backtracks from such a nested container (optionally with the depth level?). The user could then choose an escaping technique to ensure that the string will represent a single CSV field. This approach is limted as the type information / context is lost but one could at least do simple transformations like quoting and escaping or replacing the delimiters.

Using #254 as an example, the user could use that custom funtion to rewrite 4,5,6,7 (Vec<i64>) to, e.g., "4,5,6,7", 4;5;6;7, or [4,5,6,7] (like in the desired example). Additional recursion levels could be supported through nested quoting, more delimiters (in that case the function should be called with the depth level), or custom approaches.

Ideally, this would be pretty easy to integrate and flexible enough for most use cases.

2. Let the user supply a custom function that handles the nested containers

Similar to 1. but the function would get the raw data and produce the string. The type information would be preserved but it would require reflection, increase the complexitiy, and could be considered out of scope.

3. Offer generic techniques to ensure nested containers can be put into a single CSV field

This crate would implement one (or multiple) of the (hopefully pretty universal) custom string functions mentioned in 1. (possibly forcing the introduction of additional constraints).


IMO my first proposal could be a decent tradeoff but I might've missed something.
What do you think @BurntSushi?

PS: I'm currently looking into a somewhat "exotic" use case with @ammernico (#254 (comment)) where the types come from an API specification and mainly consist of structures that contain some vectors. This use case makes it difficult to flatten/convert the vectors into a string before passing the data to this CSV crate.

PPS: Huge thanks for this very useful crate and amazing documentation! :)

@primeos-work primeos-work changed the title Handling of nested containers Handling (serialization) of nested containers Nov 22, 2023
@BurntSushi
Copy link
Owner

BurntSushi commented Nov 22, 2023

(Aside: Perhaps it would also make sense to document to which specification this crate intends to conform to?)

The docs could be better. The answer to this should probably be in the crate docs. But the answer is in other places:

You are indeed correct that there really is no one agreed upon CSV specification that one can adhere to. As the links above elaborate on, the problem with RFC 4180 is that it's too strict. Essentially, the CSV world needs its version of HTML 5 (nothing is invalid). RFC 4180 is like XHTML (lots are invalid). For pretty much exactly the same reason: tons of real world CSV data is messy, and erroring on it just isn't useful (in most cases).

It certainly makes sense that flexible defaults to true

Eh? flexible defaults to false for both Reader and Writer.

but I'd prefer if there was an option to support nested containers without changing the number of fields.

I propose the following ideas (not sure how feasible they are though):

All of your proposals appear to already be possible today. All you should need to do is implement Serialize (or Deserialize) for your type. Serialize it to a string, which CSV knows how to encode. Same for deserialization. Have you tried this appoach? If not, why not? And if so, what didn't work about it?

My understanding is that folks run into this nested container issue and expect it to be resolved automatically. It can't be. Not in any way that I can see without choosing a new serialization format to layer inside of CSV fields. It can't be resolved automatically because CSV doesn't support nested data. If you need to serialize nested data, then this crate can't really figure that out for you (aside from a few special cases). You either need to figure out how to convert it to a flattened representation or how to encode richer data inside of a single CSV field.

I suspect this is why all three of your proposals appear to have a common thread between them: they ask for the user to resolve the nested container issue manually by supplying some kind of transformation function. (Your third proposal does suggest providing some sensible defaults, but presumably the user would still need to make a choice.) As far as I know, resolving the nested container issue manually is not actually a problem because I believe it can already be done via Serde's framework. So there really isn't an issue there.

Instead, what I suspect people would like is something like, "hey if I have a nested container I don't want to hear about it, just JSON serialize it and stuff it into a single CSV field." But I'm pretty philosophically opposed to something like personally, and I don't perceive it as a common enough problem to really worry about too much.

@primeos-work
Copy link
Author

The docs could be better. The answer to this should probably be in the crate docs. But the answer is in other places:

Oh, thanks, I missed that 🙈. I looked in multiple places (but focused on the Writer) but should've used the search functionality for that.
Putting a (brief) version into the crate docs would be great IMO.

As the links above elaborate on, the problem with RFC 4180 is that it's too strict.

Yes, I was focused on the Writer and not the Reader (as nested containers are basically unsolvable for the Reader due to the lack of a strict standard that supports such data structures).

For the Writer it should make sense to be as strict as possible as much more guarantees can be made.
Or as RFC 4180 puts it (interoperability considerations):

Due to lack of a single specification, there are considerable
differences among implementations. Implementors should "be
conservative in what you do, be liberal in what you accept from
others" (RFC 793 [8]) when processing CSV files. An attempt at a
common definition can be found in Section 2.
(my interpretation/opinion would be conservative and strict)

Eh? flexible defaults to false for both Reader and Writer.

Oops, sorry - that's what I meant but I somehow wrote true instead of false.

All of your proposals appear to already be possible today. All you should need to do is implement Serialize (or Deserialize) for your type. Serialize it to a string, which CSV knows how to encode. Same for deserialization. Have you tried this appoach? If not, why not? And if so, what didn't work about it?

Yes, that would be ideal. In my "exotic" use case (sorry that it was quite hidden and too brief at the bottom in the "PS:") this isn't easily possible (AFAIK) as the type gets generated during compile time and depends on the API specification. It should theoretically be possible to either generate the types in advance (but this becomes problematic when the API specification changes) or to use reflection to transform the data at run time (this is what we're currently exploring but makes it more complex).
(But when I think more about it I could probably solve my particular use case with a generic Serialize implementation for Vec<T> (or for a few common Ts if a generic implementation isn't possible). I didn't realize this before so thanks a lot for the suggestion!)

I guess another limitation should be that this approach becomes problematic when serializing to multiple different formats as there can only be a single serde::ser::Serialize implementation and one would probably only want to use flattening for CSV? (But I haven't looked into this so far / only glanced over the Serialize documentation)

It can't be resolved automatically because CSV doesn't support nested data.

Yes, agreed :)

I suspect this is why all three of your proposals appear to have a common thread between them: they ask for the user to resolve the nested container issue manually by supplying some kind of transformation function.

Right

(Your third proposal does suggest providing some sensible defaults, but presumably the user would still need to make a choice.)

Yes. That could make sense if we find a few good, generic, and "universal" solutions but I guess it'd be better to just support the custom transformation function and put the possible solutions as examples in the documentation.

Instead, what I suspect people would like is something like, "hey if I have a nested container I don't want to hear about it, just JSON serialize it and stuff it into a single CSV field."

Ideally :) But I agree that this is just not possible with CSV.

Anyway, that custom user function (more of a hack than a proper solution) could make sense if the two potential limitations of the serde::ser::Serialize approach, that I listed, make sense / are valid. If not, then Serialize should be sufficient.

@BurntSushi
Copy link
Owner

this isn't easily possible (AFAIK) as the type gets generated during compile time and depends on the API specification

You would likely need to generate the Serialize impls for those types.

I guess another limitation should be that this approach becomes problematic when serializing to multiple different formats as there can only be a single serde::ser::Serialize implementation and one would probably only want to use flattening for CSV? (But I haven't looked into this so far / only glanced over the Serialize documentation)

Yeah you probably need to use newtypes or even build up the infrastructure yourself to call different serialization functions. (This might require embedding the functions on the data type? Not sure.)

Anyway, that custom user function (more of a hack than a proper solution) could make sense if the two potential limitations of the serde::ser::Serialize approach, that I listed, make sense / are valid. If not, then Serialize should be sufficient.

My suspicion is that there is an isomorphism here where anything the csv crate can do is possible with Serialize. I will say though that I haven't even thought about how a customer user function would even work in the serializer as it exists today. It's possible that it would be quite hokey.

The other hesitance you'll run into here is that the Serde integration in this crate is absolutely hideous. I would encourage you to look at it. And the vast majority of all issues/bugs/feature-requests on this repository are related to Serde integration. It is just overall simultaneously miserable to support (for CSV specifically) but also extremely convenient. What this means is that even if you're stuck, it's unlikely there's any reasonable path forward in this crate itself in any reasonable time span.

@primeos-work
Copy link
Author

You would likely need to generate the Serialize impls for those types.

Yes, my initial hope was that I could override the Serialize implementation for Vec<T> but I completely forgot that Rust doesn't allow this (for good reasons but it would've been a handy hack here).

Having to replace all vectors with a custom vector type is unfortunately likely a dealbreaker (at least with my limited Rust knowledge) for my use case since the types are structures with quite a few fields and everything gets generated from the API specification. At that point it would probaboy be much easier to use a different approach/hack to handle the serilization.

or even build up the infrastructure yourself to call different serialization functions

That would also be handy for my use case. I'd like to call a custom serialization function for Vec<T> / change the serialization but I haven't found a (clean) way to do so and I'm not sure if it's possible (based on https://stackoverflow.com/questions/60008192/how-to-implement-a-custom-serialization-only-for-serde-json it also doesn't look good).

I'll look around a bit more but it doesn't look good for avoiding custom types.


In case it helps someone: Here's a PoC/hack how one could serialize vectors into a single CSV field using a custom type:

use anyhow::Result;
use csv::WriterBuilder;
use serde::Serialize;
use serde::Serializer;

struct MyVec<T>(Vec<T>);

impl<T> Serialize for MyVec<T>
where
    T: Serialize + std::fmt::Debug,
{
    fn serialize<S>(&self, serializer: S) -> Result<S::Ok, S::Error>
    where
        S: Serializer,
    {
        let s = format!("{:?}", self.0);
        serializer.serialize_str(&s)
    }
}

#[derive(serde::Serialize)]
struct Record {
    a: MyVec<i32>,
    b: MyVec<i32>,
}

fn main() -> Result<()> {
    let record = Record {
        a: MyVec(vec![1, 2, 3]),
        b: MyVec(vec![4, 5, 6]),
    };

    let mut wtr = WriterBuilder::new()
        .has_headers(false)
        .from_path("out.csv")?;
    wtr.serialize(record)?;
    wtr.flush()?;

    Ok(())
}

The result:

$ cat out.csv
"[1, 2, 3]","[4, 5, 6]"

@BurntSushi
Copy link
Owner

I wasn't necessarily thinking about custom impls for vecs, but custom impls for your API types.

@BurntSushi
Copy link
Owner

And serde_derive lets you specify custom serialization functions for individual fields without introducing your own newtype wrapper.

@primeos-work
Copy link
Author

I wasn't necessarily thinking about custom impls for vecs, but custom impls for your API types.

Yes, but again, the problem is that the implementation/source-code of the API types gets auto-generated by a crate and a custom build script (plus I wouldn't really like to write a serializer for a large struct just to modify how a few fields should get serialized). But I'd call that an exotic use case and it's pretty broken tbh (I never liked the idea of it in the first place). That crate is also very limited and should even be unmaintained so we'll better replace that part :)

And serde_derive lets you specify custom serialization functions for individual fields without introducing your own newtype wrapper.

Right, the #[serde(deserialize_with = "path")] macro (https://serde.rs/variant-attrs.html#deserialize_with) seems quite useful for that purpose, thanks! :)
That should be ergonomic enough to solve such use cases with minimal effort (which should be the goal, at least IMO).

Anyway, we can close this issue if you want. I'm not that happy with the current default but, considering your responses, I also don't really see a significantly better solution anymore (and prohibiting nested containers by default for correctness also doesn't sound really good).
I guess the only potentially actionable TODOs would be to further improve the documentation a little bit but hopefully this issue will also be discoverable enough to help a bit in the meantime.

@BurntSushi
Copy link
Owner

Aye. I'll leave this issue open for now. I will likely (some day) revisit the Serde integration in this crate and see whether it can be improved holistically. At the very least, we'll get better docs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants