Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Confusing relationship between has_headers and set_headers #275

Open
aleb opened this issue Apr 22, 2022 · 14 comments
Open

Confusing relationship between has_headers and set_headers #275

aleb opened this issue Apr 22, 2022 · 14 comments
Labels

Comments

@aleb
Copy link

aleb commented Apr 22, 2022

What version of the csv crate are you using?

csv = "1.1.6"

Briefly describe the question, bug or feature request.

The header line is parsed as the rest of the lines, when it should not be parsed. This results in failure when the column names are parsed and validated with serde.

Include a complete program demonstrating a problem.

I need to use the following workaround to make it work:

pub fn my_deser<'de, D>(deserializer: D) -> Result<MyFieldType, D::Error>
where
    D: serde::Deserializer<'de>,
{
    let raw: String = serde::Deserialize::deserialize(deserializer)?;
    println!("{}", &raw);
    Ok(MyFieldType::create_instance())
}

#[derive(Serialize, Deserialize, Default, Debug)]
struct MyEntry {
    #[serde(deserialize_with = "my_deser", alias = "aaaa")]
    meaningful_name: MyFieldType,
}

fn parse_csv(bytes_reader: &[u8]) -> anyhow::Result<Vec<MyEntry>> {
    // Skip the first line containing the headers to workaround the CSV reader
    // failing to ignore the first line despite has_headers(true).
    let first_newline = bytes_reader.iter().position(|&c| c == b'\n').unwrap();
    let bytes_reader: &[u8] = &bytes_reader[first_newline..];
    let mut reader = ReaderBuilder::new()
        .has_headers(true)
        .flexible(true)
        .delimiter(b',')
        .quoting(true)
        .quote(b'"')
        .trim(Trim::All)
        .from_reader(bytes_reader);
    reader.set_headers(csv::StringRecord::from(
        &[
            // Note we can't use a meaningful name here because this column name
            // is parsed as a value.
            "aaaa",
        ][..],
    ));
    reader
        .into_deserialize()
        .map(|e| match e {
            Ok(entry) => Ok(entry),
            Err(err) => Err(anyhow::anyhow!("Failed deserializing row: {:#?}", err)),
        })
        .collect::<Result<Vec<MyEntry>, _>>()
}

This is the CSV:

"name"
"000000"

Does not matter that has_headers is set to true or false, seems broken.

Also, note if I'm not manually skipping the first line, "name" is also parsed despite using set_headers() and despite has_headers(true), seems a different issue?

@BurntSushi
Copy link
Owner

Please provide a complete program, along with the input, that I can compile and build that reproduces your problem.

@aleb
Copy link
Author

aleb commented Apr 27, 2022

The example below shows the issue. Notice the parsing generates two items instead of just one, despite has_headers(true).

The question is what should happen when both has_headers and set_headers are called. Currently in this case the first line of the CSV is parsed as content. But I would argue that it should be ignored instead:

  • has_headers should mean that the first line of the CSV has headers and it's NOT data.
  • set_headers should set or overwrite the names of the columns, without any implication about the first line of the CSV.

Two examples of bogus headers we're dealing with, where this behavior would help:

000000	27.03.2022	03:05:01
480C43	T-057	A332	Airbus	A330-200MRTT
"","","","","","","","","","","","","","",""
"480c43","T-057","Airbus","","A330-243 MRTT","A330","","","","Nato Multinational Mrtt Fleet","Multi","MMF","","",""

Cargo.toml:

[package]
name = "issue"
version = "0.1.0"
edition = "2021"

[dependencies]
csv = "1.1.6"
serde = { version = "1.0", features = ["derive"] }
anyhow = "1.0.42"

main.rs:

use csv::ReaderBuilder;
use serde::Deserialize;


#[derive(Clone, Debug, Deserialize, PartialEq)]
pub struct AircraftMetadata {
    pub icao: String,
    pub tail: String,
    pub type_designator: String,
    pub model: Vec<String>,
}

fn parse_mictronics_aircraft_database(bytes_reader: &[u8]) -> anyhow::Result<Vec<AircraftMetadata>> {
    let mut reader = ReaderBuilder::new()
        .has_headers(true)
        .flexible(true)
        .delimiter(b'\t')
        .from_reader(bytes_reader);
    reader.set_headers(csv::StringRecord::from(
        &["icao", "tail", "type_designator", "model"][..],
    ));
    reader
        .into_deserialize()
        .map(|e| match e {
            Ok(entry) => Ok(entry),
            Err(err) => Err(anyhow::anyhow!("Failed deserializing row: {:#?}", err)),
        })
        .collect::<Result<Vec<AircraftMetadata>, _>>()
}

fn main() {
    let csv = "000000	27.03.2022	03:05:01
480C43	T-057	A332	Airbus	A330-200MRTT";
    let actual = parse_mictronics_aircraft_database(csv.as_bytes()).unwrap();
    let expected = vec![
        AircraftMetadata {
            icao: "000000".to_string(),
            tail: "27.03.2022".to_string(),
            type_designator: "03:05:01".to_string(),
            model: vec![],
        },
        AircraftMetadata {
            icao: "480C43".to_string(),
            tail: "T-057".to_string(),
            type_designator: "A332".to_string(),
            model: vec!["Airbus".to_string(), "A330-200MRTT".to_string()],
        },
    ];
    assert_eq!(actual.len(), expected.len());
    actual.into_iter().zip(expected).for_each(|(a, b)| {
        assert_eq!(a, b);
    });
}

@BurntSushi
Copy link
Owner

I can't make heads or tails of the problem you're facing, sorry. The program you've given me runs just fine. So what's the issue here exactly?

@aleb
Copy link
Author

aleb commented Apr 27, 2022

Yes, please remove the first element of the let expected = vec![ to see how it should work but fails

@BurntSushi
Copy link
Owner

Ah I see. Yeah, this is documented behavior. As the docs for set_headers state:

Any automatic detection of headers is disabled.

The issue here is that your CSV file does have headers. You just don't want them. You want to set your own. So you should set has_headers(true) and then read and throw away the headers found in the file. Like this:

    let mut reader = ReaderBuilder::new()
        .has_headers(true)
        .flexible(true)
        .delimiter(b'\t')
        .from_reader(bytes_reader);
    // Read and throw away the headers:
    reader.headers()?;
    reader.set_headers(csv::StringRecord::from(
        &["icao", "tail", "type_designator", "model"][..],
    ));

Then your program works, with the first record omitted the assertions below pass.

Thinking on this a bit more, I think I understand the issue here. Namely, that if has_headers(true), then the first record should always be parsed as headers even if set_headers has been called. Setting aside the fact that the existing behavior is documented and changing it would be a breaking change, the main issue with it is that set_headers is meant to override header detection. So if set_headers didn't prevent the reader from reading headers, then what is it supposed to do with the headers once it reads them? I think the implication of your request here is that it reads them and then throws them away. I think it's better if that's explicit, which is what the current API requires you to do. And it's what the code I gave above does.

Maybe this deserves to be an example in the docs.

@aleb
Copy link
Author

aleb commented Apr 27, 2022

Thanks for taking the time to look into this!

Yes, the implication of my request is that it skips the first row when both has_headers(true) and set_headers() are called.

In the example above when has_headers(true) and set_headers(...) are called, both rows are parsed as data. But one gets an error if has_headers(false) is called instead of has_headers(true). The behavior is inconsistent since in both cases the first row is read as a data row, but in one case it succeeds and in the other it fails?

thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: Failed deserializing row: Error(
    Deserialize {
        pos: Some(
            Position {
                byte: 0,
                line: 1,
                record: 0,
            },
        ),
        err: DeserializeError {
            field: None,
            kind: Message(
                "invalid length 3, expected struct AircraftMetadata with 4 elements",
            ),
        },
    },
)', src/main.rs:34:69

@BurntSushi
Copy link
Owner

Because your header row doesn't fit into your serde type. There is no inconsistency here. In one case, you're trying to deserialize it into your type. In the other case, you're just reading it as a plain string record.

Why doesn't the code i have you work for you?

@aleb
Copy link
Author

aleb commented Apr 27, 2022

The issue I see boils down to the weirdness between has_headers and set_headers.

For example:

use csv::ReaderBuilder;
use serde::Deserialize;


#[derive(Clone, Debug, Deserialize, PartialEq)]
pub struct AircraftMetadata {
    pub icao: String,
    pub tail: String,
}

fn parse_mictronics_aircraft_database(has: bool, set: bool) -> Result<Vec<AircraftMetadata>, csv::Error> {
    println!();
    print!("has_header({})", has);
    let csv = "4,T";
    let mut reader = ReaderBuilder::new()
        .has_headers(has)
        .flexible(true)
        .delimiter(b',')
        .from_reader(csv.as_bytes());
    if set {
        reader.set_headers(csv::StringRecord::from(
            &["icao", "tail"][..],
        ));
        print!(" set_headers(...)");
    }
    println!();
    reader
        .into_deserialize()
        .collect::<Result<Vec<AircraftMetadata>, _>>()
}

fn main() {
    println!("{:#?}", parse_mictronics_aircraft_database(false, false)); // 1
    println!("{:#?}", parse_mictronics_aircraft_database(true, false)); // 0
    println!("{:#?}", parse_mictronics_aircraft_database(false, true)); // 2
    println!("{:#?}", parse_mictronics_aircraft_database(true, true)); // 1
}

The output makes sense when has_headers(false) (first two entries) but it can be confusing when has_headers(true).

Notice the third result has_header(false) set_headers(...) which has two entries! It seems the csv library shoves the forced headers at the start of the data and thus when has_headers(false) they are parsed as values.

one-does-not-simply-call-has-headersfalse-before-set-headers

$ cargo run 2>/dev/null
has_header(false)
Ok(
    [
        AircraftMetadata {
            icao: "4",
            tail: "T",
        },
    ],
)

has_header(true)
Ok(
    [],
)

has_header(false) set_headers(...)
Ok(
    [
        AircraftMetadata {
            icao: "icao",
            tail: "tail",
        },
        AircraftMetadata {
            icao: "4",
            tail: "T",
        },
    ],
)

has_header(true) set_headers(...)
Ok(
    [
        AircraftMetadata {
            icao: "4",
            tail: "T",
        },
    ],
)

Mentioning the _ = reader.headers() trick in the set_headers documentation would be good!

@aleb
Copy link
Author

aleb commented Apr 27, 2022

Maybe assert in set_headers that has_headers is true?

@aleb
Copy link
Author

aleb commented Apr 27, 2022

The output when let csv = "4"; (one column instead of two):

has_header(false)
Err(
    Error(
        Deserialize {
            pos: Some(
                Position {
                    byte: 0,
                    line: 1,
                    record: 0,
                },
            ),
            err: DeserializeError {
                field: None,
                kind: Message(
                    "invalid length 1, expected struct AircraftMetadata with 2 elements",
                ),
            },
        },
    ),
)

has_header(true)
Ok(
    [],
)

has_header(false) set_headers()
Err(
    Error(
        Deserialize {
            pos: Some(
                Position {
                    byte: 0,
                    line: 1,
                    record: 0,
                },
            ),
            err: DeserializeError {
                field: None,
                kind: Message(
                    "invalid length 1, expected struct AircraftMetadata with 2 elements",
                ),
            },
        },
    ),
)

has_header(true) set_headers()
Err(
    Error(
        Deserialize {
            pos: Some(
                Position {
                    byte: 0,
                    line: 1,
                    record: 0,
                },
            ),
            err: DeserializeError {
                field: None,
                kind: UnexpectedEndOfRow,
            },
        },
    ),
)

The errors make sense except the last one: "UnexpectedEndOfRow". Mentioning as it might be a potential source of confusion.

@BurntSushi
Copy link
Owner

Maybe assert in set_headers that has_headers is true?

This is a breaking change.

I agree that it is weird to call set_headers when has_headers is set to false. The behavior is consistent though. The reader is yielding any detected header row because the caller might call the headers method, which works regardless of whether has_headers is enabled or not.

I think the fundamental confusion here is that you're trying to form a connection between has_headers and set_headers. But there isn't one. The connection is between set_headers and headers.

But yeah, if you're setting has_headers to false, then it is exceptionally weird to be calling either set_headers or headers. My code I gave above does call headers, but this is equivalent to just calling read_record once and throwing away the result. In effect, you're skipping over the first record.

I'll mark this as a doc bug.

@BurntSushi BurntSushi added the doc label Apr 27, 2022
@aleb aleb changed the title The first line is parsed as the rest of the lines despite has_headers(true) Confusing relationship between has_headers and set_headers May 2, 2022
@aleb
Copy link
Author

aleb commented May 2, 2022

Thanks for considering to improve the documentation.

When using custom deserialization for custom field types it can be super confusing seeing the headers set with set_headers being attempted to be deserialized as a row.

How bad would it be to introduce a breaking change? This usage does not make any sense at all. :) Users can select any release they want if they prefer the past functionality.

@BurntSushi
Copy link
Owner

This confusion does not come anywhere remotely close to being bad enough to justify a breaking change or a csv 2.0 release. Sorry.

You've made it clear this behavior is confusing to you and that you don't like it. I understand. Let's improve the docs and move on. If and when there's ever a csv 2, we can try to revisit this behavior, perhaps with something as simple as an assert.

This usage does not make any sense at all.

It makes sense just fine if you consider that using set_headers when has_headers is false is "just" an unchecked illegal use of the API. Don't use set_headers when has_headers is false from now on and it looks like you should be in the clear.

@aleb
Copy link
Author

aleb commented May 2, 2022

I'm just curious how you think.

Ok, good luck!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants