Confusing relationship between `has_headers` and `set_headers` #275

aleb · 2022-04-22T17:55:45Z

What version of the `csv` crate are you using?

csv = "1.1.6"

Briefly describe the question, bug or feature request.

The header line is parsed as the rest of the lines, when it should not be parsed. This results in failure when the column names are parsed and validated with serde.

Include a complete program demonstrating a problem.

I need to use the following workaround to make it work:

pub fn my_deser<'de, D>(deserializer: D) -> Result<MyFieldType, D::Error>
where
    D: serde::Deserializer<'de>,
{
    let raw: String = serde::Deserialize::deserialize(deserializer)?;
    println!("{}", &raw);
    Ok(MyFieldType::create_instance())
}

#[derive(Serialize, Deserialize, Default, Debug)]
struct MyEntry {
    #[serde(deserialize_with = "my_deser", alias = "aaaa")]
    meaningful_name: MyFieldType,
}

fn parse_csv(bytes_reader: &[u8]) -> anyhow::Result<Vec<MyEntry>> {
    // Skip the first line containing the headers to workaround the CSV reader
    // failing to ignore the first line despite has_headers(true).
    let first_newline = bytes_reader.iter().position(|&c| c == b'\n').unwrap();
    let bytes_reader: &[u8] = &bytes_reader[first_newline..];
    let mut reader = ReaderBuilder::new()
        .has_headers(true)
        .flexible(true)
        .delimiter(b',')
        .quoting(true)
        .quote(b'"')
        .trim(Trim::All)
        .from_reader(bytes_reader);
    reader.set_headers(csv::StringRecord::from(
        &[
            // Note we can't use a meaningful name here because this column name
            // is parsed as a value.
            "aaaa",
        ][..],
    ));
    reader
        .into_deserialize()
        .map(|e| match e {
            Ok(entry) => Ok(entry),
            Err(err) => Err(anyhow::anyhow!("Failed deserializing row: {:#?}", err)),
        })
        .collect::<Result<Vec<MyEntry>, _>>()
}

This is the CSV:

"name"
"000000"

Does not matter that has_headers is set to true or false, seems broken.

Also, note if I'm not manually skipping the first line, "name" is also parsed despite using set_headers() and despite has_headers(true), seems a different issue?

The text was updated successfully, but these errors were encountered:

BurntSushi · 2022-04-22T17:59:24Z

Please provide a complete program, along with the input, that I can compile and build that reproduces your problem.

aleb · 2022-04-27T10:55:55Z

The example below shows the issue. Notice the parsing generates two items instead of just one, despite has_headers(true).

The question is what should happen when both has_headers and set_headers are called. Currently in this case the first line of the CSV is parsed as content. But I would argue that it should be ignored instead:

has_headers should mean that the first line of the CSV has headers and it's NOT data.
set_headers should set or overwrite the names of the columns, without any implication about the first line of the CSV.

Two examples of bogus headers we're dealing with, where this behavior would help:

000000	27.03.2022	03:05:01
480C43	T-057	A332	Airbus	A330-200MRTT

"","","","","","","","","","","","","","",""
"480c43","T-057","Airbus","","A330-243 MRTT","A330","","","","Nato Multinational Mrtt Fleet","Multi","MMF","","",""

Cargo.toml:

[package]
name = "issue"
version = "0.1.0"
edition = "2021"

[dependencies]
csv = "1.1.6"
serde = { version = "1.0", features = ["derive"] }
anyhow = "1.0.42"

main.rs:

use csv::ReaderBuilder;
use serde::Deserialize;


#[derive(Clone, Debug, Deserialize, PartialEq)]
pub struct AircraftMetadata {
    pub icao: String,
    pub tail: String,
    pub type_designator: String,
    pub model: Vec<String>,
}

fn parse_mictronics_aircraft_database(bytes_reader: &[u8]) -> anyhow::Result<Vec<AircraftMetadata>> {
    let mut reader = ReaderBuilder::new()
        .has_headers(true)
        .flexible(true)
        .delimiter(b'\t')
        .from_reader(bytes_reader);
    reader.set_headers(csv::StringRecord::from(
        &["icao", "tail", "type_designator", "model"][..],
    ));
    reader
        .into_deserialize()
        .map(|e| match e {
            Ok(entry) => Ok(entry),
            Err(err) => Err(anyhow::anyhow!("Failed deserializing row: {:#?}", err)),
        })
        .collect::<Result<Vec<AircraftMetadata>, _>>()
}

fn main() {
    let csv = "000000	27.03.2022	03:05:01
480C43	T-057	A332	Airbus	A330-200MRTT";
    let actual = parse_mictronics_aircraft_database(csv.as_bytes()).unwrap();
    let expected = vec![
        AircraftMetadata {
            icao: "000000".to_string(),
            tail: "27.03.2022".to_string(),
            type_designator: "03:05:01".to_string(),
            model: vec![],
        },
        AircraftMetadata {
            icao: "480C43".to_string(),
            tail: "T-057".to_string(),
            type_designator: "A332".to_string(),
            model: vec!["Airbus".to_string(), "A330-200MRTT".to_string()],
        },
    ];
    assert_eq!(actual.len(), expected.len());
    actual.into_iter().zip(expected).for_each(|(a, b)| {
        assert_eq!(a, b);
    });
}

BurntSushi · 2022-04-27T11:14:16Z

I can't make heads or tails of the problem you're facing, sorry. The program you've given me runs just fine. So what's the issue here exactly?

aleb · 2022-04-27T11:19:51Z

Yes, please remove the first element of the let expected = vec![ to see how it should work but fails

BurntSushi · 2022-04-27T11:27:34Z

Ah I see. Yeah, this is documented behavior. As the docs for set_headers state:

Any automatic detection of headers is disabled.

The issue here is that your CSV file does have headers. You just don't want them. You want to set your own. So you should set has_headers(true) and then read and throw away the headers found in the file. Like this:

    let mut reader = ReaderBuilder::new()
        .has_headers(true)
        .flexible(true)
        .delimiter(b'\t')
        .from_reader(bytes_reader);
    // Read and throw away the headers:
    reader.headers()?;
    reader.set_headers(csv::StringRecord::from(
        &["icao", "tail", "type_designator", "model"][..],
    ));

Then your program works, with the first record omitted the assertions below pass.

Thinking on this a bit more, I think I understand the issue here. Namely, that if has_headers(true), then the first record should always be parsed as headers even if set_headers has been called. Setting aside the fact that the existing behavior is documented and changing it would be a breaking change, the main issue with it is that set_headers is meant to override header detection. So if set_headers didn't prevent the reader from reading headers, then what is it supposed to do with the headers once it reads them? I think the implication of your request here is that it reads them and then throws them away. I think it's better if that's explicit, which is what the current API requires you to do. And it's what the code I gave above does.

Maybe this deserves to be an example in the docs.

aleb · 2022-04-27T12:28:18Z

Thanks for taking the time to look into this!

Yes, the implication of my request is that it skips the first row when both has_headers(true) and set_headers() are called.

In the example above when has_headers(true) and set_headers(...) are called, both rows are parsed as data. But one gets an error if has_headers(false) is called instead of has_headers(true). The behavior is inconsistent since in both cases the first row is read as a data row, but in one case it succeeds and in the other it fails?

thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: Failed deserializing row: Error(
    Deserialize {
        pos: Some(
            Position {
                byte: 0,
                line: 1,
                record: 0,
            },
        ),
        err: DeserializeError {
            field: None,
            kind: Message(
                "invalid length 3, expected struct AircraftMetadata with 4 elements",
            ),
        },
    },
)', src/main.rs:34:69

BurntSushi · 2022-04-27T13:36:03Z

Because your header row doesn't fit into your serde type. There is no inconsistency here. In one case, you're trying to deserialize it into your type. In the other case, you're just reading it as a plain string record.

Why doesn't the code i have you work for you?

aleb · 2022-04-27T17:28:55Z

The issue I see boils down to the weirdness between has_headers and set_headers.

For example:

use csv::ReaderBuilder;
use serde::Deserialize;


#[derive(Clone, Debug, Deserialize, PartialEq)]
pub struct AircraftMetadata {
    pub icao: String,
    pub tail: String,
}

fn parse_mictronics_aircraft_database(has: bool, set: bool) -> Result<Vec<AircraftMetadata>, csv::Error> {
    println!();
    print!("has_header({})", has);
    let csv = "4,T";
    let mut reader = ReaderBuilder::new()
        .has_headers(has)
        .flexible(true)
        .delimiter(b',')
        .from_reader(csv.as_bytes());
    if set {
        reader.set_headers(csv::StringRecord::from(
            &["icao", "tail"][..],
        ));
        print!(" set_headers(...)");
    }
    println!();
    reader
        .into_deserialize()
        .collect::<Result<Vec<AircraftMetadata>, _>>()
}

fn main() {
    println!("{:#?}", parse_mictronics_aircraft_database(false, false)); // 1
    println!("{:#?}", parse_mictronics_aircraft_database(true, false)); // 0
    println!("{:#?}", parse_mictronics_aircraft_database(false, true)); // 2
    println!("{:#?}", parse_mictronics_aircraft_database(true, true)); // 1
}

The output makes sense when has_headers(false) (first two entries) but it can be confusing when has_headers(true).

Notice the third result has_header(false) set_headers(...) which has two entries! It seems the csv library shoves the forced headers at the start of the data and thus when has_headers(false) they are parsed as values.

$ cargo run 2>/dev/null
has_header(false)
Ok(
    [
        AircraftMetadata {
            icao: "4",
            tail: "T",
        },
    ],
)

has_header(true)
Ok(
    [],
)

has_header(false) set_headers(...)
Ok(
    [
        AircraftMetadata {
            icao: "icao",
            tail: "tail",
        },
        AircraftMetadata {
            icao: "4",
            tail: "T",
        },
    ],
)

has_header(true) set_headers(...)
Ok(
    [
        AircraftMetadata {
            icao: "4",
            tail: "T",
        },
    ],
)

Mentioning the _ = reader.headers() trick in the set_headers documentation would be good!

aleb · 2022-04-27T17:32:24Z

Maybe assert in set_headers that has_headers is true?

aleb · 2022-04-27T17:32:29Z

The output when let csv = "4"; (one column instead of two):

has_header(false)
Err(
    Error(
        Deserialize {
            pos: Some(
                Position {
                    byte: 0,
                    line: 1,
                    record: 0,
                },
            ),
            err: DeserializeError {
                field: None,
                kind: Message(
                    "invalid length 1, expected struct AircraftMetadata with 2 elements",
                ),
            },
        },
    ),
)

has_header(true)
Ok(
    [],
)

has_header(false) set_headers()
Err(
    Error(
        Deserialize {
            pos: Some(
                Position {
                    byte: 0,
                    line: 1,
                    record: 0,
                },
            ),
            err: DeserializeError {
                field: None,
                kind: Message(
                    "invalid length 1, expected struct AircraftMetadata with 2 elements",
                ),
            },
        },
    ),
)

has_header(true) set_headers()
Err(
    Error(
        Deserialize {
            pos: Some(
                Position {
                    byte: 0,
                    line: 1,
                    record: 0,
                },
            ),
            err: DeserializeError {
                field: None,
                kind: UnexpectedEndOfRow,
            },
        },
    ),
)

The errors make sense except the last one: "UnexpectedEndOfRow". Mentioning as it might be a potential source of confusion.

BurntSushi · 2022-04-27T17:39:18Z

Maybe assert in set_headers that has_headers is true?

This is a breaking change.

I agree that it is weird to call set_headers when has_headers is set to false. The behavior is consistent though. The reader is yielding any detected header row because the caller might call the headers method, which works regardless of whether has_headers is enabled or not.

I think the fundamental confusion here is that you're trying to form a connection between has_headers and set_headers. But there isn't one. The connection is between set_headers and headers.

But yeah, if you're setting has_headers to false, then it is exceptionally weird to be calling either set_headers or headers. My code I gave above does call headers, but this is equivalent to just calling read_record once and throwing away the result. In effect, you're skipping over the first record.

I'll mark this as a doc bug.

aleb · 2022-05-02T09:29:33Z

Thanks for considering to improve the documentation.

When using custom deserialization for custom field types it can be super confusing seeing the headers set with set_headers being attempted to be deserialized as a row.

How bad would it be to introduce a breaking change? This usage does not make any sense at all. :) Users can select any release they want if they prefer the past functionality.

BurntSushi · 2022-05-02T11:37:45Z

This confusion does not come anywhere remotely close to being bad enough to justify a breaking change or a csv 2.0 release. Sorry.

You've made it clear this behavior is confusing to you and that you don't like it. I understand. Let's improve the docs and move on. If and when there's ever a csv 2, we can try to revisit this behavior, perhaps with something as simple as an assert.

This usage does not make any sense at all.

It makes sense just fine if you consider that using set_headers when has_headers is false is "just" an unchecked illegal use of the API. Don't use set_headers when has_headers is false from now on and it looks like you should be in the clear.

aleb · 2022-05-02T11:42:45Z

I'm just curious how you think.

Ok, good luck!

BurntSushi added the doc label Apr 27, 2022

aleb changed the title ~~The first line is parsed as the rest of the lines despite has_headers(true)~~ Confusing relationship between has_headers and set_headers May 2, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Confusing relationship between `has_headers` and `set_headers` #275

Confusing relationship between `has_headers` and `set_headers` #275

aleb commented Apr 22, 2022 •

edited

BurntSushi commented Apr 22, 2022

aleb commented Apr 27, 2022

BurntSushi commented Apr 27, 2022

aleb commented Apr 27, 2022 •

edited

BurntSushi commented Apr 27, 2022

aleb commented Apr 27, 2022

BurntSushi commented Apr 27, 2022

aleb commented Apr 27, 2022 •

edited

aleb commented Apr 27, 2022

aleb commented Apr 27, 2022

BurntSushi commented Apr 27, 2022

aleb commented May 2, 2022

BurntSushi commented May 2, 2022

aleb commented May 2, 2022

Confusing relationship between has_headers and set_headers #275

Confusing relationship between has_headers and set_headers #275

Comments

aleb commented Apr 22, 2022 • edited

What version of the csv crate are you using?

Briefly describe the question, bug or feature request.

Include a complete program demonstrating a problem.

BurntSushi commented Apr 22, 2022

aleb commented Apr 27, 2022

BurntSushi commented Apr 27, 2022

aleb commented Apr 27, 2022 • edited

BurntSushi commented Apr 27, 2022

aleb commented Apr 27, 2022

BurntSushi commented Apr 27, 2022

aleb commented Apr 27, 2022 • edited

aleb commented Apr 27, 2022

aleb commented Apr 27, 2022

BurntSushi commented Apr 27, 2022

aleb commented May 2, 2022

BurntSushi commented May 2, 2022

aleb commented May 2, 2022

Confusing relationship between `has_headers` and `set_headers` #275

Confusing relationship between `has_headers` and `set_headers` #275

aleb commented Apr 22, 2022 •

edited

What version of the `csv` crate are you using?

aleb commented Apr 27, 2022 •

edited

aleb commented Apr 27, 2022 •

edited