Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parquet: Implement support for Encoding::BYTE_STREAM_SPLIT #4183

Closed
wants to merge 5 commits into from

Conversation

simonvandel
Copy link
Contributor

Which issue does this PR close?

Closes #4102.

Rationale for this change

What changes are included in this PR?

Implements decoding and encoding of BYTE_STREAM_SPLIT for f32 and f64.

Are there any user-facing changes?

Yes, now the BYTE_STREAM_SPLIT will not error when used as an encoding for a column.

@github-actions github-actions bot added the parquet Changes to the parquet crate label May 9, 2023
@simonvandel simonvandel changed the title Byte stream split Parquet: Implement support for Encoding::BYTE_STREAM_SPLIT May 9, 2023
Copy link
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good, I think this could do with some end-to-end test coverage, perhaps extending test_primitive_single_column_reader_test or something.

It would also be really nice to get an integration test, ideally with a file generated by pyarrow or something, not sure if parquet-testing has such a file yet - might be something to contribute there?

@@ -1882,6 +1935,7 @@ mod tests {
encoder.put(&v[..]).expect("ok to encode");
}
let bytes = encoder.flush_buffer().expect("ok to flush buffer");
println!("{:x?}", bytes.data());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

?

@tustvold
Copy link
Contributor

tustvold commented Jun 1, 2023

Marking as draft as waiting for feedback, feel free to mark as ready for review when you would like me to take another look

@tustvold tustvold marked this pull request as draft June 1, 2023 14:51
@ggreco
Copy link

ggreco commented Nov 7, 2023

I found this PR while googling for my problem:

thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: NYI("Encoding BYTE_STREAM_SPLIT is not supported")', /Users/gabry/.cargo/registry/src/index.crates.io-6f17d22bba15001f/parquet-47.0.0/src/column/writer/mod.rs:258:58

I thought it was my problem since the rust crate page says:
image

... @tustvoid can you please update the README to describe which encoding are not yet implemented?

@tustvold
Copy link
Contributor

tustvold commented Nov 7, 2023

Thank you, that's a pure oversight - filed #5051

FWIW I believe this is the only one that isn't supported.

@ggreco
Copy link

ggreco commented Nov 7, 2023

Thank you, that's a pure oversight - filed #5051

FWIW I believe this is the only one that isn't supported.

I can confirm I tried all the one described that could be applied to integers and floating points in https://parquet.apache.org/docs/file-format/data-pages/encodings/ and the only one not working with the rust implementation is BYTE_STREAM_SPLIT. I've done this in the attempt to make my parquet files smaller.

I've never tried the encodings for BYTE_ARRAY type.

@tustvold
Copy link
Contributor

tustvold commented Nov 7, 2023

PR #5053

@mwlon
Copy link
Contributor

mwlon commented Jan 5, 2024

@tustvold What's holding this PR up? I'm also encountering the issue that byte_stream_split is unsupported.

I'm willing to make a PR of my own to do this if the problem is that @simonvandel is unresponsive.

@tustvold
Copy link
Contributor

tustvold commented Jan 5, 2024

I'm willing to make a PR of my own to do this

Always happy to review PRs, IIRC the major thing this PR was missing was adequate test coverage

@simonvandel
Copy link
Contributor Author

@tustvold What's holding this PR up? I'm also encountering the issue that byte_stream_split is unsupported.

I'm willing to make a PR of my own to do this if the problem is that @simonvandel is unresponsive.

@mwlon my need for this encoding disappeared, and so did my motivation to finish it. If you want to continue, feel free to do so

@mwlon
Copy link
Contributor

mwlon commented Jan 8, 2024

I've created a parquet-testing PR to facilitate this: apache/parquet-testing#45

@mwlon
Copy link
Contributor

mwlon commented Jan 10, 2024

parquet-testing PR is in; new PR for BYTE_STREAM_SPLIT implementation: #5293

@tustvold
Copy link
Contributor

Closed by #5293

@tustvold tustvold closed this Jan 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Parquet: Support Encoding::BYTE_STREAM_SPLIT
4 participants