Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix an issue with multiple short list rowgroups using the Parquet chunked reader. #15342

Merged
merged 7 commits into from Mar 20, 2024

Conversation

nvdbaranec
Copy link
Contributor

Fixes #15306

The core issue here was that under certain conditions, the chunked reader could generate invalid page indices for list columns when using the chunked reader. This led to corruption in the decode kernels. The fix is fairly simple, but there's a decent amount of delta in this PR that is just name changes for clarity and some more comments/docs.

This affected the number of chunks generated in some of the very (unrealistically) constrained tests.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

First, row groups of lists being loaded together were not getting their end row counts computed correctly on a per-rowgroup basis.
The main consequence there was that we could potentially have generated splits that were larger than we would have liked. But downstream
from this was a second bug where we were generating incorrect page indices to be decoded, causing corruption in the decode kernels.
@nvdbaranec nvdbaranec added bug Something isn't working libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS non-breaking Non-breaking change labels Mar 19, 2024
@nvdbaranec nvdbaranec requested a review from a team as a code owner March 19, 2024 22:50
@nvdbaranec nvdbaranec marked this pull request as draft March 19, 2024 22:51
@nvdbaranec
Copy link
Contributor Author

Leaving as a draft until running it through all the Spark tests.

@nvdbaranec nvdbaranec marked this pull request as ready for review March 20, 2024 00:24
@nvdbaranec nvdbaranec added the 5 - DO NOT MERGE Hold off on merging; see PR for details label Mar 20, 2024
@nvdbaranec nvdbaranec removed the 5 - DO NOT MERGE Hold off on merging; see PR for details label Mar 20, 2024
@nvdbaranec nvdbaranec added the 5 - DO NOT MERGE Hold off on merging; see PR for details label Mar 20, 2024
@nvdbaranec
Copy link
Contributor Author

Adding do-not-merge until pending changes run through the spark integration tests.

Copy link
Contributor

@vuule vuule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

few small suggestions, looks good otherwise
tricky stuff

cpp/src/io/parquet/reader_impl_chunking.cu Outdated Show resolved Hide resolved
cpp/src/io/parquet/reader_impl_chunking.cu Outdated Show resolved Hide resolved
cpp/src/io/parquet/reader_impl_chunking.cu Outdated Show resolved Hide resolved
@nvdbaranec nvdbaranec requested a review from vuule March 20, 2024 22:10
@nvdbaranec nvdbaranec removed the 5 - DO NOT MERGE Hold off on merging; see PR for details label Mar 20, 2024
Copy link
Contributor

@vuule vuule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for addressing the suggestions!

@nvdbaranec
Copy link
Contributor Author

/merge

@rapids-bot rapids-bot bot merged commit 08bd783 into rapidsai:branch-24.04 Mar 20, 2024
75 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Spark Functionality that helps Spark RAPIDS
Projects
Status: Done
Status: No status
Development

Successfully merging this pull request may close these issues.

[BUG] libcudf generates invalid offsets for LIST column when loading Parquet file with chunked reader
3 participants