Fix an issue with multiple short list rowgroups using the Parquet chunked reader. #15342

nvdbaranec · 2024-03-19T22:50:58Z

The core issue here was that under certain conditions, the chunked reader could generate invalid page indices for list columns when using the chunked reader. This led to corruption in the decode kernels. The fix is fairly simple, but there's a decent amount of delta in this PR that is just name changes for clarity and some more comments/docs.

This affected the number of chunks generated in some of the very (unrealistically) constrained tests.

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

First, row groups of lists being loaded together were not getting their end row counts computed correctly on a per-rowgroup basis. The main consequence there was that we could potentially have generated splits that were larger than we would have liked. But downstream from this was a second bug where we were generating incorrect page indices to be decoded, causing corruption in the decode kernels.

nvdbaranec · 2024-03-19T22:51:31Z

Leaving as a draft until running it through all the Spark tests.

…short_rg_chunk_fix

cpp/src/io/parquet/reader_impl_chunking.cu

nvdbaranec · 2024-03-20T19:19:44Z

Adding do-not-merge until pending changes run through the spark integration tests.

vuule

few small suggestions, looks good otherwise
tricky stuff

cpp/src/io/parquet/reader_impl_chunking.cu

…oid unnecessarily looping in some cases.

vuule

Thank you for addressing the suggestions!

nvdbaranec · 2024-03-20T23:38:31Z

/merge

nvdbaranec added 2 commits March 19, 2024 15:25

Merge branch 'branch-24.04' into short_rg_chunk_fix

d0cb768

nvdbaranec added bug Something isn't working libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS non-breaking Non-breaking change labels Mar 19, 2024

nvdbaranec requested a review from a team as a code owner March 19, 2024 22:50

nvdbaranec requested review from vuule and davidwendt March 19, 2024 22:50

nvdbaranec marked this pull request as draft March 19, 2024 22:51

GregoryKimball assigned nvdbaranec Mar 19, 2024

nvdbaranec marked this pull request as ready for review March 20, 2024 00:24

ttnghia approved these changes Mar 20, 2024

View reviewed changes

Merge branch 'branch-24.04' into short_rg_chunk_fix

7d7c9d5

nvdbaranec added the 5 - DO NOT MERGE Hold off on merging; see PR for details label Mar 20, 2024

nvdbaranec added 2 commits March 20, 2024 07:57

Fix indexing issue.

23934c1

Merge branch 'short_rg_chunk_fix' of github.com:nvdbaranec/cudf into …

d9fbba1

…short_rg_chunk_fix

nvdbaranec removed the 5 - DO NOT MERGE Hold off on merging; see PR for details label Mar 20, 2024

vuule reviewed Mar 20, 2024

View reviewed changes

cpp/src/io/parquet/reader_impl_chunking.cu Outdated Show resolved Hide resolved

Remove loop that is now redundant in find_start_index.

ec63720

nvdbaranec added the 5 - DO NOT MERGE Hold off on merging; see PR for details label Mar 20, 2024

vuule reviewed Mar 20, 2024

View reviewed changes

cpp/src/io/parquet/reader_impl_chunking.cu Outdated Show resolved Hide resolved

cpp/src/io/parquet/reader_impl_chunking.cu Outdated Show resolved Hide resolved

cpp/src/io/parquet/reader_impl_chunking.cu Outdated Show resolved Hide resolved

Comment clarification. Simplified some logic in find_next_split to av…

9a8a9ba

…oid unnecessarily looping in some cases.

nvdbaranec requested a review from vuule March 20, 2024 22:10

nvdbaranec removed the 5 - DO NOT MERGE Hold off on merging; see PR for details label Mar 20, 2024

vuule approved these changes Mar 20, 2024

View reviewed changes

rapids-bot bot merged commit 08bd783 into rapidsai:branch-24.04 Mar 20, 2024
75 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix an issue with multiple short list rowgroups using the Parquet chunked reader. #15342

Fix an issue with multiple short list rowgroups using the Parquet chunked reader. #15342

nvdbaranec commented Mar 19, 2024

nvdbaranec commented Mar 19, 2024

nvdbaranec commented Mar 20, 2024

vuule left a comment

vuule left a comment

nvdbaranec commented Mar 20, 2024

Fix an issue with multiple short list rowgroups using the Parquet chunked reader. #15342

Fix an issue with multiple short list rowgroups using the Parquet chunked reader. #15342

Conversation

nvdbaranec commented Mar 19, 2024

Checklist

nvdbaranec commented Mar 19, 2024

nvdbaranec commented Mar 20, 2024

vuule left a comment

Choose a reason for hiding this comment

vuule left a comment

Choose a reason for hiding this comment

nvdbaranec commented Mar 20, 2024