Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parquet: don't truncate min/max statistics for float16 and decimal when writing file #5075

Closed
Jefffrey opened this issue Nov 14, 2023 · 2 comments · Fixed by #5154
Closed
Labels
enhancement Any new improvement worthy of a entry in the changelog parquet Changes to the parquet crate

Comments

@Jefffrey
Copy link
Contributor

Jefffrey commented Nov 14, 2023

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

See discussion:

#5003 (comment)

#5003 added support for float16 type which is a logical type on top of fixed len byte array

When writing statistics, truncation can occur for binary physical type:

// We only truncate if the data is represented as binary
match self.descr.physical_type() {
Type::BYTE_ARRAY | Type::FIXED_LEN_BYTE_ARRAY => {
self.column_index_builder.append(
null_page,
self.truncate_min_value(stat.min_bytes()),
self.truncate_max_value(stat.max_bytes()),
self.page_metrics.num_page_nulls as i64,
);
}

Which might be troublesome for f16 type, if the column_index_truncate_length config is set to 1, as a truncated f16 wouldn't represent the min and max correctly anymore as it has a sort order different from fixed len byte array

Describe the solution you'd like

Ignore truncation for f16 when writing min/max statistics

Describe alternatives you've considered

Additional context

Do we need to worry about this for other types based on binary physical types? i.e. decimal

@Jefffrey Jefffrey added the enhancement Any new improvement worthy of a entry in the changelog label Nov 14, 2023
@Jefffrey
Copy link
Contributor Author

Potentially has relation to #5037?

@Jefffrey Jefffrey changed the title Parquet: don't truncate for float16 Parquet: don't truncate min/max statistics for float16 Nov 14, 2023
@Jefffrey Jefffrey changed the title Parquet: don't truncate min/max statistics for float16 Parquet: don't truncate min/max statistics for float16 when writing file Nov 14, 2023
@Jefffrey Jefffrey changed the title Parquet: don't truncate min/max statistics for float16 when writing file Parquet: don't truncate min/max statistics for float16 and decimal when writing file Dec 1, 2023
@tustvold
Copy link
Contributor

tustvold commented Jan 5, 2024

label_issue.py automatically added labels {'parquet'} from #5154

@tustvold tustvold added the parquet Changes to the parquet crate label Jan 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Any new improvement worthy of a entry in the changelog parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants