Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Binary columns do not receive truncated statistics #5037

Closed
emcake opened this issue Nov 4, 2023 · 9 comments · Fixed by #5076
Closed

Binary columns do not receive truncated statistics #5037

emcake opened this issue Nov 4, 2023 · 9 comments · Fixed by #5076
Labels
enhancement Any new improvement worthy of a entry in the changelog help wanted parquet Changes to the parquet crate

Comments

@emcake
Copy link
Contributor

emcake commented Nov 4, 2023

Describe the bug
#4389 introduced truncation on column indices for binary columns, where the min/max values for a binary column may be arbitrarily large. As noted, this matches the behaviour in parquet-mr for shortening columns.

However, the value in the statistics is written un-truncated. This differs from the behaviour of parquet-mr where the statistics are truncated too: https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java#L715

To Reproduce
There is a test in delta-io/delta-rs#1805 which demonstrates this, but in general write a parquet file with a long binary column and observe that the stats for that column are not truncated.

Expected behavior
Matching parquet-mr, the statistics should be truncated as well.

Additional context
Found this when looking into delta-io/delta-rs#1805. delta-rs uses the column stats to serialize into the delta log, which leads to very bloated entries.

I think it is sufficient to just call truncate_min_value/truncate_max_value when creating the column metadata here: https://github.com/apache/arrow-rs/blob/master/parquet/src/column/writer/mod.rs#L858-L859 but I don't know enough about the internals of arrow to know if that change is correct.

@tustvold
Copy link
Contributor

tustvold commented Nov 5, 2023

Support for this was only added to the parquet standard 3 weeks ago - apache/parquet-format#216

TBC this will be a breaking API change, as it will break workloads expecting the statistics to not be truncated

@tustvold tustvold added enhancement Any new improvement worthy of a entry in the changelog help wanted and removed bug labels Nov 5, 2023
@emcake
Copy link
Contributor Author

emcake commented Nov 5, 2023

Got it. It looks like the change to enable statistics truncation was done to parquet-mr in 2019, but without the flag: apache/parquet-mr#696

So in principle, you couldn’t trust the completeness of binary column stats before then as there was no indication as to whether truncation had occurred or not.

@tustvold
Copy link
Contributor

tustvold commented Nov 5, 2023

Hmm... I also note that it is disabled by default, is this still the case?

Regardless I think we should probably only perform this in the context of apache/parquet-format#216 as whilst parquet-mr would appear to be configurable to perform binary truncation, I'm fairly confident there are applications that have implicit assumptions that this would break.

FYI @alamb my memory is hazy as to what forms of aggregate pushdown DF performs, and if we might need to introduce some notion of inexact statistics (if it doesn't already exist).

@emcake
Copy link
Contributor Author

emcake commented Nov 6, 2023

I'm happy to work up a PR that implements this in the same way, also disabled by default, for parquet-rs.

@tustvold
Copy link
Contributor

tustvold commented Nov 6, 2023

Thank you, that would be great

@alamb
Copy link
Contributor

alamb commented Nov 6, 2023

FYI @alamb my memory is hazy as to what forms of aggregate pushdown DF performs, and if we might need to introduce some notion of inexact statistics (if it doesn't already exist).

I think the recent work by @berkaysynnada to add https://github.com/apache/arrow-datafusion/blob/e95e3f89c97ae27149c1dd8093f91a5574210fe6/datafusion/common/src/stats.rs#L29-L36 might be relevant

However, I think it is likely we will/should eventually add another variant like

enum Precision {
  // The value is known to be within the range (it is at at most this large for Max, or at least this large for Min)
  // but the actual values may be lower/higher. 
  Bounded(ScalarValue)
}

I believe we have a similar usecase in IOx for when we want to ensure the bound includes the actual range, but could be larger (cc @NGA-TRAN )

@berkaysynnada
Copy link
Contributor

FYI @alamb my memory is hazy as to what forms of aggregate pushdown DF performs, and if we might need to introduce some notion of inexact statistics (if it doesn't already exist).

I think the recent work by @berkaysynnada to add https://github.com/apache/arrow-datafusion/blob/e95e3f89c97ae27149c1dd8093f91a5574210fe6/datafusion/common/src/stats.rs#L29-L36 might be relevant

However, I think it is likely we will/should eventually add another variant like

enum Precision {
  // The value is known to be within the range (it is at at most this large for Max, or at least this large for Min)
  // but the actual values may be lower/higher. 
  Bounded(ScalarValue)
}

I believe we have a similar usecase in IOx for when we want to ensure the bound includes the actual range, but could be larger (cc @NGA-TRAN )

I think so too, adding a range-specifying variant will pave the way for many things. While I have other high-priority tasks to address shortly, I'm always available to offer support if someone wishes to take this on. The variant I have in mind is as follows:

enum Precision {
  ...
  InBetween(Interval)
}

It will also be easier to use after updating intervals (planning to open the related PR in a few days).

@alamb
Copy link
Contributor

alamb commented Nov 7, 2023

I filed apache/datafusion#8078 with a proposal of a more precise way to represent inexact statistics

@tustvold
Copy link
Contributor

tustvold commented Jan 5, 2024

label_issue.py automatically added labels {'parquet'} from #5076

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Any new improvement worthy of a entry in the changelog help wanted parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants