Binary columns do not receive truncated statistics #5037

emcake · 2023-11-04T22:49:22Z

Describe the bug
#4389 introduced truncation on column indices for binary columns, where the min/max values for a binary column may be arbitrarily large. As noted, this matches the behaviour in parquet-mr for shortening columns.

However, the value in the statistics is written un-truncated. This differs from the behaviour of parquet-mr where the statistics are truncated too: https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java#L715

To Reproduce
There is a test in delta-io/delta-rs#1805 which demonstrates this, but in general write a parquet file with a long binary column and observe that the stats for that column are not truncated.

Expected behavior
Matching parquet-mr, the statistics should be truncated as well.

Additional context
Found this when looking into delta-io/delta-rs#1805. delta-rs uses the column stats to serialize into the delta log, which leads to very bloated entries.

I think it is sufficient to just call truncate_min_value/truncate_max_value when creating the column metadata here: https://github.com/apache/arrow-rs/blob/master/parquet/src/column/writer/mod.rs#L858-L859 but I don't know enough about the internals of arrow to know if that change is correct.

tustvold · 2023-11-05T09:32:19Z

Support for this was only added to the parquet standard 3 weeks ago - apache/parquet-format#216

TBC this will be a breaking API change, as it will break workloads expecting the statistics to not be truncated

emcake · 2023-11-05T09:52:13Z

Got it. It looks like the change to enable statistics truncation was done to parquet-mr in 2019, but without the flag: apache/parquet-mr#696

So in principle, you couldn’t trust the completeness of binary column stats before then as there was no indication as to whether truncation had occurred or not.

tustvold · 2023-11-05T22:00:08Z

Hmm... I also note that it is disabled by default, is this still the case?

Regardless I think we should probably only perform this in the context of apache/parquet-format#216 as whilst parquet-mr would appear to be configurable to perform binary truncation, I'm fairly confident there are applications that have implicit assumptions that this would break.

FYI @alamb my memory is hazy as to what forms of aggregate pushdown DF performs, and if we might need to introduce some notion of inexact statistics (if it doesn't already exist).

emcake · 2023-11-06T10:39:24Z

I'm happy to work up a PR that implements this in the same way, also disabled by default, for parquet-rs.

tustvold · 2023-11-06T10:46:00Z

Thank you, that would be great

alamb · 2023-11-06T18:44:13Z

FYI @alamb my memory is hazy as to what forms of aggregate pushdown DF performs, and if we might need to introduce some notion of inexact statistics (if it doesn't already exist).

I think the recent work by @berkaysynnada to add https://github.com/apache/arrow-datafusion/blob/e95e3f89c97ae27149c1dd8093f91a5574210fe6/datafusion/common/src/stats.rs#L29-L36 might be relevant

However, I think it is likely we will/should eventually add another variant like

enum Precision {
  // The value is known to be within the range (it is at at most this large for Max, or at least this large for Min)
  // but the actual values may be lower/higher. 
  Bounded(ScalarValue)
}

I believe we have a similar usecase in IOx for when we want to ensure the bound includes the actual range, but could be larger (cc @NGA-TRAN )

berkaysynnada · 2023-11-06T20:30:33Z

FYI @alamb my memory is hazy as to what forms of aggregate pushdown DF performs, and if we might need to introduce some notion of inexact statistics (if it doesn't already exist).

I think the recent work by @berkaysynnada to add https://github.com/apache/arrow-datafusion/blob/e95e3f89c97ae27149c1dd8093f91a5574210fe6/datafusion/common/src/stats.rs#L29-L36 might be relevant

However, I think it is likely we will/should eventually add another variant like
enum Precision {
  // The value is known to be within the range (it is at at most this large for Max, or at least this large for Min)
  // but the actual values may be lower/higher. 
  Bounded(ScalarValue)
}
I believe we have a similar usecase in IOx for when we want to ensure the bound includes the actual range, but could be larger (cc @NGA-TRAN )

I think so too, adding a range-specifying variant will pave the way for many things. While I have other high-priority tasks to address shortly, I'm always available to offer support if someone wishes to take this on. The variant I have in mind is as follows:

enum Precision {
  ...
  InBetween(Interval)
}

It will also be easier to use after updating intervals (planning to open the related PR in a few days).

alamb · 2023-11-07T10:35:15Z

I filed apache/datafusion#8078 with a proposal of a more precise way to represent inexact statistics

tustvold · 2024-01-05T11:28:49Z

label_issue.py automatically added labels {'parquet'} from #5076

emcake added the bug label Nov 4, 2023

emcake mentioned this issue Nov 4, 2023

Delta Stats for binary columns are not truncated delta-io/delta-rs#1805

Open

tustvold added enhancement Any new improvement worthy of a entry in the changelog help wanted and removed bug labels Nov 5, 2023

Jefffrey mentioned this issue Nov 6, 2023

Parquet: read/write f16 for Arrow #5003

Merged

alamb mentioned this issue Nov 7, 2023

Introduce a way to represent constrained statistics / bounds on values in Statistics apache/datafusion#8078

Open

Jefffrey mentioned this issue Nov 14, 2023

Parquet: don't truncate min/max statistics for float16 and decimal when writing file #5075

Closed

emcake mentioned this issue Nov 14, 2023

Enable truncation of binary statistics columns #5076

Merged

tustvold closed this as completed in #5076 Nov 15, 2023

emcake mentioned this issue Dec 19, 2023

chore: datafusion 34, arrow & parquet 49 delta-io/delta-rs#1983

Closed

tustvold added the parquet Changes to the parquet crate label Jan 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Binary columns do not receive truncated statistics #5037

Binary columns do not receive truncated statistics #5037

emcake commented Nov 4, 2023

tustvold commented Nov 5, 2023 •

edited

emcake commented Nov 5, 2023

tustvold commented Nov 5, 2023

emcake commented Nov 6, 2023

tustvold commented Nov 6, 2023

alamb commented Nov 6, 2023

berkaysynnada commented Nov 6, 2023

alamb commented Nov 7, 2023

tustvold commented Jan 5, 2024

Binary columns do not receive truncated statistics #5037

Binary columns do not receive truncated statistics #5037

Comments

emcake commented Nov 4, 2023

tustvold commented Nov 5, 2023 • edited

emcake commented Nov 5, 2023

tustvold commented Nov 5, 2023

emcake commented Nov 6, 2023

tustvold commented Nov 6, 2023

alamb commented Nov 6, 2023

berkaysynnada commented Nov 6, 2023

alamb commented Nov 7, 2023

tustvold commented Jan 5, 2024

tustvold commented Nov 5, 2023 •

edited