Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce a way to represent constrained statistics / bounds on values in Statistics #8078

Open
alamb opened this issue Nov 7, 2023 · 22 comments
Assignees
Labels
enhancement New feature or request

Comments

@alamb
Copy link
Contributor

alamb commented Nov 7, 2023

Is your feature request related to a problem or challenge?

This has come up a few times, most recently in discussions with @berkaysynnada on apache/arrow-rs#5037 (comment)

Usecase 1 is that for large binary/string columns, formats like parquet allow storing a truncated value that does not actually appear in the data. Given that values are stored in the min/max metadata, storing truncated values keeps the size of metadata down

For example, for a string column that has very long values, it requires much less space to store a short value slightly lower than the actual minimum as the "minimum" statistics value, and one that is slightly higher than the actual maximum as the "maximum" statistics value.

For example:

actual min in data actual max in data "min" value in statistics "max" value in statistics
aaa......z qqq......q a r

There is a similar usecase when applying a Filter, as described by @korowa on #5646 (comment) and we have a similar one in IOx where the operator may remove values, but won't decrease the minimum value or increase the maximum value in any column

Currently Precision only represents Exact and Inexact, there is no way to represent "unexact, but bounded above/below"

Describe the solution you'd like

Per @berkaysynnada I propose changing Precision::Inexact to a new variant Precision::Between which would store an Interval of known min/maxes of the value.

enum Precision {
  ...
  /// The value is known to be in the specified interval
  Between(Interval)
}

This is a quite general formulation, and it can describe "how" inexact the values are.

This would have the benefit of being very expressive (Intervals can represent open/closed bounds, etc)

Describe alternatives you've considered

There is also a possibility of introducing a simpler, but more limited version of these statistics, like:

enum Precision {
  // The value is known to be within the range (it is at at most this large for Max, or at least this large for Min)
  // but the actual values may be lower/higher. 
  Bounded(ScalarValue)
}

Additional context

No response

@alamb alamb added the enhancement New feature or request label Nov 7, 2023
@alamb alamb changed the title Introduce a way to represent constrained staitistics Introduce a way to represent constrained statistics / bounds on values in Statistics Nov 7, 2023
@tustvold
Copy link
Contributor

tustvold commented Nov 7, 2023

A further wrinkle that may be worth thinking about, even if the solution is not to do anything, is that some formats, including parquet, special case certain floating point values. There is some discussion on apache/parquet-format#196 to try to make the statistics actually usable, and one potential outcome might be to just use totalOrder, so it may even be no action is required, but just an FYI

@berkaysynnada
Copy link
Contributor

I have two proposals:

If we have the cases where the lower bound of a range might be inexact (or perhaps absent) and the upper bound is exact—or vice versa—we could define the following:

enum Estimation {
    Exact(T),
    Inexact(T),
    Absent,
}
enum Precision {
    Estimation,
    // min - max
    Range(Estimation, Estimation)
}
  1. If we don't need to consider the exactness of bounds and a simple interval suffices, this alternative could be used:
enum Precision {
    Exact(T),
    Inexact(T),
    Absent,
    Between(Interval)
}

@alamb
Copy link
Contributor Author

alamb commented Nov 7, 2023

enum Precision {
    Exact(T),
    Inexact(T),
    Absent,
    Between(Interval)
}

@berkaysynnada under what circumstances would we use Inexact? All the cases I can think of Inexact now would be handled by Between (potentially with unbounded intervals)

@alamb
Copy link
Contributor Author

alamb commented Nov 7, 2023

Actually, upon more thought I think the best next step here is for me to prototype what this would look like. I'll do that and report back

@berkaysynnada
Copy link
Contributor

enum Precision {
    Exact(T),
    Inexact(T),
    Absent,
    Between(Interval)
}

@berkaysynnada under what circumstances would we use Inexact? All the cases I can think of Inexact now would be handled by Between (potentially with unbounded intervals)

You are right, I cannot find any case to use Inexact. Any exact value can be relaxed to positive or negative side using intervals.

@ozankabak
Copy link
Contributor

ozankabak commented Nov 7, 2023

In proposal two, you still need Inexact; e.g. cases involving selectivity. Assume that you have a column x whose bounds are [1, 100]. After applying a filter of x <= 10, the row count estimate is an inexact estimate of 0.1 * N where N was the estimate prior to the filter.

On the contrary: If you have Between, you do not need Exact anymore -- An Exact value is simply a Between value with a singleton interval inside.

BTW @berkaysynnada's first proposal is the more general one: It allows range estimations where either bound can be exact or inexact estimations. His second proposal is slightly simpler to implement (and leverages the interval library), but it constrains range estimations to exact bounds.

When we decide to support slightly more complicated filters, say x <= y, we will probably need range estimations where both sides could be inexact. Therefore, I think it makes sense to keep things flexible and go with the first proposal. Obv. happy to discuss other perspectives.

@alamb
Copy link
Contributor Author

alamb commented Nov 7, 2023

I think there is a mismatch between the current use of Precision::Inexact and its documentation

Specifically, Precision::Inexact appears to be treated as a "conservative estimate" in several places (which is actually what we need in IOx), so perhaps another option would be to rename Precision::Exact to Precision::Conservative and document that it is a conservative estimate of the actual value 🤔

For example the comments on Precision::Inexact say

https://github.com/apache/arrow-datafusion/blob/c3430d71179d68536008cd7272f4f57b7f50d4a2/datafusion/statistics/src/statistics.rs#L32-L33

However, it is used to skip processing files when the (inexact) row count is above the fetch/limit:

https://github.com/apache/arrow-datafusion/blob/87aeef5aa4d8f38a8328f8e51e530e6c9cd9afa9/datafusion/core/src/datasource/statistics.rs#L69-L73

I am pretty sure this is only valid if the estimate of the number of rows is conservative (aka the real value is at least as large as the statistics), but the code uses the value even for Precision::Inexact:

https://github.com/apache/arrow-datafusion/blob/c3430d71179d68536008cd7272f4f57b7f50d4a2/datafusion/statistics/src/statistics.rs#L42-L46

Another example is ColumnStatistics::is_singleton, which I think is also only correct if the statistics are conservative (aka the actual min is no lower than the reported min and the actual max is no larger than the reported mx)

https://github.com/apache/arrow-datafusion/blob/c3430d71179d68536008cd7272f4f57b7f50d4a2/datafusion/statistics/src/statistics.rs#L281-L286

@alamb
Copy link
Contributor Author

alamb commented Nov 7, 2023

Another challenge I found with using Interval as proposed in option 2 of #8078 (comment) is more mechanical but real: Interval is in the datafusion_physical_expr crate but Precision is in datafusion_common, meaning I can't use Interval in Precision without moving code around. Not impossible to do, but a data point.

I will try prototyping option 1 sugested by @berkaysynnada #8078 (comment) and see what I find

@berkaysynnada
Copy link
Contributor

I think there is a mismatch between the current use of Precision::Inexact and its documentation

Specifically, Precision::Inexact appears to be treated as a "conservative estimate" in several places (which is actually what we need in IOx), so perhaps another option would be to rename Precision::Exact to Precision::Conservative and document that it is a conservative estimate of the actual value 🤔

I think the documentation is correct but you have found 2 bugs in the code:

  1. Here, we must ensure the values are exact.
    https://github.com/apache/arrow-datafusion/blob/87aeef5aa4d8f38a8328f8e51e530e6c9cd9afa9/datafusion/core/src/datasource/statistics.rs#L69-L73

  2. Singleton check must be for exact values only.
    https://github.com/apache/arrow-datafusion/blob/c3430d71179d68536008cd7272f4f57b7f50d4a2/datafusion/statistics/src/statistics.rs#L281-L286

I will open a PR now fixing these issues.

@berkaysynnada
Copy link
Contributor

Another challenge I found with using Interval as proposed in option 2 of #8078 (comment) is more mechanical but real: Interval is in the datafusion_physical_expr crate but Precision is in datafusion_common, meaning I can't use Interval in Precision without moving code around. Not impossible to do, but a data point.

I will try prototyping option 1 sugested by @berkaysynnada #8078 (comment) and see what I find

Just so you know, I have completed the interval refactor and submitted it for our internal review. After that, interval arithmetic will be in datafusion_expr, and there will be no significant obstacles to moving it to datafusion_common.

@ozankabak
Copy link
Contributor

ozankabak commented Nov 9, 2023

Any thoughts on @berkaysynnada's proposal; i.e.

enum PointEstimate {
    Exact(T),
    Inexact(T),
    Absent,
}
enum Precision {
    PointEstimate,
    Range(PointEstimate, PointEstimate)
}

I think extending the Precision struct like this, along with bugfixes like #8094, will address #8099.

BTW this exercise revealed that the name Precision maybe is not the best name. Maybe we should just use Estimate, which could either be a PointEstimate or a Range, which is tuple of PointEstimates for bounds.

@alamb
Copy link
Contributor Author

alamb commented Nov 9, 2023

Any thoughts on @berkaysynnada's proposal; i.e.

I really like it.

My plan was spend time trying to code up some of these proposals and see if it can fit in the existing framework (and thus potentially reveal in what cases Inexact would be used), etc

However, I spent much more of my time this week on apache/arrow-rs#5050 and related discussions around #7994 and #8017 than I had planned, so I am sadly behind

I hope to make progress tomorrow

@alamb
Copy link
Contributor Author

alamb commented Nov 10, 2023

It is unlikely I will have time to work on this today -- I will pick it up next week

@alamb
Copy link
Contributor Author

alamb commented Nov 14, 2023

I have been playing and studying this code. While the suggestion from @ozankabak and @berkaysynnada in #8078 (comment) is very general and can represent many types of uncertainty in statistics, I haven't found cases yet where that full generality is important

For example, I can't find (nor think of) an important case where the lower bound would be known with certainty and the upper bound was uncertain vs TYPE::MAX).

Another example would be a use case where distinguishing between ranges like

min: `PointEstimate::Absent`, max: `PointEstimate::Precise(value)`
min: PointEstimate::Precise(TYPE::MIN), max: PointEstimate::Precise(value)

Thus I am going to prototype what adding Bounded variant to Precision looks like.

I also plan to encapsulate more of the checks into Precision so that if choose to go with a more general formulation we won't have to change as much of the rest of the code.

pub enum Precision<T: Debug + Clone + PartialEq + Eq + PartialOrd> {
    /// The exact value is known
    Exact(T),
    /// The exact value is not known, but the real value is known to be within
    /// the specified range: `lower <= value <= upper` TOOD: we could use
    /// `Interval` here instead, which could represent more complex cases (like
    /// open/closed bounds)
    Bounded { lower: T, upper: T},
    /// The value is not known exactly, but is likely close to this value.
    /// NOTHING can assumed about the value for cor in this case.
    Inexact(T),
    /// Nothing is known about the value
    #[default]
    Absent,
}

I'll report back here with how it goes shorty

@ozankabak
Copy link
Contributor

When we decide to support slightly more complicated filters, say x <= y, we will probably need range estimations where both sides could be inexact. Therefore, I think it makes sense to keep things flexible and go with the first proposal. Obv. happy to discuss other perspectives.

Do we want to support something like this in the short term? Then we would need inexact bounds on both sides.

@alamb
Copy link
Contributor Author

alamb commented Nov 14, 2023

Do we want to support something like this in the short term? Then we would need inexact bounds on both sides.

Ok. Thank you for this point. I am still working through exactly what "inexact" means / when it would be used, I probably misunderstand.

I plan to work on this ticket in several PRs. In the first few I plan to centralize the use of Precision a bit more (e.g. #8172) so it is clearer how it is used

@alamb
Copy link
Contributor Author

alamb commented Nov 14, 2023

After some thought, I think the key information we are trying to capture in statistics are:

  1. Known min/max values
  2. Estimated values that are imprecise (the result of applying some sort of estimate), but are better than just "somewhere between min/max"

I don't think the proposal (at least as I understand it) in #8078 (comment), can capture the known min/max values

However, perhaps something like this could:

pub enum Precision<T: Debug + Clone + PartialEq + Eq + PartialOrd> {
    /// The exact value is known
    Exact(T),
    /// The exact value is not known, but the real value is **known** to be within
    /// the specified range: `lower <= value <= upper` and furthermore is estimated
    /// to be `estimate`. However, the real value may fall anywhere in the range
    /// TODO: we could use `Interval` here instead, which could represent more complex cases (like
    /// open/closed bounds rather than using T::MIN/T::MAX)
    Bounded { lower: T, estimate: T upper: T},
    /// Nothing is known about the value
    #[default]
    Absent,
}

Maybe estimate should be Option<T> rather than just T 🤔

Example1: Filter x <= 10

In the example of a filter where you know the column x has bounds [1, 100] and the row_count is N.

After applying a filter of x <= 10, assuming an even distribution you get a selectivity of 0.1 , so we can conclude the following about the output row_count:

  1. It is still between 0 and N (the max value hasn't been changed by the filter)
  2. We estimate that the value is near 0.1 * N given our cardinality estimation, but this is just an estimate

As I understand @berkaysynnada 's proposal #8078 (comment), it would represent the output as either as:

Precision::PointEstimate(Inexact(0.1*N))

or

Precision::Range(
  PointEstimate::Exact(0), // min
  PointEstimate::Inexact(0.1 * N), // max
)

But neither captures the fact that we know for sure that row count lies between 0 and N (the max wasn't changed by the filter). Using a Bounded variant could capture this information:

Precision::Bounded {
  lower: 0,
  estimate: 0.1 * N,
  upper: 100
}

Here is the setup in pictures for those who like that

     Output Statistics  for row_count
      value: estimated to be 0.1*N           
      min: Exactly 0           
      max: Exactly 100 (can not possibly be higher)         
                               
             ▲                 
             │                 
┌────────────┴────────────┐    
│                         │    
│       FilterExec        │    
│         x <= 10         │    
│                         │    
└─────────────────────────┘    
             ▲                 
             │                 
             │                 
                               
    Input Statistics for row count:     
     value: Exactly N      
     min: Exactly 0            
     max: Exactly 100          
                               

x <= y example

The other example raised above is a predicate like x <= y. I assume the challenge is now how do we represent the column ranges with this information. Let's say we know that x was between [50, 100] and y was between [0, 200] and there were N input rows

Assuming even and uncorrelated distribution, we would get a selectivity estimate of 0.75

  │                                                      │                      
  │                                                      │                      
  │──────────────────────────────────────────────────────┤       y              
  │                                                      │       min: 0         
  │                                                      │       max: 200       
                                                                                
  0             │                      │                200                     
                │                      │                                        
                ├──────────────────────┤                         x              
                │                      │                         min: 50        
                │                      │                         max: 100       
                                                                                
                0                     200                                       
                                                                                
                                                                                
               **     Only range that can be true      **                       
                **                                    **                        
                 **************************************                         
                                                                                

We could again represent the output statistic with the proper range as well as the cardinality estimate like this, which both captures the fact y can't be less than 50 as well as the 0.75 selectivity):

x: Bounded { lower: 50, estimate: 0.75 * N, upper: 100 }, 
y: Bounded { lower: 50, estimate: 0.75 * N, upper: 200 }

@ozankabak
Copy link
Contributor

Thank you for investigating this. There are some parts I don't quite follow. Particularly,

I don't think the proposal (at least as I understand it) in #8078 (comment), can capture the known min/max values

I don't think this is entirely accurate. In addition to the two configurations you identified, it can also be:

Precision::Range(
  PointEstimate::Exact(0), // min
  PointEstimate::Exact(N), // max
)

which captures the known min/max values. However, this loses our inexact guess 0.1 * N. I think what you are trying to convey is that @berkaysynnada's proposal can not represent the triple of (lower bound, guess, upper bound). If this is what you mean, I agree with you.

Example 2 (x <= y)

x: Bounded { lower: 50, estimate: 0.75 * N, upper: 100 }, 
y: Bounded { lower: 50, estimate: 0.75 * N, upper: 200 }

I don't understand what you mean by this. In this snippet, lower/upper bounds refer to column bounds, while estimate refers to row count. Did you mean to write:

x: Bounded { lower: 50, estimate: 75, upper: 100 }, 
y: Bounded { lower: 50, estimate: 125, upper: 200 }
row_count: Bounded { lower: 0, estimate: 62.5, upper: 100}

where the 62.5 figure is derived from the factor:

image

which is 5/8.

@ozankabak
Copy link
Contributor

I like your proposal for the new type (reproduced below).

pub enum Precision<T: Debug + Clone + PartialEq + Eq + PartialOrd> {
    /// The exact value is known
    Exact(T),
    /// The exact value is not known, but the real value is **known** to be within
    /// the specified range: `lower <= value <= upper` and furthermore is estimated
    /// to be `estimate`. However, the real value may fall anywhere in the range
    /// TODO: we could use `Interval` here instead, which could represent more complex cases (like
    /// open/closed bounds rather than using T::MIN/T::MAX)
    Bounded { lower: T, estimate: T upper: T},
    /// Nothing is known about the value
    #[default]
    Absent,
}

A few questions and remarks:

  1. Do we need the Exact variant anymore? It seems no. If lower, upper and estimate values are the same, it would be an exact estimate. This struct can implement a method is_exact to check this.
  2. Do we need theAbsent variant anymore? I think yes. We may want to differentiate between "no information" and "all possible values" (i.e. [MIN, MAX]) in certain cases.
  3. Should we use Interval in the bounded variant? I think yes, but we should probably start off without it and migrate when we move the interval library outside of the physical expression crate (this week hopefully).
  4. Should we use the name Precision still? IMO it is not a good name for this. We are making estimates after all. I think we should use the right name for what we are doing and use the name Estimate.

Assuming these tweaks, the type would look like:

pub enum Estimate<T: Debug + Clone + PartialEq + Eq + PartialOrd> {
    Range { lower: T, value: T, upper: T },
    /// Nothing is known about the value
    #[default]
    Absent,
}

or with ScalarValue, we'd have:

pub enum Estimate {
    Range { lower: ScalarValue, value: ScalarValue, upper: ScalarValue },
    /// Nothing is known about the value
    #[default]
    Absent,
}

Once Interval moves out of physical-expr, we can use:

pub enum Estimate {
    Range { bounds: Interval, value: ScalarValue },
    /// Nothing is known about the value
    #[default]
    Absent,
}

@alamb
Copy link
Contributor Author

alamb commented Nov 14, 2023

I think what you are trying to convey is that @berkaysynnada's proposal can not represent the triple of (lower bound, guess, upper bound). If this is what you mean, I agree with you.

Yes, this is what I was trying to say. Thank you

I don't understand what you mean by this. In this snippet, lower/upper bounds refer to column bounds, while estimate refers to row count. Did you mean to write:

Yes, I think so (though I am not 100% sure about the 62.5 probability -- I need to double check my math)

Do we need the Exact variant anymore? It seems no. If lower, upper and estimate values are the same, it would be an exact estimate. This struct can implement a method is_exact to check this.

That is a good point

Do we need theAbsent variant anymore? I think yes. We may want to differentiate between "no information" and "all possible values" (i.e. [MIN, MAX]) in certain cases.

I agree this is probably a useful special case.

Should we use Interval in the bounded variant? I think yes, but we should probably start off without it and migrate when we move the interval library outside of the physical expression crate (this week hopefully).

Yes, the more I play with the code, the more I think using Interval is important to avoid reimplementing the same thing over and over again

Should we use the name Precision still? IMO it is not a good name for this. We are making estimates after all. I think we should use the right name for what we are doing and use the name Estimate.

I agree the term Precision is not a good match. I think Estimate is better, though I am still not entirely happy with that because in some cases it will be "known exactly" rather than an estimate. Though perhaps this case is technically just a very precise estimate (no error) 🤔

@alamb
Copy link
Contributor Author

alamb commented Nov 14, 2023

FWIW what I am doing now is studying how the existing Precision struct is used, to see how a change might look like, and filing small cleanup PRs along the way (so that when we go to change the implementation it will be a small change)

I will think on this more tonight and continue the work tomorrow.

Proposed PRs so far:
#8172
#8174
#8177

@alamb
Copy link
Contributor Author

alamb commented Nov 15, 2023

Update here: I filed a larger Epic to track all the work that seems to be going on with statistics as it was getting pretty overwhelming to keep it all straight: #8227

My current conclusion is the current state of the Statistics code in DataFusion makes further work on this ticket somewhat intractable for awhile.

Thus, I plan to work on #8229 and #8228 first and then come back here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants