-
-
Notifications
You must be signed in to change notification settings - Fork 10.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: Possible rounding error in quantile 'closest_observation' method #26656
Comments
I can confirm the issue. But given the fact that the definition relies on The documentation on I think this should read "and |
I agree with the comment about gamma and g being confused. This could be clarified. The documentation is a little difficult to follow without reference to the original paper. However with the original paper it becomes clear what each named method is computing, and how it is doing this. On the topic of 0-based indexing I am of the opinion that it is not applicable here. The position index computed by the formulas provided by Hyndman and Fan is the k to choose the k-th element from ordered data. As such the order is in The current documentation has:
This rearranges to the definition in HF page 363 (with q==p):
So the current documentation is referring to the 1-based order index. The value is computed here: _compute_virtual_index as:
So the documentation follows Hyndman and Fan and the code adjusts the virtual index by 1. As such the documentation definition is out of sync with the implementation for the 'closest_observation' method, which rounds to the nearest odd order (but nearest even 0-based index). Note that I do not have any issue with any choice of resolution for this mismatch. My intention was just to alert the fact of the mismatch. However from a simplest point of view it would be nice that the element returned by numpy for this method returns the same element as other implementations. The only other implementation I tested was in R. However Quantile (wikipedia) indicates this is also in SAS. I have not bothered to sign up to their site to get a trial. The wikipedia table notes you choose the nearest even integer to np - 1/2 for a tie. This is like using std::rint(n*p) with rounding mode E_TONEAREST. |
Using Mathmatica requires some parameters {{a, b}, {c, d}} to configure the quantile method:
But it does not match the R output as rounding is not to nearest even order (n=5 is different):
So that is not a reference implementation. Using Octave quantile matches the R output:
I note that both R and octave use 1-based indexing. But this should be mute if we are simply interpreting the order of elements. |
I have had a crash course in SAS and managed to get the following to run in their online SASStudio for the univariate procedure:
Changing the length of the input data obtains for Q.P_50 (using SAS method 2):
So this computes the same as R and octave. |
FWIW, I am OK with changing this but it needs a release note and I am not sure it matters much (I doubt there is a reason for any choice besides that you have to make a choice). |
I agree the choice does not matter much, but it would be nice to have the documentation consistent and without confusion (with either of the two choices). @aherbert Would you be willing to make a PR to address this? |
I'll try and put in a PR next week to: match the documentation to the Hyndman and Fan paper; and update the |
…istic Detection of an even order statistic (1-based) must check for an odd index due to use of 0-based indexing. See numpy#26656
…istic Detection of an even order statistic (1-based) must check for an odd index due to use of 0-based indexing. See numpy#26656
Detection of an even order statistic (1-based) must check for an odd index due to use of 0-based indexing. See numpy#26656
Detection of an even order statistic (1-based) must check for an odd index due to use of 0-based indexing. See numpy#26656
Detection of an even order statistic (1-based) must check for an odd index due to use of 0-based indexing. See numpy#26656
Detection of an even order statistic (1-based) must check for an odd index due to use of 0-based indexing. See numpy#26656
Detection of an even order statistic (1-based) must check for an odd index due to use of 0-based indexing. See numpy#26656
The issue has been addressed in #26959, so closing the issue. |
Detection of an even order statistic (1-based) must check for an odd index due to use of 0-based indexing. See numpy#26656
Describe the issue:
For data size
n
and probabilityp
the 'closest_observation' method should return the nearest integer tonp
.Using Hyndman and Fan (1996):
gamma is a function of
j
andg
:pn + m
is the real-value based index into the data in[1, n]
. The constantm
and the function are dependent on the method.Definition 3:
k
is the nearest integer tonp
. Since there is more than one way to choose the "nearest" when g=0 (round up/round down) one appraoch is to choose the nearest even order statistic at g == 0.Set m = -0.5.
gamma = 0 if g == 0 and j is even
gamma = 1 otherwise
In the following simple case for n=3:
In this case gamma should be 1 as
j
is odd. This results in choosingj+1 = 2
. This is the closest even order statistic.But the numpy code computes the index position with an additional offset of -1 for zero-based indexing as:
Thus the input index to the closest_observation method is out by 1 and the index addresses the range
[0, n)
rather than[1, n]
.This means the logic in function_base.py:_closest_observation(n, quantiles) is incorrect.
It is testing gamma==0 and that the index is even.
It should be testing that the index is odd to account for the fact that the index is zero-based, not one-based as in the definition of Hyndman and Fan for an order of
[1, n]
.Note
This may not be a bug but an implementation interpretation. The closest neighbour definition is ambiguous when the real-valued index is an exact integer (i.e.
np - 0.5
). As such this could be put down to a simple mismatch between the definition of the closest statistic when rounding is required. The output from numpy and R are different here (see example below). However when I tracked down the implementation in numpy I believe the intension is to implement the definition as detailed by Hyndman and Fan as it is testing that the index modulus 2 is 0. The error comes because the index is offset by 1 before this logic is applied.Reproduce the code example:
Error message:
Computes:
1
Expected result is 2.
This is the most obvious case of the issue. The median of 3 values using quantile p=0.5 is the first value. It manifests at larger sizes too.
The issue can be seen when comparing numpy to R. Here is the output of small arrays from each; there is a mismatch on all odd length inputs:
numpy:
R:
Python and NumPy Versions:
Runtime Environment:
Context for the issue:
This does not affect my work. This method is one of 3 discrete interpolation methods, and is not the default. It would affect any user who specifically changes the interpolation method.
The text was updated successfully, but these errors were encountered: