Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Possible rounding error in quantile 'closest_observation' method #26656

Closed
aherbert opened this issue Jun 10, 2024 · 8 comments
Closed

BUG: Possible rounding error in quantile 'closest_observation' method #26656

aherbert opened this issue Jun 10, 2024 · 8 comments
Labels

Comments

@aherbert
Copy link
Contributor

aherbert commented Jun 10, 2024

Describe the issue:

For data size n and probability p the 'closest_observation' method should return the nearest integer to np.

Using Hyndman and Fan (1996):

Q(p) = (1-gamma) * X(j) + gamma * X(j+1)

gamma is a function of j and g:

j = floor(pn + m)
g = pn + m - j

pn + m is the real-value based index into the data in [1, n]. The constant m and the function are dependent on the method.

Definition 3:

k is the nearest integer to np. Since there is more than one way to choose the "nearest" when g=0 (round up/round down) one appraoch is to choose the nearest even order statistic at g == 0.

Set m = -0.5.
gamma = 0 if g == 0 and j is even
gamma = 1 otherwise

In the following simple case for n=3:

n=3, p=0.5
j = floor(0.5*3 - 0.5) = 1
g = 0.5*3 - 0.5 - 1 = 0

In this case gamma should be 1 as j is odd. This results in choosing j+1 = 2. This is the closest even order statistic.

But the numpy code computes the index position with an additional offset of -1 for zero-based indexing as:

index = 0.5*3 - 1 - 0.5

Thus the input index to the closest_observation method is out by 1 and the index addresses the range [0, n) rather than [1, n].

This means the logic in function_base.py:_closest_observation(n, quantiles) is incorrect.
It is testing gamma==0 and that the index is even.
It should be testing that the index is odd to account for the fact that the index is zero-based, not one-based as in the definition of Hyndman and Fan for an order of [1, n].

Note

This may not be a bug but an implementation interpretation. The closest neighbour definition is ambiguous when the real-valued index is an exact integer (i.e. np - 0.5). As such this could be put down to a simple mismatch between the definition of the closest statistic when rounding is required. The output from numpy and R are different here (see example below). However when I tracked down the implementation in numpy I believe the intension is to implement the definition as detailed by Hyndman and Fan as it is testing that the index modulus 2 is 0. The error comes because the index is offset by 1 before this logic is applied.

Reproduce the code example:

import numpy as np
np.quantile([1, 2, 3], 0.5, method='closest_observation')

Error message:

Computes:
1

Expected result is 2.

This is the most obvious case of the issue. The median of 3 values using quantile p=0.5 is the first value. It manifests at larger sizes too.

The issue can be seen when comparing numpy to R. Here is the output of small arrays from each; there is a mismatch on all odd length inputs:

numpy:

>>> for i in range(3, 11):
...   a = np.arange(1, 1+i)
...   print(f'{np.quantile(a, 0.5, method='closest_observation')} : {a}')
...
1 : [1 2 3]                          # mismatch
2 : [1 2 3 4]
3 : [1 2 3 4 5]                      # mismatch
3 : [1 2 3 4 5 6]
3 : [1 2 3 4 5 6 7]                  # mismatch
4 : [1 2 3 4 5 6 7 8]
5 : [1 2 3 4 5 6 7 8 9]              # mismatch
5 : [ 1  2  3  4  5  6  7  8  9 10]

R:

> require(stats)
> for (i in c(3:10)) {
  a = c(1:i)
  cat(paste(quantile(a, 0.5, type=3)), ": [", a, "]\n", sep=" ")
}

2 : [ 1 2 3 ]
2 : [ 1 2 3 4 ]
2 : [ 1 2 3 4 5 ]
3 : [ 1 2 3 4 5 6 ]
4 : [ 1 2 3 4 5 6 7 ]
4 : [ 1 2 3 4 5 6 7 8 ]
4 : [ 1 2 3 4 5 6 7 8 9 ]
5 : [ 1 2 3 4 5 6 7 8 9 10 ]

Python and NumPy Versions:

1.26.4
3.12.3 | packaged by conda-forge | (main, Apr 15 2024, 18:35:20) [Clang 16.0.6 ]

Runtime Environment:

Python 3.12.3 | packaged by conda-forge | (main, Apr 15 2024, 18:35:20) [Clang 16.0.6 ] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import numpy; numpy.show_runtime()
[{'numpy_version': '1.26.4',
  'python': '3.12.3 | packaged by conda-forge | (main, Apr 15 2024, 18:35:20) '
            '[Clang 16.0.6 ]',
  'uname': uname_result(system='Darwin', node='gdsc029784.gdsc.susx.ac.uk', release='23.5.0', version='Darwin Kernel Version 23.5.0: Wed May  1 20:14:38 PDT 2024; root:xnu-10063.121.3~5/RELEASE_ARM64_T6020', machine='arm64')},
 {'simd_extensions': {'baseline': ['NEON', 'NEON_FP16', 'NEON_VFPV4', 'ASIMD'],
                      'found': ['ASIMDHP'],
                      'not_found': ['ASIMDFHM']}},
 {'architecture': 'VORTEX',
  'filepath': '/Users/ah403/miniforge3/envs/numpy/lib/libopenblas.0.dylib',
  'internal_api': 'openblas',
  'num_threads': 12,
  'prefix': 'libopenblas',
  'threading_layer': 'openmp',
  'user_api': 'blas',
  'version': '0.3.27'},
 {'filepath': '/Users/ah403/miniforge3/envs/numpy/lib/libomp.dylib',
  'internal_api': 'openmp',
  'num_threads': 12,
  'prefix': 'libomp',
  'user_api': 'openmp',
  'version': None}]

Context for the issue:

This does not affect my work. This method is one of 3 discrete interpolation methods, and is not the default. It would affect any user who specifically changes the interpolation method.

@eendebakpt
Copy link
Contributor

I can confirm the issue. np.quantile([1, 2, 3], 0.5, method='closest_observation') should return 2, if one follows Hyndman and Fan.

But given the fact that the definition relies on j being even, j being closely related to the index and Numpy using zero-based indexing while Hyndman and Fan uses 1-based indexing, the current implementation can be seen as a different interpretation.

The documentation on np.quantile seems have some problems either way. For example the definitions of gamma and g are confused in the docs:

image

I think this should read "and m and gamma are defined as follows" (with formulas adapted).

@aherbert
Copy link
Contributor Author

I agree with the comment about gamma and g being confused. This could be clarified. The documentation is a little difficult to follow without reference to the original paper. However with the original paper it becomes clear what each named method is computing, and how it is doing this.

On the topic of 0-based indexing I am of the opinion that it is not applicable here. The position index computed by the formulas provided by Hyndman and Fan is the k to choose the k-th element from ordered data. As such the order is in [1, n]. How you store the sorted elements is not relevant. E.g. 0 or 1 based indexing should not matter if selecting the k-th element (1st, 2nd, 3rd, 4th, ...). Any language data representation should be able to return the same k-th element if using the same definition to compute k.

The current documentation has:

i + g = q * (n - alpha - beta + 1) + alpha

This rearranges to the definition in HF page 363 (with q==p):

i + g = qn + q(1 - alpha - beta) + alpha

So the current documentation is referring to the 1-based order index. The value is computed here: _compute_virtual_index as:

i + g = qn + q(1 - alpha - beta) + alpha - 1

So the documentation follows Hyndman and Fan and the code adjusts the virtual index by 1. As such the documentation definition is out of sync with the implementation for the 'closest_observation' method, which rounds to the nearest odd order (but nearest even 0-based index).

Note that I do not have any issue with any choice of resolution for this mismatch. My intention was just to alert the fact of the mismatch. However from a simplest point of view it would be nice that the element returned by numpy for this method returns the same element as other implementations. The only other implementation I tested was in R. However Quantile (wikipedia) indicates this is also in SAS. I have not bothered to sign up to their site to get a trial. The wikipedia table notes you choose the nearest even integer to np - 1/2 for a tie. This is like using std::rint(n*p) with rounding mode E_TONEAREST.

@aherbert
Copy link
Contributor Author

Using Mathmatica requires some parameters {{a, b}, {c, d}} to configure the quantile method:

Quantile[{1, 2, 3}, 0.5, {{1/2, 0}, {0, 0}}]
2

But it does not match the R output as rounding is not to nearest even order (n=5 is different):

Quantile[{1, 2, 3, 4}, 0.5, {{1/2, 0}, {0, 0}}]
2
Quantile[{1, 2, 3, 4, 5}, 0.5, {{1/2, 0}, {0, 0}}]
3
Quantile[{1, 2, 3, 4, 5, 6}, 0.5, {{1/2, 0}, {0, 0}}]
3
Quantile[{1, 2, 3, 4, 5, 6, 7}, 0.5, {{1/2, 0}, {0, 0}}]
4

So that is not a reference implementation.

Using Octave quantile matches the R output:

octave:9> quantile([1 2 3]', 0.5, 1, 3)
ans = 2
octave:10> quantile([1 2 3 4]', 0.5, 1, 3)
ans = 2
octave:11> quantile([1 2 3 4 5]', 0.5, 1, 3)
ans = 2
octave:12> quantile([1 2 3 4 5 6]', 0.5, 1, 3)
ans = 3
octave:13> quantile([1 2 3 4 5 6 7]', 0.5, 1, 3)
ans = 4
octave:14> quantile([1 2 3 4 5 6 7 8]', 0.5, 1, 3)
ans = 4
octave:15> quantile([1 2 3 4 5 6 7 8 9]', 0.5, 1, 3)
ans = 4
octave:16> quantile([1 2 3 4 5 6 7 8 9 10]', 0.5, 1, 3)
ans = 5

I note that both R and octave use 1-based indexing. But this should be mute if we are simply interpreting the order of elements.

@aherbert
Copy link
Contributor Author

I have had a crash course in SAS and managed to get the following to run in their online SASStudio for the univariate procedure:

data A;
  input Value @@;
  datalines;
1 2 3
;

proc univariate data=A pctldef=2;
   var Value;
   output out=Q
   pctlpre = P_
   pctlpts = 50;
run;

proc print data=Q;

Changing the length of the input data obtains for Q.P_50 (using SAS method 2):

2: 1 2 3
2: 1 2 3 4
2: 1 2 3 4 5
3: 1 2 3 4 5 6
4: 1 2 3 4 5 6 7
4: 1 2 3 4 5 6 7 8

So this computes the same as R and octave.

@seberg
Copy link
Member

seberg commented Jun 13, 2024

FWIW, I am OK with changing this but it needs a release note and I am not sure it matters much (I doubt there is a reason for any choice besides that you have to make a choice).

@eendebakpt
Copy link
Contributor

I agree the choice does not matter much, but it would be nice to have the documentation consistent and without confusion (with either of the two choices). @aherbert Would you be willing to make a PR to address this?

@aherbert
Copy link
Contributor Author

I'll try and put in a PR next week to: match the documentation to the Hyndman and Fan paper; and update the closest_observation implementation.

aherbert added a commit to aherbert/numpy that referenced this issue Jun 20, 2024

Verified

This commit was signed with the committer’s verified signature.
Arvind2222 Arvind Mishra
…istic

Detection of an even order statistic (1-based) must check for an odd
index due to use of 0-based indexing.

See numpy#26656
aherbert added a commit to aherbert/numpy that referenced this issue Jun 20, 2024
…istic

Detection of an even order statistic (1-based) must check for an odd
index due to use of 0-based indexing.

See numpy#26656
aherbert added a commit to aherbert/numpy that referenced this issue Jun 20, 2024
Detection of an even order statistic (1-based) must check for an odd
index due to use of 0-based indexing.

See numpy#26656
aherbert added a commit to aherbert/numpy that referenced this issue Jun 20, 2024
Detection of an even order statistic (1-based) must check for an odd
index due to use of 0-based indexing.

See numpy#26656
aherbert added a commit to aherbert/numpy that referenced this issue Jun 21, 2024
Detection of an even order statistic (1-based) must check for an odd
index due to use of 0-based indexing.

See numpy#26656
aherbert added a commit to aherbert/numpy that referenced this issue Jun 24, 2024
Detection of an even order statistic (1-based) must check for an odd
index due to use of 0-based indexing.

See numpy#26656
charris pushed a commit to charris/numpy that referenced this issue Jul 16, 2024
Detection of an even order statistic (1-based) must check for an odd
index due to use of 0-based indexing.

See numpy#26656
@eendebakpt
Copy link
Contributor

The issue has been addressed in #26959, so closing the issue.

ArvidJB pushed a commit to ArvidJB/numpy that referenced this issue Nov 1, 2024
Detection of an even order statistic (1-based) must check for an odd
index due to use of 0-based indexing.

See numpy#26656
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants