BUG: Possible rounding error in quantile 'closest_observation' method #26656

aherbert · 2024-06-10T12:01:32Z

Describe the issue:

For data size n and probability p the 'closest_observation' method should return the nearest integer to np.

Using Hyndman and Fan (1996):

Q(p) = (1-gamma) * X(j) + gamma * X(j+1)

gamma is a function of j and g:

j = floor(pn + m)
g = pn + m - j

pn + m is the real-value based index into the data in [1, n]. The constant m and the function are dependent on the method.

Definition 3:

k is the nearest integer to np. Since there is more than one way to choose the "nearest" when g=0 (round up/round down) one appraoch is to choose the nearest even order statistic at g == 0.

Set m = -0.5.
gamma = 0 if g == 0 and j is even
gamma = 1 otherwise

In the following simple case for n=3:

n=3, p=0.5
j = floor(0.5*3 - 0.5) = 1
g = 0.5*3 - 0.5 - 1 = 0

In this case gamma should be 1 as j is odd. This results in choosing j+1 = 2. This is the closest even order statistic.

But the numpy code computes the index position with an additional offset of -1 for zero-based indexing as:

index = 0.5*3 - 1 - 0.5

Thus the input index to the closest_observation method is out by 1 and the index addresses the range [0, n) rather than [1, n].

This means the logic in function_base.py:_closest_observation(n, quantiles) is incorrect.
It is testing gamma==0 and that the index is even.
It should be testing that the index is odd to account for the fact that the index is zero-based, not one-based as in the definition of Hyndman and Fan for an order of [1, n].

Note

This may not be a bug but an implementation interpretation. The closest neighbour definition is ambiguous when the real-valued index is an exact integer (i.e. np - 0.5). As such this could be put down to a simple mismatch between the definition of the closest statistic when rounding is required. The output from numpy and R are different here (see example below). However when I tracked down the implementation in numpy I believe the intension is to implement the definition as detailed by Hyndman and Fan as it is testing that the index modulus 2 is 0. The error comes because the index is offset by 1 before this logic is applied.

Reproduce the code example:

import numpy as np
np.quantile([1, 2, 3], 0.5, method='closest_observation')

Error message:

Computes:
1

Expected result is 2.

This is the most obvious case of the issue. The median of 3 values using quantile p=0.5 is the first value. It manifests at larger sizes too.

The issue can be seen when comparing numpy to R. Here is the output of small arrays from each; there is a mismatch on all odd length inputs:

numpy:

>>> for i in range(3, 11):
...   a = np.arange(1, 1+i)
...   print(f'{np.quantile(a, 0.5, method='closest_observation')} : {a}')
...
1 : [1 2 3]                          # mismatch
2 : [1 2 3 4]
3 : [1 2 3 4 5]                      # mismatch
3 : [1 2 3 4 5 6]
3 : [1 2 3 4 5 6 7]                  # mismatch
4 : [1 2 3 4 5 6 7 8]
5 : [1 2 3 4 5 6 7 8 9]              # mismatch
5 : [ 1  2  3  4  5  6  7  8  9 10]

R:

> require(stats)
> for (i in c(3:10)) {
  a = c(1:i)
  cat(paste(quantile(a, 0.5, type=3)), ": [", a, "]\n", sep=" ")
}

2 : [ 1 2 3 ]
2 : [ 1 2 3 4 ]
2 : [ 1 2 3 4 5 ]
3 : [ 1 2 3 4 5 6 ]
4 : [ 1 2 3 4 5 6 7 ]
4 : [ 1 2 3 4 5 6 7 8 ]
4 : [ 1 2 3 4 5 6 7 8 9 ]
5 : [ 1 2 3 4 5 6 7 8 9 10 ]

Python and NumPy Versions:

1.26.4
3.12.3 | packaged by conda-forge | (main, Apr 15 2024, 18:35:20) [Clang 16.0.6 ]

Runtime Environment:

Python 3.12.3 | packaged by conda-forge | (main, Apr 15 2024, 18:35:20) [Clang 16.0.6 ] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import numpy; numpy.show_runtime()
[{'numpy_version': '1.26.4',
  'python': '3.12.3 | packaged by conda-forge | (main, Apr 15 2024, 18:35:20) '
            '[Clang 16.0.6 ]',
  'uname': uname_result(system='Darwin', node='gdsc029784.gdsc.susx.ac.uk', release='23.5.0', version='Darwin Kernel Version 23.5.0: Wed May  1 20:14:38 PDT 2024; root:xnu-10063.121.3~5/RELEASE_ARM64_T6020', machine='arm64')},
 {'simd_extensions': {'baseline': ['NEON', 'NEON_FP16', 'NEON_VFPV4', 'ASIMD'],
                      'found': ['ASIMDHP'],
                      'not_found': ['ASIMDFHM']}},
 {'architecture': 'VORTEX',
  'filepath': '/Users/ah403/miniforge3/envs/numpy/lib/libopenblas.0.dylib',
  'internal_api': 'openblas',
  'num_threads': 12,
  'prefix': 'libopenblas',
  'threading_layer': 'openmp',
  'user_api': 'blas',
  'version': '0.3.27'},
 {'filepath': '/Users/ah403/miniforge3/envs/numpy/lib/libomp.dylib',
  'internal_api': 'openmp',
  'num_threads': 12,
  'prefix': 'libomp',
  'user_api': 'openmp',
  'version': None}]

Context for the issue:

This does not affect my work. This method is one of 3 discrete interpolation methods, and is not the default. It would affect any user who specifically changes the interpolation method.

The text was updated successfully, but these errors were encountered:

eendebakpt · 2024-06-12T12:47:03Z

I can confirm the issue. np.quantile([1, 2, 3], 0.5, method='closest_observation') should return 2, if one follows Hyndman and Fan.

But given the fact that the definition relies on j being even, j being closely related to the index and Numpy using zero-based indexing while Hyndman and Fan uses 1-based indexing, the current implementation can be seen as a different interpretation.

The documentation on np.quantile seems have some problems either way. For example the definitions of gamma and g are confused in the docs:

I think this should read "and m and gamma are defined as follows" (with formulas adapted).

aherbert · 2024-06-12T15:04:52Z

I agree with the comment about gamma and g being confused. This could be clarified. The documentation is a little difficult to follow without reference to the original paper. However with the original paper it becomes clear what each named method is computing, and how it is doing this.

On the topic of 0-based indexing I am of the opinion that it is not applicable here. The position index computed by the formulas provided by Hyndman and Fan is the k to choose the k-th element from ordered data. As such the order is in [1, n]. How you store the sorted elements is not relevant. E.g. 0 or 1 based indexing should not matter if selecting the k-th element (1st, 2nd, 3rd, 4th, ...). Any language data representation should be able to return the same k-th element if using the same definition to compute k.

The current documentation has:

i + g = q * (n - alpha - beta + 1) + alpha

This rearranges to the definition in HF page 363 (with q==p):

i + g = qn + q(1 - alpha - beta) + alpha

So the current documentation is referring to the 1-based order index. The value is computed here: _compute_virtual_index as:

i + g = qn + q(1 - alpha - beta) + alpha - 1

So the documentation follows Hyndman and Fan and the code adjusts the virtual index by 1. As such the documentation definition is out of sync with the implementation for the 'closest_observation' method, which rounds to the nearest odd order (but nearest even 0-based index).

Note that I do not have any issue with any choice of resolution for this mismatch. My intention was just to alert the fact of the mismatch. However from a simplest point of view it would be nice that the element returned by numpy for this method returns the same element as other implementations. The only other implementation I tested was in R. However Quantile (wikipedia) indicates this is also in SAS. I have not bothered to sign up to their site to get a trial. The wikipedia table notes you choose the nearest even integer to np - 1/2 for a tie. This is like using std::rint(n*p) with rounding mode E_TONEAREST.

aherbert · 2024-06-12T15:47:05Z

Using Mathmatica requires some parameters {{a, b}, {c, d}} to configure the quantile method:

Quantile[{1, 2, 3}, 0.5, {{1/2, 0}, {0, 0}}]
2

But it does not match the R output as rounding is not to nearest even order (n=5 is different):

Quantile[{1, 2, 3, 4}, 0.5, {{1/2, 0}, {0, 0}}]
2
Quantile[{1, 2, 3, 4, 5}, 0.5, {{1/2, 0}, {0, 0}}]
3
Quantile[{1, 2, 3, 4, 5, 6}, 0.5, {{1/2, 0}, {0, 0}}]
3
Quantile[{1, 2, 3, 4, 5, 6, 7}, 0.5, {{1/2, 0}, {0, 0}}]
4

So that is not a reference implementation.

Using Octave quantile matches the R output:

octave:9> quantile([1 2 3]', 0.5, 1, 3)
ans = 2
octave:10> quantile([1 2 3 4]', 0.5, 1, 3)
ans = 2
octave:11> quantile([1 2 3 4 5]', 0.5, 1, 3)
ans = 2
octave:12> quantile([1 2 3 4 5 6]', 0.5, 1, 3)
ans = 3
octave:13> quantile([1 2 3 4 5 6 7]', 0.5, 1, 3)
ans = 4
octave:14> quantile([1 2 3 4 5 6 7 8]', 0.5, 1, 3)
ans = 4
octave:15> quantile([1 2 3 4 5 6 7 8 9]', 0.5, 1, 3)
ans = 4
octave:16> quantile([1 2 3 4 5 6 7 8 9 10]', 0.5, 1, 3)
ans = 5

I note that both R and octave use 1-based indexing. But this should be mute if we are simply interpreting the order of elements.

aherbert · 2024-06-12T16:44:39Z

I have had a crash course in SAS and managed to get the following to run in their online SASStudio for the univariate procedure:

data A;
  input Value @@;
  datalines;
1 2 3
;

proc univariate data=A pctldef=2;
   var Value;
   output out=Q
   pctlpre = P_
   pctlpts = 50;
run;

proc print data=Q;

Changing the length of the input data obtains for Q.P_50 (using SAS method 2):

2: 1 2 3
2: 1 2 3 4
2: 1 2 3 4 5
3: 1 2 3 4 5 6
4: 1 2 3 4 5 6 7
4: 1 2 3 4 5 6 7 8

So this computes the same as R and octave.

seberg · 2024-06-13T12:08:42Z

FWIW, I am OK with changing this but it needs a release note and I am not sure it matters much (I doubt there is a reason for any choice besides that you have to make a choice).

eendebakpt · 2024-06-13T19:10:31Z

I agree the choice does not matter much, but it would be nice to have the documentation consistent and without confusion (with either of the two choices). @aherbert Would you be willing to make a PR to address this?

aherbert · 2024-06-14T10:19:29Z

I'll try and put in a PR next week to: match the documentation to the Hyndman and Fan paper; and update the closest_observation implementation.

…istic Detection of an even order statistic (1-based) must check for an odd index due to use of 0-based indexing. See numpy#26656

Detection of an even order statistic (1-based) must check for an odd index due to use of 0-based indexing. See numpy#26656

eendebakpt · 2024-08-09T14:32:54Z

The issue has been addressed in #26959, so closing the issue.

Detection of an even order statistic (1-based) must check for an odd index due to use of 0-based indexing. See numpy#26656

aherbert added the 00 - Bug label Jun 10, 2024

aherbert added a commit to aherbert/numpy that referenced this issue Jun 20, 2024

BUG: Quantile closest_observation to round to nearest even order

0890932

Detection of an even order statistic (1-based) must check for an odd index due to use of 0-based indexing. See numpy#26656

aherbert added a commit to aherbert/numpy that referenced this issue Jun 20, 2024

BUG: Quantile closest_observation to round to nearest even order

4ac0e4f

Detection of an even order statistic (1-based) must check for an odd index due to use of 0-based indexing. See numpy#26656

aherbert mentioned this issue Jun 20, 2024

BUG: Quantile closest_observation to round to nearest even order #26769

Merged

aherbert added a commit to aherbert/numpy that referenced this issue Jun 21, 2024

BUG: Quantile closest_observation to round to nearest even order

4e2accb

Detection of an even order statistic (1-based) must check for an odd index due to use of 0-based indexing. See numpy#26656

aherbert added a commit to aherbert/numpy that referenced this issue Jun 24, 2024

BUG: Quantile closest_observation to round to nearest even order

5a3ba2a

Detection of an even order statistic (1-based) must check for an odd index due to use of 0-based indexing. See numpy#26656

charris pushed a commit to charris/numpy that referenced this issue Jul 16, 2024

BUG: Quantile closest_observation to round to nearest even order

62b7a68

Detection of an even order statistic (1-based) must check for an odd index due to use of 0-based indexing. See numpy#26656

charris mentioned this issue Jul 16, 2024

BUG: Quantile closest_observation to round to nearest even order #26959

Merged

eendebakpt closed this as completed Aug 9, 2024

ArvidJB pushed a commit to ArvidJB/numpy that referenced this issue Nov 1, 2024

BUG: Quantile closest_observation to round to nearest even order

51ff0ac

Detection of an even order statistic (1-based) must check for an odd index due to use of 0-based indexing. See numpy#26656

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Possible rounding error in quantile 'closest_observation' method #26656

BUG: Possible rounding error in quantile 'closest_observation' method #26656

aherbert commented Jun 10, 2024 •

edited by ngoldbaum

Loading

eendebakpt commented Jun 12, 2024

aherbert commented Jun 12, 2024

aherbert commented Jun 12, 2024

aherbert commented Jun 12, 2024

seberg commented Jun 13, 2024

eendebakpt commented Jun 13, 2024

aherbert commented Jun 14, 2024

eendebakpt commented Aug 9, 2024

BUG: Possible rounding error in quantile 'closest_observation' method #26656

BUG: Possible rounding error in quantile 'closest_observation' method #26656

Comments

aherbert commented Jun 10, 2024 • edited by ngoldbaum Loading

Describe the issue:

Reproduce the code example:

Error message:

Python and NumPy Versions:

Runtime Environment:

Context for the issue:

eendebakpt commented Jun 12, 2024

aherbert commented Jun 12, 2024

aherbert commented Jun 12, 2024

aherbert commented Jun 12, 2024

seberg commented Jun 13, 2024

eendebakpt commented Jun 13, 2024

aherbert commented Jun 14, 2024

eendebakpt commented Aug 9, 2024

aherbert commented Jun 10, 2024 •

edited by ngoldbaum

Loading