It is not reasonable for `rand_distr::Zipf` to return floating-point values #1323

seekstar · 2023-06-28T08:12:52Z

The Zipf distribution (also known as the zeta distribution) is a discrete probability distribution that satisfies Zipf’s law: the frequency of an item is inversely proportional to its rank in a frequency table [1].

And numpy.random.Generator.zipf returns integers:

import numpy as np
type(np.random.default_rng().zipf(4.0))

<class 'int'>

Therefore, I think it is not reasonable for rand_distr::Zipf, a distribution of integers, to return floating-point values. The problem with returning floating-point values is that the floating-point values are not precise, and if we cast the returned floating-point value to usize and use it as an index then an out-of-bound error may occur due to floating-point error.

[1] https://numpy.org/doc/stable/reference/random/generated/numpy.random.Generator.zipf.html

The text was updated successfully, but these errors were encountered:

dhardy · 2023-08-21T07:59:50Z

Sorry for not responding sooner. Since our implementation is built on floating-point arithmetic, a different approach would be required. However...

f64 exactly represents integers up to 2^53, which should be larger than any list index you will ever need. The Zipf implementation may produce precision errors sooner than this (I didn't check), but should still be precise enough for indexing any reasonable sized list.

seekstar · 2023-08-23T07:10:37Z

I agree that simply casting the returned f64 to usize won't cause any problems in most cases. But I think it is better for this crate to handle the floating-point error properly and return precise integers to users, so that users do not have to worry about the floating-point error.

I think you may reference the implementation of numpy, which also seems to be built on floating-point algorithm but returns integers: https://github.com/numpy/numpy/blob/3032e84ff34f20def2ef4ebf9f8695947af3fd24/numpy/random/src/distributions/distributions.c#L1000

I don't know how they guarantee that the returned integer won't be out-of-bound, though. There is no boundary check. Maybe math magic? But I think it would be more robust for rand_distr to do a boundary check before returning because an out-of-bound error typically makes the program panic.

Update: I just find that the numpy implementation does not have an n parameter like this crate. So it only needs to guarantee that the returned integer fits into the integer type: https://github.com/numpy/numpy/blob/3032e84ff34f20def2ef4ebf9f8695947af3fd24/numpy/random/src/distributions/distributions.c#L1017

rand_distr::Zipf has an n parameter, so I think it should do a boundary check against n.

dhardy · 2023-08-24T10:06:45Z

The first possible error is that we convert n: u64 to F via NumCast::from. Probably we should return an error in cases of "precision loss or truncation". We could test by converting back to u64 or using another library (as is not enough on its own).

Then we should choose what output type to support. It may be better to implement Distribution<N> for N in u32, u64, usize (we would need some trait bound, maybe PrimInt). We could replace F with f64 if required but it may not be a problem to keep F: Float.

Then, as you say, test that the output is in fact less than n. If not, I think we can just loop (theoretically this might bias the result, but such bias should not be significant relative to the precision of F).

I can work on this if there are no other takers, but it isn't high on my priority list.

seekstar · 2023-08-25T09:52:42Z

The first possible error is that we convert n: u64 to F via NumCast::from. Probably we should return an error in cases of "precision loss or truncation".

Let's think twice about this. I don't know much about Zipfian distribution, but in my impression, if n gets really large, the probability of getting a large number is very small. So I think even if converting a large u64 to f64 is lossy, if we can guarantee the preciseness of the probability distribution of small numbers, we do not have to return an error.

Then, as you say, test that the output is in fact less than n. If not, I think we can just loop (theoretically this might bias the result, but such bias should not be significant relative to the precision of F).

I don't know what loop actually means, but I think it's fine to make the output n if it > n. A large output when n is large is not common anyway.

dhardy · 2023-08-25T11:13:11Z

I don't know what loop actually means

In this case, resample.

Intuitively your suggestion to ignore loss-of-precision-on-creation and to clamp the result makes sense, but if n is large enough not to be exactly representable by the floating-point format then (excepting for one specific value of n) there will be at least one value less than n which can never be sampled. Also, n will be sampled too often. But, given how we're talking about the extreme tail end of a distribution probably none of this bias matters.

By the way, I notice that output is in the range 1 ..= n (inclusive of n) unlike what a lot of programmers are used to (0 .. n). Possibly this should be clarified in the docs.

https://play.rust-lang.org/?version=stable&mode=debug&edition=2021

(So, yes, I think we should follow your suggestion and clamp the output to n. But also my previous suggestion of making this generic over n: N so that the output can be a usize.)

seekstar · 2023-08-27T06:02:34Z

Resample makes sense. The numpy implementation also resamples if the result is out of bounds. But resample theoretically introduces the possibilities of infinite looping, which might limit its use case. So I think clamping the output to n is better. As you said: "Given how we're talking about the extreme tail end of a distribution probably none of this bias matters".

https://play.rust-lang.org/?version=stable&mode=debug&edition=2021

It's an empty playground. Maybe the playground can not be used to share code?

But also my previous suggestion of making this generic over n: N so that the output can be a size.

This looks fine to me.

By the way, I notice that output is in the range 1 ..= n (inclusive of n) unlike what a lot of programmers are used to (0 .. n). Possibly this should be clarified in the docs.

Can't agree more.

dhardy · 2023-08-27T06:30:11Z

Sorry, I forgot to click the share button:

https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=4bffa76c9f829cc853db01fb94bbe953

seekstar · 2023-08-30T08:13:19Z

I think the worst case in our discussion is s = 1.0 instead of 2.0. Let's estimate how much bias there would be if we clamp the output to n.

According to Wikipedia, in this case the probability to get k is:

$$f(k, n) = \frac{1}{H_n} \frac{1}{k}$$

, where

$$H_n = \sum_{k=1}^n \frac{1}{k}$$

According to Wikipedia, if n gets large, $H_n$ will converge to $\ln n + \gamma, \gamma = 0.5772156649 \cdots$

Let's assume that due to floating point error, the largest $(1 - \alpha) n$ possible outputs are all mapped to a number > n and all clamped to n. Then the possibility of getting those $(1 - \alpha) n$ outputs is:

$$\sum_{k=(1 - \alpha) n+1}^n \frac{1}{H_n} \frac{1}{k} = \frac{1}{H_n} \sum_{k=(1 - \alpha) n+1}^n \frac{1}{k} = \frac{1}{H_n} (H_n - H_{(1 - \alpha)n}) \approx \frac{1}{\ln n + \gamma} (\ln n - \ln (1 - \alpha) n) = - \frac{1}{\ln n + \gamma} \ln (1 - \alpha) $$

Let's assume that $\alpha$ is small. Then $\ln (1 - \alpha) \approx -\alpha$, and the above possibility is approximately:

$$\frac{\alpha}{\ln n + \gamma}, \gamma = 0.5772156649 \cdots $$

This is approximately the bias that will be added to the possibility of getting n. I don't know whether it is acceptable, but resampling can definitely solve this problem. Although personally I don't like the idea of introducing a possible infinite loop, I have to admit that in this case, resampling is a more conservative idea.

but if n is large enough not to be exactly representable by the floating-point format then (excepting for one specific value of n) there will be at least one value less than n which can never be sampled.

This is still a problem. However, I believe it's better to let the user take the risk instead of deny of service.

dhardy added the T-distributions Topic: distributions label Oct 31, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

It is not reasonable for `rand_distr::Zipf` to return floating-point values #1323

It is not reasonable for `rand_distr::Zipf` to return floating-point values #1323

seekstar commented Jun 28, 2023 •

edited

dhardy commented Aug 21, 2023

seekstar commented Aug 23, 2023 •

edited

dhardy commented Aug 24, 2023

seekstar commented Aug 25, 2023

dhardy commented Aug 25, 2023 •

edited

seekstar commented Aug 27, 2023

dhardy commented Aug 27, 2023

seekstar commented Aug 30, 2023

It is not reasonable for rand_distr::Zipf to return floating-point values #1323

It is not reasonable for rand_distr::Zipf to return floating-point values #1323

Comments

seekstar commented Jun 28, 2023 • edited

dhardy commented Aug 21, 2023

seekstar commented Aug 23, 2023 • edited

dhardy commented Aug 24, 2023

seekstar commented Aug 25, 2023

dhardy commented Aug 25, 2023 • edited

seekstar commented Aug 27, 2023

dhardy commented Aug 27, 2023

seekstar commented Aug 30, 2023

It is not reasonable for `rand_distr::Zipf` to return floating-point values #1323

It is not reasonable for `rand_distr::Zipf` to return floating-point values #1323

seekstar commented Jun 28, 2023 •

edited

seekstar commented Aug 23, 2023 •

edited

dhardy commented Aug 25, 2023 •

edited