Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

It is not reasonable for rand_distr::Zipf to return floating-point values #1323

Open
seekstar opened this issue Jun 28, 2023 · 8 comments
Open
Labels
T-distributions Topic: distributions

Comments

@seekstar
Copy link

seekstar commented Jun 28, 2023

The Zipf distribution (also known as the zeta distribution) is a discrete probability distribution that satisfies Zipf’s law: the frequency of an item is inversely proportional to its rank in a frequency table [1].

And numpy.random.Generator.zipf returns integers:

import numpy as np
type(np.random.default_rng().zipf(4.0))
<class 'int'>

Therefore, I think it is not reasonable for rand_distr::Zipf, a distribution of integers, to return floating-point values. The problem with returning floating-point values is that the floating-point values are not precise, and if we cast the returned floating-point value to usize and use it as an index then an out-of-bound error may occur due to floating-point error.

[1] https://numpy.org/doc/stable/reference/random/generated/numpy.random.Generator.zipf.html

@dhardy
Copy link
Member

dhardy commented Aug 21, 2023

Sorry for not responding sooner. Since our implementation is built on floating-point arithmetic, a different approach would be required. However...

f64 exactly represents integers up to 2^53, which should be larger than any list index you will ever need. The Zipf implementation may produce precision errors sooner than this (I didn't check), but should still be precise enough for indexing any reasonable sized list.

@seekstar
Copy link
Author

seekstar commented Aug 23, 2023

I agree that simply casting the returned f64 to usize won't cause any problems in most cases. But I think it is better for this crate to handle the floating-point error properly and return precise integers to users, so that users do not have to worry about the floating-point error.

I think you may reference the implementation of numpy, which also seems to be built on floating-point algorithm but returns integers: https://github.com/numpy/numpy/blob/3032e84ff34f20def2ef4ebf9f8695947af3fd24/numpy/random/src/distributions/distributions.c#L1000

I don't know how they guarantee that the returned integer won't be out-of-bound, though. There is no boundary check. Maybe math magic? But I think it would be more robust for rand_distr to do a boundary check before returning because an out-of-bound error typically makes the program panic.

Update: I just find that the numpy implementation does not have an n parameter like this crate. So it only needs to guarantee that the returned integer fits into the integer type: https://github.com/numpy/numpy/blob/3032e84ff34f20def2ef4ebf9f8695947af3fd24/numpy/random/src/distributions/distributions.c#L1017

rand_distr::Zipf has an n parameter, so I think it should do a boundary check against n.

@dhardy
Copy link
Member

dhardy commented Aug 24, 2023

The first possible error is that we convert n: u64 to F via NumCast::from. Probably we should return an error in cases of "precision loss or truncation". We could test by converting back to u64 or using another library (as is not enough on its own).

Then we should choose what output type to support. It may be better to implement Distribution<N> for N in u32, u64, usize (we would need some trait bound, maybe PrimInt). We could replace F with f64 if required but it may not be a problem to keep F: Float.

Then, as you say, test that the output is in fact less than n. If not, I think we can just loop (theoretically this might bias the result, but such bias should not be significant relative to the precision of F).

I can work on this if there are no other takers, but it isn't high on my priority list.

@seekstar
Copy link
Author

The first possible error is that we convert n: u64 to F via NumCast::from. Probably we should return an error in cases of "precision loss or truncation".

Let's think twice about this. I don't know much about Zipfian distribution, but in my impression, if n gets really large, the probability of getting a large number is very small. So I think even if converting a large u64 to f64 is lossy, if we can guarantee the preciseness of the probability distribution of small numbers, we do not have to return an error.

Then, as you say, test that the output is in fact less than n. If not, I think we can just loop (theoretically this might bias the result, but such bias should not be significant relative to the precision of F).

I don't know what loop actually means, but I think it's fine to make the output n if it > n. A large output when n is large is not common anyway.

@dhardy
Copy link
Member

dhardy commented Aug 25, 2023

I don't know what loop actually means

In this case, resample.

Intuitively your suggestion to ignore loss-of-precision-on-creation and to clamp the result makes sense, but if n is large enough not to be exactly representable by the floating-point format then (excepting for one specific value of n) there will be at least one value less than n which can never be sampled. Also, n will be sampled too often. But, given how we're talking about the extreme tail end of a distribution probably none of this bias matters.

By the way, I notice that output is in the range 1 ..= n (inclusive of n) unlike what a lot of programmers are used to (0 .. n). Possibly this should be clarified in the docs.

https://play.rust-lang.org/?version=stable&mode=debug&edition=2021

(So, yes, I think we should follow your suggestion and clamp the output to n. But also my previous suggestion of making this generic over n: N so that the output can be a usize.)

@seekstar
Copy link
Author

Resample makes sense. The numpy implementation also resamples if the result is out of bounds. But resample theoretically introduces the possibilities of infinite looping, which might limit its use case. So I think clamping the output to n is better. As you said: "Given how we're talking about the extreme tail end of a distribution probably none of this bias matters".

https://play.rust-lang.org/?version=stable&mode=debug&edition=2021

It's an empty playground. Maybe the playground can not be used to share code?

But also my previous suggestion of making this generic over n: N so that the output can be a size.

This looks fine to me.

By the way, I notice that output is in the range 1 ..= n (inclusive of n) unlike what a lot of programmers are used to (0 .. n). Possibly this should be clarified in the docs.

Can't agree more.

@dhardy
Copy link
Member

dhardy commented Aug 27, 2023

@seekstar
Copy link
Author

I think the worst case in our discussion is s = 1.0 instead of 2.0. Let's estimate how much bias there would be if we clamp the output to n.

According to Wikipedia, in this case the probability to get k is:

$$f(k, n) = \frac{1}{H_n} \frac{1}{k}$$

, where

$$H_n = \sum_{k=1}^n \frac{1}{k}$$

According to Wikipedia, if n gets large, $H_n$ will converge to $\ln n + \gamma, \gamma = 0.5772156649 \cdots$

Let's assume that due to floating point error, the largest $(1 - \alpha) n$ possible outputs are all mapped to a number > n and all clamped to n. Then the possibility of getting those $(1 - \alpha) n$ outputs is:

$$\sum_{k=(1 - \alpha) n+1}^n \frac{1}{H_n} \frac{1}{k} = \frac{1}{H_n} \sum_{k=(1 - \alpha) n+1}^n \frac{1}{k} = \frac{1}{H_n} (H_n - H_{(1 - \alpha)n}) \approx \frac{1}{\ln n + \gamma} (\ln n - \ln (1 - \alpha) n) = - \frac{1}{\ln n + \gamma} \ln (1 - \alpha) $$

Let's assume that $\alpha$ is small. Then $\ln (1 - \alpha) \approx -\alpha$, and the above possibility is approximately:

$$\frac{\alpha}{\ln n + \gamma}, \gamma = 0.5772156649 \cdots $$

This is approximately the bias that will be added to the possibility of getting n. I don't know whether it is acceptable, but resampling can definitely solve this problem. Although personally I don't like the idea of introducing a possible infinite loop, I have to admit that in this case, resampling is a more conservative idea.

but if n is large enough not to be exactly representable by the floating-point format then (excepting for one specific value of n) there will be at least one value less than n which can never be sampled.

This is still a problem. However, I believe it's better to let the user take the risk instead of deny of service.

@dhardy dhardy added the T-distributions Topic: distributions label Oct 31, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
T-distributions Topic: distributions
Projects
None yet
Development

No branches or pull requests

2 participants