Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vectorized xoshiro256++ #3

Open
vks opened this issue Nov 6, 2019 · 5 comments
Open

Vectorized xoshiro256++ #3

vks opened this issue Nov 6, 2019 · 5 comments

Comments

@vks
Copy link
Contributor

vks commented Nov 6, 2019

There is some discussion on how to vectorize xoshiro256++ at JuliaLang/julia#27614. The method relies on interleaving 4 xoshiro256++ generators. I implemented it and the results are impressive (see gen_bytes_fill):

test gen_bytes_chacha12           ... bench:     324,494 ns/iter (+/- 4,236) = 3155 MB/s
test gen_bytes_chacha20           ... bench:     490,442 ns/iter (+/- 16,214) = 2087 MB/s
test gen_bytes_chacha8            ... bench:     243,010 ns/iter (+/- 19,972) = 4213 MB/s
test gen_bytes_fill               ... bench:     105,350 ns/iter (+/- 1,456) = 9719 MB/s
test gen_bytes_pcg64mcg           ... bench:     321,665 ns/iter (+/- 7,854) = 3183 MB/s
test gen_bytes_splitmix64         ... bench:     233,973 ns/iter (+/- 1,859) = 4376 MB/s
test gen_bytes_xoshiro256plusplus ... bench:     343,911 ns/iter (+/- 6,580) = 2977 MB/s

The implementation is 3.3 time faster than the non-vectorized xoshiro256++ generator and more than 2.2 times faster than splitmix64 or chacha8. It is also faster than dSFMT. However, the size of the state is blown up to 128 bytes, which is almost as large as chacha's state (136 bytes).

@dhardy
Copy link
Member

dhardy commented Nov 7, 2019

Impressive performance taking advantage of how "wide" modern CPUs are — but whether this is useful in practice is another question. For ciphers it does sometimes make sense to focus on "byte fill" performance, but is this useful for general-purpose random number generation, or some specific application?

@vks
Copy link
Contributor Author

vks commented Nov 7, 2019

I think in practice it replaces dSFMT, which used to be faster. All applications requiring several random numbers at a time benefit. For general-purpose number generation, using this with rand_core::block::BlockRng64 is probably faster than the non-SIMD version as well. It's actually slower by a factor two.

Personally, I think ChaCha8 is a better choice. The performance is similar and the cryptographic strength makes it usable for more general purposes.

@dhardy
Copy link
Member

dhardy commented Nov 7, 2019

It's actually slower by a factor two.

I imagine it's going to depend on the actual application significantly. You're using micro-benchmarks here? I actually think for most applications (even ones heavily using RNGs) the speed of the RNG is not very relevant.

@vks
Copy link
Contributor Author

vks commented Nov 7, 2019

I was referring to the gen_u64_* benchmarks in this repository. I agree that it is not a very relevant benchmark for most applications.

@vigna
Copy link

vigna commented Nov 8, 2019

Wow, really nice! Note that (as I mentioned in the Julia discussion) the dSFMT returns only 52 significant digits, thus restricting the values in [0..1) to half of the possible values (IEEE has 53 digits precision as one bit is implied).

I also think that the Julia guys found that an 8-fold unroll is even faster. But of course you're using a lot of space for copies (i.e., you are not getting a longer period, just faster generation).

And yes, in 90% applications the speed of the PRNG is irrelevant.

I'm not particularly akin to block generation, but this is what is done in high-performance scientific computing. The Intel Math Kernel library, for example, provides only block generation. If you try to generate one value at a time, the performance is abysmal, but it is unbeatable (high vectorization) for large amounts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants