Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add getrandom_uninit(dest: &mut [MaybeUninit<u8>]) -> .... #291

Merged
merged 7 commits into from Oct 21, 2022

Conversation

briansmith
Copy link
Contributor

Add a public API for filling an &mut [MaybeUninit<u8>]. This will primarily serve as the building block for more typeful APIs for constructing random arrays.

Fixes #226.

@josephlr
Copy link
Member

Overall, I like this approach better than the one in #279 I think that reducing the amount of unsafe inside of the library is good, if we can manage it.

@newpavlov what are you thoughts here?

tests/normal.rs Outdated Show resolved Hide resolved
src/rdrand.rs Outdated Show resolved Hide resolved
src/lib.rs Outdated Show resolved Hide resolved
src/lib.rs Outdated Show resolved Hide resolved
benches/mod.rs Outdated Show resolved Hide resolved
@newpavlov
Copy link
Member

@josephlr
I will try to review it on this weekend. I still mildly prefer the raw API, mostly because [MaybeUninit<u8>] is quite inconvenient to work with on current stable (most helper methods are Nightly-only), so in practice you often have to round-trip through raw pointers. But I guess [MaybeUninit<u8>] is more idiomatic and expresses the intent better, safe chunking is a nice bonus as well.

src/util.rs Show resolved Hide resolved
src/util.rs Show resolved Hide resolved
src/custom.rs Show resolved Hide resolved
src/espidf.rs Outdated Show resolved Hide resolved
src/lib.rs Outdated Show resolved Hide resolved
src/lib.rs Outdated Show resolved Hide resolved
src/lib.rs Show resolved Hide resolved
src/rdrand.rs Outdated Show resolved Hide resolved
benches/mod.rs Outdated Show resolved Hide resolved
src/util.rs Outdated Show resolved Hide resolved
@josephlr
Copy link
Member

I've resolved all the comments that have been addressed, so the open comments should be the only stuff that has to change for this PR:

  • Implementation of slice_assume_init_mut
  • getrandom_uninit_slice -> getrandom_uninit
  • Handling of benches/tests

Other than that, this looks reasonable to me, and I would be open to merging it.

@briansmith
Copy link
Contributor Author

Are we OK with regressing the performance of the normal getrandom for custom and node.js targets, at least temporarily?

For example, if we were using the custom implementation:

getrandom just directly calls __getrandom_custom
getrandom_uninit calls uninit_slice_fill_zero then __getrandom_custom
But how exactly we do this should be discussed in a smaller, followup PR.

I agree with this.

As for the potential node.js performance regression, I believe the best right solution is to change the wasm-bindgen declaration of random_fill_sync to take (*mut u8, usize) or &mut [MaybeUninit<u8>] parameters instead of &mut [u8]. I don't know if that requires changes to wasm-bindgen or whether it would be otherwise problematic.

The other question I have about this PR, though, is whether it actually makes a significant difference: Under what realistic circumstances is getrandom_uninit actually faster than zeroizing a buffer and then calling getrandom. In some initial testing in an application I care a lot about, there is really no difference because the input buffer is tiny and so the zeroization of the buffer before calling getrandom is free, as the instructions get scheduled very nicely.

CHANGELOG.md Outdated Show resolved Hide resolved
benches/mod.rs Outdated Show resolved Hide resolved
@briansmith
Copy link
Contributor Author

briansmith commented Oct 17, 2022

More notes on the performance:

See rust-lang/rust#90595 and other bugs about MaybeUninit uses not being optimized like we would expect. I'm not sure what the current situation is, as I've not got around to benchmarking on the most recent Rust versions yet.

I think the common good use of getrandom_uninit would be something like this:

#![feature(maybe_uninit_as_bytes)]

use std::mem::MaybeUninit;

unsafe trait Pod {}
unsafe impl Pod for u8 {}
unsafe impl Pod for u64 {}
unsafe impl<T: Pod, const N: usize> Pod for [T; N] {}

#[inline(always)]
fn getrandom_array<T: Pod, const N: usize>() -> Result<[T; N], getrandom::Error> {
    let mut uninit: MaybeUninit<[T; N]> = MaybeUninit::uninit();
    getrandom::getrandom_uninit(uninit.as_bytes_mut())?;
    Ok(unsafe { uninit.assume_init() })
}

fn main() {
    let _bytes: [u8; 32] = getrandom_array().unwrap();
    let _bytes_array: [[u8; 8]; 32] = getrandom_array().unwrap();
    let _u64s: [u64; 6] = getrandom_array().unwrap();
}

I think such tiny array sizes are typical of what a good use of getrandom would use; applications that need large arrays of (securely) random bytes should be using a userspace CSPRNG in most cases.

@josephlr
Copy link
Member

See rust-lang/rust#90595 and other bugs about MaybeUninit uses not being optimized like we would expect. I'm not sure what the current situation is, as I've not got around to benchmarking on the most recent Rust versions yet.

I patched in the benchmarks discussed above in 4719b5f, and tested on latest nightly.

For 16 / 32 / 256 bytes, there's an incredibly small difference (within the margin of error):

test bench_16           ... bench:         248 ns/iter (+/- 4) = 64 MB/s
test bench_16_uninit    ... bench:         247 ns/iter (+/- 3) = 64 MB/s

test bench_32           ... bench:         249 ns/iter (+/- 4) = 128 MB/s
test bench_32_uninit    ... bench:         247 ns/iter (+/- 5) = 129 MB/s

test bench_256          ... bench:         788 ns/iter (+/- 16) = 324 MB/s
test bench_256_uninit   ... bench:         787 ns/iter (+/- 11) = 325 MB/s

For page-sized benchmarks, there's a consistent (over multiple runs) ~1% improvement:

test bench_page         ... bench:       6,849 ns/iter (+/- 101) = 598 MB/s
test bench_page_uninit  ... bench:       6,788 ns/iter (+/- 116) = 603 MB/s

For our "large" benchmarks (2 MiB), the tests get much noisier, sometimes bench is faster, sometimes bench_uninit is faster.

@newpavlov
Copy link
Member

For our "large" benchmarks (2 MiB), the tests get much noisier, sometimes bench is faster, sometimes bench_uninit is faster.

On such large sizes page faults may have significant influence on performance measurements, depending on number of benchmark runs. It could be worth to check results by using pre-faulted heap allocated buffers.

@newpavlov
Copy link
Member

newpavlov commented Oct 17, 2022

Are we OK with regressing the performance of the normal getrandom for custom and node.js targets, at least temporarily?

Personally, I think it's fine. The custom backend could be fixed in the next breaking release, while for node.js we would have to wait for changes in the bindings.

I wonder about whether passing uninitialized buffers to __getrandom_custom is truly a breaking change. I don't think we explicitly stated contract for it (i.e. it was never specified that it's safe to transform the pointer and length to a slice) and any sane practical implementation should only write to the buffer and return Ok only when it gets filled.

The other question I have about this PR, though, is whether it actually makes a significant difference

As the measurements show, the difference is quite small. But I think it's still a good practice to provide an API which allows users to trim all unnecessary overheads. I don't think zero-initializing buffers is a bottleneck for IO operations in most cases, but still people done a fair bit of work on ReadBuf.

@briansmith
Copy link
Contributor Author

Here is what a call looks like to getrandom() for a [u8; 48]:

example::call_getrandom:
        push    rbx
        sub     rsp, 48
        mov     rbx, rdi

        xorps   xmm0, xmm0                   ; Zero a register
        movaps  xmmword ptr [rsp + 32], xmm0 ; Zero 16 bytes
        movaps  xmmword ptr [rsp + 16], xmm0 ; Zero 16 bytes
        movaps  xmmword ptr [rsp], xmm0      ; Zero 16 bytes

        mov     rdi, rsp
        call    example::getrandom::getrandom
        movaps  xmm0, xmmword ptr [rsp]
        movaps  xmm1, xmmword ptr [rsp + 16]
        movaps  xmm2, xmmword ptr [rsp + 32]
        movups  xmmword ptr [rbx + 33], xmm2
        movups  xmmword ptr [rbx + 17], xmm1
        movups  xmmword ptr [rbx + 1], xmm0
        mov     byte ptr [rbx], 0
        mov     rax, rbx
        add     rsp, 48
        pop     rbx
        ret

The only difference between this assembly and the equivalent that calls getrandom_uninit is the sequence of 4 instructions labeled Zero above: 1 instruction to zero a register and 1 instruction per 16 bytes to copy that zero into memory. Godbolt link.

If we were to reorder/execute the instructions leading up to the call optimally then we should expect to see pretty much zero performance impact of those 4 instructions as the moves to memory should happen in parallel with the prologue work in getrandom and its implementation. So we shouldn't really expect any significant performance benefit of the new instruction for the most commonly-expected uses.

Nonetheless, I'll push an updated version of the PR.

@briansmith briansmith marked this pull request as ready for review October 18, 2022 21:07
@briansmith
Copy link
Contributor Author

I updated the PR.

@newpavlov wrote:

I wonder about whether passing uninitialized buffers to __getrandom_custom is truly a breaking change. I don't think we explicitly stated contract for it (i.e. it was never specified that it's safe to transform the pointer and length to a slice) and any sane practical implementation should only write to the buffer and return Ok only when it gets filled.

getrandom's only register_custom_getrandom! generates an implementation that requires an initialized buffer since it constructs a &mut [u8] from (dest, len):

        extern "C" fn __getrandom_custom(dest: *mut u8, len: usize) -> u32 {
            let f: fn(&mut [u8]) -> Result<(), $crate::Error> = $path;
            let slice = unsafe { ::core::slice::from_raw_parts_mut(dest, len) };
            match f(slice) {
                Ok(()) => 0,
                Err(e) => e.code().get(),
            }
        }

@josephlr
Copy link
Member

getrandom's only register_custom_getrandom! generates an implementation that requires an initialized buffer since it constructs a &mut [u8] from (dest, len):

This is correct. __getrandom_custom is an implementation detail and is not part of the crate's Public, Stable API. register_custom_getrandom! is part of the API; it takes functions of the form fn(&mut [u8]) -> Result<(), Error>, so using uninitialized buffers isn't directly an option. We can discuss if we want to change things in a follow-up PR.

Copy link
Member

@josephlr josephlr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good! I especially like the benchmark macro, it makes things much clearer.

I think the only unresolved issue is exactly how to handle the tests, but I can make those changes.

@josephlr
Copy link
Member

josephlr commented Oct 19, 2022

@newpavlov I think all the style/implementation concerns have been addressed here. If you still really think having this functionality is a good idea (despite the almost imperceptible performance impact), I would be OK with merging it.

I added some additional benchmarks (for an AES-128 key, and for entire pages), and I ran them on my system with:

 RUSTFLAGS="-C opt-level=3 -C codegen-units=1 -C embed-bitcode=yes -C lto=fat -C target-cpu=native" cargo bench --jobs=1

to reduce variance and give the best performance. As before, there was no performance difference:

test aes128::bench_getrandom        ... bench:         248 ns/iter (+/- 6) = 64 MB/s
test aes128::bench_getrandom_uninit ... bench:         249 ns/iter (+/- 3) = 64 MB/s
test p256::bench_getrandom          ... bench:         204 ns/iter (+/- 2) = 156 MB/s
test p256::bench_getrandom_uninit   ... bench:         205 ns/iter (+/- 5) = 156 MB/s
test p384::bench_getrandom          ... bench:         291 ns/iter (+/- 5) = 164 MB/s
test p384::bench_getrandom_uninit   ... bench:         292 ns/iter (+/- 7) = 164 MB/s
test page::bench_getrandom          ... bench:       6,845 ns/iter (+/- 248) = 598 MB/s
test page::bench_getrandom_uninit   ... bench:       6,838 ns/iter (+/- 103) = 599 MB/s

briansmith and others added 4 commits October 20, 2022 14:05
Add a public API for filling an `&mut [MaybeUninit<u8>]`. This will primarily
serve as the building block for more typeful APIs for constructing random
arrays.

Increase the MSRV to 1.36, as `MaybeUninit` was added in that release.

Fixes rust-random#226.
Signed-off-by: Joe Richey <joerichey@google.com>
Signed-off-by: Joe Richey <joerichey@google.com>
Signed-off-by: Joe Richey <joerichey@google.com>
Signed-off-by: Joe Richey <joerichey@google.com>
Copy link
Member

@josephlr josephlr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@newpavlov @briansmith I think everything is resolved here. I've made a few minor changes (see commits) and rebased onto latest master.

Does this PR look good to you?

benches/mod.rs Outdated Show resolved Hide resolved
src/use_file.rs Show resolved Hide resolved
src/util.rs Show resolved Hide resolved
@briansmith briansmith force-pushed the b/uninit2 branch 2 times, most recently from ab78c81 to 8bf7c8d Compare October 20, 2022 23:57
benches/mod.rs Outdated Show resolved Hide resolved
benches/mod.rs Outdated Show resolved Hide resolved
benches/mod.rs Outdated Show resolved Hide resolved
benches/mod.rs Outdated Show resolved Hide resolved
@josephlr josephlr merged commit 47a59dd into rust-random:master Oct 21, 2022
@briansmith briansmith changed the title Add getrandom_uninit_slice(dest: &mut [MaybeUninit<u8>]) -> .... Add getrandom_uninit(dest: &mut [MaybeUninit<u8>]) -> .... Oct 21, 2022
josephlr added a commit that referenced this pull request Oct 21, 2022
#291 inadvertantly removed
this check

See #104 for why this was
added in the first place. Also, per our docs:

> If `dest` is empty, `getrandom` immediately returns success, making
> no calls to the underlying operating system.

Signed-off-by: Joe Richey <joerichey@google.com>
newpavlov pushed a commit that referenced this pull request Oct 21, 2022
#291 inadvertantly removed
this check

See #104 for why this was
added in the first place. Also, per our docs:

> If `dest` is empty, `getrandom` immediately returns success, making
> no calls to the underlying operating system.

Signed-off-by: Joe Richey <joerichey@google.com>
josephlr added a commit that referenced this pull request Oct 22, 2022
This reflects the changes in #291

Signed-off-by: Joe Richey <joerichey@google.com>
newpavlov pushed a commit that referenced this pull request Oct 22, 2022
This reflects the changes in #291
@josephlr josephlr mentioned this pull request Mar 24, 2023
josephlr added a commit that referenced this pull request Mar 25, 2023
Signed-off-by: Joe Richey <joerichey@google.com>
@josephlr josephlr mentioned this pull request Mar 25, 2023
@newpavlov newpavlov mentioned this pull request Apr 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Expose API interface for MaybeUninit<u8> slice
5 participants