Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rework benchmarks to make it easier to get assembly. #297

Merged
merged 2 commits into from Oct 21, 2022
Merged

Conversation

josephlr
Copy link
Member

@josephlr josephlr commented Oct 21, 2022

This change:

  • Moves the benchmarks from mod.rs to buffer.rs
  • Passes a &[u8] to test::black_box for both benchmarks
  • Move the inner loop we benchmark into an #[inline(never)] function
  • Includes instructions for getting the ASM for a specific benchmark

This should hopefully reduce the variance of these benchmarks and make it easier to figure out if we are emitting the assembly or IR we expect for a particular implementation.

Signed-off-by: Joe Richey joerichey@google.com

This naming makes more sense, esspecially if we add more benchmark
files.

Signed-off-by: Joe Richey <joerichey@google.com>
@josephlr
Copy link
Member Author

josephlr commented Oct 21, 2022

This PR came about because I discovered the amazing cargo-show-asm tool, and I wanted to figure out a way to make it easy to use this tool with our benchmarks.

After installing the tool, we can run cargo asm --bench buffer --release buffer::p384::bench_getrandom::inner. Then after some minor cleanups, we get:

buffer::p384::bench_getrandom::inner:
    push rbx
    sub rsp, 64

    ; Zero the buffer
    xorps xmm0, xmm0
    movaps xmmword ptr [rsp + 48], xmm0
    movaps xmmword ptr [rsp + 32], xmm0
    movaps xmmword ptr [rsp + 16], xmm0
    ; Call the funtion
    lea rbx, [rsp + 16]
    mov esi, 48
    mov rdi, rbx
    call qword ptr [rip + getrandom::imp::getrandom_inner@GOTPCREL]
    ; Check for error
    test eax, eax
    jne .LBB17_1
    ; test::black_box(slice);
    mov qword ptr [rsp], rbx
    mov qword ptr [rsp + 8], 48
    mov rax, rsp

    add rsp, 64
    pop rbx
    ret

We can see the effect of using getrandom_uninit:

buffer::p384::bench_getrandom_uninit::inner:
    push rbx
    sub rsp, 64

    lea rbx, [rsp + 16]
    mov esi, 48
    mov rdi, rbx
    call qword ptr [rip + getrandom::imp::getrandom_inner@GOTPCREL]

    test eax, eax
    jne .LBB18_1

    mov qword ptr [rsp], rbx
    mov qword ptr [rsp + 8], 48
    mov rax, rsp

    add rsp, 64
    pop rbx
    ret

As the benchmarks are compiled as separate crates, we can see the effect of inlining. Removing the #[inline] from getrandom_uninit() gives:

buffer::p384::bench_getrandom_uninit::inner:
    push rbx
    sub rsp, 80
    
    lea rbx, [rsp + 16]
    lea rsi, [rsp + 32]
    mov edx, 48
    mov rdi, rbx
    call qword ptr [rip + getrandom::getrandom_uninit@GOTPCREL]

    mov rax, qword ptr [rsp + 16]
    test rax, rax
    je .LBB18_1

    mov rcx, qword ptr [rsp + 24]
    mov qword ptr [rsp + 16], rax
    mov qword ptr [rsp + 24], rcx

    add rsp, 80
    pop rbx
    ret

We can also see that passing the entire array to test::black_box creates an additional copy:

buffer::p384::bench_getrandom_uninit::inner:
    sub rsp, 104

    lea rdi, [rsp + 56]
    mov esi, 48
    call qword ptr [rip + getrandom::imp::getrandom_inner@GOTPCREL]

    test eax, eax
    jne .LBB18_1

    ; 48 byte copy
    movups xmm0, xmmword ptr [rsp + 56]
    movups xmm1, xmmword ptr [rsp + 72]
    movups xmm2, xmmword ptr [rsp + 88]
    movaps xmmword ptr [rsp + 32], xmm2
    movaps xmmword ptr [rsp + 16], xmm1
    movaps xmmword ptr [rsp], xmm0
    mov rax, rsp

    add rsp, 104
    ret

@briansmith this relates to #291 (comment) about how the type you pass to black_box affects the code generation, so when comparing implementations, we should be sure to use the same types here.

This change:
  - Move the benchmarks from mod.rs to buffer.rs
  - Move the inner loop we benchmark into an `#[inline(never)]` function
  - Includes instructions for getting the ASM for a specific benchmark

This should hopefully reduce the variance of these benchmarks and make
it easier to figure out if we are emitting the assembly or IR we expect
for a particular implementation.

Signed-off-by: Joe Richey <joerichey@google.com>
@newpavlov newpavlov merged commit bd0654f into master Oct 21, 2022
@newpavlov newpavlov deleted the bench branch October 21, 2022 14:10
@briansmith
Copy link
Contributor

No major objections from me.

  • Passes a &[u8] to test::black_box for both benchmarks

I think most users don't really want a &[u8] but rather want a [u8; N] or even a stronger type (see the getrandom_array discussion). The getrandom benchmark does produce an [u8; N] but the getrandom_uninit benchmark never does. It is good that you got rid of the extra copy caused by passing the entire array to black_box but it would be better to pass a &[u8; N] instead. In reality the assume_init() call that would be required is almost definitely going to get optimized away; however, according to bug reports from last year that wasn't always the case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants