Rework benchmarks to make it easier to get assembly. #297

josephlr · 2022-10-21T07:21:04Z

This change:

Moves the benchmarks from mod.rs to buffer.rs
Passes a &[u8] to test::black_box for both benchmarks
Move the inner loop we benchmark into an #[inline(never)] function
Includes instructions for getting the ASM for a specific benchmark

This should hopefully reduce the variance of these benchmarks and make it easier to figure out if we are emitting the assembly or IR we expect for a particular implementation.

Signed-off-by: Joe Richey joerichey@google.com

This naming makes more sense, esspecially if we add more benchmark files. Signed-off-by: Joe Richey <joerichey@google.com>

josephlr · 2022-10-21T07:43:08Z

This PR came about because I discovered the amazing cargo-show-asm tool, and I wanted to figure out a way to make it easy to use this tool with our benchmarks.

After installing the tool, we can run cargo asm --bench buffer --release buffer::p384::bench_getrandom::inner. Then after some minor cleanups, we get:

buffer::p384::bench_getrandom::inner:
    push rbx
    sub rsp, 64

    ; Zero the buffer
    xorps xmm0, xmm0
    movaps xmmword ptr [rsp + 48], xmm0
    movaps xmmword ptr [rsp + 32], xmm0
    movaps xmmword ptr [rsp + 16], xmm0
    ; Call the funtion
    lea rbx, [rsp + 16]
    mov esi, 48
    mov rdi, rbx
    call qword ptr [rip + getrandom::imp::getrandom_inner@GOTPCREL]
    ; Check for error
    test eax, eax
    jne .LBB17_1
    ; test::black_box(slice);
    mov qword ptr [rsp], rbx
    mov qword ptr [rsp + 8], 48
    mov rax, rsp

    add rsp, 64
    pop rbx
    ret

We can see the effect of using getrandom_uninit:

buffer::p384::bench_getrandom_uninit::inner:
    push rbx
    sub rsp, 64

    lea rbx, [rsp + 16]
    mov esi, 48
    mov rdi, rbx
    call qword ptr [rip + getrandom::imp::getrandom_inner@GOTPCREL]

    test eax, eax
    jne .LBB18_1

    mov qword ptr [rsp], rbx
    mov qword ptr [rsp + 8], 48
    mov rax, rsp

    add rsp, 64
    pop rbx
    ret

As the benchmarks are compiled as separate crates, we can see the effect of inlining. Removing the #[inline] from getrandom_uninit() gives:

buffer::p384::bench_getrandom_uninit::inner:
    push rbx
    sub rsp, 80
    
    lea rbx, [rsp + 16]
    lea rsi, [rsp + 32]
    mov edx, 48
    mov rdi, rbx
    call qword ptr [rip + getrandom::getrandom_uninit@GOTPCREL]

    mov rax, qword ptr [rsp + 16]
    test rax, rax
    je .LBB18_1

    mov rcx, qword ptr [rsp + 24]
    mov qword ptr [rsp + 16], rax
    mov qword ptr [rsp + 24], rcx

    add rsp, 80
    pop rbx
    ret

We can also see that passing the entire array to test::black_box creates an additional copy:

buffer::p384::bench_getrandom_uninit::inner:
    sub rsp, 104

    lea rdi, [rsp + 56]
    mov esi, 48
    call qword ptr [rip + getrandom::imp::getrandom_inner@GOTPCREL]

    test eax, eax
    jne .LBB18_1

    ; 48 byte copy
    movups xmm0, xmmword ptr [rsp + 56]
    movups xmm1, xmmword ptr [rsp + 72]
    movups xmm2, xmmword ptr [rsp + 88]
    movaps xmmword ptr [rsp + 32], xmm2
    movaps xmmword ptr [rsp + 16], xmm1
    movaps xmmword ptr [rsp], xmm0
    mov rax, rsp

    add rsp, 104
    ret

@briansmith this relates to #291 (comment) about how the type you pass to black_box affects the code generation, so when comparing implementations, we should be sure to use the same types here.

This change: - Move the benchmarks from mod.rs to buffer.rs - Move the inner loop we benchmark into an `#[inline(never)]` function - Includes instructions for getting the ASM for a specific benchmark This should hopefully reduce the variance of these benchmarks and make it easier to figure out if we are emitting the assembly or IR we expect for a particular implementation. Signed-off-by: Joe Richey <joerichey@google.com>

briansmith · 2022-10-21T17:02:21Z

No major objections from me.

Passes a &[u8] to test::black_box for both benchmarks

I think most users don't really want a &[u8] but rather want a [u8; N] or even a stronger type (see the getrandom_array discussion). The getrandom benchmark does produce an [u8; N] but the getrandom_uninit benchmark never does. It is good that you got rid of the extra copy caused by passing the entire array to black_box but it would be better to pass a &[u8; N] instead. In reality the assume_init() call that would be required is almost definitely going to get optimized away; however, according to bug reports from last year that wasn't always the case.

josephlr requested a review from newpavlov October 21, 2022 07:21

Rename benches/mod.rs to benches/buffer.rs

359b978

This naming makes more sense, esspecially if we add more benchmark files. Signed-off-by: Joe Richey <joerichey@google.com>

josephlr force-pushed the bench branch from f1613f8 to c5c58c2 Compare October 21, 2022 07:26

josephlr force-pushed the bench branch from c5c58c2 to 9ac388e Compare October 21, 2022 07:59

newpavlov approved these changes Oct 21, 2022

View reviewed changes

newpavlov merged commit bd0654f into master Oct 21, 2022

newpavlov deleted the bench branch October 21, 2022 14:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rework benchmarks to make it easier to get assembly. #297

Rework benchmarks to make it easier to get assembly. #297

josephlr commented Oct 21, 2022 •

edited

josephlr commented Oct 21, 2022 •

edited

briansmith commented Oct 21, 2022

Rework benchmarks to make it easier to get assembly. #297

Rework benchmarks to make it easier to get assembly. #297

Conversation

josephlr commented Oct 21, 2022 • edited

josephlr commented Oct 21, 2022 • edited

briansmith commented Oct 21, 2022

josephlr commented Oct 21, 2022 •

edited

josephlr commented Oct 21, 2022 •

edited