Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore: enable cranelift (switch to nightly) #4498

Closed
wants to merge 3 commits into from

Conversation

dpc
Copy link
Contributor

@dpc dpc commented Mar 7, 2024

Not planning to land it right now.

On top of pending PRs, due to possibility of merge conflicts.

See https://blog.rust-lang.org/2023/11/09/parallel-rustc.html

On my machine I've noticed zero speedup. It might speed up someone's compilation time, but I guess it would require lots of cpu cores. (more than 20 I have?).

Use just bench-compilation to bench. If anyone sees a speedup and it seems beneficial, we could land. Otherwise I'm a bit dissapointed.

@dpc dpc requested review from a team as code owners March 7, 2024 22:13
@maan2003
Copy link
Member

maan2003 commented Mar 7, 2024

I've noticed zero speedup

because we have so many crates, all are already compiling in parallel.

if you compile one project then it will speed up.

like cargo b -p fedimint-cli

@dpc
Copy link
Contributor Author

dpc commented Mar 7, 2024

because we have so many crates, all are already compiling in parallel.

That's a rare use-case, a chunk of that time is already fast linking with mold, and then it will be like 2 seconds anyway.

@elsirion I remember you showing a pretty graph showing cpus were under-utilized and mentioning lots of cores. Do you happen to have a machine like that? BTW. What was that tool that generated that pretty graph?

image

@maan2003
Copy link
Member

maan2003 commented Mar 7, 2024

What was that tool that generated that pretty graph?

cargo build --timings maybe?

@dpc
Copy link
Contributor Author

dpc commented Mar 7, 2024

What was that tool that generated that pretty graph?

cargo --timings maybe?

See my edit above. I don't think that's it.

@maan2003
Copy link
Member

maan2003 commented Mar 7, 2024

that's nix run nixpkgs#btop

@maan2003
Copy link
Member

maan2003 commented Mar 7, 2024

also we need to bench cargo check instead of cargo build

because that is only runs frontend (which was parallelized in rustc multithread)

@dpc
Copy link
Contributor Author

dpc commented Mar 7, 2024

No extra threads:

Full check debug:       34.08   458.20  41.41
Incr check debug:       2.84    3.57    2.17
Full build debug:       62.37   1248.81 90.31
Incr build debug:       9.20    16.95   8.51
Full check release:     31.61   258.52  35.97
Incr check release:     9.30    14.43   2.28
Full build release:     84.01   1726.18 80.97
Incr build release:     43.24   990.00  27.75

THREADS=8

Full check debug:       31.59   524.34  56.68
Incr check debug:       3.44    6.48    5.62
Full build debug:       60.95   1378.73 116.51
Incr build debug:       9.54    22.44   12.21
Full check release:     31.79   325.78  52.02
Incr check release:     5.94    23.59   3.47
Full build release:     82.33   1820.43 105.97
Incr build release:     42.83   1018.34 32.54

THREADS=20

Full check debug:       32.78   525.24  56.50
Incr check debug:       3.49    6.55    5.63
Full build debug:       62.24   1365.69 114.37
Incr build debug:       9.00    22.19   11.84
Full check release:     31.29   319.97  52.68
Incr check release:     5.90    23.14   3.44
Full build release:     82.45   1836.50 110.86
Incr build release:     42.22   1031.65 32.38

Not going to fix the padding, too much work. :D

Notably when threads are enabled the sys increases (I guess caused by whatever the parallelism overhead), which makes what's already small and fast suffer a bit, canceling improvements.

I guess on system that have more, but slower cores the gain might be better(?).

@dpc dpc requested review from a team as code owners March 7, 2024 23:53
@douglaz
Copy link
Contributor

douglaz commented Mar 11, 2024

Master:

CPU: AMD Ryzen 9 5900X (24) @ 3.700GHz 

Date: 2024-03-11
Commit: 24e8e3a78f
                       total    user     sys
Full  check   debug:   51.72  599.06   48.98
Incr  check   debug:    3.75    5.32    2.81
Full  build   debug:   94.17 1582.06  106.13
Incr  build   debug:   13.05   24.85   10.57
Full  check release:   37.83  324.82   44.15
Incr  check release:   12.90   20.33    3.12
Full  build release:  118.29 2431.49   95.96
Incr  build release:   74.97 1463.55   35.07

This branch:

 CPU: AMD Ryzen 9 5900X (24) @ 3.700GHz 

Date: 2024-03-11
Commit: d66c68f708
                   	total	user	sys
Full check debug: 	46.99	708.62	72.70
Incr check debug: 	4.42	8.32	6.47
Full build debug: 	101.06	1876.18	152.78
Incr build debug: 	12.87	29.63	14.44
Full check release: 	46.30	438.94	68.37
Incr check release: 	7.82	30.79	4.42
Full build release: 	133.48	2511.73	138.47
Incr build release: 	70.44	1380.19	45.76

So this branch is slower in most stuff.

@elsirion
Copy link
Contributor

I see a slight improvement in release mode, but might be noise, debug is slightly slower:

Full check debug:       49.10   968.92  71.50
Incr check debug:       6.59    15.41   8.99
Full build debug:       84.02   2480.56 149.89
Incr build debug:       12.49   40.47   16.98
Full check release:     53.20   582.53  65.35
Incr check release:     10.51   44.11   3.95
Full build release:     114.46  3019.66 142.44
Incr build release:     53.65   1587.94 36.96

@dpc
Copy link
Contributor Author

dpc commented Mar 18, 2024

I was going to close it, but maybe I could try https://www.reddit.com/r/rust/comments/1bgyo8a/try_cranelift_codegen_backend_for_faster_compile/ while at it.

@dpc
Copy link
Contributor Author

dpc commented Mar 18, 2024

Nice!

Before cranelift:

                       total    user     sys
Full  check   debug:   27.67  488.49   50.46
Incr  check   debug:    3.53    7.27    6.07
Full  build   debug:   56.31 1245.94  101.50
Incr  build   debug:   10.13   25.44   14.60

After cranelift:

                       total    user     sys
Full  check   debug:   20.87  275.16   44.34
Incr  check   debug:    3.77    7.64    5.98
Full  build   debug:   29.92  395.84   65.06
Incr  build   debug:    7.75   23.76   11.82

@elsirion
Copy link
Contributor

Before:

Commit: be2815f684
                       total    user     sys
Full  check   debug:   50.83  782.93   51.03
Incr  check   debug:    4.77    7.67    2.82
Full  build   debug:   76.18 1951.05  107.73
Incr  build   debug:   11.91   29.63   10.66

After:

Commit: 0011b5374d                                                                                                                                                                                                                                                                        
                       total    user     sys                                                                                                                                                                                                                                              
Full  check   debug:   28.99  510.35   59.83                                                                                                                                                                                                                                              
Incr  check   debug:    7.23   17.43    9.65                                                                                                 
Full  build   debug:   39.40  719.91   83.11                                                                                                                                                                                                                                              
Incr  build   debug:   11.61   43.16   15.26                                                                                                 

I like! 😍

@dpc dpc changed the title chore: enable rust compiler parallel frontend [needs benchmarking] chore: enable cranelift & rust compiler parallel frontend (switch to nightly) Mar 18, 2024
@dpc
Copy link
Contributor Author

dpc commented Mar 18, 2024

It's green. It's only enabled in dev builds, so IMO low risk. Let's give it a spin!

@dpc dpc force-pushed the go-nightly branch 2 times, most recently from 07d39c1 to b0d83fd Compare March 18, 2024 22:03
@dpc dpc marked this pull request as draft March 18, 2024 22:05
@dpc
Copy link
Contributor Author

dpc commented Mar 18, 2024

Bummer.

> cat target-nix/devimint/logs/fedimintd-0.log
2024-03-18T22:05:11.213297Z  INFO consensus: Starting config gen
2024-03-18T22:05:11.213685Z  INFO net::peer::dkg: Created new config gen Api
2024-03-18T22:05:11.215604Z  INFO net::api: Starting api on ws://127.0.0.1:12969
2024-03-18T22:05:11.215874Z  INFO fedimintd::fedimintd: Metrics API listening on 127.0.0.1:12976
trap at Instance { def: Item(DefId(1:15511 ~ core[0bd7]::core_arch::x86::sha::_mm_sha1rnds4_epu32)), args: [0_i32] } (_ZN4core9core_arch3x86
3sha19_mm_sha1rnds4_epu3217h09747cb42999c042E): llvm.x86.sha1rnds4

@dpc
Copy link
Contributor Author

dpc commented Mar 18, 2024

OK, so this can be fixed by updating nightly. 🍀

But now the problem is that dev builds and tests are much slower. 5x or so. I don't think anyone wants to wait 1-2 minutes to open just mprocs :D

@elsirion
Copy link
Contributor

But now the problem is that dev builds and tests are much slower. 5x or so.

How so?

@dpc
Copy link
Contributor Author

dpc commented Mar 19, 2024

How so?

That's the whole thing with cranelift - it is supposed to be faster at the expense of optimization power. It only have basic optimizations currently, maybe it will get somewhat better in the future.

In our case it looks like consensus is getting extremely slow. Possibly just some crypto suffers from lack of certain optimization? Hard to tell. I tried increasing optimization level on certain things, but without success.

@bjorn3
Copy link

bjorn3 commented Mar 24, 2024

But now the problem is that dev builds and tests are much slower. 5x or so.

For cryptography related crates this project is using opt-level = 3:

fedimint/Cargo.toml

Lines 107 to 125 in 29cf90e

[profile.dev.package]
secp256k1 = { opt-level = 3 }
secp256k1-zkp = { opt-level = 3 }
secp256k1-sys = { opt-level = 3 }
secp256k1-zkp-sys = { opt-level = 3 }
bitcoin_hashes = { opt-level = 3 }
ff = { opt-level = 3 }
group = { opt-level = 3 }
pairing = { opt-level = 3 }
rand_core = { opt-level = 3 }
byteorder = { opt-level = 3 }
zeroize = { opt-level = 3 }
bls12_381 = { opt-level = 3 }
subtle = { opt-level = 3 }
ring = { opt-level = 3 }
fedimint-threshold-crypto = { opt-level = 3 }
aleph-bft-crypto = { opt-level = 3 }
aleph-bft-rmc = { opt-level = 3 }
aleph-bft-types = { opt-level = 3 }
I never quite benchmarked it, but I did expect cg_clif with optimizations enabled to be around the speed of LLVM opt-level = 1. It is in principle possible to compile these crates with LLVM by adding codegen-backend = "llvm" to each of these package profile overrides (so for example secp256k1 = { opt-level = 3, codegen-backend = "llvm" }.) Be aware however that there is currently an ABI incompatibility around 128bit integers: rust-lang/rustc_codegen_cranelift#1449 I haven't gotten around fixing it properly yet.

ps: This is the project you mentioned in https://lobste.rs/s/tjh7oy/cranelift_code_generation_comes_rust#c_gq7fp3, right?

Edit: Opened rust-lang/rustc_codegen_cranelift#1468 to track emitting a warning for this case to reduce confusion in the future.

@dpc
Copy link
Contributor Author

dpc commented Mar 24, 2024

@bjorn3 Oh, it's already possible to pick these per-crate? Awesome. I'll try it out and report back. Thanks for letting us know, and I'll sub these issues and keep an eye on it.

@dpc
Copy link
Contributor Author

dpc commented Mar 25, 2024

Did another round of testing, and pushed a best approach

Current master branch:

                       total    user     sys
Full  check   debug:   34.16  420.65   39.60
Incr  check   debug:    2.86    3.98    2.55
Full  build   debug:   61.04 1138.98   84.82
Incr  build   debug:    9.71   18.53    9.47

after with cranelift + parallel (but now with llvm for perf-heavy crates):

                       total    user     sys
Full  check   debug:   29.46  342.61  107.06
Incr  check   debug:    4.96   11.67   11.06
Full  build   debug:   40.04  487.93  126.50
Incr  build   debug:    8.91   29.45   17.05

just cranelift:

                       total    user     sys
Full  check   debug:   27.74  216.44   36.03
Incr  check   debug:    3.37    4.87    2.77
Full  build   debug:   38.91  336.45   54.94
Incr  build   debug:    7.87   17.36    7.19

no -O3 on local crypto crates (fedimint-threshold-crypto) + cranelift:

                       total    user     sys
Full  check   debug:   28.19  215.80   35.18
Incr  check   debug:    3.34    4.92    2.70
Full  build   debug:   39.73  335.34   54.93
Incr  build   debug:    8.69   17.95    7.95

O3 -> O2 + above:

                       total    user     sys
Full  check   debug:   28.17  212.10   34.99
Incr  check   debug:    3.32    4.85    2.73
Full  build   debug:   38.70  332.25   55.34
Incr  build   debug:    8.25   17.47    7.72

With this setup the env RUST_LOG=devimint=debug just devimint-env starts as fast as before, so seems like a pure win now.

@dpc dpc marked this pull request as ready for review March 25, 2024 23:53
@dpc dpc changed the title chore: enable cranelift & rust compiler parallel frontend (switch to nightly) chore: enable cranelift (switch to nightly) Mar 25, 2024
@dpc dpc enabled auto-merge March 27, 2024 20:32
Copy link
Contributor

@elsirion elsirion left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have we tested this on mac @justinmoon @dpc @maan2003 ? Merge if so.

@dpc dpc added this pull request to the merge queue Mar 29, 2024
@elsirion elsirion removed this pull request from the merge queue due to a manual request Mar 29, 2024
@justinmoon
Copy link
Contributor

When I checkout the branch and direnv runs, the nix environment doesn't build.

$ git fetch dpc
$ git checkout go-nightly
branch 'go-nightly' set up to track 'dpc/go-nightly' by rebasing.
Switched to a new branch 'go-nightly'
[1 copied (0.3 MiB), 0.1 MiB DL] querying source on https://cache.nixos.orgdirenv: ([/nix/store/wrnrvg7ix2bmgnm3v4qvgvipgivak2ym-direnv-2.32.3/bin/direnv export fish]) is taking a while to execute. Use CTRL-C to give up.
error:
       … while calling the 'derivationStrict' builtin

         at /builtin/derivation.nix:9:12: (source not available)

       … while evaluating derivation 'nix-shell'
         whose name attribute is located at /nix/store/3i3rncs75fid9hwai5p2nvwc4ngdnia7-source/pkgs/stdenv/generic/make-derivation.nix:348:7

       … while evaluating attribute 'nativeBuildInputs' of derivation 'nix-shell'

         at /nix/store/3i3rncs75fid9hwai5p2nvwc4ngdnia7-source/pkgs/stdenv/generic/make-derivation.nix:392:7:

          391|       depsBuildBuild              = elemAt (elemAt dependencies 0) 0;
          392|       nativeBuildInputs           = elemAt (elemAt dependencies 0) 1;
             |       ^
          393|       depsBuildTarget             = elemAt (elemAt dependencies 0) 2;

       (stack trace truncated; use '--show-trace' to show the full trace)

       error: attribute 'rustc-codegen-cranelift-preview' missing

       at /nix/store/6nf10v3z293qh6qy9rlm7rcvzrz4i6m1-source/lib/mkFenixToolchain.nix:52:21:

           51|       (map
           52|         (component: fenix.packages.${system}.${channel}.${component})
             |                     ^
           53|         components

@dpc
Copy link
Contributor Author

dpc commented Mar 29, 2024

OH. Of course. Great. I could make this component conditional, but then we would need to make the backend setting conditional on the current system, in Cargo.toml. And that is probably possibly only via build.rs setting up some feature flag, at which point it doesn't seem all that worthwhile.

@elsirion
Copy link
Contributor

Close the PR then for now?

@dpc dpc closed this Mar 30, 2024
@maan2003
Copy link
Member

can we have rustc wrapper that ignores the codegen-backend arguments on macos

@maan2003
Copy link
Member

61s => 40s on a fast system will be insane speed for slow systems.

@dpc
Copy link
Contributor Author

dpc commented Mar 31, 2024

can we have rustc wrapper that ignores the codegen-backend arguments on macos

There's only so much complexity we want to deal with.

In a couple of months MacOS support might be already implemented, and we can try-again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants