Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

new scheduler from RFC 5 #746

Merged
merged 33 commits into from Aug 13, 2020
Merged

Conversation

nikomatsakis
Copy link
Member

@nikomatsakis nikomatsakis commented Apr 8, 2020

Implementation of the scheduler described in rayon-rs/rfcs#5 -- modulo the fact that the RFC is mildly out of date. There is a walkthrough video available.

To Do List:

  • Fix the cargo lock
  • Address use of AtomicU64
  • Document the handling of rollover and wakeups and convince ourselves it's sound
  • Adopt and document the proposed scheme for the job event counter
  • Review RFC and list out the places where it differs from the branch

@@ -18,6 +18,7 @@ categories = ["concurrency"]
[dependencies]
num_cpus = "1.2"
lazy_static = "1"
crossbeam-channel = "0.4.2"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's slightly disappointing to add a dependency just for usually-disabled logging, but I guess it doesn't add much since we're already invested in crossbeam. I also see that we can't just use std::sync::mpsc for that as it's !Sync.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My original intention was to collect the events in per-thread queues and then fire them off somewhere central only every once in a while, but I decided to "just try" crossbeam-channel and see how it performs. It turned out to perform pretty well and logging had no observable impact on performance. Later I found that in some use cases it performed less well and does have an observable impact. Maybe I should swap this out for Mutex<Channel> and we can revisit later?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nikomatsakis The tracing crate might help here.

rayon-core/src/registry.rs Outdated Show resolved Hide resolved
/// reaches zero.
///
/// NB. We use a `CountLatch` here because it has no lifetimes and is
/// meant for async use, but the count never gets higher than one.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems rather kludgy, but I guess it's not anything critical.

});
self.job_completed_latch.set();
self.job_completed_latch
.set_and_tickle_one(&self.registry, self.owner_thread_index);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is another area where we're potentially invalidating &self, when the last spawn sets and its scope races to return. It's not an immediate problem, because these self.registry and self.owner_thread_index reads are completed as arguments before the actual set. But generally, it's still akin to rust-lang/rust#55005.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the concern here the (more or less unavoidable, without refactoring to use *) UB that rust-lang/rust#55005 describes, or the possibility of accidentally reordering things? It does seem like a subtle invariant that isn't clear from the code.

(Around the latter, I'm wondering if we can refactor in some way to make that less likely... but I don't know how yet.)

/// - The bits 0x4444 are the number of **sleeping threads**.
///
/// See the struct `Counters` below.
value: AtomicU64,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we noted in chat, AtomicU64 is only available starting in Rust 1.34 (which is close to a year old for compat), but even then it's not necessarily available on all architectures. We should probably stick to AtomicUsize, though we need to consider what that means for 32-bit.

AIUI the jobs/sleepy counters don't really care about true values, just comparing relative change, right? Will it be a problem if that has only an 8-bit rollover on 32-bit targets? Then the idle/sleeping threads seem like hard limits on our thread count, which we should enforce somewhere.

@nikomatsakis
Copy link
Member Author

OK, I started writing up documentation again in SLEEP.md. I expect to update the RFC with those docs eventually, but I figured we would want "living docs" in the repo regardless.

In the process, I'm pretty sure I found one bug having to do with our rollover algorithms. I'm also wondering whether the rollover protection that I added is "good enough" -- it's clearly possible for there to be missed events, which could lead to deadlock. I think before I judged them "sufficiently unlikely", but I'm not sure if I believe that anymore.

A few notes:

If a thread is announcing that it is sleepy, and it observes that incrementing the SEC would trigger rollover, it currently sets both the JEC and SEC to 0. But this seems wrong: the idea is that when a thread is in the sleepy state, it will ensure that JEC != SEC (and when another thread posts work, that thread will set JEC = SEC). But in this case, we've just gone sleepy, but we left the counters equal, so the other thread would assume it has nothing to do. I think we should set JEC = 0, SEC = 1 upon rollover instead.

But beyond that, I'm a bit confused by what I was thinking with rollover scheme. The code as written, when a thread is getting sleepy it increments the SEC and keeps the value (let's call it SEC_A). Then, when it wants to sleep, it checks that JEC < SEC_A (otherwise it goes back to idle). But we could certainly have these events:

  • Thread A increments SEC to SEC_A == MAX.
  • Thread B rolls over SEC, setting it to 1 and JEC to 0.
  • Thread C posts work, setting JEC to 1.
  • Thread A tries to go to sleep, and observes that JEC < SEC_A, and hence concludes (incorrectly) that no new work was posted.

I see that in my original comment on the RFC, I described that part of the "go to sleep" protocol would be to check whether SEC < SEC_A, which would thus allow us to detect conditions like the above. But I don't seem to have implemented that check. Similarly, we could save the value of the JEC when we announce we are sleepy, and check whether JEC == JEC_A (i.e., has it changed at all), although that seems slightly riskier (see below).

However, both these schemes only give us "probabilistic" protection against rollover:

  • If we check that SEC >= SEC_A, it's possible that there have been enough cycles to bring SEC back up to exactly equal SEC_A.
  • If we store the old JEC value, that's in some sense even riskier. The old value might've been very low, maybe even 0, in which case a rollover might inadvertently get us confused.

All of this is making me think this scheme isn't great and we should use a different/simpler one. For example, we could add an extra AtomicUsize that is an "epochal" counter, incremented each time we rollover, and we also read that and watch out for changes.

If we did that, I was wondering if we truly need the SEC/JEC-- maybe we can get by with just a "sleepy bit" that we set to true? And then, when new work is posted, we clear the bit and increment JEC -- and if there will be rollover, we first increment some atomic epochal counter. Now, to check if new work was posted, we read the JEC/epochal counters and check whether they have changed. (I have to think a bit about the ordering of these two reads, since they can't be done atomically, but I think it would work out ok.) Or, maybe, to handle rollover we have the epoch counter guarded by a special mutex. When threads go to sleep (presumably rare and it's ok for it to be a bit costly), they also acquire this mutex before checking the JEC/SEC or whatever we're using. This way we can make adjustments to the epochal counter atomic.

Anyway, have to go now, I'll tinker with this (or happy to have others do so), but I'm fairly convinced we need a simpler and more "obviously correct" scheme here (esp. since existing one is, well, just not correct, obviously or otherwise).

@nikomatsakis
Copy link
Member Author

Thought about this in the shower. Here is what I think we should do.

  1. Revamp the JEC/SEC setup. We now have just a JEC. When a thread goes sleepy, it reads the JEC. If the JEC is even (low bit is 0), it sets it to JEC+1 atomically (i.e., it increments the low-bit). If the JEC is already odd, no action is needed. We remember the final value of JEC. When a job is published, we check the JEC, and if it is odd, we make it even. When the thread goes to sleep, it checks if the JEC has changed, and if so it returns to idle.
  2. Of course, that still permits rollover to cause missed events, however unlikely. If the job is injected from inside the threadpool, this is not such a big deal: we are less efficient than we could be, but all jobs still complete (and after all this very unlikely). If the job is injected externally, though, we can get deadlock, which seems bad. To fix that, let's just have the "busy loop" for an external job wake up every N ms (250? 500?) and quickly check that if all threads are sleeping. If so, we'll wake one up and keep waiting.

This setup seems

  1. obviously correct and efficient
  2. deadlock free
  3. "zero cost" from a CPU perspective, in that if you don't have active rayon jobs executing, all threads can go to sleep, no problem

@nikomatsakis
Copy link
Member Author

nikomatsakis commented Apr 12, 2020

So I implemented the scheme I proposed above of "ignore rollover", but I hit two roadblocks

  1. I can't reproduce the problem of a deadlock, even with rollover made artificially very, very common (e.g., using only 1 bit counter).
  2. While it would be easy to have an external thread wake up every so often and ensure that the target registry is not asleep, there is one other case to consider: the cross-registry case.

To handle the "external thread", we simply convert a call to wait to a call to wait_timeout_ms, and then have that thread poll every once in a while.

To handle the cross-registry, though, the waiting is being done by a busy-loop on some other worker thread. That worker thread might even go to sleep while waiting. So we'd have to propagate this logic pretty deep, and I'm not thrilled about that.

But I realize now that maybe there's a simpler answer. After a thread increments the "sleeping threads" counter -- or perhaps only if it is the last sleeping thread -- it can simply read the "injector" one last time and wake back up if it finds anything in there. After all, the danger is that

  • thread A gets sleepy, records JEC_A as the previous value, and scans for work (finds none)
  • other threads manipulate JEC such that it is equal to JEC_A - 1
  • work is injected, incrementing JEC to JEC_A
  • thread A observes JEC, sees it is equal to JEC_A, and goes to sleep, despite work having been injected -- nobody ever wakes up to pick up that work

But if we just check for injected work after we increment the "sleeping threads" counter, then we would for sure catch this case. And the key point is that we must see the injected work because it is that step which increments the JEC back to JEC_A. (Further, since we have incremented the "sleeping threads" counter, if there is work actively being injected as we fall to sleep, it will either (a) first increment the JEC, in which case sleepy thread will see the work int he final check or (b) see that all threads are asleep.)

Now if only I could see the problem in action so that I could feel confident it was fixed. Also, I am thinking I should go find my book on different ways to implement mutexes and see if there's any obvious analogues to this; as I said, I feel like I'm re-creating the wheel here. Ah well.

@nikomatsakis
Copy link
Member Author

OK, so, I've been thinking about this and I concluded that the right approach here is:

First, the scheme I proposed is almost right but actually doesn't prevent deadlocks. The problem is that, so long as the person posting jobs just does a SeqCst read of the JEC, there is no requirement that the sleeping thread will see the job in the injection queue. This is because SeqCst reads don't synchronize with layer writes, even as they must be strongly ordered. In order to guarantee that, the person posting the job must either make a write to the JEC or else perhaps we can use fences.

This is the final straw that convinces me I am way overthinking this. I think we should pare back to the simpler scheme where each time new work appears, we increment the JEC, always. When threads get sleepy, they can read the atomic counters. When they go to sleep, they can (a) check that (a) the JEC hasn't changed and (b) they can increment the number of sleepy threads. Presuming this succeeds, due to the potential for rollover, they must still do a check of the injection queue, but this way they are guaranteed to see any new work, because they will have observed the incremented JEC.

We should measure that scheme and try to land it. Once we do that, we can tinker with other schemes. I feel like it would be great to have the majority of this work landed and then just to be focusing on this one minor thing. In particular I've been wanting to rethink some of the core things here.

I can make those changes in the next day or two I imagine.

@cuviper
Copy link
Member

cuviper commented Apr 13, 2020

I'll need more time to consider and provide any useful feedback on your proposed counter changes, but at a high level, I'm very much in favor of simplifying the counter scheme for now. If you think your new way is viable, please go ahead.

@nikomatsakis
Copy link
Member Author

@cuviper OK -- so I pushed a series of commits that switch over (and document) a very simple scheme wherein we increment a counter each time there is new work (and similarly when a thread goes to sleep). I also include an argument for why the scheme is deadlock free.

Finally, I converted to a 32-bit (or 64-bit) counter using AtomicUsize. I gave 10 bits to the thread counts and used the remaining bits (either 12 or 44, depending on architecture) for the jobs-event-counter.

I think the history is now getting fairly messy, so it'd probably be good for me to do some squashing.

Regarding the RFC, I intend to edit it at some point to defer some of the grungy implementation details to this PR.

I was thinking it'd be good for us to talk also about what runtime measurements we want to be doing. For one thing, I should run some of our local demos and replot the results versus master, I suppose.

@cuviper
Copy link
Member

cuviper commented May 6, 2020

I ran the demo benchmarks on Fedora 32, Ryzen 7 3800X (8 cores, 16 threads):

cargo benchcmp

$ TERM=xterm cargo benchcmp bench.master bench.rfc5
 name                                                                bench.master ns/iter   bench.rfc5 ns/iter     diff ns/iter   diff %  speedup
 factorial::factorial_iterator                                       15,019,860             14,981,310                  -38,550   -0.26%   x 1.00
 factorial::factorial_join                                           1,761,070              1,834,386                    73,316    4.16%   x 0.96
 factorial::factorial_par_iter                                       1,780,822              1,858,901                    78,079    4.38%   x 0.96
 factorial::factorial_recursion                                      2,352,684              2,315,756                   -36,928   -1.57%   x 1.02
 fibonacci::fibonacci_iterative                                      4                      4                                 0    0.00%   x 1.00
 fibonacci::fibonacci_join_1_2                                       8,251,419              48,123,890               39,872,471  483.22%   x 0.17
 fibonacci::fibonacci_join_2_1                                       8,512,663              53,287,189               44,774,526  525.98%   x 0.16
 fibonacci::fibonacci_recursive                                      9,102,656              9,555,715                   453,059    4.98%   x 0.95
 fibonacci::fibonacci_split_iterative                                44,577                 115,697                      71,120  159.54%   x 0.39
 fibonacci::fibonacci_split_recursive                                972,756                1,048,653                    75,897    7.80%   x 0.93
 find::size1::parallel_find_common                                   7,738                  8,735                           997   12.88%   x 0.89
 find::size1::parallel_find_first                                    6,609                  8,234                         1,625   24.59%   x 0.80
 find::size1::parallel_find_last                                     737,346                794,032                      56,686    7.69%   x 0.93
 find::size1::parallel_find_middle                                   469,103                468,806                        -297   -0.06%   x 1.00
 find::size1::parallel_find_missing                                  858,366                918,171                      59,805    6.97%   x 0.93
 find::size1::serial_find_common                                     1,656                  1,454                          -202  -12.20%   x 1.14
 find::size1::serial_find_first                                      0                      0                                 0     NaN%    x NaN
 find::size1::serial_find_last                                       4,244,938              4,256,404                    11,466    0.27%   x 1.00
 find::size1::serial_find_middle                                     2,824,711              2,819,731                    -4,980   -0.18%   x 1.00
 find::size1::serial_find_missing                                    4,284,326              4,236,018                   -48,308   -1.13%   x 1.01
 join_microbench::increment_all                                      46,730                 83,168                       36,438   77.98%   x 0.56
 join_microbench::increment_all_atomized                             448,542                1,757,363                 1,308,821  291.79%   x 0.26
 join_microbench::increment_all_max                                  55,702                 115,667                      59,965  107.65%   x 0.48
 join_microbench::increment_all_min                                  26,323                 28,125                        1,802    6.85%   x 0.94
 join_microbench::increment_all_serialized                           15,249                 14,372                         -877   -5.75%   x 1.06
 join_microbench::join_recursively                                   186,676                1,006,549                   819,873  439.20%   x 0.19
 life::bench::generations                                            54,585,512             53,882,103                 -703,409   -1.29%   x 1.01
 life::bench::par_bridge_generations                                 249,660,550            249,129,264                -531,286   -0.21%   x 1.00
 life::bench::par_iter_generations                                   39,055,642             37,692,462               -1,363,180   -3.49%   x 1.04
 map_collect::i_mod_10_to_i::with_collect                            4,229,609              4,877,449                   647,840   15.32%   x 0.87
 map_collect::i_mod_10_to_i::with_fold                               756,392                823,833                      67,441    8.92%   x 0.92
 map_collect::i_mod_10_to_i::with_fold_vec                           848,472                976,169                     127,697   15.05%   x 0.87
 map_collect::i_mod_10_to_i::with_linked_list_collect                10,974,249             12,807,461                1,833,212   16.70%   x 0.86
 map_collect::i_mod_10_to_i::with_linked_list_collect_vec            4,186,115              4,657,326                   471,211   11.26%   x 0.90
 map_collect::i_mod_10_to_i::with_linked_list_collect_vec_sized      4,183,000              4,685,410                   502,410   12.01%   x 0.89
 map_collect::i_mod_10_to_i::with_linked_list_map_reduce_vec_sized   4,158,983              4,578,303                   419,320   10.08%   x 0.91
 map_collect::i_mod_10_to_i::with_mutex                              60,831,368             62,258,970                1,427,602    2.35%   x 0.98
 map_collect::i_mod_10_to_i::with_mutex_vec                          10,288,376             10,456,321                  167,945    1.63%   x 0.98
 map_collect::i_mod_10_to_i::with_vec_vec_sized                      4,250,128              4,768,762                   518,634   12.20%   x 0.89
 map_collect::i_to_i::with_collect                                   7,416,966              7,799,783                   382,817    5.16%   x 0.95
 map_collect::i_to_i::with_fold                                      24,292,866             23,811,604                 -481,262   -1.98%   x 1.02
 map_collect::i_to_i::with_fold_vec                                  23,693,837             22,730,133                 -963,704   -4.07%   x 1.04
 map_collect::i_to_i::with_linked_list_collect                       13,364,740             14,965,954                1,601,214   11.98%   x 0.89
 map_collect::i_to_i::with_linked_list_collect_vec                   13,340,508             13,559,288                  218,780    1.64%   x 0.98
 map_collect::i_to_i::with_linked_list_collect_vec_sized             7,409,882              7,970,089                   560,207    7.56%   x 0.93
 map_collect::i_to_i::with_linked_list_map_reduce_vec_sized          7,316,685              7,924,447                   607,762    8.31%   x 0.92
 map_collect::i_to_i::with_mutex                                     113,630,171            111,344,803              -2,285,368   -2.01%   x 1.02
 map_collect::i_to_i::with_mutex_vec                                 39,554,842             39,194,169                 -360,673   -0.91%   x 1.01
 map_collect::i_to_i::with_vec_vec_sized                             7,235,517              8,054,156                   818,639   11.31%   x 0.90
 matmul::bench::bench_matmul_strassen                                3,574,132              6,477,846                 2,903,714   81.24%   x 0.55
 mergesort::bench::merge_sort_par_bench                              2,344,821              3,411,622                 1,066,801   45.50%   x 0.69
 mergesort::bench::merge_sort_seq_bench                              18,415,466             18,665,348                  249,882    1.36%   x 0.99
 nbody::bench::nbody_par_bridge                                      2,031,881              2,098,885                    67,004    3.30%   x 0.97
 nbody::bench::nbody_par_iter                                        2,070,939              2,143,938                    72,999    3.52%   x 0.97
 nbody::bench::nbody_parreduce                                       9,796,154              12,809,299                3,013,145   30.76%   x 0.76
 nbody::bench::nbody_seq                                             29,243,640             29,263,390                   19,750    0.07%   x 1.00
 pythagoras::euclid_faux_serial                                      31,540,467             31,044,935                 -495,532   -1.57%   x 1.02
 pythagoras::euclid_parallel_full                                    11,938,538             39,510,752               27,572,214  230.95%   x 0.30
 pythagoras::euclid_parallel_one                                     2,995,815              3,412,993                   417,178   13.93%   x 0.88
 pythagoras::euclid_parallel_outer                                   2,920,803              3,320,915                   400,112   13.70%   x 0.88
 pythagoras::euclid_parallel_weightless                              2,967,928              3,425,277                   457,349   15.41%   x 0.87
 pythagoras::euclid_serial                                           29,103,578             28,993,565                 -110,013   -0.38%   x 1.00
 quicksort::bench::quick_sort_par_bench                              6,277,953              6,101,155                  -176,798   -2.82%   x 1.03
 quicksort::bench::quick_sort_seq_bench                              29,361,464             28,993,169                 -368,295   -1.25%   x 1.01
 quicksort::bench::quick_sort_splitter                               6,372,673              6,460,001                    87,328    1.37%   x 0.99
 sieve::bench::sieve_chunks                                          4,067,397              3,026,409                -1,040,988  -25.59%   x 1.34
 sieve::bench::sieve_parallel                                        721,531                739,077                      17,546    2.43%   x 0.98
 sieve::bench::sieve_serial                                          4,613,450              3,463,130                -1,150,320  -24.93%   x 1.33
 sort::demo_merge_sort_ascending                                     218,644 (1829 MB/s)    123,819 (3230 MB/s)         -94,825  -43.37%   x 1.77
 sort::demo_merge_sort_big                                           7,058,215 (906 MB/s)   5,902,840 (1084 MB/s)    -1,155,375  -16.37%   x 1.20
 sort::demo_merge_sort_descending                                    183,722 (2177 MB/s)    226,363 (1767 MB/s)          42,641   23.21%   x 0.81
 sort::demo_merge_sort_mostly_ascending                              292,048 (1369 MB/s)    373,606 (1070 MB/s)          81,558   27.93%   x 0.78
 sort::demo_merge_sort_mostly_descending                             309,633 (1291 MB/s)    359,854 (1111 MB/s)          50,221   16.22%   x 0.86
 sort::demo_merge_sort_random                                        1,393,764 (286 MB/s)   1,276,739 (313 MB/s)       -117,025   -8.40%   x 1.09
 sort::demo_merge_sort_strings                                       4,421,584 (180 MB/s)   4,114,631 (194 MB/s)       -306,953   -6.94%   x 1.07
 sort::demo_quick_sort_big                                           3,310,727 (1933 MB/s)  3,855,372 (1660 MB/s)       544,645   16.45%   x 0.86
 sort::demo_quick_sort_mostly_ascending                              12,703,955 (31 MB/s)   11,501,109 (34 MB/s)     -1,202,846   -9.47%   x 1.10
 sort::demo_quick_sort_mostly_descending                             11,702,527 (34 MB/s)   6,906,252 (57 MB/s)      -4,796,275  -40.98%   x 1.69
 sort::demo_quick_sort_random                                        731,322 (546 MB/s)     980,309 (408 MB/s)          248,987   34.05%   x 0.75
 sort::demo_quick_sort_strings                                       2,340,535 (341 MB/s)   2,957,180 (270 MB/s)        616,645   26.35%   x 0.79
 sort::par_sort_ascending                                            51,688 (7738 MB/s)     78,275 (5110 MB/s)           26,587   51.44%   x 0.66
 sort::par_sort_big                                                  3,976,327 (1609 MB/s)  4,172,850 (1533 MB/s)       196,523    4.94%   x 0.95
 sort::par_sort_descending                                           133,710 (2991 MB/s)    169,129 (2365 MB/s)          35,419   26.49%   x 0.79
 sort::par_sort_expensive                                            21,147,049 (18 MB/s)   23,026,464 (17 MB/s)      1,879,415    8.89%   x 0.92
 sort::par_sort_mostly_ascending                                     271,652 (1472 MB/s)    361,887 (1105 MB/s)          90,235   33.22%   x 0.75
 sort::par_sort_mostly_descending                                    301,437 (1326 MB/s)    389,612 (1026 MB/s)          88,175   29.25%   x 0.77
 sort::par_sort_random                                               429,144 (932 MB/s)     508,791 (786 MB/s)           79,647   18.56%   x 0.84
 sort::par_sort_strings                                              1,035,431 (772 MB/s)   1,121,960 (713 MB/s)         86,529    8.36%   x 0.92
 sort::par_sort_unstable_ascending                                   24,392 (16398 MB/s)    31,183 (12827 MB/s)           6,791   27.84%   x 0.78
 sort::par_sort_unstable_big                                         2,072,515 (3088 MB/s)  2,229,743 (2870 MB/s)       157,228    7.59%   x 0.93
 sort::par_sort_unstable_descending                                  36,714 (10895 MB/s)    41,928 (9540 MB/s)            5,214   14.20%   x 0.88
 sort::par_sort_unstable_expensive                                   36,029,756 (11 MB/s)   35,144,806 (11 MB/s)       -884,950   -2.46%   x 1.03
 sort::par_sort_unstable_mostly_ascending                            152,678 (2619 MB/s)    185,464 (2156 MB/s)          32,786   21.47%   x 0.82
 sort::par_sort_unstable_mostly_descending                           171,122 (2337 MB/s)    218,151 (1833 MB/s)          47,029   27.48%   x 0.78
 sort::par_sort_unstable_random                                      234,282 (1707 MB/s)    268,847 (1487 MB/s)          34,565   14.75%   x 0.87
 sort::par_sort_unstable_strings                                     1,519,262 (526 MB/s)   1,902,943 (420 MB/s)        383,681   25.25%   x 0.80
 str_split::parallel_space_char                                      301,154                300,795                        -359   -0.12%   x 1.00
 str_split::parallel_space_fn                                        193,109                233,564                      40,455   20.95%   x 0.83
 str_split::serial_space_char                                        2,109,366              2,108,578                      -788   -0.04%   x 1.00
 str_split::serial_space_fn                                          1,287,708              1,261,710                   -25,998   -2.02%   x 1.02
 str_split::serial_space_str                                         1,516,780              1,570,580                    53,800    3.55%   x 0.97
 tsp::bench::dj10                                                    14,420,407             13,779,770                 -640,637   -4.44%   x 1.05
 vec_collect::vec_i::with_collect                                    767,485                801,403                      33,918    4.42%   x 0.96
 vec_collect::vec_i::with_collect_into_vec                           778,683                800,405                      21,722    2.79%   x 0.97
 vec_collect::vec_i::with_collect_into_vec_reused                    449,923                497,694                      47,771   10.62%   x 0.90
 vec_collect::vec_i::with_fold                                       7,701,578              8,191,063                   489,485    6.36%   x 0.94
 vec_collect::vec_i::with_linked_list_collect_vec                    4,805,092              4,926,453                   121,361    2.53%   x 0.98
 vec_collect::vec_i::with_linked_list_collect_vec_sized              4,803,618              4,976,286                   172,668    3.59%   x 0.97
 vec_collect::vec_i::with_linked_list_map_reduce_vec_sized           4,882,940              5,008,937                   125,997    2.58%   x 0.97
 vec_collect::vec_i::with_vec_vec_sized                              4,621,024              4,636,061                    15,037    0.33%   x 1.00
 vec_collect::vec_i_filtered::with_collect                           5,754,776              5,810,063                    55,287    0.96%   x 0.99
 vec_collect::vec_i_filtered::with_fold                              11,631,699             11,884,988                  253,289    2.18%   x 0.98
 vec_collect::vec_i_filtered::with_linked_list_collect_vec           8,717,749              8,776,768                    59,019    0.68%   x 0.99
 vec_collect::vec_i_filtered::with_linked_list_collect_vec_sized     8,747,970              8,886,064                   138,094    1.58%   x 0.98
 vec_collect::vec_i_filtered::with_linked_list_map_reduce_vec_sized  7,439,072              7,571,097                   132,025    1.77%   x 0.98
 vec_collect::vec_i_filtered::with_vec_vec_sized                     7,260,692              7,334,486                    73,794    1.02%   x 0.99

That's generally slower by a fair amount. The worst ones are microbenchmarks, which maybe we can write off, but I want to do some profiling to see if we can regain performance anywhere...

@cuviper
Copy link
Member

cuviper commented May 6, 2020

I added #[inline] to a bunch of tiny methods, as I noticed them showing up distinctly in the perf report. It's probably a codegen-units issue that these don't otherwise inline automatically. This did help the benchmarks a little, but not near enough to close the gap to master.

The biggest smoking gun BY FAR is lock xadd for increment_and_read_jobs_event_counter, accounting for well over half of the perf samples in those badly regressed benchmarks. That default measurement was for cycles, but it also shows up very high for cache-misses. This is a single point of contention for every new job across the pool, and I'm not sure how to improve this besides "don't count that!"

I think we should pare back to the simpler scheme where each time new work appears, we increment the JEC, always.

I think this is proving painful.

BTW, what happens if the JEC increment overflows into the thread counters?

@cuviper
Copy link
Member

cuviper commented May 6, 2020

BTW, what happens if the JEC increment overflows into the thread counters?

Nevermind, somehow I was thinking of it like a single overflow bit, but of course wrapping add with the large ONE_JEC will still leave the lower bits untouched. It's important that JEC occupies the most-significant bits for this to work.

@nikomatsakis
Copy link
Member Author

@cuviper well -- eliminating that "per job increment" was indeed the focus of a lot of the work I was doing. I just gave up on it because I wanted something simple, but there are definitely options here. One of them is to use a fence.

@nikomatsakis
Copy link
Member Author

Maybe worth exploring them. I couldn't remember how much of a perf hit the increment was.

@cuviper
Copy link
Member

cuviper commented May 12, 2020

One of them is to use a fence.

I think that would help, because it wouldn't force the same cache ping pongs. Mostly, I think we want the busy hot path to avoid writing any shared state.

I couldn't remember how much of a perf hit the increment was.

The worst of my results was fibonacci::fibonacci_join_2_1 at about 6x slower, and that one lock xadd instruction accounts for 85% of its perf report! I'll grant that this benchmark isn't a realistic use of join, very much a microbenchmark, but still...

@nikomatsakis
Copy link
Member Author

@cuviper yeah that's no good. I will say I feel somewhat validated. I remember working very hard to avoid this increment and I was trying to decide if that was time well spent. =) Let me try to pop back to my last scheme.

@nikomatsakis
Copy link
Member Author

OK, I got as far as documentation the fence-based algorithm, but I didn't implement it yet.

@nikomatsakis
Copy link
Member Author

@cuviper OK, new scheme is implemented.

@nikomatsakis
Copy link
Member Author

One minor point is that I'm pretty sure all the other "counter" operations can be Relaxed.

@cuviper
Copy link
Member

cuviper commented May 15, 2020

New results:

cargo benchcmp

 name                                                                bench.master ns/iter   bench.pr759 ns/iter    diff ns/iter   diff %  speedup 
 factorial::factorial_iterator                                       14,645,522             14,729,042                   83,520    0.57%   x 0.99 
 factorial::factorial_join                                           1,776,752              1,823,807                    47,055    2.65%   x 0.97 
 factorial::factorial_par_iter                                       1,799,777              1,841,103                    41,326    2.30%   x 0.98 
 factorial::factorial_recursion                                      2,298,423              2,335,173                    36,750    1.60%   x 0.98 
 fibonacci::fibonacci_iterative                                      4                      4                                 0    0.00%   x 1.00 
 fibonacci::fibonacci_join_1_2                                       8,096,831              8,764,526                   667,695    8.25%   x 0.92 
 fibonacci::fibonacci_join_2_1                                       8,124,152              8,878,642                   754,490    9.29%   x 0.92 
 fibonacci::fibonacci_recursive                                      9,252,351              8,255,772                  -996,579  -10.77%   x 1.12 
 fibonacci::fibonacci_split_iterative                                43,382                 75,623                       32,241   74.32%   x 0.57 
 fibonacci::fibonacci_split_recursive                                953,599                1,001,126                    47,527    4.98%   x 0.95 
 find::size1::parallel_find_common                                   7,790                  7,976                           186    2.39%   x 0.98 
 find::size1::parallel_find_first                                    6,875                  6,972                            97    1.41%   x 0.99 
 find::size1::parallel_find_last                                     734,639                761,100                      26,461    3.60%   x 0.97 
 find::size1::parallel_find_middle                                   471,355                482,455                      11,100    2.35%   x 0.98 
 find::size1::parallel_find_missing                                  893,439                882,808                     -10,631   -1.19%   x 1.01 
 find::size1::serial_find_common                                     1,711                  1,720                             9    0.53%   x 0.99 
 find::size1::serial_find_first                                      0                      0                                 0     NaN%    x NaN 
 find::size1::serial_find_last                                       4,327,097              3,827,007                  -500,090  -11.56%   x 1.13 
 find::size1::serial_find_middle                                     2,838,806              2,549,607                  -289,199  -10.19%   x 1.11 
 find::size1::serial_find_missing                                    4,285,592              3,822,936                  -462,656  -10.80%   x 1.12 
 join_microbench::increment_all                                      46,079                 63,494                       17,415   37.79%   x 0.73 
 join_microbench::increment_all_atomized                             449,315                509,033                      59,718   13.29%   x 0.88 
 join_microbench::increment_all_max                                  56,146                 78,537                       22,391   39.88%   x 0.71 
 join_microbench::increment_all_min                                  25,772                 27,209                        1,437    5.58%   x 0.95 
 join_microbench::increment_all_serialized                           14,815                 14,479                         -336   -2.27%   x 1.02 
 join_microbench::join_recursively                                   184,157                233,549                      49,392   26.82%   x 0.79 
 life::bench::generations                                            53,578,093             54,848,141                1,270,048    2.37%   x 0.98 
 life::bench::par_bridge_generations                                 247,553,309            249,109,494               1,556,185    0.63%   x 0.99 
 life::bench::par_iter_generations                                   14,137,474             39,622,421               25,484,947  180.27%   x 0.36 
 map_collect::i_mod_10_to_i::with_collect                            4,177,557              4,486,990                   309,433    7.41%   x 0.93 
 map_collect::i_mod_10_to_i::with_fold                               812,312                773,760                     -38,552   -4.75%   x 1.05 
 map_collect::i_mod_10_to_i::with_fold_vec                           854,914                868,497                      13,583    1.59%   x 0.98 
 map_collect::i_mod_10_to_i::with_linked_list_collect                11,213,958             11,873,626                  659,668    5.88%   x 0.94 
 map_collect::i_mod_10_to_i::with_linked_list_collect_vec            4,178,495              4,398,594                   220,099    5.27%   x 0.95 
 map_collect::i_mod_10_to_i::with_linked_list_collect_vec_sized      4,215,000              4,533,158                   318,158    7.55%   x 0.93 
 map_collect::i_mod_10_to_i::with_linked_list_map_reduce_vec_sized   4,251,120              4,504,432                   253,312    5.96%   x 0.94 
 map_collect::i_mod_10_to_i::with_mutex                              60,681,069             61,813,390                1,132,321    1.87%   x 0.98 
 map_collect::i_mod_10_to_i::with_mutex_vec                          10,245,921             10,332,479                   86,558    0.84%   x 0.99 
 map_collect::i_mod_10_to_i::with_vec_vec_sized                      4,284,488              4,725,665                   441,177   10.30%   x 0.91 
 map_collect::i_to_i::with_collect                                   7,220,501              7,695,906                   475,405    6.58%   x 0.94 
 map_collect::i_to_i::with_fold                                      23,748,954             23,212,807                 -536,147   -2.26%   x 1.02 
 map_collect::i_to_i::with_fold_vec                                  22,850,378             22,636,976                 -213,402   -0.93%   x 1.01 
 map_collect::i_to_i::with_linked_list_collect                       13,114,792             14,248,324                1,133,532    8.64%   x 0.92 
 map_collect::i_to_i::with_linked_list_collect_vec                   13,403,286             13,825,386                  422,100    3.15%   x 0.97 
 map_collect::i_to_i::with_linked_list_collect_vec_sized             7,061,664              7,551,996                   490,332    6.94%   x 0.94 
 map_collect::i_to_i::with_linked_list_map_reduce_vec_sized          7,565,785              7,569,768                     3,983    0.05%   x 1.00 
 map_collect::i_to_i::with_mutex                                     112,889,083            112,339,800                -549,283   -0.49%   x 1.00 
 map_collect::i_to_i::with_mutex_vec                                 38,371,320             38,761,760                  390,440    1.02%   x 0.99 
 map_collect::i_to_i::with_vec_vec_sized                             7,305,026              7,804,322                   499,296    6.83%   x 0.94 
 matmul::bench::bench_matmul_strassen                                3,434,922              4,976,249                 1,541,327   44.87%   x 0.69 
 mergesort::bench::merge_sort_par_bench                              2,338,896              3,316,543                   977,647   41.80%   x 0.71 
 mergesort::bench::merge_sort_seq_bench                              18,373,829             18,518,555                  144,726    0.79%   x 0.99 
 nbody::bench::nbody_par_bridge                                      2,025,278              2,050,761                    25,483    1.26%   x 0.99 
 nbody::bench::nbody_par_iter                                        2,072,642              2,116,542                    43,900    2.12%   x 0.98 
 nbody::bench::nbody_parreduce                                       9,973,443              10,023,190                   49,747    0.50%   x 1.00 
 nbody::bench::nbody_seq                                             28,748,923             28,727,626                  -21,297   -0.07%   x 1.00 
 pythagoras::euclid_faux_serial                                      30,444,687             29,460,324                 -984,363   -3.23%   x 1.03 
 pythagoras::euclid_parallel_full                                    11,876,560             11,803,805                  -72,755   -0.61%   x 1.01 
 pythagoras::euclid_parallel_one                                     2,995,415              2,816,807                  -178,608   -5.96%   x 1.06 
 pythagoras::euclid_parallel_outer                                   2,881,370              2,712,551                  -168,819   -5.86%   x 1.06 
 pythagoras::euclid_parallel_weightless                              3,021,678              2,834,497                  -187,181   -6.19%   x 1.07 
 pythagoras::euclid_serial                                           29,132,487             28,501,183                 -631,304   -2.17%   x 1.02 
 quicksort::bench::quick_sort_par_bench                              6,251,009              6,081,659                  -169,350   -2.71%   x 1.03 
 quicksort::bench::quick_sort_seq_bench                              29,117,448             29,485,721                  368,273    1.26%   x 0.99 
 quicksort::bench::quick_sort_splitter                               6,370,056              6,406,468                    36,412    0.57%   x 0.99 
 sieve::bench::sieve_chunks                                          4,061,561              3,055,913                -1,005,648  -24.76%   x 1.33 
 sieve::bench::sieve_parallel                                        714,706                729,639                      14,933    2.09%   x 0.98 
 sieve::bench::sieve_serial                                          4,289,540              3,511,296                  -778,244  -18.14%   x 1.22 
 sort::demo_merge_sort_ascending                                     217,069 (1842 MB/s)    231,139 (1730 MB/s)          14,070    6.48%   x 0.94 
 sort::demo_merge_sort_big                                           7,145,096 (895 MB/s)   6,819,046 (938 MB/s)       -326,050   -4.56%   x 1.05 
 sort::demo_merge_sort_descending                                    181,665 (2201 MB/s)    233,225 (1715 MB/s)          51,560   28.38%   x 0.78 
 sort::demo_merge_sort_mostly_ascending                              291,864 (1370 MB/s)    361,619 (1106 MB/s)          69,755   23.90%   x 0.81 
 sort::demo_merge_sort_mostly_descending                             311,070 (1285 MB/s)    300,375 (1331 MB/s)         -10,695   -3.44%   x 1.04 
 sort::demo_merge_sort_random                                        1,436,608 (278 MB/s)   1,243,832 (321 MB/s)       -192,776  -13.42%   x 1.15 
 sort::demo_merge_sort_strings                                       4,386,624 (182 MB/s)   4,069,567 (196 MB/s)       -317,057   -7.23%   x 1.08 
 sort::demo_quick_sort_big                                           3,316,225 (1929 MB/s)  3,032,082 (2110 MB/s)      -284,143   -8.57%   x 1.09 
 sort::demo_quick_sort_mostly_ascending                              12,654,626 (31 MB/s)   11,562,529 (34 MB/s)     -1,092,097   -8.63%   x 1.09 
 sort::demo_quick_sort_mostly_descending                             11,114,743 (35 MB/s)   6,704,983 (59 MB/s)      -4,409,760  -39.67%   x 1.66 
 sort::demo_quick_sort_random                                        724,503 (552 MB/s)     993,356 (402 MB/s)          268,853   37.11%   x 0.73 
 sort::demo_quick_sort_strings                                       2,327,649 (343 MB/s)   2,948,341 (271 MB/s)        620,692   26.67%   x 0.79 
 sort::par_sort_ascending                                            50,540 (7914 MB/s)     76,347 (5239 MB/s)           25,807   51.06%   x 0.66 
 sort::par_sort_big                                                  3,769,812 (1697 MB/s)  4,111,607 (1556 MB/s)       341,795    9.07%   x 0.92 
 sort::par_sort_descending                                           128,036 (3124 MB/s)    163,724 (2443 MB/s)          35,688   27.87%   x 0.78 
 sort::par_sort_expensive                                            20,851,778 (19 MB/s)   23,503,317 (17 MB/s)      2,651,539   12.72%   x 0.89 
 sort::par_sort_mostly_ascending                                     268,649 (1488 MB/s)    356,279 (1122 MB/s)          87,630   32.62%   x 0.75 
 sort::par_sort_mostly_descending                                    292,990 (1365 MB/s)    392,945 (1017 MB/s)          99,955   34.12%   x 0.75 
 sort::par_sort_random                                               422,226 (947 MB/s)     505,128 (791 MB/s)           82,902   19.63%   x 0.84 
 sort::par_sort_strings                                              992,458 (806 MB/s)     1,112,891 (718 MB/s)        120,433   12.13%   x 0.89 
 sort::par_sort_unstable_ascending                                   23,190 (17248 MB/s)    22,985 (17402 MB/s)            -205   -0.88%   x 1.01 
 sort::par_sort_unstable_big                                         2,055,305 (3113 MB/s)  2,091,505 (3059 MB/s)        36,200    1.76%   x 0.98 
 sort::par_sort_unstable_descending                                  36,307 (11017 MB/s)    34,672 (11536 MB/s)          -1,635   -4.50%   x 1.05 
 sort::par_sort_unstable_expensive                                   36,380,942 (10 MB/s)   33,619,394 (11 MB/s)     -2,761,548   -7.59%   x 1.08 
 sort::par_sort_unstable_mostly_ascending                            150,884 (2651 MB/s)    186,718 (2142 MB/s)          35,834   23.75%   x 0.81 
 sort::par_sort_unstable_mostly_descending                           166,225 (2406 MB/s)    218,452 (1831 MB/s)          52,227   31.42%   x 0.76 
 sort::par_sort_unstable_random                                      229,825 (1740 MB/s)    272,936 (1465 MB/s)          43,111   18.76%   x 0.84 
 sort::par_sort_unstable_strings                                     1,526,156 (524 MB/s)   1,889,335 (423 MB/s)        363,179   23.80%   x 0.81 
 str_split::parallel_space_char                                      267,500                281,306                      13,806    5.16%   x 0.95 
 str_split::parallel_space_fn                                        218,197                210,287                      -7,910   -3.63%   x 1.04 
 str_split::serial_space_char                                        2,146,993              2,047,415                   -99,578   -4.64%   x 1.05 
 str_split::serial_space_fn                                          1,304,145              1,363,595                    59,450    4.56%   x 0.96 
 str_split::serial_space_str                                         1,533,204              1,466,027                   -67,177   -4.38%   x 1.05 
 tsp::bench::dj10                                                    13,718,929             14,202,541                  483,612    3.53%   x 0.97 
 vec_collect::vec_i::with_collect                                    749,277                775,878                      26,601    3.55%   x 0.97 
 vec_collect::vec_i::with_collect_into_vec                           759,869                769,618                       9,749    1.28%   x 0.99 
 vec_collect::vec_i::with_collect_into_vec_reused                    441,611                454,327                      12,716    2.88%   x 0.97 
 vec_collect::vec_i::with_fold                                       7,691,437              7,909,820                   218,383    2.84%   x 0.97 
 vec_collect::vec_i::with_linked_list_collect_vec                    4,667,130              4,851,748                   184,618    3.96%   x 0.96 
 vec_collect::vec_i::with_linked_list_collect_vec_sized              4,776,065              5,074,214                   298,149    6.24%   x 0.94 
 vec_collect::vec_i::with_linked_list_map_reduce_vec_sized           4,781,731              4,998,964                   217,233    4.54%   x 0.96 
 vec_collect::vec_i::with_vec_vec_sized                              4,581,574              5,255,446                   673,872   14.71%   x 0.87 
 vec_collect::vec_i_filtered::with_collect                           5,732,476              5,816,785                    84,309    1.47%   x 0.99 
 vec_collect::vec_i_filtered::with_fold                              11,416,172             12,013,625                  597,453    5.23%   x 0.95 
 vec_collect::vec_i_filtered::with_linked_list_collect_vec           8,242,170              8,739,936                   497,766    6.04%   x 0.94 
 vec_collect::vec_i_filtered::with_linked_list_collect_vec_sized     8,400,793              8,849,432                   448,639    5.34%   x 0.95 
 vec_collect::vec_i_filtered::with_linked_list_map_reduce_vec_sized  7,439,237              7,538,204                    98,967    1.33%   x 0.99 
 vec_collect::vec_i_filtered::with_vec_vec_sized                     7,259,405              7,279,394                    19,989    0.28%   x 1.00 

This makes the "idle state" private to the sleep module.
@cuviper
Copy link
Member

cuviper commented Aug 13, 2020

Here are some more measurements to see if we're fixing what we intended. We have two demos that report CPU time, noop and life, and I have four machines at hand to compare with, in increasing size:

  • Intel i7-7600U: 2 cores, 4 threads, running RHEL 8.2
  • Intel i7-7700K: 4 cores, 8 threads, running Fedora 32
  • AMD Ryzen 7 3800X: 8 cores, 16 threads, running Fedora 32
  • Intel Xeon E5-2697 v3: 2 nodes, 28 cores, 56 threads, running Fedora 31

I've built with rustc 1.47.0-nightly (e5e33ebd2 2020-08-11), transferring the same binaries to all systems.

noop cpu% master pr746
Intel i7-7600U 15.6% 5.4%
Intel i7-7700K 58.2% 5.2%
AMD Ryzen 7 3800X 52.5% 2.2%
Intel Xeon E5-2697 v3 2811.0% 10.2%
life play cpu% master serial master parallel pr746 serial pr746 parallel
Intel i7-7600U 27.4% 53.1% 27.1% 49.1%
Intel i7-7700K 31.7% 92.0% 31.1% 70.8%
AMD Ryzen 7 3800X 6.5% 56.8% 6.6% 30.8%
Intel Xeon E5-2697 v3 19.7% 2052.2% 20.3% 601.1%

So noop looks great! The life results are also improved, though still bad compared to the serial runs. In both, the benefit is greater as we scale up the number of threads, which makes sense that we'd be wasting less time on wakeup storms of work stealing.

@cuviper
Copy link
Member

cuviper commented Aug 13, 2020

New benchmark results, on the 3800X:

cargo benchcmp

 name                                                                bench.master ns/iter   bench.pr746 ns/iter    diff ns/iter   diff %  speedup 
 factorial::factorial_iterator                                       14,531,786             14,470,046                  -61,740   -0.42%   x 1.00 
 factorial::factorial_join                                           1,752,942              1,847,871                    94,929    5.42%   x 0.95 
 factorial::factorial_par_iter                                       1,742,709              1,850,616                   107,907    6.19%   x 0.94 
 factorial::factorial_recursion                                      2,243,014              2,332,252                    89,238    3.98%   x 0.96 
 fibonacci::fibonacci_iterative                                      3                      4                                 1   33.33%   x 0.75 
 fibonacci::fibonacci_join_1_2                                       8,189,158              8,202,075                    12,917    0.16%   x 1.00 
 fibonacci::fibonacci_join_2_1                                       8,146,982              8,270,735                   123,753    1.52%   x 0.99 
 fibonacci::fibonacci_recursive                                      8,705,581              7,825,011                  -880,570  -10.12%   x 1.11 
 fibonacci::fibonacci_split_iterative                                44,621                 74,792                       30,171   67.62%   x 0.60 
 fibonacci::fibonacci_split_recursive                                949,586                991,311                      41,725    4.39%   x 0.96 
 find::size1::parallel_find_common                                   7,588                  7,878                           290    3.82%   x 0.96 
 find::size1::parallel_find_first                                    6,578                  7,172                           594    9.03%   x 0.92 
 find::size1::parallel_find_last                                     724,471                735,842                      11,371    1.57%   x 0.98 
 find::size1::parallel_find_middle                                   469,443                462,516                      -6,927   -1.48%   x 1.01 
 find::size1::parallel_find_missing                                  842,788                874,017                      31,229    3.71%   x 0.96 
 find::size1::serial_find_common                                     1,720                  1,387                          -333  -19.36%   x 1.24 
 find::size1::serial_find_first                                      0                      0                                 0     NaN%    x NaN 
 find::size1::serial_find_last                                       4,278,329              4,286,871                     8,542    0.20%   x 1.00 
 find::size1::serial_find_middle                                     2,805,498              2,806,315                       817    0.03%   x 1.00 
 find::size1::serial_find_missing                                    4,206,039              4,086,467                  -119,572   -2.84%   x 1.03 
 join_microbench::increment_all                                      46,801                 62,290                       15,489   33.10%   x 0.75 
 join_microbench::increment_all_atomized                             433,747                496,508                      62,761   14.47%   x 0.87 
 join_microbench::increment_all_max                                  55,392                 76,750                       21,358   38.56%   x 0.72 
 join_microbench::increment_all_min                                  26,200                 27,059                          859    3.28%   x 0.97 
 join_microbench::increment_all_serialized                           15,316                 14,445                         -871   -5.69%   x 1.06 
 join_microbench::join_recursively                                   180,683                227,287                      46,604   25.79%   x 0.79 
 life::bench::generations                                            52,596,441             53,611,895                1,015,454    1.93%   x 0.98 
 life::bench::par_bridge_generations                                 248,315,511            246,884,207              -1,431,304   -0.58%   x 1.01 
 life::bench::par_iter_generations                                   11,884,127             12,937,795                1,053,668    8.87%   x 0.92 
 map_collect::i_mod_10_to_i::with_collect                            4,188,679              4,429,994                   241,315    5.76%   x 0.95 
 map_collect::i_mod_10_to_i::with_fold                               711,377                728,586                      17,209    2.42%   x 0.98 
 map_collect::i_mod_10_to_i::with_fold_vec                           821,194                857,224                      36,030    4.39%   x 0.96 
 map_collect::i_mod_10_to_i::with_linked_list_collect                10,809,514             11,675,367                  865,853    8.01%   x 0.93 
 map_collect::i_mod_10_to_i::with_linked_list_collect_vec            4,110,314              4,361,166                   250,852    6.10%   x 0.94 
 map_collect::i_mod_10_to_i::with_linked_list_collect_vec_sized      4,191,064              4,379,095                   188,031    4.49%   x 0.96 
 map_collect::i_mod_10_to_i::with_linked_list_map_reduce_vec_sized   4,181,519              4,389,444                   207,925    4.97%   x 0.95 
 map_collect::i_mod_10_to_i::with_mutex                              56,567,441             56,735,622                  168,181    0.30%   x 1.00 
 map_collect::i_mod_10_to_i::with_mutex_vec                          10,320,281             10,352,213                   31,932    0.31%   x 1.00 
 map_collect::i_mod_10_to_i::with_vec_vec_sized                      4,397,088              4,636,135                   239,047    5.44%   x 0.95 
 map_collect::i_to_i::with_collect                                   7,160,814              7,450,344                   289,530    4.04%   x 0.96 
 map_collect::i_to_i::with_fold                                      23,905,518             24,094,097                  188,579    0.79%   x 0.99 
 map_collect::i_to_i::with_fold_vec                                  23,502,049             23,232,344                 -269,705   -1.15%   x 1.01 
 map_collect::i_to_i::with_linked_list_collect                       13,011,212             13,612,633                  601,421    4.62%   x 0.96 
 map_collect::i_to_i::with_linked_list_collect_vec                   13,035,005             12,919,410                 -115,595   -0.89%   x 1.01 
 map_collect::i_to_i::with_linked_list_collect_vec_sized             6,904,876              7,422,966                   518,090    7.50%   x 0.93 
 map_collect::i_to_i::with_linked_list_map_reduce_vec_sized          7,093,959              7,569,619                   475,660    6.71%   x 0.94 
 map_collect::i_to_i::with_mutex                                     112,945,254            113,298,748                 353,494    0.31%   x 1.00 
 map_collect::i_to_i::with_mutex_vec                                 36,613,332             36,669,144                   55,812    0.15%   x 1.00 
 map_collect::i_to_i::with_vec_vec_sized                             7,198,322              7,679,260                   480,938    6.68%   x 0.94 
 matmul::bench::bench_matmul_strassen                                3,535,206              4,813,317                 1,278,111   36.15%   x 0.73 
 mergesort::bench::merge_sort_par_bench                              2,268,185              3,442,401                 1,174,216   51.77%   x 0.66 
 mergesort::bench::merge_sort_seq_bench                              18,643,242             18,334,746                 -308,496   -1.65%   x 1.02 
 nbody::bench::nbody_par_bridge                                      2,005,564              2,021,870                    16,306    0.81%   x 0.99 
 nbody::bench::nbody_par_iter                                        2,010,222              2,056,137                    45,915    2.28%   x 0.98 
 nbody::bench::nbody_parreduce                                       9,925,492              10,276,874                  351,382    3.54%   x 0.97 
 nbody::bench::nbody_seq                                             29,206,768             28,258,410                 -948,358   -3.25%   x 1.03 
 pythagoras::euclid_faux_serial                                      30,064,300             28,485,534               -1,578,766   -5.25%   x 1.06 
 pythagoras::euclid_parallel_full                                    11,812,951             12,034,041                  221,090    1.87%   x 0.98 
 pythagoras::euclid_parallel_one                                     2,889,750              2,802,195                   -87,555   -3.03%   x 1.03 
 pythagoras::euclid_parallel_outer                                   2,845,040              2,693,966                  -151,074   -5.31%   x 1.06 
 pythagoras::euclid_parallel_weightless                              2,895,666              2,774,293                  -121,373   -4.19%   x 1.04 
 pythagoras::euclid_serial                                           27,304,977             27,234,940                  -70,037   -0.26%   x 1.00 
 quicksort::bench::quick_sort_par_bench                              5,997,710              6,262,990                   265,280    4.42%   x 0.96 
 quicksort::bench::quick_sort_seq_bench                              29,089,837             28,331,271                 -758,566   -2.61%   x 1.03 
 quicksort::bench::quick_sort_splitter                               6,527,590              6,396,389                  -131,201   -2.01%   x 1.02 
 sieve::bench::sieve_chunks                                          3,913,387              3,945,132                    31,745    0.81%   x 0.99 
 sieve::bench::sieve_parallel                                        901,661                756,036                    -145,625  -16.15%   x 1.19 
 sieve::bench::sieve_serial                                          4,648,983              4,329,193                  -319,790   -6.88%   x 1.07 
 sort::demo_merge_sort_ascending                                     213,965 (1869 MB/s)    151,896 (2633 MB/s)         -62,069  -29.01%   x 1.41 
 sort::demo_merge_sort_big                                           6,154,018 (1039 MB/s)  5,177,745 (1236 MB/s)      -976,273  -15.86%   x 1.19 
 sort::demo_merge_sort_descending                                    179,538 (2227 MB/s)    209,621 (1908 MB/s)          30,083   16.76%   x 0.86 
 sort::demo_merge_sort_mostly_ascending                              281,592 (1420 MB/s)    276,864 (1444 MB/s)          -4,728   -1.68%   x 1.02 
 sort::demo_merge_sort_mostly_descending                             307,374 (1301 MB/s)    296,143 (1350 MB/s)         -11,231   -3.65%   x 1.04 
 sort::demo_merge_sort_random                                        1,384,550 (288 MB/s)   1,238,308 (323 MB/s)       -146,242  -10.56%   x 1.12 
 sort::demo_merge_sort_strings                                       4,304,478 (185 MB/s)   3,940,837 (203 MB/s)       -363,641   -8.45%   x 1.09 
 sort::demo_quick_sort_big                                           3,249,416 (1969 MB/s)  2,973,821 (2152 MB/s)      -275,595   -8.48%   x 1.09 
 sort::demo_quick_sort_mostly_ascending                              12,468,065 (32 MB/s)   11,528,808 (34 MB/s)       -939,257   -7.53%   x 1.08 
 sort::demo_quick_sort_mostly_descending                             9,958,077 (40 MB/s)    6,686,446 (59 MB/s)      -3,271,631  -32.85%   x 1.49 
 sort::demo_quick_sort_random                                        718,416 (556 MB/s)     962,017 (415 MB/s)          243,601   33.91%   x 0.75 
 sort::demo_quick_sort_strings                                       2,370,272 (337 MB/s)   3,134,347 (255 MB/s)        764,075   32.24%   x 0.76 
 sort::par_sort_ascending                                            50,263 (7958 MB/s)     77,341 (5171 MB/s)           27,078   53.87%   x 0.65 
 sort::par_sort_big                                                  3,771,808 (1696 MB/s)  3,887,269 (1646 MB/s)       115,461    3.06%   x 0.97 
 sort::par_sort_descending                                           126,088 (3172 MB/s)    161,427 (2477 MB/s)          35,339   28.03%   x 0.78 
 sort::par_sort_expensive                                            21,029,097 (19 MB/s)   23,310,403 (17 MB/s)      2,281,306   10.85%   x 0.90 
 sort::par_sort_mostly_ascending                                     267,716 (1494 MB/s)    338,748 (1180 MB/s)          71,032   26.53%   x 0.79 
 sort::par_sort_mostly_descending                                    291,280 (1373 MB/s)    368,837 (1084 MB/s)          77,557   26.63%   x 0.79 
 sort::par_sort_random                                               435,852 (917 MB/s)     506,498 (789 MB/s)           70,646   16.21%   x 0.86 
 sort::par_sort_strings                                              978,728 (817 MB/s)     1,055,452 (757 MB/s)         76,724    7.84%   x 0.93 
 sort::par_sort_unstable_ascending                                   23,812 (16798 MB/s)    23,559 (16978 MB/s)            -253   -1.06%   x 1.01 
 sort::par_sort_unstable_big                                         1,579,825 (4051 MB/s)  1,851,460 (3456 MB/s)       271,635   17.19%   x 0.85 
 sort::par_sort_unstable_descending                                  36,582 (10934 MB/s)    34,996 (11429 MB/s)          -1,586   -4.34%   x 1.05 
 sort::par_sort_unstable_expensive                                   36,359,397 (11 MB/s)   35,260,026 (11 MB/s)     -1,099,371   -3.02%   x 1.03 
 sort::par_sort_unstable_mostly_ascending                            148,730 (2689 MB/s)    184,191 (2171 MB/s)          35,461   23.84%   x 0.81 
 sort::par_sort_unstable_mostly_descending                           165,326 (2419 MB/s)    213,143 (1876 MB/s)          47,817   28.92%   x 0.78 
 sort::par_sort_unstable_random                                      226,648 (1764 MB/s)    259,429 (1541 MB/s)          32,781   14.46%   x 0.87 
 sort::par_sort_unstable_strings                                     1,433,422 (558 MB/s)   1,957,598 (408 MB/s)        524,176   36.57%   x 0.73 
 str_split::parallel_space_char                                      256,840                278,045                      21,205    8.26%   x 0.92 
 str_split::parallel_space_fn                                        180,433                194,550                      14,117    7.82%   x 0.93 
 str_split::serial_space_char                                        1,888,174              1,952,812                    64,638    3.42%   x 0.97 
 str_split::serial_space_fn                                          1,257,385              1,326,921                    69,536    5.53%   x 0.95 
 str_split::serial_space_str                                         1,428,556              1,459,121                    30,565    2.14%   x 0.98 
 tsp::bench::dj10                                                    14,164,378             14,229,363                   64,985    0.46%   x 1.00 
 vec_collect::vec_i::with_collect                                    705,866                733,194                      27,328    3.87%   x 0.96 
 vec_collect::vec_i::with_collect_into_vec                           709,468                728,653                      19,185    2.70%   x 0.97 
 vec_collect::vec_i::with_collect_into_vec_reused                    392,204                409,836                      17,632    4.50%   x 0.96 
 vec_collect::vec_i::with_fold                                       7,551,684              8,263,352                   711,668    9.42%   x 0.91 
 vec_collect::vec_i::with_linked_list_collect_vec                    4,585,360              4,764,996                   179,636    3.92%   x 0.96 
 vec_collect::vec_i::with_linked_list_collect_vec_sized              4,755,684              4,915,955                   160,271    3.37%   x 0.97 
 vec_collect::vec_i::with_linked_list_map_reduce_vec_sized           4,726,986              4,907,798                   180,812    3.83%   x 0.96 
 vec_collect::vec_i::with_vec_vec_sized                              4,613,576              4,698,521                    84,945    1.84%   x 0.98 
 vec_collect::vec_i_filtered::with_collect                           5,640,776              5,730,659                    89,883    1.59%   x 0.98 
 vec_collect::vec_i_filtered::with_fold                              11,139,073             11,572,018                  432,945    3.89%   x 0.96 
 vec_collect::vec_i_filtered::with_linked_list_collect_vec           8,220,031              8,613,699                   393,668    4.79%   x 0.95 
 vec_collect::vec_i_filtered::with_linked_list_collect_vec_sized     8,380,939              8,832,856                   451,917    5.39%   x 0.95 
 vec_collect::vec_i_filtered::with_linked_list_map_reduce_vec_sized  7,422,325              7,459,916                    37,591    0.51%   x 0.99 
 vec_collect::vec_i_filtered::with_vec_vec_sized                     7,293,907              7,363,695                    69,788    0.96%   x 0.99 

Definitely some loss, but I think we expect the thread ramp-up to be longer, which doesn't bode well for benchmarks that are only a few milliseconds or less in the first place.

@cuviper
Copy link
Member

cuviper commented Aug 13, 2020

Here are results of the running benchmarks (i.e. rayon-demo foo bench):

Benchmark master pr746
life 13344859 ns 15918183 ns
life --size 1000 162352609 ns 166974902 ns
nbody 294959662 ns 298881450 ns
sieve 136603368 ns 136477603 ns
mergesort 1.4790496 s 1.4834818 s
quicksort 2.6739895 s 2.6493437 s
tsp dj10 4.383455ms 4.953085ms
tsp dj15 545.585831ms 540.46472ms

These are raw numbers, not normalized / averaged / etc., but it looks reasonable to me.

@cuviper
Copy link
Member

cuviper commented Aug 13, 2020

bors r+

@bors bors bot merged commit 09428ec into rayon-rs:master Aug 13, 2020
@multun multun mentioned this pull request Aug 17, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants