ThreadPool::install and thread overhead #614

bluss · 2018-12-09T10:10:41Z

I've made a reproduction of an issue with ThreadPool::install and the overhead I see due to the install call itself. In the reduced problem, the function that runs in ThreadPool::install() is entirely serial, and the overhead is proportional to the number of threads in the pool.

Maybe this is totally overblown, the overhead is on the order of milliseconds, but I had hoped to be able to parallelize problems are that fast to compute

Example

Code on the branch rayon-pool-install-201812 in ndarray.

You can clone it and run the test like this:

NUM_THREADS=1 cargo run --release --example threadpool_install  --features=rayon
NUM_THREADS=16 cargo run --release --example threadpool_install  --features=rayon

The program is in examples/threadpool_install.rs and it calls ThreadPool::install repeatedly. The work inside the install takes about 3 milliseconds on its own. (Does this sound short?)

I ran perf on the example executable with 16 threads and it says that rayon functions use significant time:

  61,53%  threadpool_inst  threadpool_install  [.] ndarray::impl_methods::<impl ndarray::ArrayBase<S, D>>::mapv_inplace                                                                                     
  13,17%  threadpool_inst  threadpool_install  [.] <std::thread::local::LocalKey<T>>::with                                                                                                                  
   9,35%  threadpool_inst  threadpool_install  [.] rayon_core::registry::WorkerThread::steal                                                                                                                
   7,56%  threadpool_inst  threadpool_install  [.] crossbeam_epoch::default::HANDLE::__getit                                                                                                                
   5,92%  threadpool_inst  threadpool_install  [.] <crossbeam_deque::Stealer<T>>::steal                                                                                                                     
   0,94%  threadpool_inst  libc-2.24.so        [.] _int_free                                                                                                                                                
   0,37%  threadpool_inst  threadpool_install  [.] crossbeam_epoch::guard::Guard::new                                                                                                                       
   0,36%  threadpool_inst  [kernel.kallsyms]   [k] entry_SYSCALL_64                                                                                                                                         
   0,32%  threadpool_inst  threadpool_install  [.] crossbeam_epoch::internal::Global::try_advance                                                                                                           
   0,28%  threadpool_inst  threadpool_install  [.] rayon_core::registry::WorkerThread::wait_until_cold

If I change the thread pool to use 1 thread, I get a serene result, 99% of execution time is in the actual job:

  99,95%  threadpool_inst  threadpool_install  [.] ndarray::impl_methods::<impl ndarray::ArrayBase<S, D>>::mapv_inplace                                                                                     
   0,03%  threadpool_inst  [kernel.kallsyms]   [k] entry_SYSCALL_64                                                                                                                                         
   0,01%  threadpool_inst  ld-2.24.so          [.] do_lookup_x                                                                                                                                              
   0,00%  threadpool_inst  libc-2.24.so        [.] _dl_addr                                                                                                                                                 
   0,00%  threadpool_inst  libpthread-2.24.so  [.] __pthread_mutex_unlock_usercnt                                                                                                                           
   0,00%  threadpool_inst  libc-2.24.so        [.] _int_malloc                                                                                                                                              
   0,00%  threadpool_inst  libpthread-2.24.so  [.] pthread_cond_destroy@@GLIBC_2.3.2

Actual Use Case

Of course, if I don't have anything parallel to do, I can avoid using rayon. I saw this problem when using a too big thread pool with just a few parallel jobs. My use case is so far just an experiment, and I was doing this:

Using a ThreadPool with 4 threads; the ThreadPool is owned by a lazy static for the duration of the program.
Using Pool.install(|| f() ) around all my loops
Running a parallel iterator of length 2

The parallel iterator was only length 2 due to the problem's size. Here I noticed that a ThreadPool of 2 threads would have less overhead. With 4 threads for 2 jobs, the overhead would eat up any gains from parallelism.

The program should use 1 thread per physical core and can't adapt the number of threads (creating the thread pool is too slow). It needs to be ready for tasks that can be split into 2 or more parallel jobs.

I also seem to be able to reproduce this issue with the global pool. Just rayon::join with two parallel jobs in the whole program, and if I run this with 2 threads, I get a speedup from 95µs to 61µs, but the speedup disappears if there are more threads in the global pool. This makes it seem that rayon can parallelize this problem well!

The text was updated successfully, but these errors were encountered:

bluss · 2018-12-09T15:43:30Z

This issue seems related. #576

cuviper · 2018-12-09T16:58:31Z

Yeah, we need to improve the situation with underutilized thread pools.

bluss · 2018-12-10T14:27:16Z

@cuviper I understand! Do you know if there are any implementations of join -- scoped on the stack jobs, without job stealing? Or a way to implement that with rayon. Of course in rayon, stealing is needed to ever have it use more than one thread for join.

cuviper · 2018-12-10T19:42:11Z

I think you could emulate this with crossbeam::scope or scoped_threadpool, but not quite as neatly packaged as rayon::join. I'm sure they could add such a convenience function though.

I'm not sure what you want rayon to do without stealing -- just run serially? Forcibly send the second job to another thread in the pool? The latter would require targeted wake-ups, which basically gets back to the main issue of #576.

bluss · 2018-12-10T20:21:26Z

Neither of them seem to support jobs stored on the stack, though.

I'm writing some experiments that take me closer to realizing what kinds of problems are involved, right now breaking out rayon's join into a PoC where it directly sends stack jobs to other threads in a pool. Not that Rayon should solve it that way, but it's a learning experience to work on. I don't know Rayon well enough to make any suggestions, but it sounds like #576 is pretty much the same issue.

cuviper · 2018-12-10T20:30:37Z

Neither of them seem to support jobs stored on the stack, though.

Oh, I thought you just wanted scoped lifetimes. I guess you mean like our StackJob that doesn't touch the heap at all? In crossbeam's case, a little heap allocation will be dwarfed by dynamically starting a thread, but maybe scoped_threadpool could reasonably implement a more direct join.

bluss · 2018-12-10T21:25:06Z

Sure, both a thread pool and stack jobs and scoped threads. Feel free to close this issue if you want. I'll work on my prototype and see if it's viable.

ghost · 2020-09-27T16:08:32Z

I noticed the following lines in rayon's scheduler:

let num_to_wake = std::cmp::min(num_jobs, num_sleepers);
self.wake_any_threads(num_to_wake);

It seems if we create 5 jobs, then 5 threads get woken at once. That may be a bit excessive if those jobs are small. It's possible that just a single thread will quickly process all 5 jobs and then the remaining 4 threads have nothing to do. In other words, more time is spent waking threads than actually processing jobs.

A strategy some other schedulers (e.g. the Go scheduler) employ is to wake up just 1 thread if there are scheduled jobs. Then, if that thread finds a job for itself, it wakes another thread. Then that thread does the same thing, and the procedure goes on like that. So if we have 5 small jobs that are enough for a single thread, we'll only wake up one additional thread in vain.

cuviper · 2020-09-28T00:58:29Z

@stjepang I think you have a good point -- however, in practice I believe this only ever sees one job at a time. While Registry::inject can take an arbitrary slice of jobs, all of its current callers only pass a singleton slice. The other waker is WorkerThread::push, which is only dealing with one job as well.

cuviper · 2020-09-28T01:05:22Z

Then, if that thread finds a job for itself, it wakes another thread.

We do have this heuristic already in Sleep::work_found:

rayon/rayon-core/src/sleep/mod.rs

Lines 95 to 98 in 8c9ada2

    
           // If we were the last idle thread and other threads are still sleeping, 
        
           // then we should wake up another thread. 
        
           let threads_to_wake = self.counters.sub_inactive_thread(); 
        
           self.wake_any_threads(threads_to_wake as u32);

And here's AtomicCounters::sub_inactive_thread:

rayon/rayon-core/src/sleep/counters.rs

Lines 160 to 163 in 8c9ada2

    
           // Current heuristic: whenever an inactive thread goes away, if 
        
           // there are any sleeping threads, wake 'em up. 
        
           let sleeping_threads = old_value.sleeping_threads(); 
        
           std::cmp::min(sleeping_threads, 2)

But they seem to be conflicted between waking "another thread" (1) versus "any sleeping threads" (capped at 2).

bluss · 2020-09-29T20:00:46Z

#746 has been merged! That means it is time to revisit this -- I haven't done so yet.

nikomatsakis · 2020-09-29T20:54:59Z

@cuviper it looks we are starting 2 threads. I know I was tinkering with that as a way to help us in "catching up" to spikes of work faster. I can't remember how effective these changes were, we should measure, but my memory is "a little but not much". I presently feel it's probably not worth it, and better to just wake one thread.

bluss · 2020-12-10T22:02:08Z

I've come back to testing newer rayon. Not with the exact same testcase yet (I'm sorry).

Some aspect of the same problem is still visible. Same has has been formulated in newer issues: high cpu usage when there are more threads than tasks. Let's say for example 8 threads in the thread pool but only work enough for four of them. With enough threads, the extra threads eat up the whole real time that could have been gained on the task, unfortunately.

Because other issues that handle the same topic are active, this one might as well be closed if you prefer.

bluss · 2020-12-20T17:37:40Z

The current conclusion of my experiments is a published crate for a new thread pool (there are too many crates for this already!) - called thread-tree which is a very simple binary tree of worker threads. It seems to work like I want for the benchmarks I run.
The problem that I saw both with rayon and to some extent with my own thread pool experiments was that there is overhead when there are two work items to run, but four threads contending to pick it up (can this be agreed on as the problem?).

If I have a binary tree of worker threads, then I always have a 1-1 channel sender/receiver to each thread, so this effect of contention between workers doesn't happen. Maybe it's an unfair conclusion, because in the MVP version of the implementation I only support 1, 2 or 4 parallel jobs.

I have a question for @nikomatsakis and @cuviper: This new crate is partly based on code that I took from rayon-core. That was very useful for me, wouldn't have been possible otherwise (StackJob and its execution) - is it enough that I mention you as authors in a notice in the code and in the Readme. Is it ok if I also mention you in the Cargo.toml authors list? Or maybe that's not neeed?

Thread-tree crate: https://docs.rs/thread-tree/
Matrixmultiply explanation: bluss/matrixmultiply#52

bluss · 2020-12-21T22:09:21Z

The case described in the original report is fixed by the new scheduler, so I'll close. I've been wanting to close this anyway. Even if it looks like I'm going for another solution due to similar issues.

cuviper · 2020-12-22T01:48:08Z

Thanks for the followup @bluss!

Is it ok if I also mention you in the Cargo.toml authors list? Or maybe that's not neeed?

If it were a close fork then shared authorship might be nice, but I think it's not needed since you just took a small piece, and the crate is substantially different.

I wish Rayon could be everything to everyone for parallelism, but I think it's OK to accept that different designs will be better for some workloads.

cuviper mentioned this issue Mar 5, 2019

Rayon uses a lot of CPU when there's not a lot of work to do #642

Closed

ralfbiedert mentioned this issue Jun 28, 2019

ThreadPool::install and allocation overhead #666

Open

bluss closed this as completed Dec 21, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ThreadPool::install and thread overhead #614

ThreadPool::install and thread overhead #614

bluss commented Dec 9, 2018 •

edited

bluss commented Dec 9, 2018

cuviper commented Dec 9, 2018

bluss commented Dec 10, 2018

cuviper commented Dec 10, 2018

bluss commented Dec 10, 2018 •

edited

cuviper commented Dec 10, 2018

bluss commented Dec 10, 2018 •

edited

ghost commented Sep 27, 2020

cuviper commented Sep 28, 2020

cuviper commented Sep 28, 2020

bluss commented Sep 29, 2020

nikomatsakis commented Sep 29, 2020

bluss commented Dec 10, 2020 •

edited

bluss commented Dec 20, 2020 •

edited

bluss commented Dec 21, 2020 •

edited

cuviper commented Dec 22, 2020

ThreadPool::install and thread overhead #614

ThreadPool::install and thread overhead #614

Comments

bluss commented Dec 9, 2018 • edited

Example

Actual Use Case

bluss commented Dec 9, 2018

cuviper commented Dec 9, 2018

bluss commented Dec 10, 2018

cuviper commented Dec 10, 2018

bluss commented Dec 10, 2018 • edited

cuviper commented Dec 10, 2018

bluss commented Dec 10, 2018 • edited

ghost commented Sep 27, 2020

cuviper commented Sep 28, 2020

cuviper commented Sep 28, 2020

bluss commented Sep 29, 2020

nikomatsakis commented Sep 29, 2020

bluss commented Dec 10, 2020 • edited

bluss commented Dec 20, 2020 • edited

bluss commented Dec 21, 2020 • edited

cuviper commented Dec 22, 2020

bluss commented Dec 9, 2018 •

edited

bluss commented Dec 10, 2018 •

edited

bluss commented Dec 10, 2018 •

edited

bluss commented Dec 10, 2020 •

edited

bluss commented Dec 20, 2020 •

edited

bluss commented Dec 21, 2020 •

edited