Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ThreadPool::install and thread overhead #614

Closed
bluss opened this issue Dec 9, 2018 · 16 comments
Closed

ThreadPool::install and thread overhead #614

bluss opened this issue Dec 9, 2018 · 16 comments

Comments

@bluss
Copy link
Contributor

bluss commented Dec 9, 2018

I've made a reproduction of an issue with ThreadPool::install and the overhead I see due to the install call itself. In the reduced problem, the function that runs in ThreadPool::install() is entirely serial, and the overhead is proportional to the number of threads in the pool.

Maybe this is totally overblown, the overhead is on the order of milliseconds, but I had hoped to be able to parallelize problems are that fast to compute

Example

Code on the branch rayon-pool-install-201812 in ndarray.

You can clone it and run the test like this:

NUM_THREADS=1 cargo run --release --example threadpool_install  --features=rayon
NUM_THREADS=16 cargo run --release --example threadpool_install  --features=rayon

The program is in examples/threadpool_install.rs and it calls ThreadPool::install repeatedly. The work inside the install takes about 3 milliseconds on its own. (Does this sound short?)

I ran perf on the example executable with 16 threads and it says that rayon functions use significant time:

  61,53%  threadpool_inst  threadpool_install  [.] ndarray::impl_methods::<impl ndarray::ArrayBase<S, D>>::mapv_inplace                                                                                     
  13,17%  threadpool_inst  threadpool_install  [.] <std::thread::local::LocalKey<T>>::with                                                                                                                  
   9,35%  threadpool_inst  threadpool_install  [.] rayon_core::registry::WorkerThread::steal                                                                                                                
   7,56%  threadpool_inst  threadpool_install  [.] crossbeam_epoch::default::HANDLE::__getit                                                                                                                
   5,92%  threadpool_inst  threadpool_install  [.] <crossbeam_deque::Stealer<T>>::steal                                                                                                                     
   0,94%  threadpool_inst  libc-2.24.so        [.] _int_free                                                                                                                                                
   0,37%  threadpool_inst  threadpool_install  [.] crossbeam_epoch::guard::Guard::new                                                                                                                       
   0,36%  threadpool_inst  [kernel.kallsyms]   [k] entry_SYSCALL_64                                                                                                                                         
   0,32%  threadpool_inst  threadpool_install  [.] crossbeam_epoch::internal::Global::try_advance                                                                                                           
   0,28%  threadpool_inst  threadpool_install  [.] rayon_core::registry::WorkerThread::wait_until_cold                                                          

If I change the thread pool to use 1 thread, I get a serene result, 99% of execution time is in the actual job:

  99,95%  threadpool_inst  threadpool_install  [.] ndarray::impl_methods::<impl ndarray::ArrayBase<S, D>>::mapv_inplace                                                                                     
   0,03%  threadpool_inst  [kernel.kallsyms]   [k] entry_SYSCALL_64                                                                                                                                         
   0,01%  threadpool_inst  ld-2.24.so          [.] do_lookup_x                                                                                                                                              
   0,00%  threadpool_inst  libc-2.24.so        [.] _dl_addr                                                                                                                                                 
   0,00%  threadpool_inst  libpthread-2.24.so  [.] __pthread_mutex_unlock_usercnt                                                                                                                           
   0,00%  threadpool_inst  libc-2.24.so        [.] _int_malloc                                                                                                                                              
   0,00%  threadpool_inst  libpthread-2.24.so  [.] pthread_cond_destroy@@GLIBC_2.3.2                                                                                        

Actual Use Case

Of course, if I don't have anything parallel to do, I can avoid using rayon. I saw this problem when using a too big thread pool with just a few parallel jobs. My use case is so far just an experiment, and I was doing this:

  1. Using a ThreadPool with 4 threads; the ThreadPool is owned by a lazy static for the duration of the program.
  2. Using Pool.install(|| f() ) around all my loops
  3. Running a parallel iterator of length 2

The parallel iterator was only length 2 due to the problem's size. Here I noticed that a ThreadPool of 2 threads would have less overhead. With 4 threads for 2 jobs, the overhead would eat up any gains from parallelism.

The program should use 1 thread per physical core and can't adapt the number of threads (creating the thread pool is too slow). It needs to be ready for tasks that can be split into 2 or more parallel jobs.

I also seem to be able to reproduce this issue with the global pool. Just rayon::join with two parallel jobs in the whole program, and if I run this with 2 threads, I get a speedup from 95µs to 61µs, but the speedup disappears if there are more threads in the global pool. This makes it seem that rayon can parallelize this problem well!

@bluss
Copy link
Contributor Author

bluss commented Dec 9, 2018

This issue seems related. #576

@cuviper
Copy link
Member

cuviper commented Dec 9, 2018

Yeah, we need to improve the situation with underutilized thread pools.

@bluss
Copy link
Contributor Author

bluss commented Dec 10, 2018

@cuviper I understand! Do you know if there are any implementations of join -- scoped on the stack jobs, without job stealing? Or a way to implement that with rayon. Of course in rayon, stealing is needed to ever have it use more than one thread for join.

@cuviper
Copy link
Member

cuviper commented Dec 10, 2018

I think you could emulate this with crossbeam::scope or scoped_threadpool, but not quite as neatly packaged as rayon::join. I'm sure they could add such a convenience function though.

I'm not sure what you want rayon to do without stealing -- just run serially? Forcibly send the second job to another thread in the pool? The latter would require targeted wake-ups, which basically gets back to the main issue of #576.

@bluss
Copy link
Contributor Author

bluss commented Dec 10, 2018

Neither of them seem to support jobs stored on the stack, though.

I'm writing some experiments that take me closer to realizing what kinds of problems are involved, right now breaking out rayon's join into a PoC where it directly sends stack jobs to other threads in a pool. Not that Rayon should solve it that way, but it's a learning experience to work on. I don't know Rayon well enough to make any suggestions, but it sounds like #576 is pretty much the same issue.

@cuviper
Copy link
Member

cuviper commented Dec 10, 2018

Neither of them seem to support jobs stored on the stack, though.

Oh, I thought you just wanted scoped lifetimes. I guess you mean like our StackJob that doesn't touch the heap at all? In crossbeam's case, a little heap allocation will be dwarfed by dynamically starting a thread, but maybe scoped_threadpool could reasonably implement a more direct join.

@bluss
Copy link
Contributor Author

bluss commented Dec 10, 2018

Sure, both a thread pool and stack jobs and scoped threads. Feel free to close this issue if you want. I'll work on my prototype and see if it's viable.

@ghost
Copy link

ghost commented Sep 27, 2020

I noticed the following lines in rayon's scheduler:

let num_to_wake = std::cmp::min(num_jobs, num_sleepers);
self.wake_any_threads(num_to_wake);

It seems if we create 5 jobs, then 5 threads get woken at once. That may be a bit excessive if those jobs are small. It's possible that just a single thread will quickly process all 5 jobs and then the remaining 4 threads have nothing to do. In other words, more time is spent waking threads than actually processing jobs.

A strategy some other schedulers (e.g. the Go scheduler) employ is to wake up just 1 thread if there are scheduled jobs. Then, if that thread finds a job for itself, it wakes another thread. Then that thread does the same thing, and the procedure goes on like that. So if we have 5 small jobs that are enough for a single thread, we'll only wake up one additional thread in vain.

@cuviper
Copy link
Member

cuviper commented Sep 28, 2020

@stjepang I think you have a good point -- however, in practice I believe this only ever sees one job at a time. While Registry::inject can take an arbitrary slice of jobs, all of its current callers only pass a singleton slice. The other waker is WorkerThread::push, which is only dealing with one job as well.

@cuviper
Copy link
Member

cuviper commented Sep 28, 2020

Then, if that thread finds a job for itself, it wakes another thread.

We do have this heuristic already in Sleep::work_found:

// If we were the last idle thread and other threads are still sleeping,
// then we should wake up another thread.
let threads_to_wake = self.counters.sub_inactive_thread();
self.wake_any_threads(threads_to_wake as u32);

And here's AtomicCounters::sub_inactive_thread:

// Current heuristic: whenever an inactive thread goes away, if
// there are any sleeping threads, wake 'em up.
let sleeping_threads = old_value.sleeping_threads();
std::cmp::min(sleeping_threads, 2)

But they seem to be conflicted between waking "another thread" (1) versus "any sleeping threads" (capped at 2).

@bluss
Copy link
Contributor Author

bluss commented Sep 29, 2020

#746 has been merged! That means it is time to revisit this -- I haven't done so yet.

@nikomatsakis
Copy link
Member

@cuviper it looks we are starting 2 threads. I know I was tinkering with that as a way to help us in "catching up" to spikes of work faster. I can't remember how effective these changes were, we should measure, but my memory is "a little but not much". I presently feel it's probably not worth it, and better to just wake one thread.

@bluss
Copy link
Contributor Author

bluss commented Dec 10, 2020

I've come back to testing newer rayon. Not with the exact same testcase yet (I'm sorry).

Some aspect of the same problem is still visible. Same has has been formulated in newer issues: high cpu usage when there are more threads than tasks. Let's say for example 8 threads in the thread pool but only work enough for four of them. With enough threads, the extra threads eat up the whole real time that could have been gained on the task, unfortunately.

Because other issues that handle the same topic are active, this one might as well be closed if you prefer.

@bluss
Copy link
Contributor Author

bluss commented Dec 20, 2020

The current conclusion of my experiments is a published crate for a new thread pool (there are too many crates for this already!) - called thread-tree which is a very simple binary tree of worker threads. It seems to work like I want for the benchmarks I run.
The problem that I saw both with rayon and to some extent with my own thread pool experiments was that there is overhead when there are two work items to run, but four threads contending to pick it up (can this be agreed on as the problem?).

If I have a binary tree of worker threads, then I always have a 1-1 channel sender/receiver to each thread, so this effect of contention between workers doesn't happen. Maybe it's an unfair conclusion, because in the MVP version of the implementation I only support 1, 2 or 4 parallel jobs.

I have a question for @nikomatsakis and @cuviper: This new crate is partly based on code that I took from rayon-core. That was very useful for me, wouldn't have been possible otherwise (StackJob and its execution) - is it enough that I mention you as authors in a notice in the code and in the Readme. Is it ok if I also mention you in the Cargo.toml authors list? Or maybe that's not neeed?

Thread-tree crate: https://docs.rs/thread-tree/
Matrixmultiply explanation: bluss/matrixmultiply#52

@bluss
Copy link
Contributor Author

bluss commented Dec 21, 2020

The case described in the original report is fixed by the new scheduler, so I'll close. I've been wanting to close this anyway. Even if it looks like I'm going for another solution due to similar issues.

@bluss bluss closed this as completed Dec 21, 2020
@cuviper
Copy link
Member

cuviper commented Dec 22, 2020

Thanks for the followup @bluss!

Is it ok if I also mention you in the Cargo.toml authors list? Or maybe that's not neeed?

If it were a close fork then shared authorship might be nice, but I think it's not needed since you just took a small piece, and the crate is substantially different.

I wish Rayon could be everything to everyone for parallelism, but I think it's OK to accept that different designs will be better for some workloads.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants