-
Notifications
You must be signed in to change notification settings - Fork 483
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ThreadPool::install and thread overhead #614
Comments
This issue seems related. #576 |
Yeah, we need to improve the situation with underutilized thread pools. |
@cuviper I understand! Do you know if there are any implementations of join -- scoped on the stack jobs, without job stealing? Or a way to implement that with rayon. Of course in rayon, stealing is needed to ever have it use more than one thread for join. |
I think you could emulate this with I'm not sure what you want rayon to do without stealing -- just run serially? Forcibly send the second job to another thread in the pool? The latter would require targeted wake-ups, which basically gets back to the main issue of #576. |
Neither of them seem to support jobs stored on the stack, though. I'm writing some experiments that take me closer to realizing what kinds of problems are involved, right now breaking out rayon's join into a PoC where it directly sends stack jobs to other threads in a pool. Not that Rayon should solve it that way, but it's a learning experience to work on. I don't know Rayon well enough to make any suggestions, but it sounds like #576 is pretty much the same issue. |
Oh, I thought you just wanted scoped lifetimes. I guess you mean like our |
Sure, both a thread pool and stack jobs and scoped threads. Feel free to close this issue if you want. I'll work on my prototype and see if it's viable. |
I noticed the following lines in rayon's scheduler: let num_to_wake = std::cmp::min(num_jobs, num_sleepers);
self.wake_any_threads(num_to_wake); It seems if we create 5 jobs, then 5 threads get woken at once. That may be a bit excessive if those jobs are small. It's possible that just a single thread will quickly process all 5 jobs and then the remaining 4 threads have nothing to do. In other words, more time is spent waking threads than actually processing jobs. A strategy some other schedulers (e.g. the Go scheduler) employ is to wake up just 1 thread if there are scheduled jobs. Then, if that thread finds a job for itself, it wakes another thread. Then that thread does the same thing, and the procedure goes on like that. So if we have 5 small jobs that are enough for a single thread, we'll only wake up one additional thread in vain. |
@stjepang I think you have a good point -- however, in practice I believe this only ever sees one job at a time. While |
We do have this heuristic already in rayon/rayon-core/src/sleep/mod.rs Lines 95 to 98 in 8c9ada2
And here's rayon/rayon-core/src/sleep/counters.rs Lines 160 to 163 in 8c9ada2
But they seem to be conflicted between waking "another thread" (1) versus "any sleeping threads" (capped at 2). |
#746 has been merged! That means it is time to revisit this -- I haven't done so yet. |
@cuviper it looks we are starting 2 threads. I know I was tinkering with that as a way to help us in "catching up" to spikes of work faster. I can't remember how effective these changes were, we should measure, but my memory is "a little but not much". I presently feel it's probably not worth it, and better to just wake one thread. |
I've come back to testing newer rayon. Not with the exact same testcase yet (I'm sorry). Some aspect of the same problem is still visible. Same has has been formulated in newer issues: high cpu usage when there are more threads than tasks. Let's say for example 8 threads in the thread pool but only work enough for four of them. With enough threads, the extra threads eat up the whole real time that could have been gained on the task, unfortunately. Because other issues that handle the same topic are active, this one might as well be closed if you prefer. |
The current conclusion of my experiments is a published crate for a new thread pool (there are too many crates for this already!) - called thread-tree which is a very simple binary tree of worker threads. It seems to work like I want for the benchmarks I run. If I have a binary tree of worker threads, then I always have a 1-1 channel sender/receiver to each thread, so this effect of contention between workers doesn't happen. Maybe it's an unfair conclusion, because in the MVP version of the implementation I only support 1, 2 or 4 parallel jobs. I have a question for @nikomatsakis and @cuviper: This new crate is partly based on code that I took from rayon-core. That was very useful for me, wouldn't have been possible otherwise (StackJob and its execution) - is it enough that I mention you as authors in a notice in the code and in the Readme. Is it ok if I also mention you in the Cargo.toml authors list? Or maybe that's not neeed? Thread-tree crate: https://docs.rs/thread-tree/ |
The case described in the original report is fixed by the new scheduler, so I'll close. I've been wanting to close this anyway. Even if it looks like I'm going for another solution due to similar issues. |
Thanks for the followup @bluss!
If it were a close fork then shared authorship might be nice, but I think it's not needed since you just took a small piece, and the crate is substantially different. I wish Rayon could be everything to everyone for parallelism, but I think it's OK to accept that different designs will be better for some workloads. |
I've made a reproduction of an issue with
ThreadPool::install
and the overhead I see due to theinstall
call itself. In the reduced problem, the function that runs inThreadPool::install()
is entirely serial, and the overhead is proportional to the number of threads in the pool.Maybe this is totally overblown, the overhead is on the order of milliseconds, but I had hoped to be able to parallelize problems are that fast to compute
Example
Code on the branch rayon-pool-install-201812 in ndarray.
You can clone it and run the test like this:
The program is in
examples/threadpool_install.rs
and it calls ThreadPool::install repeatedly. The work inside the install takes about 3 milliseconds on its own. (Does this sound short?)I ran perf on the example executable with 16 threads and it says that rayon functions use significant time:
If I change the thread pool to use 1 thread, I get a serene result, 99% of execution time is in the actual job:
Actual Use Case
Of course, if I don't have anything parallel to do, I can avoid using rayon. I saw this problem when using a too big thread pool with just a few parallel jobs. My use case is so far just an experiment, and I was doing this:
Pool.install(|| f() )
around all my loopsThe parallel iterator was only length 2 due to the problem's size. Here I noticed that a ThreadPool of 2 threads would have less overhead. With 4 threads for 2 jobs, the overhead would eat up any gains from parallelism.
The program should use 1 thread per physical core and can't adapt the number of threads (creating the thread pool is too slow). It needs to be ready for tasks that can be split into 2 or more parallel jobs.
I also seem to be able to reproduce this issue with the global pool. Just
rayon::join
with two parallel jobs in the whole program, and if I run this with 2 threads, I get a speedup from 95µs to 61µs, but the speedup disappears if there are more threads in the global pool. This makes it seem that rayon can parallelize this problem well!The text was updated successfully, but these errors were encountered: