Add ThreadPool::broadcast #492

cuviper · 2017-12-14T18:16:06Z

A broadcast runs the closure on every thread in the pool, then collects
the results. It's scheduled somewhat like a very soft interrupt -- it
won't preempt a thread's local work, but will run before it goes to
steal from any other threads.

This can be used when you want to precisely split your work per-thread,
or to set or retrieve some thread-local data in the pool, e.g. #483.

nikomatsakis

I like this overall. In terms of what we are committing to publicly, I suppose it's this:

You have the ability to schedule a closure to run exactly once on each thread. The priority is sort of undefined (as is typical for Rayon).

That doesn't seem like too much of a commitment. The main question is whether that behavior is well-defined if the size of the threadpool were to ever become dynamic, but I am more and more dubious of such a thing ever happening, and certainly not without some form of opt-in.

I wonder if it'd be useful to give the closure the number of threads as well? I guess that is readily accessible from the pool, so that's why we don't, right?

cuviper · 2017-12-20T22:23:43Z

In terms of what we are committing to publicly, I suppose it's this:

You have the ability to schedule a closure to run exactly once on each thread. The priority is sort of undefined (as is typical for Rayon).

That doesn't seem like too much of a commitment.

One thing to strengthen this is that racing broadcasts will be consistently ordered. That is, given simultaneous broadcasts A and B, then if one thread sees A before B, they all will, and vice versa. So for example, if both are setting the same TLS, every thread will set them in the same order.

It doesn't guarantee that every A will complete before any B starts, but you could use a Barrier in A if needed. Consistent order means you won't have to worry about whether a barrier in B is also waiting, interleaved with A to deadlock.

We don't have to promise this, but I think it's powerful if we do.

The main question is whether that behavior is well-defined if the size of the threadpool were to ever become dynamic, but I am more and more dubious of such a thing ever happening, and certainly not without some form of opt-in.

I'm also skeptical of this happening. Internally, we could at least make sure that threads are never removed while they have a broadcast waiting, but I guess new threads would just be left out.

I wonder if it'd be useful to give the closure the number of threads as well? I guess that is readily accessible from the pool, so that's why we don't, right?

Yeah, it's easy to get, although that could be racy if there are dynamic threads. We could supply that value to indicate the number of times the broadcast will be called -- the number we're actually queuing -- independent of any new threads that may pop up.

nikomatsakis · 2018-01-11T17:01:55Z

@cuviper

Yeah, it's easy to get, although that could be racy if there are dynamic threads. We could supply that value to indicate the number of times the broadcast will be called -- the number we're actually queuing -- independent of any new threads that may pop up.

So actually, I think this is a case where it would make sense for us to supply a &BroadcastContext (or BroadcastContext) so that we can easily add more contextual information later. If we really want to future proof, it'd probably be &BroadcastContext<'_>, where the lifetime is currently just phantom.

@cuviper

550: add bridge from Iterator to ParallelIterator r=cuviper a=QuietMisdreavus Half of #46 This started getting reviewed in QuietMisdreavus/polyester#6, but i decided to move my work to Rayon proper. This PR adds a new trait, `AsParallel`, an implementation on `Iterator + Send`, and an iterator adapter `IterParallel` that implements `ParallelIterator` with a similar "cache items as you go" methodology as Polyester. I introduced a new trait because `ParallelIterator` was implemented on `Range`, which is itself an `Iterator`. The basic idea is that you would start with a quick sequential `Iterator`, call `.as_parallel()` on it, and be able to use `ParallelIterator` adapters after that point, to do more expensive processing in multiple threads. The design of `IterParallel` is like this: * `IterParallel` defers background work to `IterParallelProducer`, which implements `UnindexedProducer`. * `IterParallelProducer` will split as many times as there are threads in the current pool. (I've been told that #492 is a better way to organize this, but until that's in, this is how i wrote it. `>_>`) * When folding items, `IterParallelProducer` keeps a `Stealer` from `crossbeam-deque` (added as a dependency, but using the same version as `rayon-core`) to access a deque of items that have already been loaded from the iterator. * If the `Stealer` is empty, a worker will attempt to lock the Mutex to access the source `Iterator` and the `Deque`. * If the Mutex is already locked, it will call `yield_now`. The implementation in polyester used a `synchronoise::SignalEvent` but i've been told that worker threads should not block. In lieu of #548, a regular spin-loop was chosen instead. * If the Mutex is available, the worker will load a number of items from the iterator (currently (number of threads * number of threads * 2)) before closing the Mutex and continuing. * (If the Mutex is poisoned, the worker will just... stop. Is there a recommended approach here? `>_>`) This design is effectively a first brush, has [the same caveats as polyester](https://docs.rs/polyester/0.1.0/polyester/trait.Polyester.html#implementation-note), probably needs some extra features in rayon-core, and needs some higher-level docs before i'm willing to let it go. However, i'm putting it here because it was not in the right place when i talked to @cuviper about it last time. Co-authored-by: QuietMisdreavus <grey@quietmisdreavus.net> Co-authored-by: Niko Matsakis <niko@alum.mit.edu>

cuviper · 2018-10-03T23:27:00Z

I've rebased and added a context type. Let me know what you think!

Zoxc · 2019-04-12T08:33:22Z

I'd like to use this in rustc to collect thread local data (complements my WorkerLocal type well).

cuviper · 2019-04-24T19:03:06Z

@Zoxc I've rebased this again, if you'd like to try it out with rustc.

Zoxc · 2020-11-01T12:44:21Z

There's 2 variations on this which may be useful too. spawn_broadcast which would spawn tasks on all threads without waiting for them. broadcast on a Scope which would spawn tasks on the Scope and could be considered a more general form of this PR.

I wonder if there's room for some generic code here. spawn_broadcast would wait until 'static ends before blocking (a.k.a. never), Scope::broadcast would wait until 'scope ends before blocking and the broadcast in this PR would wait on some internal lifetime before blocking. If you consider join as spawning 2 tasks, you can draw similar parallels between spawn, Scope::spawn and join.

cuviper · 2020-11-02T21:04:30Z

Those variations make sense to me. The Scope::broadcast idea should be a pretty easy extension, just adding to the scope counter so they're part of the blocking set. For the static/unblocked version, I guess we could wrap the closure in an Arc and basically just spawn it out.

They could all use the same injection queues, at least.

cuviper · 2022-06-14T19:38:44Z

I rebased again and added spawn_broadcast, including a scoped version. I need more docs/examples on that, but otherwise I think the code is in good shape here.

The last thing I'm thinking about is a specification of when a broadcast will run. Currently, it runs when the local deque is empty, before looking for jobs elsewhere. I can think of multiple options, from high to low "priority":

ASAP -- if we've blocked for any reason, run broadcasts first even once the blocked latch is ready.
After the seeing latch is still blocked, but before popping from the local deque.
After the local deque, but before stealing from other threads. This is what we have now.
After thread stealing, but before popping from the global injector.
After the global injector, when there are no other sources of pending work.
After the whole pool is otherwise blocked/idle.

I'm not certain that the current choice is the best. These could all be supported if we kept distinct queues, and perhaps an enum argument on the broadcast methods to indicate the user's choice, but maybe that's overkill.

LoganDark · 2022-06-18T00:22:49Z

These could all be supported if we kept distinct queues, and perhaps an enum argument on the broadcast methods to indicate the user's choice, but maybe that's overkill.

It sounds like something that could be tuned, but I agree that it sounds like overkill because people won't know what to pick to get it to Just Work.

You could support an enum, or not, but the most important thing is that it's as fast as possible both when it's the only thing executing on the thread pool and when the thread pool is juggling lots and lots of work. Basically, even if you do have an enum for user choice, please have (and recommend) an API where the user does not have to choose, so that they can benefit from a tried and tested compromise.

"as fast as possible" may mean many things. Broadcasts completing before other work, or a compromise between timely execution and not ruining the efficiency of other tasks running on the thread pool. I'd personally advocate for the latter as a sane default.

A broadcast runs the closure on every thread in the pool, then collects the results. It's scheduled somewhat like a very soft interrupt -- it won't preempt a thread's local work, but will run before it goes to steal from any other threads. This can be used when you want to precisely split your work per-thread, or to set or retrieve some thread-local data in the pool, e.g. rayon-rs#483.

cuviper · 2022-11-16T00:25:04Z

Here's the current API summary:

pub fn broadcast<OP, R>(op: OP) -> Vec<R>
where
    OP: Fn(BroadcastContext<'_>) -> R + Sync,
    R: Send;

pub fn spawn_broadcast<OP>(op: OP)
where
    OP: Fn(BroadcastContext<'_>) + Send + Sync + 'static;

pub struct BroadcastContext<'a> { .. }

impl<'a> BroadcastContext<'a> {
    pub fn index(&self) -> usize;
    pub fn num_threads(&self) -> usize;
}

impl<'a> fmt::Debug for BroadcastContext<'a> { .. }

impl<'scope> Scope<'scope> {
    pub fn spawn_broadcast<BODY>(&self, body: BODY)
    where
        BODY: Fn(&Scope<'scope>, BroadcastContext<'_>) + Send + Sync + 'scope;
}

impl<'scope> ScopeFifo<'scope> {
    pub fn spawn_broadcast<BODY>(&self, body: BODY)
    where
        BODY: Fn(&ScopeFifo<'scope>, BroadcastContext<'_>) + Send + Sync + 'scope;
}

impl ThreadPool {
    pub fn broadcast<OP, R>(&self, op: OP) -> Vec<R>
    where
        OP: Fn(BroadcastContext<'_>) -> R + Sync,
        R: Send;

    pub fn spawn_broadcast<OP>(&self, op: OP)
    where
        OP: Fn(BroadcastContext<'_>) + Send + Sync + 'static;
}

I think that's pretty safe, and the current "priority" (before remote work-stealing) still feels like a reasonable default.

cuviper · 2022-11-16T00:31:23Z

bors r+

bors · 2022-11-16T00:41:04Z

Build succeeded:

This was referenced Dec 14, 2017

exit_handler is not run for the global thread pool #483

Open

WIP: rustc changes #489

Closed

cuviper requested a review from nikomatsakis December 14, 2017 18:30

cuviper mentioned this pull request Dec 17, 2017

Example for thread-local variables #493

Open

nikomatsakis approved these changes Dec 20, 2017

View reviewed changes

QuietMisdreavus mentioned this pull request Mar 7, 2018

add bridge from Iterator to ParallelIterator #550

Merged

cuviper force-pushed the broadcast branch 2 times, most recently from fd4cad2 to 4e05307 Compare October 3, 2018 20:43

cuviper force-pushed the broadcast branch from 3629fb2 to da13f72 Compare November 14, 2018 23:52

cuviper force-pushed the broadcast branch from da13f72 to fbdc991 Compare April 24, 2019 19:01

cuviper force-pushed the broadcast branch from fbdc991 to 6a1d155 Compare May 9, 2019 00:13

cuviper force-pushed the broadcast branch 5 times, most recently from 6ec01f6 to b6f72f4 Compare June 14, 2022 19:08

cuviper mentioned this pull request Jun 14, 2022

Stack-based thread-local storage #941

Open

This was referenced Jun 18, 2022

Rayon support LoganDark/stackblur-iter#4

Closed

Rayon+SIMD support LoganDark/stackblur-iter#6

Merged

cuviper force-pushed the broadcast branch from b6f72f4 to 4204fb5 Compare October 26, 2022 17:12

cuviper added 6 commits November 15, 2022 16:16

Add ThreadPool::spawn_broadcast

817c4cc

Add Scope::spawn_broadcast

c7a3172

Simplify calls that use the panic_handler

812ca02

Add more internal enforcement of static/scope lifetimes

bd7b61c

Add some documentation about *when* broadcasts run

9ef85cd

cuviper force-pushed the broadcast branch from f8047eb to 9ef85cd Compare November 16, 2022 00:25

bors bot merged commit 911d6d0 into rayon-rs:master Nov 16, 2022

cuviper deleted the broadcast branch February 25, 2023 17:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add ThreadPool::broadcast #492

Add ThreadPool::broadcast #492

cuviper commented Dec 14, 2017

nikomatsakis left a comment

cuviper commented Dec 20, 2017

nikomatsakis commented Jan 11, 2018

cuviper commented Oct 3, 2018

Zoxc commented Apr 12, 2019

cuviper commented Apr 24, 2019

Zoxc commented Nov 1, 2020

cuviper commented Nov 2, 2020

cuviper commented Jun 14, 2022

LoganDark commented Jun 18, 2022 •

edited

cuviper commented Nov 16, 2022

cuviper commented Nov 16, 2022

bors bot commented Nov 16, 2022

Add ThreadPool::broadcast #492

Add ThreadPool::broadcast #492

Conversation

cuviper commented Dec 14, 2017

nikomatsakis left a comment

Choose a reason for hiding this comment

cuviper commented Dec 20, 2017

nikomatsakis commented Jan 11, 2018

cuviper commented Oct 3, 2018

Zoxc commented Apr 12, 2019

cuviper commented Apr 24, 2019

Zoxc commented Nov 1, 2020

cuviper commented Nov 2, 2020

cuviper commented Jun 14, 2022

LoganDark commented Jun 18, 2022 • edited

cuviper commented Nov 16, 2022

cuviper commented Nov 16, 2022

bors bot commented Nov 16, 2022

LoganDark commented Jun 18, 2022 •

edited