Stack-based thread-local storage #941

jasonsewall · 2022-06-06T15:09:50Z

I started a thread on users.rust-lang.org on this and I thought it would be good to bring some of the takeaways here.

The key thing is that there are cases where you want thread-local storage for the lifetime of certain tasks. The various thread-local storage crates available are possible, but there are undesirable qualities to things that are statically or heap allocated when a stack-based solution is possible.

This is my (non-Rayon) solution; The 'r' unbounded channel is the work-range queue. The final send is not ideal for performing a reduction, but one could imagine implementing a tree reduction with a similar mechanism.

(0..nthreads).for_each(|_| {
        let r = r.clone();
        let finish_s = finish_s.clone();
        thread::spawn(move || {
            let mut my_stats = FancyStats::new();
            r.iter().for_each(|iter: Range<u32>| {
                iter.for_each(|x| my_stats.combine_from((x % 256) as u64))
            });
            finish_s.send(my_stats).unwrap();
        });
    });

I think the missing abstraction in Rayon for this is a nested task construct, where closures like the innermost ones above can borrow thread-local variables. I'm pretty new to Rust and Rayon, so I don't have an appreciation for how this fits with the implementation strategies used therein, but I do think this is a relatively common pattern. I'd be happy to work with you to explore possible solutions.

cuviper · 2022-06-07T00:06:58Z

One thing to keep in mind is that Rayon has absolutely no idea about your stack or what you're borrowing, because we don't have any compiler integration here. As a pure library, we only know basic type-system details about your code, and even that is only expressed in constraints like T: Send/Sync, F: FnOnce/FnMut/Fn, and lifetime requirements.

In your example, Rayon doesn't know anything about the FancyStats type or the my_stats instance on the stack; we just rely on the rustc to check the constraints that we've set forth.

That doesn't mean this is impossible, per se, just that we may need to reframe how to think about this.

I think the missing abstraction in Rayon for this is a nested task construct, where closures like the innermost ones above can borrow thread-local variables.

I'm not what you mean, because generally an inner closure can borrow from outer locals. But in a parallel context, you usually can't borrow mutably, as that makes the closure FnMut where we need Fn.

Can you sketch out a more rayon-like example, and try to give details how you think that thread-local would work?

jasonsewall · 2022-06-13T14:09:49Z

I think the challenge is that what I want to do requires that I have a per-thread context I can use from within a task.

I think that means I need to use a ThreadLocal facility (which obviously must be some sort of static storage or perhaps a TID-indexed heap entry). this might be 'rayon-like'

The alternative, and which is closer to what I posted in my code example, is that I get the ability to create per-thread closures that then explicitly grab work from a queue (so that they can pass stack-local things to it). Rayon does allow me to create threads and it also has automatic work-stealing, but I don't see where they explicitly meet in any exposed interface.

    let (finish_s, finish_r) = channel();
    let (s, r) = unbounded();

    (0..nthreads).for_each(|_| {
        let r = r.clone();
        let finish_s = finish_s.clone();
        thread::spawn(move || {
            let mut my_stats = FancyStats::new();
            r.iter().for_each(|f: &dyn FnOnce(&mut FancyStats)| {
                f(&mut my_stats);
            });
            finish_s.send(my_stats).unwrap(); // clumsy reduction, but that's a secondary point
        });
    });
   });```
   
   Does that make sense?

LoganDark · 2022-06-14T21:05:15Z

I JUST ran into this issue. I need to run an algorithm that needs some scratch space, but I only want to allocate 1 "scratch space" per thread. Then the algorithm can just reuse that scratch space.

This works without rayon but with rayon there's no way to "get the scratch space for this iteration/thread". thread_local! does not work because it doesn't provide mutable access.

cuviper · 2022-06-14T23:17:58Z

@LoganDark

thread_local! does not work because it doesn't provide mutable access.

Rayon would have difficulty providing mutability as well, for much the same reason -- a globally accessible resource could be accessed by re-entrant code, even when "global" is thread-specific. Work-stealing makes this even worse, because code that is not at all recursive can still end up nested on the stack, when one part gets blocked and we steal another part of the "same" code to execute in the meantime.

@jasonsewall

The alternative [...] is that I get the ability to create per-thread closures

I think you would like the broadcast API, #492, which was mentioned in your thread on the users forum. I have rebased and expanded that PR now, so hopefully it could be something we actually ship soon. I'd love your feedback!

LoganDark · 2022-06-14T23:24:33Z

Rayon would have difficulty providing mutability as well, for much the same reason -- a globally accessible resource could be accessed by re-entrant code, even when "global" is thread-specific. Work-stealing makes this even worse, because code that is not at all recursive can still end up nested on the stack, when one part gets blocked and we steal another part of the "same" code to execute in the meantime.

In my example, each invocation of the closure passed to for_each needs its own mutable access to some "scratch space". The thread pool could keep 1 "scratch space" per thread and then give each invocation access while it executes. I don't see how work-stealing would be a problem here.

It wouldn't be globally accessible, just inside the closure. Like: you give an initializer to rayon which initializes the scratch space, to be executed in each thread. The type itself wouldn't need to be Send or Sync because it's not being shared between or sent across threads, though the initializer would need to be Send/Sync to be invoked by each thread.

It could be kind of like "zip" in that it passes an extra value to the for_each closure.

I'm bad at explaining but hopefully you can see my thought process.

cuviper · 2022-06-14T23:47:20Z

The thread pool could keep 1 "scratch space" per thread and then give each invocation access while it executes. I don't see how work-stealing would be a problem here.

That depends on what you mean by "each invocation" -- where are each of these coming from? If it's a ParallelIterator::for_each, then there absolutely is work-stealing involved. We do have for_each_init and for_each_with that give &mut access across calls, but these are per-"split" rather than per-thread, because we don't know what work-stealing will do while we're setting that up.

The broadcast I mentioned would have exactly one invocation on each thread, but in that case you can just set up your "scratch space" as the first action of that closure.

LoganDark · 2022-06-14T23:51:36Z

The thread pool could keep 1 "scratch space" per thread and then give each invocation access while it executes. I don't see how work-stealing would be a problem here.

That depends on what you mean by "each invocation" -- where are each of these coming from? If it's a ParallelIterator::for_each, then there absolutely is work-stealing involved. We do have for_each_init and for_each_with that give &mut access across calls, but these are per-"split" rather than per-thread, because we don't know what work-stealing will do while we're setting that up.

The broadcast I mentioned would have exactly one invocation on each thread, but in that case you can just set up your "scratch space" as the first action of that closure.

I'm not saying there is no work-stealing just that the work-stealing is not relevant.

At the end of the day each thread in the pool invokes the for_each closure a certain number of times. It can pass that closure a mutable reference to a piece of data kept by that thread. That data would have to be initialized separately for each thread in the thread pool, so it is effectively a "thread local". In reality it could easily be stored in the thread's stack and then dropped after the for_each is complete.

cuviper · 2022-06-15T00:01:06Z

I'm not saying there is no work-stealing just that the work-stealing is not relevant.

It is relevant because work-stealing makes your code implicitly re-entrant. If the first call to your code is holding that &mut T and calls into rayon for anything, it may get blocked and enter work-stealing, and we may steal another instance of that job that needs to be called with the same local &mut T. That would be mutable aliasing, which is strictly disallowed in Rust.

LoganDark · 2022-06-15T00:08:29Z

I'm not saying there is no work-stealing just that the work-stealing is not relevant.

It is relevant because work-stealing makes your code implicitly re-entrant. If the first call to your code is holding that &mut T and calls into rayon for anything, it may get blocked and enter work-stealing, and we may steal another instance of that job that needs to be called with the same local &mut T. That would be mutable aliasing, which is strictly disallowed in Rust.

I didn't know that rayon did coroutine stuff, that's wacky. Thanks for explaining.

(maybe this is solvable by, instead of using just one thread-local, using a thread-local "Vec of however many of these are needed at any one time", that is grown on demand... or actually some other data structure that can be grown without invalidating references)

LoganDark · 2022-06-18T00:15:54Z

I think you would like the broadcast API, #492, which was mentioned in your thread on the users forum. I have rebased and expanded that PR now, so hopefully it could be something we actually ship soon. I'd love your feedback!

I will add that this solves my problem because my algorithm is constant-time so I can just split the work evenly. As long as there's an easy way to get the number of threads to do so.

LoganDark · 2022-06-22T02:02:22Z

This isn't stack-based, but:

LoganDark/stackblur-iter@7996893#diff-b1a35a68f14e696205874893c07fd24fdb88882b47c23cc0e0c80a30c7d53759R117-R127

	let mut opses = vec![Some(VecDeque::new()); rayon::current_num_threads()];

	let get_scratch_opt = {
		let opses_uniq = unique::Unique::new(&mut opses[..] as *mut [Option<VecDeque<_>>]).unwrap();

		move || {
			let opses_ptr = opses_uniq.as_ptr();
			let thread = rayon::current_thread_index().unwrap();
			unsafe { (*opses_ptr).get_mut(thread).unwrap() }
		}
	};

Fun!

LoganDark mentioned this issue Jun 17, 2022

Feature Request: Multithreading (rayon) LoganDark/stackblur-iter#3

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stack-based thread-local storage #941

Stack-based thread-local storage #941

jasonsewall commented Jun 6, 2022 •

edited by cuviper

cuviper commented Jun 7, 2022

jasonsewall commented Jun 13, 2022

LoganDark commented Jun 14, 2022

cuviper commented Jun 14, 2022

LoganDark commented Jun 14, 2022 •

edited

cuviper commented Jun 14, 2022

LoganDark commented Jun 14, 2022 •

edited

cuviper commented Jun 15, 2022

LoganDark commented Jun 15, 2022 •

edited

LoganDark commented Jun 18, 2022

LoganDark commented Jun 22, 2022

Stack-based thread-local storage #941

Stack-based thread-local storage #941

Comments

jasonsewall commented Jun 6, 2022 • edited by cuviper

cuviper commented Jun 7, 2022

jasonsewall commented Jun 13, 2022

LoganDark commented Jun 14, 2022

cuviper commented Jun 14, 2022

LoganDark commented Jun 14, 2022 • edited

cuviper commented Jun 14, 2022

LoganDark commented Jun 14, 2022 • edited

cuviper commented Jun 15, 2022

LoganDark commented Jun 15, 2022 • edited

LoganDark commented Jun 18, 2022

LoganDark commented Jun 22, 2022

jasonsewall commented Jun 6, 2022 •

edited by cuviper

LoganDark commented Jun 14, 2022 •

edited

LoganDark commented Jun 14, 2022 •

edited

LoganDark commented Jun 15, 2022 •

edited