SegQueue::push() ... pop() keeps allocating. #398

ralfbiedert · 2019-07-01T19:12:53Z

Background

To use rayon in game engine code, I was investigating why ThreadPool::install was allocating every invocation. It turned out there are two sources of allocation,

One was a LockLatch, that was allocating a Mutex and a Condvar every call.
Then there was a larger, intermittent allocation happening every 32 frames. I traced this back to SegQueue.

Observation

If i run this simple test app:

let x: SegQueue<u32> = SegQueue::new();

for i in 0..100 {
    measure("push_pop", || {
        x.push(1);
        x.pop();
    });
}

I receive the following allocation events:

"push_pop" -- Events: 1, Bytes: 504
"push_pop" -- Events: 0, Bytes: 0
...
... 32 invocations later ...
... 
"push_pop" -- Events: 0, Bytes: 0
"push_pop" -- Events: 1, Bytes: 504

The allocation is caused by these lines:

if offset + 1 == BLOCK_CAP && next_block.is_none() {
    next_block = Some(Box::new(Block::<T>::new()));
}

Which, in turn is caused by the tail just growing:

offset 0 ... tail 0
offset 1 ... tail 2
offset 2 ... tail 4
offset 3 ... tail 6
[...]
offset 29 ... tail 58
offset 30 ... tail 60
offset 0 ... tail 64

Issue

I have to admit that I'm not 100% if this is really a bug, or just a "feature" of the algorithm, but
I would naively expect that a loop of q.push(x); q.pop(); would not cause the SegQueue to reallocate.

The text was updated successfully, but these errors were encountered:

Chopinsky · 2019-07-02T16:00:43Z

So this is almost certainly what the algorithm expects: when a Block is exhausted (of the free space), it will allocate a new block for the incoming data. That's where the magic number 32 comes into play -- it's the size of a Block, and even if we ignore the pop action in your test, we would still need to allocate regardless of the pop action, as SeqQueue won't reuse the existing blocks.

Alternatively, this issue can be mitigated by putting the destroyed blocks in a "junkyard", then pull them out when we need a new Block. However, this could cause security issues (poped data are still around somewhere), and it can be a headache to maintain the junkyard as well (extra overhead that may very possibly dwarf this allocation problem).

cuviper · 2019-07-02T16:01:23Z

It seems wise to keep some available capacity when possible. A more general case could be made whenever it pops to an empty head block and there are N remaining full blocks (here N = 0) -- making that block available for new tail pushes might be nice. Then a single pop(); push() should never need to allocate.

schets · 2019-07-02T16:03:49Z

A while ago I experimented with something like the junkyard idea, and even for a much simpler mpsc situation it was significantly more expensive in the average case than allocating. I remember some work being done here on an mpmc unbounded queue which would grow in a fashion similar to a vector, albeit with massive latency tails whenever a resize happened.

…

On Tue, Jul 2, 2019 at 12:00 PM Jacob Zuo ***@***.***> wrote: So this is almost certainly what the algorithm expects: when a Block exhausted (of the previously held data), it will allocate a new block for the incoming data. That's where the magic number 32 comes into play -- it's the size of a Block, and even if we ignore the pop action in your test, we would still need to allocate regardless of the pop action, as SeqQueue won't reuse the existing blocks. Alternatively, this issue can be mitigated by putting the destroyed blocks in a "junkyard", then pull them out when we need a new Block. However, this could cause security issues (poped data are still around somewhere), and it can be a headache to maintain the junkyard as well (extra overhead that may very possibly dwarf this allocation problem). — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#398?email_source=notifications&email_token=AA4M3CJUJINAERNB7UITXR3P5N3S3A5CNFSM4H4VF4UKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZBYC6A#issuecomment-507740536>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AA4M3CPL4QBQXGJ6I6EMJO3P5N3S3ANCNFSM4H4VF4UA> .

Chopinsky · 2019-07-02T16:13:09Z

A while ago I experimented with something like the junkyard idea, and even for a much simpler mpsc situation it was significantly more expensive in the average case than allocating. I remember some work being done here on an mpmc unbounded queue which would grow in a fashion similar to a vector, albeit with massive latency tails whenever a resize happened.

So instead of using an mpmc queue, can we just make the junkyard a fixed size array (since we won't actually need an unlimited junkyard), and use lockless structure (e.g. a naive AtomicBool for in-use vs. free-to-use status) to fend the junkyard? This could help reduce allocations when frequent push-pop happens in a short period of time, as described in the OP.

schets · 2019-07-02T16:17:08Z

That’s more or less what I did, iirc. The problem was that with multiple producers, there’s contention of some free/in-use tag. On X86, this is especially expensive. If one were to limit the junkyard to be pointer size of the machine, you could simply use an atomic bitmask which might be simple enough to beat allocation.

…

On Tue, Jul 2, 2019 at 12:13 PM Jacob Zuo ***@***.***> wrote: A while ago I experimented with something like the junkyard idea, and even for a much simpler mpsc situation it was significantly more expensive in the average case than allocating. I remember some work being done here on an mpmc unbounded queue which would grow in a fashion similar to a vector, albeit with massive latency tails whenever a resize happened. So instead of using an mpmc queue, we just make the junkyard a fixed size array (since we won't actually need an unlimited junkyard), and use lockless structure (e.g. a naive AtomicBool for in-use vs. free-to-use status) to fend the junkyard. This could help with frequent push-pop cycles, as the situation described in the OP. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#398?email_source=notifications&email_token=AA4M3CJPQG5IBDGSTWBW6QDP5N5BRA5CNFSM4H4VF4UKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZBZHKQ#issuecomment-507745194>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AA4M3CP7KKMIBGFVFXDVGXLP5N5BRANCNFSM4H4VF4UA> .

frostyplanet · 2020-06-03T19:16:13Z

@ralfbiedert Would you shred some light on how to "measure()" allocation was done? I was looking at some memory problem inside my code, interested to know a tool for such problem

ralfbiedert · 2020-06-03T21:36:50Z

That was just an ugly #[global_allocator] where I toggled logging on and off with some static atomic.

I uploaded some old code here; it will not compile, but if you look into achtung_allocation/src/lib.rs it should be relatively clear.

Feel free to open an issue in that repo if you have questions.

frostyplanet · 2020-06-04T17:11:31Z

@ralfbiedert thanks very much for sharing

iriri · 2021-03-15T09:23:42Z

I was looking at the SegQueue implementation today and also noticed the somewhat unfortunate allocation behavior so I threw together a branch where up to one block is cached for reuse. This means that no additional block is allocated until the number of pushes exceeds the number of pops by BLOCK_CAP. Retaining more than one block is annoying as you have to accept some kind of compromise like never freeing a cached block until the entire queue is dropped, epoch collection, or FIFO block reuse. The first two are definitely non-starters as one of SegQueue's best qualities is its eager freeing.

The benchmarks show some mild improvements but I don't think they're really representative of typical use cases.

Linux x86-64 (16 cores)

master:

test unbounded::create    ... bench:         124 ns/iter (+/- 4)
test unbounded::inout     ... bench:          40 ns/iter (+/- 1)
test unbounded::mpmc      ... bench:     553,855 ns/iter (+/- 17,118)
test unbounded::mpsc      ... bench:   1,162,611 ns/iter (+/- 125,553)
test unbounded::oneshot   ... bench:         195 ns/iter (+/- 5)
test unbounded::par_inout ... bench:   2,224,968 ns/iter (+/- 55,949)
test unbounded::spmc      ... bench:   1,211,853 ns/iter (+/- 65,897)
test unbounded::spsc      ... bench:     956,677 ns/iter (+/- 47,411)

cache-segqueue-block:

test unbounded::create    ... bench:         122 ns/iter (+/- 3)
test unbounded::inout     ... bench:          40 ns/iter (+/- 1)
test unbounded::mpmc      ... bench:     518,356 ns/iter (+/- 23,531)
test unbounded::mpsc      ... bench:   1,130,890 ns/iter (+/- 70,117)
test unbounded::oneshot   ... bench:         207 ns/iter (+/- 5)
test unbounded::par_inout ... bench:   2,174,341 ns/iter (+/- 33,822)
test unbounded::spmc      ... bench:   1,115,574 ns/iter (+/- 65,119)
test unbounded::spsc      ... bench:     784,513 ns/iter (+/- 29,810)

Windows x86-64 (8 cores)

master:

test unbounded::create    ... bench:         105 ns/iter (+/- 1)
test unbounded::inout     ... bench:          47 ns/iter (+/- 0)
test unbounded::mpmc      ... bench:   1,620,820 ns/iter (+/- 116,951)
test unbounded::mpsc      ... bench:   5,490,485 ns/iter (+/- 184,075)
test unbounded::oneshot   ... bench:         215 ns/iter (+/- 4)
test unbounded::par_inout ... bench:   9,814,435 ns/iter (+/- 79,270)
test unbounded::spmc      ... bench:  47,891,120 ns/iter (+/- 57,213,579)
test unbounded::spsc      ... bench:   1,190,482 ns/iter (+/- 362,679)

cache-segqueue-block:

test unbounded::create    ... bench:         104 ns/iter (+/- 2)
test unbounded::inout     ... bench:          51 ns/iter (+/- 1)
test unbounded::mpmc      ... bench:   1,583,942 ns/iter (+/- 53,008)
test unbounded::mpsc      ... bench:   5,526,165 ns/iter (+/- 161,594)
test unbounded::oneshot   ... bench:         218 ns/iter (+/- 3)
test unbounded::par_inout ... bench:   9,757,505 ns/iter (+/- 211,407)
test unbounded::spmc      ... bench:  32,795,000 ns/iter (+/- 37,473,800)
test unbounded::spsc      ... bench:     888,141 ns/iter (+/- 169,239)

Another quirk in SegQueue is the lazy allocation of the first block. This is necessitated by SegQueue::new being a const fn, which honestly doesn't seem very useful right now.

cuviper · 2021-03-16T01:27:57Z

FWIW, rayon-rs/rayon#791 switched to crossbeam_deque::Injector, but at the time I saw that the code was almost identical to SeqQueue, so I would guess that the same allocation questions arise. (It doesn't have the lazy first block though.) I hope your fixes could be applied to Injector too, because those performance numbers look promising!

iriri · 2021-03-17T11:31:27Z

If one were to limit the junkyard to be pointer size of the machine, you
could simply use an atomic bitmask which might be simple enough to beat
allocation.

So I implemented a variant of this on an aptly named branch where the head and tail of a four element intentionally imperfect ring buffer are stuffed into a single AtomicU64 and it doesn't seem to do any better on the existing benchmarks than the version that caches only a single block. The SPSC benchmark might be ~5% faster on Windows x86-64 but there's a lot of run-to-run variation so it's hard to be sure.

FWIW, rayon-rs/rayon#791 switched to crossbeam_deque::Injector, but at the time I saw that the code was almost identical to SeqQueue, so I would guess that the same allocation questions arise. (It doesn't have the lazy first block though.) I hope your fixes could be applied to Injector too, because those performance numbers look promising!

I haven't looked at deque.rs too closely yet but it's a little strange that it just inlines what appears to be more or less a superset of seg_queue.rs. But yes, the same optimization can be done there too.

I think the main difficulty here is coming up with some benchmarks to actually determine if this change is worth it or not. Concurrent queue benchmarks tend to be extremely easy to game as less concurrency usually means higher throughput.

cuviper · 2021-03-17T23:46:48Z

The timing variance might be interesting too, if you can measure that -- it will probably be more consistent in the happy path that simply keeps reusing the same allocation.

iriri · 2021-10-17T07:30:39Z

Decided to revisit this after some time and added a rough latency benchmark. There is some run-to-run variation (a lot, actually, in the case of master) but these results are fairly representative. I was hoping I could abandon the branch where I cache multiple blocks, but unfortunately it seems to do quite a bit better here than caching just one.

Linux x86-64 (8 cores)

master:

spsc | min: 0.000021100, max: 0.000168200, mean: 0.000037867, stddev: 0.000000412
spsc | min: 0.000018300, max: 0.000390400, mean: 0.000033402, stddev: 0.000000869
mpsc | min: 0.000013600, max: 0.007470500, mean: 0.000469169, stddev: 0.000014732
mpsc | min: 0.000096200, max: 0.017162200, mean: 0.001090689, stddev: 0.000031671
spmc | min: 0.000016400, max: 0.000177900, mean: 0.000043798, stddev: 0.000000613
spmc | min: 0.000015800, max: 0.000243400, mean: 0.000043987, stddev: 0.000000597
mpmc | min: 0.000018300, max: 0.010767200, mean: 0.000644508, stddev: 0.000021174
mpmc | min: 0.000014700, max: 0.013832700, mean: 0.000513442, stddev: 0.000022520

Caching one block:

spsc | min: 0.000017100, max: 0.000142800, mean: 0.000026373, stddev: 0.000000295
spsc | min: 0.000017200, max: 0.000122100, mean: 0.000022872, stddev: 0.000000158
mpsc | min: 0.000031600, max: 0.002537700, mean: 0.000549827, stddev: 0.000005980
mpsc | min: 0.000017900, max: 0.018825900, mean: 0.000755237, stddev: 0.000023489
spmc | min: 0.000014900, max: 0.000116100, mean: 0.000023121, stddev: 0.000000142
spmc | min: 0.000015300, max: 0.000115700, mean: 0.000019334, stddev: 0.000000156
mpmc | min: 0.000020900, max: 0.020000700, mean: 0.000364887, stddev: 0.000019470
mpmc | min: 0.000020300, max: 0.008355400, mean: 0.000300422, stddev: 0.000011621

Caching four blocks:

spsc | min: 0.000014500, max: 0.000104800, mean: 0.000017722, stddev: 0.000000110
spsc | min: 0.000016400, max: 0.000136500, mean: 0.000019739, stddev: 0.000000360
mpsc | min: 0.000017500, max: 0.004407900, mean: 0.000139323, stddev: 0.000005644
mpsc | min: 0.000015900, max: 0.002580300, mean: 0.000097392, stddev: 0.000003047
spmc | min: 0.000021000, max: 0.000073600, mean: 0.000023748, stddev: 0.000000101
spmc | min: 0.000014200, max: 0.000393100, mean: 0.000019547, stddev: 0.000000539
mpmc | min: 0.000019400, max: 0.001835600, mean: 0.000142146, stddev: 0.000003811
mpmc | min: 0.000015600, max: 0.008020400, mean: 0.000162701, stddev: 0.000009684

Windows x86-64 (8 cores)

master:

spsc | min: 0.000016300, max: 0.000164700, mean: 0.000022681, stddev: 0.000000472
spsc | min: 0.000013900, max: 0.000142100, mean: 0.000018740, stddev: 0.000000325
mpsc | min: 0.000064100, max: 0.001707300, mean: 0.000293963, stddev: 0.000002431
mpsc | min: 0.000019300, max: 0.002056900, mean: 0.000606370, stddev: 0.000010348
spmc | min: 0.000014700, max: 0.000171900, mean: 0.000024700, stddev: 0.000000506
spmc | min: 0.000016000, max: 0.000197400, mean: 0.000024749, stddev: 0.000000516
mpmc | min: 0.000019300, max: 0.001680300, mean: 0.000331722, stddev: 0.000004057
mpmc | min: 0.000027200, max: 0.000484300, mean: 0.000231683, stddev: 0.000000722

Caching one block:

spsc | min: 0.000017500, max: 0.000274400, mean: 0.000022215, stddev: 0.000000417
spsc | min: 0.000014900, max: 0.000168100, mean: 0.000021970, stddev: 0.000000572
mpsc | min: 0.000020300, max: 0.001446700, mean: 0.000498424, stddev: 0.000005247
mpsc | min: 0.000018300, max: 0.001163600, mean: 0.000451278, stddev: 0.000004430
spmc | min: 0.000014500, max: 0.000221800, mean: 0.000022019, stddev: 0.000000546
spmc | min: 0.000014300, max: 0.000228200, mean: 0.000021656, stddev: 0.000000587
mpmc | min: 0.000018200, max: 0.006011800, mean: 0.000314497, stddev: 0.000006603
mpmc | min: 0.000016000, max: 0.002495500, mean: 0.000218906, stddev: 0.000004683

Caching four blocks:

spsc | min: 0.000015700, max: 0.000243100, mean: 0.000021335, stddev: 0.000000439
spsc | min: 0.000014600, max: 0.000147500, mean: 0.000019120, stddev: 0.000000433
mpsc | min: 0.000218700, max: 0.000928000, mean: 0.000434571, stddev: 0.000003554
mpsc | min: 0.000190500, max: 0.000869100, mean: 0.000253494, stddev: 0.000001212
spmc | min: 0.000014500, max: 0.000212200, mean: 0.000022515, stddev: 0.000000526
spmc | min: 0.000014700, max: 0.000245100, mean: 0.000021777, stddev: 0.000000517
mpmc | min: 0.000019200, max: 0.002800600, mean: 0.000231056, stddev: 0.000002596
mpmc | min: 0.000066100, max: 0.000900200, mean: 0.000288129, stddev: 0.000001246

SUPERCILEX · 2022-03-02T22:07:17Z

@taiki-e Any updates on this?

cuviper mentioned this issue Jul 2, 2019

Removed remaining allocations in ThreadPool::install. rayon-rs/rayon#670

Closed

taiki-e added the crossbeam-queue label Jan 4, 2021

iriri mentioned this issue Oct 17, 2021

Add block cache to SegQueue-alikes #746

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SegQueue::push() ... pop() keeps allocating. #398

SegQueue::push() ... pop() keeps allocating. #398

ralfbiedert commented Jul 1, 2019

Chopinsky commented Jul 2, 2019 •

edited

cuviper commented Jul 2, 2019

schets commented Jul 2, 2019 via email

Chopinsky commented Jul 2, 2019 •

edited

schets commented Jul 2, 2019 via email

frostyplanet commented Jun 3, 2020

ralfbiedert commented Jun 3, 2020

frostyplanet commented Jun 4, 2020

iriri commented Mar 15, 2021 •

edited

cuviper commented Mar 16, 2021

iriri commented Mar 17, 2021 •

edited

cuviper commented Mar 17, 2021

iriri commented Oct 17, 2021

SUPERCILEX commented Mar 2, 2022

SegQueue::push() ... pop() keeps allocating. #398

SegQueue::push() ... pop() keeps allocating. #398

Comments

ralfbiedert commented Jul 1, 2019

Background

Observation

Issue

Chopinsky commented Jul 2, 2019 • edited

cuviper commented Jul 2, 2019

schets commented Jul 2, 2019 via email

Chopinsky commented Jul 2, 2019 • edited

schets commented Jul 2, 2019 via email

frostyplanet commented Jun 3, 2020

ralfbiedert commented Jun 3, 2020

frostyplanet commented Jun 4, 2020

iriri commented Mar 15, 2021 • edited

Linux x86-64 (16 cores)

Windows x86-64 (8 cores)

cuviper commented Mar 16, 2021

iriri commented Mar 17, 2021 • edited

cuviper commented Mar 17, 2021

iriri commented Oct 17, 2021

Linux x86-64 (8 cores)

Windows x86-64 (8 cores)

SUPERCILEX commented Mar 2, 2022

Chopinsky commented Jul 2, 2019 •

edited

Chopinsky commented Jul 2, 2019 •

edited

iriri commented Mar 15, 2021 •

edited

iriri commented Mar 17, 2021 •

edited