Make releasing objects back to Recycler faster #13174

chrisvest · 2023-01-31T18:30:37Z

Motivation:
The Recycler implementation was changed in #11858 to rely on an MPSC queue implementation for delivering released objects back to their originating thread local pool. Typically, the release will often happen from the same thread that claimed the object, so the overhead of having a thread-safe release goes to waste.

Modification:
We add an unsynchronized ArrayDeque for batching claims out of the pooledHandles. This amortises claim calls.

We then also re-introduce the concept of an owner thread (but by default only if said thread is a FastThreadLocalThread), and release directly into the claim batch if the release is from the owner thread.

Result:
The RecyclerBenchmark.recyclerGetAndRecycle benchmark sees a 27.4% improvement, and the RecyclerBenchmark.producerConsumer benchmark sees a 22.5% improvement.

Fixes #13153

Previous performance:

Benchmark                                                    Mode  Cnt      Score     Error   Units
RecyclerBenchmark.plainNew                                   avgt   20      2.482 ±   0.028   ns/op
RecyclerBenchmark.plainNew:·gc.alloc.rate                    avgt   20  30743.573 ± 344.199  MB/sec
RecyclerBenchmark.plainNew:·gc.alloc.rate.norm               avgt   20     80.000 ±   0.001    B/op
RecyclerBenchmark.plainNew:·gc.count                         avgt   20   1347.000            counts
RecyclerBenchmark.plainNew:·gc.time                          avgt   20    790.000                ms
RecyclerBenchmark.producerConsumer                           avgt   20    256.606 ±  76.322   ns/op
RecyclerBenchmark.producerConsumer:consumer                  avgt   20    256.604 ±  76.322   ns/op
RecyclerBenchmark.producerConsumer:producer                  avgt   20    256.608 ±  76.322   ns/op
RecyclerBenchmark.producerConsumer:·gc.alloc.rate            avgt   20    108.235 ±  18.323  MB/sec
RecyclerBenchmark.producerConsumer:·gc.alloc.rate.norm       avgt   20     15.485 ±   6.790    B/op
RecyclerBenchmark.producerConsumer:·gc.count                 avgt   20      5.000            counts
RecyclerBenchmark.producerConsumer:·gc.time                  avgt   20      3.000                ms
RecyclerBenchmark.recyclerGetAndOrphan                       avgt   20      6.114 ±   0.052   ns/op
RecyclerBenchmark.recyclerGetAndOrphan:·gc.alloc.rate        avgt   20  12945.295 ± 110.092  MB/sec
RecyclerBenchmark.recyclerGetAndOrphan:·gc.alloc.rate.norm   avgt   20     83.000 ±   0.001    B/op
RecyclerBenchmark.recyclerGetAndOrphan:·gc.count             avgt   20    567.000            counts
RecyclerBenchmark.recyclerGetAndOrphan:·gc.time              avgt   20    323.000                ms
RecyclerBenchmark.recyclerGetAndRecycle                      avgt   20     19.754 ±   0.627   ns/op
RecyclerBenchmark.recyclerGetAndRecycle:·gc.alloc.rate       avgt   20      0.001 ±   0.003  MB/sec
RecyclerBenchmark.recyclerGetAndRecycle:·gc.alloc.rate.norm  avgt   20     ≈ 10⁻⁵              B/op
RecyclerBenchmark.recyclerGetAndRecycle:·gc.count            avgt   20        ≈ 0            counts

This change:

Benchmark                                                    Mode  Cnt      Score     Error   Units
RecyclerBenchmark.plainNew                                   avgt   20      2.476 ±   0.009   ns/op
RecyclerBenchmark.plainNew:·gc.alloc.rate                    avgt   20  30812.454 ± 116.539  MB/sec
RecyclerBenchmark.plainNew:·gc.alloc.rate.norm               avgt   20     80.000 ±   0.001    B/op
RecyclerBenchmark.plainNew:·gc.count                         avgt   20   1351.000            counts
RecyclerBenchmark.plainNew:·gc.time                          avgt   20    733.000                ms
RecyclerBenchmark.producerConsumer                           avgt   20    198.829 ±  10.838   ns/op
RecyclerBenchmark.producerConsumer:consumer                  avgt   20    198.828 ±  10.839   ns/op
RecyclerBenchmark.producerConsumer:producer                  avgt   20    198.829 ±  10.838   ns/op
RecyclerBenchmark.producerConsumer:·gc.alloc.rate            avgt   20     92.369 ±  25.226  MB/sec
RecyclerBenchmark.producerConsumer:·gc.alloc.rate.norm       avgt   20      9.802 ±   3.137    B/op
RecyclerBenchmark.producerConsumer:·gc.count                 avgt   20      3.000            counts
RecyclerBenchmark.producerConsumer:·gc.time                  avgt   20      2.000                ms
RecyclerBenchmark.recyclerGetAndOrphan                       avgt   20      6.850 ±   0.034   ns/op
RecyclerBenchmark.recyclerGetAndOrphan:·gc.alloc.rate        avgt   20  11554.170 ±  56.474  MB/sec
RecyclerBenchmark.recyclerGetAndOrphan:·gc.alloc.rate.norm   avgt   20     83.000 ±   0.001    B/op
RecyclerBenchmark.recyclerGetAndOrphan:·gc.count             avgt   20    507.000            counts
RecyclerBenchmark.recyclerGetAndOrphan:·gc.time              avgt   20    288.000                ms
RecyclerBenchmark.recyclerGetAndRecycle                      avgt   20     14.330 ±   0.010   ns/op
RecyclerBenchmark.recyclerGetAndRecycle:·gc.alloc.rate       avgt   20      0.001 ±   0.003  MB/sec
RecyclerBenchmark.recyclerGetAndRecycle:·gc.alloc.rate.norm  avgt   20     ≈ 10⁻⁵              B/op
RecyclerBenchmark.recyclerGetAndRecycle:·gc.count            avgt   20        ≈ 0            counts

chrisvest · 2023-01-31T18:34:26Z

Note that you need to set the io.netty.recycler.batchFastThreadLocalOnly system property to false in order to observe the benchmark results, since the benchmark does not use FastThreadLocalThreads.

Motivation: The Recycler implementation was changed in netty#11858 to rely on an MPSC queue implementation for delivering released objects back to their originating thread local pool. Typically, the release will often happen from the same thread that claimed the object, so the overhead of having a thread-safe release goes to waste. Modification: We add an unsynchronized ArrayDeque for batching claims out of the `pooledHandles`. This amortises `claim` calls. We then also re-introduce the concept of an owner thread (but by default only if said thread is a FastThreadLocalThread), and release directly into the claim `batch` if the release is from the owner thread. Result: The `RecyclerBenchmark.recyclerGetAndRecycle` benchmark sees a 27.4% improvement, and the `RecyclerBenchmark.producerConsumer` benchmark sees a 22.5% improvement. Fixes netty#13153

normanmaurer

Overall looks good to me!

common/src/main/java/io/netty/util/Recycler.java

Co-authored-by: Norman Maurer <norman_maurer@apple.com>

chrisvest · 2023-01-31T20:26:39Z

@normanmaurer Applied your suggestions

franz1981

Thanks Chris, sadly today I am on sick leave so cannot run micros nor run end to end tests (but I will after a merge, in case); just one note on something that I both like but I see "dangerous" is the batchy drain into the array q. The thread local array q is a black hole and when it drain a batch, it (the batch) won't be available back anymore to the shared one, hence, maybe just draining a single element is enough. It could be the fever to talk, but let me know If it makes sense.
It's not nice to pay twice for acquire, but the release was the costy part, so, in the overall picture shouldn't matter much

chrisvest · 2023-02-01T18:51:41Z

@franz1981 Get well soon. Your reservation does not entirely make sense to me. The local pools are single-consumer, including the pooledHandles, so taking objects from the batch or the pooledHandles makes no difference to object availability in other threads. The batch drain amortises a thread-safe consumption so we can poll the individual handles with simple sequential code.

franz1981 · 2023-02-01T20:06:39Z

Totally right @chrisvest I can blame the fever indeed 😉
Go ahead then for me is a wonderful improvement

normanmaurer · 2023-02-02T09:27:51Z

@chrisvest amazing change!

Motivation: The Recycler implementation was changed in #11858 to rely on an MPSC queue implementation for delivering released objects back to their originating thread local pool. Typically, the release will often happen from the same thread that claimed the object, so the overhead of having a thread-safe release goes to waste. Modification: We add an unsynchronized ArrayDeque for batching claims out of the `pooledHandles`. This amortises `claim` calls. We then also re-introduce the concept of an owner thread (but by default only if said thread is a FastThreadLocalThread), and release directly into the claim `batch` if the release is from the owner thread. Result: The `RecyclerBenchmark.recyclerGetAndRecycle` benchmark sees a 27.4% improvement, and the `RecyclerBenchmark.producerConsumer` benchmark sees a 22.5% improvement. Fixes #13153 Co-authored-by: Norman Maurer <norman_maurer@apple.com>

chrisvest requested review from normanmaurer and franz1981 January 31, 2023 18:30

chrisvest force-pushed the 41-recycler-opt branch from 03bd7fb to 7b66699 Compare January 31, 2023 18:33

chrisvest force-pushed the 41-recycler-opt branch from 7b66699 to 2d7ece1 Compare January 31, 2023 18:35

chrisvest force-pushed the 41-recycler-opt branch from 2d7ece1 to 90de5ce Compare January 31, 2023 18:37

normanmaurer requested changes Jan 31, 2023

View reviewed changes

common/src/main/java/io/netty/util/Recycler.java Outdated Show resolved Hide resolved

common/src/main/java/io/netty/util/Recycler.java Outdated Show resolved Hide resolved

Apply suggestions from code review

f1f1597

Co-authored-by: Norman Maurer <norman_maurer@apple.com>

franz1981 reviewed Jan 31, 2023

View reviewed changes

normanmaurer approved these changes Feb 2, 2023

View reviewed changes

normanmaurer merged commit 8a8337e into netty:4.1 Feb 2, 2023

normanmaurer added this to the 4.1.88.Final milestone Feb 2, 2023

chrisvest deleted the 41-recycler-opt branch February 2, 2023 14:10

franz1981 mentioned this pull request Feb 14, 2023

Faster Recycler's claim/release (Fixes #13153) #13220

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make releasing objects back to Recycler faster #13174

Make releasing objects back to Recycler faster #13174

chrisvest commented Jan 31, 2023

chrisvest commented Jan 31, 2023

normanmaurer left a comment

chrisvest commented Jan 31, 2023

franz1981 left a comment •

edited

chrisvest commented Feb 1, 2023

franz1981 commented Feb 1, 2023

normanmaurer commented Feb 2, 2023

Make releasing objects back to Recycler faster #13174

Make releasing objects back to Recycler faster #13174

Conversation

chrisvest commented Jan 31, 2023

chrisvest commented Jan 31, 2023

normanmaurer left a comment

Choose a reason for hiding this comment

chrisvest commented Jan 31, 2023

franz1981 left a comment • edited

Choose a reason for hiding this comment

chrisvest commented Feb 1, 2023

franz1981 commented Feb 1, 2023

normanmaurer commented Feb 2, 2023

franz1981 left a comment •

edited