Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make releasing objects back to Recycler faster #13174

Merged
merged 2 commits into from Feb 2, 2023

Conversation

chrisvest
Copy link
Contributor

Motivation:
The Recycler implementation was changed in #11858 to rely on an MPSC queue implementation for delivering released objects back to their originating thread local pool. Typically, the release will often happen from the same thread that claimed the object, so the overhead of having a thread-safe release goes to waste.

Modification:
We add an unsynchronized ArrayDeque for batching claims out of the pooledHandles. This amortises claim calls.

We then also re-introduce the concept of an owner thread (but by default only if said thread is a FastThreadLocalThread), and release directly into the claim batch if the release is from the owner thread.

Result:
The RecyclerBenchmark.recyclerGetAndRecycle benchmark sees a 27.4% improvement, and the RecyclerBenchmark.producerConsumer benchmark sees a 22.5% improvement.

Fixes #13153

Previous performance:

Benchmark                                                    Mode  Cnt      Score     Error   Units
RecyclerBenchmark.plainNew                                   avgt   20      2.482 ±   0.028   ns/op
RecyclerBenchmark.plainNew:·gc.alloc.rate                    avgt   20  30743.573 ± 344.199  MB/sec
RecyclerBenchmark.plainNew:·gc.alloc.rate.norm               avgt   20     80.000 ±   0.001    B/op
RecyclerBenchmark.plainNew:·gc.count                         avgt   20   1347.000            counts
RecyclerBenchmark.plainNew:·gc.time                          avgt   20    790.000                ms
RecyclerBenchmark.producerConsumer                           avgt   20    256.606 ±  76.322   ns/op
RecyclerBenchmark.producerConsumer:consumer                  avgt   20    256.604 ±  76.322   ns/op
RecyclerBenchmark.producerConsumer:producer                  avgt   20    256.608 ±  76.322   ns/op
RecyclerBenchmark.producerConsumer:·gc.alloc.rate            avgt   20    108.235 ±  18.323  MB/sec
RecyclerBenchmark.producerConsumer:·gc.alloc.rate.norm       avgt   20     15.485 ±   6.790    B/op
RecyclerBenchmark.producerConsumer:·gc.count                 avgt   20      5.000            counts
RecyclerBenchmark.producerConsumer:·gc.time                  avgt   20      3.000                ms
RecyclerBenchmark.recyclerGetAndOrphan                       avgt   20      6.114 ±   0.052   ns/op
RecyclerBenchmark.recyclerGetAndOrphan:·gc.alloc.rate        avgt   20  12945.295 ± 110.092  MB/sec
RecyclerBenchmark.recyclerGetAndOrphan:·gc.alloc.rate.norm   avgt   20     83.000 ±   0.001    B/op
RecyclerBenchmark.recyclerGetAndOrphan:·gc.count             avgt   20    567.000            counts
RecyclerBenchmark.recyclerGetAndOrphan:·gc.time              avgt   20    323.000                ms
RecyclerBenchmark.recyclerGetAndRecycle                      avgt   20     19.754 ±   0.627   ns/op
RecyclerBenchmark.recyclerGetAndRecycle:·gc.alloc.rate       avgt   20      0.001 ±   0.003  MB/sec
RecyclerBenchmark.recyclerGetAndRecycle:·gc.alloc.rate.norm  avgt   20     ≈ 10⁻⁵              B/op
RecyclerBenchmark.recyclerGetAndRecycle:·gc.count            avgt   20        ≈ 0            counts

This change:

Benchmark                                                    Mode  Cnt      Score     Error   Units
RecyclerBenchmark.plainNew                                   avgt   20      2.476 ±   0.009   ns/op
RecyclerBenchmark.plainNew:·gc.alloc.rate                    avgt   20  30812.454 ± 116.539  MB/sec
RecyclerBenchmark.plainNew:·gc.alloc.rate.norm               avgt   20     80.000 ±   0.001    B/op
RecyclerBenchmark.plainNew:·gc.count                         avgt   20   1351.000            counts
RecyclerBenchmark.plainNew:·gc.time                          avgt   20    733.000                ms
RecyclerBenchmark.producerConsumer                           avgt   20    198.829 ±  10.838   ns/op
RecyclerBenchmark.producerConsumer:consumer                  avgt   20    198.828 ±  10.839   ns/op
RecyclerBenchmark.producerConsumer:producer                  avgt   20    198.829 ±  10.838   ns/op
RecyclerBenchmark.producerConsumer:·gc.alloc.rate            avgt   20     92.369 ±  25.226  MB/sec
RecyclerBenchmark.producerConsumer:·gc.alloc.rate.norm       avgt   20      9.802 ±   3.137    B/op
RecyclerBenchmark.producerConsumer:·gc.count                 avgt   20      3.000            counts
RecyclerBenchmark.producerConsumer:·gc.time                  avgt   20      2.000                ms
RecyclerBenchmark.recyclerGetAndOrphan                       avgt   20      6.850 ±   0.034   ns/op
RecyclerBenchmark.recyclerGetAndOrphan:·gc.alloc.rate        avgt   20  11554.170 ±  56.474  MB/sec
RecyclerBenchmark.recyclerGetAndOrphan:·gc.alloc.rate.norm   avgt   20     83.000 ±   0.001    B/op
RecyclerBenchmark.recyclerGetAndOrphan:·gc.count             avgt   20    507.000            counts
RecyclerBenchmark.recyclerGetAndOrphan:·gc.time              avgt   20    288.000                ms
RecyclerBenchmark.recyclerGetAndRecycle                      avgt   20     14.330 ±   0.010   ns/op
RecyclerBenchmark.recyclerGetAndRecycle:·gc.alloc.rate       avgt   20      0.001 ±   0.003  MB/sec
RecyclerBenchmark.recyclerGetAndRecycle:·gc.alloc.rate.norm  avgt   20     ≈ 10⁻⁵              B/op
RecyclerBenchmark.recyclerGetAndRecycle:·gc.count            avgt   20        ≈ 0            counts

@chrisvest
Copy link
Contributor Author

Note that you need to set the io.netty.recycler.batchFastThreadLocalOnly system property to false in order to observe the benchmark results, since the benchmark does not use FastThreadLocalThreads.

Motivation:
The Recycler implementation was changed in netty#11858 to rely on an MPSC queue implementation for delivering released objects back to their originating thread local pool.
Typically, the release will often happen from the same thread that claimed the object, so the overhead of having a thread-safe release goes to waste.

Modification:
We add an unsynchronized ArrayDeque for batching claims out of the `pooledHandles`.
This amortises `claim` calls.

We then also re-introduce the concept of an owner thread (but by default only if said thread is a FastThreadLocalThread), and release directly into the claim `batch` if the release is from the owner thread.

Result:
The `RecyclerBenchmark.recyclerGetAndRecycle` benchmark sees a 27.4% improvement, and the `RecyclerBenchmark.producerConsumer` benchmark sees a 22.5% improvement.

Fixes netty#13153
Copy link
Member

@normanmaurer normanmaurer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good to me!

common/src/main/java/io/netty/util/Recycler.java Outdated Show resolved Hide resolved
common/src/main/java/io/netty/util/Recycler.java Outdated Show resolved Hide resolved
Co-authored-by: Norman Maurer <norman_maurer@apple.com>
@chrisvest
Copy link
Contributor Author

@normanmaurer Applied your suggestions

Copy link
Contributor

@franz1981 franz1981 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Chris, sadly today I am on sick leave so cannot run micros nor run end to end tests (but I will after a merge, in case); just one note on something that I both like but I see "dangerous" is the batchy drain into the array q. The thread local array q is a black hole and when it drain a batch, it (the batch) won't be available back anymore to the shared one, hence, maybe just draining a single element is enough. It could be the fever to talk, but let me know If it makes sense.
It's not nice to pay twice for acquire, but the release was the costy part, so, in the overall picture shouldn't matter much

@chrisvest
Copy link
Contributor Author

@franz1981 Get well soon. Your reservation does not entirely make sense to me. The local pools are single-consumer, including the pooledHandles, so taking objects from the batch or the pooledHandles makes no difference to object availability in other threads. The batch drain amortises a thread-safe consumption so we can poll the individual handles with simple sequential code.

@franz1981
Copy link
Contributor

Totally right @chrisvest I can blame the fever indeed 😉
Go ahead then for me is a wonderful improvement

@normanmaurer normanmaurer merged commit 8a8337e into netty:4.1 Feb 2, 2023
@normanmaurer
Copy link
Member

@chrisvest amazing change!

@normanmaurer normanmaurer added this to the 4.1.88.Final milestone Feb 2, 2023
normanmaurer added a commit that referenced this pull request Feb 2, 2023
Motivation:
The Recycler implementation was changed in #11858 to rely on an MPSC queue implementation for delivering released objects back to their originating thread local pool.
Typically, the release will often happen from the same thread that claimed the object, so the overhead of having a thread-safe release goes to waste.

Modification:
We add an unsynchronized ArrayDeque for batching claims out of the `pooledHandles`.
This amortises `claim` calls.

We then also re-introduce the concept of an owner thread (but by default only if said thread is a FastThreadLocalThread), and release directly into the claim `batch` if the release is from the owner thread.

Result:
The `RecyclerBenchmark.recyclerGetAndRecycle` benchmark sees a 27.4% improvement, and the `RecyclerBenchmark.producerConsumer` benchmark sees a 22.5% improvement.

Fixes #13153

Co-authored-by: Norman Maurer <norman_maurer@apple.com>
@chrisvest chrisvest deleted the 41-recycler-opt branch February 2, 2023 14:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Makes Buffer's release faster
3 participants