Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Makes Buffer's release faster #13153

Closed
franz1981 opened this issue Jan 26, 2023 · 5 comments · Fixed by #13174
Closed

Makes Buffer's release faster #13153

franz1981 opened this issue Jan 26, 2023 · 5 comments · Fixed by #13174

Comments

@franz1981
Copy link
Contributor

franz1981 commented Jan 26, 2023

After the changes in the Recycler, Recycler$LocalPool.release often appear in heavy load scenarios even for cases where the buffer is not escaping the event loop it has reused it.

The orginal Recycler implementation was very complex and with some wider effects on GC, hence I still think it was the right move, but according to the profiling data the JCTools offering (based on a atomic CAS) is still too much costy.
I suggest to record some information re the event loop where the allocation first happen and use a (event loop)-local stack/queue where short-circuit the buffer recycling.

In short:

  • allocation happen from event loop X: always use the event-loop-local pool first. If empty, use the "shared" one using the JCTools q (using relaxed poll, pretty cheap)
  • if release happen on the same originating event loop, there's no need to use any atomic construct and can release it to the event-loop-local pool (if not full? I'm open to suggestions here) or fall-back to the shared one, based on some policy.

It's not nice to have 2 separate structures, but the benefit from common usages should be significant

@franz1981
Copy link
Contributor Author

franz1981 commented Jan 26, 2023

Gently nerd sniping @chrisvest :D

@franz1981
Copy link
Contributor Author

franz1981 commented Jan 26, 2023

This discussion has started due to quarkusio/quarkus#30473 investigations made by @jfrantzius

@mostroverkhov
Copy link
Contributor

After the changes in the Recycler

Just for the info @franz1981 - if this is 4.1 related, can you point to pr/commit (sounds like this is recent change) ?

@franz1981
Copy link
Contributor Author

#11858 is the relevant work on it

@chrisvest
Copy link
Contributor

Alright, fine, I'll put up a PR 🙂

chrisvest added a commit to chrisvest/netty that referenced this issue Jan 31, 2023
Motivation:
The Recycler implementation was changed in netty#11858 to rely on an MPSC queue implementation for delivering released objects back to their originating thread local pool.
Typically, the release will often happen from the same thread that claimed the object, so the overhead of having a thread-safe release goes to waste.

Modification:
We add an unsynchronized ArrayDeque for batching claims out of the `pooledHandles`.
This amortises `claim` calls.

We then also re-introduce the concept of an owner thread (but by default only if said thread is a FastThreadLocalThread), and release directly into the claim `batch` if the release is from the owner thread.

Result:
The `RecyclerBenchmark.recyclerGetAndRecycle` benchmark sees a 27.4% improvement, and the `RecyclerBenchmark.producerConsumer` benchmark sees a 22.5% improvement.

Fixes netty#13153
normanmaurer added a commit that referenced this issue Feb 2, 2023
Motivation:
The Recycler implementation was changed in #11858 to rely on an MPSC queue implementation for delivering released objects back to their originating thread local pool.
Typically, the release will often happen from the same thread that claimed the object, so the overhead of having a thread-safe release goes to waste.

Modification:
We add an unsynchronized ArrayDeque for batching claims out of the `pooledHandles`.
This amortises `claim` calls.

We then also re-introduce the concept of an owner thread (but by default only if said thread is a FastThreadLocalThread), and release directly into the claim `batch` if the release is from the owner thread.

Result:
The `RecyclerBenchmark.recyclerGetAndRecycle` benchmark sees a 27.4% improvement, and the `RecyclerBenchmark.producerConsumer` benchmark sees a 22.5% improvement.

Fixes #13153

Co-authored-by: Norman Maurer <norman_maurer@apple.com>
normanmaurer added a commit that referenced this issue Feb 2, 2023
Motivation:
The Recycler implementation was changed in #11858 to rely on an MPSC queue implementation for delivering released objects back to their originating thread local pool.
Typically, the release will often happen from the same thread that claimed the object, so the overhead of having a thread-safe release goes to waste.

Modification:
We add an unsynchronized ArrayDeque for batching claims out of the `pooledHandles`.
This amortises `claim` calls.

We then also re-introduce the concept of an owner thread (but by default only if said thread is a FastThreadLocalThread), and release directly into the claim `batch` if the release is from the owner thread.

Result:
The `RecyclerBenchmark.recyclerGetAndRecycle` benchmark sees a 27.4% improvement, and the `RecyclerBenchmark.producerConsumer` benchmark sees a 22.5% improvement.

Fixes #13153

Co-authored-by: Norman Maurer <norman_maurer@apple.com>
franz1981 added a commit to franz1981/netty that referenced this issue Feb 22, 2023
Motivation:

Recycler's claim/release can be made faster by saving expensive volatile ops, when not needed.
For claim, always, while for release, if the owner thread is performing release itself.

Modification:

Replacing expensive volatile ops with ordered ones.

Result:

Faster Recycler's claim/release
chrisvest pushed a commit that referenced this issue Mar 7, 2023
Motivation:

Recycler's claim/release can be made faster by saving expensive volatile ops, when not needed. For claim, always, while for release, if the owner thread is performing release itself.

Modification:

Replacing expensive volatile ops with ordered ones.

Result:

Faster Recycler's claim/release
chrisvest pushed a commit that referenced this issue Mar 13, 2023
Motivation:

Recycler's claim/release can be made faster by saving expensive volatile ops, when not needed. For claim, always, while for release, if the owner thread is performing release itself.

Modification:

Replacing expensive volatile ops with ordered ones.

Result:

Faster Recycler's claim/release
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants