New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Faster Recycler's claim/release (Fixes #13153) #13220
Conversation
Despite the nice changes by @chrisvest on #13174, the recycler still popup in profiling data for pooled allocation intensive scenario; I've dug into the assembly and found the 2 code path this PR is addressing. I'm convinced about the netty/buffer/src/main/java/io/netty/buffer/AbstractReferenceCountedByteBuf.java Lines 109 to 114 in f9e765e
Note: netty/buffer/src/main/java/io/netty/buffer/PooledByteBuf.java Lines 170 to 182 in 1751440
|
I'll run tomorrow some tests vs #13174 (comment) |
FYI |
It's fine, unless:
I'm running some end 2 end tests on these changes, and I can tell the perf difference is quite shocking, given that is such an hot op. Just a note about being "general purpose": currently the recycler is always making use of thread local(s) that are not that good both with virtual threads and/or thread pools which thread(s) can be short-lived, hence, in order to be fully friendship with any alien usage e.g. outside of the event loop, we should fix that one too. It's a separate issue and I can fill one to track this too. |
The failed tests seems unrelated to the changes of this PR: https://github.com/netty/netty/actions/runs/4179166529/jobs/7238847994 |
In a way, yes. The atomic state changes are not needed because objects are safely published cross-thread via the message passing queue. The reason we have a state field at all is to trap usage bugs (e.g. double-free), and the atomic ops are for trapping multi-threaded usage bugs (e.g. racing frees). I think these changes weaken those protections, and it would be good to take this into consideration in the analysis. For instance, in some cases (ref counting?) we might already have such protections in place and can avoid duplicating this work in Recycler. While in other cases we might need the Recycler to perform these checks. |
I am opened to any suggestions here to not break the api compatibility I have added a further commit to speedup the reset of ref count that wasn't already shielding from unwanted changes, I have just relaxed the StoreLoad after the set, to not happen |
@nitsanw please take a look as well. |
I don't pretend to have good context on
But it is also fair to say that violation of threading assumptions is not always defended. You could inflate the code slightly and add an asserting mode which is off by default, but which can detect bad usage when turned on. This may effect inlining decisions, so another alternative is to have an asserting subclass/method handle replacement mechanism. |
That's an option; the other is to extend the existing API with a relaxed/weak version of the same method and use it internally when it makes sense i.e. for the buffer pooling use case, where the behaviour is single-writer enforced by the ref count prior check. I just have no idea if Netty public API check will complain about it, because I'm not actually removing/modifying any existing public method, but extending it by adding a new one. wdyt @normanmaurer ? |
Yeah, adding relaxed versions is what I had in mind. I think it's possible to add new methods to abstract classes without breaking compatibility, as long as they're not final. |
Lovely, thanks @chrisvest |
Yep what @chrisvest suggests should work |
The problem I see is that I should add the method to |
I hope to have captured all relevant parts in Netty where it makes sense to apply it |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Had a couple of comments, but it looks good.
Have you got any benchmark numbers for this change?
int prev = state; | ||
if (prev == STATE_AVAILABLE) { | ||
throw new IllegalStateException("Object has been recycled already."); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we still need to look at the current state here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point, I've kept it because the unguarded
term here stand for "no atomic guard": I can add a javadoc to clarify it
I have yet to run both the JMH (seriously) and end 2 end one on this (I will I promise). While the producer/consumer JMH case exhibit a dramatic improvement (more then twice improved!) probably because it doesn't use SW blackhole(s) (I know that's not considered as a good practice NOT using them, but SW generated ones are not easy to be used with nano benchs!) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One nit....
void unguardedRelease(DefaultHandle<T> handle) { | ||
handle.unguardedToAvailable(); | ||
Thread owner = this.owner; | ||
if (owner != null && Thread.currentThread() == owner && batch.size() < chunkSize) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can remove the null check here as Thread.currentThread()
will never return null
if (owner != null && Thread.currentThread() == owner && batch.size() < chunkSize) { | |
if (Thread.currentThread() == owner && batch.size() < chunkSize) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or was this to eliminate some overhead when owner
is null ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
:) yep
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe adding a comment would make this more clear ;) ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a fair point sir 🤣
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't modified this, but copied from @chrisvest original code, I'm now unifying both but I'll wait @chrisvest comment in case we can replace the owner
null check with your suggestion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have no preference either way. Null checks are cheap. For platform threads at least, getting the current thread is also very fast.
@chrisvest @normanmaurer I can now provide some numbers about this :) |
@franz1981 lets show them ;) ? |
@normanmaurer ahah yes, I meant that now I got free cycles to run some tests :P |
Motivation: Recycler's claim/release can be made faster by saving expensive volatile ops, when not needed. For claim, always, while for release, if the owner thread is performing release itself. Modification: Replacing expensive volatile ops with ordered ones. Result: Faster Recycler's claim/release
I've added few other parts that can benefit of this "weaker" semantic (because they don't seem to escape/been shared, and appear as hot path during allocation(s). Please @chrisvest @normanmaurer validate if my assumptions about the life-cycle of such instances is ok |
@normanmaurer I've performed few rounds of a pinned localhost instance (single threaded) of Netty vs a pure plaintext http workload (with high pipelining, to stress the CPU, that's why localhost): I believe the improvement change a lot depending the type of load and in general I see a delta of 5-10% that's still pretty high. I'm now running some JMH benchmark and I'll address your last comment, while awaiting your opinion of the parts I've decided to "relax" (that should be indeed, un-shared and/or guarded already) |
@chrisvest This is the improvement in
this PR at 7a64def:
I've run the benchmark using In the best case scenario the improvement in the allocation path is quite decent, but similarly to the volatile benchmark on https://shipilev.net/blog/2014/nanotrusting-nanotime/, hammering the store buffer isn't a realistic use case, and such barriers tend to saturate, making tight loops to slow down more then what would happen in the real world if some work happen for real i.e. could be by accessing the data structure (read/write); but still, for smallish and frequent allocations, the best case scenario report a pooled (and cache allocation) improvement of ~18% (!). |
@njhill in case is interested :) |
void unguardedRelease(DefaultHandle<T> handle) { | ||
handle.unguardedToAvailable(); | ||
Thread owner = this.owner; | ||
if (owner != null && Thread.currentThread() == owner && batch.size() < chunkSize) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have no preference either way. Null checks are cheap. For platform threads at least, getting the current thread is also very fast.
@chrisvest feel free to park it more if you need to check the parts I've decided to make "weaker" in term of seq cst |
@franz1981 Thanks! |
@franz1981 Will you make a Netty 5 PR as well? |
Motivation: Recycler's claim/release can be made faster by saving expensive volatile ops, when not needed. For claim, always, while for release, if the owner thread is performing release itself. Modification: Replacing expensive volatile ops with ordered ones. Result: Faster Recycler's claim/release
This change seems to have introduced a build/compile failure:
Adding the following in <dependency>
<groupId>org.jctools</groupId>
<artifactId>jctools-core</artifactId>
<scope>compile</scope>
</dependency>
Should this PR have added the above dependency? |
@eirbjo yes good catch, can you send a PR? Try check if the shaded JCTools version we got can be used instead of the direct dependency |
@franz1981 Sorry, I'm completely new to Netty, so I don't think I understand. In which Maven coordinates would I find this shaded JCTools? And do you mean that How about I add a PR with the regular dependency, and we can continue the discussion there? |
See #13325 |
Still not 100% sure what solution you are referring to. If you mean imports like
Can we continue this discussion in PR #13325 ? |
Even if we could make this somewhat exotic setup work across Maven and IDEs, it is not clear to me what the benefits would be? |
Motivation: #13220 seems to have introduced a build/compile failure because of a missing Maven dependency on `jctools-core`. Adding this dependency to `microbench/pom.xml` fixes the compile failure of the `RecyclerBenchmark` class. Modification: Added the following depencency to `microbench/pom.xml`: ```xml <dependency> <groupId>org.jctools</groupId> <artifactId>jctools-core</artifactId> <scope>compile</scope> </dependency> ``` Result: Build is successful.
Motivation: #13220 seems to have introduced a build/compile failure because of a missing Maven dependency on `jctools-core`. Adding this dependency to `microbench/pom.xml` fixes the compile failure of the `RecyclerBenchmark` class. Modification: Added the following depencency to `microbench/pom.xml`: ```xml <dependency> <groupId>org.jctools</groupId> <artifactId>jctools-core</artifactId> <scope>compile</scope> </dependency> ``` Result: Build is successful.
Motivation:
Recycler's claim/release can be made faster by saving expensive volatile ops, when not needed. For claim, always, while for release, if the owner thread is performing release itself.
Modification:
Replacing expensive volatile ops with ordered ones.
Result:
Faster Recycler's claim/release