Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster Recycler's claim/release (Fixes #13153) #13220

Merged
merged 6 commits into from Mar 7, 2023

Conversation

franz1981
Copy link
Contributor

Motivation:

Recycler's claim/release can be made faster by saving expensive volatile ops, when not needed. For claim, always, while for release, if the owner thread is performing release itself.

Modification:

Replacing expensive volatile ops with ordered ones.

Result:

Faster Recycler's claim/release

@franz1981
Copy link
Contributor Author

franz1981 commented Feb 14, 2023

Despite the nice changes by @chrisvest on #13174, the recycler still popup in profiling data for pooled allocation intensive scenario; I've dug into the assembly and found the 2 code path this PR is addressing.

I'm convinced about the claim side, because it's supposed to act safely on the pooled instance, but not so convinced for the release one.
In theory it should be shielded by the ref cnt's release at

private boolean handleRelease(boolean result) {
if (result) {
deallocate();
}
return result;
}

Note:

protected final void deallocate() {
if (handle >= 0) {
final long handle = this.handle;
this.handle = -1;
memory = null;
chunk.decrementPinnedMemory(maxLength);
chunk.arena.free(chunk, tmpNioBuf, handle, maxLength, cache);
tmpNioBuf = null;
chunk = null;
cache = null;
recycle();
}
}

deallocate will call recycle only if reference count is "correct", in short, we can safely assume that recycle shouldn't need any atomic volatile operation (getAndSet's state) in any case, not just when the owner is releasing it.

@franz1981
Copy link
Contributor Author

franz1981 commented Feb 14, 2023

I'll run tomorrow some tests vs #13174 (comment)

@chrisvest
Copy link
Contributor

FYI Recycler is general-purpose, so it cannot rely on implementation details of ByteBuf sub-classes.

@franz1981
Copy link
Contributor Author

franz1981 commented Feb 14, 2023

FYI Recycler is general-purpose, so it cannot rely on implementation details of ByteBuf sub-classes.

It's fine, unless:

  • we use some subclass or special method that just buffers can use providing a relaxed semantic (if recycle happens on a quiescent state, due to a prior sequential consistent release, I see no point to use twice such costy ops)
  • we give up on the recycle improvement (assuming that could be used by any threads and concurrently and we need to guarantee its consistency), while claim is fine to use it

I'm running some end 2 end tests on these changes, and I can tell the perf difference is quite shocking, given that is such an hot op.

Just a note about being "general purpose": currently the recycler is always making use of thread local(s) that are not that good both with virtual threads and/or thread pools which thread(s) can be short-lived, hence, in order to be fully friendship with any alien usage e.g. outside of the event loop, we should fix that one too. It's a separate issue and I can fill one to track this too.

@franz1981
Copy link
Contributor Author

The failed tests seems unrelated to the changes of this PR: https://github.com/netty/netty/actions/runs/4179166529/jobs/7238847994

@chrisvest
Copy link
Contributor

in short, we can safely assume that recycle shouldn't need any atomic volatile operation

In a way, yes. The atomic state changes are not needed because objects are safely published cross-thread via the message passing queue.

The reason we have a state field at all is to trap usage bugs (e.g. double-free), and the atomic ops are for trapping multi-threaded usage bugs (e.g. racing frees).

I think these changes weaken those protections, and it would be good to take this into consideration in the analysis. For instance, in some cases (ref counting?) we might already have such protections in place and can avoid duplicating this work in Recycler. While in other cases we might need the Recycler to perform these checks.

@franz1981
Copy link
Contributor Author

While in other cases we might need the Recycler to perform these checks.

I am opened to any suggestions here to not break the api compatibility

I have added a further commit to speedup the reset of ref count that wasn't already shielding from unwanted changes, I have just relaxed the StoreLoad after the set, to not happen

@normanmaurer
Copy link
Member

@nitsanw please take a look as well.

@nitsanw
Copy link
Contributor

nitsanw commented Feb 16, 2023

I don't pretend to have good context on Recycler usage. I think the point made by @chrisvest here is a valid one:

the atomic ops are for trapping multi-threaded usage bugs (e.g. racing frees). I think these changes weaken those protections

But it is also fair to say that violation of threading assumptions is not always defended.

You could inflate the code slightly and add an asserting mode which is off by default, but which can detect bad usage when turned on. This may effect inlining decisions, so another alternative is to have an asserting subclass/method handle replacement mechanism.

@franz1981
Copy link
Contributor Author

You could inflate the code slightly and add an asserting mode which is off by default, but which can detect bad usage when turned on. This may effect inlining decisions, so another alternative is to have an asserting subclass/method handle replacement mechanism.

That's an option; the other is to extend the existing API with a relaxed/weak version of the same method and use it internally when it makes sense i.e. for the buffer pooling use case, where the behaviour is single-writer enforced by the ref count prior check. I just have no idea if Netty public API check will complain about it, because I'm not actually removing/modifying any existing public method, but extending it by adding a new one. wdyt @normanmaurer ?

@chrisvest
Copy link
Contributor

Yeah, adding relaxed versions is what I had in mind. I think it's possible to add new methods to abstract classes without breaking compatibility, as long as they're not final.

@franz1981
Copy link
Contributor Author

Lovely, thanks @chrisvest

@normanmaurer
Copy link
Member

Yep what @chrisvest suggests should work

@franz1981
Copy link
Contributor Author

franz1981 commented Feb 17, 2023

The problem I see is that I should add the method to Handle, could be either the ObjectPool one or Recycler, both public and interfaces :"(
Let me check if I can find a way to do things differently

@franz1981 franz1981 marked this pull request as ready for review February 17, 2023 14:58
@franz1981
Copy link
Contributor Author

I hope to have captured all relevant parts in Netty where it makes sense to apply it

Copy link
Contributor

@chrisvest chrisvest left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had a couple of comments, but it looks good.

Have you got any benchmark numbers for this change?

Comment on lines +289 to +292
int prev = state;
if (prev == STATE_AVAILABLE) {
throw new IllegalStateException("Object has been recycled already.");
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we still need to look at the current state here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point, I've kept it because the unguarded term here stand for "no atomic guard": I can add a javadoc to clarify it

common/src/main/java/io/netty/util/Recycler.java Outdated Show resolved Hide resolved
@franz1981
Copy link
Contributor Author

franz1981 commented Feb 17, 2023

@chrisvest

Have you got any benchmark numbers for this change?

I have yet to run both the JMH (seriously) and end 2 end one on this (I will I promise).
I've used the JMH one in order to understand what's going on (basically as a perfasm logging test), and what I've noticed is that recyclerGetAndRecycle vs recyclerGetAndUnguardedRecycle doesn't show the expected improvement due to unguarded, because there are other much costy operations making noise
i.e. we often end up in the JCTools slow path, that cause a CAS over producerLimit :( (and we use the JCTools queues way more then I was expecting too, instead of hitting the ArrayDeque). The other noise source is the black hole implemented for java 8; recently Alexey S added the option of compiler assisted black holes, that help reducing the noise in these "nano" benchmarks. The software variant just cost more then the improvement brought, making it invisible.

While the producer/consumer JMH case exhibit a dramatic improvement (more then twice improved!) probably because it doesn't use SW blackhole(s) (I know that's not considered as a good practice NOT using them, but SW generated ones are not easy to be used with nano benchs!)

Copy link
Member

@normanmaurer normanmaurer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One nit....

void unguardedRelease(DefaultHandle<T> handle) {
handle.unguardedToAvailable();
Thread owner = this.owner;
if (owner != null && Thread.currentThread() == owner && batch.size() < chunkSize) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can remove the null check here as Thread.currentThread() will never return null

Suggested change
if (owner != null && Thread.currentThread() == owner && batch.size() < chunkSize) {
if (Thread.currentThread() == owner && batch.size() < chunkSize) {

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or was this to eliminate some overhead when owner is null ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:) yep

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe adding a comment would make this more clear ;) ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a fair point sir 🤣

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't modified this, but copied from @chrisvest original code, I'm now unifying both but I'll wait @chrisvest comment in case we can replace the owner null check with your suggestion

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no preference either way. Null checks are cheap. For platform threads at least, getting the current thread is also very fast.

@franz1981
Copy link
Contributor Author

@chrisvest @normanmaurer I can now provide some numbers about this :)

@normanmaurer
Copy link
Member

@franz1981 lets show them ;) ?

@franz1981
Copy link
Contributor Author

@normanmaurer ahah yes, I meant that now I got free cycles to run some tests :P

Motivation:

Recycler's claim/release can be made faster by saving expensive volatile ops, when not needed.
For claim, always, while for release, if the owner thread is performing release itself.

Modification:

Replacing expensive volatile ops with ordered ones.

Result:

Faster Recycler's claim/release
@franz1981
Copy link
Contributor Author

franz1981 commented Feb 22, 2023

I've added few other parts that can benefit of this "weaker" semantic (because they don't seem to escape/been shared, and appear as hot path during allocation(s).

Please @chrisvest @normanmaurer validate if my assumptions about the life-cycle of such instances is ok

@franz1981
Copy link
Contributor Author

franz1981 commented Feb 22, 2023

@normanmaurer I've performed few rounds of a pinned localhost instance (single threaded) of Netty vs a pure plaintext http workload (with high pipelining, to stress the CPU, that's why localhost):
This PR can deliver Requests/sec: 673747.21 vs the baseline Requests/sec: 610041.22.

I believe the improvement change a lot depending the type of load and in general I see a delta of 5-10% that's still pretty high.

I'm now running some JMH benchmark and I'll address your last comment, while awaiting your opinion of the parts I've decided to "relax" (that should be indeed, un-shared and/or guarded already)

@franz1981
Copy link
Contributor Author

franz1981 commented Feb 22, 2023

@chrisvest This is the improvement in

baseline:

Benchmark                                                Mode  Cnt        Score        Error  Units
PooledByteBufAllocatorBenchmark.allocateAndFree         thrpt   10  4982570.184 ± 267783.435  ops/s

this PR at 7a64def:

Benchmark                                                Mode  Cnt        Score        Error  Units
PooledByteBufAllocatorBenchmark.allocateAndFree         thrpt   10  5880516.884 ± 390441.996  ops/s

I've run the benchmark using --jvmArgs="-Dio.netty.allocator.useCacheForAllThreads=true" in order to use thread local pool caches (that makes use of local pools as well!).

In the best case scenario the improvement in the allocation path is quite decent, but similarly to the volatile benchmark on https://shipilev.net/blog/2014/nanotrusting-nanotime/, hammering the store buffer isn't a realistic use case, and such barriers tend to saturate, making tight loops to slow down more then what would happen in the real world if some work happen for real i.e. could be by accessing the data structure (read/write); but still, for smallish and frequent allocations, the best case scenario report a pooled (and cache allocation) improvement of ~18% (!).

@franz1981
Copy link
Contributor Author

@njhill in case is interested :)

void unguardedRelease(DefaultHandle<T> handle) {
handle.unguardedToAvailable();
Thread owner = this.owner;
if (owner != null && Thread.currentThread() == owner && batch.size() < chunkSize) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no preference either way. Null checks are cheap. For platform threads at least, getting the current thread is also very fast.

@franz1981
Copy link
Contributor Author

franz1981 commented Mar 6, 2023

@chrisvest feel free to park it more if you need to check the parts I've decided to make "weaker" in term of seq cst
Or let me know if I have to run some more tests to prove the perf effectiveness

@chrisvest chrisvest merged commit c353f4f into netty:4.1 Mar 7, 2023
@chrisvest
Copy link
Contributor

@franz1981 Thanks!

@chrisvest
Copy link
Contributor

@franz1981 Will you make a Netty 5 PR as well?

chrisvest pushed a commit that referenced this pull request Mar 13, 2023
Motivation:

Recycler's claim/release can be made faster by saving expensive volatile ops, when not needed. For claim, always, while for release, if the owner thread is performing release itself.

Modification:

Replacing expensive volatile ops with ordered ones.

Result:

Faster Recycler's claim/release
@eirbjo
Copy link
Contributor

eirbjo commented Apr 11, 2023

This change seems to have introduced a build/compile failure:

[ERROR] Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:3.8.0:compile (default-compile) on project netty-microbench: Compilation failure: Compilation failure: 
[ERROR] netty/microbench/src/main/java/io/netty/microbench/util/RecyclerBenchmark.java:[21,25] error: package org.jctools.queues does not exist
[ERROR] netty/microbench/src/main/java/io/netty/microbench/util/RecyclerBenchmark.java:[22,32] error: package org.jctools.queues.atomic does not exist

Adding the following in microbench/pom.xml fixes the problem:

<dependency>
  <groupId>org.jctools</groupId>
  <artifactId>jctools-core</artifactId>
  <scope>compile</scope>
</dependency>

Should this PR have added the above dependency?

@franz1981
Copy link
Contributor Author

@eirbjo yes good catch, can you send a PR? Try check if the shaded JCTools version we got can be used instead of the direct dependency

@eirbjo
Copy link
Contributor

eirbjo commented Apr 11, 2023

@eirbjo yes good catch, can you send a PR? Try check if the shaded JCTools version we got can be used instead of the direct dependency

@franz1981 Sorry, I'm completely new to Netty, so I don't think I understand. In which Maven coordinates would I find this shaded JCTools? And do you mean that RecyclerBenchmark should be updated to use shaded packages? I don't think IDE's would detect that very well?

How about I add a PR with the regular dependency, and we can continue the discussion there?

@eirbjo
Copy link
Contributor

eirbjo commented Apr 11, 2023

How about I add a PR with the regular dependency, and we can continue the discussion there?

See #13325

@franz1981
Copy link
Contributor Author

@eirbjo we current have shaded JCTools, see

<shadedPattern>io.netty.util.internal.shaded.org.jctools.</shadedPattern>

I wonder if you can reference the shaded one instead and hence depend just by the shaded version of JCtools

@eirbjo
Copy link
Contributor

eirbjo commented Apr 11, 2023

I wonder if you can reference the shaded one instead and hence depend just by the shaded version of JCtools

Still not 100% sure what solution you are referring to. If you mean imports like import io.netty.util.internal.shaded.org.jctools.queues.MpmcArrayQueue;, then that does not seem to work:

netty/microbench/src/main/java/io/netty/microbench/util/RecyclerBenchmark.java:[21,55] error: cannot find symbol
[ERROR]   symbol:   class MpmcArrayQueue
[ERROR]   location: package io.netty.util.internal.shaded.org.jctools.queues

Can we continue this discussion in PR #13325 ?

@eirbjo
Copy link
Contributor

eirbjo commented Apr 11, 2023

I wonder if you can reference the shaded one instead and hence depend just by the shaded version of JCtools

Even if we could make this somewhat exotic setup work across Maven and IDEs, it is not clear to me what the benefits would be?

chrisvest pushed a commit that referenced this pull request Apr 18, 2023
Motivation:

#13220  seems to have introduced a build/compile failure because of a missing Maven dependency on `jctools-core`. Adding this dependency to `microbench/pom.xml` fixes the compile failure of the `RecyclerBenchmark` class.

Modification:

Added the following depencency to `microbench/pom.xml`:

```xml
<dependency>
  <groupId>org.jctools</groupId>
  <artifactId>jctools-core</artifactId>
  <scope>compile</scope>
</dependency>
```

Result:

Build is successful.
chrisvest pushed a commit that referenced this pull request Jun 16, 2023
Motivation:

#13220  seems to have introduced a build/compile failure because of a missing Maven dependency on `jctools-core`. Adding this dependency to `microbench/pom.xml` fixes the compile failure of the `RecyclerBenchmark` class.

Modification:

Added the following depencency to `microbench/pom.xml`:

```xml
<dependency>
  <groupId>org.jctools</groupId>
  <artifactId>jctools-core</artifactId>
  <scope>compile</scope>
</dependency>
```

Result:

Build is successful.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants