Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make Recycler faster on OpenJ9 #13357

Merged
merged 2 commits into from
Apr 24, 2023
Merged

Conversation

chrisvest
Copy link
Contributor

Motivation:
To avoid reference processing overhead while at the same time ensuring that we clear Recycler data structures that were used by now-terminated threads, we need to check if the LocalPool owner thread has terminated.

We used to do this by calling Thread.getState, but this turns out to be quite expensive on OpenJ9, because they need to halt the target thread before its state can be safely inspected. See the haltThreadForInspection call in Java_java_lang_Thread_getStateImpl.

This turns out to be a scalability bottleneck in some deployments.

Modification:
We can get the same semantics as Thread.getState() == Thread.State.TERMINATED by checking !Thread.isAlive().

The isAlive call is cheap on both OpenJ9 and OpenJDK.

Result:
We no longer see the scalability bottleneck observed in #13347 (comment)

Copy link
Contributor

@franz1981 franz1981 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well done!
Just curious ..any other call sites where we use anything similar?

@chrisvest
Copy link
Contributor Author

Previously we were getting these numbers from RecyclerBenchmark:

OpenJDK 11
Benchmark                                                             Mode  Cnt      Score     Error   Units
RecyclerBenchmark.plainNew                                            avgt   20      3.800 ±   0.028   ns/op
RecyclerBenchmark.producerConsumer                                    avgt   20    111.901 ±  25.402   ns/op
RecyclerBenchmark.producerConsumer:consumer                           avgt   20    111.901 ±  25.403   ns/op
RecyclerBenchmark.producerConsumer:producer                           avgt   20    111.901 ±  25.402   ns/op
RecyclerBenchmark.recyclerGetAndOrphan                                avgt   20      8.719 ±   0.046   ns/op
RecyclerBenchmark.recyclerGetAndRecycle                               avgt   20     15.095 ±   1.930   ns/op
RecyclerBenchmark.recyclerGetAndUnguardedRecycle                      avgt   20     13.404 ±   0.028   ns/op
RecyclerBenchmark.unguardedProducerConsumer                           avgt   20     84.549 ±   2.762   ns/op
RecyclerBenchmark.unguardedProducerConsumer:unguardedConsumer         avgt   20     84.548 ±   2.762   ns/op
RecyclerBenchmark.unguardedProducerConsumer:unguardedProducer         avgt   20     84.549 ±   2.763   ns/op

OpenJ9 11
Benchmark                                                      Mode  Cnt     Score     Error   Units
RecyclerBenchmark.plainNew                                     avgt   20     4.827 ±   0.009   ns/op
RecyclerBenchmark.producerConsumer                             avgt   20  5790.174 ± 206.705   ns/op
RecyclerBenchmark.producerConsumer:consumer                    avgt   20  5789.867 ± 206.621   ns/op
RecyclerBenchmark.producerConsumer:producer                    avgt   20  5790.481 ± 206.789   ns/op
RecyclerBenchmark.recyclerGetAndOrphan                         avgt   20     8.352 ±   0.014   ns/op
RecyclerBenchmark.recyclerGetAndRecycle                        avgt   20    11.705 ±   0.030   ns/op
RecyclerBenchmark.recyclerGetAndUnguardedRecycle               avgt   20    11.908 ±   0.023   ns/op
RecyclerBenchmark.unguardedProducerConsumer                    avgt   20  5890.974 ± 413.419   ns/op
RecyclerBenchmark.unguardedProducerConsumer:unguardedConsumer  avgt   20  5890.096 ± 413.297   ns/op
RecyclerBenchmark.unguardedProducerConsumer:unguardedProducer  avgt   20  5891.852 ± 413.542   ns/op

With this PR we're getting these numbers:

OpenJDK 11
Benchmark                                                             Mode  Cnt      Score     Error   Units
RecyclerBenchmark.plainNew                                            avgt   20      3.820 ±   0.030   ns/op
RecyclerBenchmark.producerConsumer                                    avgt   20    202.900 ±   2.375   ns/op
RecyclerBenchmark.producerConsumer:consumer                           avgt   20    202.900 ±   2.375   ns/op
RecyclerBenchmark.producerConsumer:producer                           avgt   20    202.899 ±   2.375   ns/op
RecyclerBenchmark.recyclerGetAndOrphan                                avgt   20      8.735 ±   0.084   ns/op
RecyclerBenchmark.recyclerGetAndRecycle                               avgt   20     12.943 ±   0.018   ns/op
RecyclerBenchmark.recyclerGetAndUnguardedRecycle                      avgt   20     13.398 ±   0.018   ns/op
RecyclerBenchmark.unguardedProducerConsumer                           avgt   20    198.116 ±   3.083   ns/op
RecyclerBenchmark.unguardedProducerConsumer:unguardedConsumer         avgt   20    198.116 ±   3.083   ns/op
RecyclerBenchmark.unguardedProducerConsumer:unguardedProducer         avgt   20    198.115 ±   3.083   ns/op

OpenJ9 11
Benchmark                                                      Mode  Cnt     Score    Error   Units
RecyclerBenchmark.plainNew                                     avgt   20     4.815 ±  0.017   ns/op
RecyclerBenchmark.producerConsumer                             avgt   20   121.618 ± 14.873   ns/op
RecyclerBenchmark.producerConsumer:consumer                    avgt   20   121.619 ± 14.873   ns/op
RecyclerBenchmark.producerConsumer:producer                    avgt   20   121.618 ± 14.873   ns/op
RecyclerBenchmark.recyclerGetAndOrphan                         avgt   20     8.365 ±  0.010   ns/op
RecyclerBenchmark.recyclerGetAndRecycle                        avgt   20    11.708 ±  0.034   ns/op
RecyclerBenchmark.recyclerGetAndUnguardedRecycle               avgt   20    11.941 ±  0.111   ns/op
RecyclerBenchmark.unguardedProducerConsumer                    avgt   20   106.812 ± 27.806   ns/op
RecyclerBenchmark.unguardedProducerConsumer:unguardedConsumer  avgt   20   106.811 ± 27.806   ns/op
RecyclerBenchmark.unguardedProducerConsumer:unguardedProducer  avgt   20   106.812 ± 27.806   ns/op

So OpenJ9 improves a lot, but OpenJDK regresses.

Motivation:
To avoid reference processing overhead while at the same time ensuring that we clear Recycler data structures that were used by now-terminated threads, we need to check if the LocalPool owner thread has terminated.

We used to do this by calling `Thread.getState`, but this turns out to be quite expensive on OpenJ9, because they need to halt the target thread before its state can be safely inspected.
See the `haltThreadForInspection` call in [`Java_java_lang_Thread_getStateImpl`](https://github.com/eclipse-openj9/openj9/blob/55b9267740559406b8471000cd3a3b6f681dc095/runtime/jcl/common/thread.cpp#L76-L102).

This turns out to be a scalability bottleneck in some deployments.

Modification:
We can get the same semantics as `Thread.getState() == Thread.State.TERMINATED` by checking `!Thread.isAlive()`.

The `isAlive` call is cheap on both [OpenJ9](https://github.com/eclipse-openj9/openj9/blob/03967d8cc24dc1f60168110c9d207d2dff57765a/jcl/src/java.base/share/classes/java/lang/Thread.java#L756-L761) and [OpenJDK](https://github.com/openjdk/jdk/blob/master/src/java.base/share/classes/java/lang/Thread.java#L1798-L1815).

Result:
We no longer see the scalability bottleneck observed in netty#13347 (comment)
@chrisvest
Copy link
Contributor Author

We can get the best of both by checking if we're on OpenJ9:

OpenJDK 11
Benchmark                                                             Mode  Cnt      Score    Error   Units
RecyclerBenchmark.plainNew                                            avgt   20      3.772 ±  0.018   ns/op
RecyclerBenchmark.producerConsumer                                    avgt   20    133.622 ±  2.163   ns/op
RecyclerBenchmark.producerConsumer:consumer                           avgt   20    133.622 ±  2.164   ns/op
RecyclerBenchmark.producerConsumer:producer                           avgt   20    133.622 ±  2.163   ns/op
RecyclerBenchmark.recyclerGetAndOrphan                                avgt   20      8.717 ±  0.041   ns/op
RecyclerBenchmark.recyclerGetAndRecycle                               avgt   20     12.893 ±  0.015   ns/op
RecyclerBenchmark.recyclerGetAndUnguardedRecycle                      avgt   20     13.219 ±  0.145   ns/op
RecyclerBenchmark.unguardedProducerConsumer                           avgt   20     69.392 ± 13.071   ns/op
RecyclerBenchmark.unguardedProducerConsumer:unguardedConsumer         avgt   20     69.392 ± 13.071   ns/op
RecyclerBenchmark.unguardedProducerConsumer:unguardedProducer         avgt   20     69.392 ± 13.071   ns/op

OpenJ9 11
Benchmark                                                      Mode  Cnt     Score    Error   Units
RecyclerBenchmark.plainNew                                     avgt   20     4.808 ±  0.007   ns/op
RecyclerBenchmark.producerConsumer                             avgt   20   135.377 ± 11.939   ns/op
RecyclerBenchmark.producerConsumer:consumer                    avgt   20   135.378 ± 11.939   ns/op
RecyclerBenchmark.producerConsumer:producer                    avgt   20   135.377 ± 11.938   ns/op
RecyclerBenchmark.recyclerGetAndOrphan                         avgt   20     8.371 ±  0.012   ns/op
RecyclerBenchmark.recyclerGetAndRecycle                        avgt   20    11.710 ±  0.032   ns/op
RecyclerBenchmark.recyclerGetAndUnguardedRecycle               avgt   20    11.835 ±  0.038   ns/op
RecyclerBenchmark.unguardedProducerConsumer                    avgt   20    86.019 ± 14.293   ns/op
RecyclerBenchmark.unguardedProducerConsumer:unguardedConsumer  avgt   20    86.019 ± 14.293   ns/op
RecyclerBenchmark.unguardedProducerConsumer:unguardedProducer  avgt   20    86.019 ± 14.293   ns/op

@chrisvest
Copy link
Contributor Author

any other call sites where we use anything similar?

I don't see any other calls to Thread.getState, but we have many isAlive calls.

@franz1981
Copy link
Contributor

I am surprised that we have regressed on Jdk, it doesn't seem anything but a load there - you checked on x86?

@chrisvest
Copy link
Contributor Author

I only checked on aarch

@theRealAph
Copy link

I only checked on aarch

What hardware? Such a heavy impact for a volatile read is odd.

@chrisvest
Copy link
Contributor Author

@theRealAph M1 Pro, macOS

@franz1981
Copy link
Contributor

franz1981 commented Apr 22, 2023

@chrisvest @theRealAph https://github.com/AdoptOpenJDK/openjdk-jdk8u/blob/9a751dc19fae78ce58fb0eb176522070c992fb6f/jdk/src/share/classes/java/lang/Thread.java#L1003

That's why! Ah what an horror story weekend 🥶

@theRealAph
Copy link

theRealAph commented Apr 22, 2023

@chrisvest @theRealAph https://github.com/AdoptOpenJDK/openjdk-jdk8u/blob/9a751dc19fae78ce58fb0eb176522070c992fb6f/jdk/src/share/classes/java/lang/Thread.java#L1003

That's why! Ah what an horror story weekend cold_face

Ah, it's going through a java<->native transition. So, a slowdown makes sense.
But, but... why is it being called with such extremely high frequency?

Copy link
Member

@trustin trustin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM once this comment is addressed.

Co-authored-by: Trustin Lee <t@motd.kr>
@normanmaurer normanmaurer merged commit 97a7ed0 into netty:4.1 Apr 24, 2023
12 of 14 checks passed
@normanmaurer
Copy link
Member

@chrisvest thanks a lot!

normanmaurer pushed a commit that referenced this pull request Apr 24, 2023
Motivation:
To avoid reference processing overhead while at the same time ensuring that we clear Recycler data structures that were used by now-terminated threads, we need to check if the LocalPool owner thread has terminated.

We used to do this by calling `Thread.getState`, but this turns out to be quite expensive on OpenJ9, because they need to halt the target thread before its state can be safely inspected.
See the `haltThreadForInspection` call in [`Java_java_lang_Thread_getStateImpl`](https://github.com/eclipse-openj9/openj9/blob/55b9267740559406b8471000cd3a3b6f681dc095/runtime/jcl/common/thread.cpp#L76-L102).

This turns out to be a scalability bottleneck in some deployments.

Modification:
We can get the same semantics as `Thread.getState() == Thread.State.TERMINATED` by checking `!Thread.isAlive()`.

The `isAlive` call is cheap on both [OpenJ9](https://github.com/eclipse-openj9/openj9/blob/03967d8cc24dc1f60168110c9d207d2dff57765a/jcl/src/java.base/share/classes/java/lang/Thread.java#L756-L761) and [OpenJDK](https://github.com/openjdk/jdk/blob/master/src/java.base/share/classes/java/lang/Thread.java#L1798-L1815).

Result:
We no longer see the scalability bottleneck observed in #13347 (comment)



Co-authored-by: Trustin Lee <t@motd.kr>
@chrisvest chrisvest deleted the 41-recycler-opt branch April 24, 2023 14:55
@chrisvest chrisvest added this to the 4.1.92.Final milestone Apr 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants