-
-
Notifications
You must be signed in to change notification settings - Fork 924
Optimize Class#subclasses #8462
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
facd83a
to
8c01c95
Compare
There were several issues with the old implementation: * Iterating by creating a Set and an Iterator adds significant allocation overhead. Better to use Map#forEach. * ConcurrentWeakHashMap is a very old implementation that has several inefficiencies as well as no custom implementation of Map#forEach (it falls back on the default impl which creates a Set and Iterator). Instead we use a synchronized WeakHashMap. * The initial array was created with a default size of 4 elements. For many classes this will be insufficient, leading to excessive array allocation and copying as it grows. Instead we use the current class's "subclasses" size as an initial estimate. * When no subclasses appear to be present, bail out early with an empty array that does minimal allocation. Performance is significantly better for zero, small, and large subclass lists. 1M * Object.subclasses, with 83 elements: Before: 2.020000 0.040000 2.060000 ( 1.823539) 1.980000 0.030000 2.010000 ( 1.737808) 1.830000 0.010000 1.840000 ( 1.738772) 1.730000 0.020000 1.750000 ( 1.740675) 1.720000 0.000000 1.720000 ( 1.707344) After: 1.340000 0.020000 1.360000 ( 1.034745) 0.930000 0.000000 0.930000 ( 0.918107) 0.920000 0.010000 0.930000 ( 0.914828) 0.920000 0.000000 0.920000 ( 0.922137) 0.920000 0.000000 0.920000 ( 0.915440) 10M * Numeric.subclasses, with 4 elements: Before: 0.930000 0.030000 0.960000 ( 0.789997) 0.640000 0.010000 0.650000 ( 0.621404) 0.620000 0.010000 0.630000 ( 0.614404) 0.630000 0.000000 0.630000 ( 0.629492) 0.630000 0.010000 0.640000 ( 0.608538) After: 0.720000 0.010000 0.730000 ( 0.559470) 0.460000 0.000000 0.460000 ( 0.454176) 0.510000 0.000000 0.510000 ( 0.429875) 0.430000 0.010000 0.440000 ( 0.434487) 0.430000 0.000000 0.430000 ( 0.427971) Fixes jruby#8457
8c01c95
to
0d39d52
Compare
@mohamedhafez Could you give this branch a try and see how the performance looks for you? |
Sure, can I just cherry-pick this commit onto a branch based on 9.4.9.0 so I'm sure everything else is compatible with my app? |
@mohamedhafez Yes, should apply just fine to 9.4.9.0. |
Reporting back from @mohamedhafez showed that the hard locking here is a big bottleneck for a highly-concurrent app, so I'm looking at lighter-weight locking options. This patch was an attempt to use a ReentrantReadWriteLock, but it has overhead of its own and made things worse: https://gist.github.com/headius/30578ea0e26c9566462e70d963b2cd29 I'll attempt a lock-free version. |
A second attempt with a volatile field and atomic updates improves over the synchronized version but uses a lot of background CPU spinning (a better impl would park the thread but then we have to track waiters, etc). In the same gist as above: https://gist.github.com/headius/30578ea0e26c9566462e70d963b2cd29 |
Getting this accurate is not important, but most weak collections will attempt to get an accurate count at increased expense. This patch just saves the last list size and uses that for future list allocations.
A third attempt makes the |
5223c84
to
680d2e9
Compare
This is similar to the CRuby subclasses list except for two key features: * We use weak references instead of pointer values; the GC vacates references for us. * The linked list structure is immutable and concurrency-safe. This impl is thread-safe and lock-free due to the immutable linked list structure. Adding a new class creates a new head node and atomically reassigns it into the class. Removing a class finds that element and vacates its reference. Replacing a class first removes the old and then adds the new. Traversing is a matter of walking the chain and omitting vacated references. Periodically, the list must be rebuilt without dead references. This is hardcoded currently to be when the list contains more than 25% vacated references. Adding a class is an O(1) operation. Removal, replacement, and traversal are amortized O(n). This structure is also lighter-weight than either the original ConcurrentWeakHashMap or any implementation of WeakHashMap provided by the JDK, plus it has no lock overhead and very little synchronization overhead.
680d2e9
to
6c6982c
Compare
The MetaClass constructor already does this.
I've pushed a new implementation of the subclasses collection that is just a linked list of weak references. The list is immutable so the only synchronization required is an atomic update of the head reference. Adding a class is O(1) and just attaches a new head. Removal, replacement, and traversal are O(n) linked list walks. Periodically (currently hardcoded to 25% evacuated references) the list gets rebuilt at the end of traversal, removing any dead links but preserving previous links with the same weak references; this allows the remove operation to simply This is the fastest option so far for 9.4.9.0:
9.4.10.0
CRuby 3.3
|
I've pushed a refactoring that gets all benchmarks faster than 9.4.9.0, although sometimes just barely. The larger Object.subclasses cases are even faster: 9.4.9.0:
New impl:
|
* Split large methods into smaller ones. * Eliminate repeated null-checking. * Always use estimate to create array/list, but add fast path to newArray that reuses zero-length array. * Add estimates for "all subclasses" and "all descendants" used by the old form. * Only update estimates for the target class, not for subclasses. This refactored impl is consistently faster than 9.4.9.0: 9.4.9.0: 1 thread Numeric.subclasses 1.281k (±10.4%) i/s - 6.348k in 5.015677s 1 thread Object.subclasses 57.555 (± 1.7%) i/s - 290.000 in 5.042078s 5 thread Numeric.subclasses 1.866k (±12.8%) i/s - 9.243k in 5.026434s 5 thread Object.subclasses 156.215 (± 3.8%) i/s - 780.000 in 5.000381s 10 thread Numeric.subclasses 1.151k (±11.8%) i/s - 5.700k in 5.027341s 10 thread Object.subclasses 182.022 (±15.9%) i/s - 864.000 in 5.076040s 50 thread Numeric.subclasses 276.258 (±33.3%) i/s - 1.122k in 5.030777s 50 thread Object.subclasses 149.436 (±16.1%) i/s - 705.000 in 5.005946s New impl: 1 thread Numeric.subclasses 1.871k (± 9.9%) i/s - 9.313k in 5.028326s 1 thread Object.subclasses 194.344 (± 2.6%) i/s - 988.000 in 5.087741s 5 thread Numeric.subclasses 1.966k (±12.9%) i/s - 9.750k in 5.044797s 5 thread Object.subclasses 475.950 (± 8.4%) i/s - 2.392k in 5.065777s 10 thread Numeric.subclasses 1.170k (±16.9%) i/s - 5.720k in 5.238812s 10 thread Object.subclasses 356.624 (±41.5%) i/s - 1.470k in 5.065979s 50 thread Numeric.subclasses 338.594 (±13.9%) i/s - 1.665k in 5.075536s 50 thread Object.subclasses 251.153 (±10.4%) i/s - 1.250k in 5.025052s
62c27a8
to
2976caf
Compare
This is ready to merge any time. This is about the best I think we can do right now, all things considered. We discussed on Matrix a bit about how Rails really should not be generating these lists over and over again, but that is a larger optimization to explore and implement. If they use this same logic on CRuby and JRuby, we should at least not be slower at it. |
My implementation of a new subclasses structure in jruby#8462 was a bit too "clever" leading to a potential leak of subclass references. To avoid using a ReferenceQueue, I tied the "cleaning" of the subclasses list to encountering a percentage of empty subclass references during traversal. However if traversal only walks concrete subclass references (via Class#subclasses), empty non-concrete (singleton, included, prepended) subclass references could accumulate. This would only be cleared out with a full (not concrete-only) subclass walk, which only happened for a few internal uses of this list. The patch here reverts to a more reliable ReferenceQueue-based approach: * When the first subclass is added, a ReferenceQueue is also created for the subclass weak references. * When adding or removing subsequent subclasses, a non-null queue poll indicates a full clean is necessary before proceeding. This amortizes the cleaning of the list to only fire after references have been evacuated and additional mutations of the list are requested. The changes here appear to have minimal impact on mutating the subclass list, since references will tend to be evacuated in batches. The heap occupancy of a fast singleton-creating loop tends to be a very large sawtooth, climbing rapidly into 1GB heap territory before collection and cleaning bring it back down, but this too is not unlike the old weak map-based implementation.
My implementation of a new subclasses structure in jruby#8462 was a bit too "clever" leading to a potential leak of subclass references. To avoid using a ReferenceQueue, I tied the "cleaning" of the subclasses list to encountering a percentage of empty subclass references during traversal. However if traversal only walks concrete subclass references (via Class#subclasses), empty non-concrete (singleton, included, prepended) subclass references could accumulate. This would only be cleared out with a full (not concrete-only) subclass walk, which only happened for a few internal uses of this list. The patch here reverts to the Map-based approach previously used with two changes: * Instead of using a WeakHashMap-based implementation as found in our ConcurrentWeakHashMap, this uses a synchronized WeakIdentityHashMap, since RubyClass instances should only be compared by identity. This reduces the overhead of using the map. * Concrete subclasses are tracked separately in a second map, to allow fast performance of the user-facing Class#subclasses. Both map fields are lazily initialized: * If the operation is for read and the field is null, they are initialized to Collections.EMPTY_MAP. * If the operation is for write and the field is null or Collections.EMPTY_MAP, they are initialized to a new synchronized WeakIdentityHashMap. Performance remains similar to the linked list-based implementation and has memory and GC characteristics comparable to the pre-linked list logic.
My implementation of a new subclasses structure in jruby#8462 was a bit too "clever" leading to a potential leak of subclass references. To avoid using a ReferenceQueue, I tied the "cleaning" of the subclasses list to encountering a percentage of empty subclass references during traversal. However if traversal only walks concrete subclass references (via Class#subclasses), empty non-concrete (singleton, included, prepended) subclass references could accumulate. This would only be cleared out with a full (not concrete-only) subclass walk, which only happened for a few internal uses of this list. The patch here reverts to the Map-based approach previously used with two changes: * Instead of a fully-concurrent or synchronized WeakHashMap, we use read/write locking to favor mostly-read operations like Class#subclasses and hierarchy-walking. * Concrete subclasses are tracked separately in a second map, to allow fast performance of the user-facing Class#subclasses. Both map fields are lazily initialized: * If the operation is for read and the field is null, they are initialized to Collections.EMPTY_MAP. * If the operation is for write and the field is null or Collections.EMPTY_MAP, they are initialized to a new read/write locking WeakHashMap.
There were several issues with the old implementation:
Performance is significantly better for zero, small, and large subclass lists.
1M * Object.subclasses, with 83 elements:
Before:
2.020000 0.040000 2.060000 ( 1.823539)
1.980000 0.030000 2.010000 ( 1.737808)
1.830000 0.010000 1.840000 ( 1.738772)
1.730000 0.020000 1.750000 ( 1.740675)
1.720000 0.000000 1.720000 ( 1.707344)
After:
1.340000 0.020000 1.360000 ( 1.034745)
0.930000 0.000000 0.930000 ( 0.918107)
0.920000 0.010000 0.930000 ( 0.914828)
0.920000 0.000000 0.920000 ( 0.922137)
0.920000 0.000000 0.920000 ( 0.915440)
10M * Numeric.subclasses, with 4 elements:
Before:
0.930000 0.030000 0.960000 ( 0.789997)
0.640000 0.010000 0.650000 ( 0.621404)
0.620000 0.010000 0.630000 ( 0.614404)
0.630000 0.000000 0.630000 ( 0.629492)
0.630000 0.010000 0.640000 ( 0.608538)
After:
0.720000 0.010000 0.730000 ( 0.559470)
0.460000 0.000000 0.460000 ( 0.454176)
0.510000 0.000000 0.510000 ( 0.429875)
0.430000 0.010000 0.440000 ( 0.434487)
0.430000 0.000000 0.430000 ( 0.427971)
Fixes #8457