Optimize Class#subclasses #8462

headius · 2024-11-25T21:32:39Z

There were several issues with the old implementation:

Iterating by creating a Set and an Iterator adds significant allocation overhead. Better to use Map#forEach.
ConcurrentWeakHashMap is a very old implementation that has several inefficiencies as well as no custom implementation of Map#forEach (it falls back on the default impl which creates a Set and Iterator). Instead we use a synchronized WeakHashMap.
The initial array was created with a default size of 4 elements. For many classes this will be insufficient, leading to excessive array allocation and copying as it grows. Instead we use the current class's "subclasses" size as an initial estimate.
When no subclasses appear to be present, bail out early with an empty array that does minimal allocation.

Performance is significantly better for zero, small, and large subclass lists.

1M * Object.subclasses, with 83 elements:

Before:

2.020000 0.040000 2.060000 ( 1.823539)
1.980000 0.030000 2.010000 ( 1.737808)
1.830000 0.010000 1.840000 ( 1.738772)
1.730000 0.020000 1.750000 ( 1.740675)
1.720000 0.000000 1.720000 ( 1.707344)

After:

1.340000 0.020000 1.360000 ( 1.034745)
0.930000 0.000000 0.930000 ( 0.918107)
0.920000 0.010000 0.930000 ( 0.914828)
0.920000 0.000000 0.920000 ( 0.922137)
0.920000 0.000000 0.920000 ( 0.915440)

10M * Numeric.subclasses, with 4 elements:

Before:

0.930000 0.030000 0.960000 ( 0.789997)
0.640000 0.010000 0.650000 ( 0.621404)
0.620000 0.010000 0.630000 ( 0.614404)
0.630000 0.000000 0.630000 ( 0.629492)
0.630000 0.010000 0.640000 ( 0.608538)

After:

0.720000 0.010000 0.730000 ( 0.559470)
0.460000 0.000000 0.460000 ( 0.454176)
0.510000 0.000000 0.510000 ( 0.429875)
0.430000 0.010000 0.440000 ( 0.434487)
0.430000 0.000000 0.430000 ( 0.427971)

Fixes #8457

There were several issues with the old implementation: * Iterating by creating a Set and an Iterator adds significant allocation overhead. Better to use Map#forEach. * ConcurrentWeakHashMap is a very old implementation that has several inefficiencies as well as no custom implementation of Map#forEach (it falls back on the default impl which creates a Set and Iterator). Instead we use a synchronized WeakHashMap. * The initial array was created with a default size of 4 elements. For many classes this will be insufficient, leading to excessive array allocation and copying as it grows. Instead we use the current class's "subclasses" size as an initial estimate. * When no subclasses appear to be present, bail out early with an empty array that does minimal allocation. Performance is significantly better for zero, small, and large subclass lists. 1M * Object.subclasses, with 83 elements: Before: 2.020000 0.040000 2.060000 ( 1.823539) 1.980000 0.030000 2.010000 ( 1.737808) 1.830000 0.010000 1.840000 ( 1.738772) 1.730000 0.020000 1.750000 ( 1.740675) 1.720000 0.000000 1.720000 ( 1.707344) After: 1.340000 0.020000 1.360000 ( 1.034745) 0.930000 0.000000 0.930000 ( 0.918107) 0.920000 0.010000 0.930000 ( 0.914828) 0.920000 0.000000 0.920000 ( 0.922137) 0.920000 0.000000 0.920000 ( 0.915440) 10M * Numeric.subclasses, with 4 elements: Before: 0.930000 0.030000 0.960000 ( 0.789997) 0.640000 0.010000 0.650000 ( 0.621404) 0.620000 0.010000 0.630000 ( 0.614404) 0.630000 0.000000 0.630000 ( 0.629492) 0.630000 0.010000 0.640000 ( 0.608538) After: 0.720000 0.010000 0.730000 ( 0.559470) 0.460000 0.000000 0.460000 ( 0.454176) 0.510000 0.000000 0.510000 ( 0.429875) 0.430000 0.010000 0.440000 ( 0.434487) 0.430000 0.000000 0.430000 ( 0.427971) Fixes jruby#8457

headius · 2024-11-25T22:33:21Z

@mohamedhafez Could you give this branch a try and see how the performance looks for you?

mohamedhafez · 2024-11-25T23:19:30Z

Sure, can I just cherry-pick this commit onto a branch based on 9.4.9.0 so I'm sure everything else is compatible with my app?

headius · 2024-11-25T23:45:06Z

@mohamedhafez Yes, should apply just fine to 9.4.9.0.

headius · 2024-11-26T21:13:28Z

Reporting back from @mohamedhafez showed that the hard locking here is a big bottleneck for a highly-concurrent app, so I'm looking at lighter-weight locking options.

This patch was an attempt to use a ReentrantReadWriteLock, but it has overhead of its own and made things worse: https://gist.github.com/headius/30578ea0e26c9566462e70d963b2cd29

I'll attempt a lock-free version.

headius · 2024-11-26T21:53:45Z

A second attempt with a volatile field and atomic updates improves over the synchronized version but uses a lot of background CPU spinning (a better impl would park the thread but then we have to track waiters, etc).

In the same gist as above: https://gist.github.com/headius/30578ea0e26c9566462e70d963b2cd29

Getting this accurate is not important, but most weak collections will attempt to get an accurate count at increased expense. This patch just saves the last list size and uses that for future list allocations.

headius · 2024-11-27T00:02:54Z

A third attempt makes the subclasses field a read-only collection, and copy-on-write whenever it needs to be updated. This has very fast performance for Class#subclasses but very high cost for modifying the subclass list. It also is potentially not quite thread-safe, since WeakHashMap.get may mutate the list if there's a vacated reference.

This is similar to the CRuby subclasses list except for two key features: * We use weak references instead of pointer values; the GC vacates references for us. * The linked list structure is immutable and concurrency-safe. This impl is thread-safe and lock-free due to the immutable linked list structure. Adding a new class creates a new head node and atomically reassigns it into the class. Removing a class finds that element and vacates its reference. Replacing a class first removes the old and then adds the new. Traversing is a matter of walking the chain and omitting vacated references. Periodically, the list must be rebuilt without dead references. This is hardcoded currently to be when the list contains more than 25% vacated references. Adding a class is an O(1) operation. Removal, replacement, and traversal are amortized O(n). This structure is also lighter-weight than either the original ConcurrentWeakHashMap or any implementation of WeakHashMap provided by the JDK, plus it has no lock overhead and very little synchronization overhead.

The MetaClass constructor already does this.

headius · 2024-11-27T18:42:54Z

I've pushed a new implementation of the subclasses collection that is just a linked list of weak references. The list is immutable so the only synchronization required is an atomic update of the head reference. Adding a class is O(1) and just attaches a new head. Removal, replacement, and traversal are O(n) linked list walks. Periodically (currently hardcoded to 25% evacuated references) the list gets rebuilt at the end of traversal, removing any dead links but preserving previous links with the same weak references; this allows the remove operation to simply clear weak references knowing that will be visible on any thread's copy of the list.

This is the fastest option so far for Object.subclasses but it is occasionally, strangely slower for the smaller Numeric.subclasses case. Benchmark results are below for 9.4.9.0, 9.4.10.0 (master), and CRuby 3.3 (benchmark will be committed soon):

9.4.9.0:

1 thread Numeric.subclasses
                          1.292k (±14.6%) i/s -      6.240k in   5.011785s
1 thread Object.subclasses
                         56.428  (± 3.5%) i/s -    285.000  in   5.056858s
5 thread Numeric.subclasses
                          1.906k (±18.1%) i/s -      9.265k in   5.000260s
5 thread Object.subclasses
                        157.154  (± 4.5%) i/s -    795.000  in   5.068326s
10 thread Numeric.subclasses
                          1.130k (± 6.6%) i/s -      5.700k in   5.064841s
10 thread Object.subclasses
                        170.303  (± 5.9%) i/s -    850.000  in   5.009226s
50 thread Numeric.subclasses
                        323.931  (± 4.3%) i/s -      1.632k in   5.047326s
50 thread Object.subclasses
                        139.149  (± 3.6%) i/s -    705.000  in   5.072644s

9.4.10.0

1 thread Numeric.subclasses
                          2.025k (±17.0%) i/s -      9.869k in   5.031959s
1 thread Object.subclasses
                        198.861 (±10.1%) i/s -    988.000 in   5.094656s
5 thread Numeric.subclasses
                          1.744k (±27.6%) i/s -      8.190k in   5.021846s
5 thread Object.subclasses
                        367.245 (±10.1%) i/s -      1.820k in   5.015953s
10 thread Numeric.subclasses
                        990.057 (±11.4%) i/s -      4.898k in   5.024722s
10 thread Object.subclasses
                        372.548 (±11.0%) i/s -      1.855k in   5.040239s
50 thread Numeric.subclasses
                        323.640 (± 5.3%) i/s -      1.632k in   5.059211s
50 thread Object.subclasses
                        216.725 (± 5.1%) i/s -      1.092k in   5.051484s

CRuby 3.3

1 thread Numeric.subclasses
                        999.470 (± 2.4%) i/s    (1.00 ms/i) -      5.047k in   5.052801s
1 thread Object.subclasses
                         86.706 (± 1.2%) i/s   (11.53 ms/i) -    440.000 in   5.076068s
5 thread Numeric.subclasses
                        796.605 (± 7.4%) i/s    (1.26 ms/i) -      3.956k in   5.011233s
5 thread Object.subclasses
                         84.310 (± 3.6%) i/s   (11.86 ms/i) -    424.000 in   5.035321s
10 thread Numeric.subclasses
                        656.311 (± 9.8%) i/s    (1.52 ms/i) -      3.237k in   5.012322s
10 thread Object.subclasses
                         83.197 (± 2.4%) i/s   (12.02 ms/i) -    416.000 in   5.002811s
50 thread Numeric.subclasses
                        363.659 (± 6.3%) i/s    (2.75 ms/i) -      1.833k in   5.067702s
50 thread Object.subclasses
                         72.960 (± 2.7%) i/s   (13.71 ms/i) -    371.000 in   5.089632s

headius · 2024-11-27T20:45:00Z

I've pushed a refactoring that gets all benchmarks faster than 9.4.9.0, although sometimes just barely. The larger Object.subclasses cases are even faster:

9.4.9.0:

1 thread Numeric.subclasses
                          1.281k (±10.4%) i/s -      6.348k in   5.015677s
1 thread Object.subclasses
                         57.555  (± 1.7%) i/s -    290.000  in   5.042078s
5 thread Numeric.subclasses
                          1.866k (±12.8%) i/s -      9.243k in   5.026434s
5 thread Object.subclasses
                        156.215  (± 3.8%) i/s -    780.000  in   5.000381s
10 thread Numeric.subclasses
                          1.151k (±11.8%) i/s -      5.700k in   5.027341s
10 thread Object.subclasses
                        182.022  (±15.9%) i/s -    864.000  in   5.076040s
50 thread Numeric.subclasses
                        276.258  (±33.3%) i/s -      1.122k in   5.030777s
50 thread Object.subclasses
                        149.436  (±16.1%) i/s -    705.000  in   5.005946s

New impl:

1 thread Numeric.subclasses
                          1.871k (± 9.9%) i/s -      9.313k in   5.028326s
1 thread Object.subclasses
                        194.344 (± 2.6%) i/s -    988.000 in   5.087741s
5 thread Numeric.subclasses
                          1.966k (±12.9%) i/s -      9.750k in   5.044797s
5 thread Object.subclasses
                        475.950 (± 8.4%) i/s -      2.392k in   5.065777s
10 thread Numeric.subclasses
                          1.170k (±16.9%) i/s -      5.720k in   5.238812s
10 thread Object.subclasses
                        356.624 (±41.5%) i/s -      1.470k in   5.065979s
50 thread Numeric.subclasses
                        338.594 (±13.9%) i/s -      1.665k in   5.075536s
50 thread Object.subclasses
                        251.153 (±10.4%) i/s -      1.250k in   5.025052s

* Split large methods into smaller ones. * Eliminate repeated null-checking. * Always use estimate to create array/list, but add fast path to newArray that reuses zero-length array. * Add estimates for "all subclasses" and "all descendants" used by the old form. * Only update estimates for the target class, not for subclasses. This refactored impl is consistently faster than 9.4.9.0: 9.4.9.0: 1 thread Numeric.subclasses 1.281k (±10.4%) i/s - 6.348k in 5.015677s 1 thread Object.subclasses 57.555 (± 1.7%) i/s - 290.000 in 5.042078s 5 thread Numeric.subclasses 1.866k (±12.8%) i/s - 9.243k in 5.026434s 5 thread Object.subclasses 156.215 (± 3.8%) i/s - 780.000 in 5.000381s 10 thread Numeric.subclasses 1.151k (±11.8%) i/s - 5.700k in 5.027341s 10 thread Object.subclasses 182.022 (±15.9%) i/s - 864.000 in 5.076040s 50 thread Numeric.subclasses 276.258 (±33.3%) i/s - 1.122k in 5.030777s 50 thread Object.subclasses 149.436 (±16.1%) i/s - 705.000 in 5.005946s New impl: 1 thread Numeric.subclasses 1.871k (± 9.9%) i/s - 9.313k in 5.028326s 1 thread Object.subclasses 194.344 (± 2.6%) i/s - 988.000 in 5.087741s 5 thread Numeric.subclasses 1.966k (±12.9%) i/s - 9.750k in 5.044797s 5 thread Object.subclasses 475.950 (± 8.4%) i/s - 2.392k in 5.065777s 10 thread Numeric.subclasses 1.170k (±16.9%) i/s - 5.720k in 5.238812s 10 thread Object.subclasses 356.624 (±41.5%) i/s - 1.470k in 5.065979s 50 thread Numeric.subclasses 338.594 (±13.9%) i/s - 1.665k in 5.075536s 50 thread Object.subclasses 251.153 (±10.4%) i/s - 1.250k in 5.025052s

headius · 2024-11-27T21:50:04Z

This is ready to merge any time. This is about the best I think we can do right now, all things considered.

We discussed on Matrix a bit about how Rails really should not be generating these lists over and over again, but that is a larger optimization to explore and implement. If they use this same logic on CRuby and JRuby, we should at least not be slower at it.

My implementation of a new subclasses structure in jruby#8462 was a bit too "clever" leading to a potential leak of subclass references. To avoid using a ReferenceQueue, I tied the "cleaning" of the subclasses list to encountering a percentage of empty subclass references during traversal. However if traversal only walks concrete subclass references (via Class#subclasses), empty non-concrete (singleton, included, prepended) subclass references could accumulate. This would only be cleared out with a full (not concrete-only) subclass walk, which only happened for a few internal uses of this list. The patch here reverts to a more reliable ReferenceQueue-based approach: * When the first subclass is added, a ReferenceQueue is also created for the subclass weak references. * When adding or removing subsequent subclasses, a non-null queue poll indicates a full clean is necessary before proceeding. This amortizes the cleaning of the list to only fire after references have been evacuated and additional mutations of the list are requested. The changes here appear to have minimal impact on mutating the subclass list, since references will tend to be evacuated in batches. The heap occupancy of a fast singleton-creating loop tends to be a very large sawtooth, climbing rapidly into 1GB heap territory before collection and cleaning bring it back down, but this too is not unlike the old weak map-based implementation.

My implementation of a new subclasses structure in jruby#8462 was a bit too "clever" leading to a potential leak of subclass references. To avoid using a ReferenceQueue, I tied the "cleaning" of the subclasses list to encountering a percentage of empty subclass references during traversal. However if traversal only walks concrete subclass references (via Class#subclasses), empty non-concrete (singleton, included, prepended) subclass references could accumulate. This would only be cleared out with a full (not concrete-only) subclass walk, which only happened for a few internal uses of this list. The patch here reverts to the Map-based approach previously used with two changes: * Instead of using a WeakHashMap-based implementation as found in our ConcurrentWeakHashMap, this uses a synchronized WeakIdentityHashMap, since RubyClass instances should only be compared by identity. This reduces the overhead of using the map. * Concrete subclasses are tracked separately in a second map, to allow fast performance of the user-facing Class#subclasses. Both map fields are lazily initialized: * If the operation is for read and the field is null, they are initialized to Collections.EMPTY_MAP. * If the operation is for write and the field is null or Collections.EMPTY_MAP, they are initialized to a new synchronized WeakIdentityHashMap. Performance remains similar to the linked list-based implementation and has memory and GC characteristics comparable to the pre-linked list logic.

My implementation of a new subclasses structure in jruby#8462 was a bit too "clever" leading to a potential leak of subclass references. To avoid using a ReferenceQueue, I tied the "cleaning" of the subclasses list to encountering a percentage of empty subclass references during traversal. However if traversal only walks concrete subclass references (via Class#subclasses), empty non-concrete (singleton, included, prepended) subclass references could accumulate. This would only be cleared out with a full (not concrete-only) subclass walk, which only happened for a few internal uses of this list. The patch here reverts to the Map-based approach previously used with two changes: * Instead of a fully-concurrent or synchronized WeakHashMap, we use read/write locking to favor mostly-read operations like Class#subclasses and hierarchy-walking. * Concrete subclasses are tracked separately in a second map, to allow fast performance of the user-facing Class#subclasses. Both map fields are lazily initialized: * If the operation is for read and the field is null, they are initialized to Collections.EMPTY_MAP. * If the operation is for write and the field is null or Collections.EMPTY_MAP, they are initialized to a new read/write locking WeakHashMap.

headius added this to the JRuby 9.4.10.0 milestone Nov 25, 2024

headius force-pushed the subclasses_optz branch from facd83a to 8c01c95 Compare November 25, 2024 21:51

headius force-pushed the subclasses_optz branch from 8c01c95 to 0d39d52 Compare November 25, 2024 21:52

This was referenced Nov 25, 2024

Weak collection implementations have bugs and perf issues #8463

Open

Class#subclasses slows down with larger sets #8457

Closed

Avoid size calculation for subclass list

09bcd0d

Getting this accurate is not important, but most weak collections will attempt to get an accurate count at increased expense. This patch just saves the last list size and uses that for future list allocations.

headius force-pushed the subclasses_optz branch from 5223c84 to 680d2e9 Compare November 27, 2024 03:08

headius force-pushed the subclasses_optz branch from 680d2e9 to 6c6982c Compare November 27, 2024 04:18

headius added 2 commits November 27, 2024 11:57

Don't double-add subclass

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.

GPG key ID: B5690EEEBB952194

Verified
Learn about vigilant mode

3cf9cf9

The MetaClass constructor already does this.

Dead code

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.

GPG key ID: B5690EEEBB952194

Verified
Learn about vigilant mode

Loading
Loading status checks…

665ba78

Add Class#subclasses benchmark with concurrency

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.

GPG key ID: B5690EEEBB952194

Verified
Learn about vigilant mode

Loading
Loading status checks…

57d8940

headius force-pushed the subclasses_optz branch from 62c27a8 to 2976caf Compare November 27, 2024 20:47

headius mentioned this pull request Nov 27, 2024

Rework method cache invalidation for performance #8464

Open

headius merged commit 422bb1c into jruby:master Dec 3, 2024
94 of 95 checks passed

headius deleted the subclasses_optz branch December 3, 2024 20:51

headius mentioned this pull request Jan 27, 2025

Eliminate leak of non-concrete subclass references #8591

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GitHub Sponsors

Optimize Class#subclasses #8462

Optimize Class#subclasses #8462

headius commented Nov 25, 2024 •

edited

Loading

headius commented Nov 25, 2024

mohamedhafez commented Nov 25, 2024

headius commented Nov 25, 2024

headius commented Nov 26, 2024 •

edited

Loading

headius commented Nov 26, 2024

headius commented Nov 27, 2024

headius commented Nov 27, 2024

headius commented Nov 27, 2024

headius commented Nov 27, 2024

Optimize Class#subclasses #8462

Optimize Class#subclasses #8462

Conversation

headius commented Nov 25, 2024 • edited Loading

headius commented Nov 25, 2024

mohamedhafez commented Nov 25, 2024

headius commented Nov 25, 2024

headius commented Nov 26, 2024 • edited Loading

headius commented Nov 26, 2024

headius commented Nov 27, 2024

headius commented Nov 27, 2024

headius commented Nov 27, 2024

headius commented Nov 27, 2024

headius commented Nov 25, 2024 •

edited

Loading

headius commented Nov 26, 2024 •

edited

Loading