Skip to content

Optimize Class#subclasses #8462

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Dec 3, 2024
Merged

Optimize Class#subclasses #8462

merged 7 commits into from
Dec 3, 2024

Conversation

headius
Copy link
Member

@headius headius commented Nov 25, 2024

There were several issues with the old implementation:

  • Iterating by creating a Set and an Iterator adds significant allocation overhead. Better to use Map#forEach.
  • ConcurrentWeakHashMap is a very old implementation that has several inefficiencies as well as no custom implementation of Map#forEach (it falls back on the default impl which creates a Set and Iterator). Instead we use a synchronized WeakHashMap.
  • The initial array was created with a default size of 4 elements. For many classes this will be insufficient, leading to excessive array allocation and copying as it grows. Instead we use the current class's "subclasses" size as an initial estimate.
  • When no subclasses appear to be present, bail out early with an empty array that does minimal allocation.

Performance is significantly better for zero, small, and large subclass lists.

1M * Object.subclasses, with 83 elements:

Before:

2.020000 0.040000 2.060000 ( 1.823539)
1.980000 0.030000 2.010000 ( 1.737808)
1.830000 0.010000 1.840000 ( 1.738772)
1.730000 0.020000 1.750000 ( 1.740675)
1.720000 0.000000 1.720000 ( 1.707344)

After:

1.340000 0.020000 1.360000 ( 1.034745)
0.930000 0.000000 0.930000 ( 0.918107)
0.920000 0.010000 0.930000 ( 0.914828)
0.920000 0.000000 0.920000 ( 0.922137)
0.920000 0.000000 0.920000 ( 0.915440)

10M * Numeric.subclasses, with 4 elements:

Before:

0.930000 0.030000 0.960000 ( 0.789997)
0.640000 0.010000 0.650000 ( 0.621404)
0.620000 0.010000 0.630000 ( 0.614404)
0.630000 0.000000 0.630000 ( 0.629492)
0.630000 0.010000 0.640000 ( 0.608538)

After:

0.720000 0.010000 0.730000 ( 0.559470)
0.460000 0.000000 0.460000 ( 0.454176)
0.510000 0.000000 0.510000 ( 0.429875)
0.430000 0.010000 0.440000 ( 0.434487)
0.430000 0.000000 0.430000 ( 0.427971)

Fixes #8457

@headius headius added this to the JRuby 9.4.10.0 milestone Nov 25, 2024
There were several issues with the old implementation:

* Iterating by creating a Set and an Iterator adds significant
  allocation overhead. Better to use Map#forEach.
* ConcurrentWeakHashMap is a very old implementation that has
  several inefficiencies as well as no custom implementation of
  Map#forEach (it falls back on the default impl which creates a
  Set and Iterator). Instead we use a synchronized WeakHashMap.
* The initial array was created with a default size of 4 elements.
  For many classes this will be insufficient, leading to excessive
  array allocation and copying as it grows. Instead we use the
  current class's "subclasses" size as an initial estimate.
* When no subclasses appear to be present, bail out early with
  an empty array that does minimal allocation.

Performance is significantly better for zero, small, and large
subclass lists.

1M * Object.subclasses, with 83 elements:

Before:

  2.020000   0.040000   2.060000 (  1.823539)
  1.980000   0.030000   2.010000 (  1.737808)
  1.830000   0.010000   1.840000 (  1.738772)
  1.730000   0.020000   1.750000 (  1.740675)
  1.720000   0.000000   1.720000 (  1.707344)

After:

  1.340000   0.020000   1.360000 (  1.034745)
  0.930000   0.000000   0.930000 (  0.918107)
  0.920000   0.010000   0.930000 (  0.914828)
  0.920000   0.000000   0.920000 (  0.922137)
  0.920000   0.000000   0.920000 (  0.915440)

10M * Numeric.subclasses, with 4 elements:

Before:

  0.930000   0.030000   0.960000 (  0.789997)
  0.640000   0.010000   0.650000 (  0.621404)
  0.620000   0.010000   0.630000 (  0.614404)
  0.630000   0.000000   0.630000 (  0.629492)
  0.630000   0.010000   0.640000 (  0.608538)

After:

  0.720000   0.010000   0.730000 (  0.559470)
  0.460000   0.000000   0.460000 (  0.454176)
  0.510000   0.000000   0.510000 (  0.429875)
  0.430000   0.010000   0.440000 (  0.434487)
  0.430000   0.000000   0.430000 (  0.427971)

Fixes jruby#8457
@headius
Copy link
Member Author

headius commented Nov 25, 2024

@mohamedhafez Could you give this branch a try and see how the performance looks for you?

@mohamedhafez
Copy link
Contributor

Sure, can I just cherry-pick this commit onto a branch based on 9.4.9.0 so I'm sure everything else is compatible with my app?

@headius
Copy link
Member Author

headius commented Nov 25, 2024

@mohamedhafez Yes, should apply just fine to 9.4.9.0.

@headius
Copy link
Member Author

headius commented Nov 26, 2024

Reporting back from @mohamedhafez showed that the hard locking here is a big bottleneck for a highly-concurrent app, so I'm looking at lighter-weight locking options.

This patch was an attempt to use a ReentrantReadWriteLock, but it has overhead of its own and made things worse: https://gist.github.com/headius/30578ea0e26c9566462e70d963b2cd29

I'll attempt a lock-free version.

@headius
Copy link
Member Author

headius commented Nov 26, 2024

A second attempt with a volatile field and atomic updates improves over the synchronized version but uses a lot of background CPU spinning (a better impl would park the thread but then we have to track waiters, etc).

In the same gist as above: https://gist.github.com/headius/30578ea0e26c9566462e70d963b2cd29

Getting this accurate is not important, but most weak collections
will attempt to get an accurate count at increased expense. This
patch just saves the last list size and uses that for future list
allocations.
@headius
Copy link
Member Author

headius commented Nov 27, 2024

A third attempt makes the subclasses field a read-only collection, and copy-on-write whenever it needs to be updated. This has very fast performance for Class#subclasses but very high cost for modifying the subclass list. It also is potentially not quite thread-safe, since WeakHashMap.get may mutate the list if there's a vacated reference.

This is similar to the CRuby subclasses list except for two
key features:

* We use weak references instead of pointer values; the GC vacates
  references for us.
* The linked list structure is immutable and concurrency-safe.

This impl is thread-safe and lock-free due to the immutable linked
list structure.

Adding a new class creates a new head node and atomically
reassigns it into the class. Removing a class finds that element
and vacates its reference. Replacing a class first removes the old
and then adds the new. Traversing is a matter of walking the chain
and omitting vacated references.

Periodically, the list must be rebuilt without dead references.
This is hardcoded currently to be when the list contains more than
25% vacated references.

Adding a class is an O(1) operation. Removal, replacement, and
traversal are amortized O(n).

This structure is also lighter-weight than either the original
ConcurrentWeakHashMap or any implementation of WeakHashMap
provided by the JDK, plus it has no lock overhead and very little
synchronization overhead.

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
The MetaClass constructor already does this.

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
@headius
Copy link
Member Author

headius commented Nov 27, 2024

I've pushed a new implementation of the subclasses collection that is just a linked list of weak references. The list is immutable so the only synchronization required is an atomic update of the head reference. Adding a class is O(1) and just attaches a new head. Removal, replacement, and traversal are O(n) linked list walks. Periodically (currently hardcoded to 25% evacuated references) the list gets rebuilt at the end of traversal, removing any dead links but preserving previous links with the same weak references; this allows the remove operation to simply clear weak references knowing that will be visible on any thread's copy of the list.

This is the fastest option so far for Object.subclasses but it is occasionally, strangely slower for the smaller Numeric.subclasses case. Benchmark results are below for 9.4.9.0, 9.4.10.0 (master), and CRuby 3.3 (benchmark will be committed soon):

9.4.9.0:

1 thread Numeric.subclasses
                          1.292k (±14.6%) i/s -      6.240k in   5.011785s
1 thread Object.subclasses
                         56.428  (± 3.5%) i/s -    285.000  in   5.056858s
5 thread Numeric.subclasses
                          1.906k (±18.1%) i/s -      9.265k in   5.000260s
5 thread Object.subclasses
                        157.154  (± 4.5%) i/s -    795.000  in   5.068326s
10 thread Numeric.subclasses
                          1.130k (± 6.6%) i/s -      5.700k in   5.064841s
10 thread Object.subclasses
                        170.303  (± 5.9%) i/s -    850.000  in   5.009226s
50 thread Numeric.subclasses
                        323.931  (± 4.3%) i/s -      1.632k in   5.047326s
50 thread Object.subclasses
                        139.149  (± 3.6%) i/s -    705.000  in   5.072644s

9.4.10.0

1 thread Numeric.subclasses
                          2.025k (±17.0%) i/s -      9.869k in   5.031959s
1 thread Object.subclasses
                        198.861 (±10.1%) i/s -    988.000 in   5.094656s
5 thread Numeric.subclasses
                          1.744k (±27.6%) i/s -      8.190k in   5.021846s
5 thread Object.subclasses
                        367.245 (±10.1%) i/s -      1.820k in   5.015953s
10 thread Numeric.subclasses
                        990.057 (±11.4%) i/s -      4.898k in   5.024722s
10 thread Object.subclasses
                        372.548 (±11.0%) i/s -      1.855k in   5.040239s
50 thread Numeric.subclasses
                        323.640 (± 5.3%) i/s -      1.632k in   5.059211s
50 thread Object.subclasses
                        216.725 (± 5.1%) i/s -      1.092k in   5.051484s

CRuby 3.3

1 thread Numeric.subclasses
                        999.470 (± 2.4%) i/s    (1.00 ms/i) -      5.047k in   5.052801s
1 thread Object.subclasses
                         86.706 (± 1.2%) i/s   (11.53 ms/i) -    440.000 in   5.076068s
5 thread Numeric.subclasses
                        796.605 (± 7.4%) i/s    (1.26 ms/i) -      3.956k in   5.011233s
5 thread Object.subclasses
                         84.310 (± 3.6%) i/s   (11.86 ms/i) -    424.000 in   5.035321s
10 thread Numeric.subclasses
                        656.311 (± 9.8%) i/s    (1.52 ms/i) -      3.237k in   5.012322s
10 thread Object.subclasses
                         83.197 (± 2.4%) i/s   (12.02 ms/i) -    416.000 in   5.002811s
50 thread Numeric.subclasses
                        363.659 (± 6.3%) i/s    (2.75 ms/i) -      1.833k in   5.067702s
50 thread Object.subclasses
                         72.960 (± 2.7%) i/s   (13.71 ms/i) -    371.000 in   5.089632s

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
@headius
Copy link
Member Author

headius commented Nov 27, 2024

I've pushed a refactoring that gets all benchmarks faster than 9.4.9.0, although sometimes just barely. The larger Object.subclasses cases are even faster:

9.4.9.0:

1 thread Numeric.subclasses
                          1.281k (±10.4%) i/s -      6.348k in   5.015677s
1 thread Object.subclasses
                         57.555  (± 1.7%) i/s -    290.000  in   5.042078s
5 thread Numeric.subclasses
                          1.866k (±12.8%) i/s -      9.243k in   5.026434s
5 thread Object.subclasses
                        156.215  (± 3.8%) i/s -    780.000  in   5.000381s
10 thread Numeric.subclasses
                          1.151k (±11.8%) i/s -      5.700k in   5.027341s
10 thread Object.subclasses
                        182.022  (±15.9%) i/s -    864.000  in   5.076040s
50 thread Numeric.subclasses
                        276.258  (±33.3%) i/s -      1.122k in   5.030777s
50 thread Object.subclasses
                        149.436  (±16.1%) i/s -    705.000  in   5.005946s

New impl:

1 thread Numeric.subclasses
                          1.871k (± 9.9%) i/s -      9.313k in   5.028326s
1 thread Object.subclasses
                        194.344 (± 2.6%) i/s -    988.000 in   5.087741s
5 thread Numeric.subclasses
                          1.966k (±12.9%) i/s -      9.750k in   5.044797s
5 thread Object.subclasses
                        475.950 (± 8.4%) i/s -      2.392k in   5.065777s
10 thread Numeric.subclasses
                          1.170k (±16.9%) i/s -      5.720k in   5.238812s
10 thread Object.subclasses
                        356.624 (±41.5%) i/s -      1.470k in   5.065979s
50 thread Numeric.subclasses
                        338.594 (±13.9%) i/s -      1.665k in   5.075536s
50 thread Object.subclasses
                        251.153 (±10.4%) i/s -      1.250k in   5.025052s

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
* Split large methods into smaller ones.
* Eliminate repeated null-checking.
* Always use estimate to create array/list, but add fast path to
  newArray that reuses zero-length array.
* Add estimates for "all subclasses" and "all descendants" used by
  the old form.
* Only update estimates for the target class, not for subclasses.

This refactored impl is consistently faster than 9.4.9.0:

9.4.9.0:

1 thread Numeric.subclasses
                          1.281k (±10.4%) i/s -      6.348k in   5.015677s
1 thread Object.subclasses
                         57.555  (± 1.7%) i/s -    290.000  in   5.042078s
5 thread Numeric.subclasses
                          1.866k (±12.8%) i/s -      9.243k in   5.026434s
5 thread Object.subclasses
                        156.215  (± 3.8%) i/s -    780.000  in   5.000381s
10 thread Numeric.subclasses
                          1.151k (±11.8%) i/s -      5.700k in   5.027341s
10 thread Object.subclasses
                        182.022  (±15.9%) i/s -    864.000  in   5.076040s
50 thread Numeric.subclasses
                        276.258  (±33.3%) i/s -      1.122k in   5.030777s
50 thread Object.subclasses
                        149.436  (±16.1%) i/s -    705.000  in   5.005946s

New impl:

1 thread Numeric.subclasses
                          1.871k (± 9.9%) i/s -      9.313k in   5.028326s
1 thread Object.subclasses
                        194.344 (± 2.6%) i/s -    988.000 in   5.087741s
5 thread Numeric.subclasses
                          1.966k (±12.9%) i/s -      9.750k in   5.044797s
5 thread Object.subclasses
                        475.950 (± 8.4%) i/s -      2.392k in   5.065777s
10 thread Numeric.subclasses
                          1.170k (±16.9%) i/s -      5.720k in   5.238812s
10 thread Object.subclasses
                        356.624 (±41.5%) i/s -      1.470k in   5.065979s
50 thread Numeric.subclasses
                        338.594 (±13.9%) i/s -      1.665k in   5.075536s
50 thread Object.subclasses
                        251.153 (±10.4%) i/s -      1.250k in   5.025052s
@headius
Copy link
Member Author

headius commented Nov 27, 2024

This is ready to merge any time. This is about the best I think we can do right now, all things considered.

We discussed on Matrix a bit about how Rails really should not be generating these lists over and over again, but that is a larger optimization to explore and implement. If they use this same logic on CRuby and JRuby, we should at least not be slower at it.

@headius headius merged commit 422bb1c into jruby:master Dec 3, 2024
94 of 95 checks passed
@headius headius deleted the subclasses_optz branch December 3, 2024 20:51
headius added a commit to headius/jruby that referenced this pull request Jan 25, 2025

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
My implementation of a new subclasses structure in jruby#8462
was a bit too "clever" leading to a potential leak of subclass
references. To avoid using a ReferenceQueue, I tied the "cleaning"
of the subclasses list to encountering a percentage of empty
subclass references during traversal. However if traversal only
walks concrete subclass references (via Class#subclasses),
empty non-concrete (singleton, included, prepended) subclass
references could accumulate. This would only be cleared out with
a full (not concrete-only) subclass walk, which only happened for
a few internal uses of this list.

The patch here reverts to a more reliable ReferenceQueue-based
approach:

* When the first subclass is added, a ReferenceQueue is also
  created for the subclass weak references.
* When adding or removing subsequent subclasses, a non-null queue
  poll indicates a full clean is necessary before proceeding.

This amortizes the cleaning of the list to only fire after
references have been evacuated and additional mutations of the list
are requested.

The changes here appear to have minimal impact on mutating the
subclass list, since references will tend to be evacuated in
batches. The heap occupancy of a fast singleton-creating loop
tends to be a very large sawtooth, climbing rapidly into 1GB heap
territory before collection and cleaning bring it back down, but
this too is not unlike the old weak map-based implementation.
headius added a commit to headius/jruby that referenced this pull request Jan 27, 2025
My implementation of a new subclasses structure in jruby#8462
was a bit too "clever" leading to a potential leak of subclass
references. To avoid using a ReferenceQueue, I tied the "cleaning"
of the subclasses list to encountering a percentage of empty
subclass references during traversal. However if traversal only
walks concrete subclass references (via Class#subclasses),
empty non-concrete (singleton, included, prepended) subclass
references could accumulate. This would only be cleared out with
a full (not concrete-only) subclass walk, which only happened for
a few internal uses of this list.

The patch here reverts to the Map-based approach previously
used with two changes:

* Instead of using a WeakHashMap-based implementation as found in
  our ConcurrentWeakHashMap, this uses a synchronized
  WeakIdentityHashMap, since RubyClass instances should only be
  compared by identity. This reduces the overhead of using the
  map.
* Concrete subclasses are tracked separately in a second map, to
  allow fast performance of the user-facing Class#subclasses.

Both map fields are lazily initialized:

* If the operation is for read and the field is null, they are
  initialized to Collections.EMPTY_MAP.
* If the operation is for write and the field is null or
  Collections.EMPTY_MAP, they are initialized to a new
  synchronized WeakIdentityHashMap.

Performance remains similar to the linked list-based implementation
and has memory and GC characteristics comparable to the pre-linked
list logic.
headius added a commit to headius/jruby that referenced this pull request Jan 27, 2025
My implementation of a new subclasses structure in jruby#8462
was a bit too "clever" leading to a potential leak of subclass
references. To avoid using a ReferenceQueue, I tied the "cleaning"
of the subclasses list to encountering a percentage of empty
subclass references during traversal. However if traversal only
walks concrete subclass references (via Class#subclasses),
empty non-concrete (singleton, included, prepended) subclass
references could accumulate. This would only be cleared out with
a full (not concrete-only) subclass walk, which only happened for
a few internal uses of this list.

The patch here reverts to the Map-based approach previously
used with two changes:

* Instead of a fully-concurrent or synchronized WeakHashMap, we
  use read/write locking to favor mostly-read operations like
  Class#subclasses and hierarchy-walking.
* Concrete subclasses are tracked separately in a second map, to
  allow fast performance of the user-facing Class#subclasses.

Both map fields are lazily initialized:

* If the operation is for read and the field is null, they are
  initialized to Collections.EMPTY_MAP.
* If the operation is for write and the field is null or
  Collections.EMPTY_MAP, they are initialized to a new
  read/write locking WeakHashMap.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Class#subclasses slows down with larger sets
2 participants