Skip to content

Eliminate leak of non-concrete subclass references #8591

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Jan 27, 2025

Conversation

headius
Copy link
Member

@headius headius commented Jan 27, 2025

My implementation of a new subclasses structure in #8462
was a bit too "clever" leading to a potential leak of subclass
references. To avoid using a ReferenceQueue, I tied the "cleaning"
of the subclasses list to encountering a percentage of empty
subclass references during traversal. However if traversal only
walks concrete subclass references (via Class#subclasses),
empty non-concrete (singleton, included, prepended) subclass
references could accumulate. This would only be cleared out with
a full (not concrete-only) subclass walk, which only happened for
a few internal uses of this list.

The patch here reverts to the Map-based approach previously
used with two changes:

  • Instead of a fully-concurrent or synchronized WeakHashMap, we
    use read/write locking to favor mostly-read operations like
    Class#subclasses and hierarchy-walking.
  • Concrete subclasses are tracked separately in a second map, to
    allow fast performance of the user-facing Class#subclasses.

Both map fields are lazily initialized:

  • If the operation is for read and the field is null, they are
    initialized to Collections.EMPTY_MAP.
  • If the operation is for write and the field is null or
    Collections.EMPTY_MAP, they are initialized to a new
    read/write locking WeakHashMap.

@headius headius added this to the JRuby 9.4.11.0 milestone Jan 27, 2025
@headius
Copy link
Member Author

headius commented Jan 27, 2025

Performance comparison with 9.4.10.0, which leaked but had a more efficient data structure for traversal (needed for Class#subclasses:

9.4.10.0:

1 thread Numeric.subclasses
                          2.638k (± 7.8%) i/s  (379.10 μs/i) -     13.167k in   5.028838s
1 thread Object.subclasses
                        226.221 (± 3.5%) i/s    (4.42 ms/i) -      1.134k in   5.020683s
1 thread Custom.subclasses with 1000 singletons
                          5.680k (± 3.2%) i/s  (176.04 μs/i) -     28.779k in   5.072108s
5 thread Numeric.subclasses
                          3.022k (± 8.1%) i/s  (330.87 μs/i) -     14.994k in   5.001531s
5 thread Object.subclasses
                        433.825 (±13.6%) i/s    (2.31 ms/i) -      2.115k in   4.998935s
5 thread Custom.subclasses with 1000 singletons
                          3.568k (± 7.5%) i/s  (280.24 μs/i) -     17.982k in   5.068942s
10 thread Numeric.subclasses
                          1.995k (± 6.3%) i/s  (501.32 μs/i) -     10.098k in   5.089584s
10 thread Object.subclasses
                        465.116 (±12.7%) i/s    (2.15 ms/i) -      2.279k in   5.004614s
10 thread Custom.subclasses with 1000 singletons
                          1.981k (± 5.2%) i/s  (504.75 μs/i) -     10.036k in   5.079595s
50 thread Numeric.subclasses
                        418.415 (± 6.2%) i/s    (2.39 ms/i) -      2.100k in   5.037772s
50 thread Object.subclasses
                        318.736 (± 8.5%) i/s    (3.14 ms/i) -      1.584k in   5.010015s
50 thread Custom.subclasses with 1000 singletons
                        422.564 (± 3.5%) i/s    (2.37 ms/i) -      2.142k in   5.076173s

New impl:

1 thread Numeric.subclasses
                          1.736k (± 5.6%) i/s  (575.96 μs/i) -      8.692k in   5.024334s
1 thread Object.subclasses
                         98.895 (± 3.0%) i/s   (10.11 ms/i) -    495.000 in   5.009933s
1 thread Custom.subclasses with 1000 singletons
                          4.784k (± 4.6%) i/s  (209.03 μs/i) -     23.922k in   5.011886s
5 thread Numeric.subclasses
                        497.323 (± 3.8%) i/s    (2.01 ms/i) -      2.499k in   5.032506s
5 thread Object.subclasses
                         76.209 (±10.5%) i/s   (13.12 ms/i) -    371.000 in   5.039466s
5 thread Custom.subclasses with 1000 singletons
                          2.996k (± 5.4%) i/s  (333.75 μs/i) -     14.950k in   5.005631s
10 thread Numeric.subclasses
                        412.690 (± 6.1%) i/s    (2.42 ms/i) -      2.100k in   5.106619s
10 thread Object.subclasses
                         73.359 (±10.9%) i/s   (13.63 ms/i) -    364.000 in   5.048339s
10 thread Custom.subclasses with 1000 singletons
                          1.832k (±10.7%) i/s  (545.74 μs/i) -      9.071k in   5.048197s
50 thread Numeric.subclasses
                        294.017 (±41.5%) i/s    (3.40 ms/i) -      1.218k in   5.308313s
50 thread Object.subclasses
                         66.175 (±24.2%) i/s   (15.11 ms/i) -    308.000 in   5.006706s
50 thread Custom.subclasses with 1000 singletons
                        421.576 (± 3.6%) i/s    (2.37 ms/i) -      2.142k in   5.087584s

Notice that performance is not as fast as the (broken) linked list implementation, but compare to CRuby and JRuby 9.8.9.0 (before any of these changes):

CRuby 3.4

1 thread Numeric.subclasses
                          1.020k (± 0.9%) i/s -      5.202k in   5.098416s
1 thread Object.subclasses
                         78.389 (± 1.3%) i/s -    392.000 in   5.001083s
1 thread Custom.subclasses with 1000 singletons
                         11.712 (± 0.0%) i/s -     59.000 in   5.037649s
5 thread Numeric.subclasses
                        902.478 (± 2.3%) i/s -      4.512k in   5.002271s
5 thread Object.subclasses
                         76.954 (± 1.3%) i/s -    385.000 in   5.003450s
5 thread Custom.subclasses with 1000 singletons
                         11.701 (± 0.0%) i/s -     59.000 in   5.042449s
10 thread Numeric.subclasses
                        766.111 (± 1.7%) i/s -      3.864k in   5.045062s
10 thread Object.subclasses
                         76.226 (± 1.3%) i/s -    385.000 in   5.051375s
10 thread Custom.subclasses with 1000 singletons
                         11.605 (± 0.0%) i/s -     59.000 in   5.084250s
50 thread Numeric.subclasses
                        399.096 (± 2.0%) i/s -      2.009k in   5.035707s
50 thread Object.subclasses
                         68.591 (± 1.5%) i/s -    348.000 in   5.075215s
50 thread Custom.subclasses with 1000 singletons
                         11.266 (± 8.9%) i/s -     57.000 in   5.168617s

JRuby is much faster than CRuby for Class#subclasses when the target class has many singletons. It is faster for Numeric.subclasses and Object.subclasses when not under contention. The read/write locking is not as efficient as we would like.

JRuby 9.4.8.0

1 thread Numeric.subclasses
                          1.272k (± 3.1%) i/s -      6.426k in   5.057134s
1 thread Object.subclasses
                         55.546  (± 3.6%) i/s -    280.000  in   5.045423s
1 thread Custom.subclasses with 1000 singletons
                         10.783  (± 0.0%) i/s -     54.000  in   5.011619s
5 thread Numeric.subclasses
                          2.447k (± 3.8%) i/s -     12.375k in   5.065684s
5 thread Object.subclasses
                        151.642  (± 2.0%) i/s -    768.000  in   5.067208s
5 thread Custom.subclasses with 1000 singletons
                         29.373  (±13.6%) i/s -    144.000  in   5.051597s
10 thread Numeric.subclasses
                          1.719k (±12.9%) i/s -      8.624k in   5.151087s
10 thread Object.subclasses
                         96.859  (±36.1%) i/s -    437.000  in   5.107834s
10 thread Custom.subclasses with 1000 singletons
                         26.941  (±18.6%) i/s -    129.000  in   5.052812s
50 thread Numeric.subclasses
                        377.859  (±10.6%) i/s -      1.880k in   5.060021s
50 thread Object.subclasses
                        141.944  (±15.5%) i/s -    714.000  in   5.225651s
50 thread Custom.subclasses with 1000 singletons
                         12.148  (±32.9%) i/s -     57.000  in   5.060756s

The new impl is again far faster than 9.4.8.0 for all singleton-heavy Class#subclasses and faster for non-concurrent access to Numeric.subclasses and Object.subclasses. The new impl is slower under high concurrency, again likely due to the cost of read/write locking.

My implementation of a new subclasses structure in jruby#8462
was a bit too "clever" leading to a potential leak of subclass
references. To avoid using a ReferenceQueue, I tied the "cleaning"
of the subclasses list to encountering a percentage of empty
subclass references during traversal. However if traversal only
walks concrete subclass references (via Class#subclasses),
empty non-concrete (singleton, included, prepended) subclass
references could accumulate. This would only be cleared out with
a full (not concrete-only) subclass walk, which only happened for
a few internal uses of this list.

The patch here reverts to the Map-based approach previously
used with two changes:

* Instead of a fully-concurrent or synchronized WeakHashMap, we
  use read/write locking to favor mostly-read operations like
  Class#subclasses and hierarchy-walking.
* Concrete subclasses are tracked separately in a second map, to
  allow fast performance of the user-facing Class#subclasses.

Both map fields are lazily initialized:

* If the operation is for read and the field is null, they are
  initialized to Collections.EMPTY_MAP.
* If the operation is for write and the field is null or
  Collections.EMPTY_MAP, they are initialized to a new
  read/write locking WeakHashMap.
This commit introduces specialized collection types adapted for
subclass traversal:

* SubclassList and SubclassArray, extending ArrayList and
  RubyArray respectively but also implementing BiConsumer, so they
  can be passed as the lambda for traversal.
* InvalidatorList, extending ArrayList and implementing BiConsumer
  and Consumer so it can be passed in place of the lambda versions.

This reduces allocation for Class#subclasses to just the eventual
Ruby Array and does similar for the internal (and rarely-used)
RubyClass#subclasses(boolean) logic. Allocation for invalidator
gathering is reduced to the invalidator list. All cases now avoid
multiple levels of lambda allocation due to recursion and use of
forEach internal iteration.
This makes a few changes to improve concurrent traversal of the
subclasses collections.

* Replace ReentrantReadWriteLock with StampedLock.
* Avoid CAS operations once the fields have been fully initialized.
@headius
Copy link
Member Author

headius commented Jan 27, 2025

With all changes up to this point, here's the new numbers. We are consistently faster than CRuby on all benchmarks. Performance is comparable to the old ConcurrentWeakHashMap-based implementation but is slower to traverse under contention. Performance is slower than the (broken) linked list implementation but not by a significant degree.

1 thread Numeric.subclasses
                          1.845k (± 9.2%) i/s  (542.05 μs/i) -      9.275k in   5.084890s
1 thread Object.subclasses
                        108.973 (± 1.8%) i/s    (9.18 ms/i) -    550.000 in   5.049141s
1 thread Custom.subclasses with 1000 singletons
                          5.164k (±10.3%) i/s  (193.65 μs/i) -     25.750k in   5.091201s
5 thread Numeric.subclasses
                        546.254 (±19.4%) i/s    (1.83 ms/i) -      2.597k in   5.084985s
5 thread Object.subclasses
                         88.478 (±28.3%) i/s   (11.30 ms/i) -    408.000 in   5.026991s
5 thread Custom.subclasses with 1000 singletons
                          3.483k (± 4.7%) i/s  (287.11 μs/i) -     17.450k in   5.022823s
10 thread Numeric.subclasses
                        474.393 (± 9.3%) i/s    (2.11 ms/i) -      2.392k in   5.084416s
10 thread Object.subclasses
                        110.091 (± 5.5%) i/s    (9.08 ms/i) -    552.000 in   5.028974s
10 thread Custom.subclasses with 1000 singletons
                          1.948k (± 7.8%) i/s  (513.24 μs/i) -      9.690k in   5.020453s
50 thread Numeric.subclasses
                        436.501 (± 2.7%) i/s    (2.29 ms/i) -      2.200k in   5.044202s
50 thread Object.subclasses
                        105.087 (± 6.7%) i/s    (9.52 ms/i) -    530.000 in   5.071900s
50 thread Custom.subclasses with 1000 singletons
                        442.992 (± 6.8%) i/s    (2.26 ms/i) -      2.226k in   5.043026s

@headius
Copy link
Member Author

headius commented Jan 27, 2025

It is worth pointing out that the implementation at this point also reduces allocation of lambda instances for traversal for all subclass walking, including cache invalidation, method invalidator gathering, and the usual subclass list aggregation operations. This will improve performance of several core runtime operations.

@headius headius marked this pull request as ready for review January 27, 2025 08:00
@headius headius merged commit 6a89fee into jruby:master Jan 27, 2025
95 checks passed
@headius headius deleted the subclasses_leak branch January 27, 2025 23:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Memory leak from ActiveRecord_Relation after upgrading from JRuby 9.4.9.0 to 9.4.10.0
1 participant