CoW: Clear dead references every time we add a new one #55008

phofl · 2023-09-05T12:37:23Z

closes PERF: pd.DataFrame.copy() leaks #54352 (Replace xxxx with the GitHub issue number)
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

I'd like to try that one before going back to a set

wangwillian0 · 2023-09-05T16:56:10Z

I wasn't sure if it was better to comment on the past issue, the .copy() issue or create a new issue....

I think this may hurt more the performance, here is a quick code:

import weakref, timeit

class Refs:
    def __init__(self, useList=True):
        self.useList = useList
        if self.useList:
            self._refs = []
        else:
            self._refs = weakref.WeakSet()

    def add_ref(self, val):
        if self.useList:
            self._refs = [ref for ref in self._refs if ref() is not None]
            self._refs.append(weakref.ref(val))
        else:
            self._refs.add(val)
        return len(self)
        
    def __len__(self):
        if self.useList:
            lastDead = 1 if len(self._refs) != 0 and self._refs[-1]() is None else 0
            return len(self._refs) - lastDead
        else:
            return len(self._refs)

for n in [0, 1, 10, 20, 100]:
    refList = Refs(useList=True)
    refSet = Refs(useList=False)

    dummies = [Refs() for _ in range(n)]
    for dummy in dummies:
        refList.add_ref(dummy)
        refSet.add_ref(dummy)

    def test(a):
        tmp = Refs()
        a.add_ref(tmp)


    t1 = timeit.timeit("test(refList)", globals=locals(), number=1_000_000)
    t2 = timeit.timeit("test(refSet)", globals=locals(), number=1_000_000)
    t3 = timeit.timeit("len(refList)", globals=locals(), number=1_000_000)
    t4 = timeit.timeit("len(refSet)", globals=locals(), number=1_000_000)

    print("Existing references: ", n)
    print("Insert times: List, Set, List / Set:", t1, t2, t1 / t2)
    print("len() times: List, Set, List / Set:", t3, t4, t3 / t4)

Existing references:  0
Insert times: List, Set, List / Set: 0.8709223259938881 0.8772897350136191 0.992741954264822
len() times: List, Set, List / Set: 0.21391283700359054 0.21754901600070298 0.9832857023949918
Existing references:  1
Insert times: List, Set, List / Set: 1.0056520479847677 0.8872274259920232 1.133477188061823
len() times: List, Set, List / Set: 0.21361352398525923 0.21544542300398462 0.9914971550883608
Existing references:  10
Insert times: List, Set, List / Set: 1.4736111829988658 0.9527864250121638 1.5466332688147324
len() times: List, Set, List / Set: 0.2140507549920585 0.21857467299560085 0.9793026431583252
Existing references:  20
Insert times: List, Set, List / Set: 1.7970401670027059 0.8618747400178108 2.08503635570706
len() times: List, Set, List / Set: 0.21597155602648854 0.2161130669992417 0.9993451993684692
Existing references:  100
Insert times: List, Set, List / Set: 5.558489482005825 0.8625980890064966 6.443892645771862
len() times: List, Set, List / Set: 0.21221240298473276 0.21446094301063567 0.9895153868376332

The list approach is now O(n) per insertion, these informal numbers show how it's much worse than weakSet on every case.

Cleaning on every 2^16 insertions or something like that will help but I IMO it's not easy to chose the right value and it will create corner cases where the performance will degrade much more than the predictable regression from using WeakSet().

Also, if you change your mind, I'd like to contribute with code too, no matter the approach.

Update: there was a bug in the code, the list seems to be worse than weakset from n=1 instead of n=15

lithomas1 · 2023-09-05T18:43:46Z

I'd like to try that one before going back to a set

Maybe we can use a regular set instead and use the finalize parameter of the weakref to delete it from the set?

Or is that basically just what WeakSet does?

phofl · 2023-09-05T18:49:41Z

The Weakrefs aren’t hashable, so that won’t work

wangwillian0 · 2023-09-05T19:13:56Z

I think weakref are actually hashable:

import weakref
import timeit

class test:
    s = set()
    w = weakref.WeakSet()

x = test()
dummies = [test() for _ in range(10)]
for dummy in dummies:
    weakref.finalize(dummy, x.s.discard, weakref.ref(dummy))
    x.s.add(weakref.ref(dummy))
    x.w.add(dummy)

def test1():
    tmp = test()
    weakref.finalize(tmp, x.s.discard, weakref.ref(tmp))
    x.s.add(weakref.ref(tmp))

def test2():
    tmp = test()
    x.w.add(tmp)

print("set of weakref", timeit.timeit("test1()", globals=locals(), number=1_000_000))
print("WeakSet", timeit.timeit("test2()", globals=locals(), number=1_000_000))

set of weakref 1.3031023739895318
WeakSet 0.439116241002921

not really formal benchmarks, but if you think about it, weakset would be probably already implemented based on set() if it was faster.

Now, something that needs to be changed in order to use weakset instead of list again is to make Index (and maybe something else?) hashable

phofl · 2023-09-05T19:48:16Z

FWIW the benchmarks aren’t really representative, it is very very unlikely that any user has 100 different variables that are referencing dataframes pointing to the same data. I don’t think that there are many realistic cases where even 10 different objects are alive

wangwillian0 · 2023-09-05T20:19:46Z

Doesn't it only reinforce how the WeakSet is better suited for the situation? it's faster most of the cases (it's already +10% faster at n=1) and doesn't require handcrafted solutions

phofl · 2023-09-05T20:25:50Z

The weakset slows us down when creating objects that are zero copy right now, that was noticeable.

Can you show a pandas example where this is actually noticeable and not noise? Operations have quite some overhead. The list implementation is in the nanosecond range with 20 elements, that is very very cheap. I am not convinced that this will have a noticeable impact.

I am happy to discuss other options if there is something where this would slow us down

wangwillian0 · 2023-09-05T21:03:05Z

The weakset slows us down when creating objects that are zero copy right now, that was noticeable.

Just to understand better, do you mean operations like DataFrame._from_arrays as in the original PR?

phofl · 2023-09-05T22:15:48Z

yes

wangwillian0 · 2023-09-07T02:11:11Z

class Indexing:
    def setup(self):
        N = 10**5
        self.index = pd.CategoricalIndex(range(N), range(N))
        self.series = pd.Series(range(N), index=self.index).sort_index()
        self.category = self.index[500]
        
        window = 60*60
        tstart = 12 * 60 * 60 
        tend = 18 * 60 * 60

        # +samples to self.index._references.referenced_blocks
        # +samples to self.series._mgr.blocks[0].refs.referenced_blocks
        samples = 500
        self.dummies = [ self.series[t:t+window] for t in np.random.choice(np.arange(tstart, tend), samples, replace=False) ]
...

       before           after         ratio
     [52cacb3d]       [ad697c32]
     <weakset-refs>       <list-refs>
+     1.46±0.06μs       14.4±0.5μs     9.88  categoricals.Indexing.time_shallow_copy
+     1.73±0.06μs       14.2±0.3μs     8.21  categoricals.Indexing.time_unique
+        49.7±3μs         64.0±3μs     1.29  categoricals.Indexing.time_reindex
+         103±6μs          131±4μs     1.27  categoricals.Indexing.time_intersection

import pandas as pd
import timeit

def test(n=1_000):
    v = []
    index = pd.core.indexes.base.Index(['x'])
    for _ in range(n):
        v.append(index.view())

print("n=1_000", timeit.timeit("test(1_000)", globals=locals(), number=1))
print("n=10_000", timeit.timeit("test(10_000)", globals=locals(), number=1))
print("n=20_000", timeit.timeit("test(20_000)", globals=locals(), number=1))
print("n=30_000", timeit.timeit("test(30_000)", globals=locals(), number=1))

[set]:
n=1_000 0.0018350909813307226
n=10_000 0.017116239992901683
n=20_000 0.03409070096677169
n=30_000 0.050424575980287045
[lists]:
n=1_000 0.014263276010751724
n=10_000 1.259376735019032
n=20_000 5.0642046090215445
n=30_000 11.611086241027806

I think the easiest way to reach worst cases unintentionally is when you have tons of slices. I'm still not sure if these cases are common enough though.

wangwillian0 · 2023-09-11T00:44:43Z

I think there are a few options here:

the current commit. Worst case is when there are lots of referenced views. len(ref) operations per insertion and has_reference() call
use _clear_dead_references as callback. Worst case is similar to previous case but harder to hit. (len(ref)-k+1)k operations when k references become dead at once (say an array of k views is deleted). Insertion and has_reference becomes O(1). The runtime increases by 20% compared to solution 1 at frame_ctor.FromArrays.time_frame_from_arrays_int because of more constructor code at __init__.
Use weakref instead of list. All operations will be O(1) in average. The runtime increases 40% compared to solution 1 at rame_ctor.FromArrays.time_frame_from_arrays_int (declaring weakset() is about 40x compared to declaring a simple list). This solution will also require implementing __hash__ in core.indexes.base.Index (I did the tests with the id as hash, which seems to be enough).
Doing some hybrid approach using "variable caches" and weakset, from what I understand, a small increase in the addition and has_reference() is very acceptable but the constructor time needs to decrease as much as possible. I think we can get a even faster approach than using list if we make the ref structure private and expose some other methods, pushing the weakset constructor overhead to the first insertion instead of the BlockValuesRefs constructor for example.

Let me know if I'm overthinking 😅
I like performance issues and is my first time contributing, so I'm open to recommendations of another issues that would be more important :)

WillAyd · 2023-09-11T13:19:11Z

If I am reading the benchmarks correctly the difference is 10us with 30,000 views? If so I also agree that the theory here should take a backseat to what's like to happen in practice

phofl · 2023-09-11T13:34:03Z

Yeah you are overthinking. I don't care much if someone creates 10k views of a single object. What I wanted initially is a real world pandas use-case where this matters, not a while loop where you call the same method over and over again.

wangwillian0 · 2023-09-11T14:21:51Z

If I am reading the benchmarks correctly the difference is 10us with 30,000 views?

It's actually 10 seconds.

Yeah you are overthinking. I don't care much if someone creates 10k views of a single object. What I wanted initially is a real world pandas use-case where this matters, not a while loop where you call the same method over and over again.

Perfect.

WillAyd · 2023-09-11T14:32:32Z

Ah OK. 10 seconds definitely worse than 10 us, but I don't think we should hold this up for that. Of course performance can always be improved in follow ups, but getting the right functionality first is most important.

I like the attention to detail @wangwillian0 and probably other places in the code base we can take a critical review like that; I just don't think this particular case is the most urgent to optimize

phofl · 2023-09-11T15:31:45Z

Just for completeness: I might be wrong here as well. We will switch if this shows up negatively in our benchmarks in any way.

lumberbot-app · 2023-09-15T09:11:35Z

Owee, I'm MrMeeseeks, Look at me.

There seem to be a conflict, please backport manually. Here are approximate instructions:

Checkout backport branch and update it.

git checkout 2.1.x
git pull

Cherry pick the first parent branch of the this PR on top of the older branch:

git cherry-pick -x -m1 7134f2c14e614980bcf366f979a0f85aafacbde6

You will likely have some merge/cherry-pick conflict here, fix them and commit:

git commit -am 'Backport PR #55008: CoW: Clear dead references every time we add a new one'

Push to a named branch:

git push YOURFORK 2.1.x:auto-backport-of-pr-55008-on-2.1.x

Create a PR against branch 2.1.x, I would have named this PR:

"Backport PR #55008 on branch 2.1.x (CoW: Clear dead references every time we add a new one)"

And apply the correct labels and milestones.

Congratulations — you did some good work! Hopefully your backport PR will be tested by the continuous integration and merged soon!

Remember to remove the Still Needs Manual Backport label once the PR gets merged.

If these instructions are inaccurate, feel free to suggest an improvement.

)

) (cherry picked from commit 7134f2c)

…time we add a new one) (#55220) CoW: Clear dead references every time we add a new one (#55008) (cherry picked from commit 7134f2c)

lithomas1 · 2023-10-12T15:15:46Z

@phofl
I just happened to take a look at the benchmarks, and noticed this PR seemed to be causing several regressions (at a quick glance seems to be in copy method).

Can you take a look if you have the time?

I don't have a list but if you go to
https://asv-runner.github.io/asv-collection/pandas/#regressions?sort=3&dir=desc
and ctrl+f 7134f2c (the merged commit), you'll find all of them.

jorisvandenbossche · 2023-10-14T13:27:07Z

And in addition to our benchmarks, in the meantime we also got several reports (#55256, #55245, apache/arrow#38260), so there were actual use cases hitting this.

The last two of those linked issues are essentially cases where have repeated column access in wide dataframes (typically getting each column as a Series object). A small illustration of this:

N = 10_000
df = pd.DataFrame(np.random.normal(size=(50, N)))

result = list(df.items())

This has become slower in 2.1.1 compared to 2.1.0:

In [3]: arr = np.random.normal(size=(50, N))

In [4]: %%timeit df = pd.DataFrame(arr)
   ...: result = list(df.items())

4.33 ms ± 129 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)  <-- 2.1.0
1.13 s ± 235 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)    <-- 2.1.1

(I need to recreate the dataframe in the timeit setup, otherwise the item_cache would get used from the second run, and when using that, there is no speed difference)

For further illustration, my understanding is that the reason we originally did this was to ensure the list of weakrefs gets clean-up regularly, because if it keeps growing, that can start to consume quite some memory. For this example (and with accessing each column only once):

>>> res = list(df.items())
>>> referenced_blocks = df._mgr.blocks[0].refs.referenced_blocks
>>> len(referenced_blocks)
10001

>>> import sys
>>> (sys.getsizeof(referenced_blocks) + sys.getsizeof(referenced_blocks[0]) * len(referenced_blocks)) / 1024**2
0.8442459106445312

the weakrefs use almost one MB (which is of course not that much, but if this is done repeatedly without cleaning up, it can grow).

Something that also worsens the problem is that it's not enough for those column objects to go out of scope, because they are cached in the parent object:

>>> df._mgr.blocks[0].refs._clear_dead_references()
>>> len(df._mgr.blocks[0].refs.referenced_blocks)
10001
>>> del res
>>> df._mgr.blocks[0].refs._clear_dead_references()
>>> len(df._mgr.blocks[0].refs.referenced_blocks)
10001
>>> df._clear_item_cache()
>>> df._mgr.blocks[0].refs._clear_dead_references()
>>> len(df._mgr.blocks[0].refs.referenced_blocks)
1

So in the specific case here, even if the application that iterates over the columns doesn't keep a track on those columns, the problem still occurs (the weakref list growing, and thus slowing down column access when we try to clear dead refs every time)

phofl · 2023-10-14T14:23:11Z

Yeah I’ll look Into this today and see if exponential backoffs helps with these problems (hopefully)

jorisvandenbossche · 2023-10-14T14:36:24Z

Thinking through the possible solutions (including the ones listed by @wangwillian0 above):

The main issue with the current solution is that calling _clear_dead_references on every insertion can become expensive. So one relatively small change would be to only call this every xx times (we can add a counter variable on the object, and increase that counter in the insertion methods, and then there only call _clear_dead_references when counter is above some threshold). A rough guess is that a threshold of eg 1000 should be sufficient to reduce the performance regressions we see while keeing the memory increase at an acceptable level.
This solution is maybe more of a temporary workaround, I feel, as this doesn't solve that has_reference() still calls _clear_dead_references every time, potentially unnecessarily.
Use weakref.WeakSet (as we did before PERF: Fix performance regression due to CoW ref tracking #51390). The main problems are:
- This is more expensive for creating a new owning Block (not a viewing block), i.e. creating the BlockValuesRefs object, because creating the WeakSet is more expensive than creating a list of one weakref (weakref.WeakSet([blk]) vs [weakref.ref(blk)] gives around 600ns vs 60ns)
- This is no longer possible with the latest implementation, because we now also track Index objects in addition to Blocks, and Index is not hashable (a weakref is hashable when the object it is referencing is hashable, and a set requires the items to be hashable). However, I don't think we can just make Index hashable, since this is a public object, and it is mutable (not the values, but eg the name attribute is mutable).
Use the weakref callback mechanism to clear the dead references (or at least, to keep track of when we need to do this). It was suggested above to let this callback actually remove the ref from the list, but another alternative is to keep track of the weakrefs that need to be removed, and then so only remove them when needed (when adding a new one, or when calling has_reference).
This is essentially what WeakSet does, in my understanding. WeakSet is actually a rather small and readable python implementation (https://github.com/python/cpython/blob/596589104fe5a4d90cb145b2cc69b71cc9aa9f07/Lib/_weakrefset.py#L36-L203), so one option would be to start from this code, and adapt it to our needs: we could make it smaller to just those bits we need (essentially just construct, add, and check the length), and that way we could maybe also tweak it to work with the Index id to overcome the hashable constraint (didn't think this through though). When customizing this, we might also be able to cut down the construction time (just creating the empty set and adding one weakref (s=set(); s.add(weakref.ref(blk)) only takes around 160ns), so reducing the performance issue we initially saw with using WeakSet.

@wangwillian0 I didn't understand everything you wrote in the overview above at #55008 (comment), so if I am missing something, let me know

Without trying to code anything yet, my thought is that for 2.1.2, let's try to do the "quick" fix of only calling _clear_dead_references once every x times (whether it's with a simple linear counter or with an exponential backoff as Patrick just mentioned).
But longer term, I think we should investigate the idea of using a custom WeakSet implementation.

wangwillian0 · 2023-10-16T01:06:29Z

Hi, sorry for the delay

However, I don't think we can just make Index hashable, since this is a public object, and it is mutable (not the values, but eg the name attribute is mutable).

I mentioned the possibility of making the Index hashable because I thought it was immutable, should a new issue be created? maybe at least a rephrase to the documentation?

Use the weakref callback mechanism to clear the dead references (or at least, to keep track of when we need to do this).

I created a PR with an alternative solution based on the code of weakset at #55539:

Uses the weakref.ref callback to count the number of dead references present at the referenced_blocks list.

Takes 20% more time at the constructor because of the function declaration!

The function declaration can be moved to the add_reference and add_index_reference, but this is just moving the performance hit to these methods.

Everything is is O(1) now, including has_reference. (_clear_dead_references is amortized to O(1))

The 20% was added by just declaring an extra function at the constructor, so I don't think there is a lot of elegant alternatives

jorisvandenbossche · 2023-10-19T07:50:57Z

I mentioned the possibility of making the Index hashable because I thought it was immutable, should a new issue be created? maybe at least a rephrase to the documentation?

I am not entirely sure rephrasing will help there (without making it needlessly complex for most users, or maybe putting it in a note further down): the sequence itself is immutable, in the sense you can't change values with setitem. It's only certain attributes that are settable.

CoW: Clear dead references every time we add a new one

3bc7d5d

phofl requested a review from WillAyd as a code owner September 5, 2023 12:37

phofl added this to the 2.1.1 milestone Sep 5, 2023

Merge branch 'main' into 54352

5d112af

mroeschke added the Copy / view semantics label Sep 7, 2023

WillAyd approved these changes Sep 11, 2023

View reviewed changes

phofl merged commit 7134f2c into pandas-dev:main Sep 15, 2023
42 checks passed

phofl deleted the 54352 branch September 15, 2023 09:10

lumberbot-app bot added the Still Needs Manual Backport label Sep 15, 2023

hedeershowk pushed a commit to hedeershowk/pandas that referenced this pull request Sep 20, 2023

CoW: Clear dead references every time we add a new one (pandas-dev#55008

ad9e7bb

)

phofl added a commit to phofl/pandas that referenced this pull request Sep 20, 2023

CoW: Clear dead references every time we add a new one (pandas-dev#55008

ab3dc20

) (cherry picked from commit 7134f2c)

phofl removed the Still Needs Manual Backport label Sep 20, 2023

phofl mentioned this pull request Sep 20, 2023

Backport PR #55008 on branch 2.1.x (CoW: Clear dead references every time we add a new one) #55220

Merged

lithomas1 pushed a commit that referenced this pull request Sep 20, 2023

Backport PR #55008 on branch 2.1.x (CoW: Clear dead references every …

7c93cc9

…time we add a new one) (#55220) CoW: Clear dead references every time we add a new one (#55008) (cherry picked from commit 7134f2c)

This was referenced Oct 14, 2023

PERF: hash_pandas_object performance regression between 2.1.0 and 2.1.1 #55245

Closed

[Python] Massive performance deterioration with pandas 2.1.1 vs. 1.5.3 when calling pa.Table.from_pandas() apache/arrow#38260

Closed

wangwillian0 mentioned this pull request Oct 16, 2023

CoW: Use weakref callbacks to track dead references #55539

Open

5 tasks

TendouArisu mentioned this pull request Feb 7, 2024

pd.DataFrame.copy() leaks in pandas 2.0.3 geekan/MetaGPT#868

Closed

orange-crow mentioned this pull request Mar 25, 2024

fix issue 868 geekan/MetaGPT#1098

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CoW: Clear dead references every time we add a new one #55008

CoW: Clear dead references every time we add a new one #55008

phofl commented Sep 5, 2023

wangwillian0 commented Sep 5, 2023 •

edited

lithomas1 commented Sep 5, 2023

phofl commented Sep 5, 2023

wangwillian0 commented Sep 5, 2023 •

edited

phofl commented Sep 5, 2023

wangwillian0 commented Sep 5, 2023

phofl commented Sep 5, 2023

wangwillian0 commented Sep 5, 2023

phofl commented Sep 5, 2023

wangwillian0 commented Sep 7, 2023 •

edited

wangwillian0 commented Sep 11, 2023 •

edited

WillAyd commented Sep 11, 2023 •

edited

phofl commented Sep 11, 2023

wangwillian0 commented Sep 11, 2023

WillAyd commented Sep 11, 2023

phofl commented Sep 11, 2023

lumberbot-app bot commented Sep 15, 2023

lithomas1 commented Oct 12, 2023

jorisvandenbossche commented Oct 14, 2023 •

edited

phofl commented Oct 14, 2023

jorisvandenbossche commented Oct 14, 2023

wangwillian0 commented Oct 16, 2023 •

edited

jorisvandenbossche commented Oct 19, 2023

CoW: Clear dead references every time we add a new one #55008

CoW: Clear dead references every time we add a new one #55008

Conversation

phofl commented Sep 5, 2023

wangwillian0 commented Sep 5, 2023 • edited

lithomas1 commented Sep 5, 2023

phofl commented Sep 5, 2023

wangwillian0 commented Sep 5, 2023 • edited

phofl commented Sep 5, 2023

wangwillian0 commented Sep 5, 2023

phofl commented Sep 5, 2023

wangwillian0 commented Sep 5, 2023

phofl commented Sep 5, 2023

wangwillian0 commented Sep 7, 2023 • edited

wangwillian0 commented Sep 11, 2023 • edited

WillAyd commented Sep 11, 2023 • edited

phofl commented Sep 11, 2023

wangwillian0 commented Sep 11, 2023

WillAyd commented Sep 11, 2023

phofl commented Sep 11, 2023

lumberbot-app bot commented Sep 15, 2023

lithomas1 commented Oct 12, 2023

jorisvandenbossche commented Oct 14, 2023 • edited

phofl commented Oct 14, 2023

jorisvandenbossche commented Oct 14, 2023

wangwillian0 commented Oct 16, 2023 • edited

jorisvandenbossche commented Oct 19, 2023

wangwillian0 commented Sep 5, 2023 •

edited

wangwillian0 commented Sep 5, 2023 •

edited

wangwillian0 commented Sep 7, 2023 •

edited

wangwillian0 commented Sep 11, 2023 •

edited

WillAyd commented Sep 11, 2023 •

edited

jorisvandenbossche commented Oct 14, 2023 •

edited

wangwillian0 commented Oct 16, 2023 •

edited