Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CoW: Clear dead references every time we add a new one #55008

Merged
merged 2 commits into from Sep 15, 2023

Conversation

phofl
Copy link
Member

@phofl phofl commented Sep 5, 2023

I'd like to try that one before going back to a set

@phofl phofl requested a review from WillAyd as a code owner September 5, 2023 12:37
@phofl phofl added this to the 2.1.1 milestone Sep 5, 2023
@wangwillian0
Copy link
Contributor

wangwillian0 commented Sep 5, 2023

I wasn't sure if it was better to comment on the past issue, the .copy() issue or create a new issue....

I think this may hurt more the performance, here is a quick code:

import weakref, timeit

class Refs:
    def __init__(self, useList=True):
        self.useList = useList
        if self.useList:
            self._refs = []
        else:
            self._refs = weakref.WeakSet()

    def add_ref(self, val):
        if self.useList:
            self._refs = [ref for ref in self._refs if ref() is not None]
            self._refs.append(weakref.ref(val))
        else:
            self._refs.add(val)
        return len(self)
        
    def __len__(self):
        if self.useList:
            lastDead = 1 if len(self._refs) != 0 and self._refs[-1]() is None else 0
            return len(self._refs) - lastDead
        else:
            return len(self._refs)

for n in [0, 1, 10, 20, 100]:
    refList = Refs(useList=True)
    refSet = Refs(useList=False)

    dummies = [Refs() for _ in range(n)]
    for dummy in dummies:
        refList.add_ref(dummy)
        refSet.add_ref(dummy)

    def test(a):
        tmp = Refs()
        a.add_ref(tmp)


    t1 = timeit.timeit("test(refList)", globals=locals(), number=1_000_000)
    t2 = timeit.timeit("test(refSet)", globals=locals(), number=1_000_000)
    t3 = timeit.timeit("len(refList)", globals=locals(), number=1_000_000)
    t4 = timeit.timeit("len(refSet)", globals=locals(), number=1_000_000)

    print("Existing references: ", n)
    print("Insert times: List, Set, List / Set:", t1, t2, t1 / t2)
    print("len() times: List, Set, List / Set:", t3, t4, t3 / t4)
Existing references:  0
Insert times: List, Set, List / Set: 0.8709223259938881 0.8772897350136191 0.992741954264822
len() times: List, Set, List / Set: 0.21391283700359054 0.21754901600070298 0.9832857023949918
Existing references:  1
Insert times: List, Set, List / Set: 1.0056520479847677 0.8872274259920232 1.133477188061823
len() times: List, Set, List / Set: 0.21361352398525923 0.21544542300398462 0.9914971550883608
Existing references:  10
Insert times: List, Set, List / Set: 1.4736111829988658 0.9527864250121638 1.5466332688147324
len() times: List, Set, List / Set: 0.2140507549920585 0.21857467299560085 0.9793026431583252
Existing references:  20
Insert times: List, Set, List / Set: 1.7970401670027059 0.8618747400178108 2.08503635570706
len() times: List, Set, List / Set: 0.21597155602648854 0.2161130669992417 0.9993451993684692
Existing references:  100
Insert times: List, Set, List / Set: 5.558489482005825 0.8625980890064966 6.443892645771862
len() times: List, Set, List / Set: 0.21221240298473276 0.21446094301063567 0.9895153868376332

The list approach is now O(n) per insertion, these informal numbers show how it's much worse than weakSet on every case.

Cleaning on every 2^16 insertions or something like that will help but I IMO it's not easy to chose the right value and it will create corner cases where the performance will degrade much more than the predictable regression from using WeakSet().

Also, if you change your mind, I'd like to contribute with code too, no matter the approach.

Update: there was a bug in the code, the list seems to be worse than weakset from n=1 instead of n=15

@lithomas1
Copy link
Member

I'd like to try that one before going back to a set

Maybe we can use a regular set instead and use the finalize parameter of the weakref to delete it from the set?

Or is that basically just what WeakSet does?

@phofl
Copy link
Member Author

phofl commented Sep 5, 2023

The Weakrefs aren’t hashable, so that won’t work

@wangwillian0
Copy link
Contributor

wangwillian0 commented Sep 5, 2023

I think weakref are actually hashable:

import weakref
import timeit

class test:
    s = set()
    w = weakref.WeakSet()

x = test()
dummies = [test() for _ in range(10)]
for dummy in dummies:
    weakref.finalize(dummy, x.s.discard, weakref.ref(dummy))
    x.s.add(weakref.ref(dummy))
    x.w.add(dummy)

def test1():
    tmp = test()
    weakref.finalize(tmp, x.s.discard, weakref.ref(tmp))
    x.s.add(weakref.ref(tmp))

def test2():
    tmp = test()
    x.w.add(tmp)

print("set of weakref", timeit.timeit("test1()", globals=locals(), number=1_000_000))
print("WeakSet", timeit.timeit("test2()", globals=locals(), number=1_000_000))
set of weakref 1.3031023739895318
WeakSet 0.439116241002921

not really formal benchmarks, but if you think about it, weakset would be probably already implemented based on set() if it was faster.

Now, something that needs to be changed in order to use weakset instead of list again is to make Index (and maybe something else?) hashable

@phofl
Copy link
Member Author

phofl commented Sep 5, 2023

FWIW the benchmarks aren’t really representative, it is very very unlikely that any user has 100 different variables that are referencing dataframes pointing to the same data. I don’t think that there are many realistic cases where even 10 different objects are alive

@wangwillian0
Copy link
Contributor

Doesn't it only reinforce how the WeakSet is better suited for the situation? it's faster most of the cases (it's already +10% faster at n=1) and doesn't require handcrafted solutions

@phofl
Copy link
Member Author

phofl commented Sep 5, 2023

The weakset slows us down when creating objects that are zero copy right now, that was noticeable.

Can you show a pandas example where this is actually noticeable and not noise? Operations have quite some overhead. The list implementation is in the nanosecond range with 20 elements, that is very very cheap. I am not convinced that this will have a noticeable impact.

I am happy to discuss other options if there is something where this would slow us down

@wangwillian0
Copy link
Contributor

The weakset slows us down when creating objects that are zero copy right now, that was noticeable.

Just to understand better, do you mean operations like DataFrame._from_arrays as in the original PR?

@phofl
Copy link
Member Author

phofl commented Sep 5, 2023

yes

@wangwillian0
Copy link
Contributor

wangwillian0 commented Sep 7, 2023

class Indexing:
    def setup(self):
        N = 10**5
        self.index = pd.CategoricalIndex(range(N), range(N))
        self.series = pd.Series(range(N), index=self.index).sort_index()
        self.category = self.index[500]
        
        window = 60*60
        tstart = 12 * 60 * 60 
        tend = 18 * 60 * 60

        # +samples to self.index._references.referenced_blocks
        # +samples to self.series._mgr.blocks[0].refs.referenced_blocks
        samples = 500
        self.dummies = [ self.series[t:t+window] for t in np.random.choice(np.arange(tstart, tend), samples, replace=False) ]
...
       before           after         ratio
     [52cacb3d]       [ad697c32]
     <weakset-refs>       <list-refs>
+     1.46±0.06μs       14.4±0.5μs     9.88  categoricals.Indexing.time_shallow_copy
+     1.73±0.06μs       14.2±0.3μs     8.21  categoricals.Indexing.time_unique
+        49.7±3μs         64.0±3μs     1.29  categoricals.Indexing.time_reindex
+         103±6μs          131±4μs     1.27  categoricals.Indexing.time_intersection
import pandas as pd
import timeit

def test(n=1_000):
    v = []
    index = pd.core.indexes.base.Index(['x'])
    for _ in range(n):
        v.append(index.view())

print("n=1_000", timeit.timeit("test(1_000)", globals=locals(), number=1))
print("n=10_000", timeit.timeit("test(10_000)", globals=locals(), number=1))
print("n=20_000", timeit.timeit("test(20_000)", globals=locals(), number=1))
print("n=30_000", timeit.timeit("test(30_000)", globals=locals(), number=1))
[set]:
n=1_000 0.0018350909813307226
n=10_000 0.017116239992901683
n=20_000 0.03409070096677169
n=30_000 0.050424575980287045
[lists]:
n=1_000 0.014263276010751724
n=10_000 1.259376735019032
n=20_000 5.0642046090215445
n=30_000 11.611086241027806

I think the easiest way to reach worst cases unintentionally is when you have tons of slices. I'm still not sure if these cases are common enough though.

@wangwillian0
Copy link
Contributor

wangwillian0 commented Sep 11, 2023

I think there are a few options here:

  1. the current commit. Worst case is when there are lots of referenced views. len(ref) operations per insertion and has_reference() call
  2. use _clear_dead_references as callback. Worst case is similar to previous case but harder to hit. (len(ref)-k+1)k operations when k references become dead at once (say an array of k views is deleted). Insertion and has_reference becomes O(1). The runtime increases by 20% compared to solution 1 at frame_ctor.FromArrays.time_frame_from_arrays_int because of more constructor code at __init__.
  3. Use weakref instead of list. All operations will be O(1) in average. The runtime increases 40% compared to solution 1 at rame_ctor.FromArrays.time_frame_from_arrays_int (declaring weakset() is about 40x compared to declaring a simple list). This solution will also require implementing __hash__ in core.indexes.base.Index (I did the tests with the id as hash, which seems to be enough).
  4. Doing some hybrid approach using "variable caches" and weakset, from what I understand, a small increase in the addition and has_reference() is very acceptable but the constructor time needs to decrease as much as possible. I think we can get a even faster approach than using list if we make the ref structure private and expose some other methods, pushing the weakset constructor overhead to the first insertion instead of the BlockValuesRefs constructor for example.

Let me know if I'm overthinking 😅
I like performance issues and is my first time contributing, so I'm open to recommendations of another issues that would be more important :)

@WillAyd
Copy link
Member

WillAyd commented Sep 11, 2023

If I am reading the benchmarks correctly the difference is 10us with 30,000 views? If so I also agree that the theory here should take a backseat to what's like to happen in practice

@phofl
Copy link
Member Author

phofl commented Sep 11, 2023

Yeah you are overthinking. I don't care much if someone creates 10k views of a single object. What I wanted initially is a real world pandas use-case where this matters, not a while loop where you call the same method over and over again.

@wangwillian0
Copy link
Contributor

If I am reading the benchmarks correctly the difference is 10us with 30,000 views?

It's actually 10 seconds.

Yeah you are overthinking. I don't care much if someone creates 10k views of a single object. What I wanted initially is a real world pandas use-case where this matters, not a while loop where you call the same method over and over again.

Perfect.

@WillAyd
Copy link
Member

WillAyd commented Sep 11, 2023

Ah OK. 10 seconds definitely worse than 10 us, but I don't think we should hold this up for that. Of course performance can always be improved in follow ups, but getting the right functionality first is most important.

I like the attention to detail @wangwillian0 and probably other places in the code base we can take a critical review like that; I just don't think this particular case is the most urgent to optimize

@phofl
Copy link
Member Author

phofl commented Sep 11, 2023

Just for completeness: I might be wrong here as well. We will switch if this shows up negatively in our benchmarks in any way.

@phofl phofl merged commit 7134f2c into pandas-dev:main Sep 15, 2023
42 checks passed
@phofl phofl deleted the 54352 branch September 15, 2023 09:10
@lumberbot-app
Copy link

lumberbot-app bot commented Sep 15, 2023

Owee, I'm MrMeeseeks, Look at me.

There seem to be a conflict, please backport manually. Here are approximate instructions:

  1. Checkout backport branch and update it.
git checkout 2.1.x
git pull
  1. Cherry pick the first parent branch of the this PR on top of the older branch:
git cherry-pick -x -m1 7134f2c14e614980bcf366f979a0f85aafacbde6
  1. You will likely have some merge/cherry-pick conflict here, fix them and commit:
git commit -am 'Backport PR #55008: CoW: Clear dead references every time we add a new one'
  1. Push to a named branch:
git push YOURFORK 2.1.x:auto-backport-of-pr-55008-on-2.1.x
  1. Create a PR against branch 2.1.x, I would have named this PR:

"Backport PR #55008 on branch 2.1.x (CoW: Clear dead references every time we add a new one)"

And apply the correct labels and milestones.

Congratulations — you did some good work! Hopefully your backport PR will be tested by the continuous integration and merged soon!

Remember to remove the Still Needs Manual Backport label once the PR gets merged.

If these instructions are inaccurate, feel free to suggest an improvement.

lithomas1 pushed a commit that referenced this pull request Sep 20, 2023
…time we add a new one) (#55220)

CoW: Clear dead references every time we add a new one (#55008)

(cherry picked from commit 7134f2c)
@lithomas1
Copy link
Member

@phofl
I just happened to take a look at the benchmarks, and noticed this PR seemed to be causing several regressions (at a quick glance seems to be in copy method).

Can you take a look if you have the time?

I don't have a list but if you go to
https://asv-runner.github.io/asv-collection/pandas/#regressions?sort=3&dir=desc
and ctrl+f 7134f2c (the merged commit), you'll find all of them.

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Oct 14, 2023

And in addition to our benchmarks, in the meantime we also got several reports (#55256, #55245, apache/arrow#38260), so there were actual use cases hitting this.

The last two of those linked issues are essentially cases where have repeated column access in wide dataframes (typically getting each column as a Series object). A small illustration of this:

N = 10_000
df = pd.DataFrame(np.random.normal(size=(50, N)))

result = list(df.items())

This has become slower in 2.1.1 compared to 2.1.0:

In [3]: arr = np.random.normal(size=(50, N))

In [4]: %%timeit df = pd.DataFrame(arr)
   ...: result = list(df.items())

4.33 ms ± 129 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)  <-- 2.1.0
1.13 s ± 235 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)    <-- 2.1.1

(I need to recreate the dataframe in the timeit setup, otherwise the item_cache would get used from the second run, and when using that, there is no speed difference)

For further illustration, my understanding is that the reason we originally did this was to ensure the list of weakrefs gets clean-up regularly, because if it keeps growing, that can start to consume quite some memory. For this example (and with accessing each column only once):

>>> res = list(df.items())
>>> referenced_blocks = df._mgr.blocks[0].refs.referenced_blocks
>>> len(referenced_blocks)
10001

>>> import sys
>>> (sys.getsizeof(referenced_blocks) + sys.getsizeof(referenced_blocks[0]) * len(referenced_blocks)) / 1024**2
0.8442459106445312

the weakrefs use almost one MB (which is of course not that much, but if this is done repeatedly without cleaning up, it can grow).

Something that also worsens the problem is that it's not enough for those column objects to go out of scope, because they are cached in the parent object:

>>> df._mgr.blocks[0].refs._clear_dead_references()
>>> len(df._mgr.blocks[0].refs.referenced_blocks)
10001
>>> del res
>>> df._mgr.blocks[0].refs._clear_dead_references()
>>> len(df._mgr.blocks[0].refs.referenced_blocks)
10001
>>> df._clear_item_cache()
>>> df._mgr.blocks[0].refs._clear_dead_references()
>>> len(df._mgr.blocks[0].refs.referenced_blocks)
1

So in the specific case here, even if the application that iterates over the columns doesn't keep a track on those columns, the problem still occurs (the weakref list growing, and thus slowing down column access when we try to clear dead refs every time)

@phofl
Copy link
Member Author

phofl commented Oct 14, 2023

Yeah I’ll look Into this today and see if exponential backoffs helps with these problems (hopefully)

@jorisvandenbossche
Copy link
Member

Thinking through the possible solutions (including the ones listed by @wangwillian0 above):

  • The main issue with the current solution is that calling _clear_dead_references on every insertion can become expensive. So one relatively small change would be to only call this every xx times (we can add a counter variable on the object, and increase that counter in the insertion methods, and then there only call _clear_dead_references when counter is above some threshold). A rough guess is that a threshold of eg 1000 should be sufficient to reduce the performance regressions we see while keeing the memory increase at an acceptable level.
    This solution is maybe more of a temporary workaround, I feel, as this doesn't solve that has_reference() still calls _clear_dead_references every time, potentially unnecessarily.
  • Use weakref.WeakSet (as we did before PERF: Fix performance regression due to CoW ref tracking #51390). The main problems are:
    • This is more expensive for creating a new owning Block (not a viewing block), i.e. creating the BlockValuesRefs object, because creating the WeakSet is more expensive than creating a list of one weakref (weakref.WeakSet([blk]) vs [weakref.ref(blk)] gives around 600ns vs 60ns)
    • This is no longer possible with the latest implementation, because we now also track Index objects in addition to Blocks, and Index is not hashable (a weakref is hashable when the object it is referencing is hashable, and a set requires the items to be hashable). However, I don't think we can just make Index hashable, since this is a public object, and it is mutable (not the values, but eg the name attribute is mutable).
  • Use the weakref callback mechanism to clear the dead references (or at least, to keep track of when we need to do this). It was suggested above to let this callback actually remove the ref from the list, but another alternative is to keep track of the weakrefs that need to be removed, and then so only remove them when needed (when adding a new one, or when calling has_reference).
    This is essentially what WeakSet does, in my understanding. WeakSet is actually a rather small and readable python implementation (https://github.com/python/cpython/blob/596589104fe5a4d90cb145b2cc69b71cc9aa9f07/Lib/_weakrefset.py#L36-L203), so one option would be to start from this code, and adapt it to our needs: we could make it smaller to just those bits we need (essentially just construct, add, and check the length), and that way we could maybe also tweak it to work with the Index id to overcome the hashable constraint (didn't think this through though). When customizing this, we might also be able to cut down the construction time (just creating the empty set and adding one weakref (s=set(); s.add(weakref.ref(blk)) only takes around 160ns), so reducing the performance issue we initially saw with using WeakSet.

@wangwillian0 I didn't understand everything you wrote in the overview above at #55008 (comment), so if I am missing something, let me know


Without trying to code anything yet, my thought is that for 2.1.2, let's try to do the "quick" fix of only calling _clear_dead_references once every x times (whether it's with a simple linear counter or with an exponential backoff as Patrick just mentioned).
But longer term, I think we should investigate the idea of using a custom WeakSet implementation.

@wangwillian0
Copy link
Contributor

wangwillian0 commented Oct 16, 2023

Hi, sorry for the delay

However, I don't think we can just make Index hashable, since this is a public object, and it is mutable (not the values, but eg the name attribute is mutable).

I mentioned the possibility of making the Index hashable because I thought it was immutable, should a new issue be created? maybe at least a rephrase to the documentation?

Use the weakref callback mechanism to clear the dead references (or at least, to keep track of when we need to do this).

I created a PR with an alternative solution based on the code of weakset at #55539:

  • Uses the weakref.ref callback to count the number of dead references present at the referenced_blocks list.
  • Takes 20% more time at the constructor because of the function declaration!
    • The function declaration can be moved to the add_reference and add_index_reference, but this is just moving the performance hit to these methods.
  • Everything is is O(1) now, including has_reference. (_clear_dead_references is amortized to O(1))

The 20% was added by just declaring an extra function at the constructor, so I don't think there is a lot of elegant alternatives

@jorisvandenbossche
Copy link
Member

I mentioned the possibility of making the Index hashable because I thought it was immutable, should a new issue be created? maybe at least a rephrase to the documentation?

I am not entirely sure rephrasing will help there (without making it needlessly complex for most users, or maybe putting it in a note further down): the sequence itself is immutable, in the sense you can't change values with setitem. It's only certain attributes that are settable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

PERF: pd.DataFrame.copy() leaks
6 participants