CoW: array not recognizing ea dtypes #51966

phofl · 2023-03-14T15:32:09Z

closes #xxxx (Replace xxxx with the GitHub issue number)
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

jorisvandenbossche · 2023-03-14T21:33:33Z

pandas/core/generic.py

-            arr.flags.writeable = False
+        if arr is values and using_copy_on_write() and self._mgr.is_single_block:
+            # Check if self._values coerced data
+            if not is_1d_only_ea_dtype(self.dtypes.iloc[0]):


It is possible to not have a first column? And do we never have any block in a case of a DataFrame of zero columns?

Also, does is_1d_only_ea_dtype guarantee it will always have coerced the data? I suppose a general EA can give you a view when converting it to a numpy array (for .values it only needs a reshape to 2D, but that's a view)

For example:

In [65]: df = pd.DataFrame({'a': pd.array(['a', 'b'], dtype="string")}) In [66]: np.shares_memory(df.values, df['a'].array._ndarray) Out[66]: True

Ah thanks very much. I was too focused on integers where we are coercing to object right now which causes a copy. With dtypes where converting to object does not cause a copy this still shares memory of course.

Adjusted the check

If we have zero columns then we have an empty array, correct? We can only modify an empty array through enlarging the array which would cause a copy anyway? Am I missing something?

Yes, but so I was worrying about if doing something like df_empty.to_numpy() could run into an IndexError (from self.dtypes.iloc[0] if self.dtypes has length 0)

Ah I misunderstood you yesterday then. With empty you mean no columns, otherwise iloc would work, correct? In this case the is_single_block check fails and hence we don't get there. I'll add a test for this though, because we could easily cause a regression if we are not careful

In this case the is_single_block check fails and hence we don't get there

Yes, but so I was also wondering if it would be possible to have a zero-column dataframe with one block (so whether it's possible to have a block with an axis of size 0).

It seems to be theoretically possible, by constructing this manually:

In [82]: block = pd.core.internals.make_block(np.zeros((0, 10)), np.array([])) In [86]: mgr = pd.core.internals.BlockManager([block], [pd.Index([]), pd.Index(range(10))]) In [88]: df = pd.DataFrame(mgr) In [89]: df Out[89]: Empty DataFrame Columns: [] Index: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] In [90]: df._mgr.is_single_block Out[90]: True

But I don't know if there is any way you could get that through actual pandas operations.

I don't think so. It should be a bug if you could do this?

jorisvandenbossche · 2023-03-15T12:03:47Z

pandas/core/generic.py

+        if arr is values and using_copy_on_write() and self._mgr.is_single_block:
+            # Check if self._values coerced data
+            if not is_1d_only_ea_dtype(self.dtypes.iloc[0]) or not is_numeric_dtype(
+                self.dtypes.iloc[0]


Why the or not is_numeric_dtype check?

Can you expand the comment a bit more to explain those checks? (from just reading the code, I find it hard to reason about, also with the "not .. or not ..")

Something else, could we also use the astype_is_view here? (like is done below)

We have 2 different cases here:

NumPy dtypes that are caught by the is_1d_only_ea_dtype

EA dtypes -> they get coerced to object by self._values above, so we only have to catch cases where a coercion to object does not trigger a copy, e.g. all numeric dtypes were copied already.

astype_is_view would work when we get rid of the conversion to object in the middle. I'll clarify the comment though.

So for the EA dtypes, you rely on the assumption that numeric dtypes were cast to object. But in theory, someone could implement its own EA which is indicated to be "numeric" but does not do this conversion to object dtype.

Now, in general I think we are lacking a part of the story around having proper information about copies/views with generic EAs (that was also the case when doing the astype_is_view)

That's actually what our numeric arrow dtypes do, but since those rely on pyarrow to convert itself to numpy arrays, those are already set to readonly in case they are a view:

In [91]: arr = pd.array([1, 2], dtype=pd.ArrowDtype(pa.int64())) In [92]: np_arr = np.asarray(arr) In [93]: np_arr Out[93]: array([1, 2]) In [94]: np_arr[0] = 100 --------------------------------------------------------------------------- ValueError Traceback (most recent call last) Input In [94], in <cell line: 1>() ----> 1 np_arr[0] = 100 ValueError: assignment destination is read-only

Fair point, there is actual a more elegant solution. We can check if both steps can be done without copying via astype_is_view.

Yeah I ran into this as well (that they set themselves to read only)

# Conflicts: # pandas/tests/copy_view/test_array.py

phofl · 2023-03-30T18:12:40Z

@jorisvandenbossche this would be nice to get into 2.0

jorisvandenbossche

Sorry for the slow reply here, looks good to me, I just found one more failure case ;)

jorisvandenbossche · 2023-03-31T11:25:56Z

pandas/core/generic.py

-            # TODO(CoW) also properly handle extension dtypes
-            arr = arr.view()
-            arr.flags.writeable = False
+        if arr is values and using_copy_on_write() and self._mgr.is_single_block:


The arr is values might prevent catching some EA cases, since it seems that _values can still return an EA (so the EA->ndarray conversion only happens in arr = np.asarray(values ..)).

Example (running with this PR):

In [1]: pd.options.mode.copy_on_write = True In [2]: df = pd.DataFrame({"a": pd.date_range("2012-01-01", periods=3)}) In [3]: arr = np.asarray(df) In [4]: arr.flags.writeable Out[4]: True In [5]: arr[0] = 0 In [6]: df Out[6]: a 0 1970-01-01 1 2012-01-02 2 2012-01-03

For series you left out this check, so maybe can be done here as well?

good point, thx

jorisvandenbossche · 2023-03-31T11:26:02Z

pandas/tests/copy_view/test_array.py

+
+    arr = np.asarray(df)
+    if using_copy_on_write:
+        # TODO(CoW): This should be True


Not this one, because without specifying dtype="int64" we create an object dtype array?

Yes exactly, this triggers a copy and hence the array should be writeable?

phofl · 2023-04-02T14:10:53Z

Merging to get into 2.0

… dtypes) (#52358) Backport PR #51966: CoW: __array__ not recognizing ea dtypes Co-authored-by: Patrick Hoefler <61934744+phofl@users.noreply.github.com>

phofl added 2 commits March 14, 2023 15:02

CoW: Handle EA dtypes in __array__

3a08a45

CoW: __array__ not recognizing ea dtypes

38453c4

phofl added the Copy / view semantics label Mar 14, 2023

phofl requested a review from jorisvandenbossche March 14, 2023 15:32

phofl added 2 commits March 14, 2023 16:42

Fix writeable flag

ee751b4

Skip for now

a4eb612

phofl added this to the 2.0 milestone Mar 14, 2023

jorisvandenbossche reviewed Mar 14, 2023

View reviewed changes

phofl and others added 4 commits March 14, 2023 22:45

Fix for non numeric

c515870

Fix array manager

5568ec2

Merge branch 'main' into cow_array_eas

e9641ad

Merge branch 'main' into cow_array_eas

bf1f9de

jorisvandenbossche reviewed Mar 15, 2023

View reviewed changes

phofl added 5 commits March 15, 2023 13:32

Add test

8922467

Add comment

413fc7c

Merge remote-tracking branch 'upstream/main' into cow_array_eas

9681861

Improve solution

2d8a2f2

Merge remote-tracking branch 'upstream/main' into cow_array_eas

42368e0

# Conflicts: # pandas/tests/copy_view/test_array.py

jorisvandenbossche approved these changes Mar 31, 2023

View reviewed changes

phofl added 2 commits April 2, 2023 03:00

Switch and add test

5c0f1bb

Merge remote-tracking branch 'upstream/main' into cow_array_eas

4572e4a

phofl merged commit 09593b2 into pandas-dev:main Apr 2, 2023
33 checks passed

meeseeksmachine mentioned this pull request Apr 2, 2023

Backport PR #51966 on branch 2.0.x (CoW: __array__ not recognizing ea dtypes) #52358

Merged

meeseeksmachine pushed a commit to meeseeksmachine/pandas that referenced this pull request Apr 2, 2023

Backport PR pandas-dev#51966: CoW: __array__ not recognizing ea dtypes

1815fb8

phofl deleted the cow_array_eas branch April 2, 2023 14:11

topper-123 pushed a commit to topper-123/pandas that referenced this pull request Apr 6, 2023

CoW: __array__ not recognizing ea dtypes (pandas-dev#51966)

28b1a5f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CoW: array not recognizing ea dtypes #51966

CoW: array not recognizing ea dtypes #51966

phofl commented Mar 14, 2023

jorisvandenbossche Mar 14, 2023

jorisvandenbossche Mar 14, 2023

jorisvandenbossche Mar 14, 2023

phofl Mar 14, 2023

phofl Mar 14, 2023

jorisvandenbossche Mar 15, 2023

phofl Mar 15, 2023

jorisvandenbossche Mar 15, 2023

phofl Mar 15, 2023

jorisvandenbossche Mar 15, 2023

jorisvandenbossche Mar 15, 2023

phofl Mar 15, 2023 •

edited

jorisvandenbossche Mar 15, 2023

jorisvandenbossche Mar 15, 2023

phofl Mar 15, 2023

phofl Mar 15, 2023

phofl commented Mar 30, 2023

jorisvandenbossche left a comment

jorisvandenbossche Mar 31, 2023

phofl Apr 2, 2023

jorisvandenbossche Mar 31, 2023

phofl Apr 2, 2023

phofl commented Apr 2, 2023

CoW: __array__ not recognizing ea dtypes #51966

CoW: __array__ not recognizing ea dtypes #51966

Conversation

phofl commented Mar 14, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

phofl Mar 15, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

phofl commented Mar 30, 2023

jorisvandenbossche left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

phofl commented Apr 2, 2023

CoW: array not recognizing ea dtypes #51966

CoW: array not recognizing ea dtypes #51966

phofl Mar 15, 2023 •

edited