Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CoW: __array__ not recognizing ea dtypes #51966

Merged
merged 15 commits into from
Apr 2, 2023
Merged

Conversation

phofl
Copy link
Member

@phofl phofl commented Mar 14, 2023

  • closes #xxxx (Replace xxxx with the GitHub issue number)
  • Tests added and passed if fixing a bug or adding a new feature
  • All code checks passed.
  • Added type annotations to new arguments/methods/functions.
  • Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

@phofl phofl added this to the 2.0 milestone Mar 14, 2023
arr.flags.writeable = False
if arr is values and using_copy_on_write() and self._mgr.is_single_block:
# Check if self._values coerced data
if not is_1d_only_ea_dtype(self.dtypes.iloc[0]):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is possible to not have a first column? And do we never have any block in a case of a DataFrame of zero columns?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, does is_1d_only_ea_dtype guarantee it will always have coerced the data? I suppose a general EA can give you a view when converting it to a numpy array (for .values it only needs a reshape to 2D, but that's a view)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example:

In [65]: df = pd.DataFrame({'a': pd.array(['a', 'b'], dtype="string")})

In [66]: np.shares_memory(df.values, df['a'].array._ndarray)
Out[66]: True

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah thanks very much. I was too focused on integers where we are coercing to object right now which causes a copy. With dtypes where converting to object does not cause a copy this still shares memory of course.

Adjusted the check

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we have zero columns then we have an empty array, correct? We can only modify an empty array through enlarging the array which would cause a copy anyway? Am I missing something?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but so I was worrying about if doing something like df_empty.to_numpy() could run into an IndexError (from self.dtypes.iloc[0] if self.dtypes has length 0)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I misunderstood you yesterday then. With empty you mean no columns, otherwise iloc would work, correct? In this case the is_single_block check fails and hence we don't get there. I'll add a test for this though, because we could easily cause a regression if we are not careful

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case the is_single_block check fails and hence we don't get there

Yes, but so I was also wondering if it would be possible to have a zero-column dataframe with one block (so whether it's possible to have a block with an axis of size 0).

It seems to be theoretically possible, by constructing this manually:

In [82]: block = pd.core.internals.make_block(np.zeros((0, 10)), np.array([]))

In [86]: mgr = pd.core.internals.BlockManager([block], [pd.Index([]), pd.Index(range(10))])

In [88]: df = pd.DataFrame(mgr)

In [89]: df
Out[89]: 
Empty DataFrame
Columns: []
Index: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

In [90]: df._mgr.is_single_block
Out[90]: True

But I don't know if there is any way you could get that through actual pandas operations.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so. It should be a bug if you could do this?

if arr is values and using_copy_on_write() and self._mgr.is_single_block:
# Check if self._values coerced data
if not is_1d_only_ea_dtype(self.dtypes.iloc[0]) or not is_numeric_dtype(
self.dtypes.iloc[0]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the or not is_numeric_dtype check?

Can you expand the comment a bit more to explain those checks? (from just reading the code, I find it hard to reason about, also with the "not .. or not ..")

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something else, could we also use the astype_is_view here? (like is done below)

Copy link
Member Author

@phofl phofl Mar 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have 2 different cases here:

  • NumPy dtypes that are caught by the is_1d_only_ea_dtype
  • EA dtypes -> they get coerced to object by self._values above, so we only have to catch cases where a coercion to object does not trigger a copy, e.g. all numeric dtypes were copied already.

astype_is_view would work when we get rid of the conversion to object in the middle. I'll clarify the comment though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So for the EA dtypes, you rely on the assumption that numeric dtypes were cast to object. But in theory, someone could implement its own EA which is indicated to be "numeric" but does not do this conversion to object dtype.

Now, in general I think we are lacking a part of the story around having proper information about copies/views with generic EAs (that was also the case when doing the astype_is_view)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's actually what our numeric arrow dtypes do, but since those rely on pyarrow to convert itself to numpy arrays, those are already set to readonly in case they are a view:

In [91]: arr = pd.array([1, 2], dtype=pd.ArrowDtype(pa.int64()))

In [92]: np_arr = np.asarray(arr)

In [93]: np_arr
Out[93]: array([1, 2])

In [94]: np_arr[0] = 100
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Input In [94], in <cell line: 1>()
----> 1 np_arr[0] = 100

ValueError: assignment destination is read-only

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair point, there is actual a more elegant solution. We can check if both steps can be done without copying via astype_is_view.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I ran into this as well (that they set themselves to read only)

@phofl
Copy link
Member Author

phofl commented Mar 30, 2023

@jorisvandenbossche this would be nice to get into 2.0

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the slow reply here, looks good to me, I just found one more failure case ;)

# TODO(CoW) also properly handle extension dtypes
arr = arr.view()
arr.flags.writeable = False
if arr is values and using_copy_on_write() and self._mgr.is_single_block:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The arr is values might prevent catching some EA cases, since it seems that _values can still return an EA (so the EA->ndarray conversion only happens in arr = np.asarray(values ..)).

Example (running with this PR):

In [1]: pd.options.mode.copy_on_write = True

In [2]: df = pd.DataFrame({"a": pd.date_range("2012-01-01", periods=3)})

In [3]: arr = np.asarray(df)

In [4]: arr.flags.writeable
Out[4]: True

In [5]: arr[0] = 0

In [6]: df
Out[6]: 
           a
0 1970-01-01
1 2012-01-02
2 2012-01-03

For series you left out this check, so maybe can be done here as well?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point, thx


arr = np.asarray(df)
if using_copy_on_write:
# TODO(CoW): This should be True
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not this one, because without specifying dtype="int64" we create an object dtype array?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes exactly, this triggers a copy and hence the array should be writeable?

@phofl
Copy link
Member Author

phofl commented Apr 2, 2023

Merging to get into 2.0

@phofl phofl merged commit 09593b2 into pandas-dev:main Apr 2, 2023
33 checks passed
meeseeksmachine pushed a commit to meeseeksmachine/pandas that referenced this pull request Apr 2, 2023
@phofl phofl deleted the cow_array_eas branch April 2, 2023 14:11
phofl added a commit that referenced this pull request Apr 2, 2023
… dtypes) (#52358)

Backport PR #51966: CoW: __array__ not recognizing ea dtypes

Co-authored-by: Patrick Hoefler <61934744+phofl@users.noreply.github.com>
topper-123 pushed a commit to topper-123/pandas that referenced this pull request Apr 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants