Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

REGR: Restore _constructor_from_mgr to pass manager object to constructor #54922

Merged

Conversation

jorisvandenbossche
Copy link
Member

@jorisvandenbossche jorisvandenbossche commented Sep 1, 2023

See #53871 (comment) for context. This reverts the refactoring changes of the _constructor_.._from_mgr methods of #53871 to the state of the original PR #52132 that added those new methods.

This does keep the other changes of that PR (i.e. using _constructor_from_mgr in more places). @jbrockmendel mentioned he got some AttributeErrors related to _name not yet being defined, but at the moment the tests seem to pass anyway.

Closes #55120

@jorisvandenbossche jorisvandenbossche added this to the 2.1.1 milestone Sep 1, 2023
@jorisvandenbossche jorisvandenbossche added the Subclassing Subclassing pandas objects label Sep 1, 2023
@jbrockmendel
Copy link
Member

The motivation is that geopandas does something involving inference on object dtypes differently if it gets a Manager object? Can you elaborate on that?

@lithomas1 lithomas1 removed this from the 2.1.1 milestone Sep 19, 2023
@lithomas1
Copy link
Member

Bumping off the milestone, assuming the release happens tomorrow.

@jorisvandenbossche jorisvandenbossche added this to the 2.1.2 milestone Oct 13, 2023
@lithomas1 lithomas1 removed this from the 2.1.2 milestone Oct 24, 2023
@lithomas1
Copy link
Member

CI is failing, bumping off the release.

@jorisvandenbossche
Copy link
Member Author

@lithomas1 if you bump something (at least for regressions), can you please bump it to the next bug-fix milestone (although I assume in general for PRs that's not a problem, as long as you bump issues to the next milestone instead of removing the milestone. So I should have opened an issue for this one ;))

@jorisvandenbossche jorisvandenbossche added the Regression Functionality that used to work in a prior pandas version label Oct 25, 2023
@jorisvandenbossche
Copy link
Member Author

jorisvandenbossche commented Oct 25, 2023

The motivation is that geopandas does something involving inference on object dtypes differently if it gets a Manager object? Can you elaborate on that?

@jbrockmendel yes, whenever we get a manager, we don't do any inference, while if we get object dtype array, we do some inference (and then potentially fallback to DataFrame instead of GeoDataFrame in case there are no geometries). See also my geopandas PR related to this: geopandas/geopandas#3046

But that's geopandas specific (and something we can handle by overriding the constructor from mgr), while this PR actually addresses a more general issue: with your change in #53871, creating a Series or DataFrame object with self._from_mgr and passing that to _constructor, the consequence of this is that we are now passing a non-fully initialized object to the subclass constructor. And there is no guarantee that the subclass constructor works with such partial objects.

Essentially, this is the same as #55607, where Series._from_mgr currently returns an non-fully initialized Series object (there is an attribute missing, so some methods don't work). Similarly for subclasses, when this is created with _from_mgr, it didn't go through the subclass' __init__ function, and so the subclass object might not be fully correctly be set up.

I think, when passing (subclassed) DataFrame/Series objects to the public constructor (__init__), one should always be able to assume that this are valid, fully-initialized objects.

I added a generic test that demonstrates this (test_constructor_with_metadata in frame/test_subclass.py). This test works on 2.0, but fails on 2.1.1.

@jorisvandenbossche
Copy link
Member Author

I am adding a bunch of filterwarnings here, while of course we should fix those cases properly. But I decided to go with the filtering here, because firstly those warnings don't yet exist on 2.1.x and so properly fixing them isn't need to backport. And secondly, we then also have a bit more time to figure out the best way we want to have subclasses deal with this (eg Brock's comment at https://github.com/geopandas/geopandas/pull/3046/files#r1371974580 might mean we want to tweak a bit the way that subclasses interact with this). I am also not sure whether we actually want to let all subclasses deal with this, instead of rather doing it as a breaking change in eg 3.0 (or let subclasses handle a mgr in their main constructor)

@jorisvandenbossche
Copy link
Member Author

To illustrate the issue with our own Series:

In [4]: s = pd.Series([1, 2, 3], name="test")

In [5]: pd.Series(pd.Series._from_mgr(s._mgr, s._mgr.axes))
...
AttributeError: 'Series' object has no attribute '_name'

So our Series __init__ doesn't work with partially-initialized Series object. We solve that on our side by adding s._name inside Series._constructor_from_mgr, but subclasses can also have this kind of attributes that have to be added to be able to work properly. And so those subclasses would need to override _from_mgr or _constructor_from_mgr to do that (and that's something we need to document, and decide how to move towards that, instead of already breaking that in 2.1.x)

@jbrockmendel
Copy link
Member

jbrockmendel commented Oct 25, 2023

no objection to getting this in for 2.1.x, can figure out a longer-term plan later

Longer-term, I think the "._from_mgr result is not fully-initialized" issue can be addressed by 1) pinning _name in Series._from_mgr, 2) telling subclasses not to use _from_mgr directly, and 3) replacing usages of self._from_mgr with hard-coded Series/DataFrame._from_mgr.

@jorisvandenbossche jorisvandenbossche merged commit f67d7d6 into pandas-dev:main Oct 26, 2023
36 of 39 checks passed
@lumberbot-app
Copy link

lumberbot-app bot commented Oct 26, 2023

Owee, I'm MrMeeseeks, Look at me.

There seem to be a conflict, please backport manually. Here are approximate instructions:

  1. Checkout backport branch and update it.
git checkout 2.1.x
git pull
  1. Cherry pick the first parent branch of the this PR on top of the older branch:
git cherry-pick -x -m1 f67d7d66a403c984f5f3d6aebb2a1e684d45711a
  1. You will likely have some merge/cherry-pick conflict here, fix them and commit:
git commit -am 'Backport PR #54922: REGR: Restore _constructor_from_mgr to pass manager object to constructor'
  1. Push to a named branch:
git push YOURFORK 2.1.x:auto-backport-of-pr-54922-on-2.1.x
  1. Create a PR against branch 2.1.x, I would have named this PR:

"Backport PR #54922 on branch 2.1.x (REGR: Restore _constructor_from_mgr to pass manager object to constructor)"

And apply the correct labels and milestones.

Congratulations — you did some good work! Hopefully your backport PR will be tested by the continuous integration and merged soon!

Remember to remove the Still Needs Manual Backport label once the PR gets merged.

If these instructions are inaccurate, feel free to suggest an improvement.

@jorisvandenbossche
Copy link
Member Author

Manual backport -> #55704

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Regression Functionality that used to work in a prior pandas version Subclassing Subclassing pandas objects
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: Infinite recursion on inheritance and overloading __getattr__ in v2.1.0
3 participants