Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

REF: simplify _constructor/_from_mgr/_constructor_from_mgr/... #56681

Open
jbrockmendel opened this issue Dec 29, 2023 · 6 comments
Open

REF: simplify _constructor/_from_mgr/_constructor_from_mgr/... #56681

jbrockmendel opened this issue Dec 29, 2023 · 6 comments
Assignees
Labels
Needs Discussion Requires discussion from core team before further action Refactor Internal refactoring of code Subclassing Subclassing pandas objects

Comments

@jbrockmendel
Copy link
Member

jbrockmendel commented Dec 29, 2023

ATM we have

  • _constructor
  • _constructor_sliced
  • _constructor_expanddim
  • _from_mgr
  • _sliced_from_mgr
  • _expanddim_from_mgr
  • _constructor_from_mgr
  • _constructor_sliced_from_mgr
  • _constructor_expanddim_from_mgr

I suspect that we can make do with fewer of these, or at least simplify them (e.g. make some @final, @classmethod, #51772, make downstream libraries never touch Managers, very slightly improve perf in the non-subclass cases). Some of these ideas are mutually exclusive:

  1. ATM _from_mgr is a classmethod, all the other _from_mgr methods are regular methods. The constructor_from_mgr methods rely on self._constructor*, so could not be made classmethods unless those constructors also became classmethods. Does anyone out there rely on _constructor not being a classmethod?

  2. _expanddim_from_mgr and _sliced_from_mgr are each only used once, in _constructor_expanddim_from_mgr and _constructor_sliced_from_mgr, respectively. Can we just inline these and remove the methods (maybe keep the less-verbose names)? If not, can we at least @final and @classmethod them?

  3. Can _from_mgr be made @final? If not can _constructor_from_mgr be inlined into it?

  4. The 3 listed expanddim methods are there just for Series.to_frame. We could get rid of all three and tell subclasses to just override to_frame. (Doing this would also make it easier to tell subclass authors to handle metadata propagation in to_frame xref _metadata items of subclassed pd.Series are not propagated into corresponding SubclassedDataFrame #32860)

  5. We can slightly improve perf for non-subclasses by changing if self._constructor is Series to if type(self) is Series at the cost of slightly hurting perf for subclasses.

  6. Most extreme: we can make it so that a) (most) subclasses don't need to override any from_mgr methods, b) subclasses __init__ don't need to know how to handle Managers, c) we don't need the constructorfrom_mgr methods and d) (most) subclass authors never touch Managers by writing e.g.

def _from_mgr(self, mgr, axes):
    df = DataFrame._existing_from_mgr_method(mgr, axes)
    if type(self) is DataFrame:
        return df
    return self._constructor(df)

i.e. assume that the subclass __init__ knows what to do when given a pandas object.

@jorisvandenbossche writes that this would inconvenience geopandas since it does inference when given a DataFrame but not when given a Manager. In cases like that geopandas would need to override _from_mgr (why "most" is needed in a) and d) above), but ATM they need to override _constructor_from_mgr, so that wouldn't be a net increase in Stuff They Need To Do.

This idea is roughly #53871, which was reverted by #54922.

@jbrockmendel jbrockmendel added Bug Needs Triage Issue that has not been reviewed by a pandas team member Subclassing Subclassing pandas objects Refactor Internal refactoring of code Needs Discussion Requires discussion from core team before further action and removed Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Dec 29, 2023
@theOehrly
Copy link
Contributor

Just chiming in as a maintainer of a package that subclasses Pandas DataFrames and Series'. Apart from having many constructor functions and properties, at the moment it is also not very clear what a subclass can override, should override and must not override.

The documentation for subclassing (https://pandas.pydata.org/docs/development/extending.html#subclassing-pandas-data-structures) only mentions ._constructor, ._constructor_sliced and ._constructor_expanddim. I would argue that there's no way around reading the source code anyway, when subclassing. But from the comments there, it's still very difficult to figure out how everything works.

Understanding the concepts here is fairly difficult if one does not actively participate in core development. And this has become more difficult over the last releases with the addition of new constructor related methods. Therefore, I am in favour of simplifying this system. But additionally, it would then be great to have some more documentation on the purpose of each function or property. Even if it's just in the form of comments in the code.

@Deadlica
Copy link

take

@andrefred
Copy link

take

@Gustaf-Larsson
Copy link

take

@raeef96
Copy link

raeef96 commented Feb 22, 2024

take

@NoelMT NoelMT removed their assignment Mar 4, 2024
@miguelpgarcia
Copy link

take

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Needs Discussion Requires discussion from core team before further action Refactor Internal refactoring of code Subclassing Subclassing pandas objects
Projects
None yet
Development

No branches or pull requests

8 participants