Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Infinite recursion on inheritance and overloading __getattr__ in v2.1.0 #55120

Closed
2 of 3 tasks
Tracked by #5
JoschD opened this issue Sep 13, 2023 · 2 comments · Fixed by #54922
Closed
2 of 3 tasks
Tracked by #5

BUG: Infinite recursion on inheritance and overloading __getattr__ in v2.1.0 #55120

JoschD opened this issue Sep 13, 2023 · 2 comments · Fixed by #54922
Labels
Regression Functionality that used to work in a prior pandas version Subclassing Subclassing pandas objects
Milestone

Comments

@JoschD
Copy link

JoschD commented Sep 13, 2023

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

from contextlib import suppress
import pandas as pd

class TfsDataFrame(pd.DataFrame):
    """
    Extend DataFrame to add headers, which should be accessed in case 
    the name is not found in the DataFrame itself, but also 
    can be accessed manually by the `headers` attribute.
    """

    _metadata = ["headers"]

    def __init__(self, *args, **kwargs):
        self.headers = {}
        with suppress(IndexError, AttributeError):
            self.headers = args[0].headers
        self.headers = kwargs.pop("headers", self.headers)
        super().__init__(*args, **kwargs)

    def __getattr__(self, name: str) -> object:
        try:
            return super().__getattr__(name)
        except AttributeError:
            pass
        
        try:
            return self.headers[name]
        except KeyError:
            raise AttributeError(f"{name} is neither in the DataFrame nor in headers.")

    @property
    def _constructor(self):
        return TfsDataFrame


if __name__ == "__main__":
    df = TfsDataFrame(columns=["A", "B", "C"])
    print(df[["A", "B"]])

Issue Description

This is a reduced example of a DataFrame extension, that we have been using for years now and it worked fine until version 2.0.3 (including), but broke with 2.1.0.

From 2.1.0 the code above runs into an infinite recursion loop, as df[["A", "B"]] creates a new TfsDataFrame (due to the _constructor).
In its __init__ the args[0] is then already a TfsDataFrame object butargs[0].headers for some reason calls __getattr__() as does self.headers[name] in the __getattr__() function, causing the infinite recursion.

This was not the case in the past, as .headers should be finding the attribute headers "directly", without even going to __getattr__().

Interestingly,

df = TfsDataFrame(columns=["A", "B", "C"])
print(df.headers)

df = TfsDataFrame(columns=["A", "B", "C"], headers={"a":1})
print(df.headers["a"])

are indeed working as intended.

This is the most direct recursion loop I could find, but actually with 2.1.0 we have experienced extended recursion loops, which can create infinite runtime.

Expected Behavior

Should not end up in an infinite recursion.
As is the case e.g. in pandas version 2.0.3.

Thank you very much for your help and your effort maintaining pandas.
Its amazing set of features have made our work with measurement data much more comfortable and streamlined,
not to mention the highly increased readability of our code compared to anno BP (before pandas).

Installed Versions

INSTALLED VERSIONS

commit : ba1cccd
python : 3.9.7.final.0
python-bits : 64
OS : Linux
OS-release : 5.14.0-162.18.1.el9_1.x86_64
Version : #1 SMP PREEMPT_DYNAMIC Tue Feb 28 03:54:43 EST 2023
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : C.UTF-8
LOCALE : en_US.UTF-8

pandas : 2.1.0
numpy : 1.25.2
pytz : 2023.3
dateutil : 2.8.2
setuptools : 59.5.0
pip : 21.3.1
Cython : 0.29.35
pytest : 7.3.1
hypothesis : 6.75.8
sphinx : 6.2.1
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.1.2
IPython : 8.14.0
pandas_datareader : None
bs4 : None
bottleneck : None
dataframe-api-compat: None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : 3.7.1
numba : None
numexpr : 2.8.4
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : 1.11.2
sqlalchemy : None
tables : 3.8.0
tabulate : None
xarray : None
xlrd : None
zstandard : None
tzdata : 2023.3
qtpy : 2.3.1
pyqt5 : None

@JoschD JoschD added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Sep 13, 2023
@jorisvandenbossche jorisvandenbossche added Regression Functionality that used to work in a prior pandas version Subclassing Subclassing pandas objects and removed Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Oct 14, 2023
@jorisvandenbossche jorisvandenbossche added this to the 2.1.2 milestone Oct 14, 2023
@jorisvandenbossche
Copy link
Member

@JoschD thanks a lot for the report! (and sorry for the slow response; I actually had looked at it earlier because I had a related PR (#54922) that might also fix this, but didn't get around to posting here)

So your analysis is mostly correct I think. What changed in pandas (#53871) is that we no longer pass a Manager object to the subclass constructor (which you would pass to super().__init__ to create the dataframe), but that we initialize the DataFrame object from the manager, and pass that to the constructor.
However, this DataFrame construction happens through a DataFrame._from_mgr method, which isn't overriden by the subclass, and thus the resulting subclass DataFrame object hasn't been properly initialized.

As a result of that, your TfsDataFrame.__init__ gets passed a TfsDataFrame instance that is not fully initialized, i.e. doesn't yet have the headers attribute. At that point, when trying to access that attribute with args[0].headers, it will go into __getattr__ with "headers" key because of the attribute not existing (for actual attributes, it doesn't call __getattr__). But then inside your __getattr__ implementation, you again call self.headers, which calls __getattr__ because the attribute doesn't exist, giving an infinite loop ...

I do have a PR that tries to fix this regression (#54922), but long term, you will also have to make some changes to your subclass implementation:

  • One option is to define headers = None (or {}) as an attribute on the class level, so that whenever an instance of TfsDataFrame is created, it always has this attribute already. That will avoid the recursion because of the attribute not existing.
    While in theory there shouldn't be a need for this (normally, it's totally fine to define instance attributes in __init__), I think it should probably also be harmless to do and probably the easiest workaround you can already implement on your side.
  • If you want to have custom logic for initializing a TfsDataFrame from a manager object, like ensuring it has custom attributes, you can implement _constructor_from_mgr in addition to _constructor, and set up the attribute there. This is the more general option allowing you to customize this step, but so for a simple attribute, the above option might be easier.

I tested that both options work with your example above, and they would look like:

class TfsDataFrame(pd.DataFrame):

    _metadata = ["headers"]
    headers = None    # <-- added line

    def __init__(self, *args, **kwargs):
    ....

or

class TfsDataFrame(pd.DataFrame):

    _metadata = ["headers"]

    ....

    @property
    def _constructor(self):
        return TfsDataFrame
        
    def _constructor_from_mgr(self, mgr, axes):
        obj = self._from_mgr(mgr, axes)
        obj.headers = {}
        return obj

@JoschD
Copy link
Author

JoschD commented Oct 30, 2023

Hello @jorisvandenbossche ,
thanks a lot for looking into this and especially for your extensive answer.
We have decided to go with the "proper" solution and overriding the _constructor_from_mgr function and it works like a charm, even in our actual repository!

I hope you have a great day, may all your issues have easy-to-implement solutions like this :)

Cheers!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Regression Functionality that used to work in a prior pandas version Subclassing Subclassing pandas objects
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants