BUG: Infinite recursion on inheritance and overloading `getattr` in `v2.1.0` #55120

JoschD · 2023-09-13T09:14:05Z

Pandas version checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

from contextlib import suppress
import pandas as pd

class TfsDataFrame(pd.DataFrame):
    """
    Extend DataFrame to add headers, which should be accessed in case 
    the name is not found in the DataFrame itself, but also 
    can be accessed manually by the `headers` attribute.
    """

    _metadata = ["headers"]

    def __init__(self, *args, **kwargs):
        self.headers = {}
        with suppress(IndexError, AttributeError):
            self.headers = args[0].headers
        self.headers = kwargs.pop("headers", self.headers)
        super().__init__(*args, **kwargs)

    def __getattr__(self, name: str) -> object:
        try:
            return super().__getattr__(name)
        except AttributeError:
            pass
        
        try:
            return self.headers[name]
        except KeyError:
            raise AttributeError(f"{name} is neither in the DataFrame nor in headers.")

    @property
    def _constructor(self):
        return TfsDataFrame


if __name__ == "__main__":
    df = TfsDataFrame(columns=["A", "B", "C"])
    print(df[["A", "B"]])

Issue Description

This is a reduced example of a DataFrame extension, that we have been using for years now and it worked fine until version 2.0.3 (including), but broke with 2.1.0.

From 2.1.0 the code above runs into an infinite recursion loop, as df[["A", "B"]] creates a new TfsDataFrame (due to the _constructor).
In its __init__ the args[0] is then already a TfsDataFrame object butargs[0].headers for some reason calls __getattr__() as does self.headers[name] in the __getattr__() function, causing the infinite recursion.

This was not the case in the past, as .headers should be finding the attribute headers "directly", without even going to __getattr__().

Interestingly,

df = TfsDataFrame(columns=["A", "B", "C"])
print(df.headers)

df = TfsDataFrame(columns=["A", "B", "C"], headers={"a":1})
print(df.headers["a"])

are indeed working as intended.

This is the most direct recursion loop I could find, but actually with 2.1.0 we have experienced extended recursion loops, which can create infinite runtime.

Expected Behavior

Should not end up in an infinite recursion.
As is the case e.g. in pandas version 2.0.3.

Thank you very much for your help and your effort maintaining pandas.
Its amazing set of features have made our work with measurement data much more comfortable and streamlined,
not to mention the highly increased readability of our code compared to anno BP (before pandas).

Installed Versions

INSTALLED VERSIONS

commit : ba1cccd
python : 3.9.7.final.0
python-bits : 64
OS : Linux
OS-release : 5.14.0-162.18.1.el9_1.x86_64
Version : #1 SMP PREEMPT_DYNAMIC Tue Feb 28 03:54:43 EST 2023
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : C.UTF-8
LOCALE : en_US.UTF-8

pandas : 2.1.0
numpy : 1.25.2
pytz : 2023.3
dateutil : 2.8.2
setuptools : 59.5.0
pip : 21.3.1
Cython : 0.29.35
pytest : 7.3.1
hypothesis : 6.75.8
sphinx : 6.2.1
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.1.2
IPython : 8.14.0
pandas_datareader : None
bs4 : None
bottleneck : None
dataframe-api-compat: None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : 3.7.1
numba : None
numexpr : 2.8.4
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : 1.11.2
sqlalchemy : None
tables : 3.8.0
tabulate : None
xarray : None
xlrd : None
zstandard : None
tzdata : 2023.3
qtpy : 2.3.1
pyqt5 : None

The text was updated successfully, but these errors were encountered:

jorisvandenbossche · 2023-10-25T15:53:13Z

@JoschD thanks a lot for the report! (and sorry for the slow response; I actually had looked at it earlier because I had a related PR (#54922) that might also fix this, but didn't get around to posting here)

So your analysis is mostly correct I think. What changed in pandas (#53871) is that we no longer pass a Manager object to the subclass constructor (which you would pass to super().__init__ to create the dataframe), but that we initialize the DataFrame object from the manager, and pass that to the constructor.
However, this DataFrame construction happens through a DataFrame._from_mgr method, which isn't overriden by the subclass, and thus the resulting subclass DataFrame object hasn't been properly initialized.

As a result of that, your TfsDataFrame.__init__ gets passed a TfsDataFrame instance that is not fully initialized, i.e. doesn't yet have the headers attribute. At that point, when trying to access that attribute with args[0].headers, it will go into __getattr__ with "headers" key because of the attribute not existing (for actual attributes, it doesn't call __getattr__). But then inside your __getattr__ implementation, you again call self.headers, which calls __getattr__ because the attribute doesn't exist, giving an infinite loop ...

I do have a PR that tries to fix this regression (#54922), but long term, you will also have to make some changes to your subclass implementation:

One option is to define headers = None (or {}) as an attribute on the class level, so that whenever an instance of TfsDataFrame is created, it always has this attribute already. That will avoid the recursion because of the attribute not existing.
While in theory there shouldn't be a need for this (normally, it's totally fine to define instance attributes in __init__), I think it should probably also be harmless to do and probably the easiest workaround you can already implement on your side.
If you want to have custom logic for initializing a TfsDataFrame from a manager object, like ensuring it has custom attributes, you can implement _constructor_from_mgr in addition to _constructor, and set up the attribute there. This is the more general option allowing you to customize this step, but so for a simple attribute, the above option might be easier.

I tested that both options work with your example above, and they would look like:

class TfsDataFrame(pd.DataFrame):

    _metadata = ["headers"]
    headers = None    # <-- added line

    def __init__(self, *args, **kwargs):
    ....

or

class TfsDataFrame(pd.DataFrame):

    _metadata = ["headers"]

    ....

    @property
    def _constructor(self):
        return TfsDataFrame
        
    def _constructor_from_mgr(self, mgr, axes):
        obj = self._from_mgr(mgr, axes)
        obj.headers = {}
        return obj

JoschD · 2023-10-30T14:17:26Z

Hello @jorisvandenbossche ,
thanks a lot for looking into this and especially for your extensive answer.
We have decided to go with the "proper" solution and overriding the _constructor_from_mgr function and it works like a charm, even in our actual repository!

I hope you have a great day, may all your issues have easy-to-implement solutions like this :)

Cheers!

JoschD added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Sep 13, 2023

jorisvandenbossche mentioned this issue Sep 16, 2023

Potential recent bug reports to investigate jorisvandenbossche/pydata-amsterdam-pandas-sprint#5

Open

12 tasks

jorisvandenbossche added Regression Functionality that used to work in a prior pandas version Subclassing Subclassing pandas objects and removed Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Oct 14, 2023

jorisvandenbossche added this to the 2.1.2 milestone Oct 14, 2023

jorisvandenbossche mentioned this issue Oct 25, 2023

REGR: Restore _constructor_from_mgr to pass manager object to constructor #54922

Merged

jorisvandenbossche closed this as completed in #54922 Oct 26, 2023

ivirshup mentioned this issue Oct 30, 2023

DataFrameView in pandas 2.1.2 changed behavior, breaking AnnData in several ways scverse/anndata#1210

Closed

3 tasks

flying-sheep mentioned this issue Oct 30, 2023

BUG: pandas 2.1.2 changes how copy works #55763

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Infinite recursion on inheritance and overloading `getattr` in `v2.1.0` #55120

BUG: Infinite recursion on inheritance and overloading `getattr` in `v2.1.0` #55120

JoschD commented Sep 13, 2023

INSTALLED VERSIONS

jorisvandenbossche commented Oct 25, 2023

JoschD commented Oct 30, 2023

BUG: Infinite recursion on inheritance and overloading __getattr__ in v2.1.0 #55120

BUG: Infinite recursion on inheritance and overloading __getattr__ in v2.1.0 #55120

Comments

JoschD commented Sep 13, 2023

Pandas version checks

Reproducible Example

Issue Description

Expected Behavior

Installed Versions

INSTALLED VERSIONS

jorisvandenbossche commented Oct 25, 2023

JoschD commented Oct 30, 2023

BUG: Infinite recursion on inheritance and overloading `getattr` in `v2.1.0` #55120

BUG: Infinite recursion on inheritance and overloading `getattr` in `v2.1.0` #55120