Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DataFrameView in pandas 2.1.2 changed behavior, breaking AnnData in several ways #1210

Closed
2 of 3 tasks
LucaMarconato opened this issue Oct 28, 2023 · 17 comments · Fixed by #1211
Closed
2 of 3 tasks

Comments

@LucaMarconato
Copy link
Member

Please make sure these conditions are met

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of anndata.
  • (optional) I have confirmed this bug exists on the master branch of anndata.

Report

I am getting an infinite recursion due most likely to this pandas bug: pandas-dev/pandas#52927. The bug, which appeared a few days ago (probably the latest pandas release), is triggered by a code that I use to populate old categories that gets dropped after data subsetting (to have a workaround on this: #890).

The latest main from pandas is supposed to fix the problem, but it looks like it doesn't. Maybe I should report the bug to pandas, I will try to reproduce it via pandas code only now.

Code:

##
from anndata import AnnData
import pandas as pd

def _inplace_fix_subset_categorical_obs(subset_adata: AnnData, original_adata: AnnData) -> None:
    """
    Fix categorical obs columns of subset_adata to match the categories of original_adata.

    Parameters
    ----------
    subset_adata
        The subset AnnData object
    original_adata
        The original AnnData object

    Notes
    -----
    See discussion here: https://github.com/scverse/anndata/issues/997
    """
    obs = subset_adata.obs
    for column in obs.columns:
        is_categorical = pd.api.types.is_categorical_dtype(obs[column])
        if is_categorical:
            c = obs[column].cat.set_categories(original_adata.obs[column].cat.categories)
            obs[column] = c

anndata = AnnData(obs=pd.DataFrame({"a": pd.Categorical(["a", "b", "c"])}))
subset = anndata[anndata.obs["a"] == "a"]
assert subset.obs['a'].cat.categories.tolist() == ["a"]
_inplace_fix_subset_categorical_obs(subset, anndata)

Traceback:

_inplace_fix_subset_categorical_obs(subset, anndata)
<ipython-input-2-91050d1e96e1>:21: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, pd.CategoricalDtype) instead
  is_categorical = pd.api.types.is_categorical_dtype(obs[column])
<ipython-input-2-91050d1e96e1>:24: ImplicitModificationWarning: Trying to modify attribute `.obs` of view, initializing view as actual.
  obs[column] = c
/Users/macbook/miniconda3/envs/ome/lib/python3.10/site-packages/anndata/_core/views.py:79: ImplicitModificationWarning: Trying to modify attribute `.obs` of view, initializing view as actual.
  container[idx] = value
/Users/macbook/miniconda3/envs/ome/lib/python3.10/site-packages/anndata/_core/views.py:79: ImplicitModificationWarning: Trying to modify attribute `.obs` of view, initializing view as actual.
  container[idx] = value
...
...
...
...
/Users/macbook/miniconda3/envs/ome/lib/python3.10/site-packages/anndata/_core/views.py:79: ImplicitModificationWarning: Trying to modify attribute `.obs` of view, initializing view as actual.
  container[idx] = value
/Users/macbook/miniconda3/envs/ome/lib/python3.10/site-packages/anndata/_core/views.py:79: ImplicitModificationWarning: Trying to modify attribute `.obs` of view, initializing view as actual.
  container[idx] = value
/Users/macbook/miniconda3/envs/ome/lib/python3.10/site-packages/anndata/_core/views.py:79: ImplicitModificationWarning: Trying to modify attribute `.obs` of view, initializing view as actual.
  container[idx] = value
/Users/macbook/miniconda3/envs/ome/lib/python3.10/site-packages/anndata/_core/views.py:79: ImplicitModificationWarning: Trying to modify attribute `.obs` of view, initializing view as actual.
  container[idx] = value
/Users/macbook/miniconda3/envs/ome/lib/python3.10/site-packages/anndata/_core/views.py:79: ImplicitModificationWarning: Trying to modify attribute `.obs` of view, initializing view as actual.
  container[idx] = value
/Users/macbook/miniconda3/envs/ome/lib/python3.10/site-packages/anndata/_core/views.py:79: ImplicitModificationWarning: Trying to modify attribute `.obs` of view, initializing view as actual.
  container[idx] = value
/Users/macbook/miniconda3/envs/ome/lib/python3.10/site-packages/anndata/_core/views.py:79: ImplicitModificationWarning: Trying to modify attribute `.obs` of view, initializing view as actual.
  container[idx] = value
/Users/macbook/miniconda3/envs/ome/lib/python3.10/site-packages/anndata/_core/views.py:79: ImplicitModificationWarning: Trying to modify attribute `.obs` of view, initializing view as actual.
  container[idx] = value
/Users/macbook/miniconda3/envs/ome/lib/python3.10/site-packages/anndata/_core/views.py:79: ImplicitModificationWarning: Trying to modify attribute `.obs` of view, initializing view as actual.
  container[idx] = value
/Users/macbook/miniconda3/envs/ome/lib/python3.10/site-packages/anndata/_core/views.py:79: ImplicitModificationWarning: Trying to modify attribute `.obs` of view, initializing view as actual.
  container[idx] = value
Traceback (most recent call last):
  File "/Users/macbook/miniconda3/envs/ome/lib/python3.10/site-packages/IPython/core/interactiveshell.py", line 3526, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-7-191c4157c214>", line 1, in <module>
    _inplace_fix_subset_categorical_obs(subset, anndata)
  File "<ipython-input-2-91050d1e96e1>", line 24, in _inplace_fix_subset_categorical_obs
    obs[column] = c
  File "/Users/macbook/miniconda3/envs/ome/lib/python3.10/site-packages/anndata/_core/views.py", line 79, in __setitem__
    container[idx] = value
  File "/Users/macbook/miniconda3/envs/ome/lib/python3.10/site-packages/anndata/_core/views.py", line 79, in __setitem__
    container[idx] = value
  File "/Users/macbook/miniconda3/envs/ome/lib/python3.10/site-packages/anndata/_core/views.py", line 79, in __setitem__
    container[idx] = value
  [Previous line repeated 2961 more times]
  File "/Users/macbook/miniconda3/envs/ome/lib/python3.10/site-packages/anndata/_core/views.py", line 78, in __setitem__
    with view_update(*self._view_args) as container:
  File "/Users/macbook/miniconda3/envs/ome/lib/python3.10/contextlib.py", line 135, in __enter__
    return next(self.gen)
  File "/Users/macbook/miniconda3/envs/ome/lib/python3.10/site-packages/anndata/_core/views.py", line 52, in view_update
    new = adata_view.copy()
  File "/Users/macbook/miniconda3/envs/ome/lib/python3.10/site-packages/anndata/_core/anndata.py", line 1586, in copy
    return self._mutated_copy()
  File "/Users/macbook/miniconda3/envs/ome/lib/python3.10/site-packages/anndata/_core/anndata.py", line 1527, in _mutated_copy
    return AnnData(**new)
  File "/Users/macbook/miniconda3/envs/ome/lib/python3.10/site-packages/anndata/_core/anndata.py", line 362, in __init__
    self._init_as_actual(
  File "/Users/macbook/miniconda3/envs/ome/lib/python3.10/site-packages/anndata/_core/anndata.py", line 548, in _init_as_actual
    self._var = _gen_dataframe(
  File "/Users/macbook/miniconda3/envs/ome/lib/python3.10/functools.py", line 889, in wrapper
    return dispatch(args[0].__class__)(*args, **kw)
  File "/Users/macbook/miniconda3/envs/ome/lib/python3.10/site-packages/anndata/_core/anndata.py", line 186, in _gen_dataframe_df
    anno.columns = anno.columns.astype(str)
  File "/Users/macbook/miniconda3/envs/ome/lib/python3.10/site-packages/pandas/core/indexes/base.py", line 1073, in astype
    result = Index(new_values, name=self.name, dtype=new_values.dtype, copy=False)
  File "/Users/macbook/miniconda3/envs/ome/lib/python3.10/site-packages/pandas/core/indexes/base.py", line 482, in __new__
    name = maybe_extract_name(name, data, cls)
  File "/Users/macbook/miniconda3/envs/ome/lib/python3.10/site-packages/pandas/core/indexes/base.py", line 7579, in maybe_extract_name
    if name is None and isinstance(obj, (Index, ABCSeries)):
  File "/Users/macbook/miniconda3/envs/ome/lib/python3.10/site-packages/pandas/core/dtypes/generic.py", line 44, in _instancecheck
    return _check(inst) and not isinstance(inst, type)
  File "/Users/macbook/miniconda3/envs/ome/lib/python3.10/site-packages/pandas/core/dtypes/generic.py", line 38, in _check
    return getattr(inst, attr, "_typ") in comp
RecursionError: maximum recursion depth exceeded while calling a Python object

Versions

-----
anndata             0.10.2
session_info        1.0.0
-----
asciitree           NA
cloudpickle         3.0.0
cython_runtime      NA
cytoolz             0.12.2
dask                2023.10.1
dateutil            2.8.2
exceptiongroup      1.1.3
fasteners           0.17.3
gmpy2               2.1.2
h5py                3.9.0
importlib_metadata  NA
jinja2              3.1.2
markupsafe          2.1.3
mpmath              1.3.0
msgpack             1.0.6
natsort             8.4.0
numcodecs           0.12.1
numpy               1.24.4
packaging           23.2
pandas              2.2.0.dev0+471.g984d75543
psutil              5.9.5
pyarrow             13.0.0
pytz                2023.3.post1
scipy               1.11.3
six                 1.16.0
sphinxcontrib       NA
sympy               1.12
tblib               2.0.0
tlz                 0.12.2
toolz               0.12.0
torch               2.1.0
torchgen            NA
tqdm                4.66.1
typing_extensions   NA
yaml                6.0.1
zarr                2.16.1
zipp                NA
zoneinfo            NA
-----
Python 3.10.13 | packaged by conda-forge | (main, Oct 26 2023, 18:09:17) [Clang 16.0.6 ]
macOS-13.4.1-arm64-arm-64bit
-----
Session information updated at 2023-10-28 17:20
@LucaMarconato
Copy link
Member Author

The bug seems to be in anndata, it is triggered by obs[column] = c. The type of obs[column] is a anndata._core.views.DataFrameView, I am trying to get around this, maybe avoiding assigning the column to the view can help fix the problem.

@LucaMarconato
Copy link
Member Author

LucaMarconato commented Oct 28, 2023

I managed to fix by changing obs = subset_adata.obs to obs = pd.DataFrame(subset_adata.obs), and adding a subset_adata.obs = obs right before the function end.

@LucaMarconato
Copy link
Member Author

Update: in spatialdata the bug appears also in other places.
For instance I have an AnnData object containing a categorical column in obs:

matched_table.obs
Out[14]: 
           region  instance_id  numerical_in_obs
0  values_circles            0          0.321869
1  values_circles            1          0.594300
2  values_circles            2          0.337911
3  values_circles            3          0.391619
4  values_circles            4          0.890274
5  values_circles            5          0.227158
6  values_circles            6          0.623187
7  values_circles            7          0.084015

such that

type(matched_table.obs)
Out[11]: anndata._core.views.DataFrameView

Now, the following operation leads to the same recursion exception:

matched_table[:, value_key_values].obs

but if I delete the categorical column, then the problem disappears.

In my case I'll patch the bug by making an explicit instantiation of matched_table.obs, but I think this bug should be fixed here in AnnData.

@LucaMarconato
Copy link
Member Author

Found another piece of code affected: table = table[table.obs[region_key].isin(coordinate_system)].copy(), I'll patch also this for the time being.

@flying-sheep
Copy link
Member

flying-sheep commented Oct 30, 2023

Well, DataFrameView is there for a reason. AnnData’s views are lightweight objects that represent slices of other AnnData objects. They’re copy on write, so setting an attribute on them like you do in your workaround makes them into non-views.

So I’m pretty sure that what you’re actually doing in your workaround isn’t just setting obs to a dataframe, instead it’s the same as doing adata = adata.copy(), only indirectly triggered by setting a field on the AnnData object. I’d recommend replacing your workaround code with that, seeing adata.obs = pd.DataFrame(adata.obs) is a confusing way to say adata = adata.copy() # make view into actual to work around https://github.com/scverse/anndata/issues/1210

A real fix would probably be to change DataFrameView to not trigger recursive behavior. Do you have an idea how that could be done?

@flying-sheep
Copy link
Member

I see, the bugfix for pandas-dev/pandas#52927 hasn’t made it into a release yet, so the best workaround is to just set a pandas dependency specifier pandas !=2.1.2.

@ivirshup
Copy link
Member

@flying-sheep, using the pandas nightly wheel does not solve this issue, so I don't think a new release of pandas will fix this.

So I think we still either need to do:

A real fix would probably be to change DataFrameView to not trigger recursive behavior. Do you have an idea how that could be done?

Or fix it upstream

@ivirshup ivirshup reopened this Oct 30, 2023
@ivirshup
Copy link
Member

ivirshup commented Oct 30, 2023

I also get this behavior for setting any column of a dataframe, not just categorical ones. E.g.:

import anndata as ad, pandas as pd, numpy as np

adata = ad.AnnData(
    obs=pd.DataFrame(
        {"b": [1, 2, 3]},
        index=list("abc")
    )
)
v = adata[[0], :]

v.obs["b"] = 3

Also triggers the recursion error.

As does:

v.obs.drop("b")

@ivirshup
Copy link
Member

ivirshup commented Oct 30, 2023

I think I've found the problem:

import anndata as ad, pandas as pd, numpy as np

adata = ad.AnnData(
    obs=pd.DataFrame(
        {"b": [1, 2, 3]},
        index=list("abc")
    )
)
v = adata[[0], :]

type(v.obs.copy())

in pandas 2.1.2:

anndata._core.views.DataFrameView

in pandas 2.1.1:

pandas.core.frame.DataFrame

It looks like pd.DataFrame._constructor_from_mgr is what changed.

This is a behavior change in a bug fix release of pandas, so possibly is a new pandas bug in and of itself.

It's unclear to me whether pandas-dev/pandas#52927 is relevant to this bug in anndata

@ivirshup
Copy link
Member

This looks relevant: pandas-dev/pandas#55120

@LucaMarconato
Copy link
Member Author

Thanks for the info. @flying-sheep I am quite sure I tried also adata = andata.copy() but it didn't work, that's why I am doing the conversion on the obs.

@flying-sheep
Copy link
Member

flying-sheep commented Oct 30, 2023

Very weird! If the class is still a view, it should try updating its parent AnnData object.

Well, I hope pandas reverts this and until their next major release we can come up with a good fix.

Adding the following to DataFrameView fixes all AnnData tests, and all scanpy tests except for scanpy/tests/test_pca.py::test_pca_warnings.

And that test failure is unrelated
    def test_pca_warnings(array_type, zero_center, pca_params):
        svd_solver, expected_warning = pca_params
        A = array_type(A_list).astype("float32")
        adata = AnnData(A)
    
        if expected_warning is not None:
            with pytest.warns(UserWarning, match=expected_warning):
                sc.pp.pca(adata, svd_solver=svd_solver, zero_center=zero_center)
        else:
            with warnings.catch_warnings(record=True) as record:
                sc.pp.pca(adata, svd_solver=svd_solver, zero_center=zero_center)
>           assert len(record) == 0
E           assert 1 == 0
E            +  where 1 = len([<warnings.WarningMessage object at 0x288fca350>])

scanpy/tests/test_pca.py:118: AssertionError
def copy(self, deep: bool = True) -> pd.DataFrame:
    """Create a non-view copy of the DataFrame."""
    return pd.DataFrame(super().copy(deep=deep))

A possible issue is that the tests seemed to run pretty slow, so maybe it breaks some optimization? Maybe I just used too many threads for my poor M1 CPU.

@ivirshup also noted that a .groupby(...) on a view reproduces the error, so we should add a test for that.

@ivirshup ivirshup changed the title Infinite recursion due to pandas(?) when operating on a categorical obs column Infinite recursion due to pandas(?) when modifying DataFrameView in pandas 2.1.2 Oct 30, 2023
@flying-sheep flying-sheep changed the title Infinite recursion due to pandas(?) when modifying DataFrameView in pandas 2.1.2 DataFrameView in pandas 2.1.2 changed behavior, breaking AnnData in several ways Nov 3, 2023
@flying-sheep
Copy link
Member

flying-sheep commented Nov 3, 2023

#1223 shows a different symptom of the same upstream behavior change, so I renamed this issue to catch them all.

@dsm-72

This comment was marked as off-topic.

@flying-sheep

This comment was marked as off-topic.

@ivirshup
Copy link
Member

ivirshup commented Nov 8, 2023

PR to fix, which is slated for the next pandas bug fix release: pandas-dev/pandas#55764. We can close if if/ when the PR to pandas merges

@ivirshup
Copy link
Member

ivirshup commented Nov 9, 2023

Merged!

@ivirshup ivirshup closed this as completed Nov 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants