Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FeatureUnion not working when aggregating data and pandas transform output selected #25730

Closed
laszlojakab opened this issue Feb 28, 2023 · 6 comments · Fixed by #25747
Closed

Comments

@laszlojakab
Copy link

Describe the bug

I would like to use pandas transform output and use a custom transformer in a feature union which aggregates data. When I'm using this combination I got an error. When I use default numpy output it works fine.

Steps/Code to Reproduce

import pandas as pd
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn import set_config
from sklearn.pipeline import make_union

index = pd.date_range(start="2020-01-01", end="2020-01-05", inclusive="left", freq="H")
data = pd.DataFrame(index=index, data=[10] * len(index), columns=["value"])
data["date"] = index.date


class MyTransformer(BaseEstimator, TransformerMixin):
    def fit(self, X: pd.DataFrame, y: pd.Series | None = None, **kwargs):
        return self

    def transform(self, X: pd.DataFrame, y: pd.Series | None = None) -> pd.DataFrame:
        return X["value"].groupby(X["date"]).sum()


# This works.
set_config(transform_output="default")
print(make_union(MyTransformer()).fit_transform(data))

# This does not work.
set_config(transform_output="pandas")
print(make_union(MyTransformer()).fit_transform(data))

Expected Results

No error is thrown when using pandas transform output.

Actual Results

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[5], line 25
     23 # This does not work.
     24 set_config(transform_output="pandas")
---> 25 print(make_union(MyTransformer()).fit_transform(data))

File ~/.local/share/virtualenvs/3e_VBrf2/lib/python3.10/site-packages/sklearn/utils/_set_output.py:150, in _wrap_method_output.<locals>.wrapped(self, X, *args, **kwargs)
    143 if isinstance(data_to_wrap, tuple):
    144     # only wrap the first output for cross decomposition
    145     return (
    146         _wrap_data_with_container(method, data_to_wrap[0], X, self),
    147         *data_to_wrap[1:],
    148     )
--> 150 return _wrap_data_with_container(method, data_to_wrap, X, self)

File ~/.local/share/virtualenvs/3e_VBrf2/lib/python3.10/site-packages/sklearn/utils/_set_output.py:130, in _wrap_data_with_container(method, data_to_wrap, original_input, estimator)
    127     return data_to_wrap
    129 # dense_config == "pandas"
--> 130 return _wrap_in_pandas_container(
    131     data_to_wrap=data_to_wrap,
    132     index=getattr(original_input, "index", None),
    133     columns=estimator.get_feature_names_out,
    134 )

File ~/.local/share/virtualenvs/3e_VBrf2/lib/python3.10/site-packages/sklearn/utils/_set_output.py:59, in _wrap_in_pandas_container(data_to_wrap, columns, index)
     57         data_to_wrap.columns = columns
     58     if index is not None:
---> 59         data_to_wrap.index = index
     60     return data_to_wrap
     62 return pd.DataFrame(data_to_wrap, index=index, columns=columns)

File ~/.local/share/virtualenvs/3e_VBrf2/lib/python3.10/site-packages/pandas/core/generic.py:5588, in NDFrame.__setattr__(self, name, value)
   5586 try:
   5587     object.__getattribute__(self, name)
-> 5588     return object.__setattr__(self, name, value)
   5589 except AttributeError:
   5590     pass

File ~/.local/share/virtualenvs/3e_VBrf2/lib/python3.10/site-packages/pandas/_libs/properties.pyx:70, in pandas._libs.properties.AxisProperty.__set__()

File ~/.local/share/virtualenvs/3e_VBrf2/lib/python3.10/site-packages/pandas/core/generic.py:769, in NDFrame._set_axis(self, axis, labels)
    767 def _set_axis(self, axis: int, labels: Index) -> None:
    768     labels = ensure_index(labels)
--> 769     self._mgr.set_axis(axis, labels)
    770     self._clear_item_cache()

File ~/.local/share/virtualenvs/3e_VBrf2/lib/python3.10/site-packages/pandas/core/internals/managers.py:214, in BaseBlockManager.set_axis(self, axis, new_labels)
    212 def set_axis(self, axis: int, new_labels: Index) -> None:
    213     # Caller is responsible for ensuring we have an Index object.
--> 214     self._validate_set_axis(axis, new_labels)
    215     self.axes[axis] = new_labels

File ~/.local/share/virtualenvs/3e_VBrf2/lib/python3.10/site-packages/pandas/core/internals/base.py:69, in DataManager._validate_set_axis(self, axis, new_labels)
     66     pass
     68 elif new_len != old_len:
---> 69     raise ValueError(
     70         f"Length mismatch: Expected axis has {old_len} elements, new "
     71         f"values have {new_len} elements"
     72     )

ValueError: Length mismatch: Expected axis has 4 elements, new values have 96 elements

Versions

System:
    python: 3.10.6 (main, Aug 30 2022, 05:11:14) [Clang 13.0.0 (clang-1300.0.29.30)]
executable: /Users/macbookpro/.local/share/virtualenvs/3e_VBrf2/bin/python
   machine: macOS-11.3-x86_64-i386-64bit

Python dependencies:
      sklearn: 1.2.1
          pip: 22.3.1
   setuptools: 67.3.2
        numpy: 1.23.5
        scipy: 1.10.1
       Cython: None
       pandas: 1.4.4
   matplotlib: 3.7.0
       joblib: 1.2.0
threadpoolctl: 3.1.0

Built with OpenMP: True

threadpoolctl info:
       user_api: blas
   internal_api: openblas
         prefix: libopenblas
       filepath: /Users/macbookpro/.local/share/virtualenvs/3e_VBrf2/lib/python3.10/site-packages/numpy/.dylibs/libopenblas64_.0.dylib
        version: 0.3.20
threading_layer: pthreads
   architecture: Haswell
    num_threads: 4

       user_api: openmp
   internal_api: openmp
         prefix: libomp
       filepath: /Users/macbookpro/.local/share/virtualenvs/3e_VBrf2/lib/python3.10/site-packages/sklearn/.dylibs/libomp.dylib
        version: None
    num_threads: 8

       user_api: blas
   internal_api: openblas
         prefix: libopenblas
       filepath: /Users/macbookpro/.local/share/virtualenvs/3e_VBrf2/lib/python3.10/site-packages/scipy/.dylibs/libopenblas.0.dylib
        version: 0.3.18
threading_layer: pthreads
   architecture: Haswell
    num_threads: 4
@laszlojakab laszlojakab added Bug Needs Triage Issue requires triage labels Feb 28, 2023
@thomasjpfan
Copy link
Member

As noted in the glossery, Scikit-learn transformers expects that transform's output have the same number of samples as the input. This exception is held in FeatureUnion when processing data and tries to make sure that the output index is the same as the input index. In principle, we can have a less restrictive requirement and only set the index if it is not defined.

To better understand your use case, how do you intend to use the FeatureUnion in the overall pipeline?

@thomasjpfan thomasjpfan added module:pipeline Needs Investigation Issue requires investigation and removed Needs Triage Issue requires triage labels Feb 28, 2023
@laszlojakab
Copy link
Author

laszlojakab commented Feb 28, 2023

Scikit-learn transformers expects that transform's output have the same number of samples as the input

I haven't known that. Good to know. What is the correct way to aggregate or drop rows in a pipeline? Isn't that supported?

To better understand your use case, how do you intend to use the FeatureUnion in the overall pipeline?

The actual use case: I have a time series (price) with hourly frequency. It is a single series with a datetime index. I have built a dataframe with pipeline and custom transformers (by also violating the rule to have same number of inputs and outputs) which aggregates the data (calculates daily mean, and some moving average of daily means) then I have transformed back to hourly frequency (using same values for all the hours of a day). So the dataframe has (date, price, mean, moving_avg) columns at that point with hourly frequency ("same number input/output" rule violated again). After that I have added the problematic FeatureUnion. One part of the union simply drops price and "collapses" the remaining part to daily data (as I said all the remaining columns has the same values on the same day). On the other part of the feature union I calculate a standard devition between price and moving_avg on daily basis. So I have the (date, mean, moving_avg) on the left side of the feature union and an std on the right side. Both have daily frequency. I would like to have a dataframe with (date, mean, moving_avg, std) at the end of the transformation.

@laszlojakab
Copy link
Author

As I see there is the same "problem" in ColumnTransfromer.

@laszlojakab
Copy link
Author

I have a look at how scikit-learn encapsulates output into a DataFrame and found this code block:

https://github.com/scikit-learn/scikit-learn/blob/main/sklearn/utils/_set_output.py#L55-L62

Is there any reason to set index here? If transformer returned a DataFrame this already has some kind of index. Why should we restore the original input index? What is the use case when a transformer changes the DataFrame's index and scikit-learn has to restore it automatically to the input index?

With index restoration it is also expected for transformers that index should not be changed (or if it is changed by transformer then scikit-learn restores the original one which could be a bit unintuitive). Is this an intended behaviour?

What is the design decision to not allow changing index and row count in data by transformers? In time series problems I think it is very common to aggregate raw data and modify original index.

@thomasjpfan
Copy link
Member

thomasjpfan commented Mar 2, 2023

What is the correct way to aggregate or drop rows in a pipeline? Isn't that supported?

Yea, changing the number of rows is not supported. There is a lengthy discussion about it in #3855

As for this specific issue, I think we can get away with not setting the index so your use case just works. I opened #25747 to propose this change.

@thomasjpfan thomasjpfan removed the Needs Investigation Issue requires investigation label Mar 2, 2023
@laszlojakab
Copy link
Author

@thomasjpfan Thank you very much!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants