Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ColumnTransformers don't honor set_config(transform_output="pandas") when multiprocessing with n_jobs>1 #25239

Closed
Susensio opened this issue Dec 27, 2022 · 11 comments · Fixed by #25363

Comments

@Susensio
Copy link

Susensio commented Dec 27, 2022

Describe the bug

I'm trying to do a grid search with n_jobs=-1, working with pandas output, and it fails despite set_config(transform_output = "pandas")

I have to manually .set_output(transform='pandas') in the ColumnTransformer for it to work.

Steps/Code to Reproduce

Preparation

import pandas as pd
import numpy as np
from sklearn.model_selection import GridSearchCV
from sklearn.base import TransformerMixin, BaseEstimator
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.compose import make_column_transformer, make_column_selector

from sklearn import set_config
set_config(transform_output = "pandas")

# Toy dataframe
df = pd.DataFrame(np.random.randn(100, 4), columns=list('ABCD'))
X, y = df.drop(columns='D'), df['D']>0

# Custom transformer that needs dataframe
class CustomTransformer(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        assert isinstance(X, pd.DataFrame), "Fit failed"
        self.cols = X.columns
        return self
    
    def transform(self, X, y=None):
        assert isinstance(X, pd.DataFrame), "Transform failed"
        return X.copy()
    
    def get_feature_names_out(self):
        return self.cols


prepro = CustomTransformer()
model = LogisticRegression()

param_grid = {'logisticregression__C': np.logspace(3,-3, num=50)}

This WORKS (n_jobs=1):

drop = make_column_transformer(('drop', [0]), remainder='passthrough')
pipe = make_pipeline(drop, prepro,  model)

gs = GridSearchCV(pipe, param_grid, cv=10, n_jobs=1)
gs.fit(X,y)

This FAILS (n_jobs=-1):

drop = make_column_transformer(('drop', [0]), remainder='passthrough')
pipe = make_pipeline(drop, prepro,  model)

gs = GridSearchCV(pipe, param_grid, cv=10, n_jobs=-1)
gs.fit(X,y)

This WORKS again (n_jobs=-1 and force output):

drop = make_column_transformer(('drop', [0]), remainder='passthrough').set_output(transform='pandas')
pipe = make_pipeline(drop, prepro,  model)

gs = GridSearchCV(pipe, param_grid, cv=10, n_jobs=-1)
gs.fit(X,y)

Actual Results

AssertionError: Fit failed

Versions

System:
    python: 3.11.1 (main, Dec  7 2022, 08:49:13) [GCC 12.2.0]
executable: /home/susensio/.local/share/venv/bin/python3
   machine: Linux-6.0.0-6-amd64-x86_64-with-glibc2.36

Python dependencies:
      sklearn: 1.2.0
          pip: 22.3.1
   setuptools: 65.6.3
        numpy: 1.24.1
        scipy: 1.9.3
       Cython: None
       pandas: 1.5.2
   matplotlib: 3.6.2
       joblib: 1.2.0
threadpoolctl: 3.1.0

Built with OpenMP: True

threadpoolctl info:
       user_api: openmp
   internal_api: openmp
         prefix: libgomp
       filepath: /home/susensio/.local/share/venv/lib/python3.11/site-packages/scikit_learn.libs/libgomp-a34b3233.so.1.0.0
        version: None
    num_threads: 8

       user_api: blas
   internal_api: openblas
         prefix: libopenblas
       filepath: /home/susensio/.local/share/venv/lib/python3.11/site-packages/numpy.libs/libopenblas64_p-r0-15028c96.3.21.so
        version: 0.3.21
threading_layer: pthreads
   architecture: Haswell
    num_threads: 8

       user_api: blas
   internal_api: openblas
         prefix: libopenblas
       filepath: /home/susensio/.local/share/venv/lib/python3.11/site-packages/scipy.libs/libopenblasp-r0-41284840.3.18.so
        version: 0.3.18
threading_layer: pthreads
   architecture: Haswell
    num_threads: 8
@Susensio Susensio added Bug Needs Triage Issue requires triage labels Dec 27, 2022
@glemaitre
Copy link
Member

glemaitre commented Dec 27, 2022

I suspect that this is related to #17634 and joblib/joblib#1071. I thought that we fixed the issue.

@glemaitre
Copy link
Member

So there is something weird. The setting got changed in the middle of the processing. I did try 3 Cs and cv=10 and printed the config state and the type of X in the transformer.

global config: pandas
Type of X: <class 'pandas.core.frame.DataFrame'>
global config: pandas
Type of X: <class 'pandas.core.frame.DataFrame'>
global config: pandas
Type of X: <class 'pandas.core.frame.DataFrame'>
global config: pandas
Type of X: <class 'pandas.core.frame.DataFrame'>
global config: pandas
Type of X: <class 'pandas.core.frame.DataFrame'>
global config: pandas
Type of X: <class 'pandas.core.frame.DataFrame'>
global config: pandas
Type of X: <class 'pandas.core.frame.DataFrame'>
global config: pandas
Type of X: <class 'pandas.core.frame.DataFrame'>
global config: pandas
Type of X: <class 'pandas.core.frame.DataFrame'>
global config: pandas
Type of X: <class 'pandas.core.frame.DataFrame'>
global config: pandas
Type of X: <class 'pandas.core.frame.DataFrame'>
global config: pandas
Type of X: <class 'pandas.core.frame.DataFrame'>
global config: pandas
Type of X: <class 'pandas.core.frame.DataFrame'>
global config: pandas
Type of X: <class 'pandas.core.frame.DataFrame'>
global config: pandas
Type of X: <class 'pandas.core.frame.DataFrame'>
global config: pandas
Type of X: <class 'pandas.core.frame.DataFrame'>
global config: default
Type of X: <class 'numpy.ndarray'>
global config: default
Type of X: <class 'numpy.ndarray'>
global config: default
Type of X: <class 'numpy.ndarray'>
global config: default
Type of X: <class 'numpy.ndarray'>
global config: default
Type of X: <class 'numpy.ndarray'>
global config: default
Type of X: <class 'numpy.ndarray'>
global config: default
Type of X: <class 'numpy.ndarray'>
global config: default
Type of X: <class 'numpy.ndarray'>
global config: default
Type of X: <class 'numpy.ndarray'>
global config: default
Type of X: <class 'numpy.ndarray'>
global config: default
Type of X: <class 'numpy.ndarray'>
global config: default
Type of X: <class 'numpy.ndarray'>
global config: default
Type of X: <class 'numpy.ndarray'>
global config: default
Type of X: <class 'numpy.ndarray'>
/home/glemaitre/Documents/packages/scikit-learn/sklearn/model_selection/_validation.py:378: FitFailedWarning: 
14 fits failed out of a total of 30.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
14 fits failed with the following error:
Traceback (most recent call last):
  File "/home/glemaitre/Documents/packages/scikit-learn/sklearn/model_selection/_validation.py", line 686, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/home/glemaitre/Documents/packages/scikit-learn/sklearn/pipeline.py", line 417, in fit
    Xt = self._fit(X, y, **fit_params_steps)
  File "/home/glemaitre/Documents/packages/scikit-learn/sklearn/pipeline.py", line 374, in _fit
    X, fitted_transformer = fit_transform_one_cached(
  File "/home/glemaitre/miniconda3/envs/dev/lib/python3.10/site-packages/joblib/memory.py", line 349, in __call__
    return self.func(*args, **kwargs)
  File "/home/glemaitre/Documents/packages/scikit-learn/sklearn/pipeline.py", line 915, in _fit_transform_one
    res = transformer.fit_transform(X, y, **fit_params)
  File "/home/glemaitre/Documents/packages/scikit-learn/sklearn/utils/_set_output.py", line 142, in wrapped
    data_to_wrap = f(self, X, *args, **kwargs)
  File "/home/glemaitre/Documents/packages/scikit-learn/sklearn/utils/_set_output.py", line 142, in wrapped
    data_to_wrap = f(self, X, *args, **kwargs)
  File "/home/glemaitre/Documents/packages/scikit-learn/sklearn/base.py", line 862, in fit_transform
    return self.fit(X, y, **fit_params).transform(X)
  File "<ipython-input-5-d996c9237971>", line 22, in fit
AssertionError: Fit failed

  warnings.warn(some_fits_failed_message, FitFailedWarning)
/home/glemaitre/Documents/packages/scikit-learn/sklearn/model_selection/_search.py:953: UserWarning: One or more of the test scores are non-finite: [0.51  nan  nan]
  warnings.warn(
global config: pandas
Type of X: <class 'pandas.core.frame.DataFrame'>
Out[5]: 
GridSearchCV(cv=10,
             estimator=Pipeline(steps=[('columntransformer',
                                        ColumnTransformer(remainder='passthrough',
                                                          transformers=[('drop',
                                                                         'drop',
                                                                         [0])])),
                                       ('customtransformer',
                                        CustomTransformer()),
                                       ('logisticregression',
                                        LogisticRegression())]),
             n_jobs=-1,
             param_grid={'logisticregression__C': array([1.e+03, 1.e+00, 1.e-03])})

@glemaitre
Copy link
Member

Taking the minimal example in #17634, we can reproduce the issue, so we need to have more tasks than n_jobs to trigger the issue which was not the case probably in the non-regression test.

In [1]: from joblib import Parallel
   ...: from sklearn.utils.fixes import delayed
   ...: from sklearn import get_config, config_context
   ...: 
   ...: def get_working_memory():
   ...:     return get_config()['working_memory']
   ...: 
   ...: with config_context(working_memory=123):
   ...:     results = Parallel(n_jobs=2)(
   ...:         delayed(get_working_memory)() for _ in range(10)
   ...:     )
   ...: 
   ...: results
Out[1]: [123, 123, 123, 123, 1024, 1024, 1024, 1024, 1024, 1024]

@glemaitre
Copy link
Member

glemaitre commented Dec 27, 2022

And the fun part is that things start to go sideways for the task n_jobs * 2 which is related to the pre_dispatch from Parallel. Changing pre_dispatch will impact the above behaviour.

@Susensio
Copy link
Author

with pre_dispatch='all' it works correctly

@glemaitre
Copy link
Member

Yep but you might blow up your memory :)

@glemaitre
Copy link
Member

glemaitre commented Dec 27, 2022

I assume that it could be linked to this comment:

https://github.com/joblib/joblib/blob/6836640abba55611d6e57f20338ea54b3e27f296/joblib/parallel.py#L1060-L1062

We did not associate the config with the jobs that are not consumed by the main thread.
So we should somehow make sure that the generator created by delayed and called in Parallel is done by the main thread that has the right config.

@Susensio
Copy link
Author

So there is something weird. The setting got changed in the middle of the processing. I did try 3 Cs and cv=10 and printed the config state and the type of X in the transformer.

How did you print from within the transformer? I am having troubles debugging while multiprocessing.. I only get one print

@glemaitre
Copy link
Member

How did you print from within the transformer? I am having troubles debugging while multiprocessing.. I only get one print

I assume that you get this issue: #22849
Notebook and verbosity do not work well for some reasons. Using ipython here is the best.

@Susensio
Copy link
Author

Susensio commented Dec 27, 2022

How did you print from within the transformer? I am having troubles debugging while multiprocessing.. I only get one print

I assume that you get this issue: #22849 Notebook and verbosity do not work well for some reasons. Using ipython here is the best.

That was it! thanks

@glemaitre
Copy link
Member

Digging with @fcharras, the issue is a side effect of the behaviour introduced in #18736.
When the local threads has the global config then we are fine, if it is another threads then we copy the default global config.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment