ColumnTransformers don't honor set_config(transform_output="pandas") when multiprocessing with n_jobs>1 #25239

Susensio · 2022-12-27T10:25:24Z

Describe the bug

I'm trying to do a grid search with n_jobs=-1, working with pandas output, and it fails despite set_config(transform_output = "pandas")

I have to manually .set_output(transform='pandas') in the ColumnTransformer for it to work.

Steps/Code to Reproduce

Preparation

import pandas as pd
import numpy as np
from sklearn.model_selection import GridSearchCV
from sklearn.base import TransformerMixin, BaseEstimator
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.compose import make_column_transformer, make_column_selector

from sklearn import set_config
set_config(transform_output = "pandas")

# Toy dataframe
df = pd.DataFrame(np.random.randn(100, 4), columns=list('ABCD'))
X, y = df.drop(columns='D'), df['D']>0

# Custom transformer that needs dataframe
class CustomTransformer(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        assert isinstance(X, pd.DataFrame), "Fit failed"
        self.cols = X.columns
        return self
    
    def transform(self, X, y=None):
        assert isinstance(X, pd.DataFrame), "Transform failed"
        return X.copy()
    
    def get_feature_names_out(self):
        return self.cols


prepro = CustomTransformer()
model = LogisticRegression()

param_grid = {'logisticregression__C': np.logspace(3,-3, num=50)}

This WORKS (n_jobs=1):

drop = make_column_transformer(('drop', [0]), remainder='passthrough')
pipe = make_pipeline(drop, prepro,  model)

gs = GridSearchCV(pipe, param_grid, cv=10, n_jobs=1)
gs.fit(X,y)

This FAILS (n_jobs=-1):

drop = make_column_transformer(('drop', [0]), remainder='passthrough')
pipe = make_pipeline(drop, prepro,  model)

gs = GridSearchCV(pipe, param_grid, cv=10, n_jobs=-1)
gs.fit(X,y)

This WORKS again (n_jobs=-1 and force output):

drop = make_column_transformer(('drop', [0]), remainder='passthrough').set_output(transform='pandas')
pipe = make_pipeline(drop, prepro,  model)

gs = GridSearchCV(pipe, param_grid, cv=10, n_jobs=-1)
gs.fit(X,y)

Actual Results

AssertionError: Fit failed

Versions

System:
    python: 3.11.1 (main, Dec  7 2022, 08:49:13) [GCC 12.2.0]
executable: /home/susensio/.local/share/venv/bin/python3
   machine: Linux-6.0.0-6-amd64-x86_64-with-glibc2.36

Python dependencies:
      sklearn: 1.2.0
          pip: 22.3.1
   setuptools: 65.6.3
        numpy: 1.24.1
        scipy: 1.9.3
       Cython: None
       pandas: 1.5.2
   matplotlib: 3.6.2
       joblib: 1.2.0
threadpoolctl: 3.1.0

Built with OpenMP: True

threadpoolctl info:
       user_api: openmp
   internal_api: openmp
         prefix: libgomp
       filepath: /home/susensio/.local/share/venv/lib/python3.11/site-packages/scikit_learn.libs/libgomp-a34b3233.so.1.0.0
        version: None
    num_threads: 8

       user_api: blas
   internal_api: openblas
         prefix: libopenblas
       filepath: /home/susensio/.local/share/venv/lib/python3.11/site-packages/numpy.libs/libopenblas64_p-r0-15028c96.3.21.so
        version: 0.3.21
threading_layer: pthreads
   architecture: Haswell
    num_threads: 8

       user_api: blas
   internal_api: openblas
         prefix: libopenblas
       filepath: /home/susensio/.local/share/venv/lib/python3.11/site-packages/scipy.libs/libopenblasp-r0-41284840.3.18.so
        version: 0.3.18
threading_layer: pthreads
   architecture: Haswell
    num_threads: 8

The text was updated successfully, but these errors were encountered:

glemaitre · 2022-12-27T10:48:15Z

I suspect that this is related to #17634 and joblib/joblib#1071. I thought that we fixed the issue.

glemaitre · 2022-12-27T11:02:39Z

So there is something weird. The setting got changed in the middle of the processing. I did try 3 Cs and cv=10 and printed the config state and the type of X in the transformer.

global config: pandas
Type of X: <class 'pandas.core.frame.DataFrame'>
global config: pandas
Type of X: <class 'pandas.core.frame.DataFrame'>
global config: pandas
Type of X: <class 'pandas.core.frame.DataFrame'>
global config: pandas
Type of X: <class 'pandas.core.frame.DataFrame'>
global config: pandas
Type of X: <class 'pandas.core.frame.DataFrame'>
global config: pandas
Type of X: <class 'pandas.core.frame.DataFrame'>
global config: pandas
Type of X: <class 'pandas.core.frame.DataFrame'>
global config: pandas
Type of X: <class 'pandas.core.frame.DataFrame'>
global config: pandas
Type of X: <class 'pandas.core.frame.DataFrame'>
global config: pandas
Type of X: <class 'pandas.core.frame.DataFrame'>
global config: pandas
Type of X: <class 'pandas.core.frame.DataFrame'>
global config: pandas
Type of X: <class 'pandas.core.frame.DataFrame'>
global config: pandas
Type of X: <class 'pandas.core.frame.DataFrame'>
global config: pandas
Type of X: <class 'pandas.core.frame.DataFrame'>
global config: pandas
Type of X: <class 'pandas.core.frame.DataFrame'>
global config: pandas
Type of X: <class 'pandas.core.frame.DataFrame'>
global config: default
Type of X: <class 'numpy.ndarray'>
global config: default
Type of X: <class 'numpy.ndarray'>
global config: default
Type of X: <class 'numpy.ndarray'>
global config: default
Type of X: <class 'numpy.ndarray'>
global config: default
Type of X: <class 'numpy.ndarray'>
global config: default
Type of X: <class 'numpy.ndarray'>
global config: default
Type of X: <class 'numpy.ndarray'>
global config: default
Type of X: <class 'numpy.ndarray'>
global config: default
Type of X: <class 'numpy.ndarray'>
global config: default
Type of X: <class 'numpy.ndarray'>
global config: default
Type of X: <class 'numpy.ndarray'>
global config: default
Type of X: <class 'numpy.ndarray'>
global config: default
Type of X: <class 'numpy.ndarray'>
global config: default
Type of X: <class 'numpy.ndarray'>
/home/glemaitre/Documents/packages/scikit-learn/sklearn/model_selection/_validation.py:378: FitFailedWarning: 
14 fits failed out of a total of 30.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
14 fits failed with the following error:
Traceback (most recent call last):
  File "/home/glemaitre/Documents/packages/scikit-learn/sklearn/model_selection/_validation.py", line 686, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/home/glemaitre/Documents/packages/scikit-learn/sklearn/pipeline.py", line 417, in fit
    Xt = self._fit(X, y, **fit_params_steps)
  File "/home/glemaitre/Documents/packages/scikit-learn/sklearn/pipeline.py", line 374, in _fit
    X, fitted_transformer = fit_transform_one_cached(
  File "/home/glemaitre/miniconda3/envs/dev/lib/python3.10/site-packages/joblib/memory.py", line 349, in __call__
    return self.func(*args, **kwargs)
  File "/home/glemaitre/Documents/packages/scikit-learn/sklearn/pipeline.py", line 915, in _fit_transform_one
    res = transformer.fit_transform(X, y, **fit_params)
  File "/home/glemaitre/Documents/packages/scikit-learn/sklearn/utils/_set_output.py", line 142, in wrapped
    data_to_wrap = f(self, X, *args, **kwargs)
  File "/home/glemaitre/Documents/packages/scikit-learn/sklearn/utils/_set_output.py", line 142, in wrapped
    data_to_wrap = f(self, X, *args, **kwargs)
  File "/home/glemaitre/Documents/packages/scikit-learn/sklearn/base.py", line 862, in fit_transform
    return self.fit(X, y, **fit_params).transform(X)
  File "<ipython-input-5-d996c9237971>", line 22, in fit
AssertionError: Fit failed

  warnings.warn(some_fits_failed_message, FitFailedWarning)
/home/glemaitre/Documents/packages/scikit-learn/sklearn/model_selection/_search.py:953: UserWarning: One or more of the test scores are non-finite: [0.51  nan  nan]
  warnings.warn(
global config: pandas
Type of X: <class 'pandas.core.frame.DataFrame'>
Out[5]: 
GridSearchCV(cv=10,
             estimator=Pipeline(steps=[('columntransformer',
                                        ColumnTransformer(remainder='passthrough',
                                                          transformers=[('drop',
                                                                         'drop',
                                                                         [0])])),
                                       ('customtransformer',
                                        CustomTransformer()),
                                       ('logisticregression',
                                        LogisticRegression())]),
             n_jobs=-1,
             param_grid={'logisticregression__C': array([1.e+03, 1.e+00, 1.e-03])})

glemaitre · 2022-12-27T11:17:02Z

Taking the minimal example in #17634, we can reproduce the issue, so we need to have more tasks than n_jobs to trigger the issue which was not the case probably in the non-regression test.

In [1]: from joblib import Parallel
   ...: from sklearn.utils.fixes import delayed
   ...: from sklearn import get_config, config_context
   ...: 
   ...: def get_working_memory():
   ...:     return get_config()['working_memory']
   ...: 
   ...: with config_context(working_memory=123):
   ...:     results = Parallel(n_jobs=2)(
   ...:         delayed(get_working_memory)() for _ in range(10)
   ...:     )
   ...: 
   ...: results
Out[1]: [123, 123, 123, 123, 1024, 1024, 1024, 1024, 1024, 1024]

glemaitre · 2022-12-27T11:18:13Z

And the fun part is that things start to go sideways for the task n_jobs * 2 which is related to the pre_dispatch from Parallel. Changing pre_dispatch will impact the above behaviour.

Susensio · 2022-12-27T11:25:57Z

with pre_dispatch='all' it works correctly

glemaitre · 2022-12-27T11:26:36Z

Yep but you might blow up your memory :)

glemaitre · 2022-12-27T11:30:52Z

I assume that it could be linked to this comment:

https://github.com/joblib/joblib/blob/6836640abba55611d6e57f20338ea54b3e27f296/joblib/parallel.py#L1060-L1062

We did not associate the config with the jobs that are not consumed by the main thread.
So we should somehow make sure that the generator created by delayed and called in Parallel is done by the main thread that has the right config.

Susensio · 2022-12-27T11:35:05Z

So there is something weird. The setting got changed in the middle of the processing. I did try 3 Cs and cv=10 and printed the config state and the type of X in the transformer.

How did you print from within the transformer? I am having troubles debugging while multiprocessing.. I only get one print

glemaitre · 2022-12-27T11:39:02Z

How did you print from within the transformer? I am having troubles debugging while multiprocessing.. I only get one print

I assume that you get this issue: #22849
Notebook and verbosity do not work well for some reasons. Using ipython here is the best.

Susensio · 2022-12-27T11:43:05Z

How did you print from within the transformer? I am having troubles debugging while multiprocessing.. I only get one print

I assume that you get this issue: #22849 Notebook and verbosity do not work well for some reasons. Using ipython here is the best.

That was it! thanks

glemaitre · 2022-12-27T15:33:05Z

Digging with @fcharras, the issue is a side effect of the behaviour introduced in #18736.
When the local threads has the global config then we are fine, if it is another threads then we copy the default global config.

Susensio added Bug Needs Triage Issue requires triage labels Dec 27, 2022

glemaitre mentioned this issue Dec 27, 2022

FIX get config from dispatcher thread in delayed by default #25242

Closed

glemaitre removed the Needs Triage Issue requires triage label Dec 28, 2022

glemaitre mentioned this issue Jan 4, 2023

FIX pass explicit configuration to delayed #25290

Closed

glemaitre mentioned this issue Jan 11, 2023

FIX propagate configuration to workers in parallel #25363

Merged

cmarmo added the module:compose label Jan 17, 2023

thomasjpfan closed this as completed in #25363 Jan 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ColumnTransformers don't honor set_config(transform_output="pandas") when multiprocessing with n_jobs>1 #25239

ColumnTransformers don't honor set_config(transform_output="pandas") when multiprocessing with n_jobs>1 #25239

Susensio commented Dec 27, 2022 •

edited

glemaitre commented Dec 27, 2022 •

edited

glemaitre commented Dec 27, 2022

glemaitre commented Dec 27, 2022

glemaitre commented Dec 27, 2022 •

edited

Susensio commented Dec 27, 2022

glemaitre commented Dec 27, 2022

glemaitre commented Dec 27, 2022 •

edited

Susensio commented Dec 27, 2022

glemaitre commented Dec 27, 2022

Susensio commented Dec 27, 2022 •

edited

glemaitre commented Dec 27, 2022

ColumnTransformers don't honor set_config(transform_output="pandas") when multiprocessing with n_jobs>1 #25239

ColumnTransformers don't honor set_config(transform_output="pandas") when multiprocessing with n_jobs>1 #25239

Comments

Susensio commented Dec 27, 2022 • edited

Describe the bug

Steps/Code to Reproduce

Preparation

Actual Results

Versions

glemaitre commented Dec 27, 2022 • edited

glemaitre commented Dec 27, 2022

glemaitre commented Dec 27, 2022

glemaitre commented Dec 27, 2022 • edited

Susensio commented Dec 27, 2022

glemaitre commented Dec 27, 2022

glemaitre commented Dec 27, 2022 • edited

Susensio commented Dec 27, 2022

glemaitre commented Dec 27, 2022

Susensio commented Dec 27, 2022 • edited

glemaitre commented Dec 27, 2022

Susensio commented Dec 27, 2022 •

edited

glemaitre commented Dec 27, 2022 •

edited

glemaitre commented Dec 27, 2022 •

edited

glemaitre commented Dec 27, 2022 •

edited

Susensio commented Dec 27, 2022 •

edited