Library specific global options are not passed to jobs #1071

thomasjpfan · 2020-06-18T18:11:53Z

Library such as pandas and scikit-learn have global config options. These options are not passed to jobs spawned by joblib (with a multiprocess backend):

from joblib import Parallel, delayed
from sklearn import get_config, config_context

def get_working_memory():
    return get_config()['working_memory']

with config_context(working_memory=123):
    results = Parallel(n_jobs=2)(
        delayed(get_working_memory)() for _ in range(2)
    )

results
# [1024, 1024]

Related: #1064
Scikit-learn specific fix: scikit-learn/scikit-learn#17634

lesteve · 2020-06-19T04:32:57Z

Interesting one .... I would (wild-)guess that this is related to changing a module variable in the main process which is not picked up in the subprocesses?

lesteve · 2020-06-19T06:58:21Z

I would think the culprit is very likely cloupickle if I had to do an extra wild-guess (feeling a bit adventurous this morning 😉)

thomasjpfan · 2020-06-22T18:21:00Z

I think it would not be possible for joblib to tell "where the configuration are stored and should be passed into the spawn processes" unless the library somehow registers the how to set and get the configuration from the library.

For scikit-learn it could be something like:

import joblib

joblib.register_global_config(getter=sklearn.get_config,
                              setter=sklearn.set_config)

thomasjpfan · 2020-06-22T18:25:27Z

The less magically way to do it would be to require it to be passed to the Parallel object. Although, it would require the user to constantly pass the getter and setter, which is not user friendly.

tomMoral · 2020-06-23T05:47:17Z

I think it would not be possible for joblib to tell "where the configuration are stored and should be passed into the spawn processes" unless the library somehow registers the how to set and get the configuration from the library.

That is exactly the problem. `joblib` cannot introspect the state of a module (is it's globals) to pass it to the newly spawned interpreter without having indications on where to look.

For scikit-learn it could be something like: import joblib joblib.register_global_config(getter=sklearn.get_config, setter=sklearn.set_config) Yes we could have something that register a hook in `joblib` to force the

initialization of the module globals before running. The issue is: how to make sure this hook is run on all workers with reusable executor and auto respawn. this is something we have discussed a bit before but as we have considered mainly keeping the executor, this would require implementing some kind of `run_on_all_workers` mechanism that is not trivial. But for configuration like this, a solution would be to rely on initializers and simply kill the existing executor of the `register_global_config` is called (it should not be called often).

…

tomMoral · 2020-06-23T07:15:14Z

It looks like I have misunderstood your proposition. What you would say is to pass the arguments for each call and not register them when they are set? This is probably easier to do something consistent as handeling global state is a pain. Maybe @pierreglaser has a better idea of what could be done to do this with `pickle`. I wonder if there is any mechanism in `pickle` to save states for `modules`. Le mar. 23 juin 2020 à 07:47, Thomas Moreau <thomas.moreau.2010@gmail.com> a écrit :

…

I think it would not be possible for joblib to tell "where the > configuration are stored and should be passed into the spawn processes" > unless the library somehow registers the how to set and get the > configuration from the library. > That is exactly the problem. `joblib` cannot introspect the state of a module (is it's globals) to pass it to the newly spawned interpreter without having indications on where to look. > For scikit-learn it could be something like: > > import joblib > joblib.register_global_config(getter=sklearn.get_config, > setter=sklearn.set_config) > > Yes we could have something that register a hook in `joblib` to force the initialization of the module globals before running. The issue is: how to make sure this hook is run on all workers with reusable executor and auto respawn. this is something we have discussed a bit before but as we have considered mainly keeping the executor, this would require implementing some kind of `run_on_all_workers` mechanism that is not trivial. But for configuration like this, a solution would be to rely on initializers and simply kill the existing executor of the `register_global_config` is called (it should not be called often). >

tomMoral · 2020-06-23T09:54:50Z

After discussing with @pierreglaser, we can see 2 solutions to pass global config options to joblib workers. Both need an explicit call to some registration function as described by @thomasjpfan . Then the question is how often these options are changed:

If the option can be changed often, we could pass the options to every function call and make sure they are set correctly. This is low maintenance burden (it can be bundled in the BatchedCalls object, as it is done for the parallel_backend options) but can have some impact on the performance as the setting are set for each call. This performance impact is reduced due to auto-batching.
If the option are mostly set once, we could pass them as initializer of the Pool/Executor workers. The workers would need to be re-spawned when the options change so this can have some impact if it is called often. But then it does not impact the rest of the computations anymore.

@thomasjpfan I would be interested to know your opinion on the different use case that are expected for the global options.

ogrisel · 2020-06-24T17:17:59Z

The respawn option would not work for dask, or at least would not be a "respawn" but rather a "reconfig" of all the workers.

The problem is that the dask workers can dynamically be resized without joblib being aware of it...

tomMoral · 2020-06-24T18:11:06Z

Is there no notion of initializer for dask?

So this would mean that it is probably safer to use the first option.

lesteve · 2020-06-24T18:59:52Z

There are preload scripts that could potentially help: https://docs.dask.org/en/latest/setup/custom-startup.html#preload-scripts.

Also Dask has some config mechanism (dask.config), I am wondering whether they have the same problem with modifying the config in the client, and not affecting the config in the worker, but maybe it is not that common in Dask.

Just a random thought: would there not be a chance that cloudpickle tackles module variables? I seem to remember there is some magic for tackling global variables in __main__ so maybe just maybe there is a small chance there.

thomasjpfan · 2020-06-27T00:30:36Z

I would be interested to know your opinion on the different use case that are expected for the global options.

For the moment, the use case for scikit-learn is making sure each job is limited by the working memory configured by the global flag: working_memory. In the future, we may have an array_out global config flag to control the data structure outputted by transformers which needs to be the same for each spawn job (scikit-learn/scikit-learn#16772).

As for the options, I think I would prefer 2 if it can work with dask.

amueller · 2020-08-05T16:35:05Z

One issue that came up with the solution of register_global_config is that pandas has global configs but doesn't depend on joblib, so we can't really ask them to register their config.

chrisconlan · 2020-12-15T15:57:23Z

One issue that came up with the solution of register_global_config is that pandas has global configs but doesn't depend on joblib, so we can't really ask them to register their config.

Don't be so sure. Pandas is pretty friendly to popular dependent packages.

ogrisel · 2023-01-18T08:56:23Z

Note that the design of the scikit-learn specific workaround is evolving to correctly handle the propagation of thread-local configuration:

scikit-learn/scikit-learn#25363

AKuederle · 2023-03-21T18:17:46Z

I ran into the same issue for a custom global config object in my library.

My workaround for now is similar to the fix in scikit-learn, but I am only using a modified delayed function.
Only downside of this is, is that I capture the config when delayed is called, and not when the worker is actually spawned. In my case, this doesn't matter however.

For reference:

import functools
from typing import Callable, List, Tuple, TypeVar

import joblib


T = TypeVar("T")
_parallel_getter_setter: List[Callable[[], Tuple[T, Callable[[T], None]]]] = []

def delayed(func):
    """Wrap a function to be used in a parallel context."""
    _parallel_setter = []
    for g in _parallel_getter_setter:
        _parallel_setter.append(g())

    @functools.wraps(func)
    def inner(*args, **kwargs):
        for value, setter in _parallel_setter:
            setter(value)
        return func(*args, **kwargs)

    return joblib.delayed(inner)

def register_global_parallel_callback(getter: Callable[[], Tuple[T, Callable[[T], None]]]):
    """Register a callback to be used in a parallel context."""
    _parallel_getter_setter.append(getter)

Still, it would be great, if an official solution would exist!

thomasjpfan mentioned this issue Jun 19, 2020

BUG Passes global configuration when spawning joblib jobs scikit-learn/scikit-learn#17634

Merged

tomMoral mentioned this issue Oct 19, 2020

Standard logging using loky backend #1017

Open

ogrisel mentioned this issue Dec 15, 2020

Interoperability with Django ORM #1137

Open

cbalona mentioned this issue Feb 2, 2021

Add VotingChainladder (resolves #115) casact/chainladder-python#116

Merged

glemaitre mentioned this issue Dec 27, 2022

ColumnTransformers don't honor set_config(transform_output="pandas") when multiprocessing with n_jobs>1 scikit-learn/scikit-learn#25239

Closed

AKuederle mentioned this issue Mar 21, 2023

Potential workaround for global config in joblib Parallel mad-lab-fau/tpcp#65

Merged

AKuederle mentioned this issue Sep 26, 2023

Parallel doesn't update a shared variable on some iteration. #1347

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Library specific global options are not passed to jobs #1071

Library specific global options are not passed to jobs #1071

thomasjpfan commented Jun 18, 2020 •

edited

lesteve commented Jun 19, 2020

lesteve commented Jun 19, 2020

thomasjpfan commented Jun 22, 2020

thomasjpfan commented Jun 22, 2020

tomMoral commented Jun 23, 2020 via email

tomMoral commented Jun 23, 2020 via email

tomMoral commented Jun 23, 2020

ogrisel commented Jun 24, 2020

tomMoral commented Jun 24, 2020

lesteve commented Jun 24, 2020 •

edited

thomasjpfan commented Jun 27, 2020

amueller commented Aug 5, 2020

chrisconlan commented Dec 15, 2020

ogrisel commented Jan 18, 2023

AKuederle commented Mar 21, 2023 •

edited

Library specific global options are not passed to jobs #1071

Library specific global options are not passed to jobs #1071

Comments

thomasjpfan commented Jun 18, 2020 • edited

lesteve commented Jun 19, 2020

lesteve commented Jun 19, 2020

thomasjpfan commented Jun 22, 2020

thomasjpfan commented Jun 22, 2020

tomMoral commented Jun 23, 2020 via email

tomMoral commented Jun 23, 2020 via email

tomMoral commented Jun 23, 2020

ogrisel commented Jun 24, 2020

tomMoral commented Jun 24, 2020

lesteve commented Jun 24, 2020 • edited

thomasjpfan commented Jun 27, 2020

amueller commented Aug 5, 2020

chrisconlan commented Dec 15, 2020

ogrisel commented Jan 18, 2023

AKuederle commented Mar 21, 2023 • edited

thomasjpfan commented Jun 18, 2020 •

edited

lesteve commented Jun 24, 2020 •

edited

AKuederle commented Mar 21, 2023 •

edited