Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Library specific global options are not passed to jobs #1071

Open
thomasjpfan opened this issue Jun 18, 2020 · 15 comments
Open

Library specific global options are not passed to jobs #1071

thomasjpfan opened this issue Jun 18, 2020 · 15 comments

Comments

@thomasjpfan
Copy link
Contributor

thomasjpfan commented Jun 18, 2020

Library such as pandas and scikit-learn have global config options. These options are not passed to jobs spawned by joblib (with a multiprocess backend):

from joblib import Parallel, delayed
from sklearn import get_config, config_context

def get_working_memory():
    return get_config()['working_memory']

with config_context(working_memory=123):
    results = Parallel(n_jobs=2)(
        delayed(get_working_memory)() for _ in range(2)
    )

results
# [1024, 1024]

Related: #1064
Scikit-learn specific fix: scikit-learn/scikit-learn#17634

@lesteve
Copy link
Member

lesteve commented Jun 19, 2020

Interesting one .... I would (wild-)guess that this is related to changing a module variable in the main process which is not picked up in the subprocesses?

@lesteve
Copy link
Member

lesteve commented Jun 19, 2020

I would think the culprit is very likely cloupickle if I had to do an extra wild-guess (feeling a bit adventurous this morning 😉)

@thomasjpfan
Copy link
Contributor Author

I think it would not be possible for joblib to tell "where the configuration are stored and should be passed into the spawn processes" unless the library somehow registers the how to set and get the configuration from the library.

For scikit-learn it could be something like:

import joblib

joblib.register_global_config(getter=sklearn.get_config,
                              setter=sklearn.set_config)

@thomasjpfan
Copy link
Contributor Author

The less magically way to do it would be to require it to be passed to the Parallel object. Although, it would require the user to constantly pass the getter and setter, which is not user friendly.

@tomMoral
Copy link
Contributor

tomMoral commented Jun 23, 2020 via email

@tomMoral
Copy link
Contributor

tomMoral commented Jun 23, 2020 via email

@tomMoral
Copy link
Contributor

After discussing with @pierreglaser, we can see 2 solutions to pass global config options to joblib workers. Both need an explicit call to some registration function as described by @thomasjpfan . Then the question is how often these options are changed:

  • If the option can be changed often, we could pass the options to every function call and make sure they are set correctly. This is low maintenance burden (it can be bundled in the BatchedCalls object, as it is done for the parallel_backend options) but can have some impact on the performance as the setting are set for each call. This performance impact is reduced due to auto-batching.

  • If the option are mostly set once, we could pass them as initializer of the Pool/Executor workers. The workers would need to be re-spawned when the options change so this can have some impact if it is called often. But then it does not impact the rest of the computations anymore.

@thomasjpfan I would be interested to know your opinion on the different use case that are expected for the global options.

@ogrisel
Copy link
Contributor

ogrisel commented Jun 24, 2020

The respawn option would not work for dask, or at least would not be a "respawn" but rather a "reconfig" of all the workers.

The problem is that the dask workers can dynamically be resized without joblib being aware of it...

@tomMoral
Copy link
Contributor

Is there no notion of initializer for dask?

So this would mean that it is probably safer to use the first option.

@lesteve
Copy link
Member

lesteve commented Jun 24, 2020

There are preload scripts that could potentially help: https://docs.dask.org/en/latest/setup/custom-startup.html#preload-scripts.

Also Dask has some config mechanism (dask.config), I am wondering whether they have the same problem with modifying the config in the client, and not affecting the config in the worker, but maybe it is not that common in Dask.

Just a random thought: would there not be a chance that cloudpickle tackles module variables? I seem to remember there is some magic for tackling global variables in __main__ so maybe just maybe there is a small chance there.

@thomasjpfan
Copy link
Contributor Author

I would be interested to know your opinion on the different use case that are expected for the global options.

For the moment, the use case for scikit-learn is making sure each job is limited by the working memory configured by the global flag: working_memory. In the future, we may have an array_out global config flag to control the data structure outputted by transformers which needs to be the same for each spawn job (scikit-learn/scikit-learn#16772).

As for the options, I think I would prefer 2 if it can work with dask.

@amueller
Copy link
Contributor

amueller commented Aug 5, 2020

One issue that came up with the solution of register_global_config is that pandas has global configs but doesn't depend on joblib, so we can't really ask them to register their config.

@chrisconlan
Copy link

One issue that came up with the solution of register_global_config is that pandas has global configs but doesn't depend on joblib, so we can't really ask them to register their config.

Don't be so sure. Pandas is pretty friendly to popular dependent packages.

@ogrisel
Copy link
Contributor

ogrisel commented Jan 18, 2023

Note that the design of the scikit-learn specific workaround is evolving to correctly handle the propagation of thread-local configuration:

scikit-learn/scikit-learn#25363

@AKuederle
Copy link

AKuederle commented Mar 21, 2023

I ran into the same issue for a custom global config object in my library.

My workaround for now is similar to the fix in scikit-learn, but I am only using a modified delayed function.
Only downside of this is, is that I capture the config when delayed is called, and not when the worker is actually spawned. In my case, this doesn't matter however.

For reference:

import functools
from typing import Callable, List, Tuple, TypeVar

import joblib


T = TypeVar("T")
_parallel_getter_setter: List[Callable[[], Tuple[T, Callable[[T], None]]]] = []

def delayed(func):
    """Wrap a function to be used in a parallel context."""
    _parallel_setter = []
    for g in _parallel_getter_setter:
        _parallel_setter.append(g())

    @functools.wraps(func)
    def inner(*args, **kwargs):
        for value, setter in _parallel_setter:
            setter(value)
        return func(*args, **kwargs)

    return joblib.delayed(inner)

def register_global_parallel_callback(getter: Callable[[], Tuple[T, Callable[[T], None]]]):
    """Register a callback to be used in a parallel context."""
    _parallel_getter_setter.append(getter)

Still, it would be great, if an official solution would exist!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants