New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Library specific global options are not passed to jobs #1071
Comments
Interesting one .... I would (wild-)guess that this is related to changing a module variable in the main process which is not picked up in the subprocesses? |
I would think the culprit is very likely cloupickle if I had to do an extra wild-guess (feeling a bit adventurous this morning 😉) |
I think it would not be possible for joblib to tell "where the configuration are stored and should be passed into the spawn processes" unless the library somehow registers the how to set and get the configuration from the library. For scikit-learn it could be something like: import joblib
joblib.register_global_config(getter=sklearn.get_config,
setter=sklearn.set_config) |
The less magically way to do it would be to require it to be passed to the |
I think it would not be possible for joblib to tell "where the
configuration are stored and should be passed into the spawn processes"
unless the library somehow registers the how to set and get the
configuration from the library.
That is exactly the problem. `joblib` cannot introspect the state of a
module (is it's globals) to pass it to the newly spawned interpreter
without having indications on where to look.
For scikit-learn it could be something like:
import joblib
joblib.register_global_config(getter=sklearn.get_config,
setter=sklearn.set_config)
Yes we could have something that register a hook in `joblib` to force the
initialization of the module globals before running. The issue is: how to
make sure this hook is run on all workers with reusable executor and auto
respawn.
this is something we have discussed a bit before but as we have considered
mainly keeping the executor, this would require implementing some kind of
`run_on_all_workers` mechanism that is not trivial.
But for configuration like this, a solution would be to rely on
initializers and simply kill the existing executor of the
`register_global_config` is called (it should not be called often).
… |
It looks like I have misunderstood your proposition. What you would say is
to pass the arguments for each call and not register them when they are set?
This is probably easier to do something consistent as handeling global
state is a pain.
Maybe @pierreglaser has a better idea of what could be done to do this with
`pickle`.
I wonder if there is any mechanism in `pickle` to save states for `modules`.
Le mar. 23 juin 2020 à 07:47, Thomas Moreau <thomas.moreau.2010@gmail.com>
a écrit :
…
I think it would not be possible for joblib to tell "where the
> configuration are stored and should be passed into the spawn processes"
> unless the library somehow registers the how to set and get the
> configuration from the library.
>
That is exactly the problem. `joblib` cannot introspect the state of a
module (is it's globals) to pass it to the newly spawned interpreter
without having indications on where to look.
> For scikit-learn it could be something like:
>
> import joblib
> joblib.register_global_config(getter=sklearn.get_config,
> setter=sklearn.set_config)
>
> Yes we could have something that register a hook in `joblib` to force the
initialization of the module globals before running. The issue is: how to
make sure this hook is run on all workers with reusable executor and auto
respawn.
this is something we have discussed a bit before but as we have considered
mainly keeping the executor, this would require implementing some kind of
`run_on_all_workers` mechanism that is not trivial.
But for configuration like this, a solution would be to rely on
initializers and simply kill the existing executor of the
`register_global_config` is called (it should not be called often).
>
|
After discussing with @pierreglaser, we can see 2 solutions to pass global config options to joblib workers. Both need an explicit call to some registration function as described by @thomasjpfan . Then the question is how often these options are changed:
@thomasjpfan I would be interested to know your opinion on the different use case that are expected for the global options. |
The respawn option would not work for dask, or at least would not be a "respawn" but rather a "reconfig" of all the workers. The problem is that the dask workers can dynamically be resized without joblib being aware of it... |
Is there no notion of So this would mean that it is probably safer to use the first option. |
There are preload scripts that could potentially help: https://docs.dask.org/en/latest/setup/custom-startup.html#preload-scripts. Also Dask has some config mechanism ( Just a random thought: would there not be a chance that cloudpickle tackles module variables? I seem to remember there is some magic for tackling global variables in |
For the moment, the use case for scikit-learn is making sure each job is limited by the working memory configured by the global flag: As for the options, I think I would prefer 2 if it can work with dask. |
One issue that came up with the solution of |
Don't be so sure. Pandas is pretty friendly to popular dependent packages. |
Note that the design of the scikit-learn specific workaround is evolving to correctly handle the propagation of thread-local configuration: |
I ran into the same issue for a custom global config object in my library. My workaround for now is similar to the fix in scikit-learn, but I am only using a modified For reference: import functools
from typing import Callable, List, Tuple, TypeVar
import joblib
T = TypeVar("T")
_parallel_getter_setter: List[Callable[[], Tuple[T, Callable[[T], None]]]] = []
def delayed(func):
"""Wrap a function to be used in a parallel context."""
_parallel_setter = []
for g in _parallel_getter_setter:
_parallel_setter.append(g())
@functools.wraps(func)
def inner(*args, **kwargs):
for value, setter in _parallel_setter:
setter(value)
return func(*args, **kwargs)
return joblib.delayed(inner)
def register_global_parallel_callback(getter: Callable[[], Tuple[T, Callable[[T], None]]]):
"""Register a callback to be used in a parallel context."""
_parallel_getter_setter.append(getter) Still, it would be great, if an official solution would exist! |
Library such as pandas and scikit-learn have global config options. These options are not passed to jobs spawned by joblib (with a multiprocess backend):
Related: #1064
Scikit-learn specific fix: scikit-learn/scikit-learn#17634
The text was updated successfully, but these errors were encountered: