FIX get config from dispatcher thread in delayed by default #25242

glemaitre · 2022-12-27T16:49:28Z

delayed function wrap the function and fetch the global configuration. However, this "global" dictionary is local to the specific thread that dispatches the job. Therefore, we can end-up with a default "global" config instead of the configuration defined in the main thread.

The solution here is to add a parameter to delayed, defaulting to the main thread, to retrieve the configuration associated with the main thread.

glemaitre · 2022-12-28T10:24:41Z

sklearn/tests/test_config.py

+    # check that we have 2 threads registered in the thread config dictionary
+    from sklearn._config import _thread_config
+
+    assert len(_thread_config) == 2
+
+    # delete the thread and check that the dictionary does keep a reference to it
+    # since we use a weakref dictionary
+    del thread
+    assert len(_thread_config) == 1


The failure with ARM shows that we register more threads than expected because we run the test in parallel. It is probably best to rely upon that the weakref dictionary does the job when joblib kill the thread.

thomasjpfan

Thank you for the fix!

I'm trying to avoid adding more API so we can introduce this as a bug fix into 1.2.1

sklearn/utils/fixes.py

sklearn/_config.py

sklearn/utils/fixes.py

thomasjpfan

Overall, I think this solution works and do not foresee any issues with it.

doc/whats_new/v1.2.rst

sklearn/_config.py

sklearn/tests/test_config.py

thomasjpfan · 2023-01-03T22:16:26Z

sklearn/utils/tests/test_fixes.py

+            _delayed(get_working_memory, thread=local_thread)() for _ in range(n_iter)
+        )
+
+    assert results == [get_working_memory()] * n_iter


If the default working_memory on the main thread were to change this test would fail:

sklearn.set_config(working_memory=140) # the following fails assert results == [get_working_memory()] * n_iter

The less fragile assertion would be check that the local_thread uses the default global config:

from sklearn._config import _global_config_default assert results == [_global_config_default["working_memory"]] * n_iter

Reviewing this test, it seems that I have found a bug in the way the default value of the thread argument of delayed is defined. I am working on a PR against this PR.

Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>

doc/whats_new/v1.2.rst

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

ogrisel · 2023-01-04T15:47:13Z

sklearn/utils/fixes.py


-    return delayed_function
+def _delayed(func, thread=threading.current_thread()):


Capturing the current thread here is problematic because it makes the behavior of scikit-learn dependent on which thread first imported scikit-learn and scikit-learn's behavior is no longer thread-symmetric.

I tried changing this to:

def _delayed(func, thread=None): if thread is None: thread = threading.current_thread()

but this does not work either. Thread inspection does not work as intended (too late) when calling Parallel on a generator expression which is the canonical way to use joblib. Instead we should capture the state of the config of the thread that calls Parallel just before the call happens and ship it to all the dispatched tasks.

We just had a live pair-debugging / programming session with @glemaitre on discord and I think we came up with a better solution that is a bit more verbose but also much more explicit (and correct ;). He will open a PR soon and we will be able to have a more informed technical discussion there.

For the longer term we could expose a hook in joblib to better handle this kind of configuration propagation but having a stopgap fix in scikit-learn makes it possible to decouple scikit-learn from the version of joblib.

Refer to #25290 for the better solution

👍 I was afraid it would come to this kind of more verbose solution. Maybe at the same time this is merge to enable the fix, a separate issue could be opened to discuss the in and outs of the per-thread config ? unless the behavior that is enforced and supported is clear already but that didn't seem to be (the PR where the behavior was enabled does not discuss much)

This is true that we did not discuss this here but we did during the pair programming session.

@ogrisel agreed that we should keep the current behavior where you don't want a thread modifying the config during that other threads may use it. It is a bit counter-intuitive if we rely on the fact that threads should share memory but the side-effect within scikit-learn would be potentially bad. For instance, you can potentially get different random errors that is not reproducible because it would depend on the config state at a particular moment.

I think we should discuss the two options at the next dev meeting:

option 1: revert sklearn.get/set_config to always use a single shared config without per-thread isolation as was the case prior in 1.0.

option 2: make thread isolation work correctly at the cost of a bit of verbosity as implemented in FIX pass explicit configuration to delayed #25290

I think I am in favor of option 2 but I think it worth discussing this with others.

~~Option 1 would resolve the issue for multi-threading, but I think the issue will remain for multiprocessing or loky.~~

I am okay with Option 2. Most of my concern is how third party developers using joblib need to update their code to use utils.fixes.delayed to work with scikit-learn's config.

From the developer call, joblib uses another thread to get jobs from a generator, which means Option 1 with a thread local configuration would resolve the current issue.

ogrisel · 2023-01-18T16:51:55Z

Closing in favor of #25363.

See PR scikit-learn#25242 for details

FIX get config from dispatcher thread in delayed by default

2340329

github-actions bot added the module:utils label Dec 27, 2022

glemaitre added 7 commits December 27, 2022 17:57

use weakref

c7efdda

doc stuff

2d4e553

TST add test for delayed

3c03261

TST add more test in config

cd38957

DOC add a changelog entry

37ce3ef

be nice with mypy

011a7ec

avoid flake8 complain

67a9f18

glemaitre commented Dec 28, 2022

View reviewed changes

remove test that can fail with parallel execution

e139fa7

jnothman approved these changes Dec 28, 2022

View reviewed changes

thomasjpfan reviewed Jan 2, 2023

View reviewed changes

sklearn/utils/fixes.py Outdated Show resolved Hide resolved

sklearn/_config.py Outdated Show resolved Hide resolved

sklearn/utils/fixes.py Outdated Show resolved Hide resolved

glemaitre added 4 commits January 3, 2023 09:47

Merge remote-tracking branch 'origin/main' into is/25239

3479ca9

add thread argument to private function

cf16195

MAINT remove threadlocal

52bf297

do not modify the API of get_config

5665603

thomasjpfan reviewed Jan 3, 2023

View reviewed changes

thomasjpfan added this to the 1.2.1 milestone Jan 3, 2023

glemaitre and others added 2 commits January 4, 2023 11:52

Apply suggestions from code review

e42793f

Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>

DOC update changelog

bb2b40d

ogrisel reviewed Jan 4, 2023

View reviewed changes

doc/whats_new/v1.2.rst Outdated Show resolved Hide resolved

Update doc/whats_new/v1.2.rst

379249e

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

ogrisel requested changes Jan 4, 2023

View reviewed changes

glemaitre mentioned this pull request Jan 4, 2023

FIX pass explicit configuration to delayed #25290

Closed

glemaitre mentioned this pull request Jan 11, 2023

FIX propagate configuration to workers in parallel #25363

Merged

ogrisel closed this Jan 18, 2023

avm19 pushed a commit to avm19/scikit-learn that referenced this pull request Jan 27, 2023

Use utils.parallel.delayed not utils.fixes.delayed

6c5c7d8

See PR scikit-learn#25242 for details

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FIX get config from dispatcher thread in delayed by default #25242

FIX get config from dispatcher thread in delayed by default #25242

glemaitre commented Dec 27, 2022

glemaitre Dec 28, 2022

thomasjpfan left a comment •

edited

thomasjpfan left a comment

thomasjpfan Jan 3, 2023

ogrisel Jan 4, 2023

ogrisel Jan 4, 2023 •

edited

glemaitre Jan 4, 2023

fcharras Jan 5, 2023

glemaitre Jan 5, 2023

ogrisel Jan 5, 2023

thomasjpfan Jan 7, 2023 •

edited

thomasjpfan Jan 9, 2023

ogrisel commented Jan 18, 2023


		return delayed_function
		def _delayed(func, thread=threading.current_thread()):

FIX get config from dispatcher thread in delayed by default #25242

FIX get config from dispatcher thread in delayed by default #25242

Conversation

glemaitre commented Dec 27, 2022

Choose a reason for hiding this comment

thomasjpfan left a comment • edited

Choose a reason for hiding this comment

thomasjpfan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ogrisel Jan 4, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

thomasjpfan Jan 7, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ogrisel commented Jan 18, 2023

thomasjpfan left a comment •

edited

ogrisel Jan 4, 2023 •

edited

thomasjpfan Jan 7, 2023 •

edited