FIX validate input array-like in KDTree and BallTree #18691

cmarmo · 2020-10-27T11:11:03Z

Reference Issues/PRs

Closes #14650
Superseded and closes #17400

cmarmo · 2020-10-27T17:14:47Z

This is failing with a segfault that I can reproduce locally with python 3.8 (not yet tested with python 3.6).
The segfault went away if I downgrade pytest to version 5.2.1, with 5.3.1 the segfault is back again. I need to investigate more, @ogrisel, @rth, @lesteve ... if you want to have a look... :)

ogrisel · 2020-10-28T09:04:43Z

The segfault went away if I downgrade pytest to version 5.2.1, with 5.3.1 the segfault is back again.

This is really weird. Ideally it would be great to isolate a minimal reproducer that does not depend on pytest in a simple python script to understand better what's going on.

cmarmo · 2020-10-28T11:45:26Z

The code was asking for some invalid properties before the validation step, quite surprised that some checks were passing before... :(

glemaitre · 2020-10-28T12:49:14Z

Most probably because we were passing an array but not a list of list in the tests.

glemaitre · 2020-10-28T13:06:40Z

Then I think that we need an additional test where we can pass an array-like and check that it works while it would have failed before. The sequence given is expected to fail.

sklearn/neighbors/tests/test_ball_tree.py

sklearn/neighbors/_binary_tree.pxi

sklearn/neighbors/tests/test_ball_tree.py

glemaitre

LGTM

sklearn/neighbors/tests/test_ball_tree.py

ogrisel

I am worried about the performance overhead of doing a nested check_array when BallTree / KDTree are used inside a KNearestNeigborsClassifier or similar.

In particular the check_array does a linear sequential pass over the data to detect NaNs and (+/-)inf values which will be done twice in this case. I don't know whether or not this is negligible or not in particular on multicore machines with 16+ cores. For instance on KMeans this kinds of input validation checks starts to be visible now that multicore scalability has been optimized.

But this problem is probably impacting other estimators so we might want to defer its resolution for a later PR. Just something for @jeremiedbb to keep in mind when we start to investigate the perf overhead of those input checks.

glemaitre · 2020-11-02T15:30:44Z

@ogrisel temporarily, we could make the check_array within a context manager set_config(assume_finite=True).

ogrisel · 2020-11-02T15:35:40Z

@ogrisel temporarily, we could make the check_array within a context manager set_config(assume_finite=True).

Interesting idea, but this should be done in the scikit-learn code (e.g. KNearestNeigborsClassifier or similar) that calls into BallTree / KDTree because this code knows whether the checks have already been done or not.

cmarmo · 2020-11-03T08:02:50Z

Interesting idea, but this should be done in the scikit-learn code (e.g. KNearestNeigborsClassifier or similar) that calls into BallTree / KDTree because this code knows whether the checks have already been done or not.

@ogrisel, am I understanding correctly saying that the context manager should not be added here? It is not clear to me if I have to change anything in this PR.

ogrisel · 2020-11-03T09:10:27Z

@ogrisel, am I understanding correctly saying that the context manager should not be added here? It is not clear to me if I have to change anything in this PR.

The context manager should be used to wrap the calls to BallTree from other classes in scikit-learn that have already validated there input data to avoid duplicating this check. Since this PR adds a new check, it would be great to do it as part of this PR.

ogrisel · 2020-11-03T09:11:14Z

@thomasjpfan did something similar in #18727 and the benchmarks show that it works as expected.

ogrisel · 2020-11-03T09:32:43Z

Looking again at this PR, the new input check is in the init of the BallTree. This operation is very compute intensive, so duplicating the input data check is probably not a significant performance regression.

However we have a similar problem (already in master) when KNearestNeigbhorsClassifier.predict calls BallTree.query or query_radius. In my opinion, this case is much more problematic from a performance point of view. But fixing those will extend the scope of the current PR too much.

Let's do this as part of a dedicated PR with proper benchmarking and such.

+1 for merging the current PR as it is in the mean time.

ogrisel

Thank you very much for the fix @cmarmo !

cmarmo · 2020-11-03T09:35:35Z

@ogrisel, let me check if the failure is related to this PR, then I will open an issue about the performance problem. Thanks for your review!

ogrisel · 2020-11-03T09:35:48Z

The test_poisson_zero_nodes failure is unrelated and already tracked in #18737. Let's merge.

cmarmo added 2 commits October 27, 2020 12:01

Raise error, add a test.

b7df9fc

Temporary what's new.

5117316

github-actions bot added the module:neighbors label Oct 27, 2020

Fix lint error. Final what's new.

cec423f

cmarmo added this to the 0.24 milestone Oct 27, 2020

Merge branch 'master' into tree-segfault

b2a0b3b

cmarmo changed the title ~~Kd tree, ball tree segfault fix~~ [WIP] Kd tree, ball tree segfault fix Oct 27, 2020

Add dtype in tests, following NEP 34.

8fbc490

cmarmo added 2 commits October 28, 2020 10:57

Merge branch 'master' into tree-segfault

9d772cd

Move validation before shape call.

bce45b5

cmarmo changed the title ~~[WIP] Kd tree, ball tree segfault fix~~ [MRG] Kd tree, ball tree segfault fix Oct 28, 2020