Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FIX validate input array-like in KDTree and BallTree #18691

Merged
merged 17 commits into from
Nov 3, 2020

Conversation

cmarmo
Copy link
Member

@cmarmo cmarmo commented Oct 27, 2020

Reference Issues/PRs

Closes #14650
Superseded and closes #17400

@cmarmo cmarmo added this to the 0.24 milestone Oct 27, 2020
@cmarmo cmarmo changed the title Kd tree, ball tree segfault fix [WIP] Kd tree, ball tree segfault fix Oct 27, 2020
@cmarmo
Copy link
Member Author

cmarmo commented Oct 27, 2020

This is failing with a segfault that I can reproduce locally with python 3.8 (not yet tested with python 3.6).
The segfault went away if I downgrade pytest to version 5.2.1, with 5.3.1 the segfault is back again. I need to investigate more, @ogrisel, @rth, @lesteve ... if you want to have a look... :)

@ogrisel
Copy link
Member

ogrisel commented Oct 28, 2020

The segfault went away if I downgrade pytest to version 5.2.1, with 5.3.1 the segfault is back again.

This is really weird. Ideally it would be great to isolate a minimal reproducer that does not depend on pytest in a simple python script to understand better what's going on.

@cmarmo cmarmo changed the title [WIP] Kd tree, ball tree segfault fix [MRG] Kd tree, ball tree segfault fix Oct 28, 2020
@cmarmo
Copy link
Member Author

cmarmo commented Oct 28, 2020

The code was asking for some invalid properties before the validation step, quite surprised that some checks were passing before... :(

@glemaitre
Copy link
Member

Most probably because we were passing an array but not a list of list in the tests.

@glemaitre
Copy link
Member

Then I think that we need an additional test where we can pass an array-like and check that it works while it would have failed before. The sequence given is expected to fail.

sklearn/neighbors/tests/test_ball_tree.py Outdated Show resolved Hide resolved
sklearn/neighbors/tests/test_ball_tree.py Show resolved Hide resolved
sklearn/neighbors/tests/test_ball_tree.py Show resolved Hide resolved
sklearn/neighbors/_binary_tree.pxi Outdated Show resolved Hide resolved
@glemaitre glemaitre changed the title [MRG] Kd tree, ball tree segfault fix FIX validate input array-like in KDTree and BallTree Oct 28, 2020
Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

sklearn/neighbors/tests/test_ball_tree.py Outdated Show resolved Hide resolved
sklearn/neighbors/tests/test_ball_tree.py Outdated Show resolved Hide resolved
Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am worried about the performance overhead of doing a nested check_array when BallTree / KDTree are used inside a KNearestNeigborsClassifier or similar.

In particular the check_array does a linear sequential pass over the data to detect NaNs and (+/-)inf values which will be done twice in this case. I don't know whether or not this is negligible or not in particular on multicore machines with 16+ cores. For instance on KMeans this kinds of input validation checks starts to be visible now that multicore scalability has been optimized.

But this problem is probably impacting other estimators so we might want to defer its resolution for a later PR. Just something for @jeremiedbb to keep in mind when we start to investigate the perf overhead of those input checks.

@glemaitre
Copy link
Member

@ogrisel temporarily, we could make the check_array within a context manager set_config(assume_finite=True).

@ogrisel
Copy link
Member

ogrisel commented Nov 2, 2020

@ogrisel temporarily, we could make the check_array within a context manager set_config(assume_finite=True).

Interesting idea, but this should be done in the scikit-learn code (e.g. KNearestNeigborsClassifier or similar) that calls into BallTree / KDTree because this code knows whether the checks have already been done or not.

@cmarmo
Copy link
Member Author

cmarmo commented Nov 3, 2020

Interesting idea, but this should be done in the scikit-learn code (e.g. KNearestNeigborsClassifier or similar) that calls into BallTree / KDTree because this code knows whether the checks have already been done or not.

@ogrisel, am I understanding correctly saying that the context manager should not be added here? It is not clear to me if I have to change anything in this PR.

@ogrisel
Copy link
Member

ogrisel commented Nov 3, 2020

@ogrisel, am I understanding correctly saying that the context manager should not be added here? It is not clear to me if I have to change anything in this PR.

The context manager should be used to wrap the calls to BallTree from other classes in scikit-learn that have already validated there input data to avoid duplicating this check. Since this PR adds a new check, it would be great to do it as part of this PR.

@ogrisel
Copy link
Member

ogrisel commented Nov 3, 2020

@thomasjpfan did something similar in #18727 and the benchmarks show that it works as expected.

@ogrisel
Copy link
Member

ogrisel commented Nov 3, 2020

Looking again at this PR, the new input check is in the init of the BallTree. This operation is very compute intensive, so duplicating the input data check is probably not a significant performance regression.

However we have a similar problem (already in master) when KNearestNeigbhorsClassifier.predict calls BallTree.query or query_radius. In my opinion, this case is much more problematic from a performance point of view. But fixing those will extend the scope of the current PR too much.

Let's do this as part of a dedicated PR with proper benchmarking and such.

+1 for merging the current PR as it is in the mean time.

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much for the fix @cmarmo !

@cmarmo
Copy link
Member Author

cmarmo commented Nov 3, 2020

@ogrisel, let me check if the failure is related to this PR, then I will open an issue about the performance problem. Thanks for your review!

@ogrisel
Copy link
Member

ogrisel commented Nov 3, 2020

The test_poisson_zero_nodes failure is unrelated and already tracked in #18737. Let's merge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

KDTree / BallTree seg fault
4 participants