-
-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FIX validate input array-like in KDTree and BallTree #18691
Conversation
This is failing with a segfault that I can reproduce locally with python 3.8 (not yet tested with python 3.6). |
This is really weird. Ideally it would be great to isolate a minimal reproducer that does not depend on pytest in a simple python script to understand better what's going on. |
The code was asking for some invalid properties before the validation step, quite surprised that some checks were passing before... :( |
Most probably because we were passing an array but not a list of list in the tests. |
Then I think that we need an additional test where we can pass an array-like and check that it works while it would have failed before. The sequence given is expected to fail. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am worried about the performance overhead of doing a nested check_array
when BallTree / KDTree are used inside a KNearestNeigborsClassifier
or similar.
In particular the check_array
does a linear sequential pass over the data to detect NaNs and (+/-)inf values which will be done twice in this case. I don't know whether or not this is negligible or not in particular on multicore machines with 16+ cores. For instance on KMeans this kinds of input validation checks starts to be visible now that multicore scalability has been optimized.
But this problem is probably impacting other estimators so we might want to defer its resolution for a later PR. Just something for @jeremiedbb to keep in mind when we start to investigate the perf overhead of those input checks.
@ogrisel temporarily, we could make the |
Interesting idea, but this should be done in the scikit-learn code (e.g. |
@ogrisel, am I understanding correctly saying that the context manager should not be added here? It is not clear to me if I have to change anything in this PR. |
The context manager should be used to wrap the calls to BallTree from other classes in scikit-learn that have already validated there input data to avoid duplicating this check. Since this PR adds a new check, it would be great to do it as part of this PR. |
@thomasjpfan did something similar in #18727 and the benchmarks show that it works as expected. |
Looking again at this PR, the new input check is in the init of the However we have a similar problem (already in master) when Let's do this as part of a dedicated PR with proper benchmarking and such. +1 for merging the current PR as it is in the mean time. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much for the fix @cmarmo !
@ogrisel, let me check if the failure is related to this PR, then I will open an issue about the performance problem. Thanks for your review! |
The |
Reference Issues/PRs
Closes #14650
Superseded and closes #17400