-
-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG+1] Much faster prediction with isotonic regression #6206
Conversation
keep_data = np.ones((fitted_len,), dtype=bool) | ||
# Aside from the 1st and last point, remove points whose y values | ||
# are equal to both the point before and the point after it. | ||
keep_data[1:fitted_len-1] = np.logical_or( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can probably use keep_data[1:-1]
here (assuming np arrays behave like Python lists with negative indexing?)
Also in the slices below.
I don't think making this change optional makes much sense. |
I agree, making this optional is weird. It looks like a great addition. New names might be |
Interesting idea re:fit_X and isotonic_y, but are we concerned at all about storage? The reason I made this change in the first place is that we were trying to calibrate with a large training set. |
To be quite frank, I can live with breaking user code which relies on internals. |
You will have to add a test to |
@sam-s it's not really internals. It's a documented attribute. Therefore it's part of the API. If it was a private attribute (starting with |
Made a bunch of changes, hopefully the automated checks won't throw any errors. When I created my test function I noticed that it was failing because ~20% of samples were different between the new and old methods - however, their differences were all on the order of 1e-16. After a bit of investigation I found out that this was because interpolation for prediction wasn't linear - it's a 1st-order spline function. I changed the interpolation to linear because: I left fast_predict as an option for now simply because it's necessary for the test script to prove that this change works. I'm happy to make it the default option, to add the training data as private self._x & self._y... or whatever you all want. Or you all could merge these changes and make the faster prediction fit into your grander scheme of sci-kit learn, I am after all new to contributing to this project. Thanks! |
Apparently sklearn only changed from linear to spline interpolation this month (a9ea55f). The pull request didn't say much about why the change was made. |
@@ -288,7 +288,7 @@ def _build_y(self, X, y, sample_weight): | |||
|
|||
return order_inv | |||
|
|||
def fit(self, X, y, sample_weight=None): | |||
def fit(self, X, y, sample_weight=None, fast_predict=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't make it a fit param but an init param.
@agramfort : I fixed the white-space issues you pointed out. Your point about making the parameter part of init instead of fit is valid, but from the above conversation with @amueller it sounds like it's not even going to be an option. Still waiting for @amueller to opine on how to handle this (should we call the new self.X_ and self.y_ self.fit_X_ and self.isotonic_y_ ?) |
I think adding new private variables and deprecated properties for backward compatibility is the way to go. |
Done. I'm calling them |
you should also raise deprecation warnings when trying to access X_ and y_ attributes. You could do it with properties |
def test_fast_predict(): | ||
# test that the faster prediction (https://github.com/scikit-learn/scikit-learn/pull/6206) | ||
# change doesn't affect out-of-sample predictions. | ||
np.random.seed(123) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please do not seed the global numpy singleton to avoid side effects. Instead do:
rng = np.random.RandomState(123)
Then replaces occurrences of np.random.random
with rng.random(N)
Also @jarfa if you are proficient with git, please squash the commits of this PR. If you don't know how to do that, don't worry, the person that will merge this PR will do it prior to merging. |
@ogrisel thanks, that's good feedback. I'll make sure the checks pass, then I'll get to squashing some commits. |
@@ -234,6 +234,32 @@ def __init__(self, y_min=None, y_max=None, increasing=True, | |||
self.increasing = increasing | |||
self.out_of_bounds = out_of_bounds | |||
|
|||
@property | |||
@deprecated("Attribute ``X_`` is deprecated.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please state the deprecation versions explicitly in the message:
"Attribute X_
is deprecated in version 0.18 and will be removed in version 0.20."
@ogrisel Integrated your changes and squashed them all into 1. |
can you update what's new? |
@agramfort, are you just asking for a run-down of the most recent changes?
|
sorry I meant the doc file whats_new.rst that documents the new features, bug fixes and API changes. |
@agramfort 4/4 tests were passing before the latest commit, and as you can see 1/1 have now failed. I only changed the whats_new.rst file: how exactly did I mess that up? |
your branch cannot be merged with master due to conflict in what's new file. You need to rebase and fix the conflict. |
@@ -252,7 +278,7 @@ def _build_f(self, X, y): | |||
# single y, constant prediction | |||
self.f_ = lambda x: y.repeat(x.shape) | |||
else: | |||
self.f_ = interpolate.interp1d(X, y, kind='slinear', | |||
self.f_ = interpolate.interp1d(X, y, kind='linear', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The change on interp1d from linear to slinear was intended to solve many issues with Isotonic Regression #2507. I use Isotonic Regression in production code and indeed as soon as I upgraded to scikit-learn == 0.17.0 I saw a massive decrease in performance, slinear scales really bad with size, it can easily get to 1000 times slower. I wonder if changing it back to linear won't bring those issues back.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so you're +1 on this change ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I just confirmed that indeed linear interpolation is much faster. But I'm actually encouraging further look into why interpolation method was changed in the first place because I'm pretty sure it was related to several isotonic regression bugs reported in version 0.15.0. Just as background, Isotonic Regression used to use linear interpolation up to 0.15.0, then it was changed to slinear in 0.16.0 and now it might be changed again back to linear by this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I changed it back to linear I wasn't able to find a good answer as to why it was changed to spline interpolation.
I changed it to linear because
- This change made predictions stay the same whether or not I removed the 'unnecessary' points.
- It makes more intuitive sense to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The change to slinear
was introduced in #4111 to deal with duplicate minimal values.
However we since had _make_unique
introduced by @amueller that might have been able to fix that.
I believe the non regression test for the issue fixed in #4111 is supposed to be test_isotonic_regression_ties_min
but reading the code it seems to me that
it is not testing what it's supposed to test. I suggest changing it to:
diff --git a/sklearn/tests/test_isotonic.py b/sklearn/tests/test_isotonic.py
index 5058ecd..c5c45ec 100644
--- a/sklearn/tests/test_isotonic.py
+++ b/sklearn/tests/test_isotonic.py
@@ -98,9 +98,9 @@ def test_isotonic_regression():
def test_isotonic_regression_ties_min():
# Setup examples with ties on minimum
- x = [0, 1, 1, 2, 3, 4, 5]
- y = [0, 1, 2, 3, 4, 5, 6]
- y_true = [0, 1.5, 1.5, 3, 4, 5, 6]
+ x = [1, 1, 2, 3, 4, 5]
+ y = [1, 2, 3, 4, 5, 6]
+ y_true = [1.5, 1.5, 3, 4, 5, 6]
# Check that we get identical results for fit/transform and fit_transform
ir = IsotonicRegression()
I checked on my box and this test still passes with the new code in this PR (with recent scipy), travis will run the tests with old scipy just in case.
Also the fact that other non-regression tests such as test_isotonic_regression_ties_secondary_
that compare with the results of R and that the calibration curve examples yield the same output makes me rather confident that this change will not reintroduce old bugs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense @ogrisel, I couldn't find exactly the same dataset I was having trouble with back in version 0.15. I used this branch's implementation in my newer datasets and it seems to work fine. Indeed the _make_unique function seems to have fixed the problems, I tried to remove it and I got the same errors I was getting before. 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for checking @tiagozortea, this is valuable feedback.
7eff50e
to
b2cc21a
Compare
I took care of the merge problems and the tests all passed. Aside from the ongoing question about spline vs. linear interpolation, what else needs to be done before merging into master? One issue - I'm leaving Friday (2/05) morning for a 2-week vacation without my laptop. I'll have email access but not be able to work on code - so after tomorrow, any additional changes will either need to be done by somebody else or will have to wait until Feb. 21st. |
# We're keeping self.X_ and self.y_ around for backwards compatibility, | ||
# but they should be considered deprecated. | ||
self._necessary_X_ = self._X_[keep_data] | ||
self._necessary_y_ = self._y_[keep_data] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this block of code should be moved into the _build_y
function and slightly refactored so that it does not rely upon the self._y_
attribute that will be removed once the deprecation period for the y_
attribute is over.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or alternatively you could change _build_y
to return the solution as a local variable in fit
so that we can use them in fit without accessing the deprecated attributes.
The deprecated attribute |
|
||
|
||
def test_fast_predict(): | ||
# test that the faster prediction (https://github.com/scikit-learn/scikit-learn/pull/6206) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please put the URL on its own comment line to avoid a very long line.
+1 for merge once my last batch of comments has been addressed including #6206 (comment). |
…polation to linear)
Thanks for your support @ogrisel, I made your suggested change to Again, after tonight (US Eastern time) I'll be too busy hiking around the Colombian coffee country to do more work on this for the next 2 weeks. Therefore if we need any other changes I'd suggest that you or somebody else clone this branch and do whatever else is necessary. It's been great making this small contribution to the project, I hope I'll be able to do more in the future. |
slow_model._build_y(training_X, training_Y, sample_weight=weights) | ||
slow_model.X_min_ = np.min(slow_model._X_) | ||
slow_model.X_max_ = np.max(slow_model._X_) | ||
slow_model._build_f(slow_model._X_, slow_model._y_) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually because of the new _build_y
this does no longer check the "slow approach". Maybe add an optional argument named trim_duplicates=True
to _build_y
to make it possible to test it here and add an inline comments in the body of _build_y
to explain that trim_duplicates
is only used to ease unit testing.
Thanks very much for your contribution @jarfa. I will address my last comment (#6206 (comment)) myself and as @tiagozortea made additional checks on his own environment and gave his +1 I will merge consecutively. |
I merged #6286, closing this. Thanks again for the fix @jarfa and for the tests @tiagozortea. |
Would like to point us back at this thread re: I am not a fan of fully deprecating access to my case A in comments above; it has many applications in the pre-processing and conditioning of data with known monotonic relationships prior, e.g., in credit scoring models. |
This change add an optional parameter to IsotonicRegression.fit(), fast_predict. Setting this to 'True' speeds up prediction by 3 orders of magnitude on my tests, doesn't have a meaningful effect on training time, and has no effect at all on the values that are predicted.
Unless a user cares about storing the fitted values of the training data, it's an unqualified improvement. However, in order to avoid breaking legacy code that depends values like self.X_ and self.y_, I left the default value of fast_predict to False (in other words, no speedup).
This is my first contribution to sklearn, so please let me know if I need anything else to get this merged into master. Sample output from the examples/fast_isotonic.py script (also in this pull request) is below:
Training sample size: 100000
Prediction sample size: 100000
Training the old model took 0.03661 seconds.
Training the new model took 0.03391 seconds.
Predicting with the old model took 3.60884 seconds.
Predicting with the new model took 0.00439 seconds.
Maximum absolute difference between new and old predictions: 0.000000