Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG+1] Much faster prediction with isotonic regression #6206

Closed
wants to merge 2 commits into from

Conversation

jarfa
Copy link
Contributor

@jarfa jarfa commented Jan 21, 2016

This change add an optional parameter to IsotonicRegression.fit(), fast_predict. Setting this to 'True' speeds up prediction by 3 orders of magnitude on my tests, doesn't have a meaningful effect on training time, and has no effect at all on the values that are predicted.

Unless a user cares about storing the fitted values of the training data, it's an unqualified improvement. However, in order to avoid breaking legacy code that depends values like self.X_ and self.y_, I left the default value of fast_predict to False (in other words, no speedup).

This is my first contribution to sklearn, so please let me know if I need anything else to get this merged into master. Sample output from the examples/fast_isotonic.py script (also in this pull request) is below:

Training sample size: 100000
Prediction sample size: 100000
Training the old model took 0.03661 seconds.
Training the new model took 0.03391 seconds.
Predicting with the old model took 3.60884 seconds.
Predicting with the new model took 0.00439 seconds.
Maximum absolute difference between new and old predictions: 0.000000

keep_data = np.ones((fitted_len,), dtype=bool)
# Aside from the 1st and last point, remove points whose y values
# are equal to both the point before and the point after it.
keep_data[1:fitted_len-1] = np.logical_or(
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can probably use keep_data[1:-1] here (assuming np arrays behave like Python lists with negative indexing?)

Also in the slices below.

@sam-s
Copy link
Contributor

sam-s commented Jan 21, 2016

I don't think making this change optional makes much sense.
Fast predict should always be on and if the user wants to preserve the training data, they should do it on their own.

@amueller
Copy link
Member

I agree, making this optional is weird. It looks like a great addition.
We could either change X_ directly, which might break people's code (bad) or try a work-around.
We could add store_X=None and have it be True by default but deprecate that and announce changing the default value?
But as it only touches the attributes, I think it might be better to add a deprecated property.
Maybe renaming X_ and y_ makes more sense, and having the old ones be deprecated copies?

New names might be fit_X_ and isotonic_y_? I don't know.
We would need to store fit_X_ for two versions for backward compatibility, but everyone would have fast predictions.

@jarfa
Copy link
Contributor Author

jarfa commented Jan 21, 2016

Interesting idea re:fit_X and isotonic_y, but are we concerned at all about storage? The reason I made this change in the first place is that we were trying to calibrate with a large training set.

@sam-s
Copy link
Contributor

sam-s commented Jan 21, 2016

To be quite frank, I can live with breaking user code which relies on internals.

@TomDLT
Copy link
Member

TomDLT commented Jan 22, 2016

You will have to add a test to sklearn.tests.test_isotonic.py to be sure it does not change the predictions.

@amueller
Copy link
Member

@sam-s it's not really internals. It's a documented attribute. Therefore it's part of the API. If it was a private attribute (starting with _) that would be different.
We could also just make the new ones private _X and _y.
I agree about storage, but I think we have to store it for two versions until we can remove it.
The only way not to store it is to add another parameter, which we have to remove after the storage option got removed. So that means this will create changes four versions from now.

@jarfa
Copy link
Contributor Author

jarfa commented Jan 25, 2016

Made a bunch of changes, hopefully the automated checks won't throw any errors. When I created my test function I noticed that it was failing because ~20% of samples were different between the new and old methods - however, their differences were all on the order of 1e-16. After a bit of investigation I found out that this was because interpolation for prediction wasn't linear - it's a 1st-order spline function.

I changed the interpolation to linear because:
a) this makes more sense if we're removing the points in the middle of a flat interval
b) the difference was negligible - again, it's on the order of 1e-16 for any point in my example

I left fast_predict as an option for now simply because it's necessary for the test script to prove that this change works. I'm happy to make it the default option, to add the training data as private self._x & self._y... or whatever you all want. Or you all could merge these changes and make the faster prediction fit into your grander scheme of sci-kit learn, I am after all new to contributing to this project.

Thanks!

@jarfa
Copy link
Contributor Author

jarfa commented Jan 25, 2016

Apparently sklearn only changed from linear to spline interpolation this month (a9ea55f). The pull request didn't say much about why the change was made.

@jarfa jarfa closed this Jan 25, 2016
@jarfa jarfa reopened this Jan 25, 2016
@@ -288,7 +288,7 @@ def _build_y(self, X, y, sample_weight):

return order_inv

def fit(self, X, y, sample_weight=None):
def fit(self, X, y, sample_weight=None, fast_predict=False):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't make it a fit param but an init param.

@jarfa
Copy link
Contributor Author

jarfa commented Jan 25, 2016

@agramfort : I fixed the white-space issues you pointed out. Your point about making the parameter part of init instead of fit is valid, but from the above conversation with @amueller it sounds like it's not even going to be an option.

Still waiting for @amueller to opine on how to handle this (should we call the new self.X_ and self.y_ self.fit_X_ and self.isotonic_y_ ?)

@amueller
Copy link
Member

I think adding new private variables and deprecated properties for backward compatibility is the way to go.

@jarfa
Copy link
Contributor Author

jarfa commented Jan 25, 2016

Done. I'm calling them self._necessary_X_ and self._necessary_y_.

@agramfort
Copy link
Member

you should also raise deprecation warnings when trying to access X_ and y_ attributes. You could do it with properties

@ogrisel
Copy link
Member

ogrisel commented Jan 28, 2016

def test_fast_predict():
# test that the faster prediction (https://github.com/scikit-learn/scikit-learn/pull/6206)
# change doesn't affect out-of-sample predictions.
np.random.seed(123)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please do not seed the global numpy singleton to avoid side effects. Instead do:

rng = np.random.RandomState(123)

Then replaces occurrences of np.random.random with rng.random(N)

@ogrisel
Copy link
Member

ogrisel commented Jan 28, 2016

Also @jarfa if you are proficient with git, please squash the commits of this PR. If you don't know how to do that, don't worry, the person that will merge this PR will do it prior to merging.

@jarfa
Copy link
Contributor Author

jarfa commented Jan 29, 2016

@ogrisel thanks, that's good feedback. I'll make sure the checks pass, then I'll get to squashing some commits.

@@ -234,6 +234,32 @@ def __init__(self, y_min=None, y_max=None, increasing=True,
self.increasing = increasing
self.out_of_bounds = out_of_bounds

@property
@deprecated("Attribute ``X_`` is deprecated.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please state the deprecation versions explicitly in the message:

"Attribute X_ is deprecated in version 0.18 and will be removed in version 0.20."

@jarfa
Copy link
Contributor Author

jarfa commented Jan 29, 2016

@ogrisel Integrated your changes and squashed them all into 1.

@agramfort
Copy link
Member

can you update what's new?

@jarfa
Copy link
Contributor Author

jarfa commented Jan 31, 2016

@agramfort, are you just asking for a run-down of the most recent changes?

  • replaced the attributes self.X_ and self.y_ with properties that throw deprecation warnings ("... is deprecated in version 0.18 and will be removed in version 0.20."). They refer to attributes self._X_ and self._y_
  • The attributes that hold only the points necessary for prediction are called self._necessary_X_ and self._necessary_y_
  • testing no longer has np.random.seed(...), I now create a random state object.

@agramfort
Copy link
Member

sorry I meant the doc file whats_new.rst that documents the new features, bug fixes and API changes.

@jarfa
Copy link
Contributor Author

jarfa commented Jan 31, 2016

@agramfort 4/4 tests were passing before the latest commit, and as you can see 1/1 have now failed.

I only changed the whats_new.rst file: how exactly did I mess that up?

@agramfort
Copy link
Member

your branch cannot be merged with master due to conflict in what's new file. You need to rebase and fix the conflict.

@@ -252,7 +278,7 @@ def _build_f(self, X, y):
# single y, constant prediction
self.f_ = lambda x: y.repeat(x.shape)
else:
self.f_ = interpolate.interp1d(X, y, kind='slinear',
self.f_ = interpolate.interp1d(X, y, kind='linear',

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change on interp1d from linear to slinear was intended to solve many issues with Isotonic Regression #2507. I use Isotonic Regression in production code and indeed as soon as I upgraded to scikit-learn == 0.17.0 I saw a massive decrease in performance, slinear scales really bad with size, it can easily get to 1000 times slower. I wonder if changing it back to linear won't bring those issues back.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so you're +1 on this change ?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I just confirmed that indeed linear interpolation is much faster. But I'm actually encouraging further look into why interpolation method was changed in the first place because I'm pretty sure it was related to several isotonic regression bugs reported in version 0.15.0. Just as background, Isotonic Regression used to use linear interpolation up to 0.15.0, then it was changed to slinear in 0.16.0 and now it might be changed again back to linear by this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I changed it back to linear I wasn't able to find a good answer as to why it was changed to spline interpolation.

I changed it to linear because

  1. This change made predictions stay the same whether or not I removed the 'unnecessary' points.
  2. It makes more intuitive sense to me.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change to slinear was introduced in #4111 to deal with duplicate minimal values.

However we since had _make_unique introduced by @amueller that might have been able to fix that.

I believe the non regression test for the issue fixed in #4111 is supposed to be test_isotonic_regression_ties_min but reading the code it seems to me that
it is not testing what it's supposed to test. I suggest changing it to:

diff --git a/sklearn/tests/test_isotonic.py b/sklearn/tests/test_isotonic.py
index 5058ecd..c5c45ec 100644
--- a/sklearn/tests/test_isotonic.py
+++ b/sklearn/tests/test_isotonic.py
@@ -98,9 +98,9 @@ def test_isotonic_regression():

 def test_isotonic_regression_ties_min():
     # Setup examples with ties on minimum
-    x = [0, 1, 1, 2, 3, 4, 5]
-    y = [0, 1, 2, 3, 4, 5, 6]
-    y_true = [0, 1.5, 1.5, 3, 4, 5, 6]
+    x = [1, 1, 2, 3, 4, 5]
+    y = [1, 2, 3, 4, 5, 6]
+    y_true = [1.5, 1.5, 3, 4, 5, 6]

     # Check that we get identical results for fit/transform and fit_transform
     ir = IsotonicRegression()

I checked on my box and this test still passes with the new code in this PR (with recent scipy), travis will run the tests with old scipy just in case.

Also the fact that other non-regression tests such as test_isotonic_regression_ties_secondary_ that compare with the results of R and that the calibration curve examples yield the same output makes me rather confident that this change will not reintroduce old bugs.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense @ogrisel, I couldn't find exactly the same dataset I was having trouble with back in version 0.15. I used this branch's implementation in my newer datasets and it seems to work fine. Indeed the _make_unique function seems to have fixed the problems, I tried to remove it and I got the same errors I was getting before. 👍

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for checking @tiagozortea, this is valuable feedback.

@jarfa jarfa force-pushed the fast_isotonic branch 3 times, most recently from 7eff50e to b2cc21a Compare February 3, 2016 22:44
@jarfa
Copy link
Contributor Author

jarfa commented Feb 3, 2016

I took care of the merge problems and the tests all passed. Aside from the ongoing question about spline vs. linear interpolation, what else needs to be done before merging into master?

One issue - I'm leaving Friday (2/05) morning for a 2-week vacation without my laptop. I'll have email access but not be able to work on code - so after tomorrow, any additional changes will either need to be done by somebody else or will have to wait until Feb. 21st.

# We're keeping self.X_ and self.y_ around for backwards compatibility,
# but they should be considered deprecated.
self._necessary_X_ = self._X_[keep_data]
self._necessary_y_ = self._y_[keep_data]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this block of code should be moved into the _build_y function and slightly refactored so that it does not rely upon the self._y_ attribute that will be removed once the deprecation period for the y_ attribute is over.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or alternatively you could change _build_y to return the solution as a local variable in fit so that we can use them in fit without accessing the deprecated attributes.

@ogrisel
Copy link
Member

ogrisel commented Feb 4, 2016

The deprecated attribute X_ and y_ should be removed from the list of documented attributes in the docstring and the new attributes should be documented instead.



def test_fast_predict():
# test that the faster prediction (https://github.com/scikit-learn/scikit-learn/pull/6206)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please put the URL on its own comment line to avoid a very long line.

@ogrisel
Copy link
Member

ogrisel commented Feb 4, 2016

+1 for merge once my last batch of comments has been addressed including #6206 (comment).

@ogrisel ogrisel changed the title Much faster prediction with isotonic regression [MRG+1] Much faster prediction with isotonic regression Feb 4, 2016
@jarfa
Copy link
Contributor Author

jarfa commented Feb 4, 2016

Thanks for your support @ogrisel, I made your suggested change to test_isotonic_regression_ties_min() and put it in a separate commit.

Again, after tonight (US Eastern time) I'll be too busy hiking around the Colombian coffee country to do more work on this for the next 2 weeks. Therefore if we need any other changes I'd suggest that you or somebody else clone this branch and do whatever else is necessary.

It's been great making this small contribution to the project, I hope I'll be able to do more in the future.

slow_model._build_y(training_X, training_Y, sample_weight=weights)
slow_model.X_min_ = np.min(slow_model._X_)
slow_model.X_max_ = np.max(slow_model._X_)
slow_model._build_f(slow_model._X_, slow_model._y_)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually because of the new _build_y this does no longer check the "slow approach". Maybe add an optional argument named trim_duplicates=True to _build_y to make it possible to test it here and add an inline comments in the body of _build_y to explain that trim_duplicates is only used to ease unit testing.

@ogrisel
Copy link
Member

ogrisel commented Feb 5, 2016

Thanks very much for your contribution @jarfa. I will address my last comment (#6206 (comment)) myself and as @tiagozortea made additional checks on his own environment and gave his +1 I will merge consecutively.

@ogrisel
Copy link
Member

ogrisel commented Feb 5, 2016

I merged #6286, closing this. Thanks again for the fix @jarfa and for the tests @tiagozortea.

@ogrisel ogrisel closed this Feb 5, 2016
@mjbommar
Copy link
Contributor

Would like to point us back at this thread re: fit/transform vs. fit_transform and why we should keep X and y: #4185 (comment)

I am not a fan of fully deprecating access to my case A in comments above; it has many applications in the pre-processing and conditioning of data with known monotonic relationships prior, e.g., in credit scoring models.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants