[MRG+1] Much faster prediction with isotonic regression #6206

jarfa · 2016-01-21T20:18:51Z

This change add an optional parameter to IsotonicRegression.fit(), fast_predict. Setting this to 'True' speeds up prediction by 3 orders of magnitude on my tests, doesn't have a meaningful effect on training time, and has no effect at all on the values that are predicted.

Unless a user cares about storing the fitted values of the training data, it's an unqualified improvement. However, in order to avoid breaking legacy code that depends values like self.X_ and self.y_, I left the default value of fast_predict to False (in other words, no speedup).

This is my first contribution to sklearn, so please let me know if I need anything else to get this merged into master. Sample output from the examples/fast_isotonic.py script (also in this pull request) is below:

Training sample size: 100000
Prediction sample size: 100000
Training the old model took 0.03661 seconds.
Training the new model took 0.03391 seconds.
Predicting with the old model took 3.60884 seconds.
Predicting with the new model took 0.00439 seconds.
Maximum absolute difference between new and old predictions: 0.000000

dcrosta · 2016-01-21T20:29:11Z

sklearn/isotonic.py

+            keep_data = np.ones((fitted_len,), dtype=bool)
+            # Aside from the 1st and last point, remove points whose y values
+            # are equal to both the point before and the point after it.
+            keep_data[1:fitted_len-1] = np.logical_or(


You can probably use keep_data[1:-1] here (assuming np arrays behave like Python lists with negative indexing?)

Also in the slices below.

sam-s · 2016-01-21T21:53:55Z

I don't think making this change optional makes much sense.
Fast predict should always be on and if the user wants to preserve the training data, they should do it on their own.

amueller · 2016-01-21T22:44:29Z

I agree, making this optional is weird. It looks like a great addition.
We could either change X_ directly, which might break people's code (bad) or try a work-around.
We could add store_X=None and have it be True by default but deprecate that and announce changing the default value?
But as it only touches the attributes, I think it might be better to add a deprecated property.
Maybe renaming X_ and y_ makes more sense, and having the old ones be deprecated copies?

New names might be fit_X_ and isotonic_y_? I don't know.
We would need to store fit_X_ for two versions for backward compatibility, but everyone would have fast predictions.

jarfa · 2016-01-21T23:03:39Z

Interesting idea re:fit_X and isotonic_y, but are we concerned at all about storage? The reason I made this change in the first place is that we were trying to calibrate with a large training set.

sam-s · 2016-01-21T23:37:49Z

To be quite frank, I can live with breaking user code which relies on internals.

TomDLT · 2016-01-22T15:15:59Z

You will have to add a test to sklearn.tests.test_isotonic.py to be sure it does not change the predictions.

amueller · 2016-01-22T16:51:49Z

@sam-s it's not really internals. It's a documented attribute. Therefore it's part of the API. If it was a private attribute (starting with _) that would be different.
We could also just make the new ones private _X and _y.
I agree about storage, but I think we have to store it for two versions until we can remove it.
The only way not to store it is to add another parameter, which we have to remove after the storage option got removed. So that means this will create changes four versions from now.

jarfa · 2016-01-25T01:35:05Z

Made a bunch of changes, hopefully the automated checks won't throw any errors. When I created my test function I noticed that it was failing because ~20% of samples were different between the new and old methods - however, their differences were all on the order of 1e-16. After a bit of investigation I found out that this was because interpolation for prediction wasn't linear - it's a 1st-order spline function.

I changed the interpolation to linear because:
a) this makes more sense if we're removing the points in the middle of a flat interval
b) the difference was negligible - again, it's on the order of 1e-16 for any point in my example

I left fast_predict as an option for now simply because it's necessary for the test script to prove that this change works. I'm happy to make it the default option, to add the training data as private self._x & self._y... or whatever you all want. Or you all could merge these changes and make the faster prediction fit into your grander scheme of sci-kit learn, I am after all new to contributing to this project.

Thanks!

jarfa · 2016-01-25T03:45:16Z

Apparently sklearn only changed from linear to spline interpolation this month (a9ea55f). The pull request didn't say much about why the change was made.

agramfort · 2016-01-25T08:53:08Z

sklearn/isotonic.py

@@ -288,7 +288,7 @@ def _build_y(self, X, y, sample_weight):

        return order_inv

-    def fit(self, X, y, sample_weight=None):
+    def fit(self, X, y, sample_weight=None, fast_predict=False):


don't make it a fit param but an init param.

jarfa · 2016-01-25T17:11:38Z

@agramfort : I fixed the white-space issues you pointed out. Your point about making the parameter part of init instead of fit is valid, but from the above conversation with @amueller it sounds like it's not even going to be an option.

Still waiting for @amueller to opine on how to handle this (should we call the new self.X_ and self.y_ self.fit_X_ and self.isotonic_y_ ?)

amueller · 2016-01-25T22:50:11Z

I think adding new private variables and deprecated properties for backward compatibility is the way to go.

jarfa · 2016-01-25T23:30:40Z

Done. I'm calling them self._necessary_X_ and self._necessary_y_.

agramfort · 2016-01-26T08:43:42Z

you should also raise deprecation warnings when trying to access X_ and y_ attributes. You could do it with properties

ogrisel · 2016-01-28T09:47:47Z

See: http://scikit-learn.org/stable/developers/contributing.html#deprecation

ogrisel · 2016-01-28T09:53:19Z

sklearn/tests/test_isotonic.py

+def test_fast_predict():
+    # test that the faster prediction (https://github.com/scikit-learn/scikit-learn/pull/6206)
+    # change doesn't affect out-of-sample predictions.
+    np.random.seed(123)


Please do not seed the global numpy singleton to avoid side effects. Instead do:

rng = np.random.RandomState(123)

Then replaces occurrences of np.random.random with rng.random(N)

ogrisel · 2016-01-28T09:57:27Z

Also @jarfa if you are proficient with git, please squash the commits of this PR. If you don't know how to do that, don't worry, the person that will merge this PR will do it prior to merging.

jarfa · 2016-01-29T15:51:06Z

@ogrisel thanks, that's good feedback. I'll make sure the checks pass, then I'll get to squashing some commits.

ogrisel · 2016-01-29T16:42:27Z

sklearn/isotonic.py

@@ -234,6 +234,32 @@ def __init__(self, y_min=None, y_max=None, increasing=True,
        self.increasing = increasing
        self.out_of_bounds = out_of_bounds

+    @property
+    @deprecated("Attribute ``X_`` is deprecated.")


Please state the deprecation versions explicitly in the message:

"Attribute X_ is deprecated in version 0.18 and will be removed in version 0.20."

jarfa · 2016-01-29T20:08:55Z

@ogrisel Integrated your changes and squashed them all into 1.

agramfort · 2016-01-30T11:40:23Z

can you update what's new?

jarfa · 2016-01-31T16:12:01Z

@agramfort, are you just asking for a run-down of the most recent changes?

replaced the attributes self.X_ and self.y_ with properties that throw deprecation warnings ("... is deprecated in version 0.18 and will be removed in version 0.20."). They refer to attributes self._X_ and self._y_
The attributes that hold only the points necessary for prediction are called self._necessary_X_ and self._necessary_y_
testing no longer has np.random.seed(...), I now create a random state object.

agramfort · 2016-01-31T16:41:25Z

sorry I meant the doc file whats_new.rst that documents the new features, bug fixes and API changes.

jarfa · 2016-01-31T20:06:03Z

@agramfort 4/4 tests were passing before the latest commit, and as you can see 1/1 have now failed.

I only changed the whats_new.rst file: how exactly did I mess that up?

agramfort · 2016-01-31T20:37:32Z

your branch cannot be merged with master due to conflict in what's new file. You need to rebase and fix the conflict.

tiagozortea · 2016-02-02T00:57:31Z

sklearn/isotonic.py

@@ -252,7 +278,7 @@ def _build_f(self, X, y):
            # single y, constant prediction
            self.f_ = lambda x: y.repeat(x.shape)
        else:
-            self.f_ = interpolate.interp1d(X, y, kind='slinear',
+            self.f_ = interpolate.interp1d(X, y, kind='linear',


The change on interp1d from linear to slinear was intended to solve many issues with Isotonic Regression #2507. I use Isotonic Regression in production code and indeed as soon as I upgraded to scikit-learn == 0.17.0 I saw a massive decrease in performance, slinear scales really bad with size, it can easily get to 1000 times slower. I wonder if changing it back to linear won't bring those issues back.

so you're +1 on this change ?

No, I just confirmed that indeed linear interpolation is much faster. But I'm actually encouraging further look into why interpolation method was changed in the first place because I'm pretty sure it was related to several isotonic regression bugs reported in version 0.15.0. Just as background, Isotonic Regression used to use linear interpolation up to 0.15.0, then it was changed to slinear in 0.16.0 and now it might be changed again back to linear by this PR.

When I changed it back to linear I wasn't able to find a good answer as to why it was changed to spline interpolation.

I changed it to linear because

This change made predictions stay the same whether or not I removed the 'unnecessary' points.

It makes more intuitive sense to me.

The change to slinear was introduced in #4111 to deal with duplicate minimal values.

However we since had _make_unique introduced by @amueller that might have been able to fix that.

I believe the non regression test for the issue fixed in #4111 is supposed to be test_isotonic_regression_ties_min but reading the code it seems to me that
it is not testing what it's supposed to test. I suggest changing it to:

diff --git a/sklearn/tests/test_isotonic.py b/sklearn/tests/test_isotonic.py index 5058ecd..c5c45ec 100644 --- a/sklearn/tests/test_isotonic.py +++ b/sklearn/tests/test_isotonic.py @@ -98,9 +98,9 @@ def test_isotonic_regression(): def test_isotonic_regression_ties_min(): # Setup examples with ties on minimum - x = [0, 1, 1, 2, 3, 4, 5] - y = [0, 1, 2, 3, 4, 5, 6] - y_true = [0, 1.5, 1.5, 3, 4, 5, 6] + x = [1, 1, 2, 3, 4, 5] + y = [1, 2, 3, 4, 5, 6] + y_true = [1.5, 1.5, 3, 4, 5, 6] # Check that we get identical results for fit/transform and fit_transform ir = IsotonicRegression()

I checked on my box and this test still passes with the new code in this PR (with recent scipy), travis will run the tests with old scipy just in case.

Also the fact that other non-regression tests such as test_isotonic_regression_ties_secondary_ that compare with the results of R and that the calibration curve examples yield the same output makes me rather confident that this change will not reintroduce old bugs.

That makes sense @ogrisel, I couldn't find exactly the same dataset I was having trouble with back in version 0.15. I used this branch's implementation in my newer datasets and it seems to work fine. Indeed the _make_unique function seems to have fixed the problems, I tried to remove it and I got the same errors I was getting before. 👍

Thanks for checking @tiagozortea, this is valuable feedback.

jarfa · 2016-02-03T23:19:11Z

I took care of the merge problems and the tests all passed. Aside from the ongoing question about spline vs. linear interpolation, what else needs to be done before merging into master?

One issue - I'm leaving Friday (2/05) morning for a 2-week vacation without my laptop. I'll have email access but not be able to work on code - so after tomorrow, any additional changes will either need to be done by somebody else or will have to wait until Feb. 21st.

ogrisel · 2016-02-04T10:23:52Z

sklearn/isotonic.py

+        # We're keeping self.X_ and self.y_ around for backwards compatibility,
+        # but they should be considered deprecated.
+        self._necessary_X_ = self._X_[keep_data]
+        self._necessary_y_ = self._y_[keep_data]


I think this block of code should be moved into the _build_y function and slightly refactored so that it does not rely upon the self._y_ attribute that will be removed once the deprecation period for the y_ attribute is over.

Or alternatively you could change _build_y to return the solution as a local variable in fit so that we can use them in fit without accessing the deprecated attributes.

ogrisel · 2016-02-04T10:32:51Z

The deprecated attribute X_ and y_ should be removed from the list of documented attributes in the docstring and the new attributes should be documented instead.

ogrisel · 2016-02-04T10:33:50Z

sklearn/tests/test_isotonic.py

+
+
+def test_fast_predict():
+    # test that the faster prediction (https://github.com/scikit-learn/scikit-learn/pull/6206)


Please put the URL on its own comment line to avoid a very long line.

ogrisel · 2016-02-04T11:18:50Z

+1 for merge once my last batch of comments has been addressed including #6206 (comment).

…polation to linear)

jarfa · 2016-02-04T21:13:10Z

Thanks for your support @ogrisel, I made your suggested change to test_isotonic_regression_ties_min() and put it in a separate commit.

Again, after tonight (US Eastern time) I'll be too busy hiking around the Colombian coffee country to do more work on this for the next 2 weeks. Therefore if we need any other changes I'd suggest that you or somebody else clone this branch and do whatever else is necessary.

It's been great making this small contribution to the project, I hope I'll be able to do more in the future.

ogrisel · 2016-02-05T09:22:05Z

sklearn/tests/test_isotonic.py

+    slow_model._build_y(training_X, training_Y, sample_weight=weights)
+    slow_model.X_min_ = np.min(slow_model._X_)
+    slow_model.X_max_ = np.max(slow_model._X_)
+    slow_model._build_f(slow_model._X_, slow_model._y_)


Actually because of the new _build_y this does no longer check the "slow approach". Maybe add an optional argument named trim_duplicates=True to _build_y to make it possible to test it here and add an inline comments in the body of _build_y to explain that trim_duplicates is only used to ease unit testing.

ogrisel · 2016-02-05T09:24:44Z

Thanks very much for your contribution @jarfa. I will address my last comment (#6206 (comment)) myself and as @tiagozortea made additional checks on his own environment and gave his +1 I will merge consecutively.

ogrisel · 2016-02-05T11:26:40Z

I merged #6286, closing this. Thanks again for the fix @jarfa and for the tests @tiagozortea.

mjbommar · 2016-02-18T01:45:37Z

Would like to point us back at this thread re: fit/transform vs. fit_transform and why we should keep X and y: #4185 (comment)

I am not a fan of fully deprecating access to my case A in comments above; it has many applications in the pre-processing and conditioning of data with known monotonic relationships prior, e.g., in credit scoring models.

dcrosta reviewed Jan 21, 2016
View reviewed changes

jarfa closed this Jan 25, 2016

jarfa reopened this Jan 25, 2016

agramfort reviewed Jan 25, 2016
View reviewed changes

ogrisel reviewed Jan 28, 2016
View reviewed changes

ogrisel reviewed Jan 29, 2016
View reviewed changes

jarfa force-pushed the fast_isotonic branch from 6a6b4fb to 5cdb49e Compare January 29, 2016 20:07

tiagozortea reviewed Feb 2, 2016
View reviewed changes

jarfa force-pushed the fast_isotonic branch 3 times, most recently from 7eff50e to b2cc21a Compare February 3, 2016 22:44

ogrisel reviewed Feb 4, 2016
View reviewed changes

ogrisel changed the title ~~Much faster prediction with isotonic regression~~ [MRG+1] Much faster prediction with isotonic regression Feb 4, 2016

much faster isotonic regression prediction (involved re-setting inter…

324ad7c

…polation to linear)

jarfa force-pushed the fast_isotonic branch from b2cc21a to 324ad7c Compare February 4, 2016 17:28

change to test_isotonic_regression_ties_min

650768b

ogrisel added the Waiting for Reviewer label Feb 5, 2016

ogrisel reviewed Feb 5, 2016
View reviewed changes

ogrisel mentioned this pull request Feb 5, 2016

[MRG+2] Fix isotonic performance issue at prediction time #6286

Merged

ogrisel closed this Feb 5, 2016



		def test_fast_predict():
		# test that the faster prediction (https://github.com/scikit-learn/scikit-learn/pull/6206)

[MRG+1] Much faster prediction with isotonic regression #6206

[MRG+1] Much faster prediction with isotonic regression #6206

Conversation

jarfa commented Jan 21, 2016

Choose a reason for hiding this comment

sam-s commented Jan 21, 2016

amueller commented Jan 21, 2016

jarfa commented Jan 21, 2016

sam-s commented Jan 21, 2016

TomDLT commented Jan 22, 2016

amueller commented Jan 22, 2016

jarfa commented Jan 25, 2016

jarfa commented Jan 25, 2016

Choose a reason for hiding this comment

jarfa commented Jan 25, 2016

amueller commented Jan 25, 2016

jarfa commented Jan 25, 2016

agramfort commented Jan 26, 2016

ogrisel commented Jan 28, 2016

Choose a reason for hiding this comment

ogrisel commented Jan 28, 2016

jarfa commented Jan 29, 2016

Choose a reason for hiding this comment

jarfa commented Jan 29, 2016

agramfort commented Jan 30, 2016

jarfa commented Jan 31, 2016

agramfort commented Jan 31, 2016

jarfa commented Jan 31, 2016

agramfort commented Jan 31, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jarfa commented Feb 3, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ogrisel commented Feb 4, 2016

Choose a reason for hiding this comment

ogrisel commented Feb 4, 2016

jarfa commented Feb 4, 2016

Choose a reason for hiding this comment

ogrisel commented Feb 5, 2016

ogrisel commented Feb 5, 2016

mjbommar commented Feb 18, 2016