Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG+1] Isotonic calibration #1176

Closed
wants to merge 8 commits into from

Conversation

agramfort
Copy link
Member

calibration module with platt and Isotonic calibration.

A few issues :

  • nice calibration plots
  • add tests for metrics
  • add narrative doc
  • speed up the test execution as much as possible
  • fix multi-class support for the isotonic regression method

X_train, y_train = X[:n_samples], y[:n_samples]
X_train_oob, y_train_oob = X[:n_samples / 2], y[:n_samples / 2]
X_oob, y_oob = X[n_samples / 2:], y[n_samples / 2:]
X_test, y_test = X[n_samples:], y[n_samples:]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should use sklearn.utils.train_test_split to transform this block into 2 one-liners:

http://scikit-learn.org/dev/modules/generated/sklearn.cross_validation.train_test_split.html

@mblondel
Copy link
Member

"Brier score" seems to be the accepted name so I'm inclined to keep it that way.

IIRC, in "Transforming Classifier Scores into Accurate Multiclass Probability Estimates" (KDD 2002), they study several methods and conclude that one-vs-rest in the most practical solution.

@agramfort
Copy link
Member Author

"Brier score" seems to be the accepted name so I'm inclined to keep it
that way.

hum. I am sure it will pass the consistency brigade :)

IIRC, in "Transforming Classifier Scores into Accurate Multiclass
Probability Estimates" (KDD 2002), they study several methods and conclude
that one-vs-rest in the most practical solution.

ok so IsotonicCalibrator should use OneVsRestClassifier internally and
the proba are estimated for each pair and normalized to sum to 1
right?

@mblondel
Copy link
Member

I guess IsotanicCalibrator could take the 2d output of decision_function as an input and produce multiclass probabilities. Please read the reference I gave, I don't remember it well.

@ogrisel
Copy link
Member

ogrisel commented Sep 24, 2012

About the Brier score being not a sklearn consistent score (higher == worst in this case): I don't really know what would be best as the sklearn naming convention is indeed conflicting with this official name.

http://en.wikipedia.org/wiki/Brier_score

I would be +0 for keeping the brier_score name and emphasizing the fact that higher values means less confident estimates in the docstring and the narrative doc.


def calibration_plot(y_true, y_prob, bins=5, verbose=0):
"""Compute true and predicted probabilities to be used
for a calibration plot.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PEP 257

@agramfort
Copy link
Member Author

I guess IsotanicCalibrator could take the 2d output of decision_function
as an input and produce multiclass probabilities. Please read the reference
I gave, I don't remember it well.

indeed in there experimental results they use OvR

I think it's a good idea to fit an IR to each decision score.

@agramfort
Copy link
Member Author

About the Brier score being not a sklearn consistent score (higher ==
worst in this case): I don't really no what would be best as the sklearn
naming convention is indeed conflicting with this official name.

http://en.wikipedia.org/wiki/Brier_score

I would be +0 for keeping the brier_score name and emphasizing the fact
that higher values means less confident estimates in the docstring and the
narrative doc.

can we call it brier ? or brier_error ? as it's a mean squared error
on the proba.

@mblondel
Copy link
Member

One solution would be to introduce decorators to label some functions as scores and others as losses. Of course, this is out of the scope of this PR... (This idea would also be useful to define whether a metric accepts predicted labels or predicted scores, c.f. the AUC issue)

Another solution is to introduce a function negative_brier_score, which is just the brier score multiplied by -1. But I think it's important to use the most commonly used names so -1 for this solution.

So adding a note to the documentation as @ogrisel suggested seems like a good temporary solution to me.

@mblondel mblondel mentioned this pull request Sep 29, 2012
@paolo-losi
Copy link
Member

Some quick comments ...

  • +1 for the name brier_score
  • I would consider passing a cross validation class to IsotonicCalibration.__init__().
    Using one fold (oo_bag data) in the case of small number of samples problems could be insufficient.
  • I would also add log_likelihood_loss for probability estimation evaluation.
    See "The problem with the Brier score" by Stephen Jewson
  • I would warn in the doc agains using Isotonic Calibration with less than 1000 calibration samples since it tends
    overfit (Platt's calibration is advisable in that case). I should be able to find a paper on the subject if anyone
    is interested
  • +1 for default number of bins = sqrt(samples) for calibration_plot

@agramfort
Copy link
Member Author

hi paolo,

thanks for this valuable feedback. I won't work on this in the next few
days to if you want to improve on my PR please do so. I'll merge your
commits into my PR.

@ogrisel
Copy link
Member

ogrisel commented Sep 30, 2012

I would warn in the doc agains using Isotonic Calibration with less than 1000 calibration samples since it tends overfit (Platt's calibration is advisable in that case). I should be able to find a paper on the subject if anyone is interested

+1

@mblondel
Copy link
Member

mblondel commented Feb 1, 2013

Using one fold (oo_bag data) in the case of small number of samples problems could be insufficient.

If you use more than one fold, how do you combine the results?

def __init__(self, estimator):
self.estimator = estimator

def fit(self, X, y, X_oob, y_oob):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The above is not consistent with our API. So, a CV object passed to the constructor is a good idea in any case.

@agramfort
Copy link
Member Author

rebased on master + address some comments

I needed a coding break... ;)

If you feel like playing with it please do...

@ogrisel
Copy link
Member

ogrisel commented Feb 26, 2013

You should provide a default value for the estimator constructor param maybe, a baseline that is fast such as MultinomialNB that is both fast, has few hyperparameters and benefit from calibration:

======================================================================
ERROR: sklearn.tests.test_common.test_all_estimators
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/nose/case.py", line 197, in runTest
    self.test(*self.arg)
  File "/home/travis/build/scikit-learn/scikit-learn/sklearn/tests/test_common.py", line 68, in test_all_estimators
    estimator = Estimator()
TypeError: __init__() takes exactly 2 arguments (1 given)

@agramfort
Copy link
Member Author

I've rebased and cleanup the example. I'am open to discussion regarding API and multiclass handling. I don't know how to take care of the OOB to estimate the probas.

@agramfort
Copy link
Member Author

@mblondel any chance you can provide feedback on this? I'll get back to it tomorrow morning.

y : array-like, shape = [n_samples]
Target values.

X_oob : array-like, shape = [n_samples, n_features]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about generating the calibration data with a cv object (parameter of the constructor)?

@coveralls
Copy link

Coverage Status

Changes Unknown when pulling bd727f3 on agramfort:isotonic_calibration into * on scikit-learn:master*.

@agramfort
Copy link
Member Author

@ogrisel @mblondel have a look :)

y_prob : array, shape = [n_samples]
Probabilities of the positive class.

bins: int
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

space before ":" ;)

@jmetzen
Copy link
Member

jmetzen commented Feb 18, 2015

Alright, I squashed it now into 5 generic commits (calibration module, brier score, tests, examples, narrative doc). I've also added @mblondel to the list of authors in calibration.py

@coveralls
Copy link

Coverage Status

Coverage increased (+0.02%) to 95.06% when pulling 1a2a9ae on agramfort:isotonic_calibration into f20ff86 on scikit-learn:master.

@agramfort
Copy link
Member Author

I pushed a couple of commits (cosmit + coverage improvement). The coverage could be slightly improved. Currently we have 95% coverage.

@ogrisel
Copy link
Member

ogrisel commented Feb 19, 2015

Alright, I squashed it now into 5 generic commits (calibration module, brier score, tests, examples, narrative doc). I've also added @mblondel to the list of authors in calibration.py

Great, thank you very much!

@ogrisel
Copy link
Member

ogrisel commented Feb 19, 2015

The coverage could be slightly improved. Currently we have 95% coverage.

Indeed there is a couple of exceptions that should be covered by additional assert_raises or assert_raise_message checks:

https://coveralls.io/builds/1947924/source?filename=sklearn%2Fcalibration.py

It should be easy to raise the coverage close to 99% on that file.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.03%) to 95.07% when pulling 838b06e on agramfort:isotonic_calibration into f20ff86 on scikit-learn:master.

@jmetzen
Copy link
Member

jmetzen commented Feb 19, 2015

I've added some assert_raises, coverage of calibration.py should now be effectively 100%

@ogrisel
Copy link
Member

ogrisel commented Feb 20, 2015

Thanks. I think this is ok for merge. @agramfort @mblondel any more comments?

@agramfort
Copy link
Member Author

+1 for merge on my side.

@mblondel
Copy link
Member

Just to confirm: do we want really want sigmoid_calibration to be a public function?

Other than that +1 as well!

@agramfort
Copy link
Member Author

agramfort commented Feb 20, 2015 via email

@ogrisel
Copy link
Member

ogrisel commented Feb 20, 2015

+1 for a private sigmoid_calibration as well. I will do the change and merge.

return label_binarize(y_true, labels)[:, 0]


def brier_score_loss(y_true, y_prob, sample_weight=None, pos_label=None):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done (these tests are pretty nice; I didn't know they exist). I had to modify test_invariance_string_vs_numbers_labels() slightly such that pos_label is also set for THRESHOLDED_METRICS

@ogrisel
Copy link
Member

ogrisel commented Feb 20, 2015

Oops I had already merged, I will cherry pick this.

@ogrisel
Copy link
Member

ogrisel commented Feb 20, 2015

Done! 🍻 Thank you very much everyone!

@ogrisel ogrisel closed this Feb 20, 2015
@agramfort
Copy link
Member Author

agramfort commented Feb 20, 2015 via email

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Feb 20, 2015 via email

@jmetzen
Copy link
Member

jmetzen commented Feb 20, 2015

Thanks for merging! 🍻

@mblondel
Copy link
Member

Congrats!

When was this effort started again? Two? Three years ago? :)

plt.tight_layout()

# Plot calibration cuve for Gaussian Naive Bayes
plot_calibration_curve(GaussianNB(), "Naive Bayes", 1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This raises a warning as GaussianNB doesn't support sample_weights. If this is expected behavior, I think the warning should be caught.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really want examples to raise warnings? ;) I thought the use of a classifier that doesn't have sample_weights was intentional.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it me or there is no use of sample_weight in this example?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought it was used internally in the calibration. But you are right, it shouldn't warn if it is not used. Will investigate!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet