-
-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FIX Make decision tree pickles deterministic #27580
Changes from 3 commits
553e3cb
3fe9772
da596d7
ea815a0
c8b16ae
6ffaddb
b5d9b5e
05198e0
5b8e1d8
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -424,6 +424,9 @@ Changelog | |
:pr:`13649` by :user:`Samuel Ronsin <samronsin>`, initiated by | ||
:user:`Patrick O'Reilly <pat-oreilly>`. | ||
|
||
- |Fix| Make decision tree pickles deterministic. :pr:`27580` by :user:`Loïc | ||
Estève <lesteve>`. | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Let's put this entry directly in 1.3.2 to conform with our security policy: https://github.com/scikit-learn/scikit-learn/security/policy |
||
:mod:`sklearn.utils` | ||
.................... | ||
|
||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -2601,3 +2601,16 @@ def test_sample_weight_non_uniform(make_data, Tree): | |
tree_samples_removed.fit(X[1::2, :], y[1::2]) | ||
|
||
assert_allclose(tree_samples_removed.predict(X), tree_with_sw.predict(X)) | ||
|
||
|
||
def test_deterministic_pickle(): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Could you add a comment about why we have to use two separate estimators, with the same seed but different datasets? Naively I'd have thought we could do something like using different seeds to provoke the bug/demonstrate that it no longer occurs. So a note for people from the future might be useful :D There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Oops good catch they were in the end fitted on the same datasets but in a complicated way, I fixed this. To sum up, even if we fix There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'd still 👍 a comment saying something about "uninitialised memory would lead to the two pickles being different" or some such
lesteve marked this conversation as resolved.
Show resolved
Hide resolved
|
||
tree1 = DecisionTreeClassifier(random_state=0).fit(iris.data, iris.target) | ||
tree1.fit(X, y) | ||
|
||
tree2 = DecisionTreeClassifier(random_state=0).fit(X, y) | ||
tree2.fit(X, y) | ||
|
||
pickle1 = pickle.dumps(tree1) | ||
pickle2 = pickle.dumps(tree2) | ||
|
||
assert pickle1 == pickle2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or pickle dumps.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should consider this a security fix and rephrase as "Do not leak data via non-initialized memory in decision tree pickle files and make the generation of those files deterministic."
And we should backport and issue a quick scikit-learn 1.3.2 even if it's just for this fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which security vulnerability does it fix? Pickle is still pickle.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A potential sensitive data leak from memory (e.g. a fraction of the training/testing set, the contents of the clipboard, or secret credentials or whatever). I agree that not much data will typically leak this way, but still, it's ugly.