Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FIX Make decision tree pickles deterministic #27580

Merged
merged 9 commits into from
Oct 17, 2023
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
3 changes: 3 additions & 0 deletions doc/whats_new/v1.4.rst
Original file line number Diff line number Diff line change
Expand Up @@ -424,6 +424,9 @@ Changelog
:pr:`13649` by :user:`Samuel Ronsin <samronsin>`, initiated by
:user:`Patrick O'Reilly <pat-oreilly>`.

- |Fix| Make decision tree pickles deterministic. :pr:`27580` by :user:`Loïc
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- |Fix| Make decision tree pickles deterministic. :pr:`27580` by :user:`Loïc
- |Fix| Make decision tree pickle files deterministic. :pr:`27580` by :user:`Loïc

or pickle dumps.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should consider this a security fix and rephrase as "Do not leak data via non-initialized memory in decision tree pickle files and make the generation of those files deterministic."

And we should backport and issue a quick scikit-learn 1.3.2 even if it's just for this fix.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which security vulnerability does it fix? Pickle is still pickle.

Copy link
Member

@ogrisel ogrisel Oct 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A potential sensitive data leak from memory (e.g. a fraction of the training/testing set, the contents of the clipboard, or secret credentials or whatever). I agree that not much data will typically leak this way, but still, it's ugly.

Estève <lesteve>`.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's put this entry directly in 1.3.2 to conform with our security policy:

https://github.com/scikit-learn/scikit-learn/security/policy

:mod:`sklearn.utils`
....................

Expand Down
4 changes: 3 additions & 1 deletion sklearn/tree/_tree.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -908,11 +908,13 @@ cdef class Tree:
safe_realloc(&self.nodes, capacity)
safe_realloc(&self.value, capacity * self.value_stride)

# value memory is initialised to 0 to enable classifier argmax
if capacity > self.capacity:
# value memory is initialised to 0 to enable classifier argmax
memset(<void*>(self.value + self.capacity * self.value_stride), 0,
(capacity - self.capacity) * self.value_stride *
sizeof(float64_t))
# node memory is initialised to 0 to ensure deterministic pickle (padding in Node struct)
# memset(<void*>(self.nodes + self.capacity), 0, (capacity - self.capacity) * sizeof(Node))

# if capacity smaller than node_count, adjust the counter
if capacity < self.node_count:
Expand Down
13 changes: 13 additions & 0 deletions sklearn/tree/tests/test_tree.py
Original file line number Diff line number Diff line change
Expand Up @@ -2601,3 +2601,16 @@ def test_sample_weight_non_uniform(make_data, Tree):
tree_samples_removed.fit(X[1::2, :], y[1::2])

assert_allclose(tree_samples_removed.predict(X), tree_with_sw.predict(X))


def test_deterministic_pickle():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a comment about why we have to use two separate estimators, with the same seed but different datasets?

Naively I'd have thought we could do something like using different seeds to provoke the bug/demonstrate that it no longer occurs. So a note for people from the future might be useful :D

Copy link
Member Author

@lesteve lesteve Oct 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops good catch they were in the end fitted on the same datasets but in a complicated way, I fixed this.

To sum up, even if we fix random_state and fit on the same dataset, the pickles are different due to uninitialised memory.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd still 👍 a comment saying something about "uninitialised memory would lead to the two pickles being different" or some such

lesteve marked this conversation as resolved.
Show resolved Hide resolved
tree1 = DecisionTreeClassifier(random_state=0).fit(iris.data, iris.target)
tree1.fit(X, y)

tree2 = DecisionTreeClassifier(random_state=0).fit(X, y)
tree2.fit(X, y)

pickle1 = pickle.dumps(tree1)
pickle2 = pickle.dumps(tree2)

assert pickle1 == pickle2