-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: slice feature_names
properly in Explanation objects with square .values
#3126
Conversation
Codecov ReportPatch coverage:
Additional details and impacted files@@ Coverage Diff @@
## master #3126 +/- ##
==========================================
+ Coverage 57.53% 57.61% +0.07%
==========================================
Files 88 88
Lines 12511 12511
==========================================
+ Hits 7198 7208 +10
+ Misses 5313 5303 -10
☔ View full report in Codecov by Sentry. |
d72808e
to
cefa143
Compare
I found it a little hard to review this PR at first look. I have a few questions to help my understanding: Slicing
The fact there the dimension is inferred from the shape of the data seems a bit concerning in the first place: why is the syntax ambiguous? Would it be preferable to have unambiguous syntax throughout, so intentions don't need to be guessed by matching up the lengths of each dimension? For example, numpy array slicing such as The Explanation class attributesAs the |
Yes, I would say it's well-defined. It simply goes through all of its internal arrays and picks out the object in the first dimension (axis=0), for all the arrays.
It's not the syntax of slicing that is ambiguous, per se. It's the assignment of the feature names to a particular axis. I'll give an example: this is a perfectly valid invocation of a import numpy as np, shap
e = shap.Explanation(values=np.arange(25).reshape(5,5), feature_names=list("abcde"))
print(e)
# .values =
# array([[ 0, 1, 2, 3, 4],
# [ 5, 6, 7, 8, 9],
# [10, 11, 12, 13, 14],
# [15, 16, 17, 18, 19],
# [20, 21, 22, 23, 24]]) Note that I only specified an array-like of shape Now, what is You would expect it to be an Explanation object containing the shap values of But right now, the result is an Explanation with shap_values of You could've avoided all of this, if you had done: e = shap.Explanation(values=np.arange(25).reshape(5,5), feature_names=np.array(list("abcde")).reshape(1,-1))
print(e[0].feature_names) # ['a' 'b' 'c' 'd' 'e'] I.e., make
I think that's out of scope here, but it can be a separate PR on its own. Definitely over time, the in-code documentation needs to be beefed up. That said, I think a tutorial on how the If the explanation (heh) above is still not clear, I would highly encourage going through this example with a debugger, running through line by line. And seeing why |
I want to take a moment to appreciate your impeccably well-written explanation, which is very highly appreciated! heh, thanks for the Explanation Explanation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added some suggestions, all of which are optional so approving. Great work!
cefa143
to
9d27345
Compare
765821e
to
322ea23
Compare
322ea23
to
03bf49c
Compare
Overview
Description of the changes proposed in this pull request:
feature_names
correctly in Explanation objects with "square".values
("square" := the shap values array has the same number of feature columns as number of samples).Explanation
First, some background. For 2D arrays, there is an ambiguity in how to assign the feature names to the slicer index.
E.g. if
feature_names
is a list of 5 elements, say[a,b,c,d,e]
, and theshap_values
is a (5,5) array, it's ambiguouswhether the axis=0 or axis=1 in
shap_values
refers to the "feature columns".Previously, the code assigned
axis=0
with higher priority. This causes problems like #2722 because when users index into the explanation object likeexplanation[0]
expecting to get the first sample's shap values / data (e.g. for inputting into waterfall plot),Slicer
would slice this[0]
into axis=0, which it incorrectly assumes to be the.feature_names
.This means that the resulting Explanation object's
.feature_names
would be (a
==feature_names[0]
) instead of the fullfeature_names[:]
array.See the test code for an explicit example.
This PR changes such that at least for 2+ dimensional square arrays, we always assume the feature column is on axis=1 instead of axis=0. I believe this to be a more reasonable assumption.
Since most of the time, the 2D shap values arrays are assembled as (# samples, # features).
For non-square arrays, there's no change, since there is no ambiguity which axis the
feature_names
array should refer to (obviously, it would be the axis with the same length asfeature_names
).Checklist
CHANGELOG.md
(if changes will affect users)