Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: slice feature_names properly in Explanation objects with square .values #3126

Merged
merged 7 commits into from
Aug 22, 2023

Conversation

thatlittleboy
Copy link
Collaborator

@thatlittleboy thatlittleboy commented Jul 22, 2023

Overview

Description of the changes proposed in this pull request:

  • Ensures that we slice feature_names correctly in Explanation objects with "square" .values ("square" := the shap values array has the same number of feature columns as number of samples).
  • Add a test to ensure we don't have regressions in the future.

Explanation

First, some background. For 2D arrays, there is an ambiguity in how to assign the feature names to the slicer index.
E.g. if feature_names is a list of 5 elements, say [a,b,c,d,e], and the shap_values is a (5,5) array, it's ambiguous
whether the axis=0 or axis=1 in shap_values refers to the "feature columns".

Previously, the code assigned axis=0 with higher priority. This causes problems like #2722 because when users index into the explanation object like explanation[0] expecting to get the first sample's shap values / data (e.g. for inputting into waterfall plot),
Slicer would slice this [0] into axis=0, which it incorrectly assumes to be the .feature_names.
This means that the resulting Explanation object's .feature_names would be (a == feature_names[0]) instead of the full feature_names[:] array.

See the test code for an explicit example.

This PR changes such that at least for 2+ dimensional square arrays, we always assume the feature column is on axis=1 instead of axis=0. I believe this to be a more reasonable assumption.
Since most of the time, the 2D shap values arrays are assembled as (# samples, # features).

For non-square arrays, there's no change, since there is no ambiguity which axis the feature_names array should refer to (obviously, it would be the axis with the same length as feature_names).

Checklist

@thatlittleboy thatlittleboy added the bug Indicates an unexpected problem or unintended behaviour label Jul 22, 2023
@thatlittleboy thatlittleboy added this to the 0.43.0 milestone Jul 22, 2023
@codecov
Copy link

codecov bot commented Jul 22, 2023

Codecov Report

Patch coverage: 100.00% and project coverage change: +0.07% 🎉

Comparison is base (bbee378) 57.53% compared to head (03bf49c) 57.61%.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #3126      +/-   ##
==========================================
+ Coverage   57.53%   57.61%   +0.07%     
==========================================
  Files          88       88              
  Lines       12511    12511              
==========================================
+ Hits         7198     7208      +10     
+ Misses       5313     5303      -10     
Files Changed Coverage Δ
shap/_explanation.py 59.28% <100.00%> (+1.99%) ⬆️

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@connortann
Copy link
Collaborator

connortann commented Jul 23, 2023

I found it a little hard to review this PR at first look. I have a few questions to help my understanding:

Slicing

First, some background. For 2D arrays, there is an ambiguity in how to assign the feature names to the slicer index.

The fact there the dimension is inferred from the shape of the data seems a bit concerning in the first place: why is the syntax ambiguous? Would it be preferable to have unambiguous syntax throughout, so intentions don't need to be guessed by matching up the lengths of each dimension?

For example, numpy array slicing such as array[i, j] is unambiguous: i slices along dimension 0, and j along dimension 1. What is the canonical behaviour of slicing an Explanation with a single integer, is that defined behaviour?

The Explanation class attributes

As the init method doesn't have a complete docstring, it's not immediately obvious to me what exactly the various attributes mean and what the allowed types and shapes are. I don't think it's clear from the docs how the class is used or invoked: for example, it's not listed as a return type of the various explainers. Would formalising the inputs of the class and the expected return values of __getitem__ be a good place to start?

@thatlittleboy
Copy link
Collaborator Author

thatlittleboy commented Jul 24, 2023

What is the canonical behaviour of slicing an Explanation with a single integer, is that defined behaviour?

Yes, I would say it's well-defined. It simply goes through all of its internal arrays and picks out the object in the first dimension (axis=0), for all the arrays.

why is the syntax ambiguous?

It's not the syntax of slicing that is ambiguous, per se. It's the assignment of the feature names to a particular axis. I'll give an example: this is a perfectly valid invocation of a shap.Explanation object:

import numpy as np, shap

e = shap.Explanation(values=np.arange(25).reshape(5,5), feature_names=list("abcde"))
print(e)

# .values =
# array([[ 0,  1,  2,  3,  4],
#        [ 5,  6,  7,  8,  9],
#        [10, 11, 12, 13, 14],
#        [15, 16, 17, 18, 19],
#        [20, 21, 22, 23, 24]])

Note that I only specified an array-like of shape (5,) for feature_names but a shap values array of (5,5). Which is fair, we usually only have 1 axis that corresponds to the feature names anyway.

Now, what is e[0]? As mentioned above, the behaviour itself is well-defined. It reaches in for the first axis and returns you a new Explanation object sliced appropriately.

You would expect it to be an Explanation object containing the shap values of [0, 1, 2, 3, 4] and a feature names array of [a, b, c, d, e]. And what would it take to achieve this? It means that the ['a','b','c','d','e'] array had to be assigned to the axis=1, and "broadcasted" along axis=0.

But right now, the result is an Explanation with shap_values of [0, 1, 2, 3, 4] and e[0].feature_names is 'a' (!!). Because Slicer is incorrectly treating [a, b, c, d, e] to be along axis=0, and consequently e[0].feature_names returns 'a'.

You could've avoided all of this, if you had done:

e = shap.Explanation(values=np.arange(25).reshape(5,5), feature_names=np.array(list("abcde")).reshape(1,-1))
print(e[0].feature_names)  # ['a' 'b' 'c' 'd' 'e']

I.e., make feature_names during instantiation the same ndims as your values, then it would be unambiguous which axis your ['a' 'b' 'c' 'd' 'e'] belongs to. I.e, axis=1.
But that is not ergonomic / intuitive to use. So here we are.


Would formalising the inputs of the class and the expected return values of getitem be a good place to start?

I think that's out of scope here, but it can be a separate PR on its own. Definitely over time, the in-code documentation needs to be beefed up.

That said, I think a tutorial on how the Explanation object works and slices is due.. It's actually a lot like numpy broadcasting, but it has some weird rules (such as this one w/ the feature_names). So it takes some getting used to.


If the explanation (heh) above is still not clear, I would highly encourage going through this example with a debugger, running through line by line. And seeing why 'a' pops out of e[0].feature_names instead of ['a' 'b' 'c' 'd' 'e'] as one would expect.

@connortann
Copy link
Collaborator

I want to take a moment to appreciate your impeccably well-written explanation, which is very highly appreciated!

heh, thanks for the Explanation Explanation

Copy link
Collaborator

@connortann connortann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added some suggestions, all of which are optional so approving. Great work!

@thatlittleboy thatlittleboy force-pushed the issue-2722 branch 2 times, most recently from 765821e to 322ea23 Compare August 21, 2023 17:50
@thatlittleboy thatlittleboy merged commit ad454ec into master Aug 22, 2023
15 checks passed
@thatlittleboy thatlittleboy deleted the issue-2722 branch August 22, 2023 01:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Indicates an unexpected problem or unintended behaviour
Projects
None yet
2 participants