fix: slice `feature_names` properly in Explanation objects with square `.values` #3126

thatlittleboy · 2023-07-22T18:02:51Z

Overview

Description of the changes proposed in this pull request:

Ensures that we slice feature_names correctly in Explanation objects with "square" .values ("square" := the shap values array has the same number of feature columns as number of samples).
Add a test to ensure we don't have regressions in the future.

Explanation

First, some background. For 2D arrays, there is an ambiguity in how to assign the feature names to the slicer index.
E.g. if feature_names is a list of 5 elements, say [a,b,c,d,e], and the shap_values is a (5,5) array, it's ambiguous
whether the axis=0 or axis=1 in shap_values refers to the "feature columns".

Previously, the code assigned axis=0 with higher priority. This causes problems like #2722 because when users index into the explanation object like explanation[0] expecting to get the first sample's shap values / data (e.g. for inputting into waterfall plot),
Slicer would slice this [0] into axis=0, which it incorrectly assumes to be the .feature_names.
This means that the resulting Explanation object's .feature_names would be (a == feature_names[0]) instead of the full feature_names[:] array.

See the test code for an explicit example.

This PR changes such that at least for 2+ dimensional square arrays, we always assume the feature column is on axis=1 instead of axis=0. I believe this to be a more reasonable assumption.
Since most of the time, the 2D shap values arrays are assembled as (# samples, # features).

For non-square arrays, there's no change, since there is no ambiguity which axis the feature_names array should refer to (obviously, it would be the axis with the same length as feature_names).

Checklist

Closes Symmetrical SHAP array shape fails to slice on feature names (e.g., shap_values[:,"feature_1"]) #2699, closes "string index out of range" for waterfall plot #2722, closes Feature Names not set correctly if Feature Set is too small #2394
All pre-commit checks pass.
Unit tests added (if fixing a bug or adding a new feature)
Added entry to CHANGELOG.md (if changes will affect users)

codecov · 2023-07-22T18:31:01Z

Codecov Report

Patch coverage: 100.00% and project coverage change: +0.07% 🎉

Comparison is base (bbee378) 57.53% compared to head (03bf49c) 57.61%.

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #3126      +/-   ##
==========================================
+ Coverage   57.53%   57.61%   +0.07%     
==========================================
  Files          88       88              
  Lines       12511    12511              
==========================================
+ Hits         7198     7208      +10     
+ Misses       5313     5303      -10

Files Changed	Coverage Δ
shap/_explanation.py	`59.28% <100.00%> (+1.99%)`	⬆️

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

connortann · 2023-07-23T19:08:44Z

I found it a little hard to review this PR at first look. I have a few questions to help my understanding:

Slicing

First, some background. For 2D arrays, there is an ambiguity in how to assign the feature names to the slicer index.

The fact there the dimension is inferred from the shape of the data seems a bit concerning in the first place: why is the syntax ambiguous? Would it be preferable to have unambiguous syntax throughout, so intentions don't need to be guessed by matching up the lengths of each dimension?

For example, numpy array slicing such as array[i, j] is unambiguous: i slices along dimension 0, and j along dimension 1. What is the canonical behaviour of slicing an Explanation with a single integer, is that defined behaviour?

The Explanation class attributes

As the init method doesn't have a complete docstring, it's not immediately obvious to me what exactly the various attributes mean and what the allowed types and shapes are. I don't think it's clear from the docs how the class is used or invoked: for example, it's not listed as a return type of the various explainers. Would formalising the inputs of the class and the expected return values of __getitem__ be a good place to start?

thatlittleboy · 2023-07-24T13:39:53Z

What is the canonical behaviour of slicing an Explanation with a single integer, is that defined behaviour?

Yes, I would say it's well-defined. It simply goes through all of its internal arrays and picks out the object in the first dimension (axis=0), for all the arrays.

why is the syntax ambiguous?

It's not the syntax of slicing that is ambiguous, per se. It's the assignment of the feature names to a particular axis. I'll give an example: this is a perfectly valid invocation of a shap.Explanation object:

import numpy as np, shap

e = shap.Explanation(values=np.arange(25).reshape(5,5), feature_names=list("abcde"))
print(e)

# .values =
# array([[ 0,  1,  2,  3,  4],
#        [ 5,  6,  7,  8,  9],
#        [10, 11, 12, 13, 14],
#        [15, 16, 17, 18, 19],
#        [20, 21, 22, 23, 24]])

Note that I only specified an array-like of shape (5,) for feature_names but a shap values array of (5,5). Which is fair, we usually only have 1 axis that corresponds to the feature names anyway.

Now, what is e[0]? As mentioned above, the behaviour itself is well-defined. It reaches in for the first axis and returns you a new Explanation object sliced appropriately.

You would expect it to be an Explanation object containing the shap values of [0, 1, 2, 3, 4] and a feature names array of [a, b, c, d, e]. And what would it take to achieve this? It means that the ['a','b','c','d','e'] array had to be assigned to the axis=1, and "broadcasted" along axis=0.

But right now, the result is an Explanation with shap_values of [0, 1, 2, 3, 4] and e[0].feature_names is 'a' (!!). Because Slicer is incorrectly treating [a, b, c, d, e] to be along axis=0, and consequently e[0].feature_names returns 'a'.

You could've avoided all of this, if you had done:

e = shap.Explanation(values=np.arange(25).reshape(5,5), feature_names=np.array(list("abcde")).reshape(1,-1))
print(e[0].feature_names)  # ['a' 'b' 'c' 'd' 'e']

I.e., make feature_names during instantiation the same ndims as your values, then it would be unambiguous which axis your ['a' 'b' 'c' 'd' 'e'] belongs to. I.e, axis=1.
But that is not ergonomic / intuitive to use. So here we are.

Would formalising the inputs of the class and the expected return values of getitem be a good place to start?

I think that's out of scope here, but it can be a separate PR on its own. Definitely over time, the in-code documentation needs to be beefed up.

That said, I think a tutorial on how the Explanation object works and slices is due.. It's actually a lot like numpy broadcasting, but it has some weird rules (such as this one w/ the feature_names). So it takes some getting used to.

If the explanation (heh) above is still not clear, I would highly encourage going through this example with a debugger, running through line by line. And seeing why 'a' pops out of e[0].feature_names instead of ['a' 'b' 'c' 'd' 'e'] as one would expect.

connortann · 2023-07-24T18:14:17Z

I want to take a moment to appreciate your impeccably well-written explanation, which is very highly appreciated!

heh, thanks for the Explanation Explanation

shap/_explanation.py

tests/test_explanation.py

connortann

I added some suggestions, all of which are optional so approving. Great work!

thatlittleboy added the bug Indicates an unexpected problem or unintended behaviour label Jul 22, 2023

thatlittleboy added this to the 0.43.0 milestone Jul 22, 2023

thatlittleboy requested a review from connortann July 22, 2023 18:05

thatlittleboy force-pushed the issue-2722 branch from d72808e to cefa143 Compare July 23, 2023 02:47

connortann reviewed Jul 24, 2023

View reviewed changes

shap/_explanation.py Show resolved Hide resolved

connortann reviewed Jul 24, 2023

View reviewed changes

tests/test_explanation.py Outdated Show resolved Hide resolved

connortann reviewed Jul 24, 2023

View reviewed changes

tests/test_explanation.py Outdated Show resolved Hide resolved

connortann approved these changes Jul 24, 2023

View reviewed changes

thatlittleboy force-pushed the issue-2722 branch from cefa143 to 9d27345 Compare July 29, 2023 15:02

thatlittleboy added 6 commits August 22, 2023 00:45

add test for the expected result

97691db

give axis=1 higher priority on ndim>=2 arrays

54fc273

add test for issue 2699 as well

11f78ef

modify test name and description

910526b

address review comment on test name

24d702e

improve comment clarity

5e72371

thatlittleboy force-pushed the issue-2722 branch 2 times, most recently from 765821e to 322ea23 Compare August 21, 2023 17:50

take code review suggestion

03bf49c

thatlittleboy force-pushed the issue-2722 branch from 322ea23 to 03bf49c Compare August 21, 2023 18:02

thatlittleboy merged commit ad454ec into master Aug 22, 2023
15 checks passed

thatlittleboy deleted the issue-2722 branch August 22, 2023 01:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: slice `feature_names` properly in Explanation objects with square `.values` #3126

fix: slice `feature_names` properly in Explanation objects with square `.values` #3126

thatlittleboy commented Jul 22, 2023 •

edited

codecov bot commented Jul 22, 2023 •

edited

connortann commented Jul 23, 2023 •

edited

thatlittleboy commented Jul 24, 2023 •

edited

connortann commented Jul 24, 2023

connortann left a comment

fix: slice feature_names properly in Explanation objects with square .values #3126

fix: slice feature_names properly in Explanation objects with square .values #3126

Conversation

thatlittleboy commented Jul 22, 2023 • edited

Overview

Explanation

Checklist

codecov bot commented Jul 22, 2023 • edited

Codecov Report

connortann commented Jul 23, 2023 • edited

Slicing

The Explanation class attributes

thatlittleboy commented Jul 24, 2023 • edited

connortann commented Jul 24, 2023

connortann left a comment

Choose a reason for hiding this comment

fix: slice `feature_names` properly in Explanation objects with square `.values` #3126

fix: slice `feature_names` properly in Explanation objects with square `.values` #3126

thatlittleboy commented Jul 22, 2023 •

edited

codecov bot commented Jul 22, 2023 •

edited

connortann commented Jul 23, 2023 •

edited

thatlittleboy commented Jul 24, 2023 •

edited