Explainable boosting parameters #6335

veneres · 2024-02-21T14:21:59Z

Hi all,
First of all, thank you to the maintainers for keeping this project updated; this is one of the best libraries for gradient boosting I ever tried.
I am making this pull request because two years ago, I forked lightgbm to create a proof of concept for this paper: https://dl.acm.org/doi/10.1145/3477495.3531840
Since I am now improving and expanding the experiments for a new interpretable LambdaMART, I decided to give back to this project by polishing the proof of concept and opening a pull request.

Issue addressed

This pull request is a starting point to address issue #3905 (also mentioned in #2302) by adding interpretable characteristics of the training process.

Parameters added

I added three parameters:

tree_interaction_constraints
max_tree_interactions
max_interactions

tree_interaction_constraints

The parameter tree_interaction_constraints similar to interaction_constraints limits the interactions between features. While interaction_constraints controls which features can appear in the same branch, tree_interaction_constraints controls which features can appear in the same tree.

max_tree_interactions

The parameter max_tree_interactions limits the number of features that can appear in the same tree in a greedy fashion, e.g., if max_tree_interactions = 5 after having used the fifth different features in the same tree, no further features will be used and the same 5 will be reused to find the one with the maximum gain.

max_interactions

The parameter max_interactions limits the number of interactions that can be used in the whole forest in a greedy fashion.
Every tree is associated with a set of features, and we can say that those features interact with each other.
If a tree uses a subset of features of another tree, we say that the two trees use the same set of feature interactions.
To make an example, if max_interactions = 5 and the first 5 trees use 5 disjoint sets of features, the sixth tree will be forced to use a subset of one of the feature sets used by the first 5 trees.

Tests

Other than having extensively tested the new parameters in the proof of concepts for my papers, I added three unit tests in python-package/lightgbm/engine.py, namely:

test_tree_interaction_constraints
test_max_tree_interactions
test_max_interactions
These functions are only a starting point for testing the new features added.

I have not tested the integration in R.

Merge upstream updates

veneres · 2024-02-21T14:24:09Z

@microsoft-github-policy-service agree

veneres · 2024-02-21T14:28:39Z

I changed the status of this pull request to "draft" while waiting for the maintainer's opinions on the parameters added.

shiyu1994 · 2024-02-23T08:34:30Z

Thanks for your contribution. I believe this will be a very helpful feature! I'll look into this PR.

jameslamb · 2024-02-24T04:31:25Z

Some initial questions I have

1. how should this be made consistent with all the other feature-selection parameters?

interaction_constraints (docs)
forced_splits_filename (docs)
feature_fraction (docs)
feature_fraction_bynode (docs)

There are already multiple ways to control which features are used, and I'm concerned that adding 3 (!) more parameters will be more confusing to users of the library than it is helpful. And that it might be very difficult to provide expected behavior in the face of all these combinations.

For example, consider the following mix:

forced_splits_filename specifying splits on 3 features
max_tree_interactions = 2

These can't both be satisfied, so what happens? A runtime error before boosting begins?

Or consider this:

feature_fraction < 1.0
tree_interaction_constraints for a small subset of features

What happens if a randomly-selected set of features violates all of the tree_interaction_constraints? Runtime error? create an empty tree and then move on to the next round and sample features again? keep re-sampling until a compliant set of features is found?

2. How should these work for multiclass classification?

In multiclass classification, LightGBM trains one tree per class. So for num_class=5, n_iterations=100, you'll get up to 500 trees ("up to" because early stopping might be triggered earlier).

I think it's not uncommon for different features to be more important for predicting one class than another. So should tree_interaction_constraints still literally apply to each tree? Or should it limit the number of features used across all trees within 1 iteration? And should max_interactions be per-class? Or across all trees in the model?

3. How should this work with `dart` boosting?

With dart, trees can be dropped after each iteration. See, for example, https://lightgbm.readthedocs.io/en/latest/Parameters.html#drop_rate.

Should max_interactions then be evaluated against the set of trees still in the model after dropout? Or should it somehow keep track of the distinct sets of feature interactions that have been used throughout training, even if some of those sets were only used in trees that no longer exist in the model?

I don't mean to be too negative... these are all solvable and interesting problems, and I'd be happy to help with them.

But I hope they illustrate why I'm concerned about adding another 3 feature-selection mechanisms to the library. Each new mechanism added has to be made consistent with all the others, and this project is already understaffed relative to its popularity and the size of its surface area.

And keep in mind that there are even more feature-selection requests in the backlog, like these:

veneres · 2024-02-26T08:23:56Z

Hi @jameslamb,
I agree with you that adding 3 more parameters can be more confusing than helpful. I added them mainly because it was easier to control my experiments on explainable boosting.
If you still think that it would be a good idea to integrate these types of parameters to constrain the learning phase but you want to contain the overhead of putting them into production and the maintenance, I would suggest adding only one parameter, i.e., tree_interaction_constraints.
I say that because:

the behavior is similar to interaction_constraints but applied at the tree level.
can be used to mimic the behavior of the proposed max_tree_interactions by setting all the possible combinations of n features as its value, e.g., max_tree_interaction=2 is the same as tree_interaction_constraints=list(itertools.combinations(range(n_features), 2))
can be used to mimic the behavior of the proposed max_interactions by stopping the learning after each iteration and changing the tree constraints after each learned tree.

Thus, I would say that tree_interaction_constraints is the parameter to partially answer #3905 while not adding too many parameters to the learning phase.

So focusing now only on one parameter, I try to propose a solution to your questions:

1. how should this be made consistent with all the other feature-selection parameters?

I think tree_interaction_constraints should be consistent as interaction_constraints.

For example, consider the following mix:

forced_splits_filename specifying splits on 3 features

max_tree_interactions = 2

These can't both be satisfied, so what happens? A runtime error before boosting begins?

Or consider this:

feature_fraction < 1.0

tree_interaction_constraints for a small subset of features

What happens if a randomly-selected set of features violates all of the tree_interaction_constraints? Runtime error? create an empty tree and then move on to the next round and sample features again? keep re-sampling until a compliant set of features is found?

I suggest replicating the behavior of interaction_constraints; what does happen if we set feature_fraction < 1.0 and we are not able to satisfy both restrictions?
The same also applies to forced_splits_filename; what happens by setting forced_splits_filename and interaction_constraint at the same time?
I did not have time to check the interaction_constraints interact with all the other features of the code and try them, thus I genuinely ask :)
In addition, when both interaction_constraints and tree_interaction_constraints are specified, the interaction allowed for a particular branch should be the intersection of the sets, as I code here (not tested).

2. How should these work for multiclass classification?

In multiclass classification, LightGBM trains one tree per class. So for num_class=5, n_iterations=100, you'll get up to 500 trees ("up to" because early stopping might be triggered earlier).

I think it's not uncommon for different features to be more important for predicting one class than another. So should tree_interaction_constraints still literally apply to each tree? Or should it limit the number of features used across all trees within 1 iteration? And should max_interactions be per-class? Or across all trees in the model?

3. How should this work with dart boosting?

With dart, trees can be dropped after each iteration. See, for example, https://lightgbm.readthedocs.io/en/latest/Parameters.html#drop_rate.

Should max_interactions then be evaluated against the set of trees still in the model after dropout? Or should it somehow keep track of the distinct sets of feature interactions that have been used throughout training, even if some of those sets were only used in trees that no longer exist in the model?

Again, we should be consistent with the behavior of interaction_constraints for points number 2 and 3.

I don't mean to be too negative... these are all solvable and interesting problems, and I'd be happy to help with them.

But I hope they illustrate why I'm concerned about adding another 3 feature-selection mechanisms to the library. Each new mechanism added has to be made consistent with all the others, and this project is already understaffed relative to its popularity and the size of its surface area.

I totally understand. I also do not want to insist on adding them to the main branch. I just opened the pull request because I thought it might be interesting to other users to have (some of) them.

And keep in mind that there are even more feature-selection requests in the backlog, like these:

Support ignoring some features during training on constructed dataset #4317

Probability measure for features #4605

I would be glad to help if needed :)

jameslamb · 2024-04-15T03:06:33Z

Thanks for the response.

I genuinely ask

I don't know for sure what the answers to my questions for interaction_constraints are... I would need to investigate them too. I expected you to have opinions on those specific to tree-based constraints based on your research... if not and "whatever lightgbm does for interaction_constraints" is the answer then great, that helps.

I would suggest adding only one parameter, i.e., tree_interaction_constraints

If we do move forward with adding tree-level feature constraints like this, I definitely support adding only this one parameter instead of all 3, to limit the complexity. It seems to me that the max_* parameters could be added in separate, later, contributions later if it's decided that they're helpful.

@shiyu1994 I'll wait until you have time to look into this proposal (the linked paper and code samples here) and give a specific opinion on whether LightGBM should take on some subset of this.

veneres · 2024-04-16T09:58:01Z

I expected you to have opinions on those specific to tree-based constraints based on your research.

In my investigations, I focused on creating ensembles of trees using only 1 or 2 features per tree, and thus, other parameters, such as the already mentioned feature_fraction, were too generic, and I treated them as conflicting parameters that should not be used together with the more fine-grained interaction_constraints and the parameter that I introduced tree_interaction_constraints.
However, they can be used together, and in my implementation of tree_interaction_constraints I tried to replicate the behavior of interaction_constraints by making tree_interaction_constraints consistent with what was already implemented. Basically, the core of interaction_constraints is in col_sampler.hpp (https://github.com/microsoft/LightGBM/blob/master/src/treelearner/col_sampler.hpp) and there are various checks for conflicts with other parameters, e.g. https://github.com/microsoft/LightGBM/blob/master/src/treelearner/col_sampler.hpp#L114 where is checked if feature_fraction_bynode is greater than 1 and if the interaction_constraints is also not empty.
In my implementation, I tried to make everything consistent with these checks:
https://github.com/microsoft/LightGBM/pull/6335/files#diff-bea1f4326adbf93f6463a726cd331ebf6ae86e89bac18e648e25b643ca1d3b51
However, I do not have a full overview of all possible conflicting parameters and how these interact with interaction_constraints.
I will check this, and I will get back to you with a list of parameters conflicting with interaction_constraints and how this is handled (or not).

jameslamb · 2024-05-12T05:32:03Z

Thanks very much for that!

Totally makes sense to me. Like I mentioned in #6335 (comment), I will defer to @shiyu1994 (or maybe @guolinke if interested) to move this forward if they want. They're much better qualified than me to decide on how this could fit into LightGBM.

If we do decide to go forward with it, I'll be happy to help with the testing, documentation, etc.

Alberto Veneri and others added 24 commits April 7, 2022 16:00

First version of the new parameter "tree_interaction_constraints""

f47f823

readme update

5730198

First version of the new parameter "tree_interaction_constraints""

5d69338

readme update

ec9ed61

Merge branch 'master' into microsoft-master

bfac4e1

Merge pull request #2 from veneres/microsoft-master

f5b391e

Merge upstream updates

Updated readme

d1966c2

Merge remote-tracking branch 'upstream/master'

6438f0e

Fix missing parenthesis

848fd58

Temporarly remove a new test

d32b7f6

Merge with private repository edits

d216823

Merge remote-tracking branch 'upstream/master'

8dabbb2

Resolved lint errors identified by github actions

137bc6d

Fix docs

9b3fb5e

Fix docs

997e06b

Fix docs and linting

64ff80c

Fix docs

ee8d6e6

Fix docs

09acfcf

Boolean guards added for constrained learning

0d66bea

test and small fix added

84287f1

Merge branch 'microsoft:master' into master

61727ca

Param name refactor

227ec1b

Interaction constraints test added

ca3dac5

Addressed: Unnecessary list comprehension (rewrite using list())

ab04352

veneres requested review from guolinke, jameslamb, shiyu1994, jmoralez and borchero as code owners February 21, 2024 14:22

veneres marked this pull request as draft February 21, 2024 14:24

veneres added 3 commits February 22, 2024 13:29

Merge remote-tracking branch 'upstream/master'

2fc53c0

Skip constraint test on CUDA for the moment

c0a4591

Reformat file for ruff check

8165317

jameslamb mentioned this pull request Feb 22, 2024

[ci] Azure Mariner CI jobs regularly failing: "File not found: 'docker'" #6316

Closed

jameslamb added in progress feature labels Feb 24, 2024

veneres mentioned this pull request Apr 4, 2024

Issue on Google Colab veneres/ilmart#1

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Explainable boosting parameters #6335

Explainable boosting parameters #6335

veneres commented Feb 21, 2024

veneres commented Feb 21, 2024

veneres commented Feb 21, 2024

shiyu1994 commented Feb 23, 2024

jameslamb commented Feb 24, 2024

veneres commented Feb 26, 2024

1. how should this be made consistent with all the other feature-selection parameters?

2. How should these work for multiclass classification?

3. How should this work with `dart` boosting?

jameslamb commented Apr 15, 2024

veneres commented Apr 16, 2024

jameslamb commented May 12, 2024

Explainable boosting parameters #6335

Are you sure you want to change the base?

Explainable boosting parameters #6335

Conversation

veneres commented Feb 21, 2024

Issue addressed

Parameters added

tree_interaction_constraints

max_tree_interactions

max_interactions

Tests

veneres commented Feb 21, 2024

veneres commented Feb 21, 2024

shiyu1994 commented Feb 23, 2024

jameslamb commented Feb 24, 2024

1. how should this be made consistent with all the other feature-selection parameters?

2. How should these work for multiclass classification?

3. How should this work with dart boosting?

veneres commented Feb 26, 2024

1. how should this be made consistent with all the other feature-selection parameters?

2. How should these work for multiclass classification?

3. How should this work with dart boosting?

jameslamb commented Apr 15, 2024

veneres commented Apr 16, 2024

jameslamb commented May 12, 2024

3. How should this work with `dart` boosting?

3. How should this work with `dart` boosting?