Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Explainable boosting parameters #6335

Draft
wants to merge 27 commits into
base: master
Choose a base branch
from
Draft

Conversation

veneres
Copy link

@veneres veneres commented Feb 21, 2024

Hi all,
First of all, thank you to the maintainers for keeping this project updated; this is one of the best libraries for gradient boosting I ever tried.
I am making this pull request because two years ago, I forked lightgbm to create a proof of concept for this paper: https://dl.acm.org/doi/10.1145/3477495.3531840
Since I am now improving and expanding the experiments for a new interpretable LambdaMART, I decided to give back to this project by polishing the proof of concept and opening a pull request.

Issue addressed

This pull request is a starting point to address issue #3905 (also mentioned in #2302) by adding interpretable characteristics of the training process.

Parameters added

I added three parameters:

  • tree_interaction_constraints
  • max_tree_interactions
  • max_interactions

tree_interaction_constraints

The parameter tree_interaction_constraints similar to interaction_constraints limits the interactions between features. While interaction_constraints controls which features can appear in the same branch, tree_interaction_constraints controls which features can appear in the same tree.

max_tree_interactions

The parameter max_tree_interactions limits the number of features that can appear in the same tree in a greedy fashion, e.g., if max_tree_interactions = 5 after having used the fifth different features in the same tree, no further features will be used and the same 5 will be reused to find the one with the maximum gain.

max_interactions

The parameter max_interactions limits the number of interactions that can be used in the whole forest in a greedy fashion.
Every tree is associated with a set of features, and we can say that those features interact with each other.
If a tree uses a subset of features of another tree, we say that the two trees use the same set of feature interactions.
To make an example, if max_interactions = 5 and the first 5 trees use 5 disjoint sets of features, the sixth tree will be forced to use a subset of one of the feature sets used by the first 5 trees.

Tests

Other than having extensively tested the new parameters in the proof of concepts for my papers, I added three unit tests in python-package/lightgbm/engine.py, namely:

  • test_tree_interaction_constraints
  • test_max_tree_interactions
  • test_max_interactions
    These functions are only a starting point for testing the new features added.

I have not tested the integration in R.

@veneres
Copy link
Author

veneres commented Feb 21, 2024

@microsoft-github-policy-service agree

@veneres veneres marked this pull request as draft February 21, 2024 14:24
@veneres
Copy link
Author

veneres commented Feb 21, 2024

I changed the status of this pull request to "draft" while waiting for the maintainer's opinions on the parameters added.

@shiyu1994
Copy link
Collaborator

Thanks for your contribution. I believe this will be a very helpful feature! I'll look into this PR.

@jameslamb
Copy link
Collaborator

Some initial questions I have

1. how should this be made consistent with all the other feature-selection parameters?

  • interaction_constraints (docs)
  • forced_splits_filename (docs)
  • feature_fraction (docs)
  • feature_fraction_bynode (docs)

There are already multiple ways to control which features are used, and I'm concerned that adding 3 (!) more parameters will be more confusing to users of the library than it is helpful. And that it might be very difficult to provide expected behavior in the face of all these combinations.

For example, consider the following mix:

  • forced_splits_filename specifying splits on 3 features
  • max_tree_interactions = 2

These can't both be satisfied, so what happens? A runtime error before boosting begins?

Or consider this:

  • feature_fraction < 1.0
  • tree_interaction_constraints for a small subset of features

What happens if a randomly-selected set of features violates all of the tree_interaction_constraints? Runtime error? create an empty tree and then move on to the next round and sample features again? keep re-sampling until a compliant set of features is found?

2. How should these work for multiclass classification?

In multiclass classification, LightGBM trains one tree per class. So for num_class=5, n_iterations=100, you'll get up to 500 trees ("up to" because early stopping might be triggered earlier).

I think it's not uncommon for different features to be more important for predicting one class than another. So should tree_interaction_constraints still literally apply to each tree? Or should it limit the number of features used across all trees within 1 iteration? And should max_interactions be per-class? Or across all trees in the model?

3. How should this work with dart boosting?

With dart, trees can be dropped after each iteration. See, for example, https://lightgbm.readthedocs.io/en/latest/Parameters.html#drop_rate.

Should max_interactions then be evaluated against the set of trees still in the model after dropout? Or should it somehow keep track of the distinct sets of feature interactions that have been used throughout training, even if some of those sets were only used in trees that no longer exist in the model?


I don't mean to be too negative... these are all solvable and interesting problems, and I'd be happy to help with them.

But I hope they illustrate why I'm concerned about adding another 3 feature-selection mechanisms to the library. Each new mechanism added has to be made consistent with all the others, and this project is already understaffed relative to its popularity and the size of its surface area.

And keep in mind that there are even more feature-selection requests in the backlog, like these:

@veneres
Copy link
Author

veneres commented Feb 26, 2024

Hi @jameslamb,
I agree with you that adding 3 more parameters can be more confusing than helpful. I added them mainly because it was easier to control my experiments on explainable boosting.
If you still think that it would be a good idea to integrate these types of parameters to constrain the learning phase but you want to contain the overhead of putting them into production and the maintenance, I would suggest adding only one parameter, i.e., tree_interaction_constraints.
I say that because:

  1. the behavior is similar to interaction_constraints but applied at the tree level.
  2. can be used to mimic the behavior of the proposed max_tree_interactions by setting all the possible combinations of n features as its value, e.g., max_tree_interaction=2 is the same as tree_interaction_constraints=list(itertools.combinations(range(n_features), 2))
  3. can be used to mimic the behavior of the proposed max_interactions by stopping the learning after each iteration and changing the tree constraints after each learned tree.

Thus, I would say that tree_interaction_constraints is the parameter to partially answer #3905 while not adding too many parameters to the learning phase.

So focusing now only on one parameter, I try to propose a solution to your questions:

1. how should this be made consistent with all the other feature-selection parameters?

I think tree_interaction_constraints should be consistent as interaction_constraints.

For example, consider the following mix:

  • forced_splits_filename specifying splits on 3 features
  • max_tree_interactions = 2

These can't both be satisfied, so what happens? A runtime error before boosting begins?

Or consider this:

  • feature_fraction < 1.0
  • tree_interaction_constraints for a small subset of features

What happens if a randomly-selected set of features violates all of the tree_interaction_constraints? Runtime error? create an empty tree and then move on to the next round and sample features again? keep re-sampling until a compliant set of features is found?

I suggest replicating the behavior of interaction_constraints; what does happen if we set feature_fraction < 1.0 and we are not able to satisfy both restrictions?
The same also applies to forced_splits_filename; what happens by setting forced_splits_filename and interaction_constraint at the same time?
I did not have time to check the interaction_constraints interact with all the other features of the code and try them, thus I genuinely ask :)
In addition, when both interaction_constraints and tree_interaction_constraints are specified, the interaction allowed for a particular branch should be the intersection of the sets, as I code here (not tested).

2. How should these work for multiclass classification?

In multiclass classification, LightGBM trains one tree per class. So for num_class=5, n_iterations=100, you'll get up to 500 trees ("up to" because early stopping might be triggered earlier).

I think it's not uncommon for different features to be more important for predicting one class than another. So should tree_interaction_constraints still literally apply to each tree? Or should it limit the number of features used across all trees within 1 iteration? And should max_interactions be per-class? Or across all trees in the model?

3. How should this work with dart boosting?

With dart, trees can be dropped after each iteration. See, for example, https://lightgbm.readthedocs.io/en/latest/Parameters.html#drop_rate.

Should max_interactions then be evaluated against the set of trees still in the model after dropout? Or should it somehow keep track of the distinct sets of feature interactions that have been used throughout training, even if some of those sets were only used in trees that no longer exist in the model?

Again, we should be consistent with the behavior of interaction_constraints for points number 2 and 3.

I don't mean to be too negative... these are all solvable and interesting problems, and I'd be happy to help with them.

But I hope they illustrate why I'm concerned about adding another 3 feature-selection mechanisms to the library. Each new mechanism added has to be made consistent with all the others, and this project is already understaffed relative to its popularity and the size of its surface area.

I totally understand. I also do not want to insist on adding them to the main branch. I just opened the pull request because I thought it might be interesting to other users to have (some of) them.

And keep in mind that there are even more feature-selection requests in the backlog, like these:

I would be glad to help if needed :)

@jameslamb
Copy link
Collaborator

Thanks for the response.

I genuinely ask

I don't know for sure what the answers to my questions for interaction_constraints are... I would need to investigate them too. I expected you to have opinions on those specific to tree-based constraints based on your research... if not and "whatever lightgbm does for interaction_constraints" is the answer then great, that helps.

I would suggest adding only one parameter, i.e., tree_interaction_constraints

If we do move forward with adding tree-level feature constraints like this, I definitely support adding only this one parameter instead of all 3, to limit the complexity. It seems to me that the max_* parameters could be added in separate, later, contributions later if it's decided that they're helpful.

@shiyu1994 I'll wait until you have time to look into this proposal (the linked paper and code samples here) and give a specific opinion on whether LightGBM should take on some subset of this.

@veneres
Copy link
Author

veneres commented Apr 16, 2024

I expected you to have opinions on those specific to tree-based constraints based on your research.

In my investigations, I focused on creating ensembles of trees using only 1 or 2 features per tree, and thus, other parameters, such as the already mentioned feature_fraction, were too generic, and I treated them as conflicting parameters that should not be used together with the more fine-grained interaction_constraints and the parameter that I introduced tree_interaction_constraints.
However, they can be used together, and in my implementation of tree_interaction_constraints I tried to replicate the behavior of interaction_constraints by making tree_interaction_constraints consistent with what was already implemented. Basically, the core of interaction_constraints is in col_sampler.hpp (https://github.com/microsoft/LightGBM/blob/master/src/treelearner/col_sampler.hpp) and there are various checks for conflicts with other parameters, e.g. https://github.com/microsoft/LightGBM/blob/master/src/treelearner/col_sampler.hpp#L114 where is checked if feature_fraction_bynode is greater than 1 and if the interaction_constraints is also not empty.
In my implementation, I tried to make everything consistent with these checks:
https://github.com/microsoft/LightGBM/pull/6335/files#diff-bea1f4326adbf93f6463a726cd331ebf6ae86e89bac18e648e25b643ca1d3b51
However, I do not have a full overview of all possible conflicting parameters and how these interact with interaction_constraints.
I will check this, and I will get back to you with a list of parameters conflicting with interaction_constraints and how this is handled (or not).

@jameslamb
Copy link
Collaborator

Thanks very much for that!

Totally makes sense to me. Like I mentioned in #6335 (comment), I will defer to @shiyu1994 (or maybe @guolinke if interested) to move this forward if they want. They're much better qualified than me to decide on how this could fit into LightGBM.

If we do decide to go forward with it, I'll be happy to help with the testing, documentation, etc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants