Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stratified Group KFold implementation #18649

Merged
merged 73 commits into from
Mar 20, 2021

Conversation

marrodion
Copy link
Contributor

Reference Issues/PRs

Fixes #13621.

What does this implement/fix? Explain your changes.

Implementation of StratifiedGroupKFold based on the PR #15239 and kaggle kernel.
Implementation considers distribution of labels within the groups without the restriction of each group having only one class.
For trivial cases:

  • If number of groups equals to number of samples the implementation follows StratifiedKFold as much as possible
  • If each group has roughly equal distribution of labels, the behavior is almost identical to GroupKFold

Issues with current implementation:

  • Due to the sorting of groups when iterating, shuffling happens only for groups with the same label distribution. Not sure if the option to shuffle should be kept at all given that.
  • Because the algorithm looks at groups in sequence, it may produce suboptimal splits in terms of stratification (see .sklearn/model_selection/tests/test_split.py:559 - test_startified_group_kfold_approximate)

Any other comments?

Given outlined restrictions, I am hesitant whether this should be included into scikit-learn or not. It seems that its usefulness is limited. However, there seem to be a constant interest in this feature and I failed to design a better solution that is not brute-force. Would like to hear any thoughts/ideas on which algorithm might produce better results.

hermidalc and others added 30 commits October 13, 2019 16:26
Parameters are the same as for StratifiedKFold
to ensure similar behavior given n_groups == n_samples
The idea is to ensure similar behavior when groups are trivial
(n_groups == n_samples)
Required to produce balanced size folds when the distribution of y is
more or less the same
@marrodion
Copy link
Contributor Author

Thank you so much for a thorough review, @jnothman!

The issues you've outlined seem to be fixed now.
I've also tried to switch from pure python to numpy wherever I could - hope that helps with readability.

Copy link
Member

@agramfort agramfort left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you also update https://scikit-learn.org/stable/auto_examples/model_selection/plot_cv_indices.html to share a visual intuition of what's going on?

thx a lot @marrodion

The implementation is designed to:

* Mimic the behavior of StratifiedKFold as much as possible for trivial
groups (e.g. when each group contain only one sample).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
groups (e.g. when each group contain only one sample).
groups (e.g. when each group contains only one sample).

@marrodion
Copy link
Contributor Author

@agramfort Thank you for your comment,

I've updated the visualization, however, the data in the example makes current algorithm perform identical to GroupKFold. Not sure if I should add an additional plot with something like an example from #13621 (comment). What would you suggest?

@agramfort
Copy link
Member

agramfort commented Mar 17, 2021 via email

@marrodion
Copy link
Contributor Author

I've added it as an example comparing StratifiedKFold, GroupKFold and StratifiedGroupKFold for uneven groups. That should emphasize the difference between GroupKFold and StratifiedGroupKFold, that is not visible in the visualization for all CV methods.

Copy link
Member

@agramfort agramfort left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thx @marrodion !

@jnothman
Copy link
Member

this isn't rendering right: images are in the wrong sections. https://134931-843222-gh.circle-artifacts.com/0/doc/modules/cross_validation.html

@jnothman
Copy link
Member

I think the change to the example affected the numbering of images exported and used in the user guide.

@jnothman
Copy link
Member

The image is actually a very helpful demonstration of how it differs. Thanks both!

@marrodion
Copy link
Contributor Author

this isn't rendering right: images are in the wrong sections.

Seems to be fixed now: https://134972-843222-gh.circle-artifacts.com/0/doc/modules/cross_validation.html

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really nice work. I love the tests. Well done.

- This split is suboptimal in a sense that it might produce imbalanced splits
even if perfect stratification is possible. If you have relatively close
distribution of classes in each group, using :class:`GroupKFold` is better.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Insert the visualisation here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, thank you.
Wasn't sure if needed, not every CV has a visualization in this documentation page.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps not needed, but helpful. Happy for you to make the docs more consistent in another pr! ;)

# Check that stratified kfold preserves class ratios in individual splits
# Repeat with shuffling turned off and on
n_samples = 1000
X = np.ones(n_samples)
y = np.array([4] * int(0.10 * n_samples) +
[0] * int(0.89 * n_samples) +
[1] * int(0.01 * n_samples))
groups = np.arange(len(y))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
groups = np.arange(len(y))
groups = np.arange(len(y)) # ensure perfect stratification with StratifiedGroupKFold

# Check that stratified kfold gives the same indices regardless of labels
n_samples = 100
y = np.array([2] * int(0.10 * n_samples) +
[0] * int(0.89 * n_samples) +
[1] * int(0.01 * n_samples))
X = np.ones(len(y))
groups = np.arange(len(y))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
groups = np.arange(len(y))
groups = np.arange(len(y)) # ensure perfect stratification with StratifiedGroupKFold

# Check that KFold returns folds with balanced sizes (only when
# stratification is possible)
# Repeat with shuffling turned off and on
X = np.ones(17)
y = [0] * 3 + [1] * 14
groups = np.arange(len(y))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
groups = np.arange(len(y))
groups = np.arange(len(y)) # ensure perfect stratification with StratifiedGroupKFold

@jnothman jnothman merged commit 0892a98 into scikit-learn:main Mar 20, 2021
@jnothman
Copy link
Member

Thank you @marrodion @hermidalc! It's nice to finally have a solution for this case!!

@marrodion marrodion deleted the stratified-group-kfold branch March 20, 2021 18:43
@glemaitre glemaitre mentioned this pull request Apr 22, 2021
12 tasks
@jakubwasikowski
Copy link

Wow, it's great to see my kernel could be helpful, and I'm happy that the method is finally in scikit-learn. Thanks @marrodion and @hermidalc for doing this 🍻

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Stratified GroupKFold
7 participants