Stratified Group KFold implementation #18649

marrodion · 2020-10-19T20:41:58Z

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Implementation of StratifiedGroupKFold based on the PR #15239 and kaggle kernel.
Implementation considers distribution of labels within the groups without the restriction of each group having only one class.
For trivial cases:

If number of groups equals to number of samples the implementation follows StratifiedKFold as much as possible
If each group has roughly equal distribution of labels, the behavior is almost identical to GroupKFold

Issues with current implementation:

Due to the sorting of groups when iterating, shuffling happens only for groups with the same label distribution. Not sure if the option to shuffle should be kept at all given that.
Because the algorithm looks at groups in sequence, it may produce suboptimal splits in terms of stratification (see .sklearn/model_selection/tests/test_split.py:559 - test_startified_group_kfold_approximate)

Any other comments?

Given outlined restrictions, I am hesitant whether this should be included into scikit-learn or not. It seems that its usefulness is limited. However, there seem to be a constant interest in this feature and I failed to design a better solution that is not brute-force. Would like to hear any thoughts/ideas on which algorithm might produce better results.

…ratified-groupshufflesplit

Parameters are the same as for StratifiedKFold to ensure similar behavior given n_groups == n_samples

The idea is to ensure similar behavior when groups are trivial (n_groups == n_samples)

Required to produce balanced size folds when the distribution of y is more or less the same

marrodion · 2021-03-11T21:03:32Z

Thank you so much for a thorough review, @jnothman!

The issues you've outlined seem to be fixed now.
I've also tried to switch from pure python to numpy wherever I could - hope that helps with readability.

agramfort

can you also update https://scikit-learn.org/stable/auto_examples/model_selection/plot_cv_indices.html to share a visual intuition of what's going on?

thx a lot @marrodion

agramfort · 2021-03-17T07:12:19Z

sklearn/model_selection/_split.py

+    The implementation is designed to:
+
+    * Mimic the behavior of StratifiedKFold as much as possible for trivial
+      groups (e.g. when each group contain only one sample).


Suggested change

groups (e.g. when each group contain only one sample).

groups (e.g. when each group contains only one sample).

marrodion · 2021-03-17T14:22:49Z

@agramfort Thank you for your comment,

I've updated the visualization, however, the data in the example makes current algorithm perform identical to GroupKFold. Not sure if I should add an additional plot with something like an example from #13621 (comment). What would you suggest?

agramfort · 2021-03-17T16:24:08Z

yes I would add something to the example so we can visually explain what it does

marrodion · 2021-03-17T18:56:09Z

I've added it as an example comparing StratifiedKFold, GroupKFold and StratifiedGroupKFold for uneven groups. That should emphasize the difference between GroupKFold and StratifiedGroupKFold, that is not visible in the visualization for all CV methods.

agramfort

thx @marrodion !

jnothman · 2021-03-17T22:49:45Z

this isn't rendering right: images are in the wrong sections. https://134931-843222-gh.circle-artifacts.com/0/doc/modules/cross_validation.html

jnothman · 2021-03-17T23:19:28Z

I think the change to the example affected the numbering of images exported and used in the user guide.

jnothman · 2021-03-17T23:21:51Z

The image is actually a very helpful demonstration of how it differs. Thanks both!

marrodion · 2021-03-18T13:08:17Z

this isn't rendering right: images are in the wrong sections.

Seems to be fixed now: https://134972-843222-gh.circle-artifacts.com/0/doc/modules/cross_validation.html

jnothman

This is really nice work. I love the tests. Well done.

jnothman · 2021-03-19T00:11:20Z

doc/modules/cross_validation.rst

+- This split is suboptimal in a sense that it might produce imbalanced splits
+  even if perfect stratification is possible. If you have relatively close
+  distribution of classes in each group, using :class:`GroupKFold` is better.
+


Insert the visualisation here?

Done, thank you.
Wasn't sure if needed, not every CV has a visualization in this documentation page.

Perhaps not needed, but helpful. Happy for you to make the docs more consistent in another pr! ;)

jnothman · 2021-03-19T00:12:41Z

sklearn/model_selection/tests/test_split.py

    # Check that stratified kfold preserves class ratios in individual splits
    # Repeat with shuffling turned off and on
    n_samples = 1000
    X = np.ones(n_samples)
    y = np.array([4] * int(0.10 * n_samples) +
                 [0] * int(0.89 * n_samples) +
                 [1] * int(0.01 * n_samples))
+    groups = np.arange(len(y))


Suggested change

groups = np.arange(len(y))

groups = np.arange(len(y)) # ensure perfect stratification with StratifiedGroupKFold

jnothman · 2021-03-19T00:12:58Z

sklearn/model_selection/tests/test_split.py

    # Check that stratified kfold gives the same indices regardless of labels
    n_samples = 100
    y = np.array([2] * int(0.10 * n_samples) +
                 [0] * int(0.89 * n_samples) +
                 [1] * int(0.01 * n_samples))
    X = np.ones(len(y))
+    groups = np.arange(len(y))


Suggested change

groups = np.arange(len(y))

groups = np.arange(len(y)) # ensure perfect stratification with StratifiedGroupKFold

jnothman · 2021-03-19T00:13:06Z

sklearn/model_selection/tests/test_split.py

    # Check that KFold returns folds with balanced sizes (only when
    # stratification is possible)
    # Repeat with shuffling turned off and on
    X = np.ones(17)
    y = [0] * 3 + [1] * 14
+    groups = np.arange(len(y))


Suggested change

groups = np.arange(len(y))

groups = np.arange(len(y)) # ensure perfect stratification with StratifiedGroupKFold

jnothman · 2021-03-20T10:58:12Z

Thank you @marrodion @hermidalc! It's nice to finally have a solution for this case!!

jakubwasikowski · 2021-12-09T16:14:18Z

Wow, it's great to see my kernel could be helpful, and I'm happy that the method is finally in scikit-learn. Thanks @marrodion and @hermidalc for doing this 🍻

hermidalc and others added 30 commits October 13, 2019 16:26

Initial implementation

4df86c6

Forgot to add to second __add__ list

6be3594

Update split method parameter doc

2f28673

Added example; changed default test_size to 0.1; added to author list

2365735

Merge branch 'master' of github.com:scikit-learn/scikit-learn into st…

b3d2b5a

…ratified-groupshufflesplit

StratifiedGroupKFold impl and other improvements

aa8f288

Add class to __all__ spec

647a97e

Remove random_state when no shuffle

36babe5

Tighter formatting

32e502a

Merge branch 'master' into stratified-groupshufflesplit

c7ad3f3

Update the implementation of StratifiedGroupKFold

4826d96

Add StratifiedGroupKFold to __init__

13801a7

Add y checks to StartifiedGroupKFold

8367133

Raise error if n_splits > max num samples in class

bca2dbc

Warn if n_splits > mn num samples in class

31fc183

Add SGKfold to general repr test

0d9a58f

Add SGKFold to 2d_y test case

3c2c639

Add SGKfold to value erros test case

8648519

Parameters are the same as for StratifiedKFold to ensure similar behavior given n_groups == n_samples

Add SGKFold to StratifiedKFold test cases

7005af2

The idea is to ensure similar behavior when groups are trivial (n_groups == n_samples)

Add SGKFold to reproducibility test case

6a52ae9

Add SGKFold to GroupKFold test case

6f83a85

Add SGKFold to nested cv test case

7fbc736

Add SGKFold to random_state with shuffle=False test case

d4f99e6

Add SGKFold to constant splits test case

a38a872

Fix repr test case

490b503

Fix formatting issues

cc8da98

Add samples to a fold with least num samples

6990a91

Required to produce balanced size folds when the distribution of y is more or less the same

Remove GroupShuffleSplit impl

1f4da2b

Add notes to StratifiedGroupKFold

9359cc7

Fix doctest

6386faa

Fix np.sort keyword to support numpy < 1.15

5cd8fdb

agramfort reviewed Mar 17, 2021

View reviewed changes

marrodion added 4 commits March 17, 2021 15:13

Merge remote-tracking branch 'upstream/main' into stratified-group-kfold

4fbc7a8

Fix typo in docstring

024d3c6

Add StratifiedGroupKFold to visualization doc

09cfc37

Merge remote-tracking branch 'upstream/main' into stratified-group-kfold

6c8d5da

marrodion force-pushed the stratified-group-kfold branch from cf58f75 to 6c8d5da Compare March 17, 2021 13:33

Add visualization for uneven group as an example

60cb778

agramfort approved these changes Mar 17, 2021

View reviewed changes

Fix image numbers to match updated example

29a0fe5

jnothman approved these changes Mar 19, 2021

View reviewed changes

marrodion added 3 commits March 19, 2021 22:38

Add author

859d28a

Add SGKF visualization to docs

6fd5ec8

Add comments for groups in stratified CV tests

42a6b80

marrodion force-pushed the stratified-group-kfold branch from 879d947 to 42a6b80 Compare March 19, 2021 19:53

jnothman merged commit 0892a98 into scikit-learn:main Mar 20, 2021

cmarmo mentioned this pull request Mar 20, 2021

StratifiedGroupShuffleSplit and StratifiedGroupKFold #15239

Closed

marrodion deleted the stratified-group-kfold branch March 20, 2021 18:43

glemaitre mentioned this pull request Apr 22, 2021

Release 0.24.2 #19954

Merged

12 tasks

arvkevi mentioned this pull request Aug 24, 2022

Add RepeatedStratifiedGroupKFold #24247

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stratified Group KFold implementation #18649

Stratified Group KFold implementation #18649

marrodion commented Oct 19, 2020

marrodion commented Mar 11, 2021

agramfort left a comment

agramfort Mar 17, 2021

marrodion commented Mar 17, 2021

agramfort commented Mar 17, 2021 via email

marrodion commented Mar 17, 2021

agramfort left a comment

jnothman commented Mar 17, 2021

jnothman commented Mar 17, 2021

jnothman commented Mar 17, 2021

marrodion commented Mar 18, 2021

jnothman left a comment

jnothman Mar 19, 2021

marrodion Mar 19, 2021

jnothman Mar 20, 2021

jnothman Mar 19, 2021

jnothman Mar 19, 2021

jnothman Mar 19, 2021

jnothman commented Mar 20, 2021

jakubwasikowski commented Dec 9, 2021

	groups (e.g. when each group contain only one sample).
	groups (e.g. when each group contains only one sample).

	groups = np.arange(len(y))
	groups = np.arange(len(y)) # ensure perfect stratification with StratifiedGroupKFold

Stratified Group KFold implementation #18649

Stratified Group KFold implementation #18649

Conversation

marrodion commented Oct 19, 2020

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

marrodion commented Mar 11, 2021

agramfort left a comment

Choose a reason for hiding this comment

agramfort Mar 17, 2021

Choose a reason for hiding this comment

marrodion commented Mar 17, 2021

agramfort commented Mar 17, 2021 via email

marrodion commented Mar 17, 2021

agramfort left a comment

Choose a reason for hiding this comment

jnothman commented Mar 17, 2021

jnothman commented Mar 17, 2021

jnothman commented Mar 17, 2021

marrodion commented Mar 18, 2021

jnothman left a comment

Choose a reason for hiding this comment

jnothman Mar 19, 2021

Choose a reason for hiding this comment

marrodion Mar 19, 2021

Choose a reason for hiding this comment

jnothman Mar 20, 2021

Choose a reason for hiding this comment

jnothman Mar 19, 2021

Choose a reason for hiding this comment

jnothman Mar 19, 2021

Choose a reason for hiding this comment

jnothman Mar 19, 2021

Choose a reason for hiding this comment

jnothman commented Mar 20, 2021

jakubwasikowski commented Dec 9, 2021