API Set random_state=0 for TargetEncoder #25927

thomasjpfan · 2023-03-21T19:21:32Z

Reference Issues/PRs

Follow up to #25334

What does this implement/fix? Explain your changes.

I think setting the random_state=0 is better to prevent target leakage when there are multiple TargetEnocders. If random_state=None, then the following example would combine target information from overlapping splits:

ct = ColumnTransformer([
    ("smooth_1", TargetEncoder(smooth=1), ["a", "b"]),
    ("smooth_2", TargetEncoder(smooth=2), ["c", "d"]),
])

With random_state=0, the CV in each target encoder would use the same splits during fit_transform.

Any other comments?

CC @ogrisel

betatim · 2023-03-22T08:16:55Z

I think I understand why you'd want the same splits in multiple TargetEncoders. Taking that thought train to the max makes me wonder: do users actually need to be able to control any of the splitting? Does it even matter how you split or is splitting k ways just as good as some more complex splitting method?

TargetEncoder would be the only(?) estimator in all of scikit-learn where the advice would be "don't provide your own random_state, you are likely to make a mistake if you do", compared to the other estimators where the advice generally is to provide a random state. And even more tricky, if you do provide a random state to TargetEncoder you should probably provide the same random state to all of them.

I think this means we should add to the documentation (maybe a a health warning in the doc string is enough?) to warn people about these subtleties. But all the above also makes me wonder if we can remove some of this to avoid the situation where you need to explain things to the user (aka "does a user even need to be able to provide this?").

ogrisel · 2023-03-22T11:11:32Z

I think setting the random_state=0 is better to prevent target leakage when there are multiple TargetEnocders. If random_state=None, then the following example would combine target information from overlapping splits

Do you see any valid use case where the same feature would be encoded several times by different target encoder instances? If not maybe we could keep the current random_state=None default and document this (unlikely pitfall) in the docstring of random_state.

ogrisel · 2023-03-22T11:13:12Z

Or better, in a new section at the bottom of the target encoder example in the gallery to document the various pitfalls we identified during the review.

thomasjpfan · 2023-03-22T14:31:29Z

do users actually need to be able to control any of the splitting? Does it even matter how you split or is splitting k ways just as good as some more complex splitting method?

If there is any randomness in an method, I think giving user control is worth it. It leads to reproducible pipelines and allows one to experiment with "does random state matter". I suspect that that answer is "it matters sometimes and it depends on your data".

And even more tricky, if you do provide a random state to TargetEncoder you should probably provide the same random state to all of them.

Yea it's tricky. Even passing the same RandomState object would also mix target information across folds:

rng = np.random.RandomState(42)

ct = ColumnTransformer([
    ("smooth_1", TargetEncoder(smooth=1, random_state=rng), ["a", "b"]),
    ("smooth_2", TargetEncoder(smooth=2, random_state=rng), ["c", "d"]),
])

The only way to prevent this is to disallow RandomState objects, but that feels a bit too far.

Do you see any valid use case where the same feature would be encoded several times by different target encoder instances?

In my initial example, I was thinking of using multiple smoothing factors for different features.

This PR is trying to prevent the mistake by having a safer default, which aligns with "make ML easier". After writing this response, I think adding it to the Pitfall example is enough.

ogrisel · 2023-03-22T15:10:55Z

This PR is trying to prevent the mistake by having a safer default, which aligns with "make ML easier". After writing this response, I think adding it to the Pitfall example is enough.

I agree.

API Set random_state=0 for TargetEncoder

86522f2

thomasjpfan added this to the 1.3 milestone Mar 21, 2023

github-actions bot added the module:preprocessing label Mar 21, 2023

DOC Adds PR number

29ff2e6

thomasjpfan closed this Mar 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API Set random_state=0 for TargetEncoder #25927

API Set random_state=0 for TargetEncoder #25927

thomasjpfan commented Mar 21, 2023 •

edited

betatim commented Mar 22, 2023

ogrisel commented Mar 22, 2023

ogrisel commented Mar 22, 2023

thomasjpfan commented Mar 22, 2023

ogrisel commented Mar 22, 2023

API Set random_state=0 for TargetEncoder #25927

API Set random_state=0 for TargetEncoder #25927

Conversation

thomasjpfan commented Mar 21, 2023 • edited

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

betatim commented Mar 22, 2023

ogrisel commented Mar 22, 2023

ogrisel commented Mar 22, 2023

thomasjpfan commented Mar 22, 2023

ogrisel commented Mar 22, 2023

thomasjpfan commented Mar 21, 2023 •

edited