Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API Set random_state=0 for TargetEncoder #25927

Conversation

thomasjpfan
Copy link
Member

@thomasjpfan thomasjpfan commented Mar 21, 2023

Reference Issues/PRs

Follow up to #25334

What does this implement/fix? Explain your changes.

I think setting the random_state=0 is better to prevent target leakage when there are multiple TargetEnocders. If random_state=None, then the following example would combine target information from overlapping splits:

ct = ColumnTransformer([
    ("smooth_1", TargetEncoder(smooth=1), ["a", "b"]),
    ("smooth_2", TargetEncoder(smooth=2), ["c", "d"]),
])

With random_state=0, the CV in each target encoder would use the same splits during fit_transform.

Any other comments?

CC @ogrisel

@thomasjpfan thomasjpfan added this to the 1.3 milestone Mar 21, 2023
@betatim
Copy link
Member

betatim commented Mar 22, 2023

I think I understand why you'd want the same splits in multiple TargetEncoders. Taking that thought train to the max makes me wonder: do users actually need to be able to control any of the splitting? Does it even matter how you split or is splitting k ways just as good as some more complex splitting method?

TargetEncoder would be the only(?) estimator in all of scikit-learn where the advice would be "don't provide your own random_state, you are likely to make a mistake if you do", compared to the other estimators where the advice generally is to provide a random state. And even more tricky, if you do provide a random state to TargetEncoder you should probably provide the same random state to all of them.

I think this means we should add to the documentation (maybe a a health warning in the doc string is enough?) to warn people about these subtleties. But all the above also makes me wonder if we can remove some of this to avoid the situation where you need to explain things to the user (aka "does a user even need to be able to provide this?").

@ogrisel
Copy link
Member

ogrisel commented Mar 22, 2023

I think setting the random_state=0 is better to prevent target leakage when there are multiple TargetEnocders. If random_state=None, then the following example would combine target information from overlapping splits

Do you see any valid use case where the same feature would be encoded several times by different target encoder instances? If not maybe we could keep the current random_state=None default and document this (unlikely pitfall) in the docstring of random_state.

@ogrisel
Copy link
Member

ogrisel commented Mar 22, 2023

Or better, in a new section at the bottom of the target encoder example in the gallery to document the various pitfalls we identified during the review.

@thomasjpfan
Copy link
Member Author

do users actually need to be able to control any of the splitting? Does it even matter how you split or is splitting k ways just as good as some more complex splitting method?

If there is any randomness in an method, I think giving user control is worth it. It leads to reproducible pipelines and allows one to experiment with "does random state matter". I suspect that that answer is "it matters sometimes and it depends on your data".

And even more tricky, if you do provide a random state to TargetEncoder you should probably provide the same random state to all of them.

Yea it's tricky. Even passing the same RandomState object would also mix target information across folds:

rng = np.random.RandomState(42)

ct = ColumnTransformer([
    ("smooth_1", TargetEncoder(smooth=1, random_state=rng), ["a", "b"]),
    ("smooth_2", TargetEncoder(smooth=2, random_state=rng), ["c", "d"]),
])

The only way to prevent this is to disallow RandomState objects, but that feels a bit too far.

Do you see any valid use case where the same feature would be encoded several times by different target encoder instances?

In my initial example, I was thinking of using multiple smoothing factors for different features.

This PR is trying to prevent the mistake by having a safer default, which aligns with "make ML easier". After writing this response, I think adding it to the Pitfall example is enough.

@ogrisel
Copy link
Member

ogrisel commented Mar 22, 2023

This PR is trying to prevent the mistake by having a safer default, which aligns with "make ML easier". After writing this response, I think adding it to the Pitfall example is enough.

I agree.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants