-
-
Notifications
You must be signed in to change notification settings - Fork 25.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
API Set random_state=0 for TargetEncoder #25927
API Set random_state=0 for TargetEncoder #25927
Conversation
I think I understand why you'd want the same splits in multiple
I think this means we should add to the documentation (maybe a a health warning in the doc string is enough?) to warn people about these subtleties. But all the above also makes me wonder if we can remove some of this to avoid the situation where you need to explain things to the user (aka "does a user even need to be able to provide this?"). |
Do you see any valid use case where the same feature would be encoded several times by different target encoder instances? If not maybe we could keep the current |
Or better, in a new section at the bottom of the target encoder example in the gallery to document the various pitfalls we identified during the review. |
If there is any randomness in an method, I think giving user control is worth it. It leads to reproducible pipelines and allows one to experiment with "does random state matter". I suspect that that answer is "it matters sometimes and it depends on your data".
Yea it's tricky. Even passing the same RandomState object would also mix target information across folds: rng = np.random.RandomState(42)
ct = ColumnTransformer([
("smooth_1", TargetEncoder(smooth=1, random_state=rng), ["a", "b"]),
("smooth_2", TargetEncoder(smooth=2, random_state=rng), ["c", "d"]),
]) The only way to prevent this is to disallow
In my initial example, I was thinking of using multiple smoothing factors for different features. This PR is trying to prevent the mistake by having a safer default, which aligns with "make ML easier". After writing this response, I think adding it to the Pitfall example is enough. |
I agree. |
Reference Issues/PRs
Follow up to #25334
What does this implement/fix? Explain your changes.
I think setting the
random_state=0
is better to prevent target leakage when there are multipleTargetEnocders
. Ifrandom_state=None
, then the following example would combine target information from overlapping splits:With
random_state=0
, the CV in each target encoder would use the same splits duringfit_transform
.Any other comments?
CC @ogrisel