Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH Add check for non binary variables in OneHotEncoder. #16585

Merged
merged 12 commits into from
Mar 10, 2020

Conversation

cmarmo
Copy link
Member

@cmarmo cmarmo commented Feb 28, 2020

Reference Issues/PRs

Fixes #16552
Closes #16554 (as superseded)

What does this implement/fix? Explain your changes.

Adds a check on the value of drop_idx_ elements: None is the value for no dropping.

@cmarmo cmarmo changed the title Add check for non binary variables. [MRG] Add check for non binary variables. Feb 28, 2020
@cmarmo cmarmo changed the title [MRG] Add check for non binary variables. [MRG] Add check for non binary variables in OneHotEncoder. Feb 28, 2020
@amueller
Copy link
Member

amueller commented Feb 28, 2020

Thanks for the PR! I haven't entirely followed the original issue so I might be missing something. Generally it would be good to have a regression test, i.e. a test that shows that now the generated feature names are as expected with your fix.

Cheers,
A.

@glemaitre
Copy link
Member

@amueller @jnothman
As I discussed in the issue, I think that it would be best to have None as a sentinel when not having any index. Basically, we could have:

np.array([None, None, 10, 1], dtype=object)

instead of

np.array([-1, -1, 10, 10], dtype=np.int_)

The reason being that -1 could be thought to be negative indexing in Python.

So if we should agree with any solution and then we can go ahead.

ping @ogrisel @jeremiedbb @NicolasHug @thomasjpfan as well.

@cmarmo
Copy link
Member Author

cmarmo commented Mar 2, 2020

The reason being that -1 could be thought to be negative indexing in Python.

So if we should agree with any solution and then we can go ahead.

Indeed, see also #16593 (that I would rather label as a New Feature, but maybe I'm missing something...). If we want to use indexes in get_feature_names then the None solution should be preferred. Could someone give me a green light on that, please? Thanks!

@cmarmo
Copy link
Member Author

cmarmo commented Mar 2, 2020

As I would like to move forward in #15706, I have implemented drop_idx_ as array of objects, as long as I cannot use masked array I honestly prefer this solution. Some core-dev available for review? Thanks!

@jeremiedbb
Copy link
Member

I'm ok with this change but since it will change the result of a public attribute, shouldn't it pass through a deprecation cycle ? unless we consider it a bug :)

@thomasjpfan
Copy link
Member

The convention of setting -1 in drop_idx_ has not been released yet so we can still change it. I am okay with using None.

@NicolasHug
Copy link
Member

@rth @thomasjpfan this was discussed during #16245 right? Could you please remind us the pros and cons of each approach?

@rth
Copy link
Member

rth commented Mar 2, 2020

this was discussed during #16245 right? Could you please remind us the pros and cons of each approach?

The sentinel value itself shouldn't matter, I think, other than for readability. -1 allows to keep the dtype as int. I also have a slight preference for None as was done at some point in that PR.

@cmarmo
Copy link
Member Author

cmarmo commented Mar 3, 2020

Hi @rth, hope you don't mind I've just tried the "Request for reviewers" button ... :)
Thanks for you patience!

sklearn/preprocessing/_encoders.py Outdated Show resolved Hide resolved
sklearn/preprocessing/_encoders.py Outdated Show resolved Hide resolved
sklearn/preprocessing/_encoders.py Outdated Show resolved Hide resolved
sklearn/preprocessing/_encoders.py Outdated Show resolved Hide resolved
cmarmo and others added 2 commits March 3, 2020 17:35
Co-Authored-By: Thomas J Fan <thomasjpfan@gmail.com>
@cmarmo
Copy link
Member Author

cmarmo commented Mar 3, 2020

Thanks @thomasjpfan !

sklearn/preprocessing/_encoders.py Outdated Show resolved Hide resolved
@jnothman
Copy link
Member

jnothman commented Mar 3, 2020 via email

@thomasjpfan
Copy link
Member

thomasjpfan commented Mar 3, 2020

In general I think arrays of mixed type are a strange beast. They're not
especially helpful as arrays, and should rather be a list. I don't really
see the problem in using -1 as a sentinel, as long as it's well tested. But
the consensus seems to be towards another option.

I can't really judge how many users will think -1 means negative indexing. I would want to go with the route that is less confusing for users. (If only we can do a poll)

@cmarmo
Copy link
Member Author

cmarmo commented Mar 3, 2020

@jnothman @thomasjpfan , I will not try to persuade you , just want to add a clarification.
The reason I'd rather the None solution is that -1 and negative indexes have a specific meaning in python and I don't like the idea of "overloading" it in the OneHotEncoder: one day this -1 will maybe be useful (eg a last drop option ... ok, this is a stupid example, but who knows?). Now I stop bothering you. Thanks for listening.

@jnothman
Copy link
Member

jnothman commented Mar 4, 2020 via email

Copy link
Member

@rth rth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hope you don't mind I've just tried the "Request for reviewers" button

Don't hesitate to use that button on reviewers @cmarmo :) The code LGTM.

In numerical computing it
is not at all uncommon when working with positive ints to give negative
numbers special meanings. For instance, see how leaves are indicated in
decision trees.

Indeed, but I guess the issue here is that array index is not necessarily a positive integer, and that we are introducing a different meaning from what negative index commonly means in python. Another possibility for the sentinel could have been np.iinfo(np.int32).max () == 2147483647. Anyway any of these would likely be OK.

sklearn/preprocessing/_encoders.py Outdated Show resolved Hide resolved
sklearn/preprocessing/_encoders.py Outdated Show resolved Hide resolved
sklearn/preprocessing/_encoders.py Outdated Show resolved Hide resolved
sklearn/preprocessing/_encoders.py Outdated Show resolved Hide resolved
sklearn/preprocessing/_encoders.py Outdated Show resolved Hide resolved
Co-Authored-By: Guillaume Lemaitre <g.lemaitre58@gmail.com>
@cmarmo
Copy link
Member Author

cmarmo commented Mar 9, 2020

@rth, @thomasjpfan , after discussion with @glemaitre I've finally understood that self.drop_idx_ is set to None after fit so I can use it in the checks... :)
The code is a bit different wrt the version you approved, I have added a test for this particular situation.
Maybe you can find some time to check if you are ok anyhow? Thanks a lot for your patience.

Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM apart of this small issue

sklearn/preprocessing/tests/test_encoders.py Outdated Show resolved Hide resolved
@glemaitre
Copy link
Member

Oh and we will need an entry in whats new:

Please add an entry to the change log at doc/whats_new/v0.23.rst under bug fixes. Like the other entries there, please reference this pull request with :issue: and credit yourself (and other contributors if applicable) with :user:

@glemaitre
Copy link
Member

No need for the what's new the bug was only introduced in dev.

Co-Authored-By: Guillaume Lemaitre <g.lemaitre58@gmail.com>
@cmarmo
Copy link
Member Author

cmarmo commented Mar 10, 2020

Someone available for merging? :) Thanks a lot!

Copy link
Member

@rth rth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @cmarmo !

@rth rth changed the title [MRG] Add check for non binary variables in OneHotEncoder. ENH Add check for non binary variables in OneHotEncoder. Mar 10, 2020
@rth rth merged commit f763c61 into scikit-learn:master Mar 10, 2020
ashutosh1919 pushed a commit to ashutosh1919/scikit-learn that referenced this pull request Mar 13, 2020
…n#16585)

Co-authored-by: Thomas J Fan <thomasjpfan@gmail.com>
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
@cmarmo cmarmo deleted the dropbinary_nomask branch May 5, 2020 10:44
gio8tisu pushed a commit to gio8tisu/scikit-learn that referenced this pull request May 15, 2020
…n#16585)

Co-authored-by: Thomas J Fan <thomasjpfan@gmail.com>
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

OneHotEncoder drop 'if_binary' drop one column from all categorical variables
8 participants