ENH Add check for non binary variables in OneHotEncoder. #16585

cmarmo · 2020-02-28T10:48:03Z

Reference Issues/PRs

Fixes #16552
Closes #16554 (as superseded)

What does this implement/fix? Explain your changes.

Adds a check on the value of drop_idx_ elements: None is the value for no dropping.

amueller · 2020-02-28T18:32:50Z

Thanks for the PR! I haven't entirely followed the original issue so I might be missing something. Generally it would be good to have a regression test, i.e. a test that shows that now the generated feature names are as expected with your fix.

Cheers,
A.

glemaitre · 2020-03-02T08:51:34Z

@amueller @jnothman
As I discussed in the issue, I think that it would be best to have None as a sentinel when not having any index. Basically, we could have:

np.array([None, None, 10, 1], dtype=object)

instead of

np.array([-1, -1, 10, 10], dtype=np.int_)

The reason being that -1 could be thought to be negative indexing in Python.

So if we should agree with any solution and then we can go ahead.

ping @ogrisel @jeremiedbb @NicolasHug @thomasjpfan as well.

cmarmo · 2020-03-02T09:11:41Z

The reason being that -1 could be thought to be negative indexing in Python.

So if we should agree with any solution and then we can go ahead.

Indeed, see also #16593 (that I would rather label as a New Feature, but maybe I'm missing something...). If we want to use indexes in get_feature_names then the None solution should be preferred. Could someone give me a green light on that, please? Thanks!

cmarmo · 2020-03-02T14:41:18Z

As I would like to move forward in #15706, I have implemented drop_idx_ as array of objects, as long as I cannot use masked array I honestly prefer this solution. Some core-dev available for review? Thanks!

jeremiedbb · 2020-03-02T14:48:08Z

I'm ok with this change but since it will change the result of a public attribute, shouldn't it pass through a deprecation cycle ? unless we consider it a bug :)

thomasjpfan · 2020-03-02T16:37:02Z

The convention of setting -1 in drop_idx_ has not been released yet so we can still change it. I am okay with using None.

NicolasHug · 2020-03-02T16:41:48Z

@rth @thomasjpfan this was discussed during #16245 right? Could you please remind us the pros and cons of each approach?

rth · 2020-03-02T17:56:04Z

this was discussed during #16245 right? Could you please remind us the pros and cons of each approach?

The sentinel value itself shouldn't matter, I think, other than for readability. -1 allows to keep the dtype as int. I also have a slight preference for None as was done at some point in that PR.

cmarmo · 2020-03-03T16:09:52Z

Hi @rth, hope you don't mind I've just tried the "Request for reviewers" button ... :)
Thanks for you patience!

sklearn/preprocessing/_encoders.py

Co-Authored-By: Thomas J Fan <thomasjpfan@gmail.com>

cmarmo · 2020-03-03T16:41:45Z

Thanks @thomasjpfan !

sklearn/preprocessing/_encoders.py

jnothman · 2020-03-03T20:06:55Z

In general I think arrays of mixed type are a strange beast. They're not especially helpful as arrays, and should rather be a list. I don't really see the problem in using -1 as a sentinel, as long as it's well tested. But the consensus seems to be towards another option.

thomasjpfan · 2020-03-03T20:11:32Z

In general I think arrays of mixed type are a strange beast. They're not
especially helpful as arrays, and should rather be a list. I don't really
see the problem in using -1 as a sentinel, as long as it's well tested. But
the consensus seems to be towards another option.

I can't really judge how many users will think -1 means negative indexing. I would want to go with the route that is less confusing for users. (If only we can do a poll)

cmarmo · 2020-03-03T21:20:08Z

@jnothman @thomasjpfan , I will not try to persuade you , just want to add a clarification.
The reason I'd rather the None solution is that -1 and negative indexes have a specific meaning in python and I don't like the idea of "overloading" it in the OneHotEncoder: one day this -1 will maybe be useful (eg a last drop option ... ok, this is a stupid example, but who knows?). Now I stop bothering you. Thanks for listening.

jnothman · 2020-03-04T22:31:47Z

Shrug. I'm fine either way. I agree usability is likely the greater concern here than efficiency or succinct representation. In numerical computing it is not at all uncommon when working with positive ints to give negative numbers special meanings. For instance, see how leaves are indicated in decision trees. But it can indeed add confusion.

rth

hope you don't mind I've just tried the "Request for reviewers" button

Don't hesitate to use that button on reviewers @cmarmo :) The code LGTM.

In numerical computing it
is not at all uncommon when working with positive ints to give negative
numbers special meanings. For instance, see how leaves are indicated in
decision trees.

Indeed, but I guess the issue here is that array index is not necessarily a positive integer, and that we are introducing a different meaning from what negative index commonly means in python. Another possibility for the sentinel could have been np.iinfo(np.int32).max () == 2147483647. Anyway any of these would likely be OK.

sklearn/preprocessing/_encoders.py

Co-Authored-By: Guillaume Lemaitre <g.lemaitre58@gmail.com>

…drop.

cmarmo · 2020-03-09T11:17:10Z

@rth, @thomasjpfan , after discussion with @glemaitre I've finally understood that self.drop_idx_ is set to None after fit so I can use it in the checks... :)
The code is a bit different wrt the version you approved, I have added a test for this particular situation.
Maybe you can find some time to check if you are ok anyhow? Thanks a lot for your patience.

glemaitre

LGTM apart of this small issue

sklearn/preprocessing/tests/test_encoders.py

glemaitre · 2020-03-09T14:23:51Z

Oh and we will need an entry in whats new:

Please add an entry to the change log at doc/whats_new/v0.23.rst under bug fixes. Like the other entries there, please reference this pull request with :issue: and credit yourself (and other contributors if applicable) with :user:

glemaitre · 2020-03-09T14:25:51Z

No need for the what's new the bug was only introduced in dev.

Co-Authored-By: Guillaume Lemaitre <g.lemaitre58@gmail.com>

cmarmo · 2020-03-10T07:36:24Z

Someone available for merging? :) Thanks a lot!

rth

Thanks @cmarmo !

…n#16585) Co-authored-by: Thomas J Fan <thomasjpfan@gmail.com> Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>

Add check for non binary variables.

09a85ea

cmarmo changed the title ~~Add check for non binary variables.~~ [MRG] Add check for non binary variables. Feb 28, 2020

cmarmo changed the title ~~[MRG] Add check for non binary variables.~~ [MRG] Add check for non binary variables in OneHotEncoder. Feb 28, 2020

cmarmo added 3 commits March 2, 2020 14:03

Merge branch 'master' into dropbinary_nomask

957becc

Make drop_idx_ an array of objects. Add test.

ec6fd45

Fix lint issues.

f83245b

github-actions bot added the module:preprocessing label Mar 2, 2020

cmarmo requested a review from rth March 3, 2020 16:07

thomasjpfan reviewed Mar 3, 2020

View reviewed changes

sklearn/preprocessing/_encoders.py Outdated Show resolved Hide resolved

sklearn/preprocessing/_encoders.py Outdated Show resolved Hide resolved

sklearn/preprocessing/_encoders.py Outdated Show resolved Hide resolved

sklearn/preprocessing/_encoders.py Outdated Show resolved Hide resolved

cmarmo and others added 2 commits March 3, 2020 17:35

Update sklearn/preprocessing/_encoders.py

75bbe44

Co-Authored-By: Thomas J Fan <thomasjpfan@gmail.com>

Address @thomasjpfan comments.

67cbc81

thomasjpfan approved these changes Mar 3, 2020

View reviewed changes

sklearn/preprocessing/_encoders.py Outdated Show resolved Hide resolved

Remove newline.

06cbf52

rth approved these changes Mar 5, 2020

View reviewed changes

glemaitre reviewed Mar 5, 2020

View reviewed changes

Update sklearn/preprocessing/_encoders.py

93ae5c4

Co-Authored-By: Guillaume Lemaitre <g.lemaitre58@gmail.com>

Address @glemaitre comments.

166abc2

cmarmo mentioned this pull request Mar 6, 2020

OneHotEncoder.get_feature_names doesn't work with integer column names #16593

Open

glemaitre self-requested a review March 6, 2020 20:00

cmarmo added 2 commits March 9, 2020 12:06

Only check fitted parameters. Add test for consistency when restting …

9ff6e52

…drop.

Fix lint error

3aaddb7

glemaitre approved these changes Mar 9, 2020

View reviewed changes

sklearn/preprocessing/tests/test_encoders.py Outdated Show resolved Hide resolved

Update sklearn/preprocessing/tests/test_encoders.py

f15d841

Co-Authored-By: Guillaume Lemaitre <g.lemaitre58@gmail.com>

rth approved these changes Mar 10, 2020

View reviewed changes

rth changed the title ~~[MRG] Add check for non binary variables in OneHotEncoder.~~ ENH Add check for non binary variables in OneHotEncoder. Mar 10, 2020

rth merged commit f763c61 into scikit-learn:master Mar 10, 2020

This was referenced Apr 23, 2020

DOC cleaning up to 0.23/whats new #17015

Merged

Fix 0.23 whats_new entries #17057

Closed

cmarmo deleted the dropbinary_nomask branch May 5, 2020 10:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH Add check for non binary variables in OneHotEncoder. #16585

ENH Add check for non binary variables in OneHotEncoder. #16585

cmarmo commented Feb 28, 2020 •

edited

amueller commented Feb 28, 2020 •

edited

glemaitre commented Mar 2, 2020

cmarmo commented Mar 2, 2020

cmarmo commented Mar 2, 2020

jeremiedbb commented Mar 2, 2020

thomasjpfan commented Mar 2, 2020

NicolasHug commented Mar 2, 2020

rth commented Mar 2, 2020

cmarmo commented Mar 3, 2020 •

edited

cmarmo commented Mar 3, 2020

jnothman commented Mar 3, 2020 via email

thomasjpfan commented Mar 3, 2020 •

edited

cmarmo commented Mar 3, 2020

jnothman commented Mar 4, 2020 via email

rth left a comment

cmarmo commented Mar 9, 2020

glemaitre left a comment

glemaitre commented Mar 9, 2020

glemaitre commented Mar 9, 2020

cmarmo commented Mar 10, 2020

rth left a comment

ENH Add check for non binary variables in OneHotEncoder. #16585

ENH Add check for non binary variables in OneHotEncoder. #16585

Conversation

cmarmo commented Feb 28, 2020 • edited

Reference Issues/PRs

What does this implement/fix? Explain your changes.

amueller commented Feb 28, 2020 • edited

glemaitre commented Mar 2, 2020

cmarmo commented Mar 2, 2020

cmarmo commented Mar 2, 2020

jeremiedbb commented Mar 2, 2020

thomasjpfan commented Mar 2, 2020

NicolasHug commented Mar 2, 2020

rth commented Mar 2, 2020

cmarmo commented Mar 3, 2020 • edited

cmarmo commented Mar 3, 2020

jnothman commented Mar 3, 2020 via email

thomasjpfan commented Mar 3, 2020 • edited

cmarmo commented Mar 3, 2020

jnothman commented Mar 4, 2020 via email

rth left a comment

Choose a reason for hiding this comment

cmarmo commented Mar 9, 2020

glemaitre left a comment

Choose a reason for hiding this comment

glemaitre commented Mar 9, 2020

glemaitre commented Mar 9, 2020

cmarmo commented Mar 10, 2020

rth left a comment

Choose a reason for hiding this comment

cmarmo commented Feb 28, 2020 •

edited

amueller commented Feb 28, 2020 •

edited

cmarmo commented Mar 3, 2020 •

edited

thomasjpfan commented Mar 3, 2020 •

edited