Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FIX delete feature_names_in_ when refitting on a ndarray #21389

Merged
merged 9 commits into from
Oct 23, 2021

Conversation

jeremiedbb
Copy link
Member

Fixes #21383

@jeremiedbb jeremiedbb added this to the 1.0.1 milestone Oct 21, 2021
Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. An entry in the changelog could be good.

@jeremiedbb
Copy link
Member Author

Hum, not that simple since we have several estimators that call validate_data twice, thus the second one is on a ndarray which now deletes the attribute... Maybe a good opportunity to remove these double validations ? :)

@ogrisel
Copy link
Member

ogrisel commented Oct 21, 2021

Maybe a good opportunity to remove these double validations ? :)

Indeed, I don't see any easy alternative.

@ogrisel
Copy link
Member

ogrisel commented Oct 21, 2021

Also while we are at it, the ValueError: Estimator does not have a feature_names_in_ attribute after fitting with a dataframe message could be improved to include the estimator name (passed as name).

Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I approved a second-time :). This time the tests should pass.

order="C",
accept_large_sparse=False,
)
delattr(self, "classes_")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this change really needed to fix the original problem? If so, it should probably be documented in the changelog.

If not needed, I would rather move it outside of this PR.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's necessary because we now delegate the validation to _partial_fit which will reset n_features based on the existence of classes_. But it has no impact on the user since the attribute will still be set after

I added a comment

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me it's a design issue of having fit calling partial_fit but I don't want to fix this in this PR :)

@@ -902,4 +882,8 @@ def perplexity(self, X, sub_sampling=False):
score : float
Perplexity score.
"""
check_is_fitted(self)
X = self._check_non_neg_array(
X, reset_n_features=True, whom="LatentDirichletAllocation.perplexity"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should reset the number of feature and their names when computing the perplexity of a dataset:

Suggested change
X, reset_n_features=True, whom="LatentDirichletAllocation.perplexity"
X, reset_n_features=False, whom="LatentDirichletAllocation.perplexity"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It felt weird but I did not want to change the existing behavior. Do you think I should change it anyway ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so, maybe with a small non-regression test.

@ogrisel
Copy link
Member

ogrisel commented Oct 23, 2021

I cannot spot why this test is only failing on one single CI job. It seems that all source of non determinism are fixed by setting random_state.

@ogrisel
Copy link
Member

ogrisel commented Oct 23, 2021

Actually no, the random_state of the hyperparameter search meta estimator is not. Let me fix it.

@ogrisel
Copy link
Member

ogrisel commented Oct 23, 2021

It works! @thomasjpfan @glemaitre any second review?

Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@glemaitre glemaitre merged commit a9bf7f3 into scikit-learn:main Oct 23, 2021
glemaitre pushed a commit to glemaitre/scikit-learn that referenced this pull request Oct 23, 2021
…n#21389)

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
glemaitre pushed a commit that referenced this pull request Oct 25, 2021
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
samronsin pushed a commit to samronsin/scikit-learn that referenced this pull request Nov 30, 2021
…n#21389)

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
mathijs02 pushed a commit to mathijs02/scikit-learn that referenced this pull request Dec 27, 2022
…n#21389)

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

feature_names_in_ is not reset after calling fit() again with a NumPy array
4 participants