FIX delete feature_names_in_ when refitting on a ndarray #21389

jeremiedbb · 2021-10-21T14:28:48Z

glemaitre

LGTM. An entry in the changelog could be good.

sklearn/tests/test_base.py

jeremiedbb · 2021-10-21T15:13:41Z

Hum, not that simple since we have several estimators that call validate_data twice, thus the second one is on a ndarray which now deletes the attribute... Maybe a good opportunity to remove these double validations ? :)

ogrisel · 2021-10-21T15:23:39Z

Maybe a good opportunity to remove these double validations ? :)

Indeed, I don't see any easy alternative.

ogrisel · 2021-10-21T15:24:56Z

Also while we are at it, the ValueError: Estimator does not have a feature_names_in_ attribute after fitting with a dataframe message could be improved to include the estimator name (passed as name).

glemaitre

I approved a second-time :). This time the tests should pass.

sklearn/utils/estimator_checks.py

ogrisel · 2021-10-22T14:46:36Z

sklearn/linear_model/_stochastic_gradient.py

-            order="C",
-            accept_large_sparse=False,
-        )
+            delattr(self, "classes_")


Is this change really needed to fix the original problem? If so, it should probably be documented in the changelog.

If not needed, I would rather move it outside of this PR.

It's necessary because we now delegate the validation to _partial_fit which will reset n_features based on the existence of classes_. But it has no impact on the user since the attribute will still be set after

I added a comment

To me it's a design issue of having fit calling partial_fit but I don't want to fix this in this PR :)

ogrisel · 2021-10-22T14:49:35Z

sklearn/decomposition/_lda.py

@@ -902,4 +882,8 @@ def perplexity(self, X, sub_sampling=False):
        score : float
            Perplexity score.
        """
+        check_is_fitted(self)
+        X = self._check_non_neg_array(
+            X, reset_n_features=True, whom="LatentDirichletAllocation.perplexity"


I don't think we should reset the number of feature and their names when computing the perplexity of a dataset:

Suggested change

X, reset_n_features=True, whom="LatentDirichletAllocation.perplexity"

X, reset_n_features=False, whom="LatentDirichletAllocation.perplexity"

It felt weird but I did not want to change the existing behavior. Do you think I should change it anyway ?

I think so, maybe with a small non-regression test.

sklearn/cluster/_agglomerative.py

ogrisel · 2021-10-23T09:13:56Z

I cannot spot why this test is only failing on one single CI job. It seems that all source of non determinism are fixed by setting random_state.

ogrisel · 2021-10-23T09:15:38Z

Actually no, the random_state of the hyperparameter search meta estimator is not. Let me fix it.

ogrisel · 2021-10-23T10:18:29Z

It works! @thomasjpfan @glemaitre any second review?

glemaitre

LGTM

…n#21389) Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

…n#21389) Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

delete attr when fit on ndarray

d19c40d

jeremiedbb requested a review from glemaitre October 21, 2021 14:29

jeremiedbb added this to the 1.0.1 milestone Oct 21, 2021

black

6994cf5

glemaitre approved these changes Oct 21, 2021

View reviewed changes

thomasjpfan reviewed Oct 21, 2021

View reviewed changes

sklearn/tests/test_base.py Outdated Show resolved Hide resolved

jeremiedbb added 2 commits October 21, 2021 16:37

what's new

5c90d54

iter

842a73b

jeremiedbb added 2 commits October 22, 2021 16:04

remove double validation

dc82d4f

flake8

f610d7f

glemaitre approved these changes Oct 22, 2021

View reviewed changes

ogrisel reviewed Oct 22, 2021

View reviewed changes

sklearn/utils/estimator_checks.py Outdated Show resolved Hide resolved

iter

9e400c8

ogrisel reviewed Oct 22, 2021

View reviewed changes

sklearn/cluster/_agglomerative.py Show resolved Hide resolved

iter

43904ff

Make test_groups_support deterministic

fc44309

glemaitre approved these changes Oct 23, 2021

View reviewed changes

glemaitre merged commit a9bf7f3 into scikit-learn:main Oct 23, 2021

glemaitre pushed a commit to glemaitre/scikit-learn that referenced this pull request Oct 23, 2021

FIX delete feature_names_in_ when refitting on a ndarray (scikit-lear…

937eb4e

…n#21389) Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

glemaitre pushed a commit that referenced this pull request Oct 25, 2021

FIX delete feature_names_in_ when refitting on a ndarray (#21389)

cd927c0

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

This was referenced Oct 31, 2021

Test CI dask/dask-ml#873

Closed

[REVIEW] Fix feature_names_in_ estimator error jrbourbeau/dask-ml#1

Merged

samronsin pushed a commit to samronsin/scikit-learn that referenced this pull request Nov 30, 2021

FIX delete feature_names_in_ when refitting on a ndarray (scikit-lear…

f0e6db2

…n#21389) Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

mathijs02 pushed a commit to mathijs02/scikit-learn that referenced this pull request Dec 27, 2022

FIX delete feature_names_in_ when refitting on a ndarray (scikit-lear…

844109b

…n#21389) Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FIX delete feature_names_in_ when refitting on a ndarray #21389

FIX delete feature_names_in_ when refitting on a ndarray #21389

jeremiedbb commented Oct 21, 2021

glemaitre left a comment

jeremiedbb commented Oct 21, 2021

ogrisel commented Oct 21, 2021

ogrisel commented Oct 21, 2021 •

edited

glemaitre left a comment

ogrisel Oct 22, 2021

jeremiedbb Oct 22, 2021 •

edited

jeremiedbb Oct 22, 2021

ogrisel Oct 22, 2021

jeremiedbb Oct 22, 2021

ogrisel Oct 23, 2021

ogrisel commented Oct 23, 2021

ogrisel commented Oct 23, 2021

ogrisel commented Oct 23, 2021

glemaitre left a comment

	X, reset_n_features=True, whom="LatentDirichletAllocation.perplexity"
	X, reset_n_features=False, whom="LatentDirichletAllocation.perplexity"

FIX delete feature_names_in_ when refitting on a ndarray #21389

FIX delete feature_names_in_ when refitting on a ndarray #21389

Conversation

jeremiedbb commented Oct 21, 2021

glemaitre left a comment

Choose a reason for hiding this comment

jeremiedbb commented Oct 21, 2021

ogrisel commented Oct 21, 2021

ogrisel commented Oct 21, 2021 • edited

glemaitre left a comment

Choose a reason for hiding this comment

ogrisel Oct 22, 2021

Choose a reason for hiding this comment

jeremiedbb Oct 22, 2021 • edited

Choose a reason for hiding this comment

jeremiedbb Oct 22, 2021

Choose a reason for hiding this comment

ogrisel Oct 22, 2021

Choose a reason for hiding this comment

jeremiedbb Oct 22, 2021

Choose a reason for hiding this comment

ogrisel Oct 23, 2021

Choose a reason for hiding this comment

ogrisel commented Oct 23, 2021

ogrisel commented Oct 23, 2021

ogrisel commented Oct 23, 2021

glemaitre left a comment

Choose a reason for hiding this comment

ogrisel commented Oct 21, 2021 •

edited

jeremiedbb Oct 22, 2021 •

edited