Skip to content

Latest commit

 

History

History
468 lines (370 loc) · 21.3 KB

v1.4.rst

File metadata and controls

468 lines (370 loc) · 21.3 KB

sklearn

Version 1.4.0

In Development

Changed models

The following estimators and functions, when fit with the same data and parameters, may produce different models from the previous version. This often occurs due to changes in the modelling logic (bug fixes or enhancements), or in random sampling procedures.

  • linear_model.LogisticRegression and linear_model.LogisticRegressionCV now have much better convergence for solvers "lbfgs" and "newton-cg". Both solvers can now reach much higher precision for the coefficients depending on the specified tol. Additionally, lbfgs can make better use of tol, i.e., stop sooner or reach higher precision. 26721 by Christian Lorentzen <lorentzenchr>.

    Note

    The lbfgs is the default solver, so this change might effect many models.

    This change also means that with this new version of scikit-learn, the resulting coefficients coef_ and intercept_ of your models will change for these two solvers (when fit on the same data again). The amount of change depends on the specified tol, for small values you will get more precise results.

Changes impacting all modules

  • All estimators now recognizes the column names from any dataframe that adopts the DataFrame Interchange Protocol. Dataframes that return a correct representation through np.asarray(df) is expected to work with our estimators and functions. 26464 by Thomas Fan.
  • Fixed a bug in most estimators and functions where setting a parameter to a large integer would cause a TypeError. 26648 by Naoise Holohan <naoise-h>.

Metadata Routing

The following models now support metadata routing in one or more or their methods. Refer to the Metadata Routing User Guide <metadata_routing> for more details.

  • LarsCV and LassoLarsCV now support metadata routing in their fit method and route metadata to the CV splitter. 27538 by Omar Salman <OmarManzoor>.
  • multiclass.OneVsRestClassifier, multiclass.OneVsOneClassifier and multiclass.OutputCodeClassifier now support metadata routing in their fit and partial_fit, and route metadata to the underlying estimator's fit and partial_fit. 27308 by Stefanie Senger <StefanieSenger>.
  • pipeline.Pipeline now supports metadata routing according to metadata routing user guide <metadata_routing>. 26789 by Adrin Jalali.
  • ~model_selection.cross_validate, ~model_selection.cross_val_score, and ~model_selection.cross_val_predict now support metadata routing. The metadata are routed to the estimator's fit, the scorer, and the CV splitter's split. The metadata is accepted via the new params parameter. fit_params is deprecated and will be removed in version 1.6. groups parameter is also not accepted as a separate argument when metadata routing is enabled and should be passed via the params parameter. 26896 by Adrin Jalali.
  • ~model_selection.GridSearchCV, ~model_selection.RandomizedSearchCV, ~model_selection.HalvingGridSearchCV, and ~model_selection.HalvingRandomSearchCV now support metadata routing in their fit and score, and route metadata to the underlying estimator's fit, the CV splitter, and the scorer. 27058 by Adrin Jalali.
  • ~compose.ColumnTransformer now supports metadata routing according to metadata routing user guide <metadata_routing>. 27005 by Adrin Jalali.
  • linear_model.LogisticRegressionCV now supports metadata routing. linear_model.LogisticRegressionCV.fit now accepts **params which are passed to the underlying splitter and scorer. linear_model.LogisticRegressionCV.score now accepts **score_params which are passed to the underlying scorer. 26525 by Omar Salman <OmarManzoor>.
  • linear_model.OrthogonalMatchingPursuitCV now supports metadata routing. Its fit now accepts **fit_params, which are passed to the underlying splitter. 27500 by Stefanie Senger <StefanieSenger>.
  • All meta-estimators for which metadata routing is not yet implemented now raise a NotImplementedError on get_metadata_routing and on fit if metadata routing is enabled and any metadata is passed to them. 27389 by Adrin Jalali.
  • ElasticNetCV, LassoCV, MultiTaskElasticNetCV and MultiTaskLassoCV now support metadata routing and route metadata to the CV splitter. 27478 by Omar Salman <OmarManzoor>.

Support for SciPy sparse arrays

Several estimators are now supporting SciPy sparse arrays. The following functions and classes are impacted:

Functions:

  • cluster.compute_optics_graph in 27250 by Yao Xiao <Charlie-XIAO>;
  • cluster.kmeans_plusplus in 27179 by Nurseit Kamchyev <Bncer>;
  • decomposition.non_negative_factorization in 27100 by Isaac Virshup <ivirshup>;
  • feature_selection.f_regression in 27239 by Yaroslav Korobko <Tialo>;
  • feature_selection.r_regression in 27239 by Yaroslav Korobko <Tialo>;
  • manifold.trustworthiness in 27250 by Yao Xiao <Charlie-XIAO>;
  • metrics.pairwise_distances in 27250 by Yao Xiao <Charlie-XIAO>;
  • metrics.pairwise_distances_chunked in 27250 by Yao Xiao <Charlie-XIAO>;
  • metrics.pairwise.pairwise_kernels in 27250 by Yao Xiao <Charlie-XIAO>;
  • sklearn.utils.multiclass.type_of_target in 27274 by Yao Xiao <Charlie-XIAO>.

Classes:

  • cluster.HDBSCAN in 27250 by Yao Xiao <Charlie-XIAO>;
  • cluster.KMeans in 27179 by Nurseit Kamchyev <Bncer>;
  • cluster.MiniBatchKMeans in 27179 by Nurseit Kamchyev <Bncer>;
  • cluster.OPTICS in 27250 by Yao Xiao <Charlie-XIAO>;
  • decomposition.NMF in 27100 by Isaac Virshup <ivirshup>;
  • decomposition.MiniBatchNMF in 27100 by Isaac Virshup <ivirshup>;
  • feature_extraction.text.TfidfTransformer in 27219 by Yao Xiao <Charlie-XIAO>;
  • cluster.Isomap in 27250 by Yao Xiao <Charlie-XIAO>;
  • manifold.TSNE in 27250 by Yao Xiao <Charlie-XIAO>;
  • impute.SimpleImputer in 27277 by Yao Xiao <Charlie-XIAO>;
  • impute.IterativeImputer in 27277 by Yao Xiao <Charlie-XIAO>;
  • impute.KNNImputer in 27277 by Yao Xiao <Charlie-XIAO>;
  • kernel_approximation.PolynomialCountSketch in 27301 by Lohit SundaramahaLingam <lohitslohit>;
  • neural_network.BernoulliRBM in 27252 by Yao Xiao <Charlie-XIAO>;
  • preprocessing.PolynomialFeatures in 27166 by Mohit Joshi <work-mohit>.

Changelog

sklearn.base

  • base.ClusterMixin.fit_predict and base.OutlierMixin.fit_predict now accept **kwargs which are passed to the fit method of the estimator. 26506 by Adrin Jalali.
  • base.TransformerMixin.fit_transform and base.OutlierMixin.fit_predict now raise a warning if transform / predict consume metadata, but no custom fit_transform / fit_predict is defined in the class inheriting from them correspondingly. 26831 by Adrin Jalali.
  • base.clone now supports dict as input and creates a copy. 26786 by Adrin Jalali.
  • ~utils.metadata_routing.process_routing now has a different signature. The first two (the object and the method) are positional only, and all metadata are passed as keyword arguments. 26909 by Adrin Jalali.

sklearn.calibration

  • The internal objective and gradient of the sigmoid method of calibration.CalibratedClassifierCV have been replaced by the private loss module. 27185 by Omar Salman <OmarManzoor>.

sklearn.cluster

  • : kdtree and balltree values are now deprecated and are renamed as kd_tree and ball_tree respectively for the algorithm parameter of cluster.HDBSCAN ensuring consistency in naming convention. kdtree and balltree values will be removed in 1.6. 26744 by Shreesha Kumar Bhat <Shreesha3112>.

sklearn.compose

  • ~compose.ColumnTransformer now replaces "passthrough" with a corresponding ~preprocessing.FunctionTransformer in the fitted transformers_ attribute. 27204 by Adrin Jalali.

sklearn.datasets

  • datasets.make_sparse_spd_matrix now uses a more memory-efficient sparse layout. It also accepts a new keyword sparse_format that allows specifying the output format of the sparse matrix. By default sparse_format=None, which returns a dense numpy ndarray as before. 27438 by Yao Xiao <Charlie-XIAO>.
  • All dataset fetchers now accept data_home as any object that implements the os.PathLike interface, for instance, pathlib.Path. 27468 by Yao Xiao <Charlie-XIAO>.

sklearn.decomposition

  • An "auto" option was added to the n_components parameter of decomposition.non_negative_factorization, decomposition.NMF and decomposition.MiniBatchNMF to automatically infer the number of components from W or H shapes when using a custom initialization. The default value of this parameter will change from None to auto in version 1.6. 26634 by Alexandre Landeau <AlexL> and Alexandre Vigny <avigny>.
  • decomposition.PCA now supports the Array API for the full and randomized solvers (with QR power iterations). See array_api for more details. 26315 and 27098 by Mateusz Sokół <mtsokol>, Olivier Grisel <ogrisel> and Edoardo Abati <EdAbati>.
  • Fixes a bug in decomposition.KernelPCA by forcing the output of the internal preprocessing.KernelCenterer to be a default array. When the arpack solver was used, it would expect an array with a dtype attribute. 27583 by Guillaume Lemaitre <glemaitre>.

sklearn.ensemble

  • ensemble.RandomForestClassifier and ensemble.RandomForestRegressor support missing values when the criterion is gini, entropy, or log_loss, for classification or squared_error, friedman_mse, or poisson for regression. 26391 by Thomas Fan.
  • ensemble.RandomForestClassifier, ensemble.RandomForestRegressor, ensemble.ExtraTreesClassifier and ensemble.ExtraTreesRegressor now support monotonic constraints, useful when features are supposed to have a positive/negative effect on the target. Missing values in the train data and multi-output targets are not supported. 13649 by Samuel Ronsin <samronsin>, initiated by Patrick O'Reilly <pat-oreilly>.
  • ensemble.GradientBoostingClassifier is faster, for binary and in particular for multiclass problems thanks to the private loss function module. 26278 by Christian Lorentzen <lorentzenchr>.
  • Improves runtime and memory usage for ensemble.GradientBoostingClassifier and ensemble.GradientBoostingRegressor when trained on sparse data. 26957 by Thomas Fan.
  • In ensemble.AdaBoostClassifier, the algorithm argument SAMME.R was deprecated and will be removed in 1.6. 26830 by Stefanie Senger <StefanieSenger>.

sklearn.inspection

  • inspection.DecisionBoundaryDisplay now accepts a parameter class_of_interest to select the class of interest when plotting the response provided by response_method="predict_proba" or response_method="decision_function". It allows to plot the decision boundary for both binary and multiclass classifiers. 27291 by Guillaume Lemaitre <glemaitre>.
  • inspection.DecisionBoundaryDisplay raise an AttributeError instead of a ValueError when an estimator does not implement the requested response method. 27291 by Guillaume Lemaitre <glemaitre>.

sklearn.linear_model

  • linear_model.LogisticRegression and linear_model.LogisticRegressionCV now have much better convergence for solvers "lbfgs" and "newton-cg". Both solvers can now reach much higher precision for the coefficients depending on the specified tol. Additionally, lbfgs can make better use of tol, i.e., stop sooner or reach higher precision. This is accomplished by better scaling of the objective function, i.e., using average per sample losses instead of sum of per sample losses. 26721 by Christian Lorentzen <lorentzenchr>.

    Note

    This change also means that with this new version of scikit-learn, the resulting coefficients coef_ and intercept_ of your models will change for these two solvers (when fit on the same data again). The amount of change depends on the specified tol, for small values you will get more precise results.

  • linear_model.LogisticRegression and linear_model.LogisticRegressionCV with solver "newton-cg" can now be considerably faster for some data and parameter settings. This is accomplished by a better line search convergence check for negligible loss improvements that takes into account gradient information. 26721 by Christian Lorentzen <lorentzenchr>.
  • Solver "newton-cg" in linear_model.LogisticRegression and linear_model.LogisticRegressionCV uses a little less memory. The effect is proportional to the number of coefficients (n_features * n_classes). 27417 by Christian Lorentzen <lorentzenchr>.

sklearn.metrics

  • Computing pairwise distances via metrics.DistanceMetric for CSR × CSR, Dense × CSR, and CSR × Dense datasets is now 1.5x faster. 26765 by Meekail Zain <micky774>
  • Computing distances via metrics.DistanceMetric for CSR × CSR, Dense × CSR, and CSR × Dense now uses ~50% less memory, and outputs distances in the same dtype as the provided data. 27006 by Meekail Zain <micky774>
  • Improve the rendering of the plot obtained with the metrics.PrecisionRecallDisplay and metrics.RocCurveDisplay classes. the x- and y-axis limits are set to [0, 1] and the aspect ratio between both axis is set to be 1 to get a square plot. 26366 by Mojdeh Rastgoo <mrastgoo>.
  • Added neg_root_mean_squared_log_error_scorer as scorer 26734 by Alejandro Martin Gil <101AlexMartin>.
  • sklearn.metrics.accuracy_score and sklearn.metrics.zero_one_loss now support Array API compatible inputs. 27137 by Edoardo Abati <EdAbati>.
  • Fixes a bug for metrics using zero_division=np.nan (e.g. ~metrics.precision_score) within a paralell loop (e.g. ~model_selection.cross_val_score) where the singleton for np.nan will be different in the sub-processes. 27573 by Guillaume Lemaitre <glemaitre>.
  • The squared parameter of metrics.mean_squared_error and metrics.mean_squared_log_error is deprecated and will be removed in 1.6. Use the new functions metrics.root_mean_squared_error and metrics.root_mean_squared_log_error instead. 26734 by Alejandro Martin Gil <101AlexMartin>.

sklearn.model_selection

  • sklearn.model_selection.train_test_split now supports Array API compatible inputs. 26855 by Tim Head.
  • model_selection.GridSearchCV, model_selection.RandomizedSearchCV, and model_selection.HalvingGridSearchCV now don't change the given object in the parameter grid if it's an estimator. 26786 by Adrin Jalali.

sklearn.neighbors

  • sklearn.neighbors.KNeighborsRegressor.predict and sklearn.neighbors.KNeighborsClassifier.predict_proba now efficiently support pairs of dense and sparse datasets. 27018 by Julien Jerphanion <jjerphan>.
  • neighbors.KNeighborsRegressor now accepts metrics.DistanceMetric objects directly via the metric keyword argument allowing for the use of accelerated third-party metrics.DistanceMetric objects. 26267 by Meekail Zain <micky774>.
  • The performance of neighbors.RadiusNeighborsClassifier.predict and of neighbors.RadiusNeighborsClassifier.predict_proba has been improved when radius is large and algorithm="brute" with non-Euclidean metrics. 26828 by Omar Salman <OmarManzoor>.

sklearn.preprocessing

  • preprocessing.MinMaxScaler and preprocessing.MaxAbsScaler now support the Array API. Array API support is considered experimental and might evolve without being subject to our usual rolling deprecation cycle policy. See array_api for more details. 26243 by Tim Head and 27110 by Edoardo Abati <EdAbati>.
  • preprocessing.OrdinalEncoder avoids calculating missing indices twice to improve efficiency. 27017 by Xuefeng Xu <xuefeng-xu>.
  • Improves warnings in preprocessing.FunctionTransformer when func returns a pandas dataframe and the output is configured to be pandas. 26944 by Thomas Fan.
  • preprocessing.TargetEncoder now supports target_type 'multiclass'. 26674 by Lucy Liu <lucyleeow>.

sklearn.tree

  • tree.DecisionTreeClassifier, tree.DecisionTreeRegressor, tree.ExtraTreeClassifier and tree.ExtraTreeRegressor now support monotonic constraints, useful when features are supposed to have a positive/negative effect on the target. Missing values in the train data and multi-output targets are not supported. 13649 by Samuel Ronsin <samronsin>, initiated by Patrick O'Reilly <pat-oreilly>.

sklearn.utils

  • sklearn.utils.estimator_html_repr dynamically adapts diagram colors based on the browser's prefers-color-scheme, providing improved adaptability to dark mode environments. 26862 by Andrew Goh Yisheng <9y5>, Thomas Fan, Adrin Jalali.
  • ~utils.metadata_routing.MetadataRequest and ~utils.metadata_routing.MetadataRouter now have a consumes method which can be used to check whether a given set of parameters would be consumed. 26831 by Adrin Jalali.
  • sklearn.utils.check_array should accept both matrix and array from the sparse SciPy module. The previous implementation would fail if copy=True by calling specific NumPy np.may_share_memory that does not work with SciPy sparse array and does not return the correct result for SciPy sparse matrix. 27336 by Guillaume Lemaitre <glemaitre>.
  • sklearn.extmath.log_logistic is deprecated and will be removed in 1.6. Use -np.logaddexp(0, -x) instead. 27544 by Christian Lorentzen <lorentzenchr>.

Code and Documentation Contributors

Thanks to everyone who has contributed to the maintenance and improvement of the project since version 1.3, including:

TODO: update at the time of the release.