-
-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Release highlights for 1.3 #26526
Merged
jeremiedbb
merged 11 commits into
scikit-learn:main
from
jeremiedbb:release-highlights-1.3
Jun 29, 2023
Merged
Release highlights for 1.3 #26526
Changes from 7 commits
Commits
Show all changes
11 commits
Select commit
Hold shift + click to select a range
46605b3
draft highlights
jeremiedbb 340cc73
lint
jeremiedbb 0d434a5
typo
jeremiedbb 4fe4f1c
add metadata routing to highlights
adrinjalali cb08e4e
link in what's new
jeremiedbb 06ba928
address review comments
jeremiedbb df5202d
Merge remote-tracking branch 'upstream/main' into pr/jeremiedbb/26526
jeremiedbb 6efe3a2
add gamma deviance / infrequent OrdinalEncoder / ValidationCurveDisplay
jeremiedbb 765f912
Merge remote-tracking branch 'upstream/main' into pr/jeremiedbb/26526
jeremiedbb 2970ff6
cln
jeremiedbb a7b9264
Update examples/release_highlights/plot_release_highlights_1_3_0.py
jeremiedbb File filter
Filter by extension
Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
99 changes: 99 additions & 0 deletions
99
examples/release_highlights/plot_release_highlights_1_3_0.py
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,99 @@ | ||
# flake8: noqa | ||
""" | ||
======================================= | ||
Release Highlights for scikit-learn 1.3 | ||
======================================= | ||
|
||
.. currentmodule:: sklearn | ||
|
||
We are pleased to announce the release of scikit-learn 1.3! Many bug fixes | ||
and improvements were added, as well as some new key features. We detail | ||
below a few of the major features of this release. **For an exhaustive list of | ||
all the changes**, please refer to the :ref:`release notes <changes_1_3>`. | ||
|
||
To install the latest version (with pip):: | ||
|
||
pip install --upgrade scikit-learn | ||
|
||
or with conda:: | ||
|
||
conda install -c conda-forge scikit-learn | ||
|
||
""" | ||
|
||
# %% | ||
# Metadata Routing | ||
# ---------------- | ||
# We are in the process of introducing a new way to route metadata such as | ||
# ``sample_weight`` throughout the codebase, which would affect how | ||
# meta-estimators such as :class:`pipeline.Pipeline` and | ||
# :class:`model_selection.GridSearchCV` route metadata. While the | ||
# infrastructure for this feature is already included in this release, the work | ||
# is ongoing and not all meta-estimators support this new feature. You can read | ||
# more about this feature in the :ref:`Metadata Routing User Guide | ||
# <metadata_routing>`. Note that this feature is still under development and | ||
# not implemented for most meta-estimators. | ||
# | ||
# Third party developers can already start incorporating this into their | ||
# meta-estimators. For more details, see | ||
# :ref:`metadata routing developer guide | ||
# <sphx_glr_auto_examples_miscellaneous_plot_metadata_routing.py>`. | ||
|
||
# %% | ||
# HDBSCAN: hierarchical density-based clustering | ||
# ---------------------------------------------- | ||
# Originally hosted in the scikit-learn-contrib repository, :class:`cluster.HDBSCAN` | ||
# has been adpoted into scikit-learn. It's missing a few features from the original | ||
# implementation which will be added in future releases. | ||
# By performing :class:`cluster.DBSCAN` over varying epsilon values | ||
# :class:`cluster.HDBSCAN` finds clusters of varying densities making it more robust to | ||
# parameter selection than :class:`cluster.DBSCAN`. More details in the | ||
# :ref:`User Guide <hdbscan>`. | ||
import numpy as np | ||
from sklearn.cluster import HDBSCAN | ||
from sklearn.datasets import load_digits | ||
from sklearn.metrics import v_measure_score | ||
|
||
X, true_labels = load_digits(return_X_y=True) | ||
print(f"number of digits: {len(np.unique(true_labels))}") | ||
|
||
hdbscan = HDBSCAN(min_cluster_size=15).fit(X) | ||
non_noisy_labels = hdbscan.labels_[hdbscan.labels_ != -1] | ||
print(f"number of clusters found: {len(np.unique(non_noisy_labels))}") | ||
|
||
print(v_measure_score(true_labels[hdbscan.labels_ != -1], non_noisy_labels)) | ||
|
||
# %% | ||
# TargetEncoder: a new category encoding strategy | ||
# ----------------------------------------------- | ||
# Well suited for categorical features with high cardinality, | ||
# :class:`preprocessing.TargetEncoder` encodes the categories based on a shrunk | ||
# estimate of the average target values for observations belonging to that category. | ||
# More details in the :ref:`User Guide <target_encoder>`. | ||
import numpy as np | ||
from sklearn.preprocessing import TargetEncoder | ||
|
||
X = np.array([["cat"] * 30 + ["dog"] * 20 + ["snake"] * 38], dtype=object).T | ||
y = [90.3] * 30 + [20.4] * 20 + [21.2] * 38 | ||
|
||
enc = TargetEncoder(random_state=0) | ||
X_trans = enc.fit_transform(X, y) | ||
|
||
enc.encodings_ | ||
|
||
# %% | ||
# Missing values support in decision trees | ||
# ---------------------------------------- | ||
# The classes :class:`tree.DecisionTreeClassifier` and | ||
# :class:`tree.DecisionTreeRegressor` now support missing values. For each potential | ||
# threshold on the non-missing data, the splitter will evaluate the split with all the | ||
# missing values going to the left node or the right node. | ||
# More details in the :ref:`User Guide <tree_missing_value_support>`. | ||
import numpy as np | ||
from sklearn.tree import DecisionTreeClassifier | ||
|
||
X = np.array([0, 1, 6, np.nan]).reshape(-1, 1) | ||
y = [0, 0, 1, 1] | ||
|
||
tree = DecisionTreeClassifier(random_state=0).fit(X, y) | ||
tree.predict(X) |
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The wording is a bit difficult here only because HDBSCAN doesn't solve exactly the DBSCAN problem for multiple epsilon, but rather a modified problem that solves for the same solutions locally. I think this is a fair-enough compromise between correctness and readability. Open to thoughts though :)