[RELEASE] cudf v24.04 #15367

raydouglass · 2024-03-21T16:59:56Z

❄️ Code freeze for `branch-24.04` and v24.04 release

What does this mean?

Only critical/hotfix level issues should be merged into branch-24.04 until release (merging of this PR).

What is the purpose of this PR?

Update documentation
Allow testing for the new release
Enable a means to merge branch-24.04 into main for the release

Fix documentation builds with pandas 2 changes

Add `pandas-2.x` support in `cudf`

) Fixes #14781 This PR makes changes to the Parquet writer to ensure that data to be compressed is properly aligned. Changes have also been made to the `EncPage` struct to make it easier to keep fields in that struct aligned, and also to reduce confusing re-use of fields. In particular, the `max_data_size` field can be any of a) the maximum possible size for the page data, b) the actual size of page data after encoding, c) the actual size of compressed page data. The latter two now have their own fields, `data_size` and `comp_data_size`. Authors: - Ed Seidl (https://github.com/etseidl) - Mike Wilson (https://github.com/hyperbolic2346) Approvers: - Mike Wilson (https://github.com/hyperbolic2346) - Vukasin Milovanovic (https://github.com/vuule) URL: #14841

This PR fixes cudf's `__dask_tokenization__` definitions so that they will produce data that can be deterministically tokenized when a `MultiIndex` is present. I ran into this problem in dask-expr for an index with datetime data (a case reflected by the new test). Authors: - Richard (Rick) Zamora (https://github.com/rjzamora) Approvers: - GALI PREM SAGAR (https://github.com/galipremsagar) - Charles Blackmon-Luca (https://github.com/charlesbluca) URL: #14829

As noted what's public in https://pandas.pydata.org/docs/reference/index.html Authors: - Matthew Roeschke (https://github.com/mroeschke) Approvers: - GALI PREM SAGAR (https://github.com/galipremsagar) URL: #14929

Discussed offline, replacing this legacy import style without aliasing Authors: - Matthew Roeschke (https://github.com/mroeschke) - GALI PREM SAGAR (https://github.com/galipremsagar) Approvers: - Richard (Rick) Zamora (https://github.com/rjzamora) - Bradley Dice (https://github.com/bdice) URL: #14944

This PR updates the cudf.pandas docs to reflect cudf using pandas 2. Authors: - Bradley Dice (https://github.com/bdice) Approvers: - Ashwin Srinath (https://github.com/shwina) URL: #14940

@davidwendt

This PR makes the codecov status always pass ✔️ so that it doesn't distract from actual CI failures in the commit CI summary. https://docs.codecov.com/docs/commit-status#informational cc: @davidwendt @mroeschke Authors: - Bradley Dice (https://github.com/bdice) Approvers: - Matthew Roeschke (https://github.com/mroeschke) - Vyas Ramasubramani (https://github.com/vyasr) - GALI PREM SAGAR (https://github.com/galipremsagar) URL: #14952

A few cleanups in test files following #14916. Authors: - Bradley Dice (https://github.com/bdice) Approvers: - GALI PREM SAGAR (https://github.com/galipremsagar) - Charles Blackmon-Luca (https://github.com/charlesbluca) URL: #14941

Use `make_offsets_child_column` and `offsetalator_iterator` to build/access offsets instead of hardcoded types. This cleans up the code nicely by automatically handling offset overflow and computing the total number of matches. Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Bradley Dice (https://github.com/bdice) - Mike Wilson (https://github.com/hyperbolic2346) URL: #14745

This PR implements groupby in pylibcudf along with the minimal set of aggregation logic to support groupby. To limit its scope, this PR does not include other aggregation logic for e.g. non-groupby reductions and scans. Due to the large scale of what's already in this PR, I have also omitted the changes required to leverage pylibcudf in the current cudf Cython code from this PR. That will be done in a follow-up. This PR's diff is misleadingly large, a large chunk of it is adding documentation and function declarations that shouldn't impose too heavy a cognitive load in review. Authors: - Vyas Ramasubramani (https://github.com/vyasr) Approvers: - Ashwin Srinath (https://github.com/shwina) - Matthew Roeschke (https://github.com/mroeschke) URL: #14945

Removes the functions deprecated in 24.02 in #14202. Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Bradley Dice (https://github.com/bdice) - Yunsong Wang (https://github.com/PointKernel) - Vyas Ramasubramani (https://github.com/vyasr) URL: #14848

Removes hardcoded size-type for offset variables and replaces them with offsetalator iterator. Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Vukasin Milovanovic (https://github.com/vuule) - Bradley Dice (https://github.com/bdice) URL: #14744

This PR updates spilling tests to limit the scope of all environment and cudf option modifications to the scope of the test to avoid interfering with other tests. This PR also temporary skips a test that is not currently safe to run in an environment where other tests may already have modified related global state (the rmm default memory resource). Authors: - GALI PREM SAGAR (https://github.com/galipremsagar) - Vyas Ramasubramani (https://github.com/vyasr) Approvers: - Bradley Dice (https://github.com/bdice) - Vyas Ramasubramani (https://github.com/vyasr) URL: #14958

Replace struct_column_wrapper with structs_column_wrapper in example given in documentation of `structs_column_wrapper` Authors: - Karthikeyan (https://github.com/karthikeyann) Approvers: - David Wendt (https://github.com/davidwendt) - Bradley Dice (https://github.com/bdice) URL: #14949

This is more explicit than the methods which may allow array objects where we don't want to Authors: - Matthew Roeschke (https://github.com/mroeschke) - GALI PREM SAGAR (https://github.com/galipremsagar) Approvers: - Bradley Dice (https://github.com/bdice) - GALI PREM SAGAR (https://github.com/galipremsagar) URL: #14943

This PR migrates the unary operations in cuDF Python to pylibcudf. Authors: - Vyas Ramasubramani (https://github.com/vyasr) - Bradley Dice (https://github.com/bdice) Approvers: - Ashwin Srinath (https://github.com/shwina) URL: #14850

This PR builds on #14945 to use pylibcudf's groupby in cudf's internals. It should not be merged until after that PR. Authors: - Vyas Ramasubramani (https://github.com/vyasr) Approvers: - Ashwin Srinath (https://github.com/shwina) URL: #14946

This change makes the pylibcudf API more convenient and a more faithful reproduction of the underlying libcudf APIs that offer overloaded signatures. In cases like binary ops where we were previously using runtime instance checks, this change also removes unnecessary runtime overhead if the calling code is Cython since in those cases the types at the call site are known at compile time. Authors: - Vyas Ramasubramani (https://github.com/vyasr) Approvers: - Ashwin Srinath (https://github.com/shwina) URL: #14969

`branch-24.04` was opened without updating the copyrights on a few files. This fixes those missing copyright updates, which keep getting updated by our pre-commit hooks for me locally. Authors: - Bradley Dice (https://github.com/bdice) - GALI PREM SAGAR (https://github.com/galipremsagar) Approvers: - GALI PREM SAGAR (https://github.com/galipremsagar) - Ray Douglass (https://github.com/raydouglass)

Calling `SeriesGroupBy.aggregate` is currently directed to `GroupBy.agg` instead of `SeriesGroupBy.agg`. This means that `SeriesGroupBy.aggregate` currently produces a `DataFrame` in many cases that it *should* produce a `Series`. This PR corrects the underlying problem. Authors: - Richard (Rick) Zamora (https://github.com/rjzamora) Approvers: - Bradley Dice (https://github.com/bdice) - Matthew Roeschke (https://github.com/mroeschke) URL: #14971

Replaces hardcoded offset types as size-type with the offsetalator or int64 (for temporary vectors). Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Bradley Dice (https://github.com/bdice) - Karthikeyan (https://github.com/karthikeyann) URL: #14888

This PR deprecates "H", "N", "T", "L", "U" and "S" as frequencies in all datetime APIs. This PR prepares `branch-24.04` for `pandas-2.2` support. Authors: - GALI PREM SAGAR (https://github.com/galipremsagar) Approvers: - Matthew Roeschke (https://github.com/mroeschke) URL: #14967

Adds offsetalator in place of hardcoded offset type arrays to the strings split functions: - `nvtext::tokenize()` - `nvtext::count_tokens()` - `nvtext::character_tokenize()` - `nvtext::ngrams_tokenize()` - `nvtext::tokenize_with_vocabulary()` Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Bradley Dice (https://github.com/bdice) - Karthikeyan (https://github.com/karthikeyann) URL: #14783

…14976) This PR deprecates non-integer `periods` in `date_range` and `interval_range` to match pandas-2.2 deprecations. Authors: - GALI PREM SAGAR (https://github.com/galipremsagar) Approvers: - Matthew Roeschke (https://github.com/mroeschke) URL: #14976

Contributes to #13921 Authors: - Vyas Ramasubramani (https://github.com/vyasr) Approvers: - Ashwin Srinath (https://github.com/shwina) URL: #14972

Contributes to #13921 Authors: - Vyas Ramasubramani (https://github.com/vyasr) Approvers: - Ashwin Srinath (https://github.com/shwina) URL: #14970

Forward-merge branch-24.02 to branch-24.04

This PR lists all notable breaking changes that will be happening in `cudf` as part of `pandas-2.0` upgrade. Authors: - GALI PREM SAGAR (https://github.com/galipremsagar) Approvers: - Matthew Roeschke (https://github.com/mroeschke) URL: #13535

Closes #15350. This PR changes the order of the callback `MemoryBuffer.onClosed` to happen after our `MemoryCleaner` finishes. This is done so that we can accurately, and safely, reflect the state of the memory resource (be it device or host). This PR is needed to address a bug found in spark-rapids here: NVIDIA/spark-rapids#10585. Authors: - Alessandro Bellina (https://github.com/abellina) Approvers: - Nghia Truong (https://github.com/ttnghia) - Gera Shegalov (https://github.com/gerashegalov) URL: #15351

…nked reader. (#15342) Fixes #15306 The core issue here was that under certain conditions, the chunked reader could generate invalid page indices for list columns when using the chunked reader. This led to corruption in the decode kernels. The fix is fairly simple, but there's a decent amount of delta in this PR that is just name changes for clarity and some more comments/docs. This affected the number of chunks generated in some of the very (unrealistically) constrained tests. Authors: - https://github.com/nvdbaranec - Nghia Truong (https://github.com/ttnghia) Approvers: - Nghia Truong (https://github.com/ttnghia) - Vukasin Milovanovic (https://github.com/vuule) URL: #15342

This fixes an issue with how the `verify-copyright` hook handles multiple merge bases. Authors: - Kyle Edwards (https://github.com/KyleFromNVIDIA) Approvers: - GALI PREM SAGAR (https://github.com/galipremsagar) URL: #15355

copy-pr-bot · 2024-03-21T17:00:01Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

review-notebook-app · 2024-03-21T17:00:05Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

Fixes a docs build error since `DLManagedTensor` cannot be resolved from the dlpack documentation.

conda dropped support for the `--force` flag to `conda env create`. This changes that flag name to `--yes`. See https://github.com/conda/conda/blob/main/CHANGELOG.md#2430-2024-03-12 and rapidsai/miniforge-cuda#63 for more info.

…'` (#15476) ## Description When `dtype='category'` we seem to error: ``` File "/pyenv/versions/3.11.9/lib/python3.11/site-packages/cuml/preprocessing/LabelEncoder.py", line 218, in transform 2024-04-05T19:37:35.8255262Z E y = cudf.Series('a', dtype="category") 2024-04-05T19:37:35.8257445Z E ^^^^^^^^^^^^^^^^^ 2024-04-05T19:37:35.8260865Z E File "/pyenv/versions/3.11.9/lib/python3.11/site-packages/nvtx/nvtx.py", line 116, in inner 2024-04-05T19:37:35.8264174Z E result = func(*args, **kwargs) 2024-04-05T19:37:35.8266324Z E ^^^^^^^^^^^^^^^^^ 2024-04-05T19:37:35.8270003Z E File "/pyenv/versions/3.11.9/lib/python3.11/site-packages/cudf/core/series.py", line 648, in __init__ 2024-04-05T19:37:35.8273382Z E column = as_column( 2024-04-05T19:37:35.8275420Z E ^^^^^^^^^^^^^^^^^ 2024-04-05T19:37:35.8279989Z E File "/pyenv/versions/3.11.9/lib/python3.11/site-packages/cudf/core/column/column.py", line 2022, in as_column 2024-04-05T19:37:35.8281584Z E arbitrary = cudf.Scalar(arbitrary, dtype=dtype) 2024-04-05T19:37:35.8282461Z E ^^^^^^^^^^^^^^^^^ 2024-04-05T19:37:35.8283768Z E File "/pyenv/versions/3.11.9/lib/python3.11/site-packages/cudf/core/scalar.py", line 57, in __call__ 2024-04-05T19:37:35.8285137Z E obj = super().__call__(value, dtype=dtype) 2024-04-05T19:37:35.8285959Z E ^^^^^^^^^^^^^^^^^ 2024-04-05T19:37:35.8287757Z E File "/pyenv/versions/3.11.9/lib/python3.11/site-packages/cudf/core/scalar.py", line 128, in __init__ 2024-04-05T19:37:35.8289232Z E self._host_value, self._host_dtype = self._preprocess_host_value( 2024-04-05T19:37:35.8290183Z E ^^^^^^^^^^^^^^^^^ 2024-04-05T19:37:35.8291705Z E File "/pyenv/versions/3.11.9/lib/python3.11/site-packages/cudf/core/scalar.py", line 222, in _preprocess_host_value 2024-04-05T19:37:35.8293212Z E value = to_cudf_compatible_scalar(value, dtype=dtype) 2024-04-05T19:37:35.8294438Z E ^^^^^^^^^^^^^^^^^ 2024-04-05T19:37:35.8296026Z E File "/pyenv/versions/3.11.9/lib/python3.11/site-packages/cudf/utils/dtypes.py", line 257, in to_cudf_compatible_scalar 2024-04-05T19:37:35.8297604Z E if isinstance(val, str) and np.dtype(dtype).kind == "M": 2024-04-05T19:37:35.8298543Z E ^^^^^^^^^^^^^^^^^ 2024-04-05T19:37:35.8308752Z E TypeError: data type 'category' not understood ``` ## Checklist - [x] I am familiar with the [Contributing Guidelines](https://github.com/rapidsai/cudf/blob/HEAD/CONTRIBUTING.md). - [x] New or existing tests cover these changes. - [x] The documentation is up to date with these changes.

vyasr and others added 30 commits January 30, 2024 19:13

Remove tests of now unsupported reductions

4f0563d

Address feedback

07e9872

Merge pull request #14937 from vyasr/fix/doc_errors

f281b90

Fix documentation builds with pandas 2 changes

Merge pull request #14916 from rapidsai/pandas_2.0_feature_branch

238a03f

Add `pandas-2.x` support in `cudf`

Use more public pandas APIs (#14929)

767dde1

As noted what's public in https://pandas.pydata.org/docs/reference/index.html Authors: - Matthew Roeschke (https://github.com/mroeschke) Approvers: - GALI PREM SAGAR (https://github.com/galipremsagar) URL: #14929

Update cudf.pandas FAQ. (#14940)

bbfe1c3

This PR updates the cudf.pandas docs to reflect cudf using pandas 2. Authors: - Bradley Dice (https://github.com/bdice) Approvers: - Ashwin Srinath (https://github.com/shwina) URL: #14940

Update tests for pandas 2. (#14941)

2b0d987

A few cleanups in test files following #14916. Authors: - Bradley Dice (https://github.com/bdice) Approvers: - GALI PREM SAGAR (https://github.com/galipremsagar) - Charles Blackmon-Luca (https://github.com/charlesbluca) URL: #14941

Migrate unary operations to pylibcudf (#14850)

fc83eff

This PR migrates the unary operations in cuDF Python to pylibcudf. Authors: - Vyas Ramasubramani (https://github.com/vyasr) - Bradley Dice (https://github.com/bdice) Approvers: - Ashwin Srinath (https://github.com/shwina) URL: #14850

Implement joins in pylibcudf (#14972)

c7e3dc5

Contributes to #13921 Authors: - Vyas Ramasubramani (https://github.com/vyasr) Approvers: - Ashwin Srinath (https://github.com/shwina) URL: #14972

Implement scans and reductions in pylibcudf (#14970)

d29b8a8

Contributes to #13921 Authors: - Vyas Ramasubramani (https://github.com/vyasr) Approvers: - Ashwin Srinath (https://github.com/shwina) URL: #14970

Merge pull request #14985 from rapidsai/branch-24.02

53fa4f4

Forward-merge branch-24.02 to branch-24.04

galipremsagar and others added 4 commits March 20, 2024 21:26

List all notable breaking changes (#13535)

bf58765

This PR lists all notable breaking changes that will be happening in `cudf` as part of `pandas-2.0` upgrade. Authors: - GALI PREM SAGAR (https://github.com/galipremsagar) Approvers: - Matthew Roeschke (https://github.com/mroeschke) URL: #13535

Update pre-commit-hooks to v0.0.3 (#15355)

23aad9e

This fixes an issue with how the `verify-copyright` hook handles multiple merge bases. Authors: - Kyle Edwards (https://github.com/KyleFromNVIDIA) Approvers: - GALI PREM SAGAR (https://github.com/galipremsagar) URL: #15355

raydouglass requested review from a team as code owners March 21, 2024 16:59

raydouglass requested review from shwina, charlesbluca, shrshi and PointKernel March 21, 2024 16:59

github-actions bot added libcudf Affects libcudf (C++/CUDA) code. cuDF (Python) Affects Python cuDF API. CMake CMake build issue conda conda issue cuDF (Java) Affects Java cuDF API. ci labels Mar 21, 2024

galipremsagar approved these changes Mar 21, 2024

View reviewed changes

Ignore DLManagedTensor in the docs build (#15392)

e3cbf62

Fixes a docs build error since `DLManagedTensor` cannot be resolved from the dlpack documentation.

ttnghia approved these changes Mar 26, 2024

View reviewed changes

bdice and others added 3 commits March 27, 2024 14:39

Use conda env create --yes instead of --force (#15403)

35f818b

conda dropped support for the `--force` flag to `conda env create`. This changes that flag name to `--yes`. See https://github.com/conda/conda/blob/main/CHANGELOG.md#2430-2024-03-12 and rapidsai/miniforge-cuda#63 for more info.

Update Changelog [skip ci]

94726ad

galipremsagar approved these changes Apr 10, 2024

View reviewed changes

raydouglass merged commit 27713f2 into main Apr 10, 2024
3 of 4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RELEASE] cudf v24.04 #15367

[RELEASE] cudf v24.04 #15367

raydouglass commented Mar 21, 2024 •

edited

copy-pr-bot bot commented Mar 21, 2024

review-notebook-app bot commented Mar 21, 2024

[RELEASE] cudf v24.04 #15367

[RELEASE] cudf v24.04 #15367

Conversation

raydouglass commented Mar 21, 2024 • edited

❄️ Code freeze for branch-24.04 and v24.04 release

What does this mean?

What is the purpose of this PR?

copy-pr-bot bot commented Mar 21, 2024

review-notebook-app bot commented Mar 21, 2024

raydouglass commented Mar 21, 2024 •

edited

❄️ Code freeze for `branch-24.04` and v24.04 release