Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RELEASE] cudf v24.04 #15367

Merged
merged 548 commits into from
Apr 10, 2024
Merged

[RELEASE] cudf v24.04 #15367

merged 548 commits into from
Apr 10, 2024

Conversation

raydouglass
Copy link
Member

@raydouglass raydouglass commented Mar 21, 2024

❄️ Code freeze for branch-24.04 and v24.04 release

What does this mean?

Only critical/hotfix level issues should be merged into branch-24.04 until release (merging of this PR).

What is the purpose of this PR?

  • Update documentation
  • Allow testing for the new release
  • Enable a means to merge branch-24.04 into main for the release

vyasr and others added 30 commits January 30, 2024 19:13
Fix documentation builds with pandas 2 changes
)

Fixes #14781 

This PR makes changes to the Parquet writer to ensure that data to be compressed is properly aligned. Changes have also been made to the `EncPage` struct to make it easier to keep fields in that struct aligned, and also to reduce confusing re-use of fields. In particular, the `max_data_size` field can be any of a) the maximum possible size for the page data, b) the actual size of page data after encoding, c) the actual size of compressed page data. The latter two now have their own fields, `data_size` and `comp_data_size`.

Authors:
  - Ed Seidl (https://github.com/etseidl)
  - Mike Wilson (https://github.com/hyperbolic2346)

Approvers:
  - Mike Wilson (https://github.com/hyperbolic2346)
  - Vukasin Milovanovic (https://github.com/vuule)

URL: #14841
This PR fixes cudf's `__dask_tokenization__` definitions so that they will produce data that can be deterministically tokenized when a `MultiIndex` is present. I ran into this problem in dask-expr for an index with datetime data (a case reflected by the new test).

Authors:
  - Richard (Rick) Zamora (https://github.com/rjzamora)

Approvers:
  - GALI PREM SAGAR (https://github.com/galipremsagar)
  - Charles Blackmon-Luca (https://github.com/charlesbluca)

URL: #14829
Discussed offline, replacing this legacy import style without aliasing

Authors:
  - Matthew Roeschke (https://github.com/mroeschke)
  - GALI PREM SAGAR (https://github.com/galipremsagar)

Approvers:
  - Richard (Rick) Zamora (https://github.com/rjzamora)
  - Bradley Dice (https://github.com/bdice)

URL: #14944
This PR updates the cudf.pandas docs to reflect cudf using pandas 2.

Authors:
  - Bradley Dice (https://github.com/bdice)

Approvers:
  - Ashwin Srinath (https://github.com/shwina)

URL: #14940
This PR makes the codecov status always pass ✔️ so that it doesn't distract from actual CI failures in the commit CI summary.

https://docs.codecov.com/docs/commit-status#informational

cc: @davidwendt @mroeschke

Authors:
  - Bradley Dice (https://github.com/bdice)

Approvers:
  - Matthew Roeschke (https://github.com/mroeschke)
  - Vyas Ramasubramani (https://github.com/vyasr)
  - GALI PREM SAGAR (https://github.com/galipremsagar)

URL: #14952
A few cleanups in test files following #14916.

Authors:
  - Bradley Dice (https://github.com/bdice)

Approvers:
  - GALI PREM SAGAR (https://github.com/galipremsagar)
  - Charles Blackmon-Luca (https://github.com/charlesbluca)

URL: #14941
Use `make_offsets_child_column` and `offsetalator_iterator` to build/access offsets instead of hardcoded types.
This cleans up the code nicely by automatically handling offset overflow and computing the total number of matches.

Authors:
  - David Wendt (https://github.com/davidwendt)

Approvers:
  - Bradley Dice (https://github.com/bdice)
  - Mike Wilson (https://github.com/hyperbolic2346)

URL: #14745
This PR implements groupby in pylibcudf along with the minimal set of aggregation logic to support groupby. To limit its scope, this PR does not include other aggregation logic for e.g. non-groupby reductions and scans. Due to the large scale of what's already in this PR, I have also omitted the changes required to leverage pylibcudf in the current cudf Cython code from this PR. That will be done in a follow-up. This PR's diff is misleadingly large, a large chunk of it is adding documentation and function declarations that shouldn't impose too heavy a cognitive load in review.

Authors:
  - Vyas Ramasubramani (https://github.com/vyasr)

Approvers:
  - Ashwin Srinath (https://github.com/shwina)
  - Matthew Roeschke (https://github.com/mroeschke)

URL: #14945
Removes the functions deprecated in 24.02 in #14202.

Authors:
  - David Wendt (https://github.com/davidwendt)

Approvers:
  - Bradley Dice (https://github.com/bdice)
  - Yunsong Wang (https://github.com/PointKernel)
  - Vyas Ramasubramani (https://github.com/vyasr)

URL: #14848
Removes hardcoded size-type for offset variables and replaces them with offsetalator iterator.

Authors:
  - David Wendt (https://github.com/davidwendt)

Approvers:
  - Vukasin Milovanovic (https://github.com/vuule)
  - Bradley Dice (https://github.com/bdice)

URL: #14744
This PR updates spilling tests to limit the scope of all environment and cudf option modifications to the scope of the test to avoid interfering with other tests. This PR also temporary skips a test that is not currently safe to run in an environment where other tests may already have modified related global state (the rmm default memory resource).

Authors:
  - GALI PREM SAGAR (https://github.com/galipremsagar)
  - Vyas Ramasubramani (https://github.com/vyasr)

Approvers:
  - Bradley Dice (https://github.com/bdice)
  - Vyas Ramasubramani (https://github.com/vyasr)

URL: #14958
Replace struct_column_wrapper with structs_column_wrapper in example given in documentation of `structs_column_wrapper`

Authors:
  - Karthikeyan (https://github.com/karthikeyann)

Approvers:
  - David Wendt (https://github.com/davidwendt)
  - Bradley Dice (https://github.com/bdice)

URL: #14949
This is more explicit than the methods which may allow array objects where we don't want to

Authors:
  - Matthew Roeschke (https://github.com/mroeschke)
  - GALI PREM SAGAR (https://github.com/galipremsagar)

Approvers:
  - Bradley Dice (https://github.com/bdice)
  - GALI PREM SAGAR (https://github.com/galipremsagar)

URL: #14943
This PR migrates the unary operations in cuDF Python to pylibcudf.

Authors:
  - Vyas Ramasubramani (https://github.com/vyasr)
  - Bradley Dice (https://github.com/bdice)

Approvers:
  - Ashwin Srinath (https://github.com/shwina)

URL: #14850
This PR builds on #14945 to use pylibcudf's groupby in cudf's internals. It should not be merged until after that PR.

Authors:
  - Vyas Ramasubramani (https://github.com/vyasr)

Approvers:
  - Ashwin Srinath (https://github.com/shwina)

URL: #14946
This change makes the pylibcudf API more convenient and a more faithful reproduction of the underlying libcudf APIs that offer overloaded signatures. In cases like binary ops where we were previously using runtime instance checks, this change also removes unnecessary runtime overhead if the calling code is Cython since in those cases the types at the call site are known at compile time.

Authors:
  - Vyas Ramasubramani (https://github.com/vyasr)

Approvers:
  - Ashwin Srinath (https://github.com/shwina)

URL: #14969
`branch-24.04` was opened without updating the copyrights on a few files. This fixes those missing copyright updates, which keep getting updated by our pre-commit hooks for me locally.

Authors:
   - Bradley Dice (https://github.com/bdice)
   - GALI PREM SAGAR (https://github.com/galipremsagar)

Approvers:
   - GALI PREM SAGAR (https://github.com/galipremsagar)
   - Ray Douglass (https://github.com/raydouglass)
Calling `SeriesGroupBy.aggregate` is currently directed to `GroupBy.agg` instead of `SeriesGroupBy.agg`. This means that `SeriesGroupBy.aggregate` currently produces a `DataFrame` in many cases that it *should* produce a `Series`. This PR corrects the underlying problem.

Authors:
  - Richard (Rick) Zamora (https://github.com/rjzamora)

Approvers:
  - Bradley Dice (https://github.com/bdice)
  - Matthew Roeschke (https://github.com/mroeschke)

URL: #14971
Replaces hardcoded offset types as size-type with the offsetalator or int64 (for temporary vectors).

Authors:
  - David Wendt (https://github.com/davidwendt)

Approvers:
  - Bradley Dice (https://github.com/bdice)
  - Karthikeyan (https://github.com/karthikeyann)

URL: #14888
This PR deprecates "H", "N", "T", "L", "U" and "S" as frequencies in all datetime APIs. This PR prepares `branch-24.04` for `pandas-2.2` support.

Authors:
  - GALI PREM SAGAR (https://github.com/galipremsagar)

Approvers:
  - Matthew Roeschke (https://github.com/mroeschke)

URL: #14967
Adds offsetalator in place of hardcoded offset type arrays to the strings split functions:
- `nvtext::tokenize()`
- `nvtext::count_tokens()`
- `nvtext::character_tokenize()`
- `nvtext::ngrams_tokenize()`
- `nvtext::tokenize_with_vocabulary()`

Authors:
  - David Wendt (https://github.com/davidwendt)

Approvers:
  - Bradley Dice (https://github.com/bdice)
  - Karthikeyan (https://github.com/karthikeyann)

URL: #14783
…14976)

This PR deprecates non-integer `periods` in `date_range` and `interval_range` to match pandas-2.2 deprecations.

Authors:
  - GALI PREM SAGAR (https://github.com/galipremsagar)

Approvers:
  - Matthew Roeschke (https://github.com/mroeschke)

URL: #14976
Contributes to #13921

Authors:
  - Vyas Ramasubramani (https://github.com/vyasr)

Approvers:
  - Ashwin Srinath (https://github.com/shwina)

URL: #14972
Contributes to #13921

Authors:
  - Vyas Ramasubramani (https://github.com/vyasr)

Approvers:
  - Ashwin Srinath (https://github.com/shwina)

URL: #14970
Forward-merge branch-24.02 to branch-24.04
galipremsagar and others added 4 commits March 20, 2024 21:26
This PR lists all notable breaking changes that will be happening in `cudf` as part of `pandas-2.0` upgrade.

Authors:
  - GALI PREM SAGAR (https://github.com/galipremsagar)

Approvers:
  - Matthew Roeschke (https://github.com/mroeschke)

URL: #13535
Closes #15350. This PR changes the order of the callback `MemoryBuffer.onClosed` to happen after our `MemoryCleaner` finishes. This is done so that we can accurately, and safely, reflect the state of the memory resource (be it device or host). This PR is needed to address a bug found in spark-rapids here: NVIDIA/spark-rapids#10585.

Authors:
  - Alessandro Bellina (https://github.com/abellina)

Approvers:
  - Nghia Truong (https://github.com/ttnghia)
  - Gera Shegalov (https://github.com/gerashegalov)

URL: #15351
…nked reader. (#15342)

Fixes #15306

The core issue here was that under certain conditions, the chunked reader could generate invalid page indices for list columns when using the chunked reader.  This led to corruption in the decode kernels.  The fix is fairly simple, but there's a decent amount of delta in this PR that is just name changes for clarity and some more comments/docs.

This affected the number of chunks generated in some of the very (unrealistically) constrained tests.

Authors:
  - https://github.com/nvdbaranec
  - Nghia Truong (https://github.com/ttnghia)

Approvers:
  - Nghia Truong (https://github.com/ttnghia)
  - Vukasin Milovanovic (https://github.com/vuule)

URL: #15342
This fixes an issue with how the `verify-copyright` hook handles multiple merge bases.

Authors:
  - Kyle Edwards (https://github.com/KyleFromNVIDIA)

Approvers:
  - GALI PREM SAGAR (https://github.com/galipremsagar)

URL: #15355
Copy link

copy-pr-bot bot commented Mar 21, 2024

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@github-actions github-actions bot added libcudf Affects libcudf (C++/CUDA) code. cuDF (Python) Affects Python cuDF API. CMake CMake build issue conda conda issue cuDF (Java) Affects Java cuDF API. ci labels Mar 21, 2024
Fixes a docs build error since `DLManagedTensor` cannot be resolved from
the dlpack documentation.
bdice and others added 3 commits March 27, 2024 14:39
conda dropped support for the `--force` flag to `conda env create`. This
changes that flag name to `--yes`.
See
https://github.com/conda/conda/blob/main/CHANGELOG.md#2430-2024-03-12
and rapidsai/miniforge-cuda#63 for more info.
…'` (#15476)

## Description
When `dtype='category'` we seem to error:
```

File "/pyenv/versions/3.11.9/lib/python3.11/site-packages/cuml/preprocessing/LabelEncoder.py", line 218, in transform
2024-04-05T19:37:35.8255262Z E                 y = cudf.Series('a', dtype="category")
2024-04-05T19:37:35.8257445Z E                 ^^^^^^^^^^^^^^^^^
2024-04-05T19:37:35.8260865Z E               File "/pyenv/versions/3.11.9/lib/python3.11/site-packages/nvtx/nvtx.py", line 116, in inner
2024-04-05T19:37:35.8264174Z E                 result = func(*args, **kwargs)
2024-04-05T19:37:35.8266324Z E                 ^^^^^^^^^^^^^^^^^
2024-04-05T19:37:35.8270003Z E               File "/pyenv/versions/3.11.9/lib/python3.11/site-packages/cudf/core/series.py", line 648, in __init__
2024-04-05T19:37:35.8273382Z E                 column = as_column(
2024-04-05T19:37:35.8275420Z E                 ^^^^^^^^^^^^^^^^^
2024-04-05T19:37:35.8279989Z E               File "/pyenv/versions/3.11.9/lib/python3.11/site-packages/cudf/core/column/column.py", line 2022, in as_column
2024-04-05T19:37:35.8281584Z E                 arbitrary = cudf.Scalar(arbitrary, dtype=dtype)
2024-04-05T19:37:35.8282461Z E                 ^^^^^^^^^^^^^^^^^
2024-04-05T19:37:35.8283768Z E               File "/pyenv/versions/3.11.9/lib/python3.11/site-packages/cudf/core/scalar.py", line 57, in __call__
2024-04-05T19:37:35.8285137Z E                 obj = super().__call__(value, dtype=dtype)
2024-04-05T19:37:35.8285959Z E                 ^^^^^^^^^^^^^^^^^
2024-04-05T19:37:35.8287757Z E               File "/pyenv/versions/3.11.9/lib/python3.11/site-packages/cudf/core/scalar.py", line 128, in __init__
2024-04-05T19:37:35.8289232Z E                 self._host_value, self._host_dtype = self._preprocess_host_value(
2024-04-05T19:37:35.8290183Z E                 ^^^^^^^^^^^^^^^^^
2024-04-05T19:37:35.8291705Z E               File "/pyenv/versions/3.11.9/lib/python3.11/site-packages/cudf/core/scalar.py", line 222, in _preprocess_host_value
2024-04-05T19:37:35.8293212Z E                 value = to_cudf_compatible_scalar(value, dtype=dtype)
2024-04-05T19:37:35.8294438Z E                 ^^^^^^^^^^^^^^^^^
2024-04-05T19:37:35.8296026Z E               File "/pyenv/versions/3.11.9/lib/python3.11/site-packages/cudf/utils/dtypes.py", line 257, in to_cudf_compatible_scalar
2024-04-05T19:37:35.8297604Z E                 if isinstance(val, str) and np.dtype(dtype).kind == "M":
2024-04-05T19:37:35.8298543Z E                 ^^^^^^^^^^^^^^^^^
2024-04-05T19:37:35.8308752Z E             TypeError: data type 'category' not understood
```
## Checklist
- [x] I am familiar with the [Contributing
Guidelines](https://github.com/rapidsai/cudf/blob/HEAD/CONTRIBUTING.md).
- [x] New or existing tests cover these changes.
- [x] The documentation is up to date with these changes.
@raydouglass raydouglass merged commit 27713f2 into main Apr 10, 2024
3 of 4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci CMake CMake build issue conda conda issue cuDF (Java) Affects Java cuDF API. cuDF (Python) Affects Python cuDF API. libcudf Affects libcudf (C++/CUDA) code.
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

None yet