-
Notifications
You must be signed in to change notification settings - Fork 846
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RELEASE] cudf v24.04 #15367
Merged
Merged
[RELEASE] cudf v24.04 #15367
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Fix documentation builds with pandas 2 changes
Add `pandas-2.x` support in `cudf`
) Fixes #14781 This PR makes changes to the Parquet writer to ensure that data to be compressed is properly aligned. Changes have also been made to the `EncPage` struct to make it easier to keep fields in that struct aligned, and also to reduce confusing re-use of fields. In particular, the `max_data_size` field can be any of a) the maximum possible size for the page data, b) the actual size of page data after encoding, c) the actual size of compressed page data. The latter two now have their own fields, `data_size` and `comp_data_size`. Authors: - Ed Seidl (https://github.com/etseidl) - Mike Wilson (https://github.com/hyperbolic2346) Approvers: - Mike Wilson (https://github.com/hyperbolic2346) - Vukasin Milovanovic (https://github.com/vuule) URL: #14841
This PR fixes cudf's `__dask_tokenization__` definitions so that they will produce data that can be deterministically tokenized when a `MultiIndex` is present. I ran into this problem in dask-expr for an index with datetime data (a case reflected by the new test). Authors: - Richard (Rick) Zamora (https://github.com/rjzamora) Approvers: - GALI PREM SAGAR (https://github.com/galipremsagar) - Charles Blackmon-Luca (https://github.com/charlesbluca) URL: #14829
As noted what's public in https://pandas.pydata.org/docs/reference/index.html Authors: - Matthew Roeschke (https://github.com/mroeschke) Approvers: - GALI PREM SAGAR (https://github.com/galipremsagar) URL: #14929
Discussed offline, replacing this legacy import style without aliasing Authors: - Matthew Roeschke (https://github.com/mroeschke) - GALI PREM SAGAR (https://github.com/galipremsagar) Approvers: - Richard (Rick) Zamora (https://github.com/rjzamora) - Bradley Dice (https://github.com/bdice) URL: #14944
This PR updates the cudf.pandas docs to reflect cudf using pandas 2. Authors: - Bradley Dice (https://github.com/bdice) Approvers: - Ashwin Srinath (https://github.com/shwina) URL: #14940
This PR makes the codecov status always pass ✔️ so that it doesn't distract from actual CI failures in the commit CI summary. https://docs.codecov.com/docs/commit-status#informational cc: @davidwendt @mroeschke Authors: - Bradley Dice (https://github.com/bdice) Approvers: - Matthew Roeschke (https://github.com/mroeschke) - Vyas Ramasubramani (https://github.com/vyasr) - GALI PREM SAGAR (https://github.com/galipremsagar) URL: #14952
A few cleanups in test files following #14916. Authors: - Bradley Dice (https://github.com/bdice) Approvers: - GALI PREM SAGAR (https://github.com/galipremsagar) - Charles Blackmon-Luca (https://github.com/charlesbluca) URL: #14941
Use `make_offsets_child_column` and `offsetalator_iterator` to build/access offsets instead of hardcoded types. This cleans up the code nicely by automatically handling offset overflow and computing the total number of matches. Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Bradley Dice (https://github.com/bdice) - Mike Wilson (https://github.com/hyperbolic2346) URL: #14745
This PR implements groupby in pylibcudf along with the minimal set of aggregation logic to support groupby. To limit its scope, this PR does not include other aggregation logic for e.g. non-groupby reductions and scans. Due to the large scale of what's already in this PR, I have also omitted the changes required to leverage pylibcudf in the current cudf Cython code from this PR. That will be done in a follow-up. This PR's diff is misleadingly large, a large chunk of it is adding documentation and function declarations that shouldn't impose too heavy a cognitive load in review. Authors: - Vyas Ramasubramani (https://github.com/vyasr) Approvers: - Ashwin Srinath (https://github.com/shwina) - Matthew Roeschke (https://github.com/mroeschke) URL: #14945
Removes the functions deprecated in 24.02 in #14202. Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Bradley Dice (https://github.com/bdice) - Yunsong Wang (https://github.com/PointKernel) - Vyas Ramasubramani (https://github.com/vyasr) URL: #14848
Removes hardcoded size-type for offset variables and replaces them with offsetalator iterator. Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Vukasin Milovanovic (https://github.com/vuule) - Bradley Dice (https://github.com/bdice) URL: #14744
This PR updates spilling tests to limit the scope of all environment and cudf option modifications to the scope of the test to avoid interfering with other tests. This PR also temporary skips a test that is not currently safe to run in an environment where other tests may already have modified related global state (the rmm default memory resource). Authors: - GALI PREM SAGAR (https://github.com/galipremsagar) - Vyas Ramasubramani (https://github.com/vyasr) Approvers: - Bradley Dice (https://github.com/bdice) - Vyas Ramasubramani (https://github.com/vyasr) URL: #14958
Replace struct_column_wrapper with structs_column_wrapper in example given in documentation of `structs_column_wrapper` Authors: - Karthikeyan (https://github.com/karthikeyann) Approvers: - David Wendt (https://github.com/davidwendt) - Bradley Dice (https://github.com/bdice) URL: #14949
This is more explicit than the methods which may allow array objects where we don't want to Authors: - Matthew Roeschke (https://github.com/mroeschke) - GALI PREM SAGAR (https://github.com/galipremsagar) Approvers: - Bradley Dice (https://github.com/bdice) - GALI PREM SAGAR (https://github.com/galipremsagar) URL: #14943
This PR migrates the unary operations in cuDF Python to pylibcudf. Authors: - Vyas Ramasubramani (https://github.com/vyasr) - Bradley Dice (https://github.com/bdice) Approvers: - Ashwin Srinath (https://github.com/shwina) URL: #14850
This PR builds on #14945 to use pylibcudf's groupby in cudf's internals. It should not be merged until after that PR. Authors: - Vyas Ramasubramani (https://github.com/vyasr) Approvers: - Ashwin Srinath (https://github.com/shwina) URL: #14946
This change makes the pylibcudf API more convenient and a more faithful reproduction of the underlying libcudf APIs that offer overloaded signatures. In cases like binary ops where we were previously using runtime instance checks, this change also removes unnecessary runtime overhead if the calling code is Cython since in those cases the types at the call site are known at compile time. Authors: - Vyas Ramasubramani (https://github.com/vyasr) Approvers: - Ashwin Srinath (https://github.com/shwina) URL: #14969
`branch-24.04` was opened without updating the copyrights on a few files. This fixes those missing copyright updates, which keep getting updated by our pre-commit hooks for me locally. Authors: - Bradley Dice (https://github.com/bdice) - GALI PREM SAGAR (https://github.com/galipremsagar) Approvers: - GALI PREM SAGAR (https://github.com/galipremsagar) - Ray Douglass (https://github.com/raydouglass)
Calling `SeriesGroupBy.aggregate` is currently directed to `GroupBy.agg` instead of `SeriesGroupBy.agg`. This means that `SeriesGroupBy.aggregate` currently produces a `DataFrame` in many cases that it *should* produce a `Series`. This PR corrects the underlying problem. Authors: - Richard (Rick) Zamora (https://github.com/rjzamora) Approvers: - Bradley Dice (https://github.com/bdice) - Matthew Roeschke (https://github.com/mroeschke) URL: #14971
Replaces hardcoded offset types as size-type with the offsetalator or int64 (for temporary vectors). Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Bradley Dice (https://github.com/bdice) - Karthikeyan (https://github.com/karthikeyann) URL: #14888
This PR deprecates "H", "N", "T", "L", "U" and "S" as frequencies in all datetime APIs. This PR prepares `branch-24.04` for `pandas-2.2` support. Authors: - GALI PREM SAGAR (https://github.com/galipremsagar) Approvers: - Matthew Roeschke (https://github.com/mroeschke) URL: #14967
Adds offsetalator in place of hardcoded offset type arrays to the strings split functions: - `nvtext::tokenize()` - `nvtext::count_tokens()` - `nvtext::character_tokenize()` - `nvtext::ngrams_tokenize()` - `nvtext::tokenize_with_vocabulary()` Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Bradley Dice (https://github.com/bdice) - Karthikeyan (https://github.com/karthikeyann) URL: #14783
…14976) This PR deprecates non-integer `periods` in `date_range` and `interval_range` to match pandas-2.2 deprecations. Authors: - GALI PREM SAGAR (https://github.com/galipremsagar) Approvers: - Matthew Roeschke (https://github.com/mroeschke) URL: #14976
Contributes to #13921 Authors: - Vyas Ramasubramani (https://github.com/vyasr) Approvers: - Ashwin Srinath (https://github.com/shwina) URL: #14972
Contributes to #13921 Authors: - Vyas Ramasubramani (https://github.com/vyasr) Approvers: - Ashwin Srinath (https://github.com/shwina) URL: #14970
Forward-merge branch-24.02 to branch-24.04
This PR lists all notable breaking changes that will be happening in `cudf` as part of `pandas-2.0` upgrade. Authors: - GALI PREM SAGAR (https://github.com/galipremsagar) Approvers: - Matthew Roeschke (https://github.com/mroeschke) URL: #13535
Closes #15350. This PR changes the order of the callback `MemoryBuffer.onClosed` to happen after our `MemoryCleaner` finishes. This is done so that we can accurately, and safely, reflect the state of the memory resource (be it device or host). This PR is needed to address a bug found in spark-rapids here: NVIDIA/spark-rapids#10585. Authors: - Alessandro Bellina (https://github.com/abellina) Approvers: - Nghia Truong (https://github.com/ttnghia) - Gera Shegalov (https://github.com/gerashegalov) URL: #15351
…nked reader. (#15342) Fixes #15306 The core issue here was that under certain conditions, the chunked reader could generate invalid page indices for list columns when using the chunked reader. This led to corruption in the decode kernels. The fix is fairly simple, but there's a decent amount of delta in this PR that is just name changes for clarity and some more comments/docs. This affected the number of chunks generated in some of the very (unrealistically) constrained tests. Authors: - https://github.com/nvdbaranec - Nghia Truong (https://github.com/ttnghia) Approvers: - Nghia Truong (https://github.com/ttnghia) - Vukasin Milovanovic (https://github.com/vuule) URL: #15342
This fixes an issue with how the `verify-copyright` hook handles multiple merge bases. Authors: - Kyle Edwards (https://github.com/KyleFromNVIDIA) Approvers: - GALI PREM SAGAR (https://github.com/galipremsagar) URL: #15355
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
github-actions
bot
added
libcudf
Affects libcudf (C++/CUDA) code.
cuDF (Python)
Affects Python cuDF API.
CMake
CMake build issue
conda
conda issue
cuDF (Java)
Affects Java cuDF API.
ci
labels
Mar 21, 2024
galipremsagar
approved these changes
Mar 21, 2024
Fixes a docs build error since `DLManagedTensor` cannot be resolved from the dlpack documentation.
ttnghia
approved these changes
Mar 26, 2024
conda dropped support for the `--force` flag to `conda env create`. This changes that flag name to `--yes`. See https://github.com/conda/conda/blob/main/CHANGELOG.md#2430-2024-03-12 and rapidsai/miniforge-cuda#63 for more info.
…'` (#15476) ## Description When `dtype='category'` we seem to error: ``` File "/pyenv/versions/3.11.9/lib/python3.11/site-packages/cuml/preprocessing/LabelEncoder.py", line 218, in transform 2024-04-05T19:37:35.8255262Z E y = cudf.Series('a', dtype="category") 2024-04-05T19:37:35.8257445Z E ^^^^^^^^^^^^^^^^^ 2024-04-05T19:37:35.8260865Z E File "/pyenv/versions/3.11.9/lib/python3.11/site-packages/nvtx/nvtx.py", line 116, in inner 2024-04-05T19:37:35.8264174Z E result = func(*args, **kwargs) 2024-04-05T19:37:35.8266324Z E ^^^^^^^^^^^^^^^^^ 2024-04-05T19:37:35.8270003Z E File "/pyenv/versions/3.11.9/lib/python3.11/site-packages/cudf/core/series.py", line 648, in __init__ 2024-04-05T19:37:35.8273382Z E column = as_column( 2024-04-05T19:37:35.8275420Z E ^^^^^^^^^^^^^^^^^ 2024-04-05T19:37:35.8279989Z E File "/pyenv/versions/3.11.9/lib/python3.11/site-packages/cudf/core/column/column.py", line 2022, in as_column 2024-04-05T19:37:35.8281584Z E arbitrary = cudf.Scalar(arbitrary, dtype=dtype) 2024-04-05T19:37:35.8282461Z E ^^^^^^^^^^^^^^^^^ 2024-04-05T19:37:35.8283768Z E File "/pyenv/versions/3.11.9/lib/python3.11/site-packages/cudf/core/scalar.py", line 57, in __call__ 2024-04-05T19:37:35.8285137Z E obj = super().__call__(value, dtype=dtype) 2024-04-05T19:37:35.8285959Z E ^^^^^^^^^^^^^^^^^ 2024-04-05T19:37:35.8287757Z E File "/pyenv/versions/3.11.9/lib/python3.11/site-packages/cudf/core/scalar.py", line 128, in __init__ 2024-04-05T19:37:35.8289232Z E self._host_value, self._host_dtype = self._preprocess_host_value( 2024-04-05T19:37:35.8290183Z E ^^^^^^^^^^^^^^^^^ 2024-04-05T19:37:35.8291705Z E File "/pyenv/versions/3.11.9/lib/python3.11/site-packages/cudf/core/scalar.py", line 222, in _preprocess_host_value 2024-04-05T19:37:35.8293212Z E value = to_cudf_compatible_scalar(value, dtype=dtype) 2024-04-05T19:37:35.8294438Z E ^^^^^^^^^^^^^^^^^ 2024-04-05T19:37:35.8296026Z E File "/pyenv/versions/3.11.9/lib/python3.11/site-packages/cudf/utils/dtypes.py", line 257, in to_cudf_compatible_scalar 2024-04-05T19:37:35.8297604Z E if isinstance(val, str) and np.dtype(dtype).kind == "M": 2024-04-05T19:37:35.8298543Z E ^^^^^^^^^^^^^^^^^ 2024-04-05T19:37:35.8308752Z E TypeError: data type 'category' not understood ``` ## Checklist - [x] I am familiar with the [Contributing Guidelines](https://github.com/rapidsai/cudf/blob/HEAD/CONTRIBUTING.md). - [x] New or existing tests cover these changes. - [x] The documentation is up to date with these changes.
galipremsagar
approved these changes
Apr 10, 2024
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
ci
CMake
CMake build issue
conda
conda issue
cuDF (Java)
Affects Java cuDF API.
cuDF (Python)
Affects Python cuDF API.
libcudf
Affects libcudf (C++/CUDA) code.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
❄️ Code freeze for
branch-24.04
and v24.04 releaseWhat does this mean?
Only critical/hotfix level issues should be merged into
branch-24.04
until release (merging of this PR).What is the purpose of this PR?
branch-24.04
intomain
for the release