Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for pandas-2.2 in cudf #15100

Merged
merged 35 commits into from
Feb 26, 2024
Merged
Show file tree
Hide file tree
Changes from 8 commits
Commits
Show all changes
35 commits
Select commit Hold shift + click to select a range
5262a4e
Upgrade to pandas-2.2
galipremsagar Feb 21, 2024
1e77124
Merge remote-tracking branch 'upstream/branch-24.04' into pandas_22_u…
galipremsagar Feb 21, 2024
32c1c0f
cleanup more
galipremsagar Feb 21, 2024
6daba82
Do more cleanup
galipremsagar Feb 21, 2024
00f1bbe
merge
galipremsagar Feb 21, 2024
f42676d
isort
galipremsagar Feb 21, 2024
a6c78ec
cleanup a bit more
galipremsagar Feb 21, 2024
ec80e9a
fix a flaky pytest
galipremsagar Feb 21, 2024
2c00c59
Merge remote-tracking branch 'upstream/branch-24.04' into pandas_22_u…
galipremsagar Feb 21, 2024
04309a6
Handle cython function or methods for cython>3.0
galipremsagar Feb 21, 2024
d15323a
Merge branch 'branch-24.04' into pandas_22_upgrade
galipremsagar Feb 21, 2024
7fcc521
Update fast_slow_proxy.py
galipremsagar Feb 21, 2024
7da7100
Merge branch 'branch-24.04' into pandas_22_upgrade
galipremsagar Feb 21, 2024
82b0644
Merge remote-tracking branch 'upstream/branch-24.04' into pandas_22_u…
galipremsagar Feb 22, 2024
391b602
skip segfaulting tests
galipremsagar Feb 22, 2024
1b8a1e5
Update run.sh
galipremsagar Feb 22, 2024
ecd9221
add timeouts
galipremsagar Feb 22, 2024
fa72dd8
add pytest timeout
galipremsagar Feb 22, 2024
65c8652
Merge branch 'branch-24.04' into pandas_22_upgrade
galipremsagar Feb 23, 2024
07d57f6
Merge remote-tracking branch 'upstream/branch-24.04' into pandas_22_u…
galipremsagar Feb 23, 2024
adcd271
ignore
galipremsagar Feb 23, 2024
3efb4f8
fix script
galipremsagar Feb 23, 2024
cae3b0d
Update run-pandas-tests.sh
galipremsagar Feb 23, 2024
ffa1edc
Merge branch 'branch-24.04' into pandas_22_upgrade
galipremsagar Feb 23, 2024
d50ed04
Apply suggestions from code review
galipremsagar Feb 23, 2024
415de9b
Merge branch 'branch-24.04' into pandas_22_upgrade
galipremsagar Feb 23, 2024
ca9933d
test
galipremsagar Feb 23, 2024
340a4de
Merge branch 'pandas_22_upgrade' of https://github.com/galipremsagar/…
galipremsagar Feb 23, 2024
65bba60
Update run-pandas-tests.sh
galipremsagar Feb 24, 2024
a926ad1
Merge branch 'branch-24.04' into pandas_22_upgrade
galipremsagar Feb 24, 2024
328bc95
test
galipremsagar Feb 26, 2024
4f76e9f
Merge branch 'pandas_22_upgrade' of https://github.com/galipremsagar/…
galipremsagar Feb 26, 2024
62cf147
disable pandas-tests temporarily
galipremsagar Feb 26, 2024
fb8924b
Merge remote-tracking branch 'upstream/branch-24.04' into pandas_22_u…
galipremsagar Feb 26, 2024
8159ac2
upgrade to pandas-2.2.1
galipremsagar Feb 26, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
2 changes: 1 addition & 1 deletion conda/environments/all_cuda-118_arch-x86_64.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -65,7 +65,7 @@ dependencies:
- nvcomp==3.0.5
- nvtx>=0.2.1
- packaging
- pandas>=2.0,<2.1.5dev0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the right pinning, or should we go with <2.3.0dev0? Do we expect patch releases to break us?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel various fixes are going into 2.2.1 that may have an impact on us, but would defer to @mroeschke on that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I fixed various issues @galipremsagar found in pandas 2.2.1 (coming out next week) which there are xfails for so the update will break the test suite. We should probably relax the pin in a follow up PR when that's out

- pandas>=2.0,<2.2.1dev0
- pandoc
- pip
- pre-commit
Expand Down
2 changes: 1 addition & 1 deletion conda/environments/all_cuda-122_arch-x86_64.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -63,7 +63,7 @@ dependencies:
- nvcomp==3.0.5
- nvtx>=0.2.1
- packaging
- pandas>=2.0,<2.1.5dev0
- pandas>=2.0,<2.2.1dev0
- pandoc
- pip
- pre-commit
Expand Down
2 changes: 1 addition & 1 deletion conda/recipes/cudf/meta.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -80,7 +80,7 @@ requirements:
- {{ pin_compatible('protobuf', min_pin='x.x', max_pin='x') }}
- python
- typing_extensions >=4.0.0
- pandas >=2.0,<2.1.5dev0
- pandas >=2.0,<2.2.1dev0
- cupy >=12.0.0
- numba >=0.57
- numpy >=1.21
Expand Down
2 changes: 1 addition & 1 deletion dependencies.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -497,7 +497,7 @@ dependencies:
packages:
- fsspec>=0.6.0
- *numpy
- pandas>=2.0,<2.1.5dev0
- pandas>=2.0,<2.2.1dev0
run_cudf:
common:
- output_types: [conda, requirements, pyproject]
Expand Down
1 change: 0 additions & 1 deletion python/cudf/cudf/core/_compat.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,6 @@
PANDAS_GE_201 = PANDAS_VERSION >= version.parse("2.0.1")
PANDAS_GE_210 = PANDAS_VERSION >= version.parse("2.1.0")
PANDAS_GE_214 = PANDAS_VERSION >= version.parse("2.1.4")
PANDAS_GE_220 = PANDAS_VERSION >= version.parse("2.2.0")
PANDAS_LT_203 = PANDAS_VERSION < version.parse("2.0.3")
PANDAS_GE_220 = PANDAS_VERSION >= version.parse("2.2.0")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we sort these, so we have all the EQ then GE then LT? It looks like you deleted a duplicate line but it ended up breaking up the groupings.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, I'll do that in a follow-up pr. Potentially there is still some cleanup to be done.

PANDAS_LT_300 = PANDAS_VERSION < version.parse("3.0.0")
13 changes: 2 additions & 11 deletions python/cudf/cudf/core/column/datetime.py
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@
ScalarLike,
)
from cudf.api.types import is_datetime64_dtype, is_scalar, is_timedelta64_dtype
from cudf.core._compat import PANDAS_GE_200, PANDAS_GE_220
from cudf.core._compat import PANDAS_GE_220
from cudf.core.buffer import Buffer, cuda_array_interface_wrapper
from cudf.core.column import ColumnBase, as_column, column, string
from cudf.core.column.timedelta import _unit_to_nanoseconds_conversion
Expand Down Expand Up @@ -324,17 +324,8 @@ def to_pandas(
# `copy=True` workaround until following issue is fixed:
# https://issues.apache.org/jira/browse/ARROW-9772

if PANDAS_GE_200:
host_values = self.to_arrow()
else:
# Pandas<2.0 supports only `datetime64[ns]`, hence the cast.
host_values = self.astype("datetime64[ns]").to_arrow()

# Pandas only supports `datetime64[ns]` dtype
# and conversion to this type is necessary to make
# arrow to pandas conversion happen for large values.
return pd.Series(
host_values,
self.to_arrow(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just in noting in the future we should pass pandas a numpy array instead of a pyarrow array and pandas.Series treats pyarrow arrays as objects

copy=True,
dtype=self.dtype,
index=index,
Expand Down
12 changes: 1 addition & 11 deletions python/cudf/cudf/core/column/timedelta.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,6 @@
from cudf import _lib as libcudf
from cudf._typing import ColumnBinaryOperand, DatetimeLikeScalar, Dtype
from cudf.api.types import is_scalar, is_timedelta64_dtype
from cudf.core._compat import PANDAS_GE_200
from cudf.core.buffer import Buffer, acquire_spill_lock
from cudf.core.column import ColumnBase, column, string
from cudf.utils.dtypes import np_to_pa_dtype
Expand Down Expand Up @@ -153,20 +152,11 @@ def to_pandas(
# `copy=True` workaround until following issue is fixed:
# https://issues.apache.org/jira/browse/ARROW-9772

if PANDAS_GE_200:
host_values = self.to_arrow()
else:
# Pandas<2.0 supports only `timedelta64[ns]`, hence the cast.
host_values = self.astype("timedelta64[ns]").to_arrow()

# Pandas only supports `timedelta64[ns]` dtype
# and conversion to this type is necessary to make
# arrow to pandas conversion happen for large values.
if nullable:
raise NotImplementedError(f"{nullable=} is not implemented.")

return pd.Series(
host_values,
self.to_arrow(),
copy=True,
dtype=self.dtype,
index=index,
Expand Down
9 changes: 1 addition & 8 deletions python/cudf/cudf/core/dataframe.py
Original file line number Diff line number Diff line change
Expand Up @@ -56,7 +56,7 @@
is_string_dtype,
)
from cudf.core import column, df_protocol, indexing_utils, reshape
from cudf.core._compat import PANDAS_GE_200, PANDAS_LT_300
from cudf.core._compat import PANDAS_LT_300
from cudf.core.abc import Serializable
from cudf.core.column import (
CategoricalColumn,
Expand Down Expand Up @@ -1338,13 +1338,6 @@ def __getitem__(self, arg):
mask = arg
if is_list_like(mask):
dtype = None
if len(mask) == 0 and not PANDAS_GE_200:
# An explicit dtype is needed to avoid pandas
# warnings from empty sets of columns. This
# shouldn't be needed in pandas 2.0, we don't
# need to specify a dtype when we know we're not
# trying to match any columns so the default is fine.
dtype = "float64"
mask = pd.Series(mask, dtype=dtype)
if mask.dtype == "bool":
return self._apply_boolean_mask(BooleanMask(mask, len(self)))
Expand Down
17 changes: 4 additions & 13 deletions python/cudf/cudf/core/index.py
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,7 @@
is_signed_integer_dtype,
)
from cudf.core._base_index import BaseIndex
from cudf.core._compat import PANDAS_GE_200, PANDAS_LT_300
from cudf.core._compat import PANDAS_LT_300
from cudf.core.column import (
CategoricalColumn,
ColumnBase,
Expand Down Expand Up @@ -2098,23 +2098,14 @@ def to_pandas(self, *, nullable: bool = False) -> pd.DatetimeIndex:
if nullable:
raise NotImplementedError(f"{nullable=} is not implemented.")

if PANDAS_GE_200:
nanos = self._values
else:
# no need to convert to nanos with Pandas 2.x
if isinstance(self.dtype, pd.DatetimeTZDtype):
nanos = self._values.astype(
pd.DatetimeTZDtype("ns", self.dtype.tz)
)
else:
nanos = self._values.astype("datetime64[ns]")

freq = (
self._freq._maybe_as_fast_pandas_offset()
if self._freq is not None
else None
)
return pd.DatetimeIndex(nanos.to_pandas(), name=self.name, freq=freq)
return pd.DatetimeIndex(
self._values.to_pandas(), name=self.name, freq=freq
)

@_cudf_nvtx_annotate
def _get_dt_field(self, field):
Expand Down
6 changes: 3 additions & 3 deletions python/cudf/cudf/tests/indexes/test_interval.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,9 +5,9 @@
import pytest

import cudf
from cudf.core._compat import PANDAS_GE_210, PANDAS_GE_220
from cudf.core._compat import PANDAS_GE_210
from cudf.core.index import IntervalIndex, interval_range
from cudf.testing._utils import assert_eq, expect_warning_if
from cudf.testing._utils import assert_eq


def test_interval_constructor_default_closed():
Expand Down Expand Up @@ -142,7 +142,7 @@ def test_interval_range_periods_basic_dtype(start_t, end_t, periods_t):
def test_interval_range_periods_warnings():
start_val, end_val, periods_val = 0, 4, 1.0

with expect_warning_if(PANDAS_GE_220):
with pytest.warns(FutureWarning):
pindex = pd.interval_range(
start=start_val, end=end_val, periods=periods_val, closed="left"
)
Expand Down
53 changes: 2 additions & 51 deletions python/cudf/cudf/tests/test_array_ufunc.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@
import pytest

import cudf
from cudf.core._compat import PANDAS_GE_200, PANDAS_GE_210, PANDAS_LT_300
from cudf.core._compat import PANDAS_GE_210, PANDAS_LT_300
from cudf.testing._utils import (
assert_eq,
expect_warning_if,
Expand Down Expand Up @@ -183,10 +183,7 @@ def test_ufunc_series(request, ufunc, has_nulls, indexed):

request.applymarker(
pytest.mark.xfail(
condition=PANDAS_GE_200
and fname.startswith("bitwise")
and indexed
and has_nulls,
condition=fname.startswith("bitwise") and indexed and has_nulls,
reason="https://github.com/pandas-dev/pandas/issues/52500",
)
)
Expand Down Expand Up @@ -385,52 +382,6 @@ def test_ufunc_dataframe(request, ufunc, has_nulls, indexed):
reason=f"cupy has no support for '{fname}'",
)
)
request.applymarker(
pytest.mark.xfail(
condition=(
not PANDAS_GE_200
and indexed
in {
"add",
"arctan2",
"bitwise_and",
"bitwise_or",
"bitwise_xor",
"copysign",
"divide",
"divmod",
"float_power",
"floor_divide",
"fmax",
"fmin",
"fmod",
"heaviside",
"gcd",
"hypot",
"lcm",
"ldexp",
"left_shift",
"logaddexp",
"logaddexp2",
"logical_and",
"logical_or",
"logical_xor",
"maximum",
"minimum",
"multiply",
"nextafter",
"power",
"remainder",
"right_shift",
"subtract",
}
),
reason=(
"pandas<2.0 does not currently support misaligned "
"indexes in DataFrames"
),
)
)

N = 100
# Avoid zeros in either array to skip division by 0 errors. Also limit the
Expand Down
41 changes: 2 additions & 39 deletions python/cudf/cudf/tests/test_binops.py
Original file line number Diff line number Diff line change
Expand Up @@ -1726,24 +1726,7 @@ def test_datetime_dateoffset_binaryop(
reason="https://github.com/pandas-dev/pandas/issues/57448",
)
)
request.applymarker(
pytest.mark.xfail(
not PANDAS_GE_220
and dtype in {"datetime64[ms]", "datetime64[s]"}
and frequency in ("microseconds", "nanoseconds")
and n_periods != 0,
reason="https://github.com/pandas-dev/pandas/pull/55595",
)
)
request.applymarker(
pytest.mark.xfail(
not PANDAS_GE_220
and dtype == "datetime64[us]"
and frequency == "nanoseconds"
and n_periods != 0,
reason="https://github.com/pandas-dev/pandas/pull/55595",
)
)

date_col = [
"2000-01-01 00:00:00.012345678",
"2000-01-31 00:00:00.012345678",
Expand Down Expand Up @@ -1833,27 +1816,7 @@ def test_datetime_dateoffset_binaryop_multiple(request, date_col, kwargs, op):
"dtype",
["datetime64[ns]", "datetime64[us]", "datetime64[ms]", "datetime64[s]"],
)
def test_datetime_dateoffset_binaryop_reflected(
request, n_periods, frequency, dtype
):
request.applymarker(
pytest.mark.xfail(
not PANDAS_GE_220
and dtype in {"datetime64[ms]", "datetime64[s]"}
and frequency in ("microseconds", "nanoseconds")
and n_periods != 0,
reason="https://github.com/pandas-dev/pandas/pull/55595",
)
)
request.applymarker(
pytest.mark.xfail(
not PANDAS_GE_220
and dtype == "datetime64[us]"
and frequency == "nanoseconds"
and n_periods != 0,
reason="https://github.com/pandas-dev/pandas/pull/55595",
)
)
def test_datetime_dateoffset_binaryop_reflected(n_periods, frequency, dtype):
date_col = [
"2000-01-01 00:00:00.012345678",
"2000-01-31 00:00:00.012345678",
Expand Down
3 changes: 1 addition & 2 deletions python/cudf/cudf/tests/test_column_accessor.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,6 @@
import pytest

import cudf
from cudf.core._compat import PANDAS_GE_200
from cudf.core.column_accessor import ColumnAccessor
from cudf.testing._utils import assert_eq

Expand Down Expand Up @@ -60,7 +59,7 @@ def test_to_pandas_simple(simple_data):
assert_eq(
ca.to_pandas_index(),
pd.DataFrame(simple_data).columns,
exact=not PANDAS_GE_200,
exact=False,
)


Expand Down