Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Interchange object data buffer has the wrong dtype / from_dataframe incorrect #55227

Merged
16 changes: 8 additions & 8 deletions pandas/core/interchange/from_dataframe.py
Original file line number Diff line number Diff line change
Expand Up @@ -266,10 +266,9 @@ def string_column_to_ndarray(col: Column) -> tuple[np.ndarray, Any]:

assert buffers["offsets"], "String buffers must contain offsets"
# Retrieve the data buffer containing the UTF-8 code units
data_buff, protocol_data_dtype = buffers["data"]
data_buff, _ = buffers["data"]
# We're going to reinterpret the buffer as uint8, so make sure we can do it safely
assert protocol_data_dtype[1] == 8
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assertion can indeed be deleted, as we can assume bit width 8 if the column dtype is STRING or LARGE_STRING.

assert protocol_data_dtype[2] in (
assert col.dtype[2] in (
ArrowCTypes.STRING,
ArrowCTypes.LARGE_STRING,
) # format_str == utf-8
Expand Down Expand Up @@ -377,15 +376,16 @@ def datetime_column_to_ndarray(col: Column) -> tuple[np.ndarray | pd.Series, Any
"""
buffers = col.get_buffers()

_, _, format_str, _ = col.dtype
dbuf, dtype = buffers["data"]
_, col_bit_width, format_str, _ = col.dtype
dbuf, _ = buffers["data"]
# Consider dtype being `uint` to get number of units passed since the 01.01.1970

data = buffer_to_ndarray(
dbuf,
(
DtypeKind.UINT,
dtype[1],
getattr(ArrowCTypes, f"UINT{dtype[1]}"),
DtypeKind.INT,
col_bit_width,
getattr(ArrowCTypes, f"INT{col.dtype[1]}"),
MarcoGorelli marked this conversation as resolved.
Show resolved Hide resolved
Endianness.NATIVE,
),
offset=col.offset,
Expand Down
22 changes: 22 additions & 0 deletions pandas/tests/interchange/test_impl.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@
DtypeKind,
)
from pandas.core.interchange.from_dataframe import from_dataframe
from pandas.core.interchange.utils import ArrowCTypes


@pytest.fixture
Expand Down Expand Up @@ -326,3 +327,24 @@ def test_interchange_from_non_pandas_tz_aware():
dtype="datetime64[us, Asia/Kathmandu]",
)
tm.assert_frame_equal(expected, result)


def test_interchange_from_corrected_buffer_dtypes(monkeypatch) -> None:
# https://github.com/pandas-dev/pandas/issues/54781
df = pd.DataFrame({"a": ["foo", "bar"]}).__dataframe__()
interchange = df.__dataframe__()
column = interchange.get_column_by_name("a")
buffers = column.get_buffers()
Copy link
Member

@WillAyd WillAyd Oct 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a blocker for this PR but I think these tests would be more impactful if we made the PandasBuffer implement the buffer protocol:

https://docs.cython.org/en/latest/src/userguide/buffer.html

That way we could inspect the bytes for tests

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Started this in #55671

buffers_data = buffers["data"]
buffer_dtype = buffers_data[1]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feel free to ignore my possibly wrong commentary as I'm new to this, but I think the offset buffers don't have the proper bufsize here either

(Pdb) buffers["offsets"]
(PandasBuffer({'bufsize': 24, 'ptr': 94440192356160, 'device': 'CPU'}), (<DtypeKind.INT: 0>, 64, 'l', '='))

The standard StringType which inherits from BinaryType in arrow uses a 32 bit offset value, so I think that bufsize should only be 12, unless we are mapping to a LargeString intentionally

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for looking into this - looks like it comes from

offsets = np.zeros(shape=(len(values) + 1,), dtype=np.int64)

where the dtype's being set to int64. OK to discuss/address this separately?

buffer_dtype = (
DtypeKind.UINT,
8,
ArrowCTypes.UINT8,
buffer_dtype[3],
)
buffers["data"] = (buffers_data[0], buffer_dtype)
column.get_buffers = lambda: buffers
interchange.get_column_by_name = lambda _: column
monkeypatch.setattr(df, "__dataframe__", lambda allow_copy: interchange)
pd.api.interchange.from_dataframe(df)