Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Interchange object data buffer has the wrong dtype / from_dataframe incorrect #55227

Merged
60 changes: 37 additions & 23 deletions pandas/core/interchange/from_dataframe.py
Original file line number Diff line number Diff line change
Expand Up @@ -266,21 +266,29 @@ def string_column_to_ndarray(col: Column) -> tuple[np.ndarray, Any]:

assert buffers["offsets"], "String buffers must contain offsets"
# Retrieve the data buffer containing the UTF-8 code units
data_buff, protocol_data_dtype = buffers["data"]
# We're going to reinterpret the buffer as uint8, so make sure we can do it safely
assert protocol_data_dtype[1] == 8
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assertion can indeed be deleted, as we can assume bit width 8 if the column dtype is STRING or LARGE_STRING.

assert protocol_data_dtype[2] in (
ArrowCTypes.STRING,
ArrowCTypes.LARGE_STRING,
) # format_str == utf-8
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assertion is valid, but it should be on col.dtype[2] rather than protocol_data_dtype[2]

# Convert the buffers to NumPy arrays. In order to go from STRING to
# an equivalent ndarray, we claim that the buffer is uint8 (i.e., a byte array)
data_dtype = (
DtypeKind.UINT,
8,
ArrowCTypes.UINT8,
Endianness.NATIVE,
)
data_buff, data_dtype = buffers["data"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can simply ignore the data dtype here, as we know what it needs to be (we set it later).

Suggested change
data_buff, data_dtype = buffers["data"]
data_buff, _ = buffers["data"]


if (data_dtype[1] == 8) and (
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity what is the 8 here supposed to represent?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hey - from https://data-apis.org/dataframe-protocol/latest/API.html, it's number of bits:

    @property
    @abstractmethod
    def dtype(self) -> Dtype:
        """
        Dtype description as a tuple ``(kind, bit-width, format string, endianness)``.

        Bit-width : the number of bits as an integer
        Format string : data type description format string in Apache Arrow C
                        Data Interface format.
        Endianness : current only native endianness (``=``) is supported

Copy link
Member

@WillAyd WillAyd Sep 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are string dtypes supposed to have an 8 bit association? That is kind of confusing for variable length types, granted I know very little of how this interchange is supposed to work

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the idea is that strings are meant to be utf-8, and so each character can be represented with 8bits

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm interesting. Well keep in mind that UTF-8 doesn't mean a character is 8 bits; it is still 1-4 bytes

In arrow-adbc I've seen this assigned the value of 0

https://github.com/apache/arrow-adbc/blob/0d8707a5ee2622ba959b069cd173bfe6ee2aaff3/c/driver/postgresql/statement.cc#L225

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The UFT-8 array, consisting of the actual strings and offsets array, so here the buffer representing all string data (so which is typically much longer as the logical length of the array) can be seen as simply a buffer of bytes ("bytearray"), so so in numpy / buffer interface terms you can view such an array as one bitwidth 8

In arrow-adbc I've seen this assigned the value of 0

That's something postgres specific, I think

data_dtype[2]
in (
ArrowCTypes.STRING,
ArrowCTypes.LARGE_STRING,
)
): # format_str == utf-8
# temporary workaround to keep backwards compatibility due to
# https://github.com/pandas-dev/pandas/issues/54781

# We're going to reinterpret the buffer as uint8, so make sure we can do it
# safely

# Convert the buffers to NumPy arrays. In order to go from STRING to
# an equivalent ndarray, we claim that the buffer is uint8 (i.e., a byte array)
data_dtype = (
DtypeKind.UINT,
8,
ArrowCTypes.UINT8,
Endianness.NATIVE,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not need to be in an if-block. We can simply disregard the data buffer dtype - for string columns, this will ALWAYS be as listed here. This was already the case in the previous code! Only the assertions were wrong.

# Specify zero offset as we don't want to chunk the string data
data = buffer_to_ndarray(data_buff, data_dtype, offset=0, length=data_buff.bufsize)

Expand Down Expand Up @@ -378,16 +386,22 @@ def datetime_column_to_ndarray(col: Column) -> tuple[np.ndarray | pd.Series, Any
buffers = col.get_buffers()

_, _, format_str, _ = col.dtype
dbuf, dtype = buffers["data"]
# Consider dtype being `uint` to get number of units passed since the 01.01.1970
dbuf, data_dtype = buffers["data"]

if data_dtype[0] == DtypeKind.DATETIME:
# temporary workaround to keep backwards compatibility due to
# https://github.com/pandas-dev/pandas/issues/54781
# Consider dtype being `int` to get number of units passed since 1970-01-01
data_dtype = (
DtypeKind.INT,
data_dtype[1],
getattr(ArrowCTypes, f"INT{data_dtype[1]}"),
Endianness.NATIVE,
)

data = buffer_to_ndarray(
dbuf,
(
DtypeKind.UINT,
dtype[1],
getattr(ArrowCTypes, f"UINT{dtype[1]}"),
Endianness.NATIVE,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code was fine, but needs to be DtypeKind.INT as you have found, and use col.dtype[1] rather than dtype[1] .

),
data_dtype,
offset=col.offset,
length=col.size(),
)
Expand Down
22 changes: 22 additions & 0 deletions pandas/tests/interchange/test_impl.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@
DtypeKind,
)
from pandas.core.interchange.from_dataframe import from_dataframe
from pandas.core.interchange.utils import ArrowCTypes


@pytest.fixture
Expand Down Expand Up @@ -326,3 +327,24 @@ def test_interchange_from_non_pandas_tz_aware():
dtype="datetime64[us, Asia/Kathmandu]",
)
tm.assert_frame_equal(expected, result)


def test_interchange_from_corrected_buffer_dtypes(monkeypatch) -> None:
# https://github.com/pandas-dev/pandas/issues/54781
df = pd.DataFrame({"a": ["foo", "bar"]}).__dataframe__()
interchange = df.__dataframe__()
column = interchange.get_column_by_name("a")
buffers = column.get_buffers()
Copy link
Member

@WillAyd WillAyd Oct 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a blocker for this PR but I think these tests would be more impactful if we made the PandasBuffer implement the buffer protocol:

https://docs.cython.org/en/latest/src/userguide/buffer.html

That way we could inspect the bytes for tests

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Started this in #55671

buffers_data = buffers["data"]
buffer_dtype = buffers_data[1]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feel free to ignore my possibly wrong commentary as I'm new to this, but I think the offset buffers don't have the proper bufsize here either

(Pdb) buffers["offsets"]
(PandasBuffer({'bufsize': 24, 'ptr': 94440192356160, 'device': 'CPU'}), (<DtypeKind.INT: 0>, 64, 'l', '='))

The standard StringType which inherits from BinaryType in arrow uses a 32 bit offset value, so I think that bufsize should only be 12, unless we are mapping to a LargeString intentionally

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for looking into this - looks like it comes from

offsets = np.zeros(shape=(len(values) + 1,), dtype=np.int64)

where the dtype's being set to int64. OK to discuss/address this separately?

buffer_dtype = (
DtypeKind.UINT,
8,
ArrowCTypes.UINT8,
buffer_dtype[3],
)
buffers["data"] = (buffers_data[0], buffer_dtype)
column.get_buffers = lambda: buffers
interchange.get_column_by_name = lambda _: column
monkeypatch.setattr(df, "__dataframe__", lambda allow_copy: interchange)
pd.api.interchange.from_dataframe(df)