BUG: Interchange object data buffer has the wrong dtype / from_dataframe incorrect #55227

MarcoGorelli · 2023-09-21T11:25:18Z

closes BUG: Interchange object data buffer has the wrong dtype / from_dataframe incorrect #54781 (Replace xxxx with the GitHub issue number)
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

I haven't included a whatsnew note as this isn't user-facing

WillAyd · 2023-09-21T14:31:24Z

pandas/core/interchange/from_dataframe.py

-    )
+    data_buff, data_dtype = buffers["data"]
+
+    if (data_dtype[1] == 8) and (


Out of curiosity what is the 8 here supposed to represent?

hey - from https://data-apis.org/dataframe-protocol/latest/API.html, it's number of bits:

@property @abstractmethod def dtype(self) -> Dtype: """ Dtype description as a tuple ``(kind, bit-width, format string, endianness)``. Bit-width : the number of bits as an integer Format string : data type description format string in Apache Arrow C Data Interface format. Endianness : current only native endianness (``=``) is supported

Are string dtypes supposed to have an 8 bit association? That is kind of confusing for variable length types, granted I know very little of how this interchange is supposed to work

I think the idea is that strings are meant to be utf-8, and so each character can be represented with 8bits

Hmm interesting. Well keep in mind that UTF-8 doesn't mean a character is 8 bits; it is still 1-4 bytes

In arrow-adbc I've seen this assigned the value of 0

https://github.com/apache/arrow-adbc/blob/0d8707a5ee2622ba959b069cd173bfe6ee2aaff3/c/driver/postgresql/statement.cc#L225

The UFT-8 array, consisting of the actual strings and offsets array, so here the buffer representing all string data (so which is typically much longer as the logical length of the array) can be seen as simply a buffer of bytes ("bytearray"), so so in numpy / buffer interface terms you can view such an array as one bitwidth 8

In arrow-adbc I've seen this assigned the value of 0

That's something postgres specific, I think

WillAyd · 2023-09-22T19:19:35Z

pandas/core/interchange/from_dataframe.py

+        # temporary workaround to keep backwards compatibility due to
+        # https://github.com/pandas-dev/pandas/issues/54781
+        # Consider dtype being `uint` to get number of units passed since the 01.01.1970
+        data_dtype = (
            DtypeKind.UINT,


Shouldn't this be an INT? Timestamps are backed by 64 bit signed integers in arrow

https://github.com/apache/arrow/blob/772a01c080ad57eb11e9323f5347472b769d45de/format/Schema.fbs#L264

Yes, AFAIK that should be signed INT

jorisvandenbossche · 2023-10-03T12:18:17Z

The tricky thing will be to test this ... Given that we don't yet have implementations to test with (pandas itself doesn't yet return the correct dtype), should we add some very specific unit tests explicitly targetting the helper function (where we can pass both types of dtypes). Although I assume that's still difficult, as those helpers take a Column object, and then we would have to mock a Column object just for those tests ..

MarcoGorelli · 2023-10-03T12:32:14Z

good point, marking as draft til I get back to that

…-from-dataframe

MarcoGorelli · 2023-10-08T13:49:58Z

I've tried this out with:

pandas: this branch
polars: fix(python): Fix interchange protocol data buffer dtype pola-rs/polars#10787

and now

df = pl.DataFrame({'a': ['foo', 'bar']})
print(pd.api.interchange.from_dataframe(df))

runs fine (whereas using pandas main and that branch from polars would've raised

>       assert protocol_data_dtype[2] in (
            ArrowCTypes.STRING,
            ArrowCTypes.LARGE_STRING,
        )  # format_str == utf-8
E       AssertionError

)

@stinodego fancy taking a look?

stinodego

Looks to me like you're making things more complicated than they need to be! The fix should be really simple.

If my comments aren't clear, I can make a small PR to show what I was thinking.

stinodego · 2023-10-10T08:42:05Z

pandas/core/interchange/from_dataframe.py

-    assert protocol_data_dtype[2] in (
-        ArrowCTypes.STRING,
-        ArrowCTypes.LARGE_STRING,
-    )  # format_str == utf-8


This assertion is valid, but it should be on col.dtype[2] rather than protocol_data_dtype[2]

stinodego · 2023-10-10T08:43:12Z

pandas/core/interchange/from_dataframe.py

@@ -266,21 +266,29 @@ def string_column_to_ndarray(col: Column) -> tuple[np.ndarray, Any]:

    assert buffers["offsets"], "String buffers must contain offsets"
    # Retrieve the data buffer containing the UTF-8 code units
-    data_buff, protocol_data_dtype = buffers["data"]
-    # We're going to reinterpret the buffer as uint8, so make sure we can do it safely
-    assert protocol_data_dtype[1] == 8


This assertion can indeed be deleted, as we can assume bit width 8 if the column dtype is STRING or LARGE_STRING.

stinodego · 2023-10-10T08:43:51Z

pandas/core/interchange/from_dataframe.py

-        ArrowCTypes.UINT8,
-        Endianness.NATIVE,
-    )
+    data_buff, data_dtype = buffers["data"]


We can simply ignore the data dtype here, as we know what it needs to be (we set it later).

Suggested change

data_buff, data_dtype = buffers["data"]

data_buff, _ = buffers["data"]

stinodego · 2023-10-10T08:44:29Z

pandas/core/interchange/from_dataframe.py

+        data_dtype = (
+            DtypeKind.UINT,
+            8,
+            ArrowCTypes.UINT8,
+            Endianness.NATIVE,
+        )


This does not need to be in an if-block. We can simply disregard the data buffer dtype - for string columns, this will ALWAYS be as listed here. This was already the case in the previous code! Only the assertions were wrong.

stinodego · 2023-10-10T08:51:11Z

pandas/core/interchange/from_dataframe.py

-            DtypeKind.UINT,
-            dtype[1],
-            getattr(ArrowCTypes, f"UINT{dtype[1]}"),
-            Endianness.NATIVE,


This code was fine, but needs to be DtypeKind.INT as you have found, and use col.dtype[1] rather than dtype[1] .

…-from-dataframe

MarcoGorelli · 2023-10-10T10:19:30Z

thanks for your review 🙏 ! I've tried addressing your comments

stinodego · 2023-10-10T10:44:04Z

pandas/core/interchange/from_dataframe.py

-        data_dtype,
+        (
+            DtypeKind.INT,
+            col.dtype[1],


We unpack col.dtype on line 381, it'll be slightly more efficient to get the bit width from there!

pandas/core/interchange/from_dataframe.py

stinodego · 2023-10-10T10:49:02Z

thanks for your review 🙏 ! I've tried addressing your comments

Yeah I think this should do it!

Co-authored-by: Stijn de Gooijer <stijn@degooijer.io>

stinodego · 2023-10-22T09:06:36Z

Any chance this can get merged and be part of the next release? 🙏

This is blocking improvements to the protocol across all dataframe libraries.

MarcoGorelli · 2023-10-22T09:26:30Z

@jorisvandenbossche fancy taking a look please?

I'd suggest backporting this, it has no user facing impact anyway

lithomas1 · 2023-10-24T20:38:34Z

@MarcoGorelli

Should this block the release?

WillAyd · 2023-10-24T21:08:49Z

pandas/tests/interchange/test_impl.py

+    df = pd.DataFrame({"a": ["foo", "bar"]}).__dataframe__()
+    interchange = df.__dataframe__()
+    column = interchange.get_column_by_name("a")
+    buffers = column.get_buffers()


Not a blocker for this PR but I think these tests would be more impactful if we made the PandasBuffer implement the buffer protocol:

https://docs.cython.org/en/latest/src/userguide/buffer.html

That way we could inspect the bytes for tests

Started this in #55671

WillAyd · 2023-10-24T21:23:35Z

pandas/tests/interchange/test_impl.py

+    column = interchange.get_column_by_name("a")
+    buffers = column.get_buffers()
+    buffers_data = buffers["data"]
+    buffer_dtype = buffers_data[1]


Feel free to ignore my possibly wrong commentary as I'm new to this, but I think the offset buffers don't have the proper bufsize here either

(Pdb) buffers["offsets"] (PandasBuffer({'bufsize': 24, 'ptr': 94440192356160, 'device': 'CPU'}), (<DtypeKind.INT: 0>, 64, 'l', '='))

The standard StringType which inherits from BinaryType in arrow uses a 32 bit offset value, so I think that bufsize should only be 12, unless we are mapping to a LargeString intentionally

thanks for looking into this - looks like it comes from

pandas/pandas/core/interchange/column.py

Line 364 in d2f05c2

offsets = np.zeros(shape=(len(values) + 1,), dtype=np.int64)

where the dtype's being set to int64. OK to discuss/address this separately?

MarcoGorelli · 2023-10-25T07:52:34Z

@MarcoGorelli

Should this block the release?

not at all! if it can't be backported to 1.1.2, no issue, definitely not a blocker

MarcoGorelli · 2023-11-07T12:04:23Z

merging then, thanks all!

… the wrong dtype / from_dataframe incorrect

…er has the wrong dtype / from_dataframe incorrect ) (#55863) Backport PR #55227: BUG: Interchange object data buffer has the wrong dtype / from_dataframe incorrect Co-authored-by: Marco Edward Gorelli <marcogorelli@protonmail.com> Co-authored-by: MarcoGorelli <33491632+MarcoGorelli@users.noreply.github.com>

use buffer dtype in interchange from_dataframe

9876c64

MarcoGorelli force-pushed the use-buffer-dtype-in-from-dataframe branch from 8d31eb0 to 9876c64 Compare September 21, 2023 11:37

WillAyd reviewed Sep 21, 2023

View reviewed changes

WillAyd reviewed Sep 22, 2023

View reviewed changes

MarcoGorelli marked this pull request as ready for review October 3, 2023 11:54

MarcoGorelli requested a review from jorisvandenbossche October 3, 2023 11:54

MarcoGorelli marked this pull request as draft October 3, 2023 12:32

MarcoGorelli added 4 commits October 7, 2023 20:10

Merge remote-tracking branch 'upstream/main' into use-buffer-dtype-in…

0b936b3

…-from-dataframe

wip

00ad9c7

wip

d54f950

add failing test

adeceb9

MarcoGorelli marked this pull request as ready for review October 9, 2023 05:53

stinodego reviewed Oct 10, 2023

View reviewed changes

MarcoGorelli added 5 commits October 10, 2023 12:31

wip

3557b4a

Merge remote-tracking branch 'upstream/main' into use-buffer-dtype-in…

975c87c

…-from-dataframe

simplify

0ef179a

remove unnecessary assertion

df996ac

dont double-extract bit width

0bea19b

stinodego reviewed Oct 10, 2023

View reviewed changes

pandas/core/interchange/from_dataframe.py Outdated Show resolved Hide resolved

Update pandas/core/interchange/from_dataframe.py

d04ac92

Co-authored-by: Stijn de Gooijer <stijn@degooijer.io>

mroeschke added the Interchange Dataframe Interchange Protocol label Oct 12, 2023

MarcoGorelli added this to the 2.1.2 milestone Oct 22, 2023

WillAyd reviewed Oct 24, 2023

View reviewed changes

WillAyd mentioned this pull request Oct 24, 2023

Implement Buffer Protocol for PandasBuffer #55671

Closed

lithomas1 modified the milestones: 2.1.2, 2.1.3 Oct 25, 2023

mroeschke approved these changes Nov 7, 2023

View reviewed changes

MarcoGorelli merged commit ed10a14 into pandas-dev:main Nov 7, 2023
43 checks passed

meeseeksmachine mentioned this pull request Nov 7, 2023

Backport PR #55227 on branch 2.1.x (BUG: Interchange object data buffer has the wrong dtype / from_dataframe incorrect ) #55863

Merged

meeseeksmachine pushed a commit to meeseeksmachine/pandas that referenced this pull request Nov 7, 2023

Backport PR pandas-dev#55227: BUG: Interchange object data buffer has…

3a56117

… the wrong dtype / from_dataframe incorrect

AlenkaF mentioned this pull request Dec 13, 2023

[Python] Interchange object data buffer has the wrong dtype / from_dataframe incorrect apache/arrow#37598

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Interchange object data buffer has the wrong dtype / from_dataframe incorrect #55227

BUG: Interchange object data buffer has the wrong dtype / from_dataframe incorrect #55227

MarcoGorelli commented Sep 21, 2023 •

edited

WillAyd Sep 21, 2023

MarcoGorelli Sep 21, 2023

WillAyd Sep 22, 2023 •

edited

MarcoGorelli Sep 22, 2023

WillAyd Sep 22, 2023

jorisvandenbossche Oct 3, 2023

WillAyd Sep 22, 2023

jorisvandenbossche Oct 3, 2023

jorisvandenbossche commented Oct 3, 2023

MarcoGorelli commented Oct 3, 2023

MarcoGorelli commented Oct 8, 2023

stinodego left a comment

stinodego Oct 10, 2023

stinodego Oct 10, 2023

stinodego Oct 10, 2023

stinodego Oct 10, 2023

stinodego Oct 10, 2023

MarcoGorelli commented Oct 10, 2023

stinodego Oct 10, 2023

stinodego commented Oct 10, 2023

stinodego commented Oct 22, 2023

MarcoGorelli commented Oct 22, 2023

lithomas1 commented Oct 24, 2023

WillAyd Oct 24, 2023 •

edited

WillAyd Oct 24, 2023

WillAyd Oct 24, 2023

MarcoGorelli Nov 7, 2023

MarcoGorelli commented Oct 25, 2023

MarcoGorelli commented Nov 7, 2023

	data_buff, data_dtype = buffers["data"]
	data_buff, _ = buffers["data"]

BUG: Interchange object data buffer has the wrong dtype / from_dataframe incorrect #55227

BUG: Interchange object data buffer has the wrong dtype / from_dataframe incorrect #55227

Conversation

MarcoGorelli commented Sep 21, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

WillAyd Sep 22, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorisvandenbossche commented Oct 3, 2023

MarcoGorelli commented Oct 3, 2023

MarcoGorelli commented Oct 8, 2023

stinodego left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MarcoGorelli commented Oct 10, 2023

Choose a reason for hiding this comment

stinodego commented Oct 10, 2023

stinodego commented Oct 22, 2023

MarcoGorelli commented Oct 22, 2023

lithomas1 commented Oct 24, 2023

WillAyd Oct 24, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MarcoGorelli commented Oct 25, 2023

MarcoGorelli commented Nov 7, 2023

MarcoGorelli commented Sep 21, 2023 •

edited

WillAyd Sep 22, 2023 •

edited

WillAyd Oct 24, 2023 •

edited