Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Interchange object data buffer has the wrong dtype / from_dataframe incorrect #55227

Merged

Conversation

MarcoGorelli
Copy link
Member

@MarcoGorelli MarcoGorelli commented Sep 21, 2023

I haven't included a whatsnew note as this isn't user-facing

@MarcoGorelli MarcoGorelli force-pushed the use-buffer-dtype-in-from-dataframe branch from 8d31eb0 to 9876c64 Compare September 21, 2023 11:37
)
data_buff, data_dtype = buffers["data"]

if (data_dtype[1] == 8) and (
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity what is the 8 here supposed to represent?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hey - from https://data-apis.org/dataframe-protocol/latest/API.html, it's number of bits:

    @property
    @abstractmethod
    def dtype(self) -> Dtype:
        """
        Dtype description as a tuple ``(kind, bit-width, format string, endianness)``.

        Bit-width : the number of bits as an integer
        Format string : data type description format string in Apache Arrow C
                        Data Interface format.
        Endianness : current only native endianness (``=``) is supported

Copy link
Member

@WillAyd WillAyd Sep 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are string dtypes supposed to have an 8 bit association? That is kind of confusing for variable length types, granted I know very little of how this interchange is supposed to work

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the idea is that strings are meant to be utf-8, and so each character can be represented with 8bits

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm interesting. Well keep in mind that UTF-8 doesn't mean a character is 8 bits; it is still 1-4 bytes

In arrow-adbc I've seen this assigned the value of 0

https://github.com/apache/arrow-adbc/blob/0d8707a5ee2622ba959b069cd173bfe6ee2aaff3/c/driver/postgresql/statement.cc#L225

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The UFT-8 array, consisting of the actual strings and offsets array, so here the buffer representing all string data (so which is typically much longer as the logical length of the array) can be seen as simply a buffer of bytes ("bytearray"), so so in numpy / buffer interface terms you can view such an array as one bitwidth 8

In arrow-adbc I've seen this assigned the value of 0

That's something postgres specific, I think

# temporary workaround to keep backwards compatibility due to
# https://github.com/pandas-dev/pandas/issues/54781
# Consider dtype being `uint` to get number of units passed since the 01.01.1970
data_dtype = (
DtypeKind.UINT,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be an INT? Timestamps are backed by 64 bit signed integers in arrow

https://github.com/apache/arrow/blob/772a01c080ad57eb11e9323f5347472b769d45de/format/Schema.fbs#L264

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, AFAIK that should be signed INT

@MarcoGorelli MarcoGorelli marked this pull request as ready for review October 3, 2023 11:54
@jorisvandenbossche
Copy link
Member

The tricky thing will be to test this ... Given that we don't yet have implementations to test with (pandas itself doesn't yet return the correct dtype), should we add some very specific unit tests explicitly targetting the helper function (where we can pass both types of dtypes). Although I assume that's still difficult, as those helpers take a Column object, and then we would have to mock a Column object just for those tests ..

@MarcoGorelli
Copy link
Member Author

good point, marking as draft til I get back to that

@MarcoGorelli MarcoGorelli marked this pull request as draft October 3, 2023 12:32
@MarcoGorelli
Copy link
Member Author

I've tried this out with:

and now

df = pl.DataFrame({'a': ['foo', 'bar']})
print(pd.api.interchange.from_dataframe(df))

runs fine (whereas using pandas main and that branch from polars would've raised

>       assert protocol_data_dtype[2] in (
            ArrowCTypes.STRING,
            ArrowCTypes.LARGE_STRING,
        )  # format_str == utf-8
E       AssertionError

)

@stinodego fancy taking a look?

@MarcoGorelli MarcoGorelli marked this pull request as ready for review October 9, 2023 05:53
Copy link
Contributor

@stinodego stinodego left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks to me like you're making things more complicated than they need to be! The fix should be really simple.

If my comments aren't clear, I can make a small PR to show what I was thinking.

Comment on lines 272 to 275
assert protocol_data_dtype[2] in (
ArrowCTypes.STRING,
ArrowCTypes.LARGE_STRING,
) # format_str == utf-8
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assertion is valid, but it should be on col.dtype[2] rather than protocol_data_dtype[2]

@@ -266,21 +266,29 @@ def string_column_to_ndarray(col: Column) -> tuple[np.ndarray, Any]:

assert buffers["offsets"], "String buffers must contain offsets"
# Retrieve the data buffer containing the UTF-8 code units
data_buff, protocol_data_dtype = buffers["data"]
# We're going to reinterpret the buffer as uint8, so make sure we can do it safely
assert protocol_data_dtype[1] == 8
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assertion can indeed be deleted, as we can assume bit width 8 if the column dtype is STRING or LARGE_STRING.

ArrowCTypes.UINT8,
Endianness.NATIVE,
)
data_buff, data_dtype = buffers["data"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can simply ignore the data dtype here, as we know what it needs to be (we set it later).

Suggested change
data_buff, data_dtype = buffers["data"]
data_buff, _ = buffers["data"]

Comment on lines 286 to 291
data_dtype = (
DtypeKind.UINT,
8,
ArrowCTypes.UINT8,
Endianness.NATIVE,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not need to be in an if-block. We can simply disregard the data buffer dtype - for string columns, this will ALWAYS be as listed here. This was already the case in the previous code! Only the assertions were wrong.

Comment on lines 386 to 389
DtypeKind.UINT,
dtype[1],
getattr(ArrowCTypes, f"UINT{dtype[1]}"),
Endianness.NATIVE,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code was fine, but needs to be DtypeKind.INT as you have found, and use col.dtype[1] rather than dtype[1] .

@MarcoGorelli
Copy link
Member Author

thanks for your review 🙏 ! I've tried addressing your comments

data_dtype,
(
DtypeKind.INT,
col.dtype[1],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We unpack col.dtype on line 381, it'll be slightly more efficient to get the bit width from there!

@stinodego
Copy link
Contributor

thanks for your review 🙏 ! I've tried addressing your comments

Yeah I think this should do it!

Co-authored-by: Stijn de Gooijer <stijn@degooijer.io>
@mroeschke mroeschke added the Interchange Dataframe Interchange Protocol label Oct 12, 2023
@stinodego
Copy link
Contributor

Any chance this can get merged and be part of the next release? 🙏

This is blocking improvements to the protocol across all dataframe libraries.

@MarcoGorelli MarcoGorelli added this to the 2.1.2 milestone Oct 22, 2023
@MarcoGorelli
Copy link
Member Author

@jorisvandenbossche fancy taking a look please?

I'd suggest backporting this, it has no user facing impact anyway

@lithomas1
Copy link
Member

@MarcoGorelli

Should this block the release?

df = pd.DataFrame({"a": ["foo", "bar"]}).__dataframe__()
interchange = df.__dataframe__()
column = interchange.get_column_by_name("a")
buffers = column.get_buffers()
Copy link
Member

@WillAyd WillAyd Oct 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a blocker for this PR but I think these tests would be more impactful if we made the PandasBuffer implement the buffer protocol:

https://docs.cython.org/en/latest/src/userguide/buffer.html

That way we could inspect the bytes for tests

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Started this in #55671

column = interchange.get_column_by_name("a")
buffers = column.get_buffers()
buffers_data = buffers["data"]
buffer_dtype = buffers_data[1]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feel free to ignore my possibly wrong commentary as I'm new to this, but I think the offset buffers don't have the proper bufsize here either

(Pdb) buffers["offsets"]
(PandasBuffer({'bufsize': 24, 'ptr': 94440192356160, 'device': 'CPU'}), (<DtypeKind.INT: 0>, 64, 'l', '='))

The standard StringType which inherits from BinaryType in arrow uses a 32 bit offset value, so I think that bufsize should only be 12, unless we are mapping to a LargeString intentionally

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for looking into this - looks like it comes from

offsets = np.zeros(shape=(len(values) + 1,), dtype=np.int64)

where the dtype's being set to int64. OK to discuss/address this separately?

@MarcoGorelli
Copy link
Member Author

@MarcoGorelli

Should this block the release?

not at all! if it can't be backported to 1.1.2, no issue, definitely not a blocker

@lithomas1 lithomas1 modified the milestones: 2.1.2, 2.1.3 Oct 25, 2023
@MarcoGorelli
Copy link
Member Author

merging then, thanks all!

@MarcoGorelli MarcoGorelli merged commit ed10a14 into pandas-dev:main Nov 7, 2023
43 checks passed
meeseeksmachine pushed a commit to meeseeksmachine/pandas that referenced this pull request Nov 7, 2023
mroeschke pushed a commit that referenced this pull request Nov 8, 2023
…er has the wrong dtype / from_dataframe incorrect ) (#55863)

Backport PR #55227: BUG: Interchange object data buffer has the wrong dtype / from_dataframe incorrect

Co-authored-by: Marco Edward Gorelli <marcogorelli@protonmail.com>
Co-authored-by: MarcoGorelli <33491632+MarcoGorelli@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Interchange Dataframe Interchange Protocol
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: Interchange object data buffer has the wrong dtype / from_dataframe incorrect
6 participants