Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-14875: [C++] C Data Interface: check imported buffer for non-null #14814

Merged
merged 2 commits into from Jan 17, 2023

Conversation

pitrou
Copy link
Member

@pitrou pitrou commented Dec 1, 2022

The C data interface may expose null data pointers for zero-sized buffers.
Make sure that all buffer pointers remain non-null internally.

Followup to GH-14805

@github-actions
Copy link

github-actions bot commented Dec 1, 2022

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose

Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename the pull request title in the following format?

GH-${GITHUB_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

In the case of old issues on JIRA the title also supports:

ARROW-${JIRA_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}
PARQUET-${JIRA_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

See also:

@pitrou
Copy link
Member Author

pitrou commented Dec 1, 2022

@zeroshade @alippai @alamb @viirya @tustvold Would it be easy to try this against the Go and Rust Arrow implementations?

@alamb
Copy link
Contributor

alamb commented Dec 5, 2022

Sorry @pitrou I am not sure how to test this against Rust. Did you have anything specific have in mind beyond the existing integration CI check https://github.com/apache/arrow/actions/runs/3595507889/jobs/6055082178 ?

@pitrou
Copy link
Member Author

pitrou commented Dec 5, 2022

@alamb AFAIR there are Rust-PyArrow integration tests through the C Data Interface, no?

@wjones127
Copy link
Member

@pitrou I think we just need to add a test case to this file roundtripping an empty array, and then test that against PyArrow built off of this branch, correct?

If so, I can do that.

@pitrou
Copy link
Member Author

pitrou commented Dec 5, 2022

@wjones127 That would probably work, yes.

@wjones127
Copy link
Member

I added tests here: apache/arrow-rs#3276

And locally tested them against PyArrow built from this branch. All tests pass 👍

@zeroshade
Copy link
Member

@pitrou I'll build a pyarrow off this branch and test a roundtrip of an empty array between pyarrow and Go via C data. I'll comment again after I set it up.

@zeroshade
Copy link
Member

@pitrou when I do round trips of the empty arrays, after fixing the crash in Go, I run into a few issues with pyarrow upon importing the empty array from C:

The data types: pa.binary(), pa.large_binary(), pa.string(), pa.large_string() all throw an error when calling validate(full=True) after the _import_from_c that contained a null value data buffer:

Traceback (most recent call last):
  File "/home/zeroshade/Projects/GitHub/arrow/go/arrow/cdata/test/test_export_to_cgo.py", line 218, in test
    b.validate(full=True)
  File "pyarrow/array.pxi", line 1501, in pyarrow.lib.Array.validate
  File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Value data buffer is null

I also ran into other cases where it through an error in _import_from_c because of null values but I think those are my fault, I'm going to double check and try again. But I'm pretty sure those validate's shouldn't be failing in this case if we're saying it's acceptable for a 0-length array to contain a null value data buffer.

@pitrou
Copy link
Member Author

pitrou commented Dec 7, 2022

@zeroshade Nice catch, can you open an issue?

@zeroshade
Copy link
Member

@pitrou I've opened issue #14875 for that. Gonna open a separate issue to fix the Go crash and see if the other failures are Go's fault or not.

@zeroshade
Copy link
Member

I've opened #14877 to fix the Go side of these, though some of the python tests have to be commented out until #14875 is resolved.

The C data interface may expose null data pointers for zero-sized buffers.
Make sure that all buffer pointers remain non-null internally.

Followup to apacheGH-14805.
@pitrou pitrou force-pushed the c-data-interface-check-buffer-ptr branch from 9599051 to e67c687 Compare January 17, 2023 11:53
@pitrou
Copy link
Member Author

pitrou commented Jan 17, 2023

@zeroshade This draft PR now passes the uncommented Go tests (assuming you install the corresponding PyArrow version).

@pitrou pitrou changed the title DRAFT: [C++] C Data Interface: check imported buffer for non-null GH-14875: [C++] C Data Interface: check imported buffer for non-null Jan 17, 2023
@github-actions
Copy link

@github-actions
Copy link

⚠️ GitHub issue #14875 has been automatically assigned in GitHub to PR creator.

@pitrou pitrou marked this pull request as ready for review January 17, 2023 15:21
@zeroshade
Copy link
Member

@pitrou if this passes the tests now, would it make sense to uncomment them in this PR? Or should we do that in a separate PR?

@pitrou
Copy link
Member Author

pitrou commented Jan 17, 2023

@zeroshade We can't uncomment them until a PyArrow wheel is released that incorporates the fix, AFAIU.

@pitrou pitrou requested a review from lidavidm January 17, 2023 15:49
@pitrou
Copy link
Member Author

pitrou commented Jan 17, 2023

@paleolimbot FYI

@pitrou pitrou merged commit 7319250 into apache:master Jan 17, 2023
@pitrou pitrou deleted the c-data-interface-check-buffer-ptr branch January 17, 2023 17:53
raulcd pushed a commit that referenced this pull request Jan 18, 2023
…14814)

The C data interface may expose null data pointers for zero-sized buffers.
Make sure that all buffer pointers remain non-null internally.

Followup to GH-14805

* Closes: #14875

Authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>
@jorisvandenbossche jorisvandenbossche added this to the 11.0.0 milestone Jan 18, 2023
@ursabot
Copy link

ursabot commented Jan 19, 2023

Benchmark runs are scheduled for baseline = 627caf3 and contender = 7319250. 7319250 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Failed ⬇️0.51% ⬆️0.0%] test-mac-arm
[Finished ⬇️0.26% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️1.74% ⬆️0.06%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] 73192505 ec2-t3-xlarge-us-east-2
[Failed] 73192505 test-mac-arm
[Finished] 73192505 ursa-i9-9960x
[Finished] 73192505 ursa-thinkcentre-m75q
[Finished] 627caf3f ec2-t3-xlarge-us-east-2
[Failed] 627caf3f test-mac-arm
[Finished] 627caf3f ursa-i9-9960x
[Finished] 627caf3f ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Python][C++] C Data Interface incorrect validate failures
7 participants