Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support new Arrow PyCapsule Interface for Python FFI #5067

Closed
kylebarron opened this issue Nov 12, 2023 · 5 comments · Fixed by #5070
Closed

Support new Arrow PyCapsule Interface for Python FFI #5067

kylebarron opened this issue Nov 12, 2023 · 5 comments · Fixed by #5070
Labels
arrow Changes to the arrow crate enhancement Any new improvement worthy of a entry in the changelog

Comments

@kylebarron
Copy link
Contributor

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

There's existing support for interop with Python via the C data interface in arrow::pyarrow.

As of pyarrow v14, there's a new specification for Python-based interop, called the Arrow PyCapsule Interface. In particular, this allows interfacing with any arrow-based Python library, instead of only pyarrow. It also ensures memory doesn't leak if the capsule isn't used.

Describe the solution you'd like
Updates to the Python interop to support the new capsule-based approach.

Describe alternatives you've considered

Additional context

@kylebarron kylebarron added the enhancement Any new improvement worthy of a entry in the changelog label Nov 12, 2023
@kylebarron
Copy link
Contributor Author

I took a quick stab at this but got stuck. In particular, I tried:

use arrow::array::ArrayData;
use arrow::ffi;
use pyo3::exceptions::{PyTypeError, PyValueError};
use pyo3::prelude::*;
use pyo3::types::{PyCapsule, PyTuple};
use pyo3::{PyAny, PyResult};

pub fn pyobj_to_array(ob: &'_ PyAny) -> PyResult<ArrayData> {
    if ob.hasattr("__arrow_c_array__")? {
        let tuple = ob.getattr("__arrow_c_array__")?.call0()?;

        if !tuple.is_instance_of::<PyTuple>() {
            return Err(PyTypeError::new_err(
                "Expected __arrow_c_array__ to return a tuple.",
            ));
        }

        let schema_capsule = tuple.get_item(0)?;
        if !schema_capsule.is_instance_of::<PyCapsule>() {
            return Err(PyTypeError::new_err(
                "Expected __arrow_c_array__ first element to be PyCapsule.",
            ));
        }
        let schema_capsule: &PyCapsule = PyTryInto::try_into(schema_capsule)?;
        let schema_capsule_name = schema_capsule.name()?;
        if schema_capsule_name.is_none() {
            return Err(PyValueError::new_err(
                "Expected PyCapsule to have name set.",
            ));
        }
        let schema_capsule_name = schema_capsule_name.unwrap().to_str()?;
        if schema_capsule_name != "arrow_schema" {
            return Err(PyValueError::new_err(
                "Expected name 'arrow_schema' in PyCapsule.",
            ));
        }

        let array_capsule = tuple.get_item(1)?;
        if !array_capsule.is_instance_of::<PyCapsule>() {
            return Err(PyTypeError::new_err(
                "Expected __arrow_c_array__ second element to be PyCapsule.",
            ));
        }
        let array_capsule: &PyCapsule = PyTryInto::try_into(array_capsule)?;
        let array_capsule_name = array_capsule.name()?;
        if array_capsule_name.is_none() {
            return Err(PyValueError::new_err(
                "Expected PyCapsule to have name set.",
            ));
        }
        let array_capsule_name = array_capsule_name.unwrap().to_str()?;
        if array_capsule_name != "arrow_array" {
            return Err(PyValueError::new_err(
                "Expected name 'arrow_array' in PyCapsule.",
            ));
        }

        let arr = unsafe {
            ffi::from_ffi(
                *array_capsule.reference::<ffi::FFI_ArrowArray>(),
                schema_capsule.reference::<ffi::FFI_ArrowSchema>(),
            )
            .unwrap()
        };
        return Ok(arr);
    }

    Err(PyValueError::new_err(
        "Expected an object with dunder __arrow_c_array__",
    ))
}

but this doesn't work because of

error[E0507]: cannot move out of a shared reference
  --> src/ffi2.rs:60:17
   |
60 |                 *array_capsule.reference::<ffi::FFI_ArrowArray>(),
   |                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ move occurs because value has type `arrow::ffi::FFI_ArrowArray`, which does not implement the `Copy` trait

This has a different setup because instead of provisioning FFI structs ourselves for python to export into, python exports its own ffi structs inside the capsule. Maybe I'm missing something in the pyo3 api to get an owned pointer, but I assume that's not possible.

I figure here that the arrow-rs ffi code would need to be changed to allow a reference to a FFI_ArrowArray struct, just like it does with the schema? It would then just have to check that released on the struct is not true before importing..?

@tustvold
Copy link
Contributor

tustvold commented Nov 12, 2023

The arrow-rs FFI interfaces expect to take ownership of the provided data, as per the C data specification. That is when the last reference within arrow-rs created data structures is freed it should invoke release.

If this is what the PyCapsule is providing, then it should just be a case of std::ptr::read to obtain an owned value. However, if this is not how ownership in PyCapsule works, in particular if the ownership of the PyCapsule itself needs to be transfered and not the underlying C data array, then a new FFI mechanism would be required.

@kylebarron
Copy link
Contributor Author

std::ptr::read to obtain an owned value

Awesome! Sorry, I'm still learning low-level FFI stuff in Rust and didn't know

With that I got it to work!

#[pyfunction]
pub fn read_array(ob: &'_ PyAny) -> PyResult<()> {
    let arr = pyobj_to_array(ob)?;
    println!("{:?}", arr);
    Ok(())
}

pub fn pyobj_to_array(ob: &'_ PyAny) -> PyResult<ArrayData> {
    if ob.hasattr("__arrow_c_array__")? {
        let tuple = ob.getattr("__arrow_c_array__")?.call0()?;

        if !tuple.is_instance_of::<PyTuple>() {
            return Err(PyTypeError::new_err(
                "Expected __arrow_c_array__ to return a tuple.",
            ));
        }

        let schema_capsule = tuple.get_item(0)?;
        if !schema_capsule.is_instance_of::<PyCapsule>() {
            return Err(PyTypeError::new_err(
                "Expected __arrow_c_array__ first element to be PyCapsule.",
            ));
        }
        let schema_capsule: &PyCapsule = PyTryInto::try_into(schema_capsule)?;
        let schema_capsule_name = schema_capsule.name()?;
        if schema_capsule_name.is_none() {
            return Err(PyValueError::new_err(
                "Expected PyCapsule to have name set.",
            ));
        }
        let schema_capsule_name = schema_capsule_name.unwrap().to_str()?;
        if schema_capsule_name != "arrow_schema" {
            return Err(PyValueError::new_err(
                "Expected name 'arrow_schema' in PyCapsule.",
            ));
        }

        let array_capsule = tuple.get_item(1)?;
        if !array_capsule.is_instance_of::<PyCapsule>() {
            return Err(PyTypeError::new_err(
                "Expected __arrow_c_array__ second element to be PyCapsule.",
            ));
        }
        let array_capsule: &PyCapsule = PyTryInto::try_into(array_capsule)?;
        let array_capsule_name = array_capsule.name()?;
        if array_capsule_name.is_none() {
            return Err(PyValueError::new_err(
                "Expected PyCapsule to have name set.",
            ));
        }
        let array_capsule_name = array_capsule_name.unwrap().to_str()?;
        if array_capsule_name != "arrow_array" {
            return Err(PyValueError::new_err(
                "Expected name 'arrow_array' in PyCapsule.",
            ));
        }
        let array_ptr = array_capsule.pointer();

        let array_ptr = array_ptr as *const ffi::FFI_ArrowArray;
        let owned_array_ptr = unsafe { std::ptr::read(array_ptr) };

        let arr = unsafe {
            ffi::from_ffi(
                owned_array_ptr,
                schema_capsule.reference::<ffi::FFI_ArrowSchema>(),
            )
            .unwrap()
        };
        return Ok(arr);
    }

    Err(PyValueError::new_err(
        "Expected an object with dunder __arrow_c_array__",
    ))
}

Then in Python:

>>> import pyarrow as pa
>>> arr = pa.array([1, 2, 3, 4])
>>> geoarrow.rust.rust.read_array(arr)
ArrayData { data_type: Int64, len: 4, offset: 0, buffers: [Buffer { data: Bytes { ptr: 0x596540300c0, len: 32, data: [1, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0] }, ptr: 0x596540300c0, length: 32 }], child_data: [], nulls: None }

I'm still not 100% understanding the ownership model of the pycapsule, but at least the proof of concept works

@kylebarron
Copy link
Contributor Author

Well, I take that back, if I call read_array twice in a row on the same pyarrow array, it segfaults.

>>> geoarrow.rust.rust.read_array(arr)
ArrayData { data_type: Int64, len: 4, offset: 0, buffers: [Buffer { data: Bytes { ptr: 0x596540300c0, len: 32, data: [1, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0] }, ptr: 0x596540300c0, length: 32 }], child_data: [], nulls: None }
>>> geoarrow.rust.rust.read_array(arr)
[1]    9141 segmentation fault  python

So clearly I'm doing something wrong with the ownership model

@tustvold tustvold added the arrow Changes to the arrow crate label Jan 5, 2024
@tustvold
Copy link
Contributor

tustvold commented Jan 5, 2024

label_issue.py automatically added labels {'arrow'} from #5070

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate enhancement Any new improvement worthy of a entry in the changelog
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants