Add an ExtensionType to DataType enum #4472

wjones127 · 2023-06-30T19:05:07Z

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

Extension types are annotated in field metadata. This works well with record batches, but when exporting/importing an array over the C data interface, the extension type metadata is lost.

The C++ implementation solves this by having an ExtensionType class and always exports that metadata over C data interface here:

https://github.com/apache/arrow/blob/b9aec9ad2b655817b8925462e4e2dd6973807e23/cpp/src/arrow/c/bridge.cc#L243-L252

Describe the solution you'd like

I'd propose adding a new enum variant to DataType:

struct ExtensionType {
   name: String,
   metadata: String,
   storage_type: Box<DataType>,
}

enum DataType {
    ...
    ExtensionType(ExtensionType)
}

Then make sure the C data interface implementation handles exporting and importing this type.

Describe alternatives you've considered

We could add an extension type registry like C++, but that seems heavier than we really need.

Additional context

https://arrow.apache.org/docs/format/CDataInterface.html#extension-arrays

Previous discussions:

tustvold · 2023-06-30T21:03:53Z

but when exporting/importing an array over the C data interface, the extension type metadata is lost

I might be missing something here, but why would it be lost, schema metadata should roundtrip over C data interface?

I'd propose adding a new enum variant to DataType:

My major objection to this approach is that it undermines the transparency of extension types. I feel quite strongly that only codepaths explicitly concerned with extension types should need concern themselves with them, for example the take or arithmetic kernel should not need to know about extension types. However, adding a DataType::Extension would instead force kernels to "see through" DataType::Extension when downcasting or performing operations such as extracting decimal precision. I would much prefer an approach that does not require this, by instead propagating extension metadata out-of-band.

wjones127 · 2023-06-30T22:53:28Z

I might be missing something here, but why would it be lost, schema metadata should roundtrip over C data interface?

This works well for RecordBatch, but not for an individual array transported independent from any batch. Basically, arrays themselves have no way to be tagged as an extension array, since those don't contain a field where that metadata is stored; they are only extension arrays in the context of a batch.

I feel quite strongly that only codepaths explicitly concerned with extension types should need concern themselves with them, for example the take or arithmetic kernel should not need to know about extension types.

I definitely agree, and don't want to make these operations more complex than they ought to be.

If we can think of another place to put this information, I'm open to that.

(A bit of a tangent, but...) In my ideal world, there would be a logical type enum and a physical type enum. Physical types would be the current DataType. Then logical types would be things like String (just one, regardless of offset size and encoding) and then a generic ExtensionType variant. Sort of like what Sasha was talking about a long time ago: https://lists.apache.org/thread/357z4587dczho4x1257ttf0b4o9302co

wjones127 · 2023-06-30T22:57:43Z

FWIW my workaround for now is to just wrap it in a record batch and unwrap on the other side. But it would be nice to find a way to not have to do that.

https://github.com/lancedb/lance/blob/3e1ed67acf7d1de336b6d211647d9581c64c3bed/python/src/arrow.rs#L53-L67

tustvold · 2023-07-01T09:10:27Z

If we can think of another place to put this information, I'm open to that.

I personally think Field is the correct location for such metadata, imo DataType should only contain the information for kernels to interpret the physical array data. Concerns about nullability, specialized kernels for extension types, etc... I think should be handled at a higher level. Whilst it isn't a true logical vs physical separation, I think that is a helpful way to conceptualize it.

But it would be nice to find a way to not have to do that.

This feels like a limitation of the way the python conversion is implemented

impl ToPyArrow for ArrayData {
    fn to_pyarrow(&self, py: Python) -> PyResult<PyObject> {
        let array = FFI_ArrowArray::new(self);
        let schema = FFI_ArrowSchema::try_from(self.data_type()).map_err(to_py_err)?;

        let module = py.import("pyarrow")?;
        let class = module.getattr("Array")?;
        let array = class.call_method1(
            "_import_from_c",
            (
                addr_of!(array) as Py_uintptr_t,
                addr_of!(schema) as Py_uintptr_t,
            ),
        )?;
        Ok(array.to_object(py))
    }
}

In particular the schema is inferred from the array's data type. If instead there were a way to provide a Field then this would allow propagating not only extension metadata, but also nullability, dictionary ordering, etc... It would also potentially allow performing the schema conversion once and using the result for multiple arrays.

wjones127 · 2023-07-11T17:25:25Z

I personally think Field is the correct location for such metadata, imo DataType should only contain the information for kernels to interpret the physical array data. Concerns about nullability, specialized kernels for extension types, etc... I think should be handled at a higher level.

I think that could be a decent approach, although I'm still trying to understand what that would look like. It sounds like the arrow-rs type system is closed, but can be wrapped in a higher-level type system. (whereas, the C++ kernels are extension-aware).

So it sounds like the point to add extension types is when building extensions in datafusion. Eventually, I think it would be nice to have an example in arrow-datafusion showing how to add support for a simple extension type (such as UUID) in the engine. Basically the end result would be showing that

SELECT gen_random_uuid()

outputs

+--------------------------------------+
| gen_random_uuid()                    |
+--------------------------------------+
| eeccb8c5-9943-b2bb-bb5e-222f4e14b687 |
+--------------------------------------+

Showing that you can add methods that output extension types (gen_random_uuid()) and that you can control how those extension types are displayed.

yukkit · 2023-08-30T08:01:44Z

I encountered the same trouble when trying to add a custom type. Although the extension type can be marked through the metadata of FIeld, the metadata will be lost in the scene of array processing, for example: udf in datafusion

kylebarron · 2023-10-05T23:51:52Z

I'm taking a stab at migrating my geoarrow-rs crate (which implements the GeoArrow extension array spec) from arrow2 to arrow-rs, and wanted to add that I'm feeling the pain of omission of a DataType::Extension variant in arrow-rs.

In particular, a geospatial algorithm would have to return a Field with every operation, because the physical layout of a LineStringArray is exactly the same as that of a MultiPointArray (and PolygonArray/MultiLineStringArray). Maybe this is nitpicking, but it I've liked the level of abstraction of having the extension metadata on the DataType, because the operations on the array are separate from a named column in a table.

Edit: If I'm understanding correctly, it's also impossible to implement

impl TryFrom<&dyn Array> for GeometryArray

like I could in arrow2, because dyn Array never has any extension type information, so I wouldn't be able to know what type of geometries the array is holding...

alamb · 2023-10-25T17:46:08Z

@yukkit is contemplating User Defined Types in DataFusion, and the arrow extension type mechanism is the obvious implementation I think -- see apache/datafusion#7923

I personally think Field is the correct location for such metadata,

@tustvold are you proposing something like the following?

enum DataType { 
....
   List(FieldRef),
   /// Extension type, with potentially embedded metadata in the field reference
   Extension(FieldRef)
}

This proposal runs afoul of how DataType::List works today (where the field name is mostly irrelevant ("item")), but I don't really have any better ideas.

I think this structure would allow @kylebarron to implement

impl TryFrom<&dyn Array> for GeometryArray {
  fn try_from(arr: &dyn Array>) -> Result<Self> {
    match arr.data_type() {
      DataType::Extension(field) if is_geo_type(field.metadata()) => {
        .... do the conversion ...
      }
      dt => Err("Unsupported datatype: {dt}
   }
}

tustvold · 2023-10-25T18:56:59Z

are you proposing something like the following?

No I'm proposing not making changes to DataType and using the Field metadata that already exists. This way we avoid conflating physical and logical type information. This is the same mechanism we use in IOx to encode the logical notions of tag vs field columns

alamb · 2023-10-25T19:08:38Z

No I'm proposing not making changes to DataType and using the Field metadata that already exists. This way we avoid conflating physical and logical type information.

So how would we implement @kylebarron 's use case? Perhaps via a RecordBatch (with a single column)?

kylebarron · 2023-10-25T19:13:49Z

My interpretation of this is that it's a "zero-sum" architecture decision, in the sense that if you don't want to conflate logical and physical types in the DataType enum, then there's intentionally no way to implement From on &dyn Array; instead it's only possible to implement it on (&dyn Array, &FieldRef)

tustvold · 2023-10-25T19:25:27Z

You would need whatever performs the kernel selection to have the Field, most likely via Schema, I'm not sure you necessarily need this information simultaneously with the Array?

For example, the DF PhysicalExpr could have already extracted the necessary metadata at plan time (although I think it has access to the schema anyway).

wjones127 · 2023-10-26T02:01:01Z

@tustvold do you think RecordBatch is something that end-users should be seeing? Or do you imagine this should be a hidden implementation detail in all cases?

If it's the latter, I think I can understand the position to keep extension type separate. But if it's the former, it's hard to see how we can provide a decent UX without bringing the extension type into the array itself. For example, if we return a RecordBatches with a UUID, a user might reasonable surprised that the UUID column prints as the raw bytes and not a hyphenated UUID string.

tustvold · 2023-10-26T08:12:28Z

RecordBatch/Schema is but one way that users might choose to expose logical type information, they might also define their own array abstractions that wrap arrow arrays, or their own schema abstraction at plan time, etc...

As @kylebarron rightly states it's a zero-sum API challenge, either all of arrow must become aware of extension types, or it is confined to the areas that actually care. It seems odd to me to optimize the design here for things that are not present in the specification at the expense of everything else, further it seems unfortunate to optimise for one particular way of encoding logical type information.

I don't have all the answers here, I don't know what a general purpose logical type abstraction looks like, if such a thing even exists, but it does seem that the core library shouldn't be opinionated in this regard

kylebarron · 2023-11-12T16:48:05Z

This feels like a limitation of the way the python conversion is implemented

For Python conversion specifically, this might be solved with the new PyCapsule interface (ref #5067) because the ArrowSchema FFI struct is generated by pyarrow itself, and so it doesn't have to be inferred from array.type. (I haven't verified how it works with extension arrays yet)

alamb · 2023-11-14T22:05:41Z

FYI @yukkit has created a PR showing how LogicalTypes might work in DataFusion: apache/datafusion#8143. It is a pretty neat idea.

wjones127 added arrow Changes to the arrow crate enhancement Any new improvement worthy of a entry in the changelog labels Jun 30, 2023

dexterduck mentioned this issue Jul 7, 2023

Preserve field metadata across expressions in logical plans apache/datafusion#6886

Closed

wjones127 mentioned this issue Sep 12, 2023

fix: export record batch through stream #4806

Merged

tustvold mentioned this issue Oct 4, 2023

Plans for supporting Extension Array to support Fixed shape tensor Array #4890

Closed

This was referenced Oct 23, 2023

Any plan to support JSON or JSONB? apache/datafusion#7845

Open

[Proposal] Support User-Defined Types (UDT) apache/datafusion#7923

Open

yukkit mentioned this issue Nov 12, 2023

[draft] Add LogicalType, try to support user-defined types apache/datafusion#8143

Draft

8 tasks

kylebarron mentioned this issue Nov 15, 2023

Implement Arrow PyCapsule Interface #5070

Merged

alamb mentioned this issue Dec 12, 2023

GH-39024: [Compute] Allow implicitly casting extension to storage types in compute functions apache/arrow#39200

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add an ExtensionType to DataType enum #4472

Add an ExtensionType to DataType enum #4472

wjones127 commented Jun 30, 2023 •

edited

tustvold commented Jun 30, 2023 •

edited

wjones127 commented Jun 30, 2023

wjones127 commented Jun 30, 2023

tustvold commented Jul 1, 2023

wjones127 commented Jul 11, 2023

yukkit commented Aug 30, 2023

kylebarron commented Oct 5, 2023 •

edited

alamb commented Oct 25, 2023

tustvold commented Oct 25, 2023 •

edited

alamb commented Oct 25, 2023

kylebarron commented Oct 25, 2023

tustvold commented Oct 25, 2023 •

edited

wjones127 commented Oct 26, 2023

tustvold commented Oct 26, 2023

kylebarron commented Nov 12, 2023

alamb commented Nov 14, 2023

Add an ExtensionType to DataType enum #4472

Add an ExtensionType to DataType enum #4472

Comments

wjones127 commented Jun 30, 2023 • edited

tustvold commented Jun 30, 2023 • edited

wjones127 commented Jun 30, 2023

wjones127 commented Jun 30, 2023

tustvold commented Jul 1, 2023

wjones127 commented Jul 11, 2023

yukkit commented Aug 30, 2023

kylebarron commented Oct 5, 2023 • edited

alamb commented Oct 25, 2023

tustvold commented Oct 25, 2023 • edited

alamb commented Oct 25, 2023

kylebarron commented Oct 25, 2023

tustvold commented Oct 25, 2023 • edited

wjones127 commented Oct 26, 2023

tustvold commented Oct 26, 2023

kylebarron commented Nov 12, 2023

alamb commented Nov 14, 2023

wjones127 commented Jun 30, 2023 •

edited

tustvold commented Jun 30, 2023 •

edited

kylebarron commented Oct 5, 2023 •

edited

tustvold commented Oct 25, 2023 •

edited

tustvold commented Oct 25, 2023 •

edited