Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API inconsistency of ListBuilder make it hard to use as nested builder #5098

Closed
viirya opened this issue Nov 19, 2023 · 6 comments · Fixed by #5109
Closed

API inconsistency of ListBuilder make it hard to use as nested builder #5098

viirya opened this issue Nov 19, 2023 · 6 comments · Fixed by #5109
Labels
arrow Changes to the arrow crate enhancement Any new improvement worthy of a entry in the changelog

Comments

@viirya
Copy link
Member

viirya commented Nov 19, 2023

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

I'm working on something to build various array builders (could be nested) based on different data types. The purpose is to have builders used to append values from values of rows.

For primitive and struct types, they are working well.

But ListBuilder (I think MapBuilder has the same issue) is an exception. Its API design (it has a type parameter T: ArrayBuilder) is different from other builders. This inconsistency makes harder to put ListBuilder as a same piece like other builders.

For example, this crate has an API make_builder which returns Box<dyn ArrayBuilder>. Currently it doesn't handle DataType::List . Ideally, for ListBuilder we should be able to do:

DataType::List(field) => {
    let mut builder = make_builder(field.data_type(), capacity);
    Box::new(ListBuilder::with_capacity(builder, capacity))
}

But we cannot do it because ListBuilder requires type parameter of its value builder. Thought we can write a macro to do into_box_any().downcast::<$builder_type> on the returned Box<dyn ArrayBuilder> from make_builder to get concrete owned value of value builder. And that is what I did locally. But it only works for primitive builders.

For nested list types, we need to have nested type parameter of the nested ListBuilder type. I'm not sure how we can dynamically create a deeply nested type parameter in runtime?

Since ListBuilder has additional type parameter, it means a macro that can work on all other builder types can not work on it. Because it requires a second macro to handle different element builder type cases.

Describe the solution you'd like

I am wondering if we can remove the type parameter from ListBuilder and make its value builder as Box<dyn ArrayBuilder> as StructBuilder.

Describe alternatives you've considered

Additional context

@viirya viirya added enhancement Any new improvement worthy of a entry in the changelog bug and removed enhancement Any new improvement worthy of a entry in the changelog labels Nov 19, 2023
@tustvold
Copy link
Contributor

tustvold commented Nov 19, 2023

I'm not sure of the history here, but the current design does allow for some advantages when the types are known ahead of time

https://docs.rs/arrow-array/latest/arrow_array/builder/index.html#custom-builders

Perhaps we could allow creating a ListBuilder with a boxed builder if one so wished, i.e. Box<dyn ArrayBuilder> as the generic type. This might be as simple as deriving ArrayBuilder for Box<dyn ArrayBuilder>

@tustvold tustvold added enhancement Any new improvement worthy of a entry in the changelog and removed bug labels Nov 19, 2023
@viirya
Copy link
Member Author

viirya commented Nov 19, 2023

I'm not sure of the history here, but the current design does allow for some advantages when the types are known ahead of time

If we know the row column types in compile time, it is okay as shown in the example i32_list: ListBuilder<Int32Builder>. But we don't know the types in compile time, that being said we have arbitrary column types of rows and ListBuilder could be deeply nested. I'm not sure how to get the deeply nested type parameter in runtime, i.e., ListBuilder<ListBuilder<ListBuilder<...>>>

Currently I can use macro to create ListBuilder<T> for T as primitive and struct types because they don't have type parameters.

So any situation that needs to have ListBuilder<T> as parameter, it is impossible (I am not sure, at least I don't find a way) to have it in arbitrarily nested structure, because of the generic type parameter T.

@tustvold
Copy link
Contributor

What did you think of deriving ArrayBuilder for Box<dyn ArrayBuilder>? In general though building arbitrary nested data structures is easier bottom up, e.g. using the array constructors, as opposed to top-down using builders

@viirya
Copy link
Member Author

viirya commented Nov 19, 2023

I think it could make constructing ListBuilder for various element type more easier, at least I don't need such block:

DataType::List(field) => {
    match field.data_type() {
        DataType::Boolean => {
            ...
            Box::new(ListBuilder::new(*builder))
        }
        DataType::Int8 => {
            ...
        }
        ...
}

For nested ListBuilder, we don't need to downcast nested ListBuilder (we are blocked by this as we cannot figure out nested generic type dynamically).

@viirya
Copy link
Member Author

viirya commented Nov 19, 2023

BTW, I can work on this in spare time during my travel in next week to see if it works. (if you don't have time to work on this).

@tustvold
Copy link
Contributor

tustvold commented Jan 5, 2024

label_issue.py automatically added labels {'arrow'} from #5109

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate enhancement Any new improvement worthy of a entry in the changelog
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants