Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: cast (Large)List to FixedSizeList #5081

Merged
merged 5 commits into from Nov 17, 2023

Conversation

wjones127
Copy link
Member

Which issue does this PR close?

Closes #4728

Rationale for this change

Adds cast from list to FixedSizeList. I intend to use this in DataFusion, where I'd like to allow users to write array literals in SQL (which default to List) and have the planner cast them to FixedSizeList when appropriate.

What changes are included in this PR?

cast_with_options() now supports List to FixedSizeList.

Are there any user-facing changes?

This adds a new feature. A suite of tests are added. Let me know if you think there are important cases missing.

@wjones127 wjones127 added arrow Changes to the arrow crate enhancement Any new improvement worthy of a entry in the changelog labels Nov 16, 2023
@tustvold
Copy link
Contributor

Perhaps we could provide these on the arrays themselves and then use this within the cast kernel, much like we do for the reverse transformation - https://docs.rs/arrow-array/latest/arrow_array/array/struct.GenericListArray.html#impl-From%3CFixedSizeListArray%3E-for-GenericListArray%3COffsetSize%3E

@wjones127
Copy link
Member Author

@tustvold I'm actually not sure how to satisfy the test_can_cast_types test. Right now can_cast_types() can validate the child types can be cast, but can't validate the lengths are of appropriate size.

One option is to change the implementation so that if safe is true, then we truncate or fill nulls into the arrays to coerce them into the correct size. That brings issues though if the output field is not nullable.

Are there other cases of types where can_cast_types() isn't a strong guarantee that the cast() function won't error?

@wjones127
Copy link
Member Author

Perhaps we could provide these on the arrays themselves and then use this within the cast kernel, much like we do for the reverse transformation

I can look at this. The reason I didn't put it there was I didn't think I could bring the cast kernel into that file, since we want to cast the child values. But maybe that's okay? Or I can refactor somehow to do the child cast in cast.rs? I'll see what I can do.

@tustvold
Copy link
Contributor

tustvold commented Nov 16, 2023

Are there other cases of types where can_cast_types() isn't a strong guarantee that the cast() function won't error?

The string parsing ones

I would expect when safe is true to fill the FixedSizeListArray with a null value for the errored index. This is consistent with how it behaves for other casts. The nullability concern is valid, and is why DataFusion has both CastExpr and TryCastExpr.

Tbh I'm not a fan of the safe option, but it is important for spark compatibility

Or I can refactor somehow to do the child cast in cast.rs? I'll see what I can do.

Aah forgot about that, perhaps not worth it then

@wjones127
Copy link
Member Author

Tbh I'm not a fan of the safe option

Ha I always get confused as to whether "safe" means "I will silently create nulls" or "I will raise an error if the cast cannot be done". PyArrow has a somewhat opposite definition of safe.

Comment on lines 285 to 288
/// * List to FixedSizeList: the underlying data type is cast. If the list size is different
/// than the FixedSizeList size and safe casting is requested, then lists are
/// truncated or filled will some value to achieve the requested size. If the output
/// field is nullable, the fill value is null, otherwise it is the first value.
Copy link
Member Author

@wjones127 wjones127 Nov 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't love this behavior, but I'm not sure a better way in the current framework. Personally, I don't think I ever care to use safe=True here. But I'd also like it to not be too shocking to those that do use it.

Copy link
Contributor

@tustvold tustvold Nov 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We allow nulls in non-nullable children provided they are masked by the parents null mask.

I'm not sure about truncating, I'll think on it

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I hadn't thought about having the list itself just making anything improperly sized be null. That seems like it would be reasonable behavior. We can ignore truncation, I don't think we want that.

@tustvold
Copy link
Contributor

I plan to have a play with this today to see if I can't come up with something a little simpler

@tustvold
Copy link
Contributor

I've pushed a PR that uses MutableArrayData to avoid doing multiple passes of the array, I think it should be faster and I find it a bit easier to follow. PTAL and let me know what you think

}
mutable.extend_nulls(size as _);
nulls.as_mut().unwrap().set_bit(idx, false);
last_pos = end_pos
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this works, but I found the logic of last_pos confusing. I'd like to add a comment about how it pads

When we detect a list doesn't start at the correct point
we use mutable.extend() to pad the previous list. last_pos
tracks the ending position of the last incorrectly sized list.

However, I'm still confused on where the truncation of too long values happens. Where is that?

Copy link
Contributor

@tustvold tustvold Nov 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't it replaces it with null or errors, why do you want it to truncate?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't mean truncate logically. I mean that if a section of values has list_size + 1 elements, won't that shift all the subsequent values to the right by 1? I would think to solve this we need to make sure truncate / shift values over to the left as needed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a section of values has list_size + 1 elements, we would add nulls, and then set last_pos to the end of that slice. I will add a commit explicitly testing this and some more comments

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I see how this works now. I misunderstood how MutableArrayData works, but re-reading the docs this makes sense now.

arrow-cast/src/cast.rs Outdated Show resolved Hide resolved
@wjones127
Copy link
Member Author

I can't approve my own PR, but this LGTM. Thanks for your help @tustvold. MutableArrayData is quite a nice abstraction!

@tustvold
Copy link
Contributor

MutableArrayData is quite a nice abstraction!

Until you look under the hood at least 😅, it is probably not how I would build such a construct today, but it is useful for sure.

@wjones127 wjones127 merged commit dc75a28 into apache:master Nov 17, 2023
26 checks passed
@wjones127 wjones127 deleted the cast-list-fsl branch November 17, 2023 18:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate enhancement Any new improvement worthy of a entry in the changelog
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Cast FixedSizeList from List
2 participants