[Java][C] sliced RecordBatch offset info is lost when imported from c-data #41682

hellishfire · 2024-05-16T02:50:57Z

Describe the bug, including details regarding any error messages, version, and platform.

Reproduced on latest arrow release (16.0)

When importing a sliced RecordBatch from c to java

On c side:

auto sliced_record_batch = original_record_batch->Slice(/*offset=*/8, /*length=*/2);
arrow::ExportRecordBatch(sliced_record_batch, arrow_array_ptr);

On java side:

ArrowArray arrowArray = ArrowArray.allocateNew(allocator);
Data.importIntoVectorSchemaRoot(allocator, arrowArray, vectorSchemaRoot, null);

The imported vectorSchemaRoot maintains the correct length(which is 2), but the offset info (which is 8) is not respected, hence the content of the imported vectorSchemaRoot points to the first 2 rows of the original_record_batch, while the desired content is sliced_record_batch.

I'm not familiar with arrow code, but it seems that the offset info is actually present in org.apache.arrow.c.ArrowArray.Snapshot, but org.apache.arrow.c.ArrayImporter ignores the offset in org.apache.arrow.c.ArrayImporter.doImport(ArrowArray.Snapshot)

Component(s)

Java

The text was updated successfully, but these errors were encountered:

lidavidm · 2024-05-16T02:56:07Z

Java doesn't really have a concept of slicing, so either this needs to copy the data (at least for things that can't easily be sliced, like the validity buffer) or it should error on import.

hellishfire · 2024-05-16T03:07:39Z

Java doesn't really have a concept of slicing, so either this needs to copy the data (at least for things that can't easily be sliced, like the validity buffer) or it should error on import.

I took a quick look at the slice() method of VectorSchemaRoot, which slices the data buffer, and slice or copy the validity buffer

lidavidm · 2024-05-16T03:14:17Z

Yes, but that's different in intent than the C++ version (which just tracks an offset). Here it means we can't do a 0 copy import.

hellishfire · 2024-05-16T03:21:21Z

Yes, but that's different in intent than the C++ version (which just tracks an offset). Here it means we can't do a 0 copy import.

I understand that, but IMO data correctness comes before performance, so a copy of validity buffer is tolerable in this case?

lidavidm · 2024-05-16T03:31:44Z

CC @vibhatha

We also have to traverse recursively since child arrays can have their own offset

vibhatha · 2024-05-16T04:05:54Z

I will try to reproduce this and evaluate the approach suggested by @lidavidm. I will probably need some time to try this out.

hellishfire added the Type: bug label May 16, 2024

github-actions bot added the Component: Java label May 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Java][C] sliced RecordBatch offset info is lost when imported from c-data #41682

[Java][C] sliced RecordBatch offset info is lost when imported from c-data #41682

hellishfire commented May 16, 2024 •

edited

lidavidm commented May 16, 2024

hellishfire commented May 16, 2024 •

edited

lidavidm commented May 16, 2024

hellishfire commented May 16, 2024

lidavidm commented May 16, 2024

vibhatha commented May 16, 2024

[Java][C] sliced RecordBatch offset info is lost when imported from c-data #41682

[Java][C] sliced RecordBatch offset info is lost when imported from c-data #41682

Comments

hellishfire commented May 16, 2024 • edited

Describe the bug, including details regarding any error messages, version, and platform.

Component(s)

lidavidm commented May 16, 2024

hellishfire commented May 16, 2024 • edited

lidavidm commented May 16, 2024

hellishfire commented May 16, 2024

lidavidm commented May 16, 2024

vibhatha commented May 16, 2024

hellishfire commented May 16, 2024 •

edited

hellishfire commented May 16, 2024 •

edited