Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add write_bytes for GenericBinaryBuilder #6652

Merged
merged 7 commits into from
Oct 30, 2024

Conversation

tisonkun
Copy link
Member

Similar to impl Write for GenericStringBuilder.

We found this helps in reducing one allocation when append bytes in many calls (compared with gather those bytes into a Vec first).

Not sure if there is some trait suitable for this method but just add it first.

cc @alamb @waynexia

The doc tests cover the basic usage.

Verified

This commit was signed with the committer’s verified signature.
tisonkun tison
Signed-off-by: tison <wander4096@gmail.com>
@github-actions github-actions bot added the arrow Changes to the arrow crate label Oct 30, 2024

Verified

This commit was signed with the committer’s verified signature.
tisonkun tison
Signed-off-by: tison <wander4096@gmail.com>
@tisonkun
Copy link
Member Author

Try impl io::Write interface. No sure if a good fit.

Verified

This commit was signed with the committer’s verified signature.
tisonkun tison
Signed-off-by: tison <wander4096@gmail.com>
@tisonkun tisonkun force-pushed the GenericBinaryBuilder-write_bytes branch from 1f5c637 to 30449a9 Compare October 30, 2024 13:55
tests

Verified

This commit was signed with the committer’s verified signature.
tisonkun tison
Signed-off-by: tison <wander4096@gmail.com>
fmt

Verified

This commit was signed with the committer’s verified signature.
tisonkun tison
Signed-off-by: tison <wander4096@gmail.com>

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
Co-authored-by: Andy Lok <andylokandy@hotmail.com>
@tisonkun
Copy link
Member Author

@andylokandy Thanks for your review! Addressed.

@waynexia
Copy link
Member

From the API level: this looks kind of misleading to me. The proposed Write::write or write_bytes makes BinaryBuilder closer to an opaque byte container, shadowing the concept of item and offset. I.e., array built from this API would contain only one element. Plus we don't have an API to "cut" the bytes buffer, but need to call builder.append_value(""); which looks a bit hack.

How about considering something like Vec::from_raw_parts, which accepts different buffers obviously and hence avoids the potential misunderstand?

@andylokandy
Copy link
Contributor

andylokandy commented Oct 30, 2024

I'm the member of databend, which heavily adopts arrow-rs and I'm sure it is a very common pattern to construct an item in a StringArrayBuilder/BinaryArrayBuilder by several parts of &str or &[u8].

For example, once this PR is accepted, to add a space padding before each item of BinaryArray and generate a new BinaryArray, we can do the following:

for x in array.iter() {
    write!(&mut builder, b" ").unwrap();
    write!(&mut builder, x).unwrap();
    builder.append_value("");
}

Moreover, StringBuilder has already supported this pattern. We use it in databend smoothly and we hope BinaryBuilder can have the same ability.

@waynexia
Copy link
Member

I'm the member of databend, which heavily adopts arrow-rs and I'm sure it is a very common pattern to construct an item in a StringArrayBuilder/BinaryArrayBuilder by several parts of &str or &[u8].

For example, to add a space padding before each item of BinaryArray and generate a new BinaryArray, we can do the following:

for x in array.iter() {
    write!(&mut builder, b" ").unwrap();
    write!(&mut builder, x).unwrap();
    builder.append_value("");
}

Moreover, StringBuilder has already supported this pattern. We use it in databend smoothly and we hope BinaryBuilder can have the same ability.

Thanks for your insight, this case makes sense to me 🙌 BTW, do you think it's necessary to encapsulate this pattern builder.append_value(""); into a dedicated method?

@andylokandy
Copy link
Contributor

andylokandy commented Oct 30, 2024

I'll suggest builder.commit_row();. What do you think?

Copy link
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense to me and is inline with what we support for StringBuilder already. I do think we shouldn't make the change to MutableBuffer though

arrow-buffer/src/buffer/mutable.rs Outdated Show resolved Hide resolved

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
@tustvold tustvold merged commit 5dd2b47 into apache:master Oct 30, 2024
26 checks passed
@tisonkun tisonkun deleted the GenericBinaryBuilder-write_bytes branch October 31, 2024 00:24
@alamb
Copy link
Contributor

alamb commented Oct 31, 2024

Thank you @tisonkun and @tustvold

adbmal pushed a commit to adbmal/arrow-rs that referenced this pull request Nov 2, 2024
* feat: add write_bytes for GenericBinaryBuilder

Signed-off-by: tison <wander4096@gmail.com>

* one more improve

Signed-off-by: tison <wander4096@gmail.com>

* try io::Write trait

Signed-off-by: tison <wander4096@gmail.com>

* tests

Signed-off-by: tison <wander4096@gmail.com>

* fmt

Signed-off-by: tison <wander4096@gmail.com>

* Apply suggestions from code review

Co-authored-by: Andy Lok <andylokandy@hotmail.com>

* Update mutable.rs

---------

Signed-off-by: tison <wander4096@gmail.com>
Co-authored-by: Andy Lok <andylokandy@hotmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants