-
-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
perf: Use a compression API that is designed for this use case (#11699) #14194
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice find. :)
Curious what the rest is, as I didn't see any difference on my machine initially. What is your hardware?
@@ -95,13 +95,19 @@ pub fn compress( | |||
)), | |||
#[cfg(feature = "zstd")] | |||
CompressionOptions::Zstd(level) => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add setting to the writer. E.g. that the streaming engine uses the bulk compresison, and the default writer uses the old style?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not clear to me if that'd work, the Polars-level compression API (the compress()
) is not designed for streaming, it has no state...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't follow. I understand that his new snippet is faster in the streaming engine, but slower in the default (single batch) case.
So I assume we can set a flag on the reader and use that to branch here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I misunderstood what you meant. I could do that, or maybe... size-based heuristic. In which case I would initially log the size of data, which might also be informative for the batch-size hypothesis.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sink_parquet()
ends up compressing chunks of sizes 1K to 27K.
write_parquet()
mostly writes chunks of sizes 130K to 680K.
So there's a clear divide here which plausibly explains the performance differences: an accumulation of doing lots of small operations will add overhead in a variety of both easy (this PR) and difficult to observe ways (e.g. worse branch prediction).
For the remaining #11699 slowness, investigating batch sizes further seems like a good idea. For this PR I'll see if a size-based heuristic speeds things up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After doing more benchmarking runs, I think the difference in COLLECT+WRITE
is just noise: runtime varies quite a lot in both main
and this branch, with no discernible pattern about which is slower or faster. So I think this PR should be OK as is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright. Thanks. It is a huge improvement.
I'm running on a i7-12700K, with 64GB RAM. |
For the rest, the fact compression internal state initialization showed up at all makes me wonder if the issue is something somewhere is operating on too-small batches. |
There are |
|
@ritchie46 back to you, I think the presumed performance difference in |
My impression is that #11699 is due to multiple different performance issues. This PR fixes one of them, but does not fix the whole performance difference. See below for numbers.
Why this change
Perf showed the following:
The
zstd
docs suggest theEncoder
has a 128KB internal buffer, and apparently some amount of time was needed just to clear it out (presumably over and over again). The solution was to switch to the bulk API which doesn't initialize as much state on each call. (And in practice the original compression wasn't take advantage of the theoretical ability to stream).(A similar change could in theory be made for decompression, but the API for estimating an upper bound on decompressed data is experimental, so it's unclear how much memory to allocate in the output buffer.)
Measuring performance
Test script:
Benchmarking is done with
make build-opt
.Initial values:
After zstd tweak:
COLLECT+WRITE
might be somewhat slower as a result of this.