perf: Use a compression API that is designed for this use case (#11699) #14194

itamarst · 2024-02-01T18:51:50Z

My impression is that #11699 is due to multiple different performance issues. This PR fixes one of them, but does not fix the whole performance difference. See below for numbers.

Why this change

Perf showed the following:

-   51.46%     0.11%  python  [.] polars_parquet::parquet::compression::compress 
- ▒   - 51.35% polars_parquet::parquet::compression::compress                                                                                                                                   
- ▒      - 28.61% ZSTD_compressStream                                                                                                                                                           
- ▒         - ZSTD_compressStream2                                                                                                                                                              
- ▒            - 28.16% ZSTD_CCtx_init_compressStream2                                                                                                                                          
- ▒               - 27.79% ZSTD_compressBegin_internal                                                                                                                                          
- ▒                  - 27.75% ZSTD_resetCCtx_internal                                                                                                                                           
- ▒                     - 27.42% ZSTD_reset_matchState                                                                                                                                          
- ▒                          27.37% __memset_avx2_unaligned_erms

The zstd docs suggest the Encoder has a 128KB internal buffer, and apparently some amount of time was needed just to clear it out (presumably over and over again). The solution was to switch to the bulk API which doesn't initialize as much state on each call. (And in practice the original compression wasn't take advantage of the theoretical ability to stream).

(A similar change could in theory be made for decompression, but the API for estimating an upper bound on decompressed data is experimental, so it's unclear how much memory to allocate in the output buffer.)

Measuring performance

Test script:

from datetime import datetime

import polars as pl

start_datetime = datetime(2020, 1, 1)
end_datetime = datetime(
    2024,
    1,
    1,
)
N_ids = 10
interval = "1m"
df = (
    pl.LazyFrame(
        {
            "time": pl.datetime_range(
                start_datetime, end_datetime, interval=interval, eager=True
            )
        }
    )
    .join(
        pl.LazyFrame(
            {"id": [f"id{i:05d}" for i in range(N_ids)], "value": list(range(N_ids))}
        ),
        how="cross",
    )
    .select("time", "id", "value")
)

from time import time

start = time()
df.sink_parquet("/tmp/out.parquet")
print("SINK", time() - start)

start = time()
df.collect().write_parquet("/tmp/out2.parquet")
print("COLLECT+WRITE", time() - start)

Benchmarking is done with make build-opt.

Initial values:

SINK 3.4813475608825684
COLLECT+WRITE 0.958869218826294

After zstd tweak:

SINK 2.9080893993377686
COLLECT+WRITE 0.9803664684295654

COLLECT+WRITE might be somewhat slower as a result of this.

…rs#11699)

ritchie46

Nice find. :)

Curious what the rest is, as I didn't see any difference on my machine initially. What is your hardware?

ritchie46 · 2024-02-01T19:41:59Z

crates/polars-parquet/src/parquet/compression.rs

@@ -95,13 +95,19 @@ pub fn compress(
        )),
        #[cfg(feature = "zstd")]
        CompressionOptions::Zstd(level) => {


Can we add setting to the writer. E.g. that the streaming engine uses the bulk compresison, and the default writer uses the old style?

It's not clear to me if that'd work, the Polars-level compression API (the compress()) is not designed for streaming, it has no state...

I don't follow. I understand that his new snippet is faster in the streaming engine, but slower in the default (single batch) case.

So I assume we can set a flag on the reader and use that to branch here.

Ah, I misunderstood what you meant. I could do that, or maybe... size-based heuristic. In which case I would initially log the size of data, which might also be informative for the batch-size hypothesis.

sink_parquet() ends up compressing chunks of sizes 1K to 27K.

write_parquet() mostly writes chunks of sizes 130K to 680K.

So there's a clear divide here which plausibly explains the performance differences: an accumulation of doing lots of small operations will add overhead in a variety of both easy (this PR) and difficult to observe ways (e.g. worse branch prediction).

For the remaining #11699 slowness, investigating batch sizes further seems like a good idea. For this PR I'll see if a size-based heuristic speeds things up.

After doing more benchmarking runs, I think the difference in COLLECT+WRITE is just noise: runtime varies quite a lot in both main and this branch, with no discernible pattern about which is slower or faster. So I think this PR should be OK as is.

Alright. Thanks. It is a huge improvement.

itamarst · 2024-02-01T20:35:05Z

I'm running on a i7-12700K, with 64GB RAM.

itamarst · 2024-02-01T20:35:48Z

For the rest, the fact compression internal state initialization showed up at all makes me wonder if the issue is something somewhere is operating on too-small batches.

ritchie46 · 2024-02-01T20:42:27Z

For the rest, the fact compression internal state initialization showed up at all makes me wonder if the issue is something somewhere is operating on too-small batches.

There are ENV vars that influence the batch size. That can test that hypothesis. 👀

itamarst · 2024-02-02T15:40:53Z

There are ENV vars that influence the batch size. That can test that hypothesis. 👀

POLARS_STREAMING_CHUNK_SIZE doesn't seem to have any influence over sink_parquet(), at least for this particular example. Is there another env variable?

itamarst · 2024-02-02T16:38:30Z

@ritchie46 back to you, I think the presumed performance difference in COLLECT+WRITE was just noise.

perf: Use a compression API that is designed for this use case (pola-…

9aac32b

…rs#11699)

itamarst requested review from ritchie46, stinodego, orlp and c-peters as code owners February 1, 2024 18:51

github-actions bot added performance Performance issues or improvements python Related to Python Polars rust Related to Rust Polars labels Feb 1, 2024

itamarst marked this pull request as draft February 1, 2024 18:55

pythonspeed added 3 commits February 1, 2024 14:19

Match the rather unexpected assumptions of the unit tests.

489b328

Fix formatting.

167dfe8

Better comment.

8e57cfc

itamarst marked this pull request as ready for review February 1, 2024 19:33

ritchie46 reviewed Feb 1, 2024

View reviewed changes

itamarst requested a review from ritchie46 February 2, 2024 16:38

ritchie46 approved these changes Feb 3, 2024

View reviewed changes

ritchie46 merged commit 76bc336 into pola-rs:main Feb 3, 2024
17 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: Use a compression API that is designed for this use case (#11699) #14194

perf: Use a compression API that is designed for this use case (#11699) #14194

itamarst commented Feb 1, 2024 •

edited

ritchie46 left a comment

ritchie46 Feb 1, 2024

itamarst Feb 1, 2024

ritchie46 Feb 1, 2024

itamarst Feb 2, 2024

itamarst Feb 2, 2024

itamarst Feb 2, 2024

ritchie46 Feb 3, 2024

itamarst commented Feb 1, 2024

itamarst commented Feb 1, 2024

ritchie46 commented Feb 1, 2024

itamarst commented Feb 2, 2024 •

edited

itamarst commented Feb 2, 2024

perf: Use a compression API that is designed for this use case (#11699) #14194

perf: Use a compression API that is designed for this use case (#11699) #14194

Conversation

itamarst commented Feb 1, 2024 • edited

Why this change

Measuring performance

ritchie46 left a comment

Choose a reason for hiding this comment

ritchie46 Feb 1, 2024

Choose a reason for hiding this comment

itamarst Feb 1, 2024

Choose a reason for hiding this comment

ritchie46 Feb 1, 2024

Choose a reason for hiding this comment

itamarst Feb 2, 2024

Choose a reason for hiding this comment

itamarst Feb 2, 2024

Choose a reason for hiding this comment

itamarst Feb 2, 2024

Choose a reason for hiding this comment

ritchie46 Feb 3, 2024

Choose a reason for hiding this comment

itamarst commented Feb 1, 2024

itamarst commented Feb 1, 2024

ritchie46 commented Feb 1, 2024

itamarst commented Feb 2, 2024 • edited

itamarst commented Feb 2, 2024

itamarst commented Feb 1, 2024 •

edited

itamarst commented Feb 2, 2024 •

edited