Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf: Use a compression API that is designed for this use case (#11699) #14194

Merged
merged 4 commits into from
Feb 3, 2024

Conversation

itamarst
Copy link
Contributor

@itamarst itamarst commented Feb 1, 2024

My impression is that #11699 is due to multiple different performance issues. This PR fixes one of them, but does not fix the whole performance difference. See below for numbers.

Why this change

Perf showed the following:

-   51.46%     0.11%  python  [.] polars_parquet::parquet::compression::compress 
- ▒   - 51.35% polars_parquet::parquet::compression::compress                                                                                                                                   
- ▒      - 28.61% ZSTD_compressStream                                                                                                                                                           
- ▒         - ZSTD_compressStream2                                                                                                                                                              
- ▒            - 28.16% ZSTD_CCtx_init_compressStream2                                                                                                                                          
- ▒               - 27.79% ZSTD_compressBegin_internal                                                                                                                                          
- ▒                  - 27.75% ZSTD_resetCCtx_internal                                                                                                                                           
- ▒                     - 27.42% ZSTD_reset_matchState                                                                                                                                          
- ▒                          27.37% __memset_avx2_unaligned_erms                                                                                                                                

The zstd docs suggest the Encoder has a 128KB internal buffer, and apparently some amount of time was needed just to clear it out (presumably over and over again). The solution was to switch to the bulk API which doesn't initialize as much state on each call. (And in practice the original compression wasn't take advantage of the theoretical ability to stream).

(A similar change could in theory be made for decompression, but the API for estimating an upper bound on decompressed data is experimental, so it's unclear how much memory to allocate in the output buffer.)

Measuring performance

Test script:

from datetime import datetime

import polars as pl

start_datetime = datetime(2020, 1, 1)
end_datetime = datetime(
    2024,
    1,
    1,
)
N_ids = 10
interval = "1m"
df = (
    pl.LazyFrame(
        {
            "time": pl.datetime_range(
                start_datetime, end_datetime, interval=interval, eager=True
            )
        }
    )
    .join(
        pl.LazyFrame(
            {"id": [f"id{i:05d}" for i in range(N_ids)], "value": list(range(N_ids))}
        ),
        how="cross",
    )
    .select("time", "id", "value")
)

from time import time

start = time()
df.sink_parquet("/tmp/out.parquet")
print("SINK", time() - start)

start = time()
df.collect().write_parquet("/tmp/out2.parquet")
print("COLLECT+WRITE", time() - start)

Benchmarking is done with make build-opt.

Initial values:

SINK 3.4813475608825684
COLLECT+WRITE 0.958869218826294

After zstd tweak:

SINK 2.9080893993377686
COLLECT+WRITE 0.9803664684295654

COLLECT+WRITE might be somewhat slower as a result of this.

@github-actions github-actions bot added performance Performance issues or improvements python Related to Python Polars rust Related to Rust Polars labels Feb 1, 2024
@itamarst itamarst marked this pull request as draft February 1, 2024 18:55
@itamarst itamarst marked this pull request as ready for review February 1, 2024 19:33
Copy link
Member

@ritchie46 ritchie46 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice find. :)

Curious what the rest is, as I didn't see any difference on my machine initially. What is your hardware?

@@ -95,13 +95,19 @@ pub fn compress(
)),
#[cfg(feature = "zstd")]
CompressionOptions::Zstd(level) => {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add setting to the writer. E.g. that the streaming engine uses the bulk compresison, and the default writer uses the old style?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not clear to me if that'd work, the Polars-level compression API (the compress()) is not designed for streaming, it has no state...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't follow. I understand that his new snippet is faster in the streaming engine, but slower in the default (single batch) case.

So I assume we can set a flag on the reader and use that to branch here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I misunderstood what you meant. I could do that, or maybe... size-based heuristic. In which case I would initially log the size of data, which might also be informative for the batch-size hypothesis.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sink_parquet() ends up compressing chunks of sizes 1K to 27K.

write_parquet() mostly writes chunks of sizes 130K to 680K.

So there's a clear divide here which plausibly explains the performance differences: an accumulation of doing lots of small operations will add overhead in a variety of both easy (this PR) and difficult to observe ways (e.g. worse branch prediction).

For the remaining #11699 slowness, investigating batch sizes further seems like a good idea. For this PR I'll see if a size-based heuristic speeds things up.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After doing more benchmarking runs, I think the difference in COLLECT+WRITE is just noise: runtime varies quite a lot in both main and this branch, with no discernible pattern about which is slower or faster. So I think this PR should be OK as is.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright. Thanks. It is a huge improvement.

@itamarst
Copy link
Contributor Author

itamarst commented Feb 1, 2024

I'm running on a i7-12700K, with 64GB RAM.

@itamarst
Copy link
Contributor Author

itamarst commented Feb 1, 2024

For the rest, the fact compression internal state initialization showed up at all makes me wonder if the issue is something somewhere is operating on too-small batches.

@ritchie46
Copy link
Member

For the rest, the fact compression internal state initialization showed up at all makes me wonder if the issue is something somewhere is operating on too-small batches.

There are ENV vars that influence the batch size. That can test that hypothesis. 👀

@itamarst
Copy link
Contributor Author

itamarst commented Feb 2, 2024

There are ENV vars that influence the batch size. That can test that hypothesis. 👀

POLARS_STREAMING_CHUNK_SIZE doesn't seem to have any influence over sink_parquet(), at least for this particular example. Is there another env variable?

@itamarst
Copy link
Contributor Author

itamarst commented Feb 2, 2024

@ritchie46 back to you, I think the presumed performance difference in COLLECT+WRITE was just noise.

@ritchie46 ritchie46 merged commit 76bc336 into pola-rs:main Feb 3, 2024
17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Performance issues or improvements python Related to Python Polars rust Related to Rust Polars
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants