You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Certain streaming operations result in small chunks, for example the example script in #11699 (comment). That script does 2.5KB chunks. The overhead of writing parquet with such small chunks means writes are very slow, 4× slower than what they could be with larger chunks of e.g. 4MB.
Expected behavior
collect().write_parquet() and collect(streaming=True).write_parquet() should have similar amounts of time for writing the parquet.
I am going to try to fix this. The original proposed fix was merging chunks into larger 4MB sizes, implemented inside OrderedSink, which @ritchie46 didn't like. So instead this will do the merging in write_parquet().
itamarst
pushed a commit
to itamarst/polars
that referenced
this issue
Feb 14, 2024
This is a followup to #11699, covering the case that was originally in #14346 but was removed from the PR before it was merged.
Checks
Reproducible example
The actual issue is in the Rust code.
Log output
Issue description
Certain streaming operations result in small chunks, for example the example script in #11699 (comment). That script does 2.5KB chunks. The overhead of writing parquet with such small chunks means writes are very slow, 4× slower than what they could be with larger chunks of e.g. 4MB.
Expected behavior
collect().write_parquet()
andcollect(streaming=True).write_parquet()
should have similar amounts of time for writing the parquet.Installed versions
Git commit
126ccc1b65e60f10663c490da83e60f78eec5541
.The text was updated successfully, but these errors were encountered: