GH-36765: [Python][Dataset] Change default of pre_buffer to True for reading Parquet files #37854

jorisvandenbossche · 2023-09-25T10:02:13Z

Rationale for this change

Enabling pre_buffer can give a significant speed-up on filesystems like S3, while it doesn't give noticeable slowdown on local filesystems, based on benchmarks in the issue. Therefore simply enabling it by default seems the best default.

The option was already enabled by default in the pyarrow.parquet.read_table interface, this PR aligns the defaults when using pyarrow.dataset directly.

Closes: [Python][Dataset][Parquet] Enable Pre-Buffering by default for Parquet s3 datasets #36765

…rquet files

github-actions · 2023-09-25T10:18:52Z

⚠️ GitHub issue #36765 has been automatically assigned in GitHub to PR creator.

jorisvandenbossche · 2023-09-25T10:19:20Z

Questions:

Do I change the default in C++ as well?
Should we also change the default cache_options to LazyDefaults when pre_buffer is enabled?

kou · 2023-09-25T20:35:01Z

Do I change the default in C++ as well?

Based on the #28218 results, I think so.

mapleFU · 2023-09-26T01:04:50Z

When scanner has multiple scanner thread and prefetch depth( i.e. Prefetch Fragment Count), would this causing huge memory consumption?

mapleFU · 2023-09-26T10:51:24Z

Status FileReaderImpl::GetRecordBatchReader(const std::vector<int>& row_groups,
                                            const std::vector<int>& column_indices,
                                            std::unique_ptr<RecordBatchReader>* out) {
  RETURN_NOT_OK(BoundsCheck(row_groups, column_indices));

  if (reader_properties_.pre_buffer()) {
    // PARQUET-1698/PARQUET-1820: pre-buffer row groups/column chunks if enabled
    BEGIN_PARQUET_CATCH_EXCEPTIONS
    reader_->PreBuffer(row_groups, column_indices, reader_properties_.io_context(),
                       reader_properties_.cache_options());
    END_PARQUET_CATCH_EXCEPTIONS
  }

Here, Pre_Buffer will try to buffer the require RowGroups if neccessary, and memory will not be released until read is finished. It's different from buffering mode( actually buffering mode might decrease the memory usage, lol).

Even when policy is lazy, the reader might not get faster if RowGroup is large enough, and memory will not be released before read is finished. So I wonder if this is ok.

jorisvandenbossche · 2023-09-26T11:19:50Z

@mapleFU could you post those questions and remarks on the issue? (that might have a wider audience that can answer those, as many people already commented there)

mapleFU · 2023-09-26T12:59:15Z

Done it, I think enable this is ok, maybe we should doc how to handling high memory usage in doc

lidavidm · 2023-10-05T13:06:08Z

python/pyarrow/_dataset_parquet.pyx

@@ -666,7 +666,7 @@ cdef class ParquetFragmentScanOptions(FragmentScanOptions):
        Disabled by default.
    buffer_size : int, default 8192
        Size of buffered stream, if enabled. Default is 8KB.
-    pre_buffer : bool, default False
+    pre_buffer : bool, default True
        If enabled, pre-buffer the raw Parquet data instead of issuing one
        read per column chunk. This can improve performance on high-latency
        filesystems.


Possibly we should improve the docstring to also mention that you should disable this if you are concerned with memory usage over throughput? (Also, possibly make it clear that "high-latency filesystems" is likely to mean object stores like S3, GCS, etc.)

Yep, will do

jorisvandenbossche · 2023-10-05T13:12:59Z

@lidavidm do you have any thought on the question in #37854 (comment) ?

lidavidm · 2023-10-05T13:14:30Z

Oops, I missed that.

C++: let's be consistent.
I think LazyDefaults is a good default. I'm not sure it'll make a big difference.

…uet-pre-buffer

mapleFU · 2023-10-06T10:18:55Z

cpp/src/parquet/arrow/arrow_reader_writer_test.cc

  ArrowReaderProperties arrow_properties = default_arrow_reader_properties();
-  arrow_properties.set_pre_buffer(true);
+  arrow_properties.set_pre_buffer(false);
+  arrow_properties.set_cache_options(::arrow::io::CacheOptions::Defaults())


This might cause compiling failed, I submit a fixing: #38069

jorisvandenbossche · 2023-10-06T10:24:52Z

Oh, sorry for completely missing to check all the failed builds .. 🤦

….cc` compile (#38069) ### Rationale for this change This is introduced in a previous patch. This patch fixed the compile. ( #37854 ) ### What changes are included in this PR? a one-line fixing. ### Are these changes tested? no ### Are there any user-facing changes? no * Closes: #38068 Authored-by: mwish <maplewish117@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>

jorisvandenbossche · 2023-10-06T11:45:44Z

So this PR introduced a failure in the "AMD64 Ubuntu 22.04 C++ ASAN UBSAN" build (https://github.com/apache/arrow/actions/runs/6430392691/job/17462667620?pr=38069#logs), related to the LazyCache coalesced reads. See details below.

I assume this is an existing bug, given this PR only changed a default for an option a user could already set before as well. But changing the default of course makes it more visible.

Potentially short term option is to only change pre_buffer and keep the current non-lazy default cache_options (if that fixes it). Or revert the PR entirely until this is resolved (I don't have time today to look into more detail).

2023-10-06T10:40:14.0622194Z Running: /arrow/testing/data/parquet/fuzzing/clusterfuzz-testcase-minimized-parquet-arrow-fuzz-5640198106120192
2023-10-06T10:40:14.0651320Z /arrow/cpp/src/arrow/io/interfaces.cc:457:  Check failed: (left.offset + left.length) <= (right.offset) Some read ranges overlap
2023-10-06T10:40:14.0661169Z /build/cpp/debug/parquet-arrow-fuzz(backtrace+0x5b)[0x55893309d6bb]
2023-10-06T10:40:14.0678721Z /usr/local/lib/libarrow.so.1400(_ZN5arrow4util7CerrLog14PrintBackTraceEv+0x1a5)[0x7fd67d9f5405]
2023-10-06T10:40:14.0694280Z /usr/local/lib/libarrow.so.1400(_ZN5arrow4util7CerrLogD2Ev+0x1f7)[0x7fd67d9f5177]
2023-10-06T10:40:14.0708313Z /usr/local/lib/libarrow.so.1400(_ZN5arrow4util7CerrLogD0Ev+0x61)[0x7fd67d9f5251]
2023-10-06T10:40:14.0722939Z /usr/local/lib/libarrow.so.1400(_ZN5arrow4util8ArrowLogD1Ev+0x1d0)[0x7fd67d9f4d80]
2023-10-06T10:40:14.0733586Z /usr/local/lib/libarrow.so.1400(+0xb13f151)[0x7fd67d3cc151]
2023-10-06T10:40:14.0746700Z /usr/local/lib/libarrow.so.1400(_ZN5arrow2io8internal18CoalesceReadRangesESt6vectorINS0_9ReadRangeESaIS3_EEll+0x4c1)[0x7fd67d3cac81]
2023-10-06T10:40:14.0762388Z /usr/local/lib/libarrow.so.1400(_ZN5arrow2io8internal14ReadRangeCache4Impl5CacheESt6vectorINS0_9ReadRangeESaIS5_EE+0x456)[0x7fd67d2c3be6]
2023-10-06T10:40:14.0775666Z /usr/local/lib/libarrow.so.1400(_ZN5arrow2io8internal14ReadRangeCache8LazyImpl5CacheESt6vectorINS0_9ReadRangeESaIS5_EE+0x24a)[0x7fd67d2c1cca]
2023-10-06T10:40:14.0790164Z /usr/local/lib/libarrow.so.1400(_ZN5arrow2io8internal14ReadRangeCache5CacheESt6vectorINS0_9ReadRangeESaIS4_EE+0x2a2)[0x7fd67d2bfec2]
2023-10-06T10:40:14.0795950Z /usr/local/lib/libparquet.so.1400(_ZN7parquet14SerializedFile9PreBufferERKSt6vectorIiSaIiEES5_RKN5arrow2io9IOContextERKNS7_12CacheOptionsE+0x1696)[0x7fd69120ef96]
2023-10-06T10:40:14.0801581Z /usr/local/lib/libparquet.so.1400(_ZN7parquet17ParquetFileReader9PreBufferERKSt6vectorIiSaIiEES5_RKN5arrow2io9IOContextERKNS7_12CacheOptionsE+0x360)[0x7fd69120d7c0]
2023-10-06T10:40:14.0808329Z /usr/local/lib/libparquet.so.1400(+0x15435e5)[0x7fd6904885e5]
2023-10-06T10:40:14.0808759Z /usr/local/lib/libparquet.so.1400(+0x1542728)[0x7fd690487728]
2023-10-06T10:40:14.0815343Z /usr/local/lib/libparquet.so.1400(+0x1542c7c)[0x7fd690487c7c]
2023-10-06T10:40:14.0816050Z /usr/local/lib/libparquet.so.1400(_ZN7parquet5arrow8internal10FuzzReaderESt10unique_ptrINS0_10FileReaderESt14default_deleteIS3_EE+0x3e2)[0x7fd69046cdf2]
2023-10-06T10:40:14.0822733Z ==14349== ERROR: libFuzzer: deadly signal
2023-10-06T10:40:14.0823311Z /usr/local/lib/libparquet.so.1400(_ZN7parquet5arrow8internal10FuzzReaderEPKhl+0x1130)[0x7fd69046e950]
2023-10-06T10:40:14.0824114Z /build/cpp/debug/parquet-arrow-fuzz(+0x118e98)[0x558933121e98]
2023-10-06T10:40:14.0825448Z /build/cpp/debug/parquet-arrow-fuzz(+0x3f354)[0x558933048354]
2023-10-06T10:40:14.0826059Z /build/cpp/debug/parquet-arrow-fuzz(+0x290d0)[0x5589330320d0]
2023-10-06T10:40:14.0826543Z /build/cpp/debug/parquet-arrow-fuzz(+0x2ee27)[0x558933037e27]
2023-10-06T10:40:14.0827941Z /build/cpp/debug/parquet-arrow-fuzz(+0x58c43)[0x558933061c43]
2023-10-06T10:40:14.0828405Z /lib/x86_64-linux-gnu/libc.so.6(+0x29d90)[0x7fd6713bfd90]
2023-10-06T10:40:14.0828882Z /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80)[0x7fd6713bfe40]
2023-10-06T10:40:14.0829351Z /build/cpp/debug/parquet-arrow-fuzz(+0x23995)[0x55893302c995]
2023-10-06T10:40:15.2094786Z     #0 0x5589330eeab1 in __sanitizer_print_stack_trace (/build/cpp/debug/parquet-arrow-fuzz+0xe5ab1) (BuildId: 8286aad552d39ef7fd5d08d745adab7f6b613e22)
2023-10-06T10:40:15.2096115Z     #1 0x558933061348 in fuzzer::PrintStackTrace() (/build/cpp/debug/parquet-arrow-fuzz+0x58348) (BuildId: 8286aad552d39ef7fd5d08d745adab7f6b613e22)
2023-10-06T10:40:15.2097546Z     #2 0x558933046dc3 in fuzzer::Fuzzer::CrashCallback() (/build/cpp/debug/parquet-arrow-fuzz+0x3ddc3) (BuildId: 8286aad552d39ef7fd5d08d745adab7f6b613e22)
2023-10-06T10:40:15.2098544Z     #3 0x7fd6713d851f  (/lib/x86_64-linux-gnu/libc.so.6+0x4251f) (BuildId: 229b7dc509053fe4df5e29e8629911f0c3bc66dd)
2023-10-06T10:40:15.2099481Z     #4 0x7fd67142ca7b in pthread_kill (/lib/x86_64-linux-gnu/libc.so.6+0x96a7b) (BuildId: 229b7dc509053fe4df5e29e8629911f0c3bc66dd)
2023-10-06T10:40:15.2101878Z     #5 0x7fd6713d8475 in gsignal (/lib/x86_64-linux-gnu/libc.so.6+0x42475) (BuildId: 229b7dc509053fe4df5e29e8629911f0c3bc66dd)
2023-10-06T10:40:15.2102783Z     #6 0x7fd6713be7f2 in abort (/lib/x86_64-linux-gnu/libc.so.6+0x287f2) (BuildId: 229b7dc509053fe4df5e29e8629911f0c3bc66dd)
2023-10-06T10:40:15.2103486Z     #7 0x7fd67d9f5193 in arrow::util::CerrLog::~CerrLog() /arrow/cpp/src/arrow/util/logging.cc:72:7
2023-10-06T10:40:15.2104144Z     #8 0x7fd67d9f5250 in arrow::util::CerrLog::~CerrLog() /arrow/cpp/src/arrow/util/logging.cc:66:22
2023-10-06T10:40:15.2104793Z     #9 0x7fd67d9f4d7f in arrow::util::ArrowLog::~ArrowLog() /arrow/cpp/src/arrow/util/logging.cc:250:5
2023-10-06T10:40:15.2105719Z     #10 0x7fd67d3cc150 in arrow::io::internal::(anonymous namespace)::ReadRangeCombiner::Coalesce(std::vector<arrow::io::ReadRange, std::allocator<arrow::io::ReadRange> >) /arrow/cpp/src/arrow/io/interfaces.cc:457:7
2023-10-06T10:40:15.2106830Z     #11 0x7fd67d3cac80 in arrow::io::internal::CoalesceReadRanges(std::vector<arrow::io::ReadRange, std::allocator<arrow::io::ReadRange> >, long, long) /arrow/cpp/src/arrow/io/interfaces.cc:518:19
2023-10-06T10:40:15.2107880Z     #12 0x7fd67d2c3be5 in arrow::io::internal::ReadRangeCache::Impl::Cache(std::vector<arrow::io::ReadRange, std::allocator<arrow::io::ReadRange> >) /arrow/cpp/src/arrow/io/caching.cc:177:14
2023-10-06T10:40:15.2108897Z     #13 0x7fd67d2c1cc9 in arrow::io::internal::ReadRangeCache::LazyImpl::Cache(std::vector<arrow::io::ReadRange, std::allocator<arrow::io::ReadRange> >) /arrow/cpp/src/arrow/io/caching.cc:288:34
2023-10-06T10:40:15.2109909Z     #14 0x7fd67d2bfec1 in arrow::io::internal::ReadRangeCache::Cache(std::vector<arrow::io::ReadRange, std::allocator<arrow::io::ReadRange> >) /arrow/cpp/src/arrow/io/caching.cc:320:17
2023-10-06T10:40:15.2111039Z     #15 0x7fd69120ef95 in parquet::SerializedFile::PreBuffer(std::vector<int, std::allocator<int> > const&, std::vector<int, std::allocator<int> > const&, arrow::io::IOContext const&, arrow::io::CacheOptions const&) /arrow/cpp/src/parquet/file_reader.cc:368:5
2023-10-06T10:40:15.2112348Z     #16 0x7fd69120d7bf in parquet::ParquetFileReader::PreBuffer(std::vector<int, std::allocator<int> > const&, std::vector<int, std::allocator<int> > const&, arrow::io::IOContext const&, arrow::io::CacheOptions const&) /arrow/cpp/src/parquet/file_reader.cc:862:9
2023-10-06T10:40:15.2113660Z     #17 0x7fd6904885e4 in parquet::arrow::(anonymous namespace)::FileReaderImpl::ReadRowGroups(std::vector<int, std::allocator<int> > const&, std::vector<int, std::allocator<int> > const&, std::shared_ptr<arrow::Table>*) /arrow/cpp/src/parquet/arrow/reader.cc:1224:23
2023-10-06T10:40:15.2114817Z     #18 0x7fd690487727 in parquet::arrow::(anonymous namespace)::FileReaderImpl::ReadRowGroup(int, std::vector<int, std::allocator<int> > const&, std::shared_ptr<arrow::Table>*) /arrow/cpp/src/parquet/arrow/reader.cc:321:12
2023-10-06T10:40:15.2115872Z     #19 0x7fd690487c7b in parquet::arrow::(anonymous namespace)::FileReaderImpl::ReadRowGroup(int, std::shared_ptr<arrow::Table>*) /arrow/cpp/src/parquet/arrow/reader.cc:325:12
2023-10-06T10:40:15.2116737Z     #20 0x7fd69046cdf1 in parquet::arrow::internal::FuzzReader(std::unique_ptr<parquet::arrow::FileReader, std::default_delete<parquet::arrow::FileReader> >) /arrow/cpp/src/parquet/arrow/reader.cc:1374:37
2023-10-06T10:40:15.2117736Z     #21 0x7fd69046e94f in parquet::arrow::internal::FuzzReader(unsigned char const*, long) /arrow/cpp/src/parquet/arrow/reader.cc:1399:11
2023-10-06T10:40:15.2118358Z     #22 0x558933121e97 in LLVMFuzzerTestOneInput /arrow/cpp/src/parquet/arrow/fuzz.cc:22:17
2023-10-06T10:40:15.2119357Z     #23 0x558933048353 in fuzzer::Fuzzer::ExecuteCallback(unsigned char const*, unsigned long) (/build/cpp/debug/parquet-arrow-fuzz+0x3f353) (BuildId: 8286aad552d39ef7fd5d08d745adab7f6b613e22)
2023-10-06T10:40:15.2120490Z     #24 0x5589330320cf in fuzzer::RunOneTest(fuzzer::Fuzzer*, char const*, unsigned long) (/build/cpp/debug/parquet-arrow-fuzz+0x290cf) (BuildId: 8286aad552d39ef7fd5d08d745adab7f6b613e22)
2023-10-06T10:40:15.2121762Z     #25 0x558933037e26 in fuzzer::FuzzerDriver(int*, char***, int (*)(unsigned char const*, unsigned long)) (/build/cpp/debug/parquet-arrow-fuzz+0x2ee26) (BuildId: 8286aad552d39ef7fd5d08d745adab7f6b613e22)
2023-10-06T10:40:15.2122720Z     #26 0x558933061c42 in main (/build/cpp/debug/parquet-arrow-fuzz+0x58c42) (BuildId: 8286aad552d39ef7fd5d08d745adab7f6b613e22)
2023-10-06T10:40:15.2123509Z     #27 0x7fd6713bfd8f  (/lib/x86_64-linux-gnu/libc.so.6+0x29d8f) (BuildId: 229b7dc509053fe4df5e29e8629911f0c3bc66dd)
2023-10-06T10:40:15.2124294Z     #28 0x7fd6713bfe3f in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x29e3f) (BuildId: 229b7dc509053fe4df5e29e8629911f0c3bc66dd)
2023-10-06T10:40:15.2125121Z     #29 0x55893302c994 in _start (/build/cpp/debug/parquet-arrow-fuzz+0x23994) (BuildId: 8286aad552d39ef7fd5d08d745adab7f6b613e22)
2023-10-06T10:40:15.2125489Z 
2023-10-06T10:40:15.2126159Z NOTE: libFuzzer has rudimentary signal handlers.
2023-10-06T10:40:15.2127161Z       Combine libFuzzer with AddressSanitizer or similar for better crash reports.
2023-10-06T10:40:15.2127655Z SUMMARY: libFuzzer: deadly signal
2023-10-06T10:40:16.9350640Z 77
2023-10-06T10:40:17.0185097Z Error: `docker-compose --file /home/runner/work/arrow/arrow/docker-compose.yml run --rm ubuntu-cpp-sanitizer` exited with a non-zero exit code 77, see the process log above.

conbench-apache-arrow · 2023-10-07T16:27:31Z

After merging your PR, Conbench analyzed the 6 benchmarking runs that have been run so far on merge-commit d7017dd.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 1 possible false positive for unstable benchmarks that are known to sometimes produce them.

kou · 2023-10-07T21:15:45Z

It seems that this also broke R Windows CI:

https://github.com/apache/arrow/actions/runs/6428740951/job/17457269209

-- Failure ('test-parquet.R:39:3'): simple int column roundtrip ----------------
file.exists(pq_tmp_file) is not FALSE

`actual`:   TRUE 
`expected`: FALSE
-- Error ('test-parquet.R:294:3'): write_parquet() handles version argument ----
<purrr_error_indexed/rlang_error/error/condition>
Error in `map(.x, .f, ..., .progress = .progress)`: i In index: 2.
Caused by error:
! IOError: Failed to open local file 'C:/Users/runneradmin/AppData/Local/Temp/RtmpKQDBMp/working_dir/RtmpqeT41T/filebec68cc3983'. Detail: [Windows error 1224] The requested operation cannot be performed on a file with a user-mapped section open.

Backtrace:
     x
  1. +-purrr::walk(...) at test-parquet.R:294:2
  2. | \-purrr::map(.x, .f, ..., .progress = .progress)
  3. |   \-purrr:::map_("list", .x, .f, ..., .progress = .progress)
  4. |     +-purrr:::with_indexed_errors(...)
  5. |     | \-base::withCallingHandlers(...)
  6. |     +-purrr:::call_with_cleanup(...)
  7. |     \-arrow (local) .f(.x[[i]], ...)
  8. |       \-arrow::write_parquet(df, tf, version = .x) at test-parquet.R:295:4
  9. |         \-arrow:::make_output_stream(sink)
 10. |           \-FileOutputStream$create(x)
 11. |             \-arrow:::io___FileOutputStream__Open(clean_path_abs(path))
 12. \-base::.handleSimpleError(...)
 13.   \-purrr (local) h(simpleError(msg, call))
 14.     \-cli::cli_abort(...)
 15.       \-rlang::abort(...)

…e for reading Parquet files (apache#37854) ### Rationale for this change Enabling `pre_buffer` can give a significant speed-up on filesystems like S3, while it doesn't give noticeable slowdown on local filesystems, based on benchmarks in the issue. Therefore simply enabling it by default seems the best default. The option was already enabled by default in the `pyarrow.parquet.read_table` interface, this PR aligns the defaults when using `pyarrow.dataset` directly. * Closes: apache#36765 Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>

…r_test.cc` compile (apache#38069) ### Rationale for this change This is introduced in a previous patch. This patch fixed the compile. ( apache#37854 ) ### What changes are included in this PR? a one-line fixing. ### Are these changes tested? no ### Are there any user-facing changes? no * Closes: apache#38068 Authored-by: mwish <maplewish117@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>

…e for reading Parquet files (apache#37854) ### Rationale for this change Enabling `pre_buffer` can give a significant speed-up on filesystems like S3, while it doesn't give noticeable slowdown on local filesystems, based on benchmarks in the issue. Therefore simply enabling it by default seems the best default. The option was already enabled by default in the `pyarrow.parquet.read_table` interface, this PR aligns the defaults when using `pyarrow.dataset` directly. * Closes: apache#36765 Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>

…r_test.cc` compile (apache#38069) ### Rationale for this change This is introduced in a previous patch. This patch fixed the compile. ( apache#37854 ) ### What changes are included in this PR? a one-line fixing. ### Are these changes tested? no ### Are there any user-facing changes? no * Closes: apache#38068 Authored-by: mwish <maplewish117@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>

…e for reading Parquet files (apache#37854) ### Rationale for this change Enabling `pre_buffer` can give a significant speed-up on filesystems like S3, while it doesn't give noticeable slowdown on local filesystems, based on benchmarks in the issue. Therefore simply enabling it by default seems the best default. The option was already enabled by default in the `pyarrow.parquet.read_table` interface, this PR aligns the defaults when using `pyarrow.dataset` directly. * Closes: apache#36765 Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>

…r_test.cc` compile (apache#38069) ### Rationale for this change This is introduced in a previous patch. This patch fixed the compile. ( apache#37854 ) ### What changes are included in this PR? a one-line fixing. ### Are these changes tested? no ### Are there any user-facing changes? no * Closes: apache#38068 Authored-by: mwish <maplewish117@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>

[Python][Dataset] Change default of pre_buffer to True for reading Pa…

16e56fa

…rquet files

github-actions bot added Component: Python awaiting committer review Awaiting committer review labels Sep 25, 2023

jorisvandenbossche mentioned this pull request Sep 25, 2023

[Python][Dataset][Parquet] Enable Pre-Buffering by default for Parquet s3 datasets #36765

Closed

jorisvandenbossche changed the title ~~[Python][Dataset] Change default of pre_buffer to True for reading Parquet files~~ GH-36765: [Python][Dataset] Change default of pre_buffer to True for reading Parquet files Sep 25, 2023

apache deleted a comment from github-actions bot Sep 25, 2023

mapleFU approved these changes Sep 26, 2023

View reviewed changes

jorisvandenbossche requested review from lidavidm and westonpace October 5, 2023 12:42

lidavidm approved these changes Oct 5, 2023

View reviewed changes

github-actions bot added awaiting merge Awaiting merge awaiting review Awaiting review and removed awaiting committer review Awaiting committer review labels Oct 5, 2023

github-actions bot added awaiting changes Awaiting changes and removed awaiting review Awaiting review awaiting merge Awaiting merge labels Oct 5, 2023

jorisvandenbossche added 3 commits October 5, 2023 15:17

Merge remote-tracking branch 'upstream/main' into apachegh-36765-parq…

ed1b46f

…uet-pre-buffer

update defaults on C++ as well

0c21100

update docstring

f6e56e1

jorisvandenbossche requested a review from wgtmac as a code owner October 5, 2023 13:26

github-actions bot added the Component: Parquet label Oct 5, 2023

github-actions bot added Component: C++ awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Oct 5, 2023

lidavidm approved these changes Oct 5, 2023

View reviewed changes

github-actions bot added awaiting merge Awaiting merge and removed awaiting change review Awaiting change review labels Oct 5, 2023

jorisvandenbossche merged commit d7017dd into apache:main Oct 6, 2023
19 of 32 checks passed

jorisvandenbossche removed the awaiting merge Awaiting merge label Oct 6, 2023

jorisvandenbossche deleted the gh-36765-parquet-pre-buffer branch October 6, 2023 07:26

This was referenced Oct 6, 2023

[C++][Parquet][CI] Parquet arrow_reader_writer_test.cc compile failed #38068

Closed

GH-38068: [C++][CI] Fixing Parquet unittest arrow_reader_writer_test.cc compile #38069

Merged

mapleFU reviewed Oct 6, 2023

View reviewed changes

github-actions bot added the awaiting committer review Awaiting committer review label Oct 6, 2023

jorisvandenbossche mentioned this pull request Oct 6, 2023

[C++] Enabling CacheOptions::LazyDefault caused Parquet fuzzing failure #38071

Closed

kou mentioned this pull request Oct 7, 2023

[CI][R][Parquet] "test-parquet.R: simple int column roundtrip" failure #38130

Closed

ion-elgreco mentioned this pull request Oct 8, 2023

feat: improve read performance by 7x with prebuffer delta-io/delta-rs#1709

Merged

austin3dickey mentioned this pull request Oct 23, 2023

dataset-serialize benchmark is failing with segfaults voltrondata-labs/arrow-benchmarks-ci#166

Closed

jorisvandenbossche mentioned this pull request Oct 24, 2023

[C++] Parquet reading performance regressions #38432

Closed

austin3dickey mentioned this pull request Oct 24, 2023

[C++][Python] Segfault during pyarrow.dataset.write_dataset with dataset source read with pre_buffer=True #38438

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GH-36765: [Python][Dataset] Change default of pre_buffer to True for reading Parquet files #37854

GH-36765: [Python][Dataset] Change default of pre_buffer to True for reading Parquet files #37854

jorisvandenbossche commented Sep 25, 2023 •

edited by github-actions bot

github-actions bot commented Sep 25, 2023

jorisvandenbossche commented Sep 25, 2023

kou commented Sep 25, 2023

mapleFU commented Sep 26, 2023

mapleFU commented Sep 26, 2023

jorisvandenbossche commented Sep 26, 2023

mapleFU commented Sep 26, 2023

lidavidm Oct 5, 2023

jorisvandenbossche Oct 5, 2023

jorisvandenbossche commented Oct 5, 2023

lidavidm commented Oct 5, 2023

mapleFU Oct 6, 2023

jorisvandenbossche commented Oct 6, 2023

jorisvandenbossche commented Oct 6, 2023

conbench-apache-arrow bot commented Oct 7, 2023

kou commented Oct 7, 2023

GH-36765: [Python][Dataset] Change default of pre_buffer to True for reading Parquet files #37854

GH-36765: [Python][Dataset] Change default of pre_buffer to True for reading Parquet files #37854

Conversation

jorisvandenbossche commented Sep 25, 2023 • edited by github-actions bot

Rationale for this change

github-actions bot commented Sep 25, 2023

jorisvandenbossche commented Sep 25, 2023

kou commented Sep 25, 2023

mapleFU commented Sep 26, 2023

mapleFU commented Sep 26, 2023

jorisvandenbossche commented Sep 26, 2023

mapleFU commented Sep 26, 2023

lidavidm Oct 5, 2023

Choose a reason for hiding this comment

jorisvandenbossche Oct 5, 2023

Choose a reason for hiding this comment

jorisvandenbossche commented Oct 5, 2023

lidavidm commented Oct 5, 2023

mapleFU Oct 6, 2023

Choose a reason for hiding this comment

jorisvandenbossche commented Oct 6, 2023

jorisvandenbossche commented Oct 6, 2023

conbench-apache-arrow bot commented Oct 7, 2023

kou commented Oct 7, 2023

jorisvandenbossche commented Sep 25, 2023 •

edited by github-actions bot