Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-36765: [Python][Dataset] Change default of pre_buffer to True for reading Parquet files #37854

Merged

Conversation

jorisvandenbossche
Copy link
Member

@jorisvandenbossche jorisvandenbossche commented Sep 25, 2023

Rationale for this change

Enabling pre_buffer can give a significant speed-up on filesystems like S3, while it doesn't give noticeable slowdown on local filesystems, based on benchmarks in the issue. Therefore simply enabling it by default seems the best default.

The option was already enabled by default in the pyarrow.parquet.read_table interface, this PR aligns the defaults when using pyarrow.dataset directly.

@jorisvandenbossche jorisvandenbossche changed the title [Python][Dataset] Change default of pre_buffer to True for reading Parquet files GH-36765: [Python][Dataset] Change default of pre_buffer to True for reading Parquet files Sep 25, 2023
@apache apache deleted a comment from github-actions bot Sep 25, 2023
@github-actions
Copy link

⚠️ GitHub issue #36765 has been automatically assigned in GitHub to PR creator.

@jorisvandenbossche
Copy link
Member Author

Questions:

  • Do I change the default in C++ as well?
  • Should we also change the default cache_options to LazyDefaults when pre_buffer is enabled?

@kou
Copy link
Member

kou commented Sep 25, 2023

  • Do I change the default in C++ as well?

Based on the #28218 results, I think so.

@mapleFU
Copy link
Member

mapleFU commented Sep 26, 2023

When scanner has multiple scanner thread and prefetch depth( i.e. Prefetch Fragment Count), would this causing huge memory consumption?

@mapleFU
Copy link
Member

mapleFU commented Sep 26, 2023

Status FileReaderImpl::GetRecordBatchReader(const std::vector<int>& row_groups,
                                            const std::vector<int>& column_indices,
                                            std::unique_ptr<RecordBatchReader>* out) {
  RETURN_NOT_OK(BoundsCheck(row_groups, column_indices));

  if (reader_properties_.pre_buffer()) {
    // PARQUET-1698/PARQUET-1820: pre-buffer row groups/column chunks if enabled
    BEGIN_PARQUET_CATCH_EXCEPTIONS
    reader_->PreBuffer(row_groups, column_indices, reader_properties_.io_context(),
                       reader_properties_.cache_options());
    END_PARQUET_CATCH_EXCEPTIONS
  }

Here, Pre_Buffer will try to buffer the require RowGroups if neccessary, and memory will not be released until read is finished. It's different from buffering mode( actually buffering mode might decrease the memory usage, lol).

Even when policy is lazy, the reader might not get faster if RowGroup is large enough, and memory will not be released before read is finished. So I wonder if this is ok.

@jorisvandenbossche
Copy link
Member Author

@mapleFU could you post those questions and remarks on the issue? (that might have a wider audience that can answer those, as many people already commented there)

@mapleFU
Copy link
Member

mapleFU commented Sep 26, 2023

Done it, I think enable this is ok, maybe we should doc how to handling high memory usage in doc

@@ -666,7 +666,7 @@ cdef class ParquetFragmentScanOptions(FragmentScanOptions):
Disabled by default.
buffer_size : int, default 8192
Size of buffered stream, if enabled. Default is 8KB.
pre_buffer : bool, default False
pre_buffer : bool, default True
If enabled, pre-buffer the raw Parquet data instead of issuing one
read per column chunk. This can improve performance on high-latency
filesystems.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possibly we should improve the docstring to also mention that you should disable this if you are concerned with memory usage over throughput? (Also, possibly make it clear that "high-latency filesystems" is likely to mean object stores like S3, GCS, etc.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, will do

@github-actions github-actions bot added awaiting merge Awaiting merge awaiting review Awaiting review and removed awaiting committer review Awaiting committer review labels Oct 5, 2023
@jorisvandenbossche
Copy link
Member Author

@lidavidm do you have any thought on the question in #37854 (comment) ?

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting review Awaiting review awaiting merge Awaiting merge labels Oct 5, 2023
@lidavidm
Copy link
Member

lidavidm commented Oct 5, 2023

Oops, I missed that.

  • C++: let's be consistent.
  • I think LazyDefaults is a good default. I'm not sure it'll make a big difference.

@github-actions github-actions bot added Component: C++ awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Oct 5, 2023
@github-actions github-actions bot added awaiting merge Awaiting merge and removed awaiting change review Awaiting change review labels Oct 5, 2023
@jorisvandenbossche jorisvandenbossche merged commit d7017dd into apache:main Oct 6, 2023
19 of 32 checks passed
@jorisvandenbossche jorisvandenbossche removed the awaiting merge Awaiting merge label Oct 6, 2023
@jorisvandenbossche jorisvandenbossche deleted the gh-36765-parquet-pre-buffer branch October 6, 2023 07:26
ArrowReaderProperties arrow_properties = default_arrow_reader_properties();
arrow_properties.set_pre_buffer(true);
arrow_properties.set_pre_buffer(false);
arrow_properties.set_cache_options(::arrow::io::CacheOptions::Defaults())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might cause compiling failed, I submit a fixing: #38069

@github-actions github-actions bot added the awaiting committer review Awaiting committer review label Oct 6, 2023
@jorisvandenbossche
Copy link
Member Author

Oh, sorry for completely missing to check all the failed builds .. 🤦

jorisvandenbossche pushed a commit that referenced this pull request Oct 6, 2023
….cc` compile (#38069)

### Rationale for this change

This is introduced in a previous patch. This patch fixed the compile. ( #37854 )

### What changes are included in this PR?

a one-line fixing.

### Are these changes tested?

no

### Are there any user-facing changes?

no

* Closes: #38068

Authored-by: mwish <maplewish117@gmail.com>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
@jorisvandenbossche
Copy link
Member Author

So this PR introduced a failure in the "AMD64 Ubuntu 22.04 C++ ASAN UBSAN" build (https://github.com/apache/arrow/actions/runs/6430392691/job/17462667620?pr=38069#logs), related to the LazyCache coalesced reads. See details below.

I assume this is an existing bug, given this PR only changed a default for an option a user could already set before as well. But changing the default of course makes it more visible.

Potentially short term option is to only change pre_buffer and keep the current non-lazy default cache_options (if that fixes it). Or revert the PR entirely until this is resolved (I don't have time today to look into more detail).

2023-10-06T10:40:14.0622194Z Running: /arrow/testing/data/parquet/fuzzing/clusterfuzz-testcase-minimized-parquet-arrow-fuzz-5640198106120192
2023-10-06T10:40:14.0651320Z /arrow/cpp/src/arrow/io/interfaces.cc:457:  Check failed: (left.offset + left.length) <= (right.offset) Some read ranges overlap
2023-10-06T10:40:14.0661169Z /build/cpp/debug/parquet-arrow-fuzz(backtrace+0x5b)[0x55893309d6bb]
2023-10-06T10:40:14.0678721Z /usr/local/lib/libarrow.so.1400(_ZN5arrow4util7CerrLog14PrintBackTraceEv+0x1a5)[0x7fd67d9f5405]
2023-10-06T10:40:14.0694280Z /usr/local/lib/libarrow.so.1400(_ZN5arrow4util7CerrLogD2Ev+0x1f7)[0x7fd67d9f5177]
2023-10-06T10:40:14.0708313Z /usr/local/lib/libarrow.so.1400(_ZN5arrow4util7CerrLogD0Ev+0x61)[0x7fd67d9f5251]
2023-10-06T10:40:14.0722939Z /usr/local/lib/libarrow.so.1400(_ZN5arrow4util8ArrowLogD1Ev+0x1d0)[0x7fd67d9f4d80]
2023-10-06T10:40:14.0733586Z /usr/local/lib/libarrow.so.1400(+0xb13f151)[0x7fd67d3cc151]
2023-10-06T10:40:14.0746700Z /usr/local/lib/libarrow.so.1400(_ZN5arrow2io8internal18CoalesceReadRangesESt6vectorINS0_9ReadRangeESaIS3_EEll+0x4c1)[0x7fd67d3cac81]
2023-10-06T10:40:14.0762388Z /usr/local/lib/libarrow.so.1400(_ZN5arrow2io8internal14ReadRangeCache4Impl5CacheESt6vectorINS0_9ReadRangeESaIS5_EE+0x456)[0x7fd67d2c3be6]
2023-10-06T10:40:14.0775666Z /usr/local/lib/libarrow.so.1400(_ZN5arrow2io8internal14ReadRangeCache8LazyImpl5CacheESt6vectorINS0_9ReadRangeESaIS5_EE+0x24a)[0x7fd67d2c1cca]
2023-10-06T10:40:14.0790164Z /usr/local/lib/libarrow.so.1400(_ZN5arrow2io8internal14ReadRangeCache5CacheESt6vectorINS0_9ReadRangeESaIS4_EE+0x2a2)[0x7fd67d2bfec2]
2023-10-06T10:40:14.0795950Z /usr/local/lib/libparquet.so.1400(_ZN7parquet14SerializedFile9PreBufferERKSt6vectorIiSaIiEES5_RKN5arrow2io9IOContextERKNS7_12CacheOptionsE+0x1696)[0x7fd69120ef96]
2023-10-06T10:40:14.0801581Z /usr/local/lib/libparquet.so.1400(_ZN7parquet17ParquetFileReader9PreBufferERKSt6vectorIiSaIiEES5_RKN5arrow2io9IOContextERKNS7_12CacheOptionsE+0x360)[0x7fd69120d7c0]
2023-10-06T10:40:14.0808329Z /usr/local/lib/libparquet.so.1400(+0x15435e5)[0x7fd6904885e5]
2023-10-06T10:40:14.0808759Z /usr/local/lib/libparquet.so.1400(+0x1542728)[0x7fd690487728]
2023-10-06T10:40:14.0815343Z /usr/local/lib/libparquet.so.1400(+0x1542c7c)[0x7fd690487c7c]
2023-10-06T10:40:14.0816050Z /usr/local/lib/libparquet.so.1400(_ZN7parquet5arrow8internal10FuzzReaderESt10unique_ptrINS0_10FileReaderESt14default_deleteIS3_EE+0x3e2)[0x7fd69046cdf2]
2023-10-06T10:40:14.0822733Z ==14349== ERROR: libFuzzer: deadly signal
2023-10-06T10:40:14.0823311Z /usr/local/lib/libparquet.so.1400(_ZN7parquet5arrow8internal10FuzzReaderEPKhl+0x1130)[0x7fd69046e950]
2023-10-06T10:40:14.0824114Z /build/cpp/debug/parquet-arrow-fuzz(+0x118e98)[0x558933121e98]
2023-10-06T10:40:14.0825448Z /build/cpp/debug/parquet-arrow-fuzz(+0x3f354)[0x558933048354]
2023-10-06T10:40:14.0826059Z /build/cpp/debug/parquet-arrow-fuzz(+0x290d0)[0x5589330320d0]
2023-10-06T10:40:14.0826543Z /build/cpp/debug/parquet-arrow-fuzz(+0x2ee27)[0x558933037e27]
2023-10-06T10:40:14.0827941Z /build/cpp/debug/parquet-arrow-fuzz(+0x58c43)[0x558933061c43]
2023-10-06T10:40:14.0828405Z /lib/x86_64-linux-gnu/libc.so.6(+0x29d90)[0x7fd6713bfd90]
2023-10-06T10:40:14.0828882Z /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80)[0x7fd6713bfe40]
2023-10-06T10:40:14.0829351Z /build/cpp/debug/parquet-arrow-fuzz(+0x23995)[0x55893302c995]
2023-10-06T10:40:15.2094786Z     #0 0x5589330eeab1 in __sanitizer_print_stack_trace (/build/cpp/debug/parquet-arrow-fuzz+0xe5ab1) (BuildId: 8286aad552d39ef7fd5d08d745adab7f6b613e22)
2023-10-06T10:40:15.2096115Z     #1 0x558933061348 in fuzzer::PrintStackTrace() (/build/cpp/debug/parquet-arrow-fuzz+0x58348) (BuildId: 8286aad552d39ef7fd5d08d745adab7f6b613e22)
2023-10-06T10:40:15.2097546Z     #2 0x558933046dc3 in fuzzer::Fuzzer::CrashCallback() (/build/cpp/debug/parquet-arrow-fuzz+0x3ddc3) (BuildId: 8286aad552d39ef7fd5d08d745adab7f6b613e22)
2023-10-06T10:40:15.2098544Z     #3 0x7fd6713d851f  (/lib/x86_64-linux-gnu/libc.so.6+0x4251f) (BuildId: 229b7dc509053fe4df5e29e8629911f0c3bc66dd)
2023-10-06T10:40:15.2099481Z     #4 0x7fd67142ca7b in pthread_kill (/lib/x86_64-linux-gnu/libc.so.6+0x96a7b) (BuildId: 229b7dc509053fe4df5e29e8629911f0c3bc66dd)
2023-10-06T10:40:15.2101878Z     #5 0x7fd6713d8475 in gsignal (/lib/x86_64-linux-gnu/libc.so.6+0x42475) (BuildId: 229b7dc509053fe4df5e29e8629911f0c3bc66dd)
2023-10-06T10:40:15.2102783Z     #6 0x7fd6713be7f2 in abort (/lib/x86_64-linux-gnu/libc.so.6+0x287f2) (BuildId: 229b7dc509053fe4df5e29e8629911f0c3bc66dd)
2023-10-06T10:40:15.2103486Z     #7 0x7fd67d9f5193 in arrow::util::CerrLog::~CerrLog() /arrow/cpp/src/arrow/util/logging.cc:72:7
2023-10-06T10:40:15.2104144Z     #8 0x7fd67d9f5250 in arrow::util::CerrLog::~CerrLog() /arrow/cpp/src/arrow/util/logging.cc:66:22
2023-10-06T10:40:15.2104793Z     #9 0x7fd67d9f4d7f in arrow::util::ArrowLog::~ArrowLog() /arrow/cpp/src/arrow/util/logging.cc:250:5
2023-10-06T10:40:15.2105719Z     #10 0x7fd67d3cc150 in arrow::io::internal::(anonymous namespace)::ReadRangeCombiner::Coalesce(std::vector<arrow::io::ReadRange, std::allocator<arrow::io::ReadRange> >) /arrow/cpp/src/arrow/io/interfaces.cc:457:7
2023-10-06T10:40:15.2106830Z     #11 0x7fd67d3cac80 in arrow::io::internal::CoalesceReadRanges(std::vector<arrow::io::ReadRange, std::allocator<arrow::io::ReadRange> >, long, long) /arrow/cpp/src/arrow/io/interfaces.cc:518:19
2023-10-06T10:40:15.2107880Z     #12 0x7fd67d2c3be5 in arrow::io::internal::ReadRangeCache::Impl::Cache(std::vector<arrow::io::ReadRange, std::allocator<arrow::io::ReadRange> >) /arrow/cpp/src/arrow/io/caching.cc:177:14
2023-10-06T10:40:15.2108897Z     #13 0x7fd67d2c1cc9 in arrow::io::internal::ReadRangeCache::LazyImpl::Cache(std::vector<arrow::io::ReadRange, std::allocator<arrow::io::ReadRange> >) /arrow/cpp/src/arrow/io/caching.cc:288:34
2023-10-06T10:40:15.2109909Z     #14 0x7fd67d2bfec1 in arrow::io::internal::ReadRangeCache::Cache(std::vector<arrow::io::ReadRange, std::allocator<arrow::io::ReadRange> >) /arrow/cpp/src/arrow/io/caching.cc:320:17
2023-10-06T10:40:15.2111039Z     #15 0x7fd69120ef95 in parquet::SerializedFile::PreBuffer(std::vector<int, std::allocator<int> > const&, std::vector<int, std::allocator<int> > const&, arrow::io::IOContext const&, arrow::io::CacheOptions const&) /arrow/cpp/src/parquet/file_reader.cc:368:5
2023-10-06T10:40:15.2112348Z     #16 0x7fd69120d7bf in parquet::ParquetFileReader::PreBuffer(std::vector<int, std::allocator<int> > const&, std::vector<int, std::allocator<int> > const&, arrow::io::IOContext const&, arrow::io::CacheOptions const&) /arrow/cpp/src/parquet/file_reader.cc:862:9
2023-10-06T10:40:15.2113660Z     #17 0x7fd6904885e4 in parquet::arrow::(anonymous namespace)::FileReaderImpl::ReadRowGroups(std::vector<int, std::allocator<int> > const&, std::vector<int, std::allocator<int> > const&, std::shared_ptr<arrow::Table>*) /arrow/cpp/src/parquet/arrow/reader.cc:1224:23
2023-10-06T10:40:15.2114817Z     #18 0x7fd690487727 in parquet::arrow::(anonymous namespace)::FileReaderImpl::ReadRowGroup(int, std::vector<int, std::allocator<int> > const&, std::shared_ptr<arrow::Table>*) /arrow/cpp/src/parquet/arrow/reader.cc:321:12
2023-10-06T10:40:15.2115872Z     #19 0x7fd690487c7b in parquet::arrow::(anonymous namespace)::FileReaderImpl::ReadRowGroup(int, std::shared_ptr<arrow::Table>*) /arrow/cpp/src/parquet/arrow/reader.cc:325:12
2023-10-06T10:40:15.2116737Z     #20 0x7fd69046cdf1 in parquet::arrow::internal::FuzzReader(std::unique_ptr<parquet::arrow::FileReader, std::default_delete<parquet::arrow::FileReader> >) /arrow/cpp/src/parquet/arrow/reader.cc:1374:37
2023-10-06T10:40:15.2117736Z     #21 0x7fd69046e94f in parquet::arrow::internal::FuzzReader(unsigned char const*, long) /arrow/cpp/src/parquet/arrow/reader.cc:1399:11
2023-10-06T10:40:15.2118358Z     #22 0x558933121e97 in LLVMFuzzerTestOneInput /arrow/cpp/src/parquet/arrow/fuzz.cc:22:17
2023-10-06T10:40:15.2119357Z     #23 0x558933048353 in fuzzer::Fuzzer::ExecuteCallback(unsigned char const*, unsigned long) (/build/cpp/debug/parquet-arrow-fuzz+0x3f353) (BuildId: 8286aad552d39ef7fd5d08d745adab7f6b613e22)
2023-10-06T10:40:15.2120490Z     #24 0x5589330320cf in fuzzer::RunOneTest(fuzzer::Fuzzer*, char const*, unsigned long) (/build/cpp/debug/parquet-arrow-fuzz+0x290cf) (BuildId: 8286aad552d39ef7fd5d08d745adab7f6b613e22)
2023-10-06T10:40:15.2121762Z     #25 0x558933037e26 in fuzzer::FuzzerDriver(int*, char***, int (*)(unsigned char const*, unsigned long)) (/build/cpp/debug/parquet-arrow-fuzz+0x2ee26) (BuildId: 8286aad552d39ef7fd5d08d745adab7f6b613e22)
2023-10-06T10:40:15.2122720Z     #26 0x558933061c42 in main (/build/cpp/debug/parquet-arrow-fuzz+0x58c42) (BuildId: 8286aad552d39ef7fd5d08d745adab7f6b613e22)
2023-10-06T10:40:15.2123509Z     #27 0x7fd6713bfd8f  (/lib/x86_64-linux-gnu/libc.so.6+0x29d8f) (BuildId: 229b7dc509053fe4df5e29e8629911f0c3bc66dd)
2023-10-06T10:40:15.2124294Z     #28 0x7fd6713bfe3f in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x29e3f) (BuildId: 229b7dc509053fe4df5e29e8629911f0c3bc66dd)
2023-10-06T10:40:15.2125121Z     #29 0x55893302c994 in _start (/build/cpp/debug/parquet-arrow-fuzz+0x23994) (BuildId: 8286aad552d39ef7fd5d08d745adab7f6b613e22)
2023-10-06T10:40:15.2125489Z 
2023-10-06T10:40:15.2126159Z NOTE: libFuzzer has rudimentary signal handlers.
2023-10-06T10:40:15.2127161Z       Combine libFuzzer with AddressSanitizer or similar for better crash reports.
2023-10-06T10:40:15.2127655Z SUMMARY: libFuzzer: deadly signal
2023-10-06T10:40:16.9350640Z 77
2023-10-06T10:40:17.0185097Z Error: `docker-compose --file /home/runner/work/arrow/arrow/docker-compose.yml run --rm ubuntu-cpp-sanitizer` exited with a non-zero exit code 77, see the process log above.

@conbench-apache-arrow
Copy link

After merging your PR, Conbench analyzed the 6 benchmarking runs that have been run so far on merge-commit d7017dd.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 1 possible false positive for unstable benchmarks that are known to sometimes produce them.

@kou
Copy link
Member

kou commented Oct 7, 2023

It seems that this also broke R Windows CI:

https://github.com/apache/arrow/actions/runs/6428740951/job/17457269209

-- Failure ('test-parquet.R:39:3'): simple int column roundtrip ----------------
file.exists(pq_tmp_file) is not FALSE

`actual`:   TRUE 
`expected`: FALSE
-- Error ('test-parquet.R:294:3'): write_parquet() handles version argument ----
<purrr_error_indexed/rlang_error/error/condition>
Error in `map(.x, .f, ..., .progress = .progress)`: i In index: 2.
Caused by error:
! IOError: Failed to open local file 'C:/Users/runneradmin/AppData/Local/Temp/RtmpKQDBMp/working_dir/RtmpqeT41T/filebec68cc3983'. Detail: [Windows error 1224] The requested operation cannot be performed on a file with a user-mapped section open.

Backtrace:
     x
  1. +-purrr::walk(...) at test-parquet.R:294:2
  2. | \-purrr::map(.x, .f, ..., .progress = .progress)
  3. |   \-purrr:::map_("list", .x, .f, ..., .progress = .progress)
  4. |     +-purrr:::with_indexed_errors(...)
  5. |     | \-base::withCallingHandlers(...)
  6. |     +-purrr:::call_with_cleanup(...)
  7. |     \-arrow (local) .f(.x[[i]], ...)
  8. |       \-arrow::write_parquet(df, tf, version = .x) at test-parquet.R:295:4
  9. |         \-arrow:::make_output_stream(sink)
 10. |           \-FileOutputStream$create(x)
 11. |             \-arrow:::io___FileOutputStream__Open(clean_path_abs(path))
 12. \-base::.handleSimpleError(...)
 13.   \-purrr (local) h(simpleError(msg, call))
 14.     \-cli::cli_abort(...)
 15.       \-rlang::abort(...)

JerAguilon pushed a commit to JerAguilon/arrow that referenced this pull request Oct 23, 2023
…e for reading Parquet files (apache#37854)

### Rationale for this change

Enabling `pre_buffer` can give a significant speed-up on filesystems like S3, while it doesn't give noticeable slowdown on local filesystems, based on benchmarks in the issue. Therefore simply enabling it by default seems the best default.

The option was already enabled by default in the `pyarrow.parquet.read_table` interface, this PR aligns the defaults when using `pyarrow.dataset` directly.
* Closes: apache#36765

Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
JerAguilon pushed a commit to JerAguilon/arrow that referenced this pull request Oct 23, 2023
…r_test.cc` compile (apache#38069)

### Rationale for this change

This is introduced in a previous patch. This patch fixed the compile. ( apache#37854 )

### What changes are included in this PR?

a one-line fixing.

### Are these changes tested?

no

### Are there any user-facing changes?

no

* Closes: apache#38068

Authored-by: mwish <maplewish117@gmail.com>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
loicalleyne pushed a commit to loicalleyne/arrow that referenced this pull request Nov 13, 2023
…e for reading Parquet files (apache#37854)

### Rationale for this change

Enabling `pre_buffer` can give a significant speed-up on filesystems like S3, while it doesn't give noticeable slowdown on local filesystems, based on benchmarks in the issue. Therefore simply enabling it by default seems the best default.

The option was already enabled by default in the `pyarrow.parquet.read_table` interface, this PR aligns the defaults when using `pyarrow.dataset` directly.
* Closes: apache#36765

Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
loicalleyne pushed a commit to loicalleyne/arrow that referenced this pull request Nov 13, 2023
…r_test.cc` compile (apache#38069)

### Rationale for this change

This is introduced in a previous patch. This patch fixed the compile. ( apache#37854 )

### What changes are included in this PR?

a one-line fixing.

### Are these changes tested?

no

### Are there any user-facing changes?

no

* Closes: apache#38068

Authored-by: mwish <maplewish117@gmail.com>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
dgreiss pushed a commit to dgreiss/arrow that referenced this pull request Feb 19, 2024
…e for reading Parquet files (apache#37854)

### Rationale for this change

Enabling `pre_buffer` can give a significant speed-up on filesystems like S3, while it doesn't give noticeable slowdown on local filesystems, based on benchmarks in the issue. Therefore simply enabling it by default seems the best default.

The option was already enabled by default in the `pyarrow.parquet.read_table` interface, this PR aligns the defaults when using `pyarrow.dataset` directly.
* Closes: apache#36765

Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
dgreiss pushed a commit to dgreiss/arrow that referenced this pull request Feb 19, 2024
…r_test.cc` compile (apache#38069)

### Rationale for this change

This is introduced in a previous patch. This patch fixed the compile. ( apache#37854 )

### What changes are included in this PR?

a one-line fixing.

### Are these changes tested?

no

### Are there any user-facing changes?

no

* Closes: apache#38068

Authored-by: mwish <maplewish117@gmail.com>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Python][Dataset][Parquet] Enable Pre-Buffering by default for Parquet s3 datasets
4 participants