Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++][Acero] Random hangs when joining tables with ExecutePlan #39582

Closed
stenlarsson opened this issue Jan 12, 2024 · 13 comments · Fixed by #40007
Closed

[C++][Acero] Random hangs when joining tables with ExecutePlan #39582

stenlarsson opened this issue Jan 12, 2024 · 13 comments · Fixed by #40007
Assignees
Labels
Component: Ruby Critical Fix Bugfixes for security vulnerabilities, crashes, or invalid data. Type: bug
Milestone

Comments

@stenlarsson
Copy link
Contributor

Describe the bug, including details regarding any error messages, version, and platform.

We have problems for a long time with a specific batch job that combines data from different sources. There is something in the data causing the issue, but I haven't been able to figure exactly what. I have created a test case where I tried my best to minimise and anonymise the data: https://github.com/stenlarsson/arrow-test

Sometimes it hangs after a random number of iterations:

$ ruby hang.rb
0
1
2

Sometimes it crashes:

$ ruby hang.rb
0
SEGV received in BUS handler
[1]    74331 abort      ruby hang.rb

I'm running macOS / Ruby 3.2.2 / Arrow 14.0.2 on my computer, but have also reproduced the error with Linux / Ruby 3.0.6 / Arrow 11.0.0. It doesn't seem to happen with Arrow 10.0.1.

Component(s)

Ruby

@raulcd raulcd changed the title Random hangs when joining tables with ExecutePlan in Ruby [Ruby] Random hangs when joining tables with ExecutePlan in Ruby Jan 15, 2024
@stenlarsson
Copy link
Contributor Author

I initially thought this was a Ruby problem, but now I managed to reproduce the problem with Python 3.11.6 / Arrow 15.0.0 as well. It doesn't crash when running it on macOS, but maybe I'm just lucky. It crashes randomly when running Ubuntu inside a Lima VM:

$ ~/venv/bin/python hang.py
0
1
2
3
4
5
Segmentation fault (core dumped)

I pushed hang.py to https://github.com/stenlarsson/arrow-test.

@stenlarsson
Copy link
Contributor Author

stenlarsson commented Jan 24, 2024

I also tried to compile a debug version of Arrow. Not sure if I built it correctly, but when running it the following assertion fails:

$ ruby hang.rb
0
/home/stenlarsson.linux/arrow/cpp/src/arrow/compute/util.cc:38:  Check failed: top_ <= buffer_size_
/home/stenlarsson.linux/arrow/cpp/src/arrow/compute/util.cc:38:  Check failed: top_ <= buffer_size_
/home/stenlarsson.linux/arrow/cpp/src/arrow/compute/util.cc:38:  Check failed: top_ <= buffer_size_
/home/stenlarsson.linux/arrow/cpp/src/arrow/compute/util.cc:38:  Check failed: top_ <= buffer_size_
/usr/local/lib/libarrow.so.1600(_ZN5arrow4util7CerrLog14PrintBackTraceEv+0x34)[0xffff9d325794]
/usr/local/lib/libarrow.so.1600(_ZN5arrow4util7CerrLogD1Ev+0x60)[0xffff9d325704]
/usr/local/lib/libarrow.so.1600(_ZN5arrow4util7CerrLogD0Ev+0x14)[0xffff9d325728]
/usr/local/lib/libarrow.so.1600(_ZN5arrow4util8ArrowLogD1Ev+0x50)[0xffff9d325578]
/usr/local/lib/libarrow.so.1600(_ZN5arrow4util15TempVectorStack5allocEjPPhPi+0xf4)[0xffff9d91e060]
/usr/local/lib/libarrow.so.1600(_ZN5arrow4util16TempVectorHolderIjEC1EPNS0_15TempVectorStackEj+0x58)[0xffff9d6c54b4]
/usr/local/lib/libarrow.so.1600(_ZN5arrow7compute9Hashing3215HashMultiColumnERKSt6vectorINS0_14KeyColumnArrayESaIS3_EEPNS0_12LightContextEPj+0xa4)[0xffff9d6c0ae0]
/usr/local/lib/libarrow.so.1600(_ZN5arrow7compute9Hashing329HashBatchERKNS0_9ExecBatchEPjRSt6vectorINS0_14KeyColumnArrayESaIS7_EElPNS_4util15TempVectorStackEll+0x130)[0xffff9d6c12cc]
/usr/local/lib/libarrow_acero.so.1600(_ZN5arrow5acero18SwissTableWithKeys4HashEPNS1_5InputEPjl+0x110)[0xffffaf681410]
/usr/local/lib/libarrow_acero.so.1600(_ZN5arrow5acero18JoinProbeProcessor11OnNextBatchElRKNS_7compute9ExecBatchEPNS_4util15TempVectorStackEPSt6vectorINS2_14KeyColumnArrayESaISA_EE+0x228)[0xffffaf687080]
/usr/local/lib/libarrow_acero.so.1600(_ZN5arrow5acero9SwissJoin16ProbeSingleBatchEmNS_7compute9ExecBatchE+0x300)[0xffffaf68f2a8]
/usr/local/lib/libarrow_acero.so.1600(_ZN5arrow5acero12HashJoinNode16OnProbeSideBatchEmNS_7compute9ExecBatchE+0x25c)[0xffffaf601800]
/usr/local/lib/libarrow_acero.so.1600(_ZN5arrow5acero12HashJoinNode13InputReceivedEPNS0_8ExecNodeENS_7compute9ExecBatchE+0x1b0)[0xffffaf6022c8]
/usr/local/lib/libarrow_acero.so.1600(_ZN5arrow5acero12HashJoinNode19OutputBatchCallbackENS_7compute9ExecBatchE+0x80)[0xffffaf603820]
/usr/local/lib/libarrow_acero.so.1600(_ZZN5arrow5acero12HashJoinNode4InitEvENKUllNS_7compute9ExecBatchEE4_clElS3_+0x5c)[0xffffaf602bd0]
/usr/local/lib/libarrow_acero.so.1600(_ZSt13__invoke_implIN5arrow6StatusERZNS0_5acero12HashJoinNode4InitEvEUllNS0_7compute9ExecBatchEE_JlS5_EET_St14__invoke_otherOT0_DpOT1_+0x80)[0xffffaf619114]
/usr/local/lib/libarrow_acero.so.1600(_ZSt10__invoke_rIN5arrow6StatusERZNS0_5acero12HashJoinNode4InitEvEUllNS0_7compute9ExecBatchEE_JlS5_EENSt9enable_ifIX16is_invocable_r_vIT_T0_DpT1_EES9_E4typeEOSA_DpOSB_+0x70)[0xffffaf61481c]
/usr/local/lib/libarrow_acero.so.1600(_ZNSt17_Function_handlerIFN5arrow6StatusElNS0_7compute9ExecBatchEEZNS0_5acero12HashJoinNode4InitEvEUllS3_E_E9_M_invokeERKSt9_Any_dataOlOS3_+0x6c)[0xffffaf60e9c0]
/usr/local/lib/libarrow_acero.so.1600(_ZNKSt8functionIFN5arrow6StatusElNS0_7compute9ExecBatchEEEclElS3_+0xa8)[0xffffaf5cf6f0]
/usr/local/lib/libarrow_acero.so.1600(+0x506ddc)[0xffffaf686ddc]
/usr/local/lib/libarrow_acero.so.1600(+0x50bfe0)[0xffffaf68bfe0]
/usr/local/lib/libarrow_acero.so.1600(+0x50b0c8)[0xffffaf68b0c8]
/usr/local/lib/libarrow_acero.so.1600(_ZN5arrow5acero18JoinProbeProcessor11OnNextBatchElRKNS_7compute9ExecBatchEPNS_4util15TempVectorStackEPSt6vectorINS2_14KeyColumnArrayESaISA_EE+0x8e8)[0xffffaf687740]
/usr/local/lib/libarrow_acero.so.1600(_ZN5arrow5acero9SwissJoin16ProbeSingleBatchEmNS_7compute9ExecBatchE+0x300)[0xffffaf68f2a8]
/usr/local/lib/libarrow_acero.so.1600(_ZN5arrow5acero12HashJoinNode16OnProbeSideBatchEmNS_7compute9ExecBatchE+0x25c)[0xffffaf601800]
/usr/local/lib/libarrow_acero.so.1600(_ZN5arrow5acero12HashJoinNode13InputReceivedEPNS0_8ExecNodeENS_7compute9ExecBatchE+0x1b0)[0xffffaf6022c8]
/usr/local/lib/libarrow_acero.so.1600(_ZN5arrow5acero12HashJoinNode19OutputBatchCallbackENS_7compute9ExecBatchE+0x80)[0xffffaf603820]
/usr/local/lib/libarrow_acero.so.1600(_ZZN5arrow5acero12HashJoinNode4InitEvENKUllNS_7compute9ExecBatchEE4_clElS3_+0x5c)[0xffffaf602bd0]
/usr/local/lib/libarrow_acero.so.1600(_ZSt13__invoke_implIN5arrow6StatusERZNS0_5acero12HashJoinNode4InitEvEUllNS0_7compute9ExecBatchEE_JlS5_EET_St14__invoke_otherOT0_DpOT1_+0x80)[0xffffaf619114]
/usr/local/lib/libarrow_acero.so.1600(_ZSt10__invoke_rIN5arrow6StatusERZNS0_5acero12HashJoinNode4InitEvEUllNS0_7compute9ExecBatchEE_JlS5_EENSt9enable_ifIX16is_invocable_r_vIT_T0_DpT1_EES9_E4typeEOSA_DpOSB_+0x70)[0xffffaf61481c]
/usr/local/lib/libarrow_acero.so.1600(_ZNSt17_Function_handlerIFN5arrow6StatusElNS0_7compute9ExecBatchEEZNS0_5acero12HashJoinNode4InitEvEUllS3_E_E9_M_invokeERKSt9_Any_dataOlOS3_+0x6c)[0xffffaf60e9c0]
/usr/local/lib/libarrow_acero.so.1600(_ZNKSt8functionIFN5arrow6StatusElNS0_7compute9ExecBatchEEEclElS3_+0xa8)[0xffffaf5cf6f0]
/usr/local/lib/libarrow_acero.so.1600(+0x506ddc)[0xffffaf686ddc]
/usr/local/lib/libarrow_acero.so.1600(+0x50bfe0)[0xffffaf68bfe0]
/usr/local/lib/libarrow_acero.so.1600(+0x50b0c8)[0xffffaf68b0c8]
/usr/local/lib/libarrow_acero.so.1600(_ZN5arrow5acero18JoinProbeProcessor11OnNextBatchElRKNS_7compute9ExecBatchEPNS_4util15TempVectorStackEPSt6vectorINS2_14KeyColumnArrayESaISA_EE+0x8e8)[0xffffaf687740]
/usr/local/lib/libarrow_acero.so.1600(_ZN5arrow5acero9SwissJoin16ProbeSingleBatchEmNS_7compute9ExecBatchE+0x300)[0xffffaf68f2a8]
/usr/local/lib/libarrow_acero.so.1600(_ZZN5arrow5acero12HashJoinNode4InitEvENKUlmlE_clEml+0x94)[0xffffaf602d54]
/usr/local/lib/libarrow_acero.so.1600(_ZSt13__invoke_implIN5arrow6StatusERZNS0_5acero12HashJoinNode4InitEvEUlmlE6_JmlEET_St14__invoke_otherOT0_DpOT1_+0x74)[0xffffaf6193dc]
/usr/local/lib/libarrow_acero.so.1600(_ZSt10__invoke_rIN5arrow6StatusERZNS0_5acero12HashJoinNode4InitEvEUlmlE6_JmlEENSt9enable_ifIX16is_invocable_r_vIT_T0_DpT1_EES7_E4typeEOS8_DpOS9_+0x70)[0xffffaf614c00]
/usr/local/lib/libarrow_acero.so.1600(_ZNSt17_Function_handlerIFN5arrow6StatusEmlEZNS0_5acero12HashJoinNode4InitEvEUlmlE6_E9_M_invokeERKSt9_Any_dataOmOl+0x6c)[0xffffaf60ecc4]
/usr/local/lib/libarrow_acero.so.1600(_ZNKSt8functionIFN5arrow6StatusEmlEEclEml+0xa8)[0xffffaf6a5068]
/usr/local/lib/libarrow_acero.so.1600(_ZN5arrow5acero17TaskSchedulerImpl11ExecuteTaskEmilPb+0x84)[0xffffaf6a264c]
/usr/local/lib/libarrow_acero.so.1600(+0x523220)[0xffffaf6a3220]
/usr/local/lib/libarrow_acero.so.1600(+0x52421c)[0xffffaf6a421c]
/usr/local/lib/libarrow_acero.so.1600(+0x52407c)[0xffffaf6a407c]
/usr/local/lib/libarrow_acero.so.1600(+0x523ee4)[0xffffaf6a3ee4]
/usr/local/lib/libarrow_acero.so.1600(_ZNKSt8functionIFN5arrow6StatusEmEEclEm+0x94)[0xffffaf5d033c]
/usr/local/lib/libarrow_acero.so.1600(+0x4b8a14)[0xffffaf638a14]
/usr/local/lib/libarrow_acero.so.1600(+0x4ba570)[0xffffaf63a570]
/usr/local/lib/libarrow_acero.so.1600(+0x4b9fb8)[0xffffaf639fb8]
/usr/local/lib/libarrow_acero.so.1600(+0x4b976c)[0xffffaf63976c]
/usr/local/lib/libarrow_acero.so.1600(_ZNKSt8functionIFN5arrow6StatusEvEEclEv+0x7c)[0xffffaf4d3468]
/usr/local/lib/libarrow_acero.so.1600(_ZNK5arrow6detail14ContinueFutureclIRSt8functionIFNS_6StatusEvEEJES4_NS_6FutureINS_8internal5EmptyEEEEENSt9enable_ifIXaaaantsrSt7is_voidIT1_E5valuentsrNS0_9is_futureISE_EE5valueoontsrT2_8is_emptysrSt7is_sameISE_S4_E5valueEvE4typeESI_OT_DpOT0_+0x4c)[0xffffaf63e0f0]
/usr/local/lib/libarrow_acero.so.1600(_ZSt13__invoke_implIvRN5arrow6detail14ContinueFutureEJRNS0_6FutureINS0_8internal5EmptyEEERSt8functionIFNS0_6StatusEvEEEET_St14__invoke_otherOT0_DpOT1_+0x74)[0xffffaf63dff0]
/usr/local/lib/libarrow_acero.so.1600(_ZSt8__invokeIRN5arrow6detail14ContinueFutureEJRNS0_6FutureINS0_8internal5EmptyEEERSt8functionIFNS0_6StatusEvEEEENSt15__invoke_resultIT_JDpT0_EE4typeEOSF_DpOSG_+0x50)[0xffffaf63df08]
/usr/local/lib/libarrow_acero.so.1600(_ZNSt5_BindIFN5arrow6detail14ContinueFutureENS0_6FutureINS0_8internal5EmptyEEESt8functionIFNS0_6StatusEvEEEE6__callIvJEJLm0ELm1EEEET_OSt5tupleIJDpT0_EESt12_Index_tupleIJXspT1_EEE+0x80)[0xffffaf63de04]
/usr/local/lib/libarrow_acero.so.1600(_ZNSt5_BindIFN5arrow6detail14ContinueFutureENS0_6FutureINS0_8internal5EmptyEEESt8functionIFNS0_6StatusEvEEEEclIJEvEET0_DpOT_+0x40)[0xffffaf63dd48]
/usr/local/lib/libarrow_acero.so.1600(_ZN5arrow8internal6FnOnceIFvvEE6FnImplISt5_BindIFNS_6detail14ContinueFutureENS_6FutureINS0_5EmptyEEESt8functionIFNS_6StatusEvEEEEE6invokeEv+0x1c)[0xffffaf63dcfc]
/usr/local/lib/libarrow.so.1600(_ZNO5arrow8internal6FnOnceIFvvEEclEv+0x54)[0xffff9d352db8]
/usr/local/lib/libarrow.so.1600(+0x1aabcb8)[0xffff9d34bcb8]
/usr/local/lib/libarrow.so.1600(+0x1aacd94)[0xffff9d34cd94]
/usr/local/lib/libarrow.so.1600(+0x1ab1404)[0xffff9d351404]
/usr/local/lib/libarrow.so.1600(+0x1ab13bc)[0xffff9d3513bc]
/usr/local/lib/libarrow.so.1600(+0x1ab1358)[0xffff9d351358]
/usr/local/lib/libarrow.so.1600(+0x1ab132c)[0xffff9d35132c]
/usr/local/lib/libarrow.so.1600(+0x1ab130c)[0xffff9d35130c]
/lib/aarch64-linux-gnu/libstdc++.so.6(+0xdb1cc)[0xffffaefeb1cc]
/usr/local/lib/libarrow.so.1600(_ZN5arrow4util7CerrLog14PrintBackTraceEv+0x34)[0xffff9d325794]
/usr/local/lib/libarrow.so.1600(_ZN5arrow4util7CerrLogD1Ev+0x60)[0xffff9d325704]
/usr/local/lib/libarrow.so.1600(_ZN5arrow4util7CerrLogD0Ev+0x14)[0xffff9d325728]
/usr/local/lib/libarrow.so.1600(_ZN5arrow4util8ArrowLogD1Ev+0x50)[0xffff9d325578]
/usr/local/lib/libarrow.so.1600(_ZN5arrow4util15TempVectorStack5allocEjPPhPi+0xf4)[0xffff9d91e060]
/usr/local/lib/libarrow.so.1600(_ZN5arrow4util16TempVectorHolderIjEC1EPNS0_15TempVectorStackEj+0x58)[0xffff9d6c54b4]
/usr/local/lib/libarrow.so.1600(_ZN5arrow7compute9Hashing3215HashMultiColumnERKSt6vectorINS0_14KeyColumnArrayESaIS3_EEPNS0_12LightContextEPj+0xa4)[0xffff9d6c0ae0]
/usr/local/lib/libarrow.so.1600(_ZN5arrow7compute9Hashing329HashBatchERKNS0_9ExecBatchEPjRSt6vectorINS0_14KeyColumnArrayESaIS7_EElPNS_4util15TempVectorStackEll+0x130)[0xffff9d6c12cc]
/usr/local/lib/libarrow_acero.so.1600(_ZN5arrow5acero18SwissTableWithKeys4HashEPNS1_5InputEPjl+0x110)[0xffffaf681410]
/usr/local/lib/libarrow_acero.so.1600(_ZN5arrow5acero18JoinProbeProcessor11OnNextBatchElRKNS_7compute9ExecBatchEPNS_4util15TempVectorStackEPSt6vectorINS2_14KeyColumnArrayESaISA_EE+0x228)[0xffffaf687080]
/usr/local/lib/libarrow_acero.so.1600(_ZN5arrow5acero9SwissJoin16ProbeSingleBatchEmNS_7compute9ExecBatchE+0x300)[0xffffaf68f2a8]
/usr/local/lib/libarrow_acero.so.1600(_ZN5arrow5acero12HashJoinNode16OnProbeSideBatchEmNS_7compute9ExecBatchE+0x25c)[0xffffaf601800]
/usr/local/lib/libarrow_acero.so.1600(_ZN5arrow5acero12HashJoinNode13InputReceivedEPNS0_8ExecNodeENS_7compute9ExecBatchE+0x1b0)[0xffffaf6022c8]
/usr/local/lib/libarrow_acero.so.1600(_ZN5arrow5acero12HashJoinNode19OutputBatchCallbackENS_7compute9ExecBatchE+0x80)[0xffffaf603820]
/usr/local/lib/libarrow_acero.so.1600(_ZZN5arrow5acero12HashJoinNode4InitEvENKUllNS_7compute9ExecBatchEE4_clElS3_+0x5c)[0xffffaf602bd0]
/usr/local/lib/libarrow_acero.so.1600(_ZSt13__invoke_implIN5arrow6StatusERZNS0_5acero12HashJoinNode4InitEvEUllNS0_7compute9ExecBatchEE_JlS5_EET_St14__invoke_otherOT0_DpOT1_+0x80)[0xffffaf619114]
/usr/local/lib/libarrow_acero.so.1600(_ZSt10__invoke_rIN5arrow6StatusERZNS0_5acero12HashJoinNode4InitEvEUllNS0_7compute9ExecBatchEE_JlS5_EENSt9enable_ifIX16is_invocable_r_vIT_T0_DpT1_EES9_E4typeEOSA_DpOSB_+0x70)[0xffffaf61481c]
/usr/local/lib/libarrow_acero.so.1600(_ZNSt17_Function_handlerIFN5arrow6StatusElNS0_7compute9ExecBatchEEZNS0_5acero12HashJoinNode4InitEvEUllS3_E_E9_M_invokeERKSt9_Any_dataOlOS3_+0x6c)[0xffffaf60e9c0]
/usr/local/lib/libarrow_acero.so.1600(_ZNKSt8functionIFN5arrow6StatusElNS0_7compute9ExecBatchEEEclElS3_+0xa8)[0xffffaf5cf6f0]
/usr/local/lib/libarrow_acero.so.1600(+0x506ddc)[0xffffaf686ddc]
/usr/local/lib/libarrow_acero.so.1600(+0x50bfe0)[0xffffaf68bfe0]
/usr/local/lib/libarrow_acero.so.1600(+0x50b0c8)[0xffffaf68b0c8]
/usr/local/lib/libarrow_acero.so.1600(_ZN5arrow5acero18JoinProbeProcessor11OnNextBatchElRKNS_7compute9ExecBatchEPNS_4util15TempVectorStackEPSt6vectorINS2_14KeyColumnArrayESaISA_EE+0x8e8)[0xffffaf687740]
/usr/local/lib/libarrow_acero.so.1600(_ZN5arrow5acero9SwissJoin16ProbeSingleBatchEmNS_7compute9ExecBatchE+0x300)[0xffffaf68f2a8]
/usr/local/lib/libarrow_acero.so.1600(_ZN5arrow5acero12HashJoinNode16OnProbeSideBatchEmNS_7compute9ExecBatchE+0x25c)[0xffffaf601800]
/usr/local/lib/libarrow_acero.so.1600(_ZN5arrow5acero12HashJoinNode13InputReceivedEPNS0_8ExecNodeENS_7compute9ExecBatchE+0x1b0)[0xffffaf6022c8]
/usr/local/lib/libarrow_acero.so.1600(_ZN5arrow5acero12HashJoinNode19OutputBatchCallbackENS_7compute9ExecBatchE+0x80)[0xffffaf603820]
/lib/aarch64-linux-gnu/libc.so.6(+0x837d0)[0xffffb52b37d0]
/lib/aarch64-linux-gnu/libc.so.6(+0xef54c)[0xffffb531f54c]
Aborted (core dumped)

@kou
Copy link
Member

kou commented Jan 24, 2024

Oh, sorry. I missed this. I'll try this.

@kou
Copy link
Member

kou commented Jan 27, 2024

Hmm. I couldn't reproduce this on my environment...
It seems that buffer size isn't enough for this case on your environment. The following isn't a real fix but does the following resolve this case on your environment?

diff --git a/cpp/src/arrow/compute/row/grouper.cc b/cpp/src/arrow/compute/row/grouper.cc
index 5e23eda16f..bdf2f52572 100644
--- a/cpp/src/arrow/compute/row/grouper.cc
+++ b/cpp/src/arrow/compute/row/grouper.cc
@@ -533,7 +533,7 @@ struct GrouperFastImpl : public Grouper {
     auto impl = std::make_unique<GrouperFastImpl>();
     impl->ctx_ = ctx;
 
-    RETURN_NOT_OK(impl->temp_stack_.Init(ctx->memory_pool(), 64 * minibatch_size_max_));
+    RETURN_NOT_OK(impl->temp_stack_.Init(ctx->memory_pool(), 256 * minibatch_size_max_));
     impl->encode_ctx_.hardware_flags =
         arrow::internal::CpuInfo::GetInstance()->hardware_flags();
     impl->encode_ctx_.stack = &impl->temp_stack_;

@stenlarsson
Copy link
Contributor Author

Thanks for looking into this. Your change has no effect, however this seems to help:

diff --git a/cpp/src/arrow/acero/query_context.cc b/cpp/src/arrow/acero/query_context.cc
index 9f838508f..f5558f6fc 100644
--- a/cpp/src/arrow/acero/query_context.cc
+++ b/cpp/src/arrow/acero/query_context.cc
@@ -53,7 +53,7 @@ size_t QueryContext::max_concurrency() const { return thread_indexer_.Capacity()
 Result<util::TempVectorStack*> QueryContext::GetTempStack(size_t thread_index) {
   if (!tld_[thread_index].is_init) {
     RETURN_NOT_OK(tld_[thread_index].stack.Init(
-        memory_pool(), 8 * util::MiniBatch::kMiniBatchLength * sizeof(uint64_t)));
+        memory_pool(), 256 * util::MiniBatch::kMiniBatchLength * sizeof(uint64_t)));
     tld_[thread_index].is_init = true;
   }
   return &tld_[thread_index].stack;

@rejeep
Copy link

rejeep commented Feb 8, 2024

Hey! Any updates on this? We are still on Arrow 10 because of this bug. Also, this is not Ruby-specific so perhaps remove the Ruby component label and update the issue title? Thanks!

@kou kou changed the title [Ruby] Random hangs when joining tables with ExecutePlan in Ruby [C++][Acero] Random hangs when joining tables with ExecutePlan Feb 8, 2024
@kou
Copy link
Member

kou commented Feb 8, 2024

Oh, sorry. I missed comments again...
I can't reproduce this on my local machine but could you open a pull request with the change? Let's discuss the approach on the PR.

@stenlarsson
Copy link
Contributor Author

@kou I can open a PR, but how do I know if 256 is a good value? Since I don't understand what is happening, maybe there is a situation where 256 is not enough either?

I used the value 256 since that was what you used in your patch, but I see now that I should have used 32 to get the same size (four times larger).

@kou
Copy link
Member

kou commented Feb 8, 2024

Let's discuss it too on the PR. :-)

stenlarsson added a commit to stenlarsson/arrow that referenced this issue Feb 8, 2024
Certain Acero execution plans can cause an overflow of the TempVectorStack initialized by the QueryContext, and increasing the size of the stack fixes the problem. I don't know exactly what causes the overflow, so I haven't written a test for it.

Fixes apache#39582.
@stenlarsson
Copy link
Contributor Author

Ok, PR created: #40007

pitrou pushed a commit to stenlarsson/arrow that referenced this issue Feb 26, 2024
Certain Acero execution plans can cause an overflow of the TempVectorStack initialized by the QueryContext, and increasing the size of the stack fixes the problem. I don't know exactly what causes the overflow, so I haven't written a test for it.

Fixes apache#39582.
@pitrou pitrou added this to the 15.0.1 milestone Feb 26, 2024
pitrou added a commit that referenced this issue Feb 26, 2024
We have had problems for a long time with a specific batch job that combines data from different sources. There is something in the data causing an Acero execution plan to hang or crash at random. The problem has been reproduced since Arrow 11.0.0, originally in Ruby, but it has also in Python. There is unfortunately no test case that reliably reproduces the issue in a release build.

However, in a debug build we can see that the batch job causes an overflow on the temp stack in arrow/cpp/src/arrow/compute/util.cc:38. Increasing the size of the stack created in the Acero QueryContext works around the issue, but a real fix should be investigated separately.

**This PR contains a "Critical Fix".**
* Closes: #39582

Lead-authored-by: Sten Larsson <sten@burtcorp.com>
Co-authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>
@pitrou pitrou modified the milestones: 15.0.1, 16.0.0 Feb 26, 2024
zanmato1984 pushed a commit to zanmato1984/arrow that referenced this issue Feb 28, 2024
…#40007)

We have had problems for a long time with a specific batch job that combines data from different sources. There is something in the data causing an Acero execution plan to hang or crash at random. The problem has been reproduced since Arrow 11.0.0, originally in Ruby, but it has also in Python. There is unfortunately no test case that reliably reproduces the issue in a release build.

However, in a debug build we can see that the batch job causes an overflow on the temp stack in arrow/cpp/src/arrow/compute/util.cc:38. Increasing the size of the stack created in the Acero QueryContext works around the issue, but a real fix should be investigated separately.

**This PR contains a "Critical Fix".**
* Closes: apache#39582

Lead-authored-by: Sten Larsson <sten@burtcorp.com>
Co-authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>
thisisnic pushed a commit to thisisnic/arrow that referenced this issue Mar 8, 2024
…#40007)

We have had problems for a long time with a specific batch job that combines data from different sources. There is something in the data causing an Acero execution plan to hang or crash at random. The problem has been reproduced since Arrow 11.0.0, originally in Ruby, but it has also in Python. There is unfortunately no test case that reliably reproduces the issue in a release build.

However, in a debug build we can see that the batch job causes an overflow on the temp stack in arrow/cpp/src/arrow/compute/util.cc:38. Increasing the size of the stack created in the Acero QueryContext works around the issue, but a real fix should be investigated separately.

**This PR contains a "Critical Fix".**
* Closes: apache#39582

Lead-authored-by: Sten Larsson <sten@burtcorp.com>
Co-authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>
@stenlarsson
Copy link
Contributor Author

The milestone for this issue is 15.0.1, but the changes in the corresponding PR doesn't seem to be included in the 15.0.1 release? 🤔

@pitrou
Copy link
Member

pitrou commented Mar 11, 2024

The milestone for this issue is 15.0.1, but the changes in the corresponding PR doesn't seem to be included in the 15.0.1 release? 🤔

Ping @raulcd

@raulcd
Copy link
Member

raulcd commented Mar 13, 2024

This was merged after the code freeze and tagged as 15.0.1 when merged. It did not make it into 15.0.1. I am adding it to 15.0.2.

@raulcd raulcd modified the milestones: 15.0.1, 15.0.2 Mar 13, 2024
raulcd pushed a commit that referenced this issue Mar 13, 2024
We have had problems for a long time with a specific batch job that combines data from different sources. There is something in the data causing an Acero execution plan to hang or crash at random. The problem has been reproduced since Arrow 11.0.0, originally in Ruby, but it has also in Python. There is unfortunately no test case that reliably reproduces the issue in a release build.

However, in a debug build we can see that the batch job causes an overflow on the temp stack in arrow/cpp/src/arrow/compute/util.cc:38. Increasing the size of the stack created in the Acero QueryContext works around the issue, but a real fix should be investigated separately.

**This PR contains a "Critical Fix".**
* Closes: #39582

Lead-authored-by: Sten Larsson <sten@burtcorp.com>
Co-authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>
@amoeba amoeba added the Critical Fix Bugfixes for security vulnerabilities, crashes, or invalid data. label Mar 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: Ruby Critical Fix Bugfixes for security vulnerabilities, crashes, or invalid data. Type: bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants