Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] test_hash_reduction_collect_set_on_nested_array_type failed in a distributed environment #10133

Closed
sameerz opened this issue Dec 31, 2023 · 9 comments
Assignees
Labels
bug Something isn't working cudf_dependency An issue or PR with this label depends on a new feature in cudf

Comments

@sameerz
Copy link
Collaborator

sameerz commented Dec 31, 2023

Describe the bug

test_hash_reduction_collect_set_on_nested_array_type failed in a distributed environment with the error "Type converstion is not allowed..."

[2023-12-30T22:21:16.115Z] E                   Caused by: java.lang.AssertionError: Type conversion is not allowed from LIST(STRUCT(INT8,INT16,INT32,INT64,FLOAT32,FLOAT64,STRING,BOOL8,TIMESTAMP_DAYS,TIMESTAMP_MICROSECONDS,INT8)) to ArrayType(ArrayType(StructType(StructField(child0,ByteType,true),StructField(child1,ShortType,true),StructField(child2,IntegerType,true),StructField(child3,LongType,true),StructField(child4,FloatType,true),StructField(child5,DoubleType,true),StructField(child6,StringType,true),StructField(child7,BooleanType,true),StructField(child8,DateType,true),StructField(child9,TimestampType,true),StructField(child10,NullType,true)),true),false) expected LIST(LIST(STRUCT(INT8,INT16,INT32,INT64,FLOAT32,FLOAT64,STRING,BOOL8,TIMESTAMP_DAYS,TIMESTAMP_MICROSECONDS,INT8)))
Detailed output
[2023-12-30T22:21:16.114Z] _ test_hash_reduction_collect_set_on_nested_array_type[[('a', RepeatSeq(Long)), ('b', RepeatSeq(Array(Struct(['child0', Byte],['child1', Short],['child2', Integer],['child3', Long],['child4', Float],['child5', Double],['child6', String],['child7', Boolean],['child8', Date],['child9', Timestamp],['child10', Null]))))]] _
[2023-12-30T22:21:16.114Z] 
[2023-12-30T22:21:16.114Z] data_gen = [('a', RepeatSeq(Long)), ('b', RepeatSeq(Array(Struct(['child0', Byte],['child1', Short],['child2', Integer],['child3'...['child5', Double],['child6', String],['child7', Boolean],['child8', Date],['child9', Timestamp],['child10', Null]))))]
[2023-12-30T22:21:16.114Z] 
[2023-12-30T22:21:16.114Z]     @ignore_order(local=True, arrays=["collect_set"])
[2023-12-30T22:21:16.114Z]     @allow_non_gpu("ProjectExec", *non_utc_allow)
[2023-12-30T22:21:16.114Z]     @pytest.mark.parametrize('data_gen', _gen_data_for_collect_set_op_nested, ids=idfn)
[2023-12-30T22:21:16.114Z]     def test_hash_reduction_collect_set_on_nested_array_type(data_gen):
[2023-12-30T22:21:16.114Z]         conf = copy_and_update(_float_conf, {
[2023-12-30T22:21:16.114Z]             "spark.rapids.sql.castFloatToString.enabled": "true",
[2023-12-30T22:21:16.114Z]         })
[2023-12-30T22:21:16.114Z]     
[2023-12-30T22:21:16.114Z]         def do_it(spark):
[2023-12-30T22:21:16.114Z]             return gen_df(spark, data_gen, length=100)\
[2023-12-30T22:21:16.114Z]                 .agg(f.collect_set('b').alias("collect_set"))
[2023-12-30T22:21:16.114Z]     
[2023-12-30T22:21:16.114Z] >       assert_gpu_and_cpu_are_equal_collect(do_it, conf=conf)
[2023-12-30T22:21:16.114Z] 
[2023-12-30T22:21:16.114Z] ../../src/main/python/hash_aggregate_test.py:734: 
[2023-12-30T22:21:16.114Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
[2023-12-30T22:21:16.114Z] ../../src/main/python/asserts.py:595: in assert_gpu_and_cpu_are_equal_collect
[2023-12-30T22:21:16.114Z]     _assert_gpu_and_cpu_are_equal(func, 'COLLECT', conf=conf, is_cpu_first=is_cpu_first, result_canonicalize_func_before_compare=result_canonicalize_func_before_compare)
[2023-12-30T22:21:16.114Z] ../../src/main/python/asserts.py:503: in _assert_gpu_and_cpu_are_equal
[2023-12-30T22:21:16.114Z]     from_gpu = run_on_gpu()
[2023-12-30T22:21:16.114Z] ../../src/main/python/asserts.py:496: in run_on_gpu
[2023-12-30T22:21:16.114Z]     from_gpu = with_gpu_session(bring_back, conf=conf)
[2023-12-30T22:21:16.114Z] ../../src/main/python/spark_session.py:164: in with_gpu_session
[2023-12-30T22:21:16.114Z]     return with_spark_session(func, conf=copy)
[2023-12-30T22:21:16.114Z] /opt/miniconda3/lib/python3.8/contextlib.py:75: in inner
[2023-12-30T22:21:16.114Z]     return func(*args, **kwds)
[2023-12-30T22:21:16.114Z] ../../src/main/python/spark_session.py:131: in with_spark_session
[2023-12-30T22:21:16.114Z]     ret = func(_spark)
[2023-12-30T22:21:16.114Z] ../../src/main/python/asserts.py:205: in 
[2023-12-30T22:21:16.114Z]     bring_back = lambda spark: limit_func(spark).collect()
[2023-12-30T22:21:16.114Z] /var/lib/jenkins/spark/spark-3.3.0-bin-hadoop3.2/python/lib/pyspark.zip/pyspark/sql/dataframe.py:817: in collect
[2023-12-30T22:21:16.114Z]     sock_info = self._jdf.collectToPython()
[2023-12-30T22:21:16.114Z] /var/lib/jenkins/spark/spark-3.3.0-bin-hadoop3.2/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py:1321: in __call__
[2023-12-30T22:21:16.114Z]     return_value = get_return_value(
[2023-12-30T22:21:16.114Z] /var/lib/jenkins/spark/spark-3.3.0-bin-hadoop3.2/python/lib/pyspark.zip/pyspark/sql/utils.py:190: in deco
[2023-12-30T22:21:16.114Z]     return f(*a, **kw)
[2023-12-30T22:21:16.114Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
[2023-12-30T22:21:16.114Z] 
[2023-12-30T22:21:16.114Z] answer = 'xro1818544'
[2023-12-30T22:21:16.114Z] gateway_client = 
[2023-12-30T22:21:16.114Z] target_id = 'o1818543', name = 'collectToPython'
[2023-12-30T22:21:16.114Z] 
[2023-12-30T22:21:16.114Z]     def get_return_value(answer, gateway_client, target_id=None, name=None):
[2023-12-30T22:21:16.114Z]         """Converts an answer received from the Java gateway into a Python object.
[2023-12-30T22:21:16.114Z]     
[2023-12-30T22:21:16.114Z]         For example, string representation of integers are converted to Python
[2023-12-30T22:21:16.114Z]         integer, string representation of objects are converted to JavaObject
[2023-12-30T22:21:16.114Z]         instances, etc.
[2023-12-30T22:21:16.114Z]     
[2023-12-30T22:21:16.114Z]         :param answer: the string returned by the Java gateway
[2023-12-30T22:21:16.114Z]         :param gateway_client: the gateway client used to communicate with the Java
[2023-12-30T22:21:16.114Z]             Gateway. Only necessary if the answer is a reference (e.g., object,
[2023-12-30T22:21:16.114Z]             list, map)
[2023-12-30T22:21:16.114Z]         :param target_id: the name of the object from which the answer comes from
[2023-12-30T22:21:16.114Z]             (e.g., *object1* in `object1.hello()`). Optional.
[2023-12-30T22:21:16.114Z]         :param name: the name of the member from which the answer comes from
[2023-12-30T22:21:16.114Z]             (e.g., *hello* in `object1.hello()`). Optional.
[2023-12-30T22:21:16.114Z]         """
[2023-12-30T22:21:16.114Z]         if is_error(answer)[0]:
[2023-12-30T22:21:16.114Z]             if len(answer) > 1:
[2023-12-30T22:21:16.114Z]                 type = answer[1]
[2023-12-30T22:21:16.114Z]                 value = OUTPUT_CONVERTER[type](answer[2:], gateway_client)
[2023-12-30T22:21:16.114Z]                 if answer[1] == REFERENCE_TYPE:
[2023-12-30T22:21:16.114Z] >                   raise Py4JJavaError(
[2023-12-30T22:21:16.114Z]                         "An error occurred while calling {0}{1}{2}.\n".
[2023-12-30T22:21:16.114Z]                         format(target_id, ".", name), value)
[2023-12-30T22:21:16.114Z] E                   py4j.protocol.Py4JJavaError: An error occurred while calling o1818543.collectToPython.
[2023-12-30T22:21:16.114Z] E                   : org.apache.spark.SparkException: Job aborted due to stage failure: Task 25 in stage 25094.0 failed 1 times, most recent failure: Lost task 25.0 in stage 25094.0 (TID 786354) (10.136.6.4 executor 2): java.lang.AssertionError: Type conversion is not allowed from LIST(STRUCT(INT8,INT16,INT32,INT64,FLOAT32,FLOAT64,STRING,BOOL8,TIMESTAMP_DAYS,TIMESTAMP_MICROSECONDS,INT8)) to ArrayType(ArrayType(StructType(StructField(child0,ByteType,true),StructField(child1,ShortType,true),StructField(child2,IntegerType,true),StructField(child3,LongType,true),StructField(child4,FloatType,true),StructField(child5,DoubleType,true),StructField(child6,StringType,true),StructField(child7,BooleanType,true),StructField(child8,DateType,true),StructField(child9,TimestampType,true),StructField(child10,NullType,true)),true),false) expected LIST(LIST(STRUCT(INT8,INT16,INT32,INT64,FLOAT32,FLOAT64,STRING,BOOL8,TIMESTAMP_DAYS,TIMESTAMP_MICROSECONDS,INT8)))
[2023-12-30T22:21:16.114Z] E                   	at com.nvidia.spark.rapids.GpuColumnVector.from(GpuColumnVector.java:710)
[2023-12-30T22:21:16.114Z] E                   	at com.nvidia.spark.rapids.AggHelper.$anonfun$performReduction$3(GpuAggregateExec.scala:363)
[2023-12-30T22:21:16.114Z] E                   	at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:29)
[2023-12-30T22:21:16.114Z] E                   	at com.nvidia.spark.rapids.AggHelper.$anonfun$performReduction$2(GpuAggregateExec.scala:361)
[2023-12-30T22:21:16.114Z] E                   	at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
[2023-12-30T22:21:16.114Z] E                   	at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
[2023-12-30T22:21:16.114Z] E                   	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
[2023-12-30T22:21:16.114Z] E                   	at com.nvidia.spark.rapids.AggHelper.$anonfun$performReduction$1(GpuAggregateExec.scala:357)
[2023-12-30T22:21:16.114Z] E                   	at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:29)
[2023-12-30T22:21:16.114Z] E                   	at com.nvidia.spark.rapids.AggHelper.performReduction(GpuAggregateExec.scala:355)
[2023-12-30T22:21:16.114Z] E                   	at com.nvidia.spark.rapids.AggHelper.aggregate(GpuAggregateExec.scala:294)
[2023-12-30T22:21:16.114Z] E                   	at com.nvidia.spark.rapids.AggHelper.$anonfun$aggregateWithoutCombine$4(GpuAggregateExec.scala:311)
[2023-12-30T22:21:16.114Z] E                   	at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:29)
[2023-12-30T22:21:16.114Z] E                   	at com.nvidia.spark.rapids.AggHelper.$anonfun$aggregateWithoutCombine$3(GpuAggregateExec.scala:309)
[2023-12-30T22:21:16.114Z] E                   	at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:29)
[2023-12-30T22:21:16.114Z] E                   	at com.nvidia.spark.rapids.AggHelper.$anonfun$aggregateWithoutCombine$2(GpuAggregateExec.scala:308)
[2023-12-30T22:21:16.114Z] E                   	at com.nvidia.spark.rapids.RmmRapidsRetryIterator$AutoCloseableAttemptSpliterator.next(RmmRapidsRetryIterator.scala:477)
[2023-12-30T22:21:16.114Z] E                   	at com.nvidia.spark.rapids.RmmRapidsRetryIterator$RmmRapidsRetryIterator.next(RmmRapidsRetryIterator.scala:613)
[2023-12-30T22:21:16.114Z] E                   	at com.nvidia.spark.rapids.RmmRapidsRetryIterator$RmmRapidsRetryAutoCloseableIterator.next(RmmRapidsRetryIterator.scala:517)
[2023-12-30T22:21:16.114Z] E                   	at scala.collection.Iterator$$anon$11.next(Iterator.scala:496)
[2023-12-30T22:21:16.114Z] E                   	at scala.collection.Iterator$$anon$10.next(Iterator.scala:461)
[2023-12-30T22:21:16.114Z] E                   	at com.nvidia.spark.rapids.GpuMergeAggregateIterator.aggregateInputBatches(GpuAggregateExec.scala:795)
[2023-12-30T22:21:16.114Z] E                   	at com.nvidia.spark.rapids.GpuMergeAggregateIterator.$anonfun$next$2(GpuAggregateExec.scala:752)
[2023-12-30T22:21:16.114Z] E                   	at scala.Option.getOrElse(Option.scala:189)
[2023-12-30T22:21:16.114Z] E                   	at com.nvidia.spark.rapids.GpuMergeAggregateIterator.next(GpuAggregateExec.scala:749)
[2023-12-30T22:21:16.114Z] E                   	at com.nvidia.spark.rapids.GpuMergeAggregateIterator.next(GpuAggregateExec.scala:711)
[2023-12-30T22:21:16.114Z] E                   	at scala.collection.Iterator$$anon$10.next(Iterator.scala:461)
[2023-12-30T22:21:16.114Z] E                   	at com.nvidia.spark.rapids.DynamicGpuPartialSortAggregateIterator.$anonfun$next$6(GpuAggregateExec.scala:2042)
[2023-12-30T22:21:16.114Z] E                   	at scala.Option.map(Option.scala:230)
[2023-12-30T22:21:16.114Z] E                   	at com.nvidia.spark.rapids.DynamicGpuPartialSortAggregateIterator.next(GpuAggregateExec.scala:2042)
[2023-12-30T22:21:16.114Z] E                   	at com.nvidia.spark.rapids.DynamicGpuPartialSortAggregateIterator.next(GpuAggregateExec.scala:1906)
[2023-12-30T22:21:16.114Z] E                   	at org.apache.spark.sql.rapids.execution.GpuShuffleExchangeExecBase$$anon$1.partNextBatch(GpuShuffleExchangeExecBase.scala:333)
[2023-12-30T22:21:16.114Z] E                   	at org.apache.spark.sql.rapids.execution.GpuShuffleExchangeExecBase$$anon$1.hasNext(GpuShuffleExchangeExecBase.scala:355)
[2023-12-30T22:21:16.114Z] E                   	at org.apache.spark.sql.rapids.RapidsShuffleThreadedWriterBase.$anonfun$write$2(RapidsShuffleInternalManagerBase.scala:285)
[2023-12-30T22:21:16.114Z] E                   	at org.apache.spark.sql.rapids.RapidsShuffleThreadedWriterBase.$anonfun$write$2$adapted(RapidsShuffleInternalManagerBase.scala:278)
[2023-12-30T22:21:16.114Z] E                   	at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:29)
[2023-12-30T22:21:16.114Z] E                   	at org.apache.spark.sql.rapids.RapidsShuffleThreadedWriterBase.$anonfun$write$1(RapidsShuffleInternalManagerBase.scala:278)
[2023-12-30T22:21:16.114Z] E                   	at org.apache.spark.sql.rapids.RapidsShuffleThreadedWriterBase.$anonfun$write$1$adapted(RapidsShuffleInternalManagerBase.scala:277)
[2023-12-30T22:21:16.114Z] E                   	at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:29)
[2023-12-30T22:21:16.114Z] E                   	at org.apache.spark.sql.rapids.RapidsShuffleThreadedWriterBase.write(RapidsShuffleInternalManagerBase.scala:277)
[2023-12-30T22:21:16.114Z] E                   	at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
[2023-12-30T22:21:16.114Z] E                   	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
[2023-12-30T22:21:16.114Z] E                   	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
[2023-12-30T22:21:16.114Z] E                   	at org.apache.spark.scheduler.Task.run(Task.scala:136)
[2023-12-30T22:21:16.114Z] E                   	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
[2023-12-30T22:21:16.114Z] E                   	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504)
[2023-12-30T22:21:16.114Z] E                   	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
[2023-12-30T22:21:16.114Z] E                   	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
[2023-12-30T22:21:16.114Z] E                   	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
[2023-12-30T22:21:16.114Z] E                   	at java.base/java.lang.Thread.run(Thread.java:833)
[2023-12-30T22:21:16.114Z] E                   	Suppressed: com.nvidia.spark.rapids.jni.GpuRetryOOM: injected RetryOOM
[2023-12-30T22:21:16.114Z] E                   		at ai.rapids.cudf.ColumnView.reduce(Native Method)
[2023-12-30T22:21:16.114Z] E                   		at ai.rapids.cudf.ColumnView.reduce(ColumnView.java:1583)
[2023-12-30T22:21:16.114Z] E                   		at org.apache.spark.sql.rapids.aggregate.CudfCollectSet.$anonfun$reductionAggregate$7(aggregateFunctions.scala:140)
[2023-12-30T22:21:16.114Z] E                   		... 47 more
[2023-12-30T22:21:16.114Z] E                   
[2023-12-30T22:21:16.114Z] E                   Driver stacktrace:
[2023-12-30T22:21:16.114Z] E                   	at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2672)
[2023-12-30T22:21:16.114Z] E                   	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2608)
[2023-12-30T22:21:16.114Z] E                   	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2607)
[2023-12-30T22:21:16.114Z] E                   	at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
[2023-12-30T22:21:16.114Z] E                   	at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
[2023-12-30T22:21:16.114Z] E                   	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
[2023-12-30T22:21:16.114Z] E                   	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2607)
[2023-12-30T22:21:16.114Z] E                   	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1182)
[2023-12-30T22:21:16.114Z] E                   	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1182)
[2023-12-30T22:21:16.114Z] E                   	at scala.Option.foreach(Option.scala:407)
[2023-12-30T22:21:16.114Z] E                   	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1182)
[2023-12-30T22:21:16.115Z] E                   	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2860)
[2023-12-30T22:21:16.115Z] E                   	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2802)
[2023-12-30T22:21:16.115Z] E                   	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2791)
[2023-12-30T22:21:16.115Z] E                   	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
[2023-12-30T22:21:16.115Z] E                   	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:952)
[2023-12-30T22:21:16.115Z] E                   	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2228)
[2023-12-30T22:21:16.115Z] E                   	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2249)
[2023-12-30T22:21:16.115Z] E                   	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2268)
[2023-12-30T22:21:16.115Z] E                   	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2293)
[2023-12-30T22:21:16.115Z] E                   	at org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1021)
[2023-12-30T22:21:16.115Z] E                   	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
[2023-12-30T22:21:16.115Z] E                   	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
[2023-12-30T22:21:16.115Z] E                   	at org.apache.spark.rdd.RDD.withScope(RDD.scala:406)
[2023-12-30T22:21:16.115Z] E                   	at org.apache.spark.rdd.RDD.collect(RDD.scala:1020)
[2023-12-30T22:21:16.115Z] E                   	at org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:424)
[2023-12-30T22:21:16.115Z] E                   	at org.apache.spark.sql.Dataset.$anonfun$collectToPython$1(Dataset.scala:3688)
[2023-12-30T22:21:16.115Z] E                   	at org.apache.spark.sql.Dataset.$anonfun$withAction$2(Dataset.scala:3858)
[2023-12-30T22:21:16.115Z] E                   	at org.apache.spark.sql.execution.QueryExecution$.withInternalError(QueryExecution.scala:510)
[2023-12-30T22:21:16.115Z] E                   	at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3856)
[2023-12-30T22:21:16.115Z] E                   	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:109)
[2023-12-30T22:21:16.115Z] E                   	at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:169)
[2023-12-30T22:21:16.115Z] E                   	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:95)
[2023-12-30T22:21:16.115Z] E                   	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:779)
[2023-12-30T22:21:16.115Z] E                   	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
[2023-12-30T22:21:16.115Z] E                   	at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3856)
[2023-12-30T22:21:16.115Z] E                   	at org.apache.spark.sql.Dataset.collectToPython(Dataset.scala:3685)
[2023-12-30T22:21:16.115Z] E                   	at jdk.internal.reflect.GeneratedMethodAccessor100.invoke(Unknown Source)
[2023-12-30T22:21:16.115Z] E                   	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
[2023-12-30T22:21:16.115Z] E                   	at java.base/java.lang.reflect.Method.invoke(Method.java:568)
[2023-12-30T22:21:16.115Z] E                   	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
[2023-12-30T22:21:16.115Z] E                   	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
[2023-12-30T22:21:16.115Z] E                   	at py4j.Gateway.invoke(Gateway.java:282)
[2023-12-30T22:21:16.115Z] E                   	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
[2023-12-30T22:21:16.115Z] E                   	at py4j.commands.CallCommand.execute(CallCommand.java:79)
[2023-12-30T22:21:16.115Z] E                   	at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
[2023-12-30T22:21:16.115Z] E                   	at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
[2023-12-30T22:21:16.115Z] E                   	at java.base/java.lang.Thread.run(Thread.java:833)
[2023-12-30T22:21:16.115Z] E                   Caused by: java.lang.AssertionError: Type conversion is not allowed from LIST(STRUCT(INT8,INT16,INT32,INT64,FLOAT32,FLOAT64,STRING,BOOL8,TIMESTAMP_DAYS,TIMESTAMP_MICROSECONDS,INT8)) to ArrayType(ArrayType(StructType(StructField(child0,ByteType,true),StructField(child1,ShortType,true),StructField(child2,IntegerType,true),StructField(child3,LongType,true),StructField(child4,FloatType,true),StructField(child5,DoubleType,true),StructField(child6,StringType,true),StructField(child7,BooleanType,true),StructField(child8,DateType,true),StructField(child9,TimestampType,true),StructField(child10,NullType,true)),true),false) expected LIST(LIST(STRUCT(INT8,INT16,INT32,INT64,FLOAT32,FLOAT64,STRING,BOOL8,TIMESTAMP_DAYS,TIMESTAMP_MICROSECONDS,INT8)))
[2023-12-30T22:21:16.115Z] E                   	at com.nvidia.spark.rapids.GpuColumnVector.from(GpuColumnVector.java:710)
[2023-12-30T22:21:16.115Z] E                   	at com.nvidia.spark.rapids.AggHelper.$anonfun$performReduction$3(GpuAggregateExec.scala:363)
[2023-12-30T22:21:16.115Z] E                   	at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:29)
[2023-12-30T22:21:16.115Z] E                   	at com.nvidia.spark.rapids.AggHelper.$anonfun$performReduction$2(GpuAggregateExec.scala:361)
[2023-12-30T22:21:16.115Z] E                   	at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
[2023-12-30T22:21:16.115Z] E                   	at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
[2023-12-30T22:21:16.115Z] E                   	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
[2023-12-30T22:21:16.115Z] E                   	at com.nvidia.spark.rapids.AggHelper.$anonfun$performReduction$1(GpuAggregateExec.scala:357)
[2023-12-30T22:21:16.115Z] E                   	at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:29)
[2023-12-30T22:21:16.115Z] E                   	at com.nvidia.spark.rapids.AggHelper.performReduction(GpuAggregateExec.scala:355)
[2023-12-30T22:21:16.115Z] E                   	at com.nvidia.spark.rapids.AggHelper.aggregate(GpuAggregateExec.scala:294)
[2023-12-30T22:21:16.115Z] E                   	at com.nvidia.spark.rapids.AggHelper.$anonfun$aggregateWithoutCombine$4(GpuAggregateExec.scala:311)
[2023-12-30T22:21:16.115Z] E                   	at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:29)
[2023-12-30T22:21:16.115Z] E                   	at com.nvidia.spark.rapids.AggHelper.$anonfun$aggregateWithoutCombine$3(GpuAggregateExec.scala:309)
[2023-12-30T22:21:16.115Z] E                   	at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:29)
[2023-12-30T22:21:16.115Z] E                   	at com.nvidia.spark.rapids.AggHelper.$anonfun$aggregateWithoutCombine$2(GpuAggregateExec.scala:308)
[2023-12-30T22:21:16.115Z] E                   	at com.nvidia.spark.rapids.RmmRapidsRetryIterator$AutoCloseableAttemptSpliterator.next(RmmRapidsRetryIterator.scala:477)
[2023-12-30T22:21:16.115Z] E                   	at com.nvidia.spark.rapids.RmmRapidsRetryIterator$RmmRapidsRetryIterator.next(RmmRapidsRetryIterator.scala:613)
[2023-12-30T22:21:16.115Z] E                   	at com.nvidia.spark.rapids.RmmRapidsRetryIterator$RmmRapidsRetryAutoCloseableIterator.next(RmmRapidsRetryIterator.scala:517)
[2023-12-30T22:21:16.115Z] E                   	at scala.collection.Iterator$$anon$11.next(Iterator.scala:496)
[2023-12-30T22:21:16.115Z] E                   	at scala.collection.Iterator$$anon$10.next(Iterator.scala:461)
[2023-12-30T22:21:16.115Z] E                   	at com.nvidia.spark.rapids.GpuMergeAggregateIterator.aggregateInputBatches(GpuAggregateExec.scala:795)
[2023-12-30T22:21:16.115Z] E                   	at com.nvidia.spark.rapids.GpuMergeAggregateIterator.$anonfun$next$2(GpuAggregateExec.scala:752)
[2023-12-30T22:21:16.115Z] E                   	at scala.Option.getOrElse(Option.scala:189)
[2023-12-30T22:21:16.115Z] E                   	at com.nvidia.spark.rapids.GpuMergeAggregateIterator.next(GpuAggregateExec.scala:749)
[2023-12-30T22:21:16.115Z] E                   	at com.nvidia.spark.rapids.GpuMergeAggregateIterator.next(GpuAggregateExec.scala:711)
[2023-12-30T22:21:16.115Z] E                   	at scala.collection.Iterator$$anon$10.next(Iterator.scala:461)
[2023-12-30T22:21:16.115Z] E                   	at com.nvidia.spark.rapids.DynamicGpuPartialSortAggregateIterator.$anonfun$next$6(GpuAggregateExec.scala:2042)
[2023-12-30T22:21:16.115Z] E                   	at scala.Option.map(Option.scala:230)
[2023-12-30T22:21:16.115Z] E                   	at com.nvidia.spark.rapids.DynamicGpuPartialSortAggregateIterator.next(GpuAggregateExec.scala:2042)
[2023-12-30T22:21:16.115Z] E                   	at com.nvidia.spark.rapids.DynamicGpuPartialSortAggregateIterator.next(GpuAggregateExec.scala:1906)
[2023-12-30T22:21:16.115Z] E                   	at org.apache.spark.sql.rapids.execution.GpuShuffleExchangeExecBase$$anon$1.partNextBatch(GpuShuffleExchangeExecBase.scala:333)
[2023-12-30T22:21:16.115Z] E                   	at org.apache.spark.sql.rapids.execution.GpuShuffleExchangeExecBase$$anon$1.hasNext(GpuShuffleExchangeExecBase.scala:355)
[2023-12-30T22:21:16.115Z] E                   	at org.apache.spark.sql.rapids.RapidsShuffleThreadedWriterBase.$anonfun$write$2(RapidsShuffleInternalManagerBase.scala:285)
[2023-12-30T22:21:16.115Z] E                   	at org.apache.spark.sql.rapids.RapidsShuffleThreadedWriterBase.$anonfun$write$2$adapted(RapidsShuffleInternalManagerBase.scala:278)
[2023-12-30T22:21:16.115Z] E                   	at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:29)
[2023-12-30T22:21:16.115Z] E                   	at org.apache.spark.sql.rapids.RapidsShuffleThreadedWriterBase.$anonfun$write$1(RapidsShuffleInternalManagerBase.scala:278)
[2023-12-30T22:21:16.115Z] E                   	at org.apache.spark.sql.rapids.RapidsShuffleThreadedWriterBase.$anonfun$write$1$adapted(RapidsShuffleInternalManagerBase.scala:277)
[2023-12-30T22:21:16.115Z] E                   	at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:29)
[2023-12-30T22:21:16.115Z] E                   	at org.apache.spark.sql.rapids.RapidsShuffleThreadedWriterBase.write(RapidsShuffleInternalManagerBase.scala:277)
[2023-12-30T22:21:16.115Z] E                   	at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
[2023-12-30T22:21:16.115Z] E                   	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
[2023-12-30T22:21:16.115Z] E                   	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
[2023-12-30T22:21:16.115Z] E                   	at org.apache.spark.scheduler.Task.run(Task.scala:136)
[2023-12-30T22:21:16.115Z] E                   	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
[2023-12-30T22:21:16.115Z] E                   	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504)
[2023-12-30T22:21:16.115Z] E                   	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
[2023-12-30T22:21:16.115Z] E                   	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
[2023-12-30T22:21:16.115Z] E                   	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
[2023-12-30T22:21:16.115Z] E                   	... 1 more
[2023-12-30T22:21:16.115Z] E                   	Suppressed: com.nvidia.spark.rapids.jni.GpuRetryOOM: injected RetryOOM
[2023-12-30T22:21:16.115Z] E                   		at ai.rapids.cudf.ColumnView.reduce(Native Method)
[2023-12-30T22:21:16.115Z] E                   		at ai.rapids.cudf.ColumnView.reduce(ColumnView.java:1583)
[2023-12-30T22:21:16.115Z] E                   		at org.apache.spark.sql.rapids.aggregate.CudfCollectSet.$anonfun$reductionAggregate$7(aggregateFunctions.scala:140)
[2023-12-30T22:21:16.115Z] E                   		... 47 more
[2023-12-30T22:21:16.115Z] 
[2023-12-30T22:21:16.115Z] /var/lib/jenkins/spark/spark-3.3.0-bin-hadoop3.2/python/lib/py4j-0.10.9.5-src.zip/py4j/protocol.py:326: Py4JJavaError
[2023-12-30T22:21:16.115Z] ----------------------------- Captured stdout call -----------------------------
[2023-12-30T22:21:16.115Z] ### CPU RUN ###
[2023-12-30T22:21:16.115Z] ### GPU RUN ###

Steps/Code to reproduce bug
Run integration tests in a distributed environment

Expected behavior
Tests pass

Environment details (please complete the following information)

  • Environment location: YARN
  • Spark configuration settings related to the issue

Additional context

@sameerz sameerz added bug Something isn't working ? - Needs Triage Need team to review and classify labels Dec 31, 2023
@mattahrens mattahrens removed the ? - Needs Triage Need team to review and classify label Jan 2, 2024
@tgravescs
Copy link
Collaborator

It looks like this happens when there is a batch that has one row with an empty List in a List where the datatype is supposed to be List[List[Something]]. It reproduces for me very often on a single node standalone cluster with 1 worker. My box has 64 cores.

Its doing a group by reduction collect_set in this test and

The error above I believe is when inner type "Something" is complex type like a struct. If it is like an INT32 you get a slightly different error:

java.lang.IllegalArgumentException: ArrayType(IntegerType,true) is not supported for GPU processing yet.
        at com.nvidia.spark.rapids.GpuColumnVector.getNonNestedRapidsType(GpuColumnVector.java:429)

So the data type going into the reduction is List[List[INT32] and the inner list is empty and after the reduction we get back a LIST[INT32] which doesn't match the expected type.

I'm still trying to narrow down exactly where this is happening or where it should be handled.

@tgravescs
Copy link
Collaborator

Looking at java/src/test/java/ai/rapids/cudf/ReductionTest.java in cudf it doesn't look it has any tested for nested complex columns

@tgravescs
Copy link
Collaborator

tgravescs commented Jan 9, 2024

Ok so this happens when you have a column type of like List[List[INT32]] and you get the data as List[null].

Smaller manual reproduce steps with pyspark against standalone cluster with 1 worker that has 64 cores and 1 CPU:

CPU:

>>> spark.conf.set("spark.rapids.sql.enabled", "false")
>>> my_x = [None]
>>> my_df = spark.createDataFrame(my_x, ArrayType(ArrayType(IntegerType())))
>>> my_df.agg(f.collect_set('value')).show()
+------------------+
|collect_set(value)|
+------------------+
|                []|
+------------------+

Another CPU case with some valid, GPU here also fails.

>>> my_x = [None, [(1,2,3)]]
>>> my_df = spark.createDataFrame(my_x, ArrayType(ArrayType(IntegerType())))
>>> my_df.printSchema()
root
 |-- value: array (nullable = true)
 |    |-- element: array (containsNull = true)
 |    |    |-- element: integer (containsNull = true)

>>> my_df.show()
+-----------+
|      value|
+-----------+
|       null|
|[[1, 2, 3]]|
+-----------+

>>> my_df.agg(f.collect_set('value')).show()
+------------------+
|collect_set(value)|
+------------------+
|     [[[1, 2, 3]]]|
+------------------+

GPU:

>>> spark.conf.set("spark.rapids.sql.enabled", "true")
>>> my_x = [None]
>>> my_df = spark.createDataFrame(my_x, ArrayType(ArrayType(IntegerType())))
>>> my_df.agg(f.collect_set('value')).show()
24/01/09 13:30:37 WARN TaskSetManager: Lost task 0.0 in stage 37.0 (TID 715) (10.28.9.218 executor 0): java.lang.IllegalArgumentException: Type mismatch at table 63column 0 expected LIST but found INT32

Valid case with out array being null

>>> my_x = [[([1,100]), ([2])], [([3,2])]]
>>> my_df = spark.createDataFrame(my_x, ArrayType(ArrayType(IntegerType())))
>>> my_df.agg(f.collect_set('value')).show()
+--------------------+
|  collect_set(value)|
+--------------------+
|[[[1, 100], [2]],...|
+--------------------+

@tgravescs
Copy link
Collaborator

Note for the integration tests its again easy to reproduce in standalone mode with 1 workers where it has 64 cores and 1GPU:

PYSP_TEST_spark_master=spark://myStandaloneMaster:7077 TEST_PARALLEL=0 ./integration_tests/run_pyspark_from_build.sh --test_oom_injection_mode=never -s -k test_hash_reduction_collect_set_on_nested_array_type

@tgravescs
Copy link
Collaborator

Note the exception stack traces are different from the 2 examples I gave above:

Manual pyspark repro:

: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 50.0 failed 4 times, most recent failure: Lost task 0.3 in stage 50.0 (TID 980) (10.28.9.218 executor 0): java.lang.IllegalArgumentException: Type mismatch at table 31column 0 expected LIST but found INT32
        at ai.rapids.cudf.JCudfSerialization.checkCompatibleTypes(JCudfSerialization.java:999)
        at ai.rapids.cudf.JCudfSerialization.checkCompatibleTypes(JCudfSerialization.java:1010)
        at ai.rapids.cudf.JCudfSerialization.checkCompatibleTypes(JCudfSerialization.java:1010)
        at ai.rapids.cudf.JCudfSerialization.checkCompatibleTypes(JCudfSerialization.java:989)
        at ai.rapids.cudf.JCudfSerialization.providersFrom(JCudfSerialization.java:954)
        at ai.rapids.cudf.JCudfSerialization.concatToHostBuffer(JCudfSerialization.java:1820)
        at ai.rapids.cudf.JCudfSerialization.concatToHostBuffer(JCudfSerialization.java:1846)
        at com.nvidia.spark.rapids.HostShuffleCoalesceIterator.$anonfun$concatenateTablesInHost$3(GpuShuffleCoalesceExec.scala:123)
        at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:56)
        at com.nvidia.spark.rapids.HostShuffleCoalesceIterator.$anonfun$concatenateTablesInHost$1(GpuShuffleCoalesceExec.scala:117)
        at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:29)
        at com.nvidia.spark.rapids.HostShuffleCoalesceIterator.concatenateTablesInHost(GpuShuffleCoalesceExec.scala:110)
        at com.nvidia.spark.rapids.HostShuffleCoalesceIterator.next(GpuShuffleCoalesceExec.scala:179)
        at com.nvidia.spark.rapids.HostShuffleCoalesceIterator.next(GpuShuffleCoalesceExec.scala:84)
        at com.nvidia.spark.rapids.GpuShuffleCoalesceIterator.$anonfun$next$2(GpuShuffleCoalesceExec.scala:218)
        at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:29)
        at com.nvidia.spark.rapids.GpuShuffleCoalesceIterator.$anonfun$next$1(GpuShuffleCoalesceExec.scala:214)
        at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:29)
        at com.nvidia.spark.rapids.GpuShuffleCoalesceIterator.next(GpuShuffleCoalesceExec.scala:213)
        at com.nvidia.spark.rapids.GpuShuffleCoalesceIterator.next(GpuShuffleCoalesceExec.scala:199)
        at com.nvidia.spark.rapids.AbstractProjectSplitIterator.next(basicPhysicalOperators.scala:247)
        at com.nvidia.spark.rapids.AbstractProjectSplitIterator.next(basicPhysicalOperators.scala:227)
        at scala.collection.Iterator$$anon$10.next(Iterator.scala:461)
        at scala.collection.Iterator$$anon$10.next(Iterator.scala:461)
        at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:492)
        at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
        at com.nvidia.spark.rapids.GpuMergeAggregateIterator.$anonfun$next$2(GpuAggregateExec.scala:806)
        at scala.Option.getOrElse(Option.scala:189)
        at com.nvidia.spark.rapids.GpuMergeAggregateIterator.next(GpuAggregateExec.scala:804)
        at com.nvidia.spark.rapids.GpuMergeAggregateIterator.next(GpuAggregateExec.scala:766)
        at scala.collection.Iterator$$anon$10.next(Iterator.scala:461)
        at com.nvidia.spark.rapids.DynamicGpuPartialSortAggregateIterator.$anonfun$next$11(GpuAggregateExec.scala:2105)
        at scala.Option.map(Option.scala:230)
        at com.nvidia.spark.rapids.DynamicGpuPartialSortAggregateIterator.next(GpuAggregateExec.scala:2105)
        at com.nvidia.spark.rapids.DynamicGpuPartialSortAggregateIterator.next(GpuAggregateExec.scala:1969)
        at com.nvidia.spark.rapids.ColumnarToRowIterator.$anonfun$fetchNextBatch$3(GpuColumnarToRowExec.scala:290)
        at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:29)
        at com.nvidia.spark.rapids.ColumnarToRowIterator.fetchNextBatch(GpuColumnarToRowExec.scala:287)

Integration test test_hash_reduction_collect_set_on_nested_array_type failure:

E                   Caused by: java.lang.IllegalArgumentException: ArrayType(IntegerType,true) is not supported for GPU processing yet.
E                       at com.nvidia.spark.rapids.GpuColumnVector.getNonNestedRapidsType(GpuColumnVector.java:429)
E                       at com.nvidia.spark.rapids.GpuColumnVector.typeConversionAllowed(GpuColumnVector.java:570)
E                       at com.nvidia.spark.rapids.GpuColumnVector.typeConversionAllowed(GpuColumnVector.java:599)
E                       at com.nvidia.spark.rapids.GpuColumnVector.typeConversionAllowed(GpuColumnVector.java:599)
E                       at com.nvidia.spark.rapids.GpuColumnVector.from(GpuColumnVector.java:717)
E                       at com.nvidia.spark.rapids.AggHelper.$anonfun$performReduction$8(GpuAggregateExec.scala:395)
E                       at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:29)
E                       at com.nvidia.spark.rapids.AggHelper.$anonfun$performReduction$5(GpuAggregateExec.scala:376)
E                       at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
E                       at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
E                       at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
E                       at com.nvidia.spark.rapids.AggHelper.$anonfun$performReduction$1(GpuAggregateExec.scala:367)
E                       at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:29)
E                       at com.nvidia.spark.rapids.AggHelper.performReduction(GpuAggregateExec.scala:361)
E                       at com.nvidia.spark.rapids.AggHelper.aggregate(GpuAggregateExec.scala:300)
E                       at com.nvidia.spark.rapids.AggHelper.$anonfun$aggregateWithoutCombine$4(GpuAggregateExec.scala:317)
E                       at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:29)
E                       at com.nvidia.spark.rapids.AggHelper.$anonfun$aggregateWithoutCombine$3(GpuAggregateExec.scala:315)
E                       at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:29)
E                       at com.nvidia.spark.rapids.AggHelper.$anonfun$aggregateWithoutCombine$2(GpuAggregateExec.scala:314)
E                       at com.nvidia.spark.rapids.RmmRapidsRetryIterator$AutoCloseableAttemptSpliterator.next(RmmRapidsRetryIterator.scala:477)
E                       at com.nvidia.spark.rapids.RmmRapidsRetryIterator$RmmRapidsRetryIterator.next(RmmRapidsRetryIterator.scala:613)
E                       at com.nvidia.spark.rapids.RmmRapidsRetryIterator$RmmRapidsRetryAutoCloseableIterator.next(RmmRapidsRetryIterator.scala:517)
E                       at scala.collection.Iterator$$anon$11.next(Iterator.scala:496)
E                       at scala.collection.Iterator$$anon$10.next(Iterator.scala:461)
E                       at com.nvidia.spark.rapids.GpuMergeAggregateIterator.aggregateInputBatches(GpuAggregateExec.scala:858)
E                       at com.nvidia.spark.rapids.GpuMergeAggregateIterator.$anonfun$next$2(GpuAggregateExec.scala:808)
E                       at scala.Option.getOrElse(Option.scala:189)
E                       at com.nvidia.spark.rapids.GpuMergeAggregateIterator.next(GpuAggregateExec.scala:804)
E                       at com.nvidia.spark.rapids.GpuMergeAggregateIterator.next(GpuAggregateExec.scala:766)
E                       at scala.collection.Iterator$$anon$10.next(Iterator.scala:461)
E                       at com.nvidia.spark.rapids.DynamicGpuPartialSortAggregateIterator.$anonfun$next$11(GpuAggregateExec.scala:2105)
E                       at scala.Option.map(Option.scala:230)
E                       at com.nvidia.spark.rapids.DynamicGpuPartialSortAggregateIterator.next(GpuAggregateExec.scala:2105)
E                       at com.nvidia.spark.rapids.DynamicGpuPartialSortAggregateIterator.next(GpuAggregateExec.scala:1969)
E                       at org.apache.spark.sql.rapids.execution.GpuShuffleExchangeExecBase$$anon$1.partNextBatch(GpuShuffleExchangeExecBase.scala:333)
E                       at org.apache.spark.sql.rapids.execution.GpuShuffleExchangeExecBase$$anon$1.hasNext(GpuShuffleExchangeExecBase.scala:355)
E                       at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:140)
E                       at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
E                       at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
E                       at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)

@jlowe
Copy link
Member

jlowe commented Jan 29, 2024

Filed rapidsai/cudf#14924.

@ttnghia
Copy link
Collaborator

ttnghia commented Mar 6, 2024

I can reproduce the bug using example in #10133 (comment), and verify that it is fixed by rapidsai/cudf#15243.

@ttnghia
Copy link
Collaborator

ttnghia commented Mar 15, 2024

This should be closed by rapidsai/cudf#15243.
However, I don't know if there is any temporary workaround for this issue that needs to be reverted?

@jlowe
Copy link
Member

jlowe commented Mar 15, 2024

No workaround/disable was added for this. Verified the recent EGX nightly tests that always failed with this are now passing.

@jlowe jlowe closed this as completed Mar 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working cudf_dependency An issue or PR with this label depends on a new feature in cudf
Projects
None yet
Development

No branches or pull requests

6 participants