[SPARK-47690][SQL] Enable hash aggregation support for all collations (StringType) #46640

uros-db · 2024-05-17T13:04:39Z

What changes were proposed in this pull request?

Enable collation support for hash aggregation on StringType, for aggregates where aggregate expressions don't include a non-binary collation expression. Note: support for complex types will be added separately.

Logical plan is rewritten in analysis to replace non-binary strings with CollationKey
CollationKey is a unary expression that transforms StringType to BinaryType
Collation keys allow correct & efficient string comparison under specific collation rules

Why are the changes needed?

Improve GROUP BY performance for collated strings.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

E2e SQL tests for RewriteGroupByCollation in CollationSuite
Various queries with GROUP BY in existing TPCDS collation test suite

Was this patch authored or co-authored using generative AI tooling?

No.

dbatomic · 2024-05-31T12:37:02Z

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/RewriteGroupByCollation.scala

+    // This rewrite rule is used to enabled hash aggregation on collated string columns. However,
+    // hash aggregation is currently only supported for grouping aggregations - this means that no
+    // string type can be found in the aggregate expressions, so we avoid rewrite in this case.
+    !aggregate.aggregateExpressions.exists(e => e.dataType.isInstanceOf[StringType])


I think that you should check if hash aggregation is supported at all, regardless of StringType.
If we are going to end up doing merge agg there is no need to insert collation_key.

If I just try to use supportsHashAggregate here, I might find that the aggregate does not support hash aggregation before the rewrite, but will support it after the rewrite (as a result of this, the rewrite rule will never actually execute)

However, we perform this check before doing the plan rewrite, so the point of this check is to verify that the current Aggregate is only a grouping aggregate with respect to StringType (i.e. StringType is not found in aggregateExpressions).

Any ideas on how to make this better?

Maybe you can call supportsHashAggregate by just passing agg keys and empty seq for group by?

dbatomic · 2024-05-31T12:48:11Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ExprUtils.scala

@@ -169,7 +169,11 @@ object ExprUtils extends QueryErrorsBase {
        a.failAnalysis(
          errorClass = "MISSING_GROUP_BY",
          messageParameters = Map.empty)
-      case e: Attribute if !a.groupingExpressions.exists(_.semanticEquals(e)) =>
+      case e: Attribute if !a.groupingExpressions.exists {


do you need this? afaik you are doing collationkey insertion only there is no string in aggregate?

correct, this is not needed - we won't need to "match" aggregate and grouping expressions with collationKey, since we only support collationKey injection for "pure" grouping aggregation on collated string columns

dbatomic · 2024-05-31T12:48:54Z

sql/core/src/test/scala/org/apache/spark/sql/CollationSuite.scala

+    val newPlan1 = RewriteGroupByCollation(logicalPlan1)
+    val newNewPlan1 = RewriteGroupByCollation(newPlan1)
+    assert(newPlan1 == newNewPlan1)
+    // get the query execution result


this comment is not super useful :)
we should be adding more detailed comments for many things, but checkAnswer is pretty self descriptive :)

dbatomic · 2024-05-31T12:50:12Z

sql/core/src/test/scala/org/apache/spark/sql/CollationSuite.scala

+      assert(collectFirst(queryPlan2) { case _: HashAggregateExec => () }.nonEmpty)
+      assert(collectFirst(queryPlan2) { case _: SortAggregateExec => () }.isEmpty)
+      // check that CollationKey is injected into the Aggregate logical plan
+      assert(collectFirst(queryPlan1) { case s: HashAggregateExec =>


It would be cleaner if you would explicitly check whether head is instanceof CollationKey, instead of relying on return type.

agreed, changing

dbatomic · 2024-05-31T13:10:13Z

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/RewriteGroupByCollation.scala

+import org.apache.spark.sql.catalyst.util.CollationFactory
+import org.apache.spark.sql.types.StringType
+
+object RewriteGroupByCollation extends Rule[LogicalPlan] {


Please add class header explaining why we are doing this and what is being done.

Initial commit

7d819ac

github-actions bot added the SQL label May 17, 2024

uros-db added 2 commits May 27, 2024 08:26

Update tests

b23be79

Remove tests

bbc3690

github-actions bot added the STRUCTURED STREAMING label May 27, 2024

uros-db added 8 commits May 27, 2024 13:18

Merge branch 'master' into hash-agg-str

19a7c5f

Small fixes

cab6d3a

Fixes

9015871

Limit rule

6d84ee0

Merge branch 'apache:master' into hash-agg-str

7e363ff

Merge branch 'master' into hash-agg-str

2ed4558

Merge branch 'apache:master' into hash-agg-str

4d0c7b6

Fix rule

a21989d

uros-db changed the title ~~[WIP][SQL] Enable hash aggregation support for all collations (StringType)~~ [SPARK-47690][SQL] Enable hash aggregation support for all collations (StringType) May 30, 2024

dbatomic reviewed May 31, 2024

View reviewed changes

dbatomic approved these changes May 31, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-47690][SQL] Enable hash aggregation support for all collations (StringType) #46640

[SPARK-47690][SQL] Enable hash aggregation support for all collations (StringType) #46640

uros-db commented May 17, 2024 •

edited

dbatomic May 31, 2024

uros-db May 31, 2024

dbatomic May 31, 2024

dbatomic May 31, 2024

uros-db May 31, 2024

dbatomic May 31, 2024

uros-db May 31, 2024

dbatomic May 31, 2024

uros-db May 31, 2024

dbatomic May 31, 2024

[SPARK-47690][SQL] Enable hash aggregation support for all collations (StringType) #46640

Are you sure you want to change the base?

[SPARK-47690][SQL] Enable hash aggregation support for all collations (StringType) #46640

Conversation

uros-db commented May 17, 2024 • edited

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

uros-db commented May 17, 2024 •

edited