Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix DistinctCount for timestamps with time zone #10043

Merged
merged 2 commits into from
Apr 11, 2024

Conversation

joroKr21
Copy link
Contributor

Which issue does this PR close?

Closes #10042

Rationale for this change

Fixing a regression.

What changes are included in this PR?

Preserve the original data type in the aggregation state.

Are these changes tested?

Tested in SLT.

Are there any user-facing changes?

Internal changes only.

Preserve the original data type in the aggregation state
dt @ Decimal256(_, _) => Box::new(
PrimitiveDistinctCountAccumulator::<Decimal256Type>::new()
.with_data_type(dt.clone()),
),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add the test for decimal too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, added 👍

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I expect the test that shows the change is needed. If this change is removed, it failed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is clear for me the change for timestamp with timezone, but I'm not sure about the decimal one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can show that for Decimal128 (not sure why it doesn't trigger for Decimal256). But it's better to be consistent, losing the data type might lead to unexpected consequences.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @joroKr21 and @jayzhan211

Timestamp(Second, _) => {
Box::new(PrimitiveDistinctCountAccumulator::<TimestampSecondType>::new())
}
dt @ Timestamp(Microsecond, _) => Box::new(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should defensively always set with_data_type to PrimitiveDistinctCountAccumulator 🤔

I tried to come up with other DataType that doesn't have a 1:1 mapping to its ArrowPrimitiveType and I don't think there is one, but always supplying the data_type might be a more defensive pattern

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wondering the same. Let me know what you prefer?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about we do it as a follow on PR? I would like to merge this one asap as it fixes a regression

Copy link
Contributor

@Dandandan Dandandan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @joroKr21

@alamb alamb merged commit e24b058 into apache:main Apr 11, 2024
24 checks passed
@joroKr21 joroKr21 deleted the bugfix/count-distinct-ts branch April 11, 2024 12:37
joroKr21 added a commit to coralogix/arrow-datafusion that referenced this pull request Apr 11, 2024
* Fix DistinctCount for timestamps with time zone

Preserve the original data type in the aggregation state

* Add tests for decimal count distinct
joroKr21 added a commit to coralogix/arrow-datafusion that referenced this pull request Apr 11, 2024
* Fix DistinctCount for timestamps with time zone (apache#10043)

* Fix DistinctCount for timestamps with time zone

Preserve the original data type in the aggregation state

* Add tests for decimal count distinct

* Fix with_new_children for EmptyExec
alamb pushed a commit to alamb/datafusion that referenced this pull request Apr 16, 2024
* Fix DistinctCount for timestamps with time zone

Preserve the original data type in the aggregation state

* Add tests for decimal count distinct
alamb added a commit that referenced this pull request Apr 17, 2024
* Fix DistinctCount for timestamps with time zone

Preserve the original data type in the aggregation state

* Add tests for decimal count distinct

Co-authored-by: Georgi Krastev <georgi.krastev@coralogix.com>
appletreeisyellow pushed a commit to influxdata/arrow-datafusion that referenced this pull request Apr 24, 2024
* Fix DistinctCount for timestamps with time zone

Preserve the original data type in the aggregation state

* Add tests for decimal count distinct
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
physical-expr Physical Expressions sqllogictest
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Regression: group by count distinct doesn't work for timestamps with time zone
4 participants