Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use cuco::static_set in the hash-based groupby #14813

Merged
merged 27 commits into from Feb 29, 2024

Conversation

PointKernel
Copy link
Member

@PointKernel PointKernel commented Jan 19, 2024

Description

Depends on #14849

Contributes to #12261

This PR migrates hash groupby to use the new cuco::static_set data structure. It doesn't change any existing libcudf behavior but uncovers the fact that the cudf python value_counts doesn't guarantee output orders thus the PR becomes a breaking change.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@PointKernel PointKernel added 2 - In Progress Currently a work in progress libcudf Affects libcudf (C++/CUDA) code. Performance Performance related issue tech debt improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Jan 19, 2024
@PointKernel PointKernel self-assigned this Jan 23, 2024
PointKernel added a commit to NVIDIA/cuCollections that referenced this pull request Jan 23, 2024
…428)

This PR fixes a bug in the OA base class where the erased key sentinel
value should have been initialized by the empty key sentinel if not
specified.

Tests are updated to exercise this issue.

Needed by rapidsai/cudf#14813
@rapidsai rapidsai deleted a comment from copy-pr-bot bot Feb 16, 2024
@PointKernel
Copy link
Member Author

PointKernel commented Feb 16, 2024

Question to reviewers: is it worth doing runtime dispatching based on whether nested types are involved or not?

Benchmark results are shown below, TLDR:

  • The hash table size is about half as before
  • The new set groupby brings from 0 to 30% speedups for flat types
  • The improvement for nested type is noticeable but not as good as flat types

Based on our past tuning experience, e.g.:

// Distinguish probing scheme CG sizes between nested and flat types for better performance
auto const probing_scheme = [&]() {
if constexpr (HasNested) {
return cuco::linear_probing<4, Hasher>{d_hasher};
} else {
return cuco::linear_probing<1, Hasher>{d_hasher};
}
}();
using a larger CG size (like 2 or 4) for nested types can bring significant speedups comapred to the current CGSize == 1 for all types.

groupby_max

[0] Quadro RTX 8000

T num_rows null_probability Ref Time Ref Noise Cmp Time Cmp Noise Diff %Diff Status
I32 2^12 0 83.292 us 7.14% 83.962 us 7.56% 0.670 us 0.80% PASS
I32 2^18 0 123.161 us 21.91% 103.673 us 3.47% -19.489 us -15.82% FAIL
I32 2^24 0 2.702 ms 5.78% 1.886 ms 4.61% -815.711 us -30.19% FAIL
I32 2^12 0.1 106.195 us 4.47% 105.438 us 5.46% -0.757 us -0.71% PASS
I32 2^18 0.1 144.568 us 4.03% 130.310 us 4.62% -14.258 us -9.86% FAIL
I32 2^24 0.1 2.713 ms 2.97% 1.956 ms 3.43% -756.833 us -27.90% FAIL
I32 2^12 0.9 106.164 us 5.38% 105.230 us 5.81% -0.933 us -0.88% PASS
I32 2^18 0.9 134.932 us 3.71% 126.974 us 4.86% -7.958 us -5.90% FAIL
I32 2^24 0.9 2.396 ms 0.49% 1.700 ms 0.94% -695.351 us -29.03% FAIL
I64 2^12 0 79.217 us 4.74% 81.464 us 5.33% 2.247 us 2.84% PASS
I64 2^18 0 124.065 us 3.35% 110.290 us 3.71% -13.776 us -11.10% FAIL
I64 2^24 0 2.921 ms 5.13% 2.080 ms 3.67% -841.509 us -28.81% FAIL
I64 2^12 0.1 107.544 us 5.30% 106.271 us 5.33% -1.273 us -1.18% PASS
I64 2^18 0.1 158.161 us 12.67% 138.162 us 3.74% -19.999 us -12.64% FAIL
I64 2^24 0.1 2.935 ms 3.03% 2.149 ms 2.33% -786.224 us -26.79% FAIL
I64 2^12 0.9 106.698 us 5.45% 105.586 us 5.56% -1.112 us -1.04% PASS
I64 2^18 0.9 141.798 us 3.56% 131.287 us 5.12% -10.512 us -7.41% FAIL
I64 2^24 0.9 2.528 ms 0.41% 1.843 ms 0.67% -684.869 us -27.09% FAIL
F32 2^12 0 87.432 us 4.34% 87.644 us 5.17% 0.212 us 0.24% PASS
F32 2^18 0 139.932 us 5.17% 120.252 us 3.90% -19.680 us -14.06% FAIL
F32 2^24 0 3.034 ms 8.05% 2.159 ms 8.12% -875.115 us -28.85% FAIL
F32 2^12 0.1 118.647 us 4.18% 117.394 us 4.97% -1.254 us -1.06% PASS
F32 2^18 0.1 167.872 us 5.11% 145.176 us 3.74% -22.696 us -13.52% FAIL
F32 2^24 0.1 2.997 ms 5.81% 2.172 ms 6.50% -825.041 us -27.53% FAIL
F32 2^12 0.9 121.641 us 6.73% 111.989 us 5.16% -9.652 us -7.93% FAIL
F32 2^18 0.9 143.159 us 4.11% 131.017 us 3.83% -12.142 us -8.48% FAIL
F32 2^24 0.9 2.489 ms 0.43% 1.787 ms 0.84% -702.675 us -28.23% FAIL
F64 2^12 0 87.555 us 4.14% 88.347 us 4.57% 0.792 us 0.90% PASS
F64 2^18 0 156.124 us 17.01% 127.645 us 4.27% -28.480 us -18.24% FAIL
F64 2^24 0 3.300 ms 8.27% 2.337 ms 8.59% -963.100 us -29.19% FAIL
F64 2^12 0.1 119.458 us 4.38% 117.437 us 5.24% -2.021 us -1.69% PASS
F64 2^18 0.1 173.348 us 4.67% 152.983 us 4.03% -20.365 us -11.75% FAIL
F64 2^24 0.1 3.249 ms 6.93% 2.349 ms 5.91% -900.036 us -27.70% FAIL
F64 2^12 0.9 112.170 us 4.52% 113.566 us 5.73% 1.396 us 1.24% PASS
F64 2^18 0.9 148.631 us 3.42% 135.485 us 3.78% -13.146 us -8.84% FAIL
F64 2^24 0.9 2.612 ms 0.49% 1.906 ms 0.64% -705.332 us -27.01% FAIL

groupby_struct_keys

[0] Quadro RTX 8000

NumRows Depth Nulls Ref Time Ref Noise Cmp Time Cmp Noise Diff %Diff Status
2^10 0 0 71.830 us 4.79% 68.804 us 6.57% -3.026 us -4.21% PASS
2^16 0 0 78.070 us 5.15% 81.142 us 5.82% 3.072 us 3.94% PASS
2^20 0 0 267.730 us 8.76% 228.207 us 12.14% -39.523 us -14.76% FAIL
2^10 1 0 192.670 us 4.20% 195.295 us 4.21% 2.625 us 1.36% PASS
2^16 1 0 199.809 us 4.59% 199.813 us 4.16% 0.004 us 0.00% PASS
2^20 1 0 403.536 us 4.08% 400.537 us 7.91% -2.999 us -0.74% PASS
2^10 8 0 631.825 us 2.70% 628.752 us 3.03% -3.073 us -0.49% PASS
2^16 8 0 651.027 us 2.83% 645.891 us 2.99% -5.136 us -0.79% PASS
2^20 8 0 1.049 ms 2.41% 990.536 us 2.78% -58.560 us -5.58% FAIL
2^10 0 1 127.173 us 5.66% 127.897 us 3.85% 0.724 us 0.57% PASS
2^16 0 1 128.940 us 4.33% 132.152 us 4.29% 3.212 us 2.49% PASS
2^20 0 1 304.425 us 5.95% 268.729 us 7.05% -35.696 us -11.73% FAIL
2^10 1 1 193.054 us 4.19% 195.841 us 4.19% 2.787 us 1.44% PASS
2^16 1 1 199.877 us 4.14% 201.166 us 4.07% 1.289 us 0.64% PASS
2^20 1 1 429.069 us 5.37% 431.467 us 7.65% 2.398 us 0.56% PASS
2^10 8 1 627.959 us 2.97% 625.391 us 2.84% -2.568 us -0.41% PASS
2^16 8 1 654.510 us 3.21% 648.749 us 2.94% -5.761 us -0.88% PASS
2^20 8 1 1.055 ms 2.13% 1.006 ms 2.87% -49.153 us -4.66% FAIL

@PointKernel PointKernel added 3 - Ready for Review Ready for review by team and removed 2 - In Progress Currently a work in progress labels Feb 16, 2024
@PointKernel PointKernel marked this pull request as ready for review February 16, 2024 22:21
@PointKernel PointKernel requested a review from a team as a code owner February 16, 2024 22:21
@PointKernel PointKernel requested a review from a team as a code owner February 22, 2024 20:38
@github-actions github-actions bot added the cuDF (Python) Affects Python cuDF API. label Feb 22, 2024
@vyasr vyasr removed the tech debt label Feb 23, 2024
@PointKernel
Copy link
Member Author

PointKernel commented Feb 27, 2024

@vyasr Thanks for the review, have you reviewed both cpp and python or do I need another cpp/python approval to merge the PR?

@vyasr
Copy link
Contributor

vyasr commented Feb 27, 2024

I reviewed both. I think the main question on the Python side would be whether we need to update the docs in https://github.com/rapidsai/cudf/blob/branch-24.04/docs/cudf/source/user_guide/pandas-comparison.md#result-ordering to also mention value_counts. @shwina WDYT? It's not immediately obvious that this would be implemented as a groupby under the hood.

@shwina
Copy link
Contributor

shwina commented Feb 28, 2024

I think we should do that, yes (this PR is then a breaking change)

Comment on lines +62 to +63
using probing_scheme_type = cuco::linear_probing<
1, ///< Number of threads used to handle each input key
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@PointKernel You mentioned:

Based on our past tuning experience, e.g.:
cudf/cpp/src/search/contains_table.cu
using a larger CG size (like 2 or 4) for nested types can bring significant speedups compared to the current CGSize == 1 for all types.

Do we need to adopt that here? Do we need an issue or TODO describing that optimization?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to adopt that here?

Probably not. That requires nontrivial changes to the current code and the benefit is unclear, i.e., no users actually complained about the groupby performance with nested types. I'm inclined to look into it until we have the bandwidth or someone raises an issue about it. Does it sound good to you?

Copy link
Contributor

@bdice bdice Feb 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’d like some kind of note in the code or an issue to make sure that we are aware of this optimization potential for the future. Otherwise, no action needed in terms of implementation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sense. Done in 56a2229

@vyasr
Copy link
Contributor

vyasr commented Feb 28, 2024

Also CC @mroeschke @galipremsagar we may have to introduce an extra sorting in pandas compatibility mode due to the ordering change in this PR for cudf.pandas to behave as expected. No need to make any changes yet, just something to keep on your radars.

@PointKernel PointKernel added breaking Breaking change and removed non-breaking Non-breaking change labels Feb 29, 2024
@PointKernel
Copy link
Member Author

@vyasr @shwina I've updated the doc as suggested and marked this work as a breaking change.

Thanks for your comments.

@PointKernel
Copy link
Member Author

/merge

@rapids-bot rapids-bot bot merged commit 200fc0b into rapidsai:branch-24.04 Feb 29, 2024
76 checks passed
@PointKernel PointKernel deleted the cuco-set-groupby branch February 29, 2024 21:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team breaking Breaking change cuDF (Python) Affects Python cuDF API. improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code. Performance Performance related issue
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

None yet

5 participants