Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Refactor hash-based algorithms with new cuco data structures #12261

Open
PointKernel opened this issue Nov 29, 2022 · 0 comments
Open

[FEA] Refactor hash-based algorithms with new cuco data structures #12261

PointKernel opened this issue Nov 29, 2022 · 0 comments
Labels
2 - In Progress Currently a work in progress feature request New feature or request improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code.

Comments

@PointKernel
Copy link
Member

PointKernel commented Nov 29, 2022

Is your feature request related to a problem? Please describe.
cuco::static_map and cuco::static_multimap are used to perform hash-based operations in libcudf. Depending on NVIDIA/cuCollections#110, a lot of existing use cases could be replaced with cuco::static_set or cuco:static_multiset since the payload part in the hash map is not used.

Moreover, some libcudf implementations are still using concurrent_unordered_map or unordered_multiset and they should be refactored with cuco::static_set/multiset as well.

Describe the solution you'd like
Replace existing implementations with new data structures like cuco::static_set, cuco::static_multiset or cuco::static_map.

There should also be benchmarks showing the performance changes after replacement for most works listed below.
Note: under "Related APIs", the ref:: prefix refers to the cuco device-side API.

Part 1: Updates ready to be addressed in libcudf

Algorithm Desired Data Structure Related APIs PRs Notes
distinct_count cuco::static_set insert, insert_if #13343
orc dictionary encoding (issue #10495) cuco::static_map (legacy) ref::insert, ref::find #13580 Similar to parquet dictionary encoding but the current implementation is still using a custom dictionary
byte_pair_encoding cuco::static_map insert_async, ref::find #13807 uses heterogeneous lookup
json_tree cuco::static_set insert_if_async, ref::find #13928
search/contains_table cuco::static_set insert_async, insert_if_async, contains_async, contains_if_async #14064 Needs migration from cudf::detail::unordered_multiset
search/contains_column cuco::static_set insert_async, insert_if_async, contains_async #14238 Needs migration from cudf::detail::unordered_multiset. unordered_multiset can be removed once this is done.
hash-based groupby (issue #10401) cuco::static_set ref::insert_and_find, retrieve_all #14813 Needs migration from concurrent_unordered_map
legacy json reader cuco::static_map insert_async, ref::find #15813 Needs migration from concurrent_unordered_map. Looks like a rational use of hash map. To be replaced by cuco::static_map Deleted in #15813
distinct cuco::static_set #15285 Needs migration from cuco::static_map (legacy), and then in Part 2, further migration to cuco::reduction_map

Part 2: Updates depending on upstream cuco changes

Algorithm Desired Data Structure Related APIs PRs Notes
semi/anti joins cuco::static_set ? could be ready today for migration, but might need cuco::static_multiset updates first TBD see #9973. currently using cuco::static_multimap (legacy)
joins cuco::static_multiset count, count_outer, retrieve, retrieve_outer insert_async, insert_if_async Needs migration from cuco::static_multimap (legacy) Related issues: #11176, #13116
mixed joins cuco::static_multiset see "joins" TBD Needs migration from cuco::static_multimap (legacy)
parquet dictionary encoding cuco::static_map (experimental) probably unblocked? ref::insert, ref::find Needs migration from cuco::static_map (legacy)
orc dictionary encoding cuco::static_map (experimental) probably unblocked? ref::insert, ref::find Needs migration from cuco::static_map (legacy)
distinct cuco::static_reduction_map TBD see #13157. Reduction map not yet implemented in cuco, could just have cuco::static_map specialization with a new API "insert_or_apply"
@PointKernel PointKernel added feature request New feature or request Needs Triage Need team to review and classify libcudf Affects libcudf (C++/CUDA) code. tech debt improvement Improvement / enhancement to an existing function labels Nov 29, 2022
@PointKernel PointKernel added this to Needs prioritizing in Feature Planning via automation Nov 29, 2022
@GregoryKimball GregoryKimball added 0 - Blocked Cannot progress due to external reasons and removed Needs Triage Need team to review and classify labels Dec 1, 2022
@PointKernel PointKernel removed the 0 - Blocked Cannot progress due to external reasons label May 12, 2023
rapids-bot bot pushed a commit that referenced this issue May 15, 2023
This PR improves `distinct_count` with the new `cuco::static_set` data structure.

Related to #12261

Authors:
  - Yunsong Wang (https://github.com/PointKernel)

Approvers:
  - Nghia Truong (https://github.com/ttnghia)
  - Divye Gala (https://github.com/divyegala)

URL: #13343
@GregoryKimball GregoryKimball added the 2 - In Progress Currently a work in progress label Aug 10, 2023
rapids-bot bot pushed a commit that referenced this issue Aug 28, 2023
In JSON tree algorithms of JSON reader, cuco static_map is used as a set. This PR replaces it with static_set.
No tests are changed. No significant runtime changes.
Addresses part of  #12261

Authors:
  - Karthikeyan (https://github.com/karthikeyann)

Approvers:
  - Nghia Truong (https://github.com/ttnghia)
  - Yunsong Wang (https://github.com/PointKernel)

URL: #13928
rapids-bot bot pushed a commit that referenced this issue Sep 26, 2023
Contributes to #12261

This PR refactors `contains_table` to use the new `cuco::static_set` data structure. It also adds a `contains_table` benchmark to track the performance before and after this work.

Authors:
  - Yunsong Wang (https://github.com/PointKernel)

Approvers:
  - Nghia Truong (https://github.com/ttnghia)
  - Bradley Dice (https://github.com/bdice)

URL: #14064
PointKernel added a commit to NVIDIA/cuCollections that referenced this issue Sep 26, 2023
Depends on #349 

This PR adds an example demonstrating how to create multiple subsets
with one single storage. It includes necessary changes and cleanups that
will unblock orc/parquet dictionary encoding
(rapidsai/cudf#12261) to use the new map/set
data structures.

---------

Co-authored-by: Daniel Juenger <2955913+sleeepyjack@users.noreply.github.com>
rapids-bot bot pushed a commit that referenced this issue Oct 4, 2023
Part of ##12261

This PR simplifies the `contains_column` implementation by invoking `contains_table` and gets rid of the use of the cudf `unordered_multiset`. It also removes the `unordered_multiset` header file from libcudf.

Authors:
  - Yunsong Wang (https://github.com/PointKernel)

Approvers:
  - Nghia Truong (https://github.com/ttnghia)
  - Bradley Dice (https://github.com/bdice)

URL: #14238
@vyasr vyasr removed the tech debt label Feb 23, 2024
rapids-bot bot pushed a commit that referenced this issue Feb 29, 2024
Depends on #14849

Contributes to #12261

This PR migrates hash groupby to use the new `cuco::static_set` data structure. It doesn't change any existing libcudf behavior but uncovers the fact that the cudf python `value_counts` doesn't guarantee output orders thus the PR becomes a breaking change.

Authors:
  - Yunsong Wang (https://github.com/PointKernel)

Approvers:
  - David Wendt (https://github.com/davidwendt)
  - Vyas Ramasubramani (https://github.com/vyasr)

URL: #14813
rapids-bot bot pushed a commit that referenced this issue May 23, 2024
This completes the final two steps and closes #15537. Also addresses one step of #12261.

Authors:
  - Bradley Dice (https://github.com/bdice)

Approvers:
  - Kyle Edwards (https://github.com/KyleFromNVIDIA)
  - David Wendt (https://github.com/davidwendt)
  - Shruti Shivakumar (https://github.com/shrshi)
  - Matthew Roeschke (https://github.com/mroeschke)

URL: #15813
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2 - In Progress Currently a work in progress feature request New feature or request improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code.
Projects
Status: In Progress
Status: Story Issue
Feature Planning
Needs prioritizing
Development

No branches or pull requests

3 participants