Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Data] Optimizing the multi column groupby #45667

Merged

Conversation

wingkitlee0
Copy link
Contributor

@wingkitlee0 wingkitlee0 commented Jun 1, 2024

This PR improves the "get group boundaries" function in groupby.map_groups(). This is not a bottleneck so the overall runtime will likely have minimal improvement, but the function itself is optimized and the code is cleaner.

Changes:

  • Instead of custom class implementation (_MultiColumnSortedKey, which I wrote a few months ago...), a numpy record array is constructed.
  • Changed from "np.searchsorted + while-loop" to np.where

In addition, I have cleaned some indexing and added docstrings and tests.

Why are these changes needed?

The pure numpy implementation is about 8x faster. The result below is for 100K elements. (similar results for 10M elements)

------------------------------------------------------------------------------------------------ benchmark: 2 tests -----------------------------------------------------------------------------------------------
Name (time in ms)                                    Min                 Max                Mean            StdDev              Median               IQR            Outliers      OPS            Rounds  Iterations
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_benchmark[get_key_boundaries]               21.1100 (1.0)       25.5830 (1.0)       22.2432 (1.0)      1.0147 (1.0)       22.1691 (1.0)      0.8943 (1.0)           8;4  44.9575 (1.0)          30           1
test_benchmark[get_key_boundaries_original]     175.2830 (8.30)     178.6610 (6.98)     176.4963 (7.93)     1.4369 (1.42)     175.8079 (7.93)     2.5007 (2.80)          2;0   5.6658 (0.13)          6           1
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

The custom _MultiColumnSortedKey class was introduced as a work-around. The numpy recarray should be preferred as it allows vectorization.

Time complexity / Speed up

Most of the speed up is coming from the use of recarray. This gives about 7x speed up.

A little more speed up was achieved when using np.where, although the time complexity changed from the current O( k log n) to O(n) with the new implementation (np.where), where k is the number of groups and n is the length of the array. This is likely due to the vectorization (less branching).

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Sorry, something went wrong.

@wingkitlee0 wingkitlee0 force-pushed the optimizing_multiple_group_key branch from fc8c604 to 2323159 Compare June 1, 2024 17:52
@wingkitlee0 wingkitlee0 force-pushed the optimizing_multiple_group_key branch 6 times, most recently from 1245fc0 to 8e7f055 Compare June 20, 2024 00:04
@wingkitlee0 wingkitlee0 marked this pull request as ready for review June 20, 2024 00:06
@anyscalesam anyscalesam added P2 Important issue, but not time-critical data Ray Data-related issues labels Sep 5, 2024
@wingkitlee0 wingkitlee0 force-pushed the optimizing_multiple_group_key branch from 8e7f055 to 2df45c7 Compare September 16, 2024 15:53
@wingkitlee0 wingkitlee0 force-pushed the optimizing_multiple_group_key branch 2 times, most recently from e1f328a to b6076d6 Compare September 29, 2024 01:54
@wingkitlee0
Copy link
Contributor Author

Hey @bveeramani , just wanna put this PR under your radar :)

@bveeramani
Copy link
Member

Hey @wingkitlee0, thanks for the ping! I'll let our oncall know (he's the one who's responsible for reviewing external PRs)

@alexeykudinkin alexeykudinkin self-assigned this Oct 28, 2024
@wingkitlee0 wingkitlee0 force-pushed the optimizing_multiple_group_key branch 3 times, most recently from 8adfcfa to 3ddc1d1 Compare October 31, 2024 02:46
@wingkitlee0
Copy link
Contributor Author

not sure how to solve Read the Docs build failed!. Any idea?

@wingkitlee0 wingkitlee0 force-pushed the optimizing_multiple_group_key branch from 7d2def3 to 2e58973 Compare November 2, 2024 05:15
@wingkitlee0
Copy link
Contributor Author

Hi @alexeykudinkin this PR is ready

@wingkitlee0 wingkitlee0 force-pushed the optimizing_multiple_group_key branch from 1ff7f5e to cbd1e5d Compare November 23, 2024 20:16
@wingkitlee0 wingkitlee0 force-pushed the optimizing_multiple_group_key branch from ac47c15 to 6b0d5cf Compare December 14, 2024 17:41
@wingkitlee0 wingkitlee0 requested a review from a team as a code owner December 14, 2024 17:41
@wingkitlee0
Copy link
Contributor Author

@alexeykudinkin the PR is ready for review again; added nan/None tests

@alexeykudinkin alexeykudinkin added the go add ONLY when ready to merge, run all tests label Jan 7, 2025
@wingkitlee0
Copy link
Contributor Author

bump

Copy link
Contributor

@alexeykudinkin alexeykudinkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, minor comments

@wingkitlee0 wingkitlee0 force-pushed the optimizing_multiple_group_key branch 2 times, most recently from 6d8099d to 8a41bd8 Compare January 11, 2025 02:10
@wingkitlee0 wingkitlee0 force-pushed the optimizing_multiple_group_key branch from 8a41bd8 to f4442c5 Compare January 14, 2025 23:59

Verified

This commit was signed with the committer’s verified signature.
MarcoPolo Marco Munizaga
Changed from a custom class implementation to a purely numpy implementation

Fixed the None cases

Signed-off-by: Kit Lee <7000003+wingkitlee0@users.noreply.github.com>
@wingkitlee0 wingkitlee0 force-pushed the optimizing_multiple_group_key branch from f4442c5 to 7f370c0 Compare January 15, 2025 01:22
@richardliaw richardliaw merged commit 40cc175 into ray-project:master Jan 16, 2025
5 checks passed
srinathk10 pushed a commit that referenced this pull request Feb 2, 2025

Verified

This commit was signed with the committer’s verified signature.
MarcoPolo Marco Munizaga
anyadontfly pushed a commit to anyadontfly/ray that referenced this pull request Feb 13, 2025
Signed-off-by: Puyuan Yao <williamyao034@gmail.com>
park12sj pushed a commit to park12sj/ray that referenced this pull request Mar 18, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data Ray Data-related issues go add ONLY when ready to merge, run all tests P2 Important issue, but not time-critical
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants