[Data] Optimizing the multi column groupby #45667

wingkitlee0 · 2024-06-01T17:43:04Z

This PR improves the "get group boundaries" function in groupby.map_groups(). This is not a bottleneck so the overall runtime will likely have minimal improvement, but the function itself is optimized and the code is cleaner.

Changes:

Instead of custom class implementation (_MultiColumnSortedKey, which I wrote a few months ago...), a numpy record array is constructed.
Changed from "np.searchsorted + while-loop" to np.where

In addition, I have cleaned some indexing and added docstrings and tests.

Why are these changes needed?

The pure numpy implementation is about 8x faster. The result below is for 100K elements. (similar results for 10M elements)

------------------------------------------------------------------------------------------------ benchmark: 2 tests -----------------------------------------------------------------------------------------------
Name (time in ms)                                    Min                 Max                Mean            StdDev              Median               IQR            Outliers      OPS            Rounds  Iterations
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_benchmark[get_key_boundaries]               21.1100 (1.0)       25.5830 (1.0)       22.2432 (1.0)      1.0147 (1.0)       22.1691 (1.0)      0.8943 (1.0)           8;4  44.9575 (1.0)          30           1
test_benchmark[get_key_boundaries_original]     175.2830 (8.30)     178.6610 (6.98)     176.4963 (7.93)     1.4369 (1.42)     175.8079 (7.93)     2.5007 (2.80)          2;0   5.6658 (0.13)          6           1
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

The custom _MultiColumnSortedKey class was introduced as a work-around. The numpy recarray should be preferred as it allows vectorization.

Time complexity / Speed up

Most of the speed up is coming from the use of recarray. This gives about 7x speed up.

A little more speed up was achieved when using np.where, although the time complexity changed from the current O( k log n) to O(n) with the new implementation (np.where), where k is the number of groups and n is the length of the array. This is likely due to the vectorization (less branching).

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

wingkitlee0 · 2024-10-26T00:55:01Z

Hey @bveeramani , just wanna put this PR under your radar :)

bveeramani · 2024-10-28T19:28:46Z

Hey @wingkitlee0, thanks for the ping! I'll let our oncall know (he's the one who's responsible for reviewing external PRs)

python/ray/data/grouped_data.py

python/ray/data/_internal/boundaries.py

wingkitlee0 · 2024-11-01T01:56:18Z

not sure how to solve Read the Docs build failed!. Any idea?

wingkitlee0 · 2024-11-16T00:26:43Z

Hi @alexeykudinkin this PR is ready

python/ray/data/block.py

python/ray/data/grouped_data.py

python/ray/data/block.py

python/ray/data/tests/test_block_boundaries.py

wingkitlee0 · 2024-12-18T23:00:11Z

@alexeykudinkin the PR is ready for review again; added nan/None tests

wingkitlee0 · 2025-01-09T23:11:45Z

bump

alexeykudinkin

LGTM, minor comments

python/ray/data/block.py

Changed from a custom class implementation to a purely numpy implementation Fixed the None cases Signed-off-by: Kit Lee <7000003+wingkitlee0@users.noreply.github.com>

Signed-off-by: Puyuan Yao <williamyao034@gmail.com>

wingkitlee0 force-pushed the optimizing_multiple_group_key branch from fc8c604 to 2323159 Compare June 1, 2024 17:52

wingkitlee0 force-pushed the optimizing_multiple_group_key branch 6 times, most recently from 1245fc0 to 8e7f055 Compare June 20, 2024 00:04

wingkitlee0 marked this pull request as ready for review June 20, 2024 00:06

wingkitlee0 requested review from ericl, scv119, c21, amogkam, scottjlee, bveeramani, raulchen, stephanie-wang and omatthew98 as code owners June 20, 2024 00:06

anyscalesam added P2 data labels Sep 5, 2024

wingkitlee0 force-pushed the optimizing_multiple_group_key branch from 8e7f055 to 2df45c7 Compare September 16, 2024 15:53

wingkitlee0 force-pushed the optimizing_multiple_group_key branch 2 times, most recently from e1f328a to b6076d6 Compare September 29, 2024 01:54

alexeykudinkin self-assigned this Oct 28, 2024

alexeykudinkin reviewed Oct 28, 2024

View reviewed changes

python/ray/data/grouped_data.py Outdated Show resolved Hide resolved

python/ray/data/_internal/boundaries.py Outdated Show resolved Hide resolved

python/ray/data/_internal/boundaries.py Outdated Show resolved Hide resolved

wingkitlee0 force-pushed the optimizing_multiple_group_key branch 3 times, most recently from 8adfcfa to 3ddc1d1 Compare October 31, 2024 02:46

wingkitlee0 requested a review from alexeykudinkin November 1, 2024 01:56

wingkitlee0 force-pushed the optimizing_multiple_group_key branch from 7d2def3 to 2e58973 Compare November 2, 2024 05:15

alexeykudinkin reviewed Nov 19, 2024

View reviewed changes

python/ray/data/block.py Outdated Show resolved Hide resolved

python/ray/data/grouped_data.py Show resolved Hide resolved

python/ray/data/block.py Outdated Show resolved Hide resolved

wingkitlee0 requested a review from srinathk10 as a code owner November 23, 2024 19:17

wingkitlee0 force-pushed the optimizing_multiple_group_key branch from 1ff7f5e to cbd1e5d Compare November 23, 2024 20:16

wingkitlee0 requested a review from alexeykudinkin November 23, 2024 21:03

alexeykudinkin reviewed Nov 25, 2024

View reviewed changes

python/ray/data/block.py Outdated Show resolved Hide resolved

alexeykudinkin reviewed Nov 25, 2024

View reviewed changes

python/ray/data/tests/test_block_boundaries.py Outdated Show resolved Hide resolved

wingkitlee0 force-pushed the optimizing_multiple_group_key branch from ac47c15 to 6b0d5cf Compare December 14, 2024 17:41

wingkitlee0 requested a review from a team as a code owner December 14, 2024 17:41

wingkitlee0 requested a review from alexeykudinkin December 16, 2024 22:37

alexeykudinkin added the go label Jan 7, 2025

alexeykudinkin approved these changes Jan 9, 2025

View reviewed changes

python/ray/data/block.py Outdated Show resolved Hide resolved

python/ray/data/block.py Show resolved Hide resolved

python/ray/data/block.py Show resolved Hide resolved

wingkitlee0 force-pushed the optimizing_multiple_group_key branch 2 times, most recently from 6d8099d to 8a41bd8 Compare January 11, 2025 02:10

alexeykudinkin approved these changes Jan 14, 2025

View reviewed changes

python/ray/data/block.py Outdated Show resolved Hide resolved

python/ray/data/block.py Outdated Show resolved Hide resolved

wingkitlee0 force-pushed the optimizing_multiple_group_key branch from 8a41bd8 to f4442c5 Compare January 14, 2025 23:59

wingkitlee0 force-pushed the optimizing_multiple_group_key branch from f4442c5 to 7f370c0 Compare January 15, 2025 01:22

richardliaw merged commit 40cc175 into ray-project:master Jan 16, 2025
5 checks passed

anyadontfly pushed a commit to anyadontfly/ray that referenced this pull request Feb 13, 2025

[Data] Optimizing the multi column groupby (ray-project#45667)

ea7585f

Signed-off-by: Puyuan Yao <williamyao034@gmail.com>

park12sj pushed a commit to park12sj/ray that referenced this pull request Mar 18, 2025

[Data] Optimizing the multi column groupby (ray-project#45667)

18e7eb7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Data] Optimizing the multi column groupby #45667

[Data] Optimizing the multi column groupby #45667

wingkitlee0 commented Jun 1, 2024 •

edited

Loading

wingkitlee0 commented Oct 26, 2024

bveeramani commented Oct 28, 2024

wingkitlee0 commented Nov 1, 2024

wingkitlee0 commented Nov 16, 2024

wingkitlee0 commented Dec 18, 2024

wingkitlee0 commented Jan 9, 2025

alexeykudinkin left a comment

[Data] Optimizing the multi column groupby #45667

[Data] Optimizing the multi column groupby #45667

Conversation

wingkitlee0 commented Jun 1, 2024 • edited Loading

Why are these changes needed?

Time complexity / Speed up

Related issue number

Checks

wingkitlee0 commented Oct 26, 2024

bveeramani commented Oct 28, 2024

wingkitlee0 commented Nov 1, 2024

wingkitlee0 commented Nov 16, 2024

wingkitlee0 commented Dec 18, 2024

wingkitlee0 commented Jan 9, 2025

alexeykudinkin left a comment

Choose a reason for hiding this comment

wingkitlee0 commented Jun 1, 2024 •

edited

Loading