-
Notifications
You must be signed in to change notification settings - Fork 6.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Data] Optimizing the multi column groupby #45667
[Data] Optimizing the multi column groupby #45667
Conversation
fc8c604
to
2323159
Compare
1245fc0
to
8e7f055
Compare
8e7f055
to
2df45c7
Compare
e1f328a
to
b6076d6
Compare
Hey @bveeramani , just wanna put this PR under your radar :) |
Hey @wingkitlee0, thanks for the ping! I'll let our oncall know (he's the one who's responsible for reviewing external PRs) |
8adfcfa
to
3ddc1d1
Compare
not sure how to solve |
7d2def3
to
2e58973
Compare
Hi @alexeykudinkin this PR is ready |
1ff7f5e
to
cbd1e5d
Compare
ac47c15
to
6b0d5cf
Compare
@alexeykudinkin the PR is ready for review again; added nan/None tests |
bump |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, minor comments
6d8099d
to
8a41bd8
Compare
8a41bd8
to
f4442c5
Compare
Changed from a custom class implementation to a purely numpy implementation Fixed the None cases Signed-off-by: Kit Lee <7000003+wingkitlee0@users.noreply.github.com>
f4442c5
to
7f370c0
Compare
Signed-off-by: Puyuan Yao <williamyao034@gmail.com>
This PR improves the "get group boundaries" function in
groupby.map_groups()
. This is not a bottleneck so the overall runtime will likely have minimal improvement, but the function itself is optimized and the code is cleaner.Changes:
_MultiColumnSortedKey
, which I wrote a few months ago...), a numpy record array is constructed.np.where
In addition, I have cleaned some indexing and added docstrings and tests.
Why are these changes needed?
The pure
numpy
implementation is about 8x faster. The result below is for 100K elements. (similar results for 10M elements)The custom
_MultiColumnSortedKey
class was introduced as a work-around. The numpy recarray should be preferred as it allows vectorization.Time complexity / Speed up
Most of the speed up is coming from the use of recarray. This gives about 7x speed up.
A little more speed up was achieved when using
np.where
, although the time complexity changed from the current O( k log n) to O(n) with the new implementation (np.where
), where k is the number of groups and n is the length of the array. This is likely due to the vectorization (less branching).Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.