Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix mean computation for the geometric distribution in the data generator #15282

Merged

Conversation

vuule
Copy link
Contributor

@vuule vuule commented Mar 12, 2024

Description

Since we moved random data generation to the GPU, geometric distribution has been approximated by half-normal distribution. However, the mean computation wasn't updated, causing a ~20% higher mean that the actual generated values.
Another issue that exasperated the problem is the implicit conversion to ints in the random generator. This effectively lowered the mean of generated values by 0.5.

Together, these lead to list columns having the last row with more than 20% of the total column data. Huge single row caused low performance in many benchmarks. For example, Parquet files end up with a few huge pages and load imbalance in decode.

This PR fixes the mean computation to reflex the actual distribution, and rounds the random values when converting to ints. The result is a correct distribution of the number of elements in each randomly generated list.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@vuule vuule self-assigned this Mar 12, 2024
@vuule vuule added bug Something isn't working tests Unit testing for project non-breaking Non-breaking change labels Mar 12, 2024
@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Mar 12, 2024
@vuule
Copy link
Contributor Author

vuule commented Mar 12, 2024

CC @shrshi

@vuule vuule marked this pull request as ready for review March 13, 2024 00:49
@vuule vuule requested a review from a team as a code owner March 13, 2024 00:49
@vuule
Copy link
Contributor Author

vuule commented Mar 13, 2024

Parquet reader benchmarks with only list columns are faster by 50%, while the ones with mixed types are 20-30X faster.
As expected, there's no impact on performance measurements with other types. The file size has not changed significantly either.

// In the current implementation, the geometric distribution is
// approximated by absolute value of a uniform distribution
auto const gauss_std_dev = geometric_as_gauss_std_dev(dist.lower_bound, dist.upper_bound);
auto const half_gauss_mean = gauss_std_dev * sqrt(2. / M_PI);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vuule
Copy link
Contributor Author

vuule commented Mar 13, 2024

@PointKernel made some additional changes; requested your review so I don't sneak them past you :)

Copy link
Contributor

@shrshi shrshi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one nit, otherwise looks great to me! Thank you :)

cpp/benchmarks/common/random_distribution_factory.cuh Outdated Show resolved Hide resolved
cpp/benchmarks/common/generate_input.cu Show resolved Hide resolved
@vuule vuule added the 5 - Ready to Merge Testing and reviews complete, ready to merge label Mar 14, 2024
@vuule
Copy link
Contributor Author

vuule commented Mar 14, 2024

/merge

@rapids-bot rapids-bot bot merged commit 95ce0bb into rapidsai:branch-24.04 Mar 14, 2024
75 checks passed
@vuule vuule deleted the bug-geometric-distribution-mean branch March 14, 2024 21:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge bug Something isn't working libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change tests Unit testing for project
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

None yet

4 participants