Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-36892: [C++] Fix performance regressions in FieldPath::Get #37032

Merged
merged 5 commits into from Aug 8, 2023

Conversation

benibus
Copy link
Collaborator

@benibus benibus commented Aug 6, 2023

Rationale for this change

#35197 appears to have introduced significant performance regressions in FieldPath::Get - indicated here, in a benchmark that uses a wide (10K column) dataframe.

What changes are included in this PR?

  • Adds basic benchmarks for FieldPath::Get across various input types, as they didn't previously exist
  • Addresses several performance issues. These came in the form of extremely high upfront costs for the RecordBatch and ArrayData overloads specifically
  • Some minor refactoring of NestedSelector

Are these changes tested?

Yes (covered by existing tests)

Are there any user-facing changes?

No

@github-actions
Copy link

github-actions bot commented Aug 6, 2023

⚠️ GitHub issue #36892 has been automatically assigned in GitHub to PR creator.

@benibus
Copy link
Collaborator Author

benibus commented Aug 6, 2023

I was able to use the new benchmarks at four different points in time. Got some interesting results... (these are all approximate best-case)

Immediately before PR-35197

This was right after I added initial support for tables/chunked arrays. Fortunately, those implementations probably didn't see any use before they were replaced.

FieldPathGetFromWideArray               4231818 ns      4231770 ns          164 items_per_second=2.36308M/s
FieldPathGetFromWideArrayData           1103845 ns      1103830 ns          626 items_per_second=9.05937M/s
FieldPathGetFromWideBatch               4251668 ns      4251632 ns          165 items_per_second=2.35204M/s
FieldPathGetFromWideTable            1.3847e+10 ns   1.3847e+10 ns            1 items_per_second=722.19/s
FieldPathGetFromWideChunkedArray/2   2.2871e+11 ns   2.2870e+11 ns            1 items_per_second=43.7258/s
FieldPathGetFromWideChunkedArray/8   8.8894e+10 ns   8.8892e+10 ns            1 items_per_second=112.496/s
FieldPathGetFromWideChunkedArray/32  4.9146e+10 ns   4.9145e+10 ns            1 items_per_second=203.48/s
FieldPathGetFromWideChunkedArray/100 3.8052e+10 ns   3.8051e+10 ns            1 items_per_second=262.802/s

Immediately after PR-35197

Major regressions for array data and record batches.

FieldPathGetFromWideArray               4006585 ns      4006560 ns          174 items_per_second=2.49591M/s
FieldPathGetFromWideArrayData        3190691414 ns   3190582067 ns            1 items_per_second=3.13422k/s
FieldPathGetFromWideBatch            4362726666 ns   4362483392 ns            1 items_per_second=2.29227k/s
FieldPathGetFromWideTable               3250066 ns      3249963 ns          216 items_per_second=3.07696M/s
FieldPathGetFromWideChunkedArray/2    298990880 ns    298973212 ns            2 items_per_second=33.4478k/s
FieldPathGetFromWideChunkedArray/8     46501931 ns     46500714 ns           12 items_per_second=215.05k/s
FieldPathGetFromWideChunkedArray/32    20925483 ns     20924825 ns           31 items_per_second=477.901k/s
FieldPathGetFromWideChunkedArray/100   12508686 ns     12508268 ns           54 items_per_second=799.471k/s

Immediately before this PR

At some point, performance went way up for everything besides array data and batches. Not sure why.

FieldPathGetFromWideArray                628388 ns       628356 ns         1108 items_per_second=15.9145M/s
FieldPathGetFromWideArrayData        1659049537 ns   1659040097 ns            1 items_per_second=6.02758k/s
FieldPathGetFromWideBatch            2465863699 ns   2465831041 ns            1 items_per_second=4.05543k/s
FieldPathGetFromWideTable                373124 ns       373118 ns         1876 items_per_second=26.8012M/s
FieldPathGetFromWideChunkedArray/2     25216611 ns     25215747 ns           22 items_per_second=396.578k/s
FieldPathGetFromWideChunkedArray/8      6216094 ns      6215197 ns           96 items_per_second=1.60896M/s
FieldPathGetFromWideChunkedArray/32     2292445 ns      2292410 ns          299 items_per_second=4.36222M/s
FieldPathGetFromWideChunkedArray/100    1109909 ns      1109902 ns          628 items_per_second=9.0098M/s

Immediately after this PR

FieldPathGetFromWideArray                619903 ns       619883 ns         1123 items_per_second=16.1321M/s
FieldPathGetFromWideArrayData            431889 ns       431878 ns         1622 items_per_second=23.1547M/s
FieldPathGetFromWideBatch                575262 ns       575236 ns         1216 items_per_second=17.3842M/s
FieldPathGetFromWideTable                371890 ns       371888 ns         1882 items_per_second=26.8898M/s
FieldPathGetFromWideChunkedArray/2     23430915 ns     23429532 ns           24 items_per_second=426.812k/s
FieldPathGetFromWideChunkedArray/8      5858929 ns      5858709 ns          100 items_per_second=1.70686M/s
FieldPathGetFromWideChunkedArray/32     2176156 ns      2176124 ns          317 items_per_second=4.59533M/s
FieldPathGetFromWideChunkedArray/100    1066176 ns      1066129 ns          647 items_per_second=9.37973M/s

@benibus benibus marked this pull request as ready for review August 6, 2023 23:22
@raulcd
Copy link
Member

raulcd commented Aug 8, 2023

@ursabot please benchmark

@ursabot
Copy link

ursabot commented Aug 8, 2023

Benchmark runs are scheduled for commit c7c3059. Watch https://buildkite.com/apache-arrow and https://conbench.ursa.dev for updates. A comment will be posted here when the runs are complete.

@pitrou
Copy link
Member

pitrou commented Aug 8, 2023

Out of curiosity I ran these benchmarks against 12.0.1:

12.0.1

-----------------------------------------------------------------------------------------------
Benchmark                                     Time             CPU   Iterations UserCounters...
-----------------------------------------------------------------------------------------------
FieldPathGetFromWideArray                474829 ns       474364 ns         1487 items_per_second=21.0809M/s
FieldPathGetFromWideArrayData            135436 ns       135304 ns         5174 items_per_second=73.9079M/s
FieldPathGetFromWideBatch                452217 ns       451773 ns         1553 items_per_second=22.135M/s
FieldPathGetFromWideTable            3261966454 ns   3259233772 ns            1 items_per_second=3.06821k/s
FieldPathGetFromWideChunkedArray/2   9.0042e+10 ns   8.9960e+10 ns            1 items_per_second=111.16/s
FieldPathGetFromWideChunkedArray/8   2.3128e+10 ns   2.3124e+10 ns            1 items_per_second=432.451/s
FieldPathGetFromWideChunkedArray/32  1.1759e+10 ns   1.1757e+10 ns            1 items_per_second=850.574/s
FieldPathGetFromWideChunkedArray/100 8947862056 ns   8946355791 ns            1 items_per_second=1.11777k/s

After this PR

-----------------------------------------------------------------------------------------------
Benchmark                                     Time             CPU   Iterations UserCounters...
-----------------------------------------------------------------------------------------------
FieldPathGetFromWideArray                626352 ns       626257 ns         1109 items_per_second=15.9679M/s
FieldPathGetFromWideArrayData            397684 ns       397611 ns         1760 items_per_second=25.1502M/s
FieldPathGetFromWideBatch                703693 ns       703582 ns          988 items_per_second=14.213M/s
FieldPathGetFromWideTable                394278 ns       394211 ns         1762 items_per_second=25.3671M/s
FieldPathGetFromWideChunkedArray/2     35398792 ns     35372736 ns           17 items_per_second=282.704k/s
FieldPathGetFromWideChunkedArray/8      7108294 ns      7101924 ns           84 items_per_second=1.40807M/s
FieldPathGetFromWideChunkedArray/32     2459666 ns      2458740 ns          280 items_per_second=4.06712M/s
FieldPathGetFromWideChunkedArray/100    1361175 ns      1360980 ns          513 items_per_second=7.34765M/s

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Aug 8, 2023
@conbench-apache-arrow
Copy link

Thanks for your patience. Conbench analyzed the 6 benchmarking runs that have been run so far on PR commit c7c3059.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about possible false positives for unstable benchmarks that are known to sometimes produce them.

@pitrou
Copy link
Member

pitrou commented Aug 8, 2023

It seems that wide_dataframe<use_legacy_dataset=false> went back to its pre-regression runtime of ~0.49 seconds.

Original regression:
https://conbench.ursa.dev/compare/runs/9cf73ac83f0a44179e6538b2c1c7babd...3d76cb5ffb8849bf8c3ea9b32d08b3b7/
Result on this PR:
https://conbench.ursa.dev/runs/84a588b7c37a487fa4cd60b6505cae40/

@pitrou
Copy link
Member

pitrou commented Aug 8, 2023

@github-actions crossbow submit -g cpp

@github-actions
Copy link

github-actions bot commented Aug 8, 2023

Revision: 0b75284

Submitted crossbow builds: ursacomputing/crossbow @ actions-c076bca0da

Task Status
test-alpine-linux-cpp Github Actions
test-build-cpp-fuzz Github Actions
test-conda-cpp Github Actions
test-conda-cpp-valgrind Azure
test-cuda-cpp Github Actions
test-debian-11-cpp-amd64 Github Actions
test-debian-11-cpp-i386 Github Actions
test-fedora-35-cpp Github Actions
test-ubuntu-20.04-cpp Github Actions
test-ubuntu-20.04-cpp-bundled Github Actions
test-ubuntu-20.04-cpp-minimal-with-formats Github Actions
test-ubuntu-20.04-cpp-thread-sanitizer Github Actions
test-ubuntu-22.04-cpp Github Actions
test-ubuntu-22.04-cpp-20 Github Actions

@pitrou
Copy link
Member

pitrou commented Aug 8, 2023

I pushed a variation so that Table inputs are benchmarks with multiple chunk sizes.
The results are interesting: the performance degradation with increasing numbers of chunks is much less apparent for Table inputs than for ChunkedArray inputs:

-----------------------------------------------------------------------------------------------
Benchmark                                     Time             CPU   Iterations UserCounters...
-----------------------------------------------------------------------------------------------
FieldPathGetFromWideArray                701692 ns       701592 ns          974 items_per_second=14.2533M/s num_columns=10k
FieldPathGetFromWideArrayData            402517 ns       402454 ns         1749 items_per_second=24.8476M/s num_columns=10k
FieldPathGetFromWideBatch                691269 ns       691164 ns          996 items_per_second=14.4683M/s num_columns=10k
FieldPathGetFromWideChunkedArray/2     34847633 ns     34822898 ns           17 items_per_second=287.167k/s num_chunks=50 num_columns=10k
FieldPathGetFromWideChunkedArray/10     5530621 ns      5525915 ns          102 items_per_second=1.80966M/s num_chunks=10 num_columns=10k
FieldPathGetFromWideChunkedArray/25     2403874 ns      2403105 ns          285 items_per_second=4.16128M/s num_chunks=4 num_columns=10k
FieldPathGetFromWideChunkedArray/100    1318747 ns      1318526 ns          529 items_per_second=7.58423M/s num_chunks=1 num_columns=10k
FieldPathGetFromWideTable/2              511357 ns       511244 ns         1347 items_per_second=19.5601M/s num_chunks=50 num_columns=10k
FieldPathGetFromWideTable/10             390110 ns       390025 ns         1791 items_per_second=25.6394M/s num_chunks=10 num_columns=10k
FieldPathGetFromWideTable/25             387858 ns       387783 ns         1802 items_per_second=25.7876M/s num_chunks=4 num_columns=10k
FieldPathGetFromWideTable/100            387341 ns       387250 ns         1807 items_per_second=25.8231M/s num_chunks=1 num_columns=10k

@benibus
Copy link
Collaborator Author

benibus commented Aug 8, 2023

That's probably because selecting a top-level column from a table is just a basic index into a vector, whereas for chunked arrays, you need to process all of the chunks individually to construct a new ChunkedArray that represents the child field.

Right now, the benchmarks only test top-level selections - but you'd likely see a similar degradation at higher depth levels.

@pitrou pitrou merged commit 59f30f0 into apache:main Aug 8, 2023
33 of 34 checks passed
@pitrou pitrou removed the awaiting committer review Awaiting committer review label Aug 8, 2023
@pitrou
Copy link
Member

pitrou commented Aug 8, 2023

@raulcd You can cherry-pick this now.

raulcd pushed a commit that referenced this pull request Aug 9, 2023
### Rationale for this change

#35197 appears to have introduced significant performance regressions in `FieldPath::Get` - indicated [here](https://conbench.ursa.dev/compare/runs/9cf73ac83f0a44179e6538b2c1c7babd...3d76cb5ffb8849bf8c3ea9b32d08b3b7/), in a benchmark that uses a wide (10K column) dataframe.

### What changes are included in this PR?

- Adds basic benchmarks for `FieldPath::Get` across various input types, as they didn't previously exist
- Addresses several performance issues. These came in the form of extremely high upfront costs for the `RecordBatch` and `ArrayData` overloads specifically
- Some minor refactoring of `NestedSelector`

### Are these changes tested?

Yes (covered by existing tests)

### Are there any user-facing changes?

No

* Closes: #36892

Lead-authored-by: benibus <bpharks@gmx.com>
Co-authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>
@raulcd
Copy link
Member

raulcd commented Aug 9, 2023

Thanks, I've cherry-picked and I am re-running some jobs on the maintenance branch to validate the status before creating the new RC

@conbench-apache-arrow
Copy link

After merging your PR, Conbench analyzed the 6 benchmarking runs that have been run so far on merge-commit 59f30f0.

There was 1 benchmark result indicating a performance regression:

The full Conbench report has more details. It also includes information about possible false positives for unstable benchmarks that are known to sometimes produce them.

loicalleyne pushed a commit to loicalleyne/arrow that referenced this pull request Nov 13, 2023
…apache#37032)

### Rationale for this change

apache#35197 appears to have introduced significant performance regressions in `FieldPath::Get` - indicated [here](https://conbench.ursa.dev/compare/runs/9cf73ac83f0a44179e6538b2c1c7babd...3d76cb5ffb8849bf8c3ea9b32d08b3b7/), in a benchmark that uses a wide (10K column) dataframe.

### What changes are included in this PR?

- Adds basic benchmarks for `FieldPath::Get` across various input types, as they didn't previously exist
- Addresses several performance issues. These came in the form of extremely high upfront costs for the `RecordBatch` and `ArrayData` overloads specifically
- Some minor refactoring of `NestedSelector`

### Are these changes tested?

Yes (covered by existing tests)

### Are there any user-facing changes?

No

* Closes: apache#36892

Lead-authored-by: benibus <bpharks@gmx.com>
Co-authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[C++] Potential regression on FieldRef/FieldPath non-flattening Get methods
4 participants