GH-36892: [C++] Fix performance regressions in `FieldPath::Get` #37032

benibus · 2023-08-06T21:45:39Z

Rationale for this change

#35197 appears to have introduced significant performance regressions in FieldPath::Get - indicated here, in a benchmark that uses a wide (10K column) dataframe.

What changes are included in this PR?

Adds basic benchmarks for FieldPath::Get across various input types, as they didn't previously exist
Addresses several performance issues. These came in the form of extremely high upfront costs for the RecordBatch and ArrayData overloads specifically
Some minor refactoring of NestedSelector

Are these changes tested?

Yes (covered by existing tests)

Are there any user-facing changes?

No

Closes: [C++] Potential regression on FieldRef/FieldPath non-flattening Get methods #36892

github-actions · 2023-08-06T21:46:08Z

⚠️ GitHub issue #36892 has been automatically assigned in GitHub to PR creator.

benibus · 2023-08-06T23:08:04Z

I was able to use the new benchmarks at four different points in time. Got some interesting results... (these are all approximate best-case)

Immediately before PR-35197

This was right after I added initial support for tables/chunked arrays. Fortunately, those implementations probably didn't see any use before they were replaced.

FieldPathGetFromWideArray               4231818 ns      4231770 ns          164 items_per_second=2.36308M/s
FieldPathGetFromWideArrayData           1103845 ns      1103830 ns          626 items_per_second=9.05937M/s
FieldPathGetFromWideBatch               4251668 ns      4251632 ns          165 items_per_second=2.35204M/s
FieldPathGetFromWideTable            1.3847e+10 ns   1.3847e+10 ns            1 items_per_second=722.19/s
FieldPathGetFromWideChunkedArray/2   2.2871e+11 ns   2.2870e+11 ns            1 items_per_second=43.7258/s
FieldPathGetFromWideChunkedArray/8   8.8894e+10 ns   8.8892e+10 ns            1 items_per_second=112.496/s
FieldPathGetFromWideChunkedArray/32  4.9146e+10 ns   4.9145e+10 ns            1 items_per_second=203.48/s
FieldPathGetFromWideChunkedArray/100 3.8052e+10 ns   3.8051e+10 ns            1 items_per_second=262.802/s

Immediately after PR-35197

Major regressions for array data and record batches.

FieldPathGetFromWideArray               4006585 ns      4006560 ns          174 items_per_second=2.49591M/s
FieldPathGetFromWideArrayData        3190691414 ns   3190582067 ns            1 items_per_second=3.13422k/s
FieldPathGetFromWideBatch            4362726666 ns   4362483392 ns            1 items_per_second=2.29227k/s
FieldPathGetFromWideTable               3250066 ns      3249963 ns          216 items_per_second=3.07696M/s
FieldPathGetFromWideChunkedArray/2    298990880 ns    298973212 ns            2 items_per_second=33.4478k/s
FieldPathGetFromWideChunkedArray/8     46501931 ns     46500714 ns           12 items_per_second=215.05k/s
FieldPathGetFromWideChunkedArray/32    20925483 ns     20924825 ns           31 items_per_second=477.901k/s
FieldPathGetFromWideChunkedArray/100   12508686 ns     12508268 ns           54 items_per_second=799.471k/s

Immediately before this PR

At some point, performance went way up for everything besides array data and batches. Not sure why.

FieldPathGetFromWideArray                628388 ns       628356 ns         1108 items_per_second=15.9145M/s
FieldPathGetFromWideArrayData        1659049537 ns   1659040097 ns            1 items_per_second=6.02758k/s
FieldPathGetFromWideBatch            2465863699 ns   2465831041 ns            1 items_per_second=4.05543k/s
FieldPathGetFromWideTable                373124 ns       373118 ns         1876 items_per_second=26.8012M/s
FieldPathGetFromWideChunkedArray/2     25216611 ns     25215747 ns           22 items_per_second=396.578k/s
FieldPathGetFromWideChunkedArray/8      6216094 ns      6215197 ns           96 items_per_second=1.60896M/s
FieldPathGetFromWideChunkedArray/32     2292445 ns      2292410 ns          299 items_per_second=4.36222M/s
FieldPathGetFromWideChunkedArray/100    1109909 ns      1109902 ns          628 items_per_second=9.0098M/s

Immediately after this PR

FieldPathGetFromWideArray                619903 ns       619883 ns         1123 items_per_second=16.1321M/s
FieldPathGetFromWideArrayData            431889 ns       431878 ns         1622 items_per_second=23.1547M/s
FieldPathGetFromWideBatch                575262 ns       575236 ns         1216 items_per_second=17.3842M/s
FieldPathGetFromWideTable                371890 ns       371888 ns         1882 items_per_second=26.8898M/s
FieldPathGetFromWideChunkedArray/2     23430915 ns     23429532 ns           24 items_per_second=426.812k/s
FieldPathGetFromWideChunkedArray/8      5858929 ns      5858709 ns          100 items_per_second=1.70686M/s
FieldPathGetFromWideChunkedArray/32     2176156 ns      2176124 ns          317 items_per_second=4.59533M/s
FieldPathGetFromWideChunkedArray/100    1066176 ns      1066129 ns          647 items_per_second=9.37973M/s

raulcd · 2023-08-08T08:11:04Z

@ursabot please benchmark

ursabot · 2023-08-08T08:11:11Z

Benchmark runs are scheduled for commit c7c3059. Watch https://buildkite.com/apache-arrow and https://conbench.ursa.dev for updates. A comment will be posted here when the runs are complete.

pitrou · 2023-08-08T08:43:33Z

Out of curiosity I ran these benchmarks against 12.0.1:

12.0.1

-----------------------------------------------------------------------------------------------
Benchmark                                     Time             CPU   Iterations UserCounters...
-----------------------------------------------------------------------------------------------
FieldPathGetFromWideArray                474829 ns       474364 ns         1487 items_per_second=21.0809M/s
FieldPathGetFromWideArrayData            135436 ns       135304 ns         5174 items_per_second=73.9079M/s
FieldPathGetFromWideBatch                452217 ns       451773 ns         1553 items_per_second=22.135M/s
FieldPathGetFromWideTable            3261966454 ns   3259233772 ns            1 items_per_second=3.06821k/s
FieldPathGetFromWideChunkedArray/2   9.0042e+10 ns   8.9960e+10 ns            1 items_per_second=111.16/s
FieldPathGetFromWideChunkedArray/8   2.3128e+10 ns   2.3124e+10 ns            1 items_per_second=432.451/s
FieldPathGetFromWideChunkedArray/32  1.1759e+10 ns   1.1757e+10 ns            1 items_per_second=850.574/s
FieldPathGetFromWideChunkedArray/100 8947862056 ns   8946355791 ns            1 items_per_second=1.11777k/s

After this PR

-----------------------------------------------------------------------------------------------
Benchmark                                     Time             CPU   Iterations UserCounters...
-----------------------------------------------------------------------------------------------
FieldPathGetFromWideArray                626352 ns       626257 ns         1109 items_per_second=15.9679M/s
FieldPathGetFromWideArrayData            397684 ns       397611 ns         1760 items_per_second=25.1502M/s
FieldPathGetFromWideBatch                703693 ns       703582 ns          988 items_per_second=14.213M/s
FieldPathGetFromWideTable                394278 ns       394211 ns         1762 items_per_second=25.3671M/s
FieldPathGetFromWideChunkedArray/2     35398792 ns     35372736 ns           17 items_per_second=282.704k/s
FieldPathGetFromWideChunkedArray/8      7108294 ns      7101924 ns           84 items_per_second=1.40807M/s
FieldPathGetFromWideChunkedArray/32     2459666 ns      2458740 ns          280 items_per_second=4.06712M/s
FieldPathGetFromWideChunkedArray/100    1361175 ns      1360980 ns          513 items_per_second=7.34765M/s

cpp/src/arrow/type_benchmark.cc

conbench-apache-arrow · 2023-08-08T14:28:29Z

Thanks for your patience. Conbench analyzed the 6 benchmarking runs that have been run so far on PR commit c7c3059.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about possible false positives for unstable benchmarks that are known to sometimes produce them.

pitrou · 2023-08-08T14:41:10Z

It seems that wide_dataframe<use_legacy_dataset=false> went back to its pre-regression runtime of ~0.49 seconds.

Original regression:
https://conbench.ursa.dev/compare/runs/9cf73ac83f0a44179e6538b2c1c7babd...3d76cb5ffb8849bf8c3ea9b32d08b3b7/
Result on this PR:
https://conbench.ursa.dev/runs/84a588b7c37a487fa4cd60b6505cae40/

pitrou · 2023-08-08T16:06:56Z

@github-actions crossbow submit -g cpp

github-actions · 2023-08-08T16:10:22Z

Revision: 0b75284

Submitted crossbow builds: ursacomputing/crossbow @ actions-c076bca0da

Task	Status
test-alpine-linux-cpp
test-build-cpp-fuzz
test-conda-cpp
test-conda-cpp-valgrind
test-cuda-cpp
test-debian-11-cpp-amd64
test-debian-11-cpp-i386
test-fedora-35-cpp
test-ubuntu-20.04-cpp
test-ubuntu-20.04-cpp-bundled
test-ubuntu-20.04-cpp-minimal-with-formats
test-ubuntu-20.04-cpp-thread-sanitizer
test-ubuntu-22.04-cpp
test-ubuntu-22.04-cpp-20

pitrou · 2023-08-08T16:18:13Z

I pushed a variation so that Table inputs are benchmarks with multiple chunk sizes.
The results are interesting: the performance degradation with increasing numbers of chunks is much less apparent for Table inputs than for ChunkedArray inputs:

-----------------------------------------------------------------------------------------------
Benchmark                                     Time             CPU   Iterations UserCounters...
-----------------------------------------------------------------------------------------------
FieldPathGetFromWideArray                701692 ns       701592 ns          974 items_per_second=14.2533M/s num_columns=10k
FieldPathGetFromWideArrayData            402517 ns       402454 ns         1749 items_per_second=24.8476M/s num_columns=10k
FieldPathGetFromWideBatch                691269 ns       691164 ns          996 items_per_second=14.4683M/s num_columns=10k
FieldPathGetFromWideChunkedArray/2     34847633 ns     34822898 ns           17 items_per_second=287.167k/s num_chunks=50 num_columns=10k
FieldPathGetFromWideChunkedArray/10     5530621 ns      5525915 ns          102 items_per_second=1.80966M/s num_chunks=10 num_columns=10k
FieldPathGetFromWideChunkedArray/25     2403874 ns      2403105 ns          285 items_per_second=4.16128M/s num_chunks=4 num_columns=10k
FieldPathGetFromWideChunkedArray/100    1318747 ns      1318526 ns          529 items_per_second=7.58423M/s num_chunks=1 num_columns=10k
FieldPathGetFromWideTable/2              511357 ns       511244 ns         1347 items_per_second=19.5601M/s num_chunks=50 num_columns=10k
FieldPathGetFromWideTable/10             390110 ns       390025 ns         1791 items_per_second=25.6394M/s num_chunks=10 num_columns=10k
FieldPathGetFromWideTable/25             387858 ns       387783 ns         1802 items_per_second=25.7876M/s num_chunks=4 num_columns=10k
FieldPathGetFromWideTable/100            387341 ns       387250 ns         1807 items_per_second=25.8231M/s num_chunks=1 num_columns=10k

benibus · 2023-08-08T16:42:44Z

That's probably because selecting a top-level column from a table is just a basic index into a vector, whereas for chunked arrays, you need to process all of the chunks individually to construct a new ChunkedArray that represents the child field.

Right now, the benchmarks only test top-level selections - but you'd likely see a similar degradation at higher depth levels.

pitrou · 2023-08-08T17:40:29Z

@raulcd You can cherry-pick this now.

### Rationale for this change #35197 appears to have introduced significant performance regressions in `FieldPath::Get` - indicated [here](https://conbench.ursa.dev/compare/runs/9cf73ac83f0a44179e6538b2c1c7babd...3d76cb5ffb8849bf8c3ea9b32d08b3b7/), in a benchmark that uses a wide (10K column) dataframe. ### What changes are included in this PR? - Adds basic benchmarks for `FieldPath::Get` across various input types, as they didn't previously exist - Addresses several performance issues. These came in the form of extremely high upfront costs for the `RecordBatch` and `ArrayData` overloads specifically - Some minor refactoring of `NestedSelector` ### Are these changes tested? Yes (covered by existing tests) ### Are there any user-facing changes? No * Closes: #36892 Lead-authored-by: benibus <bpharks@gmx.com> Co-authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org>

raulcd · 2023-08-09T08:33:29Z

Thanks, I've cherry-picked and I am re-running some jobs on the maintenance branch to validate the status before creating the new RC

conbench-apache-arrow · 2023-08-11T22:19:25Z

After merging your PR, Conbench analyzed the 6 benchmarking runs that have been run so far on merge-commit 59f30f0.

There was 1 benchmark result indicating a performance regression:

Commit Run on ursa-i9-9960x at 2023-08-11 19:41:31Z
- csv-read (Python) with compression=uncompressed, dataset=nyctaxi_2010-01, output_format=arrow_table, streaming=file

The full Conbench report has more details. It also includes information about possible false positives for unstable benchmarks that are known to sometimes produce them.

…apache#37032) ### Rationale for this change apache#35197 appears to have introduced significant performance regressions in `FieldPath::Get` - indicated [here](https://conbench.ursa.dev/compare/runs/9cf73ac83f0a44179e6538b2c1c7babd...3d76cb5ffb8849bf8c3ea9b32d08b3b7/), in a benchmark that uses a wide (10K column) dataframe. ### What changes are included in this PR? - Adds basic benchmarks for `FieldPath::Get` across various input types, as they didn't previously exist - Addresses several performance issues. These came in the form of extremely high upfront costs for the `RecordBatch` and `ArrayData` overloads specifically - Some minor refactoring of `NestedSelector` ### Are these changes tested? Yes (covered by existing tests) ### Are there any user-facing changes? No * Closes: apache#36892 Lead-authored-by: benibus <bpharks@gmx.com> Co-authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org>

benibus added 2 commits August 6, 2023 17:06

Add benchmarks for FieldPath::Get

ebc348e

Fix performance issues

c7c3059

github-actions bot added Component: C++ awaiting review Awaiting review labels Aug 6, 2023

benibus marked this pull request as ready for review August 6, 2023 23:22

pitrou reviewed Aug 8, 2023

View reviewed changes

cpp/src/arrow/type_benchmark.cc Outdated Show resolved Hide resolved

pitrou reviewed Aug 8, 2023

View reviewed changes

cpp/src/arrow/type_benchmark.cc Show resolved Hide resolved

github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Aug 8, 2023

Output benchmark param values

0b75284

Benchmark table on different numbers of chunks

ab59eb7

Also make sure FieldPath::Get returns successfully

914f682

pitrou merged commit 59f30f0 into apache:main Aug 8, 2023
33 of 34 checks passed

pitrou removed the awaiting committer review Awaiting committer review label Aug 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GH-36892: [C++] Fix performance regressions in `FieldPath::Get` #37032

GH-36892: [C++] Fix performance regressions in `FieldPath::Get` #37032

benibus commented Aug 6, 2023 •

edited by github-actions bot

github-actions bot commented Aug 6, 2023

benibus commented Aug 6, 2023

raulcd commented Aug 8, 2023

ursabot commented Aug 8, 2023

pitrou commented Aug 8, 2023 •

edited

conbench-apache-arrow bot commented Aug 8, 2023

pitrou commented Aug 8, 2023

pitrou commented Aug 8, 2023

github-actions bot commented Aug 8, 2023

pitrou commented Aug 8, 2023

benibus commented Aug 8, 2023

pitrou commented Aug 8, 2023

raulcd commented Aug 9, 2023

conbench-apache-arrow bot commented Aug 11, 2023

GH-36892: [C++] Fix performance regressions in FieldPath::Get #37032

GH-36892: [C++] Fix performance regressions in FieldPath::Get #37032

Conversation

benibus commented Aug 6, 2023 • edited by github-actions bot

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

github-actions bot commented Aug 6, 2023

benibus commented Aug 6, 2023

Immediately before PR-35197

Immediately after PR-35197

Immediately before this PR

Immediately after this PR

raulcd commented Aug 8, 2023

ursabot commented Aug 8, 2023

pitrou commented Aug 8, 2023 • edited

conbench-apache-arrow bot commented Aug 8, 2023

pitrou commented Aug 8, 2023

pitrou commented Aug 8, 2023

github-actions bot commented Aug 8, 2023

pitrou commented Aug 8, 2023

benibus commented Aug 8, 2023

pitrou commented Aug 8, 2023

raulcd commented Aug 9, 2023

conbench-apache-arrow bot commented Aug 11, 2023

GH-36892: [C++] Fix performance regressions in `FieldPath::Get` #37032

GH-36892: [C++] Fix performance regressions in `FieldPath::Get` #37032

benibus commented Aug 6, 2023 •

edited by github-actions bot

pitrou commented Aug 8, 2023 •

edited