Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++] Potential regression on FieldRef/FieldPath non-flattening Get methods #36892

Closed
pitrou opened this issue Jul 26, 2023 · 4 comments · Fixed by #37032
Closed

[C++] Potential regression on FieldRef/FieldPath non-flattening Get methods #36892

pitrou opened this issue Jul 26, 2023 · 4 comments · Fixed by #37032

Comments

@pitrou
Copy link
Member

pitrou commented Jul 26, 2023

Describe the bug, including details regarding any error messages, version, and platform.

I did not pay attention initially, but it seems #35197 introduced a large regression on the wide-dataframe benchmark.

See benchmark results here:
https://conbench.ursa.dev/compare/runs/9cf73ac83f0a44179e6538b2c1c7babd...3d76cb5ffb8849bf8c3ea9b32d08b3b7/

Note the benchmark is creating a dataframe with 10000 columns.

Component(s)

C++

@pitrou
Copy link
Member Author

pitrou commented Jul 26, 2023

@benibus Would you have time to take a look at this?
It would be nice to add benchmarks for the non-flattening methods. They will also help measuring the slowdown before/after that PR.

@raulcd
Copy link
Member

raulcd commented Jul 28, 2023

@benibus @westonpace I was planning on waiting a couple days to create the next RC (Tuesday?) to see if this was able to make it to 13.0.0. If this sounds totally unreasonable let me know and I can create the new RC sooner as discussed on Zulip we might want to ship 13.0.0 with this known issue and apply the fix targeting 14.0.0.

@raulcd raulcd added the Priority: Blocker Marks a blocker for the release label Jul 28, 2023
@raulcd raulcd added this to the 13.0.0 milestone Jul 28, 2023
@raulcd
Copy link
Member

raulcd commented Jul 31, 2023

@benibus any news here? I plan to create the new RC tomorrow unless there's possibility for this to be solved soon.

@raulcd raulcd modified the milestones: 13.0.0, 14.0.0 Aug 1, 2023
@raulcd raulcd removed the Priority: Blocker Marks a blocker for the release label Aug 1, 2023
@benibus
Copy link
Collaborator

benibus commented Aug 1, 2023

Should have a fix for this by the end of the week, provided my diagnosis is actually complete.

At the very least, the PR introduced substantial overhead for calling FieldPath::Get on record batches specifically. It seems this is due to a few instances where usages of RecordBatch::column_data() were replaced with columns(), which is far less trivial (it loads every column atomically on each invocation).

That being said, the two aren't exactly interchangeable in the implementation so there's some restructuring in order.

@jorisvandenbossche jorisvandenbossche modified the milestones: 14.0.0, 13.0.0 Aug 8, 2023
pitrou added a commit that referenced this issue Aug 8, 2023
### Rationale for this change

#35197 appears to have introduced significant performance regressions in `FieldPath::Get` - indicated [here](https://conbench.ursa.dev/compare/runs/9cf73ac83f0a44179e6538b2c1c7babd...3d76cb5ffb8849bf8c3ea9b32d08b3b7/), in a benchmark that uses a wide (10K column) dataframe.

### What changes are included in this PR?

- Adds basic benchmarks for `FieldPath::Get` across various input types, as they didn't previously exist
- Addresses several performance issues. These came in the form of extremely high upfront costs for the `RecordBatch` and `ArrayData` overloads specifically
- Some minor refactoring of `NestedSelector`

### Are these changes tested?

Yes (covered by existing tests)

### Are there any user-facing changes?

No

* Closes: #36892

Lead-authored-by: benibus <bpharks@gmx.com>
Co-authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>
@pitrou pitrou modified the milestones: 13.0.0, 14.0.0 Aug 8, 2023
@assignUser assignUser modified the milestones: 14.0.0, 13.0.0 Aug 8, 2023
raulcd pushed a commit that referenced this issue Aug 9, 2023
### Rationale for this change

#35197 appears to have introduced significant performance regressions in `FieldPath::Get` - indicated [here](https://conbench.ursa.dev/compare/runs/9cf73ac83f0a44179e6538b2c1c7babd...3d76cb5ffb8849bf8c3ea9b32d08b3b7/), in a benchmark that uses a wide (10K column) dataframe.

### What changes are included in this PR?

- Adds basic benchmarks for `FieldPath::Get` across various input types, as they didn't previously exist
- Addresses several performance issues. These came in the form of extremely high upfront costs for the `RecordBatch` and `ArrayData` overloads specifically
- Some minor refactoring of `NestedSelector`

### Are these changes tested?

Yes (covered by existing tests)

### Are there any user-facing changes?

No

* Closes: #36892

Lead-authored-by: benibus <bpharks@gmx.com>
Co-authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>
loicalleyne pushed a commit to loicalleyne/arrow that referenced this issue Nov 13, 2023
…apache#37032)

### Rationale for this change

apache#35197 appears to have introduced significant performance regressions in `FieldPath::Get` - indicated [here](https://conbench.ursa.dev/compare/runs/9cf73ac83f0a44179e6538b2c1c7babd...3d76cb5ffb8849bf8c3ea9b32d08b3b7/), in a benchmark that uses a wide (10K column) dataframe.

### What changes are included in this PR?

- Adds basic benchmarks for `FieldPath::Get` across various input types, as they didn't previously exist
- Addresses several performance issues. These came in the form of extremely high upfront costs for the `RecordBatch` and `ArrayData` overloads specifically
- Some minor refactoring of `NestedSelector`

### Are these changes tested?

Yes (covered by existing tests)

### Are there any user-facing changes?

No

* Closes: apache#36892

Lead-authored-by: benibus <bpharks@gmx.com>
Co-authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants