Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-41834: [R] Better error handling in dplyr code #41576

Merged
merged 12 commits into from
May 29, 2024

Conversation

nealrichardson
Copy link
Member

@nealrichardson nealrichardson commented May 7, 2024

I started out trying to make it so that arrow_eval() could just raise its errors, rather than catch them and have every caller inspect and re-raise. I ended up pulling on this further and ended up refactoring most of the error handling in the dplyr code paths. Summary of changes, from the bottom up:

  • We have two wrappers that raise classed errors: arrow_not_supported() (which previously existed but just called stop()) and validation_error(). They raise arrow_not_supported and validation_error, respectively. Function bindings now raise one or the other, never just stop/abort.
  • arrow_eval() modifies the errors raised by function bindings, inserting the expression as the call attribute of the error, which lets rlang handle the printing cleaner, and catching any non-classed errors and re-raising them as arrow_not_supported or validation_error, as appropriate.
  • New try_arrow_dplyr() wrapper around everything inside (most*) dplyr verb implementations, which only calls abandon_ship() on arrow_not_supported errors, and lets all other errors just raise. For datasets, it just adds an additional note to the error message advising you that you can call collect(). So errors generally bubble up, and each of these wrappers adds some context to the message.
  • I also removed the developer vignette writing_bindings.Rmd, per this comment.

The ultimate results of all of this:

  • We now don't tell people to collect() (or, if on in-memory data, just do it) in cases where it would also fail in regular dplyr because the input is invalid.
  • Nicer error printing across the board, using rlang/cli for formatting, and cleaner calls and tracebacks. No more Error: Error : messages.
  • Adds the ability to provide helpful suggestions in error messages in bindings, for cases where there is an alternative available other than just collect(). In fact, if there are suggestions with the ">" (arrow) bullet, we don't just add "Call collect()", we say "Or, call collect()".
  • For us, it should be easier to work with arrow_eval() and the dplyr verbs in general. There's less bookkeeping you have to do to catch and rethrow errors, and it's consistent across the various parts of the evaluation (i.e. the same thing works inside the dplyr verbs as in the bindings).

Some concrete examples:

  1. Invalid input in a binding. Retry with dplyr won't help, so don't automatically do it (if Table) or suggest it (if Dataset).
# Before: 
mtcars |> 
  arrow_table() |> 
  transmute(case_when())
#> Warning: Expression case_when() not supported in Arrow; pulling data into R
#> Error:
#> ℹ In argument: `case_when()`.
#> Caused by error in `case_when()`:
#> ! At least one condition must be supplied.

# After:
mtcars |>
  arrow_table() |>
  transmute(case_when())
#> Error in `case_when()`:
#> ! No cases provided
  1. Dealing with unsupported features outside of the bindings. This example is something that is checked in summarize() but not caught inside arrow_eval() because it's not about the expressions.
# Before:
mtcars |> 
  InMemoryDataset$create() |> 
  group_by(cyl) |> 
  summarize(mean(hp), .groups = "rowwise")
#> Error: Error : .groups = "rowwise" not supported in Arrow
#> Call collect() first to pull data into R.

# After:
mtcars |>
  InMemoryDataset$create() |> 
  group_by(cyl) |> 
  summarize(mean(hp), .groups = "rowwise")
#> Error in `summarise.arrow_dplyr_query()`:
#> ! .groups = "rowwise" not supported in Arrow
#> → Call collect() first to pull data into R.
  1. When there are ways to solve the issue other than calling collect(), we give the user options:
# After:
mtcars |>
  InMemoryDataset$create() |> 
  transmute(date = as.Date(mpg, tryFormats = c("%Y-%m-%d", "%Y/%m/%d")))
#> Error in `as.Date()`:
#> ! `as.Date()` with multiple `tryFormats` not supported in Arrow
#> → Consider using the lubridate specialised parsing functions `ymd()`, `ymd()`, etc.
#> → Or, call collect() first to pull data into R.

Copy link

github-actions bot commented May 7, 2024

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose

Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename the pull request title in the following format?

GH-${GITHUB_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

In the case of PARQUET issues on JIRA the title also supports:

PARQUET-${JIRA_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

See also:

@nealrichardson nealrichardson changed the title WIP [R] Better error handling in dplyr code GH-41834: [R] Better error handling in dplyr code May 26, 2024
Copy link

⚠️ GitHub issue #41834 has been automatically assigned in GitHub to PR creator.

abort("`...` argument to `across()` is deprecated in dplyr and not supported in Arrow")
arrow_not_supported(
"`...` argument to `across()` is deprecated in dplyr and",
body = c(">" = "Convert your call into a function or formula including the arguments"),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIL about this making arrows!

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting review Awaiting review labels May 27, 2024
Copy link
Member

@jonkeane jonkeane left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this! I'm really excited to see more helpful warnings.

I went through some of the code pretty thoroughly, but mostly skimmed the dplyr-{verb}.R files since those are all(?) indentation changes, yeah?

@github-actions github-actions bot added awaiting merge Awaiting merge and removed awaiting changes Awaiting changes labels May 27, 2024
names(sorts)[i] <- format_expr(exprs[[i]])
if (inherits(sorts[[i]], "try-error")) {
msg <- paste("Expression", names(sorts)[i], "not supported in Arrow")
return(abandon_ship(call, .data, msg))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's an example of "not just an indentation change": in the new code, we don't have to evaluate, catch the error, and re-raise in abandon_ship, we just let arrow_eval() raise, and try_arrow_dplyr() handles the abandon_ship.

!is.null(results[[new_var]])) {
# We need some wrapping to handle literal values
if (length(results[[new_var]]) != 1) {
arrow_not_supported("Recycling values of length != 1", call = exprs[[i]])
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's another not-just-indentation change: for validations/errors outside of arrow_eval, we just raise arrow_not_supported or validation_error like in the function bindings.

@nealrichardson
Copy link
Member Author

I went through some of the code pretty thoroughly, but mostly skimmed the dplyr-{verb}.R files since those are all(?) indentation changes, yeah?

Mostly. I just went back and commented on the PR in a couple places that show some of the non-indentation changes.

@nealrichardson
Copy link
Member Author

I should probably go and add some sentences to the writing_bindings.Rmd vignette, at least to x-ref to the man page I added for the new errors.

@thisisnic
Copy link
Member

I should probably go and add some sentences to the writing_bindings.Rmd vignette, at least to x-ref to the man page I added for the new errors.

Do we definitely still want/need that article? I am a big +1 to removing redundant docs/code, and given that it's buried in the developer docs and it's not likely there'll be a ton of new Acero functions, we could, like, just delete it?

Copy link
Member

@thisisnic thisisnic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Always a fan of UX changes like this, and I love the usage of to suggest a concrete action. Out of curiosity, is this something we're emulating from somewhere else or something you came up with on this PR @nealrichardson?

@nealrichardson
Copy link
Member Author

Do we definitely still want/need that article? I am a big +1 to removing redundant docs/code, and given that it's buried in the developer docs and it's not likely there'll be a ton of new Acero functions, we could, like, just delete it?

I'm cool with deleting it. You're right that it's from another era in the package's development. And if someone is going to add more bindings, there's hundreds of examples to copy now.

Always a fan of UX changes like this, and I love the usage of to suggest a concrete action. Out of curiosity, is this something we're emulating from somewhere else or something you came up with on this PR @nealrichardson?

I guess I came up with it. Looking at the options in cli (https://cli.r-lib.org/reference/cli_bullets.html), I wanted to reserve the i information ones for clarifying details, the others didn't seem appropriate, and, well, arrow just seemed like a logical choice for this package :)

@nealrichardson nealrichardson merged commit 774ee0f into apache:main May 29, 2024
11 of 12 checks passed
@nealrichardson nealrichardson deleted the better-errors branch May 29, 2024 15:37
nealrichardson added a commit that referenced this pull request May 30, 2024
### Rationale for this change

Missed this in #41576

### Are these changes tested?

We should make sure.

### Are there any user-facing changes?

No.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants