Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-33526: [R] Implement new function open_dataset_csv with signature more closely matching read_csv_arrow #33614

Merged
merged 36 commits into from Jan 17, 2023

Conversation

thisisnic
Copy link
Member

@thisisnic thisisnic commented Jan 11, 2023

This PR implements a wrapper around open_dataset() specifically for value-delimited files. It takes the parameters from open_dataset() and appends the parameters of read_csv_arrow() which are compatible with open_dataset(). This should make it easier for users to switch between the two, e.g.:

library(arrow)
library(dplyr)

# Set up directory for examples
tf <- tempfile()
dir.create(tf)
on.exit(unlink(tf))
df <- data.frame(x = c("1", "2", "NULL"))

file_path <- file.path(tf, "file1.txt")
write.table(df, file_path, sep = ",", row.names = FALSE)

read_csv_arrow(file_path, na = c("", "NA", "NULL"), col_names = "y", skip = 1)
#> # A tibble: 3 × 1
#>       y
#>   <int>
#> 1     1
#> 2     2
#> 3    NA

open_csv_dataset(file_path, na = c("", "NA", "NULL"), col_names = "y", skip = 1) %>% collect()
#> # A tibble: 3 × 1
#>       y
#>   <int>
#> 1     1
#> 2     2
#> 3    NA

This PR also hooks up the "na" (readr-style) parameter to "null_values" (i.e. CSVConvertOptions parameter).

In the process of making this PR, I also refactored CsvFileFormat$create(). Unfortunately, many changes needed to be made at once, which has considerably increasing the size/complexity of this PR.

@github-actions
Copy link

⚠️ GitHub issue #33526 has been automatically assigned in GitHub to PR creator.

@@ -271,26 +264,74 @@ check_ambiguous_options <- function(passed_opts, opts1, opts2) {
}
}

check_schema <- function(schema, column_names) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is all existing code which I just extracted out into its own function

"'."
))
}
CsvFileFormat$create <- function(...) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whereby:

  • AFAIK this isn't intended to be used much by users directly,
  • but it is exported, so at least some power users probably will anyway,
  • and I appreciate the flattening here,
  • but whenever more parameters get shoved through ... it gets harder for users to figure out what they can specify there (and specifically here you'll have to dig through more levels of source to figure it out),

suggested:

  • could we document where to look in ?FileFormat a bit more?
  • could that page specify thatCsvFileFormat exists if this is going to @rdname there?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good points made; the refactoring I did here is because that bit of the code was overly complex, so in the name of usability, I think it'll make more sense to write actual independent docs for this function with more explicit information about its usage.

#'
#' @inheritParams open_dataset
#' @inheritParams read_csv_arrow
#'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be helpful to include an example here that shows why you would want this function, i.e. what it simplifies. (A reprex of that would also be nice to include on the PR description to demonstrate why this needs to exist.)

Other things worth considering:

  • Working this function into a vignette
  • xrefs from the open_dataset docs (@seealso; an inline reference in the "format" argument; etc.) and the read_delim_arrow docs.

expect_identical(dim(ds), c(20L, 7L))

# delim
ds_tsv <- open_csv_dataset(tsv_dir, delim = "\t", partitioning = "part")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This kinda begs the question: should there also be open_tsv_dataset(), open_delim_dataset(), etc.?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, those make sense; added now

Copy link
Member

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some high-level comments (happy to re-review as this progresses).

Huge improvement in the interface! The reprex I used was:

# pr_fetch(33614)
library(arrow, warn.conflicts = FALSE)
#> Some features are not enabled in this build of Arrow. Run `arrow_info()` for more information.

ds_dir <- tempfile()
write_dataset(mtcars, ds_dir, format = "csv")

# Cool! Way easier!
open_csv_dataset(ds_dir, delim = ",")
#> FileSystemDataset with 1 csv file
#> mpg: double
#> cyl: int64
#> disp: double
#> hp: int64
#> drat: double
#> wt: double
#> qsec: double
#> vs: int64
#> am: int64
#> gear: int64
#> carb: int64

# Do we need the `as_data_frame` arg?
open_csv_dataset(ds_dir, as_data_frame = TRUE)
#> Error in `open_csv_dataset()`:
#> ! `as_data_frame = TRUE` not supported for datasets
#> ℹ instead try calling `dplyr::collect()` or `as.data.frame()` on your dataset

#> Backtrace:
#>     ▆
#>  1. └─arrow::open_csv_dataset(ds_dir, as_data_frame = TRUE)
#>  2.   └─rlang::abort(...) at r/R/dataset.R:279:4

# A bit confusing if we specify parse_options?
open_csv_dataset(ds_dir, delim = ",", parse_options = CsvParseOptions$create(delimiter = "\t"))
#> Error in `open_csv_dataset()`:
#> ! The following option is supported in "read_delim_arrow" functions but not supported here: "parse_options"

#> Backtrace:
#>     ▆
#>  1. └─arrow::open_csv_dataset(ds_dir, delim = ",", parse_options = CsvParseOptions$create(delimiter = "\t"))
#>  2.   └─rlang::abort("The following option is supported in \"read_delim_arrow\" functions but not supported here: \"parse_options\"") at r/R/dataset.R:275:4

Created on 2023-01-13 with reprex v2.0.2

It looks like there are a few arguments to open_csv_dataset() that don't work, so maybe they should be omitted?

Just as a matter of curiosity, could the "who calls who" be flipped here? Like:

open_csv_dataset <- function(sources, all = the, csv = specific, options = NULL) {
  # all the csv-specific things
}

open_dataset <- function(..., format = "parquet") {
  switch(format,
    csv = open_csv_dataset(...),
    # other dataset formats
    abort("`format` must be one of ...")
}

The call to open_dataset() at the end of open_csv_dataset() seems rather complicated to me and I wonder if flipping the direction could help?

@thisisnic
Copy link
Member Author

thisisnic commented Jan 16, 2023

It looks like there are a few arguments to open_csv_dataset() that don't work, so maybe they should be omitted?

This was a deliberate design decision whereby folks can easily just swap read_csv_arrow() and open_csv_dataset() calls, without changing parameters, and allowing for more informative error messages than unused argument if they found things that weren't compatible with both. I'm not too attached to it though - it doesn't feel like great practice either, and the arguments which are unsupported aren't the most widely-used.

(removed now)

@thisisnic thisisnic marked this pull request as ready for review January 16, 2023 20:32
r/R/csv.R Show resolved Hide resolved
@thisisnic
Copy link
Member Author

The call to open_dataset() at the end of open_csv_dataset() seems rather complicated to me and I wonder if flipping the direction could help?

I'd rather save this for a follow-up PR as I'm not entirely sure how that'd work.

Copy link
Member

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be a huge improvement for working with CSV datasets! I love the concept and the cleaning up of existing code you've done along the way.

If I'm reading this correctly, the change won't affect open_dataset() but may affect read_csv_arrow()? I don't know what our existing test coverage was of this behaviour in read_csv_arrow()...if it didn't exist before, this PR should probably add it (particularly if you had envisioned getting this in to the 11.0.0 release).

I think the documentation for some of the parameters might need to be updated (i.e., did passing a list() work for the various options parameters work before? What happens if you pass in something like convert_options AND tidyverse-style options?).

It would be great to ship this with 11.0.0 so that the new functions could get tested! That said, it is very specifically a feature and not a bug fix and we've hit feature freeze. I'm not worried this will substantiatively break existing usage.

Comment on lines +323 to +326
mc <- match.call()
mc$delim <- ","
mc[[1]] <- get("open_delim_dataset", envir = asNamespace("arrow"))
eval.parent(mc)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels too magical to me...is there a reason that

open_csv_dataset <- function(...) {
  open_delim_dataset(..., delim = ",")
}

wouldn't work? (That would also avoid repeating the default argument values)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just following the pattern from read_delim_arrow(); happy to switch it

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, no, I don't want the open_csv_dataset() function to take ellipses as its arguments. Part of the point of this wrapper is to have explicitly documented parameters.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Point taken...let's leave it for now (and be very careful when/if any of those default values get changed so that they all stay in sync).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FTR this is how base::read.csv does it, and I don't remember the exact reason now but I recall that it was a good one.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm...I think I was remembering my initial read of Advanced R that cited a subtle bug in this approach and suggested an alternative ( http://adv-r.had.co.nz/Expressions.html#capturing-call ); however, I imagine this is a minor consideration.

#' * `as_data_frame` (instead, convert to data frame after dataset creation)
#' * `parse_options`
#'
#' @examplesIf arrow_with_dataset()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other example block you added has a skip for Windows...do you need that here, too? (Or can the skip for windows on the other one be removed?)

Copy link
Member Author

@thisisnic thisisnic Jan 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I honestly have no idea; FileFormat has a skip for Windows, so I guessed it might be needed for CsvFileFormat, though it might not be. Removed them to see what happens.

Comment on lines 540 to 542
expect_error(
expect_equal(ds$time, "16-01-2023")
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the expected error message?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The expected error message is "object (actual) not equal to expected (expected)." but I didn't think I should be testing that given it's the message from testthat? Comment above this line refers to the issue ticket I opened, but I can add something else to make it more explicit perhaps?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

expect_error() should always have an expected error because it might hide a completely unrelated error (e.g., expect_error(this_is_completely_invalid_code)). My preferred way of dealing with this is to add the failing reprex to the GitHub Issue and remove the test here...some of our existing tests do skip("GH-XXXX timestamp parser not implemented") to handle that.


# skip_empty_rows
tf <- tempfile()
writeLines('"x"\n"y"\nNA\nNA\n"NULL"\n\n\n', tf)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this better than the write.table() you do below/above...it's easier to see what the csv file that's getting parsed is.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When it's one column I agree, though it's trickier when it's multiple columns. In order to make it more readable/obvious, I've swapped out write.table for write.csv as we use that more in this file, and moved things around so that the data frame is defined next to where it's written.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can also use newlines in your string literals:

csv_with_single_quotes <- "x,y
'x_col1','y_col1'
'x_col2','y_col2'"

cat(csv_with_single_quotes)
#> x,y
#> 'x_col1','y_col1'
#> 'x_col2','y_col2'

I like that because I don't need to run any code to figure out what's getting parsed as a reviewer/reader of that code.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that we do write.csv() almost everywhere in test-csv.R for this, so happy to defer this to later/if ever.

opt_names <- names(args)

# extract names of parse_options, read_options, and convert_options
if ("parse_options" %in% names(args) && is.list(args[["parse_options"]])) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if args[["parse_options"]] is not a list? (Probably should be in the comments since I can't tell by reading this what it should or should not be)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

args[["parse_options"]] can either be a list of options, or a CsvParseOptions object. This helper function is only designed to validate the former; the latter is more advanced/off-label usage and it's up to users to safely pass the correct arguments through here (you could argue that this shouldn't be the case, but doing it this way is consistent with how we've done this in other bits of the codebase).

I thought about documenting this, but I don't know where it would go - I don't think inside this function?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could redocument #' @param parse_options in open_delim_dataset()? (Or add to the existing #' @param parse_options if the behaviour is the same for both open_delim_dataset() and read_delim_arrow())

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's off-label usage then you probably should error for it...with the new function we have the opportunity to protect users from off-label usage (which was very very easy to do with the existing open_dataset()). Passing parse_options = CsvParseOptions$create(...) was the first thing I thought to do and so I imagine somebody else might think to do it, too.

Copy link
Member Author

@thisisnic thisisnic Jan 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

parse_options = CsvParseOptions$create(...) is the way that users have been doing things thus far as passing in lists is new - this additional way of doing things only got merged in in the past few weeks. I'm not sure what the right call is here; I don't want to error on passing in a CsvParseOptions object and break folks' existing code but do want to encourage passing in options as a list so it actually hits our validation code.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see...let's consider that later, then, but make it clear in this function that those are the two options for future us trying to read this code.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added more info here, and more explicit acknowledgement of this to CsvFileFormat$create()

r/R/dataset.R Outdated
#' # Set up directory for examples
#' tf <- tempfile()
#' dir.create(tf)
#' on.exit(unlink(tf))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if on.exit() works the way you think it will in an example...my guess is that it won't work if there are any on.exit()s in any other examples. I would just put unlink(tf) at the bottom of your example.

Copy link
Member Author

@thisisnic thisisnic Jan 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done! We do this in loads of examples, we should probably update those too.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might work...on.exit() when the current environment is the global environment is a little hairy (on.exit(..., add = TRUE) would be more correct, but some other on.exit() without add = TRUE might cause the cleanup code you added not to run, if my understanding of on.exit() is correct).

r/R/dataset.R Outdated
#' @seealso [open_dataset()]
#' @export
open_delim_dataset <- function(sources,
schema = NULL,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indent to line up to function(?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-indented so all args align

@thisisnic
Copy link
Member Author

For clarity, this won't affect the behaviour of open_dataset() or read_csv_arrow(). Passing in lists worked before, but I've expanded the documentation to explicitly reference the fact that this works here.

Copy link
Member

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes look great!

You may want to rebase and devtools::document()...when I pulled this locally and ran devtools::document() I saw a few changes.

@thisisnic thisisnic merged commit 444dcb6 into apache:master Jan 17, 2023
raulcd pushed a commit that referenced this pull request Jan 18, 2023
…more closely matching read_csv_arrow (#33614)

This PR implements a wrapper around `open_dataset()` specifically for value-delimited files. It takes the parameters from `open_dataset()` and appends the parameters of `read_csv_arrow()` which are compatible with `open_dataset()`. This should make it easier for users to switch between the two, e.g.:

``` r
library(arrow)
library(dplyr)

# Set up directory for examples
tf <- tempfile()
dir.create(tf)
on.exit(unlink(tf))
df <- data.frame(x = c("1", "2", "NULL"))

file_path <- file.path(tf, "file1.txt")
write.table(df, file_path, sep = ",", row.names = FALSE)

read_csv_arrow(file_path, na = c("", "NA", "NULL"), col_names = "y", skip = 1)
#> # A tibble: 3 × 1
#>       y
#>   <int>
#> 1     1
#> 2     2
#> 3    NA

open_csv_dataset(file_path, na = c("", "NA", "NULL"), col_names = "y", skip = 1) %>% collect()
#> # A tibble: 3 × 1
#>       y
#>   <int>
#> 1     1
#> 2     2
#> 3    NA
```

This PR also hooks up the "na" (readr-style) parameter to "null_values" (i.e. CSVConvertOptions parameter).

In the process of making this PR, I also refactored `CsvFileFormat$create()`.  Unfortunately, many changes needed to be made at once, which has considerably increasing the size/complexity of this PR.

Authored-by: Nic Crane <thisisnic@gmail.com>
Signed-off-by: Nic Crane <thisisnic@gmail.com>
@jorisvandenbossche jorisvandenbossche added this to the 11.0.0 milestone Jan 18, 2023
@ursabot
Copy link

ursabot commented Jan 19, 2023

Benchmark runs are scheduled for baseline = 04ffb1f and contender = 444dcb6. 444dcb6 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Failed] test-mac-arm
[Finished ⬇️0.26% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.19% ⬆️0.0%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] 444dcb67 ec2-t3-xlarge-us-east-2
[Finished] 444dcb67 test-mac-arm
[Finished] 444dcb67 ursa-i9-9960x
[Finished] 444dcb67 ursa-thinkcentre-m75q
[Finished] 04ffb1f7 ec2-t3-xlarge-us-east-2
[Failed] 04ffb1f7 test-mac-arm
[Finished] 04ffb1f7 ursa-i9-9960x
[Finished] 04ffb1f7 ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants