New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-33526: [R] Implement new function open_dataset_csv with signature more closely matching read_csv_arrow #33614
GH-33526: [R] Implement new function open_dataset_csv with signature more closely matching read_csv_arrow #33614
Conversation
|
fc4cc30
to
4684c6f
Compare
@@ -271,26 +264,74 @@ check_ambiguous_options <- function(passed_opts, opts1, opts2) { | |||
} | |||
} | |||
|
|||
check_schema <- function(schema, column_names) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function is all existing code which I just extracted out into its own function
"'." | ||
)) | ||
} | ||
CsvFileFormat$create <- function(...) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
whereby:
- AFAIK this isn't intended to be used much by users directly,
- but it is exported, so at least some power users probably will anyway,
- and I appreciate the flattening here,
- but whenever more parameters get shoved through
...
it gets harder for users to figure out what they can specify there (and specifically here you'll have to dig through more levels of source to figure it out),
suggested:
- could we document where to look in
?FileFormat
a bit more? - could that page specify that
CsvFileFormat
exists if this is going to@rdname
there?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good points made; the refactoring I did here is because that bit of the code was overly complex, so in the name of usability, I think it'll make more sense to write actual independent docs for this function with more explicit information about its usage.
#' | ||
#' @inheritParams open_dataset | ||
#' @inheritParams read_csv_arrow | ||
#' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be helpful to include an example here that shows why you would want this function, i.e. what it simplifies. (A reprex of that would also be nice to include on the PR description to demonstrate why this needs to exist.)
Other things worth considering:
- Working this function into a vignette
- xrefs from the
open_dataset
docs (@seealso
; an inline reference in the "format" argument; etc.) and theread_delim_arrow
docs.
r/tests/testthat/test-dataset-csv.R
Outdated
expect_identical(dim(ds), c(20L, 7L)) | ||
|
||
# delim | ||
ds_tsv <- open_csv_dataset(tsv_dir, delim = "\t", partitioning = "part") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This kinda begs the question: should there also be open_tsv_dataset()
, open_delim_dataset()
, etc.?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, those make sense; added now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just some high-level comments (happy to re-review as this progresses).
Huge improvement in the interface! The reprex I used was:
# pr_fetch(33614)
library(arrow, warn.conflicts = FALSE)
#> Some features are not enabled in this build of Arrow. Run `arrow_info()` for more information.
ds_dir <- tempfile()
write_dataset(mtcars, ds_dir, format = "csv")
# Cool! Way easier!
open_csv_dataset(ds_dir, delim = ",")
#> FileSystemDataset with 1 csv file
#> mpg: double
#> cyl: int64
#> disp: double
#> hp: int64
#> drat: double
#> wt: double
#> qsec: double
#> vs: int64
#> am: int64
#> gear: int64
#> carb: int64
# Do we need the `as_data_frame` arg?
open_csv_dataset(ds_dir, as_data_frame = TRUE)
#> Error in `open_csv_dataset()`:
#> ! `as_data_frame = TRUE` not supported for datasets
#> ℹ instead try calling `dplyr::collect()` or `as.data.frame()` on your dataset
#> Backtrace:
#> ▆
#> 1. └─arrow::open_csv_dataset(ds_dir, as_data_frame = TRUE)
#> 2. └─rlang::abort(...) at r/R/dataset.R:279:4
# A bit confusing if we specify parse_options?
open_csv_dataset(ds_dir, delim = ",", parse_options = CsvParseOptions$create(delimiter = "\t"))
#> Error in `open_csv_dataset()`:
#> ! The following option is supported in "read_delim_arrow" functions but not supported here: "parse_options"
#> Backtrace:
#> ▆
#> 1. └─arrow::open_csv_dataset(ds_dir, delim = ",", parse_options = CsvParseOptions$create(delimiter = "\t"))
#> 2. └─rlang::abort("The following option is supported in \"read_delim_arrow\" functions but not supported here: \"parse_options\"") at r/R/dataset.R:275:4
Created on 2023-01-13 with reprex v2.0.2
It looks like there are a few arguments to open_csv_dataset()
that don't work, so maybe they should be omitted?
Just as a matter of curiosity, could the "who calls who" be flipped here? Like:
open_csv_dataset <- function(sources, all = the, csv = specific, options = NULL) {
# all the csv-specific things
}
open_dataset <- function(..., format = "parquet") {
switch(format,
csv = open_csv_dataset(...),
# other dataset formats
abort("`format` must be one of ...")
}
The call to open_dataset()
at the end of open_csv_dataset()
seems rather complicated to me and I wonder if flipping the direction could help?
This was a deliberate design decision whereby folks can easily just swap (removed now) |
e019ea6
to
fdc2658
Compare
I'd rather save this for a follow-up PR as I'm not entirely sure how that'd work. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will be a huge improvement for working with CSV datasets! I love the concept and the cleaning up of existing code you've done along the way.
If I'm reading this correctly, the change won't affect open_dataset()
but may affect read_csv_arrow()
? I don't know what our existing test coverage was of this behaviour in read_csv_arrow()
...if it didn't exist before, this PR should probably add it (particularly if you had envisioned getting this in to the 11.0.0 release).
I think the documentation for some of the parameters might need to be updated (i.e., did passing a list()
work for the various options parameters work before? What happens if you pass in something like convert_options
AND tidyverse-style options?).
It would be great to ship this with 11.0.0 so that the new functions could get tested! That said, it is very specifically a feature and not a bug fix and we've hit feature freeze. I'm not worried this will substantiatively break existing usage.
mc <- match.call() | ||
mc$delim <- "," | ||
mc[[1]] <- get("open_delim_dataset", envir = asNamespace("arrow")) | ||
eval.parent(mc) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels too magical to me...is there a reason that
open_csv_dataset <- function(...) {
open_delim_dataset(..., delim = ",")
}
wouldn't work? (That would also avoid repeating the default argument values)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just following the pattern from read_delim_arrow()
; happy to switch it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, no, I don't want the open_csv_dataset()
function to take ellipses as its arguments. Part of the point of this wrapper is to have explicitly documented parameters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Point taken...let's leave it for now (and be very careful when/if any of those default values get changed so that they all stay in sync).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FTR this is how base::read.csv does it, and I don't remember the exact reason now but I recall that it was a good one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm...I think I was remembering my initial read of Advanced R that cited a subtle bug in this approach and suggested an alternative ( http://adv-r.had.co.nz/Expressions.html#capturing-call ); however, I imagine this is a minor consideration.
#' * `as_data_frame` (instead, convert to data frame after dataset creation) | ||
#' * `parse_options` | ||
#' | ||
#' @examplesIf arrow_with_dataset() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The other example block you added has a skip for Windows...do you need that here, too? (Or can the skip for windows on the other one be removed?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I honestly have no idea; FileFormat has a skip for Windows, so I guessed it might be needed for CsvFileFormat, though it might not be. Removed them to see what happens.
r/tests/testthat/test-dataset-csv.R
Outdated
expect_error( | ||
expect_equal(ds$time, "16-01-2023") | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the expected error message?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The expected error message is "object
(actual
) not equal to expected
(expected
)." but I didn't think I should be testing that given it's the message from testthat? Comment above this line refers to the issue ticket I opened, but I can add something else to make it more explicit perhaps?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
expect_error()
should always have an expected error because it might hide a completely unrelated error (e.g., expect_error(this_is_completely_invalid_code)
). My preferred way of dealing with this is to add the failing reprex to the GitHub Issue and remove the test here...some of our existing tests do skip("GH-XXXX timestamp parser not implemented")
to handle that.
|
||
# skip_empty_rows | ||
tf <- tempfile() | ||
writeLines('"x"\n"y"\nNA\nNA\n"NULL"\n\n\n', tf) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this better than the write.table()
you do below/above...it's easier to see what the csv file that's getting parsed is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When it's one column I agree, though it's trickier when it's multiple columns. In order to make it more readable/obvious, I've swapped out write.table
for write.csv
as we use that more in this file, and moved things around so that the data frame is defined next to where it's written.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can also use newlines in your string literals:
csv_with_single_quotes <- "x,y
'x_col1','y_col1'
'x_col2','y_col2'"
cat(csv_with_single_quotes)
#> x,y
#> 'x_col1','y_col1'
#> 'x_col2','y_col2'
I like that because I don't need to run any code to figure out what's getting parsed as a reviewer/reader of that code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see that we do write.csv()
almost everywhere in test-csv.R for this, so happy to defer this to later/if ever.
opt_names <- names(args) | ||
|
||
# extract names of parse_options, read_options, and convert_options | ||
if ("parse_options" %in% names(args) && is.list(args[["parse_options"]])) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if args[["parse_options"]]
is not a list? (Probably should be in the comments since I can't tell by reading this what it should or should not be)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
args[["parse_options"]]
can either be a list of options, or a CsvParseOptions
object. This helper function is only designed to validate the former; the latter is more advanced/off-label usage and it's up to users to safely pass the correct arguments through here (you could argue that this shouldn't be the case, but doing it this way is consistent with how we've done this in other bits of the codebase).
I thought about documenting this, but I don't know where it would go - I don't think inside this function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could redocument #' @param parse_options
in open_delim_dataset()
? (Or add to the existing #' @param parse_options
if the behaviour is the same for both open_delim_dataset()
and read_delim_arrow()
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it's off-label usage then you probably should error for it...with the new function we have the opportunity to protect users from off-label usage (which was very very easy to do with the existing open_dataset()
). Passing parse_options = CsvParseOptions$create(...)
was the first thing I thought to do and so I imagine somebody else might think to do it, too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
parse_options = CsvParseOptions$create(...)
is the way that users have been doing things thus far as passing in lists is new - this additional way of doing things only got merged in in the past few weeks. I'm not sure what the right call is here; I don't want to error on passing in a CsvParseOptions
object and break folks' existing code but do want to encourage passing in options as a list so it actually hits our validation code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see...let's consider that later, then, but make it clear in this function that those are the two options for future us trying to read this code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added more info here, and more explicit acknowledgement of this to CsvFileFormat$create()
r/R/dataset.R
Outdated
#' # Set up directory for examples | ||
#' tf <- tempfile() | ||
#' dir.create(tf) | ||
#' on.exit(unlink(tf)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know if on.exit()
works the way you think it will in an example...my guess is that it won't work if there are any on.exit()
s in any other examples. I would just put unlink(tf)
at the bottom of your example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done! We do this in loads of examples, we should probably update those too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might work...on.exit()
when the current environment is the global environment is a little hairy (on.exit(..., add = TRUE)
would be more correct, but some other on.exit()
without add = TRUE
might cause the cleanup code you added not to run, if my understanding of on.exit()
is correct).
r/R/dataset.R
Outdated
#' @seealso [open_dataset()] | ||
#' @export | ||
open_delim_dataset <- function(sources, | ||
schema = NULL, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indent to line up to function(
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Re-indented so all args align
For clarity, this won't affect the behaviour of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These changes look great!
You may want to rebase and devtools::document()
...when I pulled this locally and ran devtools::document()
I saw a few changes.
…more closely matching read_csv_arrow (#33614) This PR implements a wrapper around `open_dataset()` specifically for value-delimited files. It takes the parameters from `open_dataset()` and appends the parameters of `read_csv_arrow()` which are compatible with `open_dataset()`. This should make it easier for users to switch between the two, e.g.: ``` r library(arrow) library(dplyr) # Set up directory for examples tf <- tempfile() dir.create(tf) on.exit(unlink(tf)) df <- data.frame(x = c("1", "2", "NULL")) file_path <- file.path(tf, "file1.txt") write.table(df, file_path, sep = ",", row.names = FALSE) read_csv_arrow(file_path, na = c("", "NA", "NULL"), col_names = "y", skip = 1) #> # A tibble: 3 × 1 #> y #> <int> #> 1 1 #> 2 2 #> 3 NA open_csv_dataset(file_path, na = c("", "NA", "NULL"), col_names = "y", skip = 1) %>% collect() #> # A tibble: 3 × 1 #> y #> <int> #> 1 1 #> 2 2 #> 3 NA ``` This PR also hooks up the "na" (readr-style) parameter to "null_values" (i.e. CSVConvertOptions parameter). In the process of making this PR, I also refactored `CsvFileFormat$create()`. Unfortunately, many changes needed to be made at once, which has considerably increasing the size/complexity of this PR. Authored-by: Nic Crane <thisisnic@gmail.com> Signed-off-by: Nic Crane <thisisnic@gmail.com>
Benchmark runs are scheduled for baseline = 04ffb1f and contender = 444dcb6. 444dcb6 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
This PR implements a wrapper around
open_dataset()
specifically for value-delimited files. It takes the parameters fromopen_dataset()
and appends the parameters ofread_csv_arrow()
which are compatible withopen_dataset()
. This should make it easier for users to switch between the two, e.g.:This PR also hooks up the "na" (readr-style) parameter to "null_values" (i.e. CSVConvertOptions parameter).
In the process of making this PR, I also refactored
CsvFileFormat$create()
. Unfortunately, many changes needed to be made at once, which has considerably increasing the size/complexity of this PR.