GH-33526: [R] Implement new function open_dataset_csv with signature more closely matching read_csv_arrow #33614

thisisnic · 2023-01-11T17:37:12Z

This PR implements a wrapper around open_dataset() specifically for value-delimited files. It takes the parameters from open_dataset() and appends the parameters of read_csv_arrow() which are compatible with open_dataset(). This should make it easier for users to switch between the two, e.g.:

library(arrow)
library(dplyr)

# Set up directory for examples
tf <- tempfile()
dir.create(tf)
on.exit(unlink(tf))
df <- data.frame(x = c("1", "2", "NULL"))

file_path <- file.path(tf, "file1.txt")
write.table(df, file_path, sep = ",", row.names = FALSE)

read_csv_arrow(file_path, na = c("", "NA", "NULL"), col_names = "y", skip = 1)
#> # A tibble: 3 × 1
#>       y
#>   <int>
#> 1     1
#> 2     2
#> 3    NA

open_csv_dataset(file_path, na = c("", "NA", "NULL"), col_names = "y", skip = 1) %>% collect()
#> # A tibble: 3 × 1
#>       y
#>   <int>
#> 1     1
#> 2     2
#> 3    NA

This PR also hooks up the "na" (readr-style) parameter to "null_values" (i.e. CSVConvertOptions parameter).

In the process of making this PR, I also refactored CsvFileFormat$create(). Unfortunately, many changes needed to be made at once, which has considerably increasing the size/complexity of this PR.

github-actions · 2023-01-11T17:37:33Z

Closes: [R] Implement new function open_dataset_csv with signature more closely matching read_csv_arrow #33526

github-actions · 2023-01-11T17:37:35Z

⚠️ GitHub issue #33526 has been automatically assigned in GitHub to PR creator.

thisisnic · 2023-01-12T11:45:39Z

r/R/dataset-format.R

@@ -271,26 +264,74 @@ check_ambiguous_options <- function(passed_opts, opts1, opts2) {
  }
 }

+check_schema <- function(schema, column_names) {


This function is all existing code which I just extracted out into its own function

alistaire47 · 2023-01-12T16:53:36Z

r/R/dataset-format.R

-      "'."
-    ))
-  }
+CsvFileFormat$create <- function(...) {


whereby:

AFAIK this isn't intended to be used much by users directly,

but it is exported, so at least some power users probably will anyway,

and I appreciate the flattening here,

but whenever more parameters get shoved through ... it gets harder for users to figure out what they can specify there (and specifically here you'll have to dig through more levels of source to figure it out),

suggested:

could we document where to look in ?FileFormat a bit more?

could that page specify thatCsvFileFormat exists if this is going to @rdname there?

Good points made; the refactoring I did here is because that bit of the code was overly complex, so in the name of usability, I think it'll make more sense to write actual independent docs for this function with more explicit information about its usage.

nealrichardson · 2023-01-12T19:10:05Z

r/R/dataset.R

+#'
+#' @inheritParams open_dataset
+#' @inheritParams read_csv_arrow
+#'


It might be helpful to include an example here that shows why you would want this function, i.e. what it simplifies. (A reprex of that would also be nice to include on the PR description to demonstrate why this needs to exist.)

Other things worth considering:

Working this function into a vignette

xrefs from the open_dataset docs (@seealso; an inline reference in the "format" argument; etc.) and the read_delim_arrow docs.

nealrichardson · 2023-01-12T19:11:00Z

r/tests/testthat/test-dataset-csv.R

+  expect_identical(dim(ds), c(20L, 7L))
+
+  # delim
+  ds_tsv <- open_csv_dataset(tsv_dir, delim = "\t", partitioning = "part")


This kinda begs the question: should there also be open_tsv_dataset(), open_delim_dataset(), etc.?

Yeah, those make sense; added now

paleolimbot

Just some high-level comments (happy to re-review as this progresses).

Huge improvement in the interface! The reprex I used was:

# pr_fetch(33614)
library(arrow, warn.conflicts = FALSE)
#> Some features are not enabled in this build of Arrow. Run `arrow_info()` for more information.

ds_dir <- tempfile()
write_dataset(mtcars, ds_dir, format = "csv")

# Cool! Way easier!
open_csv_dataset(ds_dir, delim = ",")
#> FileSystemDataset with 1 csv file
#> mpg: double
#> cyl: int64
#> disp: double
#> hp: int64
#> drat: double
#> wt: double
#> qsec: double
#> vs: int64
#> am: int64
#> gear: int64
#> carb: int64

# Do we need the `as_data_frame` arg?
open_csv_dataset(ds_dir, as_data_frame = TRUE)
#> Error in `open_csv_dataset()`:
#> ! `as_data_frame = TRUE` not supported for datasets
#> ℹ instead try calling `dplyr::collect()` or `as.data.frame()` on your dataset

#> Backtrace:
#>     ▆
#>  1. └─arrow::open_csv_dataset(ds_dir, as_data_frame = TRUE)
#>  2.   └─rlang::abort(...) at r/R/dataset.R:279:4

# A bit confusing if we specify parse_options?
open_csv_dataset(ds_dir, delim = ",", parse_options = CsvParseOptions$create(delimiter = "\t"))
#> Error in `open_csv_dataset()`:
#> ! The following option is supported in "read_delim_arrow" functions but not supported here: "parse_options"

#> Backtrace:
#>     ▆
#>  1. └─arrow::open_csv_dataset(ds_dir, delim = ",", parse_options = CsvParseOptions$create(delimiter = "\t"))
#>  2.   └─rlang::abort("The following option is supported in \"read_delim_arrow\" functions but not supported here: \"parse_options\"") at r/R/dataset.R:275:4

^{Created on 2023-01-13 with reprex v2.0.2}

It looks like there are a few arguments to open_csv_dataset() that don't work, so maybe they should be omitted?

Just as a matter of curiosity, could the "who calls who" be flipped here? Like:

open_csv_dataset <- function(sources, all = the, csv = specific, options = NULL) {
  # all the csv-specific things
}

open_dataset <- function(..., format = "parquet") {
  switch(format,
    csv = open_csv_dataset(...),
    # other dataset formats
    abort("`format` must be one of ...")
}

The call to open_dataset() at the end of open_csv_dataset() seems rather complicated to me and I wonder if flipping the direction could help?

thisisnic · 2023-01-16T10:48:21Z

It looks like there are a few arguments to open_csv_dataset() that don't work, so maybe they should be omitted?

This was a deliberate design decision whereby folks can easily just swap read_csv_arrow() and open_csv_dataset() calls, without changing parameters, and allowing for more informative error messages than unused argument if they found things that weren't compatible with both. I'm not too attached to it though - it doesn't feel like great practice either, and the arguments which are unsupported aren't the most widely-used.

(removed now)

…ts, and add docs

…_csv_dataset

r/R/csv.R

thisisnic · 2023-01-16T20:39:35Z

The call to open_dataset() at the end of open_csv_dataset() seems rather complicated to me and I wonder if flipping the direction could help?

I'd rather save this for a follow-up PR as I'm not entirely sure how that'd work.

paleolimbot

This will be a huge improvement for working with CSV datasets! I love the concept and the cleaning up of existing code you've done along the way.

If I'm reading this correctly, the change won't affect open_dataset() but may affect read_csv_arrow()? I don't know what our existing test coverage was of this behaviour in read_csv_arrow()...if it didn't exist before, this PR should probably add it (particularly if you had envisioned getting this in to the 11.0.0 release).

I think the documentation for some of the parameters might need to be updated (i.e., did passing a list() work for the various options parameters work before? What happens if you pass in something like convert_options AND tidyverse-style options?).

It would be great to ship this with 11.0.0 so that the new functions could get tested! That said, it is very specifically a feature and not a bug fix and we've hit feature freeze. I'm not worried this will substantiatively break existing usage.

paleolimbot · 2023-01-17T02:17:13Z

r/R/dataset.R

+  mc <- match.call()
+  mc$delim <- ","
+  mc[[1]] <- get("open_delim_dataset", envir = asNamespace("arrow"))
+  eval.parent(mc)


This feels too magical to me...is there a reason that

open_csv_dataset <- function(...) { open_delim_dataset(..., delim = ",") }

wouldn't work? (That would also avoid repeating the default argument values)

Just following the pattern from read_delim_arrow(); happy to switch it

Actually, no, I don't want the open_csv_dataset() function to take ellipses as its arguments. Part of the point of this wrapper is to have explicitly documented parameters.

Point taken...let's leave it for now (and be very careful when/if any of those default values get changed so that they all stay in sync).

FTR this is how base::read.csv does it, and I don't remember the exact reason now but I recall that it was a good one.

Hmm...I think I was remembering my initial read of Advanced R that cited a subtle bug in this approach and suggested an alternative ( http://adv-r.had.co.nz/Expressions.html#capturing-call ); however, I imagine this is a minor consideration.

paleolimbot · 2023-01-17T02:18:57Z

r/R/dataset.R

+#' * `as_data_frame` (instead, convert to data frame after dataset creation)
+#' * `parse_options`
+#'
+#' @examplesIf arrow_with_dataset()


The other example block you added has a skip for Windows...do you need that here, too? (Or can the skip for windows on the other one be removed?)

I honestly have no idea; FileFormat has a skip for Windows, so I guessed it might be needed for CsvFileFormat, though it might not be. Removed them to see what happens.

paleolimbot · 2023-01-17T13:35:23Z

r/tests/testthat/test-dataset-csv.R

+  expect_error(
+    expect_equal(ds$time, "16-01-2023")
+  )


What is the expected error message?

The expected error message is "object (actual) not equal to expected (expected)." but I didn't think I should be testing that given it's the message from testthat? Comment above this line refers to the issue ticket I opened, but I can add something else to make it more explicit perhaps?

expect_error() should always have an expected error because it might hide a completely unrelated error (e.g., expect_error(this_is_completely_invalid_code)). My preferred way of dealing with this is to add the failing reprex to the GitHub Issue and remove the test here...some of our existing tests do skip("GH-XXXX timestamp parser not implemented") to handle that.

paleolimbot · 2023-01-17T13:49:16Z

r/tests/testthat/test-dataset-csv.R

+
+  # skip_empty_rows
+  tf <- tempfile()
+  writeLines('"x"\n"y"\nNA\nNA\n"NULL"\n\n\n', tf)


I like this better than the write.table() you do below/above...it's easier to see what the csv file that's getting parsed is.

When it's one column I agree, though it's trickier when it's multiple columns. In order to make it more readable/obvious, I've swapped out write.table for write.csv as we use that more in this file, and moved things around so that the data frame is defined next to where it's written.

You can also use newlines in your string literals:

csv_with_single_quotes <- "x,y 'x_col1','y_col1' 'x_col2','y_col2'" cat(csv_with_single_quotes) #> x,y #> 'x_col1','y_col1' #> 'x_col2','y_col2'

I like that because I don't need to run any code to figure out what's getting parsed as a reviewer/reader of that code.

I see that we do write.csv() almost everywhere in test-csv.R for this, so happy to defer this to later/if ever.

paleolimbot · 2023-01-17T13:51:50Z

r/R/dataset-format.R

+  opt_names <- names(args)
+
+  # extract names of parse_options, read_options, and convert_options
+  if ("parse_options" %in% names(args) && is.list(args[["parse_options"]])) {


What happens if args[["parse_options"]] is not a list? (Probably should be in the comments since I can't tell by reading this what it should or should not be)

args[["parse_options"]] can either be a list of options, or a CsvParseOptions object. This helper function is only designed to validate the former; the latter is more advanced/off-label usage and it's up to users to safely pass the correct arguments through here (you could argue that this shouldn't be the case, but doing it this way is consistent with how we've done this in other bits of the codebase).

I thought about documenting this, but I don't know where it would go - I don't think inside this function?

You could redocument #' @param parse_options in open_delim_dataset()? (Or add to the existing #' @param parse_options if the behaviour is the same for both open_delim_dataset() and read_delim_arrow())

If it's off-label usage then you probably should error for it...with the new function we have the opportunity to protect users from off-label usage (which was very very easy to do with the existing open_dataset()). Passing parse_options = CsvParseOptions$create(...) was the first thing I thought to do and so I imagine somebody else might think to do it, too.

parse_options = CsvParseOptions$create(...) is the way that users have been doing things thus far as passing in lists is new - this additional way of doing things only got merged in in the past few weeks. I'm not sure what the right call is here; I don't want to error on passing in a CsvParseOptions object and break folks' existing code but do want to encourage passing in options as a list so it actually hits our validation code.

I see...let's consider that later, then, but make it clear in this function that those are the two options for future us trying to read this code.

I added more info here, and more explicit acknowledgement of this to CsvFileFormat$create()

paleolimbot · 2023-01-17T13:53:59Z

r/R/dataset.R

+#' # Set up directory for examples
+#' tf <- tempfile()
+#' dir.create(tf)
+#' on.exit(unlink(tf))


I don't know if on.exit() works the way you think it will in an example...my guess is that it won't work if there are any on.exit()s in any other examples. I would just put unlink(tf) at the bottom of your example.

Done! We do this in loads of examples, we should probably update those too.

It might work...on.exit() when the current environment is the global environment is a little hairy (on.exit(..., add = TRUE) would be more correct, but some other on.exit() without add = TRUE might cause the cleanup code you added not to run, if my understanding of on.exit() is correct).

paleolimbot · 2023-01-17T13:55:37Z

r/R/dataset.R

+#' @seealso [open_dataset()]
+#' @export
+open_delim_dataset <- function(sources,
+                             schema = NULL,


Indent to line up to function(?

Re-indented so all args align

thisisnic · 2023-01-17T14:45:53Z

For clarity, this won't affect the behaviour of open_dataset() or read_csv_arrow(). Passing in lists worked before, but I've expanded the documentation to explicitly reference the fact that this works here.

paleolimbot

These changes look great!

You may want to rebase and devtools::document()...when I pulled this locally and ran devtools::document() I saw a few changes.

…more closely matching read_csv_arrow (#33614) This PR implements a wrapper around `open_dataset()` specifically for value-delimited files. It takes the parameters from `open_dataset()` and appends the parameters of `read_csv_arrow()` which are compatible with `open_dataset()`. This should make it easier for users to switch between the two, e.g.: ``` r library(arrow) library(dplyr) # Set up directory for examples tf <- tempfile() dir.create(tf) on.exit(unlink(tf)) df <- data.frame(x = c("1", "2", "NULL")) file_path <- file.path(tf, "file1.txt") write.table(df, file_path, sep = ",", row.names = FALSE) read_csv_arrow(file_path, na = c("", "NA", "NULL"), col_names = "y", skip = 1) #> # A tibble: 3 × 1 #> y #> <int> #> 1 1 #> 2 2 #> 3 NA open_csv_dataset(file_path, na = c("", "NA", "NULL"), col_names = "y", skip = 1) %>% collect() #> # A tibble: 3 × 1 #> y #> <int> #> 1 1 #> 2 2 #> 3 NA ``` This PR also hooks up the "na" (readr-style) parameter to "null_values" (i.e. CSVConvertOptions parameter). In the process of making this PR, I also refactored `CsvFileFormat$create()`. Unfortunately, many changes needed to be made at once, which has considerably increasing the size/complexity of this PR. Authored-by: Nic Crane <thisisnic@gmail.com> Signed-off-by: Nic Crane <thisisnic@gmail.com>

ursabot · 2023-01-19T13:24:27Z

Benchmark runs are scheduled for baseline = 04ffb1f and contender = 444dcb6. 444dcb6 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Failed] test-mac-arm
[Finished ⬇️0.26% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.19% ⬆️0.0%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] 444dcb67 ec2-t3-xlarge-us-east-2
[Finished] 444dcb67 test-mac-arm
[Finished] 444dcb67 ursa-i9-9960x
[Finished] 444dcb67 ursa-thinkcentre-m75q
[Finished] 04ffb1f7 ec2-t3-xlarge-us-east-2
[Failed] 04ffb1f7 test-mac-arm
[Finished] 04ffb1f7 ursa-i9-9960x
[Finished] 04ffb1f7 ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

github-actions bot added the Component: R label Jan 11, 2023

thisisnic force-pushed the ARROW_18358_read_csv_dataset branch from fc4cc30 to 4684c6f Compare January 11, 2023 18:34

thisisnic commented Jan 12, 2023

View reviewed changes

alistaire47 reviewed Jan 12, 2023

View reviewed changes

nealrichardson reviewed Jan 12, 2023

View reviewed changes

paleolimbot reviewed Jan 13, 2023

View reviewed changes

thisisnic added 21 commits January 16, 2023 14:21

Add open_csv_dataset function

f7c63cf

Remove read_csv_arrow parameters which are not implemented for datase…

c7578ac

…ts, and add docs

Move CSV*Options setup into function body instead of param setup

3e932c0

Fix typo

e057b81

Remove redundant assignment

cb2b365

Fix typo

3cbf7a8

Fix typo

cc16968

Add open_csv_dataset to pkgdown

df25d61

Add back in missing arguments

5aa4525

Add tests

aeb1dbf

Hook up na to null_values

1528b95

Chuck in unimplemented functions and errors

63f223f

Add in quoted_na to signature

fa50ef0

Test on NAs, run document()

72f2e7e

Shrink function signature and refactor opts to parse_options

677ca8e

Refactor args checking code to pull out into functions

e86d84f

Fix numerous typos

d8ae2c9

Improve as_data_frame error message

cf95f94

Improve col_select error

7f858ea

Update docs

983362e

Appease linter

40199c0

thisisnic added 4 commits January 16, 2023 14:26

Remove unused params

761afd3

Update docs for CsvFileFormat$create

26f9f1b

Add notes to open_csv_dataset about unsupported options

fb087a9

Update docs and run devtools::document()

fdc2658

thisisnic force-pushed the ARROW_18358_read_csv_dataset branch from e019ea6 to fdc2658 Compare January 16, 2023 16:18

thisisnic added 6 commits January 16, 2023 16:19

Add delim param

059fe5e

Add new functions to pkgdown

0468160

Add CsvFileFormat to pkgdown

7d6b1ce

Add example to docs showing switching between read_csv_arrow and open…

c85808a

…_csv_dataset

Fix error where col_types wasn't being passed through

54b529d

Add tests for more params

785db46

thisisnic marked this pull request as ready for review January 16, 2023 20:32

thisisnic commented Jan 16, 2023

View reviewed changes

r/R/csv.R Show resolved Hide resolved

thisisnic requested a review from paleolimbot January 16, 2023 20:40

paleolimbot reviewed Jan 17, 2023

View reviewed changes

thisisnic added 4 commits January 17, 2023 14:32

Add documentation to helper func

29da6a8

Unlink at end

b710eb8

Realign

3388076

Remove skip in windows and hope for the best

b3c27a0

Update tests and docs

e0e8411

paleolimbot approved these changes Jan 17, 2023

View reviewed changes

thisisnic merged commit 444dcb6 into apache:master Jan 17, 2023

thisisnic mentioned this pull request Jan 17, 2023

[R] Implement new function open_dataset_csv with signature more closely matching read_csv_arrow #33526

Closed

jorisvandenbossche added this to the 11.0.0 milestone Jan 18, 2023

GH-33526: [R] Implement new function open_dataset_csv with signature more closely matching read_csv_arrow #33614

GH-33526: [R] Implement new function open_dataset_csv with signature more closely matching read_csv_arrow #33614

Conversation

thisisnic commented Jan 11, 2023 • edited

github-actions bot commented Jan 11, 2023

github-actions bot commented Jan 11, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

paleolimbot left a comment

Choose a reason for hiding this comment

thisisnic commented Jan 16, 2023 • edited

thisisnic commented Jan 16, 2023

paleolimbot left a comment • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

thisisnic Jan 17, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

thisisnic Jan 17, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

thisisnic Jan 17, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

thisisnic commented Jan 17, 2023

paleolimbot left a comment

Choose a reason for hiding this comment

ursabot commented Jan 19, 2023

thisisnic commented Jan 11, 2023 •

edited

thisisnic commented Jan 16, 2023 •

edited

paleolimbot left a comment •

edited

thisisnic Jan 17, 2023 •

edited

thisisnic Jan 17, 2023 •

edited

thisisnic Jan 17, 2023 •

edited