[R] Implement new function open_dataset_csv with signature more closely matching read_csv_arrow #33526

asfimport · 2022-11-17T21:01:40Z

In order to make the transition between using the different CSV reading functions as smoothly as possible we could introduce a version of open_dataset specifically for reading CSVs with a signature more closely matching that of read_csv_arrow - this would just pass the arguments through to open_dataset (in the ellipses), but would make it simpler to have a docs page showing these options explicitly and thus be clearer for users.

Reporter: Nicola Crane / @thisisnic

_{Note: This issue was originally created as ARROW-18358. Please see the migration documentation for further details.}

…more closely matching read_csv_arrow (#33614) This PR implements a wrapper around `open_dataset()` specifically for value-delimited files. It takes the parameters from `open_dataset()` and appends the parameters of `read_csv_arrow()` which are compatible with `open_dataset()`. This should make it easier for users to switch between the two, e.g.: ``` r library(arrow) library(dplyr) # Set up directory for examples tf <- tempfile() dir.create(tf) on.exit(unlink(tf)) df <- data.frame(x = c("1", "2", "NULL")) file_path <- file.path(tf, "file1.txt") write.table(df, file_path, sep = ",", row.names = FALSE) read_csv_arrow(file_path, na = c("", "NA", "NULL"), col_names = "y", skip = 1) #> # A tibble: 3 × 1 #> y #> <int> #> 1 1 #> 2 2 #> 3 NA open_csv_dataset(file_path, na = c("", "NA", "NULL"), col_names = "y", skip = 1) %>% collect() #> # A tibble: 3 × 1 #> y #> <int> #> 1 1 #> 2 2 #> 3 NA ``` This PR also hooks up the "na" (readr-style) parameter to "null_values" (i.e. CSVConvertOptions parameter). In the process of making this PR, I also refactored `CsvFileFormat$create()`. Unfortunately, many changes needed to be made at once, which has considerably increasing the size/complexity of this PR. Authored-by: Nic Crane <thisisnic@gmail.com> Signed-off-by: Nic Crane <thisisnic@gmail.com>

thisisnic · 2023-01-17T18:55:10Z

Issue resolved by pull request 33614
#33614

…more closely matching read_csv_arrow (#33614) This PR implements a wrapper around `open_dataset()` specifically for value-delimited files. It takes the parameters from `open_dataset()` and appends the parameters of `read_csv_arrow()` which are compatible with `open_dataset()`. This should make it easier for users to switch between the two, e.g.: ``` r library(arrow) library(dplyr) # Set up directory for examples tf <- tempfile() dir.create(tf) on.exit(unlink(tf)) df <- data.frame(x = c("1", "2", "NULL")) file_path <- file.path(tf, "file1.txt") write.table(df, file_path, sep = ",", row.names = FALSE) read_csv_arrow(file_path, na = c("", "NA", "NULL"), col_names = "y", skip = 1) #> # A tibble: 3 × 1 #> y #> <int> #> 1 1 #> 2 2 #> 3 NA open_csv_dataset(file_path, na = c("", "NA", "NULL"), col_names = "y", skip = 1) %>% collect() #> # A tibble: 3 × 1 #> y #> <int> #> 1 1 #> 2 2 #> 3 NA ``` This PR also hooks up the "na" (readr-style) parameter to "null_values" (i.e. CSVConvertOptions parameter). In the process of making this PR, I also refactored `CsvFileFormat$create()`. Unfortunately, many changes needed to be made at once, which has considerably increasing the size/complexity of this PR. Authored-by: Nic Crane <thisisnic@gmail.com> Signed-off-by: Nic Crane <thisisnic@gmail.com>

asfimport mentioned this issue Jan 11, 2023

[R] Datasets API interface improvements #33520

Open

13 tasks

thisisnic mentioned this issue Jan 11, 2023

GH-33526: [R] Implement new function open_dataset_csv with signature more closely matching read_csv_arrow #33614

Merged

github-actions bot assigned thisisnic Jan 11, 2023

thisisnic added this to the 11.0.0 milestone Jan 17, 2023

thisisnic closed this as completed Jan 17, 2023

raulcd added the Priority: Blocker Marks a blocker for the release label Jan 18, 2023

thisisnic mentioned this issue May 30, 2023

[R][C++][Dataset] open_dataset and open_csv_dataset do not use skip argument #35756

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[R] Implement new function open_dataset_csv with signature more closely matching read_csv_arrow #33526

[R] Implement new function open_dataset_csv with signature more closely matching read_csv_arrow #33526

asfimport commented Nov 17, 2022

thisisnic commented Jan 17, 2023

[R] Implement new function open_dataset_csv with signature more closely matching read_csv_arrow #33526

[R] Implement new function open_dataset_csv with signature more closely matching read_csv_arrow #33526

Comments

asfimport commented Nov 17, 2022

thisisnic commented Jan 17, 2023