Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[R] Implement new function open_dataset_csv with signature more closely matching read_csv_arrow #33526

Closed
Tracked by #33520
asfimport opened this issue Nov 17, 2022 · 1 comment
Assignees
Labels
Component: R Priority: Blocker Marks a blocker for the release Type: task
Milestone

Comments

@asfimport
Copy link

In order to make the transition between using the different CSV reading functions as smoothly as possible we could introduce a version of open_dataset specifically for reading CSVs with a signature more closely matching that of read_csv_arrow - this would just pass the arguments through to open_dataset (in the ellipses), but would make it simpler to have a docs page showing these options explicitly and thus be clearer for users.

Reporter: Nicola Crane / @thisisnic

Note: This issue was originally created as ARROW-18358. Please see the migration documentation for further details.

thisisnic added a commit that referenced this issue Jan 17, 2023
…more closely matching read_csv_arrow (#33614)

This PR implements a wrapper around `open_dataset()` specifically for value-delimited files. It takes the parameters from `open_dataset()` and appends the parameters of `read_csv_arrow()` which are compatible with `open_dataset()`. This should make it easier for users to switch between the two, e.g.:

``` r
library(arrow)
library(dplyr)

# Set up directory for examples
tf <- tempfile()
dir.create(tf)
on.exit(unlink(tf))
df <- data.frame(x = c("1", "2", "NULL"))

file_path <- file.path(tf, "file1.txt")
write.table(df, file_path, sep = ",", row.names = FALSE)

read_csv_arrow(file_path, na = c("", "NA", "NULL"), col_names = "y", skip = 1)
#> # A tibble: 3 × 1
#>       y
#>   <int>
#> 1     1
#> 2     2
#> 3    NA

open_csv_dataset(file_path, na = c("", "NA", "NULL"), col_names = "y", skip = 1) %>% collect()
#> # A tibble: 3 × 1
#>       y
#>   <int>
#> 1     1
#> 2     2
#> 3    NA
```

This PR also hooks up the "na" (readr-style) parameter to "null_values" (i.e. CSVConvertOptions parameter).

In the process of making this PR, I also refactored `CsvFileFormat$create()`.  Unfortunately, many changes needed to be made at once, which has considerably increasing the size/complexity of this PR.

Authored-by: Nic Crane <thisisnic@gmail.com>
Signed-off-by: Nic Crane <thisisnic@gmail.com>
@thisisnic thisisnic added this to the 11.0.0 milestone Jan 17, 2023
@thisisnic
Copy link
Member

Issue resolved by pull request 33614
#33614

@raulcd raulcd added the Priority: Blocker Marks a blocker for the release label Jan 18, 2023
raulcd pushed a commit that referenced this issue Jan 18, 2023
…more closely matching read_csv_arrow (#33614)

This PR implements a wrapper around `open_dataset()` specifically for value-delimited files. It takes the parameters from `open_dataset()` and appends the parameters of `read_csv_arrow()` which are compatible with `open_dataset()`. This should make it easier for users to switch between the two, e.g.:

``` r
library(arrow)
library(dplyr)

# Set up directory for examples
tf <- tempfile()
dir.create(tf)
on.exit(unlink(tf))
df <- data.frame(x = c("1", "2", "NULL"))

file_path <- file.path(tf, "file1.txt")
write.table(df, file_path, sep = ",", row.names = FALSE)

read_csv_arrow(file_path, na = c("", "NA", "NULL"), col_names = "y", skip = 1)
#> # A tibble: 3 × 1
#>       y
#>   <int>
#> 1     1
#> 2     2
#> 3    NA

open_csv_dataset(file_path, na = c("", "NA", "NULL"), col_names = "y", skip = 1) %>% collect()
#> # A tibble: 3 × 1
#>       y
#>   <int>
#> 1     1
#> 2     2
#> 3    NA
```

This PR also hooks up the "na" (readr-style) parameter to "null_values" (i.e. CSVConvertOptions parameter).

In the process of making this PR, I also refactored `CsvFileFormat$create()`.  Unfortunately, many changes needed to be made at once, which has considerably increasing the size/complexity of this PR.

Authored-by: Nic Crane <thisisnic@gmail.com>
Signed-off-by: Nic Crane <thisisnic@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: R Priority: Blocker Marks a blocker for the release Type: task
Projects
None yet
Development

No branches or pull requests

3 participants