You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In order to make the transition between using the different CSV reading functions as smoothly as possible we could introduce a version of open_dataset specifically for reading CSVs with a signature more closely matching that of read_csv_arrow - this would just pass the arguments through to open_dataset (in the ellipses), but would make it simpler to have a docs page showing these options explicitly and thus be clearer for users.
…more closely matching read_csv_arrow (#33614)
This PR implements a wrapper around `open_dataset()` specifically for value-delimited files. It takes the parameters from `open_dataset()` and appends the parameters of `read_csv_arrow()` which are compatible with `open_dataset()`. This should make it easier for users to switch between the two, e.g.:
``` r
library(arrow)
library(dplyr)
# Set up directory for examples
tf <- tempfile()
dir.create(tf)
on.exit(unlink(tf))
df <- data.frame(x = c("1", "2", "NULL"))
file_path <- file.path(tf, "file1.txt")
write.table(df, file_path, sep = ",", row.names = FALSE)
read_csv_arrow(file_path, na = c("", "NA", "NULL"), col_names = "y", skip = 1)
#> # A tibble: 3 × 1
#> y
#> <int>
#> 1 1
#> 2 2
#> 3 NA
open_csv_dataset(file_path, na = c("", "NA", "NULL"), col_names = "y", skip = 1) %>% collect()
#> # A tibble: 3 × 1
#> y
#> <int>
#> 1 1
#> 2 2
#> 3 NA
```
This PR also hooks up the "na" (readr-style) parameter to "null_values" (i.e. CSVConvertOptions parameter).
In the process of making this PR, I also refactored `CsvFileFormat$create()`. Unfortunately, many changes needed to be made at once, which has considerably increasing the size/complexity of this PR.
Authored-by: Nic Crane <thisisnic@gmail.com>
Signed-off-by: Nic Crane <thisisnic@gmail.com>
…more closely matching read_csv_arrow (#33614)
This PR implements a wrapper around `open_dataset()` specifically for value-delimited files. It takes the parameters from `open_dataset()` and appends the parameters of `read_csv_arrow()` which are compatible with `open_dataset()`. This should make it easier for users to switch between the two, e.g.:
``` r
library(arrow)
library(dplyr)
# Set up directory for examples
tf <- tempfile()
dir.create(tf)
on.exit(unlink(tf))
df <- data.frame(x = c("1", "2", "NULL"))
file_path <- file.path(tf, "file1.txt")
write.table(df, file_path, sep = ",", row.names = FALSE)
read_csv_arrow(file_path, na = c("", "NA", "NULL"), col_names = "y", skip = 1)
#> # A tibble: 3 × 1
#> y
#> <int>
#> 1 1
#> 2 2
#> 3 NA
open_csv_dataset(file_path, na = c("", "NA", "NULL"), col_names = "y", skip = 1) %>% collect()
#> # A tibble: 3 × 1
#> y
#> <int>
#> 1 1
#> 2 2
#> 3 NA
```
This PR also hooks up the "na" (readr-style) parameter to "null_values" (i.e. CSVConvertOptions parameter).
In the process of making this PR, I also refactored `CsvFileFormat$create()`. Unfortunately, many changes needed to be made at once, which has considerably increasing the size/complexity of this PR.
Authored-by: Nic Crane <thisisnic@gmail.com>
Signed-off-by: Nic Crane <thisisnic@gmail.com>
In order to make the transition between using the different CSV reading functions as smoothly as possible we could introduce a version of open_dataset specifically for reading CSVs with a signature more closely matching that of read_csv_arrow - this would just pass the arguments through to open_dataset (in the ellipses), but would make it simpler to have a docs page showing these options explicitly and thus be clearer for users.
Reporter: Nicola Crane / @thisisnic
Note: This issue was originally created as ARROW-18358. Please see the migration documentation for further details.
The text was updated successfully, but these errors were encountered: