[R][C++][Dataset] open_dataset and open_csv_dataset do not use skip argument #35756

karldw · 2023-05-25T02:08:08Z

Describe the bug, including details regarding any error messages, version, and platform.

I noticed that a CSV with header rows can be successfully read with read_csv_arrow(..., skip=N), but not with open_csv_dataset unless a schema is provided. An example is below.

I expected open_csv_dataset to use the skip argument the same way read_csv_arrow does, skipping a fixed number of rows from every CSV, but it seems to not be skipping any. I think this is likely a bug -- maybe in the schema parsing?

library(arrow)

lines <- c(
  "This line should be skipped",
  "This one too, even though it has a comma",
  "X,Y,Z",
  "1,2,3",
  "4,5,6"
)
tmp_dir <- file.path(tempdir(), "arrow_test")
dir.create(tmp_dir, recursive=TRUE)
tmp_csv <- file.path(tmp_dir, "test.csv")
writeLines(lines, tmp_csv)

# This works as expected, skipping the first two lines and reading headers from the third
read_csv_arrow(tmp_csv, skip=2L)

open_csv_dataset(tmp_dir, skip=2L)
#> ! Invalid: Error creating dataset. Could not read schema from 
#> '/tmp/RtmpPRG1yT/arrow_test/test.csv'. Is this a 'csv' file?: 
#> Could not open CSV input source '/tmp/RtmpPRG1yT/arrow_test/test.csv': 
#> Invalid: CSV parse error: Row #2: Expected 1 columns, 
#> got 2: This one too, even though it has a comma

# These generate the same error:
open_csv_dataset(tmp_dir, skip=3L)
open_dataset(tmp_dir, format="csv", skip=2L)


schem <- schema(
  field(name="X", type=int32()),
  field(name="Y", type=int32()),
  field(name="Z", type=int32())
)

# Works when schema is supplied (now also need to skip header row)
open_csv_dataset(tmp_dir, skip=3L, schema=schem) |> dplyr::collect()

Version info:

R 4.2.2 on Linux
Arrow 12.0.0
(edited to add version info)

Component(s)

R

The text was updated successfully, but these errors were encountered:

thisisnic · 2023-05-30T13:54:56Z

Can confirm that this can be reproduced on dev. My best guess is that this was introduced in the refactor done as part of #33526.

thisisnic · 2023-06-01T18:18:43Z

I've been looking into this more closely, and the CSVReadOptions object is being created correctly (i.e. contains the correct skip_rows value and no column names), and now I'm wondering if something else is happening here.

We haven't caught this before, I think, because we don't tend to see skip being used on its own without col_names or a schema being supplied.

I noticed this comment in the definition of ReadOptions:

arrow/cpp/src/arrow/csv/options.h

Line 158 in 3299d12

/// If empty, fall back on autogenerate_column_names.

However, I can't find the code that implements actually falling back on autogenerate_column_names. @westonpace - am I looking in the wrong place, or is that functionality missing?

westonpace · 2023-06-09T20:19:58Z

I had an R environment setup today and so I debugged this for a bit. @thisisnic , I had, offline, pointed you at

arrow/cpp/src/arrow/csv/reader.cc

Line 602 in 8b2ab4d

if (read_options_.autogenerate_column_names) {

as the point where we handle autogenerated column names.

However, I had forgotten that we also (rather embarassingly :) have a duplicated copy of this logic in the datasets module here:

arrow/cpp/src/arrow/dataset/file_csv.cc

Line 179 in 8b2ab4d

if (read_options.autogenerate_column_names) {

The logic in the dataset module is slightly different than the logic in the reader module. It is (omitting some stuff):

  int32_t max_num_rows = read_options.skip_rows + 1;
  csv::BlockParser parser(pool, parse_options, /*num_cols=*/-1, /*first_row=*/1,
                          max_num_rows);

  RETURN_NOT_OK(parser.Parse(std::string_view{first_block}, &parsed_size));

  if (read_options.autogenerate_column_names) {
    column_names.reserve(parser.num_cols());

So we give the skipped rows to the parser (this is different than the reader.cc logic where we skip the rows outside the parser and then only give the first non-skipped row to the parser).

When we are not auto-generating column names then I think we kind of get away with it because we call parser.VisitLastRow which only really depends on the contents of the last row.

On the other hand, if the user is asking to autogenerate the column names we use parser.num_cols(). This calculation is based on the first row the parser sees!

I'm attaching a very clumsy sketch of a fix (which I verified works in OPs reprex) that just copy/pastes code from the reader so we can handle skipping in the exact same way. This sketch is also missing unit tests.

I think a good long-term fix would be to eliminate this duplicate path entirely. This could be done by adding an "inspect" method (or PeekMetadata or GetColumnNames or something) to the CSV reader. Then the datasets API could use that instead of re-inventing the wheel.

westonpace · 2023-06-09T20:21:41Z

Forgot the patch
csv-patch.txt

karldw added the Type: bug label May 25, 2023

github-actions bot added the Component: R label May 25, 2023

thisisnic added the Priority: Critical label May 30, 2023

thisisnic self-assigned this May 31, 2023

thisisnic mentioned this issue Jun 6, 2023

GH-35756: [R] open_dataset and open_csv_dataset do not use skip argument #35852

Closed

thisisnic added the Component: C++ label Jun 8, 2023

thisisnic changed the title ~~[R] open_dataset and open_csv_dataset do not use skip argument~~ [R][C++] open_dataset and open_csv_dataset do not use skip argument Jun 8, 2023

westonpace changed the title ~~[R][C++] open_dataset and open_csv_dataset do not use skip argument~~ [R][C++][Dataset] open_dataset and open_csv_dataset do not use skip argument Jun 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[R][C++][Dataset] open_dataset and open_csv_dataset do not use skip argument #35756

[R][C++][Dataset] open_dataset and open_csv_dataset do not use skip argument #35756

karldw commented May 25, 2023 •

edited

thisisnic commented May 30, 2023

thisisnic commented Jun 1, 2023 •

edited

westonpace commented Jun 9, 2023

westonpace commented Jun 9, 2023 •

edited

[R][C++][Dataset] open_dataset and open_csv_dataset do not use skip argument #35756

[R][C++][Dataset] open_dataset and open_csv_dataset do not use skip argument #35756

Comments

karldw commented May 25, 2023 • edited

Describe the bug, including details regarding any error messages, version, and platform.

Version info:

Component(s)

thisisnic commented May 30, 2023

thisisnic commented Jun 1, 2023 • edited

westonpace commented Jun 9, 2023

westonpace commented Jun 9, 2023 • edited

karldw commented May 25, 2023 •

edited

thisisnic commented Jun 1, 2023 •

edited

westonpace commented Jun 9, 2023 •

edited