New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[R][C++][Dataset] open_dataset and open_csv_dataset do not use skip argument #35756
Comments
Can confirm that this can be reproduced on dev. My best guess is that this was introduced in the refactor done as part of #33526. |
I've been looking into this more closely, and the CSVReadOptions object is being created correctly (i.e. contains the correct We haven't caught this before, I think, because we don't tend to see I noticed this comment in the definition of ReadOptions: arrow/cpp/src/arrow/csv/options.h Line 158 in 3299d12
However, I can't find the code that implements actually falling back on |
I had an R environment setup today and so I debugged this for a bit. @thisisnic , I had, offline, pointed you at arrow/cpp/src/arrow/csv/reader.cc Line 602 in 8b2ab4d
However, I had forgotten that we also (rather embarassingly :) have a duplicated copy of this logic in the datasets module here: arrow/cpp/src/arrow/dataset/file_csv.cc Line 179 in 8b2ab4d
The logic in the dataset module is slightly different than the logic in the reader module. It is (omitting some stuff):
So we give the skipped rows to the parser (this is different than the reader.cc logic where we skip the rows outside the parser and then only give the first non-skipped row to the parser). When we are not auto-generating column names then I think we kind of get away with it because we call On the other hand, if the user is asking to autogenerate the column names we use I'm attaching a very clumsy sketch of a fix (which I verified works in OPs reprex) that just copy/pastes code from the reader so we can handle skipping in the exact same way. This sketch is also missing unit tests. I think a good long-term fix would be to eliminate this duplicate path entirely. This could be done by adding an "inspect" method (or PeekMetadata or GetColumnNames or something) to the CSV reader. Then the datasets API could use that instead of re-inventing the wheel. |
Forgot the patch |
Describe the bug, including details regarding any error messages, version, and platform.
I noticed that a CSV with header rows can be successfully read with
read_csv_arrow(..., skip=N)
, but not withopen_csv_dataset
unless a schema is provided. An example is below.I expected
open_csv_dataset
to use the skip argument the same wayread_csv_arrow
does, skipping a fixed number of rows from every CSV, but it seems to not be skipping any. I think this is likely a bug -- maybe in the schema parsing?Version info:
R 4.2.2 on Linux
Arrow 12.0.0
(edited to add version info)
Component(s)
R
The text was updated successfully, but these errors were encountered: