Add an experimental csv module exposing a streaming csv parser #3743

oleiade · 2024-05-15T08:30:02Z

What?

This PR is a cleaned-up version of the CSV streaming parser we hacked during Crococon. It is aimed at addressing #2976.

import { open } from 'k6/experimental/fs'
import csv from 'k6/experimental/csv'

let file;
(async function () {
	file = await open('data.csv');
})();

export default async function() {
	let parser = new csv.Parser(file, {
		delimiter: ',',
		skipFirstLine: true,
		fromLine: 3,
		toLine: 13,
	})

	while (true) {
		const {done, value} = await parser.next();
		if (done) {
			break;
		}

		console.log(value)
	}
}

It is in a preliminary state and aims at getting input and iterating on the design collaboratively.

What's not there yet

I have written no tests for it, yet, as I would like to make sure any big changes to the design are done before that.
Part of the initial design described in Add a streaming-based CSV parser to k6 #2976 included two concepts I haven't yet included here, as I'm not sure what the best API or performance-oriented solution would be (ideas welcome 🤝):
- The ability to describe a strategy for selecting which lines should be picked for parsing or ignored (say, you want a file's lines parsing to be spread evenly across all your VUs, for instance).
- The ability to instruct the parser to cycle through the file: once it reaches the end, it restarts from the top. My main question mark is that as it would probably be possible using the existing APIs (csv.Parser.next returns an iterator-like object with a done property, seeking through the file is possible, and re-instantiating the parser once the end is reached is an option), would we want indeed to have a dedicated method/API for that?

Why?

Using CSV files in k6 tests is a very common pattern, and until recently, doing it efficiently could prove tricky. One common issue encountered by users is that JS tends to be rather slow when performing parsing operations. Hence, we are leveraging the fs module constructs and asynchronous APIs introduced in Goja over the last year to implement a Go-based CSV "high-performance" streaming parser.

Checklist

I have performed a self-review of my code.
I have added tests for my changes.
I have run linter locally (make lint) and all checks pass.
I have run tests locally (make tests) and all tests pass.
I have commented on my code, particularly in hard-to-understand areas.

Related PR(s)/Issue(s)

#2976

joanlopez · 2024-05-15T11:11:10Z

One common issue encountered by users is that JS tends to be rather slow when performing parsing operations.

Take this just a simple idea rather than something that's really a requirement for this pull request to move forward, but considering that you explicitly mentioned that, would be nice to have a small benchmark for comparison.

js/modules/k6/experimental/csv/data.csv

joanlopez · 2024-05-15T11:15:35Z

js/modules/k6/experimental/csv/module.go

+	"go.k6.io/k6/js/modules"
+)
+
+// TODO: Should we have an option to cycle through the file (restart from start when EOF is reached, or done is true)?


My personal opinion; if it doesn't pollute a lot the code/API, then I'd go for it, as in theory it doesn't sound like something very complex while it might be very handy for certain use cases.

joanlopez · 2024-05-15T11:17:29Z

js/modules/k6/experimental/csv/module.go

+)
+
+// TODO: Should we have an option to cycle through the file (restart from start when EOF is reached, or done is true)?
+// TODO: Should we have an option to skip empty lines (lines with no fields?)


My first reaction is: why not doing so by default? But I guess we should/could take a look at similar parser libraries/tools, and get inspiration from there (what looks to be more common).

PS: Do you mean lines with at least one value missing, or those lines completely empty? 🤔

I think I raised this question as I reviewed papaparse parse method's config which has a skipEmptyLines option: https://www.papaparse.com/docs

And I wondered whether we should have the same option 👍🏻 I've made a very quick and shallow pass on node and deno, and I don't see them having such option (but I might have missed it).

PS: Do you mean lines with at least one value missing, or those lines completely empty? 🤔

I meant specifically empty lines. The default behavior of Go's CSV parser is to not make assumptions on the number of fields per record, but there's an option to enforce it. Maybe it would be worth adding an option to the module's parser too for users to tell us how many fields per record to expect, and error if we find a different value in any of the records?

Regarding skipEmptyLines, it turns out that skipping them is the default behavior of the Go CSV parser, and it looks like it can't be overridden, so it sorts it out 👍🏻

joanlopez · 2024-05-15T11:18:37Z

js/modules/k6/experimental/csv/module.go

+
+// TODO: Should we have an option to cycle through the file (restart from start when EOF is reached, or done is true)?
+// TODO: Should we have an option to skip empty lines (lines with no fields?)
+// TODO: Should we have an option to skip lines based on a specific predicate func(linenum, fields): bool ?:w


Again my opinion; I'd say that'd be a nice-to-have, but not critical for a first iteration, so I'd keep that idea for a second iteration.

Talking with Pawel, he agreed with you, and mentioned that our current scope is good enough for 90% of the use cases, and we could do it in the future if we still want to 👍🏻

js/modules/k6/experimental/csv/module.go

joanlopez · 2024-05-15T11:27:49Z

js/modules/k6/experimental/fs/module.go

+	// implementation details, but keep it public so that we can access it
+	// from other modules that would want to leverage its implementation of
+	// io.Reader and io.Seeker.
+	Impl file `js:"-"`


Nit; I think in general it's discouraged to expose an attribute as part of the public surface of the package (fs in this case) which type is private.

I agree, this is an abstraction leak. The reason I went with that as a start was because I found the Read/ExportedRead approach we had initially taken somewhat inelegant (at least it tickled my sense of "nice" code).

But in hindsight it's probably the best compromise indeed.

The Impl approach that's explicitly not exposed to users, but to other k6 libs and modules, started from the rationale that the underlying file struct already implements ReadSeeker, and I might as well leverage that. But because I couldn't find a way to retrieve the underlying file implementation struct using ExportTo or by doing it manually, I went with a public Impl instead.

I'll give another shot at our initial approach and try to make it look okay 👍🏻

Would it be feasible to define it as a io.ReadSeeker? 🤔

If possible, even if that requires a couple of tricks internally (because of stat()), maybe that's a decent way to expose it for other k6 libs and modules.

Do you mean?

type File struct { // ... ReadSeeker io.ReadSeeker `js:"-"` // ... }

If so, it's something I considered, and I think it could work (would have to double-check) 🙇‍♂️

Something like that, yeah. I think it won't work as-is, because of stat(), as I said, but perhaps we can find some workaround that only lives internally/privately in the package/implementation, with the benefit of keeping the public surface clean.

Let me see what I can do 🙇‍♂️

Alright, I've pushed a commit, which exposes the behavior as an interface instead. I like it much more, indeed.

I've created a Stater interface, made file implement it, and incorporated it in a ReadSeekStater interface, which I use to expose the underlying behavior instead of file itself. Let me know what you think 👍🏻

It looks much cleaner now, at the very least! 👌🏻

joanlopez

Thanks for giving form to what we started during the Crococon 💟

I left multiple comments as some form of initial feedback, but generally speaking I think this approach is more than okay, and from my side I'd suggest to move forward (with tests and all that) 🚀

I'm not sure how far are we from being able to publish this as an experimental module, but I guess it's part of the its experimental stage, the feedback and usage we will collect from users, what will help us answer some of the open questions that you left, and to actually confirm whether the current approach is good enough or not.

codebien · 2024-05-15T15:58:07Z

js/modules/k6/experimental/csv/csv.js

I would expect to use the parser mostly within two scenarios:

Getting one line per iteration

Getting one line per VU

The current example iterating over the batch during the single iteration doesn't sound representative of the most common use case.

Am I missing something?

I think these are reasonable use cases, right!
Indeed I think I mentioned at least one of them during the ideation phase during Crococon.

You're both right 🙂

If I summarize:

"As a user, I want to create a parser, and every executed iteration gets the next line, and advances the line "head" to the next one for the next iteration (regardless of the VU executing the iteration)"

"As a user, I want to create a parser, and when the VU code calls the next method, the next line_number % vu_number is returned to it"

Do you agree with these statements? 🙂

I think both call for a rather different design and on top of my head, it might be not easy to support both in an efficient manner. I'm gonna look into it, and see what I come up with 🙇‍♂️

Yeah! I guess we could leave the second one for the future, if we manage to let different VUs use different parsers, and leave the work for the test author. I know it's not ideal, and for certain situations where the amount of VUs varies over time it might be more difficult, tho.

In general I'd say, let's get an MVP and ship it! And, imho, that MVP should have at least one mechanism that doesn't imply having a new parser for each iteration but shared across iterations/vus somehow (the one we think have a better balance between popularity and feasibility). Once we have that, I'd vote for merging it, then iterate with more strategies.

Per-vu:

vuID := exec.vu.idInTest let parser = new csv.Parser(file, { fromLine: vuID, // potentially with the modulo toLine: vuID, // potentially with the modulo }) const {done, value} = await parser.next(); export default async function() { console.log(value) }

Per-iteration:

let parser = new csv.Parser(file) export default async function() { const {done, value} = await parser.next(); console.log(value) }

They should already work, we have to just add a unit test with them

Yeah, for the "Per-vu", I guess there could be multiple strategies:

Enabling an option to skip rows, like just process the rows multiple of...

Suggesting the user, if they now the amount of rows in advance, to split the readers accordingly.

Re per-vu: It's interesting. I think we had completely different assumptions/understandings regarding the per-vu strategy 🤓 From your example, it looks like you wish to select a chunk of the whole file based on a vu number, whereas I had in mind that for 16 VU, vu 1 would receive line 1, 17, 33, etc.

Re per-vu: It's interesting. I think we had completely different assumptions/understandings regarding the per-vu strategy 🤓 From your example, it looks like you wish to select a chunk of the whole file based on a vu number, whereas I had in mind that for 16 VU, vu 1 would receive line 1, 17, 33, etc.

Aren't these two the strategies I mentioned above? 🤔 Or did I miss any?

Indeed. I think we posted our comments at the same time, not seeing each other's 🤓

Glad to see we're aligned, then! Also at when to write comments hahahaha :)

oleiade · 2024-05-23T09:16:40Z

js/modules/k6/experimental/csv/csv.js

+})();
+
+export default async function() {
+	let parser = new csv.Parser(file, {


For context @joanlopez and @codebien I just found out that currently, because we don't have support for await in the init context, the only way (with the current design) to instantiate the parser from the init context is the following:

let file; let parser; (async function () { file = await open('data.csv'); parser = new csv.Parser(file, { delimiter: ',', skipFirstLine: true, fromLine: 3, toLine: 13, }) })();

Which works, but adds a workaround to the workaround :-/

An alternative I would consider would be to have the parser support for receiving a file path, and opening the file under the hood (which might involve exposing the currently private openImpl function from the fs module to other modules such as this one).

What do you think? 🦖

Which works, but adds a workaround to the workaround :-/

It doesn't sound like a problem of this scope/pull-request. The problem is async in Init context and at some point we have to resolve it (we are already on it with ESM).

You're probably right indeed 👍🏻 🙇🏻

oleiade · 2024-05-27T14:07:23Z

Posting here a summary of the use-cases we discussed privately, and that we'd like the module to tackle:

As a user, I want to read a CSV file containing 1000 credentials, and have each credential being processed by a single iteration.

no credential should be processed more than once
unless the parser is explicitly to restart from the begining? In that scenario, the same credential can be processed multiple times.
if the option is not set, and the user calls parser.next() after all credentials are consumed, they keep getting a { done: true, value: undefined } response.

As a user, I want to read a CSV file containing 1000 credentials, and have each subset of those credentials reserved to be processed by a single VU.

the subset of credentials could be for instance a chunk: 0-100 credentials go to VU 1, 101-200 credentials go to VU 2, etc.
the subset of credentials could be every Nth credential: 0, 10, 20, 30, etc. go to VU 1, 1, 11, 21, 31, etc. go to VU 2, etc.
This is possible with the existing SharedArray approach, but it needs a faster way of processing the rows.

As a user, I want each iteration to stream through my CSV file, and have the ability to act upon each returned records.

The user has the ability to skip a record, or to stop the iteration, based on the content of the record, or the line number.
This is assuming that each iteration needs the whole content of the file to perform its test.

oleiade added 2 commits May 15, 2024 10:17

Expose fs.File underlying implementation's ReadSeeker to other modules

b8def4c

Add experimental csv parser module

e434fc7

oleiade added the feature label May 15, 2024

oleiade self-assigned this May 15, 2024

oleiade requested review from codebien and joanlopez May 15, 2024 09:13

joanlopez reviewed May 15, 2024

View reviewed changes

js/modules/k6/experimental/csv/data.csv Outdated Show resolved Hide resolved

joanlopez reviewed May 15, 2024

View reviewed changes

js/modules/k6/experimental/csv/module.go Show resolved Hide resolved

joanlopez reviewed May 15, 2024

View reviewed changes

codebien requested changes May 15, 2024

View reviewed changes

oleiade commented May 23, 2024

View reviewed changes

oleiade added 2 commits May 23, 2024 11:38

[fixup] move csv example, and instantiate in init context

13ad236

Expose the underlying behavior through a ReadSeekStater interface

06042c4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add an experimental csv module exposing a streaming csv parser #3743

Add an experimental csv module exposing a streaming csv parser #3743

oleiade commented May 15, 2024

joanlopez commented May 15, 2024

joanlopez May 15, 2024

joanlopez May 15, 2024

oleiade May 23, 2024

oleiade May 23, 2024

joanlopez May 15, 2024

oleiade May 15, 2024

joanlopez May 15, 2024

oleiade May 23, 2024

joanlopez May 23, 2024

oleiade May 23, 2024

joanlopez May 23, 2024 •

edited

oleiade May 23, 2024

oleiade May 23, 2024

joanlopez May 23, 2024

joanlopez left a comment

codebien May 15, 2024

joanlopez May 15, 2024

oleiade May 23, 2024

joanlopez May 23, 2024

codebien May 23, 2024 •

edited

joanlopez May 23, 2024

oleiade May 23, 2024

joanlopez May 23, 2024

oleiade May 23, 2024

joanlopez May 23, 2024

oleiade May 23, 2024

codebien May 23, 2024

oleiade May 23, 2024

oleiade commented May 27, 2024

Add an experimental csv module exposing a streaming csv parser #3743

Are you sure you want to change the base?

Add an experimental csv module exposing a streaming csv parser #3743

Conversation

oleiade commented May 15, 2024

What?

What's not there yet

Why?

Checklist

Related PR(s)/Issue(s)

joanlopez commented May 15, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

joanlopez May 23, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

joanlopez left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codebien May 23, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

oleiade commented May 27, 2024

joanlopez May 23, 2024 •

edited

codebien May 23, 2024 •

edited