Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add an experimental csv module exposing a streaming csv parser #3743

Draft
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

oleiade
Copy link
Member

@oleiade oleiade commented May 15, 2024

What?

This PR is a cleaned-up version of the CSV streaming parser we hacked during Crococon. It is aimed at addressing #2976.

import { open } from 'k6/experimental/fs'
import csv from 'k6/experimental/csv'

let file;
(async function () {
	file = await open('data.csv');
})();

export default async function() {
	let parser = new csv.Parser(file, {
		delimiter: ',',
		skipFirstLine: true,
		fromLine: 3,
		toLine: 13,
	})

	while (true) {
		const {done, value} = await parser.next();
		if (done) {
			break;
		}

		console.log(value)
	}
}

It is in a preliminary state and aims at getting input and iterating on the design collaboratively.

What's not there yet

  • I have written no tests for it, yet, as I would like to make sure any big changes to the design are done before that.
  • Part of the initial design described in Add a streaming-based CSV parser to k6 #2976 included two concepts I haven't yet included here, as I'm not sure what the best API or performance-oriented solution would be (ideas welcome 🤝):
    • The ability to describe a strategy for selecting which lines should be picked for parsing or ignored (say, you want a file's lines parsing to be spread evenly across all your VUs, for instance).
    • The ability to instruct the parser to cycle through the file: once it reaches the end, it restarts from the top. My main question mark is that as it would probably be possible using the existing APIs (csv.Parser.next returns an iterator-like object with a done property, seeking through the file is possible, and re-instantiating the parser once the end is reached is an option), would we want indeed to have a dedicated method/API for that?

Why?

Using CSV files in k6 tests is a very common pattern, and until recently, doing it efficiently could prove tricky. One common issue encountered by users is that JS tends to be rather slow when performing parsing operations. Hence, we are leveraging the fs module constructs and asynchronous APIs introduced in Goja over the last year to implement a Go-based CSV "high-performance" streaming parser.

Checklist

  • I have performed a self-review of my code.
  • I have added tests for my changes.
  • I have run linter locally (make lint) and all checks pass.
  • I have run tests locally (make tests) and all tests pass.
  • I have commented on my code, particularly in hard-to-understand areas.

Related PR(s)/Issue(s)

#2976

@oleiade oleiade self-assigned this May 15, 2024
@joanlopez
Copy link
Contributor

One common issue encountered by users is that JS tends to be rather slow when performing parsing operations.

Take this just a simple idea rather than something that's really a requirement for this pull request to move forward, but considering that you explicitly mentioned that, would be nice to have a small benchmark for comparison.

"go.k6.io/k6/js/modules"
)

// TODO: Should we have an option to cycle through the file (restart from start when EOF is reached, or done is true)?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My personal opinion; if it doesn't pollute a lot the code/API, then I'd go for it, as in theory it doesn't sound like something very complex while it might be very handy for certain use cases.

)

// TODO: Should we have an option to cycle through the file (restart from start when EOF is reached, or done is true)?
// TODO: Should we have an option to skip empty lines (lines with no fields?)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My first reaction is: why not doing so by default? But I guess we should/could take a look at similar parser libraries/tools, and get inspiration from there (what looks to be more common).

PS: Do you mean lines with at least one value missing, or those lines completely empty? 🤔

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I raised this question as I reviewed papaparse parse method's config which has a skipEmptyLines option: https://www.papaparse.com/docs

And I wondered whether we should have the same option 👍🏻 I've made a very quick and shallow pass on node and deno, and I don't see them having such option (but I might have missed it).

PS: Do you mean lines with at least one value missing, or those lines completely empty? 🤔

I meant specifically empty lines. The default behavior of Go's CSV parser is to not make assumptions on the number of fields per record, but there's an option to enforce it. Maybe it would be worth adding an option to the module's parser too for users to tell us how many fields per record to expect, and error if we find a different value in any of the records?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding skipEmptyLines, it turns out that skipping them is the default behavior of the Go CSV parser, and it looks like it can't be overridden, so it sorts it out 👍🏻


// TODO: Should we have an option to cycle through the file (restart from start when EOF is reached, or done is true)?
// TODO: Should we have an option to skip empty lines (lines with no fields?)
// TODO: Should we have an option to skip lines based on a specific predicate func(linenum, fields): bool ?:w
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again my opinion; I'd say that'd be a nice-to-have, but not critical for a first iteration, so I'd keep that idea for a second iteration.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Talking with Pawel, he agreed with you, and mentioned that our current scope is good enough for 90% of the use cases, and we could do it in the future if we still want to 👍🏻

// implementation details, but keep it public so that we can access it
// from other modules that would want to leverage its implementation of
// io.Reader and io.Seeker.
Impl file `js:"-"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit; I think in general it's discouraged to expose an attribute as part of the public surface of the package (fs in this case) which type is private.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, this is an abstraction leak. The reason I went with that as a start was because I found the Read/ExportedRead approach we had initially taken somewhat inelegant (at least it tickled my sense of "nice" code).

But in hindsight it's probably the best compromise indeed.

The Impl approach that's explicitly not exposed to users, but to other k6 libs and modules, started from the rationale that the underlying file struct already implements ReadSeeker, and I might as well leverage that. But because I couldn't find a way to retrieve the underlying file implementation struct using ExportTo or by doing it manually, I went with a public Impl instead.

I'll give another shot at our initial approach and try to make it look okay 👍🏻

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be feasible to define it as a io.ReadSeeker? 🤔

If possible, even if that requires a couple of tricks internally (because of stat()), maybe that's a decent way to expose it for other k6 libs and modules.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean?

type File struct {
    // ...
    
	ReadSeeker io.ReadSeeker `js:"-"`
	
	// ...
}

If so, it's something I considered, and I think it could work (would have to double-check) 🙇‍♂️

Copy link
Contributor

@joanlopez joanlopez May 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something like that, yeah. I think it won't work as-is, because of stat(), as I said, but perhaps we can find some workaround that only lives internally/privately in the package/implementation, with the benefit of keeping the public surface clean.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me see what I can do 🙇‍♂️

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, I've pushed a commit, which exposes the behavior as an interface instead. I like it much more, indeed.

I've created a Stater interface, made file implement it, and incorporated it in a ReadSeekStater interface, which I use to expose the underlying behavior instead of file itself. Let me know what you think 👍🏻

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks much cleaner now, at the very least! 👌🏻

Copy link
Contributor

@joanlopez joanlopez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for giving form to what we started during the Crococon 💟

I left multiple comments as some form of initial feedback, but generally speaking I think this approach is more than okay, and from my side I'd suggest to move forward (with tests and all that) 🚀

I'm not sure how far are we from being able to publish this as an experimental module, but I guess it's part of the its experimental stage, the feedback and usage we will collect from users, what will help us answer some of the open questions that you left, and to actually confirm whether the current approach is good enough or not.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would expect to use the parser mostly within two scenarios:

  1. Getting one line per iteration
  2. Getting one line per VU

The current example iterating over the batch during the single iteration doesn't sound representative of the most common use case.

Am I missing something?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these are reasonable use cases, right!
Indeed I think I mentioned at least one of them during the ideation phase during Crococon.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're both right 🙂

If I summarize:

  • "As a user, I want to create a parser, and every executed iteration gets the next line, and advances the line "head" to the next one for the next iteration (regardless of the VU executing the iteration)"
  • "As a user, I want to create a parser, and when the VU code calls the next method, the next line_number % vu_number is returned to it"

Do you agree with these statements? 🙂

I think both call for a rather different design and on top of my head, it might be not easy to support both in an efficient manner. I'm gonna look into it, and see what I come up with 🙇‍♂️

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah! I guess we could leave the second one for the future, if we manage to let different VUs use different parsers, and leave the work for the test author. I know it's not ideal, and for certain situations where the amount of VUs varies over time it might be more difficult, tho.

In general I'd say, let's get an MVP and ship it! And, imho, that MVP should have at least one mechanism that doesn't imply having a new parser for each iteration but shared across iterations/vus somehow (the one we think have a better balance between popularity and feasibility). Once we have that, I'd vote for merging it, then iterate with more strategies.

Copy link
Collaborator

@codebien codebien May 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per-vu:

vuID := exec.vu.idInTest

let parser = new csv.Parser(file, {
  fromLine: vuID, // potentially with the modulo
  toLine: vuID, // potentially with the modulo
})

const {done, value} = await parser.next();

export default async function() {
  console.log(value)
}

Per-iteration:

let parser = new csv.Parser(file)

export default async function() {
  const {done, value} = await parser.next();
  console.log(value)
}

They should already work, we have to just add a unit test with them

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, for the "Per-vu", I guess there could be multiple strategies:

  • Enabling an option to skip rows, like just process the rows multiple of...
  • Suggesting the user, if they now the amount of rows in advance, to split the readers accordingly.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re per-vu: It's interesting. I think we had completely different assumptions/understandings regarding the per-vu strategy 🤓 From your example, it looks like you wish to select a chunk of the whole file based on a vu number, whereas I had in mind that for 16 VU, vu 1 would receive line 1, 17, 33, etc.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re per-vu: It's interesting. I think we had completely different assumptions/understandings regarding the per-vu strategy 🤓 From your example, it looks like you wish to select a chunk of the whole file based on a vu number, whereas I had in mind that for 16 VU, vu 1 would receive line 1, 17, 33, etc.

Aren't these two the strategies I mentioned above? 🤔 Or did I miss any?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed. I think we posted our comments at the same time, not seeing each other's 🤓

CleanShot 2024-05-23 at 11 31 54

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Glad to see we're aligned, then! Also at when to write comments hahahaha :)

})();

export default async function() {
let parser = new csv.Parser(file, {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For context @joanlopez and @codebien I just found out that currently, because we don't have support for await in the init context, the only way (with the current design) to instantiate the parser from the init context is the following:

let file;
let parser;
(async function () {
	file = await open('data.csv');
	parser = new csv.Parser(file, {
		delimiter: ',',
		skipFirstLine: true,
		fromLine: 3,
		toLine: 13,
	})
})();

Which works, but adds a workaround to the workaround :-/

An alternative I would consider would be to have the parser support for receiving a file path, and opening the file under the hood (which might involve exposing the currently private openImpl function from the fs module to other modules such as this one).

What do you think? 🦖

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which works, but adds a workaround to the workaround :-/

It doesn't sound like a problem of this scope/pull-request. The problem is async in Init context and at some point we have to resolve it (we are already on it with ESM).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're probably right indeed 👍🏻 🙇🏻

@oleiade
Copy link
Member Author

oleiade commented May 27, 2024

Posting here a summary of the use-cases we discussed privately, and that we'd like the module to tackle:

  1. As a user, I want to read a CSV file containing 1000 credentials, and have each credential being processed by a single iteration.
  • no credential should be processed more than once
  • unless the parser is explicitly to restart from the begining? In that scenario, the same credential can be processed multiple times.
  • if the option is not set, and the user calls parser.next() after all credentials are consumed, they keep getting a { done: true, value: undefined } response.
  1. As a user, I want to read a CSV file containing 1000 credentials, and have each subset of those credentials reserved to be processed by a single VU.
  • the subset of credentials could be for instance a chunk: 0-100 credentials go to VU 1, 101-200 credentials go to VU 2, etc.
  • the subset of credentials could be every Nth credential: 0, 10, 20, 30, etc. go to VU 1, 1, 11, 21, 31, etc. go to VU 2, etc.
  • This is possible with the existing SharedArray approach, but it needs a faster way of processing the rows.
  1. As a user, I want each iteration to stream through my CSV file, and have the ability to act upon each returned records.
  • The user has the ability to skip a record, or to stop the iteration, based on the content of the record, or the line number.
  • This is assuming that each iteration needs the whole content of the file to perform its test.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants