Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

On Error: Continue to Next Line #136

Closed
ak--47 opened this issue May 24, 2023 · 5 comments
Closed

On Error: Continue to Next Line #136

ak--47 opened this issue May 24, 2023 · 5 comments
Assignees

Comments

@ak--47
Copy link
Contributor

ak--47 commented May 24, 2023

hello @uhop ... me again 😇

i understand that stream-json expects to be working with validated json or jsonl, and i've also read many of your comments on why it is not the job of this library to validate json... just to parse it

my use cases involves parsing jsonl of unknown origin, so i don't know ahead of time if it's valid.

i found the the checkErrors option, which does throw when a line in the source file is bad json, but i'm wondering if there's a straightforward way, when catching the error to tell the parser to "skip the bad record and continue to the next line \n" ... something like:

const stream = fs.createReadStream(filePath);
const parsedStream = stream.pipe(parser({ checkErrors: true }));
parsedStream.on('error', (e) => {
	// we have a bad line of jsonl
	// can we "skip" this line and continue to the next one?
});

(i may not have a correct mental model of how this library packs values... but i'm trying to avoid blowing up on an entire file just because there's one "bad line")

when exploring the jsonl parser code, i realize i can swallow the errors and continue by calling the callback with null in the catch clause:

  _checked_processBuffer(callback) {
    const lines = this._buffer.split('\n');
    this._rest += lines[0];
    if (lines.length > 1) {
      try {
        this._rest && this.push({key: this._counter++, value: JSON.parse(this._rest, this._reviver)});
        this._rest = lines.pop();
        for (let i = 1; i < lines.length; ++i) {
          lines[i] && this.push({key: this._counter++, value: JSON.parse(lines[i], this._reviver)});
        }
      } catch (cbErr) {
        this._buffer = '';
        callback(null) //callback(cbErr); 
        return;
      }
    }
    this._buffer = '';
    callback(null);
  }

but this feels messy...

any pointers are appreciate. thanks again!

ak--47 added a commit to ak--47/mixpanel-import that referenced this issue May 24, 2023
forking stream-json so that if input files are bad we do not blow up.

fork is here: https://github.com/ak--47/stream-json

OG comment is here: uhop/stream-json#136
@uhop uhop self-assigned this May 25, 2023
@uhop
Copy link
Owner

uhop commented May 25, 2023

First of all: using the ‘error’ event is a non-starter because it can be fired only once. To wit:

https://nodejs.org/api/stream.html#event-error

Restarting a generic JSON parser after a syntax error has proved to be impossible. Trust me — I tried.

Potentially we can restart a JSONL parser by ignoring a line and returning some specified value, which indicates an error. That’s probably the best we can do.

@ak--47
Copy link
Contributor Author

ak--47 commented May 25, 2023

@uhop thank you for the kind and thoughtful responses.

for now, i have my own fork which just swallows errors and continues parsing. it seems to work for what i need...

i'll see if i can make progress on something that could actually work via an option pasted to the jsonlParser ... {skipParseErrors: true }

@uhop
Copy link
Owner

uhop commented May 25, 2023

If all you need is the JSONL parser that ignores errors — that can be arranged. Make a good PR, or wait until I have time to add this feature using a temporary solution meanwhile.

@ak--47
Copy link
Contributor Author

ak--47 commented May 30, 2023

@uhop here's the first real implementation i tried:

./jsonl/Parser.js

_skipErrors_processBuffer(callback) {
	const lines = this._buffer.split('\n');
	this._rest += lines[0];
	if (lines.length > 1) {
		try {
			this._rest && this.push({
				key: this._counter++,
				value: JSON.parse(this._rest, this._reviver)
			});
		} catch (e) {
			//first line is bad json, skip it
		}
		this._rest = lines.pop();
		loopBuffer: for (let i = 1; i < lines.length; ++i) {
			try {
				lines[i] && this.push({
					key: this._counter++,
					value: JSON.parse(lines[i], this._reviver)
				});
			} catch (e) {
				//bad json, skip it
				continue loopBuffer;
			}
		}
	}
	this._buffer = '';
	callback(null);
}

and:

class JsonlParser extends Utf8Stream {
	static make(options) {
		return new JsonlParser(options);
	}
	constructor(options) {
		super(Object.assign({}, options, { readableObjectMode: true }));
		this._rest = '';
		this._counter = 0;
		this._reviver = options && options.reviver;
		if (options && options.checkErrors) {
			this._processBuffer = this._checked_processBuffer;
			this._flush = this._checked_flush;
		}

		//new
		if (options && options.skipErrors) {
			this._processBuffer = this._skipErrors_processBuffer;
		}
	}

so basically skipErrors would be mutually exclusive with checkErrors ...

haven't written any tests or docs, but curious to get your impression before i go further.

@uhop uhop closed this as completed in 2854b16 May 30, 2023
@ak--47
Copy link
Contributor Author

ak--47 commented Aug 18, 2023

#140 small PR for also forwarding the value, in addition to the error.

thanks @uhop for all you do!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants