Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better parsing of header #27

Merged
merged 7 commits into from Sep 5, 2023
Merged

Better parsing of header #27

merged 7 commits into from Sep 5, 2023

Conversation

MrCurtis
Copy link
Contributor

This ensures that the text from nested headers is not split on commas that are enclosed in quotes.

In particular, it ensures that this line

##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">

from the example in section 1.1 of the specs can be parsed.

Note that section 1.2 lists the comma as one of the special characters which should always be 'represented with the
capitalized percent encoding' when they are not used for their specific meaning. However, I assume this refers only to cases where they are not enclosed within quotes.

vcf/src/parse.rs Outdated
// Repeatedly match either non-comma/non-quote characters or blocks of text enclosed in
// quotes, until we can't, in which case we're either at a non-quote-enclosed comma or the
// end of the string.
let re = Regex::new(r#"(?:[^,"]+|(?:"[^"]*"))+"#).unwrap();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This regex gets compiled every time the function gets called, which doesn't seem very efficient. How do I place it outside the function call so that it is only compiled once?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use something like lazy_static!:

https://stackoverflow.com/a/35169402

Or rust-lang/regex#709 (comment) which uses std rust. Seems like there's something in the works to compile statically actually in the regex crate which would be nice.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using lazy_static! seems to have done the job. Cheers.

This avoids repeatedly compiling the regex each call to parse.
@MrCurtis MrCurtis merged commit d8b9cdd into main Sep 5, 2023
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants