-
Notifications
You must be signed in to change notification settings - Fork 415
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize case_fold_and_combine_ranges? #79
Comments
Yeah, I saw that. I'm not terribly concerned with parsing/compiling performance. I'll take a pass over it in my parser rewrite. (Which I'm doing as part of #29.) |
I think I have at least this optimization implemented in the parser rewrite: https://github.com/rust-lang/regex/blob/new-parser/regex_syntax/src/lib.rs#L148 |
Nice! I feel it could be more clever still, but this seems to be a nice compromise of optimizing some (most common?) cases while keeping it relatively simple. Have you measured it before/after and against case-sensitive for large classes like |
Not yet. Once I finish the rewrite, I'll add some benchmarks and maybe do some profiling to see if I can pick off low hanging fruit. But yeah, I feel like the entire way classes are constructed (particularly canoncalizing them) could be more clever. The code is pretty short/clear now though. :) |
This commit introduces a new `regex-syntax` crate that provides a regular expression parser and an abstract syntax for regular expressions. As part of this effort, the parser has been rewritten and has grown a substantial number of tests. The `regex` crate itself hasn't changed too much. I opted for the smallest possible delta to get it working with the new regex AST. In most cases, this simplified code because it no longer has to deal with unwieldy flags. (Instead, flag information is baked into the AST.) Here is a list of public facing non-breaking changes: * A new `regex-syntax` crate with a parser, regex AST and lots of tests. This closes #29 and fixes #84. * A new flag, `x`, has been added. This allows one to write regexes with insignificant whitespace and comments. * Repetition operators can now be directly applied to zero-width matches. e.g., `\b+` was previously not allowed but now works. Note that one could always write `(\b)+` previously. This change is mostly about lifting an arbitrary restriction. And a list of breaking changes: * A new `Regex::with_size_limit` constructor function, that allows one to tweak the limit on the size of a compiled regex. This fixes #67. The new method isn't a breaking change, but regexes that exceed the size limit (set to 10MB by default) will no longer compile. To fix, simply call `Regex::with_size_limit` with a bigger limit. * Capture group names cannot start with a number. This is a breaking change because regexes that previously compiled (e.g., `(?P<1a>.)`) will now return an error. This fixes #69. * The `regex::Error` type has been changed to reflect the better error reporting in the `regex-syntax` crate, and a new error for limiting regexes to a certain size. This is a breaking change. Most folks just call `unwrap()` on `Regex::new`, so I expect this to have minimal impact. Closes #29, #67, #69, #79, #84. [breaking-change]
Weird that github didn't close this. Fixed by #87. |
I haven’t timed it, but the
case_fold_and_combine_ranges
function introduced in #78 is probably slow for large sets of character ranges like\p{Lu}
. (That is, slow to parse/compile a regex.) It can probably be improved to not consider individualchar
s when we can determine that none in a given range (or subrange) is affected by case folding.The text was updated successfully, but these errors were encountered: