Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize case_fold_and_combine_ranges? #79

Closed
SimonSapin opened this issue Apr 19, 2015 · 5 comments
Closed

Optimize case_fold_and_combine_ranges? #79

SimonSapin opened this issue Apr 19, 2015 · 5 comments

Comments

@SimonSapin
Copy link
Contributor

I haven’t timed it, but the case_fold_and_combine_ranges function introduced in #78 is probably slow for large sets of character ranges like \p{Lu}. (That is, slow to parse/compile a regex.) It can probably be improved to not consider individual chars when we can determine that none in a given range (or subrange) is affected by case folding.

@BurntSushi
Copy link
Member

Yeah, I saw that. I'm not terribly concerned with parsing/compiling performance. I'll take a pass over it in my parser rewrite. (Which I'm doing as part of #29.)

@BurntSushi
Copy link
Member

I think I have at least this optimization implemented in the parser rewrite: https://github.com/rust-lang/regex/blob/new-parser/regex_syntax/src/lib.rs#L148

@SimonSapin
Copy link
Contributor Author

Nice! I feel it could be more clever still, but this seems to be a nice compromise of optimizing some (most common?) cases while keeping it relatively simple. Have you measured it before/after and against case-sensitive for large classes like \p{Lu}?

@BurntSushi
Copy link
Member

Not yet. Once I finish the rewrite, I'll add some benchmarks and maybe do some profiling to see if I can pick off low hanging fruit.

But yeah, I feel like the entire way classes are constructed (particularly canoncalizing them) could be more clever. The code is pretty short/clear now though. :)

BurntSushi added a commit that referenced this issue May 25, 2015
This commit introduces a new `regex-syntax` crate that provides a
regular expression parser and an abstract syntax for regular
expressions. As part of this effort, the parser has been rewritten and
has grown a substantial number of tests.

The `regex` crate itself hasn't changed too much. I opted for the
smallest possible delta to get it working with the new regex AST.
In most cases, this simplified code because it no longer has to deal
with unwieldy flags. (Instead, flag information is baked into the AST.)

Here is a list of public facing non-breaking changes:

* A new `regex-syntax` crate with a parser, regex AST and lots of tests.
  This closes #29 and fixes #84.
* A new flag, `x`, has been added. This allows one to write regexes with
  insignificant whitespace and comments.
* Repetition operators can now be directly applied to zero-width
  matches. e.g., `\b+` was previously not allowed but now works.
  Note that one could always write `(\b)+` previously. This change
  is mostly about lifting an arbitrary restriction.

And a list of breaking changes:

* A new `Regex::with_size_limit` constructor function, that allows one
  to tweak the limit on the size of a compiled regex. This fixes #67.
  The new method isn't a breaking change, but regexes that exceed the
  size limit (set to 10MB by default) will no longer compile. To fix,
  simply call `Regex::with_size_limit` with a bigger limit.
* Capture group names cannot start with a number. This is a breaking
  change because regexes that previously compiled (e.g., `(?P<1a>.)`)
  will now return an error. This fixes #69.
* The `regex::Error` type has been changed to reflect the better error
  reporting in the `regex-syntax` crate, and a new error for limiting
  regexes to a certain size. This is a breaking change. Most folks just
  call `unwrap()` on `Regex::new`, so I expect this to have minimal
  impact.

Closes #29, #67, #69, #79, #84.

[breaking-change]
@BurntSushi
Copy link
Member

Weird that github didn't close this. Fixed by #87.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants