Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rewrite parser as part of new regex-syntax crate. #87

Merged
merged 4 commits into from May 27, 2015
Merged

Conversation

BurntSushi
Copy link
Member

This commit introduces a new regex-syntax crate that provides a
regular expression parser and an abstract syntax for regular
expressions. As part of this effort, the parser has been rewritten and
has grown a substantial number of tests.

The regex crate itself hasn't changed too much. I opted for the
smallest possible delta to get it working with the new regex AST.
In most cases, this simplified code because it no longer has to deal
with unwieldy flags. (Instead, flag information is baked into the AST.)

Here is a list of public facing non-breaking changes:

  • A new regex-syntax crate with a parser, regex AST and lots of tests.
    This closes [feature request] Expose libregex's parsing/compiling internals #29 and fixes Attempting to parse {2} causes panic #84.
  • A new flag, x, has been added. This allows one to write regexes with
    insignificant whitespace and comments.
  • Repetition operators can now be directly applied to zero-width
    matches. e.g., \b+ was previously not allowed but now works.
    Note that one could always write (\b)+ previously. This change
    is mostly about lifting an arbitrary restriction.

And a list of breaking changes:

  • A new Regex::with_size_limit constructor function, that allows one
    to tweak the limit on the size of a compiled regex. This fixes impose a limit on the size of a compiled regex program #67.
    The new method isn't a breaking change, but regexes that exceed the
    size limit (set to 10MB by default) will no longer compile. To fix,
    simply call Regex::with_size_limit with a bigger limit.
  • Capture group names cannot start with a number. This is a breaking
    change because regexes that previously compiled (e.g., (?P<1a>.))
    will now return an error. This fixes Underscore in replacement string treated as backspace #69.
  • The regex::Error type has been changed to reflect the better error
    reporting in the regex-syntax crate, and a new error for limiting
    regexes to a certain size. This is a breaking change. Most folks just
    call unwrap() on Regex::new, so I expect this to have minimal
    impact.

Closes #29, #67, #69, #79, #84.

[breaking-change]

This commit introduces a new `regex-syntax` crate that provides a
regular expression parser and an abstract syntax for regular
expressions. As part of this effort, the parser has been rewritten and
has grown a substantial number of tests.

The `regex` crate itself hasn't changed too much. I opted for the
smallest possible delta to get it working with the new regex AST.
In most cases, this simplified code because it no longer has to deal
with unwieldy flags. (Instead, flag information is baked into the AST.)

Here is a list of public facing non-breaking changes:

* A new `regex-syntax` crate with a parser, regex AST and lots of tests.
  This closes #29 and fixes #84.
* A new flag, `x`, has been added. This allows one to write regexes with
  insignificant whitespace and comments.
* Repetition operators can now be directly applied to zero-width
  matches. e.g., `\b+` was previously not allowed but now works.
  Note that one could always write `(\b)+` previously. This change
  is mostly about lifting an arbitrary restriction.

And a list of breaking changes:

* A new `Regex::with_size_limit` constructor function, that allows one
  to tweak the limit on the size of a compiled regex. This fixes #67.
  The new method isn't a breaking change, but regexes that exceed the
  size limit (set to 10MB by default) will no longer compile. To fix,
  simply call `Regex::with_size_limit` with a bigger limit.
* Capture group names cannot start with a number. This is a breaking
  change because regexes that previously compiled (e.g., `(?P<1a>.)`)
  will now return an error. This fixes #69.
* The `regex::Error` type has been changed to reflect the better error
  reporting in the `regex-syntax` crate, and a new error for limiting
  regexes to a certain size. This is a breaking change. Most folks just
  call `unwrap()` on `Regex::new`, so I expect this to have minimal
  impact.

Closes #29, #67, #69, #79, #84.

[breaking-change]
@rust-highfive
Copy link

r? @huonw

(rust_highfive has picked a reviewer for you, use r? to override)

@BurntSushi
Copy link
Member Author

cc @alexcrichton @huonw @andrew-d

I think the thing I am most concerned about here is the definition of the Error type. It is a publicly exposed enum that I could envision changing in the future. (For example, if the regex-syntax crate grew to accept a superset of the syntax supported by regex, then there would need to be an additional error for rejecting unsupported syntax.) My instinct is that I should not expose the error as an enum and instead provide methods. (e.g., fn too_big(&self) -> Option<usize> and fn syntax_error(&self) -> Option<&regex_syntax::Error>.)

Also, part of the enum exposes the regex_syntax::Error type, which could also evolve. I think this means that if there's a breaking change in regex_syntax::{Error, ErrorKind}, then the major version number of regex would also need to be bumped. Not sure how to deal with this one.

@@ -21,8 +21,16 @@ path = "regex_macros/benches/bench_dynamic.rs"
test = false
bench = true

[dependencies.regex-syntax]
path = "regex_syntax"
version = "*"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally prefer to write out this section like so nowadays:

[dependencies]
regex-syntax = { path = "regex_syntax", version = "*" }

Also, perhaps this could have a more strict version requirement than *? And finally, perhaps the folder could be called regex-syntax to match the crate name?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And finally, perhaps the folder could be called regex-syntax to match the crate name?

Happy to oblige. I originally had that, but it looks very weird next to regex_macros, which I think is stuck with an underscore? :-(

👍 to everything else!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah unfortunately we can't switch underscore-to-dash after the fact just yet :(

I would like to enable this kind of feature on crates.io though as I suspect it comes up more often than not!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, hypens it is!

@alexcrichton
Copy link
Member

This looks awesome, thanks so much @BurntSushi!

And a list of breaking changes:

These all seem minor enough that I would be fine releasing this version as a minor update to this library (not a major version change).

I think the thing I am most concerned about here is the definition of the Error type. It is a publicly exposed enum that I could envision changing in the future.

I personally like having an enum for errors like this, and I think that they should be considered "defacto extensible" in the sense that although it's a breaking change to add a variant, I would not consider it one. You could add a #[doc(hidden)] variant named something like __DoNotMatchMeIfYouWishToNotBreak which may also deter exhaustive matches.

Also, part of the enum exposes the regex_syntax::Error type, which could also evolve. I think this means that if there's a breaking change in regex_syntax::{Error, ErrorKind}, then the major version number of regex would also need to be bumped. Not sure how to deal with this one.

Hm, this is interesting! I would believe that there's definitely a good deal of API leakage between crates in terms of a major version bump of your deps may require a major version bump for you, but I think with a dependency like 0.1.0 on regex-syntax that this crate will be safe. That basically means that regex 0.1.x will use the 0.1.x version of syntax, and it can evolve to a more updated syntax later if necessary.

[dev-dependencies]
rand = "0.3"

[features]
pattern = []

[profile.bench]
opt-level = 3
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that 3 is the default, so perhaps this could just be specified to set lto = true

The big change here is the addition of a non-public variant in the
error enums. This will hint to users that one shouldn't exhaustively
match the enums in case new variants are added.
@BurntSushi
Copy link
Member Author

I think I've addressed everything. Just followed all of your suggestions. :-) I can merge once Travis gives the OK!

@alexcrichton
Copy link
Member

Sounds good to me!

BurntSushi added a commit that referenced this pull request May 27, 2015
Rewrite parser as part of new regex-syntax crate.
@BurntSushi BurntSushi merged commit 3ba38c4 into master May 27, 2015
@frewsxcv
Copy link
Member

Welp, time to AFL :)

@BurntSushi
Copy link
Member Author

Haha. Do your worst. This time, quickcheck has my back. :P

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants