Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for the bytes API? #84

Open
Alphare opened this issue Nov 30, 2021 · 3 comments
Open

Support for the bytes API? #84

Alphare opened this issue Nov 30, 2021 · 3 comments
Labels
enhancement New feature or request help wanted Extra attention is needed

Comments

@Alphare
Copy link

Alphare commented Nov 30, 2021

Hi,

I'm looking into fancy-regex to maybe add it to Mercurial's hgignore system instead of regex to allow for non-linear regex as a fallback. Right now we completely disable the Rust fast-path if we come across an unsupported pattern, which is kind of a bummer.

We're using the bytes API from the regex crate and it looks like it's missing from fancy-regex. Is is something that can be added?

Thanks

@robinst
Copy link
Contributor

robinst commented Dec 17, 2021

Hey!

I haven't looked into how the bytes API works in regex at all, but given that we either delegate to regex or use our own VM which already internally prefers working with bytes, it should be possible. The biggest hurdle would probably be creating and maintaining the parallel API. I think regex does it via macros but yeah.

So it's probably not a trivial change, and I don't have any plans to work at it at the moment. I'll label this "help wanted" and see if someone wants to pick it up.

As a compromise for your solution, would it be possible to delegate to fancy-regex just in case your input and patterns are well-formed UTF-8 (which is probably the vast majority of cases)?

@robinst robinst added enhancement New feature or request help wanted Extra attention is needed labels Dec 17, 2021
@Alphare
Copy link
Author

Alphare commented Dec 17, 2021

Thanks for getting back to me!

I understand this is a decent amount of work, I think falling back to fancy-regex if regex itself fails (so reversing the fallback) and only doing the UTF8 conversion in that case should be a decent workaround that wouldn't impact performance for the "common case".

I'll keep you updated on that experiment once I get some time to try it out.

@Alphare
Copy link
Author

Alphare commented Dec 17, 2021

I just tried it (something unrelated took some time to install 😅), and there are a couple of issues with the workaround, I'll document those so it's easier to follow:

  • Not being able to pass in bytes means a utf-8 conversion/check for every single call (can be hundreds of thousands or millions per invocation in large repositories)
  • Either changing the signature from path -> bool to path -> Result<bool, UTF8Error> or silently eating those errors when they happen
  • fancy-regex does not have a way of specifying (for example in RegexBuilder) to not try regex first, which slows us down because we've already tried it and it's doing the work twice
  • The performance seems to fall off of a cliff, but that's probably to do with the pattern themselves, which can most likely be turned into linear regex with some effort involved. Another idea I had was to identify patterns that aren't linear and split the linear parts from the backtracking ones to feed the latter to fancy-regex to have it build a smaller VM. But that's probably a lot of work.

Thanks for listening ;)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

2 participants