Support for the `bytes` API? #84

Alphare · 2021-11-30T11:25:15Z

Hi,

I'm looking into fancy-regex to maybe add it to Mercurial's hgignore system instead of regex to allow for non-linear regex as a fallback. Right now we completely disable the Rust fast-path if we come across an unsupported pattern, which is kind of a bummer.

We're using the bytes API from the regex crate and it looks like it's missing from fancy-regex. Is is something that can be added?

Thanks

The text was updated successfully, but these errors were encountered:

robinst · 2021-12-17T03:20:20Z

Hey!

I haven't looked into how the bytes API works in regex at all, but given that we either delegate to regex or use our own VM which already internally prefers working with bytes, it should be possible. The biggest hurdle would probably be creating and maintaining the parallel API. I think regex does it via macros but yeah.

So it's probably not a trivial change, and I don't have any plans to work at it at the moment. I'll label this "help wanted" and see if someone wants to pick it up.

As a compromise for your solution, would it be possible to delegate to fancy-regex just in case your input and patterns are well-formed UTF-8 (which is probably the vast majority of cases)?

Alphare · 2021-12-17T09:52:25Z

Thanks for getting back to me!

I understand this is a decent amount of work, I think falling back to fancy-regex if regex itself fails (so reversing the fallback) and only doing the UTF8 conversion in that case should be a decent workaround that wouldn't impact performance for the "common case".

I'll keep you updated on that experiment once I get some time to try it out.

Alphare · 2021-12-17T13:19:13Z

I just tried it (something unrelated took some time to install 😅), and there are a couple of issues with the workaround, I'll document those so it's easier to follow:

Not being able to pass in bytes means a utf-8 conversion/check for every single call (can be hundreds of thousands or millions per invocation in large repositories)
Either changing the signature from path -> bool to path -> Result<bool, UTF8Error> or silently eating those errors when they happen
fancy-regex does not have a way of specifying (for example in RegexBuilder) to not try regex first, which slows us down because we've already tried it and it's doing the work twice
The performance seems to fall off of a cliff, but that's probably to do with the pattern themselves, which can most likely be turned into linear regex with some effort involved. Another idea I had was to identify patterns that aren't linear and split the linear parts from the backtracking ones to feed the latter to fancy-regex to have it build a smaller VM. But that's probably a lot of work.

Thanks for listening ;)

robinst added enhancement New feature or request help wanted Extra attention is needed labels Dec 17, 2021

CeleritasCelery mentioned this issue Jan 14, 2023

Regex Library CeleritasCelery/rune#19

Open

ZJaume mentioned this issue Sep 14, 2023

Is serialization supported? #114

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for the `bytes` API? #84

Support for the `bytes` API? #84

Alphare commented Nov 30, 2021

robinst commented Dec 17, 2021

Alphare commented Dec 17, 2021

Alphare commented Dec 17, 2021 •

edited

Support for the bytes API? #84

Support for the bytes API? #84

Comments

Alphare commented Nov 30, 2021

robinst commented Dec 17, 2021

Alphare commented Dec 17, 2021

Alphare commented Dec 17, 2021 • edited

Support for the `bytes` API? #84

Support for the `bytes` API? #84

Alphare commented Dec 17, 2021 •

edited