Skip to content
This repository has been archived by the owner on Jul 7, 2023. It is now read-only.

Sub-match extraction #14

Closed
robopeter opened this issue May 21, 2021 · 6 comments
Closed

Sub-match extraction #14

robopeter opened this issue May 21, 2021 · 6 comments

Comments

@robopeter
Copy link

First of all, thank you for all your work on this crate (and others!)!

This crate does not support sub-match extraction, which can be achieved with the regex crate's "captures" API. This may be added in the future, but is unlikely.

I just wanted to throw an issue in here to express an interest in having support for sub-match extraction. I understand this is probably unlikely (as stated) but perhaps others might chime in on this ticket if there's broader interest in this feature.

In the meantime, do you have suggestions / recomendations for byte-oriented no-std compatible regex processing supporting sub-match extraction?

@BurntSushi
Copy link
Owner

Thanks for the issue! So my response to this is kinda complex. Part of it is because the design space and use cases are both really quite large, so it's hard to interpret what it is you actually want. But to get the simple question out of the way first:

In the meantime, do you have suggestions / recomendations for byte-oriented no-std compatible regex processing supporting sub-match extraction?

The only thing I'm aware of in this space is maybe Ragel, but definitely re2c. Neither of those things are Rust, but I believe they can be made to work in a no_std context assuming you're okay with C code. (Or finding a way to convince them to emit Rust code.)

OK, so here's the thing. When I wrote that text in the README, I was thinking about submatch extraction for fully compiled DFAs. That's what re2c provides AIUI. But it cannot actually be done with a DFA. It isn't powerful enough. Instead, you need a computationally more expressive structured called tagged automata. I personally don't plan on implementing that in this crate.

What I'm working on right now is moving the various regex engines from the regex crate into this crate. That includes the NFA engines which support submatch extraction. While they are slower to execute, they might still fit your use case. But there are two issues. First, while it's possible to make the NFA no_std compatible, it requires a fair bit of work to do, and I don't plan on doing that initially... Just maybe eventually. (Now we're talking multiple years into the future.) The second issue depends on what you mean by "byte oriented." If you mean, "works on a &[u8] instead of a &str," then the NFA would satisfy that requirement. But if you mean, "able to be fed a byte at a time," then I think things get a bit trickier there. In particular, there is now an API design problem: how do you expose submatch boundaries in such an API? (This is a sincere question. I would love it if you had concrete ideas. See rust-lang/regex#425 also for related discussion on that point.)

Popping up a level, I would be genuinely curious to hear more about your use case if you could share it.

@robopeter
Copy link
Author

First of all the easy one, I meant "works on &[u8] instead of a &str, not a "streaming" implementation. Sorry for the confusion.

Sadly I can't (yet) get into all the use case details, but I can say my use case involves code running on a variety of embedded devices (hence no_std) looking for runtime-supplied regexes (which could be pre-compiled / serialized on a more powerful machine, like this crate does) in large-relative-to-the-embedded-device binaries. And really, while we're dreaming, I'd like bit-level matching (think: I might only care about every other bit in a byte) and sub-match extraction of bits that may not be byte aligned. Current thought is to just handle these dreams with pre/post processing of the regex and results (which assumes capture groups are an option), but I digress... :-) It's not hard to imagine that small embedded systems tend to care a lot more about data density and thus they tend to try to cram a lot more into every byte. This being said, I expect my regexes to match sequences that are on the order of dozens of bytes overall.

I've only briefly looked at Ragel and re2c, so maybe they do have a solution. However, I'm going to have many many different regexes over time and want to run on many different architectures. Given this, compiling each regex to architecture specific assembly doesn't seem like the best approach.

I certainly appreciate all your work, and it sounds like your likely headed towards implementing what I need (except the dreaming parts 😄) with a no_std NFA engine. I'll keep searching for the right approach for me in the short term and keep an eye on progress here (I'd love to help, but I fear this likely over my head). Thanks again!

@BurntSushi
Copy link
Owner

Hyperscan might be another one, but it is quite large.

@BurntSushi
Copy link
Owner

In the meantime, do you have suggestions / recomendations for byte-oriented no-std compatible regex processing supporting sub-match extraction?

Out of curiosity, when you say "no-std" here, do you also mean "no-alloc"? The regex crate will eventually support no_std, but it will require the alloc crate.

@robopeter
Copy link
Author

If I'm in dream land then no alloc would be ideal (let me give it a block of memory to work with, fail if it exhausts that memory). In reality, I'd probably be ok with an alloc dependant crate or I'd hack in something like heapless.

@BurntSushi
Copy link
Owner

In the meantime, do you have suggestions / recomendations for byte-oriented no-std compatible regex processing supporting sub-match extraction?

The regex and regex-automata crates now support this. You still need alloc, but there's no practical way around that without building a bespoke regex engine (or perhaps using re2c).

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants