Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regex on text streams #14015

Closed
suhr opened this issue May 7, 2014 · 14 comments
Closed

Regex on text streams #14015

suhr opened this issue May 7, 2014 · 14 comments

Comments

@suhr
Copy link

suhr commented May 7, 2014

Regex library defines methods for find/replace on strings, but what about text streams?

@suhr suhr changed the title Regex on text streams. Regex on text streams May 7, 2014
@huonw huonw added the A-libs label May 7, 2014
@pongad
Copy link
Contributor

pongad commented May 14, 2014

I'm not sure if this is worth the effort. In both cases, the regex library will have to implement a stream interface. Could you give a few cases where you think this might be useful?

@suhr
Copy link
Author

suhr commented May 14, 2014

Obvious one: find/replace in big files or stdin.

@pongad
Copy link
Contributor

pongad commented May 14, 2014

I'm not sure why I didn't think of that...
I shall have to defer the decision to the experts.

@BurntSushi
Copy link
Member

This actually shouldn't be as hard as one might think. The current implementation only requires a one character lookahead, which should be fine with working on a stream. It specifically never copies the search text, so I think most of the implementation is already amenable to using a stream.

What would the API look like?

@utkarshkukreti
Copy link
Contributor

@BurntSushi doesn't the VM runner require access to arbitrary byte index too here: https://github.com/mozilla/rust/blob/e87e18064ece0d7eddb269b8211fb8fdf5efaa91/src/libregex/vm.rs#L366-L390 ?

@BurntSushi
Copy link
Member

@utkarshkukreti Hmm, yeah, that piece would have to be re-worked. I believe it's primarily used for skipping across the search text when there is a literal prefix in the regex. I think we could find a way to do that without using raw byte indices. It was just much more convenient to do it that way. (This skipping requires no backtracking. If the prefix is never found, then the regex can't match.)

@huonw
Copy link
Member

huonw commented Jun 13, 2014

I thought of a possible issue with this: changing to a plain Iterator<char> or Reader means that it would longer be possible to return/handle &str slices into the original text; captures etc. would need to be allocating and building Strings.

@BurntSushi
Copy link
Member

Yuck. So all the return types would have to change to MaybeOwned?

@huonw
Copy link
Member

huonw commented Jun 13, 2014

Worse than that: String (i.e. always allocating). There's no way to go from an Iterator<char> to the underlying &str... In general, the chars aren't in memory (iter::Repeat::new('a')) or aren't UTF8 encoded (Vec<char>), that is, there might not be an underlying &str at all.

@BurntSushi
Copy link
Member

That is indeed pretty grim. I don't think it's acceptable for match|find|captures to always allocate---that will kill performance. I also think adding methods for streams is pretty lousy. That leaves us with... this gross thing:

trait Regexable<'a, T: Iterator<char>> {
    fn scannable(&self) -> MaybeIterator<'a, T>;
}
enum MaybeIterator<'a, T> {
    Iter(T),
    Slice(&'a str),
}

Then provide impls for the right types. Then the methods could return MaybeOwned, using a slice whenever the input uses a slice.

But yeah, this is going to be harder than I thought. In the iterator's case, we need a way to accrue the matched text. The VM is fundamentally built on the notion of returning indices into an existing string, and the notion simply doesn't work with streams. So that would have to be gutted.

This is a bigger change I thought. Damn.

@BurntSushi
Copy link
Member

I just took a look at Go's regexp package. They have some functions that work on streams, but they are a bit crippled. Firstly, there is only the equivalent of match|find|captures and it only returns the byte indices in the stream. Secondly, there's no "All" variant (the equivalent of find_iter and captures_iter).

The most interesting bit there is that the regexp functions on readers finds precisely one match and returns byte indices into that stream. In the Go world, if I wanted to use that and actually retrieve the text, I'd write my own "middleware" reader to save text as its being read. Then I could use that after the first match to grab the matched text. Then reset the "middleware" reader and start searching again.

The problem with that approach: what if there's never a match? You'd end up piling the whole stream into memory (likely defeating the purpose of using a stream).

Hmm. Am I missing anything? Are there any other implementations of regex searching on streams?

@BurntSushi
Copy link
Member

No regex stream matching in Perl 5, Python, .NET, Java. The Internet is full of hacks for when you know the max size of the text to match.

I did find this, which claims to have streaming regexes: https://github.com/openresty/sregex

@frewsxcv
Copy link
Member

In-tree regex was removed in #21458 . This should probably be moved to https://github.com/rust-lang/regex

@rust-highfive
Copy link
Collaborator

This issue has been moved to the regex repo: rust-lang/regex#25

bors added a commit to rust-lang-ci/rust that referenced this issue Feb 13, 2023
Don't include lifetime or label apostrophe when renaming

Closes rust-lang#13907
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants