Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: support searching an Iterator<u8> #386

Closed
Wilfred opened this issue Jul 16, 2017 · 6 comments
Closed

Feature request: support searching an Iterator<u8> #386

Wilfred opened this issue Jul 16, 2017 · 6 comments

Comments

@Wilfred
Copy link

Wilfred commented Jul 16, 2017

Following up on this reddit discussion, it would be very useful for the Remacs project if this crate supported searching non-contiguous blocks of memory. Currently, it's not possible to search a gap buffer with this crate. The reddit discussion suggests it would be useful for spidermonkey too.

Would you be willing to add this feature?

@BurntSushi
Copy link
Member

Would you willing to add this feature?

Nothing has changed since that reddit discussion. It's not really a matter of will, but a matter of resources. A feature like this needs significant implementation work and API design work. I do have vague plans to work on it, but it is not something you should expect to happen any time soon. (The right scale of measurement here is probably many months or years.)

Someone else could champion this effort, but said person would need to get their hands very dirty.

In the past, I have implemented a somewhat naive byte based finite state machine using similar concepts in the fst-regex crate. I doubt you could use it as is, but it might serve as an example of how to hack something together.

You might consider investigating what the Xi editor does. They are using fancy-regex IIRC, which is built on top of the regex crate.

Finally, you could find other means of achieving your goals. For example, by sacrificing multiline matching and searching a line at a time. (I am aware of the downsides of such a path and you have probably also considered it, but i figured it was still worth mentioning.)

@Wilfred
Copy link
Author

Wilfred commented Jul 26, 2017

Thanks, those are very useful references.

My understanding of fancy-regex is that it gives you all the exotic features of PCRE, whereas rust-lang/regex is a RE2 style regex implementation with strong performance guarantees.

Looking at the elisp regex syntax I don't believe there are any features that aren't present in this regex library. There's some extra syntax-based features (but we could look up the syntax characters and pass them through), and 'back references' (but here that just refers to replacing a matched group). That's all I'm aware of.

Elisp regexps have the useful property that . does not match newlines (you have to use a literal newline), so it's easy to detect patterns that are multiline.

Regarding line-at-a-time matching, given a gap buffer:

struct GapBuffer {
    text: Vec<u8>,
    gap_start: ptrdiff_t,
    gap_end: ptrdiff_t,
}

it would be possible to search three strings (Rust-ish pseudocode):

  • text[0..last_newline_before_gap_start]
  • text[last_newline_before_gap..gap_start] + text[gap_end..first_newline_after_gap]
  • text[first_newline_after_gap..]

In the event that a user gave us a multiline regex, we'd have to copy the data so the buffer is contiguous.

Have I understood you correctly regarding line-at-a-time?

@BurntSushi
Copy link
Member

Have I understood you correctly regarding line-at-a-time?

I believe so. As one more additional reference that might help, ripgrep forces line-at-a-time matching by forcefully removing anything in a regex's AST that could possibly match a new line: https://github.com/BurntSushi/ripgrep/blob/master/grep/src/nonl.rs --- You probably don't want to do that specific task, but that routine might serve as a good starting point for detecting whether a regex could match across multiple lines or not.

My understanding of fancy-regex is that it gives you all the exotic features of PCRE, whereas rust-lang/regex is a RE2 style regex implementation with strong performance guarantees.

Right. I didn't mean to focus on the particular features of the regex engine so much as say that Xi is using this regex engine in a text editor, so there might be lessons in Xi that you could learn from. @raphlinus is the guy to talk to about that!

@Wilfred
Copy link
Author

Wilfred commented Aug 6, 2017

Thanks again for your help and all those references. I've written up an RFC for using Rust regex crates in Remacs if you're interested.

How would you like to proceed with this GitHub issue? Should we leave it open, or would you prefer it closed as Iterator<u8> support is some way off?

@BurntSushi
Copy link
Member

@Wilfred Ah awesome, I'll be sure to check out that RFC!

I think we can leave this issue open. I wish I could give you a better estimate of when I think I'll get to this, but I have a lot on my plate! I have many places I want to take the regex crate. :-)

@BurntSushi
Copy link
Member

@Wilfred Even though this was filed first, I'm going to close this in favor of #425, which contains a bit more information from me and probably captures the more general version of this task.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants