Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Streaming Re #59

Open
Drup opened this issue May 12, 2015 · 5 comments
Open

Streaming Re #59

Drup opened this issue May 12, 2015 · 5 comments

Comments

@Drup
Copy link
Collaborator

Drup commented May 12, 2015

This is a (set of) notes after a discussion with @vouillon on how to make re able to stream.

  • We should move pos and last out of the info record and pass them around explicitly, in particular in loop. Important: check spilling in the loop function.
  • Partial would give an abstract type partial containing
    • an Re.state
    • a buffer of some sort
    • the current position in the buffer
  • We would expose two functions:
    • Some function adding some new content to the buffer.
    • Some function taking partial and starting the matching again. This would be implemented using the loop function to match more things and then the Re_automaton.status function.

It should also be possible to say "The streaming is finished, you can match eol/eos/stop".

There are delicate questions of content copying when initializing and refilling the buffer. In particular, copying the matched string to initialize the buffer is clearly not acceptable.

@eras
Copy link

eras commented May 12, 2015

Preferably the interface would be something that works for the following scenario:

  • Regular expression (l+) and input chunk "hello" -> one substring
  • Resume matching with new data: "hell" -> zero substrings, but the partial match will be matched later on.

Bonus: If the system doesn't store the contents of the substring somewhere (perhaps just by partial matches referring to each fragment they are composed of), then there should be a way for the user of the library to do so. For instance, for very long matches the client could choose to forget parts of them or put them to a storage other than memory. Or is this too rare of a requirement? For 99.9% of cases the matches are going to be short.

@Drup
Copy link
Collaborator Author

Drup commented May 12, 2015

Resume matching with new data: "hell" -> zero substrings, but the partial match will be matched later on.

That would require a specific API for partial matches, not just the current API slightly augmented.

I don't understand the bonus.

@eras
Copy link

eras commented May 12, 2015

I meant "Bonus" as in a feature that is probably not often useful, but use cases could be found.

For example: I could write a pattern that optimistically matches certain kind of network traffic from an unframed network capture. The matches could possibly be of unbounded length, if the input stream is infinite. I am still be able to find the substrings - that may span multiple units of processing - from a capture file, even if I cannot hold the whole capture in memory.

@Drup
Copy link
Collaborator Author

Drup commented May 12, 2015

To summarize: you want manual control over the internal buffer.

@rgrinberg
Copy link
Member

It should also be possible to say "The streaming is finished, you can match eol/eos/stop".

Do you also need to mark the beginning of a stream? So that bol,bos,start match as well.

How will group capture that spans across chunks will work? Or will it be possible at all.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants