Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce split_inclusive to API #681

Open
cdmistman opened this issue May 19, 2020 · 8 comments
Open

Introduce split_inclusive to API #681

cdmistman opened this issue May 19, 2020 · 8 comments

Comments

@cdmistman
Copy link

This issue has been brought up before in #285 and #330 but I think it might be worth revisiting.

I think it might be useful to introduce a full_split method on a Regex. This would behave similar to the current split method, but would also return values that match the regex. The iterator would return an enum for every iteration, either a Delim (match) or a Text (non-match).

This could have a few helpful applications. In #330 the author suggested they were using it in some kind of calculator. Personally, I would use this for tokenizing. In the same issue, there was a suggested fix, but I think it might be helpful to include it into this crate officially.

I've based the names on OCaml's own regex api (seen here)

@BurntSushi
Copy link
Member

I'm possibly open to this, but I don't think I have the bandwidth to oversee this at the moment. I'm really trying to focus on internal improvements right now.

@cdmistman
Copy link
Author

That makes sense. I've started writing a PR for this but I'm still familiarizing myself with the internals. If anybody has any suggestions, I'm open to input. I'm thinking of using a Split to iterate over the Delims, with an internal Option to temporarily store a Text for the next next call if there is a jump of larger than 1 char

@kyclark
Copy link

kyclark commented Mar 16, 2021

I'd like to leave a use case. Given a string of English, I'd like to split the text into words and the bits in between the words which can be spaces and punctuation. It's easier to define what looks like a "word" than the other, so in Python I can use a regex split on the thing I actually want to keep and put capturing parens so that it is included in the results:

>>> import re
>>> splitter = re.compile("([a-zA-Z](?:[a-zA-Z']*[a-zA-Z])?)")
>>> splitter.split('He said, "I\'d like to eat cake!"')
['', 'He', ' ', 'said', ', "', "I'd", ' ', 'like', ' ', 'to', ' ', 'eat', ' ', 'cake', '!"']

After splitting, I can, for instance, modify the "word" parts however I like and then reconstitute the original string by joining all the parts on the empty string.

Any chance this might move forward?

@RReverser
Copy link
Contributor

RReverser commented Mar 16, 2021

Starting from Rust 1.51, there is a stable split_inclusive method on strings that can work with arbitrary patterns. At least on nightly, you should be able to use pattern feature of regex crate and pass Regex instances into str.split_inclusive() calls.

@BurntSushi
Copy link
Member

Any chance this might move forward?

As I said above, right now my focus is on internals. I don't have the bandwidth to mentor this. With that said, adding such an API might not require much mentorship. It's possible that if someone submits a PR, I might be able to get it merged if doing this is as as "simple" as I think it is.

Given a string of English, I'd like to split the text into words and the bits in between the words which can be spaces and punctuation.

FWIW, I believe your use case would probably be better solved by Unicode word segmentation. The unicode-segmentation crate has exactly what you want I think. The bstr crate also has a words_with_breaks method that works on &[u8]. It is even implemented with a regex! Although, it does not use the regex crate.

@BurntSushi BurntSushi changed the title Introduce full_split to API Introduce split_inclusive to API Mar 17, 2021
@kyclark
Copy link

kyclark commented Mar 17, 2021

Well, I'm so glad I asked. The unicode-segmentation was exactly what I needed, so thanks!

@BurntSushi
Copy link
Member

@shner-elmo I think if you want to work on it then go ahead! You'll probably want to implement the core logic in the regex-automata::meta module, and define it as a method on meta::Regex: https://docs.rs/regex-automata/latest/regex_automata/meta/struct.Regex.html

I don't have this issue paged into context at the moment, so I'm not sure if there are any gotchas to look-out for.

@shner-elmo
Copy link

@BurntSushi Thanks for the encouragement! I created a pull request and I'm looking forward to hearing your thoughts about it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants