Introduce split_inclusive to API #681

cdmistman · 2020-05-19T21:14:41Z

This issue has been brought up before in #285 and #330 but I think it might be worth revisiting.

I think it might be useful to introduce a full_split method on a Regex. This would behave similar to the current split method, but would also return values that match the regex. The iterator would return an enum for every iteration, either a Delim (match) or a Text (non-match).

This could have a few helpful applications. In #330 the author suggested they were using it in some kind of calculator. Personally, I would use this for tokenizing. In the same issue, there was a suggested fix, but I think it might be helpful to include it into this crate officially.

I've based the names on OCaml's own regex api (seen here)

The text was updated successfully, but these errors were encountered:

BurntSushi · 2020-05-20T00:00:11Z

I'm possibly open to this, but I don't think I have the bandwidth to oversee this at the moment. I'm really trying to focus on internal improvements right now.

cdmistman · 2020-05-20T01:53:08Z

That makes sense. I've started writing a PR for this but I'm still familiarizing myself with the internals. If anybody has any suggestions, I'm open to input. I'm thinking of using a Split to iterate over the Delims, with an internal Option to temporarily store a Text for the next next call if there is a jump of larger than 1 char

kyclark · 2021-03-16T22:56:59Z

I'd like to leave a use case. Given a string of English, I'd like to split the text into words and the bits in between the words which can be spaces and punctuation. It's easier to define what looks like a "word" than the other, so in Python I can use a regex split on the thing I actually want to keep and put capturing parens so that it is included in the results:

>>> import re
>>> splitter = re.compile("([a-zA-Z](?:[a-zA-Z']*[a-zA-Z])?)")
>>> splitter.split('He said, "I\'d like to eat cake!"')
['', 'He', ' ', 'said', ', "', "I'd", ' ', 'like', ' ', 'to', ' ', 'eat', ' ', 'cake', '!"']

After splitting, I can, for instance, modify the "word" parts however I like and then reconstitute the original string by joining all the parts on the empty string.

Any chance this might move forward?

RReverser · 2021-03-16T23:00:48Z

Starting from Rust 1.51, there is a stable split_inclusive method on strings that can work with arbitrary patterns. At least on nightly, you should be able to use pattern feature of regex crate and pass Regex instances into str.split_inclusive() calls.

BurntSushi · 2021-03-17T13:58:27Z

Any chance this might move forward?

As I said above, right now my focus is on internals. I don't have the bandwidth to mentor this. With that said, adding such an API might not require much mentorship. It's possible that if someone submits a PR, I might be able to get it merged if doing this is as as "simple" as I think it is.

Given a string of English, I'd like to split the text into words and the bits in between the words which can be spaces and punctuation.

FWIW, I believe your use case would probably be better solved by Unicode word segmentation. The unicode-segmentation crate has exactly what you want I think. The bstr crate also has a words_with_breaks method that works on &[u8]. It is even implemented with a regex! Although, it does not use the regex crate.

kyclark · 2021-03-17T16:01:55Z

Well, I'm so glad I asked. The unicode-segmentation was exactly what I needed, so thanks!

BurntSushi · 2023-10-04T21:08:15Z

@shner-elmo I think if you want to work on it then go ahead! You'll probably want to implement the core logic in the regex-automata::meta module, and define it as a method on meta::Regex: https://docs.rs/regex-automata/latest/regex_automata/meta/struct.Regex.html

I don't have this issue paged into context at the moment, so I'm not sure if there are any gotchas to look-out for.

shner-elmo · 2023-10-06T00:31:49Z

@BurntSushi Thanks for the encouragement! I created a pull request and I'm looking forward to hearing your thoughts about it.

BurntSushi added the enhancement label May 20, 2020

BurntSushi changed the title ~~Introduce full_split to API~~ Introduce split_inclusive to API Mar 17, 2021

archer884 mentioned this issue Oct 28, 2022

Add split_inclusive #917

Closed

shner-elmo mentioned this issue Oct 6, 2023

Add split_inclusive() to API #1096

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce split_inclusive to API #681

Introduce split_inclusive to API #681

cdmistman commented May 19, 2020

BurntSushi commented May 20, 2020

cdmistman commented May 20, 2020

kyclark commented Mar 16, 2021 •

edited

RReverser commented Mar 16, 2021 •

edited

BurntSushi commented Mar 17, 2021

kyclark commented Mar 17, 2021

BurntSushi commented Oct 4, 2023

shner-elmo commented Oct 6, 2023

Introduce split_inclusive to API #681

Introduce split_inclusive to API #681

Comments

cdmistman commented May 19, 2020

BurntSushi commented May 20, 2020

cdmistman commented May 20, 2020

kyclark commented Mar 16, 2021 • edited

RReverser commented Mar 16, 2021 • edited

BurntSushi commented Mar 17, 2021

kyclark commented Mar 17, 2021

BurntSushi commented Oct 4, 2023

shner-elmo commented Oct 6, 2023

kyclark commented Mar 16, 2021 •

edited

RReverser commented Mar 16, 2021 •

edited