Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add split_inclusive() to API #1096

Open
wants to merge 5 commits into
base: master
Choose a base branch
from
Open

Conversation

shner-elmo
Copy link

As mentioned in issues: #285, #330, and #681, there is a strong demand for a split function that includes the delimiter (the regex match) inside the output, i.e.:

let text = "2+2";
let re = Regex::new(r"([^\d]+)").unwrap();

let split: Vec<_> = re
    .split(text)
    .map(|span| &text[span])
    .collect();
assert_eq!(split, ["2", "2"]);


let split_inclusive: Vec<_> = re
    .split_inclusive(text)
    .map(|span| &text[span])
    .collect();
assert_eq!(split_inclusive, ["2", "+", "2"]);

This pull request does exactly this.

The implementation is very similar to split(), except that the iterator of inclusive_split() will return two elements for each regex match,
the first element is the start of the string up until the match, the second element is the start of the match to its end.

But of course, we can't return two elements at once (unless we want to return two-element tuples and then flatten it), so we will return the first, and the second will be saved in the field span_to_yield to be returned on the next call.


I implemented the function and added some tests, please let me know what you think about the approach, if everything is good to go I will go ahead and add the documentation and examples.

Note on the tests:
Some of the test cases were copied from the GH issue, they're all tested in Python to make sure that the output of split_inclusive() is the same as Python's re.split().
I created a gist that has the same tests but in Python: https://gist.github.com/shner-elmo/7cd7c383fa882ab8cda743ec7a689b24

cheers

@BurntSushi
Copy link
Member

For CI rustfmt, your lines are waaaaaaaaay too long. I'd suggest just running cargo fmt --all from the root of the repo.

For the test failures, you'll want to use [^0-9] instead, since \d is Unicode aware and some of the crate features can be disabled such that \d is unavailable.

@BurntSushi
Copy link
Member

So looking at this, the behavior does not seem to match what str::split_inclusive does. If we're going to call this function split_inclusive, then we absolutely positively must match what std's str::split_inclusive does. Doing anything with the same name is likely to lead to complete and utter confusion.

What you've implemented here is not really a split "inclusive" (the "inclusive" in std's str::split_inclusive refers to including the terminator with each match), but rather, a split that I guess alternates match and non-match, whenever possible. But for such an iterator, just returning a Span like done in this PR doesn't really make any sense because the caller can't tell whether a Span corresponds to a match or not. As #681 requests, it should be an enum indicating whether the element is a match or a non-match. And the match should probably contain the actual Match value. The non-match would just be a Span.

The confusing part here is that #681 asks for this enum based approach but calls it split_inclusive, yet that name is used in std for an iterator that behaves completely differently. They are two distinct things. I think this PR wound up being an amalgam of both.

So what to do? I think a split_inclusive that matches the behavior of str::split_inclusive seems a little odd to me. I can't remember ever wanting something like that for a regex. But the behavior requested in the OP of #681 still does seem useful to me. I'm not sure what it should be called... Maybe split_all? Or split_with_matches? Or split_covers? Dunno.

@shner-elmo
Copy link
Author

@BurntSushi Hey again, I agree that it doesn't make much sense to replicate the behavior of str::split_inclusive, the goal is to make regex_automata::meta::Regex::split work with capture groups (my understanding is that they are being completely ignored right now).

So I think we should name it something like split_with_capture or split_with_matches, which describes its behavior better (namely the difference with split).

As for the return type, I would keep it the same as split and splitn for consistency, as long that we are not losing information. To know if a Span is a match or not, we can simply iterate with a boolean toggle:

let arr = ...; // regex-automata/src/meta/regex.rs#L3716
for (pattern, text, _) in arr {
    println!("Pattern: {:?}, Text: {:?}", pattern, text);
    
    for (i, sp) in Regex::new(pattern).unwrap().split_inclusive(text).enumerate() {
        let is_match = i % 2 != 0;
        println!("Is match: {:?}, Token: {:?}", is_match, &text[sp]);
    }
    println!();
}
Pattern: "(x)", Text: "1x2"
Is match: false, Token: "1"
Is match: true, Token: "x"
Is match: false, Token: "2"

Pattern: "([^0-9]+)", Text: "1-2"
Is match: false, Token: "1"
Is match: true, Token: "-"
Is match: false, Token: "2"

Pattern: "([\\r\\n\\t\\f\\v ]+)", Text: "this  is a \n\ntest"
Is match: false, Token: "this"
Is match: true, Token: "  "
Is match: false, Token: "is"
Is match: true, Token: " "
Is match: false, Token: "a"
Is match: true, Token: " \n\n"
Is match: false, Token: "test"
...

I think it would be better if we add a new method to the struct SplitInclusive that will iterate on self with a bool toggle as I did above, and return something like (bool, Span) or (bool, Span, &str), what do you think about this approach (smth like SplitInclusive::iter_is_match())?

@BurntSushi
Copy link
Member

I don't know what captures have to do this. There shouldn't be anything about captures related to splitting.

Your boolean toggle via i % 2 == 0 looks wrong to me. What happens if there are adjacent matches?

You're right that we could add a method on the iterator type itself to add a boolean that correctly indicates a match or not, but I do not like that idea at all. I don't like booleans in general. I'd much rather an enum with one variant corresponding to a Match. More to the point, a Match doesn't just indicate whether a match occurred, but which pattern matched. That's lost if you just stick to a boolean. This new enum type can have a span() method on it to provide easy access to the underlying Span in either case.

@shner-elmo
Copy link
Author

Im not familiar with adjacent matches in RegEx. Would you be able to provide a test case or example?

@BurntSushi
Copy link
Member

BurntSushi commented Oct 6, 2023

@shner-elmo a on the haystack aa.

Which raises the question of whether it should emit [Match(0..1), Match(1..2)] (adjacent matches), or whether it should emit an empty span with a non-match between them, e.g., [Match(0..1), NonMatch(1..1), Match(1..2)]. It's actually not obvious what the right behavior is to me. Then of course it is possible for a regex to produce an empty match. For example, the empty string matches at every position. So the empty regex on the haystack aa would produce [Match(0..0), Match(1..1), Match(2..2)]. In that case, there definitely wouldn't be any non-matches between them.

My sense is that this is the right behavior:

  • The iterator yields spans that, when concatenated, cover the entire haystack.
  • Only non-empty spans of unmatched haystack are reported.
  • Matching spans may be empty and they may be adjacent.

@shner-elmo
Copy link
Author

@BurntSushi Thanks, I'll be away for the weekend, but I'll read it properly when I'm back. Have a nice day.

@shner-elmo
Copy link
Author

Hey @BurntSushi, sorry for the late reply.

imo the behavior should be very simple: the current behavior of Regex::split but it also includes the spans between the matches (always, even if it's empty).

I think it should be up to the user to filter the empty non-matching spans.

It's actually not obvious what the right behavior is to me.

For me the default behavior is what other libraries are already doing, to not confuse its users. This is how Python handles it:

>>> import re
>>> re.split(r'a', 'aaa')
['', '', '', '']
>>> re.split(r'(a)', 'aaa')
['', 'a', '', 'a', '', 'a', '']

Are there any major libraries that handle it like you described?

@BurntSushi
Copy link
Member

This is a PR for something called split_inclusive, not split. The existing split routine already behaves like your first Python example. For example, this program:

fn main() {
    let re = regex::Regex::new(r"a").unwrap();
    let splits = re.split("aaa").collect::<Vec<_>>();
    println!("{splits:?}");
}

has this output:

["", "", "", ""]

Your second example utilizes capturing groups and is a result of this documented behavior in Python's re package:

If capturing parentheses are used in pattern, then the text of all groups in the pattern are also returned as part of the resulting list.

I find Python's behavior with capturing groups here to be quite baffling personally. And I don't really know what it has to do with this PR or even API. Capturing groups haven't been discussed at all to this point.

imo the behavior should be very simple:

"should be very simple" isn't going to help us decide anything here.

the current behavior of Regex::split but it also includes the spans between the matches (always, even if it's empty).

This doesn't make any sense to me.

I think we're at an impasse here. I'd suggest we close this PR and go back to the drawing board collecting specific use cases.

@LCrossman
Copy link

I came here looking for split_inclusive behaviour with an example of DNA sequence, however, I solved it by joining the capturing group from captures_iter onto the resultant string from split and therefore split_inclusive wasn't necessary.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants