Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add option to exclude test cases #26

Merged
merged 2 commits into from
Sep 3, 2020
Merged

Add option to exclude test cases #26

merged 2 commits into from
Sep 3, 2020

Conversation

allanlw
Copy link

@allanlw allanlw commented Aug 22, 2020

This adds a new option --file-negative, which contains a list of negative test cases. The resulting regex will strictly not matching any of these test cases. This fixes #16.


To support negation, a second DFA is built of the negative cases, and then subtracted from the positive case DFA, using the standard DFA combination algorithm. To limit the number of nodes generated, combinations of nodes in the two DFAs are visited in depth-first order. Nodes that only occur in the negative match DFA are not visited.

Because the repetition feature can produce grapheme transitions in the DFA that are variable length, code is added to calculate the overlap of two grapheme ranges.

The generated graphs can contain 'dead ends' so some code is added to remove those. Some bug fixes for corner cases that were previously not hit were needed in the recreate_graph function were also necessary. Also find_next_state was written to use the new grapheme overlapping function, to prevent sometimes creating multiple conflicting edges out of a node.

As part of this, a bug was fixed that previously caused blank lines the input to not be considered in the final regex, because the "initial" state could never be considered an accept state.

I got rid of final_state_indices and moved that information into the node label. I also added descriptive labels to nodes to aid debugging.

Adds appropriate tests. All pass. Ran through cargo fmt and cargo clippy.

I haven't written much rust before so please let me know if there are any issues.

@allanlw
Copy link
Author

allanlw commented Aug 22, 2020

Note: I took a look at the code coverage report, and it seems like the coverage primarily decreased because of code like this:

  | 573 | + | self.result.alphabet = self
  | 574 | + | .result
  | 575 | + | .graph
  | 576 | + | .edge_indices()
  | 577 | + | .map(|e| self.result.graph[e].clone())
  | 578 | + | .collect();

For some reason the code coverage report only considers lines 573 and 577 (with the closure) to be hit, not the other ones.

I don't think this change meaningfully reduces code coverage.

Repository owner deleted a comment from codecov-commenter Aug 22, 2020
@pemistahl
Copy link
Owner

Hi Allan, thank you very much for this pull request. This is awesome. :) I nearly thought that the algorithm would be to complicated for someone to comprehend. I haven't thought about how to add negative test cases myself so far, so I'm curious to see how you have accomplished that.

I will carefully examine your changes in the next days and will then give you more detailed feedback.

Can you perhaps extend the command-line interface so that negative test cases can alternatively be directly passed to the command-line, instead of being forced to put them in a file first? I think this will be more convenient for users with short lists of test cases.

Again, thanks a lot. 👍

@allanlw
Copy link
Author

allanlw commented Aug 22, 2020

I haven't thought about how to add negative test cases myself so far, so I'm curious to see how you have accomplished that.

Unfortunately I don't have a better reference than this stack overflow answer off hand, but the algorithm is essentially as listed, implemented in a recursive-descent fashion. The two DFAs are combined into a new DFA that contains the Cartesian product of the two input states, and accepts for any state the "positive" test case DFA would accept for, unless that is also an accept state for the "negative" test case DFA.

I will carefully examine your changes in the next days and will then give you more detailed feedback.

Sure thing. Happy to respond to feedback, etc.

Can you perhaps extend the command-line interface so that negative test cases can alternatively be directly passed to the command-line, instead of being forced to put them in a file first? I think this will be more convenient for users with short lists of test cases.

Actually I'm not sure about how to do this, because the current command line interface takes positive test cases as a vector. For negative test cases, either there would need to be some in-band signaling (e.g. prefix negative test cases with ! or something, but then there needs to be a way to escape this, and I guess also breaks backwards compat) or there would need to be a switch that interprets all command line arguments following that as negative or something. I'm not sure structopt supports that.

Exposing --negative-file was the easiest way for me to get the code exposed to the CLI. I really don't have a strong opinion on what the UX should be.

Another option is to have a flag for --prefixed or something, and then all test cases must be prefixed with + or - (regardless of where they came from).

Thoughts?

@allanlw
Copy link
Author

allanlw commented Aug 25, 2020

Pushed a new version. It resolves the test case issue noted before, by merging graphemes that have repetition in the DFA during a separate pass in the minimization step, instead of attempting to do it during construction.

You'll note that two other test cases had to be changed, but that in both cases they accept on equivalent strings.

I also added two new property tests to check "not matching other strings" with more restrictive alphabets to hopefully catch similar regressions.

Repository owner deleted a comment from codecov-commenter Aug 25, 2020
@pemistahl pemistahl changed the base branch from master to v1.2.0-wip August 25, 2020 17:09
src/main.rs Outdated Show resolved Hide resolved
src/main.rs Outdated Show resolved Hide resolved
@pemistahl
Copy link
Owner

Actually I'm not sure about how to do this, because the current command line interface takes positive test cases as a vector. For negative test cases, either there would need to be some in-band signaling (e.g. prefix negative test cases with ! or something, but then there needs to be a way to escape this, and I guess also breaks backwards compat) or there would need to be a switch that interprets all command line arguments following that as negative or something. I'm not sure structopt supports that.

This is pretty easy, actually. I would do it like this:

#[structopt(
    name = "exclusions",
    value_name = "INPUT",
    long,
    conflicts_with = "exclusions-file",
    help = "One or more test cases separated by blank space"
)]
exclusions: Vec<String>,

#[structopt(
    name = "exclusions-file",
    value_name = "FILE",
    long,
    parse(from_os_str),
    help = "Reads test cases to be excluded on separate lines from a file"
)]
exclusions_file_path: Option<PathBuf>,

What do you think?

@allanlw
Copy link
Author

allanlw commented Aug 27, 2020

Okay, you were correct. My reading of the structopt documentation implied to me that you couldn't have a flag that took Option<Vec<String>> (e.g. many args on the command line), but it seems that you can.

I pushed a new revision with the requested changes. This means you can now run:

cargo run -- 'a' 'b' 'c' --exclusions 'c' 'd' 'e' 'f'

which will output ^[ab]$ (I erroneously thought this would output ^[abdef]$)

I also removed the short opt setting and renamed the option as requested.

@pemistahl
Copy link
Owner

You did a very good job with building the DFA combination algorithm but I hope you don't mind the following question.

By looking at the current functionality of your algorithm, instead of creating two DFAs and subtracting them from each other, wouldn't it be much easier to subtract the set of negative test cases from the set of positive test cases first and then to apply the original algorithm that creates only one DFA? I think all of the new integration tests you added would return the same results when doing set subtraction instead of DFA subtraction. This would simplify the algorithm a lot.

Am I correct or have I missed anything? I'd like to very much merge your PR but first I want to know what its benefit is compared to set subtraction. If your algorithm lays the ground for features that are not yet implemented, can you please tell me what fields of application it could have beyond the exclusion of test cases?

Thanks a lot!

Repository owner deleted a comment from codecov-commenter Aug 28, 2020
@allanlw
Copy link
Author

allanlw commented Aug 28, 2020

You're right -- let me come up with some examples. Primarily they concern conversion features, although I just added a proptest right now for conversion features and subtraction and found a new bug, so let me solve that first.

@allanlw
Copy link
Author

allanlw commented Aug 30, 2020

So -- there's a problem with chracter classes and how to solve it is not obvious to me. I found this issue by adding a new property test that tries random features with exclusions and verifies the exclusions are still valid.

Consider this input:

./target/debug/grex --non-digits --spaces '#' --exclusions ' '

This produces ^\D$, which does match ' '.

The issue is that the dfa produced for '#' is \D, but for ' ', \S is produced. Of course, \D and \S overlap, but right now the subtraction code does not take character classes into account. The 'correct' thing to do is probably to compare character classes in the input DFA and the output DFA for matching edges, and generate edges with new character classes in the case where they overlap.

It seems that the rust regular expressions support character class intersections and subtractions, but I think this is a non-standard feature (not available in, say, PCRE).

Right now also, I'm not entirely sure how to test if a Grapheme instance is representing a character class or not or if it is a character class which one it is.

Any suggestions on how to solve this?

Repository owner deleted a comment from codecov-commenter Aug 30, 2020
@pemistahl
Copy link
Owner

Hi Allan, I do not have much time at the moment to dive into the code. I'd like to suggest the following:

I'm going to merge your changes now into the feature branch for the next version 1.2.0 because I think you did some marvelous work here regarding the DFA algorithm. Even though it does not yet assort well with character classes and probably some other features, I'm sure it is a solid foundation for various new functionality. As soon as I find the time, I will continue to work on the current state.

Are you ok with that? :)

@pemistahl pemistahl merged commit 9b9b8ec into pemistahl:v1.2.0-wip Sep 3, 2020
@pemistahl pemistahl added this to the grex 1.2.0 milestone Sep 12, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Allow to provide test cases that must not be matched
2 participants