Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

End of buffer indistinguishable from EOF #2

Closed
hxtk opened this issue Feb 2, 2018 · 3 comments
Closed

End of buffer indistinguishable from EOF #2

hxtk opened this issue Feb 2, 2018 · 3 comments
Labels
bug Something isn't working help wanted Extra attention is needed

Comments

@hxtk
Copy link
Owner

hxtk commented Feb 2, 2018

When we search for the second occurrence of the delimiter, we only check until the end of the buffer. If the buffer does not contain the entire file, it is possible that there is a match that we simply cannot discern.

In fact, it is possible that a match begins inside our buffer but continues past it, for delimiters that match more than one character. e.g.,:

delim = "AAA"
input  = "hello, worldAAAgoodbye world"
buffer = "hello, worldA"

// returns: "hello, worldA"
// expected: "hello, world"

I think I can fix this, but I am not sure how to write a test case. Ideally I could coerce an arbitrary, small buffer size, but frankly I don't know how to do that in a self-contained test.

@hxtk hxtk added bug Something isn't working help wanted Extra attention is needed labels Feb 2, 2018
hxtk added a commit that referenced this issue Feb 2, 2018
@hxtk
Copy link
Owner Author

hxtk commented Feb 2, 2018

Note this test case breaks the build.

I'm not actually sure how to handle this case, because with, e.g. /a+b/ as our delimiter, we might have to read arbitrarily far (bound above by the total filesize) to determine we have reached the end of a leading delimiter, e.g., aaaaaaaa[...]cab should return aaaaaaaa[...]c while aaaaaaaa[...]bcab should return c.

The only solution I can see is an arbitrary cutoff to the delimiter size.

Note the implementation found in JDK 1.7:

http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/7-b147/java/util/Scanner.java#Scanner.getCompleteTokenInBuffer%28java.util.regex.Pattern%29

hxtk added a commit that referenced this issue Feb 2, 2018
@hxtk
Copy link
Owner Author

hxtk commented Feb 2, 2018

As of the above commit, I can match any delimiter totally contained within a single buffer.

Currently, when we fail to find a prefixed delimiter in a buffer, we assume there is no prefixed delimiter, but note the test case below.

/// This test will fail if we cannot detect partial matches of the delimiter
/// when skipping over prefixed delimiters. Because the buffer size is 4, it
/// will read "aaaa", which is not in the language of /a+b/, however the
/// automaton is not in a dead state either: reading a "b" would put us in
/// an accepting state, thus we must read more input to know if the regex is
/// satisfied. Reading an additional character will result in "aaaab", which
/// is a valid delimiter in this language and should therefore be skipped.
#[test]
fn buffer_ends_within_start_delim() {
    let string: &[u8] = b"aaaabfoo";
    let mut br = BufReader::with_capacity(4, string);
    let mut test = Scanner::new(&mut br);
    test.set_delim(Regex::new(r"a+b").unwrap());

    assert_eq!(test.next(), Some(String::from("foo")));
}

Further, we currently assume that if no delimiters are found in a given buffer, it is safe to consume that buffer. This is not the case if the delimiter begins in that buffer but ends in a subsequent buffer, as in the test case below.

/// This test will fail if we do not solve the above problem in a way that
/// preserves the tail of the original buffer, because in this test case the
/// terminating delimiter begins within the first buffer size and
/// ends within the second.
#[test]
fn buffer_ends_within_end_delim() {
    let string: &[u8] = b"foo  bar";
    let mut br = BufReader::with_capacity(4, string);
    let mut test = Scanner::new(&mut br);
    test.set_delim_str("  ");

    assert_eq!(test.next(), Some(String::from("foo")));
    // currently this produces "foo  bar" because
    // we fail to detect the delimiter in two separate buffers.
}

@hxtk
Copy link
Owner Author

hxtk commented Feb 3, 2018

This issue has migrated from its original scope. We are now able to read beyond the length of the buffer.

The current problem is actually two separate issues: #3 #4. I am closing this issue and the relevant discussions can move to their respective new issues.

@hxtk hxtk closed this as completed Feb 3, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

1 participant