Carriage return handling on Windows #416

BatmanAoD · 2017-03-21T17:17:43Z

On Windows, I noticed two surprising things about carriage returns (\r):

They aren't treated as word-boundaries (i.e. they won't cause the \b anchor to match). This means that rg -w foo fails to find foo at the end of a line in a file with Windows line-endings.
They aren't included in the end-of-line anchor $, which makes using the anchor more difficult.

I realize that this could be by design, since RipGrep may be using some standard definition of \b that explicitly does not include \r. That seems unintuitive, though, and I don't think I've seen that behavior in other regex engines running on Windows (e.g. in Python on Windows).

The text was updated successfully, but these errors were encountered:

BurntSushi · 2017-03-21T17:24:43Z

They aren't treated as word-boundaries (i.e. they won't cause the \b anchor to match). This means that rg -w foo fails to find foo at the end of a line in a file with Windows line-endings.

This doesn't seem right. Could you please provide a minimal reproducible example? e.g., This works fine for me:

$ echo -e 'foo\n' | rg -w foo
1:foo
$ echo -e 'foo\r\n' | rg -w foo
1:foo

They aren't included in the end-of-line anchor $, which makes using the anchor more difficult.

I think GNU grep handles this by stripping \r characters and dealing with the fallout. This seems hard, or at best, very annoying.

The semantics of the regex engine aren't going to change, sorry.

BatmanAoD · 2017-03-21T18:32:54Z

Hm, you're right, I'm having trouble reproducing the \b issue with test files; I'll have to see if I can find out why I'm having issues with the file I was searching when I discovered the issue.

Regarding $, though, isn't the anchor essentially useless on Windows if \r causes a mismatch? echo -e "foo\r\n" | <greptool> 'foo$' seems to me like it should always match, regardless of the system. (I've checked that on Debian, GNU grep and ack both handle this case appropriately.)

Why not just let the anchor match when a character other than \r is followed by \n or \r\n? (I think you're right that this approach is not the one taken by grep, so I realize that there may be a good reason not to take this approach.)

I'm guessing that this behavior is due to the current behavior of rust-lang/regex ? If so, I can open an issue there, which seems like a more appropriate place (even though you're the primary author/contributor on that project as well).

BurntSushi · 2017-03-21T18:39:14Z

The issue already exists: rust-lang/regex#244

Why not just let the anchor match when a character other than \r is followed by \n or \r\n?

If you know how to implement this efficiently in a DFA, then please teach me. :-)

BurntSushi · 2017-03-21T20:44:01Z

Just to be clear, I think the state of affairs is, indeed, unfortunate. It's just hard to fix. I can't say that it will never get fixed, but I wouldn't expect something any time soon.

In the mean time, there are not-ideal workarounds. e.g., echo -e 'foo\r\n' | rg 'foo\r?$' works if you're in a bind.

BatmanAoD · 2017-03-21T20:51:12Z

I was thinking about that--how opposed would you be to introducing an option (possibly via config-file) to automatically replace $ in search patterns with \r?$, possibly only under certain conditions (e.g. only when running on Windows, or when the last two characters in the first candidate file are \r\n)? I think this would only cause unintended side-effects when paired with a flag (such as -o) that changes the output text depending on whether or not the \r is included in the match.

BurntSushi · 2017-03-21T21:12:36Z

ripgrep doesn't have a config file. I'd rather folks work around it for now. I'm not strongly against your heuristic, but we'd need to be careful that it doesn't introduce any additional weirdness.

BatmanAoD · 2017-03-22T17:29:11Z

Makes sense. Thanks for considering this.

BatmanAoD · 2017-03-23T08:06:26Z

Ah, I've figured out my issue with word-boundaries--for some reason I was expecting \b to zero-length match between ( and \r. I found that rg -w '<pattern>\(' wasn't finding anything whereas I did find what I was looking for without the -w flag. But of course left-paren isn't a word character.

/shame

BurntSushi · 2017-03-23T10:44:08Z

@BatmanAoD No worries, word boundaries are hard. See also #389.

BatmanAoD · 2017-03-31T06:12:04Z

I'm looking into modifying the regex crate to handle \r and \r\n as newlines, but I've discovered that in the current implementation, ^$ actually matches the newline at the end of a file, probably because the end-of-line flag is set whenever the iterator reaches the end of the file, regardless of the actual final bytes in the file--and of course \n causes the beginning-of-line flag to be set. (At least, I think that's why it's happening--I haven't actually done anything to verify that this explanation is correct.)

Would you consider this a bug in regex? If so, I believe it can probably be fixed by tweaking the end-of-file flag-setting logic. (An off-the-cuff suggestion: perhaps the start-of-line flag needs to be prohibited at end-of-file, regardless of what the last byte is. I'm not sure what the value is in having ^ match when there's nothing left to read in the file.)

BatmanAoD · 2017-03-31T06:57:31Z

Somewhat related: I'm not sure I understand exactly what the "reverse" matching logic is for or how it works. Is it just what's described in rust-lang/regex#190? Does the engine construct a DFA designed (guaranteed?) to find the same matches when applied backwards that the "normal" DFA would find when applied forwards?

If so, then with regard to this particular issue, it might actually make my proposed fix infeasible (or at least more complicated than I'd like), unless I'm misunderstanding something. When running the DFA forward, $ should anchor immediately before the current character-index if the character is \n and the previous character isn't \r, but in reverse, $ should anchor after \n unless the next character to be observed (i.e. the one right before \n in the input file) isn't \r. Based on the precedent set by the word-boundary logic, the way to handle this multi-character anchor dependency is via a special flag in StateFlags. But the actual logic to assign the flags for the typical case is in exec_byte, which (as far as I can tell) is intentionally designed not to have any special behavior for the reverse match. Moreover, there's not really any way that I can see to look ahead in the search, which I think would be necessary in the reverse case.

Tangentially: I haven't taken even a cursory look at the code to answer this for myself, but how is ripgrep processing input before feeding it to the regex crate? I've previously assumed that the reason ack can't support multi-line matching is because it reads a single line at a time and applies the search pattern to each line individually, and ripgrep has the same restriction--but if ripgrep were doing the line-splitting, I'd have expected you to suggest modifying that logic somehow to resolve the $ issue (or possibly to explain why that wouldn't work--though maybe that's what you meant when you mentioned that unilaterally stripping \r from input seems "hard, or at best, very annoying").

BurntSushi · 2017-03-31T09:26:20Z

It sounds like you are understanding why this is hard to fix in the regex engine. I'd prefer if we discussed the regex engine on the regex repo.

Tangentially: I haven't taken even a cursory look at the code to answer this for myself, but how is ripgrep processing input before feeding it to the regex crate?

http://blog.burntsushi.net/ripgrep/#mechanics

BurntSushi · 2017-04-08T00:21:58Z

I did test that this doesn't match Perl's behavior. And for a grep tool like ripgrep, it is simply wrong, because finding empty lines is sometimes useful, and a false positive at the end of every file is not helpful. That doesn't mean that the behavior of this crate needs to change, since ripgrep could simply work around the issue, but you did direct me this way when I mentioned it on the ripgrep issue tracker... 😕 Either way, I am strongly of the opinion that ripgrep's behavior should match grep's for this case.

Please motivate this with use cases and examples. Reasoning about this in the abstract will probably never convince me that anything needs to be changed. (This also seems orthogonal to \r\n handling, so it should probably get a new issue, please.)

BurntSushi · 2017-04-08T00:30:35Z

@BatmanAoD I created #441

BurntSushi · 2018-06-25T20:27:09Z

@BatmanAoD I think I see how this might be fixed for most use cases, at least in ripgrep. It won't hit everything, but:

Take your idea of replacing $ with (?:\r?$) (except done at the HIR level). Semantically, this should never turn something that was a match into a non-match, but could turn non-matches into a match (which is, of course, the intended effect).
When matches are reported, if the match ends with a \r, then trim it.

I think I might be able to roll this into libripgrep. In particular, I am also going to try and take a crack at #389, and the solutions have a similarish feel.

BatmanAoD · 2018-06-25T23:30:48Z

Note, of course, that in order to be entirely correct, \r should only be trimmed if it was matched due to $ rather than with ., a literal \r, or something else along those lines. I'm not sure how difficult that part is, or whether it really matters.

BurntSushi · 2018-06-25T23:52:45Z

@BatmanAoD In line oriented mode, a . can never match a \n, and this is enforced. Similarly, you cannot search for a literal \n either (try rg '\n'), so we should probably do the same for \r which seems fine to me. (This will be an opt-in in a new crate I'm building as part of libripgrep.)

Getting this right in multiline mode (again, will be a new thing part of libripgrep, it's already implemented) is trickier since matching \n (or \r) is explicitly allowed. But I think we'll either need to choose to sacrifice correctness or raise an error if this special CRLF handling and multi line mode are enabled at the same time.

BatmanAoD · 2018-06-28T21:27:55Z

Is libripgrep being developed in a separate repository? Perhaps I could try implementing that solution?

BurntSushi · 2018-06-28T22:28:35Z

@BatmanAoD That's probably not a good idea. As far as my open source work goes, I basically don't like collaborating when code is in a nascent state. There is just waaaaay too much context in my brain. It would take days to unload, and everything could switch at a moment's notice as I'm developing. I think this blog packages my thoughts in a neat little bow: http://habitatchronicles.com/2004/04/you-cant-tell-people-anything/

With that said, I do occasionally push my progress to the ag/libripgrep branch: https://github.com/BurntSushi/ripgrep/tree/ag/libripgrep --- Beware though, I sometimes force push!

BurntSushi · 2018-06-28T22:57:31Z

And I have made sure to incorporate CRLF stripping as well:

ripgrep/grep-regex/src/strip.rs

Lines 8 to 10 in 4d6fba4

    
           /// A special value that will prevent a regex from matching either a `\r` 
        
           /// or a `\n`. 
        
           CRLF,

BatmanAoD · 2018-06-29T20:32:41Z

I believe I understand!

This commit updates the CHANGELOG to reflect all the work done to make libripgrep a reality. * Closes #162 (libripgrep) * Closes #176 (multiline search) * Closes #188 (opt-in PCRE2 support) * Closes #244 (JSON output) * Closes #416 (Windows CRLF support) * Closes #917 (trim prefix whitespace) * Closes #993 (add --null-data flag) * Closes #997 (--passthru works with --replace) * Fixes #2 (memory maps and context handling work) * Fixes #200 (ripgrep stops when pipe is closed) * Fixes #389 (more intuitive `-w/--word-regexp`) * Fixes #643 (detection of stdin on Windows is better) * Fixes #441, Fixes #690, Fixes #980 (empty matching lines are weird) * Fixes #764 (coalesce color escapes) * Fixes #922 (memory maps failing is no big deal) * Fixes #937 (color escapes no longer used for empty matches) * Fixes #940 (--passthru does not impact exit status) * Fixes #1013 (show runtime CPU features in --version output)

sergeevabc · 2020-02-27T11:34:52Z

Being a Windows user with ripgrep 11.0.2 on board, I’m a bit confused about this CRLF thing.

printf "1931. Arrowsmith" | rg "(\d+)..(.*)" -r "$2 $1"
Arrowsmith 1931               <--- as expected, but printf is not a part of Windows 

echo 1931. Arrowsmith | rg "(\d+)..(.*)" -r "$2 $1"
 1931smith                    <--- weird misplacement

echo 1931. Arrowsmith | rg "(\d+)..(.*)" -r "$2 $1" --crlf
Arrowsmith  1931              <--- extra space in between

BurntSushi · 2020-02-27T12:30:19Z

I cannot reproduce on Linux:

$ printf "1931. Arrowsmith" | rg '(\d+)..(.*)' -r '$2 $1'
Arrowsmith 1931

$ echo 1931. Arrowsmith | rg '(\d+)..(.*)' -r '$2 $1'
Arrowsmith 1931

$ echo 1931. Arrowsmith | rg '(\d+)..(.*)' -r '$2 $1' --crlf
Arrowsmith 1931

OK, so I'm assuming that echo on Windows probably emits a \r\n as a new line, so let's try that:

$ printf "1931. Arrowsmith\r\n" | rg '(\d+)..(.*)' -r '$2 $1'
 1931smith

It's quite likely here that (.*) is matching the \r so the replacement actually winds up being Arrowsmith\r 1931. We can confirm this by looking at the hex output:

$ printf "1931. Arrowsmith\r\n" | rg '(\d+)..(.*)' -r '$2 $1' | xxd
00000000: 4172 726f 7773 6d69 7468 0d20 3139 3331  Arrowsmith. 1931
00000010: 0a

You can see the 0d in there, which corresponds to \r. The strange output is actually what one expects:

$ echo 'Arrowsmith\r 1931'
 1931smith

Because \r will move the cursor position back to the beginning of the line. Subsequent printing then overwrites characters that were already printed. e.g.,

$ echo 'Arrowsmith\r 1931X'
 1931Xmith

$ echo 'Arrowsmith\r 1931XX'
 1931XXith

$ echo 'Arrowsmith\r 1931XXX'
 1931XXXth

Adding the --crlf flag seemingly gets this right:

$ printf "1931. Arrowsmith\r\n" | rg '(\d+)..(.*)' -r '$2 $1' --crlf
Arrowsmith 1931

Confirming with xxd that there is a single space:

$ printf "1931. Arrowsmith\r\n" | rg '(\d+)..(.*)' -r '$2 $1' --crlf | xxd
00000000: 4172 726f 7773 6d69 7468 2031 3933 310d  Arrowsmith 1931.
00000010: 0a

Notice that there is only one 0x20 byte there. So I can't reproduce your issue, at least on Linux, and in theory, the above should be equivalent to what you're doing.

sergeevabc · 2020-02-27T12:56:53Z

@BurntSushi, I guess, Linux echo is different from Windows echo, that’s whyyou’re partly unable to reproduce the issue. That’s unfortunate for me, because the third example shows --crlf produces unexpected extra space in the middle. Also let me provide the output of xxd.

echo 1931. Arrowsmith | rg "(\d+)..(.*)" -r "$2 $1" --crlf | xxd
00000000: 4172 726f 7773 6d69 7468 2020 3139 3331  Arrowsmith  1931
00000010: 0d0a                                     ..
                                                              ^ that weird extra space

BurntSushi · 2020-02-27T13:04:05Z

It will have to wait until I have a chance to debug this on Windows. Or else someone else should feel free to debug it.

BurntSushi · 2020-02-27T13:07:38Z

Also not sure why you're reporting bugs on closed tickets. This makes it impossible to track. I created #1500 for you. Please just do that next time.

BurntSushi added the question An issue that is lacking clarity on one or more points. label Mar 21, 2017

BatmanAoD pushed a commit to BatmanAoD/ripgrep that referenced this issue Mar 22, 2017

Added test for issue BurntSushi#416

a1c83df

This was referenced Apr 7, 2017

In multi-line mode, ^$ unconditionally matches last line in text rust-lang/regex#355

Closed

$ doesn't match CRLF rust-lang/regex#244

Closed

BurntSushi mentioned this issue Apr 8, 2017

^$ matches a phantom line at end of file #441

Closed

BurntSushi added the icebox A feature that is recognized as possibly desirable, but is unlikely to implemented any time soon. label May 8, 2017

BurntSushi mentioned this issue Oct 22, 2017

regex end-of-line search is broken microsoft/vscode#36309

Closed

BurntSushi mentioned this issue Mar 26, 2018

Carriage return at end of results in Vim quickfix list #869

Closed

BatmanAoD mentioned this issue May 28, 2018

Windows eol anchor option #929

Closed

BurntSushi added this to the libripgrep milestone Aug 16, 2018

BurntSushi mentioned this issue Aug 19, 2018

libripgrep: PCRE2 support, multiline search, JSON output and more #1017

Merged

BurntSushi closed this as completed in #1017 Aug 20, 2018

BurntSushi mentioned this issue Feb 27, 2020

--replace not working correctly with carriage returns on Windows #1500

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Carriage return handling on Windows #416

Carriage return handling on Windows #416

BatmanAoD commented Mar 21, 2017

BurntSushi commented Mar 21, 2017

BatmanAoD commented Mar 21, 2017

BurntSushi commented Mar 21, 2017

BurntSushi commented Mar 21, 2017

BatmanAoD commented Mar 21, 2017

BurntSushi commented Mar 21, 2017

BatmanAoD commented Mar 22, 2017

BatmanAoD commented Mar 23, 2017

BurntSushi commented Mar 23, 2017

BatmanAoD commented Mar 31, 2017 •

edited

BatmanAoD commented Mar 31, 2017

BurntSushi commented Mar 31, 2017

BurntSushi commented Apr 8, 2017

BurntSushi commented Apr 8, 2017

BurntSushi commented Jun 25, 2018

BatmanAoD commented Jun 25, 2018

BurntSushi commented Jun 25, 2018 •

edited

BatmanAoD commented Jun 28, 2018

BurntSushi commented Jun 28, 2018

BurntSushi commented Jun 28, 2018

BatmanAoD commented Jun 29, 2018

sergeevabc commented Feb 27, 2020

BurntSushi commented Feb 27, 2020

sergeevabc commented Feb 27, 2020

BurntSushi commented Feb 27, 2020

BurntSushi commented Feb 27, 2020

Carriage return handling on Windows #416

Carriage return handling on Windows #416

Comments

BatmanAoD commented Mar 21, 2017

BurntSushi commented Mar 21, 2017

BatmanAoD commented Mar 21, 2017

BurntSushi commented Mar 21, 2017

BurntSushi commented Mar 21, 2017

BatmanAoD commented Mar 21, 2017

BurntSushi commented Mar 21, 2017

BatmanAoD commented Mar 22, 2017

BatmanAoD commented Mar 23, 2017

BurntSushi commented Mar 23, 2017

BatmanAoD commented Mar 31, 2017 • edited

BatmanAoD commented Mar 31, 2017

BurntSushi commented Mar 31, 2017

BurntSushi commented Apr 8, 2017

BurntSushi commented Apr 8, 2017

BurntSushi commented Jun 25, 2018

BatmanAoD commented Jun 25, 2018

BurntSushi commented Jun 25, 2018 • edited

BatmanAoD commented Jun 28, 2018

BurntSushi commented Jun 28, 2018

BurntSushi commented Jun 28, 2018

BatmanAoD commented Jun 29, 2018

sergeevabc commented Feb 27, 2020

BurntSushi commented Feb 27, 2020

sergeevabc commented Feb 27, 2020

BurntSushi commented Feb 27, 2020

BurntSushi commented Feb 27, 2020

BatmanAoD commented Mar 31, 2017 •

edited

BurntSushi commented Jun 25, 2018 •

edited