fix(lib): advance the lookahead end byte by 4 when there's an invalid code point #3305

amaanq · 2024-04-17T19:31:38Z

This is the real fix for the issue described in PR #2931

The problem is the heredoc_start node, given the right conditions, is being reused, which is not what we want to happen, because it's an external token and the edit that takes place affects this token. You can observe the bug here - https://github.com/amaanq/tree-sitter-bug/tree/master/bug6-php-external-reused, and if you patch the tree-sitter dependency to use a local path with the change in this PR, you'll observe that it's fixed. What really helped to visualize and understand the issue was printing every node's contents inside the nowdoc node + dumping the parser logging; I had also printed out every invocation of advance + what token got advanced and where, and noticed that the edited portion of the heredoc start was not consumed at all whatsoever, which led me to spotting one log message "reuse_node symbol:heredoc_start", which did not sound right at all.

/cc @maxbrunsfeld

maxbrunsfeld · 2024-04-17T21:09:38Z

@amaanq I don't think we want to avoid reusing external tokens' nodes completely. There's a lot of logic around when we can reuse them:

the external scanner state needs to match
all of the text that was read by the scanner needs to be un-edited
if the scanner called get_column, then the previous content on that line needs to be un-edited

Can you tell why it is that we're getting the wrong parse results here? Why exactly can't the heredoc_start node be reused in this case?

amaanq · 2024-04-17T21:41:11Z

Yeah I tried to see if there was a better condition to check for but couldn't really come up with one, I'll try to summarize the issue

Given this php corpus test:

<?php
$读写;

<<<漢漢字
  Heredoc detected.
漢漢字;

<<<'中文'
  Nowdoc detected..
中文;

We perform an edit that changes the heredoc start of the nowdoc (中文) which messes with the utf8 codepoint, thus destroying the nowdoc. When we un-edit this, the heredoc_start should be reparsed, but it is not being reparsed when we pass in the 2nd tree (tree2) to parse. If we pass in tree1 or None, it parses fine because the heredoc is intact or is parsed from scratch. But given that we un-edit tree2 to restore the heredoc start portion, it should be reparsed yet it's not actually being reparsed. I'm not sure what else we could do here to fix this

maxbrunsfeld · 2024-04-17T22:26:08Z

Huh, I would think that if that node's text changed, it would be marked as edited and not reused. Maybe there's an edge case with edits that occur within code points?

cli/src/tests/parser_test.rs

amaanq · 2024-04-28T18:59:40Z

Hey @maxbrunsfeld, taking my second attempt at really finding the real issue here, after some debugging I think I noticed the problem.

I narrowed down the bug to one edit that inserts text in the middle of the code point, which destroys the utf-8 validity. Now, the php scanner will advance and scan a heredoc word if the next code point is valid (satisfied walnum or _ or 0x80). This breakage of the code point fails that, so the scanner only scans the first utf-8 chinese character, and ignores the rest of the garbage. Now keep in mind the start/end bytes of this code point shrunk from what it should have been normally, and then when we undo this edit the changed point is one after the new byte span of the heredoc_start node. I think this is an edge case in ts_subtree_edit instead then, where the child (in this case, heredoc_start) is deemed to be unaffected or reshaped by the edit, when in fact it could be, because the edit occurs one point after it, re-validating the utf-8 codepoint. So I think a better and more correct fix could be to check for !ts_subtree_has_external_tokens(child) in ts_subtree_edit where the edit's old/new end are made the same as the start.

This seems to work a lot better in that it does allow external tokens to be reused oftentime, including that one test where we want to ensure performance isn't destroyed, but I still had to update one test. I could still be very much wrong in my approach and thought process, but I think after rethinking it for a bit this change makes a lot more sense, albeit it's probably not perfect.

maxbrunsfeld · 2024-04-30T15:52:41Z

@amaanq i still think the problem here is specific to invalid Unicode code points in the source, and isn’t really specific to external tokens per se. In particular, I worry that we’re not extending a node’s lookahead bytes far enough, if the lexer stops on an invalid code point. We probably want to record that the node’s lookahead depends on the content up to 4 bytes past the start of the invalid code point, since characters can take up to 4 bytes.

I still haven’t debugged this locally, but that’s my sense, from the way you’ve described the bug after your detailed investigation. Let me know what you think

maxbrunsfeld · 2024-04-30T15:53:34Z

Really appreciate your investigation into this

amaanq · 2024-04-30T19:16:06Z

@amaanq i still think the problem here is specific to invalid Unicode code points in the source, and isn’t really specific to external tokens per se. In particular, I worry that we’re not extending a node’s lookahead bytes far enough, if the lexer stops on an invalid code point. We probably want to record that the node’s lookahead depends on the content up to 4 bytes past the start of the invalid code point, since characters can take up to 4 bytes.

I still haven’t debugged this locally, but that’s my sense, from the way you’ve described the bug after your detailed investigation. Let me know what you think

I thought about something like this too initially - what I thought could be a problem is that if -1 (invalid utf-8) is returned from U8_NEXT, then we set the lookahead size to 1 and increment the current lookahead end byte by just 1. Could this be wrong in the case where it was valid until the 2nd, 3rd, or 4th byte, and we should actually be incrementing it by whatever byte it errored out on? Not sure I quite understand this exactly, hence why I sorta discarded this thought

maxbrunsfeld · 2024-04-30T23:01:49Z

Yes, exactly. I think the problem is something like that. The relevant code is around here:

tree-sitter/lib/src/lexer.c

Lines 363 to 369 in 946acfd

    
           // In order to determine that a byte sequence is invalid UTF8 or UTF16, 
        
           // the character decoding algorithm may have looked at the following byte. 
        
           // Therefore, the next byte *after* the current (invalid) character 
        
           // affects the interpretation of the current character. 
        
           if (self->data.lookahead == TS_DECODE_ERROR) { 
        
             current_lookahead_end_byte++; 
        
           }

Maybe the 1 byte increment that we're doing is not enough in the general case.

If we change that code to say

if (self->data.lookahead == TS_DECODE_ERROR) { 
   current_lookahead_end_byte += 4;  // the maximum number of bytes read to identify an invalid code point
 }

... does that fix this test?

amaanq · 2024-04-30T23:35:02Z

🤦 that works! I thought I had tried that, but maybe I actually didn't. Thank you!!

… code point This helps in the case where an edit was made in the middle of a code point, but bytes 1-3 are valid, thus we could advance by at most 4 bytes

amaanq · 2024-05-01T00:18:01Z

Alright, updated it with the test I used to debug this, passes with the fix, but failed before

amaanq · 2024-05-01T00:55:32Z

Gonna merge this a little presumptuously, given that your suggested fix works 🤘 😁

maxbrunsfeld · 2024-05-01T02:00:41Z

Awesome!

amaanq requested a review from maxbrunsfeld April 17, 2024 19:34

amaanq force-pushed the reuse-bug branch from 416e9b5 to e87582e Compare April 17, 2024 19:39

maxbrunsfeld reviewed Apr 17, 2024

View reviewed changes

cli/src/tests/parser_test.rs Outdated Show resolved Hide resolved

amaanq force-pushed the reuse-bug branch from e87582e to f5dba01 Compare April 28, 2024 18:59

amaanq changed the title ~~fix: leaf nodes produced by an external scanner should not be reused~~ fix: do not assume children containing external tokens are not reshaped by an edit Apr 28, 2024

amaanq force-pushed the reuse-bug branch from f5dba01 to 48d92c7 Compare April 28, 2024 19:02

amaanq requested a review from maxbrunsfeld April 30, 2024 00:22

amaanq force-pushed the reuse-bug branch from 48d92c7 to c5eb820 Compare April 30, 2024 23:48

amaanq changed the title ~~fix: do not assume children containing external tokens are not reshaped by an edit~~ fix(lib): advance the lookahead end byte by 4 when there's an invalid code point Apr 30, 2024

fix(lib): advance the lookahead end byte by 4 when there's an invalid…

da1e1e0

… code point This helps in the case where an edit was made in the middle of a code point, but bytes 1-3 are valid, thus we could advance by at most 4 bytes

amaanq force-pushed the reuse-bug branch from c5eb820 to da1e1e0 Compare May 1, 2024 00:17

amaanq merged commit 4c08325 into tree-sitter:master May 1, 2024
11 of 12 checks passed

github-actions bot removed the request for review from maxbrunsfeld May 1, 2024 00:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(lib): advance the lookahead end byte by 4 when there's an invalid code point #3305

fix(lib): advance the lookahead end byte by 4 when there's an invalid code point #3305

amaanq commented Apr 17, 2024

maxbrunsfeld commented Apr 17, 2024

amaanq commented Apr 17, 2024

maxbrunsfeld commented Apr 17, 2024

amaanq commented Apr 28, 2024 •

edited

maxbrunsfeld commented Apr 30, 2024

maxbrunsfeld commented Apr 30, 2024

amaanq commented Apr 30, 2024

maxbrunsfeld commented Apr 30, 2024 •

edited

amaanq commented Apr 30, 2024

amaanq commented May 1, 2024

amaanq commented May 1, 2024

maxbrunsfeld commented May 1, 2024

fix(lib): advance the lookahead end byte by 4 when there's an invalid code point #3305

fix(lib): advance the lookahead end byte by 4 when there's an invalid code point #3305

Conversation

amaanq commented Apr 17, 2024

maxbrunsfeld commented Apr 17, 2024

amaanq commented Apr 17, 2024

maxbrunsfeld commented Apr 17, 2024

amaanq commented Apr 28, 2024 • edited

maxbrunsfeld commented Apr 30, 2024

maxbrunsfeld commented Apr 30, 2024

amaanq commented Apr 30, 2024

maxbrunsfeld commented Apr 30, 2024 • edited

amaanq commented Apr 30, 2024

amaanq commented May 1, 2024

amaanq commented May 1, 2024

maxbrunsfeld commented May 1, 2024

amaanq commented Apr 28, 2024 •

edited

maxbrunsfeld commented Apr 30, 2024 •

edited