Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(lib): advance the lookahead end byte by 4 when there's an invalid code point #3305

Merged
merged 1 commit into from
May 1, 2024

Conversation

amaanq
Copy link
Member

@amaanq amaanq commented Apr 17, 2024

This is the real fix for the issue described in PR #2931

The problem is the heredoc_start node, given the right conditions, is being reused, which is not what we want to happen, because it's an external token and the edit that takes place affects this token. You can observe the bug here - https://github.com/amaanq/tree-sitter-bug/tree/master/bug6-php-external-reused, and if you patch the tree-sitter dependency to use a local path with the change in this PR, you'll observe that it's fixed. What really helped to visualize and understand the issue was printing every node's contents inside the nowdoc node + dumping the parser logging; I had also printed out every invocation of advance + what token got advanced and where, and noticed that the edited portion of the heredoc start was not consumed at all whatsoever, which led me to spotting one log message "reuse_node symbol:heredoc_start", which did not sound right at all.

/cc @maxbrunsfeld

@maxbrunsfeld
Copy link
Contributor

@amaanq I don't think we want to avoid reusing external tokens' nodes completely. There's a lot of logic around when we can reuse them:

  • the external scanner state needs to match
  • all of the text that was read by the scanner needs to be un-edited
  • if the scanner called get_column, then the previous content on that line needs to be un-edited

Can you tell why it is that we're getting the wrong parse results here? Why exactly can't the heredoc_start node be reused in this case?

@amaanq
Copy link
Member Author

amaanq commented Apr 17, 2024

Yeah I tried to see if there was a better condition to check for but couldn't really come up with one, I'll try to summarize the issue

Given this php corpus test:

<?php
$读写;

<<<漢漢字
  Heredoc detected.
漢漢字;

<<<'中文'
  Nowdoc detected..
中文;

We perform an edit that changes the heredoc start of the nowdoc (中文) which messes with the utf8 codepoint, thus destroying the nowdoc. When we un-edit this, the heredoc_start should be reparsed, but it is not being reparsed when we pass in the 2nd tree (tree2) to parse. If we pass in tree1 or None, it parses fine because the heredoc is intact or is parsed from scratch. But given that we un-edit tree2 to restore the heredoc start portion, it should be reparsed yet it's not actually being reparsed. I'm not sure what else we could do here to fix this

@maxbrunsfeld
Copy link
Contributor

Huh, I would think that if that node's text changed, it would be marked as edited and not reused. Maybe there's an edge case with edits that occur within code points?

@amaanq
Copy link
Member Author

amaanq commented Apr 28, 2024

Hey @maxbrunsfeld, taking my second attempt at really finding the real issue here, after some debugging I think I noticed the problem.

I narrowed down the bug to one edit that inserts text in the middle of the code point, which destroys the utf-8 validity. Now, the php scanner will advance and scan a heredoc word if the next code point is valid (satisfied walnum or _ or 0x80). This breakage of the code point fails that, so the scanner only scans the first utf-8 chinese character, and ignores the rest of the garbage. Now keep in mind the start/end bytes of this code point shrunk from what it should have been normally, and then when we undo this edit the changed point is one after the new byte span of the heredoc_start node. I think this is an edge case in ts_subtree_edit instead then, where the child (in this case, heredoc_start) is deemed to be unaffected or reshaped by the edit, when in fact it could be, because the edit occurs one point after it, re-validating the utf-8 codepoint. So I think a better and more correct fix could be to check for !ts_subtree_has_external_tokens(child) in ts_subtree_edit where the edit's old/new end are made the same as the start.

This seems to work a lot better in that it does allow external tokens to be reused oftentime, including that one test where we want to ensure performance isn't destroyed, but I still had to update one test. I could still be very much wrong in my approach and thought process, but I think after rethinking it for a bit this change makes a lot more sense, albeit it's probably not perfect.

@amaanq amaanq changed the title fix: leaf nodes produced by an external scanner should not be reused fix: do not assume children containing external tokens are not reshaped by an edit Apr 28, 2024
@maxbrunsfeld
Copy link
Contributor

@amaanq i still think the problem here is specific to invalid Unicode code points in the source, and isn’t really specific to external tokens per se. In particular, I worry that we’re not extending a node’s lookahead bytes far enough, if the lexer stops on an invalid code point. We probably want to record that the node’s lookahead depends on the content up to 4 bytes past the start of the invalid code point, since characters can take up to 4 bytes.

I still haven’t debugged this locally, but that’s my sense, from the way you’ve described the bug after your detailed investigation. Let me know what you think

@maxbrunsfeld
Copy link
Contributor

Really appreciate your investigation into this

@amaanq
Copy link
Member Author

amaanq commented Apr 30, 2024

@amaanq i still think the problem here is specific to invalid Unicode code points in the source, and isn’t really specific to external tokens per se. In particular, I worry that we’re not extending a node’s lookahead bytes far enough, if the lexer stops on an invalid code point. We probably want to record that the node’s lookahead depends on the content up to 4 bytes past the start of the invalid code point, since characters can take up to 4 bytes.

I still haven’t debugged this locally, but that’s my sense, from the way you’ve described the bug after your detailed investigation. Let me know what you think

I thought about something like this too initially - what I thought could be a problem is that if -1 (invalid utf-8) is returned from U8_NEXT, then we set the lookahead size to 1 and increment the current lookahead end byte by just 1. Could this be wrong in the case where it was valid until the 2nd, 3rd, or 4th byte, and we should actually be incrementing it by whatever byte it errored out on? Not sure I quite understand this exactly, hence why I sorta discarded this thought

@maxbrunsfeld
Copy link
Contributor

maxbrunsfeld commented Apr 30, 2024

Yes, exactly. I think the problem is something like that. The relevant code is around here:

tree-sitter/lib/src/lexer.c

Lines 363 to 369 in 946acfd

// In order to determine that a byte sequence is invalid UTF8 or UTF16,
// the character decoding algorithm may have looked at the following byte.
// Therefore, the next byte *after* the current (invalid) character
// affects the interpretation of the current character.
if (self->data.lookahead == TS_DECODE_ERROR) {
current_lookahead_end_byte++;
}

Maybe the 1 byte increment that we're doing is not enough in the general case.

If we change that code to say

if (self->data.lookahead == TS_DECODE_ERROR) { 
   current_lookahead_end_byte += 4;  // the maximum number of bytes read to identify an invalid code point
 } 

... does that fix this test?

@amaanq
Copy link
Member Author

amaanq commented Apr 30, 2024

🤦 that works! I thought I had tried that, but maybe I actually didn't. Thank you!!

@amaanq amaanq changed the title fix: do not assume children containing external tokens are not reshaped by an edit fix(lib): advance the lookahead end byte by 4 when there's an invalid code point Apr 30, 2024
… code point

This helps in the case where an edit was made in the middle of a code
point, but bytes 1-3 are valid, thus we could advance by at most 4 bytes
@amaanq
Copy link
Member Author

amaanq commented May 1, 2024

Alright, updated it with the test I used to debug this, passes with the fix, but failed before

@amaanq
Copy link
Member Author

amaanq commented May 1, 2024

Gonna merge this a little presumptuously, given that your suggested fix works 🤘 😁

@amaanq amaanq merged commit 4c08325 into tree-sitter:master May 1, 2024
11 of 12 checks passed
@github-actions github-actions bot removed the request for review from maxbrunsfeld May 1, 2024 00:55
@maxbrunsfeld
Copy link
Contributor

Awesome!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants