-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(lib): advance the lookahead end byte by 4 when there's an invalid code point #3305
Conversation
@amaanq I don't think we want to avoid reusing external tokens' nodes completely. There's a lot of logic around when we can reuse them:
Can you tell why it is that we're getting the wrong parse results here? Why exactly can't the |
Yeah I tried to see if there was a better condition to check for but couldn't really come up with one, I'll try to summarize the issue Given this php corpus test: <?php
$读写;
<<<漢漢字
Heredoc detected.
漢漢字;
<<<'中文'
Nowdoc detected..
中文; We perform an edit that changes the heredoc start of the nowdoc (中文) which messes with the utf8 codepoint, thus destroying the nowdoc. When we un-edit this, the heredoc_start should be reparsed, but it is not being reparsed when we pass in the 2nd tree (tree2) to |
Huh, I would think that if that node's text changed, it would be marked as edited and not reused. Maybe there's an edge case with edits that occur within code points? |
Hey @maxbrunsfeld, taking my second attempt at really finding the real issue here, after some debugging I think I noticed the problem. I narrowed down the bug to one edit that inserts text in the middle of the code point, which destroys the utf-8 validity. Now, the php scanner will advance and scan a heredoc word if the next code point is valid (satisfied This seems to work a lot better in that it does allow external tokens to be reused oftentime, including that one test where we want to ensure performance isn't destroyed, but I still had to update one test. I could still be very much wrong in my approach and thought process, but I think after rethinking it for a bit this change makes a lot more sense, albeit it's probably not perfect. |
@amaanq i still think the problem here is specific to invalid Unicode code points in the source, and isn’t really specific to external tokens per se. In particular, I worry that we’re not extending a node’s lookahead bytes far enough, if the lexer stops on an invalid code point. We probably want to record that the node’s lookahead depends on the content up to 4 bytes past the start of the invalid code point, since characters can take up to 4 bytes. I still haven’t debugged this locally, but that’s my sense, from the way you’ve described the bug after your detailed investigation. Let me know what you think |
Really appreciate your investigation into this |
I thought about something like this too initially - what I thought could be a problem is that if -1 (invalid utf-8) is returned from U8_NEXT, then we set the lookahead size to 1 and increment the current lookahead end byte by just 1. Could this be wrong in the case where it was valid until the 2nd, 3rd, or 4th byte, and we should actually be incrementing it by whatever byte it errored out on? Not sure I quite understand this exactly, hence why I sorta discarded this thought |
Yes, exactly. I think the problem is something like that. The relevant code is around here: Lines 363 to 369 in 946acfd
Maybe the 1 byte increment that we're doing is not enough in the general case. If we change that code to say if (self->data.lookahead == TS_DECODE_ERROR) {
current_lookahead_end_byte += 4; // the maximum number of bytes read to identify an invalid code point
} ... does that fix this test? |
🤦 that works! I thought I had tried that, but maybe I actually didn't. Thank you!! |
… code point This helps in the case where an edit was made in the middle of a code point, but bytes 1-3 are valid, thus we could advance by at most 4 bytes
Alright, updated it with the test I used to debug this, passes with the fix, but failed before |
Gonna merge this a little presumptuously, given that your suggested fix works 🤘 😁 |
Awesome! |
This is the real fix for the issue described in PR #2931
The problem is the
heredoc_start
node, given the right conditions, is being reused, which is not what we want to happen, because it's an external token and the edit that takes place affects this token. You can observe the bug here - https://github.com/amaanq/tree-sitter-bug/tree/master/bug6-php-external-reused, and if you patch the tree-sitter dependency to use a local path with the change in this PR, you'll observe that it's fixed. What really helped to visualize and understand the issue was printing every node's contents inside the nowdoc node + dumping the parser logging; I had also printed out every invocation of advance + what token got advanced and where, and noticed that the edited portion of the heredoc start was not consumed at all whatsoever, which led me to spotting one log message "reuse_node symbol:heredoc_start", which did not sound right at all./cc @maxbrunsfeld