Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[c-ares] fix spin loop bug when c-ares gives up on a socket that still has data left in its read buffer #34185

Merged
merged 19 commits into from
Aug 30, 2023

Conversation

apolcyn
Copy link
Contributor

@apolcyn apolcyn commented Aug 28, 2023

If we get a readable event on an fd and both the following happens:

  • c-ares does not read all bytes off the fd

  • c-ares removes the fd from the set ARES_GETSOCK_READABLE

... then we have a busy loop here, where we'd keep asking c-ares to process an fd that it no longer cares about.

This is indirectly related to a change in this code one month ago: #33942 - before that change, c-ares would close the socket when it called handle_error and so IsFdStillReadableLocked would start returning false, causing us to get away with this loop. Now, because IsFdStillReadableLocked will keep returning true (because of our overridden close API), we'll loop forever.

The test illustrates one concrete example of how this bug can be hit.

Note that the EE version of this code already gets this right.

Related: internal issue b/297538255

@apolcyn apolcyn requested a review from yijiem August 28, 2023 18:15
@apolcyn apolcyn added lang/core release notes: yes Indicates if PR needs to be in release notes labels Aug 28, 2023
@apolcyn apolcyn changed the title [c-ares] fix hypothetical spin loop bug [c-ares] fix spin loop bug when c-ares gives up on a socket that still has data left in its read buffer Aug 29, 2023
Copy link
Member

@yijiem yijiem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What a hack! Thanks Alex for adding the test to reproduce the issue!

// 4) Because the first two bytes were zero, c-ares attempts to malloc a
// zero-length buffer:
// https://github.com/c-ares/c-ares/blob/6360e96b5cf8e5980c887ce58ef727e53d77243a/src/lib/ares_process.c#L428.
// 5) Because malloc(0) returns NULL, c-ares invokes handle_error and stops
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a small nit: maybe say c-ares' default_malloc(0) returns NULL instead? Since it seems like some systems may return a valid pointer on malloc(0): https://github.com/c-ares/c-ares/blob/main/src/lib/ares_library_init.c#L38-L42

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, done

@apolcyn apolcyn merged commit a35f282 into grpc:master Aug 30, 2023
66 of 69 checks passed
@copybara-service copybara-service bot added the imported Specifies if the PR has been imported to the internal repository label Aug 30, 2023
apolcyn added a commit that referenced this pull request Sep 18, 2023
Now that we have #33942 (and another
follow-up [fix](#34185)), I think the
issue from #25289 is likely fixed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bloat/none imported Specifies if the PR has been imported to the internal repository lang/c++ lang/core per-call-memory/neutral per-channel-memory/neutral release notes: yes Indicates if PR needs to be in release notes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants