Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix race condition in connection termination #772

Merged

Conversation

howardjohn
Copy link
Contributor

See hyperium/hyper#3652.

What I have found is the final reference to a stream being dropped after the maybe_close_connection_if_no_streams but before the inner.poll() completes can lead to the connection dangling forever without any forward progress. No streams/references are alive, but the connection is not complete and never wakes up again. This seems like a classic TOCTOU race condition.

In this fix, I check again at the end of poll and if this state is detected, wake up the task again.

Wth the test in hyperium/hyper#3655, on my machine, it fails about 5% of the time:

1876 runs so far, 100 failures (94.94% pass rate). 95.197349ms avg, 1.097347435s max, 5.398457ms min

With that PR, this test is 100% reliable

64010 runs so far, 0 failures (100.00% pass rate). 44.484057ms avg, 121.454709ms max, 1.872657ms min

Note: we also have reproduced this using h2 directly outside of hyper, which is what gives me confidence this issue lies in h2 and not hyper.

See hyperium/hyper#3652.

What I have found is the final reference to a stream being dropped
after the `maybe_close_connection_if_no_streams` but before the
`inner.poll()` completes can lead to the connection dangling forever
without any forward progress. No streams/references are alive, but the
connection is not complete and never wakes up again. This seems like a
classic TOCTOU race condition.

In this fix, I check again at the end of poll and if this state is
detected, wake up the task again.

Wth the test in hyperium/hyper#3655, on my machine, it fails about 5% of the time:
```
1876 runs so far, 100 failures (94.94% pass rate). 95.197349ms avg, 1.097347435s max, 5.398457ms min
```

With that PR, this test is 100% reliable
```
64010 runs so far, 0 failures (100.00% pass rate). 44.484057ms avg, 121.454709ms max, 1.872657ms min
```

Note: we also have reproduced this using `h2` directly outside of `hyper`, which is what gives me
confidence this issue lies in `h2` and not `hyper`.
howardjohn added a commit to howardjohn/ztunnel that referenced this pull request May 1, 2024
Pulls in hyperium/h2#772. We might want to wait
on this, not sure how fast it will go
istio-testing pushed a commit to istio/ztunnel that referenced this pull request May 1, 2024
Pulls in hyperium/h2#772. We might want to wait
on this, not sure how fast it will go
istio-testing pushed a commit to istio-testing/ztunnel that referenced this pull request May 1, 2024
Pulls in hyperium/h2#772. We might want to wait
on this, not sure how fast it will go
Copy link
Member

@seanmonstar seanmonstar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent detective work, thanks!

@seanmonstar seanmonstar merged commit be12983 into hyperium:master May 2, 2024
6 checks passed
istio-testing added a commit to istio/ztunnel that referenced this pull request May 2, 2024
Pulls in hyperium/h2#772. We might want to wait
on this, not sure how fast it will go

Co-authored-by: John Howard <john.howard@solo.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants