HTTP/2: Hyper communication is stuck if too many requests are spawned. #3338

kundu-subhajit · 2023-10-10T17:36:38Z

Version
0.14.27

Platform
Linux ubuntu 4.15.0-213-generic # 224-Ubuntu SMP Mon Jun 19 13:30:12 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

Short Description of the issue
Client network communications using Rust hyper library continues to stuck if too many requests are spawned.

Description
I see this bug has been earlier reported in #2419 and also has another manifestation in #3041.
Both are marked fixed. #3041 also says the fix is tested with 100 requests on 50 streams.

Generally speaking, HTTP2 Servers allows 250 parallel streams. However, client can create unlimited streams. The RFC has the following recommendation: "It is recommended that this value be no smaller than 100, so as to not unnecessarily limit parallelism."

We have a RUST client implementation, using hyper 0.14.27, along with a GoLang Server. Client requests are restricted to 400 at any point of time. Go Client runs smoothly. hyper client gets stuck every time. The bug is easily reproducible. Both RUST client and Go Server implementation can be found in the link below.

Where is the sample code that hits the issue?
Sample Code Link: https://github.com/kundu-subhajit/hyper-stuck
It has a GoLang Server implementation, Rust Client implementation using Hyper-0.14.27 & a GoLang Client implementation.
The build instructions are very straightforward

Any known workaround with the sample code?
We do acknowledge that this same issue isn't reproducible with semaphore count below 250 in the sample code shared.

Why we cant use smaller number of parallel streams as workaround?
In our local setup we have 100 as max number of HTTP/2 streams at server. But even if we try with semaphore count of 90, network communications through hyper are stuck. Trying to isolate that we arrived here and at the sample code.

What we have done differently in the sample code compared to previous reporting?
Nothing much. We have done only two minor things differently here:

We have run the Go Lang server and Rust client on different machines. With at-the-most 400 requests spawned, client gets stuck every time. We have reproduced this issue in two different client machines, both Ubuntu Linux and both the machines hit the issue every time.
We have enabled trace logging to capture logs.

CC: @jeromegn, @seanmonstar, @jfourie1
Please help us fix this bug at your earliest. Any help with this is highly appreciated. I thank you in anticipation. Thanks.

seanmonstar · 2023-10-10T19:52:29Z

Hey there!

It looks like the linked repo is missing the server code.

I'm also curious, does it work if you don't use a waitgroup/semaphore? Just relying on hyper to automatically not go over the concurrent streams?

kundu-subhajit · 2023-10-11T14:14:33Z

Firstly, Appreciate your prompt response.
Secondly, My bad. Thanks for pointing out.
Repositories are now updated, checked out, compiled and tested for successful reproduction of the issue.

kundu-subhajit · 2023-10-11T14:21:13Z

I removed code parts using waitgroup in my personal space and ran the code. It still gets stuck. We still hit the issue. Semaphore is needed to keep check on number of parallel attempts at a time. So kept it.

If you want I can update the repo with waitgroup removed. Code with waitgroup is tested though, for successful runs.

seanmonstar · 2023-10-11T20:46:54Z

Ok, I was just trying to figure out how small we can make the test case.

Do you have the trace logs? Could you paste them in a comment here inside <details></details>?

kundu-subhajit · 2023-10-12T07:45:45Z

For 2-3 seconds of runtime, its pretty big log.
I unable to share the way you asked. Most log files were little more than 25 MB and drag&drop has limit of 25MB.

After couple of runs I arrived at a log file within that limit:
hyper-stuck.log.2023-10-12.log

I will still suggest that please run it from the sample code so that you can have bigger runtime to capture any log that is printed while things are stuck.

izderadicka · 2023-11-18T13:03:39Z

Problem was with hyper-openssl, which is probably bit outdated. With hyper-rustls works as expected (and is actually faster by app. %50%, number of clients does not play a role - I guess it already enough parallel on one client).

Can check here https://github.com/izderadicka/hyper-stuck

izderadicka · 2023-11-18T14:19:50Z

I was too too quick with conclusions sorry. Previous example worked because it uses HTTP/1.1.
Just switching to HTTP/2.0 causes same problem as reported, so it's not TLS implementation probably.

It demonstrates like this:

with several hundreds concurrent tasks app. >= 300
few initial requests - between 4 - 8 are processed immediately
then it stucks, no further requests are received by server
for less then 300 concurrent, say 250 - few initial requests are done quickly - 5-8 similar to above. Then it pauses for few seconds, then it catches up and finishes all remaining

Updated version which demonstrates problem will be on https://github.com/izderadicka/hyper-stuck

izderadicka · 2023-11-18T16:37:34Z

I've done one more test - against Rust hyper server (added to repo) instead and now client works fine, without problems.
So it must be something between Go server and Rust client on HTTP/2.0 protocol. I know very little of Go, however it looks like server is simple - all defaults, nothing suspicious there.

Also one interesting finding - client against Rust server on HTTP/2.0 is about 4 times slower, then on HTTP/1.1 - against same server, only switched protocols, this is bit contra-intuitive. Go client on the other hand is little bit faster against Rust server, that it was against Go server.

I wanted to try hyper 1.0 in client, but do not know yet how to setup TLS - 1.0 is still very fresh, hyper-rustls is on 0.14 and did not have time to dig deeper how to manually create appropriate connector.

izderadicka · 2023-11-18T17:03:02Z

And one more final finding - from trace log - the log from stuck run against Go server have a lot of stream capacity is 0 messages at the end of log, when communication got stuck , I do not see anything like this in the log of successful run against Rust server. Here is excerpt from the log:

2023-11-18T16:49:56.726566Z TRACE Connection{peer=Client}:poll: h2::proto::streams::prioritize: schedule_pending_open
2023-11-18T16:49:56.726571Z TRACE Connection{peer=Client}:poll:pop_frame:popped{stream.id=StreamId(515) stream.state=State { inner: HalfClosedLocal(AwaitingHeaders) }}: h2::proto::streams::prioritize: is_pending_reset=false
2023-11-18T16:49:56.726575Z TRACE Connection{peer=Client}:poll:pop_frame:popped{stream.id=StreamId(515) stream.state=State { inner: HalfClosedLocal(AwaitingHeaders) }}: h2::proto::streams::prioritize: data frame sz=235335 eos=true window=0 available=0 requested=235335 buffered=235335
2023-11-18T16:49:56.726580Z TRACE Connection{peer=Client}:poll:pop_frame:popped{stream.id=StreamId(515) stream.state=State { inner: HalfClosedLocal(AwaitingHeaders) }}: h2::proto::streams::prioritize: stream capacity is 0
2023-11-18T16:49:56.726585Z TRACE Connection{peer=Client}:poll:pop_frame:popped{stream.id=StreamId(517) stream.state=State { inner: HalfClosedLocal(AwaitingHeaders) }}: h2::proto::streams::prioritize: is_pending_reset=false
2023-11-18T16:49:56.726590Z TRACE Connection{peer=Client}:poll:pop_frame:popped{stream.id=StreamId(517) stream.state=State { inner: HalfClosedLocal(AwaitingHeaders) }}: h2::proto::streams::prioritize: data frame sz=238437 eos=true window=0 available=0 requested=238437 buffered=238437
2023-11-18T16:49:56.726594Z TRACE Connection{peer=Client}:poll:pop_frame:popped{stream.id=StreamId(517) stream.state=State { inner: HalfClosedLocal(AwaitingHeaders) }}: h2::proto::streams::prioritize: stream capacity is 0
2023-11-18T16:49:56.726599Z TRACE Connection{peer=Client}:poll:FramedWrite::flush: h2::codec::framed_write: flushing buffer
2023-11-18T16:49:56.726603Z TRACE Connection{peer=Client}:poll: tokio::task::waker: op="waker.clone" task.id=5
2023-11-18T16:49:56.726607Z TRACE Connection{peer=Client}:poll: tokio::task::waker: op="waker.drop" task.id=5

izderadicka · 2023-11-19T19:42:23Z

I've done bit of additional exploration:

happens also on hyper v1 (updated demo repo to hyper 1.0.1)
on server side following Go server influences the problem:
- MaxConcurrentStreams - if increased to 1000 then client is not stuck completely - it's just hangs up after few requests, for 10 secs or so than catches up, but overall is quite slow, still see stream capacity is 0 in log
- MaxUploadBufferPerConnection is set up to app. 32M then problem completely disappears, no stream capacity is 0 in log
as already explained above problem is best demonstrated when there is single http2 client, which spawn app. 300 requests in separate tokio tasks. If the are multiple clients (say 10), problems appears too, but much later, after some thousands of requests
go client does not have this problem - at least the one provided there, it should have about 600 spawned requests in parallel in go function.

seanmonstar · 2023-11-20T21:10:14Z

Excellent work investigating! So, is it related to flow control? Is there connection window available, but not stream? It's fine to investigate with hyper v0.14, it's most likely something in the h2 dependency, which is the same in both cases.

izderadicka · 2023-11-21T10:07:23Z

Yes it's same in 0.14 and 1.0 - so I assume it could be h2.
I' m not expert in HTTP/2.0, I have only general knowledge.
I think it might be related to flow control - what seems to be happening, is that frames stop to be send after stream capacity is indicated as 0. Connections are still open - If I end server client quickly ends by errors on connections - connection closed.

Not sure what you mean exactly by "Is there connection window available, but not stream?" - Is there any way how to check this?

seanmonstar · 2023-11-21T23:05:05Z

I'm combing through the logs. It doesn't seem like there's an easy way to see the connection window. But as I scroll through the logs, I do eventually see mention of "buffered" for a stream being really high, and things like conn=512 or even smaller. And then there's a trickle of tiny WINDOW_UPDATE frames from the peer.

It does seem like the hyper client is getting starved via flow control. It's quite interesting that a Rust server makes the client happy. But I wonder what the Go server does differently...

Can you change your program to print the status code of each response? I notice that the Go server will try to read the full request body, but if there's some sort of error, it will return a 400 response. I wonder if any of those happen, and if because of it, some window capacity isn't released for that stream in particular. Or if the buffered data is no longer allowed to keep being sent.

izderadicka · 2023-11-22T19:38:49Z

Hi,
responses have 200 status, what is weird, that with provided params it gets stuck quite quickly - only few responses get back and then it freezes, this is output with added printout of status:

 cargo run --bin hyper-stuck
    Finished dev [unoptimized + debuginfo] target(s) in 0.07s
     Running `target/debug/hyper-stuck`
Iteration i=0
Iteration i=100
Iteration i=200
Iteration i=300
Response status 200 OK version HTTP/2.0
Response status 200 OK version HTTP/2.0
Response status 200 OK version HTTP/2.0
Response status 200 OK version HTTP/2.0

So I do start all 300 requests (each in separate task), I got only 4 back, then it's stuck. I have attached complete tracing log, some interesting notes:

At the time of stuck (~18:34:50) a lot of "stream capacity is 0" as already noted before, and streams (grepped by id - see below), usually ends by this message

At the time of stuck alot of Goways like this one:

2023-11-22T18:34:50.543203Z DEBUG runtime.spawn{kind=task task.name= task.id=30 loc.file="src/main.rs" loc.line=154 loc.col=9}:runtime.spawn{kind=task task.name= task.id=193 loc.file="/home/ivan/.cargo/registry/src/index.crates.io-6f17d22bba15001f/hyper-util-0.1.1/src/rt/tokio.rs" loc.line=35 loc.col=9}:Connection{peer=Client}:poll:FramedWrite::buffer{frame=GoAway { error_code: NO_ERROR, last_stream_id: StreamId(0) }}: h2::codec::framed_write: send frame=GoAway { error_code: NO_ERROR, last_stream_id: StreamId(0) }

It's interesting to look at few first streams by id (I think odd ids are from client)

$ grep -i "StreamId(7)" hyper-stuck.log.2023-11-22  | wc -l
44
$ grep -i "StreamId(9)" hyper-stuck.log.2023-11-22  | wc -l
683

So for streams up to 7 (lower ids not shown, they have exactly same number of messages) it went well I think - it corresponds probably to responses I got back, however from stream 9 up it gets messed up - it looks similar for following streams up to very last streams, which behaves differently:

$ grep -i "StreamId(253)" hyper-stuck.log.2023-11-22  | wc -l
181
$ grep -i "StreamId(255)" hyper-stuck.log.2023-11-22  | wc -l
174
$ grep -i "StreamId(257)" hyper-stuck.log.2023-11-22  | wc -l
2
$ grep -i "StreamId(259)" hyper-stuck.log.2023-11-22  | wc -l
0

I guess by comparing individual streams - especially say 7 with 9 some insight might be gained. The only think I noticed that stream 7 did not have any WINDOW_UPDATE frames received, but problematic stream 7 already has quite few of them - 27.

log.zip

dswij · 2023-12-30T11:13:43Z

Did some investigating, it seems like all the streams are starved on connection capacity and couldn't proceed to the next state. Happens because the high Fut limit means the streams are fighting for the capacity (and some have reserved capacities).

See https://github.com/dswij/h2/blob/9a788d9fe2f10ce21d1e177646a093f55fd073d1/src/proto/streams/prioritize.rs#L750-L760

Log: https://gist.github.com/dswij/9cd727705018f41a7e627b995584deed

Although popping pending_capacity makes the requests progressed, but hangs again after a while (probably because of wrong implementation?). So, might need to comment pending_capacity.pop to correctly replicate the hang (and see OOPS).

If we change the fut limit (i.e. HTEST_FUT_LIMIT in the repro code) to 100, matching the concurrent stream limit on the Go server, the client works perfectly (with no OOPS).

seanmonstar · 2023-12-30T14:12:49Z

@dswij thanks so much for digging! I really want to get this fixed, but haven't had time between hyper 1.0, upgrading reqwest, and end of year needs. Would you want to look into this further, and recommend a fix for hyper?

dswij · 2023-12-31T09:07:35Z

Sure, I'll have a go at it

seanmonstar · 2024-01-05T16:29:20Z

As an update, @dwij has a proposed change that appears to fix a reported case in this issue. If you'd like to take a look, that'd be a huge help! hyperium/h2#730

This PR changes the the assign-capacity queue to prioritize streams that are send-ready. This is necessary to prevent a lockout when streams aren't able to proceed while waiting for connection capacity, but there is none. Closes hyperium/hyper#3338

seanmonstar · 2024-01-08T21:45:00Z

The fix has been published in h2 v0.4.1!

This PR changes the the assign-capacity queue to prioritize streams that are send-ready. This is necessary to prevent a lockout when streams aren't able to proceed while waiting for connection capacity, but there is none. Closes hyperium/hyper#3338

This PR changes the the assign-capacity queue to prioritize streams that are send-ready. This is necessary to prevent a lockout when streams aren't able to proceed while waiting for connection capacity, but there is none. Closes hyperium/hyper#3338 Co-authored-by: dswij <dharmasw@outlook.com>

kundu-subhajit added the C-bug Category: bug. Something is wrong. This is bad! label Oct 10, 2023

This comment was marked as spam.

Sign in to view

This comment was marked as resolved.

Sign in to view

dswij mentioned this issue Jan 1, 2024

fix: streams awaiting capacity lockout hyperium/h2#730

Merged

seanmonstar closed this as completed in hyperium/h2#730 Jan 8, 2024

LukasKalbertodt mentioned this issue Mar 4, 2024

Tobira seems to freeze randomly, not responding to HTTP requests at all anymore elan-ev/tobira#1129

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HTTP/2: Hyper communication is stuck if too many requests are spawned. #3338

HTTP/2: Hyper communication is stuck if too many requests are spawned. #3338

kundu-subhajit commented Oct 10, 2023 •

edited

seanmonstar commented Oct 10, 2023

kundu-subhajit commented Oct 11, 2023 •

edited

kundu-subhajit commented Oct 11, 2023

seanmonstar commented Oct 11, 2023

kundu-subhajit commented Oct 12, 2023

This comment was marked as spam.

This comment was marked as resolved.

This comment was marked as resolved.

izderadicka commented Nov 18, 2023

izderadicka commented Nov 18, 2023 •

edited

izderadicka commented Nov 18, 2023 •

edited

izderadicka commented Nov 18, 2023

izderadicka commented Nov 19, 2023

seanmonstar commented Nov 20, 2023

izderadicka commented Nov 21, 2023

seanmonstar commented Nov 21, 2023

izderadicka commented Nov 22, 2023

dswij commented Dec 30, 2023 •

edited

seanmonstar commented Dec 30, 2023

dswij commented Dec 31, 2023

seanmonstar commented Jan 5, 2024

seanmonstar commented Jan 8, 2024

HTTP/2: Hyper communication is stuck if too many requests are spawned. #3338

HTTP/2: Hyper communication is stuck if too many requests are spawned. #3338

Comments

kundu-subhajit commented Oct 10, 2023 • edited

seanmonstar commented Oct 10, 2023

kundu-subhajit commented Oct 11, 2023 • edited

kundu-subhajit commented Oct 11, 2023

seanmonstar commented Oct 11, 2023

kundu-subhajit commented Oct 12, 2023

This comment was marked as spam.

This comment was marked as resolved.

This comment was marked as resolved.

izderadicka commented Nov 18, 2023

izderadicka commented Nov 18, 2023 • edited

izderadicka commented Nov 18, 2023 • edited

izderadicka commented Nov 18, 2023

izderadicka commented Nov 19, 2023

seanmonstar commented Nov 20, 2023

izderadicka commented Nov 21, 2023

seanmonstar commented Nov 21, 2023

izderadicka commented Nov 22, 2023

dswij commented Dec 30, 2023 • edited

seanmonstar commented Dec 30, 2023

dswij commented Dec 31, 2023

seanmonstar commented Jan 5, 2024

seanmonstar commented Jan 8, 2024

kundu-subhajit commented Oct 10, 2023 •

edited

kundu-subhajit commented Oct 11, 2023 •

edited

izderadicka commented Nov 18, 2023 •

edited

izderadicka commented Nov 18, 2023 •

edited

dswij commented Dec 30, 2023 •

edited