CancelWrite after Close should be a no-op #4404

marten-seemann · 2024-04-02T20:54:01Z

@marten-seemann Hey any update on this ? We are adding support for quic to Prysm and are running into this issue. It does appear this is also related to #4139 , but that appears to still be a RFC.

Currently we ensure that all open streams are eventually reset so that they can be appropriately cleaned up. However with libp2p's Stream API and the necessity to support other multiplexers(yamux, mplex), resetting QUIC streams the same way unfortunately leads to data being lost with the remote peer unable to read the transmitted data when we initiate a reset.

We could fix this by adding a special sleep for QUIC streams so that data is reliably sent out before we reset it, but would prefer to use a cleaner/more graceful solution.

Originally posted by @nisdas in #3291 (comment)

marten-seemann · 2024-04-02T20:55:47Z

We are adding support for quic to Prysm and are running into this issue

@nisdas Very excited to hear that! Please feel free to reach out with any problems you might run into, happy to help!

Currently we ensure that all open streams are eventually reset so that they can be appropriately cleaned up. However with libp2p's Stream API and the necessity to support other multiplexers(yamux, mplex), resetting QUIC streams the same way unfortunately leads to data being lost with the remote peer unable to read the transmitted data when we initiate a reset.

I'm not sure I understand what the problem is. In general, the flow is the following:

You open (OpenStream) or accept (AcceptStream) a new stream (or the libp2p equivalents, NewStream and the callback passed to NewStreamHandler).
You read and write data on that stream.
Once you finished writing data, you close the write side of the stream (normal termination). Or if an error occurred during writing and you decide to abort transmission on the stream, you reset it (abrupt termination).

The distinction between normal and abrupt termination is important here: If you close the write side of the stream, all data will be delivered reliably, i.e. quic-go will retransmit stream data until it is acknowledged by the peer (modulo a race with connection termination, which is what this issue is about).

If you reset a stream, you do that because something went wrong and you don't want to send the entire response (or because the peer asked you to do so via CancelRead). In that case, you don't care about any of the data being delivered, so 1. the sender will not retransmit any data and 2. the receiver will immediately surface the reset error, and discard any data received.
In some cases you want a small routing identifier, which is what my IETF draft regarding Partial Delivery (https://datatracker.ietf.org/doc/draft-ietf-quic-reliable-stream-reset/) is trying to achieve. I'd caution against using it to deliver the entire stream contents reliably.

In order to not leak streams, you therefore need to make sure that every code path either calls Close or CancelWrite at some point (or kills the entire connection).

Does that make sense?

nisdas · 2024-04-03T05:56:47Z

@marten-seemann ok thank you for the explanation on the termination flow for quic. I have a good idea now on why we are running into issues with resetting streams for QUIC connections. The following is the flow for our stream handler:

We accept a stream from a remote peer for a particular protocol.
We decode the remote peer's request and send an appropriate response for it.
We then close the stream for writing.
We then reset the stream at the end so that in the event there are issues with closing the stream, we can forcefully close it.

For yamux, 4) is a no-op if 3) was successful. However for QUIC it does appear to have very different semantics where abrupt termination will infact cause the remote peer to drop the data received. We can only require QUIC streams to be abruptly terminated in the event we have issues closing the stream. Otherwise in the happy case we simply do not reset them. Thanks for all your help, feel free to close the issue.

marten-seemann · 2024-04-03T06:42:04Z

in the event there are issues with closing the stream

What issues would that be? At least for quic-go, calling Close is guaranteed to close the stream (unless it was reset previously, see below).

For yamux, 4) is a no-op if 3) was successful. However for QUIC it does appear to have very different semantics where abrupt termination will infact cause the remote peer to drop the data received.

I see where the misunderstanding lies. The logic I described above only applies to the first call that terminates a stream (i.e. Close or CancelWrite). Once you've called Close on the stream, calling CancelWrite is a no-op. Similarly, once you've called CancelWrite, calling Close is a no-op.

marten-seemann · 2024-04-03T06:54:05Z

Actually, this is how it's supposed to work. But it doesn't! This is pretty bad.
The CancelWrite is currently NOT a no-op after Close. It just works the other way around.

Fix incoming.

nisdas · 2024-04-03T06:56:25Z

Ok great thanks for clarifying @marten-seemann . This would explain why it got triggered for us

marten-seemann · 2024-04-03T20:04:58Z

Interestingly, there's a failing test on #4408. Apparently, a few years ago, when most of the stream state machine was written, we thought that resetting after closing was a feature:

quic-go/send_stream_test.go

Lines 893 to 902 in 183d42a

    
           It("queues a RESET_STREAM frame, even if the stream was already closed", func() { 
        
           	mockSender.EXPECT().onHasStreamData(streamID) 
        
           	mockSender.EXPECT().queueControlFrame(gomock.Any()).Do(func(f wire.Frame) { 
        
           		Expect(f).To(BeAssignableToTypeOf(&wire.ResetStreamFrame{})) 
        
           	}) 
        
           	mockSender.EXPECT().onStreamCompleted(gomock.Any()) 
        
           	Expect(str.Close()).To(Succeed()) 
        
           	// don't EXPECT any calls to queueControlFrame 
        
           	str.CancelWrite(123) 
        
           })

I still stand with the conclusion of this issue (reset after close should be a noop), but it's interesting to see that this was not just an oversight, but a conscious design decision back then.

marten-seemann · 2024-04-03T21:08:25Z

I think I understand why we made this decision back then. Take a look at the send stream states from RFC 9000:

It explicitly contains the transition Data Sent to Reset Sent. While it makes sense to have CancelWrite after Close a no-op (it's basically a misuse of the API, you might argue), a sender might receive a STOP_SENDING from the peer. In that case, it does make sense to send a RESET_STREAM and stop retransmitting the stream data.

nisdas · 2024-04-04T05:04:50Z

Thanks for the update, would that mean this being part of the specification(RESET after CLOSE) would block #4408 from merging right now ?

marten-seemann · 2024-04-04T05:39:38Z

No, it doesn’t block us from merging #4408. Just because the spec allows this state transition, doesn’t mean that we need to expose an API for that.

What I described in #4404 (comment) is an optimization building on top of #4408.

Release-wise, I’m planning to cut a patch release for #4408 (maybe or maybe not including this optimization) in the next few days. Does that work for you?

sukunrt · 2024-04-04T05:40:02Z

@marten-seemann Can you elaborate on

While it makes sense to have CancelWrite after Close a no-op (it's basically a misuse of the API, you might argue)

How is it a misuse of the API? I would argue that the current behaviour is correct.

If the data is sent and not received it is buffered up in case the data is lost in transit. By calling Reset I want to clear up all that memory. If the data in transit is not lost and is delivered correctly, that's great. If it doesn't, I don't want to retransmit.

marten-seemann · 2024-04-04T05:53:11Z

@sukunrt Ok, let me try to explain: We’re only looking at the send direction here. Assume you received a request for a resource, and you started generating the response. Now two things can happen:

You successfully send the entire response, and you call Close. STREAM frames will get retransmitted if lost.
An error occurs halfway through. You somehow need to tell the client “the thing I already sent is incomplete and I won’t be able to send the full thing”. That’s what a reset is for. So you call CancelRead, which triggers the sending of a RESET_STREAM frame. The RESET_STREM frame is retransmitted if lost, but none of the STREAM frames are. Upon receiving the RESET_STREAM, the receiver immediately tells the application about this error.

Now what is the meaning of calling CancelRead after Close? It’s nonsensical! (It’s quite useful though if it’s a no-op).

I assume you’re asking now “What if the receiver wants to stop receiving data?”. It would do so by calling CancelRead, which would send out a STOP_SENDING frame. Assuming the stream sender has already called Close (i.e. the STREAM frame with the FIN has already been sent), these two things are valid responses to the STOP_SENDING frame:

Stop retransmitting STREAM frames, generate a RESET_STREAM frame (and reliably deliver this one).
Ignore it and continue retransmitting STREAM frames.

Currently we’re doing (1). This is the most efficient way to implement things, for obvious reasons. With #4088, we’d (temporarily) do (2), until we implement the optimization in #4404 (comment), which will bring us back to (1).

Does this make sense?

nisdas · 2024-04-04T06:09:13Z

Perfect @marten-seemann , that works great for us

sukunrt · 2024-04-04T06:17:36Z

It does explain things better. Thank you.
As I understand, Reset is only for cases when the Write fails. It makes sense for a Reset to be a noop after Close in these cases.

I have one question. How do I ask quic to discard queued write data?

Set a deadline of 1 minute
write a request on a stream. Close the stream.
wait for a response

The peer however is not responsive. Is there no way to drop the queued data?

marten-seemann · 2024-04-06T13:13:45Z

I'm beginning to wonder if the suggested API change is the right thing to do. While I agree that in many cases, calling CancelWrite after Close doesn't make a lot of sense, it is a valid state transition according to the RFC, and there might be use cases where it does make sense.

For users, it is easy to make CancelWrite after Close a no-op by wrapping the quic.Stream and adding a single tracking (atomic) bool. On the other hand, if we adopt #4408, there's no way to restore the current behavior.

marten-seemann · 2024-04-08T13:21:00Z

#4419 fixes the documentation for the SendStream interface, making it clear that CancelWrite after Close is NOT a no-op, as I suggested in #4404 (comment).

sukunrt · 2024-04-10T11:01:31Z

I think this is the right thing to do. As you've explained users can wrap and make CancelWrite a no-op. But users can just not call CancelWrite in the first place.

MarcoPolo · 2024-04-13T22:04:52Z

The fact that the RFC explicitly mentions this transition is enough reason for me to think the original behavior here is correct. I need to think what this means for libp2p, but likely we would want similar semantics since they can handle like the one @sukunrt brought up here.

marten-seemann · 2024-04-13T23:39:10Z

Thank you to everyone who participated in this discussion! This was very enlightening, and we got to consider multiple different options for the API, and fixed the current documentation.

marten-seemann mentioned this issue Apr 2, 2024

Wait for stream to finish sending before shutdown #3291

Open

marten-seemann added the bug label Apr 3, 2024

marten-seemann changed the title ~~Question about QUIC Stream Resets~~ CancelWrite after Close should be a no-op Apr 3, 2024

marten-seemann mentioned this issue Apr 3, 2024

make CancelWrite a no-op after Close was called on the send stream #4408

Closed

marten-seemann added this to the v0.43 milestone Apr 3, 2024

marten-seemann mentioned this issue Apr 3, 2024

allow remote-triggered stream resets after closing a send stream #4411

Closed

marten-seemann mentioned this issue Apr 8, 2024

fix documentation for CancelWrite after Close on the send stream #4419

Merged

marten-seemann mentioned this issue Apr 8, 2024

capture outcome of CancelWrite after Close discussion quic-go/docs#46

Closed

marten-seemann closed this as completed in #4419 Apr 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CancelWrite after Close should be a no-op #4404

CancelWrite after Close should be a no-op #4404

marten-seemann commented Apr 2, 2024

marten-seemann commented Apr 2, 2024

nisdas commented Apr 3, 2024

marten-seemann commented Apr 3, 2024

marten-seemann commented Apr 3, 2024 •

edited

nisdas commented Apr 3, 2024

marten-seemann commented Apr 3, 2024

marten-seemann commented Apr 3, 2024

nisdas commented Apr 4, 2024

marten-seemann commented Apr 4, 2024

sukunrt commented Apr 4, 2024

marten-seemann commented Apr 4, 2024

nisdas commented Apr 4, 2024

sukunrt commented Apr 4, 2024 •

edited

marten-seemann commented Apr 6, 2024

marten-seemann commented Apr 8, 2024

sukunrt commented Apr 10, 2024

MarcoPolo commented Apr 13, 2024

marten-seemann commented Apr 13, 2024

CancelWrite after Close should be a no-op #4404

CancelWrite after Close should be a no-op #4404

Comments

marten-seemann commented Apr 2, 2024

marten-seemann commented Apr 2, 2024

nisdas commented Apr 3, 2024

marten-seemann commented Apr 3, 2024

marten-seemann commented Apr 3, 2024 • edited

nisdas commented Apr 3, 2024

marten-seemann commented Apr 3, 2024

marten-seemann commented Apr 3, 2024

nisdas commented Apr 4, 2024

marten-seemann commented Apr 4, 2024

sukunrt commented Apr 4, 2024

marten-seemann commented Apr 4, 2024

nisdas commented Apr 4, 2024

sukunrt commented Apr 4, 2024 • edited

marten-seemann commented Apr 6, 2024

marten-seemann commented Apr 8, 2024

sukunrt commented Apr 10, 2024

MarcoPolo commented Apr 13, 2024

marten-seemann commented Apr 13, 2024

marten-seemann commented Apr 3, 2024 •

edited

sukunrt commented Apr 4, 2024 •

edited