Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Close cannot be called in the middle of a frame read by Read #355

Closed
spikecurtis opened this issue Nov 8, 2022 · 20 comments
Closed

Close cannot be called in the middle of a frame read by Read #355

spikecurtis opened this issue Nov 8, 2022 · 20 comments
Labels
Milestone

Comments

@spikecurtis
Copy link

Problem: If I have an application that waits for data over the websocket, how do I cleanly shut down the websocket? Here, cleanly means:

  1. stop waiting for new data
  2. perform the closing handshake
  3. verify that these things happened with no errors

Our first attempt at this calls (*Conn).Read(ctx) in a loop on one goroutine, and then when it was time to shut down, calls (*Conn).Close() from a different goroutine.

However, this would often result in close errors reading the response packet. Investigating, I found that calling Close() concurrently with Read() from different goroutines exposed a race condition where Read() would read the frame header, and then Close() would read the packet payload as if it were a header, and throw an error.

Would you consider this a bug? That is to say, are you meant to be able to call Read() and Close() from different goroutines concurrently?

My next attempt cancels the Context passed to Read(), and then calls Close() from the same goroutine after Read() returns. There are a couple problems with this approach. Firstly, canceling the Read() context seems to trigger an opClose with status PolicyViolation. Ideally I'd be sending a normal status on disconnect. Secondly, Close() returns a couple different error messages that seem to depend on some races in the library, including "already wrote close" and "WebSocket closed", so it's really hard to verify that the close handshake completed correctly.

@nhooyr
Copy link
Owner

nhooyr commented Mar 5, 2023

@kylecarbs reached out to me on email about this and I responded there. I think you just need to go over the docs again. Can you point to the code where you think the race condition is?

As far as I can tell, everything is properly synchronized and there are no race conditions. I've tested the library extensively under the race detector.

@nhooyr nhooyr closed this as completed Mar 5, 2023
@spikecurtis
Copy link
Author

The race condition works like this:

One goroutine calls (*Conn).Read(), this has two steps:

  1. Get a Reader().
  2. Call ReadAll() on the reader returned from 1.

The way the code is structured, when you call Reader(), you lock the read mutex, read the websocket header, then unlock the mutex and return the reader. Now, after step 1, but before step 2 the next data on the socket is the payload (not the header), and the mutex is unlocked.

A second goroutine calls (*Conn).Close() on the same connection. This also has two steps:

A. Write the close header to the socket.
B. Attempt to read the response using (*Conn).waitCloseHandshake().

Both Step 2 from Read() and Step B from Close() try to lock the read mutex and read from the socket --- this is the race. If Step 2 wins and locks the mutex, we are ok, because step 2 is expecting to read the message payload, and that's what's there after Step 1 completes. But, if Step B wins, we are in trouble because step B is attempting to read a header, but the next data on the socket is not a header; it's payload. If this happens, we get a data corruption error.

The golang race detector won't help with this sort of problem because it only looks for unsynchronized memory access. That's not the problem here --- mutexes ensure only one goroutine is reading. The problem here is that Close() assumes that when it acquires the read mutex, the next data it will find is a header, but Read() unlocks the mutex while that is not true.

@nhooyr
Copy link
Owner

nhooyr commented Mar 6, 2023

The golang race detector won't help with this sort of problem because it only looks for unsynchronized memory access. That's not the problem here --- mutexes ensure only one goroutine is reading. The problem here is that Close() assumes that when it acquires the read mutex, the next data it will find is a header, but Read() unlocks the mutex while that is not true.

As far as I can tell, Close does not expect that and so there is no race. Can you point to the exact line you think is making this assumption?

The code is at

func (c *Conn) waitCloseHandshake() error {

I do understand what you're getting at but you wouldn't get a data corruption error. You'll just get a websocket.CloseError from the Read and all further reads.

Once waitCloseHandshake has the read mutex locked, it doesn't let go until the corresponding close frame is received. It discards all further read payload bytes until it gets the close frame. That's what this loop here is doing:

for {

After all, once you've called Close, you shouldn't care what else is left to be read on the connection.

@spikecurtis
Copy link
Author

waitCloseHandshake() calls readLoop(), which reads from the buffer and expects to get a frame header.

h, err := c.readFrameHeader(ctx)

Then it starts checking the data it read for reserved bits

if h.rsv1 && c.readRSV1Illegal(h) || h.rsv2 || h.rsv3 {

And checking the opcode:

switch h.opcode {

If the next data on the buffer was not a header, then you get these protocol errors returned to Close(), when in fact there was never a protocol error, just a failure to decode properly.

@spikecurtis
Copy link
Author

It sounds like you think that calling Close() and Read() from different goroutines should work, and you would consider getting a protocol error returned to Close() in that case a bug? If that's the case I don't mind putting together a minimal example that demonstrates the problem.

@nhooyr
Copy link
Owner

nhooyr commented Mar 7, 2023

Ah fair enough I see the problem. Yes this is definitely a bug in need of fixing. Thank you for identifying it and sorry for dismissing you earlier.

@nhooyr nhooyr reopened this Mar 7, 2023
@nhooyr nhooyr changed the title How to cleanly disconnect? Close fails sometimes called concurrently with Read Mar 7, 2023
@nhooyr nhooyr added the bug label Mar 7, 2023
@nhooyr nhooyr added this to the v1.8.8 milestone Mar 7, 2023
@nhooyr nhooyr changed the title Close fails sometimes called concurrently with Read Close cannot be called in the middle of a frame read by Read Mar 8, 2023
@nhooyr
Copy link
Owner

nhooyr commented Mar 8, 2023

I do feel the need to defend why I was confused. This isn't a race. You can produce it by simply calling Reader, reading a single byte of a frame. And then calling Close(). That's supposed to be completely legal e.g. if you read malformed payload bytes and want to close the connection now but will produce the same issue with Close trying to read the payload bytes as a WebSocket frame.

@marten-seemann
Copy link

We've recently switch to this library as well in go-libp2p. Thank you for your work on this, @nhooyr!

We're currently running into the same error, Close randomly blocks for 5 seconds when we close the connection. This can significantly affect the performance of the library, as we assume that Close calls return relatively quickly, and we don't have to call them concurrently (at the maximum with using a small semaphore when we're closing many connections at the same time).

Is there any way we can help get this fixed soon, or do you have a timeline for a fix?

@spikecurtis
Copy link
Author

I don't think Close() blocking is the same error as this, which is when Close() returns a protocol error even when the other side is following the protocol just fine.

@marten-seemann Close() does wait for the response frame. Is it possible that your application has some head-of-line blocking reading an earlier frame and/or the peer takes some time to send the response?

@nhooyr
Copy link
Owner

nhooyr commented Mar 17, 2023

@marten-seemann

I think your issue is unrelated like @spikecurtis noted. If you were encountering this bug, you wouldn't see Close blocking, it would error out quickly. If it's taking 5s, then the connection is either broken somehow, the peer hasn't yet read the close frame because you sent a really large message first and that message is taking longer than 5s to read or lastly you're reading a really large frame from the peer and it's taking longer than 5s and so you're not getting to the close frame.

All of which are really only possible on a slow connection. If you're able to reproduce on a local setup, please provide it in a new issue and I can debug.

@marten-seemann
Copy link

marten-seemann commented Mar 17, 2023

@nhooyr I managed to condense it down to a minimal test case that reliably reproduces the bug. There's no read and write calls at all, just the server closing the connection right after accepting it.

func TestWebsocketImmediateClose(t *testing.T) {
	closed := make(chan struct{})
	mux := http.NewServeMux()
	mux.HandleFunc("/", func(w http.ResponseWriter, r *http.Request) {
		defer close(closed)
		c, err := ws.Accept(w, r, &ws.AcceptOptions{InsecureSkipVerify: true})
		if err != nil {
			return // The upgrader writes a response for us.
		}

		start := time.Now()
		c.Close(ws.StatusUnsupportedData, "bye")
		if took := time.Since(start); took > 2*time.Second {
			t.Fatalf("closing took too long: %s", took)
		}
	})
	server := &http.Server{Handler: mux, Addr: "localhost:8080"}
	done := make(chan struct{})
	go func() {
		defer close(done)
		server.ListenAndServe()
	}()

	time.Sleep(25 * time.Millisecond) // give the server some time to boot

	if _, _, err := ws.Dial(context.Background(), "ws://localhost:8080", nil); err != nil {
		t.Fatal(err)
	}
	<-closed // wait until the HTTP handler has returned

	// test shutdown
	server.Close()
	<-done
}

I'm not very familiar with the code base (nor with the WebSocket protocol itself), but here's where the Close call gets stuck:

h, err := c.readLoop(ctx)

Maybe I'm thinking in the wrong direction here, but this is surprising to me. Why do we need to read anything when closing? I would've expected us to just write a Close frame and then return.

EDIT: feel free to move to a new issue if you think this is a different bug.

@nhooyr
Copy link
Owner

nhooyr commented Mar 17, 2023

The WebSocket RFC mandates it as part of the close handshake. Though I'm now thinking I'll remove it in a future version.

Anyway it's not a bug here, you're not reading or closing the dialed connection and so the close handshake doesn't occur.

@nhooyr
Copy link
Owner

nhooyr commented Mar 17, 2023

See docs on CloseRead for example.

@marten-seemann
Copy link

I’m not sure I understand. The server needs to be able to close the connection without hanging, no matter what the client does. Otherwise, this is an easy attack vector.

As a user, after calling Close, I don’t really care if the client acknowledges it or not. All I care is that I can move on with other work (and don’t have to wait for 5s).

@nhooyr
Copy link
Owner

nhooyr commented Mar 17, 2023

Can you clarify how this is an attack vector/performance issue for you? It's a 5s wait and obeys the set read limit.

I can't change this behaviour in v1 as it's possible others rely on Close returning successfully if the close handshake was successful.

But you can fork the library and comment out this line:

closeHandshakeErr := c.waitCloseHandshake()

@nhooyr
Copy link
Owner

nhooyr commented Mar 17, 2023

And please stop responding here, open a new issue.

@marten-seemann
Copy link

Sorry for spamming this issue, I opened #384 and will reply there.

@nhooyr
Copy link
Owner

nhooyr commented Oct 19, 2023

Probably the cause of #391

@nhooyr
Copy link
Owner

nhooyr commented Oct 19, 2023

Fixed by 6cec2ca

@nhooyr nhooyr closed this as completed Oct 19, 2023
@nhooyr
Copy link
Owner

nhooyr commented Oct 19, 2023

Thanks for reporting @spikecurtis

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants