Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ping sometimes causes Close to time out #298

Closed
univerio opened this issue Apr 16, 2021 · 9 comments
Closed

Ping sometimes causes Close to time out #298

univerio opened this issue Apr 16, 2021 · 9 comments

Comments

@univerio
Copy link
Contributor

Minimal reproduction:

package websocket_test

import (
	"context"
	"testing"
	"time"

	"github.com/stretchr/testify/assert"
	"nhooyr.io/websocket"
	"nhooyr.io/websocket/internal/test/wstest"
)

func TestWeirdTimeout(t *testing.T) {
	sConn, cConn := wstest.Pipe(nil, nil)
	sConn.CloseRead(context.Background())
	cConn.CloseRead(context.Background())
	go func() {
		for range time.Tick(10 * time.Millisecond) {
			if err := sConn.Ping(context.Background()); err != nil {
				return
			}
		}
	}()

	time.Sleep(30 * time.Millisecond)
	assert.NoError(t, sConn.Close(websocket.StatusNormalClosure, ""))
}

This seems to non-deterministically fail with the documented 5s timeout:

$ go test -run TestWeirdTimeout -count 100 -failfast -v
=== RUN   TestWeirdTimeout
--- PASS: TestWeirdTimeout (0.03s)
=== RUN   TestWeirdTimeout
--- PASS: TestWeirdTimeout (0.03s)
=== RUN   TestWeirdTimeout
--- PASS: TestWeirdTimeout (0.03s)
=== RUN   TestWeirdTimeout
    foo_test.go:26:
        	Error Trace:	foo_test.go:26
        	Error:      	Received unexpected error:
        	            	failed to close WebSocket: failed to write control frame opClose: failed to write frame: context deadline exceeded
        	Test:       	TestWeirdTimeout
--- FAIL: TestWeirdTimeout (5.03s)
FAIL
exit status 1
FAIL	nhooyr.io/websocket	5.309s

Is it incorrect to call Close() concurrently with Ping()?

@nhooyr
Copy link
Owner

nhooyr commented Aug 6, 2021

Hi @univerio

So I'm not 100% sure but I believe this is expected behaviour as the Ping being every 10 milliseconds is effectively hogging the write lock.

Instead of a ticker, try using a timer that is reset after every ping/pong. I'd bet it happens far less often if never as there would be a guaranteed 10ms interval between pings/pongs then.

I'll be sure to test prior to the next release myself.

And apologies for the delayed response!

@nhooyr nhooyr added this to the v1.8.8 milestone Aug 6, 2021
@univerio
Copy link
Contributor Author

univerio commented Aug 6, 2021

I made it 10ms to make it fail faster, but it happens with 1s tick and 3s closure delay as well.

This stops happening if you take out a mutex around sConn.Ping() and sConn.Close() so I feel like there's some sort of race condition going on.

However, you're right that it does happen less often when using a Timer, but I suspect that that's due to the Ping() and Close() not lining up at the same instant.

@nhooyr
Copy link
Owner

nhooyr commented Aug 6, 2021

Ah yea if that's the case then this does sound like a bug. Can you try the dev branch? I did make some changes that aren't on master.

@univerio
Copy link
Contributor Author

Sorry for the late response here. I tested the dev branch and could not reproduce using the test case above, so maybe it's fixed!

@nhooyr
Copy link
Owner

nhooyr commented Mar 5, 2023

Will add as a test case to the dev branch so if it happens again I'll investigate.

@timo-klarshift
Copy link

Is this fixed? Will there be a new version besides having this in dev branch?

@bhallionOhbibi
Copy link

bhallionOhbibi commented May 25, 2023

Hi !
I'm having the same issue with close and ping.

First, I want to say thanks for this library which comes quite handy 😄

The client is somehow blocked in the writeFrame method when he tries to write the opPing in response to the server's opPing while trying to read the close handshake from the server (which was sent automatically by the server during its reading routine).
In my case, the client only read from the server during the close handshake which prevent him from handling the server's ping correctly.

I managed to get 100% reproduction with this "quite simple" test:
(Sorry I didn't access the internal test tools but It does the testing job anyway)

func TestPingCloseIssue(t *testing.T) {
	ctx := context.Background()

	// Server section
	// =====================
	router := gin.New()
	endpoint := "/ping-close-websocket-test"
	router.GET(endpoint, func(gCtx *gin.Context) {
		fmt.Printf("Server: endpoint: START.\n")
		defer fmt.Printf("Server: endpoint: END.\n")

		// Accept conn
		sConn, err := websocket.Accept(gCtx.Writer, gCtx.Request, nil)
		if err != nil {
			fmt.Printf("Server: failed to accept ws connexion: %s\n", err)
			return
		}

		// Start ping routine
		// Ping once with a 2 seconds timeout.
		go func() {
			pingCtx, pingCtxCancel := context.WithTimeout(gCtx.Request.Context(), time.Second*2)
			defer pingCtxCancel()
			fmt.Printf("Server: ping\n")
			err := sConn.Ping(pingCtx)
			if err != nil {
				if status := websocket.CloseStatus(err); status != -1 {
					fmt.Printf("Server: ping: websocket closed with status: %d\n", status)
					return
				}
				fmt.Printf("Server: ping err: %s\n", err)
				return
			}
			fmt.Printf("Server: ping success.\n")
		}()

		// Start reading
		// Read once with a 2 seconds timeout.
		readCtx, readCtxCancel := context.WithTimeout(gCtx.Request.Context(), time.Second*2)
		defer readCtxCancel()
		_, _, err = sConn.Read(readCtx)
		if err != nil {
			if status := websocket.CloseStatus(err); status != -1 {
				fmt.Printf("Server: read: websocket closed with status: %d\n", status)
				return
			}
			fmt.Printf("Server: read err: %s\n", err)
			return
		}
		fmt.Printf("Server: read success.\n")
	})
	// Start server
	testserver := startTestServerOnPort(router, 42428)
	defer testserver.Close()

	// Client section
	// =====================
	fmt.Println("Client: Start dial")
	url := fmt.Sprintf("%s%s", testserver.URL, endpoint)
	// Dial the server and close the websocket after 100ms.
	cConn, _, err := websocket.Dial(ctx, url, nil)
	if err != nil {
		t.Fatalf("Client: failed to dial ws: %s", err)
		return
	}
	time.Sleep(time.Millisecond * 100)
	fmt.Printf("Client: closing (url=%s)\n", url)
	err = cConn.Close(websocket.StatusGoingAway, "dial end (url="+url+")")
	if err != nil {
		t.Fatalf("Client: failed to close ws: %s", err)
		return
	}
}

Which outputs:

Client: Start dial
Server: endpoint: START.
Server: ping
Client: closing (url=http://127.0.0.1:42428/ping-close-websocket-test)
Server: ping: websocket closed with status: 1001
Server: read: websocket closed with status: 1001
Server: endpoint: END.
--- FAIL: TestPingCloseIssue (5.10s)
    websocketPing_test.go:104: Client: failed to close ws: failed to close WebSocket: failed to handle control frame opPing: failed to write control frame opPong: context deadline exceeded
FAIL
exit status 1
FAIL    test.go   5.122s

Reproduction steps:

  • Client: start dial
  • Server: Accept conn
  • Server: sends a ping
  • Server: reads
  • Client: send close
  • Client: wait for close handshake.
  • Server: receive close
  • Server: respond with close for "handshake"
  • Client: still waiting for handshake, but only got a opPing
  • Client: timeout after 5s saying he failed to handle opPing and write opPong.

I'm not quite sure if this is an issue with the library or the way I use them. But this is definetly confusing and it would add complexity to implement around this obscure behaviour.

@bhallionOhbibi
Copy link

bhallionOhbibi commented May 25, 2023

I found a simple fix for this but it does not compply with the RFC (section-5.5.1) which specify:

The application MUST NOT send any more data frames after sending a Close frame.

Therefore I can assume that writeFrame is correct by preventing any writing once the opClose frame has been sent.
I think a correct fix for this would be to discard the opPing frame received after a opClose frame has been sent. This means, not responding to the opPing with a opPong, therefore not trying to write (since we are not allowed to) to avoid locking and continue reading incoming frames to be able to catch the opClose from the peer.

I created a PR to fix this issue: #394

nhooyr pushed a commit to bhallionOhbibi/websocket that referenced this issue Oct 19, 2023
nhooyr pushed a commit that referenced this issue Oct 19, 2023
nhooyr pushed a commit to bhallionOhbibi/websocket that referenced this issue Oct 19, 2023
nhooyr pushed a commit that referenced this issue Oct 19, 2023
nhooyr pushed a commit that referenced this issue Oct 19, 2023
nhooyr added a commit that referenced this issue Oct 19, 2023
Closes #298
Closes #394

The close frame was being received from the peer before we were able to reset our
write timeout and so we thought the write kept failing but it never was...

Thanks @univerio and @bhallionOhbibi
@nhooyr
Copy link
Owner

nhooyr commented Oct 19, 2023

Fixed in dev, see 28c6709

Thanks again @univerio and @bhallionOhbibi

@nhooyr nhooyr closed this as completed Oct 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants