Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Outbound calls timing out with context deadline exceeded when TCP connection is been drop #21346

Closed
julienstroheker opened this issue Aug 10, 2023 · 24 comments · Fixed by #21468
Assignees
Labels
Azure.Core bug This issue requires a change to an existing behavior in the product in order to be resolved. Client This issue points to a problem in the data-plane of the library.

Comments

@julienstroheker
Copy link

Bug Report

Opening this for visibility in case other folks are having this issue !

TLDR : golang/go#59690

  • What happened?

When using HTTP/2, which is defaulted everywhere now, if the keep-alive connection is been dropped somehow, the sdk clients cannot recover causing some timeout context deadline exceeded until the connection been closed by the host.

  • What did you expect or want to happen ?

When using a HTTP/2 server, CosmosDB for example, the SDK transport should PING the server to identify is the connection is not stale.

  • How can we reproduce it?

I'm able to reproduce this using CosmosDB sample code such as :

package main

import (
	"context"
	"crypto/tls"
	"log"
	"net"
	"net/http"
	"time"

	logger "github.com/Azure/azure-sdk-for-go/sdk/azcore/log"
	"github.com/Azure/azure-sdk-for-go/sdk/azcore/policy"
	"github.com/Azure/azure-sdk-for-go/sdk/data/azcosmos"
)

const (
	cosmosDbEndpoint = "https://julienstestcosmos.documents.azure.com"
	cosmosDbKey      = "REDACTED"
	dbName           = "julienstest"
	containerName    = "Subscriptions"
	pk               = "c88b61d7-85aa-415f-8eda-4e73e5e811d5"
	id               = "c88b61d7-85aa-415f-8eda-4e73e5e811d5"
)

func main() {
	ctx := context.Background()
	opt := &azcosmos.ClientOptions{ClientOptions: policy.ClientOptions{
		// we call cosmosDB more frequent and CosmosDB's latency is much less than ARM
		// so we use difference retry policy than ARM
		Retry: policy.RetryOptions{
			// Create/Update/Delete option will get ItemAlreadyExists error in second retry if first one already arrive server.
			StatusCodes: []int{
				// azcore defaults, see RetryOptions doc
				http.StatusRequestTimeout,      // 408
				http.StatusTooManyRequests,     // 429
				http.StatusInternalServerError, // 500
				http.StatusBadGateway,          // 502
				http.StatusServiceUnavailable,  // 503
				http.StatusGatewayTimeout,      // 504
				// adding 403 to work around random 403 from cosmosdb
				// see https://github.com/Azure/azure-sdk-for-go/issues/19785
				http.StatusForbidden, // 403
			},
			MaxRetries:    3,
			TryTimeout:    5 * time.Second, // The maximum latency in server side in slow EUAP regions are 2-3 seconds today
			RetryDelay:    1 * time.Second,
			MaxRetryDelay: 2 * time.Second, // Delay increases exponentially, delay from 1 second, 2 time will be 2 seconds
		},
	}}
	logger.SetListener(func(cls logger.Event, msg string) {
		log.Printf("%s: %s\n", cls, msg)
	})

	creds, err := azcosmos.NewKeyCredential(cosmosDbKey)
	if err != nil {
		log.Fatalf("failed to create credential: %v", err)
	}
	client, err := azcosmos.NewClientWithKey(cosmosDbEndpoint, creds, opt)
	if err != nil {
		log.Fatalf("failed to create NewClientWithKey: %v", err)
	}
	container, err := client.NewContainer(dbName, containerName)
	if err != nil {
		log.Fatalf("failed to create NewContainer: %v", err)
	}
	itemResponse, err := container.ReadItem(ctx, azcosmos.NewPartitionKeyString(id), id, nil)
	if err != nil {
		log.Fatalf("failed to ReadItem: %v", err)
	}
	// log.Print(itemResponse.ActivityID)

	// Set up an infinite loop to place requests from
	for {
		_, err := container.ReadItem(ctx, azcosmos.NewPartitionKeyString(id), id, nil)
		if err != nil {
			log.Printf("Timed out with error: %v", err)
		} else {
			log.Printf("Obtained response with activity ID: %s", itemResponse.ActivityID)
		}
		time.Sleep(5 * time.Second)
	}
}

Then run go run main.go

Check for the port been used to talk to cosmos netstat -anp | grep main

Drop the connection sudo iptables -I INPUT -p tcp --dport 54420 -j DROP where 54420 is the port returned by netstat`

This is what you should see

2023/08/10 04:48:48 Obtained response with activity ID: ffbe9497-7be3-4595-a17b-4d46667e0ab5
2023/08/10 04:48:50 Obtained response with activity ID: ffbe9497-7be3-4595-a17b-4d46667e0ab5
2023/08/10 04:49:17 Timed out with error: Get "https://julienstestcosmos.documents.azure.com/dbs/julienstest/colls/Subscriptions/docs/c88b61d7-85aa-415f-8eda-4e73e5e811d5": context deadline exceeded
2023/08/10 04:49:45 Timed out with error: Get "https://julienstestcosmos.documents.azure.com/dbs/julienstest/colls/Subscriptions/docs/c88b61d7-85aa-415f-8eda-4e73e5e811d5": context deadline exceeded
2023/08/10 04:50:12 Timed out with error: Get "https://julienstestcosmos.documents.azure.com/dbs/julienstest/colls/Subscriptions/docs/c88b61d7-85aa-415f-8eda-4e73e5e811d5": context deadline exceeded
2023/08/10 04:50:39 Timed out with error: Get "https://julienstestcosmos.documents.azure.com/dbs/julienstest/colls/Subscriptions/docs/c88b61d7-85aa-415f-8eda-4e73e5e811d5": context deadline exceeded
2023/08/10 04:51:06 Timed out with error: Get "https://julienstestcosmos.documents.azure.com/dbs/julienstest/colls/Subscriptions/docs/c88b61d7-85aa-415f-8eda-4e73e5e811d5": context deadline exceeded

The connection should recover at some point (based on the host that will kill the keep-alive connection at some point, might take up to 15minutes)

After setting the HTTP/2 transport properly as following :

	http2Transport, err := http2.ConfigureTransports(tr2)
	if err == nil {
		log.Println("HTTP/2 transport configured")
		// ReadIdleTimeout is the timeout after which a health check using ping
		// frame will be carried out if no frame is received on the connection.
		// Note that a ping response will is considered a received frame, so if
		// there is no other traffic on the connection, the health check will
		// be performed every ReadIdleTimeout interval.
		// If zero, no health check is performed.
		http2Transport.ReadIdleTimeout = time.Second * 10
		// PingTimeout is the timeout after which the connection will be closed
		// if a response to Ping is not received.
		// Defaults to 15s.
		http2Transport.PingTimeout = time.Second * 2
		// No error so we can use the new transport
		tr = tr2
	} else {
		log.Printf("HTTP/2 transport configuration failed: %v", err)
	}

Then running it with export GODEBUG=http2debug=2 & go run main.go

```bash
2023/08/10 05:10:39 Retry: =====> Try=1
2023/08/10 05:10:39 Request: ==> OUTGOING REQUEST (Try=1)
   GET https://julienstestcosmos.documents.azure.com/dbs/julienstest/colls/Subscriptions/docs/c88b61d7-85aa-415f-8eda-4e73e5e811d5
   Authorization: REDACTED
   User-Agent: azsdk-go-azcosmos/v0.3.5 (go1.20.7; linux)
   X-Ms-Cosmos-Sdk-Supportedcapabilities: REDACTED
   X-Ms-Date: REDACTED
   X-Ms-Documentdb-Partitionkey: REDACTED
   X-Ms-Version: REDACTED

2023/08/10 05:10:39 http2: Transport encoding header ":authority" = "julienstestcosmos.documents.azure.com"
2023/08/10 05:10:39 http2: Transport encoding header ":method" = "GET"
2023/08/10 05:10:39 http2: Transport encoding header ":path" = "/dbs/julienstest/colls/Subscriptions/docs/c88b61d7-85aa-415f-8eda-4e73e5e811d5"
2023/08/10 05:10:39 http2: Transport encoding header ":scheme" = "https"
2023/08/10 05:10:39 http2: Transport encoding header "x-ms-date" = "Thu, 10 Aug 2023 05:10:39 GMT"
2023/08/10 05:10:39 http2: Transport encoding header "x-ms-version" = "2020-11-05"
2023/08/10 05:10:39 http2: Transport encoding header "x-ms-cosmos-sdk-supportedcapabilities" = "1"
2023/08/10 05:10:39 http2: Transport encoding header "user-agent" = "azsdk-go-azcosmos/v0.3.5 (go1.20.7; linux)"
2023/08/10 05:10:39 http2: Transport encoding header "x-ms-documentdb-partitionkey" = "[\"c88b61d7-85aa-415f-8eda-4e73e5e811d5\"]"
2023/08/10 05:10:39 http2: Transport encoding header "authorization" = "type%3Dmaster%26ver%3D1.0%26sig%3D6kKN4RGlxyMfV3Q%2FA6avRQSxTzW9j3ggfGLf5Yzpmz0%3D"
2023/08/10 05:10:39 http2: Transport encoding header "accept-encoding" = "gzip"
2023/08/10 05:10:39 http2: Framer 0xc00019d420: wrote HEADERS flags=END_STREAM|END_HEADERS stream=13 len=100
2023/08/10 05:10:44 RoundTrip failure: context deadline exceeded
2023/08/10 05:10:44 Response: ==> REQUEST/RESPONSE (Try=1/5.002407971s, OpTime=5.002449671s) -- REQUEST ERROR
   GET https://julienstestcosmos.documents.azure.com/dbs/julienstest/colls/Subscriptions/docs/c88b61d7-85aa-415f-8eda-4e73e5e811d5
   Authorization: REDACTED
   User-Agent: azsdk-go-azcosmos/v0.3.5 (go1.20.7; linux)
   X-Ms-Cosmos-Sdk-Supportedcapabilities: REDACTED
   X-Ms-Date: REDACTED
   X-Ms-Documentdb-Partitionkey: REDACTED
   X-Ms-Version: REDACTED
   --------------------------------------------------------------------------------
   ERROR:
Get "https://julienstestcosmos.documents.azure.com/dbs/julienstest/colls/Subscriptions/docs/c88b61d7-85aa-415f-8eda-4e73e5e811d5": context deadline exceeded
github.com/Azure/azure-sdk-for-go/sdk/azcore/runtime.(*logPolicy).Do()
	/home/juliens/go/pkg/mod/github.com/!azure/azure-sdk-for-go/sdk/azcore@v1.7.0/runtime/policy_logging.go:140
github.com/Azure/azure-sdk-for-go/sdk/azcore/internal/exported.(*Request).Next()
	/home/juliens/go/pkg/mod/github.com/!azure/azure-sdk-for-go/sdk/azcore@v1.7.0/internal/exported/request.go:84
github.com/Azure/azure-sdk-for-go/sdk/data/azcosmos.(*sharedKeyCredPolicy).Do()
	/home/juliens/go/pkg/mod/github.com/!azure/azure-sdk-for-go/sdk/data/azcosmos@v0.3.5/shared_key_credential.go:120
github.com/Azure/azure-sdk-for-go/sdk/azcore/internal/exported.(*Request).Next()
	/home/juliens/go/pkg/mod/github.com/!azure/azure-sdk-for-go/sdk/azcore@v1.7.0/internal/exported/request.go:84
github.com/Azure/azure-sdk-for-go/sdk/azcore/runtime.(*retryPolicy).Do()
	/home/juliens/go/pkg/mod/github.com/!azure/azure-sdk-for-go/sdk/azcore@v1.7.0/runtime/policy_retry.go:134
github.com/Azure/azure-sdk-for-go/sdk/azcore/internal/exported.(*Request).Next()
	/home/juliens/go/pkg/mod/github.com/!azure/azure-sdk-for-go/sdk/azcore@v1.7.0/internal/exported/request.go:84
github.com/Azure/azure-sdk-for-go/sdk/data/azcosmos.(*headerPolicies).Do()
	/home/juliens/go/pkg/mod/github.com/!azure/azure-sdk-for-go/sdk/data/azcosmos@v0.3.5/cosmos_headers_policy.go:51
github.com/Azure/azure-sdk-for-go/sdk/azcore/internal/exported.(*Request).Next()
	/home/juliens/go/pkg/mod/github.com/!azure/azure-sdk-for-go/sdk/azcore@v1.7.0/internal/exported/request.go:84
github.com/Azure/azure-sdk-for-go/sdk/azcore/runtime.telemetryPolicy.Do()
	/home/juliens/go/pkg/mod/github.com/!azure/azure-sdk-for-go/sdk/azcore@v1.7.0/runtime/policy_telemetry.go:66
github.com/Azure/azure-sdk-for-go/sdk/azcore/internal/exported.(*Request).Next()
	/home/juliens/go/pkg/mod/github.com/!azure/azure-sdk-for-go/sdk/azcore@v1.7.0/internal/exported/request.go:84
github.com/Azure/azure-sdk-for-go/sdk/azcore/runtime.includeResponsePolicy()
	/home/juliens/go/pkg/mod/github.com/!azure/azure-sdk-for-go/sdk/azcore@v1.7.0/runtime/policy_include_response.go:19
github.com/Azure/azure-sdk-for-go/sdk/azcore/internal/exported.PolicyFunc.Do()
	/home/juliens/go/pkg/mod/github.com/!azure/azure-sdk-for-go/sdk/azcore@v1.7.0/internal/exported/request.go:181
github.com/Azure/azure-sdk-for-go/sdk/azcore/internal/exported.(*Request).Next()
	/home/juliens/go/pkg/mod/github.com/!azure/azure-sdk-for-go/sdk/azcore@v1.7.0/internal/exported/request.go:84
github.com/Azure/azure-sdk-for-go/sdk/azcore/internal/exported.Pipeline.Do()
	/home/juliens/go/pkg/mod/github.com/!azure/azure-sdk-for-go/sdk/azcore@v1.7.0/internal/exported/pipeline.go:96
github.com/Azure/azure-sdk-for-go/sdk/data/azcosmos.(*Client).executeAndEnsureSuccessResponse()
	/home/juliens/go/pkg/mod/github.com/!azure/azure-sdk-for-go/sdk/data/azcosmos@v0.3.5/cosmos_client.go:428
github.com/Azure/azure-sdk-for-go/sdk/data/azcosmos.(*Client).sendGetRequest()
	/home/juliens/go/pkg/mod/github.com/!azure/azure-sdk-for-go/sdk/data/azcosmos@v0.3.5/cosmos_client.go:310
github.com/Azure/azure-sdk-for-go/sdk/data/azcosmos.(*ContainerClient).ReadItem()
	/home/juliens/go/pkg/mod/github.com/!azure/azure-sdk-for-go/sdk/data/azcosmos@v0.3.5/cosmos_container.go:348
main.main()
	/home/juliens/go/src/batman/h2/main.go:116
runtime.main()
	/snap/go/current/src/runtime/proc.go:250
runtime.goexit()
	/snap/go/current/src/runtime/asm_amd64.s:1598

2023/08/10 05:10:44 Retry: error Get "https://julienstestcosmos.documents.azure.com/dbs/julienstest/colls/Subscriptions/docs/c88b61d7-85aa-415f-8eda-4e73e5e811d5": context deadline exceeded
2023/08/10 05:10:44 Retry: End Try #1, Delay=973.05341ms
2023/08/10 05:10:44 http2: Transport sending health check
2023/08/10 05:10:44 http2: Framer 0xc00019d420: wrote RST_STREAM stream=13 len=4 ErrCode=CANCEL
2023/08/10 05:10:44 http2: Framer 0xc00019d420: wrote PING len=8 ping="+\x1a\x9c\xf3\xfc\xae{W"
2023/08/10 05:10:45 Retry: =====> Try=2
2023/08/10 05:10:45 Request: ==> OUTGOING REQUEST (Try=2)
   GET https://julienstestcosmos.documents.azure.com/dbs/julienstest/colls/Subscriptions/docs/c88b61d7-85aa-415f-8eda-4e73e5e811d5
   Authorization: REDACTED
   User-Agent: azsdk-go-azcosmos/v0.3.5 (go1.20.7; linux)
   X-Ms-Cosmos-Sdk-Supportedcapabilities: REDACTED
   X-Ms-Date: REDACTED
   X-Ms-Documentdb-Partitionkey: REDACTED
   X-Ms-Version: REDACTED

2023/08/10 05:10:45 http2: Transport encoding header ":authority" = "julienstestcosmos.documents.azure.com"
2023/08/10 05:10:45 http2: Transport encoding header ":method" = "GET"
2023/08/10 05:10:45 http2: Transport encoding header ":path" = "/dbs/julienstest/colls/Subscriptions/docs/c88b61d7-85aa-415f-8eda-4e73e5e811d5"
2023/08/10 05:10:45 http2: Transport encoding header ":scheme" = "https"
2023/08/10 05:10:45 http2: Transport encoding header "x-ms-date" = "Thu, 10 Aug 2023 05:10:39 GMT"
2023/08/10 05:10:45 http2: Transport encoding header "authorization" = "type%3Dmaster%26ver%3D1.0%26sig%3D6kKN4RGlxyMfV3Q%2FA6avRQSxTzW9j3ggfGLf5Yzpmz0%3D"
2023/08/10 05:10:45 http2: Transport encoding header "x-ms-version" = "2020-11-05"
2023/08/10 05:10:45 http2: Transport encoding header "x-ms-cosmos-sdk-supportedcapabilities" = "1"
2023/08/10 05:10:45 http2: Transport encoding header "user-agent" = "azsdk-go-azcosmos/v0.3.5 (go1.20.7; linux)"
2023/08/10 05:10:45 http2: Transport encoding header "x-ms-documentdb-partitionkey" = "[\"c88b61d7-85aa-415f-8eda-4e73e5e811d5\"]"
2023/08/10 05:10:45 http2: Transport encoding header "accept-encoding" = "gzip"
2023/08/10 05:10:45 http2: Framer 0xc00019d420: wrote HEADERS flags=END_STREAM|END_HEADERS stream=15 len=11
2023/08/10 05:10:46 http2: Transport health check failure: context deadline exceeded
2023/08/10 05:10:46 http2: Framer 0xc00019d420: wrote RST_STREAM stream=15 len=4 ErrCode=CANCEL
2023/08/10 05:10:46 RoundTrip failure: http2: client connection lost
2023/08/10 05:10:46 http2: Transport readFrame error on conn 0xc000004000: (*net.OpError) read tcp 10.0.0.4:35104->40.78.226.1:443: use of closed network connection
2023/08/10 05:10:46 Response: ==> REQUEST/RESPONSE (Try=2/1.02727298s, OpTime=7.003358017s) -- REQUEST ERROR
   GET https://julienstestcosmos.documents.azure.com/dbs/julienstest/colls/Subscriptions/docs/c88b61d7-85aa-415f-8eda-4e73e5e811d5
   Authorization: REDACTED
   User-Agent: azsdk-go-azcosmos/v0.3.5 (go1.20.7; linux)
   X-Ms-Cosmos-Sdk-Supportedcapabilities: REDACTED
   X-Ms-Date: REDACTED
   X-Ms-Documentdb-Partitionkey: REDACTED
   X-Ms-Version: REDACTED
   --------------------------------------------------------------------------------
   ERROR:
Get "https://julienstestcosmos.documents.azure.com/dbs/julienstest/colls/Subscriptions/docs/c88b61d7-85aa-415f-8eda-4e73e5e811d5": http2: client connection lost
github.com/Azure/azure-sdk-for-go/sdk/azcore/runtime.(*logPolicy).Do()
	/home/juliens/go/pkg/mod/github.com/!azure/azure-sdk-for-go/sdk/azcore@v1.7.0/runtime/policy_logging.go:140
github.com/Azure/azure-sdk-for-go/sdk/azcore/internal/exported.(*Request).Next()
	/home/juliens/go/pkg/mod/github.com/!azure/azure-sdk-for-go/sdk/azcore@v1.7.0/internal/exported/request.go:84
github.com/Azure/azure-sdk-for-go/sdk/data/azcosmos.(*sharedKeyCredPolicy).Do()
	/home/juliens/go/pkg/mod/github.com/!azure/azure-sdk-for-go/sdk/data/azcosmos@v0.3.5/shared_key_credential.go:120
github.com/Azure/azure-sdk-for-go/sdk/azcore/internal/exported.(*Request).Next()
	/home/juliens/go/pkg/mod/github.com/!azure/azure-sdk-for-go/sdk/azcore@v1.7.0/internal/exported/request.go:84
github.com/Azure/azure-sdk-for-go/sdk/azcore/runtime.(*retryPolicy).Do()
	/home/juliens/go/pkg/mod/github.com/!azure/azure-sdk-for-go/sdk/azcore@v1.7.0/runtime/policy_retry.go:134
github.com/Azure/azure-sdk-for-go/sdk/azcore/internal/exported.(*Request).Next()
	/home/juliens/go/pkg/mod/github.com/!azure/azure-sdk-for-go/sdk/azcore@v1.7.0/internal/exported/request.go:84
github.com/Azure/azure-sdk-for-go/sdk/data/azcosmos.(*headerPolicies).Do()
	/home/juliens/go/pkg/mod/github.com/!azure/azure-sdk-for-go/sdk/data/azcosmos@v0.3.5/cosmos_headers_policy.go:51
github.com/Azure/azure-sdk-for-go/sdk/azcore/internal/exported.(*Request).Next()
	/home/juliens/go/pkg/mod/github.com/!azure/azure-sdk-for-go/sdk/azcore@v1.7.0/internal/exported/request.go:84
github.com/Azure/azure-sdk-for-go/sdk/azcore/runtime.telemetryPolicy.Do()
	/home/juliens/go/pkg/mod/github.com/!azure/azure-sdk-for-go/sdk/azcore@v1.7.0/runtime/policy_telemetry.go:66
github.com/Azure/azure-sdk-for-go/sdk/azcore/internal/exported.(*Request).Next()
	/home/juliens/go/pkg/mod/github.com/!azure/azure-sdk-for-go/sdk/azcore@v1.7.0/internal/exported/request.go:84
github.com/Azure/azure-sdk-for-go/sdk/azcore/runtime.includeResponsePolicy()
	/home/juliens/go/pkg/mod/github.com/!azure/azure-sdk-for-go/sdk/azcore@v1.7.0/runtime/policy_include_response.go:19
github.com/Azure/azure-sdk-for-go/sdk/azcore/internal/exported.PolicyFunc.Do()
	/home/juliens/go/pkg/mod/github.com/!azure/azure-sdk-for-go/sdk/azcore@v1.7.0/internal/exported/request.go:181
github.com/Azure/azure-sdk-for-go/sdk/azcore/internal/exported.(*Request).Next()
	/home/juliens/go/pkg/mod/github.com/!azure/azure-sdk-for-go/sdk/azcore@v1.7.0/internal/exported/request.go:84
github.com/Azure/azure-sdk-for-go/sdk/azcore/internal/exported.Pipeline.Do()
	/home/juliens/go/pkg/mod/github.com/!azure/azure-sdk-for-go/sdk/azcore@v1.7.0/internal/exported/pipeline.go:96
github.com/Azure/azure-sdk-for-go/sdk/data/azcosmos.(*Client).executeAndEnsureSuccessResponse()
	/home/juliens/go/pkg/mod/github.com/!azure/azure-sdk-for-go/sdk/data/azcosmos@v0.3.5/cosmos_client.go:428
github.com/Azure/azure-sdk-for-go/sdk/data/azcosmos.(*Client).sendGetRequest()
	/home/juliens/go/pkg/mod/github.com/!azure/azure-sdk-for-go/sdk/data/azcosmos@v0.3.5/cosmos_client.go:310
github.com/Azure/azure-sdk-for-go/sdk/data/azcosmos.(*ContainerClient).ReadItem()
	/home/juliens/go/pkg/mod/github.com/!azure/azure-sdk-for-go/sdk/data/azcosmos@v0.3.5/cosmos_container.go:348
main.main()
	/home/juliens/go/src/batman/h2/main.go:116
runtime.main()
	/snap/go/current/src/runtime/proc.go:250
runtime.goexit()
	/snap/go/current/src/runtime/asm_amd64.s:1598

2023/08/10 05:10:46 Retry: error Get "https://julienstestcosmos.documents.azure.com/dbs/julienstest/colls/Subscriptions/docs/c88b61d7-85aa-415f-8eda-4e73e5e811d5": http2: client connection lost
2023/08/10 05:10:46 Retry: End Try #2, Delay=2s
2023/08/10 05:10:48 Retry: =====> Try=3
2023/08/10 05:10:48 Request: ==> OUTGOING REQUEST (Try=3)
   GET https://julienstestcosmos.documents.azure.com/dbs/julienstest/colls/Subscriptions/docs/c88b61d7-85aa-415f-8eda-4e73e5e811d5
   Authorization: REDACTED
   User-Agent: azsdk-go-azcosmos/v0.3.5 (go1.20.7; linux)
   X-Ms-Cosmos-Sdk-Supportedcapabilities: REDACTED
   X-Ms-Date: REDACTED
   X-Ms-Documentdb-Partitionkey: REDACTED
   X-Ms-Version: REDACTED

2023/08/10 05:10:48 http2: Transport failed to get client conn for julienstestcosmos.documents.azure.com:443: http2: no cached connection was available
2023/08/10 05:10:48 http2: Transport failed to get client conn for julienstestcosmos.documents.azure.com:443: http2: no cached connection was available
2023/08/10 05:10:48 http2: Transport creating client conn 0xc000004a80 to 40.78.226.1:443
2023/08/10 05:10:48 http2: Framer 0xc00019d6c0: wrote SETTINGS len=18, settings: ENABLE_PUSH=0, INITIAL_WINDOW_SIZE=4194304, MAX_HEADER_LIST_SIZE=10485760
2023/08/10 05:10:48 http2: Framer 0xc00019d6c0: wrote WINDOW_UPDATE len=4 (conn) incr=1073741824
2023/08/10 05:10:48 http2: Transport encoding header ":authority" = "julienstestcosmos.documents.azure.com"
2023/08/10 05:10:48 http2: Transport encoding header ":method" = "GET"
2023/08/10 05:10:48 http2: Transport encoding header ":path" = "/dbs/julienstest/colls/Subscriptions/docs/c88b61d7-85aa-415f-8eda-4e73e5e811d5"
2023/08/10 05:10:48 http2: Transport encoding header ":scheme" = "https"
2023/08/10 05:10:48 http2: Transport encoding header "x-ms-cosmos-sdk-supportedcapabilities" = "1"
2023/08/10 05:10:48 http2: Transport encoding header "user-agent" = "azsdk-go-azcosmos/v0.3.5 (go1.20.7; linux)"
2023/08/10 05:10:48 http2: Transport encoding header "x-ms-documentdb-partitionkey" = "[\"c88b61d7-85aa-415f-8eda-4e73e5e811d5\"]"
2023/08/10 05:10:48 http2: Transport encoding header "authorization" = "type%3Dmaster%26ver%3D1.0%26sig%3D6kKN4RGlxyMfV3Q%2FA6avRQSxTzW9j3ggfGLf5Yzpmz0%3D"
2023/08/10 05:10:48 http2: Transport encoding header "x-ms-date" = "Thu, 10 Aug 2023 05:10:39 GMT"
2023/08/10 05:10:48 http2: Transport encoding header "x-ms-version" = "2020-11-05"
2023/08/10 05:10:48 http2: Transport encoding header "accept-encoding" = "gzip"
2023/08/10 05:10:48 http2: Framer 0xc00019d6c0: wrote HEADERS flags=END_STREAM|END_HEADERS stream=1 len=328
2023/08/10 05:10:48 http2: Framer 0xc00019d6c0: read SETTINGS len=18, settings: MAX_CONCURRENT_STREAMS=100, INITIAL_WINDOW_SIZE=98304, MAX_HEADER_LIST_SIZE=102400
2023/08/10 05:10:48 http2: Transport received SETTINGS len=18, settings: MAX_CONCURRENT_STREAMS=100, INITIAL_WINDOW_SIZE=98304, MAX_HEADER_LIST_SIZE=102400
2023/08/10 05:10:48 http2: Framer 0xc00019d6c0: wrote SETTINGS flags=ACK len=0
2023/08/10 05:10:48 http2: Framer 0xc00019d6c0: read WINDOW_UPDATE len=4 (conn) incr=65537
2023/08/10 05:10:48 http2: Transport received WINDOW_UPDATE len=4 (conn) incr=65537
2023/08/10 05:10:48 http2: Framer 0xc00019d6c0: read SETTINGS flags=ACK len=0
2023/08/10 05:10:48 http2: Transport received SETTINGS flags=ACK len=0
2023/08/10 05:10:48 http2: Framer 0xc00019d6c0: read HEADERS flags=END_HEADERS stream=1 len=1179
2023/08/10 05:10:48 http2: decoded hpack field header field ":status" = "200"
2023/08/10 05:10:48 http2: decoded hpack field header field "date" = "Thu, 10 Aug 2023 05:10:48 GMT"
2023/08/10 05:10:48 http2: decoded hpack field header field "content-length" = "354"
2023/08/10 05:10:48 http2: decoded hpack field header field "content-type" = "application/json"
2023/08/10 05:10:48 http2: decoded hpack field header field "server" = "Compute"
2023/08/10 05:10:48 http2: decoded hpack field header field "x-ms-gatewayversion" = "2.0.0"
2023/08/10 05:10:48 http2: decoded hpack field header field "x-ms-activity-id" = "b41d7a50-310e-429d-9239-0391396271ee"
2023/08/10 05:10:48 http2: decoded hpack field header field "x-ms-last-state-change-utc" = "Sat, 05 Aug 2023 10:33:26.920 GMT"
2023/08/10 05:10:48 http2: decoded hpack field header field "etag" = "\"0d02a39c-0000-0100-0000-60cc12330000\""
2023/08/10 05:10:48 http2: decoded hpack field header field "x-ms-resource-quota" = "documentSize=51200;documentsSize=52428800;documentsCount=-1;collectionSize=52428800;"
2023/08/10 05:10:48 http2: decoded hpack field header field "x-ms-resource-usage" = "documentSize=1472;documentsSize=1128749;documentsCount=1439661;collectionSize=1508347;"
2023/08/10 05:10:48 http2: decoded hpack field header field "x-ms-schemaversion" = "1.16"
2023/08/10 05:10:48 http2: decoded hpack field header field "lsn" = "31645147"
2023/08/10 05:10:48 http2: decoded hpack field header field "x-ms-request-charge" = "1"
2023/08/10 05:10:48 http2: decoded hpack field header field "x-ms-alt-content-path" = "dbs/julienstest/colls/Subscriptions"
2023/08/10 05:10:48 http2: decoded hpack field header field "x-ms-content-path" = "39cKANngRhg="
2023/08/10 05:10:48 http2: decoded hpack field header field "x-ms-quorum-acked-lsn" = "31645147"
2023/08/10 05:10:48 http2: decoded hpack field header field "x-ms-current-write-quorum" = "3"
2023/08/10 05:10:48 http2: decoded hpack field header field "x-ms-current-replica-set-size" = "4"
2023/08/10 05:10:48 http2: decoded hpack field header field "x-ms-documentdb-partitionkeyrangeid" = "0"
2023/08/10 05:10:48 http2: decoded hpack field header field "x-ms-xp-role" = "1"
2023/08/10 05:10:48 http2: decoded hpack field header field "x-ms-global-committed-lsn" = "31645147"
2023/08/10 05:10:48 http2: decoded hpack field header field "x-ms-number-of-read-regions" = "0"
2023/08/10 05:10:48 http2: decoded hpack field header field "x-ms-item-lsn" = "13"
2023/08/10 05:10:48 http2: decoded hpack field header field "x-ms-transport-request-id" = "108109"
2023/08/10 05:10:48 http2: decoded hpack field header field "x-ms-cosmos-llsn" = "31645147"
2023/08/10 05:10:48 http2: decoded hpack field header field "x-ms-cosmos-quorum-acked-llsn" = "31645147"
2023/08/10 05:10:48 http2: decoded hpack field header field "x-ms-cosmos-item-llsn" = "13"
2023/08/10 05:10:48 http2: decoded hpack field header field "x-ms-session-token" = "0:-1#31645147"
2023/08/10 05:10:48 http2: decoded hpack field header field "x-ms-request-duration-ms" = "0.332"
2023/08/10 05:10:48 http2: decoded hpack field header field "x-ms-serviceversion" = " version=2.14.0.0"
2023/08/10 05:10:48 http2: decoded hpack field header field "x-ms-cosmos-physical-partition-id" = "86290bdc-505e-40b0-895a-ad50cec73589"
2023/08/10 05:10:48 http2: Transport received HEADERS flags=END_HEADERS stream=1 len=1179
2023/08/10 05:10:48 http2: Framer 0xc00019d6c0: read DATA stream=1 len=354 data="{\"_attachments\":\"attachments\\/\",\"_etag\":\"\\\"0d02a39c-0000-0100-0000-60cc12330000\\\"\",\"_rid\":\"39cKANngRhgBAAAAAAAAAA==\",\"_self\":\"dbs\\/39cKAA==\\/colls\\/39cKANngRhg=\\/docs\\/39cKANngRhgBAAAAAAAAAA==\\/\",\"id\":\"c88b61d7-85aa-415f-8eda-4e73e5e811d5\",\"subscription\":{" (98 bytes omitted)
2023/08/10 05:10:48 http2: Transport received DATA stream=1 len=354 data="{\"_attachments\":\"attachments\\/\",\"_etag\":\"\\\"0d02a39c-0000-0100-0000-60cc12330000\\\"\",\"_rid\":\"39cKANngRhgBAAAAAAAAAA==\",\"_self\":\"dbs\\/39cKAA==\\/colls\\/39cKANngRhg=\\/docs\\/39cKANngRhgBAAAAAAAAAA==\\/\",\"id\":\"c88b61d7-85aa-415f-8eda-4e73e5e811d5\",\"subscription\":{" (98 bytes omitted)
2023/08/10 05:10:48 http2: Framer 0xc00019d6c0: read DATA flags=END_STREAM stream=1 len=0 data=""
2023/08/10 05:10:48 http2: Transport received DATA flags=END_STREAM stream=1 len=0 data=""
2023/08/10 05:10:48 Response: ==> REQUEST/RESPONSE (Try=3/19.033188ms, OpTime=9.023936561s) -- RESPONSE RECEIVED
   GET https://julienstestcosmos.documents.azure.com/dbs/julienstest/colls/Subscriptions/docs/c88b61d7-85aa-415f-8eda-4e73e5e811d5
   Authorization: REDACTED
   User-Agent: azsdk-go-azcosmos/v0.3.5 (go1.20.7; linux)
   X-Ms-Cosmos-Sdk-Supportedcapabilities: REDACTED
   X-Ms-Date: REDACTED
   X-Ms-Documentdb-Partitionkey: REDACTED
   X-Ms-Version: REDACTED
   --------------------------------------------------------------------------------
   RESPONSE Status: 200 OK
   Content-Length: 354
   Content-Type: application/json
   Date: Thu, 10 Aug 2023 05:10:48 GMT
   Etag: "0d02a39c-0000-0100-0000-60cc12330000"
   Lsn: REDACTED
   Server: Compute
   X-Ms-Activity-Id: REDACTED
   X-Ms-Alt-Content-Path: REDACTED
   X-Ms-Content-Path: REDACTED
   X-Ms-Cosmos-Item-Llsn: REDACTED
   X-Ms-Cosmos-Llsn: REDACTED
   X-Ms-Cosmos-Physical-Partition-Id: REDACTED
   X-Ms-Cosmos-Quorum-Acked-Llsn: REDACTED
   X-Ms-Current-Replica-Set-Size: REDACTED
   X-Ms-Current-Write-Quorum: REDACTED
   X-Ms-Documentdb-Partitionkeyrangeid: REDACTED
   X-Ms-Gatewayversion: REDACTED
   X-Ms-Global-Committed-Lsn: REDACTED
   X-Ms-Item-Lsn: REDACTED
   X-Ms-Last-State-Change-Utc: REDACTED
   X-Ms-Number-Of-Read-Regions: REDACTED
   X-Ms-Quorum-Acked-Lsn: REDACTED
   X-Ms-Request-Charge: REDACTED
   X-Ms-Request-Duration-Ms: REDACTED
   X-Ms-Resource-Quota: REDACTED
   X-Ms-Resource-Usage: REDACTED
   X-Ms-Schemaversion: REDACTED
   X-Ms-Serviceversion: REDACTED
   X-Ms-Session-Token: REDACTED
   X-Ms-Transport-Request-Id: REDACTED
   X-Ms-Xp-Role: REDACTED

2023/08/10 05:10:48 Retry: response 200
2023/08/10 05:10:48 Retry: exit due to non-retriable status code
2023/08/10 05:10:48 Obtained response with activity ID: 0a6c444d-f0a5-4987-b101-d7ae8c8f49ba

I'm able to see the connection been recover by the transport along with the runtime retrier policy :)

At least for cosmos db client, should these setting be enabled by default ?

I did not test this for ARM RPs, is ARM supporting HTTP/2 ? Or maybe doing a better job to close connection ?

  • Anything we should know about your environment.

Tested this on a vanilla Azure VM, but same issue is happening code hosted on AKS.

@github-actions github-actions bot added the needs-triage This is a new issue that needs to be triaged to the appropriate team. label Aug 10, 2023
@chlowell chlowell added needs-team-attention This issue needs attention from Azure service team or SDK team and removed needs-triage This is a new issue that needs to be triaged to the appropriate team. labels Aug 11, 2023
@ealsur
Copy link
Member

ealsur commented Aug 11, 2023

@chlowell This is not Cosmos exclusive, Cosmos does not customize the HTTP layer, it is common across all SDKs through azcore.

The point of this report is that there is a problem in Go that in order to address it, the core HTTP pipeline needs to have its settings adjusted.

@chlowell chlowell assigned jhendrixMSFT and chlowell and unassigned ealsur Aug 14, 2023
@chlowell
Copy link
Contributor

Thanks @ealsur, you're quite right, I read this too superficially when triaging it

@sgmiller
Copy link

sgmiller commented Aug 25, 2023

We're encountering this bug as well causing unavailability of the client for 15-20 minute periods. In our case for Azure Key Vault.

@RickWinter RickWinter added bug This issue requires a change to an existing behavior in the product in order to be resolved. Client This issue points to a problem in the data-plane of the library. and removed needs-team-attention This issue needs attention from Azure service team or SDK team labels Aug 30, 2023
@jhendrixMSFT
Copy link
Member

We're investigating the issue and will update as we have more info.

@jhendrixMSFT
Copy link
Member

While we investigate, you should be able to work around the issue by providing your own HTTP client with the health check enabled. Here's what this would look like for a Key Vault client.

// copied from azcore's default transport policy
customTransport := &http.Transport{
	Proxy: http.ProxyFromEnvironment,
	DialContext: (&net.Dialer{
		Timeout:   30 * time.Second,
		KeepAlive: 30 * time.Second,
	}).DialContext,
	ForceAttemptHTTP2:     true,
	MaxIdleConns:          100,
	IdleConnTimeout:       90 * time.Second,
	TLSHandshakeTimeout:   10 * time.Second,
	ExpectContinueTimeout: 1 * time.Second,
	TLSClientConfig: &tls.Config{
		MinVersion:    tls.VersionTLS12,
		Renegotiation: tls.RenegotiateFreelyAsClient,
	},
}
if http2Transport, err := http2.ConfigureTransports(customTransport); err == nil {
	// if the connection has been idle for 10 seconds, send a ping frame for a health check
	http2Transport.ReadIdleTimeout = 10 * time.Second
	// if there's no response to the ping within 2 seconds, close the connection
	http2Transport.PingTimeout = 2 * time.Second
}

client, err := azkeys.NewClient("<your vault>", cred, &azkeys.ClientOptions{
	ClientOptions: azcore.ClientOptions{
		Transport: &http.Client{
			Transport: customTransport,
		},
	},
})

@jhendrixMSFT
Copy link
Member

Note that the workaround might make things worse if running in an environment where ICMP is blocked (e.g. firewall).

@jhendrixMSFT
Copy link
Member

And to be clear, the root cause is a bug in the Go standard library.

golang/go#59690

What we're evaluating here is a workaround until the above has been fixed.

@jim-minter
Copy link
Member

Note that the workaround might make things worse if running in an environment where ICMP is blocked (e.g. firewall).

Why would that be the case? Note: an HTTP/2 ping frame != an ICMP ping.

@jhendrixMSFT
Copy link
Member

I admit this is outside my area of expertise. You have a link (RFC etc) where I can learn more about this?

@jim-minter
Copy link
Member

https://httpwg.org/specs/rfc7540.html#PING

@jhendrixMSFT
Copy link
Member

Thanks this is helpful.

There's a conversation on the PR where concern was raised about the server not supporting HTTP/2 pings. Is that a real possibility, or would a spec compliant HTTP/2 implementation be required to support ping frames?

@serbrech
Copy link
Member

serbrech commented Sep 26, 2023

I suppose a badly implemented http/2 server could not implement ping, but it should really not be in production hosting an azure service then? :D

http2 spec defines 10 frames : https://httpwg.org/specs/rfc7540.html#rfc.section.11.2

I hope servers that claim supporting http/2 support the core 10 frames. There is nothing stating that they are optional

@julienstroheker
Copy link
Author

julienstroheker commented Sep 28, 2023

No harm making this optional. Even if most of Azure endpoint should support this.

On my side I was able to quickly validate if a server does support http2 ping using this python script:

import socket
import ssl
import h2.connection
import h2.events
import certifi
import sys


SERVER_NAME = "julienstestcosmos.documents.azure.com"
SERVER_PORT = 443

socket.setdefaulttimeout(15)
ctx = ssl.create_default_context(cafile=certifi.where())
ctx.set_alpn_protocols(['h2'])

s = socket.create_connection((SERVER_NAME, SERVER_PORT))
s = ctx.wrap_socket(s, server_hostname=SERVER_NAME)

c = h2.connection.H2Connection()
c.initiate_connection()
c.ping(b'ffffffff')
s.sendall(c.data_to_send())

response_stream_ended = False
while not response_stream_ended:
    data = s.recv(65536 * 1024)
    if not data:
            break
    events = c.receive_data(data)
    for event in events:
        if isinstance(event, h2.events.PingAckReceived):
            print(event)
            response_stream_ended = True
            break
$juliens > python h2ping.py
<PingAckReceived ping_data:6666666666666666>

@ShreyasRmsft
Copy link

@julienstroheker @jim-minter @sgmiller is there a particular time of day you see the timeouts occur more frequently than others? In our case, for Cosmos DB in particular, we always see connections being dropped or going bad around 23:00 UTC.

@jhendrixMSFT
Copy link
Member

If we want to make it configurable, we have a few options.

  • if it's off by default, seems like it ends up being a knob that everybody will want to turn on
    • in addition, if/when we decide to turn it on by default, we have this vestigial configuration knob (changing the default to true is problematic as the zero-value is false)
  • we could make it on by default, with an option to disable it if there's a problem
  • or we could make it opt-in via a build tag, but that's not very discoverable
    • the idea being here that we could eventually remove the build tag and make it on by default once we gain confidence

Conceptually, it seems like this is something we'd always want to enable, and I have a hard time believing that we have Azure hosts that don't support HTTP/2 ping.

@jhendrixMSFT
Copy link
Member

Or perhaps we expose options for ReadIdleTimeout and PingTimeout values instead of a boolean, allowing for fine-grained customization.

@julienstroheker
Copy link
Author

@julienstroheker @jim-minter @sgmiller is there a particular time of day you see the timeouts occur more frequently than others? In our case, for Cosmos DB in particular, we always see connections being dropped or going bad around 23:00 UTC.

this is random for us and happening more often on busier region

@julienstroheker
Copy link
Author

@jhendrixMSFT don't have a strong opinion. Maybe you can expose it as optional for now, gain confidence then default it ?

I also feel that having a wiki page might help folks to understand to usage of this. The page can also contains clients (endpoints) that we know so far do support HTTP2 ping (cosmos, event grid, arm, etc...)

@serbrech
Copy link
Member

serbrech commented Oct 1, 2023

I don't think any http server that support http2 would not have support ping frames. It's part of the spec.

@sgmiller
Copy link

sgmiller commented Oct 1, 2023 via email

@julienstroheker
Copy link
Author

I don't think any http server that support http2 would not have support ping frames. It's part of the spec.

You'll be surprised the number of servers I saw in the past not supporting ping frames lol.

But all Azure Endpoints should respect the RFC. This should be defaulted, if there is an issue with it, the teams owning the DP should fix it, not the client/caller.

@jhendrixMSFT
Copy link
Member

Fix will be including in November's release of azcore@v1.9.0

@julienstroheker
Copy link
Author

julienstroheker commented Nov 22, 2023

@jhendrixMSFT just want to mention that we are seeing issue by using HTTP/2 ping with eventgrid client. We reverted back our transport to use http/1 for now. Might not be 100% related but the revert mitigated our issues for now...

@jhendrixMSFT
Copy link
Member

@julienstroheker thanks for the update. Do you have any additional details you can share about how this manifests?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Azure.Core bug This issue requires a change to an existing behavior in the product in order to be resolved. Client This issue points to a problem in the data-plane of the library.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants