Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proactive connection management #33983

Merged

Conversation

jeet1995
Copy link
Member

@jeet1995 jeet1995 commented Mar 13, 2023

Features

Minimum connection pool size per endpoint

Using a system config, the application developer can specify the minimum no. of connections to be set for all endpoints for all physical partitions across all containers whose connections are opened using the openConnectionsAndInitCaches flow either through CosmosClient / CosmosAsyncClient or CosmosContainer / CosmosAsyncContainer. This can help alleviate the cold start latency associated with establishing connections on-demand when requests are made by the SDK.

Setting the minimum connection pool size per endpoint through a system config

// initialize the min. no. of connections required for all endpoints
int minConnectionPoolSizePerEndpoint = 1;
System.setProperty("COSMOS.MIN_CONNECTION_POOL_SIZE_PER_ENDPOINT", String.valueOf(minConnectionPoolSizePerEndpoint));

Aggressive connection establishment time duration

In the openConnectionsAndInitCaches flow for CosmosClient / CosmosAsyncClient, a duration can be specified within which connections are established aggressively in a blocking manner. Once this duration elapses, connections are established defensively but in a non-blocking manner. This improves on the previous flow which only opened connections in a blocking manner therefore preventing the SDK from making requests until all connections were established.

Setting the aggressive proactive connection establishment duration through the public API

// containers to which connections are to be proactively opened
CosmosContainerIdentity containerIdentity1 = new CosmosContainerIdentity("sample_db_id", "sample_container_id_1");
CosmosContainerIdentity containerIdentity2 = new CosmosContainerIdentity("sample_db_id", "sample_container_id_2");

// no. of regions to which connections are to be proactively opened
int proactiveConnectionRegionsCount = 1;

// duration for which connections are to be aggressively opened in a blocking manner
// beyond this duration connections will be defensively opened in a non-blocking manner
Duration aggressiveWarmupDuration = Duration.ofSeconds(1);

// building the client along with opening connections
CosmosAsyncClient clientWithOpenConnections = new CosmosClientBuilder()
          .endpoint("<account URL goes here")
          .key("<account key goes here>")
          .endpointDiscoveryEnabled(true)
          .preferredRegions(Arrays.asList("sample_region_1", "sample_region_2"))
          .openConnectionsAndInitCaches(new CosmosContainerProactiveInitConfigBuilder(Arrays.asList(containerIdentity1, containerIdentity2))
                .setProactiveConnectionRegionsCount(proactiveConnectionRegionsCount)
                .setAggressiveWarmupDuration(aggressiveWarmupDuration)
                .build())
          .directMode()
          .buildAsyncClient();

Proactive connection management during connection close and reset

This PR helps proactively re-establish connections which are reset by the server or if they have been gracefully closed or in cases where the channel was closed due to it being unhealthy or idle.

Design considerations

  • Fairness: This done to ensure open connection tasks for a given endpoint are spaced out as much as possible.
  • Retries: A connection attempt could fail and such failures could be transient. Therefore, a backoff retry is used to reattempt unsuccessful connection attempts.
  • Concurrency control: Since the SDK could open connections in the background, it is important to set the concurrency here as low as possible (set as 1 by default). This could be too low for an application and hence a system config - COSMOS.DEFENSIVE_WARMUP_CONCURRENCY can be used to modify the defensive concurrency.

Benchmarking results

Insights

  • Cold start latency has reduced for workloads with proactive connection management.
  • Cold start latency differences is lesser between workloads with and without proactive connection management at lower concurrencies.
  • Cold start latencies are higher when the aggressive connection establishment duration is lower.

Reads (Provisioned throughput : 1 million RUs, Concurrency : 512, Documents to read: 10 thousand, Operations count: 1 million)

image

Reads (Provisioned throughput : 1 million RUs, Concurrency : 128, Documents to read: 10 thousand, Operations count: 1 million)

image

Reads (Provisioned throughput : 1 million RUs, Concurrency : 512, Documents to read: 10 thousand, Operations count: 1 million, proactive connection management with aggressive connection establishment duration)

image

Reads (Provisioned throughput : 1 million RUs, Concurrency : 128, Documents to read: 10 thousand, Operations count: 1 million, proactive connection management with aggressive connection establishment duration)

image

Reads (Provisioned throughput : 20 thousand RUs, Concurrency : 10, Documents to read: 1 thousand, Operations count: 100 thousand)

image

Writes (Provisioned throughput : 1 million RUs, Concurrency : 512, Operations count: 1 million)

image

Writes (Provisioned throughput : 1 million RUs, Concurrency : 128, Operations count: 1 million)

image

Writes (Provisioned throughput : 1 million RUs, Concurrency : 512, Operations count: 1 million, proactive connection management with aggressive connection establishment duration)

image

Writes (Provisioned throughput : 1 million RUs, Concurrency : 128, Operations count: 1 million, proactive connection management with aggressive connection establishment duration)

image

Writes (Provisioned throughput : 20 thousand RUs, Concurrency : 10, Operations count: 100 thousand)

image

@ghost ghost added the Cosmos label Mar 13, 2023
@jeet1995 jeet1995 changed the title Added logic to proactively open connections to replicas when connecti… Added logic to proactively open connections to replicas when connections are closed or reset Mar 13, 2023
@jeet1995 jeet1995 changed the title Added logic to proactively open connections to replicas when connections are closed or reset No review: Added logic to proactively open connections to replicas when connections are closed or reset Mar 13, 2023
@jeet1995 jeet1995 changed the title No review: Added logic to proactively open connections to replicas when connections are closed or reset No review: Proactively open connections to replicas when connections are closed or reset and warm up connections in the background Mar 15, 2023
…oactiveConnectionManagementForBrokenConnections

� Conflicts:
�	sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/IOpenConnectionsHandler.java
�	sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/directconnectivity/GatewayAddressCache.java
�	sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/directconnectivity/HttpTransportClient.java
�	sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/directconnectivity/RntbdTransportClient.java
�	sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/directconnectivity/SharedTransportClient.java
�	sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/directconnectivity/TransportClient.java
�	sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/directconnectivity/rntbd/RntbdOpenConnectionsHandler.java
�	sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/directconnectivity/rntbd/RntbdServiceEndpoint.java
�	sdk/cosmos/azure-cosmos/src/test/java/com/azure/cosmos/implementation/SessionNotAvailableRetryTest.java
�	sdk/cosmos/azure-cosmos/src/test/java/com/azure/cosmos/implementation/directconnectivity/RntbdTransportClientTest.java
@jeet1995
Copy link
Member Author

/azp run java - cosmos - tests

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@jeet1995
Copy link
Member Author

/azp run java - cosmos - tests

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@xinlian12
Copy link
Member

/azp run java - cosmos - tests

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@xinlian12 xinlian12 force-pushed the ProactiveConnectionManagementForBrokenConnections branch from 9dd8a97 to 52f8b03 Compare April 30, 2023 16:19
@xinlian12
Copy link
Member

/azp run java - cosmos - tests

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Copy link
Member

@xinlian12 xinlian12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for the great work :)

Copy link
Member

@FabianMeiswinkel FabianMeiswinkel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - thanks

jeet1995 and others added 7 commits April 30, 2023 15:01
…ttps://github.com/jeet1995/azure-sdk-for-java into ProactiveConnectionManagementForBrokenConnections

� Conflicts:
�	sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/directconnectivity/rntbd/ProactiveOpenConnectionsProcessor.java
…ttps://github.com/jeet1995/azure-sdk-for-java into ProactiveConnectionManagementForBrokenConnections

� Conflicts:
�	sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/directconnectivity/rntbd/ProactiveOpenConnectionsProcessor.java
@jeet1995
Copy link
Member Author

jeet1995 commented May 1, 2023

/azp run java - cosmos - tests

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@jeet1995
Copy link
Member Author

jeet1995 commented May 1, 2023

Re-ran the following tests locally.

  • ReadMyWritesConsistencyTest - all tests pass locally
  • CosmosItemTest - all tests pass locally
  • FaultInjectionServerErrorRuleTests - investigate cases where the write region server is gone and yet fault injection rule is applied for query operation.

@jeet1995
Copy link
Member Author

jeet1995 commented May 1, 2023

/check-enforcer override

@jeet1995 jeet1995 merged commit 176e4e3 into Azure:main May 1, 2023
70 of 73 checks passed
@jeet1995
Copy link
Member Author

jeet1995 commented May 2, 2023

fixes #33082
fixes #33079
fixes #33080
fixes #33087

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants