Add connection timeout to the cluster client #834

nihohit · 2023-05-02T09:46:38Z

This allows the users to define a connection timeout to the async cluster, which will cause the refresh_slots action to timeout if it takes too long to connect to a node.

nihohit · 2023-05-02T09:47:11Z

Note: This is a proposed alternative to #833

jaymell · 2023-05-03T06:21:17Z

I think this looks good -- simple and non-breaking!

nihohit · 2023-05-03T11:35:02Z

3rd option: #835

nihohit · 2023-05-04T16:15:34Z

@jaymell right, moved to ready for review. Also added support for sync cluster, for consistency - I didn't want to have a configuration in ClusterParams that applies only to want flavor of the client.

nolik · 2023-05-18T10:48:56Z

@jaymell right, moved to ready for review. Also added support for sync cluster, for consistency - I didn't want to have a configuration in ClusterParams that applies only to want flavor of the client.

hey, mates. I'm using ConnectionManager and really waiting for the connection t/0 configuration for ConnectionManager / MultiplexedConnection.

@jaymell @nihohit if I able to speed up and help with this t/o feature delivery - just let me know.

nolik · 2023-06-09T22:48:21Z

@jaymell Hey, mate. Would you be able to review it?
it seems it's passing all tests

nolik · 2023-06-19T21:57:09Z

@nihohit seems you have a conflicts after the latest merges.

nihohit · 2023-06-20T12:59:10Z

rebased

nolik · 2023-06-24T18:05:07Z

@jaymell any chance to review this PR ?

jaymell · 2023-06-25T05:42:12Z

@jaymell any chance to review this PR ?

Yep, will get to it. Thanks for your patience.

nihohit · 2023-06-25T11:28:41Z

@nolik you're welcome to join in on the discussion in #866 on what would be the best awy to implement this.

nihohit · 2023-07-09T19:12:53Z

rebased over #875

jaymell · 2023-07-21T05:12:47Z

I've been trying to benchmark the response-timeout changes locally and haven't found a significant difference, though I don't particularly trust my results. Any lingering doubts about the performance impacts of those changes?

nihohit · 2023-07-21T11:53:18Z

I've received roughly the same results in repeated benchmarks. I tried moving things around, but can't find either the reason, or a solution.

redis/src/aio/multiplexed_connection.rs

redis/src/client.rs

redis/src/cluster_async/mod.rs

nihohit · 2023-07-24T04:56:35Z

fixed breaking changes in Client. I don't think we can avoid breaking the Connect/aio::Connect traits.

redis/src/cluster_client.rs

kamulos · 2023-11-30T11:48:44Z

I tried to use this branch in my service, and it is broken for me. If I trigger a timeout I get a panic in get_random_connection (https://github.com/redis-rs/redis-rs/blob/main/redis/src/cluster_async/mod.rs#L1292) and the connection stays broken after that. I have to reestablish a completely new connection to continue making Redis requests.

How to Reproduce

I am using a 1 node Redis cluster for testing. There I use the command DEBUG SLEEP 5 to simulate a situation that would ideally result in a timeout on the redis-rs side.

The code I use

use redis::{cluster::ClusterClientBuilder, cluster_async::ClusterConnection, Value};
use std::{
    sync::atomic::{AtomicU64, Ordering},
    time::Duration,
};

static REQUEST_COUNTER: AtomicU64 = AtomicU64::new(0);

#[tokio::main]
async fn main() {
    let connection = ClusterClientBuilder::new(vec!["redis://:password@redis-0"])
        .connection_timeout(Duration::from_millis(400))
        .response_timeout(Duration::from_millis(400))
        .build()
        .unwrap()
        .get_async_connection()
        .await
        .unwrap();

    loop {
        let connection = connection.clone();
        tokio::spawn(do_request(connection));

        tokio::time::sleep(Duration::from_millis(250)).await;
    }
}

async fn do_request(mut connection: ClusterConnection) {
    redis::cmd("role")
        .query_async::<_, Value>(&mut connection)
        .await
        .unwrap();
    let counter = REQUEST_COUNTER.fetch_add(1, Ordering::SeqCst);
    println!("request {counter}");
}

nihohit · 2023-12-04T18:45:03Z

@kamulos can you please write a self-contained test to demonstrate the issue? It's unclear how/when DEBUG SLEEP 5 comes into action here.
I also rebased this branch, please check if the issue persists. This test doesn't crash locally:

#[test]
fn test_response_timeout_reuse() {
    let cluster = TestClusterContext::new(3, 0);
    block_on_all(async move {
        let mut connection = cluster.async_connection().await;
        let mut cmd = redis::Cmd::new();
        cmd.arg("BLPOP").arg("foo").arg(0); // 0 timeout blocks indefinitely
        let result = connection.req_packed_command(&cmd).await;
        assert!(result.is_err());
        assert!(result.unwrap_err().is_timeout());

        loop {
            let result: RedisResult<Value> = redis::cmd("GET")
                .arg("foo")
                .query_async(&mut connection.clone())
                .await;
            let counter = REQUEST_COUNTER.fetch_add(1, Ordering::SeqCst);
            println!("request {counter} {}", result.is_ok());
        }
    });
}

jaymell · 2023-12-08T06:31:45Z

Let's get it in!

nihohit · 2023-12-08T08:59:04Z

@kamulos I'd like your input before merging this. Can you still reproduce the issue?
@jaymell I think I'll wait for reply until wed the 13th, and if we won't hear anything more, I'll merge this.

kamulos · 2023-12-08T09:28:41Z

@nihohit sorry for being a bit quiet. I'll be able to test it later today

kamulos · 2023-12-11T16:21:22Z

@nihohit I finally got to to testing it again. I am not entirely sure what is happening, but my educated guess would be something along these lines:

In my Test I have a Redis Node, where I execute the command DEBUG SLEEP 5. This is difficult to handle for redis-rs because not only the request will time out, but also any attempt to reconnect to the Cluster. First a timeout in refresh_connections will occur, and after that a panic in get_random_connection (redis-rs/redis/src/cluster_async/mod.rs:1301:10): "Connections is empty".

I don't believe this merge request is at fault, because the whole panic in get_random_connection obviously is a preexisting condition. There is however a very bad interaction here:

When the panic occurs, it ultimately is caught by tokio and other tasks are able to continue running. However the ClusterConnection is broken at that point and does not recover until the whole program is restarted. This is quite critical in my use-case.

If I would speculate about why the Connection is broken, I think it is just not panic-safe and we somehow end up in an inconsistent state. What I observe is that after the panic the first few requests end with the Error Unable to receive command. Maybe those requests were already in-flight during the panic? Later requests end with the Error Unable to send command.

For some reason the mpsc is broken, but the stream future (in CusterConnection::new) does not seem to have terminated.

In my application I solved the timeouts, by just wrapping the ClusterConnection and putting a timeout around req_packed_command. This seems to work fine and is able to recover in this case.

So in summary I think there is a critical flaw in the cluster_async module, that needs to be solved, but this pull request probably can't do anything about it.

nihohit · 2023-12-11T18:26:33Z

Yes, I assume that the timeouts caused connections to be removed from the connections map, which eventually causes the panic. I believe #968 will help there, but as you mentioned, it's not caused by this change.
Thanks for the detective work!

nihohit mentioned this pull request May 2, 2023

Expose connection timeout in ConnectionInfo. #833

Closed

nihohit mentioned this pull request May 2, 2023

Handle timeouts in redis::aio #683

Closed

jaymell mentioned this pull request May 4, 2023

Cluster: Allow passing a type-erased configuration #835

Closed

nihohit force-pushed the connection-timeout branch from 26a5ad4 to c05ec82 Compare May 4, 2023 16:14

nihohit marked this pull request as ready for review May 4, 2023 16:14

nihohit changed the title ~~Add connection timeout to the async cluster client~~ Add connection timeout to the cluster client May 4, 2023

nihohit force-pushed the connection-timeout branch from c05ec82 to 1eb8318 Compare June 4, 2023 06:16

nihohit force-pushed the connection-timeout branch from 1eb8318 to c6e7c98 Compare June 20, 2023 12:58

nihohit mentioned this pull request Jun 21, 2023

Added a timeout for checking or creating async-cluster connections #866

Open

jaymell mentioned this pull request Jun 27, 2023

Aio: add connection / response timeouts. #875

Closed

nihohit force-pushed the connection-timeout branch 3 times, most recently from 37a7c74 to 5a8d2ad Compare July 11, 2023 06:13

jaymell reviewed Jul 24, 2023

View reviewed changes

redis/src/aio/multiplexed_connection.rs Outdated Show resolved Hide resolved

jaymell reviewed Jul 24, 2023

View reviewed changes

redis/src/client.rs Show resolved Hide resolved

jaymell reviewed Jul 24, 2023

View reviewed changes

redis/src/client.rs Show resolved Hide resolved

jaymell reviewed Jul 24, 2023

View reviewed changes

redis/src/cluster_async/mod.rs Show resolved Hide resolved

nihohit force-pushed the connection-timeout branch from a30880c to e4852ee Compare July 24, 2023 04:55

jaymell added the semver-breaking label Aug 14, 2023

jaymell reviewed Sep 22, 2023

View reviewed changes

redis/src/cluster_client.rs Outdated Show resolved Hide resolved

nihohit force-pushed the connection-timeout branch from e4852ee to 442c12d Compare September 22, 2023 06:43

nihohit mentioned this pull request Oct 4, 2023

Version 0.25 checklist #970

Closed

22 tasks

kamulos reviewed Nov 15, 2023

View reviewed changes

redis/src/cluster_client.rs Outdated Show resolved Hide resolved

shachlanAmazon force-pushed the connection-timeout branch from 422857b to a0a293d Compare December 4, 2023 19:17

nihohit added 5 commits December 6, 2023 12:30

Aio: add connection / response timeouts.

e71a974

Add timeouts to the sync & async cluster

704dd0f

Fix comments

3a61c72

reduce default connection timeout.

18e94af

fix rustdoc.

c11df7e

shachlanAmazon force-pushed the connection-timeout branch from a0a293d to c11df7e Compare December 6, 2023 13:03

fix fmt

924cb0f

jaymell approved these changes Dec 8, 2023

View reviewed changes

fix arg name

6cf7047

kamulos mentioned this pull request Dec 11, 2023

(Async) Cluster: get_random_connection can panic #1003

Closed

nihohit merged commit a82252d into redis-rs:main Dec 11, 2023

shachlanAmazon deleted the connection-timeout branch December 14, 2023 08:43

kamulos mentioned this pull request Jan 9, 2024

Async Cluster: AWS Failover takes a long time to recover #1005

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add connection timeout to the cluster client #834

Add connection timeout to the cluster client #834

nihohit commented May 2, 2023

nihohit commented May 2, 2023

jaymell commented May 3, 2023

nihohit commented May 3, 2023

nihohit commented May 4, 2023

nolik commented May 18, 2023 •

edited

nolik commented Jun 9, 2023

nolik commented Jun 19, 2023

nihohit commented Jun 20, 2023

nolik commented Jun 24, 2023

jaymell commented Jun 25, 2023

nihohit commented Jun 25, 2023

nihohit commented Jul 9, 2023

jaymell commented Jul 21, 2023

nihohit commented Jul 21, 2023

nihohit commented Jul 24, 2023 •

edited

kamulos commented Nov 30, 2023

nihohit commented Dec 4, 2023 •

edited

jaymell commented Dec 8, 2023

nihohit commented Dec 8, 2023

kamulos commented Dec 8, 2023

kamulos commented Dec 11, 2023

nihohit commented Dec 11, 2023

Add connection timeout to the cluster client #834

Add connection timeout to the cluster client #834

Conversation

nihohit commented May 2, 2023

nihohit commented May 2, 2023

jaymell commented May 3, 2023

nihohit commented May 3, 2023

nihohit commented May 4, 2023

nolik commented May 18, 2023 • edited

nolik commented Jun 9, 2023

nolik commented Jun 19, 2023

nihohit commented Jun 20, 2023

nolik commented Jun 24, 2023

jaymell commented Jun 25, 2023

nihohit commented Jun 25, 2023

nihohit commented Jul 9, 2023

jaymell commented Jul 21, 2023

nihohit commented Jul 21, 2023

nihohit commented Jul 24, 2023 • edited

kamulos commented Nov 30, 2023

How to Reproduce

nihohit commented Dec 4, 2023 • edited

jaymell commented Dec 8, 2023

nihohit commented Dec 8, 2023

kamulos commented Dec 8, 2023

kamulos commented Dec 11, 2023

nihohit commented Dec 11, 2023

nolik commented May 18, 2023 •

edited

nihohit commented Jul 24, 2023 •

edited

nihohit commented Dec 4, 2023 •

edited