Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add connection timeout to the cluster client #834

Merged
merged 7 commits into from
Dec 11, 2023

Conversation

nihohit
Copy link
Contributor

@nihohit nihohit commented May 2, 2023

This allows the users to define a connection timeout to the async cluster, which will cause the refresh_slots action to timeout if it takes too long to connect to a node.

@nihohit
Copy link
Contributor Author

nihohit commented May 2, 2023

Note: This is a proposed alternative to #833

@jaymell
Copy link
Contributor

jaymell commented May 3, 2023

I think this looks good -- simple and non-breaking!

@nihohit
Copy link
Contributor Author

nihohit commented May 3, 2023

3rd option: #835

@nihohit nihohit marked this pull request as ready for review May 4, 2023 16:14
@nihohit
Copy link
Contributor Author

nihohit commented May 4, 2023

@jaymell right, moved to ready for review. Also added support for sync cluster, for consistency - I didn't want to have a configuration in ClusterParams that applies only to want flavor of the client.

@nihohit nihohit changed the title Add connection timeout to the async cluster client Add connection timeout to the cluster client May 4, 2023
@nolik
Copy link

nolik commented May 18, 2023

@jaymell right, moved to ready for review. Also added support for sync cluster, for consistency - I didn't want to have a configuration in ClusterParams that applies only to want flavor of the client.

hey, mates. I'm using ConnectionManager and really waiting for the connection t/0 configuration for ConnectionManager / MultiplexedConnection.

@jaymell @nihohit if I able to speed up and help with this t/o feature delivery - just let me know.

@nolik
Copy link

nolik commented Jun 9, 2023

@jaymell Hey, mate. Would you be able to review it?
it seems it's passing all tests

@nolik
Copy link

nolik commented Jun 19, 2023

@nihohit seems you have a conflicts after the latest merges.

@nihohit
Copy link
Contributor Author

nihohit commented Jun 20, 2023

rebased

@nolik
Copy link

nolik commented Jun 24, 2023

@jaymell any chance to review this PR ?

@jaymell
Copy link
Contributor

jaymell commented Jun 25, 2023

@jaymell any chance to review this PR ?

Yep, will get to it. Thanks for your patience.

@nihohit
Copy link
Contributor Author

nihohit commented Jun 25, 2023

@nolik you're welcome to join in on the discussion in #866 on what would be the best awy to implement this.

@nihohit
Copy link
Contributor Author

nihohit commented Jul 9, 2023

rebased over #875

@nihohit nihohit force-pushed the connection-timeout branch 3 times, most recently from 37a7c74 to 5a8d2ad Compare July 11, 2023 06:13
@jaymell
Copy link
Contributor

jaymell commented Jul 21, 2023

I've been trying to benchmark the response-timeout changes locally and haven't found a significant difference, though I don't particularly trust my results. Any lingering doubts about the performance impacts of those changes?

@nihohit
Copy link
Contributor Author

nihohit commented Jul 21, 2023

I've received roughly the same results in repeated benchmarks. I tried moving things around, but can't find either the reason, or a solution.

@nihohit
Copy link
Contributor Author

nihohit commented Jul 24, 2023

fixed breaking changes in Client. I don't think we can avoid breaking the Connect/aio::Connect traits.

@kamulos
Copy link
Contributor

kamulos commented Nov 30, 2023

I tried to use this branch in my service, and it is broken for me. If I trigger a timeout I get a panic in get_random_connection (https://github.com/redis-rs/redis-rs/blob/main/redis/src/cluster_async/mod.rs#L1292) and the connection stays broken after that. I have to reestablish a completely new connection to continue making Redis requests.

How to Reproduce

I am using a 1 node Redis cluster for testing. There I use the command DEBUG SLEEP 5 to simulate a situation that would ideally result in a timeout on the redis-rs side.

The code I use
use redis::{cluster::ClusterClientBuilder, cluster_async::ClusterConnection, Value};
use std::{
    sync::atomic::{AtomicU64, Ordering},
    time::Duration,
};

static REQUEST_COUNTER: AtomicU64 = AtomicU64::new(0);

#[tokio::main]
async fn main() {
    let connection = ClusterClientBuilder::new(vec!["redis://:password@redis-0"])
        .connection_timeout(Duration::from_millis(400))
        .response_timeout(Duration::from_millis(400))
        .build()
        .unwrap()
        .get_async_connection()
        .await
        .unwrap();

    loop {
        let connection = connection.clone();
        tokio::spawn(do_request(connection));

        tokio::time::sleep(Duration::from_millis(250)).await;
    }
}

async fn do_request(mut connection: ClusterConnection) {
    redis::cmd("role")
        .query_async::<_, Value>(&mut connection)
        .await
        .unwrap();
    let counter = REQUEST_COUNTER.fetch_add(1, Ordering::SeqCst);
    println!("request {counter}");
}

@nihohit
Copy link
Contributor Author

nihohit commented Dec 4, 2023

@kamulos can you please write a self-contained test to demonstrate the issue? It's unclear how/when DEBUG SLEEP 5 comes into action here.
I also rebased this branch, please check if the issue persists. This test doesn't crash locally:

#[test]
fn test_response_timeout_reuse() {
    let cluster = TestClusterContext::new(3, 0);
    block_on_all(async move {
        let mut connection = cluster.async_connection().await;
        let mut cmd = redis::Cmd::new();
        cmd.arg("BLPOP").arg("foo").arg(0); // 0 timeout blocks indefinitely
        let result = connection.req_packed_command(&cmd).await;
        assert!(result.is_err());
        assert!(result.unwrap_err().is_timeout());

        loop {
            let result: RedisResult<Value> = redis::cmd("GET")
                .arg("foo")
                .query_async(&mut connection.clone())
                .await;
            let counter = REQUEST_COUNTER.fetch_add(1, Ordering::SeqCst);
            println!("request {counter} {}", result.is_ok());
        }
    });
}

@jaymell
Copy link
Contributor

jaymell commented Dec 8, 2023

Let's get it in!

@nihohit
Copy link
Contributor Author

nihohit commented Dec 8, 2023

@kamulos I'd like your input before merging this. Can you still reproduce the issue?
@jaymell I think I'll wait for reply until wed the 13th, and if we won't hear anything more, I'll merge this.

@kamulos
Copy link
Contributor

kamulos commented Dec 8, 2023

@nihohit sorry for being a bit quiet. I'll be able to test it later today

@kamulos
Copy link
Contributor

kamulos commented Dec 11, 2023

@nihohit I finally got to to testing it again. I am not entirely sure what is happening, but my educated guess would be something along these lines:

In my Test I have a Redis Node, where I execute the command DEBUG SLEEP 5. This is difficult to handle for redis-rs because not only the request will time out, but also any attempt to reconnect to the Cluster. First a timeout in refresh_connections will occur, and after that a panic in get_random_connection (redis-rs/redis/src/cluster_async/mod.rs:1301:10): "Connections is empty".

I don't believe this merge request is at fault, because the whole panic in get_random_connection obviously is a preexisting condition. There is however a very bad interaction here:

When the panic occurs, it ultimately is caught by tokio and other tasks are able to continue running. However the ClusterConnection is broken at that point and does not recover until the whole program is restarted. This is quite critical in my use-case.

If I would speculate about why the Connection is broken, I think it is just not panic-safe and we somehow end up in an inconsistent state. What I observe is that after the panic the first few requests end with the Error Unable to receive command. Maybe those requests were already in-flight during the panic? Later requests end with the Error Unable to send command.

For some reason the mpsc is broken, but the stream future (in CusterConnection::new) does not seem to have terminated.

In my application I solved the timeouts, by just wrapping the ClusterConnection and putting a timeout around req_packed_command. This seems to work fine and is able to recover in this case.

So in summary I think there is a critical flaw in the cluster_async module, that needs to be solved, but this pull request probably can't do anything about it.

@nihohit
Copy link
Contributor Author

nihohit commented Dec 11, 2023

Yes, I assume that the timeouts caused connections to be removed from the connections map, which eventually causes the panic. I believe #968 will help there, but as you mentioned, it's not caused by this change.
Thanks for the detective work!

@nihohit nihohit merged commit a82252d into redis-rs:main Dec 11, 2023
@shachlanAmazon shachlanAmazon deleted the connection-timeout branch December 14, 2023 08:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants