Bug in retry-handling in the PubSub-client #3203

sondrfos · 2024-04-10T08:41:11Z

Version:
redis-py: 5.0.3
redis: 7.2.4

Platform:
Python 3.12.2 running on Debian bookworm

Description:
First off I want to say thanks for maintaining this library :)

So to the issue at hand:
It seems like the retry-functionality in the PubSub-client (both async and sync) is not working correctly.
To simulate a crash of redis I am closing and restarting redis while our system is running. Even though our connection is set up with an infinite amount of retry, the PubSub-client always crashes and never recovers.

Having dug though the source code I have found the pathways causing the crash.

redis-py/redis/asyncio/client.py

Lines 885 to 896 in 1784b37

    
               async def _execute(self, conn, command, *args, **kwargs): 
        
                   """ 
        
                   Connect manually upon disconnection. If the Redis server is down, 
        
                   this will fail and raise a ConnectionError as desired. 
        
                   After reconnection, the ``on_connect`` callback should have been 
        
                   called by the # connection to resubscribe us to any channels and 
        
                   patterns we were previously listening to 
        
                   """ 
        
                   return await conn.retry.call_with_retry( 
        
                       lambda: command(*args, **kwargs), 
        
                       lambda error: self._disconnect_raise_connect(conn, error), 
        
                   )

The _execute-function runs periodicially, and whenever redis is closed the command fails.
This causes the _disconnect_raise_connect-function to be called:

redis-py/redis/asyncio/client.py

Lines 870 to 883 in 1784b37

    
               async def _disconnect_raise_connect(self, conn, error): 
        
                   """ 
        
                   Close the connection and raise an exception 
        
                   if retry_on_error is not set or the error is not one 
        
                   of the specified error types. Otherwise, try to 
        
                   reconnect 
        
                   """ 
        
                   await conn.disconnect() 
        
                   if ( 
        
                       conn.retry_on_error is None 
        
                       or isinstance(error, tuple(conn.retry_on_error)) is False 
        
                   ): 
        
                       raise error 
        
                   await conn.connect()

_disconnect_raise_connect closes the connection, conn.retry_on_error is True making the function skip raising error.
This is followed by an attempt at reconnecting.

redis-py/redis/asyncio/connection.py

Lines 269 to 284 in 1784b37

    
           async def connect(self): 
        
               """Connects to the Redis server if not already connected""" 
        
               if self.is_connected: 
        
                   return 
        
               try: 
        
                   await self.retry.call_with_retry( 
        
                       lambda: self._connect(), lambda error: self.disconnect() 
        
                   ) 
        
               except asyncio.CancelledError: 
        
                   raise  # in 3.7 and earlier, this is an Exception, not BaseException 
        
               except (socket.timeout, asyncio.TimeoutError): 
        
                   raise TimeoutError("Timeout connecting to server") 
        
               except OSError as e: 
        
                   raise ConnectionError(self._error_message(e)) 
        
               except Exception as exc: 
        
                   raise ConnectionError(exc) from exc

Since we are not currently connected, the first if-statement is False. A call with retry is made to the _connect-function.

redis-py/redis/asyncio/connection.py

Lines 678 to 700 in 1784b37

    
           async def _connect(self): 
        
               """Create a TCP socket connection""" 
        
               async with async_timeout(self.socket_connect_timeout): 
        
                   reader, writer = await asyncio.open_connection( 
        
                       **self._connection_arguments() 
        
                   ) 
        
               self._reader = reader 
        
               self._writer = writer 
        
               sock = writer.transport.get_extra_info("socket") 
        
               if sock: 
        
                   sock.setsockopt(socket.IPPROTO_TCP, socket.TCP_NODELAY, 1) 
        
                   try: 
        
                       # TCP_KEEPALIVE 
        
                       if self.socket_keepalive: 
        
                           sock.setsockopt(socket.SOL_SOCKET, socket.SO_KEEPALIVE, 1) 
        
                           for k, v in self.socket_keepalive_options.items(): 
        
                               sock.setsockopt(socket.SOL_TCP, k, v) 
        
                   except (OSError, TypeError): 
        
                       # `socket_keepalive_options` might contain invalid options 
        
                       # causing an error. Do not leave the connection open. 
        
                       writer.close() 
        
                       raise

Inside the _connect-function an attempt is made to connect to redis. Since redis is unavailable an OSError is created. OSError is not in the list of supported_errors causing it to bypass the error handling in the call_with_retry-function and is propagated to the connect-function. Here it is converted to a ConnectionError and is propagated all the way to the top unhandled.

Is this intended behaviour?
Is this something that could be handled in a different way?

We have temporarily fixed this by adding OSError to the list of supported_errors, but this feels like quite a hacky solution.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug in retry-handling in the PubSub-client #3203

Bug in retry-handling in the PubSub-client #3203

sondrfos commented Apr 10, 2024 •

edited

Bug in retry-handling in the PubSub-client #3203

Bug in retry-handling in the PubSub-client #3203

Comments

sondrfos commented Apr 10, 2024 • edited

sondrfos commented Apr 10, 2024 •

edited