waitUntilCondition method throws KubernetesClientException after approximately one hour with sufficiently long timeout #5379

ywacko · 2023-08-09T07:06:00Z

Describe the bug

I encountered an issue while using the waitUntilCondition method from the Fabric8 Kubernetes Client library.

code：
client.pods().inNamespace(namespace).withName(pod.getMetadata().getName()).waitUntilCondition(
o -> o.getStatus().getPhase().equals(PodConstant.POD_FAILED_PHASE) || o.getStatus().getPhase().equals(PodConstant.POD_SUCCEEDED_PHASE), timeout, TimeUnit.SECONDS);

However, I observed that when the method is used with a timeout value that is long enough (approximately one hour or more), it throws a KubernetesClientException even though the condition has not been met. This behavior is unexpected and prevents the method from working as intended.

my k8s version is v1.19

Fabric8 Kubernetes Client version

5.12.4

Steps to reproduce

Set up a scenario where a Pod's status will change over a longer period of time.
Use the waitUntilCondition method with a timeout value of approximately one hour or more.
Observe that the method throws a KubernetesClientException after the timeout period, even if the condition should be met.

Expected behavior

The waitUntilCondition method should wait for the specified condition to be met within the provided timeout period. If the condition is met, the method should not throw an exception. If the condition is not met within the timeout period, then the method can throw an exception indicating that the condition was not met.

Runtime

Kubernetes (vanilla)

Kubernetes API Server version

other (please specify in additional context)

Environment

Linux

Fabric8 Kubernetes Client Logs

io.fabric8.kubernetes.client.KubernetesClientException: An error has occurred.

at io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:64)

at io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:53)

at io.fabric8.kubernetes.client.dsl.base.BaseOperation.waitUntilConditionWithRetries(BaseOperation.java:1149)

at io.fabric8.kubernetes.client.dsl.base.BaseOperation.waitUntilCondition(BaseOperation.java:1115)

at io.fabric8.kubernetes.client.dsl.base.BaseOperation.waitUntilCondition(BaseOperation.java:84)

at com.pinlandata.project.scheduler.executor.util.KubeOperator.waitPodFinished(KubeOperator.java:74)

at com.pinlandata.project.scheduler.executor.util.KubeOperator$$FastClassBySpringCGLIB$$8ec6f2dd.invoke(<generated>)

at org.springframework.cglib.proxy.MethodProxy.invoke(MethodProxy.java:218)

at org.springframework.aop.framework.CglibAopProxy$CglibMethodInvocation.invokeJoinpoint(CglibAopProxy.java:779)

at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:163)

at org.springframework.aop.framework.CglibAopProxy$CglibMethodInvocation.proceed(CglibAopProxy.java:750)

at org.springframework.retry.annotation.AnnotationAwareRetryOperationsInterceptor.invoke(AnnotationAwareRetryOperationsInterceptor.java:166)

at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:186)

at org.springframework.aop.framework.CglibAopProxy$CglibMethodInvocation.proceed(CglibAopProxy.java:750)

at org.springframework.aop.framework.CglibAopProxy$DynamicAdvisedInterceptor.intercept(CglibAopProxy.java:692)

at com.pinlandata.project.scheduler.executor.util.KubeOperator$$EnhancerBySpringCGLIB$$d9f7fe0.waitPodFinished(<generated>)

at com.pinlandata.project.scheduler.executor.util.PodRunner.runningPod(PodRunner.java:91)

at com.pinlandata.project.scheduler.executor.util.PodRunner.run(PodRunner.java:53)

at com.pinlandata.project.scheduler.executor.service.impl.algo.model.ModelInitService.execute(ModelInitService.java:79)

at com.pinlandata.project.scheduler.transporter.service.impl.NormalTaskConsumeService.consume(NormalTaskConsumeService.java:51)

at com.pinlandata.project.scheduler.transporter.mq.impl.NormalTaskListener.receive(NormalTaskListener.java:57)

at sun.reflect.GeneratedMethodAccessor323.invoke(Unknown Source)

at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

at java.lang.reflect.Method.invoke(Method.java:498)

at org.springframework.messaging.handler.invocation.InvocableHandlerMethod.doInvoke(InvocableHandlerMethod.java:171)

at org.springframework.messaging.handler.invocation.InvocableHandlerMethod.invoke(InvocableHandlerMethod.java:120)

at org.springframework.amqp.rabbit.listener.adapter.HandlerAdapter.invoke(HandlerAdapter.java:53)

at org.springframework.amqp.rabbit.listener.adapter.MessagingMessageListenerAdapter.invokeHandler(MessagingMessageListenerAdapter.java:222)

at org.springframework.amqp.rabbit.listener.adapter.MessagingMessageListenerAdapter.invokeHandlerAndProcessResult(MessagingMessageListenerAdapter.java:150)

at org.springframework.amqp.rabbit.listener.adapter.MessagingMessageListenerAdapter.onMessage(MessagingMessageListenerAdapter.java:135)

at org.springframework.amqp.rabbit.listener.AbstractMessageListenerContainer.doInvokeListener(AbstractMessageListenerContainer.java:1632)

at org.springframework.amqp.rabbit.listener.AbstractMessageListenerContainer.actualInvokeListener(AbstractMessageListenerContainer.java:1551)

at org.springframework.amqp.rabbit.listener.AbstractMessageListenerContainer.invokeListener(AbstractMessageListenerContainer.java:1539)

at org.springframework.amqp.rabbit.listener.AbstractMessageListenerContainer.doExecuteListener(AbstractMessageListenerContainer.java:1530)

at org.springframework.amqp.rabbit.listener.AbstractMessageListenerContainer.executeListener(AbstractMessageListenerContainer.java:1474)

at org.springframework.amqp.rabbit.listener.SimpleMessageListenerContainer.doReceiveAndExecute(SimpleMessageListenerContainer.java:967)

at org.springframework.amqp.rabbit.listener.SimpleMessageListenerContainer.receiveAndExecute(SimpleMessageListenerContainer.java:913)

at org.springframework.amqp.rabbit.listener.SimpleMessageListenerContainer.access$1600(SimpleMessageListenerContainer.java:83)

at org.springframework.amqp.rabbit.listener.SimpleMessageListenerContainer$AsyncMessageProcessingConsumer.mainLoop(SimpleMessageListenerContainer.java:1288)

at org.springframework.amqp.rabbit.listener.SimpleMessageListenerContainer$AsyncMessageProcessingConsumer.run(SimpleMessageListenerContainer.java:1194)

at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)

at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)

at java.lang.Thread.run(Thread.java:748)

Caused by: io.fabric8.kubernetes.client.dsl.base.WaitForConditionWatcher$WatchException: Watcher closed

at io.fabric8.kubernetes.client.dsl.base.WaitForConditionWatcher.onClose(WaitForConditionWatcher.java:62)

at io.fabric8.kubernetes.client.utils.WatcherToggle.onClose(WatcherToggle.java:56)

at io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager.closeEvent(WatchConnectionManager.java:367)

at io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager.access$700(WatchConnectionManager.java:50)

at io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager$1.onMessage(WatchConnectionManager.java:258)

at okhttp3.internal.ws.RealWebSocket.onReadMessage(RealWebSocket.java:322)

at okhttp3.internal.ws.WebSocketReader.readMessageFrame(WebSocketReader.java:219)

at okhttp3.internal.ws.WebSocketReader.processNextFrame(WebSocketReader.java:105)

at okhttp3.internal.ws.RealWebSocket.loopReader(RealWebSocket.java:273)

at okhttp3.internal.ws.RealWebSocket$1.onResponse(RealWebSocket.java:209)

at okhttp3.RealCall$AsyncCall.execute(RealCall.java:174)

at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)

... 3 common frames omitted

Caused by: io.fabric8.kubernetes.client.KubernetesClientException: too old resource version: 664650743 (664670130)

... 11 common frames omitted

Additional context

No response

shawkins · 2023-08-10T11:12:42Z

Later versions use informers, not just watches, which are maintained indefinitely.

manusa · 2023-09-18T09:54:41Z

To clarify, In v5.x waitUntilCondition is based on HTTP watches which eventually hit the too old resource state and need to be restarted.

In v6.x waitUntilCondition is based on informers which self-heal and automatically reconnect for this kind of events.

Upgrading to v6.x sshould deal with your problem.

apiwoni · 2023-09-19T20:25:00Z

I'm seeing similar issue with waitUntilCondition in v6.4.1 after 1 hour which coincides with access token expiration. This could be result of my expiring cache closing and evicting Kubernetes client after its access token expires. waitUntilCondition does not return after Kubernetes client closure which is another issue.

I'm going to remove async Kubernetes client close when access token expires to see what happens. From what I understand, WebSocket does not require re-authentication once opened so access token expiration should not be an issue, right?
However, can you confirm that too old resource version does not close client but only restarts watcher with new resource version?
Also, how connection issues are expected to be handled during waitUntilCondition apart from issue I found with not returning when Kubernetes client is closed externally?

manusa · 2023-09-22T04:57:09Z

However, can you confirm that too old resource version does not close client but only restarts watcher with new resource version?

In v6, since it's using Informers, everything is handled internally. So yes, the Informer is restarted from the new resource versions (no events should be missed -except for intermediate events which leave the resource in the same state-).

kubernetes-client/kubernetes-client/src/main/java/io/fabric8/kubernetes/client/informers/impl/cache/Reflector.java

Lines 290 to 303 in 20be3e6

    
           public void onClose(WatcherException exception) { 
        
             // this close was triggered by an exception, 
        
             // not the user, it is expected that the watch retry will handle this 
        
             watchStopped(); 
        
             if (exception.isHttpGone()) { 
        
               if (log.isDebugEnabled()) { 
        
                 log.debug("Watch restarting due to http gone for {}", Reflector.this); 
        
               } 
        
               // start a whole new list/watch cycle 
        
               reconnect(); 
        
             } else { 
        
               onException("watch", exception); 
        
             } 
        
           }

kubernetes-client/kubernetes-client/src/main/java/io/fabric8/kubernetes/client/informers/impl/cache/Reflector.java

Lines 126 to 132 in 20be3e6

    
           CompletableFuture<Void> theFuture = processList(nextKeys, null).thenCompose(result -> { 
        
             store.retainAll(nextKeys); 
        
             final String latestResourceVersion = result.getMetadata().getResourceVersion(); 
        
             lastSyncResourceVersion = latestResourceVersion; 
        
             log.debug("Listing items ({}) for {} at v{}", nextKeys.size(), this, latestResourceVersion); 
        
             return startWatcher(latestResourceVersion); 
        
           }).thenAccept(w -> {

Also, how connection issues are expected to be handled during waitUntilCondition apart from issue I found with not returning when Kubernetes client is closed externally?

Unless it's a known status error type, the client will try to reconnect using the retry/backoff settings (defaults to 10 retries). Eventually, the informer will stop.

kubernetes-client/README.md

Lines 104 to 105 in 9b86adc

    
           | `kubernetes.request.retry.backoffLimit` / `KUBERNETES_REQUEST_RETRY_BACKOFFLIMIT`                               | Number of retry attempts (-1 for infinite)                                                                                               | `10`                                                  | 
        
           | `kubernetes.request.retry.backoffInterval` / `KUBERNETES_REQUEST_RETRY_BACKOFFINTERVAL`                         | Retry initial backoff interval in ms                                                                                                     | `100`                                                 |

If stopped due to a problem, then the Informer completes exceptionally and propagates a RuntimeException, so your application will receive the exception.

You can further customize this behavior at the Informer level by leveraging the ExceptionHandler:

kubernetes-client/kubernetes-client-api/src/main/java/io/fabric8/kubernetes/client/informers/SharedIndexInformer.java

Lines 183 to 191 in dfa920e

    
             /** 
        
              * Sets the {@link ExceptionHandler} for this informer. For example, exceptionHandler((b, t) -&gt; true)), will 
        
              * keep retrying no matter what the exception is. 
        
              * <p> 
        
              * May only be called prior to the informer starting 
        
              * 
        
              * @param handler 
        
              */ 
        
             SharedIndexInformer<T> exceptionHandler(ExceptionHandler handler);

And by subscribing to the stopped() future:

kubernetes-client/kubernetes-client-api/src/main/java/io/fabric8/kubernetes/client/informers/SharedIndexInformer.java

Lines 193 to 202 in dfa920e

    
             /** 
        
              * Return a {@link CompletionStage} that will allow notification of the informer stopping. This will be completed after 
        
              * event processing has stopped. 
        
              * <p> 
        
              * If {@link #stop()} is called, the CompletionStage will complete normally. 
        
              * <p> 
        
              * If an exception occurs that terminates the informer, then it will be exceptionally completed with that exception 
        
              * - typically a {@link WatcherException} 
        
              */ 
        
             CompletionStage<Void> stopped();

Check the CustomResourceInformerExample.

apiwoni · 2023-09-22T16:44:21Z

@manusa Thank you for your explanation. I have run additional tests and I have been able to confirm that when access token expires my HTTP token refresh interceptor is triggered.
However, as I mentioned when KubernetesClient.close() is invoked externally on a separate thread and after access token expires while waitUntilCondition runs without timeout it results in waitUntilCondition hanging indefinitely even when condition is satisfied. It seems that future is never completed and no further attempts to reconnect are made because this is treated as isForcedClosed in AbstractWatchManager hence no reconnect. It is a bug because future should be completed and waitUntilCondition should throw exception.

shawkins · 2023-09-22T17:16:35Z

It is a bug because future should be completed and waitUntilCondition should throw exception.

This is related to #5327, the future that is waiting on the condition is aware of if the informer does not start properly - but it's not aware if the informer shuts down early. This issue could be reopened or a new one created for that specifically.

Closes fabric8io#5379

Closes #5379

shawkins closed this as not planned Won't fix, can't repro, duplicate, stale Aug 10, 2023

shawkins added a commit to shawkins/kubernetes-client that referenced this issue Sep 23, 2023

fix: making dsl waits aware of client close

0dff5bc

Closes fabric8io#5379

shawkins mentioned this issue Sep 23, 2023

fix: making dsl waits aware of client close #5473

Merged

11 tasks

shawkins added a commit to shawkins/kubernetes-client that referenced this issue Sep 24, 2023

fix: making dsl waits aware of client close

da042a1

Closes fabric8io#5379

manusa assigned shawkins Sep 25, 2023

manusa reopened this Sep 25, 2023

manusa added the bug label Sep 25, 2023

manusa added this to the 6.9.0 milestone Sep 25, 2023

manusa closed this as completed in #5473 Sep 25, 2023

manusa pushed a commit that referenced this issue Sep 25, 2023

fix: making dsl waits aware of client close

007abfa

Closes #5379

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

waitUntilCondition method throws KubernetesClientException after approximately one hour with sufficiently long timeout #5379

waitUntilCondition method throws KubernetesClientException after approximately one hour with sufficiently long timeout #5379

ywacko commented Aug 9, 2023

shawkins commented Aug 10, 2023

manusa commented Sep 18, 2023

apiwoni commented Sep 19, 2023 •

edited

manusa commented Sep 22, 2023

apiwoni commented Sep 22, 2023 •

edited

shawkins commented Sep 22, 2023

waitUntilCondition method throws KubernetesClientException after approximately one hour with sufficiently long timeout #5379

waitUntilCondition method throws KubernetesClientException after approximately one hour with sufficiently long timeout #5379

Comments

ywacko commented Aug 9, 2023

Describe the bug

Fabric8 Kubernetes Client version

Steps to reproduce

Expected behavior

Runtime

Kubernetes API Server version

Environment

Fabric8 Kubernetes Client Logs

Additional context

shawkins commented Aug 10, 2023

manusa commented Sep 18, 2023

apiwoni commented Sep 19, 2023 • edited

manusa commented Sep 22, 2023

apiwoni commented Sep 22, 2023 • edited

shawkins commented Sep 22, 2023

apiwoni commented Sep 19, 2023 •

edited

apiwoni commented Sep 22, 2023 •

edited