Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

waitUntilCondition method throws KubernetesClientException after approximately one hour with sufficiently long timeout #5379

Closed
ywacko opened this issue Aug 9, 2023 · 6 comments · Fixed by #5473
Assignees
Labels
Milestone

Comments

@ywacko
Copy link

ywacko commented Aug 9, 2023

Describe the bug

I encountered an issue while using the waitUntilCondition method from the Fabric8 Kubernetes Client library.

code:
client.pods().inNamespace(namespace).withName(pod.getMetadata().getName()).waitUntilCondition(
o -> o.getStatus().getPhase().equals(PodConstant.POD_FAILED_PHASE) || o.getStatus().getPhase().equals(PodConstant.POD_SUCCEEDED_PHASE), timeout, TimeUnit.SECONDS);

However, I observed that when the method is used with a timeout value that is long enough (approximately one hour or more), it throws a KubernetesClientException even though the condition has not been met. This behavior is unexpected and prevents the method from working as intended.

my k8s version is v1.19

Fabric8 Kubernetes Client version

5.12.4

Steps to reproduce

Set up a scenario where a Pod's status will change over a longer period of time.
Use the waitUntilCondition method with a timeout value of approximately one hour or more.
Observe that the method throws a KubernetesClientException after the timeout period, even if the condition should be met.

Expected behavior

The waitUntilCondition method should wait for the specified condition to be met within the provided timeout period. If the condition is met, the method should not throw an exception. If the condition is not met within the timeout period, then the method can throw an exception indicating that the condition was not met.

Runtime

Kubernetes (vanilla)

Kubernetes API Server version

other (please specify in additional context)

Environment

Linux

Fabric8 Kubernetes Client Logs

io.fabric8.kubernetes.client.KubernetesClientException: An error has occurred.

at io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:64)

at io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:53)

at io.fabric8.kubernetes.client.dsl.base.BaseOperation.waitUntilConditionWithRetries(BaseOperation.java:1149)

at io.fabric8.kubernetes.client.dsl.base.BaseOperation.waitUntilCondition(BaseOperation.java:1115)

at io.fabric8.kubernetes.client.dsl.base.BaseOperation.waitUntilCondition(BaseOperation.java:84)

at com.pinlandata.project.scheduler.executor.util.KubeOperator.waitPodFinished(KubeOperator.java:74)

at com.pinlandata.project.scheduler.executor.util.KubeOperator$$FastClassBySpringCGLIB$$8ec6f2dd.invoke(<generated>)

at org.springframework.cglib.proxy.MethodProxy.invoke(MethodProxy.java:218)

at org.springframework.aop.framework.CglibAopProxy$CglibMethodInvocation.invokeJoinpoint(CglibAopProxy.java:779)

at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:163)

at org.springframework.aop.framework.CglibAopProxy$CglibMethodInvocation.proceed(CglibAopProxy.java:750)

at org.springframework.retry.annotation.AnnotationAwareRetryOperationsInterceptor.invoke(AnnotationAwareRetryOperationsInterceptor.java:166)

at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:186)

at org.springframework.aop.framework.CglibAopProxy$CglibMethodInvocation.proceed(CglibAopProxy.java:750)

at org.springframework.aop.framework.CglibAopProxy$DynamicAdvisedInterceptor.intercept(CglibAopProxy.java:692)

at com.pinlandata.project.scheduler.executor.util.KubeOperator$$EnhancerBySpringCGLIB$$d9f7fe0.waitPodFinished(<generated>)

at com.pinlandata.project.scheduler.executor.util.PodRunner.runningPod(PodRunner.java:91)

at com.pinlandata.project.scheduler.executor.util.PodRunner.run(PodRunner.java:53)

at com.pinlandata.project.scheduler.executor.service.impl.algo.model.ModelInitService.execute(ModelInitService.java:79)

at com.pinlandata.project.scheduler.transporter.service.impl.NormalTaskConsumeService.consume(NormalTaskConsumeService.java:51)

at com.pinlandata.project.scheduler.transporter.mq.impl.NormalTaskListener.receive(NormalTaskListener.java:57)

at sun.reflect.GeneratedMethodAccessor323.invoke(Unknown Source)

at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

at java.lang.reflect.Method.invoke(Method.java:498)

at org.springframework.messaging.handler.invocation.InvocableHandlerMethod.doInvoke(InvocableHandlerMethod.java:171)

at org.springframework.messaging.handler.invocation.InvocableHandlerMethod.invoke(InvocableHandlerMethod.java:120)

at org.springframework.amqp.rabbit.listener.adapter.HandlerAdapter.invoke(HandlerAdapter.java:53)

at org.springframework.amqp.rabbit.listener.adapter.MessagingMessageListenerAdapter.invokeHandler(MessagingMessageListenerAdapter.java:222)

at org.springframework.amqp.rabbit.listener.adapter.MessagingMessageListenerAdapter.invokeHandlerAndProcessResult(MessagingMessageListenerAdapter.java:150)

at org.springframework.amqp.rabbit.listener.adapter.MessagingMessageListenerAdapter.onMessage(MessagingMessageListenerAdapter.java:135)

at org.springframework.amqp.rabbit.listener.AbstractMessageListenerContainer.doInvokeListener(AbstractMessageListenerContainer.java:1632)

at org.springframework.amqp.rabbit.listener.AbstractMessageListenerContainer.actualInvokeListener(AbstractMessageListenerContainer.java:1551)

at org.springframework.amqp.rabbit.listener.AbstractMessageListenerContainer.invokeListener(AbstractMessageListenerContainer.java:1539)

at org.springframework.amqp.rabbit.listener.AbstractMessageListenerContainer.doExecuteListener(AbstractMessageListenerContainer.java:1530)

at org.springframework.amqp.rabbit.listener.AbstractMessageListenerContainer.executeListener(AbstractMessageListenerContainer.java:1474)

at org.springframework.amqp.rabbit.listener.SimpleMessageListenerContainer.doReceiveAndExecute(SimpleMessageListenerContainer.java:967)

at org.springframework.amqp.rabbit.listener.SimpleMessageListenerContainer.receiveAndExecute(SimpleMessageListenerContainer.java:913)

at org.springframework.amqp.rabbit.listener.SimpleMessageListenerContainer.access$1600(SimpleMessageListenerContainer.java:83)

at org.springframework.amqp.rabbit.listener.SimpleMessageListenerContainer$AsyncMessageProcessingConsumer.mainLoop(SimpleMessageListenerContainer.java:1288)

at org.springframework.amqp.rabbit.listener.SimpleMessageListenerContainer$AsyncMessageProcessingConsumer.run(SimpleMessageListenerContainer.java:1194)

at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)

at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)

at java.lang.Thread.run(Thread.java:748)

Caused by: io.fabric8.kubernetes.client.dsl.base.WaitForConditionWatcher$WatchException: Watcher closed

at io.fabric8.kubernetes.client.dsl.base.WaitForConditionWatcher.onClose(WaitForConditionWatcher.java:62)

at io.fabric8.kubernetes.client.utils.WatcherToggle.onClose(WatcherToggle.java:56)

at io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager.closeEvent(WatchConnectionManager.java:367)

at io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager.access$700(WatchConnectionManager.java:50)

at io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager$1.onMessage(WatchConnectionManager.java:258)

at okhttp3.internal.ws.RealWebSocket.onReadMessage(RealWebSocket.java:322)

at okhttp3.internal.ws.WebSocketReader.readMessageFrame(WebSocketReader.java:219)

at okhttp3.internal.ws.WebSocketReader.processNextFrame(WebSocketReader.java:105)

at okhttp3.internal.ws.RealWebSocket.loopReader(RealWebSocket.java:273)

at okhttp3.internal.ws.RealWebSocket$1.onResponse(RealWebSocket.java:209)

at okhttp3.RealCall$AsyncCall.execute(RealCall.java:174)

at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)

... 3 common frames omitted

Caused by: io.fabric8.kubernetes.client.KubernetesClientException: too old resource version: 664650743 (664670130)

... 11 common frames omitted

Additional context

No response

@shawkins
Copy link
Contributor

Later versions use informers, not just watches, which are maintained indefinitely.

@shawkins shawkins closed this as not planned Won't fix, can't repro, duplicate, stale Aug 10, 2023
@manusa
Copy link
Member

manusa commented Sep 18, 2023

To clarify, In v5.x waitUntilCondition is based on HTTP watches which eventually hit the too old resource state and need to be restarted.

In v6.x waitUntilCondition is based on informers which self-heal and automatically reconnect for this kind of events.

Upgrading to v6.x sshould deal with your problem.

@apiwoni
Copy link

apiwoni commented Sep 19, 2023

I'm seeing similar issue with waitUntilCondition in v6.4.1 after 1 hour which coincides with access token expiration. This could be result of my expiring cache closing and evicting Kubernetes client after its access token expires. waitUntilCondition does not return after Kubernetes client closure which is another issue.

I'm going to remove async Kubernetes client close when access token expires to see what happens. From what I understand, WebSocket does not require re-authentication once opened so access token expiration should not be an issue, right?
However, can you confirm that too old resource version does not close client but only restarts watcher with new resource version?
Also, how connection issues are expected to be handled during waitUntilCondition apart from issue I found with not returning when Kubernetes client is closed externally?

@manusa
Copy link
Member

manusa commented Sep 22, 2023

However, can you confirm that too old resource version does not close client but only restarts watcher with new resource version?

In v6, since it's using Informers, everything is handled internally. So yes, the Informer is restarted from the new resource versions (no events should be missed -except for intermediate events which leave the resource in the same state-).

public void onClose(WatcherException exception) {
// this close was triggered by an exception,
// not the user, it is expected that the watch retry will handle this
watchStopped();
if (exception.isHttpGone()) {
if (log.isDebugEnabled()) {
log.debug("Watch restarting due to http gone for {}", Reflector.this);
}
// start a whole new list/watch cycle
reconnect();
} else {
onException("watch", exception);
}
}

CompletableFuture<Void> theFuture = processList(nextKeys, null).thenCompose(result -> {
store.retainAll(nextKeys);
final String latestResourceVersion = result.getMetadata().getResourceVersion();
lastSyncResourceVersion = latestResourceVersion;
log.debug("Listing items ({}) for {} at v{}", nextKeys.size(), this, latestResourceVersion);
return startWatcher(latestResourceVersion);
}).thenAccept(w -> {

Also, how connection issues are expected to be handled during waitUntilCondition apart from issue I found with not returning when Kubernetes client is closed externally?

Unless it's a known status error type, the client will try to reconnect using the retry/backoff settings (defaults to 10 retries). Eventually, the informer will stop.

kubernetes-client/README.md

Lines 104 to 105 in 9b86adc

| `kubernetes.request.retry.backoffLimit` / `KUBERNETES_REQUEST_RETRY_BACKOFFLIMIT` | Number of retry attempts (-1 for infinite) | `10` |
| `kubernetes.request.retry.backoffInterval` / `KUBERNETES_REQUEST_RETRY_BACKOFFINTERVAL` | Retry initial backoff interval in ms | `100` |

If stopped due to a problem, then the Informer completes exceptionally and propagates a RuntimeException, so your application will receive the exception.

You can further customize this behavior at the Informer level by leveraging the ExceptionHandler:

/**
* Sets the {@link ExceptionHandler} for this informer. For example, exceptionHandler((b, t) -&gt; true)), will
* keep retrying no matter what the exception is.
* <p>
* May only be called prior to the informer starting
*
* @param handler
*/
SharedIndexInformer<T> exceptionHandler(ExceptionHandler handler);

And by subscribing to the stopped() future:

/**
* Return a {@link CompletionStage} that will allow notification of the informer stopping. This will be completed after
* event processing has stopped.
* <p>
* If {@link #stop()} is called, the CompletionStage will complete normally.
* <p>
* If an exception occurs that terminates the informer, then it will be exceptionally completed with that exception
* - typically a {@link WatcherException}
*/
CompletionStage<Void> stopped();

Check the CustomResourceInformerExample.

@apiwoni
Copy link

apiwoni commented Sep 22, 2023

@manusa Thank you for your explanation. I have run additional tests and I have been able to confirm that when access token expires my HTTP token refresh interceptor is triggered.
However, as I mentioned when KubernetesClient.close() is invoked externally on a separate thread and after access token expires while waitUntilCondition runs without timeout it results in waitUntilCondition hanging indefinitely even when condition is satisfied. It seems that future is never completed and no further attempts to reconnect are made because this is treated as isForcedClosed in AbstractWatchManager hence no reconnect. It is a bug because future should be completed and waitUntilCondition should throw exception.

@shawkins
Copy link
Contributor

It is a bug because future should be completed and waitUntilCondition should throw exception.

This is related to #5327, the future that is waiting on the condition is aware of if the informer does not start properly - but it's not aware if the informer shuts down early. This issue could be reopened or a new one created for that specifically.

shawkins added a commit to shawkins/kubernetes-client that referenced this issue Sep 23, 2023
shawkins added a commit to shawkins/kubernetes-client that referenced this issue Sep 24, 2023
@manusa manusa reopened this Sep 25, 2023
@manusa manusa added the bug label Sep 25, 2023
@manusa manusa added this to the 6.9.0 milestone Sep 25, 2023
manusa pushed a commit that referenced this issue Sep 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants