Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wait logic doesn't wait for all deployment and daemonset pods to be updated/ready #8660

Closed
seaneagan opened this issue Aug 26, 2020 · 22 comments
Labels
bug Categorizes issue or PR as related to a bug. Stale

Comments

@seaneagan
Copy link

I'm hitting a couple issues with the helm install/upgrade/rollback wait logic:

  1. maxUnavailable is subtracted from the amount of pods that need to be ready when waiting for deployments and daemonsets. It seems to me that maxUnavailable is only about how to accomplish the update, and should not dictate when the update is considered complete. If there is only 1 replica and the default maxUnavailable of 1 is in place, then it will wait for 1-1 == 0 pods to be ready, which is not ideal.

  2. status.observedGeneration is not accounted for, so this can lead to race conditions where the wait logic sees the pods from the old generation of the deployment or daemonset rather than the new one.

It looks like 1. came out of #5219 and a similar issue to 2. is mentioned there as well though it doesn't specifically mention observedGeneration ( cc @thomastaylor312 )

Since the wait logic does not reliably wait for all pods to be updated/ready it seems to be of limited use, and gives a false impression that the release is good to go. Would there be any interest in removing the maxUnavailable subtraction and/or adding a status.observedGeneration check?

@thomastaylor312
Copy link
Contributor

If there is only 1 replica and the default maxUnavailable of 1 is in place, then it will wait for 1-1 == 0 pods to be ready, which is not ideal

This is definitely a known problem for those deployments that only have 1 replica. However, I think the logic as it stands is still useful for a large chunk of deployments (that often have more than 1 replica). However, I would be completely open to having a case that checks for only 1 replica and then sets that to wait for the 1 pod as this really helps new users as well.

status.observedGeneration is not accounted for, so this can lead to race conditions where the wait logic sees the pods from the old generation of the deployment or daemonset rather than the new one.

Do you have something that replicates this? As far as I remember, the functions we use to fetch the new replicasets/pods get the latest generation

@seaneagan
Copy link
Author

This is definitely a known problem for those deployments that only have 1 replica. However, I think the logic as it stands is still useful for a large chunk of deployments (that often have more than 1 replica).

It is definitely useful, I'm just wondering if it would be more useful ignoring maxUnavailable altogether, as I think maxUnavailable is more about the desired upgrade strategy, rather than the desired end state, which from my perspective in most cases is for all pods to be ready. In some cases there will be a desire to allow for some newly updated pods (related to observedGeneration below) to not be ready, however I think that is less common and would be best handled by an additional configuration e.g. an annotation on the deployment etc which specifies the number/percentage of pods that should be waited for.

However, I would be completely open to having a case that checks for only 1 replica and then sets that to wait for the 1 pod as this really helps new users as well.

Barring the above, this would still be a much welcome improvement for me.

Do you have something that replicates this? As far as I remember, the functions we use to fetch the new replicasets/pods get the latest generation

It's a race condition, so it will be difficult to reliably replicate. But the issue is that it doesn't account for the observedGeneration of the deployment / daemonset / statefulset itself. So even though it gets the latest replicaset / pods, those can be from the old generation of the deployment, which may not have even been seen yet by e.g. the controller-manager which reconciles deployments etc, and sets the observedGeneration (and other status fields) to signal its progress.

I'd be happy to help make these adjustments. See also #8661.

@thomastaylor312
Copy link
Contributor

So I think that the maxUnavailable logic follows the intent of the rollout strategy. The idea is that that we say we are ok once we have enough new pods ready minus the allowed maxUnavailable. However, even if we wanted to change it, we would need to consider if it would be a breaking change, in which case we wouldn't be able to do it.

But, all this could not matter if we choose to go the route expressed in #8661. However, I do think the change of checking for a single replica is a simple feature that we would be more than willing to accept right now. Do you think it would be better just to focus on what is discussed in #8661?

@seaneagan
Copy link
Author

I can work on an "at least 1 replica ready" special case and adding observedGeneration checks, as I think those will go a long way to improve the utility in the short term, to bridge the gap to a larger overhaul like #8661.

seaneagan added a commit to seaneagan/helm that referenced this issue Aug 28, 2020
The wait logic for Deployments and Daemonsets subtracts their maxUnavailable
value from the amount of replicas they should have ready. This can lead to
waiting for 0 pods to be ready. This change ensures at least 1 replica is ready,
unless there are 0 total desired replicas.

Fixes helm#8660.
seaneagan added a commit to seaneagan/helm that referenced this issue Aug 28, 2020
The wait logic for Deployments and Daemonsets subtracts their
maxUnavailable value from the amount of replicas they should
have ready. This can lead to waiting for 0 pods to be ready.
This change ensures at least 1 replica is ready, unless there
are 0 desired replicas.

Fixes helm#8660.

Signed-off-by: Sean Eagan <seaneagan1@gmail.com>
@jon-depop
Copy link

@seaneagan is there a PR for the observedGeneration race condition? We're also observing this as an issue and have had to implement out of band logic to account for this.

@github-actions
Copy link

This issue has been marked as stale because it has been open for 90 days with no activity. This thread will be automatically closed in 30 days if no further activity occurs.

@github-actions github-actions bot added the Stale label Jan 19, 2021
@jcamu
Copy link

jcamu commented Jan 29, 2021

Hello @thomastaylor312

I have few interrogation about, maybe some misunderstanding on my side

The wait option wait until the pods are ready according to rule Status.ReadyReplicas >= expectedReady
But how can we seperate the behaviour during the deployment and the expected final state.
For instance with a strategy like below

replicas: 2
strategy:
rollingUpdate:
maxSurge: 0
maxUnavailable: 1
type: RollingUpdate

We accept to have a pod unavailable during deployment as one of another they will be restarted( for a zero downtime scenario) but we would like the deployment to be considered finished and ok only when all pods are up and ready.
If the two are not restarted correctly, the deployment should not be considered ok.

For now, we use 'kubectl rollous status' as it will wait until all pods are ready and if not the deployment will be rollback.
How to say ok is up, helm upgrade is fine?

This behaviour is possible, planned ?

@github-actions github-actions bot removed the Stale label Jan 30, 2021
@github-actions
Copy link

github-actions bot commented May 1, 2021

This issue has been marked as stale because it has been open for 90 days with no activity. This thread will be automatically closed in 30 days if no further activity occurs.

@nmische
Copy link

nmische commented Oct 20, 2021

I am running into this issue with Daemonsets. I see there is a PR open to address this but it seems stalled. I am curious if any other work was being done around this issue?

@cbf123
Copy link

cbf123 commented Apr 1, 2022

This issue is definitely still open, and is causing issues for FluxCD.

cbf123 added a commit to cbf123/helm that referenced this issue Apr 4, 2022
The wait logic for Deployments and Daemonsets subtracts their
maxUnavailable value from the amount of replicas they should
have ready. This can lead to waiting for 0 pods to be ready.
This change ensures at least 1 replica is ready, unless there
are 0 desired replicas.

Fixes helm#8660.

This is based on helm#8671 which
was never merged.

Signed-off-by: Chris Friesen <chris.friesen@windriver.com>
starlingx-github pushed a commit to starlingx/config that referenced this issue Apr 19, 2022
In this commit we added 2 enhancement to
the the FluxCD functionality:

1. Dynamically change the host in the default helm repository with the
system controller network address.

2. We will see this issue
fluxcd/helm-controller#81
on AIO-SX if there are issues during chart install.
Basically, the status of helmrelease ends up with ready but
the pods are not actually ready/running.
This is due to helm upstream issues
helm/helm#3173,
helm/helm#5814,
helm/helm#8660.
To solve this we need to check if the pods of the
applied helm charts are ready/running using
the kubernetes python client after the helmrelease is
in a ready state.

3. Check for the 'failed' state of the helmreleases and
update the app accordingly

4. Move the Timeout counter before starting the fluxcd
operations to prevent some infinite loops

Test Plan:
PASS: Deployed a SX with the 'cluster_host_subnet' changed from the
default one and checked if the helm repositories were different
as expected
PASS: Apply nginx fluxcd app 1.1-24 and verified that the app status
is 'applied' when all the pods are in running state
PASS: Apply vault fluxcd app 1.0-27 and verified that the app status
is 'applied'  when all the pods are in running state
PASS: Platform Upgrade from latest release to current release

Task: 44912
Story: 2009138
Change-Id: I207b5b55a4b504a1c8ecdb239036a3d122294a0d
Signed-off-by: Mihnea Saracin <Mihnea.Saracin@windriver.com>
@mattfarina mattfarina reopened this May 5, 2022
@mattfarina mattfarina added this to the 3.9.1 milestone May 5, 2022
@mattfarina mattfarina added the bug Categorizes issue or PR as related to a bug. label May 5, 2022
@jonnylangefeld
Copy link

Just ran into this too

cbf123 added a commit to cbf123/helm that referenced this issue Jun 27, 2022
The wait logic for Deployments and Daemonsets subtracts their
maxUnavailable value from the amount of replicas they should
have ready. This can lead to waiting for 0 pods to be ready.
This change ensures at least 1 replica is ready, unless there
are 0 desired replicas.

Fixes helm#8660.

This is based on helm#8671 which
was never merged.

Signed-off-by: Chris Friesen <chris.friesen@windriver.com>
cbf123 added a commit to cbf123/helm that referenced this issue Jun 28, 2022
The wait logic for Deployments and Daemonsets subtracts their
maxUnavailable value from the amount of replicas they should
have ready. This can lead to waiting for 0 pods to be ready.
This change ensures at least 1 replica is ready, unless there
are 0 desired replicas.

Fixes helm#8660.

This is based on helm#8671 by
Sean Eagan which was never merged.

Co-authored-by: Sean Eagan <seaneagan1@gmail.com>
Signed-off-by: Sean Eagan <seaneagan1@gmail.com>
Signed-off-by: Chris Friesen <chris.friesen@windriver.com>
@mattfarina mattfarina modified the milestones: 3.9.1, 3.9.2 Jul 13, 2022
@mattfarina
Copy link
Collaborator

To provide an update to #10831 and why it has not been merged...

Helm currently looks at the number of replicas and the declared max unavailable. It follows what's declared in the configuration. There are situations where this can be configured so that nothing is available. For example, something declares 3 replicas and 3 max unavailable. In that situation it's declared that it's ok to have everything offline.

This is where Helm is following the declared intent.

What wait doesn't do is take Pod Disruption Budgets into account. Kubernetes will take this into account but wait in Helm won't.

The current PR is to make a change so that Helm no longer follows the declared intent when 0 replicas can happen. Since this breaks from the declared intent we are not currently merging it.

If you are experiencing this issue, how do your replicas relate to the max unavailable and are you using PDBs?

@felipecrs
Copy link

My deployment has replicas: 1, maxUnavailable: 1 and my PDB has maxUnavailable: 1 as well. Still, kubectl get deployment does not mark the deployment as ready until 1 replica is up.

@github-actions
Copy link

This issue has been marked as stale because it has been open for 90 days with no activity. This thread will be automatically closed in 30 days if no further activity occurs.

@github-actions github-actions bot added the Stale label Jan 31, 2023
@joejulian joejulian removed the Stale label Jan 31, 2023
@GarrettGeorge
Copy link

I think the logic of using maxUnavailable from updateStrategy doesnt really make sense in context of Helm determining if the upgrade is complete or not. I have a daemon set of privileged pods using a hostPort and thus can't utilize the new maxSurge capability of daemon sets in 1.25.

Using maxAvailable: 1 is not an option for upgrades due to the time to upgrade.

The only option I have is to use maxUnvailable ; I would like to use 100% but really any percentage should work because this is only the maximum unavailability in context of the upgrade.

In a sense, I can understand Helm's perspective. But if there was a middle ground, I would say that the behavior should be different when the --atomic flag is set. If there's any error during the upgrade, the intention of the --atomic flag is to rollback due to the last revision, but this can't be caught because Helm deems the upgrade complete.

@joejulian
Copy link
Contributor

I think it's fair to suggest that the addition of maxSurge be taken into account when determining readiness. I also think that's different topic worthy of it's own issue.

cbf123 added a commit to cbf123/helm that referenced this issue Mar 8, 2023
The wait logic for Deployments and Daemonsets subtracts their
maxUnavailable value from the amount of replicas they should
have ready. This can lead to waiting for 0 pods to be ready.
This change ensures at least 1 replica is ready, unless there
are 0 desired replicas.

Fixes helm#8660.

This is based on helm#8671 by
Sean Eagan which was never merged.

Co-authored-by: Sean Eagan <seaneagan1@gmail.com>
Signed-off-by: Sean Eagan <seaneagan1@gmail.com>
Signed-off-by: Chris Friesen <chris.friesen@windriver.com>
@github-actions
Copy link

This issue has been marked as stale because it has been open for 90 days with no activity. This thread will be automatically closed in 30 days if no further activity occurs.

@ohads-MSFT
Copy link

ohads-MSFT commented Aug 27, 2023

To provide an update to #10831 and why it has not been merged...

Helm currently looks at the number of replicas and the declared max unavailable. It follows what's declared in the configuration. There are situations where this can be configured so that nothing is available. For example, something declares 3 replicas and 3 max unavailable. In that situation it's declared that it's ok to have everything offline.

What about the observedGeneration race condition mentioned above? I didn't see it mentioned in the PR's description, nor in the changed code.

I just had my DaemonSet helm deployment succeed even though the first node (out of 5) never completed the pod upgrade (due to crash loop backoff). We didn't specify maxUnavailable so I gather it should be the default 1. We also didn't specify maxSurge which should then be 0. So basically k8s only ever attempted to update one node, and never succeeded in doing so. The only explanation I can think of is that helm looked at the 4 old ready pods and decided it satisfied its condition of desired count - max unavailable = 5 - 1 = 4...

Also, I completely agree with:

I think the logic of using maxUnavailable from updateStrategy doesnt really make sense in context of Helm determining if the upgrade is complete or not

If chancing it is a breaking change, there could always be a new flag like --atmoic-full, or really any other annotation or flag we can set to enforce this behavior.

Sounds like an option like this would improve the race condition (at least in the default case), since the first thing k8s would do is delete 1 pod, and at that point the condition is already unsatisfied (because you have n - 1 pods ready).

@bcoughlan
Copy link

Still happens with 3.13.0, due to the observedGeneration race condition mentioned above.

@felipecrs
Copy link

I'm just trying to understand: is there a chance #10920 could have fixed this issue?

felipecrs pushed a commit to felipecrs/helm that referenced this issue Feb 16, 2024
The wait logic for Deployments and Daemonsets subtracts their
maxUnavailable value from the amount of replicas they should
have ready. This can lead to waiting for 0 pods to be ready.
This change ensures at least 1 replica is ready, unless there
are 0 desired replicas.

Fixes helm#8660.

This is based on helm#8671 by
Sean Eagan which was never merged.

Co-authored-by: Sean Eagan <seaneagan1@gmail.com>
Signed-off-by: Sean Eagan <seaneagan1@gmail.com>
Signed-off-by: Chris Friesen <chris.friesen@windriver.com>
Signed-off-by: Felipe Santos <felipecassiors@gmail.com>
felipecrs pushed a commit to felipecrs/helm that referenced this issue Feb 16, 2024
The wait logic for Deployments and Daemonsets subtracts their
maxUnavailable value from the amount of replicas they should
have ready. This can lead to waiting for 0 pods to be ready.
This change ensures at least 1 replica is ready, unless there
are 0 desired replicas.

Fixes helm#8660.

This is based on helm#8671 by
Sean Eagan which was never merged.

Co-authored-by: Sean Eagan <seaneagan1@gmail.com>
Signed-off-by: Sean Eagan <seaneagan1@gmail.com>
Signed-off-by: Chris Friesen <chris.friesen@windriver.com>
Signed-off-by: Felipe Santos <felipecassiors@gmail.com>
felipecrs pushed a commit to felipecrs/helm that referenced this issue Feb 16, 2024
The wait logic for Deployments and Daemonsets subtracts their
maxUnavailable value from the amount of replicas they should
have ready. This can lead to waiting for 0 pods to be ready.
This change ensures at least 1 replica is ready, unless there
are 0 desired replicas.

Fixes helm#8660.

This is based on helm#8671 by
Sean Eagan which was never merged.

Co-authored-by: Sean Eagan <seaneagan1@gmail.com>
Signed-off-by: Sean Eagan <seaneagan1@gmail.com>
Signed-off-by: Chris Friesen <chris.friesen@windriver.com>
Signed-off-by: Felipe Santos <felipecassiors@gmail.com>
@felipecrs
Copy link

felipecrs commented Feb 23, 2024

At least for my workload #10920 (helm 3.14) has apparently fixed this issue.

@felipecrs
Copy link

Just to correct myself, #10920 does not fix this issue. I created a reproduction repo:

https://github.com/felipecrs/felipecrs-reproduce-helm-issue-8660

Code_EE0XGfVV39.mp4

However, as stated previously this issue will not be fixed in Helm. It should be fixed in the Helm charts. For example, make sure replicas minus maxUnavailable is higher than 0. I.e, if your chart defaults to 1 replica, maxUnavailable should default to 0. Or use some percentage, like 25% or 50%.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Categorizes issue or PR as related to a bug. Stale
Projects
None yet