New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Verify generation in readiness checks #10920
Conversation
Signed-off-by: James Oden <james.oden.dev@gmail.com>
@joejulian Anything I can do on my end to help get this one triaged? Thanks for the help! |
@mattfarina @bacongobbler @yxxhero Anything I can do on my end to help get this one triaged? Thanks for the help! |
@zonggen @joejulian @yxxhero Anything I can do on my end to help get this one triaged? Thanks for the help! |
@muang0 need more comments for review. Thanks very much. |
@yxxhero thanks for the response! Do you mind clarifying what you mean? Do you need me to provide more clarification on this PR in a comment? A concise summary: During resource readiness checks, helm doesn't utilize the k8s generation/observedGeneration fields. This creates an opening for a race condition between helm and the associated k8s controller. In this case, a resource is incorrectly marked as ready by helm (before the associated k8s controller picks up the changes to the spec from helm). If we add generational checks to helm readiness validation, then helm would notice the generation & observedGeneration are out of sync, and avoid this race condition entirely. What we are observing as a result: While using the wait flag, helm doesn't wait for the resources to be created but rather immediately marks resource creation as complete and finishes the install. k8s api-conventions description of observedGeneration & generation: https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/api-conventions.md?plain=1#L476-L479 |
Hi @yxxhero any action I should take from my end? Thanks for the help! |
@muang0 In this weekend. I will review this. |
Hi @yxxhero, hope you're having a good week! So sorry to bother, just bumping this so it doesn't fall between the cracks :) It's a small change at its core, most of the lines are unit tests. |
@yxxhero I hate to bother you 😅 this PR is really a small change at its core, shouldn't take long to review/triage. You'd be helping us out a ton by helping push this one along 😄 |
@muang0 those changes are valuable. but The code changes a lot, I need more time to review. Thanks very much. |
Gentle bump 😀 |
This PR is what exactly I was looking for. I was playing with Kind cluster using Helm and noticed And ended up this PR eventually. So I tested on my local env:
Works like a charm, great job @muang0! 👏 Any possibility to make review process a bit faster to keep time window short to get this merge to main branch? Kind ping @yxxhero @mattfarina @bacongobbler 🤗 |
Kind ping :) |
Kind ping, plz fam, it's not a complex change 😅 |
We've run into this issue too. Would really appreciate it if this got reviewed please and thanks. |
@dnwe Thank you greatly for your contribution!! 🎆 Your statefulset generational readiness check is valuable work 😄 @yxxhero @mattfarina @bacongobbler I think this type of generational check would be valuable for other resources as well. In this PR I've applied generational readiness checks for all applicable resources. If you think this is something helm can benefit from, and willing to commit to reviewing these changes, I'm happy resolve my merge conflicts. Just let me know! @rmendal-Harness @Dentrax This will likely be the last activity on this PR from me unless maintainers are interested in this PR. Thank you for the shoutouts 💯 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great. The test coverage is great, and I was able to make the tests fail. This seems to me to be the correct way to check for readiness. I left one small nit inline about a comment.
I'm still digging into the logic on the IsRead, specifically the removal of the variable declaration at the top. I don't quite understand why it was there in the first place, so I'm not certain removing it is the right move. Will keep looking
Also, it looks like this could use a rebase. I can bring this to the next community meeting if you want to do that and address the comment.
pkg/kube/ready.go
Outdated
@@ -272,6 +284,16 @@ func (c *ReadyChecker) volumeReady(v *corev1.PersistentVolumeClaim) bool { | |||
} | |||
|
|||
func (c *ReadyChecker) deploymentReady(rs *appsv1.ReplicaSet, dep *appsv1.Deployment) bool { | |||
// Verify the generation observed by the replicaSet controller matches the spec generation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like a duplicate comment of the one below, and doesn't apply to this line of code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated, good catch!
Thanks for the support @sudermanjr! Happy to do what I can to help get this one through the review process 😄 |
Signed-off-by: James Oden <james.oden.dev@gmail.com>
Signed-off-by: James Oden <james.oden.dev@gmail.com>
ae97899
to
d008340
Compare
more readable debug log fmt (i64) Co-authored-by: Bjorn Svensson <bjorn.a.svensson@est.tech> Signed-off-by: muang0 <joden1995@gmail.com>
Been🕵️♀️ for a while, lot going on in my 🌎 last 🗓🗓🗓🗓🗓🗓; starting to 🌥⛅️ more lately |
We are also ran into this issue. When can this be merged? Thanks |
I'm also needing this, required for our pipelines to fail on botched helm deployments. Thanks! |
Signed-off-by: Joe Julian <me@joejulian.name>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only the one change requested then this is good to go. I tested it out on my own CI pipeline and it worked perfectly.
pkg/kube/ready.go
Outdated
var ( | ||
// This defaults to true, otherwise we get to a point where | ||
// things will always return false unless one of the objects | ||
// that manages pods has been hit | ||
ok = true | ||
err error | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I, too, would rather do this in a separate PR (the reason I didn't remove them in the first place). Let's keep this PR to the single subject.
... and, of course, CI needs to pass. |
@muang0 can I get write permissions to your repository so that we can revert this final change and get the bug merged in? Thanks |
Thanks for reaching out. I'm planning my next haircut then I'd love to talk :) |
Still waiting on the planned haircut. I've got a full head of hair, BUT I'm used to shaving it mysefl so this one's no biggie. Let's see if the !downers have any suggestions. |
@stefanmcshane added you as contributor on the project-> thanks for raising this concern. Looks like just a couple minor things are needed here. |
Signed-off-by: Stefan McShane <stefanmcshane@users.noreply.github.com>
c5bc836
to
1a3e9a9
Compare
@joejulian Can you re-review with the changes corrected? |
Could this change get merged into 3.13.x? Its a bug and fixes an active issue folks are having. If there is something else blocking the PR from being merged, could a helm contributor highlight what that might be? There are a few folks - myself included - who want the PR merged and are happy to roll up our sleeves to do the work to see that it does. |
Hi @joejulian @mattfarina can we get this merged? |
CI, tests, and building failed after helm#10920 was merged. This change fixes the issues that were introduced. Signed-off-by: Matt Farina <matt.farina@suse.com>
Signed-off-by: James Oden james.oden.dev@gmail.com
What this PR does / why we need it: This patches a bug where helm preemptively marks resources as ready. This is due to a race condition between the helm readiness check and the associated k8s controller updating the resource. I've added generational validation to readiness checks for all applicable resources.
Special notes for your reviewer:
If applicable:
closes #10859
Associated k8s issue
Thanks for the help!