Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Verify generation in readiness checks #10920

Merged
merged 12 commits into from Jan 9, 2024

Conversation

muang0
Copy link
Contributor

@muang0 muang0 commented May 2, 2022

Signed-off-by: James Oden james.oden.dev@gmail.com

What this PR does / why we need it: This patches a bug where helm preemptively marks resources as ready. This is due to a race condition between the helm readiness check and the associated k8s controller updating the resource. I've added generational validation to readiness checks for all applicable resources.

Special notes for your reviewer:

If applicable:

  • this PR contains documentation
  • this PR contains unit tests
  • this PR has been tested for backwards compatibility

closes #10859
Associated k8s issue
Thanks for the help!

Signed-off-by: James Oden <james.oden.dev@gmail.com>
@pull-request-size pull-request-size bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label May 2, 2022
@muang0
Copy link
Contributor Author

muang0 commented May 13, 2022

@joejulian Anything I can do on my end to help get this one triaged? Thanks for the help!

@muang0
Copy link
Contributor Author

muang0 commented May 18, 2022

@mattfarina @bacongobbler @yxxhero Anything I can do on my end to help get this one triaged? Thanks for the help!

@muang0
Copy link
Contributor Author

muang0 commented May 31, 2022

@zonggen @joejulian @yxxhero Anything I can do on my end to help get this one triaged? Thanks for the help!

@yxxhero
Copy link
Member

yxxhero commented May 31, 2022

@muang0 need more comments for review. Thanks very much.

@muang0
Copy link
Contributor Author

muang0 commented May 31, 2022

@muang0 need more comments for review. Thanks very much.

@yxxhero thanks for the response! Do you mind clarifying what you mean? Do you need me to provide more clarification on this PR in a comment?

A concise summary: During resource readiness checks, helm doesn't utilize the k8s generation/observedGeneration fields. This creates an opening for a race condition between helm and the associated k8s controller. In this case, a resource is incorrectly marked as ready by helm (before the associated k8s controller picks up the changes to the spec from helm). If we add generational checks to helm readiness validation, then helm would notice the generation & observedGeneration are out of sync, and avoid this race condition entirely.

What we are observing as a result: While using the wait flag, helm doesn't wait for the resources to be created but rather immediately marks resource creation as complete and finishes the install.

k8s api-conventions description of observedGeneration & generation: https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/api-conventions.md?plain=1#L476-L479

@muang0
Copy link
Contributor Author

muang0 commented Jun 3, 2022

Hi @yxxhero any action I should take from my end? Thanks for the help!

@yxxhero
Copy link
Member

yxxhero commented Jun 3, 2022

@muang0 In this weekend. I will review this.

@muang0
Copy link
Contributor Author

muang0 commented Jun 8, 2022

Hi @yxxhero, hope you're having a good week! So sorry to bother, just bumping this so it doesn't fall between the cracks :) It's a small change at its core, most of the lines are unit tests.

@muang0
Copy link
Contributor Author

muang0 commented Jun 15, 2022

@yxxhero I hate to bother you 😅 this PR is really a small change at its core, shouldn't take long to review/triage. You'd be helping us out a ton by helping push this one along 😄

@yxxhero
Copy link
Member

yxxhero commented Jun 15, 2022

@muang0 those changes are valuable. but The code changes a lot, I need more time to review. Thanks very much.

@muang0
Copy link
Contributor Author

muang0 commented Jun 28, 2022

Gentle bump 😀

@Dentrax
Copy link

Dentrax commented Jul 6, 2022

This PR is what exactly I was looking for. I was playing with Kind cluster using Helm and noticed --wait does not as expected, so found the following comment: #10061 (comment)

And ended up this PR eventually. So I tested on my local env:

$ gh pr checkout 10920
$ make build

$ cat <<EOF | kind create cluster --name helm-test --config=-
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
- role: control-plane
- role: worker
- role: worker
- role: worker
EOF

# The next two commands does not wait
$ helm upgrade fluent-bit fluent/fluent-bit --install --wait
$ helm upgrade --wait --reuse-values --set flush=3 fluent-bit fluent/fluent-bit

# Now test with this PR
$ ./bin/helm upgrade --wait --reuse-values --set flush=1 fluent-bit fluent/fluent-bit

Works like a charm, great job @muang0! 👏

Any possibility to make review process a bit faster to keep time window short to get this merge to main branch?

Kind ping @yxxhero @mattfarina @bacongobbler 🤗

@muang0
Copy link
Contributor Author

muang0 commented Jul 14, 2022

Kind ping :)

@muang0
Copy link
Contributor Author

muang0 commented Jul 28, 2022

Kind ping, plz fam, it's not a complex change 😅

@rmendal-Harness
Copy link

We've run into this issue too. Would really appreciate it if this got reviewed please and thanks.

@muang0
Copy link
Contributor Author

muang0 commented Aug 23, 2022

@dnwe Thank you greatly for your contribution!! 🎆 Your statefulset generational readiness check is valuable work 😄
7c74f1d

@yxxhero @mattfarina @bacongobbler I think this type of generational check would be valuable for other resources as well. In this PR I've applied generational readiness checks for all applicable resources. If you think this is something helm can benefit from, and willing to commit to reviewing these changes, I'm happy resolve my merge conflicts. Just let me know!

@rmendal-Harness @Dentrax This will likely be the last activity on this PR from me unless maintainers are interested in this PR. Thank you for the shoutouts 💯

Copy link

@sudermanjr sudermanjr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great. The test coverage is great, and I was able to make the tests fail. This seems to me to be the correct way to check for readiness. I left one small nit inline about a comment.

I'm still digging into the logic on the IsRead, specifically the removal of the variable declaration at the top. I don't quite understand why it was there in the first place, so I'm not certain removing it is the right move. Will keep looking

Also, it looks like this could use a rebase. I can bring this to the next community meeting if you want to do that and address the comment.

@@ -272,6 +284,16 @@ func (c *ReadyChecker) volumeReady(v *corev1.PersistentVolumeClaim) bool {
}

func (c *ReadyChecker) deploymentReady(rs *appsv1.ReplicaSet, dep *appsv1.Deployment) bool {
// Verify the generation observed by the replicaSet controller matches the spec generation

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like a duplicate comment of the one below, and doesn't apply to this line of code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated, good catch!

@muang0
Copy link
Contributor Author

muang0 commented Sep 1, 2022

Thanks for the support @sudermanjr! Happy to do what I can to help get this one through the review process 😄

James Oden added 2 commits August 31, 2022 22:01
Signed-off-by: James Oden <james.oden.dev@gmail.com>
Signed-off-by: James Oden <james.oden.dev@gmail.com>
more readable debug log fmt (i64)

Co-authored-by: Bjorn Svensson <bjorn.a.svensson@est.tech>
Signed-off-by: muang0 <joden1995@gmail.com>
@muang0
Copy link
Contributor Author

muang0 commented Feb 26, 2023

Been🕵️‍♀️ for a while, lot going on in my 🌎 last 🗓🗓🗓🗓🗓🗓; starting to 🌥⛅️ more lately
❤️👀👀the_activity_on_this_🧵

@RemusBizu
Copy link

We are also ran into this issue. When can this be merged? Thanks

@mbarcia
Copy link

mbarcia commented Apr 27, 2023

I'm also needing this, required for our pipelines to fail on botched helm deployments. Thanks!

@joejulian joejulian modified the milestones: 3.12.0, 3.13.0 May 5, 2023
Signed-off-by: Joe Julian <me@joejulian.name>
Copy link
Contributor

@joejulian joejulian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only the one change requested then this is good to go. I tested it out on my own CI pipeline and it worked perfectly.

Comment on lines 92 to 98
var (
// This defaults to true, otherwise we get to a point where
// things will always return false unless one of the objects
// that manages pods has been hit
ok = true
err error
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I, too, would rather do this in a separate PR (the reason I didn't remove them in the first place). Let's keep this PR to the single subject.

@joejulian joejulian self-assigned this May 11, 2023
@joejulian
Copy link
Contributor

... and, of course, CI needs to pass.

@stefanmcshane
Copy link
Contributor

@muang0 can I get write permissions to your repository so that we can revert this final change and get the bug merged in? Thanks

@muang0
Copy link
Contributor Author

muang0 commented Jun 27, 2023

Thanks for reaching out. I'm planning my next haircut then I'd love to talk :)

@muang0
Copy link
Contributor Author

muang0 commented Jul 10, 2023

Still waiting on the planned haircut. I've got a full head of hair, BUT I'm used to shaving it mysefl so this one's no biggie. Let's see if the !downers have any suggestions.

@muang0
Copy link
Contributor Author

muang0 commented Jul 10, 2023

@stefanmcshane added you as contributor on the project-> thanks for raising this concern. Looks like just a couple minor things are needed here.

Signed-off-by: Stefan McShane <stefanmcshane@users.noreply.github.com>
@stefanmcshane
Copy link
Contributor

@joejulian Can you re-review with the changes corrected?

@joejulian joejulian added the Has One Approval This PR has one approval. It still needs a second approval to be merged. label Aug 16, 2023
@mattfarina mattfarina modified the milestones: 3.13.0, 3.14.0 Sep 25, 2023
@jose-fully-ported
Copy link

Could this change get merged into 3.13.x? Its a bug and fixes an active issue folks are having.

If there is something else blocking the PR from being merged, could a helm contributor highlight what that might be? There are a few folks - myself included - who want the PR merged and are happy to roll up our sleeves to do the work to see that it does.

@stefanmcshane
Copy link
Contributor

Hi @joejulian @mattfarina can we get this merged?

@mattfarina mattfarina merged commit b299359 into helm:main Jan 9, 2024
2 checks passed
mattfarina added a commit to mattfarina/helm that referenced this pull request Jan 9, 2024
CI, tests, and building failed after helm#10920 was merged. This change
fixes the issues that were introduced.

Signed-off-by: Matt Farina <matt.farina@suse.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Has One Approval This PR has one approval. It still needs a second approval to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

bug- race condition due to unused resource generation metadata/spec