-
Notifications
You must be signed in to change notification settings - Fork 38.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Job controller keeps logging panics #121392
Comments
This issue is currently awaiting triage. If a SIG or subproject determines this is a relevant issue, they will accept it by applying the The Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/sig apps |
Do you have any idea which variable is nil here? |
oops, sorry, it's the deletionTimestamp |
Is it actually a panic though? It seems that the logger is parsing a nil value as I think we could use /priority important-longterm |
Possibly this is just logger indeed. Still it is important imo because it makes looking for real panics in logs hard. This was my use case to understand another issue. |
The logger sees an We've gone back and forth whether the logging call should try to call the |
I can work of this 😊 |
@kaisoz: how do you want to solve this? Besides some potential klog improvement, fixing https://pkg.go.dev/k8s.io/apimachinery/pkg/apis/meta/v1#Time to not crash for a nil pointer would be good. https://github.com/kubernetes/apimachinery/blob/v0.28.3/pkg/apis/meta/v1/generated.pb.go#L4762 already does this. |
Hi @pohly ! Sorry I didn't answer yesterday. I like the idea of doing the check at type level and we already have a precedent. |
When logging an object's |
@mimowo the changes in |
We usually do one release for each Kubernetes release, if there are changes. |
However, in this case the next release this could go into is 1.30, right? This would be unfortunate because we wouldn't be able to fix 1.29 with the klog changes. If this is the case, I think we may need to do the quick if in |
cc @dims, is there a chance to do another release of klog, so that we can use it in a patch release of k/k 1.29? |
For 1.29, https://github.com/kubernetes/kubernetes/pull/121554/files is a better fix than rushing a new klog release. But is this critical enough to ask the release team for an exception? |
It only affects developers who bump up the verbosity to >= 4. |
So we can keep 1.29 as is, or do the quick fix in I would suggest to do something about it, because it makes our log artifacts not so useful to detect real panics - a simple grep does not work anymore. @alculquicondor do you think it makes sense to patch 1.29?
Oh, yes, probably it is not critical enough, I keep forgetting about the >= 4 level. |
It's definitely not critical. It's not even a real panic... it's just a log line that says "panic" |
@alculquicondor so better wait for the |
I would wait, given that the cherry-pick is not justified |
What happened?
Job controller, and possibly other controllers keep logging panics from this line in
FilterActivePods
:kubernetes/pkg/controller/controller_utils.go
Line 955 in 349b856
This happens during e2e tests, and I think it happens on production as well.
Here is an example for the successful build:
https://storage.googleapis.com/kubernetes-jenkins/pr-logs/pull/121302/pull-kubernetes-e2e-kind/1715287183352401920/artifacts/kind-control-plane/containers/kube-controller-manager-kind-control-plane_kube-system_kube-controller-manager-ec8b5cc2095bc6b1bdbfe61f132b3d493dea09ab0808935b59e10dcc5ffe1082.log
2023-10-20T09:10:28.016158573Z stderr F I1020 09:10:28.016037 1 controller_utils.go:955] "Ignoring inactive pod" pod="ttlafterfinished-3394/rand-non-local-ghcvx" phase="Failed" deletionTime="<panic: runtime error: invalid memory address or nil pointer dereference>"
What did you expect to happen?
No panics during e2e test from this line in
FilterActivePods
.How can we reproduce it (as minimally and precisely as possible)?
Run e2e or integration tests for the job controller.
Anything else we need to know?
No response
Kubernetes version
Cloud provider
OS version
Install tools
Container runtime (CRI) and version (if applicable)
Related plugins (CNI, CSI, ...) and versions (if applicable)
The text was updated successfully, but these errors were encountered: