Job controller keeps logging panics #121392

mimowo · 2023-10-20T13:54:33Z

What happened?

Job controller, and possibly other controllers keep logging panics from this line in FilterActivePods:

kubernetes/pkg/controller/controller_utils.go

Line 955 in 349b856

    
           logger.V(4).Info("Ignoring inactive pod", "pod", klog.KObj(p), "phase", p.Status.Phase, "deletionTime", p.DeletionTimestamp)

.

This happens during e2e tests, and I think it happens on production as well.

Here is an example for the successful build:
https://storage.googleapis.com/kubernetes-jenkins/pr-logs/pull/121302/pull-kubernetes-e2e-kind/1715287183352401920/artifacts/kind-control-plane/containers/kube-controller-manager-kind-control-plane_kube-system_kube-controller-manager-ec8b5cc2095bc6b1bdbfe61f132b3d493dea09ab0808935b59e10dcc5ffe1082.log
2023-10-20T09:10:28.016158573Z stderr F I1020 09:10:28.016037 1 controller_utils.go:955] "Ignoring inactive pod" pod="ttlafterfinished-3394/rand-non-local-ghcvx" phase="Failed" deletionTime="<panic: runtime error: invalid memory address or nil pointer dereference>"

What did you expect to happen?

No panics during e2e test from this line in FilterActivePods.

How can we reproduce it (as minimally and precisely as possible)?

Run e2e or integration tests for the job controller.

Anything else we need to know?

No response

Kubernetes version

$ kubectl version
# paste output here

Cloud provider

Reproducible on kind during e2e tests

OS version

# On Linux:
$ cat /etc/os-release
# paste output here
$ uname -a
# paste output here

# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

The text was updated successfully, but these errors were encountered:

k8s-ci-robot · 2023-10-20T13:54:44Z

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

mimowo · 2023-10-20T13:55:44Z

/sig apps
FYI @alculquicondor @pohly

alculquicondor · 2023-10-20T14:06:20Z

Do you have any idea which variable is nil here?

alculquicondor · 2023-10-20T14:06:50Z

oops, sorry, it's the deletionTimestamp

alculquicondor · 2023-10-20T14:09:17Z

Is it actually a panic though? It seems that the logger is parsing a nil value as <panic: runtime error: invalid memory address or nil pointer dereference> and proceeding.

I think we could use ptr.Deref and provide a default value.

/priority important-longterm

mimowo · 2023-10-20T14:17:03Z

Possibly this is just logger indeed. Still it is important imo because it makes looking for real panics in logs hard. This was my use case to understand another issue.

pohly · 2023-10-20T15:46:02Z

The logger sees an any (= interface{}) value which implements fmt.Stringer and calls it's String method - in this case, https://cs.opensource.google/go/go/+/refs/tags/go1.21.3:src/time/format.go;l=527. That then panics.

We've gone back and forth whether the logging call should try to call the String method and recover from a panic or try to do some advance checks whether the any value is perhaps nil. There could be String implementations for pointers which produce some nice output for nil, so not calling it would be a loss. Also, detecting nil values is surprisingly hard, if I remember correctly (the fmt.Stringer interface is non-nil, only the value hidden inside it is)), therefore currently the code simply calls it.

kaisoz · 2023-10-23T06:47:24Z

I can work of this 😊
/assign

pohly · 2023-10-23T07:33:54Z

@kaisoz: how do you want to solve this?

Besides some potential klog improvement, fixing https://pkg.go.dev/k8s.io/apimachinery/pkg/apis/meta/v1#Time to not crash for a nil pointer would be good. https://github.com/kubernetes/apimachinery/blob/v0.28.3/pkg/apis/meta/v1/generated.pb.go#L4762 already does this.

kaisoz · 2023-10-24T05:38:58Z

Hi @pohly !

Sorry I didn't answer yesterday. I like the idea of doing the check at type level and we already have a precedent.
Should we provide a default value as @alculquicondor suggests? Tbh I'm not sure what would be a a proper default value here

pohly · 2023-10-24T06:30:08Z

When logging an object's DeletionTimestamp, "is nil" is the right information to log when the pointer is nil. There is no default that can or should be used instead.

kaisoz · 2023-10-26T21:30:46Z

@mimowo @pohly I've raised a PR with a simple fix. I've run the integration tests locally and it fixes the problem. Let me know if that's enough 😊

Thanks!

kaisoz · 2023-11-20T09:05:21Z

@mimowo the changes in klog have been merged but not yet released (kubernetes/klog#393). @pohly what's the klog release cycle? It seems that there's no clear release schedule, right?

pohly · 2023-11-20T09:18:32Z

We usually do one release for each Kubernetes release, if there are changes.

mimowo · 2023-11-20T09:52:18Z

We usually do one release for each Kubernetes release, if there are changes.

However, in this case the next release this could go into is 1.30, right? This would be unfortunate because we wouldn't be able to fix 1.29 with the klog changes. If this is the case, I think we may need to do the quick if in controller_utils?

mimowo · 2023-11-20T09:55:48Z

cc @dims, is there a chance to do another release of klog, so that we can use it in a patch release of k/k 1.29?

pohly · 2023-11-20T09:58:07Z

For 1.29, https://github.com/kubernetes/kubernetes/pull/121554/files is a better fix than rushing a new klog release.

But is this critical enough to ask the release team for an exception?

pohly · 2023-11-20T09:58:50Z

It only affects developers who bump up the verbosity to >= 4.

mimowo · 2023-11-20T10:13:38Z

For 1.29, https://github.com/kubernetes/kubernetes/pull/121554/files is a better fix than rushing a new klog release.

So we can keep 1.29 as is, or do the quick fix in controller_utils.

I would suggest to do something about it, because it makes our log artifacts not so useful to detect real panics - a simple grep does not work anymore. @alculquicondor do you think it makes sense to patch 1.29?

But is this critical enough to ask the release team for an exception?
It only affects developers who bump up the verbosity to >= 4.

Oh, yes, probably it is not critical enough, I keep forgetting about the >= 4 level.

alculquicondor · 2023-11-20T14:20:32Z

It's definitely not critical. It's not even a real panic... it's just a log line that says "panic"

kaisoz · 2023-11-21T14:43:04Z

It's definitely not critical. It's not even a real panic... it's just a log line that says "panic"

@alculquicondor so better wait for the klog and then make the change? or add this change meanwhile?

alculquicondor · 2023-11-21T14:49:24Z

I would wait, given that the cherry-pick is not justified

mimowo added the kind/bug Categorizes issue or PR as related to a bug. label Oct 20, 2023

k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Oct 20, 2023

k8s-ci-robot added sig/apps Categorizes an issue or PR as relevant to SIG Apps. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Oct 20, 2023

k8s-ci-robot added the priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. label Oct 20, 2023

k8s-ci-robot assigned kaisoz Oct 23, 2023

kaisoz mentioned this issue Oct 26, 2023

Check if time is nil before logging #121554

Merged

kaisoz mentioned this issue Nov 20, 2023

REQUEST: New membership for kaisoz kubernetes/org#4592

Closed

9 tasks

k8s-ci-robot closed this as completed in #121554 Feb 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Job controller keeps logging panics #121392

Job controller keeps logging panics #121392

mimowo commented Oct 20, 2023 •

edited

k8s-ci-robot commented Oct 20, 2023

mimowo commented Oct 20, 2023

alculquicondor commented Oct 20, 2023

alculquicondor commented Oct 20, 2023

alculquicondor commented Oct 20, 2023

mimowo commented Oct 20, 2023

pohly commented Oct 20, 2023 •

edited

kaisoz commented Oct 23, 2023

pohly commented Oct 23, 2023

kaisoz commented Oct 24, 2023

pohly commented Oct 24, 2023

kaisoz commented Oct 26, 2023

kaisoz commented Nov 20, 2023

pohly commented Nov 20, 2023

mimowo commented Nov 20, 2023

mimowo commented Nov 20, 2023

pohly commented Nov 20, 2023

pohly commented Nov 20, 2023

mimowo commented Nov 20, 2023

alculquicondor commented Nov 20, 2023

kaisoz commented Nov 21, 2023

alculquicondor commented Nov 21, 2023

Job controller keeps logging panics #121392

Job controller keeps logging panics #121392

Comments

mimowo commented Oct 20, 2023 • edited

What happened?

What did you expect to happen?

How can we reproduce it (as minimally and precisely as possible)?

Anything else we need to know?

Kubernetes version

Cloud provider

OS version

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

k8s-ci-robot commented Oct 20, 2023

mimowo commented Oct 20, 2023

alculquicondor commented Oct 20, 2023

alculquicondor commented Oct 20, 2023

alculquicondor commented Oct 20, 2023

mimowo commented Oct 20, 2023

pohly commented Oct 20, 2023 • edited

kaisoz commented Oct 23, 2023

pohly commented Oct 23, 2023

kaisoz commented Oct 24, 2023

pohly commented Oct 24, 2023

kaisoz commented Oct 26, 2023

kaisoz commented Nov 20, 2023

pohly commented Nov 20, 2023

mimowo commented Nov 20, 2023

mimowo commented Nov 20, 2023

pohly commented Nov 20, 2023

pohly commented Nov 20, 2023

mimowo commented Nov 20, 2023

alculquicondor commented Nov 20, 2023

kaisoz commented Nov 21, 2023

alculquicondor commented Nov 21, 2023

mimowo commented Oct 20, 2023 •

edited

pohly commented Oct 20, 2023 •

edited