Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ktesting: tone down warning about leaked test goroutine #401

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

pohly
Copy link

@pohly pohly commented Mar 8, 2024

What this PR does / why we need it:

Not terminating immediately is fairly normal: many controllers log "terminating" messages while they shut down, which often is right after test completion, if that is when the test cancels the context and then doesn't wait for goroutines (which is often not possible).

Therefore the warning now only gets printed if that happens after more than 10 seconds since test completion, and then the warning is only printed as an info message, not an error.

Release note:

ktesting: the warning about leaked goroutine is only shown after 10 or more seconds and is no longer an error message.

/cc @aojea @soltysh @kerthcet

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. approved Indicates a PR has been approved by an approver from all required OWNERS files. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Mar 8, 2024
@k8s-ci-robot
Copy link

This issue is currently awaiting triage.

If klog contributors determine this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Mar 8, 2024
@pohly
Copy link
Author

pohly commented Mar 11, 2024

Let's fix the linter issues first, see #402

logger.WithCallDepth(1).Error(nil, "WARNING: test kept at least one goroutine running after test completion", "callstack", string(dbg.Stacks(false)))
l.shared.goroutineWarningDone = true
duration := time.Since(l.shared.stopTime)
if duration > stopGracePeriod {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems will bury the real goroutine leakage if the duration is less than stopGracePeriod, do I understand right here?

Another point I can't get here is if the test completes, why will it wait for another 10s?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the "leaked" goroutine stops after less than stopGracePeriod, then it wasn't really a leak. It was "just" a test that didn't block and wait for goroutines at the end. This is very common.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Totally misunderstood the meaning here. I'm not quite understand how it works but I still have a silly question like: Will we wait and detect the goroutine leakage after test completes? Or why the duration from stop will be greater than stopGracePeriod.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, we don't wait. That is something that uber/goleak does, but not this code. This code is merely reporting on leaked goroutines if and when it detects them, which is when they keep logging.

The stopGracePeriod is the grace period that goroutines have before their log output triggers this warning. I can rename it if you have a better suggestion.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recalled goleak has maxRetries, then can't we just configure the goleak? It's configurable now. Both the retry times and sleep duration. I think somehow the same as stopGracePeriod. Will this lead to other problems?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

keep the warning for goroutines that are more likely to be real leaks while not printing it for goroutines which terminate quickly after test completion.

Definitely agree.

We cannot depend on goleak here

Yes for klog... I mean for callers, like test framework. I suggested in this way is just because the goroutine leakage detection duration is mostly decided by goleak, the maxRetries(default to 20), maxSleep(default to 100ms) specifically, the duration ~= maxRetries * (maxSleep + filterStacks), have no idea whether 10s fits here. However, the feature is reverted... uber-go/goleak#116, so no way to configure now.

As a user of klog, happy with either way to avoid confusion.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely agree.

So LGTM?

I mean for callers, like test framework.

Leak detection with goleak only works when tests run sequentially or in TestMain. I'm not sure how useful it is in an E2E test suite, because the actual code under test runs elsewhere. For integration and unit tests, individual test owners have to integrate goleak. Some apiserver-based tests have it.

have no idea whether 10s fits here

So this is about how klog will interact with goleak. I don't have a strong opinion here. I picked 10 seconds because it should be long enough to no produce "false positives" for the "I am shutting down after test completion" goroutines and short enough to give feedback for others. Because it's not causing a test failure, it can be lower than the goleak timeouts.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are these 10s added to the test duration? 10s sounds like a lot to me

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, they are not added. At the time that this gets called, the test has already completed.

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 21, 2024
@pohly
Copy link
Author

pohly commented Jun 10, 2024

/assign @dims

For approval. A new release for Kubernetes 1.31 also makes sense.

@pohly
Copy link
Author

pohly commented Jun 10, 2024

Hmm, we want a few more things for a new release, like #404

@pohly
Copy link
Author

pohly commented Jun 10, 2024

/retest

Does that work for GitHub actions? The linter wasn't happy, for some reason.

@pohly
Copy link
Author

pohly commented Jun 10, 2024

Linter failures fixed in main.

@@ -250,8 +266,12 @@ func (l tlogger) fallbackLogger() logr.Logger {
logger = logger.WithCallDepth(l.shared.callDepth + 1)

if !l.shared.goroutineWarningDone {
logger.WithCallDepth(1).Error(nil, "WARNING: test kept at least one goroutine running after test completion", "callstack", string(dbg.Stacks(false)))
l.shared.goroutineWarningDone = true
duration := time.Since(l.shared.stopTime)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this access need to be locked?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes! Good catch, thanks.

/lgtm cancel

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hrrm, no. fallbackLogger gets called while the mutex is locked. This wasn't clear here, so I've added a comment explaining that (https://github.com/kubernetes/klog/compare/83145e470293dc67bede46353012cededd97e031..ed62027e604261af3e9c42c7a50092a0db074f19).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aojea: Okay for another LGTM?

@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 11, 2024
@k8s-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: kerthcet
Once this PR has been reviewed and has the lgtm label, please ask for approval from dims. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot removed the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 11, 2024
@pohly
Copy link
Author

pohly commented Jun 11, 2024

/hold

For second approval.

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 11, 2024
Not terminating immediately is fairly normal: many controllers
log "terminating" messages while they shut down, which often is
right after test completion, if that is when the test cancels the
context and then doesn't wait for goroutines (which is often
not possible).

Therefore the warning now only gets printed if that happens after more than 10
seconds since test completion, and then the warning is only printed as an info
message, not an error.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants