Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

✨ Add 100/1000s buckets for prometheus workqueue histograms #2638

Merged
merged 1 commit into from Jan 11, 2024

Conversation

seankhliao
Copy link
Contributor

@seankhliao seankhliao commented Jan 5, 2024

Controllers making many external requests in large clusters may have normal operating latency on the order of ~100s. Add buckets that cover the range, allowing metric backends to interpolate percentile estimates properly.

For #2625

@k8s-ci-robot k8s-ci-robot added do-not-merge/invalid-commit-message Indicates that a PR should not merge because it has an invalid commit message. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Jan 5, 2024
@k8s-ci-robot
Copy link
Contributor

Welcome @seankhliao!

It looks like this is your first PR to kubernetes-sigs/controller-runtime 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-sigs/controller-runtime has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot
Copy link
Contributor

Hi @seankhliao. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Jan 5, 2024
@k8s-ci-robot k8s-ci-robot added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label Jan 5, 2024
Controllers making many external requests in large clusters
may have normal operating latency on the order of ~100s.
Add buckets that cover the range, allowing metric backends
to interpolate percentile estimates properly.
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/invalid-commit-message Indicates that a PR should not merge because it has an invalid commit message. label Jan 5, 2024
@troy0820
Copy link
Member

troy0820 commented Jan 5, 2024

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jan 5, 2024
@troy0820
Copy link
Member

troy0820 commented Jan 5, 2024

/retitle ✨ Add 100/1000s buckets for prometheus workqueue histograms

@k8s-ci-robot k8s-ci-robot changed the title ✨ pkg/metrics: add 100/1000s buckets for workqueue histograms ✨ Add 100/1000s buckets for prometheus workqueue histograms Jan 5, 2024
@troy0820
Copy link
Member

troy0820 commented Jan 5, 2024

It can't merge without approval but I want the maintainers to be able to take a look at this to get consensus around this solution.

/hold

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 5, 2024
@alvaroaleman
Copy link
Member

Controllers making many external requests in large clusters may have normal operating latency on the order of ~100s

Not really, no. This would be indicative of doing long-running and/or blocking operations in your Reconcile which you shouldn't do, use RequeueAfter instead.

@seankhliao
Copy link
Contributor Author

Or you have a poorly written vendor controller, or we just have a lot of work and we're starved of parallelism. Either way, I'd like to be able to observe how poorly it's performing beyond just 10s.

@vincepri
Copy link
Member

vincepri commented Jan 8, 2024

Not opposed to this, @alvaroaleman ? Seems a small improvement?

@sbueringer
Copy link
Member

sbueringer commented Jan 8, 2024

In the past I used controller_runtime_reconcile_time_seconds for this purpose. Exactly because the workqueue metrics are capped at 10 seconds.

ReconcileTime = prometheus.NewHistogramVec(prometheus.HistogramOpts{
Name: "controller_runtime_reconcile_time_seconds",
Help: "Length of time per reconciliation per controller",
Buckets: []float64{0.005, 0.01, 0.025, 0.05, 0.1, 0.15, 0.2, 0.25, 0.3, 0.35, 0.4, 0.45, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0,
1.25, 1.5, 1.75, 2.0, 2.5, 3.0, 3.5, 4.0, 4.5, 5, 6, 7, 8, 9, 10, 15, 20, 25, 30, 40, 50, 60},
}, []string{"controller"})

(Might be helpful retroactively in any case)

In general no objection to adding additional buckets. I have no idea what the ideal buckets are to be honest.

For what it's worth upstream still only has buckets until 10s: https://github.com/kubernetes/component-base/blob/master/metrics/prometheus/workqueue/metrics.go#L55-L69

@sbueringer
Copy link
Member

sbueringer commented Jan 8, 2024

For folks that are not familiar with metrics it can look pretty confusing that the metrics are capped at 10s (as it looks like the durations are never slower than that)

@troy0820
Copy link
Member

troy0820 commented Jan 8, 2024

In general no objection to adding additional buckets. I have no idea what the ideal buckets are to be honest.

I have no idea what is ideal as well. Initially in the issue, I wasn't trying to encourage more buckets as that may or may not impact performance, however the need for an upper bound outside of 10s seemed to be necessary.

For what it's worth upstream still only has buckets until 10s: https://github.com/kubernetes/component-base/blob/master/metrics/prometheus/workqueue/metrics.go#L55-L69

However, however, to stay consistent with upstream, would this be beneficial to add and not use the other metrics as you proposed?

@sbueringer
Copy link
Member

sbueringer commented Jan 9, 2024

However, however, to stay consistent with upstream, would this be beneficial to add and not use the other metrics as you proposed?

The controller_runtime_reconcile_time_seconds metric only ~ matches one of the workqueue metrics. So it definitely makes sense for folks to also use the work queue metrics. They are also not 100% the same data, just (I think) one of them is very close.

I think performance-wise it's probably fine. I guess it's just 20% more data for the two histogram metrics. As the number of workqueues in a controller is usually not that high that it should become a problem I think it's okay.

(Not 100% sure if I understood "add and not use the other metrics as you proposed?" correctly)

@seankhliao
Copy link
Contributor Author

If we're concerned about data volumes, I think dropping the first 2 or 3 buckets might be ok?
Most code won't have a discernable different between 1ms - 1us

@sbueringer
Copy link
Member

I discussed it with Alvaro. We're fine with the change for now.

A larger issue we have is that all the metric configuration is done in init functions. Ideally we would instead configure metrics during startup (~ manager.Start()). This would make it possible for folks to configure metrics depending on their needs instead of having to live with the hard-coded configuration we have. We'll try to get this improved in the upcoming release cycle. At this point we also might re-evaluate the "default configuration" of the metrics.

/lgtm
/approve
/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 11, 2024
@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 11, 2024
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: sbueringer, seankhliao

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 11, 2024
@sbueringer sbueringer added this to the v0.17.x milestone Jan 11, 2024
@k8s-ci-robot k8s-ci-robot merged commit 0a90173 into kubernetes-sigs:main Jan 11, 2024
15 checks passed
@seankhliao seankhliao deleted the higher-end-buckets branch January 11, 2024 17:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants