Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

higher buckets for workqueue histograms #2625

Closed
seankhliao opened this issue Dec 18, 2023 · 7 comments
Closed

higher buckets for workqueue histograms #2625

seankhliao opened this issue Dec 18, 2023 · 7 comments
Labels
kind/feature Categorizes issue or PR as related to a new feature. kind/support Categorizes issue or PR as a support question.

Comments

@seankhliao
Copy link
Contributor

The workqueue subsystem exposes several metrics, including the workqueue_queue_duration_seconds and workqueue_work_duration_seconds buckets.
These currently have exponential buckets going from 1ns -> 10s.

latency = prometheus.NewHistogramVec(prometheus.HistogramOpts{
Subsystem: WorkQueueSubsystem,
Name: QueueLatencyKey,
Help: "How long in seconds an item stays in workqueue before being requested",
Buckets: prometheus.ExponentialBuckets(10e-9, 10, 10),
}, []string{"name"})
workDuration = prometheus.NewHistogramVec(prometheus.HistogramOpts{
Subsystem: WorkQueueSubsystem,
Name: WorkDurationKey,
Help: "How long in seconds processing an item from workqueue takes.",
Buckets: prometheus.ExponentialBuckets(10e-9, 10, 10),
}, []string{"name"})

I think these should go up a further 2 steps to include 100s and 1000s as buckets (and maybe drop the lowest 2 buckets if we're concerned about the number), especially for workqueue_queue_duration_seconds as a stuck / poorly implemented controller can have quite high latency in processing reconciliation requests

@troy0820
Copy link
Member

/kind support

@k8s-ci-robot k8s-ci-robot added the kind/support Categorizes issue or PR as a support question. label Dec 18, 2023
@troy0820
Copy link
Member

troy0820 commented Jan 5, 2024

/kind feature

Would increasing the bucket size be beneficial to provide that info if that limit is never reached? I am not certain why these sizes were picked but 1000s seems a bit long and arbitrary to define for something that may not ever hit that limit. I'm interested in understanding why having these upper bounds would be helpful beyond the need to capture the high latency in the reconciliation requests.

@k8s-ci-robot k8s-ci-robot added the kind/feature Categorizes issue or PR as related to a new feature. label Jan 5, 2024
@seankhliao
Copy link
Contributor Author

You'd want the last defined bucket to be past the upper bound of what can be considered "reasonable" values to be able to estimate where the top end of the values you observe are (since monitoring system work by interpolating between bucket bounds to estimate percentiles). The +Inf bucket captures everything, but isn't useful for interpolation since it's infinity.

We regularly see "normal" queue latency on the order of 1-2 minutes for our larger clusters with controllers that need to make many external calls (not a great situation to be in) so placing the upper bound at the next step up (1000s) seems reasonable (and hopefully nothing normally operates at 1000s latency).

Aside: If prometheus ever stabilizes their native histograms (true expontnial), then we wouldn't need to choose these bucket bounds so carefully.

@troy0820
Copy link
Member

troy0820 commented Jan 5, 2024

I don't believe this will have any other implications where this would ruin anything from before, so I feel like this is a reasonable requests.

@sbueringer WDYT?

With this proposal to increase the buckets, pending buy-in from the maintainers, do you want to do the PR? @seankhliao

seankhliao added a commit to seankhliao/controller-runtime that referenced this issue Jan 5, 2024
Controllers making many external requests in large clusters
may have normal operating latency on the order of ~100s.
Add buckets that cover the range, allowing metric backends
to interpolate percentile estimates properly.

Fixes kubernetes-sigs#2625
@seankhliao
Copy link
Contributor Author

sure, I've sent #2638

@troy0820
Copy link
Member

/close

Resolved #2638

@k8s-ci-robot
Copy link
Contributor

@troy0820: Closing this issue.

In response to this:

/close

Resolved #2638

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. kind/support Categorizes issue or PR as a support question.
Projects
None yet
Development

No branches or pull requests

3 participants