-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update request rates, errors, durations dashboard #1302
Conversation
I don't fully understand what this means, but Grafana did it when I copied the JSON from the dashboard.
sum (called "Total" in the Grafana UI) isn't meaningful for Total HTTP Request Errors per second - it's the sum of the per second error rate multiplied by the number of rate intervals on the graph, not the number of seconds. This is why it comes up with odd statistics like support having had a total of 0.4 errors over the last 6 hours. What that probably means is that for one rate interval, support had an error rate of 0.4 errors / second. Best to be consistent and show mean (average) everywhere I think. I've also added "max", because I find this useulf to sort by.
Previously, the expression for average request duration was incorrect (or at least confusing). http_request_duration_seconds was a prometheus summary, giving quantiles (e.g. the median, or the 95th percentile) for individual apps and individual pods. Averaging these quantiles by job doesn't entirely make sense - if a job has multiple pods, those pods will have different medians, and avg() is going to take the mean across pods of the median request durations. That's almost certainly not what we want. Prometheus histograms track the number of observations and the sum of the observed values, which allows you to calculate the average by dividing the sum by the count. Which is what we're doing here.
http_request_duration_seconds was a prometheus summary. We switched to prometheus histograms in alphagov/govuk_app_config#318 The histogram metric is called http_request_duration_seconds_bucket (note the extra `_bucket`). Now that most of the apps don't use the http_request_duration_seconds metric, it's missing most of the app labels, making it the wrong place to query to get the list of apps. Instead, the http_requests_total metric has everything we need. I've also removed the "quantile" variable, as we're no longer using the summary metric which includes quantile as a label.
The old Request duration per App/Controller/Action panel was very noisy, and possibly inaccurate. Without an app / quantile specified (i.e. by default) it would try to show 2860 series (one series for every combination of app, controller, action _and_ quantile). Even with an app and quantile specified, some apps (e.g. whitehall) have a large number of controllers and actions, so the graph would still be unusably noisy. And even if you managed to find an interesting series, the query itself is still the average (mean) across pods of a number of summaries (medians, 95th percentiles etc.), which isn't a very meaningful statistic. I think it would be more useful to use the histograms (which we have now, thanks to alphagov/govuk_app_config#318) to show a heatmap. Heatmaps show the change in distribution over time, so they give us a visual indication both of how many requests are happening, and how many fall into each request duration bucket. In this case, by default we'll bundle all the requests to all the apps together, and allow segmenting by app using the variable on the dashboard. It is possible (and useful) to segment further (down to the controller / action), but we're probably trying to do too much with a single dashboard by trying that.
The expressions on this dashboard use the `=~` operator to match a regular expression of job. By default, grafana constructs a string like this: (asset-manager|bouncer|collections|collections-publisher|contacts-admin|content-publisher|content-store-postgresql-branch|content-tagger|draft-collections|draft-content-store-postgresql-branch|draft-frontend|draft-government-frontend|draft-router-api|draft-static|email-alert-api|feedback|finder-frontend|frontend|government-frontend|info-frontend|link-checker-api|manuals-publisher|maslow|publisher|router-api|search-admin|search-api|service-manual-publisher|short-url-manager|smartanswers|specialist-publisher|static|support|whitehall-admin) which works fine, but it's a bit verbose (and mayyybbeee not that performant). Since we know we're always using the app variable in `=~` conditions, we can simplify it by using `.*`.
6aed88d
to
63bb070
Compare
I have other thoughts about this dashboard, but I didn't want to go too far down the rabbit hole. Jotting a few of them down here:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool, thanks for fixing this broken dashboard! Been on my to-do list for a while!
"expr": "label_replace(avg without (pod, instance) (http_request_duration_seconds{namespace=\"${namespace}\", job=~\"${app}\", quantile=~\"${quantile}\"}), \"a\", \"$1\", \"job\", \"(.*)\")", | ||
"interval": "", | ||
"legendFormat": "{{job}} {{controller}} {{action}} {{quantile}}", | ||
"expr": "sum by (le) (increase(http_request_duration_seconds_bucket{namespace=\"${namespace}\", job=~\"${app}\"}[$__rate_interval]))", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this one, increase
actually gives us something other than what we want. This is because increase
gives the difference in value between the first and last non-null samples of its argument (range vector) whereas we need the per-second rate. In other words, we expect our output to be independent of the chosen sliding window (rate interval) but we were erroneously multiplying each output point by the number of seconds in the rate interval.
You can see this e.g. if you plot just one bucket (say le="+Inf"
i.e. cumulative total request count) for some particular job and compare the values to the request rate graph.
"expr": "sum by (le) (increase(http_request_duration_seconds_bucket{namespace=\"${namespace}\", job=~\"${app}\"}[$__rate_interval]))", | |
"expr": "sum by (le) (rate(http_request_duration_seconds_bucket{namespace=\"${namespace}\", job=~\"${app}\"}[$__rate_interval]))", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh wait, I think I've got my wires crossed here — our output points should be durations not counts of requests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or have I? Ugh brain no worky.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah no I think I was thinking ok in the first post. We're plotting a histogram so our y-axis is a vector of request counts by duration bucket, right? Do check my working though! 😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I could argue it both ways... It's these numbers we're talking about (sloppily underlined in red) I think:
increase
gives "the absolute number of requests in this bit of the heatmap"rate
gives "the number of requests per second in this bit of the heatmap"
Because it's a heatmap, rather than a line graph, the size of the bucket is observable in the visualisation, so increase
still makes sense in context, I think. That said, I think it's highly debateable which is more intuitive.... 🤔
I think the argument for using increase
is strongest for the buckets with a small number of samples:
i.e. (with increase)
there were 4 requests between 03:02 and 03:03 which had a duration of between 5 and 10 seconds.
vs. (with rate)
there were 0.067 requests per second between 03:02 and 03:03 which had a duration of between 5 and 10 seconds.
I think the argument for using rate
is that it's consistent with other visualisations on the dashboard.
I'm honestly not sure which is better. Interested in your thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One other thing to note: even with increase
, the values are not always integers, because $__rate_interval
is not always an even multiplier of the sample interval. So you do get confusing datapoints like "there were 2.67 requests between 03:02 and 03:03".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yeah that's a fair point; I think if we're labelling it clearly like in your last screenshot then it's fine.
These are equivalent, because the constant factor of `$__rate_interval` cancels out, but `rate` is a bit more idiomatic, and it's shorter to type. Co-authored-by: Chris Banks <chris.banks@digital.cabinet-office.gov.uk>
This is to make it more clear that the values in the buckets are the total number of requests in that bucket and timespan, not the request rate / requests per second.
It looks like Grafana 10 has a few extra defaults. Makes sense to include them in the config we have here, or they'll cause confusion in future diffs.
There are a few changes here, which I've tried to describe in individual commits to make it easier to review.
The request rates, errors, durations dashboard has numerous issues. The most immediately serious one is that the metric it's using to work out which apps exist no longer includes most apps (because of the switch to histograms in alphagov/govuk_app_config#318), but there were also a few issues stemming from the confusing nature of prometheus summaries.
I've also taken the opportunity to switch the request durations visualisation to use a heatmap, which I think is a more useful representation of the information.
You can play with the new version of the dashboard here: https://grafana.eks.production.govuk.digital/d/cdeb0607-02e2-4c51-8b7a-398a1c1e2936/richard-towersat-app-request-rates-errors-durations