Update request rates, errors, durations dashboard #1302

richardTowers · 2023-10-04T11:51:18Z

There are a few changes here, which I've tried to describe in individual commits to make it easier to review.

The request rates, errors, durations dashboard has numerous issues. The most immediately serious one is that the metric it's using to work out which apps exist no longer includes most apps (because of the switch to histograms in alphagov/govuk_app_config#318), but there were also a few issues stemming from the confusing nature of prometheus summaries.

I've also taken the opportunity to switch the request durations visualisation to use a heatmap, which I think is a more useful representation of the information.

Before	After (note: different time span)

You can play with the new version of the dashboard here: https://grafana.eks.production.govuk.digital/d/cdeb0607-02e2-4c51-8b7a-398a1c1e2936/richard-towersat-app-request-rates-errors-durations

I don't fully understand what this means, but Grafana did it when I copied the JSON from the dashboard.

sum (called "Total" in the Grafana UI) isn't meaningful for Total HTTP Request Errors per second - it's the sum of the per second error rate multiplied by the number of rate intervals on the graph, not the number of seconds. This is why it comes up with odd statistics like support having had a total of 0.4 errors over the last 6 hours. What that probably means is that for one rate interval, support had an error rate of 0.4 errors / second. Best to be consistent and show mean (average) everywhere I think. I've also added "max", because I find this useulf to sort by.

Previously, the expression for average request duration was incorrect (or at least confusing). http_request_duration_seconds was a prometheus summary, giving quantiles (e.g. the median, or the 95th percentile) for individual apps and individual pods. Averaging these quantiles by job doesn't entirely make sense - if a job has multiple pods, those pods will have different medians, and avg() is going to take the mean across pods of the median request durations. That's almost certainly not what we want. Prometheus histograms track the number of observations and the sum of the observed values, which allows you to calculate the average by dividing the sum by the count. Which is what we're doing here.

http_request_duration_seconds was a prometheus summary. We switched to prometheus histograms in alphagov/govuk_app_config#318 The histogram metric is called http_request_duration_seconds_bucket (note the extra `_bucket`). Now that most of the apps don't use the http_request_duration_seconds metric, it's missing most of the app labels, making it the wrong place to query to get the list of apps. Instead, the http_requests_total metric has everything we need. I've also removed the "quantile" variable, as we're no longer using the summary metric which includes quantile as a label.

The old Request duration per App/Controller/Action panel was very noisy, and possibly inaccurate. Without an app / quantile specified (i.e. by default) it would try to show 2860 series (one series for every combination of app, controller, action _and_ quantile). Even with an app and quantile specified, some apps (e.g. whitehall) have a large number of controllers and actions, so the graph would still be unusably noisy. And even if you managed to find an interesting series, the query itself is still the average (mean) across pods of a number of summaries (medians, 95th percentiles etc.), which isn't a very meaningful statistic. I think it would be more useful to use the histograms (which we have now, thanks to alphagov/govuk_app_config#318) to show a heatmap. Heatmaps show the change in distribution over time, so they give us a visual indication both of how many requests are happening, and how many fall into each request duration bucket. In this case, by default we'll bundle all the requests to all the apps together, and allow segmenting by app using the variable on the dashboard. It is possible (and useful) to segment further (down to the controller / action), but we're probably trying to do too much with a single dashboard by trying that.

richardTowers · 2023-10-04T17:10:02Z

I have other thoughts about this dashboard, but I didn't want to go too far down the rabbit hole.

Jotting a few of them down here:

It's probably confusing bundling up request counts / durations across all the publishing apps, frontends and APIs - is it really sensible to have one panel (or even one dashboard) trying to cover all of these things? It would be cool if we could add a label to the metrics which categorised which kind of thing something was (e.g. whitehall is a publishing app, publishing-api is an internal API, frontend is a frontend) to help group these things onto more focused dashboards
Full width graphs means we're not using vertical space very effectively. My beautiful heat map is below the fold on my screen 😭. I think it would look better as a 2 x 2 - requests and errors on the top row, average durations and heatmap on the bottom.
HTTP Errors is not very useful without a filter on, because it's absolutely dominated by 404s (most of which are totally legit responses, not errors at all)
It would be nice to link out to other dashboards where you could drill down into more detail. E.g. the heatmaps are much more useful if you drill down to a single controller

sengi

Cool, thanks for fixing this broken dashboard! Been on my to-do list for a while!

charts/monitoring-config/dashboards/app-requests.json

sengi · 2023-10-11T12:21:17Z

charts/monitoring-config/dashboards/app-requests.json

-          "expr": "label_replace(avg without (pod, instance) (http_request_duration_seconds{namespace=\"${namespace}\", job=~\"${app}\", quantile=~\"${quantile}\"}), \"a\", \"$1\", \"job\", \"(.*)\")",
-          "interval": "",
-          "legendFormat": "{{job}} {{controller}} {{action}} {{quantile}}",
+          "expr": "sum by (le) (increase(http_request_duration_seconds_bucket{namespace=\"${namespace}\", job=~\"${app}\"}[$__rate_interval]))",


In this one, increase actually gives us something other than what we want. This is because increase gives the difference in value between the first and last non-null samples of its argument (range vector) whereas we need the per-second rate. In other words, we expect our output to be independent of the chosen sliding window (rate interval) but we were erroneously multiplying each output point by the number of seconds in the rate interval.

You can see this e.g. if you plot just one bucket (say le="+Inf" i.e. cumulative total request count) for some particular job and compare the values to the request rate graph.

Suggested change

"expr": "sum by (le) (increase(http_request_duration_seconds_bucket{namespace=\"${namespace}\", job=~\"${app}\"}[$__rate_interval]))",

"expr": "sum by (le) (rate(http_request_duration_seconds_bucket{namespace=\"${namespace}\", job=~\"${app}\"}[$__rate_interval]))",

~~Oh wait, I think I've got my wires crossed here — our output points should be durations not counts of requests.~~

Or have I? Ugh brain no worky.

Yeah no I think I was thinking ok in the first post. We're plotting a histogram so our y-axis is a vector of request counts by duration bucket, right? Do check my working though! 😅

I think I could argue it both ways... It's these numbers we're talking about (sloppily underlined in red) I think:

increase gives "the absolute number of requests in this bit of the heatmap"

rate gives "the number of requests per second in this bit of the heatmap"

Because it's a heatmap, rather than a line graph, the size of the bucket is observable in the visualisation, so increase still makes sense in context, I think. That said, I think it's highly debateable which is more intuitive.... 🤔

I think the argument for using increase is strongest for the buckets with a small number of samples:

i.e. (with increase)

there were 4 requests between 03:02 and 03:03 which had a duration of between 5 and 10 seconds.

vs. (with rate)

there were 0.067 requests per second between 03:02 and 03:03 which had a duration of between 5 and 10 seconds.

I think the argument for using rate is that it's consistent with other visualisations on the dashboard.

I'm honestly not sure which is better. Interested in your thoughts?

One other thing to note: even with increase, the values are not always integers, because $__rate_interval is not always an even multiplier of the sample interval. So you do get confusing datapoints like "there were 2.67 requests between 03:02 and 03:03".

Screenshot of it with rate with units added (which do help clear it up a bit):

Or with increase, but with the value in the tooltip labeled so it's more clear what it means:

Ah yeah that's a fair point; I think if we're labelling it clearly like in your last screenshot then it's fine.

These are equivalent, because the constant factor of `$__rate_interval` cancels out, but `rate` is a bit more idiomatic, and it's shorter to type. Co-authored-by: Chris Banks <chris.banks@digital.cabinet-office.gov.uk>

This is to make it more clear that the values in the buckets are the total number of requests in that bucket and timespan, not the request rate / requests per second.

It looks like Grafana 10 has a few extra defaults. Makes sense to include them in the config we have here, or they'll cause confusion in future diffs.

richardTowers added 6 commits October 4, 2023 17:46

Bump schemaVersion to 38

d70f3db

I don't fully understand what this means, but Grafana did it when I copied the JSON from the dashboard.

richardTowers force-pushed the update-rate-error-duration-dashboard branch from 6aed88d to 63bb070 Compare October 4, 2023 16:46

richardTowers changed the title ~~WIP - update request rates, errors, durations dashboard~~ Update request rates, errors, durations dashboard Oct 4, 2023

richardTowers marked this pull request as ready for review October 4, 2023 16:59

sengi approved these changes Oct 11, 2023

View reviewed changes

richardTowers and others added 3 commits October 18, 2023 08:23

Switch from increase / increase to rate / rate

863c61e

These are equivalent, because the constant factor of `$__rate_interval` cancels out, but `rate` is a bit more idiomatic, and it's shorter to type. Co-authored-by: Chris Banks <chris.banks@digital.cabinet-office.gov.uk>

Add label to heatmap tooltip

b22aa6a

This is to make it more clear that the values in the buckets are the total number of requests in that bucket and timespan, not the request rate / requests per second.

Updates for Grafana 10

64788cb

It looks like Grafana 10 has a few extra defaults. Makes sense to include them in the config we have here, or they'll cause confusion in future diffs.

richardTowers merged commit 7b09fb3 into main Oct 19, 2023
3 checks passed

richardTowers deleted the update-rate-error-duration-dashboard branch October 19, 2023 08:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update request rates, errors, durations dashboard #1302

Update request rates, errors, durations dashboard #1302

richardTowers commented Oct 4, 2023 •

edited

richardTowers commented Oct 4, 2023

sengi left a comment

sengi Oct 11, 2023 •

edited

sengi Oct 11, 2023 •

edited

sengi Oct 11, 2023

sengi Oct 11, 2023 •

edited

richardTowers Oct 18, 2023

richardTowers Oct 18, 2023

richardTowers Oct 18, 2023

richardTowers Oct 18, 2023

sengi Oct 18, 2023

	"expr": "sum by (le) (increase(http_request_duration_seconds_bucket{namespace=\"${namespace}\", job=~\"${app}\"}[$__rate_interval]))",
	"expr": "sum by (le) (rate(http_request_duration_seconds_bucket{namespace=\"${namespace}\", job=~\"${app}\"}[$__rate_interval]))",

Update request rates, errors, durations dashboard #1302

Update request rates, errors, durations dashboard #1302

Conversation

richardTowers commented Oct 4, 2023 • edited

richardTowers commented Oct 4, 2023

sengi left a comment

Choose a reason for hiding this comment

sengi Oct 11, 2023 • edited

Choose a reason for hiding this comment

sengi Oct 11, 2023 • edited

Choose a reason for hiding this comment

sengi Oct 11, 2023

Choose a reason for hiding this comment

sengi Oct 11, 2023 • edited

Choose a reason for hiding this comment

richardTowers Oct 18, 2023

Choose a reason for hiding this comment

richardTowers Oct 18, 2023

Choose a reason for hiding this comment

richardTowers Oct 18, 2023

Choose a reason for hiding this comment

richardTowers Oct 18, 2023

Choose a reason for hiding this comment

sengi Oct 18, 2023

Choose a reason for hiding this comment

richardTowers commented Oct 4, 2023 •

edited

sengi Oct 11, 2023 •

edited

sengi Oct 11, 2023 •

edited

sengi Oct 11, 2023 •

edited