Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update request rates, errors, durations dashboard #1302

Merged
merged 9 commits into from
Oct 19, 2023

Conversation

richardTowers
Copy link
Contributor

@richardTowers richardTowers commented Oct 4, 2023

There are a few changes here, which I've tried to describe in individual commits to make it easier to review.

The request rates, errors, durations dashboard has numerous issues. The most immediately serious one is that the metric it's using to work out which apps exist no longer includes most apps (because of the switch to histograms in alphagov/govuk_app_config#318), but there were also a few issues stemming from the confusing nature of prometheus summaries.

I've also taken the opportunity to switch the request durations visualisation to use a heatmap, which I think is a more useful representation of the information.

Before After (note: different time span)
image image

You can play with the new version of the dashboard here: https://grafana.eks.production.govuk.digital/d/cdeb0607-02e2-4c51-8b7a-398a1c1e2936/richard-towersat-app-request-rates-errors-durations

I don't fully understand what this means, but Grafana did it when I
copied the JSON from the dashboard.
sum (called "Total" in the Grafana UI) isn't meaningful for Total HTTP
Request Errors per second - it's the sum of the per second error rate
multiplied by the number of rate intervals on the graph, not the number of
seconds. This is why it comes up with odd statistics like support having
had a total of 0.4 errors over the last 6 hours. What that probably means
is that for one rate interval, support had an error rate of 0.4 errors /
second.

Best to be consistent and show mean (average) everywhere I think. I've
also added "max", because I find this useulf to sort by.
Previously, the expression for average request duration was incorrect
(or at least confusing).

http_request_duration_seconds was a prometheus summary, giving quantiles
(e.g. the median, or the 95th percentile) for individual apps and
individual pods. Averaging these quantiles by job doesn't entirely make
sense - if a job has multiple pods, those pods will have different
medians, and avg() is going to take the mean across pods of the median
request durations. That's almost certainly not what we want.

Prometheus histograms track the number of observations and the sum of
the observed values, which allows you to calculate the average by
dividing the sum by the count. Which is what we're doing here.
http_request_duration_seconds was a prometheus summary. We switched to
prometheus histograms in alphagov/govuk_app_config#318

The histogram metric is called http_request_duration_seconds_bucket
(note the extra `_bucket`). Now that most of the apps don't use the
http_request_duration_seconds metric, it's missing most of the app
labels, making it the wrong place to query to get the list of apps.

Instead, the http_requests_total metric has everything we need.

I've also removed the "quantile" variable, as we're no longer using the
summary metric which includes quantile as a label.
The old Request duration per App/Controller/Action panel was very noisy,
and possibly inaccurate.

Without an app / quantile specified (i.e. by default) it would try to
show 2860 series (one series for every combination of app, controller,
action _and_ quantile).

Even with an app and quantile specified, some apps (e.g. whitehall) have
a large number of controllers and actions, so the graph would still be
unusably noisy.

And even if you managed to find an interesting series, the query itself
is still the average (mean) across pods of a number of summaries
(medians, 95th percentiles etc.), which isn't a very meaningful
statistic.

I think it would be more useful to use the histograms (which we have
now, thanks to alphagov/govuk_app_config#318) to
show a heatmap.

Heatmaps show the change in distribution over time, so they give us a
visual indication both of how many requests are happening, and how many
fall into each request duration bucket. In this case, by default we'll
bundle all the requests to all the apps together, and allow segmenting
by app using the variable on the dashboard. It is possible (and useful)
to segment further (down to the controller / action), but we're probably
trying to do too much with a single dashboard by trying that.
The expressions on this dashboard use the `=~` operator to match a
regular expression of job. By default, grafana constructs a string like
this:

    (asset-manager|bouncer|collections|collections-publisher|contacts-admin|content-publisher|content-store-postgresql-branch|content-tagger|draft-collections|draft-content-store-postgresql-branch|draft-frontend|draft-government-frontend|draft-router-api|draft-static|email-alert-api|feedback|finder-frontend|frontend|government-frontend|info-frontend|link-checker-api|manuals-publisher|maslow|publisher|router-api|search-admin|search-api|service-manual-publisher|short-url-manager|smartanswers|specialist-publisher|static|support|whitehall-admin)

which works fine, but it's a bit verbose (and mayyybbeee not that
performant). Since we know we're always using the app variable in `=~`
conditions, we can simplify it by using `.*`.
@richardTowers richardTowers force-pushed the update-rate-error-duration-dashboard branch from 6aed88d to 63bb070 Compare October 4, 2023 16:46
@richardTowers richardTowers changed the title WIP - update request rates, errors, durations dashboard Update request rates, errors, durations dashboard Oct 4, 2023
@richardTowers richardTowers marked this pull request as ready for review October 4, 2023 16:59
@richardTowers
Copy link
Contributor Author

I have other thoughts about this dashboard, but I didn't want to go too far down the rabbit hole.

Jotting a few of them down here:

  • It's probably confusing bundling up request counts / durations across all the publishing apps, frontends and APIs - is it really sensible to have one panel (or even one dashboard) trying to cover all of these things? It would be cool if we could add a label to the metrics which categorised which kind of thing something was (e.g. whitehall is a publishing app, publishing-api is an internal API, frontend is a frontend) to help group these things onto more focused dashboards
  • Full width graphs means we're not using vertical space very effectively. My beautiful heat map is below the fold on my screen 😭. I think it would look better as a 2 x 2 - requests and errors on the top row, average durations and heatmap on the bottom.
  • HTTP Errors is not very useful without a filter on, because it's absolutely dominated by 404s (most of which are totally legit responses, not errors at all)
  • It would be nice to link out to other dashboards where you could drill down into more detail. E.g. the heatmaps are much more useful if you drill down to a single controller

Copy link
Contributor

@sengi sengi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, thanks for fixing this broken dashboard! Been on my to-do list for a while!

charts/monitoring-config/dashboards/app-requests.json Outdated Show resolved Hide resolved
charts/monitoring-config/dashboards/app-requests.json Outdated Show resolved Hide resolved
"expr": "label_replace(avg without (pod, instance) (http_request_duration_seconds{namespace=\"${namespace}\", job=~\"${app}\", quantile=~\"${quantile}\"}), \"a\", \"$1\", \"job\", \"(.*)\")",
"interval": "",
"legendFormat": "{{job}} {{controller}} {{action}} {{quantile}}",
"expr": "sum by (le) (increase(http_request_duration_seconds_bucket{namespace=\"${namespace}\", job=~\"${app}\"}[$__rate_interval]))",
Copy link
Contributor

@sengi sengi Oct 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this one, increase actually gives us something other than what we want. This is because increase gives the difference in value between the first and last non-null samples of its argument (range vector) whereas we need the per-second rate. In other words, we expect our output to be independent of the chosen sliding window (rate interval) but we were erroneously multiplying each output point by the number of seconds in the rate interval.

You can see this e.g. if you plot just one bucket (say le="+Inf" i.e. cumulative total request count) for some particular job and compare the values to the request rate graph.

Suggested change
"expr": "sum by (le) (increase(http_request_duration_seconds_bucket{namespace=\"${namespace}\", job=~\"${app}\"}[$__rate_interval]))",
"expr": "sum by (le) (rate(http_request_duration_seconds_bucket{namespace=\"${namespace}\", job=~\"${app}\"}[$__rate_interval]))",

Copy link
Contributor

@sengi sengi Oct 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh wait, I think I've got my wires crossed here — our output points should be durations not counts of requests.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or have I? Ugh brain no worky.

Copy link
Contributor

@sengi sengi Oct 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah no I think I was thinking ok in the first post. We're plotting a histogram so our y-axis is a vector of request counts by duration bucket, right? Do check my working though! 😅

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I could argue it both ways... It's these numbers we're talking about (sloppily underlined in red) I think:

image
  • increase gives "the absolute number of requests in this bit of the heatmap"
  • rate gives "the number of requests per second in this bit of the heatmap"

Because it's a heatmap, rather than a line graph, the size of the bucket is observable in the visualisation, so increase still makes sense in context, I think. That said, I think it's highly debateable which is more intuitive.... 🤔

I think the argument for using increase is strongest for the buckets with a small number of samples:

image

i.e. (with increase)

there were 4 requests between 03:02 and 03:03 which had a duration of between 5 and 10 seconds.

vs. (with rate)

there were 0.067 requests per second between 03:02 and 03:03 which had a duration of between 5 and 10 seconds.

I think the argument for using rate is that it's consistent with other visualisations on the dashboard.

I'm honestly not sure which is better. Interested in your thoughts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One other thing to note: even with increase, the values are not always integers, because $__rate_interval is not always an even multiplier of the sample interval. So you do get confusing datapoints like "there were 2.67 requests between 03:02 and 03:03".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Screenshot of it with rate with units added (which do help clear it up a bit):

image

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or with increase, but with the value in the tooltip labeled so it's more clear what it means:

image

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yeah that's a fair point; I think if we're labelling it clearly like in your last screenshot then it's fine.

richardTowers and others added 3 commits October 18, 2023 08:23
These are equivalent, because the constant factor of `$__rate_interval` cancels out, but `rate` is a bit more idiomatic, and it's shorter to type.

Co-authored-by: Chris Banks <chris.banks@digital.cabinet-office.gov.uk>
This is to make it more clear that the values in the buckets are the
total number of requests in that bucket and timespan, not the request
rate / requests per second.
It looks like Grafana 10 has a few extra defaults. Makes sense to
include them in the config we have here, or they'll cause confusion in
future diffs.
@richardTowers richardTowers merged commit 7b09fb3 into main Oct 19, 2023
3 checks passed
@richardTowers richardTowers deleted the update-rate-error-duration-dashboard branch October 19, 2023 08:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants