Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Default to Prometheus histograms, not summaries #318

Merged
merged 3 commits into from Sep 14, 2023

Conversation

richardTowers
Copy link
Contributor

Prometheus histograms / summaries are complex and hard to wrap one's head around. The difference between them is also quite subtle and confusing to the uninitiated. I'm sure I won't do a good job of explaining the difference in this commit message, so I'll just link to the docs[0].

The key difference for us is that summaries can't be aggregates, but histograms can.

For example, if I'm looking at http_request_duration_seconds, we might want to know "what's the 95%ile request time for this controller action". We can answer this question for a particular application instance (i.e. a specific pod) with both histograms and summaries. If we want to know the 95%ile request time aggregated across all instances / pods however, we can only do that with histograms. This is because in summaries, quantiles are computed ahead of time by the prometheus client, so they can only see information for one particular app instance. Histograms on the other hand, defer the calculation of quantiles to query time, which means they can be aggregated (but are less precise).

prometheus_exporter defaults to Summary[1], but in our case I think it makes more sense to default to Histogram. There may be some apps where we prefer Summary, so I've allowed it to be passed in as a configuration option.

From the summary metrics we have at the moment, we can see that some controller actions take significantly longer than the 10 seconds which prometheus_exporter uses as it's default max bucket. Therefore, I've added a few more buckets so we can see the distribution between 10 and 50 seconds.

[0] - https://prometheus.io/docs/practices/histograms/#quantiles
[1] - https://github.com/discourse/prometheus_exporter#histogram-mode

Prometheus histograms / summaries are complex and hard to wrap one's
head around. The difference between them is also quite subtle and
confusing to the uninitiated.  I'm sure I won't do a good job of
explaining the difference in this commit message, so I'll just link to
the docs[0].

The key difference for us is that summaries can't be aggregates, but
histograms can.

For example, if I'm looking at http_request_duration_seconds, we might
want to know "what's the 95%ile request time for this controller
action". We can answer this question for a particular application
instance (i.e. a specific pod) with both histograms and summaries. If we
want to know the 95%ile request time aggregated across all instances /
pods however, we can only do that with histograms. This is because in
summaries, quantiles are computed ahead of time by the prometheus
client, so they can only see information for one particular app
instance. Histograms on the other hand, defer the calculation of
quantiles to query time, which means they can be aggregated (but are
less precise).

prometheus_exporter defaults to Summary[1], but in our case I think it
makes more sense to default to Histogram. There may be some apps where
we prefer Summary, so I've allowed it to be passed in as a configuration
option.

From the summary metrics we have at the moment, we can see that some
controller actions take significantly longer than the 10 seconds which
prometheus_exporter uses as it's default max bucket. Therefore, I've
added a few more buckets so we can see the distribution between 10 and
50 seconds.

[0] - https://prometheus.io/docs/practices/histograms/#quantiles
[1] - https://github.com/discourse/prometheus_exporter#histogram-mode
@richardTowers
Copy link
Contributor Author

(Draft while I test that this has the intended behvaiour in one of our app)

Copy link
Contributor

@sengi sengi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@sengi
Copy link
Contributor

sengi commented Sep 13, 2023

AFAICT Summary solves a pretty rare/obscure use case so it's frustrating that the Prometheus docs give it so much salience. Seems to cause a lot of confusion even among library authors 😕

There's a reason B�rgmon never grew an equivalent of Summary (as far as I know) despite about a decade of use at massive scale on a wide range of different services/applications.

I'd love to know what the original use case was that gave rise to Summary. Maybe someone had an inherently global metric (say, coming from an exporter, with no need to aggregate after scrape) and it was really important for them to have super accurate, arbitrary (i.e. not chosen in advance) quantiles?

This is needed because I want to use a class from
prometheus_exporter/metric as the default for a method parameter, which
means I need to require the prometheus code before I use it.

Previously these requires happened conditionally, so if
`should_configure` was false, the prometheus code wouldn't be loaded at
all. From a quick look at the prometheus exporter code, I don't think
the requires have any side effects, so it should be fine to load them
even when we're not configuring prometheus.
@richardTowers richardTowers marked this pull request as ready for review September 14, 2023 09:22
@richardTowers richardTowers merged commit eec4ec5 into main Sep 14, 2023
5 checks passed
@richardTowers richardTowers deleted the switch-default-aggregation branch September 14, 2023 10:15
@sengi
Copy link
Contributor

sengi commented Sep 14, 2023

Hopefully not an actual problem, but just beware that moving the imports could have side effects (especially given that particular library's history!) Just something to look out for as this rolls out.

richardTowers added a commit to alphagov/govuk-helm-charts that referenced this pull request Oct 4, 2023
http_request_duration_seconds was a prometheus summary. We switched to
prometheus histograms in alphagov/govuk_app_config#318

The histogram metric is called http_request_duration_seconds_bucket
(note the extra `_bucket`). Now that most of the apps don't use the
http_request_duration_seconds metric, it's missing most of the app
labels, making it the wrong place to query to get the list of apps.

Instead, the http_requests_total metric has everything we need.

I've also removed the "quantile" variable, as we're no longer using the
summary metric which includes quantile as a label.
richardTowers added a commit to alphagov/govuk-helm-charts that referenced this pull request Oct 4, 2023
The old Request duration per App/Controller/Action panel was very noisy,
and possibly inaccurate.

Without an app / quantile specified (i.e. by default) it would try to
show 2860 series (one series for every combination of app, controller,
action _and_ quantile).

Even with an app and quantile specified, some apps (e.g. whitehall) have
a large number of controllers and actions, so the graph would still be
unusably noisy.

And even if you managed to find an interesting series, the query itself
is still the average (mean) across pods of a number of summaries
(medians, 95th percentiles etc.), which isn't a very meaningful
statistic.

I think it would be more useful to use the histograms (which we have
now, thanks to alphagov/govuk_app_config#318) to
show a heatmap.

Heatmaps show the change in distribution over time, so they give us a
visual indication both of how many requests are happening, and how many
fall into each request duration bucket. In this case, by default we'll
bundle all the requests to all the apps together, and allow segmenting
by app using the variable on the dashboard. It is possible (and useful)
to segment further (down to the controller / action), but we're probably
trying to do too much with a single dashboard by trying that.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants