Skip to content

Commit

Permalink
Switch to heatmap for request durations
Browse files Browse the repository at this point in the history
The old Request duration per App/Controller/Action panel was very noisy,
and possibly inaccurate.

Without an app / quantile specified (i.e. by default) it would try to
show 2860 series (one series for every combination of app, controller,
action _and_ quantile).

Even with an app and quantile specified, some apps (e.g. whitehall) have
a large number of controllers and actions, so the graph would still be
unusably noisy.

And even if you managed to find an interesting series, the query itself
is still the average (mean) across pods of a number of summaries
(medians, 95th percentiles etc.), which isn't a very meaningful
statistic.

I think it would be more useful to use the histograms (which we have
now, thanks to alphagov/govuk_app_config#318) to
show a heatmap.

Heatmaps show the change in distribution over time, so they give us a
visual indication both of how many requests are happening, and how many
fall into each request duration bucket. In this case, by default we'll
bundle all the requests to all the apps together, and allow segmenting
by app using the variable on the dashboard. It is possible (and useful)
to segment further (down to the controller / action), but we're probably
trying to do too much with a single dashboard by trying that.
  • Loading branch information
richardTowers committed Oct 4, 2023
1 parent c6276ff commit 3be0113
Showing 1 changed file with 37 additions and 64 deletions.
101 changes: 37 additions & 64 deletions charts/monitoring-config/dashboards/app-requests.json
Original file line number Diff line number Diff line change
Expand Up @@ -456,105 +456,78 @@
"type": "prometheus",
"uid": "prometheus"
},
"description": "Controller Action times sorted by application and sortable by maximum time taken",
"fieldConfig": {
"defaults": {
"color": {
"mode": "palette-classic"
},
"custom": {
"axisCenteredZero": false,
"axisColorMode": "text",
"axisLabel": "Request Time (s)",
"axisPlacement": "auto",
"barAlignment": 0,
"drawStyle": "line",
"fillOpacity": 0,
"gradientMode": "hue",
"hideFrom": {
"legend": false,
"tooltip": false,
"viz": false
},
"lineInterpolation": "smooth",
"lineStyle": {
"fill": "solid"
},
"lineWidth": 1,
"pointSize": 5,
"scaleDistribution": {
"type": "linear"
},
"showPoints": "always",
"spanNulls": true,
"stacking": {
"group": "A",
"mode": "none"
},
"thresholdsStyle": {
"mode": "off"
}
},
"mappings": [],
"thresholds": {
"mode": "absolute",
"steps": [
{
"color": "green",
"value": null
},
{
"color": "red",
"value": 80
}
]
},
"unit": "none"
}
},
"overrides": []
},
"gridPos": {
"h": 13,
"h": 9,
"w": 24,
"x": 0,
"y": 20
},
"id": 5,
"id": 4,
"options": {
"calculate": false,
"cellGap": 1,
"color": {
"exponent": 0.5,
"fill": "dark-orange",
"mode": "scheme",
"reverse": false,
"scale": "exponential",
"scheme": "Viridis",
"steps": 64
},
"exemplars": {
"color": "rgba(255,0,255,0.7)"
},
"filterValues": {
"le": 1e-9
},
"legend": {
"calcs": [
"max"
],
"displayMode": "table",
"placement": "right",
"showLegend": true,
"width": 300
"show": true
},
"rowsFrame": {
"layout": "auto"
},
"tooltip": {
"mode": "single",
"sort": "none"
"show": true,
"yHistogram": false
},
"yAxis": {
"axisPlacement": "left",
"reverse": false
}
},
"pluginVersion": "9.5.5",
"targets": [
{
"datasource": {
"type": "prometheus",
"uid": "prometheus"
},
"editorMode": "code",
"exemplar": true,
"expr": "label_replace(avg without (pod, instance) (http_request_duration_seconds{namespace=\"${namespace}\", job=~\"${app}\", quantile=~\"${quantile}\"}), \"a\", \"$1\", \"job\", \"(.*)\")",
"interval": "",
"legendFormat": "{{job}} {{controller}} {{action}} {{quantile}}",
"expr": "sum by (le) (increase(http_request_duration_seconds_bucket{namespace=\"${namespace}\", job=~\"${app}\"}[$__rate_interval]))",
"format": "heatmap",
"legendFormat": "__auto",
"range": true,
"refId": "A",
"sort": "current",
"sortDesc": false
"refId": "A"
}
],
"title": "Request duration per App/Controller/Action",
"transformations": [],
"type": "timeseries"
"title": "Distribution of Request Durations",
"type": "heatmap"
}
],
"refresh": "1m",
Expand Down

0 comments on commit 3be0113

Please sign in to comment.