Add TempoUserConfigurableOverridesReloadFailing alert #2784

yvrhdn · 2023-08-11T15:24:26Z

What this PR does:
Add metric to count failed overrides reloads and alert on it.

Which issue(s) this PR fixes:
~~Fixes #~~

Checklist

~~Tests updated~~
~~Documentation added~~
CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

joe-elliott · 2023-08-11T16:10:19Z

modules/overrides/user_configurable_overrides.go

@@ -24,6 +25,12 @@ import (
 	"github.com/grafana/tempo/tempodb/backend"
 )

+var metricUserConfigurableOverridesReloadFailed = promauto.NewCounter(prometheus.CounterOpts{
+	Namespace: "tempo",


i'm surprised we don't publish this already

Yeah... I added metrics on the client that queries the backend, it must have been an oversight not to add one specifically for failed reloads.
We can already alert on the log lines but this will be cleaner.

joe-elliott · 2023-08-11T16:11:09Z

modules/overrides/user_configurable_overrides.go

+	if err != nil {
+		// Don't block start up if loading user-configurable overrides failed. We can fall back to runtime overrides.
+		metricUserConfigurableOverridesReloadFailed.Inc()
+		level.Error(o.logger).Log("msg", "failed to load user-configurable config during start-up, will fall back to runtime overrides", "err", err)


we purposefully fail startup if the overrides are not loadable. loading the defaults and applying to all tenants could be dangerous.

wdyt?

I'm a bit torn here tbh.

Loading defaults would mean we use the overrides from the runtime configmap. In the user-configurable overrides you can enable and tweak features like the generic forwarders and the metrics-generators. So the most likely impact will be that the forwarder and metrics-generator for a tenant would be disabled again until we can reload these overrides.
The user-configurable overrides does not store operational limits btw (like max_live_traces), so these would still be respected.

Alternatively, if the component can not start up we might break ingestion for the entire cluster since this also runs on the distributors. This would mean that during an extended backend outage we can not restart any component using user-configurable overrides since they would fail to get past this check.

Alternative idea: we can make it configurable. Default to crashing when loading fails but add a flag to skip verification.
-> In case of emergency, we can disable this and still load the app.

But I also feel conflicted about this:

if we will want to disable this check during an emergency, why even bother crashing for it?

it makes it explicit which is good but also causes more work

As discussed offline we believe that failing to start if the user configurable overrides do not get loaded correctly is the correct path.

It mimics the behavior of loading the file based overrides

Its failure modes seem less dire.

It is more "future proof". Adding critical limits to user configurable overrides is better handled.

I have reverted this change. We discussed this directly for a bit but basically: to keep Tempo simple to operate we'd favour an explicit in your face crash over a log line that warns you about something being broken. If the overrides don't work you might be running without overrides for a while which can also be disastrous as we add more responsibility to this module.
Disadvantage is that might block ingest into a Tempo cluster in very specific situations, for those scenarios we can consider adding a 'break glass' functionality that bypasses all startup checks.

Oh, I didn't see your comment when I wrote this 😅

Add TempoUserConfigurableOverridesReloadFailing alert

1e00e42

yvrhdn requested review from joe-elliott, annanay25, mdisibio, mapno, zalegrala, electron0zero, ie-pham and stoewer as code owners August 11, 2023 15:24

Update CHANGELOG.md

49eee02

joe-elliott reviewed Aug 11, 2023

View reviewed changes

Fail startup on user-configurable overrides error

a8b4711

yvrhdn requested a review from joe-elliott August 15, 2023 16:54

joe-elliott approved these changes Aug 15, 2023

View reviewed changes

yvrhdn merged commit 12bdeff into grafana:main Aug 15, 2023

yvrhdn deleted the kvrhdn/user-configurable-overrides-alerts branch August 15, 2023 18:57

knylander-grafana mentioned this pull request Aug 18, 2023

[DOC] Tempo 2.2.1 release notes #2811

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add TempoUserConfigurableOverridesReloadFailing alert #2784

Add TempoUserConfigurableOverridesReloadFailing alert #2784

yvrhdn commented Aug 11, 2023 •

edited

Loading

joe-elliott Aug 11, 2023

yvrhdn Aug 11, 2023

joe-elliott Aug 11, 2023

yvrhdn Aug 11, 2023

yvrhdn Aug 11, 2023

joe-elliott Aug 14, 2023

yvrhdn Aug 15, 2023

yvrhdn Aug 15, 2023

Add TempoUserConfigurableOverridesReloadFailing alert #2784

Add TempoUserConfigurableOverridesReloadFailing alert #2784

Conversation

yvrhdn commented Aug 11, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yvrhdn commented Aug 11, 2023 •

edited

Loading