-
Notifications
You must be signed in to change notification settings - Fork 530
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add TempoUserConfigurableOverridesReloadFailing alert #2784
Add TempoUserConfigurableOverridesReloadFailing alert #2784
Conversation
@@ -24,6 +25,12 @@ import ( | |||
"github.com/grafana/tempo/tempodb/backend" | |||
) | |||
|
|||
var metricUserConfigurableOverridesReloadFailed = promauto.NewCounter(prometheus.CounterOpts{ | |||
Namespace: "tempo", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'm surprised we don't publish this already
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah... I added metrics on the client that queries the backend, it must have been an oversight not to add one specifically for failed reloads.
We can already alert on the log lines but this will be cleaner.
if err != nil { | ||
// Don't block start up if loading user-configurable overrides failed. We can fall back to runtime overrides. | ||
metricUserConfigurableOverridesReloadFailed.Inc() | ||
level.Error(o.logger).Log("msg", "failed to load user-configurable config during start-up, will fall back to runtime overrides", "err", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we purposefully fail startup if the overrides are not loadable. loading the defaults and applying to all tenants could be dangerous.
wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit torn here tbh.
Loading defaults would mean we use the overrides from the runtime configmap. In the user-configurable overrides you can enable and tweak features like the generic forwarders and the metrics-generators. So the most likely impact will be that the forwarder and metrics-generator for a tenant would be disabled again until we can reload these overrides.
The user-configurable overrides does not store operational limits btw (like max_live_traces), so these would still be respected.
Alternatively, if the component can not start up we might break ingestion for the entire cluster since this also runs on the distributors. This would mean that during an extended backend outage we can not restart any component using user-configurable overrides since they would fail to get past this check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternative idea: we can make it configurable. Default to crashing when loading fails but add a flag to skip verification.
-> In case of emergency, we can disable this and still load the app.
But I also feel conflicted about this:
- if we will want to disable this check during an emergency, why even bother crashing for it?
- it makes it explicit which is good but also causes more work
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed offline we believe that failing to start if the user configurable overrides do not get loaded correctly is the correct path.
- It mimics the behavior of loading the file based overrides
- Its failure modes seem less dire.
- It is more "future proof". Adding critical limits to user configurable overrides is better handled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have reverted this change. We discussed this directly for a bit but basically: to keep Tempo simple to operate we'd favour an explicit in your face crash over a log line that warns you about something being broken. If the overrides don't work you might be running without overrides for a while which can also be disastrous as we add more responsibility to this module.
Disadvantage is that might block ingest into a Tempo cluster in very specific situations, for those scenarios we can consider adding a 'break glass' functionality that bypasses all startup checks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I didn't see your comment when I wrote this 😅
What this PR does:
Add metric to count failed overrides reloads and alert on it.
Which issue(s) this PR fixes:
Fixes #Checklist
Tests updatedDocumentation addedCHANGELOG.md
updated - the order of entries should be[CHANGE]
,[FEATURE]
,[ENHANCEMENT]
,[BUGFIX]