Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add TempoUserConfigurableOverridesReloadFailing alert #2784

Merged
merged 3 commits into from
Aug 15, 2023

Conversation

yvrhdn
Copy link
Member

@yvrhdn yvrhdn commented Aug 11, 2023

What this PR does:
Add metric to count failed overrides reloads and alert on it.

Which issue(s) this PR fixes:
Fixes #

Checklist

  • Tests updated
  • Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

Sorry, something went wrong.

@@ -24,6 +25,12 @@ import (
"github.com/grafana/tempo/tempodb/backend"
)

var metricUserConfigurableOverridesReloadFailed = promauto.NewCounter(prometheus.CounterOpts{
Namespace: "tempo",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm surprised we don't publish this already

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah... I added metrics on the client that queries the backend, it must have been an oversight not to add one specifically for failed reloads.
We can already alert on the log lines but this will be cleaner.

if err != nil {
// Don't block start up if loading user-configurable overrides failed. We can fall back to runtime overrides.
metricUserConfigurableOverridesReloadFailed.Inc()
level.Error(o.logger).Log("msg", "failed to load user-configurable config during start-up, will fall back to runtime overrides", "err", err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we purposefully fail startup if the overrides are not loadable. loading the defaults and applying to all tenants could be dangerous.

wdyt?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit torn here tbh.

Loading defaults would mean we use the overrides from the runtime configmap. In the user-configurable overrides you can enable and tweak features like the generic forwarders and the metrics-generators. So the most likely impact will be that the forwarder and metrics-generator for a tenant would be disabled again until we can reload these overrides.
The user-configurable overrides does not store operational limits btw (like max_live_traces), so these would still be respected.

Alternatively, if the component can not start up we might break ingestion for the entire cluster since this also runs on the distributors. This would mean that during an extended backend outage we can not restart any component using user-configurable overrides since they would fail to get past this check.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternative idea: we can make it configurable. Default to crashing when loading fails but add a flag to skip verification.
-> In case of emergency, we can disable this and still load the app.

But I also feel conflicted about this:

  • if we will want to disable this check during an emergency, why even bother crashing for it?
  • it makes it explicit which is good but also causes more work

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed offline we believe that failing to start if the user configurable overrides do not get loaded correctly is the correct path.

  1. It mimics the behavior of loading the file based overrides
  2. Its failure modes seem less dire.
  3. It is more "future proof". Adding critical limits to user configurable overrides is better handled.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have reverted this change. We discussed this directly for a bit but basically: to keep Tempo simple to operate we'd favour an explicit in your face crash over a log line that warns you about something being broken. If the overrides don't work you might be running without overrides for a while which can also be disastrous as we add more responsibility to this module.
Disadvantage is that might block ingest into a Tempo cluster in very specific situations, for those scenarios we can consider adding a 'break glass' functionality that bypasses all startup checks.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I didn't see your comment when I wrote this 😅

@yvrhdn yvrhdn requested a review from joe-elliott August 15, 2023 16:54
@yvrhdn yvrhdn merged commit 12bdeff into grafana:main Aug 15, 2023
@yvrhdn yvrhdn deleted the kvrhdn/user-configurable-overrides-alerts branch August 15, 2023 18:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants