-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Have a way to mute alert until it's resolved to receive a resolved notification once it's fixed #3825
Comments
Hi! 👋 It sounds to me like you want a silence to expire when it is no longer silencing any active alerts. I think there are a couple of problems that we would need to solve to add such a feature. For example:
|
@grobinson-grafana seems so. For issues that you outlined:
|
|
Do you think that introduces new troubles?
|
|
@grobinson-grafana for 1) I can try implementing it by myself, but I'm not not sure if I can manage 2) or if it's even feasible. |
Hi! 👋 Do you have time to evaluate some embedded k/v stores? That would be a fantastic contribution as we have discussed durable storage for Alertmanager in the past but haven't decided what to use. For example, I know that Grafana Loki uses bbolt, but it would be nice to see a comparison of some other embedded databases. You could even include sqlite3. Alertmanager has avoided being dependent on other processes as it needs to operate even when these are unavailable, so that means no MySQL, PostgreSQL, memcache, redis, etc. Second, it is not uncommon for users to have Alertmanager installations with 10,000s of alerts, so it would be nice to see some performance comparisons of different databases. I expect the workload to be write-heavy as reads will only happen at startup time. |
@grobinson-grafana so I looked a bit into how it's done for silences. Apparently it's all serialised into some binary format and stored on disk as a single file. Do you think it makes sense to do it the same way for alerts here as well, or would it be better to do it via a proper db? Basically we only need to read from it once to load all the alerts when starting Alertmanager and to write to it once an alert is created/updated. (One issue I see with that approach is that if there are 10k inserts creating alerts, then every time it'll have to overwrite the whole file, which is not nice.) |
Yes! That's the issue! :) It works for silences because silences are not created very often and you don't tend to have very many of them. But alerts are very different, and Alertmanager can be receiving 1000s of alerts per minute (i.e. the |
@grobinson-grafana okay, from my point of view, sqlite3 here doesn't make a lot of sense as it adds another layer of complexity by having to deal with db schema, so I think this won't be the best approach here. From other kv databases, other than bbold that you've suggested, one cool option I found is https://github.com/dgraph-io/badger - it has quite a big community (it has more github stars than bbolt) is used by a lot of projects and seems to be maintained. I haven't used either of this in my projects, so I can mostly look at the library popularity and if it's maintained - both seem cool with it. What do you think? |
Just as a quick note: there's kthxbye which automatically extends silences (prefixed by "ACK!" in the default configuration) that are still firing before they expire |
Let's say I have an outage on one of my server I'm monitoring and it's inaccessible but I don't know how long it's gonna take to fix it, so I'm muting it for a really long time.
With this approach, I won't receive any resolved notifications, so to check if the alert is fixed I need to go to my alerts list to see if it's still firing, and given that I've muted it for a long time I also need to remove the mute to know if it's firing again.
What would be nice to have:
Pretty sure this would have a lot of cases that'll make it difficult, like if a mute has a lot of active alerts, but still would be really awesome to have.
Do you guys think it's manageable?
The text was updated successfully, but these errors were encountered: