Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure matcher values are present when parsing matchers from strings #2968

Merged
merged 2 commits into from Sep 30, 2022

Conversation

gotjosh
Copy link
Member

@gotjosh gotjosh commented Jun 28, 2022

Fixes #2936

Signed-off-by: gotjosh josue.abreu@gmail.com

Copy link
Member

@roidelapluie roidelapluie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we fail? Or just it should match the empty value?

@gotjosh
Copy link
Member Author

gotjosh commented Jun 28, 2022

Should we fail? Or just it should match the empty value?

I pondered a bit on that as that was my initial approach, but it felt like allowing it might have an unwanted side effect (and we fail on a lot of cases) - I have no good reference points and couldn't find any in the code.

That being said, I'm happy to rely on your expertise and experience for this one - do you think we should fall back to ""?

@roidelapluie
Copy link
Member

Empty labels are valid in Prometheus and alertmanager.

cc @beorn7 WDYT here?

@gotjosh
Copy link
Member Author

gotjosh commented Jun 28, 2022

Empty labels are valid in Prometheus and alertmanager.

Sorry, I could have been clearer that's what I meant with my comment e.g. you made a mistake while typing and silenced an alert with empty value label.

@beorn7
Copy link
Member

beorn7 commented Jun 28, 2022

I think it should be possible to create a silence for an empty label, which is equivalent to an non-existing label, or in other words, it gives you a straightforward way of silencing all alerts that do not have that label (with a non-empty value).

If we forbid foo="", it becomes kind of tedious to formulate that, e.g. foo!~".+".

Maybe we could allow foo="" and error on foo=? But this feels a bit inconsistent.

@roidelapluie
Copy link
Member

yeah, "" will be dropped by shell for most users anyway. I suggest we consider it as an empty value.

@simonpasquier
Copy link
Member

simonpasquier commented Jun 28, 2022

Should we fail? Or just it should match the empty value?

not feeling strongly on this one but as the parser is already quite liberal (example), I would ok with foo= being equivalent to foo="".

@gotjosh
Copy link
Member Author

gotjosh commented Jun 28, 2022

Maybe we could allow foo="" and error on foo=? But this feels a bit inconsistent.

I thought that's what I was doing with this PR, but after a close second look at the problem (not just the solution), I realise what's currently going on.

If you look closely at the command executed as part of #2936 it's:

./amtool --alertmanager.url=http://127.0.0.1:9093 alert add foo=

Which I believe it's a mistake. It doesn't include the alertname as the first argument - I think this should have resulted in an error so that the user is aware.

Alertname must be present

Now, for this case, I propose the following: Allow the first argument to be used as a whole string if it doesn't contain the model.AlertNameLabel as part of the string or we're unable to parse the label throw an error.

Case #1: ✅

./amtool --alertmanager.url=http://127.0.0.1:9093 alertname=AlwaysFiring
# results in alertname=AlwaysFiring

Case #2: ✅

./amtool --alertmanager.url=http://127.0.0.1:9093 AlwaysFiring
# results in alertname=AlwaysFiring

Case #3: ❌

./amtool --alertmanager.url=http://127.0.0.1:9093 alertname=
error: unable to determine alertname

Case #4: ❌

./amtool --alertmanager.url=http://127.0.0.1:9093 foo=
error: unable to determine alertname

Case #5: ❌

./amtool --alertmanager.url=http://127.0.0.1:9093 foo=AlwaysFiring alertname=NeverFiring
error: unable to determine alertname

Empty value labels

Now, this brings us to the second question... The consensus I take from this discussion seems to be that: foo= is the same as foo="". I think we're able to keep that behaviour for labels that are not the alertname, see below.

Case #1: ✅

./amtool --alertmanager.url=http://127.0.0.1:9093 AlwaysFiring foo=

Case #2: ✅

./amtool --alertmanager.url=http://127.0.0.1:9093 AlwaysFiring foo=""

Josh's opinion

Personally, I would prefer to error on foo= but as @simonpasquier pointed out - the parser is already quite flexible with this so I'm happy to accommodate. I'll have to look at a few more places where we use ParseMatchers to make sure I'm not breaking something but overall sounds doable.

If we agree with the above, I'm happy to take it for another spin.

@roidelapluie
Copy link
Member

Why would alertname need to be present? I can be relabeled away by Prometheus alert_relabel_config ..

@gotjosh
Copy link
Member Author

gotjosh commented Jul 1, 2022

After Julien's comment, I took another holistic look and these are my findings:

  1. alertname as a label is not required by the API (I thought it was - my bad 🤦)
  2. Empty labels don't make it to the alertmanager, they're removed by the add alerts API endpoint
  3. Silences allow empty matchers meaning foo="" is a valid matcher as long as there is at least one non-empty matcher. See here

To top the above, it seems like users have also reported confusion on different behaviour between silence matchers vs alert matchers with #2610. This is because the "filters" params on that endpoints treats it like any other matcher, however, silences when set have additional validation on them.

I'm struggling to come up with suggestions on how to best proceed forward that a) don't incur a breaking change and b) solve the problem for both adding alerts and creating silences - any context to share here?

Concretely, I'm struggling with:

  • I see no reason why the CLI should allow empty labels for adding alerts if they're going to end up being removed anyways. However, I see the appeal for silences.
  • If alertname is not required why do we have special handling for it in the CLI? IMO this is better suited for a CLI flag that when specified it helps you create the alertname e.g. ./amtool add alert -n AlwaysFiring foo=bar. As opposed to inferring that an error in parsing the first label means you're trying to set the alertname directly.

tl;dr: Does it feel like we're at a point where a silence matcher has different semantics from a label matcher even if they have the same implementation (even if theoretically they're to match just key/value pairs)? I can see the appeal of separating the matching logic and documenting/validating the semantics of them concretely so that the CLIs, APIs and UI benefit from a consistent implementation.

We don't have to turn this into a long discussion async, I'll add this to the WG so that we can discuss it.

@beorn7
Copy link
Member

beorn7 commented Jul 7, 2022

WRT the amtool behavior: I agree that the special handling of the alertname feels weird. I guess that has historically evolved.

WRT at least one matcher that doesn't match the empty string: I'm not sure what the actual reason for this is, but I assume it's an efficiency concern. Any matcher matching the empty string essentially requires the equivalent of a "full table scan". If there is at least one matcher not matching the empty string, at least that "full table scan" is limited to whatever that one "proper" matcher returns rather than all alerts. To further speculate, I assume that the filter parameter on the API endpoint is more liberal because it's a user initiated action. If it is too expensive, well, then it times out eventually, but that's it. However, silences have to be checked continuously. If that's too expensive, the AM is permanently overloaded. (Again, this is all educated guessing. My knowledge of the AM internals is limited.) I do think that the preview should implement the same restrictions as the silence itself (cf. #2610). (The reason for the different behavior can be explained by my speculation above: One-time user initiated action vs. permanently running queries. Still, of course, we should do the same evaluation for the former in this case.)

WRT allowing empty labels for adding alerts: While an empty label is the same as a non-existing label, it is still valid and has well defined semantics throughout the Prometheus ecosystem. It would be really confusing if the only place where you ban this kind of redundancy would be the API for adding alerts.

In summary, I don't think that the matchers have different semantics for silences vs. alerts. I would even say that what we have is all rooted in the intention to keep the semantics the same. The one thing that sticks out (at least one matcher not matching the empty string) isn't really semantics of the matchers themselves but a higher level query requirement (which exists in the same way in Prometheus itself).

@gotjosh
Copy link
Member Author

gotjosh commented Sep 30, 2022

After recently experiencing a panic in one of our largest environments, I realised this is not worth blocking on some sort of theoretical consistency across silences and alerts APIs.

The general comments have a tendency on foo= being equal to foo="" and that's what this PR does. Effectively avoiding panic and providing the user with a consistent error message.

As an example, in the case of the reported issue:

 ~> ./amtool --alertmanager.url=http://127.0.0.1:9093 alert add foo=
amtool: error: [POST /alerts][400] postAlertsBadRequest  at least one label pair required

instead of a panic.

I think we should move forward and merge this as this problem has already been reported several times on multiple fronts.

Fixes #2936

Signed-off-by: gotjosh <josue.abreu@gmail.com>
Signed-off-by: gotjosh <josue.abreu@gmail.com>
Copy link
Member

@roidelapluie roidelapluie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Panic when using amtool
4 participants