Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expires notify log sooner when possible #2982

Merged
merged 1 commit into from Oct 19, 2022

Conversation

roidelapluie
Copy link
Member

@roidelapluie roidelapluie commented Jul 5, 2022

It seems useless to keep the notifications in the nflog for longer than
twice the repeat interval. This should help reduce memory usage of
clustered alertmanagers.

  • needs tests

Signed-off-by: Julien Pivotto roidelapluie@o11y.eu

@roidelapluie
Copy link
Member Author

Attempt to fix #2961

@roidelapluie roidelapluie requested a review from gotjosh July 5, 2022 10:56
@roidelapluie roidelapluie changed the title [RFC] Expires notify log sooner when possible Expires notify log sooner when possible Sep 15, 2022
@roidelapluie
Copy link
Member Author

@gotjosh We have ran this and noticed a 30% memory decrease with alertmanager with low repeat interval. Any chances to get this merged?

notify/notify.go Outdated
@@ -785,7 +785,12 @@ func (n SetNotifiesStage) Exec(ctx context.Context, l log.Logger, alerts ...*typ
return ctx, nil, errors.New("resolved alerts missing")
}

return ctx, alerts, n.nflog.Log(n.recv, gkey, firing, resolved)
var expiry time.Duration
if n, ok := RepeatInterval(ctx); ok {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The repeat interval should always be present in the context? At least *DedupStage.Exec() returns an error if this is the case. I suggest that we do the same here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

It seems useless to keep the notifications in the nflog for longer than
twice the repeat interval. This should help reduce memory usage of
clustered alertmanagers.

Signed-off-by: Julien Pivotto <roidelapluie@o11y.eu>
Copy link
Member

@simonpasquier simonpasquier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. I have a question on how the data retention interacts with repeat interval.

@@ -415,6 +415,11 @@ func (l *Log) Log(r *pb.Receiver, gkey string, firingAlerts, resolvedAlerts []ui
}
}

expiresAt := now.Add(l.retention)
if expiry > 0 && l.retention > expiry {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wouldn't it make sense to use the default retention only if expiry > 0?
If repeat_interval is 10d (because why not), it would make sense to keep the log for longer than the default 120h retention. We could also get rid of this check:

if r.RouteOpts.RepeatInterval > *retention {
level.Warn(configLogger).Log(
"msg",
"repeat_interval is greater than the data retention period. It can lead to notifications being repeated more often than expected.",
"repeat_interval",
r.RouteOpts.RepeatInterval,
"retention",
*retention,
"route",
r.Key(),
)
}
})

@roidelapluie
Copy link
Member Author

This would be change of behaviour, let's merge this while we think on the further change

@roidelapluie roidelapluie merged commit 21ca295 into prometheus:main Oct 19, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants