Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rate() function breaks histogram bucket monotonicity #13671

Closed
vpranckaitis opened this issue Feb 29, 2024 · 6 comments
Closed

rate() function breaks histogram bucket monotonicity #13671

vpranckaitis opened this issue Feb 29, 2024 · 6 comments

Comments

@vpranckaitis
Copy link
Contributor

What did you do?

(Some context: I was asked about PromQL info: input to histogram_quantile needed to be fixed for monotonicity … warning for a particular query. While investigating, I didn't find anything wrong with the data. However, the culprit was rate() function.)

For some timeseries, rate() function breaks histogram bucket monotonicity. When used in histogram_quantile(), this then leads to a PromQL info message, which points to documentation. If you dig deeper, there's quite an alarming mention of invalid data:

The latter is evidence for an actual issue with the input data and is therefore flagged with an informational annotation reading input to histogram_quantile needed to be fixed for monotonicity. If you encounter this annotation, you should find and remove the source of the invalid data.

What did you expect to see?

It seems quite common to use histogram_quantile() together with rate(). Ideally, rate() would not break histogram bucket monotonicity – though probably it's not an easy change. Otherwise, maybe the PromQL info message or documentation needs some adjustment, mentioning the possibility that the data is fine and the message is a false positive caused by rate() function.

What did you see instead? Under which circumstances?

See this commit for a small test which displays how higher histogram bucket might end up with lower rate() value.

System information

No response

Prometheus version

No response

Prometheus configuration file

No response

Alertmanager version

No response

Alertmanager configuration file

No response

Logs

No response

@beorn7
Copy link
Member

beorn7 commented Mar 6, 2024

Thanks for raising this.

This must have to do with the different extrapolation length for the two buckets. I currently fail to understand how that would give the le=1 bucket ever a higher wait, but I'll investigate.

@beorn7
Copy link
Member

beorn7 commented Mar 7, 2024

OK, that was really subtle. I found a (IMHO) unintended behavior in the rate extrapolation, see #13725 for the fix and a detailed explanation.

@beorn7 beorn7 closed this as completed Mar 7, 2024
@federicopires
Copy link

Sorry to jump in a closed issue, we updated to prometheus v2.51 recently (which includes this fix) as we were running into a very similar problem and the issue still persists for us.

@beorn7
Copy link
Member

beorn7 commented Mar 26, 2024

In that case, I assume the problem is in your data and not in Prometheus.
Please check your data. If you find that correct data leads to a false warning, please open a new issue with enough evidence that we can reproduce the issue. Or ideally write a test that demonstrates the erroneous behavior. (The commit linked above was a formidable example for that.)

@federicopires
Copy link

In that case, I assume the problem is in your data and not in Prometheus. Please check your data. If you find that correct data leads to a false warning, please open a new issue with enough evidence that we can reproduce the issue. Or ideally write a test that demonstrates the erroneous behavior. (The commit linked above was a formidable example for that.)

Thanks @beorn7 . We are getting these with remote write batches, could this be related to getting out-of-order datapoints sometimes? Metrics causing the warnings are from https://github.com/nginxinc/nginx-prometheus-exporter scraped by a prometheus agent.

Anyway, we'll see if we can figure out the problem.

@beorn7
Copy link
Member

beorn7 commented Mar 26, 2024

I'm not a remote-write expert, but I think that it suffers from histogram buckets not arriving at the same time, which might run into this kind of data incorrectness. You could try to postpone you evaluation time a bit (e.g. with offset 2m or something) to see if the problem goes away.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants