`rate()` function breaks histogram bucket monotonicity #13671

vpranckaitis · 2024-02-29T10:01:42Z

What did you do?

(Some context: I was asked about PromQL info: input to histogram_quantile needed to be fixed for monotonicity … warning for a particular query. While investigating, I didn't find anything wrong with the data. However, the culprit was rate() function.)

For some timeseries, rate() function breaks histogram bucket monotonicity. When used in histogram_quantile(), this then leads to a PromQL info message, which points to documentation. If you dig deeper, there's quite an alarming mention of invalid data:

The latter is evidence for an actual issue with the input data and is therefore flagged with an informational annotation reading input to histogram_quantile needed to be fixed for monotonicity. If you encounter this annotation, you should find and remove the source of the invalid data.

What did you expect to see?

It seems quite common to use histogram_quantile() together with rate(). Ideally, rate() would not break histogram bucket monotonicity – though probably it's not an easy change. Otherwise, maybe the PromQL info message or documentation needs some adjustment, mentioning the possibility that the data is fine and the message is a false positive caused by rate() function.

What did you see instead? Under which circumstances?

See this commit for a small test which displays how higher histogram bucket might end up with lower rate() value.

System information

No response

Prometheus version

No response

Prometheus configuration file

No response

Alertmanager version

No response

Alertmanager configuration file

No response

Logs

No response

The text was updated successfully, but these errors were encountered:

beorn7 · 2024-03-06T14:07:48Z

Thanks for raising this.

This must have to do with the different extrapolation length for the two buckets. I currently fail to understand how that would give the le=1 bucket ever a higher wait, but I'll investigate.

beorn7 · 2024-03-07T00:25:11Z

OK, that was really subtle. I found a (IMHO) unintended behavior in the rate extrapolation, see #13725 for the fix and a detailed explanation.

federicopires · 2024-03-25T16:05:42Z

Sorry to jump in a closed issue, we updated to prometheus v2.51 recently (which includes this fix) as we were running into a very similar problem and the issue still persists for us.

beorn7 · 2024-03-26T14:46:47Z

In that case, I assume the problem is in your data and not in Prometheus.
Please check your data. If you find that correct data leads to a false warning, please open a new issue with enough evidence that we can reproduce the issue. Or ideally write a test that demonstrates the erroneous behavior. (The commit linked above was a formidable example for that.)

federicopires · 2024-03-26T15:42:11Z

In that case, I assume the problem is in your data and not in Prometheus. Please check your data. If you find that correct data leads to a false warning, please open a new issue with enough evidence that we can reproduce the issue. Or ideally write a test that demonstrates the erroneous behavior. (The commit linked above was a formidable example for that.)

Thanks @beorn7 . We are getting these with remote write batches, could this be related to getting out-of-order datapoints sometimes? Metrics causing the warnings are from https://github.com/nginxinc/nginx-prometheus-exporter scraped by a prometheus agent.

Anyway, we'll see if we can figure out the problem.

beorn7 · 2024-03-26T16:21:29Z

I'm not a remote-write expert, but I think that it suffers from histogram buckets not arriving at the same time, which might run into this kind of data incorrectness. You could try to postpone you evaluation time a bit (e.g. with offset 2m or something) to see if the problem goes away.

beorn7 mentioned this issue Mar 7, 2024

promql: Fix limiting of extrapolation to negative values #13725

Merged

beorn7 closed this as completed Mar 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`rate()` function breaks histogram bucket monotonicity #13671

`rate()` function breaks histogram bucket monotonicity #13671

vpranckaitis commented Feb 29, 2024

beorn7 commented Mar 6, 2024

beorn7 commented Mar 7, 2024

federicopires commented Mar 25, 2024

beorn7 commented Mar 26, 2024

federicopires commented Mar 26, 2024

beorn7 commented Mar 26, 2024

rate() function breaks histogram bucket monotonicity #13671

rate() function breaks histogram bucket monotonicity #13671

Comments

vpranckaitis commented Feb 29, 2024

What did you do?

What did you expect to see?

What did you see instead? Under which circumstances?

System information

Prometheus version

Prometheus configuration file

Alertmanager version

Alertmanager configuration file

Logs

beorn7 commented Mar 6, 2024

beorn7 commented Mar 7, 2024

federicopires commented Mar 25, 2024

beorn7 commented Mar 26, 2024

federicopires commented Mar 26, 2024

beorn7 commented Mar 26, 2024

`rate()` function breaks histogram bucket monotonicity #13671

`rate()` function breaks histogram bucket monotonicity #13671