Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

promql: Fix limiting of extrapolation to negative values #13725

Merged
merged 1 commit into from Mar 7, 2024
Merged

Conversation

beorn7
Copy link
Member

@beorn7 beorn7 commented Mar 7, 2024

This is a bit tough to explain, but I'll try:

rate & friends have a sophisticated extrapolation algorithm. Usually, we extrapolate the result to the total interval specified in the range selector. However, if the first sample within the range is too far away from the beginning of the interval, or if the last sample within the range is too far away from the end of the interval, we assume the series has just started half a sampling interval before the first sample or after the last sample, respectively, and shorten the extrapolation interval correspondingly. We calculate the sampling interval by looking at the average time between samples within the range, and we define "too far away" as "more than 110% of that sampling interval".

However, if this algorithm leads to an extrapolated starting value that is negative, we limit the start of the extrapolation interval to the point where the extrapolated starting value is zero.

At least that was the intention.

What we actually implemented is the following: If extrapolating all the way to the beginning of the total interval would lead to an extrapolated negative value, we would only extrapolate to the zero point as above, even if the algorithm above would have selected a starting point that is just half a sampling interval before the first sample and that starting point would not have an extrapolated negative value. In other word: What was meant as a limitation of the extrapolation interval yielded a longer extrapolation interval in this case.

There is an exception to the case just described: If the increase of the extrapolation interval is more than 110% of the sampling interval, we suddenly drop back to only extrapolate to half a sampling interval.

This behavior can be nicely seen in the testcounter_zero_cutoff test, where the rate goes up all the way to 0.7 and then jumps back to 0.6.

This commit changes the behavior to what was (presumably) intended from the beginning: The extension of the extrapolation interval is only limited if actually needed to prevent extrapolation to negative values, but the "limitation" never leads to more extrapolation anymore.

The difference is subtle, and probably it never bothered anyone. However, if you calculate a rate of a classic histograms, the old behavior might create non-monotonic histograms as a result (because of the jumps you can see nicely in the old version of the testcounter_zero_cutoff test). With this fix, that doesn't happen anymore.

This is a bit tough to explain, but I'll try:

`rate` & friends have a sophisticated extrapolation algorithm.
Usually, we extrapolate the result to the total interval specified in
the range selector. However, if the first sample within the range is
too far away from the beginning of the interval, or if the last sample
within the range is too far away from the end of the interval, we
assume the series has just started half a sampling interval before the
first sample or after the last sample, respectively, and shorten the
extrapolation interval correspondingly. We calculate the sampling
interval by looking at the average time between samples within the
range, and we define "too far away" as "more than 110% of that
sampling interval".

However, if this algorithm leads to an extrapolated starting value
that is negative, we limit the start of the extrapolation interval to
the point where the extrapolated starting value is zero.

At least that was the intention.

What we actually implemented is the following: If extrapolating all
the way to the beginning of the total interval would lead to an
extrapolated negative value, we would only extrapolate to the zero
point as above, even if the algorithm above would have selected a
starting point that is just half a sampling interval before the first
sample and that starting point would not have an extrapolated negative
value. In other word: What was meant as a _limitation_ of the
extrapolation interval yielded a _longer_ extrapolation interval in
this case.

There is an exception to the case just described: If the increase of
the extrapolation interval is more than 110% of the sampling interval,
we suddenly drop back to only extrapolate to half a sampling interval.

This behavior can be nicely seen in the testcounter_zero_cutoff test,
where the rate goes up all the way to 0.7 and then jumps back to 0.6.

This commit changes the behavior to what was (presumably) intended
from the beginning: The extension of the extrapolation interval is
only limited if actually needed to prevent extrapolation to negative
values, but the "limitation" never leads to _more_ extrapolation
anymore.

The difference is subtle, and probably it never bothered anyone.
However, if you calculate a rate of a classic histograms, the old
behavior might create non-monotonic histograms as a result (because of
the jumps you can see nicely in the old version of the
testcounter_zero_cutoff test). With this fix, that doesn't happen
anymore.

Signed-off-by: beorn7 <beorn@grafana.com>
@beorn7
Copy link
Member Author

beorn7 commented Mar 7, 2024

@zenador here is the promised fix.

@beorn7
Copy link
Member Author

beorn7 commented Mar 7, 2024

This should fix #13671.

@juliusv
Copy link
Member

juliusv commented Mar 7, 2024

I had to read that multiple times before I think I understood it. So here's an illustration of the issue, I hope it's correct:

incorrect-rate-extrapolation

@juliusv
Copy link
Member

juliusv commented Mar 7, 2024

👍

@beorn7
Copy link
Member Author

beorn7 commented Mar 7, 2024

I had scribbled a graph on paper, but it was not legible enough to put a scan here. 😁

@juliusv your graph is almost correct. But the zero point is actually far enough from the first sample (more than "sampleInterval * 1.1") that it would fall back to "sampleInterval / 2". So the problem only happens if the zero point is closer than "sampleInterval * 1.1" but farther than "sampleInterval / 2".

@beorn7 beorn7 merged commit b0c0961 into main Mar 7, 2024
40 checks passed
@beorn7 beorn7 deleted the beorn7/promql2 branch March 7, 2024 11:36
@juliusv
Copy link
Member

juliusv commented Mar 7, 2024

Ah, thanks for clarifying! Indeed so complicated to think about!

krajorama added a commit to grafana/mimir that referenced this pull request Mar 12, 2024
Includes fix to rate calculation
prometheus/prometheus#13725
And some native histograms
prometheus/prometheus#13681

Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>
krajorama added a commit to grafana/mimir that referenced this pull request Mar 12, 2024
* Vendor update mimir-prometheus at 2b3d9c010123

Includes fix to rate calculation
prometheus/prometheus#13725
And some native histograms
prometheus/prometheus#13681

* Copy private const strings from Alertmanager.

Note: these are again public on main in alertmanager.

Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>
Itaykal pushed a commit to Itaykal/mimir that referenced this pull request Mar 13, 2024
* Vendor update mimir-prometheus at 2b3d9c010123

Includes fix to rate calculation
prometheus/prometheus#13725
And some native histograms
prometheus/prometheus#13681

* Copy private const strings from Alertmanager.

Note: these are again public on main in alertmanager.

Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>
charleskorn added a commit to grafana/mimir that referenced this pull request Apr 2, 2024
charleskorn added a commit to grafana/mimir that referenced this pull request Apr 9, 2024
* Use forked Prometheus module

* Use promql.QueryEngine rather than *promql.Engine, add config option to select engine

* Add parallel query path with streaming engine.

* Add IntelliJ debugging configuration for streaming querier.

* Add Grafana datasource for streaming querier

* Use correct query-scheduler port and enable debugging

* Set up load test.

* Fix p99 latency panel

* Commit settings used in previous test.

* Enable chunks streaming.

* Disable load generator.

* Create separate load testing query paths to ensure meta-monitoring queries don't interfere with load test results.

* Add initial k6-based load test script

* Add script to run load tests against both engines.

* Add todo

* Extract friendly name to hostname logic to separate script

* Rename k6 output file

* Check both engines produce the same result.

* Summarise results after each test run.

* Enable streaming chunks from store-gateways to queriers.

* Refactor test script, print max and average CPU and memory after test

* Reorder variables to match output order

* Don't include query name in k6 output file name.

* Add initial script to generate data.

* Rename script

* Add script to generate data generation config.

* Make it easier to test different queries.

* Rename PromQL engine CLI flag.

* Add Thanos' engine.

* Add Thanos engine to benchmarking setup.

* Set default test duration to 5 minutes.

* Rework data generation setup to work around limitation in prometheus-toolbox.

* Load test data faster.

* Record test config for first test

* Record test config for second test

* Record test config for third test

* Disable gRPC compression for ingester client

* Record test config for re-run of second test

* Record test config for re-run of first test

* Record test config for re-run of third test

* Log URL used for test.

* Record test config for fourth test

* Record test config for fifth test

* Record test config for 10k series test

* Record test config for 1k series test

* Record test config for 50k series test

* Fix configuration issue that does not affect outcome of testing.

This does not affect the outcome of testing because all tests use
queriers directly (ie. do not use query-frontends or query-schedulers)

* Record test config for seventh test

* Move production code to main Mimir code tree

* Copy test files into repo

* Add license and provenance comments

* Add a readme.

* Update package path in scripts and remove unnecessary testing script

* Add more notes to code

* Explain benefit of streaming engine.

* Mention benchmarking tools in readme.

* Remove references to Thanos' engine.

* Undo unnecessary go.mod changes

* Remove benchmarking tool for now

We can add something more robust in the future if we need it.

* Remove docker-compose changes

* Reset whitespace changes

* Remove readme

* Vendor in branch of mimir-prometheus with changes from prometheus/prometheus#13713

* Fix issues introduced with recent Prometheus changes and fix linting issues

* Detect and reject configuration not supported by streaming engine

* Remove TODOs tracked elsewhere

* Bring in latest changes to `QueryEngine` interface

* Remove more TODOs tracked elsewhere

* Change signature of `Next()` to return an error when the stream is exhausted

* Ignore warnings when comparing results from Prometheus and streaming engines.

* Correct naming

* Rename `Operator` to clarify that it produces instant vectors.

* Rename test file

* Reorganise test file

* Bring in latest benchmark changes from upstream.

* Run tests and benchmarks with isolation disabled, to mirror what Mimir itself does

* Fix script permissions

* Simplify test setup and don't bother testing 2000 series case to speed up tests.

* Use assertion helpers in benchmark

* Improve engine benchmark comparison script: fix issue with package name containing "streaming", clearly label results

* Remove unnecessary files

* Add note to engine benchmark comparison script

* Make test names clearer

* Ensure both engines produce the same result before benchmarking

* Add tests for unsupported expressions and improve error messages returned when unsupported expression used.

* Fix linting warnings

* Add notes for invalid cases that should be caught by the PromQL parser.

* Implement `Query.Statement()`

* Give variable a better name

* Remove Pool interface

* Remove unnecessary types

* Update comments

* Remove unused lookback delta for range vector selector

* Extract selector logic to a shared type

* Add more TODOs

* Use Selector for range vector selectors.

* Compute values used on every call to Next() once in instant and range vector selectors.

This gives a 0-7% latency improvement for queries with many series at
small absolute cost to queries with few series.

Memory consumption is slightly higher (tens of bytes), but not
concerning.

* Remove unnecessary script and update comment

* Validate query time range

* Enforce that range query expression produces a vector or scalar result.

* Add CLI flag to experimental features list in docs

* Add changelog entry

* Fix linting warnings

* Move common PromQL engine configuration to its own package

* Don't bother with 10 series benchmarks.

They're not very interesting, and the benchmarks take too long to run.

* Sort imports

* Add missing license header.

* Use bucketed pool based on zeropool.

This greatly improves latency and allocations, with latency
improvements between 0 and 15% on our benchmarks.

* Fix import sorting

* Set upper limits for sizes of pooled slices.

* Set size for SeriesBatch pool

* Add integration test.

* Move RingBuffer out of util package and pool slices used.

This improves latency of rate-related benchmarks by between 0 and 11%.

* Add tests for ring buffer type.

* Return RingBuffer point slices to pool when no longer required.

* Fix linting

* Remove TODOs about pooling group slices and maps.

Pooling these slices did not have a substantial performance impact.

* Pool groups in aggregation operator.

This reduces latency of aggregation operations by 0-2% in our
benchmarks.

* Remove TODOs about pooling label builders.

Pooling these had a negative performance impact.

* Remove more TODOs, clarify interface expectations

* Release memory sooner after query execution

* Fix typo in comment

* Remove unecessary context cancellation checks from operators.

Context cancellation is handled by the underlying queryable, at least
for the PromQL expressions we currently support.

* Fix linting warnings

* Address remaining TODOs in Selector

* Fix integration test flakiness

* Don't panic when encountering a native histogram

* Fix issue where bucketed pool could return a slice with capacity less than the requested size.

* Remove redundant checks

* Add support for @ in instant and range vector selectors.

This is used extensively by Prometheus' test scripts, so it's easier
to just implement support for this rather than try to workaround it
or disable it.

* Run Prometheus' test cases against streaming engine.

There are still some failing test cases, but these appear to be
genuine issues that need to be investigated.

* Fix failing staleness test: don't emit empty series for instant query

* Upgrade mimir-prometheus

* Add test case for stale markers in a range query.

* Add test cases for @ modifier with range queries

* Check for errors while reading test script

* Fix whitespace

* Remove duplicate timestamp computation in Selector and operators

* Update `rate` calculation to match behaviour in prometheus/prometheus#13725

* Add an overview of the engine.

* Fix linting issues and clarify docs.

* Add test case for scenario where input to aggregation returns no results or no points, and remove stale TODOs

* Implement `Query.String()`, and remove stale TODOs.

* Clarify BucketedPool contract, and add test case for case where min size is greater than 1.

* Expand test coverage to include scenarios not covered by upstream tests or those uniquely likely to cause issues in the streaming engine.

* Elaborate on comment

* Expand example in docs and fix typo

* Remove unnecessary sorting and use non-experimental slices package in comparison tests.

* Add missing comments.

* Clean up temporary files in benchmark comparison script, and improve formatting of output

* Add note explaining purpose of script.

* Address PR feedback

* Replace use of `DurationMilliseconds` with `d.Milliseconds()`

* Rename `Series()` to `SeriesMetadata()`

* Include names of acceptable values in CLI flag description

* Remove TODOs

* Add explanation for RangeVectorSelectorWithTransformation

* Move to top-level `streamingpromql` package

* Move common query engine config back to original location (revert e3933a2)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants