discuss: stateful series tracking staleness #31016

sh0rez · 2024-02-02T12:27:00Z

Component(s)

deltatocumulative (wip), interval (wip), others?

Describe the issue you're reporting

Stateful components keep state about telemetry signals (like metric streams) in memory.
WIP processors like deltatocumulative and interval need to maintain (variable size) set of samples per tracked series.
As series may come and go, tracking those indefinitely directly causes unbound memory growth.

Systems like Prometheus solve this using "staleness", meaning that series not receiving fresh samples for a given time interval are considered "stale" and subsequently removed from tracking, thus freeing the memory occupied.

Given the functional overlap of several stateful metrics processors needing to track streams and expire that tracking, I think there is an opportunity to generalize this behavior, e.g. using a stream-map interface like follows:

type Map[T any] interface {
	Load(Stream) T
	Store(Stream, T)
	Forget(Stream)
}

A staleness implementation may look as following:

type PriorityQueue[K, P comparable] interface {
	Update(id K, prio P); Peek() P; Pop() K
}

type Staleness[T any] struct {
	Max time.Duration

	Map[T]
	pq PriorityQueue[Stream, time.Time]
}

func (s Staleness[T]) Store(id Stream, v T) {
	s.pq.Update(id, time.Now())
	s.Map.Store(id, v)
}

func (s Staleness[T]) Purge() {
	for {
		ts := s.pq.Peek()
		if time.Now().Sub(ts) < s.Max {
			break
		}
		id := s.pq.Pop()
		s.Map.Forget(id)
	}
}

/cc @RichieSams @djaglowski

The text was updated successfully, but these errors were encountered:

djaglowski · 2024-02-02T14:49:07Z

Thanks for opening this @sh0rez.

Would you be willing to propose a corresponding config as well? I assume the user should have control over the staleness timeout. Is there anything else?

A related concern I have is regarding how the user can manage cardinality. Should we also have the ability to set a max number of streams, and flush the oldest when we would exceed the max? I'm asking here because these are both directly related to managing the amount of data retention, so we might want to unify these concerns in a single package. I'm curious your thoughts on this.

sh0rez · 2024-02-02T16:22:32Z

I think we can lean on Prometheus experience for this, like this talk: https://promcon.io/2017-munich/slides/staleness-in-prometheus-2-0.pdf

tldr:

fixed interval has drawbacks, e.g. "target overlap", where a target was re-deployed but the old one not yet stale
prom has "staleness markers" (special NaN's) it inserts when:
- target goes away (no longer in service discovery)
- series no longer in scrape
- scrape fails
it also does some trickery to avoid false positives: instead of inserting them right away when conditions are met, it sleeps until after the same scrape and tries then. as the tsdb is append only, this only succeeds if the conditions held true during that time.

This of course heavily builds on Prometheus data model assumptions, which are different from OTel.
Applying this to OTel leads me to some questions:

can we authoritatively detect a target going away?
can we authoritatively detect a series going away if the target remains?
can we do that interval trickery? does the single-writer principle help?

most importantly, do we even want that? e.g. a sporadic delta producer might be stale all the time. what are the use-cases we need to enable? prom-like monitoring + alerting? low-connectivity iot?

djaglowski · 2024-02-05T14:58:04Z

Thanks for the detailed thoughts on this @sh0rez. At a high level, I like the idea of not reinventing the wheel but I don't have clear answers to your questions so would want to hear other people's thoughts as well. Perhaps some folks with more OTel && Prometheus experience can chime in.

RichieSams · 2024-02-05T16:09:25Z

Apologies for the delayed response; I had some family health issues last week.

IMO, I think a fixed interval gets us 98% of the benefits and is very simple to implement / understand. While it does have the "overlap" issue, IMO, this isn't really a big issue. The "old" counter will no longer be modified, so while there is "overlap", any useful operations, like rate() will do the "right" thing, and users won't know the difference. IE, the rate() of the old series will be zero, while the rate() of the new series will start up.

RichieSams · 2024-02-06T22:17:51Z

@sh0rez @djaglowski I created a WIP PR implementing the above behaviour: ce07908

… staleness (#31089) **Description:** It's a glorified wrapper over a Map type, which allows values to be expired based on a pre-supplied interval. **Link to tracking Issue:** #31016 **Testing:** I added some basic tests of the PriorityQueue implementation as well as the expiry behaviour of Staleness **Documentation:** All the new structs are documented

**Description:** Removes stale series from tracking (and thus frees their memory) using staleness logic from #31089 **Link to tracking Issue:** #30705, #31016 **Testing:** `TestExpiry` **Documentation:** README updated

) **Description:** Removes stale series from tracking (and thus frees their memory) using staleness logic from open-telemetry#31089 **Link to tracking Issue:** open-telemetry#30705, open-telemetry#31016 **Testing:** `TestExpiry` **Documentation:** README updated

… staleness (open-telemetry#31089) **Description:** It's a glorified wrapper over a Map type, which allows values to be expired based on a pre-supplied interval. **Link to tracking Issue:** open-telemetry#31016 **Testing:** I added some basic tests of the PriorityQueue implementation as well as the expiry behaviour of Staleness **Documentation:** All the new structs are documented

) **Description:** Removes stale series from tracking (and thus frees their memory) using staleness logic from open-telemetry#31089 **Link to tracking Issue:** open-telemetry#30705, open-telemetry#31016 **Testing:** `TestExpiry` **Documentation:** README updated

sh0rez · 2024-03-28T20:36:46Z

implementation and re-usable components are merged, closing

sh0rez added the needs triage New item requiring triage label Feb 2, 2024

sh0rez changed the title ~~discuss: stateful staleness~~ discuss: stateful series tracking staleness Feb 2, 2024

This was referenced Feb 2, 2024

[processor/deltatocumulative]: progress tracking #30705

Open

[processor/deltatocumulative]: Sums #30707

Merged

crobert-1 added the discussion needed Community discussion needed label Feb 2, 2024

github-actions bot mentioned this issue Feb 6, 2024

Weekly Report: 2024-01-30 - 2024-02-06 #31055

Closed

RichieSams mentioned this issue Feb 6, 2024

[internal/exp/metrics] Add a new internal package for handling metric staleness #31089

Merged

github-actions bot mentioned this issue Feb 13, 2024

Weekly Report: 2024-02-06 - 2024-02-13 #31192

Closed

RichieSams mentioned this issue Feb 14, 2024

REQUEST: New membership for RichieSams open-telemetry/community#1945

Closed

6 tasks

sh0rez mentioned this issue Feb 15, 2024

REQUEST: New membership for @sh0rez open-telemetry/community#1948

Closed

6 tasks

This was referenced Feb 20, 2024

Weekly Report: 2024-02-13 - 2024-02-20 #31323

Closed

Weekly Report: 2024-02-13 - 2024-02-20 asuresh4/opentelemetry-collector-contrib#11541

Open

sh0rez mentioned this issue Feb 20, 2024

[processor/deltatocumulative]: expire stale series #31337

Merged

This was referenced Feb 27, 2024

Weekly Report: 2024-02-20 - 2024-02-27 #31422

Closed

Weekly Report: 2024-02-20 - 2024-02-27 asuresh4/opentelemetry-collector-contrib#11542

Open

This was referenced Mar 5, 2024

Weekly Report: 2024-02-27 - 2024-03-05 #31560

Closed

Weekly Report: 2024-02-27 - 2024-03-05 asuresh4/opentelemetry-collector-contrib#11543

Open

crobert-1 removed the needs triage New item requiring triage label Mar 5, 2024

sh0rez closed this as completed Mar 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

discuss: stateful series tracking staleness #31016

discuss: stateful series tracking staleness #31016

sh0rez commented Feb 2, 2024 •

edited

djaglowski commented Feb 2, 2024

sh0rez commented Feb 2, 2024 •

edited

djaglowski commented Feb 5, 2024

RichieSams commented Feb 5, 2024

RichieSams commented Feb 6, 2024 •

edited

sh0rez commented Mar 28, 2024

discuss: stateful series tracking staleness #31016

discuss: stateful series tracking staleness #31016

Comments

sh0rez commented Feb 2, 2024 • edited

Component(s)

Describe the issue you're reporting

djaglowski commented Feb 2, 2024

sh0rez commented Feb 2, 2024 • edited

djaglowski commented Feb 5, 2024

RichieSams commented Feb 5, 2024

RichieSams commented Feb 6, 2024 • edited

sh0rez commented Mar 28, 2024

sh0rez commented Feb 2, 2024 •

edited

sh0rez commented Feb 2, 2024 •

edited

RichieSams commented Feb 6, 2024 •

edited