Add logID to fetched leaves metric. #3077

phbnf · 2023-08-30T09:06:28Z

This allows to break down the number of leaves fetched by log id.

mhutchinson

I'd like to see @AlCutter approve this too.

I think this is a good change, but I worry that it could be a breaking change. I'm not familiar enough with all of the monitoring frameworks, but it wouldn't surprise me to find that aggregation rules get broken when this label appears.

At a minimum this should be noted in the CHANGELOG. I'd also consider whether this should be a new counter in addition to this old one. If we go this route, we can deprecate the old counter and remove it in a release or two.

AlCutter · 2023-08-30T10:33:32Z

Also not an expert, but I suspect it'll be fine for things like prometheus (AFAIU new timeseries scrapes for that metric will just have the extra label, so there should be no aggregations which could be based on it). I'd be a lot more worried about removing a label, for example.

Definitely agree it should be in the CHANGELOG though, and if we want to err on the side of caution then Martin's suggestion of a parallel metric is worth considering.

Might also be worth asking someone who runs prom in prod - I'm pretty sure LetsEncrypt use it for monitoring their logs.

JAORMX · 2023-08-30T10:37:26Z

@AlCutter the main thing one would worry about is the cardinality of the metric. Do we expect there to be many different logID entries so the cardinality of the metric could become too large? If you don't have that many logID entries in the trillian instance it will be fine.

AlCutter · 2023-08-30T10:42:50Z

@JAORMX Yeah, good point - that crossed my mind too, but I believe this should only every be a small number of possible values as it's the set of log IDs being managed by the Trillian instance.

JAORMX · 2023-08-30T12:10:44Z

If that's the case, I think this should be fine

phbnf · 2023-09-04T14:44:35Z

Thanks for the mindful review! I've added an entry to the changelog.

phbnf · 2023-09-04T14:47:56Z

I think it should be fine as well. There are already some metrics that are exported with the log on the logsigner, and the cardinality should be the same it is here. There are already some metrics exported with the logid on the logserver side, but they only work for pre-ordered logs so it's not really relevant for that conversation.

Let me do a bit of investigation for the impact on Prometheus.

phbnf · 2023-09-06T11:11:15Z

Going through documentation and resurrecting old Borgmon memories, I think it depends on how the metric are aggregated. Before this PR, each logserver would export a single timeseries. After this PR, there will be one timeseries per log_id. If prometheus is configured to aggregate the counter over a specific set of labels (dc for instance), then this PR won't have an impact. If it's configured to aggregate the counter with a without clause (instance, job for instance), then it will output multiple timeseries.

It should't be too much of an issue for graphs, it might even be a desirable change, and if not, the aggregation config can always be changed.

This could be more problematic for alerts though [1]. Before this PR, if an alerting rule alerts on fetched_leaves > 3 without aggregating it, then fetched_leaves = 4 will result in an alert, even if 2 leaves are fetched from log_1 and 2 other leaves from log_2. After this PR, fetched_leaves{log_id=1} = 2, fetched_leaves{log_id=2} = 2 will not result in an alert, unless the alerting condition is sum(fetched_leaves) > 3.

This is all very hypothetical, and it depends on whether someone has set up an alert based on this metric and how this alert is setup.
In practice, I'd be surprised if someone had set up an alert on having too many reads. There could be an alert on too little reads though, and this PR would make this alert easier to trigger, which could then be fixed if the alert becomes too noisy.

Given this change is documented in the CHANGELOG, that it's quite unlikely it breaks something silently, and adds more visibility into Trillian, I think it's a good tradeoff to proceed with it.

[1] https://prometheus.io/docs/prometheus/latest/querying/operators/#:~:text=Between%20an%20instant%20vector%20and%20a%20scalar%2C%20these%20operators%20are%20applied%20to%20the%20value%20of%20every%20data%20sample%20in%20the%20vector%2C

phbnf requested a review from a team as a code owner August 30, 2023 09:06

phbnf requested a review from mhutchinson August 30, 2023 09:06

mhutchinson reviewed Aug 30, 2023

View reviewed changes

JAORMX approved these changes Aug 30, 2023

View reviewed changes

phbnf closed this Sep 5, 2023

phbnf force-pushed the master branch from 14e2b1e to 71ada31 Compare September 5, 2023 08:45

Add logID to fetched leaves metric.

6d429e7

phbnf reopened this Sep 5, 2023

phbnf force-pushed the master branch 3 times, most recently from e04386d to 6d429e7 Compare September 6, 2023 09:49

add CHANGELOG entry

a2f4007

JAORMX approved these changes Sep 7, 2023

View reviewed changes

AlCutter approved these changes Sep 7, 2023

View reviewed changes

phbnf merged commit 1912058 into google:master Sep 8, 2023
8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add logID to fetched leaves metric. #3077

Add logID to fetched leaves metric. #3077

phbnf commented Aug 30, 2023

mhutchinson left a comment

AlCutter commented Aug 30, 2023

JAORMX commented Aug 30, 2023

AlCutter commented Aug 30, 2023

JAORMX commented Aug 30, 2023

phbnf commented Sep 4, 2023

phbnf commented Sep 4, 2023 •

edited

phbnf commented Sep 6, 2023

Add logID to fetched leaves metric. #3077

Add logID to fetched leaves metric. #3077

Conversation

phbnf commented Aug 30, 2023

mhutchinson left a comment

Choose a reason for hiding this comment

AlCutter commented Aug 30, 2023

JAORMX commented Aug 30, 2023

AlCutter commented Aug 30, 2023

JAORMX commented Aug 30, 2023

phbnf commented Sep 4, 2023

phbnf commented Sep 4, 2023 • edited

phbnf commented Sep 6, 2023

phbnf commented Sep 4, 2023 •

edited