Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add logID to fetched leaves metric. #3077

Merged
merged 2 commits into from
Sep 8, 2023
Merged

Add logID to fetched leaves metric. #3077

merged 2 commits into from
Sep 8, 2023

Conversation

phbnf
Copy link
Contributor

@phbnf phbnf commented Aug 30, 2023

This allows to break down the number of leaves fetched by log id.

@phbnf phbnf requested a review from a team as a code owner August 30, 2023 09:06
@phbnf phbnf requested a review from mhutchinson August 30, 2023 09:06
Copy link
Contributor

@mhutchinson mhutchinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to see @AlCutter approve this too.

I think this is a good change, but I worry that it could be a breaking change. I'm not familiar enough with all of the monitoring frameworks, but it wouldn't surprise me to find that aggregation rules get broken when this label appears.

At a minimum this should be noted in the CHANGELOG. I'd also consider whether this should be a new counter in addition to this old one. If we go this route, we can deprecate the old counter and remove it in a release or two.

@AlCutter
Copy link
Member

Also not an expert, but I suspect it'll be fine for things like prometheus (AFAIU new timeseries scrapes for that metric will just have the extra label, so there should be no aggregations which could be based on it). I'd be a lot more worried about removing a label, for example.

Definitely agree it should be in the CHANGELOG though, and if we want to err on the side of caution then Martin's suggestion of a parallel metric is worth considering.

Might also be worth asking someone who runs prom in prod - I'm pretty sure LetsEncrypt use it for monitoring their logs.

@JAORMX
Copy link
Collaborator

JAORMX commented Aug 30, 2023

@AlCutter the main thing one would worry about is the cardinality of the metric. Do we expect there to be many different logID entries so the cardinality of the metric could become too large? If you don't have that many logID entries in the trillian instance it will be fine.

@AlCutter
Copy link
Member

@JAORMX Yeah, good point - that crossed my mind too, but I believe this should only every be a small number of possible values as it's the set of log IDs being managed by the Trillian instance.

@JAORMX
Copy link
Collaborator

JAORMX commented Aug 30, 2023

If that's the case, I think this should be fine

@phbnf
Copy link
Contributor Author

phbnf commented Sep 4, 2023

Thanks for the mindful review! I've added an entry to the changelog.

@phbnf
Copy link
Contributor Author

phbnf commented Sep 4, 2023

I think it should be fine as well. There are already some metrics that are exported with the log on the logsigner, and the cardinality should be the same it is here. There are already some metrics exported with the logid on the logserver side, but they only work for pre-ordered logs so it's not really relevant for that conversation.

Let me do a bit of investigation for the impact on Prometheus.

@phbnf phbnf reopened this Sep 5, 2023
@phbnf phbnf force-pushed the master branch 3 times, most recently from e04386d to 6d429e7 Compare September 6, 2023 09:49
@phbnf
Copy link
Contributor Author

phbnf commented Sep 6, 2023

Going through documentation and resurrecting old Borgmon memories, I think it depends on how the metric are aggregated. Before this PR, each logserver would export a single timeseries. After this PR, there will be one timeseries per log_id. If prometheus is configured to aggregate the counter over a specific set of labels (dc for instance), then this PR won't have an impact. If it's configured to aggregate the counter with a without clause (instance, job for instance), then it will output multiple timeseries.

It should't be too much of an issue for graphs, it might even be a desirable change, and if not, the aggregation config can always be changed.

This could be more problematic for alerts though [1]. Before this PR, if an alerting rule alerts on fetched_leaves > 3 without aggregating it, then fetched_leaves = 4 will result in an alert, even if 2 leaves are fetched from log_1 and 2 other leaves from log_2. After this PR, fetched_leaves{log_id=1} = 2, fetched_leaves{log_id=2} = 2 will not result in an alert, unless the alerting condition is sum(fetched_leaves) > 3.

This is all very hypothetical, and it depends on whether someone has set up an alert based on this metric and how this alert is setup.
In practice, I'd be surprised if someone had set up an alert on having too many reads. There could be an alert on too little reads though, and this PR would make this alert easier to trigger, which could then be fixed if the alert becomes too noisy.

Given this change is documented in the CHANGELOG, that it's quite unlikely it breaks something silently, and adds more visibility into Trillian, I think it's a good tradeoff to proceed with it.

[1] https://prometheus.io/docs/prometheus/latest/querying/operators/#:~:text=Between%20an%20instant%20vector%20and%20a%20scalar%2C%20these%20operators%20are%20applied%20to%20the%20value%20of%20every%20data%20sample%20in%20the%20vector%2C

@phbnf phbnf merged commit 1912058 into google:master Sep 8, 2023
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants