-
Notifications
You must be signed in to change notification settings - Fork 373
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add logID to fetched leaves metric. #3077
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to see @AlCutter approve this too.
I think this is a good change, but I worry that it could be a breaking change. I'm not familiar enough with all of the monitoring frameworks, but it wouldn't surprise me to find that aggregation rules get broken when this label appears.
At a minimum this should be noted in the CHANGELOG. I'd also consider whether this should be a new counter in addition to this old one. If we go this route, we can deprecate the old counter and remove it in a release or two.
Also not an expert, but I suspect it'll be fine for things like prometheus (AFAIU new timeseries scrapes for that metric will just have the extra label, so there should be no aggregations which could be based on it). I'd be a lot more worried about removing a label, for example. Definitely agree it should be in the CHANGELOG though, and if we want to err on the side of caution then Martin's suggestion of a parallel metric is worth considering. Might also be worth asking someone who runs prom in prod - I'm pretty sure LetsEncrypt use it for monitoring their logs. |
@AlCutter the main thing one would worry about is the cardinality of the metric. Do we expect there to be many different |
@JAORMX Yeah, good point - that crossed my mind too, but I believe this should only every be a small number of possible values as it's the set of log IDs being managed by the Trillian instance. |
If that's the case, I think this should be fine |
Thanks for the mindful review! I've added an entry to the changelog. |
I think it should be fine as well. There are already some metrics that are exported with the log on the logsigner, and the cardinality should be the same it is here. There are already some metrics exported with the logid on the logserver side, but they only work for pre-ordered logs so it's not really relevant for that conversation. Let me do a bit of investigation for the impact on Prometheus. |
e04386d
to
6d429e7
Compare
Going through documentation and resurrecting old Borgmon memories, I think it depends on how the metric are aggregated. Before this PR, each logserver would export a single timeseries. After this PR, there will be one timeseries per log_id. If prometheus is configured to aggregate the counter over a specific set of labels (dc for instance), then this PR won't have an impact. If it's configured to aggregate the counter with a without clause (instance, job for instance), then it will output multiple timeseries. It should't be too much of an issue for graphs, it might even be a desirable change, and if not, the aggregation config can always be changed. This could be more problematic for alerts though [1]. Before this PR, if an alerting rule alerts on This is all very hypothetical, and it depends on whether someone has set up an alert based on this metric and how this alert is setup. Given this change is documented in the CHANGELOG, that it's quite unlikely it breaks something silently, and adds more visibility into Trillian, I think it's a good tradeoff to proceed with it. |
This allows to break down the number of leaves fetched by log id.