-
Notifications
You must be signed in to change notification settings - Fork 4.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
balancer/rls: Add cache metrics #7495
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #7495 +/- ##
==========================================
+ Coverage 81.60% 81.68% +0.07%
==========================================
Files 357 359 +2
Lines 27294 27532 +238
==========================================
+ Hits 22274 22490 +216
- Misses 3816 3822 +6
- Partials 1204 1220 +16
|
4983e8b
to
2eb99db
Compare
2eb99db
to
4d008b3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, modulo minor comments.
cacheSizeMetric.Record(dc.metricsRecorder, 0, grpcTarget, "", dc.uuid) | ||
cacheEntriesMetric.Record(dc.metricsRecorder, 0, grpcTarget, "", dc.uuid) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this really required? The grpc.lb.rls.server_target
is empty here. So, will this measurement even be useful?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think so. This gives a measurement at the beginning that the cache is current of 0 (vs unset and not showing up). This is a gauge, so just the most recent state of the system, but I think it makes sense to at construction time give the state that it's empty).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed this offline with Doug and Yash, at first we thought leave it but then they mused about the same point you brought up, which is this gauge as written will live around the lifetime of the binary with an empty target. So this is actually a valid correctness issue. Thanks for catching this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I brought this up in the Observability thread alongside another one of my concerns, but this case is the main one that is seen as a correctness issue. If the rls server target changes over the lifetime of the balancer, it's up to exporter to have logic for the same uuid gauge with a different rls server target. Yash mentioned dashboards can group on uuid, and then see the rls server target change over time, so WAI.
|
||
func (r *NoopMetricsRecorder) RecordInt64Count(_ *estats.Int64CountHandle, _ int64, _ ...string) {} | ||
|
||
func (r *NoopMetricsRecorder) RecordFloat64Count(_ *estats.Float64CountHandle, _ float64, _ ...string) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gofmt
is crazy that it allows {
and }
to be on the same line for some of the methods and not for other methods. I've noticed this when making changes too.
You could get rid of the _
parameter names for all the methods though, since none of the parameters are used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I see. Will switch to one line, and remove _ parameter name.
After some discussion, decided to just emit gauge metrics inline using synchronous gauges with cache mutex held. OpenTelemetry just copies this data over to another store for exporters to use, so this shouldn't introduce a performance hit/lock contention, especially since the accesses cache creates lock contention on the operations even before metrics.
As suggested in #7484 (comment).
RELEASE NOTES: