Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

balancer/rls: Add cache metrics #7495

Merged
merged 3 commits into from
Aug 14, 2024
Merged

Conversation

zasweq
Copy link
Contributor

@zasweq zasweq commented Aug 9, 2024

After some discussion, decided to just emit gauge metrics inline using synchronous gauges with cache mutex held. OpenTelemetry just copies this data over to another store for exporters to use, so this shouldn't introduce a performance hit/lock contention, especially since the accesses cache creates lock contention on the operations even before metrics.

As suggested in #7484 (comment).

RELEASE NOTES:

  • balancer/rls: Add cache metrics

@zasweq zasweq requested review from easwars and dfawley August 9, 2024 00:28
@zasweq zasweq added the Type: Feature New features or improvements in behavior label Aug 9, 2024
@zasweq zasweq added this to the 1.66 Release milestone Aug 9, 2024
Copy link

codecov bot commented Aug 9, 2024

Codecov Report

Attention: Patch coverage is 87.09677% with 4 lines in your changes missing coverage. Please review.

Project coverage is 81.68%. Comparing base (7b9e012) to head (7b17b83).
Report is 9 commits behind head on master.

Files Patch % Lines
internal/testutils/stats/test_metrics_recorder.go 55.55% 2 Missing and 2 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #7495      +/-   ##
==========================================
+ Coverage   81.60%   81.68%   +0.07%     
==========================================
  Files         357      359       +2     
  Lines       27294    27532     +238     
==========================================
+ Hits        22274    22490     +216     
- Misses       3816     3822       +6     
- Partials     1204     1220      +16     
Files Coverage Δ
balancer/rls/balancer.go 85.81% <100.00%> (+0.04%) ⬆️
balancer/rls/cache.go 89.07% <100.00%> (+1.11%) ⬆️
internal/testutils/stats/test_metrics_recorder.go 72.35% <55.55%> (-0.27%) ⬇️

... and 43 files with indirect coverage changes

@zasweq zasweq force-pushed the rls-cache-metrics branch from 4983e8b to 2eb99db Compare August 9, 2024 00:39
@zasweq zasweq force-pushed the rls-cache-metrics branch from 2eb99db to 4d008b3 Compare August 9, 2024 01:13
@easwars easwars assigned zasweq and unassigned easwars and dfawley Aug 12, 2024
@zasweq zasweq assigned easwars and dfawley and unassigned zasweq Aug 12, 2024
Copy link
Contributor

@easwars easwars left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, modulo minor comments.

Comment on lines +193 to +194
cacheSizeMetric.Record(dc.metricsRecorder, 0, grpcTarget, "", dc.uuid)
cacheEntriesMetric.Record(dc.metricsRecorder, 0, grpcTarget, "", dc.uuid)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this really required? The grpc.lb.rls.server_target is empty here. So, will this measurement even be useful?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so. This gives a measurement at the beginning that the cache is current of 0 (vs unset and not showing up). This is a gauge, so just the most recent state of the system, but I think it makes sense to at construction time give the state that it's empty).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed this offline with Doug and Yash, at first we thought leave it but then they mused about the same point you brought up, which is this gauge as written will live around the lifetime of the binary with an empty target. So this is actually a valid correctness issue. Thanks for catching this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I brought this up in the Observability thread alongside another one of my concerns, but this case is the main one that is seen as a correctness issue. If the rls server target changes over the lifetime of the balancer, it's up to exporter to have logic for the same uuid gauge with a different rls server target. Yash mentioned dashboards can group on uuid, and then see the rls server target change over time, so WAI.


func (r *NoopMetricsRecorder) RecordInt64Count(_ *estats.Int64CountHandle, _ int64, _ ...string) {}

func (r *NoopMetricsRecorder) RecordFloat64Count(_ *estats.Float64CountHandle, _ float64, _ ...string) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gofmt is crazy that it allows { and } to be on the same line for some of the methods and not for other methods. I've noticed this when making changes too.

You could get rid of the _ parameter names for all the methods though, since none of the parameters are used.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see. Will switch to one line, and remove _ parameter name.

@easwars easwars assigned zasweq and unassigned easwars Aug 14, 2024
@zasweq zasweq merged commit 9706bf8 into grpc:master Aug 14, 2024
13 checks passed
infovivek2020 pushed a commit to infovivek2020/grpc-go that referenced this pull request Aug 18, 2024
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Feb 11, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Type: Feature New features or improvements in behavior
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants