-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Include OPA min compat version in telemetry report #6475
Include OPA min compat version in telemetry report #6475
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this! I've posted some comments inline 🙃 👇
plugins/plugins.go
Outdated
@@ -148,6 +150,9 @@ const ( | |||
DefaultTriggerMode TriggerMode = "periodic" | |||
) | |||
|
|||
// default interval between OPA report uploads | |||
const defaultUploadIntervalSec = int64(3600) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be better to make this a time.Duration
instead?
Edit: we just cast it to that later... My $0.02 is that it's best to introduce units, where applicable, as early on as possible.
@@ -497,6 +528,12 @@ func (m *Manager) Init(ctx context.Context) error { | |||
}) | |||
|
|||
if err != nil { | |||
if m.stop != nil { | |||
done := make(chan struct{}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am curious to know - why does m.stop
have type chan chan struct{}
, rather than just chan struct{}
. I have never seen that pattern before. Isn't instantiating and passing a whole channel more expensive than an empty struct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess it's a trampoline type thing...
case done := <-m.stop:
cancel()
ticker.Stop()
done <- struct{}{}
return
}
I guess that even though writes to unbuffered channels are blocking, apparently we want to wait until ticker.Stop()
has also completed before we proceed... which raises the question: why do we want erroring out of New()
to block until ticker.Stop()
has completed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right we do this to make sure we've cleaned up everything.
why do we want erroring out of New() to block until ticker.Stop() has completed?
Not sure where we are doing this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure where we are doing this?
This pattern is used:
if m.stop != nil {
done := make(chan struct{})
m.stop <- done
<-done
}
It blocks until ticker.Stop()
has completed unless I am mistaken.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do this when we initialize and stop the manger, not when we create a new manager instance.
ef6895c
to
d5f964c
Compare
d5f964c
to
38deb3f
Compare
plugins/plugins.go
Outdated
ticker.Stop() | ||
|
||
if opaReportNotify { | ||
m.reporter.RegisterGatherer("min_compatible_version", func(_ context.Context) (any, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The registration should happen outside if this loop. In that loop, we'll invoke all registered gatherer functions and aggregate their results.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've moved registration outside the loop and after we create the reporter instance.
In that loop, we'll invoke all registered gatherer functions and aggregate their results.
In this case we only have one gatherer function and we're not adding any public methods on the manager to add gatherers so not sure what you mean here. In the loop we call send on the reporter which will invoke the gatherer functions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case we only have one gatherer function and we're not adding any public methods on the manager to add gatherers so not sure what you mean here.
Yeah, that's correct right now. I think ideally, we'd open this up, so wrapping code (😉 EOPA) can register extra telemetry bits.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated to include a method on the manager to allow callers to register extra telemetry gatherers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In OPA 1.0, I believe we're also gonna need to add a telemetry entry for highest compatible version, as we're gonna be removing features. This implementation looks to allow for this being added at a later time 👍. We should add a feature request for it, though, to not forget about it.
Just a few questions/comments.
return nil, fmt.Errorf("gather telemetry error for key %s: %w", key, err) | ||
} | ||
} | ||
r.gatherersMtx.Unlock() | ||
|
||
resp, err := r.client.WithJSON(r.body).Do(rCtx, "POST", "/v1/version") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here, r.body
isn't protected by a lock (unless one exists on the caller-side of this function), so we could have race conditions.
Would it perhaps be preferred to detach the body map from r
and instead have a function-local body created fresh every time we send a report? We could register gatherers for id
and version
in New()
just like we do for heap_usage_bytes
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could have gatherers. But r.body
being internal I don't see a race for it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It depends on under what conditions SendReport
is called. If it's ever only called sequentially, we don't have a problem; if it can be called concurrently, then e.g. a slow (extremely) HTTP call, or if something locks up during gatherer registration, could have another go-rutine come in and modify r.body
while in transit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case only one routine calls SendReport
.
@@ -148,6 +150,9 @@ const ( | |||
DefaultTriggerMode TriggerMode = "periodic" | |||
) | |||
|
|||
// default interval between OPA report uploads | |||
var defaultUploadIntervalSec = int64(3600) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this be a const
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or maybe just make it a time.Duration
, to not require the site of usage to remember to do the seconds multiplication.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I originally had it as a const
but it was easier to use it as a var
to help with unit testing.
opaReportNotify = false | ||
} | ||
|
||
newInterval := mr.Int63n(defaultUploadIntervalSec) + defaultUploadIntervalSec |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We shouldn't reuse uploadDuration
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If not, and uploadDuration
is just intended as the initial delay, consider renaming it to reflect that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah we want some jitter. Renamed to defaultBaseUploadIntervalSec
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the purpose of the jitter? Is adding defaultBaseUploadIntervalSec
to itself to create newInterval
part of creating the jitter? Would it make sense to have two separate default values that can be set independently instead?
It looks like we're just sending in defaultBaseUploadIntervalSec
as argument in the call to here, so wouldn't it make sense to either drop the argument, or only use the argument?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the purpose of the jitter?
The purpose of the delay is to randomize the uploads a bit. But in this case since the uploads happen only if there is a policy change we can probably skip that and reuse uploadDuration
as you suggested.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But worst case is all deployed OPAs upload at the same time loading the telemetry server. So I'll keep some jitter and reuse uploadDuration
.
plugins/plugins.go
Outdated
if m.compiler.Required != nil { | ||
minimumCompatibleVersion, _ = m.compiler.Required.MinimumCompatibleVersion() | ||
} | ||
return minimumCompatibleVersion, nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minimumCompatibleVersion
might be the empty string here. Will we then send an empty min_compatible_version
entry; and if so, is that the desired behavior?
If not, consider returning nil
if we don't have a version to report, and update RegisterGatherer()
to not add (and scrub existing entries for the key) if it gets nil
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minimumCompatibleVersion might be the empty string here. Will we then send an empty min_compatible_version entry
It should be fine. We send these reports once an hour so if it becomes an issue we can add it in. Also the consumer of the report can filter this out.
plugins/plugins.go
Outdated
|
||
_, err := m.reporter.SendReport(ctx) | ||
if err != nil { | ||
m.logger.WithFields(map[string]interface{}{"err": err}).Debug("Unable to send OPA report.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't OPA
kind of assumed? Should we just say telemetry
instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated message to "Unable to send OPA telemetry report
@@ -184,6 +190,102 @@ func TestPluginStatusUpdateOnStartAndStop(t *testing.T) { | |||
m.Stop(context.Background()) | |||
} | |||
|
|||
func TestManagerWithOPATelemetryUpdateLoop(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might have missed it, but should we not also have a test to assert that the version actually is reported?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should. Updated the test.
cd83bb1
to
f4a3f82
Compare
opaReportNotify = false | ||
_, err := m.reporter.SendReport(ctx) | ||
if err != nil { | ||
m.logger.WithFields(map[string]interface{}{"err": err}).Debug("Unable to send OPA telemetry report.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nit]
m.logger.WithFields(map[string]interface{}{"err": err}).Debug("Unable to send OPA telemetry report.") | |
m.logger.WithError(err).Debug("Unable to send OPA telemetry report.") |
This yields the same log line if we also set the logrus global variable logrus.ErrorKey = "err"
somewhere in the code.
minimumCompatibleVersion, _ = m.compiler.Required.MinimumCompatibleVersion() | ||
} | ||
return minimumCompatibleVersion, nil | ||
}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
|
||
// WithTelemetryGatherers allows registration of telemetry gatherers which enable injection of additional data in the | ||
// telemetry report | ||
func WithTelemetryGatherers(gs map[string]report.Gatherer) func(*Manager) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yay.
This commit extends the telemetry report to include the minimum compatible version of policies loaded into OPA. This information can be helpful to get visibility into era of Rego being adopted in the wild. Fixes: open-policy-agent#6361 Co-authored-by: Stephan Renatus <stephan@styra.com> Signed-off-by: Ashutosh Narkar <anarkar4387@gmail.com>
f4a3f82
to
0d804b8
Compare
This commit extends the telemetry report to include the
minimum compatible version of policies loaded into OPA.
This information can be helpful to get visibility into
era of Rego being adopted in the wild.
Fixes: #6361
Co-authored-by: Stephan Renatus stephan@styra.com
Signed-off-by: Ashutosh Narkar anarkar4387@gmail.com