Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Include OPA min compat version in telemetry report #6475

Merged

Conversation

ashutosh-narkar
Copy link
Member

@ashutosh-narkar ashutosh-narkar commented Dec 13, 2023

This commit extends the telemetry report to include the
minimum compatible version of policies loaded into OPA.
This information can be helpful to get visibility into
era of Rego being adopted in the wild.

Fixes: #6361

Co-authored-by: Stephan Renatus stephan@styra.com
Signed-off-by: Ashutosh Narkar anarkar4387@gmail.com

Copy link
Contributor

@srenatus srenatus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this! I've posted some comments inline 🙃 👇

internal/report/report.go Show resolved Hide resolved
internal/report/report.go Outdated Show resolved Hide resolved
@@ -148,6 +150,9 @@ const (
DefaultTriggerMode TriggerMode = "periodic"
)

// default interval between OPA report uploads
const defaultUploadIntervalSec = int64(3600)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be better to make this a time.Duration instead?

Edit: we just cast it to that later... My $0.02 is that it's best to introduce units, where applicable, as early on as possible.

@@ -497,6 +528,12 @@ func (m *Manager) Init(ctx context.Context) error {
})

if err != nil {
if m.stop != nil {
done := make(chan struct{})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am curious to know - why does m.stop have type chan chan struct{}, rather than just chan struct{}. I have never seen that pattern before. Isn't instantiating and passing a whole channel more expensive than an empty struct?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it's a trampoline type thing...

		case done := <-m.stop:
			cancel()
			ticker.Stop()
			done <- struct{}{}
			return
		}

I guess that even though writes to unbuffered channels are blocking, apparently we want to wait until ticker.Stop() has also completed before we proceed... which raises the question: why do we want erroring out of New() to block until ticker.Stop() has completed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right we do this to make sure we've cleaned up everything.

why do we want erroring out of New() to block until ticker.Stop() has completed?

Not sure where we are doing this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure where we are doing this?

This pattern is used:

	if m.stop != nil {
		done := make(chan struct{})
		m.stop <- done
		<-done
	}

It blocks until ticker.Stop() has completed unless I am mistaken.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do this when we initialize and stop the manger, not when we create a new manager instance.

@ashutosh-narkar ashutosh-narkar changed the title WIP: Include OPA min compat version in telemetry report Include OPA min compat version in telemetry report Dec 13, 2023
@ashutosh-narkar ashutosh-narkar marked this pull request as ready for review December 13, 2023 22:34
ticker.Stop()

if opaReportNotify {
m.reporter.RegisterGatherer("min_compatible_version", func(_ context.Context) (any, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The registration should happen outside if this loop. In that loop, we'll invoke all registered gatherer functions and aggregate their results.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've moved registration outside the loop and after we create the reporter instance.

In that loop, we'll invoke all registered gatherer functions and aggregate their results.

In this case we only have one gatherer function and we're not adding any public methods on the manager to add gatherers so not sure what you mean here. In the loop we call send on the reporter which will invoke the gatherer functions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case we only have one gatherer function and we're not adding any public methods on the manager to add gatherers so not sure what you mean here.

Yeah, that's correct right now. I think ideally, we'd open this up, so wrapping code (😉 EOPA) can register extra telemetry bits.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to include a method on the manager to allow callers to register extra telemetry gatherers.

Copy link
Contributor

@johanfylling johanfylling left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In OPA 1.0, I believe we're also gonna need to add a telemetry entry for highest compatible version, as we're gonna be removing features. This implementation looks to allow for this being added at a later time 👍. We should add a feature request for it, though, to not forget about it.

Just a few questions/comments.

internal/report/report.go Show resolved Hide resolved
return nil, fmt.Errorf("gather telemetry error for key %s: %w", key, err)
}
}
r.gatherersMtx.Unlock()

resp, err := r.client.WithJSON(r.body).Do(rCtx, "POST", "/v1/version")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, r.body isn't protected by a lock (unless one exists on the caller-side of this function), so we could have race conditions.

Would it perhaps be preferred to detach the body map from r and instead have a function-local body created fresh every time we send a report? We could register gatherers for id and version in New() just like we do for heap_usage_bytes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could have gatherers. But r.body being internal I don't see a race for it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It depends on under what conditions SendReport is called. If it's ever only called sequentially, we don't have a problem; if it can be called concurrently, then e.g. a slow (extremely) HTTP call, or if something locks up during gatherer registration, could have another go-rutine come in and modify r.body while in transit.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case only one routine calls SendReport.

@@ -148,6 +150,9 @@ const (
DefaultTriggerMode TriggerMode = "periodic"
)

// default interval between OPA report uploads
var defaultUploadIntervalSec = int64(3600)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this be a const?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or maybe just make it a time.Duration, to not require the site of usage to remember to do the seconds multiplication.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I originally had it as a const but it was easier to use it as a var to help with unit testing.

opaReportNotify = false
}

newInterval := mr.Int63n(defaultUploadIntervalSec) + defaultUploadIntervalSec
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't reuse uploadDuration here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If not, and uploadDuration is just intended as the initial delay, consider renaming it to reflect that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah we want some jitter. Renamed to defaultBaseUploadIntervalSec

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the purpose of the jitter? Is adding defaultBaseUploadIntervalSec to itself to create newInterval part of creating the jitter? Would it make sense to have two separate default values that can be set independently instead?

It looks like we're just sending in defaultBaseUploadIntervalSec as argument in the call to here, so wouldn't it make sense to either drop the argument, or only use the argument?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the purpose of the jitter?

The purpose of the delay is to randomize the uploads a bit. But in this case since the uploads happen only if there is a policy change we can probably skip that and reuse uploadDuration as you suggested.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But worst case is all deployed OPAs upload at the same time loading the telemetry server. So I'll keep some jitter and reuse uploadDuration.

plugins/plugins.go Outdated Show resolved Hide resolved
if m.compiler.Required != nil {
minimumCompatibleVersion, _ = m.compiler.Required.MinimumCompatibleVersion()
}
return minimumCompatibleVersion, nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minimumCompatibleVersion might be the empty string here. Will we then send an empty min_compatible_version entry; and if so, is that the desired behavior?

If not, consider returning nil if we don't have a version to report, and update RegisterGatherer() to not add (and scrub existing entries for the key) if it gets nil.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minimumCompatibleVersion might be the empty string here. Will we then send an empty min_compatible_version entry

It should be fine. We send these reports once an hour so if it becomes an issue we can add it in. Also the consumer of the report can filter this out.


_, err := m.reporter.SendReport(ctx)
if err != nil {
m.logger.WithFields(map[string]interface{}{"err": err}).Debug("Unable to send OPA report.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't OPA kind of assumed? Should we just say telemetry instead?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated message to "Unable to send OPA telemetry report

@@ -184,6 +190,102 @@ func TestPluginStatusUpdateOnStartAndStop(t *testing.T) {
m.Stop(context.Background())
}

func TestManagerWithOPATelemetryUpdateLoop(t *testing.T) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might have missed it, but should we not also have a test to assert that the version actually is reported?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should. Updated the test.

@ashutosh-narkar ashutosh-narkar force-pushed the min-ver-telemetry branch 2 times, most recently from cd83bb1 to f4a3f82 Compare December 15, 2023 17:13
opaReportNotify = false
_, err := m.reporter.SendReport(ctx)
if err != nil {
m.logger.WithFields(map[string]interface{}{"err": err}).Debug("Unable to send OPA telemetry report.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nit]

Suggested change
m.logger.WithFields(map[string]interface{}{"err": err}).Debug("Unable to send OPA telemetry report.")
m.logger.WithError(err).Debug("Unable to send OPA telemetry report.")

This yields the same log line if we also set the logrus global variable logrus.ErrorKey = "err" somewhere in the code.

srenatus
srenatus previously approved these changes Dec 18, 2023
minimumCompatibleVersion, _ = m.compiler.Required.MinimumCompatibleVersion()
}
return minimumCompatibleVersion, nil
})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍


// WithTelemetryGatherers allows registration of telemetry gatherers which enable injection of additional data in the
// telemetry report
func WithTelemetryGatherers(gs map[string]report.Gatherer) func(*Manager) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yay.

This commit extends the telemetry report to include the
minimum compatible version of policies loaded into OPA.
This information can be helpful to get visibility into
era of Rego being adopted in the wild.

Fixes: open-policy-agent#6361

Co-authored-by: Stephan Renatus <stephan@styra.com>
Signed-off-by: Ashutosh Narkar <anarkar4387@gmail.com>
@ashutosh-narkar ashutosh-narkar merged commit 414df4e into open-policy-agent:main Dec 18, 2023
24 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Include minimum compatible version check to telemetry reports
4 participants