-
Notifications
You must be signed in to change notification settings - Fork 173
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handle GCS bucket record deletion #2030
Conversation
On advice from @michaelkedar on google#2030
Detect and plumb through to worker for deletion from Datastore Bugs that are no longer present in GCS. Related changes (google#2023, google#2029) to the NVD CVE generation cause CVEs no longer being generated (due to changes in the heuristics like that made in google#1707) to remain existing in GCS. This PR addresses cases like this and the need identified in google#1467 by adding a deletion phase to the importing of new/updated records. The functionality is flag-protected, it won't go live in Production until a new `--delete` flag is included so in the execution of the importer. Incidental changes: - use the GCS bucket `directory_path` to efficiently filter the blobs returned when listing bucket contents - make blob retrieval resilient to blob generation change between blob listing and blob retrieval (this can happen if `combine-to-osv` happens to have run in between these two points in time) - fix a behavior inconsistency with schema validation not being performed when `ignore_last_import_time` is in effect (addressing head scratching TODO from @michaelkeder) - tidy up the existing tests, making them more readable and debuggable - add a slow manual test against live data in staging to validate real-world behavior and run time (this adds ~13 minutes to an import run on just the CVE GCS bucket)
On advice from @michaelkedar on google#2030
Use a list comprehension instead of a filter + lambda For style guide compliance.
4588c0f
to
e79c4b9
Compare
I was mistaken about scoping, based on https://stackoverflow.com/questions/45100271/scope-of-variable-within-with-statement
GCS blobs *may* contain more than one OSV record, so don't blindly return the ID of the first one only.
The exception is only potentially raised once, not on each iteration.
Fix math error :-(
- refactor out inline function from _process_deletions_bucket - switch to using executor.submit() rather than map() to follow documented examples and reduce reader cognitive burden - fix bug in safety check and add explicit test for it
Also add some additional instrumentation
So the human running them has a better idea of what the heck is going on.
Because this takes a long time to run, the human running it has little insight into what is going on.
Some of the logging shenanigans in the test were superfluous
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this generally looks good, thanks!
how's the performance of this running on the staging CVE bucket? if this ends up being slow and impacts our freshness SLO, we may need to consider moving this to e.g. a separate "cleanup" cron job instead of sticking this in the importer.
As expected, this crashes into the safety guardrail, so override it.
(Why did the linter not pick up on that?)
Yeah I think based on my testing so far, while not running on the same sized GKE node as it would in production, running it as part of the regular importing run isn't looking tenable. I think we'll need to have a separate say daily cronjob that runs it in deletion mode. Good thing I've got all of that behind a flag :-) |
Because deletion is going to be slow to run, it will need to be run separately from regular importing, so as not to cause the overall runtime to violate our freshness SLO.
This runs the code introduced in google#2030 that takes many hours to complete and isn't practical to run inline with regular imports. Addresses google#1467
46f1c03 decoupled deletion from updates, this introduces a new test for testing in deletion mode.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM with some minor nits.
46f1c03 decoupled deletion from updates, this introduces a new test for testing in deletion mode. Signed-off-by: Andrew Pollock <apollock@google.com>
…into delete_stale_bugs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Detect and plumb through to worker for deletion from Datastore Bugs that are no longer present in GCS.
Related changes (#2023, #2029) to the NVD CVE generation cause CVEs no longer being generated (due to changes in the heuristics like that made in #1707) to remain existing in GCS.
This PR addresses cases like this and the need identified in #1467 by adding a deletion phase to the importing of new/updated records. The functionality is flag-protected, it won't go live in Production until a new
--delete
flag is included so in the execution of the importer.Incidental changes:
directory_path
to efficiently filter the blobs returned when listing bucket contentscombine-to-osv
happens to have run in between these two points in time)ignore_last_import_time
is in effect (addressing head scratching TODO from @michaelkedar)