Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add parallel exporter #2167

Merged
merged 5 commits into from May 9, 2024
Merged

Conversation

another-rex
Copy link
Contributor

Exporter currently take too long to complete (around 1 hour and 10 minutes), and will only get longer to run as the OSV database expands.

This change parallelizes the exporter along each ecosystem, separating the ecosystem export portion of the script from the selection of each ecosystem. This roughly reduced the time taken down to how long the longest ecosystem takes to export, 13 minutes or so.

Copy link
Contributor

@andrewpollock andrewpollock left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, a few minor things.
Curious as to why 7?

parser.add_argument(
'--processes',
help='Maximum number of parallel exports',
default=DEFAULT_EXPORT_PROCESSES)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be tidier to default to os.cpu_count() here, that way this just does the right/intended thing if the number of CPUs is increased in the Kubernetes job spec?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, updated to use os.cpu_count() by default, and falling back to DEFAULT_EXPORT_PROCESSES if failing to get CPU count.

@@ -139,19 +140,13 @@ def main():
'--bucket',
help='Bucket name to export to',
default=DEFAULT_EXPORT_BUCKET)
parser.add_argument('--ecosystem', required=True, help='Ecosystem to upload')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I'm understanding this correctly, the intended use of this flag is either an ecosystem name, or the special string "list", which retains the old behaviour? Please call out this "list" value in the flag help text.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

list value will upload the ecosystem.txt ecosystem list. Called this out in the help message now!

@another-rex
Copy link
Contributor Author

Ended up at 7 because that's the maximum number of cores for a pod a node with 8 cores can support (since some CPU is being used up by metrics and logging driver). It doesn't really matter too much now. I'll switch this to 6 to get a more round number.

Copy link
Contributor

@andrewpollock andrewpollock left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome performance improvement

Copy link
Collaborator

@oliverchang oliverchang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

awesome stuff!

docker/exporter/export_runner.py Outdated Show resolved Hide resolved
@another-rex another-rex merged commit 8f6e7f8 into google:master May 9, 2024
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants