From 1ef0c2252ff423cc1660688ed51c473daec82ae9 Mon Sep 17 00:00:00 2001 From: Albin Severinson Date: Fri, 27 Oct 2023 11:33:46 +0100 Subject: [PATCH] Testsuite improvements (#3065) * Sync out testsuite changes (#19) * Update simulator * Replace Output with C * Typo * Restore pkg proto * Restore files * Fixing simulator changes (#6) * Fixing simulator changes * Changed to less than or equal Co-authored-by: Mustafa Ilyas * Simulator Changes (#9) * Add config and dependency injection to scheduler metrics (#2892) * Replace metrics singleton with an injection pattern. * fix * add configuration structures to metrics * add configuration * rename elements * Maker Pulsar ReceiverQueueSize Configurable (#2895) * wip * wip * set receiverQueueSize to 100 * remove old PulsarReceiverQueueSize * revert * subscriptionin api --------- Co-authored-by: Chris Martin * Add poll_interval (#2805) * Add poll_interval * Add poll_interval * Added poll_interval * update by running tox-e docs --------- Co-authored-by: Kevin Hannon Co-authored-by: Adam McArthur <46480158+Sharpz7@users.noreply.github.com> * Seperate python script for armada v1 and v2 system diagrams (#2758) * Seperate python script for armada v1 system diagram * removed generate.py so it can be replaced with two seperate files for Armada V1 and Armada V2 * Python script to generate Armada V2 system diagram * generate_v1.py Update #1 * generate_v1.py Update Number:2 * generate.py runs generate_v1.py as well as generate_v2.py and it is consistent with our instructions as 'docs/design/diagrams/relationships' * generate_v1.py Update No:3 * Armada V1 and Armada V2 diagrams * updated relationships_diagram.md to include armada v1 and v2 diagrams --------- Co-authored-by: Adam McArthur <46480158+Sharpz7@users.noreply.github.com> * Add config to use autoupdater on tagged branches (#2905) * #2904 add autoupdate config * #2904 add label config and other options * docs: create README.md for plugins directory (#2897) * Create README.md for plugins directory * Update README.md * Update plugins/README.md Co-authored-by: Kevin Hannon * Update README.md --------- Co-authored-by: Kevin Hannon Co-authored-by: Adam McArthur <46480158+Sharpz7@users.noreply.github.com> * Enables airflow operator level retry. (#2894) * Update docker stuff for latest airflow 2.7.0 * Use AirflowException instead of AirflowFailException to allow for retries * Remove codecov workflows (#2902) * Upgrade Pulsar Client to v0.11 (#2896) * update * update pulsar client * Fix bug causing server spinning * Abstract out the retry until success logic for testing (#2901) * Respond to review --------- Co-authored-by: Chris Martin Co-authored-by: Daniel Rastelli * Sync quickstart/index.md with gh-pages/quickstart.md (#2891) * Log Call Site (#2909) * allow logger to report caller * allow logger to report caller * lint --------- Co-authored-by: Chris Martin * Add cleaner test output for mage with os/exec.Command (#2907) * feat: Update Semver from version 6.3.0 to 6.3.1 (#2686) Co-authored-by: Adam McArthur <46480158+Sharpz7@users.noreply.github.com> * fix: upgrade @typescript-eslint/parser from 5.52.0 to 5.61.0 (#2743) Snyk has created this PR to upgrade @typescript-eslint/parser from 5.52.0 to 5.61.0. See this package in npm: See this project in Snyk: https://app.snyk.io/org/dave-gantenbein/project/5064983e-fa14-4803-8fc2-cfd6f1fa81b6?utm_source=github&utm_medium=referral&page=upgrade-pr Co-authored-by: snyk-bot Co-authored-by: Adam McArthur <46480158+Sharpz7@users.noreply.github.com> Co-authored-by: Mohamed Abdelfatah <39927413+Mo-Fatah@users.noreply.github.com> * fix: upgrade @types/react from 16.14.32 to 16.14.43 (#2747) Snyk has created this PR to upgrade @types/react from 16.14.32 to 16.14.43. See this package in npm: See this project in Snyk: https://app.snyk.io/org/dave-gantenbein/project/5064983e-fa14-4803-8fc2-cfd6f1fa81b6?utm_source=github&utm_medium=referral&page=upgrade-pr Co-authored-by: snyk-bot Co-authored-by: Adam McArthur <46480158+Sharpz7@users.noreply.github.com> Co-authored-by: Mohamed Abdelfatah <39927413+Mo-Fatah@users.noreply.github.com> * Bump github.com/go-openapi/jsonreference from 0.20.0 to 0.20.2 (#2316) Bumps [github.com/go-openapi/jsonreference](https://github.com/go-openapi/jsonreference) from 0.20.0 to 0.20.2. - [Release notes](https://github.com/go-openapi/jsonreference/releases) - [Commits](https://github.com/go-openapi/jsonreference/compare/v0.20.0...v0.20.2) --- updated-dependencies: - dependency-name: github.com/go-openapi/jsonreference dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Adam McArthur <46480158+Sharpz7@users.noreply.github.com> Co-authored-by: Mohamed Abdelfatah <39927413+Mo-Fatah@users.noreply.github.com> * Order leased jobs by serial (#2912) This will ensure the job leased first, gets send to the cluster first Currently we just order by postgres default sorting - which often picks the most recently leased - causing the first lease jobs to get stuck - This only occurs when scheduling is faster than leasing * Bump webpack from 5.75.0 to 5.77.0 in /internal/lookout/ui (#2302) Bumps [webpack](https://github.com/webpack/webpack) from 5.75.0 to 5.77.0. - [Release notes](https://github.com/webpack/webpack/releases) - [Commits](https://github.com/webpack/webpack/compare/v5.75.0...v5.77.0) --- updated-dependencies: - dependency-name: webpack dependency-type: indirect ... Signed-off-by: dependabot[bot] Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Adam McArthur <46480158+Sharpz7@users.noreply.github.com> Co-authored-by: Mohamed Abdelfatah <39927413+Mo-Fatah@users.noreply.github.com> * Bump word-wrap from 1.2.3 to 1.2.5 in /internal/lookout/ui (#2806) Bumps [word-wrap](https://github.com/jonschlinkert/word-wrap) from 1.2.3 to 1.2.5. - [Release notes](https://github.com/jonschlinkert/word-wrap/releases) - [Commits](https://github.com/jonschlinkert/word-wrap/compare/1.2.3...1.2.5) --- updated-dependencies: - dependency-name: word-wrap dependency-type: indirect ... Signed-off-by: dependabot[bot] Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Adam McArthur <46480158+Sharpz7@users.noreply.github.com> Co-authored-by: Mohamed Abdelfatah <39927413+Mo-Fatah@users.noreply.github.com> * resolve flaky (#2914) Co-authored-by: Adam McArthur <46480158+Sharpz7@users.noreply.github.com> * fix: upgrade @typescript-eslint/eslint-plugin from 5.52.0 to 5.61.0 (#2744) Snyk has created this PR to upgrade @typescript-eslint/eslint-plugin from 5.52.0 to 5.61.0. See this package in npm: See this project in Snyk: https://app.snyk.io/org/dave-gantenbein/project/5064983e-fa14-4803-8fc2-cfd6f1fa81b6?utm_source=github&utm_medium=referral&page=upgrade-pr Co-authored-by: snyk-bot Co-authored-by: Adam McArthur <46480158+Sharpz7@users.noreply.github.com> Co-authored-by: Mohamed Abdelfatah <39927413+Mo-Fatah@users.noreply.github.com> * fix: upgrade react-router-dom from 6.9.0 to 6.14.1 (#2746) Snyk has created this PR to upgrade react-router-dom from 6.9.0 to 6.14.1. See this package in npm: See this project in Snyk: https://app.snyk.io/org/dave-gantenbein/project/5064983e-fa14-4803-8fc2-cfd6f1fa81b6?utm_source=github&utm_medium=referral&page=upgrade-pr Co-authored-by: snyk-bot Co-authored-by: Adam McArthur <46480158+Sharpz7@users.noreply.github.com> Co-authored-by: Mohamed Abdelfatah <39927413+Mo-Fatah@users.noreply.github.com> * Bump semver from 6.3.0 to 6.3.1 in /internal/lookout/ui (#2661) Bumps [semver](https://github.com/npm/node-semver) from 6.3.0 to 6.3.1. - [Release notes](https://github.com/npm/node-semver/releases) - [Changelog](https://github.com/npm/node-semver/blob/v6.3.1/CHANGELOG.md) - [Commits](https://github.com/npm/node-semver/compare/v6.3.0...v6.3.1) --- updated-dependencies: - dependency-name: semver dependency-type: indirect ... Signed-off-by: dependabot[bot] Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Adam McArthur <46480158+Sharpz7@users.noreply.github.com> Co-authored-by: Mohamed Abdelfatah <39927413+Mo-Fatah@users.noreply.github.com> * Run CodeQL once daily on a schedule (#2918) * Helm chart update: executor (#2917) * Helm chart update: executor At the moment the helm chart for the executor doesn't include priorityClass even though one is created in the chart. This means that the executor deployment is unable to set the priorityClass. * Patch/dependencies (#2923) * Bump github.com/go-openapi/strfmt from 0.21.3 to 0.21.7 Bumps [github.com/go-openapi/strfmt](https://github.com/go-openapi/strfmt) from 0.21.3 to 0.21.7. - [Release notes](https://github.com/go-openapi/strfmt/releases) - [Commits](https://github.com/go-openapi/strfmt/compare/v0.21.3...v0.21.7) --- updated-dependencies: - dependency-name: github.com/go-openapi/strfmt dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] * Bump github.com/go-openapi/runtime from 0.24.2 to 0.26.0 Bumps [github.com/go-openapi/runtime](https://github.com/go-openapi/runtime) from 0.24.2 to 0.26.0. - [Release notes](https://github.com/go-openapi/runtime/releases) - [Commits](https://github.com/go-openapi/runtime/compare/v0.24.2...v0.26.0) --- updated-dependencies: - dependency-name: github.com/go-openapi/runtime dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] * Bump github.com/goreleaser/nfpm/v2 from 2.25.1 to 2.29.0 Bumps [github.com/goreleaser/nfpm/v2](https://github.com/goreleaser/nfpm) from 2.25.1 to 2.29.0. - [Release notes](https://github.com/goreleaser/nfpm/releases) - [Changelog](https://github.com/goreleaser/nfpm/blob/main/.goreleaser.yml) - [Commits](https://github.com/goreleaser/nfpm/compare/v2.25.1...v2.29.0) --- updated-dependencies: - dependency-name: github.com/goreleaser/nfpm/v2 dependency-type: indirect ... Signed-off-by: dependabot[bot] * Bump github.com/go-playground/validator/v10 from 10.11.1 to 10.14.1 Bumps [github.com/go-playground/validator/v10](https://github.com/go-playground/validator) from 10.11.1 to 10.14.1. - [Release notes](https://github.com/go-playground/validator/releases) - [Commits](https://github.com/go-playground/validator/compare/v10.11.1...v10.14.1) --- updated-dependencies: - dependency-name: github.com/go-playground/validator/v10 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] * Bump Grpc.Net.Client in /client/DotNet/ArmadaProject.Io.Client Bumps [Grpc.Net.Client](https://github.com/grpc/grpc-dotnet) from 2.47.0 to 2.52.0. - [Release notes](https://github.com/grpc/grpc-dotnet/releases) - [Changelog](https://github.com/grpc/grpc-dotnet/blob/master/doc/release_process.md) - [Commits](https://github.com/grpc/grpc-dotnet/compare/v2.47.0...v2.52.0) --- updated-dependencies: - dependency-name: Grpc.Net.Client dependency-type: direct:production ... Signed-off-by: dependabot[bot] * fix: upgrade @mui/material from 5.10.17 to 5.13.6 Snyk has created this PR to upgrade @mui/material from 5.10.17 to 5.13.6. See this package in npm: See this project in Snyk: https://app.snyk.io/org/dave-gantenbein/project/5064983e-fa14-4803-8fc2-cfd6f1fa81b6?utm_source=github&utm_medium=referral&page=upgrade-pr * fix: upgrade prettier from 2.7.1 to 2.8.8 Snyk has created this PR to upgrade prettier from 2.7.1 to 2.8.8. See this package in npm: See this project in Snyk: https://app.snyk.io/org/dave-gantenbein/project/5064983e-fa14-4803-8fc2-cfd6f1fa81b6?utm_source=github&utm_medium=referral&page=upgrade-pr * fix: upgrade @mui/icons-material from 5.10.16 to 5.14.3 Snyk has created this PR to upgrade @mui/icons-material from 5.10.16 to 5.14.3. See this package in npm: See this project in Snyk: https://app.snyk.io/org/dave-gantenbein/project/5064983e-fa14-4803-8fc2-cfd6f1fa81b6?utm_source=github&utm_medium=referral&page=upgrade-pr * fix: upgrade eslint-plugin-import from 2.26.0 to 2.28.0 Snyk has created this PR to upgrade eslint-plugin-import from 2.26.0 to 2.28.0. See this package in npm: See this project in Snyk: https://app.snyk.io/org/dave-gantenbein/project/5064983e-fa14-4803-8fc2-cfd6f1fa81b6?utm_source=github&utm_medium=referral&page=upgrade-pr * fix: upgrade eslint-config-prettier from 8.5.0 to 8.10.0 Snyk has created this PR to upgrade eslint-config-prettier from 8.5.0 to 8.10.0. See this package in npm: See this project in Snyk: https://app.snyk.io/org/dave-gantenbein/project/5064983e-fa14-4803-8fc2-cfd6f1fa81b6?utm_source=github&utm_medium=referral&page=upgrade-pr * Trying to update klog * go mod fix --------- Signed-off-by: dependabot[bot] Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: snyk-bot Co-authored-by: Mohamed Abdelfatah <39927413+Mo-Fatah@users.noreply.github.com> * Fix bug causing GetJobSetEvents to get stuck (#2903) * Add error message of final job run to JobFailedMessage When we hit the maximum retry limit, the JobFailedMessage just says something along the lines of "Job has been retried too many times, giving up" Now we include the final run error in that message - to make it easier to work out the cause of retries * Fix bug causing GetJobSetEvents to get stuck GetJobSetEvents only increments its fromId variable on sending new messages However now all redis events produce api events that will be sent downstream The issue here is if we get 500 redis events in a row that don't produce api events, then the fromId never gets updated - Meaning the watching gets stuck here To fix this, ReadEvents now returns a lastMessageId. So if there are no messages to process, the fromId should be updated using the lastMessageId * Formatting * Bump @adobe/css-tools from 4.0.1 to 4.3.1 in /internal/lookout/ui (#2931) Bumps [@adobe/css-tools](https://github.com/adobe/css-tools) from 4.0.1 to 4.3.1. - [Changelog](https://github.com/adobe/css-tools/blob/main/History.md) - [Commits](https://github.com/adobe/css-tools/commits) --- updated-dependencies: - dependency-name: "@adobe/css-tools" dependency-type: indirect ... Signed-off-by: dependabot[bot] Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Improved etcd protection (#2925) * Initial commit * Delete unused code * Export metrics collection delay metrics * Add mutex to InMemoryJobRepository * Add tests * Lint * Update internal/executor/configuration/types.go * Lint --------- Co-authored-by: JamesMurkin * Stop executor requesting more jobs when it still has leased jobs (#2932) * Stop executor requesting more jobs when it still has leased jobs Currently we "queue" jobs to be submitted on the executor - which sit the leased state until they are submitted to kubernetes However this causes 2 issues with our current setup: - It prevents back-pressure from working well on the scheduler side. As it sees all these "Leased" jobs as active, so just keep scheduling more - In the case we are slowing submission due to etcd going over its limit. We "queue" lots of jobs, and as soon as etcd goes under its limit we hit it with potentially thousands of jobs This flow needs further work and thought - however for now this is the minimal fix to prevent bad behaviour Signed-off-by: JamesMurkin * WIP Signed-off-by: JamesMurkin * Fix scheduler side tests Signed-off-by: JamesMurkin * Implement number of requested jobs on executor side Signed-off-by: JamesMurkin * Remove unused config Signed-off-by: JamesMurkin * Fixing panic on startup when etcd health monitor not registered Signed-off-by: JamesMurkin * Enhance logging Signed-off-by: JamesMurkin * Set more sensible default for maxLeasedJobs Signed-off-by: JamesMurkin --------- Signed-off-by: JamesMurkin * Fix race in etcd protections (#2937) * Initial commit * Fix MultiHealthMonitor race * Fix etcd health metric naming conflict (#2939) * Fix metric naming conflict * Fix metric names * Fix metrix prefix * Fix label * Bump golang.org/x/sync from 0.1.0 to 0.3.0 (#2946) Bumps [golang.org/x/sync](https://github.com/golang/sync) from 0.1.0 to 0.3.0. - [Commits](https://github.com/golang/sync/compare/v0.1.0...v0.3.0) --- updated-dependencies: - dependency-name: golang.org/x/sync dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Add more scheduler metrics (#2906) * Add jobs considered and refactor to counters * Add fair share metrics * Add reset for gauge metrics * format * cycle imports * modify cycle return struct * verbose logging --------- Co-authored-by: Albin Severinson * Update config.yaml (#2953) * Remove gang job cardinality submit check. Add placeholder for min gang size * Add msumner91 and mustafai to magic list of trusted people (#2956) * Add msumner91 to magic list of trusted people * Update .mergify.yml * Airflow: always set credentials from args in channel ctor (#2952) In the GrpcChannelArguments constructor, always set the credentials_callback_args member from what is given. Add a test to verify serialization round-tripping is complete, and a __eq__ implementation for GrpcChannelArguments. Signed-off-by: Rich Scott * Removed Makefile from repo (#2915) Co-authored-by: Mohamed Abdelfatah <39927413+Mo-Fatah@users.noreply.github.com> * Add per-queue scheduling rate-limiting (#2938) * Initial commit * Add rate limiters * go mod tidy * Updates * Add tests * Update default config * Update default scheduler config * Whitespace * Cleanup * Docstring improvements * Remove limiter nil checks * Add Cardinality() function on gctx * Fix test * Fix test * Add note about signed commits to Contributor documentation (#2960) * Add note about signed commits to Contributor documentation Signed-off-by: Aviral Singh * Add note about signed commits to Contributor documentation --------- Signed-off-by: Aviral Singh * ArmadaContext that includes a logger (#2934) * wip * wip * wip * wip * wip * wip * wip * wip * wip * wip * wip * wip * wip * wip * wip * wip * wip * compilation! * rename package * more compilation * rename to Context * embed * compilation * compilation * fix test * remove old ctxloggers * revert design doc * revert developer doc * formatting * wip * tests * don't gen * don't gen * merged master --------- Co-authored-by: Chris Martin Co-authored-by: Albin Severinson * Bump armada airflow operator to version 0.5.4 (#2961) * Bump armada airflow operator to version 0.5.4 Signed-off-by: Rich Scott * Regenerate Airflow Operator Markdown doc. Signed-off-by: Rich Scott * Fix regenerated Airflow doc error. Signed-off-by: Rich Scott * Pin versions of all modules, especially around docs generation. Signed-off-by: Rich Scott * Regenerate Airflow docs using Python 3.10 Signed-off-by: Rich Scott --------- Signed-off-by: Rich Scott * Simulator Changes Made a number of changes to the simulator and simulator tests, most notably: - Fixed implementation of minSubmitTime setting for workload specifications - Added tests for SchedulingConfigsFromPattern, ClusterSpecsFromPattern, WorkloadFromPattern - Added sample workloads, clusters and scheduling configs - Added tests which simulate per-pool and per-executorGroup scheduling - Implemented further metrics for use in simulator tests, such as a cluster's aggregate resources, number of preemptions and schedules for a given test run - Added optimisation to speed up simulator, whereby the scheduler skips the current schedule event if no eventSequences have been received since the previous schedule. * Simplified TestClusterSpecsFromPattern and TestWorkloadFromPattern tests * Removed unused test * Fixed malformed yaml * Improved metrics for simulations. Improved simulator tests with errorgroups. * Removed all simulator test data except basic data necessary for testing * Implementing CLI Signed-off-by: dependabot[bot] Signed-off-by: JamesMurkin Signed-off-by: Rich Scott Signed-off-by: Aviral Singh Co-authored-by: Daniel Rastelli Co-authored-by: Chris Martin Co-authored-by: Chris Martin Co-authored-by: Sarthak Negi <122533767+sarthaksarthak9@users.noreply.github.com> Co-authored-by: Kevin Hannon Co-authored-by: Adam McArthur <46480158+Sharpz7@users.noreply.github.com> Co-authored-by: Pradeep Kurapati <113408145+Pradeep-Kurapati@users.noreply.github.com> Co-authored-by: Dave Gantenbein Co-authored-by: Shivang Shandilya <101946115+ShivangShandilya@users.noreply.github.com> Co-authored-by: Kevin Hannon Co-authored-by: Clif Houck Co-authored-by: Mohamed Abdelfatah <39927413+Mo-Fatah@users.noreply.github.com> Co-authored-by: Kanu Mike Chibundu Co-authored-by: snyk-bot Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: JamesMurkin Co-authored-by: owenthomas17 Co-authored-by: Albin Severinson Co-authored-by: Mark Sumner Co-authored-by: Rich Scott Co-authored-by: MeenuyD <116630390+MeenuyD@users.noreply.github.com> Co-authored-by: Aviral Singh Co-authored-by: Mustafa Ilyas * Adding verbose flag to simulator CLI, changing logging context in simulator * Improved simulator CLI output, removed redundant features, implemented parallel simulations by addressing mutability of structures inputted into the simulator * Removed unknown logging library * Changing threadSafeLogger Info call to Print. Adding separation back between simulation results * Implemented stochastic runtime for jobs using a shifted exponential distribution (#13) * Implemented stochastic runtime for jobs using a shifted exponential distribution * Implemented min submit time from dependency completion (#14) Co-authored-by: Mustafa Ilyas * Fixed tests * Fixed implementation of shifted exponential distribution * Using FP unrounded parameters to sample from distribution * Modified stochastic runtime definition * Adding logging to simulator Co-authored-by: Mustafa Ilyas Signed-off-by: dependabot[bot] Signed-off-by: JamesMurkin Signed-off-by: Rich Scott Signed-off-by: Aviral Singh Co-authored-by: Albin Severinson Co-authored-by: Mustafa Ilyas Co-authored-by: Mustafa Ilyas Co-authored-by: Daniel Rastelli Co-authored-by: Chris Martin Co-authored-by: Chris Martin Co-authored-by: Sarthak Negi <122533767+sarthaksarthak9@users.noreply.github.com> Co-authored-by: Kevin Hannon Co-authored-by: Adam McArthur <46480158+Sharpz7@users.noreply.github.com> Co-authored-by: Pradeep Kurapati <113408145+Pradeep-Kurapati@users.noreply.github.com> Co-authored-by: Dave Gantenbein Co-authored-by: Shivang Shandilya <101946115+ShivangShandilya@users.noreply.github.com> Co-authored-by: Kevin Hannon Co-authored-by: Clif Houck Co-authored-by: Mohamed Abdelfatah <39927413+Mo-Fatah@users.noreply.github.com> Co-authored-by: Kanu Mike Chibundu Co-authored-by: snyk-bot Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: JamesMurkin Co-authored-by: owenthomas17 Co-authored-by: Albin Severinson Co-authored-by: Mark Sumner Co-authored-by: Rich Scott Co-authored-by: MeenuyD <116630390+MeenuyD@users.noreply.github.com> Co-authored-by: Aviral Singh * Add missing brace * Lint * Lint * Lint * Cleanup * Testsuite improvements * Lint * Tidying --------- Signed-off-by: dependabot[bot] Signed-off-by: JamesMurkin Signed-off-by: Rich Scott Signed-off-by: Aviral Singh Co-authored-by: Albin Severinson Co-authored-by: Albin Severinson Co-authored-by: Mustafa Ilyas Co-authored-by: Mustafa Ilyas Co-authored-by: Daniel Rastelli Co-authored-by: Chris Martin Co-authored-by: Chris Martin Co-authored-by: Sarthak Negi <122533767+sarthaksarthak9@users.noreply.github.com> Co-authored-by: Kevin Hannon Co-authored-by: Adam McArthur <46480158+Sharpz7@users.noreply.github.com> Co-authored-by: Pradeep Kurapati <113408145+Pradeep-Kurapati@users.noreply.github.com> Co-authored-by: Dave Gantenbein Co-authored-by: Shivang Shandilya <101946115+ShivangShandilya@users.noreply.github.com> Co-authored-by: Kevin Hannon Co-authored-by: Clif Houck Co-authored-by: Mohamed Abdelfatah <39927413+Mo-Fatah@users.noreply.github.com> Co-authored-by: Kanu Mike Chibundu Co-authored-by: snyk-bot Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: JamesMurkin Co-authored-by: owenthomas17 Co-authored-by: Mark Sumner Co-authored-by: Rich Scott Co-authored-by: MeenuyD <116630390+MeenuyD@users.noreply.github.com> Co-authored-by: Aviral Singh --- cmd/simulator/cmd/root.go | 150 +++ cmd/simulator/main.go | 13 + cmd/testsuite/cmd/test.go | 2 +- internal/scheduler/scheduler.go | 5 +- internal/scheduler/simulator/events.go | 53 + internal/scheduler/simulator/metrics.go | 106 ++ internal/scheduler/simulator/runner.go | 220 ++++ internal/scheduler/simulator/simulator.go | 549 ++++----- internal/scheduler/simulator/simulator.pb.go | 1047 +++++++++++++---- internal/scheduler/simulator/simulator.proto | 55 +- .../scheduler/simulator/simulator_test.go | 337 +++--- internal/scheduler/simulator/test_utils.go | 315 +++++ .../testdata/clusters/cpu_1_1_100.yaml | 12 + .../testdata/clusters/cpu_1_3_100.yaml | 26 + .../testdata/clusters/tinyCluster.yaml | 23 + .../testdata/clusters/tinyClusterAlt.yaml | 21 + .../configs/basicSchedulingConfig.yaml | 29 + .../configs/defaultSchedulingConfig.yaml | 38 + .../simulator/testdata/diva-plat.yaml | 30 - .../testdata/workloads/basicWorkload.yaml | 15 + .../workloads/small_big/non-preemptible.yaml | 30 + .../workloads/small_big/only-big.yaml | 17 + .../workloads/small_big/only-small.yaml | 16 + .../workloads/small_big/preemptible.yaml | 30 + magefiles/proto.go | 1 + 25 files changed, 2326 insertions(+), 814 deletions(-) create mode 100644 cmd/simulator/cmd/root.go create mode 100644 cmd/simulator/main.go create mode 100644 internal/scheduler/simulator/events.go create mode 100644 internal/scheduler/simulator/metrics.go create mode 100644 internal/scheduler/simulator/runner.go create mode 100644 internal/scheduler/simulator/test_utils.go create mode 100644 internal/scheduler/simulator/testdata/clusters/cpu_1_1_100.yaml create mode 100644 internal/scheduler/simulator/testdata/clusters/cpu_1_3_100.yaml create mode 100644 internal/scheduler/simulator/testdata/clusters/tinyCluster.yaml create mode 100644 internal/scheduler/simulator/testdata/clusters/tinyClusterAlt.yaml create mode 100644 internal/scheduler/simulator/testdata/configs/basicSchedulingConfig.yaml create mode 100644 internal/scheduler/simulator/testdata/configs/defaultSchedulingConfig.yaml delete mode 100644 internal/scheduler/simulator/testdata/diva-plat.yaml create mode 100644 internal/scheduler/simulator/testdata/workloads/basicWorkload.yaml create mode 100644 internal/scheduler/simulator/testdata/workloads/small_big/non-preemptible.yaml create mode 100644 internal/scheduler/simulator/testdata/workloads/small_big/only-big.yaml create mode 100644 internal/scheduler/simulator/testdata/workloads/small_big/only-small.yaml create mode 100644 internal/scheduler/simulator/testdata/workloads/small_big/preemptible.yaml diff --git a/cmd/simulator/cmd/root.go b/cmd/simulator/cmd/root.go new file mode 100644 index 00000000000..342deeaee2e --- /dev/null +++ b/cmd/simulator/cmd/root.go @@ -0,0 +1,150 @@ +package cmd + +import ( + "github.com/spf13/cobra" + "golang.org/x/exp/maps" + + "github.com/armadaproject/armada/internal/common/armadacontext" + "github.com/armadaproject/armada/internal/common/util" + "github.com/armadaproject/armada/internal/scheduler/simulator" + "github.com/armadaproject/armada/pkg/armadaevents" +) + +func RootCmd() *cobra.Command { + cmd := &cobra.Command{ + Use: "Simulate", + Short: "Simulate running jobs on Armada.", + RunE: runSimulations, + } + // cmd.Flags().BoolP("verbose", "v", false, "Log detailed output to console.") + cmd.Flags().String("clusters", "", "Glob pattern specifying cluster configurations to simulate.") + cmd.Flags().String("workloads", "", "Glob pattern specifying workloads to simulate.") + cmd.Flags().String("configs", "", "Glob pattern specifying scheduler configurations to simulate.") + cmd.Flags().Bool("showSchedulerLogs", false, "Show scheduler logs.") + cmd.Flags().Int("logInterval", 0, "Log summary statistics every this many events. Disabled if 0.") + return cmd +} + +func runSimulations(cmd *cobra.Command, args []string) error { + // Get command-line arguments. + clusterPattern, err := cmd.Flags().GetString("clusters") + if err != nil { + return err + } + workloadPattern, err := cmd.Flags().GetString("workloads") + if err != nil { + return err + } + configPattern, err := cmd.Flags().GetString("configs") + if err != nil { + return err + } + showSchedulerLogs, err := cmd.Flags().GetBool("showSchedulerLogs") + if err != nil { + return err + } + logInterval, err := cmd.Flags().GetInt("logInterval") + if err != nil { + return err + } + + // Load test specs. and config. + clusterSpecs, err := simulator.ClusterSpecsFromPattern(clusterPattern) + if err != nil { + return err + } + workloadSpecs, err := simulator.WorkloadsFromPattern(workloadPattern) + if err != nil { + return err + } + schedulingConfigsByFilePath, err := simulator.SchedulingConfigsByFilePathFromPattern(configPattern) + if err != nil { + return err + } + + ctx := armadacontext.Background() + ctx.Info("Armada simulator") + ctx.Infof("ClusterSpecs: %v", util.Map(clusterSpecs, func(clusperSpec *simulator.ClusterSpec) string { return clusperSpec.Name })) + ctx.Infof("WorkloadSpecs: %v", util.Map(workloadSpecs, func(workloadSpec *simulator.WorkloadSpec) string { return workloadSpec.Name })) + ctx.Infof("SchedulingConfigs: %v", maps.Keys(schedulingConfigsByFilePath)) + + // Setup a simulator for each combination of (clusterSpec, workloadSpec, schedulingConfig). + simulators := make([]*simulator.Simulator, 0) + metricsCollectors := make([]*simulator.MetricsCollector, 0) + eventSequenceChannels := make([]<-chan *armadaevents.EventSequence, 0) + schedulingConfigPaths := make([]string, 0) + for _, clusterSpec := range clusterSpecs { + for _, workloadSpec := range workloadSpecs { + for schedulingConfigPath, schedulingConfig := range schedulingConfigsByFilePath { + if s, err := simulator.NewSimulator(clusterSpec, workloadSpec, schedulingConfig); err != nil { + return err + } else { + if !showSchedulerLogs { + s.SuppressSchedulerLogs = true + } else { + ctx.Info("Showing scheduler logs") + } + simulators = append(simulators, s) + mc := simulator.NewMetricsCollector(s.Output()) + mc.LogSummaryInterval = logInterval + metricsCollectors = append(metricsCollectors, mc) + eventSequenceChannels = append(eventSequenceChannels, s.Output()) + schedulingConfigPaths = append(schedulingConfigPaths, schedulingConfigPath) + } + } + } + } + + // Run simulators. + g, ctx := armadacontext.ErrGroup(ctx) + for _, s := range simulators { + s := s + g.Go(func() error { + return s.Run(ctx) + }) + } + + // Log events to stdout. + for _, c := range eventSequenceChannels { + c := c + g.Go(func() error { + for { + select { + case <-ctx.Done(): + return ctx.Err() + case eventSequence, ok := <-c: + if !ok { + return nil + } + ctx.Debug(*eventSequence.Events[0].Created, simulator.EventSequenceSummary(eventSequence)) + } + } + }) + } + + // Run metric collectors. + for _, mc := range metricsCollectors { + mc := mc + g.Go(func() error { + return mc.Run(ctx) + }) + } + + // Wait for simulations to complete. + if err := g.Wait(); err != nil { + return err + } + + // Log overall statistics. + for i, mc := range metricsCollectors { + s := simulators[i] + schedulingConfigPath := schedulingConfigPaths[i] + ctx.Infof("Simulation result") + ctx.Infof("ClusterSpec: %s", s.ClusterSpec.Name) + ctx.Infof("WorkloadSpec: %s", s.WorkloadSpec.Name) + ctx.Infof("SchedulingConfig: %s", schedulingConfigPath) + ctx.Info(mc.String()) + } + + return nil +} diff --git a/cmd/simulator/main.go b/cmd/simulator/main.go new file mode 100644 index 00000000000..96b119c9d79 --- /dev/null +++ b/cmd/simulator/main.go @@ -0,0 +1,13 @@ +package main + +import ( + "os" + + "github.com/armadaproject/armada/cmd/simulator/cmd" +) + +func main() { + if err := cmd.RootCmd().Execute(); err != nil { + os.Exit(1) + } +} diff --git a/cmd/testsuite/cmd/test.go b/cmd/testsuite/cmd/test.go index ccf8ecdf9fd..d6c931e94ad 100644 --- a/cmd/testsuite/cmd/test.go +++ b/cmd/testsuite/cmd/test.go @@ -79,7 +79,7 @@ func testCmdRunE(app *testsuite.App) func(cmd *cobra.Command, args []string) err app.Params.PrometheusPushGatewayJobName = prometheusPushgatewayJobName // Create a context that is cancelled on SIGINT/SIGTERM. - // Ensures test jobs are cancelled on ctrl-C. + // Ensures test jobs are cancelled on ctrl-c. ctx, cancel := context.WithCancel(context.Background()) defer cancel() stopSignal := make(chan os.Signal, 1) diff --git a/internal/scheduler/scheduler.go b/internal/scheduler/scheduler.go index 58ba50b6f1b..5b63bd78811 100644 --- a/internal/scheduler/scheduler.go +++ b/internal/scheduler/scheduler.go @@ -17,7 +17,6 @@ import ( "github.com/armadaproject/armada/internal/common/logging" "github.com/armadaproject/armada/internal/common/stringinterner" "github.com/armadaproject/armada/internal/scheduler/database" - "github.com/armadaproject/armada/internal/scheduler/interfaces" "github.com/armadaproject/armada/internal/scheduler/jobdb" "github.com/armadaproject/armada/internal/scheduler/kubernetesobjects/affinity" "github.com/armadaproject/armada/internal/scheduler/schedulerobjects" @@ -407,7 +406,7 @@ func EventsFromSchedulerResult(result *SchedulerResult, time time.Time) ([]*arma if err != nil { return nil, err } - eventSequences, err = AppendEventSequencesFromUnschedulableJobs(eventSequences, result.FailedJobs, time) + eventSequences, err = AppendEventSequencesFromUnschedulableJobs(eventSequences, FailedJobsFromSchedulerResult[*jobdb.Job](result), time) if err != nil { return nil, err } @@ -510,7 +509,7 @@ func AppendEventSequencesFromScheduledJobs(eventSequences []*armadaevents.EventS return eventSequences, nil } -func AppendEventSequencesFromUnschedulableJobs(eventSequences []*armadaevents.EventSequence, jobs []interfaces.LegacySchedulerJob, time time.Time) ([]*armadaevents.EventSequence, error) { +func AppendEventSequencesFromUnschedulableJobs(eventSequences []*armadaevents.EventSequence, jobs []*jobdb.Job, time time.Time) ([]*armadaevents.EventSequence, error) { for _, job := range jobs { jobId, err := armadaevents.ProtoUuidFromUlidString(job.GetId()) if err != nil { diff --git a/internal/scheduler/simulator/events.go b/internal/scheduler/simulator/events.go new file mode 100644 index 00000000000..36071574b9c --- /dev/null +++ b/internal/scheduler/simulator/events.go @@ -0,0 +1,53 @@ +package simulator + +import "time" + +// Event is a simulator-internal event. +type Event struct { + // Time at which the event was submitted. + time time.Time + // Each event is assigned a sequence number. + // Events with equal time are ordered by their sequence number. + sequenceNumber int + // Either armadaevents.EventSequence or scheduleEvent. + eventSequenceOrScheduleEvent any + // Maintained by the heap.Interface methods. + index int +} + +// scheduleEvent is an event indicating the scheduler should be run. +type scheduleEvent struct{} + +type EventLog []Event + +func (el EventLog) Len() int { return len(el) } + +func (el EventLog) Less(i, j int) bool { + if el[i].time == el[j].time { + return el[i].sequenceNumber < el[j].sequenceNumber + } + return el[j].time.After(el[i].time) +} + +func (el EventLog) Swap(i, j int) { + el[i], el[j] = el[j], el[i] + el[i].index = i + el[j].index = j +} + +func (el *EventLog) Push(x any) { + n := len(*el) + item := x.(Event) + item.index = n + *el = append(*el, item) +} + +func (el *EventLog) Pop() any { + old := *el + n := len(old) + item := old[n-1] + old[n-1] = Event{} // avoid memory leak + item.index = -1 // for safety + *el = old[0 : n-1] + return item +} diff --git a/internal/scheduler/simulator/metrics.go b/internal/scheduler/simulator/metrics.go new file mode 100644 index 00000000000..d4447868211 --- /dev/null +++ b/internal/scheduler/simulator/metrics.go @@ -0,0 +1,106 @@ +package simulator + +import ( + "fmt" + "strings" + "time" + + "golang.org/x/exp/maps" + "golang.org/x/exp/slices" + + "github.com/armadaproject/armada/internal/common/armadacontext" + "github.com/armadaproject/armada/pkg/armadaevents" +) + +type MetricsCollector struct { + c <-chan *armadaevents.EventSequence + OverallMetrics MetricsVector + MetricsByQueue map[string]MetricsVector + // If non-zero, log a summary every this many events. + LogSummaryInterval int +} + +type MetricsVector struct { + TimeOfMostRecentJobSucceededEvent time.Duration + NumEvents int + NumSubmitEvents int + NumLeasedEvents int + NumPreemptedEvents int + NumJobSucceededEvents int +} + +func NewMetricsCollector(c <-chan *armadaevents.EventSequence) *MetricsCollector { + return &MetricsCollector{ + c: c, + MetricsByQueue: make(map[string]MetricsVector), + } +} + +func (mc *MetricsCollector) String() string { + var sb strings.Builder + sb.WriteString("{") + sb.WriteString(fmt.Sprintf("Overall metrics: %s, Per-queue metrics: {", mc.OverallMetrics)) + i := 0 + queues := maps.Keys(mc.MetricsByQueue) + slices.Sort(queues) + for _, queue := range queues { + metrics := mc.MetricsByQueue[queue] + sb.WriteString(fmt.Sprintf("%s: %s", queue, metrics)) + i++ + if i != len(mc.MetricsByQueue) { + sb.WriteString(", ") + } + } + sb.WriteString("}}") + return sb.String() +} + +func (m MetricsVector) String() string { + return fmt.Sprintf( + "{FractionLeasedSucceeded: %f, TimeOfMostRecentJobSucceededEvent: %s, NumEvents: %d, NumPreemptedEvents: %d, NumLeasedEvents: %d, NumJobSucceededEvents: %d}", + float64(m.NumJobSucceededEvents)/float64(m.NumLeasedEvents), m.TimeOfMostRecentJobSucceededEvent, m.NumEvents, m.NumPreemptedEvents, m.NumLeasedEvents, m.NumJobSucceededEvents, + ) +} + +func (mc *MetricsCollector) Run(ctx *armadacontext.Context) error { + for { + select { + case <-ctx.Done(): + return ctx.Err() + case eventSequence, ok := <-mc.c: + if !ok { + return nil + } + mc.addEventSequence(eventSequence) + if mc.LogSummaryInterval != 0 && mc.OverallMetrics.NumEvents%mc.LogSummaryInterval == 0 { + ctx.Info(mc.String()) + } + } + } +} + +func (mc *MetricsCollector) addEventSequence(eventSequence *armadaevents.EventSequence) { + queue := eventSequence.Queue + mc.OverallMetrics.NumEvents += 1 + perQueueMetrics := mc.MetricsByQueue[queue] + perQueueMetrics.NumEvents += 1 + for _, event := range eventSequence.Events { + switch event.GetEvent().(type) { + case *armadaevents.EventSequence_Event_SubmitJob: + mc.OverallMetrics.NumSubmitEvents += 1 + perQueueMetrics.NumSubmitEvents += 1 + case *armadaevents.EventSequence_Event_JobRunLeased: + mc.OverallMetrics.NumLeasedEvents += 1 + perQueueMetrics.NumLeasedEvents += 1 + case *armadaevents.EventSequence_Event_JobRunPreempted: + mc.OverallMetrics.NumPreemptedEvents += 1 + perQueueMetrics.NumPreemptedEvents += 1 + case *armadaevents.EventSequence_Event_JobSucceeded: + mc.OverallMetrics.TimeOfMostRecentJobSucceededEvent = event.Created.Sub(time.Time{}) + perQueueMetrics.TimeOfMostRecentJobSucceededEvent = event.Created.Sub(time.Time{}) + mc.OverallMetrics.NumJobSucceededEvents += 1 + perQueueMetrics.NumJobSucceededEvents += 1 + } + } + mc.MetricsByQueue[queue] = perQueueMetrics +} diff --git a/internal/scheduler/simulator/runner.go b/internal/scheduler/simulator/runner.go new file mode 100644 index 00000000000..a2265cdf789 --- /dev/null +++ b/internal/scheduler/simulator/runner.go @@ -0,0 +1,220 @@ +package simulator + +import ( + "fmt" + "path/filepath" + "strings" + + "github.com/mattn/go-zglob" + "github.com/pkg/errors" + "github.com/renstrom/shortuuid" + "github.com/spf13/viper" + + "github.com/armadaproject/armada/internal/armada/configuration" + "github.com/armadaproject/armada/internal/common/armadacontext" + commonconfig "github.com/armadaproject/armada/internal/common/config" +) + +func Simulate(ctx *armadacontext.Context, clusterSpecsPattern, workloadSpecsPattern, schedulingConfigsPattern string) error { + clusterSpecs, err := ClusterSpecsFromPattern(clusterSpecsPattern) + if err != nil { + return err + } + workloadSpecs, err := WorkloadsFromPattern(workloadSpecsPattern) + if err != nil { + return err + } + schedulingConfigs, err := SchedulingConfigsFromPattern(schedulingConfigsPattern) + if err != nil { + return err + } + g, ctx := armadacontext.ErrGroup(ctx) + for _, clusterSpec := range clusterSpecs { + for _, workloadSpec := range workloadSpecs { + for _, schedulingConfig := range schedulingConfigs { + s, err := NewSimulator(clusterSpec, workloadSpec, schedulingConfig) + if err != nil { + return err + } + g.Go(func() error { + return s.Run(ctx) + }) + } + } + } + return g.Wait() +} + +func SchedulingConfigsByFilePathFromPattern(pattern string) (map[string]configuration.SchedulingConfig, error) { + filePaths, err := zglob.Glob(pattern) + if err != nil { + return nil, errors.WithStack(err) + } + filePathConfigMap := make(map[string]configuration.SchedulingConfig) + for _, path := range filePaths { + config, err := SchedulingConfigsFromFilePaths(filePaths) + if err != nil { + return nil, err + } + filePathConfigMap[path] = config[0] + } + return filePathConfigMap, nil +} + +func SchedulingConfigsFromPattern(pattern string) ([]configuration.SchedulingConfig, error) { + filePaths, err := zglob.Glob(pattern) + if err != nil { + return nil, errors.WithStack(err) + } + return SchedulingConfigsFromFilePaths(filePaths) +} + +func SchedulingConfigsFromFilePaths(filePaths []string) ([]configuration.SchedulingConfig, error) { + rv := make([]configuration.SchedulingConfig, len(filePaths)) + for i, filePath := range filePaths { + config, err := SchedulingConfigFromFilePath(filePath) + if err != nil { + return nil, err + } + rv[i] = config + } + return rv, nil +} + +func SchedulingConfigFromFilePath(filePath string) (configuration.SchedulingConfig, error) { + config := configuration.SchedulingConfig{} + v := viper.NewWithOptions(viper.KeyDelimiter("::")) + v.SetConfigFile(filePath) + if err := v.ReadInConfig(); err != nil { + err = errors.WithMessagef(err, "failed to read in SchedulingConfig %s", filePath) + return config, errors.WithStack(err) + } + if err := v.Unmarshal(&config, commonconfig.CustomHooks...); err != nil { + err = errors.WithMessagef(err, "failed to unmarshal SchedulingConfig %s", filePath) + return config, errors.WithStack(err) + } + return config, nil +} + +func ClusterSpecsFromPattern(pattern string) ([]*ClusterSpec, error) { + filePaths, err := zglob.Glob(pattern) + if err != nil { + return nil, errors.WithStack(err) + } + return ClusterSpecsFromFilePaths(filePaths) +} + +func WorkloadsFromPattern(pattern string) ([]*WorkloadSpec, error) { + filePaths, err := zglob.Glob(pattern) + if err != nil { + return nil, errors.WithStack(err) + } + return WorkloadSpecsFromFilePaths(filePaths) +} + +func ClusterSpecsFromFilePaths(filePaths []string) ([]*ClusterSpec, error) { + rv := make([]*ClusterSpec, len(filePaths)) + for i, filePath := range filePaths { + clusterSpec, err := ClusterSpecFromFilePath(filePath) + if err != nil { + return nil, err + } + rv[i] = clusterSpec + } + return rv, nil +} + +func WorkloadSpecsFromFilePaths(filePaths []string) ([]*WorkloadSpec, error) { + rv := make([]*WorkloadSpec, len(filePaths)) + for i, filePath := range filePaths { + workloadSpec, err := WorkloadSpecFromFilePath(filePath) + if err != nil { + return nil, err + } + rv[i] = workloadSpec + } + return rv, nil +} + +func ClusterSpecFromFilePath(filePath string) (*ClusterSpec, error) { + rv := &ClusterSpec{} + v := viper.NewWithOptions(viper.KeyDelimiter("::")) + v.SetConfigFile(filePath) + if err := v.ReadInConfig(); err != nil { + err = errors.WithMessagef(err, "failed to read in ClusterSpec %s", filePath) + return nil, errors.WithStack(err) + } + if err := v.Unmarshal(rv, commonconfig.CustomHooks...); err != nil { + err = errors.WithMessagef(err, "failed to unmarshal ClusterSpec %s", filePath) + return nil, errors.WithStack(err) + } + + // If no test name is provided, set it to be the filename. + if rv.Name == "" { + fileName := filepath.Base(filePath) + fileName = strings.TrimSuffix(fileName, filepath.Ext(fileName)) + rv.Name = fileName + } + initialiseClusterSpec(rv) + + return rv, nil +} + +func WorkloadSpecFromFilePath(filePath string) (*WorkloadSpec, error) { + rv := &WorkloadSpec{} + v := viper.NewWithOptions(viper.KeyDelimiter("::")) + v.SetConfigFile(filePath) + if err := v.ReadInConfig(); err != nil { + err = errors.WithMessagef(err, "failed to read in WorkloadSpec %s", filePath) + return nil, errors.WithStack(err) + } + if err := v.Unmarshal(rv, commonconfig.CustomHooks...); err != nil { + err = errors.WithMessagef(err, "failed to unmarshal WorkloadSpec %s", filePath) + return nil, errors.WithStack(err) + } + + // If no test name is provided, set it to be the filename. + if rv.Name == "" { + fileName := filepath.Base(filePath) + fileName = strings.TrimSuffix(fileName, filepath.Ext(fileName)) + rv.Name = fileName + } + + // Generate random ids for any job templates without an explicitly set id. + for _, queue := range rv.Queues { + for j, jobTemplate := range queue.JobTemplates { + if jobTemplate.Id == "" { + jobTemplate.Id = shortuuid.New() + } + queue.JobTemplates[j] = jobTemplate + } + } + initialiseWorkloadSpec(rv) + + return rv, nil +} + +func initialiseClusterSpec(clusterSpec *ClusterSpec) { + // Assign names to executors with none specified. + for _, pool := range clusterSpec.Pools { + for i, executorGroup := range pool.ClusterGroups { + for j, executor := range executorGroup.Clusters { + if executor.Name == "" { + executor.Name = fmt.Sprintf("%s-%d-%d", pool.Name, i, j) + } + } + } + } +} + +func initialiseWorkloadSpec(workloadSpec *WorkloadSpec) { + // Assign names to jobTemplates with none specified. + for _, queue := range workloadSpec.Queues { + for i, jobTemplate := range queue.JobTemplates { + if jobTemplate.Id == "" { + jobTemplate.Id = fmt.Sprintf("%s-%d", queue.Name, i) + } + jobTemplate.Queue = queue.Name + } + } +} diff --git a/internal/scheduler/simulator/simulator.go b/internal/scheduler/simulator/simulator.go index 1c282e8c303..50926404ae6 100644 --- a/internal/scheduler/simulator/simulator.go +++ b/internal/scheduler/simulator/simulator.go @@ -1,29 +1,23 @@ package simulator import ( - "bytes" "container/heap" "fmt" - "os" - "path/filepath" - "strings" + "io" + "math/rand" "time" - "github.com/caarlos0/log" - "github.com/mattn/go-zglob" + "github.com/gogo/protobuf/proto" "github.com/oklog/ulid" "github.com/pkg/errors" - "github.com/renstrom/shortuuid" - "github.com/spf13/viper" + "github.com/sirupsen/logrus" "golang.org/x/exp/maps" "golang.org/x/exp/slices" "golang.org/x/time/rate" v1 "k8s.io/api/core/v1" - "k8s.io/apimachinery/pkg/util/yaml" "github.com/armadaproject/armada/internal/armada/configuration" "github.com/armadaproject/armada/internal/common/armadacontext" - commonconfig "github.com/armadaproject/armada/internal/common/config" armadaslices "github.com/armadaproject/armada/internal/common/slices" "github.com/armadaproject/armada/internal/common/util" "github.com/armadaproject/armada/internal/scheduler" @@ -32,20 +26,28 @@ import ( "github.com/armadaproject/armada/internal/scheduler/fairness" "github.com/armadaproject/armada/internal/scheduler/jobdb" "github.com/armadaproject/armada/internal/scheduler/nodedb" - "github.com/armadaproject/armada/internal/scheduler/schedulerobjects" + schedulerobjects "github.com/armadaproject/armada/internal/scheduler/schedulerobjects" "github.com/armadaproject/armada/internal/scheduleringester" "github.com/armadaproject/armada/pkg/armadaevents" ) +var nullLogger = &logrus.Logger{ + Out: io.Discard, + Formatter: new(logrus.TextFormatter), + Hooks: make(logrus.LevelHooks), + Level: logrus.PanicLevel, +} + // Simulator captures the parameters and state of the Armada simulator. type Simulator struct { - testCase *TestCase + ClusterSpec *ClusterSpec + WorkloadSpec *WorkloadSpec schedulingConfig configuration.SchedulingConfig // Map from jobId to the jobTemplate from which the job was created. jobTemplateByJobId map[string]*JobTemplate // Map from job template ids to slices of templates depending on those ids. jobTemplatesByDependencyIds map[string]map[string]*JobTemplate - // Map from job template id to jobTemplate for templates for which all jobs have yet to succeed. + // Map from job template id to jobTemplate for templates for which all jobs have not yet succeeded. activeJobTemplatesById map[string]*JobTemplate // The JobDb stores all jobs that have yet to terminate. jobDb *jobdb.JobDb @@ -60,54 +62,164 @@ type Simulator struct { allocationByPoolAndQueueAndPriorityClass map[string]map[string]schedulerobjects.QuantityByTAndResourceType[string] // Total resources across all executorGroups for each pool. totalResourcesByPool map[string]schedulerobjects.ResourceList + // Indicates whether a job has been submitted or terminated since the last scheduling round. + shouldSchedule bool // Current simulated time. time time.Time // Sequence number of the next event to be published. sequenceNumber int - // Events stored in a priority queue ordered by submit time. + // Events stored in a priority queue ordered first by timestamp and second by sequence number. eventLog EventLog - // Simulated events are emitted on this channel in order. - c chan *armadaevents.EventSequence - + // Simulated events are emitted on these output channels. + // Create a channel by calling s.Output() before running the simulator. + outputs []chan *armadaevents.EventSequence // Global job scheduling rate-limiter. limiter *rate.Limiter // Per-queue job scheduling rate-limiters. limiterByQueue map[string]*rate.Limiter + // Used to generate random numbers from a chosen seed. + rand *rand.Rand + // If true, scheduler logs are omitted. + // This since the logs are very verbose when scheduling large numbers of jobs. + SuppressSchedulerLogs bool } -func NewSimulator(testCase *TestCase, schedulingConfig configuration.SchedulingConfig) (*Simulator, error) { - initialiseTestCase(testCase) - if err := validateTestCase(testCase); err != nil { +func NewSimulator(clusterSpec *ClusterSpec, workloadSpec *WorkloadSpec, schedulingConfig configuration.SchedulingConfig) (*Simulator, error) { + // TODO: Move clone to caller? + // Copy specs to avoid concurrent mutation. + clusterSpec = proto.Clone(clusterSpec).(*ClusterSpec) + workloadSpec = proto.Clone(workloadSpec).(*WorkloadSpec) + initialiseClusterSpec(clusterSpec) + initialiseWorkloadSpec(workloadSpec) + if err := validateClusterSpec(clusterSpec); err != nil { + return nil, err + } + if err := validateWorkloadSpec(workloadSpec); err != nil { + return nil, err + } + s := &Simulator{ + ClusterSpec: clusterSpec, + WorkloadSpec: workloadSpec, + schedulingConfig: schedulingConfig, + jobTemplateByJobId: make(map[string]*JobTemplate), + jobTemplatesByDependencyIds: make(map[string]map[string]*JobTemplate), + activeJobTemplatesById: make(map[string]*JobTemplate), + jobDb: jobdb.NewJobDb(), + nodeDbByPoolAndExecutorGroup: make(map[string][]*nodedb.NodeDb), + poolByNodeId: make(map[string]string), + nodeDbByExecutorName: make(map[string]*nodedb.NodeDb), + allocationByPoolAndQueueAndPriorityClass: make(map[string]map[string]schedulerobjects.QuantityByTAndResourceType[string]), + totalResourcesByPool: make(map[string]schedulerobjects.ResourceList), + limiter: rate.NewLimiter( + rate.Limit(schedulingConfig.MaximumSchedulingRate), + schedulingConfig.MaximumSchedulingBurst, + ), + limiterByQueue: make(map[string]*rate.Limiter), + rand: rand.New(rand.NewSource(workloadSpec.RandomSeed)), + } + s.limiter.SetBurstAt(s.time, schedulingConfig.MaximumSchedulingBurst) + if err := s.setupClusters(); err != nil { return nil, err } + if err := s.bootstrapWorkload(); err != nil { + return nil, err + } + return s, nil +} + +// Run runs the scheduler until all jobs have finished successfully. +func (s *Simulator) Run(ctx *armadacontext.Context) error { + defer func() { + for _, c := range s.outputs { + close(c) + } + }() + // Bootstrap the simulator by pushing an event that triggers a scheduler run. + s.pushScheduleEvent(s.time) + // Then run the scheduler until all jobs have completed. + for s.eventLog.Len() > 0 { + select { + case <-ctx.Done(): + return ctx.Err() + default: + event := heap.Pop(&s.eventLog).(Event) + if err := s.handleSimulatorEvent(ctx, event); err != nil { + return err + } + } + } + return nil +} + +// Output returns a channel on which all simulated events are sent. +// This function must be called before *Simulator.Run. +func (s *Simulator) Output() <-chan *armadaevents.EventSequence { + c := make(chan *armadaevents.EventSequence, 128) + s.outputs = append(s.outputs, c) + return c +} + +func validateClusterSpec(clusterSpec *ClusterSpec) error { + poolNames := util.Map(clusterSpec.Pools, func(pool *Pool) string { return pool.Name }) + if !slices.Equal(poolNames, armadaslices.Unique(poolNames)) { + return errors.Errorf("duplicate pool name: %v", poolNames) + } + + executorNames := make([]string, 0) + for _, pool := range clusterSpec.Pools { + for _, executorGroup := range pool.ClusterGroups { + for _, executor := range executorGroup.Clusters { + executorNames = append(executorNames, executor.Name) + } + } + } + if !slices.Equal(executorNames, armadaslices.Unique(executorNames)) { + return errors.Errorf("duplicate executor name: %v", executorNames) + } + return nil +} + +func validateWorkloadSpec(workloadSpec *WorkloadSpec) error { + queueNames := util.Map(workloadSpec.Queues, func(queue *Queue) string { return queue.Name }) + if !slices.Equal(queueNames, armadaslices.Unique(queueNames)) { + return errors.Errorf("duplicate queue name: %v", queueNames) + } + jobTemplateIdSlices := util.Map(workloadSpec.Queues, func(queue *Queue) []string { + return util.Map(queue.JobTemplates, func(template *JobTemplate) string { return template.Id }) + }) + jobTemplateIds := make([]string, 0) + for _, singleQueueTemplateIds := range jobTemplateIdSlices { + jobTemplateIds = append(jobTemplateIds, singleQueueTemplateIds...) + } + if !slices.Equal(jobTemplateIds, armadaslices.Unique(jobTemplateIds)) { + return errors.Errorf("duplicate job template ids: %v", jobTemplateIds) + } - // Setup nodes. - nodeDbByPoolAndExecutorGroup := make(map[string][]*nodedb.NodeDb) - totalResourcesByPool := make(map[string]schedulerobjects.ResourceList) - poolByNodeId := make(map[string]string) - // executorGroupByExecutor := make(map[string]string) - nodeDbByExecutorName := make(map[string]*nodedb.NodeDb) - for _, pool := range testCase.Pools { + return nil +} + +func (s *Simulator) setupClusters() error { + for _, pool := range s.ClusterSpec.Pools { totalResourcesForPool := schedulerobjects.ResourceList{} - for executorGroupIndex, executorGroup := range pool.ExecutorGroups { + for executorGroupIndex, executorGroup := range pool.ClusterGroups { nodeDb, err := nodedb.NewNodeDb( - schedulingConfig.Preemption.PriorityClasses, - schedulingConfig.MaxExtraNodesToConsider, - schedulingConfig.IndexedResources, - schedulingConfig.IndexedTaints, - schedulingConfig.IndexedNodeLabels, + s.schedulingConfig.Preemption.PriorityClasses, + s.schedulingConfig.MaxExtraNodesToConsider, + s.schedulingConfig.IndexedResources, + s.schedulingConfig.IndexedTaints, + s.schedulingConfig.IndexedNodeLabels, ) if err != nil { - return nil, err + return err } - for executorIndex, executor := range executorGroup.Executors { + for executorIndex, executor := range executorGroup.Clusters { executorName := fmt.Sprintf("%s-%d-%d", pool.Name, executorGroupIndex, executorIndex) - nodeDbByExecutorName[executorName] = nodeDb + s.nodeDbByExecutorName[executorName] = nodeDb for nodeTemplateIndex, nodeTemplate := range executor.NodeTemplates { for i := 0; i < int(nodeTemplate.Number); i++ { nodeId := fmt.Sprintf("%s-%d-%d-%d-%d", pool.Name, executorGroupIndex, executorIndex, nodeTemplateIndex, i) allocatableByPriorityAndResource := make(map[int32]schedulerobjects.ResourceList) - for _, priorityClass := range schedulingConfig.Preemption.PriorityClasses { + for _, priorityClass := range s.schedulingConfig.Preemption.PriorityClasses { allocatableByPriorityAndResource[priorityClass.Priority] = nodeTemplate.TotalResources.DeepCopy() } node := &schedulerobjects.Node{ @@ -122,47 +234,31 @@ func NewSimulator(testCase *TestCase, schedulingConfig configuration.SchedulingC txn := nodeDb.Txn(true) if err := nodeDb.CreateAndInsertWithApiJobsWithTxn(txn, nil, node); err != nil { txn.Abort() - return nil, err + return err } txn.Commit() - poolByNodeId[nodeId] = pool.Name + s.poolByNodeId[nodeId] = pool.Name } } } - nodeDbByPoolAndExecutorGroup[pool.Name] = append(nodeDbByPoolAndExecutorGroup[pool.Name], nodeDb) + s.nodeDbByPoolAndExecutorGroup[pool.Name] = append(s.nodeDbByPoolAndExecutorGroup[pool.Name], nodeDb) totalResourcesForPool.Add(nodeDb.TotalResources()) } - totalResourcesByPool[pool.Name] = totalResourcesForPool - } - s := &Simulator{ - testCase: testCase, - schedulingConfig: schedulingConfig, - jobTemplateByJobId: make(map[string]*JobTemplate), - jobTemplatesByDependencyIds: make(map[string]map[string]*JobTemplate), - activeJobTemplatesById: make(map[string]*JobTemplate), - jobDb: jobdb.NewJobDb(), - poolByNodeId: poolByNodeId, - nodeDbByPoolAndExecutorGroup: nodeDbByPoolAndExecutorGroup, - nodeDbByExecutorName: nodeDbByExecutorName, - allocationByPoolAndQueueAndPriorityClass: make(map[string]map[string]schedulerobjects.QuantityByTAndResourceType[string]), - totalResourcesByPool: totalResourcesByPool, - c: make(chan *armadaevents.EventSequence), - limiter: rate.NewLimiter( - rate.Limit(schedulingConfig.MaximumSchedulingRate), - schedulingConfig.MaximumSchedulingBurst, - ), - limiterByQueue: make(map[string]*rate.Limiter), + s.totalResourcesByPool[pool.Name] = totalResourcesForPool } + return nil +} +func (s *Simulator) bootstrapWorkload() error { // Mark all jobTemplates as active. - for _, queue := range testCase.Queues { + for _, queue := range s.WorkloadSpec.Queues { for _, jobTemplate := range queue.JobTemplates { s.activeJobTemplatesById[jobTemplate.Id] = jobTemplate } } // Publish submitJob messages for all jobTemplates without dependencies. - for _, queue := range testCase.Queues { + for _, queue := range s.WorkloadSpec.Queues { for _, jobTemplate := range queue.JobTemplates { if len(jobTemplate.Dependencies) > 0 { continue @@ -179,7 +275,7 @@ func NewSimulator(testCase *TestCase, schedulingConfig configuration.SchedulingC eventSequence.Events = append( eventSequence.Events, &armadaevents.EventSequence_Event{ - Created: pointer(maxTime(s.time, jobTemplate.MinSubmitTime)), + Created: pointer(s.time.Add(jobTemplate.EarliestSubmitTime)), Event: &armadaevents.EventSequence_Event_SubmitJob{ SubmitJob: submitJobFromJobTemplate(jobId, jobTemplate), }, @@ -194,12 +290,12 @@ func NewSimulator(testCase *TestCase, schedulingConfig configuration.SchedulingC } // Setup the jobTemplate dependency map. - for _, queue := range testCase.Queues { + for _, queue := range s.WorkloadSpec.Queues { for _, jobTemplate := range queue.JobTemplates { for _, dependencyJobTemplateId := range jobTemplate.Dependencies { dependencyJobTemplate, ok := s.activeJobTemplatesById[dependencyJobTemplateId] if !ok { - return nil, errors.Errorf( + return errors.Errorf( "jobTemplate %s depends on jobTemplate %s, which does not exist", jobTemplate.Id, dependencyJobTemplate.Id, ) @@ -213,64 +309,9 @@ func NewSimulator(testCase *TestCase, schedulingConfig configuration.SchedulingC } } } - - // Publish scheduleEvent. - s.pushScheduleEvent(s.time.Add(10 * time.Second)) - return s, nil -} - -func (s *Simulator) C() <-chan *armadaevents.EventSequence { - return s.c -} - -func validateTestCase(testCase *TestCase) error { - poolNames := util.Map(testCase.Pools, func(pool *Pool) string { return pool.Name }) - if !slices.Equal(poolNames, armadaslices.Unique(poolNames)) { - return errors.Errorf("duplicate pool name: %v", poolNames) - } - - executorNames := make([]string, 0) - for _, pool := range testCase.Pools { - for _, executorGroup := range pool.ExecutorGroups { - for _, executor := range executorGroup.Executors { - executorNames = append(executorNames, executor.Name) - } - } - } - if !slices.Equal(executorNames, armadaslices.Unique(executorNames)) { - return errors.Errorf("duplicate executor name: %v", executorNames) - } - - queueNames := util.Map(testCase.Queues, func(queue Queue) string { return queue.Name }) - if !slices.Equal(queueNames, armadaslices.Unique(queueNames)) { - return errors.Errorf("duplicate queue name: %v", queueNames) - } return nil } -func initialiseTestCase(testCase *TestCase) { - // Assign names to executors with none specified. - for _, pool := range testCase.Pools { - for i, executorGroup := range pool.ExecutorGroups { - for j, executor := range executorGroup.Executors { - if executor.Name == "" { - executor.Name = fmt.Sprintf("%s-%d-%d", pool.Name, i, j) - } - } - } - } - - // Assign names to jobTemplates with none specified. - for _, queue := range testCase.Queues { - for i, jobTemplate := range queue.JobTemplates { - if jobTemplate.Id == "" { - jobTemplate.Id = fmt.Sprintf("%s-%d", queue.Name, i) - } - jobTemplate.Queue = queue.Name - } - } -} - func submitJobFromJobTemplate(jobId ulid.ULID, jobTemplate *JobTemplate) *armadaevents.SubmitJob { return &armadaevents.SubmitJob{ JobId: armadaevents.ProtoUuidFromUlid(jobId), @@ -305,6 +346,7 @@ func (s *Simulator) pushEventSequence(eventSequence *armadaevents.EventSequence) heap.Push( &s.eventLog, Event{ + // We assume that all events in the sequence have the same Created time. time: *eventSequence.Events[0].Created, sequenceNumber: s.sequenceNumber, eventSequenceOrScheduleEvent: eventSequence, @@ -325,87 +367,36 @@ func (s *Simulator) pushScheduleEvent(time time.Time) { s.sequenceNumber++ } -type EventLog []Event - -type Event struct { - // Time at which the event was submitted. - time time.Time - // Each event is assigned a sequence number. - // Events with equal time are ordered by their sequence number. - sequenceNumber int - // One of armadaevents.EventSequence or scheduleEvent.. - eventSequenceOrScheduleEvent any - // Maintained by the heap.Interface methods. - index int -} - -func (el EventLog) Len() int { return len(el) } - -func (el EventLog) Less(i, j int) bool { - if el[i].time == el[j].time { - return el[i].sequenceNumber < el[j].sequenceNumber - } - return el[j].time.After(el[i].time) -} - -func (el EventLog) Swap(i, j int) { - el[i], el[j] = el[j], el[i] - el[i].index = i - el[j].index = j -} - -func (el *EventLog) Push(x any) { - n := len(*el) - item := x.(Event) - item.index = n - *el = append(*el, item) -} - -func (el *EventLog) Pop() any { - old := *el - n := len(old) - item := old[n-1] - old[n-1] = Event{} // avoid memory leak - item.index = -1 // for safety - *el = old[0 : n-1] - return item -} - -// scheduleEvent is an event indicating the scheduler should be run. -type scheduleEvent struct{} - -func (s *Simulator) Run() error { - defer close(s.c) - for s.eventLog.Len() > 0 { - event := heap.Pop(&s.eventLog).(Event) - if err := s.handleSimulatorEvent(event); err != nil { - return err - } - } - return nil -} - -func (s *Simulator) handleSimulatorEvent(event Event) error { +func (s *Simulator) handleSimulatorEvent(ctx *armadacontext.Context, event Event) error { s.time = event.time switch e := event.eventSequenceOrScheduleEvent.(type) { case *armadaevents.EventSequence: - if err := s.handleEventSequence(e); err != nil { + if err := s.handleEventSequence(ctx, e); err != nil { return err } case scheduleEvent: - if err := s.handleScheduleEvent(); err != nil { + if err := s.handleScheduleEvent(ctx); err != nil { return err } } return nil } -func (s *Simulator) handleScheduleEvent() error { +func (s *Simulator) handleScheduleEvent(ctx *armadacontext.Context) error { + // Schedule the next run of the scheduler, unless there are no more active jobTemplates. + // TODO: Make timeout configurable. + if len(s.activeJobTemplatesById) > 0 { + s.pushScheduleEvent(s.time.Add(10 * time.Second)) + } + if !s.shouldSchedule { + return nil + } + var eventSequences []*armadaevents.EventSequence txn := s.jobDb.WriteTxn() defer txn.Abort() - for _, pool := range s.testCase.Pools { - for i := range pool.ExecutorGroups { + for _, pool := range s.ClusterSpec.Pools { + for i := range pool.ClusterGroups { nodeDb := s.nodeDbByPoolAndExecutorGroup[pool.Name][i] if err := nodeDb.Reset(); err != nil { return err @@ -427,14 +418,16 @@ func (s *Simulator) handleScheduleEvent() error { s.limiter, totalResources, ) + sctx.Started = s.time - for _, queue := range s.testCase.Queues { + for _, queue := range s.WorkloadSpec.Queues { limiter, ok := s.limiterByQueue[queue.Name] if !ok { limiter = rate.NewLimiter( rate.Limit(s.schedulingConfig.MaximumPerQueueSchedulingRate), s.schedulingConfig.MaximumPerQueueSchedulingBurst, ) + limiter.SetBurstAt(s.time, s.schedulingConfig.MaximumPerQueueSchedulingBurst) s.limiterByQueue[queue.Name] = limiter } err := sctx.AddQueueSchedulingContext( @@ -450,7 +443,7 @@ func (s *Simulator) handleScheduleEvent() error { constraints := schedulerconstraints.SchedulingConstraintsFromSchedulingConfig( pool.Name, totalResources, - // Minimum job size not not used for simulation; use taints/tolerations instead. + // Minimum job size not used for simulation; use taints/tolerations instead. schedulerobjects.ResourceList{}, s.schedulingConfig, ) @@ -470,8 +463,14 @@ func (s *Simulator) handleScheduleEvent() error { if s.schedulingConfig.EnableNewPreemptionStrategy { sch.EnableNewPreemptionStrategy() } - ctx := armadacontext.Background() - result, err := sch.Schedule(ctx) + schedulerCtx := ctx + if s.SuppressSchedulerLogs { + schedulerCtx = &armadacontext.Context{ + Context: ctx.Context, + FieldLogger: nullLogger, + } + } + result, err := sch.Schedule(schedulerCtx) if err != nil { return err } @@ -480,6 +479,7 @@ func (s *Simulator) handleScheduleEvent() error { // Sort jobs to ensure deterministic event ordering. preemptedJobs := scheduler.PreemptedJobsFromSchedulerResult[*jobdb.Job](result) scheduledJobs := scheduler.ScheduledJobsFromSchedulerResult[*jobdb.Job](result) + failedJobs := scheduler.FailedJobsFromSchedulerResult[*jobdb.Job](result) less := func(a, b *jobdb.Job) bool { if a.Queue() < b.Queue() { return true @@ -495,6 +495,7 @@ func (s *Simulator) handleScheduleEvent() error { } slices.SortFunc(preemptedJobs, less) slices.SortFunc(scheduledJobs, less) + slices.SortFunc(failedJobs, less) for i, job := range preemptedJobs { if run := job.LatestRun(); run != nil { job = job.WithUpdatedRun(run.WithFailed(true)) @@ -514,12 +515,21 @@ func (s *Simulator) handleScheduleEvent() error { scheduledJobs[i] = job.WithQueued(false).WithNewRun(node.Executor, node.Id, node.Name) } } + for i, job := range failedJobs { + if run := job.LatestRun(); run != nil { + job = job.WithUpdatedRun(run.WithFailed(true)) + } + failedJobs[i] = job.WithQueued(false).WithFailed(true) + } if err := s.jobDb.Upsert(txn, preemptedJobs); err != nil { return err } if err := s.jobDb.Upsert(txn, scheduledJobs); err != nil { return err } + if err := s.jobDb.Upsert(txn, failedJobs); err != nil { + return err + } // Update allocation. s.allocationByPoolAndQueueAndPriorityClass[pool.Name] = sctx.AllocatedByQueueAndPriority() @@ -534,10 +544,16 @@ func (s *Simulator) handleScheduleEvent() error { if err != nil { return err } - eventSequences, err = scheduler.AppendEventSequencesFromUnschedulableJobs(eventSequences, result.FailedJobs, s.time) + eventSequences, err = scheduler.AppendEventSequencesFromUnschedulableJobs(eventSequences, failedJobs, s.time) if err != nil { return err } + + // If nothing changed, we're in steady state and can safely skip scheduling until something external has changed. + // Do this only if a non-zero amount of time has passed. + if !s.time.Equal(time.Time{}) && len(result.ScheduledJobs) == 0 && len(result.PreemptedJobs) == 0 && len(result.FailedJobs) == 0 { + s.shouldSchedule = false + } } } txn.Commit() @@ -546,17 +562,11 @@ func (s *Simulator) handleScheduleEvent() error { for _, eventSequence := range eventSequences { s.pushEventSequence(eventSequence) } - - // Schedule the next run of the scheduler, unless there are no more active jobTemplates. - // TODO: Make timeout configurable. - if len(s.activeJobTemplatesById) > 0 { - s.pushScheduleEvent(s.time.Add(10 * time.Second)) - } return nil } // TODO: Write events to disk unless they should be discarded. -func (s *Simulator) handleEventSequence(es *armadaevents.EventSequence) error { +func (s *Simulator) handleEventSequence(ctx *armadacontext.Context, es *armadaevents.EventSequence) error { txn := s.jobDb.WriteTxn() defer txn.Abort() eventsToPublish := make([]*armadaevents.EventSequence_Event, 0, len(es.Events)) @@ -565,12 +575,15 @@ func (s *Simulator) handleEventSequence(es *armadaevents.EventSequence) error { var err error = nil switch eventType := event.GetEvent().(type) { case *armadaevents.EventSequence_Event_SubmitJob: + s.shouldSchedule = true ok, err = s.handleSubmitJob(txn, event.GetSubmitJob(), *event.Created, es) case *armadaevents.EventSequence_Event_JobRunLeased: ok, err = s.handleJobRunLeased(txn, event.GetJobRunLeased()) case *armadaevents.EventSequence_Event_JobSucceeded: + s.shouldSchedule = true ok, err = s.handleJobSucceeded(txn, event.GetJobSucceeded()) case *armadaevents.EventSequence_Event_JobRunPreempted: + s.shouldSchedule = true ok, err = s.handleJobRunPreempted(txn, event.GetJobRunPreempted()) case *armadaevents.EventSequence_Event_ReprioritisedJob, *armadaevents.EventSequence_Event_JobDuplicateDetected, @@ -589,10 +602,10 @@ func (s *Simulator) handleEventSequence(es *armadaevents.EventSequence) error { *armadaevents.EventSequence_Event_CancelJob, *armadaevents.EventSequence_Event_CancelJobSet: // These events can be safely ignored. - log.Debugf("Ignoring event type %T", event) + ctx.Debugf("Ignoring event type %T", event) default: // This is an event type we haven't consider; log a warning. - log.Warnf("Ignoring unknown event type %T", eventType) + return errors.Errorf("received unknown event type %T", eventType) } if err != nil { return err @@ -604,7 +617,9 @@ func (s *Simulator) handleEventSequence(es *armadaevents.EventSequence) error { txn.Commit() es.Events = eventsToPublish if len(es.Events) > 0 { - s.c <- es + for _, c := range s.outputs { + c <- es + } } return nil } @@ -636,12 +651,13 @@ func (s *Simulator) handleSubmitJob(txn *jobdb.Txn, e *armadaevents.SubmitJob, t func (s *Simulator) handleJobRunLeased(txn *jobdb.Txn, e *armadaevents.JobRunLeased) (bool, error) { jobId := armadaevents.UlidFromProtoUuid(e.JobId).String() job := s.jobDb.GetById(txn, jobId) - // TODO: Randomise runtime. jobTemplate := s.jobTemplateByJobId[jobId] if jobTemplate == nil { return false, errors.Errorf("no jobTemplate associated with job %s", jobId) } - jobSuccessTime := s.time.Add(time.Duration(jobTemplate.RuntimeMean) * time.Second) + jobSuccessTime := s.time + jobSuccessTime = jobSuccessTime.Add(s.generateRandomShiftedExponentialDuration(s.ClusterSpec.PendingDelayDistribution)) + jobSuccessTime = jobSuccessTime.Add(s.generateRandomShiftedExponentialDuration(jobTemplate.RuntimeDistribution)) s.pushEventSequence( &armadaevents.EventSequence{ Queue: job.Queue(), @@ -661,6 +677,18 @@ func (s *Simulator) handleJobRunLeased(txn *jobdb.Txn, e *armadaevents.JobRunLea return true, nil } +func (s *Simulator) generateRandomShiftedExponentialDuration(rv ShiftedExponential) time.Duration { + return generateRandomShiftedExponentialDuration(s.rand, rv) +} + +func generateRandomShiftedExponentialDuration(r *rand.Rand, rv ShiftedExponential) time.Duration { + if rv.TailMean == 0 { + return rv.Minimum + } else { + return rv.Minimum + time.Duration(r.ExpFloat64()*float64(rv.TailMean)) + } +} + func (s *Simulator) handleJobSucceeded(txn *jobdb.Txn, e *armadaevents.JobSucceeded) (bool, error) { jobId := armadaevents.UlidFromProtoUuid(e.JobId).String() job := s.jobDb.GetById(txn, jobId) @@ -675,7 +703,10 @@ func (s *Simulator) handleJobSucceeded(txn *jobdb.Txn, e *armadaevents.JobSuccee // Subtract the allocation of this job from the queue allocation. run := job.LatestRun() pool := s.poolByNodeId[run.NodeId()] - s.allocationByPoolAndQueueAndPriorityClass[pool][job.Queue()].SubV1ResourceList(job.GetPriorityClassName(), job.GetResourceRequirements().Requests) + s.allocationByPoolAndQueueAndPriorityClass[pool][job.Queue()].SubV1ResourceList( + job.GetPriorityClassName(), + job.GetResourceRequirements().Requests, + ) // Unbind the job from the node on which it was scheduled. if err := s.unbindRunningJob(job); err != nil { @@ -704,7 +735,8 @@ func (s *Simulator) handleJobSucceeded(txn *jobdb.Txn, e *armadaevents.JobSuccee eventSequence.Events = append( eventSequence.Events, &armadaevents.EventSequence_Event{ - Created: pointer(maxTime(s.time, dependentJobTemplate.MinSubmitTime)), + // EarliestSubmitTimeFromDependencyCompletion must be positive + Created: pointer(maxTime(time.Time{}.Add(dependentJobTemplate.EarliestSubmitTime), s.time.Add(dependentJobTemplate.EarliestSubmitTimeFromDependencyCompletion))), Event: &armadaevents.EventSequence_Event_SubmitJob{ SubmitJob: submitJobFromJobTemplate(jobId, dependentJobTemplate), }, @@ -759,13 +791,14 @@ func (s *Simulator) handleJobRunPreempted(txn *jobdb.Txn, e *armadaevents.JobRun // Submit a retry for this job. jobTemplate := s.jobTemplateByJobId[job.GetId()] retryJobId := util.ULID() + resubmitTime := s.time.Add(s.generateRandomShiftedExponentialDuration(s.ClusterSpec.WorkflowManagerDelayDistribution)) s.pushEventSequence( &armadaevents.EventSequence{ Queue: job.Queue(), JobSetName: job.Jobset(), Events: []*armadaevents.EventSequence_Event{ { - Created: &s.time, + Created: &resubmitTime, Event: &armadaevents.EventSequence_Event_SubmitJob{ SubmitJob: submitJobFromJobTemplate(retryJobId, jobTemplate), }, @@ -777,110 +810,6 @@ func (s *Simulator) handleJobRunPreempted(txn *jobdb.Txn, e *armadaevents.JobRun return true, nil } -// func (a *App) TestPattern(ctx *context.Context, pattern string) (*TestSuiteReport, error) { -// testSpecs, err := TestSpecsFromPattern(pattern) -// if err != nil { -// return nil, err -// } -// return a.RunTests(ctx, testSpecs) -// } - -func SchedulingConfigsFromPattern(pattern string) ([]configuration.SchedulingConfig, error) { - filePaths, err := zglob.Glob(pattern) - if err != nil { - return nil, errors.WithStack(err) - } - return SchedulingConfigsFromFilePaths(filePaths) -} - -func SchedulingConfigsFromFilePaths(filePaths []string) ([]configuration.SchedulingConfig, error) { - rv := make([]configuration.SchedulingConfig, len(filePaths)) - for i, filePath := range filePaths { - config, err := SchedulingConfigFromFilePath(filePath) - if err != nil { - return nil, err - } - rv[i] = config - } - return rv, nil -} - -func SchedulingConfigFromFilePath(filePath string) (configuration.SchedulingConfig, error) { - config := configuration.SchedulingConfig{} - v := viper.New() - v.SetConfigFile(filePath) - if err := v.ReadInConfig(); err != nil { - return config, errors.WithStack(err) - } - if err := v.Unmarshal(&config, commonconfig.CustomHooks...); err != nil { - return config, errors.WithStack(err) - } - return config, nil -} - -func TestCasesFromPattern(pattern string) ([]*TestCase, error) { - filePaths, err := zglob.Glob(pattern) - if err != nil { - return nil, errors.WithStack(err) - } - return TestCasesFromFilePaths(filePaths) -} - -func TestCasesFromFilePaths(filePaths []string) ([]*TestCase, error) { - rv := make([]*TestCase, len(filePaths)) - for i, filePath := range filePaths { - testCase, err := TestCaseFromFilePath(filePath) - if err != nil { - return nil, err - } - rv[i] = testCase - } - return rv, nil -} - -func TestCaseFromFilePath(filePath string) (*TestCase, error) { - yamlBytes, err := os.ReadFile(filePath) - if err != nil { - return nil, errors.WithStack(err) - } - if len(yamlBytes) == 0 { - return nil, errors.Errorf("%s does not exist or is empty", filePath) - } - testCase, err := TestCaseFromBytes(yamlBytes) - if err != nil { - return nil, err - } - - // If no test name is provided, set it to be the filename. - if testCase.Name == "" { - fileName := filepath.Base(filePath) - fileName = strings.TrimSuffix(fileName, filepath.Ext(fileName)) - testCase.Name = fileName - } - - // Generate random ids for any job templates without an explicitly set id. - for i, queue := range testCase.Queues { - for j, jobTemplate := range queue.JobTemplates { - if jobTemplate.Id == "" { - jobTemplate.Id = shortuuid.New() - } - queue.JobTemplates[j] = jobTemplate - } - testCase.Queues[i] = queue - } - - return testCase, nil -} - -// TestCaseFromBytes unmarshalls bytes into a TestCase. -func TestCaseFromBytes(yamlBytes []byte) (*TestCase, error) { - var testCase TestCase - if err := yaml.NewYAMLOrJSONDecoder(bytes.NewReader(yamlBytes), 128).Decode(&testCase); err != nil { - return nil, errors.WithStack(err) - } - return &testCase, nil -} - func maxTime(a, b time.Time) time.Time { if a.Before(b) { return b diff --git a/internal/scheduler/simulator/simulator.pb.go b/internal/scheduler/simulator/simulator.pb.go index fdf830cd63c..0013d40f455 100644 --- a/internal/scheduler/simulator/simulator.pb.go +++ b/internal/scheduler/simulator/simulator.pb.go @@ -7,6 +7,7 @@ import ( encoding_binary "encoding/binary" fmt "fmt" schedulerobjects "github.com/armadaproject/armada/internal/scheduler/schedulerobjects" + _ "github.com/armadaproject/armada/pkg/armadaevents" _ "github.com/gogo/protobuf/gogoproto" proto "github.com/gogo/protobuf/proto" _ "github.com/gogo/protobuf/types" @@ -30,29 +31,25 @@ var _ = time.Kitchen // proto package needs to be updated. const _ = proto.GoGoProtoPackageIsVersion3 // please upgrade the proto package -// TODO: -// Runtime family. -// Workflow manager delay. -// Job pending delay. -type TestCase struct { - Name string `protobuf:"bytes,1,opt,name=name,proto3" json:"name,omitempty"` - RandomSeed int64 `protobuf:"varint,2,opt,name=random_seed,json=randomSeed,proto3" json:"randomSeed,omitempty"` - Pools []*Pool `protobuf:"bytes,3,rep,name=pools,proto3" json:"pools,omitempty"` - Queues []Queue `protobuf:"bytes,4,rep,name=queues,proto3" json:"queues"` -} - -func (m *TestCase) Reset() { *m = TestCase{} } -func (m *TestCase) String() string { return proto.CompactTextString(m) } -func (*TestCase) ProtoMessage() {} -func (*TestCase) Descriptor() ([]byte, []int) { +type ClusterSpec struct { + Name string `protobuf:"bytes,1,opt,name=name,proto3" json:"name,omitempty"` + Pools []*Pool `protobuf:"bytes,2,rep,name=pools,proto3" json:"pools,omitempty"` + WorkflowManagerDelayDistribution ShiftedExponential `protobuf:"bytes,3,opt,name=workflow_manager_delay_distribution,json=workflowManagerDelayDistribution,proto3" json:"workflowManagerDelayDistribution"` + PendingDelayDistribution ShiftedExponential `protobuf:"bytes,4,opt,name=pending_delay_distribution,json=pendingDelayDistribution,proto3" json:"pendingDelayDistribution"` +} + +func (m *ClusterSpec) Reset() { *m = ClusterSpec{} } +func (m *ClusterSpec) String() string { return proto.CompactTextString(m) } +func (*ClusterSpec) ProtoMessage() {} +func (*ClusterSpec) Descriptor() ([]byte, []int) { return fileDescriptor_63baccdfe9127510, []int{0} } -func (m *TestCase) XXX_Unmarshal(b []byte) error { +func (m *ClusterSpec) XXX_Unmarshal(b []byte) error { return m.Unmarshal(b) } -func (m *TestCase) XXX_Marshal(b []byte, deterministic bool) ([]byte, error) { +func (m *ClusterSpec) XXX_Marshal(b []byte, deterministic bool) ([]byte, error) { if deterministic { - return xxx_messageInfo_TestCase.Marshal(b, m, deterministic) + return xxx_messageInfo_ClusterSpec.Marshal(b, m, deterministic) } else { b = b[:cap(b)] n, err := m.MarshalToSizedBuffer(b) @@ -62,40 +59,100 @@ func (m *TestCase) XXX_Marshal(b []byte, deterministic bool) ([]byte, error) { return b[:n], nil } } -func (m *TestCase) XXX_Merge(src proto.Message) { - xxx_messageInfo_TestCase.Merge(m, src) +func (m *ClusterSpec) XXX_Merge(src proto.Message) { + xxx_messageInfo_ClusterSpec.Merge(m, src) } -func (m *TestCase) XXX_Size() int { +func (m *ClusterSpec) XXX_Size() int { return m.Size() } -func (m *TestCase) XXX_DiscardUnknown() { - xxx_messageInfo_TestCase.DiscardUnknown(m) +func (m *ClusterSpec) XXX_DiscardUnknown() { + xxx_messageInfo_ClusterSpec.DiscardUnknown(m) } -var xxx_messageInfo_TestCase proto.InternalMessageInfo +var xxx_messageInfo_ClusterSpec proto.InternalMessageInfo -func (m *TestCase) GetName() string { +func (m *ClusterSpec) GetName() string { if m != nil { return m.Name } return "" } -func (m *TestCase) GetRandomSeed() int64 { +func (m *ClusterSpec) GetPools() []*Pool { if m != nil { - return m.RandomSeed + return m.Pools } - return 0 + return nil } -func (m *TestCase) GetPools() []*Pool { +func (m *ClusterSpec) GetWorkflowManagerDelayDistribution() ShiftedExponential { if m != nil { - return m.Pools + return m.WorkflowManagerDelayDistribution } - return nil + return ShiftedExponential{} +} + +func (m *ClusterSpec) GetPendingDelayDistribution() ShiftedExponential { + if m != nil { + return m.PendingDelayDistribution + } + return ShiftedExponential{} +} + +type WorkloadSpec struct { + Name string `protobuf:"bytes,1,opt,name=name,proto3" json:"name,omitempty"` + RandomSeed int64 `protobuf:"varint,2,opt,name=random_seed,json=randomSeed,proto3" json:"randomSeed,omitempty"` + Queues []*Queue `protobuf:"bytes,3,rep,name=queues,proto3" json:"queues,omitempty"` +} + +func (m *WorkloadSpec) Reset() { *m = WorkloadSpec{} } +func (m *WorkloadSpec) String() string { return proto.CompactTextString(m) } +func (*WorkloadSpec) ProtoMessage() {} +func (*WorkloadSpec) Descriptor() ([]byte, []int) { + return fileDescriptor_63baccdfe9127510, []int{1} +} +func (m *WorkloadSpec) XXX_Unmarshal(b []byte) error { + return m.Unmarshal(b) +} +func (m *WorkloadSpec) XXX_Marshal(b []byte, deterministic bool) ([]byte, error) { + if deterministic { + return xxx_messageInfo_WorkloadSpec.Marshal(b, m, deterministic) + } else { + b = b[:cap(b)] + n, err := m.MarshalToSizedBuffer(b) + if err != nil { + return nil, err + } + return b[:n], nil + } +} +func (m *WorkloadSpec) XXX_Merge(src proto.Message) { + xxx_messageInfo_WorkloadSpec.Merge(m, src) +} +func (m *WorkloadSpec) XXX_Size() int { + return m.Size() +} +func (m *WorkloadSpec) XXX_DiscardUnknown() { + xxx_messageInfo_WorkloadSpec.DiscardUnknown(m) +} + +var xxx_messageInfo_WorkloadSpec proto.InternalMessageInfo + +func (m *WorkloadSpec) GetName() string { + if m != nil { + return m.Name + } + return "" } -func (m *TestCase) GetQueues() []Queue { +func (m *WorkloadSpec) GetRandomSeed() int64 { + if m != nil { + return m.RandomSeed + } + return 0 +} + +func (m *WorkloadSpec) GetQueues() []*Queue { if m != nil { return m.Queues } @@ -103,15 +160,15 @@ func (m *TestCase) GetQueues() []Queue { } type Pool struct { - Name string `protobuf:"bytes,1,opt,name=name,proto3" json:"name,omitempty"` - ExecutorGroups []*ExecutorGroup `protobuf:"bytes,2,rep,name=executor_groups,json=executorGroups,proto3" json:"executorGroups,omitempty"` + Name string `protobuf:"bytes,1,opt,name=name,proto3" json:"name,omitempty"` + ClusterGroups []*ClusterGroup `protobuf:"bytes,2,rep,name=cluster_groups,json=clusterGroups,proto3" json:"clusterGroups,omitempty"` } func (m *Pool) Reset() { *m = Pool{} } func (m *Pool) String() string { return proto.CompactTextString(m) } func (*Pool) ProtoMessage() {} func (*Pool) Descriptor() ([]byte, []int) { - return fileDescriptor_63baccdfe9127510, []int{1} + return fileDescriptor_63baccdfe9127510, []int{2} } func (m *Pool) XXX_Unmarshal(b []byte) error { return m.Unmarshal(b) @@ -147,29 +204,29 @@ func (m *Pool) GetName() string { return "" } -func (m *Pool) GetExecutorGroups() []*ExecutorGroup { +func (m *Pool) GetClusterGroups() []*ClusterGroup { if m != nil { - return m.ExecutorGroups + return m.ClusterGroups } return nil } -type ExecutorGroup struct { - Executors []*Executor `protobuf:"bytes,1,rep,name=executors,proto3" json:"executors,omitempty"` +type ClusterGroup struct { + Clusters []*Cluster `protobuf:"bytes,1,rep,name=clusters,proto3" json:"clusters,omitempty"` } -func (m *ExecutorGroup) Reset() { *m = ExecutorGroup{} } -func (m *ExecutorGroup) String() string { return proto.CompactTextString(m) } -func (*ExecutorGroup) ProtoMessage() {} -func (*ExecutorGroup) Descriptor() ([]byte, []int) { - return fileDescriptor_63baccdfe9127510, []int{2} +func (m *ClusterGroup) Reset() { *m = ClusterGroup{} } +func (m *ClusterGroup) String() string { return proto.CompactTextString(m) } +func (*ClusterGroup) ProtoMessage() {} +func (*ClusterGroup) Descriptor() ([]byte, []int) { + return fileDescriptor_63baccdfe9127510, []int{3} } -func (m *ExecutorGroup) XXX_Unmarshal(b []byte) error { +func (m *ClusterGroup) XXX_Unmarshal(b []byte) error { return m.Unmarshal(b) } -func (m *ExecutorGroup) XXX_Marshal(b []byte, deterministic bool) ([]byte, error) { +func (m *ClusterGroup) XXX_Marshal(b []byte, deterministic bool) ([]byte, error) { if deterministic { - return xxx_messageInfo_ExecutorGroup.Marshal(b, m, deterministic) + return xxx_messageInfo_ClusterGroup.Marshal(b, m, deterministic) } else { b = b[:cap(b)] n, err := m.MarshalToSizedBuffer(b) @@ -179,42 +236,42 @@ func (m *ExecutorGroup) XXX_Marshal(b []byte, deterministic bool) ([]byte, error return b[:n], nil } } -func (m *ExecutorGroup) XXX_Merge(src proto.Message) { - xxx_messageInfo_ExecutorGroup.Merge(m, src) +func (m *ClusterGroup) XXX_Merge(src proto.Message) { + xxx_messageInfo_ClusterGroup.Merge(m, src) } -func (m *ExecutorGroup) XXX_Size() int { +func (m *ClusterGroup) XXX_Size() int { return m.Size() } -func (m *ExecutorGroup) XXX_DiscardUnknown() { - xxx_messageInfo_ExecutorGroup.DiscardUnknown(m) +func (m *ClusterGroup) XXX_DiscardUnknown() { + xxx_messageInfo_ClusterGroup.DiscardUnknown(m) } -var xxx_messageInfo_ExecutorGroup proto.InternalMessageInfo +var xxx_messageInfo_ClusterGroup proto.InternalMessageInfo -func (m *ExecutorGroup) GetExecutors() []*Executor { +func (m *ClusterGroup) GetClusters() []*Cluster { if m != nil { - return m.Executors + return m.Clusters } return nil } -type Executor struct { +type Cluster struct { Name string `protobuf:"bytes,1,opt,name=name,proto3" json:"name,omitempty"` NodeTemplates []*NodeTemplate `protobuf:"bytes,2,rep,name=node_templates,json=nodeTemplates,proto3" json:"nodeTemplates,omitempty"` } -func (m *Executor) Reset() { *m = Executor{} } -func (m *Executor) String() string { return proto.CompactTextString(m) } -func (*Executor) ProtoMessage() {} -func (*Executor) Descriptor() ([]byte, []int) { - return fileDescriptor_63baccdfe9127510, []int{3} +func (m *Cluster) Reset() { *m = Cluster{} } +func (m *Cluster) String() string { return proto.CompactTextString(m) } +func (*Cluster) ProtoMessage() {} +func (*Cluster) Descriptor() ([]byte, []int) { + return fileDescriptor_63baccdfe9127510, []int{4} } -func (m *Executor) XXX_Unmarshal(b []byte) error { +func (m *Cluster) XXX_Unmarshal(b []byte) error { return m.Unmarshal(b) } -func (m *Executor) XXX_Marshal(b []byte, deterministic bool) ([]byte, error) { +func (m *Cluster) XXX_Marshal(b []byte, deterministic bool) ([]byte, error) { if deterministic { - return xxx_messageInfo_Executor.Marshal(b, m, deterministic) + return xxx_messageInfo_Cluster.Marshal(b, m, deterministic) } else { b = b[:cap(b)] n, err := m.MarshalToSizedBuffer(b) @@ -224,26 +281,26 @@ func (m *Executor) XXX_Marshal(b []byte, deterministic bool) ([]byte, error) { return b[:n], nil } } -func (m *Executor) XXX_Merge(src proto.Message) { - xxx_messageInfo_Executor.Merge(m, src) +func (m *Cluster) XXX_Merge(src proto.Message) { + xxx_messageInfo_Cluster.Merge(m, src) } -func (m *Executor) XXX_Size() int { +func (m *Cluster) XXX_Size() int { return m.Size() } -func (m *Executor) XXX_DiscardUnknown() { - xxx_messageInfo_Executor.DiscardUnknown(m) +func (m *Cluster) XXX_DiscardUnknown() { + xxx_messageInfo_Cluster.DiscardUnknown(m) } -var xxx_messageInfo_Executor proto.InternalMessageInfo +var xxx_messageInfo_Cluster proto.InternalMessageInfo -func (m *Executor) GetName() string { +func (m *Cluster) GetName() string { if m != nil { return m.Name } return "" } -func (m *Executor) GetNodeTemplates() []*NodeTemplate { +func (m *Cluster) GetNodeTemplates() []*NodeTemplate { if m != nil { return m.NodeTemplates } @@ -261,7 +318,7 @@ func (m *NodeTemplate) Reset() { *m = NodeTemplate{} } func (m *NodeTemplate) String() string { return proto.CompactTextString(m) } func (*NodeTemplate) ProtoMessage() {} func (*NodeTemplate) Descriptor() ([]byte, []int) { - return fileDescriptor_63baccdfe9127510, []int{4} + return fileDescriptor_63baccdfe9127510, []int{5} } func (m *NodeTemplate) XXX_Unmarshal(b []byte) error { return m.Unmarshal(b) @@ -328,7 +385,7 @@ func (m *Queue) Reset() { *m = Queue{} } func (m *Queue) String() string { return proto.CompactTextString(m) } func (*Queue) ProtoMessage() {} func (*Queue) Descriptor() ([]byte, []int) { - return fileDescriptor_63baccdfe9127510, []int{5} + return fileDescriptor_63baccdfe9127510, []int{6} } func (m *Queue) XXX_Unmarshal(b []byte) error { return m.Unmarshal(b) @@ -387,27 +444,36 @@ type JobTemplate struct { // Queue to which this template belongs. Populated automatically. Queue string `protobuf:"bytes,3,opt,name=queue,proto3" json:"queue,omitempty"` // Unique id for this template. An id is generated if empty. - Id string `protobuf:"bytes,4,opt,name=id,proto3" json:"id,omitempty"` - JobSet string `protobuf:"bytes,5,opt,name=job_set,json=jobSet,proto3" json:"jobSet,omitempty"` - QueuePriority uint32 `protobuf:"varint,6,opt,name=queue_priority,json=queuePriority,proto3" json:"queuePriority,omitempty"` - PriorityClassName string `protobuf:"bytes,7,opt,name=priority_class_name,json=priorityClassName,proto3" json:"priorityClassName,omitempty"` - Requirements schedulerobjects.PodRequirements `protobuf:"bytes,8,opt,name=requirements,proto3" json:"requirements"` + Id string `protobuf:"bytes,4,opt,name=id,proto3" json:"id,omitempty"` + JobSet string `protobuf:"bytes,5,opt,name=job_set,json=jobSet,proto3" json:"jobSet,omitempty"` + QueuePriority uint32 `protobuf:"varint,6,opt,name=queue_priority,json=queuePriority,proto3" json:"queuePriority,omitempty"` + PriorityClassName string `protobuf:"bytes,7,opt,name=priority_class_name,json=priorityClassName,proto3" json:"priorityClassName,omitempty"` + // Scheduling requirements for the pod embedded in the job. + Requirements schedulerobjects.PodRequirements `protobuf:"bytes,8,opt,name=requirements,proto3" json:"requirements"` // List of template ids that must be completed before this template is submitted. Dependencies []string `protobuf:"bytes,9,rep,name=dependencies,proto3" json:"dependencies,omitempty"` - // Minimum time from which jobs are created from this template. - MinSubmitTime time.Time `protobuf:"bytes,10,opt,name=min_submit_time,json=minSubmitTime,proto3,stdtime" json:"minSubmitTime"` - // Job runtime mean in seconds. - RuntimeMean int64 `protobuf:"varint,11,opt,name=runtime_mean,json=runtimeMean,proto3" json:"runtimeMean,omitempty"` - // Job runtime variance in seconds squared. - // If zero, runtime is deterministic. - RuntimeVariance int64 `protobuf:"varint,12,opt,name=runtime_variance,json=runtimeVariance,proto3" json:"runtimeVariance,omitempty"` + // Earliest time at which jobs from this template are submitted. + // Measured from the start of the simulation. + EarliestSubmitTime time.Duration `protobuf:"bytes,10,opt,name=earliest_submit_time,json=earliestSubmitTime,proto3,stdduration" json:"earliestSubmitTime"` + // Earliest time job can be submitted from when all its dependencies have completed. + // This option is meant to model thinking or processing time, where some fixed amount of time + // needs to be spent between dependencies completing and the next batch of jobs being ready to submit. + EarliestSubmitTimeFromDependencyCompletion time.Duration `protobuf:"bytes,11,opt,name=earliest_submit_time_from_dependency_completion,json=earliestSubmitTimeFromDependencyCompletion,proto3,stdduration" json:"earliestSubmitTimeFromDependencyCompletion"` + // Job runtimes are assumed to follow a shifted exponential distribution + // i.e., to be a fixed constant (runtime_minimum) plus a random amount of time + // drawn from an exponential distribution with known mean (runtime_tail_mean). + // + // The shifted-exponential distribution strikes a good balance between simplicity and accuracy; + // see https://bora.uib.no/bora-xmlui/bitstream/handle/11250/3014726/drthesis_2022_severinson.pdf?sequence=2 + // for a discussion on the topic. + RuntimeDistribution ShiftedExponential `protobuf:"bytes,12,opt,name=runtime_distribution,json=runtimeDistribution,proto3" json:"runtimeDistribution"` } func (m *JobTemplate) Reset() { *m = JobTemplate{} } func (m *JobTemplate) String() string { return proto.CompactTextString(m) } func (*JobTemplate) ProtoMessage() {} func (*JobTemplate) Descriptor() ([]byte, []int) { - return fileDescriptor_63baccdfe9127510, []int{6} + return fileDescriptor_63baccdfe9127510, []int{7} } func (m *JobTemplate) XXX_Unmarshal(b []byte) error { return m.Unmarshal(b) @@ -499,36 +565,90 @@ func (m *JobTemplate) GetDependencies() []string { return nil } -func (m *JobTemplate) GetMinSubmitTime() time.Time { +func (m *JobTemplate) GetEarliestSubmitTime() time.Duration { + if m != nil { + return m.EarliestSubmitTime + } + return 0 +} + +func (m *JobTemplate) GetEarliestSubmitTimeFromDependencyCompletion() time.Duration { + if m != nil { + return m.EarliestSubmitTimeFromDependencyCompletion + } + return 0 +} + +func (m *JobTemplate) GetRuntimeDistribution() ShiftedExponential { if m != nil { - return m.MinSubmitTime + return m.RuntimeDistribution + } + return ShiftedExponential{} +} + +type ShiftedExponential struct { + Minimum time.Duration `protobuf:"bytes,1,opt,name=minimum,proto3,stdduration" json:"minimum"` + TailMean time.Duration `protobuf:"bytes,2,opt,name=tail_mean,json=tailMean,proto3,stdduration" json:"tailMean"` +} + +func (m *ShiftedExponential) Reset() { *m = ShiftedExponential{} } +func (m *ShiftedExponential) String() string { return proto.CompactTextString(m) } +func (*ShiftedExponential) ProtoMessage() {} +func (*ShiftedExponential) Descriptor() ([]byte, []int) { + return fileDescriptor_63baccdfe9127510, []int{8} +} +func (m *ShiftedExponential) XXX_Unmarshal(b []byte) error { + return m.Unmarshal(b) +} +func (m *ShiftedExponential) XXX_Marshal(b []byte, deterministic bool) ([]byte, error) { + if deterministic { + return xxx_messageInfo_ShiftedExponential.Marshal(b, m, deterministic) + } else { + b = b[:cap(b)] + n, err := m.MarshalToSizedBuffer(b) + if err != nil { + return nil, err + } + return b[:n], nil } - return time.Time{} +} +func (m *ShiftedExponential) XXX_Merge(src proto.Message) { + xxx_messageInfo_ShiftedExponential.Merge(m, src) +} +func (m *ShiftedExponential) XXX_Size() int { + return m.Size() +} +func (m *ShiftedExponential) XXX_DiscardUnknown() { + xxx_messageInfo_ShiftedExponential.DiscardUnknown(m) } -func (m *JobTemplate) GetRuntimeMean() int64 { +var xxx_messageInfo_ShiftedExponential proto.InternalMessageInfo + +func (m *ShiftedExponential) GetMinimum() time.Duration { if m != nil { - return m.RuntimeMean + return m.Minimum } return 0 } -func (m *JobTemplate) GetRuntimeVariance() int64 { +func (m *ShiftedExponential) GetTailMean() time.Duration { if m != nil { - return m.RuntimeVariance + return m.TailMean } return 0 } func init() { - proto.RegisterType((*TestCase)(nil), "simulator.TestCase") + proto.RegisterType((*ClusterSpec)(nil), "simulator.ClusterSpec") + proto.RegisterType((*WorkloadSpec)(nil), "simulator.WorkloadSpec") proto.RegisterType((*Pool)(nil), "simulator.Pool") - proto.RegisterType((*ExecutorGroup)(nil), "simulator.ExecutorGroup") - proto.RegisterType((*Executor)(nil), "simulator.Executor") + proto.RegisterType((*ClusterGroup)(nil), "simulator.ClusterGroup") + proto.RegisterType((*Cluster)(nil), "simulator.Cluster") proto.RegisterType((*NodeTemplate)(nil), "simulator.NodeTemplate") proto.RegisterMapType((map[string]string)(nil), "simulator.NodeTemplate.LabelsEntry") proto.RegisterType((*Queue)(nil), "simulator.Queue") proto.RegisterType((*JobTemplate)(nil), "simulator.JobTemplate") + proto.RegisterType((*ShiftedExponential)(nil), "simulator.ShiftedExponential") } func init() { @@ -536,75 +656,87 @@ func init() { } var fileDescriptor_63baccdfe9127510 = []byte{ - // 1025 bytes of a gzipped FileDescriptorProto - 0x1f, 0x8b, 0x08, 0x00, 0x00, 0x00, 0x00, 0x00, 0x02, 0xff, 0x94, 0x56, 0x4f, 0x6f, 0xdb, 0x36, - 0x14, 0x8f, 0xe2, 0xc6, 0x8d, 0xe9, 0x3f, 0x49, 0x99, 0x2c, 0x51, 0xdc, 0xcd, 0xf2, 0x5c, 0x60, - 0xf0, 0x80, 0x54, 0x46, 0xbb, 0x4b, 0x16, 0x14, 0x03, 0xa6, 0xa2, 0xd8, 0x10, 0x74, 0x5d, 0xea, - 0x04, 0x1d, 0xb0, 0x60, 0x10, 0x68, 0xe9, 0xd5, 0x61, 0x22, 0x89, 0xaa, 0x48, 0x65, 0xcb, 0xa7, - 0x58, 0x4f, 0x3b, 0xed, 0x73, 0xec, 0x33, 0xf4, 0xd8, 0xe3, 0x4e, 0xda, 0x90, 0xdc, 0xf4, 0x29, - 0x06, 0x51, 0x54, 0x4c, 0x27, 0xdd, 0x90, 0x9d, 0x2c, 0xfe, 0xfe, 0x3c, 0x3e, 0x3e, 0xbf, 0x27, - 0x0a, 0x6d, 0xd3, 0x48, 0x40, 0x12, 0x91, 0x60, 0xc4, 0xbd, 0x63, 0xf0, 0xd3, 0x00, 0x92, 0x11, - 0xa7, 0x61, 0x1a, 0x10, 0xc1, 0xb4, 0x27, 0x3b, 0x4e, 0x98, 0x60, 0xb8, 0x71, 0x05, 0x74, 0xad, - 0x29, 0x63, 0xd3, 0x00, 0x46, 0x92, 0x98, 0xa4, 0xaf, 0x47, 0x82, 0x86, 0xc0, 0x05, 0x09, 0xe3, - 0x52, 0xdb, 0x1d, 0x9c, 0xee, 0x70, 0x9b, 0xb2, 0x11, 0x89, 0xe9, 0xc8, 0x63, 0x09, 0x8c, 0xce, - 0x1e, 0x8d, 0xa6, 0x10, 0x41, 0x42, 0x04, 0xf8, 0x4a, 0xf3, 0x70, 0x4a, 0xc5, 0x71, 0x3a, 0xb1, - 0x3d, 0x16, 0x8e, 0xa6, 0x6c, 0xca, 0x66, 0xd1, 0x8a, 0x95, 0x5c, 0xc8, 0x27, 0x25, 0xdf, 0xfd, - 0x50, 0xb2, 0xd5, 0x13, 0x9b, 0x9c, 0x80, 0x27, 0xf8, 0x0d, 0xa0, 0xf4, 0x0e, 0x2e, 0x0d, 0xb4, - 0x7c, 0x08, 0x5c, 0x3c, 0x25, 0x1c, 0xf0, 0x67, 0xe8, 0x4e, 0x44, 0x42, 0x30, 0x8d, 0xbe, 0x31, - 0x6c, 0x38, 0x38, 0xcf, 0xac, 0x4e, 0xb1, 0xde, 0x66, 0x21, 0x15, 0x10, 0xc6, 0xe2, 0x7c, 0x2c, - 0x79, 0xfc, 0x25, 0x6a, 0x26, 0x24, 0xf2, 0x59, 0xe8, 0x72, 0x00, 0xdf, 0x5c, 0xec, 0x1b, 0xc3, - 0x9a, 0x63, 0xe6, 0x99, 0xb5, 0x5e, 0xc2, 0x07, 0x00, 0xbe, 0x66, 0x42, 0x33, 0x14, 0xef, 0xa2, - 0xa5, 0x98, 0xb1, 0x80, 0x9b, 0xb5, 0x7e, 0x6d, 0xd8, 0x7c, 0xbc, 0x62, 0xcf, 0x6a, 0xb9, 0xcf, - 0x58, 0xe0, 0xac, 0xe5, 0x99, 0xb5, 0x22, 0x15, 0x5a, 0x80, 0xd2, 0x82, 0x77, 0x50, 0xfd, 0x4d, - 0x0a, 0x29, 0x70, 0xf3, 0x8e, 0x34, 0xaf, 0x6a, 0xe6, 0x97, 0x05, 0xe1, 0x74, 0xde, 0x65, 0xd6, - 0x42, 0x9e, 0x59, 0x4a, 0x37, 0x56, 0xbf, 0x83, 0x5f, 0x0d, 0x74, 0xa7, 0x08, 0x7f, 0xeb, 0x13, - 0xba, 0x68, 0x05, 0x7e, 0x01, 0x2f, 0x15, 0x2c, 0x71, 0xa7, 0x09, 0x4b, 0x63, 0x6e, 0x2e, 0xca, - 0x3d, 0x4d, 0x6d, 0xcf, 0x67, 0x4a, 0xf1, 0x4d, 0x21, 0x70, 0x3e, 0xce, 0x33, 0xcb, 0x04, 0x1d, - 0xd2, 0x8f, 0xd0, 0x99, 0x67, 0x06, 0x47, 0xa8, 0x3d, 0x67, 0xc7, 0x7b, 0xa8, 0x51, 0x49, 0xb8, - 0x69, 0xc8, 0xbd, 0xd6, 0x3e, 0xb0, 0x97, 0xb3, 0x99, 0x67, 0xd6, 0xda, 0x95, 0x52, 0xdb, 0x61, - 0x66, 0x2f, 0x8e, 0xbb, 0x5c, 0x19, 0x6e, 0x7d, 0xe4, 0x23, 0xd4, 0x89, 0x98, 0x0f, 0x6e, 0x01, - 0x06, 0x44, 0x40, 0x75, 0xe2, 0x4d, 0x2d, 0x8b, 0x17, 0xcc, 0x87, 0x43, 0xc5, 0x3b, 0xf7, 0xf3, - 0xcc, 0xda, 0x8c, 0x34, 0x44, 0xcf, 0xa6, 0x3d, 0x47, 0x0c, 0x7e, 0xab, 0xa1, 0x96, 0x6e, 0xc6, - 0xdb, 0xa8, 0x1e, 0xa5, 0xe1, 0x04, 0x12, 0x99, 0x57, 0xcd, 0x59, 0xcf, 0x33, 0x6b, 0xb5, 0x44, - 0xb4, 0x28, 0x4a, 0x83, 0xbf, 0x46, 0x75, 0x41, 0x68, 0x24, 0xaa, 0x9c, 0xb6, 0xec, 0x72, 0x8a, - 0x6c, 0x12, 0x53, 0xbb, 0x98, 0x22, 0xfb, 0xec, 0x91, 0x7d, 0x58, 0x28, 0x66, 0x2d, 0x50, 0x1a, - 0xc6, 0xea, 0x17, 0xbf, 0x44, 0xf5, 0x80, 0x4c, 0xe0, 0xaa, 0xf3, 0x1e, 0xfc, 0xcb, 0xb1, 0xec, - 0xe7, 0x52, 0xf5, 0x2c, 0x12, 0xc9, 0x79, 0x99, 0x55, 0x69, 0xd3, 0xb3, 0x2a, 0x91, 0xa2, 0x49, - 0x04, 0x13, 0x24, 0x70, 0x13, 0xe0, 0x2c, 0x4d, 0x3c, 0xd9, 0x98, 0xc6, 0xb0, 0xf9, 0xb8, 0x67, - 0xdf, 0x98, 0xb6, 0xb1, 0x92, 0x3c, 0xa7, 0x5c, 0x38, 0x1b, 0x2a, 0xc7, 0x8e, 0xb4, 0x57, 0x14, - 0x1f, 0x5f, 0x5b, 0x77, 0x09, 0x6a, 0x6a, 0xd9, 0xe0, 0x07, 0xa8, 0x76, 0x0a, 0xe7, 0xea, 0x8f, - 0xbc, 0x97, 0x67, 0x56, 0xfb, 0x14, 0xce, 0xb5, 0xbc, 0x0a, 0x16, 0x7f, 0x8e, 0x96, 0xce, 0x48, - 0x90, 0x82, 0x9c, 0xca, 0x46, 0x39, 0x4f, 0x12, 0xd0, 0xe7, 0x49, 0x02, 0xbb, 0x8b, 0x3b, 0xc6, - 0xe0, 0x0f, 0x03, 0x2d, 0xc9, 0xd9, 0xb9, 0x75, 0x9f, 0x6c, 0xa3, 0xfa, 0xcf, 0x40, 0xa7, 0xc7, - 0x42, 0xee, 0x60, 0x94, 0x35, 0x2a, 0x11, 0xbd, 0x46, 0x25, 0x82, 0x7f, 0x40, 0xed, 0x13, 0x36, - 0xd1, 0x9a, 0xaa, 0xac, 0xfe, 0x86, 0x56, 0xfd, 0x3d, 0x36, 0xb9, 0xea, 0xa9, 0x6e, 0x9e, 0x59, - 0x1b, 0x27, 0x33, 0x40, 0x2f, 0x7b, 0x4b, 0xc7, 0x07, 0xbf, 0xd7, 0x51, 0x53, 0x73, 0xfe, 0xcf, - 0x86, 0xda, 0x43, 0x8a, 0x3b, 0x48, 0x3d, 0x0f, 0x38, 0x7f, 0x9d, 0x06, 0xea, 0x35, 0xd6, 0xcb, - 0x33, 0xab, 0x7b, 0x9d, 0xd3, 0x22, 0xdc, 0xf0, 0x15, 0x15, 0x97, 0xaf, 0x19, 0xb3, 0x36, 0xab, - 0xb8, 0x04, 0xf4, 0x8a, 0x4b, 0x00, 0xf7, 0xd1, 0x22, 0xf5, 0x65, 0x93, 0x34, 0x9c, 0xd5, 0x3c, - 0xb3, 0x5a, 0x54, 0x7f, 0x4f, 0x2e, 0x52, 0x1f, 0x3f, 0x44, 0x77, 0x8b, 0x7a, 0x71, 0x10, 0xe6, - 0x92, 0x94, 0xc9, 0x73, 0x9c, 0xb0, 0xc9, 0x01, 0xcc, 0x95, 0xb7, 0x44, 0xb0, 0x83, 0x3a, 0x32, - 0xb2, 0x1b, 0x27, 0x94, 0x25, 0x54, 0x9c, 0x9b, 0xf5, 0xbe, 0x31, 0x6c, 0x97, 0xb3, 0x29, 0x99, - 0x7d, 0x45, 0xe8, 0xb3, 0x39, 0x47, 0xe0, 0xef, 0xd1, 0x5a, 0xe5, 0x76, 0xbd, 0x80, 0x70, 0xee, - 0xca, 0x3e, 0xb8, 0x2b, 0xb7, 0xb7, 0xf2, 0xcc, 0xba, 0x5f, 0xd1, 0x4f, 0x0b, 0xf6, 0xc5, 0x7c, - 0x53, 0xdc, 0xbb, 0x41, 0xe2, 0x23, 0xd4, 0x4a, 0xe0, 0x4d, 0x4a, 0x13, 0x08, 0xa1, 0x98, 0xd9, - 0x65, 0x39, 0x14, 0x9f, 0xde, 0x1c, 0x8a, 0x7d, 0xe6, 0x8f, 0x35, 0xa1, 0xb3, 0xae, 0xe6, 0x62, - 0xce, 0x3e, 0x9e, 0x5b, 0xe1, 0xaf, 0x50, 0xcb, 0x87, 0x18, 0x22, 0x1f, 0x22, 0x8f, 0x02, 0x37, - 0x1b, 0xfd, 0xda, 0xb0, 0x51, 0xf6, 0x8d, 0x8e, 0xeb, 0x7d, 0xa3, 0xe3, 0xf8, 0x27, 0xb4, 0x12, - 0xd2, 0xc8, 0xe5, 0xe9, 0x24, 0xa4, 0xc2, 0x2d, 0x6e, 0x67, 0x13, 0xc9, 0xfc, 0xba, 0x76, 0x79, - 0x75, 0xdb, 0xd5, 0x65, 0x6b, 0x1f, 0x56, 0x57, 0xb7, 0xb3, 0xa5, 0x12, 0x6b, 0x87, 0x34, 0x3a, - 0x90, 0xce, 0x82, 0x7b, 0xfb, 0x97, 0x65, 0x8c, 0xe7, 0x21, 0xfc, 0x04, 0xb5, 0x92, 0x34, 0x2a, - 0xc2, 0xba, 0x21, 0x90, 0xc8, 0x6c, 0xca, 0xa6, 0xda, 0xca, 0x33, 0xeb, 0x23, 0x85, 0x7f, 0x07, - 0x24, 0xd2, 0xb2, 0x6b, 0x6a, 0x30, 0xfe, 0x16, 0xad, 0x56, 0xee, 0x33, 0x92, 0x50, 0x12, 0x79, - 0x60, 0xb6, 0x64, 0x84, 0x4f, 0xf2, 0xcc, 0xda, 0x52, 0xdc, 0x2b, 0x45, 0x69, 0x51, 0x56, 0xae, - 0x51, 0xce, 0xab, 0x77, 0x17, 0x3d, 0xe3, 0xfd, 0x45, 0xcf, 0xf8, 0xfb, 0xa2, 0x67, 0xbc, 0xbd, - 0xec, 0x2d, 0xbc, 0xbf, 0xec, 0x2d, 0xfc, 0x79, 0xd9, 0x5b, 0xf8, 0xf1, 0x89, 0xf6, 0x71, 0x41, - 0x92, 0x90, 0xf8, 0x24, 0x4e, 0x58, 0xf1, 0x7f, 0xa8, 0xd5, 0xe8, 0xbf, 0x3e, 0x7d, 0x26, 0x75, - 0x59, 0x9d, 0x2f, 0xfe, 0x09, 0x00, 0x00, 0xff, 0xff, 0x90, 0x45, 0x44, 0xb8, 0x21, 0x09, 0x00, - 0x00, -} - -func (m *TestCase) Marshal() (dAtA []byte, err error) { + // 1222 bytes of a gzipped FileDescriptorProto + 0x1f, 0x8b, 0x08, 0x00, 0x00, 0x00, 0x00, 0x00, 0x02, 0xff, 0x94, 0x56, 0x41, 0x6f, 0x13, 0x47, + 0x1b, 0xce, 0xc6, 0xc4, 0xc1, 0x63, 0x27, 0xc0, 0x24, 0x0a, 0x26, 0x08, 0xaf, 0x3f, 0x23, 0x7d, + 0x72, 0xab, 0xb0, 0x16, 0x54, 0xaa, 0x28, 0xaa, 0x90, 0xba, 0x04, 0x5a, 0x21, 0xa0, 0xe0, 0x20, + 0x90, 0xca, 0x61, 0x35, 0xde, 0x7d, 0xe3, 0x4c, 0xbc, 0xbb, 0xb3, 0xcc, 0xce, 0x86, 0xfa, 0xd0, + 0x1f, 0x50, 0x55, 0xaa, 0x7a, 0x42, 0xfd, 0x11, 0xbd, 0xf4, 0xc2, 0x6f, 0x40, 0x3d, 0x71, 0xec, + 0x69, 0x5b, 0xc1, 0xcd, 0xbf, 0xa2, 0xda, 0x99, 0x59, 0x67, 0x8c, 0x03, 0x24, 0x27, 0x7b, 0x9e, + 0xe7, 0x7d, 0xde, 0x79, 0xe6, 0x9d, 0x77, 0x66, 0x07, 0x6d, 0xd1, 0x58, 0x00, 0x8f, 0x49, 0xd8, + 0x4b, 0xfd, 0x3d, 0x08, 0xb2, 0x10, 0x78, 0x2f, 0xa5, 0x51, 0x16, 0x12, 0xc1, 0x8c, 0x7f, 0x4e, + 0xc2, 0x99, 0x60, 0xb8, 0x36, 0x05, 0x36, 0x5b, 0x43, 0xc6, 0x86, 0x21, 0xf4, 0x24, 0x31, 0xc8, + 0x76, 0x7b, 0x41, 0xc6, 0x89, 0xa0, 0x2c, 0x56, 0xa1, 0x9b, 0x9d, 0xd1, 0xf5, 0xd4, 0xa1, 0xac, + 0x47, 0x12, 0xda, 0xf3, 0x19, 0x87, 0xde, 0xc1, 0xd5, 0xde, 0x10, 0x62, 0xe0, 0x44, 0x40, 0xa0, + 0x63, 0xae, 0x0c, 0xa9, 0xd8, 0xcb, 0x06, 0x8e, 0xcf, 0xa2, 0xde, 0x90, 0x0d, 0xd9, 0x61, 0xb2, + 0x62, 0x24, 0x07, 0xf2, 0x9f, 0x0e, 0xbf, 0x71, 0x94, 0xd7, 0xf2, 0x1f, 0x1b, 0xec, 0x83, 0x2f, + 0xd2, 0x39, 0x40, 0x6b, 0x2f, 0x25, 0xa3, 0x61, 0x8f, 0xf0, 0x88, 0x04, 0x04, 0x0e, 0x20, 0x16, + 0x69, 0x4f, 0xfd, 0x28, 0xba, 0xf3, 0x73, 0x05, 0xd5, 0x6f, 0x85, 0x59, 0x2a, 0x80, 0xef, 0x24, + 0xe0, 0xe3, 0xff, 0xa3, 0x53, 0x31, 0x89, 0xa0, 0x69, 0xb5, 0xad, 0x6e, 0xcd, 0xc5, 0x93, 0xdc, + 0x5e, 0x2d, 0xc6, 0x5b, 0x2c, 0xa2, 0x02, 0xa2, 0x44, 0x8c, 0xfb, 0x92, 0xc7, 0x37, 0xd0, 0x52, + 0xc2, 0x58, 0x98, 0x36, 0x17, 0xdb, 0x95, 0x6e, 0xfd, 0xda, 0x19, 0xe7, 0xb0, 0x62, 0x0f, 0x19, + 0x0b, 0xdd, 0xb5, 0x49, 0x6e, 0x9f, 0x91, 0x11, 0x86, 0x54, 0x49, 0xf0, 0x4b, 0x0b, 0x5d, 0x7e, + 0xc1, 0xf8, 0x68, 0x37, 0x64, 0x2f, 0xbc, 0x88, 0xc4, 0x64, 0x08, 0xdc, 0x0b, 0x20, 0x24, 0x63, + 0x2f, 0xa0, 0xa9, 0xe0, 0x74, 0x90, 0x15, 0xf5, 0x6c, 0x56, 0xda, 0x56, 0xb7, 0x7e, 0xed, 0x92, + 0x91, 0x7a, 0x67, 0x8f, 0xee, 0x0a, 0x08, 0x6e, 0xff, 0x98, 0xb0, 0x18, 0x62, 0x41, 0x49, 0xe8, + 0x76, 0x5f, 0xe7, 0xf6, 0xc2, 0x24, 0xb7, 0xdb, 0x65, 0xc6, 0xfb, 0x2a, 0xe1, 0x76, 0x91, 0x6f, + 0xdb, 0x48, 0xd7, 0xff, 0x64, 0x04, 0xfe, 0x09, 0x6d, 0x26, 0x10, 0x07, 0x34, 0x1e, 0x1e, 0x65, + 0xe7, 0xd4, 0x71, 0xec, 0xb4, 0xb5, 0x9d, 0xa6, 0x4e, 0x34, 0x6f, 0xe3, 0x83, 0x4c, 0xe7, 0x4f, + 0x0b, 0x35, 0x9e, 0x32, 0x3e, 0x0a, 0x19, 0x09, 0x4e, 0xb4, 0x19, 0x5f, 0xa1, 0x3a, 0x27, 0x71, + 0xc0, 0x22, 0x2f, 0x05, 0x08, 0x9a, 0x8b, 0x6d, 0xab, 0x5b, 0x71, 0x9b, 0x93, 0xdc, 0x5e, 0x57, + 0xf0, 0x0e, 0x40, 0x60, 0x88, 0xd0, 0x21, 0x8a, 0x6f, 0xa2, 0xea, 0xf3, 0x0c, 0x32, 0x48, 0x9b, + 0x15, 0xb9, 0x91, 0x67, 0x8d, 0xe5, 0x3d, 0x2a, 0x08, 0x77, 0x7d, 0x92, 0xdb, 0x67, 0x55, 0x8c, + 0x91, 0x43, 0xab, 0x3a, 0xbf, 0x58, 0xe8, 0x54, 0xb1, 0xe1, 0xc7, 0xf6, 0xfa, 0x0c, 0xad, 0xfa, + 0xaa, 0xdf, 0xbc, 0x21, 0x67, 0x59, 0x52, 0x76, 0xd0, 0x79, 0x63, 0x62, 0xdd, 0x90, 0xdf, 0x16, + 0xbc, 0x7b, 0x71, 0x92, 0xdb, 0xe7, 0x7d, 0x03, 0x31, 0x6d, 0xac, 0xcc, 0x10, 0x9d, 0x27, 0xa8, + 0x61, 0x6a, 0xf1, 0x1d, 0x74, 0x5a, 0x07, 0xa4, 0x4d, 0x4b, 0x4e, 0x83, 0xe7, 0xa7, 0x71, 0x37, + 0x26, 0xb9, 0x8d, 0xcb, 0x38, 0x23, 0xf9, 0x54, 0xdb, 0xf9, 0xd5, 0x42, 0xcb, 0x3a, 0xfa, 0x24, + 0x0b, 0x8d, 0x59, 0x00, 0x5e, 0x01, 0x86, 0x44, 0xc0, 0x51, 0x0b, 0x7d, 0xc0, 0x02, 0x78, 0xac, + 0x79, 0xb5, 0xd0, 0xd8, 0x40, 0x66, 0x16, 0x3a, 0x43, 0x74, 0x5e, 0x56, 0x50, 0xc3, 0x14, 0xe3, + 0x2d, 0x54, 0x8d, 0xb3, 0x68, 0x00, 0x5c, 0xfa, 0xaa, 0xa8, 0x5d, 0x53, 0x88, 0xb9, 0x6b, 0x0a, + 0xc1, 0xdf, 0xa0, 0xaa, 0x20, 0x34, 0x16, 0xa5, 0xa7, 0x0b, 0x8e, 0xba, 0xb4, 0x1c, 0x92, 0x50, + 0xa7, 0xb8, 0xb4, 0x9c, 0x83, 0xab, 0xce, 0xe3, 0x22, 0xc2, 0x5d, 0xd5, 0x0d, 0xad, 0x05, 0x7d, + 0xfd, 0x8b, 0x1f, 0xa1, 0x6a, 0x48, 0x06, 0x10, 0x96, 0x8d, 0x73, 0xf9, 0x03, 0xcb, 0x72, 0xee, + 0xc9, 0xa8, 0xdb, 0xb1, 0xe0, 0x63, 0xe5, 0x4a, 0xc9, 0x4c, 0x57, 0x0a, 0xc1, 0x1e, 0x3a, 0x23, + 0x98, 0x20, 0xa1, 0xc7, 0x21, 0x65, 0x19, 0xf7, 0x21, 0xd5, 0x67, 0xae, 0xe5, 0xcc, 0x5d, 0x6e, + 0x7d, 0x1d, 0x72, 0x8f, 0xa6, 0xc2, 0xdd, 0xd0, 0x1e, 0x57, 0xa5, 0xbc, 0xa4, 0xd2, 0xfe, 0x7b, + 0xe3, 0x4d, 0x82, 0xea, 0x86, 0x1b, 0x7c, 0x19, 0x55, 0x46, 0x30, 0xd6, 0x1b, 0x79, 0x6e, 0x92, + 0xdb, 0x2b, 0x23, 0x18, 0x1b, 0xbe, 0x0a, 0x16, 0x7f, 0x86, 0x96, 0x0e, 0x48, 0x98, 0x81, 0x3c, + 0x55, 0x35, 0x75, 0xaf, 0x49, 0xc0, 0xbc, 0xd7, 0x24, 0x70, 0x63, 0xf1, 0xba, 0xd5, 0x79, 0x65, + 0xa1, 0x25, 0x79, 0x6e, 0x8e, 0xdd, 0x27, 0x5b, 0xa8, 0xfa, 0x02, 0xe8, 0x70, 0x4f, 0xc8, 0x19, + 0x2c, 0x55, 0x23, 0x85, 0x98, 0x35, 0x52, 0x08, 0x7e, 0x8a, 0x56, 0xf6, 0xd9, 0xc0, 0x68, 0x2a, + 0x55, 0xfd, 0x0d, 0xa3, 0xfa, 0x77, 0xd9, 0x60, 0xda, 0x53, 0x9b, 0x93, 0xdc, 0xde, 0xd8, 0x3f, + 0x04, 0xcc, 0xb2, 0x37, 0x4c, 0xbc, 0xf3, 0xd7, 0x32, 0xaa, 0x1b, 0xca, 0x13, 0x36, 0xd4, 0x5d, + 0xa4, 0xb9, 0x9d, 0xcc, 0xf7, 0x21, 0x4d, 0x77, 0xb3, 0x50, 0x5f, 0x43, 0xad, 0x49, 0x6e, 0x6f, + 0xbe, 0xcf, 0x19, 0x19, 0xe6, 0x74, 0x45, 0xc5, 0xe5, 0xe5, 0x22, 0xef, 0x7f, 0x5d, 0x71, 0x09, + 0x98, 0x15, 0x97, 0x00, 0x6e, 0xa3, 0x45, 0x1a, 0xc8, 0x26, 0xa9, 0xb9, 0x67, 0x27, 0xb9, 0xdd, + 0xa0, 0xe6, 0x3d, 0xb7, 0x48, 0x03, 0x7c, 0x05, 0x2d, 0x17, 0xf5, 0x4a, 0x41, 0x34, 0x97, 0x64, + 0x98, 0x5c, 0xc7, 0x3e, 0x1b, 0xec, 0xc0, 0x4c, 0x79, 0x15, 0x82, 0x5d, 0xb4, 0x2a, 0x33, 0x7b, + 0x09, 0xa7, 0x8c, 0x53, 0x31, 0x6e, 0x56, 0xdb, 0x56, 0x77, 0x45, 0x9d, 0x4d, 0xc9, 0x3c, 0xd4, + 0x84, 0x79, 0x36, 0x67, 0x08, 0xfc, 0x3d, 0x5a, 0x2b, 0xd5, 0x9e, 0x1f, 0x92, 0x34, 0xf5, 0x64, + 0x1f, 0x2c, 0xcb, 0xe9, 0xed, 0x49, 0x6e, 0x5f, 0x2c, 0xe9, 0x5b, 0x05, 0xfb, 0x60, 0xb6, 0x29, + 0xce, 0xcd, 0x91, 0xf8, 0x19, 0x6a, 0x70, 0x78, 0x9e, 0x51, 0x0e, 0x51, 0xf1, 0xe5, 0x6e, 0x9e, + 0x96, 0x87, 0xe2, 0x7f, 0xf3, 0x87, 0xe2, 0x21, 0x0b, 0xfa, 0x46, 0xa0, 0xbb, 0xae, 0xcf, 0xc5, + 0x8c, 0xbc, 0x3f, 0x33, 0xc2, 0x37, 0x51, 0x23, 0x80, 0xe2, 0x93, 0x04, 0xb1, 0x4f, 0x21, 0x6d, + 0xd6, 0xda, 0x95, 0x6e, 0x4d, 0xf5, 0x8d, 0x89, 0x9b, 0x7d, 0x63, 0xe2, 0x78, 0x84, 0xd6, 0x81, + 0xf0, 0x90, 0x42, 0x2a, 0xbc, 0x34, 0x1b, 0x44, 0x54, 0x78, 0x82, 0x46, 0xd0, 0x44, 0xd2, 0xe4, + 0x05, 0x47, 0xbd, 0x96, 0x9c, 0xf2, 0x81, 0xe3, 0x6c, 0xeb, 0xd7, 0x92, 0xdb, 0xd2, 0xe6, 0x70, + 0x29, 0xdf, 0x91, 0xea, 0xc7, 0x34, 0x82, 0xdf, 0xff, 0xb1, 0xad, 0xfe, 0x11, 0x38, 0x7e, 0x65, + 0xa1, 0xde, 0x51, 0xb3, 0x79, 0xbb, 0x9c, 0x45, 0xde, 0xd4, 0xd7, 0xd8, 0xf3, 0x59, 0x94, 0x84, + 0x20, 0x3f, 0xdb, 0xf5, 0x4f, 0x19, 0xf9, 0x52, 0x1b, 0xf9, 0x7c, 0x7e, 0xc2, 0x3b, 0x9c, 0x45, + 0xdb, 0xd3, 0xac, 0xb7, 0xa6, 0x49, 0xa5, 0xc1, 0x13, 0xc4, 0xe3, 0x08, 0xad, 0xf3, 0x2c, 0x96, + 0x56, 0x67, 0xde, 0x14, 0x8d, 0xe3, 0xbc, 0x29, 0x2e, 0x6a, 0x83, 0x6b, 0x3a, 0xc5, 0xcc, 0x73, + 0xe2, 0x28, 0xb0, 0xf3, 0x87, 0x85, 0xf0, 0x7c, 0x22, 0xfc, 0x1d, 0x5a, 0x8e, 0x68, 0x4c, 0xa3, + 0x2c, 0x92, 0x87, 0xfa, 0xa3, 0x55, 0x59, 0xd3, 0x93, 0x96, 0x0a, 0xb9, 0xe4, 0x72, 0x80, 0xef, + 0xa1, 0x9a, 0x20, 0x34, 0xf4, 0x22, 0x20, 0xb1, 0x3c, 0xe8, 0x1f, 0xcd, 0x55, 0xf6, 0xe1, 0xe9, + 0x42, 0x73, 0x1f, 0x88, 0xaa, 0xdf, 0x74, 0xe4, 0x3e, 0x79, 0xfd, 0xb6, 0x65, 0xbd, 0x79, 0xdb, + 0xb2, 0xfe, 0x7d, 0xdb, 0xb2, 0x7e, 0x7b, 0xd7, 0x5a, 0x78, 0xf3, 0xae, 0xb5, 0xf0, 0xf7, 0xbb, + 0xd6, 0xc2, 0x0f, 0x5f, 0x1b, 0x0f, 0x65, 0xf5, 0x88, 0x4d, 0x38, 0x2b, 0x9a, 0x5d, 0x8f, 0x7a, + 0x1f, 0x7b, 0xc5, 0x0f, 0xaa, 0xd2, 0xca, 0x17, 0xff, 0x05, 0x00, 0x00, 0xff, 0xff, 0xa0, 0x95, + 0x14, 0xc7, 0xec, 0x0b, 0x00, 0x00, +} + +func (m *ClusterSpec) Marshal() (dAtA []byte, err error) { size := m.Size() dAtA = make([]byte, size) n, err := m.MarshalToSizedBuffer(dAtA[:size]) @@ -614,20 +746,40 @@ func (m *TestCase) Marshal() (dAtA []byte, err error) { return dAtA[:n], nil } -func (m *TestCase) MarshalTo(dAtA []byte) (int, error) { +func (m *ClusterSpec) MarshalTo(dAtA []byte) (int, error) { size := m.Size() return m.MarshalToSizedBuffer(dAtA[:size]) } -func (m *TestCase) MarshalToSizedBuffer(dAtA []byte) (int, error) { +func (m *ClusterSpec) MarshalToSizedBuffer(dAtA []byte) (int, error) { i := len(dAtA) _ = i var l int _ = l - if len(m.Queues) > 0 { - for iNdEx := len(m.Queues) - 1; iNdEx >= 0; iNdEx-- { + { + size, err := m.PendingDelayDistribution.MarshalToSizedBuffer(dAtA[:i]) + if err != nil { + return 0, err + } + i -= size + i = encodeVarintSimulator(dAtA, i, uint64(size)) + } + i-- + dAtA[i] = 0x22 + { + size, err := m.WorkflowManagerDelayDistribution.MarshalToSizedBuffer(dAtA[:i]) + if err != nil { + return 0, err + } + i -= size + i = encodeVarintSimulator(dAtA, i, uint64(size)) + } + i-- + dAtA[i] = 0x1a + if len(m.Pools) > 0 { + for iNdEx := len(m.Pools) - 1; iNdEx >= 0; iNdEx-- { { - size, err := m.Queues[iNdEx].MarshalToSizedBuffer(dAtA[:i]) + size, err := m.Pools[iNdEx].MarshalToSizedBuffer(dAtA[:i]) if err != nil { return 0, err } @@ -635,13 +787,43 @@ func (m *TestCase) MarshalToSizedBuffer(dAtA []byte) (int, error) { i = encodeVarintSimulator(dAtA, i, uint64(size)) } i-- - dAtA[i] = 0x22 + dAtA[i] = 0x12 } } - if len(m.Pools) > 0 { - for iNdEx := len(m.Pools) - 1; iNdEx >= 0; iNdEx-- { + if len(m.Name) > 0 { + i -= len(m.Name) + copy(dAtA[i:], m.Name) + i = encodeVarintSimulator(dAtA, i, uint64(len(m.Name))) + i-- + dAtA[i] = 0xa + } + return len(dAtA) - i, nil +} + +func (m *WorkloadSpec) Marshal() (dAtA []byte, err error) { + size := m.Size() + dAtA = make([]byte, size) + n, err := m.MarshalToSizedBuffer(dAtA[:size]) + if err != nil { + return nil, err + } + return dAtA[:n], nil +} + +func (m *WorkloadSpec) MarshalTo(dAtA []byte) (int, error) { + size := m.Size() + return m.MarshalToSizedBuffer(dAtA[:size]) +} + +func (m *WorkloadSpec) MarshalToSizedBuffer(dAtA []byte) (int, error) { + i := len(dAtA) + _ = i + var l int + _ = l + if len(m.Queues) > 0 { + for iNdEx := len(m.Queues) - 1; iNdEx >= 0; iNdEx-- { { - size, err := m.Pools[iNdEx].MarshalToSizedBuffer(dAtA[:i]) + size, err := m.Queues[iNdEx].MarshalToSizedBuffer(dAtA[:i]) if err != nil { return 0, err } @@ -687,10 +869,10 @@ func (m *Pool) MarshalToSizedBuffer(dAtA []byte) (int, error) { _ = i var l int _ = l - if len(m.ExecutorGroups) > 0 { - for iNdEx := len(m.ExecutorGroups) - 1; iNdEx >= 0; iNdEx-- { + if len(m.ClusterGroups) > 0 { + for iNdEx := len(m.ClusterGroups) - 1; iNdEx >= 0; iNdEx-- { { - size, err := m.ExecutorGroups[iNdEx].MarshalToSizedBuffer(dAtA[:i]) + size, err := m.ClusterGroups[iNdEx].MarshalToSizedBuffer(dAtA[:i]) if err != nil { return 0, err } @@ -711,7 +893,7 @@ func (m *Pool) MarshalToSizedBuffer(dAtA []byte) (int, error) { return len(dAtA) - i, nil } -func (m *ExecutorGroup) Marshal() (dAtA []byte, err error) { +func (m *ClusterGroup) Marshal() (dAtA []byte, err error) { size := m.Size() dAtA = make([]byte, size) n, err := m.MarshalToSizedBuffer(dAtA[:size]) @@ -721,20 +903,20 @@ func (m *ExecutorGroup) Marshal() (dAtA []byte, err error) { return dAtA[:n], nil } -func (m *ExecutorGroup) MarshalTo(dAtA []byte) (int, error) { +func (m *ClusterGroup) MarshalTo(dAtA []byte) (int, error) { size := m.Size() return m.MarshalToSizedBuffer(dAtA[:size]) } -func (m *ExecutorGroup) MarshalToSizedBuffer(dAtA []byte) (int, error) { +func (m *ClusterGroup) MarshalToSizedBuffer(dAtA []byte) (int, error) { i := len(dAtA) _ = i var l int _ = l - if len(m.Executors) > 0 { - for iNdEx := len(m.Executors) - 1; iNdEx >= 0; iNdEx-- { + if len(m.Clusters) > 0 { + for iNdEx := len(m.Clusters) - 1; iNdEx >= 0; iNdEx-- { { - size, err := m.Executors[iNdEx].MarshalToSizedBuffer(dAtA[:i]) + size, err := m.Clusters[iNdEx].MarshalToSizedBuffer(dAtA[:i]) if err != nil { return 0, err } @@ -748,7 +930,7 @@ func (m *ExecutorGroup) MarshalToSizedBuffer(dAtA []byte) (int, error) { return len(dAtA) - i, nil } -func (m *Executor) Marshal() (dAtA []byte, err error) { +func (m *Cluster) Marshal() (dAtA []byte, err error) { size := m.Size() dAtA = make([]byte, size) n, err := m.MarshalToSizedBuffer(dAtA[:size]) @@ -758,12 +940,12 @@ func (m *Executor) Marshal() (dAtA []byte, err error) { return dAtA[:n], nil } -func (m *Executor) MarshalTo(dAtA []byte) (int, error) { +func (m *Cluster) MarshalTo(dAtA []byte) (int, error) { size := m.Size() return m.MarshalToSizedBuffer(dAtA[:size]) } -func (m *Executor) MarshalToSizedBuffer(dAtA []byte) (int, error) { +func (m *Cluster) MarshalToSizedBuffer(dAtA []byte) (int, error) { i := len(dAtA) _ = i var l int @@ -933,22 +1115,30 @@ func (m *JobTemplate) MarshalToSizedBuffer(dAtA []byte) (int, error) { _ = i var l int _ = l - if m.RuntimeVariance != 0 { - i = encodeVarintSimulator(dAtA, i, uint64(m.RuntimeVariance)) - i-- - dAtA[i] = 0x60 + { + size, err := m.RuntimeDistribution.MarshalToSizedBuffer(dAtA[:i]) + if err != nil { + return 0, err + } + i -= size + i = encodeVarintSimulator(dAtA, i, uint64(size)) } - if m.RuntimeMean != 0 { - i = encodeVarintSimulator(dAtA, i, uint64(m.RuntimeMean)) - i-- - dAtA[i] = 0x58 + i-- + dAtA[i] = 0x62 + n5, err5 := github_com_gogo_protobuf_types.StdDurationMarshalTo(m.EarliestSubmitTimeFromDependencyCompletion, dAtA[i-github_com_gogo_protobuf_types.SizeOfStdDuration(m.EarliestSubmitTimeFromDependencyCompletion):]) + if err5 != nil { + return 0, err5 } - n2, err2 := github_com_gogo_protobuf_types.StdTimeMarshalTo(m.MinSubmitTime, dAtA[i-github_com_gogo_protobuf_types.SizeOfStdTime(m.MinSubmitTime):]) - if err2 != nil { - return 0, err2 + i -= n5 + i = encodeVarintSimulator(dAtA, i, uint64(n5)) + i-- + dAtA[i] = 0x5a + n6, err6 := github_com_gogo_protobuf_types.StdDurationMarshalTo(m.EarliestSubmitTime, dAtA[i-github_com_gogo_protobuf_types.SizeOfStdDuration(m.EarliestSubmitTime):]) + if err6 != nil { + return 0, err6 } - i -= n2 - i = encodeVarintSimulator(dAtA, i, uint64(n2)) + i -= n6 + i = encodeVarintSimulator(dAtA, i, uint64(n6)) i-- dAtA[i] = 0x52 if len(m.Dependencies) > 0 { @@ -1016,6 +1206,45 @@ func (m *JobTemplate) MarshalToSizedBuffer(dAtA []byte) (int, error) { return len(dAtA) - i, nil } +func (m *ShiftedExponential) Marshal() (dAtA []byte, err error) { + size := m.Size() + dAtA = make([]byte, size) + n, err := m.MarshalToSizedBuffer(dAtA[:size]) + if err != nil { + return nil, err + } + return dAtA[:n], nil +} + +func (m *ShiftedExponential) MarshalTo(dAtA []byte) (int, error) { + size := m.Size() + return m.MarshalToSizedBuffer(dAtA[:size]) +} + +func (m *ShiftedExponential) MarshalToSizedBuffer(dAtA []byte) (int, error) { + i := len(dAtA) + _ = i + var l int + _ = l + n8, err8 := github_com_gogo_protobuf_types.StdDurationMarshalTo(m.TailMean, dAtA[i-github_com_gogo_protobuf_types.SizeOfStdDuration(m.TailMean):]) + if err8 != nil { + return 0, err8 + } + i -= n8 + i = encodeVarintSimulator(dAtA, i, uint64(n8)) + i-- + dAtA[i] = 0x12 + n9, err9 := github_com_gogo_protobuf_types.StdDurationMarshalTo(m.Minimum, dAtA[i-github_com_gogo_protobuf_types.SizeOfStdDuration(m.Minimum):]) + if err9 != nil { + return 0, err9 + } + i -= n9 + i = encodeVarintSimulator(dAtA, i, uint64(n9)) + i-- + dAtA[i] = 0xa + return len(dAtA) - i, nil +} + func encodeVarintSimulator(dAtA []byte, offset int, v uint64) int { offset -= sovSimulator(v) base := offset @@ -1027,7 +1256,7 @@ func encodeVarintSimulator(dAtA []byte, offset int, v uint64) int { dAtA[offset] = uint8(v) return base } -func (m *TestCase) Size() (n int) { +func (m *ClusterSpec) Size() (n int) { if m == nil { return 0 } @@ -1037,15 +1266,32 @@ func (m *TestCase) Size() (n int) { if l > 0 { n += 1 + l + sovSimulator(uint64(l)) } - if m.RandomSeed != 0 { - n += 1 + sovSimulator(uint64(m.RandomSeed)) - } if len(m.Pools) > 0 { for _, e := range m.Pools { l = e.Size() n += 1 + l + sovSimulator(uint64(l)) } } + l = m.WorkflowManagerDelayDistribution.Size() + n += 1 + l + sovSimulator(uint64(l)) + l = m.PendingDelayDistribution.Size() + n += 1 + l + sovSimulator(uint64(l)) + return n +} + +func (m *WorkloadSpec) Size() (n int) { + if m == nil { + return 0 + } + var l int + _ = l + l = len(m.Name) + if l > 0 { + n += 1 + l + sovSimulator(uint64(l)) + } + if m.RandomSeed != 0 { + n += 1 + sovSimulator(uint64(m.RandomSeed)) + } if len(m.Queues) > 0 { for _, e := range m.Queues { l = e.Size() @@ -1065,8 +1311,8 @@ func (m *Pool) Size() (n int) { if l > 0 { n += 1 + l + sovSimulator(uint64(l)) } - if len(m.ExecutorGroups) > 0 { - for _, e := range m.ExecutorGroups { + if len(m.ClusterGroups) > 0 { + for _, e := range m.ClusterGroups { l = e.Size() n += 1 + l + sovSimulator(uint64(l)) } @@ -1074,14 +1320,14 @@ func (m *Pool) Size() (n int) { return n } -func (m *ExecutorGroup) Size() (n int) { +func (m *ClusterGroup) Size() (n int) { if m == nil { return 0 } var l int _ = l - if len(m.Executors) > 0 { - for _, e := range m.Executors { + if len(m.Clusters) > 0 { + for _, e := range m.Clusters { l = e.Size() n += 1 + l + sovSimulator(uint64(l)) } @@ -1089,7 +1335,7 @@ func (m *ExecutorGroup) Size() (n int) { return n } -func (m *Executor) Size() (n int) { +func (m *Cluster) Size() (n int) { if m == nil { return 0 } @@ -1197,14 +1443,25 @@ func (m *JobTemplate) Size() (n int) { n += 1 + l + sovSimulator(uint64(l)) } } - l = github_com_gogo_protobuf_types.SizeOfStdTime(m.MinSubmitTime) + l = github_com_gogo_protobuf_types.SizeOfStdDuration(m.EarliestSubmitTime) n += 1 + l + sovSimulator(uint64(l)) - if m.RuntimeMean != 0 { - n += 1 + sovSimulator(uint64(m.RuntimeMean)) - } - if m.RuntimeVariance != 0 { - n += 1 + sovSimulator(uint64(m.RuntimeVariance)) + l = github_com_gogo_protobuf_types.SizeOfStdDuration(m.EarliestSubmitTimeFromDependencyCompletion) + n += 1 + l + sovSimulator(uint64(l)) + l = m.RuntimeDistribution.Size() + n += 1 + l + sovSimulator(uint64(l)) + return n +} + +func (m *ShiftedExponential) Size() (n int) { + if m == nil { + return 0 } + var l int + _ = l + l = github_com_gogo_protobuf_types.SizeOfStdDuration(m.Minimum) + n += 1 + l + sovSimulator(uint64(l)) + l = github_com_gogo_protobuf_types.SizeOfStdDuration(m.TailMean) + n += 1 + l + sovSimulator(uint64(l)) return n } @@ -1214,7 +1471,7 @@ func sovSimulator(x uint64) (n int) { func sozSimulator(x uint64) (n int) { return sovSimulator(uint64((x << 1) ^ uint64((int64(x) >> 63)))) } -func (m *TestCase) Unmarshal(dAtA []byte) error { +func (m *ClusterSpec) Unmarshal(dAtA []byte) error { l := len(dAtA) iNdEx := 0 for iNdEx < l { @@ -1237,10 +1494,10 @@ func (m *TestCase) Unmarshal(dAtA []byte) error { fieldNum := int32(wire >> 3) wireType := int(wire & 0x7) if wireType == 4 { - return fmt.Errorf("proto: TestCase: wiretype end group for non-group") + return fmt.Errorf("proto: ClusterSpec: wiretype end group for non-group") } if fieldNum <= 0 { - return fmt.Errorf("proto: TestCase: illegal tag %d (wire type %d)", fieldNum, wire) + return fmt.Errorf("proto: ClusterSpec: illegal tag %d (wire type %d)", fieldNum, wire) } switch fieldNum { case 1: @@ -1276,10 +1533,10 @@ func (m *TestCase) Unmarshal(dAtA []byte) error { m.Name = string(dAtA[iNdEx:postIndex]) iNdEx = postIndex case 2: - if wireType != 0 { - return fmt.Errorf("proto: wrong wireType = %d for field RandomSeed", wireType) + if wireType != 2 { + return fmt.Errorf("proto: wrong wireType = %d for field Pools", wireType) } - m.RandomSeed = 0 + var msglen int for shift := uint(0); ; shift += 7 { if shift >= 64 { return ErrIntOverflowSimulator @@ -1289,14 +1546,29 @@ func (m *TestCase) Unmarshal(dAtA []byte) error { } b := dAtA[iNdEx] iNdEx++ - m.RandomSeed |= int64(b&0x7F) << shift + msglen |= int(b&0x7F) << shift if b < 0x80 { break } } + if msglen < 0 { + return ErrInvalidLengthSimulator + } + postIndex := iNdEx + msglen + if postIndex < 0 { + return ErrInvalidLengthSimulator + } + if postIndex > l { + return io.ErrUnexpectedEOF + } + m.Pools = append(m.Pools, &Pool{}) + if err := m.Pools[len(m.Pools)-1].Unmarshal(dAtA[iNdEx:postIndex]); err != nil { + return err + } + iNdEx = postIndex case 3: if wireType != 2 { - return fmt.Errorf("proto: wrong wireType = %d for field Pools", wireType) + return fmt.Errorf("proto: wrong wireType = %d for field WorkflowManagerDelayDistribution", wireType) } var msglen int for shift := uint(0); ; shift += 7 { @@ -1323,12 +1595,145 @@ func (m *TestCase) Unmarshal(dAtA []byte) error { if postIndex > l { return io.ErrUnexpectedEOF } - m.Pools = append(m.Pools, &Pool{}) - if err := m.Pools[len(m.Pools)-1].Unmarshal(dAtA[iNdEx:postIndex]); err != nil { + if err := m.WorkflowManagerDelayDistribution.Unmarshal(dAtA[iNdEx:postIndex]); err != nil { return err } iNdEx = postIndex case 4: + if wireType != 2 { + return fmt.Errorf("proto: wrong wireType = %d for field PendingDelayDistribution", wireType) + } + var msglen int + for shift := uint(0); ; shift += 7 { + if shift >= 64 { + return ErrIntOverflowSimulator + } + if iNdEx >= l { + return io.ErrUnexpectedEOF + } + b := dAtA[iNdEx] + iNdEx++ + msglen |= int(b&0x7F) << shift + if b < 0x80 { + break + } + } + if msglen < 0 { + return ErrInvalidLengthSimulator + } + postIndex := iNdEx + msglen + if postIndex < 0 { + return ErrInvalidLengthSimulator + } + if postIndex > l { + return io.ErrUnexpectedEOF + } + if err := m.PendingDelayDistribution.Unmarshal(dAtA[iNdEx:postIndex]); err != nil { + return err + } + iNdEx = postIndex + default: + iNdEx = preIndex + skippy, err := skipSimulator(dAtA[iNdEx:]) + if err != nil { + return err + } + if (skippy < 0) || (iNdEx+skippy) < 0 { + return ErrInvalidLengthSimulator + } + if (iNdEx + skippy) > l { + return io.ErrUnexpectedEOF + } + iNdEx += skippy + } + } + + if iNdEx > l { + return io.ErrUnexpectedEOF + } + return nil +} +func (m *WorkloadSpec) Unmarshal(dAtA []byte) error { + l := len(dAtA) + iNdEx := 0 + for iNdEx < l { + preIndex := iNdEx + var wire uint64 + for shift := uint(0); ; shift += 7 { + if shift >= 64 { + return ErrIntOverflowSimulator + } + if iNdEx >= l { + return io.ErrUnexpectedEOF + } + b := dAtA[iNdEx] + iNdEx++ + wire |= uint64(b&0x7F) << shift + if b < 0x80 { + break + } + } + fieldNum := int32(wire >> 3) + wireType := int(wire & 0x7) + if wireType == 4 { + return fmt.Errorf("proto: WorkloadSpec: wiretype end group for non-group") + } + if fieldNum <= 0 { + return fmt.Errorf("proto: WorkloadSpec: illegal tag %d (wire type %d)", fieldNum, wire) + } + switch fieldNum { + case 1: + if wireType != 2 { + return fmt.Errorf("proto: wrong wireType = %d for field Name", wireType) + } + var stringLen uint64 + for shift := uint(0); ; shift += 7 { + if shift >= 64 { + return ErrIntOverflowSimulator + } + if iNdEx >= l { + return io.ErrUnexpectedEOF + } + b := dAtA[iNdEx] + iNdEx++ + stringLen |= uint64(b&0x7F) << shift + if b < 0x80 { + break + } + } + intStringLen := int(stringLen) + if intStringLen < 0 { + return ErrInvalidLengthSimulator + } + postIndex := iNdEx + intStringLen + if postIndex < 0 { + return ErrInvalidLengthSimulator + } + if postIndex > l { + return io.ErrUnexpectedEOF + } + m.Name = string(dAtA[iNdEx:postIndex]) + iNdEx = postIndex + case 2: + if wireType != 0 { + return fmt.Errorf("proto: wrong wireType = %d for field RandomSeed", wireType) + } + m.RandomSeed = 0 + for shift := uint(0); ; shift += 7 { + if shift >= 64 { + return ErrIntOverflowSimulator + } + if iNdEx >= l { + return io.ErrUnexpectedEOF + } + b := dAtA[iNdEx] + iNdEx++ + m.RandomSeed |= int64(b&0x7F) << shift + if b < 0x80 { + break + } + } + case 3: if wireType != 2 { return fmt.Errorf("proto: wrong wireType = %d for field Queues", wireType) } @@ -1357,7 +1762,7 @@ func (m *TestCase) Unmarshal(dAtA []byte) error { if postIndex > l { return io.ErrUnexpectedEOF } - m.Queues = append(m.Queues, Queue{}) + m.Queues = append(m.Queues, &Queue{}) if err := m.Queues[len(m.Queues)-1].Unmarshal(dAtA[iNdEx:postIndex]); err != nil { return err } @@ -1446,7 +1851,7 @@ func (m *Pool) Unmarshal(dAtA []byte) error { iNdEx = postIndex case 2: if wireType != 2 { - return fmt.Errorf("proto: wrong wireType = %d for field ExecutorGroups", wireType) + return fmt.Errorf("proto: wrong wireType = %d for field ClusterGroups", wireType) } var msglen int for shift := uint(0); ; shift += 7 { @@ -1473,8 +1878,8 @@ func (m *Pool) Unmarshal(dAtA []byte) error { if postIndex > l { return io.ErrUnexpectedEOF } - m.ExecutorGroups = append(m.ExecutorGroups, &ExecutorGroup{}) - if err := m.ExecutorGroups[len(m.ExecutorGroups)-1].Unmarshal(dAtA[iNdEx:postIndex]); err != nil { + m.ClusterGroups = append(m.ClusterGroups, &ClusterGroup{}) + if err := m.ClusterGroups[len(m.ClusterGroups)-1].Unmarshal(dAtA[iNdEx:postIndex]); err != nil { return err } iNdEx = postIndex @@ -1499,7 +1904,7 @@ func (m *Pool) Unmarshal(dAtA []byte) error { } return nil } -func (m *ExecutorGroup) Unmarshal(dAtA []byte) error { +func (m *ClusterGroup) Unmarshal(dAtA []byte) error { l := len(dAtA) iNdEx := 0 for iNdEx < l { @@ -1522,15 +1927,15 @@ func (m *ExecutorGroup) Unmarshal(dAtA []byte) error { fieldNum := int32(wire >> 3) wireType := int(wire & 0x7) if wireType == 4 { - return fmt.Errorf("proto: ExecutorGroup: wiretype end group for non-group") + return fmt.Errorf("proto: ClusterGroup: wiretype end group for non-group") } if fieldNum <= 0 { - return fmt.Errorf("proto: ExecutorGroup: illegal tag %d (wire type %d)", fieldNum, wire) + return fmt.Errorf("proto: ClusterGroup: illegal tag %d (wire type %d)", fieldNum, wire) } switch fieldNum { case 1: if wireType != 2 { - return fmt.Errorf("proto: wrong wireType = %d for field Executors", wireType) + return fmt.Errorf("proto: wrong wireType = %d for field Clusters", wireType) } var msglen int for shift := uint(0); ; shift += 7 { @@ -1557,8 +1962,8 @@ func (m *ExecutorGroup) Unmarshal(dAtA []byte) error { if postIndex > l { return io.ErrUnexpectedEOF } - m.Executors = append(m.Executors, &Executor{}) - if err := m.Executors[len(m.Executors)-1].Unmarshal(dAtA[iNdEx:postIndex]); err != nil { + m.Clusters = append(m.Clusters, &Cluster{}) + if err := m.Clusters[len(m.Clusters)-1].Unmarshal(dAtA[iNdEx:postIndex]); err != nil { return err } iNdEx = postIndex @@ -1583,7 +1988,7 @@ func (m *ExecutorGroup) Unmarshal(dAtA []byte) error { } return nil } -func (m *Executor) Unmarshal(dAtA []byte) error { +func (m *Cluster) Unmarshal(dAtA []byte) error { l := len(dAtA) iNdEx := 0 for iNdEx < l { @@ -1606,10 +2011,10 @@ func (m *Executor) Unmarshal(dAtA []byte) error { fieldNum := int32(wire >> 3) wireType := int(wire & 0x7) if wireType == 4 { - return fmt.Errorf("proto: Executor: wiretype end group for non-group") + return fmt.Errorf("proto: Cluster: wiretype end group for non-group") } if fieldNum <= 0 { - return fmt.Errorf("proto: Executor: illegal tag %d (wire type %d)", fieldNum, wire) + return fmt.Errorf("proto: Cluster: illegal tag %d (wire type %d)", fieldNum, wire) } switch fieldNum { case 1: @@ -2370,7 +2775,7 @@ func (m *JobTemplate) Unmarshal(dAtA []byte) error { iNdEx = postIndex case 10: if wireType != 2 { - return fmt.Errorf("proto: wrong wireType = %d for field MinSubmitTime", wireType) + return fmt.Errorf("proto: wrong wireType = %d for field EarliestSubmitTime", wireType) } var msglen int for shift := uint(0); ; shift += 7 { @@ -2397,15 +2802,15 @@ func (m *JobTemplate) Unmarshal(dAtA []byte) error { if postIndex > l { return io.ErrUnexpectedEOF } - if err := github_com_gogo_protobuf_types.StdTimeUnmarshal(&m.MinSubmitTime, dAtA[iNdEx:postIndex]); err != nil { + if err := github_com_gogo_protobuf_types.StdDurationUnmarshal(&m.EarliestSubmitTime, dAtA[iNdEx:postIndex]); err != nil { return err } iNdEx = postIndex case 11: - if wireType != 0 { - return fmt.Errorf("proto: wrong wireType = %d for field RuntimeMean", wireType) + if wireType != 2 { + return fmt.Errorf("proto: wrong wireType = %d for field EarliestSubmitTimeFromDependencyCompletion", wireType) } - m.RuntimeMean = 0 + var msglen int for shift := uint(0); ; shift += 7 { if shift >= 64 { return ErrIntOverflowSimulator @@ -2415,16 +2820,30 @@ func (m *JobTemplate) Unmarshal(dAtA []byte) error { } b := dAtA[iNdEx] iNdEx++ - m.RuntimeMean |= int64(b&0x7F) << shift + msglen |= int(b&0x7F) << shift if b < 0x80 { break } } + if msglen < 0 { + return ErrInvalidLengthSimulator + } + postIndex := iNdEx + msglen + if postIndex < 0 { + return ErrInvalidLengthSimulator + } + if postIndex > l { + return io.ErrUnexpectedEOF + } + if err := github_com_gogo_protobuf_types.StdDurationUnmarshal(&m.EarliestSubmitTimeFromDependencyCompletion, dAtA[iNdEx:postIndex]); err != nil { + return err + } + iNdEx = postIndex case 12: - if wireType != 0 { - return fmt.Errorf("proto: wrong wireType = %d for field RuntimeVariance", wireType) + if wireType != 2 { + return fmt.Errorf("proto: wrong wireType = %d for field RuntimeDistribution", wireType) } - m.RuntimeVariance = 0 + var msglen int for shift := uint(0); ; shift += 7 { if shift >= 64 { return ErrIntOverflowSimulator @@ -2434,11 +2853,141 @@ func (m *JobTemplate) Unmarshal(dAtA []byte) error { } b := dAtA[iNdEx] iNdEx++ - m.RuntimeVariance |= int64(b&0x7F) << shift + msglen |= int(b&0x7F) << shift if b < 0x80 { break } } + if msglen < 0 { + return ErrInvalidLengthSimulator + } + postIndex := iNdEx + msglen + if postIndex < 0 { + return ErrInvalidLengthSimulator + } + if postIndex > l { + return io.ErrUnexpectedEOF + } + if err := m.RuntimeDistribution.Unmarshal(dAtA[iNdEx:postIndex]); err != nil { + return err + } + iNdEx = postIndex + default: + iNdEx = preIndex + skippy, err := skipSimulator(dAtA[iNdEx:]) + if err != nil { + return err + } + if (skippy < 0) || (iNdEx+skippy) < 0 { + return ErrInvalidLengthSimulator + } + if (iNdEx + skippy) > l { + return io.ErrUnexpectedEOF + } + iNdEx += skippy + } + } + + if iNdEx > l { + return io.ErrUnexpectedEOF + } + return nil +} +func (m *ShiftedExponential) Unmarshal(dAtA []byte) error { + l := len(dAtA) + iNdEx := 0 + for iNdEx < l { + preIndex := iNdEx + var wire uint64 + for shift := uint(0); ; shift += 7 { + if shift >= 64 { + return ErrIntOverflowSimulator + } + if iNdEx >= l { + return io.ErrUnexpectedEOF + } + b := dAtA[iNdEx] + iNdEx++ + wire |= uint64(b&0x7F) << shift + if b < 0x80 { + break + } + } + fieldNum := int32(wire >> 3) + wireType := int(wire & 0x7) + if wireType == 4 { + return fmt.Errorf("proto: ShiftedExponential: wiretype end group for non-group") + } + if fieldNum <= 0 { + return fmt.Errorf("proto: ShiftedExponential: illegal tag %d (wire type %d)", fieldNum, wire) + } + switch fieldNum { + case 1: + if wireType != 2 { + return fmt.Errorf("proto: wrong wireType = %d for field Minimum", wireType) + } + var msglen int + for shift := uint(0); ; shift += 7 { + if shift >= 64 { + return ErrIntOverflowSimulator + } + if iNdEx >= l { + return io.ErrUnexpectedEOF + } + b := dAtA[iNdEx] + iNdEx++ + msglen |= int(b&0x7F) << shift + if b < 0x80 { + break + } + } + if msglen < 0 { + return ErrInvalidLengthSimulator + } + postIndex := iNdEx + msglen + if postIndex < 0 { + return ErrInvalidLengthSimulator + } + if postIndex > l { + return io.ErrUnexpectedEOF + } + if err := github_com_gogo_protobuf_types.StdDurationUnmarshal(&m.Minimum, dAtA[iNdEx:postIndex]); err != nil { + return err + } + iNdEx = postIndex + case 2: + if wireType != 2 { + return fmt.Errorf("proto: wrong wireType = %d for field TailMean", wireType) + } + var msglen int + for shift := uint(0); ; shift += 7 { + if shift >= 64 { + return ErrIntOverflowSimulator + } + if iNdEx >= l { + return io.ErrUnexpectedEOF + } + b := dAtA[iNdEx] + iNdEx++ + msglen |= int(b&0x7F) << shift + if b < 0x80 { + break + } + } + if msglen < 0 { + return ErrInvalidLengthSimulator + } + postIndex := iNdEx + msglen + if postIndex < 0 { + return ErrInvalidLengthSimulator + } + if postIndex > l { + return io.ErrUnexpectedEOF + } + if err := github_com_gogo_protobuf_types.StdDurationUnmarshal(&m.TailMean, dAtA[iNdEx:postIndex]); err != nil { + return err + } + iNdEx = postIndex default: iNdEx = preIndex skippy, err := skipSimulator(dAtA[iNdEx:]) diff --git a/internal/scheduler/simulator/simulator.proto b/internal/scheduler/simulator/simulator.proto index dbc02fb1b6b..ca06eda63d8 100644 --- a/internal/scheduler/simulator/simulator.proto +++ b/internal/scheduler/simulator/simulator.proto @@ -2,32 +2,35 @@ syntax = 'proto3'; package simulator; option go_package = "github.com/armadaproject/armada/internal/scheduler/simulator"; -import "google/protobuf/timestamp.proto"; +import "google/protobuf/duration.proto"; import "k8s.io/api/core/v1/generated.proto"; import "github.com/gogo/protobuf/gogoproto/gogo.proto"; import "internal/scheduler/schedulerobjects/schedulerobjects.proto"; +import "pkg/armadaevents/events.proto"; -// TODO: -// Runtime family. -// Workflow manager delay. -// Job pending delay. -message TestCase { +message ClusterSpec { + string name = 1; + repeated Pool pools = 2; + ShiftedExponential workflow_manager_delay_distribution = 3 [(gogoproto.nullable) = false]; + ShiftedExponential pending_delay_distribution = 4 [(gogoproto.nullable) = false]; +} + +message WorkloadSpec { string name = 1; int64 random_seed = 2; - repeated Pool pools = 3; - repeated Queue queues = 4 [(gogoproto.nullable) = false]; + repeated Queue queues = 3; } message Pool { string name = 1; - repeated ExecutorGroup executor_groups = 2; + repeated ClusterGroup cluster_groups = 2; } -message ExecutorGroup { - repeated Executor executors = 1; +message ClusterGroup { + repeated Cluster clusters = 1; } -message Executor { +message Cluster { string name = 1; repeated NodeTemplate node_templates = 2; } @@ -58,14 +61,28 @@ message JobTemplate { string job_set = 5; uint32 queue_priority = 6; string priority_class_name = 7; + // Scheduling requirements for the pod embedded in the job. schedulerobjects.PodRequirements requirements = 8 [(gogoproto.nullable) = false]; // List of template ids that must be completed before this template is submitted. repeated string dependencies = 9; - // Minimum time from which jobs are created from this template. - google.protobuf.Timestamp min_submit_time = 10 [(gogoproto.nullable) = false, (gogoproto.stdtime) = true]; - // Job runtime mean in seconds. - int64 runtime_mean = 11; - // Job runtime variance in seconds squared. - // If zero, runtime is deterministic. - int64 runtime_variance = 12; + // Earliest time at which jobs from this template are submitted. + // Measured from the start of the simulation. + google.protobuf.Duration earliest_submit_time = 10 [(gogoproto.nullable) = false, (gogoproto.stdduration) = true]; + // Earliest time job can be submitted from when all its dependencies have completed. + // This option is meant to model thinking or processing time, where some fixed amount of time + // needs to be spent between dependencies completing and the next batch of jobs being ready to submit. + google.protobuf.Duration earliest_submit_time_from_dependency_completion = 11 [(gogoproto.nullable) = false, (gogoproto.stdduration) = true]; + // Job runtimes are assumed to follow a shifted exponential distribution + // i.e., to be a fixed constant (runtime_minimum) plus a random amount of time + // drawn from an exponential distribution with known mean (runtime_tail_mean). + // + // The shifted-exponential distribution strikes a good balance between simplicity and accuracy; + // see https://bora.uib.no/bora-xmlui/bitstream/handle/11250/3014726/drthesis_2022_severinson.pdf?sequence=2 + // for a discussion on the topic. + ShiftedExponential runtime_distribution = 12 [(gogoproto.nullable) = false]; +} + +message ShiftedExponential { + google.protobuf.Duration minimum = 1 [(gogoproto.nullable) = false, (gogoproto.stdduration) = true]; + google.protobuf.Duration tail_mean = 2 [(gogoproto.nullable) = false, (gogoproto.stdduration) = true]; } \ No newline at end of file diff --git a/internal/scheduler/simulator/simulator_test.go b/internal/scheduler/simulator/simulator_test.go index 0d891d3cfff..a6e7c4c22a1 100644 --- a/internal/scheduler/simulator/simulator_test.go +++ b/internal/scheduler/simulator/simulator_test.go @@ -1,36 +1,40 @@ package simulator import ( - fmt "fmt" - "strings" + "math/rand" "testing" "time" + "github.com/stretchr/testify/assert" "github.com/stretchr/testify/require" - v1 "k8s.io/api/core/v1" "k8s.io/apimachinery/pkg/api/resource" "github.com/armadaproject/armada/internal/armada/configuration" + "github.com/armadaproject/armada/internal/common/armadacontext" armadaslices "github.com/armadaproject/armada/internal/common/slices" "github.com/armadaproject/armada/internal/common/util" - "github.com/armadaproject/armada/internal/scheduler/schedulerobjects" + schedulerobjects "github.com/armadaproject/armada/internal/scheduler/schedulerobjects" "github.com/armadaproject/armada/internal/scheduler/testfixtures" "github.com/armadaproject/armada/pkg/armadaevents" ) func TestSimulator(t *testing.T) { tests := map[string]struct { - testCase *TestCase + clusterSpec *ClusterSpec + workloadSpec *WorkloadSpec schedulingConfig configuration.SchedulingConfig expectedEventSequences []*armadaevents.EventSequence + simulatedTimeLimit time.Duration }{ "Two jobs in parallel": { - testCase: &TestCase{ + clusterSpec: &ClusterSpec{ Name: "basic", Pools: []*Pool{Pool32Cpu("Pool", 1, 1, 2)}, - Queues: []Queue{ + }, + workloadSpec: &WorkloadSpec{ + Queues: []*Queue{ WithJobTemplatesQueue( - Queue{Name: "A", Weight: 1}, + &Queue{Name: "A", Weight: 1}, JobTemplate32Cpu(2, "foo", testfixtures.TestDefaultPriorityClass), ), }, @@ -43,14 +47,17 @@ func TestSimulator(t *testing.T) { {Queue: "A", JobSetName: "foo", Events: []*armadaevents.EventSequence_Event{JobSucceeded()}}, {Queue: "A", JobSetName: "foo", Events: []*armadaevents.EventSequence_Event{JobSucceeded()}}, }, + simulatedTimeLimit: 5 * time.Minute, }, "Two jobs in sequence": { - testCase: &TestCase{ + clusterSpec: &ClusterSpec{ Name: "basic", Pools: []*Pool{Pool32Cpu("Pool", 1, 1, 1)}, - Queues: []Queue{ + }, + workloadSpec: &WorkloadSpec{ + Queues: []*Queue{ WithJobTemplatesQueue( - Queue{Name: "A", Weight: 1}, + &Queue{Name: "A", Weight: 1}, JobTemplate32Cpu(2, "foo", testfixtures.TestDefaultPriorityClass), ), }, @@ -63,14 +70,17 @@ func TestSimulator(t *testing.T) { {Queue: "A", JobSetName: "foo", Events: []*armadaevents.EventSequence_Event{JobRunLeased()}}, {Queue: "A", JobSetName: "foo", Events: []*armadaevents.EventSequence_Event{JobSucceeded()}}, }, + simulatedTimeLimit: 5 * time.Minute, }, "10 jobs in sequence": { - testCase: &TestCase{ + clusterSpec: &ClusterSpec{ Name: "basic", Pools: []*Pool{Pool32Cpu("Pool", 1, 1, 1)}, - Queues: []Queue{ + }, + workloadSpec: &WorkloadSpec{ + Queues: []*Queue{ WithJobTemplatesQueue( - Queue{Name: "A", Weight: 1}, + &Queue{Name: "A", Weight: 1}, JobTemplate32Cpu(10, "foo", testfixtures.TestDefaultPriorityClass), ), }, @@ -86,14 +96,17 @@ func TestSimulator(t *testing.T) { &armadaevents.EventSequence{Queue: "A", JobSetName: "foo", Events: []*armadaevents.EventSequence_Event{JobSucceeded()}}, )..., ), + simulatedTimeLimit: 20 * time.Minute, }, "JobTemplate dependencies": { - testCase: &TestCase{ + clusterSpec: &ClusterSpec{ Name: "basic", Pools: []*Pool{Pool32Cpu("Pool", 1, 1, 3)}, - Queues: []Queue{ + }, + workloadSpec: &WorkloadSpec{ + Queues: []*Queue{ WithJobTemplatesQueue( - Queue{Name: "A", Weight: 1}, + &Queue{Name: "A", Weight: 1}, WithIdJobTemplate( JobTemplate32Cpu(2, "foo", testfixtures.TestDefaultPriorityClass), "jobTemplate", @@ -116,21 +129,24 @@ func TestSimulator(t *testing.T) { {Queue: "A", JobSetName: "foo", Events: []*armadaevents.EventSequence_Event{JobRunLeased()}}, {Queue: "A", JobSetName: "foo", Events: []*armadaevents.EventSequence_Event{JobSucceeded()}}, }, + simulatedTimeLimit: 5 * time.Minute, }, "Preemption": { - testCase: &TestCase{ + clusterSpec: &ClusterSpec{ Name: "basic", Pools: []*Pool{Pool32Cpu("Pool", 1, 1, 2)}, - Queues: []Queue{ + }, + workloadSpec: &WorkloadSpec{ + Queues: []*Queue{ WithJobTemplatesQueue( - Queue{Name: "A", Weight: 1}, + &Queue{Name: "A", Weight: 1}, JobTemplate32Cpu(2, "foo", testfixtures.PriorityClass0), ), WithJobTemplatesQueue( - Queue{Name: "B", Weight: 1}, + &Queue{Name: "B", Weight: 1}, WithMinSubmitTimeJobTemplate( JobTemplate32Cpu(1, "bar", testfixtures.PriorityClass0), - time.Time{}.Add(30*time.Second), + 30*time.Second, ), ), }, @@ -149,9 +165,10 @@ func TestSimulator(t *testing.T) { {Queue: "B", JobSetName: "bar", Events: []*armadaevents.EventSequence_Event{JobSucceeded()}}, {Queue: "A", JobSetName: "foo", Events: []*armadaevents.EventSequence_Event{JobSucceeded()}}, }, + simulatedTimeLimit: 5 * time.Minute, }, "Preemption cascade": { - testCase: &TestCase{ + clusterSpec: &ClusterSpec{ Name: "test", Pools: []*Pool{ WithExecutorGroupsPool( @@ -161,20 +178,22 @@ func TestSimulator(t *testing.T) { ExecutorGroup32Cpu(1, 1), ), }, - Queues: []Queue{ + }, + workloadSpec: &WorkloadSpec{ + Queues: []*Queue{ WithJobTemplatesQueue( - Queue{Name: "B", Weight: 1}, + &Queue{Name: "B", Weight: 1}, JobTemplate32Cpu(1, "foo", testfixtures.PriorityClass0), ), WithJobTemplatesQueue( - Queue{Name: "C", Weight: 1}, + &Queue{Name: "C", Weight: 1}, JobTemplate32Cpu(2, "foo", testfixtures.PriorityClass0), ), WithJobTemplatesQueue( - Queue{Name: "A", Weight: 1}, + &Queue{Name: "A", Weight: 1}, WithMinSubmitTimeJobTemplate( JobTemplate32Cpu(1, "foo", testfixtures.PriorityClass0), - time.Time{}.Add(30*time.Second), + 30*time.Second, ), ), }, @@ -199,9 +218,10 @@ func TestSimulator(t *testing.T) { {Queue: "B", JobSetName: "foo", Events: []*armadaevents.EventSequence_Event{JobSucceeded()}}, {Queue: "C", JobSetName: "foo", Events: []*armadaevents.EventSequence_Event{JobSucceeded()}}, }, + simulatedTimeLimit: 5 * time.Minute, }, "No preemption cascade with unified scheduling": { - testCase: &TestCase{ + clusterSpec: &ClusterSpec{ Name: "test", Pools: []*Pool{ WithExecutorGroupsPool( @@ -209,20 +229,22 @@ func TestSimulator(t *testing.T) { ExecutorGroup32Cpu(3, 1), ), }, - Queues: []Queue{ + }, + workloadSpec: &WorkloadSpec{ + Queues: []*Queue{ WithJobTemplatesQueue( - Queue{Name: "B", Weight: 1}, + &Queue{Name: "B", Weight: 1}, JobTemplate32Cpu(1, "foo", testfixtures.PriorityClass0), ), WithJobTemplatesQueue( - Queue{Name: "C", Weight: 1}, + &Queue{Name: "C", Weight: 1}, JobTemplate32Cpu(2, "foo", testfixtures.PriorityClass0), ), WithJobTemplatesQueue( - Queue{Name: "A", Weight: 1}, + &Queue{Name: "A", Weight: 1}, WithMinSubmitTimeJobTemplate( JobTemplate32Cpu(1, "foo", testfixtures.PriorityClass0), - time.Time{}.Add(30*time.Second), + 30*time.Second, ), ), }, @@ -244,200 +266,111 @@ func TestSimulator(t *testing.T) { {Queue: "A", JobSetName: "foo", Events: []*armadaevents.EventSequence_Event{JobSucceeded()}}, {Queue: "C", JobSetName: "foo", Events: []*armadaevents.EventSequence_Event{JobSucceeded()}}, }, + simulatedTimeLimit: 5 * time.Minute, }, } for name, tc := range tests { t.Run(name, func(t *testing.T) { - s, err := NewSimulator(tc.testCase, tc.schedulingConfig) + s, err := NewSimulator(tc.clusterSpec, tc.workloadSpec, tc.schedulingConfig) require.NoError(t, err) - go func() { err = s.Run() }() - actualEventSequences := make([]*armadaevents.EventSequence, 0, len(tc.expectedEventSequences)) - for eventSequence := range s.C() { - t.Log(*eventSequence.Events[0].Created, eventSequenceSummary(eventSequence)) - actualEventSequences = append(actualEventSequences, eventSequence) - } + mc := NewMetricsCollector(s.Output()) + actualEventSequences := make([]*armadaevents.EventSequence, 0, 128) + c := s.Output() + + ctx := armadacontext.Background() + g, ctx := armadacontext.ErrGroup(ctx) + g.Go(func() error { + return mc.Run(ctx) + }) + g.Go(func() error { + for { + select { + case <-ctx.Done(): + return ctx.Err() + case eventSequence, ok := <-c: + if !ok { + return nil + } + t.Log(*eventSequence.Events[0].Created, EventSequenceSummary(eventSequence)) + actualEventSequences = append(actualEventSequences, eventSequence) + } + } + }) + g.Go(func() error { + return s.Run(ctx) + }) + err = g.Wait() require.NoError(t, err) - t.Logf("Simulated time: %s", s.time.Sub(time.Time{})) + + t.Logf("Simulation Results: %s", mc.String()) if tc.expectedEventSequences != nil { require.Equal( t, - util.Map(tc.expectedEventSequences, func(eventSequence *armadaevents.EventSequence) string { return eventSequenceSummary(eventSequence) }), - util.Map(actualEventSequences, func(eventSequence *armadaevents.EventSequence) string { return eventSequenceSummary(eventSequence) }), + util.Map(tc.expectedEventSequences, func(eventSequence *armadaevents.EventSequence) string { return EventSequenceSummary(eventSequence) }), + util.Map(actualEventSequences, func(eventSequence *armadaevents.EventSequence) string { return EventSequenceSummary(eventSequence) }), "Expected:\n%s\nReceived:\n%s", - eventSequencesSummary(tc.expectedEventSequences), - eventSequencesSummary(actualEventSequences), + EventSequencesSummary(tc.expectedEventSequences), + EventSequencesSummary(actualEventSequences), ) } + require.LessOrEqual(t, mc.OverallMetrics.TimeOfMostRecentJobSucceededEvent, tc.simulatedTimeLimit) }) } } -func WithExecutorGroupsPool(pool *Pool, executorGroups ...*ExecutorGroup) *Pool { - pool.ExecutorGroups = append(pool.ExecutorGroups, executorGroups...) - return pool -} - -func WithExecutorsExecutorGroup(executorGroup *ExecutorGroup, executors ...*Executor) *ExecutorGroup { - executorGroup.Executors = append(executorGroup.Executors, executors...) - return executorGroup +func TestSchedulingConfigsFromPattern(t *testing.T) { + actual, err := SchedulingConfigsFromPattern("./testdata/configs/basicSchedulingConfig.yaml") + require.NoError(t, err) + expected := []configuration.SchedulingConfig{GetBasicSchedulingConfig()} + assert.Equal(t, expected, actual) } -func WithNodeTemplatesExecutor(executor *Executor, nodeTemplates ...*NodeTemplate) *Executor { - executor.NodeTemplates = append(executor.NodeTemplates, nodeTemplates...) - return executor -} - -func Pool32Cpu(name string, numExecutorGroups, numExecutorsPerGroup, numNodesPerExecutor int64) *Pool { - executorGroups := make([]*ExecutorGroup, numExecutorGroups) - for i := 0; i < int(numExecutorGroups); i++ { - executorGroups[i] = ExecutorGroup32Cpu(numExecutorsPerGroup, numNodesPerExecutor) - } - return &Pool{ - Name: name, - ExecutorGroups: executorGroups, - } +func TestClusterSpecsFromPattern(t *testing.T) { + clusterSpecs, err := ClusterSpecsFromPattern("./testdata/clusters/tinyCluster.yaml") + require.NoError(t, err) + assert.Equal(t, []*ClusterSpec{GetTwoPoolTwoNodeCluster()}, clusterSpecs) + require.NoError(t, err) } -func ExecutorGroup32Cpu(numExecutors, numNodesPerExecutor int64) *ExecutorGroup { - executors := make([]*Executor, numExecutors) - for i := 0; i < int(numExecutors); i++ { - executors[i] = Executor32Cpu(numNodesPerExecutor) - } - return &ExecutorGroup{ - Executors: executors, - } +func TestWorkloadsFromPattern(t *testing.T) { + workloadSpecs, err := WorkloadsFromPattern("./testdata/workloads/basicWorkload.yaml") + require.NoError(t, err) + assert.Equal(t, []*WorkloadSpec{GetOneQueue10JobWorkload()}, workloadSpecs) + require.NoError(t, err) } -func Executor32Cpu(numNodes int64) *Executor { - return &Executor{ - NodeTemplates: []*NodeTemplate{ - NodeTemplate32Cpu(numNodes), +func TestClusterSpecTotalResources(t *testing.T) { + actual := GetTwoPoolTwoNodeCluster().TotalResources() + expected := schedulerobjects.ResourceList{ + Resources: map[string]resource.Quantity{ + "cpu": resource.MustParse("160"), + "memory": resource.MustParse("4352Gi"), + "nvidia.com/gpu": resource.MustParse("8"), }, } + assert.True(t, expected.Equal(actual), "expected %s, but got %s", expected.CompactString(), actual.CompactString()) } -func NodeTemplate32Cpu(n int64) *NodeTemplate { - return &NodeTemplate{ - Number: n, - TotalResources: schedulerobjects.ResourceList{ - Resources: map[string]resource.Quantity{ - "cpu": resource.MustParse("32"), - "memory": resource.MustParse("256Gi"), +func TestGenerateRandomShiftedExponentialDuration(t *testing.T) { + assert.Equal( + t, + time.Hour, + generateRandomShiftedExponentialDuration( + rand.New(rand.NewSource(0)), + ShiftedExponential{ + Minimum: time.Hour, }, - }, - } -} - -func WithJobTemplatesQueue(queue Queue, jobTemplate ...*JobTemplate) Queue { - queue.JobTemplates = append(queue.JobTemplates, jobTemplate...) - return queue -} - -func WithIdJobTemplate(jobTemplate *JobTemplate, id string) *JobTemplate { - jobTemplate.Id = id - return jobTemplate -} - -func WithDependenciesJobTemplate(jobTemplate *JobTemplate, dependencyIds ...string) *JobTemplate { - jobTemplate.Dependencies = append(jobTemplate.Dependencies, dependencyIds...) - return jobTemplate -} - -func WithMinSubmitTimeJobTemplate(jobTemplate *JobTemplate, minSubmitTime time.Time) *JobTemplate { - jobTemplate.MinSubmitTime = minSubmitTime - return jobTemplate -} - -func JobTemplate32Cpu(n int64, jobSet, priorityClassName string) *JobTemplate { - return &JobTemplate{ - Number: n, - JobSet: jobSet, - PriorityClassName: priorityClassName, - Requirements: schedulerobjects.PodRequirements{ - ResourceRequirements: v1.ResourceRequirements{ - Requests: v1.ResourceList{ - "cpu": resource.MustParse("32"), - "memory": resource.MustParse("256Gi"), - }, + ), + ) + assert.Less( + t, + time.Hour, + generateRandomShiftedExponentialDuration( + rand.New(rand.NewSource(0)), + ShiftedExponential{ + Minimum: time.Hour, + TailMean: time.Second, }, - }, - RuntimeMean: 60, - } -} - -func JobTemplate1Cpu(n int64, jobSet, priorityClassName string) *JobTemplate { - return &JobTemplate{ - Number: n, - JobSet: jobSet, - PriorityClassName: priorityClassName, - Requirements: schedulerobjects.PodRequirements{ - ResourceRequirements: v1.ResourceRequirements{ - Requests: v1.ResourceList{ - "cpu": resource.MustParse("1"), - "memory": resource.MustParse("8Gi"), - }, - }, - }, - RuntimeMean: 60, - } -} - -func SubmitJob() *armadaevents.EventSequence_Event { - return &armadaevents.EventSequence_Event{ - Event: &armadaevents.EventSequence_Event_SubmitJob{ - SubmitJob: &armadaevents.SubmitJob{}, - }, - } -} - -func JobRunLeased() *armadaevents.EventSequence_Event { - return &armadaevents.EventSequence_Event{ - Event: &armadaevents.EventSequence_Event_JobRunLeased{ - JobRunLeased: &armadaevents.JobRunLeased{}, - }, - } -} - -func JobRunPreempted() *armadaevents.EventSequence_Event { - return &armadaevents.EventSequence_Event{ - Event: &armadaevents.EventSequence_Event_JobRunPreempted{ - JobRunPreempted: &armadaevents.JobRunPreempted{}, - }, - } -} - -func JobSucceeded() *armadaevents.EventSequence_Event { - return &armadaevents.EventSequence_Event{ - Event: &armadaevents.EventSequence_Event_JobSucceeded{ - JobSucceeded: &armadaevents.JobSucceeded{}, - }, - } -} - -func eventSequencesSummary(eventSequences []*armadaevents.EventSequence) string { - var sb strings.Builder - for i, eventSequence := range eventSequences { - sb.WriteString(eventSequenceSummary(eventSequence)) - if i != len(eventSequences)-1 { - sb.WriteString("\n") - } - } - return sb.String() -} - -func eventSequenceSummary(eventSequence *armadaevents.EventSequence) string { - var sb strings.Builder - sb.WriteString(fmt.Sprintf("EventSequence{Queue: %s, JobSetName: %s, Events: [", eventSequence.Queue, eventSequence.JobSetName)) - for i, event := range eventSequence.Events { - sb.WriteString(eventSummary(event)) - if i != len(eventSequence.Events)-1 { - sb.WriteString(", ") - } - } - sb.WriteString("]}") - return sb.String() -} - -func eventSummary(event *armadaevents.EventSequence_Event) string { - return strings.ReplaceAll(fmt.Sprintf("%T", event.Event), "*armadaevents.EventSequence_Event_", "") + ), + ) } diff --git a/internal/scheduler/simulator/test_utils.go b/internal/scheduler/simulator/test_utils.go new file mode 100644 index 00000000000..031af7100c0 --- /dev/null +++ b/internal/scheduler/simulator/test_utils.go @@ -0,0 +1,315 @@ +package simulator + +import ( + "fmt" + "math" + "strings" + "time" + + v1 "k8s.io/api/core/v1" + "k8s.io/apimachinery/pkg/api/resource" + + "github.com/armadaproject/armada/internal/armada/configuration" + "github.com/armadaproject/armada/internal/common/types" + "github.com/armadaproject/armada/internal/scheduler/constraints" + "github.com/armadaproject/armada/internal/scheduler/schedulerobjects" + "github.com/armadaproject/armada/pkg/armadaevents" +) + +func GetTwoPoolTwoNodeCluster() *ClusterSpec { + cs := &ClusterSpec{ + Name: "Tiny Cluster", + Pools: []*Pool{ + Pool32Cpu("pool1", 1, 1, 1), + PoolGpu("pool2", 1, 1, 1), + }, + } + initialiseClusterSpec(cs) + return cs +} + +func GetOneQueue10JobWorkload() *WorkloadSpec { + ws := &WorkloadSpec{ + Name: "Basic Workload", + Queues: []*Queue{ + WithJobTemplatesQueue( + &Queue{Name: "A", Weight: 1}, + JobTemplate1Cpu(10, "", "armada-default", "myFirstJobTemplate"), + ), + }, + } + initialiseWorkloadSpec(ws) + return ws +} + +func GetBasicSchedulingConfig() configuration.SchedulingConfig { + return configuration.SchedulingConfig{ + Preemption: configuration.PreemptionConfig{ + NodeEvictionProbability: 1.0, + PriorityClasses: map[string]types.PriorityClass{ + "armada-default": { + Priority: 30000, + Preemptible: false, + }, + "armada-preemptible": { + Priority: 30000, + Preemptible: true, + }, + }, + }, + MaximumResourceFractionToSchedule: map[string]float64{ + "memory": 0.025, + "cpu": 0.025, + }, + FairnessModel: "DominantResourceFairness", + DominantResourceFairnessResourcesToConsider: []string{"cpu", "memory", "nvidia.com/gpu", "ephemeral-storage"}, + IndexedResources: []configuration.IndexedResource{ + { + Name: "cpu", + Resolution: resource.MustParse("1"), + }, + { + Name: "memory", + Resolution: resource.MustParse("1Mi"), + }, + { + Name: "nvidia.com/gpu", + Resolution: resource.MustParse("1"), + }, + }, + MaximumSchedulingRate: math.Inf(1), + MaximumSchedulingBurst: math.MaxInt, + MaximumPerQueueSchedulingRate: math.Inf(1), + MaximumPerQueueSchedulingBurst: math.MaxInt, + } +} + +// TotalResources returns the total resources available across all nodes in the ClusterSpec. +func (cs *ClusterSpec) TotalResources() schedulerobjects.ResourceList { + total := schedulerobjects.NewResourceListWithDefaultSize() + for _, pool := range cs.Pools { + for _, clusterGroup := range pool.ClusterGroups { + for _, cluster := range clusterGroup.Clusters { + for _, nt := range cluster.NodeTemplates { + for t, q := range nt.TotalResources.Resources { + total.AddQuantity(t, constraints.ScaleQuantity(q, float64(nt.Number))) + } + } + } + } + } + return total +} + +func WithExecutorGroupsPool(pool *Pool, executorGroups ...*ClusterGroup) *Pool { + pool.ClusterGroups = append(pool.ClusterGroups, executorGroups...) + return pool +} + +func WithExecutorsExecutorGroup(executorGroup *ClusterGroup, executors ...*Cluster) *ClusterGroup { + executorGroup.Clusters = append(executorGroup.Clusters, executors...) + return executorGroup +} + +func WithNodeTemplatesExecutor(executor *Cluster, nodeTemplates ...*NodeTemplate) *Cluster { + executor.NodeTemplates = append(executor.NodeTemplates, nodeTemplates...) + return executor +} + +func Pool32Cpu(name string, numExecutorGroups, numExecutorsPerGroup, numNodesPerExecutor int64) *Pool { + executorGroups := make([]*ClusterGroup, numExecutorGroups) + for i := 0; i < int(numExecutorGroups); i++ { + executorGroups[i] = ExecutorGroup32Cpu(numExecutorsPerGroup, numNodesPerExecutor) + } + return &Pool{ + Name: name, + ClusterGroups: executorGroups, + } +} + +func PoolGpu(name string, numExecutorGroups, numExecutorsPerGroup, numNodesPerExecutor int64) *Pool { + executorGroups := make([]*ClusterGroup, numExecutorGroups) + for i := 0; i < int(numExecutorGroups); i++ { + executorGroups[i] = ExecutorGroupGpu(numExecutorsPerGroup, numNodesPerExecutor) + } + return &Pool{ + Name: name, + ClusterGroups: executorGroups, + } +} + +func ExecutorGroup32Cpu(numExecutors, numNodesPerExecutor int64) *ClusterGroup { + executors := make([]*Cluster, numExecutors) + for i := 0; i < int(numExecutors); i++ { + executors[i] = Executor32Cpu(numNodesPerExecutor) + } + return &ClusterGroup{ + Clusters: executors, + } +} + +func ExecutorGroupGpu(numExecutors, numNodesPerExecutor int64) *ClusterGroup { + executors := make([]*Cluster, numExecutors) + for i := 0; i < int(numExecutors); i++ { + executors[i] = ExecutorGpu(numNodesPerExecutor) + } + return &ClusterGroup{ + Clusters: executors, + } +} + +func Executor32Cpu(numNodes int64) *Cluster { + return &Cluster{ + NodeTemplates: []*NodeTemplate{ + NodeTemplate32Cpu(numNodes), + }, + } +} + +func ExecutorGpu(numNodes int64) *Cluster { + return &Cluster{ + NodeTemplates: []*NodeTemplate{ + NodeTemplateGpu(numNodes), + }, + } +} + +func NodeTemplate32Cpu(n int64) *NodeTemplate { + return &NodeTemplate{ + Number: n, + TotalResources: schedulerobjects.ResourceList{ + Resources: map[string]resource.Quantity{ + "cpu": resource.MustParse("32"), + "memory": resource.MustParse("256Gi"), + }, + }, + } +} + +func NodeTemplateGpu(n int64) *NodeTemplate { + return &NodeTemplate{ + Number: n, + TotalResources: schedulerobjects.ResourceList{ + Resources: map[string]resource.Quantity{ + "cpu": resource.MustParse("128"), + "memory": resource.MustParse("4096Gi"), + "nvidia.com/gpu": resource.MustParse("8"), + }, + }, + } +} + +func WithJobTemplatesQueue(queue *Queue, jobTemplate ...*JobTemplate) *Queue { + queue.JobTemplates = append(queue.JobTemplates, jobTemplate...) + return queue +} + +func WithIdJobTemplate(jobTemplate *JobTemplate, id string) *JobTemplate { + jobTemplate.Id = id + return jobTemplate +} + +func WithDependenciesJobTemplate(jobTemplate *JobTemplate, dependencyIds ...string) *JobTemplate { + jobTemplate.Dependencies = append(jobTemplate.Dependencies, dependencyIds...) + return jobTemplate +} + +func WithMinSubmitTimeJobTemplate(jobTemplate *JobTemplate, minSubmitTime time.Duration) *JobTemplate { + jobTemplate.EarliestSubmitTime = minSubmitTime + return jobTemplate +} + +func JobTemplate32Cpu(n int64, jobSet, priorityClassName string) *JobTemplate { + return &JobTemplate{ + Number: n, + JobSet: jobSet, + PriorityClassName: priorityClassName, + Requirements: schedulerobjects.PodRequirements{ + ResourceRequirements: v1.ResourceRequirements{ + Requests: v1.ResourceList{ + "cpu": resource.MustParse("32"), + "memory": resource.MustParse("256Gi"), + }, + }, + }, + RuntimeDistribution: ShiftedExponential{Minimum: time.Minute}, + } +} + +func JobTemplate1Cpu(n int64, jobSet, priorityClassName string, id string) *JobTemplate { + return &JobTemplate{ + Number: n, + JobSet: jobSet, + Id: id, + PriorityClassName: priorityClassName, + Requirements: schedulerobjects.PodRequirements{ + ResourceRequirements: v1.ResourceRequirements{ + Requests: v1.ResourceList{ + "cpu": resource.MustParse("1"), + "memory": resource.MustParse("10Gi"), + }, + }, + }, + RuntimeDistribution: ShiftedExponential{Minimum: time.Minute}, + } +} + +func SubmitJob() *armadaevents.EventSequence_Event { + return &armadaevents.EventSequence_Event{ + Event: &armadaevents.EventSequence_Event_SubmitJob{ + SubmitJob: &armadaevents.SubmitJob{}, + }, + } +} + +func JobRunLeased() *armadaevents.EventSequence_Event { + return &armadaevents.EventSequence_Event{ + Event: &armadaevents.EventSequence_Event_JobRunLeased{ + JobRunLeased: &armadaevents.JobRunLeased{}, + }, + } +} + +func JobRunPreempted() *armadaevents.EventSequence_Event { + return &armadaevents.EventSequence_Event{ + Event: &armadaevents.EventSequence_Event_JobRunPreempted{ + JobRunPreempted: &armadaevents.JobRunPreempted{}, + }, + } +} + +func JobSucceeded() *armadaevents.EventSequence_Event { + return &armadaevents.EventSequence_Event{ + Event: &armadaevents.EventSequence_Event_JobSucceeded{ + JobSucceeded: &armadaevents.JobSucceeded{}, + }, + } +} + +func EventSequencesSummary(eventSequences []*armadaevents.EventSequence) string { + var sb strings.Builder + for i, eventSequence := range eventSequences { + sb.WriteString(EventSequenceSummary(eventSequence)) + if i != len(eventSequences)-1 { + sb.WriteString("\n") + } + } + return sb.String() +} + +func EventSequenceSummary(eventSequence *armadaevents.EventSequence) string { + var sb strings.Builder + sb.WriteString(fmt.Sprintf("EventSequence{Queue: %s, JobSetName: %s, Events: [", eventSequence.Queue, eventSequence.JobSetName)) + for i, event := range eventSequence.Events { + sb.WriteString(EventSummary(event)) + if i != len(eventSequence.Events)-1 { + sb.WriteString(", ") + } + } + sb.WriteString("]}") + return sb.String() +} + +func EventSummary(event *armadaevents.EventSequence_Event) string { + return strings.ReplaceAll(fmt.Sprintf("%T", event.Event), "*armadaevents.EventSequence_Event_", "") +} diff --git a/internal/scheduler/simulator/testdata/clusters/cpu_1_1_100.yaml b/internal/scheduler/simulator/testdata/clusters/cpu_1_1_100.yaml new file mode 100644 index 00000000000..6253602cf18 --- /dev/null +++ b/internal/scheduler/simulator/testdata/clusters/cpu_1_1_100.yaml @@ -0,0 +1,12 @@ +name: "1 CPU cluster with 100 nodes" +pools: + - name: "CPU" + clusterGroups: + - clusters: + - name: "cpu-01" + nodeTemplates: + - number: 100 + totalResources: + resources: + cpu: "32" + memory: "1024Gi" diff --git a/internal/scheduler/simulator/testdata/clusters/cpu_1_3_100.yaml b/internal/scheduler/simulator/testdata/clusters/cpu_1_3_100.yaml new file mode 100644 index 00000000000..b72678d4209 --- /dev/null +++ b/internal/scheduler/simulator/testdata/clusters/cpu_1_3_100.yaml @@ -0,0 +1,26 @@ +name: "3 CPU clusters with 100 nodes each in a single group" +pools: + - name: "CPU" + clusterGroups: + - clusters: + - name: "cpu-01" + nodeTemplates: + - number: 100 + totalResources: + resources: + cpu: "32" + memory: "1024Gi" + - name: "cpu-02" + nodeTemplates: + - number: 100 + totalResources: + resources: + cpu: "32" + memory: "1024Gi" + - name: "cpu-03" + nodeTemplates: + - number: 100 + totalResources: + resources: + cpu: "32" + memory: "1024Gi" diff --git a/internal/scheduler/simulator/testdata/clusters/tinyCluster.yaml b/internal/scheduler/simulator/testdata/clusters/tinyCluster.yaml new file mode 100644 index 00000000000..02269a49542 --- /dev/null +++ b/internal/scheduler/simulator/testdata/clusters/tinyCluster.yaml @@ -0,0 +1,23 @@ +name: "Tiny Cluster" +pools: + - name: "pool1" + clusterGroups: + - clusters: + - name: "pool1-0-0" + nodeTemplates: + - number: 1 + totalResources: + resources: + cpu: "32" + memory: "256Gi" + - name: "pool2" + clusterGroups: + - clusters: + - name: "pool2-0-0" + nodeTemplates: + - number: 1 + totalResources: + resources: + cpu: "128" + memory: "4096Gi" + nvidia.com/gpu: "8" diff --git a/internal/scheduler/simulator/testdata/clusters/tinyClusterAlt.yaml b/internal/scheduler/simulator/testdata/clusters/tinyClusterAlt.yaml new file mode 100644 index 00000000000..74858933101 --- /dev/null +++ b/internal/scheduler/simulator/testdata/clusters/tinyClusterAlt.yaml @@ -0,0 +1,21 @@ +name: "Tiny Cluster" +pools: + - name: "pool1" + clusterGroups: + - clusters: + - name: "" + nodeTemplates: + - number: 1 + totalResources: + resources: + cpu: "32" + memory: "256Gi" + - clusters: + - name: "" + nodeTemplates: + - number: 1 + totalResources: + resources: + cpu: "128" + memory: "4096Gi" + nvidia.com/gpu: "8" diff --git a/internal/scheduler/simulator/testdata/configs/basicSchedulingConfig.yaml b/internal/scheduler/simulator/testdata/configs/basicSchedulingConfig.yaml new file mode 100644 index 00000000000..6153509c001 --- /dev/null +++ b/internal/scheduler/simulator/testdata/configs/basicSchedulingConfig.yaml @@ -0,0 +1,29 @@ +maximumSchedulingRate: "+inf" +maximumSchedulingBurst: 9223372036854775807 +maximumPerQueueSchedulingRate: "+Inf" +maximumPerQueueSchedulingBurst: 9223372036854775807 +fairnessModel: "DominantResourceFairness" +dominantResourceFairnessResourcesToConsider: + - "cpu" + - "memory" + - "nvidia.com/gpu" + - "ephemeral-storage" +maximumResourceFractionToSchedule: + memory: 0.025 + cpu: 0.025 +indexedResources: + - name: "cpu" + resolution: "1" + - name: "memory" + resolution: "1Mi" + - name: "nvidia.com/gpu" + resolution: "1" +preemption: + nodeEvictionProbability: 1.0 + priorityClasses: + armada-default: + priority: 30000 + preemptible: false + armada-preemptible: + priority: 30000 + preemptible: true diff --git a/internal/scheduler/simulator/testdata/configs/defaultSchedulingConfig.yaml b/internal/scheduler/simulator/testdata/configs/defaultSchedulingConfig.yaml new file mode 100644 index 00000000000..01f5d4da7c9 --- /dev/null +++ b/internal/scheduler/simulator/testdata/configs/defaultSchedulingConfig.yaml @@ -0,0 +1,38 @@ +enableNewPreemptionStrategy: true +fairnessModel: "DominantResourceFairness" +dominantResourceFairnessResourcesToConsider: + - "cpu" + - "memory" + - "nvidia.com/gpu" + - "ephemeral-storage" +maxQueueLookback: 50000 +maximumResourceFractionToSchedule: + cpu: 1.0 +maximumSchedulingRate: 9223372036854775807 +maximumSchedulingBurst: 9223372036854775807 +maximumPerQueueSchedulingRate: 9223372036854775807 +maximumPerQueueSchedulingBurst: 9223372036854775807 +indexedResources: + - name: "cpu" + resolution: "1" + - name: "memory" + resolution: "1Mi" + - name: "nvidia.com/gpu" + resolution: "1" +preemption: + nodeEvictionProbability: 1.0 + nodeOversubscriptionEvictionProbability: 1.0 + protectedFractionOfFairShare: 1.0 + nodeIdLabel: kubernetes.io/hostname + priorityClasses: + armada-default: + priority: 1000 + preemptible: false + maximumResourceFractionPerQueue: + memory: 1.0 + cpu: 1.0 + armada-preemptible: + priority: 1000 + preemptible: true + defaultPriorityClass: armada-default + priorityClassNameOverride: armada-default \ No newline at end of file diff --git a/internal/scheduler/simulator/testdata/diva-plat.yaml b/internal/scheduler/simulator/testdata/diva-plat.yaml deleted file mode 100644 index a4106287879..00000000000 --- a/internal/scheduler/simulator/testdata/diva-plat.yaml +++ /dev/null @@ -1,30 +0,0 @@ -name: "DIVA-plat" -pools: - - name: "CPU" - executorGroups: - - executors: - - name: "Executor-CPU-1" - nodeTemplates: - - number: 1 - totalResources: - resources: - cpu: "1" - memory: "1Gi" - - name: "Executor-CPU-2" - nodeTemplates: - - number: 2 - totalResources: - resources: - cpu: "1" - memory: "1Gi" - - name: "GPU" - executorGroups: - - executors: - - name: "Executor-GPU" - nodeTemplates: - - number: 2 - totalResources: - resources: - cpu: "1" - memory: "1Gi" - ndivia.com/gpu: "1" \ No newline at end of file diff --git a/internal/scheduler/simulator/testdata/workloads/basicWorkload.yaml b/internal/scheduler/simulator/testdata/workloads/basicWorkload.yaml new file mode 100644 index 00000000000..5142e410fef --- /dev/null +++ b/internal/scheduler/simulator/testdata/workloads/basicWorkload.yaml @@ -0,0 +1,15 @@ +name: "Basic Workload" +queues: + - name: "A" + weight: 1 # = 1 / priorityFactor + jobTemplates: + - number: 10 + id: "myFirstJobTemplate" + priorityClassName: "armada-default" + requirements: + resourceRequirements: + requests: + cpu: 1 + memory: 10Gi + runtimeDistribution: + minimum: "1m" diff --git a/internal/scheduler/simulator/testdata/workloads/small_big/non-preemptible.yaml b/internal/scheduler/simulator/testdata/workloads/small_big/non-preemptible.yaml new file mode 100644 index 00000000000..6d26dcf639e --- /dev/null +++ b/internal/scheduler/simulator/testdata/workloads/small_big/non-preemptible.yaml @@ -0,0 +1,30 @@ +name: "Non-preemptible" +randomSeed: 12345 +queues: + - name: "small" + weight: 1.0 + jobTemplates: + - number: 38400 + priorityClassName: "armada-default" + requirements: + resourceRequirements: + requests: + cpu: 1 + memory: "32Gi" + runtimeDistribution: + minimum: "10m" + tailMean: "1m" + - name: "big" + weight: 1.0 + jobTemplates: + - number: 10 + priorityClassName: "armada-default" + requirements: + resourceRequirements: + requests: + cpu: 32 + memory: "1024Gi" + earliestSubmitTime: "30m" + runtimeDistribution: + minimum: "60m" + tailMean: "6m" diff --git a/internal/scheduler/simulator/testdata/workloads/small_big/only-big.yaml b/internal/scheduler/simulator/testdata/workloads/small_big/only-big.yaml new file mode 100644 index 00000000000..d1929028fea --- /dev/null +++ b/internal/scheduler/simulator/testdata/workloads/small_big/only-big.yaml @@ -0,0 +1,17 @@ +name: "Only big" +randomSeed: 12345 +queues: + - name: "big" + weight: 1.0 + jobTemplates: + - number: 10 + priorityClassName: "armada-default" + requirements: + resourceRequirements: + requests: + cpu: 32 + memory: "1024Gi" + earliestSubmitTime: "30m" + runtimeDistribution: + minimum: "60m" + tailMean: "6m" diff --git a/internal/scheduler/simulator/testdata/workloads/small_big/only-small.yaml b/internal/scheduler/simulator/testdata/workloads/small_big/only-small.yaml new file mode 100644 index 00000000000..d21bb0ba6d8 --- /dev/null +++ b/internal/scheduler/simulator/testdata/workloads/small_big/only-small.yaml @@ -0,0 +1,16 @@ +name: "Only small" +randomSeed: 12345 +queues: + - name: "small" + weight: 1.0 + jobTemplates: + - number: 38400 + priorityClassName: "armada-default" + requirements: + resourceRequirements: + requests: + cpu: 1 + memory: "32Gi" + runtimeDistribution: + minimum: "10m" + tailMean: "1m" diff --git a/internal/scheduler/simulator/testdata/workloads/small_big/preemptible.yaml b/internal/scheduler/simulator/testdata/workloads/small_big/preemptible.yaml new file mode 100644 index 00000000000..c2adac68bb5 --- /dev/null +++ b/internal/scheduler/simulator/testdata/workloads/small_big/preemptible.yaml @@ -0,0 +1,30 @@ +name: "Preemptible" +randomSeed: 12345 +queues: + - name: "small" + weight: 1.0 + jobTemplates: + - number: 38400 + priorityClassName: "armada-preemptible" + requirements: + resourceRequirements: + requests: + cpu: 1 + memory: "32Gi" + runtimeDistribution: + minimum: "10m" + tailMean: "1m" + - name: "big" + weight: 1.0 + jobTemplates: + - number: 10 + priorityClassName: "armada-preemptible" + requirements: + resourceRequirements: + requests: + cpu: 32 + memory: "1024Gi" + earliestSubmitTime: "30m" + runtimeDistribution: + minimum: "60m" + tailMean: "6m" diff --git a/magefiles/proto.go b/magefiles/proto.go index da07513d0df..b41ef489d82 100644 --- a/magefiles/proto.go +++ b/magefiles/proto.go @@ -91,6 +91,7 @@ func protoGenerate() error { "pkg/api/*.proto", "pkg/armadaevents/*.proto", "internal/scheduler/schedulerobjects/*.proto", + "internal/scheduler/simulator/*.proto", "pkg/api/lookout/*.proto", "pkg/api/binoculars/*.proto", "pkg/api/jobservice/*.proto",