Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scaling strategy to limit number of Machine pending or provisioning #8808

Closed
lentzi90 opened this issue Jun 7, 2023 · 6 comments
Closed
Labels
area/machine Issues or PRs related to machine lifecycle management kind/feature Categorizes issue or PR as related to a new feature. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@lentzi90
Copy link
Contributor

lentzi90 commented Jun 7, 2023

What would you like to be added (User Story)?

An an operator, I would like to control how fast new Machines are created when I create large clusters to avoid overwhelming controllers and infrastructure.

Detailed Description

I want a way to limit the number of Machines that are pending or provisioning. Currently, when creating large clusters we start out small and scale gradually to avoid issues. However, this could be easily automated and solved for all providers if built in to CAPI.

In the Bare Metal Operator we have a PROVISIONING_LIMIT for exactly this reason. It limits the number of BareMetalHosts that are provisioned simultaneously. Having something similar in CAPI would be very useful.

I'm not sure where it would make sense to add this option though. It could be set on the Cluster, the KCP and/or MachineDeployment for example. to get granular control. Or it could be a flag for the controllers. What do you think would work best?

Anything else you would like to add?

No response

Label(s) to be applied

/kind feature
/area machine

@k8s-ci-robot k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. area/machine Issues or PRs related to machine lifecycle management needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jun 7, 2023
@killianmuldoon
Copy link
Contributor

/triage accepted

This is an interesting idea - similar to what was implemented for upgrades in #8432. Maybe these could be generalized into an overall rollout strategy.

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jun 7, 2023
@fabriziopandini
Copy link
Member

As per 14th Jun office hour discussion, we should discuss UX/API in a doc

@lentzi90
Copy link
Contributor Author

Older related issue that was closed due to inactivity: #4022
Linking for reference

@lentzi90
Copy link
Contributor Author

I have started on a document here: https://docs.google.com/document/d/1FjX5rQGYHCyDqRdANWcAWP4AmoxudoPl8IARjXylGzY/edit?usp=sharing
Please check it and comment/edit 🙂

@lentzi90 lentzi90 changed the title Provisioning strategy to limit number of Machine pending or provisioning Scaling strategy to limit number of Machine pending or provisioning Jun 28, 2023
@fabriziopandini
Copy link
Member

cc @vincepri @enxebre

@lentzi90
Copy link
Contributor Author

lentzi90 commented Jan 8, 2024

We ended up setting limits in the "cloud" provider instead. Granted this does not solve the issue for controllers, but they can be pretty well handled by existing config options (e.g. concurrency, resources and rate limits).
Closing (but feel free to reopen if there is interest to continue this in the community)

@lentzi90 lentzi90 closed this as completed Jan 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/machine Issues or PRs related to machine lifecycle management kind/feature Categorizes issue or PR as related to a new feature. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

No branches or pull requests

4 participants