Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Zero downtime deployments for stack updates #1010

Open
joeljeske opened this issue Apr 21, 2022 · 2 comments
Open

Zero downtime deployments for stack updates #1010

joeljeske opened this issue Apr 21, 2022 · 2 comments

Comments

@joeljeske
Copy link

Is your feature request related to a problem? Please describe.

I'm always frustrated when I need to make stack updates and I have to incur downtime to the agent pool when doing so. When I deploy updates, all agents in that group are terminated and replaced. This leads to jobs failing with Exited with status -1 (agent lost). I then have to manually restart all those jobs or rely on users to do so.

Describe the solution you'd like

I would like agents to drain their workload before being terminated and replaced during a stack update.

Describe alternatives you've considered

  • Performing the stack update during non-peak hours.
  • Manually creating an adjacent stack, migrating to the stack, and then turning off the original stack

Additional context

Perhaps using AWS lifecycle hooks to put instances in a Terminating:Wait state to allow draining would be helpful.

Alternatively, I could detaching all instances from the ASG before stack update, but I then have the problem of determining when those agents are drained and can be terminated. Maybe if the buildkite-agent service could detect the detached state and then drain the workload that would be helpful

@scadu
Copy link

scadu commented Jun 13, 2023

I guess this could be controlled by CloudFormation and how it applies changes to the ASG:

It was configured like this 6 years ago, which most probably there was a reason for doing it that way, but I'm not sure if it's still relevant.
I haven't found any confirmation in this, or the agent's repo.

@scadu
Copy link

scadu commented Jun 14, 2023

I guess this could be controlled by CloudFormation and how it applies changes to the ASG:

It was configured like this 6 years ago, which most probably there was a reason for doing it that way, but I'm not sure if it's still relevant. I haven't found any confirmation in this, or the agent's repo.

OK, the reason is explained here: #764 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants