Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Max instance lifetime #839

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

nitrocode
Copy link
Contributor

Related to PR #836

If this number is set to 86400 or greater, the instance will terminated.

@yob
Copy link
Contributor

yob commented May 11, 2021

Taking a quote from #836:

Isn't there a lifecycle hook in the stack that allows jobs to finish before continuing the termination ? If so, I believe the instance refresh respects this... but perhaps there isn't a hook. I don't see one defined in the cf stack.

In v4 of the stack and earlier there was a lifecycle hook to help make instance shutdown more graceful and give the agents time to finish jobs. It was essential in those versions, as we used AWS driven autoscaling, and any instance (even those running jobs) could be marked for termination if the ASG decided it wanted to scale down.

In v5 we fully switched to the current method of scale down, where the agent self terminates the instance after an idle period. We removed the lifecycle hooks, with the understanding that they weren't needed any more.

It's possible that we may want to add them back though, as there are still scenarios where AWS can decide to terminate an instance and the instance could be running job(s). One possibility is AZ rebalancing (which we worked a round in #751). Another could be introducing instance lifetimes.

Without lifecycle hooks, I think this PR is highly likely to interrupt running jobs. Sorry!

I think it's a neat feature though, and think we have customers that would use it.

@ptarjan
Copy link

ptarjan commented Aug 26, 2021

I would love to use this at Robinhood. Right now I have to set MinSize to 0 and hope that people don't use our agents all night to force us to recycle agents, but something like this would be really useful for us.

@keithduncan keithduncan added the agent lifecycle Agent boot, job lifecycle, agent shutdown label Sep 6, 2021
@jessebye
Copy link

@nitrocode there's a merge conflict, would you mind updating your branch?

@jessebye
Copy link

jessebye commented Sep 23, 2021

@keithduncan @yob anything blocking this? Just curious if this needs changes or is good to ship.

Without lifecycle hooks, I think this PR is highly likely to interrupt running jobs. Sorry!

Couldn't this still be added? The default behavior would be the same, and if a user enabled this, they are doing so at their own risk...

@keithduncan
Copy link
Contributor

keithduncan commented Sep 23, 2021

anything blocking this?

Better support for ASG initiated instance termination is the blocker I’m afraid.

Couldn't this still be added? The default behavior would be the same, and if a user enabled this, they are doing so at their own risk...

They could! And they are the approach I’m expecting us to use for this.

Unfortunately I’m not comfortable merging this without a reliable way to handle ASG initiated instance termination. The absence of this today already causes strife so I’m very hesitant to add another source of it. For those who are comfortable with the risk, it’s possible to apply a forked copy of this template to your stacks.

I’d also be thrilled to collaborate on a Lambda + SSM based lifecycle hooks implementation for the ASG events 😄 I believe it could solve several problems at once (e.g. #927 #838) if proven to be reliable, and enable features like agent instance quarantine/cordoning.

@@ -1054,6 +1054,8 @@ Resources:
- OldestLaunchConfiguration
- ClosestToNextInstanceHour
MaxInstanceLifetime: !If [UseMaxInstanceLifetime, !Ref MaxInstanceLifetime, !Ref "AWS::NoValue"]
NewInstancesProtectedFromScaleIn: true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nitrocode I suspect (but haven’t tested) adding this by merging master will break the MaxInstanceLifetime functionality because none of the instances are eligible for ASG initiated termination. I’m not sure if you or anyone else are running stacks from this branch’s template but wanted to call it out in case you are 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
agent lifecycle Agent boot, job lifecycle, agent shutdown asg-initiated-termination
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants