Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

NewInstancesProtectedFromScaleIn causing ASGs to take ages to update. #977

Open
toothbrush opened this issue Dec 22, 2021 · 6 comments
Open

Comments

@toothbrush
Copy link
Contributor

Good day! 馃憢

Describe the bug

We use realestate's stackup to manage rollouts of the aws-stack.yml you provide. Mostly works great. In 178c253 you have enabled NewInstancesProtectedFromScaleIn: true to the ASG, but the behaviour i'm now seeing is that when i make a change to the AWS Elastic stack (e.g. an updated AMI ID), the old ASG takes ages (1h+) to delete/stabilise, since the members are protected from scale-in.

Steps To Reproduce
Steps to reproduce the behavior:

  1. Spin up the AWS elastic stack
  2. wait for it to be ready and CREATE_COMPLETE
  3. change a parameter, e.g. AMI ID
  4. a new LaunchTemplate will be created and the old one will attempt to delete (along with its instances) but will hang for... long.

Expected behavior

Previously, updates were pretty snappy because the old ASG members would just be terminated.

Stack parameters (please complete the following information):

  • AWS Region: us-east-1
  • Version: v5.7.2
@toothbrush
Copy link
Contributor Author

Ah just looking around, maybe i'm in fact being re-bitten by #927?

In any case i'm not convinced that the instance protection thing is good for us.

@eleanorakh
Copy link
Contributor

Hey hey @toothbrush. Thanks for reporting, we'll look into it!

@freewil
Copy link

freewil commented Jan 24, 2022

Somewhat related is #768

Also been working to create custom AMIs and update the stack via ImageIdParameter. This causes a new ASG to be created, which is a problem in my case, since the instances in the old ASG will be terminated, which causes in-progress jobs to fail.

@freewil
Copy link

freewil commented Jan 24, 2022

when i make a change to the AWS Elastic stack (e.g. an updated AMI ID), the old ASG takes ages (1h+) to delete/stabilise, since the members are protected from scale-in.

I've run into this issue myself when i want to scale down a stack rapidly (~400 instances to 0) via manually changing the ASG desired count/min/max values. It hasn't been a major issue for me as this is typically only done when launching new stacks to replace old stacks. I simply go to the instance management tab for the ASG in the AWS console and manually remove scale-in protection to speed up the scale down.

@huguesb
Copy link

huguesb commented Jun 10, 2022

I am running into this issue with v5.9.0 of the stack and have had to repeatedly go into the AWS console to manually disable protection for instances in the old stacks to allow the update to complete. This is obnoxious! It makes even the smallest configuration changes a major PITA, especially when some stacks have thousands of instances and AWS only allow removal of scale-in protection in batches of 50...

Please fix this asap.

@gitlon
Copy link

gitlon commented Aug 25, 2022

We made a hacky but effective fix for this problem by co-opting the AzRebalancingSuspenderFunction to remove scale-in-protection for running instances when the stack is updated or deleted. We're able to do this in our solution because we fork the ElasticCI template for other reasons. This also required changes to the function's role/permission and timeout/duration.

eg:

              client = boto3.client('autoscaling')
              props = event['ResourceProperties']
            
              if event['RequestType'] in ('Delete', 'Update'):
                instances = client.describe_auto_scaling_instances()['AutoScalingInstances']
                instances = [i['InstanceId'] for i in instances if i['AutoScalingGroupName'] == props['AutoScalingGroupName']]
                if instances:
                  response = client.set_instance_protection(InstanceIds=instances, AutoScalingGroupName=props['AutoScalingGroupName'], ProtectedFromScaleIn=False)
              else:
                response = client.suspend_processes(AutoScalingGroupName=props['AutoScalingGroupName'], ScalingProcesses=['AZRebalance'])
    
etc

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants