Instance lifetime timeout #836

dbaggerman · 2021-05-04T00:56:10Z

This is something we've been running in our version of the elastic stack, which I thought might be of interest upstream. This is pretty much cut and paste from what we do internally, and while this behaviour (with these timeouts) may or may not be ideal as defaults for everyone I thought it would be worthwhile to start a discussion.

We have one agent queue which acts as a shared pool for a variety of tasks. These instances are also configured to run several agents per instance. The combination of these two things means that even with a reasonably short idle-timeout it was often the case that instances could go days (if not weeks) without hitting their idle timeout, up to the point where the disk would fill up and jobs would start failing. At which point we'd have to go and kill them manually.

After several attempts at managing disk space on long-running instances, this is the solution we came up with. We haven't had any of that kind of problem since implementing this, although it is what ultimately led to buildkite/buildkite-agent-scaler/issues/39.

What this does is start a pair of systemd timers when the instance starts. After three hours, the first timer will be reached and trigger a job which sends a TERM signal to the agent telling it to stop accepting new jobs. Once the running jobs complete the agent will shut down, triggering the standard shutdown behaviour.

If any builds are in a stuck/hung state (or otherwise still running after 21+ hours), then the second timer will be reached once the instance has been alive for 24 hours (21 hours after the soft stop). This timer tells systemd to stop the buildkite-agent service (forcefully if necessary), which again results in the instance shutting down so it can be replaced.

This adds a timeout based on instance lifetime, intended to be a supplement to idle-timeout in the agent itself. It prevents busy agents from living forever if they never get enough idle time to turn themselves off.

nitrocode · 2021-05-04T09:11:45Z

This is a creative solution. However, periodically restarting the agent seems like a bandaid fix of the agent failing to clean up after itself.

I wonder how the agents currently manage disk size.

yob

Thanks @dbaggerman! I've shared this internally to collect input from folks.

I assume most elastic stack users rely on the instances self-terminating after the idle timeout to avoid similar issues - does that work less effectively for you because you use AgentsPerInstance>1 and you need all agents on the instance to reach the idle timeout for an instance to self-terminate?

yob · 2021-05-04T13:22:25Z

I wonder how the agents currently manage disk size.

Builds will early-fail if a docker image prune is unable to free up enough space:

elastic-ci-stack-for-aws/packer/linux/conf/buildkite-agent/hooks/environment

Lines 25 to 35 in bffd450

    
           if ! /usr/local/bin/bk-check-disk-space.sh ; then 
        
             echo "Cleaning up docker resources older than ${DOCKER_PRUNE_UNTIL:-4h}" 
        
             docker image prune --all --force --filter "until=${DOCKER_PRUNE_UNTIL:-4h}" 
        
             echo "Checking disk space again" 
        
             if ! /usr/local/bin/bk-check-disk-space.sh ; then 
        
               echo "Disk health checks failed" >&2 
        
               exit 1 
        
             fi 
        
           fi

.. and an hourly cron job will cause the instance to fail a healthcheck if the same thing happens:

elastic-ci-stack-for-aws/packer/linux/conf/docker/cron.hourly/docker-low-disk-gc

Lines 29 to 38 in bffd450

    
           if ! /usr/local/bin/bk-check-disk-space.sh ; then 
        
             echo "Cleaning up docker resources older than ${DOCKER_PRUNE_UNTIL}" 
        
             docker image prune --all --force --filter "until=${DOCKER_PRUNE_UNTIL}" 
        
             docker builder prune --all --force --filter "until=${DOCKER_PRUNE_UNTIL}" 
        
             if ! /usr/local/bin/bk-check-disk-space.sh ; then 
        
               echo "Disk health checks failed" >&2 
        
               exit 1 
        
             fi 
        
           fi

No doubt there are may other ways to consume disk space that we currently have no tooling in place to mitigate though.

dbaggerman · 2021-05-05T00:56:50Z

I assume most elastic stack users rely on the instances self-terminating after the idle timeout to avoid similar issues - does that work less effectively for you because you use AgentsPerInstance>1 and you need all agents on the instance to reach the idle timeout for an instance to self-terminate?

We mostly rely on the idle timeout as well, but we have one queue where we used to regularly have instances alive for days despite having the idle time configured. I suspect that having multiple agents on the instance and needing them all idle at the same time to reach the timeout contributed to that, although the queue is shared by builds/teams in different timezones so it can have jobs queued around the clock.

Also, while disk space is the most common problem we have run into other problems on long running instances as well. For example, at one point we were accumulating running containers. It turned out that sidecar containers were being started by docker-compose for some jobs, which weren't explicitly calling docker-compose down to stop them.

.. and an hourly cron job will cause the instance to fail a healthcheck if the same thing happens:

Being hourly, that can result in a lot of jobs failing for up to an hour before the next time the cron runs. Bringing that back to run more regularly would make it less painful if it does occur but would still be a reacting to a problem rather than proactively preventing it.

nitrocode · 2021-05-05T01:35:29Z

We had this problem on jenkins and the cloud node plugin has a feature to terminate the instance after x builds have run. We set it to 100 builds and we have a daily instance rotation.

An even simpler method could be configuring the MaxInstanceLifetime on the asg.

MaxInstanceLifetime
The maximum amount of time, in seconds, that an instance can be in service. The default is null. If specified, the value must be either 0 or a number equal to or greater than 86,400 seconds (1 day).

dbaggerman · 2021-05-05T03:36:31Z

We had this problem on jenkins and the cloud node plugin has a feature to terminate the instance after x builds have run. We set it to 100 builds and we have a daily instance rotation.

The buildkite agent has a disconnect-after-job setting to terminate after a single job. Extending this to disconnect after N jobs, or a specified duration may be an alternative way to achieve the same result.

An even simpler method could be configuring the MaxInstanceLifetime on the asg.

This may have the side-effect of interrupting any jobs in progress on the agent. You don't get much of a grace period for running jobs to finish before the instance terminates.

nitrocode · 2021-05-05T14:02:13Z

This may have the side-effect of interrupting any jobs in progress on the agent. You don't get much of a grace period for running jobs to finish before the instance terminates.

Isn't there a lifecycle hook in the stack that allows jobs to finish before continuing the termination ? If so, I believe the instance refresh respects this... but perhaps there isn't a hook. I don't see one defined in the cf stack.

dbaggerman · 2021-05-06T00:43:12Z

Isn't there a lifecycle hook in the stack that allows jobs to finish before continuing the termination ? If so, I believe the instance refresh respects this... but perhaps there isn't a hook. I don't see one defined in the cf stack.

The elastic stack image does include https://github.com/buildkite/lifecycled to manage the lifecycle events, which is probably what you're thinking of. Even with that though, the autoscaling group will only let you delay the shutdown for an hour or so. We have jobs that run much longer than that which would get still get interrupted - although using it as an alternative to the hard timer in the PR would work.

Instance lifetime timeout

5ac5161

This adds a timeout based on instance lifetime, intended to be a supplement to idle-timeout in the agent itself. It prevents busy agents from living forever if they never get enough idle time to turn themselves off.

yob reviewed May 4, 2021

View reviewed changes

nitrocode mentioned this pull request May 6, 2021

Max instance lifetime #839

Open

keithduncan added the agent lifecycle Agent boot, job lifecycle, agent shutdown label Sep 6, 2021

keithduncan added the asg-initiated-termination label Oct 14, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Instance lifetime timeout #836

Instance lifetime timeout #836

dbaggerman commented May 4, 2021

nitrocode commented May 4, 2021

yob left a comment

yob commented May 4, 2021

dbaggerman commented May 5, 2021

nitrocode commented May 5, 2021

dbaggerman commented May 5, 2021

nitrocode commented May 5, 2021

dbaggerman commented May 6, 2021

Instance lifetime timeout #836

Are you sure you want to change the base?

Instance lifetime timeout #836

Conversation

dbaggerman commented May 4, 2021

nitrocode commented May 4, 2021

yob left a comment

Choose a reason for hiding this comment

yob commented May 4, 2021

dbaggerman commented May 5, 2021

nitrocode commented May 5, 2021

dbaggerman commented May 5, 2021

nitrocode commented May 5, 2021

dbaggerman commented May 6, 2021