Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Instance lifetime timeout #836

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

dbaggerman
Copy link
Contributor

This is something we've been running in our version of the elastic stack, which I thought might be of interest upstream. This is pretty much cut and paste from what we do internally, and while this behaviour (with these timeouts) may or may not be ideal as defaults for everyone I thought it would be worthwhile to start a discussion.

We have one agent queue which acts as a shared pool for a variety of tasks. These instances are also configured to run several agents per instance. The combination of these two things means that even with a reasonably short idle-timeout it was often the case that instances could go days (if not weeks) without hitting their idle timeout, up to the point where the disk would fill up and jobs would start failing. At which point we'd have to go and kill them manually.

After several attempts at managing disk space on long-running instances, this is the solution we came up with. We haven't had any of that kind of problem since implementing this, although it is what ultimately led to buildkite/buildkite-agent-scaler/issues/39.

What this does is start a pair of systemd timers when the instance starts. After three hours, the first timer will be reached and trigger a job which sends a TERM signal to the agent telling it to stop accepting new jobs. Once the running jobs complete the agent will shut down, triggering the standard shutdown behaviour.

If any builds are in a stuck/hung state (or otherwise still running after 21+ hours), then the second timer will be reached once the instance has been alive for 24 hours (21 hours after the soft stop). This timer tells systemd to stop the buildkite-agent service (forcefully if necessary), which again results in the instance shutting down so it can be replaced.

This adds a timeout based on instance lifetime, intended to be a
supplement to idle-timeout in the agent itself. It prevents busy agents
from living forever if they never get enough idle time to turn
themselves off.
@nitrocode
Copy link
Contributor

This is a creative solution. However, periodically restarting the agent seems like a bandaid fix of the agent failing to clean up after itself.

I wonder how the agents currently manage disk size.

Copy link
Contributor

@yob yob left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @dbaggerman! I've shared this internally to collect input from folks.

I assume most elastic stack users rely on the instances self-terminating after the idle timeout to avoid similar issues - does that work less effectively for you because you use AgentsPerInstance>1 and you need all agents on the instance to reach the idle timeout for an instance to self-terminate?

@yob
Copy link
Contributor

yob commented May 4, 2021

I wonder how the agents currently manage disk size.

Builds will early-fail if a docker image prune is unable to free up enough space:

if ! /usr/local/bin/bk-check-disk-space.sh ; then
echo "Cleaning up docker resources older than ${DOCKER_PRUNE_UNTIL:-4h}"
docker image prune --all --force --filter "until=${DOCKER_PRUNE_UNTIL:-4h}"
echo "Checking disk space again"
if ! /usr/local/bin/bk-check-disk-space.sh ; then
echo "Disk health checks failed" >&2
exit 1
fi
fi

.. and an hourly cron job will cause the instance to fail a healthcheck if the same thing happens:

if ! /usr/local/bin/bk-check-disk-space.sh ; then
echo "Cleaning up docker resources older than ${DOCKER_PRUNE_UNTIL}"
docker image prune --all --force --filter "until=${DOCKER_PRUNE_UNTIL}"
docker builder prune --all --force --filter "until=${DOCKER_PRUNE_UNTIL}"
if ! /usr/local/bin/bk-check-disk-space.sh ; then
echo "Disk health checks failed" >&2
exit 1
fi
fi

No doubt there are may other ways to consume disk space that we currently have no tooling in place to mitigate though.

@dbaggerman
Copy link
Contributor Author

I assume most elastic stack users rely on the instances self-terminating after the idle timeout to avoid similar issues - does that work less effectively for you because you use AgentsPerInstance>1 and you need all agents on the instance to reach the idle timeout for an instance to self-terminate?

We mostly rely on the idle timeout as well, but we have one queue where we used to regularly have instances alive for days despite having the idle time configured. I suspect that having multiple agents on the instance and needing them all idle at the same time to reach the timeout contributed to that, although the queue is shared by builds/teams in different timezones so it can have jobs queued around the clock.

Also, while disk space is the most common problem we have run into other problems on long running instances as well. For example, at one point we were accumulating running containers. It turned out that sidecar containers were being started by docker-compose for some jobs, which weren't explicitly calling docker-compose down to stop them.

.. and an hourly cron job will cause the instance to fail a healthcheck if the same thing happens:

Being hourly, that can result in a lot of jobs failing for up to an hour before the next time the cron runs. Bringing that back to run more regularly would make it less painful if it does occur but would still be a reacting to a problem rather than proactively preventing it.

@nitrocode
Copy link
Contributor

We had this problem on jenkins and the cloud node plugin has a feature to terminate the instance after x builds have run. We set it to 100 builds and we have a daily instance rotation.

An even simpler method could be configuring the MaxInstanceLifetime on the asg.

MaxInstanceLifetime
The maximum amount of time, in seconds, that an instance can be in service. The default is null. If specified, the value must be either 0 or a number equal to or greater than 86,400 seconds (1 day).

@dbaggerman
Copy link
Contributor Author

We had this problem on jenkins and the cloud node plugin has a feature to terminate the instance after x builds have run. We set it to 100 builds and we have a daily instance rotation.

The buildkite agent has a disconnect-after-job setting to terminate after a single job. Extending this to disconnect after N jobs, or a specified duration may be an alternative way to achieve the same result.

An even simpler method could be configuring the MaxInstanceLifetime on the asg.

This may have the side-effect of interrupting any jobs in progress on the agent. You don't get much of a grace period for running jobs to finish before the instance terminates.

@nitrocode
Copy link
Contributor

This may have the side-effect of interrupting any jobs in progress on the agent. You don't get much of a grace period for running jobs to finish before the instance terminates.

Isn't there a lifecycle hook in the stack that allows jobs to finish before continuing the termination ? If so, I believe the instance refresh respects this... but perhaps there isn't a hook. I don't see one defined in the cf stack.

@dbaggerman
Copy link
Contributor Author

Isn't there a lifecycle hook in the stack that allows jobs to finish before continuing the termination ? If so, I believe the instance refresh respects this... but perhaps there isn't a hook. I don't see one defined in the cf stack.

The elastic stack image does include https://github.com/buildkite/lifecycled to manage the lifecycle events, which is probably what you're thinking of. Even with that though, the autoscaling group will only let you delay the shutdown for an hour or so. We have jobs that run much longer than that which would get still get interrupted - although using it as an alternative to the hard timer in the PR would work.

@keithduncan keithduncan added the agent lifecycle Agent boot, job lifecycle, agent shutdown label Sep 6, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
agent lifecycle Agent boot, job lifecycle, agent shutdown asg-initiated-termination
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants