Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

auto release lock on failure #2519

Open
simonbyrne opened this issue Nov 27, 2023 · 3 comments
Open

auto release lock on failure #2519

simonbyrne opened this issue Nov 27, 2023 · 3 comments

Comments

@simonbyrne
Copy link

Is your feature request related to a problem? Please describe.
According to this discussion, locks are not automatically released if a job fails.

Describe the solution you'd like
Automatically release locks at exit

Describe alternatives you've considered
The suggested solutions (using pre-exit hooks) are somewhat cumbersome.

Additional context

@triarius
Copy link
Contributor

triarius commented Nov 29, 2023

Hi @simonbyrne thanks for raising this.

We've had a bit of a guess of what your use case for this is, but please let us know if we're off track.

We think you're looking for a way to prevent more than one of the same job definition from running on a machine at a time, but you want the lock to be automatically clean up after the job exits (whether it succeeds or fails).

We can look into ways to support this work flow. As you've noted, pre-exit hooks may be used as a workaround, and if you're executing a shell script, you may also want to look into trapping the EXIT as well.

We agree that a way to avoid either of these would be useful, so we can take a look at creating it.

Another observation is that if you don't need the lock to be scoped to the machine, you can use concurrency groups to achieve this in the pipeline definition.

@simonbyrne
Copy link
Author

My use case is that we have multiple machines (on a HPC cluster) using a shared file system, and we want to prevent race conditions at certain points when they write to a common folder.

I'm not 100% sure that pre-exit hooks will always be called (e.g. if jobs are killed by our cluster scheduler), so it would be helpful if the unlocking could be handled on the buildkite side.

Thank you for the suggestion of concurrency groups: I wasn't aware they worked across builds, so that may suit our use case. Thanks!

@Maxim-Filimonov
Copy link

We have another use case for auto-release lock.
We run gcloud deployments which does some global auth magick to get gcloud cli to point to specific account. As it's global only one build can be run on a machine. We have 3 different machines so we don't really want to lock it using concurrency groups.
I tried to write pre-exit hook:

#!/usr/bin/env bun
import shell from 'shelljs';

shell.echo('--- Cleaning locks...');

// Add any new locks to this array
const LOCKS = ['e2e', 'gcloud'];

LOCKS.forEach((lock) => {
  shell.echo(`Checking ${lock} lock...`);
  const result = shell.exec(`buildkite-agent lock get ${lock}`);
  if (result.code !== 0) {
    shell.echo(`Unable to retrieve ${lock}`);
    shell.exit(1);
  }

  const lockValue = result.stdout.trim();
  if (lockValue.length === 0) {
    shell.echo(`No ${lock} lock found`);
    return;
  }
  shell.echo(`Retrieved ${lock} lock: ${lockValue}`);
  // acquired(pid=87711,otp=ddfb06653d705099ae7a24162f3d23fb)
  const processId = lockValue.split(',')[0].split('=')[1];
  shell.echo(`Checking if process with id ${processId} is still running...`);
  const processResult = shell.exec(`ps -p ${processId}`);
  if (processResult.code === 0) {
    shell.echo(`Process with id ${processId} is still running. Skipping...`);
    shell.echo(`Process info: ${processResult.stdout}`);
    return;
  } else {
    shell.echo(
      `Process with id ${processId} is not running. Releasing lock...`,
    );
    const releaseResult = shell.exec(
      `buildkite-agent lock release ${lock} "${lockValue}"`,
    );
    if (releaseResult.code !== 0) {
      shell.echo(`Unable to release ${lock}`);
      shell.exit(1);
    } else {
      shell.echo(`Released ${lock} lock`);
    }
  }
});

I made a wrong assumption about pid there as it doesn't seem to be process id. My question for this workaround how do I determine if process which acquired hook is still running?
Is there any built in way to get more info about the lock or do we need to introduce our own ? Even being able to tell how long ago the lock was acquired would be good enough.
We already do lock file saving our trap script:

echo "Running cleanup script"
BUILDKITE_LOCK=$(cat ./scripts/ci/resources/lock.txt)
if [[ -n "${BUILDKITE_LOCK}" ]]; then
  echo "Releasing lock: ${BUILDKITE_LOCK}"
  buildkite-agent lock release e2e "${BUILDKITE_LOCK}"
fi

Unfortunately, this script doesn't run when deployment is cancelled.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants