auto release lock on failure #2519

simonbyrne · 2023-11-27T21:51:16Z

Is your feature request related to a problem? Please describe.
According to this discussion, locks are not automatically released if a job fails.

Describe the solution you'd like
Automatically release locks at exit

Describe alternatives you've considered
The suggested solutions (using pre-exit hooks) are somewhat cumbersome.

Additional context

triarius · 2023-11-29T00:20:28Z

Hi @simonbyrne thanks for raising this.

We've had a bit of a guess of what your use case for this is, but please let us know if we're off track.

We think you're looking for a way to prevent more than one of the same job definition from running on a machine at a time, but you want the lock to be automatically clean up after the job exits (whether it succeeds or fails).

We can look into ways to support this work flow. As you've noted, pre-exit hooks may be used as a workaround, and if you're executing a shell script, you may also want to look into trapping the EXIT as well.

We agree that a way to avoid either of these would be useful, so we can take a look at creating it.

Another observation is that if you don't need the lock to be scoped to the machine, you can use concurrency groups to achieve this in the pipeline definition.

simonbyrne · 2023-11-29T18:26:04Z

My use case is that we have multiple machines (on a HPC cluster) using a shared file system, and we want to prevent race conditions at certain points when they write to a common folder.

I'm not 100% sure that pre-exit hooks will always be called (e.g. if jobs are killed by our cluster scheduler), so it would be helpful if the unlocking could be handled on the buildkite side.

Thank you for the suggestion of concurrency groups: I wasn't aware they worked across builds, so that may suit our use case. Thanks!

Maxim-Filimonov · 2024-01-22T06:39:08Z

We have another use case for auto-release lock.
We run gcloud deployments which does some global auth magick to get gcloud cli to point to specific account. As it's global only one build can be run on a machine. We have 3 different machines so we don't really want to lock it using concurrency groups.
I tried to write pre-exit hook:

#!/usr/bin/env bun
import shell from 'shelljs';

shell.echo('--- Cleaning locks...');

// Add any new locks to this array
const LOCKS = ['e2e', 'gcloud'];

LOCKS.forEach((lock) => {
  shell.echo(`Checking ${lock} lock...`);
  const result = shell.exec(`buildkite-agent lock get ${lock}`);
  if (result.code !== 0) {
    shell.echo(`Unable to retrieve ${lock}`);
    shell.exit(1);
  }

  const lockValue = result.stdout.trim();
  if (lockValue.length === 0) {
    shell.echo(`No ${lock} lock found`);
    return;
  }
  shell.echo(`Retrieved ${lock} lock: ${lockValue}`);
  // acquired(pid=87711,otp=ddfb06653d705099ae7a24162f3d23fb)
  const processId = lockValue.split(',')[0].split('=')[1];
  shell.echo(`Checking if process with id ${processId} is still running...`);
  const processResult = shell.exec(`ps -p ${processId}`);
  if (processResult.code === 0) {
    shell.echo(`Process with id ${processId} is still running. Skipping...`);
    shell.echo(`Process info: ${processResult.stdout}`);
    return;
  } else {
    shell.echo(
      `Process with id ${processId} is not running. Releasing lock...`,
    );
    const releaseResult = shell.exec(
      `buildkite-agent lock release ${lock} "${lockValue}"`,
    );
    if (releaseResult.code !== 0) {
      shell.echo(`Unable to release ${lock}`);
      shell.exit(1);
    } else {
      shell.echo(`Released ${lock} lock`);
    }
  }
});

I made a wrong assumption about pid there as it doesn't seem to be process id. My question for this workaround how do I determine if process which acquired hook is still running?
Is there any built in way to get more info about the lock or do we need to introduce our own ? Even being able to tell how long ago the lock was acquired would be good enough.
We already do lock file saving our trap script:

echo "Running cleanup script"
BUILDKITE_LOCK=$(cat ./scripts/ci/resources/lock.txt)
if [[ -n "${BUILDKITE_LOCK}" ]]; then
  echo "Releasing lock: ${BUILDKITE_LOCK}"
  buildkite-agent lock release e2e "${BUILDKITE_LOCK}"
fi

Unfortunately, this script doesn't run when deployment is cancelled.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

auto release lock on failure #2519

auto release lock on failure #2519

simonbyrne commented Nov 27, 2023

triarius commented Nov 29, 2023 •

edited

simonbyrne commented Nov 29, 2023

Maxim-Filimonov commented Jan 22, 2024

auto release lock on failure #2519

auto release lock on failure #2519

Comments

simonbyrne commented Nov 27, 2023

triarius commented Nov 29, 2023 • edited

simonbyrne commented Nov 29, 2023

Maxim-Filimonov commented Jan 22, 2024

triarius commented Nov 29, 2023 •

edited