Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix issue with buildkite-agent Job API when forwarding the job to the VM #85

Merged
merged 9 commits into from
Mar 11, 2024

Conversation

AliSoftware
Copy link
Contributor

@AliSoftware AliSoftware commented Mar 8, 2024

What / TL;DR

Fixes an issue with the latest version of buildkite-agent used in the xcode-15.3 VM image (and later ones) that prevents jobs from being transferred from the host to the VM

Why / Issue details

In the latest versions of buildkite-agent, the Job API experiment has been de-experimented and enabled by default.

As a result, buildkite-agent bootstrap now tries to create a Unix socket at the BUILDKITE_SOCKETS_PATH path, then exposes the created socket path and token as BUILDKITE_AGENT_JOB_API_SOCKET and BUILDKITE_AGENT_JOB_API_TOKEN env vars.

The issue is that the default value for this path (aka --sockets-path option of buildkite-agent bootstrap) is $HOME/.buildkite-agent/sockets, so when our hostmgr generate buildkite-job command generates the script to handle the job in the VM, it exports all BUILDKITE_* env vars in that script… including the BUILDKITE_SOCKETS_PATH which was resolved to /Users/administrator/.buildkite-agent/sockets on the host. This resulted in buildkite-agent bootstrap failing on creating socket directory: mkdir /Users/administrator: permission denied error.

How

  • Override the BUILDKITE_SOCKETS_PATH env var in the generated script to /opt/ci/var/tmp/sockets
  • Disallow forwarding BUILDKITE_AGENT_JOB_API_SOCKET and BUILDKITE_AGENT_JOB_API_TOKEN in the generated script

I also took the occasion of this PR to:

  • Remove obsolete addEnvironmentVariable calls for BUILDKITE_BUILD_CHECKOUT_PATH, BUILDKITE_HOOKS_PATH and BUILDKITE_PLUGINS_PATH pointing to /usr/local/var/…, as those are legacy paths; besides those env keys are part of either disallowedKeys or overriddenKeys, so were removed or overridden later in the code, before scriptBuilder.build() is called… so those particular addEnvironmentVariable calls were not impacting the generated script code after all .
  • Remove duplicate Paths.tempFilePath constant which had the exact same value as Paths.tempDirectory, and replace its only call site
  • Turn var to let in Paths constants and rearrange their order and grouping a bit
  • Add a trailing newline to generated script (to avoid odd stdout when debugging its content with cat, for example 😉 )

Testing

As it wasn't easy to test this without releasing and deploying a new hostmgr version to our Mac hosts, instead I:

  • SSH'd into one of the hosts (I picked MV-MKE-ARM64-014)

  • Manually modify /opt/ci/hooks/command script like below, to unset BUILDKITE_AGENT_JOB_API_SOCKET and BUILDKITE_AGENT_JOB_API_TOKEN and set BUILDKITE_SOCKETS_PATH to hardcoded value, and thus simulate the same change made in this hostmgr code:

    administrator@MV-MKE-ARM64-014 hooks % diff -u /opt/ci/hooks/command{.bak,}             
    --- /opt/ci/hooks/command.bak	2024-03-08 13:21:28
    +++ /opt/ci/hooks/command	2024-03-08 13:23:59
    @@ -35,7 +35,15 @@
    echo "~~~ ➡️ Transferring Job to VM"
    echo "Received Job: $BUILDKITE_JOB_ID"
    
    -SCRIPT_PATH=$(hostmgr generate buildkite-job)
    +echo ">>> Olivier's Testing"
    +echo "env before:"
    +env
    +SCRIPT_PATH=$(unset BUILDKITE_AGENT_JOB_API_SOCKET; unset BUILDKITE_AGENT_JOB_API_TOKEN; export BUILDKITE_SOCKETS_PATH=/opt/ci/var/tmp/sockets; hostmgr generate buildkite-job)
    +echo "=== Generated script ==="
    +cat "$SCRIPT_PATH"
    +echo
    +echo "======"
    +echo "<<< Olivier's Testing"
    
    echo "Running $SCRIPT_PATH on builder@$IP_ADDRESS"
    ssh builder@"$IP_ADDRESS" < "$SCRIPT_PATH"
  • Modified the DayOne-Apple pipeline in the pending Xcode-15.3 update PR, to enforce the job to run on that specific MV-MKE-ARM64-014 host

  • Validated that the Job successfully passed the step related to Job API in the VM and the transfer and running of the job to the VM working as expected, fixing the issue.

  • I then removed my patch of /opt/ci/hooks/command to restore MV-MKE-ARM64-014 to its previous state.

What's Next

Once this lands, I'll generate a new release of hostmgr (probably a non-beta 0.50.0) and work on deploying it (but probably not today, as it's a Friday and thus submission + code freeze day for many apps, so not the best day to interrupt CI (or risk breaking it during failed deployment 😅 ).

@AliSoftware AliSoftware added the bug Something isn't working label Mar 8, 2024
@AliSoftware AliSoftware requested review from jkmassel, spencertransier and a team March 8, 2024 13:35
@AliSoftware AliSoftware self-assigned this Mar 8, 2024
 - Override the `BUILDKITE_SOCKETS_PATH` (which defaults to `$HOME/.buildkite-agent/sockets`, resolved on the host while `$HOME` is different on the VM, so that caused a failure when `buildkite-agent bootstrap` tried to `mkdir` that directory
 - Disallow `BUILDKITE_AGENT_JOB_API_SOCKET` and `BUILDKITE_AGENT_JOB_API_TOKEN`, which are exported by the host's `buildkite-agent` when it creates the Job API socket (see https://github.com/buildkite/agent/blob/45d491fdb44072fe4c2a7aac79490defc95b8dcf/internal/job/api.go#L37-L38), which we want the VM to create its own
 - `BUILDKITE_BUILD_CHECKOUT_PATH` is not used in VMs now that we use git-mirrors. Besides, this env var is listed as part of `disallowedKeys`, so that key will be pruned from the dictionary before the script is generated via `scriptBuilder.build()` anyway
 - `BUILDKITE_HOOKS_PATH` and `BUILDKITE_PLUGINS_PATH` are part of `overriddenKeys` so will be overridden in the next lines of the code before `scriptBuilder.build()` is called too

So in practice those `addEnvironmentVariables` didn't do anything and did not impact the generated script at all in the end.
As per POSIX conventions (and to avoid odd outputs when debugging the scripts' content with `cat`, for example)
That way when we temporarily switch that lane to use `readonly: false` when we need to regenerate the profile, this makes sure that the newly generated profile uses that specific cert, instead of _fastlane_ trying to guess and potentially picking your _personal_ Apple Development profile.

This is what happened to me when I renewed the cert 2 days ago trying to fix code signing issues when developing `hostmgr` locally.
@AliSoftware
Copy link
Contributor Author

AliSoftware commented Mar 8, 2024

Note for bookkeeping: that PR initially had issues with code-signing hostmgr that made the CI fail on Validate Release.

Turns out the profiles expired, but match kept giving me issues when I tried to renew them as usual:

  • First I already renewed the profile on Wed 6, but back then match complained about a corrupt cer file in S3
  • After trying to fix that and running match successfully, I later realized that it picked my personal certificate instead of the "Created by API" one
  • So I tried to renew it again today (Fri 8), but this time encountered an ASC failure when trying to create profile, claiming we had no macOS devices…)…

In the end, the error was due to a recent ASC API change between Wednesday and Friday 😞 . I submitted a fix in fastlane core, after which I was finally (!) able to renew the profiles and make CI go green.

@AliSoftware AliSoftware merged commit 59dab17 into trunk Mar 11, 2024
6 checks passed
@AliSoftware AliSoftware deleted the fix-job-api-path branch March 11, 2024 16:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants