Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Build steps hang on Windows if child processes still running #2202

Open
brianseeders opened this issue Jul 12, 2023 · 7 comments
Open

Build steps hang on Windows if child processes still running #2202

brianseeders opened this issue Jul 12, 2023 · 7 comments

Comments

@brianseeders
Copy link

I have the following script used in a Buildkite step (for reproduction purposes):

echo 1
sleep 60 &
echo 2

Note that sleep 60 & goes to the background, so this script exits immediately.

On linux, when running this script in a Buildkite step, the step finishes immediately (not after waiting 60s), as I would expect.

On Windows, however, the buildkite step hangs for 60 seconds, waiting for the child process to finish, even though the parent process completed.

It doesn't matter what shell is specified (I've tried powershell, pwsh, bash, and no shell (which defaults to cmd)), the behavior is always the same. The script exits immediately if run on Windows outside of Buildkite. This all leads me to believe it's the Buildkite agent itself and how it manages processes.

Is this difference expected behavior? We have complex pipelines that spawn a lot of child processes (for example, a simple case is gradle daemons) and they hang indefinitely on Windows.

Is there a way around the behavior? I can't think of anything related to our environment that could cause this.

You can also run, for example, in batch:
bash.exe -c 'echo 1; sleep 60 ^& echo 2;'

Tested on: Windows 2022, 2019, 2016
buildkite-agent 3.48.0 and 3.49.0

Note that I e-mailed support, and they asked me to open an issue here.

@brianseeders
Copy link
Author

Also, in order to rule out any issues related to our environment/images, I created a VM in GCP using their base Windows Server 2022 image. I installed chocolatey, git, bash, and buildkite-agent. I connected buildkite-agent to our org and ran my job, and it still hangs. I also tested buildkite-agent v3.0.0, and it still happens on this very old version.

@moskyb
Copy link
Contributor

moskyb commented Jul 19, 2023

g'day @brianseeders! thanks for this - it definitely seems like something hinky is going on, we're gonna take a look into it.

thinking out loud, we don't run jobs in PTYs on windows, which seems like it could be the source of this - perhaps we could run in PTY mode on windows iff bash is the shell we're in? will think about this a bit more.

@brianseeders
Copy link
Author

Thanks. I'm guessing based on this that it won't be particularly easy to try this?

I've been trying to find a workaround for this. I started experimenting with this: https://github.com/elastic/elasticsearch/blob/cec2769216409fc143cb05048f9ecd0fedc4341a/.buildkite/scripts/windows-end-job.ps1

I'm basically ending every step by sending ctrl-c to the agent process twice. It only works because we're using ephemeral one-shot agents. It works, but the exit status isn't being reported correctly by Buildkite for certain steps and I haven't been able to figure out why.

I have a powershell script that looks like this at the end:

          echo "Exiting with $exitCode"
          exit $exitCode

and the Buildkite log shows Exiting with 1, but the step is successful sometimes.

I'll try the latest release as I see there has been a ton of work/refactoring happening, and report back.

@brianseeders
Copy link
Author

Nope, the exit code issue is still present on the latest agent as well. Job uuid 01898929-e64a-4306-90c9-2cffe43c1b2b if you'd like to see. It reproduces reliably for this job, but the job takes 90 minutes. I'm trying to come up with a smaller example for it.

@brianseeders
Copy link
Author

Ending my day here, and wanted to give one more update. It turns out that the real exit code is lost when the job has to be forcibly terminated (e.g. when you see this: Job 01898929-e64a-4306-90c9-2cffe43c1b2b hasn't stopped in time, terminating). It always shows as successful. Graceful exits will capture the exit code correctly.

I'm guessing the hanging issue itself centers around the job object / process group stuff, but I've been trying to understand why. Maybe golang's cmd.Wait() ends up waiting for the process job itself to complete? I'm not sure.

@brianseeders
Copy link
Author

Another update here. I finally seem to have figured out a workaround, with an idea from @rjernst.

https://github.com/elastic/elasticsearch/blob/buildkite-migration/.buildkite/scripts/run-script.ps1

I'm creating my own nested job, and closing it when the main script finishes executing. This cleans up any lingering processes, which allows the buildkite-agent to move on.

@triarius
Copy link
Contributor

triarius commented Aug 9, 2023

Thanks for sharing the workaround here @brianseeders 💖

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants