-
Notifications
You must be signed in to change notification settings - Fork 289
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Build steps hang on Windows if child processes still running #2202
Comments
Also, in order to rule out any issues related to our environment/images, I created a VM in GCP using their base Windows Server 2022 image. I installed chocolatey, git, bash, and buildkite-agent. I connected buildkite-agent to our org and ran my job, and it still hangs. I also tested buildkite-agent v3.0.0, and it still happens on this very old version. |
g'day @brianseeders! thanks for this - it definitely seems like something hinky is going on, we're gonna take a look into it. thinking out loud, we don't run jobs in PTYs on windows, which seems like it could be the source of this - perhaps we could run in PTY mode on windows iff bash is the shell we're in? will think about this a bit more. |
Thanks. I'm guessing based on this that it won't be particularly easy to try this? I've been trying to find a workaround for this. I started experimenting with this: https://github.com/elastic/elasticsearch/blob/cec2769216409fc143cb05048f9ecd0fedc4341a/.buildkite/scripts/windows-end-job.ps1 I'm basically ending every step by sending ctrl-c to the agent process twice. It only works because we're using ephemeral one-shot agents. It works, but the exit status isn't being reported correctly by Buildkite for certain steps and I haven't been able to figure out why. I have a powershell script that looks like this at the end:
and the Buildkite log shows I'll try the latest release as I see there has been a ton of work/refactoring happening, and report back. |
Nope, the exit code issue is still present on the latest agent as well. Job uuid |
Ending my day here, and wanted to give one more update. It turns out that the real exit code is lost when the job has to be forcibly terminated (e.g. when you see this: I'm guessing the hanging issue itself centers around the job object / process group stuff, but I've been trying to understand why. Maybe golang's cmd.Wait() ends up waiting for the process job itself to complete? I'm not sure. |
Another update here. I finally seem to have figured out a workaround, with an idea from @rjernst. https://github.com/elastic/elasticsearch/blob/buildkite-migration/.buildkite/scripts/run-script.ps1 I'm creating my own nested job, and closing it when the main script finishes executing. This cleans up any lingering processes, which allows the buildkite-agent to move on. |
Thanks for sharing the workaround here @brianseeders 💖 |
I have the following script used in a Buildkite step (for reproduction purposes):
Note that
sleep 60 &
goes to the background, so this script exits immediately.On linux, when running this script in a Buildkite step, the step finishes immediately (not after waiting 60s), as I would expect.
On Windows, however, the buildkite step hangs for 60 seconds, waiting for the child process to finish, even though the parent process completed.
It doesn't matter what shell is specified (I've tried powershell, pwsh, bash, and no shell (which defaults to cmd)), the behavior is always the same. The script exits immediately if run on Windows outside of Buildkite. This all leads me to believe it's the Buildkite agent itself and how it manages processes.
Is this difference expected behavior? We have complex pipelines that spawn a lot of child processes (for example, a simple case is gradle daemons) and they hang indefinitely on Windows.
Is there a way around the behavior? I can't think of anything related to our environment that could cause this.
You can also run, for example, in batch:
bash.exe -c 'echo 1; sleep 60 ^& echo 2;'
Tested on: Windows 2022, 2019, 2016
buildkite-agent 3.48.0 and 3.49.0
Note that I e-mailed support, and they asked me to open an issue here.
The text was updated successfully, but these errors were encountered: