ginkgo run in parallel can hang when child command outlives test suite #1191

Luap99 · 2023-05-02T19:07:37Z

I have been debugging hangs in ginkgo v2 in over last weeks and I think I finally found the problem.
I created https://github.com/Luap99/ginkgo-hang for a simple reproducer.

This diff seems to fix the issue for me:

diff --git a/internal/output_interceptor_unix.go b/internal/output_interceptor_unix.go
index f5ae15b..70d5647 100644
--- a/internal/output_interceptor_unix.go
+++ b/internal/output_interceptor_unix.go
@@ -26,6 +26,11 @@ func (impl *dupSyscallOutputInterceptorImpl) CreateStdoutStderrClones() (*os.Fil
        stdoutCloneFD, _ := unix.Dup(1)
        stderrCloneFD, _ := unix.Dup(2)
 
+       flags, _ := unix.FcntlInt(uintptr(stdoutCloneFD), unix.F_GETFD, 0)
+       unix.FcntlInt(uintptr(stdoutCloneFD), unix.F_SETFD, flags|unix.FD_CLOEXEC)
+       flags, _ = unix.FcntlInt(uintptr(stderrCloneFD), unix.F_GETFD, 0)
+       unix.FcntlInt(uintptr(stderrCloneFD), unix.F_SETFD, flags|unix.FD_CLOEXEC)
+
        // And then wrap the clone file descriptors in files.
        // One benefit of this (that we don't use yet) is that we can actually write
        // to these files to emit output to the console even though we're intercepting output

By using CLOEXEC we make sure the fds are never leaked into commands that are executed in the test suite.
I am happy to open a PR if you agree that this is the right approach to fix the problem.

onsi · 2023-05-02T19:24:00Z

oh snap! this has been a loooong-standing issue and one i've spent a lot of effort trying to understand and fix. the best i've been able to do is provide a flag to disable output interception when this failure mode occurs. your description of the issue lines up with my understanding - i didn't know about CLOEXEC but it sounds like it resolves the root issue.

yes please do submit a PR!

When you run any child process that stay around longer than the test suite ginkgo currently hangs. This is because the dup stdout/err fds are leaked into the child thus read() will block on it as there is at least one process still having the write pipe open. From ginkgos POV it looks like it is done, you see the ginkgo result output printed but then it just hangs and doe snot exit because of it. To fix it we set FD_CLOEXEC on the dup-ed fds, this prevents them from ever being leaking into other processes that could outlive the suite. There is a dup3() call the could be uses to set the CLOEXEC option directly but this seem linux only and thus is less portable. The fcntl call should be good enough here, we do not need to be worried about the race conditions described in the man page. Ideally we should do some error handling in that function for both the fcntl calls and the existing dup() above, however this seems like a rather big change. I am not so sure about how to test it properly, I added a test which just executes `ginkgo run -p` and a test which only starts `sleep 60`. Ginkgo then should exit right way and keep this process running. Then I just make sure the gingo exits in under 15 seconds. As long as it is below 60s it should fulfil the purpose. Fixes onsi#1191 Signed-off-by: Paul Holzinger <pholzing@redhat.com>

Luap99 · 2023-05-03T11:07:40Z

PR #1192

* integration: build interceptor binary automatically Trying to run this locally the interceptor binary did not exists. Lets just build this in SynchronizedBeforeSuite() automatically so users do not have to figure this out. Signed-off-by: Paul Holzinger <pholzing@redhat.com> * integration: make interceptor test parallel safe Using the same file name for all test cause conflicts when run in parallel, this causes the tests to fail for me. To fix this use a filename suffix with GinkgoParallelProcess() which prevents conflicts in the file name. Signed-off-by: Paul Holzinger <pholzing@redhat.com> * fix hang with ginkgo -p When you run any child process that stay around longer than the test suite ginkgo currently hangs. This is because the dup stdout/err fds are leaked into the child thus read() will block on it as there is at least one process still having the write pipe open. From ginkgos POV it looks like it is done, you see the ginkgo result output printed but then it just hangs and doe snot exit because of it. To fix it we set FD_CLOEXEC on the dup-ed fds, this prevents them from ever being leaking into other processes that could outlive the suite. There is a dup3() call the could be uses to set the CLOEXEC option directly but this seem linux only and thus is less portable. The fcntl call should be good enough here, we do not need to be worried about the race conditions described in the man page. Ideally we should do some error handling in that function for both the fcntl calls and the existing dup() above, however this seems like a rather big change. I am not so sure about how to test it properly, I added a test which just executes `ginkgo run -p` and a test which only starts `sleep 60`. Ginkgo then should exit right way and keep this process running. Then I just make sure the gingo exits in under 15 seconds. As long as it is below 60s it should fulfil the purpose. Fixes #1191 Signed-off-by: Paul Holzinger <pholzing@redhat.com> * Rearrange new interceptor hang tests and confirm they cover the issue in question --------- Signed-off-by: Paul Holzinger <pholzing@redhat.com> Co-authored-by: Onsi Fakhouri <onsijoe@gmail.com>

Luap99 changed the title ~~ginkgo run in parralell can hang when child command outlives test suite~~ ginkgo run in parallel can hang when child command outlives test suite May 2, 2023

Luap99 mentioned this issue May 3, 2023

fix hang with ginkgo -p #1192

Merged

onsi closed this as completed in #1192 May 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ginkgo run in parallel can hang when child command outlives test suite #1191

ginkgo run in parallel can hang when child command outlives test suite #1191

Luap99 commented May 2, 2023

onsi commented May 2, 2023

Luap99 commented May 3, 2023

ginkgo run in parallel can hang when child command outlives test suite #1191

ginkgo run in parallel can hang when child command outlives test suite #1191

Comments

Luap99 commented May 2, 2023

onsi commented May 2, 2023

Luap99 commented May 3, 2023