Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ginkgo run in parallel can hang when child command outlives test suite #1191

Closed
Luap99 opened this issue May 2, 2023 · 2 comments · Fixed by #1192
Closed

ginkgo run in parallel can hang when child command outlives test suite #1191

Luap99 opened this issue May 2, 2023 · 2 comments · Fixed by #1192

Comments

@Luap99
Copy link
Contributor

Luap99 commented May 2, 2023

I have been debugging hangs in ginkgo v2 in over last weeks and I think I finally found the problem.
I created https://github.com/Luap99/ginkgo-hang for a simple reproducer.

This diff seems to fix the issue for me:

diff --git a/internal/output_interceptor_unix.go b/internal/output_interceptor_unix.go
index f5ae15b..70d5647 100644
--- a/internal/output_interceptor_unix.go
+++ b/internal/output_interceptor_unix.go
@@ -26,6 +26,11 @@ func (impl *dupSyscallOutputInterceptorImpl) CreateStdoutStderrClones() (*os.Fil
        stdoutCloneFD, _ := unix.Dup(1)
        stderrCloneFD, _ := unix.Dup(2)
 
+       flags, _ := unix.FcntlInt(uintptr(stdoutCloneFD), unix.F_GETFD, 0)
+       unix.FcntlInt(uintptr(stdoutCloneFD), unix.F_SETFD, flags|unix.FD_CLOEXEC)
+       flags, _ = unix.FcntlInt(uintptr(stderrCloneFD), unix.F_GETFD, 0)
+       unix.FcntlInt(uintptr(stderrCloneFD), unix.F_SETFD, flags|unix.FD_CLOEXEC)
+
        // And then wrap the clone file descriptors in files.
        // One benefit of this (that we don't use yet) is that we can actually write
        // to these files to emit output to the console even though we're intercepting output

By using CLOEXEC we make sure the fds are never leaked into commands that are executed in the test suite.
I am happy to open a PR if you agree that this is the right approach to fix the problem.

@onsi
Copy link
Owner

onsi commented May 2, 2023

oh snap! this has been a loooong-standing issue and one i've spent a lot of effort trying to understand and fix. the best i've been able to do is provide a flag to disable output interception when this failure mode occurs. your description of the issue lines up with my understanding - i didn't know about CLOEXEC but it sounds like it resolves the root issue.

yes please do submit a PR!

@Luap99 Luap99 changed the title ginkgo run in parralell can hang when child command outlives test suite ginkgo run in parallel can hang when child command outlives test suite May 2, 2023
Luap99 added a commit to Luap99/ginkgo that referenced this issue May 3, 2023
When you run any child process that stay around longer than the test
suite ginkgo currently hangs. This is because the dup stdout/err fds are
leaked into the child thus read() will block on it as there is at least
one process still having the write pipe open.

From ginkgos POV it looks like it is done, you see the ginkgo result
output printed but then it just hangs and doe snot exit because of it.

To fix it we set FD_CLOEXEC on the dup-ed fds, this prevents them from
ever being leaking into other processes that could outlive the suite.
There is a dup3() call the could be uses to set the CLOEXEC option
directly but this seem linux only and thus is less portable.
The fcntl call should be good enough here, we do not need to be worried
about the race conditions described in the man page.
Ideally we should do some error handling in that function for both the
fcntl calls and the existing dup() above, however this seems like a
rather big change.

I am not so sure about how to test it properly, I added a test which
just executes `ginkgo run -p` and a test which only starts `sleep 60`.
Ginkgo then should exit right way and keep this process running. Then I
just make sure the gingo exits in under 15 seconds. As long as it is
below 60s it should fulfil the purpose.

Fixes onsi#1191

Signed-off-by: Paul Holzinger <pholzing@redhat.com>
@Luap99
Copy link
Contributor Author

Luap99 commented May 3, 2023

PR #1192

@onsi onsi closed this as completed in #1192 May 3, 2023
onsi added a commit that referenced this issue May 3, 2023
* integration: build interceptor binary automatically

Trying to run this locally the interceptor binary did not exists.
Lets just build this in SynchronizedBeforeSuite() automatically so
users do not have to figure this out.

Signed-off-by: Paul Holzinger <pholzing@redhat.com>

* integration: make interceptor test parallel safe

Using the same file name for all test cause conflicts when run in
parallel, this causes the tests to fail for me. To fix this use a
filename suffix with GinkgoParallelProcess() which prevents conflicts
in the file name.

Signed-off-by: Paul Holzinger <pholzing@redhat.com>

* fix hang with ginkgo -p

When you run any child process that stay around longer than the test
suite ginkgo currently hangs. This is because the dup stdout/err fds are
leaked into the child thus read() will block on it as there is at least
one process still having the write pipe open.

From ginkgos POV it looks like it is done, you see the ginkgo result
output printed but then it just hangs and doe snot exit because of it.

To fix it we set FD_CLOEXEC on the dup-ed fds, this prevents them from
ever being leaking into other processes that could outlive the suite.
There is a dup3() call the could be uses to set the CLOEXEC option
directly but this seem linux only and thus is less portable.
The fcntl call should be good enough here, we do not need to be worried
about the race conditions described in the man page.
Ideally we should do some error handling in that function for both the
fcntl calls and the existing dup() above, however this seems like a
rather big change.

I am not so sure about how to test it properly, I added a test which
just executes `ginkgo run -p` and a test which only starts `sleep 60`.
Ginkgo then should exit right way and keep this process running. Then I
just make sure the gingo exits in under 15 seconds. As long as it is
below 60s it should fulfil the purpose.

Fixes #1191

Signed-off-by: Paul Holzinger <pholzing@redhat.com>

* Rearrange new interceptor hang tests and confirm they cover the issue in question

---------

Signed-off-by: Paul Holzinger <pholzing@redhat.com>
Co-authored-by: Onsi Fakhouri <onsijoe@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants