Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: fail fast may cause Serial spec or cleanup Node interrupted #1178

Merged
merged 4 commits into from
May 3, 2023

Conversation

cvvz
Copy link
Contributor

@cvvz cvvz commented Apr 4, 2023

Problem

When using fail-fast, ginkgo start a goroutine to poll from server every 500ms (ABORT_POLLING_INTERVAL) to see if current spec should be interrupted by other process.

There is a chance that before Serial Spec or cleanup Node start to run, the interruption status is still unset, so they won't be skipped and start to run. However, it may get interrupted by other process during running.

Please refer to https://github.com/cvvz/go-playground/tree/master/ginkgo/sample for the steps of reproduction.

Resolution

  1. Wait for ABORT_POLLING_INTERVAL before Serial Spec start to run, this will make sure the interruption status is the latest.
  2. Ignore interruption by other process when running cleanup Nodes.

@onsi
Copy link
Owner

onsi commented Apr 4, 2023

hey there - thanks for taking the time to put together a reproducer and for proposing a fix. this part of the codebase is some of the most complex and I appreciate the effort!

I agree that cleanup nodes should not be interrupted when a different process fails during --fail-fast. That's a great catch and I think your fix is correct.

I'm not sure I follow the issue around Serial specs. It's true there is a window of time in which they may start to run and then be interrupted - but I'm not sure I follow why this is problematic. I'd prefer to not add a 500ms delay for all suites in the case of Serial specs running. If this window of time is, indeed, problematic I'd prefer a solution where we teach InterruptHandler how to CheckForCrossProcessInterrupts() and call that in suite.go instead of sleeping. But first - can you share some more around why this window of time is proving problematic for you? (edit: never mind... after some thought I think I agree now that we should fix this gap)

Also - we'll need to add tests for these in internal_integration. Particularly for the After* cleanup nodes not aborting. I'd be happy to do that when I pull this in. In fact, I'm happy to take it from here if you'd prefer - or if you'd like to rework the InterruptHandler portion of the PR please feel free to. Let me know either way.

@cvvz
Copy link
Contributor Author

cvvz commented Apr 6, 2023

If this window of time is, indeed, problematic I'd prefer a solution where we teach InterruptHandler how to CheckForCrossProcessInterrupts() and call that in suite.go instead of sleeping.

Hi @onsi , I'm not so sure how to call CheckForCrossProcessInterrupts() in suite.go instead of sleeping, could you share more detailes? Or you could just patch on this PR directly, I'm ok with that. Thanks.

@onsi
Copy link
Owner

onsi commented Apr 6, 2023

no worries - i'll try to work on this in the next couple of days.

@onsi
Copy link
Owner

onsi commented May 3, 2023

hey - sorry, i had another project take priority over this. it's still on my radar, but it's going to be a few weeks before i can get to it

cvvz and others added 2 commits May 3, 2023 09:06
1. inter-process aborts should not interrupt cleanup nodes
2. whenever we fetch interrupt status, check and see if an abort has happened.  if it has ensure we return the latest, correct, abort state.  this allows us to avoid accidentally starting the next spec  because the ABORT_POLLING_INTERVAL hasn't fired yet
@onsi onsi force-pushed the fix-parallel-fail-fast branch from 2a8be0e to 15b4871 Compare May 3, 2023 19:51
@onsi
Copy link
Owner

onsi commented May 3, 2023

hey - i ended up prioritizing this today to get it done. CI is running now but assuming it succeeds i'll be pulling this in and cutting a release. i added tests to cover the case of abort interrupting cleanup nodes.

i also came up with a different approach for closing the gap between when an abort occurs and when the other ginkgo processes notice. now instead of just polling we also check whenever interrupt Status is requested. this ensures that nothing that shouldn't run ever runs (not just serial specs)

@onsi onsi merged commit 8dea88b into onsi:master May 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants