Skip to content

fix: carefully retry restarting HNS if it hangs #3529

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Mar 27, 2025
Merged

Conversation

rbtr
Copy link
Contributor

@rbtr rbtr commented Mar 25, 2025

#3498 reverted two functional changes to the CNS init HNS interaction:

  1. always restart HNS after attempting to set the regkey
  2. retry restarting HNS if it hangs

It turns out that HNS hangs are common enough that they notably impact CNS on Windows becoming ready and are preventing that revert from passing tests and rolling out. Thus, it is not safe for CNS to block on HNS restarting - if it hangs, CNS never becomes ready, meaning the Node never becomes Ready.

This is a retake on (2) while keeping (1) reverted - we will only try to restart HNS if the regkey is not already set. But when we do try to restart it, we will retry this instead of hanging forever.

@Copilot Copilot AI review requested due to automatic review settings March 25, 2025 00:40
@rbtr rbtr requested a review from a team as a code owner March 25, 2025 00:40
@rbtr rbtr self-assigned this Mar 25, 2025
@rbtr rbtr added the cns Related to CNS. label Mar 25, 2025
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR enhances the robustness of the CNS HNS restart mechanism on Windows by incorporating retry logic to prevent indefinite hangs. The changes include:

  • Adding retry logic via the retry-go package to stop the HNS service.
  • Introducing a new helper function (tryStopServiceFn) to encapsulate stopping logic.
  • Expanding test coverage in platform/os_windows_test.go to validate multiple scenarios for stopping the HNS service.

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
platform/os_windows.go Added retry logic and refactored service stop/start functionality.
platform/os_windows_test.go Added tests for tryStopServiceFn to cover various service scenarios.

@rbtr
Copy link
Contributor Author

rbtr commented Mar 25, 2025

/azp run Azure Container Networking PR

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@rbtr rbtr enabled auto-merge March 25, 2025 16:11

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.
Signed-off-by: Evan Baker <rbtr@users.noreply.github.com>
@rbtr rbtr force-pushed the fix/retry-hns-restart branch from 47f6af9 to f11b9a0 Compare March 25, 2025 16:14
ashvindeodhar
ashvindeodhar previously approved these changes Mar 25, 2025
@rbtr
Copy link
Contributor Author

rbtr commented Mar 25, 2025

/azp run Azure Container Networking PR

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@rbtr
Copy link
Contributor Author

rbtr commented Mar 25, 2025

/azp run Azure Container Networking PR

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@rbtr rbtr force-pushed the fix/retry-hns-restart branch from d93c6d9 to 0d0817d Compare March 25, 2025 18:43
@rbtr
Copy link
Contributor Author

rbtr commented Mar 25, 2025

/azp run Azure Container Networking PR

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@rbtr rbtr force-pushed the fix/retry-hns-restart branch from 0d0817d to 0e72c6c Compare March 25, 2025 20:46
@rbtr
Copy link
Contributor Author

rbtr commented Mar 25, 2025

/azp run Azure Container Networking PR

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Signed-off-by: Evan Baker <rbtr@users.noreply.github.com>
@rbtr rbtr force-pushed the fix/retry-hns-restart branch from 0e72c6c to 9bab884 Compare March 25, 2025 21:30
@rbtr
Copy link
Contributor Author

rbtr commented Mar 25, 2025

/azp run Azure Container Networking PR

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR addresses the issue of HNS hanging when restarting by introducing retry logic so that CNS does not block indefinitely waiting on HNS.

  • Use of the retry-go library to repeatedly attempt stopping and starting the HNS service.
  • Addition of helper functions (tryStopServiceFn and tryStartServiceFn) to encapsulate the retry logic.
  • Expansion of unit tests in os_windows_test.go to cover the new retry-based service operations.

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
platform/os_windows.go Implements retry logic with new helper functions for stopping/starting HNS.
platform/os_windows_test.go Adds tests and updates imports to cover the new retry-based service operation code.

@rbtr rbtr added this pull request to the merge queue Mar 27, 2025
Merged via the queue into master with commit ef873ba Mar 27, 2025
92 checks passed
@rbtr rbtr deleted the fix/retry-hns-restart branch March 27, 2025 18:54
rbtr added a commit that referenced this pull request Apr 4, 2025
Signed-off-by: Evan Baker <rbtr@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cns Related to CNS.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants