Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

daemon: handleContainerExit: ignore networking errors #49507

Merged
merged 1 commit into from
Feb 20, 2025

Conversation

akerouanton
Copy link
Member

@akerouanton akerouanton commented Feb 20, 2025

Related to:

- What I did

Prior to commit fe856b9, containers' network sandbox and interfaces were created before the containerd task. Now, it's created after.

If this step fails, the containerd task is forcefully deleted, and an event is sent to the c8d event monitor, which triggers handleContainerExit. Then this method tries to restart the faulty container.

This leads to containers with a published port already in use to be stuck in a tight restart loop (if they're started with --restart=always) until the port is available. This is needlessly spamming the daemon logs.

Prior to that commit, a published port already in use wouldn't trigger the restart process.

- How I did it

This commit adds a check to handleContainerExit to ignore exit events if the latest container error is related to networking setup.

- How to verify it

- Human readable description for the release notes

Fix a bug that was causing containers with `--restart=always` and a published port already in use to be restarting in a tight loop.

- A picture of a cute animal (not mandatory but encouraged)

@akerouanton akerouanton force-pushed the fix-restart-port-already-in-use branch from 874796b to a34be7e Compare February 20, 2025 16:09
@akerouanton akerouanton changed the title daemon: handleContainerExit: ignore unstarted containers daemon: handleContainerExit: ignore networking errors Feb 20, 2025
@akerouanton akerouanton self-assigned this Feb 20, 2025
@vvoland vvoland added this to the 29.0.0 milestone Feb 20, 2025
@akerouanton akerouanton force-pushed the fix-restart-port-already-in-use branch from a34be7e to aa45c80 Compare February 20, 2025 16:42
@akerouanton akerouanton modified the milestones: 29.0.0, 28.0.1 Feb 20, 2025
@akerouanton akerouanton marked this pull request as ready for review February 20, 2025 16:43
Copy link
Contributor

@vvoland vvoland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

// c.ErrorMsg is set by [daemon.containerStart], and doesn't preserve the
// error type (because this field is persisted on disk). So, use string
// matching instead of usual error comparison methods.
if strings.Contains(c.ErrorMsg, "failed to set up container networking") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For posterity: We could also consider doing that in in general for any c.ErrorMsg != "" error, but for now let's go with a minimal change to get the pre-v28 behavior.

Copy link
Contributor

@robmry robmry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@@ -34,5 +34,8 @@ func (daemon *Daemon) initializeCreatedTask(
return errdefs.System(err)
}
}
return daemon.allocateNetwork(ctx, cfg, ctr)
if err := daemon.allocateNetwork(ctx, cfg, ctr); err != nil {
return fmt.Errorf("failed to set up container networking: %w", err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test failure should be enough to avoid regressions. But, as these are both in package daemon, could put the error string in a const to make it clear where the message is used?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Prior to commit fe856b9, containers' network sandbox and interfaces were
created before the containerd task. Now, it's created after.

If this step fails, the containerd task is forcefully deleted, and an
event is sent to the c8d event monitor, which triggers `handleContainerExit`.
Then this method tries to restart the faulty container.

This leads to containers with a published port already in use to be
stuck in a tight restart loop (if they're started with
`--restart=always`) until the port is available. This is needlessly
spamming the daemon logs.

Prior to that commit, a published port already in use wouldn't trigger
the restart process.

This commit adds a check to `handleContainerExit` to ignore exit events
if the latest container error is related to networking setup.

Signed-off-by: Albin Kerouanton <albinker@gmail.com>
@akerouanton akerouanton force-pushed the fix-restart-port-already-in-use branch from aa45c80 to ac8b4e3 Compare February 20, 2025 17:03
@akerouanton akerouanton requested a review from robmry February 20, 2025 17:03
Copy link
Member

@neersighted neersighted left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@thompson-shaun thompson-shaun modified the milestones: 29.0.0, 28.0.1 Feb 20, 2025
@thaJeztah thaJeztah merged commit f0f008b into moby:master Feb 20, 2025
153 checks passed
@akerouanton akerouanton deleted the fix-restart-port-already-in-use branch February 21, 2025 09:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Containers with a port already in use restart in a tight loop
6 participants