Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix deadlock during NRI plugin registration #79

Merged
merged 1 commit into from Apr 18, 2024

Conversation

bobbypage
Copy link
Contributor

During NRI external plugin registration:

  • acceptPluginConnections() is called
  • adaptation lock from nri/pkg/adaptation is acquired
  • syncFn is invoked
  • syncFn acquires NRI lock in pkg/nri/nri.go

During container lifecycle events such as ContainerStart

  • NRI lock is acquired in pkg/nri.go
  • adaptation lock is acquired in StateChange() in nri/pkg/adaptation

As a result, the locking order during NRI plugin registration is:

  • adaptation lock -> NRI lock

While the locking order during container starts is:

  • NRI lock -> adaptation lock

Due the fact that the locking order is inverted and not consistent, it it possible to encounter a deadlock.

To fix the issue, during NRI plugin registration, first acquire the NRI lock (done via syncFn call) and only after acquire the adaptation lock. This ensures that NRI plugin registration the locking order is adaption lock -> NRI lock, which is consistent with the locking order during container lifecycle events.

Fixes containerd/containerd#10085

@bobbypage
Copy link
Contributor Author

bobbypage commented Apr 18, 2024

cc @klihub @acurtiz @samuelkarp

@samuelkarp
Copy link
Member

WIP: Fix deadlock during NRI plugin registration

What are you looking for to make this non-WIP?

@klihub
Copy link
Member

klihub commented Apr 18, 2024

I think this should be fine.

During NRI external plugin registration:

* `acceptPluginConnections()` is called
* adaptation lock from `nri/pkg/adaptation` is acquired
* `syncFn` is invoked
* `syncFn` acquires NRI lock in `pkg/nri/nri.go`

During container lifecycle events such as `ContainerStart`
* NRI lock is acquired in pkg/nri.go
* adaptation lock is acquired in `StateChange()` in `nri/pkg/adaptation`

As a result, the locking order during NRI plugin registration is:
* adaptation lock -> NRI lock

While the locking order during container starts is:
* NRI lock -> adaptation lock

Due the fact that the locking order is inverted and not consistent, it
it possible to encounter a deadlock.

To fix the issue, during NRI plugin registration, first acquire the NRI
lock (done via `syncFn` call) and only after acquire the adaptation lock.
This ensures that NRI plugin registration the locking order is adaption
lock -> NRI lock, which is consistent with the locking order during
container lifecycle events.

Fixes containerd/containerd#10085

Signed-off-by: David Porter <porterdavid@google.com>
@bobbypage bobbypage changed the title WIP: Fix deadlock during NRI plugin registration Fix deadlock during NRI plugin registration Apr 18, 2024
@bobbypage
Copy link
Contributor Author

WIP: Fix deadlock during NRI plugin registration

What are you looking for to make this non-WIP?

I removed WIP, wanted to test it again a few times against the repro in containerd/containerd#10085 (comment)

After a few runs of those steps, I can confirm that without this patch the deadlock always repros and with this patch I am no longer able to repro the deadlock.

@estesp estesp merged commit fa64d11 into containerd:main Apr 18, 2024
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

Successfully merging this pull request may close these issues.

NRI plugin registration can trigger a deadlock
5 participants