Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Initial support of idmapped mount points #5890

Merged
merged 6 commits into from
Sep 5, 2023

Conversation

artqzn
Copy link

@artqzn artqzn commented Aug 17, 2021

Hi,
this is this is initial support of idmapped mount points in containerd. The original PR was published by @mauriciovasquezbernal here #4734.

Updates:

  • Original PR was rebased on top of latest containerd.
  • Idmapped mount support was reimplemented using in Go.
  • Small changes in error handling.
  • Added some comments.
  • Updated container start time performance measurement results (numbers are in seconds).
Image ID mapped mount WithRemappedSnapshot
BusyBox 00.135 04.964
Ubuntu 00.171 15.713
Fedora 00.143 38.799

In current patch set the only idmapped mount points functionality is implemented. I think that user namespace support for containerd should be integrated separately, please share your opinion, maybe I am wrong here. I don't know if anybody is working on it, if no I would like to work on user namespaces support also.

Dear reviewers, could you please check this PR.
Thank you in advance!

cc @alban @rata @mauriciovasquezbernal @AkihiroSuda

@k8s-ci-robot
Copy link

Hi @artqzn. Thanks for your PR.

I'm waiting for a containerd member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@tuananh
Copy link

tuananh commented Aug 18, 2021

Cant wait :)

@artqzn artqzn force-pushed the idmapped_mounts branch 4 times, most recently from 1edbe6e to cfb1212 Compare August 18, 2021 09:48
@artqzn
Copy link
Author

artqzn commented Aug 18, 2021

Previous PR's: #5075 #4734 #5888

@dmcgowan dmcgowan added this to New in Code Review via automation Aug 18, 2021
@dmcgowan dmcgowan moved this from New to Needs Discussion in Code Review Aug 18, 2021
@containerd containerd deleted a comment from theopenlab-ci bot Aug 19, 2021
@containerd containerd deleted a comment from theopenlab-ci bot Aug 19, 2021
@containerd containerd deleted a comment from theopenlab-ci bot Aug 19, 2021
@containerd containerd deleted a comment from theopenlab-ci bot Aug 19, 2021
Copy link
Member

@kzys kzys left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may have a bigger discussion first though.

)

path = fmt.Sprintf("/proc/%d/ns/user", pid)
if userNsFile, err = os.OpenFile(path, os.O_RDONLY, 0); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about using os.Open() here instead?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed


path = fmt.Sprintf("/proc/%d/ns/user", pid)
if userNsFile, err = os.OpenFile(path, os.O_RDONLY, 0); err != nil {
return errors.Wrapf(err, "Unable to get user ns file descriptor for - %s\n", path)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't include \n

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

attr.userNs = uint64(userNsFile.Fd())

defer userNsFile.Close()
if targetDir, err = os.OpenFile(target, os.O_RDONLY, 0); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

os.Open?

@@ -198,11 +209,13 @@ func UnmountAll(mount string, flags int) error {

// parseMountOptions takes fstab style mount options and parses them for
// use with a standard mount() syscall
func parseMountOptions(options []string) (int, string, bool) {
func parseMountOptions(options []string) (int, string, bool, string, string) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be the time to return a struct.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, it will be much better

@theopenlab-ci
Copy link

theopenlab-ci bot commented Aug 20, 2021

Build succeeded.

@artqzn
Copy link
Author

artqzn commented Aug 20, 2021

/retest

@k8s-ci-robot
Copy link

@artqzn: Cannot trigger testing until a trusted user reviews the PR and leaves an /ok-to-test message.

In response to this:

/retest

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@theopenlab-ci
Copy link

theopenlab-ci bot commented Aug 20, 2021

Build succeeded.

@artqzn
Copy link
Author

artqzn commented Aug 20, 2021

@kzys Thanks for comment! Fixed

@theopenlab-ci
Copy link

theopenlab-ci bot commented Sep 15, 2021

Build succeeded.

@theopenlab-ci
Copy link

theopenlab-ci bot commented Sep 15, 2021

Build succeeded.

// MapMount applies GID/UID shift according to gidmap/uidmap for target path
func MapMount(uidmap string, gidmap string, target string) (err error) {
const (
userNsHelperBinary = "/bin/sh"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could it be /bin/echo or something smaller?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, we can use any standard non-blocking program. Here we just need to run some small application in user namespace to create gid/uid mappings to run mount_setattr() before /proc/{pid}/ns/user becomes unavailable.
Possibly let's use /bin/true, it has the same size as echo.

-rwxr-xr-x 1 root root 39288 Sep 5 2019 uname
-rwxr-xr-x 1 root root 39256 Sep 5 2019 echo
-rwxr-xr-x 1 root root 39256 Sep 5 2019 false
-rwxr-xr-x 1 root root 39256 Sep 5 2019 sleep
-rwxr-xr-x 1 root root 39256 Sep 5 2019 sync
-rwxr-xr-x 1 root root 39256 Sep 5 2019 true

I'll update the PR.

@fuweid fuweid added this to the 2.0 milestone Aug 29, 2023
Copy link
Contributor

@rata rata left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This mostly LGTM. Most of my comments are minor and easy to fix

snapshots/overlay/overlay_test.go Show resolved Hide resolved
snapshots/overlay/overlay_test.go Outdated Show resolved Hide resolved
snapshots/overlay/overlay_test.go Outdated Show resolved Hide resolved
}
}()

if err = os.Lchown(filepath.Join(td, "upper"), uidMap.HostID, gidMap.HostID); err != nil {
Copy link
Contributor

@rata rata Sep 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, lchown won't follow a symlink, right? Why do we want to change the symlink?

This will fail with new kernels (from https://lwn.net/SubscriberLink/942954/4a108b4f66074f2a/ ):

The kernel will no longer allow the changing of permissions on symbolic links.

Although I'm unsure if this means the owner/group or the read/write/exec... 🤔

Do we really need this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm really not sure that there's a real case for that, but even if there isn't, I think it should be fixed separately, since main branch has this too (I didn't change it), so we're risking to break backward compatibility if there's a case for that, but it seems you're right, there might be issues with this later

I think maintainers can provide more info on this (@fuweid @AkihiroSuda @dmcgowan @kzys)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rata let's keep it as it is. And the upper in this context(function) is unlikely to be symlink right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, let's keep it like that if it is already like that in master

integration/client/container_idmapped_linux_test.go Outdated Show resolved Hide resolved
Previously remapping of a snapshotter has been done using
recursive chown.

Commit
containerd@31a6449 added a support
for "remap-ids" capability which allows snapshotter internals do
remappings in case of idmapped mounts support to avoid recursive
chown and creating a new remapped snapshot.

Signed-off-by: Ilya Hanov <ilya.hanov@huawei-partners.com>
Previously the only fuse-overlayfs supports "--remap-labels" option.
Since idmapped mounts were landed to Linux kernel v5.12 it becomes
possible to use it with overlayfs via mount_setattr() system call.

The changes are based on experimental patchset published by
Mauricio Vásquez containerd#4734.

Signed-off-by: Mauricio Vásquez <mauricio@kinvolk.io>
Signed-off-by: Artem Kuzin <artem.kuzin@huawei.com>
Signed-off-by: Ilya Hanov <ilya.hanov@huawei-partners.com>
This patch introduces idmapped mounts support for
container rootfs.

The idmapped mounts support was merged in Linux kernel 5.12
torvalds/linux@7d6beb7.
This functionality allows to address chown overhead for containers that
use user namespace.

The changes are based on experimental patchset published by
Mauricio Vásquez containerd#4734.
Current version reiplements support of idmapped mounts using Golang.

Performance measurement results:
Image           idmapped mount  recursive chown
BusyBox         00.135          04.964
Ubuntu          00.171          15.713
Fedora          00.143          38.799

Signed-off-by: Mauricio Vásquez <mauricio@kinvolk.io>
Signed-off-by: Artem Kuzin <artem.kuzin@huawei.com>
Signed-off-by: Alexey Perevalov <alexey.perevalov@huawei.com>
Signed-off-by: Ilya Hanov <ilya.hanov@huawei-partners.com>
Signed-off-by: Ilya Hanov <ilya.hanov@huawei-partners.com>
Signed-off-by: Alexey Perevalov <alexey.perevalov@huawei.com>
Signed-off-by: Ilya Hanov <ilya.hanov@huawei-partners.com>
Signed-off-by: Ilya Hanov <ilya.hanov@huawei-partners.com>
Copy link
Member

@fuweid fuweid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@rata rata left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@AkihiroSuda AkihiroSuda merged commit 0ee2433 into containerd:main Sep 5, 2023
45 checks passed
Code Review automation moved this from Ready For Review to Done Sep 5, 2023
fuweid added a commit to fuweid/containerd that referenced this pull request Oct 12, 2023
It's followup for containerd#5890.

The containerd-shim process depends on the mount package to init rootfs
for container. For the container enable user namespace, the mount
package needs to fork child process to get the brand-new user namespace.
However, there are two reapers in one process (described by the
following list) and there are race-condition cases.

1. mount package
2. sys.Reaper as global one which watch all the SIGCHLD.

=== [kill(2)][kill] the wrong process ===

Currently, we use pipe to ensure that child process is alive. However,
the pide file descriptor can be hold by other process, which the child
process cannot exit by self. We should use [kill(2)][kill] to ensure the
child process. But we might kill the wrong process if the child process
might be reaped by containerd-shim and the PID might be reused by other
process.

=== [waitid(2)][waitid] on the wrong child process ===

```
containerd-shim process:

Goroutine 1(GetUsernsFD):                   Goroutine 2(Reaper)

1. Ready to wait for child process X

                                            2. Received SIGCHLD from X

                                            3. Reaped the zombie child process X

                                            (X has been reused by other child process)

4. Wait on process X

The goroutine 1 will be stuck until the process X has been terminated.
```

=== open `/proc/X/ns/user` on the wrong child process ===

There is also pid-reused risk between opening `/proc/$pid/ns/user` and
writing `/proc/$pid/u[g]id_map`.

```
containerd-shim process:

Goroutine 1(GetUsernsFD):                   Goroutine 2(Reaper)

1. Fork child process X

2. Write /proc/X/uid_map,gid_map

                                            3. Received SIGCHLD from X

                                            4. Reaped the zombie child process X

                                            (X has been reused by other process)

5. Open /proc/X/ns/user file as usernsFD

The usernsFD links to the wrong X!!!
```

In order to fix the race-condition, we should use [CLONE_PIDFD][clone2] (Since
Linux v5.2).

When we fork child process `X`, the kernel will return a process file
descriptor `X_PIDFD` referencing to child process `X`. With the pidfd, we can
use [pidfd_send_signal(2)][pidfd_send_signal] (Since Linux v5.1)
to send signal(0) to ensure the child process `X` is alive. If the `X` has
terminated and its PID has been recycled for another process. The
pidfd_send_signal fails with the error ESRCH.

Therefore, we can open `/proc/X/{ns/user,uid_map,gid_map}` file
descriptors as first and then use pidfd_send_signal to check the process
is still alive. If so, we can ensure the file descriptors are valid and
reference to the child process `X`. Even if the `X` PID has been reused
after pidfd_send_signal call, the file descriptors are still valid.

```code
X, pidfd = clone2(CLONE_PIDFD)

usernsFD = open /proc/X/ns/user
uidmapFD = open /proc/X/uid_map
gidmapFD = open /proc/X/gid_map

pidfd_send_signal pidfd, signal(0)
  return err if no such process

== When we arrive here, we can ensure usernsFD/uidmapFD/gidmapFD are correct
== even if X has been reused after pidfd_send_signal call.

update uid/gid mapping by uidmapFD/gidmapFD
return usernsFD
```

And the [waitid(2)][waitid] also supports pidfd type (Since Linux 5.4).
We can use pidfd type waitid to ensure we are waiting for the correct
process. All the PID related race-condition issues can be resolved by
pidfd.

```bash
➜  mount git:(followup-idmapped) pwd
/home/fuwei/go/src/github.com/containerd/containerd/mount

➜  mount git:(followup-idmapped) sudo go test -test.root -run TestGetUsernsFD -count=1000 -failfast -p 100  ./...
PASS
ok      github.com/containerd/containerd/mount  3.446s
```

[kill]: <https://man7.org/linux/man-pages/man2/kill.2.html>
[clone2]: <https://man7.org/linux/man-pages/man2/clone.2.html>
[pidfd_send_signal]: <https://man7.org/linux/man-pages/man2/pidfd_send_signal.2.html>
[waitid]: <https://man7.org/linux/man-pages/man2/waitid.2.html>

Signed-off-by: Wei Fu <fuweid89@gmail.com>
fuweid added a commit to fuweid/containerd that referenced this pull request Oct 12, 2023
It's followup for containerd#5890.

The containerd-shim process depends on the mount package to init rootfs
for container. For the container enable user namespace, the mount
package needs to fork child process to get the brand-new user namespace.
However, there are two reapers in one process (described by the
following list) and there are race-condition cases.

1. mount package
2. sys.Reaper as global one which watch all the SIGCHLD.

=== [kill(2)][kill] the wrong process ===

Currently, we use pipe to ensure that child process is alive. However,
the pide file descriptor can be hold by other process, which the child
process cannot exit by self. We should use [kill(2)][kill] to ensure the
child process. But we might kill the wrong process if the child process
might be reaped by containerd-shim and the PID might be reused by other
process.

=== [waitid(2)][waitid] on the wrong child process ===

```
containerd-shim process:

Goroutine 1(GetUsernsFD):                   Goroutine 2(Reaper)

1. Ready to wait for child process X

                                            2. Received SIGCHLD from X

                                            3. Reaped the zombie child process X

                                            (X has been reused by other child process)

4. Wait on process X

The goroutine 1 will be stuck until the process X has been terminated.
```

=== open `/proc/X/ns/user` on the wrong child process ===

There is also pid-reused risk between opening `/proc/$pid/ns/user` and
writing `/proc/$pid/u[g]id_map`.

```
containerd-shim process:

Goroutine 1(GetUsernsFD):                   Goroutine 2(Reaper)

1. Fork child process X

2. Write /proc/X/uid_map,gid_map

                                            3. Received SIGCHLD from X

                                            4. Reaped the zombie child process X

                                            (X has been reused by other process)

5. Open /proc/X/ns/user file as usernsFD

The usernsFD links to the wrong X!!!
```

In order to fix the race-condition, we should use [CLONE_PIDFD][clone2] (Since
Linux v5.2).

When we fork child process `X`, the kernel will return a process file
descriptor `X_PIDFD` referencing to child process `X`. With the pidfd, we can
use [pidfd_send_signal(2)][pidfd_send_signal] (Since Linux v5.1)
to send signal(0) to ensure the child process `X` is alive. If the `X` has
terminated and its PID has been recycled for another process. The
pidfd_send_signal fails with the error ESRCH.

Therefore, we can open `/proc/X/{ns/user,uid_map,gid_map}` file
descriptors as first and then use pidfd_send_signal to check the process
is still alive. If so, we can ensure the file descriptors are valid and
reference to the child process `X`. Even if the `X` PID has been reused
after pidfd_send_signal call, the file descriptors are still valid.

```code
X, pidfd = clone2(CLONE_PIDFD)

usernsFD = open /proc/X/ns/user
uidmapFD = open /proc/X/uid_map
gidmapFD = open /proc/X/gid_map

pidfd_send_signal pidfd, signal(0)
  return err if no such process

== When we arrive here, we can ensure usernsFD/uidmapFD/gidmapFD are correct
== even if X has been reused after pidfd_send_signal call.

update uid/gid mapping by uidmapFD/gidmapFD
return usernsFD
```

And the [waitid(2)][waitid] also supports pidfd type (Since Linux 5.4).
We can use pidfd type waitid to ensure we are waiting for the correct
process. All the PID related race-condition issues can be resolved by
pidfd.

```bash
➜  mount git:(followup-idmapped) pwd
/home/fuwei/go/src/github.com/containerd/containerd/mount

➜  mount git:(followup-idmapped) sudo go test -test.root -run TestGetUsernsFD -count=1000 -failfast -p 100  ./...
PASS
ok      github.com/containerd/containerd/mount  3.446s
```

[kill]: <https://man7.org/linux/man-pages/man2/kill.2.html>
[clone2]: <https://man7.org/linux/man-pages/man2/clone.2.html>
[pidfd_send_signal]: <https://man7.org/linux/man-pages/man2/pidfd_send_signal.2.html>
[waitid]: <https://man7.org/linux/man-pages/man2/waitid.2.html>

Signed-off-by: Wei Fu <fuweid89@gmail.com>
fuweid added a commit to fuweid/containerd that referenced this pull request Oct 12, 2023
It's followup for containerd#5890.

The containerd-shim process depends on the mount package to init rootfs
for container. For the container enable user namespace, the mount
package needs to fork child process to get the brand-new user namespace.
However, there are two reapers in one process (described by the
following list) and there are race-condition cases.

1. mount package
2. sys.Reaper as global one which watch all the SIGCHLD.

=== [kill(2)][kill] the wrong process ===

Currently, we use pipe to ensure that child process is alive. However,
the pide file descriptor can be hold by other process, which the child
process cannot exit by self. We should use [kill(2)][kill] to ensure the
child process. But we might kill the wrong process if the child process
might be reaped by containerd-shim and the PID might be reused by other
process.

=== [waitid(2)][waitid] on the wrong child process ===

```
containerd-shim process:

Goroutine 1(GetUsernsFD):                   Goroutine 2(Reaper)

1. Ready to wait for child process X

                                            2. Received SIGCHLD from X

                                            3. Reaped the zombie child process X

                                            (X has been reused by other child process)

4. Wait on process X

The goroutine 1 will be stuck until the process X has been terminated.
```

=== open `/proc/X/ns/user` on the wrong child process ===

There is also pid-reused risk between opening `/proc/$pid/ns/user` and
writing `/proc/$pid/u[g]id_map`.

```
containerd-shim process:

Goroutine 1(GetUsernsFD):                   Goroutine 2(Reaper)

1. Fork child process X

2. Write /proc/X/uid_map,gid_map

                                            3. Received SIGCHLD from X

                                            4. Reaped the zombie child process X

                                            (X has been reused by other process)

5. Open /proc/X/ns/user file as usernsFD

The usernsFD links to the wrong X!!!
```

In order to fix the race-condition, we should use [CLONE_PIDFD][clone2] (Since
Linux v5.2).

When we fork child process `X`, the kernel will return a process file
descriptor `X_PIDFD` referencing to child process `X`. With the pidfd, we can
use [pidfd_send_signal(2)][pidfd_send_signal] (Since Linux v5.1)
to send signal(0) to ensure the child process `X` is alive. If the `X` has
terminated and its PID has been recycled for another process. The
pidfd_send_signal fails with the error ESRCH.

Therefore, we can open `/proc/X/{ns/user,uid_map,gid_map}` file
descriptors as first and then use pidfd_send_signal to check the process
is still alive. If so, we can ensure the file descriptors are valid and
reference to the child process `X`. Even if the `X` PID has been reused
after pidfd_send_signal call, the file descriptors are still valid.

```code
X, pidfd = clone2(CLONE_PIDFD)

usernsFD = open /proc/X/ns/user
uidmapFD = open /proc/X/uid_map
gidmapFD = open /proc/X/gid_map

pidfd_send_signal pidfd, signal(0)
  return err if no such process

== When we arrive here, we can ensure usernsFD/uidmapFD/gidmapFD are correct
== even if X has been reused after pidfd_send_signal call.

update uid/gid mapping by uidmapFD/gidmapFD
return usernsFD
```

And the [waitid(2)][waitid] also supports pidfd type (Since Linux 5.4).
We can use pidfd type waitid to ensure we are waiting for the correct
process. All the PID related race-condition issues can be resolved by
pidfd.

```bash
➜  mount git:(followup-idmapped) pwd
/home/fuwei/go/src/github.com/containerd/containerd/mount

➜  mount git:(followup-idmapped) sudo go test -test.root -run TestGetUsernsFD -count=1000 -failfast -p 100  ./...
PASS
ok      github.com/containerd/containerd/mount  3.446s
```

[kill]: <https://man7.org/linux/man-pages/man2/kill.2.html>
[clone2]: <https://man7.org/linux/man-pages/man2/clone.2.html>
[pidfd_send_signal]: <https://man7.org/linux/man-pages/man2/pidfd_send_signal.2.html>
[waitid]: <https://man7.org/linux/man-pages/man2/waitid.2.html>

Signed-off-by: Wei Fu <fuweid89@gmail.com>
fuweid added a commit to fuweid/containerd that referenced this pull request Oct 12, 2023
It's followup for containerd#5890.

The containerd-shim process depends on the mount package to init rootfs
for container. For the container enable user namespace, the mount
package needs to fork child process to get the brand-new user namespace.
However, there are two reapers in one process (described by the
following list) and there are race-condition cases.

1. mount package
2. sys.Reaper as global one which watch all the SIGCHLD.

=== [kill(2)][kill] the wrong process ===

Currently, we use pipe to ensure that child process is alive. However,
the pide file descriptor can be hold by other process, which the child
process cannot exit by self. We should use [kill(2)][kill] to ensure the
child process. But we might kill the wrong process if the child process
might be reaped by containerd-shim and the PID might be reused by other
process.

=== [waitid(2)][waitid] on the wrong child process ===

```
containerd-shim process:

Goroutine 1(GetUsernsFD):                   Goroutine 2(Reaper)

1. Ready to wait for child process X

                                            2. Received SIGCHLD from X

                                            3. Reaped the zombie child process X

                                            (X has been reused by other child process)

4. Wait on process X

The goroutine 1 will be stuck until the process X has been terminated.
```

=== open `/proc/X/ns/user` on the wrong child process ===

There is also pid-reused risk between opening `/proc/$pid/ns/user` and
writing `/proc/$pid/u[g]id_map`.

```
containerd-shim process:

Goroutine 1(GetUsernsFD):                   Goroutine 2(Reaper)

1. Fork child process X

2. Write /proc/X/uid_map,gid_map

                                            3. Received SIGCHLD from X

                                            4. Reaped the zombie child process X

                                            (X has been reused by other process)

5. Open /proc/X/ns/user file as usernsFD

The usernsFD links to the wrong X!!!
```

In order to fix the race-condition, we should use [CLONE_PIDFD][clone2] (Since
Linux v5.2).

When we fork child process `X`, the kernel will return a process file
descriptor `X_PIDFD` referencing to child process `X`. With the pidfd, we can
use [pidfd_send_signal(2)][pidfd_send_signal] (Since Linux v5.1)
to send signal(0) to ensure the child process `X` is alive. If the `X` has
terminated and its PID has been recycled for another process. The
pidfd_send_signal fails with the error ESRCH.

Therefore, we can open `/proc/X/{ns/user,uid_map,gid_map}` file
descriptors as first and then use pidfd_send_signal to check the process
is still alive. If so, we can ensure the file descriptors are valid and
reference to the child process `X`. Even if the `X` PID has been reused
after pidfd_send_signal call, the file descriptors are still valid.

```code
X, pidfd = clone2(CLONE_PIDFD)

usernsFD = open /proc/X/ns/user
uidmapFD = open /proc/X/uid_map
gidmapFD = open /proc/X/gid_map

pidfd_send_signal pidfd, signal(0)
  return err if no such process

== When we arrive here, we can ensure usernsFD/uidmapFD/gidmapFD are correct
== even if X has been reused after pidfd_send_signal call.

update uid/gid mapping by uidmapFD/gidmapFD
return usernsFD
```

And the [waitid(2)][waitid] also supports pidfd type (Since Linux 5.4).
We can use pidfd type waitid to ensure we are waiting for the correct
process. All the PID related race-condition issues can be resolved by
pidfd.

```bash
➜  mount git:(followup-idmapped) pwd
/home/fuwei/go/src/github.com/containerd/containerd/mount

➜  mount git:(followup-idmapped) sudo go test -test.root -run TestGetUsernsFD -count=1000 -failfast -p 100  ./...
PASS
ok      github.com/containerd/containerd/mount  3.446s
```

[kill]: <https://man7.org/linux/man-pages/man2/kill.2.html>
[clone2]: <https://man7.org/linux/man-pages/man2/clone.2.html>
[pidfd_send_signal]: <https://man7.org/linux/man-pages/man2/pidfd_send_signal.2.html>
[waitid]: <https://man7.org/linux/man-pages/man2/waitid.2.html>

Signed-off-by: Wei Fu <fuweid89@gmail.com>
juliusl pushed a commit to juliusl/containerd that referenced this pull request Jan 26, 2024
It's followup for containerd#5890.

The containerd-shim process depends on the mount package to init rootfs
for container. For the container enable user namespace, the mount
package needs to fork child process to get the brand-new user namespace.
However, there are two reapers in one process (described by the
following list) and there are race-condition cases.

1. mount package
2. sys.Reaper as global one which watch all the SIGCHLD.

=== [kill(2)][kill] the wrong process ===

Currently, we use pipe to ensure that child process is alive. However,
the pide file descriptor can be hold by other process, which the child
process cannot exit by self. We should use [kill(2)][kill] to ensure the
child process. But we might kill the wrong process if the child process
might be reaped by containerd-shim and the PID might be reused by other
process.

=== [waitid(2)][waitid] on the wrong child process ===

```
containerd-shim process:

Goroutine 1(GetUsernsFD):                   Goroutine 2(Reaper)

1. Ready to wait for child process X

                                            2. Received SIGCHLD from X

                                            3. Reaped the zombie child process X

                                            (X has been reused by other child process)

4. Wait on process X

The goroutine 1 will be stuck until the process X has been terminated.
```

=== open `/proc/X/ns/user` on the wrong child process ===

There is also pid-reused risk between opening `/proc/$pid/ns/user` and
writing `/proc/$pid/u[g]id_map`.

```
containerd-shim process:

Goroutine 1(GetUsernsFD):                   Goroutine 2(Reaper)

1. Fork child process X

2. Write /proc/X/uid_map,gid_map

                                            3. Received SIGCHLD from X

                                            4. Reaped the zombie child process X

                                            (X has been reused by other process)

5. Open /proc/X/ns/user file as usernsFD

The usernsFD links to the wrong X!!!
```

In order to fix the race-condition, we should use [CLONE_PIDFD][clone2] (Since
Linux v5.2).

When we fork child process `X`, the kernel will return a process file
descriptor `X_PIDFD` referencing to child process `X`. With the pidfd, we can
use [pidfd_send_signal(2)][pidfd_send_signal] (Since Linux v5.1)
to send signal(0) to ensure the child process `X` is alive. If the `X` has
terminated and its PID has been recycled for another process. The
pidfd_send_signal fails with the error ESRCH.

Therefore, we can open `/proc/X/{ns/user,uid_map,gid_map}` file
descriptors as first and then use pidfd_send_signal to check the process
is still alive. If so, we can ensure the file descriptors are valid and
reference to the child process `X`. Even if the `X` PID has been reused
after pidfd_send_signal call, the file descriptors are still valid.

```code
X, pidfd = clone2(CLONE_PIDFD)

usernsFD = open /proc/X/ns/user
uidmapFD = open /proc/X/uid_map
gidmapFD = open /proc/X/gid_map

pidfd_send_signal pidfd, signal(0)
  return err if no such process

== When we arrive here, we can ensure usernsFD/uidmapFD/gidmapFD are correct
== even if X has been reused after pidfd_send_signal call.

update uid/gid mapping by uidmapFD/gidmapFD
return usernsFD
```

And the [waitid(2)][waitid] also supports pidfd type (Since Linux 5.4).
We can use pidfd type waitid to ensure we are waiting for the correct
process. All the PID related race-condition issues can be resolved by
pidfd.

```bash
➜  mount git:(followup-idmapped) pwd
/home/fuwei/go/src/github.com/containerd/containerd/mount

➜  mount git:(followup-idmapped) sudo go test -test.root -run TestGetUsernsFD -count=1000 -failfast -p 100  ./...
PASS
ok      github.com/containerd/containerd/mount  3.446s
```

[kill]: <https://man7.org/linux/man-pages/man2/kill.2.html>
[clone2]: <https://man7.org/linux/man-pages/man2/clone.2.html>
[pidfd_send_signal]: <https://man7.org/linux/man-pages/man2/pidfd_send_signal.2.html>
[waitid]: <https://man7.org/linux/man-pages/man2/waitid.2.html>

Signed-off-by: Wei Fu <fuweid89@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

Successfully merging this pull request may close these issues.

None yet