*: fix several issues with namespace path handling #4124

cyphar · 2023-11-23T03:30:30Z

This fixes #4122 along with several other issues I found while writing integration tests for joining namespace paths. In particular, this fixes the following issues:

Starting a container in a target user namespace now works properly.
- NOTE This requires calling the nsenter binary. I've taken care to ensure that we don't do this in any context other than the host, but I'm not super happy with this. The other option would be to self-exec but that would require changing up how runc init works as well so we could have two self-exec entrypoints. I'm not sure it's worth it.
- I've reimplemented this using CGo, by forking in C code and doing the setns(2) there. Technically it's a little dodgy to be doing work after a fork(2), and while we don't do anything too complicated (like heap allocations), I'm not sure I'm comfortable saying that it won't ever have any issues.
Time namespace configurations now work with user namespaces (they worked with rootless containers, but not "standard" ones).
Disallow configurations where the timens and userns namespaces are configured to join an existing namespace and there are configuration entries for the timens and userns. This is needed because you cannot modify the configuration of these namespace has been configured. Previously we would not use the configuration to configure the container but would use the mapping to figure out how uids and gids map (which was broken).
Remap the rootfs when using user namespaces in our integration tests, to get rid of the mkdir -p rootfs/{proc,sys,dev} hack.

There is a test I want to add to #3985 which requires these changes (to verify that idmap uses the correct user namespace when joining a userns instead of creating a new one).

Signed-off-by: Aleksa Sarai cyphar@cyphar.com

libcontainer/configs/config_linux.go

rata

Left some comments

libcontainer/configs/config_linux.go

libcontainer/configs/validate/rootless.go

libcontainer/init_linux.go

libcontainer/specconv/spec_linux.go

libcontainer/userns/userns_maps.go

tests/integration/helpers.bash

tests/integration/run.bats

libcontainer/configs/validate/validator.go

libcontainer/userns/userns_maps.go

rata

My comments are minor (can be addressed in other PRs, if there is something to address) and mostly questions. This LGTM

rata

LGTM

libcontainer/userns/userns_maps.c

libcontainer/userns/userns_maps.go

AkihiroSuda · 2023-12-04T16:48:43Z

contrib/cmd/remap-rootfs/remap-rootfs.go

+func usage() {
+	fmt.Println("usage: remap-rootfs <bundle>")
+	os.Exit(1)
+}


Needs more documentation, as markdown or godoc

I'll switch to urfave/cli and add some docs, though I sometimes wonder if we need a separate place for programs that we only use for test purposes. fs-idmap is basically useless to anyone outside of our tests, and while remap-rootfs is theoretically useful I doubt anyone will use it.

I was thinking the same too :)

contrib/cmd/remap-rootfs/remap-rootfs.go

We had ignore lines for these warnings, but it turns out this is most likely a bug in shellcheck and we can work around it by moving the helper function definition before any of the functions that use the helper function. Hopefully it'll be fixed soon. Ref: koalaman/shellcheck#2873 Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>

Our handling for name space paths with user namespaces has been broken for a long time. In particular, the need to parse /proc/self/*id_map in quite a few places meant that we would treat userns configurations that had a namespace path as if they were a userns configuration without mappings, resulting in errors. The primary issue was down to the id translation helper functions, which could only handle configurations that had explicit mappings. Obviously, when joining a user namespace we need to map the ids but figuring out the correct mapping is non-trivial in comparison. In order to get the mapping, you need to read /proc/<pid>/*id_map of a process inside the userns -- while most userns paths will be of the form /proc/<pid>/ns/user (and we have a fast-path for this case), this is not guaranteed and thus it is necessary to spawn a process inside the container and read its /proc/<pid>/*id_map files in the general case. As Go does not allow us spawn a subprocess into a target userns, we have to use CGo to fork a sub-process which does the setns(2). To be honest, this is a little dodgy in regards to POSIX signal-safety(7) but since we do no allocations and we are executing in the forked context from a Go program (not a C program), it should be okay. The other alternative would be to do an expensive re-exec (a-la nsexec which would make several other bits of runc more complicated), or to use nsenter(1) which might not exist on the system and is less than ideal. Because we need to logically remap users quite a few times in runc (including in "runc init", where joining the namespace is not feasable), we cache the mapping inside the libcontainer config struct. A future patch will make sure that we stop allow invalid user configurations where a mapping is specified as well as a userns path to join. Finally, add an integration test to make sure we don't regress this again. Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>

While we do cache the mappings when using userns paths, there's no need to do this in this particular case, since we are in the namespace and set[ug]id() give unambiguous EINVAL error codes if the id is unmapped. This appears to also be the only code which does Host[UG]ID calculations from inside "runc init". Ref: 1a5fdc1 ("init: support setting -u with rootless containers") Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>

If a user has misconfigured their userns mappings, they need to know which id specifically is not mapped. There's no need to be vague. Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>

For userns and timens, the mappings (and offsets, respectively) cannot be changed after the namespace is first configured. Thus, configuring a container with a namespace path to join means that you cannot also provide configuration for said namespace. Previously we would silently ignore the configuration (and just join the provided path), but we really should be returning an error (especially when you consider that the configuration userns mappings are used quite a bit in runc with the assumption that they are the correct mapping for the userns -- but in this case they are not). In the case of userns, the mappings are also required if you _do not_ specify a path, while in the case of the time namespace you can have a container with a timens but no mappings specified. It should be noted that the case checking that the user has not specified a userns path and a userns mapping needs to be handled in specconv (as opposed to the configuration validator) because with this patchset we now cache the mappings of path-based userns configurations and thus the validator can't be sure whether the mapping is a cached mapping or a user-specified one. So we do the validation in specconv, and thus the test for this needs to be an integration test. Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>

The owner of /proc/self/timens_offsets doesn't change after creating a userns, meaning that we need to request stage-0 to write our timens mappings for us. Before this patch, attempting to use timens with a proper userns resulted in: FATA[0000] nsexec-1[18564]: failed to update /proc/self/timens_offsets: Permission denied FATA[0000] nsexec-0[18562]: failed to sync with stage-1: next state: Success ERRO[0000] runc run failed: unable to start container process: can't get final child's PID from pipe: EOF Fixes: ebc2e7c ("Support time namespace") Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>

Given we've had several bugs in this behaviour that have now been fixed, add an integration test that makes sure that you can start a container that joins all of the namespaces of a second container. The only namespace we do not join is the mount namespace, because joining a namespace that has been pivot_root'd leads to a bunch of errors. In principle, removing everything from config.json that requires a mount _should_ work, but the root.path configuration is mandatory and we cannot just ignore setting up the rootfs in the namespace joining scenario (if the user has configured a different rootfs, we need to use it or error out, and there's no reasonable way of checking if if the rootfs paths are the same that doesn't result in spaghetti logic). Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>

Previously, all of our userns tests worked around the remapping issue by creating the paths that runc would attempt to create (like /proc). However, this isn't really accurate to how real userns containers are created, so it's much better to actually remap the rootfs. Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>

kolyshkin · 2023-12-14T23:31:06Z

1.1 backport: #4144

kolyshkin · 2023-12-19T02:17:02Z

@cyphar kinda sad that we've recently removed a bunch of C code, only to add another bunch of C code just because Go can't do setns in the middle of ForkExec.

That reminded me of work that @cpuguy83 is doing; PTAL https://go-review.googlesource.com/c/go/+/476095?usp=dashboard

cyphar · 2023-12-19T04:24:10Z

@kolyshkin I agree, and yeah it would be nice if we could do it with ForkExec. If we had that, we could do something similar to userns.Handles and just read the files from /proc/<pid>/ of the ptrace-stopped process rather than having to spawn two processes to do the reads. I've added some comments -- @cpuguy83 if you like, I can carry that PR for you so you don't need to rework it again.

cyphar added this to the 1.1.11 milestone Nov 23, 2023

cyphar mentioned this pull request Nov 23, 2023

libcontainer: remove all mount logic from nsexec #3985

Merged

cyphar commented Nov 23, 2023

View reviewed changes

libcontainer/configs/config_linux.go Outdated Show resolved Hide resolved

cyphar added the backport/todo/1.1 A PR in main branch which needs to be backported to release-1.1 label Nov 23, 2023

lifubang reviewed Nov 23, 2023

View reviewed changes

libcontainer/configs/config_linux.go Outdated Show resolved Hide resolved

cyphar requested review from lifubang, kolyshkin and AkihiroSuda November 27, 2023 07:22

lifubang approved these changes Nov 30, 2023

View reviewed changes

lifubang added the status/4-merge label Nov 30, 2023

rata reviewed Nov 30, 2023

View reviewed changes

rata reviewed Dec 1, 2023

View reviewed changes

libcontainer/userns/userns_maps.go Outdated Show resolved Hide resolved

rata approved these changes Dec 1, 2023

View reviewed changes