Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support user namespace #8170

Open
bergwolf opened this issue Oct 8, 2023 · 17 comments
Open

support user namespace #8170

bergwolf opened this issue Oct 8, 2023 · 17 comments
Assignees
Labels
feature New functionality needs-review Needs to be assessed by the team.

Comments

@bergwolf
Copy link
Member

bergwolf commented Oct 8, 2023

Is your feature request related to a problem? Please describe.

Following the kubernetes user namespace story and KEP, containerd has merged idmapped mount and the relevant runc bits.

Now it is time for kata to be in the party and integrate with containerd to handle idmapped mount and user namespace so that rootless containers can be truly rootless even inside the guest.

@bergwolf bergwolf added feature New functionality needs-review Needs to be assessed by the team. labels Oct 8, 2023
@yawqi
Copy link

yawqi commented Oct 9, 2023

Hi, I would like to it if that's OK, thanks a lot!

@yawqi
Copy link

yawqi commented Oct 17, 2023

Currently, the virtiofs in the kernel seems not support idmap mount for now, so in the guest VM, it seems unlikely to support idmap mount for virtiofs without modify the virtiofs (Correct me if I am wrong).

// No FS_ALLOW_IDMAP flag set
static struct file_system_type virtio_fs_type = {
	.owner		= THIS_MODULE,
	.name		= "virtiofs",
	.init_fs_context = virtio_fs_init_fs_context,
	.kill_sb	= virtio_kill_sb,
};

// So this check won't pass
static int can_idmap_mount(const struct mount_kattr *kattr, struct mount *mnt)
{
        ...
	/* The underlying filesystem doesn't support idmapped mounts yet. */
	if (!(m->mnt_sb->s_type->fs_flags & FS_ALLOW_IDMAP))
		return -EINVAL;
       ...
}

So, is it necessary to add idmap mount support for virtiofs first? @bergwolf

@yawqi
Copy link

yawqi commented Oct 30, 2023

@bergwolf, currently, I have made some changes to make sudo ctr run --runtime RUNTIME-RS --uidmap 0:3000:1000 --gidmap 0:3000:1000 --mount type=bind,src=...,dst=/v1,options=rbind --rm -t docker.io/library/busybox:latest sh able to start a container.

The rest of the work is how to use the Linux new mount API and mount_setattr to mount rootfs and volumes properly.

This is what inside the container:
截屏2023-10-30 23 10 42

This is inside the guest VM:
截屏2023-10-30 23 11 19
The PID 113 process with user 3000 is the running container inside guest.

@rata
Copy link

rata commented Oct 30, 2023

Hi!

Let me know if you have any question regarding userns support. I'm author of the k8s KEP and worked in the implementation of k8s, containerd, runc, some bits of CRIO, crun and the Linux kernel. I'll be happy to help with any questions to adopt this in Kata! :)

There are two avenues I see to explore in Kata support regarding the advantages that userns has for containers based on Linux namespaces only:

  1. Isolating kata VM processes from each other. In linux-ns based container runtime this is the main advantage that userns has, all pods are running as different users in the host when we use userns. But in kata, the users in the container are already decoupled from the users in the host. So to isolate containers between each other, we need to start the kata VM as different users in the host. I don't know in the host as which user the kata vm is running, do all VM pods run as the same user?
  2. Making it harder to compromise the kata VM. This is where using userns inside the container can help. For this we will need idmap support in the filesystem used for the rootfs of the container and all the volumes mounted by the container.

@yawqi
Copy link

yawqi commented Oct 31, 2023

Hi @rata, thanks a lot! It's going to be a great help. I am currently working on the second avenue you mentioned.

As for the first one, I think the kata vm processes are all started by root user for now, I am not very sure, correct me if I am wrong. :)

For the second avenue, we do need the idmap support for the virtiofs which kata uses a lot for rootfs and volumes.

@yawqi
Copy link

yawqi commented Oct 31, 2023

Hi @rata, sorry to bother you again, I was wondering if the ctr run --mount have support the uidmap and gidmap options, it seems that it hasn't been supported yet. If not, do you think it is necessary to add such option for ctr run --mount? If you think it is necessary, I would be glad to do it too! :) Thanks a lot!

@rata
Copy link

rata commented Oct 31, 2023

@yawqi cool, do you plan to work on the kernel side for virtiofs?

I don't know if ctr supports the idmapping for volumes from the top of my head. I'm not sure if will be useful, maybe it will, but no particular use case in mind other than maybe debugging, although manuallay editting the config.json is enough too.

I'd say, let's see how to support this in kata and we can see the other containerd debugging command on the side :)

@yawqi
Copy link

yawqi commented Oct 31, 2023

Hi @rata, from my understanding, we do need the kernel side support for the virtiofs to fully support the userns for kata inside guest VM, what do you think? Are you interested in the kernel side of virtiofs? If you are interested, I would be very grateful and willing to help you with it (if there is anything I can do). If not, I surely want to do it too, and any help is appreciated :), I just hope that I can help and learn something from it. Thanks again!

@rata
Copy link

rata commented Oct 31, 2023

@yawqi I'm not familiar with kata, if you exec into the container and run df -T the filesystem you see for the rootfs is virtiofs? In that case we might support there, yes.

I'm not sure if we can create an idmap mount if the fs for the rootfs (assuming the rootfs is in a fs that supports it, like ext4) and then create the virtiofs on top of that idmapped mount. Can you try that? Or is it not trivial to do so?

@yawqi
Copy link

yawqi commented Nov 1, 2023

@rata The output of df -T inside the container is as following, both the rootfs and v1 volume that I share is of type virtiofs, I also attach the mountinfo inside the container. I start the container with sudo ctr run --runtime io.containerd.kata.v2 --uidmap 0:3000:1000 --gidmap 0:3000:1000 -t --mount type=bind,src=/home/mpiglet/playground/volumns/v1,dst=/v1,options=rbind --rm docker.io/library/busybox:latest hello-rs-wq-agent-39 sh

/ # df -T
Filesystem           Type       1K-blocks      Used Available Use% Mounted on
kataShared           virtiofs   489634808 160075428 304613872  34% /
tmpfs                tmpfs          65536         0     65536   0% /dev
shm                  tmpfs        1020628         0   1020628   0% /dev/shm
tmpfs                tmpfs        1020628         0   1020628   0% /sys/fs/cgroup
tmpfs                tmpfs          65536         0     65536   0% /run
kataShared           virtiofs   489634808 160075428 304613872  34% /v1
devtmpfs             devtmpfs     1019040         0   1019040   0% /dev/null
devtmpfs             devtmpfs     1019040         0   1019040   0% /dev/zero
devtmpfs             devtmpfs     1019040         0   1019040   0% /dev/full
devtmpfs             devtmpfs     1019040         0   1019040   0% /dev/tty
devtmpfs             devtmpfs     1019040         0   1019040   0% /dev/urandom
devtmpfs             devtmpfs     1019040         0   1019040   0% /dev/random
devtmpfs             devtmpfs     1019040         0   1019040   0% /proc/timer_list

/ # cat /proc/self/mountinfo
240 206 0:40 /passthrough/hello-rs-wq-agent-39/rootfs / rw,nodev,relatime master:72 - virtiofs kataShared rw
241 240 0:16 / /proc rw,nosuid,nodev,noexec,relatime master:11 - proc proc rw
242 241 0:35 / /proc/sys/fs/binfmt_misc rw,relatime master:24 - autofs systemd-1 rw,fd=28,pgrp=0,timeout=0,minproto=5,maxproto=5,direct
243 240 0:42 / /dev rw,nosuid - tmpfs tmpfs rw,size=65536k,mode=755,uid=3000,gid=3000
244 243 0:43 / /dev/pts rw,nosuid,noexec,relatime - devpts devpts rw,gid=3005,mode=620,ptmxmode=666
245 243 0:41 / /dev/shm rw,relatime master:74 - tmpfs shm rw
246 243 0:13 / /dev/mqueue rw,nosuid,nodev,noexec,relatime master:26 - mqueue mqueue rw
247 240 0:15 / /sys rw,nosuid,nodev,noexec,relatime master:5 - sysfs sysfs rw
248 247 0:14 / /sys/fs/selinux rw,relatime master:6 - selinuxfs selinuxfs rw
249 247 0:21 / /sys/fs/cgroup ro,nosuid,nodev,noexec master:7 - tmpfs tmpfs ro,mode=755
250 249 0:22 / /sys/fs/cgroup/unified rw,nosuid,nodev,noexec,relatime master:8 - cgroup2 cgroup2 rw,nsdelegate
251 249 0:23 / /sys/fs/cgroup/systemd rw,nosuid,nodev,noexec,relatime master:9 - cgroup cgroup rw,xattr,name=systemd
252 249 0:25 / /sys/fs/cgroup/cpuset rw,nosuid,nodev,noexec,relatime master:14 - cgroup cgroup rw,cpuset
253 249 0:26 / /sys/fs/cgroup/cpu,cpuacct rw,nosuid,nodev,noexec,relatime master:15 - cgroup cgroup rw,cpu,cpuacct
254 249 0:27 / /sys/fs/cgroup/memory rw,nosuid,nodev,noexec,relatime master:16 - cgroup cgroup rw,memory
255 249 0:28 / /sys/fs/cgroup/perf_event rw,nosuid,nodev,noexec,relatime master:17 - cgroup cgroup rw,perf_event
256 249 0:29 / /sys/fs/cgroup/net_cls,net_prio rw,nosuid,nodev,noexec,relatime master:18 - cgroup cgroup rw,net_cls,net_prio
257 249 0:30 / /sys/fs/cgroup/pids rw,nosuid,nodev,noexec,relatime master:19 - cgroup cgroup rw,pids
258 249 0:31 / /sys/fs/cgroup/devices rw,nosuid,nodev,noexec,relatime master:20 - cgroup cgroup rw,devices
259 249 0:32 / /sys/fs/cgroup/freezer rw,nosuid,nodev,noexec,relatime master:21 - cgroup cgroup rw,freezer
260 249 0:33 / /sys/fs/cgroup/blkio rw,nosuid,nodev,noexec,relatime master:22 - cgroup cgroup rw,blkio
261 249 0:34 / /sys/fs/cgroup/hugetlb rw,nosuid,nodev,noexec,relatime master:23 - cgroup cgroup rw,hugetlb
262 247 0:24 / /sys/fs/bpf rw,nosuid,nodev,noexec,relatime master:10 - bpf none rw,mode=700
263 247 0:38 / /sys/fs/fuse/connections rw,nosuid,nodev,noexec,relatime master:28 - fusectl fusectl rw
264 240 0:44 / /run rw,nosuid - tmpfs tmpfs rw,size=65536k,mode=755,uid=3000,gid=3000
116 240 0:40 /passthrough/sandbox-eff1007e-v1 /v1 rw,nodev,relatime master:72 - virtiofs kataShared rw
117 243 0:5 /null /dev/null rw,relatime master:2 - devtmpfs devtmpfs rw,size=1019040k,nr_inodes=254760,mode=755
118 243 0:5 /zero /dev/zero rw,relatime master:2 - devtmpfs devtmpfs rw,size=1019040k,nr_inodes=254760,mode=755
119 243 0:5 /full /dev/full rw,relatime master:2 - devtmpfs devtmpfs rw,size=1019040k,nr_inodes=254760,mode=755
120 243 0:5 /tty /dev/tty rw,relatime master:2 - devtmpfs devtmpfs rw,size=1019040k,nr_inodes=254760,mode=755
121 243 0:5 /urandom /dev/urandom rw,relatime master:2 - devtmpfs devtmpfs rw,size=1019040k,nr_inodes=254760,mode=755
122 243 0:5 /random /dev/random rw,relatime master:2 - devtmpfs devtmpfs rw,size=1019040k,nr_inodes=254760,mode=755
123 241 0:5 /null /proc/timer_list rw,relatime master:2 - devtmpfs devtmpfs rw,size=1019040k,nr_inodes=254760,mode=755
124 241 0:16 /bus /proc/bus ro,nosuid,nodev,noexec,relatime master:11 - proc proc rw
125 241 0:16 /fs /proc/fs ro,nosuid,nodev,noexec,relatime master:11 - proc proc rw
126 241 0:16 /irq /proc/irq ro,nosuid,nodev,noexec,relatime master:11 - proc proc rw
127 241 0:16 /sys /proc/sys ro,nosuid,nodev,noexec,relatime master:11 - proc proc rw
128 127 0:35 / /proc/sys/fs/binfmt_misc rw,relatime master:24 - autofs systemd-1 rw,fd=28,pgrp=0,timeout=0,minproto=5,maxproto=5,direct
131 128 0:45 / /proc/sys/fs/binfmt_misc rw,nosuid,nodev,noexec,relatime master:77 - binfmt_misc binfmt_misc rw
130 242 0:45 / /proc/sys/fs/binfmt_misc rw,nosuid,nodev,noexec,relatime master:77 - binfmt_misc binfmt_misc rw
/ #

And this is the df -T and mountinfo inside the guest VM:

root@localhost:/# df -T
Filesystem     Type     1K-blocks   Used Available Use% Mounted on
/dev/root      ext4        118836 107872      4620  96% /
devtmpfs       devtmpfs   1019040      0   1019040   0% /dev
tmpfs          tmpfs      1020628      0   1020628   0% /dev/shm
tmpfs          tmpfs       204128      8    204120   1% /run
tmpfs          tmpfs         5120      0      5120   0% /run/lock
tmpfs          tmpfs      1020628      0   1020628   0% /sys/fs/cgroup
tmpfs          tmpfs      1020628      8   1020620   1% /tmp
shm            tmpfs      1020628      0   1020628   0% /run/kata-containers/sandbox/shm
root@localhost:/# cat /proc/self/mountinfo
16 1 254:1 / / ro,relatime shared:1 - ext4 /dev/root ro,errors=remount-ro,data=ordered
17 16 0:5 / /dev rw,relatime shared:2 - devtmpfs devtmpfs rw,size=1019040k,nr_inodes=254760,mode=755
18 16 0:15 / /sys rw,nosuid,nodev,noexec,relatime shared:5 - sysfs sysfs rw
19 16 0:16 / /proc rw,nosuid,nodev,noexec,relatime shared:11 - proc proc rw
21 18 0:14 / /sys/fs/selinux rw,relatime shared:6 - selinuxfs selinuxfs rw
20 17 0:17 / /dev/shm rw,nosuid,nodev shared:3 - tmpfs tmpfs rw
22 17 0:18 / /dev/pts rw,nosuid,noexec,relatime shared:4 - devpts devpts rw,gid=5,mode=620,ptmxmode=000
23 16 0:19 / /run rw,nosuid,nodev shared:12 - tmpfs tmpfs rw,size=204128k,mode=755
24 23 0:20 / /run/lock rw,nosuid,nodev,noexec,relatime shared:13 - tmpfs tmpfs rw,size=5120k
25 18 0:21 / /sys/fs/cgroup ro,nosuid,nodev,noexec shared:7 - tmpfs tmpfs ro,mode=755
26 25 0:22 / /sys/fs/cgroup/unified rw,nosuid,nodev,noexec,relatime shared:8 - cgroup2 cgroup2 rw,nsdelegate
27 25 0:23 / /sys/fs/cgroup/systemd rw,nosuid,nodev,noexec,relatime shared:9 - cgroup cgroup rw,xattr,name=systemd
28 18 0:24 / /sys/fs/bpf rw,nosuid,nodev,noexec,relatime shared:10 - bpf none rw,mode=700
29 25 0:25 / /sys/fs/cgroup/cpuset rw,nosuid,nodev,noexec,relatime shared:14 - cgroup cgroup rw,cpuset
30 25 0:26 / /sys/fs/cgroup/cpu,cpuacct rw,nosuid,nodev,noexec,relatime shared:15 - cgroup cgroup rw,cpu,cpuacct
31 25 0:27 / /sys/fs/cgroup/memory rw,nosuid,nodev,noexec,relatime shared:16 - cgroup cgroup rw,memory
32 25 0:28 / /sys/fs/cgroup/perf_event rw,nosuid,nodev,noexec,relatime shared:17 - cgroup cgroup rw,perf_event
33 25 0:29 / /sys/fs/cgroup/net_cls,net_prio rw,nosuid,nodev,noexec,relatime shared:18 - cgroup cgroup rw,net_cls,net_prio
34 25 0:30 / /sys/fs/cgroup/pids rw,nosuid,nodev,noexec,relatime shared:19 - cgroup cgroup rw,pids
35 25 0:31 / /sys/fs/cgroup/devices rw,nosuid,nodev,noexec,relatime shared:20 - cgroup cgroup rw,devices
36 25 0:32 / /sys/fs/cgroup/freezer rw,nosuid,nodev,noexec,relatime shared:21 - cgroup cgroup rw,freezer
37 25 0:33 / /sys/fs/cgroup/blkio rw,nosuid,nodev,noexec,relatime shared:22 - cgroup cgroup rw,blkio
38 25 0:34 / /sys/fs/cgroup/hugetlb rw,nosuid,nodev,noexec,relatime shared:23 - cgroup cgroup rw,hugetlb
39 19 0:35 / /proc/sys/fs/binfmt_misc rw,relatime shared:24 - autofs systemd-1 rw,fd=28,pgrp=1,timeout=0,minproto=5,maxproto=5,direct
40 17 0:36 / /dev/hugepages rw,relatime shared:25 - hugetlbfs hugetlbfs rw,pagesize=2M
41 17 0:13 / /dev/mqueue rw,nosuid,nodev,noexec,relatime shared:26 - mqueue mqueue rw
42 16 0:37 / /tmp rw,nosuid,nodev shared:27 - tmpfs tmpfs rw
43 18 0:38 / /sys/fs/fuse/connections rw,nosuid,nodev,noexec,relatime shared:28 - fusectl fusectl rw
113 23 0:4 ipc:[4026532118] /run/sandbox-ns/ipc rw shared:68 - nsfs nsfs rw
186 23 0:4 uts:[4026532120] /run/sandbox-ns/uts rw shared:70 - nsfs nsfs rw
191 23 0:40 / /run/kata-containers/shared/containers rw,nodev,relatime shared:72 - virtiofs kataShared rw
196 23 0:41 / /run/kata-containers/sandbox/shm rw,relatime shared:74 - tmpfs shm rw
201 23 0:40 /passthrough/hello-rs-wq-agent-39/rootfs /run/kata-containers/hello-rs-wq-agent-39/rootfs rw,nodev,relatime shared:72 - virtiofs kataShared rw
129 39 0:45 / /proc/sys/fs/binfmt_misc rw,nosuid,nodev,noexec,relatime shared:77 - binfmt_misc binfmt_misc rw

And this is the mountinfo of kata runtime on the host:

❯ cat /proc/1613921/mountinfo
664 663 259:2 / / rw,relatime shared:401 master:1 - ext4 /dev/nvme0n1p2 rw,errors=remount-ro
665 664 0:5 / /dev rw,nosuid,relatime shared:402 master:2 - devtmpfs udev rw,size=8109972k,nr_inodes=2027493,mode=755,inode64
666 665 0:22 / /dev/pts rw,nosuid,noexec,relatime shared:403 master:3 - devpts devpts rw,gid=5,mode=620,ptmxmode=000
667 665 0:26 / /dev/shm rw,nosuid,nodev shared:404 master:4 - tmpfs tmpfs rw,inode64
668 665 0:33 / /dev/hugepages rw,relatime shared:405 master:15 - hugetlbfs hugetlbfs rw,pagesize=2M
669 665 0:19 / /dev/mqueue rw,nosuid,nodev,noexec,relatime shared:406 master:16 - mqueue mqueue rw
670 664 0:23 / /run rw,nosuid,nodev,noexec,relatime shared:407 master:5 - tmpfs tmpfs rw,size=1629708k,mode=755,inode64
671 670 0:27 / /run/lock rw,nosuid,nodev,noexec,relatime shared:408 master:6 - tmpfs tmpfs rw,size=5120k,inode64
672 670 0:42 / /run/user/1000 rw,nosuid,nodev,relatime shared:409 master:279 - tmpfs tmpfs rw,size=1629704k,nr_inodes=407426,mode=700,uid=1000,gid=1000,inode64
673 664 0:20 / /sys rw,nosuid,nodev,noexec,relatime shared:410 master:7 - sysfs sysfs rw
674 673 0:6 / /sys/kernel/security rw,nosuid,nodev,noexec,relatime shared:411 master:8 - securityfs securityfs rw
675 673 0:28 / /sys/fs/cgroup rw,nosuid,nodev,noexec,relatime shared:412 master:9 - cgroup2 cgroup2 rw,nsdelegate,memory_recursiveprot
676 675 0:45 / /sys/fs/cgroup/net_cls rw,relatime shared:413 master:329 - cgroup net_cls rw,net_cls
677 673 0:29 / /sys/fs/pstore rw,nosuid,nodev,noexec,relatime shared:414 master:10 - pstore pstore rw
678 673 0:30 / /sys/firmware/efi/efivars rw,nosuid,nodev,noexec,relatime shared:415 master:11 - efivarfs efivarfs rw
679 673 0:31 / /sys/fs/bpf rw,nosuid,nodev,noexec,relatime shared:416 master:12 - bpf bpf rw,mode=700
680 673 0:7 / /sys/kernel/debug rw,nosuid,nodev,noexec,relatime shared:417 master:17 - debugfs debugfs rw
681 673 0:12 / /sys/kernel/tracing rw,nosuid,nodev,noexec,relatime shared:418 master:18 - tracefs tracefs rw
682 673 0:34 / /sys/kernel/config rw,nosuid,nodev,noexec,relatime shared:419 master:19 - configfs configfs rw
683 673 0:35 / /sys/fs/fuse/connections rw,nosuid,nodev,noexec,relatime shared:420 master:20 - fusectl fusectl rw
684 664 0:21 / /proc rw,nosuid,nodev,noexec,relatime shared:421 master:13 - proc proc rw
685 684 0:32 / /proc/sys/fs/binfmt_misc rw,relatime shared:422 master:14 - autofs systemd-1 rw,fd=29,pgrp=1,timeout=0,minproto=5,maxproto=5,direct,pipe_ino=14146
686 685 0:36 / /proc/sys/fs/binfmt_misc rw,nosuid,nodev,noexec,relatime shared:423 master:46 - binfmt_misc binfmt_misc rw
687 664 259:1 / /boot/efi rw,relatime shared:424 master:44 - vfat /dev/nvme0n1p1 rw,fmask=0077,dmask=0077,codepage=437,iocharset=ascii,shortname=mixed,utf8,errors=remount-ro
688 670 0:23 /netns /run/netns rw,nosuid,nodev,noexec,relatime shared:407 master:5 - tmpfs tmpfs rw,size=1629708k,mode=755,inode64
689 688 0:4 net:[4026532621] /run/netns/cnitest-91ad023b-88da-1b7e-0559-fe757c9c0bea rw shared:425 - nsfs nsfs rw
690 670 0:4 net:[4026532621] /run/netns/cnitest-91ad023b-88da-1b7e-0559-fe757c9c0bea rw shared:425 - nsfs nsfs rw
691 670 0:23 /kata-containers/shared/sandboxes/hello-rs-wq-agent-39/rw /run/kata-containers/shared/sandboxes/hello-rs-wq-agent-39/ro ro,relatime master:407 - tmpfs tmpfs rw,size=1629708k,mode=755,inode64
695 670 0:47 / /run/containerd/io.containerd.runtime.v2.task/default/hello-rs-wq-agent-39/rootfs rw,relatime shared:426 - overlay overlay rw,lowerdir=/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/140/fs:/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/1/fs,upperdir=/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/150/fs,workdir=/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/150/work
771 670 0:47 / /run/kata-containers/shared/sandboxes/hello-rs-wq-agent-39/rw/passthrough/hello-rs-wq-agent-39/rootfs rw,relatime master:426 - overlay overlay rw,lowerdir=/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/140/fs:/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/1/fs,upperdir=/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/150/fs,workdir=/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/150/work
772 691 0:47 / /run/kata-containers/shared/sandboxes/hello-rs-wq-agent-39/ro/passthrough/hello-rs-wq-agent-39/rootfs rw,relatime master:426 - overlay overlay rw,lowerdir=/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/140/fs:/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/1/fs,upperdir=/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/150/fs,workdir=/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/150/work
773 670 259:2 /home/mpiglet/playground/volumns/v1 /run/kata-containers/shared/sandboxes/hello-rs-wq-agent-39/rw/passthrough/sandbox-eff1007e-v1 rw,relatime master:401 - ext4 /dev/nvme0n1p2 rw,errors=remount-ro
774 691 259:2 /home/mpiglet/playground/volumns/v1 /run/kata-containers/shared/sandboxes/hello-rs-wq-agent-39/ro/passthrough/sandbox-eff1007e-v1 rw,relatime master:401 - ext4 /dev/nvme0n1p2 rw,errors=remount-ro

Sorry for attaching so much output, I want to provide detailed info about it. :)

@yawqi
Copy link

yawqi commented Nov 1, 2023

I'm not sure if we can create an idmap mount if the fs for the rootfs (assuming the rootfs is in a fs that supports it, like ext4) and then create the virtiofs on top of that idmapped mount. Can you try that? Or is it not trivial to do so?

Actually, for this example, the rootfs is overlayfs on the host and the v1 volume is ext4, both has supported the idmapped mount. Did you mean that if the underlying fs support idmap, maybe the virtiofs can take advantage of it, is my understanding correct?

Thanks a lot, I will look into it. If I have made any progress, I will let you know. :) And if you have any suggestions, please tell me!

@rata
Copy link

rata commented Nov 1, 2023

@yawqi sure, let me know if that works when you tried it. Thanks! :)

If the fs of the rootfs doesn't support idmap mounts, containerd does a recursive chown (it is very expensive, but it should work). But that is for regular OCI runtimes, I don't know if that happens if kata is used and if it can benefit from that.

Let me know when you tried it. Are you a maintainer in kata?

@yawqi
Copy link

yawqi commented Nov 1, 2023

@rata I am not a maintainer of kata, just trying to help and learn something about it. :)

Did you mean the chown is performed by containerd when using runc? I thought it is the runtime's job to do it, good to know about this.

@yawqi
Copy link

yawqi commented Nov 1, 2023

Sorry to bother you again, I was wondering if it is convenient for you to share me how you test while you develop, can you share your config.json for the idmapping develop test with me? It would be of great help! :)

@yawqi
Copy link

yawqi commented Nov 7, 2023

Currently, to support userns inside guest VM for each container seperately, I think there are two main problems need to be solved.

Firstly, setting up userns for a container, this part can reference the runc implementation.

Secondly, idmap mount the volume mounts. For rootfs, the idmap mount is supported by the containerd for the overlayfs snapshotter(either by idmap mount or chown, depends on the kernel version). So for this problem, I believe we should first support idmap mount for the volume mounts, the rootfs part should be taken care of by the snapshotter I guess?

As for the idmap mount support for volume mounts, at containerd/containerd#7063, they proposed 3 methods to handle the ownership of the rootfs, which also can provide some insight for us:

(a) rely on idmap mounts (containerd/containerd#5890), although for overlayfs we need a 5.19+ kernel
(b) chown the image and add support for metacopy overlayfs param (containerd/containerd#6310)
(c) fuse-overlayfs with its own usermode idmap.

Ideally, the (a) should be the best option, but it has requirements for the kernel version, and more importantly, the underlying filesystem needs to support the idmap mount, in our case, the virtiofs. @rata has methoned that we should look into whether we can make use of the underlying filesystem's support for idmap mount to make virtiofs support idmap mount, I will loook into it, but I think it will take some time.

For the (c) method, in our case, https://github.com/cloud-hypervisor/fuse-backend-rs/pull/159, I believe this PR has provide what we need. So maybe we can start here for the support of volume idmap mounts, and then try to provide the support via (a) method.

How does you guys feel about it? Please feel free to leave any comments and suggestions, thanks a lot! :) @bergwolf @rata

@rata
Copy link

rata commented Nov 15, 2023

@yawqi Sorry for being silent. I'll be away until the end of the month, sorry :(

Some notes:

As for the idmap mount support for volume mounts, at containerd/containerd#7063, they proposed 3 methods to handle the ownership of the rootfs, which also can provide some insight for us:

No, that is for the rootfs. For volumes we are using idmap mounts, but the OCI runtime is doing them (runc or crun, for example).

If you are not a maintainer here, I wonder what maintainers think about how the integration with userns should look like. Their input is probably valuable before doing more work.

Sorry to bother you again, I was wondering if it is convenient for you to share me how you test while you develop, can you share your config.json for the idmapping develop test with me? It would be of great help! :)

I don't have one handy now. But you can see the tests for that (https://github.com/opencontainers/runc/blob/a2ba98557d25996532fae62fe560cdb7688a74a2/tests/integration/idmap.bats). It basically runs "runc spec" to get a default config.json and runs the "update_config" funtion that just uses jq to modify the json. You should be able to infer or even run the tests and add a step to copy the config.json to another dir and just get it for you to use.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New functionality needs-review Needs to be assessed by the team.
Projects
Issue backlog
  
To do
Development

No branches or pull requests

3 participants