Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Id mapped mounts for shared volumes #3429

Conversation

AlexeyPerevalov
Copy link

This pull request adds support for id map mount feature for shared volumes.
For rootfs this is already implemented in containerd/containerd#5890.

Both commit has a code in common, which should be placed on the same place in the future.

There are two possibilities to use mount_setattr syscall with fd returned by open_tree or with fd returned by fsmount syscall. This PR uses open_tree, since this approach was used in examples, as well as in crun.

As a fallback scenario, mount syscall still used for bind mount. Open_tree syscall as well as mount_setattr are quite new syscall, to use it latest kernel version is required. In this patch set, kernel version is checked, but as an alternative approach we could just check syscall existence, any suggestions are welcome.
Also this PR includes specs-go which could be moved to separate PR.

@AlexeyPerevalov AlexeyPerevalov changed the title Id map mounts move to open tree Support Id mapped mounts for shared volumes Mar 23, 2022
@AkihiroSuda AkihiroSuda added this to the 1.2.0 milestone Mar 23, 2022
go.mod Outdated
@@ -24,3 +25,5 @@ require (
golang.org/x/sys v0.0.0-20211116061358-0a5406a5449c
google.golang.org/protobuf v1.27.1
)

replace github.com/opencontainers/runtime-spec/specs-go => ./staging/src/github.com/opencontainers/runtime-spec/specs-go
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please open PR for runtime-spec

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -97,6 +97,7 @@ func newContainerInit(t initType, pipe *os.File, consoleSocket *os.File, fifoFd,
logFd: logFd,
}, nil
case initStandard:
logrus.Infof("mountFds: %v", mountFds)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
logrus.Infof("mountFds: %v", mountFds)
logrus.Debugf("mountFds: %v", mountFds)

go.mod Outdated
github.com/opencontainers/selinux v1.10.0
github.com/pkg/errors v0.9.1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deprecated

Comment on lines 13 to 16
OPEN_TREE_CLONE = 1
OPEN_TREE_CLOEXEC = 02000000
MOVE_MOUNT_F_EMPTY_PATH = 0x00000004
AT_FDCWD = -100
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should these be defined in x/sys/unix? Please open a CL against it -- they are usually more than happy to accept such stuff.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I wrote about it, in cover message, but definitely x/sys/unix is better candidate.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kolyshkin I've seen your https://go-review.googlesource.com/c/sys/+/363444 I assume this change was made w/o code generation e.g. by sys/unix/mkall.sh ?


//move_mount(int from_dfd, const char *from_pathname, int to_dfd,
// const char *to_pathname, unsigned int flags)
func moveMount(from_dfd int, fromPathName string, to_dfd int, toPathName string, flags uint) (err error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here -- seems that this belongs to x/sys/unix.

}

// int dfd, const char *filename, unsigned int flags)
func openTree(dfd int, fileName string, flags uint) (r int, err error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here -- seems that this belongs to x/sys/unix.

@@ -68,6 +74,7 @@ func prepareRootfs(pipe io.ReadWriter, iConfig *initConfig, mountFds []int) (err
cgroupns: config.Namespaces.Contains(configs.NEWCGROUP),
}
setupDev := needsSetupDev(config)
//time.Sleep(30 * time.Second)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Debug session leftover?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, initial version, after moving helpers to x/sys/unix and runtime-spec, it would be clear.

@@ -380,6 +387,9 @@ func mountToRootfs(m *configs.Mount, c *mountConfig) error {
return err
}

if mountFd != nil {
logrus.Infof("mountFd: %d", *mountFd)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

either logrus.Debugf or remove it.

Comment on lines 1066 to 1067
logrus.Infof("m.Data: %v\n", m.Data)
logrus.Infof("flags: %v\n", flags)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need for \n here. Also seems like a debug leftover.

@@ -1055,7 +1079,9 @@ func remount(m *configs.Mount, rootfs string, mountFd *int) error {
}
// ... and retry the mount with ro flag set.
flags |= unix.MS_RDONLY
return mount(source, m.Destination, procfd, m.Device, flags, "")

logrus.Infof("m.Data: %v\n", m.Data)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need for \n here. Also seems like a debug leftover.

@kolyshkin kolyshkin marked this pull request as draft March 23, 2022 17:33
@kolyshkin
Copy link
Contributor

OK, this should be a draft / DNM until changes to runtime-spec and x/sys/unix are landed.

@kolyshkin
Copy link
Contributor

In this patch set, kernel version is checked

This obviously won't work because of e.g. RHEL kernels that backport lots of stuff.

but as an alternative approach we could just check syscall existence, any suggestions are welcome.

Right, a simple test checking if a syscall is available (perhaps calling it with wrong arguments and expecting EINVAL in return is sufficient), when wrap this into once.Do and use it freely.

As an example, see how we check if openat2(2) is available here:

cgroupFd int = -1
prepOnce sync.Once
prepErr error
resolveFlags uint64
)
func prepareOpenat2() error {
prepOnce.Do(func() {
fd, err := unix.Openat2(-1, cgroupfsDir, &unix.OpenHow{
Flags: unix.O_DIRECTORY | unix.O_PATH,
})
if err != nil {
prepErr = &os.PathError{Op: "openat2", Path: cgroupfsDir, Err: err}
if err != unix.ENOSYS { //nolint:errorlint // unix errors are bare
logrus.Warnf("falling back to securejoin: %s", prepErr)
} else {
logrus.Debug("openat2 not available, falling back to securejoin")
}
return
}

@AlexeyPerevalov
Copy link
Author

This PR depends on golang's x/sys library modifications:
https://go-review.googlesource.com/c/sys/+/397095
https://go-review.googlesource.com/c/sys/+/397094
and depends on opencontainers/runtime-spec#1143.
Patches for x/sys are merged, just need to wait for appropriate version release of lib.
runtime-spec change is stuck on review, how to initiate it?


func mountByMove(m *configs.Mount, dest string) error {
// Check once for all mount point that syscall exists
syscallOnce.Do(func() {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this be exported as an interface so other parts may query the syscall availability status? We face the same issue too and seems it could be reused.

@rata
Copy link
Contributor

rata commented Jul 12, 2022

@AlexeyPerevalov the runtime spec PR was already merged (opencontainers/runtime-spec#1143). It would be great if you can finish this PR now :)

@AlexeyPerevalov AlexeyPerevalov marked this pull request as ready for review July 14, 2022 10:25
@AlexeyPerevalov
Copy link
Author

runtime-spec changes are not yet tagged, latest tag is v1.0.2
the same with x/sys golang lib, not yet published

@rata
Copy link
Contributor

rata commented Jul 14, 2022

@AlexeyPerevalov No need to wait for a runtime-spec tag, IMHO. We are not using a tag already:

runc/go.mod

Line 15 in a776ec9

github.com/opencontainers/runtime-spec v1.0.3-0.20210326190908-1c3f411f0417

For the x/sys pkg, I think it makes sense to define ourselves the constants if that is not going to be released soon. Do you happen to know when they plan to release that with the constants we need? If there is no visibility into that, I don't know what others think, but I would guess it seems reasonable to use some tmp file with the constants defined here. We can switch whenever that becomes available.

@kolyshkin
Copy link
Contributor

runtime-spec changes are not yet tagged, latest tag is v1.0.2

We are already using a sha for it (due to the lack of runtime-spec maintainers I guess).

the same with x/sys golang lib, not yet published

This one is never tagged, i.e. all users are just pin it to a particular sha.

@AlexeyPerevalov AlexeyPerevalov force-pushed the IdMapMountsMoveToOpenTree branch 3 times, most recently from 1ba3c92 to 28c40da Compare July 25, 2022 13:49
@AlexeyPerevalov
Copy link
Author

in centos-stream-9 two tests are failed:
TestUsernsCheckpoint
TestCheckpoint

Tests related to checkpoint restore failed due to segfault in criu, I reproduced it on Centos Stream 8 (I don't have Stream 9), but It is not reproducible on Ubuntu 20.04, with working criu.

Comment on lines +492 to +494
if mountByMove(m, dest) != nil {
if err := mountPropagate(m, rootfs, mountLabel, mountFd); err != nil {
return err
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This means, if we fail to mount with idmap mounts, it just does a bind-mount?

I'm not sure what is the right behavior, because just mounting and ignoring the mapping for that mount will probably create issues to access the files (if the container running inside a userns).

I'm not sure what the right behavior should be. What do you think? And other maintainers?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm leaning towards in case of errors just failing the mount, rather than doing a bind-mount and ignoring the mapping.

It will definitely cause access issues, or even worse: files might get written with completely different UIDs/GIDs and it will be painful to fix once that happens.

@AlexeyPerevalov What do you think?

@AkihiroSuda @kolyshkin What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking a little bit how higher layers will use this, yes, I think we need to error out. At the k8s layer we want to start the container with an id mapped mount or not start it at all. If it is started without the id mapped mounts, the files created will have just random garbage as UID/GID, won't be able to read files... it will be a complete mess.

I think if we just return an error if an id mapped mount was requested and we failed to create it is the right path here.

Also, even if in the future we want to open this door (doubt about it), we can always open a door. But we won't be able to close it if we open it now.

@AlexeyPerevalov Can we make this error out if the id map mount fails?

@rata
Copy link
Contributor

rata commented Jul 29, 2022

@AlexeyPerevalov but it seems all test have passed... Was that maybe before the latest fix merged yesterday to fix the CI? (although IIRC it was only about one of those test, not both)

@rata
Copy link
Contributor

rata commented Aug 9, 2022

@AlexeyPerevalov friendly ping? It would be great to have this in for runc 1.2.

(otherwise, we will have to wait for another runc release, that takes a while, some more time for stable versions of high level container runtimes to include it, etc. :))

Signed-off-by: Alexey Perevalov <alexey.perevalov@huawei.com>
to configs.Mount from spec

This patch is using mount_setattr for idmapped mounts

mount_setattr syscall requires anonymous mount point,
IOW mount point should not be published, but it published by
mount syscall. So either fsmount or open_tree should be used
here.

So open_tree syscall is used instead of mount, for bind mount.

Signed-off-by: Alexey Perevalov <alexey.perevalov@huawei.com>
This is not based on kernel version, just to check syscall existence.

Signed-off-by: Alexey Perevalov <alexey.perevalov@huawei.com>
Copy link
Contributor

@rata rata left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AlexeyPerevalov Hey, I was testing this and can't make it work.

It seems the code to create an idmap mount is fine (except probably the move mount should be last), but as this code is run already inside the userns, we always get EPERM for the mountSetAttr() syscall, because that can only be run from the initial userns to create a IDMAP mount.

Have you tested this with runc and it worked for you? It would be great if I'm missing something and there is a simple way to make this work :)

if err := mountPropagate(m, rootfs, mountLabel, mountFd); err != nil {
return err

if mountByMove(m, dest) != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm testing this, and it seems this can't work. At this point we are inside the userns (I've added some logrus.Debugf to print the content of /proc/self/uid_map to verify this) and therefore we create an idmap mount. They only work from the initial userns.

Did you made this runc code work for you at some point? Or was this a c&p from working code in a standalone binary (that you tried from the host, so the initial userns) that never was tested in the runc binary (that IIUC this part runs inside the userns).

Am I missing something? Or do we need a big rework to make this work?

@rata
Copy link
Contributor

rata commented Jan 27, 2023

I really wish to include this functionality in the next runc release. This is required for our userns work in Kubernetes. I'll try to make this work.

If anyone has any thoughts on the approach to take, please let me know :)

@AkihiroSuda
Copy link
Member

I really wish to include this functionality in the next runc release. This is required for our userns work in Kubernetes. I'll try to make this work.

If anyone has any thoughts on the approach to take, please let me know :)

Thanks @rata , could you open a new PR to carry this PR?

@rata
Copy link
Contributor

rata commented Jan 31, 2023

@AkihiroSuda Thanks! Yes, I'm working on it and I'll open a new PR in the next days.

I need to start from scratch and do most things in nsenter/nsexec, so nothing really to carry from here. And quite more involved, that is why I haven't opened it yet :)

@rata
Copy link
Contributor

rata commented Feb 1, 2023

Opened this draft PR with an approach that works: #3717

@AkihiroSuda AkihiroSuda removed this from the 1.2.0 milestone Feb 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants