Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

seccomp: add support for "clone3" syscall in default policy #42681

Merged
merged 1 commit into from
Jul 30, 2021

Conversation

berrange
Copy link
Contributor

@berrange berrange commented Jul 27, 2021

- What I did
Modified the default seccomp profile so that clone3 is explicitly requested to give ENOSYS instead of the default EPERM, when CAP_SYS_ADMIN is unset.

If CAP_SYS_ADMIN is set, then clone3 is simply allowed unconditionally

- How to verify it
Test by using

$ docker run registry.fedoraproject.org/fedora:rawhide curl google.com

It should dump the HTML if seccomp is correctly triggering fallback from clone3 to clone.

- Description for the changelog

Explicitly set clone3 syscall to return ENOSYS to ensure glibc will correctly fallback to using clone. This fixes ability to spawn threads in Fedora 35 rawhide container images which now default to clone3. The default errno of EPERM results in a fatal error making the images unusable when seccomp is enabled.

Fixes #42680
fixes #42963
fixes #42876

If no seccomp policy is requested, then the built-in default policy in
dockerd applies. This has no rule for "clone3" defined, nor any default
errno defined. So when runc receives the config it attempts to determine
a default errno, using logic defined in its commit:

  opencontainers/runc@7a8d716

As explained in the above commit message, runc uses a heuristic to
decide which errno to return by default:

[quote]
  The solution applied here is to prepend a "stub" filter which returns
  -ENOSYS if the requested syscall has a larger syscall number than any
  syscall mentioned in the filter. The reason for this specific rule is
  that syscall numbers are (roughly) allocated sequentially and thus newer
  syscalls will (usually) have a larger syscall number -- thus causing our
  filters to produce -ENOSYS if the filter was written before the syscall
  existed.
[/quote]

Unfortunately clone3 appears to one of the edge cases that does not
result in use of ENOSYS, instead ending up with the historical EPERM
errno.

Latest glibc (2.33.9000, in Fedora 35 rawhide) will attempt to use
clone3 by default. If it sees ENOSYS then it will automatically
fallback to using clone. Any other errno is treated as a fatal
error. Thus when docker seccomp policy triggers EPERM from clone3,
no fallback occurs and programs are thus unable to spawn threads.

The clone3 syscall is much more complicated than clone, most notably its
flags are not exposed as a directly argument any more. Instead they are
hidden inside a struct. This means that seccomp filters are unable to
apply policy based on values seen in flags. Thus we can't directly
replicate the current "clone" filtering for "clone3". We can at least
ensure "clone3" returns ENOSYS errno, to trigger fallback to "clone"
at which point we can filter on flags.

Fixes: moby#42680
Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
Copy link
Member

@thaJeztah thaJeztah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

@thaJeztah thaJeztah added this to the 21.xx milestone Jul 30, 2021
@justincormack justincormack merged commit f07e53e into moby:master Jul 30, 2021
@thaJeztah
Copy link
Member

@AkihiroSuda I see you marked this for backporting, but doing so would also require #42005 to be included to add support for ErrNoRet. Perhaps we should keep this one for the next release (but be sure to do that soon)

@AkihiroSuda
Copy link
Member

Perhaps we should keep this one for the next release (but be sure to do that soon)

SGTM

mrc0mmand added a commit to mrc0mmand/restraint that referenced this pull request Aug 25, 2021
Current Docker version on Ubuntu 20.04 used by GH Actions suffers from
an incompatibility with newer glibc [0] used by Fedora Rawhide, causing
Rawhide containers in CI to fail with:

```
Errors during downloading metadata for repository 'fedora-cisco-openh264':
  - Curl error (6): Couldn't resolve host name for https://mirrors.fedoraproject.org/metalink?repo=fedora-cisco-openh264-rawhide&arch=x86_64 [getaddrinfo() thread failed to start]
```

glibc 2.34 and later tries to use the clone3 syscall (for
hardware-assisted security hardening on x86_64), and falls back to clone2
on ENOSYS. However, with the current seccomp profile Docker returns EPERM
instead, which is considered a "hard" fail.

A fix [1] has been merged in upstream, but until then let's run the CI Docker
containers without any seccomp profiles to allow Rawhide jobs to to their job.
(I tried to disable seccomp only for the Rawhide jobs, but I couldn't procure
any solution which wouldn't make my eyes bleed...)

[0] moby/moby#42680
[1] moby/moby#42681
@gotmax23

This comment has been minimized.

@pascallj
Copy link

pascallj commented Sep 6, 2021

@gotmax23 It must be a coincidence, but today I have actually been working on a backport for this for Ubuntu 21.04. Because Ubuntu 21.10 containers also requires this fix now as it also started to use Glibc >= 2.34.

I have backported both this fix and #42005 (with some slight modifications for Ubuntu) as mentioned in this PR and it seems to do the trick for me though.

If I am understanding it correctly, Docker 20.10.8 still doesn't contain this fix. Also the milestone of this PR is 21.xx. Personally I think it's important enough to backport it (or release the next version very soon 😉).

@gotmax23
Copy link

gotmax23 commented Sep 7, 2021

it seems to do the trick for me though.

@pascallj, you are right! I have updated my comment accordingly. Could you please add support for Ubuntu 20.04 to your PPA?

@pascallj
Copy link

pascallj commented Sep 7, 2021

Could you please add support for Ubuntu 20.04 to your PPA?

Your wish is my command ✌️

@gotmax23
Copy link

gotmax23 commented Sep 7, 2021

Could you please add support for Ubuntu 20.04 to your PPA?

Your wish is my command v

Thank you, @pascallj! I tested the docker.io package from your PPA on both Ubuntu 20.04 and 21.04; it fixes the problem for regular Docker containers. However, I still faced the same problem with docker buildx, which I installed by copying the pre-built binary to ~/.docker/cli-plugins/docker-buildx. Does Docker Buildx just need to be rebuilt or is this a problem that still needs to be fixed?

@pascallj
Copy link

pascallj commented Sep 8, 2021

I had no previous experience with buildx or BuildKit, so I'm not quite sure. These are way too complicated for my Docker use cases. I did some testing, but I might be completely wrong.

It seems to depend on which driver your builder instance uses. If your builder instance uses the docker driver, it uses the docker daemon and therefore works fine (with the ppa packages). If your builder instance uses the docker-container driver instead, it loads a BuildKit container and therefore you completely depend on the capabilities of this container.

This issue is not present in everything after and including commit moby/buildkit@8021a3e. By default buildx uses the latest stable BuildKit image which is at the moment two months old and therefore does not contain said commit. However if you create a builder instance with an image tag (master, nightly) after this commit (or build BuildKit yourself and specify that image), it works fine:

docker buildx create --driver-opt image=moby/buildkit:master --use

So if I'm right, the problem is also fixed in BuildKit (which is used by buildx), but is just not released to stable.

AeroStun added a commit to ItJustWorksTM/libSMCE that referenced this pull request Sep 8, 2021
See moby/moby#42681

Signed-off-by: AeroStun <24841307+AeroStun@users.noreply.github.com>
AeroStun added a commit to ItJustWorksTM/libSMCE that referenced this pull request Sep 8, 2021
See moby/moby#42681

Signed-off-by: AeroStun <24841307+AeroStun@users.noreply.github.com>
@tianon
Copy link
Member

tianon commented Sep 9, 2021

I admit I'm not well-versed in the details around this syscall, but we do allow clone (at least to some degree, if I'm reading our policy correctly), so I'm wondering why we're going for explicit ENOSYS instead of just ALLOW? Are there known insecurities around clone3, or is this rather an overabundance of caution given it's so new?

@berrange
Copy link
Contributor Author

berrange commented Sep 9, 2021

The clone syscall is not permitted unconditionally. The seccomp rules for clone do matches on the flags parameter bitmask to prevent creation of namespaces. In clone3 there is no flags parameter, instead the syscall is passed a struct and the flags are now just a field inside this struct. This is a problem because struct contents are not accessible for purpose of seccomp rule filtering. So if we allowed clone3 it would be weakening the seccomp policy compared to the historical state.

Blocking clone3 with with ENOSYS forces GLibC to fallback to clone where the existing seccomp filtering works as desired, preseving current seccomp policy semantics. We get away with this for now because usage of clone3 is not critical for commonly used apps, but at some point in the future there might be important features only accessible via clone3 in common use. Hopefully that's far enough in the future though we don't need to spend time worrying about it now.

@thaJeztah
Copy link
Member

We're looking if we can backport this to the 20.10 branch; we previously tried to do so, but it also would include a (rather large) refactor, so perhaps we should have an implementation of this that targets the 20.10 branch (before the refactor)

rofirrim added a commit to rofirrim/eiciel that referenced this pull request Sep 4, 2022
rofirrim added a commit to rofirrim/eiciel that referenced this pull request Sep 4, 2022
pmatilai added a commit to pmatilai/rpm that referenced this pull request Sep 15, 2022
It appears that some container deity somewhere has fixed the Docker
issue [1] that prevented us from upgrading beyond F34, but there was
another gotcha introduced in the meanwhile on Fedora side:
glibc-gconv-extras is now needed for our UTF-8 encoding check to work.

While at it, optimize the dnf side a bit: get rid of modularity repos
entirely so they don't come back via updates, and disable the H.264
repo too, we don't need *that* for building or testing rpm...

[1] moby/moby#42681
pmatilai added a commit to rpm-software-management/rpm that referenced this pull request Sep 15, 2022
It appears that some container deity somewhere has fixed the Docker
issue [1] that prevented us from upgrading beyond F34, but there was
another gotcha introduced in the meanwhile on Fedora side:
glibc-gconv-extras is now needed for our UTF-8 encoding check to work.

While at it, optimize the dnf side a bit: get rid of modularity repos
entirely so they don't come back via updates, and disable the H.264
repo too, we don't need *that* for building or testing rpm...

[1] moby/moby#42681
pmatilai added a commit to pmatilai/rpm that referenced this pull request Sep 16, 2022
It appears that some container deity somewhere has fixed the Docker
issue [1] that prevented us from upgrading beyond F34, but there was
another gotcha introduced in the meanwhile on Fedora side:
glibc-gconv-extras is now needed for our UTF-8 encoding check to work.

While at it, optimize the dnf side a bit: get rid of modularity repos
entirely so they don't come back via updates, and disable the H.264
repo too, we don't need *that* for building or testing rpm...

[1] moby/moby#42681

(cherry picked from commit 6761c39)
pmatilai added a commit to rpm-software-management/rpm that referenced this pull request Sep 20, 2022
It appears that some container deity somewhere has fixed the Docker
issue [1] that prevented us from upgrading beyond F34, but there was
another gotcha introduced in the meanwhile on Fedora side:
glibc-gconv-extras is now needed for our UTF-8 encoding check to work.

While at it, optimize the dnf side a bit: get rid of modularity repos
entirely so they don't come back via updates, and disable the H.264
repo too, we don't need *that* for building or testing rpm...

[1] moby/moby#42681

(cherry picked from commit 6761c39)
jlebon added a commit to jlebon/rpm-ostree that referenced this pull request Nov 18, 2022
Seems like a combination of `ubuntu-latest` and/or the move to f37 glibc
is causing `createrepo_c` to hit the classic `clone3` Docker seccomp
issue:

moby/moby#42681

Hack around this by running the container in privileged mode.
jlebon added a commit to jlebon/rpm-ostree that referenced this pull request Nov 18, 2022
Seems like a combination of `ubuntu-latest` and/or the move to f37 glibc
is causing `createrepo_c` to hit the classic `clone3` Docker seccomp
issue:

moby/moby#42681

Hack around this by running the container in privileged mode.
sshyran pushed a commit to sshyran/Tools-for-Container-Optimized-OS that referenced this pull request Dec 28, 2022
Ubuntu archived short-term release 21.10 and moved it to the
old-releases.ubuntu.com site. We still have to use it because
older Docker versions are affected by moby/moby#42681

To fix the build switch apt sources to old-releases before installing
packages.

Change-Id: I0432cd0002b4e955399539a5b0ddaba21b4535cc
Reviewed-on: https://cos-review.googlesource.com/c/cos/tools/+/36309
Reviewed-by: Arnav Kansal <rnv@google.com>
Tested-by: Oleksandr Tymoshenko <ovt@google.com>
Cloud-Build: GCB Service account <228075978874@cloudbuild.gserviceaccount.com>
MaxMustermann2 added a commit to MaxMustermann2/harmony that referenced this pull request Jun 21, 2023
On Docker versions < 20.10.9, `apt update` fails due to the use of
syscall `clone3` by `Glibc >= 2.34`. This change upgrades the base
distribution used by Travis to `jammy`, which contains Docker engine
20.10.12.

See https://docs.travis-ci.com/user/reference/jammy/#docker and
moby/moby#42681 for reference.
ONECasey pushed a commit to harmony-one/harmony that referenced this pull request Jun 21, 2023
On Docker versions < 20.10.9, `apt update` fails due to the use of
syscall `clone3` by `Glibc >= 2.34`. This change upgrades the base
distribution used by Travis to `jammy`, which contains Docker engine
20.10.12.

See https://docs.travis-ci.com/user/reference/jammy/#docker and
moby/moby#42681 for reference.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment