Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[20.10 backport] seccomp: add support for "clone3" syscall in default policy #42836

Merged
merged 1 commit into from
Oct 4, 2021

Conversation

tianon
Copy link
Member

@tianon tianon commented Sep 9, 2021

This is a backport of 9f6b562 (#42681, see also #42680), adapted to avoid the refactoring that happened in d927397 (#42005).

fixes #42963
fixes #42876

@tianon tianon changed the title seccomp: add support for "clone3" syscall in default policy [20.10 backport] seccomp: add support for "clone3" syscall in default policy Sep 9, 2021
@thaJeztah thaJeztah added this to the 20.10.9 milestone Sep 9, 2021
@tianon
Copy link
Member Author

tianon commented Sep 9, 2021

Doh, build_userns_linux_test.go:73: assertion failed: error is not nil: OCI runtime create failed: container_linux.go:380: starting container process caused: error adding seccomp filter rule for syscall clone3: requested action matches default action of filter: unknown looks very related. 😞

@thaJeztah
Copy link
Member

hmmmm yes. ISTR there was a recent change in runc related to that (at least, I recall a PR about ignoring defaults)

@thaJeztah
Copy link
Member

Yup; fix for that is in v1.0.2. But that seems to indicate we may not need this patch / backport? opencontainers/runc#3173

@tianon
Copy link
Member Author

tianon commented Sep 10, 2021

Now that I'm reading that error message with a fresh eye, I realize it's complaining that SCMP_ACT_ERRNO is the default action, which ... yep, what of it? The errnoRet is different?? 😕

@tianon
Copy link
Member Author

tianon commented Sep 10, 2021

That makes me wonder if the "fix" in opencontainers/runc#3129 is actually correct? Skipping rules like this doesn't feel like the right answer.

Scratch that; looking at the implementation there, I see it does take errnoRet into account.

@tianon
Copy link
Member Author

tianon commented Sep 10, 2021

I think I found what was wrong. 🤞

@tianon tianon marked this pull request as ready for review September 11, 2021 00:47
@tianon
Copy link
Member Author

tianon commented Sep 11, 2021

Got it -- the only failure on the last run was gofmt, which I've corrected, so this one should be good.

@tianon
Copy link
Member Author

tianon commented Sep 13, 2021

FWIW, @lucaskanashiro tested this and reported in https://bugs.launchpad.net/cloud-images/+bug/1943049/comments/25 that it does fix the problem 😄

This is a backport of 9f6b562, adapted to avoid the refactoring that happened in d927397.

Original commit message is as follows:

> If no seccomp policy is requested, then the built-in default policy in
> dockerd applies. This has no rule for "clone3" defined, nor any default
> errno defined. So when runc receives the config it attempts to determine
> a default errno, using logic defined in its commit:
>
>   opencontainers/runc@7a8d716
>
> As explained in the above commit message, runc uses a heuristic to
> decide which errno to return by default:
>
> [quote]
>   The solution applied here is to prepend a "stub" filter which returns
>   -ENOSYS if the requested syscall has a larger syscall number than any
>   syscall mentioned in the filter. The reason for this specific rule is
>   that syscall numbers are (roughly) allocated sequentially and thus newer
>   syscalls will (usually) have a larger syscall number -- thus causing our
>   filters to produce -ENOSYS if the filter was written before the syscall
>   existed.
> [/quote]
>
> Unfortunately clone3 appears to one of the edge cases that does not
> result in use of ENOSYS, instead ending up with the historical EPERM
> errno.
>
> Latest glibc (2.33.9000, in Fedora 35 rawhide) will attempt to use
> clone3 by default. If it sees ENOSYS then it will automatically
> fallback to using clone. Any other errno is treated as a fatal
> error. Thus when docker seccomp policy triggers EPERM from clone3,
> no fallback occurs and programs are thus unable to spawn threads.
>
> The clone3 syscall is much more complicated than clone, most notably its
> flags are not exposed as a directly argument any more. Instead they are
> hidden inside a struct. This means that seccomp filters are unable to
> apply policy based on values seen in flags. Thus we can't directly
> replicate the current "clone" filtering for "clone3". We can at least
> ensure "clone3" returns ENOSYS errno, to trigger fallback to "clone"
> at which point we can filter on flags.

Signed-off-by: Tianon Gravi <admwiggin@gmail.com>
Co-authored-by: Daniel P. Berrangé <berrange@redhat.com>
@tianon
Copy link
Member Author

tianon commented Sep 13, 2021

Re-reviewing this, it felt silly to still have the createSpecsSyscall function which is only invoked once (and with an unused argument to boot, because I didn't notice in my refactoring 😂), so I've updated again to remove that completely.

Copy link
Member

@thaJeztah thaJeztah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@gotmax23
Copy link

Hi,

Can you please consider merging this change and expediting a release containing the fix? This clone3 issue has caused a lot of problems over the past couple months.

Thanks you for your work. I apologize if I'm being too pushy.

-- Maxwell 😀

flouthoc added a commit to flouthoc/crun that referenced this pull request Nov 5, 2021
Following commit migrates CI to latest stable `f35`.

* Make sure we are using latest docker: Since support
  for `clone3` was added in default seccomp profile was
  backported via moby/moby#42836.

* Bump podman to `v3.4.0` for `f35`

* New versions of `clang-check` adds a check for
  `-Wunused-but-set-variable` so mark variables with `__attribute__
((unused))` where its needed.

Signed-off-by: Aditya Rajan <arajan@redhat.com>
flouthoc added a commit to flouthoc/crun that referenced this pull request Nov 5, 2021
Following commit migrates CI to latest stable `f35`.

* Make sure we are using latest docker: Since support
  for `clone3` was added in default seccomp profile was
  backported via moby/moby#42836.

* Bump podman to `v3.4.0` for `f35` and skip newly added tests which are
  not supported in this env.

* New versions of `clang-check` adds a check for
  `-Wunused-but-set-variable` so mark variables with `__attribute__
((unused))` where its needed.

Signed-off-by: Aditya Rajan <arajan@redhat.com>
flouthoc added a commit to flouthoc/crun that referenced this pull request Nov 5, 2021
Following commit migrates CI to latest stable `f35`.

* Make sure we are using latest docker: Since support
  for `clone3` was added in default seccomp profile was
  backported via moby/moby#42836.

* Bump podman to `v3.4.0` for `f35` and skip newly added tests which are
  not supported in this env.

Signed-off-by: Aditya Rajan <arajan@redhat.com>
@tianon
Copy link
Member Author

tianon commented Nov 8, 2021

@zhsj brought https://buildd.debian.org/status/fetch.php?pkg=docker.io&arch=mips64el&ver=20.10.10%2Bdfsg1-1&stamp=1636015943&raw=0 to my attention in which TestUnmarshalDefaultProfile fails on mips64le because ENOSYS has a different numeric value (because of course it does, right? we can't possibly have nice things)

=== FAIL: profiles/seccomp TestUnmarshalDefaultProfile (0.19s)
    seccomp_test.go:68: assertion failed: 
        --- expected.Syscalls
        +++ profile.Syscalls
          []*seccomp.Syscall{
          	... // 14 identical elements
          	&{Names: {"clone"}, Action: "SCMP_ACT_ALLOW", Args: {&{Value: 2114060288, Op: "SCMP_CMP_MASKED_EQ"}}, Excludes: {Caps: {"CAP_SYS_ADMIN"}, Arches: {"s390", "s390x"}}},
          	&{Names: {"clone"}, Action: "SCMP_ACT_ALLOW", Args: {&{Index: 1, Value: 2114060288, Op: "SCMP_CMP_MASKED_EQ"}}, Comment: "s390 parameter ordering for clone is different", ...},
          	&{
          		Name:     "",
          		Names:    {"clone3"},
          		Action:   "SCMP_ACT_ERRNO",
        - 		ErrnoRet: &89,
        + 		ErrnoRet: &38,
          		Args:     {},
          		Comment:  "",
          		... // 2 identical fields
          	},
          	&{Names: {"reboot"}, Action: "SCMP_ACT_ALLOW", Args: {}, Includes: {Caps: {"CAP_SYS_BOOT"}}, ...},
          	&{Names: {"chroot"}, Action: "SCMP_ACT_ALLOW", Args: {}, Includes: {Caps: {"CAP_SYS_CHROOT"}}, ...},
          	... // 8 identical elements
          }

I think this test will probably fail on MIPS arches on the version that made it into master, too (for the same reason).

Edit: I've raised this further up the chain at opencontainers/runtime-spec#1087 (comment) now

@tianon
Copy link
Member Author

tianon commented Nov 8, 2021

On the plus side, at least this only affects MIPS* and sparc64? 😬

$ git grep -inE 'ENOSYS[[:space:]]+=' | grep linux | column -t
unix/zerrors_linux_386.go:588:       ENOSYS  =  syscall.Errno(0x26)
unix/zerrors_linux_amd64.go:588:     ENOSYS  =  syscall.Errno(0x26)
unix/zerrors_linux_arm.go:594:       ENOSYS  =  syscall.Errno(0x26)
unix/zerrors_linux_arm64.go:585:     ENOSYS  =  syscall.Errno(0x26)
unix/zerrors_linux_mips.go:591:      ENOSYS  =  syscall.Errno(0x59)
unix/zerrors_linux_mips64.go:591:    ENOSYS  =  syscall.Errno(0x59)
unix/zerrors_linux_mips64le.go:591:  ENOSYS  =  syscall.Errno(0x59)
unix/zerrors_linux_mipsle.go:591:    ENOSYS  =  syscall.Errno(0x59)
unix/zerrors_linux_ppc.go:646:       ENOSYS  =  syscall.Errno(0x26)
unix/zerrors_linux_ppc64.go:650:     ENOSYS  =  syscall.Errno(0x26)
unix/zerrors_linux_ppc64le.go:650:   ENOSYS  =  syscall.Errno(0x26)
unix/zerrors_linux_riscv64.go:575:   ENOSYS  =  syscall.Errno(0x26)
unix/zerrors_linux_s390x.go:650:     ENOSYS  =  syscall.Errno(0x26)
unix/zerrors_linux_sparc64.go:639:   ENOSYS  =  syscall.Errno(0x5a)

Edit: I've raised this further up the chain at opencontainers/runtime-spec#1087 (comment) now

@thaJeztah
Copy link
Member

Ahm.. yes, so we could adjust the test and use a different default.json for that architecture, or (bit of a hack), make the rule conditional, and used a hard-coded value for mips and sparc64. Looks like for clone, we also had to use a conditional rule;

Comment: "s390 parameter ordering for clone is different",
Includes: &Filter{
Arches: []string{"s390", "s390x"},
},
Excludes: &Filter{
Caps: []string{"CAP_SYS_ADMIN"},
},

@thaJeztah
Copy link
Member

But I guess the "portable" way would be to use a string for this, and have the conversion done by the runtime that sets the actual profile

thaJeztah added a commit to thaJeztah/docker that referenced this pull request Nov 9, 2021
…MIPS

The value of ENOSYS differs between MIPS and non-MIPS architectures. While
this is not problematic for the embedded seccomp profile, it prevents the
profile from being saved as a portable JSON file that can be used for both
architectures.

To work around this situation, we include conditional rules for both arches.
and hard-code the value for ENOSYS in both.

For more details, refer to moby#42836 (comment)
and opencontainers/runtime-spec#1087 (comment)

A test was added, which can be run with:

    go test --tags=seccomp -run TestLoadConditionalClone3 -v ./profiles/seccomp/

    === RUN   TestLoadConditionalClone3
    === RUN   TestLoadConditionalClone3/clone3_default_amd64
    === RUN   TestLoadConditionalClone3/clone3_default_mips
    === RUN   TestLoadConditionalClone3/clone3_cap_sys_admin_amd64
    === RUN   TestLoadConditionalClone3/clone3_cap_sys_admin_mips
    --- PASS: TestLoadConditionalClone3 (0.01s)
        --- PASS: TestLoadConditionalClone3/clone3_default_amd64 (0.00s)
        --- PASS: TestLoadConditionalClone3/clone3_default_mips (0.00s)
        --- PASS: TestLoadConditionalClone3/clone3_cap_sys_admin_amd64 (0.00s)
        --- PASS: TestLoadConditionalClone3/clone3_cap_sys_admin_mips (0.00s)
    PASS
    ok  	github.com/docker/docker/profiles/seccomp	0.015s

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
thaJeztah added a commit to thaJeztah/docker that referenced this pull request Nov 9, 2021
…MIPS

The value of ENOSYS differs between MIPS and non-MIPS architectures. While
this is not problematic for the embedded seccomp profile, it prevents the
profile from being saved as a portable JSON file that can be used for both
architectures.

To work around this situation, we include conditional rules for both arches.
and hard-code the value for ENOSYS in both.

For more details, refer to moby#42836 (comment)
and opencontainers/runtime-spec#1087 (comment)

A test was added, which can be run with:

    go test --tags=seccomp -run TestLoadConditionalClone3 -v ./profiles/seccomp/

    === RUN   TestLoadConditionalClone3
    === RUN   TestLoadConditionalClone3/clone3_default_amd64
    === RUN   TestLoadConditionalClone3/clone3_default_mips
    === RUN   TestLoadConditionalClone3/clone3_cap_sys_admin_amd64
    === RUN   TestLoadConditionalClone3/clone3_cap_sys_admin_mips
    --- PASS: TestLoadConditionalClone3 (0.01s)
        --- PASS: TestLoadConditionalClone3/clone3_default_amd64 (0.00s)
        --- PASS: TestLoadConditionalClone3/clone3_default_mips (0.00s)
        --- PASS: TestLoadConditionalClone3/clone3_cap_sys_admin_amd64 (0.00s)
        --- PASS: TestLoadConditionalClone3/clone3_cap_sys_admin_mips (0.00s)
    PASS
    ok  	github.com/docker/docker/profiles/seccomp	0.015s

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
umarcor added a commit to dbhi/qus that referenced this pull request Nov 25, 2021
Fedora containers fail with QEMU 6.1.0 and Docker 20.10.9.
It can be worked around with option '--security-opt seccomp=unconfined',
but that is not allowed on GitHub Actions.
Anyway, it is fixed in Docker 20.10.10.

* moby/moby#42963
* moby/moby#42836
flokli added a commit to wireapp/wire-server that referenced this pull request Jun 8, 2022
glibc 2.34 uses the clone3 syscall, which is not part of the seccomp
filters that moby ships on older versions.

While as a workaround you might be able to run containers with
`--privileged`, it's the better call to just run a more recent Docker
runtime.

References:
 - docker/buildx#772
 - moby/buildkit#2379
 - moby/moby#42836
 - NixOS/nixpkgs#170900
flokli added a commit to wireapp/wire-server that referenced this pull request Jun 8, 2022
glibc 2.34 uses the clone3 syscall, which is not part of the seccomp
filters that moby ships on older versions.

While as a workaround you might be able to run containers with
`--privileged`, it's the better call to just run a more recent Docker
runtime.

References:
 - docker/buildx#772
 - moby/buildkit#2379
 - moby/moby#42836
 - NixOS/nixpkgs#170900
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants