Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

seccomp: allow to override default errno return code #1087

Merged
merged 2 commits into from Mar 3, 2021

Conversation

giuseppe
Copy link
Member

the specs already support overriding the errno code for the syscalls
but the default value is hardcoded to EPERM.

Add a new attribute to override the default value.

Signed-off-by: Giuseppe Scrivano gscrivan@redhat.com

@rhatdan
Copy link
Contributor

rhatdan commented Feb 19, 2021

LGTM
@mrunalp @cyphar PTAL

@mrunalp
Copy link
Contributor

mrunalp commented Feb 19, 2021

LGTM

Approved with PullApprove

@fweimer-rh
Copy link

fweimer-rh commented Feb 19, 2021

This makes the runc workaround explicitly non-compliant, doesn't it? It sometimes returns EPERM, and sometimes ENOSYS, and not always EPERM.

As previously discussed, it is not entirely clear when that default should be applied. To any system call not listed in the policy?

Is EPERM really the right default here?

@richfelker

@richfelker
Copy link

I have always believed (1) ENOSYS should be the default and (2) the system should not have any method for causing as-yet-unknown syscalls (at the time of profile authorship) to fail with any error code other than ENOSYS. Of these, (2) is the most important, and I understand there's a backwards-compatibility argument for returning EPERM as default for syscalls that were in the known range when the filter was written. I'm indifferent to adding the ability to change the default for this range as long as the unknown range is always forcibly ENOSYS.

@giuseppe
Copy link
Member Author

Is EPERM really the right default here?

Probably not, I've picked it just for backward compatibility.

I have always believed (1) ENOSYS should be the default and (2) the system should not have any method for causing as-yet-unknown syscalls (at the time of profile authorship) to fail with any error code other than ENOSYS. Of these, (2) is the most important, and I understand there's a backwards-compatibility argument for returning EPERM as default for syscalls that were in the known range when the filter was written. I'm indifferent to adding the ability to change the default for this range as long as the unknown range is always forcibly ENOSYS.

I personally think the default for other actions should still be EPERM as they are known at the time the profile was written but it is just a matter of taste. If a container engine uses the new defaultErrnoRet then it knows about its meaning and we can pick the behavior we prefer and a seccomp profile can already override the errno value for each specified syscall.

Copy link
Member

@thaJeztah thaJeztah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

Actually had a branch with this, but didn't find the time yet to open 😂

@giuseppe
Copy link
Member Author

thanks anyone for the reviews. Before merging let's wait for @cyphar to weight in as he was looking the most at this issue.

Copy link

@vrothberg vrothberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@cyphar
Copy link
Member

cyphar commented Feb 20, 2021

I'm fine with this PR, but as I've discussed before -- just to make it completely explicit -- this does not solve the -ENOSYS problem. It's good that we're doing this for completeness's sake, but you would need all profiles to switch to ENOSYS and even then you're going to end up in scenarios where socket(2) returns -ENOSYS spuriously.

I think it would've been nicer to do this as one big changeset -- I think it would be a bad idea to release a new runtime-spec which adds this and doesn't address the larger -ENOSYS problem.

(2) the system should not have any method for causing as-yet-unknown syscalls (at the time of profile authorship) to fail with any error code other than ENOSYS

I agree, and the reason I wanted to wait before we did this change is because I wanted us to include all of the other changes we need to solve this problem properly in the spec. But we can take this change now and do the other necessary changes later.

@h-vetinari
Copy link
Contributor

@cyphar: I think it would be a bad idea to release a new runtime-spec which adds this and doesn't address the larger -ENOSYS problem.

By that argument, merging this PR would make the spec "unreleaseable". Given how thorny this problem seems to be, and how there have been calls for a new version for ~8 months, would it be feasible to release 1.0.3 before merging this?

@richfelker
Copy link

Why can't the spec just be fixed to specify what runc is doing to fix the problem?

@giuseppe
Copy link
Member Author

It's good that we're doing this for completeness's sake, but you would need all profiles to switch to ENOSYS and even then you're going to end up in scenarios where socket(2) returns -ENOSYS spuriously.

would this happen when a syscall is not fully specified in the profile and the defaultErrnoRet is used or are there other cases?

@cyphar
Copy link
Member

cyphar commented Feb 21, 2021

@giuseppe

would this happen when a syscall is not fully specified in the profile and the defaultErrnoRet is used or are there other cases?

That is the main issue -- but given that libseccomp today doesn't let you create fully-specified rules (depending on how complicated the allow rule is), that's sufficient to say that defaultErrnoRet cannot alone be used to solve the -ENOSYS problem.

In addition, in my view we should not be relying on profile authors to solve the -ENOSYS problem at all -- this is such a footgun that runtimes should handle it automatically (similarly to what runc is doing today). So in my view the complete solution for this problem is that we specify that the runtime should (or must) resolve this problem in a particular way. (Maybe there should be an opt-out from this behaviour, I'd have to think about exactly how that would work.)

To answer @richfelker, I tihnk that the spec-level solution should be a little bit less janky than the current runc solution. (Ideally we would do this through minimum kernel version information.)

@h-vetinari

Given how thorny this problem seems to be, and how there have been calls for a new version for ~8 months, would it be feasible to release 1.0.3 before merging this?

Yeah let's do that first.

@giuseppe
Copy link
Member Author

@giuseppe

would this happen when a syscall is not fully specified in the profile and the defaultErrnoRet is used or are there other cases?

That is the main issue -- but given that libseccomp today doesn't let you create fully-specified rules (depending on how complicated the allow rule is), that's sufficient to say that defaultErrnoRet cannot alone be used to solve the -ENOSYS problem.

true that won't fully solve the issue with MASKED_EQ, that must be addressed in libseccomp first (and it seems it is likely going to be added soon (2.6.0?))

In addition, in my view we should not be relying on profile authors to solve the -ENOSYS problem at all -- this is such a footgun that runtimes should handle it automatically (similarly to what runc is doing today). So in my view the complete solution for this problem is that we specify that the runtime should (or must) resolve this problem in a particular way. (Maybe there should be an opt-out from this behaviour, I'd have to think about exactly how that would work.)

To answer @richfelker, I tihnk that the spec-level solution should be a little bit less janky than the current runc solution. (Ideally we would do this through minimum kernel version information.)

do you see the kernel version information as a replacement for defaultErrnoRet or the two features can coexist together?

@cyphar
Copy link
Member

cyphar commented Feb 21, 2021

@giuseppe

true that won't fully solve the issue with MASKED_EQ

MASKED_EQ is a problem (though it is one you can work around in a fairly dodgy manner by generating 2^n MASKED_EQ rules). But the biggest issue is that you cannot construct multiple AND rules for the same argument with today's libseccomp (and there is no roadmap for fixing this). This means you cannot generate the inverse of (arg0 == 1 || arg1 == 2) because (arg0 != 1 && arg1 != 2) is an invalid rule -- and as far as I can tell there is no clever way to work around this without having to generate an exhaustive set of 2^64 rules in the pathological case (and that's only if one argument has || rules -- it'll get even more complicated with multiple arguments).

Do you see the kernel version information as a replacement for defaultErrnoRet or the two features can coexist together?

They can both co-exist -- the lack of defaultErrnoRet should be fixed purely for completeness's sake (and I can imagine a few cases where someone might want to use a different default errno).

@giuseppe
Copy link
Member Author

MASKED_EQ is a problem (though it is one you can work around in a fairly dodgy manner by generating 2^n MASKED_EQ rules). But the biggest issue is that you cannot construct multiple AND rules for the same argument with today's libseccomp (and there is no roadmap for fixing this). This means you cannot generate the inverse of (arg0 == 1 || arg1 == 2) because (arg0 != 1 && arg1 != 2) is an invalid rule -- and as far as I can tell there is no clever way to work around this without having to generate an exhaustive set of 2^64 rules in the pathological case (and that's only if one argument has || rules -- it'll get even more complicated with multiple arguments).

Did you mean (arg0 == 1 || arg0 == 2)? Yeah, that is another possible problem with libseccomp :(

In the long term I really hope we can bypass the seccomp generation and add to the specs something like containers/crun#578 (in the seccomp section). Alternatively, at least agree on something nicer than JSON and that users are not expected to deal directly with the libseccomp limitations.

@cyphar
Copy link
Member

cyphar commented Feb 22, 2021

Did you mean (arg0 == 1 || arg0 == 2)? Yeah, that is another possible problem with libseccomp :(

Oops, yeah that's what I meant. And practically this is more critical than MASKED_EQ because there is a workable fix for MASKED_EQ (not to mention it will be fixed soon since it's fairly simple to add) -- the lack of && support is pretty major.

In the long term I really hope we can bypass the seccomp generation and add to the specs something like containers/crun#578 (in the seccomp section).

I agree, though I am on the fence on whether we should still have -ENOSYS stubs be auto-added by runtimes even with custom seccomp profiles. Maybe it makes sense to allow users to opt-out of it, but I still think seccomp is far too footgun-friendly for us to leave it all up to profile generators.

HOWEVER there are upsides to the current system:

  • It's possible to understand the seccomp profile logic -- inserting compiled BPF is not really ideal either.
  • Profile generators can just use the syscall names rather than needing to know the actual number (in practice you can have the higher-level runtime just support a better generator -- but the issue is that part of the container configuration is now outside of the runtime-spec).

It should also be noted that while the current seccomp schema is based around libseccomp's limitations, nothing is stopping us from switching away from libseccomp internally.

Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
the specs already support overriding the errno code for the syscalls
but the default value is hardcoded to EPERM.

Add a new attribute to override the default value.

Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
@giuseppe
Copy link
Member Author

I agree, though I am on the fence on whether we should still have -ENOSYS stubs be auto-added by runtimes even with custom seccomp profiles. Maybe it makes sense to allow users to opt-out of it, but I still think seccomp is far too footgun-friendly for us to leave it all up to profile generators.

I still think it is useful to allow the raw BPF. I have already considered the pre-compilation of BPF in the past, since most containers use the same BPF profile, to avoid the cost of looking up syscalls in libseccomp. The lookup function in 2.4.0 had O(n) complexity and it is used for each syscall specified in the profile (but then I preferred to go with seccomp/libseccomp#223 and address it in libseccomp itself).

The raw BPF feature should not replace the existing mechanism, neither be the default; but it would open up the possibility to experiment with different ways of generating the BPF. IMO, users should be able to shoot themselves in the foot if they wish so.

I'm fine with this PR, but as I've discussed before -- just to make it completely explicit -- this does not solve the -ENOSYS problem. It's good that we're doing this for completeness's sake, but you would need all profiles to switch to ENOSYS and even then you're going to end up in scenarios where socket(2) returns -ENOSYS spuriously

Could we move forward with it as it is just for completeness and we can address the ENOSYS issue separately? I'd expect the discussion to take some time, while this feature will at least help us to temporarily address the issue with most seccomp profiles.

@cyphar
Copy link
Member

cyphar commented Feb 23, 2021

Could we move forward with it as it is just for completeness and we can address the ENOSYS issue separately? I'd expect the discussion to take some time, while this feature will at least help us to temporarily address the issue with most seccomp profiles.

Yeah we can merge this first.

Copy link
Member

@cyphar cyphar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@mrunalp
Copy link
Contributor

mrunalp commented Mar 3, 2021

LGTM

Approved with PullApprove

@mrunalp mrunalp merged commit 43e4633 into opencontainers:master Mar 3, 2021
flo-renaud pushed a commit to flo-renaud/freeipa that referenced this pull request Oct 18, 2021
This allows application to detect whether the kernel supports
syscall or not. Previously, an error was unconditionally EPERM.
There are many issues about glibc failed with new syscalls in containerized
environments if their host run on old kernel.

More about motivation for ENOSYS over EPERM:
opencontainers/runc#2151
opencontainers/runc#2750

See about defaultErrnoRet introduction:
opencontainers/runtime-spec#1087

Previously, FreeIPA profile was vendored from
https://github.com/containers/podman/blob/main/vendor/github.com/containers/common/pkg/seccomp/seccomp.json

Now it is merged directly from
https://github.com/containers/common/blob/main/pkg/seccomp/seccomp.json

Fixes: https://pagure.io/freeipa/issue/9008
Signed-off-by: Stanislav Levin <slev@altlinux.org>
flo-renaud pushed a commit to flo-renaud/freeipa that referenced this pull request Oct 18, 2021
This allows application to detect whether the kernel supports
syscall or not. Previously, an error was unconditionally EPERM.
There are many issues about glibc failed with new syscalls in containerized
environments if their host run on old kernel.

More about motivation for ENOSYS over EPERM:
opencontainers/runc#2151
opencontainers/runc#2750

See about defaultErrnoRet introduction:
opencontainers/runtime-spec#1087

Previously, FreeIPA profile was vendored from
https://github.com/containers/podman/blob/main/vendor/github.com/containers/common/pkg/seccomp/seccomp.json

Now it is merged directly from
https://github.com/containers/common/blob/main/pkg/seccomp/seccomp.json

Fixes: https://pagure.io/freeipa/issue/9008
Signed-off-by: Stanislav Levin <slev@altlinux.org>
stanislavlevin added a commit to stanislavlevin/freeipa that referenced this pull request Oct 19, 2021
This allows application to detect whether the kernel supports
syscall or not. Previously, an error unconditionally was EPERM.
There are many issues about glibc failed to new syscalls in containerized
environments for which host run on old kernel.

More about motivation for ENOSYS over EPERM:
opencontainers/runc#2151
opencontainers/runc#2750

See about defaultErrnoRet introduction:
opencontainers/runtime-spec#1087

Previously, FreeIPA profile was vendored from
https://github.com/containers/podman/blob/main/vendor/github.com/containers/common/pkg/seccomp/seccomp.json

Now it is merged directly from
https://github.com/containers/common/blob/main/pkg/seccomp/seccomp.json
abbra pushed a commit to freeipa/freeipa that referenced this pull request Oct 19, 2021
This allows application to detect whether the kernel supports
syscall or not. Previously, an error was unconditionally EPERM.
There are many issues about glibc failed with new syscalls in containerized
environments if their host run on old kernel.

More about motivation for ENOSYS over EPERM:
opencontainers/runc#2151
opencontainers/runc#2750

See about defaultErrnoRet introduction:
opencontainers/runtime-spec#1087

Previously, FreeIPA profile was vendored from
https://github.com/containers/podman/blob/main/vendor/github.com/containers/common/pkg/seccomp/seccomp.json

Now it is merged directly from
https://github.com/containers/common/blob/main/pkg/seccomp/seccomp.json

Fixes: https://pagure.io/freeipa/issue/9008
Signed-off-by: Stanislav Levin <slev@altlinux.org>
Reviewed-By: Alexander Bokovoy <abokovoy@redhat.com>
stanislavlevin added a commit to stanislavlevin/freeipa that referenced this pull request Oct 21, 2021
This allows application to detect whether the kernel supports
syscall or not. Previously, an error unconditionally was EPERM.
There are many issues about glibc failed to new syscalls in containerized
environments for which host run on old kernel.

More about motivation for ENOSYS over EPERM:
opencontainers/runc#2151
opencontainers/runc#2750

See about defaultErrnoRet introduction:
opencontainers/runtime-spec#1087

Previously, FreeIPA profile was vendored from
https://github.com/containers/podman/blob/main/vendor/github.com/containers/common/pkg/seccomp/seccomp.json

Now it is merged directly from
https://github.com/containers/common/blob/main/pkg/seccomp/seccomp.json
rcritten pushed a commit to flo-renaud/freeipa that referenced this pull request Oct 21, 2021
This allows application to detect whether the kernel supports
syscall or not. Previously, an error was unconditionally EPERM.
There are many issues about glibc failed with new syscalls in containerized
environments if their host run on old kernel.

More about motivation for ENOSYS over EPERM:
opencontainers/runc#2151
opencontainers/runc#2750

See about defaultErrnoRet introduction:
opencontainers/runtime-spec#1087

Previously, FreeIPA profile was vendored from
https://github.com/containers/podman/blob/main/vendor/github.com/containers/common/pkg/seccomp/seccomp.json

Now it is merged directly from
https://github.com/containers/common/blob/main/pkg/seccomp/seccomp.json

Fixes: https://pagure.io/freeipa/issue/9008
Signed-off-by: Stanislav Levin <slev@altlinux.org>
abbra pushed a commit to freeipa/freeipa that referenced this pull request Oct 23, 2021
This allows application to detect whether the kernel supports
syscall or not. Previously, an error was unconditionally EPERM.
There are many issues about glibc failed with new syscalls in containerized
environments if their host run on old kernel.

More about motivation for ENOSYS over EPERM:
opencontainers/runc#2151
opencontainers/runc#2750

See about defaultErrnoRet introduction:
opencontainers/runtime-spec#1087

Previously, FreeIPA profile was vendored from
https://github.com/containers/podman/blob/main/vendor/github.com/containers/common/pkg/seccomp/seccomp.json

Now it is merged directly from
https://github.com/containers/common/blob/main/pkg/seccomp/seccomp.json

Fixes: https://pagure.io/freeipa/issue/9008
Signed-off-by: Stanislav Levin <slev@altlinux.org>
Reviewed-By: Alexander Bokovoy <abokovoy@redhat.com>
stanislavlevin added a commit to stanislavlevin/freeipa that referenced this pull request Oct 25, 2021
This allows application to detect whether the kernel supports
syscall or not. Previously, an error unconditionally was EPERM.
There are many issues about glibc failed to new syscalls in containerized
environments for which host run on old kernel.

More about motivation for ENOSYS over EPERM:
opencontainers/runc#2151
opencontainers/runc#2750

See about defaultErrnoRet introduction:
opencontainers/runtime-spec#1087

Previously, FreeIPA profile was vendored from
https://github.com/containers/podman/blob/main/vendor/github.com/containers/common/pkg/seccomp/seccomp.json

Now it is merged directly from
https://github.com/containers/common/blob/main/pkg/seccomp/seccomp.json
@tianon
Copy link
Member

tianon commented Nov 8, 2021

Arg, we've run into some fun with this downstream in moby/moby thanks to errno values being a per-architecture thing: moby/moby#42836 (comment) / moby/moby#42681 (comment)

$ git grep -inE 'ENOSYS[[:space:]]+=' | grep linux | column -t
unix/zerrors_linux_386.go:588:       ENOSYS  =  syscall.Errno(0x26)
unix/zerrors_linux_amd64.go:588:     ENOSYS  =  syscall.Errno(0x26)
unix/zerrors_linux_arm.go:594:       ENOSYS  =  syscall.Errno(0x26)
unix/zerrors_linux_arm64.go:585:     ENOSYS  =  syscall.Errno(0x26)
unix/zerrors_linux_mips.go:591:      ENOSYS  =  syscall.Errno(0x59)
unix/zerrors_linux_mips64.go:591:    ENOSYS  =  syscall.Errno(0x59)
unix/zerrors_linux_mips64le.go:591:  ENOSYS  =  syscall.Errno(0x59)
unix/zerrors_linux_mipsle.go:591:    ENOSYS  =  syscall.Errno(0x59)
unix/zerrors_linux_ppc.go:646:       ENOSYS  =  syscall.Errno(0x26)
unix/zerrors_linux_ppc64.go:650:     ENOSYS  =  syscall.Errno(0x26)
unix/zerrors_linux_ppc64le.go:650:   ENOSYS  =  syscall.Errno(0x26)
unix/zerrors_linux_riscv64.go:575:   ENOSYS  =  syscall.Errno(0x26)
unix/zerrors_linux_s390x.go:650:     ENOSYS  =  syscall.Errno(0x26)
unix/zerrors_linux_sparc64.go:639:   ENOSYS  =  syscall.Errno(0x5a)

Perhaps this should've been a string? 😞

@utam0k
Copy link
Member

utam0k commented Nov 9, 2021

Hi! I love this change. Is there any plan to add a check for this to the runtime-tools tests?

@giuseppe
Copy link
Member Author

giuseppe commented Nov 9, 2021

Arg, we've run into some fun with this downstream in moby/moby thanks to errno values being a per-architecture thing: moby/moby#42836 (comment) / moby/moby#42681 (comment)

$ git grep -inE 'ENOSYS[[:space:]]+=' | grep linux | column -t
unix/zerrors_linux_386.go:588:       ENOSYS  =  syscall.Errno(0x26)
unix/zerrors_linux_amd64.go:588:     ENOSYS  =  syscall.Errno(0x26)
unix/zerrors_linux_arm.go:594:       ENOSYS  =  syscall.Errno(0x26)
unix/zerrors_linux_arm64.go:585:     ENOSYS  =  syscall.Errno(0x26)
unix/zerrors_linux_mips.go:591:      ENOSYS  =  syscall.Errno(0x59)
unix/zerrors_linux_mips64.go:591:    ENOSYS  =  syscall.Errno(0x59)
unix/zerrors_linux_mips64le.go:591:  ENOSYS  =  syscall.Errno(0x59)
unix/zerrors_linux_mipsle.go:591:    ENOSYS  =  syscall.Errno(0x59)
unix/zerrors_linux_ppc.go:646:       ENOSYS  =  syscall.Errno(0x26)
unix/zerrors_linux_ppc64.go:650:     ENOSYS  =  syscall.Errno(0x26)
unix/zerrors_linux_ppc64le.go:650:   ENOSYS  =  syscall.Errno(0x26)
unix/zerrors_linux_riscv64.go:575:   ENOSYS  =  syscall.Errno(0x26)
unix/zerrors_linux_s390x.go:650:     ENOSYS  =  syscall.Errno(0x26)
unix/zerrors_linux_sparc64.go:639:   ENOSYS  =  syscall.Errno(0x5a)

Perhaps this should've been a string? disappointed

indeed, that was a mistake :/

Any suggestions on how we can fix it? Do we need a new field and obsolete the current one?

@thaJeztah
Copy link
Member

The portable way would probably to have a string representation, so that the runtime that sets the actual profile can do the conversion 🤔

FWIW, I think this would only be a problem if a profile is read from a JSON, and the JSON is created on / with a non-matching architecture in mind, but I can see a string representation easier to use (when writing a profile).

@thaJeztah
Copy link
Member

Do we need a new field and obsolete the current one?

If we decide to add a new field, we could make the new filed take precedence (if new field is set, old field must be ignored)

@giuseppe
Copy link
Member Author

giuseppe commented Nov 9, 2021

If we decide to add a new field, we could make the new filed take precedence (if new field is set, old field must be ignored)

yes, I agree with it. I've used the same technique for containers/common seccomp profiles: containers/common#821

thaJeztah added a commit to thaJeztah/docker that referenced this pull request Nov 9, 2021
…MIPS

The value of ENOSYS differs between MIPS and non-MIPS architectures. While
this is not problematic for the embedded seccomp profile, it prevents the
profile from being saved as a portable JSON file that can be used for both
architectures.

To work around this situation, we include conditional rules for both arches.
and hard-code the value for ENOSYS in both.

For more details, refer to moby#42836 (comment)
and opencontainers/runtime-spec#1087 (comment)

A test was added, which can be run with:

    go test --tags=seccomp -run TestLoadConditionalClone3 -v ./profiles/seccomp/

    === RUN   TestLoadConditionalClone3
    === RUN   TestLoadConditionalClone3/clone3_default_amd64
    === RUN   TestLoadConditionalClone3/clone3_default_mips
    === RUN   TestLoadConditionalClone3/clone3_cap_sys_admin_amd64
    === RUN   TestLoadConditionalClone3/clone3_cap_sys_admin_mips
    --- PASS: TestLoadConditionalClone3 (0.01s)
        --- PASS: TestLoadConditionalClone3/clone3_default_amd64 (0.00s)
        --- PASS: TestLoadConditionalClone3/clone3_default_mips (0.00s)
        --- PASS: TestLoadConditionalClone3/clone3_cap_sys_admin_amd64 (0.00s)
        --- PASS: TestLoadConditionalClone3/clone3_cap_sys_admin_mips (0.00s)
    PASS
    ok  	github.com/docker/docker/profiles/seccomp	0.015s

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
@thaJeztah
Copy link
Member

For the meantime, for moby, I opened moby/moby#43005, which uses conditional rules for MIPS / non-MIPS, with hard-coded values for ENOSYS. It's a bit dirty, but should do the trick for now (to allow users to save the profile in JSON).

thaJeztah added a commit to thaJeztah/docker that referenced this pull request Nov 9, 2021
…MIPS

The value of ENOSYS differs between MIPS and non-MIPS architectures. While
this is not problematic for the embedded seccomp profile, it prevents the
profile from being saved as a portable JSON file that can be used for both
architectures.

To work around this situation, we include conditional rules for both arches.
and hard-code the value for ENOSYS in both.

For more details, refer to moby#42836 (comment)
and opencontainers/runtime-spec#1087 (comment)

A test was added, which can be run with:

    go test --tags=seccomp -run TestLoadConditionalClone3 -v ./profiles/seccomp/

    === RUN   TestLoadConditionalClone3
    === RUN   TestLoadConditionalClone3/clone3_default_amd64
    === RUN   TestLoadConditionalClone3/clone3_default_mips
    === RUN   TestLoadConditionalClone3/clone3_cap_sys_admin_amd64
    === RUN   TestLoadConditionalClone3/clone3_cap_sys_admin_mips
    --- PASS: TestLoadConditionalClone3 (0.01s)
        --- PASS: TestLoadConditionalClone3/clone3_default_amd64 (0.00s)
        --- PASS: TestLoadConditionalClone3/clone3_default_mips (0.00s)
        --- PASS: TestLoadConditionalClone3/clone3_cap_sys_admin_amd64 (0.00s)
        --- PASS: TestLoadConditionalClone3/clone3_cap_sys_admin_mips (0.00s)
    PASS
    ok  	github.com/docker/docker/profiles/seccomp	0.015s

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
@tianon
Copy link
Member

tianon commented Nov 9, 2021

From the perspective of the specification, this hasn't actually been released yet, so I think we have a small amount of latitude in fixing this "properly", right? I imagine actual users in the wild are limited to the two cases already identified, and probably only really ENOSYS or EPERM? 😅

(I doubt this has been an extremely popular feature for users to take advantage of in their custom profiles -- the number of times I've seen someone write a non-trivial custom seccomp profile that wasn't just "download the default from my runtime and change one minor thing" I can count on a single hand. 🙈)

@giuseppe
Copy link
Member Author

From the perspective of the specification, this hasn't actually been released yet, so I think we have a small amount of latitude in fixing this "properly", right? I imagine actual users in the wild are limited to the two cases already identified, and probably only really ENOSYS or EPERM?

I think we can probably drop defaultErrnoRet and errnoRet once the new version that uses a string settles down.

Would you be fine if I add two fields defaultErrno and errno that accept a string and have higher precedence over defaultErrnoRet and errnoRet that will be marked as obsolete?

@thaJeztah
Copy link
Member

From the perspective of the specification, this hasn't actually been released yet, so I think we have a small amount of latitude in fixing this "properly", right?

Good call; looks indeed like it was not in a release yet; v1.0.2...main

From that perspective, I'd be 👍 to fix it before it's released. With the added note that we need to check what projects already shipped with this feature in use (mostly concerned if, e.g., runc uses it already and/or what happens if containers were created with this option set, and it's upgraded 🤔)

@AkihiroSuda AkihiroSuda mentioned this pull request Jan 24, 2023
@AkihiroSuda AkihiroSuda added this to the v1.1.0 milestone Feb 1, 2023
@AkihiroSuda AkihiroSuda mentioned this pull request Jun 26, 2023
12 tasks
buytenh added a commit to buytenh/ivykis that referenced this pull request May 16, 2024
Commit 491daf4 ("iv_fd_epoll: Add support for epoll_pwait2().")
added support for epoll_pwait2(), with a fallback to epoll_wait() in
case epoll_pwait2() is not supported by the kernel we are running on,
which would be indicated by epoll_pwait2() returning -ENOSYS.

Some reports (e.g. axoflow/axosyslog#85 ,
#33 (comment) )
suggest that some container technologies can cause -EPERM to be
returned for epoll_pwait2(), independently of whether or not
epoll_pwait2() is actually supported by the kernel we are running on,
and this trips us up because we don't currently handle -EPERM
gracefully, as we did not expect that we would have to do so.

Making system calls return -EPERM to indicate that they were filtered
out by a security policy framework seems somewhat dubious, especially
when considering the amount of application and user confusion generated
by system calls that are not documented as being able to fail with
-EPERM now suddenly being able to fail with -EPERM, but there is not
much we can do about this.

I would be against adding EPERM-as-ENOSYS fallbacks for every current
or future case where we handle ENOSYS, but:

1. it seems that this is the only case where this triggers;

2. upstream seems to agree that this EPERM behavior is a bug (see
   e.g. these links dug up by László Várady:
   containers/common#573 ,
   containers/podman#10337 ,
   opencontainers/runtime-spec#1087 ), so
   there will hopefully be no new cases of this in the future;

3. there's at least one container technology release (podman on
   CentOS 7) where this bug triggers and where the platform is
   sufficiently old to no longer be receiving updates, as pointed
   out by Balazs Scheidler, so this issue can't be fixed by users
   updating their container software.

Under these circumstances, adding a workaround on our end seems
reasonable, and this commit does so.

This issue was originally reported by @mstopa-splunk on GitHub.
Workaround originally by Balazs Scheidler.

Signed-off-by: Lennert Buytenhek <buytenh@wantstofly.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet