Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[25.0 backport] profiles/seccomp: add syscalls for kernel v5.17 - v6.6, match containerd's profile #47344

Merged
merged 7 commits into from Feb 6, 2024

Commits on Feb 6, 2024

  1. seccomp: add set_mempolicy_home_node syscall (kernel v5.17, libseccom…

    …p v2.5.4)
    
    This syscall is gated by CAP_SYS_NICE, matching the profile in containerd.
    
    containerd: containerd/containerd@a6e52c7
    libseccomp: seccomp/libseccomp@d83cb7a
    kernel: torvalds/linux@c6018b4
    
        mm/mempolicy: add set_mempolicy_home_node syscall
        This syscall can be used to set a home node for the MPOL_BIND and
        MPOL_PREFERRED_MANY memory policy.  Users should use this syscall after
        setting up a memory policy for the specified range as shown below.
    
          mbind(p, nr_pages * page_size, MPOL_BIND, new_nodes->maskp,
                new_nodes->size + 1, 0);
          sys_set_mempolicy_home_node((unsigned long)p, nr_pages * page_size,
                        home_node, 0);
    
        The syscall allows specifying a home node/preferred node from which
        kernel will fulfill memory allocation requests first.
        ...
    
    Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
    (cherry picked from commit 1251982)
    Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
    thaJeztah committed Feb 6, 2024
    Configuration menu
    Copy the full SHA
    61b82be View commit details
    Browse the repository at this point in the history
  2. seccomp: add cachestat syscall (kernel v6.5, libseccomp v2.5.5)

    Add this syscall to match the profile in containerd
    
    containerd: containerd/containerd@a6e52c7
    libseccomp: seccomp/libseccomp@53267af
    kernel: torvalds/linux@cf264e1
    
        NAME
            cachestat - query the page cache statistics of a file.
    
        SYNOPSIS
            #include <sys/mman.h>
    
            struct cachestat_range {
                __u64 off;
                __u64 len;
            };
    
            struct cachestat {
                __u64 nr_cache;
                __u64 nr_dirty;
                __u64 nr_writeback;
                __u64 nr_evicted;
                __u64 nr_recently_evicted;
            };
    
            int cachestat(unsigned int fd, struct cachestat_range *cstat_range,
                struct cachestat *cstat, unsigned int flags);
    
        DESCRIPTION
            cachestat() queries the number of cached pages, number of dirty
            pages, number of pages marked for writeback, number of evicted
            pages, number of recently evicted pages, in the bytes range given by
            `off` and `len`.
    
    Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
    (cherry picked from commit 4d0d5ee)
    Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
    thaJeztah committed Feb 6, 2024
    Configuration menu
    Copy the full SHA
    67e9aa6 View commit details
    Browse the repository at this point in the history
  3. seccomp: add fchmodat2 syscall (kernel v6.6, libseccomp v2.5.5)

    Add this syscall to match the profile in containerd
    
    containerd: containerd/containerd@a6e52c7
    libseccomp: seccomp/libseccomp@53267af
    kernel: torvalds/linux@09da082
    
        fs: Add fchmodat2()
    
        On the userspace side fchmodat(3) is implemented as a wrapper
        function which implements the POSIX-specified interface. This
        interface differs from the underlying kernel system call, which does not
        have a flags argument. Most implementations require procfs [1][2].
    
        There doesn't appear to be a good userspace workaround for this issue
        but the implementation in the kernel is pretty straight-forward.
    
        The new fchmodat2() syscall allows to pass the AT_SYMLINK_NOFOLLOW flag,
        unlike existing fchmodat.
    
    Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
    (cherry picked from commit 6f242f1)
    Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
    thaJeztah committed Feb 6, 2024
    Configuration menu
    Copy the full SHA
    5fb4eb9 View commit details
    Browse the repository at this point in the history
  4. seccomp: add map_shadow_stack syscall (kernel v6.6, libseccomp v2.5.5)

    Add this syscall to match the profile in containerd
    
    containerd: containerd/containerd@a6e52c7
    libseccomp: seccomp/libseccomp@53267af
    kernel: torvalds/linux@c35559f
    
        x86/shstk: Introduce map_shadow_stack syscall
    
        When operating with shadow stacks enabled, the kernel will automatically
        allocate shadow stacks for new threads, however in some cases userspace
        will need additional shadow stacks. The main example of this is the
        ucontext family of functions, which require userspace allocating and
        pivoting to userspace managed stacks.
    
        Unlike most other user memory permissions, shadow stacks need to be
        provisioned with special data in order to be useful. They need to be setup
        with a restore token so that userspace can pivot to them via the RSTORSSP
        instruction. But, the security design of shadow stacks is that they
        should not be written to except in limited circumstances. This presents a
        problem for userspace, as to how userspace can provision this special
        data, without allowing for the shadow stack to be generally writable.
    
    Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
    (cherry picked from commit 8826f40)
    Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
    thaJeztah committed Feb 6, 2024
    Configuration menu
    Copy the full SHA
    f9f9e7f View commit details
    Browse the repository at this point in the history
  5. seccomp: add futex_requeue syscall (kernel v6.7, libseccomp v2.5.5)

    Add this syscall to match the profile in containerd
    
    containerd: containerd/containerd@a6e52c7
    libseccomp: seccomp/libseccomp@53267af
    kernel: torvalds/linux@0f4b5f9
    
        futex: Add sys_futex_requeue()
    
        Finish off the 'simple' futex2 syscall group by adding
        sys_futex_requeue(). Unlike sys_futex_{wait,wake}() its arguments are
        too numerous to fit into a regular syscall. As such, use struct
        futex_waitv to pass the 'source' and 'destination' futexes to the
        syscall.
    
        This syscall implements what was previously known as FUTEX_CMP_REQUEUE
        and uses {val, uaddr, flags} for source and {uaddr, flags} for
        destination.
    
        This design explicitly allows requeueing between different types of
        futex by having a different flags word per uaddr.
    
    Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
    (cherry picked from commit df57a08)
    Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
    thaJeztah committed Feb 6, 2024
    Configuration menu
    Copy the full SHA
    4cc0416 View commit details
    Browse the repository at this point in the history
  6. seccomp: add futex_wait syscall (kernel v6.7, libseccomp v2.5.5)

    Add this syscall to match the profile in containerd
    
    containerd: containerd/containerd@a6e52c7
    libseccomp: seccomp/libseccomp@53267af
    kernel: torvalds/linux@cb8c431
    
        futex: Add sys_futex_wait()
    
        To complement sys_futex_waitv()/wake(), add sys_futex_wait(). This
        syscall implements what was previously known as FUTEX_WAIT_BITSET
        except it uses 'unsigned long' for the value and bitmask arguments,
        takes timespec and clockid_t arguments for the absolute timeout and
        uses FUTEX2 flags.
    
        The 'unsigned long' allows FUTEX2_SIZE_U64 on 64bit platforms.
    
    Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
    (cherry picked from commit 10d344d)
    Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
    thaJeztah committed Feb 6, 2024
    Configuration menu
    Copy the full SHA
    74e3b4f View commit details
    Browse the repository at this point in the history
  7. seccomp: add futex_wake syscall (kernel v6.7, libseccomp v2.5.5)

    Add this syscall to match the profile in containerd
    
    containerd: containerd/containerd@a6e52c7
    libseccomp: seccomp/libseccomp@53267af
    kernel: torvalds/linux@9f6c532
    
        futex: Add sys_futex_wake()
    
        To complement sys_futex_waitv() add sys_futex_wake(). This syscall
        implements what was previously known as FUTEX_WAKE_BITSET except it
        uses 'unsigned long' for the bitmask and takes FUTEX2 flags.
    
        The 'unsigned long' allows FUTEX2_SIZE_U64 on 64bit platforms.
    
    Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
    (cherry picked from commit d69729e)
    Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
    thaJeztah committed Feb 6, 2024
    Configuration menu
    Copy the full SHA
    ed7c263 View commit details
    Browse the repository at this point in the history