Gunicorn worker processes killed by main process under gramine #1798

jonathan-sha · 2024-03-07T16:26:06Z

Description of the problem

gunicorn main process will kill the worker processes under gramine when timeout elapses.

gunicorn uses the following mechanism:

main process spawns child worker processes (which will run in a separate gramine enclave, using python os.fork())
each child process inherits an fd to a tempfile created by the master process, which it uses to signal to the master that it's "alive"
the child process calls chmod(tmp_file_fd, 0) and chmod(tmp_file_fd, 1) in round-robin between every request and while waiting for new requests - this is supposed to update the ctime of the tmp file.
the master process monitors the workers by checking the ctime (fstat) and sends a SIGKILL if timeout has elapsed.

Note - the problem was discussed here:
#1134
gramineproject/examples#80

There are a few issues here -

fstat uses the gramine inode structure, which can be out-of-sync for both allowed files (changed by host) or in a multi-process (multi enclave) scenario like gunicorn.
chmod doesn't update the internal inode ctime. This doesn't really matter in this case, since gunicorn will create separate enclaves per worker process, but this is an easy fix and can be useful (in case threads rely on file stat to communicate, e.g. in gramine tmpfs files).

Regarding (2), I have a merge request I can open in case we want to fix this.

Regarding (1), I think the best way to solve this is to use an eager "slow-path" stat for allowed files. Or as @pwmarcz suggested, we can periodically refresh the inode stat for allowed files. Otherwise, we can add a manifest option to explicitly "force stat refresh" on selected uris, if the app needs it. Though I'm not sure why we won't want this by default - if the app is calling stat, then it shouldn't get "stale" data.

Steps to reproduce

I used a gsc container running a gunicorn app:

# gramine manifest

sgx.enclave_size = "512M"
sgx.max_threads = 64
sgx.allowed_files = [
  "file://tmp/",                            # needed for python tmpfile
]

# NOTE: this won't work, since the master thread calls os.fork() to create a new enclave.
# [[fs.mounts]]
# type = "tmpfs"
# path = "/tmp/"

The container is run by docker compose with the following command:

    command: [ "/usr/local/bin/gunicorn", "-b", "0.0.0.0:8888", "--capture-output", "--timeout", "180", "--graceful-timeout", "180", "--max-requests", "20000", "-w", "10", "app:app" ]

This issue should be reproducible whenever running gunicorn under gramine with a --timeout option.

Expected results

The master process should correctly detect that the workers are alive and well

Actual results

The master process calls stat on the tmp_files, reads the returned ctime and checks if timeout has elapsed. It then kills the worker process, even though it is alive and well and handling requests.

Gramine commit hash

v1.6

The text was updated successfully, but these errors were encountered:

dimakuv · 2024-03-08T09:14:50Z

@jonathan-sha Thanks for a great description of the problem!

This was already discussed in #1134, so I'm not sure if we should consider this issue a duplicate and move your descriptions as a new comment to #1134. I'll keep it like this for now, as a separate issue.

Regarding (2), I have a merge request I can open in case we want to fix this.

Sure, feel free to create a PR, and we'll take a look at it (depending on the complexity of this PR, this may take a while before we do the reviews). Make sure to properly synchronize -- you'll need to take libos_inode::lock whenever you access ctime, mtime, atime.

Regarding (1), I think the best way to solve this is to use an eager "slow-path" stat for allowed files.

Yes, I agree.

Btw, I don't like the approach of "periodic refresh" because this will introduce more complexity (who performs this periodic refresh? what should be the refresh period? how does this helper thread learn which inodes to refresh?). Also, fstat() should be a sufficiently rare syscall in applications, so that we can afford the overhead of a "slow-path" host-level fstat() on each libos_syscall_fstat().

Currently we have a "use cached values" implementation of fstat(), through the following chain:

gramine/libos/src/sys/libos_stat.c

Line 99 in d82e885

long libos_syscall_fstat(int fd, struct stat* stat) {
gramine/libos/src/sys/libos_stat.c

Line 36 in d82e885

static int do_hstat(struct libos_handle* hdl, struct stat* stat) {
gramine/libos/src/fs/chroot/fs.c

Line 495 in d82e885

.hstat = &generic_inode_hstat,
gramine/libos/src/fs/libos_fs_util.c

Line 98 in d82e885

int generic_inode_hstat(struct libos_handle* hdl, struct stat* buf) {

gramine/libos/src/fs/libos_fs_util.c

Lines 59 to 89 in d82e885

    
           static int generic_istat(struct libos_inode* inode, struct stat* buf) { 
        
               memset(buf, 0, sizeof(*buf)); 
        
               lock(&inode->lock); 
        
               buf->st_mode = inode->type | inode->perm; 
        
               buf->st_size = inode->size; 
        
               buf->st_uid  = inode->uid; 
        
               buf->st_gid  = inode->gid; 
        
               buf->st_atime = inode->atime; 
        
               buf->st_mtime = inode->mtime; 
        
               buf->st_ctime = inode->ctime; 
        
               /* Some programs (e.g. some tests from LTP) require this value. We've picked some random, 
        
                * pretty looking constant - exact value should not affect anything (perhaps except 
        
                * performance). */ 
        
               buf->st_blksize = 0x1000; 
        
               /* 
        
                * Pretend `nlink` is 2 for directories (to account for "." and ".."), 1 for other files. 
        
                * 
        
                * Applications are unlikely to depend on exact value of `nlink`, and for us, it's inconvenient 
        
                * to keep track of the exact value (we would have to list the directory, and also take into 
        
                * account synthetic files created by Gramine, such as named pipes and sockets). 
        
                */ 
        
               buf->st_nlink = (inode->type == S_IFDIR ? 2 : 1); 
        
               if (inode->mount->uri) 
        
                   buf->st_dev = hash_str(inode->mount->uri); 
        
               unlock(&inode->lock); 
        
               return 0; 
        
           }

We can add a call to PalStreamAttributesQuery() in generic_istat(), to request the new atime/mtime/ctime attributes of the file (in addition to already-existing size and others). NOTE: the file may be already deleted on the host, in which case PalStreamAttributesQuery() will return an error and we should probably just return without updating the inode and without errors, as some FDs/handles may still be opened on this file and won't expect the failure.

We'll need to add atime/mtime/ctime to PAL_STREAM_ATTR for file objects:

gramine/pal/include/pal/pal.h

Line 465 in d82e885

typedef struct _PAL_STREAM_ATTR {

We'll also need to update this:

gramine/pal/src/host/linux-sgx/pal_files.c

Line 423 in d82e885

file_attrcopy(attr, &stat_buf);

In particular, we'll need to add more fields in this helper func:

gramine/pal/src/host/linux-common/file_info.c

Lines 25 to 30 in d82e885

    
           void file_attrcopy(PAL_STREAM_ATTR* attr, struct stat* stat) { 
        
               attr->handle_type  = file_stat_type(stat); 
        
               attr->nonblocking  = false; 
        
               attr->share_flags  = stat->st_mode & PAL_SHARE_MASK; 
        
               attr->pending_size = stat->st_size; 
        
           }

NOTES:

We don't want to do this on Trusted Files -- these files are not supposed to be updatable anyway. Maybe Trusted Files should get some synthetic/dummy ctime/mtime/atime values upon first open, i.e. the initial timestamp of when Gramine started.
We can do this on Allowed Files (the focus of this issue). But to learn that a particular file URI belongs to the Allowed File, we can either go a slow way of checking the URI via get_trusted_or_allowed_file(), or we should cache this info inside LibOS and only call PalStreamAttributesQuery() on Allowed Files (LibOS must have this switch logic).
Encrypted Files would need a completely different approach -- ctime/mtime/atime should be part of the encrypted metadata. This could be left unimplemented, as it's at least currently not required.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gunicorn worker processes killed by main process under gramine #1798

Gunicorn worker processes killed by main process under gramine #1798

jonathan-sha commented Mar 7, 2024

dimakuv commented Mar 8, 2024

Gunicorn worker processes killed by main process under gramine #1798

Gunicorn worker processes killed by main process under gramine #1798

Comments

jonathan-sha commented Mar 7, 2024

Description of the problem

Steps to reproduce

Expected results

Actual results

Gramine commit hash

dimakuv commented Mar 8, 2024