[Metaissue] Memory allocators in Gramine, their performance and possible alternatives #1767

dimakuv · 2024-02-09T09:40:13Z

Description of the feature

Memory allocators in Gramine were written ~15 years ago. It's time to re-evaluate them and propose alternatives.

I want to start a discussion on this topic. In particular, the next posts will contain information on:

Why Gramine should implement it?

We see more and more performance bottlenecks in memory/object allocation inside Gramine itself. They all stem from the fact that our memory allocators are old, ad-hoc, not scalable, and not CPU- and cache-friendly; also our allocators do not perform object caching.

Just a couple of recent examples:

For possible alternatives (in Rust), see this comment: #1723 (review)

The text was updated successfully, but these errors were encountered:

dimakuv · 2024-02-09T11:02:09Z

Description of current memory allocators

There are two memory allocators in Gramine: MEMMGR and SLAB.

Both allocators rely on the following shared logic:

system_malloc() and system_free() are macros that are defined inside LibOS and in PAL
- these macros/functions are "back-end" allocations of page-sized contiguous memory regions; i.e. these are actual host memory allocations
- these macros/functions are used to dynamically grow (enlarge) and shrink (not really used!) the slabs/caches in MEMMGR and SLAB
- defined in LibOS like this:
  - generally:
    
    gramine/libos/src/libos_malloc.c
    
    Line 34 in e740728
    
    void* __system_malloc(size_t size) {
  - special case of VMAs:
    
    gramine/libos/src/bookkeep/libos_vma.c
    
    Line 315 in e740728
    
    static void* _vma_malloc(size_t size) {
- defined in PAL like this:
  
  gramine/pal/src/slab.c
  
  Line 28 in e740728
  
  static inline void* system_mem_alloc(size_t size) {

SYSTEM_LOCK() and SYSTEM_UNLOCK() are macros that are defined inside LibOS and in PAL

protect the specific MEMMGR "memory managers", each MEMMGR manager object has its own lock; this allows more parallelism as each LibOS subsystem that uses MEMMGR introduces its own lock

defined in LibOS like this:

for SLAB:

gramine/libos/src/libos_malloc.c

Lines 20 to 24 in e740728

    
           static struct libos_lock slab_mgr_lock; 
        
           #define SYSTEM_LOCK()   lock(&slab_mgr_lock) 
        
           #define SYSTEM_UNLOCK() unlock(&slab_mgr_lock) 
        
           #define SYSTEM_LOCKED() locked(&slab_mgr_lock)

for MEMMGR, each LibOS subsystem has its own lock, e.g. for FS operations:

gramine/libos/src/fs/libos_dcache.c

Lines 19 to 23 in e740728

    
           static struct libos_lock dcache_mgr_lock; 
        
           #define SYSTEM_LOCK()   lock(&dcache_mgr_lock) 
        
           #define SYSTEM_UNLOCK() unlock(&dcache_mgr_lock) 
        
           #define SYSTEM_LOCKED() locked(&dcache_mgr_lock)

defined in PAL like this (only SLAB):

gramine/pal/src/slab.c

Lines 15 to 19 in e740728

    
           static spinlock_t g_slab_mgr_lock = INIT_SPINLOCK_UNLOCKED; 
        
           #define SYSTEM_LOCK()   spinlock_lock(&g_slab_mgr_lock) 
        
           #define SYSTEM_UNLOCK() spinlock_unlock(&g_slab_mgr_lock) 
        
           #define SYSTEM_LOCKED() spinlock_is_locked(&g_slab_mgr_lock)

MEMMGR fixed-size allocator

Used to allocate specific objects in specific subsystems. Currently used only in LibOS.

Each subsystem of LibOS that uses MEMMGR specifies its own (global to the subsystem) lock. Thus, MEMMGR object allocs/frees in the same subsystem are synchronized on this lock, but object allocs/frees in different subsystems can run in parallel.

The current users of MEMMGR:

VMA subsystem (with some specific features, like disallowing automatic growth of MEMMGR underlying memory regions)
- see https://github.com/gramineproject/gramine/blob/master/libos/src/bookkeep/libos_vma.c
- managed objects: struct libos_vma
- ~~MEMMGR lock: struct libos_lock vma_mgr_lock~~ no MEMMGR-special lock, rather VMA does its own locking via struct libos_lock vma_mgr_lock
FS mounts
- see https://github.com/gramineproject/gramine/blob/master/libos/src/fs/libos_fs.c
- managed objects: struct libos_mount
- MEMMGR lock: struct libos_lock g_mount_mgr_lock
FS dentries
- see https://github.com/gramineproject/gramine/blob/master/libos/src/fs/libos_dcache.c
- managed objects: struct libos_dentry
- MEMMGR lock: libos_lock dcache_mgr_lock
LibOS handles
- see https://github.com/gramineproject/gramine/blob/master/libos/src/bookkeep/libos_handle.c
- managed objects: struct libos_handle
- MEMMGR lock: struct libos_lock handle_mgr_lock

Every managed object is wrapped into a MEM_OBJ_TYPE struct. This struct doesn't have additional fields (if not built with ASan), so it's the most compact representation possible. When the object is "freed" and its underlying memory is moved to the free list, MEM_OBJ_TYPE's list field is used instead.

Design and implementation are very simple:

There is a single "memory manager" object created for each LibOS subsystem.
This memory manager keeps two lists:
- available areas (disjoint allocated memory regions)
- free list of re-usable objects
Each subsystem chooses the initial size (number of objects) this manager allocates at startup. The first area is allocated which can host this initial number of objects. This first area is also marked as active (when the first area becomes full, the second area will be allocated and will be marked as active, and first area becomes inactive).
When the subsystem needs a new object (alloc), the manager first checks if its currently active area has any untouched memory left, and if yes, object is "allocated" in this memory. If no memory in active area, then the manager checks the free list, and if there is one free slot, object is "allocated" in this slot. If no free slots, then (if subsystem allowed it), the manager creates a new area, makes it active, and "allocates" the object at the base address of this new area.
When the subsystem frees the object, there are two possibilities:
- The to-be-freed object is inside the checkpoint memory region (i.e., the object was copied from the parent process). This corner case is not handled (checkpoint memory is never freed or reused), so the manager returns and the object is leaked.
- Otherwise, the object is added to the free list.

The MEMMGR memory managers are never "reset", or shrunk, or deleted. Thus, if LibOS allocated a lot of MEMMGR objects initially, and then freed them all, then this MEMMGR memory is leaked. This should be a very rare and unimportant case though.

Backend-memory (page-sized) allocation happens via __system_malloc() declared here: https://github.com/gramineproject/gramine/blob/master/libos/src/libos_malloc.c

Backend-memory (page-sized) deallocation, as mentioned above, doesn't really happen. But if it would, then it would be via __system_free() also declared here: https://github.com/gramineproject/gramine/blob/master/libos/src/libos_malloc.c

Support for ASan

Allocated objects have an 8-byte redzone (padding)
When object is freed, ASan poisons the "freed" memory

Open issues

Objects in migrated memory (in the child) leak because their "slots" are never re-used:

gramine/common/include/memmgr.h

Lines 248 to 251 in e740728

    
           static inline void free_mem_obj_to_mgr(MEM_MGR mgr, OBJ_TYPE* obj) { 
        
               if (memory_migrated(obj)) { 
        
                   return; 
        
               }

TODO: create a separate GitHub issue about this.

Names are bad, in particular get_mem_obj_from_mgr_enlarge(MEM_MGR mgr, size_t size) -- the size here is actually "by how many bytes to increase the pool of available memory if there is not free memory in areas and no free slots in the free list".
Something is fishy with size arguments in functions. These args are in bytes (at least that's what the callers assume), but function implementations seem to treat the argument as count. TODO: verify and fix this; we may have a memory leak here.

SLAB variable-size allocator

Generic backend for malloc and free in all other subsystems. Used both in LibOS and in PAL.

When any (random size) object needs to be allocated/freed in LibOS or in PAL, the traditional malloc() and free() calls are used. They are wrappers around slab_alloc() and slab_free(). See:

Backend-memory (page-sized) allocation and deallocation is implemented via:

for LibOS: __system_malloc() and __system_free() in https://github.com/gramineproject/gramine/blob/master/libos/src/libos_malloc.c
for PAL: system_mem_alloc() and system_mem_free() in https://github.com/gramineproject/gramine/blob/master/pal/src/slab.c

There is a single global slab manager and corresponding lock for LibOS and similarly a single global slab manager and corresponding lock for PAL. See these:

LibOS:
- gramine/libos/src/libos_malloc.c
  
  Line 20 in e740728
  
  static struct libos_lock slab_mgr_lock;
- gramine/libos/src/libos_malloc.c
  
  Line 31 in e740728
  
  static SLAB_MGR slab_mgr = NULL;
PAL:
- gramine/pal/src/slab.c
  
  Line 15 in e740728
  
  static spinlock_t g_slab_mgr_lock = INIT_SPINLOCK_UNLOCKED;
- gramine/pal/src/slab.c
  
  Line 63 in e740728
  
  static SLAB_MGR g_slab_mgr = NULL;

NOTE We have a struct libos_lock in LibOS. This lock is implemented via PalEventWait() and PalEventSet() which do have a fast path, but the slow path results in ocall_futex(), which is super-expensive in SGX. Maybe we could replace it with a spinlock()? Technically most of the time we'll hit the slab-allocator's cache, which is a fast operation; so it seems like no real need to ocall_futex() in this case.

Design and implementation is based on the MEMMGR allocator for the common case, and has a trivial fallback for the large-object case:

If the requested object's size is greater than the max allowed in the SLAB allocator's levels (currently ~2KB, but [common] Add encrypted-file-node-size slab into slab allocator #1763 changes it to ~8KB), then SLAB falls back to backend-memory allocation:

gramine/common/include/slabmgr.h

Lines 308 to 322 in f35d8e0

    
               if (level == (size_t)-1) { 
        
                   size = ALIGN_UP_POW2(size, MIN_MALLOC_ALIGNMENT); 
        
                   LARGE_MEM_OBJ mem = (LARGE_MEM_OBJ)system_malloc(sizeof(LARGE_MEM_OBJ_TYPE) + size); 
        
                   if (!mem) 
        
                       return NULL; 
        
                   mem->size = size; 
        
                   OBJ_LEVEL(mem) = (unsigned char)-1; 
        
           #ifdef ASAN 
        
                   asan_unpoison_region((uintptr_t)OBJ_RAW(mem), size); 
        
           #endif 
        
                   return OBJ_RAW(mem); 
        
               }

If the requested object's size is less than one of the SLAB allocator's levels, then SLAB chooses the highest level this object fits in (to pack objects as dense as possible) and then proceeds in the same way as MEMMGR.
- In other words, each SLAB level has the same rules as MEMMGR (search free space in active area, then fallback to free list, then fallback to enlarging the level by allocating a new area).

Deallocation happens similarly to the allocation description above:

Get the SLAB allocator's metadata for the to-be-freed object. Part of this metadata is the level value.

If level == -1, it means that the object's size is greater than the max allowed in SLAB, so it was allocated via backend-memory allocation, thus it must be deallocated via backend-memory free:

gramine/common/include/slabmgr.h

Lines 402 to 412 in f35d8e0

    
               if (level == (unsigned char)-1) { 
        
                   LARGE_MEM_OBJ mem = RAW_TO_OBJ(obj, LARGE_MEM_OBJ_TYPE); 
        
           #ifdef DEBUG 
        
                   _real_memset(obj, 0xCC, mem->size); 
        
           #endif 
        
           #ifdef ASAN 
        
                   asan_unpoison_region((uintptr_t)mem, mem->size + sizeof(LARGE_MEM_OBJ_TYPE)); 
        
           #endif 
        
                   system_free(mem, mem->size + sizeof(LARGE_MEM_OBJ_TYPE)); 
        
                   return; 
        
               }

Otherwise this object's memory is returned back to the SLAB allocator's level's free list:

gramine/common/include/slabmgr.h

Lines 440 to 443 in f35d8e0

    
           SYSTEM_LOCK(); 
        
           INIT_LIST_HEAD(mobj, __list); 
        
           LISTP_ADD_TAIL(mobj, &mgr->free_list[level], __list); 
        
           SYSTEM_UNLOCK();

Open issues

No realloc() implementation.
- I'm not sure what's the technical reason for not having it.
- If we provide realloc(), then we have many places in LibOS and PAL that could benefit from this function (currently they implement such realloc in ad-hoc ways combining malloc memcpy and free).
SLAB_CANARY is defined for LibOS, but not for PAL.
- This leads to the SLAB object metadata (header) being 32B in LibOS,
- and only 16B in PAL.

dimakuv mentioned this issue Feb 12, 2024

WIP [LibOS,PAL] Print stats of internal-memory allocators #1768

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Metaissue] Memory allocators in Gramine, their performance and possible alternatives #1767

[Metaissue] Memory allocators in Gramine, their performance and possible alternatives #1767

dimakuv commented Feb 9, 2024

dimakuv commented Feb 9, 2024 •

edited

[Metaissue] Memory allocators in Gramine, their performance and possible alternatives #1767

[Metaissue] Memory allocators in Gramine, their performance and possible alternatives #1767

Comments

dimakuv commented Feb 9, 2024

Description of the feature

Why Gramine should implement it?

dimakuv commented Feb 9, 2024 • edited

Description of current memory allocators

MEMMGR fixed-size allocator

Support for ASan

Open issues

SLAB variable-size allocator

Open issues

dimakuv commented Feb 9, 2024 •

edited