Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[LibOS,PAL/Linux-SGX] Add EDMM lazy allocation support #1513

Open
wants to merge 12 commits into
base: master
Choose a base branch
from

Conversation

kailun-qin
Copy link
Contributor

@kailun-qin kailun-qin commented Aug 28, 2023

Description of the changes

The initial implementation of EDMM support pre-accepted every mmapped region of enclave pages. However in some cases, applications (e.g., Java runtime) may rely on lazy allocation of pages, where the VMAs are reserved but not actually committed to physical memory. In particular, mmap(..., MAP_NORESERVE) requests are used in such cases -- to mmap a huge chunk of memory (possibly never used in the future) at once and then commit pages on demand on page fault events. Due to the lack of lazy allocation concept in Gramine, this incurred performance and physical memory overhead on allocating and accepting the whole range of enclave pages.

The commit adds the lazy allocation support by deferring page accepts to page-fault events when Gramine notices MAP_NORESERVE flag in the mmap request. It also introduces a bit vector indicating the commit status of each enclave page and an upcall from the SGX EDMM backend into the LibOS to ask for additional info on a particular page.

Fixes #1099.
Fixes #1754.

How to test this PR?

Updated regression tests + CI.


This change is Reviewable

Copy link
Contributor

@dimakuv dimakuv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 18 files at r1.
Reviewable status: 1 of 18 files reviewed, 4 unresolved discussions, not enough approvals from maintainers (2 more required), not enough approvals from different teams (1 more required, approved so far: Intel) (waiting on @kailun-qin)

a discussion (no related file):
Haven't done a deep review of this PR yet, but several high-level notes:

  1. Please add a paragraph about this optimization here: https://gramine.readthedocs.io/en/stable/devel/features.html#memory-management
  2. Can you run some benchmarks and report results? E.g., some Java app, and/or a microbenchmark that uses MAP_NORESERVE.
  3. Please rename PAL_PROT_NORESERVE to something more easily recognizable. I suggest PAL_PROT_LAZYALLOC.


libos/test/regression/munmap.c line 45 at r1 (raw file):

    test_mmap_munmap(PROT_READ | PROT_WRITE, MAP_ANONYMOUS | MAP_PRIVATE);

    test_mmap_munmap(PROT_READ | PROT_WRITE, MAP_ANONYMOUS | MAP_PRIVATE | MAP_NORESERVE);

This test looks quite useless... What is the side effect of MAP_NORESERVE? I mean, functionality-wise we won't notice the effect of your PR in this test. Similarly, because it's just 3 pages, we won't notice the effect of your PR performance-wise -- it will be too fast to recognize the difference.

I understand that since this is a perf optimization, it can't be easily tested. But what about modifying large_mmap.c -- just add MAP_NORESERVE to that huge mmap syscall, and at least in manual testing, Gramine developers will notice a significant difference.

You also won't need to modify the generic manifest.template, instead you can modify only large_mmap.manifest.template.


pal/src/host/linux-sgx/pal_memory.c line 23 at r1 (raw file):

#include "spinlock.h"

enclave_page_tracker_t* g_enclave_page_tracker = NULL;

I would prefer to have a top-level comment about this tracker, and why it is needed. I think you can steal bits of info from the commit message and also from the original GitHub issue.

In particular, it is important to explain that this tracker is used purely for EDMM lazy allocation, and is required in the memory-free flows (otherwise we don't know whether the enclave page was already EACCEPTed). It is also important to explain that other info, like the enclave-page permissions (required in the #PF flow) are taken from the LibOS VMA subsystem, for which we use a special upcall.


pal/src/host/linux-sgx/pal_memory.c line 29 at r1 (raw file):

#define ENCLAVE_PAGE_LOCK()   spinlock_lock(&g_enclave_page_lock)
#define ENCLAVE_PAGE_UNLOCK() spinlock_unlock(&g_enclave_page_lock)
#define ENCLAVE_PAGE_LOCKED() spinlock_is_locked(&g_enclave_page_lock)

What's the point in these three macros? Why not just use the spinlock_xxx() functions?

Copy link
Contributor Author

@kailun-qin kailun-qin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 23 files reviewed, 4 unresolved discussions, not enough approvals from maintainers (2 more required), not enough approvals from different teams (1 more required, approved so far: Intel), "fixup! " found in commit messages' one-liners (waiting on @dimakuv)

a discussion (no related file):
Thanks for the feedbacks!

I suppose this should be done (in batch for all new features) before each release? But added anyway.

  • Can you run some benchmarks and report results? E.g., some Java app, and/or a microbenchmark that uses MAP_NORESERVE.

Sure, will do.

  • Please rename PAL_PROT_NORESERVE to something more easily recognizable. I suggest PAL_PROT_LAZYALLOC.

Done.



pal/src/host/linux-sgx/pal_memory.c line 23 at r1 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

I would prefer to have a top-level comment about this tracker, and why it is needed. I think you can steal bits of info from the commit message and also from the original GitHub issue.

In particular, it is important to explain that this tracker is used purely for EDMM lazy allocation, and is required in the memory-free flows (otherwise we don't know whether the enclave page was already EACCEPTed). It is also important to explain that other info, like the enclave-page permissions (required in the #PF flow) are taken from the LibOS VMA subsystem, for which we use a special upcall.

Done.


pal/src/host/linux-sgx/pal_memory.c line 29 at r1 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

What's the point in these three macros? Why not just use the spinlock_xxx() functions?

Done.


libos/test/regression/munmap.c line 45 at r1 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

This test looks quite useless... What is the side effect of MAP_NORESERVE? I mean, functionality-wise we won't notice the effect of your PR in this test. Similarly, because it's just 3 pages, we won't notice the effect of your PR performance-wise -- it will be too fast to recognize the difference.

I understand that since this is a perf optimization, it can't be easily tested. But what about modifying large_mmap.c -- just add MAP_NORESERVE to that huge mmap syscall, and at least in manual testing, Gramine developers will notice a significant difference.

You also won't need to modify the generic manifest.template, instead you can modify only large_mmap.manifest.template.

Done.

Copy link
Contributor Author

@kailun-qin kailun-qin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 23 files reviewed, 4 unresolved discussions, not enough approvals from maintainers (2 more required), not enough approvals from different teams (1 more required, approved so far: Intel), "fixup! " found in commit messages' one-liners (waiting on @dimakuv)


Documentation/devel/features.md line 1717 at r2 (raw file):

are lazily accepted on page-fault events. Note that the current implementation has one caveat that
when a process forks, the child process inherits the `MAP_NORESERVE` memory mappings of the parent
process, but the content isn't copied (via checkpoint-and-restore in Gramine).

This was not covered by the initial proposal. Let's dissuss it here. Thanks!

Code quote:

when a process forks, the child process inherits the `MAP_NORESERVE` memory mappings of the parent
process, but the content isn't copied (via checkpoint-and-restore in Gramine).

Copy link
Contributor

@llly llly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 6 of 18 files at r1, 8 of 12 files at r2, all commit messages.
Reviewable status: 14 of 23 files reviewed, 7 unresolved discussions, not enough approvals from maintainers (2 more required), not enough approvals from different teams (1 more required, approved so far: Intel), "fixup! " found in commit messages' one-liners (waiting on @dimakuv and @kailun-qin)


libos/src/bookkeep/libos_vma.c line 1151 at r2 (raw file):

    int ret = lookup_vma((void*)addr, &vma_info);
    if (ret < 0) {
        *out_prot_flags = 0;

Don't need to set out_prot_flags when vma not found.


pal/src/host/linux-sgx/pal_memory.c line 316 at r2 (raw file):

        if (consecutive_count > 0) {
            /* invoke the `callback` on the consecutive pages */
            ret = callback(consecutive_start_addr, consecutive_count, arg);

Is callback really needed?
I can see direct functions can be called according to walk_set. When false, sgx_edmm_remove_pages() unset_enclave_addr_range(); when true, sgx_edmm_add_pages() set_enclave_addr_range().

Copy link
Contributor

@dimakuv dimakuv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 14 of 23 files reviewed, 10 unresolved discussions, not enough approvals from maintainers (2 more required), not enough approvals from different teams (1 more required, approved so far: Intel), "fixup! " found in commit messages' one-liners (waiting on @kailun-qin)


libos/src/libos_checkpoint.c line 221 at r2 (raw file):

        pal_prot_flags_t mem_prot = entry->prot;

        if (entry->dummy || mem_prot & PAL_PROT_LAZYALLOC) {

Actually, why do you skip MAP_NORESERVE regions on fork?

If you would remove these tricks, you would end up with much worse performance (because to copy such region on EDMM, there will be a ton of #PFs). But at least the correctness will be preserved, right? Also, what if we are on gramine-direct or SGX without EDMM? Looks like you're breaking other backends, just to improve performance for SGX EDMM?


libos/src/libos_checkpoint.c line 330 at r2 (raw file):

            }

            if (prot & PAL_PROT_LAZYALLOC) {

Why would you even get into this IF condition? Looks like this is deadcode, because send_memory_on_stream() skips such memory regions.

If it's indeed deadcode, I suggest to replace this thing with simply assert(!(prot & PAL_PROT_LAZYALLOC)).


libos/src/bookkeep/libos_vma.c line 1458 at r2 (raw file):

                DO_CP_SIZE(memory, vma->addr, vma->length, &mem);
                /* Propogate `MAP_NORESERVE` flag if it's set. */
                mem->prot = LINUX_PROT_TO_PAL(vma->prot, vma->flags & MAP_NORESERVE);

I don't understand all these tricks with checkpoint/restore of MAP_NORESERVE. Can you explain all these flows?

Copy link
Contributor

@dimakuv dimakuv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 4 of 18 files at r1, 3 of 12 files at r2.
Reviewable status: 20 of 23 files reviewed, 17 unresolved discussions, not enough approvals from maintainers (2 more required), not enough approvals from different teams (1 more required, approved so far: Intel), "fixup! " found in commit messages' one-liners (waiting on @kailun-qin)


-- commits line 19 at r2:
We will need to add a note on the current handling of MAP_NORESERVE VMAs during fork.


Documentation/devel/features.md line 1717 at r2 (raw file):

Previously, kailun-qin (Kailun Qin) wrote…

This was not covered by the initial proposal. Let's dissuss it here. Thanks!

Left my comments in different places. General feedback: I think we need to fix it.


libos/src/bookkeep/libos_vma.c line 1458 at r2 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

I don't understand all these tricks with checkpoint/restore of MAP_NORESERVE. Can you explain all these flows?

Ok, I thought about this problem of having only a subset of pages committed for MAP_NORESERVE VMAs.

I think we should not cut corners (or unnecessarily worsen performance) and we should send only committed memory regions. For this, we'll need to query the PAL to learn which pages from the VMA were committed, and create as many struct libos_mem_entry objects as there are committed pages. My proposed approach would look smth like this:

if (vma->flags & MAP_NORESERVE) {
    /* lazy allocation of pages, send only committed pages */
    char* bitvector = malloc(vma->length / 4096 / 8);
    PalGetCommittedPages(vma->addr, vma->length, bitvector);
    for each bit in bitvector {
        if (bit is set) {
            struct libos_mem_entry* mem;
            DO_CP_SIZE(memory, vma->addr + page_addr(bit), 4096, &mem);
            mem->prot = LINUX_PROT_TO_PAL(vma->prot, /*map_flags=*/0);
        }
    }
} else {
    /* no lazy allocation of pages, send the whole memory region of VMA */
    struct libos_mem_entry* mem;
    DO_CP_SIZE(memory, vma->addr, vma->length, &mem);
    mem->prot = LINUX_PROT_TO_PAL(vma->prot, /*map_flags=*/0);
}

Several notes here:

  1. We will need to introduce a new PalGetCommittedPages() API
    1. Returns all-ones for non-SGX non-EDMM PALs
    2. Returns a populated bitvector slice for SGX EDMM PAL
  2. Each committed page will be sent separately -- this may be sub-par in terms of performance, but good enough for the initial version. If we ever find it to be a bottleneck, we'll optimize to send memory regions, not single pages.

pal/src/host/linux-sgx/pal_main.c line 631 at r2 (raw file):

    init_slab_mgr();

    /* initialize the enclave page tracker as soon as we Initialized the slab memory allocator */

Initialized -> use small letter, not capital letter


pal/src/host/linux-sgx/pal_memory.c line 24 at r2 (raw file):

/*
 * Global enclave page tracker purely used for EDMM lazy allocation (based on a bitmap vector).

purely used -> used


pal/src/host/linux-sgx/pal_memory.c line 27 at r2 (raw file):

 *
 * This is required for the memory-free flows, where we have to know whether the to-be-freed pages
 * were already EACCEPTed (so that we need to remove them) or not not (so that we can simply skip

not not -> not


pal/src/host/linux-sgx/pal_memory.c line 33 at r2 (raw file):

 *
 * Besides, for the additional info required in the #PF flow (e.g., enclave-page permissions), it's
 * taken from the LibOS VMA subsystem, for which we use a special upcall.

I would reformulate this last paragraph to:

We don't use this page tracker to store info required in the #PF flows (e.g. enclave-page permissions).
Instead, we get this info from the LibOS VMA subsystem, for which we use a special upcall, see g_mem_bkeep_get_vma_info_upcall.

pal/src/host/linux-sgx/pal_memory.c line 329 at r2 (raw file):

                   int (*callback)(uintptr_t, size_t, void*), void* arg) {
    return walk_pages(start_addr, count, true, callback, arg);
}

This function is overcomplicated. (A similar concern that @llly raised.)

Why not replace it with:

int remove_commited_pages(uintptr_t start_addr, size_t count) {
    walk_pages(start_addr, count, /*walk_set=*/true, sgx_edmm_remove_pages_callback, NULL);
}

int add_uncommitted_pages(uintptr_t start_addr, size_t count,  uint64_t prot_flags) {
    walk_pages(start_addr, count, /*walk_set=*/false, sgx_edmm_add_pages_callback, &prot_flags);
}

Otherwise you have too many levels of indirection, and it takes a while to untangle this sequence of funcs...


pal/src/host/linux-sgx/pal_memory.c line 335 at r2 (raw file):

                     int (*callback)(uintptr_t, size_t, void*), void* arg) {
    return walk_pages(start_addr, count, false, callback, arg);
}

Please move all these funcs to sgx_edmm.c file, as these funcs are only used for EDMM flows, and are not part of the generic PAL functions.


pal/src/host/linux-sgx/pal_sgx.h line 3 at r2 (raw file):

/* SPDX-License-Identifier: LGPL-3.0-or-later */
/* Copyright (C) 2022 Intel Corporation
 *                    Borys Popławski <borysp@invisiblethingslab.com>

You can add your copyright, btw, since the changes you do are big.


pal/src/host/linux-sgx/pal_sgx.h line 8 at r2 (raw file):

#include "pal.h"
#include "sgx_arch.h"
#include "spinlock.h"

This include won't be needed if you remove the lock field


pal/src/host/linux-sgx/pal_sgx.h line 16 at r2 (raw file):

}

typedef struct {

Could you add a top-level comment with a bit of arithmetic -- how much memory will this data bitvector occupy if the enclave size is 1GB, 128GB, 1TB?

If my calculations are correct, then for a 1TB enclave, this bitvector will occupy 32MB, or 0.003%. Which is reasonable.


pal/src/host/linux-sgx/pal_sgx.h line 19 at r2 (raw file):

    unsigned char* data;    /* bit array to store page allocation status */
    uintptr_t base_address; /* base address of the enclave memory region */
    size_t size;            /* number of pages in the memory region */

Comments for base_address and size don't seem to be true? It's really about the whole enclave memory address space and the whole enclave size, not about some "memory region".


pal/src/host/linux-sgx/pal_sgx.h line 21 at r2 (raw file):

    size_t size;            /* number of pages in the memory region */
    size_t page_size;       /* size of each page in bytes */
    spinlock_t lock;        /* lock for the page tracker */

Unused field?

Copy link
Contributor Author

@kailun-qin kailun-qin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 10 of 29 files reviewed, 17 unresolved discussions, not enough approvals from maintainers (2 more required), not enough approvals from different teams (1 more required, approved so far: Intel), "squash! " and "fixup! " found in commit messages' one-liners (waiting on @dimakuv and @llly)


-- commits line 19 at r2:

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

We will need to add a note on the current handling of MAP_NORESERVE VMAs during fork.

Done.


libos/src/bookkeep/libos_vma.c line 1151 at r2 (raw file):

Previously, llly (Li Xun) wrote…

Don't need to set out_prot_flags when vma not found.

Done.


libos/src/bookkeep/libos_vma.c line 1458 at r2 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

Ok, I thought about this problem of having only a subset of pages committed for MAP_NORESERVE VMAs.

I think we should not cut corners (or unnecessarily worsen performance) and we should send only committed memory regions. For this, we'll need to query the PAL to learn which pages from the VMA were committed, and create as many struct libos_mem_entry objects as there are committed pages. My proposed approach would look smth like this:

if (vma->flags & MAP_NORESERVE) {
    /* lazy allocation of pages, send only committed pages */
    char* bitvector = malloc(vma->length / 4096 / 8);
    PalGetCommittedPages(vma->addr, vma->length, bitvector);
    for each bit in bitvector {
        if (bit is set) {
            struct libos_mem_entry* mem;
            DO_CP_SIZE(memory, vma->addr + page_addr(bit), 4096, &mem);
            mem->prot = LINUX_PROT_TO_PAL(vma->prot, /*map_flags=*/0);
        }
    }
} else {
    /* no lazy allocation of pages, send the whole memory region of VMA */
    struct libos_mem_entry* mem;
    DO_CP_SIZE(memory, vma->addr, vma->length, &mem);
    mem->prot = LINUX_PROT_TO_PAL(vma->prot, /*map_flags=*/0);
}

Several notes here:

  1. We will need to introduce a new PalGetCommittedPages() API
    1. Returns all-ones for non-SGX non-EDMM PALs
    2. Returns a populated bitvector slice for SGX EDMM PAL
  2. Each committed page will be sent separately -- this may be sub-par in terms of performance, but good enough for the initial version. If we ever find it to be a bottleneck, we'll optimize to send memory regions, not single pages.

Done.


pal/src/host/linux-sgx/pal_main.c line 631 at r2 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

Initialized -> use small letter, not capital letter

Done.


pal/src/host/linux-sgx/pal_memory.c line 24 at r2 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

purely used -> used

Done.


pal/src/host/linux-sgx/pal_memory.c line 27 at r2 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

not not -> not

Done.


pal/src/host/linux-sgx/pal_memory.c line 33 at r2 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

I would reformulate this last paragraph to:

We don't use this page tracker to store info required in the #PF flows (e.g. enclave-page permissions).
Instead, we get this info from the LibOS VMA subsystem, for which we use a special upcall, see g_mem_bkeep_get_vma_info_upcall.

Done.


pal/src/host/linux-sgx/pal_memory.c line 316 at r2 (raw file):

Previously, llly (Li Xun) wrote…

Is callback really needed?
I can see direct functions can be called according to walk_set. When false, sgx_edmm_remove_pages() unset_enclave_addr_range(); when true, sgx_edmm_add_pages() set_enclave_addr_range().

Done.


pal/src/host/linux-sgx/pal_memory.c line 329 at r2 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

This function is overcomplicated. (A similar concern that @llly raised.)

Why not replace it with:

int remove_commited_pages(uintptr_t start_addr, size_t count) {
    walk_pages(start_addr, count, /*walk_set=*/true, sgx_edmm_remove_pages_callback, NULL);
}

int add_uncommitted_pages(uintptr_t start_addr, size_t count,  uint64_t prot_flags) {
    walk_pages(start_addr, count, /*walk_set=*/false, sgx_edmm_add_pages_callback, &prot_flags);
}

Otherwise you have too many levels of indirection, and it takes a while to untangle this sequence of funcs...

Done.


pal/src/host/linux-sgx/pal_memory.c line 335 at r2 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

Please move all these funcs to sgx_edmm.c file, as these funcs are only used for EDMM flows, and are not part of the generic PAL functions.

Done.


pal/src/host/linux-sgx/pal_sgx.h line 3 at r2 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

You can add your copyright, btw, since the changes you do are big.

Done.


pal/src/host/linux-sgx/pal_sgx.h line 8 at r2 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

This include won't be needed if you remove the lock field

Done.


pal/src/host/linux-sgx/pal_sgx.h line 16 at r2 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

Could you add a top-level comment with a bit of arithmetic -- how much memory will this data bitvector occupy if the enclave size is 1GB, 128GB, 1TB?

If my calculations are correct, then for a 1TB enclave, this bitvector will occupy 32MB, or 0.003%. Which is reasonable.

Done.


pal/src/host/linux-sgx/pal_sgx.h line 19 at r2 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

Comments for base_address and size don't seem to be true? It's really about the whole enclave memory address space and the whole enclave size, not about some "memory region".

Done.


pal/src/host/linux-sgx/pal_sgx.h line 21 at r2 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

Unused field?

Yeah, left over. Removed.


libos/src/libos_checkpoint.c line 221 at r2 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

Actually, why do you skip MAP_NORESERVE regions on fork?

If you would remove these tricks, you would end up with much worse performance (because to copy such region on EDMM, there will be a ton of #PFs). But at least the correctness will be preserved, right? Also, what if we are on gramine-direct or SGX without EDMM? Looks like you're breaking other backends, just to improve performance for SGX EDMM?

Not relevant any more.


libos/src/libos_checkpoint.c line 330 at r2 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

Why would you even get into this IF condition? Looks like this is deadcode, because send_memory_on_stream() skips such memory regions.

If it's indeed deadcode, I suggest to replace this thing with simply assert(!(prot & PAL_PROT_LAZYALLOC)).

Not relevant any more.

@kailun-qin kailun-qin force-pushed the kailun-qin/edmm-lazy-alloc branch 5 times, most recently from a2f3548 to b0fc282 Compare September 2, 2023 14:07
Copy link
Contributor

@dimakuv dimakuv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 2 of 18 files at r1, 2 of 12 files at r2, 11 of 18 files at r3, 1 of 5 files at r5, all commit messages.
Reviewable status: 21 of 29 files reviewed, 6 unresolved discussions, not enough approvals from maintainers (1 more required), not enough approvals from different teams (1 more required, approved so far: Intel), "squash! " found in commit messages' one-liners (waiting on @kailun-qin and @llly)


libos/src/bookkeep/libos_vma.c line 1463 at r5 (raw file):

                    int ret = PalGetCommittedPages((uintptr_t)vma->addr, vma->length, bitvector,
                                                   &bv_size, &start_idx);

Meh, this function signature is complicated. I suggest to return a bitvector that reflects pages starting from vma->addr, instead of the current bitvector that contains pages with an offset.

Let's start with something simple and easy to understand (even if not super-performant). If really needed, we will modify this in follow up PRs.


libos/src/bookkeep/libos_vma.c line 1470 at r5 (raw file):

                    size_t byte_idx = 0;
                    while (byte_idx < bv_size) {

Why not a for loop?


libos/src/bookkeep/libos_vma.c line 1473 at r5 (raw file):

                        unsigned char byte = bitvector[byte_idx];
                        size_t bit_idx = 0;
                        while (bit_idx < 8) {

Why not a for loop?


pal/include/pal/pal.h line 1014 at r5 (raw file):

 * \param[in,out] bitvector    On success, will contain the commit status of the pages in the memory
 *                             area.
 * \param[in,out] bv_size      On success, will contain the actual size of the bitvector.

Please rename to bitvector_size, for readability.


pal/include/pal/pal.h line 1015 at r5 (raw file):

 *                             area.
 * \param[in,out] bv_size      On success, will contain the actual size of the bitvector.
 * \param[out]    out_bv_index On success, will contain the starting index of the corresponding

What's the point in this starting index? This looks like an optimization, but why overcomplicate this PAL function? You could calculate the same in the LibOS.

I feel like this is a premature optimization, and I would remove this arg.

I think the out logic of bv_size already covers a common case of MAP_NORESRERVE area having zero committed pages.


pal/include/pal/pal.h line 1018 at r5 (raw file):

 *                             address in the bitvector.
 *
 * Returns all-ones for non-SGX non-EDMM PALs; returns a populated bitvector slice for SGX EDMM PAL.

Please also add that this func is currently used for checkpoint-and-restore logic (to learn which memory areas need to be sent to the child process, as a perf optimization specific for SGX EDMM).

Copy link
Contributor Author

@kailun-qin kailun-qin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 15 of 29 files reviewed, 6 unresolved discussions, not enough approvals from maintainers (1 more required), not enough approvals from different teams (1 more required, approved so far: Intel), "fixup! " found in commit messages' one-liners (waiting on @dimakuv and @llly)


libos/src/bookkeep/libos_vma.c line 1463 at r5 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

Meh, this function signature is complicated. I suggest to return a bitvector that reflects pages starting from vma->addr, instead of the current bitvector that contains pages with an offset.

Let's start with something simple and easy to understand (even if not super-performant). If really needed, we will modify this in follow up PRs.

Done.


libos/src/bookkeep/libos_vma.c line 1470 at r5 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

Why not a for loop?

Done.


libos/src/bookkeep/libos_vma.c line 1473 at r5 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

Why not a for loop?

Done.


pal/include/pal/pal.h line 1014 at r5 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

Please rename to bitvector_size, for readability.

Done.


pal/include/pal/pal.h line 1015 at r5 (raw file):

What's the point in this starting index? This looks like an optimization

Yes, it's for perf considerations. Updated.


pal/include/pal/pal.h line 1018 at r5 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

Please also add that this func is currently used for checkpoint-and-restore logic (to learn which memory areas need to be sent to the child process, as a perf optimization specific for SGX EDMM).

Done.

Copy link
Contributor

@dimakuv dimakuv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 12 files at r2, 2 of 18 files at r3, 9 of 9 files at r6, all commit messages.
Reviewable status: 27 of 29 files reviewed, 21 unresolved discussions, not enough approvals from maintainers (1 more required), not enough approvals from different teams (1 more required, approved so far: Intel), "fixup! " found in commit messages' one-liners (waiting on @kailun-qin and @llly)


libos/src/bookkeep/libos_vma.c line 1456 at r6 (raw file):

                if (vma->flags & MAP_NORESERVE) {
                    /* lazy allocation of pages, send only committed pages */
                    size_t bitvector_size = ((vma->length + PAGE_SIZE - 1) / PAGE_SIZE + 7) / 8;

What's this calculation?

I think this is more readable:

size_t bitvector_size = ALIGN_UP(ALIGN_UP(vma->length, PAGE_SIZE) / PAGE_SIZE, 8) / 8;

pal/include/pal/pal.h line 1012 at r6 (raw file):

 * \param         addr           Starting address of the memory area.
 * \param         size           Size of the memory area.
 * \param[in,out] bitvector      On success, will contain the commit status of the pages in the memory

Please rewrap to 100 chars per line


pal/src/host/linux/pal_misc.c line 89 at r6 (raw file):

    assert(bitvector_size);

    size_t num_pages = (size + g_page_size - 1) / g_page_size;

Please use ALIGN_UP()


pal/src/host/linux/pal_misc.c line 91 at r6 (raw file):

    size_t num_pages = (size + g_page_size - 1) / g_page_size;
    size_t end_page = num_pages - 1;
    size_t num_bytes = (num_pages + 7) / 8;

Please use ALIGN_UP()


pal/src/host/linux/pal_misc.c line 102 at r6 (raw file):

        /* clear the leading bits in the last byte of the slice */
        bitvector[num_bytes - 1] &= (0xFF >> (7 - (end_page % 8)));
    }

I can't parse this logic. Aren't you supposed to just set the last N pages in the last byte?

Isn't something like this not enough:

leftover_pages = num_pages % 8;
if (leftover_pages)
    bitvector[num_bytes - 1] = (1 << leftover_pages) - 1;

pal/src/host/linux-sgx/enclave_edmm.c line 24 at r6 (raw file):

 * Global enclave page tracker used for EDMM lazy allocation (based on a bitmap vector).
 *
 * This is required for the memory-free flows, where we have to know whether the to-be-freed pages

memory-free flows -> "memory free()" flows

(memory-free sounds weird, like "flows that do not use memory")


pal/src/host/linux-sgx/enclave_edmm.c line 36 at r6 (raw file):

enclave_page_tracker_t* g_enclave_page_tracker = NULL;

static spinlock_t g_enclave_page_lock = INIT_SPINLOCK_UNLOCKED;

Can we rename to g_enclave_page_tracker_lock


pal/src/host/linux-sgx/enclave_edmm.c line 197 at r6 (raw file):

    enclave_page_tracker_t* tracker = malloc(sizeof(enclave_page_tracker_t));
    if (!tracker)
        INIT_FAIL("cannot initialize enclave page tracker");

initialize -> allocate


pal/src/host/linux-sgx/enclave_edmm.c line 198 at r6 (raw file):

    if (!tracker)
        INIT_FAIL("cannot initialize enclave page tracker");
    tracker->data = (unsigned char*)calloc((num_pages + 7) / 8, sizeof(unsigned char));

Please use ALIGN_UP


pal/src/host/linux-sgx/enclave_edmm.c line 200 at r6 (raw file):

    tracker->data = (unsigned char*)calloc((num_pages + 7) / 8, sizeof(unsigned char));
    if (!tracker->data)
        INIT_FAIL("cannot initialize enclave page tracker data");

initialize -> allocate


pal/src/host/linux-sgx/enclave_edmm.c line 214 at r6 (raw file):

    assert(!g_enclave_page_tracker);

    size_t num_pages = (memory_size + page_size - 1) / page_size;

Isn't memory_size guaranteed to be page-aligned? Please add an assert on this and remove this complicated arithmetic.


pal/src/host/linux-sgx/enclave_edmm.c line 273 at r6 (raw file):

/* get a copy of bitvector slice according to the memory address and the size in bytes; also give
 * the starting index of the corresponding address in the slice */

also give the starting index ... -- stale comment?


pal/src/host/linux-sgx/enclave_edmm.c line 277 at r6 (raw file):

                        size_t* bitvector_size) {
    size_t start_page = address_to_index(addr);
    size_t num_pages = (size + g_enclave_page_tracker->page_size - 1) /

Please use ALIGN_UP() and other macros (if needed) in this code


pal/src/host/linux-sgx/enclave_edmm.c line 296 at r6 (raw file):

            unsigned char val_next = g_enclave_page_tracker->data[start_byte + i + 1];
            bitvector[i] = ((val_next & (0xFF >> (8 - start_offset))) << (8 - start_offset)) |
                           ((val_cur & (0xFF << start_offset)) >> start_offset);

What is this madness. Please simplify!

Why not have three distinct parts in your code:

if (start_offset != 0)
    ... copy the highest bits in the starting byte ...

for (loop)
    ... copy each next byte entirely ...

if (end_offset != 0)
    ... copy the lowest bits in the ending byte ...

pal/src/host/linux-sgx/enclave_edmm.c line 306 at r6 (raw file):

        /* clear the leading bits in the last byte of the slice */
        bitvector[num_bytes - 1] &= (0xFF >> (7 - end_offset));
    }

Please see my comment for Linux PAL


pal/src/host/linux-sgx/enclave_edmm.c line 313 at r6 (raw file):

/* iterate over the given range of enclave pages in the tracker and perform the specified `callback`
 * on the consecutive set/unset pages; return error when `callback` failed */
static int walk_pages(uintptr_t start_addr, size_t count, bool walk_set,

walk_set -> walk_set_pages


pal/src/host/linux-sgx/pal_exception.c line 289 at r6 (raw file):

    if (ADDR_IN_PAL(uc->rip) && event_num != PAL_EVENT_QUIT &&
                                event_num != PAL_EVENT_INTERRUPTED &&
                                event_num != PAL_EVENT_MEMFAULT) {

In which case this can happen? What's the scenario when we are doing something in PAL (so that ADDR_IN_PAL(uc->rip) is true) and we hit this to-be-lazily-allocated page?


pal/src/host/linux-sgx/pal_exception.c line 346 at r6 (raw file):

            ret = _PalVirtualMemoryAlloc((void*)addr, g_page_size, prot_flags);
            if (ret < 0) {
                log_error("failed to allocate page at 0x%lx: %s", addr, pal_strerror(ret));

failed to allocate page -> failed to lazily allocate page


pal/src/host/linux-sgx/pal_exception.c line 350 at r6 (raw file):

            }
            goto out;
        }

Now if we are inside PAL and we hit a memfault on a NOT lazily-allocated page, we'll pass through this new logic and will end up calling the LibOS upcall -- which is wrong because LibOS shouldn't try to deal with inside-PAL faults.

See my other comment -- when is it really possible that while we're in PAL, a #PF happens?

If it's possible, then we'll need to add here a case of checking if (ADDR_IN_PAL(uc->rip)) and terminating on this case.


pal/src/host/linux-sgx/pal_memory.c line 29 at r6 (raw file):

    if (g_pal_linuxsgx_state.edmm_enabled) {
        /* defer page accepts to page-fault events when `MAP_NORESERVE` is set */

MAP_NORESERVE -> PAL_PROT_LAZYALLOC


pal/src/host/linux-sgx/pal_misc.c line 824 at r6 (raw file):

        return get_bitvector_slice(addr, size, bitvector, bitvector_size);
    } else {
        size_t num_pages = (size + g_page_size - 1) / g_page_size;

ditto (all comments from Linux PAL)

Copy link
Contributor

@dimakuv dimakuv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 27 of 29 files reviewed, 22 unresolved discussions, not enough approvals from maintainers (1 more required), not enough approvals from different teams (1 more required, approved so far: Intel), "fixup! " found in commit messages' one-liners (waiting on @kailun-qin and @llly)

a discussion (no related file):

Previously, kailun-qin (Kailun Qin) wrote…

Thanks for the feedbacks!

I suppose this should be done (in batch for all new features) before each release? But added anyway.

  • Can you run some benchmarks and report results? E.g., some Java app, and/or a microbenchmark that uses MAP_NORESERVE.

Sure, will do.

  • Please rename PAL_PROT_NORESERVE to something more easily recognizable. I suggest PAL_PROT_LAZYALLOC.

Done.

Please run some benchmarks and report results.

I'm afraid the complexity of this change sky-rocketed, and I will be unhappy if this all is for 1% performance improvement of 1% of use cases...


Copy link
Contributor Author

@kailun-qin kailun-qin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 27 of 29 files reviewed, 22 unresolved discussions, not enough approvals from maintainers (1 more required), not enough approvals from different teams (1 more required, approved so far: Intel), "fixup! " found in commit messages' one-liners (waiting on @dimakuv and @llly)


pal/src/host/linux-sgx/enclave_edmm.c line 296 at r6 (raw file):

What is this madness. Please simplify!

This was a bit like implementing a variant of memcpy w/ offsets by bits support.

Why not have three distinct parts in your code:
...

This was my original impl. But we wanted to get rid of the start_offset (-- "I suggest to return a bitvector that reflects pages starting from vma->addr, instead of the current bitvector that contains pages with an offset"). How can your proposal achieve that?

Copy link
Contributor

@dimakuv dimakuv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 27 of 29 files reviewed, 22 unresolved discussions, not enough approvals from maintainers (1 more required), not enough approvals from different teams (1 more required, approved so far: Intel), "fixup! " found in commit messages' one-liners (waiting on @kailun-qin and @llly)


pal/src/host/linux-sgx/enclave_edmm.c line 296 at r6 (raw file):

Previously, kailun-qin (Kailun Qin) wrote…

What is this madness. Please simplify!

This was a bit like implementing a variant of memcpy w/ offsets by bits support.

Why not have three distinct parts in your code:
...

This was my original impl. But we wanted to get rid of the start_offset (-- "I suggest to return a bitvector that reflects pages starting from vma->addr, instead of the current bitvector that contains pages with an offset"). How can your proposal achieve that?

Oh, I think I know what you mean. What you're saying is that LibOS (the caller) can issue PalGetCommittedPages(/*addr=*/4096, /*size=*/1TB, &bitvector).

In this case, bitvector[0][bit 0] must contain g_enclave_page_tracker->data[0][bit 1], and so on -- all the bits will be with the offset, and we can't just copy-paste byte-by-byte.

@kailun-qin Given this complexity, could you split this code into helper functions, so that it's clear how things work? The current code has too much bit-twiddling magic...

Copy link
Contributor Author

@kailun-qin kailun-qin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 20 of 29 files reviewed, 22 unresolved discussions, not enough approvals from maintainers (1 more required), not enough approvals from different teams (1 more required, approved so far: Intel), "fixup! " found in commit messages' one-liners (waiting on @dimakuv and @llly)

a discussion (no related file):

Please run some benchmarks and report results.

I tested our openjdk example.

Without this PR (release build, commit b984089, EDMM enabled):

$ time gramine-sgx java -Xmx8G MultiThreadMain
...
real    33m27.868s
user    9m30.793s
sys     25m20.908s

With this PR:

$ time gramine-sgx java -Xmx8G MultiThreadMain
...
real    0m34.768s
user    0m18.902s
sys     0m21.610s


libos/src/bookkeep/libos_vma.c line 1456 at r6 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

What's this calculation?

I think this is more readable:

size_t bitvector_size = ALIGN_UP(ALIGN_UP(vma->length, PAGE_SIZE) / PAGE_SIZE, 8) / 8;

Done.


pal/include/pal/pal.h line 1012 at r6 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

Please rewrap to 100 chars per line

Done.


pal/src/host/linux/pal_misc.c line 89 at r6 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

Please use ALIGN_UP()

Done.


pal/src/host/linux/pal_misc.c line 91 at r6 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

Please use ALIGN_UP()

Done.


pal/src/host/linux/pal_misc.c line 102 at r6 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

I can't parse this logic. Aren't you supposed to just set the last N pages in the last byte?

Isn't something like this not enough:

leftover_pages = num_pages % 8;
if (leftover_pages)
    bitvector[num_bytes - 1] = (1 << leftover_pages) - 1;

Done


pal/src/host/linux-sgx/enclave_edmm.c line 24 at r6 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

memory-free flows -> "memory free()" flows

(memory-free sounds weird, like "flows that do not use memory")

Done.


pal/src/host/linux-sgx/enclave_edmm.c line 36 at r6 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

Can we rename to g_enclave_page_tracker_lock

Done.


pal/src/host/linux-sgx/enclave_edmm.c line 197 at r6 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

initialize -> allocate

Done.


pal/src/host/linux-sgx/enclave_edmm.c line 198 at r6 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

Please use ALIGN_UP

Done.


pal/src/host/linux-sgx/enclave_edmm.c line 200 at r6 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

initialize -> allocate

Done.


pal/src/host/linux-sgx/enclave_edmm.c line 214 at r6 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

Isn't memory_size guaranteed to be page-aligned? Please add an assert on this and remove this complicated arithmetic.

Done.


pal/src/host/linux-sgx/enclave_edmm.c line 273 at r6 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

also give the starting index ... -- stale comment?

Done.


pal/src/host/linux-sgx/enclave_edmm.c line 277 at r6 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

Please use ALIGN_UP() and other macros (if needed) in this code

Done.


pal/src/host/linux-sgx/enclave_edmm.c line 296 at r6 (raw file):

In this case, bitvector[0][bit 0] must contain g_enclave_page_tracker->data[0][bit 1], and so on -- all the bits will be with the offset, and we can't just copy-paste byte-by-byte.

Yes, exactly. Splitted it to a helper function.


pal/src/host/linux-sgx/enclave_edmm.c line 306 at r6 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

Please see my comment for Linux PAL

Done.


pal/src/host/linux-sgx/enclave_edmm.c line 313 at r6 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

walk_set -> walk_set_pages

Done.


pal/src/host/linux-sgx/pal_exception.c line 289 at r6 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

In which case this can happen? What's the scenario when we are doing something in PAL (so that ADDR_IN_PAL(uc->rip) is true) and we hit this to-be-lazily-allocated page?

I think this was a leftover, as we switched to the new approach for MAP_NORESERVE checkpointing, it should not be needed anymore. Reverted.


pal/src/host/linux-sgx/pal_exception.c line 346 at r6 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

failed to allocate page -> failed to lazily allocate page

Done.


pal/src/host/linux-sgx/pal_exception.c line 350 at r6 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

Now if we are inside PAL and we hit a memfault on a NOT lazily-allocated page, we'll pass through this new logic and will end up calling the LibOS upcall -- which is wrong because LibOS shouldn't try to deal with inside-PAL faults.

See my other comment -- when is it really possible that while we're in PAL, a #PF happens?

If it's possible, then we'll need to add here a case of checking if (ADDR_IN_PAL(uc->rip)) and terminating on this case.

Not relevant anymore.


pal/src/host/linux-sgx/pal_memory.c line 29 at r6 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

MAP_NORESERVE -> PAL_PROT_LAZYALLOC

Done.


pal/src/host/linux-sgx/pal_misc.c line 824 at r6 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

ditto (all comments from Linux PAL)

Done.

Copy link
Contributor

@llly llly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 20 of 29 files reviewed, 25 unresolved discussions, not enough approvals from maintainers (1 more required), not enough approvals from different teams (1 more required, approved so far: Intel), "fixup! " found in commit messages' one-liners (waiting on @dimakuv and @kailun-qin)


pal/src/host/linux-sgx/enclave_edmm.c line 144 at r7 (raw file):

        return ret;

    set_enclave_addr_range((uintptr_t)addr, count);

casting (uintptr_t) not needed.


pal/src/host/linux-sgx/enclave_edmm.c line 154 at r7 (raw file):

        return ret;

    unset_enclave_addr_range((uintptr_t)addr, count);

ditto.


pal/src/host/linux-sgx/pal_files.c line 279 at r7 (raw file):

        /* Enclave pages will be written to below, so we must add W permission. */
        uint64_t prot_flags = PAL_TO_SGX_PROT(prot | PAL_PROT_WRITE);
        int ret = add_uncommitted_pages((uintptr_t)addr, size / PAGE_SIZE, prot_flags);

int is not needed.

Copy link
Contributor

@dimakuv dimakuv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 3 of 18 files at r1, 1 of 4 files at r4, 1 of 5 files at r5, 7 of 7 files at r7, all commit messages.
Reviewable status: all files reviewed, 8 unresolved discussions, not enough approvals from maintainers (1 more required), not enough approvals from different teams (1 more required, approved so far: Intel), "fixup! " found in commit messages' one-liners (waiting on @kailun-qin)

a discussion (no related file):

Previously, kailun-qin (Kailun Qin) wrote…

Please run some benchmarks and report results.

I tested our openjdk example.

Without this PR (release build, commit b984089, EDMM enabled):

$ time gramine-sgx java -Xmx8G MultiThreadMain
...
real    33m27.868s
user    9m30.793s
sys     25m20.908s

With this PR:

$ time gramine-sgx java -Xmx8G MultiThreadMain
...
real    0m34.768s
user    0m18.902s
sys     0m21.610s

Wow, this is a great boost. The PR clearly has value. Thanks for these numbers.



libos/test/regression/large_mmap.c line 2 at r7 (raw file):

#define _GNU_SOURCE
#include <err.h>

Sorry, could you instead introduce a new test file?

I initially thought it would be a good idea to put MAP_NORESERVE in the already-existing file, but now I see that you modified this test significantly, and there is no benefit in cramming two independent tests in a single file...

Let's just revert the changes here, and instead will create a separate test. Otherwise it's very hard to review.


pal/src/pal_memory.c line 63 at r7 (raw file):

static int (*g_mem_bkeep_alloc_upcall)(size_t size, uintptr_t* out_addr) = NULL;
static int (*g_mem_bkeep_free_upcall)(uintptr_t addr, size_t size) = NULL;
int (*g_mem_bkeep_get_vma_info_upcall)(uintptr_t addr, pal_prot_flags_t* out_prot_flags) = NULL;

Why is it not static like the others?


pal/src/host/linux-sgx/enclave_edmm.c line 283 at r7 (raw file):

        unsigned char val_cur = src_bitvector[0];
        for (size_t i = 0; i < num_bytes; i++) {
            unsigned char val_next = src_bitvector[i + 1];

You have a buffer overread here.

You need to have i < num_bytes - 1 in the for loop condition then, and take special treatment of the very last byte in the source bitvector.


pal/src/host/linux-sgx/enclave_edmm.c line 285 at r7 (raw file):

            unsigned char val_next = src_bitvector[i + 1];
            dest_bitvector[i] = ((val_next & (0xFF >> (8 - offset))) << (8 - offset)) |
                                ((val_cur & (0xFF << offset)) >> offset);

Ok, I finally understand what happens here :)


pal/src/host/linux-sgx/enclave_edmm.c line 313 at r7 (raw file):

    size_t leftover_pages = num_pages % 8;
    if (leftover_pages)
        bitvector[num_bytes - 1] &= (1 << leftover_pages) - 1;

This logic is wrong? Here you unconditionally set bits to 1 for all leftover pages. But you actually should get the information (whether to set or unset each bit) from the g_enclave_page_tracker->data[] array.

Or am I misunderstanding something?

Copy link
Contributor Author

@kailun-qin kailun-qin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 8 unresolved discussions, not enough approvals from maintainers (1 more required), not enough approvals from different teams (1 more required, approved so far: Intel), "fixup! " found in commit messages' one-liners (waiting on @dimakuv)


pal/src/pal_memory.c line 63 at r7 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

Why is it not static like the others?

Because we need to invoke this call in the exception handling code:

ret = g_mem_bkeep_get_vma_info_upcall(addr, &prot_flags);
?


pal/src/host/linux-sgx/enclave_edmm.c line 283 at r7 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

You have a buffer overread here.

You need to have i < num_bytes - 1 in the for loop condition then, and take special treatment of the very last byte in the source bitvector.

hmm... But there is no lengh of buffer specified (so no sense of overreading). Actually, when offset != 0, the length of src_bitvector would actually be larger thandest_bitvector. As num_bytes is the number of bytes to copy from src_bitvector so we will need to copy (partial) bits in src_bitvector[num_bytes] into dest[num_bytes - 1] (which is exactly the "special treatment of the very last byte in the source bitvector").


pal/src/host/linux-sgx/enclave_edmm.c line 313 at r7 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

This logic is wrong? Here you unconditionally set bits to 1 for all leftover pages. But you actually should get the information (whether to set or unset each bit) from the g_enclave_page_tracker->data[] array.

Or am I misunderstanding something?

But this is &= not =, i.e., in the last byte of the dest/copied bitvector, we only keep the bits for the leftover pages?

Copy link
Contributor Author

@kailun-qin kailun-qin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 23 of 34 files reviewed, 8 unresolved discussions, not enough approvals from maintainers (1 more required), not enough approvals from different teams (1 more required, approved so far: Intel), "fixup! " found in commit messages' one-liners (waiting on @dimakuv and @llly)


pal/src/host/linux-sgx/enclave_edmm.c line 144 at r7 (raw file):

Previously, llly (Li Xun) wrote…

casting (uintptr_t) not needed.

Done.


pal/src/host/linux-sgx/enclave_edmm.c line 154 at r7 (raw file):

Previously, llly (Li Xun) wrote…

ditto.

Done.


pal/src/host/linux-sgx/pal_files.c line 279 at r7 (raw file):

Previously, llly (Li Xun) wrote…

int is not needed.

Done.


libos/test/regression/large_mmap.c line 2 at r7 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

Sorry, could you instead introduce a new test file?

I initially thought it would be a good idea to put MAP_NORESERVE in the already-existing file, but now I see that you modified this test significantly, and there is no benefit in cramming two independent tests in a single file...

Let's just revert the changes here, and instead will create a separate test. Otherwise it's very hard to review.

Done.

Copy link
Contributor

@llly llly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 2 of 12 files at r2, 6 of 18 files at r3, 4 of 9 files at r6, 5 of 7 files at r7, 11 of 11 files at r8, all commit messages.
Reviewable status: all files reviewed, 5 unresolved discussions, not enough approvals from maintainers (1 more required), not enough approvals from different teams (1 more required, approved so far: Intel), "fixup! " found in commit messages' one-liners (waiting on @dimakuv and @kailun-qin)

Copy link
Contributor

@dimakuv dimakuv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 11 of 11 files at r8, all commit messages.
Reviewable status: all files reviewed, 2 unresolved discussions, not enough approvals from maintainers (1 more required), not enough approvals from different teams (1 more required, approved so far: Intel), "fixup! " found in commit messages' one-liners (waiting on @kailun-qin)


libos/test/regression/mmap_map_noreserve.c line 6 at r8 (raw file):

 */

/* Test for creating and accessing anonymous mappings with `MAP_NORESERVE`. */

Can you add an explanation why this test is required? We should say that this test works on both EDMM and non-EDMM platforms, but if EDMM (and EXINFO) is enabled, then the mmap doesn't really commit the enclave pages, but instead they are lazily committed on the first access. This test also stresses the lazy-allocation logic on fork (again, only the actually-accessed enclave pages will be copied to the child enclave).

So on EDMM-enabled platforms, this test is supposed to be significantly faster than on non-EDMM-enabled platforms. But functionality-wise it will be the same.

Also will be ideal if you'll specify approximate timings on e.g. an ICX machine.


libos/test/regression/mmap_map_noreserve.c line 21 at r8 (raw file):

#define TEST_LENGTH  0x10000f000
#define TEST_LENGTH2 0x800f000

Why such weird lengths?


pal/src/pal_memory.c line 63 at r7 (raw file):

Previously, kailun-qin (Kailun Qin) wrote…

Because we need to invoke this call in the exception handling code:

ret = g_mem_bkeep_get_vma_info_upcall(addr, &prot_flags);
?

Ok


pal/src/host/linux-sgx/enclave_edmm.c line 283 at r7 (raw file):

Previously, kailun-qin (Kailun Qin) wrote…

hmm... But there is no lengh of buffer specified (so no sense of overreading). Actually, when offset != 0, the length of src_bitvector would actually be larger thandest_bitvector. As num_bytes is the number of bytes to copy from src_bitvector so we will need to copy (partial) bits in src_bitvector[num_bytes] into dest[num_bytes - 1] (which is exactly the "special treatment of the very last byte in the source bitvector").

Ok, got it. Quite convoluted, but I have no ideas how to make it more visible.


pal/src/host/linux-sgx/enclave_edmm.c line 313 at r7 (raw file):

Previously, kailun-qin (Kailun Qin) wrote…

But this is &= not =, i.e., in the last byte of the dest/copied bitvector, we only keep the bits for the leftover pages?

Yes, I missed the &= assignment. Understood now.

Copy link
Contributor

@dimakuv dimakuv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 3 unresolved discussions, not enough approvals from maintainers (1 more required), not enough approvals from different teams (1 more required, approved so far: Intel), "fixup! " found in commit messages' one-liners (waiting on @kailun-qin)

a discussion (no related file):
Jenkins-SGX-Sanitizers pipeline failed, please check what's wrong and fix.


Copy link
Contributor Author

@kailun-qin kailun-qin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 32 of 34 files reviewed, 3 unresolved discussions, not enough approvals from maintainers (1 more required), not enough approvals from different teams (1 more required, approved so far: Intel), "fixup! " found in commit messages' one-liners (waiting on @dimakuv and @llly)

a discussion (no related file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

Jenkins-SGX-Sanitizers pipeline failed, please check what's wrong and fix.

I reduced the enclave size (as we're testing forking large enclaves now, which we commented out in the large_mmap test). Let's give it another try.



libos/test/regression/mmap_map_noreserve.c line 6 at r8 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

Can you add an explanation why this test is required? We should say that this test works on both EDMM and non-EDMM platforms, but if EDMM (and EXINFO) is enabled, then the mmap doesn't really commit the enclave pages, but instead they are lazily committed on the first access. This test also stresses the lazy-allocation logic on fork (again, only the actually-accessed enclave pages will be copied to the child enclave).

So on EDMM-enabled platforms, this test is supposed to be significantly faster than on non-EDMM-enabled platforms. But functionality-wise it will be the same.

Also will be ideal if you'll specify approximate timings on e.g. an ICX machine.

Done, also added some approximate timings on an ICX machine.


libos/test/regression/mmap_map_noreserve.c line 21 at r8 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

Why such weird lengths?

It was a legacy from the large_mmap test. Updated now.


pal/src/host/linux-sgx/enclave_edmm.c line 283 at r7 (raw file):

Quite convoluted

Yeah, I also dislike it.

Copy link
Contributor

@dimakuv dimakuv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 2 of 2 files at r9, all commit messages.
Reviewable status: all files reviewed, 1 unresolved discussion, not enough approvals from maintainers (1 more required), not enough approvals from different teams (1 more required, approved so far: Intel), "fixup! " found in commit messages' one-liners (waiting on @kailun-qin)

a discussion (no related file):

Previously, kailun-qin (Kailun Qin) wrote…

I reduced the enclave size (as we're testing forking large enclaves now, which we commented out in the large_mmap test). Let's give it another try.

Why do you think this may be the problem?


Copy link
Contributor Author

@kailun-qin kailun-qin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 1 unresolved discussion, not enough approvals from maintainers (1 more required), not enough approvals from different teams (1 more required, approved so far: Intel), "fixup! " found in commit messages' one-liners (waiting on @dimakuv)

a discussion (no related file):

Why do you think this may be the problem?

Well, from the error log of the Jenkins-SGX-Sanitizers pipeline:

...
[2023-09-12T15:20:34.288Z] =================================== FAILURES ===================================
[2023-09-12T15:20:34.288Z] __________________ TC_30_Syscall.test_05B_mmap_map_noreserve ___________________
[2023-09-12T15:20:34.288Z] test_libos.py:856: in test_05B_mmap_map_noreserve
...
[2023-09-12T15:20:34.288Z] E   subprocess.CalledProcessError: Command '['/home/jenkins/workspace/graphene-sgx-sanitizers/usr/lib/x86_64-linux-gnu/gramine/sgx/loader', '/home/jenkins/workspace/graphene-sgx-sanitizers/usr/lib/x86_64-linux-gnu/gramine/sgx/libpal.so', 'init', 'mmap_map_noreserve']' died with <Signals.SIGKILL: 9>.
[2023-09-12T15:20:34.288Z] ----------------------------- Captured stderr call -----------------------------
[2023-09-12T15:20:34.288Z] [0.006] Gramine is starting. Parsing TOML manifest file, this may take some time...
...
[2023-09-12T15:20:34.288Z] [57.402] error: Failed to initialize child process: Broken pipe (PAL_ERROR_CONNFAILED_PIPE)

it seemed to me that the failure was caused by the OOM killer. (But w/o access to the machine, I can not confirm it.)

Now that the Jenkins-SGX-Sanitizers pipeline passes, and I also could not reproduce it w/ the same config (using SGX sanitizer + 8GB enclave config) on my local setup, let me know if you still want a double check on that CI machine.


Signed-off-by: Kailun Qin <kailun.qin@intel.com>
Copy link
Contributor Author

@kailun-qin kailun-qin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 40 of 41 files reviewed, 3 unresolved discussions, not enough approvals from maintainers (1 more required), not enough approvals from different teams (1 more required, approved so far: Intel), "fixup! " found in commit messages' one-liners (waiting on @dimakuv and @mkow)


pal/src/host/linux-sgx/pal_exception.c line 465 at r37 (raw file):

Previously, kailun-qin (Kailun Qin) wrote…

Yes I agree. I was thinking that the behavior can be arch-specific -- since we're in SGX EDMM PAL which should allow things like XOM (execute-only-memory), where users only set PROT_EXEC but read on it would get segfaults. But I'm also fine w/ following Linux. Updated as suggested.

Ok, after quick talk w/ @dimakuv, I keep it as is and add a comment. See if it makes sense. Thanks!

dimakuv
dimakuv previously approved these changes Apr 4, 2024
Copy link
Contributor

@dimakuv dimakuv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 1 files at r41, all commit messages.
Reviewable status: all files reviewed, 2 unresolved discussions, not enough approvals from different teams (1 more required, approved so far: Intel), "fixup! " found in commit messages' one-liners (waiting on @mkow)


pal/src/host/linux-sgx/pal_exception.c line 465 at r37 (raw file):

Previously, kailun-qin (Kailun Qin) wrote…

Ok, after quick talk w/ @dimakuv, I keep it as is and add a comment. See if it makes sense. Thanks!

Yep, I agree that in the SGX world, we better keep the SGX semantics.

Copy link
Contributor

@mwkmwkmwk mwkmwkmwk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 1 files at r39, all commit messages.
Reviewable status: all files reviewed, 2 unresolved discussions, not enough approvals from different teams (1 more required, approved so far: Intel), "fixup! " found in commit messages' one-liners (waiting on @mkow)

Copy link
Member

@mkow mkow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 3 files at r38, 1 of 1 files at r39, 1 of 1 files at r41.
Reviewable status: 40 of 41 files reviewed, 2 unresolved discussions, not enough approvals from maintainers (1 more required), not enough approvals from different teams (1 more required, approved so far: Intel), "fixup! " found in commit messages' one-liners (waiting on @dimakuv and @kailun-qin)


pal/src/host/linux-sgx/pal_exception.c line 465 at r37 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

Yep, I agree that in the SGX world, we better keep the SGX semantics.

Actually, what is inside ctx.err in such a case? Now you only check for the lack of flags, not for presence of some. And the code assumes it's not 0, because there's such an assert right before the if.

Btw. could you add a XOM case to the tests? (whether reading triggers an exception and the signal handler is called). Would need to be gated behind some check for PKEYs or something like that, I'm not sure how Linux communicates whether this feature is present or not.


pal/src/host/linux-sgx/pal_exception.c line 470 at r41 (raw file):

                 * eXecute-Only-Memory (XOM) (specified with `PROT_EXEC` alone). Note that on Linux,
                 * `PROT_READ` is not required to be set when `PROT_WRITE` or `PROT_EXEC` are set.
                 * But since we're in SGX EDMM PAL, XOM should be allowed -- reading it would cause

But? This sentence is in line with the previous one? Also, "allowed" is a wrong word here IMO, here we aren't allowing or disallowing it, only propagating the exception or not.

Code quote:

But

Signed-off-by: Kailun Qin <kailun.qin@intel.com>
dimakuv
dimakuv previously approved these changes Apr 5, 2024
Copy link
Contributor

@dimakuv dimakuv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 1 files at r44, all commit messages.
Reviewable status: all files reviewed, 2 unresolved discussions, not enough approvals from different teams (1 more required, approved so far: Intel), "fixup! " found in commit messages' one-liners (waiting on @kailun-qin and @mkow)


pal/src/host/linux-sgx/pal_exception.c line 465 at r37 (raw file):

Actually, what is inside ctx.err in such a case?

I don't understand the confusion. ctx.err has the same info as classic #PF error code, see https://wiki.osdev.org/Paging#Handling.

And the code assumes it's not 0, because there's such an assert right before the if.

SGX is a ring-3 technology, so hardware exceptions reported inside an SGX enclave always have at least bit 2 (Userspace/Kernelspace) set to 1. Thus ctx.err can never be 0 if the HW exception was raised by the SGX enclave.

Btw. could you add a XOM case to the tests?

We could try, but if it's too cumbersome to implement, I would just ignore this.


pal/src/host/linux-sgx/pal_exception.c line 470 at r44 (raw file):

             * perspective), the host kernel not only resolves the enclave page (brings it back into
             * EPC) but also delivers it to Gramine, ending up in this code path. */
            goto out;

Intel validation team noticed this intermittent bug on client machines (only). @kailun-qin tried to fix it yesterday, and I debugged it today.

I hope the message explains the gist, but I'll leave a couple comments.

The bug happened in a nested-exception way: while the first-#PF-exception code in here called commit_lazy_alloc_pages(), it went to the EACCEPT instruction, and this instruction is special in that it raises a #PF on a non-present page (even though the host kernel will resolve it by swapping the enclave page in and updating the PTEs and the EPCM entries).

See https://www.felixcloutier.com/x86/eaccept, the first THEN #PF(DS:RCX); FI; case.

To recall the PF handling, consult this: https://wiki.osdev.org/Paging#Handling


This was a data race between several threads (they put pressure on the EPC on a client machine, forcing the ksgxd daemon to start swapping some enclave pages out, leading to the bug). So it wasn't possible to debug it via GDB, as everything got slower and didn't overlap in time. Thus I had to reconstruct the context + stack trace manually, here's a dump of the code I used:

        pal_prot_flags_t prot_flags;

        if ((addr & 0xFFF) == 0xFFF) {
                log_always("--- addr = %p", (void*)addr);
                log_always("--- trusted_exit_info = %u", trusted_exit_info_);
                log_always("--- rip  = %p", (void*)uc->rip);
                log_always("--- rax  = %p", (void*)uc->rax);
                log_always("--- rbx  = %p", (void*)uc->rbx);
                log_always("--- rcx  = %p", (void*)uc->rcx);
                log_always("--- rsp  = %p", (void*)uc->rsp);
                log_always("--- rbp  = %p", (void*)uc->rbp);
                log_always("--- ret1 = 0x%lx", *((uint64_t*)uc->rbp + 1));
                uc->rbp = *((uint64_t*)uc->rbp);
                log_always("--- ret2 = 0x%lx", *((uint64_t*)uc->rbp + 1));
                uc->rbp = *((uint64_t*)uc->rbp);
                log_always("--- ret3 = 0x%lx", *((uint64_t*)uc->rbp + 1));
                uc->rbp = *((uint64_t*)uc->rbp);
                log_always("--- ret4 = 0x%lx", *((uint64_t*)uc->rbp + 1));
                uc->rbp = *((uint64_t*)uc->rbp);
                log_always("--- ret5 = 0x%lx", *((uint64_t*)uc->rbp + 1));
                uc->rbp = *((uint64_t*)uc->rbp);
                log_always("--- ret6 = 0x%lx", *((uint64_t*)uc->rbp + 1));
                uc->rbp = *((uint64_t*)uc->rbp);
                log_always("--- ret7 = 0x%lx", *((uint64_t*)uc->rbp + 1));
                uc->rbp = *((uint64_t*)uc->rbp);
                log_always("--- ret8 = 0x%lx", *((uint64_t*)uc->rbp + 1));
                uc->rbp = *((uint64_t*)uc->rbp);
                log_always("--- ret9 = 0x%lx", *((uint64_t*)uc->rbp + 1));
                uc->rbp = *((uint64_t*)uc->rbp);
                log_always("--- ret10 = 0x%lx", *((uint64_t*)uc->rbp + 1));
                log_always("--- err  = %lu", (uint64_t)ctx.err);
                log_always("--- cr2  = %lu", (uint64_t)ctx.cr2);
                log_always("--- event_num  = %lu", (uint64_t)event_num);

                BUG();
        }

        if (g_mem_bkeep_get_vma_info_upcall(addr, &prot_flags) == 0) {
            prot_flags &= ~PAL_PROT_LAZYALLOC;

Note that the bug always reports 0xFFF-ending enclave page address (this seems to be some weirdness of EACCEPT behavior), that's why we have such a condition to catch it.

When printing it, it corresponded to the following stack trace:

enclu
sgx_eaccept
sgx_edmm_add_pages_callback
walk_pages
commit_lazy_alloc_pages
_PalExceptionHandler

Reverts the previous fixup commit (incorrect attempt at fixing the bug
by Kailun). And adds the correct fix to the intermittent bug observed on
client platforms.

Signed-off-by: Dmitrii Kuvaiskii <dmitrii.kuvaiskii@intel.com>
dimakuv
dimakuv previously approved these changes Apr 5, 2024
Copy link
Contributor

@dimakuv dimakuv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 1 files at r45, all commit messages.
Reviewable status: all files reviewed, 2 unresolved discussions, not enough approvals from different teams (1 more required, approved so far: Intel), "fixup! " found in commit messages' one-liners (waiting on @kailun-qin and @mkow)


pal/src/host/linux-sgx/pal_exception.c line 470 at r44 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

Intel validation team noticed this intermittent bug on client machines (only). @kailun-qin tried to fix it yesterday, and I debugged it today.

I hope the message explains the gist, but I'll leave a couple comments.

The bug happened in a nested-exception way: while the first-#PF-exception code in here called commit_lazy_alloc_pages(), it went to the EACCEPT instruction, and this instruction is special in that it raises a #PF on a non-present page (even though the host kernel will resolve it by swapping the enclave page in and updating the PTEs and the EPCM entries).

See https://www.felixcloutier.com/x86/eaccept, the first THEN #PF(DS:RCX); FI; case.

To recall the PF handling, consult this: https://wiki.osdev.org/Paging#Handling


This was a data race between several threads (they put pressure on the EPC on a client machine, forcing the ksgxd daemon to start swapping some enclave pages out, leading to the bug). So it wasn't possible to debug it via GDB, as everything got slower and didn't overlap in time. Thus I had to reconstruct the context + stack trace manually, here's a dump of the code I used:

        pal_prot_flags_t prot_flags;

        if ((addr & 0xFFF) == 0xFFF) {
                log_always("--- addr = %p", (void*)addr);
                log_always("--- trusted_exit_info = %u", trusted_exit_info_);
                log_always("--- rip  = %p", (void*)uc->rip);
                log_always("--- rax  = %p", (void*)uc->rax);
                log_always("--- rbx  = %p", (void*)uc->rbx);
                log_always("--- rcx  = %p", (void*)uc->rcx);
                log_always("--- rsp  = %p", (void*)uc->rsp);
                log_always("--- rbp  = %p", (void*)uc->rbp);
                log_always("--- ret1 = 0x%lx", *((uint64_t*)uc->rbp + 1));
                uc->rbp = *((uint64_t*)uc->rbp);
                log_always("--- ret2 = 0x%lx", *((uint64_t*)uc->rbp + 1));
                uc->rbp = *((uint64_t*)uc->rbp);
                log_always("--- ret3 = 0x%lx", *((uint64_t*)uc->rbp + 1));
                uc->rbp = *((uint64_t*)uc->rbp);
                log_always("--- ret4 = 0x%lx", *((uint64_t*)uc->rbp + 1));
                uc->rbp = *((uint64_t*)uc->rbp);
                log_always("--- ret5 = 0x%lx", *((uint64_t*)uc->rbp + 1));
                uc->rbp = *((uint64_t*)uc->rbp);
                log_always("--- ret6 = 0x%lx", *((uint64_t*)uc->rbp + 1));
                uc->rbp = *((uint64_t*)uc->rbp);
                log_always("--- ret7 = 0x%lx", *((uint64_t*)uc->rbp + 1));
                uc->rbp = *((uint64_t*)uc->rbp);
                log_always("--- ret8 = 0x%lx", *((uint64_t*)uc->rbp + 1));
                uc->rbp = *((uint64_t*)uc->rbp);
                log_always("--- ret9 = 0x%lx", *((uint64_t*)uc->rbp + 1));
                uc->rbp = *((uint64_t*)uc->rbp);
                log_always("--- ret10 = 0x%lx", *((uint64_t*)uc->rbp + 1));
                log_always("--- err  = %lu", (uint64_t)ctx.err);
                log_always("--- cr2  = %lu", (uint64_t)ctx.cr2);
                log_always("--- event_num  = %lu", (uint64_t)event_num);

                BUG();
        }

        if (g_mem_bkeep_get_vma_info_upcall(addr, &prot_flags) == 0) {
            prot_flags &= ~PAL_PROT_LAZYALLOC;

Note that the bug always reports 0xFFF-ending enclave page address (this seems to be some weirdness of EACCEPT behavior), that's why we have such a condition to catch it.

When printing it, it corresponded to the following stack trace:

enclu
sgx_eaccept
sgx_edmm_add_pages_callback
walk_pages
commit_lazy_alloc_pages
_PalExceptionHandler

Btw, another function that is special in this way (raises #PF that is resolved by host kernel but is still delivered to the Gramine SGX enclave) is EACCEPTCOPY. But we don't use that one (yet), so I didn't add it. I can though, if reviewers think we should safe us future debugs when we start using EACCEPTCOPY.


pal/src/host/linux-sgx/pal_exception.c line 241 at r45 (raw file):

    /* instruction must be ENCLU and leaf must be EACCEPT */
    uint8_t* instr = (uint8_t*)uc->rip;
    if (instr[0] == 0x0f && instr[1] == 0x01 && instr[2] == 0xd7 && uc->rax == EACCEPT)

See https://www.felixcloutier.com/x86/enclu for the opcode.

Signed-off-by: Kailun Qin <kailun.qin@intel.com>
dimakuv
dimakuv previously approved these changes Apr 8, 2024
Copy link
Contributor

@dimakuv dimakuv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 3 of 3 files at r46, all commit messages.
Reviewable status: all files reviewed, 2 unresolved discussions, not enough approvals from different teams (1 more required, approved so far: Intel), "fixup! " found in commit messages' one-liners (waiting on @kailun-qin and @mkow)

Signed-off-by: Kailun Qin <kailun.qin@intel.com>
Copy link
Contributor

@dimakuv dimakuv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 3 of 3 files at r47, all commit messages.
Reviewable status: all files reviewed, 2 unresolved discussions, not enough approvals from different teams (1 more required, approved so far: Intel), "fixup! " found in commit messages' one-liners (waiting on @kailun-qin and @mkow)

Copy link
Contributor

@llly llly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 5 of 14 files at r24, 3 of 4 files at r30, 3 of 3 files at r31, 1 of 1 files at r32.
Reviewable status: all files reviewed, 3 unresolved discussions, not enough approvals from different teams (1 more required, approved so far: Intel), "fixup! " found in commit messages' one-liners (waiting on @kailun-qin and @mkow)


libos/src/bookkeep/libos_vma.c line 1165 at r47 (raw file):

static int pal_mem_bkeep_get_vma_info(uintptr_t addr, pal_prot_flags_t* out_prot_flags) {
    struct libos_vma_info vma_info;
    int ret = lookup_vma((void*)addr, &vma_info);
           if (vma_info.file) {
                put_handle(vma_info.file);
            }

is needed because get_handle is called in dump_vma.

Copy link
Contributor

@dimakuv dimakuv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 3 unresolved discussions, not enough approvals from maintainers (1 more required), not enough approvals from different teams (1 more required, approved so far: Intel), "fixup! " found in commit messages' one-liners (waiting on @kailun-qin, @llly, and @mkow)


libos/src/bookkeep/libos_vma.c line 1165 at r47 (raw file):

Previously, llly (Li Xun) wrote…
           if (vma_info.file) {
                put_handle(vma_info.file);
            }

is needed because get_handle is called in dump_vma.

+1. See:

if (vma_info->file) {
get_handle(vma_info->file);

Signed-off-by: Kailun Qin <kailun.qin@intel.com>
Copy link
Contributor Author

@kailun-qin kailun-qin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 43 of 44 files reviewed, 3 unresolved discussions, not enough approvals from maintainers (1 more required), not enough approvals from different teams (1 more required, approved so far: Intel), "fixup! " found in commit messages' one-liners (waiting on @dimakuv, @llly, @mkow, and @mwkmwkmwk)


libos/src/bookkeep/libos_vma.c line 1165 at r47 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

+1. See:

if (vma_info->file) {
get_handle(vma_info->file);

Good catch! Done.

Copy link
Contributor

@dimakuv dimakuv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 1 files at r48, all commit messages.
Reviewable status: all files reviewed, 3 unresolved discussions, not enough approvals from different teams (1 more required, approved so far: Intel), "fixup! " found in commit messages' one-liners (waiting on @kailun-qin, @llly, and @mkow)

Copy link
Contributor

@llly llly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 6 files at r25, 2 of 18 files at r33, 1 of 12 files at r34, 4 of 8 files at r36, 19 of 24 files at r37, 1 of 3 files at r38, 1 of 1 files at r45, 3 of 3 files at r46, 3 of 3 files at r47, 1 of 1 files at r48, all commit messages.
Reviewable status: all files reviewed, 2 unresolved discussions, not enough approvals from different teams (1 more required, approved so far: Intel), "fixup! " found in commit messages' one-liners (waiting on @kailun-qin and @mkow)

Copy link
Contributor

@dimakuv dimakuv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 3 unresolved discussions, not enough approvals from maintainers (1 more required), not enough approvals from different teams (1 more required, approved so far: Intel), "fixup! " found in commit messages' one-liners (waiting on @kailun-qin and @mkow)


libos/src/bookkeep/libos_vma.c line 1353 at r48 (raw file):

    uintptr_t end = MIN(ctx->end, vma->end);
    if (PalFreeThenLazyReallocCommittedPages((void*)start, end - start) < 0)
        return false;

Why don't we set ctx->error here?

Signed-off-by: Kailun Qin <kailun.qin@intel.com>
Copy link
Contributor Author

@kailun-qin kailun-qin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 43 of 44 files reviewed, 3 unresolved discussions, not enough approvals from maintainers (1 more required), not enough approvals from different teams (1 more required, approved so far: Intel), "fixup! " found in commit messages' one-liners (waiting on @dimakuv, @llly, and @mkow)


libos/src/bookkeep/libos_vma.c line 1353 at r48 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

Why don't we set ctx->error here?

Done.

Copy link
Contributor

@dimakuv dimakuv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 1 files at r49, all commit messages.
Reviewable status: all files reviewed, 2 unresolved discussions, not enough approvals from different teams (1 more required, approved so far: Intel), "fixup! " found in commit messages' one-liners (waiting on @kailun-qin and @mkow)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
7 participants