Crash dumps and/or stack traces #1617

dimakuv · 2023-10-24T08:03:43Z

Description of the feature

This is a request from a particular Gramine user.

There is a C++ application that runs on Gramine in a production environment. Analysis of crashes in production traditionally relies on a core dump. Developers and operators are familiar with GDB and know how to root cause crashes using GDB.

Gramine does not support core dump. Given the threat model of SGX, it is unclear how exactly to produce a core dump -- what does it mean to produce a secure core dump? Or if the core dump cannot be made secure, how to lower the possibility of accidental data leaks on crashes?

Alternatively, Gramine could support logging crash stack traces.

Caveats:

Core dumps/stack traces must be printed in response to catastrophic signals (SIGSEGV, SIGBUS, SIGILL, etc). Therefore, the core dump/stack traces logic must adhere to strict rules of "async safety". It is also important to make sure that security vulnerabilities associated with signal handlers are not flagged by security scanner tools (as this would block production usage).
Stack size limitations. If doing part of the logic on the Gramine altstack, I think the size of that altstack is very-very small. So need to be careful to not overflow.
It would be great to log stack traces of all thread of the process. But this is probably impossible to achieve, as this would require some form of IPC between threads.
The solution should work in a Kubernetes (K8s) environment. In particular, K8s application must stream logs to STDOUT, and then the logging agent or kubelet would stream them to an appropriate destination.

There are various open source libraries such as backward-cpp. But it's not clear that the handlers (the code in that library) to print stack traces is async-safe.

TLDR: Several questions:

Anyone knows good tools to perform core dumps and stack traces for C/C++ apps in a limited environment such as Gramine SGX enclave?
How to make these secure, or at least mark them explicitly as insecure in Gramine?
Should this logic, or part of this logic, be implemented in Gramine itself?

Why Gramine should implement it?

It's a reasonable request from customer.

mkow · 2023-10-24T23:52:28Z

It is also important to make sure that security vulnerabilities associated with signal handlers are not flagged by security scanner tools

These scanners don't matter at all for security, they are supposed to be used only as hints where could be bugs (and more often just spam false-positives). What's important is to not introduce a security issue when adding this feature. Not to make automatic scanners happy.

Also, writing down what I said on the today's call: (plus some more thoughts)

Printing a stack trace can leak secrets from the app (e.g. by accidentally interpreting data bytes on the stack as a return address and printing them). Could be mitigated by printing only symbols, but then it may limit usefulness of the stack traces.
"or at least mark them explicitly as insecure in Gramine?" - but that goes contrary to the case you originally mentioned? "There is a C++ application that runs on Gramine in a production environment" - it's a production setup, insecure flag won't help there.
Doing any complicated logic after a memory corruption may be risky, because Gramine's state may be corrupted (and may make some unexploitable bugs exploitable).
Someone proposed asymmetrically encrypted stack traces, which is sound on the design level, but sounds complicated to do when we're already in a corrupted memory state.
Overall, I don't like the idea of adding any introspection into production configurations, because the very idea of SGX is to make introspection impossible for untrusted host.

From my side I'd recommend trying to reproduce the issue locally / on test deployment instead and just plug GDB (but yes, it requires significantly more effort than just looking at logs from production, unfortunately).

lejunzhuintel · 2023-10-26T00:23:23Z

Someone proposed asymmetrically encrypted stack traces, which is sound on the design level, but sounds complicated to do when we're already in a corrupted memory state.

Will "encrypt to mrsigner" be simple enough to work in the corrupted memory state?

kailun-qin · 2023-10-26T02:54:49Z

Will "encrypt to mrsigner" be simple enough to work in the corrupted memory state?

I think this may indeed simplify the encryption flow (by e.g., skipping DEK generation + encryption if using KEM) and mitigate the risk of the encryption key corruption when in a corrupted memory state.

However, the dump encryption may still occur unreliably or insecurely during a crash, e.g., the data used by the encryption routine or the keyrequest used by the hardware key retrieval can be corrupted.

dimakuv · 2023-10-26T09:40:06Z

Please also see the notes from our discussion: #1616 (first topic).

These scanners don't matter at all for security, they are supposed to be used only as hints where could be bugs (and more often just spam false-positives). What's important is to not introduce a security issue when adding this feature. Not to make automatic scanners happy.

I disagree in the sense that vulnerability scanners (and their final reports) matter for customers, especially in strictly regulated areas. So the request to "make scanners happy" is legitimate in certain business areas.

mkow · 2023-10-26T10:58:21Z

I disagree in the sense that vulnerability scanners (and their final reports) matter for customers, especially in strictly regulated areas. So the request to "make scanners happy" is legitimate in certain business areas.

But you originally said:

It is also important to make sure that security vulnerabilities associated with signal handlers are not flagged by security scanner tools (as this would block production usage).

Which completely twists the priorities. The most important thing is to not have security vulnerabilities, not to hide them from scanners or silence scanners.
If some users want to have clean scanner results then we can do the scans, but this is more like bureaucracy / paperwork than something real.

dimakuv · 2023-10-26T11:20:16Z

Ok, I guess I formulated that sentence wrong. I agree with @mkow; the original sense in my sentence was to "get rid of security vulnerabilities, and then make sure nothing is flagged by security scanner tools".

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Crash dumps and/or stack traces #1617

Crash dumps and/or stack traces #1617

dimakuv commented Oct 24, 2023

mkow commented Oct 24, 2023

lejunzhuintel commented Oct 26, 2023 •

edited

kailun-qin commented Oct 26, 2023

dimakuv commented Oct 26, 2023

mkow commented Oct 26, 2023

dimakuv commented Oct 26, 2023

Crash dumps and/or stack traces #1617

Crash dumps and/or stack traces #1617

Comments

dimakuv commented Oct 24, 2023

Description of the feature

Why Gramine should implement it?

mkow commented Oct 24, 2023

lejunzhuintel commented Oct 26, 2023 • edited

kailun-qin commented Oct 26, 2023

dimakuv commented Oct 26, 2023

mkow commented Oct 26, 2023

dimakuv commented Oct 26, 2023

lejunzhuintel commented Oct 26, 2023 •

edited