Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crash dumps and/or stack traces #1617

Open
dimakuv opened this issue Oct 24, 2023 · 6 comments
Open

Crash dumps and/or stack traces #1617

dimakuv opened this issue Oct 24, 2023 · 6 comments

Comments

@dimakuv
Copy link
Contributor

dimakuv commented Oct 24, 2023

Description of the feature

This is a request from a particular Gramine user.

There is a C++ application that runs on Gramine in a production environment. Analysis of crashes in production traditionally relies on a core dump. Developers and operators are familiar with GDB and know how to root cause crashes using GDB.

Gramine does not support core dump. Given the threat model of SGX, it is unclear how exactly to produce a core dump -- what does it mean to produce a secure core dump? Or if the core dump cannot be made secure, how to lower the possibility of accidental data leaks on crashes?

Alternatively, Gramine could support logging crash stack traces.

Caveats:

  • Core dumps/stack traces must be printed in response to catastrophic signals (SIGSEGV, SIGBUS, SIGILL, etc). Therefore, the core dump/stack traces logic must adhere to strict rules of "async safety". It is also important to make sure that security vulnerabilities associated with signal handlers are not flagged by security scanner tools (as this would block production usage).
  • Stack size limitations. If doing part of the logic on the Gramine altstack, I think the size of that altstack is very-very small. So need to be careful to not overflow.
  • It would be great to log stack traces of all thread of the process. But this is probably impossible to achieve, as this would require some form of IPC between threads.
  • The solution should work in a Kubernetes (K8s) environment. In particular, K8s application must stream logs to STDOUT, and then the logging agent or kubelet would stream them to an appropriate destination.

There are various open source libraries such as backward-cpp. But it's not clear that the handlers (the code in that library) to print stack traces is async-safe.

TLDR: Several questions:

  • Anyone knows good tools to perform core dumps and stack traces for C/C++ apps in a limited environment such as Gramine SGX enclave?
  • How to make these secure, or at least mark them explicitly as insecure in Gramine?
  • Should this logic, or part of this logic, be implemented in Gramine itself?

Why Gramine should implement it?

It's a reasonable request from customer.

@mkow
Copy link
Member

mkow commented Oct 24, 2023

It is also important to make sure that security vulnerabilities associated with signal handlers are not flagged by security scanner tools

These scanners don't matter at all for security, they are supposed to be used only as hints where could be bugs (and more often just spam false-positives). What's important is to not introduce a security issue when adding this feature. Not to make automatic scanners happy.

Also, writing down what I said on the today's call: (plus some more thoughts)

  • Printing a stack trace can leak secrets from the app (e.g. by accidentally interpreting data bytes on the stack as a return address and printing them). Could be mitigated by printing only symbols, but then it may limit usefulness of the stack traces.
  • "or at least mark them explicitly as insecure in Gramine?" - but that goes contrary to the case you originally mentioned? "There is a C++ application that runs on Gramine in a production environment" - it's a production setup, insecure flag won't help there.
  • Doing any complicated logic after a memory corruption may be risky, because Gramine's state may be corrupted (and may make some unexploitable bugs exploitable).
  • Someone proposed asymmetrically encrypted stack traces, which is sound on the design level, but sounds complicated to do when we're already in a corrupted memory state.
  • Overall, I don't like the idea of adding any introspection into production configurations, because the very idea of SGX is to make introspection impossible for untrusted host.

From my side I'd recommend trying to reproduce the issue locally / on test deployment instead and just plug GDB (but yes, it requires significantly more effort than just looking at logs from production, unfortunately).

@lejunzhuintel
Copy link

lejunzhuintel commented Oct 26, 2023

  • Someone proposed asymmetrically encrypted stack traces, which is sound on the design level, but sounds complicated to do when we're already in a corrupted memory state.

Will "encrypt to mrsigner" be simple enough to work in the corrupted memory state?

@kailun-qin
Copy link
Contributor

Will "encrypt to mrsigner" be simple enough to work in the corrupted memory state?

I think this may indeed simplify the encryption flow (by e.g., skipping DEK generation + encryption if using KEM) and mitigate the risk of the encryption key corruption when in a corrupted memory state.

However, the dump encryption may still occur unreliably or insecurely during a crash, e.g., the data used by the encryption routine or the keyrequest used by the hardware key retrieval can be corrupted.

@dimakuv
Copy link
Contributor Author

dimakuv commented Oct 26, 2023

Please also see the notes from our discussion: #1616 (first topic).

These scanners don't matter at all for security, they are supposed to be used only as hints where could be bugs (and more often just spam false-positives). What's important is to not introduce a security issue when adding this feature. Not to make automatic scanners happy.

I disagree in the sense that vulnerability scanners (and their final reports) matter for customers, especially in strictly regulated areas. So the request to "make scanners happy" is legitimate in certain business areas.

@mkow
Copy link
Member

mkow commented Oct 26, 2023

I disagree in the sense that vulnerability scanners (and their final reports) matter for customers, especially in strictly regulated areas. So the request to "make scanners happy" is legitimate in certain business areas.

But you originally said:

  • It is also important to make sure that security vulnerabilities associated with signal handlers are not flagged by security scanner tools (as this would block production usage).

Which completely twists the priorities. The most important thing is to not have security vulnerabilities, not to hide them from scanners or silence scanners.
If some users want to have clean scanner results then we can do the scans, but this is more like bureaucracy / paperwork than something real.

@dimakuv
Copy link
Contributor Author

dimakuv commented Oct 26, 2023

Ok, I guess I formulated that sentence wrong. I agree with @mkow; the original sense in my sentence was to "get rid of security vulnerabilities, and then make sure nothing is flagged by security scanner tools".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants