Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Public Key Infrastructure for Rust Project #3579

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

walterhpearce
Copy link

@walterhpearce walterhpearce commented Feb 27, 2024

This RFC is the first of a series for a PKI model for Rust; includes the design and implementation for a PKI CA and a resilient Quorum model for the project to implement, and next steps for signing across the project. Crate and release signing will follow in a subsequent separate RFC.

Rendered

@pietroalbini pietroalbini added the T-infra Relevant to the infrastructure team, which will review and decide on the RFC. label Feb 27, 2024
text/3579-pki.md Outdated Show resolved Hide resolved
text/3579-pki.md Show resolved Hide resolved
text/3579-pki.md Outdated Show resolved Hide resolved
text/3579-pki.md Show resolved Hide resolved
text/3579-pki.md Outdated Show resolved Hide resolved
text/3579-pki.md Outdated Show resolved Hide resolved
text/3579-pki.md Show resolved Hide resolved
text/3579-pki.md Outdated Show resolved Hide resolved
text/3579-pki.md Outdated Show resolved Hide resolved
text/3579-pki.md Show resolved Hide resolved
text/3579-pki.md Show resolved Hide resolved
- Delegate key usage and expectations for implementation
- Elaborated the justifications for algorithm usage
- Allowed for delegate keys to use other algorithms
- Justify key expiration

- Removed the Infra key and its usage descriptions (mTLS et al)

- Gave justifications for BORS specific key for git commit signing vs. github key

- Elaborated on keys being able to be used in code signing

- Clarified future use of mTLS

- Details on CloudHSM backups and retention

- Changed cloud agnostic descriptions to instead describe the steps taken to not use cloud provider specific solutions

- Fixed some typos and broken links
- Changes to physical recommendation to be specifically geographic and political dispersion; removed language around events
- Addressed transparency logs

### Key Transparency (`pki.rust-lang.org`)

All certificate operations on the Root Key and the top-level Delegate Keys will be logged via Certificate Transparency, on a domain [pki-rust-lang-org][pki-rust-lang-org]. (This prevents issuing or using a Delegate Key surreptitiously.) Each issued key must have proof of the Certificate Transparency record stapled to it.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this needs more detail. Just logging the intermediates doesn't feel like enough: it just moves the point of interest for a secret attacker to an intermediate. I'd expect more discussion here of how what we're doing differs from (for example) web (public) CT logs and why we think our constraints are different.

Validating these CT logs are accurate (i.e. there's no false stamp issued) also seems easiest/possible only with independent systems - it's not clear to me that's achievable if we operate both (and the same humans are pretty thoroughly involved with both).

Copy link

@burdges burdges Feb 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this saying you use ordinary public web CT logs? You'd then have some CT log inclusion statements embedded in the full certificate (the pre-certificate has a poison extension making it invalid without those inclusion proofs). Afaik, that'd work fine for rust project releases.

As for external crates, you could've some MMR root for ecosystem crates on crates.io certified in this way too, and then provide crates their extension proof whenever the rust crates.io CT log updates. You provide the MMR root transition proof too, so then only one successor exists. Anyone paranoid could impose some delay upon updates, maybe once per week or even per day.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@burdges typical CT logs will only accept certificates anchored in a trusted public root for the Web PKI, so it sounds like this would involve standing up and running a CT log specifically for this CA

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They could get certificates on pki.rust-lang.org but okay.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe a certificate issued for pki.rust-lang.org would be insufficient here: that would chain up to a Web PKI root, but I'm not aware of any Web PKI CA that will issue a subordinate root like this RFC requires (or even a subscriber cert with non-CABF EKUs/policy constraints).

(I could have misunderstood what you meant by "get certificates on.)


We may also wish to set up an OCSP service; we defer this question for evaluation during implementation.

### Rotation & Revocation
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to see a concrete story on how we expect to deal with incidents that would normally prompt revocation, and how that plays with our expectation on the timeline of the roots.

In particular, I'd like to see a clear argument made for why we shouldn't mint roots that never expire. OpenBSD for example does (did?) use keys without expiry time to sign releases: https://www.openbsd.org/papers/bsdcan-signify.html, but each key is only trusted for the next release, which gives you an automatic window of ~12ish weeks for our purposes. (And practically makes revocation unnecessary IMO).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to second this point: IME revocation handling is one of the most common operational failure in self-deployed PKIs, because it hits the sweet spot between "absolutely must happen seamlessly" and "virtually never happens, meaning operators have no muscle memory for it."

Short-lived key material with automated rotations (either like OpenBSD or like the Web PKI) is a lot easier to operationalize 🙂


- Delegate Keys can be re-issued and re-signed by the Root Key
- Root Key rotation may require a Rust/Cargo/rustup update to update the local certificate
- Any rotation, especially any out-of-cycle rotation (e.g. compromise or infrastructure change) will by design be very *loud* and involve many people who can raise a red flag over any concerns of impropriety.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like the argument could also go the other way, though: we would be hesitant to revoke and incur the very high costs in doing so if there was a mild incident. For example, if I'm part of the quorum and my private key used to authenticate is posted on GitHub, do we treat that as an event warranting revocation? Probably not - after all, it's one of 5. (Or N). But it feels like the line gets murky at 3, even if there's not public evidence of compromise.

(Obviously I can invent other scenarios).


The Rust Infra Team will deploy a new SSH bastion host, `pki-bastion.rust-lang.org`, which shall be used for performing all PKI operations. This system will have available to it a suite of scripts for performing all the common scenarios mentioned in [When the Quorum will be needed][when-the-quorum-will-be-needed]. Current members of the quorum will authenticate to this system utilizing IAM roles via EC2 Instance Connect, which will force them to use their IAM role with two factor authentication (which will double as their role utilized for the actual quorum process).

- This system shall be isolated and powered down except for scheduled quorum events or regular security updates to the system.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems pretty likely that no one will be able to access the system after such a long downtime, especially if it is network isolated. For example, I think we'd need to use a less-secure sha1 connection for openssh released 7 years ago.

Plus, presuming our expectation is that the instance has all 5 keys in memory at some point, all an attacker would need to do is update the underlying EBS volume or access the instance at the point of a ceremony, right? It's not clear to me how this fits into our threat model.

# Motivation
[motivation]: #motivation

The Rust Project currently lacks any official Public Key Infrastructure (PKI) required to perform proper package, mirror, code, and release signing across the project. There is high demand for capabilities that require such PKI infrastructure, such as mirroring and binary signing. This RFC provides the necessary cryptographic infrastructure to unblock such work, and an initial usage for cryptographic verification of Rust releases and `crates.io` downloads.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is not quite true. We do sign git tags and Rust releases with a keypair today; that isn't used for validation and the keypair isn't particularly secure (it has been on multiple machines and laptops and servers), but it is something.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do sign git tags and Rust releases with a keypair today; that isn't used for validation and the keypair isn't particularly secure (it has been on multiple machines and laptops and servers), but it is something.

hi @Mark-Simulacrum , what do you mean by that isn't used for validation? From my understanding, the *.asc file in rustup is for users to valite the integrity of Rust release packages

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rustup doesn't validate the pgp signature anymore: rust-lang/rustup#3277 It has never been a hard error anyway, the validation code had been incorrectly rejecting older rustc versions (rust-lang/rustup#3250) and it is expected to be replaced by a new system which this RFC lies the foundations for.

- **Threat**: Prevent compromise of key material by nation-state or corporate actor (whether by coercing the cooperation of an individual or obtaining the key material illicitly).
**Addressed**: by use of an HSM, quorum model, and geographically and politically distributed quorum members.
- **Threat**: Supply chain compromise of the Rust release process, such that Rust releases a signed binary containing malicious code
**Addressed**: Either by revocation and rotation of the key used, or by verifiers having a mechanism (not specified here) for finer-grained revocation of individual signed artifacts.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this "not specified here" is going to make this difficult to independently evaluate. How we deal with revocation is a big question, but I absolutely agree that it depends on clients. The current model of a shared key hierarchy and root across different use cases increases this difficulty IMO, since we'd need to support all of them with the same system.

Maybe I just need to wait for the forthcoming RFCs on actual usage.


**Expiration:** Root and Top-Level Delegate keys shall follow a `7 year expiration` schedule, except for the first keys of this proposal, which shall have an expiration date of the expected release date of the Rust 2030 edition plus 1 year.

Expiration allows us to re-evaluate algorithm choice and algorithm strength, archive transparency logs, and have a well-tested path for root key replacement.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of these can be equally achieved by adding new roots and removing old ones on some cadence. In other words, embedding the time into the root isn't actually required for them IMO: we (to some extent) control clients and regularly ship new clients.

text/3579-pki.md Show resolved Hide resolved
### Using the GitHub bot key to sign bors commits
[using-github-bot-sign]: #using-the-github-bot-key-to-sign-bors-commits

As an alternative to having a dedicated delegate key for bors that chains to the Rust Root, we could rely on GitHub's commit signatures for bot accounts. However, this would not allow users of git mirrors to be resilient against impersonation of rust-lang/rust. Any GitHb project that uses bors could push a branch that is a clone of rust-lang/rust and get bors to add and sign malicious commits. That history could then be pushed to a purported "mirror" and would appear the same as a legitimate mirror of rust-lang/rust, allowing for social engineering attacks. Using per-repository keys makes mirrors easier to verify and reduces attack surface area.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think per-repository keys managed by us help with this problem at all.

To preface, we are using an instance of bors specific to our organization, so regardless of which solution we choose all repositories will be in the rust-lang org, and attackers will need to have to be members of the team / compromise members of the team.

For this attack to work, you need to have a base branch with the content of rust-lang/rust, and a PR you instructed bors to merge with that branch. There are two candidates for such branch:

  1. The main/master branch, so it's either rust-lang/rust itself and the attacker compromised rustc proper (signatures wouldn't matter here), or it's a different repository with a separate history (let's say clippy) which would result in a failed merge due to endless conflicts.

  2. An attacker-controlled branch (let's call it mastr to sprinkle some typosquatting) containing malicious code, to which the attacker sends a benign PR tricking reviewers into thinking it's going to be merged into master. There is no difference from this happening in let's say clippy or rustc proper, so separate keys wouldn't help.

Case 1. is not relevant, and case 2. can be fixed by adding a check to bors preventing it from merging into non-protected branches, regardless of whether we use GitHub keys or our own.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the case of 1. the attacker itself could merge the default branch of the target repo with --keep-ours to resolve all conflicts and only then open the PR on the target repo. Merging the PR should then result in no conflicts at all.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't that still cause a huge noticeable PR changing hundreds of thousands of lines of code? If we let's say merge rustc into clippy.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does. Other repos have less oversight and get reviewers more easily afaik, so there is a higher risk of the attacker itself getting review permission for one of these repos and the attack not being detected than for the main rust repo. Whether this attack scenario matters enough to do this is another question though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're really worried about this attack GitHub keys would still work, we would just need to create a separate GitHub App for rust-lang/rust. I don't think realistically this is going to happen, as smaller repos are being (very slowly) migrated away from bors to GitHub Merge Queues.

@@ -247,15 +256,15 @@ Relevant past RFCs:

**Crate Signing / Mirroring**: A subsequent RFC will specify how we handle crates.io index signing, to enable mirroring and other requirements.

**Code Signing**: This RFC provides a chain of trust that can in the future be used for signing binaries for Apple/Microsoft binary authentication. However, this RFC does not specify a mechanism or process for such signing; establishing that is future work. A subsequent RFC will specify how we handle code signing.
**Code Signing**: This RFC provides a chain of trust that can in the future be used for signing binaries for Apple/Microsoft binary authentication. (This would require generating a CSR for a key chaining to the Rust Root, and then getting that CSR signed by the appropriate Microsoft/Apple process.) However, this RFC does not specify a mechanism or process for such signing; establishing that is future work. A subsequent RFC will specify how we handle code signing.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the purpose of tying those certificates to the root?

Regardless of which signing scheme we end with for releases, we will need to sign more than just releases (source tarballs, reproducibility artifacts, Windows/Apple installers). So code signed things will be already signed by the release key like any other of our artifacts.

What advantage would tying the code signing key to our root provide? We would be code signing those things just to avoid the operating systems from complaining, and they will not care about our root.

@@ -247,15 +256,15 @@ Relevant past RFCs:

**Crate Signing / Mirroring**: A subsequent RFC will specify how we handle crates.io index signing, to enable mirroring and other requirements.

**Code Signing**: This RFC provides a chain of trust that can in the future be used for signing binaries for Apple/Microsoft binary authentication. However, this RFC does not specify a mechanism or process for such signing; establishing that is future work. A subsequent RFC will specify how we handle code signing.
**Code Signing**: This RFC provides a chain of trust that can in the future be used for signing binaries for Apple/Microsoft binary authentication. (This would require generating a CSR for a key chaining to the Rust Root, and then getting that CSR signed by the appropriate Microsoft/Apple process.) However, this RFC does not specify a mechanism or process for such signing; establishing that is future work. A subsequent RFC will specify how we handle code signing.

**Git mirroring**: This RFC specifies a delegate key for bors to sign git commits, but does not specify any scheme for mirroring Rust git repositories. Future RFCs will specify how we handle git repository mirroring.

**OSCP/CRL/Transparency Log/Stapling**: Do we wish to set up an OCSP service, CRL, Transparency Logs, and/or perform stapling? Future implementations of these types of services can reside on [pki.rust-lang.org][pki-rust-lang-org], meeting its purpose of providing transparency to our PKI. We leave it to future implementation and RFCs to discuss the use cases of these systems.

**Internal Signing Service**: A microservice endpoint to provide authenticated signing of binary blobs with various keys for roles. This could be implemented in AWS via roles for allowing different teams to perform binary blob signing using their issued keys without disclosure of the keys.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This (internal signing service) should be explicitly out of scope and we should not do this. This sentence is describing reimplementing AWS KMS from scratch.


Delegate keys shall be used for purposes within the realms of their designated responsibility and team(s). It shall be up to the individual implementors using these keys to make sure to verify their trust up to the Rust root or to the appropriate delegate key for their usage. For example, it shall be an implementation detail for the crates.io and cargo teams on whether to verify signatures up to the Rust root key, or to their specific delegate keys for other purposes. This is done to allow for the greatest flexibility of users of the keys.

Delegate keys will use appropriate mechanisms within the key format to specify the purposes the key can be used for, and verifiers relying on delegate keys should then verify those purposes.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can the mechanism be described, for people not deeply familiar with how it usually works in a X.509 CA?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this section should specify what the EKU (Extended Key Usage) OIDs are for each of the usages.

Replace "appropriate mechanisms within the key format" with "an EKU ([Extended Key Usage][rfc5280])" and note that the correct EKU must be present in every intermediate and end-entity certificate but not necessarily the root.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As an operational note from someone who implemented X.509 validation: handling of EKUs and similar policy mechanisms (like policy constraints) is not super consistent between different implementations -- the "right" thing to do in many cases is to over-index and support only what the Web PKI requires, which may not be compatible with the extensions/profile this PKI defines.

(Many implementations also make it hard to define these kinds of profile requirements, e.g. "only leaf has EKU" and "everything has EKU" are both common but "everything except the trust anchor has EKU" will involve rejecting chains after they've already been validated.)

Comment on lines +221 to +225
This RFC does not preclude deploying TUF as a solution for crate and mirror signing. However, one thing to keep in mind is TUF does not solve issues around the quorum, how it is managed, where signing occurs, or how it occurs; TUF provides a model hierarchy for signing and how trust among keys for a file index occurs, and easily rotatable keylists for hierarchical trust. We would still have the problems around how we select a qourum, how and where signing occurs, and where the trust lies. We would still need to determine how we select our TUF quorum, who they are, how and where they perform signing. All those problems are deferred but TUF - but iterated in this RFC.

Part of this approach is meeting our requirements in a model which is understood and applicable by a broader audiance - doing this for a CA is well understood and implemented, while within TUF is only understood and applicable in that context.

In a future where crates.io implements TUF, this CA and quorum would exist as the Root Role, where then we delegate Target roles to crates.io delegate key, and then utilize the TUF model underneath that. We'd still need to solve our quorum, ceremonies, etc. but it would only apply to TUF - a problem space addressed in this RFC. Said another way, a majority of this RFC applies to problems that we also need to solve in TUF, but at a broader scale utilizing a full CA and not strictly within the TUF model.
Copy link

@tarcieri tarcieri Feb 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TUF has built-in first-class support for k-of-n threshold signing for establishing a quorum which can be used for role, including the root, and which can be validated end-to-end by clients.

This gives it greater deployment flexibility around how the quorum could operate, e.g. instead of an HSM storing a single key-to-the-kingdom and enforcing the quorum around key usage via trusted hardware, it could instead support a distributed quorum of multiple signers who independently store and manage keys using e.g. PIN-protected offline Yubikeys/YubiHSMs/Solokeys/Nitrokeys in tandem with secure endpoints like a Chromebook.

Achieving quorum for signed updates could simply be making PRs to flat files managed through e.g. GitHub, where someone opens a PR that performs an update, and then k-of-n signers compute a signature over the new contents and add their signatures to the PR. When k-of-n have been achieved, the update can be merged and deployed via CDN.

See the Python Software Foundation's TUF runbook for a real-world example and working Rust code for managing a TUF root role using this sort of process, which was used to provision the PyPI root role.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tarcieri FWIW, I am hugely in favor of the idea of trying to eliminate the bastion server in favor of just having individual quorum members perform the signing operations locally and submitting the results, if we can arrange to make that work.

To me, the big concern is the degree to which we expose the result of that to individual users. "There's a root key for Rust which signs things" is a very well-understood model. "There are N quorum members whose individual signatures must be on anything that's trusted" is a less well-understood model, if we actually expose that to users as what the client is doing to figure out if something is trusted. That seems even more true if the quorum members are rotated.

If we can abstract that away, to where we have N quorum members individually signing things but to clients doing verification we can make that look like there's a single root key that signs things, I'm all for it. For instance, if we could embed that logic in an HSM or similar, such that the HSM signs things if and only if they're threshold-signed like that, and then others who rely on us can just look at the HSM signature, that seems perfectly reasonable.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In TUF, threshold signing is built-in and routine. The root role would be used to delegate authority to the other roles and the keys could remain offline and unused unless changes need to be made to the toplevel keys of any of the other role. TUF clients must be able to verify threshold signatures.

There are multiple Rust implementations of multiparty threshold ECDSA which result in a single signature which could potentially be used for an X.509 root and verified by unmodified X.509 certificate verification libraries, but such schemes use a complicated online protocol which would require a "dealer" service to coordinate and couldn't be as simple as adding signatures to an open PR the way it would be with TUF. I'm also not sure any of those schemes would be amenable to storing key shares in a hardware device, and it would be an awful lot of work.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me, the big concern is the degree to which we expose the result of that to individual users. "There's a root key for Rust which signs things" is a very well-understood model. "There are N quorum members whose individual signatures must be on anything that's trusted" is a less well-understood model, if we actually expose that to users as what the client is doing to figure out if something is trusted. That seems even more true if the quorum members are rotated.

I'll be using TUF as an example here, as it's the implementation I'm familiar with.

Quorum would not be exposed directly to users, unless they want to dig deeper into how the signatures are made (at which point they'd need to learn about the quorum described in this RFC anyway). In TUF, all the root keys would be included in a single JSON file called the "root role". This file contains the public keys of all trusted keys, plus signatures of at least the quorum of keys.

If a rotation needs to happen, tooling or users would not go and swap a few individual keys. They'd replace the old JSON root role as a whole with a new JSON root role containing the new set of trusted keys. If it's a routine rotation the new root role would need to be signed by a quorum of the keys of the previous role.

From the user's point of view it would be a single file providing the root of trust, like the CA key described in the RFC. If we're worried about confusing users, we could just PEM-encode the TUF root role.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As an example, here is the TUF root role for sigstore. It's a single JSON file, even though it contains 5 quorum keys, 4 signatures, and 2 delegated keys.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Threshold EdDSA is far simpler than threshold ECDSA, but anyways threshold signatures are not "accountable" in the sense that you do not know who signed the bad signature. Accountable sounds pretty important when defending against supplychain attacks.

There are good reasons why organizations might use threshold signatures for their own internal opsec, even when distributing software, like hiding the number of signers from the signers themselves, but accountable sounds more important here, and in the general TUF usage.


This HSM instance will be created within the Rust Project AWS account. After initial creation and configuration, the quorum shall be the only entity capable of administering or utilizing the root key in the HSM. Network access to the HSM shall be limited to a PKI bastion [Signing Console][signing-console], which shall only be powered on for scheduled signing events. See the reference-level [Signing Console][signing-console] description for further details on this system.

The root key shall follow a `5-of-9` authentication model for all operations. We consider this a reasonable middle ground of quorum to prevent malicious activity, while allowing for rapid response to an event requiring a quorum. These events are iterated below in [When the Quorum will be needed][when-the-quorum-will-be-needed].

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this might need elaboration: is the 5-of-9 scheme here a cryptographic one (i.e. a reconstructed secret key, or threshold signing), or is it an informal one? If the latter, what are the checks (and transparency mechanisms) to prevent access by a malicious or faithless quorum member?

(Similarly, when quorum membership rotates, how do threshold keys/shared secrets/access credentials get demonstrably revoked or burnt for departing members?)

@woodruffw
Copy link

(Preface: I apologize for the top-level comment; if this isn't the right venue for this, just let me know and I'll drop it!)

I think the overall design here is sensible, for an independent PKI.

However, I want to issue a general caution about self-deployed PKIs and operational burdens: both PKI standup and long-term maintenance are operationally complex, and fit into the middle of the iron triangle of "hard to do right", "manually intensive", and "not performed often enough to establish muscle memory." The operational overhead of recovery and rotation in PKIs also presents a "normalization of deviance" risk, in the sense that operations that are challenging to perform (such as rotating out weak keys or non-conforming certificates) get deferred due to stability/breakage risks for downstream users. This latter risk is heightened by technical constraints in the X.509 PKI ecosystem (e.g. poor real-world support for policy mapping, name constraints, mixed key/signature types in chains).

If I read the RFC correctly, it sounds like there are really three underlying motivations for a new PKI:

  1. Signing "official" Rust artifacts, such as releases of Rust itself, including eventually for the purpose of trust by OS codesigning mechanisms;
  2. Signing crate distributions (not end-user signing, but signing by the index itself);
  3. Infrastructure operations, such as internal services or unforeseen additional delegations.

If this understanding is correct, I'd like to suggest the following alternatives:

  1. To sign Rust releases, consider artifact transparency with either a fixed identity via Sigstore or a single KMS-managed key, which could then be subject to key transparency. My bias is towards encouraging Sigstore since it sidesteps the need to do any key management whatsoever, but you could also do key distribution through a .well-known URI on a dedicated host with transit security through a Web PKI CA issued certificate.

  2. As a potentially controversial opinion: I think index-side package signing has relatively little security value in the presence of strong transport security (i.e. TLS) and cryptographic digests, unless there's a firm security boundary between the signing/uploading component and the rest of the codebase. This might be the case, but it can't be taken for granted -- without that boundary, it's difficult to quantify the damage an attacker with some access to crates' infrastructure can do.

    (End-user signing is incredibly valuable, and much harder to get right (cf. PyPI's challenges getting users to sign with reasonably secure PGP keys, much less sign at all).)

    Rather than prioritize index signing, I'd like to suggest that Rust consider an artifact transparency scheme similar to Go's: Go's checksum database ("sumdb") has a proven track record in the Go ecosystem, and provides similar properties (notably, committing the index to an immutable mapping between package names and their contents).

  3. This is a big space, so I don't want to make assumptions about the total constellation of things that the Rust project may end up needing key materials or certificates for 🙂. However, in the most general case: services can and should (continue to) use the Web PKI for TLS certificates, including for internal services.

Again, I apologize if this is a premature response or an inappropriate venue for a "general" opinion like this one. I'm happy to discuss synchronously as well, based on experiences implementing X.509 and working on the same problem space within PyPI and other ecosystems.

@burdges
Copy link

burdges commented Mar 7, 2024

I think CT winds up much less fragile than CAs do. If git used sha256 then you gain pretty powerful forensic tools from repository history plus git hashes in the package.

I've noted above that existing web CT infrastructure could provide the ultimate CT roots, although rust would still require its own infrastructure that maintains the extension merkle trees.

@bjorn3
Copy link
Member

bjorn3 commented Mar 8, 2024

I think index-side package signing has relatively little security value in the presence of strong transport security (i.e. TLS) and cryptographic digests

The reason we want package signing is to allow mirrors of crates.io where there is bad or no internet connectivity with crates.io like China or airgapped systems. In those cases it is impossible to use TLS for security.

### Preventing Vendor Lock-in

This RFC attempts to limit our exposure to singular points of failure or compromise by relying on entities for the totality of our security. A choice was made to utilize Amazon CloudHSM as a hardware solution for key storage; but we have chosen not to use internal cloud-specific CA mechanisms in this case to avoid being further bound to a single providers ecosystem. The scheme described in this RFC can be moved to other HSM or storage backends with no other dependencies on a specific service.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my understanding, migration between different HSM services may face incompatibility of similar key formats and encoding methods, and considering the replacement of keys, the workload of this migration is almost equivalent to re-establishing a new key management mechanism

### Alternative Quorum Tools
[alternative-quorum-tools]: #alternative-quorum-tools

Other solutions exist for performing Quorum-based authentication for key access, which support various storage backends. The main alternative for such a standalone solution would be utilizing a quorum authentication plugin on a cloud-agnostic storage system (Such as Hashicorp Vault, or a cloud-agnostic authentication scheme atop a cloud key storage, such as AWS key store). These solutions would allow us to deploy and move independent of cloud providers; however, this comes with the added infrastructure overhead of the Infra team needing to implement and maintain such a solution. This choice was considered but thought prohibitive to implement and maintain for the current staffing available to the project. Finally, these solutions do not exist in hardware due to the need to remain cloud and hardware agnostic.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For Valut-like tools, although it requires some extra time and effort to implementand maintain, it can not only be used as a key management tool for rust projects, but also as a secrect management tool for the entire Rust project to manage other Secrects like tokens or AK/SK in a unified manner. So I think it's possible to consider

@wangkirin
Copy link
Contributor

wangkirin commented Mar 8, 2024

The reason we want package signing is to allow mirrors of crates.io where there is bad or no internet connectivity with crates.io like China or airgapped systems. In those cases it is impossible to use TLS for security.

In addition to network restricted scenarios, package signing can also develop a de-centralized crate distribution system in the future, thereby reducing the bandwidth costs of a single system

@woodruffw
Copy link

I think index-side package signing has relatively little security value in the presence of strong transport security (i.e. TLS) and cryptographic digests

The reason we want package signing is to allow mirrors of crates.io where there is bad or no internet connectivity with crates.io like China or airgapped systems. In those cases it is impossible to use TLS for security.

These are reasonable scenarios, but to point out: in an airgapped scenario, the verifier necessarily lacks access to the revocation parts of the PKI. In a scenario with a country that interferes with secure transport: you have limited ability to prevent the country/network operator from stripping away signatures during transport, as well as securely distributing the root of trust in the first place. This compounds in complexity when mirrors needs to host their own distributions, which may not be able to receive upstream index signatures 🙂

None of these are insurmountable problems, but they're ones that we've similarly thought about in terms of PKI/signing designs for PyPI as changing the cost-benefit envelope for index signatures.

Finally, I'll note that artifact transparency in the form of a sumdb is suitable for both of these cases, and has similar security properties: a sumdb can be mirrored or placed in an airgapped environment, with the added benefit (over signatures absent reliable revocation) of being auditable by all parties that depend on it.

@ijackson
Copy link

I'm sorry to say that I think this is a bad idea.

Background

Perhaps I should introduce myself. Most relevantly, I was Chief Cryptographer at the HSM manufacturer nCipher; I've a PhD in Computer Security from Cambridge University; and I'm a longstanding Member (and former Leader) of the Debian Project.

Debian's experience is very relevant here. Debian has for decades operated one of the most successful and resilient code signing systems in existence - the apt repository and package signing system. (To be clear: I did not design or implement that system, although I did have a hand in some of Debian's internal package signing processes.)

Lack of a solution-neutral problem statement

The RFC is quite vague about why. It seems to take for granted that we should do some more signing of things, to improve software supply chain integrity presumably, and therefore we should have a "PKI".

The conclusion does not follow from the premise. Code and data signing can be done very successfuly without an X.509 PKI. We should start again from scratch with clear objectives.

Single trust root is wrong

Traditional "PKI" as proposed here tends to have a single root of trust. As is often the case, that's inappropriate for the Rust Project.

There is no necessary linkage between (for example) code signing keys used to verify downloads by Rustup of new versions of Rust (and self-updates), on the one hand, and (for example) the crates.io index, on the other.

Instead of an X.509-style PKI, with a single cryptographic trust root, the necessary public keys for each role should be embedded in the appropriate software. So, for exmple, Rustup should contain the public keys with which it verifies the things it downloads. Cargo should contain the crates.io repository public keys.

Linking these use cases (and others) into a single hierarchical structure, is a mistake. (Sharing tooling and workflows is likely to be helpful.)

Revocation should be done by key lifetimes, rollover and update

In practice, revocation turns out to be quite rare. Debian has had to do an emergency archive signing key rollover about once.

Revocation checks should be done without online revocation services (eg, OCSP). Instead, if it is necessary to revoke keys, this can be done by promulgating updated relying software: after all, the relying software contains the public keys.

Avoid X.509

The X.509 protocols are, frakly, terrible. Many of the libraries for dealing with them are poor too. X.509 has been a constant source of misbheaviours in critical security systems. It is often difficult to persuade X.509 software to apply the correct criteria; most X.509 software is designed to serve the (very poor) web PKI.

Debian and git have had great success with OpenPGP-based signing systems. Rust should choose OpenPGP. Note that we should not be using the OpenPGP "web of trust" trust model; OpenPGP implementations do support simpler trust models.

The once-leading OpenPGP implementation, GnuPG, is rather troublesome, but in the Rust world we could use Sequoia.

Don't use threshold schemes, do thresholding at the protocol level

Threshold cryptography is complicated and very limiting. For example, what will we do when we want to move to a postquantum signature scheme? We want a free choice of algorithms.

Doing thresholding at the cryptographic algorithm layer is great for getting papers into crypto conferences (lots of really advanced mathematics, yay) but is only good engineering if you don't control the relying software, so you need to publish just one public key to a naive relying party. That's not our situation: we control and distribute the relying software.

So there is no need for threshold cryptography here. Instead, to implement a k/n control, we can just publish (embed with the relying software distribution) n keys along with an instruction that k signatures are required. That keeps our mathematics as boring as possible (and affords a wider choice of algorithms and HSMs).

Real improvements

ISTM that there are indeed real improvements that could be made by making more use of digital signatures:

Right now crates.io relies on the web X.509 PKI for index signing. The web X.509 PKI is very poor. crates.io should be digitally signing the index. That would improve transparency, traceability, and mirrorability.

I don't know precisely how rustup verifies new Rust release, but IMO that ought also to be done with code signing.

So there is real scope for improvement here, but this RFC is the wrong approach.

@woodruffw
Copy link

I agree with a lot of the above comment: if you don't need a single root of trust (or threshold signing at the protocol level), your design is better off without it.

I don't, however, agree with the OpenPGP recommendation 🙂 -- OpenPGP implementations share much of the same spotted security history as X.509 implementations do, with weaker algorithmic controls and design decisions that aren't consistent with modern cryptographic best practices (e.g. compression-before-operation, MDC). These don't all necessarily apply to digital signatures, but they are indicative of the compromises OpenPGP makes in practice for legacy support (which, for a greenfield design, should not be a significant priority for Rust).

(Some of this is ameliorated by better implementations, like Sequoia. But some of it is baked into OpenPGP-the-standard or, worse, impossible to change because GnuPG is the 800lb gorilla in the room. The recent fracas with LibrePGP is broadly indicative of the PGP ecosystem's practical inability to modernize and discharge unsafe defaults.)

Some supporting resources from well-regarded sources:

(Note the general age on these: the industry/security consensus against PGP in greenfield is well established. That consensus is also why git and other applications have (slowly) moved towards SSH based signing, minisign, age, etc. as needed.)

TL;DR: I agree that this RFC could use a better problem statement, and that an improved problem statement may reveal an architecture better than a PKI here (certainly for operational reasons that I mentioned in an earlier comment). But IMO it would be a mistake to build any subsequent architecture on any variant of PGP.

@bjorn3
Copy link
Member

bjorn3 commented Mar 21, 2024

Revocation should be done by key lifetimes, rollover and update

For crates.io requiring a cargo update after key rotation is not an option. To be able to bootstrap rustc, cargo versions that are many years old need to keep the ability to connect to crates.io. And users may want to keep using an older rustc (and thus cargo) version for other reasons, including but not limited to bisecting regressions, using a distro provided rustc (which almost always is very outdated) or for any other reason. This needs either a way to override the key using a config option (ideally this option is only necessary in case of a compromised key) or by signing the new key with the old key.

The once-leading OpenPGP implementation, GnuPG, is rather troublesome, but in the Rust world we could use Sequoia.

Sequoia stopped verifying rust releases on 2023-01-02 (yes, they checked the system time), causing everyone to get a warning that the signature verification failed. Luckily the verification logic was still considered experimental so it didn't actually break anyone. See rust-lang/rustup#3186. The entire verification logic was removed in rust-lang/rustup#3277. This RFC is supposed to create the foundations for a replacement of the check.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
T-infra Relevant to the infrastructure team, which will review and decide on the RFC.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet