Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PKI - Fix order of chain building writes #17772

Merged
merged 5 commits into from Nov 3, 2022

Conversation

cipherboy
Copy link
Contributor

Probably best to defer to the commit messages here:

commit 8010a8755755afd19d5a62fd5073acb2fea1e1ba
Author: Alexander Scheel <alex.scheel@hashicorp.com>
Date:   Wed Nov 2 08:59:44 2022 -0400

    Remigrate ca_chains to fix any missing issuers
    
    In the previous commit, we identified an issue that would occur on
    legacy issuer migration to the new storage format. This is easy enough
    to detect for any given mount (by an operator), but automating scanning
    and remediating all PKI mounts in large deployments might be difficult.
    
    Write a new storage migration version to regenerate all chains on
    upgrade, once.
    
    Signed-off-by: Alexander Scheel <alex.scheel@hashicorp.com>

commit 8a786b0c57c27a6e3dc1d06edacd03672d8f8bc9
Author: Alexander Scheel <alex.scheel@hashicorp.com>
Date:   Wed Nov 2 08:12:34 2022 -0400

    Ensure correct write ordering in rebuildIssuersChains
    
    When troubleshooting a recent migration failure from 1.10->1.11, it was
    noted that some PKI mounts had bad chain construction despite having
    valid, chaining issuers. Due to the cluster's leadership trashing
    between nodes, the migration logic was re-executed several times,
    partially succeeding each time. While the legacy CA bundle migration
    logic was written with this in mind, one shortcoming in the chain
    building code lead us to truncate the ca_chain: by sorting the list of
    issuers after including non-written issuers (with random IDs), these
    issuers would occasionally be persisted prior to storage _prior_ to
    existing CAs with modified chains.
    
    The migration code carefully imported the active issuer prior to its
    parents. However, due to this bug, there was a chance that, if write to
    the pending parent succeeded but updating the active issuer didn't, the
    active issuer's ca_chain field would only contain the self-reference and
    not the parent's reference as well. Ultimately, a workaround of setting
    and subsequently unsetting a manual chain would force a chain
    regeneration.
    
    In this patch, we simply fix the write ordering: because we need to
    ensure a stable chain sorting, we leave the sort location in the same
    place, but delay writing the provided referenceCert to the last
    position. This is because the reference is meant to be the user-facing
    action: without transactional write capabilities, other chains may
    succeed, but if the last user-facing action fails, the user will
    hopefully retry the action. This will also correct migration, by
    ensuring the subsequent issuer import will be attempted again,
    triggering another chain build and only persisting this issuer when
    all other issuers have also been updated.
    
    Signed-off-by: Alexander Scheel <alex.scheel@hashicorp.com>

As noted above, patching the issuer to set manual_chain=self followed by manual_chain= will be an adequate workaround.

@cipherboy cipherboy added bug Used to indicate a potential bug secret/pki backport/1.11.x labels Nov 2, 2022
@cipherboy cipherboy added this to the 1.13.0-rc1 milestone Nov 2, 2022
Copy link
Contributor

@stevendpclark stevendpclark left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

builtin/logical/pki/chain_util.go Outdated Show resolved Hide resolved
When troubleshooting a recent migration failure from 1.10->1.11, it was
noted that some PKI mounts had bad chain construction despite having
valid, chaining issuers. Due to the cluster's leadership trashing
between nodes, the migration logic was re-executed several times,
partially succeeding each time. While the legacy CA bundle migration
logic was written with this in mind, one shortcoming in the chain
building code lead us to truncate the ca_chain: by sorting the list of
issuers after including non-written issuers (with random IDs), these
issuers would occasionally be persisted prior to storage _prior_ to
existing CAs with modified chains.

The migration code carefully imported the active issuer prior to its
parents. However, due to this bug, there was a chance that, if write to
the pending parent succeeded but updating the active issuer didn't, the
active issuer's ca_chain field would only contain the self-reference and
not the parent's reference as well. Ultimately, a workaround of setting
and subsequently unsetting a manual chain would force a chain
regeneration.

In this patch, we simply fix the write ordering: because we need to
ensure a stable chain sorting, we leave the sort location in the same
place, but delay writing the provided referenceCert to the last
position. This is because the reference is meant to be the user-facing
action: without transactional write capabilities, other chains may
succeed, but if the last user-facing action fails, the user will
hopefully retry the action. This will also correct migration, by
ensuring the subsequent issuer import will be attempted again,
triggering another chain build and only persisting this issuer when
all other issuers have also been updated.

Signed-off-by: Alexander Scheel <alex.scheel@hashicorp.com>
In the previous commit, we identified an issue that would occur on
legacy issuer migration to the new storage format. This is easy enough
to detect for any given mount (by an operator), but automating scanning
and remediating all PKI mounts in large deployments might be difficult.

Write a new storage migration version to regenerate all chains on
upgrade, once.

Signed-off-by: Alexander Scheel <alex.scheel@hashicorp.com>
Signed-off-by: Alexander Scheel <alex.scheel@hashicorp.com>
@cipherboy cipherboy force-pushed the cipherboy-fix-order-of-chain-building-writes branch from e0acf81 to 6c0a7de Compare November 2, 2022 13:52
Signed-off-by: Alexander Scheel <alex.scheel@hashicorp.com>
Signed-off-by: Alexander Scheel <alex.scheel@hashicorp.com>
@cipherboy cipherboy force-pushed the cipherboy-fix-order-of-chain-building-writes branch from 6c0a7de to b964e77 Compare November 2, 2022 14:08
Copy link
Contributor

@kitography kitography left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks really great.

@@ -43,7 +43,7 @@ func (sc *storageContext) rebuildIssuersChains(referenceCert *issuerEntry /* opt
// To begin, we fetch all known issuers from disk.
issuers, err := sc.listIssuers()
if err != nil {
return fmt.Errorf("unable to list issuers to build chain: %v", err)
return fmt.Errorf("unable to list issuers to build chain: %w", err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice fix, thank you

// Ensure the chains are correct on the intermediate. By using the
// issuerId saved above, this ensures we didn't change any issuerIds,
// we merely updated the existing issuers.
intIssuer, err := sc.fetchIssuerById(intIssuerId)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is really well done, thank you!

@cipherboy
Copy link
Contributor Author

Thank you both! Merging...

@cipherboy cipherboy merged commit 866a47d into main Nov 3, 2022
@cipherboy cipherboy deleted the cipherboy-fix-order-of-chain-building-writes branch December 1, 2022 14:56
jayant07-yb pushed a commit to jayant07-yb/hashicorp-vault-integrations that referenced this pull request Mar 15, 2023
* Ensure correct write ordering in rebuildIssuersChains

When troubleshooting a recent migration failure from 1.10->1.11, it was
noted that some PKI mounts had bad chain construction despite having
valid, chaining issuers. Due to the cluster's leadership trashing
between nodes, the migration logic was re-executed several times,
partially succeeding each time. While the legacy CA bundle migration
logic was written with this in mind, one shortcoming in the chain
building code lead us to truncate the ca_chain: by sorting the list of
issuers after including non-written issuers (with random IDs), these
issuers would occasionally be persisted prior to storage _prior_ to
existing CAs with modified chains.

The migration code carefully imported the active issuer prior to its
parents. However, due to this bug, there was a chance that, if write to
the pending parent succeeded but updating the active issuer didn't, the
active issuer's ca_chain field would only contain the self-reference and
not the parent's reference as well. Ultimately, a workaround of setting
and subsequently unsetting a manual chain would force a chain
regeneration.

In this patch, we simply fix the write ordering: because we need to
ensure a stable chain sorting, we leave the sort location in the same
place, but delay writing the provided referenceCert to the last
position. This is because the reference is meant to be the user-facing
action: without transactional write capabilities, other chains may
succeed, but if the last user-facing action fails, the user will
hopefully retry the action. This will also correct migration, by
ensuring the subsequent issuer import will be attempted again,
triggering another chain build and only persisting this issuer when
all other issuers have also been updated.

Signed-off-by: Alexander Scheel <alex.scheel@hashicorp.com>

* Remigrate ca_chains to fix any missing issuers

In the previous commit, we identified an issue that would occur on
legacy issuer migration to the new storage format. This is easy enough
to detect for any given mount (by an operator), but automating scanning
and remediating all PKI mounts in large deployments might be difficult.

Write a new storage migration version to regenerate all chains on
upgrade, once.

Signed-off-by: Alexander Scheel <alex.scheel@hashicorp.com>

* Add changelog entry

Signed-off-by: Alexander Scheel <alex.scheel@hashicorp.com>

* Add issue to PKI considerations documentation

Signed-off-by: Alexander Scheel <alex.scheel@hashicorp.com>

* Correct %v -> %w in chain building errs

Signed-off-by: Alexander Scheel <alex.scheel@hashicorp.com>

Signed-off-by: Alexander Scheel <alex.scheel@hashicorp.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Used to indicate a potential bug secret/pki
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants