Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a periodic test of the autoseal to detect loss of connectivity. #13078

Merged
merged 17 commits into from Nov 10, 2021

Conversation

sgmiller
Copy link
Collaborator

@sgmiller sgmiller commented Nov 8, 2021

This allows operators to spot problems with connections to their HSMs and
KMSes that would otherwise go unnoticed until the next unseal, where time is
critical.

@vercel vercel bot temporarily deployed to Preview – vault-storybook November 8, 2021 16:53 Inactive
@vercel vercel bot temporarily deployed to Preview – vault November 8, 2021 16:53 Inactive
@vercel vercel bot temporarily deployed to Preview – vault November 8, 2021 16:54 Inactive
@vercel vercel bot temporarily deployed to Preview – vault-storybook November 8, 2021 16:54 Inactive
@vercel vercel bot temporarily deployed to Preview – vault-storybook November 8, 2021 16:56 Inactive
@vercel vercel bot temporarily deployed to Preview – vault November 8, 2021 16:56 Inactive
@vercel vercel bot temporarily deployed to Preview – vault November 8, 2021 17:13 Inactive
@vercel vercel bot temporarily deployed to Preview – vault-storybook November 8, 2021 17:13 Inactive
@@ -18,16 +21,20 @@ import (
// applicable in the OSS side
var barrierTypeUpgradeCheck = func(_ string, _ *SealConfig) {}

const sealHeathTestInterval = 1 * time.Minute
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't we want to check this hourly?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'm back and forth on that. Since it's a cheap encrypt/decrypt I coded it to be more frequent. Open to thoughts.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm assuming this is being done on all auto-unseal implementations, not just pkcs11 implementations. If that is the case could the costs associated with using the various KMS providers within the cloud start adding up? I don't have a good sense of those costs across the various implementations.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sealHeathTestInterval - should this be sealHealthTestInterval?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It sure should.

for {
select {
case <-d.healthCheckStop:
d.healthCheck.Stop()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm probably being paranoid here but if we get multiple healthCheckStop we will get a nil deference error. Is it worth adding a quick test and return if d.healthCheck is nil?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nvm the case statement will never fire.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes it is, thanks.

vault/seal_autoseal.go Outdated Show resolved Hide resolved
Copy link
Contributor

@stevendpclark stevendpclark left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Just review the seal test interval if we really want to run this every minute.

for {
select {
case <-d.healthCheckStop:
d.healthCheck.Stop()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nvm the case statement will never fire.

@vercel vercel bot temporarily deployed to Preview – vault November 9, 2021 17:11 Inactive
@vercel vercel bot temporarily deployed to Preview – vault-storybook November 9, 2021 17:11 Inactive
@vercel vercel bot temporarily deployed to Preview – vault-storybook November 9, 2021 21:17 Inactive
@vercel vercel bot temporarily deployed to Preview – vault November 9, 2021 21:17 Inactive
Copy link
Collaborator

@ncabatoff ncabatoff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I somehow failed to submit these comments yesterday.

vault/seal_autoseal.go Outdated Show resolved Hide resolved
vault/seal_autoseal.go Outdated Show resolved Hide resolved
vault/seal_autoseal.go Show resolved Hide resolved
vault/seal_autoseal.go Outdated Show resolved Hide resolved
vault/seal_autoseal.go Show resolved Hide resolved
@vercel vercel bot temporarily deployed to Preview – vault November 10, 2021 18:42 Inactive
@vercel vercel bot temporarily deployed to Preview – vault November 10, 2021 18:46 Inactive
@vercel vercel bot temporarily deployed to Preview – vault-storybook November 10, 2021 18:46 Inactive
lastTestOk = false
d.core.MetricSink().SetGauge(autoSealUnavailableDuration, 0)
} else {
plaintext, err := d.Access.Decrypt(ctx, ciphertext, nil)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same timeout (shared) across both Encrypt/Decrypt, or should we have a 1m timeout for each?

vault/seal_autoseal.go Outdated Show resolved Hide resolved
vault/seal_autoseal.go Outdated Show resolved Hide resolved
@vercel vercel bot temporarily deployed to Preview – vault-storybook November 10, 2021 18:52 Inactive
@vercel vercel bot temporarily deployed to Preview – vault November 10, 2021 18:52 Inactive
@vercel vercel bot temporarily deployed to Preview – vault November 10, 2021 19:04 Inactive
@vercel vercel bot temporarily deployed to Preview – vault-storybook November 10, 2021 19:04 Inactive
@vercel vercel bot temporarily deployed to Preview – vault November 10, 2021 19:07 Inactive
@vercel vercel bot temporarily deployed to Preview – vault-storybook November 10, 2021 19:07 Inactive
ciphertext, err := d.Access.Encrypt(ctx, []byte(testVal), nil)

if err != nil {
fail("failed to encrypt seal health test value, seal backend may be unreachable", "error", err)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that we're in a func we could do an early return instead of an else.

@vercel vercel bot temporarily deployed to Preview – vault-storybook November 10, 2021 19:12 Inactive
@vercel vercel bot temporarily deployed to Preview – vault November 10, 2021 19:12 Inactive
@vercel vercel bot temporarily deployed to Preview – vault November 10, 2021 19:54 Inactive
@vercel vercel bot temporarily deployed to Preview – vault-storybook November 10, 2021 19:54 Inactive
@sgmiller sgmiller merged commit 87c2b1a into main Nov 10, 2021
@sgmiller sgmiller deleted the autoseal-health-check branch November 10, 2021 20:46
qk4l pushed a commit to qk4l/vault that referenced this pull request Feb 4, 2022
…ashicorp#13078)

* Add a periodic test of the autoseal to detect loss of connectivity

* Keep the logic adjacent to autoseal

* imports

* typo, plus unnecessary constant time compare

* changelog

* pr feedback

* More feedback

* Add locking and a unit test

* unnecessary

* Add timeouts to encrypt/decrypt operations, capture activeContext before starting loop

* Add a block scope for the timeout

* copy/paste ftl

* Refactor to use two timeouts, and cleanup the repetitive failure code

* Readd 0ing gauge

* use millis

* Invert the unit test logic
@heliobmartins
Copy link

heliobmartins commented May 28, 2022

Hello @sgmiller,

I was wondering if you could please clarify a quick question I have around this PR (which is a great feature by the way).

From my understanding, the only way to be verify if the seal backend is health would be by checking Vault logs. Do you know if there is any API that we could call to get the status of the seal backend without having to check the logs?

If not, do you reckon that maybe v1/sys/seal-status could be an endpoint that could return the health status of our seal backend?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants