-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Nomad with Consul Connect Service Mesh randomly fails with a bootstrap error #20516
Comments
I've observed this as well.. more information in support case #147547 |
got a link? |
sorry, I reported this to Hashicorp's Enterprise support portal, behind auth. They are now tracking it internally but public issues helps raise awareness.. and might provide some restored sanity for anyone else experience this issue as well. |
++ also seeing this fairly repeatedly, unsure if the same cause. @ryanhulet @colinbruner - does restarting Consul fix this for you (temporarily at least)? |
Sometimes. I finally gave up and destroyed my entire cluster and brought up a new one on 1.6.1 |
Possibly related to #20185? I wonder if the Consul sync is getting stuck because it doesn't have tokens to deregister services - exiting early - and therefore never registering newly created services, meaning that Envoy can't bootstrap?
Not sure why it would have gotten worse in newer releases though? But we also see it much more often recently (on 1.7.5 currently) could also just be because we see more deploys now though. |
as far as I can tell it is 100% related to workload identities, 1.6.1 doesn't use them and I haven't had that error a single time since downgrading |
this is interesting.. the consul acl stanza settings the quoted issue lists in step #1 do match my consul acl configurations.. acl = {
enabled = true
default_policy = "deny"
enable_token_persistence = true
} I wonder if setting |
Yeah I'm also wondering if disabling persistence will fix it too, was also suggested by support on my ticket. We're (obviously) not going in an manually deleting the ACL tokens like the original issue repro but if it's getting stuck with some old expired token because it's been persisted then maybe could end up with similar behaviour. In our case we are reloading Consul once per day (with a SIGHUP) when we rotate its agent token - which may also a source for this, not sure if you're doing similar? |
similar, but less frequent. We have consul-template generating a Consul ACL token from Vault's Consul secret engine.. this is regenerated and a SIGHUP is issued at the half-life, which I believe to be ~16days. However, I've seen this on Nomad Clients that have only been around for a few hours.. so not necessarily correlated to after the token is rolled.
I'm going to do this in a non-prod environment today, will report back if I'm still seeing errors after the change is made. |
@ryanhulet @t-davies upon the advice of Hashicorp support, I set Will continue to monitor and update if this is not a full mitigation of the error. |
Thanks, yeah we are in a similar situation with |
I've been seeing this problem in a variety of forms for a while now, possibly when I started using Workload Identities. Many times, I see entries like this in my logs:
So it looks like Nomad is failing to bootstrap Envoy because there's something else that's failing to get deregistered. Why is the token I have an AWFUL Nomad job that scans through Consul's logs for those errors and uses a management token to do the deregistration. Here's a slimmed down version of it:
|
You're quite right, In our case we use Vault agent to rotate the token and store it in the config file, so we don't really need token persistence anyway since Consul will just read it from the file if it restarts. |
Nomad version
Operating system and Environment details
Issue
Ever since upgrading Nomad to 1.7.7 and Consul to 1.18.1, nomad jobs will fail repeatedly with the error:
envoy_bootstrap: error creating bootstrap configuration for Connect proxy sidecar: exit status 1; see: <https://developer.hashicorp.com/nomad/s/envoy-bootstrap-error>
This will happen multiple times, until eventually the job will schedule and start correctly. This results in multiple jobs stuck in pending for hours on end because of rescheduling back off
Reproduction steps
Literally run a job similar to the one attached, using Consul with auto-config and mTLS, envoy version 1.27.3. I am also using Vault 1.15.4 for secrets, and Nomad is using Vault and Consul workflow identities
Expected Result
A job to schedule and register with Consul Connect successfully
Actual Result
REPEATED failures with the envoy_bootstrap error
Job file (if appropriate)
Nomad Server logs (if appropriate)
Nomad Client logs (if appropriate)
The text was updated successfully, but these errors were encountered: