Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nomad with Consul Connect Service Mesh randomly fails with a bootstrap error #20516

Open
ryanhulet opened this issue May 2, 2024 · 14 comments
Open

Comments

@ryanhulet
Copy link

Nomad version

Nomad v1.7.7
BuildDate 2024-04-16T19:26:43Z
Revision 0f34c85ee63f6472bd2db1e2487611f4b176c70c

Operating system and Environment details

NAME="Ubuntu"
VERSION="20.04.5 LTS (Focal Fossa)"

Issue

Ever since upgrading Nomad to 1.7.7 and Consul to 1.18.1, nomad jobs will fail repeatedly with the error:
envoy_bootstrap: error creating bootstrap configuration for Connect proxy sidecar: exit status 1; see: <https://developer.hashicorp.com/nomad/s/envoy-bootstrap-error>
This will happen multiple times, until eventually the job will schedule and start correctly. This results in multiple jobs stuck in pending for hours on end because of rescheduling back off

Reproduction steps

Literally run a job similar to the one attached, using Consul with auto-config and mTLS, envoy version 1.27.3. I am also using Vault 1.15.4 for secrets, and Nomad is using Vault and Consul workflow identities

Expected Result

A job to schedule and register with Consul Connect successfully

Actual Result

REPEATED failures with the envoy_bootstrap error

Job file (if appropriate)

job "canonical-landscape" {
  datacenters = ["<DATACENTER>"]

  namespace = "ops"

  type = "service"

  vault {}
  
 affinity {
    attribute = "${node.unique.name}"
    operator = "set_contains_any"
    value = "<HOSTNAMES>"
    weight = -50
  }

  group "landscape" {

    volume "canonical-landscape" {
      type      = "host"
      read_only = false
      source    = "canonical-landscape"
    }

    volume "gpgs_keys" {
      type      = "host"
      read_only = false
      source    = "gpgs_keys"
    }

    volume "landscape-repository" {
      type      = "host"
      read_only = false
      source    = "landscape-repository"
    }

    network {
      mode = "bridge"

      port "landscape" {
        to = 443
      }

      port "landscape80" {
        to = 80
      }
    }

    service {
      name = "landscape"
      port = "landscape"
      address_mode = "alloc"

      connect {
        sidecar_service {
            proxy {
                upstreams {
                destination_name = "landscape-postgres"
                local_bind_port = 5432
                }
                upstreams {
                    destination_name = "landscape-rabbitmq"
                    local_bind_port = 5672
                }
            }

        }
      }
    }

    service {
      name = "landscape80"
      port = "landscape80"
      address_mode = "alloc"

      connect {
        sidecar_service {

    }

    task "landscape" {

      driver = "docker"

      resources {
          memory = 4096
        }

      template {
        data = <<EOH
[stores]
user = landscape
password = {{- with secret "nomad-kv/default/canonical-landscape" }}{{- .Data.data.landscape_pw -}}{{ end }}
host = 127.0.0.1
main = landscape-standalone-main
account-1 = landscape-standalone-account-1
resource-1 = landscape-standalone-resource-1
package = landscape-standalone-package
session = landscape-standalone-session
session-autocommit = landscape-standalone-session?isolation=autocommit
knowledge = landscape-standalone-knowledge

[global]
oops-path = /var/lib/landscape/landscape-oops
# Syslog address is either a socket path or server:port string.
syslog-address = /dev/log
deployment-mode = standalone

[broker]
port = 5672
host = 127.0.0.1 
user = landscape 
password = {{- with secret "nomad-kv/default/canonical-landscape" }}{{- .Data.data.rmq_pw -}}{{ end }}
vhost = landscape 
hostagent_virtual_host = landscape-hostagent
hostagent_task_queue = landscape-server-hostagent-task-queue

[landscape]
base-port = 8080
threads = 8
stores = main account-1 resource-1 package session session-autocommit knowledge
oops-key = DF%(instance-letter)s
#soft-timeout = 5
mailer = queue
mailer-path = /var/lib/landscape/landscape-mail-queue
reprepro-binary = /opt/canonical/landscape/scripts/reprepro-wrapper.sh
repository-path = /var/lib/landscape/landscape-repository

[message-server]
base-port = 8090
threads = 8
stores = main account-1 resource-1 package
oops-key = DM%(instance-letter)s
#soft-timeout = 5
#ping-interval = 30

[load-shaper]
critical-load = 10.0
good-load = 3.0
good-duration = 60.0

[pingserver]
base-port = 8070
stores = main account-1 resource-1
threads = 2

[job-handler]
stores = main account-1 resource-1 package
threads = 10
mailer = queue
mailer-path = /var/lib/landscape/landscape-mail-queue

[async-frontend]
base-port = 9090

[api]
stores = main account-1 resource-1 package
threads = 10
base-port = 9080
mailer = queue
mailer-path = /var/lib/landscape/landscape-mail-queue

[maintenance]
threads = 1
stores = main account-1 resource-1 package session session-autocommit knowledge
mailer = queue
mailer-path = /var/lib/landscape/landscape-mail-queue

[schema]
threads = 1
stores = main account-1 resource-1 package session knowledge
store_user = landscape_superuser
store_password = {{- with secret "nomad-kv/default/canonical-landscape" }}{{- .Data.data.postgres_pw -}}{{ end }}

[scripts]
threads = 1
stores = main account-1 resource-1 package session knowledge
mailer = queue
mailer-path = /var/lib/landscape/landscape-mail-queue

[package-upload]
stores = main account-1
threads = 10
port = 9100
root-url = http://localhost:9100
mailer = queue
mailer-path = /var/lib/landscape/landscape-mail-queue

[package-search]
port = 9099
stores = main package resource-1
pid-path = /var/run/landscape/landscape-package-search-1.pid
account-threshold = 0
EOH
        // Setting destination to secrets makes this file inaccessible from the host machine
        destination = "secrets/service.conf"
        change_mode = "noop"
      }

      template {
        data = <<EOH
{{- with secret "/nomad-kv/svwi-certs" -}}{{- .Data.data.cert }}{{- end }}
EOH

        destination = "local/svwi.crt"
        change_mode = "restart"
      }

      template {
        data = <<EOH
{{- with secret "/nomad-kv/svwi-certs" -}}{{- .Data.data.key }}{{- end }}
EOH

        destination = "secrets/svwi.key"
        change_mode = "restart"
      }

      template {
        data = <<EOH
{{- with secret "/nomad-kv/default/canonical-landscape" -}}{{- .Data.data.license }}{{- end }}
EOH

        destination = "secrets/license.txt"
        change_mode = "restart"
      }

      template {
        data = <<EOH
[smtp.sendgrid.net]:587 apikey:{{- with secret "/nomad-kv/default/canonical-landscape" -}}{{- .Data.data.sendgrid_pw }}{{- end }}
EOH
        destination = "secrets/postfix"
        change_mode = "restart"
      }

      template {
        data = <<EOH
DB_HOST=127.0.0.1
RMQ_HOST=127.0.0.1
DB_PORT=5432
DB_USER=landscape_superuser
DB_LANDSCAPE_USER=landscape
DB_NAME=postgres
INITIALIZE_SCHEMA=yes
EOH
        // Setting destination to local makes this file accessible from the host machine
        destination = "local/lds.env"
        env = true
        change_mode = "noop"
      }

      template {
        data = <<EOH
DB_LANDSCAPE_PASS="{{- with secret "nomad-kv/default/canonical-landscape" }}{{- .Data.data.landscape_pw -}}{{ end }}"
GCP_TOKEN="{{- with secret "/gcp/roleset/<ROLE>/token" }}{{- .Data.token | regexReplaceAll "\\.{3,}" "" }}{{- end }}"
EOH
        destination = "secrets/postgres.env"
        env = true
        change_mode = "noop"
      }

      volume_mount {
        volume      = "canonical-landscape"
        destination = "/var/lib/landscape/hash-id-databases"
        read_only   = false
      }

      volume_mount {
        volume      = "gpgs_keys"
        destination = "/root/.gnupg"
        read_only   = false
      }

      volume_mount {
        volume      = "landscape-repository"
        destination = "/var/lib/landscape/landscape-repository"
        read_only   = false
      }

      config {
        image = "canonical-landscape/app-server"

        }

        volumes = [
          "/etc/ssl/certs:/etc/ssl/certs",
          "secrets/service.conf:/etc/landscape/service.conf",
          "secrets/license.txt:/etc/landscape/license.txt",
          "secrets/postfix:/etc/postfix/sasl_passwd"
          //
        ]

        ports = ["landscape","landscape80"]
      }
    }
  }

  group "postgres" {

    volume "landscape-postgres" {
      type      = "host"
      read_only = false
      source    = "landscape-postgres"
    }

    network {
      mode = "bridge"

      port "postgres" {
        to = 5432
      }
    }

    service {
      name = "landscape-postgres"
      port = "postgres"
      address_mode = "alloc"

      connect {
        sidecar_service {}

      }
    }

    task "postgres" {

      driver = "docker"

      resources {
          memory = 2048
          memory_max = 4096
        }

      template {
        data = <<EOH
# TYPE  DATABASE        USER            ADDRESS                 METHOD

# "local" is for Unix domain socket connections only
local   all             all                                     trust
# IPv4 local connections:
host    all             all             127.0.0.1/32            trust
# IPv6 local connections:
host    all             all             ::1/128                 trust
# Allow replication connections from localhost, by a user with the
# replication privilege.
local   replication     all                                     trust
host    replication     all             127.0.0.1/32            trust
host    replication     all             ::1/128                 trust
EOH
        // Setting destination to secrets makes this file inaccessible from the host machine
        destination = "local/pg_hba.conf"
        change_mode = "noop"
      }

      template {
        data = <<EOH
listen_addresses = '127.0.0.1'
max_connections = 100                   # (change requires restart)
max_prepared_transactions = 100
shared_buffers = 128MB                  # min 128kB
dynamic_shared_memory_type = posix      # the default is the first option
max_wal_size = 1GB
min_wal_size = 80MB
log_timezone = 'Etc/UTC'
datestyle = 'iso, mdy'
timezone = 'Etc/UTC'
lc_messages = 'en_US.utf8'                      # locale for system error message
lc_monetary = 'en_US.utf8'                      # locale for monetary formatting
lc_numeric = 'en_US.utf8'                       # locale for number formatting
lc_time = 'en_US.utf8'                          # locale for time formatting
default_text_search_config = 'pg_catalog.english'
EOH
        // Setting destination to secrets makes this file inaccessible from the host machine
        destination = "local/postgresql.conf"
        change_mode = "noop"
      }

      template {
        data = <<EOH
POSTGRES_PASSWORD="{{- with secret "nomad-kv/default/canonical-landscape" }}{{- .Data.data.postgres_pw -}}{{ end }}"        
SUPERUSER_PW="{{- with secret "nomad-kv/default/canonical-landscape" }}{{- .Data.data.postgres_pw -}}{{ end }}"
GCP_TOKEN="{{- with secret "/gcp/roleset/<ROLE>/token" }}{{- .Data.token | regexReplaceAll "\\.{3,}" "" }}{{- end }}"
EOH
        destination = "secrets/postgres.env"
        env = true
        change_mode = "noop"

      }

      volume_mount {
        volume      = "landscape-postgres"
        destination = "/var/lib/postgresql/data"
        read_only   = false
      }

      config {
        image = "postgres"

}
        volumes = [
          "/etc/ssl/certs:/etc/ssl/certs",
          // Remove these volumes if the db is being initialized, readd after persistent data is created
          //
        ]

        ports = ["postgres"]
      }
    }
  }

  group "rebbitmq" {


    network {
      mode = "bridge"

      port "rabbitmq" {
        to = 5672
      }
    }

    service {
      name = "landscape-rabbitmq"
      port = "rabbitmq"
      address_mode = "alloc"

      connect {
        sidecar_service {}

      }
    }

    task "rabbitmq" {

      driver = "docker"

      template {
        data = <<EOH
RABBITMQ_DEFAULT_USER=landscape
RABBITMQ_NODE_IP_ADDRESS=127.0.0.1
RABBITMQ_DEFAULT_VHOST=landscape
EOH
        destination = "local/rabbit.env"
        env = true
        change_mode = "noop"
      }

      template {
        data = <<EOH
RABBITMQ_DEFAULT_PASS="{{- with secret "nomad-kv/default/canonical-landscape" }}{{- .Data.data.rmq_pw -}}{{ end }}"
EOH
        destination = "secrets/rabbit.env"
        env = true
        change_mode = "noop"
    }

    config {
        image = "rabbitmq:3.12.7-management"

        volumes = [
          "/etc/ssl/certs:/etc/ssl/certs",
        ]

        ports = ["rabbitmq"]
      }
  }
}
}

Nomad Server logs (if appropriate)

Nomad Client logs (if appropriate)

May  2 14:07:02 nomadrunner2 consul[3179549]: 2024-05-02T14:07:02.279Z [ERROR] agent.client: RPC failed to server: method=ACL.TokenRead server=10.138.0.12:7300 error="rpc error making call: token does not exist: ACL not found"
May  2 14:07:02 nomadrunner2 consul[3179549]: 2024-05-02T14:07:02.279Z [ERROR] agent.http: Request error: method=PUT url=/v1/agent/service/deregister/_nomad-task-7c5db155-9feb-ddb7-e4b8-fcc01f74d3e8-group-record_device_status-record-d
evice-status- from=[::1]:60828 error="ACL not found"
May  2 14:07:27 nomadrunner2 nomad[372726]:     2024-05-02T14:07:27.544Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=83f0ace8-c38a-d642-b902-fac10a19e252 task=connect-proxy-landscape-postgres type="Task hook failed" 
msg="envoy_bootstrap: error creating bootstrap configuration for Connect proxy sidecar: exit status 1; see: <https://developer.hashicorp.com/nomad/s/envoy-bootstrap-error>" failed=false
May  2 14:07:27 nomadrunner2 nomad[372726]:     2024-05-02T14:07:27.602Z [ERROR] client.alloc_runner.task_runner: prestart failed: alloc_id=83f0ace8-c38a-d642-b902-fac10a19e252 task=connect-proxy-landscape-postgres error="prestart hoo
k \"envoy_bootstrap\" failed: error creating bootstrap configuration for Connect proxy sidecar: exit status 1; see: <https://developer.hashicorp.com/nomad/s/envoy-bootstrap-error>"
May  2 14:07:27 nomadrunner2 nomad[372726]:     2024-05-02T14:07:27.602Z [INFO]  client.alloc_runner.task_runner: restarting task: alloc_id=83f0ace8-c38a-d642-b902-fac10a19e252 task=connect-proxy-landscape-postgres reason="Restart wit
hin policy" delay=16.651674377s
May  2 14:07:27 nomadrunner2 nomad[372726]:     2024-05-02T14:07:27.602Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=83f0ace8-c38a-d642-b902-fac10a19e252 task=connect-proxy-landscape-postgres type=Restarting msg="Tas
k restarting in 16.651674377s" failed=false
May  2 14:07:32 nomadrunner2 consul[3179549]: 2024-05-02T14:07:32.281Z [ERROR] agent.client: RPC failed to server: method=ACL.TokenRead server=10.10.5.104:7300 error="rpc error making call: token does not exist: ACL not found"
May  2 14:07:32 nomadrunner2 consul[3179549]: 2024-05-02T14:07:32.281Z [ERROR] agent.http: Request error: method=PUT url=/v1/agent/service/deregister/_nomad-task-7c5db155-9feb-ddb7-e4b8-fcc01f74d3e8-group-record_device_status-record-d
evice-status- from=[::1]:60828 error="ACL not found"
May  2 14:07:44 nomadrunner2 nomad[372726]:     2024-05-02T14:07:44.346Z [INFO]  client.alloc_runner.task_runner.task_hook.envoy_bootstrap: bootstrapping envoy: alloc_id=83f0ace8-c38a-d642-b902-fac10a19e252 task=connect-proxy-landscap
e-postgres namespace="" proxy_id=_nomad-task-83f0ace8-c38a-d642-b902-fac10a19e252-group-postgres-landscape-postgres-postgres-sidecar-proxy service=landscape-postgres gateway="" bootstrap_file=/opt/nomad/alloc/83f0ace8-c38a-d642-b902-f
ac10a19e252/connect-proxy-landscape-postgres/secrets/envoy_bootstrap.json grpc_addr=unix://alloc/tmp/consul_grpc.sock admin_bind=127.0.0.2:19001 ready_bind=127.0.0.1:19101
May  2 14:07:44 nomadrunner2 consul[3179549]: 2024-05-02T14:07:44.467Z [ERROR] agent.http: Request error: method=GET url=/v1/agent/service/_nomad-task-83f0ace8-c38a-d642-b902-fac10a19e252-group-postgres-landscape-postgres-postgres-sid
ecar-proxy from=[::1]:57568 error="unknown service ID: _nomad-task-83f0ace8-c38a-d642-b902-fac10a19e252-group-postgres-landscape-postgres-postgres-sidecar-proxy"
May  2 14:07:46 nomadrunner2 consul[3179549]: 2024-05-02T14:07:46.035Z [ERROR] agent.http: Request error: method=GET url=/v1/agent/service/_nomad-task-83f0ace8-c38a-d642-b902-fac10a19e252-group-postgres-landscape-postgres-postgres-sid
ecar-proxy from=[::1]:57570 error="unknown service ID: _nomad-task-83f0ace8-c38a-d642-b902-fac10a19e252-group-postgres-landscape-postgres-postgres-sidecar-proxy"
May  2 14:07:46 nomadrunner2 nomad[372726]:     2024-05-02T14:07:46.818Z [INFO]  client.gc: garbage collecting allocation: alloc_id=9df3c3a3-faa0-d023-f0e2-3addfd778f51 reason="forced collection"
May  2 14:07:46 nomadrunner2 nomad[372726]:     2024-05-02T14:07:46.926Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=9df3c3a3-faa0-d023-f0e2-3addfd778f51 task=postgres type=Killing msg="Sent interrupt. Waiting 5s bef
ore force killing" failed=false
May  2 14:07:47 nomadrunner2 nomad[372726]:     2024-05-02T14:07:47.117Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=9df3c3a3-faa0-d023-f0e2-3addfd778f51 task=connect-proxy-landscape-postgres type=Killing msg="Sent i
nterrupt. Waiting 5s before force killing" failed=false
May  2 14:07:47 nomadrunner2 systemd[1]: opt-nomad-alloc-9df3c3a3\x2dfaa0\x2dd023\x2df0e2\x2d3addfd778f51-connect\x2dproxy\x2dlandscape\x2dpostgres-private.mount: Deactivated successfully.
May  2 14:07:47 nomadrunner2 systemd[1]: opt-nomad-alloc-9df3c3a3\x2dfaa0\x2dd023\x2df0e2\x2d3addfd778f51-connect\x2dproxy\x2dlandscape\x2dpostgres-secrets.mount: Deactivated successfully.
May  2 14:07:48 nomadrunner2 consul[3179549]: 2024-05-02T14:07:48.253Z [ERROR] agent.http: Request error: method=GET url=/v1/agent/service/_nomad-task-83f0ace8-c38a-d642-b902-fac10a19e252-group-postgres-landscape-postgres-postgres-sid
ecar-proxy from=[::1]:57578 error="unknown service ID: _nomad-task-83f0ace8-c38a-d642-b902-fac10a19e252-group-postgres-landscape-postgres-postgres-sidecar-proxy"
May  2 14:07:52 nomadrunner2 consul[3179549]: 2024-05-02T14:07:52.676Z [ERROR] agent.http: Request error: method=GET url=/v1/agent/service/_nomad-task-83f0ace8-c38a-d642-b902-fac10a19e252-group-postgres-landscape-postgres-postgres-sid
ecar-proxy from=[::1]:34052 error="unknown service ID: _nomad-task-83f0ace8-c38a-d642-b902-fac10a19e252-group-postgres-landscape-postgres-postgres-sidecar-proxy"
May  2 14:08:00 nomadrunner2 consul[3179549]: 2024-05-02T14:08:00.801Z [ERROR] agent.http: Request error: method=GET url=/v1/agent/service/_nomad-task-83f0ace8-c38a-d642-b902-fac10a19e252-group-postgres-landscape-postgres-postgres-sid
ecar-proxy from=[::1]:58406 error="unknown service ID: _nomad-task-83f0ace8-c38a-d642-b902-fac10a19e252-group-postgres-landscape-postgres-postgres-sidecar-proxy"
May  2 14:08:02 nomadrunner2 consul[3179549]: 2024-05-02T14:08:02.373Z [ERROR] agent.client: RPC failed to server: method=ACL.TokenRead server=10.138.0.8:7300 error="rpc error making call: token does not exist: ACL not found"
May  2 14:08:02 nomadrunner2 consul[3179549]: 2024-05-02T14:08:02.373Z [ERROR] agent.http: Request error: method=PUT url=/v1/agent/service/deregister/_nomad-task-7c5db155-9feb-ddb7-e4b8-fcc01f74d3e8-group-record_device_status-record-d
evice-status- from=[::1]:60828 error="ACL not found"
May  2 14:08:17 nomadrunner2 consul[3179549]: 2024-05-02T14:08:17.113Z [ERROR] agent.http: Request error: method=GET url=/v1/agent/service/_nomad-task-83f0ace8-c38a-d642-b902-fac10a19e252-group-postgres-landscape-postgres-postgres-sid
ecar-proxy from=[::1]:34180 error="unknown service ID: _nomad-task-83f0ace8-c38a-d642-b902-fac10a19e252-group-postgres-landscape-postgres-postgres-sidecar-proxy"
May  2 14:08:32 nomadrunner2 consul[3179549]: 2024-05-02T14:08:32.465Z [ERROR] agent.client: RPC failed to server: method=ACL.TokenRead server=10.138.0.12:7300 error="rpc error making call: token does not exist: ACL not found"
May  2 14:08:32 nomadrunner2 consul[3179549]: 2024-05-02T14:08:32.465Z [ERROR] agent.http: Request error: method=PUT url=/v1/agent/service/deregister/_nomad-task-7c5db155-9feb-ddb7-e4b8-fcc01f74d3e8-group-record_device_status-record-d
evice-status- from=[::1]:60828 error="ACL not found"
May  2 14:08:44 nomadrunner2 nomad[372726]:     2024-05-02T14:08:44.982Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=83f0ace8-c38a-d642-b902-fac10a19e252 task=connect-proxy-landscape-postgres type="Task hook failed" 
msg="envoy_bootstrap: error creating bootstrap configuration for Connect proxy sidecar: exit status 1; see: <https://developer.hashicorp.com/nomad/s/envoy-bootstrap-error>" failed=false
May  2 14:08:45 nomadrunner2 nomad[372726]:     2024-05-02T14:08:45.133Z [ERROR] client.alloc_runner.task_runner: prestart failed: alloc_id=83f0ace8-c38a-d642-b902-fac10a19e252 task=connect-proxy-landscape-postgres error="prestart hoo
k \"envoy_bootstrap\" failed: error creating bootstrap configuration for Connect proxy sidecar: exit status 1; see: <https://developer.hashicorp.com/nomad/s/envoy-bootstrap-error>"
May  2 14:08:45 nomadrunner2 nomad[372726]:     2024-05-02T14:08:45.133Z [INFO]  client.alloc_runner.task_runner: not restarting task: alloc_id=83f0ace8-c38a-d642-b902-fac10a19e252 task=connect-proxy-landscape-postgres reason="Exceede
d allowed attempts 2 in interval 30m0s and mode is \"fail\""
May  2 14:08:45 nomadrunner2 nomad[372726]:     2024-05-02T14:08:45.133Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=83f0ace8-c38a-d642-b902-fac10a19e252 task=connect-proxy-landscape-postgres type="Not Restarting" ms
g="Exceeded allowed attempts 2 in interval 30m0s and mode is \"fail\"" failed=true
@tgross tgross added this to Needs Triage in Nomad - Community Issues Triage via automation May 6, 2024
@colinbruner
Copy link

I've observed this as well.. more information in support case #147547

@louievandyke louievandyke added the hcc/cst Admin - internal label May 10, 2024
@ryanhulet
Copy link
Author

I've observed this as well.. more information in support case #147547

got a link?

@colinbruner
Copy link

I've observed this as well.. more information in support case #147547

got a link?

sorry, I reported this to Hashicorp's Enterprise support portal, behind auth. They are now tracking it internally but public issues helps raise awareness.. and might provide some restored sanity for anyone else experience this issue as well.

@t-davies
Copy link
Contributor

++ also seeing this fairly repeatedly, unsure if the same cause.
See support case #149941.

@ryanhulet @colinbruner - does restarting Consul fix this for you (temporarily at least)?

@ryanhulet
Copy link
Author

ryanhulet commented May 28, 2024

Sometimes.

I finally gave up and destroyed my entire cluster and brought up a new one on 1.6.1

@t-davies
Copy link
Contributor

t-davies commented May 28, 2024

Possibly related to #20185?

I wonder if the Consul sync is getting stuck because it doesn't have tokens to deregister services - exiting early - and therefore never registering newly created services, meaning that Envoy can't bootstrap?

[root@xx logs]# cat envoy_bootstrap.stderr.0
failed fetch proxy config from local agent: Unexpected response code: 404 (unknown service ID: default/xx-xxx/_nomad-task-29d6f08e-ee4a-d2e0-f6e2-39d32be26e68-group-default-xxx-000-sidecar-proxy)

Not sure why it would have gotten worse in newer releases though? But we also see it much more often recently (on 1.7.5 currently) could also just be because we see more deploys now though.

@ryanhulet
Copy link
Author

as far as I can tell it is 100% related to workload identities, 1.6.1 doesn't use them and I haven't had that error a single time since downgrading

@colinbruner
Copy link

Possibly related to #20185?

this is interesting.. the consul acl stanza settings the quoted issue lists in step #1 do match my consul acl configurations..

acl = {
  enabled                  = true
  default_policy           = "deny"
  enable_token_persistence = true
}

I wonder if setting enable_token_persistence to false would mitigate this at all...

@t-davies
Copy link
Contributor

t-davies commented May 29, 2024

Yeah I'm also wondering if disabling persistence will fix it too, was also suggested by support on my ticket.

We're (obviously) not going in an manually deleting the ACL tokens like the original issue repro but if it's getting stuck with some old expired token because it's been persisted then maybe could end up with similar behaviour. In our case we are reloading Consul once per day (with a SIGHUP) when we rotate its agent token - which may also a source for this, not sure if you're doing similar?

@colinbruner
Copy link

In our case we are reloading Consul once per day (with a SIGHUP) when we rotate its agent token - which may also a source for this, not sure if you're doing similar?

similar, but less frequent. We have consul-template generating a Consul ACL token from Vault's Consul secret engine.. this is regenerated and a SIGHUP is issued at the half-life, which I believe to be ~16days.

However, I've seen this on Nomad Clients that have only been around for a few hours.. so not necessarily correlated to after the token is rolled.

Yeah I'm also wondering if disabling persistence will fix it too, was also suggested by support on my ticket.

I'm going to do this in a non-prod environment today, will report back if I'm still seeing errors after the change is made.

@colinbruner
Copy link

@ryanhulet @t-davies upon the advice of Hashicorp support, I set enable_token_persistence to false and upgraded to Nomad 1.8.0 in a non production environment and have yet to see these bootstrap errors thus far.

Will continue to monitor and update if this is not a full mitigation of the error.

@t-davies
Copy link
Contributor

t-davies commented Jun 4, 2024

Thanks, yeah we are in a similar situation with enable_token_persistence off and running 1.8.0 - not reoccurred since either.

@sorenisanerd
Copy link
Contributor

I've been seeing this problem in a variety of forms for a while now, possibly when I started using Workload Identities.

Many times, I see entries like this in my logs:

Jun 05 17:13:40 direct-newt consul[2616748]: 2024-06-05T17:13:40.746Z [ERROR] agent.http: Request error: method=PUT url=/v1/agent/service/deregister/_nomad-task-003d7834-ed06-fb31-9052-23774fb0ce3c-group-db-demoapp1--db-db-sidecar-proxy from=127.0.0.1:37742 error="ACL not found"

So it looks like Nomad is failing to bootstrap Envoy because there's something else that's failing to get deregistered. Why is the token

I have an AWFUL Nomad job that scans through Consul's logs for those errors and uses a management token to do the deregistration. Here's a slimmed down version of it:

job "one-off" {
  type      = "sysbatch"
  group "cmdgroup" {
    task "cmd" {
      driver = "raw_exec"
      config {
        command = "/usr/bin/bash"
        args = ["-c",
          <<-EOA
            journalctl -u consul -n 5000 | grep ACL.not.found | grep deregister | cut -f3 -d= | cut -f1 -d' ' | sort -u | while read x
            do
              echo "Overriding $x request with management token"
              curl -X PUT -v -H "X-Consul-Token: 12345678-abcd-1234-dcba-1234567890ab" http://127.0.0.1:8500$x
              rv=1
            done
            EOA
        ]
      }
    }
  }
}

enable_token_persistence should only affect the Consul agent's own tokens, shouldn't it? I.e. the default, agent, replication, config_file_service_registration, and possibly recovery tokens.
a) If token persistence is disabled, do you inject the tokens whenever the agent starts or do you keep them in your config files?
b) What ACL's are assigned to these tokens? Might I be missing something there?

@t-davies
Copy link
Contributor

t-davies commented Jun 9, 2024

You're quite right, enable_token_persistence should only affect the Consul agents own tokens. Nomad should be using the WLI token to register/deregister the service as far as I can tell. In our case, because we're rotating the Consul agent token pretty frequently the hypothesis was something is happening within Consul (possibly even unrelated to Nomad) that is then causing the sync loop within Nomad to get stuck - but it's quite possible that enable_token_persistence being set is completely unrelated.

In our case we use Vault agent to rotate the token and store it in the config file, so we don't really need token persistence anyway since Consul will just read it from the file if it restarts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hcc/cst Admin - internal type/bug
Projects
Development

No branches or pull requests

5 participants