Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nomad 1.7.7 raw exec task using consul kv put no longer working #20566

Closed
sirbudd opened this issue May 13, 2024 · 6 comments
Closed

Nomad 1.7.7 raw exec task using consul kv put no longer working #20566

sirbudd opened this issue May 13, 2024 · 6 comments

Comments

@sirbudd
Copy link

sirbudd commented May 13, 2024

Nomad version

Nomad 1.7.7

Operating system

Ubuntu 22.04 jammy

Issue

After upgrading to Nomad 1.7.7 from Nomad 1.6.9 a raw_exec task which is a bash script that is using the consul kv put command no longer works.

Reproduction steps

The following bash script is executed by the raw_exec task:

#!/usr/bin/env bash
set -x
while ! docker ps | egrep -i "${DOCKER_CONTAINER}" | egrep -vi "${NON_INCLUDE}"; do sleep 3; done
sleep 3;
DATE=date +%Y-%m-%d
consul kv put "job_state/ANY_PATH/${NOMAD_ALLOC_ID}" "${DATE}";
while true; do sleep 6000; done

Expected Result

A successful kv put: Success! Data written to: job_state/......

Actual Result

/var/lib/nomad/alloc/9b74628a-f6b9-07b2-898d-e986c227b73d/Consul-Set-Dependency/set_dependency.sh: line 6: 1080306 Killed consul kv put "job_state/ANY_PATH/${NOMAD_ALLOC_ID}" "${DATE}"

Job file (if appropriate)

The raw_exec task worked until upgrading to the latest Nomad version. The only way to have it work again was to downgrade back to 1.6.9.

If I run the bash script by hand there are no issues.

This is the Nomad agent client options config:
"driver.raw_exec.no_cgroups" = "true"
"docker.cleanup.image" = "true"
"docker.volumes.enabled" = "true"
"docker.cleanup.image.delay" = "16h"
"driver.raw_exec.enable" = "1"
"docker.caps.whitelist" = "CHOWN,DAC_OVERRIDE,FSETID,FOWNER,MKNOD,NET_RAW,SETGID,SETUID,SETFCAP,SETPCAP,NET_BIND_SERVICE,SYS_CHROOT,KILL,AUDIT_WRITE,AUDIT_CONTROL,AUDIT_READ,SYS_PTRACE"

Please let me know if I need to provide any further info.

@tgross tgross added this to Needs Triage in Nomad - Community Issues Triage via automation May 13, 2024
@sirbudd sirbudd changed the title Nomad 1.7.7 raw exec task using consul kv pu no longer working Nomad 1.7.7 raw exec task using consul kv put no longer working May 15, 2024
@pkazmierczak
Copy link
Contributor

Hi @sirbudd, thanks for reporting an issue! Could you give us more detail about the job you're submitting? Is your server configured with ACLs?

@sirbudd
Copy link
Author

sirbudd commented May 23, 2024

Hello!
The server is not configured with ACLs.

The job file contains a simple raw_exec task that runs the above mentioned script, nothing special
task "Consul-Set-Dependency" {
driver = "raw_exec"

  config {
    command      = "script.sh"
  }

  template {
    data = <<EOH{{"\n"}}
    {{- file "templates/scripts/script.sh.ctmpl" }}
    {{- "\n"  }}EOH
    destination = "script.sh"
  }

  resources {
    cpu    = 30
    memory = 16
  }
}

I did some more testing and I concluded that all consul kv put/get commands no longer work when executed by the raw_exec driver

@pkazmierczak
Copy link
Contributor

Hi @sirbudd, I'm having a hard time reproducing your issue. The task definition you've pasted won't even get parsed by Nomad, there's invalid syntax inside your template block. I want to help and investigate the issue, but it's hard without more details on how to reproduce this.

If you can't provide a full jobspec for some reason, could you provide more details about the error message? Perhaps running Nomad with debug level logging?

@sirbudd
Copy link
Author

sirbudd commented May 23, 2024

Hello @pkazmierczak
Sorry for not giving enough info.

This is a Nomad job file that I used for debugging the issue:

job "Test-Nomad" {
  priority = 50
  datacenters = ["development"]

  type = "sysbatch"

  constraint {
    attribute = "${meta.environment}"
    value     = "development"
  }

  group "test_nomad_17x" {
    count = 1

    constraint {
      attribute = "${meta.component}"
      value     = "COMPONENT_TEST"
    }


    task "run_script" {
      driver = "raw_exec"

      template {
        data = <<EOH
#!/bin/bash
# set -x

consul kv put "test_consul_put/test/foobar2";
sleep 10

set -x
while ! docker ps | egrep -i "${INCLUDE}" | egrep -vi "${NON_INCLUDE}"; do sleep 3; done
sleep 3;
DATE=date +%Y-%m-%d
consul kv put "job_state/PATH/${NOMAD_ALLOC_ID}" "${DATE}";
while true; do sleep 6000; done

        EOH

        destination= "local/script.sh"
        perms = 755
      }

      env {
        "CONSUL_ADDR"                     = "https://${attr.unique.network.ip-address}:8500"
        "CONSUL_HTTP_ADDR"                = "https://${attr.unique.network.ip-address}:8500"
        "CONSUL_HTTP_SSL"                 = "true"
        "CONSUL_CACERT"                   = "/etc/consul/ssl/ca_cert.pem"
        "CONSUL_CLIENT_CERT"              = "/etc/consul/ssl/server.pem"
        "CONSUL_CLIENT_KEY"               = "/etc/consul/ssl/server.key"
        "INCLUDE"                         = "Component-subEnvironment-${NOMAD_ALLOC_ID}"
        "NON_INCLUDE"                     = "config|metricbeat|logstash|antivirus"
      }

      config {
        command = "local/script.sh"
      }
    }
  }
}

With the above job file everything worked without any issues. Turns out what was happening was that the raw_exec was getting killed due to not enough resources.
Before the upgrade the raw_exec task had the following resources config stanza

      resources {
        cpu    = 30
        memory = 16
      }

Turns out that after the Nomad upgrade from 1.6.x to 1.7.7 those resources were not enough anymore. This is why the consul kv put command was getting killed

@pkazmierczak
Copy link
Contributor

Hey @sirbudd, thanks for providing all the detail. It's hard to say whether the resource exhaustion issue is due to newer version of Nomad or perhaps Consul or docker taking more resources to execute, or yet another factor. I'll close the issue for now but please feel free to re-open in case you encounter more problems.

Nomad - Community Issues Triage automation moved this from Needs Triage to Done May 23, 2024
@sirbudd
Copy link
Author

sirbudd commented May 23, 2024

Thank you for your assistance!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

2 participants