CSI: restart task on failing initial probe, instead of killing it #25307

tgross · 2025-03-06T21:36:07Z

When a CSI plugin is launched, we probe it until the csi_plugin.health_timeout expires (by default 30s). But if the plugin never becomes healthy, we're not restarting the task as documented.

Update the plugin supervisor to trigger a restart instead. We still exit the supervisor loop at that point to avoid having the supervisor send probes to a task that isn't running yet. This requires reworking the poststart hook to allow the supervisor loop to be restarted when the task restarts. In doing so, I identified that we weren't respecting the task kill context from the post start hook.

Fixes: #25293
Ref: https://hashicorp.atlassian.net/browse/NET-12264

Testing & Reproduction steps

For the happy path, run the demo/csi/hostpath. For the failing path:

jobspec that will never work as a CSI plugin

job "example" {

  group "group" {
    task "task" {

      driver = "docker"

      config {
        image   = "busybox:1"
        command = "httpd"
        args    = ["-vv", "-f", "-p", "8001", "-h", "/local"]
      }

      csi_plugin {
        id = "whatever"
        type = "monolith"
        health_timeout = "5s"
      }

      resources {
        cpu    = 100
        memory = 100
      }

    }
  }
}

task events

$ nomad alloc status af5c
...
Recent Events:
Time                       Type                     Description
2025-03-06T16:31:17-05:00  Restarting               Task restarting in 16.104906834s
2025-03-06T16:31:17-05:00  Terminated               Exit Code: 137, Exit Message: "Docker container exited with non-zero exit code: 137"
2025-03-06T16:31:12-05:00  Restarting               CSI plugin did not become healthy before configured 5s health timeout
2025-03-06T16:31:12-05:00  Plugin became unhealthy  Error: CSI plugin failed probe: timeout while connecting to gRPC socket: failed to stat socket: stat /var/nomad/data/client/csi/plugins/af58f31a-0733-7c83-2231-4e97d956ad74/csi.sock: no such file or directory
2025-03-06T16:31:07-05:00  Started                  Task started by client
2025-03-06T16:30:50-05:00  Restarting               Task restarting in 16.928817516s
2025-03-06T16:30:50-05:00  Terminated               Exit Code: 137, Exit Message: "Docker container exited with non-zero exit code: 137"
2025-03-06T16:30:45-05:00  Restarting               CSI plugin did not become healthy before configured 5s health timeout
2025-03-06T16:30:45-05:00  Plugin became unhealthy  Error: CSI plugin failed probe: timeout while connecting to gRPC socket: failed to stat socket: stat /var/nomad/data/client/csi/plugins/af58f31a-0733-7c83-2231-4e97d956ad74/csi.sock: no such file or directory
2025-03-06T16:30:40-05:00  Started                  Task started by client

Contributor Checklist

Changelog Entry If this PR changes user-facing behavior, please generate and add a
changelog entry using the make cl command.
Testing Please add tests to cover any new functionality or to demonstrate bug fixes and
ensure regressions will be caught.
Documentation If the change impacts user-facing functionality such as the CLI, API, UI,
and job configuration, please update the Nomad website documentation to reflect this. Refer to
the website README for docs guidelines. Please also consider whether the
change requires notes within the upgrade guide.

Reviewer Checklist

Backport Labels Please add the correct backport labels as described by the internal
backporting document.
Commit Type Ensure the correct merge method is selected which should be "squash and merge"
in the majority of situations. The main exceptions are long-lived feature branches or merges where
history should be preserved.
Enterprise PRs If this is an enterprise only PR, please add any required changelog entry
within the public repository.

When a CSI plugin is launched, we probe it until the csi_plugin.health_timeout expires (by default 30s). But if the plugin never becomes healthy, we're not restarting the task as documented. Update the plugin supervisor to trigger a restart instead. We still exit the supervisor loop at that point to avoid having the supervisor send probes to a task that isn't running yet. This requires reworking the poststart hook to allow the supervisor loop to be restarted when the task restarts. In doing so, I identified that we weren't respecting the task kill context from the post start hook, which would leave the supervisor running in the window between when a task is killed because it failed and its stop hooks were triggered. Combine the two contexts to make sure we stop the supervisor whichever context gets closed first. Fixes: #25293 Ref: https://hashicorp.atlassian.net/browse/NET-12264

gulducat

main hitch I see is the ineffectual lock + immediate unlock. otherwise lgtm!

client/allocrunner/taskrunner/plugin_supervisor_hook.go

gulducat · 2025-03-06T22:22:36Z

client/allocrunner/taskrunner/plugin_supervisor_hook.go

 			SetDisplayMessage(fmt.Sprintf("CSI plugin did not become healthy before configured %v health timeout", h.task.CSIPluginConfig.HealthTimeout.String())),
-	); err != nil {
-		h.logger.Error("failed to kill task", "kill_reason", reason, "error", err)
+		true); err != nil {


just noting that true here represents failure, which means that each restart will count against the task's restart.attempts limit. seems reasonable to me - we would want it to run out of attempts and be rescheduled elsewhere, if possible (not possible with node plugins as system jobs, but hey)

gulducat

lgtm!

tgross force-pushed the 25293-csi-plugin-supervisor-restart branch from 43ceb3d to 5a076b9 Compare March 6, 2025 21:38

tgross added theme/storage type/bug backport/ent/1.7.x+ent backport/ent/1.8.x+ent backport/1.9.x labels Mar 6, 2025

tgross added this to the 1.9.x milestone Mar 6, 2025

vercel bot deployed to Preview – nomad-ui March 6, 2025 21:39 View deployment

tgross marked this pull request as ready for review March 6, 2025 21:57

tgross requested review from a team as code owners March 6, 2025 21:57

tgross requested review from gulducat, jrasell and Juanadelacuesta March 6, 2025 21:57

gulducat reviewed Mar 6, 2025

View reviewed changes

address comments from code review

Loading
Loading status checks…

89efe69

vercel bot deployed to Preview – nomad-ui March 7, 2025 14:09 View deployment

gulducat approved these changes Mar 7, 2025

View reviewed changes

tgross merged commit f3d53e3 into main Mar 7, 2025
31 checks passed

tgross deleted the 25293-csi-plugin-supervisor-restart branch March 7, 2025 15:05

hc-github-team-nomad-core mentioned this pull request Mar 7, 2025

Backport of CSI: restart task on failing initial probe, instead of killing it into release/1.9.x #25314

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CSI: restart task on failing initial probe, instead of killing it #25307

CSI: restart task on failing initial probe, instead of killing it #25307

tgross commented Mar 6, 2025 •

edited

Loading

gulducat left a comment

gulducat Mar 6, 2025

gulducat left a comment

CSI: restart task on failing initial probe, instead of killing it #25307

CSI: restart task on failing initial probe, instead of killing it #25307

Conversation

tgross commented Mar 6, 2025 • edited Loading

Testing & Reproduction steps

Contributor Checklist

Reviewer Checklist

gulducat left a comment

Choose a reason for hiding this comment

gulducat Mar 6, 2025

Choose a reason for hiding this comment

gulducat left a comment

Choose a reason for hiding this comment

tgross commented Mar 6, 2025 •

edited

Loading