Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CSI: restart task on failing initial probe, instead of killing it #25307

Merged
merged 2 commits into from
Mar 7, 2025

Conversation

tgross
Copy link
Member

@tgross tgross commented Mar 6, 2025

When a CSI plugin is launched, we probe it until the csi_plugin.health_timeout expires (by default 30s). But if the plugin never becomes healthy, we're not restarting the task as documented.

Update the plugin supervisor to trigger a restart instead. We still exit the supervisor loop at that point to avoid having the supervisor send probes to a task that isn't running yet. This requires reworking the poststart hook to allow the supervisor loop to be restarted when the task restarts. In doing so, I identified that we weren't respecting the task kill context from the post start hook.

Fixes: #25293
Ref: https://hashicorp.atlassian.net/browse/NET-12264

Testing & Reproduction steps

For the happy path, run the demo/csi/hostpath. For the failing path:

jobspec that will never work as a CSI plugin
job "example" {

  group "group" {
    task "task" {

      driver = "docker"

      config {
        image   = "busybox:1"
        command = "httpd"
        args    = ["-vv", "-f", "-p", "8001", "-h", "/local"]
      }

      csi_plugin {
        id = "whatever"
        type = "monolith"
        health_timeout = "5s"
      }

      resources {
        cpu    = 100
        memory = 100
      }

    }
  }
}
task events
$ nomad alloc status af5c
...
Recent Events:
Time                       Type                     Description
2025-03-06T16:31:17-05:00  Restarting               Task restarting in 16.104906834s
2025-03-06T16:31:17-05:00  Terminated               Exit Code: 137, Exit Message: "Docker container exited with non-zero exit code: 137"
2025-03-06T16:31:12-05:00  Restarting               CSI plugin did not become healthy before configured 5s health timeout
2025-03-06T16:31:12-05:00  Plugin became unhealthy  Error: CSI plugin failed probe: timeout while connecting to gRPC socket: failed to stat socket: stat /var/nomad/data/client/csi/plugins/af58f31a-0733-7c83-2231-4e97d956ad74/csi.sock: no such file or directory
2025-03-06T16:31:07-05:00  Started                  Task started by client
2025-03-06T16:30:50-05:00  Restarting               Task restarting in 16.928817516s
2025-03-06T16:30:50-05:00  Terminated               Exit Code: 137, Exit Message: "Docker container exited with non-zero exit code: 137"
2025-03-06T16:30:45-05:00  Restarting               CSI plugin did not become healthy before configured 5s health timeout
2025-03-06T16:30:45-05:00  Plugin became unhealthy  Error: CSI plugin failed probe: timeout while connecting to gRPC socket: failed to stat socket: stat /var/nomad/data/client/csi/plugins/af58f31a-0733-7c83-2231-4e97d956ad74/csi.sock: no such file or directory
2025-03-06T16:30:40-05:00  Started                  Task started by client

Contributor Checklist

  • Changelog Entry If this PR changes user-facing behavior, please generate and add a
    changelog entry using the make cl command.
  • Testing Please add tests to cover any new functionality or to demonstrate bug fixes and
    ensure regressions will be caught.
  • Documentation If the change impacts user-facing functionality such as the CLI, API, UI,
    and job configuration, please update the Nomad website documentation to reflect this. Refer to
    the website README for docs guidelines. Please also consider whether the
    change requires notes within the upgrade guide.

Reviewer Checklist

  • Backport Labels Please add the correct backport labels as described by the internal
    backporting document.
  • Commit Type Ensure the correct merge method is selected which should be "squash and merge"
    in the majority of situations. The main exceptions are long-lived feature branches or merges where
    history should be preserved.
  • Enterprise PRs If this is an enterprise only PR, please add any required changelog entry
    within the public repository.

Sorry, something went wrong.

When a CSI plugin is launched, we probe it until the csi_plugin.health_timeout
expires (by default 30s). But if the plugin never becomes healthy, we're not
restarting the task as documented.

Update the plugin supervisor to trigger a restart instead. We still exit the
supervisor loop at that point to avoid having the supervisor send probes to a
task that isn't running yet. This requires reworking the poststart hook to allow
the supervisor loop to be restarted when the task restarts.

In doing so, I identified that we weren't respecting the task kill context from
the post start hook, which would leave the supervisor running in the window
between when a task is killed because it failed and its stop hooks were
triggered. Combine the two contexts to make sure we stop the supervisor
whichever context gets closed first.

Fixes: #25293
Ref: https://hashicorp.atlassian.net/browse/NET-12264
@tgross tgross force-pushed the 25293-csi-plugin-supervisor-restart branch from 43ceb3d to 5a076b9 Compare March 6, 2025 21:38
@tgross tgross added theme/storage type/bug backport/ent/1.7.x+ent Changes are backported to 1.7.x+ent backport/ent/1.8.x+ent Changes are backported to 1.8.x+ent backport/1.9.x backport to 1.9.x release line labels Mar 6, 2025
@tgross tgross added this to the 1.9.x milestone Mar 6, 2025
@tgross tgross marked this pull request as ready for review March 6, 2025 21:57
@tgross tgross requested review from a team as code owners March 6, 2025 21:57
Copy link
Member

@gulducat gulducat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

main hitch I see is the ineffectual lock + immediate unlock. otherwise lgtm!

SetDisplayMessage(fmt.Sprintf("CSI plugin did not become healthy before configured %v health timeout", h.task.CSIPluginConfig.HealthTimeout.String())),
); err != nil {
h.logger.Error("failed to kill task", "kill_reason", reason, "error", err)
true); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just noting that true here represents failure, which means that each restart will count against the task's restart.attempts limit. seems reasonable to me - we would want it to run out of attempts and be rescheduled elsewhere, if possible (not possible with node plugins as system jobs, but hey)

Copy link
Member

@gulducat gulducat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm!

@tgross tgross merged commit f3d53e3 into main Mar 7, 2025
31 checks passed
@tgross tgross deleted the 25293-csi-plugin-supervisor-restart branch March 7, 2025 15:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport/ent/1.7.x+ent Changes are backported to 1.7.x+ent backport/ent/1.8.x+ent Changes are backported to 1.8.x+ent backport/1.9.x backport to 1.9.x release line theme/storage type/bug
Projects
None yet
Development

Successfully merging this pull request may close these issues.

CSI: failing plugin probe should restart task, not mark it dead
2 participants