Batch job reschedule attempts sometimes not being respected #20462

sagebind · 2024-04-19T16:18:52Z

Nomad version

Output from nomad version

Nomad v1.7.6
BuildDate 2024-03-12T07:27:36Z
Revision 594fedbfbc4f0e532b65e8a69b28ff9403eb822e

Operating system and Environment details

Output from uname -a:

Linux ip-10-151-99-125 6.5.0-1017-aws #17~22.04.2-Ubuntu SMP Mon Mar 25 20:47:22 UTC 2024 aarch64 aarch64 aarch64 GNU/Linux

Issue

It seems that attempts for reschedule set on a job is sometimes not being respected. This job configuration was formerly known to be for sure working when we were running Nomad 1.4 (we recently upgraded versions to catch up with the latest). We run quite a bit of small, discrete operations using batch jobs, and we use the reschedule policy to allow Nomad to automatically retry the operation with a new allocation on a different node, in case the underlying node had an issue, but we have attempts set to 2.

Normally, this does what we expect -- if a task exits with a nonzero exit code or gets OOM Killed, the allocation fails (because restart.attempts is set to zero and mode is "fail"), and up to 2 additional allocations may be created to attempt the task again. However, since upgrading to 1.7, we've noticed that sometimes this limit is not being enforced, and Nomad is behaving as if unlimited were set to true, which it is not. This results in some failing tasks generating an infinite number of new allocations over time and filling up our clusters with wasted work, and causing quite a bit of problems for us. We've had jobs with attempts set to 2 somehow end up with hundreds of failed allocations. Here's a screenshot of the dashboard for a job that already had 63 before we found and purged it manually:

So far, I haven't found a rhyme or reason yet as to why it seems like only sometimes this behavior happens, but over time the percentage of our cluster filled with these "zombie" jobs grows until most of our capacity is taken up by them, leaving no room for new work.

My current theory is that sometimes, when the task gets OOM Killed with the docker driver, the status code is seen as 0 and thus isn't incrementing some sort of counter somewhere that keeps track of the number of failed allocations so far. Normally, the event message is

Exit Code: 137, Exit Message: "OOM Killed"

But for these "zombie" jobs that reschedule infinitely, it seems like the event message is

Exit Code: 0, Exit Message: "OOM Killed"

Below are the allocations and events for the example zombie job in the reproduction steps section that works correctly and stops after attempts reschedules (the JSON is too big for GitHub comment):

zombie-allocations.json

Now below is the response of allocations from a buggy job in production that spawned more allocations than allowed by its attempts limit:

buggy-allocations.json

Reproduction steps

I haven't been able to consistently reproduce this issue yet, but the general idea would be to use a job like this one:

job "zombie" {
    type = "batch"
    datacenters = ["dc1"]
    group "zombie" {
        task "zombie" {
            driver = "docker"
            config {
                image = "alpine:3"
                entrypoint = ["sh"]
                args = ["-c", "sleep 60; head -c 64m /dev/zero | tail"]
                network_mode = "host"
                auth_soft_fail = true
                memory_hard_limit = 32
            }
            resources {
                cpu = 100
                memory = 16
            }
            restart {
                attempts = 0
                mode = "fail"
            }
        }
    }
    reschedule {
        attempts = 2
        interval = "168h"
        unlimited = false
    }
}

Expected Result

The job should be rescheduled up to attempts times, and then stay in a failed state after that.

Actual Result

The job is rescheduled without any limit, creating far more than attempts allocations over time.

Job file (if appropriate)

We use the Nomad API programmatically to generate these jobs so we don't have an HCL for such jobs, but here is the job definition of an example job that experienced this problem according to the Nomad API (with some sensitive bits redacted).

{
  "Stop": false,
  "Region": "global",
  "Namespace": "default",
  "ID": "25b3cee6-86f0-4641-b72b-7e014c4049fb",
  "ParentID": "",
  "Name": "{redacted}",
  "Type": "batch",
  "Priority": 50,
  "AllAtOnce": false,
  "Datacenters": [
    "dc1"
  ],
  "NodePool": "default",
  "Constraints": [
    {
      "LTarget": "${attr.cpu.arch}",
      "RTarget": "amd64",
      "Operand": "set_contains_any"
    },
    {
      "LTarget": "${attr.driver.docker.os_type}",
      "RTarget": "linux",
      "Operand": "="
    },
    {
      "LTarget": "${meta.network_type}",
      "RTarget": "public",
      "Operand": "!="
    },
    {
      "LTarget": "${attr.unique.storage.volume}",
      "RTarget": "^/dev/nvme.+",
      "Operand": "regexp"
    }
  ],
  "Affinities": null,
  "Spreads": null,
  "TaskGroups": [
    {
      "Name": "executor",
      "Count": 1,
      "Update": null,
      "Migrate": null,
      "Constraints": null,
      "Scaling": null,
      "RestartPolicy": {
        "Attempts": 3,
        "Interval": 86400000000000,
        "Delay": 15000000000,
        "Mode": "fail",
        "RenderTemplates": false
      },
      "Tasks": [
        {
          "Name": "executor",
          "Driver": "docker",
          "User": "",
          "Config": {
            "network_mode": "host",
            "memory_hard_limit": 8192,
            "mounts": [
              {
                "type": "bind",
                "source": "tmp",
                "target": "/tmp"
              },
              {
                "source": "tmp",
                "target": "/var/tmp",
                "type": "bind"
              }
            ],
            "work_dir": "/alloc/data",
            "image": "{redacted}"
          },
          "Env": {
            "_": "{redacted}"
          },
          "Services": null,
          "Vault": null,
          "Consul": null,
          "Templates": [
            {
              "SourcePath": "",
              "DestPath": "local/args.json",
              "EmbeddedTmpl": "{redacted}",
              "ChangeMode": "noop",
              "ChangeSignal": "",
              "ChangeScript": null,
              "Splay": 5000000000,
              "Perms": "0644",
              "Uid": null,
              "Gid": null,
              "LeftDelim": "\"\"\"",
              "RightDelim": "\"\"\"",
              "Envvars": false,
              "VaultGrace": 0,
              "Wait": null,
              "ErrMissingKey": false
            }
          ],
          "Constraints": null,
          "Affinities": null,
          "Resources": {
            "CPU": 2000,
            "Cores": 0,
            "MemoryMB": 2048,
            "MemoryMaxMB": 0,
            "DiskMB": 0,
            "IOPS": 0,
            "Networks": null,
            "Devices": null,
            "NUMA": null
          },
          "RestartPolicy": {
            "Attempts": 0,
            "Interval": 86400000000000,
            "Delay": 15000000000,
            "Mode": "fail",
            "RenderTemplates": false
          },
          "DispatchPayload": null,
          "Lifecycle": null,
          "Meta": null,
          "KillTimeout": 5000000000,
          "LogConfig": {
            "MaxFiles": 1,
            "MaxFileSizeMB": 10,
            "Disabled": false
          },
          "Artifacts": null,
          "Leader": false,
          "ShutdownDelay": 0,
          "VolumeMounts": null,
          "ScalingPolicies": null,
          "KillSignal": "",
          "Kind": "",
          "CSIPluginConfig": null,
          "Identity": {
            "Name": "default",
            "Audience": [
              "nomadproject.io"
            ],
            "ChangeMode": "",
            "ChangeSignal": "",
            "Env": false,
            "File": false,
            "ServiceName": "",
            "TTL": 0
          },
          "Identities": null,
          "Actions": null
        }
      ],
      "EphemeralDisk": {
        "Sticky": false,
        "SizeMB": 50,
        "Migrate": false
      },
      "Meta": null,
      "ReschedulePolicy": {
        "Attempts": 2,
        "Interval": 604800000000000,
        "Delay": 5000000000,
        "DelayFunction": "constant",
        "MaxDelay": 0,
        "Unlimited": false
      },
      "Affinities": null,
      "Spreads": null,
      "Networks": null,
      "Consul": {
        "Namespace": "",
        "Cluster": "default",
        "Partition": ""
      },
      "Services": null,
      "Volumes": null,
      "ShutdownDelay": null,
      "StopAfterClientDisconnect": null,
      "MaxClientDisconnect": null,
      "PreventRescheduleOnLost": false
    }
  ],
  "Update": {
    "Stagger": 0,
    "MaxParallel": 0,
    "HealthCheck": "",
    "MinHealthyTime": 0,
    "HealthyDeadline": 0,
    "ProgressDeadline": 0,
    "AutoRevert": false,
    "AutoPromote": false,
    "Canary": 0
  },
  "Multiregion": null,
  "Periodic": null,
  "ParameterizedJob": null,
  "Dispatched": false,
  "DispatchIdempotencyToken": "",
  "Payload": null,
  "Meta": {
    "_": "{redacted}"
  },
  "ConsulToken": "",
  "ConsulNamespace": "",
  "VaultToken": "",
  "VaultNamespace": "",
  "NomadTokenID": "",
  "Status": "pending",
  "StatusDescription": "",
  "Stable": false,
  "Version": 0,
  "SubmitTime": 1713431957190805800,
  "CreateIndex": 25929346,
  "ModifyIndex": 30491074,
  "JobModifyIndex": 25929346
}

Nomad Server logs (if appropriate)

Nomad Client logs (if appropriate)

The text was updated successfully, but these errors were encountered:

sagebind · 2024-04-23T20:58:57Z

Digging into this a bit more, not familiar with the codebase but the fact that RescheduleTracker is getting reset to null on some of these reschedules seems a bit dubious. Here's just the RescheduleTrackers of the allocs of the job above exhibiting the buggy behavior:

{
  "ID": "b9fd3bb7-7b8b-28cd-a848-717fc995893c",
  "RescheduleTracker": null
}
{
  "ID": "b550b588-fcf3-8431-e00d-c1271e307dd0",
  "RescheduleTracker": {
    "Events": [
      {
        "RescheduleTime": 1713540616182219192,
        "PrevAllocID": "b9fd3bb7-7b8b-28cd-a848-717fc995893c",
        "PrevNodeID": "2d4576df-c1f6-8b8a-a0da-ca05123cf849",
        "Delay": 5000000000
      }
    ]
  }
}
{
  "ID": "52a123ff-2e23-4860-e802-3c314e2b3a6a",
  "RescheduleTracker": null
}
{
  "ID": "46eb89fa-ef85-19cc-b79a-c60edf63ba01",
  "RescheduleTracker": {
    "Events": [
      {
        "RescheduleTime": 1713540907034600526,
        "PrevAllocID": "52a123ff-2e23-4860-e802-3c314e2b3a6a",
        "PrevNodeID": "7349b23c-b638-15d2-ec50-f9ca5e4103f3",
        "Delay": 5000000000
      }
    ]
  }
}
{
  "ID": "51147d13-cef8-a522-091e-8ede42887a17",
  "RescheduleTracker": null
}
{
  "ID": "eb529550-4ec9-eb9b-24da-f3892da18b81",
  "RescheduleTracker": null
}
{
  "ID": "1cd8ae47-39d8-d5e1-7b34-a42678fcee66",
  "RescheduleTracker": {
    "Events": [
      {
        "RescheduleTime": 1713541254356376383,
        "PrevAllocID": "eb529550-4ec9-eb9b-24da-f3892da18b81",
        "PrevNodeID": "5b244e29-d814-84dd-d83c-a235ada79a3e",
        "Delay": 5000000000
      }
    ]
  }
}
{
  "ID": "fa019977-b038-a8e8-6b3b-837a4bf04661",
  "RescheduleTracker": null
}
{
  "ID": "0505a317-7470-06cf-dfba-9ecef6e92e52",
  "RescheduleTracker": {
    "Events": [
      {
        "RescheduleTime": 1713541591399485131,
        "PrevAllocID": "fa019977-b038-a8e8-6b3b-837a4bf04661",
        "PrevNodeID": "3ebe29d2-5c40-643c-6ddb-345a01d315c5",
        "Delay": 5000000000
      }
    ]
  }
}
{
  "ID": "92022347-e05a-c3b2-877f-ba3458fb00fc",
  "RescheduleTracker": {
    "Events": [
      {
        "RescheduleTime": 1713541591399485131,
        "PrevAllocID": "fa019977-b038-a8e8-6b3b-837a4bf04661",
        "PrevNodeID": "3ebe29d2-5c40-643c-6ddb-345a01d315c5",
        "Delay": 5000000000
      },
      {
        "RescheduleTime": 1713541647009844424,
        "PrevAllocID": "0505a317-7470-06cf-dfba-9ecef6e92e52",
        "PrevNodeID": "214d3a1c-3b69-8d67-c707-812bf8a943c6",
        "Delay": 5000000000
      }
    ]
  }
}

It seems like normally RescheduleTracker.Events should have an event for each prior allocation continually appended to it until the size of the list reaches the reschedule Attempts value. But because it is getting reset to null for some reason, it never reaches Attempts number of events, or it does eventually (by chance?).

tgross · 2024-04-25T13:03:21Z

Hi @sagebind! Any chance that these allocations failed a placement on reschedule, and then got placed? There's a known issue around that, which I have a test for in #12319 but we never got back to figuring out how to actually fix it.

You should be able to verify this by looking at the evaluation history for the job.

sagebind · 2024-04-25T16:19:47Z

@tgross Here's the evaluation history for a couple of jobs that placed 7 allocations even though Attempts was set to 2, along with their allocations:

5df4fd90-d0b3-46e1-992d-45f61b08c537-allocations.json
5df4fd90-d0b3-46e1-992d-45f61b08c537-evaluations.json
f290af7c-5bb9-41da-b3db-d643f8e70600-allocations.json
f290af7c-5bb9-41da-b3db-d643f8e70600-evaluations.json

The only evaluation triggers I see are alloc-failure and no placement failures. It looks pretty ordinary, other than the fact that there are too many of them.

tgross · 2024-05-15T20:06:45Z

Just a heads up that I'm digging into this. The cause of the problem seems straightforward and tied up in the logic we use for marking allocations for reschedule, but the solution is going to take a bit of figuring out. I'll update here as I know more.

While working on #20462, I discovered that some of the scheduler tests for disconnected clients making long blocking queries. The tests used `testutil.WaitForResult` to wait for an evaluation to be written to the state store. The evaluation was never written, but the tests were not correctly returning an error for an empty query. This resulted in the tests blocking for 5s and then continuing anyways. In practice, the evaluation is never written to the state store as part of the test harness `Process` method, so this test assertion was meaningless. Remove the broken assertion from the two top-level tests that used it, and upgrade these tests to use `shoenig/test` in the process. This will save us ~50s per test run.

…20615) While working on #20462, I discovered that some of the scheduler tests for disconnected clients making long blocking queries. The tests used `testutil.WaitForResult` to wait for an evaluation to be written to the state store. The evaluation was never written, but the tests were not correctly returning an error for an empty query. This resulted in the tests blocking for 5s and then continuing anyways. In practice, the evaluation is never written to the state store as part of the test harness `Process` method, so this test assertion was meaningless. Remove the broken assertion from the two top-level tests that used it, and upgrade these tests to use `shoenig/test` in the process. This will save us ~50s per test run.

…d blocking into release/1.6.x (#20616) While working on #20462, I discovered that some of the scheduler tests for disconnected clients making long blocking queries. The tests used `testutil.WaitForResult` to wait for an evaluation to be written to the state store. The evaluation was never written, but the tests were not correctly returning an error for an empty query. This resulted in the tests blocking for 5s and then continuing anyways. In practice, the evaluation is never written to the state store as part of the test harness `Process` method, so this test assertion was meaningless. Remove the broken assertion from the two top-level tests that used it, and upgrade these tests to use `shoenig/test` in the process. This will save us ~50s per test run. Co-authored-by: Tim Gross <tgross@hashicorp.com>

…d blocking into release/1.7.x (#20617) While working on #20462, I discovered that some of the scheduler tests for disconnected clients making long blocking queries. The tests used `testutil.WaitForResult` to wait for an evaluation to be written to the state store. The evaluation was never written, but the tests were not correctly returning an error for an empty query. This resulted in the tests blocking for 5s and then continuing anyways. In practice, the evaluation is never written to the state store as part of the test harness `Process` method, so this test assertion was meaningless. Remove the broken assertion from the two top-level tests that used it, and upgrade these tests to use `shoenig/test` in the process. This will save us ~50s per test run. Co-authored-by: Tim Gross <tgross@hashicorp.com>

While working on #20462 #12319 I found that some of our scheduler tests around down nodes or disconnected clients were enforcing invariants that were unclear. This changeset pulls out some minor refactorings so that the bug fix PR is easier to review. This includes: * Migrating a few tests from `testify` to `shoenig/test` that I'm going to touch in #12319 anyways. * Adding test names to the node down test * Update the disconnected client test so that we always re-process the pending/blocked eval it creates; this eliminates 2 redundant sub-tests. * Update the disconnected client test assertions so that they're explicit in the test setup rather than implied by whether we re-process the pending/blocked eval. Ref: #20462 Ref: #12319

gibbonsjohnm · 2024-05-31T16:59:28Z

We are seeing something tangentially related to this. Rescheduling is disabled, but occasionally allocations are getting rescheduled after failing or even completing. No rhyme or reason identified yet for why this is occurring, but I'm wondering if we're dealing with the same issue identified above.

tgross · 2024-05-31T18:01:19Z

Just wanted to drop an update here. I've got a working solution to the problem described in #12147 and #17072, but @sagebind's report is slightly different because there's no failed placement visible. I strongly suspect the solution there will also fix this problem because of how we attach the reschedule tracker, but until I can reproduce this problem I don't feel confident in closing this out with it.

One of the issues I'm running into with the available evidence is that some of the evaluations in the "chain" of evaluations are missing. That shouldn't matter here, as all the evaluations that are missing should be the first eval after the allocation failure, which then creates a new eval to be run after the reschedule.delay.

For job f290af7c-5bb9-41da-b3db-d643f8e70600, the allocs 05cac123 and eb6eeeb9 have a chain that's intact but shows the dropped reschedule tracker.

example allocs

$ cat f290af7c-5bb9-41da-b3db-d643f8e70600-allocations.json| jq '.[] | select(.ID == "05cac123-d5e1-b767-8e13-32fca12970fa") | {ID: .ID, RescheduleTracker: .RescheduleTracker, FollowupEvalID: .FollowupEvalID, DesiredDescription: .DesiredDescription, EvalID: .EvalID, NextAllocation: .NextAllocation}'
{
  "ID": "05cac123-d5e1-b767-8e13-32fca12970fa",
  "RescheduleTracker": {
    "Events": [
      {
        "RescheduleTime": 1714060539543172000,
        "PrevAllocID": "dceaf833-8e6d-db00-9280-a2572e09fd4d",
        "PrevNodeID": "b3d352a2-8c4f-556c-df8c-8c70d5271779",
        "Delay": 15000000000
      }
    ]
  },
  "FollowupEvalID": "e218c7d6-3acf-7232-e8c0-65b9be87b2f4",
  "DesiredDescription": "alloc was rescheduled because it failed",
  "EvalID": "6c398c16-af9c-ad61-9f59-cd7af671f30a",
  "NextAllocation": ""
}

$ cat f290af7c-5bb9-41da-b3db-d643f8e70600-allocations.json| jq '.[] | select(.ID == "eb6eeeb9-e357-3f47-b447-aa3b8a4d52d7") | {ID: .ID, RescheduleTracker: .RescheduleTracker, FollowupEvalID: .FollowupEvalID, DesiredDescription: .DesiredDescription, EvalID: .EvalID, NextAllocation: .NextAllocation}'
{
  "ID": "eb6eeeb9-e357-3f47-b447-aa3b8a4d52d7",
  "RescheduleTracker": null,
  "FollowupEvalID": "00a7ca3b-8ff0-fa0a-2e37-7297769153c3",
  "DesiredDescription": "alloc was rescheduled because it failed",
  "EvalID": "e218c7d6-3acf-7232-e8c0-65b9be87b2f4",
  "NextAllocation": "819e549e-7b37-7329-46c1-1e0809c18f1b"
}

I'm still working on trying to figure out the code path that can reproduce this.

@gibbonsjohnm wrote:

Rescheduling is disabled, but occasionally allocations are getting rescheduled after failing or even completing.

It might be related or not. Just by way of clarification, you mean you've got a reschedule block like this (as documented), right?:

reschedule {
  attempts  = 0
  unlimited = false
}

If so, it would be helpful if you could provide the allocations and evaluations list like @sagebind did above. Debug-level server log snippets from when the evaluations were processed would be helpful too.

tgross · 2024-05-31T20:36:14Z

I'm at the end of my week here, but wanted to leave one last update. It looks like in e8efe2d we introduced a code path where "untainted" allocs but are disconnecting should pass through the reschedule check. But in practice those untainted allocs that aren't disconnecting also pass through the reschedule check now too. This could potentially cause a spurious reschedule, but I still need to figure out the exact circumstances where that could happen. The fix for that would be:

diff --git a/scheduler/reconcile_util.go b/scheduler/reconcile_util.go
index 09cc0e42f5..661db532ec 100644
--- a/scheduler/reconcile_util.go
+++ b/scheduler/reconcile_util.go
@@ -414,6 +414,7 @@ func (a allocSet) filterByRescheduleable(isBatch, isDisconnecting bool, now time
                isUntainted, ignore := shouldFilter(alloc, isBatch)
                if isUntainted && !isDisconnecting {
                        untainted[alloc.ID] = alloc
+                       continue // TODO: right?
                }

                if ignore {

But right now that doesn't cause any tests to fail, which at least suggests test coverage may be why we didn't see this earlier. The only way that explains the failed allocs not getting their reschedule tracker is if they don't get marked rescheduable in that code path and just get dropped in the "untainted" list. That would cause a later stage of the reconciler to determine we were short healthy allocs and backfill them. I'll pick this all back up on Monday.

tgross · 2024-06-04T17:54:09Z

Ok folks, I've got #12319 marked as ready for review to cover the similar-but-not-quite identical case, and that should ship in the next regular version of Nomad.

For the remaining cases described in this issue, I have a high degree of confidence at this point that we're not going anywhere near the reschedule logic. Instead, the scheduler is considering the job to be "underprovisioned", so it's creating allocations that it doesn't consider to be replacements for existing allocations at all. That is, the scheduler isn't saying "I have a count=3 but one alloc failed, so make a new alloc to replace it" and is instead saying "I have count=3 and 2 running allocs, so place a new alloc". I admit this sounds only subtly different but the internals of the scheduler treat those cases very differently.

Unfortunately I don't have a reproduction for that yet. What could help me at this point is if you're experiencing allocations being replaced without a reschedule tracker and without there having been a placement failure during rescheduling, or completed allocations getting replaced, is to post the following:

Whether it's a batch or service job
The evaluations for that job (via /v1/evaluations?job=$jobid)
The debug-level server logs where that evaluation was processed
The DesiredStatus and ClientStatus of the allocations impacted

@sagebind I did a little digging on the OOM report exit codes, and that looks like it's a case of #13119, which we're aware of and are waiting for a fix from upstream

tgross · 2024-06-04T18:24:37Z

If you are reading this to report you've seen similar issues and have not followed along with the above: The key factor in this issue is that the replacement is happening without an apparent attempt to reschedule. If you have seen placement failures during rescheduling, then you're seeing #17072 #12147, which we already have a patch for.

gibbonsjohnm · 2024-06-11T17:12:15Z

Just wanted to drop an update here. I've got a working solution to the problem described in #12147 and #17072, but @sagebind's report is slightly different because there's no failed placement visible. I strongly suspect the solution there will also fix this problem because of how we attach the reschedule tracker, but until I can reproduce this problem I don't feel confident in closing this out with it.

One of the issues I'm running into with the available evidence is that some of the evaluations in the "chain" of evaluations are missing. That shouldn't matter here, as all the evaluations that are missing should be the first eval after the allocation failure, which then creates a new eval to be run after the reschedule.delay.

For job f290af7c-5bb9-41da-b3db-d643f8e70600, the allocs 05cac123 and eb6eeeb9 have a chain that's intact but shows the dropped reschedule tracker.
example allocs

I'm still working on trying to figure out the code path that can reproduce this.

@gibbonsjohnm wrote:

Rescheduling is disabled, but occasionally allocations are getting rescheduled after failing or even completing.

It might be related or not. Just by way of clarification, you mean you've got a reschedule block like this (as documented), right?:
reschedule {
  attempts  = 0
  unlimited = false
}
If so, it would be helpful if you could provide the allocations and evaluations list like @sagebind did above. Debug-level server log snippets from when the evaluations were processed would be helpful too.

We do have that reschedule block, yes. If the issue happens again, I will provide the data requested. For what it's worth, these are batch jobs.

sagebind added the type/bug label Apr 19, 2024

jrasell added this to Needs Triage in Nomad - Community Issues Triage via automation Apr 23, 2024

tgross moved this from Needs Triage to In Progress in Nomad - Community Issues Triage May 15, 2024

tgross self-assigned this May 15, 2024

tgross added theme/scheduling theme/restart/reschedule labels May 15, 2024

tgross mentioned this issue May 15, 2024

v1.5.3: TaskGroup Rescheduling not enforced #17072

Closed

tgross mentioned this issue May 16, 2024

tests: update disconnected client scheduler tests to avoid blocking #20615

Merged

This was referenced May 16, 2024

Backport of tests: update disconnected client scheduler tests to avoid blocking into release/1.6.x #20616

Merged

Backport of tests: update disconnected client scheduler tests to avoid blocking into release/1.7.x #20617

Merged

tgross mentioned this issue May 22, 2024

refactor scheduler tests for node down/disconnected #22198

Merged

tgross mentioned this issue May 31, 2024

Reschedule attempts limit is not respected when there is placement failures #12147

Closed

tgross changed the title ~~Reschedule attempts sometimes not being respected~~ Batch job reschedule attempts sometimes not being respected Jun 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Batch job reschedule attempts sometimes not being respected #20462

Batch job reschedule attempts sometimes not being respected #20462

sagebind commented Apr 19, 2024

sagebind commented Apr 23, 2024

tgross commented Apr 25, 2024

sagebind commented Apr 25, 2024

tgross commented May 15, 2024

gibbonsjohnm commented May 31, 2024

tgross commented May 31, 2024 •

edited

tgross commented May 31, 2024

tgross commented Jun 4, 2024 •

edited

tgross commented Jun 4, 2024

gibbonsjohnm commented Jun 11, 2024

Batch job reschedule attempts sometimes not being respected #20462

Batch job reschedule attempts sometimes not being respected #20462

Comments

sagebind commented Apr 19, 2024

Nomad version

Operating system and Environment details

Issue

Reproduction steps

Expected Result

Actual Result

Job file (if appropriate)

Nomad Server logs (if appropriate)

Nomad Client logs (if appropriate)

sagebind commented Apr 23, 2024

tgross commented Apr 25, 2024

sagebind commented Apr 25, 2024

tgross commented May 15, 2024

gibbonsjohnm commented May 31, 2024

tgross commented May 31, 2024 • edited

tgross commented May 31, 2024

tgross commented Jun 4, 2024 • edited

tgross commented Jun 4, 2024

gibbonsjohnm commented Jun 11, 2024

tgross commented May 31, 2024 •

edited

tgross commented Jun 4, 2024 •

edited