Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Batch job reschedule attempts sometimes not being respected #20462

Open
sagebind opened this issue Apr 19, 2024 · 10 comments
Open

Batch job reschedule attempts sometimes not being respected #20462

sagebind opened this issue Apr 19, 2024 · 10 comments

Comments

@sagebind
Copy link

Nomad version

Output from nomad version

Nomad v1.7.6
BuildDate 2024-03-12T07:27:36Z
Revision 594fedbfbc4f0e532b65e8a69b28ff9403eb822e

Operating system and Environment details

Output from uname -a:

Linux ip-10-151-99-125 6.5.0-1017-aws #17~22.04.2-Ubuntu SMP Mon Mar 25 20:47:22 UTC 2024 aarch64 aarch64 aarch64 GNU/Linux

Issue

It seems that attempts for reschedule set on a job is sometimes not being respected. This job configuration was formerly known to be for sure working when we were running Nomad 1.4 (we recently upgraded versions to catch up with the latest). We run quite a bit of small, discrete operations using batch jobs, and we use the reschedule policy to allow Nomad to automatically retry the operation with a new allocation on a different node, in case the underlying node had an issue, but we have attempts set to 2.

Normally, this does what we expect -- if a task exits with a nonzero exit code or gets OOM Killed, the allocation fails (because restart.attempts is set to zero and mode is "fail"), and up to 2 additional allocations may be created to attempt the task again. However, since upgrading to 1.7, we've noticed that sometimes this limit is not being enforced, and Nomad is behaving as if unlimited were set to true, which it is not. This results in some failing tasks generating an infinite number of new allocations over time and filling up our clusters with wasted work, and causing quite a bit of problems for us. We've had jobs with attempts set to 2 somehow end up with hundreds of failed allocations. Here's a screenshot of the dashboard for a job that already had 63 before we found and purged it manually:

Screenshot 2024-04-18 at 11 18 49

So far, I haven't found a rhyme or reason yet as to why it seems like only sometimes this behavior happens, but over time the percentage of our cluster filled with these "zombie" jobs grows until most of our capacity is taken up by them, leaving no room for new work.

My current theory is that sometimes, when the task gets OOM Killed with the docker driver, the status code is seen as 0 and thus isn't incrementing some sort of counter somewhere that keeps track of the number of failed allocations so far. Normally, the event message is

Exit Code: 137, Exit Message: "OOM Killed"

But for these "zombie" jobs that reschedule infinitely, it seems like the event message is

Exit Code: 0, Exit Message: "OOM Killed"

Below are the allocations and events for the example zombie job in the reproduction steps section that works correctly and stops after attempts reschedules (the JSON is too big for GitHub comment):

zombie-allocations.json

Now below is the response of allocations from a buggy job in production that spawned more allocations than allowed by its attempts limit:

buggy-allocations.json

Reproduction steps

I haven't been able to consistently reproduce this issue yet, but the general idea would be to use a job like this one:

job "zombie" {
    type = "batch"
    datacenters = ["dc1"]
    group "zombie" {
        task "zombie" {
            driver = "docker"
            config {
                image = "alpine:3"
                entrypoint = ["sh"]
                args = ["-c", "sleep 60; head -c 64m /dev/zero | tail"]
                network_mode = "host"
                auth_soft_fail = true
                memory_hard_limit = 32
            }
            resources {
                cpu = 100
                memory = 16
            }
            restart {
                attempts = 0
                mode = "fail"
            }
        }
    }
    reschedule {
        attempts = 2
        interval = "168h"
        unlimited = false
    }
}

Expected Result

The job should be rescheduled up to attempts times, and then stay in a failed state after that.

Actual Result

The job is rescheduled without any limit, creating far more than attempts allocations over time.

Job file (if appropriate)

We use the Nomad API programmatically to generate these jobs so we don't have an HCL for such jobs, but here is the job definition of an example job that experienced this problem according to the Nomad API (with some sensitive bits redacted).

{
  "Stop": false,
  "Region": "global",
  "Namespace": "default",
  "ID": "25b3cee6-86f0-4641-b72b-7e014c4049fb",
  "ParentID": "",
  "Name": "{redacted}",
  "Type": "batch",
  "Priority": 50,
  "AllAtOnce": false,
  "Datacenters": [
    "dc1"
  ],
  "NodePool": "default",
  "Constraints": [
    {
      "LTarget": "${attr.cpu.arch}",
      "RTarget": "amd64",
      "Operand": "set_contains_any"
    },
    {
      "LTarget": "${attr.driver.docker.os_type}",
      "RTarget": "linux",
      "Operand": "="
    },
    {
      "LTarget": "${meta.network_type}",
      "RTarget": "public",
      "Operand": "!="
    },
    {
      "LTarget": "${attr.unique.storage.volume}",
      "RTarget": "^/dev/nvme.+",
      "Operand": "regexp"
    }
  ],
  "Affinities": null,
  "Spreads": null,
  "TaskGroups": [
    {
      "Name": "executor",
      "Count": 1,
      "Update": null,
      "Migrate": null,
      "Constraints": null,
      "Scaling": null,
      "RestartPolicy": {
        "Attempts": 3,
        "Interval": 86400000000000,
        "Delay": 15000000000,
        "Mode": "fail",
        "RenderTemplates": false
      },
      "Tasks": [
        {
          "Name": "executor",
          "Driver": "docker",
          "User": "",
          "Config": {
            "network_mode": "host",
            "memory_hard_limit": 8192,
            "mounts": [
              {
                "type": "bind",
                "source": "tmp",
                "target": "/tmp"
              },
              {
                "source": "tmp",
                "target": "/var/tmp",
                "type": "bind"
              }
            ],
            "work_dir": "/alloc/data",
            "image": "{redacted}"
          },
          "Env": {
            "_": "{redacted}"
          },
          "Services": null,
          "Vault": null,
          "Consul": null,
          "Templates": [
            {
              "SourcePath": "",
              "DestPath": "local/args.json",
              "EmbeddedTmpl": "{redacted}",
              "ChangeMode": "noop",
              "ChangeSignal": "",
              "ChangeScript": null,
              "Splay": 5000000000,
              "Perms": "0644",
              "Uid": null,
              "Gid": null,
              "LeftDelim": "\"\"\"",
              "RightDelim": "\"\"\"",
              "Envvars": false,
              "VaultGrace": 0,
              "Wait": null,
              "ErrMissingKey": false
            }
          ],
          "Constraints": null,
          "Affinities": null,
          "Resources": {
            "CPU": 2000,
            "Cores": 0,
            "MemoryMB": 2048,
            "MemoryMaxMB": 0,
            "DiskMB": 0,
            "IOPS": 0,
            "Networks": null,
            "Devices": null,
            "NUMA": null
          },
          "RestartPolicy": {
            "Attempts": 0,
            "Interval": 86400000000000,
            "Delay": 15000000000,
            "Mode": "fail",
            "RenderTemplates": false
          },
          "DispatchPayload": null,
          "Lifecycle": null,
          "Meta": null,
          "KillTimeout": 5000000000,
          "LogConfig": {
            "MaxFiles": 1,
            "MaxFileSizeMB": 10,
            "Disabled": false
          },
          "Artifacts": null,
          "Leader": false,
          "ShutdownDelay": 0,
          "VolumeMounts": null,
          "ScalingPolicies": null,
          "KillSignal": "",
          "Kind": "",
          "CSIPluginConfig": null,
          "Identity": {
            "Name": "default",
            "Audience": [
              "nomadproject.io"
            ],
            "ChangeMode": "",
            "ChangeSignal": "",
            "Env": false,
            "File": false,
            "ServiceName": "",
            "TTL": 0
          },
          "Identities": null,
          "Actions": null
        }
      ],
      "EphemeralDisk": {
        "Sticky": false,
        "SizeMB": 50,
        "Migrate": false
      },
      "Meta": null,
      "ReschedulePolicy": {
        "Attempts": 2,
        "Interval": 604800000000000,
        "Delay": 5000000000,
        "DelayFunction": "constant",
        "MaxDelay": 0,
        "Unlimited": false
      },
      "Affinities": null,
      "Spreads": null,
      "Networks": null,
      "Consul": {
        "Namespace": "",
        "Cluster": "default",
        "Partition": ""
      },
      "Services": null,
      "Volumes": null,
      "ShutdownDelay": null,
      "StopAfterClientDisconnect": null,
      "MaxClientDisconnect": null,
      "PreventRescheduleOnLost": false
    }
  ],
  "Update": {
    "Stagger": 0,
    "MaxParallel": 0,
    "HealthCheck": "",
    "MinHealthyTime": 0,
    "HealthyDeadline": 0,
    "ProgressDeadline": 0,
    "AutoRevert": false,
    "AutoPromote": false,
    "Canary": 0
  },
  "Multiregion": null,
  "Periodic": null,
  "ParameterizedJob": null,
  "Dispatched": false,
  "DispatchIdempotencyToken": "",
  "Payload": null,
  "Meta": {
    "_": "{redacted}"
  },
  "ConsulToken": "",
  "ConsulNamespace": "",
  "VaultToken": "",
  "VaultNamespace": "",
  "NomadTokenID": "",
  "Status": "pending",
  "StatusDescription": "",
  "Stable": false,
  "Version": 0,
  "SubmitTime": 1713431957190805800,
  "CreateIndex": 25929346,
  "ModifyIndex": 30491074,
  "JobModifyIndex": 25929346
}

Nomad Server logs (if appropriate)

Nomad Client logs (if appropriate)

@jrasell jrasell added this to Needs Triage in Nomad - Community Issues Triage via automation Apr 23, 2024
@sagebind
Copy link
Author

Digging into this a bit more, not familiar with the codebase but the fact that RescheduleTracker is getting reset to null on some of these reschedules seems a bit dubious. Here's just the RescheduleTrackers of the allocs of the job above exhibiting the buggy behavior:

{
  "ID": "b9fd3bb7-7b8b-28cd-a848-717fc995893c",
  "RescheduleTracker": null
}
{
  "ID": "b550b588-fcf3-8431-e00d-c1271e307dd0",
  "RescheduleTracker": {
    "Events": [
      {
        "RescheduleTime": 1713540616182219192,
        "PrevAllocID": "b9fd3bb7-7b8b-28cd-a848-717fc995893c",
        "PrevNodeID": "2d4576df-c1f6-8b8a-a0da-ca05123cf849",
        "Delay": 5000000000
      }
    ]
  }
}
{
  "ID": "52a123ff-2e23-4860-e802-3c314e2b3a6a",
  "RescheduleTracker": null
}
{
  "ID": "46eb89fa-ef85-19cc-b79a-c60edf63ba01",
  "RescheduleTracker": {
    "Events": [
      {
        "RescheduleTime": 1713540907034600526,
        "PrevAllocID": "52a123ff-2e23-4860-e802-3c314e2b3a6a",
        "PrevNodeID": "7349b23c-b638-15d2-ec50-f9ca5e4103f3",
        "Delay": 5000000000
      }
    ]
  }
}
{
  "ID": "51147d13-cef8-a522-091e-8ede42887a17",
  "RescheduleTracker": null
}
{
  "ID": "eb529550-4ec9-eb9b-24da-f3892da18b81",
  "RescheduleTracker": null
}
{
  "ID": "1cd8ae47-39d8-d5e1-7b34-a42678fcee66",
  "RescheduleTracker": {
    "Events": [
      {
        "RescheduleTime": 1713541254356376383,
        "PrevAllocID": "eb529550-4ec9-eb9b-24da-f3892da18b81",
        "PrevNodeID": "5b244e29-d814-84dd-d83c-a235ada79a3e",
        "Delay": 5000000000
      }
    ]
  }
}
{
  "ID": "fa019977-b038-a8e8-6b3b-837a4bf04661",
  "RescheduleTracker": null
}
{
  "ID": "0505a317-7470-06cf-dfba-9ecef6e92e52",
  "RescheduleTracker": {
    "Events": [
      {
        "RescheduleTime": 1713541591399485131,
        "PrevAllocID": "fa019977-b038-a8e8-6b3b-837a4bf04661",
        "PrevNodeID": "3ebe29d2-5c40-643c-6ddb-345a01d315c5",
        "Delay": 5000000000
      }
    ]
  }
}
{
  "ID": "92022347-e05a-c3b2-877f-ba3458fb00fc",
  "RescheduleTracker": {
    "Events": [
      {
        "RescheduleTime": 1713541591399485131,
        "PrevAllocID": "fa019977-b038-a8e8-6b3b-837a4bf04661",
        "PrevNodeID": "3ebe29d2-5c40-643c-6ddb-345a01d315c5",
        "Delay": 5000000000
      },
      {
        "RescheduleTime": 1713541647009844424,
        "PrevAllocID": "0505a317-7470-06cf-dfba-9ecef6e92e52",
        "PrevNodeID": "214d3a1c-3b69-8d67-c707-812bf8a943c6",
        "Delay": 5000000000
      }
    ]
  }
}

It seems like normally RescheduleTracker.Events should have an event for each prior allocation continually appended to it until the size of the list reaches the reschedule Attempts value. But because it is getting reset to null for some reason, it never reaches Attempts number of events, or it does eventually (by chance?).

@tgross
Copy link
Member

tgross commented Apr 25, 2024

Hi @sagebind! Any chance that these allocations failed a placement on reschedule, and then got placed? There's a known issue around that, which I have a test for in #12319 but we never got back to figuring out how to actually fix it.

You should be able to verify this by looking at the evaluation history for the job.

@sagebind
Copy link
Author

@tgross Here's the evaluation history for a couple of jobs that placed 7 allocations even though Attempts was set to 2, along with their allocations:

5df4fd90-d0b3-46e1-992d-45f61b08c537-allocations.json
5df4fd90-d0b3-46e1-992d-45f61b08c537-evaluations.json
f290af7c-5bb9-41da-b3db-d643f8e70600-allocations.json
f290af7c-5bb9-41da-b3db-d643f8e70600-evaluations.json

The only evaluation triggers I see are alloc-failure and no placement failures. It looks pretty ordinary, other than the fact that there are too many of them.

@tgross
Copy link
Member

tgross commented May 15, 2024

Just a heads up that I'm digging into this. The cause of the problem seems straightforward and tied up in the logic we use for marking allocations for reschedule, but the solution is going to take a bit of figuring out. I'll update here as I know more.

tgross added a commit that referenced this issue May 16, 2024
While working on #20462, I discovered that some of the scheduler tests for
disconnected clients making long blocking queries. The tests used
`testutil.WaitForResult` to wait for an evaluation to be written to the state
store. The evaluation was never written, but the tests were not correctly
returning an error for an empty query. This resulted in the tests blocking for
5s and then continuing anyways.

In practice, the evaluation is never written to the state store as part of the
test harness `Process` method, so this test assertion was meaningless. Remove
the broken assertion from the two top-level tests that used it, and upgrade
these tests to use `shoenig/test` in the process. This will save us ~50s per
test run.
tgross added a commit that referenced this issue May 16, 2024
…20615)

While working on #20462, I discovered that some of the scheduler tests for
disconnected clients making long blocking queries. The tests used
`testutil.WaitForResult` to wait for an evaluation to be written to the state
store. The evaluation was never written, but the tests were not correctly
returning an error for an empty query. This resulted in the tests blocking for
5s and then continuing anyways.

In practice, the evaluation is never written to the state store as part of the
test harness `Process` method, so this test assertion was meaningless. Remove
the broken assertion from the two top-level tests that used it, and upgrade
these tests to use `shoenig/test` in the process. This will save us ~50s per
test run.
tgross added a commit that referenced this issue May 16, 2024
…20615)

While working on #20462, I discovered that some of the scheduler tests for
disconnected clients making long blocking queries. The tests used
`testutil.WaitForResult` to wait for an evaluation to be written to the state
store. The evaluation was never written, but the tests were not correctly
returning an error for an empty query. This resulted in the tests blocking for
5s and then continuing anyways.

In practice, the evaluation is never written to the state store as part of the
test harness `Process` method, so this test assertion was meaningless. Remove
the broken assertion from the two top-level tests that used it, and upgrade
these tests to use `shoenig/test` in the process. This will save us ~50s per
test run.
tgross added a commit that referenced this issue May 16, 2024
…20615)

While working on #20462, I discovered that some of the scheduler tests for
disconnected clients making long blocking queries. The tests used
`testutil.WaitForResult` to wait for an evaluation to be written to the state
store. The evaluation was never written, but the tests were not correctly
returning an error for an empty query. This resulted in the tests blocking for
5s and then continuing anyways.

In practice, the evaluation is never written to the state store as part of the
test harness `Process` method, so this test assertion was meaningless. Remove
the broken assertion from the two top-level tests that used it, and upgrade
these tests to use `shoenig/test` in the process. This will save us ~50s per
test run.
tgross added a commit that referenced this issue May 16, 2024
…d blocking into release/1.6.x (#20616)

While working on #20462, I discovered that some of the scheduler tests for
disconnected clients making long blocking queries. The tests used
`testutil.WaitForResult` to wait for an evaluation to be written to the state
store. The evaluation was never written, but the tests were not correctly
returning an error for an empty query. This resulted in the tests blocking for
5s and then continuing anyways.

In practice, the evaluation is never written to the state store as part of the
test harness `Process` method, so this test assertion was meaningless. Remove
the broken assertion from the two top-level tests that used it, and upgrade
these tests to use `shoenig/test` in the process. This will save us ~50s per
test run.

Co-authored-by: Tim Gross <tgross@hashicorp.com>
tgross added a commit that referenced this issue May 16, 2024
…d blocking into release/1.7.x (#20617)

While working on #20462, I discovered that some of the scheduler tests for
disconnected clients making long blocking queries. The tests used
`testutil.WaitForResult` to wait for an evaluation to be written to the state
store. The evaluation was never written, but the tests were not correctly
returning an error for an empty query. This resulted in the tests blocking for
5s and then continuing anyways.

In practice, the evaluation is never written to the state store as part of the
test harness `Process` method, so this test assertion was meaningless. Remove
the broken assertion from the two top-level tests that used it, and upgrade
these tests to use `shoenig/test` in the process. This will save us ~50s per
test run.

Co-authored-by: Tim Gross <tgross@hashicorp.com>
tgross added a commit that referenced this issue May 22, 2024
While working on #20462 #12319 I found that some of our scheduler tests around
down nodes or disconnected clients were enforcing invariants that were
unclear. This changeset pulls out some minor refactorings so that the bug fix PR
is easier to review. This includes:

* Migrating a few tests from `testify` to `shoenig/test` that I'm going to touch
  in #12319 anyways.
* Adding test names to the node down test
* Update the disconnected client test so that we always re-process the
  pending/blocked eval it creates; this eliminates 2 redundant sub-tests.
* Update the disconnected client test assertions so that they're explicit in the
  test setup rather than implied by whether we re-process the pending/blocked
  eval.

Ref: #20462
Ref: #12319
tgross added a commit that referenced this issue May 22, 2024
While working on #20462 #12319 I found that some of our scheduler tests around
down nodes or disconnected clients were enforcing invariants that were
unclear. This changeset pulls out some minor refactorings so that the bug fix PR
is easier to review. This includes:

* Migrating a few tests from `testify` to `shoenig/test` that I'm going to touch
  in #12319 anyways.
* Adding test names to the node down test
* Update the disconnected client test so that we always re-process the
  pending/blocked eval it creates; this eliminates 2 redundant sub-tests.
* Update the disconnected client test assertions so that they're explicit in the
  test setup rather than implied by whether we re-process the pending/blocked
  eval.

Ref: #20462
Ref: #12319
@gibbonsjohnm
Copy link

We are seeing something tangentially related to this. Rescheduling is disabled, but occasionally allocations are getting rescheduled after failing or even completing. No rhyme or reason identified yet for why this is occurring, but I'm wondering if we're dealing with the same issue identified above.

@tgross
Copy link
Member

tgross commented May 31, 2024

Just wanted to drop an update here. I've got a working solution to the problem described in #12147 and #17072, but @sagebind's report is slightly different because there's no failed placement visible. I strongly suspect the solution there will also fix this problem because of how we attach the reschedule tracker, but until I can reproduce this problem I don't feel confident in closing this out with it.

One of the issues I'm running into with the available evidence is that some of the evaluations in the "chain" of evaluations are missing. That shouldn't matter here, as all the evaluations that are missing should be the first eval after the allocation failure, which then creates a new eval to be run after the reschedule.delay.

For job f290af7c-5bb9-41da-b3db-d643f8e70600, the allocs 05cac123 and eb6eeeb9 have a chain that's intact but shows the dropped reschedule tracker.

example allocs
$ cat f290af7c-5bb9-41da-b3db-d643f8e70600-allocations.json| jq '.[] | select(.ID == "05cac123-d5e1-b767-8e13-32fca12970fa") | {ID: .ID, RescheduleTracker: .RescheduleTracker, FollowupEvalID: .FollowupEvalID, DesiredDescription: .DesiredDescription, EvalID: .EvalID, NextAllocation: .NextAllocation}'
{
  "ID": "05cac123-d5e1-b767-8e13-32fca12970fa",
  "RescheduleTracker": {
    "Events": [
      {
        "RescheduleTime": 1714060539543172000,
        "PrevAllocID": "dceaf833-8e6d-db00-9280-a2572e09fd4d",
        "PrevNodeID": "b3d352a2-8c4f-556c-df8c-8c70d5271779",
        "Delay": 15000000000
      }
    ]
  },
  "FollowupEvalID": "e218c7d6-3acf-7232-e8c0-65b9be87b2f4",
  "DesiredDescription": "alloc was rescheduled because it failed",
  "EvalID": "6c398c16-af9c-ad61-9f59-cd7af671f30a",
  "NextAllocation": ""
}

$ cat f290af7c-5bb9-41da-b3db-d643f8e70600-allocations.json| jq '.[] | select(.ID == "eb6eeeb9-e357-3f47-b447-aa3b8a4d52d7") | {ID: .ID, RescheduleTracker: .RescheduleTracker, FollowupEvalID: .FollowupEvalID, DesiredDescription: .DesiredDescription, EvalID: .EvalID, NextAllocation: .NextAllocation}'
{
  "ID": "eb6eeeb9-e357-3f47-b447-aa3b8a4d52d7",
  "RescheduleTracker": null,
  "FollowupEvalID": "00a7ca3b-8ff0-fa0a-2e37-7297769153c3",
  "DesiredDescription": "alloc was rescheduled because it failed",
  "EvalID": "e218c7d6-3acf-7232-e8c0-65b9be87b2f4",
  "NextAllocation": "819e549e-7b37-7329-46c1-1e0809c18f1b"
}

I'm still working on trying to figure out the code path that can reproduce this.

@gibbonsjohnm wrote:

Rescheduling is disabled, but occasionally allocations are getting rescheduled after failing or even completing.

It might be related or not. Just by way of clarification, you mean you've got a reschedule block like this (as documented), right?:

reschedule {
  attempts  = 0
  unlimited = false
}

If so, it would be helpful if you could provide the allocations and evaluations list like @sagebind did above. Debug-level server log snippets from when the evaluations were processed would be helpful too.

@tgross
Copy link
Member

tgross commented May 31, 2024

I'm at the end of my week here, but wanted to leave one last update. It looks like in e8efe2d we introduced a code path where "untainted" allocs but are disconnecting should pass through the reschedule check. But in practice those untainted allocs that aren't disconnecting also pass through the reschedule check now too. This could potentially cause a spurious reschedule, but I still need to figure out the exact circumstances where that could happen. The fix for that would be:

diff --git a/scheduler/reconcile_util.go b/scheduler/reconcile_util.go
index 09cc0e42f5..661db532ec 100644
--- a/scheduler/reconcile_util.go
+++ b/scheduler/reconcile_util.go
@@ -414,6 +414,7 @@ func (a allocSet) filterByRescheduleable(isBatch, isDisconnecting bool, now time
                isUntainted, ignore := shouldFilter(alloc, isBatch)
                if isUntainted && !isDisconnecting {
                        untainted[alloc.ID] = alloc
+                       continue // TODO: right?
                }

                if ignore {

But right now that doesn't cause any tests to fail, which at least suggests test coverage may be why we didn't see this earlier. The only way that explains the failed allocs not getting their reschedule tracker is if they don't get marked rescheduable in that code path and just get dropped in the "untainted" list. That would cause a later stage of the reconciler to determine we were short healthy allocs and backfill them. I'll pick this all back up on Monday.

@tgross
Copy link
Member

tgross commented Jun 4, 2024

Ok folks, I've got #12319 marked as ready for review to cover the similar-but-not-quite identical case, and that should ship in the next regular version of Nomad.

For the remaining cases described in this issue, I have a high degree of confidence at this point that we're not going anywhere near the reschedule logic. Instead, the scheduler is considering the job to be "underprovisioned", so it's creating allocations that it doesn't consider to be replacements for existing allocations at all. That is, the scheduler isn't saying "I have a count=3 but one alloc failed, so make a new alloc to replace it" and is instead saying "I have count=3 and 2 running allocs, so place a new alloc". I admit this sounds only subtly different but the internals of the scheduler treat those cases very differently.

Unfortunately I don't have a reproduction for that yet. What could help me at this point is if you're experiencing allocations being replaced without a reschedule tracker and without there having been a placement failure during rescheduling, or completed allocations getting replaced, is to post the following:

  • Whether it's a batch or service job
  • The evaluations for that job (via /v1/evaluations?job=$jobid)
  • The debug-level server logs where that evaluation was processed
  • The DesiredStatus and ClientStatus of the allocations impacted

@sagebind I did a little digging on the OOM report exit codes, and that looks like it's a case of #13119, which we're aware of and are waiting for a fix from upstream

@tgross tgross changed the title Reschedule attempts sometimes not being respected Batch job reschedule attempts sometimes not being respected Jun 4, 2024
@tgross
Copy link
Member

tgross commented Jun 4, 2024

If you are reading this to report you've seen similar issues and have not followed along with the above: The key factor in this issue is that the replacement is happening without an apparent attempt to reschedule. If you have seen placement failures during rescheduling, then you're seeing #17072 #12147, which we already have a patch for.

@gibbonsjohnm
Copy link

Just wanted to drop an update here. I've got a working solution to the problem described in #12147 and #17072, but @sagebind's report is slightly different because there's no failed placement visible. I strongly suspect the solution there will also fix this problem because of how we attach the reschedule tracker, but until I can reproduce this problem I don't feel confident in closing this out with it.

One of the issues I'm running into with the available evidence is that some of the evaluations in the "chain" of evaluations are missing. That shouldn't matter here, as all the evaluations that are missing should be the first eval after the allocation failure, which then creates a new eval to be run after the reschedule.delay.

For job f290af7c-5bb9-41da-b3db-d643f8e70600, the allocs 05cac123 and eb6eeeb9 have a chain that's intact but shows the dropped reschedule tracker.
example allocs

I'm still working on trying to figure out the code path that can reproduce this.

@gibbonsjohnm wrote:

Rescheduling is disabled, but occasionally allocations are getting rescheduled after failing or even completing.

It might be related or not. Just by way of clarification, you mean you've got a reschedule block like this (as documented), right?:

reschedule {
  attempts  = 0
  unlimited = false
}

If so, it would be helpful if you could provide the allocations and evaluations list like @sagebind did above. Debug-level server log snippets from when the evaluations were processed would be helpful too.

We do have that reschedule block, yes. If the issue happens again, I will provide the data requested. For what it's worth, these are batch jobs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

3 participants