fix: workflow stuck in running state when using activeDeadlineSeconds on template level. Fixes: #12329 #12761

shuangkun · 2024-03-07T15:22:49Z

Fixes: #12329
Root cause is when pod set faied from pending, because wait container's Terminated is nil, So this node is reset to pending and causing workflow stuck.

Motivation

Let workflow Failed when activeDeadlineSeconds exceed.

Modifications

Set pod Failed when pod set Failed from Pending. (Let workflow failed)

Verification

Local test and add some unit test and e2e

… on template level. Fixes: argoproj#12329 Signed-off-by: shuangkun <tsk2013uestc@163.com>

Signed-off-by: shuangkun <tsk2013uestc@163.com>

juliev0 · 2024-03-12T04:53:44Z

I've started looking at this, so I assigned it to myself, but if anyone feels they know the code well here and prefers to reassign it to themselves, I'm good. :)

juliev0 · 2024-03-12T16:50:07Z

Can you help me understand the sequence of events that led to the Issue? So, the containers were in "Pending" state and then the Pod's activeDeadlineSeconds was reached. I see at that point from the documentation that Kubernetes "will actively try to mark it failed and kill associated containers". Do we understand what's preventing Kubernetes from marking it Terminated at that point?

shuangkun · 2024-03-12T23:55:11Z

Kubernetes has marked it as Failed, but Argo observed Failed when recording but still reset it from Faild to Pending because its wait container teminting is nil.

shuangkun · 2024-03-12T23:58:09Z

I think it’s because the waiting container has never been running, so it cannot receive termination.

shuangkun · 2024-03-13T01:12:15Z

kubernetes/kubernetes#108733

kubernetes/kubernetes#108366

juliev0 · 2024-03-13T04:13:46Z

kubernetes/kubernetes#108733

kubernetes/kubernetes#108366

Thanks for the links. Okay, I see from the first one that it links to this official documentation which says "A container in the Terminated state began execution and then either ran to completion or failed for some reason." So, you're right that if it didn't begin execution, it seems it should just stay Pending I guess.

juliev0 · 2024-03-13T04:21:46Z

Part of me is wondering what Kubernetes will do in this case - should it just leave it in the Pending state?

In any case, though, maybe your change is reasonable - if the purpose of that code is to make sure we save off any outputs from the wait container, but we hit the activeDeadlineSeconds (or terminated for some other reason), then maybe we don't want to allow the container to get into the Running state in the first place I suppose.

workflow/controller/operator.go

juliev0 · 2024-03-13T04:50:15Z

Part of me is wondering what Kubernetes will do in this case - should it just leave it in the Pending state?

In any case, though, maybe your change is reasonable - if the purpose of that code is to make sure we save off any outputs from the wait container, but we hit the activeDeadlineSeconds (or terminated for some other reason), then maybe we don't want to allow the container to get into the Running state in the first place I suppose.

Although, I guess this code doesn't prevent it from getting into a Running state, does it? It just presumes that it won't get into a Running state, right? (and if it did, presumably we won't wait to save the outputs)

juliev0 · 2024-03-13T04:58:08Z

Can you help point me to where in the code after this point that this node.Phase can enable vs prevent workflow shutdown?

shuangkun · 2024-03-13T05:28:34Z

Can you help point me to where in the code after this point that this node.Phase can enable vs prevent workflow shutdown?

If the pod Failed, the kubernetes wont't update it.
The containerstatus is always:

  - image: quay.io/argoproj/argoexec:latest
    imageID: ""
    lastState: {}
    name: wait
    ready: false
    restartCount: 0
    started: false
    state:
      waiting:
        reason: PodInitializing
  hostIP: 172.18.0.2
  initContainerStatuses:
  - containerID: containerd://66eb4112a8061ff6da5f48b0c15b9c5c5ccf70f8a4ffa079d258b4315b4bbf74
    image: quay.io/argoproj/argoexec:latest
    imageID: quay.io/argoproj/argoexec@sha256:61c8e55d00437565a2823c802cece0b3323aed758e870a5055768337a87d3546
    lastState: {}
    name: init
    ready: true
    restartCount: 0
    state:
      terminated:
        containerID: containerd://66eb4112a8061ff6da5f48b0c15b9c5c5ccf70f8a4ffa079d258b4315b4bbf74
        exitCode: 0
        finishedAt: "2024-03-13T05:34:25Z"
        reason: Completed
        startedAt: "2024-03-13T05:34:21Z"
  message: Pod was active on the node longer than the specified deadline
  phase: Failed

So if argo did't update node phase, the node is always show pending for workflow, and the workflow stuck.

      boundaryID: memoized-bug-4nh77
      displayName: fanout(3:4)
      finishedAt: null
      hostNodeName: kind-control-plane
      id: memoized-bug-4nh77-4195166634
      inputs:
        parameters:
        - name: item
          value: "4"
      message: Pod was active on the node longer than the specified deadline
      name: memoized-bug-4nh77[0].fanout(3:4)
      phase: Pending
      progress: 0/1
      startedAt: "2024-03-13T05:34:20Z"
      templateName: echo
      templateScope: local/memoized-bug-4nh77
      type: Pod

Signed-off-by: shuangkun <tsk2013uestc@163.com>

shuangkun · 2024-03-13T05:44:25Z

Although, I guess this code doesn't prevent it from getting into a Running state, does it? It just presumes that it won't get into a Running state, right? (and if it did, presumably we won't wait to save the outputs)

Yes， it does not prevent pod Running.

workflow/controller/operator.go

juliev0 · 2024-03-14T04:10:22Z

Okay, so the code being changed isn't particular to failed pods, but for any pod for which new.Phase.Completed()==true. So, we should make sure this will be okay for any of those cases. Have you verified that?

If I try to look at those myself, I believe:

this is the case of all containers having Succeeded, marking the Pod as Succeeded
this is Phase.Failed which Kubernetes says means "All containers in the Pod have terminated, and at least one container has terminated in failure. That is, the container either exited with non-zero status or was terminated by the system."
this one wouldn't actually happen
this one seems like some json unmarshalling error, which I don't think would happen
this one is the InitContainer failing

If I look at the original bug, I see the InitContainer succeeded, and the wait and main containers are in Pending. So, how did we get to new.Phase.Completed?

shuangkun · 2024-03-14T04:24:47Z

Okay, so the code being changed isn't particular to failed pods, but for any pod for which new.Phase.Completed()==true. So, we should make sure this will be okay for any of those cases. Have you verified that?

If I try to look at those myself, I believe:

this is the case of all containers having Succeeded, marking the Pod as Succeeded

this is Phase.Failed which Kubernetes says means "All containers in the Pod have terminated, and at least one container has terminated in failure. That is, the container either exited with non-zero status or was terminated by the system."

There seems to be something wrong with this description.

this one wouldn't actually happen

this one seems like some json unmarshalling error, which I don't think would happen

this one is the InitContainer failing

If I look at the original bug, I see the InitContainer succeeded, and the wait and main containers are in Pending. So, how did we get to new.Phase.Completed?

The Pod from yesterday is still in this state now. K8s version is v1.25.3, Argo is latest. Phase is Failed.

status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2024-03-13T05:34:26Z"
    status: "True"
    type: Initialized
  - lastProbeTime: null
    lastTransitionTime: "2024-03-13T05:34:20Z"
    reason: PodFailed
    status: "False"
    type: Ready
  - lastProbeTime: null
    lastTransitionTime: "2024-03-13T05:34:20Z"
    reason: PodFailed
    status: "False"
    type: ContainersReady
  - lastProbeTime: null
    lastTransitionTime: "2024-03-13T05:34:20Z"
    status: "True"
    type: PodScheduled
  containerStatuses:
  - image: alpine:latest
    imageID: ""
    lastState: {}
    name: main
    ready: false
    restartCount: 0
    started: false
    state:
      waiting:
        reason: PodInitializing
  - image: quay.io/argoproj/argoexec:latest
    imageID: ""
    lastState: {}
    name: wait
    ready: false
    restartCount: 0
    started: false
    state:
      waiting:
        reason: PodInitializing
  hostIP: 172.18.0.2
  initContainerStatuses:
  - containerID: containerd://66eb4112a8061ff6da5f48b0c15b9c5c5ccf70f8a4ffa079d258b4315b4bbf74
    image: quay.io/argoproj/argoexec:latest
    imageID: quay.io/argoproj/argoexec@sha256:61c8e55d00437565a2823c802cece0b3323aed758e870a5055768337a87d3546
    lastState: {}
    name: init
    ready: true
    restartCount: 0
    state:
      terminated:
        containerID: containerd://66eb4112a8061ff6da5f48b0c15b9c5c5ccf70f8a4ffa079d258b4315b4bbf74
        exitCode: 0
        finishedAt: "2024-03-13T05:34:25Z"
        reason: Completed
        startedAt: "2024-03-13T05:34:21Z"
  message: Pod was active on the node longer than the specified deadline
  phase: Failed
  podIP: 10.244.0.123
  podIPs:
  - ip: 10.244.0.123
  qosClass: Burstable
  reason: DeadlineExceeded
  startTime: "2024-03-13T05:34:20Z"

Signed-off-by: shuangkun <tsk2013uestc@163.com>

juliev0 · 2024-03-14T05:41:20Z

There seems to be something wrong with this description.

Interesting. Thanks for sending that. Okay, so the Pod is clearly in a failed state.

So, going through my cases above of Phase==Complete:

If the Pod is Failed, we'll hold off if the wait container is in Running.
If the Pod is Succeeded, presumably all of the containers succeeded, so we won't ever hold off.
If the InitContainer failed (and perhaps Pod is not necessarily considered Failed itself), we'll hold off if the wait container is for some reason Running.

We won't hold off any longer if for some reason the wait container is in Pending. I can't really think of a case where this would be a problem. I will probably approve this - may just want to take another look later with fresh eyes.

workflow/controller/operator.go

Co-authored-by: Julie Vogelman <julievogelman0@gmail.com> Signed-off-by: shuangkun tian <72060326+shuangkun@users.noreply.github.com>

shuangkun · 2024-03-15T04:04:33Z

Thanks！

… on template level. Fixes: #12329 (#12761) Signed-off-by: shuangkun <tsk2013uestc@163.com> Signed-off-by: shuangkun tian <72060326+shuangkun@users.noreply.github.com> Co-authored-by: Julie Vogelman <julievogelman0@gmail.com> (cherry picked from commit 16cfef9)

agilgur5 · 2024-04-19T16:49:20Z

Backported cleanly to release-3.5 as 19a7ede

… on template level. Fixes: argoproj#12329 (argoproj#12761) Signed-off-by: shuangkun <tsk2013uestc@163.com> Signed-off-by: shuangkun tian <72060326+shuangkun@users.noreply.github.com> Co-authored-by: Julie Vogelman <julievogelman0@gmail.com> (cherry picked from commit 16cfef9)

shuangkun marked this pull request as draft March 7, 2024 15:22

shuangkun added the area/controller Controller issues, panics label Mar 7, 2024

shuangkun force-pushed the fix/StuckWhenExceedActiveDeadlineSeconds branch from 98c4206 to 0133ffe Compare March 8, 2024 03:52

fix: workflow stuck in running state when using activeDeadlineSeconds…

cd5a7f0

… on template level. Fixes: argoproj#12329 Signed-off-by: shuangkun <tsk2013uestc@163.com>

shuangkun force-pushed the fix/StuckWhenExceedActiveDeadlineSeconds branch from 0133ffe to cd5a7f0 Compare March 8, 2024 06:43

fix: add ut

42a56b2

Signed-off-by: shuangkun <tsk2013uestc@163.com>

shuangkun marked this pull request as ready for review March 8, 2024 07:25

shuangkun added the prioritized-review For members of the Sustainability Effort label Mar 11, 2024

fix: ex

d63d81e

Signed-off-by: shuangkun <tsk2013uestc@163.com>

shuangkun closed this Mar 11, 2024

shuangkun reopened this Mar 11, 2024

juliev0 self-assigned this Mar 12, 2024

juliev0 reviewed Mar 13, 2024

View reviewed changes

workflow/controller/operator.go Outdated Show resolved Hide resolved

fix: waiting == nil -> running != nil

fd5334d

Signed-off-by: shuangkun <tsk2013uestc@163.com>

shuangkun force-pushed the fix/StuckWhenExceedActiveDeadlineSeconds branch from fd5334d to 86c2f4e Compare March 13, 2024 06:15

shuangkun requested a review from juliev0 March 13, 2024 06:52

juliev0 reviewed Mar 14, 2024

View reviewed changes

workflow/controller/operator.go Outdated Show resolved Hide resolved

fix: test

61611e3

Signed-off-by: shuangkun <tsk2013uestc@163.com>

shuangkun force-pushed the fix/StuckWhenExceedActiveDeadlineSeconds branch from 86c2f4e to 61611e3 Compare March 14, 2024 04:31

juliev0 reviewed Mar 14, 2024

View reviewed changes

workflow/controller/operator.go Outdated Show resolved Hide resolved

Update workflow/controller/operator.go

243fd75

Co-authored-by: Julie Vogelman <julievogelman0@gmail.com> Signed-off-by: shuangkun tian <72060326+shuangkun@users.noreply.github.com>

shuangkun requested a review from juliev0 March 14, 2024 06:23

juliev0 approved these changes Mar 15, 2024

View reviewed changes

juliev0 merged commit 16cfef9 into argoproj:main Mar 15, 2024
27 checks passed

This was referenced Mar 29, 2024

Pod failed with error: Pod was active on the node longer than the specified deadline -remain in status Running #9934

Closed

REQUEST: Promotion to Reviewer for @shuangkun argoproj/argoproj#293

Open

agilgur5 added this to the v3.5.x patches milestone Apr 19, 2024

agilgur5 mentioned this pull request May 11, 2024

Pod was active on the node longer than the specified deadline issue with v3.4.2+ #11156

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: workflow stuck in running state when using activeDeadlineSeconds on template level. Fixes: #12329 #12761

fix: workflow stuck in running state when using activeDeadlineSeconds on template level. Fixes: #12329 #12761

shuangkun commented Mar 7, 2024 •

edited

juliev0 commented Mar 12, 2024

juliev0 commented Mar 12, 2024

shuangkun commented Mar 12, 2024

shuangkun commented Mar 12, 2024

shuangkun commented Mar 13, 2024

juliev0 commented Mar 13, 2024

juliev0 commented Mar 13, 2024 •

edited

juliev0 commented Mar 13, 2024

juliev0 commented Mar 13, 2024

shuangkun commented Mar 13, 2024 •

edited by agilgur5

shuangkun commented Mar 13, 2024 •

edited by agilgur5

juliev0 commented Mar 14, 2024

shuangkun commented Mar 14, 2024 •

edited by agilgur5

juliev0 commented Mar 14, 2024

shuangkun commented Mar 15, 2024

agilgur5 commented Apr 19, 2024

fix: workflow stuck in running state when using activeDeadlineSeconds on template level. Fixes: #12329 #12761

fix: workflow stuck in running state when using activeDeadlineSeconds on template level. Fixes: #12329 #12761

Conversation

shuangkun commented Mar 7, 2024 • edited

Motivation

Modifications

Verification

juliev0 commented Mar 12, 2024

juliev0 commented Mar 12, 2024

shuangkun commented Mar 12, 2024

shuangkun commented Mar 12, 2024

shuangkun commented Mar 13, 2024

juliev0 commented Mar 13, 2024

juliev0 commented Mar 13, 2024 • edited

juliev0 commented Mar 13, 2024

juliev0 commented Mar 13, 2024

shuangkun commented Mar 13, 2024 • edited by agilgur5

shuangkun commented Mar 13, 2024 • edited by agilgur5

juliev0 commented Mar 14, 2024

shuangkun commented Mar 14, 2024 • edited by agilgur5

juliev0 commented Mar 14, 2024

shuangkun commented Mar 15, 2024

agilgur5 commented Apr 19, 2024

shuangkun commented Mar 7, 2024 •

edited

juliev0 commented Mar 13, 2024 •

edited

shuangkun commented Mar 13, 2024 •

edited by agilgur5

shuangkun commented Mar 13, 2024 •

edited by agilgur5

shuangkun commented Mar 14, 2024 •

edited by agilgur5