Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(containerSet): mark container deleted when pod deleted. Fixes: #12210 #12756

Merged
merged 9 commits into from Mar 25, 2024

Conversation

shuangkun
Copy link
Member

@shuangkun shuangkun commented Mar 7, 2024

Fixes #12210

Motivation

Avoid incorrect judgment that dag running causes workflow hang. The fundamental reason is that when the pod is cleaned up, the container is not still displayed as running, which causes the dag to be judged as running in assessDAGPhase.

Modifications

Set container node error instead of running when pod was deleted.

Verification

test local and e2e.

Before, pod delete but container running:

    lovely-rhino-544408588:
      boundaryID: lovely-rhino
      children:
      - lovely-rhino-3630214831
      - lovely-rhino-1026744615
      displayName: A
      finishedAt: "2024-03-07T07:07:07Z"
      hostNodeName: kind-control-plane
      id: lovely-rhino-544408588
      message: pod deleted
      name: lovely-rhino.A
      outputs:
        artifacts:
        - name: main-logs
          s3:
            key: lovely-rhino/lovely-rhino-run-544408588/main.log
        - name: main2-logs
          s3:
            key: lovely-rhino/lovely-rhino-run-544408588/main2.log
      phase: Error
      progress: 0/1
      startedAt: "2024-03-07T07:06:34Z"
      templateName: run
      templateScope: local/lovely-rhino
      type: Pod
    lovely-rhino-1026744615:
      boundaryID: lovely-rhino-544408588
      displayName: main2
      finishedAt: null
      id: lovely-rhino-1026744615
      name: lovely-rhino.A.main2
      phase: Running
      progress: 0/1
      startedAt: "2024-03-07T07:06:34Z"
      templateName: run
      templateScope: local/lovely-rhino
      type: Container
    lovely-rhino-3630214831:
      boundaryID: lovely-rhino-544408588
      displayName: main
      finishedAt: null
      id: lovely-rhino-3630214831
      name: lovely-rhino.A.main
      phase: Running
      progress: 0/1
      startedAt: "2024-03-07T07:06:34Z"
      templateName: run
      templateScope: local/lovely-rhino
      type: Container
  phase: Running

After fix:

      lovely-rhino-544408588:
        boundaryID: lovely-rhino
        children:
        - lovely-rhino-3630214831
        - lovely-rhino-1026744615
        displayName: A
        finishedAt: "2024-03-07T07:25:56Z"
        hostNodeName: kind-control-plane
        id: lovely-rhino-544408588
        message: pod deleted
        name: lovely-rhino.A
        outputs:
          artifacts:
          - name: main-logs
            s3:
              key: lovely-rhino/lovely-rhino-run-544408588/main.log
          - name: main2-logs
            s3:
              key: lovely-rhino/lovely-rhino-run-544408588/main2.log
        phase: Error
        progress: 0/1
        startedAt: "2024-03-07T07:25:32Z"
        templateName: run
        templateScope: local/lovely-rhino
        type: Pod
      lovely-rhino-1026744615:
        boundaryID: lovely-rhino-544408588
        displayName: main2
        finishedAt: "2024-03-07T07:25:56Z"
        id: lovely-rhino-1026744615
        message: container deleted
        name: lovely-rhino.A.main2
        phase: Error
        progress: 0/1
        startedAt: "2024-03-07T07:25:33Z"
        templateName: run
        templateScope: local/lovely-rhino
        type: Container
      lovely-rhino-3630214831:
        boundaryID: lovely-rhino-544408588
        displayName: main
        finishedAt: "2024-03-07T07:25:56Z"
        id: lovely-rhino-3630214831
        message: container deleted
        name: lovely-rhino.A.main
        phase: Error
        progress: 0/1
        startedAt: "2024-03-07T07:25:33Z"
        templateName: run
        templateScope: local/lovely-rhino
        type: Container
    phase: Error

@shuangkun shuangkun marked this pull request as draft March 7, 2024 07:49
@shuangkun shuangkun closed this Mar 9, 2024
@shuangkun shuangkun reopened this Mar 9, 2024
@shuangkun shuangkun marked this pull request as ready for review March 9, 2024 14:32
@shuangkun shuangkun added the prioritized-review For members of the Sustainability Effort label Mar 11, 2024
}

// delete pod
time.Sleep(10 * time.Second)
Copy link
Member

@tczhao tczhao Mar 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Be mindful of how long a test takes, with so many tests in place, the time to run tests could accumulate very quickly

use
_ = os.Setenv("RECENTLY_STARTED_POD_DURATION", "0")
and
set graceperiod to 0 on metav1.DeleteOptions{ in deletePods function

so that we can get rid of time.sleep here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I add it. Thanks!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use
_ = os.Setenv("RECENTLY_STARTED_POD_DURATION", "0")

This sets it for the whole process though, no? not just this single test?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sigle and i remove it after test.

Copy link
Member

@agilgur5 agilgur5 Mar 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's not quite sufficient. Env vars are set for an entire process -- meaning it affects any tests that are ran in parallel and use the same env var. But I wasn't sure if Go had any special handling for this given that I've seen this in other tests. According to Go's own docs (for the t.Setenv function), it does not have any special handling and this would indeed affect any parallel tests.

While this is done in other tests in this codebase (50 occurrences) which probably need a larger refactoring, this env var is a bit more global in its effects.

This is one of those why tests shouldn't rely on globals.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed it, and i test sleep 5s is not enough. need 10s

Copy link
Member

@agilgur5 agilgur5 Mar 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, not needing to sleep in the tests is good to have though... @tczhao what do you think?

perhaps we leave one in and then refactor in a separate PR to avoid globals? (note that refactoring the globals might be substantially easier said than done)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed it, and i test sleep 5s is not enough. need 10s

The 10s is due to RECENTLY_STARTED_POD_DURATION defaults to 10s,
podReconciliation will not update the wf status until RECENTLY_STARTED_POD_DURATION elapsed
https://github.com/argoproj/argo-workflows/blob/v3.5.5/workflow/controller/operator.go#L1209


perhaps we leave one in and then refactor in a separate PR to avoid globals?

Sounds good to me

Copy link
Member

@agilgur5 agilgur5 Mar 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make sure to add a // TODO: comment here to modify this in the future. Can reference this thread in the comment: #12756 (comment).
This would be a really easy follow-up to miss (and may not happen soon due to the possible complexity of the refactor), so being explicit is good (and TODOs can be searched for in the codebase easily)

@tczhao tczhao self-assigned this Mar 19, 2024
Signed-off-by: shuangkun <tsk2013uestc@163.com>
Signed-off-by: shuangkun <tsk2013uestc@163.com>
@shuangkun shuangkun requested a review from tczhao March 19, 2024 15:09
@agilgur5
Copy link
Member

agilgur5 commented Mar 21, 2024

containerset does not stop container when pod removed.

The fundamental reason is that when the pod is cleaned up, the container is not still displayed as running, which causes the dag to be judged as running in assessDAGPhase.

Set container node error instead of running when pod was deleted.

To make sure I understand this correctly -- the Pod and container were correctly stopped, but the container's status node was incorrectly marked?

If so, we should retitle this and clarify in the issue -- the container was stopped and is not running, it's just incorrectly marked as running.

@shuangkun shuangkun changed the title fix: containerset does not stop container when pod removed. Fixes: #12210 fix: mark containerset container error when pod removed. Fixes: #12210 Mar 22, 2024
@shuangkun shuangkun changed the title fix: mark containerset container error when pod removed. Fixes: #12210 fix: mark containerset container deleted when pod deleted. Fixes: #12210 Mar 22, 2024
Signed-off-by: shuangkun <tsk2013uestc@163.com>
@agilgur5 agilgur5 changed the title fix: mark containerset container deleted when pod deleted. Fixes: #12210 fix(containerSet): mark container deleted when pod deleted. Fixes: #12210 Mar 22, 2024
@shuangkun shuangkun requested a review from agilgur5 March 22, 2024 06:14
Signed-off-by: shuangkun <tsk2013uestc@163.com>
Signed-off-by: shuangkun <tsk2013uestc@163.com>
Signed-off-by: shuangkun <tsk2013uestc@163.com>
Signed-off-by: shuangkun <tsk2013uestc@163.com>
Signed-off-by: shuangkun <tsk2013uestc@163.com>
Copy link
Member

@agilgur5 agilgur5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for tracking this down, fixing it, and adding a regression test!

@agilgur5 agilgur5 added the area/controller Controller issues, panics label Mar 25, 2024
@agilgur5 agilgur5 enabled auto-merge (squash) March 25, 2024 07:11
@agilgur5 agilgur5 merged commit cfe2bb7 into argoproj:main Mar 25, 2024
27 checks passed
@shuangkun
Copy link
Member Author

Thanks for tracking this down, fixing it, and adding a regression test!

Thany you and @tczhao , thank you for your suggestions!

@agilgur5 agilgur5 added this to the v3.5.x patches milestone Apr 19, 2024
agilgur5 pushed a commit that referenced this pull request Apr 19, 2024
…2210 (#12756)

Signed-off-by: shuangkun <tsk2013uestc@163.com>
(cherry picked from commit cfe2bb7)
@agilgur5
Copy link
Member

Backported cleanly to release-3.5 as faaddf3

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/controller Controller issues, panics area/templates/container-set prioritized-review For members of the Sustainability Effort
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ContainerSet does not stop container task when Pod is removed
3 participants