Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: ensure workflowtaskresults complete before mark workflow completed status. Fixes: #12615 #12574

Merged
merged 7 commits into from
Mar 10, 2024

Conversation

shuangkun
Copy link
Member

@shuangkun shuangkun commented Jan 25, 2024

Fixes #12615

I encountered a problem:

My workflow failed but don‘t have key LabelKeyCompleted. So enter “Processing workflow”, at the same time,my user find the workflow failed and retry it.

At Last, the workflow is Running but have LabelKeyCompleted, so it will never be dealt with again.

So I think we need to ensure workflowtaskresults complete before mark workflow completed status

time="2024-01-25T02:58:46.957Z" level=info msg="Updated phase Running -> Failed" namespace=argo workflow=dongdong
time="2024-01-25T03:00:25.693Z" level=info msg="Workflow update successful" namespace=argo phase=Failed resourceVersion=41288141 workflow=dongdong
time="2024-01-25T03:00:31.005Z" level=info msg="Processing workflow" namespace=argo workflow=dongdong
time="2024-01-25T03:03:08.430Z" level=warning msg="Error updating workflow: Operation cannot be fulfilled on workflows.argoproj.io \"dongdong\": the object has been modified; please apply your changes to the latest version and try again Conflict" namespace=argo workflow=dongdong

Reproduce

This problem is relatively difficult to reproduce. This situation can only occur if the user performs a retry during the last “Processing workflow”. This may occur when very large workflows and API Servers are under pressure.

This is difficult to reproduce in unit testing and e2e testing. But I designed a reasonable experiment to reproduce it and added it to the test for reference.

First, add the code to func "taskResultReconciliation". (I think this is possible, when the Wokflow is large and has complex outputs, processing the workflow may take more than 2s.)

if woc.checkReconciliationComplete() {
	time.Sleep(time.Second*2)
}

Second, run the test. Retry the failed workflow when "Processing workflow".

make TestRetryStoppedButIncompleteWorkflow

Finally, you will get a workflow in Running status but its labelCompleted is true.

tianshuangkun@B-QGHAMD6M-2124 logs % kubectl get workflow -o yaml -w | grep workflows.argoproj.io/completed
    workflows.argoproj.io/completed: "false"
    workflows.argoproj.io/completed: "false"
    workflows.argoproj.io/completed: "true"
    workflows.argoproj.io/completed: "true"
^C
tianshuangkun@B-QGHAMD6M-2124 logs % argo list
NAME                     STATUS    AGE   DURATION   PRIORITY
wf-retry-stopped-b57pd   Running   8s    5s         0

Solution

I hope to define a state to describe the stage of workflow from reaching completion (Failed, Error, Succedd) to truly Completed, named Completing.

Motivation

Modifications

Verification

@shuangkun shuangkun changed the title fix: cancel label completed key merge fix: cancel label completed key merge when reapplyUpdate Jan 25, 2024
@agilgur5 agilgur5 requested a review from juliev0 January 28, 2024 19:46
@Joibel
Copy link
Member

Joibel commented Jan 29, 2024

As a simple code change, this feels wrong.
It feels as though we are heaping workarounds on top of workarounds in the hope that we'll end up with something that works, and that there is an underlying problem with the design that needs fixing instead.

@shuangkun
Copy link
Member Author

As a simple code change, this feels wrong. It feels as though we are heaping workarounds on top of workarounds in the hope that we'll end up with something that works, and that there is an underlying problem with the design that needs fixing instead.

Yes, I think I should change markWorkflowPhase logic, it is the root cause, shouldn't mark it completed status (Failed、Error) when the workflowtaskresults Incompleted.

@Joibel
Copy link
Member

Joibel commented Jan 29, 2024

I agree, so I'd prefer to reject and close this PR. Are you happy to do that?

@shuangkun
Copy link
Member Author

shuangkun commented Jan 29, 2024

As a simple code change, this feels wrong. It feels as though we are heaping workarounds on top of workarounds in the hope that we'll end up with something that works, and that there is an underlying problem with the design that needs fixing instead.

I think so, it is not a good

I agree, so I'd prefer to reject and close this PR. Are you happy to do that?

No need to close it yet. I will submit a new commit in the branch soon.(My customer encountered this problem again. He has a large workflow (8000 steps), but after encountering this problem, he cannot retry, stop, terminate, and can hardly perform any operations. I did a lot of non-standard operations today and finally helped him retry this workflow. And I realized that the final solution should be shouldn't mark it completed status (Failed、Error) when the workflowtaskresults Incompleted. )
We can review it here, because it solve the same question. Thanks!

@shuangkun shuangkun marked this pull request as draft January 29, 2024 14:43
@shuangkun shuangkun changed the title fix: cancel label completed key merge when reapplyUpdate fix: ensure workflowtaskresults complete before mark workflow completed status Jan 29, 2024
@shuangkun shuangkun force-pushed the fix/CancleCompletedLabelMerge branch 3 times, most recently from cb28771 to bd1db45 Compare January 30, 2024 05:55
@juliev0
Copy link
Contributor

juliev0 commented Jan 30, 2024

CC @Garett-MacGowan

@juliev0
Copy link
Contributor

juliev0 commented Jan 30, 2024

Just read this from the original description: "So we need to ensure workflowtaskresults complete before mark workflow completed status".
But that seems to be what we're doing here, no?

@shuangkun
Copy link
Member Author

shuangkun commented Jan 30, 2024

Just read this from the original description: "So we need to ensure workflowtaskresults complete before mark workflow completed status". But that seems to be what we're doing here, no?

I think we should check it a little earlier, because if the code is executed here, the status of the workflow has been set to completed (error、Failed、successd), but it is still possible that common.LabelKeyCompleted is not set because the check fails.
And then,the user may see its status is completed and to make some action update the workflow. At the sametime, the workflow-controller still processing the workflow because of no "common.LabelKeyCompleted" and may update the workflow's common.LabelKeyCompleted.
At Last, the two updates will be Conflict and be merged,the workflow may be Running but have LabelKeyCompleted finnally.

@shuangkun shuangkun force-pushed the fix/CancleCompletedLabelMerge branch from a51f62e to 642b9f4 Compare January 30, 2024 06:50
@Garett-MacGowan
Copy link
Contributor

Will take a look at this after 😴

@shuangkun shuangkun marked this pull request as ready for review January 30, 2024 06:55
@shuangkun shuangkun marked this pull request as draft January 30, 2024 07:23
@shuangkun shuangkun force-pushed the fix/CancleCompletedLabelMerge branch 4 times, most recently from 559dffa to 8e10bf5 Compare January 30, 2024 14:12
@shuangkun shuangkun marked this pull request as ready for review January 30, 2024 14:22
@shuangkun shuangkun marked this pull request as draft January 30, 2024 14:31
@shuangkun shuangkun force-pushed the fix/CancleCompletedLabelMerge branch from 8e10bf5 to 54cf7b5 Compare January 30, 2024 14:59
@juliev0
Copy link
Contributor

juliev0 commented Jan 30, 2024

Just read this from the original description: "So we need to ensure workflowtaskresults complete before mark workflow completed status". But that seems to be what we're doing here, no?

I think we should check it a little earlier, because if the code is executed here, the status of the workflow has been set to completed (error、Failed、successd), but it is still possible that common.LabelKeyCompleted is not set because the check fails. And then,the user may see its status is completed and to make some action update the workflow. At the sametime, the workflow-controller still processing the workflow because of no "common.LabelKeyCompleted" and may update the workflow's common.LabelKeyCompleted. At Last, the two updates will be Conflict and be merged,the workflow may be Running but have LabelKeyCompleted finnally.

Thanks for the explanation.

@Garett-MacGowan
Copy link
Contributor

Does a restarted workflow not get a new ID? @juliev0

@Garett-MacGowan
Copy link
Contributor

This seems to imply that common.LabelKeyCompleted isn't considered in the UI, right?

We can't mark common.LabelKeyCompleted before tasks results have reconciled.

I'll take a better look once I'm at my computer.

@Garett-MacGowan
Copy link
Contributor

Ok, @shuangkun I'm going to take a crack at creating a test case that fails in the way you describe. Once we have that, we can figure out how we should go about resolving this issue.

@juliev0
Copy link
Contributor

juliev0 commented Jan 30, 2024

Does a restarted workflow not get a new ID? @juliev0

A new name? As far as I can tell it doesn't, looking at the code.

@juliev0
Copy link
Contributor

juliev0 commented Mar 2, 2024

Do you mean adding a condition to func (p WorkflowPhase) Completed() bool that checks if woc.wf.ObjectMeta.Labels[common.LabelKeyCompleted] == "true"?

If so, we need to change below to keep old Phase.Completed() functionality.
[...]

What I meant was keeping the Completed() function as it is in master: only returning true for phase={WorkflowSucceeded, WorkflowFailed, WorkflowError}. But basically holding off on setting phase={WorkflowSucceeded, WorkflowFailed, WorkflowError} until that time when we set LabelKeyCompleted=true (when all task results have been reconciled, etc). So, essentially the phase would still be "Running" for a little extra longer.

@juliev0
Copy link
Contributor

juliev0 commented Mar 2, 2024

Oh, now i see the problem with what I said.

  1. checkReconciliationComplete() is dependent on woc.wf.Status.Phase.Completed().

    func (woc *wfOperationCtx) checkReconciliationComplete() bool {
      return woc.wf.Status.Phase.Completed() && !woc.wf.Status.TaskResultsInProgress()
    }

    (We had to include woc.wf.Status.Phase.Completed() in there to know that all TaskResults were in fact complete; otherwise we had no way of knowing that new ones might not be started up in the future (unless there is perhaps another way to track that?). )

  2. And we only set LabelKeyCompleted=true once checkReconciliationComplete()

@juliev0
Copy link
Contributor

juliev0 commented Mar 2, 2024

@Garett-MacGowan sorry I didn't read your message thoroughly before responding. This is an interesting solution you've proposed. Maybe that can work.

…ed status. Fixes: argoproj#12615

Signed-off-by: shuangkun <tsk2013uestc@163.com>
@shuangkun shuangkun force-pushed the fix/CancleCompletedLabelMerge branch from a2c22e5 to 522a13c Compare March 2, 2024 16:23
@juliev0
Copy link
Contributor

juliev0 commented Mar 3, 2024

This looks like a nice clean solution to the problem as far as I can tell so far.
Only once the workflow has ended and its reconciliation tasks are done, the Status.Phase changes to a Completed state, and its LabelKeyCompleted is set.

I will dig deeper when I get a chance. Thanks!

Signed-off-by: shuangkun <tsk2013uestc@163.com>
Signed-off-by: shuangkun <tsk2013uestc@163.com>
Signed-off-by: shuangkun <tsk2013uestc@163.com>
@shuangkun
Copy link
Member Author

This looks like a nice clean solution to the problem as far as I can tell so far. Only once the workflow has ended and its reconciliation tasks are done, the Status.Phase changes to a Completed state, and its LabelKeyCompleted is set.

I will dig deeper when I get a chance. Thanks!

Thanks, I have solved the related errors and look forward to your review

@shuangkun shuangkun changed the title fix: ensure workflowtaskresults complete before mark workflow completed status fix: ensure workflowtaskresults complete before mark workflow completed status. Fixes: #12615 Mar 3, 2024
@shuangkun shuangkun requested a review from juliev0 March 3, 2024 13:49
@@ -1704,7 +1706,7 @@ func (woc *wfOperationCtx) deletePVCs(ctx context.Context) error {

switch gcStrategy {
case wfv1.VolumeClaimGCOnSuccess:
if woc.wf.Status.Phase == wfv1.WorkflowError || woc.wf.Status.Phase == wfv1.WorkflowFailed {
if woc.wf.Status.Phase == wfv1.WorkflowError || woc.wf.Status.Phase == wfv1.WorkflowFailed || woc.wf.Status.Phase == wfv1.WorkflowRunning {
Copy link
Contributor

@juliev0 juliev0 Mar 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this makes me wonder why the original logic wasn't just simply woc.wf.Status.Phase != wfv1.WorkflowSucceeded. That seems more readable. And I assume there's no risk for the cases of WorkflowUnknown and WorkflowPending.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the original function was written with the assumption that it would only ever be called if the Workflow were in fact Completed. If we want to make it safer to call this even when the Workflow isn't completed, then we should also add a check to the wfv1.VolumeClaimGCOnCompletion case.

@@ -513,7 +510,7 @@ func (woc *wfOperationCtx) operate(ctx context.Context) {
woc.markWorkflowError(ctx, err)
}

if woc.execWf.Spec.Metrics != nil {
if woc.execWf.Spec.Metrics != nil && woc.wf.Status.Fulfilled() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: If it were previously assumed that this line until the end of the function should only be invoked after everything was officially completed, then perhaps we just add a line right above here to terminate early if we're done instead of having the condition here. What do you think? This function is so long and windy that I find myself looking for any way to improve readability and maintainability. :)

@juliev0
Copy link
Contributor

juliev0 commented Mar 8, 2024

You know, I'm seeing we may no longer need this method:

func (woc *wfOperationCtx) checkReconciliationComplete() bool {
	woc.log.Debugf("Task results completion status: %v", woc.wf.Status.TaskResultsCompletionStatus)
	return woc.wf.Status.Phase.Completed() && !woc.wf.Status.TaskResultsInProgress()
}

If woc.wf.Status.Phase.Completed(), then it's impossible for woc.wf.Status.TaskResultsInProgress(), right? So, should we remove this method and replace any calls with a simple check to see if Status.Phase is completed?

Signed-off-by: shuangkun <tsk2013uestc@163.com>
Signed-off-by: shuangkun <tsk2013uestc@163.com>
@shuangkun
Copy link
Member Author

checkReconciliationComplete

OK, I remove it.

Signed-off-by: shuangkun <tsk2013uestc@163.com>
@shuangkun shuangkun requested a review from juliev0 March 9, 2024 04:17
@juliev0
Copy link
Contributor

juliev0 commented Mar 10, 2024

This looks great! Thanks for going through the iterations.

Before merging, @Garett-MacGowan would you mind taking a look as well? Now only once the workflow has ended and its reconciliation tasks are done, the Status.Phase changes to a Completed state, and its LabelKeyCompleted is set. Given that, checkReconciliationComplete() was no longer needed as a method per this comment and was removed.

@Garett-MacGowan
Copy link
Contributor

Looks good to me! Thanks @shuangkun & @juliev0

@juliev0 juliev0 merged commit ebce8ef into argoproj:main Mar 10, 2024
28 checks passed
@agilgur5
Copy link
Member

Thanks everyone for all the efforts here! This iteration looks simpler and cleaner too

@agilgur5 agilgur5 added this to the v3.5.x patches milestone Apr 19, 2024
agilgur5 pushed a commit that referenced this pull request Apr 19, 2024
…ed status. Fixes: #12615 (#12574)

Signed-off-by: shuangkun <tsk2013uestc@163.com>
(cherry picked from commit ebce8ef)
@agilgur5
Copy link
Member

Backported cleanly to release-3.5 as 68c089d

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/controller Controller issues, panics prioritized-review For members of the Sustainability Effort
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Workflow in Running but workflows.argoproj.io/completed is true
5 participants