-
Notifications
You must be signed in to change notification settings - Fork 3.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: ensure workflowtaskresults complete before mark workflow completed status. Fixes: #12615 #12574
fix: ensure workflowtaskresults complete before mark workflow completed status. Fixes: #12615 #12574
Conversation
As a simple code change, this feels wrong. |
Yes, I think I should change markWorkflowPhase logic, it is the root cause, shouldn't mark it completed status (Failed、Error) when the workflowtaskresults Incompleted. |
I agree, so I'd prefer to reject and close this PR. Are you happy to do that? |
I think so, it is not a good
No need to close it yet. I will submit a new commit in the branch soon.(My customer encountered this problem again. He has a large workflow (8000 steps), but after encountering this problem, he cannot retry, stop, terminate, and can hardly perform any operations. I did a lot of non-standard operations today and finally helped him retry this workflow. And I realized that the final solution should be shouldn't mark it completed status (Failed、Error) when the workflowtaskresults Incompleted. ) |
cb28771
to
bd1db45
Compare
Just read this from the original description: "So we need to ensure workflowtaskresults complete before mark workflow completed status". |
I think we should check it a little earlier, because if the code is executed here, the status of the workflow has been set to completed (error、Failed、successd), but it is still possible that common.LabelKeyCompleted is not set because the check fails. |
a51f62e
to
642b9f4
Compare
Will take a look at this after 😴 |
559dffa
to
8e10bf5
Compare
8e10bf5
to
54cf7b5
Compare
Thanks for the explanation. |
Does a restarted workflow not get a new ID? @juliev0 |
This seems to imply that We can't mark common.LabelKeyCompleted before tasks results have reconciled. I'll take a better look once I'm at my computer. |
Ok, @shuangkun I'm going to take a crack at creating a test case that fails in the way you describe. Once we have that, we can figure out how we should go about resolving this issue. |
What I meant was keeping the |
Oh, now i see the problem with what I said.
|
@Garett-MacGowan sorry I didn't read your message thoroughly before responding. This is an interesting solution you've proposed. Maybe that can work. |
…ed status. Fixes: argoproj#12615 Signed-off-by: shuangkun <tsk2013uestc@163.com>
a2c22e5
to
522a13c
Compare
This looks like a nice clean solution to the problem as far as I can tell so far. I will dig deeper when I get a chance. Thanks! |
Signed-off-by: shuangkun <tsk2013uestc@163.com>
Signed-off-by: shuangkun <tsk2013uestc@163.com>
Signed-off-by: shuangkun <tsk2013uestc@163.com>
Thanks, I have solved the related errors and look forward to your review |
workflow/controller/operator.go
Outdated
@@ -1704,7 +1706,7 @@ func (woc *wfOperationCtx) deletePVCs(ctx context.Context) error { | |||
|
|||
switch gcStrategy { | |||
case wfv1.VolumeClaimGCOnSuccess: | |||
if woc.wf.Status.Phase == wfv1.WorkflowError || woc.wf.Status.Phase == wfv1.WorkflowFailed { | |||
if woc.wf.Status.Phase == wfv1.WorkflowError || woc.wf.Status.Phase == wfv1.WorkflowFailed || woc.wf.Status.Phase == wfv1.WorkflowRunning { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this makes me wonder why the original logic wasn't just simply woc.wf.Status.Phase != wfv1.WorkflowSucceeded
. That seems more readable. And I assume there's no risk for the cases of WorkflowUnknown and WorkflowPending.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess the original function was written with the assumption that it would only ever be called if the Workflow were in fact Completed. If we want to make it safer to call this even when the Workflow isn't completed, then we should also add a check to the wfv1.VolumeClaimGCOnCompletion
case.
workflow/controller/operator.go
Outdated
@@ -513,7 +510,7 @@ func (woc *wfOperationCtx) operate(ctx context.Context) { | |||
woc.markWorkflowError(ctx, err) | |||
} | |||
|
|||
if woc.execWf.Spec.Metrics != nil { | |||
if woc.execWf.Spec.Metrics != nil && woc.wf.Status.Fulfilled() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: If it were previously assumed that this line until the end of the function should only be invoked after everything was officially completed, then perhaps we just add a line right above here to terminate early if we're done instead of having the condition here. What do you think? This function is so long and windy that I find myself looking for any way to improve readability and maintainability. :)
You know, I'm seeing we may no longer need this method: func (woc *wfOperationCtx) checkReconciliationComplete() bool {
woc.log.Debugf("Task results completion status: %v", woc.wf.Status.TaskResultsCompletionStatus)
return woc.wf.Status.Phase.Completed() && !woc.wf.Status.TaskResultsInProgress()
} If |
Signed-off-by: shuangkun <tsk2013uestc@163.com>
Signed-off-by: shuangkun <tsk2013uestc@163.com>
OK, I remove it. |
Signed-off-by: shuangkun <tsk2013uestc@163.com>
This looks great! Thanks for going through the iterations. Before merging, @Garett-MacGowan would you mind taking a look as well? Now only once the workflow has ended and its reconciliation tasks are done, the Status.Phase changes to a Completed state, and its LabelKeyCompleted is set. Given that, |
Looks good to me! Thanks @shuangkun & @juliev0 |
Thanks everyone for all the efforts here! This iteration looks simpler and cleaner too |
Backported cleanly to |
Fixes #12615
I encountered a problem:
My workflow failed but don‘t have key LabelKeyCompleted. So enter “Processing workflow”, at the same time,my user find the workflow failed and retry it.
At Last, the workflow is Running but have LabelKeyCompleted, so it will never be dealt with again.
So I think we need to ensure workflowtaskresults complete before mark workflow completed status
Reproduce
This problem is relatively difficult to reproduce. This situation can only occur if the user performs a retry during the last “Processing workflow”. This may occur when very large workflows and API Servers are under pressure.
This is difficult to reproduce in unit testing and e2e testing. But I designed a reasonable experiment to reproduce it and added it to the test for reference.
First, add the code to func "taskResultReconciliation". (I think this is possible, when the Wokflow is large and has complex outputs, processing the workflow may take more than 2s.)
Second, run the test. Retry the failed workflow when "Processing workflow".
Finally, you will get a workflow in Running status but its labelCompleted is true.
Solution
I hope to define a state to describe the stage of workflow from reaching completion (Failed, Error, Succedd) to truly Completed, named Completing.
Motivation
Modifications
Verification