Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TestTemplateExecutor fails sometimes (flakey test) #12832

Closed
4 tasks done
shuangkun opened this issue Mar 22, 2024 · 5 comments · Fixed by #12835
Closed
4 tasks done

TestTemplateExecutor fails sometimes (flakey test) #12832

shuangkun opened this issue Mar 22, 2024 · 5 comments · Fixed by #12835
Assignees
Labels

Comments

@shuangkun
Copy link
Member

shuangkun commented Mar 22, 2024

Pre-requisites

  • I have double-checked my configuration
  • I can confirm the issue exists when I tested with :latest
  • I have searched existing issues and could not find a match for this bug
  • I'd like to contribute the fix myself (see contributing guide)

What happened/what did you expect to happen?

Some pr not pass because TestTemplateExecutor:

https://github.com/argoproj/argo-workflows/actions/runs/8306974296/job/22735726882?pr=12787
https://github.com/argoproj/argo-workflows/actions/runs/8260166456/job/22595317286?pr=12741
https://github.com/argoproj/argo-workflows/actions/runs/8490016818/job/23260724522?pr=12862
I've encountered it several times

I want

The test always pass.

Version

latest

Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

no need

Logs from the workflow controller

kubectl logs -n argo deploy/workflow-controller | grep ${workflow}

Logs from in your workflow's wait container

kubectl logs -n argo -c wait -l workflows.argoproj.io/workflow=${workflow},workflow.argoproj.io/phase!=Succeeded
@jswxstw
Copy link
Member

jswxstw commented Mar 22, 2024

It seems like related to this PR I submitted: #12620, which I add a new check with WorkflowTaskSet.
I think the root cause is that workflow was updated to Succeeded before woc.removeCompletedTaskSetStatus(ctx).

@shuangkun
Copy link
Member Author

This should be the reason. I was just wondering why the removeCompletedTaskSetStatus was not implemented. This error in testing is basically impossible to reproduce locally.
According to this log, agent didn't patch the label. Maybe we should patch the taskset completed before mark completed.

2024-03-22T04:55:13.6660504Z controller: time="2024-03-22T04:55:11.983Z" level=info msg="Processing workflow" Phase=Running ResourceVersion=861 namespace=argo workflow=template-executor-2x442
2024-03-22T04:55:13.6662026Z controller: time="2024-03-22T04:55:11.983Z" level=info msg="Task-result reconciliation" namespace=argo numObjs=0 workflow=template-executor-2x442
2024-03-22T04:55:13.6663542Z controller: time="2024-03-22T04:55:11.983Z" level=info msg=updateAgentPodStatus namespace=argo workflow=template-executor-2x442
2024-03-22T04:55:13.6664899Z controller: time="2024-03-22T04:55:11.983Z" level=info msg=assessAgentPodStatus namespace=argo podName=template-executor-2x442-1340600742-agent
2024-03-22T04:55:13.6666518Z controller: time="2024-03-22T04:55:11.983Z" level=error msg="was unable to obtain node for template-executor-2x442-2166136261" namespace=argo workflow=template-executor-2x442
2024-03-22T04:55:13.6668036Z controller: time="2024-03-22T04:55:11.983Z" level=info msg="TaskSet Reconciliation" namespace=argo workflow=template-executor-2x442
2024-03-22T04:55:13.6669304Z controller: time="2024-03-22T04:55:11.983Z" level=info msg="Creating TaskSet" namespace=argo workflow=template-executor-2x442
2024-03-22T04:55:13.6670534Z controller: time="2024-03-22T04:55:11.990Z" level=info msg=reconcileAgentPod namespace=argo workflow=template-executor-2x442
2024-03-22T04:55:13.6671793Z controller: time="2024-03-22T04:55:11.990Z" level=info msg=updateAgentPodStatus namespace=argo workflow=template-executor-2x442
2024-03-22T04:55:13.6673479Z controller: time="2024-03-22T04:55:11.990Z" level=info msg=assessAgentPodStatus namespace=argo podName=template-executor-2x442-1340600742-agent
2024-03-22T04:55:13.6674721Z controller: time="2024-03-22T04:55:11.990Z" level=info msg="Workflow to be dehydrated" Workflow Size=1281
2024-03-22T04:55:13.6676105Z controller: time="2024-03-22T04:55:11.994Z" level=info msg="Workflow update successful" namespace=argo phase=Running resourceVersion=883 workflow=template-executor-2x442
2024-03-22T04:55:13.6677771Z controller: time="2024-03-22T04:55:12.997Z" level=info msg="Processing workflow" Phase=Running ResourceVersion=883 namespace=argo workflow=template-executor-2x442
2024-03-22T04:55:13.6679301Z controller: time="2024-03-22T04:55:12.997Z" level=info msg="Task-result reconciliation" namespace=argo numObjs=0 workflow=template-executor-2x442
2024-03-22T04:55:13.6680663Z controller: time="2024-03-22T04:55:12.997Z" level=info msg=updateAgentPodStatus namespace=argo workflow=template-executor-2x442
2024-03-22T04:55:13.6682203Z controller: time="2024-03-22T04:55:12.997Z" level=info msg=assessAgentPodStatus namespace=argo podName=template-executor-2x442-1340600742-agent
2024-03-22T04:55:13.6684001Z controller: time="2024-03-22T04:55:12.997Z" level=error msg="was unable to obtain node for template-executor-2x442-2166136261" namespace=argo workflow=template-executor-2x442
2024-03-22T04:55:13.6685534Z controller: time="2024-03-22T04:55:12.997Z" level=info msg="TaskSet Reconciliation" namespace=argo workflow=template-executor-2x442
2024-03-22T04:55:13.6686802Z controller: time="2024-03-22T04:55:12.997Z" level=info msg=reconcileAgentPod namespace=argo workflow=template-executor-2x442
2024-03-22T04:55:13.6688133Z controller: time="2024-03-22T04:55:12.997Z" level=info msg="Updated phase Running -> Succeeded" namespace=argo workflow=template-executor-2x442
2024-03-22T04:55:13.6689497Z controller: time="2024-03-22T04:55:12.997Z" level=info msg="Marking workflow completed" namespace=argo workflow=template-executor-2x442
2024-03-22T04:55:13.6690668Z controller: time="2024-03-22T04:55:12.997Z" level=info msg="Workflow to be dehydrated" Workflow Size=1313
2024-03-22T04:55:13.6691973Z controller: time="2024-03-22T04:55:13.003Z" level=info msg="cleaning up pod" action=deletePod key=argo/template-executor-2x442-1340600742-agent/deletePod
2024-03-22T04:55:13.6693618Z controller: time="2024-03-22T04:55:13.003Z" level=info msg="Workflow update successful" namespace=argo phase=Succeeded resourceVersion=890 workflow=template-executor-2x442

@jswxstw
Copy link
Member

jswxstw commented Mar 22, 2024

I was just wondering why the removeCompletedTaskSetStatus was not implemented.

I think removeCompletedTaskSetStatus was actually eventually executed, just not before the ExpectWorkflowTaskSet or not completely executed yet.

I think the root cause is that workflow was updated to Succeeded before woc.removeCompletedTaskSetStatus(ctx).

WaitForWorkflow(fixtures.ToBeSucceeded) can not guarantee that removeCompletedTaskSetStatus has been executed completely.

According to this log, agent didn't patch the label. Maybe we should patch the taskset completed before mark completed.

Currently, logs will only be printed when the taskset patch failed.

@agilgur5
Copy link
Member

@shuangkun
Copy link
Member Author

=== RUN   TestExecutorPluginsSuite
=== RUN   TestExecutorPluginsSuite/TestTemplateExecutor
Submitting workflow  template-executor-
Waiting 1m30s for workflow metadata.name=template-executor-sc68c
 ? template-executor-sc68c Workflow 0s      

 ● template-executor-sc68c   Workflow 0s      
 └ ◷ template-executor-sc68c Plugin   0s      

 ● template-executor-sc68c   Workflow 0s      
 └ ✔ template-executor-sc68c Plugin   7s      Hello template!

 ✔ template-executor-sc68c   Workflow 8s      
 └ ✔ template-executor-sc68c Plugin   7s      Hello template!

Condition "to be Succeeded" met after 7s
Checking expectation template-executor-sc68c
template-executor-sc68c : Succeeded 
    executor_plugins_test.go:85: 
        	Error Trace:	/home/runner/work/argo-workflows/argo-workflows/test/e2e/executor_plugins_test.go:85
        	            				/home/runner/work/argo-workflows/argo-workflows/test/e2e/fixtures/then.go:[27](https://github.com/argoproj/argo-workflows/actions/runs/8306974296/job/22735726882?pr=12787#step:16:28)4
        	            				/home/runner/work/argo-workflows/argo-workflows/test/e2e/executor_plugins_test.go:81
        	Error:      	Not equal: 
        	            	expected: "true"
        	            	actual  : "false"
        	            	
        	            	Diff:
        	            	--- Expected
        	            	+++ Actual
        	            	@@ -1 +1 @@
        	            	-true
        	            	+false
        	Test:       	TestExecutorPluginsSuite/TestTemplateExecutor
=== FAIL: ExecutorPluginsSuite/TestTemplateExecutor
FAIL	github.com/argoproj/argo-workflows/v3/test/e2e	7.9[33](https://github.com/argoproj/argo-workflows/actions/runs/8306974296/job/22735726882?pr=12787#step:16:34)s

@agilgur5 agilgur5 changed the title TestTemplateExecutor fails sometimes TestTemplateExecutor fails sometimes (flakey test) Mar 24, 2024
isubasinghe pushed a commit that referenced this issue Apr 2, 2024
@agilgur5 agilgur5 added this to the v3.5.x patches milestone Apr 19, 2024
agilgur5 pushed a commit that referenced this issue Apr 19, 2024
…12832 (#12835)

Signed-off-by: shuangkun <tsk2013uestc@163.com>
(cherry picked from commit cd0c58e)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants