Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mutex check happens after memoization cache check and lock info missing from UI #11219

Closed
2 of 3 tasks
mnarozny opened this issue Jun 15, 2023 · 1 comment · Fixed by #11456
Closed
2 of 3 tasks

Mutex check happens after memoization cache check and lock info missing from UI #11219

mnarozny opened this issue Jun 15, 2023 · 1 comment · Fixed by #11456

Comments

@mnarozny
Copy link

Pre-requisites

  • I have double-checked my configuration
  • I can confirm the issues exists when I tested with :latest
  • I'd like to contribute the fix myself (see contributing guide)

What happened/what you expected to happen?

When using both memoize and mutex in a template, the cache check happens prior to the mutex check and the UI does not show any information about the job waiting for a lock to be released.

If I have a long running template and multiple workflows run it, I was expecting the other workflows to wait for the cached result behind the lock before even starting, it appears the behaviour is that all workflows will all check cache in parallel for a cache-hit: NO and then proceed to run consecutively. Both in the UI and CLI, the information about the job waiting for a lock is missing.

Example workflow simplified to include two jobs but the same issue happens when multiple workflows use the template.

Version

v3.4.8

Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: example-steps-simple
spec:
  entrypoint: main

  templates:
    - name: main
      steps:
        - - name: job-1
            template: sleep
            arguments:
              parameters:
                - name: sleep_duration
                  value: 30
          - name: job-2
            template: sleep
            arguments:
              parameters:
                - name: sleep_duration
                  value: 15

    - name: sleep
      synchronization:
        mutex:
          name: mutex-example-steps-simple
      inputs:
        parameters:
          - name: sleep_duration
      script:
        image: alpine:latest
        command: [/bin/sh]
        source: |
          echo "Sleeping for {{ inputs.parameters.sleep_duration }}"
          sleep {{ inputs.parameters.sleep_duration }}
      memoize:
        key: "memo-key-5"
        cache:
          configMap:
            name: cache-example-steps-simple

Logs from the workflow controller

time="2023-06-15T13:17:42.050Z" level=info msg="Processing workflow" namespace=ci-workflows workflow=example-steps-simple4xnjt
time="2023-06-15T13:17:42.156Z" level=info msg="Updated phase  -> Running" namespace=ci-workflows workflow=example-steps-simple4xnjt
time="2023-06-15T13:17:42.156Z" level=info msg="Steps node example-steps-simple4xnjt initialized Running" namespace=ci-workflows workflow=example-steps-simple4xnjt
time="2023-06-15T13:17:42.156Z" level=info msg="StepGroup node example-steps-simple4xnjt-3033033798 initialized Running" namespace=ci-workflows workflow=example-steps-simple4xnjt
time="2023-06-15T13:17:42.305Z" level=info msg="Pod node example-steps-simple4xnjt-3810396681 initialized Pending" namespace=ci-workflows workflow=example-steps-simple4xnjt
time="2023-06-15T13:17:42.305Z" level=info msg="ci-workflows/Mutex/mutex-example-steps-simple acquired by ci-workflows/example-steps-simple4xnjt/example-steps-simple4xnjt-3810396681. Lock availability: 0/1" mutex=ci-workflows/Mutex/mutex-example-steps-simple
time="2023-06-15T13:17:42.305Z" level=info msg="Node example-steps-simple4xnjt[0].job-1 acquired synchronization lock" namespace=ci-workflows workflow=example-steps-simple4xnjt
time="2023-06-15T13:17:42.417Z" level=info msg="Created pod: example-steps-simple4xnjt[0].job-1 (example-steps-simple4xnjt-sleep-3810396681)" namespace=ci-workflows workflow=example-steps-simple4xnjt
time="2023-06-15T13:17:42.503Z" level=info msg="Pod node example-steps-simple4xnjt-3760063824 initialized Pending" namespace=ci-workflows workflow=example-steps-simple4xnjt
time="2023-06-15T13:17:42.503Z" level=info msg="Workflow step group node example-steps-simple4xnjt-3033033798 not yet completed" namespace=ci-workflows workflow=example-steps-simple4xnjt
time="2023-06-15T13:17:42.503Z" level=info msg="TaskSet Reconciliation" namespace=ci-workflows workflow=example-steps-simple4xnjt
time="2023-06-15T13:17:42.503Z" level=info msg=reconcileAgentPod namespace=ci-workflows workflow=example-steps-simple4xnjt
time="2023-06-15T13:17:42.511Z" level=info msg="Workflow update successful" namespace=ci-workflows phase=Running resourceVersion=435875605 workflow=example-steps-simple4xnjt
time="2023-06-15T13:17:52.418Z" level=info msg="Processing workflow" namespace=ci-workflows workflow=example-steps-simple4xnjt
time="2023-06-15T13:17:52.419Z" level=info msg="Task-result reconciliation" namespace=ci-workflows numObjs=0 workflow=example-steps-simple4xnjt
time="2023-06-15T13:17:52.419Z" level=info msg="node changed" namespace=ci-workflows new.message= new.phase=Running new.progress=0/1 nodeID=example-steps-simple4xnjt-3810396681 old.message= old.phase=Pending old.progress=0/1 workflow=example-steps-simple4xnjt
time="2023-06-15T13:17:52.419Z" level=info msg="Workflow pod is missing" namespace=ci-workflows nodeName="example-steps-simple4xnjt[0].job-2" nodePhase=Pending recentlyStarted=false workflow=example-steps-simple4xnjt
time="2023-06-15T13:17:52.419Z" level=info msg="Node example-steps-simple4xnjt[0].job-1 acquired synchronization lock" namespace=ci-workflows workflow=example-steps-simple4xnjt
time="2023-06-15T13:17:52.420Z" level=info msg="Workflow step group node example-steps-simple4xnjt-3033033798 not yet completed" namespace=ci-workflows workflow=example-steps-simple4xnjt
time="2023-06-15T13:17:52.420Z" level=info msg="TaskSet Reconciliation" namespace=ci-workflows workflow=example-steps-simple4xnjt
time="2023-06-15T13:17:52.420Z" level=info msg=reconcileAgentPod namespace=ci-workflows workflow=example-steps-simple4xnjt
time="2023-06-15T13:17:52.430Z" level=info msg="Workflow update successful" namespace=ci-workflows phase=Running resourceVersion=435876641 workflow=example-steps-simple4xnjt
time="2023-06-15T13:18:26.602Z" level=info msg="Processing workflow" namespace=ci-workflows workflow=example-steps-simple4xnjt
time="2023-06-15T13:18:26.603Z" level=info msg="Task-result reconciliation" namespace=ci-workflows numObjs=1 workflow=example-steps-simple4xnjt
time="2023-06-15T13:18:26.603Z" level=info msg="task-result changed" namespace=ci-workflows nodeID=example-steps-simple4xnjt-3810396681 workflow=example-steps-simple4xnjt
time="2023-06-15T13:18:26.603Z" level=info msg="node changed" namespace=ci-workflows new.message= new.phase=Succeeded new.progress=0/1 nodeID=example-steps-simple4xnjt-3810396681 old.message= old.phase=Running old.progress=0/1 workflow=example-steps-simple4xnjt
time="2023-06-15T13:18:26.603Z" level=info msg="Saving ConfigMap cache entry" key=memo-key-5 name=cache-example-steps-simple namespace=argo-workflows nodeId=example-steps-simple4xnjt-3810396681
time="2023-06-15T13:18:26.786Z" level=info msg="Workflow pod is missing" namespace=ci-workflows nodeName="example-steps-simple4xnjt[0].job-2" nodePhase=Pending recentlyStarted=false workflow=example-steps-simple4xnjt
time="2023-06-15T13:18:26.786Z" level=info msg="Lock has been released by ci-workflows/example-steps-simple4xnjt/example-steps-simple4xnjt-3810396681. Available locks: 1" mutex=ci-workflows/Mutex/mutex-example-steps-simple
time="2023-06-15T13:18:26.787Z" level=info msg="ci-workflows/Mutex/mutex-example-steps-simple acquired by ci-workflows/example-steps-simple4xnjt/example-steps-simple4xnjt-3760063824. Lock availability: 0/1" mutex=ci-workflows/Mutex/mutex-example-steps-simple
time="2023-06-15T13:18:26.787Z" level=info msg="Node example-steps-simple4xnjt[0].job-2 acquired synchronization lock" namespace=ci-workflows workflow=example-steps-simple4xnjt
time="2023-06-15T13:18:26.893Z" level=info msg="Created pod: example-steps-simple4xnjt[0].job-2 (example-steps-simple4xnjt-sleep-3760063824)" namespace=ci-workflows workflow=example-steps-simple4xnjt
time="2023-06-15T13:18:26.893Z" level=info msg="Workflow step group node example-steps-simple4xnjt-3033033798 not yet completed" namespace=ci-workflows workflow=example-steps-simple4xnjt
time="2023-06-15T13:18:26.893Z" level=info msg="TaskSet Reconciliation" namespace=ci-workflows workflow=example-steps-simple4xnjt
time="2023-06-15T13:18:26.893Z" level=info msg=reconcileAgentPod namespace=ci-workflows workflow=example-steps-simple4xnjt
time="2023-06-15T13:18:26.902Z" level=info msg="Workflow update successful" namespace=ci-workflows phase=Running resourceVersion=435879700 workflow=example-steps-simple4xnjt
time="2023-06-15T13:18:27.787Z" level=info msg="Processing workflow" namespace=ci-workflows workflow=example-steps-simple4xnjt
time="2023-06-15T13:18:27.788Z" level=info msg="Task-result reconciliation" namespace=ci-workflows numObjs=1 workflow=example-steps-simple4xnjt
time="2023-06-15T13:18:27.788Z" level=info msg="node changed" namespace=ci-workflows new.message=PodInitializing new.phase=Pending new.progress=0/1 nodeID=example-steps-simple4xnjt-3760063824 old.message= old.phase=Pending old.progress=0/1 workflow=example-steps-simple4xnjt
time="2023-06-15T13:18:27.788Z" level=info msg="node unchanged" namespace=ci-workflows nodeID=example-steps-simple4xnjt-3810396681 workflow=example-steps-simple4xnjt
time="2023-06-15T13:18:27.788Z" level=info msg="Node example-steps-simple4xnjt[0].job-2 acquired synchronization lock" namespace=ci-workflows workflow=example-steps-simple4xnjt
time="2023-06-15T13:18:27.788Z" level=info msg="Workflow step group node example-steps-simple4xnjt-3033033798 not yet completed" namespace=ci-workflows workflow=example-steps-simple4xnjt
time="2023-06-15T13:18:27.788Z" level=info msg="TaskSet Reconciliation" namespace=ci-workflows workflow=example-steps-simple4xnjt
time="2023-06-15T13:18:27.788Z" level=info msg=reconcileAgentPod namespace=ci-workflows workflow=example-steps-simple4xnjt
time="2023-06-15T13:18:27.798Z" level=info msg="Workflow update successful" namespace=ci-workflows phase=Running resourceVersion=435879775 workflow=example-steps-simple4xnjt
time="2023-06-15T13:18:37.813Z" level=info msg="Processing workflow" namespace=ci-workflows workflow=example-steps-simple4xnjt
time="2023-06-15T13:18:37.814Z" level=info msg="Task-result reconciliation" namespace=ci-workflows numObjs=1 workflow=example-steps-simple4xnjt
time="2023-06-15T13:18:37.814Z" level=info msg="node changed" namespace=ci-workflows new.message= new.phase=Running new.progress=0/1 nodeID=example-steps-simple4xnjt-3760063824 old.message= old.phase=Pending old.progress=0/1 workflow=example-steps-simple4xnjt
time="2023-06-15T13:18:37.814Z" level=info msg="node unchanged" namespace=ci-workflows nodeID=example-steps-simple4xnjt-3810396681 workflow=example-steps-simple4xnjt
time="2023-06-15T13:18:37.815Z" level=info msg="Node example-steps-simple4xnjt[0].job-2 acquired synchronization lock" namespace=ci-workflows workflow=example-steps-simple4xnjt
time="2023-06-15T13:18:37.815Z" level=info msg="Workflow step group node example-steps-simple4xnjt-3033033798 not yet completed" namespace=ci-workflows workflow=example-steps-simple4xnjt
time="2023-06-15T13:18:37.815Z" level=info msg="TaskSet Reconciliation" namespace=ci-workflows workflow=example-steps-simple4xnjt
time="2023-06-15T13:18:37.815Z" level=info msg=reconcileAgentPod namespace=ci-workflows workflow=example-steps-simple4xnjt
time="2023-06-15T13:18:37.826Z" level=info msg="Workflow update successful" namespace=ci-workflows phase=Running resourceVersion=435880541 workflow=example-steps-simple4xnjt
time="2023-06-15T13:18:54.606Z" level=info msg="Processing workflow" namespace=ci-workflows workflow=example-steps-simple4xnjt
time="2023-06-15T13:18:54.607Z" level=info msg="Task-result reconciliation" namespace=ci-workflows numObjs=2 workflow=example-steps-simple4xnjt
time="2023-06-15T13:18:54.607Z" level=info msg="task-result changed" namespace=ci-workflows nodeID=example-steps-simple4xnjt-3760063824 workflow=example-steps-simple4xnjt
time="2023-06-15T13:18:54.607Z" level=info msg="node changed" namespace=ci-workflows new.message= new.phase=Succeeded new.progress=0/1 nodeID=example-steps-simple4xnjt-3760063824 old.message= old.phase=Running old.progress=0/1 workflow=example-steps-simple4xnjt
time="2023-06-15T13:18:54.607Z" level=info msg="Saving ConfigMap cache entry" key=memo-key-5 name=cache-example-steps-simple namespace=argo-workflows nodeId=example-steps-simple4xnjt-3760063824
time="2023-06-15T13:18:54.617Z" level=info msg="node unchanged" namespace=ci-workflows nodeID=example-steps-simple4xnjt-3810396681 workflow=example-steps-simple4xnjt
time="2023-06-15T13:18:54.618Z" level=info msg="Lock has been released by ci-workflows/example-steps-simple4xnjt/example-steps-simple4xnjt-3760063824. Available locks: 1" mutex=ci-workflows/Mutex/mutex-example-steps-simple
time="2023-06-15T13:18:54.618Z" level=info msg="Step group node example-steps-simple4xnjt-3033033798 successful" namespace=ci-workflows workflow=example-steps-simple4xnjt
time="2023-06-15T13:18:54.618Z" level=info msg="node example-steps-simple4xnjt-3033033798 phase Running -> Succeeded" namespace=ci-workflows workflow=example-steps-simple4xnjt
time="2023-06-15T13:18:54.618Z" level=info msg="node example-steps-simple4xnjt-3033033798 finished: 2023-06-15 13:18:54.618227534 +0000 UTC" namespace=ci-workflows workflow=example-steps-simple4xnjt
time="2023-06-15T13:18:54.618Z" level=info msg="Outbound nodes of example-steps-simple4xnjt-3810396681 is [example-steps-simple4xnjt-3810396681]" namespace=ci-workflows workflow=example-steps-simple4xnjt
time="2023-06-15T13:18:54.618Z" level=info msg="Outbound nodes of example-steps-simple4xnjt-3760063824 is [example-steps-simple4xnjt-3760063824]" namespace=ci-workflows workflow=example-steps-simple4xnjt
time="2023-06-15T13:18:54.618Z" level=info msg="Outbound nodes of example-steps-simple4xnjt is [example-steps-simple4xnjt-3810396681 example-steps-simple4xnjt-3760063824]" namespace=ci-workflows workflow=example-steps-simple4xnjt
time="2023-06-15T13:18:54.618Z" level=info msg="node example-steps-simple4xnjt phase Running -> Succeeded" namespace=ci-workflows workflow=example-steps-simple4xnjt
time="2023-06-15T13:18:54.618Z" level=info msg="node example-steps-simple4xnjt finished: 2023-06-15 13:18:54.618324516 +0000 UTC" namespace=ci-workflows workflow=example-steps-simple4xnjt
time="2023-06-15T13:18:54.618Z" level=info msg="Checking daemoned children of example-steps-simple4xnjt" namespace=ci-workflows workflow=example-steps-simple4xnjt
time="2023-06-15T13:18:54.618Z" level=info msg="TaskSet Reconciliation" namespace=ci-workflows workflow=example-steps-simple4xnjt
time="2023-06-15T13:18:54.618Z" level=info msg=reconcileAgentPod namespace=ci-workflows workflow=example-steps-simple4xnjt
time="2023-06-15T13:18:54.618Z" level=info msg="Updated phase Running -> Succeeded" namespace=ci-workflows workflow=example-steps-simple4xnjt
time="2023-06-15T13:18:54.618Z" level=info msg="Marking workflow completed" namespace=ci-workflows workflow=example-steps-simple4xnjt
time="2023-06-15T13:18:54.618Z" level=info msg="Marking workflow as pending archiving" namespace=ci-workflows workflow=example-steps-simple4xnjt
time="2023-06-15T13:18:54.618Z" level=info msg="Checking daemoned children of " namespace=ci-workflows workflow=example-steps-simple4xnjt
time="2023-06-15T13:18:54.618Z" level=info msg="Released all acquired locks" key=example-steps-simple4xnjt namespace=ci-workflows workflow=example-steps-simple4xnjt
time="2023-06-15T13:18:54.624Z" level=info msg="cleaning up pod" action=deletePod key=ci-workflows/example-steps-simple4xnjt-1340600742-agent/deletePod
time="2023-06-15T13:18:54.625Z" level=info msg="Workflow update successful" namespace=ci-workflows phase=Succeeded resourceVersion=435881554 workflow=example-steps-simple4xnjt
time="2023-06-15T13:18:59.014Z" level=info msg="archiving workflow" namespace=ci-workflows uid=f9c8ed34-d088-43c4-a421-6317321ab8b3 workflow=example-steps-simple4xnjt
time="2023-06-15T13:18:59.414Z" level=info msg="Queueing Succeeded workflow ci-workflows/example-steps-simple4xnjt for delete in 59m55s due to TTL"
time="2023-06-15T13:19:04.015Z" level=info msg="cleaning up pod" action=deletePod key=ci-workflows/example-steps-simple4xnjt-sleep-3760063824/deletePod
time="2023-06-15T13:19:04.015Z" level=info msg="cleaning up pod" action=deletePod key=ci-workflows/example-steps-simple4xnjt-sleep-3810396681/deletePod

Logs from in your workflow's wait container

time="2023-06-15T13:17:45 UTC" level=info msg="Starting Workflow Executor" version=untagged
time="2023-06-15T13:17:45 UTC" level=info msg="Using executor retry strategy" Duration=1s Factor=1.6 Jitter=0.5 Steps=5
time="2023-06-15T13:17:45 UTC" level=info msg="Executor initialized" deadline="2023-06-15 14:17:42 +0000 UTC" includeScriptOutput=false namespace=ci-workflows podName=example-step
s-simple4xnjt-sleep-3810396681 template="{\"name\":\"sleep\",\"inputs\":{\"parameters\":[{\"name\":\"sleep_duration\",\"value\":\"30\"}]},\"outputs\":{},\"metadata\":{},\"script\"
:{\"name\":\"\",\"image\":\"alpine:latest\",\"command\":[\"/bin/sh\"],\"resources\":{},\"source\":\"echo \\\"Sleeping for 30\\\"\\nsleep 30\\n\"},\"archiveLocation\":{\"archiveLog
s\":true,\"s3\":{\"endpoint\":\"s3.amazonaws.com\",\"bucket\":\"argo-workflows-artifacts-infrastructure-production\",\"region\":\"us-east-1\",\"insecure\":false,\"useSDKCreds\":tr
ue,\"key\":\"example-steps-simple4xnjt/example-steps-simple4xnjt-sleep-3810396681\"}},\"activeDeadlineSeconds\":1800,\"synchronization\":{\"mutex\":{\"name\":\"mutex-example-steps
-simple\"}},\"memoize\":{\"key\":\"memo-key-5\",\"cache\":{\"configMap\":{\"name\":\"cache-example-steps-simple\",\"key\":\"\"}},\"maxAge\":\"\"},\"timeout\":\"45m\"}" version="&V
ersion{Version:untagged,BuildDate:2023-06-13T00:26:22Z,GitCommit:bc457cae6a431461f858e2a2f953fbb16af43a8f,GitTag:untagged,GitTreeState:clean,GoVersion:go1.20.4,Compiler:gc,Platfor
m:linux/amd64,}"
time="2023-06-15T13:17:45 UTC" level=info msg="Starting deadline monitor"
time="2023-06-15T13:18:16 UTC" level=info msg="Main container completed" error="<nil>"
time="2023-06-15T13:18:16 UTC" level=info msg="No Script output reference in workflow. Capturing script output ignored"
time="2023-06-15T13:18:16 UTC" level=info msg="No output parameters"
time="2023-06-15T13:18:16 UTC" level=info msg="No output artifacts"
time="2023-06-15T13:18:16 UTC" level=info msg="S3 Save path: /tmp/argo/outputs/logs/main.log, key: example-steps-simple4xnjt/example-steps-simple4xnjt-sleep-3810396681/main.log"
time="2023-06-15T13:18:16 UTC" level=info msg="Creating minio client using AWS SDK credentials"
time="2023-06-15T13:18:16 UTC" level=info msg="Saving file to s3" bucket=argo-workflows-artifacts-infrastructure-production endpoint=s3.amazonaws.com key=example-steps-simple4xnjt
/example-steps-simple4xnjt-sleep-3810396681/main.log path=/tmp/argo/outputs/logs/main.log
time="2023-06-15T13:18:16 UTC" level=info msg="Save artifact" artifactName=main-logs duration=93.346992ms error="<nil>" key=example-steps-simple4xnjt/example-steps-simple4xnjt-sle
ep-3810396681/main.log
time="2023-06-15T13:18:16 UTC" level=info msg="not deleting local artifact" localArtPath=/tmp/argo/outputs/logs/main.log
time="2023-06-15T13:18:16 UTC" level=info msg="Successfully saved file: /tmp/argo/outputs/logs/main.log"
time="2023-06-15T13:18:16 UTC" level=info msg="Create workflowtaskresults 201"
time="2023-06-15T13:18:16 UTC" level=info msg="Alloc=11816 TotalAlloc=20511 Sys=44157 NumGC=5 Goroutines=10"
time="2023-06-15T13:18:16 UTC" level=info msg="Deadline monitor stopped"
time="2023-06-15T13:18:16 UTC" level=info msg="stopping progress monitor (context done)" error="context canceled"
@shmruin
Copy link
Contributor

shmruin commented Jul 23, 2023

Hi, @mnarozny. Can I be assigned to this issue? I check the workflow you paste and so far I notice,

  1. Mutex synchronization works fine. job-1 runs and job-2 runs after that.
  2. Memoization not working. Never hit on both job-1 and job-2.

I want to fix this. It seems when job-1 is completed then job-2 should not run. as cache is hit.

shmruin added a commit to shmruin/argo-workflows that referenced this issue Jul 26, 2023
shmruin added a commit to shmruin/argo-workflows that referenced this issue Aug 2, 2023
shmruin added a commit to shmruin/argo-workflows that referenced this issue Aug 3, 2023
juliev0 pushed a commit that referenced this issue Aug 7, 2023
jaen pushed a commit to jaen/argo-workflows that referenced this issue Aug 12, 2023
jaen pushed a commit to jaen/argo-workflows that referenced this issue Aug 12, 2023
shmruin added a commit to shmruin/argo-workflows that referenced this issue Aug 15, 2023
terrytangyuan pushed a commit that referenced this issue Aug 15, 2023
dpadhiar pushed a commit to dpadhiar/argo-workflows that referenced this issue May 9, 2024
…argoproj#11219 (argoproj#11578)

Signed-off-by: shmruin <meme_hm@naver.com>
Signed-off-by: Dillen Padhiar <dillen_padhiar@intuit.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants