Controller deletes Pods before available Nodes are provisioned, leaves workflows in pending state #3468

dillon-cullinan · 2024-04-24T17:46:52Z

Checks

I've already read https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners-with-actions-runner-controller/troubleshooting-actions-runner-controller-errors and I'm sure my issue is not covered in the troubleshooting guide.
I am using charts that are officially provided

Controller Version

0.9.1

Deployment Method

Helm

Checks

This isn't a question or user support case (For Q&A and community support, go to Discussions).
I've read the Changelog before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes

To Reproduce

1. Installed runner controller based on docs: https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners-with-actions-runner-controller/quickstart-for-actions-runner-controller
2. Deploy a runner set based on docs with a slightly custom values.yaml, requires minRunners to be 0 for minimal testing

Describe the bug

The runner controller is killing a pod after 1 minute of being unable to obtain a node to run on. The workflow never starts and is left in a pending state, and there is no attempt to try again either.

After the pod is killed, the provisioned node is available shortly after, and cancelling + rerunning the workflow allows it to run properly.

It consistently happens at 1 minute every time, so I'm guessing its internal to the controller and is some kind of timeout. For the record, there is another bug similar to this related to the runner registration. If the docker image you are pulling takes too long, the controller revokes the registration causing the pod to die after the pull is finished, I can create another ticket for this if needed, but it seems to be very similar timeout behavior.

Describe the expected behavior

The controller should be more patient with nodes and docker pulls, or these timeouts should be configurable. This issue does not exist in 0.9.0. The workflow should also not be left in a pending state. If the controller gives up on obtaining a pod then the workflow should be cancelled or the controller should retry.

Additional Context

Exact values.yaml used for runner scale set. Only requirement to reproduce both described issues are a large image and a node that takes longer than 1 minutes to spin up. Other values are meaningless.

---
runnerScaleSetName: <redacted>
githubConfigUrl: <redacted>
githubConfigSecret: <redacted>
maxRunners: 16
minRunners: 0
metadata:
  name: <redacted>
  namespace: gha-runner-scale-set-controller
template:
  metadata:
    annotations:
      cluster-autoscaler.kubernetes.io/safe-to-evict: "false"
  spec:
    dockerdWithinRunnerContainer: true
    nodeSelector:
      cloud.google.com/gke-nodepool: build-heavy-compute
      kubernetes.io/arch: amd64
      kubernetes.io/os: linux
    containers:
    - name: runner
      image: <redacted> # This is a large image ~15GB, ~6 min to pull uncached
      command:
        - bash
        - -c
        - "mkdir -p /home/runner/.docker/docker /home/runner/.local/share && ln -s /home/runner/.docker/docker /home/runner/.local/share/docker && /bin/bash /usr/bin/entrypoint-dind-rootless.sh && /bin/bash /runner/run.sh"
      securityContext:
        privileged: true
      volumeMounts:
      - mountPath: /tmp
        name: tmpdir

    tolerations:
    - key: "<redacted>.com/workload"
      operator: "Equal"
      value: "build-heavy-compute"
      effect: "NoSchedule"        
    volumes:
    - name: tmpdir
      emptyDir: {}
    resources:
      requests:
        cpu: "12000m"
        memory: "28Gi"
        ephemeral-storage: "48Gi"
      limits:
        cpu: "12000m"
        memory: "30Gi"

Controller Logs

https://gist.github.com/dillon-cullinan/db470ee50ab1b411589142d907764e9c

Runner Pod Logs

Describe Logs

https://gist.github.com/dillon-cullinan/8fafe89e61e325c6f82db977e7d52e7c

Pod Logs

None, the pod is unable to obtain a node

The text was updated successfully, but these errors were encountered:

github-actions · 2024-04-24T17:47:18Z

Hello! Thank you for filing an issue.

The maintainers will triage your issue shortly.

In the meantime, please take a look at the troubleshooting guide for bug reports.

If this is a feature request, please review our contribution guidelines.

nikola-jokic · 2024-04-25T13:48:11Z

Closing in favor of #3450

dillon-cullinan added bug Something isn't working gha-runner-scale-set Related to the gha-runner-scale-set mode needs triage Requires review from the maintainers labels Apr 24, 2024

nikola-jokic closed this as completed Apr 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Controller deletes Pods before available Nodes are provisioned, leaves workflows in pending state #3468

Controller deletes Pods before available Nodes are provisioned, leaves workflows in pending state #3468

dillon-cullinan commented Apr 24, 2024 •

edited

github-actions bot commented Apr 24, 2024

nikola-jokic commented Apr 25, 2024

Controller deletes Pods before available Nodes are provisioned, leaves workflows in pending state #3468

Controller deletes Pods before available Nodes are provisioned, leaves workflows in pending state #3468

Comments

dillon-cullinan commented Apr 24, 2024 • edited

Checks

Controller Version

Deployment Method

Checks

To Reproduce

Describe the bug

Describe the expected behavior

Additional Context

Controller Logs

Runner Pod Logs

github-actions bot commented Apr 24, 2024

nikola-jokic commented Apr 25, 2024

dillon-cullinan commented Apr 24, 2024 •

edited