Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Controller deletes Pods before available Nodes are provisioned, leaves workflows in pending state #3468

Closed
4 tasks done
dillon-cullinan opened this issue Apr 24, 2024 · 2 comments
Labels
bug Something isn't working gha-runner-scale-set Related to the gha-runner-scale-set mode needs triage Requires review from the maintainers

Comments

@dillon-cullinan
Copy link

dillon-cullinan commented Apr 24, 2024

Checks

Controller Version

0.9.1

Deployment Method

Helm

Checks

  • This isn't a question or user support case (For Q&A and community support, go to Discussions).
  • I've read the Changelog before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes

To Reproduce

1. Installed runner controller based on docs: https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners-with-actions-runner-controller/quickstart-for-actions-runner-controller
2. Deploy a runner set based on docs with a slightly custom values.yaml, requires minRunners to be 0 for minimal testing

Describe the bug

The runner controller is killing a pod after 1 minute of being unable to obtain a node to run on. The workflow never starts and is left in a pending state, and there is no attempt to try again either.

After the pod is killed, the provisioned node is available shortly after, and cancelling + rerunning the workflow allows it to run properly.

It consistently happens at 1 minute every time, so I'm guessing its internal to the controller and is some kind of timeout. For the record, there is another bug similar to this related to the runner registration. If the docker image you are pulling takes too long, the controller revokes the registration causing the pod to die after the pull is finished, I can create another ticket for this if needed, but it seems to be very similar timeout behavior.

Describe the expected behavior

The controller should be more patient with nodes and docker pulls, or these timeouts should be configurable. This issue does not exist in 0.9.0. The workflow should also not be left in a pending state. If the controller gives up on obtaining a pod then the workflow should be cancelled or the controller should retry.

Additional Context

Exact values.yaml used for runner scale set. Only requirement to reproduce both described issues are a large image and a node that takes longer than 1 minutes to spin up. Other values are meaningless.

---
runnerScaleSetName: <redacted>
githubConfigUrl: <redacted>
githubConfigSecret: <redacted>
maxRunners: 16
minRunners: 0
metadata:
  name: <redacted>
  namespace: gha-runner-scale-set-controller
template:
  metadata:
    annotations:
      cluster-autoscaler.kubernetes.io/safe-to-evict: "false"
  spec:
    dockerdWithinRunnerContainer: true
    nodeSelector:
      cloud.google.com/gke-nodepool: build-heavy-compute
      kubernetes.io/arch: amd64
      kubernetes.io/os: linux
    containers:
    - name: runner
      image: <redacted> # This is a large image ~15GB, ~6 min to pull uncached
      command:
        - bash
        - -c
        - "mkdir -p /home/runner/.docker/docker /home/runner/.local/share && ln -s /home/runner/.docker/docker /home/runner/.local/share/docker && /bin/bash /usr/bin/entrypoint-dind-rootless.sh && /bin/bash /runner/run.sh"
      securityContext:
        privileged: true
      volumeMounts:
      - mountPath: /tmp
        name: tmpdir

    tolerations:
    - key: "<redacted>.com/workload"
      operator: "Equal"
      value: "build-heavy-compute"
      effect: "NoSchedule"        
    volumes:
    - name: tmpdir
      emptyDir: {}
    resources:
      requests:
        cpu: "12000m"
        memory: "28Gi"
        ephemeral-storage: "48Gi"
      limits:
        cpu: "12000m"
        memory: "30Gi"

Controller Logs

https://gist.github.com/dillon-cullinan/db470ee50ab1b411589142d907764e9c

Runner Pod Logs

Describe Logs

https://gist.github.com/dillon-cullinan/8fafe89e61e325c6f82db977e7d52e7c

Pod Logs

None, the pod is unable to obtain a node

@dillon-cullinan dillon-cullinan added bug Something isn't working gha-runner-scale-set Related to the gha-runner-scale-set mode needs triage Requires review from the maintainers labels Apr 24, 2024
Copy link
Contributor

Hello! Thank you for filing an issue.

The maintainers will triage your issue shortly.

In the meantime, please take a look at the troubleshooting guide for bug reports.

If this is a feature request, please review our contribution guidelines.

@nikola-jokic
Copy link
Member

Closing in favor of #3450

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working gha-runner-scale-set Related to the gha-runner-scale-set mode needs triage Requires review from the maintainers
Projects
None yet
Development

No branches or pull requests

2 participants