Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scale Set Listener Stops Responding #3204

Closed
4 tasks done
jameshounshell opened this issue Jan 5, 2024 · 25 comments · Fixed by #3307
Closed
4 tasks done

Scale Set Listener Stops Responding #3204

jameshounshell opened this issue Jan 5, 2024 · 25 comments · Fixed by #3307
Labels
bug Something isn't working gha-runner-scale-set Related to the gha-runner-scale-set mode

Comments

@jameshounshell
Copy link

jameshounshell commented Jan 5, 2024

Checks

Controller Version

0.7.0

Deployment Method

Helm

Checks

  • This isn't a question or user support case (For Q&A and community support, go to Discussions).
  • I've read the Changelog before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes

To Reproduce

1. Install both ARC and the ScaleSet helm charts to a kubernetes cluster.
2. Run Github Actions workflows/jobs normally with a self hosted runner.

Beyond this we do not have specific steps to reproduce and would very much appreciate suggestions for how we could gather more information to trigger this bug.

Describe the bug

Symptom:

  • Developers reported that jobs were pending for over 20 minutes (in the github workflow job view)
  • Restarted the controller and nothing happened
  • Deleted the listener and once a new listener pod starts, then new runner pods start. (Alternative we have now discovered that after about an hour, the listener will begin responding again and the logs says "refreshing token")

Diagnostics:

  • The listener stops logging
  • The only association we can make to alert on is using something like this ((gha_assigned_jobs > 0) and (rate(gha_assigned_jobs[10m]) == 0)) from the metrics provided by the listener.
    image

Another symptom I've noticed: When the controller pod is updated (for instance, between 0.7.0 and 0.8.0) two inconsistencies happen:

  • If there are old job pods running, they will block the new listener from starting. I need to delete the pods and do a rollout restart of the controller to get the listener going
  • Just updating the deployument doesn't seem to consistently start the listener either. There is always a chance that a rollout restart is needed to get it going.

A separate instance where we observed the issue

Currently running listener gha-runner-scale-set-controller:0.7.0
No logs were being generated, had found a workflow job in a pending state (from the listener logs).
10 Runners were online but the runner group was offline when looking at our org's action runner group settings ( url like this: https://github.com/organizations/bigcorp/settings/actions/runner-groups/9 )

Verified listener metrics had flatlined, for example:
image

Then the queued job being watched was assigned with no intervention.
Below is the section of logs where the listener had stopped and then started working again with no intervention.

2023-12-28T17:10:14Z	INFO	service	process batched runner scale set job messages.	{"messageId": 13462, "batchSize": 4}
2023-12-28T17:10:14Z	INFO	service	job started message received.	{"RequestId": 1026947, "RunnerId": 1062477}
2023-12-28T17:10:14Z	INFO	service	update job info for runner	{"runnerName": "self-hosted-pqbsj-runner-dkmf2", "ownerName": "bigcorp", "repoName": "cc-api-orc-kt", "workflowRef": "bigcorp/github-actions-shared/.github/workflows/sql-migration-lint.yaml@refs/heads/v1", "workflowRunId": 7349734264, "jobDisplayName": "build / sql-lint / lint-migrations", "requestId": 1026947}
2023-12-28T17:10:14Z	INFO	KubernetesManager	Created merge patch json for EphemeralRunner status update	{"json": "{\"status\":{\"jobDisplayName\":\"build / sql-lint / lint-migrations\",\"jobRepositoryName\":\"bigcorp/cc-api-orc-kt\",\"jobRequestId\":1026947,\"jobWorkflowRef\":\"bigcorp/github-actions-shared/.github/workflows/sql-migration-lint.yaml@refs/heads/v1\",\"workflowRunId\":7349734264}}"}
2023-12-28T17:10:14Z	INFO	service	job started message received.	{"RequestId": 1026948, "RunnerId": 1062478}
2023-12-28T17:10:14Z	INFO	service	update job info for runner	{"runnerName": "self-hosted-pqbsj-runner-gl8pq", "ownerName": "bigcorp", "repoName": "cc-api-orc-kt", "workflowRef": "bigcorp/github-actions-shared/.github/workflows/gradle.yaml@refs/heads/v1", "workflowRunId": 7349734264, "jobDisplayName": "build / build / code build", "requestId": 1026948}
2023-12-28T17:10:14Z	INFO	KubernetesManager	Created merge patch json for EphemeralRunner status update	{"json": "{\"status\":{\"jobDisplayName\":\"build / build / code build\",\"jobRepositoryName\":\"bigcorp/cc-api-orc-kt\",\"jobRequestId\":1026948,\"jobWorkflowRef\":\"bigcorp/github-actions-shared/.github/workflows/gradle.yaml@refs/heads/v1\",\"workflowRunId\":7349734264}}"}
2023-12-28T17:10:14Z	INFO	service	job assigned message received.	{"RequestId": 1026947}
2023-12-28T17:10:14Z	INFO	service	job assigned message received.	{"RequestId": 1026948}
2023-12-28T17:10:14Z	INFO	auto_scaler	acquiring jobs.	{"request count": 0, "requestIds": "[]"}
2023-12-28T17:10:15Z	INFO	auto_scaler	deleted message.	{"messageId": 13462}
2023-12-28T17:10:15Z	INFO	service	waiting for message...

2023-12-28T18:10:15Z	INFO	refreshing_client	message queue token is expired during GetNextMessage, refreshing...
2023-12-28T18:10:15Z	INFO	refreshing token	{"githubConfigUrl": "https://github.com/bigcorp"}
2023-12-28T18:10:15Z	INFO	getting access token for GitHub App auth	{"accessTokenURL": "https://api.github.com/app/installations/43625644/access_tokens"}
2023-12-28T18:10:15Z	INFO	getting runner registration token	{"registrationTokenURL": "https://api.github.com/orgs/bigcorp/actions/runners/registration-token"}
2023-12-28T18:10:15Z	INFO	getting Actions tenant URL and JWT	{"registrationURL": "https://api.github.com/actions/runner-registration"}
2023-12-28T18:10:16Z	INFO	service	process message.	{"messageId": 13463, "messageType": "RunnerScaleSetJobMessages"}
2023-12-28T18:10:16Z	INFO	service	current runner scale set statistics.	{"available jobs": 26, "acquired jobs": 0, "assigned jobs": 0, "running jobs": 0, "registered runners": 10, "busy runners": 0, "idle runners": 10}
2023-12-28T18:10:16Z	INFO	service	process batched runner scale set job messages.	{"messageId": 13463, "batchSize": 1}
2023-12-28T18:10:16Z	INFO	service	job completed message received.	{"RequestId": 1026947, "Result": "succeeded", "RunnerId": 1062477, "RunnerName": "self-hosted-pqbsj-runner-dkmf2"}
2023-12-28T18:10:16Z	INFO	auto_scaler	acquiring jobs.	{"request count": 0, "requestIds": "[]"}

Describe the expected behavior

Ideally the listener would never stop responding.

Additional Context

The only additional thing we tried was using the opt out button on our github app advanced features. This was kind of a hail mary since we saw the logs related to refreshing the token.
image
It seems to have helped but maybe that's just a fluke.

Controller Logs

Here's the logs from the time 10 min before through 10 min after the lister stops responding.

https://gist.github.com/jameshounshell/597358f9b0b624d1f80f98057ddddcf0

Runner Pod Logs

N/A
@jameshounshell jameshounshell added bug Something isn't working gha-runner-scale-set Related to the gha-runner-scale-set mode needs triage Requires review from the maintainers labels Jan 5, 2024
@jameshounshell
Copy link
Author

jameshounshell commented Jan 5, 2024

We've had a pretty huge break in the case.

It appears that the listener stopping responding is directly linked to when the token is refreshed.

In the image in yellow you can see every time the token refreshes with the log message "refreshing_client".
It appears to be reactive, in response to a failed http call, approximately happening every 60-65m. (which seems to jive with the max token lifetime of 1h for a github app)
But before the outage there is no token refresh. Then 1 hour later the token refreshes and the logs begin again (and jobs start running again)

image

@nikola-jokic
Copy link
Member

Hey @jameshounshell,

Is it possible that there are network interrupts causing problems? At version 0.8.0, we did add client timeouts to retriable client that could help determine if there are any network issues. Just keep in mind, we are missing metrics on the newest listener app (Will be part of 0.8.2, and to enable metrics, you can use the old listener app as described here)

@jameshounshell
Copy link
Author

We can sit tight and wait for 0.8.2 and report back on this issue.

@nikola-jokic nikola-jokic removed the needs triage Requires review from the maintainers label Jan 23, 2024
@jasonluck
Copy link

I am seeing this exact same problem with our listener pod. I can also confirm it appears to be related to the token refresh. Its always the last thing logged by the pod before it stops responding.

@nikola-jokic
Copy link
Member

Hey @jasonluck,

Could you also please post the log so we can inspect it? It is always better to have more evidence of an issue to help us find the root cause

@nikola-jokic
Copy link
Member

I am going to close this issue until we hear back from you. Please, let us know if the issue is resolved. If not, we can always re-open it ☺️

@ccureau
Copy link

ccureau commented Feb 12, 2024

Now that we've updated to the 0.8.2 runner, we are getting a better indication of what happens when the listener pod stops. Instead of just no longer polling for jobs, we now get the pod stopped in an error state. The Controller does not attempt to kill/restart the listener.

Here's the logs that I collected this morning:

{"severity":"info","ts":"2024-02-11T02:27:36Z","logger":"listener-app.listener","message":"Getting next message","lastMessageID":557}
{"severity":"info","ts":"2024-02-11T02:27:57Z","logger":"listener-app.listener","message":"Processing message","messageId":558,"messageType":"RunnerScaleSetJobMessages"}
{"severity":"info","ts":"2024-02-11T02:27:57Z","logger":"listener-app.listener","message":"New runner scale set statistics.","statistics":{"totalAvailableJobs":1,"totalAcquiredJobs":1,"totalAssignedJobs":1,"totalRunningJobs":1,"totalRegisteredRunners":11,"totalBusyRunners":1,"totalIdleRunners":10}}
{"severity":"info","ts":"2024-02-11T02:27:57Z","logger":"listener-app.listener","message":"Job available message received","jobId":1132810}
{"severity":"info","ts":"2024-02-11T02:27:57Z","logger":"listener-app.listener","message":"Acquiring jobs","count":1,"requestIds":"[1132810]"}
{"severity":"info","ts":"2024-02-11T02:27:57Z","logger":"listener-app.listener","message":"Message queue token is expired during GetNextMessage, refreshing..."}
{"severity":"info","ts":"2024-02-11T02:27:57Z","logger":"listener-app","message":"refreshing token","githubConfigUrl":"https://github.com/missionlane"}
{"severity":"info","ts":"2024-02-11T02:27:57Z","logger":"listener-app","message":"getting access token for GitHub App auth","accessTokenURL":"https://api.github.com/app/installations/43625644/access_tokens"}
{"severity":"info","ts":"2024-02-11T02:27:57Z","logger":"listener-app","message":"getting runner registration token","registrationTokenURL":"https://api.github.com/orgs/missionlane/actions/runners/registration-token"}
{"severity":"info","ts":"2024-02-11T02:27:57Z","logger":"listener-app","message":"getting Actions tenant URL and JWT","registrationURL":"https://api.github.com/actions/runner-registration"}
{"severity":"info","ts":"2024-02-11T02:27:58Z","logger":"listener-app.listener","message":"Deleting message session"}
2024/02/11 02:27:58 Application returned an error: failed to handle message: failed to acquire jobs: failed to acquire jobs after session refresh: 400 - had issue communicating with Actions backend: Value cannot be null. (Parameter 'source')

@ccureau
Copy link

ccureau commented Feb 12, 2024

@nikola-jokic Please re-open this ticket :)

@nikola-jokic nikola-jokic reopened this Feb 12, 2024
@RalphSleighK
Copy link

RalphSleighK commented Feb 14, 2024

We are on 0.8.1 (with the LISTENER_ENTRYPOINT workaround) and I think we saw this today, one of our four listeners just stopped responding. The controller finalised and removed its remaining runner pods as they finished their jobs but didn't produce any errors or otherwise appear to notice. About 45 mins later a node scaledown evicted the controller and it all came back to life after that.

We also saw a token refresh just before this happened, listener logs:

2024-02-14T11:55:41Z INFO refreshing_client message queue token is expired during AcquireJobs, refreshing...
2024-02-14T11:55:41Z INFO refreshing token {"githubConfigUrl": "https://github.com/********"}
2024-02-14T11:55:42Z INFO getting access token for GitHub App auth {"accessTokenURL": "https://api.github.com/app/installations/********/access_tokens"}"
2024-02-14T11:55:48Z INFO getting runner registration token {"registrationTokenURL": "https://api.github.com/orgs/********/actions/runners/registration-token"}"
2024-02-14T11:55:52Z INFO getting Actions tenant URL and JWT {"registrationURL": "https://api.github.com/actions/runner-registration"}"
2024-02-14T11:56:02Z INFO auto_scaler acquired jobs. {"requested": 5, "acquired": 5}
  
2024-02-14T12:41:20Z INFO Starting metrics server {"address": ":8080"}

We have been using the current version for about a month without seeing it until today, but we are on EKS and upgraded from 1.28 to 1.29 yesterday. If it happens again I may downgrade and see if it goes away.

@ccureau
Copy link

ccureau commented Feb 14, 2024

@RalphSleighK This is the behavior we saw with 0.8.1 as well. If left as it is, it seems to recover after about an hour of being brain dead. Otherwise it takes us deleting the pod to make it happy again.

@angusjellis
Copy link

@ccureau What version of K8s are you using out of curiosity? We've just upgraded to 1.29, wondering if it's a coincidence that we saw this issue post-upgrade.

@ccureau
Copy link

ccureau commented Feb 14, 2024

@angusjellis We're on 1.27.10 in a GKE cluster. We did an upgrade from 1.26 to 1.27 while this was happening without any change in listener behavior for what it's worth.

@ccureau
Copy link

ccureau commented Feb 14, 2024

We got a new error this morning...unsure what to make of it:

{"severity":"info","ts":"2024-02-14T11:21:43Z","logger":"listener-app.listener","message":"Deleting last message","lastMessageID":1298}
{"severity":"info","ts":"2024-02-14T11:21:43Z","logger":"listener-app.listener","message":"Deleting message session"}
{"severity":"info","ts":"2024-02-14T11:21:43Z","logger":"listener-app","message":"refreshing token","githubConfigUrl":"https://github.com/missionlane"}
{"severity":"info","ts":"2024-02-14T11:21:43Z","logger":"listener-app","message":"getting access token for GitHub App auth","accessTokenURL":"https://api.github.com/app/installations/43625644/access_tokens"}
{"severity":"info","ts":"2024-02-14T11:21:43Z","logger":"listener-app","message":"getting runner registration token","registrationTokenURL":"https://api.github.com/orgs/missionlane/actions/runners/registration-token"}
{"severity":"info","ts":"2024-02-14T11:21:43Z","logger":"listener-app","message":"getting Actions tenant URL and JWT","registrationURL":"https://api.github.com/actions/runner-registration"}
2024/02/14 11:21:43 Application returned an error: failed to handle message: failed to delete message: failed to delete message: {"$id":"1","innerException":null,"message":"The user 'System:PublicAccess;aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa' is not authorized to access this resource.","typeName":"Microsoft.TeamFoundation.Framework.Server.UnauthorizedRequestException, Microsoft.TeamFoundation.Framework.Server","typeKey":"UnauthorizedRequestException","errorCode":0,"eventId":3000}

@pavelsmolensky
Copy link
Contributor

With ghcr.io/actions/gha-runner-scale-set-controller:0.8.3 we still get

2024/03/06 13:19:57 Application returned an error: failed to handle message: failed to delete message: failed to delete message: {"$id":"1","innerException":null,"message":"The user 'System:PublicAccess;aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa' is not authorized to access this resource.","typeName":"Microsoft.TeamFoundation.Framework.Server.UnauthorizedRequestException, Microsoft.TeamFoundation.Framework.Server","typeKey":"UnauthorizedRequestException","errorCode":0,"eventId":3000}

once/twice a day (k8s v1.28.5)

@clementnero
Copy link

With ghcr.io/actions/gha-runner-scale-set-controller:0.8.3 we still get

2024/03/06 13:19:57 Application returned an error: failed to handle message: failed to delete message: failed to delete message: {"$id":"1","innerException":null,"message":"The user 'System:PublicAccess;aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa' is not authorized to access this resource.","typeName":"Microsoft.TeamFoundation.Framework.Server.UnauthorizedRequestException, Microsoft.TeamFoundation.Framework.Server","typeKey":"UnauthorizedRequestException","errorCode":0,"eventId":3000}

once/twice a day (k8s v1.28.5)

Same here with GKE v1.29.1 and gha-runner-scale-set-0.9.0

@nikola-jokic
Copy link
Member

Re-opening this one for #3204 (comment)

@ohookins
Copy link

ohookins commented Apr 29, 2024

Also seeing this here, with 0.9.1. The listener seems to be running but we don't see any new jobs being created in the cluster. What's frustrating is that the log level is already debug by default, and there's no way to get any further detail from the listener to see what might be happening.

I did take a quick look at the listener queue handling code, and while I'm not positive I'm either correct or that the possible problem is even related to this, I wonder if we have a goroutine leak here:
https://github.com/actions/actions-runner-controller/blob/master/github/actions/client.go#L610

When a queue message is requested and read, if any of the error paths are followed, we defer the closing of resp.Body. If it is the happy path, we pass resp.Body to the json.NewDecoder (which only accepts an io.Reader) and never close it. Possibly we tie up all the client connections within leaked goroutines? I might make a PR to address it.

EDIT: I missed the full response body buffering in client.Do, so this is not a problem.

@nikola-jokic
Copy link
Member

Hey @ohookins,

If your runners are taking a long time to be scheduled on nodes, it may be related to this issue: #3450

@ohookins
Copy link

@nikola-jokic yes, that sounds very much like it! Thanks :)

@verdel
Copy link

verdel commented Apr 29, 2024

I add restartPolicy: Always to listenerTemplate.spec in AutoscalingRunnerSet. It seems like a temporary fix that has helped. The listener container restarts immediately after completing its work with exit 1 due to the issue described here, rather than waiting for the Pod to be deleted, which remains in an Error state the whole time.

So far, one day has passed, and I am waiting for more statistical data to accumulate.

@nikola-jokic
Copy link
Member

Hey @verdel,

When you can, please provide the log when the listener restarts. Since this has been an issue with the older version of the listener, it may be related to the actions server. In the 0.9.1 release, we added the trace ID which should help us track down the issue.

@verdel
Copy link

verdel commented Apr 29, 2024

I'll send the logs a bit later. We are staying on version 0.9.0 as we are waiting for a fix that will be added in this PR.

@verdel
Copy link

verdel commented Apr 29, 2024

Lately, I've been seeing the following message in the listener.

2024-04-27T07:07:53Z	INFO	listener-app.listener	Job started message received.	{"RequestId": 653192, "RunnerId": 76946}
2024-04-27T07:07:53Z	INFO	listener-app.listener	Job started message received.	{"RequestId": 653194, "RunnerId": 76949}
2024-04-27T07:07:53Z	INFO	listener-app.listener	Job started message received.	{"RequestId": 653193, "RunnerId": 76950}
2024-04-27T07:07:53Z	INFO	listener-app.listener	Deleting last message	{"lastMessageID": 180}
2024-04-27T07:07:53Z	INFO	listener-app.listener	Deleting message session
2024-04-27T07:07:53Z	INFO	listener-app	refreshing token	{"githubConfigUrl": "https://github.com/<org_name>"}
2024-04-27T07:07:53Z	INFO	listener-app	getting access token for GitHub App auth	{"accessTokenURL": "https://api.github.com/app/installations/<installation_id>/access_tokens"}
2024-04-27T07:07:53Z	INFO	listener-app	getting runner registration token	{"registrationTokenURL": "https://api.github.com/orgs/<org_name>/actions/runners/registration-token"}
2024-04-27T07:07:53Z	INFO	listener-app	getting Actions tenant URL and JWT	{"registrationURL": "https://api.github.com/actions/runner-registration"}
2024/04/27 07:07:54 Application returned an error: failed to handle message: failed to delete message: failed to delete message: {"$id":"1","innerException":null,"message":"The user 'System:PublicAccess;aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa' is not authorized to access this resource.","typeName":"Microsoft.TeamFoundation.Framework.Server.UnauthorizedRequestException, Microsoft.TeamFoundation.Framework.Server","typeKey":"UnauthorizedRequestException","errorCode":0,"eventId":3000}

We use runners that are rarely launched (no more than once or twice a day). Such a message appears during the first launch of the day.

After that, ghalistener terminates with exit code 1 and the container transitions to Error state. Consequently, tasks stop being processed.

Today, after I added a restart policy, there have been no messages so far. I will continue to observe and provide additional information when it becomes available.

@nikola-jokic
Copy link
Member

Hey everyone, I believe we found the root cause of this issue, and it is now fixed in 0.9.2.
This was the fix: #3529

I will leave this issue opened for now until someone confirms they are not seeing these problems with the newest release ☺️

@nikola-jokic
Copy link
Member

I think this issue is safe to close now since there are no reports of listener crashes in the latest release. Thank you all for providing more information, and please let us know if you experience any issues like this again!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working gha-runner-scale-set Related to the gha-runner-scale-set mode
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants