scaleSetListener ignores scaling settings while acquiring the jobs #3446

prizov · 2024-04-16T15:06:11Z

Checks

I've already read https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners-with-actions-runner-controller/troubleshooting-actions-runner-controller-errors and I'm sure my issue is not covered in the troubleshooting guide.
I am using charts that are officially provided

Controller Version

0.9.0

Deployment Method

Helm

Checks

This isn't a question or user support case (For Q&A and community support, go to Discussions).
I've read the Changelog before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes

To Reproduce

1. Set `spec.maxRunners` to `0` for the `AutoscalingRunnerSet`.
2. Start workflow jobs
3. Note that the listener acquires the jobs despite lacking the capacity to process them

Describe the bug

According to the official docs, the listener is supposed to check if it's able to scale up to the desired count before acquiring a job:

When the Runner ScaleSet Listener receives the Job Available message, it checks whether it can scale up to the desired count. If it can, the Runner ScaleSet Listener acknowledges the message

However, when we set maxRunners to 0 to prevent listeners from acquiring new jobs while waiting for the completion of the jobs that are currently in progress, we noticed that the listener continues to acquire jobs despite the autoscaling settings (see logs).

We have a multi-cluster setup and expected all jobs to be handled by one cluster, while another has maxRunners set to 0. But, this resulted in queued jobs, as a listener was still acquiring the jobs.

I don't see any checks in the listener code before it acquires the jobs

actions-runner-controller/cmd/githubrunnerscalesetlistener/autoScalerService.go

Lines 100 to 205 in f7eb88c

    
           func (s *Service) processMessage(message *actions.RunnerScaleSetMessage) error { 
        
           	s.logger.Info("process message.", "messageId", message.MessageId, "messageType", message.MessageType) 
        
           	if message.Statistics == nil { 
        
           		return fmt.Errorf("can't process message with empty statistics") 
        
           	} 
        
           	s.logger.Info("current runner scale set statistics.", 
        
           		"available jobs", message.Statistics.TotalAvailableJobs, 
        
           		"acquired jobs", message.Statistics.TotalAcquiredJobs, 
        
           		"assigned jobs", message.Statistics.TotalAssignedJobs, 
        
           		"running jobs", message.Statistics.TotalRunningJobs, 
        
           		"registered runners", message.Statistics.TotalRegisteredRunners, 
        
           		"busy runners", message.Statistics.TotalBusyRunners, 
        
           		"idle runners", message.Statistics.TotalIdleRunners) 
        
           	s.metricsExporter.publishStatistics(message.Statistics) 
        
           	if message.MessageType != "RunnerScaleSetJobMessages" { 
        
           		s.logger.Info("skip message with unknown message type.", "messageType", message.MessageType) 
        
           		return nil 
        
           	} 
        
           	if message.MessageId == 0 && message.Body == "" { // initial message with statistics only 
        
           		return s.scaleForAssignedJobCount(message.Statistics.TotalAssignedJobs) 
        
           	} 
        
           	var batchedMessages []json.RawMessage 
        
           	if err := json.NewDecoder(strings.NewReader(message.Body)).Decode(&batchedMessages); err != nil { 
        
           		return fmt.Errorf("could not decode job messages. %w", err) 
        
           	} 
        
           	s.logger.Info("process batched runner scale set job messages.", "messageId", message.MessageId, "batchSize", len(batchedMessages)) 
        
           	var availableJobs []int64 
        
           	for _, message := range batchedMessages { 
        
           		var messageType actions.JobMessageType 
        
           		if err := json.Unmarshal(message, &messageType); err != nil { 
        
           			return fmt.Errorf("could not decode job message type. %w", err) 
        
           		} 
        
           		switch messageType.MessageType { 
        
           		case "JobAvailable": 
        
           			var jobAvailable actions.JobAvailable 
        
           			if err := json.Unmarshal(message, &jobAvailable); err != nil { 
        
           				return fmt.Errorf("could not decode job available message. %w", err) 
        
           			} 
        
           			s.logger.Info( 
        
           				"job available message received.", 
        
           				"RequestId", 
        
           				jobAvailable.RunnerRequestId, 
        
           			) 
        
           			availableJobs = append(availableJobs, jobAvailable.RunnerRequestId) 
        
           		case "JobAssigned": 
        
           			var jobAssigned actions.JobAssigned 
        
           			if err := json.Unmarshal(message, &jobAssigned); err != nil { 
        
           				return fmt.Errorf("could not decode job assigned message. %w", err) 
        
           			} 
        
           			s.logger.Info( 
        
           				"job assigned message received.", 
        
           				"RequestId", 
        
           				jobAssigned.RunnerRequestId, 
        
           			) 
        
           			// s.metricsExporter.publishJobAssigned(&jobAssigned) 
        
           		case "JobStarted": 
        
           			var jobStarted actions.JobStarted 
        
           			if err := json.Unmarshal(message, &jobStarted); err != nil { 
        
           				return fmt.Errorf("could not decode job started message. %w", err) 
        
           			} 
        
           			s.logger.Info( 
        
           				"job started message received.", 
        
           				"RequestId", 
        
           				jobStarted.RunnerRequestId, 
        
           				"RunnerId", 
        
           				jobStarted.RunnerId, 
        
           			) 
        
           			s.metricsExporter.publishJobStarted(&jobStarted) 
        
           			s.updateJobInfoForRunner(jobStarted) 
        
           		case "JobCompleted": 
        
           			var jobCompleted actions.JobCompleted 
        
           			if err := json.Unmarshal(message, &jobCompleted); err != nil { 
        
           				return fmt.Errorf("could not decode job completed message. %w", err) 
        
           			} 
        
           			s.logger.Info( 
        
           				"job completed message received.", 
        
           				"RequestId", 
        
           				jobCompleted.RunnerRequestId, 
        
           				"Result", 
        
           				jobCompleted.Result, 
        
           				"RunnerId", 
        
           				jobCompleted.RunnerId, 
        
           				"RunnerName", 
        
           				jobCompleted.RunnerName, 
        
           			) 
        
           			s.metricsExporter.publishJobCompleted(&jobCompleted) 
        
           		default: 
        
           			s.logger.Info("unknown job message type.", "messageType", messageType.MessageType) 
        
           		} 
        
           	} 
        
           	err := s.rsClient.AcquireJobsForRunnerScaleSet(s.ctx, availableJobs) 
        
           	if err != nil { 
        
           		return fmt.Errorf("could not acquire jobs. %w", err) 
        
           	} 
        
           	return s.scaleForAssignedJobCount(message.Statistics.TotalAssignedJobs) 
        
           }

Describe the expected behavior

The listener should verify autoscaling settings to ensure it can handle a job before acquiring it

Listener Logs

https://gist.github.com/prizov/64a1045e83cb8b1612087109602cc2c1

Note the scaling settings on line 10 https://gist.github.com/prizov/64a1045e83cb8b1612087109602cc2c1#file-gistfile1-txt-L10

The text was updated successfully, but these errors were encountered:

github-actions · 2024-04-16T15:06:36Z

Hello! Thank you for filing an issue.

The maintainers will triage your issue shortly.

In the meantime, please take a look at the troubleshooting guide for bug reports.

If this is a feature request, please review our contribution guidelines.

nikola-jokic · 2024-04-18T12:13:57Z

Hey @prizov,

We are aware of this problem, and we are gradually working on the fix. This problem is the reason why we added capacity information in the latest release, so the server can be aware of it, and will not offer any jobs to the listener that it won't be able to handle. I will however bring this issue to the team, to check if we should do something on the listener before server changes are ready.

prizov added bug Something isn't working gha-runner-scale-set Related to the gha-runner-scale-set mode needs triage Requires review from the maintainers labels Apr 16, 2024

nikola-jokic removed the needs triage Requires review from the maintainers label Apr 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

scaleSetListener ignores scaling settings while acquiring the jobs #3446

scaleSetListener ignores scaling settings while acquiring the jobs #3446

prizov commented Apr 16, 2024 •

edited

github-actions bot commented Apr 16, 2024

nikola-jokic commented Apr 18, 2024

scaleSetListener ignores scaling settings while acquiring the jobs #3446

scaleSetListener ignores scaling settings while acquiring the jobs #3446

Comments

prizov commented Apr 16, 2024 • edited

Checks

Controller Version

Deployment Method

Checks

To Reproduce

Describe the bug

Describe the expected behavior

Listener Logs

github-actions bot commented Apr 16, 2024

nikola-jokic commented Apr 18, 2024

prizov commented Apr 16, 2024 •

edited