Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Distribute jobs more evenly across hosts #1929

Open
nick-f opened this issue Jan 24, 2023 · 4 comments
Open

Distribute jobs more evenly across hosts #1929

nick-f opened this issue Jan 24, 2023 · 4 comments

Comments

@nick-f
Copy link

nick-f commented Jan 24, 2023

Is your feature request related to a problem? Please describe.

We have hosts that can have different numbers of spawned agents. The priority for these is set by the spawn ID with the spawn-with-priority option.

The way that priorities work now are that higher numbered priorities are used first.

If hostA has 1 spawned agent running and hostB has 3 spawned agents running, hostB is going to be running at least 2 or maybe 3 tests while hostA is sitting idle waiting for jobs to be assigned to it.

Assuming all agents are idle, the order that jobs are assigned is:

  1. hostB agent3
  2. hostB agent2
  3. hostA agent1 or hostB agent1
  4. hostA agent1 or hostB agent1 (whichever was not given a job before)

Describe the solution you'd like

The next agent would be chosen based on the spawned agent utilisation of each host.

hostA has 1 spawned agent with 1 job running (100% utilisation)
hostB has 3 spawned agents with 1 job running (33% utilisation)
hostC has 5 spawned agents with 1 job running (20% utilisation)

The next host to be assigned work would be hostC because the current utilisation is the lowest. The agent on hostC that is given the work is determined based on the priority.

(Ideally that spawned agent prioritisation could also be flipped so hostC agent1 would be the first to be used instead of hostC agent5. Having that as a configuration option would be ace! I can split that out into a separate feature request if needed.)

hostA has 1 spawned agent with 1 job running (100% utilisation)
hostB has 3 spawned agents with 1 job running (33% utilisation)
hostC has 5 spawned agents with 2 jobs running (40% utilisation)

Now, with hostC utilisation at 40%, the next host to be assigned a job would be hostB.

Describe alternatives you've considered

I've spoken with Jarryd from Buildkite about this issue, but there doesn't appear to be any existing solutions for this use case.

Setting host priority doesn't work for situations where there are, say, two agents on a host. If that host is meant to be used first due to host priority, then the same situation would occur as the original problem, where one host is doing all the work while the other is sitting idle.

Additional context

We set the number of spawn agents in each host's config.

There are a variety of hardware profiles for our hosts, so some can only run one agent at a time, some run 3, and we're about to start trialling hosts that should be able to run 6 or more agents 🤞

@nick-f nick-f changed the title Add ability to use "lower" priority agents first Distribute jobs more evenly across hosts Jan 24, 2023
@triarius
Copy link
Contributor

triarius commented Feb 8, 2023

Hi @nick-f thanks for your interest in the buildkite-agent! Apologies for taking a long time to get back to you.

The experience for running multiple agents on difference sized hosts is somewhat lacking, as you are finding. In particular, the backend scheduler is not fully aware of the assignment of agents to hosts. From its perspective, there are scheduled jobs and there are agents available to run those jobs, and it assigns the jobs to the agents without knowledge of how the agents are utilising their hosts. This decoupling keeps the scheduler simple, but an unfortunate side effect is situations where hosts are not being fully utilised.

The spawn-with-priority option was an attempt to address this. However, as you are finding, it is not a complete solution. The place to fix this in the backend scheduler, and it currently not designed to schedule jobs with this in mind.

It would be a significant redesign of the scheduler to make it more aware of both the hosts and the worker agent, and while this is a paint point for a significant portion of our customers, it is also not a problem at all for others. So we are concentrating our efforts at the moment on running making the buildkite-agent runnable in a Kubernetes clusters. There, agents workers are spun up on demand, and we can take advantage of primitives offered in that ecosystem to bin pack jobs to host. So hopefully we will soon have a better story to tell in this space.

@nick-f
Copy link
Author

nick-f commented Feb 9, 2023

This decoupling keeps the scheduler simple, but an unfortunate side effect is situations where hosts are not being fully utilised.

If the priorities were flipped (i.e. agents with spawn priority 1 were used first, etc.) then that would at least give the ability to spread the load across all hosts. For my example situation, the extra agents on the more powerful hosts would be used as overflow, once all the other hosts' agents are in use. The priority as it is now doesn't allow for this.

So we are concentrating our efforts at the moment on running making the buildkite-agent runnable in a Kubernetes clusters. There, agents workers are spun up on demand, and we can take advantage of primitives offered in that ecosystem to bin pack jobs to host. So hopefully we will soon have a better story to tell in this space.

Unfortunately that won't help us at all with our use case (we're running iOS tests on physical Mac Minis) and doesn't seem to be related to this issue or a solution to it at all.

If there's somewhere else to submit this feedback to as well I'm happy to do it. Just let me know where it should go.

@nick-f
Copy link
Author

nick-f commented Mar 19, 2023

With the release of v3.45.0 and enabling the experimental flag, the load is being spread out across hosts now 🎉

I'll leave this open while #1967 is still open, but it's looking good so far. Thanks!

@DrJosh9000
Copy link
Contributor

I've closed #1967, but I'm happy to leave this open while we decide how to de-experiment-ify (make descending-spawn-priority the default? or re-use some ideas from #1967?)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants