Support scalability tests #126

xyhuang · 2018-10-09T23:24:07Z

The Kubebench should be extended to support scalability tests in 2 ways:

run a job with many workers
run many jobs in parallel

It should also be able to collect metrics during the runs and make it easy to analyze the results, refer to #124 for this.

Also related with
kubeflow/training-operator#830

xyhuang · 2018-10-09T23:24:28Z

/priority p1

jbottum · 2018-11-04T00:06:31Z

/priority p2

chrisheecho · 2018-11-06T00:16:08Z

No eng resource assigned, priority 2

@xyhuang are you planning to take this on? if so, if you can assign yourself and please tag us so that PMs can add this to the kanban board

chrisheecho · 2018-11-06T00:16:18Z

/remove-priority p1

xyhuang · 2018-11-17T14:20:33Z

@chrisheecho sorry for the late response. I will try to implement this if I got time, but we might likely move it to 0.5, I agree to keep it as p2 for now.

richardsliu · 2019-01-08T00:49:44Z

@xyhuang @swiftdiaries

Let's try to have this for 0.5.

A few things to consider:

How should we automate this?
I think it makes sense to create a periodic Prow workflow that runs this daily. We should use a separate GCP project so that the scale tests don't interfere with regular presubmits.
What tests should we run?
As a starting point we can consider these items for tf-operator:

Lots of workers: [scalability testing] large number of replicas (100) training-operator#830
Lots of concurrent jobs: [scalability testing] large number of jobs (100?) running concurrently? training-operator#829

What would it take for kubebench to support these?

Collecting metrics

Error rates
Latency (how fast does each pending job get processed? And how long does it take for a worker to start?)
Throughput (how many jobs/workers can we run concurrently?)
CPU/memory usage of the operator

Dashboard
We currently have a kubebench-dashboard. Can we use it to track load test results?

xyhuang · 2019-01-08T06:24:54Z

@richardsliu here are some quick answers, we will discuss in more details:

1 & 2 agree, i will create a few issues to track required changes, they should be doable within 0.5.
3 today benchmark metrics/results are collected in 2 ways: (1) sending run-time metrics through monitoring infra (e.g. prometheus) (2) collecting job performance numbers through a user-defined "postjob", which interprets the outputs from tf jobs. if required info can be collected in one of these ways it should be easy, else we will figure it out.
4 possible with some changes. currently we use a on-prem backend for result storage/viz, for tracking results in prow tests it's probably easier to leverage gcloud resources (bigtable? stackdriver?), that should be supported with a few small changes.

that being said, here are the needed things in my mind:

create prow workflow (for 1)
add support for benchmarking concurrent jobs in a single kubebench job (for 2)
minor workload improvements (yaml configs + minor code changes if needed) for actual tests (for 2)
support collecting required metrics (for 3)
leverage gcloud backend for result storage and viz (for 4)

richardsliu · 2019-01-08T16:08:00Z

Let's split up the work. I can take care of item 1 (set up project, cluster, and Prow workflow).

xyhuang added the area/0.4.0 label Oct 9, 2018

k8s-ci-robot added the priority/p1 label Oct 9, 2018

k8s-ci-robot added the priority/p2 label Nov 4, 2018

k8s-ci-robot removed the priority/p1 label Nov 6, 2018

xyhuang removed the area/0.4.0 label Jan 3, 2019

xyhuang self-assigned this Jan 8, 2019

xyhuang added the area/0.5.0 label Jan 8, 2019

This was referenced Jan 24, 2019

Support deploying many kubeflow jobs concurrently #163

Closed

Support pushing results to cloud locations #37

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support scalability tests #126

Support scalability tests #126

xyhuang commented Oct 9, 2018 •

edited

xyhuang commented Oct 9, 2018

jbottum commented Nov 4, 2018

chrisheecho commented Nov 6, 2018

chrisheecho commented Nov 6, 2018

xyhuang commented Nov 17, 2018

richardsliu commented Jan 8, 2019

xyhuang commented Jan 8, 2019 •

edited

richardsliu commented Jan 8, 2019

Support scalability tests #126

Support scalability tests #126

Comments

xyhuang commented Oct 9, 2018 • edited

xyhuang commented Oct 9, 2018

jbottum commented Nov 4, 2018

chrisheecho commented Nov 6, 2018

chrisheecho commented Nov 6, 2018

xyhuang commented Nov 17, 2018

richardsliu commented Jan 8, 2019

xyhuang commented Jan 8, 2019 • edited

richardsliu commented Jan 8, 2019

xyhuang commented Oct 9, 2018 •

edited

xyhuang commented Jan 8, 2019 •

edited