Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support scalability tests #126

Open
xyhuang opened this issue Oct 9, 2018 · 8 comments
Open

Support scalability tests #126

xyhuang opened this issue Oct 9, 2018 · 8 comments

Comments

@xyhuang
Copy link
Member

xyhuang commented Oct 9, 2018

The Kubebench should be extended to support scalability tests in 2 ways:

  • run a job with many workers
  • run many jobs in parallel

It should also be able to collect metrics during the runs and make it easy to analyze the results, refer to #124 for this.

Also related with
kubeflow/training-operator#830

@xyhuang
Copy link
Member Author

xyhuang commented Oct 9, 2018

/priority p1

@jbottum
Copy link

jbottum commented Nov 4, 2018

/priority p2

@chrisheecho
Copy link

No eng resource assigned, priority 2

@xyhuang are you planning to take this on? if so, if you can assign yourself and please tag us so that PMs can add this to the kanban board

@chrisheecho
Copy link

/remove-priority p1

@xyhuang
Copy link
Member Author

xyhuang commented Nov 17, 2018

@chrisheecho sorry for the late response. I will try to implement this if I got time, but we might likely move it to 0.5, I agree to keep it as p2 for now.

@xyhuang xyhuang removed the area/0.4.0 label Jan 3, 2019
@richardsliu
Copy link

@xyhuang @swiftdiaries

Let's try to have this for 0.5.

A few things to consider:

  1. How should we automate this?
    I think it makes sense to create a periodic Prow workflow that runs this daily. We should use a separate GCP project so that the scale tests don't interfere with regular presubmits.

  2. What tests should we run?
    As a starting point we can consider these items for tf-operator:

What would it take for kubebench to support these?

  1. Collecting metrics
  • Error rates
  • Latency (how fast does each pending job get processed? And how long does it take for a worker to start?)
  • Throughput (how many jobs/workers can we run concurrently?)
  • CPU/memory usage of the operator
  1. Dashboard
    We currently have a kubebench-dashboard. Can we use it to track load test results?

@xyhuang xyhuang self-assigned this Jan 8, 2019
@xyhuang
Copy link
Member Author

xyhuang commented Jan 8, 2019

@richardsliu here are some quick answers, we will discuss in more details:

1 & 2 agree, i will create a few issues to track required changes, they should be doable within 0.5.
3 today benchmark metrics/results are collected in 2 ways: (1) sending run-time metrics through monitoring infra (e.g. prometheus) (2) collecting job performance numbers through a user-defined "postjob", which interprets the outputs from tf jobs. if required info can be collected in one of these ways it should be easy, else we will figure it out.
4 possible with some changes. currently we use a on-prem backend for result storage/viz, for tracking results in prow tests it's probably easier to leverage gcloud resources (bigtable? stackdriver?), that should be supported with a few small changes.

that being said, here are the needed things in my mind:

  • create prow workflow (for 1)
  • add support for benchmarking concurrent jobs in a single kubebench job (for 2)
  • minor workload improvements (yaml configs + minor code changes if needed) for actual tests (for 2)
  • support collecting required metrics (for 3)
  • leverage gcloud backend for result storage and viz (for 4)

@richardsliu
Copy link

Let's split up the work. I can take care of item 1 (set up project, cluster, and Prow workflow).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants