Draft: Performance mega boost - queue per app #1990

bnetzi · 2024-04-17T17:48:42Z

Issues related

Purpose of this PR

First of all - this PR is mainly a Draft which we think should be discussed, and that's why we are submitting it even though we haven't added documentation not unit tests.

The original design of spark-operator posses one queue which is used for all the applications.
This design is causing a huge latency issue when trying to deal with hundreds / thousands of applications concurrently.

In benchmarks me and my fellows from Mobileye preformed, we showed clearly a linear latency increase depend on the amount of apps handled.
When getting to more than 500 applications, the avg time from creating a spark application object until pod creation is ~130 seconds for app to start. When going up to 1200 apps concurrently it can go Up to 20 minutes on average for each spark application to be created.
Scale up vertically would not be helpful as the CPU is doing nothing, most of the time is spent on the queue mutex.

The change we are presenting here is to create a queue for each app.
It required a big change all around the code, but it is not changing the main flow in any way.

Our benchmarks showed that even with 1000 apps concurrently, Avg time for application creation is ~7 seconds

We also added a nice feature of using memoryLimit for driver / executor which is larger than the request by using the admission webhook.

Proposed changes:

create a queue for each app
Add a parameter to controll qps and burst for k8s api
adding memoryLimit option for executors and drivers

Change Category

Feature (non-breaking change which adds functionality)

What are we still missing:

fixing broken unit test
documentation
Peer review

I would point out that this code currently runs on our production environment with massive scale without any issues.

* Add MemoryLimit as option that will override spark pods limits (by using webhook) * queue per spark app - improved performance by far * added logs * prevent concurrent access to the appsQueues map * use RWmutex when accessing the appQueues --------- Co-authored-by: Netanel Levine <netanel.levine@mobileye.com> Co-authored-by: Eran Ben Ami <eranba@mobileye.com>

# Conflicts: # pkg/controller/sparkapplication/controller.go

Merge kubeflow

Updated

google-oss-prow · 2024-06-05T16:31:36Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign andreyvelich for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

# Conflicts: # Dockerfile

Fix tests

vara-bonthu · 2024-06-08T18:47:06Z

/assign @ChenYi015
/assign @yuchaoran2011

Please review the changes when you get a chance

bnetzi and others added 3 commits April 17, 2024 19:40

Merge remote-tracking branch 'kubeflow/master'

d9ccd87

# Conflicts: # pkg/controller/sparkapplication/controller.go

Merge pull request #3 from bnetzi/merge-kubeflow

b8f32f1

Merge kubeflow

google-oss-prow bot added the do-not-merge/work-in-progress label Apr 17, 2024

google-oss-prow bot requested review from mwielgus and vara-bonthu April 17, 2024 17:48

google-oss-prow bot added the size/L label Apr 17, 2024

Qps merger (#4)

dcae610

Updated

google-oss-prow bot added size/XXL and removed size/L labels Jun 5, 2024

Merge branch 'master' of github.com:kubeflow/spark-operator

5ee95e6

# Conflicts: # Dockerfile

google-oss-prow bot added size/XL and removed size/XXL labels Jun 5, 2024

Qps merger (#5)

926f185

Fix tests

google-oss-prow bot assigned ChenYi015 and yuchaoran2011 Jun 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Draft: Performance mega boost - queue per app #1990

Draft: Performance mega boost - queue per app #1990

bnetzi commented Apr 17, 2024 •

edited

google-oss-prow bot commented Jun 5, 2024

vara-bonthu commented Jun 8, 2024

Draft: Performance mega boost - queue per app #1990

Are you sure you want to change the base?

Draft: Performance mega boost - queue per app #1990

Conversation

bnetzi commented Apr 17, 2024 • edited

Issues related

Purpose of this PR

Change Category

google-oss-prow bot commented Jun 5, 2024

vara-bonthu commented Jun 8, 2024

bnetzi commented Apr 17, 2024 •

edited