Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft: Performance mega boost - queue per app #1990

Draft
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

bnetzi
Copy link

@bnetzi bnetzi commented Apr 17, 2024

Issues related

#1138
#1574
#783
#1489

Purpose of this PR

First of all - this PR is mainly a Draft which we think should be discussed, and that's why we are submitting it even though we haven't added documentation not unit tests.

The original design of spark-operator posses one queue which is used for all the applications.
This design is causing a huge latency issue when trying to deal with hundreds / thousands of applications concurrently.

In benchmarks me and my fellows from Mobileye preformed, we showed clearly a linear latency increase depend on the amount of apps handled.
When getting to more than 500 applications, the avg time from creating a spark application object until pod creation is ~130 seconds for app to start. When going up to 1200 apps concurrently it can go Up to 20 minutes on average for each spark application to be created.
Scale up vertically would not be helpful as the CPU is doing nothing, most of the time is spent on the queue mutex.

The change we are presenting here is to create a queue for each app.
It required a big change all around the code, but it is not changing the main flow in any way.

Our benchmarks showed that even with 1000 apps concurrently, Avg time for application creation is ~7 seconds

We also added a nice feature of using memoryLimit for driver / executor which is larger than the request by using the admission webhook.

Proposed changes:

  • create a queue for each app
  • Add a parameter to controll qps and burst for k8s api
  • adding memoryLimit option for executors and drivers

Change Category

  • Feature (non-breaking change which adds functionality)

What are we still missing:

  • fixing broken unit test
  • documentation
  • Peer review

I would point out that this code currently runs on our production environment with massive scale without any issues.

bnetzi and others added 3 commits April 17, 2024 19:40
* Add MemoryLimit as option that will override spark pods limits (by using webhook)

* queue per spark app - improved performance by far 

* added logs

* prevent concurrent access to the appsQueues map

* use RWmutex when accessing the appQueues

---------

Co-authored-by: Netanel Levine <netanel.levine@mobileye.com>
Co-authored-by: Eran Ben Ami <eranba@mobileye.com>
# Conflicts:
#	pkg/controller/sparkapplication/controller.go
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign andreyvelich for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@google-oss-prow google-oss-prow bot added size/XXL and removed size/L labels Jun 5, 2024
@google-oss-prow google-oss-prow bot added size/XL and removed size/XXL labels Jun 5, 2024
Fix tests
@vara-bonthu
Copy link
Contributor

/assign @ChenYi015
/assign @yuchaoran2011

Please review the changes when you get a chance

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants