-
Notifications
You must be signed in to change notification settings - Fork 6.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[core][dashboard] Reworking dashboard_max_actors_to_cache to RAY_maximum_gcs_destroyed_actor_cached_count #48229
Conversation
Signed-off-by: dayshah <dhyey2019@gmail.com>
Signed-off-by: dayshah <dhyey2019@gmail.com>
@@ -45,12 +45,10 @@ Checks: > | |||
-modernize-make-shared, | |||
-modernize-use-override, | |||
performance-*, | |||
readability-avoid-const-params-in-decls, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why deleting this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is just something that isn't followed at all through the codebase, seems like almost every file has a function that takes by const value so adds a lot of noise when looking at clang-tidy lints, and more of a personal preference lint vs. something that may actually detract from performance
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we want to add them back, we should enforce them: i.e. treat these warnings as errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agreed want to condense this list down to more important lints or things that are already followed first, start treating those as errors, and then slowly expand on the list again
@@ -613,7 +611,7 @@ void GcsActorManager::HandleGetAllActorInfo(rpc::GetAllActorInfoRequest request, | |||
absl::flat_hash_map<ActorID, rpc::ActorTableData>>(arena, std::move(result)); | |||
size_t count = 0; | |||
size_t num_filtered = 0; | |||
for (const auto &pair : *ptr) { | |||
for (auto &pair : *ptr) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why changing this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so below at line 625 there's a const_cast for the UnsafeArenaAddAllocated call, shouldn't have const casts, and just taking the const away here allows use to remove the const cast
node_id = actor["address"].get("rayletId") | ||
if node_id and node_id != actor_consts.NIL_NODE_ID: | ||
del DataSource.node_actors[node_id][actor_id] | ||
while len(self.dead_actors_queue) > MAX_DELETED_ACTORS_TO_CACHE: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible that GCS and dashboard evict different dead actors?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ya good point, so post python 3.7, dictionaries preserve insertion order, and they're coming in in-order from the gcs actor channel in the same order that they're added to the gcs's dead actor queue because the gcs both publishes and adds to cache inside the actor manager's DestroyActor function, so both queues should have the same order, and then both are just popping from the front whenever it goes over the max.
One concern I have is that gcs is only adding to it's dead actor cache if the actor is not restartable, but the publish happens either way, can non-restartable actors still be marked dead? If so they'll end up with different dead actor queues.
This does feel like a very weak guarantee though. And I don't think it makes sense to expect people to keep the dashboard queue in mind while making changes in the gcs queue. I think two possible solutions for a stronger relationship are having a minheap based on death timestamp (has to be new pushed field), or having the dashboard not keep it's own cache. But I see alexey recently revamped this just last month for performance reasons.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's defer this to the next PR since we have the same issue today.
Signed-off-by: dayshah <dhyey2019@gmail.com>
@@ -235,9 +235,9 @@ Use the Actors view to see the logs for an Actor and which Job created the Actor | |||
<div style="position: relative; height: 0; overflow: hidden; max-width: 100%; height: auto;"> | |||
<iframe width="560" height="315" src="https://www.youtube.com/embed/MChn6O1ecEQ" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" allowfullscreen></iframe> | |||
</div> | |||
|
|||
|
|||
The information for up to 1000 dead Actors is stored. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The default value for RAY_maximum_gcs_destroyed_actor_cached_count
is 100000
@@ -21,7 +22,10 @@ | |||
logger = logging.getLogger(__name__) | |||
routes = dashboard_optional_utils.DashboardHeadRouteTable | |||
|
|||
MAX_ACTORS_TO_CACHE = int(os.environ.get("RAY_DASHBOARD_MAX_ACTORS_TO_CACHE", 1000)) | |||
MAX_DELETED_ACTORS_TO_CACHE = max( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MAX_DELETED_ACTORS_TO_CACHE = max( | |
MAX_DESTROYED_ACTORS_TO_CACHE = max( |
if node_id and node_id != actor_consts.NIL_NODE_ID: | ||
del DataSource.node_actors[node_id][actor_id] | ||
while len(self.dead_actors_queue) > MAX_DELETED_ACTORS_TO_CACHE: | ||
actor_id = self.dead_actors_queue.popleft() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I found one bug: we should only add actor to dead_actors_queue
if it's dead and non-restartable
Signed-off-by: dayshah <dhyey2019@gmail.com>
Signed-off-by: dayshah <dhyey2019@gmail.com>
…mum_gcs_destroyed_actor_cached_count (ray-project#48229) Signed-off-by: dayshah <dhyey2019@gmail.com>
…mum_gcs_destroyed_actor_cached_count (ray-project#48229) Signed-off-by: dayshah <dhyey2019@gmail.com>
…mum_gcs_destroyed_actor_cached_count (ray-project#48229) Signed-off-by: dayshah <dhyey2019@gmail.com>
…mum_gcs_destroyed_actor_cached_count (ray-project#48229) Signed-off-by: dayshah <dhyey2019@gmail.com> Signed-off-by: mohitjain2504 <mohit.jain@dream11.com>
Why are these changes needed?
The issue is that there's two env variables
RAY_DASHBOARD_MAX_ACTORS_TO_CACHE
that controls the cache on the dashboard server side andRAY_maximum_gcs_destroyed_actor_cached_count
that controls how many deleted actors the gcs_actor_manager_holds.Here we're unifying RAY_DASHBOARD_MAX_ACTORS_TO_CACHE and RAY_maximum_gcs_destroyed_actor_cached_count
Related issue number
Closes #47930
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.