[core][dashboard] Reworking dashboard_max_actors_to_cache to RAY_maximum_gcs_destroyed_actor_cached_count #48229

dayshah · 2024-10-23T20:35:19Z

Why are these changes needed?

The issue is that there's two env variables RAY_DASHBOARD_MAX_ACTORS_TO_CACHE that controls the cache on the dashboard server side and RAY_maximum_gcs_destroyed_actor_cached_count that controls how many deleted actors the gcs_actor_manager_holds.

Here we're unifying RAY_DASHBOARD_MAX_ACTORS_TO_CACHE and RAY_maximum_gcs_destroyed_actor_cached_count

Related issue number

Closes #47930

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: dayshah <dhyey2019@gmail.com>

rkooo567 · 2024-10-23T20:57:43Z

.clang-tidy

@@ -45,12 +45,10 @@ Checks: >
  -modernize-make-shared,
  -modernize-use-override,
  performance-*,
-  readability-avoid-const-params-in-decls,


why deleting this?

this is just something that isn't followed at all through the codebase, seems like almost every file has a function that takes by const value so adds a lot of noise when looking at clang-tidy lints, and more of a personal preference lint vs. something that may actually detract from performance

If we want to add them back, we should enforce them: i.e. treat these warnings as errors.

agreed want to condense this list down to more important lints or things that are already followed first, start treating those as errors, and then slowly expand on the list again

rkooo567 · 2024-10-23T20:59:27Z

src/ray/gcs/gcs_server/gcs_actor_manager.cc

@@ -613,7 +611,7 @@ void GcsActorManager::HandleGetAllActorInfo(rpc::GetAllActorInfoRequest request,
            absl::flat_hash_map<ActorID, rpc::ActorTableData>>(arena, std::move(result));
        size_t count = 0;
        size_t num_filtered = 0;
-        for (const auto &pair : *ptr) {
+        for (auto &pair : *ptr) {


why changing this?

so below at line 625 there's a const_cast for the UnsafeArenaAddAllocated call, shouldn't have const casts, and just taking the const away here allows use to remove the const cast

python/ray/dashboard/modules/actor/actor_head.py

jjyao · 2024-10-24T19:09:52Z

python/ray/dashboard/modules/actor/actor_head.py

-                            node_id = actor["address"].get("rayletId")
-                            if node_id and node_id != actor_consts.NIL_NODE_ID:
-                                del DataSource.node_actors[node_id][actor_id]
+                while len(self.dead_actors_queue) > MAX_DELETED_ACTORS_TO_CACHE:


Is it possible that GCS and dashboard evict different dead actors?

ya good point, so post python 3.7, dictionaries preserve insertion order, and they're coming in in-order from the gcs actor channel in the same order that they're added to the gcs's dead actor queue because the gcs both publishes and adds to cache inside the actor manager's DestroyActor function, so both queues should have the same order, and then both are just popping from the front whenever it goes over the max.

One concern I have is that gcs is only adding to it's dead actor cache if the actor is not restartable, but the publish happens either way, can non-restartable actors still be marked dead? If so they'll end up with different dead actor queues.

This does feel like a very weak guarantee though. And I don't think it makes sense to expect people to keep the dashboard queue in mind while making changes in the gcs queue. I think two possible solutions for a stronger relationship are having a minheap based on death timestamp (has to be new pushed field), or having the dashboard not keep it's own cache. But I see alexey recently revamped this just last month for performance reasons.

https://github.com/ray-project/ray/pull/47617/files#diff-0df41fb197ae91ea90ce4414703c9bb809d01ac60f01639fe0bdaca4e4331b1e

Let's defer this to the next PR since we have the same issue today.

Signed-off-by: dayshah <dhyey2019@gmail.com>

jjyao · 2024-10-25T17:22:03Z

doc/source/ray-observability/getting-started.rst

@@ -235,9 +235,9 @@ Use the Actors view to see the logs for an Actor and which Job created the Actor
    <div style="position: relative; height: 0; overflow: hidden; max-width: 100%; height: auto;">
        <iframe width="560" height="315" src="https://www.youtube.com/embed/MChn6O1ecEQ" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" allowfullscreen></iframe>
    </div>
-
+
 The information for up to 1000 dead Actors is stored.


The default value for RAY_maximum_gcs_destroyed_actor_cached_count is 100000

jjyao · 2024-10-25T17:24:52Z

python/ray/dashboard/modules/actor/actor_head.py

@@ -21,7 +22,10 @@
 logger = logging.getLogger(__name__)
 routes = dashboard_optional_utils.DashboardHeadRouteTable

-MAX_ACTORS_TO_CACHE = int(os.environ.get("RAY_DASHBOARD_MAX_ACTORS_TO_CACHE", 1000))
+MAX_DELETED_ACTORS_TO_CACHE = max(


Suggested change

MAX_DELETED_ACTORS_TO_CACHE = max(

MAX_DESTROYED_ACTORS_TO_CACHE = max(

jjyao · 2024-10-25T17:33:01Z

python/ray/dashboard/modules/actor/actor_head.py

-                            if node_id and node_id != actor_consts.NIL_NODE_ID:
-                                del DataSource.node_actors[node_id][actor_id]
+                while len(self.dead_actors_queue) > MAX_DELETED_ACTORS_TO_CACHE:
+                    actor_id = self.dead_actors_queue.popleft()


Oh, I found one bug: we should only add actor to dead_actors_queue if it's dead and non-restartable

Signed-off-by: dayshah <dhyey2019@gmail.com>

…mum_gcs_destroyed_actor_cached_count (ray-project#48229) Signed-off-by: dayshah <dhyey2019@gmail.com>

…mum_gcs_destroyed_actor_cached_count (ray-project#48229) Signed-off-by: dayshah <dhyey2019@gmail.com> Signed-off-by: mohitjain2504 <mohit.jain@dream11.com>

Reworking dashboard_max_actors_to_cache

Loading
Loading status checks…

b5c4c5c

Signed-off-by: dayshah <dhyey2019@gmail.com>

dayshah added dashboard core labels Oct 23, 2024

dayshah requested a review from rynewang October 23, 2024 20:35

dayshah assigned rynewang Oct 23, 2024

dayshah changed the title ~~[core][dashboard] Reworking dashboard_max_actors_to_cache~~ [core][dashboard] Reworking dashboard_max_actors_to_cache to RAY_maximum_gcs_destroyed_actor_cached_count Oct 23, 2024

update doc and test

Loading
Loading status checks…

6725316

Signed-off-by: dayshah <dhyey2019@gmail.com>

dayshah marked this pull request as ready for review October 23, 2024 20:52

dayshah requested review from a team as code owners October 23, 2024 20:52

rkooo567 approved these changes Oct 23, 2024

View reviewed changes

jjyao added P0 go and removed P0 labels Oct 24, 2024

jjyao reviewed Oct 24, 2024

View reviewed changes

dayshah added 2 commits October 24, 2024 17:57

Pull environment variable from cpp and move publish actor messages

Loading
Loading status checks…

ef6978e

Signed-off-by: dayshah <dhyey2019@gmail.com>

format

Loading
Loading status checks…

8abffcb

Signed-off-by: dayshah <dhyey2019@gmail.com>

jjyao reviewed Oct 25, 2024

View reviewed changes

dayshah added 3 commits October 25, 2024 13:52

change name to destroyed actors

Loading
Loading status checks…

b649534

Signed-off-by: dayshah <dhyey2019@gmail.com>

max destroyed

Loading
Loading status checks…

49ef3dc

Signed-off-by: dayshah <dhyey2019@gmail.com>

to 100k

Loading
Loading status checks…

3677a45

Signed-off-by: dayshah <dhyey2019@gmail.com>

jjyao approved these changes Oct 25, 2024

View reviewed changes

jjyao merged commit ad2925f into ray-project:master Oct 28, 2024
5 checks passed

edoakes pushed a commit to edoakes/ray that referenced this pull request Oct 30, 2024

[core][dashboard] Reworking dashboard_max_actors_to_cache to RAY_maxi…

b6a42f9

…mum_gcs_destroyed_actor_cached_count (ray-project#48229) Signed-off-by: dayshah <dhyey2019@gmail.com>

Jay-ju pushed a commit to Jay-ju/ray that referenced this pull request Nov 5, 2024

[core][dashboard] Reworking dashboard_max_actors_to_cache to RAY_maxi…

9ae3c04

…mum_gcs_destroyed_actor_cached_count (ray-project#48229) Signed-off-by: dayshah <dhyey2019@gmail.com>

JP-sDEV pushed a commit to JP-sDEV/ray that referenced this pull request Nov 14, 2024

[core][dashboard] Reworking dashboard_max_actors_to_cache to RAY_maxi…

bbcf693

…mum_gcs_destroyed_actor_cached_count (ray-project#48229) Signed-off-by: dayshah <dhyey2019@gmail.com>

dayshah deleted the dashboard-actors branch November 15, 2024 04:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[core][dashboard] Reworking dashboard_max_actors_to_cache to RAY_maximum_gcs_destroyed_actor_cached_count #48229

[core][dashboard] Reworking dashboard_max_actors_to_cache to RAY_maximum_gcs_destroyed_actor_cached_count #48229

dayshah commented Oct 23, 2024 •

edited by jjyao

Loading

rkooo567 Oct 23, 2024

dayshah Oct 23, 2024

jjyao Oct 25, 2024

dayshah Oct 25, 2024

rkooo567 Oct 23, 2024

dayshah Oct 23, 2024

jjyao Oct 24, 2024

dayshah Oct 24, 2024

jjyao Oct 24, 2024

jjyao Oct 25, 2024

jjyao Oct 25, 2024

jjyao Oct 25, 2024

	MAX_DELETED_ACTORS_TO_CACHE = max(
	MAX_DESTROYED_ACTORS_TO_CACHE = max(

[core][dashboard] Reworking dashboard_max_actors_to_cache to RAY_maximum_gcs_destroyed_actor_cached_count #48229

[core][dashboard] Reworking dashboard_max_actors_to_cache to RAY_maximum_gcs_destroyed_actor_cached_count #48229

Conversation

dayshah commented Oct 23, 2024 • edited by jjyao Loading

Why are these changes needed?

Related issue number

Checks

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dayshah commented Oct 23, 2024 •

edited by jjyao

Loading