Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[core][dashboard] Reworking dashboard_max_actors_to_cache to RAY_maximum_gcs_destroyed_actor_cached_count #48229

Merged
merged 7 commits into from
Oct 28, 2024

Conversation

dayshah
Copy link
Contributor

@dayshah dayshah commented Oct 23, 2024

Why are these changes needed?

The issue is that there's two env variables RAY_DASHBOARD_MAX_ACTORS_TO_CACHE that controls the cache on the dashboard server side and RAY_maximum_gcs_destroyed_actor_cached_count that controls how many deleted actors the gcs_actor_manager_holds.

Here we're unifying RAY_DASHBOARD_MAX_ACTORS_TO_CACHE and RAY_maximum_gcs_destroyed_actor_cached_count

Related issue number

Closes #47930

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Sorry, something went wrong.

Signed-off-by: dayshah <dhyey2019@gmail.com>
@dayshah dayshah added dashboard Issues specific to the Ray Dashboard core Issues that should be addressed in Ray Core labels Oct 23, 2024
@dayshah dayshah requested a review from rynewang October 23, 2024 20:35
@dayshah dayshah changed the title [core][dashboard] Reworking dashboard_max_actors_to_cache [core][dashboard] Reworking dashboard_max_actors_to_cache to RAY_maximum_gcs_destroyed_actor_cached_count Oct 23, 2024
Signed-off-by: dayshah <dhyey2019@gmail.com>
@dayshah dayshah marked this pull request as ready for review October 23, 2024 20:52
@dayshah dayshah requested review from a team as code owners October 23, 2024 20:52
@@ -45,12 +45,10 @@ Checks: >
-modernize-make-shared,
-modernize-use-override,
performance-*,
readability-avoid-const-params-in-decls,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why deleting this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is just something that isn't followed at all through the codebase, seems like almost every file has a function that takes by const value so adds a lot of noise when looking at clang-tidy lints, and more of a personal preference lint vs. something that may actually detract from performance

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want to add them back, we should enforce them: i.e. treat these warnings as errors.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed want to condense this list down to more important lints or things that are already followed first, start treating those as errors, and then slowly expand on the list again

@@ -613,7 +611,7 @@ void GcsActorManager::HandleGetAllActorInfo(rpc::GetAllActorInfoRequest request,
absl::flat_hash_map<ActorID, rpc::ActorTableData>>(arena, std::move(result));
size_t count = 0;
size_t num_filtered = 0;
for (const auto &pair : *ptr) {
for (auto &pair : *ptr) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why changing this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so below at line 625 there's a const_cast for the UnsafeArenaAddAllocated call, shouldn't have const casts, and just taking the const away here allows use to remove the const cast

@jjyao jjyao added P0 Issues that should be fixed in short order go add ONLY when ready to merge, run all tests and removed P0 Issues that should be fixed in short order labels Oct 24, 2024
node_id = actor["address"].get("rayletId")
if node_id and node_id != actor_consts.NIL_NODE_ID:
del DataSource.node_actors[node_id][actor_id]
while len(self.dead_actors_queue) > MAX_DELETED_ACTORS_TO_CACHE:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible that GCS and dashboard evict different dead actors?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ya good point, so post python 3.7, dictionaries preserve insertion order, and they're coming in in-order from the gcs actor channel in the same order that they're added to the gcs's dead actor queue because the gcs both publishes and adds to cache inside the actor manager's DestroyActor function, so both queues should have the same order, and then both are just popping from the front whenever it goes over the max.

One concern I have is that gcs is only adding to it's dead actor cache if the actor is not restartable, but the publish happens either way, can non-restartable actors still be marked dead? If so they'll end up with different dead actor queues.

This does feel like a very weak guarantee though. And I don't think it makes sense to expect people to keep the dashboard queue in mind while making changes in the gcs queue. I think two possible solutions for a stronger relationship are having a minheap based on death timestamp (has to be new pushed field), or having the dashboard not keep it's own cache. But I see alexey recently revamped this just last month for performance reasons.

https://github.com/ray-project/ray/pull/47617/files#diff-0df41fb197ae91ea90ce4414703c9bb809d01ac60f01639fe0bdaca4e4331b1e

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's defer this to the next PR since we have the same issue today.

Signed-off-by: dayshah <dhyey2019@gmail.com>
format
Signed-off-by: dayshah <dhyey2019@gmail.com>
@@ -235,9 +235,9 @@ Use the Actors view to see the logs for an Actor and which Job created the Actor
<div style="position: relative; height: 0; overflow: hidden; max-width: 100%; height: auto;">
<iframe width="560" height="315" src="https://www.youtube.com/embed/MChn6O1ecEQ" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" allowfullscreen></iframe>
</div>


The information for up to 1000 dead Actors is stored.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default value for RAY_maximum_gcs_destroyed_actor_cached_count is 100000

@@ -21,7 +22,10 @@
logger = logging.getLogger(__name__)
routes = dashboard_optional_utils.DashboardHeadRouteTable

MAX_ACTORS_TO_CACHE = int(os.environ.get("RAY_DASHBOARD_MAX_ACTORS_TO_CACHE", 1000))
MAX_DELETED_ACTORS_TO_CACHE = max(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
MAX_DELETED_ACTORS_TO_CACHE = max(
MAX_DESTROYED_ACTORS_TO_CACHE = max(

if node_id and node_id != actor_consts.NIL_NODE_ID:
del DataSource.node_actors[node_id][actor_id]
while len(self.dead_actors_queue) > MAX_DELETED_ACTORS_TO_CACHE:
actor_id = self.dead_actors_queue.popleft()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I found one bug: we should only add actor to dead_actors_queue if it's dead and non-restartable

Signed-off-by: dayshah <dhyey2019@gmail.com>
Signed-off-by: dayshah <dhyey2019@gmail.com>
to 100k
Signed-off-by: dayshah <dhyey2019@gmail.com>
@jjyao jjyao merged commit ad2925f into ray-project:master Oct 28, 2024
5 checks passed
edoakes pushed a commit to edoakes/ray that referenced this pull request Oct 30, 2024
…mum_gcs_destroyed_actor_cached_count (ray-project#48229)

Signed-off-by: dayshah <dhyey2019@gmail.com>
Jay-ju pushed a commit to Jay-ju/ray that referenced this pull request Nov 5, 2024
…mum_gcs_destroyed_actor_cached_count (ray-project#48229)

Signed-off-by: dayshah <dhyey2019@gmail.com>
JP-sDEV pushed a commit to JP-sDEV/ray that referenced this pull request Nov 14, 2024
…mum_gcs_destroyed_actor_cached_count (ray-project#48229)

Signed-off-by: dayshah <dhyey2019@gmail.com>
@dayshah dayshah deleted the dashboard-actors branch November 15, 2024 04:35
mohitjain2504 pushed a commit to mohitjain2504/ray that referenced this pull request Nov 15, 2024
…mum_gcs_destroyed_actor_cached_count (ray-project#48229)

Signed-off-by: dayshah <dhyey2019@gmail.com>
Signed-off-by: mohitjain2504 <mohit.jain@dream11.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Issues that should be addressed in Ray Core dashboard Issues specific to the Ray Dashboard go add ONLY when ready to merge, run all tests
Projects
None yet
4 participants