Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[core] Add metrics ray_io_context_event_loop_lag_ms. #47989

Merged
merged 14 commits into from
Oct 30, 2024

Conversation

rynewang
Copy link
Contributor

@rynewang rynewang commented Oct 11, 2024

Adds a metric to monitor event loop lag for instrumented_io_context. This is by default applied to IO Contexts in GCS only.

By default every 250ms we post an async task and use std::chrono::steady_clock to record the lag. The metric is a GAUGE with the thread name recorded.

I tested this on my laptop and it works - it's a ~0 on an idle cluster, and 1000ms if I use RAY_testing_asio_delay_us=event_loop_lag_probe=1000000:1000000.

Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
@rynewang rynewang requested a review from a team as a code owner October 11, 2024 04:58
rynewang and others added 2 commits October 11, 2024 13:26
Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
@rynewang rynewang added the go add ONLY when ready to merge, run all tests label Oct 11, 2024
@jjyao
Copy link
Collaborator

jjyao commented Oct 11, 2024

@MengjinYan could you review the PR. This is related to observability.

cpp tests
Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
rynewang and others added 2 commits October 23, 2024 16:48
Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
@rynewang rynewang assigned rynewang and unassigned jjyao, MengjinYan and rkooo567 Oct 24, 2024
Copy link
Collaborator

@jjyao jjyao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice

@@ -145,6 +145,12 @@ DEFINE_stats(placement_groups,
/// ===================== INTERNAL SYSTEM METRICS =================================
/// ===============================================================================

DEFINE_stats(io_context_event_loop_lag_ms,
"Latency of a task from post to execution",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is basically the queuing time we print in the debug state right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so. just that one does not have a metrics and we don't always want to look at logs.

Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
fix
Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
rynewang and others added 2 commits October 28, 2024 13:51
Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
@rynewang rynewang enabled auto-merge (squash) October 28, 2024 20:52
Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
@github-actions github-actions bot disabled auto-merge October 29, 2024 18:16
@rynewang rynewang enabled auto-merge (squash) October 29, 2024 18:18
Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
@github-actions github-actions bot disabled auto-merge October 29, 2024 21:30
@rynewang rynewang enabled auto-merge (squash) October 29, 2024 21:33
Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
@github-actions github-actions bot disabled auto-merge October 30, 2024 02:43
@rynewang rynewang enabled auto-merge (squash) October 30, 2024 02:43
@rynewang rynewang merged commit cb4526a into ray-project:master Oct 30, 2024
6 checks passed
@rynewang rynewang deleted the ioctx-lag-metrics branch October 30, 2024 17:47
Jay-ju pushed a commit to Jay-ju/ray that referenced this pull request Nov 5, 2024
Adds a metric to monitor event loop lag for `instrumented_io_context`.
This is by default applied to IO Contexts in GCS only.

By default every 250ms we post an async task and use
std::chrono::steady_clock to record the lag. The metric is a GAUGE with
the thread name recorded.

I tested this on my laptop and it works - it's a ~0 on an idle cluster,
and 1000ms if I use
`RAY_testing_asio_delay_us=event_loop_lag_probe=1000000:1000000`.

Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
JP-sDEV pushed a commit to JP-sDEV/ray that referenced this pull request Nov 14, 2024
Adds a metric to monitor event loop lag for `instrumented_io_context`.
This is by default applied to IO Contexts in GCS only.

By default every 250ms we post an async task and use
std::chrono::steady_clock to record the lag. The metric is a GAUGE with
the thread name recorded.

I tested this on my laptop and it works - it's a ~0 on an idle cluster,
and 1000ms if I use
`RAY_testing_asio_delay_us=event_loop_lag_probe=1000000:1000000`.

Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
mohitjain2504 pushed a commit to mohitjain2504/ray that referenced this pull request Nov 15, 2024
Adds a metric to monitor event loop lag for `instrumented_io_context`.
This is by default applied to IO Contexts in GCS only.

By default every 250ms we post an async task and use
std::chrono::steady_clock to record the lag. The metric is a GAUGE with
the thread name recorded.

I tested this on my laptop and it works - it's a ~0 on an idle cluster,
and 1000ms if I use
`RAY_testing_asio_delay_us=event_loop_lag_probe=1000000:1000000`.

Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
Signed-off-by: mohitjain2504 <mohit.jain@dream11.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
go add ONLY when ready to merge, run all tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants