-
Notifications
You must be signed in to change notification settings - Fork 6.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[core] Add metrics ray_io_context_event_loop_lag_ms. #47989
Conversation
Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
7f144ed
to
97191d8
Compare
Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
@MengjinYan could you review the PR. This is related to observability. |
Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice
@@ -145,6 +145,12 @@ DEFINE_stats(placement_groups, | |||
/// ===================== INTERNAL SYSTEM METRICS ================================= | |||
/// =============================================================================== | |||
|
|||
DEFINE_stats(io_context_event_loop_lag_ms, | |||
"Latency of a task from post to execution", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is basically the queuing time we print in the debug state right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think so. just that one does not have a metrics and we don't always want to look at logs.
Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
Adds a metric to monitor event loop lag for `instrumented_io_context`. This is by default applied to IO Contexts in GCS only. By default every 250ms we post an async task and use std::chrono::steady_clock to record the lag. The metric is a GAUGE with the thread name recorded. I tested this on my laptop and it works - it's a ~0 on an idle cluster, and 1000ms if I use `RAY_testing_asio_delay_us=event_loop_lag_probe=1000000:1000000`. Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
Adds a metric to monitor event loop lag for `instrumented_io_context`. This is by default applied to IO Contexts in GCS only. By default every 250ms we post an async task and use std::chrono::steady_clock to record the lag. The metric is a GAUGE with the thread name recorded. I tested this on my laptop and it works - it's a ~0 on an idle cluster, and 1000ms if I use `RAY_testing_asio_delay_us=event_loop_lag_probe=1000000:1000000`. Signed-off-by: Ruiyang Wang <rywang014@gmail.com>
Adds a metric to monitor event loop lag for `instrumented_io_context`. This is by default applied to IO Contexts in GCS only. By default every 250ms we post an async task and use std::chrono::steady_clock to record the lag. The metric is a GAUGE with the thread name recorded. I tested this on my laptop and it works - it's a ~0 on an idle cluster, and 1000ms if I use `RAY_testing_asio_delay_us=event_loop_lag_probe=1000000:1000000`. Signed-off-by: Ruiyang Wang <rywang014@gmail.com> Signed-off-by: mohitjain2504 <mohit.jain@dream11.com>
Adds a metric to monitor event loop lag for
instrumented_io_context
. This is by default applied to IO Contexts in GCS only.By default every 250ms we post an async task and use std::chrono::steady_clock to record the lag. The metric is a GAUGE with the thread name recorded.
I tested this on my laptop and it works - it's a ~0 on an idle cluster, and 1000ms if I use
RAY_testing_asio_delay_us=event_loop_lag_probe=1000000:1000000
.