[train v2+tune] Add `TuneReportCallback` for propagating intermediate Train results to Tune #49927

justinvyu · 2025-01-17T23:18:46Z

Summary

Add TuneReportCallback, which implements the UserCallback interface introduced in #49819.

This is a callback provided by Ray Train out of the box to support the Ray Tune integration. The callback collects intermediate metrics reported by Train workers and propagates the rank 0 metrics to the Tune driver. This allows Ray Tune searchers, schedulers, etc. to kick in.

Implementation details

The TuneReportCallback execution must be in the same process as the Tune FunctionTrainable and the session. Ray Train runs its control loop in a separate actor by default, where ray.tune.report wouldn't work properly. This integration relies on the RAY_TRAIN_RUN_CONTROLLER_AS_ACTOR environment variable introduced in #49522. This environment variable is automatically set by Ray Tune, so the user does not need to be aware of this.

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

justinvyu · 2025-01-17T23:19:28Z

python/ray/tune/integration/ray_train.py

+        if checkpoint:
+            metrics[CHECKPOINT_PATH_KEY] = checkpoint.path


note: if it's an s3 checkpoint, this path does not have the s3:// prefix. we might want to introduce a checkpoint.uri utility that returns a more usable path.

Agreed, I think the checkpoint.uri feature will be useful in a few more use cases.

justinvyu · 2025-01-17T23:20:20Z

python/ray/tune/integration/ray_train.py

Q1: should this be in ray.train or ray.tune?
Q2: should this be exposed with a the top-level module alias? probably want to avoid users importing from ray.tune.integration...

Do we have a plan to drop the air folder after rolling out train v2? Otherwise, do you think air should be a proper location for the intersection of train and tune.

QQ. historically, where do we locate the implementation of the TuneReportCallback in v1?

I think ray.air is a fine internal package that we can use to share internal utilities between libraries. Might want to make it a hidden package like ray._air instead. Cleaning up the ray.air namespace has been a todo for a long time that we can do with the v2 cleanup.

QQ. historically, where do we locate the implementation of the TuneReportCallback in v1?

There's no similar callback, but this is where the propagation happens: https://github.com/ray-project/ray/blob/master/python/ray/train/data_parallel_trainer.py#L372

This is a good question. I can see both sides, but from a discoverability/readability perspective I'm currently leaning towards putting it in Train...

I was hoping to maintain imports in one direction (Tune only ever importing from Train) by putting it in Tune.

I think if we added other utilities like calculating the max concurrent Train trials, it'd also make more sense in ray.tune.

I think we cannot do the top-level module imports though since ray.tune.TuneReportCallback doesn't make much sense. The ray.tune.integration.ray_train actually seems like a nice place to put things.

Okay let's go with that

hongpeng-guo

Thanks for the integration!

hongpeng-guo · 2025-01-17T23:44:17Z

python/ray/tune/integration/ray_train.py

QQ. historically, where do we locate the implementation of the TuneReportCallback in v1?

matthewdeng

so clean!

matthewdeng · 2025-01-18T00:41:46Z

python/ray/tune/integration/ray_train.py

This is a good question. I can see both sides, but from a discoverability/readability perspective I'm currently leaning towards putting it in Train...

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

…_revamp/report_callback

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

… Train results to Tune (ray-project#49927) Add `TuneReportCallback`, which implements the `UserCallback` interface introduced in ray-project#49819. This is a callback provided by Ray Train out of the box to support the Ray Tune integration. The callback collects intermediate metrics reported by Train workers and propagates the rank 0 metrics to the Tune driver. This allows Ray Tune searchers, schedulers, etc. to kick in. --------- Signed-off-by: Justin Yu <justinvyu@anyscale.com>

… Train results to Tune (ray-project#49927) Add `TuneReportCallback`, which implements the `UserCallback` interface introduced in ray-project#49819. This is a callback provided by Ray Train out of the box to support the Ray Tune integration. The callback collects intermediate metrics reported by Train workers and propagates the rank 0 metrics to the Tune driver. This allows Ray Tune searchers, schedulers, etc. to kick in. --------- Signed-off-by: Justin Yu <justinvyu@anyscale.com> Signed-off-by: Anson Qian <anson627@gmail.com>

… Train results to Tune (#49927) Add `TuneReportCallback`, which implements the `UserCallback` interface introduced in #49819. This is a callback provided by Ray Train out of the box to support the Ray Tune integration. The callback collects intermediate metrics reported by Train workers and propagates the rank 0 metrics to the Tune driver. This allows Ray Tune searchers, schedulers, etc. to kick in. --------- Signed-off-by: Justin Yu <justinvyu@anyscale.com>

… Train results to Tune (ray-project#49927) Add `TuneReportCallback`, which implements the `UserCallback` interface introduced in ray-project#49819. This is a callback provided by Ray Train out of the box to support the Ray Tune integration. The callback collects intermediate metrics reported by Train workers and propagates the rank 0 metrics to the Tune driver. This allows Ray Tune searchers, schedulers, etc. to kick in. --------- Signed-off-by: Justin Yu <justinvyu@anyscale.com> Signed-off-by: Puyuan Yao <williamyao034@gmail.com>

… Train results to Tune (ray-project#49927) Add `TuneReportCallback`, which implements the `UserCallback` interface introduced in ray-project#49819. This is a callback provided by Ray Train out of the box to support the Ray Tune integration. The callback collects intermediate metrics reported by Train workers and propagates the rank 0 metrics to the Tune driver. This allows Ray Tune searchers, schedulers, etc. to kick in. --------- Signed-off-by: Justin Yu <justinvyu@anyscale.com>

add TuneReportCallback

Loading
Loading status checks…

1581f15

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

justinvyu assigned matthewdeng and hongpeng-guo Jan 17, 2025

justinvyu requested review from hongpeng-guo, matthewdeng, raulchen and woshiyyya as code owners January 17, 2025 23:18

justinvyu commented Jan 17, 2025

View reviewed changes

hongpeng-guo reviewed Jan 17, 2025

View reviewed changes

matthewdeng approved these changes Jan 18, 2025

View reviewed changes

justinvyu added 2 commits January 21, 2025 11:01

lint

Loading
Loading status checks…

d871fae

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Merge branch 'master' of https://github.com/ray-project/ray into tune…

32116d8

…_revamp/report_callback

hongpeng-guo approved these changes Jan 21, 2025

View reviewed changes

better comment

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.

GPG key ID: B5690EEEBB952194

Verified
Learn about vigilant mode

0c590d9

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

justinvyu enabled auto-merge (squash) January 21, 2025 19:24

github-actions bot added the go label Jan 21, 2025

justinvyu merged commit 84dba3a into ray-project:master Jan 21, 2025
6 of 7 checks passed

justinvyu deleted the tune_revamp/report_callback branch January 21, 2025 22:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[train v2+tune] Add `TuneReportCallback` for propagating intermediate Train results to Tune #49927

[train v2+tune] Add `TuneReportCallback` for propagating intermediate Train results to Tune #49927

justinvyu commented Jan 17, 2025

justinvyu Jan 17, 2025

hongpeng-guo Jan 17, 2025

justinvyu Jan 17, 2025

hongpeng-guo Jan 17, 2025

hongpeng-guo Jan 17, 2025

justinvyu Jan 17, 2025

justinvyu Jan 17, 2025 •

edited

Loading

matthewdeng Jan 18, 2025

justinvyu Jan 21, 2025

matthewdeng Jan 21, 2025

hongpeng-guo left a comment

hongpeng-guo Jan 17, 2025

matthewdeng left a comment

matthewdeng Jan 18, 2025

		if checkpoint:
		metrics[CHECKPOINT_PATH_KEY] = checkpoint.path

[train v2+tune] Add TuneReportCallback for propagating intermediate Train results to Tune #49927

[train v2+tune] Add TuneReportCallback for propagating intermediate Train results to Tune #49927

Conversation

justinvyu commented Jan 17, 2025

Summary

Implementation details

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

justinvyu Jan 17, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hongpeng-guo left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

matthewdeng left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

[train v2+tune] Add `TuneReportCallback` for propagating intermediate Train results to Tune #49927

[train v2+tune] Add `TuneReportCallback` for propagating intermediate Train results to Tune #49927

justinvyu Jan 17, 2025 •

edited

Loading