-
Notifications
You must be signed in to change notification settings - Fork 6.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[train v2+tune] Add TuneReportCallback
for propagating intermediate Train results to Tune
#49927
[train v2+tune] Add TuneReportCallback
for propagating intermediate Train results to Tune
#49927
Conversation
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
if checkpoint: | ||
metrics[CHECKPOINT_PATH_KEY] = checkpoint.path |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note: if it's an s3 checkpoint, this path does not have the s3://
prefix. we might want to introduce a checkpoint.uri
utility that returns a more usable path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, I think the checkpoint.uri
feature will be useful in a few more use cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Q1: should this be in ray.train
or ray.tune
?
Q2: should this be exposed with a the top-level module alias? probably want to avoid users importing from ray.tune.integration...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have a plan to drop the air
folder after rolling out train v2? Otherwise, do you think air should be a proper location for the intersection of train and tune.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
QQ. historically, where do we locate the implementation of the TuneReportCallback
in v1?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think ray.air
is a fine internal package that we can use to share internal utilities between libraries. Might want to make it a hidden package like ray._air
instead. Cleaning up the ray.air
namespace has been a todo for a long time that we can do with the v2 cleanup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
QQ. historically, where do we locate the implementation of the TuneReportCallback in v1?
There's no similar callback, but this is where the propagation happens: https://github.com/ray-project/ray/blob/master/python/ray/train/data_parallel_trainer.py#L372
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good question. I can see both sides, but from a discoverability/readability perspective I'm currently leaning towards putting it in Train...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was hoping to maintain imports in one direction (Tune only ever importing from Train) by putting it in Tune.
I think if we added other utilities like calculating the max concurrent Train trials, it'd also make more sense in ray.tune.
I think we cannot do the top-level module imports though since ray.tune.TuneReportCallback
doesn't make much sense. The ray.tune.integration.ray_train
actually seems like a nice place to put things.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay let's go with that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the integration!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
QQ. historically, where do we locate the implementation of the TuneReportCallback
in v1?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so clean!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good question. I can see both sides, but from a discoverability/readability perspective I'm currently leaning towards putting it in Train...
…_revamp/report_callback
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
… Train results to Tune (ray-project#49927) Add `TuneReportCallback`, which implements the `UserCallback` interface introduced in ray-project#49819. This is a callback provided by Ray Train out of the box to support the Ray Tune integration. The callback collects intermediate metrics reported by Train workers and propagates the rank 0 metrics to the Tune driver. This allows Ray Tune searchers, schedulers, etc. to kick in. --------- Signed-off-by: Justin Yu <justinvyu@anyscale.com>
… Train results to Tune (ray-project#49927) Add `TuneReportCallback`, which implements the `UserCallback` interface introduced in ray-project#49819. This is a callback provided by Ray Train out of the box to support the Ray Tune integration. The callback collects intermediate metrics reported by Train workers and propagates the rank 0 metrics to the Tune driver. This allows Ray Tune searchers, schedulers, etc. to kick in. --------- Signed-off-by: Justin Yu <justinvyu@anyscale.com>
… Train results to Tune (ray-project#49927) Add `TuneReportCallback`, which implements the `UserCallback` interface introduced in ray-project#49819. This is a callback provided by Ray Train out of the box to support the Ray Tune integration. The callback collects intermediate metrics reported by Train workers and propagates the rank 0 metrics to the Tune driver. This allows Ray Tune searchers, schedulers, etc. to kick in. --------- Signed-off-by: Justin Yu <justinvyu@anyscale.com> Signed-off-by: Anson Qian <anson627@gmail.com>
… Train results to Tune (#49927) Add `TuneReportCallback`, which implements the `UserCallback` interface introduced in #49819. This is a callback provided by Ray Train out of the box to support the Ray Tune integration. The callback collects intermediate metrics reported by Train workers and propagates the rank 0 metrics to the Tune driver. This allows Ray Tune searchers, schedulers, etc. to kick in. --------- Signed-off-by: Justin Yu <justinvyu@anyscale.com>
… Train results to Tune (ray-project#49927) Add `TuneReportCallback`, which implements the `UserCallback` interface introduced in ray-project#49819. This is a callback provided by Ray Train out of the box to support the Ray Tune integration. The callback collects intermediate metrics reported by Train workers and propagates the rank 0 metrics to the Tune driver. This allows Ray Tune searchers, schedulers, etc. to kick in. --------- Signed-off-by: Justin Yu <justinvyu@anyscale.com> Signed-off-by: Puyuan Yao <williamyao034@gmail.com>
… Train results to Tune (ray-project#49927) Add `TuneReportCallback`, which implements the `UserCallback` interface introduced in ray-project#49819. This is a callback provided by Ray Train out of the box to support the Ray Tune integration. The callback collects intermediate metrics reported by Train workers and propagates the rank 0 metrics to the Tune driver. This allows Ray Tune searchers, schedulers, etc. to kick in. --------- Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Summary
Add
TuneReportCallback
, which implements theUserCallback
interface introduced in #49819.This is a callback provided by Ray Train out of the box to support the Ray Tune integration. The callback collects intermediate metrics reported by Train workers and propagates the rank 0 metrics to the Tune driver. This allows Ray Tune searchers, schedulers, etc. to kick in.
Implementation details
The
TuneReportCallback
execution must be in the same process as the TuneFunctionTrainable
and thesession
. Ray Train runs its control loop in a separate actor by default, whereray.tune.report
wouldn't work properly. This integration relies on theRAY_TRAIN_RUN_CONTROLLER_AS_ACTOR
environment variable introduced in #49522. This environment variable is automatically set by Ray Tune, so the user does not need to be aware of this.