Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[train v2+tune] Add TuneReportCallback for propagating intermediate Train results to Tune #49927

Merged
merged 4 commits into from
Jan 21, 2025

Conversation

justinvyu
Copy link
Contributor

Summary

Add TuneReportCallback, which implements the UserCallback interface introduced in #49819.

This is a callback provided by Ray Train out of the box to support the Ray Tune integration. The callback collects intermediate metrics reported by Train workers and propagates the rank 0 metrics to the Tune driver. This allows Ray Tune searchers, schedulers, etc. to kick in.

Implementation details

The TuneReportCallback execution must be in the same process as the Tune FunctionTrainable and the session. Ray Train runs its control loop in a separate actor by default, where ray.tune.report wouldn't work properly. This integration relies on the RAY_TRAIN_RUN_CONTROLLER_AS_ACTOR environment variable introduced in #49522. This environment variable is automatically set by Ray Tune, so the user does not need to be aware of this.

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Comment on lines +30 to +31
if checkpoint:
metrics[CHECKPOINT_PATH_KEY] = checkpoint.path
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note: if it's an s3 checkpoint, this path does not have the s3:// prefix. we might want to introduce a checkpoint.uri utility that returns a more usable path.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, I think the checkpoint.uri feature will be useful in a few more use cases.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q1: should this be in ray.train or ray.tune?
Q2: should this be exposed with a the top-level module alias? probably want to avoid users importing from ray.tune.integration...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have a plan to drop the air folder after rolling out train v2? Otherwise, do you think air should be a proper location for the intersection of train and tune.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

QQ. historically, where do we locate the implementation of the TuneReportCallback in v1?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think ray.air is a fine internal package that we can use to share internal utilities between libraries. Might want to make it a hidden package like ray._air instead. Cleaning up the ray.air namespace has been a todo for a long time that we can do with the v2 cleanup.

Copy link
Contributor Author

@justinvyu justinvyu Jan 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

QQ. historically, where do we locate the implementation of the TuneReportCallback in v1?

There's no similar callback, but this is where the propagation happens: https://github.com/ray-project/ray/blob/master/python/ray/train/data_parallel_trainer.py#L372

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good question. I can see both sides, but from a discoverability/readability perspective I'm currently leaning towards putting it in Train...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was hoping to maintain imports in one direction (Tune only ever importing from Train) by putting it in Tune.

I think if we added other utilities like calculating the max concurrent Train trials, it'd also make more sense in ray.tune.

I think we cannot do the top-level module imports though since ray.tune.TuneReportCallback doesn't make much sense. The ray.tune.integration.ray_train actually seems like a nice place to put things.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay let's go with that

Copy link
Contributor

@hongpeng-guo hongpeng-guo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the integration!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

QQ. historically, where do we locate the implementation of the TuneReportCallback in v1?

Copy link
Contributor

@matthewdeng matthewdeng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so clean!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good question. I can see both sides, but from a discoverability/readability perspective I'm currently leaning towards putting it in Train...

lint
Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
@justinvyu justinvyu enabled auto-merge (squash) January 21, 2025 19:24
@github-actions github-actions bot added the go add ONLY when ready to merge, run all tests label Jan 21, 2025
@justinvyu justinvyu merged commit 84dba3a into ray-project:master Jan 21, 2025
6 of 7 checks passed
@justinvyu justinvyu deleted the tune_revamp/report_callback branch January 21, 2025 22:00
win5923 pushed a commit to win5923/ray that referenced this pull request Jan 23, 2025

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
… Train results to Tune (ray-project#49927)

Add `TuneReportCallback`, which implements the `UserCallback` interface
introduced in ray-project#49819.

This is a callback provided by Ray Train out of the box to support the
Ray Tune integration. The callback collects intermediate metrics
reported by Train workers and propagates the rank 0 metrics to the Tune
driver. This allows Ray Tune searchers, schedulers, etc. to kick in.

---------

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
simonsays1980 pushed a commit to simonsays1980/ray that referenced this pull request Jan 23, 2025

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
… Train results to Tune (ray-project#49927)

Add `TuneReportCallback`, which implements the `UserCallback` interface
introduced in ray-project#49819.

This is a callback provided by Ray Train out of the box to support the
Ray Tune integration. The callback collects intermediate metrics
reported by Train workers and propagates the rank 0 metrics to the Tune
driver. This allows Ray Tune searchers, schedulers, etc. to kick in.

---------

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
anson627 pushed a commit to anson627/ray that referenced this pull request Jan 31, 2025
… Train results to Tune (ray-project#49927)

Add `TuneReportCallback`, which implements the `UserCallback` interface
introduced in ray-project#49819.

This is a callback provided by Ray Train out of the box to support the
Ray Tune integration. The callback collects intermediate metrics
reported by Train workers and propagates the rank 0 metrics to the Tune
driver. This allows Ray Tune searchers, schedulers, etc. to kick in.

---------

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Anson Qian <anson627@gmail.com>
srinathk10 pushed a commit that referenced this pull request Feb 2, 2025
… Train results to Tune (#49927)

Add `TuneReportCallback`, which implements the `UserCallback` interface
introduced in #49819.

This is a callback provided by Ray Train out of the box to support the
Ray Tune integration. The callback collects intermediate metrics
reported by Train workers and propagates the rank 0 metrics to the Tune
driver. This allows Ray Tune searchers, schedulers, etc. to kick in.

---------

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
anyadontfly pushed a commit to anyadontfly/ray that referenced this pull request Feb 13, 2025
… Train results to Tune (ray-project#49927)

Add `TuneReportCallback`, which implements the `UserCallback` interface
introduced in ray-project#49819.

This is a callback provided by Ray Train out of the box to support the
Ray Tune integration. The callback collects intermediate metrics
reported by Train workers and propagates the rank 0 metrics to the Tune
driver. This allows Ray Tune searchers, schedulers, etc. to kick in.

---------

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Puyuan Yao <williamyao034@gmail.com>
park12sj pushed a commit to park12sj/ray that referenced this pull request Mar 18, 2025
… Train results to Tune (ray-project#49927)

Add `TuneReportCallback`, which implements the `UserCallback` interface
introduced in ray-project#49819.

This is a callback provided by Ray Train out of the box to support the
Ray Tune integration. The callback collects intermediate metrics
reported by Train workers and propagates the rank 0 metrics to the Tune
driver. This allows Ray Tune searchers, schedulers, etc. to kick in.

---------

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
go add ONLY when ready to merge, run all tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants