Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[data] Fix DataContext sealing for multiple datasets. #49096

Merged
merged 16 commits into from
Dec 6, 2024

Conversation

raulchen
Copy link
Contributor

@raulchen raulchen commented Dec 5, 2024

Why are these changes needed?

When users using multiple datasets and want to set different DataContext configurations.
The recommended way is to set DataContext.get_current() before creating a Dataset. The DataContext is supposed to be captured and sealed by a Dataset when it's created. For example:

                import ray

                context = ray.data.DataContext.get_current()

                context.target_max_block_size = 100 * 1024 ** 2
                ds1 = ray.data.range(1)
                context.target_max_block_size = 1 * 1024 ** 2
                ds2 = ray.data.range(1)

                # ds1's target_max_block_size will be 100MB
                ds1.take_all()
                # ds2's target_max_block_size will be 1MB
                ds2.take_all()

However in Ray Data internal code, DataContext.get_current() has been widely used in an incorrect way. This PR fixes most outstanding issues (but not all), by explicitly passing around the captured DataContext object as an argument to each component.

Related issue number

#41573

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Sorry, something went wrong.

@raulchen raulchen requested a review from a team as a code owner December 5, 2024 08:26
@srinathk10
Copy link
Contributor

LGTM. Nice chance! I am wondering how DataContext affects the Checkpoint restore if at all. I suppose, it the DataContext changes are inline in the User code, there should not be a problem.

Copy link
Contributor

@alexeykudinkin alexeykudinkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice clean up! Thank you for doing it!

@raulchen raulchen added the go add ONLY when ready to merge, run all tests label Dec 6, 2024
@raulchen raulchen enabled auto-merge (squash) December 6, 2024 01:22
@github-actions github-actions bot disabled auto-merge December 6, 2024 04:19
Signed-off-by: Hao Chen <chenh1024@gmail.com>
Signed-off-by: Hao Chen <chenh1024@gmail.com>
Signed-off-by: Hao Chen <chenh1024@gmail.com>
Signed-off-by: Hao Chen <chenh1024@gmail.com>
Signed-off-by: Hao Chen <chenh1024@gmail.com>
Signed-off-by: Hao Chen <chenh1024@gmail.com>
Signed-off-by: Hao Chen <chenh1024@gmail.com>
Signed-off-by: Hao Chen <chenh1024@gmail.com>
Signed-off-by: Hao Chen <chenh1024@gmail.com>
Signed-off-by: Hao Chen <chenh1024@gmail.com>
Signed-off-by: Hao Chen <chenh1024@gmail.com>
Signed-off-by: Hao Chen <chenh1024@gmail.com>
Signed-off-by: Hao Chen <chenh1024@gmail.com>
Signed-off-by: Hao Chen <chenh1024@gmail.com>
Signed-off-by: Hao Chen <chenh1024@gmail.com>
Signed-off-by: Hao Chen <chenh1024@gmail.com>
@raulchen raulchen enabled auto-merge (squash) December 6, 2024 04:21
@raulchen raulchen merged commit 02a7b1c into ray-project:master Dec 6, 2024
6 checks passed
@raulchen raulchen deleted the context-multi-ds branch December 6, 2024 12:55
ujjawal-khare pushed a commit to ujjawal-khare-27/ray that referenced this pull request Dec 17, 2024
)

## Why are these changes needed?

When users using multiple datasets and want to set different DataContext
configurations.
The recommended way is to set `DataContext.get_current()` before
creating a Dataset. The DataContext is supposed to be captured and
sealed by a Dataset when it's created. For example:

```python
                import ray

                context = ray.data.DataContext.get_current()

                context.target_max_block_size = 100 * 1024 ** 2
                ds1 = ray.data.range(1)
                context.target_max_block_size = 1 * 1024 ** 2
                ds2 = ray.data.range(1)

                # ds1's target_max_block_size will be 100MB
                ds1.take_all()
                # ds2's target_max_block_size will be 1MB
                ds2.take_all()
```

However in Ray Data internal code, `DataContext.get_current()` has been
widely used in an incorrect way. This PR fixes most outstanding issues
(but not all), by explicitly passing around the captured DataContext
object as an argument to each component.

## Related issue number
ray-project#41573

---------

Signed-off-by: Hao Chen <chenh1024@gmail.com>
Signed-off-by: ujjawal-khare <ujjawal.khare@dream11.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
go add ONLY when ready to merge, run all tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants