-
Notifications
You must be signed in to change notification settings - Fork 6.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Data] Add support for shuffling input files #40154
Conversation
@@ -284,6 +284,7 @@ def _get_block_metadata( | |||
if ( | |||
prefetched_metadata is not None | |||
and len(prefetched_metadata) == num_fragments | |||
and all(m is not None for m in prefetched_metadata) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is needed for Parquet datasource, as previously it depends on Parquet datasource to just return a empty list if metadata is unknown.
LGTM! A couple questions:
|
python/ray/data/tests/test_image.py
Outdated
def test_random_shuffle(self, ray_start_regular_shared): | ||
# NOTE: set preserve_order to True to allow consistent output behavior. | ||
context = ray.data.DataContext.get_current() | ||
preserve_order = context.execution_options.preserve_order |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit, define a context manager in conftest.py
for enabling preserve_order. That would be clearer and reusable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We also have the restore_data_context fixture.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 let's use a pytest fixture here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated to use restore_data_context
, thanks folks!
python/ray/data/tests/test_image.py
Outdated
file_paths == sorted(output_paths) | ||
for output_paths in output_paths_list | ||
] | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also test read_parquet
, as it has a different implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, added
python/ray/data/tests/test_image.py
Outdated
def test_random_shuffle(self, ray_start_regular_shared): | ||
# NOTE: set preserve_order to True to allow consistent output behavior. | ||
context = ray.data.DataContext.get_current() | ||
preserve_order = context.execution_options.preserve_order |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 let's use a pytest fixture here
The file paths and their sizes after shuffling. | ||
""" | ||
raise NotImplementedError | ||
class FileMetadataShuffler: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO this abstractions adds an unnecessary layer on indirection. It essential just wraps a single function. I think it'd be simpler if did something like this in FileBasedDatasource
:
if shuffle == "files":
metadata = np.random.shuffle(metadata)
If we introduce different shuffling methods in the future, we can always revisit this an introduce a new abstraction. But at this point, I think it's premature
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was also thinking about it. I wanted to save some code duplication, but it looks like there's no much duplication now. Remove this class for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like the FileMetadataShuffler
class is still here (?)
@stephanie-wang, according to discussion offline earlier,
That's good question, probably not in 2.8. Need more discussion with Ray Train together. Would prefer to introduce on Data first, and gather users feedback. |
All comments should be addressed, PTAL thanks! cc @raulchen, @stephanie-wang and @bveeramani. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we still need to remove file_metadata_shuffler.py
(?), but other than that LGTM
The file paths and their sizes after shuffling. | ||
""" | ||
raise NotImplementedError | ||
class FileMetadataShuffler: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like the FileMetadataShuffler
class is still here (?)
Let me do the code removal in a separate PR. Several places need to be cleaned up, such as file_metadata_shuffler.py, https://github.com/ray-project/ray/blob/master/python/ray/data/_default_config.py and https://github.com/ray-project/ray/blob/master/python/ray/data/context.py#L174 . |
Ah, gotcha. Sounds good |
Sounds good, thanks for the context. |
Signed-off-by: Cheng Su <scnju13@gmail.com>
Signed-off-by: Cheng Su <scnju13@gmail.com>
Signed-off-by: Cheng Su <scnju13@gmail.com>
Signed-off-by: Cheng Su <scnju13@gmail.com>
Signed-off-by: Cheng Su <scnju13@gmail.com>
Curious how this would work wrt checkpointing and determinism. If you want to have reproducibility and resume without re-iterating on the same data you've trained on, how would you ensure that? |
As a followup of #40154 (comment), remove the `FileMetadataShuffler` and the config setting in `DataContext` now. They are not used any more. Signed-off-by: Cheng Su <scnju13@gmail.com>
@kszlim - good question. This PR only enables randomness for training. More design discussion needed to integrate into checkpointing and achieve resumability. |
Why are these changes needed?
This PR is to add support for shuffling input files ordering, for all file-based data sources. The interface for controlling the behavior is through
shuffle
argument in all read APIs for file-based data sources:Several different interfaces considered but not chosen:
Add to
DataContext
. It has drawback thatDataContext
config controlling the subtle semantics difference for operators, which could introduce bugs later, and not consistent with rest of APIs.Optimizer rule to push down
randomize_block_order
to data source:read_xxx().randomize_block_order()
. This has the benefit of not introducing any new interface. But it has drawback: (1).randomize_block_order()
API is exposing block concept, and hard for users to understand and use based on feedback in the past. (2).Pushdown can only happen whenrandomize_block_order()
applied immediately afterread_xxx()
, and it's not safe to push down if there's more operation in between:read_xxx().map_batches().randomize_block_order()
. This behavior will be hard for users to understand, and will cause issue when user code gets more complicated.Introduce new argument into
random_shuffle(file=True)
, and optimizer rule to push down to data source:read_xxx().random_shuffle(file=True)
. It has drawback similar to above 2.(2)., and I get a hard time to choose the name of new argument.file
is not a good name here, becauserandom_shuffle()
should not care about whether data is coming form file or not (it's part of data source concepts). and I don't want to name itrandom_shuffle(block=True)
, so this exposes block concept, and make it even more confusing given we already haverandomize_block_order()
.Note:
The seed would be always using the default one, and not supported to change from users. After 2.8, we shall iterate on how to expose the seed option for users. One option is to have a
Dataset.manual/set_seed()
API to control the global seed of random number generator, but it's a bit too early to introduce now without users feedback.Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.