Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster folder based builder + parquet support + allow repeated media + use torchvideo #7424

Merged
merged 17 commits into from
Mar 5, 2025

Conversation

lhoestq
Copy link
Member

@lhoestq lhoestq commented Feb 26, 2025

This will be useful for LeRobotDataset (robotics datasets for lerobot based on videos)

Impacted builders:

  • ImageFolder
  • AudioFolder
  • VideoFolder

Improvements:

  • faster to stream (got a 5x speed up on an image dataset)
  • improved RAM usage
  • support for metadata.parquet
  • allow to link to an image/audio/video multiple times
  • support for pyarrow filters (mostly efficient for parquet)
  • link to files using fields names *_file_name (in addition to the already existing file_name)
    • this allows to have multiple image/audio/video per row
    • there is also file_names and *_file_names for lists of image/audio/video

Changes:

  • the builders iterate on the metadata files instead of the media files
  • the builders iterate on chunks of metadata instead of loading them in RAM completely
  • metadata files are no longer handled separately in data_files
  • added the filters argument to pass to load_dataset
    • either as an Expression
    • or as tuples like filters=[('event_name', '=', 'SomeEvent')]
  • small breaking change: you can't add labels to a dataset withdrop_labels=False if it has a metadata file
  • small breaking change: you can't use one metadata file for multiple splits anymore

Example: lhoestq/pusht-videofolder is a video dataset with metadata.parquet where multiple rows can point to the same video

In [1]: from datasets import load_dataset

In [2]: load_dataset("lhoestq/pusht-videofolder")
Resolving data files: 100%|██████████████████████████████| 207/207 [00:00<00:00, 1087.32it/s]
Out[2]: 
DatasetDict({
    train: Dataset({
        features: ['video', 'observation.state', 'action', 'episode_index', 'frame_index', 'timestamp', 'next.reward', 'next.done', 'next.success', 'index', 'task_index'],
        num_rows: 25650
    })
})

In [3]: load_dataset("lhoestq/pusht-videofolder", filters=[("next.reward", ">", 0.5)])
Resolving data files: 100%|██████████████████████████████| 207/207 [00:01<00:00, 183.03it/s]
Out[3]: 
DatasetDict({
    train: Dataset({
        features: ['video', 'observation.state', 'action', 'episode_index', 'frame_index', 'timestamp', 'next.reward', 'next.done', 'next.success', 'index', 'task_index'],
        num_rows: 5773
    })
})

Additional change for VideoFolder:

  • decord can't be installed in many setups, I switched the backend to torchvision instead
  • I also added streaming capability from HF (you can get video frames without downloading the full video from HF)

Example: load a robotics dataset

In [1]: from datasets import load_dataset
ds
In [2]: ds = load_dataset("lhoestq/pusht-videofolder")
Resolving data files: 100%|██████████████████████████████| 207/207 [00:00<00:00, 624.81it/s]

In [3]: ds["train"][0]
Out[3]: 
{'video': <torchvision.io.video_reader.VideoReader at 0x1145dc290>,
 'observation.state': [222.0, 97.0],
 'action': [233.0, 71.0],
 'episode_index': 0,
 'frame_index': 0,
 'timestamp': 0.0,
 'next.reward': 0.19029748439788818,
 'next.done': False,
 'next.success': False,
 'index': 0,
 'task_index': 0}

Example: stream frames without downloading full videos

In [1]: from datasets import load_dataset

In [2]: ds = load_dataset("BrianGuo/Tennis_Data", streaming=True)

In [3]: example = next(iter(ds["train"]))

In [4]: video = example["video"]

In [5]: video.get_metadata()
Out[5]: 
{'audio': {'framerate': [44100.0], 'duration': [2027.35]},
 'video': {'fps': [59.00002712894387], 'duration': [2027.355]}}

In [6]: video.seek(1800, keyframes_only=True)  # 30min
Out[6]: <torchvision.io.video_reader.VideoReader at 0x148d4d010>

In [7]: next(video)
Out[7]: 
{'data': tensor([[[ 76,  77,  79,  ...,  41,  39,  38],
          [ 76,  77,  79,  ...,  40,  39,  35],
          [ 76,  77,  79,  ...,  34,  30,  26],
          ...,
          [127, 127, 127,  ..., 125, 125, 125],
          [125, 126, 126,  ..., 125, 125, 125],
          [122, 124, 126,  ..., 125, 125, 125]]], dtype=torch.uint8),
 'pts': 1800.0}

TODO:

  • docs
  • fix tests

Sorry, something went wrong.

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@lhoestq lhoestq changed the title Faster folder based builder + parquet support + allow repeated media Faster folder based builder + parquet support + allow repeated media + use torchvideo Feb 28, 2025
lhoestq added 6 commits March 4, 2025 17:55
fix
@lhoestq lhoestq marked this pull request as ready for review March 5, 2025 17:02
@lhoestq lhoestq merged commit 5c8869f into main Mar 5, 2025
15 checks passed
@lhoestq lhoestq deleted the faster-folder-based-builder branch March 5, 2025 17:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants