Faster folder based builder + parquet support + allow repeated media + use torchvideo #7424

lhoestq · 2025-02-26T19:55:18Z

This will be useful for LeRobotDataset (robotics datasets for lerobot based on videos)

Impacted builders:

ImageFolder
AudioFolder
VideoFolder

Improvements:

faster to stream (got a 5x speed up on an image dataset)
improved RAM usage
support for metadata.parquet
allow to link to an image/audio/video multiple times
support for pyarrow filters (mostly efficient for parquet)
link to files using fields names *_file_name (in addition to the already existing file_name)
- this allows to have multiple image/audio/video per row
- there is also file_names and *_file_names for lists of image/audio/video

Changes:

the builders iterate on the metadata files instead of the media files
the builders iterate on chunks of metadata instead of loading them in RAM completely
metadata files are no longer handled separately in data_files
added the filters argument to pass to load_dataset
- either as an Expression
- or as tuples like filters=[('event_name', '=', 'SomeEvent')]
small breaking change: you can't add labels to a dataset withdrop_labels=False if it has a metadata file
small breaking change: you can't use one metadata file for multiple splits anymore

Example: lhoestq/pusht-videofolder is a video dataset with metadata.parquet where multiple rows can point to the same video

In [1]: from datasets import load_dataset

In [2]: load_dataset("lhoestq/pusht-videofolder")
Resolving data files: 100%|██████████████████████████████| 207/207 [00:00<00:00, 1087.32it/s]
Out[2]: 
DatasetDict({
    train: Dataset({
        features: ['video', 'observation.state', 'action', 'episode_index', 'frame_index', 'timestamp', 'next.reward', 'next.done', 'next.success', 'index', 'task_index'],
        num_rows: 25650
    })
})

In [3]: load_dataset("lhoestq/pusht-videofolder", filters=[("next.reward", ">", 0.5)])
Resolving data files: 100%|██████████████████████████████| 207/207 [00:01<00:00, 183.03it/s]
Out[3]: 
DatasetDict({
    train: Dataset({
        features: ['video', 'observation.state', 'action', 'episode_index', 'frame_index', 'timestamp', 'next.reward', 'next.done', 'next.success', 'index', 'task_index'],
        num_rows: 5773
    })
})

Additional change for VideoFolder:

decord can't be installed in many setups, I switched the backend to torchvision instead
I also added streaming capability from HF (you can get video frames without downloading the full video from HF)

Example: load a robotics dataset

In [1]: from datasets import load_dataset
ds
In [2]: ds = load_dataset("lhoestq/pusht-videofolder")
Resolving data files: 100%|██████████████████████████████| 207/207 [00:00<00:00, 624.81it/s]

In [3]: ds["train"][0]
Out[3]: 
{'video': <torchvision.io.video_reader.VideoReader at 0x1145dc290>,
 'observation.state': [222.0, 97.0],
 'action': [233.0, 71.0],
 'episode_index': 0,
 'frame_index': 0,
 'timestamp': 0.0,
 'next.reward': 0.19029748439788818,
 'next.done': False,
 'next.success': False,
 'index': 0,
 'task_index': 0}

Example: stream frames without downloading full videos

In [1]: from datasets import load_dataset

In [2]: ds = load_dataset("BrianGuo/Tennis_Data", streaming=True)

In [3]: example = next(iter(ds["train"]))

In [4]: video = example["video"]

In [5]: video.get_metadata()
Out[5]: 
{'audio': {'framerate': [44100.0], 'duration': [2027.35]},
 'video': {'fps': [59.00002712894387], 'duration': [2027.355]}}

In [6]: video.seek(1800, keyframes_only=True)  # 30min
Out[6]: <torchvision.io.video_reader.VideoReader at 0x148d4d010>

In [7]: next(video)
Out[7]: 
{'data': tensor([[[ 76,  77,  79,  ...,  41,  39,  38],
          [ 76,  77,  79,  ...,  40,  39,  35],
          [ 76,  77,  79,  ...,  34,  30,  26],
          ...,
          [127, 127, 127,  ..., 125, 125, 125],
          [125, 126, 126,  ..., 125, 125, 125],
          [122, 124, 126,  ..., 125, 125, 125]]], dtype=torch.uint8),
 'pts': 1800.0}

TODO:

docs
fix tests

HuggingFaceDocBuilderDev · 2025-02-26T19:57:30Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

faster folder based builder + parquet support + allow repeated media

Loading
Loading status checks…

7011039

lhoestq added 10 commits February 27, 2025 16:28

add _visit_with_path in features

b06910b

support image/audio/video in nested data

514febf

docs

Loading
Loading status checks…

cb3e789

use filters even without metadata

Loading
Loading status checks…

aec88df

minor

Loading
Loading status checks…

b1e58c0

replace decord by torchcodec

Loading
Loading status checks…

f80a845

switch to torchvision

Loading
Loading status checks…

349b1c8

update video docs

Loading
Loading status checks…

19566bf

minor

2df8112

fix tests

Loading
Loading status checks…

1a5d5c6

lhoestq changed the title ~~Faster folder based builder + parquet support + allow repeated media~~ Faster folder based builder + parquet support + allow repeated media + use torchvideo Feb 28, 2025

lhoestq added 6 commits March 4, 2025 17:55

fix tests

Loading
Loading status checks…

3612cb9

fix tests

3d93441

better webdataset docs

aaa82f7

style

Loading
Loading status checks…

4982a7a

Merge branch 'main' into faster-folder-based-builder

Loading
Loading status checks…

1b52338

fix

Loading
Loading status checks…

817fda5

lhoestq marked this pull request as ready for review March 5, 2025 17:02

lhoestq merged commit 5c8869f into main Mar 5, 2025
15 checks passed

lhoestq deleted the faster-folder-based-builder branch March 5, 2025 17:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faster folder based builder + parquet support + allow repeated media + use torchvideo #7424

Faster folder based builder + parquet support + allow repeated media + use torchvideo #7424

lhoestq commented Feb 26, 2025 •

edited

Loading

HuggingFaceDocBuilderDev commented Feb 26, 2025

Faster folder based builder + parquet support + allow repeated media + use torchvideo #7424

Faster folder based builder + parquet support + allow repeated media + use torchvideo #7424

Conversation

lhoestq commented Feb 26, 2025 • edited Loading

HuggingFaceDocBuilderDev commented Feb 26, 2025

lhoestq commented Feb 26, 2025 •

edited

Loading