dataloader crashes after several epochs if the trained model contains triton-based operators #126620
Labels
module: dataloader
Related to torch.utils.data.DataLoader and Sampler
triaged
This issue has been looked at a team member, and triaged and prioritized into an appropriate module
馃悰 Describe the bug
Compete codes uploaded to:
minifer.py
Key codes (Line 990 to Line 1068):
Before running the code, one should put some images in the same subfolders in /tmp/ILSVRC2012_debug/train and /tmp/ILSVRC2012_debug/val, or modify Line 1010-1011 to point to the data path.
Running the code with cmd
passes, but running the code with cmd
crashes. The only difference is whether the trained model uses the triton kernel in block_attention.
Crash info
Versions
cc @andrewkho @gokulavasan @ssnl @VitalyFedyunin @dzhulgakov
The text was updated successfully, but these errors were encountered: