performance of gate_recurrent.py #1555

nkkbr · 2024-05-20T04:12:44Z

This is for YOCO/yoco/models/decoder/kernel/gate_recurrent.py

I assumed that this code is aimed to accelerate some calculation by triton.

after

python3 gate_recurrent.py

I got some printout:

naive time: 0.04773402214050293
triton time: 0.5681734085083008
False
tensor(0.0078, device='cuda:0', dtype=torch.float16, grad_fn=<MaxBackward1>) tensor(0.0001, device='cuda:0', dtype=torch.float16, grad_fn=<MeanBackward0>)
False
tensor(0.0078, device='cuda:0', dtype=torch.float16) tensor(0.0001, device='cuda:0', dtype=torch.float16)
False
tensor(0.0078, device='cuda:0', dtype=torch.float16) tensor(0.0002, device='cuda:0', dtype=torch.float16)
False

It seems that triton takes more time than naive?

The text was updated successfully, but these errors were encountered:

donglixp self-assigned this May 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

performance of gate_recurrent.py #1555

performance of gate_recurrent.py #1555

nkkbr commented May 20, 2024 •

edited

performance of gate_recurrent.py #1555

performance of gate_recurrent.py #1555

Comments

nkkbr commented May 20, 2024 • edited

nkkbr commented May 20, 2024 •

edited