Add Op(_scaled_dot_product_flash_attention) | feat(torchlib) #1043

titaiwangms · 2023-08-31T18:12:01Z

_scaled_dot_product_flash_attention is one out of three ATen implementations of nn.functional.scaled_dot_product_attention according to the page: https://pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html.

As of which one of three ATen operator is representing nn.functional.scaled_dot_product_attention in a model is decided by a context manager: https://pytorch.org/docs/stable/backends.html. From ONNX perspective, they have no difference except the function signature.

Only the first result matters in terms of the model prediction, and the unrelated outputs are following the below code:

@register_meta(
    [
        aten._scaled_dot_product_flash_attention,
    ]
)
def meta__scaled_dot_product_flash(
    query: Tensor,
    key: Tensor,
    value: Tensor,
    dropout_p: float = 0.0,
    is_causal: bool = False,
    return_debug_mask: bool = False,
    scale: Optional[float] = None,
):
    batch_size = query.size(0)
    num_heads = query.size(1)
    max_seqlen_batch_q = query.size(2)
    head_dim = query.size(3)

    max_seqlen_batch_k = key.size(2)
    if device_hint(query) == "cpu":
        Nnz_q = batch_size * max_seqlen_batch_q
        query_t = query.transpose(1, 2)
        query_reshaped = query_t.reshape(Nnz_q, num_heads, head_dim)
        attention = torch.empty_like(query_reshaped, device=query.device)
        attention = attention.view(
            batch_size, max_seqlen_batch_q, num_heads, head_dim
        ).transpose(1, 2)
        logsumexp = torch.empty(
            (
                batch_size,
                max_seqlen_batch_q,
                num_heads,
            ),
            dtype=torch.float,
            device=query.device,
        ).transpose(1, 2)
        return (
            attention,
            logsumexp,
            torch.empty((), dtype=torch.int32, device="meta"),
            torch.empty((), dtype=torch.int32, device="meta"),
            0,
            0,
            torch.empty((), dtype=torch.long, device="meta"),
            torch.empty((), dtype=torch.long, device="meta"),
            torch.empty((), dtype=query.dtype, device=query.device),
        )

    # Cuda Path
    query_t = query.transpose(1, 2)
    attention = torch.empty_like(query_t).transpose(1, 2)
    logsumexp = torch.empty(
        (batch_size, num_heads, max_seqlen_batch_q),
        dtype=torch.float,
        device=query.device,
    )
    cumulative_sequence_length_q = torch.empty(
        batch_size + 1, dtype=torch.int32, device="meta"
    )
    cumulative_sequence_length_k = torch.empty(
        batch_size + 1, dtype=torch.int32, device="meta"
    )

    if return_debug_mask:
        blocksize_c = 128 if head_dim > 64 else 256
        max_seqlen_k = math.ceil(max_seqlen_batch_q / blocksize_c)
        if max_seqlen_batch_k <= 128:
            max_seqlen_k = 128
        elif max_seqlen_batch_k <= 256:
            max_seqlen_k = 256
        debug_mask = torch.empty(
            (batch_size, num_heads, max_seqlen_batch_q, max_seqlen_k),
            dtype=query.dtype,
            device=query.device,
        )
    else:
        debug_mask = torch.empty(0, dtype=query.dtype, device=query.device)

    # Note [Seed and Offset]: device for seed and offset below depends on whether we are
    # capturing or not, but at the time of tracing we don't know if we
    # are going to use cudagraphs or not, so we return meta tensors here
    # it's possible we'll need to have some special handling in inductor for sdpa

    return (
        attention,
        logsumexp,
        None,
        None,
        max_seqlen_batch_q,
        max_seqlen_batch_k,
        torch.empty((), dtype=torch.long, device="meta"),
        torch.empty((), dtype=torch.long, device="meta"),
        debug_mask,
    )

NOTE: PyTorch converter should consider None would appear in _fill_tensor_shape_type, otherwise, the exporter crashes.

onnxscript/function_libs/torch_lib/ops/nn.py

onnxscript/tests/function_libs/torch_lib/ops_test_data.py

codecov · 2023-08-31T18:22:11Z

Codecov Report

Merging #1043 (376765b) into main (0c25215) will increase coverage by 0.04%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##             main    #1043      +/-   ##
==========================================
+ Coverage   77.68%   77.73%   +0.04%     
==========================================
  Files         114      114              
  Lines       14445    14473      +28     
  Branches     1545     1546       +1     
==========================================
+ Hits        11222    11250      +28     
  Misses       2857     2857              
  Partials      366      366

Files Changed	Coverage Δ
...ipt/tests/function_libs/torch_lib/ops_test_data.py	`96.03% <ø> (ø)`
onnxscript/function_libs/torch_lib/ops/nn.py	`80.06% <100.00%> (+0.44%)`	⬆️
...ript/tests/function_libs/torch_lib/extra_opinfo.py	`98.29% <100.00%> (+0.08%)`	⬆️

onnxscript/function_libs/torch_lib/ops/nn.py

titaiwangms · 2023-09-05T17:28:38Z

onnxscript/function_libs/torch_lib/ops/nn.py

+    return (
+        result,
+        logsumexp,
+        empty_tensor_int,


Should I create TInt for these guys?

The one in embedding remains TFloat though. But I can do INT64 in this case. Depends should we follow native-func sig or what we really return.

If the return types for the empty float values need to be TFloat, do we need a CaskLike self here? Otherwise it would be FLOAT because the dtype is set and not dependent on the input?

justinchuby · 2023-09-05T21:08:30Z

lgtm with the return types fixed

titaiwangms · 2023-09-05T21:20:20Z

@justinchuby I found CI all fails except torch-nightly. I guess it needs torch-nightly to test this op?

justinchuby · 2023-09-05T23:32:56Z

@justinchuby I found CI all fails except torch-nightly. I guess it needs torch-nightly to test this op?

Looks like so. We can skip the tests for older torch by using .skip(enabled_if=version_utils.torch_older_than("2.1"))

… inputs" Previous to this PR, if None is returned from intermediate nodes, it will crashes the export because None is not expected to be passed into `_fill_tensor_shape_type`, and raise beartype roar. The function fills in shape and type to TorchScriptTensor according to its info from FX graph. This is discovered after microsoft/onnxscript#1043 is supported. The op specifically generates None in one of its inputs, but the only output from it being consumed is the first one (not None). Reference test from a TorchBench model: ```python def test_nanogpt(self): import sys sys.path.append("/home/titaiwang") from nanoGPT.model import GPT, GPTConfig # Load the model kwargs = { "block_size": 256, "vocab_size": 8096, # GPT-2 vocab_size of 50257, padded up to nearest multiple of 64 for efficiency "n_layer": 2, "n_head": 2, "n_embd": 128, "dropout": 0.0, "bias": False, # True: bias in Linears and LayerNorms, like GPT-2. False: a bit better and faster } config = GPTConfig(**kwargs) with torch.backends.cuda.sdp_kernel( enable_flash=True, enable_mem_efficient=True ): model = GPT(config) print("Done loading model") inputs = torch.arange(128).view(2, 64) targets = torch.arange(128).view(2, 64) self.run_test_with_fx_to_onnx_exporter_and_onnx_runtime( model, (inputs,), input_kwargs={ "targets": targets, }, verbose=True, ) ``` [ghstack-poisoned]

Previous to this PR, if None is returned from intermediate nodes, it will crashes the export because None is not expected to be passed into `_fill_tensor_shape_type`, and raise beartype roar. The function fills in shape and type to TorchScriptTensor according to its info from FX graph. This is discovered after microsoft/onnxscript#1043 is supported. The op specifically generates None in one of its inputs, but the only output from it being consumed is the first one (not None). Reference test from a TorchBench model: ```python def test_nanogpt(self): import sys sys.path.append("/home/titaiwang") from nanoGPT.model import GPT, GPTConfig # Load the model kwargs = { "block_size": 256, "vocab_size": 8096, # GPT-2 vocab_size of 50257, padded up to nearest multiple of 64 for efficiency "n_layer": 2, "n_head": 2, "n_embd": 128, "dropout": 0.0, "bias": False, # True: bias in Linears and LayerNorms, like GPT-2. False: a bit better and faster } config = GPTConfig(**kwargs) with torch.backends.cuda.sdp_kernel( enable_flash=True, enable_mem_efficient=True ): model = GPT(config) print("Done loading model") inputs = torch.arange(128).view(2, 64) targets = torch.arange(128).view(2, 64) self.run_test_with_fx_to_onnx_exporter_and_onnx_runtime( model, (inputs,), input_kwargs={ "targets": targets, }, verbose=True, ) ``` [ghstack-poisoned]

… inputs" Prior to this PR, if None is returned from intermediate nodes, it will crashes the export because None is not expected to be passed into `_fill_tensor_shape_type`, and raise beartype roar. The function fills in shape and type to TorchScriptTensor according to its info from FX graph. This is discovered after microsoft/onnxscript#1043 is supported. The op specifically generates None in one of its inputs, but the only output from it being consumed is the first one (not None). Reference test from a TorchBench model: ```python def test_nanogpt(self): import sys sys.path.append("/home/titaiwang") from nanoGPT.model import GPT, GPTConfig # Load the model kwargs = { "block_size": 256, "vocab_size": 8096, # GPT-2 vocab_size of 50257, padded up to nearest multiple of 64 for efficiency "n_layer": 2, "n_head": 2, "n_embd": 128, "dropout": 0.0, "bias": False, # True: bias in Linears and LayerNorms, like GPT-2. False: a bit better and faster } config = GPTConfig(**kwargs) with torch.backends.cuda.sdp_kernel( enable_flash=True, enable_mem_efficient=True ): model = GPT(config) print("Done loading model") inputs = torch.arange(128).view(2, 64) targets = torch.arange(128).view(2, 64) self.run_test_with_fx_to_onnx_exporter_and_onnx_runtime( model, (inputs,), input_kwargs={ "targets": targets, }, verbose=True, ) ``` [ghstack-poisoned]

Prior to this PR, if None is returned from intermediate nodes, it will crashes the export because None is not expected to be passed into `_fill_tensor_shape_type`, and raise beartype roar. The function fills in shape and type to TorchScriptTensor according to its info from FX graph. This is discovered after microsoft/onnxscript#1043 is supported. The op specifically generates None in one of its inputs, but the only output from it being consumed is the first one (not None). Reference test from a TorchBench model: ```python def test_nanogpt(self): import sys sys.path.append("/home/titaiwang") from nanoGPT.model import GPT, GPTConfig # Load the model kwargs = { "block_size": 256, "vocab_size": 8096, # GPT-2 vocab_size of 50257, padded up to nearest multiple of 64 for efficiency "n_layer": 2, "n_head": 2, "n_embd": 128, "dropout": 0.0, "bias": False, # True: bias in Linears and LayerNorms, like GPT-2. False: a bit better and faster } config = GPTConfig(**kwargs) with torch.backends.cuda.sdp_kernel( enable_flash=True, enable_mem_efficient=True ): model = GPT(config) print("Done loading model") inputs = torch.arange(128).view(2, 64) targets = torch.arange(128).view(2, 64) self.run_test_with_fx_to_onnx_exporter_and_onnx_runtime( model, (inputs,), input_kwargs={ "targets": targets, }, verbose=True, ) ``` [ghstack-poisoned]

Prior to this PR, if None is returned from intermediate nodes, it will crashes the export because None is not expected to be passed into `_fill_tensor_shape_type`, and raise beartype roar. The function fills in shape and type to TorchScriptTensor according to its info from FX graph. This is discovered after microsoft/onnxscript#1043 is supported. The op specifically generates None in one of its inputs, but the only output from it being consumed is the first one (not None). Reference test from a TorchBench model: ```python def test_nanogpt(self): import sys sys.path.append("/home/titaiwang") from nanoGPT.model import GPT, GPTConfig # Load the model kwargs = { "block_size": 256, "vocab_size": 8096, # GPT-2 vocab_size of 50257, padded up to nearest multiple of 64 for efficiency "n_layer": 2, "n_head": 2, "n_embd": 128, "dropout": 0.0, "bias": False, # True: bias in Linears and LayerNorms, like GPT-2. False: a bit better and faster } config = GPTConfig(**kwargs) with torch.backends.cuda.sdp_kernel( enable_flash=True, enable_mem_efficient=True ): model = GPT(config) print("Done loading model") inputs = torch.arange(128).view(2, 64) targets = torch.arange(128).view(2, 64) self.run_test_with_fx_to_onnx_exporter_and_onnx_runtime( model, (inputs,), input_kwargs={ "targets": targets, }, verbose=True, ) ``` Pull Request resolved: #108708 Approved by: https://github.com/justinchuby, https://github.com/thiagocrepaldi

#1197) Fix #1160 Follow up #1043 It's another `scaled_dot_product_XXX_attention` It only supports CUDA. https://github.com/pytorch/pytorch/blob/38ae17d166a001ef6837553d1ddffa111624df27/torch/_meta_registrations.py#L5195-L5236 NOTE: This PR also enables CUDA tests.

Add Op(meta__scaled_dot_product_flash) | feat(torchlib)

9d46a5a

titaiwangms added the topic: torch_lib Related to the torch/aten function lib in development label Aug 31, 2023

titaiwangms requested review from justinchuby, xiaowuhu and fatcat-z August 31, 2023 18:12

justinchuby approved these changes Aug 31, 2023

View reviewed changes

onnxscript/function_libs/torch_lib/ops/nn.py Outdated Show resolved Hide resolved

onnxscript/tests/function_libs/torch_lib/ops_test_data.py Outdated Show resolved Hide resolved

titaiwangms added the hold on merging Don't merge yet label Aug 31, 2023

titaiwangms marked this pull request as draft August 31, 2023 20:45

refactor tests

a90d78c

titaiwangms commented Aug 31, 2023

View reviewed changes

onnxscript/function_libs/torch_lib/ops/nn.py Outdated Show resolved Hide resolved

Use make_tensor

3521e2d

titaiwangms marked this pull request as ready for review September 5, 2023 17:27

titaiwangms commented Sep 5, 2023

View reviewed changes

titaiwangms added 2 commits September 5, 2023 10:32

Merge branch 'main' into titaiwang/support_flash

44bf579

add TODO

0021a52

justinchuby self-requested a review September 5, 2023 17:55

Merge branch 'main' into titaiwang/support_flash

19275ef

titaiwangms removed the hold on merging Don't merge yet label Sep 5, 2023

Update return types

2a9af9a

skip older torch

376765b

justinchuby approved these changes Sep 5, 2023

View reviewed changes

titaiwangms merged commit 0e9c495 into microsoft:main Sep 6, 2023
29 of 30 checks passed

titaiwangms deleted the titaiwang/support_flash branch September 6, 2023 00:01

justinchuby mentioned this pull request Sep 8, 2023

ATen ops list #258

Open

titaiwangms mentioned this pull request Sep 12, 2023

[ONNX] Support None in fx.args as torchlib inputs pytorch/pytorch#108708

Closed

titaiwangms mentioned this pull request Nov 30, 2023

Add Op(aten::_scaled_dot_product_efficient_attention) | feat(torchlib) #1197

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Op(_scaled_dot_product_flash_attention) | feat(torchlib) #1043

Add Op(_scaled_dot_product_flash_attention) | feat(torchlib) #1043

titaiwangms commented Aug 31, 2023 •

edited by justinchuby

codecov bot commented Aug 31, 2023 •

edited

titaiwangms Sep 5, 2023

justinchuby Sep 5, 2023

titaiwangms Sep 5, 2023 •

edited

justinchuby Sep 5, 2023

justinchuby commented Sep 5, 2023

titaiwangms commented Sep 5, 2023

justinchuby commented Sep 5, 2023

Add Op(_scaled_dot_product_flash_attention) | feat(torchlib) #1043

Add Op(_scaled_dot_product_flash_attention) | feat(torchlib) #1043

Conversation

titaiwangms commented Aug 31, 2023 • edited by justinchuby

codecov bot commented Aug 31, 2023 • edited

Codecov Report

titaiwangms Sep 5, 2023

Choose a reason for hiding this comment

justinchuby Sep 5, 2023

Choose a reason for hiding this comment

titaiwangms Sep 5, 2023 • edited

Choose a reason for hiding this comment

justinchuby Sep 5, 2023

Choose a reason for hiding this comment

justinchuby commented Sep 5, 2023

titaiwangms commented Sep 5, 2023

justinchuby commented Sep 5, 2023

titaiwangms commented Aug 31, 2023 •

edited by justinchuby

codecov bot commented Aug 31, 2023 •

edited

titaiwangms Sep 5, 2023 •

edited