[Operation Question] How to separate truncation and matmul operations #672

wilyub · 2024-04-30T17:38:22Z

Issue Type

Support

Modules Involved

MPC protocol, SPU runtime

Have you reproduced the bug with SPU HEAD?

Yes

Have you searched existing issues?

Yes

SPU Version

spu 0.9.0b1

OS Platform and Distribution

Linux Ubuntu 22.04

Python Version

3.11.5

Compiler Version

No response

Current Behavior?

I'd like to do some latency testing for multiplication and truncation operations. Here is a short overview of how it would look:

Input x and weight y are represented using 16 bits fixed point each. Therefore, when you do a multiplication, you expect them to have 32 bits representation in fixed point. However, suppose you have a ring size of 64 bits. You could then do another 32 bits * 32 bits multiplication before overflowing your ring size of 64 bits.

How can I do this in secretflow? In particular, I want to be able to control the fixed point representation size of each input to the matrix multiplication, and only truncate after specific multiplications (not every multiplication).

Thank you for your help!

Standalone code to reproduce the issue

N/A

Relevant log output

N/A

6fj · 2024-05-06T02:26:48Z

Hi @wilyub

I don't think you could seperate matmul and truncate in Python bindings of SPU. You may have to dive into kernels of SPU runtime.

wilyub · 2024-05-06T16:27:48Z

Thanks for the advice. Any idea on where I could start looking? I'm not too familiar with the inner workings of SPU as I've only used it from python. Thanks!

6fj · 2024-05-07T01:58:48Z

I would suggest having a look at https://github.com/secretflow/spu/blob/main/REPO_LAYOUT.md.

fionser · 2024-05-07T14:44:33Z

@wilyub To bench the matmul only without the truncation, we can use "integer" matrix instead of the floating point matrix.

wilyub · 2024-05-07T17:04:49Z

Thanks for the update, I'll take a look at integer containers. Another thing that has been confusing me is benchmarking the time for softmax/silu/division, etc... From my reading of the literature (Puma, CipherGPT, etc..) these operations should introduce a high latency cost because they require a lot of communication. However, in my tests it seems like these operations do not add significant time to the end to end runtime of one layer in a transformer model? It seems like the largest amount of latency is attributed to the matrix multiplications instead.

I've added a screenshot of my log when I benchmarked one layer of Llama7B. The nonlinearities are implemented in jax (jax.nn.softmax and jax.nn.silu). However, as you see in the log, the vast majority of the time spent is on the mmul_aa operation (I assume this is matrix multiplication). Any idea if I've messed something up here?

Note: I'm using SEMI2K with FM64. I also did some 3PC testing with ABY3 (also FM64) and saw similar results.

anakinxc · 2024-05-08T00:45:20Z

Thanks for the update, I'll take a look at integer containers. Another thing that has been confusing me is benchmarking the time for softmax/silu/division, etc... From my reading of the literature (Puma, CipherGPT, etc..) these operations should introduce a high latency cost because they require a lot of communication. However, in my tests it seems like these operations do not add significant time to the end to end runtime of one layer in a transformer model? It seems like the largest amount of latency is attributed to the matrix multiplications instead.

I've added a screenshot of my log when I benchmarked one layer of Llama7B. The nonlinearities are implemented in jax (jax.nn.softmax and jax.nn.silu). However, as you see in the log, the vast majority of the time spent is on the mmul_aa operation (I assume this is matrix multiplication). Any idea if I've messed something up here?

Note: I'm using SEMI2K with FM64. I also did some 3PC testing with ABY3 (also FM64) and saw similar results.

@Ye-D

wilyub · 2024-05-08T16:06:06Z

Please let me know if you find anything out about the above nonlinearities issue. I can also provide my code if it helps with reproducability (although like I said earlier, I just used the out of the box jnn.softmax() and jnn.silu() alongside some matrix multiplications).

One other thing that has been weird for me is the matmul benchmark. I tried two tests. One of them I did a matrix multiplication for the value weight matrix and the hidden states. (Only one matmul call). In the other test I did that matmul in addition to a matmul for hidden states and key weight matrix, and hidden states and query weight matrix. (Three matmul calls). However, the logs show the latency for the matmul to be nearly identitcal. Even weirder, the number of bytes sent/received are exactly the same. This sounds weird to me because I would expect matmul time to be tripled if we call three matmuls? I have sent a snapshot of my log and for the code.

anakinxc · 2024-05-08T16:15:00Z

Please let me know if you find anything out about the above nonlinearities issue. I can also provide my code if it helps with reproducability (although like I said earlier, I just used the out of the box jnn.softmax() and jnn.silu() alongside some matrix multiplications).

One other thing that has been weird for me is the matmul benchmark. I tried two tests. One of them I did a matrix multiplication for the value weight matrix and the hidden states. (Only one matmul call). In the other test I did that matmul in addition to a matmul for hidden states and key weight matrix, and hidden states and query weight matrix. (Three matmul calls). However, the logs show the latency for the matmul to be nearly identitcal. Even weirder, the number of bytes sent/received are exactly the same. This sounds weird to me because I would expect matmul time to be tripled if we call three matmuls? I have sent a snapshot of my log and for the code.

both query_states and key_states are unused values, so matmuls defining them are dead code and should be killed during optimizations.

wilyub · 2024-05-10T19:22:24Z

I tested out the integer matmul and it works (no truncation shows up in the log). I was wondering what function should I call to get trunc_a to show up in the log by itself? I tried jnp.trunc and didn't get that result. Thanks! Also any update on why the nonlinearities seem so cheap compared to matmuls?

Chrisdehe assigned tpppppub and 6fj and unassigned tpppppub May 6, 2024

warpoons mentioned this issue May 31, 2024

[Question]: The number of convolutional multiplication decreases but the communication cost increases in SPU #678

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Operation Question] How to separate truncation and matmul operations #672

[Operation Question] How to separate truncation and matmul operations #672

wilyub commented Apr 30, 2024

6fj commented May 6, 2024

wilyub commented May 6, 2024

6fj commented May 7, 2024

fionser commented May 7, 2024

wilyub commented May 7, 2024 •

edited

anakinxc commented May 8, 2024

wilyub commented May 8, 2024

anakinxc commented May 8, 2024

wilyub commented May 10, 2024

[Operation Question] How to separate truncation and matmul operations #672

[Operation Question] How to separate truncation and matmul operations #672

Comments

wilyub commented Apr 30, 2024

Issue Type

Modules Involved

Have you reproduced the bug with SPU HEAD?

Have you searched existing issues?

SPU Version

OS Platform and Distribution

Python Version

Compiler Version

Current Behavior?

Standalone code to reproduce the issue

Relevant log output

6fj commented May 6, 2024

wilyub commented May 6, 2024

6fj commented May 7, 2024

fionser commented May 7, 2024

wilyub commented May 7, 2024 • edited

anakinxc commented May 8, 2024

wilyub commented May 8, 2024

anakinxc commented May 8, 2024

wilyub commented May 10, 2024

wilyub commented May 7, 2024 •

edited