[Feature]: CUDA acceleration for Cheetah protocol #503

BeStrongok · 2024-01-17T08:11:38Z

Feature Request Type

Performance

Have you searched existing issues?

No

Is your feature request related to a problem?

No.

Describe features you want to add to SPU

Below.

Describe features you want to add to SPU

Hi, SPU team:
The cheetah protocol is a high-performance two-party inference protocol which is running on CPUs, i'm thinking is there a way to apply CUDA acceleration on this protocol?
Such as the matrix encoding process, or the computation process which compute the results of each modulus. Do you have any corresponding development plans?

fionser · 2024-01-17T08:36:07Z

Try to play with https://github.com/privateLLM001/ which already intergate somehow CUDA into the SEAL lib.

anakinxc · 2024-01-17T08:50:13Z

Hi @BeStrongok

Based on the experience we had on accelerate ABY3 matmul with CUDA, the improvement might be marginal.

Consider MPC protocols usually have tasks that GPU cannot handle, like send/recv data through network, so there are many data movements between GPU and CPU and IO becomes a huge bottleneck. From some preliminary data collected from ABY3 GPT2 inference example, copy data to/from GPU can take ~95% of matmul time.

Another common issue is MPC protocols usually works on integers like int64/int128, these types not optimized for computing on either CPU or GPU, and lacks support from libraries like cuBLAS.

But feel free to give it a shot :P

BeStrongok · 2024-01-17T09:11:49Z

Try to play with https://github.com/privateLLM001/ which already intergate somehow CUDA into the SEAL lib.

Thanks for pointing this repo :) , i'm also trying to do some experiments on applying the CUDA version of SEAL to Cheetah.

BeStrongok · 2024-01-17T09:18:32Z

Hi @BeStrongok

Based on the experience we had on accelerate ABY3 matmul with CUDA, the improvement might be marginal.

Consider MPC protocols usually have tasks that GPU cannot handle, like send/recv data through network, so there are many data movements between GPU and CPU and IO becomes a huge bottleneck. From some preliminary data collected from ABY3 GPT2 inference example, copy data to/from GPU can take ~95% of matmul time.

Another common issue is MPC protocols usually works on integers like int64/int128, these types not optimized for computing on either CPU or GPU, and lacks support from libraries like cuBLAS.

But feel free to give it a shot :P

Thank you for providing me with these useful information. :)
Yes, there maybe exists a performance bottleneck in accelerating secret sharing protocols due to the frequent I/O, the acceleration for Homomorphic encryption used in Cheetah may be useful.

fionser · 2024-01-17T09:45:32Z

Try to play with https://github.com/privateLLM001/ which already intergate somehow CUDA into the SEAL lib.

Thanks for pointing this repo :) , i'm also trying to do some experiments on applying the CUDA version of SEAL to Cheetah.

I would expect 60x faster keyswitching than single core CPU implementation, if you put the whole key-switching logic into GPU. However it might take "a little bit" work to do so.

The GPU code in privateLLM001 is pretty simple, and thus the acceleration from their code will be very marginal.

BeStrongok · 2024-01-17T11:00:27Z

Reference

Thanks for your guidance. I just run the benchmark between seal-cuda and seal on BFV protocol, the acceleration is obvious.
Original version:

CUDA version:

But i haven't implemented it on Cheetah yet, this work is indeed non-trival and requires familiarity with both SEAL and CUDA.
Hope i can make some useful results. :)

fionser · 2024-01-17T14:31:17Z

Less than 10x RotateRows is less impressive to me since 10 cores CPU is much more easier to get than a x100 NV card.

anakinxc assigned fionser Jan 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature]: CUDA acceleration for Cheetah protocol #503

[Feature]: CUDA acceleration for Cheetah protocol #503

BeStrongok commented Jan 17, 2024 •

edited

fionser commented Jan 17, 2024

anakinxc commented Jan 17, 2024

BeStrongok commented Jan 17, 2024

BeStrongok commented Jan 17, 2024 •

edited

fionser commented Jan 17, 2024

BeStrongok commented Jan 17, 2024 •

edited

fionser commented Jan 17, 2024

[Feature]: CUDA acceleration for Cheetah protocol #503

[Feature]: CUDA acceleration for Cheetah protocol #503

Comments

BeStrongok commented Jan 17, 2024 • edited

Feature Request Type

Have you searched existing issues?

Is your feature request related to a problem?

Describe features you want to add to SPU

Describe features you want to add to SPU

fionser commented Jan 17, 2024

anakinxc commented Jan 17, 2024

BeStrongok commented Jan 17, 2024

BeStrongok commented Jan 17, 2024 • edited

fionser commented Jan 17, 2024

BeStrongok commented Jan 17, 2024 • edited

fionser commented Jan 17, 2024

BeStrongok commented Jan 17, 2024 •

edited

BeStrongok commented Jan 17, 2024 •

edited

BeStrongok commented Jan 17, 2024 •

edited