Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature]: CUDA acceleration for Cheetah protocol #503

Open
BeStrongok opened this issue Jan 17, 2024 · 7 comments
Open

[Feature]: CUDA acceleration for Cheetah protocol #503

BeStrongok opened this issue Jan 17, 2024 · 7 comments
Assignees

Comments

@BeStrongok
Copy link

BeStrongok commented Jan 17, 2024

Feature Request Type

Performance

Have you searched existing issues?

No

Is your feature request related to a problem?

No.

Describe features you want to add to SPU

Below.

Describe features you want to add to SPU

Hi, SPU team:
The cheetah protocol is a high-performance two-party inference protocol which is running on CPUs, i'm thinking is there a way to apply CUDA acceleration on this protocol?
Such as the matrix encoding process, or the computation process which compute the results of each modulus. Do you have any corresponding development plans?

@fionser
Copy link
Contributor

fionser commented Jan 17, 2024

Try to play with https://github.com/privateLLM001/ which already intergate somehow CUDA into the SEAL lib.

@anakinxc
Copy link
Contributor

Hi @BeStrongok

Based on the experience we had on accelerate ABY3 matmul with CUDA, the improvement might be marginal.

Consider MPC protocols usually have tasks that GPU cannot handle, like send/recv data through network, so there are many data movements between GPU and CPU and IO becomes a huge bottleneck. From some preliminary data collected from ABY3 GPT2 inference example, copy data to/from GPU can take ~95% of matmul time.

Another common issue is MPC protocols usually works on integers like int64/int128, these types not optimized for computing on either CPU or GPU, and lacks support from libraries like cuBLAS.

But feel free to give it a shot :P

@BeStrongok
Copy link
Author

Try to play with https://github.com/privateLLM001/ which already intergate somehow CUDA into the SEAL lib.

Thanks for pointing this repo :) , i'm also trying to do some experiments on applying the CUDA version of SEAL to Cheetah.

@BeStrongok
Copy link
Author

BeStrongok commented Jan 17, 2024

Hi @BeStrongok

Based on the experience we had on accelerate ABY3 matmul with CUDA, the improvement might be marginal.

Consider MPC protocols usually have tasks that GPU cannot handle, like send/recv data through network, so there are many data movements between GPU and CPU and IO becomes a huge bottleneck. From some preliminary data collected from ABY3 GPT2 inference example, copy data to/from GPU can take ~95% of matmul time.

Another common issue is MPC protocols usually works on integers like int64/int128, these types not optimized for computing on either CPU or GPU, and lacks support from libraries like cuBLAS.

But feel free to give it a shot :P

Thank you for providing me with these useful information. :)
Yes, there maybe exists a performance bottleneck in accelerating secret sharing protocols due to the frequent I/O, the acceleration for Homomorphic encryption used in Cheetah may be useful.

@fionser
Copy link
Contributor

fionser commented Jan 17, 2024

Try to play with https://github.com/privateLLM001/ which already intergate somehow CUDA into the SEAL lib.

Thanks for pointing this repo :) , i'm also trying to do some experiments on applying the CUDA version of SEAL to Cheetah.

I would expect 60x faster keyswitching than single core CPU implementation, if you put the whole key-switching logic into GPU. However it might take "a little bit" work to do so.

The GPU code in privateLLM001 is pretty simple, and thus the acceleration from their code will be very marginal.

@BeStrongok
Copy link
Author

BeStrongok commented Jan 17, 2024

Reference

Thanks for your guidance. I just run the benchmark between seal-cuda and seal on BFV protocol, the acceleration is obvious.
Original version:
image
CUDA version:
image
But i haven't implemented it on Cheetah yet, this work is indeed non-trival and requires familiarity with both SEAL and CUDA.
Hope i can make some useful results. :)

@fionser
Copy link
Contributor

fionser commented Jan 17, 2024

Less than 10x RotateRows is less impressive to me since 10 cores CPU is much more easier to get than a x100 NV card.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants