Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARM assembly support? #31

Open
tomleavy opened this issue Nov 3, 2020 · 23 comments
Open

ARM assembly support? #31

tomleavy opened this issue Nov 3, 2020 · 23 comments

Comments

@tomleavy
Copy link

tomleavy commented Nov 3, 2020

I'm attempting to work on a PoC using mobile devices, wondering if anyone has attempted to translate the AVX2 code over to something that is compatible with ARM64 CPUs?

@cryptojedi
Copy link
Contributor

cryptojedi commented Nov 5, 2020 via email

@tomleavy
Copy link
Author

tomleavy commented Nov 5, 2020

Don't have much experience with ASM coding but maybe I can find some time to try and kick start the process. I'm assuming a command by command port would be fine, unless there are any particular nuances I should know about?

@cryptojedi
Copy link
Contributor

cryptojedi commented Nov 7, 2020 via email

@cothan
Copy link

cothan commented Nov 7, 2020

Hi @cryptojedi and @tomleavy,

I have my NEON implementation is at https://github.com/cothan/kyber/tree/round3
The NTT part is speed up by 3x. Overal speed up is 1.5x. I'm close to finish NEON implementation, the critical component is Keccak, I have my NEON code written for Keccak, my preliminary result for NEON Keccak is a fraction of speed up.

This is my NEON benchmark on ARMv8, Rasberry Pi 4 8Gb 1.9 Ghz (overclocked). Just Polynomial Multiplication module in Kyber.

NTT: 
median: 72 cycles/ticks
average: 72 cycles/ticks

INVNTT: 
median: 115 cycles/ticks
average: 115 cycles/ticks

kyber_keypair: 
median: 7162 cycles/ticks
average: 7185 cycles/ticks

kyber_encaps: 
median: 8207 cycles/ticks
average: 8247 cycles/ticks

kyber_decaps: 
median: 8313 cycles/ticks
average: 8342 cycles/ticks

And this is reference code on the same platform

NTT: 
median: 281 cycles/ticks
average: 281 cycles/ticks

INVNTT: 
median: 358 cycles/ticks
average: 358 cycles/ticks

kyber_keypair: 
median: 11160 cycles/ticks
average: 11198 cycles/ticks

kyber_encaps: 
median: 12835 cycles/ticks
average: 12877 cycles/ticks

kyber_decaps: 
median: 14646 cycles/ticks
average: 14696 cycles/ticks

@tomleavy
Copy link
Author

tomleavy commented Nov 7, 2020

Good stuff @cothan . Will gladly utilize your version when it's ready

@cryptojedi
Copy link
Contributor

cryptojedi commented Nov 12, 2020 via email

@cothan
Copy link

cothan commented Nov 12, 2020

Just a question: the benchmarks below look more like a factor-4 speedup
for the NTT and a factor-3 speedup for the inverse NTT. That's great,
but not quite the factor of 6 you're mentioning. Am I missing something?

@cryptojedi
It's my bad, I've edited the message above. The NTT speedup is only 3x and overall speed up is only 1.5x.
I'm at very last step for Kyber, I'm trying to finish NEON reject_sampling, there is a few if-else condition that I'm not sure how to handle the permutation part.
Anyway, when I finish the last step, I'm happy to create a PR to upstream.

@cothan
Copy link

cothan commented Nov 20, 2020

So I'm kinda finish my implementation.
There are 2 functions left, poly_tomsg and poly_frommsg. It won't affect much the total execution time, but anyway they need to be implemented.
I'm happy if anyone can help.

Okay, let's see the benchmark for ./test_kyber1024 in neon/ vs ref

.NEON

gen_a: 
median: 2950 cycles/ticks
average: 2983 cycles/ticks

neon_poly_getnoise_eta1_2x/2: 
median: 138 cycles/ticks
average: 140 cycles/ticks

neon_poly_getnoise_eta2: 
median: 73 cycles/ticks
average: 73 cycles/ticks

poly_tomsg: 
median: 16 cycles/ticks
average: 17 cycles/ticks

poly_frommsg: 
median: 6 cycles/ticks
average: 6 cycles/ticks

NTT: 
median: 72 cycles/ticks
average: 72 cycles/ticks

INVNTT: 
median: 115 cycles/ticks
average: 116 cycles/ticks

kyber_keypair: 
median: 5917 cycles/ticks
average: 5960 cycles/ticks

kyber_encaps: 
median: 7137 cycles/ticks
average: 7168 cycles/ticks

kyber_decaps: 
median: 7171 cycles/ticks
average: 7206 cycles/ticks

kex_uake_initA: 
median: 13062 cycles/ticks
average: 13141 cycles/ticks

kex_uake_sharedB: 
median: 14374 cycles/ticks
average: 14437 cycles/ticks

kex_uake_sharedA: 
median: 7224 cycles/ticks
average: 7265 cycles/ticks

kex_ake_initA: 
median: 13059 cycles/ticks
average: 13128 cycles/ticks

kex_ake_sharedB: 
median: 21521 cycles/ticks
average: 21622 cycles/ticks

kex_ake_sharedA: 
median: 14408 cycles/ticks
average: 14487 cycles/ticks

.C_REF

gen_a: 
median: 3748 cycles/ticks
average: 3774 cycles/ticks

poly_getnoise_eta1: 
median: 69 cycles/ticks
average: 68 cycles/ticks

poly_getnoise_eta2: 
median: 69 cycles/ticks
average: 69 cycles/ticks

poly_tomsg: 
median: 61 cycles/ticks
average: 60 cycles/ticks

poly_frommsg: 
median: 6 cycles/ticks
average: 5 cycles/ticks

NTT: 
median: 280 cycles/ticks
average: 281 cycles/ticks

INVNTT: 
median: 358 cycles/ticks
average: 358 cycles/ticks

kyber_keypair: 
median: 10500 cycles/ticks
average: 10537 cycles/ticks

kyber_encaps: 
median: 12226 cycles/ticks
average: 12280 cycles/ticks

kyber_decaps: 
median: 13788 cycles/ticks
average: 13833 cycles/ticks

kex_uake_initA: 
median: 22755 cycles/ticks
average: 22854 cycles/ticks

kex_uake_sharedB: 
median: 26199 cycles/ticks
average: 26321 cycles/ticks

kex_uake_sharedA: 
median: 13870 cycles/ticks
average: 13944 cycles/ticks

kex_ake_initA: 
median: 22750 cycles/ticks
average: 22843 cycles/ticks

kex_ake_sharedB: 
median: 38417 cycles/ticks
average: 38825 cycles/ticks

kex_ake_sharedA: 
median: 27555 cycles/ticks
average: 27771 cycles/ticks

Benchmark on ARMv8, Pi 4 8Gb 1.9 Ghz (overclocked).
My SHA3 implementation can be seen here: https://github.com/cothan/NEON-SHA3_2x. Too bad, the speed up is very minimal.
NEON Keccak-2x is only 5%-2% faster than C reference implementation, C reference implementation is compiled without -fno-tree-vectorize

https://github.com/cothan/kyber/tree/round3

Next week I will create a PR.

@cryptojedi
Copy link
Contributor

cryptojedi commented Nov 24, 2020 via email

@gregorseiler
Copy link
Contributor

Hi Peter,

Let me give some perspective on this. The avx2 optimized implementations of the (de)compression functions including poly_{to,from}msg are around 10x faster than the reference versions. This results in a 30% speed-up of indcpa_enc and a >60% speed-up of indcpa_dec after our other optimizations have been applied. In the KEM this difference gets drowned out because of the huge fraction of time spend on hashing the public key and ciphertext due to our very conservative choice of CCA transform. But we'll soon see numbers from Intel CPUs with hardware support for SHA2 (and much higher AES throughput due to the new vector AES instructions that speed-up the matrix expansion and noise sampling).

Cheers,
Gregor

@cryptojedi
Copy link
Contributor

cryptojedi commented Dec 8, 2020 via email

@BeechatNetworkSystemsLtd
Copy link

BeechatNetworkSystemsLtd commented Mar 22, 2021

Hello @tomleavy @cryptojedi and @gregorseiler,

We have made a fork of the Kyber repo and made an ARM32 and ARM64 implementation of Kyber:
https://github.com/BeechatNetworkSystemsLtd/Kyber-ARM

We have also made a Java wrapper for Kyber:
https://github.com/BeechatNetworkSystemsLtd/JNIPQC

And we are currently working on a Python bindings for Kyber as we speak as well.

We hope this helps and to contribute to the Kyber project as well as its implementation on multiple platforms.

Cheers,

The Beechat Network team

@cothan
Copy link

cothan commented Mar 26, 2021

After iteration of code, finally my NEON ARMv8 code is complete. I also have some benchmarks on Apple M1. The speed up is a bit better.

Kyber M1 NEON neon Level 1 neon Level 3 neon Level 5
gen_matrix 7,680 17,944 30,743
neon_ntt 413 413 413
neon_invntt 428 428 428
crypto_kem_keypair 22,958 36,342 55,951
crypto_kem_enc 32,522 49,134 71,579
crypto_kem_dec 29,412 45,100 67,026
Kyber M1 REF ref Level 1 ref Level 3 ref Level 5
gen_matrix 11,989 26,894 47,558
ref_ntt 3,171 3,223 3,217
ref_invntt 5,171 5,118 5,148
crypto_kem_keypair 59,622 105,058 163,075
crypto_kem_enc 76,513 120,766 175,568
crypto_kem_dec 90,254 138,813 198,509

EDIT: the unit is in cycles.

I think this neon version is ready, what else should I do to make a pull request?

@hanno-becker
Copy link

@cothan Could you also report the CCs for your latest code running on your Raspberry Pi's?

@cothan
Copy link

cothan commented Apr 6, 2021

Kyber A72 NEON neon Level 1 neon Level 3 neon Level 5
gen_matrix 27,152 59,800 108,518
neon_poly_getnoise_eta1_2x 9,247 4,422 4,416
neon_poly_getnoise_eta2 2,679 2,679 2,683
poly_tomsg 976 979 980
poly_frommsg 810 816 815
neon_ntt 1,496 1,473 1,476
neon_invntt 1,659 1,661 1,657
crypto_kem_keypair 72,015 116,390 184,965
crypto_kem_enc 95,287 150,959 223,786
crypto_kem_dec 94,069 149,845 220,671
Kyber A72 REF ref Level 1 ref Level 3 ref Level 5
gen_matrix 28,921 70,117 127,776
ref_poly_getnoise_eta1 4,397 3,317 3,315
ref_poly_getnoise_eta2 3,311 3,314 3,317
poly_tomsg 976 979 979
poly_frommsg 810 815 817
ref_ntt 8,496 8,500 8,551
ref_invntt 12,530 12,533 12,409
crypto_kem_keypair 136,934 237,601 371,906
crypto_kem_enc 184,533 298,928 440,645
crypto_kem_dec 223,359 349,122 503,820

Yes, sure, I conduct the result by using a patch of PAPI for Cortex-A72. https://github.com/cothan/PAPI_ARMv8_Cortex_A72

Note: neon_poly_getnoise_eta1_2x includes NEON implementation of NEON_SHA2x on Cortex-A72, when compare with ref_poly_getnoise_eta1 in Level 3 and Level 5, you have to divide by 2.

@marco-palumbi
Copy link

@cothan which OS and compiler did you use on Raspberry PI?

@cothan
Copy link

cothan commented May 7, 2021

Hi @marco-palumbi ,

Here is the configuration in my RP4:

cothan@manjaro-rp4 ~> uname -a
Linux manjaro-rp4 5.10.20-1-MANJARO-ARM #1 SMP PREEMPT Thu Mar 4 14:36:16 CST 2021 aarch64 GNU/Linux

cothan@manjaro-rp4 ~> gcc -v 
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/lib/gcc/aarch64-unknown-linux-gnu/10.2.0/lto-wrapper
Target: aarch64-unknown-linux-gnu
Configured with: /build/gcc/src/gcc/configure --prefix=/usr --libdir=/usr/lib --libexecdir=/usr/lib --mandir=/usr/share/man --infodir=/usr/share/info --with-bugurl=https://github.com/archlinuxarm/PKGBUILDs/issues --enable-languages=c,c++,fortran,go,lto,objc,obj-c++,d --with-isl --with-linker-hash-style=gnu --with-system-zlib --enable-__cxa_atexit --enable-checking=release --enable-clocale=gnu --enable-default-pie --enable-default-ssp --enable-gnu-indirect-function --enable-gnu-unique-object --enable-install-libiberty --enable-linker-build-id --enable-lto --enable-plugin --enable-shared --enable-threads=posix --disable-libssp --disable-libstdcxx-pch --disable-libunwind-exceptions --disable-multilib --disable-werror --host=aarch64-unknown-linux-gnu --build=aarch64-unknown-linux-gnu --with-arch=armv8-a --enable-fix-cortex-a53-835769 --enable-fix-cortex-a53-843419 gdc_include_dir=/usr/include/dlang/gdc
Thread model: posix
Supported LTO compression algorithms: zlib zstd
gcc version 10.2.0 (GCC) 

cothan@manjaro-rp4 ~> clang -v 
clang version 11.1.0
Target: aarch64-unknown-linux-gnu
Thread model: posix
InstalledDir: /usr/bin
Found candidate GCC installation: /usr/bin/../lib/gcc/aarch64-unknown-linux-gnu/10.2.0
Found candidate GCC installation: /usr/lib/gcc/aarch64-unknown-linux-gnu/10.2.0
Selected GCC installation: /usr/bin/../lib/gcc/aarch64-unknown-linux-gnu/10.2.0
Candidate multilib: .;@m64
Selected multilib: .;@m64

I use Clang across all my results.

@hanno-becker
Copy link

Thanks a lot @cothan for sharing your work on developing optimized PQC implementations for Arm, it's great to see more interest and progress in this area.

Note that you can further improve performance by using the fixed-point doubling multiply-high VQDMULH to implement @gregorseiler's approach to Montgomery multiplication from Faster AVX2 optimized NTT multiplication for Ring-LWE lattice cryptography (use the doubling VQDMULH for the high multiply, VMUL for the low multiply, and the halving subtract VHSUB to compensate for the doubling). This reduces the Montgomery multiplication sequence from 9 to 5 instructions and should be a drop-in replacement for your fqmul macro. If you further precompute twisted constants following NTTRU: Truly Fast NTRU Using NTT, you'll get to 4 instructions.

Finally, you can use the fixed-point multiply-high-accumulate VQRDMLAH (v8.1-A onwards) to get to 3 instructions, but this is quite a bit trickier because of the carry-in mentioned in @gregorseiler's paper, as well as the rounding, doubling, and especially impacted range analysis in case of the Kyber prime.

If you publish or present your code with those modifications, a mention would be appreciated. A cite-able paper on the use of
fixed point instructions for modular arithmetic -- which also applies to MVE, SVE and SVE2 -- will hopefully appear shortly.

@cothan
Copy link

cothan commented Jun 2, 2021

Hi @hanno-arm,

Yes, you're right. It is the multiply instruction I'm looking for. I'm fully aware that the specialty of AVX2 that they can multiply high or low in single instruction. I tried to look similar instruction in the past, due to the time limit, so I don't know much about rounding and saturating.
I declare exact the issue you mentioned in a submitted paper (mine is going to be published soon), and you solve my bottleneck in fqmul. Please let me know your paper after you publish it, I am happy to cite it.
I had a precise Kyber range analysis, improve upon the paper of Bas Westerbaan: When to Barrett reduce in the inverse NTT, I hope I can apply it to the new approach.

Thank you very much. I will definitely mention you, and cite your work.

@hanno-becker
Copy link

hanno-becker commented Jun 3, 2021

I declare exact the issue you mentioned in a submitted paper (mine is going to be published soon), and you solve my bottleneck in fqmul.

Great! Here's a complete patch, by the way:

#define fqmul(out, in, zeta, t)                                                                  \
    t.val[0] = (int16x8_t)vqdmulhq_s16(in, zeta);                  /* (2*a)_H */                 \ 
    t.val[1] = (int16x8_t)vmulq_s16(in, zeta);                     /* a_L */                     \
    t.val[2] = vmulq_s16(t.val[1], neon_qinv);                     /* a_L = a_L * QINV */        \
    t.val[3] = (int16x8_t)vqdmulhq_s16(t.val[2], neon_kyberq);     /* (2*a_L*Q)_H */             \
    out = vhsubq_s16(t.val[0], t.val[3]);                          /* ((2*a)_H - (2*a_L*Q)_H)/2 */

This, of course, doesn't yet implement constant merging, though.

It would be interesting see what difference it actually makes on various CPUs. Since the bottleneck may be the multiplication sequence, rather than the boilerplate around it, it might not be as large as one would expect from the mere instruction count.

Please let me know your paper after you publish it, I am happy to cite it.

Yes, will do!

@hanno-becker
Copy link

hanno-becker commented Jun 3, 2021

I'm fully aware that the specialty of AVX2 that they can multiply high or low in single instruction

Note that MVE and SVE also have separate high/low multiply instructions. It is also noteworthy that in contrast to AVX2, those instructions aren't limited to 16-bit lanes, which is of interest e.g. for Dilithium or an NTT-based implementation of Saber.

When looking through your code, I noticed that you often duplicate twiddle factors across vectors. Have you considered loading multiple twiddles into a single vector and using lane-indexed instructions? This should work for layers 0-4 and save you some instructions (and free up some vectors).

@Ruofei-Li
Copy link

@cothan hello,is this "neon" version sensitive to big-endian and little-endian mode?
https://github.com/cothan/kyber/tree/round3

@cothan
Copy link

cothan commented Jun 15, 2022

Hi @Ruofei-Li,

I only test my code in A72 and Apple M1 so I use the default endian on such platform (whatever endian it is).
You can feel free to make an issue over there if you find it bugs, or support mismatch endian.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants