ARM assembly support? #31

tomleavy · 2020-11-03T15:43:00Z

I'm attempting to work on a PoC using mobile devices, wondering if anyone has attempted to translate the AVX2 code over to something that is compatible with ARM64 CPUs?

cryptojedi · 2020-11-05T06:24:01Z

Tom Leavy <notifications@github.com> wrote:

I'm attempting to work on a PoC using mobile devices, wondering if anyone has attempted to translate the AVX2 code over to something that is compatible with ARM64 CPUs?

Not yet as far as I know, but an ARM-NEON implementation is certainly on our TODO list.

tomleavy · 2020-11-05T15:58:59Z

Don't have much experience with ASM coding but maybe I can find some time to try and kick start the process. I'm assuming a command by command port would be fine, unless there are any particular nuances I should know about?

cryptojedi · 2020-11-07T12:56:28Z

Tom Leavy <notifications@github.com> wrote:

Don't have much experience with ASM coding but maybe I can find some time to try and kick start the process.

That would of course be great and highly appreciated!

I'm assuming a command by command port would be fine, unless there are any particular nuances I should know about?

That won't quite work. The AVX2 implementation uses 256-bit vector registers, ARM-NEON registers have a length of only 128 bits. Also, instructions operating on those registers are slightly different, in particular the multiplication instructions.

cothan · 2020-11-07T19:10:25Z

Hi @cryptojedi and @tomleavy,

I have my NEON implementation is at https://github.com/cothan/kyber/tree/round3
The NTT part is speed up by 3x. Overal speed up is 1.5x. I'm close to finish NEON implementation, the critical component is Keccak, I have my NEON code written for Keccak, my preliminary result for NEON Keccak is a fraction of speed up.

This is my NEON benchmark on ARMv8, Rasberry Pi 4 8Gb 1.9 Ghz (overclocked). Just Polynomial Multiplication module in Kyber.

NTT: 
median: 72 cycles/ticks
average: 72 cycles/ticks

INVNTT: 
median: 115 cycles/ticks
average: 115 cycles/ticks

kyber_keypair: 
median: 7162 cycles/ticks
average: 7185 cycles/ticks

kyber_encaps: 
median: 8207 cycles/ticks
average: 8247 cycles/ticks

kyber_decaps: 
median: 8313 cycles/ticks
average: 8342 cycles/ticks

And this is reference code on the same platform

NTT: 
median: 281 cycles/ticks
average: 281 cycles/ticks

INVNTT: 
median: 358 cycles/ticks
average: 358 cycles/ticks

kyber_keypair: 
median: 11160 cycles/ticks
average: 11198 cycles/ticks

kyber_encaps: 
median: 12835 cycles/ticks
average: 12877 cycles/ticks

kyber_decaps: 
median: 14646 cycles/ticks
average: 14696 cycles/ticks

tomleavy · 2020-11-07T19:55:14Z

Good stuff @cothan . Will gladly utilize your version when it's ready

cryptojedi · 2020-11-12T10:55:09Z

cothan <notifications@github.com> wrote:

Hi @cryptojedi and @tomleavy,

Hi @cothan,

I have my NEON implementation is at https://github.com/cothan/kyber/tree/round3 The NTT part is speed up by 6x. Overal speed up is 1.5x. I'm close to finish NEON implementation, the critical component is Keccak, I have my NEON code written for Keccak, my preliminary result for NEON Keccak is a fraction of speed up.

This sounds great! Would you be willing to send a PR to upstream? Just a question: the benchmarks below look more like a factor-4 speedup for the NTT and a factor-3 speedup for the inverse NTT. That's great, but not quite the factor of 6 you're mentioning. Am I missing something?

…

This is my NEON benchmark on ARMv8, Rasberry Pi 4 8Gb 1.9 Ghz. Just Polynomial Multiplication module in Kyber. ``` NTT: median: 72 cycles/ticks average: 72 cycles/ticks INVNTT: median: 115 cycles/ticks average: 115 cycles/ticks kyber_keypair: median: 7162 cycles/ticks average: 7185 cycles/ticks kyber_encaps: median: 8207 cycles/ticks average: 8247 cycles/ticks kyber_decaps: median: 8313 cycles/ticks average: 8342 cycles/ticks ``` And this is reference code on the same platform ``` NTT: median: 281 cycles/ticks average: 281 cycles/ticks INVNTT: median: 358 cycles/ticks average: 358 cycles/ticks kyber_keypair: median: 11160 cycles/ticks average: 11198 cycles/ticks kyber_encaps: median: 12835 cycles/ticks average: 12877 cycles/ticks kyber_decaps: median: 14646 cycles/ticks average: 14696 cycles/ticks ``` -- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: #31 (comment)

cothan · 2020-11-12T16:40:08Z

Just a question: the benchmarks below look more like a factor-4 speedup
for the NTT and a factor-3 speedup for the inverse NTT. That's great,
but not quite the factor of 6 you're mentioning. Am I missing something?

@cryptojedi
It's my bad, I've edited the message above. The NTT speedup is only 3x and overall speed up is only 1.5x.
I'm at very last step for Kyber, I'm trying to finish NEON reject_sampling, there is a few if-else condition that I'm not sure how to handle the permutation part.
Anyway, when I finish the last step, I'm happy to create a PR to upstream.

cothan · 2020-11-20T01:52:57Z

So I'm kinda finish my implementation.
There are 2 functions left, poly_tomsg and poly_frommsg. It won't affect much the total execution time, but anyway they need to be implemented.
I'm happy if anyone can help.

Okay, let's see the benchmark for ./test_kyber1024 in neon/ vs ref

.NEON

gen_a: 
median: 2950 cycles/ticks
average: 2983 cycles/ticks

neon_poly_getnoise_eta1_2x/2: 
median: 138 cycles/ticks
average: 140 cycles/ticks

neon_poly_getnoise_eta2: 
median: 73 cycles/ticks
average: 73 cycles/ticks

poly_tomsg: 
median: 16 cycles/ticks
average: 17 cycles/ticks

poly_frommsg: 
median: 6 cycles/ticks
average: 6 cycles/ticks

NTT: 
median: 72 cycles/ticks
average: 72 cycles/ticks

INVNTT: 
median: 115 cycles/ticks
average: 116 cycles/ticks

kyber_keypair: 
median: 5917 cycles/ticks
average: 5960 cycles/ticks

kyber_encaps: 
median: 7137 cycles/ticks
average: 7168 cycles/ticks

kyber_decaps: 
median: 7171 cycles/ticks
average: 7206 cycles/ticks

kex_uake_initA: 
median: 13062 cycles/ticks
average: 13141 cycles/ticks

kex_uake_sharedB: 
median: 14374 cycles/ticks
average: 14437 cycles/ticks

kex_uake_sharedA: 
median: 7224 cycles/ticks
average: 7265 cycles/ticks

kex_ake_initA: 
median: 13059 cycles/ticks
average: 13128 cycles/ticks

kex_ake_sharedB: 
median: 21521 cycles/ticks
average: 21622 cycles/ticks

kex_ake_sharedA: 
median: 14408 cycles/ticks
average: 14487 cycles/ticks

.C_REF

gen_a: 
median: 3748 cycles/ticks
average: 3774 cycles/ticks

poly_getnoise_eta1: 
median: 69 cycles/ticks
average: 68 cycles/ticks

poly_getnoise_eta2: 
median: 69 cycles/ticks
average: 69 cycles/ticks

poly_tomsg: 
median: 61 cycles/ticks
average: 60 cycles/ticks

poly_frommsg: 
median: 6 cycles/ticks
average: 5 cycles/ticks

NTT: 
median: 280 cycles/ticks
average: 281 cycles/ticks

INVNTT: 
median: 358 cycles/ticks
average: 358 cycles/ticks

kyber_keypair: 
median: 10500 cycles/ticks
average: 10537 cycles/ticks

kyber_encaps: 
median: 12226 cycles/ticks
average: 12280 cycles/ticks

kyber_decaps: 
median: 13788 cycles/ticks
average: 13833 cycles/ticks

kex_uake_initA: 
median: 22755 cycles/ticks
average: 22854 cycles/ticks

kex_uake_sharedB: 
median: 26199 cycles/ticks
average: 26321 cycles/ticks

kex_uake_sharedA: 
median: 13870 cycles/ticks
average: 13944 cycles/ticks

kex_ake_initA: 
median: 22750 cycles/ticks
average: 22843 cycles/ticks

kex_ake_sharedB: 
median: 38417 cycles/ticks
average: 38825 cycles/ticks

kex_ake_sharedA: 
median: 27555 cycles/ticks
average: 27771 cycles/ticks

Benchmark on ARMv8, Pi 4 8Gb 1.9 Ghz (overclocked).
My SHA3 implementation can be seen here: https://github.com/cothan/NEON-SHA3_2x. Too bad, the speed up is very minimal.
NEON Keccak-2x is only 5%-2% faster than C reference implementation, C reference implementation is compiled without -fno-tree-vectorize

https://github.com/cothan/kyber/tree/round3

Next week I will create a PR.

cryptojedi · 2020-11-24T07:08:22Z

cothan <notifications@github.com> wrote:

So I'm kinda finish my implementation. There are 2 functions left, `poly_tomsg` and `poly_frommsg`. It won't affect much the total execution time, but anyway they need to be implemented.

Is there a reason for not using the implementations from the "ref" implementation?

gregorseiler · 2020-12-07T16:40:36Z

Hi Peter,

Let me give some perspective on this. The avx2 optimized implementations of the (de)compression functions including poly_{to,from}msg are around 10x faster than the reference versions. This results in a 30% speed-up of indcpa_enc and a >60% speed-up of indcpa_dec after our other optimizations have been applied. In the KEM this difference gets drowned out because of the huge fraction of time spend on hashing the public key and ciphertext due to our very conservative choice of CCA transform. But we'll soon see numbers from Intel CPUs with hardware support for SHA2 (and much higher AES throughput due to the new vector AES instructions that speed-up the matrix expansion and noise sampling).

Cheers,
Gregor

cryptojedi · 2020-12-08T16:21:59Z

Gregor Seiler <notifications@github.com> wrote:

Hi Peter,

Hi Gregor,

Let me give some perspective on this. The avx2 optimized implementations of the (de)compression functions including poly_{to,from}msg are around 10x faster than the reference versions. This results in a 30% speed-up of indcpa_enc and a >60% speed-up of indcpa_dec after our other optimizations have been applied. In the KEM this difference gets drowned out because of the huge fraction of time spend on hashing the public key and ciphertext due to our very conservative choice of CCA transform. But we'll soon see numbers from Intel CPUs with hardware support for SHA2 (and much higher AES throughput due to the new vector AES instructions that speed-up the matrix expansion and noise sampling).

Oh, didn't mean to say that we don't want an optimized implementation eventually; I was just wondering if there is any reason that the existing C implementation wouldn't work. Cheers, Peter

BeechatNetworkSystemsLtd · 2021-03-22T12:51:08Z

Hello @tomleavy @cryptojedi and @gregorseiler,

We have made a fork of the Kyber repo and made an ARM32 and ARM64 implementation of Kyber:
https://github.com/BeechatNetworkSystemsLtd/Kyber-ARM

We have also made a Java wrapper for Kyber:
https://github.com/BeechatNetworkSystemsLtd/JNIPQC

And we are currently working on a Python bindings for Kyber as we speak as well.

We hope this helps and to contribute to the Kyber project as well as its implementation on multiple platforms.

Cheers,

The Beechat Network team

cothan · 2021-03-26T23:11:35Z

After iteration of code, finally my NEON ARMv8 code is complete. I also have some benchmarks on Apple M1. The speed up is a bit better.

Kyber M1 NEON	neon Level 1	neon Level 3	neon Level 5
gen_matrix	7,680	17,944	30,743
neon_ntt	413	413	413
neon_invntt	428	428	428
crypto_kem_keypair	22,958	36,342	55,951
crypto_kem_enc	32,522	49,134	71,579
crypto_kem_dec	29,412	45,100	67,026

Kyber M1 REF	ref Level 1	ref Level 3	ref Level 5
gen_matrix	11,989	26,894	47,558
ref_ntt	3,171	3,223	3,217
ref_invntt	5,171	5,118	5,148
crypto_kem_keypair	59,622	105,058	163,075
crypto_kem_enc	76,513	120,766	175,568
crypto_kem_dec	90,254	138,813	198,509

EDIT: the unit is in cycles.

I think this neon version is ready, what else should I do to make a pull request?

hanno-becker · 2021-04-06T15:07:51Z

@cothan Could you also report the CCs for your latest code running on your Raspberry Pi's?

cothan · 2021-04-06T15:47:13Z

Kyber A72 NEON	neon Level 1	neon Level 3	neon Level 5
gen_matrix	27,152	59,800	108,518
neon_poly_getnoise_eta1_2x	9,247	4,422	4,416
neon_poly_getnoise_eta2	2,679	2,679	2,683
poly_tomsg	976	979	980
poly_frommsg	810	816	815
neon_ntt	1,496	1,473	1,476
neon_invntt	1,659	1,661	1,657
crypto_kem_keypair	72,015	116,390	184,965
crypto_kem_enc	95,287	150,959	223,786
crypto_kem_dec	94,069	149,845	220,671

Kyber A72 REF	ref Level 1	ref Level 3	ref Level 5
gen_matrix	28,921	70,117	127,776
ref_poly_getnoise_eta1	4,397	3,317	3,315
ref_poly_getnoise_eta2	3,311	3,314	3,317
poly_tomsg	976	979	979
poly_frommsg	810	815	817
ref_ntt	8,496	8,500	8,551
ref_invntt	12,530	12,533	12,409
crypto_kem_keypair	136,934	237,601	371,906
crypto_kem_enc	184,533	298,928	440,645
crypto_kem_dec	223,359	349,122	503,820

Yes, sure, I conduct the result by using a patch of PAPI for Cortex-A72. https://github.com/cothan/PAPI_ARMv8_Cortex_A72

Note: neon_poly_getnoise_eta1_2x includes NEON implementation of NEON_SHA2x on Cortex-A72, when compare with ref_poly_getnoise_eta1 in Level 3 and Level 5, you have to divide by 2.

marco-palumbi · 2021-05-07T13:58:07Z

@cothan which OS and compiler did you use on Raspberry PI?

cothan · 2021-05-07T16:58:52Z

Hi @marco-palumbi ,

Here is the configuration in my RP4:

cothan@manjaro-rp4 ~> uname -a
Linux manjaro-rp4 5.10.20-1-MANJARO-ARM #1 SMP PREEMPT Thu Mar 4 14:36:16 CST 2021 aarch64 GNU/Linux

cothan@manjaro-rp4 ~> gcc -v 
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/lib/gcc/aarch64-unknown-linux-gnu/10.2.0/lto-wrapper
Target: aarch64-unknown-linux-gnu
Configured with: /build/gcc/src/gcc/configure --prefix=/usr --libdir=/usr/lib --libexecdir=/usr/lib --mandir=/usr/share/man --infodir=/usr/share/info --with-bugurl=https://github.com/archlinuxarm/PKGBUILDs/issues --enable-languages=c,c++,fortran,go,lto,objc,obj-c++,d --with-isl --with-linker-hash-style=gnu --with-system-zlib --enable-__cxa_atexit --enable-checking=release --enable-clocale=gnu --enable-default-pie --enable-default-ssp --enable-gnu-indirect-function --enable-gnu-unique-object --enable-install-libiberty --enable-linker-build-id --enable-lto --enable-plugin --enable-shared --enable-threads=posix --disable-libssp --disable-libstdcxx-pch --disable-libunwind-exceptions --disable-multilib --disable-werror --host=aarch64-unknown-linux-gnu --build=aarch64-unknown-linux-gnu --with-arch=armv8-a --enable-fix-cortex-a53-835769 --enable-fix-cortex-a53-843419 gdc_include_dir=/usr/include/dlang/gdc
Thread model: posix
Supported LTO compression algorithms: zlib zstd
gcc version 10.2.0 (GCC) 

cothan@manjaro-rp4 ~> clang -v 
clang version 11.1.0
Target: aarch64-unknown-linux-gnu
Thread model: posix
InstalledDir: /usr/bin
Found candidate GCC installation: /usr/bin/../lib/gcc/aarch64-unknown-linux-gnu/10.2.0
Found candidate GCC installation: /usr/lib/gcc/aarch64-unknown-linux-gnu/10.2.0
Selected GCC installation: /usr/bin/../lib/gcc/aarch64-unknown-linux-gnu/10.2.0
Candidate multilib: .;@m64
Selected multilib: .;@m64

I use Clang across all my results.

hanno-becker · 2021-06-02T20:03:08Z

Thanks a lot @cothan for sharing your work on developing optimized PQC implementations for Arm, it's great to see more interest and progress in this area.

Note that you can further improve performance by using the fixed-point doubling multiply-high VQDMULH to implement @gregorseiler's approach to Montgomery multiplication from Faster AVX2 optimized NTT multiplication for Ring-LWE lattice cryptography (use the doubling VQDMULH for the high multiply, VMUL for the low multiply, and the halving subtract VHSUB to compensate for the doubling). This reduces the Montgomery multiplication sequence from 9 to 5 instructions and should be a drop-in replacement for your fqmul macro. If you further precompute twisted constants following NTTRU: Truly Fast NTRU Using NTT, you'll get to 4 instructions.

Finally, you can use the fixed-point multiply-high-accumulate VQRDMLAH (v8.1-A onwards) to get to 3 instructions, but this is quite a bit trickier because of the carry-in mentioned in @gregorseiler's paper, as well as the rounding, doubling, and especially impacted range analysis in case of the Kyber prime.

If you publish or present your code with those modifications, a mention would be appreciated. A cite-able paper on the use of
fixed point instructions for modular arithmetic -- which also applies to MVE, SVE and SVE2 -- will hopefully appear shortly.

cothan · 2021-06-02T20:43:59Z

Hi @hanno-arm,

Yes, you're right. It is the multiply instruction I'm looking for. I'm fully aware that the specialty of AVX2 that they can multiply high or low in single instruction. I tried to look similar instruction in the past, due to the time limit, so I don't know much about rounding and saturating.
I declare exact the issue you mentioned in a submitted paper (mine is going to be published soon), and you solve my bottleneck in fqmul. Please let me know your paper after you publish it, I am happy to cite it.
I had a precise Kyber range analysis, improve upon the paper of Bas Westerbaan: When to Barrett reduce in the inverse NTT, I hope I can apply it to the new approach.

Thank you very much. I will definitely mention you, and cite your work.

hanno-becker · 2021-06-03T04:44:02Z

I declare exact the issue you mentioned in a submitted paper (mine is going to be published soon), and you solve my bottleneck in fqmul.

Great! Here's a complete patch, by the way:

#define fqmul(out, in, zeta, t)                                                                  \
    t.val[0] = (int16x8_t)vqdmulhq_s16(in, zeta);                  /* (2*a)_H */                 \ 
    t.val[1] = (int16x8_t)vmulq_s16(in, zeta);                     /* a_L */                     \
    t.val[2] = vmulq_s16(t.val[1], neon_qinv);                     /* a_L = a_L * QINV */        \
    t.val[3] = (int16x8_t)vqdmulhq_s16(t.val[2], neon_kyberq);     /* (2*a_L*Q)_H */             \
    out = vhsubq_s16(t.val[0], t.val[3]);                          /* ((2*a)_H - (2*a_L*Q)_H)/2 */

This, of course, doesn't yet implement constant merging, though.

It would be interesting see what difference it actually makes on various CPUs. Since the bottleneck may be the multiplication sequence, rather than the boilerplate around it, it might not be as large as one would expect from the mere instruction count.

Please let me know your paper after you publish it, I am happy to cite it.

Yes, will do!

hanno-becker · 2021-06-03T06:58:59Z

I'm fully aware that the specialty of AVX2 that they can multiply high or low in single instruction

Note that MVE and SVE also have separate high/low multiply instructions. It is also noteworthy that in contrast to AVX2, those instructions aren't limited to 16-bit lanes, which is of interest e.g. for Dilithium or an NTT-based implementation of Saber.

When looking through your code, I noticed that you often duplicate twiddle factors across vectors. Have you considered loading multiple twiddles into a single vector and using lane-indexed instructions? This should work for layers 0-4 and save you some instructions (and free up some vectors).

Ruofei-Li · 2022-06-14T10:45:30Z

@cothan hello，is this "neon" version sensitive to big-endian and little-endian mode?
https://github.com/cothan/kyber/tree/round3

cothan · 2022-06-15T04:44:57Z

Hi @Ruofei-Li,

I only test my code in A72 and Apple M1 so I use the default endian on such platform (whatever endian it is).
You can feel free to make an issue over there if you find it bugs, or support mismatch endian.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARM assembly support? #31

ARM assembly support? #31

tomleavy commented Nov 3, 2020

cryptojedi commented Nov 5, 2020 via email

tomleavy commented Nov 5, 2020

cryptojedi commented Nov 7, 2020 via email

cothan commented Nov 7, 2020 •

edited

tomleavy commented Nov 7, 2020

cryptojedi commented Nov 12, 2020 via email

cothan commented Nov 12, 2020

cothan commented Nov 20, 2020

cryptojedi commented Nov 24, 2020 via email

gregorseiler commented Dec 7, 2020

cryptojedi commented Dec 8, 2020 via email

BeechatNetworkSystemsLtd commented Mar 22, 2021 •

edited

cothan commented Mar 26, 2021 •

edited

hanno-becker commented Apr 6, 2021

cothan commented Apr 6, 2021

marco-palumbi commented May 7, 2021

cothan commented May 7, 2021

hanno-becker commented Jun 2, 2021

cothan commented Jun 2, 2021

hanno-becker commented Jun 3, 2021 •

edited

hanno-becker commented Jun 3, 2021 •

edited

Ruofei-Li commented Jun 14, 2022

cothan commented Jun 15, 2022

ARM assembly support? #31

ARM assembly support? #31

Comments

tomleavy commented Nov 3, 2020

cryptojedi commented Nov 5, 2020 via email

tomleavy commented Nov 5, 2020

cryptojedi commented Nov 7, 2020 via email

cothan commented Nov 7, 2020 • edited

tomleavy commented Nov 7, 2020

cryptojedi commented Nov 12, 2020 via email

cothan commented Nov 12, 2020

cothan commented Nov 20, 2020

cryptojedi commented Nov 24, 2020 via email

gregorseiler commented Dec 7, 2020

cryptojedi commented Dec 8, 2020 via email

BeechatNetworkSystemsLtd commented Mar 22, 2021 • edited

cothan commented Mar 26, 2021 • edited

hanno-becker commented Apr 6, 2021

cothan commented Apr 6, 2021

marco-palumbi commented May 7, 2021

cothan commented May 7, 2021

hanno-becker commented Jun 2, 2021

cothan commented Jun 2, 2021

hanno-becker commented Jun 3, 2021 • edited

hanno-becker commented Jun 3, 2021 • edited

Ruofei-Li commented Jun 14, 2022

cothan commented Jun 15, 2022

cothan commented Nov 7, 2020 •

edited

BeechatNetworkSystemsLtd commented Mar 22, 2021 •

edited

cothan commented Mar 26, 2021 •

edited

hanno-becker commented Jun 3, 2021 •

edited

hanno-becker commented Jun 3, 2021 •

edited