Add AdvSimd in ComponentProcessor #2429

stefannikolei · 2023-04-02T20:52:11Z

Prerequisites

I have written a descriptive pull-request title
I have verified that there are no overlapping pull-requests open
I have verified that I am following the existing coding patterns and practice as demonstrated in the repository. These follow strict Stylecop rules 👮.
I have provided test coverage for my change (where applicable)

Description

Added Arm intrinsics in the ComponentProcessor.

Only SumHorizontal is not ported to Arm.

src/ImageSharp/Formats/Jpeg/Components/Encoder/ComponentProcessor.cs

JimBobSquarePants · 2023-04-04T13:21:27Z

Only SumHorizontal is not ported to Arm.

Are you hitting an API wall here?

stefannikolei · 2023-04-04T16:13:03Z

Only SumHorizontal is not ported to Arm.

Are you hitting an API wall here?

I had no clue how to implement Permute4x64 to Arm

JimBobSquarePants · 2023-04-04T23:35:58Z

Only SumHorizontal is not ported to Arm.

Are you hitting an API wall here?

I had no clue how to implement Permute4x64 to Arm

Yeah, that's got me stumped also. @tannergooding is there anything we can do here?

stefannikolei · 2023-04-05T10:22:00Z

Only SumHorizontal is not ported to Arm.

Are you hitting an API wall here?

I had no clue how to implement Permute4x64 to Arm

Yeah, that's got me stumped also. @tannergooding is there anything we can do here?

Another Api which I have no clue how to port is Sse.Shuffle.

If I knew how to write that in AdvSimd, then I could go forward porting other methods.

tannergooding · 2023-04-05T13:51:20Z

Another Api which I have no clue how to port is Sse.Shuffle.

Depending on exactly how you're shuffling, you want one of:

ExtractVector64/ExtractVector128
TransposeEven/TransposeOdd
UnzipEven/UnzipOdd
ZipHigh/ZipLow

There is also the more powerful:

VectorTableLookup
VectorTableLookupExtension

Extract extracts Count sequential elements from two inputs. So with ExtractVector64 over ushort you'd act as though the two input vectors left and right are one contiguous Vector128<ushort>, with right being the "upper". If you had specified index: 3 then you'd grab (left[3], left[4], left[5], left[6], left[7], right[0], right[1], right[2])

Transpose interleaves alternating even/odd numbered pairs. So with TransposeEven you get (left[0], right[0], left[2], right[2], ...). With TransposeOdd you get (left[1], right[1], left[3], right[3], ...)

Unzip reads corresponding even/odd numbered elements. So with UnzipEven you get (left[0], left[2], ..., right[0], right[2], ...). With UnzipOdd you get (left[1], left[3], ..., right[1], right[3], ...)

Zip reads adjacent vector elements from the lower/upper halves. So with ZipLow you get (left[0], right[0], left[1], right[1], ...). With ZipHigh you get (left[4], right[4], left[5], right[5], ...) (where 4 assumes you're operating on Vector128<ushort> and therefore the Count == 8. This is in practice Count / 2).

VectorTableLookup works much like Ssse3.Shuffle. It operates purely on 8-bit elements and allows you to select any index from the input on a per element basis. So you could choose (table[0], table[0], ..., table[0]) the entire way, you could reverse with (table[7], table[6], ..., table[0]), etc. The main difference is that with Ssse3.Shuffle if the index is out of range it has one of two behaviors. If the most significant bit is "clear", then it masks off the upper bits. While if the most-significant bit is "set", the resulting value for that index is 0. For VectorTableLookup on the other hand, any out of range index is treated as 0.

VectorTableLookupExtension is basically the same premise. The difference is that rather than setting the resulting value to 0 for "out of range", it instead selects a default value from a different vector.

VectorTableLookup and VectorTableLookupExtension in .NET 5/6/7 only support one vector in the table. In .NET 8 (should be preview 4), we support 2, 3, or 4 input vectors in the table. This can be useful for doing things like Matrix4x4.Tranpose in 4 instructions, as an example.

Yeah, that's got me stumped also. @tannergooding is there anything we can do here?

Depends on exactly how you're permuting the 4x doubles, but you can at worst use 2x VectorTableLookupExtension today.

JimBobSquarePants · 2023-04-07T23:22:05Z

Thanks Tanner!! That's gonna take me a few reads to get my head around 🤣

tannergooding · 2023-04-07T23:35:20Z

No worries, happy to provide additional suggestions and/or review if needed. Feel free to tag me :)

stefannikolei · 2023-04-11T17:23:18Z

Depends on exactly how you're permuting the 4x doubles, but you can at worst use 2x VectorTableLookupExtension today.

Am I missing something? VectorTableLookupExtensions only support byte and sbyte. Unfortunately this code (SumHorizontal) is working on floats.

tannergooding · 2023-04-11T17:28:52Z

Yes, because it does things bytewise, much as Ssse3.Shuffle. You have to move bytes in groups of 4 if you want to do it for float/int/uint.

That is, if you wanted to pick 3, 2, 1, 0, you'd use a shuffle mask of 12, 13, 14, 15, 8, 9, 10, 11, 4, 5, 6, 7, 0, 1, 2, 3. Basically take the index scale by size (4 in this case), and grab the next size entries. -- This assumes you started with float and are needing to shuffle as byte. If you're instead starting with bytes and want floats, you may have to account for endianness as well.

stefannikolei · 2023-04-12T11:01:28Z

@JimBobSquarePants I think I will try to tackle the missing Method in a different PR. I probably need much more time to first understand what the Permute4x64 does and then try my way with ARM.

Do you know of any Benchmarks which cover the ComponentProcessor?

JimBobSquarePants · 2023-04-12T11:30:23Z

@stefannikolei Yeah I'm happy for that to be separate. There's a lot of figuring out to do to implement.

Do you know of any Benchmarks which cover the ComponentProcessor?

Not to my knowledge no.

stefannikolei added 2 commits April 2, 2023 17:24

Add Arm for MultiplyToAverage

c190951

Add Arm for SumVertical

03a988b

gfoidl reviewed Apr 3, 2023

View reviewed changes

src/ImageSharp/Formats/Jpeg/Components/Encoder/ComponentProcessor.cs Show resolved Hide resolved

add DebugGuard to check for multiple of 8

4571325

Merge branch 'main' into stefannikolei/arm/componentconverter

574ec8a

Merge branch 'main' into stefannikolei/arm/componentconverter

8a68c67

JimBobSquarePants approved these changes Apr 12, 2023

View reviewed changes

JimBobSquarePants merged commit e646c4b into SixLabors:main Apr 12, 2023
7 checks passed

stefannikolei deleted the stefannikolei/arm/componentconverter branch April 12, 2023 12:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add AdvSimd in ComponentProcessor #2429

Add AdvSimd in ComponentProcessor #2429

stefannikolei commented Apr 2, 2023

JimBobSquarePants commented Apr 4, 2023

stefannikolei commented Apr 4, 2023 •

edited

JimBobSquarePants commented Apr 4, 2023

stefannikolei commented Apr 5, 2023

tannergooding commented Apr 5, 2023

JimBobSquarePants commented Apr 7, 2023

tannergooding commented Apr 7, 2023

stefannikolei commented Apr 11, 2023

tannergooding commented Apr 11, 2023

stefannikolei commented Apr 12, 2023

JimBobSquarePants commented Apr 12, 2023

Add AdvSimd in ComponentProcessor #2429

Add AdvSimd in ComponentProcessor #2429

Conversation

stefannikolei commented Apr 2, 2023

Prerequisites

Description

JimBobSquarePants commented Apr 4, 2023

stefannikolei commented Apr 4, 2023 • edited

JimBobSquarePants commented Apr 4, 2023

stefannikolei commented Apr 5, 2023

tannergooding commented Apr 5, 2023

JimBobSquarePants commented Apr 7, 2023

tannergooding commented Apr 7, 2023

stefannikolei commented Apr 11, 2023

tannergooding commented Apr 11, 2023

stefannikolei commented Apr 12, 2023

JimBobSquarePants commented Apr 12, 2023

stefannikolei commented Apr 4, 2023 •

edited