Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add AdvSimd in ComponentProcessor #2429

Conversation

stefannikolei
Copy link
Contributor

Prerequisites

  • I have written a descriptive pull-request title
  • I have verified that there are no overlapping pull-requests open
  • I have verified that I am following the existing coding patterns and practice as demonstrated in the repository. These follow strict Stylecop rules 馃懏.
  • I have provided test coverage for my change (where applicable)

Description

Added Arm intrinsics in the ComponentProcessor.

Only SumHorizontal is not ported to Arm.

@JimBobSquarePants
Copy link
Member

Only SumHorizontal is not ported to Arm.

Are you hitting an API wall here?

@stefannikolei
Copy link
Contributor Author

stefannikolei commented Apr 4, 2023

Only SumHorizontal is not ported to Arm.

Are you hitting an API wall here?

I had no clue how to implement Permute4x64 to Arm

@JimBobSquarePants
Copy link
Member

Only SumHorizontal is not ported to Arm.

Are you hitting an API wall here?

I had no clue how to implement Permute4x64 to Arm

Yeah, that's got me stumped also. @tannergooding is there anything we can do here?

@stefannikolei
Copy link
Contributor Author

Only SumHorizontal is not ported to Arm.

Are you hitting an API wall here?

I had no clue how to implement Permute4x64 to Arm

Yeah, that's got me stumped also. @tannergooding is there anything we can do here?

Another Api which I have no clue how to port is Sse.Shuffle.

If I knew how to write that in AdvSimd, then I could go forward porting other methods.

@tannergooding
Copy link
Contributor

Another Api which I have no clue how to port is Sse.Shuffle.

Depending on exactly how you're shuffling, you want one of:

  • ExtractVector64/ExtractVector128
  • TransposeEven/TransposeOdd
  • UnzipEven/UnzipOdd
  • ZipHigh/ZipLow

There is also the more powerful:

  • VectorTableLookup
  • VectorTableLookupExtension

Extract extracts Count sequential elements from two inputs. So with ExtractVector64 over ushort you'd act as though the two input vectors left and right are one contiguous Vector128<ushort>, with right being the "upper". If you had specified index: 3 then you'd grab (left[3], left[4], left[5], left[6], left[7], right[0], right[1], right[2])

Transpose interleaves alternating even/odd numbered pairs. So with TransposeEven you get (left[0], right[0], left[2], right[2], ...). With TransposeOdd you get (left[1], right[1], left[3], right[3], ...)

Unzip reads corresponding even/odd numbered elements. So with UnzipEven you get (left[0], left[2], ..., right[0], right[2], ...). With UnzipOdd you get (left[1], left[3], ..., right[1], right[3], ...)

Zip reads adjacent vector elements from the lower/upper halves. So with ZipLow you get (left[0], right[0], left[1], right[1], ...). With ZipHigh you get (left[4], right[4], left[5], right[5], ...) (where 4 assumes you're operating on Vector128<ushort> and therefore the Count == 8. This is in practice Count / 2).

VectorTableLookup works much like Ssse3.Shuffle. It operates purely on 8-bit elements and allows you to select any index from the input on a per element basis. So you could choose (table[0], table[0], ..., table[0]) the entire way, you could reverse with (table[7], table[6], ..., table[0]), etc. The main difference is that with Ssse3.Shuffle if the index is out of range it has one of two behaviors. If the most significant bit is "clear", then it masks off the upper bits. While if the most-significant bit is "set", the resulting value for that index is 0. For VectorTableLookup on the other hand, any out of range index is treated as 0.

VectorTableLookupExtension is basically the same premise. The difference is that rather than setting the resulting value to 0 for "out of range", it instead selects a default value from a different vector.

  • VectorTableLookup and VectorTableLookupExtension in .NET 5/6/7 only support one vector in the table. In .NET 8 (should be preview 4), we support 2, 3, or 4 input vectors in the table. This can be useful for doing things like Matrix4x4.Tranpose in 4 instructions, as an example.

Yeah, that's got me stumped also. @tannergooding is there anything we can do here?

Depends on exactly how you're permuting the 4x doubles, but you can at worst use 2x VectorTableLookupExtension today.

@JimBobSquarePants
Copy link
Member

Thanks Tanner!! That's gonna take me a few reads to get my head around 馃ぃ

@tannergooding
Copy link
Contributor

No worries, happy to provide additional suggestions and/or review if needed. Feel free to tag me :)

@stefannikolei
Copy link
Contributor Author

Depends on exactly how you're permuting the 4x doubles, but you can at worst use 2x VectorTableLookupExtension today.

Am I missing something? VectorTableLookupExtensions only support byte and sbyte. Unfortunately this code (SumHorizontal) is working on floats.

@tannergooding
Copy link
Contributor

Yes, because it does things bytewise, much as Ssse3.Shuffle. You have to move bytes in groups of 4 if you want to do it for float/int/uint.

That is, if you wanted to pick 3, 2, 1, 0, you'd use a shuffle mask of 12, 13, 14, 15, 8, 9, 10, 11, 4, 5, 6, 7, 0, 1, 2, 3. Basically take the index scale by size (4 in this case), and grab the next size entries. -- This assumes you started with float and are needing to shuffle as byte. If you're instead starting with bytes and want floats, you may have to account for endianness as well.

@stefannikolei
Copy link
Contributor Author

@JimBobSquarePants I think I will try to tackle the missing Method in a different PR. I probably need much more time to first understand what the Permute4x64 does and then try my way with ARM.

Do you know of any Benchmarks which cover the ComponentProcessor?

@JimBobSquarePants
Copy link
Member

@stefannikolei Yeah I'm happy for that to be separate. There's a lot of figuring out to do to implement.

Do you know of any Benchmarks which cover the ComponentProcessor?

Not to my knowledge no.

@JimBobSquarePants JimBobSquarePants merged commit e646c4b into SixLabors:main Apr 12, 2023
7 checks passed
@stefannikolei stefannikolei deleted the stefannikolei/arm/componentconverter branch April 12, 2023 12:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants