Restructure `sum` for better auto-vectorization for floats #4560

simonvandel · 2023-07-22T13:27:31Z

Which issue does this PR close?

Closes #.

Rationale for this change

What changes are included in this PR?

Restructure the code for sum to allow for better auto-vectorization.
The code in the simd module is very similar, but it depends on packed_simd.
The auto-vectorized non-null case now has the same performance the simd feature impl. See benchmarks.

I didn't manage to make the null case quite as fast as the simd feature impl, but it's pretty close.
If I/someone else manages to also make the null case identical, I think we can remove the simd version altogether to remove duplicated code.

Benchmarks:

Before:

[svs@nixos:~/code/arrow-rs]$ RUSTFLAGS='-C target-cpu=native' cargo +nightly bench --bench aggregate_kernels "sum"
    Finished bench [optimized] target(s) in 0.10s
     Running benches/aggregate_kernels.rs (target/release/deps/aggregate_kernels-da08a889b5821ed5)
sum 512                 time:   [404.54 ns 405.84 ns 407.31 ns]

Found 11 outliers among 100 measurements (11.00%)
  4 (4.00%) high mild
  7 (7.00%) high severe

sum nulls 512           time:   [222.33 ns 223.23 ns 224.51 ns]

Found 12 outliers among 100 measurements (12.00%)
  5 (5.00%) high mild
  7 (7.00%) high severe

After:

[svs@nixos:~/code/arrow-rs]$ RUSTFLAGS='-C target-cpu=native' cargo +nightly bench --features="simd" --bench aggregate_kernels "sum"
    Finished bench [optimized] target(s) in 0.11s
     Running benches/aggregate_kernels.rs (target/release/deps/aggregate_kernels-20ec62142ba71a42)
sum 512                 time:   [30.901 ns 31.125 ns 31.385 ns]

Found 13 outliers among 100 measurements (13.00%)
  4 (4.00%) high mild
  9 (9.00%) high severe

sum nulls 512           time:   [72.911 ns 74.171 ns 75.687 ns]

Found 19 outliers among 100 measurements (19.00%)
  4 (4.00%) low mild
  3 (3.00%) high mild
  12 (12.00%) high severe


[svs@nixos:~/code/arrow-rs]$ RUSTFLAGS='-C target-cpu=native' cargo +nightly bench --bench aggregate_kernels "sum"
    Finished bench [optimized] target(s) in 0.10s
     Running benches/aggregate_kernels.rs (target/release/deps/aggregate_kernels-da08a889b5821ed5)
sum 512                 time:   [27.895 ns 27.952 ns 28.018 ns]

Found 13 outliers among 100 measurements (13.00%)
  1 (1.00%) low mild
  3 (3.00%) high mild
  9 (9.00%) high severe

sum nulls 512           time:   [79.906 ns 80.048 ns 80.211 ns]

Found 14 outliers among 100 measurements (14.00%)
  3 (3.00%) high mild
  11 (11.00%) high severe

Are there any user-facing changes?

Faster implementation of sum.
Non-null case is more than 10x faster.
Null case is around 3x faster.

…vectorized)

jhorstmann · 2023-07-22T15:18:02Z

Interesting, I can reproduce the benchmark results on my machine (i9-11900KB with AVX-512), the version without simd feature is even slightly faster. Very nice improvement!

With target-cpu=skylake there is still a small difference for the nullable version, 70ns vs 56ns with simd feature. The packed_simd code is not taking full advantage of avx512 mask registers and therefore runs at the same speed when targeting either skylake or native.

What cpu did you run your benchmarks on?

simonvandel · 2023-07-22T16:16:59Z

Interesting, I can reproduce the benchmark results on my machine (i9-11900KB with AVX-512), the version without simd feature is even slightly faster. Very nice improvement!

Great! Do you mean that both the non-null and null version are competitive with the simd feature on your machine?

What cpu did you run your benchmarks on?

It's an i7-10750H. I used the following rustc: rustc 1.73.0-nightly (0308df23e 2023-07-21)

jhorstmann · 2023-07-23T16:28:31Z

These are my results, simd feature is a tiny bit ahead, but maybe not enough to justify the additional code complexity.

$ RUSTFLAGS='-C target-cpu=native' cargo +nightly bench --bench aggregate_kernels "sum"
sum 512                 time:   [20.207 ns 20.244 ns 20.285 ns]
sum nulls 512           time:   [58.970 ns 59.000 ns 59.035 ns]

$ RUSTFLAGS='-C target-cpu=native' cargo +nightly bench --features simd --bench aggregate_kernels "sum"
sum 512                 time:   [17.095 ns 17.107 ns 17.120 ns]
sum nulls 512           time:   [56.853 ns 56.887 ns 56.925 ns]

simonvandel · 2023-07-23T19:43:51Z

I'll let you/others decide if we should replace the simd feature impl with the code in this PR. And if so, should this be done in this PR, or another one?

I'll push a commit today it tomorrow that resolves the todo, picking a proper value for LANES based on T.

simonvandel · 2023-07-25T19:12:57Z

I tried expanding the benchmarks in f472f3f, and then comparing before this PR (but with f472f3f) and this PR:

$ RUSTFLAGS='-C target-cpu=native' cargo +nightly bench --bench aggregate_kernels "sum" -- --baseline=before
    Finished bench [optimized] target(s) in 0.10s
     Running benches/aggregate_kernels.rs (target/release/deps/aggregate_kernels-da08a889b5821ed5)
sum 512 u8 no nulls     time:   [17.561 ns 17.569 ns 17.577 ns]
                        change: [+151.69% +157.94% +162.51%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 10 outliers among 100 measurements (10.00%)
  3 (3.00%) high mild
  7 (7.00%) high severe

sum 512 u8 50% nulls    time:   [672.19 ns 672.84 ns 673.62 ns]
                        change: [+254.36% +255.68% +257.08%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 8 outliers among 100 measurements (8.00%)
  3 (3.00%) high mild
  5 (5.00%) high severe

sum 512 ts_millis no nulls
                        time:   [158.65 ns 158.72 ns 158.80 ns]
                        change: [+443.21% +444.17% +445.21%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 10 outliers among 100 measurements (10.00%)
  1 (1.00%) low severe
  2 (2.00%) low mild
  1 (1.00%) high mild
  6 (6.00%) high severe

sum 512 ts_millis 50% nulls
                        time:   [84.507 ns 84.543 ns 84.577 ns]
                        change: [-56.741% -56.521% -56.356%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
  1 (1.00%) low mild
  2 (2.00%) high mild
  6 (6.00%) high severe

sum 512 f32 no nulls    time:   [28.857 ns 28.886 ns 28.920 ns]
                        change: [-93.041% -92.961% -92.887%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
  1 (1.00%) low mild
  2 (2.00%) high mild
  6 (6.00%) high severe

sum 512 f32 50% nulls   time:   [87.285 ns 87.330 ns 87.380 ns]
                        change: [-61.882% -61.786% -61.684%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  3 (3.00%) high mild
  5 (5.00%) high severe

This is still on an i7-10750H, and rustc 1.73.0-nightly (0308df23e 2023-07-21).

Interestingly, the speedups are only for f32 with/without nulls, and ts_millis with nulls. For all others, it's not a speedup.
@jhorstmann can you reproduce?

In any case, some more investigation into the regressions are needed before this can be merged.

tustvold · 2023-07-25T21:13:52Z

Marking as draft whilst we work out the details. Feel free to mark ready for review when you would like me to take another look.

FWIW when I was playing with this a few days ago in Godbolt, I found that nightly did a better job optimising the code than stable, and this was borne out by benchmarks. We may be in the unfortunate territory of LLVM being tempremental

simonvandel · 2023-07-30T17:37:19Z

I couldn't find a single implementation that would speed up both integer and floating point numbers, so I decided to have an implementation for both.
This should keep the current performance for all integer types, but give significant speed-ups for floating points.

Marked ready to review.

tustvold

Left some comments, thank you for sticking with this.

I'm not sure what your area of focus is, but FWIW following the recent improvements to grouping in DataFusion, it no longer actually uses these kernels as it now performs aggregation of all groups at once

tustvold · 2023-07-30T20:10:07Z

arrow-arith/src/aggregate.rs

+        | DataType::Decimal128(_, _)
+        | DataType::Decimal256(_, _) => match T::lanes() {


Why is decimal here?

tustvold · 2023-07-30T20:12:27Z

arrow/benches/aggregate_kernels.rs

+    sum_min_max_bench::<TimestampMillisecondType>(c, 512, 0.0, "ts_millis no nulls");
+    sum_min_max_bench::<TimestampMillisecondType>(c, 512, 0.5, "ts_millis 50% nulls");


FWIW arithmetic on timestamps as this does is not especially meaningful, adding two timestamps doesn't yield a timestamp, DurationMillisecondType might be more meaningful

tustvold · 2023-07-30T20:15:26Z

arrow-array/src/numeric.rs

-pub trait ArrowNumericType: ArrowPrimitiveType {}
+pub trait ArrowNumericType: ArrowPrimitiveType {
+    /// The number of SIMD lanes available
+    fn lanes() -> usize;


It feels a little off to define this for all the types, but then only use it for a special case of floats 🤔

tustvold · 2023-07-30T20:16:04Z

arrow-arith/src/aggregate.rs

+        | DataType::Decimal256(_, _) => match T::lanes() {
+            1 => sum_impl_floating::<T, 1>(array),
+            2 => sum_impl_floating::<T, 2>(array),
+            4 => sum_impl_floating::<T, 4>(array),


It occurs to me that we have 3 floating point types, we could just dispatch to sum_impl_floating with the appropriate constant specified, without needing ArrowNumericType?

tustvold · 2023-07-30T20:40:44Z

arrow-arith/src/aggregate.rs

@@ -285,44 +285,178 @@ where
        return None;
    }

-    let data: &[T::Native] = array.values();
+    fn sum_impl_integer<T>(array: &PrimitiveArray<T>) -> Option<T::Native>


FWIW if you changed the signature to

fn sum_impl_integer<T: ArrowNativeType>(values: &[T], nulls: Option<&NullBuffer>) -> Option<T>

It would potentially save on codegen, as it would be instantiated per native type not per primitive type

tustvold · 2023-07-30T20:41:12Z

arrow-arith/src/aggregate.rs

-                            sum = sum.add_wrapping(*value);
+    }
+
+    fn sum_impl_floating<T, const LANES: usize>(


Same comment as above

tustvold · 2023-07-30T20:45:44Z

arrow-arith/src/aggregate.rs

+        }
+    }
+
+    match T::DATA_TYPE {


This match block is kind of grim, but I don't have a better solution off the top of my head... Perhaps some sort of trait 🤔

outdated

tustvold · 2023-09-05T15:04:32Z

Marking this as a draft to make clear it isn't awaiting review, feel free to unmark when you would like me to take another look

tustvold · 2023-12-07T16:38:24Z

This code has been incorporated into #5100 and merged, thank you for starting this process

simonvandel added 3 commits July 22, 2023 12:24

almost same performance for null as well (72ns simd or 83ns for auto-…

0e41362

…vectorized)

more

4912ae3

simd sum

51be103

github-actions bot added the arrow Changes to the arrow crate label Jul 22, 2023

fix clippy

857fdd4

tustvold previously approved these changes Jul 24, 2023

View reviewed changes

simonvandel added 2 commits July 25, 2023 16:51

handle all number of lanes

4bc686a

expand benchmarks for min, max, sum

f472f3f

tustvold marked this pull request as draft July 25, 2023 21:12

simonvandel added 4 commits July 29, 2023 18:07

don't change code under simd feature flag

be00492

separate impl for floating and integers

94f7b18

make not pub

181c6da

remove unneeded mod

0a03c83

simonvandel marked this pull request as ready for review July 30, 2023 17:35

simonvandel changed the title ~~Restructure sum for better auto-vectorization~~ Restructure sum for better auto-vectorization for floats Jul 30, 2023

tustvold reviewed Jul 30, 2023

View reviewed changes

tustvold marked this pull request as draft September 5, 2023 15:04

tustvold mentioned this pull request Nov 2, 2023

Re-evaluate Explicit SIMD Aggregations #5032

Closed

jhorstmann mentioned this pull request Nov 19, 2023

Use Total Ordering for Aggregates and Refactor for Better Auto-Vectorization #5100

Merged

tustvold closed this Dec 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Restructure `sum` for better auto-vectorization for floats #4560

Restructure `sum` for better auto-vectorization for floats #4560

simonvandel commented Jul 22, 2023 •

edited

jhorstmann commented Jul 22, 2023

simonvandel commented Jul 22, 2023

jhorstmann commented Jul 23, 2023

simonvandel commented Jul 23, 2023

simonvandel commented Jul 25, 2023

tustvold commented Jul 25, 2023 •

edited

simonvandel commented Jul 30, 2023

tustvold left a comment

tustvold Jul 30, 2023

tustvold Jul 30, 2023

tustvold Jul 30, 2023

tustvold Jul 30, 2023

tustvold Jul 30, 2023

tustvold Jul 30, 2023

tustvold Jul 30, 2023

tustvold commented Sep 5, 2023

tustvold commented Dec 7, 2023

		\| DataType::Decimal128(_, _)
		\| DataType::Decimal256(_, _) => match T::lanes() {

		sum_min_max_bench::<TimestampMillisecondType>(c, 512, 0.0, "ts_millis no nulls");
		sum_min_max_bench::<TimestampMillisecondType>(c, 512, 0.5, "ts_millis 50% nulls");

Restructure sum for better auto-vectorization for floats #4560

Restructure sum for better auto-vectorization for floats #4560

Conversation

simonvandel commented Jul 22, 2023 • edited

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Benchmarks:

Are there any user-facing changes?

jhorstmann commented Jul 22, 2023

simonvandel commented Jul 22, 2023

jhorstmann commented Jul 23, 2023

simonvandel commented Jul 23, 2023

simonvandel commented Jul 25, 2023

tustvold commented Jul 25, 2023 • edited

simonvandel commented Jul 30, 2023

tustvold left a comment

Choose a reason for hiding this comment

tustvold Jul 30, 2023

Choose a reason for hiding this comment

tustvold Jul 30, 2023

Choose a reason for hiding this comment

tustvold Jul 30, 2023

Choose a reason for hiding this comment

tustvold Jul 30, 2023

Choose a reason for hiding this comment

tustvold Jul 30, 2023

Choose a reason for hiding this comment

tustvold Jul 30, 2023

Choose a reason for hiding this comment

tustvold Jul 30, 2023

Choose a reason for hiding this comment

tustvold commented Sep 5, 2023

tustvold commented Dec 7, 2023

Restructure `sum` for better auto-vectorization for floats #4560

Restructure `sum` for better auto-vectorization for floats #4560

simonvandel commented Jul 22, 2023 •

edited

tustvold commented Jul 25, 2023 •

edited