Skip to content

andrewjradcliffe/VectorizedReduction.jl

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

VectorizedReduction

Note for Julia >= 1.11

Maintenance of this package will continue.

Clarification

The performance afforded by LoopVectorization (see the deprecation announcement for background) will be present so long as the test suites for LoopVectorization and VectorizedReduction pass on the relevant Julia version. At time of writing (2024-05-17), this corresponds to Julia versions 1.5 - 1.11 (inclusive).

See the tracking issue for 1.11 for more details.

Tentative support for Julia master branch (1.12-DEV at time of writing) appears to be justified; please file an issue with MWE and versioninfo() output if this packages raises errors.

Installation

using Pkg
Pkg.add("VectorizedReduction")

Usage

This library provides "vectorized" (with/without multithreading) versions of the following functions

  1. mapreduce and common derived functions: reduce, sum, prod, minimum, maximum, extrema
  2. count, any, all
  3. findmin, findmax, argmin, argmax
  4. logsumexp, softmax, logsoftmax ("safe" versions: avoid underflow/overflow)

The naming convention is as follows: a vectorized (without threading) version is prefixed by v, and a vectorized with threading version is prefixed by vt. There is a single exception to this rule: vectorized (without threading) versions of the functions listed in 1. are prefixed by vv in order to avoid name collisions with LoopVectorization and VectorizedStatistics.

This library also provides other, less common, reductions (all of which follow the naming convention above):

  1. mapreducethen : Apply function f to each element of A, reduce the result over the dimensions dims using the binary function op, then apply g to the result
  2. distances: manhattan, euclidean, chebyshev, minkowski
  3. norms: norm, treating arbitrary slices via dims keyword
  4. deviances: counteq, countne, meanad, maxad, mse, rmse
  5. means: mean, geomean, harmmean
  6. entropies: crossentropy, shannonentropy, collisionentropy, minentropy, maxentropy, renyientropy
  7. divergences: kldivergence, gkldiv, renyidivergence

Motivation

When writing numerical code, one may wish to perform a reduction, perhaps across multiple dimensions, as the most natural means of expressing the relevant mathematical operation. For readers well-acquainted with LoopVectorization.jl, the thought immediately comes to mind: writing out the loops will inevitably be a large performance gain. Thus, in that neverending pursuit of fast code, we write the loops -- but this produces specific code, tailored to the dimensions of the problem. Instead, we might have liked to write generic code, parameterizing our function with an index set of dimensions. This package attempts to resolve this perpetual dilemma using metaprogramming. The hope is that the next time one asks the question: is it worthwhile to write the loops (gain performance, lose genericity), or can I make do with the Base implementation? that one can confidently reach for one of the "vectorized" versions provided by this package.

Now, before diving into other interesting topics, a few salient comments. Foremost, this package is not intended as a univeral replacement for Base implementations of the same functions. The scope of functions provided by this package is limited to numerical code, and subject to all restrictions imposed by LoopVectorization (the user is encouraged to familiarize themselves with said package). Moreover, there exist limitations of vfindmin and friends which are not yet resolved (a workaround is provided, but not without a performance cost). That being said, the user may find useful the expanded capabilities of vfindmin/vfindmax which are not yet part of Base (an aspect the author hopes to rectify). The user is encouraged to read the additional commentary below, but it is not strictly necessary.

Commentary

Why provide dims directly, rather than as kwargs? (an optional performance enhancement for small arrays)

To define a multi-dimensional mapreduce, one specifies a index set of dimensions, dims over which the reduction will occur. In Base, one passes dims using keyword arguments, e.g. sum(A, dims=(2,4)). However, this sometimes incurs an overhead of ~20ns. Admittedly, small, and in almost all cases negligible, but, if one wishes to ensure that such costs are avoided, the interface also supports direct calls which provide init and dims as positional arguments.

When to consider replacing any/all with vectorized versions

Assumptions: x is inherently unordered, such that the probablilty of a "success" is independent of the index (i.e. independent of a random permutation of the effective vector over which the reduction occurs). This is a very reasonable assumption for certain types of data, e.g. Monte Carlo simulations. However, for cases in which there is some inherent order (e.g. a solution to an ODE, or even very simply, cos.(-2π:0.1:2π)), then the analysis below does not hold. If inherent order exists, then the probability of success depends on said ordering -- the caller must then exercise judgment based on where the first success might land (if at all); as this is inevitably problem-specific, I am unable to offer general advice.

For inherently unordered data: Define p as the probability of "success", i.e. the probability of true w.r.t. any and the probability of false w.r.t. all. The cumulative probability of evaluating all elements, Pr(x ≤ 0) is

binomial(n, 0) * p^0 * (1 - p)^(n - 0) # (1 - p)^n

Define a linearized cost model, with t the time required to evaluate one element, and n the length of the vector. Denote t₀ as the non-vectorized evaluation time per element, and tᵥ as the vectorized evaluation time per element. A crude estimate for the expected cost of the call is therefore

t₀ * n * (1 - p)^n

Thus, the point at which non-vectorized evaluation is optimal is

t₀ * (1 - p)^n < tᵥ

Or, rearranging: non-vectorized is optimal when p > 1 - (tᵥ/t₀)^(1/n). Intuitively, as (tᵥ/t₀) becomes smaller, larger p is needed to make the non-vectorized option optimal. Holding (tᵥ/t₀) constant, increasing n results in a rapid decrease in the p required for the non-vectorized option to be optimal. Consider the following examples, denoting r = (tᵥ/t₀)

julia> p(r, n) = 1 - r^(1/n)
p (generic function with 1 method)

julia> p.(.1, 10 .^ (1:4))
4-element Vector{Float64}:
 0.2056717652757185
 0.02276277904418933
 0.0022999361774467264
 0.0002302320018434667

julia> p.(.01, 10 .^ (1:4))
4-element Vector{Float64}:
 0.36904265551980675
 0.045007413978564004
 0.004594582648473011
 0.0004604109969121861

However, due to the current implementation details of Base any/all, early breakout occurs only when the reduction is being carried out across the entire array (i.e. does not occur when reducing over a subset of dimensions). Thus, the current advice is to use vany/vall unless one is reducing over the entire array, and even then, one should consider the p and n for one's problem.

Examples

Simple examples

A very simple comparison.

julia> A = rand(5,5,5,5);

julia> @benchmark mapreduce($abs2, $+, $A, dims=$(1,2,4))
BenchmarkTools.Trial: 10000 samples with 133 evaluations.
 Range (min  max):  661.038 ns  139.234 μs  ┊ GC (min  max): 0.00%  99.24%
 Time  (median):     746.880 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   798.069 ns ±   1.957 μs  ┊ GC (mean ± σ):  3.46% ±  1.40%

 Memory estimate: 368 bytes, allocs estimate: 8.

julia> @benchmark vvmapreduce($abs2, $+, $A, dims=$(1,2,4))
BenchmarkTools.Trial: 10000 samples with 788 evaluations.
 Range (min  max):  160.538 ns   29.430 μs  ┊ GC (min  max):  0.00%  99.11%
 Time  (median):     203.479 ns               ┊ GC (median):     0.00%
 Time  (mean ± σ):   212.916 ns ± 761.848 ns  ┊ GC (mean ± σ):  10.68% ±  2.97%

 Memory estimate: 240 bytes, allocs estimate: 6.

julia> @benchmark extrema($A, dims=$(1,2))
BenchmarkTools.Trial: 10000 samples with 9 evaluations.
 Range (min  max):  2.813 μs    5.827 μs  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     2.990 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   3.039 μs ± 149.676 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

 Memory estimate: 960 bytes, allocs estimate: 14.

julia> @benchmark vvextrema($A, dims=$(1,2))
BenchmarkTools.Trial: 10000 samples with 202 evaluations.
 Range (min  max):  381.743 ns  86.288 μs  ┊ GC (min  max):  0.00%  99.05%
 Time  (median):     689.658 ns              ┊ GC (median):     0.00%
 Time  (mean ± σ):   712.113 ns ±  2.851 μs  ┊ GC (mean ± σ):  13.84% ±  3.43%

 Memory estimate: 1.19 KiB, allocs estimate: 8.

Varargs examples

These are somewhat standard fare, but can be quite convenient for expressing certain Bayesian computations.

julia> A1, A2, A3, A4 = rand(5,5,5,5), rand(5,5,5,5), rand(5,5,5,5), rand(5,5,5,5);

julia> @benchmark mapreduce($+, $+, $A1, $A2, $A3, $A4, dims=$(1,2,4))
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
 Range (min  max):  1.597 μs   1.181 ms  ┊ GC (min  max): 0.00%  97.71%
 Time  (median):     1.867 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.257 μs ± 14.216 μs  ┊ GC (mean ± σ):  8.56% ±  1.38%

 Memory estimate: 5.66 KiB, allocs estimate: 14.

julia> @benchmark vvmapreduce($+, $+, $A1, $A2, $A3, $A4, dims=$(1,2,4))
BenchmarkTools.Trial: 10000 samples with 203 evaluations.
 Range (min  max):  384.768 ns  150.041 μs  ┊ GC (min  max): 0.00%  99.57%
 Time  (median):     437.601 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   478.179 ns ±   2.117 μs  ┊ GC (mean ± σ):  7.50% ±  1.72%

 Memory estimate: 304 bytes, allocs estimate: 6.

# And for really strange stuff (e.g. posterior predictive transformations)
julia> @benchmark vvmapreduce((x,y,z) -> ifelse(x*y+z  1, 1, 0), +, $A1, $A2, $A3)
BenchmarkTools.Trial: 10000 samples with 198 evaluations.
 Range (min  max):  438.126 ns   5.704 μs  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     439.995 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   442.020 ns ± 63.038 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

 Memory estimate: 0 bytes, allocs estimate: 0.

# using ifelse for just a boolean is quite slow, but the above is just for demonstration
julia> @benchmark vvmapreduce((x,y,z) -> (x*y+z, 1), +, $A1, $A2, $A3)
BenchmarkTools.Trial: 10000 samples with 975 evaluations.
 Range (min  max):  70.558 ns   2.085 μs  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     70.888 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   71.425 ns ± 23.489 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

 Memory estimate: 0 bytes, allocs estimate: 0.

# What I mean by posterior predictive transformation? Well, one might encounter
# this in Bayesian model checking, which provides a convenient example.
# If one wishes to compute the Pr = ∫∫𝕀(T(yʳᵉᵖ, θ) ≥ T(y, θ))p(yʳᵉᵖ|θ)p(θ|y)dyʳᵉᵖdθ
# Let's imagine that A1 represents T(yʳᵉᵖ, θ) and A2 represents T(y, θ)
# i.e. the test variable samples computed as a functional of the Markov chain (samples of θ)
# Then, Pr is computed as
vvmapreduce(, +, A1, A2) / length(A1)
# Or, if only the probability is of interest, and we do not wish to use the functionals
# for any other purpose, we could compute it as:
vvmapreduce((x, y) -> (f(x), f(y)), +, A1, A2) / length(A1)
# where `f` is the functional of interest, e.g.
vvmapreduce((x, y) -> (abs2(x), abs2(y)), +, A1, A2) / length(A1)

# One can also express commonly encountered reductions with ease;
# these will be fused once a post-reduction operator can be specified
# Mean squared error
vvmapreduce((x, y) -> abs2(x - y), +, A1, A2, dims=(2,4)) ./ (size(A1, 2) * size(A1, 4))
# Euclidean distance
().(vvmapreduce((x, y) -> abs2(x - y), +, A1, A2, dims=(2,4)))

findmin/findmax examples

Examples of extended syntax using vfindmin(f, A; dims), vfindmin(f, A...; dims). In the former case, f : ℝ → ℝ; in the latter, f : ℝᴺ → ℝ. Also applies to vfindmax, vargmin, vargmax.

# Easy to express without the extended syntax, but not efficient.
julia> B1, B2, B3 = rand(5,5,5,5), rand(5,5,5,5), rand(5,5,5,5);

julia> B′ = @. B1 + B2 + B3;

julia> findmax(B′) == vfindmax(+, B1, B2, B3)
true

julia> @benchmark findmin(@. $B1 + $B2 + $B3)
BenchmarkTools.Trial: 10000 samples with 8 evaluations.
 Range (min  max):  3.905 μs  943.922 μs  ┊ GC (min  max): 0.00%  94.06%
 Time  (median):     4.011 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   4.128 μs ±   9.418 μs  ┊ GC (mean ± σ):  2.15% ±  0.94%

 Memory estimate: 5.11 KiB, allocs estimate: 2.

julia> @benchmark vfindmin(+, $B1, $B2, $B3)
BenchmarkTools.Trial: 10000 samples with 943 evaluations.
 Range (min  max):  100.346 ns  151.376 ns  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     100.821 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   100.866 ns ±   0.651 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

 Memory estimate: 0 bytes, allocs estimate: 0.

# A multidimensional example

julia> @benchmark findmin((@. abs2($B1) * $B2 + $B3), dims=$(3,4))
BenchmarkTools.Trial: 10000 samples with 7 evaluations.
 Range (min  max):  4.026 μs   1.132 ms  ┊ GC (min  max): 0.00%  94.30%
 Time  (median):     4.311 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   4.494 μs ± 11.335 μs  ┊ GC (mean ± σ):  2.37% ±  0.94%

 Memory estimate: 6.55 KiB, allocs estimate: 12.

julia> @benchmark vfindmin((x, y, z) -> abs2(x) * y + z, $B1, $B2, $B3, dims=$(3,4))
BenchmarkTools.Trial: 10000 samples with 173 evaluations.
 Range (min  max):  615.145 ns  27.463 μs  ┊ GC (min  max):  0.00%  95.61%
 Time  (median):     635.491 ns              ┊ GC (median):     0.00%
 Time  (mean ± σ):   850.233 ns ±  1.487 μs  ┊ GC (mean ± σ):  10.58% ±  5.89%

 Memory estimate: 1.62 KiB, allocs estimate: 9.

mapreducethen examples

Examples of seemingly strange but useful concept: mapreduce(f, op, ...), then apply g to each element of the result. However, the post-transform g can be fused such that the output array is populated in a single pass, hence, mapreducethen(f, op, g, ...). It happens that many familiar quantities follow this pattern, as shown below.

# L₂ norm
julia> A = rand(10,10,10,10); B = rand(10,10,10,10);

julia> @benchmark vmapreducethen(abs2, +, , $A, dims=(2,4))
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
 Range (min  max):  1.634 μs  620.474 μs  ┊ GC (min  max): 0.00%  99.00%
 Time  (median):     1.969 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.040 μs ±   6.188 μs  ┊ GC (mean ± σ):  3.01% ±  0.99%

 Memory estimate: 960 bytes, allocs estimate: 3.

julia> @benchmark .√mapreduce(abs2, +, $A, dims=(2,4))
BenchmarkTools.Trial: 10000 samples with 4 evaluations.
 Range (min  max):  7.462 μs   13.938 μs  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     7.957 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   8.017 μs ± 378.040 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

 Memory estimate: 2.05 KiB, allocs estimate: 10.

# Euclidean distance
julia> euclidean(x, y; dims=:) = .√mapreduce(abs2  -, +, x, y, dims=dims);

julia> veuclidean(x, y; dims=:) = vmapreducethen((a, b) -> abs2(a - b), +, , x, y, dims=dims);

julia> @benchmark veuclidean(A, B, dims=(1,3))
BenchmarkTools.Trial: 10000 samples with 8 evaluations.
 Range (min  max):  3.277 μs    6.065 μs  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     3.576 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   3.602 μs ± 202.787 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

 Memory estimate: 992 bytes, allocs estimate: 4.

julia> @benchmark euclidean(A, B, dims=(1,3))
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min  max):  11.103 μs   2.024 ms  ┊ GC (min  max): 0.00%  95.82%
 Time  (median):     13.781 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   17.502 μs ± 58.495 μs  ┊ GC (mean ± σ):  9.72% ±  2.90%

 Memory estimate: 80.28 KiB, allocs estimate: 13.

Acknowledgments

The original motivation for this work was a vectorized & multithreaded multi-dimensional findmin, taking a variable number of array arguments -- it's a long story, but the similarity between findmin and mapreduce motivated a broad approach. My initial attempt (visible in the /attic) did not deliver all the performance possible -- this was only apparent through comparison to C. Elrod's approach to multidimensional forms in VectorizedStatistics. Having fully appreciated the beauty of branching through @generated functions, I decided to take a tour of some low-hanging fruit -- this package is the result.

Future work

  1. ✓ post-reduction operators
  2. □ reductions over index subsets within a dimension.
  3. □ actual documentation

Elsewhere