Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
ENH: Adopt new macOS Accelerate BLAS/LAPACK Interfaces, including ILP…
…64 (numpy#24053) macOS 13.3 shipped with an updated Accelerate framework that provides BLAS / LAPACK. The new version is aligned with Netlib's v3.9.1 and also supports ILP64. The changes here adopt those new interfaces when available. - New interfaces are used when ACCELERATE_NEW_LAPACK is defined. - ILP64 interfaces are used when both ACCELERATE_NEW_LAPACK and ACCELERATE_LAPACK_ILP64 are defined. macOS 13.3 now ships with 3 different sets of BLAS / LAPACK interfaces: - LP64 / LAPACK v3.2.1 - legacy interfaces kept for compatibility - LP64 / LAPACK v3.9.1 - new interfaces - ILP64 / LAPACK v3.9.1 - new interfaces with ILP64 support For LP64, we want to support building against macOS 13.3+ SDK, but having it work on pre-13.3 systems. To that end, we created wrappers for each API that do a runtime check on which set of API is available and should be used. However, these were deemed potentially too complex to include during review of numpygh-24053, and left out in this commit. Please see numpygh-24053 for those. ILP64 is only supported on macOS 13.3+ and does not use additional wrappers. We've included support for both distutils and Meson builds. All tests pass on Apple silicon and Intel based Macs. A new CI job for Accelerate ILP64 on x86-64 was added as well. Benchmarks ILP64 Accelerate vs OpenBLAS before after ratio [73f0cf4f] [d1572653] <openblas-ilp64> <accelerate-ilp64> n/a n/a n/a bench_linalg.Linalg.time_op('det', 'float16') n/a n/a n/a bench_linalg.Linalg.time_op('pinv', 'float16') n/a n/a n/a bench_linalg.Linalg.time_op('svd', 'float16') failed failed n/a bench_linalg.LinalgSmallArrays.time_det_small_array + 3.96±0.1μs 5.04±0.4μs 1.27 bench_linalg.Linalg.time_op('norm', 'float32') 1.43±0.04ms 1.43±0ms 1.00 bench_linalg.Einsum.time_einsum_outer(<class 'numpy.float32'>) 12.7±0.4μs 12.7±0.3μs 1.00 bench_linalg.Einsum.time_einsum_sum_mul2(<class 'numpy.float32'>) 24.1±0.8μs 24.1±0.04μs 1.00 bench_linalg.Linalg.time_op('norm', 'float16') 9.48±0.2ms 9.48±0.3ms 1.00 bench_linalg.Einsum.time_einsum_outer(<class 'numpy.float64'>) 609±20μs 609±2μs 1.00 bench_linalg.Einsum.time_einsum_noncon_outer(<class 'numpy.float32'>) 64.9±2μs 64.7±0.07μs 1.00 bench_linalg.Einsum.time_einsum_contig_outstride0(<class 'numpy.float64'>) 1.24±0.03ms 1.24±0.01ms 1.00 bench_linalg.Einsum.time_einsum_noncon_outer(<class 'numpy.float64'>) 102±3μs 102±0.2μs 1.00 bench_linalg.Einsum.time_einsum_contig_contig(<class 'numpy.float64'>) 21.9±0.8μs 21.8±0.02μs 1.00 bench_linalg.Einsum.time_einsum_multiply(<class 'numpy.float64'>) 22.8±0.2ms 22.7±0.3ms 0.99 bench_linalg.Eindot.time_einsum_ijk_jil_kl 13.3±0.4μs 13.3±0.02μs 0.99 bench_linalg.Einsum.time_einsum_sum_mul2(<class 'numpy.float64'>) 9.56±0.3μs 9.49±0.2μs 0.99 bench_linalg.Einsum.time_einsum_noncon_contig_contig(<class 'numpy.float64'>) 7.31±0.2μs 7.26±0.08μs 0.99 bench_linalg.Einsum.time_einsum_noncon_contig_outstride0(<class 'numpy.float32'>) 5.60±0.2ms 5.55±0.02ms 0.99 bench_linalg.Eindot.time_einsum_ij_jk_a_b 37.1±1μs 36.7±0.1μs 0.99 bench_linalg.Einsum.time_einsum_contig_outstride0(<class 'numpy.float32'>) 13.5±0.4μs 13.4±0.05μs 0.99 bench_linalg.Einsum.time_einsum_sum_mul(<class 'numpy.float64'>) 1.03±0.03μs 1.02±0μs 0.99 bench_linalg.LinalgSmallArrays.time_norm_small_array 51.6±2μs 51.0±0.09μs 0.99 bench_linalg.Einsum.time_einsum_contig_contig(<class 'numpy.float32'>) 15.2±0.5μs 15.0±0.04μs 0.99 bench_linalg.Einsum.time_einsum_noncon_sum_mul2(<class 'numpy.float64'>) 13.9±0.4μs 13.7±0.02μs 0.99 bench_linalg.Einsum.time_einsum_noncon_sum_mul2(<class 'numpy.float32'>) 415±10μs 409±0.4μs 0.99 bench_linalg.Eindot.time_einsum_i_ij_j 9.29±0.3μs 9.01±0.03μs 0.97 bench_linalg.Einsum.time_einsum_noncon_mul(<class 'numpy.float64'>) 18.2±0.6μs 17.6±0.04μs 0.97 bench_linalg.Einsum.time_einsum_multiply(<class 'numpy.float32'>) 509±40μs 492±10μs 0.97 bench_linalg.Einsum.time_einsum_mul(<class 'numpy.float64'>) 9.63±0.3μs 9.28±0.09μs 0.96 bench_linalg.Einsum.time_einsum_noncon_contig_contig(<class 'numpy.float32'>) 9.08±0.2μs 8.73±0.02μs 0.96 bench_linalg.Einsum.time_einsum_noncon_mul(<class 'numpy.float32'>) 15.6±0.5μs 15.0±0.04μs 0.96 bench_linalg.Einsum.time_einsum_noncon_sum_mul(<class 'numpy.float64'>) 7.74±0.2μs 7.39±0.04μs 0.95 bench_linalg.Einsum.time_einsum_noncon_contig_outstride0(<class 'numpy.float64'>) 18.6±0.6μs 17.7±0.03μs 0.95 bench_linalg.Einsum.time_einsum_noncon_multiply(<class 'numpy.float32'>) 14.5±0.4μs 13.7±0.03μs 0.95 bench_linalg.Einsum.time_einsum_noncon_sum_mul(<class 'numpy.float32'>) 13.3±0.6μs 12.5±0.3μs 0.94 bench_linalg.Einsum.time_einsum_sum_mul(<class 'numpy.float32'>) 23.5±0.5μs 21.9±0.05μs 0.93 bench_linalg.Einsum.time_einsum_noncon_multiply(<class 'numpy.float64'>) 264±20μs 243±4μs 0.92 bench_linalg.Einsum.time_einsum_mul(<class 'numpy.float32'>) - 177±50μs 132±0.6μs 0.75 bench_linalg.Eindot.time_dot_trans_at_a - 10.7±0.3μs 7.13±0.01μs 0.67 bench_linalg.Linalg.time_op('norm', 'int16') - 97.5±2μs 64.7±0.1μs 0.66 bench_linalg.Eindot.time_matmul_trans_a_at - 8.87±0.3μs 5.76±0μs 0.65 bench_linalg.Linalg.time_op('norm', 'longfloat') - 8.90±0.3μs 5.77±0.01μs 0.65 bench_linalg.Linalg.time_op('norm', 'float64') - 8.48±0.3μs 5.40±0.01μs 0.64 bench_linalg.Linalg.time_op('norm', 'int64') - 106±2μs 66.5±8μs 0.63 bench_linalg.Eindot.time_inner_trans_a_a - 8.25±0.3μs 5.16±0μs 0.62 bench_linalg.Linalg.time_op('norm', 'int32') - 103±5ms 64.6±0.5ms 0.62 bench_import.Import.time_linalg - 106±3μs 66.0±0.1μs 0.62 bench_linalg.Eindot.time_dot_trans_a_at - 202±20μs 124±0.6μs 0.61 bench_linalg.Eindot.time_matmul_trans_at_a - 31.5±10μs 19.3±0.02μs 0.61 bench_linalg.Eindot.time_dot_d_dot_b_c - 32.4±20μs 19.7±0.03μs 0.61 bench_linalg.Eindot.time_matmul_d_matmul_b_c - 5.05±1ms 3.06±0.09ms 0.61 bench_linalg.Linalg.time_op('svd', 'complex128') - 5.35±0.9ms 3.09±0.09ms 0.58 bench_linalg.Linalg.time_op('svd', 'complex64') - 6.37±3ms 3.27±0.1ms 0.51 bench_linalg.Linalg.time_op('pinv', 'complex128') - 7.26±8ms 3.24±0.1ms 0.45 bench_linalg.Linalg.time_op('pinv', 'complex64') - 519±100μs 219±0.8μs 0.42 bench_linalg.Linalg.time_op('det', 'complex64') - 31.3±0.9μs 12.8±0.1μs 0.41 bench_linalg.Linalg.time_op('norm', 'complex128') - 2.44±0.7ms 924±1μs 0.38 bench_linalg.Linalg.time_op('pinv', 'float64') - 29.9±0.8μs 10.8±0.01μs 0.36 bench_linalg.Linalg.time_op('norm', 'complex64') - 2.56±0.5ms 924±1μs 0.36 bench_linalg.Linalg.time_op('pinv', 'float32') - 2.63±0.5ms 924±0.6μs 0.35 bench_linalg.Linalg.time_op('pinv', 'int64') - 2.68±0.7ms 927±10μs 0.35 bench_linalg.Linalg.time_op('pinv', 'int32') - 2.68±0.5ms 927±10μs 0.35 bench_linalg.Linalg.time_op('pinv', 'int16') - 2.93±0.6ms 925±2μs 0.32 bench_linalg.Linalg.time_op('pinv', 'longfloat') - 809±500μs 215±0.2μs 0.27 bench_linalg.Linalg.time_op('det', 'complex128') - 3.67±0.9ms 895±20μs 0.24 bench_linalg.Eindot.time_tensordot_a_b_axes_1_0_0_1 - 489±100μs 114±20μs 0.23 bench_linalg.Eindot.time_inner_trans_a_ac - 3.64±0.7ms 777±0.3μs 0.21 bench_linalg.Lstsq.time_numpy_linalg_lstsq_a__b_float64 - 755±90μs 157±10μs 0.21 bench_linalg.Eindot.time_dot_a_b - 4.63±1ms 899±9μs 0.19 bench_linalg.Linalg.time_op('svd', 'longfloat') - 5.19±1ms 922±10μs 0.18 bench_linalg.Linalg.time_op('svd', 'float64') - 599±200μs 89.4±2μs 0.15 bench_linalg.Eindot.time_matmul_trans_atc_a - 956±200μs 140±10μs 0.15 bench_linalg.Eindot.time_matmul_a_b - 6.45±3ms 903±10μs 0.14 bench_linalg.Linalg.time_op('svd', 'float32') - 6.42±3ms 896±0.7μs 0.14 bench_linalg.Linalg.time_op('svd', 'int32') - 6.47±4ms 902±5μs 0.14 bench_linalg.Linalg.time_op('svd', 'int64') - 6.52±1ms 899±2μs 0.14 bench_linalg.Linalg.time_op('svd', 'int16') - 799±300μs 109±2μs 0.14 bench_linalg.Eindot.time_dot_trans_atc_a - 502±100μs 65.0±0.2μs 0.13 bench_linalg.Eindot.time_dot_trans_a_atc - 542±300μs 64.2±0.05μs 0.12 bench_linalg.Eindot.time_matmul_trans_a_atc - 458±300μs 41.6±0.09μs 0.09 bench_linalg.Linalg.time_op('det', 'int32') - 471±100μs 41.9±0.03μs 0.09 bench_linalg.Linalg.time_op('det', 'float32') - 510±100μs 43.6±0.06μs 0.09 bench_linalg.Linalg.time_op('det', 'int16') - 478±200μs 39.6±0.05μs 0.08 bench_linalg.Linalg.time_op('det', 'longfloat') - 599±200μs 39.6±0.09μs 0.07 bench_linalg.Linalg.time_op('det', 'float64') - 758±300μs 41.6±0.1μs 0.05 bench_linalg.Linalg.time_op('det', 'int64') Co-authored-by: Ralf Gommers <ralf.gommers@gmail.com>
- Loading branch information