-
-
Notifications
You must be signed in to change notification settings - Fork 9.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Adopt new macOS Accelerate BLAS/LAPACK Interfaces, including ILP64 #24053
Merged
rgommers
merged 9 commits into
numpy:main
from
Developer-Ecosystem-Engineering:adopt_new_blas_lapack_ilp64
Aug 31, 2023
Merged
ENH: Adopt new macOS Accelerate BLAS/LAPACK Interfaces, including ILP64 #24053
rgommers
merged 9 commits into
numpy:main
from
Developer-Ecosystem-Engineering:adopt_new_blas_lapack_ilp64
Aug 31, 2023
Commits on Aug 3, 2023
-
ENH: Adopt new BLAS/LAPACK Interfaces, including ILP64
macOS 13.3 shipped with an updated Accelerate framework that provides BLAS / LAPACK. The new version is aligned with Netlib's v3.9.1 and also supports ILP64. The changes here adopt those new interfaces when available. - New interfaces are used when ACCELERATE_NEW_LAPACK is defined. - ILP64 interfaces are used when both ACCELERATE_NEW_LAPACK and ACCELERATE_LAPACK_ILP64 are defined. macOS 13.3 now ships with 3 different sets of BLAS / LAPACK interfaces: - LP64 / LAPACK v3.2.1 - legacy interfaces kept for compatibility - LP64 / LAPACK v3.9.1 - new interfaces - ILP64 / LAPACK v3.9.1 - new interfaces with ILP64 support For LP64, we want to support building against macOS 13.3+ SDK, but having it work on pre-13.3 systems. To that end, we create wrappers for each API that do a runtime check on which set of API is available and should be used. ILP64 is only supported on macOS 13.3+ and does not use additional wrappers. We've included support for both distutils and Meson builds. All tests pass on Apple silicon and Intel based Macs. Benchmarks ILP64 Accelerate vs OpenBLAS before after ratio [73f0cf4f] [d1572653] <openblas-ilp64> <accelerate-ilp64> n/a n/a n/a bench_linalg.Linalg.time_op('det', 'float16') n/a n/a n/a bench_linalg.Linalg.time_op('pinv', 'float16') n/a n/a n/a bench_linalg.Linalg.time_op('svd', 'float16') failed failed n/a bench_linalg.LinalgSmallArrays.time_det_small_array + 3.96±0.1μs 5.04±0.4μs 1.27 bench_linalg.Linalg.time_op('norm', 'float32') 1.43±0.04ms 1.43±0ms 1.00 bench_linalg.Einsum.time_einsum_outer(<class 'numpy.float32'>) 12.7±0.4μs 12.7±0.3μs 1.00 bench_linalg.Einsum.time_einsum_sum_mul2(<class 'numpy.float32'>) 24.1±0.8μs 24.1±0.04μs 1.00 bench_linalg.Linalg.time_op('norm', 'float16') 9.48±0.2ms 9.48±0.3ms 1.00 bench_linalg.Einsum.time_einsum_outer(<class 'numpy.float64'>) 609±20μs 609±2μs 1.00 bench_linalg.Einsum.time_einsum_noncon_outer(<class 'numpy.float32'>) 64.9±2μs 64.7±0.07μs 1.00 bench_linalg.Einsum.time_einsum_contig_outstride0(<class 'numpy.float64'>) 1.24±0.03ms 1.24±0.01ms 1.00 bench_linalg.Einsum.time_einsum_noncon_outer(<class 'numpy.float64'>) 102±3μs 102±0.2μs 1.00 bench_linalg.Einsum.time_einsum_contig_contig(<class 'numpy.float64'>) 21.9±0.8μs 21.8±0.02μs 1.00 bench_linalg.Einsum.time_einsum_multiply(<class 'numpy.float64'>) 22.8±0.2ms 22.7±0.3ms 0.99 bench_linalg.Eindot.time_einsum_ijk_jil_kl 13.3±0.4μs 13.3±0.02μs 0.99 bench_linalg.Einsum.time_einsum_sum_mul2(<class 'numpy.float64'>) 9.56±0.3μs 9.49±0.2μs 0.99 bench_linalg.Einsum.time_einsum_noncon_contig_contig(<class 'numpy.float64'>) 7.31±0.2μs 7.26±0.08μs 0.99 bench_linalg.Einsum.time_einsum_noncon_contig_outstride0(<class 'numpy.float32'>) 5.60±0.2ms 5.55±0.02ms 0.99 bench_linalg.Eindot.time_einsum_ij_jk_a_b 37.1±1μs 36.7±0.1μs 0.99 bench_linalg.Einsum.time_einsum_contig_outstride0(<class 'numpy.float32'>) 13.5±0.4μs 13.4±0.05μs 0.99 bench_linalg.Einsum.time_einsum_sum_mul(<class 'numpy.float64'>) 1.03±0.03μs 1.02±0μs 0.99 bench_linalg.LinalgSmallArrays.time_norm_small_array 51.6±2μs 51.0±0.09μs 0.99 bench_linalg.Einsum.time_einsum_contig_contig(<class 'numpy.float32'>) 15.2±0.5μs 15.0±0.04μs 0.99 bench_linalg.Einsum.time_einsum_noncon_sum_mul2(<class 'numpy.float64'>) 13.9±0.4μs 13.7±0.02μs 0.99 bench_linalg.Einsum.time_einsum_noncon_sum_mul2(<class 'numpy.float32'>) 415±10μs 409±0.4μs 0.99 bench_linalg.Eindot.time_einsum_i_ij_j 9.29±0.3μs 9.01±0.03μs 0.97 bench_linalg.Einsum.time_einsum_noncon_mul(<class 'numpy.float64'>) 18.2±0.6μs 17.6±0.04μs 0.97 bench_linalg.Einsum.time_einsum_multiply(<class 'numpy.float32'>) 509±40μs 492±10μs 0.97 bench_linalg.Einsum.time_einsum_mul(<class 'numpy.float64'>) 9.63±0.3μs 9.28±0.09μs 0.96 bench_linalg.Einsum.time_einsum_noncon_contig_contig(<class 'numpy.float32'>) 9.08±0.2μs 8.73±0.02μs 0.96 bench_linalg.Einsum.time_einsum_noncon_mul(<class 'numpy.float32'>) 15.6±0.5μs 15.0±0.04μs 0.96 bench_linalg.Einsum.time_einsum_noncon_sum_mul(<class 'numpy.float64'>) 7.74±0.2μs 7.39±0.04μs 0.95 bench_linalg.Einsum.time_einsum_noncon_contig_outstride0(<class 'numpy.float64'>) 18.6±0.6μs 17.7±0.03μs 0.95 bench_linalg.Einsum.time_einsum_noncon_multiply(<class 'numpy.float32'>) 14.5±0.4μs 13.7±0.03μs 0.95 bench_linalg.Einsum.time_einsum_noncon_sum_mul(<class 'numpy.float32'>) 13.3±0.6μs 12.5±0.3μs 0.94 bench_linalg.Einsum.time_einsum_sum_mul(<class 'numpy.float32'>) 23.5±0.5μs 21.9±0.05μs 0.93 bench_linalg.Einsum.time_einsum_noncon_multiply(<class 'numpy.float64'>) 264±20μs 243±4μs 0.92 bench_linalg.Einsum.time_einsum_mul(<class 'numpy.float32'>) - 177±50μs 132±0.6μs 0.75 bench_linalg.Eindot.time_dot_trans_at_a - 10.7±0.3μs 7.13±0.01μs 0.67 bench_linalg.Linalg.time_op('norm', 'int16') - 97.5±2μs 64.7±0.1μs 0.66 bench_linalg.Eindot.time_matmul_trans_a_at - 8.87±0.3μs 5.76±0μs 0.65 bench_linalg.Linalg.time_op('norm', 'longfloat') - 8.90±0.3μs 5.77±0.01μs 0.65 bench_linalg.Linalg.time_op('norm', 'float64') - 8.48±0.3μs 5.40±0.01μs 0.64 bench_linalg.Linalg.time_op('norm', 'int64') - 106±2μs 66.5±8μs 0.63 bench_linalg.Eindot.time_inner_trans_a_a - 8.25±0.3μs 5.16±0μs 0.62 bench_linalg.Linalg.time_op('norm', 'int32') - 103±5ms 64.6±0.5ms 0.62 bench_import.Import.time_linalg - 106±3μs 66.0±0.1μs 0.62 bench_linalg.Eindot.time_dot_trans_a_at - 202±20μs 124±0.6μs 0.61 bench_linalg.Eindot.time_matmul_trans_at_a - 31.5±10μs 19.3±0.02μs 0.61 bench_linalg.Eindot.time_dot_d_dot_b_c - 32.4±20μs 19.7±0.03μs 0.61 bench_linalg.Eindot.time_matmul_d_matmul_b_c - 5.05±1ms 3.06±0.09ms 0.61 bench_linalg.Linalg.time_op('svd', 'complex128') - 5.35±0.9ms 3.09±0.09ms 0.58 bench_linalg.Linalg.time_op('svd', 'complex64') - 6.37±3ms 3.27±0.1ms 0.51 bench_linalg.Linalg.time_op('pinv', 'complex128') - 7.26±8ms 3.24±0.1ms 0.45 bench_linalg.Linalg.time_op('pinv', 'complex64') - 519±100μs 219±0.8μs 0.42 bench_linalg.Linalg.time_op('det', 'complex64') - 31.3±0.9μs 12.8±0.1μs 0.41 bench_linalg.Linalg.time_op('norm', 'complex128') - 2.44±0.7ms 924±1μs 0.38 bench_linalg.Linalg.time_op('pinv', 'float64') - 29.9±0.8μs 10.8±0.01μs 0.36 bench_linalg.Linalg.time_op('norm', 'complex64') - 2.56±0.5ms 924±1μs 0.36 bench_linalg.Linalg.time_op('pinv', 'float32') - 2.63±0.5ms 924±0.6μs 0.35 bench_linalg.Linalg.time_op('pinv', 'int64') - 2.68±0.7ms 927±10μs 0.35 bench_linalg.Linalg.time_op('pinv', 'int32') - 2.68±0.5ms 927±10μs 0.35 bench_linalg.Linalg.time_op('pinv', 'int16') - 2.93±0.6ms 925±2μs 0.32 bench_linalg.Linalg.time_op('pinv', 'longfloat') - 809±500μs 215±0.2μs 0.27 bench_linalg.Linalg.time_op('det', 'complex128') - 3.67±0.9ms 895±20μs 0.24 bench_linalg.Eindot.time_tensordot_a_b_axes_1_0_0_1 - 489±100μs 114±20μs 0.23 bench_linalg.Eindot.time_inner_trans_a_ac - 3.64±0.7ms 777±0.3μs 0.21 bench_linalg.Lstsq.time_numpy_linalg_lstsq_a__b_float64 - 755±90μs 157±10μs 0.21 bench_linalg.Eindot.time_dot_a_b - 4.63±1ms 899±9μs 0.19 bench_linalg.Linalg.time_op('svd', 'longfloat') - 5.19±1ms 922±10μs 0.18 bench_linalg.Linalg.time_op('svd', 'float64') - 599±200μs 89.4±2μs 0.15 bench_linalg.Eindot.time_matmul_trans_atc_a - 956±200μs 140±10μs 0.15 bench_linalg.Eindot.time_matmul_a_b - 6.45±3ms 903±10μs 0.14 bench_linalg.Linalg.time_op('svd', 'float32') - 6.42±3ms 896±0.7μs 0.14 bench_linalg.Linalg.time_op('svd', 'int32') - 6.47±4ms 902±5μs 0.14 bench_linalg.Linalg.time_op('svd', 'int64') - 6.52±1ms 899±2μs 0.14 bench_linalg.Linalg.time_op('svd', 'int16') - 799±300μs 109±2μs 0.14 bench_linalg.Eindot.time_dot_trans_atc_a - 502±100μs 65.0±0.2μs 0.13 bench_linalg.Eindot.time_dot_trans_a_atc - 542±300μs 64.2±0.05μs 0.12 bench_linalg.Eindot.time_matmul_trans_a_atc - 458±300μs 41.6±0.09μs 0.09 bench_linalg.Linalg.time_op('det', 'int32') - 471±100μs 41.9±0.03μs 0.09 bench_linalg.Linalg.time_op('det', 'float32') - 510±100μs 43.6±0.06μs 0.09 bench_linalg.Linalg.time_op('det', 'int16') - 478±200μs 39.6±0.05μs 0.08 bench_linalg.Linalg.time_op('det', 'longfloat') - 599±200μs 39.6±0.09μs 0.07 bench_linalg.Linalg.time_op('det', 'float64') - 758±300μs 41.6±0.1μs 0.05 bench_linalg.Linalg.time_op('det', 'int64')
Configuration menu - View commit details
-
Copy full SHA for 57364f0 - Browse repository at this point
Copy the full SHA 57364f0View commit details -
Use fortran_int for emscripten, remove debug prints
emscripten doesn't use external BLAS / LAPACK. It uses a f2c version that's embedded in NumPy. They happen to declare some LAPACK APIs as returning int instead of void, because that's the way that f2c worked for subroutines. Also remove some debug prints from umath_linalg
Configuration menu - View commit details
-
Copy full SHA for 6abd114 - Browse repository at this point
Copy the full SHA 6abd114View commit details -
Remove prints and revert dispatch
Removing the prints and providing an option that removes the dispatching for Accelerate.
Configuration menu - View commit details
-
Copy full SHA for 992dda7 - Browse repository at this point
Copy the full SHA 992dda7View commit details -
Configuration menu - View commit details
-
Copy full SHA for 18d1672 - Browse repository at this point
Copy the full SHA 18d1672View commit details
Commits on Aug 30, 2023
-
Configuration menu - View commit details
-
Copy full SHA for d408397 - Browse repository at this point
Copy the full SHA d408397View commit details -
Configuration menu - View commit details
-
Copy full SHA for 4fc68ff - Browse repository at this point
Copy the full SHA 4fc68ffView commit details -
Configuration menu - View commit details
-
Copy full SHA for 9e6db51 - Browse repository at this point
Copy the full SHA 9e6db51View commit details
Commits on Aug 31, 2023
-
Configuration menu - View commit details
-
Copy full SHA for 6faa102 - Browse repository at this point
Copy the full SHA 6faa102View commit details -
Configuration menu - View commit details
-
Copy full SHA for bc94c48 - Browse repository at this point
Copy the full SHA bc94c48View commit details
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.