Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Adopt new macOS Accelerate BLAS/LAPACK Interfaces, including ILP64 #24053

Commits on Aug 3, 2023

  1. ENH: Adopt new BLAS/LAPACK Interfaces, including ILP64

    macOS 13.3 shipped with an updated Accelerate framework that provides BLAS / LAPACK.  The new version is aligned with Netlib's v3.9.1 and also supports ILP64.  The changes here adopt those new interfaces when available.
    
    - New interfaces are used when ACCELERATE_NEW_LAPACK is defined.
    - ILP64 interfaces are used when both ACCELERATE_NEW_LAPACK and ACCELERATE_LAPACK_ILP64 are defined.
    
    macOS 13.3 now ships with 3 different sets of BLAS / LAPACK interfaces:
    - LP64 / LAPACK v3.2.1 - legacy interfaces kept for compatibility
    - LP64 / LAPACK v3.9.1 - new interfaces
    - ILP64 / LAPACK v3.9.1 - new interfaces with ILP64 support
    
    For LP64, we want to support building against macOS 13.3+ SDK, but having it work on pre-13.3 systems.  To that end, we create wrappers for each API that do a runtime check on which set of API is available and should be used.
    
    ILP64 is only supported on macOS 13.3+ and does not use additional wrappers.
    
    We've included support for both distutils and Meson builds.
    
    All tests pass on Apple silicon and Intel based Macs.
    
    Benchmarks
    ILP64 Accelerate vs OpenBLAS
           before           after         ratio
         [73f0cf4f]       [d1572653]
         <openblas-ilp64>       <accelerate-ilp64>
                  n/a              n/a      n/a  bench_linalg.Linalg.time_op('det', 'float16')
                  n/a              n/a      n/a  bench_linalg.Linalg.time_op('pinv', 'float16')
                  n/a              n/a      n/a  bench_linalg.Linalg.time_op('svd', 'float16')
               failed           failed      n/a  bench_linalg.LinalgSmallArrays.time_det_small_array
    +      3.96±0.1μs       5.04±0.4μs     1.27  bench_linalg.Linalg.time_op('norm', 'float32')
          1.43±0.04ms         1.43±0ms     1.00  bench_linalg.Einsum.time_einsum_outer(<class 'numpy.float32'>)
           12.7±0.4μs       12.7±0.3μs     1.00  bench_linalg.Einsum.time_einsum_sum_mul2(<class 'numpy.float32'>)
           24.1±0.8μs      24.1±0.04μs     1.00  bench_linalg.Linalg.time_op('norm', 'float16')
           9.48±0.2ms       9.48±0.3ms     1.00  bench_linalg.Einsum.time_einsum_outer(<class 'numpy.float64'>)
             609±20μs          609±2μs     1.00  bench_linalg.Einsum.time_einsum_noncon_outer(<class 'numpy.float32'>)
             64.9±2μs      64.7±0.07μs     1.00  bench_linalg.Einsum.time_einsum_contig_outstride0(<class 'numpy.float64'>)
          1.24±0.03ms      1.24±0.01ms     1.00  bench_linalg.Einsum.time_einsum_noncon_outer(<class 'numpy.float64'>)
              102±3μs        102±0.2μs     1.00  bench_linalg.Einsum.time_einsum_contig_contig(<class 'numpy.float64'>)
           21.9±0.8μs      21.8±0.02μs     1.00  bench_linalg.Einsum.time_einsum_multiply(<class 'numpy.float64'>)
           22.8±0.2ms       22.7±0.3ms     0.99  bench_linalg.Eindot.time_einsum_ijk_jil_kl
           13.3±0.4μs      13.3±0.02μs     0.99  bench_linalg.Einsum.time_einsum_sum_mul2(<class 'numpy.float64'>)
           9.56±0.3μs       9.49±0.2μs     0.99  bench_linalg.Einsum.time_einsum_noncon_contig_contig(<class 'numpy.float64'>)
           7.31±0.2μs      7.26±0.08μs     0.99  bench_linalg.Einsum.time_einsum_noncon_contig_outstride0(<class 'numpy.float32'>)
           5.60±0.2ms      5.55±0.02ms     0.99  bench_linalg.Eindot.time_einsum_ij_jk_a_b
             37.1±1μs       36.7±0.1μs     0.99  bench_linalg.Einsum.time_einsum_contig_outstride0(<class 'numpy.float32'>)
           13.5±0.4μs      13.4±0.05μs     0.99  bench_linalg.Einsum.time_einsum_sum_mul(<class 'numpy.float64'>)
          1.03±0.03μs         1.02±0μs     0.99  bench_linalg.LinalgSmallArrays.time_norm_small_array
             51.6±2μs      51.0±0.09μs     0.99  bench_linalg.Einsum.time_einsum_contig_contig(<class 'numpy.float32'>)
           15.2±0.5μs      15.0±0.04μs     0.99  bench_linalg.Einsum.time_einsum_noncon_sum_mul2(<class 'numpy.float64'>)
           13.9±0.4μs      13.7±0.02μs     0.99  bench_linalg.Einsum.time_einsum_noncon_sum_mul2(<class 'numpy.float32'>)
             415±10μs        409±0.4μs     0.99  bench_linalg.Eindot.time_einsum_i_ij_j
           9.29±0.3μs      9.01±0.03μs     0.97  bench_linalg.Einsum.time_einsum_noncon_mul(<class 'numpy.float64'>)
           18.2±0.6μs      17.6±0.04μs     0.97  bench_linalg.Einsum.time_einsum_multiply(<class 'numpy.float32'>)
             509±40μs         492±10μs     0.97  bench_linalg.Einsum.time_einsum_mul(<class 'numpy.float64'>)
           9.63±0.3μs      9.28±0.09μs     0.96  bench_linalg.Einsum.time_einsum_noncon_contig_contig(<class 'numpy.float32'>)
           9.08±0.2μs      8.73±0.02μs     0.96  bench_linalg.Einsum.time_einsum_noncon_mul(<class 'numpy.float32'>)
           15.6±0.5μs      15.0±0.04μs     0.96  bench_linalg.Einsum.time_einsum_noncon_sum_mul(<class 'numpy.float64'>)
           7.74±0.2μs      7.39±0.04μs     0.95  bench_linalg.Einsum.time_einsum_noncon_contig_outstride0(<class 'numpy.float64'>)
           18.6±0.6μs      17.7±0.03μs     0.95  bench_linalg.Einsum.time_einsum_noncon_multiply(<class 'numpy.float32'>)
           14.5±0.4μs      13.7±0.03μs     0.95  bench_linalg.Einsum.time_einsum_noncon_sum_mul(<class 'numpy.float32'>)
           13.3±0.6μs       12.5±0.3μs     0.94  bench_linalg.Einsum.time_einsum_sum_mul(<class 'numpy.float32'>)
           23.5±0.5μs      21.9±0.05μs     0.93  bench_linalg.Einsum.time_einsum_noncon_multiply(<class 'numpy.float64'>)
             264±20μs          243±4μs     0.92  bench_linalg.Einsum.time_einsum_mul(<class 'numpy.float32'>)
    -        177±50μs        132±0.6μs     0.75  bench_linalg.Eindot.time_dot_trans_at_a
    -      10.7±0.3μs      7.13±0.01μs     0.67  bench_linalg.Linalg.time_op('norm', 'int16')
    -        97.5±2μs       64.7±0.1μs     0.66  bench_linalg.Eindot.time_matmul_trans_a_at
    -      8.87±0.3μs         5.76±0μs     0.65  bench_linalg.Linalg.time_op('norm', 'longfloat')
    -      8.90±0.3μs      5.77±0.01μs     0.65  bench_linalg.Linalg.time_op('norm', 'float64')
    -      8.48±0.3μs      5.40±0.01μs     0.64  bench_linalg.Linalg.time_op('norm', 'int64')
    -         106±2μs         66.5±8μs     0.63  bench_linalg.Eindot.time_inner_trans_a_a
    -      8.25±0.3μs         5.16±0μs     0.62  bench_linalg.Linalg.time_op('norm', 'int32')
    -         103±5ms       64.6±0.5ms     0.62  bench_import.Import.time_linalg
    -         106±3μs       66.0±0.1μs     0.62  bench_linalg.Eindot.time_dot_trans_a_at
    -        202±20μs        124±0.6μs     0.61  bench_linalg.Eindot.time_matmul_trans_at_a
    -       31.5±10μs      19.3±0.02μs     0.61  bench_linalg.Eindot.time_dot_d_dot_b_c
    -       32.4±20μs      19.7±0.03μs     0.61  bench_linalg.Eindot.time_matmul_d_matmul_b_c
    -        5.05±1ms      3.06±0.09ms     0.61  bench_linalg.Linalg.time_op('svd', 'complex128')
    -      5.35±0.9ms      3.09±0.09ms     0.58  bench_linalg.Linalg.time_op('svd', 'complex64')
    -        6.37±3ms       3.27±0.1ms     0.51  bench_linalg.Linalg.time_op('pinv', 'complex128')
    -        7.26±8ms       3.24±0.1ms     0.45  bench_linalg.Linalg.time_op('pinv', 'complex64')
    -       519±100μs        219±0.8μs     0.42  bench_linalg.Linalg.time_op('det', 'complex64')
    -      31.3±0.9μs       12.8±0.1μs     0.41  bench_linalg.Linalg.time_op('norm', 'complex128')
    -      2.44±0.7ms          924±1μs     0.38  bench_linalg.Linalg.time_op('pinv', 'float64')
    -      29.9±0.8μs      10.8±0.01μs     0.36  bench_linalg.Linalg.time_op('norm', 'complex64')
    -      2.56±0.5ms          924±1μs     0.36  bench_linalg.Linalg.time_op('pinv', 'float32')
    -      2.63±0.5ms        924±0.6μs     0.35  bench_linalg.Linalg.time_op('pinv', 'int64')
    -      2.68±0.7ms         927±10μs     0.35  bench_linalg.Linalg.time_op('pinv', 'int32')
    -      2.68±0.5ms         927±10μs     0.35  bench_linalg.Linalg.time_op('pinv', 'int16')
    -      2.93±0.6ms          925±2μs     0.32  bench_linalg.Linalg.time_op('pinv', 'longfloat')
    -       809±500μs        215±0.2μs     0.27  bench_linalg.Linalg.time_op('det', 'complex128')
    -      3.67±0.9ms         895±20μs     0.24  bench_linalg.Eindot.time_tensordot_a_b_axes_1_0_0_1
    -       489±100μs         114±20μs     0.23  bench_linalg.Eindot.time_inner_trans_a_ac
    -      3.64±0.7ms        777±0.3μs     0.21  bench_linalg.Lstsq.time_numpy_linalg_lstsq_a__b_float64
    -        755±90μs         157±10μs     0.21  bench_linalg.Eindot.time_dot_a_b
    -        4.63±1ms          899±9μs     0.19  bench_linalg.Linalg.time_op('svd', 'longfloat')
    -        5.19±1ms         922±10μs     0.18  bench_linalg.Linalg.time_op('svd', 'float64')
    -       599±200μs         89.4±2μs     0.15  bench_linalg.Eindot.time_matmul_trans_atc_a
    -       956±200μs         140±10μs     0.15  bench_linalg.Eindot.time_matmul_a_b
    -        6.45±3ms         903±10μs     0.14  bench_linalg.Linalg.time_op('svd', 'float32')
    -        6.42±3ms        896±0.7μs     0.14  bench_linalg.Linalg.time_op('svd', 'int32')
    -        6.47±4ms          902±5μs     0.14  bench_linalg.Linalg.time_op('svd', 'int64')
    -        6.52±1ms          899±2μs     0.14  bench_linalg.Linalg.time_op('svd', 'int16')
    -       799±300μs          109±2μs     0.14  bench_linalg.Eindot.time_dot_trans_atc_a
    -       502±100μs       65.0±0.2μs     0.13  bench_linalg.Eindot.time_dot_trans_a_atc
    -       542±300μs      64.2±0.05μs     0.12  bench_linalg.Eindot.time_matmul_trans_a_atc
    -       458±300μs      41.6±0.09μs     0.09  bench_linalg.Linalg.time_op('det', 'int32')
    -       471±100μs      41.9±0.03μs     0.09  bench_linalg.Linalg.time_op('det', 'float32')
    -       510±100μs      43.6±0.06μs     0.09  bench_linalg.Linalg.time_op('det', 'int16')
    -       478±200μs      39.6±0.05μs     0.08  bench_linalg.Linalg.time_op('det', 'longfloat')
    -       599±200μs      39.6±0.09μs     0.07  bench_linalg.Linalg.time_op('det', 'float64')
    -       758±300μs       41.6±0.1μs     0.05  bench_linalg.Linalg.time_op('det', 'int64')
    Developer-Ecosystem-Engineering committed Aug 3, 2023
    Configuration menu
    Copy the full SHA
    57364f0 View commit details
    Browse the repository at this point in the history
  2. Use fortran_int for emscripten, remove debug prints

    emscripten doesn't use external BLAS / LAPACK.  It uses a f2c version that's embedded in NumPy.  They happen to declare some LAPACK APIs as returning int instead of void, because that's the way that f2c worked for subroutines.
    
    Also remove some debug prints from umath_linalg
    Developer-Ecosystem-Engineering committed Aug 3, 2023
    Configuration menu
    Copy the full SHA
    6abd114 View commit details
    Browse the repository at this point in the history
  3. Remove prints and revert dispatch

    Removing the prints and providing an option that removes the dispatching for Accelerate.
    Developer-Ecosystem-Engineering committed Aug 3, 2023
    Configuration menu
    Copy the full SHA
    992dda7 View commit details
    Browse the repository at this point in the history
  4. Configuration menu
    Copy the full SHA
    18d1672 View commit details
    Browse the repository at this point in the history

Commits on Aug 30, 2023

  1. Configuration menu
    Copy the full SHA
    d408397 View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    4fc68ff View commit details
    Browse the repository at this point in the history
  3. CI: add macOS job using Accelerate ILP64

    [skip circle]
    rgommers committed Aug 30, 2023
    Configuration menu
    Copy the full SHA
    9e6db51 View commit details
    Browse the repository at this point in the history

Commits on Aug 31, 2023

  1. BLD: match CBLAS header name with BLAS library name

    [skip circle]
    rgommers committed Aug 31, 2023
    Configuration menu
    Copy the full SHA
    6faa102 View commit details
    Browse the repository at this point in the history
  2. DOC: add release note

    [skip ci]
    rgommers committed Aug 31, 2023
    Configuration menu
    Copy the full SHA
    bc94c48 View commit details
    Browse the repository at this point in the history