Default to Accelerate on macOS and add wheels for macOS >=14 #24905

rgommers · 2023-10-12T09:46:37Z

The benefits of using the updated Accelerate library are (1) performance (xref gh-24053 for benchmarks) and (2) smaller wheels because we're no longer vendoring OpenBLAS.

The first part of this is done: after gh-24893 we are defaulting to the new Accelerate on macOS >=13.3 when building from source.

The second part is adding a new set of wheel build that use Accelerate rather than OpenBLAS on macOS >=14. Note that it can only be from macOS 14 rather than 13.3, because the packaging library (and hence pip) do not take macOS minor version into account, so using 13_3 in wheel names will lead to pip installing those also on 13.0-13.2 and those will then crash with missing symbols at runtime.

We need a macOS 14 image to build wheels (ideally, we could rename wheels manually in a pinch). The status of those CI images is:

Cirrus CI made macOS 14 images available on the release date of macOS 14 three weeks ago. We want to use macos-sonoma-xcode (see here for all images). So we can do this now for arm64 wheels
GitHub does not yet have a timeline for making runner images available. The tracking issue is Add macOS 14 (Sonoma) Runner Image actions/runner-images#7508. So for x86_64 we have to wait. I'd expect support to materialize well before the 2.0 release date.

The text was updated successfully, but these errors were encountered:

Addresses the arm64 part of numpygh-24905.

rgommers · 2023-12-21T16:51:52Z

The most important part of this, for arm64, was done in gh-25255. In case GitHub Actions releases macOS 14 images in time, we can also get x86-64 wheels in for 2.0. But if not, then not.

Mousius · 2023-12-21T17:00:58Z

The benefits of using the updated Accelerate library are (1) performance (xref gh-24053 for benchmarks) and (2) smaller wheels because we're no longer vendoring OpenBLAS.

Hi @rgommers, is the smaller wheel massively important given the scales we're talking about here? Asking as OpenBLAS builds a lot of different CPU variants for dynamic dispatch, and I could look into shrinking that based on the expected users of NumPy or overlapping cores? (https://github.com/OpenMathLib/OpenBLAS/blob/67779177b94f03493a44888f6a682065d5f59618/Makefile.system#L676-L700)

rgommers · 2023-12-21T17:27:16Z

I'd say yes, that matters a lot, for multiple reasons:

NumPy is now downloaded over 200 million times a month from PyPI, and given our wheel size we are one of the packages that puts the highest load on PyPI
For end users the download will be proportionally faster, which matters when you're on a lower-bandwidth connection
The size reduction also helps reduce the size of an installed environment. Right now we're in the neighborhood of ~200 MB unpacked for a very basic stack (numpy/scipy/pandas/scikit-learn), and that adds up locally, plus for some deployments (think AWS Lambda & co) it may push it over the max limit

The performance gain of Accelerate is still more important I'd say than the size though.

Asking as OpenBLAS builds a lot of different CPU variants for dynamic dispatch, and I could look into shrinking that based on the expected users of NumPy or overlapping cores?

The OpenBLAS we vendor in wheels is built in https://github.com/MacPython/openblas-libs. I'm not sure how much we optimized the settings or if there's much to gain. Maybe @mattip knows?

mattip · 2023-12-23T19:30:32Z

We set a baseline CPU targets in the build script and then OpenBLAS fills in any optional dynamic dispatch variants above those minima. I guess we could try increasing that minimum, although we do occasionally get complaints that even PRESCOT is too advanced for x86.

martin-frbg · 2023-12-25T18:11:39Z

Unless I'm missing something, you will not need DYNAMIC_ARCH for Mac/aarch64 as there's only the one target for M1 and newer (moot if you're switching to Accelerate for its proprietary AMX support). For the others architectures you can mix your own DYNAMIC_LIST subset of cpus you deem relevant

rgommers · 2024-01-31T10:58:32Z

Good news finally on the GitHub Actions front:

the ~~x86-64~~ macOS 14 support we need to finally close this issue: https://github.blog/changelog/2024-01-30-github-actions-macos-14-sonoma-is-now-available/ EDIT: the macos-14 image is arm64-only, and that won't change
arm64 support! https://github.blog/changelog/2024-01-30-github-actions-introducing-the-new-m1-macos-runner-available-to-open-source/

rgommers · 2024-03-05T21:27:41Z

Okay, I think I've got this working. There was a fairly large hiccup: the macOS 14 image is only supported for arm64. So when trying the macos-14 image we get:

ERROR: numpy-2.0.0.dev0-cp312-cp312-macosx_14_0_x86_64.whl is not a supported wheel on this platform.

because the wheel has to target macOS 14.0 and hence cannot be tested on the platform on which it's built.

I considered setting MACOSX_DEPLOYMENT_TARGET to 13.3 instead and then manually renaming the wheel at the end - that would allow running the tests - but it's not very clean either, so just skipped the cibuildwheel test phase instead. We do have regular CI with macos-13 and Accelerate, so this shouldn't be too much of a problem. @andyfaff if you prefer the other way, please let me know.

Here is a first successful run with a wheel that can be downloaded, if anyone wants to try on an Intel Mac: https://github.com/rgommers/numpy/actions/runs/8163109512

andyfaff · 2024-03-06T00:27:06Z

To be honest I hadn't even thought about making Accelerate only wheels (macosx_14_0_x86_64) for Intel. I guess it makes sense. However:

there are going to be fewer and fewer Intel Macs
it there's going to be official support (i.e. it's on PyPI), then it would make sense to run some tests. I think the existing macos-13 infra is probably ok in lieu of testing the actual wheel. (If we only want to support Accelerate-only wheels on macOS >=14 then why are we doing regular CI with macos-13?)

Looking at your commits I don't think it's necessary to have a USE_ACCELERATE variable. One can just specify the INSTALL_OPENBLAS=false variable.

My plans for mac wheel building were to move the arm wheel build from cirrus to GHA, but I was thinking of doing that after numpy2.0 was released. Do we want to go ahead with that transition now? The transition could include the kinds of changes you linked to.

rgommers · 2024-03-06T06:39:34Z

(If we only want to support Accelerate-only wheels on macOS >=14 then why are we doing regular CI with macos-13?)

Because it was the only option; I wanted testing of Accelerate and there was no macos-14 image. It's also fine for regular CI. The problem with wheels is a flaw in packaging/pip: you cannot target minor macOS versions. Since the new Accelerate version we need was introduced in macOS 13.3, we can use it in CI just fine but we cannot release macosx_13_3-tagged wheels because the _3 will be ignored.

Looking at your commits I don't think it's necessary to have a USE_ACCELERATE variable. One can just specify the INSTALL_OPENBLAS=false variable.

Agreed. Those cibw_ bash scripts are a little painful to work with (or blame my bash foo); I just took a shortcut to get things to work. Your way in gh-25945 looks nicer.

My plans for mac wheel building were to move the arm wheel build from cirrus to GHA, but I was thinking of doing that after numpy2.0 was released. Do we want to go ahead with that transition now? The transition could include the kinds of changes you linked to.

We may as well do that I think, if CI is green. It's a lot easier to retrigger nightly builds in GHA, or debug a single job on your own fork.

there are going to be fewer and fewer Intel Macs

That is the key question, yes. I had always planned on doing this - although it would have been more clear cut if GHA had macos-14 images for x86-64. Macbooks can last a long time, and wheels that are >2x smaller and up to 8x faster on some key linear algebra functions like svd are worth it I think.

In case it becomes a burden in the future, these wheels are still safe to drop in a later release. I think it'll be fine though. I expect the real burden for macOS x86-64 to show up once GHA drops native hardware support completely. According to the roadmap in actions/runner-images#9255 we're good for 2024, but will only have macos-13 left at the end of the year. So 2025 may be the year when x86-64 gets dropped, and then we're back to cross-compiling.

martin-frbg · 2024-03-06T14:01:39Z

wait, you're seeing 8x speedup for linalg.svd using Accelerate on aging Intel Macs too, where the M-series AMX tile plays no role at all and I am used to seeing speed penalties for OpenBLAS within about ten percent of MKL ?

rgommers · 2024-03-06T14:37:34Z

Ehhh no, apologies - that is my tired brain not working at all. The binary size reduction is of interest for x86-64; the performance isn't. We do not have new benchmark results that say otherwise at least. The x86-64 implementations also got upgraded in Accelerate, so they are available with NEWLAPACK symbols and should be more robust - but no indication that they are much faster. It's also not easy to find older benchmarks (things like https://blog.kiyohiroyabuuchi.com/en/c-cpp-en/comparison-between-mac-accelerate-framework-and-intel-mkl/ hint at "roughly on par with MKL and OpenBLAS").

The binary size gain is actually larger than 2x (e.g., from this CI run):

cp310-macosx_arm64-accelerate: 4.94 MB
cp310-macosx_x86_64-accelerate: 6.43 MB
cp310-macosx_x86_64-openblas: 19.9 MB

cp310-macosx-arm64-openblas from here weighs in at about 12.8 MB.

So it's about 3x for x86-64, and 2.6x for arm64.

I no longer have an Intel Mac, so no way to easily check performance.

martin-frbg · 2024-03-06T15:00:37Z

Thanks for the clarification :)
Given that there never were Macs with AMD processors, or (AFAIK) with AVX-512, you could reduce the size of your library build by adding DYNAMIC_LIST="CORE2 NEHALEM SANDYBRIDGE HASWELL", removing the dedicated BLAS kernels for 10 irrelevant cpus (if you don't already do this). Probably no longer worth it for you, but I guess I should add this as the default for x86_64 builds with OSNAME=Darwin...

rgommers · 2024-03-06T19:34:56Z

With gh-25945 this is now complete - let's see how it goes🤞🏼. Thanks all!

rgommers · 2024-03-06T19:38:59Z

As a clarification for future readers: the order of auto-detection now is:

scipy-openblas (our own wheel builds of OpenBLAS on PyPI)
Accelerate (macOS only)
MKL (x86-64 only)
OpenBLAS
FlexiBLAS
BLIS
reference BLAS

Wheels are still all built with OpenBLAS, except for macOS >=14.0 (Accelerate) and 32-bit Windows (no BLAS).

rgommers self-assigned this Oct 12, 2023

rgommers added 01 - Enhancement 14 - Release component: CI labels Oct 12, 2023

rgommers added this to the 2.0.0 release milestone Oct 12, 2023

rgommers added a commit to rgommers/numpy that referenced this issue Oct 26, 2023

ENH: build wheels with Accelerate on macOS >=14 on Cirrus CI

44cc5ce

Addresses the arm64 part of numpygh-24905.

rgommers added a commit to rgommers/numpy that referenced this issue Oct 26, 2023

ENH: build wheels with Accelerate on macOS >=14 on Cirrus CI

01c8ef2

Addresses the arm64 part of numpygh-24905.

rgommers added a commit to rgommers/numpy that referenced this issue Oct 26, 2023

ENH: build wheels with Accelerate on macOS >=14 on Cirrus CI

271e9c5

Addresses the arm64 part of numpygh-24905.

rgommers added a commit to rgommers/numpy that referenced this issue Oct 26, 2023

ENH: build wheels with Accelerate on macOS >=14 on Cirrus CI

cc0fce0

Addresses the arm64 part of numpygh-24905.

rgommers added a commit to rgommers/numpy that referenced this issue Nov 22, 2023

ENH: build wheels with Accelerate on macOS >=14 on Cirrus CI

260752c

Addresses the arm64 part of numpygh-24905.

rgommers added a commit to rgommers/numpy that referenced this issue Nov 22, 2023

ENH: build wheels with Accelerate on macOS >=14 on Cirrus CI

e0a628f

Addresses the arm64 part of numpygh-24905.

psobolewskiPhD mentioned this issue Jan 2, 2024

blas/cblas performance impact: Apple Accelerate can speedup on arm64 napari/napari#5040

Open

mattip mentioned this issue Mar 6, 2024

Shrink macos x86_64 blas library size MacPython/openblas-libs#144

Open

rgommers closed this as completed Mar 6, 2024

jwnimmer-tri mentioned this issue Mar 13, 2024

[workspace] Switch macOS to use Accelerate for BLAS/LAPACK RobotLocomotion/drake#21135

Merged

danieltomasz mentioned this issue May 8, 2024

fooof v.1.1 not yet updated on conda forge fooof-tools/fooof#315

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Default to Accelerate on macOS and add wheels for macOS >=14 #24905

Default to Accelerate on macOS and add wheels for macOS >=14 #24905

rgommers commented Oct 12, 2023 •

edited

rgommers commented Dec 21, 2023

Mousius commented Dec 21, 2023

rgommers commented Dec 21, 2023

mattip commented Dec 23, 2023

martin-frbg commented Dec 25, 2023

rgommers commented Jan 31, 2024 •

edited

rgommers commented Mar 5, 2024

andyfaff commented Mar 6, 2024

rgommers commented Mar 6, 2024

martin-frbg commented Mar 6, 2024

rgommers commented Mar 6, 2024

martin-frbg commented Mar 6, 2024

rgommers commented Mar 6, 2024

rgommers commented Mar 6, 2024

Default to Accelerate on macOS and add wheels for macOS >=14 #24905

Default to Accelerate on macOS and add wheels for macOS >=14 #24905

Comments

rgommers commented Oct 12, 2023 • edited

rgommers commented Dec 21, 2023

Mousius commented Dec 21, 2023

rgommers commented Dec 21, 2023

mattip commented Dec 23, 2023

martin-frbg commented Dec 25, 2023

rgommers commented Jan 31, 2024 • edited

rgommers commented Mar 5, 2024

andyfaff commented Mar 6, 2024

rgommers commented Mar 6, 2024

martin-frbg commented Mar 6, 2024

rgommers commented Mar 6, 2024

martin-frbg commented Mar 6, 2024

rgommers commented Mar 6, 2024

rgommers commented Mar 6, 2024

rgommers commented Oct 12, 2023 •

edited

rgommers commented Jan 31, 2024 •

edited