Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Default to Accelerate on macOS and add wheels for macOS >=14 #24905

Closed
rgommers opened this issue Oct 12, 2023 · 14 comments
Closed

Default to Accelerate on macOS and add wheels for macOS >=14 #24905

rgommers opened this issue Oct 12, 2023 · 14 comments

Comments

@rgommers
Copy link
Member

rgommers commented Oct 12, 2023

The benefits of using the updated Accelerate library are (1) performance (xref gh-24053 for benchmarks) and (2) smaller wheels because we're no longer vendoring OpenBLAS.

The first part of this is done: after gh-24893 we are defaulting to the new Accelerate on macOS >=13.3 when building from source.

The second part is adding a new set of wheel build that use Accelerate rather than OpenBLAS on macOS >=14. Note that it can only be from macOS 14 rather than 13.3, because the packaging library (and hence pip) do not take macOS minor version into account, so using 13_3 in wheel names will lead to pip installing those also on 13.0-13.2 and those will then crash with missing symbols at runtime.

We need a macOS 14 image to build wheels (ideally, we could rename wheels manually in a pinch). The status of those CI images is:

  • Cirrus CI made macOS 14 images available on the release date of macOS 14 three weeks ago. We want to use macos-sonoma-xcode (see here for all images). So we can do this now for arm64 wheels
  • GitHub does not yet have a timeline for making runner images available. The tracking issue is Add macOS 14 (Sonoma) Runner Image actions/runner-images#7508. So for x86_64 we have to wait. I'd expect support to materialize well before the 2.0 release date.
@rgommers rgommers self-assigned this Oct 12, 2023
@rgommers rgommers added this to the 2.0.0 release milestone Oct 12, 2023
rgommers added a commit to rgommers/numpy that referenced this issue Oct 26, 2023
rgommers added a commit to rgommers/numpy that referenced this issue Oct 26, 2023
rgommers added a commit to rgommers/numpy that referenced this issue Oct 26, 2023
rgommers added a commit to rgommers/numpy that referenced this issue Oct 26, 2023
rgommers added a commit to rgommers/numpy that referenced this issue Nov 22, 2023
rgommers added a commit to rgommers/numpy that referenced this issue Nov 22, 2023
@rgommers
Copy link
Member Author

The most important part of this, for arm64, was done in gh-25255. In case GitHub Actions releases macOS 14 images in time, we can also get x86-64 wheels in for 2.0. But if not, then not.

@Mousius
Copy link
Member

Mousius commented Dec 21, 2023

The benefits of using the updated Accelerate library are (1) performance (xref gh-24053 for benchmarks) and (2) smaller wheels because we're no longer vendoring OpenBLAS.

Hi @rgommers, is the smaller wheel massively important given the scales we're talking about here? Asking as OpenBLAS builds a lot of different CPU variants for dynamic dispatch, and I could look into shrinking that based on the expected users of NumPy or overlapping cores? (https://github.com/OpenMathLib/OpenBLAS/blob/67779177b94f03493a44888f6a682065d5f59618/Makefile.system#L676-L700)

@rgommers
Copy link
Member Author

I'd say yes, that matters a lot, for multiple reasons:

  • NumPy is now downloaded over 200 million times a month from PyPI, and given our wheel size we are one of the packages that puts the highest load on PyPI
  • For end users the download will be proportionally faster, which matters when you're on a lower-bandwidth connection
  • The size reduction also helps reduce the size of an installed environment. Right now we're in the neighborhood of ~200 MB unpacked for a very basic stack (numpy/scipy/pandas/scikit-learn), and that adds up locally, plus for some deployments (think AWS Lambda & co) it may push it over the max limit

The performance gain of Accelerate is still more important I'd say than the size though.

Asking as OpenBLAS builds a lot of different CPU variants for dynamic dispatch, and I could look into shrinking that based on the expected users of NumPy or overlapping cores?

The OpenBLAS we vendor in wheels is built in https://github.com/MacPython/openblas-libs. I'm not sure how much we optimized the settings or if there's much to gain. Maybe @mattip knows?

@mattip
Copy link
Member

mattip commented Dec 23, 2023

We set a baseline CPU targets in the build script and then OpenBLAS fills in any optional dynamic dispatch variants above those minima. I guess we could try increasing that minimum, although we do occasionally get complaints that even PRESCOT is too advanced for x86.

@martin-frbg
Copy link

Unless I'm missing something, you will not need DYNAMIC_ARCH for Mac/aarch64 as there's only the one target for M1 and newer (moot if you're switching to Accelerate for its proprietary AMX support). For the others architectures you can mix your own DYNAMIC_LIST subset of cpus you deem relevant

@rgommers
Copy link
Member Author

rgommers commented Jan 31, 2024

Good news finally on the GitHub Actions front:

@rgommers
Copy link
Member Author

rgommers commented Mar 5, 2024

Okay, I think I've got this working. There was a fairly large hiccup: the macOS 14 image is only supported for arm64. So when trying the macos-14 image we get:

ERROR: numpy-2.0.0.dev0-cp312-cp312-macosx_14_0_x86_64.whl is not a supported wheel on this platform.

because the wheel has to target macOS 14.0 and hence cannot be tested on the platform on which it's built.

I considered setting MACOSX_DEPLOYMENT_TARGET to 13.3 instead and then manually renaming the wheel at the end - that would allow running the tests - but it's not very clean either, so just skipped the cibuildwheel test phase instead. We do have regular CI with macos-13 and Accelerate, so this shouldn't be too much of a problem. @andyfaff if you prefer the other way, please let me know.

Here is a first successful run with a wheel that can be downloaded, if anyone wants to try on an Intel Mac: https://github.com/rgommers/numpy/actions/runs/8163109512

@andyfaff
Copy link
Contributor

andyfaff commented Mar 6, 2024

To be honest I hadn't even thought about making Accelerate only wheels (macosx_14_0_x86_64) for Intel. I guess it makes sense. However:

  • there are going to be fewer and fewer Intel Macs
  • it there's going to be official support (i.e. it's on PyPI), then it would make sense to run some tests. I think the existing macos-13 infra is probably ok in lieu of testing the actual wheel. (If we only want to support Accelerate-only wheels on macOS >=14 then why are we doing regular CI with macos-13?)

Looking at your commits I don't think it's necessary to have a USE_ACCELERATE variable. One can just specify the INSTALL_OPENBLAS=false variable.

My plans for mac wheel building were to move the arm wheel build from cirrus to GHA, but I was thinking of doing that after numpy2.0 was released. Do we want to go ahead with that transition now? The transition could include the kinds of changes you linked to.

@rgommers
Copy link
Member Author

rgommers commented Mar 6, 2024

(If we only want to support Accelerate-only wheels on macOS >=14 then why are we doing regular CI with macos-13?)

Because it was the only option; I wanted testing of Accelerate and there was no macos-14 image. It's also fine for regular CI. The problem with wheels is a flaw in packaging/pip: you cannot target minor macOS versions. Since the new Accelerate version we need was introduced in macOS 13.3, we can use it in CI just fine but we cannot release macosx_13_3-tagged wheels because the _3 will be ignored.

Looking at your commits I don't think it's necessary to have a USE_ACCELERATE variable. One can just specify the INSTALL_OPENBLAS=false variable.

Agreed. Those cibw_ bash scripts are a little painful to work with (or blame my bash foo); I just took a shortcut to get things to work. Your way in gh-25945 looks nicer.

My plans for mac wheel building were to move the arm wheel build from cirrus to GHA, but I was thinking of doing that after numpy2.0 was released. Do we want to go ahead with that transition now? The transition could include the kinds of changes you linked to.

We may as well do that I think, if CI is green. It's a lot easier to retrigger nightly builds in GHA, or debug a single job on your own fork.

there are going to be fewer and fewer Intel Macs

That is the key question, yes. I had always planned on doing this - although it would have been more clear cut if GHA had macos-14 images for x86-64. Macbooks can last a long time, and wheels that are >2x smaller and up to 8x faster on some key linear algebra functions like svd are worth it I think.

In case it becomes a burden in the future, these wheels are still safe to drop in a later release. I think it'll be fine though. I expect the real burden for macOS x86-64 to show up once GHA drops native hardware support completely. According to the roadmap in actions/runner-images#9255 we're good for 2024, but will only have macos-13 left at the end of the year. So 2025 may be the year when x86-64 gets dropped, and then we're back to cross-compiling.

@martin-frbg
Copy link

wait, you're seeing 8x speedup for linalg.svd using Accelerate on aging Intel Macs too, where the M-series AMX tile plays no role at all and I am used to seeing speed penalties for OpenBLAS within about ten percent of MKL ?

@rgommers
Copy link
Member Author

rgommers commented Mar 6, 2024

Ehhh no, apologies - that is my tired brain not working at all. The binary size reduction is of interest for x86-64; the performance isn't. We do not have new benchmark results that say otherwise at least. The x86-64 implementations also got upgraded in Accelerate, so they are available with NEWLAPACK symbols and should be more robust - but no indication that they are much faster. It's also not easy to find older benchmarks (things like https://blog.kiyohiroyabuuchi.com/en/c-cpp-en/comparison-between-mac-accelerate-framework-and-intel-mkl/ hint at "roughly on par with MKL and OpenBLAS").

The binary size gain is actually larger than 2x (e.g., from this CI run):

  • cp310-macosx_arm64-accelerate: 4.94 MB
  • cp310-macosx_x86_64-accelerate: 6.43 MB
  • cp310-macosx_x86_64-openblas: 19.9 MB

cp310-macosx-arm64-openblas from here weighs in at about 12.8 MB.

So it's about 3x for x86-64, and 2.6x for arm64.

I no longer have an Intel Mac, so no way to easily check performance.

@martin-frbg
Copy link

Thanks for the clarification :)
Given that there never were Macs with AMD processors, or (AFAIK) with AVX-512, you could reduce the size of your library build by adding DYNAMIC_LIST="CORE2 NEHALEM SANDYBRIDGE HASWELL", removing the dedicated BLAS kernels for 10 irrelevant cpus (if you don't already do this). Probably no longer worth it for you, but I guess I should add this as the default for x86_64 builds with OSNAME=Darwin...

@rgommers
Copy link
Member Author

rgommers commented Mar 6, 2024

With gh-25945 this is now complete - let's see how it goes🤞🏼. Thanks all!

@rgommers rgommers closed this as completed Mar 6, 2024
@rgommers
Copy link
Member Author

rgommers commented Mar 6, 2024

As a clarification for future readers: the order of auto-detection now is:

  1. scipy-openblas (our own wheel builds of OpenBLAS on PyPI)
  2. Accelerate (macOS only)
  3. MKL (x86-64 only)
  4. OpenBLAS
  5. FlexiBLAS
  6. BLIS
  7. reference BLAS

Wheels are still all built with OpenBLAS, except for macOS >=14.0 (Accelerate) and 32-bit Windows (no BLAS).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Development

No branches or pull requests

5 participants