Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MAINT, BUG: bump OpenBLAS #20362

Merged

Conversation

tylerjereddy
Copy link
Contributor

@tylerjereddy tylerjereddy commented Mar 30, 2024

  [210/1479] Linking target scipy/special/_ellip_harm_2.cpython-311-x86_64-linux-gnu.so
  FAILED: scipy/special/_ellip_harm_2.cpython-311-x86_64-linux-gnu.so
  cc  -o scipy/special/_ellip_harm_2.cpython-311-x86_64-linux-gnu.so scipy/special/_ellip_harm_2.cpython-311-x86_64-linux-gnu.so.p/meson-generated__ellip_harm_2.c.o scipy/special/_ellip_harm_2.cpython-311-x86_64-linux-gnu.so.p/sf_error.c.o -Wl,--as-needed -Wl,--allow-shlib-undefined -Wl,-O1 -shared -fPIC -Wl,--start-group -lm -Wl,--version-script=/project/scipy/_build_utils/link-version-pyinit.map -L/usr/local/lib '-l$(libprefix}openblas' -Wl,--end-group
  /opt/rh/devtoolset-10/root/usr/libexec/gcc/x86_64-redhat-linux/10/ld: cannot find -l$(libprefix}openblas
  collect2: error: ld returned 1 exit status

[skip circle]

* Fixes scipy#20294.

* Pull in new OpenBLAS binaries as discussed in the above issue,
to solve a bug that affects downstream consumers like `scikit-learn`.
The OpenBLAS binary builds happened via MacPython/openblas-libs#149
with CI results at links below; the single s390x arch failure should
be safe to ignore, we don't distribute binaries for that arch:
https://app.travis-ci.com/github/MacPython/openblas-libs/builds/269741262
https://github.com/MacPython/openblas-libs/actions/runs/8493017997
https://github.com/MacPython/openblas-libs/actions/runs/8493018404

* I had an issue when testing locally with
`CIBW_BUILD=cp311-* cibuildwheel --platform linux --archs x86_64`.
Let's see if the regular CI catches anything before I try
a full blown wheel build here, hopefully that's just
some local issue I have.

```
  [210/1479] Linking target scipy/special/_ellip_harm_2.cpython-311-x86_64-linux-gnu.so
  FAILED: scipy/special/_ellip_harm_2.cpython-311-x86_64-linux-gnu.so
  cc  -o scipy/special/_ellip_harm_2.cpython-311-x86_64-linux-gnu.so scipy/special/_ellip_harm_2.cpython-311-x86_64-linux-gnu.so.p/meson-generated__ellip_harm_2.c.o scipy/special/_ellip_harm_2.cpython-311-x86_64-linux-gnu.so.p/sf_error.c.o -Wl,--as-needed -Wl,--allow-shlib-undefined -Wl,-O1 -shared -fPIC -Wl,--start-group -lm -Wl,--version-script=/project/scipy/_build_utils/link-version-pyinit.map -L/usr/local/lib '-l$(libprefix}openblas' -Wl,--end-group
  /opt/rh/devtoolset-10/root/usr/libexec/gcc/x86_64-redhat-linux/10/ld: cannot find -l$(libprefix}openblas
  collect2: error: ld returned 1 exit status
```

[skip circle]
@tylerjereddy tylerjereddy added maintenance Items related to regular maintenance tasks backport-candidate This fix should be ported by a maintainer to previous SciPy versions. labels Mar 30, 2024
@tylerjereddy tylerjereddy added this to the 1.14.0 milestone Mar 30, 2024
@github-actions github-actions bot added the defect A clear bug or issue that prevents SciPy from being installed or used as expected label Mar 30, 2024
@tylerjereddy
Copy link
Contributor Author

Ok, the CI replicates my failure, maybe the hash size/format on the new strings then.

@tylerjereddy
Copy link
Contributor Author

Ah, the shared lib naming scheme probably just changed a bit.

@tylerjereddy
Copy link
Contributor Author

Current layout from the tools/openblas_support.py download in /tmp/openblas/lib locally on this branch:

cmake  libopenblas.a  libopenblasp-r0.3.26.dev.a  libopenblasp-r0.3.26.dev.so  libopenblas.so  libopenblas.so.0  pkgconfig

Last known working layout from a non-stable release:

cmake  libopenblas.a  libopenblasp-r0.3.21.dev.a  libopenblasp-r0.3.21.dev.so  libopenblas.so  libopenblas.so.0  pkgconfig

I confirmed that reverting to the original versions/hashes before gh-20215 allows cibuildwheel to get past the linking problem. So, will need to dig around a bit to debug that I think, probably upstream somewhere.

@tylerjereddy
Copy link
Contributor Author

tylerjereddy commented Mar 30, 2024

On Windows the error looks like:

c:/rtools40/ucrt64/bin/../lib/gcc/x86_64-w64-mingw32/10.3.0/../../../../x86_64-w64-mingw32/bin/ld.exe: cannot find -l$(libprefix}openblas_v0.3.26-382-gb1e8ba50-

On local Linux box:

/opt/rh/devtoolset-10/root/usr/libexec/gcc/x86_64-redhat-linux/10/ld: cannot find -l$(libprefix}openblas

Of course git grep -E -i "openblas_support" shows various calls for each wheel build platform.

There are quite a few harsh/version/searching related commits at: https://github.com/MacPython/openblas-libs/commits/scipy/

@tylerjereddy
Copy link
Contributor Author

On the assumption that the linker stuff gets propagated by pkg-config/meson somehow I compared the Libs: lines in .pc files for a few different versions of the OpenBLAS binary we use (I renamed the files myself for clarity):

openblas_1_12_0.pc:Libs: -L${libdir} -lopenblas${libsuffix}
openblas_1_13_rc1.pc:Libs: -L${libdir} -lopenblas${libsuffix}
openblas_new.pc:Libs: -L${libdir} -l$(libprefix}openblas${libnamesuffix}

Probably not a coincidence that there's a change in substitution pattern at the first observed failure (last line above/version used in this branch).

There are some suffix changes recently upstream here:
MacPython/openblas-libs@9c64109

@larsoner
Copy link
Member

larsoner commented Mar 31, 2024

Maybe a stupid question but in your earlier comment is it really correct to have cannot find -l$(libprefix}openblas with a $( money-open-paren to start and a } close-brace to end? That seems like a typo but maybe that's just some gcc naivete from my end in which case feel free to ignore!

@tylerjereddy
Copy link
Contributor Author

I think you may be right. The smoke seems to be coming from the pkg-config *.pc file and there are at least two things that have changed:

  1. The general substitution pattern.
  2. A likely-invalid syntax usage.

tylerjereddy added a commit to MacPython/openblas-libs that referenced this pull request Mar 31, 2024
* attempt to deal with this issue in `*.pc`
`Libs` formatting:
scipy/scipy#20362 (comment)

* this initial attempt aims to effectively revert
back to the original `Libs:` line format that
was last functional in SciPy wheel builds
@tylerjereddy
Copy link
Contributor Author

I've initiated an initial attempt to rebuild the OpenBLAS binaries upstream with this patch: MacPython/openblas-libs@082affc

Wheel builds are at:

This assumes that the uploads will be able to overwrite the previous binaries.

tylerjereddy added a commit to tylerjereddy/scipy that referenced this pull request Mar 31, 2024
* `openblas_support` module now contains a shim to change
the pkg-config `openblas.pc` `Libs:` line back to its previous
formatting in attempt to do deal with:
scipy#20362 (comment)

[skip circle]
@tylerjereddy
Copy link
Contributor Author

Although my attempt to fix this upstream failed miserably (the .pc files were completely unchanged), adding a shim to our own openblas_support.py to restore the old formatting of openblas.pc was successful for me locally on x86_64 Linux with cibuildwheel testing.

So, I'll flush the CI to see if the Windows version is also happy here, and if so then I'll try full blown wheel builds here after that. Perhaps the increasingly-ugly openblas_support.py shims can be forgiven given our impending migration away from that (gh-20074) to the wheels infrastructure anyway.

tylerjereddy added a commit to tylerjereddy/scipy that referenced this pull request Mar 31, 2024
* `openblas_support` module now contains a shim to change
the pkg-config `openblas.pc` `Libs:` line to avoid a typo
in `libprefix` specification in attempt to do deal with:
scipy#20362 (comment)

[skip circle]
@tylerjereddy tylerjereddy force-pushed the treddy_bump_openblas_issue_20294 branch from c93f00e to 8511570 Compare March 31, 2024 19:41
tylerjereddy added a commit to tylerjereddy/scipy that referenced this pull request Mar 31, 2024
* `openblas_support` module now contains a shim to change
the pkg-config `openblas.pc` `Libs:` line to avoid a typo
in `libprefix` specification in attempt to do deal with:
scipy#20362 (comment)

[skip circle]
@tylerjereddy tylerjereddy force-pushed the treddy_bump_openblas_issue_20294 branch from 8511570 to aa22e80 Compare March 31, 2024 20:14
* `openblas_support` module and `cibw_before_build_win.sh` now contain
shims to change the pkg-config `openblas.pc` `Libs:` line to avoid a typo
in `libprefix` specification in attempt to do deal with:
scipy#20362 (comment)

[skip circle] [skip cirrus]
@tylerjereddy tylerjereddy force-pushed the treddy_bump_openblas_issue_20294 branch from aa22e80 to 408681a Compare March 31, 2024 21:21
* shims needed for pkg-config file modification
on Windows

[skip circle] [skip cirrus]
@tylerjereddy
Copy link
Contributor Author

The only failing CI check here now is unrelated (gh-20365). So, I'll push in an empty commit to probe the wheel builds proper with the new OpenBLAS + shims in this branch.

* empty commit to test wheel builds

[wheel build]
@tylerjereddy
Copy link
Contributor Author

tylerjereddy commented Apr 1, 2024

Summarizing current status here, after solving some of the initial problems with shims in this PR branch:

  1. 2 CI failures are coming from MacOS arm64 wheel builds, which fail for Python 3.10 and 3.11 only. I note that MacOS M1 GHA runners are still in beta, and the failures do not appear to be related to my shims here at first glance. In short, pip install pipx fails with No such file or directory: '/usr/local/opt/pipx/venvs'. Something looks weird here, in the cp311 matrix entry, I see a 3.12 version of Python: /Users/runner/hostedtoolcache/Python/3.12.2/arm64/bin/python" -m pip install pipx. Same for the 3.10 failure. Python 3.9 and 3.12 (which both pass), use Python 3.11.8 to do the initial pipx install, so maybe the mismatch isn't a problem upfront before the the real cibuildwheel work starts, but the inconsistency is still a bit weird to read.
  2. All 4 musllinux x86_64 wheel build jobs fail at python tools/openblas_support.py because their download target URLs don't seem to exist. The 3.9 example is: Could not download "https://anaconda.org/multibuild-wheels-staging/openblas-libs/v0.3.26-382-gb1e8ba50/download/openblas-v0.3.26-382-gb1e8ba50-musllinux_1_1_x86_64.tar.gz". The last time this job succeeded, for 1.13.0rc1, the URL was: https://anaconda.org/multibuild-wheels-staging/openblas-libs/v0.3.26/download/openblas-v0.3.26-musllinux_1_1_x86_64.tar.gz. I'm not sure what's going on with ILP64 stuff vs. other prefix/postfix naming work, but the closest I see at the moment is: https://anaconda.org/multibuild-wheels-staging/openblas-libs/v0.3.26-382-gb1e8ba50/download/openblas64_-v0.3.26-382-gb1e8ba50-musllinux_1_1_x86_64.tar.gz. I suppose we could add the trailing 64_ here if we're comfortable with that and it doesn't mean something else that we don't want?

Certainly it is clear why I don't like to frequently try bumping OpenBLAS, I guess some upstream changes have already made things trickier between the fairly recent last bump for the non-wheel infra.

@rgommers
Copy link
Member

rgommers commented Apr 1, 2024

2. I suppose we could add the trailing 64_ here if we're comfortable with that and it doesn't mean something else that we don't want?

No, those are the ILP64 builds, they're not usable for SciPy. We need the ones without 64 in the name.

@rgommers
Copy link
Member

rgommers commented Apr 1, 2024

The musllinux problem is because the builds weren't uploaded (see CI log) due to:

 Error:  ('Service Unavailable: The server cannot process the request due to a high load ([GET] https://api.anaconda.org/package/multibuild-wheels-staging/openblas-libs -> 503)', 503)

Just rerunning the failed builds on that job should fix the problem. (side note: I don't have permissions to do that, could I be added @mattip?)

@rgommers
Copy link
Member

rgommers commented Apr 1, 2024

The macOS arm64 failures look like a cibuildwheel problem that depends on the runner image version:

  • Jobs with macOS 14.2.1 and runner image version 20240116.1 (which has Python 3.11.8) pass
  • Jobs with macOS 14.4 and runner image version 20240325.1 (which has Python 3.12.2) fail

The runners get picked at random, a re-run made one of the two jobs pass. The second one I retriggered three times, but no luck getting the older runner image.

@rgommers
Copy link
Member

rgommers commented Apr 1, 2024

@tylerjereddy I bet an upgrade to cibuildwheel 2.17.0 will fix the macOS issue. We're still using 2.16.5, and that only has preliminary support, 2.17.0 has official support and contains a bunch of fixes: https://github.com/pypa/cibuildwheel#v2170

@tylerjereddy
Copy link
Contributor Author

The musllinux problem is because the builds weren't uploaded (see CI log) due to:

Ah, I ignored that because the job had already passed in the previous iteration, but now I see not only did it fail to upload, it also deleted the correct asset before it did:

Warning: Distribution "openblas-v0.3.26-382-gb1e8ba50-musllinux_1_1_x86_64.tar.gz" already exists. Removing.

@tylerjereddy
Copy link
Contributor Author

Just rerunning the failed builds on that job should fix the problem. (side note: I don't have permissions to do that, could I be added @mattip?)

@rgommers Thanks for the suggestions, I added you as an admin on that repo, so that should do the trick. Once the job that failed to upload completes I'll push a cibuildwheel version bump and flush wheel builds again here.

* Based on reviewer feedback, bump the `cibuildwheel` version
in attempt to deal with GitHub actions MacOS ARM wheel build
flakiness.

* Also fixed an upstream OpenBLAS build/CI issue that prevented
a binary from being uploaded for our consumption in our wheel
builds.

[wheel build]
@tylerjereddy
Copy link
Contributor Author

Upstream CI upload seems fixed now, and I'm retrying the wheel builds with the latest cibuildwheel stable release to see if things are green yet.

@tylerjereddy
Copy link
Contributor Author

I think the one Cirrus Linux ARM job just timed out at the 1 hour mark exactly, and I believe we've seen that before on occasion and is not related to any changes here. The rest of the CI/wheel builds are green so I may proceed with squash-merging.

@tylerjereddy
Copy link
Contributor Author

The shims in openblas_support are a bit hacky, but seem to get the job done for now.

@tylerjereddy tylerjereddy merged commit 3fa9450 into scipy:main Apr 1, 2024
48 of 50 checks passed
@tylerjereddy tylerjereddy deleted the treddy_bump_openblas_issue_20294 branch April 1, 2024 19:42
@tylerjereddy tylerjereddy modified the milestones: 1.14.0, 1.13.0 Apr 1, 2024
tylerjereddy added a commit to tylerjereddy/scipy that referenced this pull request Apr 1, 2024
* Fixes scipy#20294.

* Pull in new OpenBLAS binaries as discussed in the above issue,
to solve a bug that affects downstream consumers like `scikit-learn`.

* Some additional shims were needed: bumping `cibuildwheel` to
improve GitHub actions MacOS M1 runner support, and `openblas_support`
module needed an adjustment to process out an apparent syntax
error in the pkg-config file for openblas that we pull in from upstream.
@tylerjereddy tylerjereddy removed the backport-candidate This fix should be ported by a maintainer to previous SciPy versions. label May 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
defect A clear bug or issue that prevents SciPy from being installed or used as expected maintenance Items related to regular maintenance tasks
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: Hang on Windows in scikit-learn with 1.13rc1 and 1.14.dev (maybe due to OpenBLAS 0.3.26 update?)
3 participants