Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

query: Switch to cython_<lapack,blas> based wrappers and deprecate custom scipy.linalg.<lapack,blas> #20682

Open
ilayn opened this issue May 9, 2024 · 18 comments
Labels
Cython Issues with the internal Cython code base deprecated Items related to behavior that has been deprecated Fortran Items related to the internal Fortran code base query A question or suggestion that requires further information scipy.linalg

Comments

@ilayn
Copy link
Member

ilayn commented May 9, 2024

Historically, we have gone through many stages of BLAS/LAPACK wrappers. And consequently we have a lot of home-baked wrappers to the formal BLAS/LAPACK routines.

This actually allowed the users to conveniently call native routines through f2py glue. While this is very valuable, our way of carrying the historical inconsistencies have been quite high. Just to name a few

  • Character switches sometimes kept ala uplo='L' or converted into 0, 1 ala lower=1 (since bool is not really available).
  • The pivot arrays sometimes converted at the wrapper with C injection, sometimes kept in Fortran order.
  • lwork queries sometimes included as separate public function <routine>_lwork but sometimes left to user by calling the same routine twice with exposed lwork keyword set to -1.
  • This would also help us handling the OpenBLAS linkage or other libs a bit more ergonomic.

I think we can switch to cython based linkage to have more automated wrapper generation and less strain on our testing framework. And also it would make things more straightforward to compile.

On the downside, we let f2py to handle the normalization of arguments and this has to be robust in the cythonized wrappers. Moreover, we will have to make sure that the F-contiguous and <s,d,c,z> flavorizing happens properly.

We touched upon this many times over the years but let's consolidate the discussions here (last time we mentioned this was I think #13663 (comment))

One particular difficulty is that if we want to solve the historical issues, unfortunately, we have to get rid of all wrappers and offer new ones where the naming would be different. I'm not necessarily a big fan of this but doing this without breaking anybody's current code this seems like the only option. And probably needs more than 2 versions for deprecation cycles.

Hence it would be the case that scipy.linalg.lapack and scipy.linalg.blas gets deprecated goes away and new ones, whatever the name might be will be the new auto-generated submodule names.

@ilayn ilayn added scipy.linalg Fortran Items related to the internal Fortran code base Cython Issues with the internal Cython code base RFC Request for Comments; typically used to gather feedback for a substantial change proposal labels May 9, 2024
@ev-br
Copy link
Member

ev-br commented May 9, 2024

cross-ref #4516

@ilayn
Copy link
Member Author

ilayn commented May 9, 2024

Just occured to me that we can at least do this for our own internal use and leave the deprecation for later.

@lucascolley lucascolley changed the title RfC:Switch to cython_<lapack,blas> based wrappers and deprecate custom scipy.linalg.<lapack,blas> RFC: Switch to cython_<lapack,blas> based wrappers and deprecate custom scipy.linalg.<lapack,blas> May 9, 2024
@lucascolley lucascolley added the deprecated Items related to behavior that has been deprecated label May 9, 2024
@rgommers
Copy link
Member

This proposal isn't very clear to me I'm afraid. We should consider public API we expose separately from what we do internally. For public API:

  • scipy.linalg.blas and scipy.linalg.lapack are quite low-level indeed. If someone wants to write nicer wrappers, then sure why not.
    • Once those exist, we can consider if the old ones can be deprecated.
  • scipy.linalg.cython_blas/cython_lapack are LP64 only, and that cannot be changed without breaking backwards compatibility. We could add a second set for ILP64 builds perhaps, but that isn't done today and I haven't seen anyone planning to work on this.

For SciPy's internal usage:

  • We're using the f2py-based wrappers internally Since we also want ILP64 support for SciPy itself, the f2py-based wrappers must be kept.
  • The machinery to generate the wrappers was improved a fair bit recently.

Just occured to me that we can at least do this for our own internal use

I don't think so, because of what I wrote above.

@ilayn
Copy link
Member Author

ilayn commented May 22, 2024

scipy.linalg.cython_blas/cython_lapack are LP64 only, and that cannot be changed without breaking backwards compatibility. We could add a second set for ILP64 builds perhaps, but that isn't done today and I haven't seen anyone planning to work on this.

I can do that at some point after Fortran rewrites. It is not the problem. But probably the actual problem is having two copies of OpenBLAS or picking up what we have at compile time. How does Cython set pickup ILP64 if the linked OpenBLAS is 32bit?

@rgommers
Copy link
Member

How does Cython set pickup ILP64 if the linked OpenBLAS is 32bit?

It doesn't, that was my point. Internally we can use ILP64 everywhere when the user builds for that, but the requirements for the Cython API are different.

having two copies of OpenBLAS

Yeah, we probably don't want to put that back. That was a really weird construction. We should have a unified build that provides both LP64 and ILP64 symbols. Currently MKL and Accelerate provide that. OpenBLAS could in the future, but not today.

@ilayn
Copy link
Member Author

ilayn commented May 22, 2024

Ah OK, intel-lingo for 32 bit is LP64, I didn't pay attention. We need to do something about this at some point anyways. I'll try some about it at when I can spare time.

@chillenb
Copy link

If you deprecate the F2PY wrappers, would it be possible to expose the Cython ones as Python functions? Something along these lines.. The ability to call BLAS from Python is really nice, especially ?trsm, which is faster than scipy.linalg.solve_triangular (?trtrs) when B is row-major and nrhs is large. Also ?axpy, which is much faster than single-threaded addition of ndarrays. If these wrappers are Cython-only, then they cannot be used interactively and packages that use them will need to add a Cython dependency.

@rgommers
Copy link
Member

If you deprecate the F2PY wrappers,

I don't think we will. At least I haven't seen a reason for that yet. Either way, we will only deprecate the current wrappers if a new and improved API for low-level access to BLAS/LAPACK functions is available to switch to.

@ilayn I'm all for improving our BLAS/LAPACK wrapping, however it requires some prototyping and a very crisp and concise proposal, after laying out all the constraints that the new situation should meet. This RFC as written should probably be closed, because it is fairly incomplete and as written it cannot be executed, due to the points I raised above.

Maybe let's use this one as a discussion thread, and at some point open a new one which is complete?

@ilayn
Copy link
Member Author

ilayn commented May 26, 2024

If you deprecate the F2PY wrappers,

If we are going to deprecate wrappers we are not going to remove them without alternatives. The problem with the existing ones are just historical inconsistencies and by quite a lot. There is hardly any consistency of the wrappers and everyone of them feels unique. Some exports m, n array size some hide it and so on many weird stuff as I mentioned above. So we will deprecate when we have a proper alternative which what this issue is about.

Also ?axpy, which is much faster than single-threaded addition of ndarrays.

That's very unlikely to beat numpy if same optimizations are given at build time to both.

@ilayn
Copy link
Member Author

ilayn commented May 26, 2024

This RFC as written should probably be closed, because it is fairly incomplete and as written it cannot be executed, due to the points I raised above.

Not sure what you mean by this. This is an RfC not a PR. And as I am trying to make a case, there are ways to do this including 64 bit Cython wrappers. And there are many open similar discussion as linked above. So what is missing about the RfC, maybe you didn't like it?

We just did not choose which way. But I don't agree that backwards compatibility is an issue here because it will be needed at some point because of const addition saga and many other things are waiting for similar handling. NumPy 2.0 changes were much more drastic than this change. This is not even close to that scale. And we can easily do the big changes in other downstream packages.

@rgommers
Copy link
Member

An RFC is usually a detailed proposal about what to do. This is more a "hey let's get rid of the f2py wrappers" without (so far) a concrete way of achieving that, nor laying out all the constraints/requirements. So I just really don't know what to do with this at the moment beyond the input I have already given in my first comment - it's too vague. All I can say is "sure, I'd love better/simpler wrappers and less duplication", but that's about it. To me, that is not a very useful open issue.

Let's turn this around: what do you actually want as input from other maintainers here? Or what are the next steps?

@chillenb
Copy link

chillenb commented May 26, 2024

Thanks everybody for your clarifications!! I'm glad to hear that BLAS wrappers are going to be kept in the public API, at least in some form.

That said, I'd like to push back against the following statement a little:

Also ?axpy, which is much faster than single-threaded addition of ndarrays.

That's very unlikely to beat numpy if same optimizations are given at build time to both.

The multithreading does make a difference! These days, single-core memory bandwidth is often a fraction of the total. This gets worse if there is NUMA.

Here is a benchmark on a machine with 2x Intel Xeon 6252N @2.6 GHz. NumPy was compiled with -march=native -mtune=native and BLAS is OpenBLAS from conda-forge:

In [2]: X=np.random.random(1000**3)

In [3]: Y=np.random.random(1000**3)

In [4]: %timeit Z=X+Y
3.15 s ± 3.02 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [5]: from scipy.linalg.blas import daxpy

In [7]: %timeit daxpy(X,Y)
656 ms ± 27.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [8]: %timeit np.add(X,Y,out=Y)
1.25 s ± 358 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

With Intel Distribution for Python, even np.add is multithreaded. It's a pretty significant boost:

In [5]: %timeit Z=X+Y
339 ms ± 8.14 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [6]: %timeit daxpy(X,Y)
413 ms ± 53.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [7]: %timeit np.add(X,Y,out=Y)
209 ms ± 49.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

@ilayn
Copy link
Member Author

ilayn commented May 29, 2024

@chillenb That is because the default OpenBLAS is single-threaded. So it is true that it would make a difference indeed. What I meant to say is that if you also use a BLAS/LAPACK with threading enabled, Numpy does the same thing. Intel distro has multithreaded MKL hence the boost. At some point I think NumPy uses daxpy anyways.

Could you please try also with infix operator Y += X?

@ilayn
Copy link
Member Author

ilayn commented May 29, 2024

Let's turn this around: what do you actually want as input from other maintainers here? Or what are the next steps?

Would that be OK if I remove RfC and put query instead? Because I want to discuss the very next steps that are scattered around different issues. I don't think this is an impossible problem at all but needs coordination and will break things. However, we have been delaying these things for a very long time now and leaving a lot of performance at the table through the f2py glue.

Spending a lot of time on a possible solution and then realizing that we can't do it serves noone and wastes limited time. Your comments are quite valuable in that regard but I don't think they pose an immovable block so far.

@chillenb
Copy link

@chillenb That is because the default OpenBLAS is single-threaded. So it is true that it would make a difference indeed. What I meant to say is that if you also use a BLAS/LAPACK with threading enabled, Numpy does the same thing. Intel distro has multithreaded MKL hence the boost. At some point I think NumPy uses daxpy anyways.

Could you please try also with infix operator Y += X?

Sure thing. Here is what I found:

# incY() does Y += X

In [14]: %timeit incY()
1.13 s ± 1.57 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

To get more data I also tried conda-forge NumPy, with libblas=MKL, and found this:

In [10]: %timeit scipy.linalg.blas.daxpy(X,Y)
539 ms ± 15.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [11]: %timeit np.add(X,Y,out=Y)
1.22 s ± 1.43 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [19]: %timeit incY()
1.29 s ± 150 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Both OpenBLAS and MKL ate 100% of the CPU, but np.add(X,Y,out=Y) and X+=Y were both single-threaded. This suggests np.add did not touch BLAS at all.

To be sure of this, I also checked MKL's verbose mode:

>>> X=np.random.random(1024**2)
>>> Y=np.random.random(1024**2)
>>> np.add(X,Y,out=Y)
array([1.34086246, 0.20451872, 0.92893604, ..., 0.51186964, 0.59766295,
       0.39522988])
>>> scipy.linalg.blas.daxpy(X,Y)
MKL_VERBOSE DAXPY(1048576,0x7ffc0562e8b8,0x7dfe5d200010,1,0x7dfe5c800010,1) 1.26ms CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:48
array([1.97819284, 0.20506346, 1.21871963, ..., 0.72971799, 0.8775876 ,
       0.61443089])
>>> 

@rgommers
Copy link
Member

That's all correct. OpenBLAS defaults to multi-threading just like MKL (some details depend on build settings, but there is no single-threaded default build in use anywhere important AFAIK), and np.add doesn't use BLAS.

Would that be OK if I remove RfC and put query instead?

Yes, that would be better.

Your comments are quite valuable in that regard but I don't think they pose an immovable block so far.

I'll point out another big can of worms: symbol names. We use Fortran symbol names internally in Fortran code, and have adaptors for C/C++ code to bridge the C/Fortran symbol naming conventions (stuff like NO_APPEND_FORTRAN and use_g77_abi). This would be very complex to invert by making C symbol names canonical and adapting names on the Fortran side; this switches one set of (solved) problems for a new set of (unsolved) problems. I think this alone is a blocker until the relevant Fortran code is gone (xref gh-18566).

@lucascolley lucascolley added query A question or suggestion that requires further information and removed RFC Request for Comments; typically used to gather feedback for a substantial change proposal labels May 29, 2024
@lucascolley lucascolley changed the title RFC: Switch to cython_<lapack,blas> based wrappers and deprecate custom scipy.linalg.<lapack,blas> query: Switch to cython_<lapack,blas> based wrappers and deprecate custom scipy.linalg.<lapack,blas> May 29, 2024
@ilayn
Copy link
Member Author

ilayn commented May 29, 2024

I think this alone is a blocker until the relevant Fortran code is gone (xref #18566).

Yes I'm working on it, hang in there 😅 I know it is really doable in one sit for many libs but work is crazy these days. Already a bit bummed out missing 1.14

But lapack wrappers are party of #18566 just saying. This is hence pre work for that

@rgommers
Copy link
Member

But lapack wrappers are party of #18566 just saying. This is hence pre work for that

It should be the last part of it. If nothing is left except from the wrappers themselves, then it is possible to touch them without having to deal with the Fortran side of symbol naming.

The note on "BLAS/LAPACK demuxing: FlexiBLAS, libblastrampoline & SciPy's cython_blas/cython_lapack" at https://pypackaging-native.github.io/key-issues/native-dependencies/blas_openmp/ is very relevant here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Cython Issues with the internal Cython code base deprecated Items related to behavior that has been deprecated Fortran Items related to the internal Fortran code base query A question or suggestion that requires further information scipy.linalg
Projects
None yet
Development

No branches or pull requests

5 participants