Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use "tensorised-Pauli decomposition" in SparsePauliOp.from_operator #11133

Merged
merged 9 commits into from Jan 12, 2024

Conversation

jakelishman
Copy link
Member

@jakelishman jakelishman commented Oct 27, 2023

Summary

This switches the implementation of SparsePauliOp.from_operator to the "tensorised-Pauli decomposition" described in https://arxiv.org/abs/2310.13421.

This initial implementation is a relatively naive implementation of the algorithm as described in the paper, with a couple of engineering improvements over the paper's accompaniment at HANTLUK/PauliDecomposition@932ce3926 (i.e. constant-factor improvements, rather than complexity improvements). This makes the "zero check" on the recursions short-circuiting, which means rather matrix elements need to be examined on average after each recursive step. Further, the constant-factor multiplication is tracked separately as a single scalar throughout the recursion to avoid a full-matrix complex multiplication at every recursive step (which is approximately six real-floating-point operations per complex multiplication).

There is plenty of space in this implementation to reduce internal memory allocations by reusing swap space, and to dispatch the different components of the decomposition to threaded workers. Effective threading is more non-trivial, as a typical operator will have structure that could be exploiting to more optimally dispatch the threads. These can be investigated later.

Details and comments

Timing using the script (IPython because I'm lazy about doing manual timeit):

import numpy as np
from qiskit.quantum_info import SparsePauliOp, Operator
from TensorizedPauliDecomposition import PauliDecomposition


def random_op(num_qubits, rng):
    size = 1 << num_qubits
    out = np.empty((size, size), dtype=complex)
    out.real = rng.random((size, size))
    out.imag = rng.random((size, size))
    return Operator(out)


ns = list(range(1, 11))

qiskit = []
rng = np.random.default_rng(0)
for n in ns:
    op = random_op(n, rng)
    result = %timeit -o SparsePauliOp.from_operator(op)
    qiskit.append(result.best)

paper = []
rng = np.random.default_rng(0)
for n in ns:
    op = random_op(n, rng).data
    result = %timeit -o PauliDecomposition(op)
    paper.append(result.best)

I got these results:

Screenshot 2023-10-27 at 16 36 52

where in the num_qubits=10 case, this PR's time is 217ms vs the paper's 13.2s. Some of that will be Rust vs Python, but my version will also allocate fewer matrices and has drastically smaller constant factors on each recursive step via avoiding matrix-long complex multiplications. This means that this PR's version is doing $2 \times 4^{\text{num qubits}}$ floating-point addition/subtractions for the recursive step at a size of num_qubits, whereas the paper is doing that, plus $4 \times 4^{\text{num qubits}}$ complex multiplications, where each complex multiplication is approximately 6 real floating-point operations in non-SIMD code, though Numpy will have that optimised via its runtime-CPU-feature dispatch to be much more efficient.

edit: I made the minor modifications needed to avoid the extra complex-number multiplications, and it looks like the non-short-circuiting np.any and Python-space overhead dominate the time of the paper's Python implementation, though the complex-multiplication is very visible. I didn't go far down that path, though.

I intend to go further in a follow-up PR to reduce the number of matrix allocations that are done in Rust space by re-using the same scratch space all the way down, and potentially avoiding using the helper Pauli enum.

The algorithm is trivially naively threadable (dispatch the four top-level recursive calls to a thread each), but only when discounting the structure of real operators that meant that it's unlikely that the top level would always split neatly in four, and then there'd be quite unbalanced threads. Taking efficient advantage of the structure in threading is a little more non-trivial.

@jakelishman jakelishman added performance Changelog: New Feature Include in the "Added" section of the changelog mod: quantum info Related to the Quantum Info module (States & Operators) Rust This PR or issue is related to Rust code in the repository labels Oct 27, 2023
@jakelishman jakelishman added this to the 1.0.0 milestone Oct 27, 2023
@jakelishman jakelishman requested review from ikkoham and a team as code owners October 27, 2023 15:53
@qiskit-bot
Copy link
Collaborator

One or more of the the following people are requested to review this:

Comment on lines -134 to +135
if isinstance(coeffs, np.ndarray) and coeffs.dtype == object:
dtype = object
if isinstance(coeffs, np.ndarray):
dtype = object if coeffs.dtype == object else complex
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This rearrange prevents an unnecessary Python-space iteration from coeffs when it's given as a Numpy array that's not an object array; we don't need to check each element individually but can let Numpy perform the cast (if needed) in C space.

@coveralls
Copy link

coveralls commented Oct 27, 2023

Pull Request Test Coverage Report for Build 7507348255

Warning: This coverage report may be inaccurate.

We've detected an issue with your CI configuration that might affect the accuracy of this pull request's coverage report.
To ensure accuracy in future PRs, please see these guidelines.
A quick fix for this PR: rebase it; your next report should be accurate.

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage increased (+1.8%) to 89.34%

Totals Coverage Status
Change from base Build 7198837138: 1.8%
Covered Lines: 59662
Relevant Lines: 66781

💛 - Coveralls

This switches the implementation of `SparsePauliOp.from_operator` to
the "tensorised-Pauli decomposition" described in
https://arxiv.org/abs/2310.13421.

This initial implementation is a relatively naive implementation of the
algorithm as described in the paper, with a couple of engineering
improvements over the paper's accompaniment at
HANTLUK/PauliDecomposition@932ce3926 (i.e. constant-factor improvements,
rather than complexity improvements).  This makes the "zero check" on
the recursions short-circuiting, which means rather matrix elements need
to be examined on average after each recursive step.  Further, the
constant-factor multiplication is tracked separately as a single scalar
throughout the recursion to avoid a full-matrix complex multiplication
at every recursive step (which is approximately six real-floating-point
operations per complex multiplication).

There is plenty of space in this implementation to reduce internal
memory allocations by reusing swap space, and to dispatch the different
components of the decomposition to threaded workers.  _Effective_
threading is more non-trivial, as a typical operator will have structure
that could be exploiting to more optimally dispatch the threads.  These
can be investigated later.
@jakelishman
Copy link
Member Author

jakelishman commented Oct 27, 2023

Very casual flame-graphing looks like this implementation is bottlenecked on its very inefficient allocations, which is pretty much exactly what I'd expect. I'd prefer to rearrange everything into a scratch-reusing, potentially non-recursive version of the code in a follow-up, though, since I suspect that it will be rather harder to read.

Copy link
Member

@mtreinish mtreinish left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this looks great, it's a big speedup an the algorithm is pretty straightforward. I just had a couple of questions and comments inline. Nothing major though.

crates/accelerate/src/sparse_pauli_op.rs Show resolved Hide resolved
crates/accelerate/src/sparse_pauli_op.rs Show resolved Hide resolved
crates/accelerate/src/sparse_pauli_op.rs Outdated Show resolved Hide resolved
crates/accelerate/src/sparse_pauli_op.rs Show resolved Hide resolved
crates/accelerate/src/sparse_pauli_op.rs Outdated Show resolved Hide resolved
Copy link
Member

@mtreinish mtreinish left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for the updates

@mtreinish mtreinish added this pull request to the merge queue Jan 12, 2024
Merged via the queue into Qiskit:main with commit 52affa5 Jan 12, 2024
13 checks passed
@jakelishman jakelishman deleted the fast-pauli-decomposition branch January 12, 2024 22:54
ShellyGarion pushed a commit to ShellyGarion/qiskit-terra that referenced this pull request Jan 18, 2024
…Qiskit#11133)

* Use "tensorised-Pauli decomposition" in `SparsePauliOp.from_operator`

This switches the implementation of `SparsePauliOp.from_operator` to
the "tensorised-Pauli decomposition" described in
https://arxiv.org/abs/2310.13421.

This initial implementation is a relatively naive implementation of the
algorithm as described in the paper, with a couple of engineering
improvements over the paper's accompaniment at
HANTLUK/PauliDecomposition@932ce3926 (i.e. constant-factor improvements,
rather than complexity improvements).  This makes the "zero check" on
the recursions short-circuiting, which means rather matrix elements need
to be examined on average after each recursive step.  Further, the
constant-factor multiplication is tracked separately as a single scalar
throughout the recursion to avoid a full-matrix complex multiplication
at every recursive step (which is approximately six real-floating-point
operations per complex multiplication).

There is plenty of space in this implementation to reduce internal
memory allocations by reusing swap space, and to dispatch the different
components of the decomposition to threaded workers.  _Effective_
threading is more non-trivial, as a typical operator will have structure
that could be exploiting to more optimally dispatch the threads.  These
can be investigated later.

* Remove unused import

* Remove superfluous `num_qubits` argument

* Export `ZXPaulis` to Python space

* Add comment on array-builder logic

* Use `num_traits::Zero` trait for zero checking

* Use custom ilog2

* Use stdlib `ilog2`
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Changelog: New Feature Include in the "Added" section of the changelog mod: quantum info Related to the Quantum Info module (States & Operators) performance Rust This PR or issue is related to Rust code in the repository
Projects
None yet
Development

Successfully merging this pull request may close these issues.

avoid typechecking coefficients individually in SparsePauliOp.init()
5 participants