Use "tensorised-Pauli decomposition" in `SparsePauliOp.from_operator` #11133

jakelishman · 2023-10-27T15:53:32Z

Summary

This switches the implementation of SparsePauliOp.from_operator to the "tensorised-Pauli decomposition" described in https://arxiv.org/abs/2310.13421.

This initial implementation is a relatively naive implementation of the algorithm as described in the paper, with a couple of engineering improvements over the paper's accompaniment at HANTLUK/PauliDecomposition@932ce3926 (i.e. constant-factor improvements, rather than complexity improvements). This makes the "zero check" on the recursions short-circuiting, which means rather matrix elements need to be examined on average after each recursive step. Further, the constant-factor multiplication is tracked separately as a single scalar throughout the recursion to avoid a full-matrix complex multiplication at every recursive step (which is approximately six real-floating-point operations per complex multiplication).

There is plenty of space in this implementation to reduce internal memory allocations by reusing swap space, and to dispatch the different components of the decomposition to threaded workers. Effective threading is more non-trivial, as a typical operator will have structure that could be exploiting to more optimally dispatch the threads. These can be investigated later.

Details and comments

Timing using the script (IPython because I'm lazy about doing manual timeit):

import numpy as np
from qiskit.quantum_info import SparsePauliOp, Operator
from TensorizedPauliDecomposition import PauliDecomposition


def random_op(num_qubits, rng):
    size = 1 << num_qubits
    out = np.empty((size, size), dtype=complex)
    out.real = rng.random((size, size))
    out.imag = rng.random((size, size))
    return Operator(out)


ns = list(range(1, 11))

qiskit = []
rng = np.random.default_rng(0)
for n in ns:
    op = random_op(n, rng)
    result = %timeit -o SparsePauliOp.from_operator(op)
    qiskit.append(result.best)

paper = []
rng = np.random.default_rng(0)
for n in ns:
    op = random_op(n, rng).data
    result = %timeit -o PauliDecomposition(op)
    paper.append(result.best)

I got these results:

where in the num_qubits=10 case, this PR's time is 217ms vs the paper's 13.2s. Some of that will be Rust vs Python, but my version will also allocate fewer matrices and has drastically smaller constant factors on each recursive step via avoiding matrix-long complex multiplications. This means that this PR's version is doing $2 \times 4^{\text{num qubits}}$ floating-point addition/subtractions for the recursive step at a size of num_qubits, whereas the paper is doing that, plus $4 \times 4^{\text{num qubits}}$ complex multiplications, where each complex multiplication is approximately 6 real floating-point operations in non-SIMD code, though Numpy will have that optimised via its runtime-CPU-feature dispatch to be much more efficient.

edit: I made the minor modifications needed to avoid the extra complex-number multiplications, and it looks like the non-short-circuiting np.any and Python-space overhead dominate the time of the paper's Python implementation, though the complex-multiplication is very visible. I didn't go far down that path, though.

I intend to go further in a follow-up PR to reduce the number of matrix allocations that are done in Rust space by re-using the same scratch space all the way down, and potentially avoiding using the helper Pauli enum.

The algorithm is trivially naively threadable (dispatch the four top-level recursive calls to a thread each), but only when discounting the structure of real operators that meant that it's unlikely that the top level would always split neatly in four, and then there'd be quite unbalanced threads. Taking efficient advantage of the structure in threading is a little more non-trivial.

qiskit-bot · 2023-10-27T15:53:38Z

One or more of the the following people are requested to review this:

@Eric-Arellano
@Qiskit/terra-core
@ikkoham
@kevinhartman
@mtreinish

jakelishman · 2023-10-27T15:55:52Z

qiskit/quantum_info/operators/symplectic/sparse_pauli_op.py

-        if isinstance(coeffs, np.ndarray) and coeffs.dtype == object:
-            dtype = object
+        if isinstance(coeffs, np.ndarray):
+            dtype = object if coeffs.dtype == object else complex


This rearrange prevents an unnecessary Python-space iteration from coeffs when it's given as a Numpy array that's not an object array; we don't need to check each element individually but can let Numpy perform the cast (if needed) in C space.

coveralls · 2023-10-27T16:23:44Z

Pull Request Test Coverage Report for Build 7507348255

Warning: This coverage report may be inaccurate.

We've detected an issue with your CI configuration that might affect the accuracy of this pull request's coverage report.
To ensure accuracy in future PRs, please see these guidelines.
A quick fix for this PR: rebase it; your next report should be accurate.

0 of 0 changed or added relevant lines in 0 files are covered.
No unchanged relevant lines lost coverage.
Overall coverage increased (+1.8%) to 89.34%

Totals
Change from base Build 7198837138:	1.8%
Covered Lines:	59662
Relevant Lines:	66781

💛 - Coveralls

This switches the implementation of `SparsePauliOp.from_operator` to the "tensorised-Pauli decomposition" described in https://arxiv.org/abs/2310.13421. This initial implementation is a relatively naive implementation of the algorithm as described in the paper, with a couple of engineering improvements over the paper's accompaniment at HANTLUK/PauliDecomposition@932ce3926 (i.e. constant-factor improvements, rather than complexity improvements). This makes the "zero check" on the recursions short-circuiting, which means rather matrix elements need to be examined on average after each recursive step. Further, the constant-factor multiplication is tracked separately as a single scalar throughout the recursion to avoid a full-matrix complex multiplication at every recursive step (which is approximately six real-floating-point operations per complex multiplication). There is plenty of space in this implementation to reduce internal memory allocations by reusing swap space, and to dispatch the different components of the decomposition to threaded workers. _Effective_ threading is more non-trivial, as a typical operator will have structure that could be exploiting to more optimally dispatch the threads. These can be investigated later.

jakelishman · 2023-10-27T17:38:16Z

Very casual flame-graphing looks like this implementation is bottlenecked on its very inefficient allocations, which is pretty much exactly what I'd expect. I'd prefer to rearrange everything into a scratch-reusing, potentially non-recursive version of the code in a follow-up, though, since I suspect that it will be rather harder to read.

mtreinish

Overall this looks great, it's a big speedup an the algorithm is pretty straightforward. I just had a couple of questions and comments inline. Nothing major though.

crates/accelerate/src/sparse_pauli_op.rs

mtreinish

LGTM, thanks for the updates

…Qiskit#11133) * Use "tensorised-Pauli decomposition" in `SparsePauliOp.from_operator` This switches the implementation of `SparsePauliOp.from_operator` to the "tensorised-Pauli decomposition" described in https://arxiv.org/abs/2310.13421. This initial implementation is a relatively naive implementation of the algorithm as described in the paper, with a couple of engineering improvements over the paper's accompaniment at HANTLUK/PauliDecomposition@932ce3926 (i.e. constant-factor improvements, rather than complexity improvements). This makes the "zero check" on the recursions short-circuiting, which means rather matrix elements need to be examined on average after each recursive step. Further, the constant-factor multiplication is tracked separately as a single scalar throughout the recursion to avoid a full-matrix complex multiplication at every recursive step (which is approximately six real-floating-point operations per complex multiplication). There is plenty of space in this implementation to reduce internal memory allocations by reusing swap space, and to dispatch the different components of the decomposition to threaded workers. _Effective_ threading is more non-trivial, as a typical operator will have structure that could be exploiting to more optimally dispatch the threads. These can be investigated later. * Remove unused import * Remove superfluous `num_qubits` argument * Export `ZXPaulis` to Python space * Add comment on array-builder logic * Use `num_traits::Zero` trait for zero checking * Use custom ilog2 * Use stdlib `ilog2`

jakelishman added performance Changelog: New Feature Include in the "Added" section of the changelog mod: quantum info Related to the Quantum Info module (States & Operators) Rust This PR or issue is related to Rust code in the repository labels Oct 27, 2023

jakelishman added this to the 1.0.0 milestone Oct 27, 2023

jakelishman requested review from ikkoham and a team as code owners October 27, 2023 15:53

jakelishman commented Oct 27, 2023

View reviewed changes

jakelishman force-pushed the fast-pauli-decomposition branch from b96ae78 to f016f32 Compare October 27, 2023 17:08

Remove unused import

25bd3f8

ikkoham self-assigned this Oct 30, 2023

jakelishman mentioned this pull request Nov 3, 2023

avoid typechecking coefficients individually in SparsePauliOp.init() #11187

Closed

jakelishman linked an issue Nov 3, 2023 that may be closed by this pull request

avoid typechecking coefficients individually in SparsePauliOp.init() #11187

Closed

mtreinish self-assigned this Dec 12, 2023

mtreinish reviewed Dec 12, 2023

View reviewed changes

jakelishman added 6 commits December 13, 2023 15:22

Remove superfluous num_qubits argument

337c93e

Export ZXPaulis to Python space

370b811

Add comment on array-builder logic

9ac012b

Use num_traits::Zero trait for zero checking

8a82510

Use custom ilog2

dbb35e4

Merge remote-tracking branch 'ibm/main' into fast-pauli-decomposition

1b2bea3

jakelishman force-pushed the fast-pauli-decomposition branch from 87a8967 to 1b2bea3 Compare December 13, 2023 18:06

Use stdlib ilog2

0ebec77

mtreinish approved these changes Jan 12, 2024

View reviewed changes

mtreinish enabled auto-merge January 12, 2024 21:37

mtreinish added this pull request to the merge queue Jan 12, 2024

Merged via the queue into Qiskit:main with commit 52affa5 Jan 12, 2024
13 checks passed

jakelishman deleted the fast-pauli-decomposition branch January 12, 2024 22:54

This was referenced Jan 13, 2024

Add Rust-based SparsePauliOp.to_matrix and Miri tests #11388

Merged

Optimise SparsePauliOp.from_operator #11557

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use "tensorised-Pauli decomposition" in `SparsePauliOp.from_operator` #11133

Use "tensorised-Pauli decomposition" in `SparsePauliOp.from_operator` #11133

jakelishman commented Oct 27, 2023 •

edited

qiskit-bot commented Oct 27, 2023

jakelishman Oct 27, 2023

coveralls commented Oct 27, 2023 •

edited

jakelishman commented Oct 27, 2023 •

edited

mtreinish left a comment

mtreinish left a comment

Use "tensorised-Pauli decomposition" in SparsePauliOp.from_operator #11133

Use "tensorised-Pauli decomposition" in SparsePauliOp.from_operator #11133

Conversation

jakelishman commented Oct 27, 2023 • edited

Summary

Details and comments

qiskit-bot commented Oct 27, 2023

jakelishman Oct 27, 2023

Choose a reason for hiding this comment

coveralls commented Oct 27, 2023 • edited

Pull Request Test Coverage Report for Build 7507348255

Warning: This coverage report may be inaccurate.

💛 - Coveralls

jakelishman commented Oct 27, 2023 • edited

mtreinish left a comment

Choose a reason for hiding this comment

mtreinish left a comment

Choose a reason for hiding this comment

Use "tensorised-Pauli decomposition" in `SparsePauliOp.from_operator` #11133

Use "tensorised-Pauli decomposition" in `SparsePauliOp.from_operator` #11133

jakelishman commented Oct 27, 2023 •

edited

coveralls commented Oct 27, 2023 •

edited

jakelishman commented Oct 27, 2023 •

edited