pith. sign in

arxiv: 2605.25974 · v1 · pith:2C5RMQVSnew · submitted 2026-05-25 · 🪐 quant-ph · cs.ET

PauLIB: A High-Performance Library for Processing Pauli Strings

Pith reviewed 2026-06-29 22:03 UTC · model grok-4.3

classification 🪐 quant-ph cs.ET
keywords Pauli stringsquantum chemistryC++ librarySIMD vectorizationHamiltonian processingvariational algorithmsperformance benchmarksmemory reduction
0
0 comments X

The pith

PauLIB multiplies Pauli strings in 25 nanoseconds and shrinks Hamiltonian memory sevenfold through a two-bit-per-qubit encoding and sorted arrays.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents PauLIB as a header-only C++20 library that removes Python interpreter overhead and hash-map limitations when handling large Pauli sums for quantum chemistry, propagation, and compilation. It establishes that three architectural choices—a bit-packed binary symplectic representation, replacement of hash maps by sorted arrays, and struct-of-arrays memory layout—enable branch-predictable SIMD vectorization and multi-threaded merging. Benchmarks on 500-qubit instances show single-string multiplication at 25 nanoseconds, Hamiltonian outer-product work roughly ten times faster than PauliEngine, and greedy commutation grouping up to 21,000 times faster than PennyLane, while cutting the memory footprint of a one-million-term Hamiltonian from 1,036 MB to 142 MB. A sympathetic reader would care because these concrete gains directly increase the size of tractable problems on fixed hardware.

Core claim

PauLIB encodes each Pauli string in a binary symplectic form with two bits per qubit so that multiplication reduces to a bitwise XOR followed by a population count. Strings are kept in sorted arrays rather than hash maps to support contiguous SIMD bulk operations and use a struct-of-arrays layout to expose word arrays for explicit vectorization. On representative 500-qubit workloads the design yields single-string multiplication at 25 nanoseconds (14 times faster than PauliEngine, 660 times faster than Qiskit-flat), outer-product Hamiltonian multiplication approximately 10 times faster than PauliEngine and 45 times faster than Qiskit, greedy commutation grouping up to 21,000 times faster tha

What carries the argument

The bit-packed binary symplectic representation that encodes each qubit in two bits, reducing Pauli multiplication to a bitwise XOR and a population count.

If this is right

  • Single Pauli string multiplication completes in 25 nanoseconds, delivering a 14-fold speedup over PauliEngine and a 660-fold speedup over Qiskit-flat across tested pair counts.
  • Hamiltonian outer-product multiplication runs approximately 10 times faster than PauliEngine and 45 times faster than Qiskit at all tested sizes.
  • Greedy commutation grouping, the dominant preprocessing step in variational algorithms, achieves up to a 21,000-fold speedup over PennyLane.
  • A one-million-term Hamiltonian at 500 qubits occupies 142 MB instead of 1,036 MB, a 7.3-fold reduction that directly enables larger problem sizes within a fixed memory budget.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same compact representation could be reused to accelerate Pauli propagation or stabilizer simulation workloads that currently rely on the same operator algebra.
  • Embedding PauLIB inside higher-level quantum frameworks would allow existing variational quantum eigensolver pipelines to scale to larger qubit counts without source changes.
  • The sorted-array layout for sparse operator sums may transfer to other algebraic structures that appear in quantum error correction or tensor-network contractions.

Load-bearing premise

The reported speedups and memory reductions are measured on representative workloads and hardware configurations that match typical user environments, and the library's C++ and Python interfaces incur no hidden overhead that would negate the gains in real applications.

What would settle it

A direct timing run of one million random 500-qubit Pauli-string multiplications on a standard x86-64 CPU that yields an average time per multiplication clearly above 25 nanoseconds would falsify the central performance claim.

Figures

Figures reproduced from arXiv: 2605.25974 by Florian Kr\"otz.

Figure 1
Figure 1. Figure 1: Memory layout of PauliString. The struct stores ⌈N/64⌉ 64-bit Z-words, ⌈N/64⌉ 64-bit X-words, and a one-byte flags field encoding sign (±) and imaginary phase (i). H. Sorting and Deduplication After multiplication or Pauli propagation, a sum may con￾tain duplicate Pauli strings whose coefficients must be added together. PauLIB uses a sort-and-combine strategy. First, an index array of size M is sorted usin… view at source ↗
Figure 2
Figure 2. Figure 2: Memory layout of PauliSumSoA. For a sum of M terms, the struct stores ⌈N/64⌉ contiguous arrays of M 64-bit Z-words, ⌈N/64⌉ arrays of M 64-bit X-words, and an array of M flags bytes encoding sign (±) and imaginary phase (i). insertion and lookup for the workloads we target, in line with the design of the Stim simulator. B. SIMD Vectorization PauLIB uses two layers of SIMD acceleration. The first layer is co… view at source ↗
Figure 3
Figure 3. Figure 3: shows the time per PauliString×PauliString operation for 100, 200, 500, and 1 000 random pairs at 500 qubits. The three sets of bars span three orders of magnitude on the log axis. The PauLIB SoA bars (blue) sit at the bottom, flat at 25 ns per operation across all tested pair counts. The PauliEngine bars (orange) are constant at approximately 350 ns. Qiskit (red) requires approximately 16.5 µs, placing it… view at source ↗
Figure 4
Figure 4. Figure 4: PauliSum×PauliSum outer product time at 500 qubits vs. number of output terms N2 (log-log scale). PauLIB AoS and SoA (solid lines, lower band) are one to two orders of magnitude faster than PauliEngine and Qiskit (dashed lines, upper band), with all four lines following O(N2 ) scaling. At 500 qubits AoS outperforms SoA due to better cache locality. C. Greedy Grouping [PITH_FULL_IMAGE:figures/full_fig_p007… view at source ↗
Figure 6
Figure 6. Figure 6: RSS memory footprint for Pauli sums at 500 qubits (isolated [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
read the original abstract

Processing large Pauli sums is a significant bottleneck in quantum chemistry, Pauli propagation, and Pauli-based compilation. Existing frameworks often suffer from Python interpreter overhead or utilize hash-map data structures that hinder SIMD vectorization and complicate multi-threaded merging. We present PauLIB, a header-only C++20 library designed to eliminate these bottlenecks through three key architectural choices. A bit-packed binary symplectic representation that encodes each qubit in two bits, reducing Pauli multiplication to a bitwise XOR and a population count; a sorted array layout that replaces hash maps to enable branch-predictable SIMD bulk operations; and a struct-of-arrays (SoA) memory layout that exposes contiguous word arrays for explicit SIMD vectorization. Benchmarks at 500 qubits show that single Pauli string multiplication runs at 25ns per operation-14 times faster than PauliEngine and 660 times faster than Qiskit-flat across all pair counts tested. Hamiltonian outer-product multiplication is approximately 10 times faster than PauliEngine and 45 times faster than Qiskit at all tested sizes. Greedy commutation grouping, the dominant preprocessing cost in variational algorithms, achieves up to 21,000 times speedup over PennyLane, driven by the compact bit-packed representation. The compact layout reduces the memory footprint of a one-million-term Hamiltonian at 500 qubits from 1,036MB (Qiskit) to 142MB, a 7.3 times reduction that directly enables larger problem sizes within a fixed memory budget. PauLIB is open source and provides C++ and Python interfaces.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces PauLIB, a header-only C++20 library for processing Pauli strings that employs a bit-packed binary symplectic representation (two bits per qubit), a sorted array layout instead of hash maps, and a struct-of-arrays memory layout to enable SIMD operations. It reports concrete performance claims at 500 qubits: single Pauli multiplication at 25 ns/op (14× faster than PauliEngine, 660× faster than Qiskit-flat), Hamiltonian outer-product multiplication (10× faster than PauliEngine, 45× faster than Qiskit), greedy commutation grouping (up to 21,000× faster than PennyLane), and a 7.3× memory reduction for a 1M-term Hamiltonian (142 MB vs. 1,036 MB in Qiskit). The library provides both C++ and Python interfaces and is open source.

Significance. If the reported speedups and memory reductions hold under representative conditions, PauLIB could meaningfully accelerate Pauli propagation, variational algorithms, and quantum chemistry workloads by enabling larger Hamiltonians within fixed memory budgets. The open-source release and explicit support for both C++ and Python interfaces are positive attributes that facilitate adoption.

major comments (2)
  1. [Abstract] Abstract: the headline performance numbers (25 ns/op single multiplication, 14–660× speedups, 10–45× for outer products, 21,000× for grouping, 7.3× memory reduction) are presented without any description of hardware platform, compiler flags, measurement methodology, statistical error bars, or whether timings isolate the C++ core versus the Python bindings. Because the central claims rest entirely on these benchmarks and the library explicitly advertises Python interfaces, the absence of this information is load-bearing for assessing whether the reported factors translate to end-user workloads.
  2. [Abstract] Abstract / claimed results: no isolation or quantification of Python binding overhead (data marshalling, GIL, per-call cost) is provided, even though the skeptic note correctly identifies this as the weakest link between micro-benchmark numbers and practical Python usage. This directly affects the applicability of the 14–660× and 10–45× factors for the dominant user path.
minor comments (1)
  1. [Abstract] Abstract contains a typographic issue: "operation-14 times" lacks a space or em-dash.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed and constructive comments. The points raised correctly identify areas where the abstract and benchmark presentation can be strengthened to better support the performance claims. We address each comment below and have revised the manuscript to incorporate the requested information.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the headline performance numbers (25 ns/op single multiplication, 14–660× speedups, 10–45× for outer products, 21,000× for grouping, 7.3× memory reduction) are presented without any description of hardware platform, compiler flags, measurement methodology, statistical error bars, or whether timings isolate the C++ core versus the Python bindings. Because the central claims rest entirely on these benchmarks and the library explicitly advertises Python interfaces, the absence of this information is load-bearing for assessing whether the reported factors translate to end-user workloads.

    Authors: We agree that the abstract lacks essential methodological context for the reported numbers. In the revised manuscript we will expand the abstract to state the hardware platform, compiler and optimization flags used, and explicitly note that the timings isolate the C++ core. The existing benchmarks section will be augmented with a concise description of the timing methodology (high-resolution timers, repetition counts), inclusion of statistical error bars, and a clear statement confirming isolation from Python bindings. revision: yes

  2. Referee: [Abstract] Abstract / claimed results: no isolation or quantification of Python binding overhead (data marshalling, GIL, per-call cost) is provided, even though the skeptic note correctly identifies this as the weakest link between micro-benchmark numbers and practical Python usage. This directly affects the applicability of the 14–660× and 10–45× factors for the dominant user path.

    Authors: We concur that the absence of quantified Python-binding overhead limits the interpretability of the speedup factors for the primary Python user path. In the revised manuscript we will add dedicated measurements in the performance section that isolate the overhead of the pybind11 bindings (marshalling, GIL acquisition/release, and per-call cost) relative to the C++ core, allowing readers to assess effective speedups under typical Python usage. revision: yes

Circularity Check

0 steps flagged

No derivation chain present; claims are benchmark-driven engineering results

full rationale

The manuscript describes a C++ library implementation using bit-packed representations, sorted arrays, and SoA layouts, with all performance claims (25 ns/op, speedups of 14–660×, 10–45×, 21 000×, 7.3× memory reduction) resting on reported benchmarks against other libraries. No equations, first-principles derivations, fitted parameters, or predictions appear. No self-citations are invoked to justify uniqueness theorems or ansatzes. The work is self-contained against external benchmarks and contains no load-bearing steps that reduce to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are required; the contribution is an implementation of known representation techniques rather than a theoretical derivation.

pith-pipeline@v0.9.1-grok · 5801 in / 1132 out tokens · 41115 ms · 2026-06-29T22:03:36.835340+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

18 extracted references · 13 canonical work pages · 6 internal anchors

  1. [1]

    M. A. Nielsen and I. L. Chuang,Quantum Computation and Quantum Information, 10th ed. Cambridge: Cambridge University Press, 2010

  2. [2]

    Stabilizer Codes and Quantum Error Correction

    D. Gottesman, “Stabilizer codes and quantum error correction,” 1997. [Online]. Available: https://arxiv.org/abs/quant-ph/9705052

  3. [3]

    NetQASM—a low-level instruction set architecture for hybrid quantum–classical programs in a quantum internet,

    McCleanet al., “Openfermion: the electronic structure package for quantum computers,”Quantum Science and Technology, vol. 5, no. 3, p. 034014, jun 2020. [Online]. Available: https://doi.org/10.1088/2058- 9565/ab8ebc

  4. [4]

    Stim: a fast stabilizer circuit simulator,

    C. Gidney, “Stim: a fast stabilizer circuit simulator,” vol. 5, p. 497. [Online]. Available: http://arxiv.org/abs/2103.02202

  5. [5]

    Improved Simulation of Stabilizer Circuits

    S. Aaronson and D. Gottesman, “Improved simulation of stabilizer circuits,”Physical Review A, vol. 70, no. 5, p. 052328, 2004. [Online]. Available: http://arxiv.org/abs/quant-ph/0406196

  6. [6]

    Trading classical and quantum computational resources,

    S. Bravyi, G. Smith, and J. Smolin, “Trading classical and quantum computational resources,”Physical Review X, vol. 6, no. 2, p. 021043,

  7. [7]

    Trading classical and quantum computational resources

    [Online]. Available: http://arxiv.org/abs/1506.01396

  8. [8]

    Classically estimating observables of noiseless quantum circuits,

    A. Angrisani, A. Schmidhuber, M. S. Rudolph, M. Cerezo, Z. Holmes, and H.-Y . Huang, “Classically estimating observables of noiseless quantum circuits,”Physical Review Letters, vol. 135, no. 17, p. 170602,

  9. [9]

    Available: https://link.aps.org/doi/10.1103/lh6x-7rc3

    [Online]. Available: https://link.aps.org/doi/10.1103/lh6x-7rc3

  10. [10]

    Pauli Propagation: A Computational Framework for Simulating Quantum Systems

    M. S. Rudolph, T. Jones, Y . Teng, A. Angrisani, and Z. Holmes, “Pauli propagation: A computational framework for simulating quantum systems.” [Online]. Available: http://arxiv.org/abs/2505.21606

  11. [11]

    Lie-algebraic classical simulations for quantum computing,

    M. L. Goh, M. Larocca, L. Cincio, M. Cerezo, and F. Sauvage, “Lie-algebraic classical simulations for quantum computing,”Physical Review Research, vol. 7, no. 3, p. 033266, 2025. [Online]. Available: https://link.aps.org/doi/10.1103/3y65-f5w6

  12. [12]

    [46]https://github.com/nepomechie/qudit-definite-weight-states

    C. Developers,Cirq. Zenodo, Aug. 2025. [Online]. Available: https://zenodo.org/doi/10.5281/zenodo.4062499

  13. [13]

    PennyLane: Automatic differentiation of hybrid quantum-classical computations

    V . Bergholmet al., “Pennylane: Automatic differentiation of hybrid quantum-classical computations,” 2022. [Online]. Available: https://arxiv.org/abs/1811.04968

  14. [14]

    Quantum computing with Qiskit,

    A. Javadi-Abhari, M. Treinish, K. Krsulich, C. J. Wood, J. Lishman, J. Gacon, S. Martiel, P. D. Nation, L. S. Bishop, A. W. Cross, B. R. Johnson, and J. M. Gambetta, “Quantum computing with Qiskit,” 2024

  15. [15]

    PauLIB: A high-performance library for processing pauli strings,

    F. Kr ¨otz, “PauLIB: A high-performance library for processing pauli strings,” https://github.com/Flousen/PauLIB, 2026

  16. [16]

    PauliEngine: High-Performant Symbolic Arithmetic for Quantum Operations

    L. M ¨uller, A. B ¨arligea, A. Knapp, and J. S. Kottmann, “PauliEngine: High-performant symbolic arithmetic for quantum operations.” [Online]. Available: http://arxiv.org/abs/2601.02233

  17. [17]

    STABSim: A parallelized clifford simulator with features beyond direct simulation

    S. Garner, C. Liu, M. Wang, S. Stein, and A. Li, “STABSim: A parallelized clifford simulator with features beyond direct simulation.” [Online]. Available: http://arxiv.org/abs/2507.03092

  18. [18]

    The clifford hierarchy for one qubit or qudit

    N. d. Silva and O. Lautsch, “The clifford hierarchy for one qubit or qudit.” [Online]. Available: http://arxiv.org/abs/2501.07939