PauLIB: A High-Performance Library for Processing Pauli Strings

Florian Kr\"otz

arxiv: 2605.25974 · v1 · pith:2C5RMQVSnew · submitted 2026-05-25 · 🪐 quant-ph · cs.ET

PauLIB: A High-Performance Library for Processing Pauli Strings

Florian Kr\"otz This is my paper

Pith reviewed 2026-06-29 22:03 UTC · model grok-4.3

classification 🪐 quant-ph cs.ET

keywords Pauli stringsquantum chemistryC++ librarySIMD vectorizationHamiltonian processingvariational algorithmsperformance benchmarksmemory reduction

0 comments

The pith

PauLIB multiplies Pauli strings in 25 nanoseconds and shrinks Hamiltonian memory sevenfold through a two-bit-per-qubit encoding and sorted arrays.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents PauLIB as a header-only C++20 library that removes Python interpreter overhead and hash-map limitations when handling large Pauli sums for quantum chemistry, propagation, and compilation. It establishes that three architectural choices—a bit-packed binary symplectic representation, replacement of hash maps by sorted arrays, and struct-of-arrays memory layout—enable branch-predictable SIMD vectorization and multi-threaded merging. Benchmarks on 500-qubit instances show single-string multiplication at 25 nanoseconds, Hamiltonian outer-product work roughly ten times faster than PauliEngine, and greedy commutation grouping up to 21,000 times faster than PennyLane, while cutting the memory footprint of a one-million-term Hamiltonian from 1,036 MB to 142 MB. A sympathetic reader would care because these concrete gains directly increase the size of tractable problems on fixed hardware.

Core claim

PauLIB encodes each Pauli string in a binary symplectic form with two bits per qubit so that multiplication reduces to a bitwise XOR followed by a population count. Strings are kept in sorted arrays rather than hash maps to support contiguous SIMD bulk operations and use a struct-of-arrays layout to expose word arrays for explicit vectorization. On representative 500-qubit workloads the design yields single-string multiplication at 25 nanoseconds (14 times faster than PauliEngine, 660 times faster than Qiskit-flat), outer-product Hamiltonian multiplication approximately 10 times faster than PauliEngine and 45 times faster than Qiskit, greedy commutation grouping up to 21,000 times faster tha

What carries the argument

The bit-packed binary symplectic representation that encodes each qubit in two bits, reducing Pauli multiplication to a bitwise XOR and a population count.

If this is right

Single Pauli string multiplication completes in 25 nanoseconds, delivering a 14-fold speedup over PauliEngine and a 660-fold speedup over Qiskit-flat across tested pair counts.
Hamiltonian outer-product multiplication runs approximately 10 times faster than PauliEngine and 45 times faster than Qiskit at all tested sizes.
Greedy commutation grouping, the dominant preprocessing step in variational algorithms, achieves up to a 21,000-fold speedup over PennyLane.
A one-million-term Hamiltonian at 500 qubits occupies 142 MB instead of 1,036 MB, a 7.3-fold reduction that directly enables larger problem sizes within a fixed memory budget.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same compact representation could be reused to accelerate Pauli propagation or stabilizer simulation workloads that currently rely on the same operator algebra.
Embedding PauLIB inside higher-level quantum frameworks would allow existing variational quantum eigensolver pipelines to scale to larger qubit counts without source changes.
The sorted-array layout for sparse operator sums may transfer to other algebraic structures that appear in quantum error correction or tensor-network contractions.

Load-bearing premise

The reported speedups and memory reductions are measured on representative workloads and hardware configurations that match typical user environments, and the library's C++ and Python interfaces incur no hidden overhead that would negate the gains in real applications.

What would settle it

A direct timing run of one million random 500-qubit Pauli-string multiplications on a standard x86-64 CPU that yields an average time per multiplication clearly above 25 nanoseconds would falsify the central performance claim.

Figures

Figures reproduced from arXiv: 2605.25974 by Florian Kr\"otz.

**Figure 1.** Figure 1: Memory layout of PauliString. The struct stores ⌈N/64⌉ 64-bit Z-words, ⌈N/64⌉ 64-bit X-words, and a one-byte flags field encoding sign (±) and imaginary phase (i). H. Sorting and Deduplication After multiplication or Pauli propagation, a sum may contain duplicate Pauli strings whose coefficients must be added together. PauLIB uses a sort-and-combine strategy. First, an index array of size M is sorted usin… view at source ↗

**Figure 2.** Figure 2: Memory layout of PauliSumSoA. For a sum of M terms, the struct stores ⌈N/64⌉ contiguous arrays of M 64-bit Z-words, ⌈N/64⌉ arrays of M 64-bit X-words, and an array of M flags bytes encoding sign (±) and imaginary phase (i). insertion and lookup for the workloads we target, in line with the design of the Stim simulator. B. SIMD Vectorization PauLIB uses two layers of SIMD acceleration. The first layer is co… view at source ↗

**Figure 3.** Figure 3: shows the time per PauliString×PauliString operation for 100, 200, 500, and 1 000 random pairs at 500 qubits. The three sets of bars span three orders of magnitude on the log axis. The PauLIB SoA bars (blue) sit at the bottom, flat at 25 ns per operation across all tested pair counts. The PauliEngine bars (orange) are constant at approximately 350 ns. Qiskit (red) requires approximately 16.5 µs, placing it… view at source ↗

**Figure 4.** Figure 4: PauliSum×PauliSum outer product time at 500 qubits vs. number of output terms N2 (log-log scale). PauLIB AoS and SoA (solid lines, lower band) are one to two orders of magnitude faster than PauliEngine and Qiskit (dashed lines, upper band), with all four lines following O(N2 ) scaling. At 500 qubits AoS outperforms SoA due to better cache locality. C. Greedy Grouping [PITH_FULL_IMAGE:figures/full_fig_p007… view at source ↗

**Figure 6.** Figure 6: RSS memory footprint for Pauli sums at 500 qubits (isolated [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

read the original abstract

Processing large Pauli sums is a significant bottleneck in quantum chemistry, Pauli propagation, and Pauli-based compilation. Existing frameworks often suffer from Python interpreter overhead or utilize hash-map data structures that hinder SIMD vectorization and complicate multi-threaded merging. We present PauLIB, a header-only C++20 library designed to eliminate these bottlenecks through three key architectural choices. A bit-packed binary symplectic representation that encodes each qubit in two bits, reducing Pauli multiplication to a bitwise XOR and a population count; a sorted array layout that replaces hash maps to enable branch-predictable SIMD bulk operations; and a struct-of-arrays (SoA) memory layout that exposes contiguous word arrays for explicit SIMD vectorization. Benchmarks at 500 qubits show that single Pauli string multiplication runs at 25ns per operation-14 times faster than PauliEngine and 660 times faster than Qiskit-flat across all pair counts tested. Hamiltonian outer-product multiplication is approximately 10 times faster than PauliEngine and 45 times faster than Qiskit at all tested sizes. Greedy commutation grouping, the dominant preprocessing cost in variational algorithms, achieves up to 21,000 times speedup over PennyLane, driven by the compact bit-packed representation. The compact layout reduces the memory footprint of a one-million-term Hamiltonian at 500 qubits from 1,036MB (Qiskit) to 142MB, a 7.3 times reduction that directly enables larger problem sizes within a fixed memory budget. PauLIB is open source and provides C++ and Python interfaces.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PauLIB is a practical C++ library for Pauli-string ops that claims large speedups via bit-packing and SoA layout, but the reported numbers rest on unverified benchmark details and unquantified Python overhead.

read the letter

The core contribution is a header-only C++20 library that packs Pauli strings into a two-bit-per-qubit symplectic form, stores them in sorted arrays instead of hash maps, and uses struct-of-arrays memory to feed SIMD operations directly. This replaces the interpreter and hash-map costs that slow down Qiskit, PennyLane, and similar tools. The memory claim is concrete: a million-term Hamiltonian at 500 qubits drops from over a gigabyte to 142 MB. If the implementation is clean and the code ships, that alone is useful for anyone running large variational workflows where preprocessing dominates.

The speed numbers are the part that needs checking. The abstract states 25 ns per multiplication, 14–660× gains, and up to 21 000× on commutation grouping, all at 500 qubits. Those figures are presented without hardware model, compiler flags, measurement method, or error bars. The Python bindings are mentioned but the cost of crossing the interface is not isolated, so it is unclear how much of the C++ micro-benchmark advantage survives in the typical user path. That is the main soft spot.

The work is straightforward engineering rather than a new algorithm. The individual techniques are standard in high-performance computing; the novelty is their targeted combination for this workload. No circular math or fitting is involved, which keeps the claims falsifiable once the code and full benchmark scripts are examined.

This paper is for groups that already hit memory or time limits on Pauli preprocessing and are willing to drop into a C++ core. A serious referee should see it, mainly to press for reproducible benchmark methodology and explicit Python-overhead measurements. The artifact itself looks worth testing if the claims hold up.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces PauLIB, a header-only C++20 library for processing Pauli strings that employs a bit-packed binary symplectic representation (two bits per qubit), a sorted array layout instead of hash maps, and a struct-of-arrays memory layout to enable SIMD operations. It reports concrete performance claims at 500 qubits: single Pauli multiplication at 25 ns/op (14× faster than PauliEngine, 660× faster than Qiskit-flat), Hamiltonian outer-product multiplication (10× faster than PauliEngine, 45× faster than Qiskit), greedy commutation grouping (up to 21,000× faster than PennyLane), and a 7.3× memory reduction for a 1M-term Hamiltonian (142 MB vs. 1,036 MB in Qiskit). The library provides both C++ and Python interfaces and is open source.

Significance. If the reported speedups and memory reductions hold under representative conditions, PauLIB could meaningfully accelerate Pauli propagation, variational algorithms, and quantum chemistry workloads by enabling larger Hamiltonians within fixed memory budgets. The open-source release and explicit support for both C++ and Python interfaces are positive attributes that facilitate adoption.

major comments (2)

[Abstract] Abstract: the headline performance numbers (25 ns/op single multiplication, 14–660× speedups, 10–45× for outer products, 21,000× for grouping, 7.3× memory reduction) are presented without any description of hardware platform, compiler flags, measurement methodology, statistical error bars, or whether timings isolate the C++ core versus the Python bindings. Because the central claims rest entirely on these benchmarks and the library explicitly advertises Python interfaces, the absence of this information is load-bearing for assessing whether the reported factors translate to end-user workloads.
[Abstract] Abstract / claimed results: no isolation or quantification of Python binding overhead (data marshalling, GIL, per-call cost) is provided, even though the skeptic note correctly identifies this as the weakest link between micro-benchmark numbers and practical Python usage. This directly affects the applicability of the 14–660× and 10–45× factors for the dominant user path.

minor comments (1)

[Abstract] Abstract contains a typographic issue: "operation-14 times" lacks a space or em-dash.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed and constructive comments. The points raised correctly identify areas where the abstract and benchmark presentation can be strengthened to better support the performance claims. We address each comment below and have revised the manuscript to incorporate the requested information.

read point-by-point responses

Referee: [Abstract] Abstract: the headline performance numbers (25 ns/op single multiplication, 14–660× speedups, 10–45× for outer products, 21,000× for grouping, 7.3× memory reduction) are presented without any description of hardware platform, compiler flags, measurement methodology, statistical error bars, or whether timings isolate the C++ core versus the Python bindings. Because the central claims rest entirely on these benchmarks and the library explicitly advertises Python interfaces, the absence of this information is load-bearing for assessing whether the reported factors translate to end-user workloads.

Authors: We agree that the abstract lacks essential methodological context for the reported numbers. In the revised manuscript we will expand the abstract to state the hardware platform, compiler and optimization flags used, and explicitly note that the timings isolate the C++ core. The existing benchmarks section will be augmented with a concise description of the timing methodology (high-resolution timers, repetition counts), inclusion of statistical error bars, and a clear statement confirming isolation from Python bindings. revision: yes
Referee: [Abstract] Abstract / claimed results: no isolation or quantification of Python binding overhead (data marshalling, GIL, per-call cost) is provided, even though the skeptic note correctly identifies this as the weakest link between micro-benchmark numbers and practical Python usage. This directly affects the applicability of the 14–660× and 10–45× factors for the dominant user path.

Authors: We concur that the absence of quantified Python-binding overhead limits the interpretability of the speedup factors for the primary Python user path. In the revised manuscript we will add dedicated measurements in the performance section that isolate the overhead of the pybind11 bindings (marshalling, GIL acquisition/release, and per-call cost) relative to the C++ core, allowing readers to assess effective speedups under typical Python usage. revision: yes

Circularity Check

0 steps flagged

No derivation chain present; claims are benchmark-driven engineering results

full rationale

The manuscript describes a C++ library implementation using bit-packed representations, sorted arrays, and SoA layouts, with all performance claims (25 ns/op, speedups of 14–660×, 10–45×, 21 000×, 7.3× memory reduction) resting on reported benchmarks against other libraries. No equations, first-principles derivations, fitted parameters, or predictions appear. No self-citations are invoked to justify uniqueness theorems or ansatzes. The work is self-contained against external benchmarks and contains no load-bearing steps that reduce to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are required; the contribution is an implementation of known representation techniques rather than a theoretical derivation.

pith-pipeline@v0.9.1-grok · 5801 in / 1132 out tokens · 41115 ms · 2026-06-29T22:03:36.835340+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

18 extracted references · 13 canonical work pages · 6 internal anchors

[1]

M. A. Nielsen and I. L. Chuang,Quantum Computation and Quantum Information, 10th ed. Cambridge: Cambridge University Press, 2010

2010
[2]

Stabilizer Codes and Quantum Error Correction

D. Gottesman, “Stabilizer codes and quantum error correction,” 1997. [Online]. Available: https://arxiv.org/abs/quant-ph/9705052

work page internal anchor Pith review Pith/arXiv arXiv 1997
[3]

NetQASM—a low-level instruction set architecture for hybrid quantum–classical programs in a quantum internet,

McCleanet al., “Openfermion: the electronic structure package for quantum computers,”Quantum Science and Technology, vol. 5, no. 3, p. 034014, jun 2020. [Online]. Available: https://doi.org/10.1088/2058- 9565/ab8ebc

work page doi:10.1088/2058- 2020
[4]

Stim: a fast stabilizer circuit simulator,

C. Gidney, “Stim: a fast stabilizer circuit simulator,” vol. 5, p. 497. [Online]. Available: http://arxiv.org/abs/2103.02202

work page arXiv
[5]

Improved Simulation of Stabilizer Circuits

S. Aaronson and D. Gottesman, “Improved simulation of stabilizer circuits,”Physical Review A, vol. 70, no. 5, p. 052328, 2004. [Online]. Available: http://arxiv.org/abs/quant-ph/0406196

work page internal anchor Pith review Pith/arXiv arXiv 2004
[6]

Trading classical and quantum computational resources,

S. Bravyi, G. Smith, and J. Smolin, “Trading classical and quantum computational resources,”Physical Review X, vol. 6, no. 2, p. 021043,
[7]

Trading classical and quantum computational resources

[Online]. Available: http://arxiv.org/abs/1506.01396

work page internal anchor Pith review Pith/arXiv arXiv
[8]

Classically estimating observables of noiseless quantum circuits,

A. Angrisani, A. Schmidhuber, M. S. Rudolph, M. Cerezo, Z. Holmes, and H.-Y . Huang, “Classically estimating observables of noiseless quantum circuits,”Physical Review Letters, vol. 135, no. 17, p. 170602,
[9]

Available: https://link.aps.org/doi/10.1103/lh6x-7rc3

[Online]. Available: https://link.aps.org/doi/10.1103/lh6x-7rc3

work page doi:10.1103/lh6x-7rc3
[10]

Pauli Propagation: A Computational Framework for Simulating Quantum Systems

M. S. Rudolph, T. Jones, Y . Teng, A. Angrisani, and Z. Holmes, “Pauli propagation: A computational framework for simulating quantum systems.” [Online]. Available: http://arxiv.org/abs/2505.21606

work page internal anchor Pith review Pith/arXiv arXiv
[11]

Lie-algebraic classical simulations for quantum computing,

M. L. Goh, M. Larocca, L. Cincio, M. Cerezo, and F. Sauvage, “Lie-algebraic classical simulations for quantum computing,”Physical Review Research, vol. 7, no. 3, p. 033266, 2025. [Online]. Available: https://link.aps.org/doi/10.1103/3y65-f5w6

work page doi:10.1103/3y65-f5w6 2025
[12]

[46]https://github.com/nepomechie/qudit-definite-weight-states

C. Developers,Cirq. Zenodo, Aug. 2025. [Online]. Available: https://zenodo.org/doi/10.5281/zenodo.4062499

work page doi:10.5281/zenodo.4062499 2025
[13]

PennyLane: Automatic differentiation of hybrid quantum-classical computations

V . Bergholmet al., “Pennylane: Automatic differentiation of hybrid quantum-classical computations,” 2022. [Online]. Available: https://arxiv.org/abs/1811.04968

work page internal anchor Pith review Pith/arXiv arXiv 2022
[14]

Quantum computing with Qiskit,

A. Javadi-Abhari, M. Treinish, K. Krsulich, C. J. Wood, J. Lishman, J. Gacon, S. Martiel, P. D. Nation, L. S. Bishop, A. W. Cross, B. R. Johnson, and J. M. Gambetta, “Quantum computing with Qiskit,” 2024

2024
[15]

PauLIB: A high-performance library for processing pauli strings,

F. Kr ¨otz, “PauLIB: A high-performance library for processing pauli strings,” https://github.com/Flousen/PauLIB, 2026

2026
[16]

PauliEngine: High-Performant Symbolic Arithmetic for Quantum Operations

L. M ¨uller, A. B ¨arligea, A. Knapp, and J. S. Kottmann, “PauliEngine: High-performant symbolic arithmetic for quantum operations.” [Online]. Available: http://arxiv.org/abs/2601.02233

work page internal anchor Pith review Pith/arXiv arXiv
[17]

STABSim: A parallelized clifford simulator with features beyond direct simulation

S. Garner, C. Liu, M. Wang, S. Stein, and A. Li, “STABSim: A parallelized clifford simulator with features beyond direct simulation.” [Online]. Available: http://arxiv.org/abs/2507.03092

work page arXiv
[18]

The clifford hierarchy for one qubit or qudit

N. d. Silva and O. Lautsch, “The clifford hierarchy for one qubit or qudit.” [Online]. Available: http://arxiv.org/abs/2501.07939

work page arXiv

[1] [1]

M. A. Nielsen and I. L. Chuang,Quantum Computation and Quantum Information, 10th ed. Cambridge: Cambridge University Press, 2010

2010

[2] [2]

Stabilizer Codes and Quantum Error Correction

D. Gottesman, “Stabilizer codes and quantum error correction,” 1997. [Online]. Available: https://arxiv.org/abs/quant-ph/9705052

work page internal anchor Pith review Pith/arXiv arXiv 1997

[3] [3]

NetQASM—a low-level instruction set architecture for hybrid quantum–classical programs in a quantum internet,

McCleanet al., “Openfermion: the electronic structure package for quantum computers,”Quantum Science and Technology, vol. 5, no. 3, p. 034014, jun 2020. [Online]. Available: https://doi.org/10.1088/2058- 9565/ab8ebc

work page doi:10.1088/2058- 2020

[4] [4]

Stim: a fast stabilizer circuit simulator,

C. Gidney, “Stim: a fast stabilizer circuit simulator,” vol. 5, p. 497. [Online]. Available: http://arxiv.org/abs/2103.02202

work page arXiv

[5] [5]

Improved Simulation of Stabilizer Circuits

S. Aaronson and D. Gottesman, “Improved simulation of stabilizer circuits,”Physical Review A, vol. 70, no. 5, p. 052328, 2004. [Online]. Available: http://arxiv.org/abs/quant-ph/0406196

work page internal anchor Pith review Pith/arXiv arXiv 2004

[6] [6]

Trading classical and quantum computational resources,

S. Bravyi, G. Smith, and J. Smolin, “Trading classical and quantum computational resources,”Physical Review X, vol. 6, no. 2, p. 021043,

[7] [7]

Trading classical and quantum computational resources

[Online]. Available: http://arxiv.org/abs/1506.01396

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

Classically estimating observables of noiseless quantum circuits,

A. Angrisani, A. Schmidhuber, M. S. Rudolph, M. Cerezo, Z. Holmes, and H.-Y . Huang, “Classically estimating observables of noiseless quantum circuits,”Physical Review Letters, vol. 135, no. 17, p. 170602,

[9] [9]

Available: https://link.aps.org/doi/10.1103/lh6x-7rc3

[Online]. Available: https://link.aps.org/doi/10.1103/lh6x-7rc3

work page doi:10.1103/lh6x-7rc3

[10] [10]

Pauli Propagation: A Computational Framework for Simulating Quantum Systems

M. S. Rudolph, T. Jones, Y . Teng, A. Angrisani, and Z. Holmes, “Pauli propagation: A computational framework for simulating quantum systems.” [Online]. Available: http://arxiv.org/abs/2505.21606

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

Lie-algebraic classical simulations for quantum computing,

M. L. Goh, M. Larocca, L. Cincio, M. Cerezo, and F. Sauvage, “Lie-algebraic classical simulations for quantum computing,”Physical Review Research, vol. 7, no. 3, p. 033266, 2025. [Online]. Available: https://link.aps.org/doi/10.1103/3y65-f5w6

work page doi:10.1103/3y65-f5w6 2025

[12] [12]

[46]https://github.com/nepomechie/qudit-definite-weight-states

C. Developers,Cirq. Zenodo, Aug. 2025. [Online]. Available: https://zenodo.org/doi/10.5281/zenodo.4062499

work page doi:10.5281/zenodo.4062499 2025

[13] [13]

PennyLane: Automatic differentiation of hybrid quantum-classical computations

V . Bergholmet al., “Pennylane: Automatic differentiation of hybrid quantum-classical computations,” 2022. [Online]. Available: https://arxiv.org/abs/1811.04968

work page internal anchor Pith review Pith/arXiv arXiv 2022

[14] [14]

Quantum computing with Qiskit,

A. Javadi-Abhari, M. Treinish, K. Krsulich, C. J. Wood, J. Lishman, J. Gacon, S. Martiel, P. D. Nation, L. S. Bishop, A. W. Cross, B. R. Johnson, and J. M. Gambetta, “Quantum computing with Qiskit,” 2024

2024

[15] [15]

PauLIB: A high-performance library for processing pauli strings,

F. Kr ¨otz, “PauLIB: A high-performance library for processing pauli strings,” https://github.com/Flousen/PauLIB, 2026

2026

[16] [16]

PauliEngine: High-Performant Symbolic Arithmetic for Quantum Operations

L. M ¨uller, A. B ¨arligea, A. Knapp, and J. S. Kottmann, “PauliEngine: High-performant symbolic arithmetic for quantum operations.” [Online]. Available: http://arxiv.org/abs/2601.02233

work page internal anchor Pith review Pith/arXiv arXiv

[17] [17]

STABSim: A parallelized clifford simulator with features beyond direct simulation

S. Garner, C. Liu, M. Wang, S. Stein, and A. Li, “STABSim: A parallelized clifford simulator with features beyond direct simulation.” [Online]. Available: http://arxiv.org/abs/2507.03092

work page arXiv

[18] [18]

The clifford hierarchy for one qubit or qudit

N. d. Silva and O. Lautsch, “The clifford hierarchy for one qubit or qudit.” [Online]. Available: http://arxiv.org/abs/2501.07939

work page arXiv