pith. machine review for the scientific record. sign in

arxiv: 2604.15765 · v1 · submitted 2026-04-17 · 🪐 quant-ph

Recognition: unknown

Orkan: Cache-friendly simulation of quantum operations on hermitian operators

Authors on Pith no claims yet

Pith reviewed 2026-05-10 08:58 UTC · model grok-4.3

classification 🪐 quant-ph
keywords hermitianquantummatrixmemoryoperationssimulationdensityfootprint
0
0 comments X

The pith

Orkan simulates quantum operations on Hermitian operators using a cache-friendly tiled lower-triangle layout, halving memory and achieving 2-4x speedups over Qiskit Aer, QuEST, and Qulacs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Quantum operations on states or observables are described by positive maps on Hermitian matrices. Standard simulators vectorize the full density matrix and store every element even though Hermitian symmetry means half the entries are redundant. Orkan instead uses a tiled memory layout that keeps only the lower triangle at tile granularity. This reduces both storage and the number of operations needed. The library includes special algorithms for k-local conjugations that update the entire matrix in one pass and works the same way whether the state or the observable is being evolved. Benchmarks on standard quantum simulation tasks show consistent 2-4 times faster wall-clock times than three widely used libraries, with part of the gain coming from the smaller memory footprint and better cache behavior.

Core claim

I introduce Orkan, a simulation library that uses a tiled memory layout storing only the lower triangle of the hermitian matrix at tile granularity, roughly halving both the memory footprint and the wall time to simulate the evolution of quantum states under generic quantum operations.

Load-bearing premise

That the overhead of managing the tiled layout and the k-local update algorithms remains negligible compared with the memory savings across the full range of n-qubit systems and operation types.

Figures

Figures reproduced from arXiv: 2604.15765 by Timo Ziegler.

Figure 1
Figure 1. Figure 1: Time per execution of the Hadamard gate averaged over [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The tiled format with a tile size of M = 2. The hermitian matrix H is covered with tiles of edge length M and only tiles Tti,tj in the lower-triangle, i.e., with ti ≥ tj are stored to a contiguous array. Within the tiles, all elements are stored in row-major format to enhance SIMD vectorisation. stored ( [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: (a) Time per execution of the single-qubit depolarising channel averaged over all possible qubit positions (lower is better). The tiled and packed implementations separate from the field from n = 3 qubits on, with packed leading tiled until n ≈ 12. The speedup settles to roughly 2× in the intermediate regime (10 ≤ n ≤ 14); at n = 15 the competitors enter an out-of-core regime in which the gap widens to 14×… view at source ↗
Figure 4
Figure 4. Figure 4: (a) Time per execution of the Pauli-X gate averaged over all possible qubit positions. Except for small systems (n ≤ 4), where Qiskit Aer’s backend is slightly faster, the tiled and packed implementations exhibit a time saving by a factor of roughly four for intermediate systems (10 ≤ n ≤ 14). This additional factor of two is expected to stem from the double-pass for native gates in the established simulat… view at source ↗
Figure 5
Figure 5. Figure 5: (a) Time per execution of the controlled Pauli-X (CNOT) gate averaged over all possible qubit positions. In contrast to earlier native gates, the asymptotic speedup of the tiled and the packed implementation over QuEST is limited to a factor of three and two, respectively. (b) Effective bandwidth of the controlled Pauli-X (CNOT) gate execution over all possible qubit positions. Settling at ∼ 150 GiB/s, the… view at source ↗
Figure 6
Figure 6. Figure 6: Effective bandwidth of the Hadamard gate execution [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗
read the original abstract

Classical simulation of quantum operations is essential for algorithm design, noise characterisation, and benchmarking of quantum hardware. The most general physically realisable operation can be described by a positive linear map acting on a hermitian operator, representing either a density matrix or an observable. Established simulators vectorise the density matrix on an $n$-qubit Hilbert space and reuse state-vector kernels, storing all $2^{2n}$ elements and forgoing the benefits of hermitian symmetry. In this work, I introduce \emph{Orkan}, a simulation library that uses a tiled memory layout storing only the lower triangle of the hermitian matrix at tile granularity, roughly halving both the memory footprint and the wall time to simulate the evolution of quantum states under generic quantum operations. The implementation treats any hermitian operator uniformly and is agnostic to whether the Schr\"{o}dinger or Heisenberg picture is used. Dedicated $k$-local conjugation algorithms update all entries of the hermitian matrix in a single pass. Benchmarks against Qiskit Aer, QuEST, and Qulacs show consistent wall-clock speedups of $2$-$4{\times}$ partly attributable to the reduced memory footprint.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Orkan, a simulation library for quantum operations acting on Hermitian operators (density matrices or observables). It employs a tiled memory layout that stores only the lower triangle of the Hermitian matrix at tile granularity, claiming this roughly halves both memory footprint and wall-clock time for evolution under generic quantum operations. The library is agnostic to Schrödinger or Heisenberg picture and uses dedicated k-local conjugation algorithms to update all matrix entries in a single pass. Benchmarks against Qiskit Aer, QuEST, and Qulacs report consistent 2–4× speedups.

Significance. If the performance claims and generality hold, the tiled lower-triangle approach offers a practical advance in memory-efficient classical simulation of quantum systems by exploiting Hermitian symmetry without vectorization. This could benefit algorithm design, noise characterization, and hardware benchmarking. The single-pass k-local update strategy and uniform treatment of operators are notable strengths, as is the provision of reproducible benchmarks against established simulators.

major comments (2)
  1. [Abstract] Abstract: The central claim states that Orkan supports 'generic quantum operations' via 'positive linear maps' on Hermitian operators while storing only the lower triangle at tile granularity and using k-local conjugation algorithms that 'update all entries of the hermitian matrix in a single pass.' General positive maps (including arbitrary CPTP maps) cannot always be reduced to k-local conjugations; they typically require vectorization or full superoperator action whose cost is not shown to be compatible with the tiled layout. This is load-bearing for the 'roughly halving ... wall time' claim for generic operations.
  2. [Benchmarks] Benchmarks (throughout): The reported 2–4× speedups and 'roughly halving' memory/time claims are asserted without error bars, detailed methodology, data-exclusion rules, specific n-qubit sizes, or operation types tested. This prevents independent verification of whether the tiled layout overhead remains negligible across the claimed range of systems and maps.
minor comments (2)
  1. [Abstract] Abstract: The LaTeX markup (e.g., 'Schrö{dinger}') should be rendered consistently in the final version.
  2. [Introduction] The manuscript would benefit from an explicit statement of the supported class of maps (e.g., whether only unitary conjugations or a broader set of k-local channels are implemented) to avoid ambiguity with the 'generic' phrasing.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below and indicate the revisions we will make to the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim states that Orkan supports 'generic quantum operations' via 'positive linear maps' on Hermitian operators while storing only the lower triangle at tile granularity and using k-local conjugation algorithms that 'update all entries of the hermitian matrix in a single pass.' General positive maps (including arbitrary CPTP maps) cannot always be reduced to k-local conjugations; they typically require vectorization or full superoperator action whose cost is not shown to be compatible with the tiled layout. This is load-bearing for the 'roughly halving ... wall time' claim for generic operations.

    Authors: We appreciate the referee pointing out this scope clarification. The Orkan implementation and all reported benchmarks are restricted to operations that admit a k-local conjugation representation (unitary evolution under k-local Hamiltonians and certain local noise models), for which the single-pass tiled update is valid and yields the claimed memory and time reductions. While the underlying tiled lower-triangle storage can in principle be used with general positive maps via superoperator action, the manuscript does not claim or benchmark single-pass performance for arbitrary CPTP maps. We will revise the abstract, introduction, and methods sections to explicitly delimit the supported operation class and note that general maps may incur additional overhead not covered by the current single-pass algorithms. revision: yes

  2. Referee: [Benchmarks] Benchmarks (throughout): The reported 2–4× speedups and 'roughly halving' memory/time claims are asserted without error bars, detailed methodology, data-exclusion rules, specific n-qubit sizes, or operation types tested. This prevents independent verification of whether the tiled layout overhead remains negligible across the claimed range of systems and maps.

    Authors: We fully agree that the current benchmark presentation lacks the necessary detail for independent verification. In the revised manuscript we will expand the benchmarks section to report: (i) error bars from at least ten independent runs per data point, (ii) complete methodology including hardware specifications, compiler flags, timing methodology, and warm-up procedures, (iii) explicit data-exclusion rules, (iv) the precise qubit counts tested (4–12 qubits), and (v) the exact operation families benchmarked (random k-local unitaries, Pauli-string conjugations, and local depolarizing channels). We will also release the benchmark scripts and raw timing data as supplementary material. revision: yes

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The approach rests on the standard symmetry property of Hermitian matrices and introduces a new data layout; no fitted parameters or new physical entities are required.

axioms (1)
  • standard math Hermitian matrices satisfy M = M†, so only the lower triangle needs to be stored.
    Basic linear-algebra fact invoked to justify the memory reduction.
invented entities (1)
  • Tiled lower-triangle memory layout no independent evidence
    purpose: Cache-friendly storage of Hermitian operators for faster simulation
    New data structure introduced by the library.

pith-pipeline@v0.9.0 · 5496 in / 1177 out tokens · 46671 ms · 2026-05-10T08:58:22.295985+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Exhaustive and feasible parametrisation with applications to the travelling salesperson problem

    quant-ph 2026-04 unverdicted novelty 7.0

    Exhaustively parametrised feasibility-respecting quantum circuits can reach every feasible solution to problems like TSP with certainty using fixed parameters by leveraging group actions and generating sequences.

Reference graph

Works this paper leans on

38 extracted references · 16 canonical work pages · cited by 1 Pith paper · 2 internal anchors

  1. [1]

    From barren plateaus through fertile valleys: Conic exten- sions of parameterised quantum circuits,

    L. Binkowski, G. Koßmann, T. J. Osborne, R. Schwonnek, and T. Ziegler, “From barren plateaus through fertile valleys: Conic exten- sions of parameterised quantum circuits,” in2025 IEEE International Conference on Quantum Computing and Engineering (QCE), vol. 01, 2025, pp. 111–118

  2. [2]

    One for all: Universal quantum conic programming framework for hard-constrained combinatorial optimization problems,

    L. Binkowski, T. J. Osborne, M. Schwiering, R. Schwonnek, and T. Ziegler, “One for all: Universal quantum conic programming framework for hard-constrained combinatorial optimization problems,”

  3. [3]

    Available: https://arxiv.org/abs/2411.00435

    [Online]. Available: https://arxiv.org/abs/2411.00435

  4. [4]

    G.et al.Surface codes: Towards practical large-scale quantum com- putation.Physical Review A86, 032324 (2012)

    A. G. Fowler, M. Mariantoni, J. M. Martinis, and A. N. Cleland, “Surface codes: Towards practical large-scale quantum computation,” Phys. Rev. A, vol. 86, p. 032324, Sep 2012. [Online]. Available: https://link.aps.org/doi/10.1103/PhysRevA.86.032324

  5. [5]

    Wave-function approach to dissipative processes in quantum optics,

    J. Dalibard, Y . Castin, and K. Mølmer, “Wave-function approach to dissipative processes in quantum optics,”Phys. Rev. Lett., vol. 68, pp. 580–583, Feb 1992. [Online]. Available: https://link.aps.org/doi/10. 1103/PhysRevLett.68.580

  6. [6]

    Monte carlo wave-function method in quantum optics,

    K. Mølmer, Y . Castin, and J. Dalibard, “Monte carlo wave-function method in quantum optics,”J. Opt. Soc. Am. B, vol. 10, no. 3, pp. 524–538, Mar 1993. [Online]. Available: https://opg.optica.org/josab/ abstract.cfm?URI=josab-10-3-524

  7. [7]

    The heisenberg representation of quantum computers,

    D. Gottesman, “The heisenberg representation of quantum computers,”

  8. [8]

    The Heisenberg Representation of Quantum Computers

    [Online]. Available: https://arxiv.org/abs/quant-ph/9807006

  9. [9]

    Improved simulation of stabilizer circuits.Physical Review A, 70(5), 2004

    S. Aaronson and D. Gottesman, “Improved simulation of stabilizer circuits,”Phys. Rev. A, vol. 70, p. 052328, Nov 2004. [Online]. Available: https://link.aps.org/doi/10.1103/PhysRevA.70.052328

  10. [10]

    Improving gate-level simulation of quantum circuits,

    G. F. Viamontes, I. L. Markov, and J. P. Hayes, “Improving gate-level simulation of quantum circuits,” 2003. [Online]. Available: https://arxiv.org/abs/quant-ph/0309060

  11. [11]

    Advanced simulation of quantum computations,

    A. Zulehner and R. Wille, “Advanced simulation of quantum computations,” 2018. [Online]. Available: https://arxiv.org/abs/1707. 00865

  12. [12]

    Efficient classical simulation of slightly entangled quantum computations,

    G. Vidal, “Efficient classical simulation of slightly entangled quantum computations,”Phys. Rev. Lett., vol. 91, no. 14, p. 147902, 2003

  13. [13]

    Tensor network circuit simulation at exascale,

    J. Brennan, M. Allalen, H.-J. Bungartz, C. Denniston, M. Galgon, M. Ganahl, A. Hawkins, D. Lykov, R. Orus, and Y . Alexeev, “Tensor network circuit simulation at exascale,” inProc. IEEE/ACM Second Int. Workshop on Quantum Computing Software (QCS), 2021, pp. 26–31

  14. [14]

    Simulating quantum computa- tions on classical machines: A survey,

    K. Young, M. Scese, and A. Ebnenasir, “Simulating quantum computations on classical machines: A survey,” 2023. [Online]. Available: https://arxiv.org/abs/2311.16505

  15. [15]

    QuEST and high performance simulation of quantum computers,

    T. Jones, A. Brown, I. Bush, and S. C. Benjamin, “QuEST and high performance simulation of quantum computers,”Scientific Reports, vol. 9, no. 1, p. 10736, 2019

  16. [16]

    and Mitarai, Kosuke and Imai, Ryosuke and Tamiya, Shiro and Yamamoto, Takahiro and Yan, Tennin and Kawakubo, Toru and Nakagawa, Yuya O

    Y . Suzuki, Y . Kawase, Y . Masumura, Y . Hiraga, M. Nakadai, J. Chen, K. M. Nakanishi, K. Mitarai, R. Imai, S. Tamiya, T. Yamamoto, T. Yan, T. Kawakubo, Y . O. Nakagawa, Y . Ibe, Y . Zhang, H. Yamashita, H. Yoshimura, A. Hayashi, and K. Fujii, “Qulacs: a fast and versatile quantum circuit simulator for research purpose,”Quantum, vol. 5, p. 559, October 2...

  17. [17]

    Watrous,The Theory of Quantum Information

    J. Watrous,The Theory of Quantum Information. Cambridge University Press, 2018

  18. [18]

    A parallel quantum computer simulator,

    K. M. Obenland and A. M. Despain, “A parallel quantum computer simulator,” 1998. [Online]. Available: https://arxiv.org/abs/quant-ph/ 9804039

  19. [19]

    Massively parallel quantum computer simulator,

    K. De Raedt, K. Michielsen, H. De Raedt, B. Trieu, G. Arnold, M. Richter, T. Lippert, H. Watanabe, and N. Ito, “Massively parallel quantum computer simulator,”Computer Physics Communications, vol. 176, no. 2, p. 121–136, jan 2007. [Online]. Available: http://dx.doi.org/10.1016/j.cpc.2006.08.007

  20. [20]

    mpiqulacs: A distributed quantum computer simulator for a64fx-based cluster systems,

    S. Imamura, M. Yamazaki, T. Honda, A. Kasagi, A. Tabuchi, H. Nakao, N. Fukumoto, and K. Nakashima, “mpiqulacs: A distributed quantum computer simulator for a64fx-based cluster systems,” 2022. [Online]. Available: https://arxiv.org/abs/2203.16044

  21. [21]

    Nemirovsky and D

    M. Nemirovsky and D. M. Tullsen,Multithreading Architecture, ser. Synthesis Lectures on Computer Architecture. Morgan & Claypool Publishers, 2013. [Online]. Available: https://doi.org/10.2200/ S00458ED1V01Y201212CAC021

  22. [22]

    Understanding caching,

    J. Bottomly, “Understanding caching,”LINUX journal, jan 2004. [Online]. Available: https://www.linuxjournal.com/article/7105

  23. [23]

    Understanding the Impact of Memory Access Patterns in Intel Processors ,

    M. A. H. Monil, S. Lee, J. S. Vetter, and A. D. Malony, “ Understanding the Impact of Memory Access Patterns in Intel Processors ,” in2020 IEEE/ACM Workshop on Memory Centric High Performance Computing (MCHPC). Los Alamitos, CA, USA: IEEE Computer Society, Nov 2020, pp. 52–61. [Online]. Available: https://doi.ieeecomputersociety.org/10.1109/MCHPC51950.2020.00012

  24. [24]

    False sharing and its effect on shared memory performance,

    W. J. Bolosky and M. L. Scott, “False sharing and its effect on shared memory performance,” inUSENIX Experiences with Distributed and Multiprocessor Systems (SEDMS IV). San Diego, CA: USENIX Association, Sep

  25. [25]

    Available: https://www.usenix.org/conference/sedms-iv/ false-sharing-and-its-effect-shared-memory-performance

    [Online]. Available: https://www.usenix.org/conference/sedms-iv/ false-sharing-and-its-effect-shared-memory-performance

  26. [26]

    Exploiting superword level parallelism with multimedia instruction sets,

    S. Larsen and S. Amarasinghe, “Exploiting superword level parallelism with multimedia instruction sets,” inProceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI). New York, NY , USA: Association for Computing Machinery,

  27. [27]

    Available: https://doi.org/10.1145/349299.349320

    [Online]. Available: https://doi.org/10.1145/349299.349320

  28. [28]

    Superword-level parallelism in the presence of control flow,

    J. Shin, M. Hall, and J. Chame, “Superword-level parallelism in the presence of control flow,” inInternational Symposium on Code Generation and Optimization (CGO). USA: IEEE Computer Society,

  29. [29]

    Available: https://doi.org/10.1109/CGO.2005.33

    [Online]. Available: https://doi.org/10.1109/CGO.2005.33

  30. [30]

    Density matrix quantum circuit simulation via the bsp machine on modern gpu clusters,

    A. Li, O. Subasi, X. Yang, and S. Krishnamoorthy, “Density matrix quantum circuit simulation via the bsp machine on modern gpu clusters,” inSC20: International Conference for High Performance Computing, Networking, Storage and Analysis, 2020, pp. 1–15

  31. [31]

    Tiling, block data layout, and memory hierarchy performance,

    N. Park, B. Hong, and V . K. Prasanna, “Tiling, block data layout, and memory hierarchy performance,”IEEE Trans. Parallel Distrib. Syst., vol. 14, no. 7, p. 640–654, jul 2003. [Online]. Available: https://doi.org/10.1109/TPDS.2003.1214317

  32. [32]

    Algorithm 865: Fortran 95 subroutines for cholesky factorization in block hybrid format,

    F. G. Gustavson, J. K. Reid, and J. Wa ´sniewski, “Algorithm 865: Fortran 95 subroutines for cholesky factorization in block hybrid format,”ACM Trans. Math. Softw., vol. 33, no. 1, p. 8–es, mar 2007. [Online]. Available: https://doi.org/10.1145/1206040.1206048

  33. [33]

    Slate: design of a modern distributed and accelerated linear algebra library,

    M. Gates, J. Kurzak, A. Charara, A. YarKhan, and J. Dongarra, “Slate: design of a modern distributed and accelerated linear algebra library,” inProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, ser. SC ’19. New York, NY , USA: Association for Computing Machinery, 2019. [Online]. Available: https:...

  34. [34]

    Quantum computing with Qiskit

    A. Javadi-Abhari, M. Treinish, K. Krsulich, C. J. Wood, J. Lishman, J. Gacon, S. Martiel, P. D. Nation, L. S. Bishop, A. W. Cross, B. R. Johnson, and J. M. Gambetta, “Quantum computing with qiskit,” 2024. [Online]. Available: https://arxiv.org/abs/2405.08810

  35. [35]

    Qiskit aer: High-performance quantum computing simulators,

    Qiskit Contributors, “Qiskit aer: High-performance quantum computing simulators,” 2024, software, version 0.17.2. [Online]. Available: https://github.com/Qiskit/qiskit-aer

  36. [36]

    QuEST: The quantum exact simulation toolkit,

    T. Jones, K. Sherbert, S. Brown, and QuEST-Kit Contributors, “QuEST: The quantum exact simulation toolkit,” 2024, software, version 4.2.0. [Online]. Available: https://github.com/QuEST-Kit/QuEST

  37. [37]

    Qulacs: Variational quantum circuit simulator,

    Y . Suzuki, Y . Kawase, Y . Masumura, and Qulacs Contributors, “Qulacs: Variational quantum circuit simulator,” 2021, software, version 0.6.12. [Online]. Available: https://github.com/qulacs/qulacs

  38. [38]

    Openmp: an industry standard api for shared- memory programming,

    L. Dagum and R. Menon, “Openmp: an industry standard api for shared- memory programming,”IEEE Computational Science and Engineering, vol. 5, no. 1, pp. 46–55, 1998. APPENDIX A. Cross-tile subroutines This appendix details the three subroutines invoked by Al- gorithm 4. Unlike the two-pass approach( U⊗I) (I⊗U) vec(ρ) used by established simulators (Section...