arxiv: 2604.15765 · v1 · submitted 2026-04-17 · 🪐 quant-ph

Recognition: unknown

Orkan: Cache-friendly simulation of quantum operations on hermitian operators

Timo Ziegler

Authors on Pith no claims yet

Pith reviewed 2026-05-10 08:58 UTC · model grok-4.3

classification 🪐 quant-ph

keywords hermitianquantummatrixmemoryoperationssimulationdensityfootprint

0 comments

The pith

Orkan simulates quantum operations on Hermitian operators using a cache-friendly tiled lower-triangle layout, halving memory and achieving 2-4x speedups over Qiskit Aer, QuEST, and Qulacs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Quantum operations on states or observables are described by positive maps on Hermitian matrices. Standard simulators vectorize the full density matrix and store every element even though Hermitian symmetry means half the entries are redundant. Orkan instead uses a tiled memory layout that keeps only the lower triangle at tile granularity. This reduces both storage and the number of operations needed. The library includes special algorithms for k-local conjugations that update the entire matrix in one pass and works the same way whether the state or the observable is being evolved. Benchmarks on standard quantum simulation tasks show consistent 2-4 times faster wall-clock times than three widely used libraries, with part of the gain coming from the smaller memory footprint and better cache behavior.

Core claim

I introduce Orkan, a simulation library that uses a tiled memory layout storing only the lower triangle of the hermitian matrix at tile granularity, roughly halving both the memory footprint and the wall time to simulate the evolution of quantum states under generic quantum operations.

Load-bearing premise

That the overhead of managing the tiled layout and the k-local update algorithms remains negligible compared with the memory savings across the full range of n-qubit systems and operation types.

Figures

Figures reproduced from arXiv: 2604.15765 by Timo Ziegler.

**Figure 2.** Figure 2: The tiled format with a tile size of M = 2. The hermitian matrix H is covered with tiles of edge length M and only tiles Tti,tj in the lower-triangle, i.e., with ti ≥ tj are stored to a contiguous array. Within the tiles, all elements are stored in row-major format to enhance SIMD vectorisation. stored ( [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: (a) Time per execution of the single-qubit depolarising channel averaged over all possible qubit positions (lower is better). The tiled and packed implementations separate from the field from n = 3 qubits on, with packed leading tiled until n ≈ 12. The speedup settles to roughly 2× in the intermediate regime (10 ≤ n ≤ 14); at n = 15 the competitors enter an out-of-core regime in which the gap widens to 14×… view at source ↗

**Figure 4.** Figure 4: (a) Time per execution of the Pauli-X gate averaged over all possible qubit positions. Except for small systems (n ≤ 4), where Qiskit Aer’s backend is slightly faster, the tiled and packed implementations exhibit a time saving by a factor of roughly four for intermediate systems (10 ≤ n ≤ 14). This additional factor of two is expected to stem from the double-pass for native gates in the established simulat… view at source ↗

**Figure 5.** Figure 5: (a) Time per execution of the controlled Pauli-X (CNOT) gate averaged over all possible qubit positions. In contrast to earlier native gates, the asymptotic speedup of the tiled and the packed implementation over QuEST is limited to a factor of three and two, respectively. (b) Effective bandwidth of the controlled Pauli-X (CNOT) gate execution over all possible qubit positions. Settling at ∼ 150 GiB/s, the… view at source ↗

**Figure 6.** Figure 6: Effective bandwidth of the Hadamard gate execution [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗

read the original abstract

Classical simulation of quantum operations is essential for algorithm design, noise characterisation, and benchmarking of quantum hardware. The most general physically realisable operation can be described by a positive linear map acting on a hermitian operator, representing either a density matrix or an observable. Established simulators vectorise the density matrix on an $n$-qubit Hilbert space and reuse state-vector kernels, storing all $2^{2n}$ elements and forgoing the benefits of hermitian symmetry. In this work, I introduce \emph{Orkan}, a simulation library that uses a tiled memory layout storing only the lower triangle of the hermitian matrix at tile granularity, roughly halving both the memory footprint and the wall time to simulate the evolution of quantum states under generic quantum operations. The implementation treats any hermitian operator uniformly and is agnostic to whether the Schr\"{o}dinger or Heisenberg picture is used. Dedicated $k$-local conjugation algorithms update all entries of the hermitian matrix in a single pass. Benchmarks against Qiskit Aer, QuEST, and Qulacs show consistent wall-clock speedups of $2$-$4{\times}$ partly attributable to the reduced memory footprint.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Orkan's tiled lower-triangle storage for Hermitian operators is a solid engineering tweak that can cut memory and time, but the paper's claim to handle fully generic positive maps rests on k-local conjugations that don't obviously extend to arbitrary CPTP maps.

read the letter

The core contribution is a cache-friendly tiled layout that stores only the lower triangle of Hermitian matrices at tile level, paired with single-pass k-local conjugation routines. This is new relative to the vectorized approaches in Qiskit Aer, QuEST, and Qulacs, and it directly exploits the symmetry that those tools ignore. The reported 2-4x wall-time gains and halved memory footprint are plausible for the cases where the operations stay within k-local conjugations, and the uniform treatment of Schrödinger and Heisenberg pictures is a nice practical touch. Benchmarks against three established simulators give the work some grounding, even if the abstract leaves out error bars and exact methodology details. That said, the generality claim is the soft spot. Positive linear maps on Hermitian operators include arbitrary CPTP channels, yet the described algorithms are built around k-local conjugations that update entries in one pass. Not every physically valid map reduces cleanly to that form without vectorization or full superoperator overhead, so the halving claim for truly generic operations may only hold under an unstated restriction to unitary or k-local cases. If the full code and paper clarify the supported map class, this becomes a minor scope issue rather than a load-bearing one. The work is aimed at simulator developers and users who run repeated Hermitian evolutions on mid-sized systems and want straightforward memory wins. It shows clear thinking on the implementation side and honest comparison to prior tools. I would bring it to a reading group for the concrete storage idea and would send it to peer review so the scope and benchmark details can be tightened.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Orkan, a simulation library for quantum operations acting on Hermitian operators (density matrices or observables). It employs a tiled memory layout that stores only the lower triangle of the Hermitian matrix at tile granularity, claiming this roughly halves both memory footprint and wall-clock time for evolution under generic quantum operations. The library is agnostic to Schrödinger or Heisenberg picture and uses dedicated k-local conjugation algorithms to update all matrix entries in a single pass. Benchmarks against Qiskit Aer, QuEST, and Qulacs report consistent 2–4× speedups.

Significance. If the performance claims and generality hold, the tiled lower-triangle approach offers a practical advance in memory-efficient classical simulation of quantum systems by exploiting Hermitian symmetry without vectorization. This could benefit algorithm design, noise characterization, and hardware benchmarking. The single-pass k-local update strategy and uniform treatment of operators are notable strengths, as is the provision of reproducible benchmarks against established simulators.

major comments (2)

[Abstract] Abstract: The central claim states that Orkan supports 'generic quantum operations' via 'positive linear maps' on Hermitian operators while storing only the lower triangle at tile granularity and using k-local conjugation algorithms that 'update all entries of the hermitian matrix in a single pass.' General positive maps (including arbitrary CPTP maps) cannot always be reduced to k-local conjugations; they typically require vectorization or full superoperator action whose cost is not shown to be compatible with the tiled layout. This is load-bearing for the 'roughly halving ... wall time' claim for generic operations.
[Benchmarks] Benchmarks (throughout): The reported 2–4× speedups and 'roughly halving' memory/time claims are asserted without error bars, detailed methodology, data-exclusion rules, specific n-qubit sizes, or operation types tested. This prevents independent verification of whether the tiled layout overhead remains negligible across the claimed range of systems and maps.

minor comments (2)

[Abstract] Abstract: The LaTeX markup (e.g., 'Schrö{dinger}') should be rendered consistently in the final version.
[Introduction] The manuscript would benefit from an explicit statement of the supported class of maps (e.g., whether only unitary conjugations or a broader set of k-local channels are implemented) to avoid ambiguity with the 'generic' phrasing.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below and indicate the revisions we will make to the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim states that Orkan supports 'generic quantum operations' via 'positive linear maps' on Hermitian operators while storing only the lower triangle at tile granularity and using k-local conjugation algorithms that 'update all entries of the hermitian matrix in a single pass.' General positive maps (including arbitrary CPTP maps) cannot always be reduced to k-local conjugations; they typically require vectorization or full superoperator action whose cost is not shown to be compatible with the tiled layout. This is load-bearing for the 'roughly halving ... wall time' claim for generic operations.

Authors: We appreciate the referee pointing out this scope clarification. The Orkan implementation and all reported benchmarks are restricted to operations that admit a k-local conjugation representation (unitary evolution under k-local Hamiltonians and certain local noise models), for which the single-pass tiled update is valid and yields the claimed memory and time reductions. While the underlying tiled lower-triangle storage can in principle be used with general positive maps via superoperator action, the manuscript does not claim or benchmark single-pass performance for arbitrary CPTP maps. We will revise the abstract, introduction, and methods sections to explicitly delimit the supported operation class and note that general maps may incur additional overhead not covered by the current single-pass algorithms. revision: yes
Referee: [Benchmarks] Benchmarks (throughout): The reported 2–4× speedups and 'roughly halving' memory/time claims are asserted without error bars, detailed methodology, data-exclusion rules, specific n-qubit sizes, or operation types tested. This prevents independent verification of whether the tiled layout overhead remains negligible across the claimed range of systems and maps.

Authors: We fully agree that the current benchmark presentation lacks the necessary detail for independent verification. In the revised manuscript we will expand the benchmarks section to report: (i) error bars from at least ten independent runs per data point, (ii) complete methodology including hardware specifications, compiler flags, timing methodology, and warm-up procedures, (iii) explicit data-exclusion rules, (iv) the precise qubit counts tested (4–12 qubits), and (v) the exact operation families benchmarked (random k-local unitaries, Pauli-string conjugations, and local depolarizing channels). We will also release the benchmark scripts and raw timing data as supplementary material. revision: yes

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The approach rests on the standard symmetry property of Hermitian matrices and introduces a new data layout; no fitted parameters or new physical entities are required.

axioms (1)

standard math Hermitian matrices satisfy M = M†, so only the lower triangle needs to be stored.
Basic linear-algebra fact invoked to justify the memory reduction.

invented entities (1)

Tiled lower-triangle memory layout no independent evidence
purpose: Cache-friendly storage of Hermitian operators for faster simulation
New data structure introduced by the library.

pith-pipeline@v0.9.0 · 5496 in / 1177 out tokens · 46671 ms · 2026-05-10T08:58:22.295985+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Exhaustive and feasible parametrisation with applications to the travelling salesperson problem
quant-ph 2026-04 unverdicted novelty 7.0

Exhaustively parametrised feasibility-respecting quantum circuits can reach every feasible solution to problems like TSP with certainty using fixed parameters by leveraging group actions and generating sequences.

Reference graph

Works this paper leans on

38 extracted references · 16 canonical work pages · cited by 1 Pith paper · 2 internal anchors

[1]

From barren plateaus through fertile valleys: Conic exten- sions of parameterised quantum circuits,

L. Binkowski, G. Koßmann, T. J. Osborne, R. Schwonnek, and T. Ziegler, “From barren plateaus through fertile valleys: Conic exten- sions of parameterised quantum circuits,” in2025 IEEE International Conference on Quantum Computing and Engineering (QCE), vol. 01, 2025, pp. 111–118

2025
[2]

One for all: Universal quantum conic programming framework for hard-constrained combinatorial optimization problems,

L. Binkowski, T. J. Osborne, M. Schwiering, R. Schwonnek, and T. Ziegler, “One for all: Universal quantum conic programming framework for hard-constrained combinatorial optimization problems,”
[3]

Available: https://arxiv.org/abs/2411.00435

[Online]. Available: https://arxiv.org/abs/2411.00435

work page arXiv
[4]

G.et al.Surface codes: Towards practical large-scale quantum com- putation.Physical Review A86, 032324 (2012)

A. G. Fowler, M. Mariantoni, J. M. Martinis, and A. N. Cleland, “Surface codes: Towards practical large-scale quantum computation,” Phys. Rev. A, vol. 86, p. 032324, Sep 2012. [Online]. Available: https://link.aps.org/doi/10.1103/PhysRevA.86.032324

work page doi:10.1103/physreva.86.032324 2012
[5]

Wave-function approach to dissipative processes in quantum optics,

J. Dalibard, Y . Castin, and K. Mølmer, “Wave-function approach to dissipative processes in quantum optics,”Phys. Rev. Lett., vol. 68, pp. 580–583, Feb 1992. [Online]. Available: https://link.aps.org/doi/10. 1103/PhysRevLett.68.580

1992
[6]

Monte carlo wave-function method in quantum optics,

K. Mølmer, Y . Castin, and J. Dalibard, “Monte carlo wave-function method in quantum optics,”J. Opt. Soc. Am. B, vol. 10, no. 3, pp. 524–538, Mar 1993. [Online]. Available: https://opg.optica.org/josab/ abstract.cfm?URI=josab-10-3-524

1993
[7]

The heisenberg representation of quantum computers,

D. Gottesman, “The heisenberg representation of quantum computers,”
[8]

The Heisenberg Representation of Quantum Computers

[Online]. Available: https://arxiv.org/abs/quant-ph/9807006

work page internal anchor Pith review arXiv
[9]

Improved simulation of stabilizer circuits.Physical Review A, 70(5), 2004

S. Aaronson and D. Gottesman, “Improved simulation of stabilizer circuits,”Phys. Rev. A, vol. 70, p. 052328, Nov 2004. [Online]. Available: https://link.aps.org/doi/10.1103/PhysRevA.70.052328

work page doi:10.1103/physreva.70.052328 2004
[10]

Improving gate-level simulation of quantum circuits,

G. F. Viamontes, I. L. Markov, and J. P. Hayes, “Improving gate-level simulation of quantum circuits,” 2003. [Online]. Available: https://arxiv.org/abs/quant-ph/0309060

work page arXiv 2003
[11]

Advanced simulation of quantum computations,

A. Zulehner and R. Wille, “Advanced simulation of quantum computations,” 2018. [Online]. Available: https://arxiv.org/abs/1707. 00865

2018
[12]

Efficient classical simulation of slightly entangled quantum computations,

G. Vidal, “Efficient classical simulation of slightly entangled quantum computations,”Phys. Rev. Lett., vol. 91, no. 14, p. 147902, 2003

2003
[13]

Tensor network circuit simulation at exascale,

J. Brennan, M. Allalen, H.-J. Bungartz, C. Denniston, M. Galgon, M. Ganahl, A. Hawkins, D. Lykov, R. Orus, and Y . Alexeev, “Tensor network circuit simulation at exascale,” inProc. IEEE/ACM Second Int. Workshop on Quantum Computing Software (QCS), 2021, pp. 26–31

2021
[14]

Simulating quantum computa- tions on classical machines: A survey,

K. Young, M. Scese, and A. Ebnenasir, “Simulating quantum computations on classical machines: A survey,” 2023. [Online]. Available: https://arxiv.org/abs/2311.16505

work page arXiv 2023
[15]

QuEST and high performance simulation of quantum computers,

T. Jones, A. Brown, I. Bush, and S. C. Benjamin, “QuEST and high performance simulation of quantum computers,”Scientific Reports, vol. 9, no. 1, p. 10736, 2019

2019
[16]

and Mitarai, Kosuke and Imai, Ryosuke and Tamiya, Shiro and Yamamoto, Takahiro and Yan, Tennin and Kawakubo, Toru and Nakagawa, Yuya O

Y . Suzuki, Y . Kawase, Y . Masumura, Y . Hiraga, M. Nakadai, J. Chen, K. M. Nakanishi, K. Mitarai, R. Imai, S. Tamiya, T. Yamamoto, T. Yan, T. Kawakubo, Y . O. Nakagawa, Y . Ibe, Y . Zhang, H. Yamashita, H. Yoshimura, A. Hayashi, and K. Fujii, “Qulacs: a fast and versatile quantum circuit simulator for research purpose,”Quantum, vol. 5, p. 559, October 2...

work page doi:10.22331/q-2021-10-06-559 2021
[17]

Watrous,The Theory of Quantum Information

J. Watrous,The Theory of Quantum Information. Cambridge University Press, 2018

2018
[18]

A parallel quantum computer simulator,

K. M. Obenland and A. M. Despain, “A parallel quantum computer simulator,” 1998. [Online]. Available: https://arxiv.org/abs/quant-ph/ 9804039

1998
[19]

Massively parallel quantum computer simulator,

K. De Raedt, K. Michielsen, H. De Raedt, B. Trieu, G. Arnold, M. Richter, T. Lippert, H. Watanabe, and N. Ito, “Massively parallel quantum computer simulator,”Computer Physics Communications, vol. 176, no. 2, p. 121–136, jan 2007. [Online]. Available: http://dx.doi.org/10.1016/j.cpc.2006.08.007

work page doi:10.1016/j.cpc.2006.08.007 2007
[20]

mpiqulacs: A distributed quantum computer simulator for a64fx-based cluster systems,

S. Imamura, M. Yamazaki, T. Honda, A. Kasagi, A. Tabuchi, H. Nakao, N. Fukumoto, and K. Nakashima, “mpiqulacs: A distributed quantum computer simulator for a64fx-based cluster systems,” 2022. [Online]. Available: https://arxiv.org/abs/2203.16044

work page arXiv 2022
[21]

Nemirovsky and D

M. Nemirovsky and D. M. Tullsen,Multithreading Architecture, ser. Synthesis Lectures on Computer Architecture. Morgan & Claypool Publishers, 2013. [Online]. Available: https://doi.org/10.2200/ S00458ED1V01Y201212CAC021

2013
[22]

Understanding caching,

J. Bottomly, “Understanding caching,”LINUX journal, jan 2004. [Online]. Available: https://www.linuxjournal.com/article/7105

2004
[23]

Understanding the Impact of Memory Access Patterns in Intel Processors ,

M. A. H. Monil, S. Lee, J. S. Vetter, and A. D. Malony, “ Understanding the Impact of Memory Access Patterns in Intel Processors ,” in2020 IEEE/ACM Workshop on Memory Centric High Performance Computing (MCHPC). Los Alamitos, CA, USA: IEEE Computer Society, Nov 2020, pp. 52–61. [Online]. Available: https://doi.ieeecomputersociety.org/10.1109/MCHPC51950.2020.00012

work page doi:10.1109/mchpc51950.2020.00012 2020
[24]

False sharing and its effect on shared memory performance,

W. J. Bolosky and M. L. Scott, “False sharing and its effect on shared memory performance,” inUSENIX Experiences with Distributed and Multiprocessor Systems (SEDMS IV). San Diego, CA: USENIX Association, Sep
[25]

Available: https://www.usenix.org/conference/sedms-iv/ false-sharing-and-its-effect-shared-memory-performance

[Online]. Available: https://www.usenix.org/conference/sedms-iv/ false-sharing-and-its-effect-shared-memory-performance
[26]

Exploiting superword level parallelism with multimedia instruction sets,

S. Larsen and S. Amarasinghe, “Exploiting superword level parallelism with multimedia instruction sets,” inProceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI). New York, NY , USA: Association for Computing Machinery,
[27]

Available: https://doi.org/10.1145/349299.349320

[Online]. Available: https://doi.org/10.1145/349299.349320

work page doi:10.1145/349299.349320
[28]

Superword-level parallelism in the presence of control flow,

J. Shin, M. Hall, and J. Chame, “Superword-level parallelism in the presence of control flow,” inInternational Symposium on Code Generation and Optimization (CGO). USA: IEEE Computer Society,
[29]

Available: https://doi.org/10.1109/CGO.2005.33

[Online]. Available: https://doi.org/10.1109/CGO.2005.33

work page doi:10.1109/cgo.2005.33 2005
[30]

Density matrix quantum circuit simulation via the bsp machine on modern gpu clusters,

A. Li, O. Subasi, X. Yang, and S. Krishnamoorthy, “Density matrix quantum circuit simulation via the bsp machine on modern gpu clusters,” inSC20: International Conference for High Performance Computing, Networking, Storage and Analysis, 2020, pp. 1–15

2020
[31]

Tiling, block data layout, and memory hierarchy performance,

N. Park, B. Hong, and V . K. Prasanna, “Tiling, block data layout, and memory hierarchy performance,”IEEE Trans. Parallel Distrib. Syst., vol. 14, no. 7, p. 640–654, jul 2003. [Online]. Available: https://doi.org/10.1109/TPDS.2003.1214317

work page doi:10.1109/tpds.2003.1214317 2003
[32]

Algorithm 865: Fortran 95 subroutines for cholesky factorization in block hybrid format,

F. G. Gustavson, J. K. Reid, and J. Wa ´sniewski, “Algorithm 865: Fortran 95 subroutines for cholesky factorization in block hybrid format,”ACM Trans. Math. Softw., vol. 33, no. 1, p. 8–es, mar 2007. [Online]. Available: https://doi.org/10.1145/1206040.1206048

work page doi:10.1145/1206040.1206048 2007
[33]

Slate: design of a modern distributed and accelerated linear algebra library,

M. Gates, J. Kurzak, A. Charara, A. YarKhan, and J. Dongarra, “Slate: design of a modern distributed and accelerated linear algebra library,” inProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, ser. SC ’19. New York, NY , USA: Association for Computing Machinery, 2019. [Online]. Available: https:...

work page doi:10.1145/3295500.3356223 2019
[34]

Quantum computing with Qiskit

A. Javadi-Abhari, M. Treinish, K. Krsulich, C. J. Wood, J. Lishman, J. Gacon, S. Martiel, P. D. Nation, L. S. Bishop, A. W. Cross, B. R. Johnson, and J. M. Gambetta, “Quantum computing with qiskit,” 2024. [Online]. Available: https://arxiv.org/abs/2405.08810

work page internal anchor Pith review arXiv 2024
[35]

Qiskit aer: High-performance quantum computing simulators,

Qiskit Contributors, “Qiskit aer: High-performance quantum computing simulators,” 2024, software, version 0.17.2. [Online]. Available: https://github.com/Qiskit/qiskit-aer

2024
[36]

QuEST: The quantum exact simulation toolkit,

T. Jones, K. Sherbert, S. Brown, and QuEST-Kit Contributors, “QuEST: The quantum exact simulation toolkit,” 2024, software, version 4.2.0. [Online]. Available: https://github.com/QuEST-Kit/QuEST

2024
[37]

Qulacs: Variational quantum circuit simulator,

Y . Suzuki, Y . Kawase, Y . Masumura, and Qulacs Contributors, “Qulacs: Variational quantum circuit simulator,” 2021, software, version 0.6.12. [Online]. Available: https://github.com/qulacs/qulacs

2021
[38]

Openmp: an industry standard api for shared- memory programming,

L. Dagum and R. Menon, “Openmp: an industry standard api for shared- memory programming,”IEEE Computational Science and Engineering, vol. 5, no. 1, pp. 46–55, 1998. APPENDIX A. Cross-tile subroutines This appendix details the three subroutines invoked by Al- gorithm 4. Unlike the two-pass approach( U⊗I) (I⊗U) vec(ρ) used by established simulators (Section...

1998