Qimax: Efficient quantum simulation via GPU-accelerated extended stabilizer formalism

Bui Cao Doanh; Le Vu Trung Duong; Pham Hoai Luan; Vu Tuan Hai; Yasuhiko Nakashima

arxiv: 2505.03307 · v3 · pith:M4TAIZOUnew · submitted 2025-05-06 · 🪐 quant-ph · cs.SE

Qimax: Efficient quantum simulation via GPU-accelerated extended stabilizer formalism

Vu Tuan Hai , Bui Cao Doanh , Le Vu Trung Duong , Pham Hoai Luan , Yasuhiko Nakashima This is my paper

Pith reviewed 2026-05-22 17:04 UTC · model grok-4.3

classification 🪐 quant-ph cs.SE

keywords quantum simulationstabilizer formalismGPU accelerationnear-Clifford circuitsClifford circuitsquantum error correctionparallel computing

0 comments

The pith

A GPU-parallelized extended stabilizer formalism simulates near-Clifford circuits more efficiently than full state-vector methods in tested cases.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a parallel version of the extended stabilizer formalism that runs on GPUs to simulate Clifford and near-Clifford quantum circuits. Instead of tracking the full state vector, the method works with stabilizers, which uses less memory and compute for many circuits of interest in quantum error correction. The authors implement this in Python and report that it runs faster than Qiskit and Pennylane on selected examples. A reader cares because faster classical simulation lets researchers test larger error-correcting codes and near-Clifford algorithms without waiting for scarce quantum hardware.

Core claim

We introduce a parallelized version of the extended stabilizer formalism, enabling efficient execution on multi-core devices such as GPU. Experimental results demonstrate that, in certain scenarios, our Python-based implementation outperforms state-of-the-art simulators such as Qiskit and Pennylane.

What carries the argument

GPU-accelerated parallel operations on the extended stabilizer formalism that replace sequential stabilizer updates with concurrent multi-core updates for near-Clifford circuits.

If this is right

Near-Clifford circuits become simulable with far lower memory than state-vector methods.
Sequential bottlenecks in stabilizer updates are removed by distributing operations across GPU cores.
A pure-Python implementation can match or exceed the speed of established simulators on supported hardware.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If parallel efficiency remains high at larger ranks, the method could scale to useful sizes of quantum error-correcting codes.
The same GPU kernel pattern might transfer to other multi-core accelerators such as TPUs or clusters.
Integration with existing circuit compilers could let users switch to this simulator for the near-Clifford fragments of larger workloads.

Load-bearing premise

The reported speedups hold only for circuits whose stabilizer rank stays modest and whose ancilla overhead remains small.

What would settle it

Measure runtime and output fidelity on circuits deliberately chosen to have high stabilizer rank; if the GPU version slows down or produces incorrect results while sequential methods remain accurate, the performance claim fails.

read the original abstract

Simulating Clifford and near-Clifford circuits using the extended stabilizer formalism has become increasingly popular, particularly in quantum error correction. Compared to the state-vector approach, the extended stabilizer formalism can solve the same problems with fewer computational resources, as it operates on stabilizers rather than full state vectors. Most existing studies on near-Clifford circuits focus on balancing the trade-off between the number of ancilla qubits and simulation accuracy, often overlooking performance considerations. Furthermore, in the presence of high-rank stabilizers, performance is limited by the sequential property of the stabilizer formalism. In this work, we introduce a parallelized version of the extended stabilizer formalism, enabling efficient execution on multi-core devices such as GPU. Experimental results demonstrate that, in certain scenarios, our Python-based implementation outperforms state-of-the-art simulators such as Qiskit and Pennylane.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Qimax gives a practical GPU speed-up for modest-rank near-Clifford circuits over Qiskit and Pennylane, but the parallelization's behavior at high stabilizer rank is not demonstrated.

read the letter

Qimax is a Python package that parallelizes the extended stabilizer formalism on GPUs and reports faster runtimes than Qiskit or Pennylane on selected near-Clifford test cases. The authors start from the known extended stabilizer method, which uses ancilla qubits to keep simulation cost lower than full state vectors, and they move the tableau updates to GPU kernels to cut wall-clock time. This is a direct engineering response to the sequential bottleneck the abstract itself flags for high-rank stabilizers. The timing results are the concrete new piece, and they appear to come from actual runs rather than just theory. If the released code matches the described kernels, that adds real value for anyone who needs to simulate many such circuits on available hardware. The paper explains the formalism clearly enough and gives credit to the prior work it builds on. The soft spot is the limited scope of the benchmarks. The abstract notes that high-rank stabilizers remain constrained by sequential steps, yet the reported comparisons use circuits where rank and ancilla overhead stay modest. No scaling curves versus rank or checks for race conditions and load balance in the GPU schedule are described in the provided text. If the parallel decomposition still serializes on the highest-rank generators, the claimed advantage would shrink precisely on the harder instances that matter most for error-correction studies. This paper is for researchers running repeated near-Clifford simulations who have GPU access and want shorter runtimes in the regimes it covers. A reader who needs a faster tool for modest-rank cases will get immediate use from it. I would send it to peer review because the empirical claim is worth documenting and the authors can be asked to add the missing scaling data and circuit details without changing the core contribution.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Qimax, a Python implementation of a GPU-parallelized extended stabilizer formalism for simulating Clifford and near-Clifford circuits. It argues that this approach reduces resource requirements relative to state-vector methods and, in certain scenarios, outperforms established simulators such as Qiskit and Pennylane by addressing sequential bottlenecks in high-rank stabilizer updates.

Significance. If the performance advantage is shown to hold under rigorous scaling tests, the work would be useful for quantum error correction simulations that rely on extended stabilizer techniques, offering a practical GPU route to larger near-Clifford instances. The open Python code base is a clear strength for reproducibility.

major comments (2)

[Abstract and §5] Abstract and §5 (Experimental results): the central claim that the GPU-parallelized implementation outperforms Qiskit and Pennylane is presented without any quantitative timing data, error bars, circuit specifications, or stabilizer-rank values. Because the abstract itself identifies sequential limitations at high rank, the absence of scaling plots versus rank or explicit high-rank benchmarks makes the outperformance claim impossible to evaluate.
[§4] §4 (Parallelization method): the description of the GPU kernel for tableau updates and ancilla injection does not address potential race conditions or load imbalance when the number of independent stabilizers grows. Given the acknowledged sequential property of the formalism, a concrete argument or test that the parallel schedule remains asymptotically efficient at high rank is required to support the performance claims.

minor comments (2)

[§2] §2 (Background): the notation distinguishing the extended stabilizer generators from standard Pauli tableaux is introduced without a small worked example; adding one would improve readability.
[Figure 3] Figure 3 caption: axis labels and the meaning of the plotted quantity (wall-clock time versus what parameter) are not stated explicitly.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and agree that revisions will improve clarity and rigor. We will update the manuscript to include additional quantitative details and discussions as outlined.

read point-by-point responses

Referee: [Abstract and §5] Abstract and §5 (Experimental results): the central claim that the GPU-parallelized implementation outperforms Qiskit and Pennylane is presented without any quantitative timing data, error bars, circuit specifications, or stabilizer-rank values. Because the abstract itself identifies sequential limitations at high rank, the absence of scaling plots versus rank or explicit high-rank benchmarks makes the outperformance claim impossible to evaluate.

Authors: We acknowledge that the abstract and §5 would benefit from more explicit quantitative support. While the manuscript reports outperformance in selected scenarios, we agree the current presentation lacks sufficient detail for full evaluation. In the revised manuscript we will add timing measurements with error bars, explicit circuit specifications including stabilizer-rank values, and scaling plots versus rank (including high-rank cases) to substantiate the claims. revision: yes
Referee: [§4] §4 (Parallelization method): the description of the GPU kernel for tableau updates and ancilla injection does not address potential race conditions or load imbalance when the number of independent stabilizers grows. Given the acknowledged sequential property of the formalism, a concrete argument or test that the parallel schedule remains asymptotically efficient at high rank is required to support the performance claims.

Authors: We thank the referee for this point. The GPU kernel processes independent stabilizers concurrently with explicit synchronization barriers during tableau updates to eliminate race conditions, and ancilla injection is batched to reduce imbalance. We will revise §4 to include a detailed argument on the parallel schedule together with empirical scaling tests at high ranks, showing that the approach remains efficient despite the sequential aspects of the formalism. revision: yes

Circularity Check

0 steps flagged

No derivation chain or fitted parameters; empirical benchmark claim only

full rationale

The manuscript introduces a GPU-parallelized implementation of the extended stabilizer formalism and reports wall-clock comparisons against Qiskit and Pennylane on selected circuits. No equations, ansatzes, uniqueness theorems, or parameter-fitting steps appear in the provided text. The central claim is therefore an empirical performance observation rather than a mathematical derivation that could reduce to its own inputs by construction. The abstract itself flags the sequential limitation at high stabilizer rank, but this is an acknowledged engineering constraint, not a self-referential loop. Consequently the paper is self-contained against external benchmarks and exhibits no circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract supplies no explicit free parameters, axioms, or invented entities. The performance claim rests on the unstated assumption that the chosen benchmark circuits are representative and that GPU memory access patterns remain efficient for the stabilizer data structures.

pith-pipeline@v0.9.0 · 5684 in / 1013 out tokens · 33438 ms · 2026-05-22T17:04:20.540437+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

13 extracted references · 13 canonical work pages

[1]

Nature , author=

Arute, F., Arya, K., Babbush, R., Bacon, D., Bardin, J.C., Barends, R., Biswas, R., Boixo, S., Brandao, F.G.S.L., Buell, D.A., Burkett, B., Chen, Y., Chen, Z., Chiaro, B., Collins, R., Courtney, W., Dunsworth, A., Farhi, E., Foxen, B., Fowler, A., Gidney, C., Giustina, M., Graff, R., Guerin, K., Habegger, S., Harrigan, M.P., Hartmann, M.J., Ho, A., Hoffma...

work page doi:10.1038/s41586-019-1666-5 2019
[2]

Javadi-Abhari, A., al: Quantum computing with Qiskit (2024)

work page 2024
[3]

Almudever, and Sebastian Feld

Bayraktar, al: cuquantum sdk: A high-performance library for accelerating quan- tum science. In: 2023 IEEE International Conference on Quantum Computing and Engineering (QCE), vol. 01, pp. 1050–1061 (2023). https://doi.org/10.1109/ QCE57702.2023.00119

work page arXiv 2023
[4]

Quantum7, 1108 (2023) https://doi.org/10.22331/q-2023-09-11-1108

Vinkhuijzen, L., Coopmans, T., Elkouss, D., Dunjko, V., Laarman, A.: LIMDD: A Decision Diagram for Simulation of Quantum Computing Including Stabilizer States. Quantum7, 1108 (2023) https://doi.org/10.22331/q-2023-09-11-1108

work page doi:10.22331/q-2023-09-11-1108 2023
[6]

In: Gurfinkel, A., Ganesh, V

Mei, J., Bonsangue, M., Laarman, A.: Simulating quantum circuits by model counting. In: Gurfinkel, A., Ganesh, V. (eds.) Computer Aided Verification, pp. 555–578. Springer, Cham (2024) 14

work page 2024
[7]

Quantum3, 181 (2019) https://doi.org/10.22331/q-2019-09-02-181

Bravyi, S., Browne, D., Calpin, P., Campbell, E., Gosset, D., Howard, M.: Sim- ulation of quantum circuits by low-rank stabilizer decompositions. Quantum3, 181 (2019) https://doi.org/10.22331/q-2019-09-02-181

work page doi:10.22331/q-2019-09-02-181 2019
[8]

Stim: a fast stabilizer circuit simulator.Quantum, 5:497, July 2021

Gidney, C.: Stim: a fast stabilizer circuit simulator. Quantum5, 497 (2021) https: //doi.org/10.22331/q-2021-07-06-497

work page doi:10.22331/q-2021-07-06-497 2021
[9]

Quantum Science and Technology7(4), 044001 (2022)

Kissinger, A., Wetering, J.: Simulating quantum circuits with zx-calculus reduced stabiliser decompositions. Quantum Science and Technology7(4), 044001 (2022)

work page 2022
[10]

In: Coecke, B., Leifer, M

Kissinger, A., Wetering, J.: PyZX: Large Scale Automated Diagrammatic Reason- ing. In: Coecke, B., Leifer, M. (eds.) Proceedings 16th International Conference on Quantum Physics and Logic, Chapman University, Orange, CA, USA., 10- 14 June 2019. Electronic Proceedings in Theoretical Computer Science, vol. 318, pp. 229–241. Open Publishing Association, ??? ...

work page 2019
[11]

In: Gurfinkel, A., Heule, M

Osama, M., Thanos, D., Laarman, A.: Parallel equivalence checking of stabilizer quantum circuits on gpus. In: Gurfinkel, A., Heule, M. (eds.) Tools and Algo- rithms for the Construction and Analysis of Systems, pp. 109–128. Springer, Cham (2025)

work page 2025
[12]

In: Proceedings of Workshop on Machine Learning Systems (LearningSys) in The Thirty-first Annual Conference on Neural Information Processing Systems (NIPS) (2017)

Okuta, R., Unno, Y., Nishino, D., Hido, S., Loomis, C.: Cupy: A numpy- compatible library for nvidia gpu calculations. In: Proceedings of Workshop on Machine Learning Systems (LearningSys) in The Thirty-first Annual Conference on Neural Information Processing Systems (NIPS) (2017)

work page 2017
[13]

Bergholm, V., al: PennyLane: Automatic differentiation of hybrid quantum- classical computations (2022)

work page 2022
[14]

Quantum, doi:10.22331/q-2023-07-20-1062

Quetschlich, N., Burgholzer, L., Wille, R.: MQT Bench: Benchmarking Software and Design Automation Tools for Quantum Computing. Quantum (2023) https: //doi.org/10.22331/q-2023-07-20-1062 arxiv:2204.13719 [quant-ph] 15

work page doi:10.22331/q-2023-07-20-1062 2023

[1] [1]

Nature , author=

Arute, F., Arya, K., Babbush, R., Bacon, D., Bardin, J.C., Barends, R., Biswas, R., Boixo, S., Brandao, F.G.S.L., Buell, D.A., Burkett, B., Chen, Y., Chen, Z., Chiaro, B., Collins, R., Courtney, W., Dunsworth, A., Farhi, E., Foxen, B., Fowler, A., Gidney, C., Giustina, M., Graff, R., Guerin, K., Habegger, S., Harrigan, M.P., Hartmann, M.J., Ho, A., Hoffma...

work page doi:10.1038/s41586-019-1666-5 2019

[2] [2]

Javadi-Abhari, A., al: Quantum computing with Qiskit (2024)

work page 2024

[3] [3]

Almudever, and Sebastian Feld

Bayraktar, al: cuquantum sdk: A high-performance library for accelerating quan- tum science. In: 2023 IEEE International Conference on Quantum Computing and Engineering (QCE), vol. 01, pp. 1050–1061 (2023). https://doi.org/10.1109/ QCE57702.2023.00119

work page arXiv 2023

[4] [4]

Quantum7, 1108 (2023) https://doi.org/10.22331/q-2023-09-11-1108

Vinkhuijzen, L., Coopmans, T., Elkouss, D., Dunjko, V., Laarman, A.: LIMDD: A Decision Diagram for Simulation of Quantum Computing Including Stabilizer States. Quantum7, 1108 (2023) https://doi.org/10.22331/q-2023-09-11-1108

work page doi:10.22331/q-2023-09-11-1108 2023

[5] [6]

In: Gurfinkel, A., Ganesh, V

Mei, J., Bonsangue, M., Laarman, A.: Simulating quantum circuits by model counting. In: Gurfinkel, A., Ganesh, V. (eds.) Computer Aided Verification, pp. 555–578. Springer, Cham (2024) 14

work page 2024

[6] [7]

Quantum3, 181 (2019) https://doi.org/10.22331/q-2019-09-02-181

Bravyi, S., Browne, D., Calpin, P., Campbell, E., Gosset, D., Howard, M.: Sim- ulation of quantum circuits by low-rank stabilizer decompositions. Quantum3, 181 (2019) https://doi.org/10.22331/q-2019-09-02-181

work page doi:10.22331/q-2019-09-02-181 2019

[7] [8]

Stim: a fast stabilizer circuit simulator.Quantum, 5:497, July 2021

Gidney, C.: Stim: a fast stabilizer circuit simulator. Quantum5, 497 (2021) https: //doi.org/10.22331/q-2021-07-06-497

work page doi:10.22331/q-2021-07-06-497 2021

[8] [9]

Quantum Science and Technology7(4), 044001 (2022)

Kissinger, A., Wetering, J.: Simulating quantum circuits with zx-calculus reduced stabiliser decompositions. Quantum Science and Technology7(4), 044001 (2022)

work page 2022

[9] [10]

In: Coecke, B., Leifer, M

Kissinger, A., Wetering, J.: PyZX: Large Scale Automated Diagrammatic Reason- ing. In: Coecke, B., Leifer, M. (eds.) Proceedings 16th International Conference on Quantum Physics and Logic, Chapman University, Orange, CA, USA., 10- 14 June 2019. Electronic Proceedings in Theoretical Computer Science, vol. 318, pp. 229–241. Open Publishing Association, ??? ...

work page 2019

[10] [11]

In: Gurfinkel, A., Heule, M

Osama, M., Thanos, D., Laarman, A.: Parallel equivalence checking of stabilizer quantum circuits on gpus. In: Gurfinkel, A., Heule, M. (eds.) Tools and Algo- rithms for the Construction and Analysis of Systems, pp. 109–128. Springer, Cham (2025)

work page 2025

[11] [12]

In: Proceedings of Workshop on Machine Learning Systems (LearningSys) in The Thirty-first Annual Conference on Neural Information Processing Systems (NIPS) (2017)

Okuta, R., Unno, Y., Nishino, D., Hido, S., Loomis, C.: Cupy: A numpy- compatible library for nvidia gpu calculations. In: Proceedings of Workshop on Machine Learning Systems (LearningSys) in The Thirty-first Annual Conference on Neural Information Processing Systems (NIPS) (2017)

work page 2017

[12] [13]

Bergholm, V., al: PennyLane: Automatic differentiation of hybrid quantum- classical computations (2022)

work page 2022

[13] [14]

Quantum, doi:10.22331/q-2023-07-20-1062

Quetschlich, N., Burgholzer, L., Wille, R.: MQT Bench: Benchmarking Software and Design Automation Tools for Quantum Computing. Quantum (2023) https: //doi.org/10.22331/q-2023-07-20-1062 arxiv:2204.13719 [quant-ph] 15

work page doi:10.22331/q-2023-07-20-1062 2023