pith. machine review for the scientific record. sign in

arxiv: 2604.12256 · v1 · submitted 2026-04-14 · 🪐 quant-ph · cs.SE

Recognition: unknown

Large-Scale Quantum Circuit Simulation on HPC Cluster via Cache Blocking, Boosting, and Gate Fusion Optimization

Authors on Pith no claims yet

Pith reviewed 2026-05-10 16:10 UTC · model grok-4.3

classification 🪐 quant-ph cs.SE
keywords quantum circuit simulationgate fusioncache blockingHPC optimizationmerge boosterdiagonal detectorcircuit restructuringfull-state simulation
0
0 comments X

The pith

An extensible framework with merge booster and diagonal detector accelerates full-state quantum circuit simulations by optimizing data locality and gate operations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that full-state quantum circuit simulation, essential for developing and debugging quantum algorithms before hardware implementation, can be made much faster through targeted optimizations that improve how data is accessed and how gates are combined during execution. It introduces an extensible framework that restructures circuits automatically and adapts simulation strategies depending on the type of quantum operations involved. Central to this are two new components: a merge booster that fuses gates to reduce computations and a diagonal detector that identifies simplifications based on entanglement patterns. Benchmarks show these changes deliver substantial speedups over prior simulators, making it practical to handle larger qubit systems without proportional increases in runtime.

Core claim

The authors present a framework that integrates circuit restructuring with adaptive execution, plus the merge booster and diagonal detector, to enhance both data locality via cache blocking and computational efficiency via gate fusion. This yields speedups reaching 160 times on circuit-level benchmarks and 34 times on diagonal-heavy gate-level benchmarks relative to existing simulators.

What carries the argument

The merge booster and diagonal detector, which apply entanglement-inspired fusion and detection rules to restructure and simplify quantum circuit execution while preserving correctness.

If this is right

  • Larger qubit counts become feasible to simulate classically within practical time limits, supporting more extensive algorithm prototyping.
  • Diagonal-dominant circuits in particular benefit from reduced operation counts without loss of accuracy.
  • The extensible design allows the same optimizations to be applied across different hardware backends for portable gains.
  • Redundant computations decrease overall, lowering the energy and resource demands of simulation runs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same locality and fusion principles might transfer to tensor-network or other approximate simulation methods for even larger systems.
  • Hardware architects could use the detected patterns to prioritize features that classical emulators handle efficiently.
  • Integration with hybrid workflows could let developers alternate quickly between simulation and real-device runs.

Load-bearing premise

The new merge booster, diagonal detector, and circuit restructuring components deliver consistent speed gains across many different circuit types without adding hidden costs that vary by hardware.

What would settle it

A direct comparison on a broad set of random or highly entangled circuits where the new framework shows no net speedup or higher memory use than a standard simulator would disprove the performance claims.

Figures

Figures reproduced from arXiv: 2604.12256 by Chia-Heng Tu, Chuan-Chi Wang, Shih-Hao Hung, Yan-Jie Wang.

Figure 1
Figure 1. Figure 1: illustrates a typical workflow adopted by modern quantum circuit simulation. The input consists of a structured file in a spe￾cific format (e.g., Quil [33] and OpenQASM [9]) to represent a raw quantum circuit. Subsequently, a quantum circuit optimizer, similar to a quantum compiler, is proficient in performing various quan￾tum circuit optimizations, such as combining sequential quantum gates to reduce circ… view at source ↗
Figure 2
Figure 2. Figure 2: Memory allocation and quantum gate arrangements [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Memory allocation and quantum gate arrangements [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Different ways to enable diagonal gate fusion: (a) a naive approach, (b) the proposed diagonal detector optimization. [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Elapsed time of the circuit-level benchmarks ranges from 31 to 38 qubits. [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗
read the original abstract

Quantum circuit simulation is crucial for the development of quantum algorithms, particularly given the high cost and noise limitations of physical quantum hardware. While full-state quantum circuit simulation is commonly employed for prototyping and debugging, it poses challenges because of the exponential increase in simulation time for large quantum systems. In this work, we propose an extensible framework designed to enhance simulation performance by optimizing both data locality and computational efficiency, thereby addressing these challenges. This framework is seamlessly integrated with an optimizer that restructures quantum circuits and a simulator that adjusts execution strategies for various quantum operations. For the newly developed components, merge booster and diagonal detector, the underlying algorithms are inspired by the principles of quantum entanglement and gate fusion, as well as by the limitations identified in existing third-party simulation libraries. The experiments were conducted on eight DGX-H100 workstations, each equipped with eight NVIDIA H100 GPUs, employing both gate-level and circuit-level benchmarks. The results indicate a speedup of up to 160 times for circuit-level benchmarks and an acceleration of up to 34 times for diagonal-heavy gate-level benchmarks compared to existing simulators. The proposed methodologies are anticipated to deliver more robust and faster quantum circuit simulations, thereby fostering the advancement of novel quantum algorithms.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes an extensible framework for large-scale quantum circuit simulation on HPC clusters, combining cache blocking, a merge booster, a diagonal detector, gate fusion, and a circuit restructuring optimizer. The new components are motivated by entanglement and fusion principles. Experiments on eight DGX-H100 nodes (each with eight H100 GPUs) using gate-level and circuit-level benchmarks report peak speedups of up to 160x for circuit-level cases and 34x for diagonal-heavy gate-level cases relative to existing simulators.

Significance. If the reported gains hold under broader testing, the work could meaningfully extend the scale of simulatable quantum circuits, aiding algorithm prototyping. The multi-node GPU cluster evaluation demonstrates practical scalability on current HPC hardware, and the combination of locality and fusion optimizations offers a coherent engineering approach. However, the absence of component ablations and baseline specifications limits the ability to attribute gains specifically to the novel elements.

major comments (3)
  1. [Experiments] Experiments section: No ablation studies isolate the performance impact of the merge booster, diagonal detector, cache blocking, and gate fusion; without these, it is impossible to verify that the newly introduced components are responsible for the claimed speedups rather than baseline optimizations or hardware effects.
  2. [Experiments] Experiments section: Speedup results are given exclusively as peak 'up to' values (160x circuit-level, 34x diagonal-heavy gate-level) with no average-case metrics, standard deviations, error bars, or details on the number of runs, circuit sizes, or gate distributions used in the benchmarks.
  3. [Experiments] Experiments section: Comparisons are made only to unspecified 'existing simulators' without naming the libraries, versions, or optimization flags employed, and no portability results are shown on non-NVIDIA hardware or non-diagonal circuits, undermining the assertion of robust gains.
minor comments (2)
  1. [Abstract] Abstract: Lacks any mention of error bars, benchmark circuit specifications, or the precise baseline simulators, which would help readers assess the scope of the performance claims.
  2. The manuscript would benefit from pseudocode or high-level algorithmic descriptions for the merge booster and diagonal detector to allow independent verification of the entanglement- and fusion-inspired logic.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, indicating the revisions planned for the manuscript.

read point-by-point responses
  1. Referee: Experiments section: No ablation studies isolate the performance impact of the merge booster, diagonal detector, cache blocking, and gate fusion; without these, it is impossible to verify that the newly introduced components are responsible for the claimed speedups rather than baseline optimizations or hardware effects.

    Authors: We agree that the current manuscript does not include explicit ablation studies isolating the contribution of each component. In the revised version, we will add ablation experiments that start from a baseline implementation and incrementally enable cache blocking, gate fusion, the merge booster, and the diagonal detector, reporting the resulting performance deltas to attribute gains to the novel elements. revision: yes

  2. Referee: Experiments section: Speedup results are given exclusively as peak 'up to' values (160x circuit-level, 34x diagonal-heavy gate-level) with no average-case metrics, standard deviations, error bars, or details on the number of runs, circuit sizes, or gate distributions used in the benchmarks.

    Authors: The reported figures are the maximum observed speedups. We will revise the Experiments section to report average speedups across the full benchmark suite, along with details on the number of circuits, their qubit counts, gate distributions, and the number of runs performed. Because each benchmark was executed once given the high cost of HPC cluster time, we will provide the observed range of results rather than standard deviations or error bars. revision: partial

  3. Referee: Experiments section: Comparisons are made only to unspecified 'existing simulators' without naming the libraries, versions, or optimization flags employed, and no portability results are shown on non-NVIDIA hardware or non-diagonal circuits, undermining the assertion of robust gains.

    Authors: We will update the manuscript to name the specific baseline simulators, their versions, and optimization flags used in all comparisons. We will also add a discussion of the framework's design for NVIDIA GPU clusters and note that experiments were limited to the DGX-H100 nodes; while the optimizations are not inherently NVIDIA-specific, we did not evaluate non-NVIDIA hardware or non-diagonal circuits and will state this scope limitation explicitly. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical performance claims on hardware benchmarks

full rationale

The paper describes an engineering framework for quantum circuit simulation optimizations (cache blocking, merge booster, diagonal detector, gate fusion, circuit restructuring) and reports measured speedups from direct execution-time experiments on eight DGX-H100 nodes with H100 GPUs. No equations, fitted parameters, self-definitional quantities, or derivation chains appear in the provided text. Claims rest on external benchmarks against third-party simulators rather than reducing to self-citations or inputs by construction. The work is self-contained as an empirical optimization study.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The performance claims rest on standard assumptions about quantum state-vector simulation costs and the effectiveness of HPC memory optimizations; the two new components are introduced without independent theoretical justification beyond inspiration from entanglement and gate fusion.

axioms (1)
  • domain assumption Full-state quantum circuit simulation time grows exponentially with qubit count due to state-vector size
    Stated in the abstract as the core challenge being addressed.
invented entities (2)
  • merge booster no independent evidence
    purpose: Restructures circuits to exploit entanglement-like patterns for better fusion and locality
    New component added to the optimizer; no independent evidence provided beyond empirical speedup.
  • diagonal detector no independent evidence
    purpose: Identifies diagonal-heavy gates for specialized fusion to accelerate simulation
    New component added to the simulator; no independent evidence provided beyond empirical speedup.

pith-pipeline@v0.9.0 · 5524 in / 1280 out tokens · 36754 ms · 2026-05-10T16:10:44.367252+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

41 extracted references · 31 canonical work pages · 2 internal anchors

  1. [1]

    cuBLAS: Basic Linear Algebra on NVIDIA GPUs

    2024. cuBLAS: Basic Linear Algebra on NVIDIA GPUs. https://developer.nvidia. com/cublas

  2. [2]

    Qibojit Benchmarks: Benchmarking quantum simulation

    2024. Qibojit Benchmarks: Benchmarking quantum simulation. https://github. com/qiboteam/qibojit-benchmarks

  3. [3]

    NVIDIA DGX H100

    2026. NVIDIA DGX H100. https://www.nvidia.com/zh-tw/data-center/dgx- h100/

  4. [4]

    Ethan Bernstein and Umesh Vazirani. 1997. Quantum Com- plexity Theory.SIAM J. Comput.26, 5 (1997), 1411–1473. arXiv:https://doi.org/10.1137/S0097539796300921 doi:10.1137/S0097539796300921

  5. [5]

    Shin-Wei Chiu, Chuo-Min Yang, Shan-Jung Hou, Po-Hsuan Huang, Chuan-Chi Wang, Chia-Heng Tu, and Shih-Hao Hung. 2025. FOR-QAOA: Fully Optimized Resource-Efficient QAOA Circuit Simulation for Solving the Max-Cut Problems. InPractice and Experience in Advanced Research Computing 2025: The Power of Collaboration (PEARC ’25). Association for Computing Machinery...

  6. [6]

    Jerry Chow, Oliver Dial, and Jay Gambetta. 2021. IBM Quantum breaks the 100-qubit processor barrier. https://research.ibm.com/blog/127-qubit-quantum- processor-eagle

  7. [7]

    Coppersmith, An approximate fourier transform useful in quantum fac- toring, arXiv:quant-ph/0201067 (2002)

    D. Coppersmith. 2002. An approximate Fourier transform useful in quantum factoring. arXiv:quant-ph/0201067 [quant-ph]

  8. [8]

    Cross, Lev S

    Andrew W. Cross, Lev S. Bishop, Sarah Sheldon, Paul D. Nation, and Jay M. Gambetta. 2019. Validating quantum computers using randomized model circuits. Physical Review A100, 3 (Sept. 2019). doi:10.1103/physreva.100.032328

  9. [9]

    Open Quantum Assembly Language

    Andrew W. Cross, Lev S. Bishop, John A. Smolin, and Jay M. Gambetta. 2017. Open Quantum Assembly Language. doi:10.48550/ARXIV.1707.03429

  10. [10]

    mlco2/codecarbon: v2.4.1,

    The cuQuantum development team. 2023.cuQuantum. doi:10.5281/zenodo. 7806810

  11. [11]

    Cirq development team. 2022. Cirq is a Python library for writing, manipulating, and optimizing quantum circuits and running them against quantum computers and simulators. https://github.com/quantumlib/Cirq

  12. [12]

    Jun Doi and Hiroshi Horii. 2020. Cache Blocking Technique to Large Scale Quantum Computing Simulation on Supercomputers. In2020 IEEE International Conference on Quantum Computing and Engineering (QCE). IEEE. doi:10.1109/ qce49297.2020.00035

  13. [13]

    Edward Farhi, Jeffrey Goldstone, and Sam Gutmann. 2014. A Quantum Approxi- mate Optimization Algorithm. arXiv:1411.4028 [quant-ph]

  14. [14]

    Vlad Gheorghiu. 2018. Quantum++: A modern C++ quantum computing library. PLOS ONE13, 12 (dec 2018), e0208073

  15. [15]

    Hiroshi Horii and Jun Doi. 2021. Optimization of Quantum Computing Simulation with Gate Fusion. https://ipsj.ixsq.nii.ac.jp/record/210570/files/IPSJ-QS21002023. pdf

  16. [16]

    Chia-Hsin Hsu, Chuan-Chi Wang, Nai-Wei Hsu, Chia-Heng Tu, and Shih-Hao Hung. 2023. Towards Scalable Quantum Circuit Simulation via RDMA. InProceed- ings of the 2023 International Conference on Research in Adaptive and Convergent Systems(Gdansk, Poland)(RACS ’23). Association for Computing Machinery, New York, NY, USA, Article 3, 8 pages. doi:10.1145/35999...

  17. [17]

    Antti-Pekka Hynninen and Dmitry I. Lyakh. 2017. cuTT: A High-Performance Tensor Transpose Library for CUDA Compatible GPUs. arXiv:1705.01598 [cs.MS] https://arxiv.org/abs/1705.01598

  18. [18]

    Thomas Häner and Damian S. Steiger. 2017. 0.5 petabyte simulation of a 45- qubit quantum circuit. InProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. ACM. doi:10.1145/ 3126908.3126947

  19. [19]

    Satoshi Imamura, Masafumi Yamazaki, Takumi Honda, Akihiko Kasagi, Aki- hiro Tabuchi, Hiroshi Nakao, Naoto Fukumoto, and Kohta Nakashima. 2022. mpiQulacs: A Distributed Quantum Computer Simulator for A64FX-based Clus- ter Systems. arXiv:2203.16044 [cs.DC]

  20. [20]

    Quantum computing with Qiskit

    Ali Javadi-Abhari, Matthew Treinish, Kevin Krsulich, Christopher J. Wood, Jake Lishman, Julien Gacon, Simon Martiel, Paul D. Nation, Lev S. Bishop, Andrew W. Cross, Blake R. Johnson, and Jay M. Gambetta. 2024. Quantum computing with Qiskit. arXiv:2405.08810 [quant-ph] doi:10.48550/arXiv.2405.08810

  21. [21]

    Chenyang Jiao, Weihua Zhang, and Li Shen. 2023. Communication Optimizations for State-vector Quantum Simulator on CPU+GPU Clusters. InProceedings of the 52nd International Conference on Parallel Processing(, Salt Lake City, UT, USA,) (ICPP ’23). Association for Computing Machinery, New York, NY, USA, 203–212. https://doi.org/10.1145/3605573.3605631

  22. [22]

    Tyson Jones, Anna Brown, Ian Bush, and Simon Benjamin. 2019. QuEST and High Performance Simulation of Quantum Computers.Scientific Reports9 (07 2019). doi:10.1038/s41598-019-47174-9

  23. [23]

    Yu-Cheng Lin, Chuan-Chi Wang, Chia-Heng Tu, and Shih-Hao Hung. 2024. Towards Optimizations of Quantum Circuit Simulation for Solving Max-Cut Problems with QAOA. InProceedings of the 39th ACM/SIGAPP Symposium on Applied Computing (SAC ’24). ACM, 1487–1494. doi:10.1145/3605098.3635897

  24. [24]

    Ji Liu, Peiyi Li, and Huiyang Zhou. 2022. Not All SWAPs Have the Same Cost: A Case for Optimization-Aware Qubit Routing. arXiv:2205.10596 [quant-ph] https://arxiv.org/abs/2205.10596

  25. [25]

    Quantum supremacy is both closer and farther than it appears.arXiv preprint arXiv:1807.10749, 2018

    Igor L. Markov, Aneeqa Fatima, Sergei V. Isakov, and Sergio Boixo. 2018. Quantum Supremacy Is Both Closer and Farther than It Appears. arXiv:1807.10749

  26. [26]

    Hsu Nai-Wei, Chuan-Chi Wang, Chia-Hsin Hsu, Chia-Heng Tu, and Hung Shih-Hao. 2024. Toward cost-effective quantum circuit simulation with performance tuning techniques.Connection Science36, 1 (2024), 2349541. arXiv:https://doi.org/10.1080/09540091.2024.2349541 doi:10.1080/09540091.2024. 2349541

  27. [27]

    NVIDIA Corporation. 2024. NVIDIA NCCL. https://developer.nvidia.com/nccl

  28. [28]

    Daeyoung Park, Heehoon Kim, Jinpyo Kim, Taehyun Kim, and Jaejin Lee. 2022. SnuQS: scaling quantum circuit simulation using storage devices. InProceedings of the 36th ACM International Conference on Supercomputing. 1–13

  29. [29]

    Nature Communications5(1), 4213 (2014) https://doi.org/ 10.1038/ncomms5213

    Alberto Peruzzo, Jarrod McClean, Peter Shadbolt, Man-Hong Yung, Xiao-Qi Zhou, Peter J. Love, Alán Aspuru-Guzik, and Jeremy L. O’Brien. 2014. A variational eigenvalue solver on a photonic quantum processor.Nature Communications5, 1 (July 2014). doi:10.1038/ncomms5213

  30. [30]

    Vicente Pina-Canelles, Adrian Auer, and Inés de Vega. 2025. Improving and benchmarking NISQ qubit routers. arXiv:2502.03908 [quant-ph] https://arxiv. org/abs/2502.03908

  31. [31]

    Qiskit contributors. 2023. Qiskit: An Open-source Framework for Quantum Computing. doi:10.5281/zenodo.2573505

  32. [32]

    Mikhail Smelyanskiy, Nicolas P. D. Sawaya, and Alán Aspuru-Guzik. 2016. qHiPSTER: The Quantum High Performance Software Testing Environment. arXiv:1601.07195 [quant-ph]

  33. [33]

    Smith, Michael J

    Robert S. Smith, Michael J. Curtis, and William J. Zeng. 2017. A Practical Quantum Instruction Set Architecture. arXiv:1608.03355 [quant-ph] https://arxiv.org/abs/ 1608.03355

  34. [34]

    Nakanishi, Kosuke Mitarai, Ryosuke Imai, Shiro Tamiya, Takahiro Yamamoto, Tennin Yan, Toru Kawakubo, Yuya O

    Yasunari Suzuki, Yoshiaki Kawase, Yuya Masumura, Yuria Hiraga, Masahiro Nakadai, Jiabao Chen, Ken M. Nakanishi, Kosuke Mitarai, Ryosuke Imai, Shiro Tamiya, Takahiro Yamamoto, Tennin Yan, Toru Kawakubo, Yuya O. Nakagawa, Yohei Ibe, Youyuan Zhang, Hirotsugu Yamashita, Hikaru Yoshimura, Akihiro Hayashi, and Keisuke Fujii. 2021. Qulacs: a fast and versatile q...

  35. [35]

    2020.qsim

    Quantum AI team and collaborators. 2020.qsim. doi:10.5281/zenodo.4023103

  36. [36]

    Wim van Dam, Sean Hallgren, and Lawrence Ip. 2002. Quantum Algorithms for some Hidden Shift Problems. arXiv:quant-ph/0211140

  37. [37]

    Chuan-Chi Wang, Yu-Cheng Lin, Yan-Jie Wang, Chia-Heng Tu, and Shih-Hao Hung. 2024. Queen: A quick, scalable, and comprehensive quantum circuit simulation for supercomputing. arXiv:2406.14084 [quant-ph] https://arxiv.org/ abs/2406.14084

  38. [38]

    Acar, and Zhihao Jia

    Mingkuan Xu, Shiyi Cao, Xupeng Miao, Umut A. Acar, and Zhihao Jia. 2024. Atlas: Hierarchical Partitioning for Quantum Circuit Simulation on GPUs (Extended Version). arXiv:2408.09055 [cs.DC] https://arxiv.org/abs/2408.09055

  39. [39]

    Ge Yan, Wenjie Wu, Yuheng Chen, Kaisen Pan, Xudong Lu, Zixiang Zhou, Yuhan Wang, Ruocheng Wang, and Junchi Yan. 2025. Quantum Circuit Synthesis and Compilation Optimization: Overview and Prospects. arXiv:2407.00736 [quant-ph] https://arxiv.org/abs/2407.00736

  40. [40]

    Chen Zhang, Zeyu Song, Haojie Wang, Kaiyuan Rong, and Jidong Zhai. 2021. HyQuas: hybrid partitioner based quantum circuit simulation system on GPU. In Proceedings of the 35th ACM International Conference on Supercomputing(Virtual Event, USA)(ICS ’21). Association for Computing Machinery, New York, NY, USA, 443–454. doi:10.1145/3447818.3460357

  41. [41]

    Chen Zhang, Haojie Wang, Zixuan Ma, Lei Xie, Zeyu Song, and Jidong Zhai. 2022. UniQ: A Unified Programming Model for Efficient Quantum Circuit Simulation. In 2022 SC22: International Conference for High Performance Computing, Networking, Storage and Analysis (SC). IEEE Computer Society, 692–707. 12