Recognition: unknown
Large-Scale Quantum Circuit Simulation on HPC Cluster via Cache Blocking, Boosting, and Gate Fusion Optimization
Pith reviewed 2026-05-10 16:10 UTC · model grok-4.3
The pith
An extensible framework with merge booster and diagonal detector accelerates full-state quantum circuit simulations by optimizing data locality and gate operations.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors present a framework that integrates circuit restructuring with adaptive execution, plus the merge booster and diagonal detector, to enhance both data locality via cache blocking and computational efficiency via gate fusion. This yields speedups reaching 160 times on circuit-level benchmarks and 34 times on diagonal-heavy gate-level benchmarks relative to existing simulators.
What carries the argument
The merge booster and diagonal detector, which apply entanglement-inspired fusion and detection rules to restructure and simplify quantum circuit execution while preserving correctness.
If this is right
- Larger qubit counts become feasible to simulate classically within practical time limits, supporting more extensive algorithm prototyping.
- Diagonal-dominant circuits in particular benefit from reduced operation counts without loss of accuracy.
- The extensible design allows the same optimizations to be applied across different hardware backends for portable gains.
- Redundant computations decrease overall, lowering the energy and resource demands of simulation runs.
Where Pith is reading between the lines
- The same locality and fusion principles might transfer to tensor-network or other approximate simulation methods for even larger systems.
- Hardware architects could use the detected patterns to prioritize features that classical emulators handle efficiently.
- Integration with hybrid workflows could let developers alternate quickly between simulation and real-device runs.
Load-bearing premise
The new merge booster, diagonal detector, and circuit restructuring components deliver consistent speed gains across many different circuit types without adding hidden costs that vary by hardware.
What would settle it
A direct comparison on a broad set of random or highly entangled circuits where the new framework shows no net speedup or higher memory use than a standard simulator would disprove the performance claims.
Figures
read the original abstract
Quantum circuit simulation is crucial for the development of quantum algorithms, particularly given the high cost and noise limitations of physical quantum hardware. While full-state quantum circuit simulation is commonly employed for prototyping and debugging, it poses challenges because of the exponential increase in simulation time for large quantum systems. In this work, we propose an extensible framework designed to enhance simulation performance by optimizing both data locality and computational efficiency, thereby addressing these challenges. This framework is seamlessly integrated with an optimizer that restructures quantum circuits and a simulator that adjusts execution strategies for various quantum operations. For the newly developed components, merge booster and diagonal detector, the underlying algorithms are inspired by the principles of quantum entanglement and gate fusion, as well as by the limitations identified in existing third-party simulation libraries. The experiments were conducted on eight DGX-H100 workstations, each equipped with eight NVIDIA H100 GPUs, employing both gate-level and circuit-level benchmarks. The results indicate a speedup of up to 160 times for circuit-level benchmarks and an acceleration of up to 34 times for diagonal-heavy gate-level benchmarks compared to existing simulators. The proposed methodologies are anticipated to deliver more robust and faster quantum circuit simulations, thereby fostering the advancement of novel quantum algorithms.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes an extensible framework for large-scale quantum circuit simulation on HPC clusters, combining cache blocking, a merge booster, a diagonal detector, gate fusion, and a circuit restructuring optimizer. The new components are motivated by entanglement and fusion principles. Experiments on eight DGX-H100 nodes (each with eight H100 GPUs) using gate-level and circuit-level benchmarks report peak speedups of up to 160x for circuit-level cases and 34x for diagonal-heavy gate-level cases relative to existing simulators.
Significance. If the reported gains hold under broader testing, the work could meaningfully extend the scale of simulatable quantum circuits, aiding algorithm prototyping. The multi-node GPU cluster evaluation demonstrates practical scalability on current HPC hardware, and the combination of locality and fusion optimizations offers a coherent engineering approach. However, the absence of component ablations and baseline specifications limits the ability to attribute gains specifically to the novel elements.
major comments (3)
- [Experiments] Experiments section: No ablation studies isolate the performance impact of the merge booster, diagonal detector, cache blocking, and gate fusion; without these, it is impossible to verify that the newly introduced components are responsible for the claimed speedups rather than baseline optimizations or hardware effects.
- [Experiments] Experiments section: Speedup results are given exclusively as peak 'up to' values (160x circuit-level, 34x diagonal-heavy gate-level) with no average-case metrics, standard deviations, error bars, or details on the number of runs, circuit sizes, or gate distributions used in the benchmarks.
- [Experiments] Experiments section: Comparisons are made only to unspecified 'existing simulators' without naming the libraries, versions, or optimization flags employed, and no portability results are shown on non-NVIDIA hardware or non-diagonal circuits, undermining the assertion of robust gains.
minor comments (2)
- [Abstract] Abstract: Lacks any mention of error bars, benchmark circuit specifications, or the precise baseline simulators, which would help readers assess the scope of the performance claims.
- The manuscript would benefit from pseudocode or high-level algorithmic descriptions for the merge booster and diagonal detector to allow independent verification of the entanglement- and fusion-inspired logic.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, indicating the revisions planned for the manuscript.
read point-by-point responses
-
Referee: Experiments section: No ablation studies isolate the performance impact of the merge booster, diagonal detector, cache blocking, and gate fusion; without these, it is impossible to verify that the newly introduced components are responsible for the claimed speedups rather than baseline optimizations or hardware effects.
Authors: We agree that the current manuscript does not include explicit ablation studies isolating the contribution of each component. In the revised version, we will add ablation experiments that start from a baseline implementation and incrementally enable cache blocking, gate fusion, the merge booster, and the diagonal detector, reporting the resulting performance deltas to attribute gains to the novel elements. revision: yes
-
Referee: Experiments section: Speedup results are given exclusively as peak 'up to' values (160x circuit-level, 34x diagonal-heavy gate-level) with no average-case metrics, standard deviations, error bars, or details on the number of runs, circuit sizes, or gate distributions used in the benchmarks.
Authors: The reported figures are the maximum observed speedups. We will revise the Experiments section to report average speedups across the full benchmark suite, along with details on the number of circuits, their qubit counts, gate distributions, and the number of runs performed. Because each benchmark was executed once given the high cost of HPC cluster time, we will provide the observed range of results rather than standard deviations or error bars. revision: partial
-
Referee: Experiments section: Comparisons are made only to unspecified 'existing simulators' without naming the libraries, versions, or optimization flags employed, and no portability results are shown on non-NVIDIA hardware or non-diagonal circuits, undermining the assertion of robust gains.
Authors: We will update the manuscript to name the specific baseline simulators, their versions, and optimization flags used in all comparisons. We will also add a discussion of the framework's design for NVIDIA GPU clusters and note that experiments were limited to the DGX-H100 nodes; while the optimizations are not inherently NVIDIA-specific, we did not evaluate non-NVIDIA hardware or non-diagonal circuits and will state this scope limitation explicitly. revision: partial
Circularity Check
No circularity: empirical performance claims on hardware benchmarks
full rationale
The paper describes an engineering framework for quantum circuit simulation optimizations (cache blocking, merge booster, diagonal detector, gate fusion, circuit restructuring) and reports measured speedups from direct execution-time experiments on eight DGX-H100 nodes with H100 GPUs. No equations, fitted parameters, self-definitional quantities, or derivation chains appear in the provided text. Claims rest on external benchmarks against third-party simulators rather than reducing to self-citations or inputs by construction. The work is self-contained as an empirical optimization study.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Full-state quantum circuit simulation time grows exponentially with qubit count due to state-vector size
invented entities (2)
-
merge booster
no independent evidence
-
diagonal detector
no independent evidence
Reference graph
Works this paper leans on
-
[1]
cuBLAS: Basic Linear Algebra on NVIDIA GPUs
2024. cuBLAS: Basic Linear Algebra on NVIDIA GPUs. https://developer.nvidia. com/cublas
2024
-
[2]
Qibojit Benchmarks: Benchmarking quantum simulation
2024. Qibojit Benchmarks: Benchmarking quantum simulation. https://github. com/qiboteam/qibojit-benchmarks
2024
-
[3]
NVIDIA DGX H100
2026. NVIDIA DGX H100. https://www.nvidia.com/zh-tw/data-center/dgx- h100/
2026
-
[4]
Ethan Bernstein and Umesh Vazirani. 1997. Quantum Com- plexity Theory.SIAM J. Comput.26, 5 (1997), 1411–1473. arXiv:https://doi.org/10.1137/S0097539796300921 doi:10.1137/S0097539796300921
-
[5]
Shin-Wei Chiu, Chuo-Min Yang, Shan-Jung Hou, Po-Hsuan Huang, Chuan-Chi Wang, Chia-Heng Tu, and Shih-Hao Hung. 2025. FOR-QAOA: Fully Optimized Resource-Efficient QAOA Circuit Simulation for Solving the Max-Cut Problems. InPractice and Experience in Advanced Research Computing 2025: The Power of Collaboration (PEARC ’25). Association for Computing Machinery...
-
[6]
Jerry Chow, Oliver Dial, and Jay Gambetta. 2021. IBM Quantum breaks the 100-qubit processor barrier. https://research.ibm.com/blog/127-qubit-quantum- processor-eagle
2021
-
[7]
D. Coppersmith. 2002. An approximate Fourier transform useful in quantum factoring. arXiv:quant-ph/0201067 [quant-ph]
-
[8]
Andrew W. Cross, Lev S. Bishop, Sarah Sheldon, Paul D. Nation, and Jay M. Gambetta. 2019. Validating quantum computers using randomized model circuits. Physical Review A100, 3 (Sept. 2019). doi:10.1103/physreva.100.032328
-
[9]
Open Quantum Assembly Language
Andrew W. Cross, Lev S. Bishop, John A. Smolin, and Jay M. Gambetta. 2017. Open Quantum Assembly Language. doi:10.48550/ARXIV.1707.03429
-
[10]
The cuQuantum development team. 2023.cuQuantum. doi:10.5281/zenodo. 7806810
-
[11]
Cirq development team. 2022. Cirq is a Python library for writing, manipulating, and optimizing quantum circuits and running them against quantum computers and simulators. https://github.com/quantumlib/Cirq
2022
- [12]
-
[13]
Edward Farhi, Jeffrey Goldstone, and Sam Gutmann. 2014. A Quantum Approxi- mate Optimization Algorithm. arXiv:1411.4028 [quant-ph]
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[14]
Vlad Gheorghiu. 2018. Quantum++: A modern C++ quantum computing library. PLOS ONE13, 12 (dec 2018), e0208073
2018
-
[15]
Hiroshi Horii and Jun Doi. 2021. Optimization of Quantum Computing Simulation with Gate Fusion. https://ipsj.ixsq.nii.ac.jp/record/210570/files/IPSJ-QS21002023. pdf
2021
-
[16]
Chia-Hsin Hsu, Chuan-Chi Wang, Nai-Wei Hsu, Chia-Heng Tu, and Shih-Hao Hung. 2023. Towards Scalable Quantum Circuit Simulation via RDMA. InProceed- ings of the 2023 International Conference on Research in Adaptive and Convergent Systems(Gdansk, Poland)(RACS ’23). Association for Computing Machinery, New York, NY, USA, Article 3, 8 pages. doi:10.1145/35999...
- [17]
- [18]
- [19]
-
[20]
Ali Javadi-Abhari, Matthew Treinish, Kevin Krsulich, Christopher J. Wood, Jake Lishman, Julien Gacon, Simon Martiel, Paul D. Nation, Lev S. Bishop, Andrew W. Cross, Blake R. Johnson, and Jay M. Gambetta. 2024. Quantum computing with Qiskit. arXiv:2405.08810 [quant-ph] doi:10.48550/arXiv.2405.08810
work page internal anchor Pith review doi:10.48550/arxiv.2405.08810 2024
-
[21]
Chenyang Jiao, Weihua Zhang, and Li Shen. 2023. Communication Optimizations for State-vector Quantum Simulator on CPU+GPU Clusters. InProceedings of the 52nd International Conference on Parallel Processing(, Salt Lake City, UT, USA,) (ICPP ’23). Association for Computing Machinery, New York, NY, USA, 203–212. https://doi.org/10.1145/3605573.3605631
-
[22]
Tyson Jones, Anna Brown, Ian Bush, and Simon Benjamin. 2019. QuEST and High Performance Simulation of Quantum Computers.Scientific Reports9 (07 2019). doi:10.1038/s41598-019-47174-9
-
[23]
Yu-Cheng Lin, Chuan-Chi Wang, Chia-Heng Tu, and Shih-Hao Hung. 2024. Towards Optimizations of Quantum Circuit Simulation for Solving Max-Cut Problems with QAOA. InProceedings of the 39th ACM/SIGAPP Symposium on Applied Computing (SAC ’24). ACM, 1487–1494. doi:10.1145/3605098.3635897
- [24]
-
[25]
Quantum supremacy is both closer and farther than it appears.arXiv preprint arXiv:1807.10749, 2018
Igor L. Markov, Aneeqa Fatima, Sergei V. Isakov, and Sergio Boixo. 2018. Quantum Supremacy Is Both Closer and Farther than It Appears. arXiv:1807.10749
-
[26]
Hsu Nai-Wei, Chuan-Chi Wang, Chia-Hsin Hsu, Chia-Heng Tu, and Hung Shih-Hao. 2024. Toward cost-effective quantum circuit simulation with performance tuning techniques.Connection Science36, 1 (2024), 2349541. arXiv:https://doi.org/10.1080/09540091.2024.2349541 doi:10.1080/09540091.2024. 2349541
-
[27]
NVIDIA Corporation. 2024. NVIDIA NCCL. https://developer.nvidia.com/nccl
2024
-
[28]
Daeyoung Park, Heehoon Kim, Jinpyo Kim, Taehyun Kim, and Jaejin Lee. 2022. SnuQS: scaling quantum circuit simulation using storage devices. InProceedings of the 36th ACM International Conference on Supercomputing. 1–13
2022
-
[29]
Nature Communications5(1), 4213 (2014) https://doi.org/ 10.1038/ncomms5213
Alberto Peruzzo, Jarrod McClean, Peter Shadbolt, Man-Hong Yung, Xiao-Qi Zhou, Peter J. Love, Alán Aspuru-Guzik, and Jeremy L. O’Brien. 2014. A variational eigenvalue solver on a photonic quantum processor.Nature Communications5, 1 (July 2014). doi:10.1038/ncomms5213
- [30]
-
[31]
Qiskit contributors. 2023. Qiskit: An Open-source Framework for Quantum Computing. doi:10.5281/zenodo.2573505
-
[32]
Mikhail Smelyanskiy, Nicolas P. D. Sawaya, and Alán Aspuru-Guzik. 2016. qHiPSTER: The Quantum High Performance Software Testing Environment. arXiv:1601.07195 [quant-ph]
work page Pith review arXiv 2016
-
[33]
Robert S. Smith, Michael J. Curtis, and William J. Zeng. 2017. A Practical Quantum Instruction Set Architecture. arXiv:1608.03355 [quant-ph] https://arxiv.org/abs/ 1608.03355
-
[34]
Yasunari Suzuki, Yoshiaki Kawase, Yuya Masumura, Yuria Hiraga, Masahiro Nakadai, Jiabao Chen, Ken M. Nakanishi, Kosuke Mitarai, Ryosuke Imai, Shiro Tamiya, Takahiro Yamamoto, Tennin Yan, Toru Kawakubo, Yuya O. Nakagawa, Yohei Ibe, Youyuan Zhang, Hirotsugu Yamashita, Hikaru Yoshimura, Akihiro Hayashi, and Keisuke Fujii. 2021. Qulacs: a fast and versatile q...
-
[35]
Quantum AI team and collaborators. 2020.qsim. doi:10.5281/zenodo.4023103
- [36]
- [37]
-
[38]
Mingkuan Xu, Shiyi Cao, Xupeng Miao, Umut A. Acar, and Zhihao Jia. 2024. Atlas: Hierarchical Partitioning for Quantum Circuit Simulation on GPUs (Extended Version). arXiv:2408.09055 [cs.DC] https://arxiv.org/abs/2408.09055
- [39]
-
[40]
Chen Zhang, Zeyu Song, Haojie Wang, Kaiyuan Rong, and Jidong Zhai. 2021. HyQuas: hybrid partitioner based quantum circuit simulation system on GPU. In Proceedings of the 35th ACM International Conference on Supercomputing(Virtual Event, USA)(ICS ’21). Association for Computing Machinery, New York, NY, USA, 443–454. doi:10.1145/3447818.3460357
-
[41]
Chen Zhang, Haojie Wang, Zixuan Ma, Lei Xie, Zeyu Song, and Jidong Zhai. 2022. UniQ: A Unified Programming Model for Efficient Quantum Circuit Simulation. In 2022 SC22: International Conference for High Performance Computing, Networking, Storage and Analysis (SC). IEEE Computer Society, 692–707. 12
2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.