arxiv: 2604.12256 · v1 · submitted 2026-04-14 · 🪐 quant-ph · cs.SE

Recognition: unknown

Large-Scale Quantum Circuit Simulation on HPC Cluster via Cache Blocking, Boosting, and Gate Fusion Optimization

Chuan-Chi Wang , Yan-Jie Wang , Chia-Heng Tu , Shih-Hao Hung

Authors on Pith no claims yet

Pith reviewed 2026-05-10 16:10 UTC · model grok-4.3

classification 🪐 quant-ph cs.SE

keywords quantum circuit simulationgate fusioncache blockingHPC optimizationmerge boosterdiagonal detectorcircuit restructuringfull-state simulation

0 comments

The pith

An extensible framework with merge booster and diagonal detector accelerates full-state quantum circuit simulations by optimizing data locality and gate operations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that full-state quantum circuit simulation, essential for developing and debugging quantum algorithms before hardware implementation, can be made much faster through targeted optimizations that improve how data is accessed and how gates are combined during execution. It introduces an extensible framework that restructures circuits automatically and adapts simulation strategies depending on the type of quantum operations involved. Central to this are two new components: a merge booster that fuses gates to reduce computations and a diagonal detector that identifies simplifications based on entanglement patterns. Benchmarks show these changes deliver substantial speedups over prior simulators, making it practical to handle larger qubit systems without proportional increases in runtime.

Core claim

The authors present a framework that integrates circuit restructuring with adaptive execution, plus the merge booster and diagonal detector, to enhance both data locality via cache blocking and computational efficiency via gate fusion. This yields speedups reaching 160 times on circuit-level benchmarks and 34 times on diagonal-heavy gate-level benchmarks relative to existing simulators.

What carries the argument

The merge booster and diagonal detector, which apply entanglement-inspired fusion and detection rules to restructure and simplify quantum circuit execution while preserving correctness.

If this is right

Larger qubit counts become feasible to simulate classically within practical time limits, supporting more extensive algorithm prototyping.
Diagonal-dominant circuits in particular benefit from reduced operation counts without loss of accuracy.
The extensible design allows the same optimizations to be applied across different hardware backends for portable gains.
Redundant computations decrease overall, lowering the energy and resource demands of simulation runs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same locality and fusion principles might transfer to tensor-network or other approximate simulation methods for even larger systems.
Hardware architects could use the detected patterns to prioritize features that classical emulators handle efficiently.
Integration with hybrid workflows could let developers alternate quickly between simulation and real-device runs.

Load-bearing premise

The new merge booster, diagonal detector, and circuit restructuring components deliver consistent speed gains across many different circuit types without adding hidden costs that vary by hardware.

What would settle it

A direct comparison on a broad set of random or highly entangled circuits where the new framework shows no net speedup or higher memory use than a standard simulator would disprove the performance claims.

Figures

Figures reproduced from arXiv: 2604.12256 by Chia-Heng Tu, Chuan-Chi Wang, Shih-Hao Hung, Yan-Jie Wang.

**Figure 1.** Figure 1: illustrates a typical workflow adopted by modern quantum circuit simulation. The input consists of a structured file in a specific format (e.g., Quil [33] and OpenQASM [9]) to represent a raw quantum circuit. Subsequently, a quantum circuit optimizer, similar to a quantum compiler, is proficient in performing various quantum circuit optimizations, such as combining sequential quantum gates to reduce circ… view at source ↗

**Figure 2.** Figure 2: Memory allocation and quantum gate arrangements [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Memory allocation and quantum gate arrangements [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Different ways to enable diagonal gate fusion: (a) a naive approach, (b) the proposed diagonal detector optimization. [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗

**Figure 5.** Figure 5: Elapsed time of the circuit-level benchmarks ranges from 31 to 38 qubits. [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗

read the original abstract

Quantum circuit simulation is crucial for the development of quantum algorithms, particularly given the high cost and noise limitations of physical quantum hardware. While full-state quantum circuit simulation is commonly employed for prototyping and debugging, it poses challenges because of the exponential increase in simulation time for large quantum systems. In this work, we propose an extensible framework designed to enhance simulation performance by optimizing both data locality and computational efficiency, thereby addressing these challenges. This framework is seamlessly integrated with an optimizer that restructures quantum circuits and a simulator that adjusts execution strategies for various quantum operations. For the newly developed components, merge booster and diagonal detector, the underlying algorithms are inspired by the principles of quantum entanglement and gate fusion, as well as by the limitations identified in existing third-party simulation libraries. The experiments were conducted on eight DGX-H100 workstations, each equipped with eight NVIDIA H100 GPUs, employing both gate-level and circuit-level benchmarks. The results indicate a speedup of up to 160 times for circuit-level benchmarks and an acceleration of up to 34 times for diagonal-heavy gate-level benchmarks compared to existing simulators. The proposed methodologies are anticipated to deliver more robust and faster quantum circuit simulations, thereby fostering the advancement of novel quantum algorithms.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper delivers concrete speedups on H100 clusters via layered optimizations but the gains look hardware-specific and lack the breakdowns needed to judge how much is truly new or portable.

read the letter

This paper shows measured speedups on DGX-H100 clusters for quantum circuit simulation through a combination of cache blocking, gate fusion, and two new optimizers called the merge booster and diagonal detector. The framework restructures circuits and adapts execution for different operations. Experiments on eight nodes with H100 GPUs report peaks of 160 times faster for circuit-level benchmarks and 34 times for diagonal-heavy gate-level ones against existing simulators. That level of concrete performance data is useful for anyone running large simulations to develop quantum algorithms. The new components draw from entanglement and fusion ideas to improve on third-party library shortcomings. The work covers both gate and circuit benchmarks, which helps demonstrate the approach across scales. The main limitations are the lack of ablation studies to show each part's impact, only peak up to numbers without averages or error bars, and unspecified baselines. All results are tied to this NVIDIA hardware, so gains could be less consistent elsewhere or on other circuit types. The robust claim needs more support from broader testing. This is for people building or using quantum simulators on GPU-based HPC systems. Practitioners looking for optimization ideas will find value in the described methods and results. It has enough empirical content to go to serious peer review, where referees can probe the comparisons and portability. I would send it for review but flag the need for more detailed baselines, component breakdowns, and tests beyond the H100 setup.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes an extensible framework for large-scale quantum circuit simulation on HPC clusters, combining cache blocking, a merge booster, a diagonal detector, gate fusion, and a circuit restructuring optimizer. The new components are motivated by entanglement and fusion principles. Experiments on eight DGX-H100 nodes (each with eight H100 GPUs) using gate-level and circuit-level benchmarks report peak speedups of up to 160x for circuit-level cases and 34x for diagonal-heavy gate-level cases relative to existing simulators.

Significance. If the reported gains hold under broader testing, the work could meaningfully extend the scale of simulatable quantum circuits, aiding algorithm prototyping. The multi-node GPU cluster evaluation demonstrates practical scalability on current HPC hardware, and the combination of locality and fusion optimizations offers a coherent engineering approach. However, the absence of component ablations and baseline specifications limits the ability to attribute gains specifically to the novel elements.

major comments (3)

[Experiments] Experiments section: No ablation studies isolate the performance impact of the merge booster, diagonal detector, cache blocking, and gate fusion; without these, it is impossible to verify that the newly introduced components are responsible for the claimed speedups rather than baseline optimizations or hardware effects.
[Experiments] Experiments section: Speedup results are given exclusively as peak 'up to' values (160x circuit-level, 34x diagonal-heavy gate-level) with no average-case metrics, standard deviations, error bars, or details on the number of runs, circuit sizes, or gate distributions used in the benchmarks.
[Experiments] Experiments section: Comparisons are made only to unspecified 'existing simulators' without naming the libraries, versions, or optimization flags employed, and no portability results are shown on non-NVIDIA hardware or non-diagonal circuits, undermining the assertion of robust gains.

minor comments (2)

[Abstract] Abstract: Lacks any mention of error bars, benchmark circuit specifications, or the precise baseline simulators, which would help readers assess the scope of the performance claims.
The manuscript would benefit from pseudocode or high-level algorithmic descriptions for the merge booster and diagonal detector to allow independent verification of the entanglement- and fusion-inspired logic.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, indicating the revisions planned for the manuscript.

read point-by-point responses

Referee: Experiments section: No ablation studies isolate the performance impact of the merge booster, diagonal detector, cache blocking, and gate fusion; without these, it is impossible to verify that the newly introduced components are responsible for the claimed speedups rather than baseline optimizations or hardware effects.

Authors: We agree that the current manuscript does not include explicit ablation studies isolating the contribution of each component. In the revised version, we will add ablation experiments that start from a baseline implementation and incrementally enable cache blocking, gate fusion, the merge booster, and the diagonal detector, reporting the resulting performance deltas to attribute gains to the novel elements. revision: yes
Referee: Experiments section: Speedup results are given exclusively as peak 'up to' values (160x circuit-level, 34x diagonal-heavy gate-level) with no average-case metrics, standard deviations, error bars, or details on the number of runs, circuit sizes, or gate distributions used in the benchmarks.

Authors: The reported figures are the maximum observed speedups. We will revise the Experiments section to report average speedups across the full benchmark suite, along with details on the number of circuits, their qubit counts, gate distributions, and the number of runs performed. Because each benchmark was executed once given the high cost of HPC cluster time, we will provide the observed range of results rather than standard deviations or error bars. revision: partial
Referee: Experiments section: Comparisons are made only to unspecified 'existing simulators' without naming the libraries, versions, or optimization flags employed, and no portability results are shown on non-NVIDIA hardware or non-diagonal circuits, undermining the assertion of robust gains.

Authors: We will update the manuscript to name the specific baseline simulators, their versions, and optimization flags used in all comparisons. We will also add a discussion of the framework's design for NVIDIA GPU clusters and note that experiments were limited to the DGX-H100 nodes; while the optimizations are not inherently NVIDIA-specific, we did not evaluate non-NVIDIA hardware or non-diagonal circuits and will state this scope limitation explicitly. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical performance claims on hardware benchmarks

full rationale

The paper describes an engineering framework for quantum circuit simulation optimizations (cache blocking, merge booster, diagonal detector, gate fusion, circuit restructuring) and reports measured speedups from direct execution-time experiments on eight DGX-H100 nodes with H100 GPUs. No equations, fitted parameters, self-definitional quantities, or derivation chains appear in the provided text. Claims rest on external benchmarks against third-party simulators rather than reducing to self-citations or inputs by construction. The work is self-contained as an empirical optimization study.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The performance claims rest on standard assumptions about quantum state-vector simulation costs and the effectiveness of HPC memory optimizations; the two new components are introduced without independent theoretical justification beyond inspiration from entanglement and gate fusion.

axioms (1)

domain assumption Full-state quantum circuit simulation time grows exponentially with qubit count due to state-vector size
Stated in the abstract as the core challenge being addressed.

invented entities (2)

merge booster no independent evidence
purpose: Restructures circuits to exploit entanglement-like patterns for better fusion and locality
New component added to the optimizer; no independent evidence provided beyond empirical speedup.
diagonal detector no independent evidence
purpose: Identifies diagonal-heavy gates for specialized fusion to accelerate simulation
New component added to the simulator; no independent evidence provided beyond empirical speedup.

pith-pipeline@v0.9.0 · 5524 in / 1280 out tokens · 36754 ms · 2026-05-10T16:10:44.367252+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

41 extracted references · 31 canonical work pages · 2 internal anchors

[1]

cuBLAS: Basic Linear Algebra on NVIDIA GPUs

2024. cuBLAS: Basic Linear Algebra on NVIDIA GPUs. https://developer.nvidia. com/cublas

2024
[2]

Qibojit Benchmarks: Benchmarking quantum simulation

2024. Qibojit Benchmarks: Benchmarking quantum simulation. https://github. com/qiboteam/qibojit-benchmarks

2024
[3]

NVIDIA DGX H100

2026. NVIDIA DGX H100. https://www.nvidia.com/zh-tw/data-center/dgx- h100/

2026
[4]

Ethan Bernstein and Umesh Vazirani. 1997. Quantum Com- plexity Theory.SIAM J. Comput.26, 5 (1997), 1411–1473. arXiv:https://doi.org/10.1137/S0097539796300921 doi:10.1137/S0097539796300921

work page doi:10.1137/s0097539796300921 1997
[5]

Shin-Wei Chiu, Chuo-Min Yang, Shan-Jung Hou, Po-Hsuan Huang, Chuan-Chi Wang, Chia-Heng Tu, and Shih-Hao Hung. 2025. FOR-QAOA: Fully Optimized Resource-Efficient QAOA Circuit Simulation for Solving the Max-Cut Problems. InPractice and Experience in Advanced Research Computing 2025: The Power of Collaboration (PEARC ’25). Association for Computing Machinery...

work page doi:10.1145/3708035.3736006 2025
[6]

Jerry Chow, Oliver Dial, and Jay Gambetta. 2021. IBM Quantum breaks the 100-qubit processor barrier. https://research.ibm.com/blog/127-qubit-quantum- processor-eagle

2021
[7]

Coppersmith, An approximate fourier transform useful in quantum fac- toring, arXiv:quant-ph/0201067 (2002)

D. Coppersmith. 2002. An approximate Fourier transform useful in quantum factoring. arXiv:quant-ph/0201067 [quant-ph]

work page arXiv 2002
[8]

Cross, Lev S

Andrew W. Cross, Lev S. Bishop, Sarah Sheldon, Paul D. Nation, and Jay M. Gambetta. 2019. Validating quantum computers using randomized model circuits. Physical Review A100, 3 (Sept. 2019). doi:10.1103/physreva.100.032328

work page doi:10.1103/physreva.100.032328 2019
[9]

Open Quantum Assembly Language

Andrew W. Cross, Lev S. Bishop, John A. Smolin, and Jay M. Gambetta. 2017. Open Quantum Assembly Language. doi:10.48550/ARXIV.1707.03429

work page Pith review doi:10.48550/arxiv.1707.03429 2017
[10]

mlco2/codecarbon: v2.4.1,

The cuQuantum development team. 2023.cuQuantum. doi:10.5281/zenodo. 7806810

work page doi:10.5281/zenodo 2023
[11]

Cirq development team. 2022. Cirq is a Python library for writing, manipulating, and optimizing quantum circuits and running them against quantum computers and simulators. https://github.com/quantumlib/Cirq

2022
[12]

Jun Doi and Hiroshi Horii. 2020. Cache Blocking Technique to Large Scale Quantum Computing Simulation on Supercomputers. In2020 IEEE International Conference on Quantum Computing and Engineering (QCE). IEEE. doi:10.1109/ qce49297.2020.00035

work page arXiv 2020
[13]

Edward Farhi, Jeffrey Goldstone, and Sam Gutmann. 2014. A Quantum Approxi- mate Optimization Algorithm. arXiv:1411.4028 [quant-ph]

work page internal anchor Pith review Pith/arXiv arXiv 2014
[14]

Vlad Gheorghiu. 2018. Quantum++: A modern C++ quantum computing library. PLOS ONE13, 12 (dec 2018), e0208073

2018
[15]

Hiroshi Horii and Jun Doi. 2021. Optimization of Quantum Computing Simulation with Gate Fusion. https://ipsj.ixsq.nii.ac.jp/record/210570/files/IPSJ-QS21002023. pdf

2021
[16]

Chia-Hsin Hsu, Chuan-Chi Wang, Nai-Wei Hsu, Chia-Heng Tu, and Shih-Hao Hung. 2023. Towards Scalable Quantum Circuit Simulation via RDMA. InProceed- ings of the 2023 International Conference on Research in Adaptive and Convergent Systems(Gdansk, Poland)(RACS ’23). Association for Computing Machinery, New York, NY, USA, Article 3, 8 pages. doi:10.1145/35999...

work page doi:10.1145/3599957.3606215 2023
[17]

Antti-Pekka Hynninen and Dmitry I. Lyakh. 2017. cuTT: A High-Performance Tensor Transpose Library for CUDA Compatible GPUs. arXiv:1705.01598 [cs.MS] https://arxiv.org/abs/1705.01598

work page arXiv 2017
[18]

Thomas Häner and Damian S. Steiger. 2017. 0.5 petabyte simulation of a 45- qubit quantum circuit. InProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. ACM. doi:10.1145/ 3126908.3126947

work page arXiv 2017
[19]

Satoshi Imamura, Masafumi Yamazaki, Takumi Honda, Akihiko Kasagi, Aki- hiro Tabuchi, Hiroshi Nakao, Naoto Fukumoto, and Kohta Nakashima. 2022. mpiQulacs: A Distributed Quantum Computer Simulator for A64FX-based Clus- ter Systems. arXiv:2203.16044 [cs.DC]

work page arXiv 2022
[20]

Quantum computing with Qiskit

Ali Javadi-Abhari, Matthew Treinish, Kevin Krsulich, Christopher J. Wood, Jake Lishman, Julien Gacon, Simon Martiel, Paul D. Nation, Lev S. Bishop, Andrew W. Cross, Blake R. Johnson, and Jay M. Gambetta. 2024. Quantum computing with Qiskit. arXiv:2405.08810 [quant-ph] doi:10.48550/arXiv.2405.08810

work page internal anchor Pith review doi:10.48550/arxiv.2405.08810 2024
[21]

Chenyang Jiao, Weihua Zhang, and Li Shen. 2023. Communication Optimizations for State-vector Quantum Simulator on CPU+GPU Clusters. InProceedings of the 52nd International Conference on Parallel Processing(, Salt Lake City, UT, USA,) (ICPP ’23). Association for Computing Machinery, New York, NY, USA, 203–212. https://doi.org/10.1145/3605573.3605631

work page doi:10.1145/3605573.3605631 2023
[22]

Tyson Jones, Anna Brown, Ian Bush, and Simon Benjamin. 2019. QuEST and High Performance Simulation of Quantum Computers.Scientific Reports9 (07 2019). doi:10.1038/s41598-019-47174-9

work page doi:10.1038/s41598-019-47174-9 2019
[23]

Yu-Cheng Lin, Chuan-Chi Wang, Chia-Heng Tu, and Shih-Hao Hung. 2024. Towards Optimizations of Quantum Circuit Simulation for Solving Max-Cut Problems with QAOA. InProceedings of the 39th ACM/SIGAPP Symposium on Applied Computing (SAC ’24). ACM, 1487–1494. doi:10.1145/3605098.3635897

work page doi:10.1145/3605098.3635897 2024
[24]

Ji Liu, Peiyi Li, and Huiyang Zhou. 2022. Not All SWAPs Have the Same Cost: A Case for Optimization-Aware Qubit Routing. arXiv:2205.10596 [quant-ph] https://arxiv.org/abs/2205.10596

work page arXiv 2022
[25]

Quantum supremacy is both closer and farther than it appears.arXiv preprint arXiv:1807.10749, 2018

Igor L. Markov, Aneeqa Fatima, Sergei V. Isakov, and Sergio Boixo. 2018. Quantum Supremacy Is Both Closer and Farther than It Appears. arXiv:1807.10749

work page arXiv 2018
[26]

Hsu Nai-Wei, Chuan-Chi Wang, Chia-Hsin Hsu, Chia-Heng Tu, and Hung Shih-Hao. 2024. Toward cost-effective quantum circuit simulation with performance tuning techniques.Connection Science36, 1 (2024), 2349541. arXiv:https://doi.org/10.1080/09540091.2024.2349541 doi:10.1080/09540091.2024. 2349541

work page doi:10.1080/09540091.2024.2349541 2024
[27]

NVIDIA Corporation. 2024. NVIDIA NCCL. https://developer.nvidia.com/nccl

2024
[28]

Daeyoung Park, Heehoon Kim, Jinpyo Kim, Taehyun Kim, and Jaejin Lee. 2022. SnuQS: scaling quantum circuit simulation using storage devices. InProceedings of the 36th ACM International Conference on Supercomputing. 1–13

2022
[29]

Nature Communications5(1), 4213 (2014) https://doi.org/ 10.1038/ncomms5213

Alberto Peruzzo, Jarrod McClean, Peter Shadbolt, Man-Hong Yung, Xiao-Qi Zhou, Peter J. Love, Alán Aspuru-Guzik, and Jeremy L. O’Brien. 2014. A variational eigenvalue solver on a photonic quantum processor.Nature Communications5, 1 (July 2014). doi:10.1038/ncomms5213

work page doi:10.1038/ncomms5213 2014
[30]

Vicente Pina-Canelles, Adrian Auer, and Inés de Vega. 2025. Improving and benchmarking NISQ qubit routers. arXiv:2502.03908 [quant-ph] https://arxiv. org/abs/2502.03908

work page arXiv 2025
[31]

Qiskit contributors. 2023. Qiskit: An Open-source Framework for Quantum Computing. doi:10.5281/zenodo.2573505

work page doi:10.5281/zenodo.2573505 2023
[32]

Mikhail Smelyanskiy, Nicolas P. D. Sawaya, and Alán Aspuru-Guzik. 2016. qHiPSTER: The Quantum High Performance Software Testing Environment. arXiv:1601.07195 [quant-ph]

work page Pith review arXiv 2016
[33]

Smith, Michael J

Robert S. Smith, Michael J. Curtis, and William J. Zeng. 2017. A Practical Quantum Instruction Set Architecture. arXiv:1608.03355 [quant-ph] https://arxiv.org/abs/ 1608.03355

work page arXiv 2017
[34]

Nakanishi, Kosuke Mitarai, Ryosuke Imai, Shiro Tamiya, Takahiro Yamamoto, Tennin Yan, Toru Kawakubo, Yuya O

Yasunari Suzuki, Yoshiaki Kawase, Yuya Masumura, Yuria Hiraga, Masahiro Nakadai, Jiabao Chen, Ken M. Nakanishi, Kosuke Mitarai, Ryosuke Imai, Shiro Tamiya, Takahiro Yamamoto, Tennin Yan, Toru Kawakubo, Yuya O. Nakagawa, Yohei Ibe, Youyuan Zhang, Hirotsugu Yamashita, Hikaru Yoshimura, Akihiro Hayashi, and Keisuke Fujii. 2021. Qulacs: a fast and versatile q...

work page doi:10.22331/q-2021- 2021
[35]

2020.qsim

Quantum AI team and collaborators. 2020.qsim. doi:10.5281/zenodo.4023103

work page doi:10.5281/zenodo.4023103 2020
[36]

Wim van Dam, Sean Hallgren, and Lawrence Ip. 2002. Quantum Algorithms for some Hidden Shift Problems. arXiv:quant-ph/0211140

work page arXiv 2002
[37]

Chuan-Chi Wang, Yu-Cheng Lin, Yan-Jie Wang, Chia-Heng Tu, and Shih-Hao Hung. 2024. Queen: A quick, scalable, and comprehensive quantum circuit simulation for supercomputing. arXiv:2406.14084 [quant-ph] https://arxiv.org/ abs/2406.14084

work page arXiv 2024
[38]

Acar, and Zhihao Jia

Mingkuan Xu, Shiyi Cao, Xupeng Miao, Umut A. Acar, and Zhihao Jia. 2024. Atlas: Hierarchical Partitioning for Quantum Circuit Simulation on GPUs (Extended Version). arXiv:2408.09055 [cs.DC] https://arxiv.org/abs/2408.09055

work page arXiv 2024
[39]

Ge Yan, Wenjie Wu, Yuheng Chen, Kaisen Pan, Xudong Lu, Zixiang Zhou, Yuhan Wang, Ruocheng Wang, and Junchi Yan. 2025. Quantum Circuit Synthesis and Compilation Optimization: Overview and Prospects. arXiv:2407.00736 [quant-ph] https://arxiv.org/abs/2407.00736

work page arXiv 2025
[40]

Chen Zhang, Zeyu Song, Haojie Wang, Kaiyuan Rong, and Jidong Zhai. 2021. HyQuas: hybrid partitioner based quantum circuit simulation system on GPU. In Proceedings of the 35th ACM International Conference on Supercomputing(Virtual Event, USA)(ICS ’21). Association for Computing Machinery, New York, NY, USA, 443–454. doi:10.1145/3447818.3460357

work page doi:10.1145/3447818.3460357 2021
[41]

Chen Zhang, Haojie Wang, Zixuan Ma, Lei Xie, Zeyu Song, and Jidong Zhai. 2022. UniQ: A Unified Programming Model for Efficient Quantum Circuit Simulation. In 2022 SC22: International Conference for High Performance Computing, Networking, Storage and Analysis (SC). IEEE Computer Society, 692–707. 12

2022