Parallelizing Large-Scale Tensor Network Contraction on Multiple GPUs
Pith reviewed 2026-06-28 12:51 UTC · model grok-4.3
The pith
Distributing intermediate tensors across GPUs with explicit communication converts a fixed contraction path into a schedule that beats slicing by orders of magnitude.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We present a multi-GPU framework that distributes intermediate tensors across devices with explicit communication, converting a fixed contraction path into a communication-efficient schedule via GEMM-oriented mode reordering and communication-aware mode distribution planning. Within a single DGX H100 node (8 GPUs, NVLink), distribution delivers 7--173× extra speedup beyond embarrassingly parallel slicing, capturing nearly all of the available compute reduction (87--101%) because NVLink's high bandwidth keeps communication small relative to compute. Scaling the same four workloads to 1024 H100 GPUs over InfiniBand, the extra speedup beyond slicing ranges from 42× to 67,869×, demonstrating tha
What carries the argument
GEMM-oriented mode reordering combined with communication-aware mode distribution planning, which turns any fixed contraction path into a schedule that moves only the data required for each distributed matrix multiplication.
If this is right
- On NVLink-connected GPUs the method captures 87-101% of the theoretical compute reduction available from avoiding slicing redundancy.
- The same four workloads that are limited by slicing on 1024 GPUs become feasible when distribution is used instead.
- Communication overhead remains small enough that the approach continues to scale when moving from a single node to a full InfiniBand cluster.
- The framework applies directly to the contraction workloads that appear in quantum circuit simulation and combinatorial optimization.
Where Pith is reading between the lines
- The same mode-reordering and distribution logic could be applied to tensor networks that arise in classical machine-learning models whose contraction graphs are currently handled by slicing.
- If the planning step can be made dynamic rather than static, the method might adapt to tensor networks whose optimal contraction paths change during execution.
- The reported scaling behavior implies that further increases in GPU count will continue to favor communication-aware distribution over slicing as long as interconnect bandwidth grows with compute.
- Because the schedule preserves the original contraction path, existing path-finding heuristics can be reused without modification.
Load-bearing premise
Any fixed contraction path can be turned into a communication-efficient schedule by reordering modes for matrix multiplication and planning their distribution while keeping communication volume small relative to the compute that is saved.
What would settle it
Run one of the four reported workloads on 1024 GPUs and measure whether total communication time plus any extra compute exceeds the reduction in redundant floating-point operations compared with slicing; if the net time is longer, the claimed speedups do not hold.
Figures
read the original abstract
Exact tensor network contraction underpins quantum circuit simulation, quantum error correction, combinatorial optimization, and many-body dynamics. The dominant parallelization strategy, slicing, scales exponentially and incurs redundant computation. We present a multi-GPU framework that instead distributes intermediate tensors across devices with explicit communication, converting a fixed contraction path into a communication-efficient schedule via GEMM-oriented mode reordering and communication-aware mode distribution planning. Within a single DGX H100 node (8 GPUs, NVLink), distribution delivers $7$--$173\times$ extra speedup beyond embarrassingly parallel slicing, capturing nearly all of the available compute reduction (87--101%) because NVLink's high bandwidth keeps communication small relative to compute. Scaling the same four workloads to 1024 H100 GPUs over InfiniBand, the extra speedup beyond slicing ranges from $42\times$ to $67{,}869\times$, demonstrating that communication-aware distributed contraction far surpasses slicing-based scaling limits for frontier tensor networks.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents a multi-GPU framework for exact tensor network contraction that distributes intermediate tensors with explicit communication rather than relying on slicing. A fixed contraction path is converted into a communication-efficient schedule via GEMM-oriented mode reordering and communication-aware mode distribution planning. On four workloads within a single DGX H100 node (8 GPUs, NVLink), the approach yields 7--173× extra speedup beyond embarrassingly parallel slicing while capturing 87--101% of the available compute reduction; scaling the same workloads to 1024 H100 GPUs over InfiniBand produces extra speedups ranging from 42× to 67,869×.
Significance. If the empirical results hold under detailed verification, the work would be significant for distributed tensor computations in quantum simulation and related domains. It demonstrates that communication-aware distribution can largely eliminate the redundant compute of slicing while keeping communication overhead low relative to compute savings on both NVLink and InfiniBand, offering a practical path to larger-scale contractions than slicing-based methods allow.
major comments (1)
- [Abstract] Abstract: the quantitative claims (7--173× on-node, 87--101% capture, 42×--67,869× at 1024 GPUs) are stated without workload definitions, implementation details, error analysis, or verification steps. These elements are load-bearing for assessing support for the central empirical claim that the GEMM-oriented reordering plus distribution planning keeps communication cost small relative to avoided redundant compute.
Simulated Author's Rebuttal
Thank you for the constructive feedback. We address the single major comment below and agree that enhancing the abstract will improve clarity without altering the manuscript's core contributions.
read point-by-point responses
-
Referee: [Abstract] Abstract: the quantitative claims (7--173× on-node, 87--101% capture, 42×--67,869× at 1024 GPUs) are stated without workload definitions, implementation details, error analysis, or verification steps. These elements are load-bearing for assessing support for the central empirical claim that the GEMM-oriented reordering plus distribution planning keeps communication cost small relative to avoided redundant compute.
Authors: We acknowledge the abstract's conciseness omits these details, which are present in the body. Workloads are defined in Section 4.1 (four quantum circuit simulation networks with explicit tensor dimensions and paths). Implementation (GEMM reordering and distribution planning) is in Section 3. Error analysis (multi-run timings) and verification (87-101% capture vs. theoretical reduction) appear in Section 5. To address the concern directly, we will revise the abstract to add brief workload descriptors (e.g., 'four 20-40 qubit quantum simulation workloads') and parenthetical references to the relevant sections. This change supports the empirical claims without misrepresenting results. revision: yes
Circularity Check
No significant circularity
full rationale
The paper's central claims consist of measured speedups (7-173× on-node, 42-67869× at 1024 GPUs) obtained by running four specific workloads on DGX H100 hardware under the proposed GEMM-oriented reordering and communication-aware distribution. These are direct empirical timings, not quantities derived from a fitted model, self-referential definition, or self-citation chain. The method description (mode reordering plus distribution planning) is presented as an algorithmic technique whose correctness and efficiency are verified by the reported measurements rather than presupposed by them. No load-bearing step reduces to its own inputs by construction.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 1 Pith paper
-
Pushing the Classical Frontier of 1D Fermi-Hubbard Quench Dynamics Beyond Current Quantum Simulations
Symmetric TDVP on GPUs achieves converged 1D Fermi-Hubbard quench dynamics at chi~62000 up to t=7, certifying the high-entanglement regime and lowering the reported quantum advantage to ~36x.
Reference graph
Works this paper leans on
-
[1]
The density-matrix renormalization group in the age of matrix product states,
U. Schollw ¨ock, “The density-matrix renormalization group in the age of matrix product states,”Annals of Physics, vol. 326, no. 1, pp. 96–192,
-
[2]
Available: https://doi.org/10.1016/j.aop.2010.09.012
[Online]. Available: https://doi.org/10.1016/j.aop.2010.09.012
-
[3]
R. Or ´us, “A practical introduction to tensor networks: Matrix product states and projected entangled pair states,”Annals of Physics, vol. 349, pp. 117–158, 2014. [Online]. Available: https: //doi.org/10.1016/j.aop.2014.06.013
-
[4]
Hyper-optimized tensor network contraction,
J. Gray and S. Kourtis, “Hyper-optimized tensor network contraction,” Quantum, vol. 5, p. 410, 2021. [Online]. Available: https://doi.org/10. 22331/q-2021-03-15-410
2021
-
[5]
Supervised learning with tensor networks,
E. M. Stoudenmire and D. J. Schwab, “Supervised learning with tensor networks,” inAdvances in Neural Information Processing Systems 29, 2016, pp. 4799–4807
2016
-
[6]
Tensorizing neural networks,
A. Novikov, D. Podoprikhin, A. Osokin, and D. P. Vetrov, “Tensorizing neural networks,” inAdvances in Neural Information Processing Systems 28, 2015, pp. 442–450
2015
-
[7]
Tensor networks and quantum error correction,
A. J. Ferris and D. Poulin, “Tensor networks and quantum error correction,”Physical review letters, vol. 113, no. 3, p. 030501, 2014. [Online]. Available: https://doi.org/10.1103/PhysRevLett.113.030501
-
[8]
J.-G. Liu, L. Wang, and P. Zhang, “Tropical tensor network for ground states of spin glasses,”Physical Review Letters, vol. 126, no. 9, p. 090506, 2021. [Online]. Available: https://doi.org/10.1103/PhysRevLett. 126.090506
-
[9]
The state of quantum computing applications in challenging many-body quantum dynamics,
B. Fauseweh, “The state of quantum computing applications in challenging many-body quantum dynamics,”Nature Communications, vol. 15, p. 2123, 2024. [Online]. Available: https://doi.org/10.1038/ s41467-024-46402-9
2024
-
[10]
Quantum supremacy us- ing a programmable superconducting proces- sor
F. Arute, K. Arya, R. Babbush, D. Bacon, J. C. Bardin, R. Barends, R. Biswas, S. Boixo, F. G. Brandao, D. A. Buellet al., “Quantum supremacy using a programmable superconducting processor,”Nature, vol. 574, no. 7779, pp. 505–510, 2019. [Online]. Available: https://doi.org/10.1038/s41586-019-1666-5
-
[11]
Suppressing quantum errors by scaling a surface code logical qubit
Google Quantum AI, “Suppressing quantum errors by scaling a surface code logical qubit,”Nature, vol. 614, no. 7949, pp. 676–681, 2023. [Online]. Available: https://doi.org/10.1038/s41586-022-05434-1
-
[12]
Independent set enumeration in king’s graphs by tensor network contractions,
K. Liang, “Independent set enumeration in king’s graphs by tensor network contractions,”arXiv preprint arXiv:2505.12776, 2025. [Online]. Available: https://arxiv.org/abs/2505.12776
arXiv 2025
-
[13]
Science354(6317), 1240–1241 (2016) https://doi.org/10.1126/science
A. D. King, A. Nocera, M. Rams, J. Dziarmaga, R. Wiersema et al., “Beyond-classical computation in quantum simulation,”Science, pp. 199–204, 2025. [Online]. Available: https://doi.org/10.1126/science. ado6285
-
[14]
Simulating quantum computation by contract- ing tensor networks,
I. L. Markov and Y . Shi, “Simulating quantum computation by contract- ing tensor networks,”SIAM Journal on Computing, vol. 38, no. 3, pp. 963–981, 2008
2008
-
[15]
NVIDIA H100 Tensor Core GPU Architecture,
NVIDIA Corporation, “NVIDIA H100 Tensor Core GPU Architecture,” https://resources.nvidia.com/en-us-tensor-core, 2022
2022
-
[16]
cuTENSORMp: Multi-process tensor contraction library,
——, “cuTENSORMp: Multi-process tensor contraction library,” https: //docs.nvidia.com/cuda/cutensor/latest/user guide cutensorMp.html, 2024
2024
-
[17]
NCCL: NVIDIA collective communications library,
——, “NCCL: NVIDIA collective communications library,” https:// developer.nvidia.com/nccl, 2024
2024
-
[18]
Efficient parallelization of tensor network contraction for simulating quantum computation,
C. Huang, F. Zhang, M. Newman, X. Ni, D. Ding, J. Cai, X. Gao, T. Wang, F. Wu, G. Zhang, H.-S. Ku, Z. Tian, J. Wu, H. Xu, H. Yu, B. Yuan, M. Szegedy, Y . Shi, H.-H. Zhao, C. Deng, and J. Chen, “Efficient parallelization of tensor network contraction for simulating quantum computation,”Nature Computational Science, vol. 1, pp. 578–587, 2021. [Online]. Avai...
-
[19]
Closing the “quantum supremacy
Y . Liu, X. Liu, F. Li, Y . Yang, J. Song, P. Zhao, Z. Wang, D. Peng, H. Fu, D. Chen, W. Wu, H. Huang, and C. Guo, “Closing the “quantum supremacy” gap: Achieving real-time simulation of a random quantum circuit using a new Sunway supercomputer,” inProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis...
-
[20]
Simulation of quantum circuits using the big-batch tensor network method,
F. Pan and P. Zhang, “Simulation of quantum circuits using the big-batch tensor network method,”Physical Review Letters, vol. 128, no. 3, p. 030501, 2022. [Online]. Available: https: //doi.org/10.1103/PhysRevLett.128.030501
-
[21]
Solving the sampling problem of the Sycamore quantum circuits,
F. Pan, K. Chen, and P. Zhang, “Solving the sampling problem of the Sycamore quantum circuits,”Physical Review Letters, vol. 129, no. 9, p. 090502, 2022. [Online]. Available: https://doi.org/10.1103/ PhysRevLett.129.090502
2022
-
[22]
Leapfrogging Sycamore: Harnessing 1432 GPUs for 7x faster quantum random circuit sampling,
X.-H. Zhao, H.-S. Zhong, F. Panet al., “Leapfrogging Sycamore: Harnessing 1432 GPUs for 7x faster quantum random circuit sampling,” National Science Review, vol. 12, no. 3, p. nwae317, 2025. [Online]. Available: https://doi.org/10.1093/nsr/nwae317
-
[23]
R. Fu, Z. Su, H.-S. Zhong, X.-H. Zhao, J. Zhang, F. Panet al., “Surpassing Sycamore: Achieving energetic superiority through system- level circuit simulation,” inProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 2024. [Online]. Available: https://doi.org/10.1109/SC41406.2024. 00085
work page internal anchor Pith review Pith/arXiv arXiv doi:10.1109/sc41406.2024 2024
-
[24]
SW-TNC: Reaching the most complex random quantum circuit via tensor network contraction,
Y . Chen, Z. Sun, C. Qiu, Z. Li, Y . Liu, L. Gan, X. Duan, and G. Yang, “SW-TNC: Reaching the most complex random quantum circuit via tensor network contraction,”arXiv preprint arXiv:2504.09186, 2025. [Online]. Available: https://arxiv.org/abs/2504.09186
arXiv 2025
-
[25]
M. Xu, S. Cao, X. Miao, U. A. Acar, and Z. Jia, “Atlas: Hierarchical partitioning for quantum circuit simulation on GPUs,” inProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC24), 2024. [Online]. Available: https://doi.org/10.1109/SC41406.2024.00072
work page internal anchor Pith review Pith/arXiv arXiv doi:10.1109/sc41406.2024.00072 2024
-
[26]
Multi-GPU quantum circuit simulation and the impact of network performance,
W. M. Brown, A. Ramesh, T. Lubinski, T. Nguyen, and D. E. B. Neira, “Multi-GPU quantum circuit simulation and the impact of network performance,”arXiv preprint arXiv:2511.14664, 2025. [Online]. Available: https://arxiv.org/abs/2511.14664
arXiv 2025
-
[27]
Simulation of quantum computers: Review and acceleration opportunities,
A. Cicero, M. A. Maleki, M. W. Azhar, A. F. Kockum, and P. Trancoso, “Simulation of quantum computers: Review and acceleration opportunities,”ACM Transactions on Quantum Computing, vol. 7, no. 1, p. 3, 2025. [Online]. Available: https://doi.org/10.1145/ 3701725
2025
-
[28]
Strong quantum computational advantage using a superconducting quantum processor,
Y . Wu, W.-S. Bao, S. Cao, F. Chen, Y . Chen, X. Chen, T.-H. Chung, H. Deng, Y . Du, D. Fanet al., “Strong quantum computational advantage using a superconducting quantum processor,”Physical Review Letters, vol. 127, no. 18, p. 180501, 2021. [Online]. Available: https://doi.org/10.1103/PhysRevLett.127.180501
-
[29]
The computational boundaries of quantum advantage,
A. Zlokapa, F. Fuchs, L. Schaeffer, A. M. Dalzell, E. Lau, E. T. Hollandet al., “The computational boundaries of quantum advantage,” npj Quantum Information, vol. 9, p. 36, 2023. [Online]. Available: https://doi.org/10.1038/s41534-023-00744-7
-
[30]
F. Pan, P. Zhou, S. Li, and P. Zhang, “Contracting arbitrary tensor networks: General approximate algorithm and applications in graphical models and quantum circuit simulations,”Physical Review Letters, vol. 125, no. 6, p. 060503, 2020. [Online]. Available: https://doi.org/10.1103/PhysRevLett.125.060503
-
[31]
Efficient quantum circuit simulation by tensor network methods on modern GPUs,
F. Pan, H. Gu, L. Kuang, B. Liu, and P. Zhang, “Efficient quantum circuit simulation by tensor network methods on modern GPUs,”ACM Transactions on Quantum Computing, vol. 5, no. 4, 2024. [Online]. Available: https://doi.org/10.1145/3696465
-
[32]
Efficient algorithms for maximum likelihood decoding of quantum error-correcting codes,
S. Bravyi, M. Suchara, and A. Vargo, “Efficient algorithms for maximum likelihood decoding of quantum error-correcting codes,” Physical Review A, vol. 90, no. 3, p. 032326, 2014. [Online]. Available: https://doi.org/10.1103/PhysRevA.90.032326
-
[33]
Surface codes: Towards practical large-scale quantum computation,
A. G. Fowler, M. Mariantoni, J. M. Martinis, and A. N. Cleland, “Surface codes: Towards practical large-scale quantum computation,” Physical Review A, vol. 86, no. 3, p. 032324, 2012. [Online]. Available: https://doi.org/10.1103/PhysRevA.86.032324
-
[34]
Tensor-network decoding beyond 2d,
C. Piveteau, C. T. Chubb, and J. M. Renes, “Tensor-network decoding beyond 2d,”PRX Quantum, vol. 5, no. 4, p. 040303, 2024. [Online]. Available: https://doi.org/10.1103/PRXQuantum.5.040303
-
[35]
Learning high-accuracy error decoding for quantum processors,
J. Bausch, M. S. Kesselring, A. Elben, V . Swaroop, B. Yao, A. Molleet al., “Learning high-accuracy error decoding for quantum processors,”Nature, vol. 635, pp. 834–840, 2024. [Online]. Available: https://doi.org/10.1038/s41586-024-08148-8
-
[36]
Google Quantum AIet al., “Quantum error correction below the surface code threshold,”Nature, vol. 638, pp. 920–926, 2025. [Online]. Available: https://doi.org/10.1038/s41586-024-08449-y
-
[37]
M. Suzuki, “Generalized trotter’s formula and systematic approximants of exponential operators and inner derivations with applications to many-body problems,”Communications in Mathematical Physics, vol. 51, no. 2, pp. 183–190, 1976. [Online]. Available: https: //doi.org/10.1007/BF01609348
-
[38]
doi:10.1103/PhysRevLett.93.040502 , url =
G. Vidal, “Efficient simulation of one-dimensional quantum many-body systems,”Physical Review Letters, vol. 93, no. 4, p. 040502, 2004. [Online]. Available: https://doi.org/10.1103/PhysRevLett.93.040502
-
[39]
doi:10.1038/s41586-023-06096-3 , url =
Y . Kim, A. Eddins, S. Anand, K. X. Wei, E. van den Berg, S. Rosenblatt, H. Nayfeh, Y . Wu, M. Zaletel, K. Temme, and A. Kandala, “Evidence for the utility of quantum computing before fault tolerance,”Nature, vol. 618, pp. 500–505, 2023. [Online]. Available: https://doi.org/10.1038/s41586-023-06096-3
-
[40]
Uncovering local integrability in quantum many- body dynamics,
O. Shtanko, D. S. Wang, H. Zhang, N. Harle, A. Seif, R. Movassagh, and Z. Minev, “Uncovering local integrability in quantum many- body dynamics,”Nature Communications, 2025. [Online]. Available: https://doi.org/10.1038/s41467-025-57623-x
-
[41]
Quantum critical dynamics in a 5,000-qubit pro- grammable spin glass,
A. D. Kinget al., “Quantum critical dynamics in a 5,000-qubit pro- grammable spin glass,”Nature, vol. 617, pp. 61–66, 2023
2023
-
[42]
Confinement in a Z2 lattice gauge theory on a quantum computer,
J. Mildenberger, Z. Jiang, W. Mruczkiewicz, J. C. Halimeh, and P. Hauke, “Confinement in a Z2 lattice gauge theory on a quantum computer,”Nature Physics, 2025. [Online]. Available: https://doi.org/10.1038/s41567-024-02723-6
-
[43]
cuTENSOR: A high-performance CUDA library for tensor primitives,
NVIDIA Corporation, “cuTENSOR: A high-performance CUDA library for tensor primitives,” https://developer.nvidia.com/cutensor, 2024
2024
-
[44]
High-performance tensor contraction without transposition,
D. A. Matthews, “High-performance tensor contraction without transposition,”SIAM Journal on Scientific Computing, vol. 40, no. 1, pp. C1–C24, 2018. [Online]. Available: https://doi.org/10.1137/ 16M108968X
2018
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.