pith. sign in

arxiv: 2505.06119 · v2 · submitted 2025-05-09 · 💻 cs.DC · quant-ph

Tensor-Parallel Emulation of Quantum Circuits with Block-Cyclic Distributed Matrix Product States

Pith reviewed 2026-05-22 15:45 UTC · model grok-4.3

classification 💻 cs.DC quant-ph
keywords tensor networksmatrix product statesdistributed computingquantum circuit simulationtensor parallelismrandom circuit samplingpivoted QR factorizationbond dimension scaling
0
0 comments X

The pith

A tensor-parallel distribution of matrix product states reaches bond dimensions of 16,384 and improves emulation accuracy by three orders of magnitude on distributed systems.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to show that matrix product states can be made to run in parallel across compute nodes by scattering each site tensor in a block-cyclic pattern. Replacing the usual singular-value decomposition with a pivoted QR step cuts communication while keeping the approximation under control. The authors apply the scheme to Google's random circuit sampling task, a standard test of how well classical machines can mimic quantum computers. They report that the method supports much larger bond dimensions than earlier distributed approaches and delivers fidelity gains of roughly one thousand times on 32 nodes of a supercomputer. If the scaling holds, the technique could let classical simulations keep pace with larger quantum devices for longer.

Core claim

The central claim is that scattering individual dense MPS site tensors across processes via a block-cyclic layout, combined with pivoted QR factorization for truncation, produces a tensor-parallel emulation of quantum circuits that is both scalable and numerically stable. This approach reaches bond dimensions of 16,384 and yields accuracy three orders of magnitude above prior state-of-the-art results when applied to the random circuit sampling benchmark on 32 nodes of ARCHER2, while remaining compatible with additional layers of parallelism.

What carries the argument

Block-cyclic distributed matrix product states whose site tensors are evenly scattered across indices and truncated by pivoted QR factorization rather than SVD.

If this is right

  • Larger or deeper quantum circuits become feasible to emulate classically at higher fidelity on existing supercomputers.
  • The same block-cyclic layout can be applied to other dense tensor-network representations beyond MPS.
  • Hybrid parallelism that combines this distribution with thread-level or GPU-level methods becomes straightforward.
  • Practical algorithms such as quantum phase estimation can be tested at scales previously out of reach for classical emulation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The distribution formula may extend naturally to systems with hundreds of nodes if communication patterns remain balanced.
  • Integration with GPU-accelerated tensor libraries could further reduce runtime for the same bond dimensions.
  • The method supplies a concrete way to tighten classical bounds on quantum sampling tasks, helping quantify any claimed quantum advantage.

Load-bearing premise

The block-cyclic scattering of site tensors together with pivoted QR truncation maintains both numerical stability and low communication cost without introducing uncontrolled approximation errors that would erase the reported accuracy gains.

What would settle it

Run the method on the same Google's random circuit sampling instance at bond dimension 16,384 on 32 nodes and measure whether the achieved fidelity is within a factor of roughly one thousand of the best previous distributed result; a gap much smaller than three orders of magnitude would falsify the central performance claim.

Figures

Figures reproduced from arXiv: 2505.06119 by Jakub Adamski, Oliver Thomson Brown.

Figure 1
Figure 1. Figure 1: Graphical representation of tensor networks. Fig. 1a visualises tensor contraction. Fig. 1b shows a tensor network with 7 tensors and 18 indices (10 closed and 8 open). Ψ σ1σ2...σn = X i1 Xσ1,i1 Yi1,σ2...σn = X i1 Mσ1 i1 R σ2...σn i1 (1) The routine above isolates the first site. By iteratively applying this to the remainder tensor R, all sites tensors can be computed. It is also valid to perform the split… view at source ↗
Figure 2
Figure 2. Figure 2: Tensor distribution in QTNH of a (2, 3, 4, 2) tensor. TABLE II Tensor index groups mapping to block-cyclic matrices Cyclic Distributed Block Row {ic} {id} {ib} Column {jc} {jd} {jb} Local rows and columns are reversed due to Fortran’s column-major indexing. After a ScaLAPACK routine is called, the data is permuted back to the QTNH format. A further optimisation would be to permute the tensors lazily, but t… view at source ↗
Figure 3
Figure 3. Figure 3: Examples of permuting indices of a (2; 2, 4, 2) tensor. Moving the 4-dimensional local index to the distributed position is an asymmetric permutation and requires more MPI ranks than the original tensor. TABLE III Tensor decomposition routines Decomposition Formula ScaLAPACK routine(s) SVD M = USV † PZGESVD LQ M = LQ PZGELQF + PZUNGLQ QR M = QR PZGEQRF + PZUNGQR QR with pivoting M = QRP T PZGEQRP + PZUNGQR… view at source ↗
Figure 4
Figure 4. Figure 4: a, corresponding to Tabcdef g tensors with the following shape: (χd, χd; D, χc, χb, χc, χb) (4) It is straightforward to convert this into a block-cyclic matrix, by treating a, cd, e as row indices, and b, f, g as column indices (distributed, cyclic and blocking respectively). Note that c and d are both cyclic row indices. The only associated permutation costs are due to conversion between C and Fortran in… view at source ↗
Figure 6
Figure 6. Figure 6: We also share further details on our implementations of [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 5
Figure 5. Figure 5: Inverse Quantum Fourier Transform. Fig. 5a presents the application of IQFT in 8-qubit phase estimation, while Fig. 5b shows a 7-qubit IQFT structure as a tensor network of Hadamard gates and controlled phase MPOs. A B C D 0 1 2 3 4 5 6 7 8 ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... (a) 0 1 2 3 4 5 6 7 8 PERMUTE A B C D (b) [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: First 8 qubits and 4 layers of the RCS circuit implemented based on Sycamore connectivity using two approaches. Fig. 6a handles long-range interactions as MPOs, while Fig. 6b introduces a permutation layer to bring the interacting qubits together. A. Experimental setup In Sec. III-B–III-C, we discussed various implementations of tensor decomposition and long-range interactions. In ad￾dition, the decomposit… view at source ↗
Figure 7
Figure 7. Figure 7: Empirical demonstration of equivalence between standard fidelity and norm fidelity for both SVD and pivoted QR decomposition. The plot on the left shows that the points lie on a y = x line, while the one on the right displays the corresponding residuals (r = F − F ¯ ). Given that there is no clear asymptotic behaviour in the latter, it is likely that the norm metric becomes less accurate at tiny fidelities… view at source ↗
Figure 8
Figure 8. Figure 8: Comparison of RCS emulation approaches, considering decomposition and long-range interactions. Higher fidelity values are desired, while avoiding longer runtimes. SWAP + QR appears to offer the best trade-off for the circuits benchmarked. IQFT RCS 1 4 16 64 256 1 4 16 64 256 0.5 0.6 0.7 0.8 0.9 1.0 Block size (χb) Relative runtime (t t(χb=1)) Loc. bond (χc ⋅ χb): 128 256 Dis. bond (χd): 8 16 [PITH_FULL_IM… view at source ↗
Figure 9
Figure 9. Figure 9: Block size comparison for IQFT (left) and RCS (right) circuits. Runtimes are normalised relative to χb = 1, and smaller values are better. By choosing χb from 8 to 32, we can trim off nearly half of the runtime compared to purely cyclic decomposition [PITH_FULL_IMAGE:figures/full_fig_p010_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Runtime profiles of three partial RCS experiments at different local array sizes. Fig. 10a shows the split between decomposition, matrix multiplication and other function calls, where the former always dominates. Fig. 10b investigates how the program changes from communication-bound to compute-bound at different scales, where the middle case represents the tradeoff point for the local bond (χc · χb = 256)… view at source ↗
Figure 11
Figure 11. Figure 11: Comparison between the state-of-the-art libraries and QTNH, emulating the RCS circuit. Both ITensor and quimb were run with 16 BLAS threads, which was the fastest setting on ARCHER2. QTNH with SVD used the distributed bond χd = 8 on 64 cores. QTNH with pivoted QR needed higher bond to achieve comparable fidelity, which was empirically set to χd = 11. Lower runtime is better. 0.03 0.10 0.30 1.00 0.10 0.25 … view at source ↗
Figure 12
Figure 12. Figure 12: Fidelity of IQFT given fixed input bond saturation σ = χi/χ. Values of σ ≤ 0.25 yield nearly perfect output. For larger systems, the fidelity drops faster with σ. operations is O(n 3 ), and the scaled problem would need 8 times as many. Therefore, the expected runtime is doubled even with no additional communication costs, and it should increase linearly with the distributed bond dimension (χd). In [PITH… view at source ↗
Figure 13
Figure 13. Figure 13: Strong scaling of 100-qubit IQFT for fixed χ. The number of ranks r = χ 2 d . PE is relative to the minimum r needed to fit the state (χl = 512, ≤ 2 GB per rank). The key result is that saturating the ranks is inefficient, as PE > 1 when χl = 256. This occurs at 2 MB per local matrix, i.e. the maximum that fits into L3 cache. The empty dot on the left is from an incomplete run that exceeded 96 hours (long… view at source ↗
Figure 14
Figure 14. Figure 14: Weak scaling of Google’s 53-qubit RCS emulation with SWAP + QR approach, at fixed local χl . The results are shown on the log-log plot, as the underlying numerical method dictate at least linear scaling, which corresponds to the gradient of 1. The actual gradients are annotated, and they indicate the actual performance of the distribution – the closer to 1, the better. Plot to the right shows output fidel… view at source ↗
Figure 15
Figure 15. Figure 15: A general IQFT circuit diagram. Swaps are omitted, as they only affect the ordering of the result, and instead the input qubits are reversed. The controlled phase shifts with the same target can be grouped together into a single gate, which is more efficient to use in the emulation. √Y √Y √X √W √X √X √W √Y √X √Y √Y √Y √X √X √X √W √W √W √X √X √X √Y √Y √W √W √W √X √X √X √X √W √W √W √Y √Y √Y √W √W √W √W √Y √… view at source ↗
Figure 16
Figure 16. Figure 16: An example RCS circuit diagram of a 3 × 3 qubit grid. As the diagram orders the qubits in one dimension, not all interacting qubits appear next to each other, even though they are nearest neighbours in 2 dimensions. The actual RCS circuits used in this work is much larger, and orders the qubits as shown on [PITH_FULL_IMAGE:figures/full_fig_p016_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: RCS topology and interaction patterns, including the initial labelling of qubits. Patterns C and D are long-range in one dimension, and therefore more difficult to apply. topology, according to one of four patterns (A, B, C, D), as demonstrated on [PITH_FULL_IMAGE:figures/full_fig_p017_17.png] view at source ↗
read the original abstract

Tensor networks establish an adaptable framework for the emulation of quantum circuits. By partitioning exponentially large registers and gates into smaller tensors, this unlocks fast transformations through tensor algebra, and grants fine control over memory, runtime and accuracy. Due to inherently lower spatial footprint, there is a gap in distributed-memory tensor network methods. While certain parallel techniques exist, they are usually limited to direct contraction and sampling problems, and a more general approach is needed for tensor representations like matrix product states (MPS), which efficiently approximate full quantum state evolution. In this study, we expand the MPS site tensors beyond local memory by introducing a tensor-parallel distribution scheme, where individual dense tensors are evenly scattered across a subset of indices. This is further facilitated by leveraging pivoted QR factorisation instead of slower singular value decomposition (SVD). We demonstrate the capabilities of our approach by approximately emulating the classically difficult Google's random circuit sampling (RCS) benchmark. The highest bond dimensions of 16,384 is reached, surpassing the accuracy of the state-of-the-art methods by three orders of magnitude on 32 nodes of ARCHER2. We also show how this helps advance experiments involving more practical quantum phase estimation circuits. Our approach has the potential to enhance numerous algorithms based on dense tensor networks, offering a scalable and naturally load-balanced distribution formula. It is also compatible with other types of parallelism, unlocking new opportunities to push the quantum-classical computational phase boundary.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces a tensor-parallel distribution scheme for matrix product states (MPS) that scatters individual site tensors in a block-cyclic manner across nodes and substitutes pivoted QR factorization for SVD during truncation after each gate application. It applies this to approximate emulation of Google's random circuit sampling (RCS) benchmark, reporting that bond dimensions up to 16,384 are reached on 32 nodes of ARCHER2 while achieving three orders of magnitude better accuracy than prior state-of-the-art methods; the approach is also positioned as extensible to quantum phase estimation circuits and compatible with other forms of parallelism.

Significance. If the reported accuracy gains at D=16384 are robust, the work would meaningfully extend the reachable regime for distributed dense-tensor-network simulation of quantum circuits, particularly by demonstrating practical scaling of MPS evolution beyond local memory limits while maintaining load balance. The block-cyclic layout and QR substitution are presented as generalizable techniques that could benefit other tensor-network algorithms.

major comments (2)
  1. [Abstract and Results section] Abstract and Results: the headline claim that the method 'surpasses the accuracy of the state-of-the-art methods by three orders of magnitude' is presented without error bars, without an explicit definition of the accuracy metric (e.g., total variation distance or fidelity to exact sampling probabilities), and without a description of the comparison protocol or data-exclusion rules. These omissions make it impossible to verify whether the reported improvement is independent of the chosen bond dimension or truncation threshold.
  2. [Truncation procedure section] Section describing the truncation procedure (likely §4 or equivalent): pivoted QR is substituted for SVD to accelerate local truncation, yet no error bound analogous to the singular-value tail bound of SVD is supplied, nor is any numerical comparison of the Frobenius-norm truncation error between the two factorizations shown for the random-circuit ensemble. In the block-cyclic layout each local QR operates on a scattered sub-tensor; without evidence that the pivot selection preserves the global singular spectrum, the cumulative approximation error after hundreds of layers could exceed the claimed three-order improvement.
minor comments (2)
  1. [Methods] Notation for the block-cyclic index mapping and the precise definition of the distributed tensor layout should be introduced with an explicit equation or diagram early in the methods section to aid reproducibility.
  2. [Results] The manuscript would benefit from a short table summarizing wall-clock time, memory per node, and achieved fidelity for the largest runs (D=16384) versus the prior state-of-the-art baselines.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed review. The comments identify important areas where additional clarity and supporting material will strengthen the manuscript. We address each major comment below and indicate the revisions that will be incorporated.

read point-by-point responses
  1. Referee: [Abstract and Results section] Abstract and Results: the headline claim that the method 'surpasses the accuracy of the state-of-the-art methods by three orders of magnitude' is presented without error bars, without an explicit definition of the accuracy metric (e.g., total variation distance or fidelity to exact sampling probabilities), and without a description of the comparison protocol or data-exclusion rules. These omissions make it impossible to verify whether the reported improvement is independent of the chosen bond dimension or truncation threshold.

    Authors: We agree that the presentation of the accuracy claim requires greater precision to enable verification. In the revised manuscript we will explicitly define the accuracy metric as the total variation distance between the approximate sampling probabilities and reference values obtained from exact or high-fidelity simulations on smaller instances. Error bars will be added, computed as the standard deviation across an ensemble of independent random-circuit instances generated with distinct seeds. The comparison protocol will be described in detail, following the standard RCS benchmark circuits and parameters used in prior work, with all data points retained and no post-hoc exclusion. These additions will demonstrate that the reported improvement is robust across the tested bond dimensions and truncation thresholds. revision: yes

  2. Referee: [Truncation procedure section] Section describing the truncation procedure (likely §4 or equivalent): pivoted QR is substituted for SVD to accelerate local truncation, yet no error bound analogous to the singular-value tail bound of SVD is supplied, nor is any numerical comparison of the Frobenius-norm truncation error between the two factorizations shown for the random-circuit ensemble. In the block-cyclic layout each local QR operates on a scattered sub-tensor; without evidence that the pivot selection preserves the global singular spectrum, the cumulative approximation error after hundreds of layers could exceed the claimed three-order improvement.

    Authors: We acknowledge that the current manuscript does not supply a formal error bound for pivoted QR or a direct numerical comparison of truncation errors. In the revision we will add a concise discussion of the approximation quality of column-pivoted QR, citing established results on its use for low-rank matrix approximation. We will also include a new supplementary figure or table that reports the Frobenius-norm truncation error for both SVD and pivoted QR on representative sub-tensors drawn from the RCS ensemble at multiple bond dimensions. For the block-cyclic distribution, we will expand the explanation to show that the cyclic scattering of tensor blocks, combined with local pivoting, maintains a representative sampling of the global singular spectrum; supporting numerical evidence from our full-depth simulations (20 layers) will be presented to confirm that cumulative error remains consistent with the observed accuracy gains and does not exceed the reported improvement. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical performance claims are independent of fitted inputs or self-referential definitions

full rationale

The paper describes an algorithmic implementation of block-cyclic distributed MPS using pivoted QR for truncation, then reports measured wall-clock performance and fidelity on the external Google RCS benchmark at D=16384. These quantities are obtained by direct execution on ARCHER2 rather than being algebraically defined in terms of the method's own parameters or prior self-citations. No derivation step equates a claimed accuracy gain to a fitted quantity or reduces the central result to an input by construction. The approach builds on standard MPS truncation theory while substituting a faster factorization; the reported three-order accuracy improvement is therefore an external measurement, not a tautology.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The approach rests on standard tensor-network compression assumptions and the empirical claim that pivoted QR suffices for truncation; no new physical entities or ad-hoc constants are introduced in the abstract.

axioms (2)
  • domain assumption Matrix product states with controlled bond dimension can approximate the evolution of quantum circuits to useful accuracy
    Core premise of all MPS-based quantum simulation
  • domain assumption Pivoted QR factorization produces truncation results comparable to SVD for the purposes of this emulation
    Invoked to justify replacing the slower SVD step

pith-pipeline@v0.9.0 · 5785 in / 1424 out tokens · 33361 ms · 2026-05-22T15:45:36.104722+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages · 3 internal anchors

  1. [1]

    A. M. Childs, J. Goldstone, Spatial search by quantum walk, Phys. Rev. A 70 (2004) 022314. doi:10.1103/PhysRevA.70.022314. URL https://link.aps.org/doi/10.1103/PhysRevA.70.022314

  2. [2]

    In: Pro- ceedings 35th Annual Symposium on Foundations of Computer Science, pp

    P. Shor, Algorithms for quantum computation: Discrete logarithms and factoring, in: Proceedings 35th Annual Symposium on Foundations of Computer Science, 1994, pp. 124–134. doi:10.1109/SFCS.1994.365700

  3. [3]

    J. R. McClean, J. Romero, R. Babbush, A. Aspuru-Guzik, The theory of variational hybrid quantum-classical algorithms, New Journal of Physics 18 (2) (2016) 023023. doi:10.1088/1367-2630/18/2/023023. URL https://dx.doi.org/10.1088/1367-2630/18/2/023023

  4. [4]

    Acharya, D

    Google Quantum AI, Collaborators, R. Acharya, D. A. Abanin, L. Aghababaie-Beni, I. Aleiner, T. I. Andersen, M. Ansmann, F. Arute, K. Arya, A. Asfaw, N. Astrakhantsev, J. Atalaya, R. Babbush, D. Bacon, B. Ballard, J. C. Bardin, J. Bausch, A. Bengtsson, A. Bilmes, S. Black- well, S. Boixo, N. Lacroix, e. al., Quantum error correction below the surface code ...

  5. [5]

    H ¨aner, D

    T. H ¨aner, D. S. Steiger, M. Smelyanskiy, M. Troyer, High performance emulation of quantum circuits, in: SC ’16: Proceedings of the Inter- national Conference for High Performance Computing, Networking, Storage and Analysis, 2016, pp. 866–874. doi:10.1109/SC.2016.73

  6. [6]

    I. L. Markov, Y . Shi, Simulating quantum computation by contracting tensor networks, SIAM Journal on Computing 38 (3) (2008) 963–981. arXiv:https://doi.org/10.1137/050644756, doi:10.1137/050644756

  7. [7]

    F. Pan, P. Zhang, Simulation of quantum circuits using the big- batch tensor network method, Physical Review Letters 128 (1 2022). doi:10.1103/PhysRevLett.128.030501

  8. [8]

    Y . Zhou, E. M. Stoudenmire, X. Waintal, What limits the simulation of quantum computers?, Physical Review X 10 (4) (2020) 041038. doi:10.1103/PhysRevX.10.041038

  9. [9]

    Solomonik, D

    E. Solomonik, D. Matthews, J. R. Hammond, J. F. Stanton, J. Demmel, A massively parallel tensor contraction framework for coupled-cluster computations, Journal of Parallel and Distributed Computing 74 (12) (2014) 3176–3190. doi:10.1016/j.jpdc.2014.06.002

  10. [10]

    D. I. Lyakh, T. Nguyen, D. Claudino, E. Dumitrescu, A. J. McCaskey, ExaTN: Scalable GPU-Accelerated High-Performance Processing of General Tensor Networks at Exascale, Frontiers in Applied Mathematics and Statistics 8 (Jul. 2022). doi:10.3389/fams.2022.838601

  11. [11]

    Almudever, and Sebastian Feld

    H. Bayraktar, A. Charara, D. Clark, S. Cohen, T. Costa, Y .-L. L. Fang, Y . Gao, J. Guan, J. Gunnels, A. Haidar, A. Hehn, M. Hohner- bach, M. Jones, T. Lubowe, D. Lyakh, S. Morino, P. Springer, S. Stanwyck, I. Terentyev, S. Varadhan, J. Wong, T. Yamaguchi, cuquantum sdk: A high-performance library for accelerating quan- tum science, in: 2023 IEEE Internat...

  12. [12]

    2010.Log-Gases and Random Matrices (LMS-34)

    M. Fishman, S. White, E. M. Stoudenmire, The ITensor Software Library for Tensor Network Calculations, SciPost Physics Codebases (2022) 004doi:10.21468/SciPostPhysCodeb.4

  13. [13]

    K. Z. Ibrahim, S. W. Williams, E. Epifanovsky, A. I. Krylov, Analysis and tuning of libtensor framework on multicore architectures, in: 2014 21st International Conference on High Performance Computing (HiPC), 2014, pp. 1–10. doi:10.1109/HiPC.2014.7116881

  14. [14]

    D. S. Wang, C. D. Hill, L. C. L. Hollenberg, Simulations of Shor’s algorithm using matrix product states, Quantum Information Processing 16 (7) (2017) 176. doi:10.1007/s11128-017-1587-x

  15. [15]

    M. A. Nielsen, I. L. Chuang, Quantum Computation and Quan- tum Information, 10th Edition, Cambridge University Press, 2012. doi:10.1017/CBO9780511976667

  16. [16]

    Dickerson

    J. Biamonte, Lectures on quantum tensor networks, arXiv preprint (12 2019). doi:10.48550/ARXIV .1912.10049. URL http://arxiv.org/abs/1912.10049

  17. [17]

    Annals of Physics , author =

    U. Schollw ¨ock, The density-matrix renormalization group in the age of matrix product states, Annals of Physics 326 (1) (2011) 96–192. doi:10.1016/j.aop.2010.09.012

  18. [18]

    Nature , author=

    F. Arute, K. Arya, R. Babbush, D. Bacon, J. C. Bardin, R. Barends, R. Biswas, S. Boixo, F. G. S. L. Brandao, D. A. Buell, B. Burkett, Y . Chen, Z. Chen, B. Chiaro, R. Collins, W. Courtney, A. Dunsworth, E. Farhi, B. Foxen, A. Fowler, C. Gidney, M. Giustina, R. Graff, K. Guerin, S. Habegger, M. P. Harrigan, M. J. Hartmann, A. Ho, M. Hoffmann, T. Huang, T. ...

  19. [19]

    Y . Kim, A. Eddins, S. Anand, K. X. Wei, E. van den Berg, S. Rosenblatt, H. Nayfeh, Y . Wu, M. Zaletel, K. Temme, A. Kandala, Evidence for the utility of quantum computing before fault tolerance, Nature 618 (7965) (2023) 500–505. doi:10.1038/s41586-023-06096-3

  20. [20]

    R. Fu, Z. Su, H.-S. Zhong, X. Zhao, J. Zhang, F. Pan, P. Zhang, X. Zhao, M.-C. Chen, C.-Y . Lu, J.-W. Pan, Z. Pei, X. Zhang, W. Ouyang, Sur- passing Sycamore: Achieving Energetic Superiority Through System- Level Circuit Simulation, in: Proceedings of the International Con- ference for High Performance Computing, Networking, Storage, and Analysis, SC ’24,...

  21. [21]

    Schieffer, S

    G. Schieffer, S. Markidis, I. Peng, Harnessing CUDA-Q’s MPS for Tensor Network Simulations of Large-Scale Quantum Circuits, in: 2025 33rd Euromicro International Conference on Parallel, Dis- tributed, and Network-Based Processing (PDP), 2025, pp. 94–103. doi:10.1109/PDP66500.2025.00022

  22. [22]

    Ganahl, J

    M. Ganahl, J. Beall, M. Hauru, A. G. Lewis, T. Wojno, J. H. Yoo, Y . Zou, G. Vidal, Density Matrix Renormalization Group with Tensor Processing Units, PRX Quantum 4 (1) (2023) 010317. doi:10.1103/PRXQuantum.4.010317

  23. [23]

    R. Levy, E. Solomonik, B. K. Clark, Distributed-Memory DMRG via Sparse and Dense Parallel Tensor Contractions, in: SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, 2020, pp. 1–14. doi:10.1109/SC41405.2020.00028

  24. [24]

    Gray, Quimb: A python package for quantum information and many- body calculations, Journal of Open Source Software 3 (29) (2018) 819

    J. Gray, Quimb: A python package for quantum information and many- body calculations, Journal of Open Source Software 3 (29) (2018) 819. doi:10.21105/joss.00819

  25. [25]

    L. S. Blackford, J. Choi, A. Cleary, E. D’Azevedo, J. Demmel, I. Dhillon, J. Dongarra, S. Hammarling, G. Henry, A. Petitet, K. Stanley, D. Walker, R. C. Whaley, ScaLAPACK Users’ Guide, Society for Industrial and Applied Mathematics, 1997. arXiv:https://epubs.siam.org/doi/pdf/10.1137/1.9780898719642, doi:10.1137/1.9780898719642

  26. [26]

    E. M. Stoudenmire, S. R. White, Minimally entangled typical ther- mal state algorithms, New Journal of Physics 12 (5) (2010) 055026. doi:10.1088/1367-2630/12/5/055026

  27. [27]

    J. Chen, E. Stoudenmire, S. R. White, Quantum Fourier Trans- form Has Small Entanglement, PRX Quantum 4 (4) (2023) 040318. doi:10.1103/PRXQuantum.4.040318

  28. [28]

    X. Feng, Z. Zhang, The rank of a random matrix, Applied Mathematics and Computation 185 (1) (2007) 689–694. doi:10.1016/j.amc.2006.07.076

  29. [29]

    Bujanovi ´c, Z

    Z. Bujanovi ´c, Z. Drma ˇc, New robust ScaLAPACK routine for com- puting the QR factorization with column pivoting (Oct. 2019). arXiv:1910.05623, doi:10.48550/arXiv.1910.05623

  30. [30]

    Ferrero-Roza, J

    P. Ferrero-Roza, J. A. Mor ´ı˜nigo, F. Terragni, Strong Scaling of The SVD Algorithm For HPC Science: A Petsc-Based Approach, in: 2023 Winter Simulation Conference (WSC), 2023, pp. 2872–2883. doi:10.1109/WSC60868.2023.10407904

  31. [31]

    Berezutskii, M

    A. Berezutskii, M. Liu, A. Acharya, R. Ellerbrock, J. Gray, R. Haghshenas, Z. He, A. Khan, V . Kuzmin, D. Lyakh, D. Lykov, S. Mandr`a, C. Mansell, A. Melnikov, A. Melnikov, V . Mironov, D. Mo- rozov, F. Neukart, A. Nocera, M. A. Perlin, M. Perelshtein, M. Stein- berg, R. Shaydulin, B. Villalonga, M. Pflitsch, M. Pistoia, V . Vinokur, Y . Alexeev, Tensor n...

  32. [32]

    Beckett, J

    G. Beckett, J. Beech-Brandt, K. Leach, Z. Payne, A. Simpson, L. Smith, A. Turner, A. Whiting, ARCHER2 Service Description (2024). doi:10.5281/zenodo.14507040. APPENDIX For reference, here we outline the implementation details of the circuits used in this work. Fig. 15 shows a generalInverse H R2 H R3 R2 H R4 R3 R2 H Rn H Fig. 15:A general IQFT circuit dia...