Parallelized contraction of tensor trains or matrix product operators

Hiroshi Shinaoka; Jan von Delft; Marc K. Ritter; Simone Foder\`a

arxiv: 2606.23274 · v1 · pith:TLD7TH22new · submitted 2026-06-22 · ⚛️ physics.comp-ph

Parallelized contraction of tensor trains or matrix product operators

Simone Foder\`a , Marc K. Ritter , Hiroshi Shinaoka , Jan von Delft This is my paper

Pith reviewed 2026-06-26 06:12 UTC · model grok-4.3

classification ⚛️ physics.comp-ph

keywords tensor trainsmatrix product operatorsfit algorithmMPO contractionparallelizationrandomized projectionsinverse canonical gauge

0 comments

The pith

Inverse canonical gauge yields near-ideal parallel speedup for MPO-MPO fit contractions across all sizes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes two combined accelerations for the fit algorithm used to contract matrix product operators. First, an MPI parallelization scheme based on the inverse canonical gauge distributes work so that speedup stays close to ideal even for small problems. Second, randomized projections replace the usual two-site local updates with one-site updates while preserving the accuracy of the original two-site procedure. These changes lower the dominant costs in high-dimensional tensor contractions without requiring changes to the overall algorithm structure.

Core claim

We present two strategies for accelerating the fit algorithm, usable in combination: (1) MPI-based distributed-memory parallelization tailored for MPO-MPO contractions, employing the inverse canonical gauge, which yields near-ideal parallelization speedup across all problem sizes; and (2) randomized projections to reduce the cost of local updates from 2-site to 1-site costs while retaining 2-site accuracy, and to speed up contractions of environment tensors with MPO tensors.

What carries the argument

The inverse canonical gauge choice together with randomized projections inside the fit algorithm for contracting two MPOs.

If this is right

Near-ideal parallel speedup holds for any problem size when the inverse canonical gauge is used.
Local update cost drops from two-site to one-site scaling while the final contraction accuracy remains the same.
Site-canonical gauge still delivers excellent speedup once the problem is large enough to need multiple sweeps.
Environment-tensor contractions with MPO tensors become cheaper under the same randomized-projection treatment.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same gauge and projection techniques could be tested on contractions involving more than two MPOs or on other tensor-network algorithms that rely on similar sweeps.
If the one-site randomized version matches two-site accuracy on a wider set of physical operators, existing MPO-based codes could adopt it with only local changes to the update step.
For very large problems the site-canonical gauge may become preferable once communication overhead dominates over the extra consistency work.

Load-bearing premise

Randomized projections preserve the accuracy of the original 2-site fit algorithm when applied to MPO-MPO contractions.

What would settle it

A side-by-side run on a small exactly solvable MPO-MPO contraction where the error or number of sweeps required with randomized projections exceeds the error obtained by the standard 2-site fit.

Figures

Figures reproduced from arXiv: 2606.23274 by Hiroshi Shinaoka, Jan von Delft, Marc K. Ritter, Simone Foder\`a.

**Figure 1.** Figure 1: Diagrammatic representation of the MPO A = QL ℓ=1 Aℓ. An MPO-MPO contraction C = AB is defined as the contraction of the output indices of B with the corresponding input indices of A, as schematically depicted in [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 3.** Figure 3: Diagrammatic representation of the zip-up contraction. [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

**Figure 2.** Figure 2: Diagrammatic representations of (a) the contraction of two MPOs, [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 5.** Figure 5: Diagrammatic representation of an MPO in (a) the Vidal canonical [PITH_FULL_IMAGE:figures/full_fig_p004_5.png] view at source ↗

**Figure 4.** Figure 4: Diagrammatic representation of the conditions for (a) left orthogonality [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

**Figure 6.** Figure 6: (a,b) Diagrammatic depictions of the left and right sides of Eq. [PITH_FULL_IMAGE:figures/full_fig_p004_6.png] view at source ↗

**Figure 7.** Figure 7: Construction of the left environment in the inverse canonical form. [PITH_FULL_IMAGE:figures/full_fig_p005_7.png] view at source ↗

**Figure 8.** Figure 8: (a) Visual representation of the sizes of the matrices involved in the [PITH_FULL_IMAGE:figures/full_fig_p006_8.png] view at source ↗

**Figure 10.** Figure 10: Computation of environment tensor using two separate randomized [PITH_FULL_IMAGE:figures/full_fig_p006_10.png] view at source ↗

**Figure 11.** Figure 11: Illustrating steps (2) to (5) of the parallel fit algorithm, for an odd-even [PITH_FULL_IMAGE:figures/full_fig_p008_11.png] view at source ↗

**Figure 12.** Figure 12: Accuracy (averaged out of 5 runs with di [PITH_FULL_IMAGE:figures/full_fig_p010_12.png] view at source ↗

**Figure 14.** Figure 14: Logarithmic plot of the runtime of fit algorithm on a of TTs with [PITH_FULL_IMAGE:figures/full_fig_p011_14.png] view at source ↗

**Figure 13.** Figure 13: Speedup (averaged out of 5 runs with different seeds) obtained by the parallel fit algorithm. The top-left (a) and top-right (b) plots show runs for different amounts of χ, with a TT with L = 102 and Ns = 1. The bottom-left (c) and bottom-right (d) plots show runs for different amounts of sweeps Ns , with a TT with L = 102 and χ = 100. leads to an error of the same order of magnitude of the norm of the te… view at source ↗

read the original abstract

Tensor Trains (TT), also known as Matrix Product States (MPS) and Matrix Product Operators (MPO), provide a compact and structured representation for high-dimensional data and operators. One of the most expensive manipulations involving tensor trains is the contraction of two MPOs. A popular and accurate method for mitigating this cost is the fit algorithm. However, it is still comparatively costly since it involves 2-site updates. Moreover, the parallelization of the fit algorithm when used for MPO-MPO contractions has received comparatively little attention. In this work, we present two strategies for accelerating the fit algorithm, usable in combination: (1) We use MPI-based distributed-memory parallelization tailored for MPO-MPO contractions, employing one of two MPO gauge choices: (1a) the inverse canonical gauge, which yields near-ideal parallelization speedup across all problem sizes; and (1b) the site-canonical gauge, which avoids the inversion of singular values but requires extra computations to ensure global consistency, thus yielding excellent parallelization speedup only for large problems requiring several sweeps before convergence. (2) We use randomized projections to reduce the cost of local updates from 2-site to 1-site costs while retaining 2-site accuracy, and to speed up contractions of environment tensors with MPO tensors.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MPI parallelization with two gauges plus randomized 1-site projections for MPO-MPO fit, but accuracy retention rests on an unvalidated claim.

read the letter

The paper's core contribution is a pair of concrete acceleration strategies for the fit algorithm on MPO-MPO contractions: MPI distributed parallelization using either inverse-canonical or site-canonical gauge, and randomized projections that drop local updates from 2-site to 1-site cost while asserting 2-site accuracy is kept. The gauge choices are presented as practical adaptations to the operator case, with inverse-canonical giving near-ideal scaling across sizes and site-canonical trading some overhead for easier implementation on big problems. The randomized projection step is positioned as a way to speed environment contractions too.

What stands out is the focus on MPO-MPO rather than the more common MPS-MPS setting, and the explicit tailoring of the parallel scheme to how environments are built and contracted in the operator case. That is a useful engineering detail for people already running these codes on clusters.

The soft spot is exactly the one flagged in the stress-test note. The abstract states that randomized 1-site projections retain 2-site accuracy, yet supplies no error bounds, convergence rates, or numerical checks on MPO-MPO instances. If the projection error behaves differently once the tensors carry operator structure, the claimed speedup could come with uncontrolled loss of fidelity. Without benchmarks or analysis in the manuscript, that central assertion remains untested.

The work is aimed at computational physicists who already use tensor-network codes and need to push MPO contractions to larger scales on distributed machines. Readers who care about implementation tricks for the fit algorithm will find the gauge discussion and parallel layout worth examining.

It deserves a serious referee. The algorithmic ideas are specific enough to be checked, the problem is real, and the absence of validation is fixable rather than fatal. I would send it to review.

Referee Report

1 major / 0 minor

Summary. The manuscript proposes two strategies to accelerate the fit algorithm for contracting matrix product operators (MPOs): MPI-based distributed parallelization using either the inverse canonical gauge (near-ideal speedup across sizes) or site-canonical gauge (good speedup for large problems), and randomized projections that reduce local 2-site updates to 1-site costs while retaining 2-site accuracy and accelerating environment contractions.

Significance. If the accuracy-retention claim for randomized projections holds with supporting validation, the combined approach would offer practical speedups for MPO-MPO contractions in tensor-network applications; the parallelization strategy builds directly on established gauge choices without introducing new parameters.

major comments (1)

[Abstract] Abstract: the central claim that randomized projections 'retain 2-site accuracy' for MPO-MPO contractions is asserted without error bounds, convergence rates, or numerical validation on operator contractions; this is load-bearing because the 1-site speedup is only useful if the accuracy guarantee transfers from the MPS-MPS case to the MPO setting with its distinct environment structure.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their careful reading and for highlighting the need for stronger support of the randomized-projection claim in the MPO-MPO setting. We address the single major comment below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that randomized projections 'retain 2-site accuracy' for MPO-MPO contractions is asserted without error bounds, convergence rates, or numerical validation on operator contractions; this is load-bearing because the 1-site speedup is only useful if the accuracy guarantee transfers from the MPS-MPS case to the MPO setting with its distinct environment structure.

Authors: We agree that the original manuscript asserted retention of 2-site accuracy for randomized projections without providing explicit error bounds, convergence rates, or numerical validation on MPO-MPO contractions. While the underlying randomized-projection technique is mathematically the same as in the MPS-MPS literature and the environment tensors share the same low-rank structure (differing only by the presence of operator indices), we did not demonstrate transfer of the accuracy guarantee in the operator setting. To address this, the revised manuscript will include (i) a short theoretical paragraph clarifying why the projection error analysis carries over to the MPO environment contractions and (ii) a new numerical subsection with direct comparisons of 2-site versus randomized 1-site updates on representative MPO-MPO problems, reporting relative errors, sweep convergence, and wall-time speedups. These additions will be placed in the results section and referenced from the abstract. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained

full rationale

The paper describes algorithmic improvements (MPI parallelization with specific MPO gauges and randomized projections for 1-site updates) to the standard fit algorithm for MPO-MPO contractions. These build on established tensor network techniques without any claimed first-principles derivation, uniqueness theorem, or prediction that reduces by construction to fitted parameters or self-citations defined within the paper. Performance claims rest on implementation details and numerical tests rather than self-referential logic, satisfying the criteria for a non-circular methods contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no information on free parameters, axioms, or invented entities; all elements are standard in tensor network literature.

pith-pipeline@v0.9.1-grok · 5764 in / 942 out tokens · 19722 ms · 2026-06-26T06:12:17.720269+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

21 extracted references · 10 canonical work pages

[1]

I. V . Oseledets, Tensor-train decomposition, SIAM Journal on Scientific Computing 33 (5) (2011) 2295–2317. doi: 10.1137/090752286

work page doi:10.1137/090752286 2011
[2]

B. N. Khoromskij, O(d log n)-quantics approximation of n-d tensors in high-dimensional numerical modeling, Springer Nature 34 (2011) 257–280. doi:10.1007/ s00365-011-9131-1. 11

2011
[3]

M. K. Ritter, Y . Núñez Fernández, M. Wallerberger, J. von Delft, H. Shinaoka, X. Waintal, Quantics tensor cross in- terpolation for high-resolution parsimonious representa- tions of multivariate functions, Phys. Rev. Lett. 132 (2024) 056501.doi:10.1103/PhysRevLett.132.056501

work page doi:10.1103/physrevlett.132.056501 2024
[4]

Available: https://doi.org/10.1016/j.aop.2010.09.012

U. Schollwöck, The density-matrix renormalization group in the age of matrix product states, Annals of Physics 326 (2011) 96–192.doi:10.1016/j.aop.2010.09.012

work page doi:10.1016/j.aop.2010.09.012 2011
[5]

E. M. Stoudenmire, S. R. White, Minimally entangled typical thermal state algorithms, New Journal of Physics 12 (5) (2010) 055026. doi:10.1088/1367-2630/12/5/ 055026

work page doi:10.1088/1367-2630/12/5/ 2010
[6]

B.-B. Chen, L. Chen, Z. Chen, W. Li, A. Weichselbaum, Exponential thermal tensor network approach for quantum lattice models, Phys. Rev. X 8 (2018) 031082. doi:10. 1103/PhysRevX.8.031082

2018
[7]

Núñez Fernández, M

Y . Núñez Fernández, M. K. Ritter, M. Jeannin, J.-W. Li, T. Kloss, T. Louvet, S. Terasaki, O. Parcollet, J. von Delft, H. Shinaoka, X. Waintal, Learning tensor net- works with tensor cross interpolation: New algorithms and libraries, SciPost Physics 18 (3) (Mar. 2025). doi: 10.21468/scipostphys.18.3.104

work page doi:10.21468/scipostphys.18.3.104 2025
[8]

Verstraete, J

F. Verstraete, J. I. Cirac, Renormalization algorithms for quantum-many body systems in two and higher di- mensions, arXiv: Strongly Correlated Electrons (2004). arXiv:cond-mat/0407066

Pith/arXiv arXiv 2004
[9]

Camano, E

C. Camano, E. N. Epperly, J. A. Tropp, Successive ran- domized compression: A randomized algorithm for the compressed mpo-mps product, Quantum 10 (2025) 2022

2025
[10]

Z. Meng, Y . Khoo, J. Li, E. M. Stoudenmire, Recursive sketched interpolation: Efficient hadamard products of tensor trains (2026).arXiv:2602.17974

arXiv 2026
[11]

M. K. Ritter, Fast elementwise operations on tensor trains with alternating cross interpolation (2026). arXiv:2604. 00037

2026
[12]

E. M. Stoudenmire, S. R. White, Real-space parallel den- sity matrix renormalization group, Phys. Rev. B 87 (2013) 155137.doi:10.1103/PhysRevB.87.155137

work page doi:10.1103/physrevb.87.155137 2013
[13]

Gleis, J.-W

A. Gleis, J.-W. Li, J. von Delft, Controlled bond expan- sion for density matrix renormalization group ground state search at single-site costs, Phys. Rev. Lett. 130 (2023) 246402.doi:10.1103/PhysRevLett.130.246402

work page doi:10.1103/physrevlett.130.246402 2023
[14]

controlled bond expansion for density matrix renormalization group ground state search at single-site costs

I. P. McCulloch, J. Osborne, Comment on "controlled bond expansion for density matrix renormalization group ground state search at single-site costs" (extended version) (2024). arXiv:2403.00562

arXiv 2024
[15]

Efficient Classical Simulation of Slightly Entangled Quantum Computations

G. Vidal, Efficient classical simulation of slightly entangled quantum computations, Phys. Rev. Lett. 91 (2003) 147902. doi:10.1103/PhysRevLett.91.147902

work page doi:10.1103/physrevlett.91.147902 2003
[16]

M. B. Hastings, An area law for one-dimensional quantum systems, Journal of Statistical Mechanics: Theory and Experiment 2007 (08) (2007) P08024–P08024. doi:10. 1088/1742-5468/2007/08/p08024

2007
[17]

Martinsson, J

P.-G. Martinsson, J. A. Tropp, Randomized nu- merical linear algebra: Foundations and algorithms, Acta Numerica 29 (2020) 403–572. doi:10.1017/ S0962492920000021

2020
[18]

Halko, P

N. Halko, P. G. Martinsson, J. A. Tropp, Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions, SIAM Review 53 (2) (2011) 217–288.doi:10.1137/090771806

work page doi:10.1137/090771806 2011
[19]

Zhang, J

C. Zhang, J. von Delft, Finite-temperature study of the hubbard model via enhanced exponential tensor renormal- ization group (2025).arXiv:2510.25022

arXiv 2025
[20]

Secular, N

P. Secular, N. Gourianov, M. Lubasch, S. Dolgov, S. R. Clark, D. Jaksch, Parallel time-dependent variational prin- ciple algorithm for matrix product states, Phys. Rev. B 101 (2020) 235123.doi:10.1103/PhysRevB.101.235123

work page doi:10.1103/physrevb.101.235123 2020
[21]

Tensor4all,https://tensor4all.org. 12

[1] [1]

I. V . Oseledets, Tensor-train decomposition, SIAM Journal on Scientific Computing 33 (5) (2011) 2295–2317. doi: 10.1137/090752286

work page doi:10.1137/090752286 2011

[2] [2]

B. N. Khoromskij, O(d log n)-quantics approximation of n-d tensors in high-dimensional numerical modeling, Springer Nature 34 (2011) 257–280. doi:10.1007/ s00365-011-9131-1. 11

2011

[3] [3]

M. K. Ritter, Y . Núñez Fernández, M. Wallerberger, J. von Delft, H. Shinaoka, X. Waintal, Quantics tensor cross in- terpolation for high-resolution parsimonious representa- tions of multivariate functions, Phys. Rev. Lett. 132 (2024) 056501.doi:10.1103/PhysRevLett.132.056501

work page doi:10.1103/physrevlett.132.056501 2024

[4] [4]

Available: https://doi.org/10.1016/j.aop.2010.09.012

U. Schollwöck, The density-matrix renormalization group in the age of matrix product states, Annals of Physics 326 (2011) 96–192.doi:10.1016/j.aop.2010.09.012

work page doi:10.1016/j.aop.2010.09.012 2011

[5] [5]

E. M. Stoudenmire, S. R. White, Minimally entangled typical thermal state algorithms, New Journal of Physics 12 (5) (2010) 055026. doi:10.1088/1367-2630/12/5/ 055026

work page doi:10.1088/1367-2630/12/5/ 2010

[6] [6]

B.-B. Chen, L. Chen, Z. Chen, W. Li, A. Weichselbaum, Exponential thermal tensor network approach for quantum lattice models, Phys. Rev. X 8 (2018) 031082. doi:10. 1103/PhysRevX.8.031082

2018

[7] [7]

Núñez Fernández, M

Y . Núñez Fernández, M. K. Ritter, M. Jeannin, J.-W. Li, T. Kloss, T. Louvet, S. Terasaki, O. Parcollet, J. von Delft, H. Shinaoka, X. Waintal, Learning tensor net- works with tensor cross interpolation: New algorithms and libraries, SciPost Physics 18 (3) (Mar. 2025). doi: 10.21468/scipostphys.18.3.104

work page doi:10.21468/scipostphys.18.3.104 2025

[8] [8]

Verstraete, J

F. Verstraete, J. I. Cirac, Renormalization algorithms for quantum-many body systems in two and higher di- mensions, arXiv: Strongly Correlated Electrons (2004). arXiv:cond-mat/0407066

Pith/arXiv arXiv 2004

[9] [9]

Camano, E

C. Camano, E. N. Epperly, J. A. Tropp, Successive ran- domized compression: A randomized algorithm for the compressed mpo-mps product, Quantum 10 (2025) 2022

2025

[10] [10]

Z. Meng, Y . Khoo, J. Li, E. M. Stoudenmire, Recursive sketched interpolation: Efficient hadamard products of tensor trains (2026).arXiv:2602.17974

arXiv 2026

[11] [11]

M. K. Ritter, Fast elementwise operations on tensor trains with alternating cross interpolation (2026). arXiv:2604. 00037

2026

[12] [12]

E. M. Stoudenmire, S. R. White, Real-space parallel den- sity matrix renormalization group, Phys. Rev. B 87 (2013) 155137.doi:10.1103/PhysRevB.87.155137

work page doi:10.1103/physrevb.87.155137 2013

[13] [13]

Gleis, J.-W

A. Gleis, J.-W. Li, J. von Delft, Controlled bond expan- sion for density matrix renormalization group ground state search at single-site costs, Phys. Rev. Lett. 130 (2023) 246402.doi:10.1103/PhysRevLett.130.246402

work page doi:10.1103/physrevlett.130.246402 2023

[14] [14]

controlled bond expansion for density matrix renormalization group ground state search at single-site costs

I. P. McCulloch, J. Osborne, Comment on "controlled bond expansion for density matrix renormalization group ground state search at single-site costs" (extended version) (2024). arXiv:2403.00562

arXiv 2024

[15] [15]

Efficient Classical Simulation of Slightly Entangled Quantum Computations

G. Vidal, Efficient classical simulation of slightly entangled quantum computations, Phys. Rev. Lett. 91 (2003) 147902. doi:10.1103/PhysRevLett.91.147902

work page doi:10.1103/physrevlett.91.147902 2003

[16] [16]

M. B. Hastings, An area law for one-dimensional quantum systems, Journal of Statistical Mechanics: Theory and Experiment 2007 (08) (2007) P08024–P08024. doi:10. 1088/1742-5468/2007/08/p08024

2007

[17] [17]

Martinsson, J

P.-G. Martinsson, J. A. Tropp, Randomized nu- merical linear algebra: Foundations and algorithms, Acta Numerica 29 (2020) 403–572. doi:10.1017/ S0962492920000021

2020

[18] [18]

Halko, P

N. Halko, P. G. Martinsson, J. A. Tropp, Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions, SIAM Review 53 (2) (2011) 217–288.doi:10.1137/090771806

work page doi:10.1137/090771806 2011

[19] [19]

Zhang, J

C. Zhang, J. von Delft, Finite-temperature study of the hubbard model via enhanced exponential tensor renormal- ization group (2025).arXiv:2510.25022

arXiv 2025

[20] [20]

Secular, N

P. Secular, N. Gourianov, M. Lubasch, S. Dolgov, S. R. Clark, D. Jaksch, Parallel time-dependent variational prin- ciple algorithm for matrix product states, Phys. Rev. B 101 (2020) 235123.doi:10.1103/PhysRevB.101.235123

work page doi:10.1103/physrevb.101.235123 2020

[21] [21]

Tensor4all,https://tensor4all.org. 12