Minimizing the Arithmetic and Communication Complexity of Jacobi's Method for Eigenvalues and Singular Values: Part One -- Serial Algorithms

Hengrui Luo; James Demmel; Ryan Schneider; Yifu Wang

arxiv: 2506.03466 · v2 · submitted 2025-06-04 · 🧮 math.NA · cs.CC· cs.NA

Minimizing the Arithmetic and Communication Complexity of Jacobi's Method for Eigenvalues and Singular Values: Part One -- Serial Algorithms

James Demmel , Hengrui Luo , Ryan Schneider , Yifu Wang This is my paper

Pith reviewed 2026-05-19 12:04 UTC · model grok-4.3

classification 🧮 math.NA cs.CCcs.NA

keywords Jacobi methodeigenvalue problemcommunication complexityarithmetic complexityblocked algorithmsmatrix multiplicationsingular value decomposition

0 comments

The pith

A blocked Jacobi method for the symmetric eigenvalue problem attains the communication lower bound for matrix multiplication.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops serial algorithms based on Jacobi's method that minimize both the number of arithmetic operations and the amount of data movement for computing eigenvalues and singular values. It proves that a blocked implementation achieves the known lower bound on communication for classical matrix multiplication, which is the dominant cost in large-scale linear algebra. A recursive blocked version extends this to fast matrix multiplication, reaching near-optimal costs in both arithmetic and communication. Readers should care because in modern computing, moving data often costs more than performing operations, so these bounds indicate when Jacobi methods can be efficient. The analysis also covers one-sided Jacobi for singular values.

Core claim

In the classical O(n^3) setting, a blocked Jacobi method attains the communication lower bound for matrix multiplication and is thus expected to be communication optimal among O(n^3) eigensolvers. In the fast matrix multiplication setting, a recursive version of blocked Jacobi reaches essentially optimal complexity in both arithmetic operations and communication.

What carries the argument

Blocked and recursive implementations of Jacobi's method, which organize rotations to leverage efficient matrix multiplications as black boxes.

If this is right

The blocked version is expected to be communication optimal among O(n^3) eigensolvers.
A recursive version reaches essentially optimal complexity with fast matrix multiplication.
Analogous complexity bounds hold for one-sided Jacobi SVD algorithms.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This optimality suggests Jacobi methods could be competitive with other eigensolvers like QR algorithm in bandwidth-limited settings.
Similar blocking and recursion techniques might apply to other eigenvalue algorithms to reduce communication.
Practical implementations could test these bounds on large matrices to verify performance gains.

Load-bearing premise

The matrix multiplication routine acts as a black box with exactly known arithmetic and communication costs, and blocking or recursion introduces no significant hidden lower-order terms.

What would settle it

Implementing the blocked Jacobi method and measuring the total volume of data moved between fast and slow memory to determine if it matches or beats the communication lower bound for matrix multiplication.

Figures

Figures reproduced from arXiv: 2506.03466 by Hengrui Luo, James Demmel, Ryan Schneider, Yifu Wang.

**Figure 2.** Figure 2: Comparison of scalar, blocked and recursive Jacobi on a 512 [PITH_FULL_IMAGE:figures/full_fig_p025_2.png] view at source ↗

**Figure 3.** Figure 3: Performance of various implementations of blocked Jacobi, including Algorithm 2 without pivoting [PITH_FULL_IMAGE:figures/full_fig_p026_3.png] view at source ↗

**Figure 4.** Figure 4: Impact of base-case accuracy on the convergence of recursive Jacobi for a fixed 512 [PITH_FULL_IMAGE:figures/full_fig_p033_4.png] view at source ↗

**Figure 5.** Figure 5: Impact of pre-determined maximum recursion depth on the convergence of recursive Jacobi for a [PITH_FULL_IMAGE:figures/full_fig_p033_5.png] view at source ↗

read the original abstract

We analyze several versions of Jacobi's method for the symmetric eigenvalue problem. Our goal is to reduce the asymptotic cost of the algorithm as much as possible, as measured by the number of arithmetic operations performed and associated (serial or parallel) communication, i.e., the amount of data moved between slow and fast memory or between processors in a network. The first half of this effort, which considers the serial setting, is presented here; this paper contains rigorous complexity bounds for a variety of serial Jacobi algorithms, built on both classic $O(n^3)$ matrix multiplication and fast, Strassen-like $O(n^{\omega_0})$ alternatives. In the classical case, we show that a blocked implementation of Jacobi's method attains the communication lower bound for $O(n^3)$ matrix multiplication (and is therefore expected to be communication optimal among $O(n^3)$ eigensolvers). In the fast setting, we demonstrate that a recursive version of blocked Jacobi can go further, reaching essentially optimal complexity in both measures. We also derive analogous complexity bounds for (one-sided) Jacobi SVD algorithms. A forthcoming sequel to this paper will extend our complexity analysis to the parallel case.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Blocked Jacobi reaches the communication lower bound for classical matrix multiplication and nearly optimal complexity with fast multiplication in the serial case.

read the letter

The key thing to take away is that blocked Jacobi attains the communication lower bound for classical matrix multiplication and a recursive version gets close with fast matrix multiplication. This paper analyzes the arithmetic and communication costs of various serial Jacobi algorithms for symmetric eigenvalues and singular values. It shows that a blocked implementation matches the lower bound in the O(n^3) case, which positions it as communication optimal among eigensolvers using standard arithmetic. In the fast matrix multiplication setting, recursion helps reach essentially optimal complexity. Similar bounds are derived for one-sided Jacobi SVD. The analysis relies on established matrix multiplication complexity results, which avoids circularity and allows clean derivations. The paper does a good job extending prior complexity work to this setting with rigorous bounds. The central claims appear consistent with the models used. A minor concern is whether the black-box treatment of matrix multiplications fully accounts for all data movement in the blocked or recursive structure. The multiple sweeps and any re-layout of blocks could introduce additional communication volumes that need to be shown as lower order. The stress-test note highlights this possibility, and it would be good to confirm that the proofs bound these terms tightly enough to preserve the optimality statements. This is the kind of paper that specialists in numerical linear algebra and high-performance computing will want to read for the asymptotic targets it sets. It is not for a general audience but provides useful benchmarks for algorithm designers. Given the novelty of the tight bounds and the grounding in standard theory, it should go to peer review rather than being desk rejected.

Referee Report

1 major / 1 minor

Summary. The manuscript analyzes serial versions of Jacobi's method for the symmetric eigenvalue problem and one-sided SVD. It derives rigorous asymptotic bounds on arithmetic operations and communication volume (in a two-level memory model) for both classical O(n^3) matrix multiplication and fast Strassen-like multiplication with exponent ω₀. The central claims are that a blocked implementation attains the communication lower bound for O(n^3) MM and is therefore expected to be communication-optimal among O(n^3) eigensolvers, while a recursive blocked version reaches essentially optimal complexity when fast matrix multiplication is substituted.

Significance. If the bounds are correct, the work establishes the first rigorous communication-optimal serial analysis for Jacobi methods, which is significant for communication-bound regimes in high-performance computing. The use of standard MM cost models to obtain parameter-free lower-bound attainment for the classical case and near-optimality for the fast case is a clear strength; the forthcoming parallel sequel is noted as a natural extension.

major comments (1)

[blocked and recursive Jacobi sections (complexity derivations)] The optimality claims rest on treating each matrix multiplication (classical or fast) as a black-box routine whose communication cost exactly matches the known lower bound. In the serial two-level memory model, the O(n) sweeps of the blocked or recursive algorithm require additional data movement for block extraction, rotation application, and re-layouting of the working set between sweeps. The manuscript must explicitly bound these auxiliary volumes and show they are absorbed into lower-order terms (or cancel across sweeps) so that the total communication remains asymptotically equal to the pure-MM bound; otherwise the “attains the lower bound” and “essentially optimal” statements are not justified.

minor comments (1)

Ensure consistent notation for the fast-MM exponent (ω₀ vs. ω) and for the memory parameters (M, B) across all complexity statements and tables.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful reading and constructive feedback. The point raised about auxiliary communication costs is well taken and we address it directly below.

read point-by-point responses

Referee: [blocked and recursive Jacobi sections (complexity derivations)] The optimality claims rest on treating each matrix multiplication (classical or fast) as a black-box routine whose communication cost exactly matches the known lower bound. In the serial two-level memory model, the O(n) sweeps of the blocked or recursive algorithm require additional data movement for block extraction, rotation application, and re-layouting of the working set between sweeps. The manuscript must explicitly bound these auxiliary volumes and show they are absorbed into lower-order terms (or cancel across sweeps) so that the total communication remains asymptotically equal to the pure-MM bound; otherwise the “attains the lower bound” and “essentially optimal” statements are not justified.

Authors: We agree that an explicit accounting of auxiliary data movement strengthens the communication analysis. In the blocked Jacobi algorithm the dominant communication arises from the matrix multiplications performed inside each sweep; block extraction, rotation application, and re-layouting are performed on sub-blocks whose aggregate volume per sweep is O(n^2 log n) words (or lower when fast memory is reused across consecutive rotations). Over the O(n) sweeps required for convergence this contributes an auxiliary term of O(n^3 log n) words moved. In the two-level memory model this term is asymptotically absorbed into the leading communication lower bound Ω(n^3 / √M) of classical matrix multiplication whenever M = o(n^2), and remains lower order even for larger fast-memory sizes. The recursive blocked variant further reduces re-layout cost by preserving block structure across recursion levels. We will add a short dedicated paragraph (or subsection) in Sections 3 and 4 that states these bounds explicitly and confirms they do not alter the asymptotic attainment of the pure-MM lower bound. The same argument applies, mutatis mutandis, to the one-sided SVD variant. revision: yes

Circularity Check

0 steps flagged

No circularity: complexity claims compose external matrix-multiplication lower bounds with new blocked/recursive Jacobi constructions.

full rationale

The paper's central results state that a blocked Jacobi attains the known communication lower bound for O(n^3) matrix multiplication and that a recursive version reaches essentially optimal cost when fast matrix multiplication is substituted. These statements rest on treating matrix multiplication as an opaque black-box routine whose arithmetic and communication costs are taken from prior, independently established results. No equation or definition inside the paper is shown to be defined in terms of the claimed optimality, no parameter is fitted to a subset of the target data and then renamed a prediction, and no load-bearing uniqueness theorem or ansatz is imported solely via self-citation. The analysis therefore remains self-contained against external benchmarks and does not reduce the claimed optimality to a tautology.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The analysis rests on standard arithmetic and communication cost models for matrix multiplication; no free parameters are fitted, no new entities are postulated, and the axioms are the usual ones of asymptotic analysis.

axioms (2)

standard math Matrix multiplication costs O(n^3) arithmetic operations and satisfies known communication lower bounds in the serial two-level memory model.
Invoked when stating that the blocked Jacobi attains the lower bound.
standard math Fast matrix multiplication costs O(n^ω0) with corresponding communication bounds.
Used for the recursive version analysis.

pith-pipeline@v0.9.0 · 5756 in / 1273 out tokens · 28784 ms · 2026-05-19T12:04:33.940129+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

62 extracted references · 62 canonical work pages

[1]

Alman, R

J. Alman, R. Duan, V. V. Williams, Y. Xu, Z. Xu, and R. Zhou. More asymmetry yields faster matrix multiplication. In Proceedings of the 2025 Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 2005–2039, 2025

work page 2025
[2]

M. V. Athi, S. R. Zekavat, and A. A. Struthers. Real-time signal processing of massive sensor arrays via a parallel fast converging SVD algorithm: Latency, throughput, and resource analysis. IEEE Sensors Journal, 16(8):2519–2526, 2016

work page 2016
[3]

Ballard, E

G. Ballard, E. Carson, J. Demmel, M. Hoemmen, N. Knight, and O. Schwartz. Communication lower bounds and optimal algorithms for numerical linear algebra. Acta Numerica, 23:1–155, May 2014

work page 2014
[4]

Ballard, J

G. Ballard, J. Demmel, and I. Dumitriu. Communication-optimal parallel and sequential eigenvalue and singular value algorithms. EECS Technical Report EECS-2011-14, UC Berkeley, Feb. 2011

work page 2011
[5]

Ballard, J

G. Ballard, J. Demmel, O. Holtz, B. Lipshitz, and O. Schwartz. Communication-optimal parallel algo- rithm for Strassen’s matrix multiplication. In Proceedings of the 24th ACM Symposium on Parallelism in Algorithms and Architectures, SPAA ’12, pages 193–204, New York, NY, USA, June 2012. ACM

work page 2012
[6]

Ballard, J

G. Ballard, J. Demmel, O. Holtz, and O. Schwartz. Minimizing communication in numerical linear algebra. SIAM J. Mat. Anal. Appl. , 32:866–901, 2011

work page 2011
[7]

Ballard, J

G. Ballard, J. Demmel, O. Holtz, and O. Schwartz. Sequential communication bounds for fast linear algebra. Technical Report UCB/EECS-2012-36, UC Berkeley, March 2012

work page 2012
[8]

Ballard, J

G. Ballard, J. Demmel, O. Holtz, and O. Schwartz. Communication costs of Strassen’s matrix multi- plication. Comm. of the ACM , 57(02):107–114, Feb 2014

work page 2014
[9]

Ballard, J

G. Ballard, J. Demmel, and N. Knight. Communication avoiding successive band reduction. In Pro- ceedings of the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming , PPoPP ’12, pages 35–44. ACM, 2012

work page 2012
[10]

Berry and A

M. Berry and A. Sameh. An overview of parallel algorithms for the singular value and symmetric eigenvalue problems. J. Comp. Appl. Math. , 27:191–213, 1989

work page 1989
[11]

Bl¨ aser

M. Bl¨ aser. Fast matrix multiplication. Theory of Computing , pages 1–60, 2013

work page 2013
[12]

Y. Cho, J. Demmel, M. Derezi´ nski, H. Li, H. Luo, M. Mahoney, and R. Murray. Surrogate-based autotuning for randomized sketching algorithms in regression problems. SIAM Journal on Matrix Analysis and Applications, 46(2):1247–1279, 2025

work page 2025
[13]

J. J. M. Cuppen. A divide and conquer method for the symmetric tridiagonal eigenproblem. Numerische Mathematik, 36:177–195, 1980

work page 1980
[14]

Demmel, I

J. Demmel, I. Dumitriu, and O. Holtz. Fast linear algebra is stable. Numerische Mathematik, 108(1):59– 91, 2007

work page 2007
[15]

Demmel, I

J. Demmel, I. Dumitriu, O. Holtz, and R. Kleinberg. Fast matrix multiplication is stable. Numerische Mathematik, 106, 2006

work page 2006
[16]

Demmel, M

J. Demmel, M. Gu, S. Eisenstat, I. Slapniˇ car, K. Veseli´ c, and Z. Drmaˇ c. Computing the singular value decomposition with high relative accuracy. Lin. Alg. Appl. , 299:21–80, 1999

work page 1999
[17]

Demmel and K

J. Demmel and K. Veseli´ c. Jacobi’s Method is More Accurate Than QR. SIAM J. Mat. Anal. Appl. , 13(4):1204–1246, 1992. 27

work page 1992
[18]

J. W. Demmel. Applied Numerical Linear Algebra . Society for Industrial and Applied Mathematics, 1997

work page 1997
[19]

J. W. Demmel, L. Grigori, M. Hoemmen, and J. Langou. Communication-optimal parallel and sequential QR and LU factorizations. SIAM J. Sci. Comput. , 34(1):A206–A239, 2012

work page 2012
[20]

Detherage and R

I. Detherage and R. Shah. A unified perspective on orthogonalization and diagonalization. arXiv:2505.02023, 2025

work page arXiv 2025
[21]

I. S. Dhillon and B. N. Parlett. Orthogonal eigenvectors and relative gaps. SIAM Journal on Matrix Analysis and Applications, 25(3):858–899, 2003

work page 2003
[22]

Dongarra, S

J. Dongarra, S. Hammarling, and D. Sorensen. Block reduction of matrices to condensed forms for eigenvalue computations. J. Comput. Appl. Math. , 27:215–227, 1989

work page 1989
[23]

Drineas, M

P. Drineas, M. W. Mahoney, and S. Muthukrishnan. Sampling algorithms for ℓ2 regression and appli- cations. In Proceedings of the 17th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA) , pages 1127–1136, 2006

work page 2006
[24]

Z. Drmac. A global convergence proof for cyclic Jacobi methods with block rotations. SIAM Journal on Matrix Analysis and Applications , 31(3):1329–1350, 2009

work page 2009
[25]

Drmaˇ c and K

K. Drmaˇ c and K. Veseli´ c. New fast and accurate Jacobi SVD algorithm, I.SIAM J. Mat. Anal. Appl. , 29(4):1322–1342, 2008

work page 2008
[26]

Drmaˇ c and K

K. Drmaˇ c and K. Veseli´ c. New fast and accurate Jacobi SVD algorithm, II.SIAM J. Mat. Anal. Appl. , 29(4):1343–1362, 2008

work page 2008
[27]

J. A. Duersch and M. Gu. Randomized projection for rank-revealing matrix factorizations and low-rank approximations. SIAM Review, 62(3):661–682, 2020

work page 2020
[28]

Fawzi, M

A. Fawzi, M. Balog, A. Huang, T. Hubert, B. Romera-Paredes, M. Barekatain, A. Novikov, F. J. R. Ruiz, J. Schrittwieser, G. Swirszcz, et al. Discovering faster matrix multiplication algorithms with reinforcement learning. Nature, 610(7930):47–53, 2022

work page 2022
[29]

Forsythe and P

G. Forsythe and P. Henrici. The cyclic Jacobi method for computing the principal values of a complex matrix. Trans. Amer. Math. Soc., 94:1–23, 1960

work page 1960
[30]

D. E. Foulser. A blocked Jacobi method for the symmetric eigenproblem . Yale University, Department of Computer Science, 1989

work page 1989
[31]

W. Gao, Y. Ma, and M. Shao. A mixed precision Jacobi SVD algorithm. ACM Trans. Math. Softw. , 51(1), Apr. 2025

work page 2025
[32]

I. Gemp, B. McWilliams, C. Vernade, and T. Graepel. Eigengame: PCA as a Nash equilibrium. arXiv:2010.00554, 2020

work page arXiv 2010
[33]

G. H. Golub and H. A. Van der Vorst. Eigenvalue computation in the 20th century. Journal of Computational and Applied Mathematics , 123(1-2):35–65, 2000

work page 2000
[34]

E. R. Hansen. On Cyclic Jacobi Methods. Journal of the Society for Industrial and Applied Mathematics, 11(2):448–459, 1963

work page 1963
[35]

V. Hari. Convergence to diagonal form of block Jacobi-type methods. Numer. Math. , 129:449–481, 2015. 28

work page 2015
[36]

P. Henrici. On the speed of convergence of cyclic and quasicyclic Jacobi methods for computing eigenval- ues of Hermitian matrices. Journal of the Society for Industrial and Applied Mathematics , 6(2):144–162, 1958

work page 1958
[37]

N. J. Higham, F. Tisseur, M. Webb, and Z. Zhou. Computing accurate eigenvalues using a mixed- precision jacobi algorithm. arXiv preprint arXiv:2501.03742 , 2025

work page arXiv 2025
[38]

J. W. Hong and H. T. Kung. I/O complexity: The red-blue pebble game. In Proceedings of the Thirteenth Annual ACM Symposium on Theory of Computing , STOC ’81, pages 326–333. ACM, 1981

work page 1981
[39]

C. G. J. Jacobi. ¨Uber ein leichtes Verfahren die in der Theorie der S¨ acularst¨ orungen vorkommenden Gleichungen numerisch aufzul¨ osen.Journal f¨ ur die reine und angewandte Mathematik, 30:51–94, 1846

work page
[40]

T. G. Kolda and B. W. Bader. Tensor decompositions and applications. SIAM review, 51(3):455–500, 2009

work page 2009
[41]

Kulesza and B

A. Kulesza and B. Taskar. k-DPPs: Fixed-size determinantal point processes. In Proceedings of the 28th International Conference on Machine Learning (ICML-11) , pages 1193–1200, 2011

work page 2011
[42]

C. L. Lawson and R. J. Hanson. Solving Least Squares Problems . Society for Industrial and Applied Mathematics, 1995

work page 1995
[43]

P. Ma, M. Mahoney, and B. Yu. A statistical perspective on algorithmic leveraging. In International Conference on Machine Learning , pages 91–99. PMLR, 2014

work page 2014
[44]

W. F. Mascarenhas. On the convergence of the Jacobi method for arbitrary orderings. SIAM Journal on Matrix Analysis and Applications , 16(4):1197–1209, 1995

work page 1995
[45]

Matejaˇ s

J. Matejaˇ s. Accuracy of the Jacobi method on scaled diagonally dominant symmetric matrices. SIAM journal on matrix analysis and applications , 31(1):133–153, 2009

work page 2009
[46]

N. Megiddo. Applying parallel computation algorithms in the design of serial algorithms. Journal of the ACM (JACM) , 30(4):852–865, 1983

work page 1983
[47]

Murray, J

R. Murray, J. Demmel, M. Mahoney, N. B. Erichson, M. Melnichenko, O. A. Malik, L. Grigori, P. Luszczek, M. Derezinski, M. E. Lopes, T. Liang, H. Luo, and J. Dongarra. Randomized numerical lin- ear algebra: A perspective on the field with an eye to software. Technical Report UCB/EECS-2023-19, EECS Department, University of California, Berkeley, 2023

work page 2023
[48]

Nakatsukasa and N

Y. Nakatsukasa and N. Higham. Stable and efficient spectral divide and conquer algorithms for the symmetric eigenvalue decomposition and the SVD. SIAM J. Sci. Comp. , 35(3), 2013

work page 2013
[49]

J. Poulson. High-performance sampling of generic determinantal point processes. Philosophical Trans- actions of the Royal Society A , 378(2166):20190059, 2020

work page 2020
[50]

Y. Saad. Revisiting the (block) Jacobi subspace rotation method for the symmetric eigenvalue problem. Numerical Algorithms, 92(1):917–944, 2023

work page 2023
[51]

A. Sameh. On Jacobi and Jacobi-like algorithms for parallel computers. Math. Comp., 25(115):579–590, July 1971

work page 1971
[52]

Scott, O

J. Scott, O. Holtz, and O. Schwartz. Matrix multiplication I/O-complexity by path routing. In Pro- ceedings of the 27th ACM Symposium on Parallelism in Algorithms and Architectures , SPAA ’15, pages 35–45, New York, NY, USA, 2015. ACM

work page 2015
[53]

R. Shah. Hermitian diagonalization in linear precision. In Proceedings of the 2025 Annual ACM-SIAM Symposium on Discrete Algorithms (SODA) , pages 5599–5615. 29

work page 2025
[54]

Shiri and G

A. Shiri and G. K. Khosroshahi. An FPGA implementation of singular value decomposition. In 2019 27th Iranian Conference on Electrical Engineering (ICEE) , pages 416–422. IEEE, 2019

work page 2019
[55]

Shroff and R

G. Shroff and R. Schreiber. On the convergence of the cyclic Jacobi method for parallel block orderings. SIAM J. Mat. Anal. Appl. , 10(3):326–346, 1989

work page 1989
[56]

A. Sobczyk. Deterministic complexity analysis of Hermitian eigenproblems. arXiv:2410.21550, 2024

work page arXiv 2024
[57]

Solomonik and J

E. Solomonik and J. Demmel. Communication-optimal parallel 2.5D matrix multiplication and LU factorization algorithms. In E. Jeannot, R. Namyst, and J. Roman, editors, Euro-Par 2011 Parallel Processing, volume 6853 of Lecture Notes in Computer Science , pages 90–109. Springer Berlin Heidel- berg, 2011

work page 2011
[58]

Strassen

V. Strassen. Gaussian elimination is not optimal. Numerische Mathematik , 13:354–356, 1969. 10.1007/BF02165411

work page doi:10.1007/bf02165411 1969
[59]

F. Tisseur. Parallel implementation of the Yau and Lu method for eigenvalue computation. The International Journal of Supercomputer Applications and High Performance Computing , 11(3):197–204, 1997

work page 1997
[60]

S. Toledo. Locality of reference in LU decomposition with partial pivoting. SIAM Journal on Matrix Analysis and Applications, 18(4):1065–1081, 1997

work page 1997
[61]

Van Loan

C. Van Loan. The block Jacobi method for computing the singular value decomposition. Technical report, Cornell University, USA, 1985

work page 1985
[62]

LL(1 : ⌊ n 2 ⌋, : ) 0 PRLL(⌊ n 2 ⌋ + 1 : m, : ) LR # ; U =

S.-T. Yau and Y. Y. Lu. Reducing the symmetric matrix eigenvalue problem to matrix multiplications. SIAM Journal on Scientific Computing , 14(1):121–136, 1993. A Fast, Recursive LUPP This appendix contains pseudocode for the recursive LUPP routine used to guarantee convergence in blocked/recursive Jacobi (i.e., Algorithm 6). This also appears in section 4...

work page 1993

[1] [1]

Alman, R

J. Alman, R. Duan, V. V. Williams, Y. Xu, Z. Xu, and R. Zhou. More asymmetry yields faster matrix multiplication. In Proceedings of the 2025 Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 2005–2039, 2025

work page 2025

[2] [2]

M. V. Athi, S. R. Zekavat, and A. A. Struthers. Real-time signal processing of massive sensor arrays via a parallel fast converging SVD algorithm: Latency, throughput, and resource analysis. IEEE Sensors Journal, 16(8):2519–2526, 2016

work page 2016

[3] [3]

Ballard, E

G. Ballard, E. Carson, J. Demmel, M. Hoemmen, N. Knight, and O. Schwartz. Communication lower bounds and optimal algorithms for numerical linear algebra. Acta Numerica, 23:1–155, May 2014

work page 2014

[4] [4]

Ballard, J

G. Ballard, J. Demmel, and I. Dumitriu. Communication-optimal parallel and sequential eigenvalue and singular value algorithms. EECS Technical Report EECS-2011-14, UC Berkeley, Feb. 2011

work page 2011

[5] [5]

Ballard, J

G. Ballard, J. Demmel, O. Holtz, B. Lipshitz, and O. Schwartz. Communication-optimal parallel algo- rithm for Strassen’s matrix multiplication. In Proceedings of the 24th ACM Symposium on Parallelism in Algorithms and Architectures, SPAA ’12, pages 193–204, New York, NY, USA, June 2012. ACM

work page 2012

[6] [6]

Ballard, J

G. Ballard, J. Demmel, O. Holtz, and O. Schwartz. Minimizing communication in numerical linear algebra. SIAM J. Mat. Anal. Appl. , 32:866–901, 2011

work page 2011

[7] [7]

Ballard, J

G. Ballard, J. Demmel, O. Holtz, and O. Schwartz. Sequential communication bounds for fast linear algebra. Technical Report UCB/EECS-2012-36, UC Berkeley, March 2012

work page 2012

[8] [8]

Ballard, J

G. Ballard, J. Demmel, O. Holtz, and O. Schwartz. Communication costs of Strassen’s matrix multi- plication. Comm. of the ACM , 57(02):107–114, Feb 2014

work page 2014

[9] [9]

Ballard, J

G. Ballard, J. Demmel, and N. Knight. Communication avoiding successive band reduction. In Pro- ceedings of the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming , PPoPP ’12, pages 35–44. ACM, 2012

work page 2012

[10] [10]

Berry and A

M. Berry and A. Sameh. An overview of parallel algorithms for the singular value and symmetric eigenvalue problems. J. Comp. Appl. Math. , 27:191–213, 1989

work page 1989

[11] [11]

Bl¨ aser

M. Bl¨ aser. Fast matrix multiplication. Theory of Computing , pages 1–60, 2013

work page 2013

[12] [12]

Y. Cho, J. Demmel, M. Derezi´ nski, H. Li, H. Luo, M. Mahoney, and R. Murray. Surrogate-based autotuning for randomized sketching algorithms in regression problems. SIAM Journal on Matrix Analysis and Applications, 46(2):1247–1279, 2025

work page 2025

[13] [13]

J. J. M. Cuppen. A divide and conquer method for the symmetric tridiagonal eigenproblem. Numerische Mathematik, 36:177–195, 1980

work page 1980

[14] [14]

Demmel, I

J. Demmel, I. Dumitriu, and O. Holtz. Fast linear algebra is stable. Numerische Mathematik, 108(1):59– 91, 2007

work page 2007

[15] [15]

Demmel, I

J. Demmel, I. Dumitriu, O. Holtz, and R. Kleinberg. Fast matrix multiplication is stable. Numerische Mathematik, 106, 2006

work page 2006

[16] [16]

Demmel, M

J. Demmel, M. Gu, S. Eisenstat, I. Slapniˇ car, K. Veseli´ c, and Z. Drmaˇ c. Computing the singular value decomposition with high relative accuracy. Lin. Alg. Appl. , 299:21–80, 1999

work page 1999

[17] [17]

Demmel and K

J. Demmel and K. Veseli´ c. Jacobi’s Method is More Accurate Than QR. SIAM J. Mat. Anal. Appl. , 13(4):1204–1246, 1992. 27

work page 1992

[18] [18]

J. W. Demmel. Applied Numerical Linear Algebra . Society for Industrial and Applied Mathematics, 1997

work page 1997

[19] [19]

J. W. Demmel, L. Grigori, M. Hoemmen, and J. Langou. Communication-optimal parallel and sequential QR and LU factorizations. SIAM J. Sci. Comput. , 34(1):A206–A239, 2012

work page 2012

[20] [20]

Detherage and R

I. Detherage and R. Shah. A unified perspective on orthogonalization and diagonalization. arXiv:2505.02023, 2025

work page arXiv 2025

[21] [21]

I. S. Dhillon and B. N. Parlett. Orthogonal eigenvectors and relative gaps. SIAM Journal on Matrix Analysis and Applications, 25(3):858–899, 2003

work page 2003

[22] [22]

Dongarra, S

J. Dongarra, S. Hammarling, and D. Sorensen. Block reduction of matrices to condensed forms for eigenvalue computations. J. Comput. Appl. Math. , 27:215–227, 1989

work page 1989

[23] [23]

Drineas, M

P. Drineas, M. W. Mahoney, and S. Muthukrishnan. Sampling algorithms for ℓ2 regression and appli- cations. In Proceedings of the 17th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA) , pages 1127–1136, 2006

work page 2006

[24] [24]

Z. Drmac. A global convergence proof for cyclic Jacobi methods with block rotations. SIAM Journal on Matrix Analysis and Applications , 31(3):1329–1350, 2009

work page 2009

[25] [25]

Drmaˇ c and K

K. Drmaˇ c and K. Veseli´ c. New fast and accurate Jacobi SVD algorithm, I.SIAM J. Mat. Anal. Appl. , 29(4):1322–1342, 2008

work page 2008

[26] [26]

Drmaˇ c and K

K. Drmaˇ c and K. Veseli´ c. New fast and accurate Jacobi SVD algorithm, II.SIAM J. Mat. Anal. Appl. , 29(4):1343–1362, 2008

work page 2008

[27] [27]

J. A. Duersch and M. Gu. Randomized projection for rank-revealing matrix factorizations and low-rank approximations. SIAM Review, 62(3):661–682, 2020

work page 2020

[28] [28]

Fawzi, M

A. Fawzi, M. Balog, A. Huang, T. Hubert, B. Romera-Paredes, M. Barekatain, A. Novikov, F. J. R. Ruiz, J. Schrittwieser, G. Swirszcz, et al. Discovering faster matrix multiplication algorithms with reinforcement learning. Nature, 610(7930):47–53, 2022

work page 2022

[29] [29]

Forsythe and P

G. Forsythe and P. Henrici. The cyclic Jacobi method for computing the principal values of a complex matrix. Trans. Amer. Math. Soc., 94:1–23, 1960

work page 1960

[30] [30]

D. E. Foulser. A blocked Jacobi method for the symmetric eigenproblem . Yale University, Department of Computer Science, 1989

work page 1989

[31] [31]

W. Gao, Y. Ma, and M. Shao. A mixed precision Jacobi SVD algorithm. ACM Trans. Math. Softw. , 51(1), Apr. 2025

work page 2025

[32] [32]

I. Gemp, B. McWilliams, C. Vernade, and T. Graepel. Eigengame: PCA as a Nash equilibrium. arXiv:2010.00554, 2020

work page arXiv 2010

[33] [33]

G. H. Golub and H. A. Van der Vorst. Eigenvalue computation in the 20th century. Journal of Computational and Applied Mathematics , 123(1-2):35–65, 2000

work page 2000

[34] [34]

E. R. Hansen. On Cyclic Jacobi Methods. Journal of the Society for Industrial and Applied Mathematics, 11(2):448–459, 1963

work page 1963

[35] [35]

V. Hari. Convergence to diagonal form of block Jacobi-type methods. Numer. Math. , 129:449–481, 2015. 28

work page 2015

[36] [36]

P. Henrici. On the speed of convergence of cyclic and quasicyclic Jacobi methods for computing eigenval- ues of Hermitian matrices. Journal of the Society for Industrial and Applied Mathematics , 6(2):144–162, 1958

work page 1958

[37] [37]

N. J. Higham, F. Tisseur, M. Webb, and Z. Zhou. Computing accurate eigenvalues using a mixed- precision jacobi algorithm. arXiv preprint arXiv:2501.03742 , 2025

work page arXiv 2025

[38] [38]

J. W. Hong and H. T. Kung. I/O complexity: The red-blue pebble game. In Proceedings of the Thirteenth Annual ACM Symposium on Theory of Computing , STOC ’81, pages 326–333. ACM, 1981

work page 1981

[39] [39]

C. G. J. Jacobi. ¨Uber ein leichtes Verfahren die in der Theorie der S¨ acularst¨ orungen vorkommenden Gleichungen numerisch aufzul¨ osen.Journal f¨ ur die reine und angewandte Mathematik, 30:51–94, 1846

work page

[40] [40]

T. G. Kolda and B. W. Bader. Tensor decompositions and applications. SIAM review, 51(3):455–500, 2009

work page 2009

[41] [41]

Kulesza and B

A. Kulesza and B. Taskar. k-DPPs: Fixed-size determinantal point processes. In Proceedings of the 28th International Conference on Machine Learning (ICML-11) , pages 1193–1200, 2011

work page 2011

[42] [42]

C. L. Lawson and R. J. Hanson. Solving Least Squares Problems . Society for Industrial and Applied Mathematics, 1995

work page 1995

[43] [43]

P. Ma, M. Mahoney, and B. Yu. A statistical perspective on algorithmic leveraging. In International Conference on Machine Learning , pages 91–99. PMLR, 2014

work page 2014

[44] [44]

W. F. Mascarenhas. On the convergence of the Jacobi method for arbitrary orderings. SIAM Journal on Matrix Analysis and Applications , 16(4):1197–1209, 1995

work page 1995

[45] [45]

Matejaˇ s

J. Matejaˇ s. Accuracy of the Jacobi method on scaled diagonally dominant symmetric matrices. SIAM journal on matrix analysis and applications , 31(1):133–153, 2009

work page 2009

[46] [46]

N. Megiddo. Applying parallel computation algorithms in the design of serial algorithms. Journal of the ACM (JACM) , 30(4):852–865, 1983

work page 1983

[47] [47]

Murray, J

R. Murray, J. Demmel, M. Mahoney, N. B. Erichson, M. Melnichenko, O. A. Malik, L. Grigori, P. Luszczek, M. Derezinski, M. E. Lopes, T. Liang, H. Luo, and J. Dongarra. Randomized numerical lin- ear algebra: A perspective on the field with an eye to software. Technical Report UCB/EECS-2023-19, EECS Department, University of California, Berkeley, 2023

work page 2023

[48] [48]

Nakatsukasa and N

Y. Nakatsukasa and N. Higham. Stable and efficient spectral divide and conquer algorithms for the symmetric eigenvalue decomposition and the SVD. SIAM J. Sci. Comp. , 35(3), 2013

work page 2013

[49] [49]

J. Poulson. High-performance sampling of generic determinantal point processes. Philosophical Trans- actions of the Royal Society A , 378(2166):20190059, 2020

work page 2020

[50] [50]

Y. Saad. Revisiting the (block) Jacobi subspace rotation method for the symmetric eigenvalue problem. Numerical Algorithms, 92(1):917–944, 2023

work page 2023

[51] [51]

A. Sameh. On Jacobi and Jacobi-like algorithms for parallel computers. Math. Comp., 25(115):579–590, July 1971

work page 1971

[52] [52]

Scott, O

J. Scott, O. Holtz, and O. Schwartz. Matrix multiplication I/O-complexity by path routing. In Pro- ceedings of the 27th ACM Symposium on Parallelism in Algorithms and Architectures , SPAA ’15, pages 35–45, New York, NY, USA, 2015. ACM

work page 2015

[53] [53]

R. Shah. Hermitian diagonalization in linear precision. In Proceedings of the 2025 Annual ACM-SIAM Symposium on Discrete Algorithms (SODA) , pages 5599–5615. 29

work page 2025

[54] [54]

Shiri and G

A. Shiri and G. K. Khosroshahi. An FPGA implementation of singular value decomposition. In 2019 27th Iranian Conference on Electrical Engineering (ICEE) , pages 416–422. IEEE, 2019

work page 2019

[55] [55]

Shroff and R

G. Shroff and R. Schreiber. On the convergence of the cyclic Jacobi method for parallel block orderings. SIAM J. Mat. Anal. Appl. , 10(3):326–346, 1989

work page 1989

[56] [56]

A. Sobczyk. Deterministic complexity analysis of Hermitian eigenproblems. arXiv:2410.21550, 2024

work page arXiv 2024

[57] [57]

Solomonik and J

E. Solomonik and J. Demmel. Communication-optimal parallel 2.5D matrix multiplication and LU factorization algorithms. In E. Jeannot, R. Namyst, and J. Roman, editors, Euro-Par 2011 Parallel Processing, volume 6853 of Lecture Notes in Computer Science , pages 90–109. Springer Berlin Heidel- berg, 2011

work page 2011

[58] [58]

Strassen

V. Strassen. Gaussian elimination is not optimal. Numerische Mathematik , 13:354–356, 1969. 10.1007/BF02165411

work page doi:10.1007/bf02165411 1969

[59] [59]

F. Tisseur. Parallel implementation of the Yau and Lu method for eigenvalue computation. The International Journal of Supercomputer Applications and High Performance Computing , 11(3):197–204, 1997

work page 1997

[60] [60]

S. Toledo. Locality of reference in LU decomposition with partial pivoting. SIAM Journal on Matrix Analysis and Applications, 18(4):1065–1081, 1997

work page 1997

[61] [61]

Van Loan

C. Van Loan. The block Jacobi method for computing the singular value decomposition. Technical report, Cornell University, USA, 1985

work page 1985

[62] [62]

LL(1 : ⌊ n 2 ⌋, : ) 0 PRLL(⌊ n 2 ⌋ + 1 : m, : ) LR # ; U =

S.-T. Yau and Y. Y. Lu. Reducing the symmetric matrix eigenvalue problem to matrix multiplications. SIAM Journal on Scientific Computing , 14(1):121–136, 1993. A Fast, Recursive LUPP This appendix contains pseudocode for the recursive LUPP routine used to guarantee convergence in blocked/recursive Jacobi (i.e., Algorithm 6). This also appears in section 4...

work page 1993