Minimizing the Arithmetic and Communication Complexity of Jacobi's Method for Eigenvalues and Singular Values: Part One -- Serial Algorithms
Pith reviewed 2026-05-19 12:04 UTC · model grok-4.3
The pith
A blocked Jacobi method for the symmetric eigenvalue problem attains the communication lower bound for matrix multiplication.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
In the classical O(n^3) setting, a blocked Jacobi method attains the communication lower bound for matrix multiplication and is thus expected to be communication optimal among O(n^3) eigensolvers. In the fast matrix multiplication setting, a recursive version of blocked Jacobi reaches essentially optimal complexity in both arithmetic operations and communication.
What carries the argument
Blocked and recursive implementations of Jacobi's method, which organize rotations to leverage efficient matrix multiplications as black boxes.
If this is right
- The blocked version is expected to be communication optimal among O(n^3) eigensolvers.
- A recursive version reaches essentially optimal complexity with fast matrix multiplication.
- Analogous complexity bounds hold for one-sided Jacobi SVD algorithms.
Where Pith is reading between the lines
- This optimality suggests Jacobi methods could be competitive with other eigensolvers like QR algorithm in bandwidth-limited settings.
- Similar blocking and recursion techniques might apply to other eigenvalue algorithms to reduce communication.
- Practical implementations could test these bounds on large matrices to verify performance gains.
Load-bearing premise
The matrix multiplication routine acts as a black box with exactly known arithmetic and communication costs, and blocking or recursion introduces no significant hidden lower-order terms.
What would settle it
Implementing the blocked Jacobi method and measuring the total volume of data moved between fast and slow memory to determine if it matches or beats the communication lower bound for matrix multiplication.
Figures
read the original abstract
We analyze several versions of Jacobi's method for the symmetric eigenvalue problem. Our goal is to reduce the asymptotic cost of the algorithm as much as possible, as measured by the number of arithmetic operations performed and associated (serial or parallel) communication, i.e., the amount of data moved between slow and fast memory or between processors in a network. The first half of this effort, which considers the serial setting, is presented here; this paper contains rigorous complexity bounds for a variety of serial Jacobi algorithms, built on both classic $O(n^3)$ matrix multiplication and fast, Strassen-like $O(n^{\omega_0})$ alternatives. In the classical case, we show that a blocked implementation of Jacobi's method attains the communication lower bound for $O(n^3)$ matrix multiplication (and is therefore expected to be communication optimal among $O(n^3)$ eigensolvers). In the fast setting, we demonstrate that a recursive version of blocked Jacobi can go further, reaching essentially optimal complexity in both measures. We also derive analogous complexity bounds for (one-sided) Jacobi SVD algorithms. A forthcoming sequel to this paper will extend our complexity analysis to the parallel case.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript analyzes serial versions of Jacobi's method for the symmetric eigenvalue problem and one-sided SVD. It derives rigorous asymptotic bounds on arithmetic operations and communication volume (in a two-level memory model) for both classical O(n^3) matrix multiplication and fast Strassen-like multiplication with exponent ω₀. The central claims are that a blocked implementation attains the communication lower bound for O(n^3) MM and is therefore expected to be communication-optimal among O(n^3) eigensolvers, while a recursive blocked version reaches essentially optimal complexity when fast matrix multiplication is substituted.
Significance. If the bounds are correct, the work establishes the first rigorous communication-optimal serial analysis for Jacobi methods, which is significant for communication-bound regimes in high-performance computing. The use of standard MM cost models to obtain parameter-free lower-bound attainment for the classical case and near-optimality for the fast case is a clear strength; the forthcoming parallel sequel is noted as a natural extension.
major comments (1)
- [blocked and recursive Jacobi sections (complexity derivations)] The optimality claims rest on treating each matrix multiplication (classical or fast) as a black-box routine whose communication cost exactly matches the known lower bound. In the serial two-level memory model, the O(n) sweeps of the blocked or recursive algorithm require additional data movement for block extraction, rotation application, and re-layouting of the working set between sweeps. The manuscript must explicitly bound these auxiliary volumes and show they are absorbed into lower-order terms (or cancel across sweeps) so that the total communication remains asymptotically equal to the pure-MM bound; otherwise the “attains the lower bound” and “essentially optimal” statements are not justified.
minor comments (1)
- Ensure consistent notation for the fast-MM exponent (ω₀ vs. ω) and for the memory parameters (M, B) across all complexity statements and tables.
Simulated Author's Rebuttal
We thank the referee for the careful reading and constructive feedback. The point raised about auxiliary communication costs is well taken and we address it directly below.
read point-by-point responses
-
Referee: [blocked and recursive Jacobi sections (complexity derivations)] The optimality claims rest on treating each matrix multiplication (classical or fast) as a black-box routine whose communication cost exactly matches the known lower bound. In the serial two-level memory model, the O(n) sweeps of the blocked or recursive algorithm require additional data movement for block extraction, rotation application, and re-layouting of the working set between sweeps. The manuscript must explicitly bound these auxiliary volumes and show they are absorbed into lower-order terms (or cancel across sweeps) so that the total communication remains asymptotically equal to the pure-MM bound; otherwise the “attains the lower bound” and “essentially optimal” statements are not justified.
Authors: We agree that an explicit accounting of auxiliary data movement strengthens the communication analysis. In the blocked Jacobi algorithm the dominant communication arises from the matrix multiplications performed inside each sweep; block extraction, rotation application, and re-layouting are performed on sub-blocks whose aggregate volume per sweep is O(n^2 log n) words (or lower when fast memory is reused across consecutive rotations). Over the O(n) sweeps required for convergence this contributes an auxiliary term of O(n^3 log n) words moved. In the two-level memory model this term is asymptotically absorbed into the leading communication lower bound Ω(n^3 / √M) of classical matrix multiplication whenever M = o(n^2), and remains lower order even for larger fast-memory sizes. The recursive blocked variant further reduces re-layout cost by preserving block structure across recursion levels. We will add a short dedicated paragraph (or subsection) in Sections 3 and 4 that states these bounds explicitly and confirms they do not alter the asymptotic attainment of the pure-MM lower bound. The same argument applies, mutatis mutandis, to the one-sided SVD variant. revision: yes
Circularity Check
No circularity: complexity claims compose external matrix-multiplication lower bounds with new blocked/recursive Jacobi constructions.
full rationale
The paper's central results state that a blocked Jacobi attains the known communication lower bound for O(n^3) matrix multiplication and that a recursive version reaches essentially optimal cost when fast matrix multiplication is substituted. These statements rest on treating matrix multiplication as an opaque black-box routine whose arithmetic and communication costs are taken from prior, independently established results. No equation or definition inside the paper is shown to be defined in terms of the claimed optimality, no parameter is fitted to a subset of the target data and then renamed a prediction, and no load-bearing uniqueness theorem or ansatz is imported solely via self-citation. The analysis therefore remains self-contained against external benchmarks and does not reduce the claimed optimality to a tautology.
Axiom & Free-Parameter Ledger
axioms (2)
- standard math Matrix multiplication costs O(n^3) arithmetic operations and satisfies known communication lower bounds in the serial two-level memory model.
- standard math Fast matrix multiplication costs O(n^ω0) with corresponding communication bounds.
Reference graph
Works this paper leans on
- [1]
-
[2]
M. V. Athi, S. R. Zekavat, and A. A. Struthers. Real-time signal processing of massive sensor arrays via a parallel fast converging SVD algorithm: Latency, throughput, and resource analysis. IEEE Sensors Journal, 16(8):2519–2526, 2016
work page 2016
-
[3]
G. Ballard, E. Carson, J. Demmel, M. Hoemmen, N. Knight, and O. Schwartz. Communication lower bounds and optimal algorithms for numerical linear algebra. Acta Numerica, 23:1–155, May 2014
work page 2014
-
[4]
G. Ballard, J. Demmel, and I. Dumitriu. Communication-optimal parallel and sequential eigenvalue and singular value algorithms. EECS Technical Report EECS-2011-14, UC Berkeley, Feb. 2011
work page 2011
-
[5]
G. Ballard, J. Demmel, O. Holtz, B. Lipshitz, and O. Schwartz. Communication-optimal parallel algo- rithm for Strassen’s matrix multiplication. In Proceedings of the 24th ACM Symposium on Parallelism in Algorithms and Architectures, SPAA ’12, pages 193–204, New York, NY, USA, June 2012. ACM
work page 2012
-
[6]
G. Ballard, J. Demmel, O. Holtz, and O. Schwartz. Minimizing communication in numerical linear algebra. SIAM J. Mat. Anal. Appl. , 32:866–901, 2011
work page 2011
-
[7]
G. Ballard, J. Demmel, O. Holtz, and O. Schwartz. Sequential communication bounds for fast linear algebra. Technical Report UCB/EECS-2012-36, UC Berkeley, March 2012
work page 2012
-
[8]
G. Ballard, J. Demmel, O. Holtz, and O. Schwartz. Communication costs of Strassen’s matrix multi- plication. Comm. of the ACM , 57(02):107–114, Feb 2014
work page 2014
-
[9]
G. Ballard, J. Demmel, and N. Knight. Communication avoiding successive band reduction. In Pro- ceedings of the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming , PPoPP ’12, pages 35–44. ACM, 2012
work page 2012
-
[10]
M. Berry and A. Sameh. An overview of parallel algorithms for the singular value and symmetric eigenvalue problems. J. Comp. Appl. Math. , 27:191–213, 1989
work page 1989
- [11]
-
[12]
Y. Cho, J. Demmel, M. Derezi´ nski, H. Li, H. Luo, M. Mahoney, and R. Murray. Surrogate-based autotuning for randomized sketching algorithms in regression problems. SIAM Journal on Matrix Analysis and Applications, 46(2):1247–1279, 2025
work page 2025
-
[13]
J. J. M. Cuppen. A divide and conquer method for the symmetric tridiagonal eigenproblem. Numerische Mathematik, 36:177–195, 1980
work page 1980
- [14]
- [15]
- [16]
-
[17]
J. Demmel and K. Veseli´ c. Jacobi’s Method is More Accurate Than QR. SIAM J. Mat. Anal. Appl. , 13(4):1204–1246, 1992. 27
work page 1992
-
[18]
J. W. Demmel. Applied Numerical Linear Algebra . Society for Industrial and Applied Mathematics, 1997
work page 1997
-
[19]
J. W. Demmel, L. Grigori, M. Hoemmen, and J. Langou. Communication-optimal parallel and sequential QR and LU factorizations. SIAM J. Sci. Comput. , 34(1):A206–A239, 2012
work page 2012
-
[20]
I. Detherage and R. Shah. A unified perspective on orthogonalization and diagonalization. arXiv:2505.02023, 2025
-
[21]
I. S. Dhillon and B. N. Parlett. Orthogonal eigenvectors and relative gaps. SIAM Journal on Matrix Analysis and Applications, 25(3):858–899, 2003
work page 2003
-
[22]
J. Dongarra, S. Hammarling, and D. Sorensen. Block reduction of matrices to condensed forms for eigenvalue computations. J. Comput. Appl. Math. , 27:215–227, 1989
work page 1989
-
[23]
P. Drineas, M. W. Mahoney, and S. Muthukrishnan. Sampling algorithms for ℓ2 regression and appli- cations. In Proceedings of the 17th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA) , pages 1127–1136, 2006
work page 2006
-
[24]
Z. Drmac. A global convergence proof for cyclic Jacobi methods with block rotations. SIAM Journal on Matrix Analysis and Applications , 31(3):1329–1350, 2009
work page 2009
-
[25]
K. Drmaˇ c and K. Veseli´ c. New fast and accurate Jacobi SVD algorithm, I.SIAM J. Mat. Anal. Appl. , 29(4):1322–1342, 2008
work page 2008
-
[26]
K. Drmaˇ c and K. Veseli´ c. New fast and accurate Jacobi SVD algorithm, II.SIAM J. Mat. Anal. Appl. , 29(4):1343–1362, 2008
work page 2008
-
[27]
J. A. Duersch and M. Gu. Randomized projection for rank-revealing matrix factorizations and low-rank approximations. SIAM Review, 62(3):661–682, 2020
work page 2020
- [28]
-
[29]
G. Forsythe and P. Henrici. The cyclic Jacobi method for computing the principal values of a complex matrix. Trans. Amer. Math. Soc., 94:1–23, 1960
work page 1960
-
[30]
D. E. Foulser. A blocked Jacobi method for the symmetric eigenproblem . Yale University, Department of Computer Science, 1989
work page 1989
-
[31]
W. Gao, Y. Ma, and M. Shao. A mixed precision Jacobi SVD algorithm. ACM Trans. Math. Softw. , 51(1), Apr. 2025
work page 2025
- [32]
-
[33]
G. H. Golub and H. A. Van der Vorst. Eigenvalue computation in the 20th century. Journal of Computational and Applied Mathematics , 123(1-2):35–65, 2000
work page 2000
-
[34]
E. R. Hansen. On Cyclic Jacobi Methods. Journal of the Society for Industrial and Applied Mathematics, 11(2):448–459, 1963
work page 1963
-
[35]
V. Hari. Convergence to diagonal form of block Jacobi-type methods. Numer. Math. , 129:449–481, 2015. 28
work page 2015
-
[36]
P. Henrici. On the speed of convergence of cyclic and quasicyclic Jacobi methods for computing eigenval- ues of Hermitian matrices. Journal of the Society for Industrial and Applied Mathematics , 6(2):144–162, 1958
work page 1958
- [37]
-
[38]
J. W. Hong and H. T. Kung. I/O complexity: The red-blue pebble game. In Proceedings of the Thirteenth Annual ACM Symposium on Theory of Computing , STOC ’81, pages 326–333. ACM, 1981
work page 1981
-
[39]
C. G. J. Jacobi. ¨Uber ein leichtes Verfahren die in der Theorie der S¨ acularst¨ orungen vorkommenden Gleichungen numerisch aufzul¨ osen.Journal f¨ ur die reine und angewandte Mathematik, 30:51–94, 1846
-
[40]
T. G. Kolda and B. W. Bader. Tensor decompositions and applications. SIAM review, 51(3):455–500, 2009
work page 2009
-
[41]
A. Kulesza and B. Taskar. k-DPPs: Fixed-size determinantal point processes. In Proceedings of the 28th International Conference on Machine Learning (ICML-11) , pages 1193–1200, 2011
work page 2011
-
[42]
C. L. Lawson and R. J. Hanson. Solving Least Squares Problems . Society for Industrial and Applied Mathematics, 1995
work page 1995
-
[43]
P. Ma, M. Mahoney, and B. Yu. A statistical perspective on algorithmic leveraging. In International Conference on Machine Learning , pages 91–99. PMLR, 2014
work page 2014
-
[44]
W. F. Mascarenhas. On the convergence of the Jacobi method for arbitrary orderings. SIAM Journal on Matrix Analysis and Applications , 16(4):1197–1209, 1995
work page 1995
- [45]
-
[46]
N. Megiddo. Applying parallel computation algorithms in the design of serial algorithms. Journal of the ACM (JACM) , 30(4):852–865, 1983
work page 1983
-
[47]
R. Murray, J. Demmel, M. Mahoney, N. B. Erichson, M. Melnichenko, O. A. Malik, L. Grigori, P. Luszczek, M. Derezinski, M. E. Lopes, T. Liang, H. Luo, and J. Dongarra. Randomized numerical lin- ear algebra: A perspective on the field with an eye to software. Technical Report UCB/EECS-2023-19, EECS Department, University of California, Berkeley, 2023
work page 2023
-
[48]
Y. Nakatsukasa and N. Higham. Stable and efficient spectral divide and conquer algorithms for the symmetric eigenvalue decomposition and the SVD. SIAM J. Sci. Comp. , 35(3), 2013
work page 2013
-
[49]
J. Poulson. High-performance sampling of generic determinantal point processes. Philosophical Trans- actions of the Royal Society A , 378(2166):20190059, 2020
work page 2020
-
[50]
Y. Saad. Revisiting the (block) Jacobi subspace rotation method for the symmetric eigenvalue problem. Numerical Algorithms, 92(1):917–944, 2023
work page 2023
-
[51]
A. Sameh. On Jacobi and Jacobi-like algorithms for parallel computers. Math. Comp., 25(115):579–590, July 1971
work page 1971
- [52]
-
[53]
R. Shah. Hermitian diagonalization in linear precision. In Proceedings of the 2025 Annual ACM-SIAM Symposium on Discrete Algorithms (SODA) , pages 5599–5615. 29
work page 2025
-
[54]
A. Shiri and G. K. Khosroshahi. An FPGA implementation of singular value decomposition. In 2019 27th Iranian Conference on Electrical Engineering (ICEE) , pages 416–422. IEEE, 2019
work page 2019
-
[55]
G. Shroff and R. Schreiber. On the convergence of the cyclic Jacobi method for parallel block orderings. SIAM J. Mat. Anal. Appl. , 10(3):326–346, 1989
work page 1989
- [56]
-
[57]
E. Solomonik and J. Demmel. Communication-optimal parallel 2.5D matrix multiplication and LU factorization algorithms. In E. Jeannot, R. Namyst, and J. Roman, editors, Euro-Par 2011 Parallel Processing, volume 6853 of Lecture Notes in Computer Science , pages 90–109. Springer Berlin Heidel- berg, 2011
work page 2011
-
[58]
V. Strassen. Gaussian elimination is not optimal. Numerische Mathematik , 13:354–356, 1969. 10.1007/BF02165411
-
[59]
F. Tisseur. Parallel implementation of the Yau and Lu method for eigenvalue computation. The International Journal of Supercomputer Applications and High Performance Computing , 11(3):197–204, 1997
work page 1997
-
[60]
S. Toledo. Locality of reference in LU decomposition with partial pivoting. SIAM Journal on Matrix Analysis and Applications, 18(4):1065–1081, 1997
work page 1997
- [61]
-
[62]
LL(1 : ⌊ n 2 ⌋, : ) 0 PRLL(⌊ n 2 ⌋ + 1 : m, : ) LR # ; U =
S.-T. Yau and Y. Y. Lu. Reducing the symmetric matrix eigenvalue problem to matrix multiplications. SIAM Journal on Scientific Computing , 14(1):121–136, 1993. A Fast, Recursive LUPP This appendix contains pseudocode for the recursive LUPP routine used to guarantee convergence in blocked/recursive Jacobi (i.e., Algorithm 6). This also appears in section 4...
work page 1993
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.