pith. sign in

arxiv: 2506.03466 · v2 · submitted 2025-06-04 · 🧮 math.NA · cs.CC· cs.NA

Minimizing the Arithmetic and Communication Complexity of Jacobi's Method for Eigenvalues and Singular Values: Part One -- Serial Algorithms

Pith reviewed 2026-05-19 12:04 UTC · model grok-4.3

classification 🧮 math.NA cs.CCcs.NA
keywords Jacobi methodeigenvalue problemcommunication complexityarithmetic complexityblocked algorithmsmatrix multiplicationsingular value decomposition
0
0 comments X

The pith

A blocked Jacobi method for the symmetric eigenvalue problem attains the communication lower bound for matrix multiplication.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops serial algorithms based on Jacobi's method that minimize both the number of arithmetic operations and the amount of data movement for computing eigenvalues and singular values. It proves that a blocked implementation achieves the known lower bound on communication for classical matrix multiplication, which is the dominant cost in large-scale linear algebra. A recursive blocked version extends this to fast matrix multiplication, reaching near-optimal costs in both arithmetic and communication. Readers should care because in modern computing, moving data often costs more than performing operations, so these bounds indicate when Jacobi methods can be efficient. The analysis also covers one-sided Jacobi for singular values.

Core claim

In the classical O(n^3) setting, a blocked Jacobi method attains the communication lower bound for matrix multiplication and is thus expected to be communication optimal among O(n^3) eigensolvers. In the fast matrix multiplication setting, a recursive version of blocked Jacobi reaches essentially optimal complexity in both arithmetic operations and communication.

What carries the argument

Blocked and recursive implementations of Jacobi's method, which organize rotations to leverage efficient matrix multiplications as black boxes.

If this is right

  • The blocked version is expected to be communication optimal among O(n^3) eigensolvers.
  • A recursive version reaches essentially optimal complexity with fast matrix multiplication.
  • Analogous complexity bounds hold for one-sided Jacobi SVD algorithms.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This optimality suggests Jacobi methods could be competitive with other eigensolvers like QR algorithm in bandwidth-limited settings.
  • Similar blocking and recursion techniques might apply to other eigenvalue algorithms to reduce communication.
  • Practical implementations could test these bounds on large matrices to verify performance gains.

Load-bearing premise

The matrix multiplication routine acts as a black box with exactly known arithmetic and communication costs, and blocking or recursion introduces no significant hidden lower-order terms.

What would settle it

Implementing the blocked Jacobi method and measuring the total volume of data moved between fast and slow memory to determine if it matches or beats the communication lower bound for matrix multiplication.

Figures

Figures reproduced from arXiv: 2506.03466 by Hengrui Luo, James Demmel, Ryan Schneider, Yifu Wang.

Figure 1
Figure 1. Figure 1: Block structure at each recursive level of Algorithm 3 for [PITH_FULL_IMAGE:figures/full_fig_p010_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Comparison of scalar, blocked and recursive Jacobi on a 512 [PITH_FULL_IMAGE:figures/full_fig_p025_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Performance of various implementations of blocked Jacobi, including Algorithm 2 without pivoting [PITH_FULL_IMAGE:figures/full_fig_p026_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Impact of base-case accuracy on the convergence of recursive Jacobi for a fixed 512 [PITH_FULL_IMAGE:figures/full_fig_p033_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Impact of pre-determined maximum recursion depth on the convergence of recursive Jacobi for a [PITH_FULL_IMAGE:figures/full_fig_p033_5.png] view at source ↗
read the original abstract

We analyze several versions of Jacobi's method for the symmetric eigenvalue problem. Our goal is to reduce the asymptotic cost of the algorithm as much as possible, as measured by the number of arithmetic operations performed and associated (serial or parallel) communication, i.e., the amount of data moved between slow and fast memory or between processors in a network. The first half of this effort, which considers the serial setting, is presented here; this paper contains rigorous complexity bounds for a variety of serial Jacobi algorithms, built on both classic $O(n^3)$ matrix multiplication and fast, Strassen-like $O(n^{\omega_0})$ alternatives. In the classical case, we show that a blocked implementation of Jacobi's method attains the communication lower bound for $O(n^3)$ matrix multiplication (and is therefore expected to be communication optimal among $O(n^3)$ eigensolvers). In the fast setting, we demonstrate that a recursive version of blocked Jacobi can go further, reaching essentially optimal complexity in both measures. We also derive analogous complexity bounds for (one-sided) Jacobi SVD algorithms. A forthcoming sequel to this paper will extend our complexity analysis to the parallel case.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The manuscript analyzes serial versions of Jacobi's method for the symmetric eigenvalue problem and one-sided SVD. It derives rigorous asymptotic bounds on arithmetic operations and communication volume (in a two-level memory model) for both classical O(n^3) matrix multiplication and fast Strassen-like multiplication with exponent ω₀. The central claims are that a blocked implementation attains the communication lower bound for O(n^3) MM and is therefore expected to be communication-optimal among O(n^3) eigensolvers, while a recursive blocked version reaches essentially optimal complexity when fast matrix multiplication is substituted.

Significance. If the bounds are correct, the work establishes the first rigorous communication-optimal serial analysis for Jacobi methods, which is significant for communication-bound regimes in high-performance computing. The use of standard MM cost models to obtain parameter-free lower-bound attainment for the classical case and near-optimality for the fast case is a clear strength; the forthcoming parallel sequel is noted as a natural extension.

major comments (1)
  1. [blocked and recursive Jacobi sections (complexity derivations)] The optimality claims rest on treating each matrix multiplication (classical or fast) as a black-box routine whose communication cost exactly matches the known lower bound. In the serial two-level memory model, the O(n) sweeps of the blocked or recursive algorithm require additional data movement for block extraction, rotation application, and re-layouting of the working set between sweeps. The manuscript must explicitly bound these auxiliary volumes and show they are absorbed into lower-order terms (or cancel across sweeps) so that the total communication remains asymptotically equal to the pure-MM bound; otherwise the “attains the lower bound” and “essentially optimal” statements are not justified.
minor comments (1)
  1. Ensure consistent notation for the fast-MM exponent (ω₀ vs. ω) and for the memory parameters (M, B) across all complexity statements and tables.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful reading and constructive feedback. The point raised about auxiliary communication costs is well taken and we address it directly below.

read point-by-point responses
  1. Referee: [blocked and recursive Jacobi sections (complexity derivations)] The optimality claims rest on treating each matrix multiplication (classical or fast) as a black-box routine whose communication cost exactly matches the known lower bound. In the serial two-level memory model, the O(n) sweeps of the blocked or recursive algorithm require additional data movement for block extraction, rotation application, and re-layouting of the working set between sweeps. The manuscript must explicitly bound these auxiliary volumes and show they are absorbed into lower-order terms (or cancel across sweeps) so that the total communication remains asymptotically equal to the pure-MM bound; otherwise the “attains the lower bound” and “essentially optimal” statements are not justified.

    Authors: We agree that an explicit accounting of auxiliary data movement strengthens the communication analysis. In the blocked Jacobi algorithm the dominant communication arises from the matrix multiplications performed inside each sweep; block extraction, rotation application, and re-layouting are performed on sub-blocks whose aggregate volume per sweep is O(n^2 log n) words (or lower when fast memory is reused across consecutive rotations). Over the O(n) sweeps required for convergence this contributes an auxiliary term of O(n^3 log n) words moved. In the two-level memory model this term is asymptotically absorbed into the leading communication lower bound Ω(n^3 / √M) of classical matrix multiplication whenever M = o(n^2), and remains lower order even for larger fast-memory sizes. The recursive blocked variant further reduces re-layout cost by preserving block structure across recursion levels. We will add a short dedicated paragraph (or subsection) in Sections 3 and 4 that states these bounds explicitly and confirms they do not alter the asymptotic attainment of the pure-MM lower bound. The same argument applies, mutatis mutandis, to the one-sided SVD variant. revision: yes

Circularity Check

0 steps flagged

No circularity: complexity claims compose external matrix-multiplication lower bounds with new blocked/recursive Jacobi constructions.

full rationale

The paper's central results state that a blocked Jacobi attains the known communication lower bound for O(n^3) matrix multiplication and that a recursive version reaches essentially optimal cost when fast matrix multiplication is substituted. These statements rest on treating matrix multiplication as an opaque black-box routine whose arithmetic and communication costs are taken from prior, independently established results. No equation or definition inside the paper is shown to be defined in terms of the claimed optimality, no parameter is fitted to a subset of the target data and then renamed a prediction, and no load-bearing uniqueness theorem or ansatz is imported solely via self-citation. The analysis therefore remains self-contained against external benchmarks and does not reduce the claimed optimality to a tautology.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The analysis rests on standard arithmetic and communication cost models for matrix multiplication; no free parameters are fitted, no new entities are postulated, and the axioms are the usual ones of asymptotic analysis.

axioms (2)
  • standard math Matrix multiplication costs O(n^3) arithmetic operations and satisfies known communication lower bounds in the serial two-level memory model.
    Invoked when stating that the blocked Jacobi attains the lower bound.
  • standard math Fast matrix multiplication costs O(n^ω0) with corresponding communication bounds.
    Used for the recursive version analysis.

pith-pipeline@v0.9.0 · 5756 in / 1273 out tokens · 28784 ms · 2026-05-19T12:04:33.940129+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

62 extracted references · 62 canonical work pages

  1. [1]

    Alman, R

    J. Alman, R. Duan, V. V. Williams, Y. Xu, Z. Xu, and R. Zhou. More asymmetry yields faster matrix multiplication. In Proceedings of the 2025 Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 2005–2039, 2025

  2. [2]

    M. V. Athi, S. R. Zekavat, and A. A. Struthers. Real-time signal processing of massive sensor arrays via a parallel fast converging SVD algorithm: Latency, throughput, and resource analysis. IEEE Sensors Journal, 16(8):2519–2526, 2016

  3. [3]

    Ballard, E

    G. Ballard, E. Carson, J. Demmel, M. Hoemmen, N. Knight, and O. Schwartz. Communication lower bounds and optimal algorithms for numerical linear algebra. Acta Numerica, 23:1–155, May 2014

  4. [4]

    Ballard, J

    G. Ballard, J. Demmel, and I. Dumitriu. Communication-optimal parallel and sequential eigenvalue and singular value algorithms. EECS Technical Report EECS-2011-14, UC Berkeley, Feb. 2011

  5. [5]

    Ballard, J

    G. Ballard, J. Demmel, O. Holtz, B. Lipshitz, and O. Schwartz. Communication-optimal parallel algo- rithm for Strassen’s matrix multiplication. In Proceedings of the 24th ACM Symposium on Parallelism in Algorithms and Architectures, SPAA ’12, pages 193–204, New York, NY, USA, June 2012. ACM

  6. [6]

    Ballard, J

    G. Ballard, J. Demmel, O. Holtz, and O. Schwartz. Minimizing communication in numerical linear algebra. SIAM J. Mat. Anal. Appl. , 32:866–901, 2011

  7. [7]

    Ballard, J

    G. Ballard, J. Demmel, O. Holtz, and O. Schwartz. Sequential communication bounds for fast linear algebra. Technical Report UCB/EECS-2012-36, UC Berkeley, March 2012

  8. [8]

    Ballard, J

    G. Ballard, J. Demmel, O. Holtz, and O. Schwartz. Communication costs of Strassen’s matrix multi- plication. Comm. of the ACM , 57(02):107–114, Feb 2014

  9. [9]

    Ballard, J

    G. Ballard, J. Demmel, and N. Knight. Communication avoiding successive band reduction. In Pro- ceedings of the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming , PPoPP ’12, pages 35–44. ACM, 2012

  10. [10]

    Berry and A

    M. Berry and A. Sameh. An overview of parallel algorithms for the singular value and symmetric eigenvalue problems. J. Comp. Appl. Math. , 27:191–213, 1989

  11. [11]

    Bl¨ aser

    M. Bl¨ aser. Fast matrix multiplication. Theory of Computing , pages 1–60, 2013

  12. [12]

    Y. Cho, J. Demmel, M. Derezi´ nski, H. Li, H. Luo, M. Mahoney, and R. Murray. Surrogate-based autotuning for randomized sketching algorithms in regression problems. SIAM Journal on Matrix Analysis and Applications, 46(2):1247–1279, 2025

  13. [13]

    J. J. M. Cuppen. A divide and conquer method for the symmetric tridiagonal eigenproblem. Numerische Mathematik, 36:177–195, 1980

  14. [14]

    Demmel, I

    J. Demmel, I. Dumitriu, and O. Holtz. Fast linear algebra is stable. Numerische Mathematik, 108(1):59– 91, 2007

  15. [15]

    Demmel, I

    J. Demmel, I. Dumitriu, O. Holtz, and R. Kleinberg. Fast matrix multiplication is stable. Numerische Mathematik, 106, 2006

  16. [16]

    Demmel, M

    J. Demmel, M. Gu, S. Eisenstat, I. Slapniˇ car, K. Veseli´ c, and Z. Drmaˇ c. Computing the singular value decomposition with high relative accuracy. Lin. Alg. Appl. , 299:21–80, 1999

  17. [17]

    Demmel and K

    J. Demmel and K. Veseli´ c. Jacobi’s Method is More Accurate Than QR. SIAM J. Mat. Anal. Appl. , 13(4):1204–1246, 1992. 27

  18. [18]

    J. W. Demmel. Applied Numerical Linear Algebra . Society for Industrial and Applied Mathematics, 1997

  19. [19]

    J. W. Demmel, L. Grigori, M. Hoemmen, and J. Langou. Communication-optimal parallel and sequential QR and LU factorizations. SIAM J. Sci. Comput. , 34(1):A206–A239, 2012

  20. [20]

    Detherage and R

    I. Detherage and R. Shah. A unified perspective on orthogonalization and diagonalization. arXiv:2505.02023, 2025

  21. [21]

    I. S. Dhillon and B. N. Parlett. Orthogonal eigenvectors and relative gaps. SIAM Journal on Matrix Analysis and Applications, 25(3):858–899, 2003

  22. [22]

    Dongarra, S

    J. Dongarra, S. Hammarling, and D. Sorensen. Block reduction of matrices to condensed forms for eigenvalue computations. J. Comput. Appl. Math. , 27:215–227, 1989

  23. [23]

    Drineas, M

    P. Drineas, M. W. Mahoney, and S. Muthukrishnan. Sampling algorithms for ℓ2 regression and appli- cations. In Proceedings of the 17th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA) , pages 1127–1136, 2006

  24. [24]

    Z. Drmac. A global convergence proof for cyclic Jacobi methods with block rotations. SIAM Journal on Matrix Analysis and Applications , 31(3):1329–1350, 2009

  25. [25]

    Drmaˇ c and K

    K. Drmaˇ c and K. Veseli´ c. New fast and accurate Jacobi SVD algorithm, I.SIAM J. Mat. Anal. Appl. , 29(4):1322–1342, 2008

  26. [26]

    Drmaˇ c and K

    K. Drmaˇ c and K. Veseli´ c. New fast and accurate Jacobi SVD algorithm, II.SIAM J. Mat. Anal. Appl. , 29(4):1343–1362, 2008

  27. [27]

    J. A. Duersch and M. Gu. Randomized projection for rank-revealing matrix factorizations and low-rank approximations. SIAM Review, 62(3):661–682, 2020

  28. [28]

    Fawzi, M

    A. Fawzi, M. Balog, A. Huang, T. Hubert, B. Romera-Paredes, M. Barekatain, A. Novikov, F. J. R. Ruiz, J. Schrittwieser, G. Swirszcz, et al. Discovering faster matrix multiplication algorithms with reinforcement learning. Nature, 610(7930):47–53, 2022

  29. [29]

    Forsythe and P

    G. Forsythe and P. Henrici. The cyclic Jacobi method for computing the principal values of a complex matrix. Trans. Amer. Math. Soc., 94:1–23, 1960

  30. [30]

    D. E. Foulser. A blocked Jacobi method for the symmetric eigenproblem . Yale University, Department of Computer Science, 1989

  31. [31]

    W. Gao, Y. Ma, and M. Shao. A mixed precision Jacobi SVD algorithm. ACM Trans. Math. Softw. , 51(1), Apr. 2025

  32. [32]

    I. Gemp, B. McWilliams, C. Vernade, and T. Graepel. Eigengame: PCA as a Nash equilibrium. arXiv:2010.00554, 2020

  33. [33]

    G. H. Golub and H. A. Van der Vorst. Eigenvalue computation in the 20th century. Journal of Computational and Applied Mathematics , 123(1-2):35–65, 2000

  34. [34]

    E. R. Hansen. On Cyclic Jacobi Methods. Journal of the Society for Industrial and Applied Mathematics, 11(2):448–459, 1963

  35. [35]

    V. Hari. Convergence to diagonal form of block Jacobi-type methods. Numer. Math. , 129:449–481, 2015. 28

  36. [36]

    P. Henrici. On the speed of convergence of cyclic and quasicyclic Jacobi methods for computing eigenval- ues of Hermitian matrices. Journal of the Society for Industrial and Applied Mathematics , 6(2):144–162, 1958

  37. [37]

    N. J. Higham, F. Tisseur, M. Webb, and Z. Zhou. Computing accurate eigenvalues using a mixed- precision jacobi algorithm. arXiv preprint arXiv:2501.03742 , 2025

  38. [38]

    J. W. Hong and H. T. Kung. I/O complexity: The red-blue pebble game. In Proceedings of the Thirteenth Annual ACM Symposium on Theory of Computing , STOC ’81, pages 326–333. ACM, 1981

  39. [39]

    C. G. J. Jacobi. ¨Uber ein leichtes Verfahren die in der Theorie der S¨ acularst¨ orungen vorkommenden Gleichungen numerisch aufzul¨ osen.Journal f¨ ur die reine und angewandte Mathematik, 30:51–94, 1846

  40. [40]

    T. G. Kolda and B. W. Bader. Tensor decompositions and applications. SIAM review, 51(3):455–500, 2009

  41. [41]

    Kulesza and B

    A. Kulesza and B. Taskar. k-DPPs: Fixed-size determinantal point processes. In Proceedings of the 28th International Conference on Machine Learning (ICML-11) , pages 1193–1200, 2011

  42. [42]

    C. L. Lawson and R. J. Hanson. Solving Least Squares Problems . Society for Industrial and Applied Mathematics, 1995

  43. [43]

    P. Ma, M. Mahoney, and B. Yu. A statistical perspective on algorithmic leveraging. In International Conference on Machine Learning , pages 91–99. PMLR, 2014

  44. [44]

    W. F. Mascarenhas. On the convergence of the Jacobi method for arbitrary orderings. SIAM Journal on Matrix Analysis and Applications , 16(4):1197–1209, 1995

  45. [45]

    Matejaˇ s

    J. Matejaˇ s. Accuracy of the Jacobi method on scaled diagonally dominant symmetric matrices. SIAM journal on matrix analysis and applications , 31(1):133–153, 2009

  46. [46]

    N. Megiddo. Applying parallel computation algorithms in the design of serial algorithms. Journal of the ACM (JACM) , 30(4):852–865, 1983

  47. [47]

    Murray, J

    R. Murray, J. Demmel, M. Mahoney, N. B. Erichson, M. Melnichenko, O. A. Malik, L. Grigori, P. Luszczek, M. Derezinski, M. E. Lopes, T. Liang, H. Luo, and J. Dongarra. Randomized numerical lin- ear algebra: A perspective on the field with an eye to software. Technical Report UCB/EECS-2023-19, EECS Department, University of California, Berkeley, 2023

  48. [48]

    Nakatsukasa and N

    Y. Nakatsukasa and N. Higham. Stable and efficient spectral divide and conquer algorithms for the symmetric eigenvalue decomposition and the SVD. SIAM J. Sci. Comp. , 35(3), 2013

  49. [49]

    J. Poulson. High-performance sampling of generic determinantal point processes. Philosophical Trans- actions of the Royal Society A , 378(2166):20190059, 2020

  50. [50]

    Y. Saad. Revisiting the (block) Jacobi subspace rotation method for the symmetric eigenvalue problem. Numerical Algorithms, 92(1):917–944, 2023

  51. [51]

    A. Sameh. On Jacobi and Jacobi-like algorithms for parallel computers. Math. Comp., 25(115):579–590, July 1971

  52. [52]

    Scott, O

    J. Scott, O. Holtz, and O. Schwartz. Matrix multiplication I/O-complexity by path routing. In Pro- ceedings of the 27th ACM Symposium on Parallelism in Algorithms and Architectures , SPAA ’15, pages 35–45, New York, NY, USA, 2015. ACM

  53. [53]

    R. Shah. Hermitian diagonalization in linear precision. In Proceedings of the 2025 Annual ACM-SIAM Symposium on Discrete Algorithms (SODA) , pages 5599–5615. 29

  54. [54]

    Shiri and G

    A. Shiri and G. K. Khosroshahi. An FPGA implementation of singular value decomposition. In 2019 27th Iranian Conference on Electrical Engineering (ICEE) , pages 416–422. IEEE, 2019

  55. [55]

    Shroff and R

    G. Shroff and R. Schreiber. On the convergence of the cyclic Jacobi method for parallel block orderings. SIAM J. Mat. Anal. Appl. , 10(3):326–346, 1989

  56. [56]

    A. Sobczyk. Deterministic complexity analysis of Hermitian eigenproblems. arXiv:2410.21550, 2024

  57. [57]

    Solomonik and J

    E. Solomonik and J. Demmel. Communication-optimal parallel 2.5D matrix multiplication and LU factorization algorithms. In E. Jeannot, R. Namyst, and J. Roman, editors, Euro-Par 2011 Parallel Processing, volume 6853 of Lecture Notes in Computer Science , pages 90–109. Springer Berlin Heidel- berg, 2011

  58. [58]

    Strassen

    V. Strassen. Gaussian elimination is not optimal. Numerische Mathematik , 13:354–356, 1969. 10.1007/BF02165411

  59. [59]

    F. Tisseur. Parallel implementation of the Yau and Lu method for eigenvalue computation. The International Journal of Supercomputer Applications and High Performance Computing , 11(3):197–204, 1997

  60. [60]

    S. Toledo. Locality of reference in LU decomposition with partial pivoting. SIAM Journal on Matrix Analysis and Applications, 18(4):1065–1081, 1997

  61. [61]

    Van Loan

    C. Van Loan. The block Jacobi method for computing the singular value decomposition. Technical report, Cornell University, USA, 1985

  62. [62]

    LL(1 : ⌊ n 2 ⌋, : ) 0 PRLL(⌊ n 2 ⌋ + 1 : m, : ) LR # ; U =

    S.-T. Yau and Y. Y. Lu. Reducing the symmetric matrix eigenvalue problem to matrix multiplications. SIAM Journal on Scientific Computing , 14(1):121–136, 1993. A Fast, Recursive LUPP This appendix contains pseudocode for the recursive LUPP routine used to guarantee convergence in blocked/recursive Jacobi (i.e., Algorithm 6). This also appears in section 4...