pith. sign in

arxiv: 2601.17979 · v2 · submitted 2026-01-25 · 💻 cs.MS

An Efficient Batch Solver for the Singular Value Decomposition on GPUs

Pith reviewed 2026-05-16 11:31 UTC · model grok-4.3

classification 💻 cs.MS
keywords singular value decompositionbatch SVDGPU computingone-sided Jacobinumerical linear algebrahigh-performance computingparallel algorithmsmatrix decompositions
0
0 comments X

The pith

A GPU batch SVD solver based on the one-sided Jacobi algorithm achieves significant speedups over vendor and open-source methods while remaining numerically robust.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a dedicated solver for handling many small singular value decomposition problems simultaneously on GPUs. It starts with the one-sided Jacobi algorithm because this method supports fine-grained parallelism and then applies a sequence of algorithmic and design optimizations to improve throughput. Experiments confirm the solver handles matrices with different conditioning, shapes, and floating-point precisions without accuracy loss. Benchmarks on NVIDIA and AMD hardware show clear performance gains compared with existing libraries, which matters for applications that repeatedly compute SVDs on modest-sized inputs.

Core claim

The central claim is that the one-sided Jacobi algorithm, when combined with targeted GPU-specific optimizations starting from a baseline implementation, produces a batch SVD solver that is both stable across varied problem conditions and substantially faster than current vendor and open-source alternatives on contemporary GPU platforms.

What carries the argument

One-sided Jacobi algorithm mapped to fine-grained GPU parallelism through incremental algorithmic and design optimizations applied to a baseline solver.

If this is right

  • Batch SVD workloads in principal component analysis and low-rank approximation can execute faster on GPUs.
  • The solver maintains accuracy across different matrix shapes, conditioning, and arithmetic precisions.
  • Significant speedups are realized on both NVIDIA and AMD GPU systems relative to existing solutions.
  • The approach supports randomized algorithms and other methods that rely on repeated small SVD computations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The optimization pattern could extend to other dense linear-algebra kernels that currently lack mature GPU batch support.
  • Adoption in scientific computing libraries would reduce wall-clock time for workflows that process thousands of modest matrices.
  • Mixed-precision variants of the same mapping might yield additional throughput for applications tolerant of lower accuracy.

Load-bearing premise

The one-sided Jacobi algorithm can be parallelized on GPUs in the manner described without losing numerical stability or accuracy.

What would settle it

A set of timing and accuracy measurements on ill-conditioned small matrices that shows either slower execution than vendor libraries or singular-value errors exceeding standard double-precision tolerances would disprove the performance and robustness claims.

Figures

Figures reproduced from arXiv: 2601.17979 by Ahmad Abdelfattah, Massimiliano Fasi.

Figure 1
Figure 1. Figure 1: Example of the parallel ordering with eight block columns. Each Jacobi sweep consists of seven iterations. Each iteration [PITH_FULL_IMAGE:figures/full_fig_p006_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Computing the Gram matrix using three concurrent matrix multiplications [PITH_FULL_IMAGE:figures/full_fig_p010_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Heatmaps showing the speedups for MAGMA’s own batch GEMM kernel over the vendor’s BLAS library for the use case of [PITH_FULL_IMAGE:figures/full_fig_p011_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Time-to-solution of MAGMA’s batch Hermitian eigensolver, with vectors computed. Results are shown for [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: GEMM shape for updating the singular vectors. [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Time breakdown of the baseline batch SVD solver on the GH200 system (left) and on the MI300A APU (right). Experiments are [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Performance gain of Design-2 over the baseline of the batch SVD solver on the GH200 system (left) and on the MI300A APU [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Time breakdown of Design-2 of the batch SVD solver on the GH200 system (left) and on the MI300A APU (right). Experiments [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Performance gain of Design-3 over the baseline of the batch SVD solver on the GH200 system (left) and on the MI300A APU [PITH_FULL_IMAGE:figures/full_fig_p019_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Performance gain of Design-4 over the MAGMA baseline on the GH200 system (left) and on the MI300A APU (right). [PITH_FULL_IMAGE:figures/full_fig_p020_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Numerical accuracy of the batch SVD solver using the matrix types described in Table [PITH_FULL_IMAGE:figures/full_fig_p022_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Performance of the batch SVD on the GH200 GPU. Results are shown for batches of 10,000 very small matrices. [PITH_FULL_IMAGE:figures/full_fig_p023_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Performance of the batch SVD on the MI300A APU. Results are shown for batches of 10,000 very small matrices. [PITH_FULL_IMAGE:figures/full_fig_p024_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Performance of the batch SVD on the GH200 system. Results are shown for batches of 1,000 square matrices. [PITH_FULL_IMAGE:figures/full_fig_p025_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Performance of the batch SVD on the MI300A APU. Results are shown for batches of 1,000 square matrices. [PITH_FULL_IMAGE:figures/full_fig_p026_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Performance of the batch SVD on the GH200 system. Results are shown for batches of 1,000 tall-skinny matrices. [PITH_FULL_IMAGE:figures/full_fig_p027_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Performance of the batch SVD on the MI300A APU. Results are shown for batches of 1,000 tall-skinny matrices. [PITH_FULL_IMAGE:figures/full_fig_p028_17.png] view at source ↗
read the original abstract

The singular value decomposition (SVD) is a powerful tool in modern numerical linear algebra, which underpins computational methods such as principal component analysis (PCA), low-rank approximations, and randomized algorithms. Many practical scenarios require solving numerous small SVD problems, a regime generally referred to as "batch SVD". Existing programming models can handle this efficiently on parallel CPU architectures, but high-performance solutions for GPUs remain immature. A GPU-oriented batch SVD solver is introduced. This solver exploits the one-sided Jacobi algorithm to exploit fine-grained parallelism, and a number of algorithmic and design optimizations achieve unmatched performance. Starting from a baseline solver, a sequence of optimizations is applied to obtain incremental performance gains. Numerical experiments show that the new solver is robust across problems with different numerical properties, matrix shapes, and arithmetic precisions. Performance benchmarks on both NVIDIA and AMD systems show significant performance speedups over vendor solutions as well as existing open-source solvers.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 2 minor

Summary. The manuscript introduces a GPU-oriented batch SVD solver based on the one-sided Jacobi algorithm. It describes a sequence of algorithmic and design optimizations applied incrementally to a baseline implementation, followed by numerical experiments demonstrating robustness across varying matrix properties, shapes, condition numbers, and arithmetic precisions. Performance benchmarks on NVIDIA and AMD systems report significant speedups relative to vendor libraries and existing open-source solvers.

Significance. If the empirical results hold, the work addresses an important gap in high-performance batch SVD for GPUs, with direct relevance to PCA, low-rank approximations, and randomized algorithms in scientific computing and machine learning. The cross-vendor benchmarks and focus on fine-grained parallelism provide practical value, and the incremental optimization approach aids reproducibility of the performance gains.

minor comments (2)
  1. In the numerical experiments section, explicitly state the error metrics (e.g., relative residual norms or orthogonality measures), the precise baseline implementations compared, and any rules for excluding ill-conditioned test cases to strengthen verification of the robustness claims.
  2. Ensure all benchmark tables report both runtime and accuracy results side-by-side for each matrix shape and precision to make the tradeoff between speed and stability immediately visible.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive summary, significance assessment, and recommendation of minor revision. We are encouraged that the incremental optimization approach and cross-vendor performance results are viewed as valuable for the community. No major comments were listed in the report, so we have no specific points requiring rebuttal or revision at this stage.

Circularity Check

0 steps flagged

No significant circularity in the derivation chain

full rationale

The paper describes an implementation of a batch SVD solver using the one-sided Jacobi algorithm, followed by a sequence of explicit algorithmic and design optimizations whose effects are measured incrementally via benchmarks. All performance and robustness claims rest on external empirical comparisons against vendor libraries and open-source solvers on NVIDIA and AMD hardware across varied matrix shapes, condition numbers, and precisions. No equations, parameters, or uniqueness claims reduce by construction to fitted inputs or self-citations; the derivation chain consists of standard algorithmic steps whose correctness is verified by direct numerical testing rather than internal redefinition.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that the one-sided Jacobi method maps efficiently to GPU parallelism and that the listed optimizations preserve correctness.

axioms (1)
  • domain assumption One-sided Jacobi algorithm admits fine-grained parallelism suitable for GPU execution
    Invoked when the solver design is introduced in the abstract.

pith-pipeline@v0.9.0 · 5452 in / 1089 out tokens · 36275 ms · 2026-05-16T11:31:27.541027+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

73 extracted references · 73 canonical work pages · 1 internal anchor

  1. [1]

    Ahmad Abdelfattah, Natalie Beams, Robert Carson, Pieter Ghysels, Tzanio Kolev, Thomas Stitt, Arturo Vargas, Stanimire Tomov, and Jack Dongarra

  2. [2]

    MAGMA: Enabling exascale performance with accelerated BLAS and LAPACK for diverse GPU architectures.Int. J. High Perform. Comput. Appl.(June 2024). doi:10.1177/10943420241261960

  3. [3]

    Ahmad Abdelfattah, Azzam Haidar, Stanimire Tomov, and Jack Dongarra. 2017. Factorization and Inversion of a Million Matrices using GPUs: Challenges and Countermeasures.Procedia Comput. Sci.108 (2017), 606–615. doi:10.1016/j.procs.2017.05.250

  4. [4]

    Ahmad Abdelfattah, Azzam Haidar, Stanimire Tomov, and Jack Dongarra. 2017. Fast Cholesky Factorization on GPUs for Batch and Native Modes in MAGMA.Journal of Computational Science20, Supplement C (2017), 85 – 93. doi:10.1016/j.jocs.2016.12.009

  5. [5]

    Performance, design, and autotuning of batched gemm for gpus

    Ahmad Abdelfattah, Azzam Haidar, Stanimire Tomov, and Jack J. Dongarra. 2016. Performance, Design, and Autotuning of Batched GEMM for GPUs. InISC High Performance 2016, Frankfurt, Germany, June 19-23, 2016, Proceedings. 21–38. doi:10.1007/978-3-319-41321-1_2

  6. [6]

    Dongarra

    Ahmad Abdelfattah, Azzam Haidar, Stanimire Tomov, and Jack J. Dongarra. 2017. Novel HPC techniques to batch execution of many variable size BLAS computations on GPUs. InProceedings of the International Conference on Supercomputing, ICS 2017, Chicago, IL, USA, June 14-16, 2017, William D. Gropp, Pete Beckman, Zhiyuan Li, and Francisco J. Cazorla (Eds.). AC...

  7. [7]

    Ahmad Abdelfattah, David Keyes, and Hatem Ltaief. 2016. KBLAS: An Optimized Library for Dense Matrix-Vector Multiplication on GPU Accelerators. ACM Trans. Math. Software42, 3 (May 2016), 1–31. doi:10.1145/2818311

  8. [9]

    Dongarra

    Ahmad Abdelfattah, Stanimire Tomov, and Jack J. Dongarra. 2019. Fast Batch Matrix Multiplication for Small Sizes using Half Precision Arithmetic on GPUs. In2019 IEEE International Parallel and Distributed Processing Symposium, IPDPS 2019, Rio de Janeiro, Brazil, May 20-24, 2019. 111–122

  9. [10]

    Dongarra

    Ahmad Abdelfattah, Stan Tomov, and Jack J. Dongarra. 2022. Batch QR Factorization on GPUs: Design, Optimization, and Tuning. InComputational Science - ICCS 2022 - 22nd International Conference, London, UK, June 21-23, 2022, Proceedings, Part I (Lecture Notes in Computer Science, Vol. 13350), Derek Groen, Clélia de Mulatier, Maciej Paszynski, Valeria V. Kr...

  10. [11]

    Advanced Micro Devices, Inc. 2025. rocBLAS Library. https://rocm.docs.amd.com/projects/rocBLAS/en/latest/

  11. [12]

    Advanced Micro Devices, Inc. 2025. rocSOLVER Library. https://rocm.docs.amd.com/projects/rocSOLVER/en/latest/

  12. [13]

    Emmanuel Agullo, Jim Demmel, Jack Dongarra, Bilel Hadri, Jakub Kurzak, Julien Langou, Hatem Ltaief, Piotr Luszczek, and Stanimire Tomov

  13. [14]

    Agullo, J

    Numerical linear algebra on emerging architectures: The PLASMA and MAGMA projects.J. Phys.: Conf. Ser.180 (July 2009), 012037. doi:10.1088/1742-6596/180/1/012037

  14. [15]

    2017.Tile Low Rank Cholesky Factorization for Climate/Weather Modeling Applications on Manycore Architectures

    Kadir Akbudak, Hatem Ltaief, Aleksandr Mikhalev, and David Keyes. 2017.Tile Low Rank Cholesky Factorization for Climate/Weather Modeling Applications on Manycore Architectures. Springer-Verlag, Cham, Switzerland, 22–40. doi:10.1007/978-3-319-58667-0_2

  15. [16]

    Orly Alter and Gene H. Golub. 2004. Integrative Analysis of Genome-Scale Data by Using Pseudoinverse Projection Predicts Novel Correlation Between DNA Replication and RNA Transcription.Proc. Nat. Acad. Sci.101, 47 (Nov. 2004), 16577–16582. doi:10.1073/pnas.0406767101

  16. [17]

    Andrews and C

    H. Andrews and C. Patterson. 1976. Singular Value Decompositions and Digital Image Processing.IEEE Trans. Acoust. Speech Signal Process.24, 1 (Feb. 1976), 26–53. doi:10.1109/tassp.1976.1162766

  17. [18]

    Irving Badolato, Luciano de Paula, and Ricardo Farias. 2015. Many SVDs on GPU for Image Mosaic Assemble. InProceedings of the 2015 International Symposium on Computer Architecture and High Performance Computing Workshop (SBAC-PADW). IEEE, 37–42. doi:10.1109/sbac-padw.2015.22

  18. [19]

    http://www.netlib.org/blas

    BLAS 1980-2023.BLAS (Basic Linear Algebra Subprograms). http://www.netlib.org/blas

  19. [20]

    Wajih Halim Boukaram, George Turkiyyah, Hatem Ltaief, and David E. Keyes. 2018. Batched QR and SVD Algorithms on GPUs with Applications In Hierarchical Matrix Compression.Parallel Comput.74 (May 2018), 19–33. doi:10.1016/j.parco.2017.09.001

  20. [21]

    Brent and Franklin T

    Richard P. Brent and Franklin T. Luk. 1985. The Solution of Singular-Value and Symmetric Eigenvalue Problems on Multiprocessor Arrays.SIAM J. Sci. Statist. Comput.6, 1 (Jan. 1985), 69–84. doi:10.1137/0906007

  21. [22]

    Brent, Franklin T

    Richard P. Brent, Franklin T. Luk, and Charles Van Loan. 1985. Computation of the Singular Value Decomposition Using Mesh-Connected Processors. J. Very Large Scale Integr. Comput. Syst.1, 3 (1985)

  22. [23]

    2016.Redesigning Triangular Dense Matrix Computations on GPUs

    Ali Charara, Hatem Ltaief, and David Keyes. 2016.Redesigning Triangular Dense Matrix Computations on GPUs. Springer-Verlag, Cham, Switzerland, 477–489. doi:10.1007/978-3-319-43659-3_35 Manuscript submitted to ACM 30 Abdelfattah and Fasi

  23. [24]

    Cunningham and Zoubin Ghahramani

    John P. Cunningham and Zoubin Ghahramani. 2015. Linear Dimensionality Reduction: Survey, Insights, and Generalizations.J. Mach. Learn. Res.16, 89 (2015), 2859–2900. http://jmlr.org/papers/v16/cunningham15a.html

  24. [25]

    James Demmel, Ming Gu, Stanley Eisenstat, Ivan Slapničar, Krešimir Veselić, and Zlatko Drmač. 1999. Computing the Singular Value Decomposition with High Relative Accuracy.Linear Algebra Appl.299, 1–3 (Sept. 1999), 21–80. doi:10.1016/s0024-3795(99)00134-2

  25. [26]

    James Demmel and W. Kahan. 1990. Accurate Singular Values of Bidiagonal Matrices.SIAM J. Sci. Statist. Comput.11, 5 (Sept. 1990), 873–912. doi:10.1137/0911052

  26. [27]

    James Demmel and Krešimir Veselić. 1992. Jacobi’s Method is More Accurate than QR.SIAM J. Matrix Anal. Appl.13, 4 (Oct. 1992), 1204–1245. doi:10.1137/0613074

  27. [28]

    James W. Demmel. 1997.Applied Numerical Linear Algebra. Society for Industrial and Applied Mathematics, Philadelphia, PA, USA. doi:10.1137/1. 9781611971446

  28. [29]

    Emily Denton, Wojciech Zaremba, Joan Bruna, Yann LeCun, and Rob Fergus. 2014. Exploiting Linear Structure Within Convolutional Networks for Efficient Evaluation. InAdvances in Neural Information Processing Systems, Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K.Q. Weinberger (Eds.), Vol. 27. Curran Associates, Inc. https://proceedings.neurips.cc...

  29. [30]

    Tingxing Dong, Azzam Haidar, Stanimire Tomov, and Jack Dongarra. 2014. A Fast Batched Cholesky Factorization on a GPU. InProceedings of the 43rd International Conference on Parallel Processing. IEEE, 432–440. doi:10.1109/icpp.2014.52

  30. [32]

    Jack Dongarra, Mark Gates, Azzam Haidar, Jakub Kurzak, Piotr Luszczek, Stanimire Tomov, and Ichitaro Yamazaki. 2018. The Singular Value Decomposition: Anatomy of Optimizing an Algorithm for Extreme Scale.SIAM Rev.60, 4 (2018), 808–865. arXiv:https://doi.org/10.1137/17M1117732 doi:10.1137/17M1117732

  31. [33]

    Zlatko Drmač. 2009. A Global Convergence Proof for Cyclic Jacobi Methods with Block Rotations.SIAM J. Matrix Anal. Appl.31, 3 (Nov. 2009), 1329–1350. doi:10.1137/090748548

  32. [34]

    Zlatko Drmač and Krešimir Veselić. 2008. New Fast and Accurate Jacobi SVD Algorithm. I.SIAM J. Matrix Anal. Appl.29, 4 (Jan. 2008), 1322–1342. doi:10.1137/050639193

  33. [35]

    Zlatko Drmač and Krešimir Veselić. 2008. New Fast and Accurate Jacobi SVD Algorithm. II.SIAM J. Matrix Anal. Appl.29, 4 (Jan. 2008), 1343–1362. doi:10.1137/05063920x

  34. [36]

    Carl Eckart and Gale Young. 1936. The Approximation of One Matrix by Another of Lower Rank.Psychometrika1, 3 (Sept. 1936), 211–218. doi:10.1007/bf02288367

  35. [37]

    G. E. Forsythe and P. Henrici. 1960. The Cyclic Jacobi Method for Computing the Principal Values of A Complex Matrix.Trans. Amer. Math. Soc.94, 1 (1960), 1–23. doi:10.1090/s0002-9947-1960-0109825-2

  36. [38]

    David E. Foulser. 1989.A Blocked Jacobi Method for the Symmetric Eigenproblem. Research Report YALEU/DCS/RR-680. Yale University, Department of Computer Science, New Haven, CT. 25 pages. https://apps.dtic.mil/sti/citations/ADA206553 Accession Number: ADA206553. Approved for public release

  37. [39]

    Benedikt Großer and Bruno Lang. 1999. Efficient Parallel Reduction to Bidiagonal Form.Parallel Comput.25, 8 (Sept. 1999), 969–986. doi:10.1016/s0167- 8191(99)00041-1

  38. [40]

    Eldon R. Hansen. 1963. On Cyclic Jacobi Methods.J. Soc. Indust. Appl. Math.11, 2 (June 1963), 448–459. doi:10.1137/0111032

  39. [41]

    Hestenes

    Magnus R. Hestenes. 1958. Inversion of Matrices by Biorthogonalization and Related Results.J. Soc. Indust. Appl. Math.6, 1 (March 1958), 51–90. doi:10.1137/0106005

  40. [42]

    , volume =

    H. Hotelling. 1933. Analysis of a Complex of Statistical Variables into Principal Components.J. Educ. Psychol.24, 6 (Sept. 1933), 417–441. doi:10.1037/h0071325

  41. [43]

    2022.A Batched Jacobi SVD Algorithm on GPUs and Its Application to Quantum Lattice Systems

    Rongfeng Huang, Tianyu Yu, Shifang Liu, Xinyin Zhang, and Yonghua Zhao. 2022.A Batched Jacobi SVD Algorithm on GPUs and Its Application to Quantum Lattice Systems. Springer International Publishing, 69–80. doi:10.1007/978-3-030-96772-7_7

  42. [44]

    C. G. J. Jacobi. 1846. Über ein leichtes Verfahren die in der Theorie der Säcularstörungen vorkommenden Gleichungen numerisch aufzulösen.J. Reine Angew. Math.30 (1846), 51–94

  43. [45]

    and Cadima, Jorge , title =

    Ian T. Jolliffe and Jorge Cadima. 2016. Principal Component Analysis: A Review and Recent Developments.Philos. Trans. R. Soc. A374, 2065 (April 2016), 20150202. doi:10.1098/rsta.2015.0202

  44. [46]

    2015.Improving Performance of Convolutional Neural Networks by Separable Filters on GPU

    Hao-Ping Kang and Che-Rung Lee. 2015.Improving Performance of Convolutional Neural Networks by Separable Filters on GPU. 638–649. doi:10. 1007/978-3-662-48096-0_49

  45. [47]

    Klema and A

    V. Klema and A. Laub. 1980. The Singular Value Decomposition: Its Computation and Some Applications.IEEE Trans. Automat. Control25, 2 (April 1980), 164–176. doi:10.1109/tac.1980.1102314

  46. [48]

    E. G. Kogbetliantz. 1955. Solution of Linear Equations by Diagonalization of Coefficients Matrix.Quart. Appl. Math.13, 2 (July 1955), 123–132. doi:10.1090/qam/88795

  47. [49]

    Sheetal Lahabar and P. J. Narayanan. 2009. Singular value decomposition on GPU using CUDA. InProceedings of the 2009 IEEE International Symposium on Parallel & Distributed Processing. IEEE, 1–10. doi:10.1109/ipdps.2009.5161058

  48. [50]

    Bruno Lang. 1996. Parallel reduction of banded matrices to bidiagonal form.Parallel Comput.22, 1 (Jan. 1996), 1–18. doi:10.1016/0167-8191(95)00064-x Manuscript submitted to ACM Batch SVD on GPUs 31

  49. [51]

    Jihye Lee, Donghyoung Han, Oh-Kyoung Kwon, Kang-Wook Chon, and Min-Soo Kim. 2024. GPUTucker: Large-Scale GPU-Based Tucker Decompo- sition Using Tensor Partitioning.Expert Syst. Appl.237 (March 2024), 121445. doi:10.1016/j.eswa.2023.121445

  50. [52]

    Luk and Haesun Park

    Franklin T. Luk and Haesun Park. 1989. On Parallel Jacobi Orderings.SIAM J. Sci. Statist. Comput.10, 1 (Jan. 1989), 18–26. doi:10.1137/0910002

  51. [53]

    Martin and Mason A

    Carla D. Martin and Mason A. Porter. 2012. The Extraordinary SVD.Amer. Math. Monthly119, 10 (2012), 838. doi:10.4169/amer.math.monthly.119. 10.838

  52. [54]

    Mascarenhas

    Walter F. Mascarenhas. 1994. A note on Jacobi Being More Accurate Than 𝑄𝑅.SIAM J. Matrix Anal. Appl.15, 1 (Jan. 1994), 215–218. doi:10.1137/ s089547989222792x

  53. [55]

    Roy Mathias. 1995. Accurate Eigensystem Computations by Jacobi Methods.SIAM J. Matrix Anal. Appl.16, 3 (July 1995), 977–1003. doi:10.1137/ s089547989324820x

  54. [56]

    Intel oneAPI Math Kernel Library

    MKL 2024. Intel oneAPI Math Kernel Library. Available at https://www.intel.com/content/www/us/en/developer/tools/oneapi/onemkl.html

  55. [57]

    B. Moore. 1981. Principal Component Analysis in Linear Systems: Controllability, Observability, and Model Reduction.IEEE Trans. Automat. Control 26, 1 (Feb. 1981), 17–32. doi:10.1109/tac.1981.1102568

  56. [58]

    E. H. Moore. 1920. On the Reciprocal of the General Algebraic Matrix.Bull. Amer. Math. Soc.26, 9 (June 1920), 385–397. doi:10.1090/s0002-9904- 1920-03322-7 Contribution atThe fourteenth western meeting of the American Mathematical Society

  57. [59]

    Larry Nazareth. 1975. On the Convergence of the Cyclic Jacobi Method.Linear Algebra Appl.12, 2 (1975), 151–164. doi:10.1016/0024-3795(75)90063-4

  58. [60]

    NVIDIA Corporation. 2025. NVIDIA cuBLAS Library. https://developer.nvidia.com/cublas

  59. [61]

    NVIDIA Corporation. 2025. NVIDIA cuSOLVER Library. https://developer.nvidia.com/cusolver

  60. [62]

    OpenBLAS contributors. 2025. OpenBLAS Library. www.openblas.net

  61. [63]

    Karl Pearson. 1901. LIII. On Lines and Planes of Closest Fit to Systems of Points in Space.Lond. Edinb. Dublin Philos. Mag. J. Sci.2, 11 (Nov. 1901), 559–572. doi:10.1080/14786440109462720

  62. [64]

    R. Penrose. 1955. A Generalized Inverse for Matrices.Math. Proc. Camb. Philos. Soc.51, 3 (July 1955), 406–413. doi:10.1017/s0305004100030401

  63. [65]

    R. Penrose. 1956. On Best Approximate Solutions of Linear Matrix Equations.Math. Proc. Camb. Philos. Soc.52, 1 (Jan. 1956), 17–19. doi:10.1017/ s0305004100030929

  64. [66]

    Gautam Shroff and Robert Schreiber. 1989. On the Convergence of the Cyclic Jacobi Method for Parallel Block Orderings.SIAM J. Sci. Statist. Comput.10, 3 (July 1989), 326–346. doi:10.1137/0610025

  65. [67]

    In: Proc

    Gautam Shroff and Robert Schreiber. 1991.On the Convergence of Cyclic Jacobi Methods. Springer-Verlag, Berlin, Heidelberg, 597–604. doi:10.1007/978- 3-642-75536-1_48

  66. [68]

    G. W. Stewart. 1993. On the Early History of the Singular Value Decomposition.SIAM Rev.35, 4 (Dec. 1993), 551–566. doi:10.1137/1035134

  67. [69]

    Van Loan

    Charles F. Van Loan. 1986.The Block Jacobi Method for Computing the Singular Value Decomposition. Elsevier, Amsterdam, The Netherlands, 245–256

  68. [70]

    Veselić and V

    K. Veselić and V. Hari. 1989. A note on a one-sided Jacobi algorithm.Numer. Math.56, 6 (June 1989), 627–633. doi:10.1007/bf01396349

  69. [71]

    Qian Wang, Xianyi Zhang, Yunquan Zhang, and Qing Yi. 2013. AUGEM: Automatically Generate High Performance Dense Linear Algebra Kernels on X86 CPUs. InProceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC13). ACM, 1–12. doi:10.1145/2503210.2503219

  70. [72]

    Xin Wang, Yu Zheng, Zhongwei Wan, and Mi Zhang. 2025. SVD-LLM: Truncation-Aware Singular Value Decomposition for Large Language Model Compression. InProceedings of the 13th International Conference on Learning Representations. https://openreview.net/forum?id=LNYIUouhdt

  71. [73]

    Zhang Xianyi, Wang Qian, and Zhang Yunquan. 2012. Model-driven Level 3 BLAS Performance Optimization on Loongson 3A Processor. In Proceedings of the 18th IEEE International Conference on Parallel and Distributed Systems. IEEE, 684–691. doi:10.1109/icpads.2012.97

  72. [74]

    Junmin Xiao, Yunfei Pang, Qing Xue, Chaoyang Shui, Ke Meng, Hui Ma, Mingyi Li, Xiaoyang Zhang, and Guangming Tan. 2022. W-Cycle SVD: A Multilevel Algorithm for Batched SVD on GPUs. InProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC22). IEEE, 1–16. doi:10.1109/sc41404.2022.00087

  73. [75]

    B. B. Zhou and R. P. Brent. 1995. On parallel implementation of the one-sided Jacobi algorithm for singular value decompositions. InProceedings of Euromicro Workshop on Parallel and Distributed Processing(1995). 401–408. doi:10.1109/empdp.1995.389182 Manuscript submitted to ACM