Out-of-core singular value decomposition

Miroslav Ba\v{c}\'ak; Stefan Bordag; Vadim Demchik

arxiv: 1907.06470 · v1 · pith:IRHWLNPGnew · submitted 2019-07-15 · 💻 cs.MS · cs.NA· math.NA

Out-of-core singular value decomposition

Vadim Demchik , Miroslav Ba\v{c}\'ak , Stefan Bordag This is my paper

Pith reviewed 2026-05-24 21:18 UTC · model grok-4.3

classification 💻 cs.MS cs.NAmath.NA

keywords singular value decompositionout-of-core algorithmsrandomized SVDlarge-scale matrix factorizationparallel computingexternal memory algorithms

0 comments

The pith

An out-of-core randomized SVD factors arbitrarily large dense and sparse matrices using limited memory and external storage.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents an implementation of randomized singular value decomposition that operates out-of-core, allowing it to factor matrices too large to fit in main memory by relying on external storage such as HDDs or SSDs. The solution is designed to be fully scalable and efficiently parallelizable while supporting both dense and sparse matrices. It introduces a partitioning technique that enables memory and I/O planning, automatic load balancing, and the ability to resume interrupted computations without recalculating prior steps. A sympathetic reader would care because standard SVD methods assume fast random access or large intermediate storage, which fails for the very large matrices common in machine learning, data science, and signal processing.

Core claim

The central claim is that an out-of-core randomized SVD solution factors both dense and sparse matrices of arbitrarily large size within arbitrarily small memory limits by using an innovative technique for partitioning matrices that supports out-of-core and parallel processing, memory and I/O use planning, automatic load balancing, performance tuning, and resumption of interrupted operations from persistent external storage.

What carries the argument

The innovative technique for partitioning matrices that supports out-of-core processing, parallelization, and resumption from external storage.

If this is right

Very large matrices in applications like machine learning can be factored on commodity hardware without high-memory requirements.
Interrupted computations on large matrices can resume from the last completed partition without restarting from scratch.
The same partitioning supports automatic load balancing and performance tuning across parallel processes.
The approach applies to both dense and sparse matrices, removing the need for separate in-core algorithms for each type.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The partitioning approach might extend to other randomized linear algebra routines such as QR or eigenvalue decompositions.
Commodity hardware could handle SVD tasks previously requiring distributed clusters if the I/O efficiency holds.
Resumption from disk could change workflows for long-running factorizations in production data pipelines.

Load-bearing premise

The partitioning technique enables efficient out-of-core operations and parallelization without I/O or coordination overhead that would prevent scaling to arbitrarily large matrices in arbitrarily small memory.

What would settle it

Running the method on a matrix whose size exceeds available RAM plus disk capacity and observing that it either fails to complete or requires memory proportional to the full matrix size would falsify the central claim.

Figures

Figures reproduced from arXiv: 1907.06470 by Miroslav Ba\v{c}\'ak, Stefan Bordag, Vadim Demchik.

**Figure 1.** Figure 1: Sparse matrix representation 3.7. Multi-threading. ExB SVD Library uses the POSIX pthreads execution model to provide efficient process threading. Users can determine the number of threads at run-time, or at any stage during processing, including setting the number of threads used for individual operations or matrices. To avoid inefficiencies caused by threading overhead, there is an internal automatic re-… view at source ↗

**Figure 2.** Figure 2: displays the processing times calculating the approximated SVD factorizations of these two matrices, as a function of the memory limit we allocated to it. The experiments were performed on a dedicated system with dual Intel Xeon E5-2640 v4 @2.40GHz CPUs, 512GB RAM as core memory, and a 2TB NVMe-SSD as out-of-core storage. Memory limits were allocated per matrix (see Parameter 1 in Section 3.5) and not as a… view at source ↗

**Figure 3.** Figure 3: Image reconstruction via low rank SVD with and without power iterations. 4.2.1. Power iteration [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: Rank 50 image reconstruction with different precisions. All image compression experiments in this section were performed with the memory allocation restricted to 1KB per matrix – less than the memory required for a single row or column of the source image – in order to demonstrate the implementation’s ability to function under very extreme memory restrictions. 5. Conclusions SVD is time-honored and well-e… view at source ↗

read the original abstract

Singular value decomposition (SVD) is a standard matrix factorization technique that produces optimal low-rank approximations of matrices. It has diverse applications, including machine learning, data science and signal processing. However, many common problems involve very large matrices that cannot fit in the main memory of commodity computers, making it impractical to use standard SVD algorithms that assume fast random access or large amounts of space for intermediate calculations. To address this issue, we have implemented an out-of-core (external memory) randomized SVD solution that is fully scalable and efficiently parallelizable. This solution factors both dense and sparse matrices of arbitrarily large size within arbitrarily small memory limits, efficiently using out-of-core storage as needed. It uses an innovative technique for partitioning matrices that lends itself to out-of-core and parallel processing, as well as memory and I/O use planning, automatic load balancing, performance tuning, and makes possible a number of other practical enhancements to the current state-of-the-art. Furthermore, by using persistent external storage (generally HDDs or SSDs), users can resume interrupted operations without having to recalculate previously performed steps, solving a major practical problem in factoring very large matrices.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper describes a practical out-of-core randomized SVD implementation with a new partitioning scheme for resumability, but the abstract supplies no benchmarks or accuracy data to support the scalability claims.

read the letter

The main point is that the authors have coded an out-of-core randomized SVD that claims to factor dense or sparse matrices of any size while keeping memory use bounded, using disk as needed and a partitioning method that supports resuming and load balancing. That addresses a real workflow issue for people running SVD on commodity hardware when matrices exceed RAM. The partitioning approach appears to be the actual novelty, extending standard randomized SVD by making external storage and parallelism more straightforward than basic block methods. It also adds resumability, which is a useful engineering touch for long runs. The paper does a reasonable job framing the problem and listing the practical features the technique enables. The central claim holds together on its own terms without obvious circularity or hidden assumptions about matrix structure. The soft spot is the complete lack of evidence in the abstract: no timing results, no accuracy checks against in-memory SVD, no I/O volume measurements, and no discussion of how the partitioning actually performs at scale. Without those, it is impossible to tell whether the method delivers on the arbitrary-size, arbitrary-small-memory promise or whether there are hidden costs in practice. This is an implementation paper aimed at practitioners in machine learning and data analysis who need to handle oversized matrices without buying more RAM. A reader looking for code-level ideas on external-memory linear algebra would get something out of the partitioning details if they are described clearly in the full text. It deserves peer review because the problem is common and the approach is concrete, even though the current draft would need experiments and validation before acceptance.

Referee Report

1 major / 0 minor

Summary. The paper claims to have implemented an out-of-core randomized SVD algorithm that factors dense and sparse matrices of arbitrary size using arbitrarily small main memory by leveraging external storage (HDDs/SSDs), a novel matrix partitioning technique for out-of-core and parallel processing, automatic load balancing, performance tuning, and resumability of interrupted computations without recalculation.

Significance. If the implementation and its claimed properties hold under empirical validation, the work would provide a practical extension of randomized SVD methods to external-memory settings, enabling low-rank approximations for matrices that exceed main memory on commodity hardware and addressing a common bottleneck in machine learning, data science, and signal processing applications.

major comments (1)

[Abstract] Abstract: the central claims of full scalability, efficient parallelizability, arbitrary matrix size within arbitrarily small memory limits, and practical enhancements via the partitioning technique are asserted without any accompanying benchmarks, accuracy measurements, I/O analysis, runtime comparisons, or validation details; this absence makes it impossible to assess whether the implementation actually achieves the stated properties.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their review and for highlighting the need for clearer linkage between claims and supporting evidence. We address the single major comment below.

read point-by-point responses

Referee: [Abstract] Abstract: the central claims of full scalability, efficient parallelizability, arbitrary matrix size within arbitrarily small memory limits, and practical enhancements via the partitioning technique are asserted without any accompanying benchmarks, accuracy measurements, I/O analysis, runtime comparisons, or validation details; this absence makes it impossible to assess whether the implementation actually achieves the stated properties.

Authors: The abstract is a concise summary and does not contain the detailed empirical results; those appear in the experimental and implementation sections of the full manuscript, which include benchmarks, accuracy measurements, I/O analysis, runtime comparisons, and validation on both dense and sparse matrices. We agree that the abstract could better signal the existence of this supporting material. We will revise the abstract to reference the location of the empirical validation and, space permitting, include one or two key quantitative highlights so that readers can immediately assess the claims. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper describes an engineering implementation of an out-of-core randomized SVD using a matrix partitioning technique for external-memory and parallel processing. No derivation chain, equations, or first-principles claims are present that could reduce to self-definition, fitted inputs renamed as predictions, or self-citation load-bearing steps. All performance and scalability properties are stated as direct consequences of the implemented algorithm and storage strategy, with no mathematical reductions or uniqueness theorems invoked that loop back to the paper's own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the effectiveness of randomized SVD approximations and the unproven efficiency of the new partitioning scheme for out-of-core use; no free parameters or invented entities are introduced in the abstract.

axioms (1)

domain assumption Randomized SVD yields sufficiently accurate low-rank approximations for the intended applications
The method builds on randomized SVD without deriving or validating its approximation quality in the abstract.

pith-pipeline@v0.9.0 · 5735 in / 1281 out tokens · 28272 ms · 2026-05-24T21:18:00.530091+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

18 extracted references · 18 canonical work pages · 1 internal anchor

[1]

Dongarra, M

J. Dongarra, M. Gates, A. Haidar, J. Kurzak, P. Luszczek, S. Tomov, and I. Yamazaki , The singular value decomposition: Anatomy of optimizing an algorithm for extreme scale , SIAM Rev., 60 (2018), pp. 808–865

work page 2018
[2]

Kabir, A

K. Kabir, A. Haidar, S. Tomov, A. Bouteiller, and J. Dongarra , A framework for out of memory SVD algo- rithms, in High Performance Computing, J. M. Kunkel, R. Yokota, P. Balaji, and D. Keyes, eds., Cham, 2017, Springer International Publishing, pp. 158–178

work page 2017
[3]

Haidar, K

A. Haidar, K. Kabir, D. F ayad, S. Tomov, and J. Dongarra , Out of memory SVD solver for big data , in 2017 IEEE High Performance Extreme Computing Conference (HPEC), Sept 2017, pp. 1–7

work page 2017
[4]

Rabani and S

E. Rabani and S. Toledo , Out-of-core SVD and QR decompositions , in Proceedings of the 10th SIAM Conference on Parallel Processing for Scientiﬁc Computing, Norfolk, Virginia, March 2001

work page 2001
[5]

Gates, S

M. Gates, S. Tomov and J. Dongarra , Accelerating the SVD two stage bidiagonal reduction and divide and conquer using GPUs, in Parallel Computing, 74 (2018), pp. 3–18

work page 2018
[6]

D. I. Martin, J. C. Martin, M. W. Berry, and M. Browne , Out-of-core SVD performance for document indexing , Appl. Numer. Math., 57 (2007), pp. 1230–1239

work page 2007
[7]

Y. Lu, F. Ino, and Y. Matsushita , High-performance out-of-core block randomized singular value decomposition on GPU, arXiv:1706.07191, (2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017
[8]

Halko, P

N. Halko, P. G. Martinsson, and J. A. Tropp , Finding Structure with Randomness: Probabilistic Algorithms for Constructing Approximate Matrix Decompositions, SIAM Review, 53 (2011), pp. 217–288

work page 2011
[9]

Dongarra and D

J. Dongarra and D. W alker , The design of linear algebra libraries for high performance computers , tech. rep., United States Department of Energy: Oﬃce of Scientiﬁc and Technical Information, August 1993

work page 1993
[10]

Bouge, P

L. Bouge, P. Fraigniaud, A. Mignotte, and Y. Robert , Eﬃcient block cyclic data redistribution , in Euro-Par’96 Parallel Processing. Euro-Par 1996, vol. 1123 of Lecture Notes in Computer Science, 1996

work page 1996
[11]

E. F. D’Azevedo and J. J. Dongarra , The design and implementation of the parallel out-of-core scalapack lu, qr, and cholesky factorization routines , Concurrency - Practice and Experience, 12 (2000), pp. 1481–1493

work page 2000
[12]

N. P. Halko , Randomized methods for computing low-rank approximations of matrices , ProQuest LLC, Ann Arbor, MI, 2012. Thesis (Ph.D.)–University of Colorado at Boulder

work page 2012
[13]

G. H. Golub and C. F. V an Loan , Matrix Computations, Johns Hopkins Studies in the Mathematical Sciences, Johns Hopkins University Press, Baltimore, MD, fourth ed., 2013

work page 2013
[14]

I. S. Duff, A. M. Erisman, and J. K. Reid , Direct Methods for Sparse Matrices , Oxford University Press, 1986

work page 1986
[15]

T. A. Davis and Y. Hu , The University of Florida sparse matrix collection , ACM Trans. Math. Software, 38 (2011), pp. Art. 1, 25

work page 2011
[16]

R. F. Boisvert, R. Pozo, and K. A. Remington , The matrix market exchange formats: Initial design , tech. rep., US Dept of Commerce: National Institute of Standards and Technology, December 1996

work page 1996
[17]

Hooker , Centerfold of Lenna Sj¨ o¨ oblom, Playboy, (November 1972)

D. Hooker , Centerfold of Lenna Sj¨ o¨ oblom, Playboy, (November 1972). Photograph

work page 1972
[18]

Demchik , Scalable out-of-core BLAS library from scratch , In preparation, (2019)

V. Demchik , Scalable out-of-core BLAS library from scratch , In preparation, (2019). ExB Labs, Seeburgstr. 100, 04103 Leipzig, Germany E-mail address: demchik@exb.de E-mail address: bordag@exb.de 5https://www.exb.de/

work page 2019

[1] [1]

Dongarra, M

J. Dongarra, M. Gates, A. Haidar, J. Kurzak, P. Luszczek, S. Tomov, and I. Yamazaki , The singular value decomposition: Anatomy of optimizing an algorithm for extreme scale , SIAM Rev., 60 (2018), pp. 808–865

work page 2018

[2] [2]

Kabir, A

K. Kabir, A. Haidar, S. Tomov, A. Bouteiller, and J. Dongarra , A framework for out of memory SVD algo- rithms, in High Performance Computing, J. M. Kunkel, R. Yokota, P. Balaji, and D. Keyes, eds., Cham, 2017, Springer International Publishing, pp. 158–178

work page 2017

[3] [3]

Haidar, K

A. Haidar, K. Kabir, D. F ayad, S. Tomov, and J. Dongarra , Out of memory SVD solver for big data , in 2017 IEEE High Performance Extreme Computing Conference (HPEC), Sept 2017, pp. 1–7

work page 2017

[4] [4]

Rabani and S

E. Rabani and S. Toledo , Out-of-core SVD and QR decompositions , in Proceedings of the 10th SIAM Conference on Parallel Processing for Scientiﬁc Computing, Norfolk, Virginia, March 2001

work page 2001

[5] [5]

Gates, S

M. Gates, S. Tomov and J. Dongarra , Accelerating the SVD two stage bidiagonal reduction and divide and conquer using GPUs, in Parallel Computing, 74 (2018), pp. 3–18

work page 2018

[6] [6]

D. I. Martin, J. C. Martin, M. W. Berry, and M. Browne , Out-of-core SVD performance for document indexing , Appl. Numer. Math., 57 (2007), pp. 1230–1239

work page 2007

[7] [7]

Y. Lu, F. Ino, and Y. Matsushita , High-performance out-of-core block randomized singular value decomposition on GPU, arXiv:1706.07191, (2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017

[8] [8]

Halko, P

N. Halko, P. G. Martinsson, and J. A. Tropp , Finding Structure with Randomness: Probabilistic Algorithms for Constructing Approximate Matrix Decompositions, SIAM Review, 53 (2011), pp. 217–288

work page 2011

[9] [9]

Dongarra and D

J. Dongarra and D. W alker , The design of linear algebra libraries for high performance computers , tech. rep., United States Department of Energy: Oﬃce of Scientiﬁc and Technical Information, August 1993

work page 1993

[10] [10]

Bouge, P

L. Bouge, P. Fraigniaud, A. Mignotte, and Y. Robert , Eﬃcient block cyclic data redistribution , in Euro-Par’96 Parallel Processing. Euro-Par 1996, vol. 1123 of Lecture Notes in Computer Science, 1996

work page 1996

[11] [11]

E. F. D’Azevedo and J. J. Dongarra , The design and implementation of the parallel out-of-core scalapack lu, qr, and cholesky factorization routines , Concurrency - Practice and Experience, 12 (2000), pp. 1481–1493

work page 2000

[12] [12]

N. P. Halko , Randomized methods for computing low-rank approximations of matrices , ProQuest LLC, Ann Arbor, MI, 2012. Thesis (Ph.D.)–University of Colorado at Boulder

work page 2012

[13] [13]

G. H. Golub and C. F. V an Loan , Matrix Computations, Johns Hopkins Studies in the Mathematical Sciences, Johns Hopkins University Press, Baltimore, MD, fourth ed., 2013

work page 2013

[14] [14]

I. S. Duff, A. M. Erisman, and J. K. Reid , Direct Methods for Sparse Matrices , Oxford University Press, 1986

work page 1986

[15] [15]

T. A. Davis and Y. Hu , The University of Florida sparse matrix collection , ACM Trans. Math. Software, 38 (2011), pp. Art. 1, 25

work page 2011

[16] [16]

R. F. Boisvert, R. Pozo, and K. A. Remington , The matrix market exchange formats: Initial design , tech. rep., US Dept of Commerce: National Institute of Standards and Technology, December 1996

work page 1996

[17] [17]

Hooker , Centerfold of Lenna Sj¨ o¨ oblom, Playboy, (November 1972)

D. Hooker , Centerfold of Lenna Sj¨ o¨ oblom, Playboy, (November 1972). Photograph

work page 1972

[18] [18]

Demchik , Scalable out-of-core BLAS library from scratch , In preparation, (2019)

V. Demchik , Scalable out-of-core BLAS library from scratch , In preparation, (2019). ExB Labs, Seeburgstr. 100, 04103 Leipzig, Germany E-mail address: demchik@exb.de E-mail address: bordag@exb.de 5https://www.exb.de/

work page 2019