pith. sign in

arxiv: 1704.07669 · v1 · pith:T7DRG3WSnew · submitted 2017-04-25 · 💻 cs.DS · cs.LG· math.NA

Single-Pass PCA of Large High-Dimensional Data

classification 💻 cs.DS cs.LGmath.NA
keywords dataalgorithmhigh-dimensionallargesingle-passcomputememoryprincipal
0
0 comments X
read the original abstract

Principal component analysis (PCA) is a fundamental dimension reduction tool in statistics and machine learning. For large and high-dimensional data, computing the PCA (i.e., the singular vectors corresponding to a number of dominant singular values of the data matrix) becomes a challenging task. In this work, a single-pass randomized algorithm is proposed to compute PCA with only one pass over the data. It is suitable for processing extremely large and high-dimensional data stored in slow memory (hard disk) or the data generated in a streaming fashion. Experiments with synthetic and real data validate the algorithm's accuracy, which has orders of magnitude smaller error than an existing single-pass algorithm. For a set of high-dimensional data stored as a 150 GB file, the proposed algorithm is able to compute the first 50 principal components in just 24 minutes on a typical 24-core computer, with less than 1 GB memory cost.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.