Dimensionality Reduction for k-Means Clustering and Low Rank Approximation

Cameron Musco; Christopher Musco; Madalina Persu; Michael B. Cohen; Sam Elder

arxiv: 1410.6801 · v3 · pith:4AUFBBKBnew · submitted 2014-10-24 · 💻 cs.DS · cs.LG

Dimensionality Reduction for k-Means Clustering and Low Rank Approximation

Michael B. Cohen , Sam Elder , Cameron Musco , Christopher Musco , Madalina Persu This is my paper

classification 💻 cs.DS cs.LG

keywords approximateapproximationepsilonmeansclusteringdataproblemsachieve

0 comments

read the original abstract

We show how to approximate a data matrix $\mathbf{A}$ with a much smaller sketch $\mathbf{\tilde A}$ that can be used to solve a general class of constrained k-rank approximation problems to within $(1+\epsilon)$ error. Importantly, this class of problems includes $k$-means clustering and unconstrained low rank approximation (i.e. principal component analysis). By reducing data points to just $O(k)$ dimensions, our methods generically accelerate any exact, approximate, or heuristic algorithm for these ubiquitous problems. For $k$-means dimensionality reduction, we provide $(1+\epsilon)$ relative error results for many common sketching techniques, including random row projection, column selection, and approximate SVD. For approximate principal component analysis, we give a simple alternative to known algorithms that has applications in the streaming setting. Additionally, we extend recent work on column-based matrix reconstruction, giving column subsets that not only `cover' a good subspace for $\bv{A}$, but can be used directly to compute this subspace. Finally, for $k$-means clustering, we show how to achieve a $(9+\epsilon)$ approximation by Johnson-Lindenstrauss projecting data points to just $O(\log k/\epsilon^2)$ dimensions. This gives the first result that leverages the specific structure of $k$-means to achieve dimension independent of input size and sublinear in $k$.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Tight Sensitivity Bounds For Smaller Coresets
cs.LG 2019-07 unverdicted novelty 7.0

New algorithms compute provably tight sensitivity bounds for matrix rows, yielding smaller coresets for LMS approximation of affine k-subspaces via an iterative exact method and a dimensionality-reduction trick.