pith. sign in

arxiv: 1304.3162 · v4 · pith:4RSXYYUCnew · submitted 2013-04-10 · 💻 cs.DS · cs.DC

Principal Component Analysis and Higher Correlations for Distributed Data

classification 💻 cs.DS cs.DC
keywords datacomputingproblemsalgorithmscommunicationcorrelationsfunctionldots
0
0 comments X
read the original abstract

We consider algorithmic problems in the setting in which the input data has been partitioned arbitrarily on many servers. The goal is to compute a function of all the data, and the bottleneck is the communication used by the algorithm. We present algorithms for two illustrative problems on massive data sets: (1) computing a low-rank approximation of a matrix $A=A^1 + A^2 + \ldots + A^s$, with matrix $A^t$ stored on server $t$ and (2) computing a function of a vector $a_1 + a_2 + \ldots + a_s$, where server $t$ has the vector $a_t$; this includes the well-studied special case of computing frequency moments and separable functions, as well as higher-order correlations such as the number of subgraphs of a specified type occurring in a graph. For both problems we give algorithms with nearly optimal communication, and in particular the only dependence on $n$, the size of the data, is in the number of bits needed to represent indices and words ($O(\log n)$).

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.