Connected Components at Scale via Local Contractions

Jakub {\L}\k{a}cki; Micha{\l} W{\l}odarczyk; Vahab Mirrokni

arxiv: 1807.10727 · v1 · pith:N4P4QQ4Lnew · submitted 2018-07-27 · 💻 cs.DC · cs.DS

Connected Components at Scale via Local Contractions

Jakub {\L}\k{a}cki , Vahab Mirrokni , Micha{\l} W{\l}odarczyk This is my paper

classification 💻 cs.DC cs.DS

keywords algorithmgraphsmapreduceparallelproblemrunningtimealgorithms

0 comments

read the original abstract

As a fundamental tool in hierarchical graph clustering, computing connected components has been a central problem in large-scale data mining. While many known algorithms have been developed for this problem, they are either not scalable in practice or lack strong theoretical guarantees on the parallel running time, that is, the number of communication rounds. So far, the best proven guarantee is $\Oh(\log n)$, which matches the running time in the PRAM model. In this paper, we aim to design a distributed algorithm for this problem that works well in theory and practice. In particular, we present a simple algorithm based on contractions and provide a scalable implementation of it in MapReduce. On the theoretical side, in addition to showing $\Oh(\log n)$ convergence for all graphs, we prove an $\Oh(\log \log n)$ parallel running time with high probability for a certain class of random graphs. We work in the MPC model that captures popular parallel computing frameworks, such as MapReduce, Hadoop or Spark. On the practical side, we show that our algorithm outperforms the state-of-the-art MapReduce algorithms. To confirm its scalability, we report empirical results on graphs with several trillions of edges.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Deduplicating Training Data Makes Language Models Better
cs.CL 2021-07 unverdicted novelty 6.0

Deduplicating training datasets reduces language model verbatim memorization by 10x, improves training efficiency, and enables more accurate evaluation by cutting train-test overlap.