A 2D Parallel Triangle Counting Algorithm for Distributed-Memory Architectures
Pith reviewed 2026-05-24 17:44 UTC · model grok-4.3
The pith
A 2D cyclic decomposition balances load and cuts communication to scale triangle counting on distributed-memory systems.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The algorithm uses a 2D cyclic decomposition to partition the graph, arranges its communication and computation phases to lower memory overhead, and adds optimizations that take advantage of sparsity. On the tested graphs this yields an average relative speedup between 3.24 and 7.22 out of 10.56 when scaling from 16 to 169 MPI ranks and an average 10.2 times improvement over prior distributed-memory triangle counting algorithms.
What carries the argument
The 2D cyclic decomposition that partitions the adjacency matrix to balance computation and reduce inter-processor communication.
If this is right
- Triangle counting becomes practical on clusters with hundreds of nodes for graphs that exceed single-node memory.
- The same decomposition and communication ordering can be reused for other sparse matrix operations common in graph mining.
- Memory footprint stays lower than earlier distributed approaches because communication steps are structured to avoid full graph replication.
- Speedups hold across both synthetic power-law graphs and real-world networks, suggesting the method is not limited to one graph family.
Where Pith is reading between the lines
- The same partitioning idea could be tested on other graph primitives such as k-clique counting or motif enumeration.
- If the load-balancing property generalizes, the approach might extend to GPU clusters or heterogeneous nodes with modest code changes.
- A direct comparison against newer shared-memory or hybrid algorithms on the same hardware would clarify the relative gain of the 2D layout.
- The sparsity optimizations might translate to dynamic graphs where edges are added or removed over time.
Load-bearing premise
The 2D cyclic decomposition evenly spreads work and limits data movement for the synthetic and real-world graphs used in the tests.
What would settle it
Execute the algorithm on a new graph whose degree distribution causes severe load imbalance under the 2D cyclic partition, then measure whether the reported speedups disappear.
Figures
read the original abstract
Triangle counting is a fundamental graph analytic operation that is used extensively in network science and graph mining. As the size of the graphs that needs to be analyzed continues to grow, there is a requirement in developing scalable algorithms for distributed-memory parallel systems. To this end, we present a distributed-memory triangle counting algorithm, which uses a 2D cyclic decomposition to balance the computations and reduce the communication overheads. The algorithm structures its communication and computational steps such that it reduces its memory overhead and includes key optimizations that leverage the sparsity of the graph and the way the computations are structured. Experiments on synthetic and real-world graphs show that our algorithm obtains an average relative speedup that range between 3.24 and 7.22 out of 10.56 across the datasets using 169 MPI ranks over the performance achieved by 16 MPI ranks. Moreover, we obtain an average speedup of 10.2 times on comparison with previously developed distributed-memory parallel algorithms.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents a 2D parallel triangle counting algorithm for distributed-memory architectures that utilizes a 2D cyclic decomposition to balance computations and reduce communication overheads. The key empirical claims are an average relative speedup ranging from 3.24 to 7.22 (out of an ideal 10.56) when scaling from 16 to 169 MPI ranks across datasets, and an average 10.2× speedup compared to previously developed distributed-memory parallel algorithms.
Significance. Should the performance results be reproducible and the methodology sound, this work would offer a practical improvement in the scalability of triangle counting, a core operation in graph mining, for large-scale distributed systems. The approach's focus on sparsity and structured computation could inform future algorithm designs in the field.
major comments (1)
- [Abstract] Abstract: The abstract reports specific numerical speedups without accompanying details on the experimental methodology, such as the sizes and types of the synthetic and real-world graphs, the hardware platform, the exact implementation of the baseline algorithms, or any measures of variability in the timings. This omission makes it difficult to fully evaluate the central performance claims.
minor comments (1)
- [Abstract] Abstract: The phrase 'an average relative speedup that range between' contains a grammatical error and should be corrected to 'ranges'.
Simulated Author's Rebuttal
Thank you for the opportunity to respond to the referee's report. We address the major comment below.
read point-by-point responses
-
Referee: The abstract reports specific numerical speedups without accompanying details on the experimental methodology, such as the sizes and types of the synthetic and real-world graphs, the hardware platform, the exact implementation of the baseline algorithms, or any measures of variability in the timings. This omission makes it difficult to fully evaluate the central performance claims.
Authors: We agree that the abstract, due to length constraints, does not include full methodological details. The body of the manuscript provides these in Sections 4.1 (graph datasets with sizes and types), 4.2 (hardware platform), and 4.3 (baseline implementations and timing methodology). To address the concern, we will revise the abstract to briefly note the experimental setup, including the use of synthetic and real-world graphs and scaling from 16 to 169 MPI ranks. revision: yes
Circularity Check
No significant circularity detected
full rationale
The paper presents a distributed-memory triangle counting algorithm using 2D cyclic decomposition, with central claims consisting of measured empirical speedups (3.24–7.22× from 16 to 169 MPI ranks, 10.2× vs. prior algorithms) on synthetic and real-world graphs. These rest on runtime benchmarks rather than any mathematical derivation, prediction, or fitted parameter that reduces to the inputs by construction. No self-definitional steps, self-citation load-bearing uniqueness theorems, or ansatzes smuggled via prior work appear in the algorithm description or results. The work is self-contained against external timing measurements.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Graphs are sparse and can be partitioned without excessive replication in a 2D cyclic manner
- standard math Standard MPI point-to-point and collective communication costs apply
Reference graph
Works this paper leans on
-
[1]
Shaikh Arifuzzaman, Maleq Khan, and Madhav Marathe. 2017. Distributed- Memory Parallel Algorithms for Counting and Listing Triangles in Big Graphs. arXiv preprint arXiv:1706.05151 (2017)
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[2]
Mauro Bisson and Massimiliano Fatica. 2018. Update on Static Graph Challenge on GPU. In 2018 IEEE High Performance extreme Computing Conference (HPEC) . IEEE, 1–8
work page 2018
-
[3]
Aydın Buluç and John R Gilbert. 2010. Highly parallel sparse matrix-matrix multiplication. arXiv preprint arXiv:1006.2183 (2010)
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[4]
Lynn E Cannon. 1969. A CELLULAR COMPUTER TO IMPLEMENT THE KALMAN FILTER ALGORITHM. Technical Report. MONTANA STATE UNIV BOZEMAN ENGINEERING RESEARCH LABS
work page 1969
-
[5]
Michelle Girvan and Mark EJ Newman. 2002. Community structure in social and biological networks. Proceedings of the national academy of sciences 99, 12 (2002), 7821–7826
work page 2002
-
[6]
Graph500. 2018. graph500. https://graph500.org
work page 2018
-
[7]
Oded Green, Pavan Yalamanchili, and Lluís-Miquel Munguía. 2014. Fast triangle counting on the GPU. In Proceedings of the 4th Workshop on Irregular Applications: Architectures and Algorithms. IEEE Press, 1–8
work page 2014
-
[8]
Yang Hu, Hang Liu, and H Howie Huang. 2018. High-Performance Triangle Counting on GPUs. In 2018 IEEE High Performance extreme Computing Conference (HPEC). IEEE, 1–5
work page 2018
-
[9]
Minnesota Supercomputing Institute. 2018. Mesabi Description. https://www. msi.umn.edu/content/mesabi
work page 2018
-
[10]
Thejaka Amila Kanewala, Marcin Zalewski, and Andrew Lumsdaine. 2018. Dis- tributed, Shared-Memory Parallel Triangle Counting. In Proceedings of the Plat- form for Advanced Scientific Computing Conference . ACM, 5
work page 2018
-
[11]
Haewoon Kwak, Changhyun Lee, Hosung Park, and Sue Moon. 2010. What is Twitter, a social network or a news media?. InProceedings of the 19th international conference on World wide web. ACM, 591–600
work page 2010
-
[12]
Jacob Nelson, Brandon Holt, Brandon Myers, Preston Briggs, Luis Ceze, Simon Kahan, and Mark Oskin. 2014. Grappa: A latency-tolerant runtime for large-scale irregular applications. In International Workshop on Rack-Scale Computing (WRSC w/EuroSys)
work page 2014
-
[13]
Sindhuja Parimalarangan, George M Slota, and Kamesh Madduri. 2017. Fast Parallel Triad Census and Triangle Listing on Shared-Memory Platforms. In Parallel and Distributed Processing Symposium Workshop (IPDPSW), 2017 IEEE International. IEEE
work page 2017
-
[14]
Roger Pearce. 2017. Triangle counting for scale-free graphs at scale in distributed memory. In 2017 IEEE High Performance Extreme Computing Conference (HPEC) . IEEE, 1–4
work page 2017
-
[15]
R. Pearce, M. Gokhale, and N. M. Amato. 2013. Scaling Techniques for Mas- sive Scale-Free Graphs in Distributed (External) Memory. In 2013 IEEE 27th International Symposium on Parallel and Distributed Processing . 825–836. https: //doi.org/10.1109/IPDPS.2013.72
-
[16]
Adam Polak. 2016. Counting triangles in large graphs on GPU. In Parallel and Distributed Processing Symposium Workshops, 2016 IEEE International . IEEE, 740– 746
work page 2016
-
[17]
Siddharth Samsi, Vijay Gadepally, Michael Hurley, Michael Jones, Edward Kao, Sanjeev Mohindra, Paul Monticciolo, Albert Reuther, Steven Smith, William Song, Diane Staheli, and Jeremy Kepner. 2017. Static Graph Challenge: Subgraph Isomorphism. IEEE HPEC (2017)
work page 2017
-
[18]
Nisheeth Shrivastava, Anirban Majumder, and Rajeev Rastogi. 2008. Mining (social) network graphs to detect random link attacks. In Data Engineering, 2008. ICDE 2008. IEEE 24th International Conference on . IEEE, 486–495
work page 2008
-
[19]
Julian Shun and Kanat Tangwongsan. 2015. Multicore triangle computations with- out tuning. In Data Engineering (ICDE), 2015 IEEE 31st International Conference on. IEEE, 149–160
work page 2015
-
[20]
Shaden Smith, Xing Liu, Nesreen K Ahmed, Ancy Sarah Tom, Fabrizio Petrini, and George Karypis. 2017. Truss decomposition on shared-memory parallel systems. In High Performance Extreme Computing Conference (HPEC) . IEEE, 1–6
work page 2017
-
[21]
Ancy Sarah Tom, Narayanan Sundaram, Nesreen K Ahmed, Shaden Smith, Stijn Eyerman, Midhunchandra Kodiyath, Ibrahim Hur, Fabrizio Petrini, and George Karypis. 2017. Exploring optimizations on shared-memory platforms for parallel triangle counting algorithms. In High Performance Extreme Computing Conference (HPEC), 2017 IEEE. IEEE, 1–7
work page 2017
-
[22]
Robert A Van De Geijn and Jerrell Watts. 1997. SUMMA: Scalable universal matrix multiplication algorithm. Concurrency: Practice and Experience 9, 4 (1997), 255–274
work page 1997
-
[23]
Leyuan Wang, Yangzihao Wang, Carl Yang, and John D Owens. 2016. A compar- ative study on exact triangle counting algorithms on the GPU. In Proceedings of the ACM Workshop on High Performance Graph Processing . ACM, 1–8
work page 2016
-
[24]
Duncan J Watts and Steven H Strogatz. 1998. Collective dynamics of small-world networks. nature 393, 6684 (1998), 440–442
work page 1998
-
[25]
Abdurrahman Yaşar, Sivasankaran Rajamanickam, Michael Wolf, Jonathan Berry, and Ümit V Çatalyürek. 2018. Fast Triangle Counting Using Cilk. In 2018 IEEE High Performance extreme Computing Conference (HPEC) . IEEE, 1–7
work page 2018
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.