A 2D Parallel Triangle Counting Algorithm for Distributed-Memory Architectures

Ancy Sarah Tom; George Karypis

arxiv: 1907.09575 · v1 · pith:XSKUKNVTnew · submitted 2019-07-22 · 💻 cs.DC

A 2D Parallel Triangle Counting Algorithm for Distributed-Memory Architectures

Ancy Sarah Tom , George Karypis This is my paper

Pith reviewed 2026-05-24 17:44 UTC · model grok-4.3

classification 💻 cs.DC

keywords triangle countingdistributed memoryparallel graph algorithms2D decompositionMPIgraph miningscalabilitynetwork analysis

0 comments

The pith

A 2D cyclic decomposition balances load and cuts communication to scale triangle counting on distributed-memory systems.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a new parallel algorithm for counting triangles in large graphs using distributed memory. It partitions the graph with a 2D cyclic decomposition that spreads work evenly across processors while limiting data transfers. The design also trims memory use by ordering steps to exploit graph sparsity. Tests on synthetic and real graphs report average speedups from 3.24 to 7.22 times when moving from 16 to 169 MPI ranks, plus a 10.2 times gain over earlier distributed methods. This matters for network analysis tasks that must handle graphs too big for single machines.

Core claim

The algorithm uses a 2D cyclic decomposition to partition the graph, arranges its communication and computation phases to lower memory overhead, and adds optimizations that take advantage of sparsity. On the tested graphs this yields an average relative speedup between 3.24 and 7.22 out of 10.56 when scaling from 16 to 169 MPI ranks and an average 10.2 times improvement over prior distributed-memory triangle counting algorithms.

What carries the argument

The 2D cyclic decomposition that partitions the adjacency matrix to balance computation and reduce inter-processor communication.

If this is right

Triangle counting becomes practical on clusters with hundreds of nodes for graphs that exceed single-node memory.
The same decomposition and communication ordering can be reused for other sparse matrix operations common in graph mining.
Memory footprint stays lower than earlier distributed approaches because communication steps are structured to avoid full graph replication.
Speedups hold across both synthetic power-law graphs and real-world networks, suggesting the method is not limited to one graph family.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same partitioning idea could be tested on other graph primitives such as k-clique counting or motif enumeration.
If the load-balancing property generalizes, the approach might extend to GPU clusters or heterogeneous nodes with modest code changes.
A direct comparison against newer shared-memory or hybrid algorithms on the same hardware would clarify the relative gain of the 2D layout.
The sparsity optimizations might translate to dynamic graphs where edges are added or removed over time.

Load-bearing premise

The 2D cyclic decomposition evenly spreads work and limits data movement for the synthetic and real-world graphs used in the tests.

What would settle it

Execute the algorithm on a new graph whose degree distribution causes severe load imbalance under the 2D cyclic partition, then measure whether the reported speedups disappear.

Figures

Figures reproduced from arXiv: 1907.09575 by Ancy Sarah Tom, George Karypis.

**Figure 1.** Figure 1: Efficiency achieved by our algorithm in the preprocessing step (labeled as “ppt”), the triangle counting step (labeled as [PITH_FULL_IMAGE:figures/full_fig_p007_1.png] view at source ↗

**Figure 2.** Figure 2: This plot corresponds to the average operation rate [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: This plot corresponds to the fraction of time (in [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

read the original abstract

Triangle counting is a fundamental graph analytic operation that is used extensively in network science and graph mining. As the size of the graphs that needs to be analyzed continues to grow, there is a requirement in developing scalable algorithms for distributed-memory parallel systems. To this end, we present a distributed-memory triangle counting algorithm, which uses a 2D cyclic decomposition to balance the computations and reduce the communication overheads. The algorithm structures its communication and computational steps such that it reduces its memory overhead and includes key optimizations that leverage the sparsity of the graph and the way the computations are structured. Experiments on synthetic and real-world graphs show that our algorithm obtains an average relative speedup that range between 3.24 and 7.22 out of 10.56 across the datasets using 169 MPI ranks over the performance achieved by 16 MPI ranks. Moreover, we obtain an average speedup of 10.2 times on comparison with previously developed distributed-memory parallel algorithms.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The manuscript presents a 2D parallel triangle counting algorithm for distributed-memory architectures that utilizes a 2D cyclic decomposition to balance computations and reduce communication overheads. The key empirical claims are an average relative speedup ranging from 3.24 to 7.22 (out of an ideal 10.56) when scaling from 16 to 169 MPI ranks across datasets, and an average 10.2× speedup compared to previously developed distributed-memory parallel algorithms.

Significance. Should the performance results be reproducible and the methodology sound, this work would offer a practical improvement in the scalability of triangle counting, a core operation in graph mining, for large-scale distributed systems. The approach's focus on sparsity and structured computation could inform future algorithm designs in the field.

major comments (1)

[Abstract] Abstract: The abstract reports specific numerical speedups without accompanying details on the experimental methodology, such as the sizes and types of the synthetic and real-world graphs, the hardware platform, the exact implementation of the baseline algorithms, or any measures of variability in the timings. This omission makes it difficult to fully evaluate the central performance claims.

minor comments (1)

[Abstract] Abstract: The phrase 'an average relative speedup that range between' contains a grammatical error and should be corrected to 'ranges'.

Simulated Author's Rebuttal

1 responses · 0 unresolved

Thank you for the opportunity to respond to the referee's report. We address the major comment below.

read point-by-point responses

Referee: The abstract reports specific numerical speedups without accompanying details on the experimental methodology, such as the sizes and types of the synthetic and real-world graphs, the hardware platform, the exact implementation of the baseline algorithms, or any measures of variability in the timings. This omission makes it difficult to fully evaluate the central performance claims.

Authors: We agree that the abstract, due to length constraints, does not include full methodological details. The body of the manuscript provides these in Sections 4.1 (graph datasets with sizes and types), 4.2 (hardware platform), and 4.3 (baseline implementations and timing methodology). To address the concern, we will revise the abstract to briefly note the experimental setup, including the use of synthetic and real-world graphs and scaling from 16 to 169 MPI ranks. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper presents a distributed-memory triangle counting algorithm using 2D cyclic decomposition, with central claims consisting of measured empirical speedups (3.24–7.22× from 16 to 169 MPI ranks, 10.2× vs. prior algorithms) on synthetic and real-world graphs. These rest on runtime benchmarks rather than any mathematical derivation, prediction, or fitted parameter that reduces to the inputs by construction. No self-definitional steps, self-citation load-bearing uniqueness theorems, or ansatzes smuggled via prior work appear in the algorithm description or results. The work is self-contained against external timing measurements.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The contribution is an algorithmic design; no free parameters, new entities, or non-standard axioms are introduced. Relies on established graph sparsity and MPI communication models.

axioms (2)

domain assumption Graphs are sparse and can be partitioned without excessive replication in a 2D cyclic manner
Invoked in the description of how the decomposition balances load and reduces memory overhead.
standard math Standard MPI point-to-point and collective communication costs apply
Underlying the claim that the structured communication steps reduce overhead.

pith-pipeline@v0.9.0 · 5692 in / 1339 out tokens · 27536 ms · 2026-05-24T17:44:33.952193+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

25 extracted references · 25 canonical work pages · 2 internal anchors

[1]

Shaikh Arifuzzaman, Maleq Khan, and Madhav Marathe. 2017. Distributed- Memory Parallel Algorithms for Counting and Listing Triangles in Big Graphs. arXiv preprint arXiv:1706.05151 (2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017
[2]

Mauro Bisson and Massimiliano Fatica. 2018. Update on Static Graph Challenge on GPU. In 2018 IEEE High Performance extreme Computing Conference (HPEC) . IEEE, 1–8

work page 2018
[3]

Aydın Buluç and John R Gilbert. 2010. Highly parallel sparse matrix-matrix multiplication. arXiv preprint arXiv:1006.2183 (2010)

work page internal anchor Pith review Pith/arXiv arXiv 2010
[4]

Lynn E Cannon. 1969. A CELLULAR COMPUTER TO IMPLEMENT THE KALMAN FILTER ALGORITHM. Technical Report. MONTANA STATE UNIV BOZEMAN ENGINEERING RESEARCH LABS

work page 1969
[5]

Michelle Girvan and Mark EJ Newman. 2002. Community structure in social and biological networks. Proceedings of the national academy of sciences 99, 12 (2002), 7821–7826

work page 2002
[6]

Graph500. 2018. graph500. https://graph500.org

work page 2018
[7]

Oded Green, Pavan Yalamanchili, and Lluís-Miquel Munguía. 2014. Fast triangle counting on the GPU. In Proceedings of the 4th Workshop on Irregular Applications: Architectures and Algorithms. IEEE Press, 1–8

work page 2014
[8]

Yang Hu, Hang Liu, and H Howie Huang. 2018. High-Performance Triangle Counting on GPUs. In 2018 IEEE High Performance extreme Computing Conference (HPEC). IEEE, 1–5

work page 2018
[9]

Minnesota Supercomputing Institute. 2018. Mesabi Description. https://www. msi.umn.edu/content/mesabi

work page 2018
[10]

Thejaka Amila Kanewala, Marcin Zalewski, and Andrew Lumsdaine. 2018. Dis- tributed, Shared-Memory Parallel Triangle Counting. In Proceedings of the Plat- form for Advanced Scientific Computing Conference . ACM, 5

work page 2018
[11]

Haewoon Kwak, Changhyun Lee, Hosung Park, and Sue Moon. 2010. What is Twitter, a social network or a news media?. InProceedings of the 19th international conference on World wide web. ACM, 591–600

work page 2010
[12]

Jacob Nelson, Brandon Holt, Brandon Myers, Preston Briggs, Luis Ceze, Simon Kahan, and Mark Oskin. 2014. Grappa: A latency-tolerant runtime for large-scale irregular applications. In International Workshop on Rack-Scale Computing (WRSC w/EuroSys)

work page 2014
[13]

Sindhuja Parimalarangan, George M Slota, and Kamesh Madduri. 2017. Fast Parallel Triad Census and Triangle Listing on Shared-Memory Platforms. In Parallel and Distributed Processing Symposium Workshop (IPDPSW), 2017 IEEE International. IEEE

work page 2017
[14]

Roger Pearce. 2017. Triangle counting for scale-free graphs at scale in distributed memory. In 2017 IEEE High Performance Extreme Computing Conference (HPEC) . IEEE, 1–4

work page 2017
[15]

Pearce, M

R. Pearce, M. Gokhale, and N. M. Amato. 2013. Scaling Techniques for Mas- sive Scale-Free Graphs in Distributed (External) Memory. In 2013 IEEE 27th International Symposium on Parallel and Distributed Processing . 825–836. https: //doi.org/10.1109/IPDPS.2013.72

work page doi:10.1109/ipdps.2013.72 2013
[16]

Adam Polak. 2016. Counting triangles in large graphs on GPU. In Parallel and Distributed Processing Symposium Workshops, 2016 IEEE International . IEEE, 740– 746

work page 2016
[17]

Siddharth Samsi, Vijay Gadepally, Michael Hurley, Michael Jones, Edward Kao, Sanjeev Mohindra, Paul Monticciolo, Albert Reuther, Steven Smith, William Song, Diane Staheli, and Jeremy Kepner. 2017. Static Graph Challenge: Subgraph Isomorphism. IEEE HPEC (2017)

work page 2017
[18]

Nisheeth Shrivastava, Anirban Majumder, and Rajeev Rastogi. 2008. Mining (social) network graphs to detect random link attacks. In Data Engineering, 2008. ICDE 2008. IEEE 24th International Conference on . IEEE, 486–495

work page 2008
[19]

Julian Shun and Kanat Tangwongsan. 2015. Multicore triangle computations with- out tuning. In Data Engineering (ICDE), 2015 IEEE 31st International Conference on. IEEE, 149–160

work page 2015
[20]

Shaden Smith, Xing Liu, Nesreen K Ahmed, Ancy Sarah Tom, Fabrizio Petrini, and George Karypis. 2017. Truss decomposition on shared-memory parallel systems. In High Performance Extreme Computing Conference (HPEC) . IEEE, 1–6

work page 2017
[21]

Ancy Sarah Tom, Narayanan Sundaram, Nesreen K Ahmed, Shaden Smith, Stijn Eyerman, Midhunchandra Kodiyath, Ibrahim Hur, Fabrizio Petrini, and George Karypis. 2017. Exploring optimizations on shared-memory platforms for parallel triangle counting algorithms. In High Performance Extreme Computing Conference (HPEC), 2017 IEEE. IEEE, 1–7

work page 2017
[22]

Robert A Van De Geijn and Jerrell Watts. 1997. SUMMA: Scalable universal matrix multiplication algorithm. Concurrency: Practice and Experience 9, 4 (1997), 255–274

work page 1997
[23]

Leyuan Wang, Yangzihao Wang, Carl Yang, and John D Owens. 2016. A compar- ative study on exact triangle counting algorithms on the GPU. In Proceedings of the ACM Workshop on High Performance Graph Processing . ACM, 1–8

work page 2016
[24]

Duncan J Watts and Steven H Strogatz. 1998. Collective dynamics of small-world networks. nature 393, 6684 (1998), 440–442

work page 1998
[25]

Abdurrahman Yaşar, Sivasankaran Rajamanickam, Michael Wolf, Jonathan Berry, and Ümit V Çatalyürek. 2018. Fast Triangle Counting Using Cilk. In 2018 IEEE High Performance extreme Computing Conference (HPEC) . IEEE, 1–7

work page 2018

[1] [1]

Shaikh Arifuzzaman, Maleq Khan, and Madhav Marathe. 2017. Distributed- Memory Parallel Algorithms for Counting and Listing Triangles in Big Graphs. arXiv preprint arXiv:1706.05151 (2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017

[2] [2]

Mauro Bisson and Massimiliano Fatica. 2018. Update on Static Graph Challenge on GPU. In 2018 IEEE High Performance extreme Computing Conference (HPEC) . IEEE, 1–8

work page 2018

[3] [3]

Aydın Buluç and John R Gilbert. 2010. Highly parallel sparse matrix-matrix multiplication. arXiv preprint arXiv:1006.2183 (2010)

work page internal anchor Pith review Pith/arXiv arXiv 2010

[4] [4]

Lynn E Cannon. 1969. A CELLULAR COMPUTER TO IMPLEMENT THE KALMAN FILTER ALGORITHM. Technical Report. MONTANA STATE UNIV BOZEMAN ENGINEERING RESEARCH LABS

work page 1969

[5] [5]

Michelle Girvan and Mark EJ Newman. 2002. Community structure in social and biological networks. Proceedings of the national academy of sciences 99, 12 (2002), 7821–7826

work page 2002

[6] [6]

Graph500. 2018. graph500. https://graph500.org

work page 2018

[7] [7]

Oded Green, Pavan Yalamanchili, and Lluís-Miquel Munguía. 2014. Fast triangle counting on the GPU. In Proceedings of the 4th Workshop on Irregular Applications: Architectures and Algorithms. IEEE Press, 1–8

work page 2014

[8] [8]

Yang Hu, Hang Liu, and H Howie Huang. 2018. High-Performance Triangle Counting on GPUs. In 2018 IEEE High Performance extreme Computing Conference (HPEC). IEEE, 1–5

work page 2018

[9] [9]

Minnesota Supercomputing Institute. 2018. Mesabi Description. https://www. msi.umn.edu/content/mesabi

work page 2018

[10] [10]

Thejaka Amila Kanewala, Marcin Zalewski, and Andrew Lumsdaine. 2018. Dis- tributed, Shared-Memory Parallel Triangle Counting. In Proceedings of the Plat- form for Advanced Scientific Computing Conference . ACM, 5

work page 2018

[11] [11]

Haewoon Kwak, Changhyun Lee, Hosung Park, and Sue Moon. 2010. What is Twitter, a social network or a news media?. InProceedings of the 19th international conference on World wide web. ACM, 591–600

work page 2010

[12] [12]

Jacob Nelson, Brandon Holt, Brandon Myers, Preston Briggs, Luis Ceze, Simon Kahan, and Mark Oskin. 2014. Grappa: A latency-tolerant runtime for large-scale irregular applications. In International Workshop on Rack-Scale Computing (WRSC w/EuroSys)

work page 2014

[13] [13]

Sindhuja Parimalarangan, George M Slota, and Kamesh Madduri. 2017. Fast Parallel Triad Census and Triangle Listing on Shared-Memory Platforms. In Parallel and Distributed Processing Symposium Workshop (IPDPSW), 2017 IEEE International. IEEE

work page 2017

[14] [14]

Roger Pearce. 2017. Triangle counting for scale-free graphs at scale in distributed memory. In 2017 IEEE High Performance Extreme Computing Conference (HPEC) . IEEE, 1–4

work page 2017

[15] [15]

Pearce, M

R. Pearce, M. Gokhale, and N. M. Amato. 2013. Scaling Techniques for Mas- sive Scale-Free Graphs in Distributed (External) Memory. In 2013 IEEE 27th International Symposium on Parallel and Distributed Processing . 825–836. https: //doi.org/10.1109/IPDPS.2013.72

work page doi:10.1109/ipdps.2013.72 2013

[16] [16]

Adam Polak. 2016. Counting triangles in large graphs on GPU. In Parallel and Distributed Processing Symposium Workshops, 2016 IEEE International . IEEE, 740– 746

work page 2016

[17] [17]

Siddharth Samsi, Vijay Gadepally, Michael Hurley, Michael Jones, Edward Kao, Sanjeev Mohindra, Paul Monticciolo, Albert Reuther, Steven Smith, William Song, Diane Staheli, and Jeremy Kepner. 2017. Static Graph Challenge: Subgraph Isomorphism. IEEE HPEC (2017)

work page 2017

[18] [18]

Nisheeth Shrivastava, Anirban Majumder, and Rajeev Rastogi. 2008. Mining (social) network graphs to detect random link attacks. In Data Engineering, 2008. ICDE 2008. IEEE 24th International Conference on . IEEE, 486–495

work page 2008

[19] [19]

Julian Shun and Kanat Tangwongsan. 2015. Multicore triangle computations with- out tuning. In Data Engineering (ICDE), 2015 IEEE 31st International Conference on. IEEE, 149–160

work page 2015

[20] [20]

Shaden Smith, Xing Liu, Nesreen K Ahmed, Ancy Sarah Tom, Fabrizio Petrini, and George Karypis. 2017. Truss decomposition on shared-memory parallel systems. In High Performance Extreme Computing Conference (HPEC) . IEEE, 1–6

work page 2017

[21] [21]

Ancy Sarah Tom, Narayanan Sundaram, Nesreen K Ahmed, Shaden Smith, Stijn Eyerman, Midhunchandra Kodiyath, Ibrahim Hur, Fabrizio Petrini, and George Karypis. 2017. Exploring optimizations on shared-memory platforms for parallel triangle counting algorithms. In High Performance Extreme Computing Conference (HPEC), 2017 IEEE. IEEE, 1–7

work page 2017

[22] [22]

Robert A Van De Geijn and Jerrell Watts. 1997. SUMMA: Scalable universal matrix multiplication algorithm. Concurrency: Practice and Experience 9, 4 (1997), 255–274

work page 1997

[23] [23]

Leyuan Wang, Yangzihao Wang, Carl Yang, and John D Owens. 2016. A compar- ative study on exact triangle counting algorithms on the GPU. In Proceedings of the ACM Workshop on High Performance Graph Processing . ACM, 1–8

work page 2016

[24] [24]

Duncan J Watts and Steven H Strogatz. 1998. Collective dynamics of small-world networks. nature 393, 6684 (1998), 440–442

work page 1998

[25] [25]

Abdurrahman Yaşar, Sivasankaran Rajamanickam, Michael Wolf, Jonathan Berry, and Ümit V Çatalyürek. 2018. Fast Triangle Counting Using Cilk. In 2018 IEEE High Performance extreme Computing Conference (HPEC) . IEEE, 1–7

work page 2018