SHIRO: Near-Optimal Communication Strategies for Distributed Sparse Matrix Multiplication

Benjamin Brock; Chen Zhuang; Du Wu; Lingqi Zhang; Mohamed Wahib; Peng Chen; Satoshi Matsuoka; Toshio Endo

arxiv: 2512.20178 · v2 · pith:7NH7VC5Inew · submitted 2025-12-23 · 💻 cs.DC · cs.PF

SHIRO: Near-Optimal Communication Strategies for Distributed Sparse Matrix Multiplication

Chen Zhuang , Lingqi Zhang , Benjamin Brock , Du Wu , Peng Chen , Toshio Endo , Satoshi Matsuoka , Mohamed Wahib This is my paper

Pith reviewed 2026-05-16 20:35 UTC · model grok-4.3

classification 💻 cs.DC cs.PF

keywords SpMMdistributed sparse matrix multiplicationcommunication optimizationGPU clusterssparsity-aware schedulinghierarchical networksscalability

0 comments

The pith

SHIRO reduces distributed SpMM communication overhead by sending only the data needed for non-zero multiplications and prioritizing fast intra-node GPU links.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that standard distributed sparse matrix multiplication wastes bandwidth on transfers of data that will multiply by zero, and also fails to keep most traffic on fast links inside GPU nodes. It introduces two targeted fixes: a fine-grained plan that calculates exactly which matrix entries must cross the network, and a hierarchical scheduler that routes as much of that traffic as possible over quick intra-node connections. If these changes work as described, total communication volume drops sharply and the operation scales to far larger GPU counts before network time dominates. Readers care because SpMM is a core kernel in scientific simulations and neural network training, where communication already limits performance at scale.

Core claim

SHIRO introduces a fine-grained, sparsity-aware communication strategy that exploits the sparsity pattern of the input matrix to eliminate redundant data transfers, together with a hierarchical communication strategy that maps the resulting schedule onto two-tier GPU networks so that most movement stays on faster intra-node links. When implemented and tested on real-world sparse datasets, the combined approach yields strong scalability up to 128 GPUs, with geometric-mean speedups of 221.5× over CAGNET, 56.0× over SPA, 23.4× over BCL, and 8.8× over CoLa.

What carries the argument

Fine-grained sparsity-aware communication strategy that computes only the transfers required by actual non-zero entries, mapped onto a two-tier GPU network to favor intra-node links.

If this is right

Communication volume falls because only entries participating in non-zero multiplications are moved.
Intra-node links carry the majority of traffic, cutting exposure to slower inter-node bandwidth.
Runtime remains dominated by computation rather than communication up to at least 128 GPUs.
The same speedups appear consistently across four different existing SpMM implementations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same sparsity-driven planning could be applied to other distributed sparse kernels such as SpMV.
On networks without a clear two-tier speed difference the hierarchical component would contribute little.
For extremely irregular sparsity the one-time cost of building the communication schedule might become noticeable.

Load-bearing premise

The real-world matrices tested have sparsity patterns that let the fine-grained planner remove most redundant transfers, and the hardware has substantially faster links inside nodes than between nodes.

What would settle it

Running the same code on a single-tier flat network or on matrices whose sparsity offers no exploitable structure would eliminate or reverse the reported speedups.

Figures

Figures reproduced from arXiv: 2512.20178 by Benjamin Brock, Chen Zhuang, Du Wu, Lingqi Zhang, Mohamed Wahib, Peng Chen, Satoshi Matsuoka, Toshio Endo.

**Figure 2.** Figure 2: Redundant data transfers over slow inter-group links. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Overview of our proposed optimization strategies. [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Optimal communication volume via minimum weighted [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Different sparsity patterns and communication volume [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: Overview of hierarchical communication strategy. For clarity, only Group0-to-Group1 transfers are shown (reverse [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: Runtime comparison (Y-axis log-scale) with baselines and strong scaling for different datasets. Due to out-of-memory (OOM) and runtime errors, some data could not be collected. The dense columns number (N) is 32. TABLE II: Sparse Matrices Datasets Matrix #rows/cols #nonzeros Density com-Youtube (com-YT) 1.1M 6.0M 4.64e-06 soc-Pokec (Pokec) 1.6M 30.6M 1.15e-05 sx-stackoverflow (sx-SO) 2.6M 36.2M 5.35e-06 so… view at source ↗

**Figure 8.** Figure 8: Global communication volume reduction by joint row-col based communication and inter-node communication volume [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗

**Figure 9.** Figure 9: Inter-process communication patterns before and after applying joint row-column strategy across datasets: achieving [PITH_FULL_IMAGE:figures/full_fig_p009_9.png] view at source ↗

**Figure 11.** Figure 11: Performance of different numbers of dense columns [PITH_FULL_IMAGE:figures/full_fig_p010_11.png] view at source ↗

read the original abstract

Distributed Sparse Matrix-Matrix Multiplication (SpMM) is a fundamental operation in high-performance computing and deep learning applications. The major performance bottleneck in distributed SpMM lies in substantial communication overhead, which limits both performance and scalability. In this paper, we identify two key sources of communication inefficiency in distributed SpMM: redundant data transfer due to sparsity unawareness, and suboptimal utilization of hierarchical network topology. To address these, we propose (1) a fine-grained, sparsity-aware communication strategy that reduces communication overhead by exploiting the sparsity pattern of the sparse matrix, and (2) a hierarchical communication strategy that maps the sparsity-aware strategy onto two-tier GPU network architectures, minimizing redundant data movement across slower inter-node links. We implement these optimizations in a comprehensive distributed SpMM framework, \method{}. Extensive evaluations on real-world datasets show that \method{} demonstrates strong scalability up to 128 GPUs, achieving geometric mean speedups of 221.5$\times$, 56.0$\times$, 23.4$\times$, and 8.8$\times$ in SpMM over four state-of-the-art baselines (CAGNET, SPA, BCL, and CoLa, respectively) at this scale.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SHIRO reports big empirical speedups on 128 GPUs for distributed SpMM by combining sparsity-aware fine-grained transfers with two-tier hierarchical mapping, but the abstract gives almost no communication-volume data or ablations to show the gains are robust.

read the letter

The main takeaway is that this paper gets large reported speedups—221x down to 8.8x geometric mean over CAGNET, SPA, BCL, and CoLa—at 128 GPUs by avoiding redundant data movement in sparse matrix multiplication and by routing the remaining traffic to stay inside fast intra-node links. The integration of those two ideas looks like the actual new piece: prior work had sparsity hints or topology awareness separately, but the explicit fine-grained strategy plus the two-tier mapping on real GPU clusters is the contribution that shows up in the scaling numbers. The empirical results on actual datasets are the part that works best; they demonstrate concrete scaling where many distributed SpMM papers stop at smaller node counts. If the full text includes the communication volume tables and the ablation runs that separate the sparsity effect from the topology effect, that evidence would be solid enough to stand on. The soft spots sit exactly where the stress-test note points. The abstract supplies no measured communication volumes, no lower-bound comparisons, and no runs on uniform-random matrices or flat networks, so it is impossible to tell how much of the speedup collapses when the sparsity pattern does not cooperate or when intra-node links are not dramatically faster. Those omissions make the central performance claims hard to evaluate without the full manuscript. This work is aimed at people who already tune distributed sparse linear algebra on GPU clusters; a reader who needs practical speedups for large-scale SpMM will find the framework and the scaling curves useful even if they later have to re-verify the numbers. It is worth sending to peer review because the claims are specific, the baselines are standard, and the approach is concrete enough that referees can check the missing measurements directly.

Referee Report

2 major / 1 minor

Summary. The manuscript presents SHIRO, a distributed SpMM framework that introduces a fine-grained sparsity-aware communication strategy to reduce redundant transfers by exploiting matrix sparsity patterns and a hierarchical mapping strategy to exploit two-tier GPU network topologies by minimizing inter-node data movement. It reports empirical results on real-world datasets showing strong scalability to 128 GPUs with geometric mean speedups of 221.5×, 56.0×, 23.4×, and 8.8× over CAGNET, SPA, BCL, and CoLa respectively.

Significance. If the reported speedups are reproducible and generalizable, the work would represent a meaningful advance in communication-efficient distributed sparse linear algebra, directly addressing bottlenecks relevant to HPC and large-scale machine learning. The emphasis on sparsity awareness and topology hierarchy is well-motivated, though the magnitude of gains appears tightly coupled to the evaluated sparsity patterns and network assumptions.

major comments (2)

[Abstract and Evaluation] Abstract and Evaluation section: the headline geometric-mean speedups (221.5×–8.8× at 128 GPUs) are presented without accompanying communication-volume tables, lower-bound comparisons, error bars, or ablation results on uniform-random sparsity or flat networks; this leaves the central claim dependent on unverified conditions that the fine-grained strategy eliminates most redundant transfers and that intra-node links are substantially faster than inter-node links.
[§3.2] §3.2 (Hierarchical Communication Strategy): the mapping assumes a two-tier topology with fast intra-node interconnects, yet no topology-ablated experiments or results on single-tier networks are provided to quantify how much of the reported gain collapses when this assumption is removed.

minor comments (1)

[Abstract] Abstract: the four baseline acronyms (CAGNET, SPA, BCL, CoLa) are not expanded on first use, which reduces readability for readers outside the immediate subfield.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript to incorporate the requested supporting data and experiments.

read point-by-point responses

Referee: [Abstract and Evaluation] Abstract and Evaluation section: the headline geometric-mean speedups (221.5×–8.8× at 128 GPUs) are presented without accompanying communication-volume tables, lower-bound comparisons, error bars, or ablation results on uniform-random sparsity or flat networks; this leaves the central claim dependent on unverified conditions that the fine-grained strategy eliminates most redundant transfers and that intra-node links are substantially faster than inter-node links.

Authors: We agree that these elements would strengthen the presentation. In the revised manuscript we will add communication-volume tables for all baselines, lower-bound comparisons derived from the sparsity patterns, error bars from repeated runs, and ablation results on uniform-random sparsity matrices as well as on flat single-tier networks. These additions will directly verify the conditions under which the fine-grained and hierarchical strategies deliver the reported gains. revision: yes
Referee: [§3.2] §3.2 (Hierarchical Communication Strategy): the mapping assumes a two-tier topology with fast intra-node interconnects, yet no topology-ablated experiments or results on single-tier networks are provided to quantify how much of the reported gain collapses when this assumption is removed.

Authors: We acknowledge that the current evaluation focuses on two-tier topologies. To quantify the dependence, we will add new experiments on single-tier networks (uniform link speeds) in the revised version. These results will show the performance contribution of the hierarchical mapping and the extent to which gains diminish when the intra-node versus inter-node speed difference is removed. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical claims rest on benchmarks, not derivations

full rationale

The paper contains no equations, derivations, or mathematical claims that could reduce to self-definition or fitted inputs. Its central results are reported geometric-mean speedups from direct runtime measurements on 128-GPU runs against four external baselines; these numbers are not obtained by fitting parameters to a subset of the same data and then relabeling the fit as a prediction. No self-citation is invoked to justify a uniqueness theorem or to smuggle an ansatz; the two proposed strategies are described at the algorithmic level and then evaluated experimentally. Because the derivation chain is empty and the performance claims are externally falsifiable via replication on the same hardware and matrices, the analysis registers no circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities; all performance claims are presented as empirical outcomes of the proposed strategies.

pith-pipeline@v0.9.0 · 5533 in / 1141 out tokens · 22872 ms · 2026-05-16T20:35:27.282934+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

joint row-column communication strategy... minimum weighted vertex cover problem... bipartite graph G=(R∪C,E)
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

hierarchical communication strategy... two-tier GPU network... inter-group links

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

49 extracted references · 49 canonical work pages · 3 internal anchors

[1]

All-pairs shortest paths computation in the bsp model,

A. Tiskin, “All-pairs shortest paths computation in the bsp model,” in International Colloquium on Automata, Languages, and Programming. Springer, 2001, pp. 178–189

work page 2001
[2]

Rdma-based algorithms for sparse matrix multiplication on gpus,

B. Brock, A. Bulu c ¸, and K. Yelick, “Rdma-based algorithms for sparse matrix multiplication on gpus,” inProceedings of the 38th ACM International Conference on Supercomputing, 2024, pp. 225–235

work page 2024
[3]

The block conjugate gradient algorithm and related methods,

D. P. O’Leary, “The block conjugate gradient algorithm and related methods,”Linear algebra and its applications, vol. 29, pp. 293–322, 1980

work page 1980
[4]

A shifted block lanczos algorithm for solving sparse symmetric generalized eigenproblems,

R. G. Grimes, J. G. Lewis, and H. D. Simon, “A shifted block lanczos algorithm for solving sparse symmetric generalized eigenproblems,”SIAM Journal on Matrix Analysis and Applications, vol. 15, no. 1, pp. 228–272, 1994

work page 1994
[5]

A block arnoldi-chebyshev method for computing the leading eigenpairs of large sparse unsymmetric matrices,

M. Sadkane, “A block arnoldi-chebyshev method for computing the leading eigenpairs of large sparse unsymmetric matrices,”Numerische mathematik, vol. 64, no. 1, pp. 181–193, 1993

work page 1993
[6]

Block krylov space methods for linear systems with multiple right-hand sides: an introduction,

M. H. Gutknecht, “Block krylov space methods for linear systems with multiple right-hand sides: an introduction,” inModern mathematical models, methods and algorithms for real world systems. Anshan, 2007, pp. 420–447

work page 2007
[7]

Neural message passing for quantum chemistry,

J. Gilmer, S. S. Schoenholz, P. F. Riley, O. Vinyals, and G. E. Dahl, “Neural message passing for quantum chemistry,” inProceedings of the 34th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, D. Precup and Y . W. Teh, Eds., vol. 70. PMLR, 2017, pp. 1263–1272

work page 2017
[8]

Fast Graph Representation Learning with PyTorch Geometric

M. Fey and J. E. Lenssen, “Fast graph representation learning with PyTorch Geometric,”arXiv preprint arXiv:1903.02428, 2019, iCLR Workshop on Representation Learning on Graphs and Manifolds

work page internal anchor Pith review Pith/arXiv arXiv 1903
[9]

Deep Graph Library: A Graph-Centric, Highly-Performant Package for Graph Neural Networks

M. Wang, D. Zheng, Z. Ye, Q. Gan, M. Li, X. Song, J. Zhou, C. Ma, L. Yu, Y . Gai, T. Xiao, T. He, G. Karypis, J. Li, and Z. Zhang, “Deep graph library: A graph-centric, highly-performant package for graph neural networks,”arXiv preprint arXiv:1909.01315, 2019

work page internal anchor Pith review arXiv 1909
[10]

Neural collab- orative filtering,

X. He, L. Liao, H. Zhang, L. Nie, X. Hu, and T.-S. Chua, “Neural collab- orative filtering,” inProceedings of the 26th International Conference on World Wide Web. International World Wide Web Conferences Steering Committee, 2017, pp. 173–182

work page 2017
[11]

Knowledge graph embedding: A survey of approaches and applications,

Q. Wang, Z. Mao, B. Wang, and L. Guo, “Knowledge graph embedding: A survey of approaches and applications,”IEEE Transactions on Knowledge and Data Engineering, vol. 29, no. 12, pp. 2724–2743, 2017

work page 2017
[12]

Sparse gpu kernels for deep learning,

T. Gale, M. Zaharia, C. Young, and E. Elsen, “Sparse gpu kernels for deep learning,” inSC20: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 2020, pp. 1–14

work page 2020
[13]

Sparsetir: Composable abstractions for sparse compilation in deep learning,

Z. Ye, R. Lai, J. Shao, T. Chen, and L. Ceze, “Sparsetir: Composable abstractions for sparse compilation in deep learning,” inProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3, 2023, pp. 660–678

work page 2023
[14]

Dtc-spmm: Bridging the gap in accelerating general sparse matrix multiplication with tensor cores,

R. Fan, W. Wang, and X. Chu, “Dtc-spmm: Bridging the gap in accelerating general sparse matrix multiplication with tensor cores,” in Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3, 2024, pp. 253–267

work page 2024
[15]

{TC-GNN}: Bridging sparse {GNN} computation and dense tensor cores on {GPUs},

Y . Wang, B. Feng, Z. Wang, G. Huang, and Y . Ding, “ {TC-GNN}: Bridging sparse {GNN} computation and dense tensor cores on {GPUs},” in2023 USENIX Annual Technical Conference (USENIX ATC 23), 2023, pp. 149–164

work page 2023
[16]

A row decomposition-based approach for sparse matrix multiplication on gpus,

M. Pang, X. Fei, P. Qu, Y . Zhang, and Z. Li, “A row decomposition-based approach for sparse matrix multiplication on gpus,” inProceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming, 2024, pp. 377–389

work page 2024
[17]

Solving unsymmetric sparse systems of linear equations with pardiso,

O. Schenk and K. G ¨artner, “Solving unsymmetric sparse systems of linear equations with pardiso,”Future Generation Computer Systems, vol. 20, no. 3, pp. 475–487, 2004

work page 2004
[18]

Open graph benchmark: Datasets for machine learning on graphs

W. Hu, M. Fey, M. Zitnik, Y . Dong, H. Ren, B. Liu, M. Catasta, and J. Leskovec, “Open graph benchmark: Datasets for machine learning on graphs,”arXiv preprint arXiv:2005.00687, 2020

work page arXiv 2005
[19]

SNAP Datasets: Stanford large network dataset collection,

J. Leskovec and A. Krevl, “SNAP Datasets: Stanford large network dataset collection,” http://snap.stanford.edu/data, June 2014

work page 2014
[20]

Snap: A general-purpose network analysis and graph-mining library,

J. Leskovec and R. Sosi ˇc, “Snap: A general-purpose network analysis and graph-mining library,”ACM Transactions on Intelligent Systems and Technology, vol. 8, no. 1, pp. 1–20, 2016

work page 2016
[21]

Communication-avoiding parallel sparse-dense matrix-matrix multiplication,

P. Koanantakool, A. Azad, A. Buluc ¸, D. Morozov, S.-Y . Oh, L. Oliker, and K. Yelick, “Communication-avoiding parallel sparse-dense matrix-matrix multiplication,” in2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 2016, pp. 842–853

work page 2016
[22]

Two- face: Combining collective and one-sided communication for efficient distributed spmm,

C. Block, G. Gerogiannis, C. Mendis, A. Azad, and J. Torrellas, “Two- face: Combining collective and one-sided communication for efficient distributed spmm,” inProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, 2024, pp. 1200–1217

work page 2024
[23]

Improving performance of sparse matrix dense matrix multiplication on large-scale parallel systems,

S. Acer, O. Selvitopi, and C. Aykanat, “Improving performance of sparse matrix dense matrix multiplication on large-scale parallel systems,” Parallel Computing, vol. 59, pp. 71–96, 2016

work page 2016
[24]

Spcomm3d: A framework for en- abling sparse communication in 3d sparse kernels,

N. Abubaker and T. Hoefler, “Spcomm3d: A framework for en- abling sparse communication in 3d sparse kernels,”arXiv preprint arXiv:2404.19638, 2024

work page arXiv 2024
[25]

Reducing communication in graph neural network training,

A. Tripathy, K. Yelick, and A. Buluc ¸, “Reducing communication in graph neural network training,” inSC20: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 2020, pp. 1–14

work page 2020
[26]

Sparsity-aware communication for distributed graph neural network training,

U. Mukhopadhyay, A. Tripathy, O. Selvitopi, K. Yelick, and A. Buluc, “Sparsity-aware communication for distributed graph neural network training,” inProceedings of the 53rd International Conference on Parallel Processing, 2024, pp. 117–126

work page 2024
[27]

Distributed-memory parallel algorithms for sparse times tall-skinny-dense matrix multiplication,

O. Selvitopi, B. Brock, I. Nisa, A. Tripathy, K. Yelick, and A. Bulu c ¸, “Distributed-memory parallel algorithms for sparse times tall-skinny-dense matrix multiplication,” inProceedings of the 35th ACM International Conference on Supercomputing, 2021, pp. 431–442

work page 2021
[28]

Cola: Towards communication-efficient distributed sparse matrix-matrix multiplication on gpus,

L. Zhang, Y . Shao, and S. Li, “Cola: Towards communication-efficient distributed sparse matrix-matrix multiplication on gpus,” inProceedings of the 39th ACM International Conference on Supercomputing, 2025, pp. 73–87

work page 2025
[29]

Distributed-memory sparse kernels for machine learning,

V . Bharadwaj, A. Buluc ¸, and J. Demmel, “Distributed-memory sparse kernels for machine learning,” in2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 2022, pp. 47–58

work page 2022
[30]

Mg-gcn: a scalable multi- gpu gcn training framework,

M. F. Balin, K. Sancak, and U. V . Catalyurek, “Mg-gcn: a scalable multi- gpu gcn training framework,” inProceedings of the 51st International Conference on Parallel Processing, 2022, pp. 1–11

work page 2022
[31]

Tsubame4.0 user’s guide,

Institute of Science Tokyo, “Tsubame4.0 user’s guide,” https://www.t4. cii.isct.ac.jp/docs/handbook.en/, 2024

work page 2024
[32]

NVIDIA H100 Tensor Core GPU,

NVIDIA Corporation, “NVIDIA H100 Tensor Core GPU,” https://www. nvidia.com/en-us/data-center/h100/, 2024, accessed: 2024

work page 2024
[33]

Introducing 200G HDR Infini- Band Solutions,

Mellanox Technologies, “Introducing 200G HDR Infini- Band Solutions,” https://network.nvidia.com/files/doc-2020/ wp-introducing-200g-hdr-infiniband-solutions.pdf, 2019, accessed: 2024

work page 2020
[34]

Graphok ´es matrixok,

D. K ¨onig, “Graphok ´es matrixok,”Matematikai ´es Fizikai Lapok, vol. 38, pp. 116–119, 1931

work page 1931
[35]

Schrijver,Combinatorial optimization: polyhedra and efficiency

A. Schrijver,Combinatorial optimization: polyhedra and efficiency. Springer Science & Business Media, 2003, vol. 24

work page 2003
[36]

Matrixok kombinatorius tulajdons ´agair´ol,

J. Egerv ´ary, “Matrixok kombinatorius tulajdons ´agair´ol,”Matematikai ´es Fizikai Lapok, vol. 38, no. 1931, pp. 16–28, 1931. 11

work page 1931
[37]

Maximal flow through a network,

L. R. Ford Jr and D. R. Fulkerson, “Maximal flow through a network,” Canadian Journal of Mathematics, vol. 8, pp. 399–404, 1956

work page 1956
[38]

Algorithm for solution of a problem of maximum flow in networks with power estimation,

E. A. Dinic, “Algorithm for solution of a problem of maximum flow in networks with power estimation,”Soviet Mathematics Doklady, vol. 11, pp. 1277–1280, 1970

work page 1970
[39]

The university of florida sparse matrix collection,

T. A. Davis and Y . Hu, “The university of florida sparse matrix collection,” ACM Transactions on Mathematical Software, vol. 38, no. 1, pp. 1–25, 2011

work page 2011
[40]

Pytorch: An imperative style, high-performance deep learning library,

A. Paszke, S. Gross, F. Massaet al., “Pytorch: An imperative style, high-performance deep learning library,”Advances in Neural Information Processing Systems, vol. 32, 2019

work page 2019
[41]

cusparse library,

NVIDIA Corporation, “cusparse library,” https://developer.nvidia.com/ cusparse, 2024

work page 2024
[42]

Pytorch distributed,

PyTorch Team, “Pytorch distributed,” https://docs.pytorch.org/docs/stable/ distributed.html, 2024

work page 2024
[43]

Nccl: Nvidia collective communication library,

NVIDIA Corporation, “Nccl: Nvidia collective communication library,” https://developer.nvidia.com/nccl, 2024

work page 2024
[44]

Nvshmem,

——, “Nvshmem,” https://developer.nvidia.com/nvshmem, 2024

work page 2024
[45]

Hilfer fractional advection-diffusion equations with power-law initial condition; a Numerical study using variational iteration method

Y . Hong and A. Buluc ¸, “A sparsity-aware distributed-memory algorithm for sparse-sparse matrix multiplication,” inProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis, ser. SC ’24. IEEE Press, 2024. [Online]. Available: https://doi.org/10.1109/SC41406.2024.00053

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1109/sc41406.2024.00053 2024
[46]

Arrow matrix decomposition: A novel approach for communication-efficient sparse matrix multiplication,

L. Gianinazzi, A. N. Ziogas, L. Huang, P. Luczynski, S. Ashkboosh, F. Scheidl, A. Carigiet, C. Ge, N. Abubaker, M. Bestaet al., “Arrow matrix decomposition: A novel approach for communication-efficient sparse matrix multiplication,” inProceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming, 2024, pp. 404–416

work page 2024
[47]

Scaling large-scale gnn training to thousands of processors on cpu-based supercomputers,

C. Zhuang, L. Zhang, D. Wu, P. Chen, J. Huang, X. Liu, R. Yokota, N. Dryden, T. Endo, S. Matsuokaet al., “Scaling large-scale gnn training to thousands of processors on cpu-based supercomputers,” inProceedings of the 39th ACM International Conference on Supercomputing, 2025, pp. 57–72

work page 2025
[48]

Bns-gcn: Efficient full- graph training of graph convolutional networks with partition-parallelism and random boundary node sampling,

C. Wan, Y . Li, A. Li, N. S. Kim, and Y . Lin, “Bns-gcn: Efficient full- graph training of graph convolutional networks with partition-parallelism and random boundary node sampling,”Proceedings of Machine Learning and Systems, vol. 4, pp. 673–693, 2022

work page 2022
[49]

Adaptive message quantization and parallelization for distributed full-graph gnn training,

B. Wan, J. Zhao, and C. Wu, “Adaptive message quantization and parallelization for distributed full-graph gnn training,”Proceedings of Machine Learning and Systems, vol. 5, pp. 203–218, 2023. 12

work page 2023

[1] [1]

All-pairs shortest paths computation in the bsp model,

A. Tiskin, “All-pairs shortest paths computation in the bsp model,” in International Colloquium on Automata, Languages, and Programming. Springer, 2001, pp. 178–189

work page 2001

[2] [2]

Rdma-based algorithms for sparse matrix multiplication on gpus,

B. Brock, A. Bulu c ¸, and K. Yelick, “Rdma-based algorithms for sparse matrix multiplication on gpus,” inProceedings of the 38th ACM International Conference on Supercomputing, 2024, pp. 225–235

work page 2024

[3] [3]

The block conjugate gradient algorithm and related methods,

D. P. O’Leary, “The block conjugate gradient algorithm and related methods,”Linear algebra and its applications, vol. 29, pp. 293–322, 1980

work page 1980

[4] [4]

A shifted block lanczos algorithm for solving sparse symmetric generalized eigenproblems,

R. G. Grimes, J. G. Lewis, and H. D. Simon, “A shifted block lanczos algorithm for solving sparse symmetric generalized eigenproblems,”SIAM Journal on Matrix Analysis and Applications, vol. 15, no. 1, pp. 228–272, 1994

work page 1994

[5] [5]

A block arnoldi-chebyshev method for computing the leading eigenpairs of large sparse unsymmetric matrices,

M. Sadkane, “A block arnoldi-chebyshev method for computing the leading eigenpairs of large sparse unsymmetric matrices,”Numerische mathematik, vol. 64, no. 1, pp. 181–193, 1993

work page 1993

[6] [6]

Block krylov space methods for linear systems with multiple right-hand sides: an introduction,

M. H. Gutknecht, “Block krylov space methods for linear systems with multiple right-hand sides: an introduction,” inModern mathematical models, methods and algorithms for real world systems. Anshan, 2007, pp. 420–447

work page 2007

[7] [7]

Neural message passing for quantum chemistry,

J. Gilmer, S. S. Schoenholz, P. F. Riley, O. Vinyals, and G. E. Dahl, “Neural message passing for quantum chemistry,” inProceedings of the 34th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, D. Precup and Y . W. Teh, Eds., vol. 70. PMLR, 2017, pp. 1263–1272

work page 2017

[8] [8]

Fast Graph Representation Learning with PyTorch Geometric

M. Fey and J. E. Lenssen, “Fast graph representation learning with PyTorch Geometric,”arXiv preprint arXiv:1903.02428, 2019, iCLR Workshop on Representation Learning on Graphs and Manifolds

work page internal anchor Pith review Pith/arXiv arXiv 1903

[9] [9]

Deep Graph Library: A Graph-Centric, Highly-Performant Package for Graph Neural Networks

M. Wang, D. Zheng, Z. Ye, Q. Gan, M. Li, X. Song, J. Zhou, C. Ma, L. Yu, Y . Gai, T. Xiao, T. He, G. Karypis, J. Li, and Z. Zhang, “Deep graph library: A graph-centric, highly-performant package for graph neural networks,”arXiv preprint arXiv:1909.01315, 2019

work page internal anchor Pith review arXiv 1909

[10] [10]

Neural collab- orative filtering,

X. He, L. Liao, H. Zhang, L. Nie, X. Hu, and T.-S. Chua, “Neural collab- orative filtering,” inProceedings of the 26th International Conference on World Wide Web. International World Wide Web Conferences Steering Committee, 2017, pp. 173–182

work page 2017

[11] [11]

Knowledge graph embedding: A survey of approaches and applications,

Q. Wang, Z. Mao, B. Wang, and L. Guo, “Knowledge graph embedding: A survey of approaches and applications,”IEEE Transactions on Knowledge and Data Engineering, vol. 29, no. 12, pp. 2724–2743, 2017

work page 2017

[12] [12]

Sparse gpu kernels for deep learning,

T. Gale, M. Zaharia, C. Young, and E. Elsen, “Sparse gpu kernels for deep learning,” inSC20: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 2020, pp. 1–14

work page 2020

[13] [13]

Sparsetir: Composable abstractions for sparse compilation in deep learning,

Z. Ye, R. Lai, J. Shao, T. Chen, and L. Ceze, “Sparsetir: Composable abstractions for sparse compilation in deep learning,” inProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3, 2023, pp. 660–678

work page 2023

[14] [14]

Dtc-spmm: Bridging the gap in accelerating general sparse matrix multiplication with tensor cores,

R. Fan, W. Wang, and X. Chu, “Dtc-spmm: Bridging the gap in accelerating general sparse matrix multiplication with tensor cores,” in Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3, 2024, pp. 253–267

work page 2024

[15] [15]

{TC-GNN}: Bridging sparse {GNN} computation and dense tensor cores on {GPUs},

Y . Wang, B. Feng, Z. Wang, G. Huang, and Y . Ding, “ {TC-GNN}: Bridging sparse {GNN} computation and dense tensor cores on {GPUs},” in2023 USENIX Annual Technical Conference (USENIX ATC 23), 2023, pp. 149–164

work page 2023

[16] [16]

A row decomposition-based approach for sparse matrix multiplication on gpus,

M. Pang, X. Fei, P. Qu, Y . Zhang, and Z. Li, “A row decomposition-based approach for sparse matrix multiplication on gpus,” inProceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming, 2024, pp. 377–389

work page 2024

[17] [17]

Solving unsymmetric sparse systems of linear equations with pardiso,

O. Schenk and K. G ¨artner, “Solving unsymmetric sparse systems of linear equations with pardiso,”Future Generation Computer Systems, vol. 20, no. 3, pp. 475–487, 2004

work page 2004

[18] [18]

Open graph benchmark: Datasets for machine learning on graphs

W. Hu, M. Fey, M. Zitnik, Y . Dong, H. Ren, B. Liu, M. Catasta, and J. Leskovec, “Open graph benchmark: Datasets for machine learning on graphs,”arXiv preprint arXiv:2005.00687, 2020

work page arXiv 2005

[19] [19]

SNAP Datasets: Stanford large network dataset collection,

J. Leskovec and A. Krevl, “SNAP Datasets: Stanford large network dataset collection,” http://snap.stanford.edu/data, June 2014

work page 2014

[20] [20]

Snap: A general-purpose network analysis and graph-mining library,

J. Leskovec and R. Sosi ˇc, “Snap: A general-purpose network analysis and graph-mining library,”ACM Transactions on Intelligent Systems and Technology, vol. 8, no. 1, pp. 1–20, 2016

work page 2016

[21] [21]

Communication-avoiding parallel sparse-dense matrix-matrix multiplication,

P. Koanantakool, A. Azad, A. Buluc ¸, D. Morozov, S.-Y . Oh, L. Oliker, and K. Yelick, “Communication-avoiding parallel sparse-dense matrix-matrix multiplication,” in2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 2016, pp. 842–853

work page 2016

[22] [22]

Two- face: Combining collective and one-sided communication for efficient distributed spmm,

C. Block, G. Gerogiannis, C. Mendis, A. Azad, and J. Torrellas, “Two- face: Combining collective and one-sided communication for efficient distributed spmm,” inProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, 2024, pp. 1200–1217

work page 2024

[23] [23]

Improving performance of sparse matrix dense matrix multiplication on large-scale parallel systems,

S. Acer, O. Selvitopi, and C. Aykanat, “Improving performance of sparse matrix dense matrix multiplication on large-scale parallel systems,” Parallel Computing, vol. 59, pp. 71–96, 2016

work page 2016

[24] [24]

Spcomm3d: A framework for en- abling sparse communication in 3d sparse kernels,

N. Abubaker and T. Hoefler, “Spcomm3d: A framework for en- abling sparse communication in 3d sparse kernels,”arXiv preprint arXiv:2404.19638, 2024

work page arXiv 2024

[25] [25]

Reducing communication in graph neural network training,

A. Tripathy, K. Yelick, and A. Buluc ¸, “Reducing communication in graph neural network training,” inSC20: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 2020, pp. 1–14

work page 2020

[26] [26]

Sparsity-aware communication for distributed graph neural network training,

U. Mukhopadhyay, A. Tripathy, O. Selvitopi, K. Yelick, and A. Buluc, “Sparsity-aware communication for distributed graph neural network training,” inProceedings of the 53rd International Conference on Parallel Processing, 2024, pp. 117–126

work page 2024

[27] [27]

Distributed-memory parallel algorithms for sparse times tall-skinny-dense matrix multiplication,

O. Selvitopi, B. Brock, I. Nisa, A. Tripathy, K. Yelick, and A. Bulu c ¸, “Distributed-memory parallel algorithms for sparse times tall-skinny-dense matrix multiplication,” inProceedings of the 35th ACM International Conference on Supercomputing, 2021, pp. 431–442

work page 2021

[28] [28]

Cola: Towards communication-efficient distributed sparse matrix-matrix multiplication on gpus,

L. Zhang, Y . Shao, and S. Li, “Cola: Towards communication-efficient distributed sparse matrix-matrix multiplication on gpus,” inProceedings of the 39th ACM International Conference on Supercomputing, 2025, pp. 73–87

work page 2025

[29] [29]

Distributed-memory sparse kernels for machine learning,

V . Bharadwaj, A. Buluc ¸, and J. Demmel, “Distributed-memory sparse kernels for machine learning,” in2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 2022, pp. 47–58

work page 2022

[30] [30]

Mg-gcn: a scalable multi- gpu gcn training framework,

M. F. Balin, K. Sancak, and U. V . Catalyurek, “Mg-gcn: a scalable multi- gpu gcn training framework,” inProceedings of the 51st International Conference on Parallel Processing, 2022, pp. 1–11

work page 2022

[31] [31]

Tsubame4.0 user’s guide,

Institute of Science Tokyo, “Tsubame4.0 user’s guide,” https://www.t4. cii.isct.ac.jp/docs/handbook.en/, 2024

work page 2024

[32] [32]

NVIDIA H100 Tensor Core GPU,

NVIDIA Corporation, “NVIDIA H100 Tensor Core GPU,” https://www. nvidia.com/en-us/data-center/h100/, 2024, accessed: 2024

work page 2024

[33] [33]

Introducing 200G HDR Infini- Band Solutions,

Mellanox Technologies, “Introducing 200G HDR Infini- Band Solutions,” https://network.nvidia.com/files/doc-2020/ wp-introducing-200g-hdr-infiniband-solutions.pdf, 2019, accessed: 2024

work page 2020

[34] [34]

Graphok ´es matrixok,

D. K ¨onig, “Graphok ´es matrixok,”Matematikai ´es Fizikai Lapok, vol. 38, pp. 116–119, 1931

work page 1931

[35] [35]

Schrijver,Combinatorial optimization: polyhedra and efficiency

A. Schrijver,Combinatorial optimization: polyhedra and efficiency. Springer Science & Business Media, 2003, vol. 24

work page 2003

[36] [36]

Matrixok kombinatorius tulajdons ´agair´ol,

J. Egerv ´ary, “Matrixok kombinatorius tulajdons ´agair´ol,”Matematikai ´es Fizikai Lapok, vol. 38, no. 1931, pp. 16–28, 1931. 11

work page 1931

[37] [37]

Maximal flow through a network,

L. R. Ford Jr and D. R. Fulkerson, “Maximal flow through a network,” Canadian Journal of Mathematics, vol. 8, pp. 399–404, 1956

work page 1956

[38] [38]

Algorithm for solution of a problem of maximum flow in networks with power estimation,

E. A. Dinic, “Algorithm for solution of a problem of maximum flow in networks with power estimation,”Soviet Mathematics Doklady, vol. 11, pp. 1277–1280, 1970

work page 1970

[39] [39]

The university of florida sparse matrix collection,

T. A. Davis and Y . Hu, “The university of florida sparse matrix collection,” ACM Transactions on Mathematical Software, vol. 38, no. 1, pp. 1–25, 2011

work page 2011

[40] [40]

Pytorch: An imperative style, high-performance deep learning library,

A. Paszke, S. Gross, F. Massaet al., “Pytorch: An imperative style, high-performance deep learning library,”Advances in Neural Information Processing Systems, vol. 32, 2019

work page 2019

[41] [41]

cusparse library,

NVIDIA Corporation, “cusparse library,” https://developer.nvidia.com/ cusparse, 2024

work page 2024

[42] [42]

Pytorch distributed,

PyTorch Team, “Pytorch distributed,” https://docs.pytorch.org/docs/stable/ distributed.html, 2024

work page 2024

[43] [43]

Nccl: Nvidia collective communication library,

NVIDIA Corporation, “Nccl: Nvidia collective communication library,” https://developer.nvidia.com/nccl, 2024

work page 2024

[44] [44]

Nvshmem,

——, “Nvshmem,” https://developer.nvidia.com/nvshmem, 2024

work page 2024

[45] [45]

Hilfer fractional advection-diffusion equations with power-law initial condition; a Numerical study using variational iteration method

Y . Hong and A. Buluc ¸, “A sparsity-aware distributed-memory algorithm for sparse-sparse matrix multiplication,” inProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis, ser. SC ’24. IEEE Press, 2024. [Online]. Available: https://doi.org/10.1109/SC41406.2024.00053

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1109/sc41406.2024.00053 2024

[46] [46]

Arrow matrix decomposition: A novel approach for communication-efficient sparse matrix multiplication,

L. Gianinazzi, A. N. Ziogas, L. Huang, P. Luczynski, S. Ashkboosh, F. Scheidl, A. Carigiet, C. Ge, N. Abubaker, M. Bestaet al., “Arrow matrix decomposition: A novel approach for communication-efficient sparse matrix multiplication,” inProceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming, 2024, pp. 404–416

work page 2024

[47] [47]

Scaling large-scale gnn training to thousands of processors on cpu-based supercomputers,

C. Zhuang, L. Zhang, D. Wu, P. Chen, J. Huang, X. Liu, R. Yokota, N. Dryden, T. Endo, S. Matsuokaet al., “Scaling large-scale gnn training to thousands of processors on cpu-based supercomputers,” inProceedings of the 39th ACM International Conference on Supercomputing, 2025, pp. 57–72

work page 2025

[48] [48]

Bns-gcn: Efficient full- graph training of graph convolutional networks with partition-parallelism and random boundary node sampling,

C. Wan, Y . Li, A. Li, N. S. Kim, and Y . Lin, “Bns-gcn: Efficient full- graph training of graph convolutional networks with partition-parallelism and random boundary node sampling,”Proceedings of Machine Learning and Systems, vol. 4, pp. 673–693, 2022

work page 2022

[49] [49]

Adaptive message quantization and parallelization for distributed full-graph gnn training,

B. Wan, J. Zhao, and C. Wu, “Adaptive message quantization and parallelization for distributed full-graph gnn training,”Proceedings of Machine Learning and Systems, vol. 5, pp. 203–218, 2023. 12

work page 2023