SHIRO: Near-Optimal Communication Strategies for Distributed Sparse Matrix Multiplication
Pith reviewed 2026-05-16 20:35 UTC · model grok-4.3
The pith
SHIRO reduces distributed SpMM communication overhead by sending only the data needed for non-zero multiplications and prioritizing fast intra-node GPU links.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
SHIRO introduces a fine-grained, sparsity-aware communication strategy that exploits the sparsity pattern of the input matrix to eliminate redundant data transfers, together with a hierarchical communication strategy that maps the resulting schedule onto two-tier GPU networks so that most movement stays on faster intra-node links. When implemented and tested on real-world sparse datasets, the combined approach yields strong scalability up to 128 GPUs, with geometric-mean speedups of 221.5× over CAGNET, 56.0× over SPA, 23.4× over BCL, and 8.8× over CoLa.
What carries the argument
Fine-grained sparsity-aware communication strategy that computes only the transfers required by actual non-zero entries, mapped onto a two-tier GPU network to favor intra-node links.
If this is right
- Communication volume falls because only entries participating in non-zero multiplications are moved.
- Intra-node links carry the majority of traffic, cutting exposure to slower inter-node bandwidth.
- Runtime remains dominated by computation rather than communication up to at least 128 GPUs.
- The same speedups appear consistently across four different existing SpMM implementations.
Where Pith is reading between the lines
- The same sparsity-driven planning could be applied to other distributed sparse kernels such as SpMV.
- On networks without a clear two-tier speed difference the hierarchical component would contribute little.
- For extremely irregular sparsity the one-time cost of building the communication schedule might become noticeable.
Load-bearing premise
The real-world matrices tested have sparsity patterns that let the fine-grained planner remove most redundant transfers, and the hardware has substantially faster links inside nodes than between nodes.
What would settle it
Running the same code on a single-tier flat network or on matrices whose sparsity offers no exploitable structure would eliminate or reverse the reported speedups.
Figures
read the original abstract
Distributed Sparse Matrix-Matrix Multiplication (SpMM) is a fundamental operation in high-performance computing and deep learning applications. The major performance bottleneck in distributed SpMM lies in substantial communication overhead, which limits both performance and scalability. In this paper, we identify two key sources of communication inefficiency in distributed SpMM: redundant data transfer due to sparsity unawareness, and suboptimal utilization of hierarchical network topology. To address these, we propose (1) a fine-grained, sparsity-aware communication strategy that reduces communication overhead by exploiting the sparsity pattern of the sparse matrix, and (2) a hierarchical communication strategy that maps the sparsity-aware strategy onto two-tier GPU network architectures, minimizing redundant data movement across slower inter-node links. We implement these optimizations in a comprehensive distributed SpMM framework, \method{}. Extensive evaluations on real-world datasets show that \method{} demonstrates strong scalability up to 128 GPUs, achieving geometric mean speedups of 221.5$\times$, 56.0$\times$, 23.4$\times$, and 8.8$\times$ in SpMM over four state-of-the-art baselines (CAGNET, SPA, BCL, and CoLa, respectively) at this scale.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents SHIRO, a distributed SpMM framework that introduces a fine-grained sparsity-aware communication strategy to reduce redundant transfers by exploiting matrix sparsity patterns and a hierarchical mapping strategy to exploit two-tier GPU network topologies by minimizing inter-node data movement. It reports empirical results on real-world datasets showing strong scalability to 128 GPUs with geometric mean speedups of 221.5×, 56.0×, 23.4×, and 8.8× over CAGNET, SPA, BCL, and CoLa respectively.
Significance. If the reported speedups are reproducible and generalizable, the work would represent a meaningful advance in communication-efficient distributed sparse linear algebra, directly addressing bottlenecks relevant to HPC and large-scale machine learning. The emphasis on sparsity awareness and topology hierarchy is well-motivated, though the magnitude of gains appears tightly coupled to the evaluated sparsity patterns and network assumptions.
major comments (2)
- [Abstract and Evaluation] Abstract and Evaluation section: the headline geometric-mean speedups (221.5×–8.8× at 128 GPUs) are presented without accompanying communication-volume tables, lower-bound comparisons, error bars, or ablation results on uniform-random sparsity or flat networks; this leaves the central claim dependent on unverified conditions that the fine-grained strategy eliminates most redundant transfers and that intra-node links are substantially faster than inter-node links.
- [§3.2] §3.2 (Hierarchical Communication Strategy): the mapping assumes a two-tier topology with fast intra-node interconnects, yet no topology-ablated experiments or results on single-tier networks are provided to quantify how much of the reported gain collapses when this assumption is removed.
minor comments (1)
- [Abstract] Abstract: the four baseline acronyms (CAGNET, SPA, BCL, CoLa) are not expanded on first use, which reduces readability for readers outside the immediate subfield.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript to incorporate the requested supporting data and experiments.
read point-by-point responses
-
Referee: [Abstract and Evaluation] Abstract and Evaluation section: the headline geometric-mean speedups (221.5×–8.8× at 128 GPUs) are presented without accompanying communication-volume tables, lower-bound comparisons, error bars, or ablation results on uniform-random sparsity or flat networks; this leaves the central claim dependent on unverified conditions that the fine-grained strategy eliminates most redundant transfers and that intra-node links are substantially faster than inter-node links.
Authors: We agree that these elements would strengthen the presentation. In the revised manuscript we will add communication-volume tables for all baselines, lower-bound comparisons derived from the sparsity patterns, error bars from repeated runs, and ablation results on uniform-random sparsity matrices as well as on flat single-tier networks. These additions will directly verify the conditions under which the fine-grained and hierarchical strategies deliver the reported gains. revision: yes
-
Referee: [§3.2] §3.2 (Hierarchical Communication Strategy): the mapping assumes a two-tier topology with fast intra-node interconnects, yet no topology-ablated experiments or results on single-tier networks are provided to quantify how much of the reported gain collapses when this assumption is removed.
Authors: We acknowledge that the current evaluation focuses on two-tier topologies. To quantify the dependence, we will add new experiments on single-tier networks (uniform link speeds) in the revised version. These results will show the performance contribution of the hierarchical mapping and the extent to which gains diminish when the intra-node versus inter-node speed difference is removed. revision: yes
Circularity Check
No circularity: empirical claims rest on benchmarks, not derivations
full rationale
The paper contains no equations, derivations, or mathematical claims that could reduce to self-definition or fitted inputs. Its central results are reported geometric-mean speedups from direct runtime measurements on 128-GPU runs against four external baselines; these numbers are not obtained by fitting parameters to a subset of the same data and then relabeling the fit as a prediction. No self-citation is invoked to justify a uniqueness theorem or to smuggle an ansatz; the two proposed strategies are described at the algorithmic level and then evaluated experimentally. Because the derivation chain is empty and the performance claims are externally falsifiable via replication on the same hardware and matrices, the analysis registers no circular steps.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
joint row-column communication strategy... minimum weighted vertex cover problem... bipartite graph G=(R∪C,E)
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
hierarchical communication strategy... two-tier GPU network... inter-group links
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
All-pairs shortest paths computation in the bsp model,
A. Tiskin, “All-pairs shortest paths computation in the bsp model,” in International Colloquium on Automata, Languages, and Programming. Springer, 2001, pp. 178–189
work page 2001
-
[2]
Rdma-based algorithms for sparse matrix multiplication on gpus,
B. Brock, A. Bulu c ¸, and K. Yelick, “Rdma-based algorithms for sparse matrix multiplication on gpus,” inProceedings of the 38th ACM International Conference on Supercomputing, 2024, pp. 225–235
work page 2024
-
[3]
The block conjugate gradient algorithm and related methods,
D. P. O’Leary, “The block conjugate gradient algorithm and related methods,”Linear algebra and its applications, vol. 29, pp. 293–322, 1980
work page 1980
-
[4]
A shifted block lanczos algorithm for solving sparse symmetric generalized eigenproblems,
R. G. Grimes, J. G. Lewis, and H. D. Simon, “A shifted block lanczos algorithm for solving sparse symmetric generalized eigenproblems,”SIAM Journal on Matrix Analysis and Applications, vol. 15, no. 1, pp. 228–272, 1994
work page 1994
-
[5]
M. Sadkane, “A block arnoldi-chebyshev method for computing the leading eigenpairs of large sparse unsymmetric matrices,”Numerische mathematik, vol. 64, no. 1, pp. 181–193, 1993
work page 1993
-
[6]
Block krylov space methods for linear systems with multiple right-hand sides: an introduction,
M. H. Gutknecht, “Block krylov space methods for linear systems with multiple right-hand sides: an introduction,” inModern mathematical models, methods and algorithms for real world systems. Anshan, 2007, pp. 420–447
work page 2007
-
[7]
Neural message passing for quantum chemistry,
J. Gilmer, S. S. Schoenholz, P. F. Riley, O. Vinyals, and G. E. Dahl, “Neural message passing for quantum chemistry,” inProceedings of the 34th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, D. Precup and Y . W. Teh, Eds., vol. 70. PMLR, 2017, pp. 1263–1272
work page 2017
-
[8]
Fast Graph Representation Learning with PyTorch Geometric
M. Fey and J. E. Lenssen, “Fast graph representation learning with PyTorch Geometric,”arXiv preprint arXiv:1903.02428, 2019, iCLR Workshop on Representation Learning on Graphs and Manifolds
work page internal anchor Pith review Pith/arXiv arXiv 1903
-
[9]
Deep Graph Library: A Graph-Centric, Highly-Performant Package for Graph Neural Networks
M. Wang, D. Zheng, Z. Ye, Q. Gan, M. Li, X. Song, J. Zhou, C. Ma, L. Yu, Y . Gai, T. Xiao, T. He, G. Karypis, J. Li, and Z. Zhang, “Deep graph library: A graph-centric, highly-performant package for graph neural networks,”arXiv preprint arXiv:1909.01315, 2019
work page internal anchor Pith review arXiv 1909
-
[10]
Neural collab- orative filtering,
X. He, L. Liao, H. Zhang, L. Nie, X. Hu, and T.-S. Chua, “Neural collab- orative filtering,” inProceedings of the 26th International Conference on World Wide Web. International World Wide Web Conferences Steering Committee, 2017, pp. 173–182
work page 2017
-
[11]
Knowledge graph embedding: A survey of approaches and applications,
Q. Wang, Z. Mao, B. Wang, and L. Guo, “Knowledge graph embedding: A survey of approaches and applications,”IEEE Transactions on Knowledge and Data Engineering, vol. 29, no. 12, pp. 2724–2743, 2017
work page 2017
-
[12]
Sparse gpu kernels for deep learning,
T. Gale, M. Zaharia, C. Young, and E. Elsen, “Sparse gpu kernels for deep learning,” inSC20: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 2020, pp. 1–14
work page 2020
-
[13]
Sparsetir: Composable abstractions for sparse compilation in deep learning,
Z. Ye, R. Lai, J. Shao, T. Chen, and L. Ceze, “Sparsetir: Composable abstractions for sparse compilation in deep learning,” inProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3, 2023, pp. 660–678
work page 2023
-
[14]
Dtc-spmm: Bridging the gap in accelerating general sparse matrix multiplication with tensor cores,
R. Fan, W. Wang, and X. Chu, “Dtc-spmm: Bridging the gap in accelerating general sparse matrix multiplication with tensor cores,” in Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3, 2024, pp. 253–267
work page 2024
-
[15]
{TC-GNN}: Bridging sparse {GNN} computation and dense tensor cores on {GPUs},
Y . Wang, B. Feng, Z. Wang, G. Huang, and Y . Ding, “ {TC-GNN}: Bridging sparse {GNN} computation and dense tensor cores on {GPUs},” in2023 USENIX Annual Technical Conference (USENIX ATC 23), 2023, pp. 149–164
work page 2023
-
[16]
A row decomposition-based approach for sparse matrix multiplication on gpus,
M. Pang, X. Fei, P. Qu, Y . Zhang, and Z. Li, “A row decomposition-based approach for sparse matrix multiplication on gpus,” inProceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming, 2024, pp. 377–389
work page 2024
-
[17]
Solving unsymmetric sparse systems of linear equations with pardiso,
O. Schenk and K. G ¨artner, “Solving unsymmetric sparse systems of linear equations with pardiso,”Future Generation Computer Systems, vol. 20, no. 3, pp. 475–487, 2004
work page 2004
-
[18]
Open graph benchmark: Datasets for machine learning on graphs
W. Hu, M. Fey, M. Zitnik, Y . Dong, H. Ren, B. Liu, M. Catasta, and J. Leskovec, “Open graph benchmark: Datasets for machine learning on graphs,”arXiv preprint arXiv:2005.00687, 2020
-
[19]
SNAP Datasets: Stanford large network dataset collection,
J. Leskovec and A. Krevl, “SNAP Datasets: Stanford large network dataset collection,” http://snap.stanford.edu/data, June 2014
work page 2014
-
[20]
Snap: A general-purpose network analysis and graph-mining library,
J. Leskovec and R. Sosi ˇc, “Snap: A general-purpose network analysis and graph-mining library,”ACM Transactions on Intelligent Systems and Technology, vol. 8, no. 1, pp. 1–20, 2016
work page 2016
-
[21]
Communication-avoiding parallel sparse-dense matrix-matrix multiplication,
P. Koanantakool, A. Azad, A. Buluc ¸, D. Morozov, S.-Y . Oh, L. Oliker, and K. Yelick, “Communication-avoiding parallel sparse-dense matrix-matrix multiplication,” in2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 2016, pp. 842–853
work page 2016
-
[22]
Two- face: Combining collective and one-sided communication for efficient distributed spmm,
C. Block, G. Gerogiannis, C. Mendis, A. Azad, and J. Torrellas, “Two- face: Combining collective and one-sided communication for efficient distributed spmm,” inProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, 2024, pp. 1200–1217
work page 2024
-
[23]
Improving performance of sparse matrix dense matrix multiplication on large-scale parallel systems,
S. Acer, O. Selvitopi, and C. Aykanat, “Improving performance of sparse matrix dense matrix multiplication on large-scale parallel systems,” Parallel Computing, vol. 59, pp. 71–96, 2016
work page 2016
-
[24]
Spcomm3d: A framework for en- abling sparse communication in 3d sparse kernels,
N. Abubaker and T. Hoefler, “Spcomm3d: A framework for en- abling sparse communication in 3d sparse kernels,”arXiv preprint arXiv:2404.19638, 2024
-
[25]
Reducing communication in graph neural network training,
A. Tripathy, K. Yelick, and A. Buluc ¸, “Reducing communication in graph neural network training,” inSC20: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 2020, pp. 1–14
work page 2020
-
[26]
Sparsity-aware communication for distributed graph neural network training,
U. Mukhopadhyay, A. Tripathy, O. Selvitopi, K. Yelick, and A. Buluc, “Sparsity-aware communication for distributed graph neural network training,” inProceedings of the 53rd International Conference on Parallel Processing, 2024, pp. 117–126
work page 2024
-
[27]
Distributed-memory parallel algorithms for sparse times tall-skinny-dense matrix multiplication,
O. Selvitopi, B. Brock, I. Nisa, A. Tripathy, K. Yelick, and A. Bulu c ¸, “Distributed-memory parallel algorithms for sparse times tall-skinny-dense matrix multiplication,” inProceedings of the 35th ACM International Conference on Supercomputing, 2021, pp. 431–442
work page 2021
-
[28]
Cola: Towards communication-efficient distributed sparse matrix-matrix multiplication on gpus,
L. Zhang, Y . Shao, and S. Li, “Cola: Towards communication-efficient distributed sparse matrix-matrix multiplication on gpus,” inProceedings of the 39th ACM International Conference on Supercomputing, 2025, pp. 73–87
work page 2025
-
[29]
Distributed-memory sparse kernels for machine learning,
V . Bharadwaj, A. Buluc ¸, and J. Demmel, “Distributed-memory sparse kernels for machine learning,” in2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 2022, pp. 47–58
work page 2022
-
[30]
Mg-gcn: a scalable multi- gpu gcn training framework,
M. F. Balin, K. Sancak, and U. V . Catalyurek, “Mg-gcn: a scalable multi- gpu gcn training framework,” inProceedings of the 51st International Conference on Parallel Processing, 2022, pp. 1–11
work page 2022
-
[31]
Institute of Science Tokyo, “Tsubame4.0 user’s guide,” https://www.t4. cii.isct.ac.jp/docs/handbook.en/, 2024
work page 2024
-
[32]
NVIDIA Corporation, “NVIDIA H100 Tensor Core GPU,” https://www. nvidia.com/en-us/data-center/h100/, 2024, accessed: 2024
work page 2024
-
[33]
Introducing 200G HDR Infini- Band Solutions,
Mellanox Technologies, “Introducing 200G HDR Infini- Band Solutions,” https://network.nvidia.com/files/doc-2020/ wp-introducing-200g-hdr-infiniband-solutions.pdf, 2019, accessed: 2024
work page 2020
-
[34]
D. K ¨onig, “Graphok ´es matrixok,”Matematikai ´es Fizikai Lapok, vol. 38, pp. 116–119, 1931
work page 1931
-
[35]
Schrijver,Combinatorial optimization: polyhedra and efficiency
A. Schrijver,Combinatorial optimization: polyhedra and efficiency. Springer Science & Business Media, 2003, vol. 24
work page 2003
-
[36]
Matrixok kombinatorius tulajdons ´agair´ol,
J. Egerv ´ary, “Matrixok kombinatorius tulajdons ´agair´ol,”Matematikai ´es Fizikai Lapok, vol. 38, no. 1931, pp. 16–28, 1931. 11
work page 1931
-
[37]
Maximal flow through a network,
L. R. Ford Jr and D. R. Fulkerson, “Maximal flow through a network,” Canadian Journal of Mathematics, vol. 8, pp. 399–404, 1956
work page 1956
-
[38]
Algorithm for solution of a problem of maximum flow in networks with power estimation,
E. A. Dinic, “Algorithm for solution of a problem of maximum flow in networks with power estimation,”Soviet Mathematics Doklady, vol. 11, pp. 1277–1280, 1970
work page 1970
-
[39]
The university of florida sparse matrix collection,
T. A. Davis and Y . Hu, “The university of florida sparse matrix collection,” ACM Transactions on Mathematical Software, vol. 38, no. 1, pp. 1–25, 2011
work page 2011
-
[40]
Pytorch: An imperative style, high-performance deep learning library,
A. Paszke, S. Gross, F. Massaet al., “Pytorch: An imperative style, high-performance deep learning library,”Advances in Neural Information Processing Systems, vol. 32, 2019
work page 2019
-
[41]
NVIDIA Corporation, “cusparse library,” https://developer.nvidia.com/ cusparse, 2024
work page 2024
-
[42]
PyTorch Team, “Pytorch distributed,” https://docs.pytorch.org/docs/stable/ distributed.html, 2024
work page 2024
-
[43]
Nccl: Nvidia collective communication library,
NVIDIA Corporation, “Nccl: Nvidia collective communication library,” https://developer.nvidia.com/nccl, 2024
work page 2024
- [44]
-
[45]
Y . Hong and A. Buluc ¸, “A sparsity-aware distributed-memory algorithm for sparse-sparse matrix multiplication,” inProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis, ser. SC ’24. IEEE Press, 2024. [Online]. Available: https://doi.org/10.1109/SC41406.2024.00053
work page internal anchor Pith review Pith/arXiv arXiv doi:10.1109/sc41406.2024.00053 2024
-
[46]
L. Gianinazzi, A. N. Ziogas, L. Huang, P. Luczynski, S. Ashkboosh, F. Scheidl, A. Carigiet, C. Ge, N. Abubaker, M. Bestaet al., “Arrow matrix decomposition: A novel approach for communication-efficient sparse matrix multiplication,” inProceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming, 2024, pp. 404–416
work page 2024
-
[47]
Scaling large-scale gnn training to thousands of processors on cpu-based supercomputers,
C. Zhuang, L. Zhang, D. Wu, P. Chen, J. Huang, X. Liu, R. Yokota, N. Dryden, T. Endo, S. Matsuokaet al., “Scaling large-scale gnn training to thousands of processors on cpu-based supercomputers,” inProceedings of the 39th ACM International Conference on Supercomputing, 2025, pp. 57–72
work page 2025
-
[48]
C. Wan, Y . Li, A. Li, N. S. Kim, and Y . Lin, “Bns-gcn: Efficient full- graph training of graph convolutional networks with partition-parallelism and random boundary node sampling,”Proceedings of Machine Learning and Systems, vol. 4, pp. 673–693, 2022
work page 2022
-
[49]
Adaptive message quantization and parallelization for distributed full-graph gnn training,
B. Wan, J. Zhao, and C. Wu, “Adaptive message quantization and parallelization for distributed full-graph gnn training,”Proceedings of Machine Learning and Systems, vol. 5, pp. 203–218, 2023. 12
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.