Sparsity-Aware Roofline Models for Sparse Matrix-Matrix Multiplication
Pith reviewed 2026-05-10 18:26 UTC · model grok-4.3
The pith
Sparse matrix-matrix multiplication requires sparsity-aware roofline models because matrix structure alters arithmetic intensity in ways a single model cannot capture.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper claims that accurate roofline-based performance analysis of SpMM requires sparsity-aware modeling that incorporates memory traffic, cache locality, and blocking behavior for each sparsity pattern. Experiments on an AMD node with CSR, CSB, and MKL implementations demonstrate that a single model is insufficient across block-structured, banded, scale-free, and uniformly random matrices, as structured sparsity and blocking significantly change effective arithmetic intensity. Data layout and blocking strategies therefore need evaluation in the context of the matrix structure.
What carries the argument
Sparsity-aware roofline models that adjust arithmetic intensity and performance bounds according to matrix-specific memory traffic, cache effects, and blocking.
Load-bearing premise
That the four sparsity pattern groups and three implementations tested on one AMD node are representative enough to require sparsity-aware models in general.
What would settle it
A set of measurements showing that one unified roofline model matches observed SpMM performance across all four sparsity patterns on both AMD and Intel nodes would falsify the necessity of sparsity-aware modeling.
Figures
read the original abstract
Sparse matrix-dense matrix multiplication (SpMM) is a critical kernel in scientific computing, graph analytics, and machine learning, whose performance is often constrained by memory bandwidth. In this work, we investigate the applicability and limitations of roofline modeling for SpMM by explicitly accounting for the impact of matrix sparsity structure on arithmetic intensity and attainable performance. We evaluate three SpMM implementations: Compressed Sparse Row (CSR), Compressed Sparse Blocks (CSB), and Intel's Math Kernel Library (MKL). Each implementation was tested using large-scale matrices from the SuiteSparse collection and grouped by sparsity pattern, including block-structured, banded (diagonal), scale-free, and uniformly random matrices. We derive sparsity-aware roofline models that incorporate memory traffic, cache locality, and blocking behavior, and demonstrate that a single model is insufficient to accurately predict performance across diverse structures. Experiments were conducted on an AMD-based Perlmutter compute node with a varying number of columns in the dense matrix. In particular, blocking and structured sparsity significantly alter effective arithmetic intensity. The results show that accurate roofline-based performance analysis of SpMM requires sparsity-aware modeling, and that data layout and blocking strategies must be evaluated in the context of matrix structure rather than through a single unified model.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that standard roofline models are inadequate for predicting SpMM performance because they fail to account for the effects of different sparsity structures on arithmetic intensity, memory traffic, cache locality, and blocking. It derives sparsity-aware variants and evaluates them empirically using CSR, CSB, and MKL implementations on SuiteSparse matrices grouped into block-structured, banded, scale-free, and uniformly random patterns, run on a single AMD Perlmutter node with varying numbers of dense-matrix columns. The central conclusion is that a single unified model is insufficient and that data layout and blocking must be assessed relative to matrix structure.
Significance. If the sparsity-aware models are shown to be quantitatively accurate and the necessity claim holds beyond the tested cases, the work would provide a useful refinement to roofline analysis for sparse kernels common in scientific computing, graph analytics, and ML. The use of public matrices, multiple formats, and real hardware is a positive empirical foundation.
major comments (2)
- [Experimental evaluation and results] The experimental evaluation (described in the abstract and results) is confined to four sparsity pattern groups drawn from SuiteSparse, three implementations, and a single AMD node. This scope is insufficient to support the general claim that 'accurate roofline-based performance analysis of SpMM requires sparsity-aware modeling' and that 'a single model is insufficient... across diverse structures,' because the paper provides no evidence that these groups exhaust the space of realistic sparsity (e.g., hierarchical, multi-scale, or controlled power-law patterns) or that the observed deviations would persist on other architectures or with different blocking parameters.
- [Abstract and model derivation] The abstract states that sparsity-aware roofline models are derived to incorporate memory traffic, cache locality, and blocking behavior, yet supplies no explicit equations, arithmetic-intensity formulas, or quantitative error analysis (e.g., predicted vs. measured performance or R² values). Without these, it is impossible to assess whether the models are parameter-free, how they differ from a baseline unified roofline, or whether they actually improve predictive accuracy enough to justify the 'requires' claim.
minor comments (1)
- [Abstract] The abstract contains no numerical results, error metrics, or concrete performance numbers, which weakens the ability of readers to immediately gauge the practical impact of the proposed models.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment below and describe the revisions planned to strengthen the manuscript.
read point-by-point responses
-
Referee: [Experimental evaluation and results] The experimental evaluation (described in the abstract and results) is confined to four sparsity pattern groups drawn from SuiteSparse, three implementations, and a single AMD node. This scope is insufficient to support the general claim that 'accurate roofline-based performance analysis of SpMM requires sparsity-aware modeling' and that 'a single model is insufficient... across diverse structures,' because the paper provides no evidence that these groups exhaust the space of realistic sparsity (e.g., hierarchical, multi-scale, or controlled power-law patterns) or that the observed deviations would persist on other architectures or with different blocking parameters.
Authors: We agree that the experimental scope is limited to the four sparsity pattern groups, three implementations, and a single AMD Perlmutter node. These patterns were chosen to represent common structures in the target domains, but they do not exhaust the space of realistic sparsity (e.g., hierarchical or multi-scale patterns) and we provide no evidence regarding other architectures or blocking parameters. We will revise the abstract, introduction, and conclusions to qualify all general claims, explicitly limiting them to the evaluated patterns and hardware. A new Limitations and Future Work section will be added to discuss the boundaries of the study and suggest extensions to additional patterns, architectures, and blocking strategies. revision: yes
-
Referee: [Abstract and model derivation] The abstract states that sparsity-aware roofline models are derived to incorporate memory traffic, cache locality, and blocking behavior, yet supplies no explicit equations, arithmetic-intensity formulas, or quantitative error analysis (e.g., predicted vs. measured performance or R² values). Without these, it is impossible to assess whether the models are parameter-free, how they differ from a baseline unified roofline, or whether they actually improve predictive accuracy enough to justify the 'requires' claim.
Authors: The abstract is a concise summary and cannot accommodate full equations or derivations due to length limits. The sparsity-aware roofline models, including explicit arithmetic-intensity formulas that adjust for sparsity-dependent memory traffic, cache locality, and blocking, are derived in the model section of the manuscript. Quantitative error analysis comparing predicted and measured performance (including metrics such as R²) appears in the results section. To improve clarity, we will revise the abstract to briefly describe the key modifications to the standard roofline model and reference the quantitative validation, allowing readers to better assess the differences and accuracy improvements. revision: yes
Circularity Check
No circularity: empirical benchmarking with independent model adaptation
full rationale
The paper performs empirical evaluation of three SpMM implementations (CSR, CSB, MKL) on SuiteSparse matrices grouped into four sparsity patterns, measured on one AMD node. It adapts roofline models to incorporate observed memory traffic, cache locality, and blocking effects, then concludes that a single unified model is insufficient based on measured deviations. No equations reduce fitted parameters to predictions by construction, no self-citations are invoked as load-bearing uniqueness theorems, and the central claim rests on direct experimental comparison rather than definitional equivalence or renaming of known results. The derivation chain is self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Standard roofline assumptions remain valid when adjusted for sparsity-induced changes in memory traffic and cache behavior
Reference graph
Works this paper leans on
-
[1]
Math- ematical foundations of the graphblas,
J. Kepner, P. Aaltonen, D. Bader, A. Buluc ¸, F. Franchetti, J. Gilbert, D. Hutchison, M. Kumar, A. Lumsdaine, H. Meyerhenkeet al., “Math- ematical foundations of the graphblas,” in2016 IEEE High Performance Extreme Computing Conference (HPEC). IEEE, 2016, pp. 1–9
work page 2016
-
[2]
Com- binatorial blas 2.0: Scaling combinatorial algorithms on distributed- memory systems,
A. Azad, O. Selvitopi, M. T. Hussain, J. R. Gilbert, and A. Buluc ¸, “Com- binatorial blas 2.0: Scaling combinatorial algorithms on distributed- memory systems,”IEEE Transactions on Parallel and Distributed Sys- tems, vol. 33, no. 4, pp. 989–1001, 2021
work page 2021
-
[3]
Ge-spmm: General-purpose sparse matrix-matrix multiplication on gpus for graph neural networks,
G. Huang, G. Dai, Y . Wang, and H. Yang, “Ge-spmm: General-purpose sparse matrix-matrix multiplication on gpus for graph neural networks,” inSC20: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 2020, pp. 1–12
work page 2020
-
[4]
Force2vec: Parallel force- directed graph embedding,
M. K. Rahman, M. H. Sujon, and A. Azad, “Force2vec: Parallel force- directed graph embedding,” in2020 IEEE international conference on data mining (ICDM). IEEE, 2020, pp. 442–451
work page 2020
-
[5]
Batchlayout: A batch-parallel force-directed graph layout algo- rithm in shared memory,
——, “Batchlayout: A batch-parallel force-directed graph layout algo- rithm in shared memory,” in2020 IEEE Pacific Visualization Symposium (PacificVis). IEEE, 2020, pp. 16–25
work page 2020
-
[6]
Massively parallel algorithms for personalized pagerank,
G. Hou, X. Chen, S. Wang, and Z. Wei, “Massively parallel algorithms for personalized pagerank,”Proceedings of the VLDB Endowment, vol. 14, no. 9, pp. 1668–1680, 2021
work page 2021
-
[7]
Block krylov space methods for linear systems with multiple right-hand sides: an introduction,
M. H. Gutknecht, “Block krylov space methods for linear systems with multiple right-hand sides: an introduction,” inModern mathematical models, methods and algorithms for real world systems. Anshan, 2007, pp. 420–447
work page 2007
-
[8]
Linear scaling electronic structure methods,
S. Goedecker, “Linear scaling electronic structure methods,”Reviews of Modern Physics, vol. 71, no. 4, p. 1085, 1999
work page 1999
-
[9]
Generalsparse: Bridging the gap in spmm for pruned large language model inference on gpus,
Y . Wang, X. Guo, J. Xiao, D. Chen, and G. Tan, “Generalsparse: Bridging the gap in spmm for pruned large language model inference on gpus,” in2025 USENIX Annual Technical Conference (USENIX ATC 25), 2025, pp. 417–432
work page 2025
-
[10]
Sparse gpu kernels for deep learning,
T. Gale, M. Zaharia, C. Young, and E. Elsen, “Sparse gpu kernels for deep learning,” inSC20: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 2020, pp. 1–14
work page 2020
-
[11]
Roofline: an insightful visual performance model for multicore architectures,
S. Williams, A. Waterman, and D. Patterson, “Roofline: an insightful visual performance model for multicore architectures,”Communications of the ACM, vol. 52, no. 4, pp. 65–76, 2009
work page 2009
-
[12]
H. M. Aktulga, A. Buluc ¸, S. Williams, and C. Yang, “Optimizing sparse matrix-multiple vectors multiplication for nuclear configuration interaction calculations,” in2014 IEEE 28th International Parallel and Distributed Processing Symposium. IEEE, 2014, pp. 1213–1222
work page 2014
-
[13]
The sparsity roofline: Understanding the hardware limits of sparse neural networks,
C. Shinn, C. McCarthy, S. Muralidharan, M. Osama, and J. D. Owens, “The sparsity roofline: Understanding the hardware limits of sparse neural networks,”arXiv preprint arXiv:2310.00496, 2023
-
[14]
A. Buluc ¸, J. T. Fineman, M. Frigo, J. R. Gilbert, and C. E. Leiserson, “Parallel sparse matrix-vector and matrix-transpose-vector multiplication using compressed sparse blocks,” inProceedings of the twenty-first annual symposium on Parallelism in algorithms and architectures, 2009, pp. 233–244
work page 2009
-
[15]
Cache-aware roofline model: Upgrading the loft,
A. Ilic, F. Pratas, and L. Sousa, “Cache-aware roofline model: Upgrading the loft,”IEEE Computer Architecture Letters, vol. 13, no. 1, pp. 21–24, 2013
work page 2013
-
[16]
Optimization of sparse matrix-vector multiplication on emerging mul- ticore platforms,
S. Williams, L. Oliker, R. Vuduc, J. Shalf, K. Yelick, and J. Demmel, “Optimization of sparse matrix-vector multiplication on emerging mul- ticore platforms,” inProceedings of the 2007 ACM/IEEE Conference on Supercomputing, 2007, pp. 1–12
work page 2007
-
[17]
Z. Gu, J. Moreira, D. Edelsohn, and A. Azad, “Bandwidth-optimized parallel algorithms for sparse matrix-matrix multiplication using propa- gation blocking,” inSPAA, 2020, pp. 293–303
work page 2020
-
[18]
FusedMM: A unified sddmm-spmm kernel for graph embedding and graph neural networks,
M. K. Rahman, M. H. Sujon, and A. Azad, “FusedMM: A unified sddmm-spmm kernel for graph embedding and graph neural networks,” in2021 IEEE International Parallel and Distributed Processing Sympo- sium (IPDPS). IEEE, 2021, pp. 256–266
work page 2021
-
[19]
iSpLib: A library for accelerating graph neural networks using auto-tuned sparse operations,
M. S. Hoque Anik, P. Badhe, R. Gampa, and A. Azad, “iSpLib: A library for accelerating graph neural networks using auto-tuned sparse operations,” inCompanion Proceedings of the ACM Web Conference 2024, 2024, pp. 778–781
work page 2024
-
[20]
Beyond Human-level Accuracy: Computational Challenges in Deep Learning , Url =
C. Hong, A. Sukumaran-Rajam, I. Nisa, K. Singh, and P. Sadayappan, “Adaptive sparse tiling for sparse matrix multiplication,” inProceedings of the 24th Symposium on Principles and Practice of Parallel Programming, ser. PPoPP ’19. New York, NY , USA: Association for Computing Machinery, 2019, p. 300–314. [Online]. Available: https://doi.org/10.1145/3293883.3295712
-
[21]
Efficient quantized sparse matrix operations on tensor cores,
S. Li, K. Osawa, and T. Hoefler, “Efficient quantized sparse matrix operations on tensor cores,” inSC22: International Conference for High Performance Computing, Networking, Storage and Analysis, 2022, pp. 1–15
work page 2022
-
[22]
Distributed-memory parallel algorithms for sparse times tall-skinny- dense matrix multiplication,
O. Selvitopi, B. Brock, I. Nisa, A. Tripathy, K. Yelick, and A. Buluc ¸, “Distributed-memory parallel algorithms for sparse times tall-skinny- dense matrix multiplication,” inACM International Conference on Supercomputing (ICS), 2021
work page 2021
-
[23]
Two- face: Combining collective and one-sided communication for efficient distributed spmm,
C. Block, G. Gerogiannis, C. Mendis, A. Azad, and J. Torrellas, “Two- face: Combining collective and one-sided communication for efficient distributed spmm,” inProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, 2024, pp. 1200–1217
work page 2024
-
[24]
Communication-avoiding parallel sparse-dense matrix- matrix multiplication,
P. Koanantakool, A. Azad, A. Buluc ¸, D. Morozov, S.-Y . Oh, L. Oliker, and K. Yelick, “Communication-avoiding parallel sparse-dense matrix- matrix multiplication,” in2016 IEEE International Parallel and Dis- tributed Processing Symposium (IPDPS). IEEE, 2016, pp. 842–853
work page 2016
-
[25]
Wise: Predicting the performance of sparse matrix vector multiplication with machine learning,
S. Yesil, A. Heidarshenas, A. Morrison, and J. Torrellas, “Wise: Predicting the performance of sparse matrix vector multiplication with machine learning,” inProceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming, ser. PPoPP ’23. New York, NY , USA: Association for Computing Machinery, 2023, p. 329–341. [Onl...
-
[26]
S. Yesil, J. E. Moreira, and J. Torrellas, “Dense dynamic blocks: Optimizing spmm for processors with vector and matrix units using machine learning techniques,” inProceedings of the 36th ACM International Conference on Supercomputing, ser. ICS ’22. New York, NY , USA: Association for Computing Machinery, 2022. [Online]. Available: https://doi.org/10.1145...
-
[27]
M. Mitzenmacher and E. Upfal,Probability and computing: Random- ization and probabilistic techniques in algorithms and data analysis. Cambridge university press, 2017
work page 2017
-
[28]
Power-law distributions in empirical data,
A. Clauset, C. R. Shalizi, and M. E. Newman, “Power-law distributions in empirical data,”SIAM review, vol. 51, no. 4, pp. 661–703, 2009
work page 2009
-
[29]
The university of florida sparse matrix collec- tion,
T. A. Davis and Y . Hu, “The university of florida sparse matrix collec- tion,”ACM Transactions on Mathematical Software (TOMS), vol. 38, no. 1, pp. 1–25, 2011
work page 2011
-
[30]
Memory bandwidth and machine balance in current high performance computers,
J. D. McCalpin, “Memory bandwidth and machine balance in current high performance computers,”IEEE Computer Society Technical Com- mittee on Computer Architecture (TCCA) Newsletter, pp. 19–25, 1995. APPENDIX Derivation of the hub edge fraction for scale-free graphs Assume that the degree distributionp(k)of a graph follows a power law, andk min is the minim...
work page 1995
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.