arxiv: 2604.06637 · v1 · submitted 2026-04-08 · 💻 cs.DC

Sparsity-Aware Roofline Models for Sparse Matrix-Matrix Multiplication

Matthew Qian , Yahia Ramadan , Suhita Anubha , Ariful Azad This is my paper

Pith reviewed 2026-05-10 18:26 UTC · model grok-4.3

classification 💻 cs.DC

keywords SpMMroofline modelsparsity patternsperformance modelingsparse matrix multiplicationCSRblocking strategiesarithmetic intensity

0 comments p. Extension

The pith

Sparse matrix-matrix multiplication requires sparsity-aware roofline models because matrix structure alters arithmetic intensity in ways a single model cannot capture.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper establishes that roofline models for sparse matrix-dense matrix multiplication must incorporate the specific sparsity patterns of input matrices to predict performance reliably. Standard approaches overlook how blocking, cache locality, and memory traffic shift across block-structured, banded, scale-free, and uniformly random matrices. The authors test CSR, CSB, and MKL implementations on large SuiteSparse matrices on an AMD node, showing that effective arithmetic intensity changes with sparsity and dense matrix width. If the claim holds, then performance analysis and choices of data layout or blocking cannot rely on one universal model.

Core claim

The paper claims that accurate roofline-based performance analysis of SpMM requires sparsity-aware modeling that incorporates memory traffic, cache locality, and blocking behavior for each sparsity pattern. Experiments on an AMD node with CSR, CSB, and MKL implementations demonstrate that a single model is insufficient across block-structured, banded, scale-free, and uniformly random matrices, as structured sparsity and blocking significantly change effective arithmetic intensity. Data layout and blocking strategies therefore need evaluation in the context of the matrix structure.

What carries the argument

Sparsity-aware roofline models that adjust arithmetic intensity and performance bounds according to matrix-specific memory traffic, cache effects, and blocking.

Load-bearing premise

That the four sparsity pattern groups and three implementations tested on one AMD node are representative enough to require sparsity-aware models in general.

What would settle it

A set of measurements showing that one unified roofline model matches observed SpMM performance across all four sparsity patterns on both AMD and Intel nodes would falsify the necessity of sparsity-aware modeling.

Figures

Figures reproduced from arXiv: 2604.06637 by Ariful Azad, Matthew Qian, Suhita Anubha, Yahia Ramadan.

**Figure 1.** Figure 1: SpMM Performance (GFLOP/s) vs Number of Columns (d) for various sparsity patterns. Insets are visualizations of the specific sparsity pattern . Matrices used are (a) er 22 1, (b) rajat31, (c) road usa, and (d) com-LiveJournal. C. Performance of SpMM implementations We first evaluate the performance of three SpMM implementations based on CSR and CSC data structures, as well as Intel MKL, for all matrices … view at source ↗

**Figure 2.** Figure 2: Comparison of attained performance with our sparsity-aware Roofline model for four representative matrices, one from [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

read the original abstract

Sparse matrix-dense matrix multiplication (SpMM) is a critical kernel in scientific computing, graph analytics, and machine learning, whose performance is often constrained by memory bandwidth. In this work, we investigate the applicability and limitations of roofline modeling for SpMM by explicitly accounting for the impact of matrix sparsity structure on arithmetic intensity and attainable performance. We evaluate three SpMM implementations: Compressed Sparse Row (CSR), Compressed Sparse Blocks (CSB), and Intel's Math Kernel Library (MKL). Each implementation was tested using large-scale matrices from the SuiteSparse collection and grouped by sparsity pattern, including block-structured, banded (diagonal), scale-free, and uniformly random matrices. We derive sparsity-aware roofline models that incorporate memory traffic, cache locality, and blocking behavior, and demonstrate that a single model is insufficient to accurately predict performance across diverse structures. Experiments were conducted on an AMD-based Perlmutter compute node with a varying number of columns in the dense matrix. In particular, blocking and structured sparsity significantly alter effective arithmetic intensity. The results show that accurate roofline-based performance analysis of SpMM requires sparsity-aware modeling, and that data layout and blocking strategies must be evaluated in the context of matrix structure rather than through a single unified model.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Sparsity-aware rooflines are a useful practical tweak for SpMM modeling but rest on experiments from one node and four matrix groups.

read the letter

This paper's main message is that roofline analysis for sparse matrix-matrix multiplication needs to account for sparsity patterns explicitly, because a generic model misses the mark on arithmetic intensity for different matrix types. They make this case by classifying SuiteSparse matrices into four groups—block-structured, banded, scale-free, and uniformly random—and then creating tailored roofline models for each. These models factor in memory traffic, cache locality, and how blocking works in the implementations. They compare CSR, CSB, and MKL on an AMD Perlmutter node while varying the number of columns in the dense matrix. The results indicate that blocking and structured sparsity change the effective arithmetic intensity enough that one model cannot cover all cases. This is a practical contribution. Using real hardware and public matrices gives the findings some weight, and showing the impact on three different codes is helpful for people who actually implement these kernels. The soft spot is the limited experimental base. Everything comes from one AMD node and those four pattern classes. The concern that this does not prove sparsity-aware models are generally necessary is fair, because other sparsity types or different architectures might behave differently. The paper would be stronger with quantitative error numbers for the models and tests on more hardware. Readers who tune SpMM for scientific computing, graph analytics, or machine learning will find the modeling approach useful. It is not a major theoretical advance but a useful refinement of an existing technique. I would recommend sending this to peer review. The work is focused and the experiments are reproducible enough that referees can evaluate the model derivations and the scope of the conclusions.

Referee Report

2 major / 1 minor

Summary. The paper claims that standard roofline models are inadequate for predicting SpMM performance because they fail to account for the effects of different sparsity structures on arithmetic intensity, memory traffic, cache locality, and blocking. It derives sparsity-aware variants and evaluates them empirically using CSR, CSB, and MKL implementations on SuiteSparse matrices grouped into block-structured, banded, scale-free, and uniformly random patterns, run on a single AMD Perlmutter node with varying numbers of dense-matrix columns. The central conclusion is that a single unified model is insufficient and that data layout and blocking must be assessed relative to matrix structure.

Significance. If the sparsity-aware models are shown to be quantitatively accurate and the necessity claim holds beyond the tested cases, the work would provide a useful refinement to roofline analysis for sparse kernels common in scientific computing, graph analytics, and ML. The use of public matrices, multiple formats, and real hardware is a positive empirical foundation.

major comments (2)

[Experimental evaluation and results] The experimental evaluation (described in the abstract and results) is confined to four sparsity pattern groups drawn from SuiteSparse, three implementations, and a single AMD node. This scope is insufficient to support the general claim that 'accurate roofline-based performance analysis of SpMM requires sparsity-aware modeling' and that 'a single model is insufficient... across diverse structures,' because the paper provides no evidence that these groups exhaust the space of realistic sparsity (e.g., hierarchical, multi-scale, or controlled power-law patterns) or that the observed deviations would persist on other architectures or with different blocking parameters.
[Abstract and model derivation] The abstract states that sparsity-aware roofline models are derived to incorporate memory traffic, cache locality, and blocking behavior, yet supplies no explicit equations, arithmetic-intensity formulas, or quantitative error analysis (e.g., predicted vs. measured performance or R² values). Without these, it is impossible to assess whether the models are parameter-free, how they differ from a baseline unified roofline, or whether they actually improve predictive accuracy enough to justify the 'requires' claim.

minor comments (1)

[Abstract] The abstract contains no numerical results, error metrics, or concrete performance numbers, which weakens the ability of readers to immediately gauge the practical impact of the proposed models.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and describe the revisions planned to strengthen the manuscript.

read point-by-point responses

Referee: [Experimental evaluation and results] The experimental evaluation (described in the abstract and results) is confined to four sparsity pattern groups drawn from SuiteSparse, three implementations, and a single AMD node. This scope is insufficient to support the general claim that 'accurate roofline-based performance analysis of SpMM requires sparsity-aware modeling' and that 'a single model is insufficient... across diverse structures,' because the paper provides no evidence that these groups exhaust the space of realistic sparsity (e.g., hierarchical, multi-scale, or controlled power-law patterns) or that the observed deviations would persist on other architectures or with different blocking parameters.

Authors: We agree that the experimental scope is limited to the four sparsity pattern groups, three implementations, and a single AMD Perlmutter node. These patterns were chosen to represent common structures in the target domains, but they do not exhaust the space of realistic sparsity (e.g., hierarchical or multi-scale patterns) and we provide no evidence regarding other architectures or blocking parameters. We will revise the abstract, introduction, and conclusions to qualify all general claims, explicitly limiting them to the evaluated patterns and hardware. A new Limitations and Future Work section will be added to discuss the boundaries of the study and suggest extensions to additional patterns, architectures, and blocking strategies. revision: yes
Referee: [Abstract and model derivation] The abstract states that sparsity-aware roofline models are derived to incorporate memory traffic, cache locality, and blocking behavior, yet supplies no explicit equations, arithmetic-intensity formulas, or quantitative error analysis (e.g., predicted vs. measured performance or R² values). Without these, it is impossible to assess whether the models are parameter-free, how they differ from a baseline unified roofline, or whether they actually improve predictive accuracy enough to justify the 'requires' claim.

Authors: The abstract is a concise summary and cannot accommodate full equations or derivations due to length limits. The sparsity-aware roofline models, including explicit arithmetic-intensity formulas that adjust for sparsity-dependent memory traffic, cache locality, and blocking, are derived in the model section of the manuscript. Quantitative error analysis comparing predicted and measured performance (including metrics such as R²) appears in the results section. To improve clarity, we will revise the abstract to briefly describe the key modifications to the standard roofline model and reference the quantitative validation, allowing readers to better assess the differences and accuracy improvements. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmarking with independent model adaptation

full rationale

The paper performs empirical evaluation of three SpMM implementations (CSR, CSB, MKL) on SuiteSparse matrices grouped into four sparsity patterns, measured on one AMD node. It adapts roofline models to incorporate observed memory traffic, cache locality, and blocking effects, then concludes that a single unified model is insufficient based on measured deviations. No equations reduce fitted parameters to predictions by construction, no self-citations are invoked as load-bearing uniqueness theorems, and the central claim rests on direct experimental comparison rather than definitional equivalence or renaming of known results. The derivation chain is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work adapts the standard roofline framework without introducing new free parameters, axioms beyond domain assumptions about memory traffic, or invented entities.

axioms (1)

domain assumption Standard roofline assumptions remain valid when adjusted for sparsity-induced changes in memory traffic and cache behavior
Invoked when deriving models that incorporate blocking and structure effects.

pith-pipeline@v0.9.0 · 5525 in / 1049 out tokens · 63903 ms · 2026-05-10T18:26:01.624611+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

30 extracted references · 30 canonical work pages

[1]

Math- ematical foundations of the graphblas,

J. Kepner, P. Aaltonen, D. Bader, A. Buluc ¸, F. Franchetti, J. Gilbert, D. Hutchison, M. Kumar, A. Lumsdaine, H. Meyerhenkeet al., “Math- ematical foundations of the graphblas,” in2016 IEEE High Performance Extreme Computing Conference (HPEC). IEEE, 2016, pp. 1–9

work page 2016
[2]

Com- binatorial blas 2.0: Scaling combinatorial algorithms on distributed- memory systems,

A. Azad, O. Selvitopi, M. T. Hussain, J. R. Gilbert, and A. Buluc ¸, “Com- binatorial blas 2.0: Scaling combinatorial algorithms on distributed- memory systems,”IEEE Transactions on Parallel and Distributed Sys- tems, vol. 33, no. 4, pp. 989–1001, 2021

work page 2021
[3]

Ge-spmm: General-purpose sparse matrix-matrix multiplication on gpus for graph neural networks,

G. Huang, G. Dai, Y . Wang, and H. Yang, “Ge-spmm: General-purpose sparse matrix-matrix multiplication on gpus for graph neural networks,” inSC20: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 2020, pp. 1–12

work page 2020
[4]

Force2vec: Parallel force- directed graph embedding,

M. K. Rahman, M. H. Sujon, and A. Azad, “Force2vec: Parallel force- directed graph embedding,” in2020 IEEE international conference on data mining (ICDM). IEEE, 2020, pp. 442–451

work page 2020
[5]

Batchlayout: A batch-parallel force-directed graph layout algo- rithm in shared memory,

——, “Batchlayout: A batch-parallel force-directed graph layout algo- rithm in shared memory,” in2020 IEEE Pacific Visualization Symposium (PacificVis). IEEE, 2020, pp. 16–25

work page 2020
[6]

Massively parallel algorithms for personalized pagerank,

G. Hou, X. Chen, S. Wang, and Z. Wei, “Massively parallel algorithms for personalized pagerank,”Proceedings of the VLDB Endowment, vol. 14, no. 9, pp. 1668–1680, 2021

work page 2021
[7]

Block krylov space methods for linear systems with multiple right-hand sides: an introduction,

M. H. Gutknecht, “Block krylov space methods for linear systems with multiple right-hand sides: an introduction,” inModern mathematical models, methods and algorithms for real world systems. Anshan, 2007, pp. 420–447

work page 2007
[8]

Linear scaling electronic structure methods,

S. Goedecker, “Linear scaling electronic structure methods,”Reviews of Modern Physics, vol. 71, no. 4, p. 1085, 1999

work page 1999
[9]

Generalsparse: Bridging the gap in spmm for pruned large language model inference on gpus,

Y . Wang, X. Guo, J. Xiao, D. Chen, and G. Tan, “Generalsparse: Bridging the gap in spmm for pruned large language model inference on gpus,” in2025 USENIX Annual Technical Conference (USENIX ATC 25), 2025, pp. 417–432

work page 2025
[10]

Sparse gpu kernels for deep learning,

T. Gale, M. Zaharia, C. Young, and E. Elsen, “Sparse gpu kernels for deep learning,” inSC20: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 2020, pp. 1–14

work page 2020
[11]

Roofline: an insightful visual performance model for multicore architectures,

S. Williams, A. Waterman, and D. Patterson, “Roofline: an insightful visual performance model for multicore architectures,”Communications of the ACM, vol. 52, no. 4, pp. 65–76, 2009

work page 2009
[12]

Optimizing sparse matrix-multiple vectors multiplication for nuclear configuration interaction calculations,

H. M. Aktulga, A. Buluc ¸, S. Williams, and C. Yang, “Optimizing sparse matrix-multiple vectors multiplication for nuclear configuration interaction calculations,” in2014 IEEE 28th International Parallel and Distributed Processing Symposium. IEEE, 2014, pp. 1213–1222

work page 2014
[13]

The sparsity roofline: Understanding the hardware limits of sparse neural networks,

C. Shinn, C. McCarthy, S. Muralidharan, M. Osama, and J. D. Owens, “The sparsity roofline: Understanding the hardware limits of sparse neural networks,”arXiv preprint arXiv:2310.00496, 2023

work page arXiv 2023
[14]

Parallel sparse matrix-vector and matrix-transpose-vector multiplication using compressed sparse blocks,

A. Buluc ¸, J. T. Fineman, M. Frigo, J. R. Gilbert, and C. E. Leiserson, “Parallel sparse matrix-vector and matrix-transpose-vector multiplication using compressed sparse blocks,” inProceedings of the twenty-first annual symposium on Parallelism in algorithms and architectures, 2009, pp. 233–244

work page 2009
[15]

Cache-aware roofline model: Upgrading the loft,

A. Ilic, F. Pratas, and L. Sousa, “Cache-aware roofline model: Upgrading the loft,”IEEE Computer Architecture Letters, vol. 13, no. 1, pp. 21–24, 2013

work page 2013
[16]

Optimization of sparse matrix-vector multiplication on emerging mul- ticore platforms,

S. Williams, L. Oliker, R. Vuduc, J. Shalf, K. Yelick, and J. Demmel, “Optimization of sparse matrix-vector multiplication on emerging mul- ticore platforms,” inProceedings of the 2007 ACM/IEEE Conference on Supercomputing, 2007, pp. 1–12

work page 2007
[17]

Bandwidth-optimized parallel algorithms for sparse matrix-matrix multiplication using propa- gation blocking,

Z. Gu, J. Moreira, D. Edelsohn, and A. Azad, “Bandwidth-optimized parallel algorithms for sparse matrix-matrix multiplication using propa- gation blocking,” inSPAA, 2020, pp. 293–303

work page 2020
[18]

FusedMM: A unified sddmm-spmm kernel for graph embedding and graph neural networks,

M. K. Rahman, M. H. Sujon, and A. Azad, “FusedMM: A unified sddmm-spmm kernel for graph embedding and graph neural networks,” in2021 IEEE International Parallel and Distributed Processing Sympo- sium (IPDPS). IEEE, 2021, pp. 256–266

work page 2021
[19]

iSpLib: A library for accelerating graph neural networks using auto-tuned sparse operations,

M. S. Hoque Anik, P. Badhe, R. Gampa, and A. Azad, “iSpLib: A library for accelerating graph neural networks using auto-tuned sparse operations,” inCompanion Proceedings of the ACM Web Conference 2024, 2024, pp. 778–781

work page 2024
[20]

Beyond Human-level Accuracy: Computational Challenges in Deep Learning , Url =

C. Hong, A. Sukumaran-Rajam, I. Nisa, K. Singh, and P. Sadayappan, “Adaptive sparse tiling for sparse matrix multiplication,” inProceedings of the 24th Symposium on Principles and Practice of Parallel Programming, ser. PPoPP ’19. New York, NY , USA: Association for Computing Machinery, 2019, p. 300–314. [Online]. Available: https://doi.org/10.1145/3293883.3295712

work page doi:10.1145/3293883.3295712 2019
[21]

Efficient quantized sparse matrix operations on tensor cores,

S. Li, K. Osawa, and T. Hoefler, “Efficient quantized sparse matrix operations on tensor cores,” inSC22: International Conference for High Performance Computing, Networking, Storage and Analysis, 2022, pp. 1–15

work page 2022
[22]

Distributed-memory parallel algorithms for sparse times tall-skinny- dense matrix multiplication,

O. Selvitopi, B. Brock, I. Nisa, A. Tripathy, K. Yelick, and A. Buluc ¸, “Distributed-memory parallel algorithms for sparse times tall-skinny- dense matrix multiplication,” inACM International Conference on Supercomputing (ICS), 2021

work page 2021
[23]

Two- face: Combining collective and one-sided communication for efficient distributed spmm,

C. Block, G. Gerogiannis, C. Mendis, A. Azad, and J. Torrellas, “Two- face: Combining collective and one-sided communication for efficient distributed spmm,” inProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, 2024, pp. 1200–1217

work page 2024
[24]

Communication-avoiding parallel sparse-dense matrix- matrix multiplication,

P. Koanantakool, A. Azad, A. Buluc ¸, D. Morozov, S.-Y . Oh, L. Oliker, and K. Yelick, “Communication-avoiding parallel sparse-dense matrix- matrix multiplication,” in2016 IEEE International Parallel and Dis- tributed Processing Symposium (IPDPS). IEEE, 2016, pp. 842–853

work page 2016
[25]

Wise: Predicting the performance of sparse matrix vector multiplication with machine learning,

S. Yesil, A. Heidarshenas, A. Morrison, and J. Torrellas, “Wise: Predicting the performance of sparse matrix vector multiplication with machine learning,” inProceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming, ser. PPoPP ’23. New York, NY , USA: Association for Computing Machinery, 2023, p. 329–341. [Onl...

work page arXiv 2023
[26]

Dense dynamic blocks: Optimizing spmm for processors with vector and matrix units using machine learning techniques,

S. Yesil, J. E. Moreira, and J. Torrellas, “Dense dynamic blocks: Optimizing spmm for processors with vector and matrix units using machine learning techniques,” inProceedings of the 36th ACM International Conference on Supercomputing, ser. ICS ’22. New York, NY , USA: Association for Computing Machinery, 2022. [Online]. Available: https://doi.org/10.1145...

work page doi:10.1145/3524059.3532369 2022
[27]

Mitzenmacher and E

M. Mitzenmacher and E. Upfal,Probability and computing: Random- ization and probabilistic techniques in algorithms and data analysis. Cambridge university press, 2017

work page 2017
[28]

Power-law distributions in empirical data,

A. Clauset, C. R. Shalizi, and M. E. Newman, “Power-law distributions in empirical data,”SIAM review, vol. 51, no. 4, pp. 661–703, 2009

work page 2009
[29]

The university of florida sparse matrix collec- tion,

T. A. Davis and Y . Hu, “The university of florida sparse matrix collec- tion,”ACM Transactions on Mathematical Software (TOMS), vol. 38, no. 1, pp. 1–25, 2011

work page 2011
[30]

Memory bandwidth and machine balance in current high performance computers,

J. D. McCalpin, “Memory bandwidth and machine balance in current high performance computers,”IEEE Computer Society Technical Com- mittee on Computer Architecture (TCCA) Newsletter, pp. 19–25, 1995. APPENDIX Derivation of the hub edge fraction for scale-free graphs Assume that the degree distributionp(k)of a graph follows a power law, andk min is the minim...

work page 1995