pith. the verified trust layer for science. sign in

arxiv: 2604.06637 · v1 · submitted 2026-04-08 · 💻 cs.DC

Sparsity-Aware Roofline Models for Sparse Matrix-Matrix Multiplication

Pith reviewed 2026-05-10 18:26 UTC · model grok-4.3

classification 💻 cs.DC
keywords SpMMroofline modelsparsity patternsperformance modelingsparse matrix multiplicationCSRblocking strategiesarithmetic intensity
0
0 comments X p. Extension

The pith

Sparse matrix-matrix multiplication requires sparsity-aware roofline models because matrix structure alters arithmetic intensity in ways a single model cannot capture.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper establishes that roofline models for sparse matrix-dense matrix multiplication must incorporate the specific sparsity patterns of input matrices to predict performance reliably. Standard approaches overlook how blocking, cache locality, and memory traffic shift across block-structured, banded, scale-free, and uniformly random matrices. The authors test CSR, CSB, and MKL implementations on large SuiteSparse matrices on an AMD node, showing that effective arithmetic intensity changes with sparsity and dense matrix width. If the claim holds, then performance analysis and choices of data layout or blocking cannot rely on one universal model.

Core claim

The paper claims that accurate roofline-based performance analysis of SpMM requires sparsity-aware modeling that incorporates memory traffic, cache locality, and blocking behavior for each sparsity pattern. Experiments on an AMD node with CSR, CSB, and MKL implementations demonstrate that a single model is insufficient across block-structured, banded, scale-free, and uniformly random matrices, as structured sparsity and blocking significantly change effective arithmetic intensity. Data layout and blocking strategies therefore need evaluation in the context of the matrix structure.

What carries the argument

Sparsity-aware roofline models that adjust arithmetic intensity and performance bounds according to matrix-specific memory traffic, cache effects, and blocking.

Load-bearing premise

That the four sparsity pattern groups and three implementations tested on one AMD node are representative enough to require sparsity-aware models in general.

What would settle it

A set of measurements showing that one unified roofline model matches observed SpMM performance across all four sparsity patterns on both AMD and Intel nodes would falsify the necessity of sparsity-aware modeling.

Figures

Figures reproduced from arXiv: 2604.06637 by Ariful Azad, Matthew Qian, Suhita Anubha, Yahia Ramadan.

Figure 1
Figure 1. Figure 1: SpMM Performance (GFLOP/s) vs Number of Columns (d) for various sparsity patterns. Insets are visual￾izations of the specific sparsity pattern . Matrices used are (a) er 22 1, (b) rajat31, (c) road usa, and (d) com-LiveJournal. C. Performance of SpMM implementations We first evaluate the performance of three SpMM imple￾mentations based on CSR and CSC data structures, as well as Intel MKL, for all matrices … view at source ↗
Figure 2
Figure 2. Figure 2: Comparison of attained performance with our sparsity-aware Roofline model for four representative matrices, one from [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
read the original abstract

Sparse matrix-dense matrix multiplication (SpMM) is a critical kernel in scientific computing, graph analytics, and machine learning, whose performance is often constrained by memory bandwidth. In this work, we investigate the applicability and limitations of roofline modeling for SpMM by explicitly accounting for the impact of matrix sparsity structure on arithmetic intensity and attainable performance. We evaluate three SpMM implementations: Compressed Sparse Row (CSR), Compressed Sparse Blocks (CSB), and Intel's Math Kernel Library (MKL). Each implementation was tested using large-scale matrices from the SuiteSparse collection and grouped by sparsity pattern, including block-structured, banded (diagonal), scale-free, and uniformly random matrices. We derive sparsity-aware roofline models that incorporate memory traffic, cache locality, and blocking behavior, and demonstrate that a single model is insufficient to accurately predict performance across diverse structures. Experiments were conducted on an AMD-based Perlmutter compute node with a varying number of columns in the dense matrix. In particular, blocking and structured sparsity significantly alter effective arithmetic intensity. The results show that accurate roofline-based performance analysis of SpMM requires sparsity-aware modeling, and that data layout and blocking strategies must be evaluated in the context of matrix structure rather than through a single unified model.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper claims that standard roofline models are inadequate for predicting SpMM performance because they fail to account for the effects of different sparsity structures on arithmetic intensity, memory traffic, cache locality, and blocking. It derives sparsity-aware variants and evaluates them empirically using CSR, CSB, and MKL implementations on SuiteSparse matrices grouped into block-structured, banded, scale-free, and uniformly random patterns, run on a single AMD Perlmutter node with varying numbers of dense-matrix columns. The central conclusion is that a single unified model is insufficient and that data layout and blocking must be assessed relative to matrix structure.

Significance. If the sparsity-aware models are shown to be quantitatively accurate and the necessity claim holds beyond the tested cases, the work would provide a useful refinement to roofline analysis for sparse kernels common in scientific computing, graph analytics, and ML. The use of public matrices, multiple formats, and real hardware is a positive empirical foundation.

major comments (2)
  1. [Experimental evaluation and results] The experimental evaluation (described in the abstract and results) is confined to four sparsity pattern groups drawn from SuiteSparse, three implementations, and a single AMD node. This scope is insufficient to support the general claim that 'accurate roofline-based performance analysis of SpMM requires sparsity-aware modeling' and that 'a single model is insufficient... across diverse structures,' because the paper provides no evidence that these groups exhaust the space of realistic sparsity (e.g., hierarchical, multi-scale, or controlled power-law patterns) or that the observed deviations would persist on other architectures or with different blocking parameters.
  2. [Abstract and model derivation] The abstract states that sparsity-aware roofline models are derived to incorporate memory traffic, cache locality, and blocking behavior, yet supplies no explicit equations, arithmetic-intensity formulas, or quantitative error analysis (e.g., predicted vs. measured performance or R² values). Without these, it is impossible to assess whether the models are parameter-free, how they differ from a baseline unified roofline, or whether they actually improve predictive accuracy enough to justify the 'requires' claim.
minor comments (1)
  1. [Abstract] The abstract contains no numerical results, error metrics, or concrete performance numbers, which weakens the ability of readers to immediately gauge the practical impact of the proposed models.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and describe the revisions planned to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Experimental evaluation and results] The experimental evaluation (described in the abstract and results) is confined to four sparsity pattern groups drawn from SuiteSparse, three implementations, and a single AMD node. This scope is insufficient to support the general claim that 'accurate roofline-based performance analysis of SpMM requires sparsity-aware modeling' and that 'a single model is insufficient... across diverse structures,' because the paper provides no evidence that these groups exhaust the space of realistic sparsity (e.g., hierarchical, multi-scale, or controlled power-law patterns) or that the observed deviations would persist on other architectures or with different blocking parameters.

    Authors: We agree that the experimental scope is limited to the four sparsity pattern groups, three implementations, and a single AMD Perlmutter node. These patterns were chosen to represent common structures in the target domains, but they do not exhaust the space of realistic sparsity (e.g., hierarchical or multi-scale patterns) and we provide no evidence regarding other architectures or blocking parameters. We will revise the abstract, introduction, and conclusions to qualify all general claims, explicitly limiting them to the evaluated patterns and hardware. A new Limitations and Future Work section will be added to discuss the boundaries of the study and suggest extensions to additional patterns, architectures, and blocking strategies. revision: yes

  2. Referee: [Abstract and model derivation] The abstract states that sparsity-aware roofline models are derived to incorporate memory traffic, cache locality, and blocking behavior, yet supplies no explicit equations, arithmetic-intensity formulas, or quantitative error analysis (e.g., predicted vs. measured performance or R² values). Without these, it is impossible to assess whether the models are parameter-free, how they differ from a baseline unified roofline, or whether they actually improve predictive accuracy enough to justify the 'requires' claim.

    Authors: The abstract is a concise summary and cannot accommodate full equations or derivations due to length limits. The sparsity-aware roofline models, including explicit arithmetic-intensity formulas that adjust for sparsity-dependent memory traffic, cache locality, and blocking, are derived in the model section of the manuscript. Quantitative error analysis comparing predicted and measured performance (including metrics such as R²) appears in the results section. To improve clarity, we will revise the abstract to briefly describe the key modifications to the standard roofline model and reference the quantitative validation, allowing readers to better assess the differences and accuracy improvements. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmarking with independent model adaptation

full rationale

The paper performs empirical evaluation of three SpMM implementations (CSR, CSB, MKL) on SuiteSparse matrices grouped into four sparsity patterns, measured on one AMD node. It adapts roofline models to incorporate observed memory traffic, cache locality, and blocking effects, then concludes that a single unified model is insufficient based on measured deviations. No equations reduce fitted parameters to predictions by construction, no self-citations are invoked as load-bearing uniqueness theorems, and the central claim rests on direct experimental comparison rather than definitional equivalence or renaming of known results. The derivation chain is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work adapts the standard roofline framework without introducing new free parameters, axioms beyond domain assumptions about memory traffic, or invented entities.

axioms (1)
  • domain assumption Standard roofline assumptions remain valid when adjusted for sparsity-induced changes in memory traffic and cache behavior
    Invoked when deriving models that incorporate blocking and structure effects.

pith-pipeline@v0.9.0 · 5525 in / 1049 out tokens · 63903 ms · 2026-05-10T18:26:01.624611+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

30 extracted references · 30 canonical work pages

  1. [1]

    Math- ematical foundations of the graphblas,

    J. Kepner, P. Aaltonen, D. Bader, A. Buluc ¸, F. Franchetti, J. Gilbert, D. Hutchison, M. Kumar, A. Lumsdaine, H. Meyerhenkeet al., “Math- ematical foundations of the graphblas,” in2016 IEEE High Performance Extreme Computing Conference (HPEC). IEEE, 2016, pp. 1–9

  2. [2]

    Com- binatorial blas 2.0: Scaling combinatorial algorithms on distributed- memory systems,

    A. Azad, O. Selvitopi, M. T. Hussain, J. R. Gilbert, and A. Buluc ¸, “Com- binatorial blas 2.0: Scaling combinatorial algorithms on distributed- memory systems,”IEEE Transactions on Parallel and Distributed Sys- tems, vol. 33, no. 4, pp. 989–1001, 2021

  3. [3]

    Ge-spmm: General-purpose sparse matrix-matrix multiplication on gpus for graph neural networks,

    G. Huang, G. Dai, Y . Wang, and H. Yang, “Ge-spmm: General-purpose sparse matrix-matrix multiplication on gpus for graph neural networks,” inSC20: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 2020, pp. 1–12

  4. [4]

    Force2vec: Parallel force- directed graph embedding,

    M. K. Rahman, M. H. Sujon, and A. Azad, “Force2vec: Parallel force- directed graph embedding,” in2020 IEEE international conference on data mining (ICDM). IEEE, 2020, pp. 442–451

  5. [5]

    Batchlayout: A batch-parallel force-directed graph layout algo- rithm in shared memory,

    ——, “Batchlayout: A batch-parallel force-directed graph layout algo- rithm in shared memory,” in2020 IEEE Pacific Visualization Symposium (PacificVis). IEEE, 2020, pp. 16–25

  6. [6]

    Massively parallel algorithms for personalized pagerank,

    G. Hou, X. Chen, S. Wang, and Z. Wei, “Massively parallel algorithms for personalized pagerank,”Proceedings of the VLDB Endowment, vol. 14, no. 9, pp. 1668–1680, 2021

  7. [7]

    Block krylov space methods for linear systems with multiple right-hand sides: an introduction,

    M. H. Gutknecht, “Block krylov space methods for linear systems with multiple right-hand sides: an introduction,” inModern mathematical models, methods and algorithms for real world systems. Anshan, 2007, pp. 420–447

  8. [8]

    Linear scaling electronic structure methods,

    S. Goedecker, “Linear scaling electronic structure methods,”Reviews of Modern Physics, vol. 71, no. 4, p. 1085, 1999

  9. [9]

    Generalsparse: Bridging the gap in spmm for pruned large language model inference on gpus,

    Y . Wang, X. Guo, J. Xiao, D. Chen, and G. Tan, “Generalsparse: Bridging the gap in spmm for pruned large language model inference on gpus,” in2025 USENIX Annual Technical Conference (USENIX ATC 25), 2025, pp. 417–432

  10. [10]

    Sparse gpu kernels for deep learning,

    T. Gale, M. Zaharia, C. Young, and E. Elsen, “Sparse gpu kernels for deep learning,” inSC20: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 2020, pp. 1–14

  11. [11]

    Roofline: an insightful visual performance model for multicore architectures,

    S. Williams, A. Waterman, and D. Patterson, “Roofline: an insightful visual performance model for multicore architectures,”Communications of the ACM, vol. 52, no. 4, pp. 65–76, 2009

  12. [12]

    Optimizing sparse matrix-multiple vectors multiplication for nuclear configuration interaction calculations,

    H. M. Aktulga, A. Buluc ¸, S. Williams, and C. Yang, “Optimizing sparse matrix-multiple vectors multiplication for nuclear configuration interaction calculations,” in2014 IEEE 28th International Parallel and Distributed Processing Symposium. IEEE, 2014, pp. 1213–1222

  13. [13]

    The sparsity roofline: Understanding the hardware limits of sparse neural networks,

    C. Shinn, C. McCarthy, S. Muralidharan, M. Osama, and J. D. Owens, “The sparsity roofline: Understanding the hardware limits of sparse neural networks,”arXiv preprint arXiv:2310.00496, 2023

  14. [14]

    Parallel sparse matrix-vector and matrix-transpose-vector multiplication using compressed sparse blocks,

    A. Buluc ¸, J. T. Fineman, M. Frigo, J. R. Gilbert, and C. E. Leiserson, “Parallel sparse matrix-vector and matrix-transpose-vector multiplication using compressed sparse blocks,” inProceedings of the twenty-first annual symposium on Parallelism in algorithms and architectures, 2009, pp. 233–244

  15. [15]

    Cache-aware roofline model: Upgrading the loft,

    A. Ilic, F. Pratas, and L. Sousa, “Cache-aware roofline model: Upgrading the loft,”IEEE Computer Architecture Letters, vol. 13, no. 1, pp. 21–24, 2013

  16. [16]

    Optimization of sparse matrix-vector multiplication on emerging mul- ticore platforms,

    S. Williams, L. Oliker, R. Vuduc, J. Shalf, K. Yelick, and J. Demmel, “Optimization of sparse matrix-vector multiplication on emerging mul- ticore platforms,” inProceedings of the 2007 ACM/IEEE Conference on Supercomputing, 2007, pp. 1–12

  17. [17]

    Bandwidth-optimized parallel algorithms for sparse matrix-matrix multiplication using propa- gation blocking,

    Z. Gu, J. Moreira, D. Edelsohn, and A. Azad, “Bandwidth-optimized parallel algorithms for sparse matrix-matrix multiplication using propa- gation blocking,” inSPAA, 2020, pp. 293–303

  18. [18]

    FusedMM: A unified sddmm-spmm kernel for graph embedding and graph neural networks,

    M. K. Rahman, M. H. Sujon, and A. Azad, “FusedMM: A unified sddmm-spmm kernel for graph embedding and graph neural networks,” in2021 IEEE International Parallel and Distributed Processing Sympo- sium (IPDPS). IEEE, 2021, pp. 256–266

  19. [19]

    iSpLib: A library for accelerating graph neural networks using auto-tuned sparse operations,

    M. S. Hoque Anik, P. Badhe, R. Gampa, and A. Azad, “iSpLib: A library for accelerating graph neural networks using auto-tuned sparse operations,” inCompanion Proceedings of the ACM Web Conference 2024, 2024, pp. 778–781

  20. [20]

    Beyond Human-level Accuracy: Computational Challenges in Deep Learning , Url =

    C. Hong, A. Sukumaran-Rajam, I. Nisa, K. Singh, and P. Sadayappan, “Adaptive sparse tiling for sparse matrix multiplication,” inProceedings of the 24th Symposium on Principles and Practice of Parallel Programming, ser. PPoPP ’19. New York, NY , USA: Association for Computing Machinery, 2019, p. 300–314. [Online]. Available: https://doi.org/10.1145/3293883.3295712

  21. [21]

    Efficient quantized sparse matrix operations on tensor cores,

    S. Li, K. Osawa, and T. Hoefler, “Efficient quantized sparse matrix operations on tensor cores,” inSC22: International Conference for High Performance Computing, Networking, Storage and Analysis, 2022, pp. 1–15

  22. [22]

    Distributed-memory parallel algorithms for sparse times tall-skinny- dense matrix multiplication,

    O. Selvitopi, B. Brock, I. Nisa, A. Tripathy, K. Yelick, and A. Buluc ¸, “Distributed-memory parallel algorithms for sparse times tall-skinny- dense matrix multiplication,” inACM International Conference on Supercomputing (ICS), 2021

  23. [23]

    Two- face: Combining collective and one-sided communication for efficient distributed spmm,

    C. Block, G. Gerogiannis, C. Mendis, A. Azad, and J. Torrellas, “Two- face: Combining collective and one-sided communication for efficient distributed spmm,” inProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, 2024, pp. 1200–1217

  24. [24]

    Communication-avoiding parallel sparse-dense matrix- matrix multiplication,

    P. Koanantakool, A. Azad, A. Buluc ¸, D. Morozov, S.-Y . Oh, L. Oliker, and K. Yelick, “Communication-avoiding parallel sparse-dense matrix- matrix multiplication,” in2016 IEEE International Parallel and Dis- tributed Processing Symposium (IPDPS). IEEE, 2016, pp. 842–853

  25. [25]

    Wise: Predicting the performance of sparse matrix vector multiplication with machine learning,

    S. Yesil, A. Heidarshenas, A. Morrison, and J. Torrellas, “Wise: Predicting the performance of sparse matrix vector multiplication with machine learning,” inProceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming, ser. PPoPP ’23. New York, NY , USA: Association for Computing Machinery, 2023, p. 329–341. [Onl...

  26. [26]

    Dense dynamic blocks: Optimizing spmm for processors with vector and matrix units using machine learning techniques,

    S. Yesil, J. E. Moreira, and J. Torrellas, “Dense dynamic blocks: Optimizing spmm for processors with vector and matrix units using machine learning techniques,” inProceedings of the 36th ACM International Conference on Supercomputing, ser. ICS ’22. New York, NY , USA: Association for Computing Machinery, 2022. [Online]. Available: https://doi.org/10.1145...

  27. [27]

    Mitzenmacher and E

    M. Mitzenmacher and E. Upfal,Probability and computing: Random- ization and probabilistic techniques in algorithms and data analysis. Cambridge university press, 2017

  28. [28]

    Power-law distributions in empirical data,

    A. Clauset, C. R. Shalizi, and M. E. Newman, “Power-law distributions in empirical data,”SIAM review, vol. 51, no. 4, pp. 661–703, 2009

  29. [29]

    The university of florida sparse matrix collec- tion,

    T. A. Davis and Y . Hu, “The university of florida sparse matrix collec- tion,”ACM Transactions on Mathematical Software (TOMS), vol. 38, no. 1, pp. 1–25, 2011

  30. [30]

    Memory bandwidth and machine balance in current high performance computers,

    J. D. McCalpin, “Memory bandwidth and machine balance in current high performance computers,”IEEE Computer Society Technical Com- mittee on Computer Architecture (TCCA) Newsletter, pp. 19–25, 1995. APPENDIX Derivation of the hub edge fraction for scale-free graphs Assume that the degree distributionp(k)of a graph follows a power law, andk min is the minim...