ParamSpMM: Adaptive and Efficient Sparse Matrix-Matrix Multiplication on GPUs for GNNs
Pith reviewed 2026-05-19 19:32 UTC · model grok-4.3
The pith
ParamSpMM adapts SpMM on GPUs to graph inputs via a flexible data structure and ML predictor for better GNN efficiency.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
ParamSpMM is a parametric SpMM design built around the Parameterized Compressed Sparse Row (PCSR) format that can embed and combine prior optimization techniques, together with an ML-based SpMM-decider that selects the best configuration from a small set of input features; evaluations show this combination delivers an average 1.92x speedup over Nvidia cuSPARSE across GNN workloads.
What carries the argument
The Parameterized Compressed Sparse Row (PCSR) data structure that encodes existing SpMM optimizations as configurable parameters so they can be mixed and matched per input graph.
If this is right
- SpMM kernels can be made input-aware without rewriting low-level code for each new graph family.
- GNN training throughput increases by roughly a factor of two on average when the adaptive choice replaces a fixed library call.
- Multiple existing SpMM optimizations become complementary rather than mutually exclusive once they are parameterized together.
- Feature-based prediction can replace hand-tuned or exhaustive search for kernel configuration at runtime.
Where Pith is reading between the lines
- The same parametric-plus-predictor pattern could be applied to other sparse kernels such as SpMV or sampled dense-dense multiplication inside GNN layers.
- If the feature set proves insufficient on new graph distributions, adding cheap structural summaries like degree histograms or clustering coefficients would be a direct next step.
- Hardware vendors could embed similar deciders inside future cuSPARSE or cuBLAS releases so users obtain the benefit without changing application code.
Load-bearing premise
The ML-based SpMM-decider can reliably predict optimal configurations from the crafted input features across diverse graph characteristics.
What would settle it
A set of previously unseen graphs where the ML decider selects a configuration whose measured SpMM time is consistently worse than the best static or cuSPARSE baseline on the same hardware.
Figures
read the original abstract
Fueled by the ability to mine real-world graph data, GNN applications have experienced phenomenal growth. Sparse Matrix-Matrix Multiplication (SpMM) is a critical operator in GNNs. However, existing SpMM designs for GNNs struggle to adapt to diverse input characteristics. In this paper, we first conduct a comprehensive analysis of existing SpMM optimizations, revealing their limitations through statistical and empirical evidence. Based on this analysis, we introduce ParamSpMM, a parametric approach for highly adaptive and efficient SpMM computation in GNNs. It incorporates a new data structure, the Parameterized Compressed Sparse Row (PCSR), to flexibly integrate existing optimization techniques. ParamSpMM enables the configuration of these optimization techniques according to various input characteristics. Furthermore, we complement ParamSpMM with an ML-based SpMM-decider that predicts optimal configurations based on carefully crafted input features. Our evaluations demonstrate that ParamSpMM outperforms Nvidia cuSPARSE with an average speedup of 1.92x, significantly enhancing GNN training efficiency.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes ParamSpMM, a parametric approach for adaptive and efficient SpMM on GPUs targeted at GNN workloads. It introduces the Parameterized Compressed Sparse Row (PCSR) data structure to flexibly integrate existing optimizations and pairs it with an ML-based SpMM-decider that selects configurations using crafted input features. The central empirical claim is an average 1.92x speedup over Nvidia cuSPARSE.
Significance. If the performance results and the reliability of the ML decider hold across diverse graphs, the work would offer a practical advance in GPU SpMM kernels for GNN training by addressing the limitations of fixed optimizations through adaptive parameterization.
major comments (1)
- The 1.92x speedup claim depends on the ML-based SpMM-decider reliably mapping input features to optimal PCSR configurations. No accuracy metrics, cross-validation results, or fallback analysis for prediction errors on graphs with unseen degree distributions or sparsity patterns are reported, which is load-bearing for the adaptive advantage over cuSPARSE.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive feedback. We address the major comment point by point below and outline the revisions we will make to strengthen the manuscript.
read point-by-point responses
-
Referee: The 1.92x speedup claim depends on the ML-based SpMM-decider reliably mapping input features to optimal PCSR configurations. No accuracy metrics, cross-validation results, or fallback analysis for prediction errors on graphs with unseen degree distributions or sparsity patterns are reported, which is load-bearing for the adaptive advantage over cuSPARSE.
Authors: We agree that the reliability of the SpMM-decider is central to the claimed adaptive benefits and that the original submission would be strengthened by additional validation details. The reported 1.92x average speedup is measured end-to-end using the decider on the evaluated GNN workloads and graphs. However, we did not include explicit accuracy metrics, cross-validation statistics, or dedicated fallback analysis in the initial manuscript. In the revised version we will add a new subsection under the SpMM-decider description that reports (1) prediction accuracy on held-out test graphs, (2) k-fold cross-validation results across multiple graph collections, and (3) an analysis of fallback behavior (defaulting to a conservative PCSR configuration) when the model predicts sub-optimally. We will also include a small set of additional experiments on graphs whose degree distributions and sparsity patterns were not present in the training data to demonstrate generalization. revision: yes
Circularity Check
No circularity: empirical system evaluation with measured speedups
full rationale
The paper proposes ParamSpMM using a new PCSR data structure to combine optimizations and an ML-based decider trained on crafted features to select configurations. All performance claims (e.g., 1.92x average speedup over cuSPARSE) are presented as direct empirical measurements from evaluations on GNN workloads. No mathematical derivations, equations, or predictions are shown that reduce by construction to fitted inputs, self-citations, or ansatzes. The ML decider's role is a design choice whose reliability is assessed via runtime results rather than assumed tautologically.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Arai, J., Shiokawa, H., Yamamuro, T., Onizuka, M., Iwamura, S.: Rabbit order: Just-in-time parallel reordering for fast graph analysis. In: (IPDPS). pp. 22–31 (2016)
work page 2016
-
[2]
Bader, D.A., Meyerhenke, H., Sanders, P., Wagner, D.: 10th dimacs implementa- tion challenge-graph partitioning and graph clustering (2011)
work page 2011
-
[3]
Chandrasekaran, B.: Survey of network traffic models. Waschington University in St. Louis CSE567(2009)
work page 2009
-
[4]
Dai, G., Huang, G., Yang, S., Yu, Z., Zhang, H., Ding, Y., Xie, Y., Yang, H., Wang, Y.: Heuristic adaptability to input dynamics for spmm on gpus. p. 595–600. DAC ’22 (2022)
work page 2022
-
[5]
Fan, R., Wang, W., Chu, X.: Fast sparse gpu kernels for accelerated training of graph neural networks. In: (IPDPS). pp. 501–511 (2023)
work page 2023
-
[6]
Fey, M., Lenssen, J.E.: Fast graph representation learning with pytorch geometric (2019)
work page 2019
-
[7]
Physics Reports486(3–5), 75–174 (Feb 2010)
Fortunato, S.: Community detection in graphs. Physics Reports486(3–5), 75–174 (Feb 2010)
work page 2010
-
[8]
Gilmer, J., Schoenholz, S.S., Riley, P.F., Vinyals, O., Dahl, G.E.: Neural message passing for quantum chemistry. p. 1263–1272. ICML’17 (2017) 16 Lixing et al
work page 2017
-
[9]
AAAI’19/IAAI’19/EAAI’19 (2019)
Guo, S., Lin, Y., Feng, N., Song, C., Wan, H.: Attention based spatial- temporal graph convolutional networks for traffic flow forecasting. AAAI’19/IAAI’19/EAAI’19 (2019)
work page 2019
-
[10]
Hoefler, T., Alistarh, D., Ben-Nun, T., Dryden, N., Peste, A.: Sparsity in deep learning: pruning and growth for efficient inference and training in neural networks. J. Mach. Learn. Res.22(1) (jan 2021)
work page 2021
-
[11]
Hong, C., Sukumaran-Rajam, A., Nisa, I., Singh, K., Sadayappan, P.: Adaptive sparse tiling for sparse matrix multiplication. p. 300–314. PPoPP ’19 (2019)
work page 2019
-
[12]
Advances in neural information processing systems33, 22118–22133 (2020)
Hu, W., Fey, M., Zitnik, M., Dong, Y., Ren, H., Liu, B., Catasta, M., Leskovec, J.: Open graph benchmark: Datasets for machine learning on graphs. Advances in neural information processing systems33, 22118–22133 (2020)
work page 2020
- [13]
-
[14]
Huang, K., Zhai, J., Zheng, Z., Yi, Y., Shen, X.: Understanding and bridging the gaps in current gnn performance optimizations. p. 119–132. PPoPP ’21 (2021)
work page 2021
-
[15]
Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutional networks (2017)
work page 2017
-
[16]
Leskovec, J., Krevl, A.: Snap datasets: Stanford large network dataset collection (2014)
work page 2014
-
[17]
Nature Biomedical Engineering6(12), 1353–1369 (2022)
Li, M.M., Huang, K., Zitnik, M.: Graph representation learning in biomedicine and healthcare. Nature Biomedical Engineering6(12), 1353–1369 (2022)
work page 2022
-
[18]
Li, S., Osawa, K., Hoefler, T.: Efficient quantized sparse matrix operations on tensor cores. SC ’22 (2022)
work page 2022
-
[19]
Merrill, D., Garland, M.: Merge-based sparse matrix-vector multiplication (spmv) using the csr storage format. PPoPP ’16 (2016)
work page 2016
-
[20]
In: GPU Technology Conference (2010)
Naumov, M., Chien, L., Vandermersch, P., Kapasi, U.: Cusparse library. In: GPU Technology Conference (2010)
work page 2010
-
[21]
Sala, A., Zheng, H., Zhao, B.Y., Gaito, S., Rossi, G.P.: Brief announcement: re- visiting the power-law degree distribution for social graph analysis. p. 400–401. PODC ’10 (2010)
work page 2010
-
[22]
Shao, Y., Li, H., Gu, X., Yin, H., Li, Y., Miao, X., Zhang, W., Cui, B., Chen, L.: Distributed graph neural network training: A survey. ACM Comput. Surv.56(8) (Apr 2024).https://doi.org/10.1145/3648358,https://doi.org/ 10.1145/3648358
-
[23]
Volkov, V., Demmel, J.W.: Benchmarking gpus to tune dense linear algebra. In: SC ’08. pp. 1–11 (2008)
work page 2008
-
[24]
Wang, M., Zheng, D., Ye, Z., Gan, Q., Li, M., Song, X., Zhou, J., Ma, C., Yu, L., Gai, Y., Xiao, T., He, T., Karypis, G., Li, J., Zhang, Z.: Deep graph library: A graph-centric, highly-performant package for graph neural networks (2020)
work page 2020
-
[25]
Wang, Y., Feng, B., Li, G., Li, S., Deng, L., Xie, Y., Ding, Y.: Gnnadvisor: An adaptive and efficient runtime system for gnn acceleration on gpus. In: (OSDI 21). pp. 515–531 (2021)
work page 2021
-
[26]
Wei, H., Yu, J.X., Lu, C., Lin, X.: Speedup graph processing by graph ordering. p. 1813–1828. SIGMOD ’16 (2016)
work page 2016
-
[27]
Wu, S., Sun, F., Zhang, W., Xie, X., Cui, B.: Graph neural networks in recom- mender systems: A survey. ACM Comput. Surv.55(5) (dec 2022)
work page 2022
-
[28]
Xia, H., Zheng, Z., Li, Y., Zhuang, D., Zhou, Z., Qiu, X., Li, Y., Lin, W., Song, S.L.: Flash-llm: Enabling cost-effective and highly-efficient large generative model inference with unstructured sparsity. arXiv preprint arXiv:2309.10285 (2023)
-
[29]
Xu, K., Hu, W., Leskovec, J., Jegelka, S.: How powerful are graph neural networks? (2019) ParamSpMM 17
work page 2019
-
[30]
Yang,C.,Buluç,A.,Owens,J.D.:Designprinciplesforsparsematrixmultiplication on the gpu. p. 672–687. Euro-Par ’18 (2018)
work page 2018
-
[31]
Yang, X., Parthasarathy, S., Sadayappan, P.: Fast sparse matrix-vector multipli- cation on gpus: Implications for graph mining. Proc. VLDB Endow.4(4), 231–242 (jan 2011)
work page 2011
-
[32]
ICS ’22, New York, NY, USA (2022)
Yesil, S., Moreira, J.E., Torrellas, J.: Dense dynamic blocks: optimizing spmm for processors with vector and matrix units using machine learning techniques. ICS ’22, New York, NY, USA (2022)
work page 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.