pith. sign in

arxiv: 2605.15695 · v1 · pith:QZYS7BZ6new · submitted 2026-05-15 · 💻 cs.DC

ParamSpMM: Adaptive and Efficient Sparse Matrix-Matrix Multiplication on GPUs for GNNs

Pith reviewed 2026-05-19 19:32 UTC · model grok-4.3

classification 💻 cs.DC
keywords sparse matrix-matrix multiplicationgraph neural networksGPU accelerationadaptive algorithmsmachine learning for systemsperformance optimizationcompressed sparse row
0
0 comments X

The pith

ParamSpMM adapts SpMM on GPUs to graph inputs via a flexible data structure and ML predictor for better GNN efficiency.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to show that existing SpMM implementations on GPUs fail to adapt well to the wide variety of graph structures encountered in GNN workloads. It establishes a parametric method that encodes multiple optimizations into one framework so the right combination can be chosen for each input. The central mechanism is a new data layout that makes those choices explicit and an ML model that picks the layout and settings from observable graph features. If the approach holds, GNN training runs faster without changing the underlying graph algorithms or hardware. This matters because SpMM dominates runtime in many GNN pipelines and small constant-factor gains compound across large models and datasets.

Core claim

ParamSpMM is a parametric SpMM design built around the Parameterized Compressed Sparse Row (PCSR) format that can embed and combine prior optimization techniques, together with an ML-based SpMM-decider that selects the best configuration from a small set of input features; evaluations show this combination delivers an average 1.92x speedup over Nvidia cuSPARSE across GNN workloads.

What carries the argument

The Parameterized Compressed Sparse Row (PCSR) data structure that encodes existing SpMM optimizations as configurable parameters so they can be mixed and matched per input graph.

If this is right

  • SpMM kernels can be made input-aware without rewriting low-level code for each new graph family.
  • GNN training throughput increases by roughly a factor of two on average when the adaptive choice replaces a fixed library call.
  • Multiple existing SpMM optimizations become complementary rather than mutually exclusive once they are parameterized together.
  • Feature-based prediction can replace hand-tuned or exhaustive search for kernel configuration at runtime.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same parametric-plus-predictor pattern could be applied to other sparse kernels such as SpMV or sampled dense-dense multiplication inside GNN layers.
  • If the feature set proves insufficient on new graph distributions, adding cheap structural summaries like degree histograms or clustering coefficients would be a direct next step.
  • Hardware vendors could embed similar deciders inside future cuSPARSE or cuBLAS releases so users obtain the benefit without changing application code.

Load-bearing premise

The ML-based SpMM-decider can reliably predict optimal configurations from the crafted input features across diverse graph characteristics.

What would settle it

A set of previously unseen graphs where the ML decider selects a configuration whose measured SpMM time is consistently worse than the best static or cuSPARSE baseline on the same hardware.

Figures

Figures reproduced from arXiv: 2605.15695 by Guanhua Ye, Hongzheng Li, Lixing Zhang, Shigang Li, Yingxia Shao.

Figure 1
Figure 1. Figure 1: Throughputs of SpMM with or without workload balancing [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of ParamSpMM: a three-phase workflow for SpMM in GNNs [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: PCSR under four different configurations. TW: Thread Warp. Data Representation of PCSR. PCSR represents a sparse matrix via four arrays: rowP tr, colIdx, val, and T Row, arranging matrix elements into V × 1 nonzero vectors. rowP tr defines the traversal range of nonzero vectors and indi￾cates each thread warp’s workload l. val and colIdx store the values and column indices of nonzero vectors. T Row is for … view at source ↗
Figure 4
Figure 4. Figure 4: The performance of SpMM across various dim. The reported speedup is nor￾malized to cuSPARSE. ParamSpMM configurations based on the crafted input features, enabling sys￾tematic and data-driven SpMM optimization. With the sparse matrix features listed in [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: GCN and GIN performance speedup normalized to DGL. 7 Related Work In this section, we discuss the related works for SpMM optimization. Blocking. Blocking techniques [11,18,28,32] exploit data locality. FlashLLM [28] and Magicube [18] utilize the high throughput tensor core units to facilitate SpMM in sparse DNN training [10], where the sparse matrices are relatively dense and uniform. Other studies [11, 31… view at source ↗
read the original abstract

Fueled by the ability to mine real-world graph data, GNN applications have experienced phenomenal growth. Sparse Matrix-Matrix Multiplication (SpMM) is a critical operator in GNNs. However, existing SpMM designs for GNNs struggle to adapt to diverse input characteristics. In this paper, we first conduct a comprehensive analysis of existing SpMM optimizations, revealing their limitations through statistical and empirical evidence. Based on this analysis, we introduce ParamSpMM, a parametric approach for highly adaptive and efficient SpMM computation in GNNs. It incorporates a new data structure, the Parameterized Compressed Sparse Row (PCSR), to flexibly integrate existing optimization techniques. ParamSpMM enables the configuration of these optimization techniques according to various input characteristics. Furthermore, we complement ParamSpMM with an ML-based SpMM-decider that predicts optimal configurations based on carefully crafted input features. Our evaluations demonstrate that ParamSpMM outperforms Nvidia cuSPARSE with an average speedup of 1.92x, significantly enhancing GNN training efficiency.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper proposes ParamSpMM, a parametric approach for adaptive and efficient SpMM on GPUs targeted at GNN workloads. It introduces the Parameterized Compressed Sparse Row (PCSR) data structure to flexibly integrate existing optimizations and pairs it with an ML-based SpMM-decider that selects configurations using crafted input features. The central empirical claim is an average 1.92x speedup over Nvidia cuSPARSE.

Significance. If the performance results and the reliability of the ML decider hold across diverse graphs, the work would offer a practical advance in GPU SpMM kernels for GNN training by addressing the limitations of fixed optimizations through adaptive parameterization.

major comments (1)
  1. The 1.92x speedup claim depends on the ML-based SpMM-decider reliably mapping input features to optimal PCSR configurations. No accuracy metrics, cross-validation results, or fallback analysis for prediction errors on graphs with unseen degree distributions or sparsity patterns are reported, which is load-bearing for the adaptive advantage over cuSPARSE.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We address the major comment point by point below and outline the revisions we will make to strengthen the manuscript.

read point-by-point responses
  1. Referee: The 1.92x speedup claim depends on the ML-based SpMM-decider reliably mapping input features to optimal PCSR configurations. No accuracy metrics, cross-validation results, or fallback analysis for prediction errors on graphs with unseen degree distributions or sparsity patterns are reported, which is load-bearing for the adaptive advantage over cuSPARSE.

    Authors: We agree that the reliability of the SpMM-decider is central to the claimed adaptive benefits and that the original submission would be strengthened by additional validation details. The reported 1.92x average speedup is measured end-to-end using the decider on the evaluated GNN workloads and graphs. However, we did not include explicit accuracy metrics, cross-validation statistics, or dedicated fallback analysis in the initial manuscript. In the revised version we will add a new subsection under the SpMM-decider description that reports (1) prediction accuracy on held-out test graphs, (2) k-fold cross-validation results across multiple graph collections, and (3) an analysis of fallback behavior (defaulting to a conservative PCSR configuration) when the model predicts sub-optimally. We will also include a small set of additional experiments on graphs whose degree distributions and sparsity patterns were not present in the training data to demonstrate generalization. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical system evaluation with measured speedups

full rationale

The paper proposes ParamSpMM using a new PCSR data structure to combine optimizations and an ML-based decider trained on crafted features to select configurations. All performance claims (e.g., 1.92x average speedup over cuSPARSE) are presented as direct empirical measurements from evaluations on GNN workloads. No mathematical derivations, equations, or predictions are shown that reduce by construction to fitted inputs, self-citations, or ansatzes. The ML decider's role is a design choice whose reliability is assessed via runtime results rather than assumed tautologically.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Review conducted from abstract only; no explicit free parameters, axioms, or invented entities are described in the provided text.

pith-pipeline@v0.9.0 · 5722 in / 973 out tokens · 29656 ms · 2026-05-19T19:32:01.448336+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages

  1. [1]

    In: (IPDPS)

    Arai, J., Shiokawa, H., Yamamuro, T., Onizuka, M., Iwamura, S.: Rabbit order: Just-in-time parallel reordering for fast graph analysis. In: (IPDPS). pp. 22–31 (2016)

  2. [2]

    Bader, D.A., Meyerhenke, H., Sanders, P., Wagner, D.: 10th dimacs implementa- tion challenge-graph partitioning and graph clustering (2011)

  3. [3]

    Waschington University in St

    Chandrasekaran, B.: Survey of network traffic models. Waschington University in St. Louis CSE567(2009)

  4. [4]

    Dai, G., Huang, G., Yang, S., Yu, Z., Zhang, H., Ding, Y., Xie, Y., Yang, H., Wang, Y.: Heuristic adaptability to input dynamics for spmm on gpus. p. 595–600. DAC ’22 (2022)

  5. [5]

    In: (IPDPS)

    Fan, R., Wang, W., Chu, X.: Fast sparse gpu kernels for accelerated training of graph neural networks. In: (IPDPS). pp. 501–511 (2023)

  6. [6]

    Fey, M., Lenssen, J.E.: Fast graph representation learning with pytorch geometric (2019)

  7. [7]

    Physics Reports486(3–5), 75–174 (Feb 2010)

    Fortunato, S.: Community detection in graphs. Physics Reports486(3–5), 75–174 (Feb 2010)

  8. [8]

    Gilmer, J., Schoenholz, S.S., Riley, P.F., Vinyals, O., Dahl, G.E.: Neural message passing for quantum chemistry. p. 1263–1272. ICML’17 (2017) 16 Lixing et al

  9. [9]

    AAAI’19/IAAI’19/EAAI’19 (2019)

    Guo, S., Lin, Y., Feng, N., Song, C., Wan, H.: Attention based spatial- temporal graph convolutional networks for traffic flow forecasting. AAAI’19/IAAI’19/EAAI’19 (2019)

  10. [10]

    Hoefler, T., Alistarh, D., Ben-Nun, T., Dryden, N., Peste, A.: Sparsity in deep learning: pruning and growth for efficient inference and training in neural networks. J. Mach. Learn. Res.22(1) (jan 2021)

  11. [11]

    Hong, C., Sukumaran-Rajam, A., Nisa, I., Singh, K., Sadayappan, P.: Adaptive sparse tiling for sparse matrix multiplication. p. 300–314. PPoPP ’19 (2019)

  12. [12]

    Advances in neural information processing systems33, 22118–22133 (2020)

    Hu, W., Fey, M., Zitnik, M., Dong, Y., Ren, H., Liu, B., Catasta, M., Leskovec, J.: Open graph benchmark: Datasets for machine learning on graphs. Advances in neural information processing systems33, 22118–22133 (2020)

  13. [13]

    In: SC20

    Huang, G., Dai, G., Wang, Y., Yang, H.: Ge-spmm: General-purpose sparse matrix- matrix multiplication on gpus for graph neural networks. In: SC20. pp. 1–12 (2020)

  14. [14]

    Huang, K., Zhai, J., Zheng, Z., Yi, Y., Shen, X.: Understanding and bridging the gaps in current gnn performance optimizations. p. 119–132. PPoPP ’21 (2021)

  15. [15]

    Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutional networks (2017)

  16. [16]

    Leskovec, J., Krevl, A.: Snap datasets: Stanford large network dataset collection (2014)

  17. [17]

    Nature Biomedical Engineering6(12), 1353–1369 (2022)

    Li, M.M., Huang, K., Zitnik, M.: Graph representation learning in biomedicine and healthcare. Nature Biomedical Engineering6(12), 1353–1369 (2022)

  18. [18]

    SC ’22 (2022)

    Li, S., Osawa, K., Hoefler, T.: Efficient quantized sparse matrix operations on tensor cores. SC ’22 (2022)

  19. [19]

    PPoPP ’16 (2016)

    Merrill, D., Garland, M.: Merge-based sparse matrix-vector multiplication (spmv) using the csr storage format. PPoPP ’16 (2016)

  20. [20]

    In: GPU Technology Conference (2010)

    Naumov, M., Chien, L., Vandermersch, P., Kapasi, U.: Cusparse library. In: GPU Technology Conference (2010)

  21. [21]

    Sala, A., Zheng, H., Zhao, B.Y., Gaito, S., Rossi, G.P.: Brief announcement: re- visiting the power-law degree distribution for social graph analysis. p. 400–401. PODC ’10 (2010)

  22. [22]

    ACM Comput

    Shao, Y., Li, H., Gu, X., Yin, H., Li, Y., Miao, X., Zhang, W., Cui, B., Chen, L.: Distributed graph neural network training: A survey. ACM Comput. Surv.56(8) (Apr 2024).https://doi.org/10.1145/3648358,https://doi.org/ 10.1145/3648358

  23. [23]

    In: SC ’08

    Volkov, V., Demmel, J.W.: Benchmarking gpus to tune dense linear algebra. In: SC ’08. pp. 1–11 (2008)

  24. [24]

    Wang, M., Zheng, D., Ye, Z., Gan, Q., Li, M., Song, X., Zhou, J., Ma, C., Yu, L., Gai, Y., Xiao, T., He, T., Karypis, G., Li, J., Zhang, Z.: Deep graph library: A graph-centric, highly-performant package for graph neural networks (2020)

  25. [25]

    In: (OSDI 21)

    Wang, Y., Feng, B., Li, G., Li, S., Deng, L., Xie, Y., Ding, Y.: Gnnadvisor: An adaptive and efficient runtime system for gnn acceleration on gpus. In: (OSDI 21). pp. 515–531 (2021)

  26. [26]

    Wei, H., Yu, J.X., Lu, C., Lin, X.: Speedup graph processing by graph ordering. p. 1813–1828. SIGMOD ’16 (2016)

  27. [27]

    ACM Comput

    Wu, S., Sun, F., Zhang, W., Xie, X., Cui, B.: Graph neural networks in recom- mender systems: A survey. ACM Comput. Surv.55(5) (dec 2022)

  28. [28]

    Flash-llm: Enabling cost-effective and highly-efficient large generative model inference with unstructured sparsity

    Xia, H., Zheng, Z., Li, Y., Zhuang, D., Zhou, Z., Qiu, X., Li, Y., Lin, W., Song, S.L.: Flash-llm: Enabling cost-effective and highly-efficient large generative model inference with unstructured sparsity. arXiv preprint arXiv:2309.10285 (2023)

  29. [29]

    Xu, K., Hu, W., Leskovec, J., Jegelka, S.: How powerful are graph neural networks? (2019) ParamSpMM 17

  30. [30]

    Yang,C.,Buluç,A.,Owens,J.D.:Designprinciplesforsparsematrixmultiplication on the gpu. p. 672–687. Euro-Par ’18 (2018)

  31. [31]

    Yang, X., Parthasarathy, S., Sadayappan, P.: Fast sparse matrix-vector multipli- cation on gpus: Implications for graph mining. Proc. VLDB Endow.4(4), 231–242 (jan 2011)

  32. [32]

    ICS ’22, New York, NY, USA (2022)

    Yesil, S., Moreira, J.E., Torrellas, J.: Dense dynamic blocks: optimizing spmm for processors with vector and matrix units using machine learning techniques. ICS ’22, New York, NY, USA (2022)