ParamSpMM: Adaptive and Efficient Sparse Matrix-Matrix Multiplication on GPUs for GNNs

Guanhua Ye; Hongzheng Li; Lixing Zhang; Shigang Li; Yingxia Shao

arxiv: 2605.15695 · v1 · pith:QZYS7BZ6new · submitted 2026-05-15 · 💻 cs.DC

ParamSpMM: Adaptive and Efficient Sparse Matrix-Matrix Multiplication on GPUs for GNNs

Lixing Zhang , Guanhua Ye , Hongzheng Li , Shigang Li , Yingxia Shao This is my paper

Pith reviewed 2026-05-19 19:32 UTC · model grok-4.3

classification 💻 cs.DC

keywords sparse matrix-matrix multiplicationgraph neural networksGPU accelerationadaptive algorithmsmachine learning for systemsperformance optimizationcompressed sparse row

0 comments

The pith

ParamSpMM adapts SpMM on GPUs to graph inputs via a flexible data structure and ML predictor for better GNN efficiency.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to show that existing SpMM implementations on GPUs fail to adapt well to the wide variety of graph structures encountered in GNN workloads. It establishes a parametric method that encodes multiple optimizations into one framework so the right combination can be chosen for each input. The central mechanism is a new data layout that makes those choices explicit and an ML model that picks the layout and settings from observable graph features. If the approach holds, GNN training runs faster without changing the underlying graph algorithms or hardware. This matters because SpMM dominates runtime in many GNN pipelines and small constant-factor gains compound across large models and datasets.

Core claim

ParamSpMM is a parametric SpMM design built around the Parameterized Compressed Sparse Row (PCSR) format that can embed and combine prior optimization techniques, together with an ML-based SpMM-decider that selects the best configuration from a small set of input features; evaluations show this combination delivers an average 1.92x speedup over Nvidia cuSPARSE across GNN workloads.

What carries the argument

The Parameterized Compressed Sparse Row (PCSR) data structure that encodes existing SpMM optimizations as configurable parameters so they can be mixed and matched per input graph.

If this is right

SpMM kernels can be made input-aware without rewriting low-level code for each new graph family.
GNN training throughput increases by roughly a factor of two on average when the adaptive choice replaces a fixed library call.
Multiple existing SpMM optimizations become complementary rather than mutually exclusive once they are parameterized together.
Feature-based prediction can replace hand-tuned or exhaustive search for kernel configuration at runtime.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same parametric-plus-predictor pattern could be applied to other sparse kernels such as SpMV or sampled dense-dense multiplication inside GNN layers.
If the feature set proves insufficient on new graph distributions, adding cheap structural summaries like degree histograms or clustering coefficients would be a direct next step.
Hardware vendors could embed similar deciders inside future cuSPARSE or cuBLAS releases so users obtain the benefit without changing application code.

Load-bearing premise

The ML-based SpMM-decider can reliably predict optimal configurations from the crafted input features across diverse graph characteristics.

What would settle it

A set of previously unseen graphs where the ML decider selects a configuration whose measured SpMM time is consistently worse than the best static or cuSPARSE baseline on the same hardware.

Figures

Figures reproduced from arXiv: 2605.15695 by Guanhua Ye, Hongzheng Li, Lixing Zhang, Shigang Li, Yingxia Shao.

**Figure 2.** Figure 2: Overview of ParamSpMM: a three-phase workflow for SpMM in GNNs [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: PCSR under four different configurations. TW: Thread Warp. Data Representation of PCSR. PCSR represents a sparse matrix via four arrays: rowP tr, colIdx, val, and T Row, arranging matrix elements into V × 1 nonzero vectors. rowP tr defines the traversal range of nonzero vectors and indicates each thread warp’s workload l. val and colIdx store the values and column indices of nonzero vectors. T Row is for … view at source ↗

**Figure 4.** Figure 4: The performance of SpMM across various dim. The reported speedup is normalized to cuSPARSE. ParamSpMM configurations based on the crafted input features, enabling systematic and data-driven SpMM optimization. With the sparse matrix features listed in [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗

**Figure 5.** Figure 5: GCN and GIN performance speedup normalized to DGL. 7 Related Work In this section, we discuss the related works for SpMM optimization. Blocking. Blocking techniques [11,18,28,32] exploit data locality. FlashLLM [28] and Magicube [18] utilize the high throughput tensor core units to facilitate SpMM in sparse DNN training [10], where the sparse matrices are relatively dense and uniform. Other studies [11, 31… view at source ↗

read the original abstract

Fueled by the ability to mine real-world graph data, GNN applications have experienced phenomenal growth. Sparse Matrix-Matrix Multiplication (SpMM) is a critical operator in GNNs. However, existing SpMM designs for GNNs struggle to adapt to diverse input characteristics. In this paper, we first conduct a comprehensive analysis of existing SpMM optimizations, revealing their limitations through statistical and empirical evidence. Based on this analysis, we introduce ParamSpMM, a parametric approach for highly adaptive and efficient SpMM computation in GNNs. It incorporates a new data structure, the Parameterized Compressed Sparse Row (PCSR), to flexibly integrate existing optimization techniques. ParamSpMM enables the configuration of these optimization techniques according to various input characteristics. Furthermore, we complement ParamSpMM with an ML-based SpMM-decider that predicts optimal configurations based on carefully crafted input features. Our evaluations demonstrate that ParamSpMM outperforms Nvidia cuSPARSE with an average speedup of 1.92x, significantly enhancing GNN training efficiency.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ParamSpMM introduces a PCSR format and ML decider to adapt SpMM optimizations to graph variety in GNNs, but the 1.92x speedup claim hinges on untested prediction reliability.

read the letter

The core of this paper is a parametric SpMM that uses a new PCSR data structure to turn standard optimizations on or off depending on the input matrix, paired with an ML model that picks the settings from a few hand-crafted features. They back the idea with an analysis showing where existing GPU SpMM kernels lose out on the range of sparsity and degree patterns that show up in GNN workloads. That analysis is the part that lands cleanly; it gives concrete statistical and timing evidence for why a one-size-fits-all kernel is not enough. The parametric wrapper itself is a straightforward engineering move that avoids having to maintain separate code paths for each optimization combination. Credit there for making the integration explicit rather than burying it in heuristics. The ML decider is the part that still needs work. The reported 1.92x average over cuSPARSE only holds if the predictor maps the features to the right configuration most of the time. The abstract gives no accuracy numbers, no cross-validation results, and no fallback behavior when the model is wrong on graphs with different scale or degree distributions. Without those, the performance number is hard to trust as a general result. Experiments also lack any mention of variance across runs or how many distinct graph families were tested, which makes it difficult to judge whether the average is stable. This paper is aimed at people who implement or tune GNN training stacks on GPUs and want a more flexible SpMM primitive. A systems reader who already works on sparse kernels will see the value in the PCSR parameterization even if they end up replacing the ML component. It is worth sending to peer review so the authors can supply the missing predictor metrics and a fuller experimental breakdown; the underlying idea is concrete enough to be worth checking.

Referee Report

1 major / 0 minor

Summary. The paper proposes ParamSpMM, a parametric approach for adaptive and efficient SpMM on GPUs targeted at GNN workloads. It introduces the Parameterized Compressed Sparse Row (PCSR) data structure to flexibly integrate existing optimizations and pairs it with an ML-based SpMM-decider that selects configurations using crafted input features. The central empirical claim is an average 1.92x speedup over Nvidia cuSPARSE.

Significance. If the performance results and the reliability of the ML decider hold across diverse graphs, the work would offer a practical advance in GPU SpMM kernels for GNN training by addressing the limitations of fixed optimizations through adaptive parameterization.

major comments (1)

The 1.92x speedup claim depends on the ML-based SpMM-decider reliably mapping input features to optimal PCSR configurations. No accuracy metrics, cross-validation results, or fallback analysis for prediction errors on graphs with unseen degree distributions or sparsity patterns are reported, which is load-bearing for the adaptive advantage over cuSPARSE.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We address the major comment point by point below and outline the revisions we will make to strengthen the manuscript.

read point-by-point responses

Referee: The 1.92x speedup claim depends on the ML-based SpMM-decider reliably mapping input features to optimal PCSR configurations. No accuracy metrics, cross-validation results, or fallback analysis for prediction errors on graphs with unseen degree distributions or sparsity patterns are reported, which is load-bearing for the adaptive advantage over cuSPARSE.

Authors: We agree that the reliability of the SpMM-decider is central to the claimed adaptive benefits and that the original submission would be strengthened by additional validation details. The reported 1.92x average speedup is measured end-to-end using the decider on the evaluated GNN workloads and graphs. However, we did not include explicit accuracy metrics, cross-validation statistics, or dedicated fallback analysis in the initial manuscript. In the revised version we will add a new subsection under the SpMM-decider description that reports (1) prediction accuracy on held-out test graphs, (2) k-fold cross-validation results across multiple graph collections, and (3) an analysis of fallback behavior (defaulting to a conservative PCSR configuration) when the model predicts sub-optimally. We will also include a small set of additional experiments on graphs whose degree distributions and sparsity patterns were not present in the training data to demonstrate generalization. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical system evaluation with measured speedups

full rationale

The paper proposes ParamSpMM using a new PCSR data structure to combine optimizations and an ML-based decider trained on crafted features to select configurations. All performance claims (e.g., 1.92x average speedup over cuSPARSE) are presented as direct empirical measurements from evaluations on GNN workloads. No mathematical derivations, equations, or predictions are shown that reduce by construction to fitted inputs, self-citations, or ansatzes. The ML decider's role is a design choice whose reliability is assessed via runtime results rather than assumed tautologically.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Review conducted from abstract only; no explicit free parameters, axioms, or invented entities are described in the provided text.

pith-pipeline@v0.9.0 · 5722 in / 973 out tokens · 29656 ms · 2026-05-19T19:32:01.448336+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages

[1]

In: (IPDPS)

Arai, J., Shiokawa, H., Yamamuro, T., Onizuka, M., Iwamura, S.: Rabbit order: Just-in-time parallel reordering for fast graph analysis. In: (IPDPS). pp. 22–31 (2016)

work page 2016
[2]

Bader, D.A., Meyerhenke, H., Sanders, P., Wagner, D.: 10th dimacs implementa- tion challenge-graph partitioning and graph clustering (2011)

work page 2011
[3]

Waschington University in St

Chandrasekaran, B.: Survey of network traffic models. Waschington University in St. Louis CSE567(2009)

work page 2009
[4]

Dai, G., Huang, G., Yang, S., Yu, Z., Zhang, H., Ding, Y., Xie, Y., Yang, H., Wang, Y.: Heuristic adaptability to input dynamics for spmm on gpus. p. 595–600. DAC ’22 (2022)

work page 2022
[5]

In: (IPDPS)

Fan, R., Wang, W., Chu, X.: Fast sparse gpu kernels for accelerated training of graph neural networks. In: (IPDPS). pp. 501–511 (2023)

work page 2023
[6]

Fey, M., Lenssen, J.E.: Fast graph representation learning with pytorch geometric (2019)

work page 2019
[7]

Physics Reports486(3–5), 75–174 (Feb 2010)

Fortunato, S.: Community detection in graphs. Physics Reports486(3–5), 75–174 (Feb 2010)

work page 2010
[8]

Gilmer, J., Schoenholz, S.S., Riley, P.F., Vinyals, O., Dahl, G.E.: Neural message passing for quantum chemistry. p. 1263–1272. ICML’17 (2017) 16 Lixing et al

work page 2017
[9]

AAAI’19/IAAI’19/EAAI’19 (2019)

Guo, S., Lin, Y., Feng, N., Song, C., Wan, H.: Attention based spatial- temporal graph convolutional networks for traffic flow forecasting. AAAI’19/IAAI’19/EAAI’19 (2019)

work page 2019
[10]

Hoefler, T., Alistarh, D., Ben-Nun, T., Dryden, N., Peste, A.: Sparsity in deep learning: pruning and growth for efficient inference and training in neural networks. J. Mach. Learn. Res.22(1) (jan 2021)

work page 2021
[11]

Hong, C., Sukumaran-Rajam, A., Nisa, I., Singh, K., Sadayappan, P.: Adaptive sparse tiling for sparse matrix multiplication. p. 300–314. PPoPP ’19 (2019)

work page 2019
[12]

Advances in neural information processing systems33, 22118–22133 (2020)

Hu, W., Fey, M., Zitnik, M., Dong, Y., Ren, H., Liu, B., Catasta, M., Leskovec, J.: Open graph benchmark: Datasets for machine learning on graphs. Advances in neural information processing systems33, 22118–22133 (2020)

work page 2020
[13]

In: SC20

Huang, G., Dai, G., Wang, Y., Yang, H.: Ge-spmm: General-purpose sparse matrix- matrix multiplication on gpus for graph neural networks. In: SC20. pp. 1–12 (2020)

work page 2020
[14]

Huang, K., Zhai, J., Zheng, Z., Yi, Y., Shen, X.: Understanding and bridging the gaps in current gnn performance optimizations. p. 119–132. PPoPP ’21 (2021)

work page 2021
[15]

Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutional networks (2017)

work page 2017
[16]

Leskovec, J., Krevl, A.: Snap datasets: Stanford large network dataset collection (2014)

work page 2014
[17]

Nature Biomedical Engineering6(12), 1353–1369 (2022)

Li, M.M., Huang, K., Zitnik, M.: Graph representation learning in biomedicine and healthcare. Nature Biomedical Engineering6(12), 1353–1369 (2022)

work page 2022
[18]

SC ’22 (2022)

Li, S., Osawa, K., Hoefler, T.: Efficient quantized sparse matrix operations on tensor cores. SC ’22 (2022)

work page 2022
[19]

PPoPP ’16 (2016)

Merrill, D., Garland, M.: Merge-based sparse matrix-vector multiplication (spmv) using the csr storage format. PPoPP ’16 (2016)

work page 2016
[20]

In: GPU Technology Conference (2010)

Naumov, M., Chien, L., Vandermersch, P., Kapasi, U.: Cusparse library. In: GPU Technology Conference (2010)

work page 2010
[21]

Sala, A., Zheng, H., Zhao, B.Y., Gaito, S., Rossi, G.P.: Brief announcement: re- visiting the power-law degree distribution for social graph analysis. p. 400–401. PODC ’10 (2010)

work page 2010
[22]

ACM Comput

Shao, Y., Li, H., Gu, X., Yin, H., Li, Y., Miao, X., Zhang, W., Cui, B., Chen, L.: Distributed graph neural network training: A survey. ACM Comput. Surv.56(8) (Apr 2024).https://doi.org/10.1145/3648358,https://doi.org/ 10.1145/3648358

work page doi:10.1145/3648358 2024
[23]

In: SC ’08

Volkov, V., Demmel, J.W.: Benchmarking gpus to tune dense linear algebra. In: SC ’08. pp. 1–11 (2008)

work page 2008
[24]

Wang, M., Zheng, D., Ye, Z., Gan, Q., Li, M., Song, X., Zhou, J., Ma, C., Yu, L., Gai, Y., Xiao, T., He, T., Karypis, G., Li, J., Zhang, Z.: Deep graph library: A graph-centric, highly-performant package for graph neural networks (2020)

work page 2020
[25]

In: (OSDI 21)

Wang, Y., Feng, B., Li, G., Li, S., Deng, L., Xie, Y., Ding, Y.: Gnnadvisor: An adaptive and efficient runtime system for gnn acceleration on gpus. In: (OSDI 21). pp. 515–531 (2021)

work page 2021
[26]

Wei, H., Yu, J.X., Lu, C., Lin, X.: Speedup graph processing by graph ordering. p. 1813–1828. SIGMOD ’16 (2016)

work page 2016
[27]

ACM Comput

Wu, S., Sun, F., Zhang, W., Xie, X., Cui, B.: Graph neural networks in recom- mender systems: A survey. ACM Comput. Surv.55(5) (dec 2022)

work page 2022
[28]

Flash-llm: Enabling cost-effective and highly-efficient large generative model inference with unstructured sparsity

Xia, H., Zheng, Z., Li, Y., Zhuang, D., Zhou, Z., Qiu, X., Li, Y., Lin, W., Song, S.L.: Flash-llm: Enabling cost-effective and highly-efficient large generative model inference with unstructured sparsity. arXiv preprint arXiv:2309.10285 (2023)

work page arXiv 2023
[29]

Xu, K., Hu, W., Leskovec, J., Jegelka, S.: How powerful are graph neural networks? (2019) ParamSpMM 17

work page 2019
[30]

Yang,C.,Buluç,A.,Owens,J.D.:Designprinciplesforsparsematrixmultiplication on the gpu. p. 672–687. Euro-Par ’18 (2018)

work page 2018
[31]

Yang, X., Parthasarathy, S., Sadayappan, P.: Fast sparse matrix-vector multipli- cation on gpus: Implications for graph mining. Proc. VLDB Endow.4(4), 231–242 (jan 2011)

work page 2011
[32]

ICS ’22, New York, NY, USA (2022)

Yesil, S., Moreira, J.E., Torrellas, J.: Dense dynamic blocks: optimizing spmm for processors with vector and matrix units using machine learning techniques. ICS ’22, New York, NY, USA (2022)

work page 2022

[1] [1]

In: (IPDPS)

Arai, J., Shiokawa, H., Yamamuro, T., Onizuka, M., Iwamura, S.: Rabbit order: Just-in-time parallel reordering for fast graph analysis. In: (IPDPS). pp. 22–31 (2016)

work page 2016

[2] [2]

Bader, D.A., Meyerhenke, H., Sanders, P., Wagner, D.: 10th dimacs implementa- tion challenge-graph partitioning and graph clustering (2011)

work page 2011

[3] [3]

Waschington University in St

Chandrasekaran, B.: Survey of network traffic models. Waschington University in St. Louis CSE567(2009)

work page 2009

[4] [4]

Dai, G., Huang, G., Yang, S., Yu, Z., Zhang, H., Ding, Y., Xie, Y., Yang, H., Wang, Y.: Heuristic adaptability to input dynamics for spmm on gpus. p. 595–600. DAC ’22 (2022)

work page 2022

[5] [5]

In: (IPDPS)

Fan, R., Wang, W., Chu, X.: Fast sparse gpu kernels for accelerated training of graph neural networks. In: (IPDPS). pp. 501–511 (2023)

work page 2023

[6] [6]

Fey, M., Lenssen, J.E.: Fast graph representation learning with pytorch geometric (2019)

work page 2019

[7] [7]

Physics Reports486(3–5), 75–174 (Feb 2010)

Fortunato, S.: Community detection in graphs. Physics Reports486(3–5), 75–174 (Feb 2010)

work page 2010

[8] [8]

Gilmer, J., Schoenholz, S.S., Riley, P.F., Vinyals, O., Dahl, G.E.: Neural message passing for quantum chemistry. p. 1263–1272. ICML’17 (2017) 16 Lixing et al

work page 2017

[9] [9]

AAAI’19/IAAI’19/EAAI’19 (2019)

Guo, S., Lin, Y., Feng, N., Song, C., Wan, H.: Attention based spatial- temporal graph convolutional networks for traffic flow forecasting. AAAI’19/IAAI’19/EAAI’19 (2019)

work page 2019

[10] [10]

Hoefler, T., Alistarh, D., Ben-Nun, T., Dryden, N., Peste, A.: Sparsity in deep learning: pruning and growth for efficient inference and training in neural networks. J. Mach. Learn. Res.22(1) (jan 2021)

work page 2021

[11] [11]

Hong, C., Sukumaran-Rajam, A., Nisa, I., Singh, K., Sadayappan, P.: Adaptive sparse tiling for sparse matrix multiplication. p. 300–314. PPoPP ’19 (2019)

work page 2019

[12] [12]

Advances in neural information processing systems33, 22118–22133 (2020)

Hu, W., Fey, M., Zitnik, M., Dong, Y., Ren, H., Liu, B., Catasta, M., Leskovec, J.: Open graph benchmark: Datasets for machine learning on graphs. Advances in neural information processing systems33, 22118–22133 (2020)

work page 2020

[13] [13]

In: SC20

Huang, G., Dai, G., Wang, Y., Yang, H.: Ge-spmm: General-purpose sparse matrix- matrix multiplication on gpus for graph neural networks. In: SC20. pp. 1–12 (2020)

work page 2020

[14] [14]

Huang, K., Zhai, J., Zheng, Z., Yi, Y., Shen, X.: Understanding and bridging the gaps in current gnn performance optimizations. p. 119–132. PPoPP ’21 (2021)

work page 2021

[15] [15]

Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutional networks (2017)

work page 2017

[16] [16]

Leskovec, J., Krevl, A.: Snap datasets: Stanford large network dataset collection (2014)

work page 2014

[17] [17]

Nature Biomedical Engineering6(12), 1353–1369 (2022)

Li, M.M., Huang, K., Zitnik, M.: Graph representation learning in biomedicine and healthcare. Nature Biomedical Engineering6(12), 1353–1369 (2022)

work page 2022

[18] [18]

SC ’22 (2022)

Li, S., Osawa, K., Hoefler, T.: Efficient quantized sparse matrix operations on tensor cores. SC ’22 (2022)

work page 2022

[19] [19]

PPoPP ’16 (2016)

Merrill, D., Garland, M.: Merge-based sparse matrix-vector multiplication (spmv) using the csr storage format. PPoPP ’16 (2016)

work page 2016

[20] [20]

In: GPU Technology Conference (2010)

Naumov, M., Chien, L., Vandermersch, P., Kapasi, U.: Cusparse library. In: GPU Technology Conference (2010)

work page 2010

[21] [21]

Sala, A., Zheng, H., Zhao, B.Y., Gaito, S., Rossi, G.P.: Brief announcement: re- visiting the power-law degree distribution for social graph analysis. p. 400–401. PODC ’10 (2010)

work page 2010

[22] [22]

ACM Comput

Shao, Y., Li, H., Gu, X., Yin, H., Li, Y., Miao, X., Zhang, W., Cui, B., Chen, L.: Distributed graph neural network training: A survey. ACM Comput. Surv.56(8) (Apr 2024).https://doi.org/10.1145/3648358,https://doi.org/ 10.1145/3648358

work page doi:10.1145/3648358 2024

[23] [23]

In: SC ’08

Volkov, V., Demmel, J.W.: Benchmarking gpus to tune dense linear algebra. In: SC ’08. pp. 1–11 (2008)

work page 2008

[24] [24]

Wang, M., Zheng, D., Ye, Z., Gan, Q., Li, M., Song, X., Zhou, J., Ma, C., Yu, L., Gai, Y., Xiao, T., He, T., Karypis, G., Li, J., Zhang, Z.: Deep graph library: A graph-centric, highly-performant package for graph neural networks (2020)

work page 2020

[25] [25]

In: (OSDI 21)

Wang, Y., Feng, B., Li, G., Li, S., Deng, L., Xie, Y., Ding, Y.: Gnnadvisor: An adaptive and efficient runtime system for gnn acceleration on gpus. In: (OSDI 21). pp. 515–531 (2021)

work page 2021

[26] [26]

Wei, H., Yu, J.X., Lu, C., Lin, X.: Speedup graph processing by graph ordering. p. 1813–1828. SIGMOD ’16 (2016)

work page 2016

[27] [27]

ACM Comput

Wu, S., Sun, F., Zhang, W., Xie, X., Cui, B.: Graph neural networks in recom- mender systems: A survey. ACM Comput. Surv.55(5) (dec 2022)

work page 2022

[28] [28]

Flash-llm: Enabling cost-effective and highly-efficient large generative model inference with unstructured sparsity

Xia, H., Zheng, Z., Li, Y., Zhuang, D., Zhou, Z., Qiu, X., Li, Y., Lin, W., Song, S.L.: Flash-llm: Enabling cost-effective and highly-efficient large generative model inference with unstructured sparsity. arXiv preprint arXiv:2309.10285 (2023)

work page arXiv 2023

[29] [29]

Xu, K., Hu, W., Leskovec, J., Jegelka, S.: How powerful are graph neural networks? (2019) ParamSpMM 17

work page 2019

[30] [30]

Yang,C.,Buluç,A.,Owens,J.D.:Designprinciplesforsparsematrixmultiplication on the gpu. p. 672–687. Euro-Par ’18 (2018)

work page 2018

[31] [31]

Yang, X., Parthasarathy, S., Sadayappan, P.: Fast sparse matrix-vector multipli- cation on gpus: Implications for graph mining. Proc. VLDB Endow.4(4), 231–242 (jan 2011)

work page 2011

[32] [32]

ICS ’22, New York, NY, USA (2022)

Yesil, S., Moreira, J.E., Torrellas, J.: Dense dynamic blocks: optimizing spmm for processors with vector and matrix units using machine learning techniques. ICS ’22, New York, NY, USA (2022)

work page 2022