pith. sign in

arxiv: 2605.18515 · v1 · pith:JEWWJRSZnew · submitted 2026-05-18 · 💻 cs.DC

CB-SpMV:A Data Aggregating and Balance Algorithm for Cache-Friendly Block-Based SpMV on GPUs

Pith reviewed 2026-05-20 08:15 UTC · model grok-4.3

classification 💻 cs.DC
keywords sparse matrix-vector multiplicationGPU optimizationcache localityload balancing2D blockingdata aggregationperformance acceleration
0
0 comments X

The pith

A 2D blocking structure with virtual pointer aggregation and load balancing raises cache hit rates and delivers up to 3.95x speedup for SpMV on GPUs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents CB-SpMV as a way to reorganize sparse matrices for better performance during matrix-vector multiplication on graphics processors. It splits the matrix into independent sub-blocks, then uses virtual pointers to collect related data inside each block so that fast cache memory holds it longer. Block-aware column grouping and format choices adapt the computation to different sparsity patterns, while a separate balancing step spreads work evenly across thread blocks. If these steps work as described, many repeated SpMV calls in science and machine learning would finish in less time on current NVIDIA hardware without changes to the underlying matrix data.

Core claim

The paper claims that dividing the input matrix into independent sub-blocks, aggregating intra-block data of different types through virtual pointers to raise cache-level locality, applying block-aware column aggregation together with per-sub-block format selection to improve hardware use, and adding an inter-block load-balancing algorithm produces markedly higher cache hit rates and average speedups reaching 3.95 times over existing libraries when measured across 2,843 matrices from the SuiteSparse Collection on A100 and RTX 4090 GPUs.

What carries the argument

An adaptable 2D blocking structure that supports virtual pointer aggregation of intra-block data and enables block-aware column aggregation plus inter-block load balancing.

If this is right

  • Higher cache hit rates directly shorten the time spent fetching matrix and vector entries from slower memory levels.
  • Per-block format selection lets the same code handle both very sparse and denser local regions without manual retuning.
  • Even distribution of work across thread blocks reduces idle time on the GPU during each multiplication.
  • The resulting faster SpMV kernels accelerate any larger computation that repeatedly calls matrix-vector multiplication.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same aggregation and balancing ideas could be tested on other memory-intensive GPU routines such as sparse matrix-matrix multiplication.
  • Library developers might adopt similar virtual-pointer layouts as a default storage option for sparse data on GPUs.
  • Gains are likely largest for matrices from graph or finite-element applications whose local structure matches the block assumptions.
  • The open code release makes it straightforward to measure whether the speedups hold on future GPU generations with different cache sizes.

Load-bearing premise

The 2D blocking, virtual pointer aggregation, column strategies, and balancing steps will improve cache use and hardware efficiency for most sparse matrix patterns without adding offsetting overheads that erase the gains.

What would settle it

Running CB-SpMV on a fresh collection of matrices whose sparsity patterns differ markedly from the SuiteSparse set and recording no rise in cache hit rate or no net reduction in runtime compared with current libraries.

Figures

Figures reproduced from arXiv: 2605.18515 by Chenhao Xie*, Depei Qian, Fukai Sun, Xing Cong, Yifan Chen, Yi Liu.

Figure 1
Figure 1. Figure 1: (a) illustrates a 16×16 example matrix divided into 4×4 sub-blocks, with 13 non-zero sub-blocks highlighted. (b) depicts the data layout of the matrix in GPU global memory across four sparse storage formats (CSR, COO, BSR, TileSpMV), along with the memory access patterns for retrieving the third non-zero element located at (0,4) in the example matrix. case of highly sparse sub-blocks. For instance, when a … view at source ↗
Figure 2
Figure 2. Figure 2: An example of an SpMV that multiplies a 6-by-6 sparse matrix 𝐴 by a vector 𝑥 to get a vector 𝑦. 2 Background, Motivation and Challenge 2.1 Sparse Matrix Storage Format Sparse matrices, characterized by a few non-zero elements per row, require specialized storage formats to improve access efficiency. The COO format is widely used for its simplicity and compatibility with data storage, stores non-zero elemen… view at source ↗
Figure 4
Figure 4. Figure 4: Standard deviation and distribution of non-zero elements per thread block: (a) Standard deviation across 2000 matrices; (b) Distribution for a specific matrix (TSC_OPF_1047.mtx). that 59.36% of sub-blocks have a warp-level thread utilization below 75%, while 20.35% falls below 50%, leading to significant efficiency losses. Although TileSpMV[39] partially addresses this by consoli￾dating sparse sub-blocks a… view at source ↗
Figure 6
Figure 6. Figure 6: The storage format and memory organization of CB-SpMV. (a) illustrates a 16-by-16 example matrix, where different colors indicate that sub-blocks adopt distinct storage formats. (b) depicts the matrix after column aggregation. (c) shows the 2D metadata of CB-SpMV. In comparison, CB-SpMV aggregates different types of data within the sub-block for better data locality. As shown in Fig.5, the overview flow ch… view at source ↗
Figure 7
Figure 7. Figure 7: Sub-block Data Transfer and GPU Memory Access: (a) Packaging of sub-block data on the CPU, transfer to GPU global memory, and access via L1, L2 caches, and shared memory; (b) Mem￾ory alignment issue and its resolution. specify the storage format of each sub-block. The low-level struc￾ture defaults to the COO format for rapid conversion from the input data to the CB-SpMV structure. To treat the different ty… view at source ↗
Figure 8
Figure 8. Figure 8: Block level load balancing policy and high-level structure after balancing 3.3.2 Dense Sub-blocks. For excessively dense sub-blocks, CB￾SpMV employs a format selection strategy, storing sub-blocks in one of three formats: COO for low-density blocks(the number of non-zero elements in a block is less than 𝑡ℎ1), Dense for high￾density blocks(the number of non-zero elements in a block is more than 𝑡ℎ2), and CS… view at source ↗
Figure 9
Figure 9. Figure 9: Performance comparison between CB-SpMV and the SOTA SpMV algorithm on RTX4090 and A100 GPUs [PITH_FULL_IMAGE:figures/full_fig_p008_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Performance comparison of 15 typical sparse matrices on RTX 4090 GPU and L1 & L2 Cache hit ratio lts__t_request_hit_rate for L2 Cache hit rate. These metrics were used to evaluate the Gflops and average L1 and L2 cache hit rates for the three methods mentioned earlier when applied to these 15 matrices. As shown in Fig.10, CB-SpMV achieves L1 cache hit rates 14.4× and 1.93× higher than BSR and TileSpMV, re… view at source ↗
Figure 11
Figure 11. Figure 11: Performance comparison of 15 typical sparse matrices using different optimization measures using CB-SpMV on RTX 4090 GPUs [PITH_FULL_IMAGE:figures/full_fig_p010_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Comparison of memory usage and preprocessing time across different SpMV storage formats. This demonstrates that CB-SpMV effectively balances storage effi￾ciency and computational adaptability. 4.4.2 preprocessing time overhead. The preprocessing time re￾quired to convert COO matrices into the formats used by various algorithms is shown in Fig.12(b). CB-SpMV consistently outper￾forms TileSpMV and cuSPARSE-… view at source ↗
read the original abstract

Sparse matrix-vector multiplication (SpMV) is crucial in computational science, engineering, and machine learning. Despite substantial efforts to improve SpMV performance on GPUs through various techniques, issues related to data locality, hardware utilization, and load balancing persist, leaving room for further optimization. This paper presents CB-SpMV, a cache-friendly SpMV optimization algorithm, using a novel data convergent and adaptable 2D blocking structure. The matrix in CB-SpMV is divided into independent sub-blocks, with virtual pointers aggregating different types of intra-block data for better cache-level data locality. To enhance hardware utilization, a block-aware column aggregation strategy and the selection of sub-block formats are proposed to accelerate computation and adapt to varying sparse matrices. Finally, an inter-block load-balancing algorithm is designed to ensure efficient workload distribution across thread blocks. Experimental evaluations on 2,843 matrices from the SuiteSparse Collection show that CB-SpMV significantly improves cache hit rates and achieves average speedups of up to 3.95x over state-of-the-art methods like cuSPARSE-BSR, TileSpMV, and DASP on NVIDIA A100 and RTX 4090 GPUs. The implementation is available at: \url{https://github.com/xing-cong/CB-Sparse}.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript presents CB-SpMV, a GPU SpMV algorithm that partitions the matrix into independent sub-blocks using a novel 2D blocking structure, aggregates intra-block data via virtual pointers to improve cache locality, applies block-aware column aggregation and sub-block format selection to boost hardware utilization, and incorporates an inter-block load-balancing algorithm. Experiments on 2,843 SuiteSparse matrices report improved cache hit rates and average speedups of up to 3.95× over cuSPARSE-BSR, TileSpMV, and DASP on NVIDIA A100 and RTX 4090 GPUs, with the implementation released on GitHub.

Significance. If the performance claims are substantiated, the work would represent a useful incremental advance in cache- and load-aware SpMV kernels for GPUs, a core primitive in scientific computing and machine learning. The open-source release supports reproducibility and follow-on work. The significance hinges on demonstrating that the proposed aggregation and blocking mechanisms deliver net gains without material offsetting costs across diverse sparsity patterns.

major comments (2)
  1. [Experimental Evaluation] Experimental Evaluation section: The reported speedups on 2,843 SuiteSparse matrices lack any description of selection criteria, post-hoc exclusions, variance or error-bar computation, or confirmation that baseline libraries (cuSPARSE-BSR, TileSpMV, DASP) were compiled and preprocessed under identical conditions. These omissions make it impossible to assess whether the 3.95× average speedup is robust or sensitive to matrix subset choice.
  2. [Experimental Evaluation] Experimental Evaluation section: No ablation studies or micro-benchmarks isolate the incremental cache-hit and runtime contributions (or overheads) of virtual-pointer aggregation, block-aware column aggregation, and sub-block format selection versus a plain blocked baseline or the inter-block load balancer alone. Without such isolation, the central claim that the cache-friendly 2D structures are the primary driver of the observed gains cannot be verified and may be confounded by the load balancer.
minor comments (2)
  1. [Abstract] Abstract: The phrasing 'average speedups of up to 3.95x' is ambiguous; clarify whether this is the mean of per-matrix speedups or the maximum observed average across baselines.
  2. [Algorithm Description] Notation throughout: Define the precise meaning of 'virtual pointer' and 'sub-block format selection' on first use, and ensure consistent terminology between the algorithmic description and the experimental figures.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and have revised the Experimental Evaluation section to improve transparency and strengthen the supporting evidence for our claims.

read point-by-point responses
  1. Referee: Experimental Evaluation section: The reported speedups on 2,843 SuiteSparse matrices lack any description of selection criteria, post-hoc exclusions, variance or error-bar computation, or confirmation that baseline libraries (cuSPARSE-BSR, TileSpMV, DASP) were compiled and preprocessed under identical conditions. These omissions make it impossible to assess whether the 3.95× average speedup is robust or sensitive to matrix subset choice.

    Authors: We appreciate the referee's emphasis on rigorous experimental reporting. The 2,843 matrices comprise the full subset of SuiteSparse matrices with at least 1,000 non-zeros that fit in GPU memory on the evaluated platforms; no post-hoc exclusions were applied. All timing results are deterministic for a given matrix, hardware, and kernel launch configuration, which is why per-matrix variance was not originally reported. We confirm that cuSPARSE-BSR, TileSpMV, and DASP were compiled with identical CUDA toolchains, optimization flags, and preprocessing pipelines. In the revised manuscript we have added an explicit subsection describing the selection criteria, the use of geometric means for aggregation, and per-category standard deviations to allow readers to assess robustness across sparsity patterns. revision: yes

  2. Referee: Experimental Evaluation section: No ablation studies or micro-benchmarks isolate the incremental cache-hit and runtime contributions (or overheads) of virtual-pointer aggregation, block-aware column aggregation, and sub-block format selection versus a plain blocked baseline or the inter-block load balancer alone. Without such isolation, the central claim that the cache-friendly 2D structures are the primary driver of the observed gains cannot be verified and may be confounded by the load balancer.

    Authors: The referee correctly notes that component-wise isolation would further substantiate our central claim. While the original results already link the observed cache-hit-rate improvements directly to the 2D blocking and aggregation mechanisms, we agree that explicit ablations reduce the possibility of confounding. We have therefore added a new set of micro-benchmark experiments in the revised manuscript. These compare (i) a plain blocked baseline, (ii) the inter-block load balancer alone, and (iii) incremental addition of virtual-pointer aggregation, block-aware column aggregation, and sub-block format selection. The new data show that the cache-friendly 2D structures account for the majority of the performance and locality gains, with the load balancer providing complementary but smaller additional benefits. revision: yes

Circularity Check

0 steps flagged

No circularity: algorithmic construction with independent empirical benchmarks

full rationale

The paper introduces a 2D blocking structure, virtual pointer aggregation, block-aware column aggregation, sub-block format selection, and inter-block load balancing as explicit algorithmic design choices for cache-friendly SpMV. These mechanisms are described directly in the manuscript without reference to fitted parameters, self-citations that bear the central claim, or uniqueness theorems imported from prior author work. Performance results are reported as measured speedups and cache-hit improvements on 2,843 SuiteSparse matrices against external baselines (cuSPARSE-BSR, TileSpMV, DASP), with no equations equating outputs to quantities defined by the authors' own inputs. The derivation chain consists of engineering decisions followed by external validation and is therefore self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper introduces no new physical constants or mathematical axioms; it relies on standard GPU memory hierarchy assumptions and the existence of the SuiteSparse matrix collection. No free parameters are explicitly fitted in the abstract description.

axioms (1)
  • domain assumption GPU cache behavior and thread-block scheduling follow the documented NVIDIA architecture rules for the A100 and RTX 4090.
    Invoked implicitly when claiming improved cache hit rates and load balance.

pith-pipeline@v0.9.0 · 5770 in / 1366 out tokens · 55800 ms · 2026-05-20T08:15:33.062718+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

55 extracted references · 55 canonical work pages

  1. [1]

    Hartwig Anzt, Terry Cojean, Chen Yen-Chen, Jack Dongarra, Goran Flegar, Pratik Nayak, Stanimire Tomov, Yuhsiang M Tsai, and Weichung Wang. 2020. Load- balancing sparse matrix vector product kernels on gpus.ACM Transactions on Parallel Computing (TOPC)7, 1 (2020), 1–26

  2. [2]

    Arash Ashari, Naser Sedaghati, John Eisenlohr, Srinivasan Parthasarath, and P Sadayappan. 2014. Fast sparse matrix-vector multiplication on GPUs for graph applications. InSC’14: Proceedings of the International Conference for High Perfor- mance Computing, Networking, Storage and Analysis. IEEE, 781–792

  3. [3]

    Arash Ashari, Naser Sedaghati, John Eisenlohr, and P Sadayappan. 2014. An effi- cient two-dimensional blocking strategy for sparse matrix-vector multiplication on GPUs. InProceedings of the 28th ACM international conference on Supercom- puting. 273–282

  4. [4]

    Nathan Bell and Michael Garland. 2009. Implementing sparse matrix-vector multiplication on throughput-oriented processors. InProceedings of the conference on high performance computing networking, storage and analysis. 1–11

  5. [5]

    Deshun Bi, Xiaowen Tian, Shengguo Li, and Dezun Dong. 2023. Efficiently Running SpMV on Multi-Core DSPs for Block Sparse Matrix. In2023 IEEE 29th International Conference on Parallel and Distributed Systems (ICPADS). IEEE, 1912– 1919

  6. [6]

    Haodong Bian, Jianqiang Huang, Lingbin Liu, Dongqiang Huang, and Xiaoying Wang. 2021. Albus: A method for efficiently processing spmv using simd and load balancing.Future Generation Computer Systems116 (2021), 371–392

  7. [7]

    Urban Borštnik, Joost VandeVondele, Valéry Weber, and Jürg Hutter. 2014. Sparse matrix multiplication: The distributed block-compressed sparse row library.Par- allel Comput.40, 5-6 (2014), 47–58

  8. [8]

    Aydin Buluç, Jeremy T Fineman, Matteo Frigo, John R Gilbert, and Charles E Leiserson. 2009. Parallel sparse matrix-vector and matrix-transpose-vector multi- plication using compressed sparse blocks. InProceedings of the twenty-first annual symposium on Parallelism in algorithms and architectures. 233–244

  9. [9]

    Aydin Buluc and John R Gilbert. 2008. On the representation and multiplication of hypersparse matrices. In2008 IEEE International Symposium on Parallel and Distributed Processing. IEEE, 1–11. ICS ’25, June 8–11, 2025, Salt Lake City, UT, USA Xing Cong, Fukai Sun, Yifan Chen, Chenhao Xie, Yi Liu, and Depei Qian

  10. [10]

    Aydin Buluç and John R Gilbert. 2012. Parallel sparse matrix-matrix multiplication and indexing: Implementation and experiments.SIAM Journal on Scientific Computing34, 4 (2012), C170–C191

  11. [11]

    Aydin Buluç, Samuel Williams, Leonid Oliker, and James Demmel. 2011. Reduced- bandwidth multithreaded algorithms for sparse matrix-vector multiplication. In2011 IEEE International Parallel & Distributed Processing Symposium. IEEE, 721–733

  12. [12]

    Genshen Chu, Yuanjie He, Lingyu Dong, Zhezhao Ding, Dandan Chen, He Bai, Xuesong Wang, and Changjun Hu. 2023. Efficient Algorithm Design of Opti- mizing SpMV on GPU. InProceedings of the 32nd International Symposium on High-Performance Parallel and Distributed Computing. 115–128

  13. [13]

    2012.CUDA programming: a developer’s guide to parallel computing with GPUs

    Shane Cook. 2012.CUDA programming: a developer’s guide to parallel computing with GPUs. Newnes

  14. [14]

    NVIDIA Corporation. 2024. cuSPARSE: GPU-Accelerated Sparse Matrix Library. https://developer.nvidia.com/cusparse. Version 12.4

  15. [15]

    Timothy A Davis and Yifan Hu. 2011. The University of Florida sparse matrix collection.ACM Transactions on Mathematical Software (TOMS)38, 1 (2011), 1–25

  16. [16]

    Zhen Du, Jiajia Li, Yinshan Wang, Xueqi Li, Guangming Tan, and Ninghui Sun

  17. [17]

    InSC22: International Conference for High Performance Computing, Networking, Storage and Analysis

    Alphasparse: Generating high performance spmv codes directly from sparse matrices. InSC22: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 1–15

  18. [18]

    Ryan Eberhardt and Mark Hoemmen. 2016. Optimization of block sparse matrix- vector multiplication on shared-memory parallel architectures. In2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). IEEE, 663–672

  19. [19]

    Ruibo Fan, Wei Wang, and Xiaowen Chu. 2024. DTC-SpMM: Bridging the Gap in Accelerating General Sparse Matrix Multiplication with Tensor Cores. In Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3. 253–267

  20. [21]

    Sparse matrix-vector multiplication on GPGPUs.ACM Transactions on Mathematical Software (TOMS)43, 4 (2017), 1–49

  21. [22]

    Salvatore Filippone, Valeria Cardellini, Davide Barbieri, and Alessandro Fanfarillo

  22. [23]

    Sparse Matrix-Vector Multiplication on GPGPUs.ACM Trans. Math. Softw. 43, 4, Article 30 (Jan. 2017), 49 pages. https://doi.org/10.1145/3017994

  23. [24]

    Jianhua Gao, Weixing Ji, Zhaonian Tan, Yizhuo Wang, and Feng Shi. 2022. Taichi: A hybrid compression format for binary sparse matrix-vector multiplication on gpu.IEEE Transactions on Parallel and Distributed Systems33, 12 (2022), 3732–3745

  24. [25]

    Greathouse and Mayank Daga

    Joseph L. Greathouse and Mayank Daga. 2014. Efficient Sparse Matrix-Vector Multiplication on GPUs Using the CSR Storage Format. InSC ’14: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 769–780. https://doi.org/10.1109/SC.2014.68

  25. [26]

    Jihu Guo, Rui Xia, Jie Liu, Xiaoxiong Zhu, and Xiang Zhang. 2024. CAMLB-SpMV: An Efficient Cache-Aware Memory Load-Balancing SpMV on CPU. InProceedings of the 53rd International Conference on Parallel Processing. 640–649

  26. [27]

    Haonan Ji, Huimin Song, Shibo Lu, Zhou Jin, Guangming Tan, and Weifeng Liu

  27. [28]

    InProceedings of the 51st International Conference on Parallel Processing

    Tilespmspv: A tiled algorithm for sparse matrix-sparse vector multiplication on gpus. InProceedings of the 51st International Conference on Parallel Processing. 1–11

  28. [29]

    Kwangrae Kim and Ki-Seok Chung. 2024. CAMPuS: Concurrent Acceleration of Memory Access and Parallel Processing in Near-Memory SpMV Architecture. IEEE Access(2024)

  29. [30]

    Kornilios Kourtis, Vasileios Karakasis, Georgios Goumas, and Nectarios Koziris

  30. [31]

    ACM SIGPLAN Notices46, 8 (2011), 247–256

    CSX: an extended compression format for spmv on shared memory systems. ACM SIGPLAN Notices46, 8 (2011), 247–256

  31. [32]

    Moritz Kreutzer, Georg Hager, Gerhard Wellein, Holger Fehske, and Alan R Bishop. 2014. A unified sparse matrix data format for efficient general sparse matrix-vector multiplication on modern processors with wide SIMD units.SIAM Journal on Scientific Computing36, 5 (2014), C401–C423

  32. [33]

    Paolo Sylos Labini, Massimo Bernaschi, Werner Nutt, Francesco Silvestri, and Flavio Vella. 2022. Blocking Sparse Matrices to Leverage Dense-Specific Multipli- cation. In2022 IEEE/ACM Workshop on Irregular Applications: Architectures and Algorithms (IA3). IEEE, 19–24

  33. [34]

    Jiajia Li, Jimeng Sun, and Richard Vuduc. 2018. HiCOO: Hierarchical storage of sparse tensors. InSC18: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 238–252

  34. [35]

    Wenxuan Li, Helin Cheng, Zhengyang Lu, Yuechen Lu, and Weifeng Liu. 2023. Haspmv: Heterogeneity-aware sparse matrix-vector multiplication on modern asymmetric multicore processors. In2023 IEEE International Conference on Cluster Computing (CLUSTER). IEEE, 209–220

  35. [36]

    Yun Liang, Wai Teng Tang, Ruizhe Zhao, Mian Lu, Huynh Phung Huynh, and Rick Siow Mong Goh. 2017. Scale-free sparse matrix-vector multiplication on many- core architectures.IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems36, 12 (2017), 2106–2119

  36. [37]

    Lifeng Liu, Meilin Liu, Chongjun Wang, and Jun Wang. 2015. LSRB-CSR: A low overhead storage format for SpMV on the GPU systems. In2015 IEEE 21st International Conference on Parallel and Distributed Systems (ICPADS). IEEE, 733– 741

  37. [38]

    Weifeng Liu and Brian Vinter. 2015. CSR5: An efficient storage format for cross- platform sparse matrix-vector multiplication. InProceedings of the 29th ACM on International Conference on Supercomputing. 339–350

  38. [39]

    Yuechen Lu and Weifeng Liu. 2023. DASP: Specific Dense Matrix Multiply- Accumulate Units Accelerated General Sparse Matrix-Vector Multiplication. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 1–14

  39. [40]

    Zhengyang Lu, Yuyao Niu, and Weifeng Liu. 2020. Efficient block algorithms for parallel sparse triangular solve. InProceedings of the 49th International Conference on Parallel Processing. 1–11

  40. [41]

    Duane Merrill and Michael Garland. 2016. Merge-based parallel sparse matrix- vector multiplication. InSC’16: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 678–689

  41. [42]

    Hongli Mi, Xiangrui Yu, Xiaosong Yu, Shuangyuan Wu, and Weifeng Liu. 2023. Balancing computation and communication in distributed sparse matrix-vector multiplication. In2023 IEEE/ACM 23rd International Symposium on Cluster, Cloud and Internet Computing (CCGrid). IEEE, 535–544

  42. [43]

    Juan J Navarro, Elena García-Diego, Josep-L Larriba-Pey, and Toni Juan. 1996. Block algorithms for sparse matrix computations on high performance work- stations. InProceedings of the 10th international conference on Supercomputing. 301–308

  43. [44]

    Yuyao Niu, Zhengyang Lu, Meichen Dong, Zhou Jin, Weifeng Liu, and Guangming Tan. 2021. Tilespmv: A tiled algorithm for sparse matrix-vector multiplication on gpus. In2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 68–78

  44. [45]

    Yuyao Niu, Zhengyang Lu, Haonan Ji, Shuhui Song, Zhou Jin, and Weifeng Liu

  45. [46]

    InProceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

    TileSpGEMM: A tiled algorithm for parallel sparse general matrix-matrix multiplication on GPUs. InProceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. 90–106

  46. [47]

    Muhammad Osama, Serban D Porumbescu, and John D Owens. 2023. A program- ming model for GPU load balancing. InProceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming. 79–91

  47. [48]

    James O’Neil and Daniel B Szyld. 1990. A block ordering method for sparse matrices.SIAM J. Sci. Statist. Comput.11, 5 (1990), 811–823

  48. [49]

    Markus Steinberger, Rhaleb Zayer, and Hans-Peter Seidel. 2017. Globally homo- geneous, locally adaptive sparse matrix-vector multiplication on the GPU. In Proceedings of the International Conference on Supercomputing. 1–11

  49. [50]

    Abdul Rehman Tareen, Marius Meyer, Christian Plessl, and Tobias Kenter. 2024. HiHiSpMV: Sparse Matrix Vector Multiplication with Hierarchical Row Reduc- tions on FPGAs with High Bandwidth Memory. In2024 IEEE 32nd Annual Inter- national Symposium on Field-Programmable Custom Computing Machines (FCCM). IEEE, 32–42

  50. [51]

    Richard W Vuduc and Hyun-Jin Moon. 2005. Fast sparse matrix-vector multi- plication by exploiting variable block structure. InHigh Performance Computing and Communications: First International Conference, HPCC 2005, Sorrento, Italy, September 21-23, 2005. Proceedings 1. Springer, 807–816

  51. [52]

    Chenhao Xie, Jieyang Chen, Jesun Firoz, Jiajia Li, Shuaiwen Leon Song, Kevin Barker, Mark Raugas, and Ang Li. 2021. Fast and scalable sparse triangular solver for multi-gpu based hpc architectures. InProceedings of the 50th International Conference on Parallel Processing. 1–11

  52. [53]

    Shengen Yan, Chao Li, Yunquan Zhang, and Huiyang Zhou. 2014. yaSpMV: Yet another SpMV framework on GPUs.Acm Sigplan Notices49, 8 (2014), 107–118

  53. [54]

    Wangdong Yang, Kenli Li, and Keqin Li. 2018. A parallel computing method using blocked format with optimal partitioning for SpMV on GPU.Journal of computer and system sciences92 (2018), 152–170

  54. [55]

    Serif Yesil, Azin Heidarshenas, Adam Morrison, and Josep Torrellas. 2020. Speed- ing up SpMV for power-law graph analytics by enhancing locality & vectorization. InSC20: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 1–15

  55. [56]

    Yichen Zhang, Shengguo Li, Fan Yuan, Dezun Dong, Xiaojian Yang, Tiejun Li, and Zheng Wang. 2023. Memory-aware optimization for sequences of sparse matrix-vector multiplications. In2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 379–389