CB-SpMV:A Data Aggregating and Balance Algorithm for Cache-Friendly Block-Based SpMV on GPUs

Chenhao Xie*; Depei Qian; Fukai Sun; Xing Cong; Yifan Chen; Yi Liu

arxiv: 2605.18515 · v1 · pith:JEWWJRSZnew · submitted 2026-05-18 · 💻 cs.DC

CB-SpMV:A Data Aggregating and Balance Algorithm for Cache-Friendly Block-Based SpMV on GPUs

Xing Cong , Fukai Sun , Yifan Chen , Chenhao Xie* , Yi Liu , Depei Qian This is my paper

Pith reviewed 2026-05-20 08:15 UTC · model grok-4.3

classification 💻 cs.DC

keywords sparse matrix-vector multiplicationGPU optimizationcache localityload balancing2D blockingdata aggregationperformance acceleration

0 comments

The pith

A 2D blocking structure with virtual pointer aggregation and load balancing raises cache hit rates and delivers up to 3.95x speedup for SpMV on GPUs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents CB-SpMV as a way to reorganize sparse matrices for better performance during matrix-vector multiplication on graphics processors. It splits the matrix into independent sub-blocks, then uses virtual pointers to collect related data inside each block so that fast cache memory holds it longer. Block-aware column grouping and format choices adapt the computation to different sparsity patterns, while a separate balancing step spreads work evenly across thread blocks. If these steps work as described, many repeated SpMV calls in science and machine learning would finish in less time on current NVIDIA hardware without changes to the underlying matrix data.

Core claim

The paper claims that dividing the input matrix into independent sub-blocks, aggregating intra-block data of different types through virtual pointers to raise cache-level locality, applying block-aware column aggregation together with per-sub-block format selection to improve hardware use, and adding an inter-block load-balancing algorithm produces markedly higher cache hit rates and average speedups reaching 3.95 times over existing libraries when measured across 2,843 matrices from the SuiteSparse Collection on A100 and RTX 4090 GPUs.

What carries the argument

An adaptable 2D blocking structure that supports virtual pointer aggregation of intra-block data and enables block-aware column aggregation plus inter-block load balancing.

If this is right

Higher cache hit rates directly shorten the time spent fetching matrix and vector entries from slower memory levels.
Per-block format selection lets the same code handle both very sparse and denser local regions without manual retuning.
Even distribution of work across thread blocks reduces idle time on the GPU during each multiplication.
The resulting faster SpMV kernels accelerate any larger computation that repeatedly calls matrix-vector multiplication.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same aggregation and balancing ideas could be tested on other memory-intensive GPU routines such as sparse matrix-matrix multiplication.
Library developers might adopt similar virtual-pointer layouts as a default storage option for sparse data on GPUs.
Gains are likely largest for matrices from graph or finite-element applications whose local structure matches the block assumptions.
The open code release makes it straightforward to measure whether the speedups hold on future GPU generations with different cache sizes.

Load-bearing premise

The 2D blocking, virtual pointer aggregation, column strategies, and balancing steps will improve cache use and hardware efficiency for most sparse matrix patterns without adding offsetting overheads that erase the gains.

What would settle it

Running CB-SpMV on a fresh collection of matrices whose sparsity patterns differ markedly from the SuiteSparse set and recording no rise in cache hit rate or no net reduction in runtime compared with current libraries.

Figures

Figures reproduced from arXiv: 2605.18515 by Chenhao Xie*, Depei Qian, Fukai Sun, Xing Cong, Yifan Chen, Yi Liu.

**Figure 1.** Figure 1: (a) illustrates a 16×16 example matrix divided into 4×4 sub-blocks, with 13 non-zero sub-blocks highlighted. (b) depicts the data layout of the matrix in GPU global memory across four sparse storage formats (CSR, COO, BSR, TileSpMV), along with the memory access patterns for retrieving the third non-zero element located at (0,4) in the example matrix. case of highly sparse sub-blocks. For instance, when a … view at source ↗

**Figure 2.** Figure 2: An example of an SpMV that multiplies a 6-by-6 sparse matrix 𝐴 by a vector 𝑥 to get a vector 𝑦. 2 Background, Motivation and Challenge 2.1 Sparse Matrix Storage Format Sparse matrices, characterized by a few non-zero elements per row, require specialized storage formats to improve access efficiency. The COO format is widely used for its simplicity and compatibility with data storage, stores non-zero elemen… view at source ↗

**Figure 4.** Figure 4: Standard deviation and distribution of non-zero elements per thread block: (a) Standard deviation across 2000 matrices; (b) Distribution for a specific matrix (TSC_OPF_1047.mtx). that 59.36% of sub-blocks have a warp-level thread utilization below 75%, while 20.35% falls below 50%, leading to significant efficiency losses. Although TileSpMV[39] partially addresses this by consolidating sparse sub-blocks a… view at source ↗

**Figure 6.** Figure 6: The storage format and memory organization of CB-SpMV. (a) illustrates a 16-by-16 example matrix, where different colors indicate that sub-blocks adopt distinct storage formats. (b) depicts the matrix after column aggregation. (c) shows the 2D metadata of CB-SpMV. In comparison, CB-SpMV aggregates different types of data within the sub-block for better data locality. As shown in Fig.5, the overview flow ch… view at source ↗

**Figure 7.** Figure 7: Sub-block Data Transfer and GPU Memory Access: (a) Packaging of sub-block data on the CPU, transfer to GPU global memory, and access via L1, L2 caches, and shared memory; (b) Memory alignment issue and its resolution. specify the storage format of each sub-block. The low-level structure defaults to the COO format for rapid conversion from the input data to the CB-SpMV structure. To treat the different ty… view at source ↗

**Figure 8.** Figure 8: Block level load balancing policy and high-level structure after balancing 3.3.2 Dense Sub-blocks. For excessively dense sub-blocks, CBSpMV employs a format selection strategy, storing sub-blocks in one of three formats: COO for low-density blocks(the number of non-zero elements in a block is less than 𝑡ℎ1), Dense for highdensity blocks(the number of non-zero elements in a block is more than 𝑡ℎ2), and CS… view at source ↗

**Figure 9.** Figure 9: Performance comparison between CB-SpMV and the SOTA SpMV algorithm on RTX4090 and A100 GPUs [PITH_FULL_IMAGE:figures/full_fig_p008_9.png] view at source ↗

**Figure 10.** Figure 10: Performance comparison of 15 typical sparse matrices on RTX 4090 GPU and L1 & L2 Cache hit ratio lts__t_request_hit_rate for L2 Cache hit rate. These metrics were used to evaluate the Gflops and average L1 and L2 cache hit rates for the three methods mentioned earlier when applied to these 15 matrices. As shown in Fig.10, CB-SpMV achieves L1 cache hit rates 14.4× and 1.93× higher than BSR and TileSpMV, re… view at source ↗

**Figure 11.** Figure 11: Performance comparison of 15 typical sparse matrices using different optimization measures using CB-SpMV on RTX 4090 GPUs [PITH_FULL_IMAGE:figures/full_fig_p010_11.png] view at source ↗

**Figure 12.** Figure 12: Comparison of memory usage and preprocessing time across different SpMV storage formats. This demonstrates that CB-SpMV effectively balances storage efficiency and computational adaptability. 4.4.2 preprocessing time overhead. The preprocessing time required to convert COO matrices into the formats used by various algorithms is shown in Fig.12(b). CB-SpMV consistently outperforms TileSpMV and cuSPARSE-… view at source ↗

read the original abstract

Sparse matrix-vector multiplication (SpMV) is crucial in computational science, engineering, and machine learning. Despite substantial efforts to improve SpMV performance on GPUs through various techniques, issues related to data locality, hardware utilization, and load balancing persist, leaving room for further optimization. This paper presents CB-SpMV, a cache-friendly SpMV optimization algorithm, using a novel data convergent and adaptable 2D blocking structure. The matrix in CB-SpMV is divided into independent sub-blocks, with virtual pointers aggregating different types of intra-block data for better cache-level data locality. To enhance hardware utilization, a block-aware column aggregation strategy and the selection of sub-block formats are proposed to accelerate computation and adapt to varying sparse matrices. Finally, an inter-block load-balancing algorithm is designed to ensure efficient workload distribution across thread blocks. Experimental evaluations on 2,843 matrices from the SuiteSparse Collection show that CB-SpMV significantly improves cache hit rates and achieves average speedups of up to 3.95x over state-of-the-art methods like cuSPARSE-BSR, TileSpMV, and DASP on NVIDIA A100 and RTX 4090 GPUs. The implementation is available at: \url{https://github.com/xing-cong/CB-Sparse}.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CB-SpMV adds a practical 2D blocking scheme with virtual pointers for SpMV cache locality plus load balancing, and the large SuiteSparse evaluation shows real speedups, but the gains need better separation from the balancer component.

read the letter

The core of this paper is a 2D blocking method that uses virtual pointers to pull intra-block data together for better cache behavior, combined with block-aware column aggregation and an inter-block load balancer. They report average speedups up to 3.95x over cuSPARSE-BSR, TileSpMV, and DASP on A100 and RTX 4090 hardware across 2,843 SuiteSparse matrices, along with improved cache hit rates. The code release on GitHub helps with checking the details later. The adaptable blocking and pointer aggregation look like a sensible engineering synthesis that goes a step past plain BSR or simple tiling in prior work, and the scale of the matrix collection is a clear strength for an optimization paper. The inter-block balancer is described clearly enough to understand its role in workload distribution. The main gaps sit in the experiments. The write-up does not break out how matrices were chosen or whether any were dropped after initial runs, and there are no ablation results that isolate the contribution of the virtual pointers and column aggregation from the load balancer. Without those, it is hard to tell whether the cache improvements are the main driver or whether the balancer is carrying most of the observed gains on irregular matrices. Overhead from the extra indirection or format selection logic is also not quantified separately. This work is aimed at people who implement or tune sparse kernels on GPUs for scientific or ML codes. A reader who needs a faster SpMV drop-in on NVIDIA hardware would get concrete ideas worth testing. The evaluation size and the concrete algorithmic description are enough to justify sending it to referees, though the review should ask for the missing ablation data and clearer selection criteria. I would route it to review rather than desk reject.

Referee Report

2 major / 2 minor

Summary. The manuscript presents CB-SpMV, a GPU SpMV algorithm that partitions the matrix into independent sub-blocks using a novel 2D blocking structure, aggregates intra-block data via virtual pointers to improve cache locality, applies block-aware column aggregation and sub-block format selection to boost hardware utilization, and incorporates an inter-block load-balancing algorithm. Experiments on 2,843 SuiteSparse matrices report improved cache hit rates and average speedups of up to 3.95× over cuSPARSE-BSR, TileSpMV, and DASP on NVIDIA A100 and RTX 4090 GPUs, with the implementation released on GitHub.

Significance. If the performance claims are substantiated, the work would represent a useful incremental advance in cache- and load-aware SpMV kernels for GPUs, a core primitive in scientific computing and machine learning. The open-source release supports reproducibility and follow-on work. The significance hinges on demonstrating that the proposed aggregation and blocking mechanisms deliver net gains without material offsetting costs across diverse sparsity patterns.

major comments (2)

[Experimental Evaluation] Experimental Evaluation section: The reported speedups on 2,843 SuiteSparse matrices lack any description of selection criteria, post-hoc exclusions, variance or error-bar computation, or confirmation that baseline libraries (cuSPARSE-BSR, TileSpMV, DASP) were compiled and preprocessed under identical conditions. These omissions make it impossible to assess whether the 3.95× average speedup is robust or sensitive to matrix subset choice.
[Experimental Evaluation] Experimental Evaluation section: No ablation studies or micro-benchmarks isolate the incremental cache-hit and runtime contributions (or overheads) of virtual-pointer aggregation, block-aware column aggregation, and sub-block format selection versus a plain blocked baseline or the inter-block load balancer alone. Without such isolation, the central claim that the cache-friendly 2D structures are the primary driver of the observed gains cannot be verified and may be confounded by the load balancer.

minor comments (2)

[Abstract] Abstract: The phrasing 'average speedups of up to 3.95x' is ambiguous; clarify whether this is the mean of per-matrix speedups or the maximum observed average across baselines.
[Algorithm Description] Notation throughout: Define the precise meaning of 'virtual pointer' and 'sub-block format selection' on first use, and ensure consistent terminology between the algorithmic description and the experimental figures.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and have revised the Experimental Evaluation section to improve transparency and strengthen the supporting evidence for our claims.

read point-by-point responses

Referee: Experimental Evaluation section: The reported speedups on 2,843 SuiteSparse matrices lack any description of selection criteria, post-hoc exclusions, variance or error-bar computation, or confirmation that baseline libraries (cuSPARSE-BSR, TileSpMV, DASP) were compiled and preprocessed under identical conditions. These omissions make it impossible to assess whether the 3.95× average speedup is robust or sensitive to matrix subset choice.

Authors: We appreciate the referee's emphasis on rigorous experimental reporting. The 2,843 matrices comprise the full subset of SuiteSparse matrices with at least 1,000 non-zeros that fit in GPU memory on the evaluated platforms; no post-hoc exclusions were applied. All timing results are deterministic for a given matrix, hardware, and kernel launch configuration, which is why per-matrix variance was not originally reported. We confirm that cuSPARSE-BSR, TileSpMV, and DASP were compiled with identical CUDA toolchains, optimization flags, and preprocessing pipelines. In the revised manuscript we have added an explicit subsection describing the selection criteria, the use of geometric means for aggregation, and per-category standard deviations to allow readers to assess robustness across sparsity patterns. revision: yes
Referee: Experimental Evaluation section: No ablation studies or micro-benchmarks isolate the incremental cache-hit and runtime contributions (or overheads) of virtual-pointer aggregation, block-aware column aggregation, and sub-block format selection versus a plain blocked baseline or the inter-block load balancer alone. Without such isolation, the central claim that the cache-friendly 2D structures are the primary driver of the observed gains cannot be verified and may be confounded by the load balancer.

Authors: The referee correctly notes that component-wise isolation would further substantiate our central claim. While the original results already link the observed cache-hit-rate improvements directly to the 2D blocking and aggregation mechanisms, we agree that explicit ablations reduce the possibility of confounding. We have therefore added a new set of micro-benchmark experiments in the revised manuscript. These compare (i) a plain blocked baseline, (ii) the inter-block load balancer alone, and (iii) incremental addition of virtual-pointer aggregation, block-aware column aggregation, and sub-block format selection. The new data show that the cache-friendly 2D structures account for the majority of the performance and locality gains, with the load balancer providing complementary but smaller additional benefits. revision: yes

Circularity Check

0 steps flagged

No circularity: algorithmic construction with independent empirical benchmarks

full rationale

The paper introduces a 2D blocking structure, virtual pointer aggregation, block-aware column aggregation, sub-block format selection, and inter-block load balancing as explicit algorithmic design choices for cache-friendly SpMV. These mechanisms are described directly in the manuscript without reference to fitted parameters, self-citations that bear the central claim, or uniqueness theorems imported from prior author work. Performance results are reported as measured speedups and cache-hit improvements on 2,843 SuiteSparse matrices against external baselines (cuSPARSE-BSR, TileSpMV, DASP), with no equations equating outputs to quantities defined by the authors' own inputs. The derivation chain consists of engineering decisions followed by external validation and is therefore self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper introduces no new physical constants or mathematical axioms; it relies on standard GPU memory hierarchy assumptions and the existence of the SuiteSparse matrix collection. No free parameters are explicitly fitted in the abstract description.

axioms (1)

domain assumption GPU cache behavior and thread-block scheduling follow the documented NVIDIA architecture rules for the A100 and RTX 4090.
Invoked implicitly when claiming improved cache hit rates and load balance.

pith-pipeline@v0.9.0 · 5770 in / 1366 out tokens · 55800 ms · 2026-05-20T08:15:33.062718+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

novel data convergent and adaptable 2D blocking structure... virtual pointers aggregating different types of intra-block data... block-aware column aggregation strategy and the selection of sub-block formats... inter-block load-balancing algorithm
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Experimental evaluations on 2,843 matrices... average speedups of up to 3.95× over cuSPARSE-BSR, TileSpMV, and DASP

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

55 extracted references · 55 canonical work pages

[1]

Hartwig Anzt, Terry Cojean, Chen Yen-Chen, Jack Dongarra, Goran Flegar, Pratik Nayak, Stanimire Tomov, Yuhsiang M Tsai, and Weichung Wang. 2020. Load- balancing sparse matrix vector product kernels on gpus.ACM Transactions on Parallel Computing (TOPC)7, 1 (2020), 1–26

work page 2020
[2]

Arash Ashari, Naser Sedaghati, John Eisenlohr, Srinivasan Parthasarath, and P Sadayappan. 2014. Fast sparse matrix-vector multiplication on GPUs for graph applications. InSC’14: Proceedings of the International Conference for High Perfor- mance Computing, Networking, Storage and Analysis. IEEE, 781–792

work page 2014
[3]

Arash Ashari, Naser Sedaghati, John Eisenlohr, and P Sadayappan. 2014. An effi- cient two-dimensional blocking strategy for sparse matrix-vector multiplication on GPUs. InProceedings of the 28th ACM international conference on Supercom- puting. 273–282

work page 2014
[4]

Nathan Bell and Michael Garland. 2009. Implementing sparse matrix-vector multiplication on throughput-oriented processors. InProceedings of the conference on high performance computing networking, storage and analysis. 1–11

work page 2009
[5]

Deshun Bi, Xiaowen Tian, Shengguo Li, and Dezun Dong. 2023. Efficiently Running SpMV on Multi-Core DSPs for Block Sparse Matrix. In2023 IEEE 29th International Conference on Parallel and Distributed Systems (ICPADS). IEEE, 1912– 1919

work page 2023
[6]

Haodong Bian, Jianqiang Huang, Lingbin Liu, Dongqiang Huang, and Xiaoying Wang. 2021. Albus: A method for efficiently processing spmv using simd and load balancing.Future Generation Computer Systems116 (2021), 371–392

work page 2021
[7]

Urban Borštnik, Joost VandeVondele, Valéry Weber, and Jürg Hutter. 2014. Sparse matrix multiplication: The distributed block-compressed sparse row library.Par- allel Comput.40, 5-6 (2014), 47–58

work page 2014
[8]

Aydin Buluç, Jeremy T Fineman, Matteo Frigo, John R Gilbert, and Charles E Leiserson. 2009. Parallel sparse matrix-vector and matrix-transpose-vector multi- plication using compressed sparse blocks. InProceedings of the twenty-first annual symposium on Parallelism in algorithms and architectures. 233–244

work page 2009
[9]

Aydin Buluc and John R Gilbert. 2008. On the representation and multiplication of hypersparse matrices. In2008 IEEE International Symposium on Parallel and Distributed Processing. IEEE, 1–11. ICS ’25, June 8–11, 2025, Salt Lake City, UT, USA Xing Cong, Fukai Sun, Yifan Chen, Chenhao Xie, Yi Liu, and Depei Qian

work page 2008
[10]

Aydin Buluç and John R Gilbert. 2012. Parallel sparse matrix-matrix multiplication and indexing: Implementation and experiments.SIAM Journal on Scientific Computing34, 4 (2012), C170–C191

work page 2012
[11]

Aydin Buluç, Samuel Williams, Leonid Oliker, and James Demmel. 2011. Reduced- bandwidth multithreaded algorithms for sparse matrix-vector multiplication. In2011 IEEE International Parallel & Distributed Processing Symposium. IEEE, 721–733

work page 2011
[12]

Genshen Chu, Yuanjie He, Lingyu Dong, Zhezhao Ding, Dandan Chen, He Bai, Xuesong Wang, and Changjun Hu. 2023. Efficient Algorithm Design of Opti- mizing SpMV on GPU. InProceedings of the 32nd International Symposium on High-Performance Parallel and Distributed Computing. 115–128

work page 2023
[13]

2012.CUDA programming: a developer’s guide to parallel computing with GPUs

Shane Cook. 2012.CUDA programming: a developer’s guide to parallel computing with GPUs. Newnes

work page 2012
[14]

NVIDIA Corporation. 2024. cuSPARSE: GPU-Accelerated Sparse Matrix Library. https://developer.nvidia.com/cusparse. Version 12.4

work page 2024
[15]

Timothy A Davis and Yifan Hu. 2011. The University of Florida sparse matrix collection.ACM Transactions on Mathematical Software (TOMS)38, 1 (2011), 1–25

work page 2011
[16]

Zhen Du, Jiajia Li, Yinshan Wang, Xueqi Li, Guangming Tan, and Ninghui Sun

work page
[17]

InSC22: International Conference for High Performance Computing, Networking, Storage and Analysis

Alphasparse: Generating high performance spmv codes directly from sparse matrices. InSC22: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 1–15

work page
[18]

Ryan Eberhardt and Mark Hoemmen. 2016. Optimization of block sparse matrix- vector multiplication on shared-memory parallel architectures. In2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). IEEE, 663–672

work page 2016
[19]

Ruibo Fan, Wei Wang, and Xiaowen Chu. 2024. DTC-SpMM: Bridging the Gap in Accelerating General Sparse Matrix Multiplication with Tensor Cores. In Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3. 253–267

work page 2024
[21]

Sparse matrix-vector multiplication on GPGPUs.ACM Transactions on Mathematical Software (TOMS)43, 4 (2017), 1–49

work page 2017
[22]

Salvatore Filippone, Valeria Cardellini, Davide Barbieri, and Alessandro Fanfarillo

work page
[23]

Sparse Matrix-Vector Multiplication on GPGPUs.ACM Trans. Math. Softw. 43, 4, Article 30 (Jan. 2017), 49 pages. https://doi.org/10.1145/3017994

work page doi:10.1145/3017994 2017
[24]

Jianhua Gao, Weixing Ji, Zhaonian Tan, Yizhuo Wang, and Feng Shi. 2022. Taichi: A hybrid compression format for binary sparse matrix-vector multiplication on gpu.IEEE Transactions on Parallel and Distributed Systems33, 12 (2022), 3732–3745

work page 2022
[25]

Greathouse and Mayank Daga

Joseph L. Greathouse and Mayank Daga. 2014. Efficient Sparse Matrix-Vector Multiplication on GPUs Using the CSR Storage Format. InSC ’14: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 769–780. https://doi.org/10.1109/SC.2014.68

work page doi:10.1109/sc.2014.68 2014
[26]

Jihu Guo, Rui Xia, Jie Liu, Xiaoxiong Zhu, and Xiang Zhang. 2024. CAMLB-SpMV: An Efficient Cache-Aware Memory Load-Balancing SpMV on CPU. InProceedings of the 53rd International Conference on Parallel Processing. 640–649

work page 2024
[27]

Haonan Ji, Huimin Song, Shibo Lu, Zhou Jin, Guangming Tan, and Weifeng Liu

work page
[28]

InProceedings of the 51st International Conference on Parallel Processing

Tilespmspv: A tiled algorithm for sparse matrix-sparse vector multiplication on gpus. InProceedings of the 51st International Conference on Parallel Processing. 1–11

work page
[29]

Kwangrae Kim and Ki-Seok Chung. 2024. CAMPuS: Concurrent Acceleration of Memory Access and Parallel Processing in Near-Memory SpMV Architecture. IEEE Access(2024)

work page 2024
[30]

Kornilios Kourtis, Vasileios Karakasis, Georgios Goumas, and Nectarios Koziris

work page
[31]

ACM SIGPLAN Notices46, 8 (2011), 247–256

CSX: an extended compression format for spmv on shared memory systems. ACM SIGPLAN Notices46, 8 (2011), 247–256

work page 2011
[32]

Moritz Kreutzer, Georg Hager, Gerhard Wellein, Holger Fehske, and Alan R Bishop. 2014. A unified sparse matrix data format for efficient general sparse matrix-vector multiplication on modern processors with wide SIMD units.SIAM Journal on Scientific Computing36, 5 (2014), C401–C423

work page 2014
[33]

Paolo Sylos Labini, Massimo Bernaschi, Werner Nutt, Francesco Silvestri, and Flavio Vella. 2022. Blocking Sparse Matrices to Leverage Dense-Specific Multipli- cation. In2022 IEEE/ACM Workshop on Irregular Applications: Architectures and Algorithms (IA3). IEEE, 19–24

work page 2022
[34]

Jiajia Li, Jimeng Sun, and Richard Vuduc. 2018. HiCOO: Hierarchical storage of sparse tensors. InSC18: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 238–252

work page 2018
[35]

Wenxuan Li, Helin Cheng, Zhengyang Lu, Yuechen Lu, and Weifeng Liu. 2023. Haspmv: Heterogeneity-aware sparse matrix-vector multiplication on modern asymmetric multicore processors. In2023 IEEE International Conference on Cluster Computing (CLUSTER). IEEE, 209–220

work page 2023
[36]

Yun Liang, Wai Teng Tang, Ruizhe Zhao, Mian Lu, Huynh Phung Huynh, and Rick Siow Mong Goh. 2017. Scale-free sparse matrix-vector multiplication on many- core architectures.IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems36, 12 (2017), 2106–2119

work page 2017
[37]

Lifeng Liu, Meilin Liu, Chongjun Wang, and Jun Wang. 2015. LSRB-CSR: A low overhead storage format for SpMV on the GPU systems. In2015 IEEE 21st International Conference on Parallel and Distributed Systems (ICPADS). IEEE, 733– 741

work page 2015
[38]

Weifeng Liu and Brian Vinter. 2015. CSR5: An efficient storage format for cross- platform sparse matrix-vector multiplication. InProceedings of the 29th ACM on International Conference on Supercomputing. 339–350

work page 2015
[39]

Yuechen Lu and Weifeng Liu. 2023. DASP: Specific Dense Matrix Multiply- Accumulate Units Accelerated General Sparse Matrix-Vector Multiplication. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 1–14

work page 2023
[40]

Zhengyang Lu, Yuyao Niu, and Weifeng Liu. 2020. Efficient block algorithms for parallel sparse triangular solve. InProceedings of the 49th International Conference on Parallel Processing. 1–11

work page 2020
[41]

Duane Merrill and Michael Garland. 2016. Merge-based parallel sparse matrix- vector multiplication. InSC’16: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 678–689

work page 2016
[42]

Hongli Mi, Xiangrui Yu, Xiaosong Yu, Shuangyuan Wu, and Weifeng Liu. 2023. Balancing computation and communication in distributed sparse matrix-vector multiplication. In2023 IEEE/ACM 23rd International Symposium on Cluster, Cloud and Internet Computing (CCGrid). IEEE, 535–544

work page 2023
[43]

Juan J Navarro, Elena García-Diego, Josep-L Larriba-Pey, and Toni Juan. 1996. Block algorithms for sparse matrix computations on high performance work- stations. InProceedings of the 10th international conference on Supercomputing. 301–308

work page 1996
[44]

Yuyao Niu, Zhengyang Lu, Meichen Dong, Zhou Jin, Weifeng Liu, and Guangming Tan. 2021. Tilespmv: A tiled algorithm for sparse matrix-vector multiplication on gpus. In2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 68–78

work page 2021
[45]

Yuyao Niu, Zhengyang Lu, Haonan Ji, Shuhui Song, Zhou Jin, and Weifeng Liu

work page
[46]

InProceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

TileSpGEMM: A tiled algorithm for parallel sparse general matrix-matrix multiplication on GPUs. InProceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. 90–106

work page
[47]

Muhammad Osama, Serban D Porumbescu, and John D Owens. 2023. A program- ming model for GPU load balancing. InProceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming. 79–91

work page 2023
[48]

James O’Neil and Daniel B Szyld. 1990. A block ordering method for sparse matrices.SIAM J. Sci. Statist. Comput.11, 5 (1990), 811–823

work page 1990
[49]

Markus Steinberger, Rhaleb Zayer, and Hans-Peter Seidel. 2017. Globally homo- geneous, locally adaptive sparse matrix-vector multiplication on the GPU. In Proceedings of the International Conference on Supercomputing. 1–11

work page 2017
[50]

Abdul Rehman Tareen, Marius Meyer, Christian Plessl, and Tobias Kenter. 2024. HiHiSpMV: Sparse Matrix Vector Multiplication with Hierarchical Row Reduc- tions on FPGAs with High Bandwidth Memory. In2024 IEEE 32nd Annual Inter- national Symposium on Field-Programmable Custom Computing Machines (FCCM). IEEE, 32–42

work page 2024
[51]

Richard W Vuduc and Hyun-Jin Moon. 2005. Fast sparse matrix-vector multi- plication by exploiting variable block structure. InHigh Performance Computing and Communications: First International Conference, HPCC 2005, Sorrento, Italy, September 21-23, 2005. Proceedings 1. Springer, 807–816

work page 2005
[52]

Chenhao Xie, Jieyang Chen, Jesun Firoz, Jiajia Li, Shuaiwen Leon Song, Kevin Barker, Mark Raugas, and Ang Li. 2021. Fast and scalable sparse triangular solver for multi-gpu based hpc architectures. InProceedings of the 50th International Conference on Parallel Processing. 1–11

work page 2021
[53]

Shengen Yan, Chao Li, Yunquan Zhang, and Huiyang Zhou. 2014. yaSpMV: Yet another SpMV framework on GPUs.Acm Sigplan Notices49, 8 (2014), 107–118

work page 2014
[54]

Wangdong Yang, Kenli Li, and Keqin Li. 2018. A parallel computing method using blocked format with optimal partitioning for SpMV on GPU.Journal of computer and system sciences92 (2018), 152–170

work page 2018
[55]

Serif Yesil, Azin Heidarshenas, Adam Morrison, and Josep Torrellas. 2020. Speed- ing up SpMV for power-law graph analytics by enhancing locality & vectorization. InSC20: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 1–15

work page 2020
[56]

Yichen Zhang, Shengguo Li, Fan Yuan, Dezun Dong, Xiaojian Yang, Tiejun Li, and Zheng Wang. 2023. Memory-aware optimization for sequences of sparse matrix-vector multiplications. In2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 379–389

work page 2023

[1] [1]

Hartwig Anzt, Terry Cojean, Chen Yen-Chen, Jack Dongarra, Goran Flegar, Pratik Nayak, Stanimire Tomov, Yuhsiang M Tsai, and Weichung Wang. 2020. Load- balancing sparse matrix vector product kernels on gpus.ACM Transactions on Parallel Computing (TOPC)7, 1 (2020), 1–26

work page 2020

[2] [2]

Arash Ashari, Naser Sedaghati, John Eisenlohr, Srinivasan Parthasarath, and P Sadayappan. 2014. Fast sparse matrix-vector multiplication on GPUs for graph applications. InSC’14: Proceedings of the International Conference for High Perfor- mance Computing, Networking, Storage and Analysis. IEEE, 781–792

work page 2014

[3] [3]

Arash Ashari, Naser Sedaghati, John Eisenlohr, and P Sadayappan. 2014. An effi- cient two-dimensional blocking strategy for sparse matrix-vector multiplication on GPUs. InProceedings of the 28th ACM international conference on Supercom- puting. 273–282

work page 2014

[4] [4]

Nathan Bell and Michael Garland. 2009. Implementing sparse matrix-vector multiplication on throughput-oriented processors. InProceedings of the conference on high performance computing networking, storage and analysis. 1–11

work page 2009

[5] [5]

Deshun Bi, Xiaowen Tian, Shengguo Li, and Dezun Dong. 2023. Efficiently Running SpMV on Multi-Core DSPs for Block Sparse Matrix. In2023 IEEE 29th International Conference on Parallel and Distributed Systems (ICPADS). IEEE, 1912– 1919

work page 2023

[6] [6]

Haodong Bian, Jianqiang Huang, Lingbin Liu, Dongqiang Huang, and Xiaoying Wang. 2021. Albus: A method for efficiently processing spmv using simd and load balancing.Future Generation Computer Systems116 (2021), 371–392

work page 2021

[7] [7]

Urban Borštnik, Joost VandeVondele, Valéry Weber, and Jürg Hutter. 2014. Sparse matrix multiplication: The distributed block-compressed sparse row library.Par- allel Comput.40, 5-6 (2014), 47–58

work page 2014

[8] [8]

Aydin Buluç, Jeremy T Fineman, Matteo Frigo, John R Gilbert, and Charles E Leiserson. 2009. Parallel sparse matrix-vector and matrix-transpose-vector multi- plication using compressed sparse blocks. InProceedings of the twenty-first annual symposium on Parallelism in algorithms and architectures. 233–244

work page 2009

[9] [9]

Aydin Buluc and John R Gilbert. 2008. On the representation and multiplication of hypersparse matrices. In2008 IEEE International Symposium on Parallel and Distributed Processing. IEEE, 1–11. ICS ’25, June 8–11, 2025, Salt Lake City, UT, USA Xing Cong, Fukai Sun, Yifan Chen, Chenhao Xie, Yi Liu, and Depei Qian

work page 2008

[10] [10]

Aydin Buluç and John R Gilbert. 2012. Parallel sparse matrix-matrix multiplication and indexing: Implementation and experiments.SIAM Journal on Scientific Computing34, 4 (2012), C170–C191

work page 2012

[11] [11]

Aydin Buluç, Samuel Williams, Leonid Oliker, and James Demmel. 2011. Reduced- bandwidth multithreaded algorithms for sparse matrix-vector multiplication. In2011 IEEE International Parallel & Distributed Processing Symposium. IEEE, 721–733

work page 2011

[12] [12]

Genshen Chu, Yuanjie He, Lingyu Dong, Zhezhao Ding, Dandan Chen, He Bai, Xuesong Wang, and Changjun Hu. 2023. Efficient Algorithm Design of Opti- mizing SpMV on GPU. InProceedings of the 32nd International Symposium on High-Performance Parallel and Distributed Computing. 115–128

work page 2023

[13] [13]

2012.CUDA programming: a developer’s guide to parallel computing with GPUs

Shane Cook. 2012.CUDA programming: a developer’s guide to parallel computing with GPUs. Newnes

work page 2012

[14] [14]

NVIDIA Corporation. 2024. cuSPARSE: GPU-Accelerated Sparse Matrix Library. https://developer.nvidia.com/cusparse. Version 12.4

work page 2024

[15] [15]

Timothy A Davis and Yifan Hu. 2011. The University of Florida sparse matrix collection.ACM Transactions on Mathematical Software (TOMS)38, 1 (2011), 1–25

work page 2011

[16] [16]

Zhen Du, Jiajia Li, Yinshan Wang, Xueqi Li, Guangming Tan, and Ninghui Sun

work page

[17] [17]

InSC22: International Conference for High Performance Computing, Networking, Storage and Analysis

Alphasparse: Generating high performance spmv codes directly from sparse matrices. InSC22: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 1–15

work page

[18] [18]

Ryan Eberhardt and Mark Hoemmen. 2016. Optimization of block sparse matrix- vector multiplication on shared-memory parallel architectures. In2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). IEEE, 663–672

work page 2016

[19] [19]

Ruibo Fan, Wei Wang, and Xiaowen Chu. 2024. DTC-SpMM: Bridging the Gap in Accelerating General Sparse Matrix Multiplication with Tensor Cores. In Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3. 253–267

work page 2024

[20] [21]

Sparse matrix-vector multiplication on GPGPUs.ACM Transactions on Mathematical Software (TOMS)43, 4 (2017), 1–49

work page 2017

[21] [22]

Salvatore Filippone, Valeria Cardellini, Davide Barbieri, and Alessandro Fanfarillo

work page

[22] [23]

Sparse Matrix-Vector Multiplication on GPGPUs.ACM Trans. Math. Softw. 43, 4, Article 30 (Jan. 2017), 49 pages. https://doi.org/10.1145/3017994

work page doi:10.1145/3017994 2017

[23] [24]

Jianhua Gao, Weixing Ji, Zhaonian Tan, Yizhuo Wang, and Feng Shi. 2022. Taichi: A hybrid compression format for binary sparse matrix-vector multiplication on gpu.IEEE Transactions on Parallel and Distributed Systems33, 12 (2022), 3732–3745

work page 2022

[24] [25]

Greathouse and Mayank Daga

Joseph L. Greathouse and Mayank Daga. 2014. Efficient Sparse Matrix-Vector Multiplication on GPUs Using the CSR Storage Format. InSC ’14: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 769–780. https://doi.org/10.1109/SC.2014.68

work page doi:10.1109/sc.2014.68 2014

[25] [26]

Jihu Guo, Rui Xia, Jie Liu, Xiaoxiong Zhu, and Xiang Zhang. 2024. CAMLB-SpMV: An Efficient Cache-Aware Memory Load-Balancing SpMV on CPU. InProceedings of the 53rd International Conference on Parallel Processing. 640–649

work page 2024

[26] [27]

Haonan Ji, Huimin Song, Shibo Lu, Zhou Jin, Guangming Tan, and Weifeng Liu

work page

[27] [28]

InProceedings of the 51st International Conference on Parallel Processing

Tilespmspv: A tiled algorithm for sparse matrix-sparse vector multiplication on gpus. InProceedings of the 51st International Conference on Parallel Processing. 1–11

work page

[28] [29]

Kwangrae Kim and Ki-Seok Chung. 2024. CAMPuS: Concurrent Acceleration of Memory Access and Parallel Processing in Near-Memory SpMV Architecture. IEEE Access(2024)

work page 2024

[29] [30]

Kornilios Kourtis, Vasileios Karakasis, Georgios Goumas, and Nectarios Koziris

work page

[30] [31]

ACM SIGPLAN Notices46, 8 (2011), 247–256

CSX: an extended compression format for spmv on shared memory systems. ACM SIGPLAN Notices46, 8 (2011), 247–256

work page 2011

[31] [32]

Moritz Kreutzer, Georg Hager, Gerhard Wellein, Holger Fehske, and Alan R Bishop. 2014. A unified sparse matrix data format for efficient general sparse matrix-vector multiplication on modern processors with wide SIMD units.SIAM Journal on Scientific Computing36, 5 (2014), C401–C423

work page 2014

[32] [33]

Paolo Sylos Labini, Massimo Bernaschi, Werner Nutt, Francesco Silvestri, and Flavio Vella. 2022. Blocking Sparse Matrices to Leverage Dense-Specific Multipli- cation. In2022 IEEE/ACM Workshop on Irregular Applications: Architectures and Algorithms (IA3). IEEE, 19–24

work page 2022

[33] [34]

Jiajia Li, Jimeng Sun, and Richard Vuduc. 2018. HiCOO: Hierarchical storage of sparse tensors. InSC18: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 238–252

work page 2018

[34] [35]

Wenxuan Li, Helin Cheng, Zhengyang Lu, Yuechen Lu, and Weifeng Liu. 2023. Haspmv: Heterogeneity-aware sparse matrix-vector multiplication on modern asymmetric multicore processors. In2023 IEEE International Conference on Cluster Computing (CLUSTER). IEEE, 209–220

work page 2023

[35] [36]

Yun Liang, Wai Teng Tang, Ruizhe Zhao, Mian Lu, Huynh Phung Huynh, and Rick Siow Mong Goh. 2017. Scale-free sparse matrix-vector multiplication on many- core architectures.IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems36, 12 (2017), 2106–2119

work page 2017

[36] [37]

Lifeng Liu, Meilin Liu, Chongjun Wang, and Jun Wang. 2015. LSRB-CSR: A low overhead storage format for SpMV on the GPU systems. In2015 IEEE 21st International Conference on Parallel and Distributed Systems (ICPADS). IEEE, 733– 741

work page 2015

[37] [38]

Weifeng Liu and Brian Vinter. 2015. CSR5: An efficient storage format for cross- platform sparse matrix-vector multiplication. InProceedings of the 29th ACM on International Conference on Supercomputing. 339–350

work page 2015

[38] [39]

Yuechen Lu and Weifeng Liu. 2023. DASP: Specific Dense Matrix Multiply- Accumulate Units Accelerated General Sparse Matrix-Vector Multiplication. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 1–14

work page 2023

[39] [40]

Zhengyang Lu, Yuyao Niu, and Weifeng Liu. 2020. Efficient block algorithms for parallel sparse triangular solve. InProceedings of the 49th International Conference on Parallel Processing. 1–11

work page 2020

[40] [41]

Duane Merrill and Michael Garland. 2016. Merge-based parallel sparse matrix- vector multiplication. InSC’16: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 678–689

work page 2016

[41] [42]

Hongli Mi, Xiangrui Yu, Xiaosong Yu, Shuangyuan Wu, and Weifeng Liu. 2023. Balancing computation and communication in distributed sparse matrix-vector multiplication. In2023 IEEE/ACM 23rd International Symposium on Cluster, Cloud and Internet Computing (CCGrid). IEEE, 535–544

work page 2023

[42] [43]

Juan J Navarro, Elena García-Diego, Josep-L Larriba-Pey, and Toni Juan. 1996. Block algorithms for sparse matrix computations on high performance work- stations. InProceedings of the 10th international conference on Supercomputing. 301–308

work page 1996

[43] [44]

Yuyao Niu, Zhengyang Lu, Meichen Dong, Zhou Jin, Weifeng Liu, and Guangming Tan. 2021. Tilespmv: A tiled algorithm for sparse matrix-vector multiplication on gpus. In2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 68–78

work page 2021

[44] [45]

Yuyao Niu, Zhengyang Lu, Haonan Ji, Shuhui Song, Zhou Jin, and Weifeng Liu

work page

[45] [46]

InProceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

TileSpGEMM: A tiled algorithm for parallel sparse general matrix-matrix multiplication on GPUs. InProceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. 90–106

work page

[46] [47]

Muhammad Osama, Serban D Porumbescu, and John D Owens. 2023. A program- ming model for GPU load balancing. InProceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming. 79–91

work page 2023

[47] [48]

James O’Neil and Daniel B Szyld. 1990. A block ordering method for sparse matrices.SIAM J. Sci. Statist. Comput.11, 5 (1990), 811–823

work page 1990

[48] [49]

Markus Steinberger, Rhaleb Zayer, and Hans-Peter Seidel. 2017. Globally homo- geneous, locally adaptive sparse matrix-vector multiplication on the GPU. In Proceedings of the International Conference on Supercomputing. 1–11

work page 2017

[49] [50]

Abdul Rehman Tareen, Marius Meyer, Christian Plessl, and Tobias Kenter. 2024. HiHiSpMV: Sparse Matrix Vector Multiplication with Hierarchical Row Reduc- tions on FPGAs with High Bandwidth Memory. In2024 IEEE 32nd Annual Inter- national Symposium on Field-Programmable Custom Computing Machines (FCCM). IEEE, 32–42

work page 2024

[50] [51]

Richard W Vuduc and Hyun-Jin Moon. 2005. Fast sparse matrix-vector multi- plication by exploiting variable block structure. InHigh Performance Computing and Communications: First International Conference, HPCC 2005, Sorrento, Italy, September 21-23, 2005. Proceedings 1. Springer, 807–816

work page 2005

[51] [52]

Chenhao Xie, Jieyang Chen, Jesun Firoz, Jiajia Li, Shuaiwen Leon Song, Kevin Barker, Mark Raugas, and Ang Li. 2021. Fast and scalable sparse triangular solver for multi-gpu based hpc architectures. InProceedings of the 50th International Conference on Parallel Processing. 1–11

work page 2021

[52] [53]

Shengen Yan, Chao Li, Yunquan Zhang, and Huiyang Zhou. 2014. yaSpMV: Yet another SpMV framework on GPUs.Acm Sigplan Notices49, 8 (2014), 107–118

work page 2014

[53] [54]

Wangdong Yang, Kenli Li, and Keqin Li. 2018. A parallel computing method using blocked format with optimal partitioning for SpMV on GPU.Journal of computer and system sciences92 (2018), 152–170

work page 2018

[54] [55]

Serif Yesil, Azin Heidarshenas, Adam Morrison, and Josep Torrellas. 2020. Speed- ing up SpMV for power-law graph analytics by enhancing locality & vectorization. InSC20: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 1–15

work page 2020

[55] [56]

Yichen Zhang, Shengguo Li, Fan Yuan, Dezun Dong, Xiaojian Yang, Tiejun Li, and Zheng Wang. 2023. Memory-aware optimization for sequences of sparse matrix-vector multiplications. In2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 379–389

work page 2023