CB-SpMV:A Data Aggregating and Balance Algorithm for Cache-Friendly Block-Based SpMV on GPUs
Pith reviewed 2026-05-20 08:15 UTC · model grok-4.3
The pith
A 2D blocking structure with virtual pointer aggregation and load balancing raises cache hit rates and delivers up to 3.95x speedup for SpMV on GPUs.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper claims that dividing the input matrix into independent sub-blocks, aggregating intra-block data of different types through virtual pointers to raise cache-level locality, applying block-aware column aggregation together with per-sub-block format selection to improve hardware use, and adding an inter-block load-balancing algorithm produces markedly higher cache hit rates and average speedups reaching 3.95 times over existing libraries when measured across 2,843 matrices from the SuiteSparse Collection on A100 and RTX 4090 GPUs.
What carries the argument
An adaptable 2D blocking structure that supports virtual pointer aggregation of intra-block data and enables block-aware column aggregation plus inter-block load balancing.
If this is right
- Higher cache hit rates directly shorten the time spent fetching matrix and vector entries from slower memory levels.
- Per-block format selection lets the same code handle both very sparse and denser local regions without manual retuning.
- Even distribution of work across thread blocks reduces idle time on the GPU during each multiplication.
- The resulting faster SpMV kernels accelerate any larger computation that repeatedly calls matrix-vector multiplication.
Where Pith is reading between the lines
- The same aggregation and balancing ideas could be tested on other memory-intensive GPU routines such as sparse matrix-matrix multiplication.
- Library developers might adopt similar virtual-pointer layouts as a default storage option for sparse data on GPUs.
- Gains are likely largest for matrices from graph or finite-element applications whose local structure matches the block assumptions.
- The open code release makes it straightforward to measure whether the speedups hold on future GPU generations with different cache sizes.
Load-bearing premise
The 2D blocking, virtual pointer aggregation, column strategies, and balancing steps will improve cache use and hardware efficiency for most sparse matrix patterns without adding offsetting overheads that erase the gains.
What would settle it
Running CB-SpMV on a fresh collection of matrices whose sparsity patterns differ markedly from the SuiteSparse set and recording no rise in cache hit rate or no net reduction in runtime compared with current libraries.
Figures
read the original abstract
Sparse matrix-vector multiplication (SpMV) is crucial in computational science, engineering, and machine learning. Despite substantial efforts to improve SpMV performance on GPUs through various techniques, issues related to data locality, hardware utilization, and load balancing persist, leaving room for further optimization. This paper presents CB-SpMV, a cache-friendly SpMV optimization algorithm, using a novel data convergent and adaptable 2D blocking structure. The matrix in CB-SpMV is divided into independent sub-blocks, with virtual pointers aggregating different types of intra-block data for better cache-level data locality. To enhance hardware utilization, a block-aware column aggregation strategy and the selection of sub-block formats are proposed to accelerate computation and adapt to varying sparse matrices. Finally, an inter-block load-balancing algorithm is designed to ensure efficient workload distribution across thread blocks. Experimental evaluations on 2,843 matrices from the SuiteSparse Collection show that CB-SpMV significantly improves cache hit rates and achieves average speedups of up to 3.95x over state-of-the-art methods like cuSPARSE-BSR, TileSpMV, and DASP on NVIDIA A100 and RTX 4090 GPUs. The implementation is available at: \url{https://github.com/xing-cong/CB-Sparse}.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents CB-SpMV, a GPU SpMV algorithm that partitions the matrix into independent sub-blocks using a novel 2D blocking structure, aggregates intra-block data via virtual pointers to improve cache locality, applies block-aware column aggregation and sub-block format selection to boost hardware utilization, and incorporates an inter-block load-balancing algorithm. Experiments on 2,843 SuiteSparse matrices report improved cache hit rates and average speedups of up to 3.95× over cuSPARSE-BSR, TileSpMV, and DASP on NVIDIA A100 and RTX 4090 GPUs, with the implementation released on GitHub.
Significance. If the performance claims are substantiated, the work would represent a useful incremental advance in cache- and load-aware SpMV kernels for GPUs, a core primitive in scientific computing and machine learning. The open-source release supports reproducibility and follow-on work. The significance hinges on demonstrating that the proposed aggregation and blocking mechanisms deliver net gains without material offsetting costs across diverse sparsity patterns.
major comments (2)
- [Experimental Evaluation] Experimental Evaluation section: The reported speedups on 2,843 SuiteSparse matrices lack any description of selection criteria, post-hoc exclusions, variance or error-bar computation, or confirmation that baseline libraries (cuSPARSE-BSR, TileSpMV, DASP) were compiled and preprocessed under identical conditions. These omissions make it impossible to assess whether the 3.95× average speedup is robust or sensitive to matrix subset choice.
- [Experimental Evaluation] Experimental Evaluation section: No ablation studies or micro-benchmarks isolate the incremental cache-hit and runtime contributions (or overheads) of virtual-pointer aggregation, block-aware column aggregation, and sub-block format selection versus a plain blocked baseline or the inter-block load balancer alone. Without such isolation, the central claim that the cache-friendly 2D structures are the primary driver of the observed gains cannot be verified and may be confounded by the load balancer.
minor comments (2)
- [Abstract] Abstract: The phrasing 'average speedups of up to 3.95x' is ambiguous; clarify whether this is the mean of per-matrix speedups or the maximum observed average across baselines.
- [Algorithm Description] Notation throughout: Define the precise meaning of 'virtual pointer' and 'sub-block format selection' on first use, and ensure consistent terminology between the algorithmic description and the experimental figures.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each major comment below and have revised the Experimental Evaluation section to improve transparency and strengthen the supporting evidence for our claims.
read point-by-point responses
-
Referee: Experimental Evaluation section: The reported speedups on 2,843 SuiteSparse matrices lack any description of selection criteria, post-hoc exclusions, variance or error-bar computation, or confirmation that baseline libraries (cuSPARSE-BSR, TileSpMV, DASP) were compiled and preprocessed under identical conditions. These omissions make it impossible to assess whether the 3.95× average speedup is robust or sensitive to matrix subset choice.
Authors: We appreciate the referee's emphasis on rigorous experimental reporting. The 2,843 matrices comprise the full subset of SuiteSparse matrices with at least 1,000 non-zeros that fit in GPU memory on the evaluated platforms; no post-hoc exclusions were applied. All timing results are deterministic for a given matrix, hardware, and kernel launch configuration, which is why per-matrix variance was not originally reported. We confirm that cuSPARSE-BSR, TileSpMV, and DASP were compiled with identical CUDA toolchains, optimization flags, and preprocessing pipelines. In the revised manuscript we have added an explicit subsection describing the selection criteria, the use of geometric means for aggregation, and per-category standard deviations to allow readers to assess robustness across sparsity patterns. revision: yes
-
Referee: Experimental Evaluation section: No ablation studies or micro-benchmarks isolate the incremental cache-hit and runtime contributions (or overheads) of virtual-pointer aggregation, block-aware column aggregation, and sub-block format selection versus a plain blocked baseline or the inter-block load balancer alone. Without such isolation, the central claim that the cache-friendly 2D structures are the primary driver of the observed gains cannot be verified and may be confounded by the load balancer.
Authors: The referee correctly notes that component-wise isolation would further substantiate our central claim. While the original results already link the observed cache-hit-rate improvements directly to the 2D blocking and aggregation mechanisms, we agree that explicit ablations reduce the possibility of confounding. We have therefore added a new set of micro-benchmark experiments in the revised manuscript. These compare (i) a plain blocked baseline, (ii) the inter-block load balancer alone, and (iii) incremental addition of virtual-pointer aggregation, block-aware column aggregation, and sub-block format selection. The new data show that the cache-friendly 2D structures account for the majority of the performance and locality gains, with the load balancer providing complementary but smaller additional benefits. revision: yes
Circularity Check
No circularity: algorithmic construction with independent empirical benchmarks
full rationale
The paper introduces a 2D blocking structure, virtual pointer aggregation, block-aware column aggregation, sub-block format selection, and inter-block load balancing as explicit algorithmic design choices for cache-friendly SpMV. These mechanisms are described directly in the manuscript without reference to fitted parameters, self-citations that bear the central claim, or uniqueness theorems imported from prior author work. Performance results are reported as measured speedups and cache-hit improvements on 2,843 SuiteSparse matrices against external baselines (cuSPARSE-BSR, TileSpMV, DASP), with no equations equating outputs to quantities defined by the authors' own inputs. The derivation chain consists of engineering decisions followed by external validation and is therefore self-contained.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption GPU cache behavior and thread-block scheduling follow the documented NVIDIA architecture rules for the A100 and RTX 4090.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
novel data convergent and adaptable 2D blocking structure... virtual pointers aggregating different types of intra-block data... block-aware column aggregation strategy and the selection of sub-block formats... inter-block load-balancing algorithm
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Experimental evaluations on 2,843 matrices... average speedups of up to 3.95× over cuSPARSE-BSR, TileSpMV, and DASP
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Hartwig Anzt, Terry Cojean, Chen Yen-Chen, Jack Dongarra, Goran Flegar, Pratik Nayak, Stanimire Tomov, Yuhsiang M Tsai, and Weichung Wang. 2020. Load- balancing sparse matrix vector product kernels on gpus.ACM Transactions on Parallel Computing (TOPC)7, 1 (2020), 1–26
work page 2020
-
[2]
Arash Ashari, Naser Sedaghati, John Eisenlohr, Srinivasan Parthasarath, and P Sadayappan. 2014. Fast sparse matrix-vector multiplication on GPUs for graph applications. InSC’14: Proceedings of the International Conference for High Perfor- mance Computing, Networking, Storage and Analysis. IEEE, 781–792
work page 2014
-
[3]
Arash Ashari, Naser Sedaghati, John Eisenlohr, and P Sadayappan. 2014. An effi- cient two-dimensional blocking strategy for sparse matrix-vector multiplication on GPUs. InProceedings of the 28th ACM international conference on Supercom- puting. 273–282
work page 2014
-
[4]
Nathan Bell and Michael Garland. 2009. Implementing sparse matrix-vector multiplication on throughput-oriented processors. InProceedings of the conference on high performance computing networking, storage and analysis. 1–11
work page 2009
-
[5]
Deshun Bi, Xiaowen Tian, Shengguo Li, and Dezun Dong. 2023. Efficiently Running SpMV on Multi-Core DSPs for Block Sparse Matrix. In2023 IEEE 29th International Conference on Parallel and Distributed Systems (ICPADS). IEEE, 1912– 1919
work page 2023
-
[6]
Haodong Bian, Jianqiang Huang, Lingbin Liu, Dongqiang Huang, and Xiaoying Wang. 2021. Albus: A method for efficiently processing spmv using simd and load balancing.Future Generation Computer Systems116 (2021), 371–392
work page 2021
-
[7]
Urban Borštnik, Joost VandeVondele, Valéry Weber, and Jürg Hutter. 2014. Sparse matrix multiplication: The distributed block-compressed sparse row library.Par- allel Comput.40, 5-6 (2014), 47–58
work page 2014
-
[8]
Aydin Buluç, Jeremy T Fineman, Matteo Frigo, John R Gilbert, and Charles E Leiserson. 2009. Parallel sparse matrix-vector and matrix-transpose-vector multi- plication using compressed sparse blocks. InProceedings of the twenty-first annual symposium on Parallelism in algorithms and architectures. 233–244
work page 2009
-
[9]
Aydin Buluc and John R Gilbert. 2008. On the representation and multiplication of hypersparse matrices. In2008 IEEE International Symposium on Parallel and Distributed Processing. IEEE, 1–11. ICS ’25, June 8–11, 2025, Salt Lake City, UT, USA Xing Cong, Fukai Sun, Yifan Chen, Chenhao Xie, Yi Liu, and Depei Qian
work page 2008
-
[10]
Aydin Buluç and John R Gilbert. 2012. Parallel sparse matrix-matrix multiplication and indexing: Implementation and experiments.SIAM Journal on Scientific Computing34, 4 (2012), C170–C191
work page 2012
-
[11]
Aydin Buluç, Samuel Williams, Leonid Oliker, and James Demmel. 2011. Reduced- bandwidth multithreaded algorithms for sparse matrix-vector multiplication. In2011 IEEE International Parallel & Distributed Processing Symposium. IEEE, 721–733
work page 2011
-
[12]
Genshen Chu, Yuanjie He, Lingyu Dong, Zhezhao Ding, Dandan Chen, He Bai, Xuesong Wang, and Changjun Hu. 2023. Efficient Algorithm Design of Opti- mizing SpMV on GPU. InProceedings of the 32nd International Symposium on High-Performance Parallel and Distributed Computing. 115–128
work page 2023
-
[13]
2012.CUDA programming: a developer’s guide to parallel computing with GPUs
Shane Cook. 2012.CUDA programming: a developer’s guide to parallel computing with GPUs. Newnes
work page 2012
-
[14]
NVIDIA Corporation. 2024. cuSPARSE: GPU-Accelerated Sparse Matrix Library. https://developer.nvidia.com/cusparse. Version 12.4
work page 2024
-
[15]
Timothy A Davis and Yifan Hu. 2011. The University of Florida sparse matrix collection.ACM Transactions on Mathematical Software (TOMS)38, 1 (2011), 1–25
work page 2011
-
[16]
Zhen Du, Jiajia Li, Yinshan Wang, Xueqi Li, Guangming Tan, and Ninghui Sun
-
[17]
InSC22: International Conference for High Performance Computing, Networking, Storage and Analysis
Alphasparse: Generating high performance spmv codes directly from sparse matrices. InSC22: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 1–15
-
[18]
Ryan Eberhardt and Mark Hoemmen. 2016. Optimization of block sparse matrix- vector multiplication on shared-memory parallel architectures. In2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). IEEE, 663–672
work page 2016
-
[19]
Ruibo Fan, Wei Wang, and Xiaowen Chu. 2024. DTC-SpMM: Bridging the Gap in Accelerating General Sparse Matrix Multiplication with Tensor Cores. In Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3. 253–267
work page 2024
-
[21]
Sparse matrix-vector multiplication on GPGPUs.ACM Transactions on Mathematical Software (TOMS)43, 4 (2017), 1–49
work page 2017
-
[22]
Salvatore Filippone, Valeria Cardellini, Davide Barbieri, and Alessandro Fanfarillo
-
[23]
Sparse Matrix-Vector Multiplication on GPGPUs.ACM Trans. Math. Softw. 43, 4, Article 30 (Jan. 2017), 49 pages. https://doi.org/10.1145/3017994
-
[24]
Jianhua Gao, Weixing Ji, Zhaonian Tan, Yizhuo Wang, and Feng Shi. 2022. Taichi: A hybrid compression format for binary sparse matrix-vector multiplication on gpu.IEEE Transactions on Parallel and Distributed Systems33, 12 (2022), 3732–3745
work page 2022
-
[25]
Joseph L. Greathouse and Mayank Daga. 2014. Efficient Sparse Matrix-Vector Multiplication on GPUs Using the CSR Storage Format. InSC ’14: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 769–780. https://doi.org/10.1109/SC.2014.68
-
[26]
Jihu Guo, Rui Xia, Jie Liu, Xiaoxiong Zhu, and Xiang Zhang. 2024. CAMLB-SpMV: An Efficient Cache-Aware Memory Load-Balancing SpMV on CPU. InProceedings of the 53rd International Conference on Parallel Processing. 640–649
work page 2024
-
[27]
Haonan Ji, Huimin Song, Shibo Lu, Zhou Jin, Guangming Tan, and Weifeng Liu
-
[28]
InProceedings of the 51st International Conference on Parallel Processing
Tilespmspv: A tiled algorithm for sparse matrix-sparse vector multiplication on gpus. InProceedings of the 51st International Conference on Parallel Processing. 1–11
-
[29]
Kwangrae Kim and Ki-Seok Chung. 2024. CAMPuS: Concurrent Acceleration of Memory Access and Parallel Processing in Near-Memory SpMV Architecture. IEEE Access(2024)
work page 2024
-
[30]
Kornilios Kourtis, Vasileios Karakasis, Georgios Goumas, and Nectarios Koziris
-
[31]
ACM SIGPLAN Notices46, 8 (2011), 247–256
CSX: an extended compression format for spmv on shared memory systems. ACM SIGPLAN Notices46, 8 (2011), 247–256
work page 2011
-
[32]
Moritz Kreutzer, Georg Hager, Gerhard Wellein, Holger Fehske, and Alan R Bishop. 2014. A unified sparse matrix data format for efficient general sparse matrix-vector multiplication on modern processors with wide SIMD units.SIAM Journal on Scientific Computing36, 5 (2014), C401–C423
work page 2014
-
[33]
Paolo Sylos Labini, Massimo Bernaschi, Werner Nutt, Francesco Silvestri, and Flavio Vella. 2022. Blocking Sparse Matrices to Leverage Dense-Specific Multipli- cation. In2022 IEEE/ACM Workshop on Irregular Applications: Architectures and Algorithms (IA3). IEEE, 19–24
work page 2022
-
[34]
Jiajia Li, Jimeng Sun, and Richard Vuduc. 2018. HiCOO: Hierarchical storage of sparse tensors. InSC18: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 238–252
work page 2018
-
[35]
Wenxuan Li, Helin Cheng, Zhengyang Lu, Yuechen Lu, and Weifeng Liu. 2023. Haspmv: Heterogeneity-aware sparse matrix-vector multiplication on modern asymmetric multicore processors. In2023 IEEE International Conference on Cluster Computing (CLUSTER). IEEE, 209–220
work page 2023
-
[36]
Yun Liang, Wai Teng Tang, Ruizhe Zhao, Mian Lu, Huynh Phung Huynh, and Rick Siow Mong Goh. 2017. Scale-free sparse matrix-vector multiplication on many- core architectures.IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems36, 12 (2017), 2106–2119
work page 2017
-
[37]
Lifeng Liu, Meilin Liu, Chongjun Wang, and Jun Wang. 2015. LSRB-CSR: A low overhead storage format for SpMV on the GPU systems. In2015 IEEE 21st International Conference on Parallel and Distributed Systems (ICPADS). IEEE, 733– 741
work page 2015
-
[38]
Weifeng Liu and Brian Vinter. 2015. CSR5: An efficient storage format for cross- platform sparse matrix-vector multiplication. InProceedings of the 29th ACM on International Conference on Supercomputing. 339–350
work page 2015
-
[39]
Yuechen Lu and Weifeng Liu. 2023. DASP: Specific Dense Matrix Multiply- Accumulate Units Accelerated General Sparse Matrix-Vector Multiplication. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 1–14
work page 2023
-
[40]
Zhengyang Lu, Yuyao Niu, and Weifeng Liu. 2020. Efficient block algorithms for parallel sparse triangular solve. InProceedings of the 49th International Conference on Parallel Processing. 1–11
work page 2020
-
[41]
Duane Merrill and Michael Garland. 2016. Merge-based parallel sparse matrix- vector multiplication. InSC’16: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 678–689
work page 2016
-
[42]
Hongli Mi, Xiangrui Yu, Xiaosong Yu, Shuangyuan Wu, and Weifeng Liu. 2023. Balancing computation and communication in distributed sparse matrix-vector multiplication. In2023 IEEE/ACM 23rd International Symposium on Cluster, Cloud and Internet Computing (CCGrid). IEEE, 535–544
work page 2023
-
[43]
Juan J Navarro, Elena García-Diego, Josep-L Larriba-Pey, and Toni Juan. 1996. Block algorithms for sparse matrix computations on high performance work- stations. InProceedings of the 10th international conference on Supercomputing. 301–308
work page 1996
-
[44]
Yuyao Niu, Zhengyang Lu, Meichen Dong, Zhou Jin, Weifeng Liu, and Guangming Tan. 2021. Tilespmv: A tiled algorithm for sparse matrix-vector multiplication on gpus. In2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 68–78
work page 2021
-
[45]
Yuyao Niu, Zhengyang Lu, Haonan Ji, Shuhui Song, Zhou Jin, and Weifeng Liu
-
[46]
InProceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
TileSpGEMM: A tiled algorithm for parallel sparse general matrix-matrix multiplication on GPUs. InProceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. 90–106
-
[47]
Muhammad Osama, Serban D Porumbescu, and John D Owens. 2023. A program- ming model for GPU load balancing. InProceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming. 79–91
work page 2023
-
[48]
James O’Neil and Daniel B Szyld. 1990. A block ordering method for sparse matrices.SIAM J. Sci. Statist. Comput.11, 5 (1990), 811–823
work page 1990
-
[49]
Markus Steinberger, Rhaleb Zayer, and Hans-Peter Seidel. 2017. Globally homo- geneous, locally adaptive sparse matrix-vector multiplication on the GPU. In Proceedings of the International Conference on Supercomputing. 1–11
work page 2017
-
[50]
Abdul Rehman Tareen, Marius Meyer, Christian Plessl, and Tobias Kenter. 2024. HiHiSpMV: Sparse Matrix Vector Multiplication with Hierarchical Row Reduc- tions on FPGAs with High Bandwidth Memory. In2024 IEEE 32nd Annual Inter- national Symposium on Field-Programmable Custom Computing Machines (FCCM). IEEE, 32–42
work page 2024
-
[51]
Richard W Vuduc and Hyun-Jin Moon. 2005. Fast sparse matrix-vector multi- plication by exploiting variable block structure. InHigh Performance Computing and Communications: First International Conference, HPCC 2005, Sorrento, Italy, September 21-23, 2005. Proceedings 1. Springer, 807–816
work page 2005
-
[52]
Chenhao Xie, Jieyang Chen, Jesun Firoz, Jiajia Li, Shuaiwen Leon Song, Kevin Barker, Mark Raugas, and Ang Li. 2021. Fast and scalable sparse triangular solver for multi-gpu based hpc architectures. InProceedings of the 50th International Conference on Parallel Processing. 1–11
work page 2021
-
[53]
Shengen Yan, Chao Li, Yunquan Zhang, and Huiyang Zhou. 2014. yaSpMV: Yet another SpMV framework on GPUs.Acm Sigplan Notices49, 8 (2014), 107–118
work page 2014
-
[54]
Wangdong Yang, Kenli Li, and Keqin Li. 2018. A parallel computing method using blocked format with optimal partitioning for SpMV on GPU.Journal of computer and system sciences92 (2018), 152–170
work page 2018
-
[55]
Serif Yesil, Azin Heidarshenas, Adam Morrison, and Josep Torrellas. 2020. Speed- ing up SpMV for power-law graph analytics by enhancing locality & vectorization. InSC20: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 1–15
work page 2020
-
[56]
Yichen Zhang, Shengguo Li, Fan Yuan, Dezun Dong, Xiaojian Yang, Tiejun Li, and Zheng Wang. 2023. Memory-aware optimization for sequences of sparse matrix-vector multiplications. In2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 379–389
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.