Recognition: unknown
AsyncSparse: Accelerating Sparse Matrix-Matrix Multiplication on Asynchronous GPU Architectures
Pith reviewed 2026-05-10 04:31 UTC · model grok-4.3
The pith
Asynchronous GPU features power new SpMM kernels that beat existing libraries by up to 6x.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors establish that co-designed BCSR and WCSR kernels, which exploit TMA asynchronous data movement and warp specialization to hide latency, outperform prior SpMM implementations. WCSR achieves 1.47x over AccSpMM and 6.24x over cuSPARSE across SuiteSparse matrices, while BCSR yields a combined 2.66x end-to-end speedup on Qwen2.5-7B prefill at 90 percent block sparsity with 64K tokens relative to cuDNN and cuBLAS.
What carries the argument
Warp-specialized producer-consumer pipeline that overlaps TMA data transfers with WGMMA computation, using BCSR format for structured sparsity and WCSR format for windowed irregular sparsity with cross-block row splitting.
If this is right
- Block-sparse LLM inference at high sparsity can run more than twice as fast end-to-end compared with dense library baselines.
- Irregular sparse matrices from standard collections can be multiplied 1.47 times faster than the previous best kernel and over six times faster than cuSPARSE.
- Asynchronous memory features become a primary optimization target for sparse linear-algebra kernels on current GPUs.
- Load balancing via window splitting across thread blocks improves throughput on non-uniform sparse data.
- New sparse storage formats aligned with hardware async primitives enable better overlap of data movement and computation.
Where Pith is reading between the lines
- Similar producer-consumer pipelines could accelerate other sparse kernels such as SpMV or sparse convolutions on the same hardware.
- Adoption of these kernels might lower overall energy use in data centers that run sparse machine-learning workloads.
- Future GPU designs may benefit from exposing even more flexible asynchronous primitives once their value for sparse codes is shown.
- Real-world models may increasingly adopt block or window sparsity patterns to exploit these performance gains.
Load-bearing premise
That practical sparsity patterns will match the block-structured or windowed irregular forms assumed by BCSR and WCSR, and that the asynchronous overlap adds negligible overhead on target hardware.
What would settle it
Measuring performance on a collection of matrices or model weights whose nonzero pattern is uniformly random at the same density, with no block or window structure, to determine whether the reported speedups remain or reverse.
Figures
read the original abstract
Sparse Matrix-Matrix Multiplication (SpMM) is a fundamental kernel across scientific computing and machine learning. While prior work accelerates SpMM using Tensor Cores, no existing sparse kernel exploits the asynchronous features of modern GPU architectures, such as NVIDIA's Tensor Memory Accelerator (TMA) and warp specialization. This work systematically studies how these features impact SpMM performance and introduces two co-designed kernels. For structured sparsity, we optimize a warp-specialized producer-consumer pipeline overlapping TMA data transfer with WGMMA computation using Block Compressed Sparse Row (BCSR) format. For irregular sparsity, we design a Window Compressed Sparse Row (WCSR) kernel that loads the sparse operand via TMA and splits large row-windows across thread blocks for load balancing. Our WCSR kernel outperforms all prior SpMM kernels on SuiteSparse matrices (1.47x over AccSpMM, 6.24x over cuSPARSE). Our BCSR kernel achieves a combined 2.66x end-to-end speedup on Qwen2.5-7B prefill at 90% block sparsity with 64K tokens over cuDNN/cuBLAS.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces AsyncSparse, a set of SpMM kernels that exploit NVIDIA GPU asynchronous features (TMA transfers overlapped with WGMMA via warp specialization). For block-structured sparsity it uses a BCSR format with a producer-consumer pipeline; for irregular sparsity it uses a WCSR format that splits row windows across thread blocks. On SuiteSparse matrices the WCSR kernel is reported to deliver 1.47× over AccSpMM and 6.24× over cuSPARSE; the BCSR kernel yields a 2.66× end-to-end speedup on Qwen2.5-7B prefill at 90 % block sparsity with 64 K tokens.
Significance. If the performance claims are substantiated, the work provides concrete evidence that modern asynchronous GPU primitives can be profitably co-designed with sparse formats for both regular and irregular SpMM, which is relevant to large-scale ML inference and scientific computing workloads. The explicit use of TMA-WGMMA overlap and warp specialization is a timely contribution given the evolution of Hopper-and-later architectures.
major comments (3)
- [Evaluation / Kernel Design sections] The central claim attributes the reported speedups primarily to the exploitation of asynchronous TMA-WGMMA overlap and warp specialization. However, both kernels also introduce new compressed formats (BCSR and WCSR). No ablation is presented that holds the sparse format fixed while disabling the asynchronous components (e.g., replacing TMA with synchronous loads or removing producer-consumer pipelining). This omission makes it impossible to determine how much of the 1.47×/6.24× and 2.66× gains are due to the async features versus the format changes themselves.
- [Abstract and Results] The abstract and results sections state concrete speedups (1.47×, 6.24×, 2.66×) but provide no details on measurement methodology, number of runs, variance, hardware configuration, or full baseline library versions and compilation flags. Without these, the reproducibility and robustness of the performance numbers cannot be assessed.
- [End-to-end evaluation] The BCSR kernel's 2.66× end-to-end claim is conditioned on 90 % block sparsity with 64 K tokens. The paper does not discuss how this sparsity level is obtained in practice for Qwen2.5-7B or whether the reported gains degrade under more realistic or lower sparsity patterns.
minor comments (2)
- [Introduction / Kernel Design] Notation for BCSR and WCSR should be introduced with a small diagram or pseudocode in the first section where they appear, rather than only in the kernel description.
- [Experimental setup] The paper should cite the exact cuSPARSE, cuDNN, and cuBLAS versions used for all baselines.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment point-by-point below and commit to revisions that will strengthen the manuscript's claims and reproducibility.
read point-by-point responses
-
Referee: [Evaluation / Kernel Design sections] The central claim attributes the reported speedups primarily to the exploitation of asynchronous TMA-WGMMA overlap and warp specialization. However, both kernels also introduce new compressed formats (BCSR and WCSR). No ablation is presented that holds the sparse format fixed while disabling the asynchronous components (e.g., replacing TMA with synchronous loads or removing producer-consumer pipelining). This omission makes it impossible to determine how much of the 1.47×/6.24× and 2.66× gains are due to the async features versus the format changes themselves.
Authors: We agree that an ablation isolating the asynchronous primitives from the format changes would provide stronger evidence. The BCSR and WCSR formats were co-designed specifically to enable efficient TMA usage and warp-specialized pipelining. In the revision we will add an ablation study that keeps the sparse formats fixed while replacing TMA loads with synchronous equivalents and disabling the producer-consumer pipeline, allowing direct quantification of the async contribution on the same SuiteSparse and model workloads. revision: yes
-
Referee: [Abstract and Results] The abstract and results sections state concrete speedups (1.47×, 6.24×, 2.66×) but provide no details on measurement methodology, number of runs, variance, hardware configuration, or full baseline library versions and compilation flags. Without these, the reproducibility and robustness of the performance numbers cannot be assessed.
Authors: We acknowledge that the current manuscript lacks sufficient methodological detail. The revised version will include an expanded evaluation subsection that reports: the exact GPU (NVIDIA H100), CUDA 12.4, number of runs (median of 10), standard deviation across runs, precise library versions (cuSPARSE 12.4, cuDNN 9.0, cuBLAS), and compilation flags (-O3 with sm_90a). Kernel timings will be explicitly defined as device-side execution time. revision: yes
-
Referee: [End-to-end evaluation] The BCSR kernel's 2.66× end-to-end claim is conditioned on 90 % block sparsity with 64 K tokens. The paper does not discuss how this sparsity level is obtained in practice for Qwen2.5-7B or whether the reported gains degrade under more realistic or lower sparsity patterns.
Authors: The 90 % block sparsity is produced by applying block-wise structured pruning to the Qwen2.5-7B weights. We will revise the end-to-end section to describe the pruning procedure and add results at 50 % and 70 % block sparsity (same 64 K token prefill) to illustrate how speedups vary with sparsity level, thereby addressing robustness under more realistic patterns. revision: yes
Circularity Check
No circularity: empirical kernel benchmarks are self-contained against external baselines
full rationale
The paper reports measured speedups from two new SpMM kernels (BCSR for structured block sparsity and WCSR for irregular cases) that exploit TMA/WGMMA asynchronous overlap on modern GPUs. All central claims are direct empirical comparisons to external libraries (cuSPARSE, cuDNN, AccSpMM) on SuiteSparse matrices and an end-to-end LLM prefill workload. No equations, fitted parameters, or predictions are presented that reduce by construction to the paper's own inputs or self-citations. The derivation chain consists of implementation choices and benchmarking, which remain falsifiable against independent hardware runs and do not invoke load-bearing self-citations or ansatzes.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Y . Saad,Iterative Methods for Sparse Linear Systems, 2nd ed. Society for Industrial and Applied Mathematics, 2003. [Online]. Available: https://doi.org/10.1137/1.9780898718003
-
[2]
I. S. Duff, M. A. Heroux, and R. Pozo, “An overview of the sparse basic linear algebra subprograms: The new standard from the blas technical forum,”ACM Trans. Math. Softw., vol. 28, no. 2, p. 239–267, Jun 2002. [Online]. Available: https://doi.org/10.1145/567806.567810
-
[3]
Mathematical foundations of the graphblas,
J. Kepner, P. Aaltonen, D. A. Bader, A. Buluç, F. Franchetti, J. R. Gilbert, D. Hutchison, M. Kumar, A. Lumsdaine, H. Meyerhenke, S. McMillan, J. E. Moreira, J. D. Owens, C. Yang, M. Zalewski, and T. G. Mattson, “Mathematical foundations of the graphblas,”CoRR, vol. abs/1606.05790, 2016. [Online]. Available: http://arxiv.org/abs/ 1606.05790
-
[4]
Gunrock: a high-performance graph processing library on the gpu,
Y . Wang, A. Davidson, Y . Pan, Y . Wu, A. Riffel, and J. D. Owens, “Gunrock: a high-performance graph processing library on the gpu,” in Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, ser. PPoPP ’16. New York, NY , USA: Association for Computing Machinery, 2016. [Online]. Available: https://doi.org/10.11...
-
[5]
Sparsity in deep learning: Pruning and growth for efficient inference and training in neural networks,
T. Hoefler, D. Alistarh, T. Ben-Nun, N. Dryden, and A. Peste, “Sparsity in deep learning: Pruning and growth for efficient inference and training in neural networks,”Journal of Machine Learning Research (JMLR), vol. 22, no. 1, pp. 241:1–241:124, 2021. [Online]. Available: http://jmlr.org/papers/v22/21-0366.html
2021
-
[6]
Generalized Slow Roll for Tensors
T. Gale, M. Zaharia, C. Young, and E. Elsen, “Sparse gpu kernels for deep learning,” inProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, ser. SC ’20. IEEE Press, 2020. [Online]. Available: https://doi.org/10.1109/SC41405.2020.00021
work page internal anchor Pith review Pith/arXiv arXiv doi:10.1109/sc41405.2020.00021 2020
-
[7]
Deep graph library: A graph-centric, highly-performant package for graph neural networks,
M. Wang, D. Zheng, Z. Ye, Q. Gan, M. Li, X. Song, J. Zhou, C. Ma, L. Yu, Y . Gai, T. Xiao, T. He, G. Karypis, J. Li, and Z. Zhang, “Deep graph library: A graph-centric, highly-performant package for graph neural networks,” 2020. [Online]. Available: https://arxiv.org/abs/1909.01315
-
[8]
Block-sparse recurrent neural networks,
S. Narang, E. Undersander, and G. Diamos, “Block-sparse recurrent neural networks,” 2017. [Online]. Available: https://arxiv.org/abs/1711. 02782
2017
-
[9]
Gpu kernels for block-sparse weights,
S. Gray, A. Radford, and D. P. Kingma, “Gpu kernels for block-sparse weights,” 2017. [Online]. Available: https://cdn.openai.com/blocksparse/ blocksparsepaper.pdf
2017
-
[10]
Enabling unstructured sparse acceleration on structured sparse accelerators,
G. Jeong, P.-A. Tsai, A. R. Bambhaniya, S. W. Keckler, and T. Krishna, “Enabling unstructured sparse acceleration on structured sparse accelerators,” inEighth Conference on Machine Learning and Systems, 2025. [Online]. Available: https://openreview.net/forum?id= Py0XA6QQAh
2025
-
[11]
Sparsegpt: Massive language models can be accurately pruned in one-shot,
E. Frantar and D. Alistarh, “Sparsegpt: Massive language models can be accurately pruned in one-shot,” inProceedings of the 40th International Conference on Machine Learning (ICML), 2023, pp. 10 323–10 337. [Online]. Available: https://proceedings.mlr.press/v202/frantar23a.html
2023
-
[12]
A simple and effective pruning approach for large language models,
M. Sun, Z. Liu, A. Bair, and J. Z. Kolter, “A simple and effective pruning approach for large language models,” inThe Twelfth International Conference on Learning Representations, 2024. [Online]. Available: https://openreview.net/forum?id=PxoFut3dWW
2024
-
[13]
TC-GNN: Bridging sparse GNN computation and dense tensor cores on GPUs,
Y . Wang, B. Feng, Z. Wang, G. Huang, and Y . Ding, “TC-GNN: Bridging sparse GNN computation and dense tensor cores on GPUs,” in 2023 USENIX Annual Technical Conference (USENIX ATC 23). Boston, MA: USENIX Association, Jul 2023, pp. 149–164. [Online]. Available: https://www.usenix.org/conference/atc23/presentation/wang-yuke
2023
-
[14]
J. Shi, S. Li, Y . Xu, R. Fu, X. Wang, and T. Wu, “Flashsparse: Minimizing computation redundancy for fast sparse matrix multiplications on tensor cores,” inProceedings of the 30th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming, ser. PPoPP ’25. New York, NY , USA: Association for Computing Machinery, 2025, p. 312–325. [Onl...
-
[15]
Tensormd: Molecular dynamics simulation with ab initio accuracy of 50 billion atoms,
H. Zhao, S. Li, J. Wang, C. Zhou, J. Wang, Z. Xin, S. Li, Z. Liang, Z. Pan, F. Liu, Y . Zeng, Y . Wang, and X. Chi, “Acc-spmm: Accelerating general-purpose sparse matrix-matrix multiplication with gpu tensor cores,” inProceedings of the 30th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming, ser. PPoPP ’25. New York, NY , USA...
-
[16]
Dissecting the nvidia hopper architecture through microbenchmarking and multiple level analysis,
W. Luo, R. Fan, Z. Li, D. Du, H. Liu, Q. Wang, and X. Chu, “Dissecting the nvidia hopper architecture through microbenchmarking and multiple level analysis,” 2025. [Online]. Available: https: //arxiv.org/abs/2501.12084
-
[17]
Cusparse library,
M. Naumov, L. Chien, P. Vandermersch, and U. Kapasi, “Cusparse library,” inGPU technology conference, vol. 12, 2010
2010
-
[18]
TorchAO: Pytorch-native training-to-serving model optimization,
A. Or, A. Jain, D. Vega-Myhre, J. Cai, C. D. Hernandez, Z. Zhang, D. Guessous, V . Kuznetsov, C. Puhrsch, M. Saroufim, and S. Rao, “TorchAO: Pytorch-native training-to-serving model optimization,” inChampioning Open-source DEvelopment in ML Workshop @ ICML25, 2025. [Online]. Available: https://openreview.net/forum?id= HpqH0JakHf
2025
-
[19]
H. Xia, Z. Zheng, Y . Li, D. Zhuang, Z. Zhou, X. Qiu, Y . Li, W. Lin, and S. L. Song, “Flash-llm: Enabling cost-effective and highly-efficient large generative model inference with unstructured sparsity,”Proc. VLDB Endow., vol. 17, no. 2, p. 211–224, Oct 2023. [Online]. Available: https://doi.org/10.14778/3626292.3626303
-
[20]
Dtc-spmm: Bridging the gap in accelerating general sparse matrix multiplication with tensor cores,
R. Fan, W. Wang, and X. Chu, “Dtc-spmm: Bridging the gap in accelerating general sparse matrix multiplication with tensor cores,” inProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3, ser. ASPLOS ’24. New York, NY , USA: Association for Computing Machinery, 2024, p. 253–2...
-
[21]
Outperforming cuBLAS on H100: A worklog,
P. Shankhdhar, “Outperforming cuBLAS on H100: A worklog,” https: //cudaforfun.substack.com/p/outperforming-cublas-on-h100-a-worklog, 2024
2024
-
[22]
Generalized Slow Roll for Tensors
G. Huang, G. Dai, Y . Wang, and H. Yang, “Ge-spmm: General-purpose sparse matrix-matrix multiplication on gpus for graph neural networks,” inSC20: International Conference for High Performance Computing, Networking, Storage and Analysis, 2020, pp. 1–12. [Online]. Available: https://doi.org/10.1109/SC41405.2020.00076
work page internal anchor Pith review Pith/arXiv arXiv doi:10.1109/sc41405.2020.00076 2020
-
[23]
Fastspmm: Leveraging tensor cores for sparse matrix multiplication,
H. Wang, M. Li, W. Jia, H. Yang, and G. Tan, “Fastspmm: Leveraging tensor cores for sparse matrix multiplication,” inProceedings of the 22nd ACM International Conference on Computing Frontiers, ser. CF ’25. New York, NY , USA: Association for Computing Machinery, 2025, p. 195–204. [Online]. Available: https://doi.org/10.1145/3719276.3725173
-
[24]
Accelerating gnns on gpu sparse tensor cores through n:m sparsity-oriented graph reordering,
J.-A. Chen, H.-H. Sung, R. Zhang, A. Li, and X. Shen, “Accelerating gnns on gpu sparse tensor cores through n:m sparsity-oriented graph reordering,” inProceedings of the 30th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming, ser. PPoPP ’25. New York, NY , USA: Association for Computing Machinery, 2025, p. 16–28. [Online]. Av...
-
[25]
Bridging the gap between unstructured spmm and structured sparse tensor cores,
Y . Dong, Z. Shen, W. Jiang, Z. Liu, Y . Xu, B. He, R. Zheng, and H. Jin, “Bridging the gap between unstructured spmm and structured sparse tensor cores,” inProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, ser. SC ’25. New York, NY , USA: Association for Computing Machinery, 2025, p. 645–660. [O...
-
[26]
Jigsaw: Accelerating spmm with vector sparsity on sparse tensor core,
K. Zhang, X. Liu, H. Yang, T. Feng, X. Yang, Y . Liu, Z. Luan, and D. Qian, “Jigsaw: Accelerating spmm with vector sparsity on sparse tensor core,” inProceedings of the 53rd International Conference on Parallel Processing, ser. ICPP ’24. New York, NY , USA: Association for Computing Machinery, 2024, p. 1124–1134. [Online]. Available: https://doi.org/10.11...
-
[27]
The tensor algebra compiler,
F. Kjolstad, S. Kamil, S. Chou, D. Lugato, and S. Amarasinghe, “The tensor algebra compiler,”Proc. ACM Program. Lang., vol. 1, no. OOPSLA, Oct 2017. [Online]. Available: https://doi.org/10.1145/ 3133901
2017
-
[28]
Sparsetir: Composable abstractions for sparse compilation in deep learning,
Z. Ye, R. Lai, J. Shao, T. Chen, and L. Ceze, “Sparsetir: Composable abstractions for sparse compilation in deep learning,” inProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3, ser. ASPLOS 2023. New York, NY , USA: Association for Computing Machinery, 2023, p. 660–678. [...
-
[29]
Unisparse: An intermediate language for general sparse format customization,
J. Liu, Z. Zhao, Z. Ding, B. Brock, H. Rong, and Z. Zhang, “Unisparse: An intermediate language for general sparse format customization,” Proc. ACM Program. Lang., vol. 8, no. OOPSLA1, Apr 2024. [Online]. Available: https://doi.org/10.1145/3649816
-
[30]
Splat: A framework for optimised gpu code-generation for sparse regular attention,
A. Gupta, Y . Yuan, D. Jain, Y . Ge, D. Aponte, Y . Zhou, and C. Mendis, “Splat: A framework for optimised gpu code-generation for sparse regular attention,”Proc. ACM Program. Lang., vol. 9, no. OOPSLA1, Apr 2025. [Online]. Available: https://doi.org/10.1145/3720503
-
[31]
Triton: an intermediate language and compiler for tiled neural network computations,
P. Tillet, H. T. Kung, and D. Cox, “Triton: an intermediate language and compiler for tiled neural network computations,” inProceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages, ser. MAPL 2019. New York, NY , USA: Association for Computing Machinery, 2019, p. 10–19. [Online]. Available: https://doi.org/10...
-
[32]
Tawa: Automatic warp specialization for modern gpus with asynchronous references,
H. Chen, B. Fan, A. Collins, B. Hagedorn, E. Gaburov, M. Masuda, M. Brookhart, C. Sullivan, J. Knight, Z. Zhang, and V . Grover, “Tawa: Automatic warp specialization for modern gpus with asynchronous references,” 2025. [Online]. Available: https://arxiv.org/abs/2510.14719
-
[33]
Optimal brain damage,
Y . LeCun, J. S. Denker, and S. A. Solla, “Optimal brain damage,” inNIPS, 1989, pp. 598–605. [Online]. Available: http: //papers.nips.cc/paper/250-optimal-brain-damage
1989
-
[34]
Optimal brain surgeon: Extensions and performance comparison,
B. Hassibi, D. G. Stork, and G. J. Wolff, “Optimal brain surgeon: Extensions and performance comparison,” inNIPS, 1993, pp. 263–270. [Online]. Available: http://papers.nips.cc/paper/ 749-optimal-brain-surgeon-extensions-and-performance-comparisons
1993
-
[35]
Learning both weights and connections for efficient neural network,
S. Han, J. Pool, J. Tran, and W. J. Dally, “Learning both weights and connections for efficient neural network,” inNIPS, 2015, pp. 1135–1143. [Online]. Available: http://papers.nips.cc/paper/ 5784-learning-both-weights-and-connections-for-efficient-neural-network
2015
-
[36]
Chasing sparsity in vision transformers: An end- to-end exploration,
T. Chen, Y . Cheng, Z. Gan, L. Yuan, L. Zhang, and Z. Wang, “Chasing sparsity in vision transformers: An end- to-end exploration,” inNeurIPS, 2021, pp. 19 974–19 988. [Online]. Available: https://proceedings.neurips.cc/paper/2021/hash/ a61f27ab2165df0e18cc9433bd7f27c5-Abstract.html
2021
-
[37]
SliceGPT: Compress large language models by deleting rows and columns,
S. Ashkboos, M. L. Croci, M. G. do Nascimento, T. Hoefler, and J. Hensman, “SliceGPT: Compress large language models by deleting rows and columns,” inThe Twelfth International Conference on Learning Representations, 2024. [Online]. Available: https://openreview.net/forum?id=vXxardq6db
2024
-
[38]
The sparse frontier: Sparse attention trade-offs in transformer llms,
P. Nawrot, R. Li, R. Huang, S. Ruder, K. Marchisio, and E. M. Ponti, “The sparse frontier: Sparse attention trade-offs in transformer llms,”
- [39]
-
[40]
Efficient Streaming Language Models with Attention Sinks
G. Xiao, Y . Tian, B. Chen, S. Han, and M. Lewis, “Efficient streaming language models with attention sinks,” 2024. [Online]. Available: https://arxiv.org/abs/2309.17453
work page internal anchor Pith review arXiv 2024
-
[41]
Longformer: The Long-Document Transformer
I. Beltagy, M. E. Peters, and A. Cohan, “Longformer: The long- document transformer,” 2020. [Online]. Available: https://arxiv.org/abs/ 2004.05150
work page internal anchor Pith review arXiv 2020
-
[42]
XAttention: Block Sparse Attention with Antidiagonal Scoring, March 2025
R. Xu, G. Xiao, H. Huang, J. Guo, and S. Han, “Xattention: Block sparse attention with antidiagonal scoring,” 2025. [Online]. Available: https://arxiv.org/abs/2503.16428
-
[43]
An efficient training algorithm for models with block-wise sparsity,
D. Zhu, Z. Zuo, and M. M. Khalili, “An efficient training algorithm for models with block-wise sparsity,” 2025. [Online]. Available: https://arxiv.org/abs/2503.21928
-
[44]
Thanos: A block-wise pruning algorithm for efficient large language model compression,
I. Ilin and P. Richtarik, “Thanos: A block-wise pruning algorithm for efficient large language model compression,” 2025. [Online]. Available: https://arxiv.org/abs/2504.05346
-
[45]
MInference 1.0: Accelerating pre-filling for long-context LLMs via dynamic sparse attention,
H. Jiang, Y . Li, C. Zhang, Q. Wu, X. Luo, S. Ahn, Z. Han, A. H. Abdi, D. Li, C.-Y . Lin, Y . Yang, and L. Qiu, “MInference 1.0: Accelerating pre-filling for long-context LLMs via dynamic sparse attention,” inThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. [Online]. Available: https://openreview.net/forum?id=fPBACAbqSN
2024
-
[46]
Model-based memory hierarchy optimizations for sparse matrices,
E.-J. Im, “Model-based memory hierarchy optimizations for sparse matrices,” 2007. [Online]. Available: https://api.semanticscholar.org/ CorpusID:14967653
2007
-
[47]
E. Cuthill and J. McKee, “Reducing the bandwidth of sparse symmetric matrices,” inProceedings of the 1969 24th National Conference, ser. ACM ’69. New York, NY , USA: Association for Computing Machinery, 1969, p. 157–172. [Online]. Available: https://doi.org/10.1145/800195.805928
-
[48]
T. A. Davis and Y . Hu, “The university of florida sparse matrix collection,”ACM Trans. Math. Softw., vol. 38, no. 1, Dec 2011. [Online]. Available: https://doi.org/10.1145/2049662.2049663
-
[49]
Qwen2.5 technical report,
Qwen, :, A. Yang, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Li, D. Liu, F. Huang, H. Wei, H. Lin, J. Yang, J. Tu, J. Zhang, J. Yang, J. Yang, J. Zhou, J. Lin, K. Dang, K. Lu, K. Bao, K. Yang, L. Yu, M. Li, M. Xue, P. Zhang, Q. Zhu, R. Men, R. Lin, T. Li, T. Tang, T. Xia, X. Ren, X. Ren, Y . Fan, Y . Su, Y . Zhang, Y . Wan, Y . Liu, Z. Cui, Z. Zhang, ...
-
[50]
[Online]. Available: https://arxiv.org/abs/2412.15115
work page internal anchor Pith review Pith/arXiv arXiv
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.