arxiv: 2604.17834 · v1 · submitted 2026-04-20 · 💻 cs.DC

Recognition: unknown

AsyncSparse: Accelerating Sparse Matrix-Matrix Multiplication on Asynchronous GPU Architectures

Jie Liu , Huanzhi Pu , Zhiru Zhang

Authors on Pith no claims yet

Pith reviewed 2026-05-10 04:31 UTC · model grok-4.3

classification 💻 cs.DC

keywords SpMMGPU accelerationasynchronous executionTMAwarp specializationBCSRWCSRsparse linear algebra

0 comments

The pith

Asynchronous GPU features power new SpMM kernels that beat existing libraries by up to 6x.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops two kernels that use modern GPU asynchronous hardware to speed up sparse matrix-matrix multiplication. For block-structured sparsity, BCSR organizes data so a warp-specialized pipeline can overlap TMA transfers with WGMMA computation. For irregular cases, WCSR loads via TMA and splits row windows across blocks to balance work. These changes deliver 1.47x gains over the prior best kernel and 6.24x over cuSPARSE on standard benchmarks, plus 2.66x end-to-end improvement in large-model inference. Faster SpMM directly shortens runtimes for many scientific and machine-learning workloads that rely on sparse operations.

Core claim

The authors establish that co-designed BCSR and WCSR kernels, which exploit TMA asynchronous data movement and warp specialization to hide latency, outperform prior SpMM implementations. WCSR achieves 1.47x over AccSpMM and 6.24x over cuSPARSE across SuiteSparse matrices, while BCSR yields a combined 2.66x end-to-end speedup on Qwen2.5-7B prefill at 90 percent block sparsity with 64K tokens relative to cuDNN and cuBLAS.

What carries the argument

Warp-specialized producer-consumer pipeline that overlaps TMA data transfers with WGMMA computation, using BCSR format for structured sparsity and WCSR format for windowed irregular sparsity with cross-block row splitting.

If this is right

Block-sparse LLM inference at high sparsity can run more than twice as fast end-to-end compared with dense library baselines.
Irregular sparse matrices from standard collections can be multiplied 1.47 times faster than the previous best kernel and over six times faster than cuSPARSE.
Asynchronous memory features become a primary optimization target for sparse linear-algebra kernels on current GPUs.
Load balancing via window splitting across thread blocks improves throughput on non-uniform sparse data.
New sparse storage formats aligned with hardware async primitives enable better overlap of data movement and computation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar producer-consumer pipelines could accelerate other sparse kernels such as SpMV or sparse convolutions on the same hardware.
Adoption of these kernels might lower overall energy use in data centers that run sparse machine-learning workloads.
Future GPU designs may benefit from exposing even more flexible asynchronous primitives once their value for sparse codes is shown.
Real-world models may increasingly adopt block or window sparsity patterns to exploit these performance gains.

Load-bearing premise

That practical sparsity patterns will match the block-structured or windowed irregular forms assumed by BCSR and WCSR, and that the asynchronous overlap adds negligible overhead on target hardware.

What would settle it

Measuring performance on a collection of matrices or model weights whose nonzero pattern is uniformly random at the same density, with no block or window structure, to determine whether the reported speedups remain or reverse.

Figures

Figures reproduced from arXiv: 2604.17834 by Huanzhi Pu, Jie Liu, Zhiru Zhang.

**Figure 2.** Figure 2: Comparison of BCSR and WCSR sparse formats for the same sparse [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Execution timeline. (a) Synchronous cooperative loads: all threads [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Warp-specialization pipeline for the BCSR kernel. (a) Thread block [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Distribution of normalized speedup over cuSPARSE BELL SpMM on SuiteSparse matrices for [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Performance breakdown showing the cumulative impact of each [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Warp-specialized kernel throughput variance with [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗

**Figure 8.** Figure 8: End-to-end prefill speedup over the dense baseline on Qwen2.5-7B [PITH_FULL_IMAGE:figures/full_fig_p010_8.png] view at source ↗

read the original abstract

Sparse Matrix-Matrix Multiplication (SpMM) is a fundamental kernel across scientific computing and machine learning. While prior work accelerates SpMM using Tensor Cores, no existing sparse kernel exploits the asynchronous features of modern GPU architectures, such as NVIDIA's Tensor Memory Accelerator (TMA) and warp specialization. This work systematically studies how these features impact SpMM performance and introduces two co-designed kernels. For structured sparsity, we optimize a warp-specialized producer-consumer pipeline overlapping TMA data transfer with WGMMA computation using Block Compressed Sparse Row (BCSR) format. For irregular sparsity, we design a Window Compressed Sparse Row (WCSR) kernel that loads the sparse operand via TMA and splits large row-windows across thread blocks for load balancing. Our WCSR kernel outperforms all prior SpMM kernels on SuiteSparse matrices (1.47x over AccSpMM, 6.24x over cuSPARSE). Our BCSR kernel achieves a combined 2.66x end-to-end speedup on Qwen2.5-7B prefill at 90% block sparsity with 64K tokens over cuDNN/cuBLAS.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The speedups in AsyncSparse likely stem more from the new BCSR and WCSR formats than from the TMA-WGMMA overlap, since no ablation isolates the async part.

read the letter

The main takeaway is that this paper applies asynchronous GPU features to SpMM kernels but the speedups are hard to attribute without separating the new formats from the TMA overlap. They co-design BCSR for block-sparse cases with a warp-specialized producer-consumer pipeline. TMA handles data transfer while WGMMA does the compute in parallel. For irregular sparsity they use WCSR with row window splitting across thread blocks. This is presented as the first use of these async primitives for SpMM. The paper does well by delivering concrete numbers. WCSR beats AccSpMM by 1.47x and cuSPARSE by 6.24x on SuiteSparse. BCSR gives 2.66x end-to-end speedup on Qwen2.5-7B prefill at 90% sparsity. Those are practical gains for ML workloads. The soft spot is the lack of ablation. The stress-test note is on point: they introduce both new formats and the async design at once. Without a version that uses BCSR or WCSR but falls back to synchronous loads, it's unclear how much the overlap actually helps versus the better compression and balancing. Methodology details are also thin in the abstract, though the full text might clarify that. This is useful for GPU kernel developers and ML systems folks who deal with sparse matrices. Readers working on inference optimization would get the most from the pipeline descriptions and format choices. It should go to peer review. The empirical claims are strong enough to merit checking the full setup and asking for those controls.

Referee Report

3 major / 2 minor

Summary. The paper introduces AsyncSparse, a set of SpMM kernels that exploit NVIDIA GPU asynchronous features (TMA transfers overlapped with WGMMA via warp specialization). For block-structured sparsity it uses a BCSR format with a producer-consumer pipeline; for irregular sparsity it uses a WCSR format that splits row windows across thread blocks. On SuiteSparse matrices the WCSR kernel is reported to deliver 1.47× over AccSpMM and 6.24× over cuSPARSE; the BCSR kernel yields a 2.66× end-to-end speedup on Qwen2.5-7B prefill at 90 % block sparsity with 64 K tokens.

Significance. If the performance claims are substantiated, the work provides concrete evidence that modern asynchronous GPU primitives can be profitably co-designed with sparse formats for both regular and irregular SpMM, which is relevant to large-scale ML inference and scientific computing workloads. The explicit use of TMA-WGMMA overlap and warp specialization is a timely contribution given the evolution of Hopper-and-later architectures.

major comments (3)

[Evaluation / Kernel Design sections] The central claim attributes the reported speedups primarily to the exploitation of asynchronous TMA-WGMMA overlap and warp specialization. However, both kernels also introduce new compressed formats (BCSR and WCSR). No ablation is presented that holds the sparse format fixed while disabling the asynchronous components (e.g., replacing TMA with synchronous loads or removing producer-consumer pipelining). This omission makes it impossible to determine how much of the 1.47×/6.24× and 2.66× gains are due to the async features versus the format changes themselves.
[Abstract and Results] The abstract and results sections state concrete speedups (1.47×, 6.24×, 2.66×) but provide no details on measurement methodology, number of runs, variance, hardware configuration, or full baseline library versions and compilation flags. Without these, the reproducibility and robustness of the performance numbers cannot be assessed.
[End-to-end evaluation] The BCSR kernel's 2.66× end-to-end claim is conditioned on 90 % block sparsity with 64 K tokens. The paper does not discuss how this sparsity level is obtained in practice for Qwen2.5-7B or whether the reported gains degrade under more realistic or lower sparsity patterns.

minor comments (2)

[Introduction / Kernel Design] Notation for BCSR and WCSR should be introduced with a small diagram or pseudocode in the first section where they appear, rather than only in the kernel description.
[Experimental setup] The paper should cite the exact cuSPARSE, cuDNN, and cuBLAS versions used for all baselines.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point-by-point below and commit to revisions that will strengthen the manuscript's claims and reproducibility.

read point-by-point responses

Referee: [Evaluation / Kernel Design sections] The central claim attributes the reported speedups primarily to the exploitation of asynchronous TMA-WGMMA overlap and warp specialization. However, both kernels also introduce new compressed formats (BCSR and WCSR). No ablation is presented that holds the sparse format fixed while disabling the asynchronous components (e.g., replacing TMA with synchronous loads or removing producer-consumer pipelining). This omission makes it impossible to determine how much of the 1.47×/6.24× and 2.66× gains are due to the async features versus the format changes themselves.

Authors: We agree that an ablation isolating the asynchronous primitives from the format changes would provide stronger evidence. The BCSR and WCSR formats were co-designed specifically to enable efficient TMA usage and warp-specialized pipelining. In the revision we will add an ablation study that keeps the sparse formats fixed while replacing TMA loads with synchronous equivalents and disabling the producer-consumer pipeline, allowing direct quantification of the async contribution on the same SuiteSparse and model workloads. revision: yes
Referee: [Abstract and Results] The abstract and results sections state concrete speedups (1.47×, 6.24×, 2.66×) but provide no details on measurement methodology, number of runs, variance, hardware configuration, or full baseline library versions and compilation flags. Without these, the reproducibility and robustness of the performance numbers cannot be assessed.

Authors: We acknowledge that the current manuscript lacks sufficient methodological detail. The revised version will include an expanded evaluation subsection that reports: the exact GPU (NVIDIA H100), CUDA 12.4, number of runs (median of 10), standard deviation across runs, precise library versions (cuSPARSE 12.4, cuDNN 9.0, cuBLAS), and compilation flags (-O3 with sm_90a). Kernel timings will be explicitly defined as device-side execution time. revision: yes
Referee: [End-to-end evaluation] The BCSR kernel's 2.66× end-to-end claim is conditioned on 90 % block sparsity with 64 K tokens. The paper does not discuss how this sparsity level is obtained in practice for Qwen2.5-7B or whether the reported gains degrade under more realistic or lower sparsity patterns.

Authors: The 90 % block sparsity is produced by applying block-wise structured pruning to the Qwen2.5-7B weights. We will revise the end-to-end section to describe the pruning procedure and add results at 50 % and 70 % block sparsity (same 64 K token prefill) to illustrate how speedups vary with sparsity level, thereby addressing robustness under more realistic patterns. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical kernel benchmarks are self-contained against external baselines

full rationale

The paper reports measured speedups from two new SpMM kernels (BCSR for structured block sparsity and WCSR for irregular cases) that exploit TMA/WGMMA asynchronous overlap on modern GPUs. All central claims are direct empirical comparisons to external libraries (cuSPARSE, cuDNN, AccSpMM) on SuiteSparse matrices and an end-to-end LLM prefill workload. No equations, fitted parameters, or predictions are presented that reduce by construction to the paper's own inputs or self-citations. The derivation chain consists of implementation choices and benchmarking, which remain falsifiable against independent hardware runs and do not invoke load-bearing self-citations or ansatzes.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a systems/engineering paper focused on kernel implementation. No mathematical free parameters, domain axioms, or invented entities are introduced beyond standard assumptions about GPU memory hierarchy and sparsity patterns.

pith-pipeline@v0.9.0 · 5501 in / 1179 out tokens · 45515 ms · 2026-05-10T04:31:40.370953+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

50 extracted references · 30 canonical work pages · 5 internal anchors

[1]

SIAM, 2 edition, 2003

Y . Saad,Iterative Methods for Sparse Linear Systems, 2nd ed. Society for Industrial and Applied Mathematics, 2003. [Online]. Available: https://doi.org/10.1137/1.9780898718003

work page doi:10.1137/1.9780898718003 2003
[2]

An overview of the sparse basic linear algebra subprograms: The new standard from the blas technical forum,

I. S. Duff, M. A. Heroux, and R. Pozo, “An overview of the sparse basic linear algebra subprograms: The new standard from the blas technical forum,”ACM Trans. Math. Softw., vol. 28, no. 2, p. 239–267, Jun 2002. [Online]. Available: https://doi.org/10.1145/567806.567810

work page doi:10.1145/567806.567810 2002
[3]

Mathematical foundations of the graphblas,

J. Kepner, P. Aaltonen, D. A. Bader, A. Buluç, F. Franchetti, J. R. Gilbert, D. Hutchison, M. Kumar, A. Lumsdaine, H. Meyerhenke, S. McMillan, J. E. Moreira, J. D. Owens, C. Yang, M. Zalewski, and T. G. Mattson, “Mathematical foundations of the graphblas,”CoRR, vol. abs/1606.05790, 2016. [Online]. Available: http://arxiv.org/abs/ 1606.05790

work page arXiv 2016
[4]

Gunrock: a high-performance graph processing library on the gpu,

Y . Wang, A. Davidson, Y . Pan, Y . Wu, A. Riffel, and J. D. Owens, “Gunrock: a high-performance graph processing library on the gpu,” in Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, ser. PPoPP ’16. New York, NY , USA: Association for Computing Machinery, 2016. [Online]. Available: https://doi.org/10.11...

work page doi:10.1145/2851141.2851145 2016
[5]

Sparsity in deep learning: Pruning and growth for efficient inference and training in neural networks,

T. Hoefler, D. Alistarh, T. Ben-Nun, N. Dryden, and A. Peste, “Sparsity in deep learning: Pruning and growth for efficient inference and training in neural networks,”Journal of Machine Learning Research (JMLR), vol. 22, no. 1, pp. 241:1–241:124, 2021. [Online]. Available: http://jmlr.org/papers/v22/21-0366.html

2021
[6]

Generalized Slow Roll for Tensors

T. Gale, M. Zaharia, C. Young, and E. Elsen, “Sparse gpu kernels for deep learning,” inProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, ser. SC ’20. IEEE Press, 2020. [Online]. Available: https://doi.org/10.1109/SC41405.2020.00021

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1109/sc41405.2020.00021 2020
[7]

Deep graph library: A graph-centric, highly-performant package for graph neural networks,

M. Wang, D. Zheng, Z. Ye, Q. Gan, M. Li, X. Song, J. Zhou, C. Ma, L. Yu, Y . Gai, T. Xiao, T. He, G. Karypis, J. Li, and Z. Zhang, “Deep graph library: A graph-centric, highly-performant package for graph neural networks,” 2020. [Online]. Available: https://arxiv.org/abs/1909.01315

work page arXiv 2020
[8]

Block-sparse recurrent neural networks,

S. Narang, E. Undersander, and G. Diamos, “Block-sparse recurrent neural networks,” 2017. [Online]. Available: https://arxiv.org/abs/1711. 02782

2017
[9]

Gpu kernels for block-sparse weights,

S. Gray, A. Radford, and D. P. Kingma, “Gpu kernels for block-sparse weights,” 2017. [Online]. Available: https://cdn.openai.com/blocksparse/ blocksparsepaper.pdf

2017
[10]

Enabling unstructured sparse acceleration on structured sparse accelerators,

G. Jeong, P.-A. Tsai, A. R. Bambhaniya, S. W. Keckler, and T. Krishna, “Enabling unstructured sparse acceleration on structured sparse accelerators,” inEighth Conference on Machine Learning and Systems, 2025. [Online]. Available: https://openreview.net/forum?id= Py0XA6QQAh

2025
[11]

Sparsegpt: Massive language models can be accurately pruned in one-shot,

E. Frantar and D. Alistarh, “Sparsegpt: Massive language models can be accurately pruned in one-shot,” inProceedings of the 40th International Conference on Machine Learning (ICML), 2023, pp. 10 323–10 337. [Online]. Available: https://proceedings.mlr.press/v202/frantar23a.html

2023
[12]

A simple and effective pruning approach for large language models,

M. Sun, Z. Liu, A. Bair, and J. Z. Kolter, “A simple and effective pruning approach for large language models,” inThe Twelfth International Conference on Learning Representations, 2024. [Online]. Available: https://openreview.net/forum?id=PxoFut3dWW

2024
[13]

TC-GNN: Bridging sparse GNN computation and dense tensor cores on GPUs,

Y . Wang, B. Feng, Z. Wang, G. Huang, and Y . Ding, “TC-GNN: Bridging sparse GNN computation and dense tensor cores on GPUs,” in 2023 USENIX Annual Technical Conference (USENIX ATC 23). Boston, MA: USENIX Association, Jul 2023, pp. 149–164. [Online]. Available: https://www.usenix.org/conference/atc23/presentation/wang-yuke

2023
[14]

Gregory Pauloski, et al

J. Shi, S. Li, Y . Xu, R. Fu, X. Wang, and T. Wu, “Flashsparse: Minimizing computation redundancy for fast sparse matrix multiplications on tensor cores,” inProceedings of the 30th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming, ser. PPoPP ’25. New York, NY , USA: Association for Computing Machinery, 2025, p. 312–325. [Onl...

work page doi:10.1145/3710848.3710858 2025
[15]

Tensormd: Molecular dynamics simulation with ab initio accuracy of 50 billion atoms,

H. Zhao, S. Li, J. Wang, C. Zhou, J. Wang, Z. Xin, S. Li, Z. Liang, Z. Pan, F. Liu, Y . Zeng, Y . Wang, and X. Chi, “Acc-spmm: Accelerating general-purpose sparse matrix-matrix multiplication with gpu tensor cores,” inProceedings of the 30th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming, ser. PPoPP ’25. New York, NY , USA...

work page doi:10.1145/3710848.3710888 2025
[16]

Dissecting the nvidia hopper architecture through microbenchmarking and multiple level analysis,

W. Luo, R. Fan, Z. Li, D. Du, H. Liu, Q. Wang, and X. Chu, “Dissecting the nvidia hopper architecture through microbenchmarking and multiple level analysis,” 2025. [Online]. Available: https: //arxiv.org/abs/2501.12084

work page arXiv 2025
[17]

Cusparse library,

M. Naumov, L. Chien, P. Vandermersch, and U. Kapasi, “Cusparse library,” inGPU technology conference, vol. 12, 2010

2010
[18]

TorchAO: Pytorch-native training-to-serving model optimization,

A. Or, A. Jain, D. Vega-Myhre, J. Cai, C. D. Hernandez, Z. Zhang, D. Guessous, V . Kuznetsov, C. Puhrsch, M. Saroufim, and S. Rao, “TorchAO: Pytorch-native training-to-serving model optimization,” inChampioning Open-source DEvelopment in ML Workshop @ ICML25, 2025. [Online]. Available: https://openreview.net/forum?id= HpqH0JakHf

2025
[19]

Flash-llm: Enabling cost-effective and highly-efficient large generative model inference with unstructured sparsity,

H. Xia, Z. Zheng, Y . Li, D. Zhuang, Z. Zhou, X. Qiu, Y . Li, W. Lin, and S. L. Song, “Flash-llm: Enabling cost-effective and highly-efficient large generative model inference with unstructured sparsity,”Proc. VLDB Endow., vol. 17, no. 2, p. 211–224, Oct 2023. [Online]. Available: https://doi.org/10.14778/3626292.3626303

work page doi:10.14778/3626292.3626303 2023
[20]

Dtc-spmm: Bridging the gap in accelerating general sparse matrix multiplication with tensor cores,

R. Fan, W. Wang, and X. Chu, “Dtc-spmm: Bridging the gap in accelerating general sparse matrix multiplication with tensor cores,” inProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3, ser. ASPLOS ’24. New York, NY , USA: Association for Computing Machinery, 2024, p. 253–2...

work page doi:10.1145/3620666.3651378 2024
[21]

Outperforming cuBLAS on H100: A worklog,

P. Shankhdhar, “Outperforming cuBLAS on H100: A worklog,” https: //cudaforfun.substack.com/p/outperforming-cublas-on-h100-a-worklog, 2024

2024
[22]

Generalized Slow Roll for Tensors

G. Huang, G. Dai, Y . Wang, and H. Yang, “Ge-spmm: General-purpose sparse matrix-matrix multiplication on gpus for graph neural networks,” inSC20: International Conference for High Performance Computing, Networking, Storage and Analysis, 2020, pp. 1–12. [Online]. Available: https://doi.org/10.1109/SC41405.2020.00076

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1109/sc41405.2020.00076 2020
[23]

Fastspmm: Leveraging tensor cores for sparse matrix multiplication,

H. Wang, M. Li, W. Jia, H. Yang, and G. Tan, “Fastspmm: Leveraging tensor cores for sparse matrix multiplication,” inProceedings of the 22nd ACM International Conference on Computing Frontiers, ser. CF ’25. New York, NY , USA: Association for Computing Machinery, 2025, p. 195–204. [Online]. Available: https://doi.org/10.1145/3719276.3725173

work page doi:10.1145/3719276.3725173 2025
[24]

Accelerating gnns on gpu sparse tensor cores through n:m sparsity-oriented graph reordering,

J.-A. Chen, H.-H. Sung, R. Zhang, A. Li, and X. Shen, “Accelerating gnns on gpu sparse tensor cores through n:m sparsity-oriented graph reordering,” inProceedings of the 30th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming, ser. PPoPP ’25. New York, NY , USA: Association for Computing Machinery, 2025, p. 16–28. [Online]. Av...

work page arXiv 2025
[25]

Bridging the gap between unstructured spmm and structured sparse tensor cores,

Y . Dong, Z. Shen, W. Jiang, Z. Liu, Y . Xu, B. He, R. Zheng, and H. Jin, “Bridging the gap between unstructured spmm and structured sparse tensor cores,” inProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, ser. SC ’25. New York, NY , USA: Association for Computing Machinery, 2025, p. 645–660. [O...

work page doi:10.1145/3712285.3759849 2025
[26]

Jigsaw: Accelerating spmm with vector sparsity on sparse tensor core,

K. Zhang, X. Liu, H. Yang, T. Feng, X. Yang, Y . Liu, Z. Luan, and D. Qian, “Jigsaw: Accelerating spmm with vector sparsity on sparse tensor core,” inProceedings of the 53rd International Conference on Parallel Processing, ser. ICPP ’24. New York, NY , USA: Association for Computing Machinery, 2024, p. 1124–1134. [Online]. Available: https://doi.org/10.11...

work page doi:10.1145/3673038.3673108 2024
[27]

The tensor algebra compiler,

F. Kjolstad, S. Kamil, S. Chou, D. Lugato, and S. Amarasinghe, “The tensor algebra compiler,”Proc. ACM Program. Lang., vol. 1, no. OOPSLA, Oct 2017. [Online]. Available: https://doi.org/10.1145/ 3133901

2017
[28]

Sparsetir: Composable abstractions for sparse compilation in deep learning,

Z. Ye, R. Lai, J. Shao, T. Chen, and L. Ceze, “Sparsetir: Composable abstractions for sparse compilation in deep learning,” inProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3, ser. ASPLOS 2023. New York, NY , USA: Association for Computing Machinery, 2023, p. 660–678. [...

work page doi:10.1145/3582016.3582047 2023
[29]

Unisparse: An intermediate language for general sparse format customization,

J. Liu, Z. Zhao, Z. Ding, B. Brock, H. Rong, and Z. Zhang, “Unisparse: An intermediate language for general sparse format customization,” Proc. ACM Program. Lang., vol. 8, no. OOPSLA1, Apr 2024. [Online]. Available: https://doi.org/10.1145/3649816

work page doi:10.1145/3649816 2024
[30]

Splat: A framework for optimised gpu code-generation for sparse regular attention,

A. Gupta, Y . Yuan, D. Jain, Y . Ge, D. Aponte, Y . Zhou, and C. Mendis, “Splat: A framework for optimised gpu code-generation for sparse regular attention,”Proc. ACM Program. Lang., vol. 9, no. OOPSLA1, Apr 2025. [Online]. Available: https://doi.org/10.1145/3720503

work page doi:10.1145/3720503 2025
[31]

Triton: an intermediate language and compiler for tiled neural network computations,

P. Tillet, H. T. Kung, and D. Cox, “Triton: an intermediate language and compiler for tiled neural network computations,” inProceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages, ser. MAPL 2019. New York, NY , USA: Association for Computing Machinery, 2019, p. 10–19. [Online]. Available: https://doi.org/10...

work page doi:10.1145/3315508.3329973 2019
[32]

Tawa: Automatic warp specialization for modern gpus with asynchronous references,

H. Chen, B. Fan, A. Collins, B. Hagedorn, E. Gaburov, M. Masuda, M. Brookhart, C. Sullivan, J. Knight, Z. Zhang, and V . Grover, “Tawa: Automatic warp specialization for modern gpus with asynchronous references,” 2025. [Online]. Available: https://arxiv.org/abs/2510.14719

work page arXiv 2025
[33]

Optimal brain damage,

Y . LeCun, J. S. Denker, and S. A. Solla, “Optimal brain damage,” inNIPS, 1989, pp. 598–605. [Online]. Available: http: //papers.nips.cc/paper/250-optimal-brain-damage

1989
[34]

Optimal brain surgeon: Extensions and performance comparison,

B. Hassibi, D. G. Stork, and G. J. Wolff, “Optimal brain surgeon: Extensions and performance comparison,” inNIPS, 1993, pp. 263–270. [Online]. Available: http://papers.nips.cc/paper/ 749-optimal-brain-surgeon-extensions-and-performance-comparisons

1993
[35]

Learning both weights and connections for efficient neural network,

S. Han, J. Pool, J. Tran, and W. J. Dally, “Learning both weights and connections for efficient neural network,” inNIPS, 2015, pp. 1135–1143. [Online]. Available: http://papers.nips.cc/paper/ 5784-learning-both-weights-and-connections-for-efficient-neural-network

2015
[36]

Chasing sparsity in vision transformers: An end- to-end exploration,

T. Chen, Y . Cheng, Z. Gan, L. Yuan, L. Zhang, and Z. Wang, “Chasing sparsity in vision transformers: An end- to-end exploration,” inNeurIPS, 2021, pp. 19 974–19 988. [Online]. Available: https://proceedings.neurips.cc/paper/2021/hash/ a61f27ab2165df0e18cc9433bd7f27c5-Abstract.html

2021
[37]

SliceGPT: Compress large language models by deleting rows and columns,

S. Ashkboos, M. L. Croci, M. G. do Nascimento, T. Hoefler, and J. Hensman, “SliceGPT: Compress large language models by deleting rows and columns,” inThe Twelfth International Conference on Learning Representations, 2024. [Online]. Available: https://openreview.net/forum?id=vXxardq6db

2024
[38]

The sparse frontier: Sparse attention trade-offs in transformer llms,

P. Nawrot, R. Li, R. Huang, S. Ruder, K. Marchisio, and E. M. Ponti, “The sparse frontier: Sparse attention trade-offs in transformer llms,”
[39]

Nawrot, R

[Online]. Available: https://arxiv.org/abs/2504.17768

work page arXiv
[40]

Efficient Streaming Language Models with Attention Sinks

G. Xiao, Y . Tian, B. Chen, S. Han, and M. Lewis, “Efficient streaming language models with attention sinks,” 2024. [Online]. Available: https://arxiv.org/abs/2309.17453

work page internal anchor Pith review arXiv 2024
[41]

Longformer: The Long-Document Transformer

I. Beltagy, M. E. Peters, and A. Cohan, “Longformer: The long- document transformer,” 2020. [Online]. Available: https://arxiv.org/abs/ 2004.05150

work page internal anchor Pith review arXiv 2020
[42]

XAttention: Block Sparse Attention with Antidiagonal Scoring, March 2025

R. Xu, G. Xiao, H. Huang, J. Guo, and S. Han, “Xattention: Block sparse attention with antidiagonal scoring,” 2025. [Online]. Available: https://arxiv.org/abs/2503.16428

work page arXiv 2025
[43]

An efficient training algorithm for models with block-wise sparsity,

D. Zhu, Z. Zuo, and M. M. Khalili, “An efficient training algorithm for models with block-wise sparsity,” 2025. [Online]. Available: https://arxiv.org/abs/2503.21928

work page arXiv 2025
[44]

Thanos: A block-wise pruning algorithm for efficient large language model compression,

I. Ilin and P. Richtarik, “Thanos: A block-wise pruning algorithm for efficient large language model compression,” 2025. [Online]. Available: https://arxiv.org/abs/2504.05346

work page arXiv 2025
[45]

MInference 1.0: Accelerating pre-filling for long-context LLMs via dynamic sparse attention,

H. Jiang, Y . Li, C. Zhang, Q. Wu, X. Luo, S. Ahn, Z. Han, A. H. Abdi, D. Li, C.-Y . Lin, Y . Yang, and L. Qiu, “MInference 1.0: Accelerating pre-filling for long-context LLMs via dynamic sparse attention,” inThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. [Online]. Available: https://openreview.net/forum?id=fPBACAbqSN

2024
[46]

Model-based memory hierarchy optimizations for sparse matrices,

E.-J. Im, “Model-based memory hierarchy optimizations for sparse matrices,” 2007. [Online]. Available: https://api.semanticscholar.org/ CorpusID:14967653

2007
[47]

Cuthill, J

E. Cuthill and J. McKee, “Reducing the bandwidth of sparse symmetric matrices,” inProceedings of the 1969 24th National Conference, ser. ACM ’69. New York, NY , USA: Association for Computing Machinery, 1969, p. 157–172. [Online]. Available: https://doi.org/10.1145/800195.805928

work page doi:10.1145/800195.805928 1969
[48]

Davis and Yifan Hu

T. A. Davis and Y . Hu, “The university of florida sparse matrix collection,”ACM Trans. Math. Softw., vol. 38, no. 1, Dec 2011. [Online]. Available: https://doi.org/10.1145/2049662.2049663

work page doi:10.1145/2049662.2049663 2011
[49]

Qwen2.5 technical report,

Qwen, :, A. Yang, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Li, D. Liu, F. Huang, H. Wei, H. Lin, J. Yang, J. Tu, J. Zhang, J. Yang, J. Yang, J. Zhou, J. Lin, K. Dang, K. Lu, K. Bao, K. Yang, L. Yu, M. Li, M. Xue, P. Zhang, Q. Zhu, R. Men, R. Lin, T. Li, T. Tang, T. Xia, X. Ren, X. Ren, Y . Fan, Y . Su, Y . Zhang, Y . Wan, Y . Liu, Z. Cui, Z. Zhang, ...
[50]

Qwen2.5 Technical Report

[Online]. Available: https://arxiv.org/abs/2412.15115

work page internal anchor Pith review Pith/arXiv arXiv