arxiv: 2604.16864 · v1 · submitted 2026-04-18 · 💻 cs.DC · cs.AR

Recognition: unknown

HieraSparse: Hierarchical Semi-Structured Sparse KV Attention

Haoxuan Wang , Chen Wang

Authors on Pith no claims yet

Pith reviewed 2026-05-10 07:09 UTC · model grok-4.3

classification 💻 cs.DC cs.AR

keywords KV cache compressionsemi-structured sparsityhierarchical pruningLLM attention accelerationsparse tensor coresprefill decode optimizationlong-context inference

0 comments

The pith

HieraSparse applies hierarchical semi-structured sparsity to KV caches for faster attention and better compression in long-context LLMs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper presents HieraSparse as a framework that compresses the key-value cache using a layered semi-structured sparsity pattern and supplies custom kernels to run the resulting sparse attention on GPU tensor cores. The design supports different sparsity levels for a quality-efficiency tradeoff and applies to both the initial input processing stage and the token-by-token generation stage. At the same sparsity, it reports 1.2 times higher compression and 4.57 times faster attention than the prior best unstructured-sparsity decode method, plus up to 1.85 times prefill speedup when the same pruning is used early in the sequence. With a basic magnitude-based pruning rule the method still delivers 1.37 times prefill and 1.77 times decode speedups while keeping generation quality close to the unpruned baseline. A reader cares because long-context models are limited by KV-cache memory and attention compute, and this approach turns sparsity into concrete wall-clock and memory savings on existing hardware.

Core claim

HieraSparse is a hierarchical KV cache compression framework that uses semi-structured sparsity patterns together with GPU sparse-tensor-core kernels to accelerate attention for both prefill and decode. At equivalent sparsity it obtains a 1.2 times higher KV compression ratio and 4.57 times attention speedup over the previous state-of-the-art unstructured-sparsity decode method. The same semi-structured pruning extends to the prefill stage, yielding up to 1.85 times attention speedup at the highest sparsity level. When magnitude-based pruning is applied, the framework achieves 1.37 times prefill speedup and 1.77 times decode speedup without significant quality loss.

What carries the argument

The hierarchical semi-structured sparsity pattern applied to the KV cache, which organizes pruning across multiple levels to enable flexible sparsity-quality trade-offs and direct mapping onto sparse tensor core operations.

If this is right

Higher KV cache compression reduces memory footprint and permits longer contexts or larger batches on the same hardware.
The reported speedups apply to both prefill and decode phases, shortening end-to-end latency for interactive use.
Simple magnitude-based pruning already yields practical speedups while keeping quality drop small.
Custom kernels convert the sparsity directly into measurable gains on current sparse-tensor-core GPUs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The hierarchical pattern could be combined with other cache-reduction techniques such as token eviction or quantization to compound memory savings.
The same semi-structured layout might improve utilization on future GPU generations that add more native sparse support.
Testing across a wider range of model sizes and context lengths would show whether the speedup ratios remain stable.
Adapting the pruning criterion to task-specific signals instead of magnitude alone could further reduce quality impact at high sparsity.

Load-bearing premise

That the chosen semi-structured sparsity patterns and hierarchical compression preserve the essential information in the attention computation so that generation quality remains acceptable.

What would settle it

A side-by-side run of long-context generation benchmarks at the reported sparsity levels that shows substantially higher perplexity or lower downstream task accuracy for HieraSparse than for the unstructured-sparsity baseline.

Figures

Figures reproduced from arXiv: 2604.16864 by Chen Wang, Haoxuan Wang.

**Figure 3.** Figure 3: demonstrates the overall workflow of HieraSparse. Given the KV Cache that is divided into sparse and dense regions, the caches are further split into blocks. For dense blocks, they are directly stored in the dense cache memory pool; for sparse blocks, they are further pruned and compressed into non-zero data and metadata, then stored in the respective memory pools. A block index mapping is created accordi… view at source ↗

**Figure 4.** Figure 4: The performance gain of different optimizations for [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: The illustration of P T fragment re-layout. The source layout consists of multiple 16 × 8 D-matrix atoms, and the destination layout consists of multiple 32×8 B-matrix atoms, both in row-major. They are both partitioned into 8×8 atoms, and multiple movmatrix are issued to perform the re-layout without shared memory access. V. EVALUATION We evaluate HieraSparse to demonstrate its effectiveness in balancing … view at source ↗

**Figure 6.** Figure 6: The quality evaluation of HieraSparse when extended to prefill stage. 2) Uniformed Sparsity Pruning on Both Prefill and Decode: As shown in Figure 6a, we measured the overall LongBench scores in two settings: i) Keep all value cache as dense, gradually increase key block sparsity SK (green line). ii) Keep all key cache as dense, gradually increase value block sparsity SV (blue line). The results indicate t… view at source ↗

**Figure 7.** Figure 7: Comparison of attention kernel latency, including [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗

**Figure 8.** Figure 8: The efficiency evaluation of HieraSparse under different sparsity. We also benchmark kernel speedup across block sparsity levels, as shown in Figure 8a. The decode kernel speedup closely follows the theoretical curve, with a small gap because non-memory operation latencies are excluded from the model. In contrast, the prefill speedup is slightly offset: at low sparsity it exceeds the theoretical estimate,… view at source ↗

**Figure 9.** Figure 9: Per-layer latency breakdown for prefill and decode. [PITH_FULL_IMAGE:figures/full_fig_p010_9.png] view at source ↗

read the original abstract

The deployment of long-context Large Language Models (LLMs) poses significant challenges due to the intense computational cost of self-attention and the substantial memory overhead of the Key-Value Cache (KV Cache). In this paper, we introduce HieraSparse, a hierarchical KV Cache compression framework with acceleration kernels that leverage GPU sparse tensor cores to speed up semi-structured KV Cache attention for both the prefill and decode phases. With the hierarchical design, our method allows for a flexible quality-sparsity trade-off and successfully converts sparsity into efficiency. Compared to the state-of-the-art decode method that utilizes unstructured sparsity, HieraSparse achieves $\mathbf{1.2\times}$ KV compression ratio and $\mathbf{4.57\times}$ attention speedup at the same sparsity level. Furthermore, we extended the semi-structured KV Cache pruning to the prefill stage, which demonstrated up to $\mathbf{1.85\times}$ attention speedup at the highest sparsity. Lastly, we evaluate the generation quality of HieraSparse with a simple magnitude-based pruning method, and the results show that $\mathbf{1.37\times}$ prefill speedup and $\mathbf{1.77\times}$ decode speedup can be achieved without significant quality drop. The codebase can be found at https://github.com/psl-ntu/HieraSparse.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

HieraSparse shows practical speedups from hierarchical semi-structured KV sparsity plus custom kernels for prefill and decode, with open code to check the claims.

read the letter

The main point is that HieraSparse organizes semi-structured sparsity in a hierarchy and supplies GPU kernels that turn that sparsity into runtime gains for both prefill and decode attention. It reports 4.57x attention speedup in decode over unstructured baselines at the same sparsity level, plus up to 1.85x in prefill, and keeps quality mostly intact with simple magnitude pruning for 1.37x and 1.77x speedups respectively. The public codebase is the part that makes the work usable right away.

Referee Report

3 major / 2 minor

Summary. The paper introduces HieraSparse, a hierarchical semi-structured sparse KV Cache compression framework for long-context LLMs, accompanied by custom GPU kernels that exploit sparse tensor cores to accelerate attention computation in both prefill and decode phases. It reports concrete empirical gains over prior unstructured-sparsity decode methods (1.2× KV compression and 4.57× attention speedup at matched sparsity), extends semi-structured pruning to prefill (up to 1.85× speedup), and shows that simple magnitude-based pruning yields 1.37× prefill and 1.77× decode speedups with acceptable quality retention. A public codebase is provided.

Significance. If the performance and quality results hold under rigorous scrutiny, the work offers a practical route to convert sparsity into measurable inference efficiency and memory reduction for long-context models. The public codebase is a clear strength that directly supports verification of kernel correctness, overheads, and benchmark numbers, increasing the likelihood of adoption and follow-on research in sparse attention systems.

major comments (3)

[§4] §4 (Experimental results): The headline speedups (4.57× decode, 1.85× prefill, 1.37×/1.77× with magnitude pruning) are reported as single-point measurements without error bars, number of runs, hardware details, or statistical significance tests. Because these numbers are the central empirical claim, the absence of variability assessment undermines confidence in the reported gains.
[§3] §3 (Hierarchical design and kernel implementation): The paper provides no ablation isolating the contribution of the hierarchical levels versus the semi-structured pattern itself, nor any analysis of how the chosen sparsity masks interact with the attention matrix multiplication. This is load-bearing for the claim that quality is preserved while speedups are realized.
[Table 2] Table 2 / Figure 4 (baseline comparisons): The evaluation compares only against one unstructured-sparsity decode method; missing are direct head-to-head results against other recent semi-structured or hierarchical KV pruning techniques at identical sparsity ratios and model scales, which is required to substantiate the “state-of-the-art” claim.

minor comments (2)

[§3.2] Notation for the hierarchical block sizes and pruning thresholds is introduced without a compact mathematical definition; a single equation summarizing the mask construction would improve clarity.
[Abstract] The abstract states “without significant quality drop” but the main text does not define the threshold used for this judgment (e.g., perplexity delta, downstream task accuracy).

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We agree that the experimental reporting can be strengthened with variability measures, that additional ablations would better isolate design contributions, and that expanded baselines would provide fuller context. We will incorporate these changes in a revised manuscript. Our point-by-point responses follow.

read point-by-point responses

Referee: §4 (Experimental results): The headline speedups (4.57× decode, 1.85× prefill, 1.37×/1.77× with magnitude pruning) are reported as single-point measurements without error bars, number of runs, hardware details, or statistical significance tests. Because these numbers are the central empirical claim, the absence of variability assessment undermines confidence in the reported gains.

Authors: We agree that single-point measurements limit confidence in the reported gains. In the revision we will rerun all timing experiments on the same hardware (NVIDIA H100 GPUs) for a minimum of five independent trials, report means with standard-deviation error bars, and include paired t-tests to assess statistical significance of the speedups. The public codebase already exposes the exact benchmark scripts, so these additional runs can be reproduced directly. revision: yes
Referee: §3 (Hierarchical design and kernel implementation): The paper provides no ablation isolating the contribution of the hierarchical levels versus the semi-structured pattern itself, nor any analysis of how the chosen sparsity masks interact with the attention matrix multiplication. This is load-bearing for the claim that quality is preserved while speedups are realized.

Authors: We acknowledge the absence of these isolating experiments. We will add a dedicated ablation subsection that compares the full hierarchical mask against a non-hierarchical (flat) semi-structured mask at identical sparsity ratios, measuring both quality (perplexity, downstream task scores) and kernel throughput. We will also include a brief analysis, supported by a new figure, of how the hierarchical block structure alters the sparsity pattern seen by the sparse tensor-core matmul and why this preserves attention quality. All new results will be generated from the released implementation. revision: yes
Referee: Table 2 / Figure 4 (baseline comparisons): The evaluation compares only against one unstructured-sparsity decode method; missing are direct head-to-head results against other recent semi-structured or hierarchical KV pruning techniques at identical sparsity ratios and model scales, which is required to substantiate the “state-of-the-art” claim.

Authors: The primary baseline was chosen because it is the strongest published unstructured-sparsity decode method operating at the same sparsity level and KV-cache setting. To address the gap, we will extend Table 2 and Figure 4 with head-to-head numbers against the most relevant recent semi-structured and hierarchical KV-pruning works, using identical sparsity ratios and the same model scales wherever the original implementations or sufficient details are available. Where direct reproduction is not feasible we will clearly state the differences in experimental conditions and sparsity definitions. revision: yes

Circularity Check

0 steps flagged

No significant circularity; purely empirical systems contribution

full rationale

The paper introduces HieraSparse as a hierarchical KV cache compression method using semi-structured sparsity and custom sparse-tensor-core kernels for prefill and decode phases. All reported results (1.2× compression, speedups up to 4.57×, quality under magnitude pruning) are direct empirical measurements of runtime and accuracy on LLMs. No equations, derivations, fitted parameters, or predictions appear in the provided text. No self-citations are invoked as load-bearing uniqueness theorems or ansatzes. The public codebase further enables independent verification of kernels and benchmarks, confirming the work is self-contained against external measurements rather than reducing to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No mathematical axioms, free parameters, or invented entities are introduced; the work is an engineering optimization relying on standard attention mechanics and GPU hardware capabilities.

pith-pipeline@v0.9.0 · 5526 in / 1260 out tokens · 75650 ms · 2026-05-10T07:09:59.541147+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

66 extracted references · 49 canonical work pages · 21 internal anchors

[1]

GPT-4 Technical Report

OpenAI, “GPT-4 Technical Report,” 2024. [Online]. Available: https://doi.org/10.48550/arXiv.2303.08774

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2303.08774 2024
[2]

The Llama 3 Herd of Models

MetaAI, “The Llama 3 Herd of Models,” 2024. [Online]. Available: https://doi.org/10.48550/arXiv.2407.21783

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2407.21783 2024
[3]

Qwen3 Technical Report,

A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lv, C. Zheng, D. Liu, F. Zhou, F. Huang, F. Hu, H. Ge, H. Wei, H. Lin, J. Tang, J. Yang, J. Tu, J. Zhang, J. Yang, J. Yang, J. Zhou, J. Zhou, J. Lin, K. Dang, K. Bao, K. Yang, L. Yu, L. Deng, M. Li, M. Xue, M. Li, P. Zhang, P. Wang, Q. Zhu, R. Men, R. Gao, S. Liu, S. Luo, T. ...
[4]

Qwen3 Technical Report

[Online]. Available: https://doi.org/10.48550/arXiv.2505.09388

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2505.09388
[5]

Mistral 7B

A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bamford, D. S. Chaplot, D. de las Casas, F. Bressand, G. Lengyel, G. Lample, L. Saulnier, L. R. Lavaud, M.-A. Lachaux, P. Stock, T. L. Scao, T. Lavril, T. Wang, T. Lacroix, and W. E. Sayed, “Mistral 7b,” 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2310.06825

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2310.06825 2023
[6]

A Survey on Large Language Models for Code Generation

J. Jiang, F. Wang, J. Shen, S. Kim, and S. Kim, “A survey on large language models for code generation,” 2024. [Online]. Available: https://doi.org/10.48550/arXiv.2406.00515

work page internal anchor Pith review doi:10.48550/arxiv.2406.00515 2024
[7]

Benchmarking Large Language Models for News Summarization,

T. Zhang, F. Ladhak, E. Durmus, P. Liang, K. McKeown, and T. B. Hashimoto, “Benchmarking Large Language Models for News Summarization,” 2023. [Online]. Available: https://doi.org/10.48550/ arXiv.2301.13848

work page arXiv 2023
[8]

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. Chi, Q. Le, and D. Zhou, “Chain-of-thought prompting elicits reasoning in large language models,” 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2201.11903

work page internal anchor Pith review doi:10.48550/arxiv.2201.11903 2023
[9]

(2020) A Non-equilibrium Thermodynamic Framework of Consciousness.arXiv:2005.02801https://doi.org/10.48550/arXiv.2005

P. Lewis, E. Perez, A. Piktus, F. Petroni, V . Karpukhin, N. Goyal, H. K ¨uttler, M. Lewis, W. tau Yih, T. Rockt ¨aschel, S. Riedel, and D. Kiela, “Retrieval-augmented generation for knowledge-intensive nlp tasks,” 2021. [Online]. Available: https://doi.org/10.48550/arXiv.2005. 11401

work page doi:10.48550/arxiv.2005 2021
[10]

Many-shot in-context learning,

R. Agarwal, A. Singh, L. M. Zhang, B. Bohnet, L. Rosias, S. Chan, B. Zhang, A. Anand, Z. Abbaset al., “Many-shot in-context learning,”
[11]

arXiv preprint arXiv:2404.11018

[Online]. Available: https://doi.org/10.48550/arXiv.2404.11018

work page doi:10.48550/arxiv.2404.11018
[12]

MemGPT: Towards LLMs as Operating Systems

C. Packer, S. Wooders, K. Lin, V . Fang, S. G. Patil, I. Stoica, and J. E. Gonzalez, “Memgpt: Towards llms as operating systems,” 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2310.08560

work page internal anchor Pith review doi:10.48550/arxiv.2310.08560 2023
[13]

RoFormer: Enhanced Transformer with Rotary Position Embedding

J. Su, Y . Lu, S. Pan, A. Murtadha, B. Wen, and Y . Liu, “Roformer: Enhanced transformer with rotary position embedding,” 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2104.09864

work page internal anchor Pith review doi:10.48550/arxiv.2104.09864 2023
[14]

Effective long-context scaling of foundation models

W. Xiong, J. Liu, I. Molybog, H. Zhang, P. Bhargava, R. Hou, L. Martin, R. Rungta, K. A. Sankararaman, B. Oguz, M. Khabsa, H. Fang, Y . Mehdad, S. Narang, K. Malik, A. Fan, S. Bhosale, S. Edunov, M. Lewis, S. Wang, and H. Ma, “Effective long- context scaling of foundation models,” 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2309.16039

work page doi:10.48550/arxiv.2309.16039 2023
[15]

Llama 3 gradient: A series of long context model

L. Pekelis, M. Feil, F. Moret, M. Huang, and T. Peng, “Llama 3 gradient: A series of long context model.” [Online]. Available: https://huggingface.co/gradientai/Llama-3-8B-Instruct-Gradient-1048k
[16]

Attention Is All You Need

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” 2023. [Online]. Available: https://doi.org/10.48550/arXiv.1706.03762

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1706.03762 2023
[17]

H 2o: Heavy-hitter oracle for efficient generative inference of large language models,

Z. Zhang, Y . Sheng, T. Zhou, T. Chen, L. Zheng, R. Cai, Z. Song, Y . Tian, C. R´e, C. Barrett, Z. Wang, and B. Chen, “H 2o: Heavy-hitter oracle for efficient generative inference of large language models,”
[18]

arXiv preprint arXiv:2306.14048 , year=

[Online]. Available: https://doi.org/10.48550/arXiv.2306.14048

work page doi:10.48550/arxiv.2306.14048
[19]

SnapKV: LLM Knows What You are Looking for Before Generation

Y . Li, Y . Huang, B. Yang, B. Venkitesh, A. Locatelli, H. Ye, T. Cai, P. Lewis, and D. Chen, “Snapkv: Llm knows what you are looking for before generation,” 2024. [Online]. Available: https://doi.org/10.48550/arXiv.2404.14469

work page internal anchor Pith review doi:10.48550/arxiv.2404.14469 2024
[20]

Think: Thinner key cache by query-driven pruning,

Y . Xu, Z. Jie, H. Dong, L. Wang, X. Lu, A. Zhou, A. Saha, C. Xiong, and D. Sahoo, “Think: Thinner key cache by query-driven pruning,”
[21]

Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuan- dong Tian, Christopher Ré, Clark Barrett, and 1 oth- ers

[Online]. Available: https://doi.org/10.48550/arXiv.2407.21018

work page doi:10.48550/arxiv.2407.21018
[22]

Leank: Learnable k cache channel pruning for efficient decoding,

Y . Zhang, Z. He, H. Jiang, C. Zhang, Y . Yang, J. Wang, and L. Qiu, “Leank: Learnable k cache channel pruning for efficient decoding,”
[23]

Available: https://doi.org/10.48550/arXiv.2508.02215

[Online]. Available: https://doi.org/10.48550/arXiv.2508.02215

work page doi:10.48550/arxiv.2508.02215
[24]

Not all heads matter: A head-level kv cache compression method with integrated retrieval and reasoning

Y . Fu, Z. Cai, A. Asi, W. Xiong, Y . Dong, and W. Xiao, “Not all heads matter: A head-level kv cache compression method with integrated retrieval and reasoning,” 2025. [Online]. Available: https://doi.org/10.48550/arXiv.2410.19258

work page doi:10.48550/arxiv.2410.19258 2025
[25]

Ada-kv: Optimizing kv cache eviction by adaptive budget allocation for efficient llm inference,

Y . Feng, J. Lv, Y . Cao, X. Xie, and S. K. Zhou, “Ada-kv: Optimizing kv cache eviction by adaptive budget allocation for efficient llm inference,”
[26]

arXiv preprint arXiv:2407.11550 , year =

[Online]. Available: https://doi.org/10.48550/arXiv.2407.11550

work page doi:10.48550/arxiv.2407.11550
[27]

PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling

Z. Cai, Y . Zhang, B. Gao, Y . Liu, Y . Li, T. Liu, K. Lu, W. Xiong, Y . Dong, J. Hu, and W. Xiao, “Pyramidkv: Dynamic kv cache compression based on pyramidal information funneling,” 2025. [Online]. Available: https://doi.org/10.48550/arXiv.2406.02069

work page internal anchor Pith review doi:10.48550/arxiv.2406.02069 2025
[28]

Dynamickv: Task-aware adaptive kv cache compression for long context llms

X. Zhou, W. Wang, M. Zeng, J. Guo, X. Liu, L. Shen, M. Zhang, and L. Ding, “Dynamickv: Task-aware adaptive kv cache compression for long context llms,” 2025. [Online]. Available: https://doi.org/10.48550/arXiv.2412.14838

work page doi:10.48550/arxiv.2412.14838 2025
[29]

Duoattention: Efficient long-context LLM inference with retrieval and streaming heads

G. Xiao, J. Tang, J. Zuo, J. Guo, S. Yang, H. Tang, Y . Fu, and S. Han, “Duoattention: Efficient long-context llm inference with retrieval and streaming heads,” 2024. [Online]. Available: https://doi.org/10.48550/arXiv.2410.10819

work page doi:10.48550/arxiv.2410.10819 2024
[30]

Mustafar: Promoting unstructured sparsity for kv cache pruning in llm inference.arXiv.org, 2025

D. Joo, H. Hosseini, R. Hadidi, and B. Asgari, “Mustafar: Promoting unstructured sparsity for kv cache pruning in llm inference,” 2025. [Online]. Available: https://doi.org/10.48550/arXiv.2505.22913

work page doi:10.48550/arxiv.2505.22913 2025
[31]

Efficient Streaming Language Models with Attention Sinks

G. Xiao, Y . Tian, B. Chen, S. Han, and M. Lewis, “Efficient streaming language models with attention sinks,” 2024. [Online]. Available: https://doi.org/10.48550/arXiv.2309.17453

work page internal anchor Pith review doi:10.48550/arxiv.2309.17453 2024
[32]

Loki: Low-rank keys for efficient sparse attention.Neural Information Processing Systems, 37: 16692–16723, 2024

P. Singhania, S. Singh, S. He, S. Feizi, and A. Bhatele, “Loki: Low-rank keys for efficient sparse attention,” 2024. [Online]. Available: https://doi.org/10.48550/arXiv.2406.02542

work page doi:10.48550/arxiv.2406.02542 2024
[33]

Post-training sparse attention with double sparsity.arXiv preprint arXiv:2408.07092, 2024

S. Yang, Y . Sheng, J. E. Gonzalez, I. Stoica, and L. Zheng, “Post-training sparse attention with double sparsity,” 2024. [Online]. Available: https://doi.org/10.48550/arXiv.2408.07092

work page doi:10.48550/arxiv.2408.07092 2024
[34]

Spatten: Efficient sparse attention architecture with cascade token and head pruning,

H. Wang, Z. Zhang, and S. Han, “Spatten: Efficient sparse attention architecture with cascade token and head pruning,” in 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA), 2021, pp. 97–110. [Online]. Available: https: //doi.org/10.48550/arXiv.2012.09852

work page doi:10.48550/arxiv.2012.09852 2021
[35]

SparTA: Deep-Learning model sparsity via Tensor-with-Sparsity-Attribute,

N. Zheng, B. Lin, Q. Zhang, L. Ma, Y . Yang, F. Yang, Y . Wang, M. Yang, and L. Zhou, “SparTA: Deep-Learning model sparsity via Tensor-with-Sparsity-Attribute,” in16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22). Carlsbad, CA: USENIX Association, Jul. 2022, pp. 213–232. [Online]. Available: https://www.usenix.org/conference/o...

2022
[36]

D. Joo, H. Hosseini, R. Hadidi, and B. Asgari,Coruscant: Co- Designing GPU Kernel and Sparse Tensor Core to Advocate Unstructured Sparsity in Efficient LLM Inference. New York, NY , USA: Association for Computing Machinery, 2025, p. 232–245. [Online]. Available: https://doi.org/10.1145/3725843.3756065

work page doi:10.1145/3725843.3756065 2025
[37]

Flash-llm: Enabling cost-effective and highly-efficient large generative model inference with unstructured sparsity,

H. Xia, Z. Zheng, Y . Li, D. Zhuang, Z. Zhou, X. Qiu, Y . Li, W. Lin, and S. L. Song, “Flash-llm: Enabling cost-effective and highly-efficient large generative model inference with unstructured sparsity,”Proc. VLDB Endow., vol. 17, no. 2, p. 211–224, Oct. 2023. [Online]. Available: https://doi.org/10.14778/3626292.3626303

work page doi:10.14778/3626292.3626303 2023
[38]

Spinfer: Leveraging low-level sparsity for efficient large language model inference on gpus,

R. Fan, X. Yu, P. Dong, Z. Li, G. Gong, Q. Wang, W. Wang, and X. Chu, “Spinfer: Leveraging low-level sparsity for efficient large language model inference on gpus,” inProceedings of the Twentieth European Conference on Computer Systems, ser. EuroSys ’25. New York, NY , USA: Association for Computing Machinery, 2025, p. 243–260. [Online]. Available: https:...

work page doi:10.1145/3689031.3717481 2025
[39]

Accelerating sparse deep neural networks,

A. Mishra, J. A. Latorre, J. Pool, D. Stosic, D. Stosic, G. Venkatesh, C. Yu, and P. Micikevicius, “Accelerating sparse deep neural networks,”
[40]

Accelerating sparse deep neural networks.arXiv preprint arXiv:2104.08378,

[Online]. Available: https://doi.org/10.48550/arXiv.2104.08378

work page doi:10.48550/arxiv.2104.08378
[41]

Ampere tensor core,

Nvidia, “Ampere tensor core,” https://www.nvidia.com/en-us/ data-center/ampere-architecture/, 2023

2023
[42]

Amd matrix core,

Amd, “Amd matrix core,” https://www.amd.com/content/dam/amd/en/ documents/instinct-tech-docs/product-briefs/instinct-mi325x-datasheet. pdf, 2023

2023
[43]

Longformer: The Long-Document Transformer

I. Beltagy, M. E. Peters, and A. Cohan, “Longformer: The long- document transformer,” 2020. [Online]. Available: https://doi.org/10. 48550/arXiv.2004.05150

work page internal anchor Pith review arXiv 2020
[44]

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

T. Dao, D. Y . Fu, S. Ermon, A. Rudra, and C. R ´e, “Flashattention: Fast and memory-efficient exact attention with io-awareness,” 2022. [Online]. Available: https://doi.org/10.48550/arXiv.2205.14135

work page internal anchor Pith review doi:10.48550/arxiv.2205.14135 2022
[45]

Self-attention does not needo(n 2)memory,

M. N. Rabe and C. Staats, “Self-attention does not needo(n 2)memory,”
[46]

Self-attention does not needo(n 2)memory.arXiv preprint arXiv:2112.05682,

[Online]. Available: https://doi.org/10.48550/arXiv.2112.05682

work page doi:10.48550/arxiv.2112.05682
[47]

In: Proceedings of the 29th Symposium on Operating Systems Principles

W. Kwon, Z. Li, S. Zhuang, Y . Sheng, L. Zheng, C. H. Yu, J. Gonzalez, H. Zhang, and I. Stoica, “Efficient memory management for large language model serving with pagedattention,” inProceedings of the 29th Symposium on Operating Systems Principles, ser. SOSP ’23. New York, NY , USA: Association for Computing Machinery, 2023, p. 611–626. [Online]. Availabl...

work page doi:10.1145/3600006.3613165 2023
[48]

Sglang: Efficient execution of structured language model programs,

L. Zheng, L. Yin, Z. Xie, C. Sun, J. Huang, C. H. Yu, S. Cao, C. Kozyrakis, I. Stoica, J. E. Gonzalez, C. Barrett, and Y . Sheng, “Sglang: Efficient execution of structured language model programs,”
[49]

SGLang: Efficient Execution of Structured Language Model Programs

[Online]. Available: https://doi.org/10.48550/arXiv.2312.07104

work page internal anchor Pith review doi:10.48550/arxiv.2312.07104
[50]

CacheBlend: Fast Large Language Model Serving for RAG with Cached Knowledge Fusion

J. Yao, H. Li, Y . Liu, S. Ray, Y . Cheng, Q. Zhang, K. Du, S. Lu, and J. Jiang, “Cacheblend: Fast large language model serving for rag with cached knowledge fusion,” 2025. [Online]. Available: https://doi.org/10.48550/arXiv.2405.16444

work page doi:10.48550/arxiv.2405.16444 2025
[51]

Prompt cache: Modular attention reuse for low-latency inference,

I. Gim, G. Chen, S. seob Lee, N. Sarda, A. Khandelwal, and L. Zhong, “Prompt cache: Modular attention reuse for low-latency inference,”
[52]

Prompt cache: Modular attention reuse for low-latency inference

[Online]. Available: https://doi.org/10.48550/arXiv.2311.04934

work page doi:10.48550/arxiv.2311.04934
[53]

Jiayi Yao, Hanchen Li, Yuhan Liu, Siddhant Ray, Yi- hua Cheng, Qizheng Zhang, Kuntai Du, Shan Lu, and Junchen Jiang

J. Yang, B. Hou, W. Wei, Y . Bao, and S. Chang, “Kvlink: Accelerating large language models via efficient kv cache reuse,” 2025. [Online]. Available: https://doi.org/10.48550/arXiv.2502.16002

work page doi:10.48550/arxiv.2502.16002 2025
[54]

KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache

Z. Liu, J. Yuan, H. Jin, S. H. Zhong, Z. Xu, V . Braverman, B. Chen, and X. Hu, “Kivi: a tuning-free asymmetric 2bit quantization for kv cache,” inProceedings of the 41st International Conference on Machine Learning, ser. ICML’24. JMLR.org, 2024. [Online]. Available: https://doi.org/10.48550/arXiv.2402.02750

work page internal anchor Pith review doi:10.48550/arxiv.2402.02750 2024
[55]

control bars

T. Zhang, J. Yi, Z. Xu, and A. Shrivastava, “Kv cache is 1 bit per channel: Efficient large language model inference with coupled quantization,” 2024. [Online]. Available: https://doi.org/10.48550/arXiv. 2405.03917

work page internal anchor Pith review doi:10.48550/arxiv 2024
[56]

Kvzip: Query-agnostic kv cache compression with context reconstruction,

J.-H. Kim, J. Kim, S. Kwon, J. W. Lee, S. Yun, and H. O. Song, “Kvzip: Query-agnostic kv cache compression with context reconstruction,”
[57]

arXiv preprint arXiv:2505.23416 , year =

[Online]. Available: https://doi.org/10.48550/arXiv.2505.23416

work page doi:10.48550/arxiv.2505.23416
[58]

io/blog/qwen3.5

L. Wang, Y . Cheng, Y . Shi, Z. Tang, Z. Mo, W. Xie, L. Ma, Y . Xia, J. Xue, F. Yang, and Z. Yang, “Tilelang: A composable tiled programming model for ai systems,” 2025. [Online]. Available: https://doi.org/10.48550/arXiv.2504.17577

work page doi:10.48550/arxiv.2504.17577 2025
[59]

PyTorch: An Imperative Style, High-Performance Deep Learning Library

A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. K ¨opf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala, “Pytorch: An imperative style, high-performance deep learning library,” 2019. [Online]. Available: https://doi.org/1...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1912.01703 2019
[60]

Fast transformer decoding: One write-head is all you need,

N. Shazeer, “Fast transformer decoding: One write-head is all you need,”
[61]

Fast Transformer Decoding: One Write-Head is All You Need

[Online]. Available: https://doi.org/10.48550/arXiv.1911.02150

work page internal anchor Pith review doi:10.48550/arxiv.1911.02150 1911
[62]

GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints

J. Ainslie, J. Lee-Thorp, M. de Jong, Y . Zemlyanskiy, F. Lebr ´on, and S. Sanghai, “Gqa: Training generalized multi-query transformer models from multi-head checkpoints,” 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2305.13245

work page internal anchor Pith review doi:10.48550/arxiv.2305.13245 2023
[63]

LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding

Y . Bai, X. Lv, J. Zhang, H. Lyu, J. Tang, Z. Huang, Z. Du, X. Liu, A. Zeng, L. Hou, Y . Dong, J. Tang, and J. Li, “LongBench: A bilingual, multitask benchmark for long context understanding,” inProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), L.-W. Ku, A. Martins, and V . Srikumar, Eds. Bang...

work page internal anchor Pith review doi:10.48550/arxiv.2308.14508 2024
[64]

Enabling unstructured sparse acceleration on structured sparse accelerators,

G. Jeong, P.-A. Tsai, A. R. Bambhaniya, S. W. Keckler, and T. Krishna, “Enabling unstructured sparse acceleration on structured sparse accelerators,” 2025. [Online]. Available: https://doi.org/10.48550/ arXiv.2403.07953

work page arXiv 2025
[65]

Venom: A vectorized n:m format for unleashing the power of sparse tensor cores,

R. L. Castro, A. Ivanov, D. Andrade, T. Ben-Nun, B. B. Fraguela, and T. Hoefler, “Venom: A vectorized n:m format for unleashing the power of sparse tensor cores,” inProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, ser. SC ’23. New York, NY , USA: Association for Computing Machinery,
[66]

Rogers, Evan Schneider, Jean-Luc Vay, and P

[Online]. Available: https://doi.org/10.1145/3581784.3607087

work page doi:10.1145/3581784.3607087