pith. machine review for the scientific record. sign in

arxiv: 2604.16864 · v1 · submitted 2026-04-18 · 💻 cs.DC · cs.AR

Recognition: unknown

HieraSparse: Hierarchical Semi-Structured Sparse KV Attention

Authors on Pith no claims yet

Pith reviewed 2026-05-10 07:09 UTC · model grok-4.3

classification 💻 cs.DC cs.AR
keywords KV cache compressionsemi-structured sparsityhierarchical pruningLLM attention accelerationsparse tensor coresprefill decode optimizationlong-context inference
0
0 comments X

The pith

HieraSparse applies hierarchical semi-structured sparsity to KV caches for faster attention and better compression in long-context LLMs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper presents HieraSparse as a framework that compresses the key-value cache using a layered semi-structured sparsity pattern and supplies custom kernels to run the resulting sparse attention on GPU tensor cores. The design supports different sparsity levels for a quality-efficiency tradeoff and applies to both the initial input processing stage and the token-by-token generation stage. At the same sparsity, it reports 1.2 times higher compression and 4.57 times faster attention than the prior best unstructured-sparsity decode method, plus up to 1.85 times prefill speedup when the same pruning is used early in the sequence. With a basic magnitude-based pruning rule the method still delivers 1.37 times prefill and 1.77 times decode speedups while keeping generation quality close to the unpruned baseline. A reader cares because long-context models are limited by KV-cache memory and attention compute, and this approach turns sparsity into concrete wall-clock and memory savings on existing hardware.

Core claim

HieraSparse is a hierarchical KV cache compression framework that uses semi-structured sparsity patterns together with GPU sparse-tensor-core kernels to accelerate attention for both prefill and decode. At equivalent sparsity it obtains a 1.2 times higher KV compression ratio and 4.57 times attention speedup over the previous state-of-the-art unstructured-sparsity decode method. The same semi-structured pruning extends to the prefill stage, yielding up to 1.85 times attention speedup at the highest sparsity level. When magnitude-based pruning is applied, the framework achieves 1.37 times prefill speedup and 1.77 times decode speedup without significant quality loss.

What carries the argument

The hierarchical semi-structured sparsity pattern applied to the KV cache, which organizes pruning across multiple levels to enable flexible sparsity-quality trade-offs and direct mapping onto sparse tensor core operations.

If this is right

  • Higher KV cache compression reduces memory footprint and permits longer contexts or larger batches on the same hardware.
  • The reported speedups apply to both prefill and decode phases, shortening end-to-end latency for interactive use.
  • Simple magnitude-based pruning already yields practical speedups while keeping quality drop small.
  • Custom kernels convert the sparsity directly into measurable gains on current sparse-tensor-core GPUs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The hierarchical pattern could be combined with other cache-reduction techniques such as token eviction or quantization to compound memory savings.
  • The same semi-structured layout might improve utilization on future GPU generations that add more native sparse support.
  • Testing across a wider range of model sizes and context lengths would show whether the speedup ratios remain stable.
  • Adapting the pruning criterion to task-specific signals instead of magnitude alone could further reduce quality impact at high sparsity.

Load-bearing premise

That the chosen semi-structured sparsity patterns and hierarchical compression preserve the essential information in the attention computation so that generation quality remains acceptable.

What would settle it

A side-by-side run of long-context generation benchmarks at the reported sparsity levels that shows substantially higher perplexity or lower downstream task accuracy for HieraSparse than for the unstructured-sparsity baseline.

Figures

Figures reproduced from arXiv: 2604.16864 by Chen Wang, Haoxuan Wang.

Figure 1
Figure 1. Figure 1: The latency breakdown of the prefill and decode phases [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 3
Figure 3. Figure 3: demonstrates the overall workflow of HieraSparse. Given the KV Cache that is divided into sparse and dense regions, the caches are further split into blocks. For dense blocks, they are directly stored in the dense cache memory pool; for sparse blocks, they are further pruned and com￾pressed into non-zero data and metadata, then stored in the respective memory pools. A block index mapping is created accordi… view at source ↗
Figure 4
Figure 4. Figure 4: The performance gain of different optimizations for [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: The illustration of P T fragment re-layout. The source layout consists of multiple 16 × 8 D-matrix atoms, and the destination layout consists of multiple 32×8 B-matrix atoms, both in row-major. They are both partitioned into 8×8 atoms, and multiple movmatrix are issued to perform the re-layout without shared memory access. V. EVALUATION We evaluate HieraSparse to demonstrate its effectiveness in balancing … view at source ↗
Figure 6
Figure 6. Figure 6: The quality evaluation of HieraSparse when extended to prefill stage. 2) Uniformed Sparsity Pruning on Both Prefill and Decode: As shown in Figure 6a, we measured the overall LongBench scores in two settings: i) Keep all value cache as dense, gradually increase key block sparsity SK (green line). ii) Keep all key cache as dense, gradually increase value block sparsity SV (blue line). The results indicate t… view at source ↗
Figure 7
Figure 7. Figure 7: Comparison of attention kernel latency, including [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: The efficiency evaluation of HieraSparse under differ￾ent sparsity. We also benchmark kernel speedup across block sparsity levels, as shown in Figure 8a. The decode kernel speedup closely follows the theoretical curve, with a small gap because non-memory operation latencies are excluded from the model. In contrast, the prefill speedup is slightly offset: at low sparsity it exceeds the theoretical estimate,… view at source ↗
Figure 9
Figure 9. Figure 9: Per-layer latency breakdown for prefill and decode. [PITH_FULL_IMAGE:figures/full_fig_p010_9.png] view at source ↗
read the original abstract

The deployment of long-context Large Language Models (LLMs) poses significant challenges due to the intense computational cost of self-attention and the substantial memory overhead of the Key-Value Cache (KV Cache). In this paper, we introduce HieraSparse, a hierarchical KV Cache compression framework with acceleration kernels that leverage GPU sparse tensor cores to speed up semi-structured KV Cache attention for both the prefill and decode phases. With the hierarchical design, our method allows for a flexible quality-sparsity trade-off and successfully converts sparsity into efficiency. Compared to the state-of-the-art decode method that utilizes unstructured sparsity, HieraSparse achieves $\mathbf{1.2\times}$ KV compression ratio and $\mathbf{4.57\times}$ attention speedup at the same sparsity level. Furthermore, we extended the semi-structured KV Cache pruning to the prefill stage, which demonstrated up to $\mathbf{1.85\times}$ attention speedup at the highest sparsity. Lastly, we evaluate the generation quality of HieraSparse with a simple magnitude-based pruning method, and the results show that $\mathbf{1.37\times}$ prefill speedup and $\mathbf{1.77\times}$ decode speedup can be achieved without significant quality drop. The codebase can be found at https://github.com/psl-ntu/HieraSparse.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces HieraSparse, a hierarchical semi-structured sparse KV Cache compression framework for long-context LLMs, accompanied by custom GPU kernels that exploit sparse tensor cores to accelerate attention computation in both prefill and decode phases. It reports concrete empirical gains over prior unstructured-sparsity decode methods (1.2× KV compression and 4.57× attention speedup at matched sparsity), extends semi-structured pruning to prefill (up to 1.85× speedup), and shows that simple magnitude-based pruning yields 1.37× prefill and 1.77× decode speedups with acceptable quality retention. A public codebase is provided.

Significance. If the performance and quality results hold under rigorous scrutiny, the work offers a practical route to convert sparsity into measurable inference efficiency and memory reduction for long-context models. The public codebase is a clear strength that directly supports verification of kernel correctness, overheads, and benchmark numbers, increasing the likelihood of adoption and follow-on research in sparse attention systems.

major comments (3)
  1. [§4] §4 (Experimental results): The headline speedups (4.57× decode, 1.85× prefill, 1.37×/1.77× with magnitude pruning) are reported as single-point measurements without error bars, number of runs, hardware details, or statistical significance tests. Because these numbers are the central empirical claim, the absence of variability assessment undermines confidence in the reported gains.
  2. [§3] §3 (Hierarchical design and kernel implementation): The paper provides no ablation isolating the contribution of the hierarchical levels versus the semi-structured pattern itself, nor any analysis of how the chosen sparsity masks interact with the attention matrix multiplication. This is load-bearing for the claim that quality is preserved while speedups are realized.
  3. [Table 2] Table 2 / Figure 4 (baseline comparisons): The evaluation compares only against one unstructured-sparsity decode method; missing are direct head-to-head results against other recent semi-structured or hierarchical KV pruning techniques at identical sparsity ratios and model scales, which is required to substantiate the “state-of-the-art” claim.
minor comments (2)
  1. [§3.2] Notation for the hierarchical block sizes and pruning thresholds is introduced without a compact mathematical definition; a single equation summarizing the mask construction would improve clarity.
  2. [Abstract] The abstract states “without significant quality drop” but the main text does not define the threshold used for this judgment (e.g., perplexity delta, downstream task accuracy).

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We agree that the experimental reporting can be strengthened with variability measures, that additional ablations would better isolate design contributions, and that expanded baselines would provide fuller context. We will incorporate these changes in a revised manuscript. Our point-by-point responses follow.

read point-by-point responses
  1. Referee: §4 (Experimental results): The headline speedups (4.57× decode, 1.85× prefill, 1.37×/1.77× with magnitude pruning) are reported as single-point measurements without error bars, number of runs, hardware details, or statistical significance tests. Because these numbers are the central empirical claim, the absence of variability assessment undermines confidence in the reported gains.

    Authors: We agree that single-point measurements limit confidence in the reported gains. In the revision we will rerun all timing experiments on the same hardware (NVIDIA H100 GPUs) for a minimum of five independent trials, report means with standard-deviation error bars, and include paired t-tests to assess statistical significance of the speedups. The public codebase already exposes the exact benchmark scripts, so these additional runs can be reproduced directly. revision: yes

  2. Referee: §3 (Hierarchical design and kernel implementation): The paper provides no ablation isolating the contribution of the hierarchical levels versus the semi-structured pattern itself, nor any analysis of how the chosen sparsity masks interact with the attention matrix multiplication. This is load-bearing for the claim that quality is preserved while speedups are realized.

    Authors: We acknowledge the absence of these isolating experiments. We will add a dedicated ablation subsection that compares the full hierarchical mask against a non-hierarchical (flat) semi-structured mask at identical sparsity ratios, measuring both quality (perplexity, downstream task scores) and kernel throughput. We will also include a brief analysis, supported by a new figure, of how the hierarchical block structure alters the sparsity pattern seen by the sparse tensor-core matmul and why this preserves attention quality. All new results will be generated from the released implementation. revision: yes

  3. Referee: Table 2 / Figure 4 (baseline comparisons): The evaluation compares only against one unstructured-sparsity decode method; missing are direct head-to-head results against other recent semi-structured or hierarchical KV pruning techniques at identical sparsity ratios and model scales, which is required to substantiate the “state-of-the-art” claim.

    Authors: The primary baseline was chosen because it is the strongest published unstructured-sparsity decode method operating at the same sparsity level and KV-cache setting. To address the gap, we will extend Table 2 and Figure 4 with head-to-head numbers against the most relevant recent semi-structured and hierarchical KV-pruning works, using identical sparsity ratios and the same model scales wherever the original implementations or sufficient details are available. Where direct reproduction is not feasible we will clearly state the differences in experimental conditions and sparsity definitions. revision: yes

Circularity Check

0 steps flagged

No significant circularity; purely empirical systems contribution

full rationale

The paper introduces HieraSparse as a hierarchical KV cache compression method using semi-structured sparsity and custom sparse-tensor-core kernels for prefill and decode phases. All reported results (1.2× compression, speedups up to 4.57×, quality under magnitude pruning) are direct empirical measurements of runtime and accuracy on LLMs. No equations, derivations, fitted parameters, or predictions appear in the provided text. No self-citations are invoked as load-bearing uniqueness theorems or ansatzes. The public codebase further enables independent verification of kernels and benchmarks, confirming the work is self-contained against external measurements rather than reducing to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No mathematical axioms, free parameters, or invented entities are introduced; the work is an engineering optimization relying on standard attention mechanics and GPU hardware capabilities.

pith-pipeline@v0.9.0 · 5526 in / 1260 out tokens · 75650 ms · 2026-05-10T07:09:59.541147+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

66 extracted references · 49 canonical work pages · 21 internal anchors

  1. [1]

    GPT-4 Technical Report

    OpenAI, “GPT-4 Technical Report,” 2024. [Online]. Available: https://doi.org/10.48550/arXiv.2303.08774

  2. [2]

    The Llama 3 Herd of Models

    MetaAI, “The Llama 3 Herd of Models,” 2024. [Online]. Available: https://doi.org/10.48550/arXiv.2407.21783

  3. [3]

    Qwen3 Technical Report,

    A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lv, C. Zheng, D. Liu, F. Zhou, F. Huang, F. Hu, H. Ge, H. Wei, H. Lin, J. Tang, J. Yang, J. Tu, J. Zhang, J. Yang, J. Yang, J. Zhou, J. Zhou, J. Lin, K. Dang, K. Bao, K. Yang, L. Yu, L. Deng, M. Li, M. Xue, M. Li, P. Zhang, P. Wang, Q. Zhu, R. Men, R. Gao, S. Liu, S. Luo, T. ...

  4. [4]

    Qwen3 Technical Report

    [Online]. Available: https://doi.org/10.48550/arXiv.2505.09388

  5. [5]

    Mistral 7B

    A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bamford, D. S. Chaplot, D. de las Casas, F. Bressand, G. Lengyel, G. Lample, L. Saulnier, L. R. Lavaud, M.-A. Lachaux, P. Stock, T. L. Scao, T. Lavril, T. Wang, T. Lacroix, and W. E. Sayed, “Mistral 7b,” 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2310.06825

  6. [6]

    A Survey on Large Language Models for Code Generation

    J. Jiang, F. Wang, J. Shen, S. Kim, and S. Kim, “A survey on large language models for code generation,” 2024. [Online]. Available: https://doi.org/10.48550/arXiv.2406.00515

  7. [7]

    Benchmarking Large Language Models for News Summarization,

    T. Zhang, F. Ladhak, E. Durmus, P. Liang, K. McKeown, and T. B. Hashimoto, “Benchmarking Large Language Models for News Summarization,” 2023. [Online]. Available: https://doi.org/10.48550/ arXiv.2301.13848

  8. [8]

    Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

    J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. Chi, Q. Le, and D. Zhou, “Chain-of-thought prompting elicits reasoning in large language models,” 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2201.11903

  9. [9]

    (2020) A Non-equilibrium Thermodynamic Framework of Consciousness.arXiv:2005.02801https://doi.org/10.48550/arXiv.2005

    P. Lewis, E. Perez, A. Piktus, F. Petroni, V . Karpukhin, N. Goyal, H. K ¨uttler, M. Lewis, W. tau Yih, T. Rockt ¨aschel, S. Riedel, and D. Kiela, “Retrieval-augmented generation for knowledge-intensive nlp tasks,” 2021. [Online]. Available: https://doi.org/10.48550/arXiv.2005. 11401

  10. [10]

    Many-shot in-context learning,

    R. Agarwal, A. Singh, L. M. Zhang, B. Bohnet, L. Rosias, S. Chan, B. Zhang, A. Anand, Z. Abbaset al., “Many-shot in-context learning,”

  11. [11]

    arXiv preprint arXiv:2404.11018

    [Online]. Available: https://doi.org/10.48550/arXiv.2404.11018

  12. [12]

    MemGPT: Towards LLMs as Operating Systems

    C. Packer, S. Wooders, K. Lin, V . Fang, S. G. Patil, I. Stoica, and J. E. Gonzalez, “Memgpt: Towards llms as operating systems,” 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2310.08560

  13. [13]

    RoFormer: Enhanced Transformer with Rotary Position Embedding

    J. Su, Y . Lu, S. Pan, A. Murtadha, B. Wen, and Y . Liu, “Roformer: Enhanced transformer with rotary position embedding,” 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2104.09864

  14. [14]

    Effective long-context scaling of foundation models

    W. Xiong, J. Liu, I. Molybog, H. Zhang, P. Bhargava, R. Hou, L. Martin, R. Rungta, K. A. Sankararaman, B. Oguz, M. Khabsa, H. Fang, Y . Mehdad, S. Narang, K. Malik, A. Fan, S. Bhosale, S. Edunov, M. Lewis, S. Wang, and H. Ma, “Effective long- context scaling of foundation models,” 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2309.16039

  15. [15]

    Llama 3 gradient: A series of long context model

    L. Pekelis, M. Feil, F. Moret, M. Huang, and T. Peng, “Llama 3 gradient: A series of long context model.” [Online]. Available: https://huggingface.co/gradientai/Llama-3-8B-Instruct-Gradient-1048k

  16. [16]

    Attention Is All You Need

    A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” 2023. [Online]. Available: https://doi.org/10.48550/arXiv.1706.03762

  17. [17]

    H 2o: Heavy-hitter oracle for efficient generative inference of large language models,

    Z. Zhang, Y . Sheng, T. Zhou, T. Chen, L. Zheng, R. Cai, Z. Song, Y . Tian, C. R´e, C. Barrett, Z. Wang, and B. Chen, “H 2o: Heavy-hitter oracle for efficient generative inference of large language models,”

  18. [18]

    arXiv preprint arXiv:2306.14048 , year=

    [Online]. Available: https://doi.org/10.48550/arXiv.2306.14048

  19. [19]

    SnapKV: LLM Knows What You are Looking for Before Generation

    Y . Li, Y . Huang, B. Yang, B. Venkitesh, A. Locatelli, H. Ye, T. Cai, P. Lewis, and D. Chen, “Snapkv: Llm knows what you are looking for before generation,” 2024. [Online]. Available: https://doi.org/10.48550/arXiv.2404.14469

  20. [20]

    Think: Thinner key cache by query-driven pruning,

    Y . Xu, Z. Jie, H. Dong, L. Wang, X. Lu, A. Zhou, A. Saha, C. Xiong, and D. Sahoo, “Think: Thinner key cache by query-driven pruning,”

  21. [21]
  22. [22]

    Leank: Learnable k cache channel pruning for efficient decoding,

    Y . Zhang, Z. He, H. Jiang, C. Zhang, Y . Yang, J. Wang, and L. Qiu, “Leank: Learnable k cache channel pruning for efficient decoding,”

  23. [23]

    Available: https://doi.org/10.48550/arXiv.2508.02215

    [Online]. Available: https://doi.org/10.48550/arXiv.2508.02215

  24. [24]

    Not all heads matter: A head-level kv cache compression method with integrated retrieval and reasoning

    Y . Fu, Z. Cai, A. Asi, W. Xiong, Y . Dong, and W. Xiao, “Not all heads matter: A head-level kv cache compression method with integrated retrieval and reasoning,” 2025. [Online]. Available: https://doi.org/10.48550/arXiv.2410.19258

  25. [25]

    Ada-kv: Optimizing kv cache eviction by adaptive budget allocation for efficient llm inference,

    Y . Feng, J. Lv, Y . Cao, X. Xie, and S. K. Zhou, “Ada-kv: Optimizing kv cache eviction by adaptive budget allocation for efficient llm inference,”

  26. [26]

    arXiv preprint arXiv:2407.11550 , year =

    [Online]. Available: https://doi.org/10.48550/arXiv.2407.11550

  27. [27]

    PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling

    Z. Cai, Y . Zhang, B. Gao, Y . Liu, Y . Li, T. Liu, K. Lu, W. Xiong, Y . Dong, J. Hu, and W. Xiao, “Pyramidkv: Dynamic kv cache compression based on pyramidal information funneling,” 2025. [Online]. Available: https://doi.org/10.48550/arXiv.2406.02069

  28. [28]

    Dynamickv: Task-aware adaptive kv cache compression for long context llms

    X. Zhou, W. Wang, M. Zeng, J. Guo, X. Liu, L. Shen, M. Zhang, and L. Ding, “Dynamickv: Task-aware adaptive kv cache compression for long context llms,” 2025. [Online]. Available: https://doi.org/10.48550/arXiv.2412.14838

  29. [29]

    Duoattention: Efficient long-context LLM inference with retrieval and streaming heads

    G. Xiao, J. Tang, J. Zuo, J. Guo, S. Yang, H. Tang, Y . Fu, and S. Han, “Duoattention: Efficient long-context llm inference with retrieval and streaming heads,” 2024. [Online]. Available: https://doi.org/10.48550/arXiv.2410.10819

  30. [30]

    Mustafar: Promoting unstructured sparsity for kv cache pruning in llm inference.arXiv.org, 2025

    D. Joo, H. Hosseini, R. Hadidi, and B. Asgari, “Mustafar: Promoting unstructured sparsity for kv cache pruning in llm inference,” 2025. [Online]. Available: https://doi.org/10.48550/arXiv.2505.22913

  31. [31]

    Efficient Streaming Language Models with Attention Sinks

    G. Xiao, Y . Tian, B. Chen, S. Han, and M. Lewis, “Efficient streaming language models with attention sinks,” 2024. [Online]. Available: https://doi.org/10.48550/arXiv.2309.17453

  32. [32]

    Loki: Low-rank keys for efficient sparse attention.Neural Information Processing Systems, 37: 16692–16723, 2024

    P. Singhania, S. Singh, S. He, S. Feizi, and A. Bhatele, “Loki: Low-rank keys for efficient sparse attention,” 2024. [Online]. Available: https://doi.org/10.48550/arXiv.2406.02542

  33. [33]

    Post-training sparse attention with double sparsity.arXiv preprint arXiv:2408.07092, 2024

    S. Yang, Y . Sheng, J. E. Gonzalez, I. Stoica, and L. Zheng, “Post-training sparse attention with double sparsity,” 2024. [Online]. Available: https://doi.org/10.48550/arXiv.2408.07092

  34. [34]

    Spatten: Efficient sparse attention architecture with cascade token and head pruning,

    H. Wang, Z. Zhang, and S. Han, “Spatten: Efficient sparse attention architecture with cascade token and head pruning,” in 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA), 2021, pp. 97–110. [Online]. Available: https: //doi.org/10.48550/arXiv.2012.09852

  35. [35]

    SparTA: Deep-Learning model sparsity via Tensor-with-Sparsity-Attribute,

    N. Zheng, B. Lin, Q. Zhang, L. Ma, Y . Yang, F. Yang, Y . Wang, M. Yang, and L. Zhou, “SparTA: Deep-Learning model sparsity via Tensor-with-Sparsity-Attribute,” in16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22). Carlsbad, CA: USENIX Association, Jul. 2022, pp. 213–232. [Online]. Available: https://www.usenix.org/conference/o...

  36. [36]

    D. Joo, H. Hosseini, R. Hadidi, and B. Asgari,Coruscant: Co- Designing GPU Kernel and Sparse Tensor Core to Advocate Unstructured Sparsity in Efficient LLM Inference. New York, NY , USA: Association for Computing Machinery, 2025, p. 232–245. [Online]. Available: https://doi.org/10.1145/3725843.3756065

  37. [37]

    Flash-llm: Enabling cost-effective and highly-efficient large generative model inference with unstructured sparsity,

    H. Xia, Z. Zheng, Y . Li, D. Zhuang, Z. Zhou, X. Qiu, Y . Li, W. Lin, and S. L. Song, “Flash-llm: Enabling cost-effective and highly-efficient large generative model inference with unstructured sparsity,”Proc. VLDB Endow., vol. 17, no. 2, p. 211–224, Oct. 2023. [Online]. Available: https://doi.org/10.14778/3626292.3626303

  38. [38]

    Spinfer: Leveraging low-level sparsity for efficient large language model inference on gpus,

    R. Fan, X. Yu, P. Dong, Z. Li, G. Gong, Q. Wang, W. Wang, and X. Chu, “Spinfer: Leveraging low-level sparsity for efficient large language model inference on gpus,” inProceedings of the Twentieth European Conference on Computer Systems, ser. EuroSys ’25. New York, NY , USA: Association for Computing Machinery, 2025, p. 243–260. [Online]. Available: https:...

  39. [39]

    Accelerating sparse deep neural networks,

    A. Mishra, J. A. Latorre, J. Pool, D. Stosic, D. Stosic, G. Venkatesh, C. Yu, and P. Micikevicius, “Accelerating sparse deep neural networks,”

  40. [40]
  41. [41]

    Ampere tensor core,

    Nvidia, “Ampere tensor core,” https://www.nvidia.com/en-us/ data-center/ampere-architecture/, 2023

  42. [42]

    Amd matrix core,

    Amd, “Amd matrix core,” https://www.amd.com/content/dam/amd/en/ documents/instinct-tech-docs/product-briefs/instinct-mi325x-datasheet. pdf, 2023

  43. [43]

    Longformer: The Long-Document Transformer

    I. Beltagy, M. E. Peters, and A. Cohan, “Longformer: The long- document transformer,” 2020. [Online]. Available: https://doi.org/10. 48550/arXiv.2004.05150

  44. [44]

    FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

    T. Dao, D. Y . Fu, S. Ermon, A. Rudra, and C. R ´e, “Flashattention: Fast and memory-efficient exact attention with io-awareness,” 2022. [Online]. Available: https://doi.org/10.48550/arXiv.2205.14135

  45. [45]

    Self-attention does not needo(n 2)memory,

    M. N. Rabe and C. Staats, “Self-attention does not needo(n 2)memory,”

  46. [46]
  47. [47]

    In: Proceedings of the 29th Symposium on Operating Systems Principles

    W. Kwon, Z. Li, S. Zhuang, Y . Sheng, L. Zheng, C. H. Yu, J. Gonzalez, H. Zhang, and I. Stoica, “Efficient memory management for large language model serving with pagedattention,” inProceedings of the 29th Symposium on Operating Systems Principles, ser. SOSP ’23. New York, NY , USA: Association for Computing Machinery, 2023, p. 611–626. [Online]. Availabl...

  48. [48]

    Sglang: Efficient execution of structured language model programs,

    L. Zheng, L. Yin, Z. Xie, C. Sun, J. Huang, C. H. Yu, S. Cao, C. Kozyrakis, I. Stoica, J. E. Gonzalez, C. Barrett, and Y . Sheng, “Sglang: Efficient execution of structured language model programs,”

  49. [49]

    SGLang: Efficient Execution of Structured Language Model Programs

    [Online]. Available: https://doi.org/10.48550/arXiv.2312.07104

  50. [50]

    CacheBlend: Fast Large Language Model Serving for RAG with Cached Knowledge Fusion

    J. Yao, H. Li, Y . Liu, S. Ray, Y . Cheng, Q. Zhang, K. Du, S. Lu, and J. Jiang, “Cacheblend: Fast large language model serving for rag with cached knowledge fusion,” 2025. [Online]. Available: https://doi.org/10.48550/arXiv.2405.16444

  51. [51]

    Prompt cache: Modular attention reuse for low-latency inference,

    I. Gim, G. Chen, S. seob Lee, N. Sarda, A. Khandelwal, and L. Zhong, “Prompt cache: Modular attention reuse for low-latency inference,”

  52. [52]
  53. [53]

    Jiayi Yao, Hanchen Li, Yuhan Liu, Siddhant Ray, Yi- hua Cheng, Qizheng Zhang, Kuntai Du, Shan Lu, and Junchen Jiang

    J. Yang, B. Hou, W. Wei, Y . Bao, and S. Chang, “Kvlink: Accelerating large language models via efficient kv cache reuse,” 2025. [Online]. Available: https://doi.org/10.48550/arXiv.2502.16002

  54. [54]

    KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache

    Z. Liu, J. Yuan, H. Jin, S. H. Zhong, Z. Xu, V . Braverman, B. Chen, and X. Hu, “Kivi: a tuning-free asymmetric 2bit quantization for kv cache,” inProceedings of the 41st International Conference on Machine Learning, ser. ICML’24. JMLR.org, 2024. [Online]. Available: https://doi.org/10.48550/arXiv.2402.02750

  55. [55]

    control bars

    T. Zhang, J. Yi, Z. Xu, and A. Shrivastava, “Kv cache is 1 bit per channel: Efficient large language model inference with coupled quantization,” 2024. [Online]. Available: https://doi.org/10.48550/arXiv. 2405.03917

  56. [56]

    Kvzip: Query-agnostic kv cache compression with context reconstruction,

    J.-H. Kim, J. Kim, S. Kwon, J. W. Lee, S. Yun, and H. O. Song, “Kvzip: Query-agnostic kv cache compression with context reconstruction,”

  57. [57]

    arXiv preprint arXiv:2505.23416 , year =

    [Online]. Available: https://doi.org/10.48550/arXiv.2505.23416

  58. [58]

    io/blog/qwen3.5

    L. Wang, Y . Cheng, Y . Shi, Z. Tang, Z. Mo, W. Xie, L. Ma, Y . Xia, J. Xue, F. Yang, and Z. Yang, “Tilelang: A composable tiled programming model for ai systems,” 2025. [Online]. Available: https://doi.org/10.48550/arXiv.2504.17577

  59. [59]

    PyTorch: An Imperative Style, High-Performance Deep Learning Library

    A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. K ¨opf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala, “Pytorch: An imperative style, high-performance deep learning library,” 2019. [Online]. Available: https://doi.org/1...

  60. [60]

    Fast transformer decoding: One write-head is all you need,

    N. Shazeer, “Fast transformer decoding: One write-head is all you need,”

  61. [61]

    Fast Transformer Decoding: One Write-Head is All You Need

    [Online]. Available: https://doi.org/10.48550/arXiv.1911.02150

  62. [62]

    GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints

    J. Ainslie, J. Lee-Thorp, M. de Jong, Y . Zemlyanskiy, F. Lebr ´on, and S. Sanghai, “Gqa: Training generalized multi-query transformer models from multi-head checkpoints,” 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2305.13245

  63. [63]

    LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding

    Y . Bai, X. Lv, J. Zhang, H. Lyu, J. Tang, Z. Huang, Z. Du, X. Liu, A. Zeng, L. Hou, Y . Dong, J. Tang, and J. Li, “LongBench: A bilingual, multitask benchmark for long context understanding,” inProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), L.-W. Ku, A. Martins, and V . Srikumar, Eds. Bang...

  64. [64]

    Enabling unstructured sparse acceleration on structured sparse accelerators,

    G. Jeong, P.-A. Tsai, A. R. Bambhaniya, S. W. Keckler, and T. Krishna, “Enabling unstructured sparse acceleration on structured sparse accelerators,” 2025. [Online]. Available: https://doi.org/10.48550/ arXiv.2403.07953

  65. [65]

    Venom: A vectorized n:m format for unleashing the power of sparse tensor cores,

    R. L. Castro, A. Ivanov, D. Andrade, T. Ben-Nun, B. B. Fraguela, and T. Hoefler, “Venom: A vectorized n:m format for unleashing the power of sparse tensor cores,” inProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, ser. SC ’23. New York, NY , USA: Association for Computing Machinery,

  66. [66]

    Rogers, Evan Schneider, Jean-Luc Vay, and P

    [Online]. Available: https://doi.org/10.1145/3581784.3607087