Recognition: unknown
HieraSparse: Hierarchical Semi-Structured Sparse KV Attention
Pith reviewed 2026-05-10 07:09 UTC · model grok-4.3
The pith
HieraSparse applies hierarchical semi-structured sparsity to KV caches for faster attention and better compression in long-context LLMs.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
HieraSparse is a hierarchical KV cache compression framework that uses semi-structured sparsity patterns together with GPU sparse-tensor-core kernels to accelerate attention for both prefill and decode. At equivalent sparsity it obtains a 1.2 times higher KV compression ratio and 4.57 times attention speedup over the previous state-of-the-art unstructured-sparsity decode method. The same semi-structured pruning extends to the prefill stage, yielding up to 1.85 times attention speedup at the highest sparsity level. When magnitude-based pruning is applied, the framework achieves 1.37 times prefill speedup and 1.77 times decode speedup without significant quality loss.
What carries the argument
The hierarchical semi-structured sparsity pattern applied to the KV cache, which organizes pruning across multiple levels to enable flexible sparsity-quality trade-offs and direct mapping onto sparse tensor core operations.
If this is right
- Higher KV cache compression reduces memory footprint and permits longer contexts or larger batches on the same hardware.
- The reported speedups apply to both prefill and decode phases, shortening end-to-end latency for interactive use.
- Simple magnitude-based pruning already yields practical speedups while keeping quality drop small.
- Custom kernels convert the sparsity directly into measurable gains on current sparse-tensor-core GPUs.
Where Pith is reading between the lines
- The hierarchical pattern could be combined with other cache-reduction techniques such as token eviction or quantization to compound memory savings.
- The same semi-structured layout might improve utilization on future GPU generations that add more native sparse support.
- Testing across a wider range of model sizes and context lengths would show whether the speedup ratios remain stable.
- Adapting the pruning criterion to task-specific signals instead of magnitude alone could further reduce quality impact at high sparsity.
Load-bearing premise
That the chosen semi-structured sparsity patterns and hierarchical compression preserve the essential information in the attention computation so that generation quality remains acceptable.
What would settle it
A side-by-side run of long-context generation benchmarks at the reported sparsity levels that shows substantially higher perplexity or lower downstream task accuracy for HieraSparse than for the unstructured-sparsity baseline.
Figures
read the original abstract
The deployment of long-context Large Language Models (LLMs) poses significant challenges due to the intense computational cost of self-attention and the substantial memory overhead of the Key-Value Cache (KV Cache). In this paper, we introduce HieraSparse, a hierarchical KV Cache compression framework with acceleration kernels that leverage GPU sparse tensor cores to speed up semi-structured KV Cache attention for both the prefill and decode phases. With the hierarchical design, our method allows for a flexible quality-sparsity trade-off and successfully converts sparsity into efficiency. Compared to the state-of-the-art decode method that utilizes unstructured sparsity, HieraSparse achieves $\mathbf{1.2\times}$ KV compression ratio and $\mathbf{4.57\times}$ attention speedup at the same sparsity level. Furthermore, we extended the semi-structured KV Cache pruning to the prefill stage, which demonstrated up to $\mathbf{1.85\times}$ attention speedup at the highest sparsity. Lastly, we evaluate the generation quality of HieraSparse with a simple magnitude-based pruning method, and the results show that $\mathbf{1.37\times}$ prefill speedup and $\mathbf{1.77\times}$ decode speedup can be achieved without significant quality drop. The codebase can be found at https://github.com/psl-ntu/HieraSparse.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces HieraSparse, a hierarchical semi-structured sparse KV Cache compression framework for long-context LLMs, accompanied by custom GPU kernels that exploit sparse tensor cores to accelerate attention computation in both prefill and decode phases. It reports concrete empirical gains over prior unstructured-sparsity decode methods (1.2× KV compression and 4.57× attention speedup at matched sparsity), extends semi-structured pruning to prefill (up to 1.85× speedup), and shows that simple magnitude-based pruning yields 1.37× prefill and 1.77× decode speedups with acceptable quality retention. A public codebase is provided.
Significance. If the performance and quality results hold under rigorous scrutiny, the work offers a practical route to convert sparsity into measurable inference efficiency and memory reduction for long-context models. The public codebase is a clear strength that directly supports verification of kernel correctness, overheads, and benchmark numbers, increasing the likelihood of adoption and follow-on research in sparse attention systems.
major comments (3)
- [§4] §4 (Experimental results): The headline speedups (4.57× decode, 1.85× prefill, 1.37×/1.77× with magnitude pruning) are reported as single-point measurements without error bars, number of runs, hardware details, or statistical significance tests. Because these numbers are the central empirical claim, the absence of variability assessment undermines confidence in the reported gains.
- [§3] §3 (Hierarchical design and kernel implementation): The paper provides no ablation isolating the contribution of the hierarchical levels versus the semi-structured pattern itself, nor any analysis of how the chosen sparsity masks interact with the attention matrix multiplication. This is load-bearing for the claim that quality is preserved while speedups are realized.
- [Table 2] Table 2 / Figure 4 (baseline comparisons): The evaluation compares only against one unstructured-sparsity decode method; missing are direct head-to-head results against other recent semi-structured or hierarchical KV pruning techniques at identical sparsity ratios and model scales, which is required to substantiate the “state-of-the-art” claim.
minor comments (2)
- [§3.2] Notation for the hierarchical block sizes and pruning thresholds is introduced without a compact mathematical definition; a single equation summarizing the mask construction would improve clarity.
- [Abstract] The abstract states “without significant quality drop” but the main text does not define the threshold used for this judgment (e.g., perplexity delta, downstream task accuracy).
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We agree that the experimental reporting can be strengthened with variability measures, that additional ablations would better isolate design contributions, and that expanded baselines would provide fuller context. We will incorporate these changes in a revised manuscript. Our point-by-point responses follow.
read point-by-point responses
-
Referee: §4 (Experimental results): The headline speedups (4.57× decode, 1.85× prefill, 1.37×/1.77× with magnitude pruning) are reported as single-point measurements without error bars, number of runs, hardware details, or statistical significance tests. Because these numbers are the central empirical claim, the absence of variability assessment undermines confidence in the reported gains.
Authors: We agree that single-point measurements limit confidence in the reported gains. In the revision we will rerun all timing experiments on the same hardware (NVIDIA H100 GPUs) for a minimum of five independent trials, report means with standard-deviation error bars, and include paired t-tests to assess statistical significance of the speedups. The public codebase already exposes the exact benchmark scripts, so these additional runs can be reproduced directly. revision: yes
-
Referee: §3 (Hierarchical design and kernel implementation): The paper provides no ablation isolating the contribution of the hierarchical levels versus the semi-structured pattern itself, nor any analysis of how the chosen sparsity masks interact with the attention matrix multiplication. This is load-bearing for the claim that quality is preserved while speedups are realized.
Authors: We acknowledge the absence of these isolating experiments. We will add a dedicated ablation subsection that compares the full hierarchical mask against a non-hierarchical (flat) semi-structured mask at identical sparsity ratios, measuring both quality (perplexity, downstream task scores) and kernel throughput. We will also include a brief analysis, supported by a new figure, of how the hierarchical block structure alters the sparsity pattern seen by the sparse tensor-core matmul and why this preserves attention quality. All new results will be generated from the released implementation. revision: yes
-
Referee: Table 2 / Figure 4 (baseline comparisons): The evaluation compares only against one unstructured-sparsity decode method; missing are direct head-to-head results against other recent semi-structured or hierarchical KV pruning techniques at identical sparsity ratios and model scales, which is required to substantiate the “state-of-the-art” claim.
Authors: The primary baseline was chosen because it is the strongest published unstructured-sparsity decode method operating at the same sparsity level and KV-cache setting. To address the gap, we will extend Table 2 and Figure 4 with head-to-head numbers against the most relevant recent semi-structured and hierarchical KV-pruning works, using identical sparsity ratios and the same model scales wherever the original implementations or sufficient details are available. Where direct reproduction is not feasible we will clearly state the differences in experimental conditions and sparsity definitions. revision: yes
Circularity Check
No significant circularity; purely empirical systems contribution
full rationale
The paper introduces HieraSparse as a hierarchical KV cache compression method using semi-structured sparsity and custom sparse-tensor-core kernels for prefill and decode phases. All reported results (1.2× compression, speedups up to 4.57×, quality under magnitude pruning) are direct empirical measurements of runtime and accuracy on LLMs. No equations, derivations, fitted parameters, or predictions appear in the provided text. No self-citations are invoked as load-bearing uniqueness theorems or ansatzes. The public codebase further enables independent verification of kernels and benchmarks, confirming the work is self-contained against external measurements rather than reducing to its own inputs by construction.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
OpenAI, “GPT-4 Technical Report,” 2024. [Online]. Available: https://doi.org/10.48550/arXiv.2303.08774
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2303.08774 2024
-
[2]
MetaAI, “The Llama 3 Herd of Models,” 2024. [Online]. Available: https://doi.org/10.48550/arXiv.2407.21783
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2407.21783 2024
-
[3]
Qwen3 Technical Report,
A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lv, C. Zheng, D. Liu, F. Zhou, F. Huang, F. Hu, H. Ge, H. Wei, H. Lin, J. Tang, J. Yang, J. Tu, J. Zhang, J. Yang, J. Yang, J. Zhou, J. Zhou, J. Lin, K. Dang, K. Bao, K. Yang, L. Yu, L. Deng, M. Li, M. Xue, M. Li, P. Zhang, P. Wang, Q. Zhu, R. Men, R. Gao, S. Liu, S. Luo, T. ...
-
[4]
[Online]. Available: https://doi.org/10.48550/arXiv.2505.09388
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2505.09388
-
[5]
A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bamford, D. S. Chaplot, D. de las Casas, F. Bressand, G. Lengyel, G. Lample, L. Saulnier, L. R. Lavaud, M.-A. Lachaux, P. Stock, T. L. Scao, T. Lavril, T. Wang, T. Lacroix, and W. E. Sayed, “Mistral 7b,” 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2310.06825
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2310.06825 2023
-
[6]
A Survey on Large Language Models for Code Generation
J. Jiang, F. Wang, J. Shen, S. Kim, and S. Kim, “A survey on large language models for code generation,” 2024. [Online]. Available: https://doi.org/10.48550/arXiv.2406.00515
work page internal anchor Pith review doi:10.48550/arxiv.2406.00515 2024
-
[7]
Benchmarking Large Language Models for News Summarization,
T. Zhang, F. Ladhak, E. Durmus, P. Liang, K. McKeown, and T. B. Hashimoto, “Benchmarking Large Language Models for News Summarization,” 2023. [Online]. Available: https://doi.org/10.48550/ arXiv.2301.13848
-
[8]
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. Chi, Q. Le, and D. Zhou, “Chain-of-thought prompting elicits reasoning in large language models,” 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2201.11903
work page internal anchor Pith review doi:10.48550/arxiv.2201.11903 2023
-
[9]
P. Lewis, E. Perez, A. Piktus, F. Petroni, V . Karpukhin, N. Goyal, H. K ¨uttler, M. Lewis, W. tau Yih, T. Rockt ¨aschel, S. Riedel, and D. Kiela, “Retrieval-augmented generation for knowledge-intensive nlp tasks,” 2021. [Online]. Available: https://doi.org/10.48550/arXiv.2005. 11401
-
[10]
Many-shot in-context learning,
R. Agarwal, A. Singh, L. M. Zhang, B. Bohnet, L. Rosias, S. Chan, B. Zhang, A. Anand, Z. Abbaset al., “Many-shot in-context learning,”
-
[11]
arXiv preprint arXiv:2404.11018
[Online]. Available: https://doi.org/10.48550/arXiv.2404.11018
-
[12]
MemGPT: Towards LLMs as Operating Systems
C. Packer, S. Wooders, K. Lin, V . Fang, S. G. Patil, I. Stoica, and J. E. Gonzalez, “Memgpt: Towards llms as operating systems,” 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2310.08560
work page internal anchor Pith review doi:10.48550/arxiv.2310.08560 2023
-
[13]
RoFormer: Enhanced Transformer with Rotary Position Embedding
J. Su, Y . Lu, S. Pan, A. Murtadha, B. Wen, and Y . Liu, “Roformer: Enhanced transformer with rotary position embedding,” 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2104.09864
work page internal anchor Pith review doi:10.48550/arxiv.2104.09864 2023
-
[14]
Effective long-context scaling of foundation models
W. Xiong, J. Liu, I. Molybog, H. Zhang, P. Bhargava, R. Hou, L. Martin, R. Rungta, K. A. Sankararaman, B. Oguz, M. Khabsa, H. Fang, Y . Mehdad, S. Narang, K. Malik, A. Fan, S. Bhosale, S. Edunov, M. Lewis, S. Wang, and H. Ma, “Effective long- context scaling of foundation models,” 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2309.16039
-
[15]
Llama 3 gradient: A series of long context model
L. Pekelis, M. Feil, F. Moret, M. Huang, and T. Peng, “Llama 3 gradient: A series of long context model.” [Online]. Available: https://huggingface.co/gradientai/Llama-3-8B-Instruct-Gradient-1048k
-
[16]
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” 2023. [Online]. Available: https://doi.org/10.48550/arXiv.1706.03762
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1706.03762 2023
-
[17]
H 2o: Heavy-hitter oracle for efficient generative inference of large language models,
Z. Zhang, Y . Sheng, T. Zhou, T. Chen, L. Zheng, R. Cai, Z. Song, Y . Tian, C. R´e, C. Barrett, Z. Wang, and B. Chen, “H 2o: Heavy-hitter oracle for efficient generative inference of large language models,”
-
[18]
arXiv preprint arXiv:2306.14048 , year=
[Online]. Available: https://doi.org/10.48550/arXiv.2306.14048
-
[19]
SnapKV: LLM Knows What You are Looking for Before Generation
Y . Li, Y . Huang, B. Yang, B. Venkitesh, A. Locatelli, H. Ye, T. Cai, P. Lewis, and D. Chen, “Snapkv: Llm knows what you are looking for before generation,” 2024. [Online]. Available: https://doi.org/10.48550/arXiv.2404.14469
work page internal anchor Pith review doi:10.48550/arxiv.2404.14469 2024
-
[20]
Think: Thinner key cache by query-driven pruning,
Y . Xu, Z. Jie, H. Dong, L. Wang, X. Lu, A. Zhou, A. Saha, C. Xiong, and D. Sahoo, “Think: Thinner key cache by query-driven pruning,”
-
[21]
[Online]. Available: https://doi.org/10.48550/arXiv.2407.21018
-
[22]
Leank: Learnable k cache channel pruning for efficient decoding,
Y . Zhang, Z. He, H. Jiang, C. Zhang, Y . Yang, J. Wang, and L. Qiu, “Leank: Learnable k cache channel pruning for efficient decoding,”
-
[23]
Available: https://doi.org/10.48550/arXiv.2508.02215
[Online]. Available: https://doi.org/10.48550/arXiv.2508.02215
-
[24]
Y . Fu, Z. Cai, A. Asi, W. Xiong, Y . Dong, and W. Xiao, “Not all heads matter: A head-level kv cache compression method with integrated retrieval and reasoning,” 2025. [Online]. Available: https://doi.org/10.48550/arXiv.2410.19258
-
[25]
Ada-kv: Optimizing kv cache eviction by adaptive budget allocation for efficient llm inference,
Y . Feng, J. Lv, Y . Cao, X. Xie, and S. K. Zhou, “Ada-kv: Optimizing kv cache eviction by adaptive budget allocation for efficient llm inference,”
-
[26]
arXiv preprint arXiv:2407.11550 , year =
[Online]. Available: https://doi.org/10.48550/arXiv.2407.11550
-
[27]
PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling
Z. Cai, Y . Zhang, B. Gao, Y . Liu, Y . Li, T. Liu, K. Lu, W. Xiong, Y . Dong, J. Hu, and W. Xiao, “Pyramidkv: Dynamic kv cache compression based on pyramidal information funneling,” 2025. [Online]. Available: https://doi.org/10.48550/arXiv.2406.02069
work page internal anchor Pith review doi:10.48550/arxiv.2406.02069 2025
-
[28]
Dynamickv: Task-aware adaptive kv cache compression for long context llms
X. Zhou, W. Wang, M. Zeng, J. Guo, X. Liu, L. Shen, M. Zhang, and L. Ding, “Dynamickv: Task-aware adaptive kv cache compression for long context llms,” 2025. [Online]. Available: https://doi.org/10.48550/arXiv.2412.14838
-
[29]
Duoattention: Efficient long-context LLM inference with retrieval and streaming heads
G. Xiao, J. Tang, J. Zuo, J. Guo, S. Yang, H. Tang, Y . Fu, and S. Han, “Duoattention: Efficient long-context llm inference with retrieval and streaming heads,” 2024. [Online]. Available: https://doi.org/10.48550/arXiv.2410.10819
-
[30]
Mustafar: Promoting unstructured sparsity for kv cache pruning in llm inference.arXiv.org, 2025
D. Joo, H. Hosseini, R. Hadidi, and B. Asgari, “Mustafar: Promoting unstructured sparsity for kv cache pruning in llm inference,” 2025. [Online]. Available: https://doi.org/10.48550/arXiv.2505.22913
-
[31]
Efficient Streaming Language Models with Attention Sinks
G. Xiao, Y . Tian, B. Chen, S. Han, and M. Lewis, “Efficient streaming language models with attention sinks,” 2024. [Online]. Available: https://doi.org/10.48550/arXiv.2309.17453
work page internal anchor Pith review doi:10.48550/arxiv.2309.17453 2024
-
[32]
P. Singhania, S. Singh, S. He, S. Feizi, and A. Bhatele, “Loki: Low-rank keys for efficient sparse attention,” 2024. [Online]. Available: https://doi.org/10.48550/arXiv.2406.02542
-
[33]
Post-training sparse attention with double sparsity.arXiv preprint arXiv:2408.07092, 2024
S. Yang, Y . Sheng, J. E. Gonzalez, I. Stoica, and L. Zheng, “Post-training sparse attention with double sparsity,” 2024. [Online]. Available: https://doi.org/10.48550/arXiv.2408.07092
-
[34]
Spatten: Efficient sparse attention architecture with cascade token and head pruning,
H. Wang, Z. Zhang, and S. Han, “Spatten: Efficient sparse attention architecture with cascade token and head pruning,” in 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA), 2021, pp. 97–110. [Online]. Available: https: //doi.org/10.48550/arXiv.2012.09852
-
[35]
SparTA: Deep-Learning model sparsity via Tensor-with-Sparsity-Attribute,
N. Zheng, B. Lin, Q. Zhang, L. Ma, Y . Yang, F. Yang, Y . Wang, M. Yang, and L. Zhou, “SparTA: Deep-Learning model sparsity via Tensor-with-Sparsity-Attribute,” in16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22). Carlsbad, CA: USENIX Association, Jul. 2022, pp. 213–232. [Online]. Available: https://www.usenix.org/conference/o...
2022
-
[36]
D. Joo, H. Hosseini, R. Hadidi, and B. Asgari,Coruscant: Co- Designing GPU Kernel and Sparse Tensor Core to Advocate Unstructured Sparsity in Efficient LLM Inference. New York, NY , USA: Association for Computing Machinery, 2025, p. 232–245. [Online]. Available: https://doi.org/10.1145/3725843.3756065
-
[37]
H. Xia, Z. Zheng, Y . Li, D. Zhuang, Z. Zhou, X. Qiu, Y . Li, W. Lin, and S. L. Song, “Flash-llm: Enabling cost-effective and highly-efficient large generative model inference with unstructured sparsity,”Proc. VLDB Endow., vol. 17, no. 2, p. 211–224, Oct. 2023. [Online]. Available: https://doi.org/10.14778/3626292.3626303
-
[38]
Spinfer: Leveraging low-level sparsity for efficient large language model inference on gpus,
R. Fan, X. Yu, P. Dong, Z. Li, G. Gong, Q. Wang, W. Wang, and X. Chu, “Spinfer: Leveraging low-level sparsity for efficient large language model inference on gpus,” inProceedings of the Twentieth European Conference on Computer Systems, ser. EuroSys ’25. New York, NY , USA: Association for Computing Machinery, 2025, p. 243–260. [Online]. Available: https:...
-
[39]
Accelerating sparse deep neural networks,
A. Mishra, J. A. Latorre, J. Pool, D. Stosic, D. Stosic, G. Venkatesh, C. Yu, and P. Micikevicius, “Accelerating sparse deep neural networks,”
-
[40]
Accelerating sparse deep neural networks.arXiv preprint arXiv:2104.08378,
[Online]. Available: https://doi.org/10.48550/arXiv.2104.08378
-
[41]
Ampere tensor core,
Nvidia, “Ampere tensor core,” https://www.nvidia.com/en-us/ data-center/ampere-architecture/, 2023
2023
-
[42]
Amd matrix core,
Amd, “Amd matrix core,” https://www.amd.com/content/dam/amd/en/ documents/instinct-tech-docs/product-briefs/instinct-mi325x-datasheet. pdf, 2023
2023
-
[43]
Longformer: The Long-Document Transformer
I. Beltagy, M. E. Peters, and A. Cohan, “Longformer: The long- document transformer,” 2020. [Online]. Available: https://doi.org/10. 48550/arXiv.2004.05150
work page internal anchor Pith review arXiv 2020
-
[44]
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
T. Dao, D. Y . Fu, S. Ermon, A. Rudra, and C. R ´e, “Flashattention: Fast and memory-efficient exact attention with io-awareness,” 2022. [Online]. Available: https://doi.org/10.48550/arXiv.2205.14135
work page internal anchor Pith review doi:10.48550/arxiv.2205.14135 2022
-
[45]
Self-attention does not needo(n 2)memory,
M. N. Rabe and C. Staats, “Self-attention does not needo(n 2)memory,”
-
[46]
Self-attention does not needo(n 2)memory.arXiv preprint arXiv:2112.05682,
[Online]. Available: https://doi.org/10.48550/arXiv.2112.05682
-
[47]
In: Proceedings of the 29th Symposium on Operating Systems Principles
W. Kwon, Z. Li, S. Zhuang, Y . Sheng, L. Zheng, C. H. Yu, J. Gonzalez, H. Zhang, and I. Stoica, “Efficient memory management for large language model serving with pagedattention,” inProceedings of the 29th Symposium on Operating Systems Principles, ser. SOSP ’23. New York, NY , USA: Association for Computing Machinery, 2023, p. 611–626. [Online]. Availabl...
-
[48]
Sglang: Efficient execution of structured language model programs,
L. Zheng, L. Yin, Z. Xie, C. Sun, J. Huang, C. H. Yu, S. Cao, C. Kozyrakis, I. Stoica, J. E. Gonzalez, C. Barrett, and Y . Sheng, “Sglang: Efficient execution of structured language model programs,”
-
[49]
SGLang: Efficient Execution of Structured Language Model Programs
[Online]. Available: https://doi.org/10.48550/arXiv.2312.07104
work page internal anchor Pith review doi:10.48550/arxiv.2312.07104
-
[50]
CacheBlend: Fast Large Language Model Serving for RAG with Cached Knowledge Fusion
J. Yao, H. Li, Y . Liu, S. Ray, Y . Cheng, Q. Zhang, K. Du, S. Lu, and J. Jiang, “Cacheblend: Fast large language model serving for rag with cached knowledge fusion,” 2025. [Online]. Available: https://doi.org/10.48550/arXiv.2405.16444
-
[51]
Prompt cache: Modular attention reuse for low-latency inference,
I. Gim, G. Chen, S. seob Lee, N. Sarda, A. Khandelwal, and L. Zhong, “Prompt cache: Modular attention reuse for low-latency inference,”
-
[52]
Prompt cache: Modular attention reuse for low-latency inference
[Online]. Available: https://doi.org/10.48550/arXiv.2311.04934
-
[53]
J. Yang, B. Hou, W. Wei, Y . Bao, and S. Chang, “Kvlink: Accelerating large language models via efficient kv cache reuse,” 2025. [Online]. Available: https://doi.org/10.48550/arXiv.2502.16002
-
[54]
KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache
Z. Liu, J. Yuan, H. Jin, S. H. Zhong, Z. Xu, V . Braverman, B. Chen, and X. Hu, “Kivi: a tuning-free asymmetric 2bit quantization for kv cache,” inProceedings of the 41st International Conference on Machine Learning, ser. ICML’24. JMLR.org, 2024. [Online]. Available: https://doi.org/10.48550/arXiv.2402.02750
work page internal anchor Pith review doi:10.48550/arxiv.2402.02750 2024
-
[55]
T. Zhang, J. Yi, Z. Xu, and A. Shrivastava, “Kv cache is 1 bit per channel: Efficient large language model inference with coupled quantization,” 2024. [Online]. Available: https://doi.org/10.48550/arXiv. 2405.03917
work page internal anchor Pith review doi:10.48550/arxiv 2024
-
[56]
Kvzip: Query-agnostic kv cache compression with context reconstruction,
J.-H. Kim, J. Kim, S. Kwon, J. W. Lee, S. Yun, and H. O. Song, “Kvzip: Query-agnostic kv cache compression with context reconstruction,”
-
[57]
arXiv preprint arXiv:2505.23416 , year =
[Online]. Available: https://doi.org/10.48550/arXiv.2505.23416
-
[58]
L. Wang, Y . Cheng, Y . Shi, Z. Tang, Z. Mo, W. Xie, L. Ma, Y . Xia, J. Xue, F. Yang, and Z. Yang, “Tilelang: A composable tiled programming model for ai systems,” 2025. [Online]. Available: https://doi.org/10.48550/arXiv.2504.17577
-
[59]
PyTorch: An Imperative Style, High-Performance Deep Learning Library
A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. K ¨opf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala, “Pytorch: An imperative style, high-performance deep learning library,” 2019. [Online]. Available: https://doi.org/1...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1912.01703 2019
-
[60]
Fast transformer decoding: One write-head is all you need,
N. Shazeer, “Fast transformer decoding: One write-head is all you need,”
-
[61]
Fast Transformer Decoding: One Write-Head is All You Need
[Online]. Available: https://doi.org/10.48550/arXiv.1911.02150
work page internal anchor Pith review doi:10.48550/arxiv.1911.02150 1911
-
[62]
GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints
J. Ainslie, J. Lee-Thorp, M. de Jong, Y . Zemlyanskiy, F. Lebr ´on, and S. Sanghai, “Gqa: Training generalized multi-query transformer models from multi-head checkpoints,” 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2305.13245
work page internal anchor Pith review doi:10.48550/arxiv.2305.13245 2023
-
[63]
LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding
Y . Bai, X. Lv, J. Zhang, H. Lyu, J. Tang, Z. Huang, Z. Du, X. Liu, A. Zeng, L. Hou, Y . Dong, J. Tang, and J. Li, “LongBench: A bilingual, multitask benchmark for long context understanding,” inProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), L.-W. Ku, A. Martins, and V . Srikumar, Eds. Bang...
work page internal anchor Pith review doi:10.48550/arxiv.2308.14508 2024
-
[64]
Enabling unstructured sparse acceleration on structured sparse accelerators,
G. Jeong, P.-A. Tsai, A. R. Bambhaniya, S. W. Keckler, and T. Krishna, “Enabling unstructured sparse acceleration on structured sparse accelerators,” 2025. [Online]. Available: https://doi.org/10.48550/ arXiv.2403.07953
-
[65]
Venom: A vectorized n:m format for unleashing the power of sparse tensor cores,
R. L. Castro, A. Ivanov, D. Andrade, T. Ben-Nun, B. B. Fraguela, and T. Hoefler, “Venom: A vectorized n:m format for unleashing the power of sparse tensor cores,” inProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, ser. SC ’23. New York, NY , USA: Association for Computing Machinery,
-
[66]
Rogers, Evan Schneider, Jean-Luc Vay, and P
[Online]. Available: https://doi.org/10.1145/3581784.3607087
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.