Kara: Efficient Reasoning LLM Serving via Sliding-Window KV Cache Compression

Shen Han; Yuyang Wu

arxiv: 2607.01237 · v1 · pith:IQ3HSRYAnew · submitted 2026-05-01 · 💻 cs.CL · cs.AI

Kara: Efficient Reasoning LLM Serving via Sliding-Window KV Cache Compression

Shen Han , Yuyang Wu This is my paper

Pith reviewed 2026-07-04 01:21 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords KV cache compressionreasoning LLMschain-of-thoughtsliding windowbidirectional attentionPagedAttentioninference optimization

0 comments

The pith

Kara uses sliding-window compression on recent context to reduce KV cache size in reasoning LLMs while maintaining performance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Reasoning language models produce long chain-of-thought sequences that cause the KV cache to grow large, increasing memory use and slowing down generation. Existing compression methods have issues with when and how they compress, sometimes hurting throughput or losing key information from whole blocks. Kara fixes this by compressing only the latest part of the context in a sliding window, using bidirectional attention to identify useful key-value pairs and a Token2Chunk module to keep them in adaptable groups. It integrates with PagedAttention in the KvLLM system to lower memory needs and raise output speed. Experiments show this leads to better results across different setups.

Core claim

Kara is a sliding-window KV cache compression method that performs decoding-time compression only on the recently generated context. It leverages bidirectional attention to score and select informative KV pairs in the window and uses a Token2Chunk module to expand selected pairs into flexible chunks. Adapted to PagedAttention, it is implemented in KvLLM to reduce KV cache memory usage and improve output throughput for reasoning models with long CoT.

What carries the argument

Sliding-window KV cache compression that scores pairs with bidirectional attention and expands them via the Token2Chunk module into flexible chunks.

If this is right

Reduces memory overhead from massive KV caches in long decoding sequences.
Improves decoding throughput without the limitations of threshold-triggered policies.
Preserves important flexible-sized semantic chunks at arbitrary positions.
Avoids fully eliminating KV pairs from certain sequence blocks.
Adapts to existing PagedAttention frameworks for practical deployment.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

May enable serving larger batch sizes or longer contexts on the same hardware.
Could reduce energy consumption in large-scale LLM inference deployments.
Potential to extend to other attention-based models beyond standard transformers.
Testing on a wider range of reasoning benchmarks might reveal task-specific benefits.

Load-bearing premise

Compressing only the recently generated context in a sliding window avoids significant information loss for future decoding steps.

What would settle it

Running Kara on a long CoT reasoning task and observing a substantial drop in final answer accuracy compared to the full KV cache baseline.

Figures

Figures reproduced from arXiv: 2607.01237 by Shen Han, Yuyang Wu.

**Figure 1.** Figure 1: (a) Average throughput under different batch sizes, where batch size denotes the predefined maximum decoding sequences. We observe that vLLM with SnapKV achieves lower throughput than vanilla vLLM as the batch size increases. (b) The actual number of decoding sequences varies with decoding steps. The decoding step denotes the number of global decoding iterations and predefined maximum decoding sequences i… view at source ↗

**Figure 3.** Figure 3: (a) Distribution comparison between causal attention and bidirectional attention of a specific token. (b) Average causal attention weight versus bidirectional attention percentile. We first compute the bidirectional attention weights for all token pairs (xi, xj ) with j > i, using xi as the query and xj as the key. We then sort these weights in ascending order and group the pairs by percentiles. Finally,… view at source ↗

**Figure 5.** Figure 5: Performance of different KV cache compression methods across varying retention levels. The dash line represents the accuracy of the vanilla LLM model without compression. concurrency-throughput inversion effect observed in [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Needle-In-A-Haystack (NIAH) performance. The x-axis denotes the input context length and the y-axis denotes the needle insertion depth. Each cell reports the retrieval score for the corresponding (length, depth) setting, and we also report the mean accuracy averaged over all cells. |W| ∈ {256, 384, 512} and set the buffer length |U| ∈ {32, 64}. For Token2Chunk, we use a fixed maximum chunk size γ of 8 and … view at source ↗

read the original abstract

Reasoning language models often generate long chain-of-thought (CoT), which accumulates a massive KV cache during the decoding phase and incurs high decoding latency and limited throughput. To address these issues, KV cache compression has emerged as a promising technique for reducing memory overhead by selectively removing unimportant KV pairs while preserving useful ones for subsequent decoding. Nevertheless, we identify two key limitations in existing KV cache compression methods: 1) their threshold-triggered compression policy may provide limited throughput improvement or even reduce throughput, and may fully eliminate KV pairs from certain blocks of the sequence, potentially worsening information loss. 2) they typically retain either isolated KV pairs or fixed-size chunks with rigid boundaries, failing to preserve important flexible-sized chunks at arbitrary token positions. To overcome these limitations, we propose Kara, a sliding-window KV cache compression method that performs decoding-time compression by operating only on the recently generated context. Kara leverages bidirectional attention to score and select informative KV pairs in the window. To enable flexible preservation of important semantic information, we design a Token2Chunk module to expand a subset of selected KV pairs into chunks. Furthermore, we adapt Kara to PagedAttention and develop KvLLM, an inference framework built upon vLLM, which reduces KV cache memory usage and effectively improves output throughput. Extensive experiments demonstrate consistent performance improvements of proposed Kara and KvLLM.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Kara applies sliding-window compression only to recent tokens with bidirectional scoring and a Token2Chunk step to keep flexible chunks, which directly targets the threshold and rigidity problems called out in prior KV work, but the performance claims rest on experiments not shown in the abstract.

read the letter

The main takeaway is that Kara limits compression to a sliding window over newly generated tokens, scores them with bidirectional attention, and uses Token2Chunk to turn selected pairs into variable-length chunks before integrating the whole thing with PagedAttention inside KvLLM on vLLM. This setup is presented as a way around the throughput penalties and block-level drops that come with threshold-triggered policies, plus the rigid chunk boundaries in earlier methods.

What is actually new is the specific combination of the window restriction, the bidirectional scoring inside it, and the chunk-expansion module, along with the concrete adaptation to an existing serving stack. The paper does a clean job stating the two limitations it wants to fix and showing how each design choice maps to one of them.

The soft spot is that the abstract asserts consistent improvements from extensive experiments yet supplies no numbers, baselines, memory savings, or latency figures. Without those, it is difficult to judge whether the sliding-window restriction really prevents cumulative loss across long CoT traces or whether the gains are large enough to matter in practice. The stress-test worry about irreversible early losses is reasonable on the face of it; the paper would need clear ablations showing that tokens outside the current window do not need further selection.

This is a paper for people who build or tune LLM inference systems, especially those dealing with reasoning models that produce long outputs. A reader already working on KV cache or PagedAttention would find the method description and the KvLLM implementation useful even if the quantitative results need more detail.

I would send it to peer review. The idea is coherent, the implementation angle is practical, and the targeted limitations are real; the work is worth referee time to check the experiments and confirm the assumptions hold.

Referee Report

3 major / 2 minor

Summary. The paper identifies two limitations in prior KV cache compression for long-CoT reasoning LLMs (threshold-triggered policies that can hurt throughput or cause block-level KV elimination, and rigid isolated-pair or fixed-chunk retention) and proposes Kara: a sliding-window method that compresses only the recently generated context using bidirectional attention for KV scoring, a Token2Chunk module to expand selected pairs into flexible semantic chunks, and an adaptation to PagedAttention inside the KvLLM framework built on vLLM. It claims that this yields reduced KV memory usage and consistent performance improvements.

Significance. If the empirical claims hold, Kara would provide a practical, low-overhead route to higher throughput and lower memory for serving reasoning models whose CoT traces exceed typical context windows, directly addressing a growing deployment bottleneck.

major comments (3)

[Abstract] Abstract: the central claim of 'consistent performance improvements' and 'effectively improves output throughput' is asserted without any quantitative results, baselines, error bars, or experimental protocol; this absence makes it impossible to assess whether the sliding-window restriction actually supports the claim or whether the identified limitations are resolved.
[Abstract / method description] The design rests on the unexamined assumption that irreversible compression decisions made inside one sliding window on recent tokens will not produce cumulative information loss for later decoding steps that depend on earlier windows; no analysis, ablation, or long-CoT experiment tests cross-window dependency preservation (cf. the weakest assumption in the stress-test note).
[Abstract] No equations, pseudocode, or complexity analysis are supplied for the bidirectional scoring, Token2Chunk expansion, or the PagedAttention integration, leaving open whether the claimed throughput gains are parameter-free or require additional tuning.

minor comments (2)

[Abstract] The two limitations are stated clearly but the manuscript never returns to them with a direct head-to-head comparison showing how Kara avoids each failure mode.
[Abstract] Terminology such as 'Token2Chunk module' and 'KvLLM' is introduced without a forward reference to the section that defines their implementation details.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment point by point below, indicating the revisions we will incorporate.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim of 'consistent performance improvements' and 'effectively improves output throughput' is asserted without any quantitative results, baselines, error bars, or experimental protocol; this absence makes it impossible to assess whether the sliding-window restriction actually supports the claim or whether the identified limitations are resolved.

Authors: We agree that the abstract would benefit from quantitative support. In the revised version, we will add specific metrics from our experiments (e.g., throughput gains and KV cache reductions relative to baselines), along with a brief reference to the evaluation protocol and models used. revision: yes
Referee: [Abstract / method description] The design rests on the unexamined assumption that irreversible compression decisions made inside one sliding window on recent tokens will not produce cumulative information loss for later decoding steps that depend on earlier windows; no analysis, ablation, or long-CoT experiment tests cross-window dependency preservation (cf. the weakest assumption in the stress-test note).

Authors: The sliding-window design with Token2Chunk is intended to limit cumulative loss by preserving semantic chunks from recent context. We will add a dedicated discussion of cross-window dependency preservation in Section 3 and include an ablation on long-CoT traces spanning multiple windows to empirically test this aspect. revision: yes
Referee: [Abstract] No equations, pseudocode, or complexity analysis are supplied for the bidirectional scoring, Token2Chunk expansion, or the PagedAttention integration, leaving open whether the claimed throughput gains are parameter-free or require additional tuning.

Authors: The full equations, pseudocode, and complexity analysis appear in Section 3 and the appendix. We will revise the abstract to briefly reference these components and state that the method operates without additional hyperparameters beyond the window size. revision: yes

Circularity Check

0 steps flagged

No circularity; purely algorithmic proposal without derivations or self-referential reductions

full rationale

The paper describes an engineering method (sliding-window KV cache compression with bidirectional scoring and Token2Chunk) to address stated limitations in prior KV compression techniques. No equations, parameter fits, uniqueness theorems, or self-citations appear in the provided text that would reduce any claim to its own inputs by construction. The central contribution is a new procedure whose performance is evaluated externally rather than derived from fitted values or prior author work invoked as axiomatic. This is the common case of a self-contained algorithmic paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Insufficient information in the abstract to identify free parameters, axioms, or invented entities; no mathematical derivations or modeling assumptions are detailed.

pith-pipeline@v0.9.1-grok · 5766 in / 1130 out tokens · 38106 ms · 2026-07-04T01:21:45.333440+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

33 extracted references · 8 canonical work pages · 5 internal anchors

[1]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

Chain-of-thought prompting elicits reasoning in large language models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022. 9

2022
[3]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

A survey on large language model acceleration based on kv cache management.arXiv preprint arXiv:2412.19442, 2024

Haoyang Li, Yiming Li, Anxin Tian, Tianhao Tang, Zhanchao Xu, Xuejia Chen, Nicole Hu, Wei Dong, Qing Li, and Lei Chen. A survey on large language model acceleration based on kv cache management.arXiv preprint arXiv:2412.19442, 2024

work page arXiv 2024
[5]

Llm inference unveiled: Survey and roofline model insights.arXiv preprint arXiv:2402.16363, 2024

Zhihang Yuan, Yuzhang Shang, Yang Zhou, Zhen Dong, Zhe Zhou, Chenhao Xue, Bingzhe Wu, Zhikai Li, Qingyi Gu, Yong Jae Lee, et al. Llm inference unveiled: Survey and roofline model insights.arXiv preprint arXiv:2402.16363, 2024

work page arXiv 2024
[6]

R-KV: Redundancy-aware KV cache compression for reasoning models

Zefan Cai, Wen Xiao, Hanshi Sun, Cheng Luo, Yikai Zhang, Ke Wan, Yucheng Li, Yeyang Zhou, Li-Wen Chang, Jiuxiang Gu, Zhen Dong, Anima Anandkumar, Abedelkadir Asi, and Junjie Hu. R-KV: Redundancy-aware KV cache compression for reasoning models. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026

2026
[7]

Ada-KV: Optimizing KV cache eviction by adaptive budget allocation for efficient LLM inference

Yuan Feng, Junlin Lv, Yukun Cao, Xike Xie, and S Kevin Zhou. Ada-KV: Optimizing KV cache eviction by adaptive budget allocation for efficient LLM inference. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026

2026
[8]

SnapKV: LLM knows what you are looking for before generation

Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Acyr Locatelli, Hanchen Ye, Tianle Cai, Patrick Lewis, and Deming Chen. SnapKV: LLM knows what you are looking for before generation. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

2024
[9]

KV cache compression, but what must we give in return? a comprehensive benchmark of long context capable approaches

Jiayi Yuan, Hongyi Liu, Shaochen Zhong, Yu-Neng Chuang, Songchen Li, Guanchu Wang, Duy Le, Hongye Jin, Vipin Chaudhary, Zhaozhuo Xu, Zirui Liu, and Xia Hu. KV cache compression, but what must we give in return? a comprehensive benchmark of long context capable approaches. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors,Findings of the Associ...

2024
[10]

Efficient streaming language models with attention sinks

Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks. InThe Twelfth International Conference on Learning Representations, 2024

2024
[11]

Beyond text-visual attention: Exploiting visual cues for effective token pruning in vlms

Qizhe Zhang, Aosong Cheng, Ming Lu, Renrui Zhang, Zhiyong Zhuo, Jiajun Cao, Shaobo Guo, Qi She, and Shanghang Zhang. Beyond text-visual attention: Exploiting visual cues for effective token pruning in vlms. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 20857–20867, 2025

2025
[12]

Inference-time hyper-scaling with KV cache compression

Adrian Ła ´ncucki, Konrad Staniszewski, Piotr Nawrot, and Edoardo Ponti. Inference-time hyper-scaling with KV cache compression. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026

2026
[13]

Lee, Sangdoo Yun, and Hyun Oh Song

Jang-Hyun Kim, Jinuk Kim, Sangwoo Kwon, Jae W. Lee, Sangdoo Yun, and Hyun Oh Song. KVzip: Query-agnostic KV cache compression with context reconstruction. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026

2026
[14]

Expected attention: Kv cache compres- sion by estimating attention from future queries distribution.arXiv preprint arXiv:2510.00636, 2025

Alessio Devoto, Maximilian Jeblick, and Simon Jégou. Expected attention: Kv cache compres- sion by estimating attention from future queries distribution.arXiv preprint arXiv:2510.00636, 2025

work page arXiv 2025
[15]

Gonzalez, Hao Zhang, and Ion Stoica

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large lan- guage model serving with pagedattention. InProceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023

2023
[16]

Criticbench: Benchmarking llms for critique-correct reasoning

Zicheng Lin, Zhibin Gou, Tian Liang, Ruilin Luo, Haowei Liu, and Yujiu Yang. Criticbench: Benchmarking llms for critique-correct reasoning. InFindings of the Association for Computa- tional Linguistics: ACL 2024, pages 1552–1587, 2024. 10

2024
[17]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[18]

ChunkKV: Semantic-preserving KV cache compression for efficient long-context LLM inference

Xiang Liu, Zhenheng Tang, Peijie Dong, Zeyu Li, Liuyue, Bo Li, Xuming Hu, and Xiaowen Chu. ChunkKV: Semantic-preserving KV cache compression for efficient long-context LLM inference. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026

2026
[19]

Where does in-context learning \\ happen in large language models? InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

Suzanna Sia, David Mueller, and Kevin Duh. Where does in-context learning \\ happen in large language models? InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

2024
[20]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[21]

A survey on large language model acceleration based on KV cache management.Transactions on Machine Learning Research, 2025

Haoyang LI, Yiming Li, Anxin Tian, Tianhao Tang, Zhanchao Xu, Xuejia Chen, Nicole HU, Wei Dong, Li Qing, and Lei Chen. A survey on large language model acceleration based on KV cache management.Transactions on Machine Learning Research, 2025

2025
[22]

Gonzalez, Clark Barrett, and Ying Sheng

Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E. Gonzalez, Clark Barrett, and Ying Sheng. Sglang: efficient execution of structured language model programs. NIPS ’24, Red Hook, NY , USA,
[23]

Curran Associates Inc
[24]

Keydiff: Key similarity-based KV cache eviction for long-context LLM inference in resource-constrained environments

Junyoung Park, Dalton Jones, Matthew J Morse, Raghavv Goel, Mingu Lee, and Christopher Lott. Keydiff: Key similarity-based KV cache eviction for long-context LLM inference in resource-constrained environments. InThe Thirty-ninth Annual Conference on Neural Informa- tion Processing Systems, 2026

2026
[25]

ThinKV: Thought-adaptive KV cache compression for efficient reasoning models

Akshat Ramachandran, Marina Neseem, Charbel Sakr, Rangharajan Venkatesan, Brucek Khailany, and Tushar Krishna. ThinKV: Thought-adaptive KV cache compression for efficient reasoning models. InThe F ourteenth International Conference on Learning Representations, 2026

2026
[26]

DefensiveKV: Taming the fragility of KV cache eviction in LLM inference

Yuan Feng, Haoyu Guo, Junlin Lv, S Kevin Zhou, and Xike Xie. DefensiveKV: Taming the fragility of KV cache eviction in LLM inference. InThe F ourteenth International Conference on Learning Representations, 2026

2026
[27]

CAKE: Cascading and adaptive KV cache eviction with layer preferences

Ziran Qin, Yuchen Cao, Mingbao Lin, Wen Hu, Shixuan Fan, Ke Cheng, Weiyao Lin, and Jianguo Li. CAKE: Cascading and adaptive KV cache eviction with layer preferences. InThe Thirteenth International Conference on Learning Representations, 2025

2025
[28]

American invitational mathematics examination (aime) 2024, 2024

Yifan Zhang and Team Math-AI. American invitational mathematics examination (aime) 2024, 2024

2024
[29]

PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling

Zefan Cai, Yichi Zhang, Bofei Gao, Yuliang Liu, Yucheng Li, Tianyu Liu, Keming Lu, Wayne Xiong, Yue Dong, Junjie Hu, et al. Pyramidkv: Dynamic kv cache compression based on pyramidal information funneling.arXiv preprint arXiv:2406.02069, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[30]

QuoKA: Query-oriented KV selection for efficient LLM prefill

Dalton Jones, Junyoung Park, Matthew J Morse, Mingu Lee, Matthew Harper Langston, and Christopher Lott. QuoKA: Query-oriented KV selection for efficient LLM prefill. InThe F ourteenth International Conference on Learning Representations, 2026

2026
[31]

Icecache: Memory-efficient KV-cache management for long-sequence LLMs

Yuzhen Mao, Qitong Wang, Martin Ester, and Ke Li. Icecache: Memory-efficient KV-cache management for long-sequence LLMs. InThe F ourteenth International Conference on Learning Representations, 2026

2026
[32]

Cache what lasts: Token retention for memory-bounded KV cache in LLMs

Ngoc Bui, Shubham Sharma, Simran Lamba, Saumitra Mishra, and Rex Ying. Cache what lasts: Token retention for memory-bounded KV cache in LLMs. InThe F ourteenth International Conference on Learning Representations, 2026

2026
[33]

Deepseek-v4: Towards highly efficient million-token context intelligence, 2026

DeepSeek-AI. Deepseek-v4: Towards highly efficient million-token context intelligence, 2026. 11 A Detailed Experimental Settings for Figure 1 and 2 This section provides detailed settings for the simple experiment shown in Figure 1 and 2. We run all experiments on the MATH-500 dataset using DeepSeek-R1-Distill-LLaMA-8B. In Figure 1, we vary the maximum nu...

2026

[1] [1]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[2] [2]

Chain-of-thought prompting elicits reasoning in large language models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022. 9

2022

[3] [3]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[4] [4]

A survey on large language model acceleration based on kv cache management.arXiv preprint arXiv:2412.19442, 2024

Haoyang Li, Yiming Li, Anxin Tian, Tianhao Tang, Zhanchao Xu, Xuejia Chen, Nicole Hu, Wei Dong, Qing Li, and Lei Chen. A survey on large language model acceleration based on kv cache management.arXiv preprint arXiv:2412.19442, 2024

work page arXiv 2024

[5] [5]

Llm inference unveiled: Survey and roofline model insights.arXiv preprint arXiv:2402.16363, 2024

Zhihang Yuan, Yuzhang Shang, Yang Zhou, Zhen Dong, Zhe Zhou, Chenhao Xue, Bingzhe Wu, Zhikai Li, Qingyi Gu, Yong Jae Lee, et al. Llm inference unveiled: Survey and roofline model insights.arXiv preprint arXiv:2402.16363, 2024

work page arXiv 2024

[6] [6]

R-KV: Redundancy-aware KV cache compression for reasoning models

Zefan Cai, Wen Xiao, Hanshi Sun, Cheng Luo, Yikai Zhang, Ke Wan, Yucheng Li, Yeyang Zhou, Li-Wen Chang, Jiuxiang Gu, Zhen Dong, Anima Anandkumar, Abedelkadir Asi, and Junjie Hu. R-KV: Redundancy-aware KV cache compression for reasoning models. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026

2026

[7] [7]

Ada-KV: Optimizing KV cache eviction by adaptive budget allocation for efficient LLM inference

Yuan Feng, Junlin Lv, Yukun Cao, Xike Xie, and S Kevin Zhou. Ada-KV: Optimizing KV cache eviction by adaptive budget allocation for efficient LLM inference. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026

2026

[8] [8]

SnapKV: LLM knows what you are looking for before generation

Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Acyr Locatelli, Hanchen Ye, Tianle Cai, Patrick Lewis, and Deming Chen. SnapKV: LLM knows what you are looking for before generation. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

2024

[9] [9]

KV cache compression, but what must we give in return? a comprehensive benchmark of long context capable approaches

Jiayi Yuan, Hongyi Liu, Shaochen Zhong, Yu-Neng Chuang, Songchen Li, Guanchu Wang, Duy Le, Hongye Jin, Vipin Chaudhary, Zhaozhuo Xu, Zirui Liu, and Xia Hu. KV cache compression, but what must we give in return? a comprehensive benchmark of long context capable approaches. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors,Findings of the Associ...

2024

[10] [10]

Efficient streaming language models with attention sinks

Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks. InThe Twelfth International Conference on Learning Representations, 2024

2024

[11] [11]

Beyond text-visual attention: Exploiting visual cues for effective token pruning in vlms

Qizhe Zhang, Aosong Cheng, Ming Lu, Renrui Zhang, Zhiyong Zhuo, Jiajun Cao, Shaobo Guo, Qi She, and Shanghang Zhang. Beyond text-visual attention: Exploiting visual cues for effective token pruning in vlms. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 20857–20867, 2025

2025

[12] [12]

Inference-time hyper-scaling with KV cache compression

Adrian Ła ´ncucki, Konrad Staniszewski, Piotr Nawrot, and Edoardo Ponti. Inference-time hyper-scaling with KV cache compression. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026

2026

[13] [13]

Lee, Sangdoo Yun, and Hyun Oh Song

Jang-Hyun Kim, Jinuk Kim, Sangwoo Kwon, Jae W. Lee, Sangdoo Yun, and Hyun Oh Song. KVzip: Query-agnostic KV cache compression with context reconstruction. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026

2026

[14] [14]

Expected attention: Kv cache compres- sion by estimating attention from future queries distribution.arXiv preprint arXiv:2510.00636, 2025

Alessio Devoto, Maximilian Jeblick, and Simon Jégou. Expected attention: Kv cache compres- sion by estimating attention from future queries distribution.arXiv preprint arXiv:2510.00636, 2025

work page arXiv 2025

[15] [15]

Gonzalez, Hao Zhang, and Ion Stoica

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large lan- guage model serving with pagedattention. InProceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023

2023

[16] [16]

Criticbench: Benchmarking llms for critique-correct reasoning

Zicheng Lin, Zhibin Gou, Tian Liang, Ruilin Luo, Haowei Liu, and Yujiu Yang. Criticbench: Benchmarking llms for critique-correct reasoning. InFindings of the Association for Computa- tional Linguistics: ACL 2024, pages 1552–1587, 2024. 10

2024

[17] [17]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[18] [18]

ChunkKV: Semantic-preserving KV cache compression for efficient long-context LLM inference

Xiang Liu, Zhenheng Tang, Peijie Dong, Zeyu Li, Liuyue, Bo Li, Xuming Hu, and Xiaowen Chu. ChunkKV: Semantic-preserving KV cache compression for efficient long-context LLM inference. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026

2026

[19] [19]

Where does in-context learning \\ happen in large language models? InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

Suzanna Sia, David Mueller, and Kevin Duh. Where does in-context learning \\ happen in large language models? InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

2024

[20] [20]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[21] [21]

A survey on large language model acceleration based on KV cache management.Transactions on Machine Learning Research, 2025

Haoyang LI, Yiming Li, Anxin Tian, Tianhao Tang, Zhanchao Xu, Xuejia Chen, Nicole HU, Wei Dong, Li Qing, and Lei Chen. A survey on large language model acceleration based on KV cache management.Transactions on Machine Learning Research, 2025

2025

[22] [22]

Gonzalez, Clark Barrett, and Ying Sheng

Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E. Gonzalez, Clark Barrett, and Ying Sheng. Sglang: efficient execution of structured language model programs. NIPS ’24, Red Hook, NY , USA,

[23] [23]

Curran Associates Inc

[24] [24]

Keydiff: Key similarity-based KV cache eviction for long-context LLM inference in resource-constrained environments

Junyoung Park, Dalton Jones, Matthew J Morse, Raghavv Goel, Mingu Lee, and Christopher Lott. Keydiff: Key similarity-based KV cache eviction for long-context LLM inference in resource-constrained environments. InThe Thirty-ninth Annual Conference on Neural Informa- tion Processing Systems, 2026

2026

[25] [25]

ThinKV: Thought-adaptive KV cache compression for efficient reasoning models

Akshat Ramachandran, Marina Neseem, Charbel Sakr, Rangharajan Venkatesan, Brucek Khailany, and Tushar Krishna. ThinKV: Thought-adaptive KV cache compression for efficient reasoning models. InThe F ourteenth International Conference on Learning Representations, 2026

2026

[26] [26]

DefensiveKV: Taming the fragility of KV cache eviction in LLM inference

Yuan Feng, Haoyu Guo, Junlin Lv, S Kevin Zhou, and Xike Xie. DefensiveKV: Taming the fragility of KV cache eviction in LLM inference. InThe F ourteenth International Conference on Learning Representations, 2026

2026

[27] [27]

CAKE: Cascading and adaptive KV cache eviction with layer preferences

Ziran Qin, Yuchen Cao, Mingbao Lin, Wen Hu, Shixuan Fan, Ke Cheng, Weiyao Lin, and Jianguo Li. CAKE: Cascading and adaptive KV cache eviction with layer preferences. InThe Thirteenth International Conference on Learning Representations, 2025

2025

[28] [28]

American invitational mathematics examination (aime) 2024, 2024

Yifan Zhang and Team Math-AI. American invitational mathematics examination (aime) 2024, 2024

2024

[29] [29]

PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling

Zefan Cai, Yichi Zhang, Bofei Gao, Yuliang Liu, Yucheng Li, Tianyu Liu, Keming Lu, Wayne Xiong, Yue Dong, Junjie Hu, et al. Pyramidkv: Dynamic kv cache compression based on pyramidal information funneling.arXiv preprint arXiv:2406.02069, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[30] [30]

QuoKA: Query-oriented KV selection for efficient LLM prefill

Dalton Jones, Junyoung Park, Matthew J Morse, Mingu Lee, Matthew Harper Langston, and Christopher Lott. QuoKA: Query-oriented KV selection for efficient LLM prefill. InThe F ourteenth International Conference on Learning Representations, 2026

2026

[31] [31]

Icecache: Memory-efficient KV-cache management for long-sequence LLMs

Yuzhen Mao, Qitong Wang, Martin Ester, and Ke Li. Icecache: Memory-efficient KV-cache management for long-sequence LLMs. InThe F ourteenth International Conference on Learning Representations, 2026

2026

[32] [32]

Cache what lasts: Token retention for memory-bounded KV cache in LLMs

Ngoc Bui, Shubham Sharma, Simran Lamba, Saumitra Mishra, and Rex Ying. Cache what lasts: Token retention for memory-bounded KV cache in LLMs. InThe F ourteenth International Conference on Learning Representations, 2026

2026

[33] [33]

Deepseek-v4: Towards highly efficient million-token context intelligence, 2026

DeepSeek-AI. Deepseek-v4: Towards highly efficient million-token context intelligence, 2026. 11 A Detailed Experimental Settings for Figure 1 and 2 This section provides detailed settings for the simple experiment shown in Figure 1 and 2. We run all experiments on the MATH-500 dataset using DeepSeek-R1-Distill-LLaMA-8B. In Figure 1, we vary the maximum nu...

2026