pith. sign in

arxiv: 2607.01237 · v1 · pith:IQ3HSRYAnew · submitted 2026-05-01 · 💻 cs.CL · cs.AI

Kara: Efficient Reasoning LLM Serving via Sliding-Window KV Cache Compression

Pith reviewed 2026-07-04 01:21 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords KV cache compressionreasoning LLMschain-of-thoughtsliding windowbidirectional attentionPagedAttentioninference optimization
0
0 comments X

The pith

Kara uses sliding-window compression on recent context to reduce KV cache size in reasoning LLMs while maintaining performance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Reasoning language models produce long chain-of-thought sequences that cause the KV cache to grow large, increasing memory use and slowing down generation. Existing compression methods have issues with when and how they compress, sometimes hurting throughput or losing key information from whole blocks. Kara fixes this by compressing only the latest part of the context in a sliding window, using bidirectional attention to identify useful key-value pairs and a Token2Chunk module to keep them in adaptable groups. It integrates with PagedAttention in the KvLLM system to lower memory needs and raise output speed. Experiments show this leads to better results across different setups.

Core claim

Kara is a sliding-window KV cache compression method that performs decoding-time compression only on the recently generated context. It leverages bidirectional attention to score and select informative KV pairs in the window and uses a Token2Chunk module to expand selected pairs into flexible chunks. Adapted to PagedAttention, it is implemented in KvLLM to reduce KV cache memory usage and improve output throughput for reasoning models with long CoT.

What carries the argument

Sliding-window KV cache compression that scores pairs with bidirectional attention and expands them via the Token2Chunk module into flexible chunks.

If this is right

  • Reduces memory overhead from massive KV caches in long decoding sequences.
  • Improves decoding throughput without the limitations of threshold-triggered policies.
  • Preserves important flexible-sized semantic chunks at arbitrary positions.
  • Avoids fully eliminating KV pairs from certain sequence blocks.
  • Adapts to existing PagedAttention frameworks for practical deployment.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • May enable serving larger batch sizes or longer contexts on the same hardware.
  • Could reduce energy consumption in large-scale LLM inference deployments.
  • Potential to extend to other attention-based models beyond standard transformers.
  • Testing on a wider range of reasoning benchmarks might reveal task-specific benefits.

Load-bearing premise

Compressing only the recently generated context in a sliding window avoids significant information loss for future decoding steps.

What would settle it

Running Kara on a long CoT reasoning task and observing a substantial drop in final answer accuracy compared to the full KV cache baseline.

Figures

Figures reproduced from arXiv: 2607.01237 by Shen Han, Yuyang Wu.

Figure 1
Figure 1. Figure 1: (a) Average throughput under different batch sizes, where batch size denotes the predefined maximum decoding sequences. We observe that vLLM with SnapKV achieves lower throughput than vanilla vLLM as the batch size in￾creases. (b) The actual number of decoding sequences varies with decoding steps. The decoding step denotes the number of global decoding iterations and predefined maximum decoding sequences i… view at source ↗
Figure 3
Figure 3. Figure 3: (a) Distribution comparison between causal at￾tention and bidirectional attention of a specific token. (b) Average causal attention weight versus bidirectional atten￾tion percentile. We first compute the bidirectional attention weights for all token pairs (xi, xj ) with j > i, using xi as the query and xj as the key. We then sort these weights in ascending order and group the pairs by percentiles. Finally,… view at source ↗
Figure 5
Figure 5. Figure 5: Performance of different KV cache compression methods across varying retention levels. The dash line represents the accuracy of the vanilla LLM model without compression. concurrency-throughput inversion effect observed in [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Needle-In-A-Haystack (NIAH) performance. The x-axis denotes the input context length and the y-axis denotes the needle insertion depth. Each cell reports the retrieval score for the corresponding (length, depth) setting, and we also report the mean accuracy averaged over all cells. |W| ∈ {256, 384, 512} and set the buffer length |U| ∈ {32, 64}. For Token2Chunk, we use a fixed maximum chunk size γ of 8 and … view at source ↗
read the original abstract

Reasoning language models often generate long chain-of-thought (CoT), which accumulates a massive KV cache during the decoding phase and incurs high decoding latency and limited throughput. To address these issues, KV cache compression has emerged as a promising technique for reducing memory overhead by selectively removing unimportant KV pairs while preserving useful ones for subsequent decoding. Nevertheless, we identify two key limitations in existing KV cache compression methods: 1) their threshold-triggered compression policy may provide limited throughput improvement or even reduce throughput, and may fully eliminate KV pairs from certain blocks of the sequence, potentially worsening information loss. 2) they typically retain either isolated KV pairs or fixed-size chunks with rigid boundaries, failing to preserve important flexible-sized chunks at arbitrary token positions. To overcome these limitations, we propose Kara, a sliding-window KV cache compression method that performs decoding-time compression by operating only on the recently generated context. Kara leverages bidirectional attention to score and select informative KV pairs in the window. To enable flexible preservation of important semantic information, we design a Token2Chunk module to expand a subset of selected KV pairs into chunks. Furthermore, we adapt Kara to PagedAttention and develop KvLLM, an inference framework built upon vLLM, which reduces KV cache memory usage and effectively improves output throughput. Extensive experiments demonstrate consistent performance improvements of proposed Kara and KvLLM.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper identifies two limitations in prior KV cache compression for long-CoT reasoning LLMs (threshold-triggered policies that can hurt throughput or cause block-level KV elimination, and rigid isolated-pair or fixed-chunk retention) and proposes Kara: a sliding-window method that compresses only the recently generated context using bidirectional attention for KV scoring, a Token2Chunk module to expand selected pairs into flexible semantic chunks, and an adaptation to PagedAttention inside the KvLLM framework built on vLLM. It claims that this yields reduced KV memory usage and consistent performance improvements.

Significance. If the empirical claims hold, Kara would provide a practical, low-overhead route to higher throughput and lower memory for serving reasoning models whose CoT traces exceed typical context windows, directly addressing a growing deployment bottleneck.

major comments (3)
  1. [Abstract] Abstract: the central claim of 'consistent performance improvements' and 'effectively improves output throughput' is asserted without any quantitative results, baselines, error bars, or experimental protocol; this absence makes it impossible to assess whether the sliding-window restriction actually supports the claim or whether the identified limitations are resolved.
  2. [Abstract / method description] The design rests on the unexamined assumption that irreversible compression decisions made inside one sliding window on recent tokens will not produce cumulative information loss for later decoding steps that depend on earlier windows; no analysis, ablation, or long-CoT experiment tests cross-window dependency preservation (cf. the weakest assumption in the stress-test note).
  3. [Abstract] No equations, pseudocode, or complexity analysis are supplied for the bidirectional scoring, Token2Chunk expansion, or the PagedAttention integration, leaving open whether the claimed throughput gains are parameter-free or require additional tuning.
minor comments (2)
  1. [Abstract] The two limitations are stated clearly but the manuscript never returns to them with a direct head-to-head comparison showing how Kara avoids each failure mode.
  2. [Abstract] Terminology such as 'Token2Chunk module' and 'KvLLM' is introduced without a forward reference to the section that defines their implementation details.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment point by point below, indicating the revisions we will incorporate.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim of 'consistent performance improvements' and 'effectively improves output throughput' is asserted without any quantitative results, baselines, error bars, or experimental protocol; this absence makes it impossible to assess whether the sliding-window restriction actually supports the claim or whether the identified limitations are resolved.

    Authors: We agree that the abstract would benefit from quantitative support. In the revised version, we will add specific metrics from our experiments (e.g., throughput gains and KV cache reductions relative to baselines), along with a brief reference to the evaluation protocol and models used. revision: yes

  2. Referee: [Abstract / method description] The design rests on the unexamined assumption that irreversible compression decisions made inside one sliding window on recent tokens will not produce cumulative information loss for later decoding steps that depend on earlier windows; no analysis, ablation, or long-CoT experiment tests cross-window dependency preservation (cf. the weakest assumption in the stress-test note).

    Authors: The sliding-window design with Token2Chunk is intended to limit cumulative loss by preserving semantic chunks from recent context. We will add a dedicated discussion of cross-window dependency preservation in Section 3 and include an ablation on long-CoT traces spanning multiple windows to empirically test this aspect. revision: yes

  3. Referee: [Abstract] No equations, pseudocode, or complexity analysis are supplied for the bidirectional scoring, Token2Chunk expansion, or the PagedAttention integration, leaving open whether the claimed throughput gains are parameter-free or require additional tuning.

    Authors: The full equations, pseudocode, and complexity analysis appear in Section 3 and the appendix. We will revise the abstract to briefly reference these components and state that the method operates without additional hyperparameters beyond the window size. revision: yes

Circularity Check

0 steps flagged

No circularity; purely algorithmic proposal without derivations or self-referential reductions

full rationale

The paper describes an engineering method (sliding-window KV cache compression with bidirectional scoring and Token2Chunk) to address stated limitations in prior KV compression techniques. No equations, parameter fits, uniqueness theorems, or self-citations appear in the provided text that would reduce any claim to its own inputs by construction. The central contribution is a new procedure whose performance is evaluated externally rather than derived from fitted values or prior author work invoked as axiomatic. This is the common case of a self-contained algorithmic paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Insufficient information in the abstract to identify free parameters, axioms, or invented entities; no mathematical derivations or modeling assumptions are detailed.

pith-pipeline@v0.9.1-grok · 5766 in / 1130 out tokens · 38106 ms · 2026-07-04T01:21:45.333440+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

33 extracted references · 8 canonical work pages · 5 internal anchors

  1. [1]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

  2. [2]

    Chain-of-thought prompting elicits reasoning in large language models

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022. 9

  3. [3]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

  4. [4]

    A survey on large language model acceleration based on kv cache management.arXiv preprint arXiv:2412.19442, 2024

    Haoyang Li, Yiming Li, Anxin Tian, Tianhao Tang, Zhanchao Xu, Xuejia Chen, Nicole Hu, Wei Dong, Qing Li, and Lei Chen. A survey on large language model acceleration based on kv cache management.arXiv preprint arXiv:2412.19442, 2024

  5. [5]

    Llm inference unveiled: Survey and roofline model insights.arXiv preprint arXiv:2402.16363, 2024

    Zhihang Yuan, Yuzhang Shang, Yang Zhou, Zhen Dong, Zhe Zhou, Chenhao Xue, Bingzhe Wu, Zhikai Li, Qingyi Gu, Yong Jae Lee, et al. Llm inference unveiled: Survey and roofline model insights.arXiv preprint arXiv:2402.16363, 2024

  6. [6]

    R-KV: Redundancy-aware KV cache compression for reasoning models

    Zefan Cai, Wen Xiao, Hanshi Sun, Cheng Luo, Yikai Zhang, Ke Wan, Yucheng Li, Yeyang Zhou, Li-Wen Chang, Jiuxiang Gu, Zhen Dong, Anima Anandkumar, Abedelkadir Asi, and Junjie Hu. R-KV: Redundancy-aware KV cache compression for reasoning models. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026

  7. [7]

    Ada-KV: Optimizing KV cache eviction by adaptive budget allocation for efficient LLM inference

    Yuan Feng, Junlin Lv, Yukun Cao, Xike Xie, and S Kevin Zhou. Ada-KV: Optimizing KV cache eviction by adaptive budget allocation for efficient LLM inference. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026

  8. [8]

    SnapKV: LLM knows what you are looking for before generation

    Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Acyr Locatelli, Hanchen Ye, Tianle Cai, Patrick Lewis, and Deming Chen. SnapKV: LLM knows what you are looking for before generation. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

  9. [9]

    KV cache compression, but what must we give in return? a comprehensive benchmark of long context capable approaches

    Jiayi Yuan, Hongyi Liu, Shaochen Zhong, Yu-Neng Chuang, Songchen Li, Guanchu Wang, Duy Le, Hongye Jin, Vipin Chaudhary, Zhaozhuo Xu, Zirui Liu, and Xia Hu. KV cache compression, but what must we give in return? a comprehensive benchmark of long context capable approaches. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors,Findings of the Associ...

  10. [10]

    Efficient streaming language models with attention sinks

    Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks. InThe Twelfth International Conference on Learning Representations, 2024

  11. [11]

    Beyond text-visual attention: Exploiting visual cues for effective token pruning in vlms

    Qizhe Zhang, Aosong Cheng, Ming Lu, Renrui Zhang, Zhiyong Zhuo, Jiajun Cao, Shaobo Guo, Qi She, and Shanghang Zhang. Beyond text-visual attention: Exploiting visual cues for effective token pruning in vlms. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 20857–20867, 2025

  12. [12]

    Inference-time hyper-scaling with KV cache compression

    Adrian Ła ´ncucki, Konrad Staniszewski, Piotr Nawrot, and Edoardo Ponti. Inference-time hyper-scaling with KV cache compression. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026

  13. [13]

    Lee, Sangdoo Yun, and Hyun Oh Song

    Jang-Hyun Kim, Jinuk Kim, Sangwoo Kwon, Jae W. Lee, Sangdoo Yun, and Hyun Oh Song. KVzip: Query-agnostic KV cache compression with context reconstruction. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026

  14. [14]

    Expected attention: Kv cache compres- sion by estimating attention from future queries distribution.arXiv preprint arXiv:2510.00636, 2025

    Alessio Devoto, Maximilian Jeblick, and Simon Jégou. Expected attention: Kv cache compres- sion by estimating attention from future queries distribution.arXiv preprint arXiv:2510.00636, 2025

  15. [15]

    Gonzalez, Hao Zhang, and Ion Stoica

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large lan- guage model serving with pagedattention. InProceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023

  16. [16]

    Criticbench: Benchmarking llms for critique-correct reasoning

    Zicheng Lin, Zhibin Gou, Tian Liang, Ruilin Luo, Haowei Liu, and Yujiu Yang. Criticbench: Benchmarking llms for critique-correct reasoning. InFindings of the Association for Computa- tional Linguistics: ACL 2024, pages 1552–1587, 2024. 10

  17. [17]

    The Llama 3 Herd of Models

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

  18. [18]

    ChunkKV: Semantic-preserving KV cache compression for efficient long-context LLM inference

    Xiang Liu, Zhenheng Tang, Peijie Dong, Zeyu Li, Liuyue, Bo Li, Xuming Hu, and Xiaowen Chu. ChunkKV: Semantic-preserving KV cache compression for efficient long-context LLM inference. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026

  19. [19]

    Where does in-context learning \\ happen in large language models? InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

    Suzanna Sia, David Mueller, and Kevin Duh. Where does in-context learning \\ happen in large language models? InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

  20. [20]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

  21. [21]

    A survey on large language model acceleration based on KV cache management.Transactions on Machine Learning Research, 2025

    Haoyang LI, Yiming Li, Anxin Tian, Tianhao Tang, Zhanchao Xu, Xuejia Chen, Nicole HU, Wei Dong, Li Qing, and Lei Chen. A survey on large language model acceleration based on KV cache management.Transactions on Machine Learning Research, 2025

  22. [22]

    Gonzalez, Clark Barrett, and Ying Sheng

    Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E. Gonzalez, Clark Barrett, and Ying Sheng. Sglang: efficient execution of structured language model programs. NIPS ’24, Red Hook, NY , USA,

  23. [23]

    Curran Associates Inc

  24. [24]

    Keydiff: Key similarity-based KV cache eviction for long-context LLM inference in resource-constrained environments

    Junyoung Park, Dalton Jones, Matthew J Morse, Raghavv Goel, Mingu Lee, and Christopher Lott. Keydiff: Key similarity-based KV cache eviction for long-context LLM inference in resource-constrained environments. InThe Thirty-ninth Annual Conference on Neural Informa- tion Processing Systems, 2026

  25. [25]

    ThinKV: Thought-adaptive KV cache compression for efficient reasoning models

    Akshat Ramachandran, Marina Neseem, Charbel Sakr, Rangharajan Venkatesan, Brucek Khailany, and Tushar Krishna. ThinKV: Thought-adaptive KV cache compression for efficient reasoning models. InThe F ourteenth International Conference on Learning Representations, 2026

  26. [26]

    DefensiveKV: Taming the fragility of KV cache eviction in LLM inference

    Yuan Feng, Haoyu Guo, Junlin Lv, S Kevin Zhou, and Xike Xie. DefensiveKV: Taming the fragility of KV cache eviction in LLM inference. InThe F ourteenth International Conference on Learning Representations, 2026

  27. [27]

    CAKE: Cascading and adaptive KV cache eviction with layer preferences

    Ziran Qin, Yuchen Cao, Mingbao Lin, Wen Hu, Shixuan Fan, Ke Cheng, Weiyao Lin, and Jianguo Li. CAKE: Cascading and adaptive KV cache eviction with layer preferences. InThe Thirteenth International Conference on Learning Representations, 2025

  28. [28]

    American invitational mathematics examination (aime) 2024, 2024

    Yifan Zhang and Team Math-AI. American invitational mathematics examination (aime) 2024, 2024

  29. [29]

    PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling

    Zefan Cai, Yichi Zhang, Bofei Gao, Yuliang Liu, Yucheng Li, Tianyu Liu, Keming Lu, Wayne Xiong, Yue Dong, Junjie Hu, et al. Pyramidkv: Dynamic kv cache compression based on pyramidal information funneling.arXiv preprint arXiv:2406.02069, 2024

  30. [30]

    QuoKA: Query-oriented KV selection for efficient LLM prefill

    Dalton Jones, Junyoung Park, Matthew J Morse, Mingu Lee, Matthew Harper Langston, and Christopher Lott. QuoKA: Query-oriented KV selection for efficient LLM prefill. InThe F ourteenth International Conference on Learning Representations, 2026

  31. [31]

    Icecache: Memory-efficient KV-cache management for long-sequence LLMs

    Yuzhen Mao, Qitong Wang, Martin Ester, and Ke Li. Icecache: Memory-efficient KV-cache management for long-sequence LLMs. InThe F ourteenth International Conference on Learning Representations, 2026

  32. [32]

    Cache what lasts: Token retention for memory-bounded KV cache in LLMs

    Ngoc Bui, Shubham Sharma, Simran Lamba, Saumitra Mishra, and Rex Ying. Cache what lasts: Token retention for memory-bounded KV cache in LLMs. InThe F ourteenth International Conference on Learning Representations, 2026

  33. [33]

    Deepseek-v4: Towards highly efficient million-token context intelligence, 2026

    DeepSeek-AI. Deepseek-v4: Towards highly efficient million-token context intelligence, 2026. 11 A Detailed Experimental Settings for Figure 1 and 2 This section provides detailed settings for the simple experiment shown in Figure 1 and 2. We run all experiments on the MATH-500 dataset using DeepSeek-R1-Distill-LLaMA-8B. In Figure 1, we vary the maximum nu...