pith. sign in

arxiv: 2605.23200 · v1 · pith:XXJXG3KJnew · submitted 2026-05-22 · 💻 cs.LG · cs.AI

Adaptive Mass-Segmented KV Compression for Long-Context Reasoning

Pith reviewed 2026-05-25 05:20 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords KV cache compressionlong-context reasoningattention mass segmentationregion-aware quota allocationtoken evictionLLM inferencestructural fragmentationplug-and-play compression
0
0 comments X

The pith

Adaptive Mass-Segmented KV compression gives guaranteed memory quotas to attention-rich regions instead of letting global Top-k evict whole reasoning blocks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard KV compression picks the globally highest-scoring tokens and thereby wipes out entire contiguous segments that carry logical structure. AMS partitions the cache according to the spatial layout of attention mass and assigns each resulting region a protected quota. An EMA smoother keeps the boundaries stable across decoding steps. The method works as a plug-in layer on top of existing scorers and adds no steady-state attention cost inside paged serving systems. Experiments on math, code, QA and retrieval tasks show consistent gains once the region wipe-out problem is removed.

Core claim

By replacing token-level global Top-k selection with region-aware quota allocation driven by the spatial distribution of attention mass, AMS prevents the eviction of structurally vital reasoning segments, incorporates EMA-based boundary smoothing for stable iterative decoding, and remains orthogonal to any underlying importance scorer while remaining compatible with paged-KV frameworks.

What carries the argument

Adaptive Mass-Segmented (AMS) KV Compression framework that partitions the KV cache according to the spatial distribution of attention mass and enforces guaranteed per-region memory quotas.

If this is right

  • Preserves logical coherence by protecting contiguous reasoning blocks from eviction
  • Raises accuracy on MATH500, AIME, GSM8K, code completion, open-domain QA and sparse retrieval
  • Integrates without modification into TOVA, Expected Attention, KeyDiff, R-KV and TriAttention
  • Runs inside vLLM-style paged-KV serving with gather-and-compact execution and zero added attention overhead
  • Remains stable across iterative decoding steps through EMA boundary smoothing

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same mass-based segmentation could be applied to activation compression or weight pruning where contiguous structure also matters
  • Ablating the EMA smoother on very long sequences would test whether boundary jitter becomes the next bottleneck
  • Extending the quota mechanism to multi-turn dialogues might reveal whether attention mass still tracks evolving logical units
  • Comparing AMS against purely length-based segmentation would isolate how much the attention-mass signal contributes beyond simple locality

Load-bearing premise

The spatial distribution of attention mass reliably identifies structurally vital reasoning segments that deserve guaranteed memory quotas.

What would settle it

Run the same long-context reasoning task twice: once with attention mass left as computed by the model and once with attention mass randomly reassigned across segments; if AMS stops improving accuracy in the randomized case, the mass-to-importance correlation is the load-bearing assumption.

Figures

Figures reproduced from arXiv: 2605.23200 by Junzhe Yang, Xiaoyu Shen.

Figure 1
Figure 1. Figure 1: Overview of AMS for decoding-time KV compression. [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Motivating dropped-token burst example. Purple pixels denote dropped tokens. The full-sequence view highlights a local region where TOVA forms a dense contiguous dropped-token burst. The zoomed view shows the same token window across compression rounds: AMS-TOVA fragments the dropped positions in the selected round, illustrating a local failure mode that motivates adaptive segment-wise allocation. 3 Method… view at source ↗
Figure 3
Figure 3. Figure 3: Quality mass and adaptive segmentation. The solid curve shows normalized quality mass over current KV-cache positions, not absolute generation positions. Shaded bands and dashed lines denote adaptive segments, and teal ticks mark retained KV positions. High-mass regions form finer segments under a fixed Tkeep. Consider a single KV head with mass vector m ∈ R T satisfying PT t=1 mt = 1. We first compute the… view at source ↗
Figure 4
Figure 4. Figure 4: Mechanistic insights on MATH500. (a) TOVA under-retains the middle portion of the reasoning context, while AMS improves middle-context coverage through segment-wise quotas. (b) Repetition collapse increases with problem difficulty under token-wise eviction; AMS suppresses this degradation. 0 5 10 15 20 25 Transformer Layer 0.15 0.20 0.25 0.30 0.35 Temporal IoU (Higher is more stable) TOVA (Token-wise) AMS … view at source ↗
Figure 5
Figure 5. Figure 5: Temporal stability of retained context. AMS consistently achieves higher temporal retained-set IoU than TOVA across transformer layers and mathematical sub-tasks. For the token-wise TOVA baseline, consecutive retained tokens are grouped as proxy segments for direct comparison. 17 [PITH_FULL_IMAGE:figures/full_fig_p017_5.png] view at source ↗
read the original abstract

The linear growth of the Key-Value (KV) cache is a critical bottleneck in long-form LLM inference. Existing KV compression methods mitigate this by evicting tokens based on importance scores. However, we show that their reliance on global Top-k selection triggers Region Wipe-out: the severe eviction of contiguous reasoning blocks that derails logical coherence. To address this, we propose Adaptive Mass-Segmented (AMS) KV Compression, a framework that shifts the paradigm from token-level competition to region-aware quota allocation. AMS adaptively partitions the KV cache based on the spatial distribution of attention mass, ensuring structurally vital reasoning segments receive guaranteed memory quotas. To ensure stability during iterative decoding, an EMA-based smoothing mechanism is incorporated to prevent jitter in segment boundaries. Crucially, AMS is a universal plug-and-play layer that is orthogonal to existing scorers. It can be seamlessly integrated into representative methods such as TOVA, Expected Attention, KeyDiff, R-KV and TriAttention. AMS is also system-compatible with modern paged-KV serving frameworks such as vLLM, supporting efficient gather-and-compact KV execution without introducing additional steady-state attention overhead. Extensive experiments across a diverse suite of tasks, including mathematical reasoning (MATH500, AIME, GSM8K), code completion, open-domain QA, and sparse retrieval, demonstrate that AMS consistently mitigates structural fragmentation and boosts model performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper claims that existing token-level Top-k KV eviction methods suffer from Region Wipe-out, where contiguous reasoning blocks are evicted and logical coherence is lost. It proposes Adaptive Mass-Segmented (AMS) KV Compression, which partitions the KV cache according to the spatial distribution of attention mass to allocate guaranteed quotas to structurally vital segments, adds EMA-based smoothing to stabilize segment boundaries during decoding, and is presented as a plug-and-play, orthogonal layer compatible with scorers such as TOVA, Expected Attention, KeyDiff, R-KV and TriAttention as well as paged-KV systems like vLLM. Experiments on mathematical reasoning (MATH500, AIME, GSM8K), code completion, open-domain QA and sparse retrieval are stated to show consistent mitigation of fragmentation and performance gains.

Significance. If the empirical results hold, AMS could offer a practical, low-overhead way to preserve structural coherence in long-context reasoning without replacing existing importance scorers, with the claimed system compatibility providing an additional deployment advantage.

major comments (1)
  1. [Abstract (framework description paragraph)] Abstract (framework description paragraph): the central claim that attention-mass spatial distribution reliably identifies structurally vital reasoning segments deserving guaranteed quotas is load-bearing, yet the manuscript provides no direct validation (e.g., correlation with logical importance, ablation against positional/recency biases, or counter-example analysis). If the proxy is weak or task-dependent, the region-aware allocation reduces to a smoothed variant of prior scorers and the claimed structural protection does not follow.
minor comments (1)
  1. [Abstract] Abstract: the statement that 'extensive experiments demonstrate consistent mitigation and performance gains' is not accompanied by any quantitative results, error bars, baseline tables, or statistical details, making the strength of the empirical support difficult to assess from the provided text.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive comment regarding validation of the attention-mass proxy. We address the concern directly below and commit to strengthening the manuscript with additional analysis.

read point-by-point responses
  1. Referee: [Abstract (framework description paragraph)] Abstract (framework description paragraph): the central claim that attention-mass spatial distribution reliably identifies structurally vital reasoning segments deserving guaranteed quotas is load-bearing, yet the manuscript provides no direct validation (e.g., correlation with logical importance, ablation against positional/recency biases, or counter-example analysis). If the proxy is weak or task-dependent, the region-aware allocation reduces to a smoothed variant of prior scorers and the claimed structural protection does not follow.

    Authors: We agree that direct validation of the attention-mass spatial distribution as a proxy for structurally vital segments would strengthen the central claim. The current manuscript relies on indirect evidence: consistent performance improvements when AMS is combined with multiple independent scorers (TOVA, Expected Attention, KeyDiff, R-KV, TriAttention) across mathematical reasoning, code, and QA tasks, together with the orthogonality results showing gains beyond any single scorer. These outcomes are difficult to explain if AMS were merely a smoothed Top-k variant. Nevertheless, we acknowledge the absence of explicit correlation studies or bias ablations. In the revision we will add (i) a quantitative correlation between detected segment boundaries and logical step transitions in MATH problems, (ii) an ablation replacing mass-based partitioning with positional or recency-based alternatives, and (iii) selected counter-example traces. These additions will clarify the proxy's reliability and rule out reduction to prior smoothing techniques. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical plug-and-play method with no self-referential derivations

full rationale

The paper describes AMS as an empirical framework that partitions KV cache by attention mass distribution and integrates orthogonally with existing scorers, validated on external benchmarks like MATH500 and GSM8K. No equations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the text. The central claims rest on experimental results rather than any derivation that reduces to its own inputs by construction. This is self-contained against external tasks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that attention mass distribution can be used to identify and protect reasoning segments; no free parameters or invented entities are named in the abstract.

axioms (1)
  • domain assumption Spatial distribution of attention mass identifies structurally vital reasoning segments that merit guaranteed memory quotas
    This premise is invoked to justify the shift from global Top-k to region-aware allocation and is required for the performance claims to follow.

pith-pipeline@v0.9.0 · 5772 in / 1224 out tokens · 21275 ms · 2026-05-25T05:20:42.306367+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

54 extracted references · 54 canonical work pages · 14 internal anchors

  1. [1]

    H2O: Heavy-hitter oracle for efficient generative inference of large language models

    Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher Ré, Clark Barrett, Zhangyang Wang, and Beidi Chen. H2O: Heavy-hitter oracle for efficient generative inference of large language models. InAdvances in Neural Information Processing Systems, 2023. URLhttps://openreview.net/forum?id=RkRrPp7GKO

  2. [2]

    Efficient Memory Management for Large Language Model Serving with PagedAttention , booktitle =

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InProceedings of the 29th ACM Symposium on Operating Systems Principles, pages 611–626, 2023. doi: 10.1145/3600006.3613165. URLhttps://arxiv.org/abs...

  3. [3]

    Mahoney, Yakun Sophia Shao, Kurt Keutzer, and Amir Gholami

    Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, Michael W. Mahoney, Yakun Sophia Shao, Kurt Keutzer, and Amir Gholami. KVQuant: Towards 10 million context length LLM inference with KV cache quantization. InAdvances in Neural Information Processing Systems, 2024. URL https: //arxiv.org/abs/2401.18079

  4. [4]

    Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V . Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. InAdvances in Neural Information Processing Systems, 2022. URLhttps://arxiv.org/abs/2201.11903

  5. [5]

    R-KV: Redundancy-aware KV cache compression for reasoning models.arXiv preprint arXiv:2505.24133, 2025

    Zefan Cai, Wen Xiao, Hanshi Sun, Cheng Luo, Yikai Zhang, Ke Wan, Yucheng Li, Yeyang Zhou, Li- Wen Chang, Jiuxiang Gu, Zhen Dong, Anima Anandkumar, Abedelkadir Asi, and Junjie Hu. R-KV: Redundancy-aware KV cache compression for reasoning models.arXiv preprint arXiv:2505.24133, 2025. doi: 10.48550/arXiv.2505.24133. URLhttps://arxiv.org/abs/2505.24133

  6. [6]

    Reasoning path compression: Compressing generation trajectories for efficient LLM reasoning

    Jiwon Song, Dongwon Jo, Yulhwa Kim, and Jae-Joon Kim. Reasoning path compression: Compressing generation trajectories for efficient LLM reasoning. InAdvances in Neural Information Processing Systems,

  7. [7]

    NeurIPS 2025 Poster

    URLhttps://openreview.net/forum?id=894Yo61h1P. NeurIPS 2025 Poster

  8. [8]

    TriAttention: Efficient Long Reasoning with Trigonometric KV Compression

    Weian Mao, Xi Lin, Wei Huang, Yuxin Xie, Tianfu Fu, Bohan Zhuang, Song Han, and Yukang Chen. Tri- attention: Efficient long reasoning with trigonometric KV compression.arXiv preprint arXiv:2604.04921,

  9. [9]

    TriAttention: Efficient Long Reasoning with Trigonometric KV Compression

    doi: 10.48550/arXiv.2604.04921. URLhttps://arxiv.org/abs/2604.04921

  10. [10]

    FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

    Tri Dao, Daniel Y . Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. Flashattention: Fast and memory- efficient exact attention with IO-awareness. InAdvances in Neural Information Processing Systems, 2022. doi: 10.48550/arXiv.2205.14135. URLhttps://arxiv.org/abs/2205.14135

  11. [11]

    Efficient Streaming Language Models with Attention Sinks

    Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks. InInternational Conference on Learning Representations, 2024. URL https://arxiv.org/abs/2309.17453. ICLR 2024

  12. [12]

    Transformers are multi-state RNNs

    Matanel Oren, Michael Hassid, Nir Yarden, Yossi Adi, and Roy Schwartz. Transformers are multi-state RNNs. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors,Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 18724–18741, Miami, Florida, USA, November 2024. Association for Computational Linguistics. doi...

  13. [13]

    URLhttps://aclanthology.org/2024.emnlp-main.1043/

  14. [14]

    PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling

    Zefan Cai, Yichi Zhang, Bofei Gao, Yuliang Liu, Tianyu Liu, Keming Lu, Wayne Xiong, Yue Dong, Baobao Chang, Junjie Hu, and Wen Xiao. Pyramidkv: Dynamic KV cache compression based on pyramidal information funneling.arXiv preprint arXiv:2406.02069, 2024. doi: 10.48550/arXiv.2406.02069. URL https://arxiv.org/abs/2406.02069

  15. [15]

    Omnikv: Dynamic context selection for efficient long-context LLMs

    Jitai Hao, Yuke Zhu, Tian Wang, Jun Yu, Xin Xin, Bo Zheng, Zhaochun Ren, and Sheng Guo. Omnikv: Dynamic context selection for efficient long-context LLMs. InInternational Conference on Learning Representations, 2025. URLhttps://openreview.net/forum?id=ulCAPXYXfa. ICLR 2025

  16. [16]

    Lacache: Ladder-shaped KV caching for efficient long-context modeling of large language models

    Dachuan Shi, Yonggan Fu, Xiangchi Yuan, Zhongzhi Yu, Haoran You, Sixu Li, Xin Dong, Jan Kautz, Pavlo Molchanov, and Yingyan Celine Lin. Lacache: Ladder-shaped KV caching for efficient long-context modeling of large language models. InProceedings of the 42nd International Conference on Machine Learning, volume 267 ofProceedings of Machine Learning Research...

  17. [17]

    Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference

    Yuan Feng, Junlin Lv, Yukun Cao, Xike Xie, and S. Kevin Zhou. Ada-KV: Optimizing KV cache eviction by adaptive budget allocation for efficient LLM inference.arXiv preprint arXiv:2407.11550, 2024. doi: 10.48550/arXiv.2407.11550. URLhttps://arxiv.org/abs/2407.11550. 10

  18. [18]

    Duoattention: Efficient long-context LLM inference with retrieval and streaming heads

    Guangxuan Xiao, Jiaming Tang, Jingwei Zuo, Junxian Guo, Shang Yang, Haotian Tang, Yao Fu, and Song Han. Duoattention: Efficient long-context LLM inference with retrieval and streaming heads. In International Conference on Learning Representations, 2025. URL https://proceedings.iclr.cc/ paper_files/paper/2025/file/5c1ddd2e59df46fd2aa85c833b1b36ed-Paper-Con...

  19. [19]

    Not all heads matter: A head-level KV cache compression method with integrated retrieval and reasoning.arXiv preprint arXiv:2410.19258, 2024

    Yu Fu, Zefan Cai, Abedelkadir Asi, Wayne Xiong, Yue Dong, and Wen Xiao. Not all heads matter: A head-level KV cache compression method with integrated retrieval and reasoning.arXiv preprint arXiv:2410.19258, 2024. doi: 10.48550/arXiv.2410.19258. URL https://arxiv.org/abs/2410.192 58

  20. [20]

    Razorattention: Efficient kv cache compression through retrieval heads

    Hanlin Tang, Yang Lin, Jing Lin, Qingsen Han, Danning Ke, Shikuan Hong, Yiwu Yao, and Gongyi Wang. Razorattention: Efficient kv cache compression through retrieval heads. InInternational Conference on Learning Representations, 2025. URLhttps://openreview.net/forum?id=tkiZQlL04w

  21. [21]

    SnapKV: LLM Knows What You are Looking for Before Generation

    Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Acyr Locatelli, Hanchen Ye, Tianle Cai, Patrick Lewis, and Deming Chen. Snapkv: LLM knows what you are looking for before generation.arXiv preprint arXiv:2404.14469, 2024. doi: 10.48550/arXiv.2404.14469. URL https://arxiv.org/abs/24 04.14469

  22. [22]

    Sablock: Semantic-aware KV cache eviction with adaptive compression block size.arXiv preprint arXiv:2510.22556, 2025

    Jinhan Chen, Jianchun Liu, Hongli Xu, Xianjun Gao, and Shilong Wang. Sablock: Semantic-aware KV cache eviction with adaptive compression block size.arXiv preprint arXiv:2510.22556, 2025. doi: 10.48550/arXiv.2510.22556. URLhttps://arxiv.org/abs/2510.22556

  23. [23]

    Clusterkv: Manipulating LLM KV cache in semantic space for recallable compression.arXiv preprint arXiv:2412.03213, 2024

    Guangda Liu, Chengwei Li, Jieru Zhao, Chenqi Zhang, and Minyi Guo. Clusterkv: Manipulating LLM KV cache in semantic space for recallable compression.arXiv preprint arXiv:2412.03213, 2024. doi: 10.48550/arXiv.2412.03213. URLhttps://arxiv.org/abs/2412.03213

  24. [24]

    Protokv: Long-context knowledges are already well-organized before your query

    Zhiyuan Yu, Shijian Xiao, Zhangyue Yin, Xiaoran Liu, Lekai Xing, Wenzhong Li, Cam-Tu Nguyen, and Sanglu Lu. Protokv: Long-context knowledges are already well-organized before your query. In International Conference on Learning Representations, 2026. URL https://openreview.net/forum ?id=kXhPkDaFbJ. ICLR 2026 Poster

  25. [25]

    Treekv: Smooth key-value cache compression with tree structures.arXiv preprint arXiv:2501.04987, 2025

    Ziwei He, Jian Yuan, Haoli Bai, Jingwen Leng, and Bo Jiang. Treekv: Smooth key-value cache compression with tree structures.arXiv preprint arXiv:2501.04987, 2025. doi: 10.48550/arXiv.2501.04987. URL https://arxiv.org/abs/2501.04987

  26. [26]

    HeteroCache: A Dynamic Retrieval Approach to Heterogeneous KV Cache Compression for Long-Context LLM Inference

    Zhiyuan Shi, Qibo Qiu, Feng Xue, Zhonglin Jiang, Li Yu, Jian Jiang, Xiaofei He, and Wenxiao Wang. Heterocache: A dynamic retrieval approach to heterogeneous KV cache compression for long-context LLM inference.arXiv preprint arXiv:2601.13684, 2026. doi: 10.48550/arXiv.2601.13684. URL https://arxiv.org/abs/2601.13684

  27. [27]

    Expected attention: KV cache compression by estimating attention from future queries distribution.arXiv preprint arXiv:2510.00636, 2025

    Alessio Devoto, Maximilian Jeblick, and Simon Jégou. Expected attention: KV cache compression by estimating attention from future queries distribution.arXiv preprint arXiv:2510.00636, 2025. doi: 10.48550/arXiv.2510.00636. URLhttps://arxiv.org/abs/2510.00636

  28. [28]

    Chunkkv: Semantic-preserving KV cache compression for efficient long-context LLM inference.arXiv preprint arXiv:2502.00299, 2025

    Xiang Liu, Zhenheng Tang, Peijie Dong, Zeyu Li, Yue Liu, Bo Li, Xuming Hu, and Xiaowen Chu. Chunkkv: Semantic-preserving KV cache compression for efficient long-context LLM inference.arXiv preprint arXiv:2502.00299, 2025. doi: 10.48550/arXiv.2502.00299. URL https://arxiv.org/abs/25 02.00299. NeurIPS 2025

  29. [29]

    ReST-KV: Robust KV cache eviction with layer-wise output reconstruction and spatial-temporal smoothing

    Yongqi An, Chang Lu, Kuan Zhu, Tao Yu, Chaoyang Zhao, Hong Wu, Ming Tang, and Jinqiao Wang. ReST-KV: Robust KV cache eviction with layer-wise output reconstruction and spatial-temporal smoothing. InInternational Conference on Learning Representations, 2026. URL https://openreview.net/for um?id=PhEHuo7oMm. ICLR 2026 Poster

  30. [30]

    Kediff: Key similarity-based KV cache eviction for long-context LLM inference in resource-constrained environments

    Junyoung Park, Dalton Jones, Matt Morse, Raghavv Goel, Mingu Lee, and Chris Lott. Kediff: Key similarity-based KV cache eviction for long-context LLM inference in resource-constrained environments. arXiv preprint arXiv:2504.15364, 2025. doi: 10.48550/arXiv.2504.15364. URL https://arxiv.org/ abs/2504.15364

  31. [31]

    SCOPE: Optimizing key-value cache compression in long-context generation.arXiv preprint arXiv:2412.13649, 2024

    Jialong Wu, Zhenglin Wang, Linhai Zhang, Yilong Lai, Yulan He, and Deyu Zhou. SCOPE: Optimizing key-value cache compression in long-context generation.arXiv preprint arXiv:2412.13649, 2024. doi: 10.48550/arXiv.2412.13649. URLhttps://arxiv.org/abs/2412.13649

  32. [32]

    G-KV: Decoding-time KV cache eviction with global attention.arXiv preprint arXiv:2512.00504, 2025

    Mengqi Liao, Lu Wang, Chaoyun Zhang, Zekai Shen, Xiaowei Mao, Si Qin, Qingwei Lin, Saravan Rajmohan, Dongmei Zhang, and Huaiyu Wan. G-KV: Decoding-time KV cache eviction with global attention.arXiv preprint arXiv:2512.00504, 2025. doi: 10.48550/arXiv.2512.00504. URL https: //arxiv.org/abs/2512.00504. 11

  33. [33]

    Scissorhands: Exploiting the persistence of importance hypothesis for llm kv cache compression at test time

    Zichang Liu, Aditya Desai, Fangshuo Liao, Weitao Wang, Victor Xie, Zhaozhuo Xu, Anastasios Kyrillidis, and Anshumali Shrivastava. Scissorhands: Exploiting the persistence of importance hypothesis for llm kv cache compression at test time. InAdvances in Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=JZfg6wGi6g

  34. [34]

    KV-compress: Paged KV-cache compression with variable compression rates per attention head.arXiv preprint arXiv:2410.00161, 2024

    Isaac Rehg. KV-compress: Paged KV-cache compression with variable compression rates per attention head.arXiv preprint arXiv:2410.00161, 2024. doi: 10.48550/arXiv.2410.00161. URL https: //arxiv.org/abs/2410.00161

  35. [35]

    Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang

    Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts.Transactions of the Association for Computational Linguistics, 12:157–173, 2024. doi: 10.1162/tacl_a_00638. URL h t t p s : //aclanthology.org/2024.tacl-1.9/

  36. [36]

    The curious case of neural text degeneration

    Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. The curious case of neural text degeneration. InInternational Conference on Learning Representations, 2020. URL https://openrevi ew.net/forum?id=rygGQyrFvH. ICLR 2020

  37. [37]

    Kevin Zhou, and Xike Xie

    Yuan Feng, Haoyu Guo, Junlin Lv, S. Kevin Zhou, and Xike Xie. Taming the fragility of KV cache eviction in LLM inference.arXiv preprint arXiv:2510.13334, 2025. doi: 10.48550/arXiv.2510.13334. URL https://arxiv.org/abs/2510.13334

  38. [38]

    LongFlow: Efficient KV Cache Compression for Reasoning Models

    Yi Su, Zhenxu Tian, Dan Qiao, Yuechi Zhou, Juntao Li, and Min Zhang. Longflow: Efficient kv cache compression for reasoning models.arXiv preprint arXiv:2603.11504, 2026. doi: 10.48550/arXiv.2603.11

  39. [39]

    URLhttps://arxiv.org/abs/2603.11504

  40. [40]

    Lethe: Layer- and time-adaptive kv cache pruning for reasoning-intensive llm serving.arXiv preprint arXiv:2511.06029, 2025

    Hui Zeng, Daming Zhao, Pengfei Yang, WenXuan Hou, Tianyang Zheng, Hui Li, Weiye Ji, and Jidong Zhai. Lethe: Layer- and time-adaptive kv cache pruning for reasoning-intensive llm serving.arXiv preprint arXiv:2511.06029, 2025. doi: 10.48550/arXiv.2511.06029. URL https://arxiv.org/abs/2511.060 29

  41. [41]

    ThinKV: Thought-Adaptive KV Cache Compression for Efficient Reasoning Models

    Akshat Ramachandran, Marina Neseem, Charbel Sakr, Rangharajan Venkatesan, Brucek Khailany, and Tushar Krishna. Thinkv: Thought-adaptive kv cache compression for efficient reasoning models.arXiv preprint arXiv:2510.01290, 2025. doi: 10.48550/arXiv.2510.01290. URL https://arxiv.org/abs/25 10.01290

  42. [42]

    Measuring Mathematical Problem Solving With the MATH Dataset

    Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the MATH dataset.arXiv preprint arXiv:2103.03874, 2021. doi: 10.48550/arXiv.2103.03874. URL https://arxiv.org/abs/2103.038 74

  43. [43]

    Let’s verify step by step

    Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. InInternational Conference on Learning Representations, 2024. URL https://proceedings.iclr.cc/paper_files/paper/2024 /hash/aca97732e30bcf1303bc22ac3924fd16-Abstract-Conference.html. ICLR 2024

  44. [44]

    TIGER-Lab. Aime25. Hugging Face dataset repository, 2025. URL https://huggingface.co/datas ets/TIGER-Lab/AIME25. Accessed 2026-05-05

  45. [45]

    Training Verifiers to Solve Math Word Problems

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021. doi: 10.48550/arXiv.2110. 14168. URLhttps://arxiv.org/abs/2110.14168

  46. [46]

    Deepseek-r1-distill-qwen-7b

    deepseek-ai. Deepseek-r1-distill-qwen-7b. Hugging Face model repository, 2025. URL https://huggin gface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B. Accessed 2026-05-05

  47. [47]

    Deepseek-r1-distill-qwen-32b

    DeepSeek-AI. Deepseek-r1-distill-qwen-32b. https://huggingface.co/deepseek-ai/DeepSeek -R1-Distill-Qwen-32B, 2025. Distilled from DeepSeek-R1 and based on Qwen2.5-32B

  48. [48]

    lost in the middle

    open-thoughts. Openthinker3-7b. Hugging Face model repository, 2025. URL https://huggingface. co/open-thoughts/OpenThinker3-7B. Accessed 2026-05-05. 12 Appendix A Limitations and Impact Statement Limitations.AMS is a training-free decoding-time allocation layer that uses attention-derived mass for adaptive segmentation. This design keeps the method lightw...

  49. [49]

    materialize the current per-request KV view needed by the AMS selector

  50. [50]

    Pass@1 is computed from metric_main / num_samples in the csv

    call the AMS/KVPress selector to obtain head-wise keep indicesI ∈N B×Hkv ×Tkeep; 24 W∆L min Lmax qmin keep_lastn sink Pass@1 (%) sec/sample peak GB 16 0.005 32 1024 32 16 4 50.0 55.9 14.48 16 0.005 64 1024 8 32 8 50.0 54.4 14.48 16 0.005 128 512 0 16 0 50.0 54.6 14.48 16 0.010 16 4096 16 128 8 50.0 58.6 14.48 16 0.010 32 1024 32 128 4 50.0 55.9 14.48 16 0...

  51. [51]

    allocate compact replacement blocks from the paged KV block pool

  52. [52]

    launch a layout-aware GPU copy kernel that performs the per-head KV movement above for every attention layer

  53. [53]

    replace the request’s block-table row with the compact block IDs and free the old blocks after the copy completes; and

  54. [54]

    Simplify (u+ 4)(u−1)−(u+ 4)(u−1)

    maintain separate bookkeeping for the logical decoding position and the compact physical KV length. The last item is important because the compact cache length becomes Tkeep, while the next generated token should still follow the original autoregressive position. Current implementation status.The supplementary code implements this policy–layout contract i...