Adaptive Mass-Segmented KV Compression for Long-Context Reasoning

Junzhe Yang; Xiaoyu Shen

arxiv: 2605.23200 · v1 · pith:XXJXG3KJnew · submitted 2026-05-22 · 💻 cs.LG · cs.AI

Adaptive Mass-Segmented KV Compression for Long-Context Reasoning

Junzhe Yang , Xiaoyu Shen This is my paper

Pith reviewed 2026-05-25 05:20 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords KV cache compressionlong-context reasoningattention mass segmentationregion-aware quota allocationtoken evictionLLM inferencestructural fragmentationplug-and-play compression

0 comments

The pith

Adaptive Mass-Segmented KV compression gives guaranteed memory quotas to attention-rich regions instead of letting global Top-k evict whole reasoning blocks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard KV compression picks the globally highest-scoring tokens and thereby wipes out entire contiguous segments that carry logical structure. AMS partitions the cache according to the spatial layout of attention mass and assigns each resulting region a protected quota. An EMA smoother keeps the boundaries stable across decoding steps. The method works as a plug-in layer on top of existing scorers and adds no steady-state attention cost inside paged serving systems. Experiments on math, code, QA and retrieval tasks show consistent gains once the region wipe-out problem is removed.

Core claim

By replacing token-level global Top-k selection with region-aware quota allocation driven by the spatial distribution of attention mass, AMS prevents the eviction of structurally vital reasoning segments, incorporates EMA-based boundary smoothing for stable iterative decoding, and remains orthogonal to any underlying importance scorer while remaining compatible with paged-KV frameworks.

What carries the argument

Adaptive Mass-Segmented (AMS) KV Compression framework that partitions the KV cache according to the spatial distribution of attention mass and enforces guaranteed per-region memory quotas.

If this is right

Preserves logical coherence by protecting contiguous reasoning blocks from eviction
Raises accuracy on MATH500, AIME, GSM8K, code completion, open-domain QA and sparse retrieval
Integrates without modification into TOVA, Expected Attention, KeyDiff, R-KV and TriAttention
Runs inside vLLM-style paged-KV serving with gather-and-compact execution and zero added attention overhead
Remains stable across iterative decoding steps through EMA boundary smoothing

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same mass-based segmentation could be applied to activation compression or weight pruning where contiguous structure also matters
Ablating the EMA smoother on very long sequences would test whether boundary jitter becomes the next bottleneck
Extending the quota mechanism to multi-turn dialogues might reveal whether attention mass still tracks evolving logical units
Comparing AMS against purely length-based segmentation would isolate how much the attention-mass signal contributes beyond simple locality

Load-bearing premise

The spatial distribution of attention mass reliably identifies structurally vital reasoning segments that deserve guaranteed memory quotas.

What would settle it

Run the same long-context reasoning task twice: once with attention mass left as computed by the model and once with attention mass randomly reassigned across segments; if AMS stops improving accuracy in the randomized case, the mass-to-importance correlation is the load-bearing assumption.

Figures

Figures reproduced from arXiv: 2605.23200 by Junzhe Yang, Xiaoyu Shen.

**Figure 2.** Figure 2: Motivating dropped-token burst example. Purple pixels denote dropped tokens. The full-sequence view highlights a local region where TOVA forms a dense contiguous dropped-token burst. The zoomed view shows the same token window across compression rounds: AMS-TOVA fragments the dropped positions in the selected round, illustrating a local failure mode that motivates adaptive segment-wise allocation. 3 Method… view at source ↗

**Figure 3.** Figure 3: Quality mass and adaptive segmentation. The solid curve shows normalized quality mass over current KV-cache positions, not absolute generation positions. Shaded bands and dashed lines denote adaptive segments, and teal ticks mark retained KV positions. High-mass regions form finer segments under a fixed Tkeep. Consider a single KV head with mass vector m ∈ R T satisfying PT t=1 mt = 1. We first compute the… view at source ↗

**Figure 4.** Figure 4: Mechanistic insights on MATH500. (a) TOVA under-retains the middle portion of the reasoning context, while AMS improves middle-context coverage through segment-wise quotas. (b) Repetition collapse increases with problem difficulty under token-wise eviction; AMS suppresses this degradation. 0 5 10 15 20 25 Transformer Layer 0.15 0.20 0.25 0.30 0.35 Temporal IoU (Higher is more stable) TOVA (Token-wise) AMS … view at source ↗

**Figure 5.** Figure 5: Temporal stability of retained context. AMS consistently achieves higher temporal retained-set IoU than TOVA across transformer layers and mathematical sub-tasks. For the token-wise TOVA baseline, consecutive retained tokens are grouped as proxy segments for direct comparison. 17 [PITH_FULL_IMAGE:figures/full_fig_p017_5.png] view at source ↗

read the original abstract

The linear growth of the Key-Value (KV) cache is a critical bottleneck in long-form LLM inference. Existing KV compression methods mitigate this by evicting tokens based on importance scores. However, we show that their reliance on global Top-k selection triggers Region Wipe-out: the severe eviction of contiguous reasoning blocks that derails logical coherence. To address this, we propose Adaptive Mass-Segmented (AMS) KV Compression, a framework that shifts the paradigm from token-level competition to region-aware quota allocation. AMS adaptively partitions the KV cache based on the spatial distribution of attention mass, ensuring structurally vital reasoning segments receive guaranteed memory quotas. To ensure stability during iterative decoding, an EMA-based smoothing mechanism is incorporated to prevent jitter in segment boundaries. Crucially, AMS is a universal plug-and-play layer that is orthogonal to existing scorers. It can be seamlessly integrated into representative methods such as TOVA, Expected Attention, KeyDiff, R-KV and TriAttention. AMS is also system-compatible with modern paged-KV serving frameworks such as vLLM, supporting efficient gather-and-compact KV execution without introducing additional steady-state attention overhead. Extensive experiments across a diverse suite of tasks, including mathematical reasoning (MATH500, AIME, GSM8K), code completion, open-domain QA, and sparse retrieval, demonstrate that AMS consistently mitigates structural fragmentation and boosts model performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AMS proposes region-aware KV quota allocation via attention mass to avoid wiping out reasoning blocks, but the abstract shows no numbers so the gains and the proxy assumption stay untested.

read the letter

The main point is that this paper identifies Region Wipe-out as a failure mode in global top-k KV eviction and offers Adaptive Mass-Segmented compression to fix it by partitioning the cache according to attention mass distribution and giving guaranteed quotas to those segments, plus EMA smoothing to keep boundaries stable across decoding steps. It positions the approach as a universal layer that sits on top of existing scorers like TOVA or Expected Attention and works with vLLM-style paged serving. That framing is useful because it turns a token-level competition problem into a region-level allocation one, which could matter for preserving logical structure in math, code, and QA tasks. The paper does a clean job naming the problem and claiming orthogonality plus system compatibility without adding steady-state overhead. The soft spots are straightforward. The abstract states that experiments show consistent mitigation and performance lifts across MATH500, AIME, GSM8K, code completion, open QA, and retrieval, yet it gives zero quantitative results, no baseline deltas, no error bars, and no details on how the tasks were run or what was excluded. Without those, the central claim cannot be evaluated. The assumption that attention mass spatial distribution reliably flags structurally vital reasoning segments is taken as given; if that correlation is weak or task-dependent, the method reduces to a smoothed variant of prior scorers and the claimed structural protection disappears. This work is for people already working on long-context inference efficiency and KV cache management. A reader in that sub-area would get value from the partitioning idea and the compatibility claims even if the results need verification. It deserves peer review because the problem is real, the proposed fix is concrete and integrable, and the full manuscript can supply the missing evidence.

Referee Report

1 major / 1 minor

Summary. The paper claims that existing token-level Top-k KV eviction methods suffer from Region Wipe-out, where contiguous reasoning blocks are evicted and logical coherence is lost. It proposes Adaptive Mass-Segmented (AMS) KV Compression, which partitions the KV cache according to the spatial distribution of attention mass to allocate guaranteed quotas to structurally vital segments, adds EMA-based smoothing to stabilize segment boundaries during decoding, and is presented as a plug-and-play, orthogonal layer compatible with scorers such as TOVA, Expected Attention, KeyDiff, R-KV and TriAttention as well as paged-KV systems like vLLM. Experiments on mathematical reasoning (MATH500, AIME, GSM8K), code completion, open-domain QA and sparse retrieval are stated to show consistent mitigation of fragmentation and performance gains.

Significance. If the empirical results hold, AMS could offer a practical, low-overhead way to preserve structural coherence in long-context reasoning without replacing existing importance scorers, with the claimed system compatibility providing an additional deployment advantage.

major comments (1)

[Abstract (framework description paragraph)] Abstract (framework description paragraph): the central claim that attention-mass spatial distribution reliably identifies structurally vital reasoning segments deserving guaranteed quotas is load-bearing, yet the manuscript provides no direct validation (e.g., correlation with logical importance, ablation against positional/recency biases, or counter-example analysis). If the proxy is weak or task-dependent, the region-aware allocation reduces to a smoothed variant of prior scorers and the claimed structural protection does not follow.

minor comments (1)

[Abstract] Abstract: the statement that 'extensive experiments demonstrate consistent mitigation and performance gains' is not accompanied by any quantitative results, error bars, baseline tables, or statistical details, making the strength of the empirical support difficult to assess from the provided text.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive comment regarding validation of the attention-mass proxy. We address the concern directly below and commit to strengthening the manuscript with additional analysis.

read point-by-point responses

Referee: [Abstract (framework description paragraph)] Abstract (framework description paragraph): the central claim that attention-mass spatial distribution reliably identifies structurally vital reasoning segments deserving guaranteed quotas is load-bearing, yet the manuscript provides no direct validation (e.g., correlation with logical importance, ablation against positional/recency biases, or counter-example analysis). If the proxy is weak or task-dependent, the region-aware allocation reduces to a smoothed variant of prior scorers and the claimed structural protection does not follow.

Authors: We agree that direct validation of the attention-mass spatial distribution as a proxy for structurally vital segments would strengthen the central claim. The current manuscript relies on indirect evidence: consistent performance improvements when AMS is combined with multiple independent scorers (TOVA, Expected Attention, KeyDiff, R-KV, TriAttention) across mathematical reasoning, code, and QA tasks, together with the orthogonality results showing gains beyond any single scorer. These outcomes are difficult to explain if AMS were merely a smoothed Top-k variant. Nevertheless, we acknowledge the absence of explicit correlation studies or bias ablations. In the revision we will add (i) a quantitative correlation between detected segment boundaries and logical step transitions in MATH problems, (ii) an ablation replacing mass-based partitioning with positional or recency-based alternatives, and (iii) selected counter-example traces. These additions will clarify the proxy's reliability and rule out reduction to prior smoothing techniques. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical plug-and-play method with no self-referential derivations

full rationale

The paper describes AMS as an empirical framework that partitions KV cache by attention mass distribution and integrates orthogonally with existing scorers, validated on external benchmarks like MATH500 and GSM8K. No equations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the text. The central claims rest on experimental results rather than any derivation that reduces to its own inputs by construction. This is self-contained against external tasks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that attention mass distribution can be used to identify and protect reasoning segments; no free parameters or invented entities are named in the abstract.

axioms (1)

domain assumption Spatial distribution of attention mass identifies structurally vital reasoning segments that merit guaranteed memory quotas
This premise is invoked to justify the shift from global Top-k to region-aware allocation and is required for the performance claims to follow.

pith-pipeline@v0.9.0 · 5772 in / 1224 out tokens · 21275 ms · 2026-05-25T05:20:42.306367+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

54 extracted references · 54 canonical work pages · 14 internal anchors

[1]

H2O: Heavy-hitter oracle for efficient generative inference of large language models

Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher Ré, Clark Barrett, Zhangyang Wang, and Beidi Chen. H2O: Heavy-hitter oracle for efficient generative inference of large language models. InAdvances in Neural Information Processing Systems, 2023. URLhttps://openreview.net/forum?id=RkRrPp7GKO

work page 2023
[2]

Efficient Memory Management for Large Language Model Serving with PagedAttention , booktitle =

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InProceedings of the 29th ACM Symposium on Operating Systems Principles, pages 611–626, 2023. doi: 10.1145/3600006.3613165. URLhttps://arxiv.org/abs...

work page doi:10.1145/3600006.3613165 2023
[3]

Mahoney, Yakun Sophia Shao, Kurt Keutzer, and Amir Gholami

Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, Michael W. Mahoney, Yakun Sophia Shao, Kurt Keutzer, and Amir Gholami. KVQuant: Towards 10 million context length LLM inference with KV cache quantization. InAdvances in Neural Information Processing Systems, 2024. URL https: //arxiv.org/abs/2401.18079

work page arXiv 2024
[4]

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V . Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. InAdvances in Neural Information Processing Systems, 2022. URLhttps://arxiv.org/abs/2201.11903

work page internal anchor Pith review Pith/arXiv arXiv 2022
[5]

R-KV: Redundancy-aware KV cache compression for reasoning models.arXiv preprint arXiv:2505.24133, 2025

Zefan Cai, Wen Xiao, Hanshi Sun, Cheng Luo, Yikai Zhang, Ke Wan, Yucheng Li, Yeyang Zhou, Li- Wen Chang, Jiuxiang Gu, Zhen Dong, Anima Anandkumar, Abedelkadir Asi, and Junjie Hu. R-KV: Redundancy-aware KV cache compression for reasoning models.arXiv preprint arXiv:2505.24133, 2025. doi: 10.48550/arXiv.2505.24133. URLhttps://arxiv.org/abs/2505.24133

work page doi:10.48550/arxiv.2505.24133 2025
[6]

Reasoning path compression: Compressing generation trajectories for efficient LLM reasoning

Jiwon Song, Dongwon Jo, Yulhwa Kim, and Jae-Joon Kim. Reasoning path compression: Compressing generation trajectories for efficient LLM reasoning. InAdvances in Neural Information Processing Systems,

work page
[7]

NeurIPS 2025 Poster

URLhttps://openreview.net/forum?id=894Yo61h1P. NeurIPS 2025 Poster

work page 2025
[8]

TriAttention: Efficient Long Reasoning with Trigonometric KV Compression

Weian Mao, Xi Lin, Wei Huang, Yuxin Xie, Tianfu Fu, Bohan Zhuang, Song Han, and Yukang Chen. Tri- attention: Efficient long reasoning with trigonometric KV compression.arXiv preprint arXiv:2604.04921,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

TriAttention: Efficient Long Reasoning with Trigonometric KV Compression

doi: 10.48550/arXiv.2604.04921. URLhttps://arxiv.org/abs/2604.04921

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2604.04921
[10]

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

Tri Dao, Daniel Y . Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. Flashattention: Fast and memory- efficient exact attention with IO-awareness. InAdvances in Neural Information Processing Systems, 2022. doi: 10.48550/arXiv.2205.14135. URLhttps://arxiv.org/abs/2205.14135

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2205.14135 2022
[11]

Efficient Streaming Language Models with Attention Sinks

Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks. InInternational Conference on Learning Representations, 2024. URL https://arxiv.org/abs/2309.17453. ICLR 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[12]

Transformers are multi-state RNNs

Matanel Oren, Michael Hassid, Nir Yarden, Yossi Adi, and Roy Schwartz. Transformers are multi-state RNNs. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors,Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 18724–18741, Miami, Florida, USA, November 2024. Association for Computational Linguistics. doi...

work page doi:10.18653/v1/2024.emnlp-main.10 2024
[13]

URLhttps://aclanthology.org/2024.emnlp-main.1043/

work page 2024
[14]

PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling

Zefan Cai, Yichi Zhang, Bofei Gao, Yuliang Liu, Tianyu Liu, Keming Lu, Wayne Xiong, Yue Dong, Baobao Chang, Junjie Hu, and Wen Xiao. Pyramidkv: Dynamic KV cache compression based on pyramidal information funneling.arXiv preprint arXiv:2406.02069, 2024. doi: 10.48550/arXiv.2406.02069. URL https://arxiv.org/abs/2406.02069

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2406.02069 2024
[15]

Omnikv: Dynamic context selection for efficient long-context LLMs

Jitai Hao, Yuke Zhu, Tian Wang, Jun Yu, Xin Xin, Bo Zheng, Zhaochun Ren, and Sheng Guo. Omnikv: Dynamic context selection for efficient long-context LLMs. InInternational Conference on Learning Representations, 2025. URLhttps://openreview.net/forum?id=ulCAPXYXfa. ICLR 2025

work page 2025
[16]

Lacache: Ladder-shaped KV caching for efficient long-context modeling of large language models

Dachuan Shi, Yonggan Fu, Xiangchi Yuan, Zhongzhi Yu, Haoran You, Sixu Li, Xin Dong, Jan Kautz, Pavlo Molchanov, and Yingyan Celine Lin. Lacache: Ladder-shaped KV caching for efficient long-context modeling of large language models. InProceedings of the 42nd International Conference on Machine Learning, volume 267 ofProceedings of Machine Learning Research...

work page 2025
[17]

Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference

Yuan Feng, Junlin Lv, Yukun Cao, Xike Xie, and S. Kevin Zhou. Ada-KV: Optimizing KV cache eviction by adaptive budget allocation for efficient LLM inference.arXiv preprint arXiv:2407.11550, 2024. doi: 10.48550/arXiv.2407.11550. URLhttps://arxiv.org/abs/2407.11550. 10

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2407.11550 2024
[18]

Duoattention: Efficient long-context LLM inference with retrieval and streaming heads

Guangxuan Xiao, Jiaming Tang, Jingwei Zuo, Junxian Guo, Shang Yang, Haotian Tang, Yao Fu, and Song Han. Duoattention: Efficient long-context LLM inference with retrieval and streaming heads. In International Conference on Learning Representations, 2025. URL https://proceedings.iclr.cc/ paper_files/paper/2025/file/5c1ddd2e59df46fd2aa85c833b1b36ed-Paper-Con...

work page 2025
[19]

Not all heads matter: A head-level KV cache compression method with integrated retrieval and reasoning.arXiv preprint arXiv:2410.19258, 2024

Yu Fu, Zefan Cai, Abedelkadir Asi, Wayne Xiong, Yue Dong, and Wen Xiao. Not all heads matter: A head-level KV cache compression method with integrated retrieval and reasoning.arXiv preprint arXiv:2410.19258, 2024. doi: 10.48550/arXiv.2410.19258. URL https://arxiv.org/abs/2410.192 58

work page doi:10.48550/arxiv.2410.19258 2024
[20]

Razorattention: Efficient kv cache compression through retrieval heads

Hanlin Tang, Yang Lin, Jing Lin, Qingsen Han, Danning Ke, Shikuan Hong, Yiwu Yao, and Gongyi Wang. Razorattention: Efficient kv cache compression through retrieval heads. InInternational Conference on Learning Representations, 2025. URLhttps://openreview.net/forum?id=tkiZQlL04w

work page 2025
[21]

SnapKV: LLM Knows What You are Looking for Before Generation

Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Acyr Locatelli, Hanchen Ye, Tianle Cai, Patrick Lewis, and Deming Chen. Snapkv: LLM knows what you are looking for before generation.arXiv preprint arXiv:2404.14469, 2024. doi: 10.48550/arXiv.2404.14469. URL https://arxiv.org/abs/24 04.14469

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2404.14469 2024
[22]

Sablock: Semantic-aware KV cache eviction with adaptive compression block size.arXiv preprint arXiv:2510.22556, 2025

Jinhan Chen, Jianchun Liu, Hongli Xu, Xianjun Gao, and Shilong Wang. Sablock: Semantic-aware KV cache eviction with adaptive compression block size.arXiv preprint arXiv:2510.22556, 2025. doi: 10.48550/arXiv.2510.22556. URLhttps://arxiv.org/abs/2510.22556

work page doi:10.48550/arxiv.2510.22556 2025
[23]

Clusterkv: Manipulating LLM KV cache in semantic space for recallable compression.arXiv preprint arXiv:2412.03213, 2024

Guangda Liu, Chengwei Li, Jieru Zhao, Chenqi Zhang, and Minyi Guo. Clusterkv: Manipulating LLM KV cache in semantic space for recallable compression.arXiv preprint arXiv:2412.03213, 2024. doi: 10.48550/arXiv.2412.03213. URLhttps://arxiv.org/abs/2412.03213

work page doi:10.48550/arxiv.2412.03213 2024
[24]

Protokv: Long-context knowledges are already well-organized before your query

Zhiyuan Yu, Shijian Xiao, Zhangyue Yin, Xiaoran Liu, Lekai Xing, Wenzhong Li, Cam-Tu Nguyen, and Sanglu Lu. Protokv: Long-context knowledges are already well-organized before your query. In International Conference on Learning Representations, 2026. URL https://openreview.net/forum ?id=kXhPkDaFbJ. ICLR 2026 Poster

work page 2026
[25]

Treekv: Smooth key-value cache compression with tree structures.arXiv preprint arXiv:2501.04987, 2025

Ziwei He, Jian Yuan, Haoli Bai, Jingwen Leng, and Bo Jiang. Treekv: Smooth key-value cache compression with tree structures.arXiv preprint arXiv:2501.04987, 2025. doi: 10.48550/arXiv.2501.04987. URL https://arxiv.org/abs/2501.04987

work page doi:10.48550/arxiv.2501.04987 2025
[26]

HeteroCache: A Dynamic Retrieval Approach to Heterogeneous KV Cache Compression for Long-Context LLM Inference

Zhiyuan Shi, Qibo Qiu, Feng Xue, Zhonglin Jiang, Li Yu, Jian Jiang, Xiaofei He, and Wenxiao Wang. Heterocache: A dynamic retrieval approach to heterogeneous KV cache compression for long-context LLM inference.arXiv preprint arXiv:2601.13684, 2026. doi: 10.48550/arXiv.2601.13684. URL https://arxiv.org/abs/2601.13684

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2601.13684 2026
[27]

Expected attention: KV cache compression by estimating attention from future queries distribution.arXiv preprint arXiv:2510.00636, 2025

Alessio Devoto, Maximilian Jeblick, and Simon Jégou. Expected attention: KV cache compression by estimating attention from future queries distribution.arXiv preprint arXiv:2510.00636, 2025. doi: 10.48550/arXiv.2510.00636. URLhttps://arxiv.org/abs/2510.00636

work page doi:10.48550/arxiv.2510.00636 2025
[28]

Chunkkv: Semantic-preserving KV cache compression for efficient long-context LLM inference.arXiv preprint arXiv:2502.00299, 2025

Xiang Liu, Zhenheng Tang, Peijie Dong, Zeyu Li, Yue Liu, Bo Li, Xuming Hu, and Xiaowen Chu. Chunkkv: Semantic-preserving KV cache compression for efficient long-context LLM inference.arXiv preprint arXiv:2502.00299, 2025. doi: 10.48550/arXiv.2502.00299. URL https://arxiv.org/abs/25 02.00299. NeurIPS 2025

work page doi:10.48550/arxiv.2502.00299 2025
[29]

ReST-KV: Robust KV cache eviction with layer-wise output reconstruction and spatial-temporal smoothing

Yongqi An, Chang Lu, Kuan Zhu, Tao Yu, Chaoyang Zhao, Hong Wu, Ming Tang, and Jinqiao Wang. ReST-KV: Robust KV cache eviction with layer-wise output reconstruction and spatial-temporal smoothing. InInternational Conference on Learning Representations, 2026. URL https://openreview.net/for um?id=PhEHuo7oMm. ICLR 2026 Poster

work page 2026
[30]

Kediff: Key similarity-based KV cache eviction for long-context LLM inference in resource-constrained environments

Junyoung Park, Dalton Jones, Matt Morse, Raghavv Goel, Mingu Lee, and Chris Lott. Kediff: Key similarity-based KV cache eviction for long-context LLM inference in resource-constrained environments. arXiv preprint arXiv:2504.15364, 2025. doi: 10.48550/arXiv.2504.15364. URL https://arxiv.org/ abs/2504.15364

work page doi:10.48550/arxiv.2504.15364 2025
[31]

SCOPE: Optimizing key-value cache compression in long-context generation.arXiv preprint arXiv:2412.13649, 2024

Jialong Wu, Zhenglin Wang, Linhai Zhang, Yilong Lai, Yulan He, and Deyu Zhou. SCOPE: Optimizing key-value cache compression in long-context generation.arXiv preprint arXiv:2412.13649, 2024. doi: 10.48550/arXiv.2412.13649. URLhttps://arxiv.org/abs/2412.13649

work page doi:10.48550/arxiv.2412.13649 2024
[32]

G-KV: Decoding-time KV cache eviction with global attention.arXiv preprint arXiv:2512.00504, 2025

Mengqi Liao, Lu Wang, Chaoyun Zhang, Zekai Shen, Xiaowei Mao, Si Qin, Qingwei Lin, Saravan Rajmohan, Dongmei Zhang, and Huaiyu Wan. G-KV: Decoding-time KV cache eviction with global attention.arXiv preprint arXiv:2512.00504, 2025. doi: 10.48550/arXiv.2512.00504. URL https: //arxiv.org/abs/2512.00504. 11

work page doi:10.48550/arxiv.2512.00504 2025
[33]

Scissorhands: Exploiting the persistence of importance hypothesis for llm kv cache compression at test time

Zichang Liu, Aditya Desai, Fangshuo Liao, Weitao Wang, Victor Xie, Zhaozhuo Xu, Anastasios Kyrillidis, and Anshumali Shrivastava. Scissorhands: Exploiting the persistence of importance hypothesis for llm kv cache compression at test time. InAdvances in Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=JZfg6wGi6g

work page 2023
[34]

KV-compress: Paged KV-cache compression with variable compression rates per attention head.arXiv preprint arXiv:2410.00161, 2024

Isaac Rehg. KV-compress: Paged KV-cache compression with variable compression rates per attention head.arXiv preprint arXiv:2410.00161, 2024. doi: 10.48550/arXiv.2410.00161. URL https: //arxiv.org/abs/2410.00161

work page doi:10.48550/arxiv.2410.00161 2024
[35]

Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang

Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts.Transactions of the Association for Computational Linguistics, 12:157–173, 2024. doi: 10.1162/tacl_a_00638. URL h t t p s : //aclanthology.org/2024.tacl-1.9/

work page doi:10.1162/tacl_a_00638 2024
[36]

The curious case of neural text degeneration

Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. The curious case of neural text degeneration. InInternational Conference on Learning Representations, 2020. URL https://openrevi ew.net/forum?id=rygGQyrFvH. ICLR 2020

work page 2020
[37]

Kevin Zhou, and Xike Xie

Yuan Feng, Haoyu Guo, Junlin Lv, S. Kevin Zhou, and Xike Xie. Taming the fragility of KV cache eviction in LLM inference.arXiv preprint arXiv:2510.13334, 2025. doi: 10.48550/arXiv.2510.13334. URL https://arxiv.org/abs/2510.13334

work page doi:10.48550/arxiv.2510.13334 2025
[38]

LongFlow: Efficient KV Cache Compression for Reasoning Models

Yi Su, Zhenxu Tian, Dan Qiao, Yuechi Zhou, Juntao Li, and Min Zhang. Longflow: Efficient kv cache compression for reasoning models.arXiv preprint arXiv:2603.11504, 2026. doi: 10.48550/arXiv.2603.11

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2603.11 2026
[39]

URLhttps://arxiv.org/abs/2603.11504

work page internal anchor Pith review Pith/arXiv arXiv
[40]

Lethe: Layer- and time-adaptive kv cache pruning for reasoning-intensive llm serving.arXiv preprint arXiv:2511.06029, 2025

Hui Zeng, Daming Zhao, Pengfei Yang, WenXuan Hou, Tianyang Zheng, Hui Li, Weiye Ji, and Jidong Zhai. Lethe: Layer- and time-adaptive kv cache pruning for reasoning-intensive llm serving.arXiv preprint arXiv:2511.06029, 2025. doi: 10.48550/arXiv.2511.06029. URL https://arxiv.org/abs/2511.060 29

work page doi:10.48550/arxiv.2511.06029 2025
[41]

ThinKV: Thought-Adaptive KV Cache Compression for Efficient Reasoning Models

Akshat Ramachandran, Marina Neseem, Charbel Sakr, Rangharajan Venkatesan, Brucek Khailany, and Tushar Krishna. Thinkv: Thought-adaptive kv cache compression for efficient reasoning models.arXiv preprint arXiv:2510.01290, 2025. doi: 10.48550/arXiv.2510.01290. URL https://arxiv.org/abs/25 10.01290

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2510.01290 2025
[42]

Measuring Mathematical Problem Solving With the MATH Dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the MATH dataset.arXiv preprint arXiv:2103.03874, 2021. doi: 10.48550/arXiv.2103.03874. URL https://arxiv.org/abs/2103.038 74

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2103.03874 2021
[43]

Let’s verify step by step

Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. InInternational Conference on Learning Representations, 2024. URL https://proceedings.iclr.cc/paper_files/paper/2024 /hash/aca97732e30bcf1303bc22ac3924fd16-Abstract-Conference.html. ICLR 2024

work page 2024
[44]

TIGER-Lab. Aime25. Hugging Face dataset repository, 2025. URL https://huggingface.co/datas ets/TIGER-Lab/AIME25. Accessed 2026-05-05

work page 2025
[45]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021. doi: 10.48550/arXiv.2110. 14168. URLhttps://arxiv.org/abs/2110.14168

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2110 2021
[46]

Deepseek-r1-distill-qwen-7b

deepseek-ai. Deepseek-r1-distill-qwen-7b. Hugging Face model repository, 2025. URL https://huggin gface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B. Accessed 2026-05-05

work page 2025
[47]

Deepseek-r1-distill-qwen-32b

DeepSeek-AI. Deepseek-r1-distill-qwen-32b. https://huggingface.co/deepseek-ai/DeepSeek -R1-Distill-Qwen-32B, 2025. Distilled from DeepSeek-R1 and based on Qwen2.5-32B

work page 2025
[48]

lost in the middle

open-thoughts. Openthinker3-7b. Hugging Face model repository, 2025. URL https://huggingface. co/open-thoughts/OpenThinker3-7B. Accessed 2026-05-05. 12 Appendix A Limitations and Impact Statement Limitations.AMS is a training-free decoding-time allocation layer that uses attention-derived mass for adaptive segmentation. This design keeps the method lightw...

work page 2025
[49]

materialize the current per-request KV view needed by the AMS selector

work page
[50]

Pass@1 is computed from metric_main / num_samples in the csv

call the AMS/KVPress selector to obtain head-wise keep indicesI ∈N B×Hkv ×Tkeep; 24 W∆L min Lmax qmin keep_lastn sink Pass@1 (%) sec/sample peak GB 16 0.005 32 1024 32 16 4 50.0 55.9 14.48 16 0.005 64 1024 8 32 8 50.0 54.4 14.48 16 0.005 128 512 0 16 0 50.0 54.6 14.48 16 0.010 16 4096 16 128 8 50.0 58.6 14.48 16 0.010 32 1024 32 128 4 50.0 55.9 14.48 16 0...

work page 2048
[51]

allocate compact replacement blocks from the paged KV block pool

work page
[52]

launch a layout-aware GPU copy kernel that performs the per-head KV movement above for every attention layer

work page
[53]

replace the request’s block-table row with the compact block IDs and free the old blocks after the copy completes; and

work page
[54]

Simplify (u+ 4)(u−1)−(u+ 4)(u−1)

maintain separate bookkeeping for the logical decoding position and the compact physical KV length. The last item is important because the compact cache length becomes Tkeep, while the next generated token should still follow the original autoregressive position. Current implementation status.The supplementary code implements this policy–layout contract i...

work page

[1] [1]

H2O: Heavy-hitter oracle for efficient generative inference of large language models

Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher Ré, Clark Barrett, Zhangyang Wang, and Beidi Chen. H2O: Heavy-hitter oracle for efficient generative inference of large language models. InAdvances in Neural Information Processing Systems, 2023. URLhttps://openreview.net/forum?id=RkRrPp7GKO

work page 2023

[2] [2]

Efficient Memory Management for Large Language Model Serving with PagedAttention , booktitle =

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InProceedings of the 29th ACM Symposium on Operating Systems Principles, pages 611–626, 2023. doi: 10.1145/3600006.3613165. URLhttps://arxiv.org/abs...

work page doi:10.1145/3600006.3613165 2023

[3] [3]

Mahoney, Yakun Sophia Shao, Kurt Keutzer, and Amir Gholami

Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, Michael W. Mahoney, Yakun Sophia Shao, Kurt Keutzer, and Amir Gholami. KVQuant: Towards 10 million context length LLM inference with KV cache quantization. InAdvances in Neural Information Processing Systems, 2024. URL https: //arxiv.org/abs/2401.18079

work page arXiv 2024

[4] [4]

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V . Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. InAdvances in Neural Information Processing Systems, 2022. URLhttps://arxiv.org/abs/2201.11903

work page internal anchor Pith review Pith/arXiv arXiv 2022

[5] [5]

R-KV: Redundancy-aware KV cache compression for reasoning models.arXiv preprint arXiv:2505.24133, 2025

Zefan Cai, Wen Xiao, Hanshi Sun, Cheng Luo, Yikai Zhang, Ke Wan, Yucheng Li, Yeyang Zhou, Li- Wen Chang, Jiuxiang Gu, Zhen Dong, Anima Anandkumar, Abedelkadir Asi, and Junjie Hu. R-KV: Redundancy-aware KV cache compression for reasoning models.arXiv preprint arXiv:2505.24133, 2025. doi: 10.48550/arXiv.2505.24133. URLhttps://arxiv.org/abs/2505.24133

work page doi:10.48550/arxiv.2505.24133 2025

[6] [6]

Reasoning path compression: Compressing generation trajectories for efficient LLM reasoning

Jiwon Song, Dongwon Jo, Yulhwa Kim, and Jae-Joon Kim. Reasoning path compression: Compressing generation trajectories for efficient LLM reasoning. InAdvances in Neural Information Processing Systems,

work page

[7] [7]

NeurIPS 2025 Poster

URLhttps://openreview.net/forum?id=894Yo61h1P. NeurIPS 2025 Poster

work page 2025

[8] [8]

TriAttention: Efficient Long Reasoning with Trigonometric KV Compression

Weian Mao, Xi Lin, Wei Huang, Yuxin Xie, Tianfu Fu, Bohan Zhuang, Song Han, and Yukang Chen. Tri- attention: Efficient long reasoning with trigonometric KV compression.arXiv preprint arXiv:2604.04921,

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

TriAttention: Efficient Long Reasoning with Trigonometric KV Compression

doi: 10.48550/arXiv.2604.04921. URLhttps://arxiv.org/abs/2604.04921

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2604.04921

[10] [10]

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

Tri Dao, Daniel Y . Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. Flashattention: Fast and memory- efficient exact attention with IO-awareness. InAdvances in Neural Information Processing Systems, 2022. doi: 10.48550/arXiv.2205.14135. URLhttps://arxiv.org/abs/2205.14135

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2205.14135 2022

[11] [11]

Efficient Streaming Language Models with Attention Sinks

Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks. InInternational Conference on Learning Representations, 2024. URL https://arxiv.org/abs/2309.17453. ICLR 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[12] [12]

Transformers are multi-state RNNs

Matanel Oren, Michael Hassid, Nir Yarden, Yossi Adi, and Roy Schwartz. Transformers are multi-state RNNs. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors,Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 18724–18741, Miami, Florida, USA, November 2024. Association for Computational Linguistics. doi...

work page doi:10.18653/v1/2024.emnlp-main.10 2024

[13] [13]

URLhttps://aclanthology.org/2024.emnlp-main.1043/

work page 2024

[14] [14]

PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling

Zefan Cai, Yichi Zhang, Bofei Gao, Yuliang Liu, Tianyu Liu, Keming Lu, Wayne Xiong, Yue Dong, Baobao Chang, Junjie Hu, and Wen Xiao. Pyramidkv: Dynamic KV cache compression based on pyramidal information funneling.arXiv preprint arXiv:2406.02069, 2024. doi: 10.48550/arXiv.2406.02069. URL https://arxiv.org/abs/2406.02069

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2406.02069 2024

[15] [15]

Omnikv: Dynamic context selection for efficient long-context LLMs

Jitai Hao, Yuke Zhu, Tian Wang, Jun Yu, Xin Xin, Bo Zheng, Zhaochun Ren, and Sheng Guo. Omnikv: Dynamic context selection for efficient long-context LLMs. InInternational Conference on Learning Representations, 2025. URLhttps://openreview.net/forum?id=ulCAPXYXfa. ICLR 2025

work page 2025

[16] [16]

Lacache: Ladder-shaped KV caching for efficient long-context modeling of large language models

Dachuan Shi, Yonggan Fu, Xiangchi Yuan, Zhongzhi Yu, Haoran You, Sixu Li, Xin Dong, Jan Kautz, Pavlo Molchanov, and Yingyan Celine Lin. Lacache: Ladder-shaped KV caching for efficient long-context modeling of large language models. InProceedings of the 42nd International Conference on Machine Learning, volume 267 ofProceedings of Machine Learning Research...

work page 2025

[17] [17]

Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference

Yuan Feng, Junlin Lv, Yukun Cao, Xike Xie, and S. Kevin Zhou. Ada-KV: Optimizing KV cache eviction by adaptive budget allocation for efficient LLM inference.arXiv preprint arXiv:2407.11550, 2024. doi: 10.48550/arXiv.2407.11550. URLhttps://arxiv.org/abs/2407.11550. 10

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2407.11550 2024

[18] [18]

Duoattention: Efficient long-context LLM inference with retrieval and streaming heads

Guangxuan Xiao, Jiaming Tang, Jingwei Zuo, Junxian Guo, Shang Yang, Haotian Tang, Yao Fu, and Song Han. Duoattention: Efficient long-context LLM inference with retrieval and streaming heads. In International Conference on Learning Representations, 2025. URL https://proceedings.iclr.cc/ paper_files/paper/2025/file/5c1ddd2e59df46fd2aa85c833b1b36ed-Paper-Con...

work page 2025

[19] [19]

Not all heads matter: A head-level KV cache compression method with integrated retrieval and reasoning.arXiv preprint arXiv:2410.19258, 2024

Yu Fu, Zefan Cai, Abedelkadir Asi, Wayne Xiong, Yue Dong, and Wen Xiao. Not all heads matter: A head-level KV cache compression method with integrated retrieval and reasoning.arXiv preprint arXiv:2410.19258, 2024. doi: 10.48550/arXiv.2410.19258. URL https://arxiv.org/abs/2410.192 58

work page doi:10.48550/arxiv.2410.19258 2024

[20] [20]

Razorattention: Efficient kv cache compression through retrieval heads

Hanlin Tang, Yang Lin, Jing Lin, Qingsen Han, Danning Ke, Shikuan Hong, Yiwu Yao, and Gongyi Wang. Razorattention: Efficient kv cache compression through retrieval heads. InInternational Conference on Learning Representations, 2025. URLhttps://openreview.net/forum?id=tkiZQlL04w

work page 2025

[21] [21]

SnapKV: LLM Knows What You are Looking for Before Generation

Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Acyr Locatelli, Hanchen Ye, Tianle Cai, Patrick Lewis, and Deming Chen. Snapkv: LLM knows what you are looking for before generation.arXiv preprint arXiv:2404.14469, 2024. doi: 10.48550/arXiv.2404.14469. URL https://arxiv.org/abs/24 04.14469

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2404.14469 2024

[22] [22]

Sablock: Semantic-aware KV cache eviction with adaptive compression block size.arXiv preprint arXiv:2510.22556, 2025

Jinhan Chen, Jianchun Liu, Hongli Xu, Xianjun Gao, and Shilong Wang. Sablock: Semantic-aware KV cache eviction with adaptive compression block size.arXiv preprint arXiv:2510.22556, 2025. doi: 10.48550/arXiv.2510.22556. URLhttps://arxiv.org/abs/2510.22556

work page doi:10.48550/arxiv.2510.22556 2025

[23] [23]

Clusterkv: Manipulating LLM KV cache in semantic space for recallable compression.arXiv preprint arXiv:2412.03213, 2024

Guangda Liu, Chengwei Li, Jieru Zhao, Chenqi Zhang, and Minyi Guo. Clusterkv: Manipulating LLM KV cache in semantic space for recallable compression.arXiv preprint arXiv:2412.03213, 2024. doi: 10.48550/arXiv.2412.03213. URLhttps://arxiv.org/abs/2412.03213

work page doi:10.48550/arxiv.2412.03213 2024

[24] [24]

Protokv: Long-context knowledges are already well-organized before your query

Zhiyuan Yu, Shijian Xiao, Zhangyue Yin, Xiaoran Liu, Lekai Xing, Wenzhong Li, Cam-Tu Nguyen, and Sanglu Lu. Protokv: Long-context knowledges are already well-organized before your query. In International Conference on Learning Representations, 2026. URL https://openreview.net/forum ?id=kXhPkDaFbJ. ICLR 2026 Poster

work page 2026

[25] [25]

Treekv: Smooth key-value cache compression with tree structures.arXiv preprint arXiv:2501.04987, 2025

Ziwei He, Jian Yuan, Haoli Bai, Jingwen Leng, and Bo Jiang. Treekv: Smooth key-value cache compression with tree structures.arXiv preprint arXiv:2501.04987, 2025. doi: 10.48550/arXiv.2501.04987. URL https://arxiv.org/abs/2501.04987

work page doi:10.48550/arxiv.2501.04987 2025

[26] [26]

HeteroCache: A Dynamic Retrieval Approach to Heterogeneous KV Cache Compression for Long-Context LLM Inference

Zhiyuan Shi, Qibo Qiu, Feng Xue, Zhonglin Jiang, Li Yu, Jian Jiang, Xiaofei He, and Wenxiao Wang. Heterocache: A dynamic retrieval approach to heterogeneous KV cache compression for long-context LLM inference.arXiv preprint arXiv:2601.13684, 2026. doi: 10.48550/arXiv.2601.13684. URL https://arxiv.org/abs/2601.13684

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2601.13684 2026

[27] [27]

Expected attention: KV cache compression by estimating attention from future queries distribution.arXiv preprint arXiv:2510.00636, 2025

Alessio Devoto, Maximilian Jeblick, and Simon Jégou. Expected attention: KV cache compression by estimating attention from future queries distribution.arXiv preprint arXiv:2510.00636, 2025. doi: 10.48550/arXiv.2510.00636. URLhttps://arxiv.org/abs/2510.00636

work page doi:10.48550/arxiv.2510.00636 2025

[28] [28]

Chunkkv: Semantic-preserving KV cache compression for efficient long-context LLM inference.arXiv preprint arXiv:2502.00299, 2025

Xiang Liu, Zhenheng Tang, Peijie Dong, Zeyu Li, Yue Liu, Bo Li, Xuming Hu, and Xiaowen Chu. Chunkkv: Semantic-preserving KV cache compression for efficient long-context LLM inference.arXiv preprint arXiv:2502.00299, 2025. doi: 10.48550/arXiv.2502.00299. URL https://arxiv.org/abs/25 02.00299. NeurIPS 2025

work page doi:10.48550/arxiv.2502.00299 2025

[29] [29]

ReST-KV: Robust KV cache eviction with layer-wise output reconstruction and spatial-temporal smoothing

Yongqi An, Chang Lu, Kuan Zhu, Tao Yu, Chaoyang Zhao, Hong Wu, Ming Tang, and Jinqiao Wang. ReST-KV: Robust KV cache eviction with layer-wise output reconstruction and spatial-temporal smoothing. InInternational Conference on Learning Representations, 2026. URL https://openreview.net/for um?id=PhEHuo7oMm. ICLR 2026 Poster

work page 2026

[30] [30]

Kediff: Key similarity-based KV cache eviction for long-context LLM inference in resource-constrained environments

Junyoung Park, Dalton Jones, Matt Morse, Raghavv Goel, Mingu Lee, and Chris Lott. Kediff: Key similarity-based KV cache eviction for long-context LLM inference in resource-constrained environments. arXiv preprint arXiv:2504.15364, 2025. doi: 10.48550/arXiv.2504.15364. URL https://arxiv.org/ abs/2504.15364

work page doi:10.48550/arxiv.2504.15364 2025

[31] [31]

SCOPE: Optimizing key-value cache compression in long-context generation.arXiv preprint arXiv:2412.13649, 2024

Jialong Wu, Zhenglin Wang, Linhai Zhang, Yilong Lai, Yulan He, and Deyu Zhou. SCOPE: Optimizing key-value cache compression in long-context generation.arXiv preprint arXiv:2412.13649, 2024. doi: 10.48550/arXiv.2412.13649. URLhttps://arxiv.org/abs/2412.13649

work page doi:10.48550/arxiv.2412.13649 2024

[32] [32]

G-KV: Decoding-time KV cache eviction with global attention.arXiv preprint arXiv:2512.00504, 2025

Mengqi Liao, Lu Wang, Chaoyun Zhang, Zekai Shen, Xiaowei Mao, Si Qin, Qingwei Lin, Saravan Rajmohan, Dongmei Zhang, and Huaiyu Wan. G-KV: Decoding-time KV cache eviction with global attention.arXiv preprint arXiv:2512.00504, 2025. doi: 10.48550/arXiv.2512.00504. URL https: //arxiv.org/abs/2512.00504. 11

work page doi:10.48550/arxiv.2512.00504 2025

[33] [33]

Scissorhands: Exploiting the persistence of importance hypothesis for llm kv cache compression at test time

Zichang Liu, Aditya Desai, Fangshuo Liao, Weitao Wang, Victor Xie, Zhaozhuo Xu, Anastasios Kyrillidis, and Anshumali Shrivastava. Scissorhands: Exploiting the persistence of importance hypothesis for llm kv cache compression at test time. InAdvances in Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=JZfg6wGi6g

work page 2023

[34] [34]

KV-compress: Paged KV-cache compression with variable compression rates per attention head.arXiv preprint arXiv:2410.00161, 2024

Isaac Rehg. KV-compress: Paged KV-cache compression with variable compression rates per attention head.arXiv preprint arXiv:2410.00161, 2024. doi: 10.48550/arXiv.2410.00161. URL https: //arxiv.org/abs/2410.00161

work page doi:10.48550/arxiv.2410.00161 2024

[35] [35]

Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang

Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts.Transactions of the Association for Computational Linguistics, 12:157–173, 2024. doi: 10.1162/tacl_a_00638. URL h t t p s : //aclanthology.org/2024.tacl-1.9/

work page doi:10.1162/tacl_a_00638 2024

[36] [36]

The curious case of neural text degeneration

Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. The curious case of neural text degeneration. InInternational Conference on Learning Representations, 2020. URL https://openrevi ew.net/forum?id=rygGQyrFvH. ICLR 2020

work page 2020

[37] [37]

Kevin Zhou, and Xike Xie

Yuan Feng, Haoyu Guo, Junlin Lv, S. Kevin Zhou, and Xike Xie. Taming the fragility of KV cache eviction in LLM inference.arXiv preprint arXiv:2510.13334, 2025. doi: 10.48550/arXiv.2510.13334. URL https://arxiv.org/abs/2510.13334

work page doi:10.48550/arxiv.2510.13334 2025

[38] [38]

LongFlow: Efficient KV Cache Compression for Reasoning Models

Yi Su, Zhenxu Tian, Dan Qiao, Yuechi Zhou, Juntao Li, and Min Zhang. Longflow: Efficient kv cache compression for reasoning models.arXiv preprint arXiv:2603.11504, 2026. doi: 10.48550/arXiv.2603.11

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2603.11 2026

[39] [39]

URLhttps://arxiv.org/abs/2603.11504

work page internal anchor Pith review Pith/arXiv arXiv

[40] [40]

Lethe: Layer- and time-adaptive kv cache pruning for reasoning-intensive llm serving.arXiv preprint arXiv:2511.06029, 2025

Hui Zeng, Daming Zhao, Pengfei Yang, WenXuan Hou, Tianyang Zheng, Hui Li, Weiye Ji, and Jidong Zhai. Lethe: Layer- and time-adaptive kv cache pruning for reasoning-intensive llm serving.arXiv preprint arXiv:2511.06029, 2025. doi: 10.48550/arXiv.2511.06029. URL https://arxiv.org/abs/2511.060 29

work page doi:10.48550/arxiv.2511.06029 2025

[41] [41]

ThinKV: Thought-Adaptive KV Cache Compression for Efficient Reasoning Models

Akshat Ramachandran, Marina Neseem, Charbel Sakr, Rangharajan Venkatesan, Brucek Khailany, and Tushar Krishna. Thinkv: Thought-adaptive kv cache compression for efficient reasoning models.arXiv preprint arXiv:2510.01290, 2025. doi: 10.48550/arXiv.2510.01290. URL https://arxiv.org/abs/25 10.01290

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2510.01290 2025

[42] [42]

Measuring Mathematical Problem Solving With the MATH Dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the MATH dataset.arXiv preprint arXiv:2103.03874, 2021. doi: 10.48550/arXiv.2103.03874. URL https://arxiv.org/abs/2103.038 74

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2103.03874 2021

[43] [43]

Let’s verify step by step

Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. InInternational Conference on Learning Representations, 2024. URL https://proceedings.iclr.cc/paper_files/paper/2024 /hash/aca97732e30bcf1303bc22ac3924fd16-Abstract-Conference.html. ICLR 2024

work page 2024

[44] [44]

TIGER-Lab. Aime25. Hugging Face dataset repository, 2025. URL https://huggingface.co/datas ets/TIGER-Lab/AIME25. Accessed 2026-05-05

work page 2025

[45] [45]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021. doi: 10.48550/arXiv.2110. 14168. URLhttps://arxiv.org/abs/2110.14168

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2110 2021

[46] [46]

Deepseek-r1-distill-qwen-7b

deepseek-ai. Deepseek-r1-distill-qwen-7b. Hugging Face model repository, 2025. URL https://huggin gface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B. Accessed 2026-05-05

work page 2025

[47] [47]

Deepseek-r1-distill-qwen-32b

DeepSeek-AI. Deepseek-r1-distill-qwen-32b. https://huggingface.co/deepseek-ai/DeepSeek -R1-Distill-Qwen-32B, 2025. Distilled from DeepSeek-R1 and based on Qwen2.5-32B

work page 2025

[48] [48]

lost in the middle

open-thoughts. Openthinker3-7b. Hugging Face model repository, 2025. URL https://huggingface. co/open-thoughts/OpenThinker3-7B. Accessed 2026-05-05. 12 Appendix A Limitations and Impact Statement Limitations.AMS is a training-free decoding-time allocation layer that uses attention-derived mass for adaptive segmentation. This design keeps the method lightw...

work page 2025

[49] [49]

materialize the current per-request KV view needed by the AMS selector

work page

[50] [50]

Pass@1 is computed from metric_main / num_samples in the csv

call the AMS/KVPress selector to obtain head-wise keep indicesI ∈N B×Hkv ×Tkeep; 24 W∆L min Lmax qmin keep_lastn sink Pass@1 (%) sec/sample peak GB 16 0.005 32 1024 32 16 4 50.0 55.9 14.48 16 0.005 64 1024 8 32 8 50.0 54.4 14.48 16 0.005 128 512 0 16 0 50.0 54.6 14.48 16 0.010 16 4096 16 128 8 50.0 58.6 14.48 16 0.010 32 1024 32 128 4 50.0 55.9 14.48 16 0...

work page 2048

[51] [51]

allocate compact replacement blocks from the paged KV block pool

work page

[52] [52]

launch a layout-aware GPU copy kernel that performs the per-head KV movement above for every attention layer

work page

[53] [53]

replace the request’s block-table row with the compact block IDs and free the old blocks after the copy completes; and

work page

[54] [54]

Simplify (u+ 4)(u−1)−(u+ 4)(u−1)

maintain separate bookkeeping for the logical decoding position and the compact physical KV length. The last item is important because the compact cache length becomes Tkeep, while the next generated token should still follow the original autoregressive position. Current implementation status.The supplementary code implements this policy–layout contract i...

work page