Adaptive Mass-Segmented KV Compression for Long-Context Reasoning
Pith reviewed 2026-05-25 05:20 UTC · model grok-4.3
The pith
Adaptive Mass-Segmented KV compression gives guaranteed memory quotas to attention-rich regions instead of letting global Top-k evict whole reasoning blocks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By replacing token-level global Top-k selection with region-aware quota allocation driven by the spatial distribution of attention mass, AMS prevents the eviction of structurally vital reasoning segments, incorporates EMA-based boundary smoothing for stable iterative decoding, and remains orthogonal to any underlying importance scorer while remaining compatible with paged-KV frameworks.
What carries the argument
Adaptive Mass-Segmented (AMS) KV Compression framework that partitions the KV cache according to the spatial distribution of attention mass and enforces guaranteed per-region memory quotas.
If this is right
- Preserves logical coherence by protecting contiguous reasoning blocks from eviction
- Raises accuracy on MATH500, AIME, GSM8K, code completion, open-domain QA and sparse retrieval
- Integrates without modification into TOVA, Expected Attention, KeyDiff, R-KV and TriAttention
- Runs inside vLLM-style paged-KV serving with gather-and-compact execution and zero added attention overhead
- Remains stable across iterative decoding steps through EMA boundary smoothing
Where Pith is reading between the lines
- The same mass-based segmentation could be applied to activation compression or weight pruning where contiguous structure also matters
- Ablating the EMA smoother on very long sequences would test whether boundary jitter becomes the next bottleneck
- Extending the quota mechanism to multi-turn dialogues might reveal whether attention mass still tracks evolving logical units
- Comparing AMS against purely length-based segmentation would isolate how much the attention-mass signal contributes beyond simple locality
Load-bearing premise
The spatial distribution of attention mass reliably identifies structurally vital reasoning segments that deserve guaranteed memory quotas.
What would settle it
Run the same long-context reasoning task twice: once with attention mass left as computed by the model and once with attention mass randomly reassigned across segments; if AMS stops improving accuracy in the randomized case, the mass-to-importance correlation is the load-bearing assumption.
Figures
read the original abstract
The linear growth of the Key-Value (KV) cache is a critical bottleneck in long-form LLM inference. Existing KV compression methods mitigate this by evicting tokens based on importance scores. However, we show that their reliance on global Top-k selection triggers Region Wipe-out: the severe eviction of contiguous reasoning blocks that derails logical coherence. To address this, we propose Adaptive Mass-Segmented (AMS) KV Compression, a framework that shifts the paradigm from token-level competition to region-aware quota allocation. AMS adaptively partitions the KV cache based on the spatial distribution of attention mass, ensuring structurally vital reasoning segments receive guaranteed memory quotas. To ensure stability during iterative decoding, an EMA-based smoothing mechanism is incorporated to prevent jitter in segment boundaries. Crucially, AMS is a universal plug-and-play layer that is orthogonal to existing scorers. It can be seamlessly integrated into representative methods such as TOVA, Expected Attention, KeyDiff, R-KV and TriAttention. AMS is also system-compatible with modern paged-KV serving frameworks such as vLLM, supporting efficient gather-and-compact KV execution without introducing additional steady-state attention overhead. Extensive experiments across a diverse suite of tasks, including mathematical reasoning (MATH500, AIME, GSM8K), code completion, open-domain QA, and sparse retrieval, demonstrate that AMS consistently mitigates structural fragmentation and boosts model performance.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that existing token-level Top-k KV eviction methods suffer from Region Wipe-out, where contiguous reasoning blocks are evicted and logical coherence is lost. It proposes Adaptive Mass-Segmented (AMS) KV Compression, which partitions the KV cache according to the spatial distribution of attention mass to allocate guaranteed quotas to structurally vital segments, adds EMA-based smoothing to stabilize segment boundaries during decoding, and is presented as a plug-and-play, orthogonal layer compatible with scorers such as TOVA, Expected Attention, KeyDiff, R-KV and TriAttention as well as paged-KV systems like vLLM. Experiments on mathematical reasoning (MATH500, AIME, GSM8K), code completion, open-domain QA and sparse retrieval are stated to show consistent mitigation of fragmentation and performance gains.
Significance. If the empirical results hold, AMS could offer a practical, low-overhead way to preserve structural coherence in long-context reasoning without replacing existing importance scorers, with the claimed system compatibility providing an additional deployment advantage.
major comments (1)
- [Abstract (framework description paragraph)] Abstract (framework description paragraph): the central claim that attention-mass spatial distribution reliably identifies structurally vital reasoning segments deserving guaranteed quotas is load-bearing, yet the manuscript provides no direct validation (e.g., correlation with logical importance, ablation against positional/recency biases, or counter-example analysis). If the proxy is weak or task-dependent, the region-aware allocation reduces to a smoothed variant of prior scorers and the claimed structural protection does not follow.
minor comments (1)
- [Abstract] Abstract: the statement that 'extensive experiments demonstrate consistent mitigation and performance gains' is not accompanied by any quantitative results, error bars, baseline tables, or statistical details, making the strength of the empirical support difficult to assess from the provided text.
Simulated Author's Rebuttal
We thank the referee for the constructive comment regarding validation of the attention-mass proxy. We address the concern directly below and commit to strengthening the manuscript with additional analysis.
read point-by-point responses
-
Referee: [Abstract (framework description paragraph)] Abstract (framework description paragraph): the central claim that attention-mass spatial distribution reliably identifies structurally vital reasoning segments deserving guaranteed quotas is load-bearing, yet the manuscript provides no direct validation (e.g., correlation with logical importance, ablation against positional/recency biases, or counter-example analysis). If the proxy is weak or task-dependent, the region-aware allocation reduces to a smoothed variant of prior scorers and the claimed structural protection does not follow.
Authors: We agree that direct validation of the attention-mass spatial distribution as a proxy for structurally vital segments would strengthen the central claim. The current manuscript relies on indirect evidence: consistent performance improvements when AMS is combined with multiple independent scorers (TOVA, Expected Attention, KeyDiff, R-KV, TriAttention) across mathematical reasoning, code, and QA tasks, together with the orthogonality results showing gains beyond any single scorer. These outcomes are difficult to explain if AMS were merely a smoothed Top-k variant. Nevertheless, we acknowledge the absence of explicit correlation studies or bias ablations. In the revision we will add (i) a quantitative correlation between detected segment boundaries and logical step transitions in MATH problems, (ii) an ablation replacing mass-based partitioning with positional or recency-based alternatives, and (iii) selected counter-example traces. These additions will clarify the proxy's reliability and rule out reduction to prior smoothing techniques. revision: yes
Circularity Check
No circularity: empirical plug-and-play method with no self-referential derivations
full rationale
The paper describes AMS as an empirical framework that partitions KV cache by attention mass distribution and integrates orthogonally with existing scorers, validated on external benchmarks like MATH500 and GSM8K. No equations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the text. The central claims rest on experimental results rather than any derivation that reduces to its own inputs by construction. This is self-contained against external tasks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Spatial distribution of attention mass identifies structurally vital reasoning segments that merit guaranteed memory quotas
Reference graph
Works this paper leans on
-
[1]
H2O: Heavy-hitter oracle for efficient generative inference of large language models
Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher Ré, Clark Barrett, Zhangyang Wang, and Beidi Chen. H2O: Heavy-hitter oracle for efficient generative inference of large language models. InAdvances in Neural Information Processing Systems, 2023. URLhttps://openreview.net/forum?id=RkRrPp7GKO
work page 2023
-
[2]
Efficient Memory Management for Large Language Model Serving with PagedAttention , booktitle =
Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InProceedings of the 29th ACM Symposium on Operating Systems Principles, pages 611–626, 2023. doi: 10.1145/3600006.3613165. URLhttps://arxiv.org/abs...
-
[3]
Mahoney, Yakun Sophia Shao, Kurt Keutzer, and Amir Gholami
Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, Michael W. Mahoney, Yakun Sophia Shao, Kurt Keutzer, and Amir Gholami. KVQuant: Towards 10 million context length LLM inference with KV cache quantization. InAdvances in Neural Information Processing Systems, 2024. URL https: //arxiv.org/abs/2401.18079
-
[4]
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V . Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. InAdvances in Neural Information Processing Systems, 2022. URLhttps://arxiv.org/abs/2201.11903
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[5]
Zefan Cai, Wen Xiao, Hanshi Sun, Cheng Luo, Yikai Zhang, Ke Wan, Yucheng Li, Yeyang Zhou, Li- Wen Chang, Jiuxiang Gu, Zhen Dong, Anima Anandkumar, Abedelkadir Asi, and Junjie Hu. R-KV: Redundancy-aware KV cache compression for reasoning models.arXiv preprint arXiv:2505.24133, 2025. doi: 10.48550/arXiv.2505.24133. URLhttps://arxiv.org/abs/2505.24133
-
[6]
Reasoning path compression: Compressing generation trajectories for efficient LLM reasoning
Jiwon Song, Dongwon Jo, Yulhwa Kim, and Jae-Joon Kim. Reasoning path compression: Compressing generation trajectories for efficient LLM reasoning. InAdvances in Neural Information Processing Systems,
-
[7]
URLhttps://openreview.net/forum?id=894Yo61h1P. NeurIPS 2025 Poster
work page 2025
-
[8]
TriAttention: Efficient Long Reasoning with Trigonometric KV Compression
Weian Mao, Xi Lin, Wei Huang, Yuxin Xie, Tianfu Fu, Bohan Zhuang, Song Han, and Yukang Chen. Tri- attention: Efficient long reasoning with trigonometric KV compression.arXiv preprint arXiv:2604.04921,
work page internal anchor Pith review Pith/arXiv arXiv
-
[9]
TriAttention: Efficient Long Reasoning with Trigonometric KV Compression
doi: 10.48550/arXiv.2604.04921. URLhttps://arxiv.org/abs/2604.04921
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2604.04921
-
[10]
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
Tri Dao, Daniel Y . Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. Flashattention: Fast and memory- efficient exact attention with IO-awareness. InAdvances in Neural Information Processing Systems, 2022. doi: 10.48550/arXiv.2205.14135. URLhttps://arxiv.org/abs/2205.14135
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2205.14135 2022
-
[11]
Efficient Streaming Language Models with Attention Sinks
Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks. InInternational Conference on Learning Representations, 2024. URL https://arxiv.org/abs/2309.17453. ICLR 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[12]
Transformers are multi-state RNNs
Matanel Oren, Michael Hassid, Nir Yarden, Yossi Adi, and Roy Schwartz. Transformers are multi-state RNNs. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors,Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 18724–18741, Miami, Florida, USA, November 2024. Association for Computational Linguistics. doi...
-
[13]
URLhttps://aclanthology.org/2024.emnlp-main.1043/
work page 2024
-
[14]
PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling
Zefan Cai, Yichi Zhang, Bofei Gao, Yuliang Liu, Tianyu Liu, Keming Lu, Wayne Xiong, Yue Dong, Baobao Chang, Junjie Hu, and Wen Xiao. Pyramidkv: Dynamic KV cache compression based on pyramidal information funneling.arXiv preprint arXiv:2406.02069, 2024. doi: 10.48550/arXiv.2406.02069. URL https://arxiv.org/abs/2406.02069
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2406.02069 2024
-
[15]
Omnikv: Dynamic context selection for efficient long-context LLMs
Jitai Hao, Yuke Zhu, Tian Wang, Jun Yu, Xin Xin, Bo Zheng, Zhaochun Ren, and Sheng Guo. Omnikv: Dynamic context selection for efficient long-context LLMs. InInternational Conference on Learning Representations, 2025. URLhttps://openreview.net/forum?id=ulCAPXYXfa. ICLR 2025
work page 2025
-
[16]
Lacache: Ladder-shaped KV caching for efficient long-context modeling of large language models
Dachuan Shi, Yonggan Fu, Xiangchi Yuan, Zhongzhi Yu, Haoran You, Sixu Li, Xin Dong, Jan Kautz, Pavlo Molchanov, and Yingyan Celine Lin. Lacache: Ladder-shaped KV caching for efficient long-context modeling of large language models. InProceedings of the 42nd International Conference on Machine Learning, volume 267 ofProceedings of Machine Learning Research...
work page 2025
-
[17]
Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference
Yuan Feng, Junlin Lv, Yukun Cao, Xike Xie, and S. Kevin Zhou. Ada-KV: Optimizing KV cache eviction by adaptive budget allocation for efficient LLM inference.arXiv preprint arXiv:2407.11550, 2024. doi: 10.48550/arXiv.2407.11550. URLhttps://arxiv.org/abs/2407.11550. 10
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2407.11550 2024
-
[18]
Duoattention: Efficient long-context LLM inference with retrieval and streaming heads
Guangxuan Xiao, Jiaming Tang, Jingwei Zuo, Junxian Guo, Shang Yang, Haotian Tang, Yao Fu, and Song Han. Duoattention: Efficient long-context LLM inference with retrieval and streaming heads. In International Conference on Learning Representations, 2025. URL https://proceedings.iclr.cc/ paper_files/paper/2025/file/5c1ddd2e59df46fd2aa85c833b1b36ed-Paper-Con...
work page 2025
-
[19]
Yu Fu, Zefan Cai, Abedelkadir Asi, Wayne Xiong, Yue Dong, and Wen Xiao. Not all heads matter: A head-level KV cache compression method with integrated retrieval and reasoning.arXiv preprint arXiv:2410.19258, 2024. doi: 10.48550/arXiv.2410.19258. URL https://arxiv.org/abs/2410.192 58
-
[20]
Razorattention: Efficient kv cache compression through retrieval heads
Hanlin Tang, Yang Lin, Jing Lin, Qingsen Han, Danning Ke, Shikuan Hong, Yiwu Yao, and Gongyi Wang. Razorattention: Efficient kv cache compression through retrieval heads. InInternational Conference on Learning Representations, 2025. URLhttps://openreview.net/forum?id=tkiZQlL04w
work page 2025
-
[21]
SnapKV: LLM Knows What You are Looking for Before Generation
Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Acyr Locatelli, Hanchen Ye, Tianle Cai, Patrick Lewis, and Deming Chen. Snapkv: LLM knows what you are looking for before generation.arXiv preprint arXiv:2404.14469, 2024. doi: 10.48550/arXiv.2404.14469. URL https://arxiv.org/abs/24 04.14469
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2404.14469 2024
-
[22]
Jinhan Chen, Jianchun Liu, Hongli Xu, Xianjun Gao, and Shilong Wang. Sablock: Semantic-aware KV cache eviction with adaptive compression block size.arXiv preprint arXiv:2510.22556, 2025. doi: 10.48550/arXiv.2510.22556. URLhttps://arxiv.org/abs/2510.22556
-
[23]
Guangda Liu, Chengwei Li, Jieru Zhao, Chenqi Zhang, and Minyi Guo. Clusterkv: Manipulating LLM KV cache in semantic space for recallable compression.arXiv preprint arXiv:2412.03213, 2024. doi: 10.48550/arXiv.2412.03213. URLhttps://arxiv.org/abs/2412.03213
-
[24]
Protokv: Long-context knowledges are already well-organized before your query
Zhiyuan Yu, Shijian Xiao, Zhangyue Yin, Xiaoran Liu, Lekai Xing, Wenzhong Li, Cam-Tu Nguyen, and Sanglu Lu. Protokv: Long-context knowledges are already well-organized before your query. In International Conference on Learning Representations, 2026. URL https://openreview.net/forum ?id=kXhPkDaFbJ. ICLR 2026 Poster
work page 2026
-
[25]
Ziwei He, Jian Yuan, Haoli Bai, Jingwen Leng, and Bo Jiang. Treekv: Smooth key-value cache compression with tree structures.arXiv preprint arXiv:2501.04987, 2025. doi: 10.48550/arXiv.2501.04987. URL https://arxiv.org/abs/2501.04987
-
[26]
Zhiyuan Shi, Qibo Qiu, Feng Xue, Zhonglin Jiang, Li Yu, Jian Jiang, Xiaofei He, and Wenxiao Wang. Heterocache: A dynamic retrieval approach to heterogeneous KV cache compression for long-context LLM inference.arXiv preprint arXiv:2601.13684, 2026. doi: 10.48550/arXiv.2601.13684. URL https://arxiv.org/abs/2601.13684
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2601.13684 2026
-
[27]
Alessio Devoto, Maximilian Jeblick, and Simon Jégou. Expected attention: KV cache compression by estimating attention from future queries distribution.arXiv preprint arXiv:2510.00636, 2025. doi: 10.48550/arXiv.2510.00636. URLhttps://arxiv.org/abs/2510.00636
-
[28]
Xiang Liu, Zhenheng Tang, Peijie Dong, Zeyu Li, Yue Liu, Bo Li, Xuming Hu, and Xiaowen Chu. Chunkkv: Semantic-preserving KV cache compression for efficient long-context LLM inference.arXiv preprint arXiv:2502.00299, 2025. doi: 10.48550/arXiv.2502.00299. URL https://arxiv.org/abs/25 02.00299. NeurIPS 2025
-
[29]
Yongqi An, Chang Lu, Kuan Zhu, Tao Yu, Chaoyang Zhao, Hong Wu, Ming Tang, and Jinqiao Wang. ReST-KV: Robust KV cache eviction with layer-wise output reconstruction and spatial-temporal smoothing. InInternational Conference on Learning Representations, 2026. URL https://openreview.net/for um?id=PhEHuo7oMm. ICLR 2026 Poster
work page 2026
-
[30]
Junyoung Park, Dalton Jones, Matt Morse, Raghavv Goel, Mingu Lee, and Chris Lott. Kediff: Key similarity-based KV cache eviction for long-context LLM inference in resource-constrained environments. arXiv preprint arXiv:2504.15364, 2025. doi: 10.48550/arXiv.2504.15364. URL https://arxiv.org/ abs/2504.15364
-
[31]
Jialong Wu, Zhenglin Wang, Linhai Zhang, Yilong Lai, Yulan He, and Deyu Zhou. SCOPE: Optimizing key-value cache compression in long-context generation.arXiv preprint arXiv:2412.13649, 2024. doi: 10.48550/arXiv.2412.13649. URLhttps://arxiv.org/abs/2412.13649
-
[32]
G-KV: Decoding-time KV cache eviction with global attention.arXiv preprint arXiv:2512.00504, 2025
Mengqi Liao, Lu Wang, Chaoyun Zhang, Zekai Shen, Xiaowei Mao, Si Qin, Qingwei Lin, Saravan Rajmohan, Dongmei Zhang, and Huaiyu Wan. G-KV: Decoding-time KV cache eviction with global attention.arXiv preprint arXiv:2512.00504, 2025. doi: 10.48550/arXiv.2512.00504. URL https: //arxiv.org/abs/2512.00504. 11
-
[33]
Zichang Liu, Aditya Desai, Fangshuo Liao, Weitao Wang, Victor Xie, Zhaozhuo Xu, Anastasios Kyrillidis, and Anshumali Shrivastava. Scissorhands: Exploiting the persistence of importance hypothesis for llm kv cache compression at test time. InAdvances in Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=JZfg6wGi6g
work page 2023
-
[34]
Isaac Rehg. KV-compress: Paged KV-cache compression with variable compression rates per attention head.arXiv preprint arXiv:2410.00161, 2024. doi: 10.48550/arXiv.2410.00161. URL https: //arxiv.org/abs/2410.00161
-
[35]
Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang
Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts.Transactions of the Association for Computational Linguistics, 12:157–173, 2024. doi: 10.1162/tacl_a_00638. URL h t t p s : //aclanthology.org/2024.tacl-1.9/
-
[36]
The curious case of neural text degeneration
Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. The curious case of neural text degeneration. InInternational Conference on Learning Representations, 2020. URL https://openrevi ew.net/forum?id=rygGQyrFvH. ICLR 2020
work page 2020
-
[37]
Yuan Feng, Haoyu Guo, Junlin Lv, S. Kevin Zhou, and Xike Xie. Taming the fragility of KV cache eviction in LLM inference.arXiv preprint arXiv:2510.13334, 2025. doi: 10.48550/arXiv.2510.13334. URL https://arxiv.org/abs/2510.13334
-
[38]
LongFlow: Efficient KV Cache Compression for Reasoning Models
Yi Su, Zhenxu Tian, Dan Qiao, Yuechi Zhou, Juntao Li, and Min Zhang. Longflow: Efficient kv cache compression for reasoning models.arXiv preprint arXiv:2603.11504, 2026. doi: 10.48550/arXiv.2603.11
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2603.11 2026
-
[39]
URLhttps://arxiv.org/abs/2603.11504
work page internal anchor Pith review Pith/arXiv arXiv
-
[40]
Hui Zeng, Daming Zhao, Pengfei Yang, WenXuan Hou, Tianyang Zheng, Hui Li, Weiye Ji, and Jidong Zhai. Lethe: Layer- and time-adaptive kv cache pruning for reasoning-intensive llm serving.arXiv preprint arXiv:2511.06029, 2025. doi: 10.48550/arXiv.2511.06029. URL https://arxiv.org/abs/2511.060 29
-
[41]
ThinKV: Thought-Adaptive KV Cache Compression for Efficient Reasoning Models
Akshat Ramachandran, Marina Neseem, Charbel Sakr, Rangharajan Venkatesan, Brucek Khailany, and Tushar Krishna. Thinkv: Thought-adaptive kv cache compression for efficient reasoning models.arXiv preprint arXiv:2510.01290, 2025. doi: 10.48550/arXiv.2510.01290. URL https://arxiv.org/abs/25 10.01290
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2510.01290 2025
-
[42]
Measuring Mathematical Problem Solving With the MATH Dataset
Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the MATH dataset.arXiv preprint arXiv:2103.03874, 2021. doi: 10.48550/arXiv.2103.03874. URL https://arxiv.org/abs/2103.038 74
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2103.03874 2021
-
[43]
Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. InInternational Conference on Learning Representations, 2024. URL https://proceedings.iclr.cc/paper_files/paper/2024 /hash/aca97732e30bcf1303bc22ac3924fd16-Abstract-Conference.html. ICLR 2024
work page 2024
-
[44]
TIGER-Lab. Aime25. Hugging Face dataset repository, 2025. URL https://huggingface.co/datas ets/TIGER-Lab/AIME25. Accessed 2026-05-05
work page 2025
-
[45]
Training Verifiers to Solve Math Word Problems
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021. doi: 10.48550/arXiv.2110. 14168. URLhttps://arxiv.org/abs/2110.14168
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2110 2021
-
[46]
deepseek-ai. Deepseek-r1-distill-qwen-7b. Hugging Face model repository, 2025. URL https://huggin gface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B. Accessed 2026-05-05
work page 2025
-
[47]
DeepSeek-AI. Deepseek-r1-distill-qwen-32b. https://huggingface.co/deepseek-ai/DeepSeek -R1-Distill-Qwen-32B, 2025. Distilled from DeepSeek-R1 and based on Qwen2.5-32B
work page 2025
-
[48]
open-thoughts. Openthinker3-7b. Hugging Face model repository, 2025. URL https://huggingface. co/open-thoughts/OpenThinker3-7B. Accessed 2026-05-05. 12 Appendix A Limitations and Impact Statement Limitations.AMS is a training-free decoding-time allocation layer that uses attention-derived mass for adaptive segmentation. This design keeps the method lightw...
work page 2025
-
[49]
materialize the current per-request KV view needed by the AMS selector
-
[50]
Pass@1 is computed from metric_main / num_samples in the csv
call the AMS/KVPress selector to obtain head-wise keep indicesI ∈N B×Hkv ×Tkeep; 24 W∆L min Lmax qmin keep_lastn sink Pass@1 (%) sec/sample peak GB 16 0.005 32 1024 32 16 4 50.0 55.9 14.48 16 0.005 64 1024 8 32 8 50.0 54.4 14.48 16 0.005 128 512 0 16 0 50.0 54.6 14.48 16 0.010 16 4096 16 128 8 50.0 58.6 14.48 16 0.010 32 1024 32 128 4 50.0 55.9 14.48 16 0...
work page 2048
-
[51]
allocate compact replacement blocks from the paged KV block pool
-
[52]
launch a layout-aware GPU copy kernel that performs the per-head KV movement above for every attention layer
-
[53]
replace the request’s block-table row with the compact block IDs and free the old blocks after the copy completes; and
-
[54]
Simplify (u+ 4)(u−1)−(u+ 4)(u−1)
maintain separate bookkeeping for the logical decoding position and the compact physical KV length. The last item is important because the compact cache length becomes Tkeep, while the next generated token should still follow the original autoregressive position. Current implementation status.The supplementary code implements this policy–layout contract i...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.