Semantic Integrity Matters: Benchmarking and Preserving High-Density Reasoning in KV Cache Compression
Pith reviewed 2026-05-23 04:11 UTC · model grok-4.3
The pith
KV cache compression breaks chain-of-thought reasoning unless few-shot examples are preserved as indivisible units.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that aggressive KV cache compression produces severe task-dependent degradation on high-density reasoning because it disrupts the coherence of chain-of-thought links inside few-shot examples, which must therefore be treated as indivisible semantic units; ShotKV restores performance by explicitly separating the prefill phase from the decoding phase so that semantic integrity is maintained throughout.
What carries the argument
ShotKV, a KV cache method that separates the prefill phase from the decoding phase to keep few-shot examples intact as indivisible semantic units.
If this is right
- Retrieval tasks remain robust under aggressive compression while reasoning tasks exhibit severe degradation due to broken CoT links.
- ShotKV produces 9-18% accuracy gains on long-context generation tasks.
- The gains generalize to document QA tasks.
- ShotKV reduces latency by 11% relative to full-cache inference.
Where Pith is reading between the lines
- The phase-separation tactic could be applied to other KV compression algorithms that currently mix prefill and decode steps.
- Attention patterns observed in DeepSeek-R1 suggest that model-specific fragility of reasoning chains may require tailored semantic-unit rules.
- Success on document QA indicates the method may extend to other multi-step language tasks that depend on long prompt coherence.
Load-bearing premise
The assumption that the observed degradation in reasoning tasks is caused specifically by the breaking of indivisible semantic units within few-shot examples.
What would settle it
An experiment in which few-shot examples are kept whole during compression yet reasoning accuracy still falls by the same amount as in standard compression.
Figures
read the original abstract
While Key-Value (KV) cache compression is essential for efficient LLM inference, current evaluations disproportionately focus on sparse retrieval tasks, potentially masking the degradation of High-Density Reasoning where Chain-of-Thought (CoT) coherence is critical. We introduce KVFundaBench to systematically evaluate this gap, revealing a sharp dichotomy: while retrieval tasks remain robust, reasoning tasks exhibit severe Task-Dependent Degradation under aggressive compression due to disrupted CoT links. Extending our analysis to the DeepSeek-R1 model, we uncover that its specialized attention patterns offer unique insights into the fragility of reasoning chains. Guided by these findings -- specifically the necessity of preserving few-shot examples as indivisible Semantic Units -- we propose ShotKV. This approach explicitly separates prefill and decoding phases to prioritize semantic integrity. Empirical results demonstrate that ShotKV achieves 9%-18% accuracy improvements on long-context generation tasks and effectively generalizes to document QA, all while delivering an 11% latency reduction compared to full cache inference.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces KVFundaBench to benchmark KV cache compression methods, finding that retrieval tasks remain robust while reasoning tasks exhibit severe Task-Dependent Degradation under aggressive compression due to disrupted CoT links. It extends analysis to DeepSeek-R1's attention patterns and proposes ShotKV, which separates prefill and decoding phases to treat few-shot examples as indivisible Semantic Units, claiming 9%-18% accuracy gains on long-context generation tasks, generalization to document QA, and an 11% latency reduction versus full cache inference.
Significance. If the results hold, the work would highlight an important gap in KV cache compression evaluations that have focused on sparse retrieval and would offer a practical engineering approach (ShotKV) for preserving reasoning coherence. The introduction of a dedicated benchmark for high-density reasoning is a constructive contribution to the field.
major comments (2)
- [Findings guiding ShotKV design] The section on findings guiding ShotKV design: the attribution of reasoning degradation specifically to disrupted CoT links (and the consequent necessity of preserving few-shot examples as indivisible Semantic Units) is not supported by direct measurements such as step-wise entailment scores, attention-link tracing, or ablations that hold total retained tokens fixed while varying atomic versus fragmented treatment of few-shot content.
- [Empirical results] The empirical results paragraph: the reported 9%-18% accuracy improvements and 11% latency reduction are presented without accompanying information on experimental setup, baselines, statistical controls, dataset construction, or error bars, rendering it impossible to evaluate whether the central performance claims are supported by the data.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback highlighting the need for stronger empirical grounding in our design rationale and clearer reporting of experimental details. We address each major comment point by point below.
read point-by-point responses
-
Referee: [Findings guiding ShotKV design] The section on findings guiding ShotKV design: the attribution of reasoning degradation specifically to disrupted CoT links (and the consequent necessity of preserving few-shot examples as indivisible Semantic Units) is not supported by direct measurements such as step-wise entailment scores, attention-link tracing, or ablations that hold total retained tokens fixed while varying atomic versus fragmented treatment of few-shot content.
Authors: Our attribution rests on the core empirical observation from KVFundaBench that retrieval tasks remain robust under aggressive compression while reasoning tasks exhibit sharp Task-Dependent Degradation, combined with the attention pattern analysis on DeepSeek-R1 that reveals fragility in long reasoning chains. These results motivate treating few-shot examples as indivisible Semantic Units. We did not perform step-wise entailment scoring or explicit attention-link tracing, nor the specific token-fixed ablation contrasting atomic versus fragmented few-shot treatment. We agree this constitutes a gap and will add the requested ablation study (holding total retained tokens fixed) in the revision to directly test the indivisible-unit hypothesis. revision: partial
-
Referee: [Empirical results] The empirical results paragraph: the reported 9%-18% accuracy improvements and 11% latency reduction are presented without accompanying information on experimental setup, baselines, statistical controls, dataset construction, or error bars, rendering it impossible to evaluate whether the central performance claims are supported by the data.
Authors: The full manuscript contains an Experiments section describing the setup, baselines (standard KV compression methods and full-cache inference), KVFundaBench dataset construction, and evaluation protocol. However, the presentation of the 9%-18% accuracy and 11% latency figures lacks error bars, statistical significance reporting, and explicit controls. We will revise the results section and add a dedicated experimental details subsection with these elements to make the claims fully reproducible and evaluable. revision: yes
Circularity Check
No significant circularity; empirical benchmarking and engineering proposal
full rationale
The paper presents an empirical benchmark (KVFundaBench) and an engineering method (ShotKV) motivated by observed task-dependent performance patterns under KV compression. No mathematical derivations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text. The design choice to treat few-shot examples as indivisible units follows directly from reported accuracy measurements rather than reducing to a definitional or fitted tautology. The work is self-contained against external benchmarks and contains no load-bearing steps that equate outputs to inputs by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Few-shot examples function as indivisible semantic units whose preservation is required to maintain CoT coherence under compression
Forward citations
Cited by 2 Pith papers
-
Position: LLM Inference Should Be Evaluated as Energy-to-Token Production
LLM inference should be reframed and evaluated as energy-to-token production with a Token Production Function that accounts for power, cooling, and efficiency ceilings.
-
The Pitfalls of KV Cache Compression
KV cache compression causes certain instructions to degrade rapidly and be ignored in multi-instruction prompting, with system prompt leakage worsened by method choice, instruction order, and eviction bias; simple pol...
Reference graph
Works this paper leans on
-
[1]
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21:140:1–140:67, 2020. URL http://jmlr. org/papers/v21/20-074.html
work page 2020
-
[2]
Language models are few-shot learners
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020
work page 1901
-
[3]
PaLM: Scaling Language Modeling with Pathways
Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways. ArXiv preprint, abs/2204.02311, 2022. URL https://arxiv.org/abs/2204.02311
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[4]
Unifying language learning paradigms.arXiv preprint arXiv:2205.05131, 2022a
Yi Tay, Mostafa Dehghani, Vinh Q Tran, Xavier Garcia, Dara Bahri, Tal Schuster, Huaixiu Steven Zheng, Neil Houlsby, and Donald Metzler. Unifying language learning paradigms. ArXiv preprint, abs/2205.05131, 2022. URL https://arxiv.org/abs/2205. 05131
-
[5]
LLaMA: Open and Efficient Foundation Language Models
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Tim- oth´ee Lacroix, Baptiste Rozi `ere, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. ArXiv preprint, abs/2302.13971, 2023. URL https://arxiv.org/abs/2302.13971
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[6]
Llama 2: Open Foundation and Fine-Tuned Chat Models
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. ArXiv preprint, abs/2307.09288, 2023. URL https: //arxiv.org/abs/2307.09288
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[7]
Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher R ´e. Flashattention: Fast and memory-efficient exact attention with io-awareness.Advances in Neural Information Processing Systems, 35:16344–16359, 2022
work page 2022
-
[8]
FlashAttention-2: Faster attention with better parallelism and work partitioning
Tri Dao. FlashAttention-2: Faster attention with better parallelism and work partitioning. In International Conference on Learning Representations (ICLR), 2024
work page 2024
-
[9]
Sam Ade Jacobs et al. DeepSpeed Ulysses: System optimizations for enabling training of extreme long sequence Transformer models. ArXiv preprint, abs/2309.14509, 2023. URL https://arxiv.org/abs/2309.14509
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[10]
Efficient streaming language models with attention sinks
Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=NG7sS51zVF
work page 2024
-
[12]
URL https://arxiv.org/abs/2306.15595
work page internal anchor Pith review Pith/arXiv arXiv
-
[13]
Effective long-context scaling of foundation models
Wenhan Xiong, Jingyu Liu, Igor Molybog, Hejia Zhang, Prajjwal Bhargava, Rui Hou, Louis Martin, Rashi Rungta, Karthik Abinav Sankararaman, Barlas Oguz, Madian Khabsa, Han Fang, Yashar Mehdad, Sharan Narang, Kshitiz Malik, Angela Fan, Shruti Bhosale, Sergey Edunov, Mike Lewis, Sinong Wang, and Hao Ma. Effective long-context scaling of foundation models. In ...
work page 2024
-
[14]
URL https://aclanthology.org/2024
Association for Computational Linguistics. URL https://aclanthology.org/2024. naacl-long.260
work page 2024
-
[15]
Lon- glora: Efficient fine-tuning of long-context large language models
Yukang Chen, Shengju Qian, Haotian Tang, Xin Lai, Zhijian Liu, Song Han, and Jiaya Jia. Lon- glora: Efficient fine-tuning of long-context large language models. In The Twelfth International Conference on Learning Representations, 2023
work page 2023
-
[16]
YaRN: Efficient context window extension of large language models
Bowen Peng, Jeffrey Quesnelle, Honglu Fan, and Enrico Shippole. YaRN: Efficient context window extension of large language models. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=wHBfxhZu1u. 10
work page 2024
-
[17]
Introducing jamba: Ai21’s groundbreaking ssm-transformer model, 2024
AI21. Introducing jamba: Ai21’s groundbreaking ssm-transformer model, 2024. URL https: //www.ai21.com/blog/announcing-jamba
work page 2024
-
[18]
X.AI. Announcing grok-1.5, 2024. URL https://x.ai/blog/grok-1.5
work page 2024
-
[19]
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context
Machel Reid, Nikolay Savinov, Denis Teplyashin, Dmitry Lepikhin, Timothy Lillicrap, Jean- baptiste Alayrac, Radu Soricut, Angeliki Lazaridou, Orhan Firat, Julian Schrittwieser, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. ArXiv preprint, abs/2403.05530, 2024. URL https://arxiv.org/abs/2403.05530
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[20]
Introducing the next generation of claude, 2024
Anthropic. Introducing the next generation of claude, 2024. URL https://www.anthropic. com/news/claude-3-family
work page 2024
-
[21]
Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model, 2024
DeepSeek-AI. Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model, 2024
work page 2024
-
[22]
Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report. arXiv preprint arXiv:2412.19437, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[23]
H2o: Heavy-hitter oracle for efficient generative inference of large language models
Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher R´e, Clark Barrett, et al. H2o: Heavy-hitter oracle for efficient generative inference of large language models. Advances in Neural Information Processing Systems, 36:34661–34710, 2023
work page 2023
-
[24]
SnapKV: LLM Knows What You are Looking for Before Generation
Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Acyr Locatelli, Hanchen Ye, Tianle Cai, Patrick Lewis, and Deming Chen. Snapkv: Llm knows what you are looking for before generation. ArXiv preprint, abs/2404.14469, 2024. URL https://arxiv.org/abs/ 2404.14469
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[26]
URL https://arxiv.org/abs/2310.01801
work page internal anchor Pith review Pith/arXiv arXiv
-
[27]
PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling
Zefan Cai, Yichi Zhang, Bofei Gao, Yuliang Liu, Tianyu Liu, Keming Lu, Wayne Xiong, Yue Dong, Baobao Chang, Junjie Hu, et al. Pyramidkv: Dynamic kv cache compression based on pyramidal information funneling. arXiv preprint arXiv:2406.02069, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[28]
LazyLLM: Dynamic token pruning for efficient long context LLM inference
Qichen Fu, Minsik Cho, Thomas Merth, Sachin Mehta, Mohammad Rastegari, and Mahyar Najibi. LazyLLM: Dynamic token pruning for efficient long context LLM inference. In Workshop on Efficient Systems for Foundation Models II @ ICML2024, 2024. URL https: //openreview.net/forum?id=gGZD1dsJqZ
work page 2024
-
[29]
Pyramidinfer: Pyramid kv cache compres- sion for high-throughput llm inference
Dongjie Yang, XiaoDong Han, Yan Gao, Yao Hu, Shilin Zhang, and Hai Zhao. Pyramid- infer: Pyramid kv cache compression for high-throughput llm inference. arXiv preprint arXiv:2405.12532, 2024
-
[30]
Keyformer: Kv cache reduction through key tokens selection for efficient generative inference
Muhammad Adnan, Akhil Arunkumar, Gaurav Jain, Prashant Nair, Ilya Soloveychik, and Purushotham Kamath. Keyformer: Kv cache reduction through key tokens selection for efficient generative inference. Proceedings of Machine Learning and Systems, 6:114–127, 2024
work page 2024
-
[31]
Zichang Liu, Aditya Desai, Fangshuo Liao, Weitao Wang, Victor Xie, Zhaozhuo Xu, Anastasios Kyrillidis, and Anshumali Shrivastava. Scissorhands: Exploiting the persistence of impor- tance hypothesis for llm kv cache compression at test time. Advances in Neural Information Processing Systems, 36, 2024
work page 2024
-
[33]
URL https://arxiv.org/abs/2406.10774
work page internal anchor Pith review Pith/arXiv arXiv
-
[34]
LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding
Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, et al. Longbench: A bilingual, multitask benchmark for long context understanding. arXiv preprint arXiv:2308.14508, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[35]
LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks
Yushi Bai, Shangqing Tu, Jiajie Zhang, Hao Peng, Xiaozhi Wang, Xin Lv, Shulin Cao, Ji- azheng Xu, Lei Hou, Yuxiao Dong, Jie Tang, and Juanzi Li. Longbench v2: Towards deeper understanding and reasoning on realistic long-context multitasks, 2025. URL https: //arxiv.org/abs/2412.15204
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[36]
Needle In A Haystack - pressure testing LLMs
Gregory Kamradt. Needle In A Haystack - pressure testing LLMs. Github, 2023. URL https://github.com/gkamradt/LLMTest_NeedleInAHaystack/tree/main. 11
work page 2023
-
[37]
Measuring Massive Multitask Language Understanding
Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2009
-
[38]
Training Verifiers to Solve Math Word Problems
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[39]
Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. CommonsenseQA: A question answering challenge targeting commonsense knowledge. In Jill Burstein, Christy Doran, and Thamar Solorio, editors, Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volu...
-
[40]
Evaluating Large Language Models Trained on Code
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[41]
Weidi Luo, Siyuan Ma, Xiaogeng Liu, Xiaoyu Guo, and Chaowei Xiao. Jailbreakv: A bench- mark for assessing the robustness of multimodal large language models against jailbreak attacks. In First Conference on Language Modeling, 2024. URL https://openreview.net/forum? id=GC4mXVfquq
work page 2024
-
[42]
LongGenBench: Long-context generation benchmark
Xiang Liu, Peijie Dong, Xuming Hu, and Xiaowen Chu. LongGenBench: Long-context generation benchmark. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors, Findings of the Association for Computational Linguistics: EMNLP 2024 , pages 865–883, Miami, Florida, USA, November 2024. Association for Computational Linguistics. URL https://aclanthology.or...
work page 2024
-
[43]
Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[44]
Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[45]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[46]
Chunkkv: Semantic-preserving kv cache compression for efficient long-context llm inference,
Xiang Liu, Zhenheng Tang, Peijie Dong, Zeyu Li, Bo Li, Xuming Hu, and Xiaowen Chu. Chunkkv: Semantic-preserving kv cache compression for efficient long-context llm inference,
- [47]
-
[48]
A framework for few-shot language model evaluation, 12 2023
Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac’h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. A framework...
work page 2023
-
[49]
Rishabh Agarwal, Avi Singh, Lei M Zhang, Bernd Bohnet, Luis Rosias, Stephanie Chan, Biao Zhang, Ankesh Anand, Zaheer Abbas, Azade Nova, et al. Many-shot in-context learning. arXiv preprint arXiv:2404.11018, 2024
-
[50]
Efficiently scaling transformer inference
Reiner Pope, Sholto Douglas, Aakanksha Chowdhery, Jacob Devlin, James Bradbury, Jonathan Heek, Kefan Xiao, Shivani Agrawal, and Jeff Dean. Efficiently scaling transformer inference. Proceedings of Machine Learning and Systems, 5:606–624, 2023
work page 2023
-
[51]
Zhiyu Guo, Hidetaka Kamigaito, and Taro Watanabe. Attention score is not all you need for token importance indicator in kv cache reduction: Value also matters. arXiv preprint arXiv:2406.12335, 2024
-
[52]
Cam: Cache merging for memory-efficient llms inference
Yuxin Zhang, Yuxuan Du, Gen Luo, Yunshan Zhong, Zhenyu Zhang, Shiwei Liu, and Rongrong Ji. Cam: Cache merging for memory-efficient llms inference. In Forty-first International Conference on Machine Learning, 2024. 12
work page 2024
-
[53]
Cacheblend: Fast large language model serving with cached knowledge fusion
Jiayi Yao, Hanchen Li, Yuhan Liu, Siddhant Ray, Yihua Cheng, Qizheng Zhang, Kuntai Du, Shan Lu, and Junchen Jiang. Cacheblend: Fast large language model serving with cached knowledge fusion. arXiv preprint arXiv:2405.16444, 2024
-
[54]
Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference
Yuan Feng, Junlin Lv, Yukun Cao, Xike Xie, and S Kevin Zhou. Ada-kv: Optimizing kv cache eviction by adaptive budget allocation for efficient llm inference. arXiv preprint arXiv:2407.11550, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[55]
Scope: Optimizing key-value cache compression in long-context generation, 2024
Jialong Wu, Zhenglin Wang, Linhai Zhang, Yilong Lai, Yulan He, and Deyu Zhou. Scope: Optimizing key-value cache compression in long-context generation, 2024. URL https: //arxiv.org/abs/2412.13649
-
[56]
Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them
Mirac Suzgun, Nathan Scales, Nathanael Sch¨arli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc V Le, Ed H Chi, Denny Zhou, et al. Challenging big- bench tasks and whether chain-of-thought can solve them. arXiv preprint arXiv:2210.09261, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[57]
Layer-condensed kv cache for efficient inference of large language models, 2024
Haoyi Wu and Kewei Tu. Layer-condensed kv cache for efficient inference of large language models, 2024. URL https://arxiv.org/abs/2405.10637
-
[58]
You only cache once: Decoder-decoder architectures for language models
Yutao Sun, Li Dong, Yi Zhu, Shaohan Huang, Wenhui Wang, Shuming Ma, Quanlu Zhang, Jianyong Wang, and Furu Wei. You only cache once: Decoder-decoder architectures for language models. arXiv preprint arXiv:2405.05254, 2024
-
[59]
Reducing transformer key-value cache size with cross-layer attention
William Brandon, Mayank Mishra, Aniruddha Nrusimha, Rameswar Panda, and Jonathan Ragan Kelly. Reducing transformer key-value cache size with cross-layer attention. arXiv preprint arXiv:2405.12981, 2024
-
[60]
Mini- cache: Kv cache compression in depth dimension for large language models
Akide Liu, Jing Liu, Zizheng Pan, Yefei He, Gholamreza Haffari, and Bohan Zhuang. Mini- cache: Kv cache compression in depth dimension for large language models. arXiv preprint arXiv:2405.14366, 2024
-
[61]
David Wingate, Mohammad Shoeybi, and Taylor Sorensen. Prompt compression and contrastive conditioning for controllability and toxicity reduction in language models. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 5621–5634, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. doi: 10.18...
work page 2022
-
[63]
Recurrentgpt: Interactive generation of (arbitrarily) long text, 2023
Wangchunshu Zhou, Yuchen Eleanor Jiang, Peng Cui, Tiannan Wang, Zhenxin Xiao, Yifan Hou, Ryan Cotterell, and Mrinmaya Sachan. Recurrentgpt: Interactive generation of (arbitrarily) long text, 2023
work page 2023
-
[64]
Recursively summarizing enables long-term dialogue memory in large language models
Qingyue Wang, Liang Ding, Yanan Cao, Zhiliang Tian, Shi Wang, Dacheng Tao, and Li Guo. Recursively summarizing enables long-term dialogue memory in large language models. arXiv preprint arXiv:2308.15022, 2023
-
[65]
LLMLingua: Com- pressing prompts for accelerated inference of large language models
Huiqiang Jiang, Qianhui Wu, Chin-Yew Lin, Yuqing Yang, and Lili Qiu. LLMLingua: Com- pressing prompts for accelerated inference of large language models. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Proceedings of the 2023 Conference on Empirical Meth- ods in Natural Language Processing, pages 13358–13376, Singapore, December 2023. As- sociation...
-
[66]
LongLLMLingua: Accelerating and enhancing LLMs in long context scenarios via prompt compression
Huiqiang Jiang, Qianhui Wu, , Xufang Luo, Dongsheng Li, Chin-Yew Lin, Yuqing Yang, and Lili Qiu. LongLLMLingua: Accelerating and enhancing LLMs in long context scenarios via prompt compression. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long P...
work page 2024
-
[67]
Extending context window of large language models via semantic compression
Weizhi Fei, Xueyan Niu, Pingyi Zhou, Lu Hou, Bo Bai, Lei Deng, and Wei Han. Extending context window of large language models via semantic compression. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Findings of the Association for Computational Linguistics ACL 2024, pages 5169–5181, Bangkok, Thailand and virtual meeting, August 2024. Associatio...
-
[68]
Measuring Mathematical Problem Solving With the MATH Dataset
Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[69]
MathQA: Towards Interpretable Math Word Problem Solving with Operation-Based Formalisms
Aida Amini, Saadia Gabriel, Peter Lin, Rik Koncel-Kedziorski, Yejin Choi, and Hannaneh Hajishirzi. Mathqa: Towards interpretable math word problem solving with operation-based formalisms. arXiv preprint arXiv:1905.13319, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1905
-
[70]
Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models
Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R Brown, Adam Santoro, Aditya Gupta, Adri `a Garriga-Alonso, et al. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[71]
Jiayi Yuan, Hongyi Liu, Shaochen Zhong, Yu-Neng Chuang, Songchen Li, Guanchu Wang, Duy Le, Hongye Jin, Vipin Chaudhary, Zhaozhuo Xu, et al. Kv cache compression, but what must we give in return? a comprehensive benchmark of long context capable approaches. arXiv preprint arXiv:2407.01527, 2024
-
[72]
TruthfulQA: Measuring How Models Mimic Human Falsehoods
Stephanie Lin, Jacob Hilton, and Owain Evans. Truthfulqa: Measuring how models mimic human falsehoods. arXiv preprint arXiv:2109.07958, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[73]
Toxigen: A large-scale machine-generated dataset for adversarial and implicit hate speech detection
Thomas Hartvigsen, Saadia Gabriel, Hamid Palangi, Maarten Sap, Dipankar Ray, and Ece Kamar. Toxigen: A large-scale machine-generated dataset for adversarial and implicit hate speech detection. arXiv preprint arXiv:2203.09509, 2022
-
[74]
Towards understanding and mitigating social biases in language models
Paul Pu Liang, Chiyu Wu, Louis-Philippe Morency, and Ruslan Salakhutdinov. Towards understanding and mitigating social biases in language models. In International Conference on Machine Learning, pages 6565–6576. PMLR, 2021
work page 2021
-
[75]
Promptbench: Towards evaluating the robustness of large language models on adversarial prompts
Kaijie Zhu, Jindong Wang, Jiaheng Zhou, Zichen Wang, Hao Chen, Yidong Wang, Linyi Yang, Wei Ye, Yue Zhang, Neil Zhenqiang Gong, et al. Promptbench: Towards evaluating the robustness of large language models on adversarial prompts. arXiv e-prints, pages arXiv–2306, 2023
work page 2023
-
[76]
Xinyue Shen, Zeyuan Chen, Michael Backes, Yun Shen, and Yang Zhang. ” do anything now”: Characterizing and evaluating in-the-wild jailbreak prompts on large language models. In Proceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security, pages 1671–1685, 2024
work page 2024
-
[77]
Multilingual jailbreak chal- lenges in large language models
Yue Deng, Wenxuan Zhang, Sinno Jialin Pan, and Lidong Bing. Multilingual jailbreak chal- lenges in large language models. arXiv preprint arXiv:2310.06474, 2023
-
[78]
Should we really edit language models? on the evaluation of edited language models
Qi Li, Xiang Liu, Zhenheng Tang, Peijie Dong, Zeyu Li, Xinglin Pan, and Xiaowen Chu. Should we really edit language models? on the evaluation of edited language models. arXiv preprint arXiv:2410.18785, 2024
-
[79]
Program Synthesis with Large Language Models
Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. Program synthesis with large language models. arXiv preprint arXiv:2108.07732, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[80]
∞-bench: Extending long context evaluation beyond 100k tokens
Xinrong Zhang, Yingfa Chen, Shengding Hu, Zihang Xu, Junhao Chen, Moo Khai Hao, Xu Han, Zhen Leng Thai, Shuo Wang, Zhiyuan Liu, et al. ∞-bench: Extending long context evaluation beyond 100k tokens. ArXiv preprint, abs/2402.13718, 2024. URL https://arxiv.org/abs/ 2402.13718
-
[81]
In: Bouamor, H., Pino, J., Bali, K
Uri Shaham, Maor Ivgi, Avia Efrat, Jonathan Berant, and Omer Levy. ZeroSCROLLS: A zero- shot benchmark for long text understanding. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Findings of the Association for Computational Linguistics: EMNLP 2023, pages 7977– 7989, Singapore, 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023...
-
[82]
arXiv preprint arXiv:2307.11088 (2023)
Chenxin An, Shansan Gong, Ming Zhong, Mukai Li, Jun Zhang, Lingpeng Kong, and Xipeng Qiu. L-eval: Instituting standardized evaluation for long context language models. ArXiv preprint, abs/2307.11088, 2023. URL https://arxiv.org/abs/2307.11088
-
[83]
Landmark attention: Random-access infinite context length for transformers
Amirkeivan Mohtashami and Martin Jaggi. Landmark attention: Random-access infinite context length for transformers. ArXiv preprint, abs/2305.16300, 2023. URL https://arxiv.org/ abs/2305.16300
-
[84]
How long can open-source LLMs truly promise on context length?, 2023
Dacheng Li, Rulin Shao, et al. How long can open-source LLMs truly promise on context length?, 2023. URL https://lmsys.org/blog/2023-06-29-longchat
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.