pith. sign in

arxiv: 2502.01941 · v4 · submitted 2025-02-04 · 💻 cs.CL · cs.AI

Semantic Integrity Matters: Benchmarking and Preserving High-Density Reasoning in KV Cache Compression

Pith reviewed 2026-05-23 04:11 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords KV cache compressionchain-of-thought reasoningsemantic integrityfew-shot exampleslong-context generationLLM inferencebenchmark evaluation
0
0 comments X

The pith

KV cache compression breaks chain-of-thought reasoning unless few-shot examples are preserved as indivisible units.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Current KV cache compression methods perform well on retrieval tasks but cause sharp drops on reasoning tasks that rely on coherent chain-of-thought steps. The authors introduce a benchmark that isolates this difference and links the failures to compression that splits apart the semantic links inside few-shot examples. Guided by that observation they introduce ShotKV, which keeps those examples whole by handling the initial prompt loading and the later token generation in separate phases. The result is higher accuracy on long-context generation and document QA together with lower latency than full-cache inference.

Core claim

The central claim is that aggressive KV cache compression produces severe task-dependent degradation on high-density reasoning because it disrupts the coherence of chain-of-thought links inside few-shot examples, which must therefore be treated as indivisible semantic units; ShotKV restores performance by explicitly separating the prefill phase from the decoding phase so that semantic integrity is maintained throughout.

What carries the argument

ShotKV, a KV cache method that separates the prefill phase from the decoding phase to keep few-shot examples intact as indivisible semantic units.

If this is right

  • Retrieval tasks remain robust under aggressive compression while reasoning tasks exhibit severe degradation due to broken CoT links.
  • ShotKV produces 9-18% accuracy gains on long-context generation tasks.
  • The gains generalize to document QA tasks.
  • ShotKV reduces latency by 11% relative to full-cache inference.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The phase-separation tactic could be applied to other KV compression algorithms that currently mix prefill and decode steps.
  • Attention patterns observed in DeepSeek-R1 suggest that model-specific fragility of reasoning chains may require tailored semantic-unit rules.
  • Success on document QA indicates the method may extend to other multi-step language tasks that depend on long prompt coherence.

Load-bearing premise

The assumption that the observed degradation in reasoning tasks is caused specifically by the breaking of indivisible semantic units within few-shot examples.

What would settle it

An experiment in which few-shot examples are kept whole during compression yet reasoning accuracy still falls by the same amount as in standard compression.

Figures

Figures reproduced from arXiv: 2502.01941 by Bo Li, Hong Chen, Peijie Dong, Xiang Liu, Xiaowen Chu, Xiuze Zhou, Xuming Hu, Zeyu Li, Zhenheng Tang.

Figure 1
Figure 1. Figure 1: KV cache compression methods on long-context and arithmetic benchmarks. (a) Arithmetic [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Attention heatmap on different tasks. Models We conduct experiments on a series of LLMs, including LLaMA-3.1-8B, LLaMA￾3.1-8B-Instruct [39], Mistral-7B-Instruct [40], and multi-step reasoning LLM DeepSeek-R1- Distill-Llama-8B [41]. KV Cache Compression Methods To thor￾oughly investigate the potential impact on KV cache compression methods, we select the following methods: StreamingLLM [10], SnapKV [22], H2… view at source ↗
Figure 3
Figure 3. Figure 3: Cumulative attention score distribution for Long-Context and Arithmetic. (a) Overall [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Sensitivity Analysis of Different Bench [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Performance Comparison of KV Cache Compression Methods on KVFundaBench. Results [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗
Figure 7
Figure 7. Figure 7: Performance Comparison of KV Cache Compression Methods on different training dy￾namics on Arithmetic Reasoning Baseline90 80 70 60 50 40 30 20 10 Compression Ratio (%) 0.20 0.40 0.60 0.80 Accuracy Different Shot Numbers 8-shot 6-shot 4-shot 2-shot 1-shot [PITH_FULL_IMAGE:figures/full_fig_p006_7.png] view at source ↗
Figure 6
Figure 6. Figure 6: Many-shot scenario on KV cache com￾pression Observation 3. Prompt Length Vulnerabil￾ity: Shorter prompts are more vulnerable to KV cache compression. As illustrated in [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗
Figure 9
Figure 9. Figure 9: Performance Comparison of KV Cache Compression Methods Across Tasks with Mistral [PITH_FULL_IMAGE:figures/full_fig_p022_9.png] view at source ↗
read the original abstract

While Key-Value (KV) cache compression is essential for efficient LLM inference, current evaluations disproportionately focus on sparse retrieval tasks, potentially masking the degradation of High-Density Reasoning where Chain-of-Thought (CoT) coherence is critical. We introduce KVFundaBench to systematically evaluate this gap, revealing a sharp dichotomy: while retrieval tasks remain robust, reasoning tasks exhibit severe Task-Dependent Degradation under aggressive compression due to disrupted CoT links. Extending our analysis to the DeepSeek-R1 model, we uncover that its specialized attention patterns offer unique insights into the fragility of reasoning chains. Guided by these findings -- specifically the necessity of preserving few-shot examples as indivisible Semantic Units -- we propose ShotKV. This approach explicitly separates prefill and decoding phases to prioritize semantic integrity. Empirical results demonstrate that ShotKV achieves 9%-18% accuracy improvements on long-context generation tasks and effectively generalizes to document QA, all while delivering an 11% latency reduction compared to full cache inference.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper introduces KVFundaBench to benchmark KV cache compression methods, finding that retrieval tasks remain robust while reasoning tasks exhibit severe Task-Dependent Degradation under aggressive compression due to disrupted CoT links. It extends analysis to DeepSeek-R1's attention patterns and proposes ShotKV, which separates prefill and decoding phases to treat few-shot examples as indivisible Semantic Units, claiming 9%-18% accuracy gains on long-context generation tasks, generalization to document QA, and an 11% latency reduction versus full cache inference.

Significance. If the results hold, the work would highlight an important gap in KV cache compression evaluations that have focused on sparse retrieval and would offer a practical engineering approach (ShotKV) for preserving reasoning coherence. The introduction of a dedicated benchmark for high-density reasoning is a constructive contribution to the field.

major comments (2)
  1. [Findings guiding ShotKV design] The section on findings guiding ShotKV design: the attribution of reasoning degradation specifically to disrupted CoT links (and the consequent necessity of preserving few-shot examples as indivisible Semantic Units) is not supported by direct measurements such as step-wise entailment scores, attention-link tracing, or ablations that hold total retained tokens fixed while varying atomic versus fragmented treatment of few-shot content.
  2. [Empirical results] The empirical results paragraph: the reported 9%-18% accuracy improvements and 11% latency reduction are presented without accompanying information on experimental setup, baselines, statistical controls, dataset construction, or error bars, rendering it impossible to evaluate whether the central performance claims are supported by the data.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting the need for stronger empirical grounding in our design rationale and clearer reporting of experimental details. We address each major comment point by point below.

read point-by-point responses
  1. Referee: [Findings guiding ShotKV design] The section on findings guiding ShotKV design: the attribution of reasoning degradation specifically to disrupted CoT links (and the consequent necessity of preserving few-shot examples as indivisible Semantic Units) is not supported by direct measurements such as step-wise entailment scores, attention-link tracing, or ablations that hold total retained tokens fixed while varying atomic versus fragmented treatment of few-shot content.

    Authors: Our attribution rests on the core empirical observation from KVFundaBench that retrieval tasks remain robust under aggressive compression while reasoning tasks exhibit sharp Task-Dependent Degradation, combined with the attention pattern analysis on DeepSeek-R1 that reveals fragility in long reasoning chains. These results motivate treating few-shot examples as indivisible Semantic Units. We did not perform step-wise entailment scoring or explicit attention-link tracing, nor the specific token-fixed ablation contrasting atomic versus fragmented few-shot treatment. We agree this constitutes a gap and will add the requested ablation study (holding total retained tokens fixed) in the revision to directly test the indivisible-unit hypothesis. revision: partial

  2. Referee: [Empirical results] The empirical results paragraph: the reported 9%-18% accuracy improvements and 11% latency reduction are presented without accompanying information on experimental setup, baselines, statistical controls, dataset construction, or error bars, rendering it impossible to evaluate whether the central performance claims are supported by the data.

    Authors: The full manuscript contains an Experiments section describing the setup, baselines (standard KV compression methods and full-cache inference), KVFundaBench dataset construction, and evaluation protocol. However, the presentation of the 9%-18% accuracy and 11% latency figures lacks error bars, statistical significance reporting, and explicit controls. We will revise the results section and add a dedicated experimental details subsection with these elements to make the claims fully reproducible and evaluable. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical benchmarking and engineering proposal

full rationale

The paper presents an empirical benchmark (KVFundaBench) and an engineering method (ShotKV) motivated by observed task-dependent performance patterns under KV compression. No mathematical derivations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text. The design choice to treat few-shot examples as indivisible units follows directly from reported accuracy measurements rather than reducing to a definitional or fitted tautology. The work is self-contained against external benchmarks and contains no load-bearing steps that equate outputs to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities; the central proposal rests on the domain assumption that few-shot examples function as indivisible semantic units whose preservation is both necessary and sufficient for the reported gains.

axioms (1)
  • domain assumption Few-shot examples function as indivisible semantic units whose preservation is required to maintain CoT coherence under compression
    This premise directly guides the design of ShotKV and the interpretation of the benchmark results.

pith-pipeline@v0.9.0 · 5725 in / 1315 out tokens · 71740 ms · 2026-05-23T04:11:13.633412+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Position: LLM Inference Should Be Evaluated as Energy-to-Token Production

    cs.CE 2026-05 unverdicted novelty 5.0

    LLM inference should be reframed and evaluated as energy-to-token production with a Token Production Function that accounts for power, cooling, and efficiency ceilings.

  2. The Pitfalls of KV Cache Compression

    cs.LG 2025-09 conditional novelty 5.0

    KV cache compression causes certain instructions to degrade rapidly and be ignored in multi-instruction prompting, with system prompt leakage worsened by method choice, instruction order, and eviction bias; simple pol...

Reference graph

Works this paper leans on

83 extracted references · 83 canonical work pages · cited by 2 Pith papers · 28 internal anchors

  1. [1]

    Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21:140:1–140:67, 2020. URL http://jmlr. org/papers/v21/20-074.html

  2. [2]

    Language models are few-shot learners

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020

  3. [3]

    PaLM: Scaling Language Modeling with Pathways

    Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways. ArXiv preprint, abs/2204.02311, 2022. URL https://arxiv.org/abs/2204.02311

  4. [4]

    Unifying language learning paradigms.arXiv preprint arXiv:2205.05131, 2022a

    Yi Tay, Mostafa Dehghani, Vinh Q Tran, Xavier Garcia, Dara Bahri, Tal Schuster, Huaixiu Steven Zheng, Neil Houlsby, and Donald Metzler. Unifying language learning paradigms. ArXiv preprint, abs/2205.05131, 2022. URL https://arxiv.org/abs/2205. 05131

  5. [5]

    LLaMA: Open and Efficient Foundation Language Models

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Tim- oth´ee Lacroix, Baptiste Rozi `ere, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. ArXiv preprint, abs/2302.13971, 2023. URL https://arxiv.org/abs/2302.13971

  6. [6]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. ArXiv preprint, abs/2307.09288, 2023. URL https: //arxiv.org/abs/2307.09288

  7. [7]

    Flashattention: Fast and memory-efficient exact attention with io-awareness.Advances in Neural Information Processing Systems, 35:16344–16359, 2022

    Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher R ´e. Flashattention: Fast and memory-efficient exact attention with io-awareness.Advances in Neural Information Processing Systems, 35:16344–16359, 2022

  8. [8]

    FlashAttention-2: Faster attention with better parallelism and work partitioning

    Tri Dao. FlashAttention-2: Faster attention with better parallelism and work partitioning. In International Conference on Learning Representations (ICLR), 2024

  9. [9]

    DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models

    Sam Ade Jacobs et al. DeepSpeed Ulysses: System optimizations for enabling training of extreme long sequence Transformer models. ArXiv preprint, abs/2309.14509, 2023. URL https://arxiv.org/abs/2309.14509

  10. [10]

    Efficient streaming language models with attention sinks

    Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=NG7sS51zVF

  11. [12]

    URL https://arxiv.org/abs/2306.15595

  12. [13]

    Effective long-context scaling of foundation models

    Wenhan Xiong, Jingyu Liu, Igor Molybog, Hejia Zhang, Prajjwal Bhargava, Rui Hou, Louis Martin, Rashi Rungta, Karthik Abinav Sankararaman, Barlas Oguz, Madian Khabsa, Han Fang, Yashar Mehdad, Sharan Narang, Kshitiz Malik, Angela Fan, Shruti Bhosale, Sergey Edunov, Mike Lewis, Sinong Wang, and Hao Ma. Effective long-context scaling of foundation models. In ...

  13. [14]

    URL https://aclanthology.org/2024

    Association for Computational Linguistics. URL https://aclanthology.org/2024. naacl-long.260

  14. [15]

    Lon- glora: Efficient fine-tuning of long-context large language models

    Yukang Chen, Shengju Qian, Haotian Tang, Xin Lai, Zhijian Liu, Song Han, and Jiaya Jia. Lon- glora: Efficient fine-tuning of long-context large language models. In The Twelfth International Conference on Learning Representations, 2023

  15. [16]

    YaRN: Efficient context window extension of large language models

    Bowen Peng, Jeffrey Quesnelle, Honglu Fan, and Enrico Shippole. YaRN: Efficient context window extension of large language models. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=wHBfxhZu1u. 10

  16. [17]

    Introducing jamba: Ai21’s groundbreaking ssm-transformer model, 2024

    AI21. Introducing jamba: Ai21’s groundbreaking ssm-transformer model, 2024. URL https: //www.ai21.com/blog/announcing-jamba

  17. [18]

    Announcing grok-1.5, 2024

    X.AI. Announcing grok-1.5, 2024. URL https://x.ai/blog/grok-1.5

  18. [19]

    Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

    Machel Reid, Nikolay Savinov, Denis Teplyashin, Dmitry Lepikhin, Timothy Lillicrap, Jean- baptiste Alayrac, Radu Soricut, Angeliki Lazaridou, Orhan Firat, Julian Schrittwieser, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. ArXiv preprint, abs/2403.05530, 2024. URL https://arxiv.org/abs/2403.05530

  19. [20]

    Introducing the next generation of claude, 2024

    Anthropic. Introducing the next generation of claude, 2024. URL https://www.anthropic. com/news/claude-3-family

  20. [21]

    Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model, 2024

    DeepSeek-AI. Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model, 2024

  21. [22]

    DeepSeek-V3 Technical Report

    Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report. arXiv preprint arXiv:2412.19437, 2024

  22. [23]

    H2o: Heavy-hitter oracle for efficient generative inference of large language models

    Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher R´e, Clark Barrett, et al. H2o: Heavy-hitter oracle for efficient generative inference of large language models. Advances in Neural Information Processing Systems, 36:34661–34710, 2023

  23. [24]

    SnapKV: LLM Knows What You are Looking for Before Generation

    Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Acyr Locatelli, Hanchen Ye, Tianle Cai, Patrick Lewis, and Deming Chen. Snapkv: Llm knows what you are looking for before generation. ArXiv preprint, abs/2404.14469, 2024. URL https://arxiv.org/abs/ 2404.14469

  24. [26]

    URL https://arxiv.org/abs/2310.01801

  25. [27]

    PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling

    Zefan Cai, Yichi Zhang, Bofei Gao, Yuliang Liu, Tianyu Liu, Keming Lu, Wayne Xiong, Yue Dong, Baobao Chang, Junjie Hu, et al. Pyramidkv: Dynamic kv cache compression based on pyramidal information funneling. arXiv preprint arXiv:2406.02069, 2024

  26. [28]

    LazyLLM: Dynamic token pruning for efficient long context LLM inference

    Qichen Fu, Minsik Cho, Thomas Merth, Sachin Mehta, Mohammad Rastegari, and Mahyar Najibi. LazyLLM: Dynamic token pruning for efficient long context LLM inference. In Workshop on Efficient Systems for Foundation Models II @ ICML2024, 2024. URL https: //openreview.net/forum?id=gGZD1dsJqZ

  27. [29]

    Pyramidinfer: Pyramid kv cache compres- sion for high-throughput llm inference

    Dongjie Yang, XiaoDong Han, Yan Gao, Yao Hu, Shilin Zhang, and Hai Zhao. Pyramid- infer: Pyramid kv cache compression for high-throughput llm inference. arXiv preprint arXiv:2405.12532, 2024

  28. [30]

    Keyformer: Kv cache reduction through key tokens selection for efficient generative inference

    Muhammad Adnan, Akhil Arunkumar, Gaurav Jain, Prashant Nair, Ilya Soloveychik, and Purushotham Kamath. Keyformer: Kv cache reduction through key tokens selection for efficient generative inference. Proceedings of Machine Learning and Systems, 6:114–127, 2024

  29. [31]

    Scissorhands: Exploiting the persistence of impor- tance hypothesis for llm kv cache compression at test time

    Zichang Liu, Aditya Desai, Fangshuo Liao, Weitao Wang, Victor Xie, Zhaozhuo Xu, Anastasios Kyrillidis, and Anshumali Shrivastava. Scissorhands: Exploiting the persistence of impor- tance hypothesis for llm kv cache compression at test time. Advances in Neural Information Processing Systems, 36, 2024

  30. [33]

    URL https://arxiv.org/abs/2406.10774

  31. [34]

    LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding

    Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, et al. Longbench: A bilingual, multitask benchmark for long context understanding. arXiv preprint arXiv:2308.14508, 2023

  32. [35]

    LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks

    Yushi Bai, Shangqing Tu, Jiajie Zhang, Hao Peng, Xiaozhi Wang, Xin Lv, Shulin Cao, Ji- azheng Xu, Lei Hou, Yuxiao Dong, Jie Tang, and Juanzi Li. Longbench v2: Towards deeper understanding and reasoning on realistic long-context multitasks, 2025. URL https: //arxiv.org/abs/2412.15204

  33. [36]

    Needle In A Haystack - pressure testing LLMs

    Gregory Kamradt. Needle In A Haystack - pressure testing LLMs. Github, 2023. URL https://github.com/gkamradt/LLMTest_NeedleInAHaystack/tree/main. 11

  34. [37]

    Measuring Massive Multitask Language Understanding

    Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020

  35. [38]

    Training Verifiers to Solve Math Word Problems

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021

  36. [39]

    doi: 10.18653/v1/n19-1421

    Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. CommonsenseQA: A question answering challenge targeting commonsense knowledge. In Jill Burstein, Christy Doran, and Thamar Solorio, editors, Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volu...

  37. [40]

    Evaluating Large Language Models Trained on Code

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021

  38. [41]

    Jailbreakv: A bench- mark for assessing the robustness of multimodal large language models against jailbreak attacks

    Weidi Luo, Siyuan Ma, Xiaogeng Liu, Xiaoyu Guo, and Chaowei Xiao. Jailbreakv: A bench- mark for assessing the robustness of multimodal large language models against jailbreak attacks. In First Conference on Language Modeling, 2024. URL https://openreview.net/forum? id=GC4mXVfquq

  39. [42]

    LongGenBench: Long-context generation benchmark

    Xiang Liu, Peijie Dong, Xuming Hu, and Xiaowen Chu. LongGenBench: Long-context generation benchmark. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors, Findings of the Association for Computational Linguistics: EMNLP 2024 , pages 865–883, Miami, Florida, USA, November 2024. Association for Computational Linguistics. URL https://aclanthology.or...

  40. [43]

    The Llama 3 Herd of Models

    Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024

  41. [44]

    Mistral 7B

    Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023

  42. [45]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025

  43. [46]

    Chunkkv: Semantic-preserving kv cache compression for efficient long-context llm inference,

    Xiang Liu, Zhenheng Tang, Peijie Dong, Zeyu Li, Bo Li, Xuming Hu, and Xiaowen Chu. Chunkkv: Semantic-preserving kv cache compression for efficient long-context llm inference,

  44. [47]

    URL https://arxiv.org/abs/2502.00299

  45. [48]

    A framework for few-shot language model evaluation, 12 2023

    Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac’h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. A framework...

  46. [49]

    Many-shot in-context learning

    Rishabh Agarwal, Avi Singh, Lei M Zhang, Bernd Bohnet, Luis Rosias, Stephanie Chan, Biao Zhang, Ankesh Anand, Zaheer Abbas, Azade Nova, et al. Many-shot in-context learning. arXiv preprint arXiv:2404.11018, 2024

  47. [50]

    Efficiently scaling transformer inference

    Reiner Pope, Sholto Douglas, Aakanksha Chowdhery, Jacob Devlin, James Bradbury, Jonathan Heek, Kefan Xiao, Shivani Agrawal, and Jeff Dean. Efficiently scaling transformer inference. Proceedings of Machine Learning and Systems, 5:606–624, 2023

  48. [51]

    Attention score is not all you need for token importance indicator in kv cache reduction: Value also matters

    Zhiyu Guo, Hidetaka Kamigaito, and Taro Watanabe. Attention score is not all you need for token importance indicator in kv cache reduction: Value also matters. arXiv preprint arXiv:2406.12335, 2024

  49. [52]

    Cam: Cache merging for memory-efficient llms inference

    Yuxin Zhang, Yuxuan Du, Gen Luo, Yunshan Zhong, Zhenyu Zhang, Shiwei Liu, and Rongrong Ji. Cam: Cache merging for memory-efficient llms inference. In Forty-first International Conference on Machine Learning, 2024. 12

  50. [53]

    Cacheblend: Fast large language model serving with cached knowledge fusion

    Jiayi Yao, Hanchen Li, Yuhan Liu, Siddhant Ray, Yihua Cheng, Qizheng Zhang, Kuntai Du, Shan Lu, and Junchen Jiang. Cacheblend: Fast large language model serving with cached knowledge fusion. arXiv preprint arXiv:2405.16444, 2024

  51. [54]

    Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference

    Yuan Feng, Junlin Lv, Yukun Cao, Xike Xie, and S Kevin Zhou. Ada-kv: Optimizing kv cache eviction by adaptive budget allocation for efficient llm inference. arXiv preprint arXiv:2407.11550, 2024

  52. [55]

    Scope: Optimizing key-value cache compression in long-context generation, 2024

    Jialong Wu, Zhenglin Wang, Linhai Zhang, Yilong Lai, Yulan He, and Deyu Zhou. Scope: Optimizing key-value cache compression in long-context generation, 2024. URL https: //arxiv.org/abs/2412.13649

  53. [56]

    Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them

    Mirac Suzgun, Nathan Scales, Nathanael Sch¨arli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc V Le, Ed H Chi, Denny Zhou, et al. Challenging big- bench tasks and whether chain-of-thought can solve them. arXiv preprint arXiv:2210.09261, 2022

  54. [57]

    Layer-condensed kv cache for efficient inference of large language models, 2024

    Haoyi Wu and Kewei Tu. Layer-condensed kv cache for efficient inference of large language models, 2024. URL https://arxiv.org/abs/2405.10637

  55. [58]

    You only cache once: Decoder-decoder architectures for language models

    Yutao Sun, Li Dong, Yi Zhu, Shaohan Huang, Wenhui Wang, Shuming Ma, Quanlu Zhang, Jianyong Wang, and Furu Wei. You only cache once: Decoder-decoder architectures for language models. arXiv preprint arXiv:2405.05254, 2024

  56. [59]

    Reducing transformer key-value cache size with cross-layer attention

    William Brandon, Mayank Mishra, Aniruddha Nrusimha, Rameswar Panda, and Jonathan Ragan Kelly. Reducing transformer key-value cache size with cross-layer attention. arXiv preprint arXiv:2405.12981, 2024

  57. [60]

    Mini- cache: Kv cache compression in depth dimension for large language models

    Akide Liu, Jing Liu, Zizheng Pan, Yefei He, Gholamreza Haffari, and Bohan Zhuang. Mini- cache: Kv cache compression in depth dimension for large language models. arXiv preprint arXiv:2405.14366, 2024

  58. [61]

    Prompt compression and contrastive conditioning for controllability and toxicity reduction in language models

    David Wingate, Mohammad Shoeybi, and Taylor Sorensen. Prompt compression and contrastive conditioning for controllability and toxicity reduction in language models. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 5621–5634, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. doi: 10.18...

  59. [63]

    Recurrentgpt: Interactive generation of (arbitrarily) long text, 2023

    Wangchunshu Zhou, Yuchen Eleanor Jiang, Peng Cui, Tiannan Wang, Zhenxin Xiao, Yifan Hou, Ryan Cotterell, and Mrinmaya Sachan. Recurrentgpt: Interactive generation of (arbitrarily) long text, 2023

  60. [64]

    Recursively summarizing enables long-term dialogue memory in large language models

    Qingyue Wang, Liang Ding, Yanan Cao, Zhiliang Tian, Shi Wang, Dacheng Tao, and Li Guo. Recursively summarizing enables long-term dialogue memory in large language models. arXiv preprint arXiv:2308.15022, 2023

  61. [65]

    LLMLingua: Com- pressing prompts for accelerated inference of large language models

    Huiqiang Jiang, Qianhui Wu, Chin-Yew Lin, Yuqing Yang, and Lili Qiu. LLMLingua: Com- pressing prompts for accelerated inference of large language models. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Proceedings of the 2023 Conference on Empirical Meth- ods in Natural Language Processing, pages 13358–13376, Singapore, December 2023. As- sociation...

  62. [66]

    LongLLMLingua: Accelerating and enhancing LLMs in long context scenarios via prompt compression

    Huiqiang Jiang, Qianhui Wu, , Xufang Luo, Dongsheng Li, Chin-Yew Lin, Yuqing Yang, and Lili Qiu. LongLLMLingua: Accelerating and enhancing LLMs in long context scenarios via prompt compression. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long P...

  63. [67]

    Extending context window of large language models via semantic compression

    Weizhi Fei, Xueyan Niu, Pingyi Zhou, Lu Hou, Bo Bai, Lei Deng, and Wei Han. Extending context window of large language models via semantic compression. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Findings of the Association for Computational Linguistics ACL 2024, pages 5169–5181, Bangkok, Thailand and virtual meeting, August 2024. Associatio...

  64. [68]

    Measuring Mathematical Problem Solving With the MATH Dataset

    Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874, 2021

  65. [69]

    MathQA: Towards Interpretable Math Word Problem Solving with Operation-Based Formalisms

    Aida Amini, Saadia Gabriel, Peter Lin, Rik Koncel-Kedziorski, Yejin Choi, and Hannaneh Hajishirzi. Mathqa: Towards interpretable math word problem solving with operation-based formalisms. arXiv preprint arXiv:1905.13319, 2019

  66. [70]

    Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

    Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R Brown, Adam Santoro, Aditya Gupta, Adri `a Garriga-Alonso, et al. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615, 2022

  67. [71]

    Kv cache compression, but what must we give in return? a compre- hensive benchmark of long context capable approaches

    Jiayi Yuan, Hongyi Liu, Shaochen Zhong, Yu-Neng Chuang, Songchen Li, Guanchu Wang, Duy Le, Hongye Jin, Vipin Chaudhary, Zhaozhuo Xu, et al. Kv cache compression, but what must we give in return? a comprehensive benchmark of long context capable approaches. arXiv preprint arXiv:2407.01527, 2024

  68. [72]

    TruthfulQA: Measuring How Models Mimic Human Falsehoods

    Stephanie Lin, Jacob Hilton, and Owain Evans. Truthfulqa: Measuring how models mimic human falsehoods. arXiv preprint arXiv:2109.07958, 2021

  69. [73]

    Toxigen: A large-scale machine-generated dataset for adversarial and implicit hate speech detection

    Thomas Hartvigsen, Saadia Gabriel, Hamid Palangi, Maarten Sap, Dipankar Ray, and Ece Kamar. Toxigen: A large-scale machine-generated dataset for adversarial and implicit hate speech detection. arXiv preprint arXiv:2203.09509, 2022

  70. [74]

    Towards understanding and mitigating social biases in language models

    Paul Pu Liang, Chiyu Wu, Louis-Philippe Morency, and Ruslan Salakhutdinov. Towards understanding and mitigating social biases in language models. In International Conference on Machine Learning, pages 6565–6576. PMLR, 2021

  71. [75]

    Promptbench: Towards evaluating the robustness of large language models on adversarial prompts

    Kaijie Zhu, Jindong Wang, Jiaheng Zhou, Zichen Wang, Hao Chen, Yidong Wang, Linyi Yang, Wei Ye, Yue Zhang, Neil Zhenqiang Gong, et al. Promptbench: Towards evaluating the robustness of large language models on adversarial prompts. arXiv e-prints, pages arXiv–2306, 2023

  72. [76]

    ” do anything now”: Characterizing and evaluating in-the-wild jailbreak prompts on large language models

    Xinyue Shen, Zeyuan Chen, Michael Backes, Yun Shen, and Yang Zhang. ” do anything now”: Characterizing and evaluating in-the-wild jailbreak prompts on large language models. In Proceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security, pages 1671–1685, 2024

  73. [77]

    Multilingual jailbreak chal- lenges in large language models

    Yue Deng, Wenxuan Zhang, Sinno Jialin Pan, and Lidong Bing. Multilingual jailbreak chal- lenges in large language models. arXiv preprint arXiv:2310.06474, 2023

  74. [78]

    Should we really edit language models? on the evaluation of edited language models

    Qi Li, Xiang Liu, Zhenheng Tang, Peijie Dong, Zeyu Li, Xinglin Pan, and Xiaowen Chu. Should we really edit language models? on the evaluation of edited language models. arXiv preprint arXiv:2410.18785, 2024

  75. [79]

    Program Synthesis with Large Language Models

    Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. Program synthesis with large language models. arXiv preprint arXiv:2108.07732, 2021

  76. [80]

    ∞-bench: Extending long context evaluation beyond 100k tokens

    Xinrong Zhang, Yingfa Chen, Shengding Hu, Zihang Xu, Junhao Chen, Moo Khai Hao, Xu Han, Zhen Leng Thai, Shuo Wang, Zhiyuan Liu, et al. ∞-bench: Extending long context evaluation beyond 100k tokens. ArXiv preprint, abs/2402.13718, 2024. URL https://arxiv.org/abs/ 2402.13718

  77. [81]

    In: Bouamor, H., Pino, J., Bali, K

    Uri Shaham, Maor Ivgi, Avia Efrat, Jonathan Berant, and Omer Levy. ZeroSCROLLS: A zero- shot benchmark for long text understanding. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Findings of the Association for Computational Linguistics: EMNLP 2023, pages 7977– 7989, Singapore, 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023...

  78. [82]

    arXiv preprint arXiv:2307.11088 (2023)

    Chenxin An, Shansan Gong, Ming Zhong, Mukai Li, Jun Zhang, Lingpeng Kong, and Xipeng Qiu. L-eval: Instituting standardized evaluation for long context language models. ArXiv preprint, abs/2307.11088, 2023. URL https://arxiv.org/abs/2307.11088

  79. [83]

    Landmark attention: Random-access infinite context length for transformers

    Amirkeivan Mohtashami and Martin Jaggi. Landmark attention: Random-access infinite context length for transformers. ArXiv preprint, abs/2305.16300, 2023. URL https://arxiv.org/ abs/2305.16300

  80. [84]

    How long can open-source LLMs truly promise on context length?, 2023

    Dacheng Li, Rulin Shao, et al. How long can open-source LLMs truly promise on context length?, 2023. URL https://lmsys.org/blog/2023-06-29-longchat

Showing first 80 references.