Semantic Integrity Matters: Benchmarking and Preserving High-Density Reasoning in KV Cache Compression

Bo Li; Hong Chen; Peijie Dong; Xiang Liu; Xiaowen Chu; Xiuze Zhou; Xuming Hu; Zeyu Li; Zhenheng Tang

arxiv: 2502.01941 · v4 · submitted 2025-02-04 · 💻 cs.CL · cs.AI

Semantic Integrity Matters: Benchmarking and Preserving High-Density Reasoning in KV Cache Compression

Xiang Liu , Zhenheng Tang , Hong Chen , Peijie Dong , Zeyu Li , Xiuze Zhou , Bo Li , Xuming Hu

show 1 more author

Xiaowen Chu

This is my paper

Pith reviewed 2026-05-23 04:11 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords KV cache compressionchain-of-thought reasoningsemantic integrityfew-shot exampleslong-context generationLLM inferencebenchmark evaluation

0 comments

The pith

KV cache compression breaks chain-of-thought reasoning unless few-shot examples are preserved as indivisible units.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Current KV cache compression methods perform well on retrieval tasks but cause sharp drops on reasoning tasks that rely on coherent chain-of-thought steps. The authors introduce a benchmark that isolates this difference and links the failures to compression that splits apart the semantic links inside few-shot examples. Guided by that observation they introduce ShotKV, which keeps those examples whole by handling the initial prompt loading and the later token generation in separate phases. The result is higher accuracy on long-context generation and document QA together with lower latency than full-cache inference.

Core claim

The central claim is that aggressive KV cache compression produces severe task-dependent degradation on high-density reasoning because it disrupts the coherence of chain-of-thought links inside few-shot examples, which must therefore be treated as indivisible semantic units; ShotKV restores performance by explicitly separating the prefill phase from the decoding phase so that semantic integrity is maintained throughout.

What carries the argument

ShotKV, a KV cache method that separates the prefill phase from the decoding phase to keep few-shot examples intact as indivisible semantic units.

If this is right

Retrieval tasks remain robust under aggressive compression while reasoning tasks exhibit severe degradation due to broken CoT links.
ShotKV produces 9-18% accuracy gains on long-context generation tasks.
The gains generalize to document QA tasks.
ShotKV reduces latency by 11% relative to full-cache inference.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The phase-separation tactic could be applied to other KV compression algorithms that currently mix prefill and decode steps.
Attention patterns observed in DeepSeek-R1 suggest that model-specific fragility of reasoning chains may require tailored semantic-unit rules.
Success on document QA indicates the method may extend to other multi-step language tasks that depend on long prompt coherence.

Load-bearing premise

The assumption that the observed degradation in reasoning tasks is caused specifically by the breaking of indivisible semantic units within few-shot examples.

What would settle it

An experiment in which few-shot examples are kept whole during compression yet reasoning accuracy still falls by the same amount as in standard compression.

Figures

Figures reproduced from arXiv: 2502.01941 by Bo Li, Hong Chen, Peijie Dong, Xiang Liu, Xiaowen Chu, Xiuze Zhou, Xuming Hu, Zeyu Li, Zhenheng Tang.

**Figure 2.** Figure 2: Attention heatmap on different tasks. Models We conduct experiments on a series of LLMs, including LLaMA-3.1-8B, LLaMA3.1-8B-Instruct [39], Mistral-7B-Instruct [40], and multi-step reasoning LLM DeepSeek-R1- Distill-Llama-8B [41]. KV Cache Compression Methods To thoroughly investigate the potential impact on KV cache compression methods, we select the following methods: StreamingLLM [10], SnapKV [22], H2… view at source ↗

**Figure 3.** Figure 3: Cumulative attention score distribution for Long-Context and Arithmetic. (a) Overall [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Sensitivity Analysis of Different Bench [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

**Figure 5.** Figure 5: Performance Comparison of KV Cache Compression Methods on KVFundaBench. Results [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗

**Figure 7.** Figure 7: Performance Comparison of KV Cache Compression Methods on different training dynamics on Arithmetic Reasoning Baseline90 80 70 60 50 40 30 20 10 Compression Ratio (%) 0.20 0.40 0.60 0.80 Accuracy Different Shot Numbers 8-shot 6-shot 4-shot 2-shot 1-shot [PITH_FULL_IMAGE:figures/full_fig_p006_7.png] view at source ↗

**Figure 6.** Figure 6: Many-shot scenario on KV cache compression Observation 3. Prompt Length Vulnerability: Shorter prompts are more vulnerable to KV cache compression. As illustrated in [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗

**Figure 9.** Figure 9: Performance Comparison of KV Cache Compression Methods Across Tasks with Mistral [PITH_FULL_IMAGE:figures/full_fig_p022_9.png] view at source ↗

read the original abstract

While Key-Value (KV) cache compression is essential for efficient LLM inference, current evaluations disproportionately focus on sparse retrieval tasks, potentially masking the degradation of High-Density Reasoning where Chain-of-Thought (CoT) coherence is critical. We introduce KVFundaBench to systematically evaluate this gap, revealing a sharp dichotomy: while retrieval tasks remain robust, reasoning tasks exhibit severe Task-Dependent Degradation under aggressive compression due to disrupted CoT links. Extending our analysis to the DeepSeek-R1 model, we uncover that its specialized attention patterns offer unique insights into the fragility of reasoning chains. Guided by these findings -- specifically the necessity of preserving few-shot examples as indivisible Semantic Units -- we propose ShotKV. This approach explicitly separates prefill and decoding phases to prioritize semantic integrity. Empirical results demonstrate that ShotKV achieves 9%-18% accuracy improvements on long-context generation tasks and effectively generalizes to document QA, all while delivering an 11% latency reduction compared to full cache inference.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper flags a real evaluation gap in KV compression for reasoning tasks and proposes a phase-separated method, but the causal attribution to CoT disruption lacks isolating evidence.

read the letter

The main thing to know is that this work shows retrieval tasks hold up under aggressive KV cache compression while reasoning tasks degrade sharply, then introduces KVFundaBench and ShotKV to address it by treating few-shot examples as indivisible units and separating prefill from decoding. That split and the semantic-unit rule are the concrete new pieces not already in the prior compression literature referenced in the abstract.

Referee Report

2 major / 0 minor

Summary. The paper introduces KVFundaBench to benchmark KV cache compression methods, finding that retrieval tasks remain robust while reasoning tasks exhibit severe Task-Dependent Degradation under aggressive compression due to disrupted CoT links. It extends analysis to DeepSeek-R1's attention patterns and proposes ShotKV, which separates prefill and decoding phases to treat few-shot examples as indivisible Semantic Units, claiming 9%-18% accuracy gains on long-context generation tasks, generalization to document QA, and an 11% latency reduction versus full cache inference.

Significance. If the results hold, the work would highlight an important gap in KV cache compression evaluations that have focused on sparse retrieval and would offer a practical engineering approach (ShotKV) for preserving reasoning coherence. The introduction of a dedicated benchmark for high-density reasoning is a constructive contribution to the field.

major comments (2)

[Findings guiding ShotKV design] The section on findings guiding ShotKV design: the attribution of reasoning degradation specifically to disrupted CoT links (and the consequent necessity of preserving few-shot examples as indivisible Semantic Units) is not supported by direct measurements such as step-wise entailment scores, attention-link tracing, or ablations that hold total retained tokens fixed while varying atomic versus fragmented treatment of few-shot content.
[Empirical results] The empirical results paragraph: the reported 9%-18% accuracy improvements and 11% latency reduction are presented without accompanying information on experimental setup, baselines, statistical controls, dataset construction, or error bars, rendering it impossible to evaluate whether the central performance claims are supported by the data.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting the need for stronger empirical grounding in our design rationale and clearer reporting of experimental details. We address each major comment point by point below.

read point-by-point responses

Referee: [Findings guiding ShotKV design] The section on findings guiding ShotKV design: the attribution of reasoning degradation specifically to disrupted CoT links (and the consequent necessity of preserving few-shot examples as indivisible Semantic Units) is not supported by direct measurements such as step-wise entailment scores, attention-link tracing, or ablations that hold total retained tokens fixed while varying atomic versus fragmented treatment of few-shot content.

Authors: Our attribution rests on the core empirical observation from KVFundaBench that retrieval tasks remain robust under aggressive compression while reasoning tasks exhibit sharp Task-Dependent Degradation, combined with the attention pattern analysis on DeepSeek-R1 that reveals fragility in long reasoning chains. These results motivate treating few-shot examples as indivisible Semantic Units. We did not perform step-wise entailment scoring or explicit attention-link tracing, nor the specific token-fixed ablation contrasting atomic versus fragmented few-shot treatment. We agree this constitutes a gap and will add the requested ablation study (holding total retained tokens fixed) in the revision to directly test the indivisible-unit hypothesis. revision: partial
Referee: [Empirical results] The empirical results paragraph: the reported 9%-18% accuracy improvements and 11% latency reduction are presented without accompanying information on experimental setup, baselines, statistical controls, dataset construction, or error bars, rendering it impossible to evaluate whether the central performance claims are supported by the data.

Authors: The full manuscript contains an Experiments section describing the setup, baselines (standard KV compression methods and full-cache inference), KVFundaBench dataset construction, and evaluation protocol. However, the presentation of the 9%-18% accuracy and 11% latency figures lacks error bars, statistical significance reporting, and explicit controls. We will revise the results section and add a dedicated experimental details subsection with these elements to make the claims fully reproducible and evaluable. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical benchmarking and engineering proposal

full rationale

The paper presents an empirical benchmark (KVFundaBench) and an engineering method (ShotKV) motivated by observed task-dependent performance patterns under KV compression. No mathematical derivations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text. The design choice to treat few-shot examples as indivisible units follows directly from reported accuracy measurements rather than reducing to a definitional or fitted tautology. The work is self-contained against external benchmarks and contains no load-bearing steps that equate outputs to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities; the central proposal rests on the domain assumption that few-shot examples function as indivisible semantic units whose preservation is both necessary and sufficient for the reported gains.

axioms (1)

domain assumption Few-shot examples function as indivisible semantic units whose preservation is required to maintain CoT coherence under compression
This premise directly guides the design of ShotKV and the interpretation of the benchmark results.

pith-pipeline@v0.9.0 · 5725 in / 1315 out tokens · 71740 ms · 2026-05-23T04:11:13.633412+00:00 · methodology

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Position: LLM Inference Should Be Evaluated as Energy-to-Token Production
cs.CE 2026-05 unverdicted novelty 5.0

LLM inference should be reframed and evaluated as energy-to-token production with a Token Production Function that accounts for power, cooling, and efficiency ceilings.
The Pitfalls of KV Cache Compression
cs.LG 2025-09 conditional novelty 5.0

KV cache compression causes certain instructions to degrade rapidly and be ignored in multi-instruction prompting, with system prompt leakage worsened by method choice, instruction order, and eviction bias; simple pol...

Reference graph

Works this paper leans on

83 extracted references · 83 canonical work pages · cited by 2 Pith papers · 28 internal anchors

[1]

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21:140:1–140:67, 2020. URL http://jmlr. org/papers/v21/20-074.html

work page 2020
[2]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020

work page 1901
[3]

PaLM: Scaling Language Modeling with Pathways

Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways. ArXiv preprint, abs/2204.02311, 2022. URL https://arxiv.org/abs/2204.02311

work page internal anchor Pith review Pith/arXiv arXiv 2022
[4]

Unifying language learning paradigms.arXiv preprint arXiv:2205.05131, 2022a

Yi Tay, Mostafa Dehghani, Vinh Q Tran, Xavier Garcia, Dara Bahri, Tal Schuster, Huaixiu Steven Zheng, Neil Houlsby, and Donald Metzler. Unifying language learning paradigms. ArXiv preprint, abs/2205.05131, 2022. URL https://arxiv.org/abs/2205. 05131

work page arXiv 2022
[5]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Tim- oth´ee Lacroix, Baptiste Rozi `ere, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. ArXiv preprint, abs/2302.13971, 2023. URL https://arxiv.org/abs/2302.13971

work page internal anchor Pith review Pith/arXiv arXiv 2023
[6]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. ArXiv preprint, abs/2307.09288, 2023. URL https: //arxiv.org/abs/2307.09288

work page internal anchor Pith review Pith/arXiv arXiv 2023
[7]

Flashattention: Fast and memory-efficient exact attention with io-awareness.Advances in Neural Information Processing Systems, 35:16344–16359, 2022

Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher R ´e. Flashattention: Fast and memory-efficient exact attention with io-awareness.Advances in Neural Information Processing Systems, 35:16344–16359, 2022

work page 2022
[8]

FlashAttention-2: Faster attention with better parallelism and work partitioning

Tri Dao. FlashAttention-2: Faster attention with better parallelism and work partitioning. In International Conference on Learning Representations (ICLR), 2024

work page 2024
[9]

DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models

Sam Ade Jacobs et al. DeepSpeed Ulysses: System optimizations for enabling training of extreme long sequence Transformer models. ArXiv preprint, abs/2309.14509, 2023. URL https://arxiv.org/abs/2309.14509

work page internal anchor Pith review Pith/arXiv arXiv 2023
[10]

Efficient streaming language models with attention sinks

Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=NG7sS51zVF

work page 2024
[12]

URL https://arxiv.org/abs/2306.15595

work page internal anchor Pith review Pith/arXiv arXiv
[13]

Effective long-context scaling of foundation models

Wenhan Xiong, Jingyu Liu, Igor Molybog, Hejia Zhang, Prajjwal Bhargava, Rui Hou, Louis Martin, Rashi Rungta, Karthik Abinav Sankararaman, Barlas Oguz, Madian Khabsa, Han Fang, Yashar Mehdad, Sharan Narang, Kshitiz Malik, Angela Fan, Shruti Bhosale, Sergey Edunov, Mike Lewis, Sinong Wang, and Hao Ma. Effective long-context scaling of foundation models. In ...

work page 2024
[14]

URL https://aclanthology.org/2024

Association for Computational Linguistics. URL https://aclanthology.org/2024. naacl-long.260

work page 2024
[15]

Lon- glora: Efficient fine-tuning of long-context large language models

Yukang Chen, Shengju Qian, Haotian Tang, Xin Lai, Zhijian Liu, Song Han, and Jiaya Jia. Lon- glora: Efficient fine-tuning of long-context large language models. In The Twelfth International Conference on Learning Representations, 2023

work page 2023
[16]

YaRN: Efficient context window extension of large language models

Bowen Peng, Jeffrey Quesnelle, Honglu Fan, and Enrico Shippole. YaRN: Efficient context window extension of large language models. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=wHBfxhZu1u. 10

work page 2024
[17]

Introducing jamba: Ai21’s groundbreaking ssm-transformer model, 2024

AI21. Introducing jamba: Ai21’s groundbreaking ssm-transformer model, 2024. URL https: //www.ai21.com/blog/announcing-jamba

work page 2024
[18]

Announcing grok-1.5, 2024

X.AI. Announcing grok-1.5, 2024. URL https://x.ai/blog/grok-1.5

work page 2024
[19]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Machel Reid, Nikolay Savinov, Denis Teplyashin, Dmitry Lepikhin, Timothy Lillicrap, Jean- baptiste Alayrac, Radu Soricut, Angeliki Lazaridou, Orhan Firat, Julian Schrittwieser, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. ArXiv preprint, abs/2403.05530, 2024. URL https://arxiv.org/abs/2403.05530

work page internal anchor Pith review Pith/arXiv arXiv 2024
[20]

Introducing the next generation of claude, 2024

Anthropic. Introducing the next generation of claude, 2024. URL https://www.anthropic. com/news/claude-3-family

work page 2024
[21]

Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model, 2024

DeepSeek-AI. Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model, 2024

work page 2024
[22]

DeepSeek-V3 Technical Report

Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report. arXiv preprint arXiv:2412.19437, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[23]

H2o: Heavy-hitter oracle for efficient generative inference of large language models

Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher R´e, Clark Barrett, et al. H2o: Heavy-hitter oracle for efficient generative inference of large language models. Advances in Neural Information Processing Systems, 36:34661–34710, 2023

work page 2023
[24]

SnapKV: LLM Knows What You are Looking for Before Generation

Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Acyr Locatelli, Hanchen Ye, Tianle Cai, Patrick Lewis, and Deming Chen. Snapkv: Llm knows what you are looking for before generation. ArXiv preprint, abs/2404.14469, 2024. URL https://arxiv.org/abs/ 2404.14469

work page internal anchor Pith review Pith/arXiv arXiv 2024
[26]

URL https://arxiv.org/abs/2310.01801

work page internal anchor Pith review Pith/arXiv arXiv
[27]

PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling

Zefan Cai, Yichi Zhang, Bofei Gao, Yuliang Liu, Tianyu Liu, Keming Lu, Wayne Xiong, Yue Dong, Baobao Chang, Junjie Hu, et al. Pyramidkv: Dynamic kv cache compression based on pyramidal information funneling. arXiv preprint arXiv:2406.02069, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[28]

LazyLLM: Dynamic token pruning for efficient long context LLM inference

Qichen Fu, Minsik Cho, Thomas Merth, Sachin Mehta, Mohammad Rastegari, and Mahyar Najibi. LazyLLM: Dynamic token pruning for efficient long context LLM inference. In Workshop on Efficient Systems for Foundation Models II @ ICML2024, 2024. URL https: //openreview.net/forum?id=gGZD1dsJqZ

work page 2024
[29]

Pyramidinfer: Pyramid kv cache compres- sion for high-throughput llm inference

Dongjie Yang, XiaoDong Han, Yan Gao, Yao Hu, Shilin Zhang, and Hai Zhao. Pyramid- infer: Pyramid kv cache compression for high-throughput llm inference. arXiv preprint arXiv:2405.12532, 2024

work page arXiv 2024
[30]

Keyformer: Kv cache reduction through key tokens selection for efficient generative inference

Muhammad Adnan, Akhil Arunkumar, Gaurav Jain, Prashant Nair, Ilya Soloveychik, and Purushotham Kamath. Keyformer: Kv cache reduction through key tokens selection for efficient generative inference. Proceedings of Machine Learning and Systems, 6:114–127, 2024

work page 2024
[31]

Scissorhands: Exploiting the persistence of impor- tance hypothesis for llm kv cache compression at test time

Zichang Liu, Aditya Desai, Fangshuo Liao, Weitao Wang, Victor Xie, Zhaozhuo Xu, Anastasios Kyrillidis, and Anshumali Shrivastava. Scissorhands: Exploiting the persistence of impor- tance hypothesis for llm kv cache compression at test time. Advances in Neural Information Processing Systems, 36, 2024

work page 2024
[33]

URL https://arxiv.org/abs/2406.10774

work page internal anchor Pith review Pith/arXiv arXiv
[34]

LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding

Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, et al. Longbench: A bilingual, multitask benchmark for long context understanding. arXiv preprint arXiv:2308.14508, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[35]

LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks

Yushi Bai, Shangqing Tu, Jiajie Zhang, Hao Peng, Xiaozhi Wang, Xin Lv, Shulin Cao, Ji- azheng Xu, Lei Hou, Yuxiao Dong, Jie Tang, and Juanzi Li. Longbench v2: Towards deeper understanding and reasoning on realistic long-context multitasks, 2025. URL https: //arxiv.org/abs/2412.15204

work page internal anchor Pith review Pith/arXiv arXiv 2025
[36]

Needle In A Haystack - pressure testing LLMs

Gregory Kamradt. Needle In A Haystack - pressure testing LLMs. Github, 2023. URL https://github.com/gkamradt/LLMTest_NeedleInAHaystack/tree/main. 11

work page 2023
[37]

Measuring Massive Multitask Language Understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2009
[38]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[39]

doi: 10.18653/v1/n19-1421

Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. CommonsenseQA: A question answering challenge targeting commonsense knowledge. In Jill Burstein, Christy Doran, and Thamar Solorio, editors, Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volu...

work page doi:10.18653/v1/n19-1421 2019
[40]

Evaluating Large Language Models Trained on Code

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[41]

Jailbreakv: A bench- mark for assessing the robustness of multimodal large language models against jailbreak attacks

Weidi Luo, Siyuan Ma, Xiaogeng Liu, Xiaoyu Guo, and Chaowei Xiao. Jailbreakv: A bench- mark for assessing the robustness of multimodal large language models against jailbreak attacks. In First Conference on Language Modeling, 2024. URL https://openreview.net/forum? id=GC4mXVfquq

work page 2024
[42]

LongGenBench: Long-context generation benchmark

Xiang Liu, Peijie Dong, Xuming Hu, and Xiaowen Chu. LongGenBench: Long-context generation benchmark. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors, Findings of the Association for Computational Linguistics: EMNLP 2024 , pages 865–883, Miami, Florida, USA, November 2024. Association for Computational Linguistics. URL https://aclanthology.or...

work page 2024
[43]

The Llama 3 Herd of Models

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[44]

Mistral 7B

Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[45]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[46]

Chunkkv: Semantic-preserving kv cache compression for efficient long-context llm inference,

Xiang Liu, Zhenheng Tang, Peijie Dong, Zeyu Li, Bo Li, Xuming Hu, and Xiaowen Chu. Chunkkv: Semantic-preserving kv cache compression for efficient long-context llm inference,

work page
[47]

URL https://arxiv.org/abs/2502.00299

work page arXiv
[48]

A framework for few-shot language model evaluation, 12 2023

Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac’h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. A framework...

work page 2023
[49]

Many-shot in-context learning

Rishabh Agarwal, Avi Singh, Lei M Zhang, Bernd Bohnet, Luis Rosias, Stephanie Chan, Biao Zhang, Ankesh Anand, Zaheer Abbas, Azade Nova, et al. Many-shot in-context learning. arXiv preprint arXiv:2404.11018, 2024

work page arXiv 2024
[50]

Efficiently scaling transformer inference

Reiner Pope, Sholto Douglas, Aakanksha Chowdhery, Jacob Devlin, James Bradbury, Jonathan Heek, Kefan Xiao, Shivani Agrawal, and Jeff Dean. Efficiently scaling transformer inference. Proceedings of Machine Learning and Systems, 5:606–624, 2023

work page 2023
[51]

Attention score is not all you need for token importance indicator in kv cache reduction: Value also matters

Zhiyu Guo, Hidetaka Kamigaito, and Taro Watanabe. Attention score is not all you need for token importance indicator in kv cache reduction: Value also matters. arXiv preprint arXiv:2406.12335, 2024

work page arXiv 2024
[52]

Cam: Cache merging for memory-efficient llms inference

Yuxin Zhang, Yuxuan Du, Gen Luo, Yunshan Zhong, Zhenyu Zhang, Shiwei Liu, and Rongrong Ji. Cam: Cache merging for memory-efficient llms inference. In Forty-first International Conference on Machine Learning, 2024. 12

work page 2024
[53]

Cacheblend: Fast large language model serving with cached knowledge fusion

Jiayi Yao, Hanchen Li, Yuhan Liu, Siddhant Ray, Yihua Cheng, Qizheng Zhang, Kuntai Du, Shan Lu, and Junchen Jiang. Cacheblend: Fast large language model serving with cached knowledge fusion. arXiv preprint arXiv:2405.16444, 2024

work page arXiv 2024
[54]

Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference

Yuan Feng, Junlin Lv, Yukun Cao, Xike Xie, and S Kevin Zhou. Ada-kv: Optimizing kv cache eviction by adaptive budget allocation for efficient llm inference. arXiv preprint arXiv:2407.11550, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[55]

Scope: Optimizing key-value cache compression in long-context generation, 2024

Jialong Wu, Zhenglin Wang, Linhai Zhang, Yilong Lai, Yulan He, and Deyu Zhou. Scope: Optimizing key-value cache compression in long-context generation, 2024. URL https: //arxiv.org/abs/2412.13649

work page arXiv 2024
[56]

Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them

Mirac Suzgun, Nathan Scales, Nathanael Sch¨arli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc V Le, Ed H Chi, Denny Zhou, et al. Challenging big- bench tasks and whether chain-of-thought can solve them. arXiv preprint arXiv:2210.09261, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[57]

Layer-condensed kv cache for efficient inference of large language models, 2024

Haoyi Wu and Kewei Tu. Layer-condensed kv cache for efficient inference of large language models, 2024. URL https://arxiv.org/abs/2405.10637

work page arXiv 2024
[58]

You only cache once: Decoder-decoder architectures for language models

Yutao Sun, Li Dong, Yi Zhu, Shaohan Huang, Wenhui Wang, Shuming Ma, Quanlu Zhang, Jianyong Wang, and Furu Wei. You only cache once: Decoder-decoder architectures for language models. arXiv preprint arXiv:2405.05254, 2024

work page arXiv 2024
[59]

Reducing transformer key-value cache size with cross-layer attention

William Brandon, Mayank Mishra, Aniruddha Nrusimha, Rameswar Panda, and Jonathan Ragan Kelly. Reducing transformer key-value cache size with cross-layer attention. arXiv preprint arXiv:2405.12981, 2024

work page arXiv 2024
[60]

Mini- cache: Kv cache compression in depth dimension for large language models

Akide Liu, Jing Liu, Zizheng Pan, Yefei He, Gholamreza Haffari, and Bohan Zhuang. Mini- cache: Kv cache compression in depth dimension for large language models. arXiv preprint arXiv:2405.14366, 2024

work page arXiv 2024
[61]

Prompt compression and contrastive conditioning for controllability and toxicity reduction in language models

David Wingate, Mohammad Shoeybi, and Taylor Sorensen. Prompt compression and contrastive conditioning for controllability and toxicity reduction in language models. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 5621–5634, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. doi: 10.18...

work page 2022
[63]

Recurrentgpt: Interactive generation of (arbitrarily) long text, 2023

Wangchunshu Zhou, Yuchen Eleanor Jiang, Peng Cui, Tiannan Wang, Zhenxin Xiao, Yifan Hou, Ryan Cotterell, and Mrinmaya Sachan. Recurrentgpt: Interactive generation of (arbitrarily) long text, 2023

work page 2023
[64]

Recursively summarizing enables long-term dialogue memory in large language models

Qingyue Wang, Liang Ding, Yanan Cao, Zhiliang Tian, Shi Wang, Dacheng Tao, and Li Guo. Recursively summarizing enables long-term dialogue memory in large language models. arXiv preprint arXiv:2308.15022, 2023

work page arXiv 2023
[65]

LLMLingua: Com- pressing prompts for accelerated inference of large language models

Huiqiang Jiang, Qianhui Wu, Chin-Yew Lin, Yuqing Yang, and Lili Qiu. LLMLingua: Com- pressing prompts for accelerated inference of large language models. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Proceedings of the 2023 Conference on Empirical Meth- ods in Natural Language Processing, pages 13358–13376, Singapore, December 2023. As- sociation...

work page doi:10.18653/v1/2023.emnlp-main.825 2023
[66]

LongLLMLingua: Accelerating and enhancing LLMs in long context scenarios via prompt compression

Huiqiang Jiang, Qianhui Wu, , Xufang Luo, Dongsheng Li, Chin-Yew Lin, Yuqing Yang, and Lili Qiu. LongLLMLingua: Accelerating and enhancing LLMs in long context scenarios via prompt compression. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long P...

work page 2024
[67]

Extending context window of large language models via semantic compression

Weizhi Fei, Xueyan Niu, Pingyi Zhou, Lu Hou, Bo Bai, Lei Deng, and Wei Han. Extending context window of large language models via semantic compression. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Findings of the Association for Computational Linguistics ACL 2024, pages 5169–5181, Bangkok, Thailand and virtual meeting, August 2024. Associatio...

work page doi:10.18653/v1/2024.findings-acl.306 2024
[68]

Measuring Mathematical Problem Solving With the MATH Dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[69]

MathQA: Towards Interpretable Math Word Problem Solving with Operation-Based Formalisms

Aida Amini, Saadia Gabriel, Peter Lin, Rik Koncel-Kedziorski, Yejin Choi, and Hannaneh Hajishirzi. Mathqa: Towards interpretable math word problem solving with operation-based formalisms. arXiv preprint arXiv:1905.13319, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1905
[70]

Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R Brown, Adam Santoro, Aditya Gupta, Adri `a Garriga-Alonso, et al. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[71]

Kv cache compression, but what must we give in return? a compre- hensive benchmark of long context capable approaches

Jiayi Yuan, Hongyi Liu, Shaochen Zhong, Yu-Neng Chuang, Songchen Li, Guanchu Wang, Duy Le, Hongye Jin, Vipin Chaudhary, Zhaozhuo Xu, et al. Kv cache compression, but what must we give in return? a comprehensive benchmark of long context capable approaches. arXiv preprint arXiv:2407.01527, 2024

work page arXiv 2024
[72]

TruthfulQA: Measuring How Models Mimic Human Falsehoods

Stephanie Lin, Jacob Hilton, and Owain Evans. Truthfulqa: Measuring how models mimic human falsehoods. arXiv preprint arXiv:2109.07958, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[73]

Toxigen: A large-scale machine-generated dataset for adversarial and implicit hate speech detection

Thomas Hartvigsen, Saadia Gabriel, Hamid Palangi, Maarten Sap, Dipankar Ray, and Ece Kamar. Toxigen: A large-scale machine-generated dataset for adversarial and implicit hate speech detection. arXiv preprint arXiv:2203.09509, 2022

work page arXiv 2022
[74]

Towards understanding and mitigating social biases in language models

Paul Pu Liang, Chiyu Wu, Louis-Philippe Morency, and Ruslan Salakhutdinov. Towards understanding and mitigating social biases in language models. In International Conference on Machine Learning, pages 6565–6576. PMLR, 2021

work page 2021
[75]

Promptbench: Towards evaluating the robustness of large language models on adversarial prompts

Kaijie Zhu, Jindong Wang, Jiaheng Zhou, Zichen Wang, Hao Chen, Yidong Wang, Linyi Yang, Wei Ye, Yue Zhang, Neil Zhenqiang Gong, et al. Promptbench: Towards evaluating the robustness of large language models on adversarial prompts. arXiv e-prints, pages arXiv–2306, 2023

work page 2023
[76]

” do anything now”: Characterizing and evaluating in-the-wild jailbreak prompts on large language models

Xinyue Shen, Zeyuan Chen, Michael Backes, Yun Shen, and Yang Zhang. ” do anything now”: Characterizing and evaluating in-the-wild jailbreak prompts on large language models. In Proceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security, pages 1671–1685, 2024

work page 2024
[77]

Multilingual jailbreak chal- lenges in large language models

Yue Deng, Wenxuan Zhang, Sinno Jialin Pan, and Lidong Bing. Multilingual jailbreak chal- lenges in large language models. arXiv preprint arXiv:2310.06474, 2023

work page arXiv 2023
[78]

Should we really edit language models? on the evaluation of edited language models

Qi Li, Xiang Liu, Zhenheng Tang, Peijie Dong, Zeyu Li, Xinglin Pan, and Xiaowen Chu. Should we really edit language models? on the evaluation of edited language models. arXiv preprint arXiv:2410.18785, 2024

work page arXiv 2024
[79]

Program Synthesis with Large Language Models

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. Program synthesis with large language models. arXiv preprint arXiv:2108.07732, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[80]

∞-bench: Extending long context evaluation beyond 100k tokens

Xinrong Zhang, Yingfa Chen, Shengding Hu, Zihang Xu, Junhao Chen, Moo Khai Hao, Xu Han, Zhen Leng Thai, Shuo Wang, Zhiyuan Liu, et al. ∞-bench: Extending long context evaluation beyond 100k tokens. ArXiv preprint, abs/2402.13718, 2024. URL https://arxiv.org/abs/ 2402.13718

work page arXiv 2024
[81]

In: Bouamor, H., Pino, J., Bali, K

Uri Shaham, Maor Ivgi, Avia Efrat, Jonathan Berant, and Omer Levy. ZeroSCROLLS: A zero- shot benchmark for long text understanding. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Findings of the Association for Computational Linguistics: EMNLP 2023, pages 7977– 7989, Singapore, 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023...

work page doi:10.18653/v1/2023 2023
[82]

arXiv preprint arXiv:2307.11088 (2023)

Chenxin An, Shansan Gong, Ming Zhong, Mukai Li, Jun Zhang, Lingpeng Kong, and Xipeng Qiu. L-eval: Instituting standardized evaluation for long context language models. ArXiv preprint, abs/2307.11088, 2023. URL https://arxiv.org/abs/2307.11088

work page arXiv 2023
[83]

Landmark attention: Random-access infinite context length for transformers

Amirkeivan Mohtashami and Martin Jaggi. Landmark attention: Random-access infinite context length for transformers. ArXiv preprint, abs/2305.16300, 2023. URL https://arxiv.org/ abs/2305.16300

work page arXiv 2023
[84]

How long can open-source LLMs truly promise on context length?, 2023

Dacheng Li, Rulin Shao, et al. How long can open-source LLMs truly promise on context length?, 2023. URL https://lmsys.org/blog/2023-06-29-longchat

work page 2023

Showing first 80 references.

[1] [1]

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21:140:1–140:67, 2020. URL http://jmlr. org/papers/v21/20-074.html

work page 2020

[2] [2]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020

work page 1901

[3] [3]

PaLM: Scaling Language Modeling with Pathways

Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways. ArXiv preprint, abs/2204.02311, 2022. URL https://arxiv.org/abs/2204.02311

work page internal anchor Pith review Pith/arXiv arXiv 2022

[4] [4]

Unifying language learning paradigms.arXiv preprint arXiv:2205.05131, 2022a

Yi Tay, Mostafa Dehghani, Vinh Q Tran, Xavier Garcia, Dara Bahri, Tal Schuster, Huaixiu Steven Zheng, Neil Houlsby, and Donald Metzler. Unifying language learning paradigms. ArXiv preprint, abs/2205.05131, 2022. URL https://arxiv.org/abs/2205. 05131

work page arXiv 2022

[5] [5]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Tim- oth´ee Lacroix, Baptiste Rozi `ere, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. ArXiv preprint, abs/2302.13971, 2023. URL https://arxiv.org/abs/2302.13971

work page internal anchor Pith review Pith/arXiv arXiv 2023

[6] [6]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. ArXiv preprint, abs/2307.09288, 2023. URL https: //arxiv.org/abs/2307.09288

work page internal anchor Pith review Pith/arXiv arXiv 2023

[7] [7]

Flashattention: Fast and memory-efficient exact attention with io-awareness.Advances in Neural Information Processing Systems, 35:16344–16359, 2022

Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher R ´e. Flashattention: Fast and memory-efficient exact attention with io-awareness.Advances in Neural Information Processing Systems, 35:16344–16359, 2022

work page 2022

[8] [8]

FlashAttention-2: Faster attention with better parallelism and work partitioning

Tri Dao. FlashAttention-2: Faster attention with better parallelism and work partitioning. In International Conference on Learning Representations (ICLR), 2024

work page 2024

[9] [9]

DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models

Sam Ade Jacobs et al. DeepSpeed Ulysses: System optimizations for enabling training of extreme long sequence Transformer models. ArXiv preprint, abs/2309.14509, 2023. URL https://arxiv.org/abs/2309.14509

work page internal anchor Pith review Pith/arXiv arXiv 2023

[10] [10]

Efficient streaming language models with attention sinks

Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=NG7sS51zVF

work page 2024

[11] [12]

URL https://arxiv.org/abs/2306.15595

work page internal anchor Pith review Pith/arXiv arXiv

[12] [13]

Effective long-context scaling of foundation models

Wenhan Xiong, Jingyu Liu, Igor Molybog, Hejia Zhang, Prajjwal Bhargava, Rui Hou, Louis Martin, Rashi Rungta, Karthik Abinav Sankararaman, Barlas Oguz, Madian Khabsa, Han Fang, Yashar Mehdad, Sharan Narang, Kshitiz Malik, Angela Fan, Shruti Bhosale, Sergey Edunov, Mike Lewis, Sinong Wang, and Hao Ma. Effective long-context scaling of foundation models. In ...

work page 2024

[13] [14]

URL https://aclanthology.org/2024

Association for Computational Linguistics. URL https://aclanthology.org/2024. naacl-long.260

work page 2024

[14] [15]

Lon- glora: Efficient fine-tuning of long-context large language models

Yukang Chen, Shengju Qian, Haotian Tang, Xin Lai, Zhijian Liu, Song Han, and Jiaya Jia. Lon- glora: Efficient fine-tuning of long-context large language models. In The Twelfth International Conference on Learning Representations, 2023

work page 2023

[15] [16]

YaRN: Efficient context window extension of large language models

Bowen Peng, Jeffrey Quesnelle, Honglu Fan, and Enrico Shippole. YaRN: Efficient context window extension of large language models. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=wHBfxhZu1u. 10

work page 2024

[16] [17]

Introducing jamba: Ai21’s groundbreaking ssm-transformer model, 2024

AI21. Introducing jamba: Ai21’s groundbreaking ssm-transformer model, 2024. URL https: //www.ai21.com/blog/announcing-jamba

work page 2024

[17] [18]

Announcing grok-1.5, 2024

X.AI. Announcing grok-1.5, 2024. URL https://x.ai/blog/grok-1.5

work page 2024

[18] [19]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Machel Reid, Nikolay Savinov, Denis Teplyashin, Dmitry Lepikhin, Timothy Lillicrap, Jean- baptiste Alayrac, Radu Soricut, Angeliki Lazaridou, Orhan Firat, Julian Schrittwieser, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. ArXiv preprint, abs/2403.05530, 2024. URL https://arxiv.org/abs/2403.05530

work page internal anchor Pith review Pith/arXiv arXiv 2024

[19] [20]

Introducing the next generation of claude, 2024

Anthropic. Introducing the next generation of claude, 2024. URL https://www.anthropic. com/news/claude-3-family

work page 2024

[20] [21]

Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model, 2024

DeepSeek-AI. Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model, 2024

work page 2024

[21] [22]

DeepSeek-V3 Technical Report

Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report. arXiv preprint arXiv:2412.19437, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[22] [23]

H2o: Heavy-hitter oracle for efficient generative inference of large language models

Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher R´e, Clark Barrett, et al. H2o: Heavy-hitter oracle for efficient generative inference of large language models. Advances in Neural Information Processing Systems, 36:34661–34710, 2023

work page 2023

[23] [24]

SnapKV: LLM Knows What You are Looking for Before Generation

Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Acyr Locatelli, Hanchen Ye, Tianle Cai, Patrick Lewis, and Deming Chen. Snapkv: Llm knows what you are looking for before generation. ArXiv preprint, abs/2404.14469, 2024. URL https://arxiv.org/abs/ 2404.14469

work page internal anchor Pith review Pith/arXiv arXiv 2024

[24] [26]

URL https://arxiv.org/abs/2310.01801

work page internal anchor Pith review Pith/arXiv arXiv

[25] [27]

PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling

Zefan Cai, Yichi Zhang, Bofei Gao, Yuliang Liu, Tianyu Liu, Keming Lu, Wayne Xiong, Yue Dong, Baobao Chang, Junjie Hu, et al. Pyramidkv: Dynamic kv cache compression based on pyramidal information funneling. arXiv preprint arXiv:2406.02069, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[26] [28]

LazyLLM: Dynamic token pruning for efficient long context LLM inference

Qichen Fu, Minsik Cho, Thomas Merth, Sachin Mehta, Mohammad Rastegari, and Mahyar Najibi. LazyLLM: Dynamic token pruning for efficient long context LLM inference. In Workshop on Efficient Systems for Foundation Models II @ ICML2024, 2024. URL https: //openreview.net/forum?id=gGZD1dsJqZ

work page 2024

[27] [29]

Pyramidinfer: Pyramid kv cache compres- sion for high-throughput llm inference

Dongjie Yang, XiaoDong Han, Yan Gao, Yao Hu, Shilin Zhang, and Hai Zhao. Pyramid- infer: Pyramid kv cache compression for high-throughput llm inference. arXiv preprint arXiv:2405.12532, 2024

work page arXiv 2024

[28] [30]

Keyformer: Kv cache reduction through key tokens selection for efficient generative inference

Muhammad Adnan, Akhil Arunkumar, Gaurav Jain, Prashant Nair, Ilya Soloveychik, and Purushotham Kamath. Keyformer: Kv cache reduction through key tokens selection for efficient generative inference. Proceedings of Machine Learning and Systems, 6:114–127, 2024

work page 2024

[29] [31]

Scissorhands: Exploiting the persistence of impor- tance hypothesis for llm kv cache compression at test time

Zichang Liu, Aditya Desai, Fangshuo Liao, Weitao Wang, Victor Xie, Zhaozhuo Xu, Anastasios Kyrillidis, and Anshumali Shrivastava. Scissorhands: Exploiting the persistence of impor- tance hypothesis for llm kv cache compression at test time. Advances in Neural Information Processing Systems, 36, 2024

work page 2024

[30] [33]

URL https://arxiv.org/abs/2406.10774

work page internal anchor Pith review Pith/arXiv arXiv

[31] [34]

LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding

Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, et al. Longbench: A bilingual, multitask benchmark for long context understanding. arXiv preprint arXiv:2308.14508, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[32] [35]

LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks

Yushi Bai, Shangqing Tu, Jiajie Zhang, Hao Peng, Xiaozhi Wang, Xin Lv, Shulin Cao, Ji- azheng Xu, Lei Hou, Yuxiao Dong, Jie Tang, and Juanzi Li. Longbench v2: Towards deeper understanding and reasoning on realistic long-context multitasks, 2025. URL https: //arxiv.org/abs/2412.15204

work page internal anchor Pith review Pith/arXiv arXiv 2025

[33] [36]

Needle In A Haystack - pressure testing LLMs

Gregory Kamradt. Needle In A Haystack - pressure testing LLMs. Github, 2023. URL https://github.com/gkamradt/LLMTest_NeedleInAHaystack/tree/main. 11

work page 2023

[34] [37]

Measuring Massive Multitask Language Understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2009

[35] [38]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[36] [39]

doi: 10.18653/v1/n19-1421

Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. CommonsenseQA: A question answering challenge targeting commonsense knowledge. In Jill Burstein, Christy Doran, and Thamar Solorio, editors, Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volu...

work page doi:10.18653/v1/n19-1421 2019

[37] [40]

Evaluating Large Language Models Trained on Code

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[38] [41]

Jailbreakv: A bench- mark for assessing the robustness of multimodal large language models against jailbreak attacks

Weidi Luo, Siyuan Ma, Xiaogeng Liu, Xiaoyu Guo, and Chaowei Xiao. Jailbreakv: A bench- mark for assessing the robustness of multimodal large language models against jailbreak attacks. In First Conference on Language Modeling, 2024. URL https://openreview.net/forum? id=GC4mXVfquq

work page 2024

[39] [42]

LongGenBench: Long-context generation benchmark

Xiang Liu, Peijie Dong, Xuming Hu, and Xiaowen Chu. LongGenBench: Long-context generation benchmark. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors, Findings of the Association for Computational Linguistics: EMNLP 2024 , pages 865–883, Miami, Florida, USA, November 2024. Association for Computational Linguistics. URL https://aclanthology.or...

work page 2024

[40] [43]

The Llama 3 Herd of Models

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[41] [44]

Mistral 7B

Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[42] [45]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[43] [46]

Chunkkv: Semantic-preserving kv cache compression for efficient long-context llm inference,

Xiang Liu, Zhenheng Tang, Peijie Dong, Zeyu Li, Bo Li, Xuming Hu, and Xiaowen Chu. Chunkkv: Semantic-preserving kv cache compression for efficient long-context llm inference,

work page

[44] [47]

URL https://arxiv.org/abs/2502.00299

work page arXiv

[45] [48]

A framework for few-shot language model evaluation, 12 2023

Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac’h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. A framework...

work page 2023

[46] [49]

Many-shot in-context learning

Rishabh Agarwal, Avi Singh, Lei M Zhang, Bernd Bohnet, Luis Rosias, Stephanie Chan, Biao Zhang, Ankesh Anand, Zaheer Abbas, Azade Nova, et al. Many-shot in-context learning. arXiv preprint arXiv:2404.11018, 2024

work page arXiv 2024

[47] [50]

Efficiently scaling transformer inference

Reiner Pope, Sholto Douglas, Aakanksha Chowdhery, Jacob Devlin, James Bradbury, Jonathan Heek, Kefan Xiao, Shivani Agrawal, and Jeff Dean. Efficiently scaling transformer inference. Proceedings of Machine Learning and Systems, 5:606–624, 2023

work page 2023

[48] [51]

Attention score is not all you need for token importance indicator in kv cache reduction: Value also matters

Zhiyu Guo, Hidetaka Kamigaito, and Taro Watanabe. Attention score is not all you need for token importance indicator in kv cache reduction: Value also matters. arXiv preprint arXiv:2406.12335, 2024

work page arXiv 2024

[49] [52]

Cam: Cache merging for memory-efficient llms inference

Yuxin Zhang, Yuxuan Du, Gen Luo, Yunshan Zhong, Zhenyu Zhang, Shiwei Liu, and Rongrong Ji. Cam: Cache merging for memory-efficient llms inference. In Forty-first International Conference on Machine Learning, 2024. 12

work page 2024

[50] [53]

Cacheblend: Fast large language model serving with cached knowledge fusion

Jiayi Yao, Hanchen Li, Yuhan Liu, Siddhant Ray, Yihua Cheng, Qizheng Zhang, Kuntai Du, Shan Lu, and Junchen Jiang. Cacheblend: Fast large language model serving with cached knowledge fusion. arXiv preprint arXiv:2405.16444, 2024

work page arXiv 2024

[51] [54]

Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference

Yuan Feng, Junlin Lv, Yukun Cao, Xike Xie, and S Kevin Zhou. Ada-kv: Optimizing kv cache eviction by adaptive budget allocation for efficient llm inference. arXiv preprint arXiv:2407.11550, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[52] [55]

Scope: Optimizing key-value cache compression in long-context generation, 2024

Jialong Wu, Zhenglin Wang, Linhai Zhang, Yilong Lai, Yulan He, and Deyu Zhou. Scope: Optimizing key-value cache compression in long-context generation, 2024. URL https: //arxiv.org/abs/2412.13649

work page arXiv 2024

[53] [56]

Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them

Mirac Suzgun, Nathan Scales, Nathanael Sch¨arli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc V Le, Ed H Chi, Denny Zhou, et al. Challenging big- bench tasks and whether chain-of-thought can solve them. arXiv preprint arXiv:2210.09261, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[54] [57]

Layer-condensed kv cache for efficient inference of large language models, 2024

Haoyi Wu and Kewei Tu. Layer-condensed kv cache for efficient inference of large language models, 2024. URL https://arxiv.org/abs/2405.10637

work page arXiv 2024

[55] [58]

You only cache once: Decoder-decoder architectures for language models

Yutao Sun, Li Dong, Yi Zhu, Shaohan Huang, Wenhui Wang, Shuming Ma, Quanlu Zhang, Jianyong Wang, and Furu Wei. You only cache once: Decoder-decoder architectures for language models. arXiv preprint arXiv:2405.05254, 2024

work page arXiv 2024

[56] [59]

Reducing transformer key-value cache size with cross-layer attention

William Brandon, Mayank Mishra, Aniruddha Nrusimha, Rameswar Panda, and Jonathan Ragan Kelly. Reducing transformer key-value cache size with cross-layer attention. arXiv preprint arXiv:2405.12981, 2024

work page arXiv 2024

[57] [60]

Mini- cache: Kv cache compression in depth dimension for large language models

Akide Liu, Jing Liu, Zizheng Pan, Yefei He, Gholamreza Haffari, and Bohan Zhuang. Mini- cache: Kv cache compression in depth dimension for large language models. arXiv preprint arXiv:2405.14366, 2024

work page arXiv 2024

[58] [61]

Prompt compression and contrastive conditioning for controllability and toxicity reduction in language models

David Wingate, Mohammad Shoeybi, and Taylor Sorensen. Prompt compression and contrastive conditioning for controllability and toxicity reduction in language models. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 5621–5634, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. doi: 10.18...

work page 2022

[59] [63]

Recurrentgpt: Interactive generation of (arbitrarily) long text, 2023

Wangchunshu Zhou, Yuchen Eleanor Jiang, Peng Cui, Tiannan Wang, Zhenxin Xiao, Yifan Hou, Ryan Cotterell, and Mrinmaya Sachan. Recurrentgpt: Interactive generation of (arbitrarily) long text, 2023

work page 2023

[60] [64]

Recursively summarizing enables long-term dialogue memory in large language models

Qingyue Wang, Liang Ding, Yanan Cao, Zhiliang Tian, Shi Wang, Dacheng Tao, and Li Guo. Recursively summarizing enables long-term dialogue memory in large language models. arXiv preprint arXiv:2308.15022, 2023

work page arXiv 2023

[61] [65]

LLMLingua: Com- pressing prompts for accelerated inference of large language models

Huiqiang Jiang, Qianhui Wu, Chin-Yew Lin, Yuqing Yang, and Lili Qiu. LLMLingua: Com- pressing prompts for accelerated inference of large language models. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Proceedings of the 2023 Conference on Empirical Meth- ods in Natural Language Processing, pages 13358–13376, Singapore, December 2023. As- sociation...

work page doi:10.18653/v1/2023.emnlp-main.825 2023

[62] [66]

LongLLMLingua: Accelerating and enhancing LLMs in long context scenarios via prompt compression

Huiqiang Jiang, Qianhui Wu, , Xufang Luo, Dongsheng Li, Chin-Yew Lin, Yuqing Yang, and Lili Qiu. LongLLMLingua: Accelerating and enhancing LLMs in long context scenarios via prompt compression. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long P...

work page 2024

[63] [67]

Extending context window of large language models via semantic compression

Weizhi Fei, Xueyan Niu, Pingyi Zhou, Lu Hou, Bo Bai, Lei Deng, and Wei Han. Extending context window of large language models via semantic compression. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Findings of the Association for Computational Linguistics ACL 2024, pages 5169–5181, Bangkok, Thailand and virtual meeting, August 2024. Associatio...

work page doi:10.18653/v1/2024.findings-acl.306 2024

[64] [68]

Measuring Mathematical Problem Solving With the MATH Dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[65] [69]

MathQA: Towards Interpretable Math Word Problem Solving with Operation-Based Formalisms

Aida Amini, Saadia Gabriel, Peter Lin, Rik Koncel-Kedziorski, Yejin Choi, and Hannaneh Hajishirzi. Mathqa: Towards interpretable math word problem solving with operation-based formalisms. arXiv preprint arXiv:1905.13319, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1905

[66] [70]

Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R Brown, Adam Santoro, Aditya Gupta, Adri `a Garriga-Alonso, et al. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[67] [71]

Kv cache compression, but what must we give in return? a compre- hensive benchmark of long context capable approaches

Jiayi Yuan, Hongyi Liu, Shaochen Zhong, Yu-Neng Chuang, Songchen Li, Guanchu Wang, Duy Le, Hongye Jin, Vipin Chaudhary, Zhaozhuo Xu, et al. Kv cache compression, but what must we give in return? a comprehensive benchmark of long context capable approaches. arXiv preprint arXiv:2407.01527, 2024

work page arXiv 2024

[68] [72]

TruthfulQA: Measuring How Models Mimic Human Falsehoods

Stephanie Lin, Jacob Hilton, and Owain Evans. Truthfulqa: Measuring how models mimic human falsehoods. arXiv preprint arXiv:2109.07958, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[69] [73]

Toxigen: A large-scale machine-generated dataset for adversarial and implicit hate speech detection

Thomas Hartvigsen, Saadia Gabriel, Hamid Palangi, Maarten Sap, Dipankar Ray, and Ece Kamar. Toxigen: A large-scale machine-generated dataset for adversarial and implicit hate speech detection. arXiv preprint arXiv:2203.09509, 2022

work page arXiv 2022

[70] [74]

Towards understanding and mitigating social biases in language models

Paul Pu Liang, Chiyu Wu, Louis-Philippe Morency, and Ruslan Salakhutdinov. Towards understanding and mitigating social biases in language models. In International Conference on Machine Learning, pages 6565–6576. PMLR, 2021

work page 2021

[71] [75]

Promptbench: Towards evaluating the robustness of large language models on adversarial prompts

Kaijie Zhu, Jindong Wang, Jiaheng Zhou, Zichen Wang, Hao Chen, Yidong Wang, Linyi Yang, Wei Ye, Yue Zhang, Neil Zhenqiang Gong, et al. Promptbench: Towards evaluating the robustness of large language models on adversarial prompts. arXiv e-prints, pages arXiv–2306, 2023

work page 2023

[72] [76]

” do anything now”: Characterizing and evaluating in-the-wild jailbreak prompts on large language models

Xinyue Shen, Zeyuan Chen, Michael Backes, Yun Shen, and Yang Zhang. ” do anything now”: Characterizing and evaluating in-the-wild jailbreak prompts on large language models. In Proceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security, pages 1671–1685, 2024

work page 2024

[73] [77]

Multilingual jailbreak chal- lenges in large language models

Yue Deng, Wenxuan Zhang, Sinno Jialin Pan, and Lidong Bing. Multilingual jailbreak chal- lenges in large language models. arXiv preprint arXiv:2310.06474, 2023

work page arXiv 2023

[74] [78]

Should we really edit language models? on the evaluation of edited language models

Qi Li, Xiang Liu, Zhenheng Tang, Peijie Dong, Zeyu Li, Xinglin Pan, and Xiaowen Chu. Should we really edit language models? on the evaluation of edited language models. arXiv preprint arXiv:2410.18785, 2024

work page arXiv 2024

[75] [79]

Program Synthesis with Large Language Models

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. Program synthesis with large language models. arXiv preprint arXiv:2108.07732, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[76] [80]

∞-bench: Extending long context evaluation beyond 100k tokens

Xinrong Zhang, Yingfa Chen, Shengding Hu, Zihang Xu, Junhao Chen, Moo Khai Hao, Xu Han, Zhen Leng Thai, Shuo Wang, Zhiyuan Liu, et al. ∞-bench: Extending long context evaluation beyond 100k tokens. ArXiv preprint, abs/2402.13718, 2024. URL https://arxiv.org/abs/ 2402.13718

work page arXiv 2024

[77] [81]

In: Bouamor, H., Pino, J., Bali, K

Uri Shaham, Maor Ivgi, Avia Efrat, Jonathan Berant, and Omer Levy. ZeroSCROLLS: A zero- shot benchmark for long text understanding. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Findings of the Association for Computational Linguistics: EMNLP 2023, pages 7977– 7989, Singapore, 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023...

work page doi:10.18653/v1/2023 2023

[78] [82]

arXiv preprint arXiv:2307.11088 (2023)

Chenxin An, Shansan Gong, Ming Zhong, Mukai Li, Jun Zhang, Lingpeng Kong, and Xipeng Qiu. L-eval: Instituting standardized evaluation for long context language models. ArXiv preprint, abs/2307.11088, 2023. URL https://arxiv.org/abs/2307.11088

work page arXiv 2023

[79] [83]

Landmark attention: Random-access infinite context length for transformers

Amirkeivan Mohtashami and Martin Jaggi. Landmark attention: Random-access infinite context length for transformers. ArXiv preprint, abs/2305.16300, 2023. URL https://arxiv.org/ abs/2305.16300

work page arXiv 2023

[80] [84]

How long can open-source LLMs truly promise on context length?, 2023

Dacheng Li, Rulin Shao, et al. How long can open-source LLMs truly promise on context length?, 2023. URL https://lmsys.org/blog/2023-06-29-longchat

work page 2023