arxiv: 2605.05971 · v2 · submitted 2026-05-07 · 💻 cs.LG

Recognition: 2 theorem links

· Lean Theorem

Training Transformers for KV Cache Compressibility

Yoav Gelberg , Yam Eitan , Michael Bronstein , Yarin Gal , Haggai Maron

Authors on Pith no claims yet

Pith reviewed 2026-05-13 05:56 UTC · model grok-4.3

classification 💻 cs.LG

keywords KV cache compressiontransformer traininglong-context modelingcontext compressionKV sparsificationrepresentation learningmemory efficiencycontinued pretraining

0 comments

The pith

Training transformers with KV masking produces representations that compress far more effectively after the fact.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that KV cache compressibility is not fixed by the input but depends on the learned internal representations. It proves that nearly any sequence-to-vector function can be implemented by a transformer whose KV states are either highly compressible or stubbornly non-compressible. This motivates a continued pretraining stage that masks KV slots during training so the model learns to pack its information into fewer usable slots. The resulting models let existing post-hoc compression methods reach higher quality at the same memory budget on retrieval, question answering, and continuation tasks.

Core claim

Almost any sequence-to-vector function admits both highly compressible and inherently non-compressible transformer implementations, and a train-time KV sparsification policy can steer the model toward the compressible regime without degrading its core capabilities.

What carries the argument

KV-Compression Aware Training (KV-CAT), a continued pretraining procedure that applies a KV masking policy during training to force the model to rely on fewer KV slots.

If this is right

Existing KV compression algorithms achieve better quality for any given memory budget on retrieval and long-context QA.
Perplexity on tasks that continue from a compressed prefix improves relative to models trained without the masking policy.
The same model weights remain competitive on standard short-context benchmarks while becoming easier to compress at inference time.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same masking idea could be applied during fine-tuning rather than only continued pretraining to adapt existing models.
If compressibility can be trained in, it may become a standard training objective alongside next-token prediction for any long-context architecture.
The proof that both compressible and non-compressible realizations exist implies that architecture search or regularization choices can inadvertently lock models into the harder-to-compress regime.

Load-bearing premise

Masking KV slots at training time will cause compressible yet still useful representations to emerge without requiring heavy hyperparameter search or harming the model's original performance.

What would settle it

After running KV-CAT, applying any standard post-hoc KV compression method produces no improvement in the quality-versus-budget curve or causes measurable drops in accuracy on long-context retrieval and QA benchmarks.

Figures

Figures reproduced from arXiv: 2605.05971 by Haggai Maron, Michael Bronstein, Yam Eitan, Yarin Gal, Yoav Gelberg.

**Figure 1.** Figure 1: KV-Compression Aware Training (KV-CAT). A context a goes through both masked (left) and dense (right) forward passes. In the masked forward pass, routers (orange) compute masks for groups of consecutive layers, marking KV slots as active (green) or inactive (muted green). In the dense forward pass (blue), all KV slots are kept. The output of the masked forward pass is used to compute Lmask, router distribu… view at source ↗

**Figure 2.** Figure 2: KV-CAT speeds up gradient-based KV cache compression. We plot the gap in suffix perplexity under full/compressed-prefix inference throughout gradient-based KV cache optimization. Each panel fixes a different KV keep ratio. Across ratios, the KV-CAT checkpoint achieves a comparable ∆PPL in fewer optimization steps than the base model, yielding up to a 5× speedup. where qi and mi denote the router’s score an… view at source ↗

**Figure 3.** Figure 3: QWEN2.5-0.5B KV-CAT continued pretraining run. We plot dense validation nexttoken-prediction loss, compressed-path validation next-token-prediction loss, and the realized KV retention ratio over the 5.24B-token KV-CAT continued-pretraining run. The retention trace reports the fraction of KV slots kept by the learned routers, whose budget target is 50%. PIQA [4], Social IQa [49], OpenBookQA [39], ARC-Easy … view at source ↗

**Figure 4.** Figure 4: KV-CAT speeds up gradient-based KV cache compression. We plot the gap in suffix perplexity under full/compressed-prefix inference throughout gradient-based KV cache optimization. Each panel fixes a different KV keep ratio. Across ratios, the KV-CAT checkpoint achieves a comparable ∆PPL in fewer optimization steps than the base model, yielding up to a 5× speedup. E Limitations Our approach relies on continu… view at source ↗

read the original abstract

Long-context language modeling is increasingly constrained by the Key-Value (KV) cache, whose memory and decode-time access costs scale linearly with the prefix length. This bottleneck has motivated a range of context-compression methods, from token-level summarization to recent optimization-based KV compression methods. These post-hoc methods operate on the KV cache of a fixed pretrained model, so their effectiveness is fundamentally limited by how well the model's internal representations can be compressed. In this work, we formalize the notion of KV compressibility and show that it is a property of the learned representations, rather than of the context alone. We prove that almost any sequence-to-vector function admits both highly compressible and inherently non-compressible transformer implementations, highlighting the need to guide transformers toward compressible representations during training. Motivated by this, we propose KV-Compression Aware Training (KV-CAT), a continued pretraining procedure that incentivizes the emergence of compressible representations. We introduce a train-time KV sparsification policy that masks KV slots during training. This forces the model to use fewer KV slots and encourages it to learn representations amenable to post-hoc compression. Empirically, we show that KV-CAT improves the quality-budget tradeoff of downstream compression methods across retrieval, long-context question answering, and perplexity-based evaluation of compressed-prefix continuation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper proves representation-dependent compressibility exists and tests a masking-based pretraining fix, but the link from proof to training outcome stays unproven.

read the letter

The main point is that KV cache compressibility depends on how the model was trained, not just the input, and the authors give a proof that both compressible and non-compressible transformer realizations are possible for the same function. They then run continued pretraining with a KV masking policy (KV-CAT) to push the model toward the compressible side, and they report better quality-budget curves on retrieval, long-context QA, and compressed perplexity when later compressors are applied.

Referee Report

3 major / 2 minor

Summary. The paper claims that KV cache compressibility is a property of learned transformer representations rather than the input context. It proves that nearly any sequence-to-vector function admits both highly compressible and inherently non-compressible transformer implementations. Motivated by this, it introduces KV-Compression Aware Training (KV-CAT), a continued pretraining procedure that applies a train-time KV slot masking policy to encourage compressible representations. Experiments demonstrate improved quality-budget tradeoffs for post-hoc compression methods on retrieval, long-context QA, and perplexity-based continuation tasks.

Significance. If the central claims hold, the work supplies a useful theoretical lens on representation non-uniqueness in transformers together with a concrete training intervention that can improve downstream compression efficiency. The existence proof for compressible versus non-compressible realizations is a clear strength, as is the empirical evaluation across multiple compression techniques and task types. Successful adoption could meaningfully reduce KV cache memory and latency costs in long-context inference without requiring changes to inference-time compressors.

major comments (3)

[Section 3] The existence proof (Section 3) shows that compressible implementations are possible for almost any sequence-to-vector function but supplies no analysis or guarantee that gradient descent under the specific KV-masking policy will converge to the compressible basin rather than a non-compressible or capability-degraded one. This link is load-bearing for the motivation of KV-CAT.
[Section 4] The KV-CAT description (Section 4) introduces the masking rate and sparsification policy as free parameters without ablations demonstrating that the induced representations remain useful for the original task distribution while becoming amenable to unrelated post-hoc compressors (optimization-based or summarization-based).
[Section 5] The empirical results (Section 5) report gains on compressed-prefix tasks but do not include explicit controls confirming that uncompressed perplexity and retrieval accuracy are preserved; without these, it is unclear whether the observed improvements reflect genuine compressibility gains or hidden capability tradeoffs.

minor comments (2)

[Section 2] Notation for the KV masking policy and compressibility metric could be introduced earlier and used consistently across the proof and experimental sections.
[Figure 3] Figure captions for the quality-budget curves should explicitly state the post-hoc compression methods being compared and the exact masking rate used during KV-CAT.

Simulated Author's Rebuttal

3 responses · 1 unresolved

Thank you for the detailed and constructive review. We appreciate the recognition of the theoretical non-uniqueness result and the potential practical value of KV-CAT. We address each major comment below and indicate the revisions we will make.

read point-by-point responses

Referee: [Section 3] The existence proof (Section 3) shows that compressible implementations are possible for almost any sequence-to-vector function but supplies no analysis or guarantee that gradient descent under the specific KV-masking policy will converge to the compressible basin rather than a non-compressible or capability-degraded one. This link is load-bearing for the motivation of KV-CAT.

Authors: We agree that the existence proof is purely existential and provides no convergence guarantee for gradient descent under the KV-masking policy. Proving such a guarantee is difficult given the non-convex optimization landscape of transformers. Our strongest defense is empirical: across retrieval, long-context QA, and perplexity tasks, KV-CAT consistently improves post-hoc compression quality while preserving base-model performance, indicating that the masking policy reliably steers optimization toward compressible representations in practice. In the revision we will add an explicit limitations paragraph acknowledging the lack of theoretical convergence analysis. revision: partial
Referee: [Section 4] The KV-CAT description (Section 4) introduces the masking rate and sparsification policy as free parameters without ablations demonstrating that the induced representations remain useful for the original task distribution while becoming amenable to unrelated post-hoc compressors (optimization-based or summarization-based).

Authors: We thank the referee for this observation. The initial submission used a single masking rate selected via limited tuning and did not present systematic ablations. In the revised manuscript we will add a new subsection with ablations over masking rates (0.1, 0.3, 0.5) and two sparsification policies, reporting both uncompressed task performance and downstream quality under optimization-based and summarization-based compressors to confirm that the representations remain useful while becoming more compressible. revision: yes
Referee: [Section 5] The empirical results (Section 5) report gains on compressed-prefix tasks but do not include explicit controls confirming that uncompressed perplexity and retrieval accuracy are preserved; without these, it is unclear whether the observed improvements reflect genuine compressibility gains or hidden capability tradeoffs.

Authors: The manuscript states that KV-CAT models retain competitive uncompressed performance, but we acknowledge that side-by-side controls could be presented more explicitly. In the revision we will insert a dedicated table comparing uncompressed perplexity, retrieval accuracy, and long-context QA scores for the original pretrained model, the KV-CAT model, and relevant baselines, thereby making the absence of capability tradeoffs fully transparent. revision: yes

standing simulated objections not resolved

No theoretical analysis or guarantee is provided that gradient descent will converge to the compressible basin under the KV-masking policy.

Circularity Check

0 steps flagged

No significant circularity; proof and intervention are independent of target metric

full rationale

The paper's core mathematical claim is a proof that almost any sequence-to-vector function admits both compressible and non-compressible transformer realizations; this existence result is stated as a first-principles theorem and does not reduce to the KV-CAT masking procedure or to any fitted quantity. The KV-CAT method is introduced as an explicit train-time sparsification policy (masking KV slots) whose downstream effect on post-hoc compression is measured empirically rather than defined by construction. No self-definitional equations, fitted-input predictions, or load-bearing self-citations appear in the derivation chain. The central claim therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The abstract provides limited detail on assumptions. The proof likely relies on the known universality of transformers for sequence-to-vector mappings, which is a standard result rather than a new axiom introduced here.

free parameters (1)

KV masking rate / sparsification policy
The fraction and schedule of KV slots masked during training is a hyperparameter chosen to balance compressibility against task performance.

axioms (1)

standard math Transformers are universal approximators for sequence-to-vector functions
Invoked to support the claim that both compressible and non-compressible implementations exist for almost any function.

pith-pipeline@v0.9.0 · 5534 in / 1270 out tokens · 28372 ms · 2026-05-13T05:56:54.753489+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Theorem 3.1: almost any sequence-to-vector function admits both highly compressible (r(n)=1) and inherently non-compressible transformer implementations.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

KV-CAT training objective with Lbudget maintaining target retention rate via router masks.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

66 extracted references · 66 canonical work pages · 7 internal anchors

[1]

Can foundation models help us achieve perfect secrecy? arXiv preprint arXiv:2205.13722, 2022

Simran Arora and Christopher Ré. Can foundation models help us achieve perfect secrecy? arXiv preprint arXiv:2205.13722, 2022

work page arXiv 2022
[2]

Longbench v2: Towards deeper understanding and reason- ing on realistic long-context multitasks

Yushi Bai, Shangqing Tu, Jiajie Zhang, Hao Peng, Xiaozhi Wang, Xin Lv, Shulin Cao, Jiazheng Xu, Lei Hou, Yuxiao Dong, et al. Longbench v2: Towards deeper understanding and reason- ing on realistic long-context multitasks. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3639–3664, 2025

work page 2025
[3]

Longformer: The Long-Document Transformer

Iz Beltagy, Matthew E Peters, and Arman Cohan. Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2004
[4]

The Thirty-Fourth

Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, and Yejin Choi. PIQA: Reasoning about physical commonsense in natural language. InProceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 7432–7439, 2020. doi: 10.1609/aaai.v34i05.6239. URLhttps://ojs.aaai.org/index.php/AAAI/article/view/6239

work page doi:10.1609/aaai.v34i05.6239 2020
[5]

Aydar Bulatov, Yuri Kuratov, and Mikhail S. Burtsev. Recurrent memory transformer, 2022

work page 2022
[6]

PyramidKV: Dynamic kv cache compression based on pyramidal information funneling, 2025

Zefan Cai, Yichi Zhang, Bofei Gao, Yuliang Liu, Yucheng Li, Tianyu Liu, Keming Lu, Wayne Xiong, Yue Dong, Junjie Hu, and Wen Xiao. PyramidKV: Dynamic kv cache compression based on pyramidal information funneling, 2025

work page 2025
[7]

Cetin, S

Rujikorn Charakorn, Edoardo Cetin, Shinnosuke Uesaka, and Robert Tjarko Lange. Doc-to-lora: Learning to instantly internalize contexts.arXiv preprint arXiv:2602.15902, 2026

work page arXiv 2026
[8]

Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models

Qiguang Chen, Libo Qin, Jinhao Liu, Dengyun Peng, Jiannan Guan, Peng Wang, Mengkang Hu, Yuhang Zhou, Te Gao, and Wanxiang Che. Towards reasoning era: A survey of long chain-of-thought for reasoning large language models.arXiv preprint arXiv:2503.09567, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[9]

Generative adapter: Contextualizing language models in parameters with a single forward pass.arXiv preprint arXiv:2411.05877, 2024

Tong Chen, Hao Fang, Patrick Xia, Xiaodong Liu, Benjamin Van Durme, Luke Zettlemoyer, Jianfeng Gao, and Hao Cheng. Generative adapter: Contextualizing language models in parameters with a single forward pass.arXiv preprint arXiv:2411.05877, 2024

work page arXiv 2024
[10]

Adapting language models to compress contexts, 2023

Alexis Chevalier, Alexander Wettig, Anirudh Ajith, and Danqi Chen. Adapting language models to compress contexts, 2023

work page 2023
[11]

Rethinking Attention with Performers

Krzysztof Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas Sarlos, Peter Hawkins, Jared Davis, Afroz Mohiuddin, Lukasz Kaiser, et al. Rethinking attention with performers.arXiv preprint arXiv:2009.14794, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2009
[12]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try ARC, the AI2 reasoning challenge.arXiv preprint arXiv:1803.05457, 2018. URL https://arxiv.org/abs/1803. 05457

work page internal anchor Pith review Pith/arXiv arXiv 2018
[13]

A discourse-aware attention model for abstractive summarization of long documents

Arman Cohan, Franck Dernoncourt, Doo Soon Kim, Trung Bui, Seokhwan Kim, Walter Chang, and Nazli Goharian. A discourse-aware attention model for abstractive summarization of long documents. InProceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers...

work page doi:10.18653/v1/n18-2097 2018
[14]

Approximation by superpositions of a sigmoidal function.Mathematics of control, signals and systems, 2(4):303–314, 1989

George Cybenko. Approximation by superpositions of a sigmoidal function.Mathematics of control, signals and systems, 2(4):303–314, 1989

work page 1989
[15]

The centered convex body whose marginals have the heaviest tails.arXiv preprint arXiv:2110.14382, 2021

Yam Eitan. The centered convex body whose marginals have the heaviest tails.arXiv preprint arXiv:2110.14382, 2021. 10

work page arXiv 2021
[16]

Cartridges: Lightweight and general-purpose long context representations via self-study, 2025

Sabri Eyuboglu, Ryan Ehrlich, Simran Arora, Neel Guha, Dylan Zinsley, Emily Liu, Will Ten- nien, Atri Rudra, James Zou, Azalia Mirhoseini, and Christopher Ré. Cartridges: Lightweight and general-purpose long context representations via self-study, 2025

work page 2025
[17]

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[18]

Efficiently Modeling Long Sequences with Structured State Spaces

Albert Gu, Karan Goel, and Christopher Ré. Efficiently modeling long sequences with structured state spaces.arXiv preprint arXiv:2111.00396, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[19]

Lighte- val: A lightweight framework for llm evaluation

Nathan Habib, Clémentine Fourrier, Hynek Kydlíˇcek, Thomas Wolf, and Lewis Tunstall. Lighte- val: A lightweight framework for llm evaluation. https://github.com/huggingface/ lighteval, 2023. GitHub repository

work page 2023
[20]

Delta-net: Real-time network verification using atoms

Alex Horn, Ali Kheradmand, and Mukul Prasad. Delta-net: Real-time network verification using atoms. In14th USENIX Symposium on Networked Systems Design and Implementation (NSDI 17), pages 735–749, 2017

work page 2017
[21]

Approximation capabilities of multilayer feedforward networks.Neural networks, 4(2):251–257, 1991

Kurt Hornik. Approximation capabilities of multilayer feedforward networks.Neural networks, 4(2):251–257, 1991

work page 1991
[22]

Dynamic chunking for end-to-end hierarchical sequence modeling

Sukjun Hwang, Brandon Wang, and Albert Gu. Dynamic chunking for end-to-end hierarchical sequence modeling.arXiv preprint arXiv:2507.07955, 2025

work page arXiv 2025
[23]

Financebench: A new benchmark for financial question answering.arXiv preprint arXiv:2311.11944, 2023

Pranab Islam, Anand Kannappan, Douwe Kiela, Rebecca Qian, Nino Scherrer, and Bertie Vidgen. Financebench: A new benchmark for financial question answering.arXiv preprint arXiv:2311.11944, 2023

work page arXiv 2023
[24]

LLMLingua: Com- pressing prompts for accelerated inference of large language models, 2023

Huiqiang Jiang, Qianhui Wu, Chin-Yew Lin, Yuqing Yang, and Lili Qiu. LLMLingua: Com- pressing prompts for accelerated inference of large language models, 2023

work page 2023
[25]

LongLLMLingua: Accelerating and enhancing llms in long context scenarios via prompt compression, 2024

Huiqiang Jiang, Qianhui Wu, Xufang Luo, Dongsheng Li, Chin-Yew Lin, Yuqing Yang, and Lili Qiu. LongLLMLingua: Accelerating and enhancing llms in long context scenarios via prompt compression, 2024

work page 2024
[26]

Optimal experimental designs.The Annals of Mathemat- ical Statistics, 37(4):783–815, 1966

Samuel Karlin and William J Studden. Optimal experimental designs.The Annals of Mathemat- ical Statistics, 37(4):783–815, 1966

work page 1966
[27]

Tchebycheff systems: With applications in analysis and statistics.(No Title), 1966

Samuel Karlin and William J Studden. Tchebycheff systems: With applications in analysis and statistics.(No Title), 1966

work page 1966
[28]

Chebyshevian spline functions.Siam Journal on Numerical Analysis, 3(3):514–543, 1966

Samuel Karlin and Zvi Ziegler. Chebyshevian spline functions.Siam Journal on Numerical Analysis, 3(3):514–543, 1966

work page 1966
[29]

Transformers are rnns: Fast autoregressive transformers with linear attention

Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret. Transformers are rnns: Fast autoregressive transformers with linear attention. InInternational conference on machine learning, pages 5156–5165. PMLR, 2020

work page 2020
[30]

arXiv preprint arXiv:2505.23416 , year =

Jang-Hyun Kim, Jinuk Kim, Sangwoo Kwon, Jae W Lee, Sangdoo Yun, and Hyun Oh Song. Kvzip: Query-agnostic kv cache compression with context reconstruction.arXiv preprint arXiv:2505.23416, 2025

work page arXiv 2025
[31]

Lexico: Extreme KV cache compression via sparse coding over universal dictionaries

Junhyuck Kim, Jongho Park, Jaewoong Cho, and Dimitris Papailiopoulos. Lexico: Extreme KV cache compression via sparse coding over universal dictionaries. InProceedings of the 42nd International Conference on Machine Learning, volume 267 ofProceedings of Machine Learning Research, pages 30672–30687, 2025

work page 2025
[32]

Retrieval-augmented generation for knowledge-intensive NLP tasks

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. Retrieval-augmented generation for knowledge-intensive NLP tasks. InAdvances in Neural Information Processing Systems, volume 33, pages 9459–9474, 2020. 11

work page 2020
[33]

Compressing context to enhance inference efficiency of large language models

Yucheng Li, Bo Dong, Frank Guerin, and Chenghua Lin. Compressing context to enhance inference efficiency of large language models. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 6342–6353, 2023. doi: 10.18653/ v1/2023.emnlp-main.391

work page 2023
[34]

SnapKV: Llm knows what you are looking for before generation, 2024

Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Acyr Locatelli, Hanchen Ye, Tianle Cai, Patrick Lewis, and Deming Chen. SnapKV: Llm knows what you are looking for before generation, 2024

work page 2024
[35]

Larm: Large auto-regressive model for long-horizon embodied intelligence.arXiv preprint arXiv:2405.17424, 2024

Zhuoling Li, Xiaogang Xu, Zhenhua Xu, SerNam Lim, and Hengshuang Zhao. Larm: Large auto-regressive model for long-horizon embodied intelligence.arXiv preprint arXiv:2405.17424, 2024

work page arXiv 2024
[36]

Shine: A scalable in-context hypernetwork for mapping context to lora in a single pass.arXiv preprint arXiv:2602.06358, 2026

Yewei Liu, Xiyuan Wang, Yansheng Mao, Yoav Gelbery, Haggai Maron, and Muhan Zhang. Shine: A scalable in-context hypernetwork for mapping context to lora in a single pass.arXiv preprint arXiv:2602.06358, 2026

work page arXiv 2026
[37]

Pointer sentinel mixture models, 2016

Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models, 2016

work page 2016
[38]

CA Micchelli and Allan Pinkus. Moment theory for weak chebyshev systems with applications to monosplines, quadrature formulae and best one-sided lˆ1-approximation by spline functions with fixed knots.SIAM Journal on Mathematical Analysis, 8(2):206–230, 1977

work page 1977
[39]

Can a Suit of Armor Conduct Electricity?

Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity? a new dataset for open book question answering. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2381–2391, Brussels, Belgium, 2018. Association for Computational Linguistics. doi: 10.18653/v1/D18-1260. UR...

work page doi:10.18653/v1/d18-1260 2018
[40]

Learning to compress prompts with gist tokens, 2023

Jesse Mu, Xiang Lisa Li, and Noah Goodman. Learning to compress prompts with gist tokens, 2023

work page 2023
[41]

Using an llm to help with code understanding

Daye Nam, Andrew Macvean, Vincent Hellendoorn, Bogdan Vasilescu, and Brad Myers. Using an llm to help with code understanding. InProceedings of the IEEE/ACM 46th International Conference on Software Engineering, pages 1–13, 2024

work page 2024
[42]

Context Engineering - Short-Term Memory Management with Sessions from OpenAI Agents SDK, September 2025

Emre Okcular. Context Engineering - Short-Term Memory Management with Sessions from OpenAI Agents SDK, September 2025

work page 2025
[43]

Transformers are multi-state RNNs

Matanel Oren, Michael Hassid, Nir Yarden, Yossi Adi, and Roy Schwartz. Transformers are multi-state RNNs. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 18724–18741, 2024. doi: 10.18653/v1/2024.emnlp-main.1043

work page doi:10.18653/v1/2024.emnlp-main.1043 2024
[44]

The fineweb datasets: Decanting the web for the finest text data at scale.Advances in Neural Information Processing Systems, 37:30811–30849, 2024

Guilherme Penedo, Hynek Kydlíˇcek, Anton Lozhkov, Margaret Mitchell, Colin Raffel, Leandro V on Werra, Thomas Wolf, et al. The fineweb datasets: Decanting the web for the finest text data at scale.Advances in Neural Information Processing Systems, 37:30811–30849, 2024

work page 2024
[45]

W., Potapenko, A., Jayakumar, S

Jack W. Rae, Anna Potapenko, Siddhant M. Jayakumar, Chloe Hillier, and Timothy P. Lillicrap. Compressive transformers for long-range sequence modelling.arXiv preprint, 2019. URL https://arxiv.org/abs/1911.05507

work page arXiv 2019
[46]

Effective context engi- neering for ai agents, September 2025

Prithvi Rajasekaran, Ethan Dixon, Carly Ryan, and Jeremy Hadfield. Effective context engi- neering for ai agents, September 2025

work page 2025
[47]

The Thirty-Fourth

Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. WinoGrande: An adversarial winograd schema challenge at scale. InProceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 8732–8740, 2020. doi: 10.1609/aaai.v34i05.6399. URLhttps://ojs.aaai.org/index.php/AAAI/article/view/6399

work page doi:10.1609/aaai.v34i05.6399 2020
[48]

Representational strengths and limitations of transformers.Advances in Neural Information Processing Systems, 36:36677–36707, 2023

Clayton Sanford, Daniel J Hsu, and Matus Telgarsky. Representational strengths and limitations of transformers.Advances in Neural Information Processing Systems, 36:36677–36707, 2023. 12

work page 2023
[49]

Social IQa: Commonsense reasoning about social interactions

Maarten Sap, Hannah Rashkin, Derek Chen, Ronan Le Bras, and Yejin Choi. Social IQa: Commonsense reasoning about social interactions. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 4463–4473, Hong Kong, China,

work page 2019
[50]

Social IQa: Commonsense Reasoning about Social Interactions

Association for Computational Linguistics. doi: 10.18653/v1/D19-1454. URL https: //aclanthology.org/D19-1454/

work page doi:10.18653/v1/d19-1454
[51]

QUEST: Query-aware sparsity for efficient long-context LLM inference

Jiaming Tang, Yilong Zhao, Kan Zhu, Guangxuan Xiao, Baris Kasikci, and Song Han. QUEST: Query-aware sparsity for efficient long-context LLM inference. InProceedings of the 41st In- ternational Conference on Machine Learning, volume 235 ofProceedings of Machine Learning Research, pages 47901–47911, 2024

work page 2024
[52]

Qwen2.5: A party of foundation models, September 2024

Qwen Team. Qwen2.5: A party of foundation models, September 2024. URL https:// qwenlm.github.io/blog/qwen2.5/

work page 2024
[53]

Efficient streaming language models with attention sinks, 2024

Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks, 2024

work page 2024
[54]

Duoattention: Efficient long-context LLM inference with retrieval and streaming heads

Guangxuan Xiao, Jiaming Tang, Jingwei Zuo, Junxian Guo, Shang Yang, Haotian Tang, Yao Fu, and Song Han. Duoattention: Efficient long-context LLM inference with retrieval and streaming heads. InInternational Conference on Learning Representations, 2025. URL https://arxiv.org/abs/2410.10819

work page arXiv 2025
[55]

Gated Delta Networks: Improving Mamba2 with Delta Rule

Songlin Yang, Jan Kautz, and Ali Hatamizadeh. Gated delta networks: Improving mamba2 with delta rule.arXiv preprint arXiv:2412.06464, 2024

work page internal anchor Pith review arXiv 2024
[56]

When can transformers count to n?arXiv preprint arXiv:2407.15160, 2024

Gilad Yehudai, Haim Kaplan, Guy Dar, Royi Rassin, Asma Ghandeharioun, Mor Geva, and Amir Globerson. When can transformers count to n?arXiv preprint arXiv:2407.15160, 2024

work page arXiv 2024
[57]

Deep sets.Advances in neural information processing systems, 30, 2017

Manzil Zaheer, Satwik Kottur, Siamak Ravanbakhsh, Barnabas Poczos, Russ R Salakhutdinov, and Alexander J Smola. Deep sets.Advances in neural information processing systems, 30, 2017

work page 2017
[58]

Big bird: Transformers for longer sequences.Advances in neural information processing systems, 33: 17283–17297, 2020

Manzil Zaheer, Guru Guruganesh, Kumar Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, et al. Big bird: Transformers for longer sequences.Advances in neural information processing systems, 33: 17283–17297, 2020

work page 2020
[59]

Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. HellaSwag: Can a machine really finish your sentence? InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4791–4800, Florence, Italy, 2019. Association for Computational Linguistics. doi: 10.18653/v1/P19-1472. URL https://aclanthology. org/...

work page doi:10.18653/v1/p19-1472 2019
[60]

H2O: Heavy-hitter oracle for efficient generative inference of large language models, 2023

Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher Ré, Clark Barrett, Zhangyang Wang, and Beidi Chen. H2O: Heavy-hitter oracle for efficient generative inference of large language models, 2023

work page 2023
[61]

Lifelong learning of large language model based agents: A roadmap.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2026

Junhao Zheng, Chengming Shi, Xidi Cai, Qiuke Li, Duzhen Zhang, Chenxing Li, Dong Yu, and Qianli Ma. Lifelong learning of large language model based agents: A roadmap.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2026

work page 2026
[62]

Fast kv compaction via attention matching, 2026

Adam Zweiger, Xinghong Fu, Han Guo, and Yoon Kim. Fast kv compaction via attention matching, 2026. A Theory A.1 Transformers and KV Cache Compression Notation.For a set A, let A∗ denote the set of all finite sequences over A. We denote by ei the i-th standard (one hot encoded) basis vector. Forn∈N, we write[n] ={1, . . . , n}. 13 For matrices X∈R n×d and ...

work page 2026
[63]

, an)∈A n withn≤N, ∥f(a)−M(a)∥< ε.(70)

(Approximation) For every sequencea= (a 1, . . . , an)∈A n withn≤N, ∥f(a)−M(a)∥< ε.(70)

work page
[64]

(Maximal compressibility) There exists a KV cache compression policy C with compression budget r(n)≡1(71) such that for every prefixa∈A n and suffixb∈A k withk+n≤N, ∥M([a,b])−M C,a(b)∥< ε.(72) Proof.We assume that the token embedding mapemb :A×N→R d0 is defined by emb(a, i) =u a +p i (73) where ∥pi∥=∥p j∥ for all i, j∈[N] , and ua,p i ∈R d0 for all a∈A . ...

work page
[65]

, an)∈A n withn≤N, ∥f(a)−M(a)∥< ε.(97)

(Approximation) For every sequencea= (a 1, . . . , an)∈A n withn≤N, ∥f(a)−M(a)∥< ε.(97)

work page
[66]

By the same argument as in the proof of Lemma A.7, there exist functions ϕ:R d0 →R d1 andρ:R d1 →R dout such that for everya= (a 1,

(Non- compressibility) For every KV cache compression policy C with compression budget satisfyingr(n)< n there exists a suffixb∈A k withk+n≤Nsuch that, ∥M([a,b])−M C,a(b)∥> C.(98) Proof. By the same argument as in the proof of Lemma A.7, there exist functions ϕ:R d0 →R d1 andρ:R d1 →R dout such that for everya= (a 1, . . . , an), f(a) =ρ nX i=1 ϕ(emb(ai, ...

work page arXiv