Recognition: 2 theorem links
· Lean TheoremTraining Transformers for KV Cache Compressibility
Pith reviewed 2026-05-13 05:56 UTC · model grok-4.3
The pith
Training transformers with KV masking produces representations that compress far more effectively after the fact.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Almost any sequence-to-vector function admits both highly compressible and inherently non-compressible transformer implementations, and a train-time KV sparsification policy can steer the model toward the compressible regime without degrading its core capabilities.
What carries the argument
KV-Compression Aware Training (KV-CAT), a continued pretraining procedure that applies a KV masking policy during training to force the model to rely on fewer KV slots.
If this is right
- Existing KV compression algorithms achieve better quality for any given memory budget on retrieval and long-context QA.
- Perplexity on tasks that continue from a compressed prefix improves relative to models trained without the masking policy.
- The same model weights remain competitive on standard short-context benchmarks while becoming easier to compress at inference time.
Where Pith is reading between the lines
- The same masking idea could be applied during fine-tuning rather than only continued pretraining to adapt existing models.
- If compressibility can be trained in, it may become a standard training objective alongside next-token prediction for any long-context architecture.
- The proof that both compressible and non-compressible realizations exist implies that architecture search or regularization choices can inadvertently lock models into the harder-to-compress regime.
Load-bearing premise
Masking KV slots at training time will cause compressible yet still useful representations to emerge without requiring heavy hyperparameter search or harming the model's original performance.
What would settle it
After running KV-CAT, applying any standard post-hoc KV compression method produces no improvement in the quality-versus-budget curve or causes measurable drops in accuracy on long-context retrieval and QA benchmarks.
Figures
read the original abstract
Long-context language modeling is increasingly constrained by the Key-Value (KV) cache, whose memory and decode-time access costs scale linearly with the prefix length. This bottleneck has motivated a range of context-compression methods, from token-level summarization to recent optimization-based KV compression methods. These post-hoc methods operate on the KV cache of a fixed pretrained model, so their effectiveness is fundamentally limited by how well the model's internal representations can be compressed. In this work, we formalize the notion of KV compressibility and show that it is a property of the learned representations, rather than of the context alone. We prove that almost any sequence-to-vector function admits both highly compressible and inherently non-compressible transformer implementations, highlighting the need to guide transformers toward compressible representations during training. Motivated by this, we propose KV-Compression Aware Training (KV-CAT), a continued pretraining procedure that incentivizes the emergence of compressible representations. We introduce a train-time KV sparsification policy that masks KV slots during training. This forces the model to use fewer KV slots and encourages it to learn representations amenable to post-hoc compression. Empirically, we show that KV-CAT improves the quality-budget tradeoff of downstream compression methods across retrieval, long-context question answering, and perplexity-based evaluation of compressed-prefix continuation.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that KV cache compressibility is a property of learned transformer representations rather than the input context. It proves that nearly any sequence-to-vector function admits both highly compressible and inherently non-compressible transformer implementations. Motivated by this, it introduces KV-Compression Aware Training (KV-CAT), a continued pretraining procedure that applies a train-time KV slot masking policy to encourage compressible representations. Experiments demonstrate improved quality-budget tradeoffs for post-hoc compression methods on retrieval, long-context QA, and perplexity-based continuation tasks.
Significance. If the central claims hold, the work supplies a useful theoretical lens on representation non-uniqueness in transformers together with a concrete training intervention that can improve downstream compression efficiency. The existence proof for compressible versus non-compressible realizations is a clear strength, as is the empirical evaluation across multiple compression techniques and task types. Successful adoption could meaningfully reduce KV cache memory and latency costs in long-context inference without requiring changes to inference-time compressors.
major comments (3)
- [Section 3] The existence proof (Section 3) shows that compressible implementations are possible for almost any sequence-to-vector function but supplies no analysis or guarantee that gradient descent under the specific KV-masking policy will converge to the compressible basin rather than a non-compressible or capability-degraded one. This link is load-bearing for the motivation of KV-CAT.
- [Section 4] The KV-CAT description (Section 4) introduces the masking rate and sparsification policy as free parameters without ablations demonstrating that the induced representations remain useful for the original task distribution while becoming amenable to unrelated post-hoc compressors (optimization-based or summarization-based).
- [Section 5] The empirical results (Section 5) report gains on compressed-prefix tasks but do not include explicit controls confirming that uncompressed perplexity and retrieval accuracy are preserved; without these, it is unclear whether the observed improvements reflect genuine compressibility gains or hidden capability tradeoffs.
minor comments (2)
- [Section 2] Notation for the KV masking policy and compressibility metric could be introduced earlier and used consistently across the proof and experimental sections.
- [Figure 3] Figure captions for the quality-budget curves should explicitly state the post-hoc compression methods being compared and the exact masking rate used during KV-CAT.
Simulated Author's Rebuttal
Thank you for the detailed and constructive review. We appreciate the recognition of the theoretical non-uniqueness result and the potential practical value of KV-CAT. We address each major comment below and indicate the revisions we will make.
read point-by-point responses
-
Referee: [Section 3] The existence proof (Section 3) shows that compressible implementations are possible for almost any sequence-to-vector function but supplies no analysis or guarantee that gradient descent under the specific KV-masking policy will converge to the compressible basin rather than a non-compressible or capability-degraded one. This link is load-bearing for the motivation of KV-CAT.
Authors: We agree that the existence proof is purely existential and provides no convergence guarantee for gradient descent under the KV-masking policy. Proving such a guarantee is difficult given the non-convex optimization landscape of transformers. Our strongest defense is empirical: across retrieval, long-context QA, and perplexity tasks, KV-CAT consistently improves post-hoc compression quality while preserving base-model performance, indicating that the masking policy reliably steers optimization toward compressible representations in practice. In the revision we will add an explicit limitations paragraph acknowledging the lack of theoretical convergence analysis. revision: partial
-
Referee: [Section 4] The KV-CAT description (Section 4) introduces the masking rate and sparsification policy as free parameters without ablations demonstrating that the induced representations remain useful for the original task distribution while becoming amenable to unrelated post-hoc compressors (optimization-based or summarization-based).
Authors: We thank the referee for this observation. The initial submission used a single masking rate selected via limited tuning and did not present systematic ablations. In the revised manuscript we will add a new subsection with ablations over masking rates (0.1, 0.3, 0.5) and two sparsification policies, reporting both uncompressed task performance and downstream quality under optimization-based and summarization-based compressors to confirm that the representations remain useful while becoming more compressible. revision: yes
-
Referee: [Section 5] The empirical results (Section 5) report gains on compressed-prefix tasks but do not include explicit controls confirming that uncompressed perplexity and retrieval accuracy are preserved; without these, it is unclear whether the observed improvements reflect genuine compressibility gains or hidden capability tradeoffs.
Authors: The manuscript states that KV-CAT models retain competitive uncompressed performance, but we acknowledge that side-by-side controls could be presented more explicitly. In the revision we will insert a dedicated table comparing uncompressed perplexity, retrieval accuracy, and long-context QA scores for the original pretrained model, the KV-CAT model, and relevant baselines, thereby making the absence of capability tradeoffs fully transparent. revision: yes
- No theoretical analysis or guarantee is provided that gradient descent will converge to the compressible basin under the KV-masking policy.
Circularity Check
No significant circularity; proof and intervention are independent of target metric
full rationale
The paper's core mathematical claim is a proof that almost any sequence-to-vector function admits both compressible and non-compressible transformer realizations; this existence result is stated as a first-principles theorem and does not reduce to the KV-CAT masking procedure or to any fitted quantity. The KV-CAT method is introduced as an explicit train-time sparsification policy (masking KV slots) whose downstream effect on post-hoc compression is measured empirically rather than defined by construction. No self-definitional equations, fitted-input predictions, or load-bearing self-citations appear in the derivation chain. The central claim therefore remains self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
free parameters (1)
- KV masking rate / sparsification policy
axioms (1)
- standard math Transformers are universal approximators for sequence-to-vector functions
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Theorem 3.1: almost any sequence-to-vector function admits both highly compressible (r(n)=1) and inherently non-compressible transformer implementations.
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
KV-CAT training objective with Lbudget maintaining target retention rate via router masks.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Can foundation models help us achieve perfect secrecy? arXiv preprint arXiv:2205.13722, 2022
Simran Arora and Christopher Ré. Can foundation models help us achieve perfect secrecy? arXiv preprint arXiv:2205.13722, 2022
-
[2]
Longbench v2: Towards deeper understanding and reason- ing on realistic long-context multitasks
Yushi Bai, Shangqing Tu, Jiajie Zhang, Hao Peng, Xiaozhi Wang, Xin Lv, Shulin Cao, Jiazheng Xu, Lei Hou, Yuxiao Dong, et al. Longbench v2: Towards deeper understanding and reason- ing on realistic long-context multitasks. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3639–3664, 2025
work page 2025
-
[3]
Longformer: The Long-Document Transformer
Iz Beltagy, Matthew E Peters, and Arman Cohan. Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2004
-
[4]
Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, and Yejin Choi. PIQA: Reasoning about physical commonsense in natural language. InProceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 7432–7439, 2020. doi: 10.1609/aaai.v34i05.6239. URLhttps://ojs.aaai.org/index.php/AAAI/article/view/6239
-
[5]
Aydar Bulatov, Yuri Kuratov, and Mikhail S. Burtsev. Recurrent memory transformer, 2022
work page 2022
-
[6]
PyramidKV: Dynamic kv cache compression based on pyramidal information funneling, 2025
Zefan Cai, Yichi Zhang, Bofei Gao, Yuliang Liu, Yucheng Li, Tianyu Liu, Keming Lu, Wayne Xiong, Yue Dong, Junjie Hu, and Wen Xiao. PyramidKV: Dynamic kv cache compression based on pyramidal information funneling, 2025
work page 2025
- [7]
-
[8]
Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models
Qiguang Chen, Libo Qin, Jinhao Liu, Dengyun Peng, Jiannan Guan, Peng Wang, Mengkang Hu, Yuhang Zhou, Te Gao, and Wanxiang Che. Towards reasoning era: A survey of long chain-of-thought for reasoning large language models.arXiv preprint arXiv:2503.09567, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[9]
Tong Chen, Hao Fang, Patrick Xia, Xiaodong Liu, Benjamin Van Durme, Luke Zettlemoyer, Jianfeng Gao, and Hao Cheng. Generative adapter: Contextualizing language models in parameters with a single forward pass.arXiv preprint arXiv:2411.05877, 2024
-
[10]
Adapting language models to compress contexts, 2023
Alexis Chevalier, Alexander Wettig, Anirudh Ajith, and Danqi Chen. Adapting language models to compress contexts, 2023
work page 2023
-
[11]
Rethinking Attention with Performers
Krzysztof Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas Sarlos, Peter Hawkins, Jared Davis, Afroz Mohiuddin, Lukasz Kaiser, et al. Rethinking attention with performers.arXiv preprint arXiv:2009.14794, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2009
-
[12]
Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge
Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try ARC, the AI2 reasoning challenge.arXiv preprint arXiv:1803.05457, 2018. URL https://arxiv.org/abs/1803. 05457
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[13]
A discourse-aware attention model for abstractive summarization of long documents
Arman Cohan, Franck Dernoncourt, Doo Soon Kim, Trung Bui, Seokhwan Kim, Walter Chang, and Nazli Goharian. A discourse-aware attention model for abstractive summarization of long documents. InProceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers...
-
[14]
George Cybenko. Approximation by superpositions of a sigmoidal function.Mathematics of control, signals and systems, 2(4):303–314, 1989
work page 1989
-
[15]
Yam Eitan. The centered convex body whose marginals have the heaviest tails.arXiv preprint arXiv:2110.14382, 2021. 10
-
[16]
Cartridges: Lightweight and general-purpose long context representations via self-study, 2025
Sabri Eyuboglu, Ryan Ehrlich, Simran Arora, Neel Guha, Dylan Zinsley, Emily Liu, Will Ten- nien, Atri Rudra, James Zou, Azalia Mirhoseini, and Christopher Ré. Cartridges: Lightweight and general-purpose long context representations via self-study, 2025
work page 2025
-
[17]
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[18]
Efficiently Modeling Long Sequences with Structured State Spaces
Albert Gu, Karan Goel, and Christopher Ré. Efficiently modeling long sequences with structured state spaces.arXiv preprint arXiv:2111.00396, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[19]
Lighte- val: A lightweight framework for llm evaluation
Nathan Habib, Clémentine Fourrier, Hynek Kydlíˇcek, Thomas Wolf, and Lewis Tunstall. Lighte- val: A lightweight framework for llm evaluation. https://github.com/huggingface/ lighteval, 2023. GitHub repository
work page 2023
-
[20]
Delta-net: Real-time network verification using atoms
Alex Horn, Ali Kheradmand, and Mukul Prasad. Delta-net: Real-time network verification using atoms. In14th USENIX Symposium on Networked Systems Design and Implementation (NSDI 17), pages 735–749, 2017
work page 2017
-
[21]
Approximation capabilities of multilayer feedforward networks.Neural networks, 4(2):251–257, 1991
Kurt Hornik. Approximation capabilities of multilayer feedforward networks.Neural networks, 4(2):251–257, 1991
work page 1991
-
[22]
Dynamic chunking for end-to-end hierarchical sequence modeling
Sukjun Hwang, Brandon Wang, and Albert Gu. Dynamic chunking for end-to-end hierarchical sequence modeling.arXiv preprint arXiv:2507.07955, 2025
-
[23]
Financebench: A new benchmark for financial question answering.arXiv preprint arXiv:2311.11944, 2023
Pranab Islam, Anand Kannappan, Douwe Kiela, Rebecca Qian, Nino Scherrer, and Bertie Vidgen. Financebench: A new benchmark for financial question answering.arXiv preprint arXiv:2311.11944, 2023
-
[24]
LLMLingua: Com- pressing prompts for accelerated inference of large language models, 2023
Huiqiang Jiang, Qianhui Wu, Chin-Yew Lin, Yuqing Yang, and Lili Qiu. LLMLingua: Com- pressing prompts for accelerated inference of large language models, 2023
work page 2023
-
[25]
Huiqiang Jiang, Qianhui Wu, Xufang Luo, Dongsheng Li, Chin-Yew Lin, Yuqing Yang, and Lili Qiu. LongLLMLingua: Accelerating and enhancing llms in long context scenarios via prompt compression, 2024
work page 2024
-
[26]
Optimal experimental designs.The Annals of Mathemat- ical Statistics, 37(4):783–815, 1966
Samuel Karlin and William J Studden. Optimal experimental designs.The Annals of Mathemat- ical Statistics, 37(4):783–815, 1966
work page 1966
-
[27]
Tchebycheff systems: With applications in analysis and statistics.(No Title), 1966
Samuel Karlin and William J Studden. Tchebycheff systems: With applications in analysis and statistics.(No Title), 1966
work page 1966
-
[28]
Chebyshevian spline functions.Siam Journal on Numerical Analysis, 3(3):514–543, 1966
Samuel Karlin and Zvi Ziegler. Chebyshevian spline functions.Siam Journal on Numerical Analysis, 3(3):514–543, 1966
work page 1966
-
[29]
Transformers are rnns: Fast autoregressive transformers with linear attention
Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret. Transformers are rnns: Fast autoregressive transformers with linear attention. InInternational conference on machine learning, pages 5156–5165. PMLR, 2020
work page 2020
-
[30]
arXiv preprint arXiv:2505.23416 , year =
Jang-Hyun Kim, Jinuk Kim, Sangwoo Kwon, Jae W Lee, Sangdoo Yun, and Hyun Oh Song. Kvzip: Query-agnostic kv cache compression with context reconstruction.arXiv preprint arXiv:2505.23416, 2025
-
[31]
Lexico: Extreme KV cache compression via sparse coding over universal dictionaries
Junhyuck Kim, Jongho Park, Jaewoong Cho, and Dimitris Papailiopoulos. Lexico: Extreme KV cache compression via sparse coding over universal dictionaries. InProceedings of the 42nd International Conference on Machine Learning, volume 267 ofProceedings of Machine Learning Research, pages 30672–30687, 2025
work page 2025
-
[32]
Retrieval-augmented generation for knowledge-intensive NLP tasks
Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. Retrieval-augmented generation for knowledge-intensive NLP tasks. InAdvances in Neural Information Processing Systems, volume 33, pages 9459–9474, 2020. 11
work page 2020
-
[33]
Compressing context to enhance inference efficiency of large language models
Yucheng Li, Bo Dong, Frank Guerin, and Chenghua Lin. Compressing context to enhance inference efficiency of large language models. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 6342–6353, 2023. doi: 10.18653/ v1/2023.emnlp-main.391
work page 2023
-
[34]
SnapKV: Llm knows what you are looking for before generation, 2024
Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Acyr Locatelli, Hanchen Ye, Tianle Cai, Patrick Lewis, and Deming Chen. SnapKV: Llm knows what you are looking for before generation, 2024
work page 2024
-
[35]
Zhuoling Li, Xiaogang Xu, Zhenhua Xu, SerNam Lim, and Hengshuang Zhao. Larm: Large auto-regressive model for long-horizon embodied intelligence.arXiv preprint arXiv:2405.17424, 2024
-
[36]
Yewei Liu, Xiyuan Wang, Yansheng Mao, Yoav Gelbery, Haggai Maron, and Muhan Zhang. Shine: A scalable in-context hypernetwork for mapping context to lora in a single pass.arXiv preprint arXiv:2602.06358, 2026
-
[37]
Pointer sentinel mixture models, 2016
Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models, 2016
work page 2016
-
[38]
CA Micchelli and Allan Pinkus. Moment theory for weak chebyshev systems with applications to monosplines, quadrature formulae and best one-sided lˆ1-approximation by spline functions with fixed knots.SIAM Journal on Mathematical Analysis, 8(2):206–230, 1977
work page 1977
-
[39]
Can a Suit of Armor Conduct Electricity?
Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity? a new dataset for open book question answering. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2381–2391, Brussels, Belgium, 2018. Association for Computational Linguistics. doi: 10.18653/v1/D18-1260. UR...
-
[40]
Learning to compress prompts with gist tokens, 2023
Jesse Mu, Xiang Lisa Li, and Noah Goodman. Learning to compress prompts with gist tokens, 2023
work page 2023
-
[41]
Using an llm to help with code understanding
Daye Nam, Andrew Macvean, Vincent Hellendoorn, Bogdan Vasilescu, and Brad Myers. Using an llm to help with code understanding. InProceedings of the IEEE/ACM 46th International Conference on Software Engineering, pages 1–13, 2024
work page 2024
-
[42]
Emre Okcular. Context Engineering - Short-Term Memory Management with Sessions from OpenAI Agents SDK, September 2025
work page 2025
-
[43]
Transformers are multi-state RNNs
Matanel Oren, Michael Hassid, Nir Yarden, Yossi Adi, and Roy Schwartz. Transformers are multi-state RNNs. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 18724–18741, 2024. doi: 10.18653/v1/2024.emnlp-main.1043
-
[44]
Guilherme Penedo, Hynek Kydlíˇcek, Anton Lozhkov, Margaret Mitchell, Colin Raffel, Leandro V on Werra, Thomas Wolf, et al. The fineweb datasets: Decanting the web for the finest text data at scale.Advances in Neural Information Processing Systems, 37:30811–30849, 2024
work page 2024
-
[45]
W., Potapenko, A., Jayakumar, S
Jack W. Rae, Anna Potapenko, Siddhant M. Jayakumar, Chloe Hillier, and Timothy P. Lillicrap. Compressive transformers for long-range sequence modelling.arXiv preprint, 2019. URL https://arxiv.org/abs/1911.05507
-
[46]
Effective context engi- neering for ai agents, September 2025
Prithvi Rajasekaran, Ethan Dixon, Carly Ryan, and Jeremy Hadfield. Effective context engi- neering for ai agents, September 2025
work page 2025
-
[47]
Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. WinoGrande: An adversarial winograd schema challenge at scale. InProceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 8732–8740, 2020. doi: 10.1609/aaai.v34i05.6399. URLhttps://ojs.aaai.org/index.php/AAAI/article/view/6399
-
[48]
Clayton Sanford, Daniel J Hsu, and Matus Telgarsky. Representational strengths and limitations of transformers.Advances in Neural Information Processing Systems, 36:36677–36707, 2023. 12
work page 2023
-
[49]
Social IQa: Commonsense reasoning about social interactions
Maarten Sap, Hannah Rashkin, Derek Chen, Ronan Le Bras, and Yejin Choi. Social IQa: Commonsense reasoning about social interactions. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 4463–4473, Hong Kong, China,
work page 2019
-
[50]
Social IQa: Commonsense Reasoning about Social Interactions
Association for Computational Linguistics. doi: 10.18653/v1/D19-1454. URL https: //aclanthology.org/D19-1454/
-
[51]
QUEST: Query-aware sparsity for efficient long-context LLM inference
Jiaming Tang, Yilong Zhao, Kan Zhu, Guangxuan Xiao, Baris Kasikci, and Song Han. QUEST: Query-aware sparsity for efficient long-context LLM inference. InProceedings of the 41st In- ternational Conference on Machine Learning, volume 235 ofProceedings of Machine Learning Research, pages 47901–47911, 2024
work page 2024
-
[52]
Qwen2.5: A party of foundation models, September 2024
Qwen Team. Qwen2.5: A party of foundation models, September 2024. URL https:// qwenlm.github.io/blog/qwen2.5/
work page 2024
-
[53]
Efficient streaming language models with attention sinks, 2024
Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks, 2024
work page 2024
-
[54]
Duoattention: Efficient long-context LLM inference with retrieval and streaming heads
Guangxuan Xiao, Jiaming Tang, Jingwei Zuo, Junxian Guo, Shang Yang, Haotian Tang, Yao Fu, and Song Han. Duoattention: Efficient long-context LLM inference with retrieval and streaming heads. InInternational Conference on Learning Representations, 2025. URL https://arxiv.org/abs/2410.10819
-
[55]
Gated Delta Networks: Improving Mamba2 with Delta Rule
Songlin Yang, Jan Kautz, and Ali Hatamizadeh. Gated delta networks: Improving mamba2 with delta rule.arXiv preprint arXiv:2412.06464, 2024
work page internal anchor Pith review arXiv 2024
-
[56]
When can transformers count to n?arXiv preprint arXiv:2407.15160, 2024
Gilad Yehudai, Haim Kaplan, Guy Dar, Royi Rassin, Asma Ghandeharioun, Mor Geva, and Amir Globerson. When can transformers count to n?arXiv preprint arXiv:2407.15160, 2024
-
[57]
Deep sets.Advances in neural information processing systems, 30, 2017
Manzil Zaheer, Satwik Kottur, Siamak Ravanbakhsh, Barnabas Poczos, Russ R Salakhutdinov, and Alexander J Smola. Deep sets.Advances in neural information processing systems, 30, 2017
work page 2017
-
[58]
Manzil Zaheer, Guru Guruganesh, Kumar Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, et al. Big bird: Transformers for longer sequences.Advances in neural information processing systems, 33: 17283–17297, 2020
work page 2020
-
[59]
Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. HellaSwag: Can a machine really finish your sentence? InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4791–4800, Florence, Italy, 2019. Association for Computational Linguistics. doi: 10.18653/v1/P19-1472. URL https://aclanthology. org/...
-
[60]
H2O: Heavy-hitter oracle for efficient generative inference of large language models, 2023
Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher Ré, Clark Barrett, Zhangyang Wang, and Beidi Chen. H2O: Heavy-hitter oracle for efficient generative inference of large language models, 2023
work page 2023
-
[61]
Junhao Zheng, Chengming Shi, Xidi Cai, Qiuke Li, Duzhen Zhang, Chenxing Li, Dong Yu, and Qianli Ma. Lifelong learning of large language model based agents: A roadmap.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2026
work page 2026
-
[62]
Fast kv compaction via attention matching, 2026
Adam Zweiger, Xinghong Fu, Han Guo, and Yoon Kim. Fast kv compaction via attention matching, 2026. A Theory A.1 Transformers and KV Cache Compression Notation.For a set A, let A∗ denote the set of all finite sequences over A. We denote by ei the i-th standard (one hot encoded) basis vector. Forn∈N, we write[n] ={1, . . . , n}. 13 For matrices X∈R n×d and ...
work page 2026
-
[63]
, an)∈A n withn≤N, ∥f(a)−M(a)∥< ε.(70)
(Approximation) For every sequencea= (a 1, . . . , an)∈A n withn≤N, ∥f(a)−M(a)∥< ε.(70)
-
[64]
(Maximal compressibility) There exists a KV cache compression policy C with compression budget r(n)≡1(71) such that for every prefixa∈A n and suffixb∈A k withk+n≤N, ∥M([a,b])−M C,a(b)∥< ε.(72) Proof.We assume that the token embedding mapemb :A×N→R d0 is defined by emb(a, i) =u a +p i (73) where ∥pi∥=∥p j∥ for all i, j∈[N] , and ua,p i ∈R d0 for all a∈A . ...
-
[65]
, an)∈A n withn≤N, ∥f(a)−M(a)∥< ε.(97)
(Approximation) For every sequencea= (a 1, . . . , an)∈A n withn≤N, ∥f(a)−M(a)∥< ε.(97)
-
[66]
(Non- compressibility) For every KV cache compression policy C with compression budget satisfyingr(n)< n there exists a suffixb∈A k withk+n≤Nsuch that, ∥M([a,b])−M C,a(b)∥> C.(98) Proof. By the same argument as in the proof of Lemma A.7, there exist functions ϕ:R d0 →R d1 andρ:R d1 →R dout such that for everya= (a 1, . . . , an), f(a) =ρ nX i=1 ϕ(emb(ai, ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.