Meta-Soft: Leveraging Composable Meta-Tokens for Context-Preserving KV Cache Compression

Huanyu Qu; Jiang Cai; Mingkun Xu; Songchen Ma; Wei Luo; Yi Huang

arxiv: 2605.22337 · v1 · pith:RF5AMAPLnew · submitted 2026-05-21 · 💻 cs.AI

Meta-Soft: Leveraging Composable Meta-Tokens for Context-Preserving KV Cache Compression

Wei Luo , Yi Huang , Songchen Ma , Huanyu Qu , Jiang Cai , Mingkun Xu This is my paper

Pith reviewed 2026-05-22 05:07 UTC · model grok-4.3

classification 💻 cs.AI

keywords KV cache compressionmeta-tokensGumbel-Softmax selectorattention-flow integrationlong-context LLMsdynamic evictioncontext preservationsoft token synthesis

0 comments

The pith

Meta-Soft dynamically composes prompt-specific meta-tokens from a learnable basis to compress KV caches while redistributing semantic information from evicted tokens.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models suffer memory and speed problems as their KV caches grow linearly with context length. Prior eviction methods use fixed soft tokens that cannot adapt to each new prompt and permanently discard information when they remove token pairs. The paper introduces Meta-Soft, which maintains a meta-library of orthogonal vectors and employs a selector network with Gumbel-Softmax to produce sparse weights that synthesize a small set of targeted soft tokens for any given input. These tokens are appended to probe the sequence, after which an attention-flow mechanism moves the semantic content of dropped tokens into the retained ones. Experiments across several datasets show the resulting compressed cache outperforms prior eviction techniques.

Core claim

By constructing a meta-library as a learnable orthogonal basis and using a selector network with Gumbel-Softmax to generate differentiable sparse combination weights, the method synthesizes the most relevant soft tokens from prompt features; an attention-flow integration step then redistributes the information of removed KV pairs into the kept tokens, preventing irreversible context loss and enabling more effective dynamic compression than static approaches.

What carries the argument

A learnable orthogonal basis matrix that serves as a meta-library, paired with a Gumbel-Softmax selector that produces sparse weights for synthesizing prompt-specific soft tokens and an attention-flow mechanism that redistributes semantic content from evicted tokens.

If this is right

LLMs can handle longer input sequences under the same memory budget.
Eviction decisions adapt automatically to each prompt instead of relying on a fixed query.
Context breaks are reduced because semantic content of dropped tokens is moved rather than erased.
Overall decoding efficiency improves while matching or exceeding the accuracy of uncompressed caches on tested tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same composable-token idea could be applied to compress other internal state structures such as activation caches in non-transformer architectures.
Training the meta-library once and freezing it might allow the selector to be reused across multiple models without retraining the entire system.
The attention-flow redistribution could be measured directly by comparing hidden-state similarity before and after eviction to quantify information retention.
Extending the approach to streaming inputs might let the system continuously update the soft-token set as new tokens arrive.

Load-bearing premise

The selector network's combination weights correctly identify changing task relevance in the prompt, and the attention-flow step transfers all necessary semantic information from removed tokens into retained ones without permanent loss.

What would settle it

Running the method on a dataset containing prompts with rapidly shifting relevance and finding either no accuracy gain over static soft-token baselines or measurable degradation traceable to information discarded during eviction.

Figures

Figures reproduced from arXiv: 2605.22337 by Huanyu Qu, Jiang Cai, Mingkun Xu, Songchen Ma, Wei Luo, Yi Huang.

**Figure 1.** Figure 1: Motivation and overview of Meta-Soft. Left: Existing KV-cache compression often relies on static queries for eviction, which fail to adapt across diverse tasks and may cause cross-task mismatch; moreover, hard eviction permanently deletes KV entries, leading to irreversible information loss and broken context. Right: Meta-Soft uses input-dependent dynamic soft tokens synthesized from a meta-library to prob… view at source ↗

**Figure 2.** Figure 2: Meta-Soft framework overview. Meta-Soft trains a Meta-Library and selector offline with Ground-Truth Attention supervision and compresses the KV cache online by generating prompt-conditioned soft tokens to probe, partition, and consolidate context for decoding. Cache Partitioning Based on Asof t and budget B, we partition the cache: Ikeep = TopK(Asof t, B), Idrop = {1, . . . , L} \ Ikeep (6) This yields t… view at source ↗

read the original abstract

The KV cache used in large language models has linearly growing time complexity, so LLMs face memory blow-up and reduced decoding efficiency when they process long contexts.Current KV Cache eviction has become an important research direction; however, existing methods based on fixed Soft Tokens (e.g., Judge Q) rely on a static parameter set as the query to evaluate the importance of KV pairs, so they cannot adapt dynamically to different input prompts, and they cannot precisely capture complex and changing task relevance.Also, evicted KV pairs are discarded permanently, so this causes irreversible information loss and context breaks. To address this problem, we propose Meta-Soft, a dynamic compression framework based on probe-driven context integration. Specifically, we build a meta-library with a learnable orthogonal basis matrix $\mathcal{L}$, and we use a selector network with Gumbel-Softmax to produce differentiable sparse combination weights, so we dynamically synthesize the most targeted $k$ Soft Tokens from the input prompt features.We append these Soft Tokens to the end of the input sequence to probe key information. We also introduce an attention-flow based integration mechanism, which redistributes the semantic information of removed tokens into retained tokens, and this keeps the dropped context information effectively.Experiments on multiple datasets show that our method outperforms existing state-of-the-art eviction methods and provides a new solution for KV Cache compression.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Dynamic meta-token synthesis for KV cache compression looks promising on paper but lacks supporting numbers.

read the letter

Hi colleague, The punchline on this one is that Meta-Soft introduces a way to dynamically create soft tokens for compressing the KV cache in large language models by using a learnable orthogonal basis in a meta-library and Gumbel-Softmax to select combinations based on the prompt. They pair this with an attention-flow mechanism meant to preserve information from tokens that get evicted. What the paper does well is identify the shortcomings of static soft token methods, which don't adapt to different inputs and cause permanent loss when evicting KV pairs. The proposal to synthesize targeted soft tokens on the fly and then integrate the semantics of removed ones into retained tokens through attention redistribution is a reasonable extension. It tries to keep the context intact in a more flexible manner than previous eviction techniques. On the soft spots, the main issue is the lack of concrete evidence in the description. The abstract claims better performance on multiple datasets compared to state-of-the-art eviction methods, but there are no numbers, no error bars, no details on the datasets or tasks, and no ablations showing the contribution of the meta-library or the attention-flow step. This makes it hard to evaluate the central assumption that the attention-flow integration avoids irreversible loss. As the stress-test points out, in long contexts where attention can be sparse or focused on few positions, re-weighting based on those scores might still drop important but low-attention information. Without experiments isolating that part, the no-loss claim feels unanchored. This paper is for researchers and engineers working on making long-context inference more memory-efficient in LLMs. A reader interested in KV cache optimization techniques would get some value from the framework description, even if they have to wait for the full results to see if it delivers. It deserves a serious referee because the problem it tackles is important for practical deployment, and the technical components like the orthogonal basis and differentiable selection show some thought. A review process could help verify the experiments and strengthen the evaluation of the integration mechanism. Recommendation: Send it for peer review to get the necessary details and checks.

Referee Report

1 major / 1 minor

Summary. The paper proposes Meta-Soft, a dynamic KV cache compression framework for LLMs. It builds a meta-library from a learnable orthogonal basis matrix L, uses a selector network with Gumbel-Softmax to synthesize k targeted soft tokens from input prompt features, appends these tokens to probe key information, and applies an attention-flow integration mechanism to redistribute semantic information from evicted tokens into retained ones, thereby avoiding irreversible context loss. The work claims that this approach outperforms existing state-of-the-art eviction methods on multiple datasets.

Significance. If the central claims hold, the method could meaningfully improve long-context LLM efficiency by enabling prompt-adaptive compression that preserves more context than static soft-token baselines. The composable meta-tokens and attention-flow redistribution constitute a distinct technical contribution relative to prior fixed-parameter eviction techniques.

major comments (1)

[Abstract (attention-flow integration mechanism)] The attention-flow based integration mechanism is asserted to redistribute semantic information of removed tokens into retained tokens without irreversible loss (Abstract). However, no quantitative bound, ablation study isolating this step, or analysis of behavior under sparse/diluted attention patterns is supplied. This assumption is load-bearing for the context-preservation claim; failure in long-context regimes where attention concentrates on few positions would undermine the no-loss guarantee.

minor comments (1)

[Abstract] The abstract states that experiments were run 'on multiple datasets' and 'outperform existing state-of-the-art' but supplies neither dataset names, quantitative metrics, error bars, nor ablation results; adding these details would strengthen the presentation.

Simulated Author's Rebuttal

1 responses · 1 unresolved

We thank the referee for their constructive comments. The concern regarding the attention-flow integration mechanism is well-taken, and we address it directly below while revising the manuscript to strengthen the supporting evidence for our context-preservation claims.

read point-by-point responses

Referee: [Abstract (attention-flow integration mechanism)] The attention-flow based integration mechanism is asserted to redistribute semantic information of removed tokens into retained tokens without irreversible loss (Abstract). However, no quantitative bound, ablation study isolating this step, or analysis of behavior under sparse/diluted attention patterns is supplied. This assumption is load-bearing for the context-preservation claim; failure in long-context regimes where attention concentrates on few positions would undermine the no-loss guarantee.

Authors: We agree that the attention-flow integration mechanism is central to the context-preservation claim. In the revised manuscript we add an ablation study that isolates this component by comparing full Meta-Soft against a variant that performs eviction without the redistribution step; the results show a consistent drop in long-context task accuracy when the mechanism is removed. We also include a new analysis section that examines attention patterns on long-context benchmarks, including regimes with sparse and concentrated attention. These experiments indicate that the redistribution step continues to improve retention metrics even when attention focuses on a small number of positions. A strict theoretical quantitative bound on information preservation is not supplied, as deriving one would require assumptions on attention distributions that do not hold across all models and tasks; we instead rely on the empirical evidence from the ablations and pattern analysis. revision: yes

standing simulated objections not resolved

Deriving a rigorous quantitative theoretical bound on semantic information preservation under the attention-flow mechanism.

Circularity Check

0 steps flagged

No significant circularity; derivation introduces independent mechanisms

full rationale

The paper proposes a meta-library with learnable orthogonal basis matrix L, a selector network using Gumbel-Softmax for dynamic sparse combination weights to synthesize soft tokens from prompt features, and an attention-flow integration mechanism to redistribute semantics of removed tokens. These are presented as new constructions without any reduction of the central claims to fitted inputs by construction, self-definitional loops, or load-bearing self-citations. The abstract and described framework remain self-contained against external benchmarks, with no quoted equations or steps that equate predictions directly to their own inputs.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 2 invented entities

The framework depends on the learnability of the orthogonal basis matrix and the effectiveness of Gumbel-Softmax selection plus attention redistribution, both introduced without external benchmarks in the abstract.

free parameters (2)

learnable orthogonal basis matrix L
Core of the meta-library; parameters are trained to enable synthesis of targeted soft tokens.
number k of synthesized soft tokens
Chosen hyperparameter controlling compression ratio and probe strength.

axioms (1)

standard math Gumbel-Softmax enables differentiable sparse selection
Invoked to allow end-to-end training of the selector network from prompt features.

invented entities (2)

meta-library with learnable orthogonal basis no independent evidence
purpose: Provides basis vectors for dynamic synthesis of prompt-specific soft tokens
New construct introduced to overcome limitations of fixed soft tokens.
attention-flow based integration mechanism no independent evidence
purpose: Redistributes semantic information from evicted KV pairs into retained tokens
New mechanism claimed to prevent irreversible context loss.

pith-pipeline@v0.9.0 · 5783 in / 1400 out tokens · 45945 ms · 2026-05-22T05:07:21.660568+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

29 extracted references · 29 canonical work pages · 11 internal anchors

[1]

LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding

[Baiet al., 2023 ] Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, Yuxiao Dong, Jie Tang, and Juanzi Li. Longbench: A bilingual, multitask benchmark for long context understanding.arXiv preprint arXiv:2308.14508,

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Token Merging: Your ViT But Faster

[Bolyaet al., 2022 ] Daniel Bolya, Cheng-Yang Fu, Xiao- liang Dai, Peizhao Zhang, Christoph Feichtenhofer, and Judy Hoffman. Token merging: Your vit but faster.arXiv preprint arXiv:2210.09461,

work page internal anchor Pith review Pith/arXiv arXiv 2022
[3]

[Caiet al., 2024a ] Ruisi Cai, Yuandong Tian, Zhangyang Wang, and Beidi Chen

Accepted at ICLR 2023 (Oral), per arXiv. [Caiet al., 2024a ] Ruisi Cai, Yuandong Tian, Zhangyang Wang, and Beidi Chen. Lococo: Dropping in convo- lutions for long context compression.arXiv preprint arXiv:2406.05317,

work page arXiv 2023
[4]

PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling

[Caiet al., 2024b ] Zefan Cai, Yichi Zhang, Bofei Gao, Yu- liang Liu, Yucheng Li, Tianyu Liu, Keming Lu, Wayne Xiong, Yue Dong, Junjie Hu, and Wen Xiao. Pyramidkv: Dynamic kv cache compression based on pyramidal infor- mation funneling.arXiv preprint arXiv:2406.02069,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Adapting lan- guage models to compress contexts.arXiv preprint arXiv:2305.14788,

[Chevalieret al., 2023 ] Alexis Chevalier, Alexander Wet- tig, Anirudh Ajith, and Danqi Chen. Adapting lan- guage models to compress contexts.arXiv preprint arXiv:2305.14788,

work page arXiv 2023
[6]

Gonzalez, Ion Stoica, and Eric P

[Chianget al., 2023 ] Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. Vicuna: An open-source chat- bot impressing gpt-4 with 90%* chatgpt quality, March

work page 2023
[7]

Fu, Stefano Ermon, Atri Rudra, and Christopher R ´e

[Daoet al., 2022 ] Tri Dao, Daniel Y . Fu, Stefano Ermon, Atri Rudra, and Christopher R ´e. Flashattention: Fast and memory-efficient exact attention with io-awareness.Ad- vances in Neural Information Processing Systems,

work page 2022
[8]

The Pile: An 800GB Dataset of Diverse Text for Language Modeling

[Gaoet al., 2020 ] Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, and Connor Leahy. The Pile: An 800gb dataset of diverse text for language modeling.arXiv preprint arXiv:2101.00027,

work page internal anchor Pith review Pith/arXiv arXiv 2020
[9]

Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs

[Geet al., 2023 ] Suyu Ge, Yunan Zhang, Liyuan Liu, Minjia Zhang, Jiawei Han, and Jianfeng Gao. Model tells you what to discard: Adaptive kv cache compression for llms. arXiv preprint arXiv:2310.01801,

work page internal anchor Pith review Pith/arXiv arXiv 2023
[10]

Accurate kv cache eviction via anchor direction projection for efficient llm inference

[Genget al., 2025 ] Zijie Geng, Jie Wang, Ziqi Liu, Feng Ju, Yiming Li, Xing Li, Mingxuan Yuan, Jianye Hao, Defu Lian, Enhong Chen, and Feng Wu. Accurate kv cache eviction via anchor direction projection for efficient llm inference. InAdvances in Neural Information Processing Systems,

work page 2025
[11]

The Llama 3 Herd of Models

NeurIPS 2025 (OpenReview). [Grattafioriet al., 2024 ] Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783,

work page internal anchor Pith review Pith/arXiv arXiv 2025
[12]

Zipcache: Accu- rate and efficient kv cache quantization with salient token identification.arXiv preprint arXiv:2405.14256,

[Heet al., 2024 ] Yefei He, Luoming Zhang, Weijia Wu, Jing Liu, Hong Zhou, and Bohan Zhuang. Zipcache: Accu- rate and efficient kv cache quantization with salient token identification.arXiv preprint arXiv:2405.14256,

work page arXiv 2024
[13]

RULER: What's the Real Context Size of Your Long-Context Language Models?

[Hsiehet al., 2024 ] Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, Yang Zhang, and Boris Ginsburg. Ruler: What’s the real context size of your long-context language models?arXiv preprint arXiv:2404.06654,

work page internal anchor Pith review Pith/arXiv arXiv 2024
[14]

[Jianget al., 2023 ] Albert Q

COLM 2024 (per arXiv). [Jianget al., 2023 ] Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, L´elio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timoth ´ee Lacroix, and Willi...

work page 2024
[15]

Gonzalez, Hao Zhang, and Ion Stoica

[Kwonet al., 2023 ] Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Effi- cient memory management for large language model serv- ing with pagedattention. InProceedings of the 29th ACM Symposium on Operating Systems Principles (SOSP ’23),

work page 2023
[16]

SnapKV: LLM Knows What You are Looking for Before Generation

[Liet al., 2024 ] Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Acyr Locatelli, Hanchen Ye, Tianle Cai, Patrick Lewis, and Deming Chen. Snapkv: Llm knows what you are looking for before generation.arXiv preprint arXiv:2404.14469,

work page internal anchor Pith review Pith/arXiv arXiv 2024
[17]

Scissorhands: Exploiting the persistence of importance hypothesis for llm kv cache compression at test time.arXiv preprint arXiv:2305.17118,

[Liuet al., 2023 ] Zichang Liu, Aditya Desai, Fangshuo Liao, Weitao Wang, Victor Xie, Zhaozhuo Xu, Anasta- sios Kyrillidis, and Anshumali Shrivastava. Scissorhands: Exploiting the persistence of importance hypothesis for llm kv cache compression at test time.arXiv preprint arXiv:2305.17118,

work page arXiv 2023
[18]

Judge q: Trainable queries for optimized information retention in kv cache eviction.arXiv preprint arXiv:2509.10798,

[Liuet al., 2025b ] Yijun Liu, Yixuan Wang, Yuzhuang Xu, Shiyu Ji, Yang Xu, Qingfu Zhu, and Wanxiang Che. Judge q: Trainable queries for optimized information retention in kv cache eviction.arXiv preprint arXiv:2509.10798,

work page arXiv
[19]

Learning to compress prompts with gist tokens

[Muet al., 2023 ] Jesse Mu, Xiang Lisa Li, and Noah Good- man. Learning to compress prompts with gist tokens. arXiv preprint arXiv:2304.08467,

work page arXiv 2023
[20]

[Popeet al., 2023 ] Reiner Pope, Sholto Douglas, Aakanksha Chowdhery, Jacob Devlin, James Bradbury, Anselm Lev- skaya, Jonathan Heek, Kefan Xiao, Shivani Agrawal, and Jeff Dean

NeurIPS 2023 (per arXiv). [Popeet al., 2023 ] Reiner Pope, Sholto Douglas, Aakanksha Chowdhery, Jacob Devlin, James Bradbury, Anselm Lev- skaya, Jonathan Heek, Kefan Xiao, Shivani Agrawal, and Jeff Dean. Efficiently scaling transformer inference. In Proceedings of Machine Learning and Systems, volume 5,

work page 2023
[21]

Compressive Transformers for Long-Range Sequence Modelling

[Raeet al., 2019 ] Jack W. Rae, Anna Potapenko, Sid- dhant M. Jayakumar, and Timothy P. Lillicrap. Com- pressive transformers for long-range sequence modelling. arXiv preprint arXiv:1911.05507,

work page internal anchor Pith review Pith/arXiv arXiv 2019
[22]

[Ren and Zhu, 2024] Siyu Ren and Kenny Q. Zhu. On the efficacy of eviction policy for key-value constrained generative language model inference.arXiv preprint arXiv:2402.06262,

work page arXiv 2024
[23]

Slimpajama-dc: Understanding data combinations for llm training,

[Shenet al., 2024 ] Zhiqiang Shen, Tianhua Tao, Liqun Ma, Willie Neiswanger, Zhengzhong Liu, Hongyi Wang, Bowen Tan, Joel Hestness, Natalia Vassilieva, Daria Soboleva, and Eric Xing. Slimpajama-dc: Understanding data combinations for llm training,

work page 2024
[24]

D2o: Dy- namic discriminative operations for efficient long-context inference of large language models.arXiv preprint arXiv:2406.13035,

[Wanet al., 2024 ] Zhongwei Wan, Xinjian Wu, Yu Zhang, Yi Xin, Chaofan Tao, Zhihong Zhu, Xin Wang, Siqi Luo, Jing Xiong, Longyue Wang, and Mi Zhang. D2o: Dy- namic discriminative operations for efficient long-context inference of large language models.arXiv preprint arXiv:2406.13035,

work page arXiv 2024
[25]

[Wanget al., 2025 ] Yixuan Wang, Shiyu Ji, Yijun Liu, Yuzhuang Xu, Yang Xu, Qingfu Zhu, and Wanxiang Che

ICLR 2025 (per arXiv). [Wanget al., 2025 ] Yixuan Wang, Shiyu Ji, Yijun Liu, Yuzhuang Xu, Yang Xu, Qingfu Zhu, and Wanxiang Che. Lookahead q-cache: Achieving more consistent kv cache eviction via pseudo query.arXiv preprint arXiv:2505.20334,

work page arXiv 2025
[26]

Efficient Streaming Language Models with Attention Sinks

Accepted by EMNLP 2025 Main (per arXiv). [Xiaoet al., 2023 ] Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient stream- ing language models with attention sinks.arXiv preprint arXiv:2309.17453,

work page internal anchor Pith review Pith/arXiv arXiv 2025
[27]

No token left be- hind: Reliable kv cache compression via importance- aware mixed precision quantization.arXiv preprint arXiv:2402.18096,

[Yanget al., 2024 ] June Yong Yang, Byeongwook Kim, Jeongin Bae, Beomseok Kwon, Gunho Park, Eunho Yang, Se Jung Kwon, and Dongsoo Lee. No token left be- hind: Reliable kv cache compression via importance- aware mixed precision quantization.arXiv preprint arXiv:2402.18096,

work page arXiv 2024
[28]

H$_2$O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models

[Zhanget al., 2023 ] Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher R ´e, Clark Barrett, Zhangyang Wang, and Beidi Chen. H 2O: Heavy-hitter oracle for efficient generative inference of large language models.arXiv preprint arXiv:2306.14048,

work page internal anchor Pith review Pith/arXiv arXiv 2023
[29]

CaM: Cache merging for memory-efficient LLMs inference

[Zhanget al., 2024 ] Yuxin Zhang, Yuxuan Du, Gen Luo, Yunshan Zhong, Zhenyu Zhang, Shiwei Liu, and Ron- grong Ji. CaM: Cache merging for memory-efficient LLMs inference. In Ruslan Salakhutdinov, Zico Kolter, Katherine Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, editors,Proceedings of the 41st International Conference on M...

work page 2024

[1] [1]

LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding

[Baiet al., 2023 ] Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, Yuxiao Dong, Jie Tang, and Juanzi Li. Longbench: A bilingual, multitask benchmark for long context understanding.arXiv preprint arXiv:2308.14508,

work page internal anchor Pith review Pith/arXiv arXiv 2023

[2] [2]

Token Merging: Your ViT But Faster

[Bolyaet al., 2022 ] Daniel Bolya, Cheng-Yang Fu, Xiao- liang Dai, Peizhao Zhang, Christoph Feichtenhofer, and Judy Hoffman. Token merging: Your vit but faster.arXiv preprint arXiv:2210.09461,

work page internal anchor Pith review Pith/arXiv arXiv 2022

[3] [3]

[Caiet al., 2024a ] Ruisi Cai, Yuandong Tian, Zhangyang Wang, and Beidi Chen

Accepted at ICLR 2023 (Oral), per arXiv. [Caiet al., 2024a ] Ruisi Cai, Yuandong Tian, Zhangyang Wang, and Beidi Chen. Lococo: Dropping in convo- lutions for long context compression.arXiv preprint arXiv:2406.05317,

work page arXiv 2023

[4] [4]

PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling

[Caiet al., 2024b ] Zefan Cai, Yichi Zhang, Bofei Gao, Yu- liang Liu, Yucheng Li, Tianyu Liu, Keming Lu, Wayne Xiong, Yue Dong, Junjie Hu, and Wen Xiao. Pyramidkv: Dynamic kv cache compression based on pyramidal infor- mation funneling.arXiv preprint arXiv:2406.02069,

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

Adapting lan- guage models to compress contexts.arXiv preprint arXiv:2305.14788,

[Chevalieret al., 2023 ] Alexis Chevalier, Alexander Wet- tig, Anirudh Ajith, and Danqi Chen. Adapting lan- guage models to compress contexts.arXiv preprint arXiv:2305.14788,

work page arXiv 2023

[6] [6]

Gonzalez, Ion Stoica, and Eric P

[Chianget al., 2023 ] Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. Vicuna: An open-source chat- bot impressing gpt-4 with 90%* chatgpt quality, March

work page 2023

[7] [7]

Fu, Stefano Ermon, Atri Rudra, and Christopher R ´e

[Daoet al., 2022 ] Tri Dao, Daniel Y . Fu, Stefano Ermon, Atri Rudra, and Christopher R ´e. Flashattention: Fast and memory-efficient exact attention with io-awareness.Ad- vances in Neural Information Processing Systems,

work page 2022

[8] [8]

The Pile: An 800GB Dataset of Diverse Text for Language Modeling

[Gaoet al., 2020 ] Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, and Connor Leahy. The Pile: An 800gb dataset of diverse text for language modeling.arXiv preprint arXiv:2101.00027,

work page internal anchor Pith review Pith/arXiv arXiv 2020

[9] [9]

Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs

[Geet al., 2023 ] Suyu Ge, Yunan Zhang, Liyuan Liu, Minjia Zhang, Jiawei Han, and Jianfeng Gao. Model tells you what to discard: Adaptive kv cache compression for llms. arXiv preprint arXiv:2310.01801,

work page internal anchor Pith review Pith/arXiv arXiv 2023

[10] [10]

Accurate kv cache eviction via anchor direction projection for efficient llm inference

[Genget al., 2025 ] Zijie Geng, Jie Wang, Ziqi Liu, Feng Ju, Yiming Li, Xing Li, Mingxuan Yuan, Jianye Hao, Defu Lian, Enhong Chen, and Feng Wu. Accurate kv cache eviction via anchor direction projection for efficient llm inference. InAdvances in Neural Information Processing Systems,

work page 2025

[11] [11]

The Llama 3 Herd of Models

NeurIPS 2025 (OpenReview). [Grattafioriet al., 2024 ] Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783,

work page internal anchor Pith review Pith/arXiv arXiv 2025

[12] [12]

Zipcache: Accu- rate and efficient kv cache quantization with salient token identification.arXiv preprint arXiv:2405.14256,

[Heet al., 2024 ] Yefei He, Luoming Zhang, Weijia Wu, Jing Liu, Hong Zhou, and Bohan Zhuang. Zipcache: Accu- rate and efficient kv cache quantization with salient token identification.arXiv preprint arXiv:2405.14256,

work page arXiv 2024

[13] [13]

RULER: What's the Real Context Size of Your Long-Context Language Models?

[Hsiehet al., 2024 ] Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, Yang Zhang, and Boris Ginsburg. Ruler: What’s the real context size of your long-context language models?arXiv preprint arXiv:2404.06654,

work page internal anchor Pith review Pith/arXiv arXiv 2024

[14] [14]

[Jianget al., 2023 ] Albert Q

COLM 2024 (per arXiv). [Jianget al., 2023 ] Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, L´elio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timoth ´ee Lacroix, and Willi...

work page 2024

[15] [15]

Gonzalez, Hao Zhang, and Ion Stoica

[Kwonet al., 2023 ] Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Effi- cient memory management for large language model serv- ing with pagedattention. InProceedings of the 29th ACM Symposium on Operating Systems Principles (SOSP ’23),

work page 2023

[16] [16]

SnapKV: LLM Knows What You are Looking for Before Generation

[Liet al., 2024 ] Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Acyr Locatelli, Hanchen Ye, Tianle Cai, Patrick Lewis, and Deming Chen. Snapkv: Llm knows what you are looking for before generation.arXiv preprint arXiv:2404.14469,

work page internal anchor Pith review Pith/arXiv arXiv 2024

[17] [17]

Scissorhands: Exploiting the persistence of importance hypothesis for llm kv cache compression at test time.arXiv preprint arXiv:2305.17118,

[Liuet al., 2023 ] Zichang Liu, Aditya Desai, Fangshuo Liao, Weitao Wang, Victor Xie, Zhaozhuo Xu, Anasta- sios Kyrillidis, and Anshumali Shrivastava. Scissorhands: Exploiting the persistence of importance hypothesis for llm kv cache compression at test time.arXiv preprint arXiv:2305.17118,

work page arXiv 2023

[18] [18]

Judge q: Trainable queries for optimized information retention in kv cache eviction.arXiv preprint arXiv:2509.10798,

[Liuet al., 2025b ] Yijun Liu, Yixuan Wang, Yuzhuang Xu, Shiyu Ji, Yang Xu, Qingfu Zhu, and Wanxiang Che. Judge q: Trainable queries for optimized information retention in kv cache eviction.arXiv preprint arXiv:2509.10798,

work page arXiv

[19] [19]

Learning to compress prompts with gist tokens

[Muet al., 2023 ] Jesse Mu, Xiang Lisa Li, and Noah Good- man. Learning to compress prompts with gist tokens. arXiv preprint arXiv:2304.08467,

work page arXiv 2023

[20] [20]

[Popeet al., 2023 ] Reiner Pope, Sholto Douglas, Aakanksha Chowdhery, Jacob Devlin, James Bradbury, Anselm Lev- skaya, Jonathan Heek, Kefan Xiao, Shivani Agrawal, and Jeff Dean

NeurIPS 2023 (per arXiv). [Popeet al., 2023 ] Reiner Pope, Sholto Douglas, Aakanksha Chowdhery, Jacob Devlin, James Bradbury, Anselm Lev- skaya, Jonathan Heek, Kefan Xiao, Shivani Agrawal, and Jeff Dean. Efficiently scaling transformer inference. In Proceedings of Machine Learning and Systems, volume 5,

work page 2023

[21] [21]

Compressive Transformers for Long-Range Sequence Modelling

[Raeet al., 2019 ] Jack W. Rae, Anna Potapenko, Sid- dhant M. Jayakumar, and Timothy P. Lillicrap. Com- pressive transformers for long-range sequence modelling. arXiv preprint arXiv:1911.05507,

work page internal anchor Pith review Pith/arXiv arXiv 2019

[22] [22]

[Ren and Zhu, 2024] Siyu Ren and Kenny Q. Zhu. On the efficacy of eviction policy for key-value constrained generative language model inference.arXiv preprint arXiv:2402.06262,

work page arXiv 2024

[23] [23]

Slimpajama-dc: Understanding data combinations for llm training,

[Shenet al., 2024 ] Zhiqiang Shen, Tianhua Tao, Liqun Ma, Willie Neiswanger, Zhengzhong Liu, Hongyi Wang, Bowen Tan, Joel Hestness, Natalia Vassilieva, Daria Soboleva, and Eric Xing. Slimpajama-dc: Understanding data combinations for llm training,

work page 2024

[24] [24]

D2o: Dy- namic discriminative operations for efficient long-context inference of large language models.arXiv preprint arXiv:2406.13035,

[Wanet al., 2024 ] Zhongwei Wan, Xinjian Wu, Yu Zhang, Yi Xin, Chaofan Tao, Zhihong Zhu, Xin Wang, Siqi Luo, Jing Xiong, Longyue Wang, and Mi Zhang. D2o: Dy- namic discriminative operations for efficient long-context inference of large language models.arXiv preprint arXiv:2406.13035,

work page arXiv 2024

[25] [25]

[Wanget al., 2025 ] Yixuan Wang, Shiyu Ji, Yijun Liu, Yuzhuang Xu, Yang Xu, Qingfu Zhu, and Wanxiang Che

ICLR 2025 (per arXiv). [Wanget al., 2025 ] Yixuan Wang, Shiyu Ji, Yijun Liu, Yuzhuang Xu, Yang Xu, Qingfu Zhu, and Wanxiang Che. Lookahead q-cache: Achieving more consistent kv cache eviction via pseudo query.arXiv preprint arXiv:2505.20334,

work page arXiv 2025

[26] [26]

Efficient Streaming Language Models with Attention Sinks

Accepted by EMNLP 2025 Main (per arXiv). [Xiaoet al., 2023 ] Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient stream- ing language models with attention sinks.arXiv preprint arXiv:2309.17453,

work page internal anchor Pith review Pith/arXiv arXiv 2025

[27] [27]

No token left be- hind: Reliable kv cache compression via importance- aware mixed precision quantization.arXiv preprint arXiv:2402.18096,

[Yanget al., 2024 ] June Yong Yang, Byeongwook Kim, Jeongin Bae, Beomseok Kwon, Gunho Park, Eunho Yang, Se Jung Kwon, and Dongsoo Lee. No token left be- hind: Reliable kv cache compression via importance- aware mixed precision quantization.arXiv preprint arXiv:2402.18096,

work page arXiv 2024

[28] [28]

H$_2$O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models

[Zhanget al., 2023 ] Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher R ´e, Clark Barrett, Zhangyang Wang, and Beidi Chen. H 2O: Heavy-hitter oracle for efficient generative inference of large language models.arXiv preprint arXiv:2306.14048,

work page internal anchor Pith review Pith/arXiv arXiv 2023

[29] [29]

CaM: Cache merging for memory-efficient LLMs inference

[Zhanget al., 2024 ] Yuxin Zhang, Yuxuan Du, Gen Luo, Yunshan Zhong, Zhenyu Zhang, Shiwei Liu, and Ron- grong Ji. CaM: Cache merging for memory-efficient LLMs inference. In Ruslan Salakhutdinov, Zico Kolter, Katherine Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, editors,Proceedings of the 41st International Conference on M...

work page 2024