pith. sign in

arxiv: 2605.22337 · v1 · pith:RF5AMAPLnew · submitted 2026-05-21 · 💻 cs.AI

Meta-Soft: Leveraging Composable Meta-Tokens for Context-Preserving KV Cache Compression

Pith reviewed 2026-05-22 05:07 UTC · model grok-4.3

classification 💻 cs.AI
keywords KV cache compressionmeta-tokensGumbel-Softmax selectorattention-flow integrationlong-context LLMsdynamic evictioncontext preservationsoft token synthesis
0
0 comments X

The pith

Meta-Soft dynamically composes prompt-specific meta-tokens from a learnable basis to compress KV caches while redistributing semantic information from evicted tokens.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models suffer memory and speed problems as their KV caches grow linearly with context length. Prior eviction methods use fixed soft tokens that cannot adapt to each new prompt and permanently discard information when they remove token pairs. The paper introduces Meta-Soft, which maintains a meta-library of orthogonal vectors and employs a selector network with Gumbel-Softmax to produce sparse weights that synthesize a small set of targeted soft tokens for any given input. These tokens are appended to probe the sequence, after which an attention-flow mechanism moves the semantic content of dropped tokens into the retained ones. Experiments across several datasets show the resulting compressed cache outperforms prior eviction techniques.

Core claim

By constructing a meta-library as a learnable orthogonal basis and using a selector network with Gumbel-Softmax to generate differentiable sparse combination weights, the method synthesizes the most relevant soft tokens from prompt features; an attention-flow integration step then redistributes the information of removed KV pairs into the kept tokens, preventing irreversible context loss and enabling more effective dynamic compression than static approaches.

What carries the argument

A learnable orthogonal basis matrix that serves as a meta-library, paired with a Gumbel-Softmax selector that produces sparse weights for synthesizing prompt-specific soft tokens and an attention-flow mechanism that redistributes semantic content from evicted tokens.

If this is right

  • LLMs can handle longer input sequences under the same memory budget.
  • Eviction decisions adapt automatically to each prompt instead of relying on a fixed query.
  • Context breaks are reduced because semantic content of dropped tokens is moved rather than erased.
  • Overall decoding efficiency improves while matching or exceeding the accuracy of uncompressed caches on tested tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same composable-token idea could be applied to compress other internal state structures such as activation caches in non-transformer architectures.
  • Training the meta-library once and freezing it might allow the selector to be reused across multiple models without retraining the entire system.
  • The attention-flow redistribution could be measured directly by comparing hidden-state similarity before and after eviction to quantify information retention.
  • Extending the approach to streaming inputs might let the system continuously update the soft-token set as new tokens arrive.

Load-bearing premise

The selector network's combination weights correctly identify changing task relevance in the prompt, and the attention-flow step transfers all necessary semantic information from removed tokens into retained ones without permanent loss.

What would settle it

Running the method on a dataset containing prompts with rapidly shifting relevance and finding either no accuracy gain over static soft-token baselines or measurable degradation traceable to information discarded during eviction.

Figures

Figures reproduced from arXiv: 2605.22337 by Huanyu Qu, Jiang Cai, Mingkun Xu, Songchen Ma, Wei Luo, Yi Huang.

Figure 1
Figure 1. Figure 1: Motivation and overview of Meta-Soft. Left: Existing KV-cache compression often relies on static queries for eviction, which fail to adapt across diverse tasks and may cause cross-task mismatch; moreover, hard eviction permanently deletes KV entries, leading to irreversible information loss and broken context. Right: Meta-Soft uses input-dependent dynamic soft tokens synthesized from a meta-library to prob… view at source ↗
Figure 2
Figure 2. Figure 2: Meta-Soft framework overview. Meta-Soft trains a Meta-Library and selector offline with Ground-Truth Attention supervision and compresses the KV cache online by generating prompt-conditioned soft tokens to probe, partition, and consolidate context for decoding. Cache Partitioning Based on Asof t and budget B, we par￾tition the cache: Ikeep = TopK(Asof t, B), Idrop = {1, . . . , L} \ Ikeep (6) This yields t… view at source ↗
read the original abstract

The KV cache used in large language models has linearly growing time complexity, so LLMs face memory blow-up and reduced decoding efficiency when they process long contexts.Current KV Cache eviction has become an important research direction; however, existing methods based on fixed Soft Tokens (e.g., Judge Q) rely on a static parameter set as the query to evaluate the importance of KV pairs, so they cannot adapt dynamically to different input prompts, and they cannot precisely capture complex and changing task relevance.Also, evicted KV pairs are discarded permanently, so this causes irreversible information loss and context breaks. To address this problem, we propose Meta-Soft, a dynamic compression framework based on probe-driven context integration. Specifically, we build a meta-library with a learnable orthogonal basis matrix $\mathcal{L}$, and we use a selector network with Gumbel-Softmax to produce differentiable sparse combination weights, so we dynamically synthesize the most targeted $k$ Soft Tokens from the input prompt features.We append these Soft Tokens to the end of the input sequence to probe key information. We also introduce an attention-flow based integration mechanism, which redistributes the semantic information of removed tokens into retained tokens, and this keeps the dropped context information effectively.Experiments on multiple datasets show that our method outperforms existing state-of-the-art eviction methods and provides a new solution for KV Cache compression.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper proposes Meta-Soft, a dynamic KV cache compression framework for LLMs. It builds a meta-library from a learnable orthogonal basis matrix L, uses a selector network with Gumbel-Softmax to synthesize k targeted soft tokens from input prompt features, appends these tokens to probe key information, and applies an attention-flow integration mechanism to redistribute semantic information from evicted tokens into retained ones, thereby avoiding irreversible context loss. The work claims that this approach outperforms existing state-of-the-art eviction methods on multiple datasets.

Significance. If the central claims hold, the method could meaningfully improve long-context LLM efficiency by enabling prompt-adaptive compression that preserves more context than static soft-token baselines. The composable meta-tokens and attention-flow redistribution constitute a distinct technical contribution relative to prior fixed-parameter eviction techniques.

major comments (1)
  1. [Abstract (attention-flow integration mechanism)] The attention-flow based integration mechanism is asserted to redistribute semantic information of removed tokens into retained tokens without irreversible loss (Abstract). However, no quantitative bound, ablation study isolating this step, or analysis of behavior under sparse/diluted attention patterns is supplied. This assumption is load-bearing for the context-preservation claim; failure in long-context regimes where attention concentrates on few positions would undermine the no-loss guarantee.
minor comments (1)
  1. [Abstract] The abstract states that experiments were run 'on multiple datasets' and 'outperform existing state-of-the-art' but supplies neither dataset names, quantitative metrics, error bars, nor ablation results; adding these details would strengthen the presentation.

Simulated Author's Rebuttal

1 responses · 1 unresolved

We thank the referee for their constructive comments. The concern regarding the attention-flow integration mechanism is well-taken, and we address it directly below while revising the manuscript to strengthen the supporting evidence for our context-preservation claims.

read point-by-point responses
  1. Referee: [Abstract (attention-flow integration mechanism)] The attention-flow based integration mechanism is asserted to redistribute semantic information of removed tokens into retained tokens without irreversible loss (Abstract). However, no quantitative bound, ablation study isolating this step, or analysis of behavior under sparse/diluted attention patterns is supplied. This assumption is load-bearing for the context-preservation claim; failure in long-context regimes where attention concentrates on few positions would undermine the no-loss guarantee.

    Authors: We agree that the attention-flow integration mechanism is central to the context-preservation claim. In the revised manuscript we add an ablation study that isolates this component by comparing full Meta-Soft against a variant that performs eviction without the redistribution step; the results show a consistent drop in long-context task accuracy when the mechanism is removed. We also include a new analysis section that examines attention patterns on long-context benchmarks, including regimes with sparse and concentrated attention. These experiments indicate that the redistribution step continues to improve retention metrics even when attention focuses on a small number of positions. A strict theoretical quantitative bound on information preservation is not supplied, as deriving one would require assumptions on attention distributions that do not hold across all models and tasks; we instead rely on the empirical evidence from the ablations and pattern analysis. revision: yes

standing simulated objections not resolved
  • Deriving a rigorous quantitative theoretical bound on semantic information preservation under the attention-flow mechanism.

Circularity Check

0 steps flagged

No significant circularity; derivation introduces independent mechanisms

full rationale

The paper proposes a meta-library with learnable orthogonal basis matrix L, a selector network using Gumbel-Softmax for dynamic sparse combination weights to synthesize soft tokens from prompt features, and an attention-flow integration mechanism to redistribute semantics of removed tokens. These are presented as new constructions without any reduction of the central claims to fitted inputs by construction, self-definitional loops, or load-bearing self-citations. The abstract and described framework remain self-contained against external benchmarks, with no quoted equations or steps that equate predictions directly to their own inputs.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 2 invented entities

The framework depends on the learnability of the orthogonal basis matrix and the effectiveness of Gumbel-Softmax selection plus attention redistribution, both introduced without external benchmarks in the abstract.

free parameters (2)
  • learnable orthogonal basis matrix L
    Core of the meta-library; parameters are trained to enable synthesis of targeted soft tokens.
  • number k of synthesized soft tokens
    Chosen hyperparameter controlling compression ratio and probe strength.
axioms (1)
  • standard math Gumbel-Softmax enables differentiable sparse selection
    Invoked to allow end-to-end training of the selector network from prompt features.
invented entities (2)
  • meta-library with learnable orthogonal basis no independent evidence
    purpose: Provides basis vectors for dynamic synthesis of prompt-specific soft tokens
    New construct introduced to overcome limitations of fixed soft tokens.
  • attention-flow based integration mechanism no independent evidence
    purpose: Redistributes semantic information from evicted KV pairs into retained tokens
    New mechanism claimed to prevent irreversible context loss.

pith-pipeline@v0.9.0 · 5783 in / 1400 out tokens · 45945 ms · 2026-05-22T05:07:21.660568+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

29 extracted references · 29 canonical work pages · 11 internal anchors

  1. [1]

    LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding

    [Baiet al., 2023 ] Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, Yuxiao Dong, Jie Tang, and Juanzi Li. Longbench: A bilingual, multitask benchmark for long context understanding.arXiv preprint arXiv:2308.14508,

  2. [2]

    Token Merging: Your ViT But Faster

    [Bolyaet al., 2022 ] Daniel Bolya, Cheng-Yang Fu, Xiao- liang Dai, Peizhao Zhang, Christoph Feichtenhofer, and Judy Hoffman. Token merging: Your vit but faster.arXiv preprint arXiv:2210.09461,

  3. [3]

    [Caiet al., 2024a ] Ruisi Cai, Yuandong Tian, Zhangyang Wang, and Beidi Chen

    Accepted at ICLR 2023 (Oral), per arXiv. [Caiet al., 2024a ] Ruisi Cai, Yuandong Tian, Zhangyang Wang, and Beidi Chen. Lococo: Dropping in convo- lutions for long context compression.arXiv preprint arXiv:2406.05317,

  4. [4]

    PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling

    [Caiet al., 2024b ] Zefan Cai, Yichi Zhang, Bofei Gao, Yu- liang Liu, Yucheng Li, Tianyu Liu, Keming Lu, Wayne Xiong, Yue Dong, Junjie Hu, and Wen Xiao. Pyramidkv: Dynamic kv cache compression based on pyramidal infor- mation funneling.arXiv preprint arXiv:2406.02069,

  5. [5]

    Adapting lan- guage models to compress contexts.arXiv preprint arXiv:2305.14788,

    [Chevalieret al., 2023 ] Alexis Chevalier, Alexander Wet- tig, Anirudh Ajith, and Danqi Chen. Adapting lan- guage models to compress contexts.arXiv preprint arXiv:2305.14788,

  6. [6]

    Gonzalez, Ion Stoica, and Eric P

    [Chianget al., 2023 ] Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. Vicuna: An open-source chat- bot impressing gpt-4 with 90%* chatgpt quality, March

  7. [7]

    Fu, Stefano Ermon, Atri Rudra, and Christopher R ´e

    [Daoet al., 2022 ] Tri Dao, Daniel Y . Fu, Stefano Ermon, Atri Rudra, and Christopher R ´e. Flashattention: Fast and memory-efficient exact attention with io-awareness.Ad- vances in Neural Information Processing Systems,

  8. [8]

    The Pile: An 800GB Dataset of Diverse Text for Language Modeling

    [Gaoet al., 2020 ] Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, and Connor Leahy. The Pile: An 800gb dataset of diverse text for language modeling.arXiv preprint arXiv:2101.00027,

  9. [9]

    Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs

    [Geet al., 2023 ] Suyu Ge, Yunan Zhang, Liyuan Liu, Minjia Zhang, Jiawei Han, and Jianfeng Gao. Model tells you what to discard: Adaptive kv cache compression for llms. arXiv preprint arXiv:2310.01801,

  10. [10]

    Accurate kv cache eviction via anchor direction projection for efficient llm inference

    [Genget al., 2025 ] Zijie Geng, Jie Wang, Ziqi Liu, Feng Ju, Yiming Li, Xing Li, Mingxuan Yuan, Jianye Hao, Defu Lian, Enhong Chen, and Feng Wu. Accurate kv cache eviction via anchor direction projection for efficient llm inference. InAdvances in Neural Information Processing Systems,

  11. [11]

    The Llama 3 Herd of Models

    NeurIPS 2025 (OpenReview). [Grattafioriet al., 2024 ] Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783,

  12. [12]

    Zipcache: Accu- rate and efficient kv cache quantization with salient token identification.arXiv preprint arXiv:2405.14256,

    [Heet al., 2024 ] Yefei He, Luoming Zhang, Weijia Wu, Jing Liu, Hong Zhou, and Bohan Zhuang. Zipcache: Accu- rate and efficient kv cache quantization with salient token identification.arXiv preprint arXiv:2405.14256,

  13. [13]

    RULER: What's the Real Context Size of Your Long-Context Language Models?

    [Hsiehet al., 2024 ] Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, Yang Zhang, and Boris Ginsburg. Ruler: What’s the real context size of your long-context language models?arXiv preprint arXiv:2404.06654,

  14. [14]

    [Jianget al., 2023 ] Albert Q

    COLM 2024 (per arXiv). [Jianget al., 2023 ] Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, L´elio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timoth ´ee Lacroix, and Willi...

  15. [15]

    Gonzalez, Hao Zhang, and Ion Stoica

    [Kwonet al., 2023 ] Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Effi- cient memory management for large language model serv- ing with pagedattention. InProceedings of the 29th ACM Symposium on Operating Systems Principles (SOSP ’23),

  16. [16]

    SnapKV: LLM Knows What You are Looking for Before Generation

    [Liet al., 2024 ] Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Acyr Locatelli, Hanchen Ye, Tianle Cai, Patrick Lewis, and Deming Chen. Snapkv: Llm knows what you are looking for before generation.arXiv preprint arXiv:2404.14469,

  17. [17]

    Scissorhands: Exploiting the persistence of importance hypothesis for llm kv cache compression at test time.arXiv preprint arXiv:2305.17118,

    [Liuet al., 2023 ] Zichang Liu, Aditya Desai, Fangshuo Liao, Weitao Wang, Victor Xie, Zhaozhuo Xu, Anasta- sios Kyrillidis, and Anshumali Shrivastava. Scissorhands: Exploiting the persistence of importance hypothesis for llm kv cache compression at test time.arXiv preprint arXiv:2305.17118,

  18. [18]

    Judge q: Trainable queries for optimized information retention in kv cache eviction.arXiv preprint arXiv:2509.10798,

    [Liuet al., 2025b ] Yijun Liu, Yixuan Wang, Yuzhuang Xu, Shiyu Ji, Yang Xu, Qingfu Zhu, and Wanxiang Che. Judge q: Trainable queries for optimized information retention in kv cache eviction.arXiv preprint arXiv:2509.10798,

  19. [19]

    Learning to compress prompts with gist tokens

    [Muet al., 2023 ] Jesse Mu, Xiang Lisa Li, and Noah Good- man. Learning to compress prompts with gist tokens. arXiv preprint arXiv:2304.08467,

  20. [20]

    [Popeet al., 2023 ] Reiner Pope, Sholto Douglas, Aakanksha Chowdhery, Jacob Devlin, James Bradbury, Anselm Lev- skaya, Jonathan Heek, Kefan Xiao, Shivani Agrawal, and Jeff Dean

    NeurIPS 2023 (per arXiv). [Popeet al., 2023 ] Reiner Pope, Sholto Douglas, Aakanksha Chowdhery, Jacob Devlin, James Bradbury, Anselm Lev- skaya, Jonathan Heek, Kefan Xiao, Shivani Agrawal, and Jeff Dean. Efficiently scaling transformer inference. In Proceedings of Machine Learning and Systems, volume 5,

  21. [21]

    Compressive Transformers for Long-Range Sequence Modelling

    [Raeet al., 2019 ] Jack W. Rae, Anna Potapenko, Sid- dhant M. Jayakumar, and Timothy P. Lillicrap. Com- pressive transformers for long-range sequence modelling. arXiv preprint arXiv:1911.05507,

  22. [22]

    [Ren and Zhu, 2024] Siyu Ren and Kenny Q. Zhu. On the efficacy of eviction policy for key-value constrained generative language model inference.arXiv preprint arXiv:2402.06262,

  23. [23]

    Slimpajama-dc: Understanding data combinations for llm training,

    [Shenet al., 2024 ] Zhiqiang Shen, Tianhua Tao, Liqun Ma, Willie Neiswanger, Zhengzhong Liu, Hongyi Wang, Bowen Tan, Joel Hestness, Natalia Vassilieva, Daria Soboleva, and Eric Xing. Slimpajama-dc: Understanding data combinations for llm training,

  24. [24]

    D2o: Dy- namic discriminative operations for efficient long-context inference of large language models.arXiv preprint arXiv:2406.13035,

    [Wanet al., 2024 ] Zhongwei Wan, Xinjian Wu, Yu Zhang, Yi Xin, Chaofan Tao, Zhihong Zhu, Xin Wang, Siqi Luo, Jing Xiong, Longyue Wang, and Mi Zhang. D2o: Dy- namic discriminative operations for efficient long-context inference of large language models.arXiv preprint arXiv:2406.13035,

  25. [25]

    [Wanget al., 2025 ] Yixuan Wang, Shiyu Ji, Yijun Liu, Yuzhuang Xu, Yang Xu, Qingfu Zhu, and Wanxiang Che

    ICLR 2025 (per arXiv). [Wanget al., 2025 ] Yixuan Wang, Shiyu Ji, Yijun Liu, Yuzhuang Xu, Yang Xu, Qingfu Zhu, and Wanxiang Che. Lookahead q-cache: Achieving more consistent kv cache eviction via pseudo query.arXiv preprint arXiv:2505.20334,

  26. [26]

    Efficient Streaming Language Models with Attention Sinks

    Accepted by EMNLP 2025 Main (per arXiv). [Xiaoet al., 2023 ] Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient stream- ing language models with attention sinks.arXiv preprint arXiv:2309.17453,

  27. [27]

    No token left be- hind: Reliable kv cache compression via importance- aware mixed precision quantization.arXiv preprint arXiv:2402.18096,

    [Yanget al., 2024 ] June Yong Yang, Byeongwook Kim, Jeongin Bae, Beomseok Kwon, Gunho Park, Eunho Yang, Se Jung Kwon, and Dongsoo Lee. No token left be- hind: Reliable kv cache compression via importance- aware mixed precision quantization.arXiv preprint arXiv:2402.18096,

  28. [28]

    H$_2$O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models

    [Zhanget al., 2023 ] Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher R ´e, Clark Barrett, Zhangyang Wang, and Beidi Chen. H 2O: Heavy-hitter oracle for efficient generative inference of large language models.arXiv preprint arXiv:2306.14048,

  29. [29]

    CaM: Cache merging for memory-efficient LLMs inference

    [Zhanget al., 2024 ] Yuxin Zhang, Yuxuan Du, Gen Luo, Yunshan Zhong, Zhenyu Zhang, Shiwei Liu, and Ron- grong Ji. CaM: Cache merging for memory-efficient LLMs inference. In Ruslan Salakhutdinov, Zico Kolter, Katherine Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, editors,Proceedings of the 41st International Conference on M...