pith. sign in

arxiv: 2606.03075 · v1 · pith:SWTIWW5Wnew · submitted 2026-06-02 · 💻 cs.CV

TGV-KV: Text-Grounded KV Eviction for Vision-Language Models

Pith reviewed 2026-06-28 11:01 UTC · model grok-4.3

classification 💻 cs.CV
keywords KV cache evictionvision-language modelstext-grounded evictionmutual informationattention weightinginference accelerationvisual question answeringcache compression
0
0 comments X

The pith

TGV-KV evicts visual KV cache in vision-language models by grounding decisions in text, preserving 99.2 percent accuracy at 5 percent retention.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Vision-language models store keys and values for all tokens during generation, but visual tokens carry substantial redundancy that language-only eviction methods overlook. The paper argues that visual importance must be determined by interaction with the accompanying text to avoid performance drops. It implements this idea through three modules that allocate layer budgets via mutual information, rank visual tokens using text-weighted attention, and protect all text tokens from eviction. The result is a large reduction in memory use and faster inference while task accuracy stays nearly unchanged. This matters for running multimodal models on longer inputs or constrained hardware without retraining.

Core claim

The central claim is that the modality gap in VLMs requires text to guide visual KV eviction. TGV-KV realizes this via Text-Vision Budgeting that sets per-layer retention using mutual information, Text-Weighted Ranking that scores visual importance from text-image attention weights, and Text-Prioritised Retention that keeps every text KV entry. Across five models the method keeps 99.2 percent of full-KV accuracy on VizWiz-VQA with LLaVA-NeXT at a 5 percent retention budget and raises end-to-end throughput by 52.6 percent.

What carries the argument

TGV-KV's three submodules—Text-Vision Budgeting via mutual information, Text-Weighted Ranking via weighted text-image attention, and Text-Prioritised Retention—that use textual guidance to decide which visual KV entries to keep.

If this is right

  • KV cache memory for VLMs grows much more slowly with increasing context length.
  • End-to-end throughput rises by more than 50 percent at very low retention budgets.
  • Visual question answering accuracy remains within 1 percent of the full-cache baseline.
  • The approach works across VLMs of different sizes and architectures.
  • Text KV entries are never evicted, preventing sudden loss of language context.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same text-grounding idea could reduce cache pressure in video or audio-language models.
  • Longer multimodal conversations become feasible on edge devices with limited memory.
  • Making the per-layer budgets change during generation might yield further speed gains.
  • Direct comparison to purely vision-based eviction on identical tasks would quantify the benefit of text guidance.

Load-bearing premise

Mutual information between text and vision plus weighted attention scores can reliably identify which visual KV entries can be evicted without losing task-critical information.

What would settle it

If accuracy on VizWiz-VQA with LLaVA-NeXT falls below 95 percent of the full-KV baseline when TGV-KV is applied at the 5 percent retention budget, the claim that text-grounded selection preserves necessary visual details would not hold.

Figures

Figures reproduced from arXiv: 2606.03075 by Jizhihui Liu, Miao Zhang, Rui Shao, Ruizi Han, Weili Guan, Xuebo Liu, Yaowei Wang.

Figure 1
Figure 1. Figure 1: Visualization and Consequence of Modality Gap. (a) The average cosine similarity of vision and text tokens with k neighbours. (b) The layer 2 attention map (in log scale) in LLaVA, the intra-modality part (text-text, vision-vision) is much more intense than inter-modality (text-vision). (c) Accumulated attention score across all the layers, the eviction is highly uneven and most text KV pairs are evicted b… view at source ↗
Figure 2
Figure 2. Figure 2: Attention Map of VLMs. We average all the heads and plot the attention map in LLaVA-1.5-7B and Qwen3-VL-8B. Each zoomed-in area denotes the text-text attention in a multimodal input. We label a few dominant text token in orange box. More visualizations can be found in Appendix C.1 [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Overview of TGV-KV framework. Our method consists of three submodules or policies: (1) Text-Vision Budgeting adaptively allocates layer-wise KV budgets via text-image attention summation; (2) Text-Weighted Ranking determines vision importance score by re-weighting text-image attention with averaged text attention, and text importance by attention summation; (3) Text-Prioritised Retention selectively evicts… view at source ↗
Figure 4
Figure 4. Figure 4: Accuracy Results on Video-TT (Zhang et al., 2025b) using Qwen3-VL-4B. Vanilla refers to the original model with the full computation budget. Detailed breakdowns on subtasks and full numerical results are provided in Appendix B.2 [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Visualization of Retained KV. We plot the retained KV of one sample from POPE (Li et al., 2023). The visualization is from LLaVA (32 layers) with an overall retention of 3%. The patches most related to the text instruction are circled in violet. Evicted KVs are masked or labelled in gray. More visualizations are in Appendix C.2 [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Accuracy Results on Different Lengths. We count the model accuracy with different text instruction lengths in ScienceQA (Lu et al., 2022). The results stay smooth and stable under different lengths, indicating removing all visual KV makes a small influence. B. Extended Experiment Results B.1. Text-Dominant Results We provide a comprehensive evaluation of text-dominant tasks in [PITH_FULL_IMAGE:figures/ful… view at source ↗
Figure 7
Figure 7. Figure 7: Visualization of Attention Maps. The image is displayed in log scale [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Visualization of Retained KV. The patches most related to the text instruction are circled in violet. Evicted KVs are masked or labelled in gray. 16 [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗
read the original abstract

Vision-Language Models (VLMs) inherit the auto-regressive generation paradigm and cache the keys and values (KV) of all previous tokens to accelerate inference, resulting in memory consumption that scales linearly with context length. This issue is particularly pronounced in VLMs due to substantial redundancy in the visual modality. Although KV cache eviction approaches can effectively reduce inference memory, they often incur significant performance degradation in VLMs, as most are designed for language models and overlook the inherent gap between text and vision. By systematically analyzing the modality gap in VLMs in this work, we argue that the importance of visual information should be grounded in textual guidance and accordingly propose a Text-Grounded KV Eviction method for VLMs (TGV-KV). TGV-KV comprises three submodules: (1) Text-Vision Budgeting (TVB) assigns budget to each layer based on the mutual information interaction. (2) Text-Weighted Ranking (TWR) assesses the priority of text and ranks vision importance based on weighted text-image attention. (3) Text-Prioritised Retention (TPR) policy strategically preserves text KV to avoid acute information loss. We evaluate TGV-KV across five models with different sizes and architectures, showing that TGV-KV preserves 99.2% full-KV accuracy on the VizWiz-VQA task with LLaVA-NeXT and boosts end-to-end throughput by 52.6% with an extreme retention budget of 5%. Code is available at https://github.com/Danielement321/TGV-KV.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper proposes TGV-KV, a KV-cache eviction method for vision-language models that grounds visual token importance in textual guidance. It introduces three components—Text-Vision Budgeting (TVB) via mutual information, Text-Weighted Ranking (TWR) via weighted text-image attention, and Text-Prioritised Retention (TPR)—and reports that the method retains 99.2% of full-KV accuracy on VizWiz-VQA using LLaVA-NeXT while achieving a 52.6% end-to-end throughput gain at a 5% retention budget. The approach is evaluated across five models of varying sizes and architectures, with code released publicly.

Significance. If the empirical claims hold under scrutiny, the work offers a practical, text-aware engineering solution to the memory scaling problem in VLMs that language-model eviction methods overlook. The public code release is a clear strength that enables direct reproducibility and extension.

major comments (3)
  1. [TWR description] TWR submodule: the ranking of vision KV entries by weighted text-image attention is presented as identifying task-critical tokens, yet no analysis or ablation demonstrates that these scores correlate with downstream VQA answer correctness rather than superficial modality alignment or early-layer features. This assumption is load-bearing for the 99.2% retention result at the 5% budget.
  2. [TVB description] TVB submodule: mutual-information-based budget allocation across layers is claimed to respect the text-vision gap, but the manuscript supplies no verification that the resulting per-layer budgets preserve answer-relevant visual evidence instead of modality-level statistics.
  3. [Experiments] Experimental section: the abstract states strong numerical results on VizWiz-VQA and throughput, yet the manuscript contains no ablation studies isolating TVB/TWR/TPR contributions, no error bars, and no comparison against attention-only or random-eviction baselines under identical 5% budgets.
minor comments (1)
  1. The abstract claims evaluation on five models but does not name the models or additional tasks beyond VizWiz-VQA.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments on our manuscript. We address each of the major comments below and indicate the revisions we will make to strengthen the paper.

read point-by-point responses
  1. Referee: [TWR description] TWR submodule: the ranking of vision KV entries by weighted text-image attention is presented as identifying task-critical tokens, yet no analysis or ablation demonstrates that these scores correlate with downstream VQA answer correctness rather than superficial modality alignment or early-layer features. This assumption is load-bearing for the 99.2% retention result at the 5% budget.

    Authors: We acknowledge that a direct analysis linking TWR scores to answer correctness is not present in the current manuscript. The method is grounded in the principle that text provides guidance for visual token importance in VQA tasks. To address this, we will add an ablation study in the revised manuscript that examines the correlation between TWR rankings and token impact on VQA performance, including comparisons to early-layer attention features. revision: yes

  2. Referee: [TVB description] TVB submodule: mutual-information-based budget allocation across layers is claimed to respect the text-vision gap, but the manuscript supplies no verification that the resulting per-layer budgets preserve answer-relevant visual evidence instead of modality-level statistics.

    Authors: We agree that explicit verification of the per-layer budgets' effectiveness in preserving answer-relevant evidence would be valuable. The TVB uses mutual information to capture the text-vision interaction, but we will include additional analysis in the revision, such as layer-wise performance contributions or visualizations, to demonstrate that the allocation goes beyond modality-level statistics. revision: yes

  3. Referee: [Experiments] Experimental section: the abstract states strong numerical results on VizWiz-VQA and throughput, yet the manuscript contains no ablation studies isolating TVB/TWR/TPR contributions, no error bars, and no comparison against attention-only or random-eviction baselines under identical 5% budgets.

    Authors: We recognize the importance of these experimental controls. The current results demonstrate the overall effectiveness across multiple models, but we will revise the experimental section to include: (1) ablations isolating the contribution of each submodule (TVB, TWR, TPR), (2) error bars from repeated runs, and (3) direct comparisons to attention-only and random-eviction baselines at the 5% budget on VizWiz-VQA and other tasks. revision: yes

Circularity Check

0 steps flagged

No circularity: method is an independent engineering proposal with external evaluation

full rationale

The paper introduces TGV-KV via three submodules (TVB using mutual information, TWR using text-weighted attention, TPR for text retention) and reports empirical results on VizWiz-VQA and throughput. No equations, derivations, or self-citations are shown that reduce the claimed accuracy retention or speedup to quantities fitted inside the paper. The central claims rest on benchmark measurements rather than any self-referential definition or prediction-by-construction. This is the normal case of a non-circular applied-methods paper.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Review is based solely on the abstract; no explicit free parameters, axioms, or invented entities are stated beyond the high-level design choices.

axioms (1)
  • domain assumption Visual modality in VLMs contains substantial redundancy that can be identified via text guidance
    Abstract states this as the motivation for the approach.

pith-pipeline@v0.9.1-grok · 5834 in / 1206 out tokens · 42310 ms · 2026-06-28T11:01:18.930337+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

50 extracted references · 5 canonical work pages

  1. [1]

    arXiv preprint arXiv:2511.21631 , year=

    Qwen3-VL Technical Report , author=. arXiv preprint arXiv:2511.21631 , year=

  2. [2]

    Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

    Improved baselines with visual instruction tuning , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

  3. [3]

    arXiv preprint arXiv:2407.21783 , year=

    The llama 3 herd of models , author=. arXiv preprint arXiv:2407.21783 , year=

  4. [4]

    Advances in neural information processing systems , volume=

    Attention is all you need , author=. Advances in neural information processing systems , volume=

  5. [5]

    Advances in neural information processing systems , volume=

    Visual instruction tuning , author=. Advances in neural information processing systems , volume=

  6. [6]

    arXiv preprint arXiv:2505.09388 , year=

    Qwen3 technical report , author=. arXiv preprint arXiv:2505.09388 , year=

  7. [7]

    International conference on machine learning , pages=

    Learning transferable visual models from natural language supervision , author=. International conference on machine learning , pages=. 2021 , organization=

  8. [8]

    arXiv preprint arXiv:2407.11550 , year=

    Ada-kv: Optimizing kv cache eviction by adaptive budget allocation for efficient llm inference , author=. arXiv preprint arXiv:2407.11550 , year=

  9. [9]

    Ao Wang and Hui Chen and Jianchao Tan and Kefeng Zhang and Xunliang Cai and Zijia Lin and Jungong Han and Guiguang Ding , booktitle=. Prefix. 2025 , url=

  10. [10]

    Advances in Neural Information Processing Systems , volume=

    H2o: Heavy-hitter oracle for efficient generative inference of large language models , author=. Advances in Neural Information Processing Systems , volume=

  11. [11]

    Proceedings of the IEEE/CVF international conference on computer vision , year=

    SparseMM: Head Sparsity Emerges from Visual Concept Responses in MLLMs , author=. Proceedings of the IEEE/CVF international conference on computer vision , year=

  12. [12]

    Advances in Neural Information Processing Systems , volume=

    Snapkv: Llm knows what you are looking for before generation , author=. Advances in Neural Information Processing Systems , volume=

  13. [13]

    arXiv preprint arXiv:2509.10798 , year=

    Judge Q: Trainable Queries for Optimized Information Retention in KV Cache Eviction , author=. arXiv preprint arXiv:2509.10798 , year=

  14. [14]

    A Simple and Effective L\_2 Norm-Based Strategy for KV Cache Compression

    Devoto, Alessio and Zhao, Yu and Scardapane, Simone and Minervini, Pasquale. A Simple and Effective L\_2 Norm-Based Strategy for KV Cache Compression. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.1027

  15. [15]

    arXiv preprint arXiv:2508.00553 , year=

    Hiprune: Training-free visual token pruning via hierarchical attention in vision-language models , author=. arXiv preprint arXiv:2508.00553 , year=

  16. [16]

    European Conference on Computer Vision , pages=

    An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models , author=. European Conference on Computer Vision , pages=. 2024 , organization=

  17. [17]

    arXiv preprint arXiv:2303.08774 , year=

    Gpt-4 technical report , author=. arXiv preprint arXiv:2303.08774 , year=

  18. [18]

    ICLR , year=

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , author=. ICLR , year=

  19. [19]

    Proceedings of the IEEE/CVF international conference on computer vision , pages=

    Sigmoid loss for language image pre-training , author=. Proceedings of the IEEE/CVF international conference on computer vision , pages=

  20. [20]

    Advances in neural information processing systems , volume=

    Flamingo: a visual language model for few-shot learning , author=. Advances in neural information processing systems , volume=

  21. [21]

    Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

    Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

  22. [22]

    arXiv preprint arXiv:2312.11805 , year=

    Gemini: a family of highly capable multimodal models , author=. arXiv preprint arXiv:2312.11805 , year=

  23. [23]

    Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

    Visionzip: Longer is better but not necessary in vision language models , author=. Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

  24. [24]

    arXiv preprint arXiv:2506.10967 , year=

    Beyond Attention or Similarity: Maximizing Conditional Diversity for Token Pruning in MLLMs , author=. arXiv preprint arXiv:2506.10967 , year=

  25. [25]

    Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

    Beyond text-visual attention: Exploiting visual cues for effective token pruning in vlms , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

  26. [26]

    arXiv preprint arXiv:2406.02069 , year=

    Pyramidkv: Dynamic kv cache compression based on pyramidal information funneling , author=. arXiv preprint arXiv:2406.02069 , year=

  27. [27]

    2025 , url=

    Ziran Qin and Yuchen Cao and Mingbao Lin and Wen Hu and Shixuan Fan and Ke Cheng and Weiyao Lin and Jianguo Li , booktitle=. 2025 , url=

  28. [28]

    Efficient Streaming Language Models with Attention Sinks , volume =

    Xiao, Guangxuan and Tian, Yuandong and Chen, Beidi and Han, Song and Lewis, Mike , booktitle =. Efficient Streaming Language Models with Attention Sinks , volume =

  29. [29]

    International Conference on Machine Learning , year=

    SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference , author=. International Conference on Machine Learning , year=

  30. [30]

    arXiv preprint arXiv:2408.03326 , year=

    Llava-onevision: Easy visual task transfer , author=. arXiv preprint arXiv:2408.03326 , year=

  31. [31]

    2024 , eprint=

    LMMs-Eval: Reality Check on the Evaluation of Large Multimodal Models , author=. 2024 , eprint=

  32. [32]

    Why 1 + 1

    Yangfu Li and Hongjian Zhan and Tianyi Chen and Qi Liu and Yu-Jie Xiong and Yue Lu , booktitle=. Why 1 + 1. 2025 , url=

  33. [33]

    C hart QA : A Benchmark for Question Answering about Charts with Visual and Logical Reasoning

    Masry, Ahmed and Long, Do Xuan and Tan, Jia Qing and Joty, Shafiq and Hoque, Enamul. C hart QA : A Benchmark for Question Answering about Charts with Visual and Logical Reasoning. Findings of the Association for Computational Linguistics: ACL 2022. 2022. doi:10.18653/v1/2022.findings-acl.177

  34. [34]

    Proceedings of the IEEE/CVF winter conference on applications of computer vision , pages=

    Docvqa: A dataset for vqa on document images , author=. Proceedings of the IEEE/CVF winter conference on applications of computer vision , pages=

  35. [35]

    Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

    Towards vqa models that can read , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

  36. [36]

    doi: 10.1007/ s11432-024-4235-6

    Liu, Yuliang and Li, Zhang and Huang, Mingxin and Yang, Biao and Yu, Wenwen and Li, Chunyuan and Yin, Xu-Cheng and Liu, Cheng-Lin and Jin, Lianwen and Bai, Xiang , year=. OCRBench: on the hidden mystery of OCR in large multimodal models , volume=. Science China Information Sciences , publisher=. doi:10.1007/s11432-024-4235-6 , number=

  37. [37]

    European conference on computer vision , pages=

    Mmbench: Is your multi-modal model an all-around player? , author=. European conference on computer vision , pages=. 2024 , organization=

  38. [38]

    European conference on computer vision , pages=

    Textcaps: a dataset for image captioning with reading comprehension , author=. European conference on computer vision , pages=. 2020 , organization=

  39. [39]

    Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

    Vizwiz grand challenge: Answering visual questions from blind people , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

  40. [40]

    2015 , eprint=

    Microsoft COCO: Common Objects in Context , author=. 2015 , eprint=

  41. [41]

    Efficient Inference of Vision Instruction-Following Models with Elastic Cache

    Liu, Zuyan and Liu, Benlin and Wang, Jiahui and Dong, Yuhao and Chen, Guangyi and Rao, Yongming and Krishna, Ranjay and Lu, Jiwen. Efficient Inference of Vision Instruction-Following Models with Elastic Cache. Computer Vision -- ECCV 2024. 2025

  42. [42]

    Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

    Towards video thinking test: A holistic benchmark for advanced video reasoning and understanding , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

  43. [43]

    Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages =

    Li, Yifan and Du, Yifan and Zhou, Kun and Wang, Jinpeng and Zhao, Xin and Wen, Ji-Rong. Evaluating Object Hallucination in Large Vision-Language Models. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023. doi:10.18653/v1/2023.emnlp-main.20

  44. [44]

    Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR) , month =

    Xing, Long and Huang, Qidong and Dong, Xiaoyi and Lu, Jiajie and Zhang, Pan and Zang, Yuhang and Cao, Yuhang and He, Conghui and Wang, Jiaqi and Wu, Feng and Lin, Dahua , title =. Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR) , month =. 2025 , pages =

  45. [45]

    ArXiv , year=

    Qwen2.5-VL Technical Report , author=. ArXiv , year=

  46. [46]

    Towards Interpreting Visual Information Processing in Vision-Language Models , volume =

    Neo, Clement and Ong, Luke and Torr, Philip and Geva, Mor and Krueger, David and Barez, Fazl , booktitle =. Towards Interpreting Visual Information Processing in Vision-Language Models , volume =

  47. [47]

    The 36th Conference on Neural Information Processing Systems (NeurIPS) , year=

    Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering , author=. The 36th Conference on Neural Information Processing Systems (NeurIPS) , year=

  48. [48]

    arXiv preprint arXiv:2503.23956 , year=

    AirCache: Activating Inter-modal Relevancy KV Cache Compression for Efficient Large Vision-Language Model Inference , author=. arXiv preprint arXiv:2503.23956 , year=

  49. [49]

    Dao, Tri , booktitle=. Flash

  50. [50]

    Video-LLaV A: Learning United Visual Representation by Alignment Before Projection

    Lin, Bin and Ye, Yang and Zhu, Bin and Cui, Jiaxi and Ning, Munan and Jin, Peng and Yuan, Li. Video- LL a VA : Learning United Visual Representation by Alignment Before Projection. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.342