KVCapsule: Efficient Sequential KV Cache Compression for Vision-Language Models with Asymmetric Redundancy

Deming Chen; Steven K. Reinhardt; Tharun Adithya Srikrishnan; Yingbing Huang

arxiv: 2605.16439 · v1 · pith:YZFSFS5Mnew · submitted 2026-05-14 · 💻 cs.CV · cs.AI

KVCapsule: Efficient Sequential KV Cache Compression for Vision-Language Models with Asymmetric Redundancy

Yingbing Huang , Tharun Adithya Srikrishnan , Steven K. Reinhardt , Deming Chen This is my paper

Pith reviewed 2026-05-20 19:50 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords KV cache compressionVision-Language Modelsmultimodal inferencesequential compressionasymmetric redundancyefficient decodingfrozen backbone

0 comments

The pith

KVCapsule compresses the KV cache of vision-language models by 60 percent using lightweight add-on modules while keeping the backbone frozen.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to establish that vision tokens exhibit structured and redundant attention patterns distinct from text tokens, allowing a new compression approach to reduce memory overhead in VLMs. This matters because image inputs create long token sequences that amplify the KV cache bottleneck during autoregressive decoding, and standard LLM compression methods fail to handle the spatial nature of vision data. KVCapsule adds lightweight compression and reconstruction components to existing models without retraining or changes to attention computation. Experiments across multiple VLMs and tasks show up to 2x gains in tokens per second and 2.4x memory reduction at 60 percent compression with negligible quality loss.

Core claim

KVCapsule is a KV cache compression framework for vision tokens that keeps the pretrained VLM backbone frozen, requires no modification to the attention computation modules, and integrates into existing VLMs through lightweight compression and reconstruction components. It exploits the asymmetric redundancy and structured attention patterns of vision tokens to reach a 60 percent compression ratio, yielding up to 2x improvement in tokens per second and 2.4x reduction in KV cache memory while maintaining negligible degradation in accuracy or response quality on benchmark tasks.

What carries the argument

Lightweight sequential compression and reconstruction components that target the structured attention patterns unique to vision tokens without altering the model's attention computation or backbone.

If this is right

VLMs can operate under tighter memory budgets without requiring model retraining or architecture changes.
Inference throughput increases up to twofold through reduced KV cache size during autoregressive generation.
The method applies directly to existing pretrained VLMs via simple integration of the compression modules.
Response quality across diverse multimodal tasks stays close to the uncompressed baseline.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same principle of modality-specific redundancy could guide cache compression in other multimodal systems that mix text with audio or video tokens.
KVCapsule might combine with quantization or speculative decoding to produce additional efficiency gains in constrained environments.
Extending the approach to variable compression ratios per layer could further optimize memory use for different vision tasks.

Load-bearing premise

Vision tokens possess sufficiently structured and redundant attention patterns distinct from text that allow 60 percent compression through lightweight add-on modules without retraining or attention changes while still preserving response quality.

What would settle it

A substantial drop in accuracy or response quality on standard vision-language benchmarks such as VQA or image captioning when KVCapsule is applied at the 60 percent compression ratio would show the central claim does not hold.

Figures

Figures reproduced from arXiv: 2605.16439 by Deming Chen, Steven K. Reinhardt, Tharun Adithya Srikrishnan, Yingbing Huang.

**Figure 1.** Figure 1: Overview of the KVCapsule framework. The gray region denotes the original process of VLMs, while the yellow region shows the process of KVCapsule. consumption to scale with both the visual context length and the decoding length. Efficient KV cache reduction is therefore essential for scalable VLM inference under realistic hardware constraints. To mitigate KV cache growth, prior work has explored compressio… view at source ↗

**Figure 3.** Figure 3: Vision-token attention dynamics. Attention shifts across decoding steps and layers; brighter colors indicate higher attention over vision tokens [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 5.** Figure 5: Efficiency analysis of fused KVCapsule. (a) Fused KVCapsule improves decoding throughput over the full-cache baseline. (b) KVCapsule reduces persistent KV memory after the break-even point at L ≈ 626 tokens. (c) For batch size 1, KVCapsule reduces per-token latency across 9K-15K inputs, with relative reductions reported in the last row. Inference Latency. As shown in Algorithm 1, fused KVCapsule integrates… view at source ↗

**Figure 6.** Figure 6: Hidden-dimension PCA energy for visual KV states. We compute cumulative PCA energy over feature channels for visual keys and values. Values require nearly the full hidden dimension to preserve 95% variance across layers, suggesting that hidden-dimension compression is inefficient for visual values. To examine whether visual KV states can be compressed along the hidden dimension, we compute the cumulative P… view at source ↗

**Figure 7.** Figure 7: Comparison of Key and Value Reconstruction Designs. This figure illustrates the effectiveness of a hybrid MLP–PCA architecture compared to baseline methods across 36 layers of a neural network. Performance is measured by the cosine similarity between the original and reconstructed vectors (higher is better). similarity between the original and reconstructed tensors across different mask ratios, we identifi… view at source ↗

**Figure 8.** Figure 8: KV Cache Reconstruction Accuracy vs. Compression Ratio. The plot illustrates the relationship between compression ratio (mask ratio) and reconstruction cosine similarity across various model layers [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗

read the original abstract

Vision-Language Models (VLMs) have emerged as a critical and fast-growing extension of Large Language Models (LLMs) that enable multimodal reasoning through both text and image inputs. Although VLMs enrich the capabilities of language models, they also inherit and amplify key computational bottlenecks: the memory overhead caused by the large key-value (KV) cache during autoregressive decoding. This challenge is particularly severe in VLMs, where images produce longer token sequences and denser feature representations compared to text. Moreover, the spatial and information-rich nature of vision tokens introduces structured attention patterns that make many LLM-oriented KV cache compression techniques ineffective when applied directly to VLMs. In this work, we conduct a detailed empirical analysis of the behavior of vision tokens, highlighting the critical differences from purely text-based models. Based on these insights, we propose KVCapsule, a novel KV cache compression framework for vision tokens. KVCapsule keeps the pretrained VLM backbone frozen, requires no modification to the attention computation modules, and can be integrated into existing VLMs through lightweight compression and reconstruction components. We evaluate KVCapsule on multiple VLMs and benchmark tasks, demonstrating up to 2x improvement in TPS and 2.4x reduction in KV cache memory at a 60% compression ratio, with negligible degradation in accuracy or response quality. Our findings offer practical pathways to scale VLM inference under constrained memory budgets and inspire further research into structure-aware cache compression for multimodal models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

KVCapsule adds lightweight plug-in compression for vision KV caches in frozen VLMs and shows practical speed and memory gains at 60% ratio, but reconstruction robustness across layers is the main open question.

read the letter

The one thing to know is that KVCapsule introduces simple add-on compression and reconstruction modules that target vision tokens specifically, leaving the VLM backbone and attention code unchanged. It reports up to 2x tokens-per-second gains and 2.4x KV cache memory reduction at 60% compression with little accuracy loss on several models and tasks. That combination of no-retraining integration and concrete numbers is the practical hook.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces KVCapsule, a KV cache compression framework targeted at vision tokens in VLMs. It begins with an empirical analysis of vision-token attention patterns and their differences from text-only models, then proposes lightweight compression and reconstruction modules that leave the pretrained backbone frozen and require no changes to attention code. The method is evaluated on multiple VLMs across standard benchmarks, reporting up to 2× TPS improvement and 2.4× KV-cache memory reduction at a 60 % compression ratio while claiming negligible degradation in accuracy or response quality.

Significance. If the central performance claims hold under closer scrutiny, the work offers a practical, training-free route to easing memory pressure during VLM autoregressive decoding. The plug-in design and preservation of the original attention implementation are engineering strengths that ease adoption. The focus on asymmetric redundancy between vision and text tokens also supplies a useful empirical foundation for future structure-aware compression research.

major comments (3)

[§4] §4 (Evaluation): the reported accuracy and quality metrics at the 60 % compression ratio are presented without error bars, standard deviations, or statistical significance tests across the multiple VLMs and tasks; this omission weakens the claim that degradation is negligible and makes it difficult to judge robustness.
[§3.2] §3.2 (Reconstruction module): no quantitative analysis or bound is supplied on how reconstruction error in compressed vision-token KV entries propagates through unchanged cross-attention layers when those tokens later interact with newly generated text tokens; because the central claim rests on preserved response quality, this missing propagation check is load-bearing.
[§2] §2 (Empirical analysis): while vision-token redundancy is highlighted, the paper does not quantify how the observed attention patterns differ in a way that would guarantee the 60 % compression ratio remains safe once reconstruction is applied; a direct comparison of attention-score cosine similarity before and after compression would strengthen the justification.

minor comments (2)

[Figure 2] Figure 2: the pipeline diagram would benefit from explicit labels indicating which components are frozen versus trainable.
[§3.1] Notation in §3.1: the symbols for the compression and reconstruction functions are introduced without a compact table of definitions, which slightly reduces readability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments on our manuscript. We address each of the major comments in detail below and have prepared revisions to strengthen the paper accordingly.

read point-by-point responses

Referee: [§4] §4 (Evaluation): the reported accuracy and quality metrics at the 60 % compression ratio are presented without error bars, standard deviations, or statistical significance tests across the multiple VLMs and tasks; this omission weakens the claim that degradation is negligible and makes it difficult to judge robustness.

Authors: We agree with the referee that providing error bars, standard deviations, and statistical significance tests would enhance the robustness of our claims regarding negligible degradation. In the revised manuscript, we will include these statistical measures across the multiple VLMs and tasks evaluated, to better demonstrate the consistency and reliability of the results at the 60% compression ratio. revision: yes
Referee: [§3.2] §3.2 (Reconstruction module): no quantitative analysis or bound is supplied on how reconstruction error in compressed vision-token KV entries propagates through unchanged cross-attention layers when those tokens later interact with newly generated text tokens; because the central claim rests on preserved response quality, this missing propagation check is load-bearing.

Authors: We recognize that a quantitative analysis of reconstruction error propagation through the cross-attention layers is important for supporting the claim of preserved response quality. Although our comprehensive end-to-end evaluations on various benchmarks indicate that the overall performance remains largely unaffected, we will add a dedicated analysis in the revised version. This will include measuring the reconstruction error and assessing its propagation effects on attention computations involving newly generated text tokens. revision: yes
Referee: [§2] §2 (Empirical analysis): while vision-token redundancy is highlighted, the paper does not quantify how the observed attention patterns differ in a way that would guarantee the 60 % compression ratio remains safe once reconstruction is applied; a direct comparison of attention-score cosine similarity before and after compression would strengthen the justification.

Authors: We appreciate this suggestion to strengthen the empirical justification. We will incorporate a direct comparison of attention-score cosine similarities before and after compression and reconstruction in the revised empirical analysis section. This will help quantify the differences in attention patterns and support the safety of the 60% compression ratio. revision: yes

Circularity Check

0 steps flagged

No significant circularity: empirical engineering contribution without load-bearing self-referential steps

full rationale

The paper presents KVCapsule as a practical, plug-in KV cache compression method for VLMs. It begins with an empirical analysis of vision-token attention patterns (distinct from text), then introduces lightweight compression and reconstruction modules that leave the pretrained backbone and attention computation unchanged. Performance claims (2x TPS, 2.4x memory reduction at 60% compression) rest on benchmark evaluations across multiple VLMs and tasks rather than any closed-form derivation. No equations, fitted parameters, or predictions reduce to inputs by construction; the central results are externally falsifiable via accuracy and quality metrics on held-out tasks. Any self-citations are incidental and non-load-bearing for the empirical claims, satisfying the criteria for a self-contained engineering contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based solely on the abstract, the approach relies on empirical observations of vision-token attention patterns and the premise that lightweight modules can be inserted without retraining; no explicit free parameters, axioms, or invented entities are described.

pith-pipeline@v0.9.0 · 5814 in / 1139 out tokens · 28721 ms · 2026-05-20T19:50:56.255859+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

KVCapsule keeps the pretrained VLM backbone frozen, requires no modification to the attention computation modules, and can be integrated into existing VLMs through lightweight compression and reconstruction components
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean J_uniquely_calibrated_via_higher_derivative unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

asymmetric compression for keys (selective retention + MLP reconstruction) and values (sequence-level PCA)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

34 extracted references · 34 canonical work pages · 5 internal anchors

[1]

Flamingo: a visual language model for few-shot learning.Advances in neural information processing systems, 35: 23716–23736, 2022

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning.Advances in neural information processing systems, 35: 23716–23736, 2022

work page 2022
[2]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond.arXiv preprint arXiv:2308.12966, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

Arthur Conmy, Augustine Mavor-Parker, Aengus Lynch, Stefan Heimersheim, and Adrià Garriga- Alonso

Chi-Chih Chang, Chien-Yu Lin, Yash Akhauri, Wei-Cheng Lin, Kai-Chiang Wu, Luis Ceze, and Mohamed S Abdelfattah. xkv: Cross-layer svd for kv-cache compression.arXiv preprint arXiv:2503.18893, 2025

work page arXiv 2025
[4]

Palu: Kv- cache compression with low-rank projection

Chi-Chih Chang, Wei-Cheng Lin, Chien-Yu Lin, Chong-Yan Chen, Yu-Fang Hu, Pei-Shuo Wang, Ning-Chi Huang, Luis Ceze, Mohamed S Abdelfattah, and Kai-Chiang Wu. Palu: Kv- cache compression with low-rank projection. InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025
[5]

Mega-bench: Scaling multimodal evaluation to over 500 real-world tasks

Jiacheng Chen, Tianhao Liang, Sherman Siu, Zhengqing Wang, Kai Wang, Yubo Wang, Yuan- sheng Ni, Wang Zhu, Ziyan Jiang, Bohan Lyu, Dongfu Jiang, Xuan He, Yuan Liu, Hexiang Hu, Xiang Yue, and Wenhu Chen. Mega-bench: Scaling multimodal evaluation to over 500 real-world tasks. 2025

work page 2025
[6]

Qaq: Quality adaptive quantization for llm kv cache

Wen Cheng, Shichen Dong, Jiayu Qin, and Wei Wang. Qaq: Quality adaptive quantization for llm kv cache. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 2542–2550, 2025

work page 2025
[7]

Vlmevalkit: An open-source toolkit for evaluating large multi-modality models

Haodong Duan, Junming Yang, Yuxuan Qiao, Xinyu Fang, Lin Chen, Yuan Liu, Xiaoyi Dong, Yuhang Zang, Pan Zhang, Jiaqi Wang, et al. Vlmevalkit: An open-source toolkit for evaluating large multi-modality models. InProceedings of the 32nd ACM International Conference on Multimedia, pages 11198–11201, 2024

work page 2024
[8]

Mmbench-video: A long-form multi-shot benchmark for holistic video understanding

Xinyu Fang, Kangrui Mao, Haodong Duan, Xiangyu Zhao, Yining Li, Dahua Lin, and Kai Chen. Mmbench-video: A long-form multi-shot benchmark for holistic video understanding. Advances in Neural Information Processing Systems, 37:89098–89124, 2024

work page 2024
[9]

Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis

Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. InCVPR, 2025

work page 2025
[10]

Hallusionbench: an advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models

Tianrui Guan, Fuxiao Liu, Xiyang Wu, Ruiqi Xian, Zongxia Li, Xiaoyu Liu, Xijun Wang, Lichang Chen, Furong Huang, Yaser Yacoob, et al. Hallusionbench: an advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pag...

work page 2024
[11]

Calibquant: 1-bit kv cache quantization for multimodal llms.arXiv preprint arXiv:2502.14882, 2025

Insu Han, Zeliang Zhang, Zhiyuan Wang, Yifan Zhu, Susan Liang, Jiani Liu, Haiting Lin, Mingjie Zhao, Chenliang Xu, Kun Wan, et al. Calibquant: 1-bit kv cache quantization for multimodal llms.arXiv preprint arXiv:2502.14882, 2025

work page arXiv 2025
[12]

Kvquant: Towards 10 million context length llm inference with kv cache quantization.Advances in Neural Information Processing Systems, 37:1270–1303, 2024

Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, Michael W Mahoney, Yakun S Shao, Kurt Keutzer, and Amir Gholami. Kvquant: Towards 10 million context length llm inference with kv cache quantization.Advances in Neural Information Processing Systems, 37:1270–1303, 2024

work page 2024
[13]

Aircache: Activating inter-modal relevancy kv cache compression for efficient large vision-language model inference

Kai Huang, Hao Zou, Bochen Wang, Ye Xi, Zhen Xie, and Hao Wang. Aircache: Activating inter-modal relevancy kv cache compression for efficient large vision-language model inference. arXiv preprint arXiv:2503.23956, 2025

work page arXiv 2025
[14]

Purekv: Plug-and-play kv cache optimization with spatial-temporal sparse attention for vision-language large models.arXiv preprint arXiv:2510.25600,

Zhonghua Jiang, Kunxi Li, Yiyun Zhou, Sihao Liu, Zhaode Wang, Shengyu Zhang, et al. Purekv: Plug-and-play kv cache optimization with spatial-temporal sparse attention for vision-language large models.arXiv preprint arXiv:2510.25600, 2025. 10

work page arXiv 2025
[15]

Madakv: Adaptive modality-perception kv cache eviction for efficient multimodal long-context inference

Kunxi Li, Zhonghua Jiang, Zhouzhou Shen, ZhaodeWang ZhaodeWang, Chengfei Lv, Shengyu Zhang, Fan Wu, and Fei Wu. Madakv: Adaptive modality-perception kv cache eviction for efficient multimodal long-context inference. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), pages 13306–13318, 2025

work page 2025
[16]

Snapkv: Llm knows what you are looking for before generation.Advances in Neural Information Processing Systems, 37:22947–22970, 2024

Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Acyr Locatelli, Hanchen Ye, Tianle Cai, Patrick Lewis, and Deming Chen. Snapkv: Llm knows what you are looking for before generation.Advances in Neural Information Processing Systems, 37:22947–22970, 2024

work page 2024
[17]

Microsoft COCO: Common Objects in Context

Tsung-Yi Lin, Michael Maire, Serge Belongie, Lubomir Bourdev, Ross Girshick, James Hays, Pietro Perona, Deva Ramanan, C. Lawrence Zitnick, and Piotr Dollár. Microsoft coco: Common objects in context, 2015. URLhttps://arxiv.org/abs/1405.0312

work page internal anchor Pith review Pith/arXiv arXiv 2015
[18]

Improved baselines with visual instruction tuning, 2023

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning, 2023

work page 2023
[19]

Visual instruction tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. In NeurIPS, 2023

work page 2023
[20]

Llava-next: Improved reasoning, ocr, and world knowledge, January 2024

Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llava-next: Improved reasoning, ocr, and world knowledge, January 2024. URL https: //llava-vl.github.io/blog/2024-01-30-llava-next/

work page 2024
[21]

Mmbench: Is your multi-modal model an all-around player? InEuropean conference on computer vision, pages 216–233

Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? InEuropean conference on computer vision, pages 216–233. Springer, 2024

work page 2024
[22]

Sals: Sparse attention in latent space for kv cache compression.arXiv preprint arXiv:2510.24273, 2025

Junlin Mu, Hantao Huang, Jihang Zhang, Minghui Yu, Tao Wang, and Yidong Li. Sals: Sparse attention in latent space for kv cache compression.arXiv preprint arXiv:2510.24273, 2025

work page arXiv 2025
[23]

Cross-self kv cache pruning for efficient vision- language inference.arXiv preprint arXiv:2412.04652, 2024

Xiaohuan Pei, Tao Huang, and Chang Xu. Cross-self kv cache pruning for efficient vision- language inference.arXiv preprint arXiv:2412.04652, 2024

work page arXiv 2024
[24]

Eigen attention: Attention in low-rank space for kv cache compression.arXiv preprint arXiv:2408.05646, 2024

Utkarsh Saxena, Gobinda Saha, Sakshi Choudhary, and Kaushik Roy. Eigen attention: Attention in low-rank space for kv cache compression.arXiv preprint arXiv:2408.05646, 2024

work page arXiv 2024
[25]

Akvq-vl: Attention-aware kv cache adaptive 2-bit quantization for vision-language models

Zunhai Su, Wang Shen, Linge Li, Zhe Chen, Hanyu Wei, Huangqi Yu, and Kehong Yuan. Akvq-vl: Attention-aware kv cache adaptive 2-bit quantization for vision-language models. arXiv preprint arXiv:2501.15021, 2025

work page arXiv 2025
[26]

A corpus for reasoning about natural language grounded in photographs

Alane Suhr, Stephanie Zhou, Ally Zhang, Iris Zhang, Huajun Bai, and Yoav Artzi. A corpus for reasoning about natural language grounded in photographs. InProceedings of the 57th annual meeting of the association for computational linguistics, pages 6418–6428, 2019

work page 2019
[27]

Vl-cache: Sparsity and modality- aware kv cache compression for vision-language model inference acceleration.arXiv preprint arXiv:2410.23317, 2024

Dezhan Tu, Danylo Vashchilenko, Yuzhe Lu, and Panpan Xu. Vl-cache: Sparsity and modality- aware kv cache compression for vision-language model inference acceleration.arXiv preprint arXiv:2410.23317, 2024

work page arXiv 2024
[28]

Look-m: Look-once optimization in kv cache for efficient multimodal long-context inference.arXiv preprint arXiv:2406.18139, 2024

Zhongwei Wan, Ziang Wu, Che Liu, Jinfa Huang, Zhihong Zhu, Peng Jin, Longyue Wang, and Li Yuan. Look-m: Look-once optimization in kv cache for efficient multimodal long-context inference.arXiv preprint arXiv:2406.18139, 2024

work page arXiv 2024
[29]

Topv: Compatible token pruning with inference time optimization for fast and low-memory multimodal vision language model

Cheng Yang, Yang Sui, Jinqi Xiao, Lingyi Huang, Yu Gong, Chendi Li, Jinghua Yan, Yu Bai, Ponnuswamy Sadayappan, Xia Hu, et al. Topv: Compatible token pruning with inference time optimization for fast and low-memory multimodal vision language model. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 19803–19813, 2025

work page 2025
[30]

Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi

Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, Cong Wei, Botao Yu, Ruibin Yuan, Renliang Sun, Ming Yin, Boyuan Zheng, Zhenzhu Yang, Yibo Liu, Wenhao Huang, Huan Sun, Yu Su, and Wenhu Chen. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for exp...

work page 2024
[31]

Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding

Hang Zhang, Xin Li, and Lidong Bing. Video-llama: An instruction-tuned audio-visual language model for video understanding.arXiv preprint arXiv:2306.02858, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[32]

Lorc: Low-rank compression for llms kv cache with a progressive compression strategy.arXiv preprint arXiv:2410.03111, 2024

Rongzhi Zhang, Kuang Wang, Liyuan Liu, Shuohang Wang, Hao Cheng, Chao Zhang, and Yelong Shen. Lorc: Low-rank compression for llms kv cache with a progressive compression strategy.arXiv preprint arXiv:2410.03111, 2024

work page arXiv 2024
[33]

MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?

Yi-Fan Zhang, Huanyu Zhang, Haochen Tian, Chaoyou Fu, Shuangqing Zhang, Junfei Wu, Feng Li, Kun Wang, Qingsong Wen, Zhang Zhang, et al. Mme-realworld: Could your multimodal llm challenge high-resolution real-world scenarios that are difficult for humans?arXiv preprint arXiv:2408.13257, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[34]

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479, 2025. 12 A Quantitative Analysis of Key-Value Asymmetry To strengthen our observation, we conduct a quant...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[1] [1]

Flamingo: a visual language model for few-shot learning.Advances in neural information processing systems, 35: 23716–23736, 2022

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning.Advances in neural information processing systems, 35: 23716–23736, 2022

work page 2022

[2] [2]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond.arXiv preprint arXiv:2308.12966, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[3] [3]

Arthur Conmy, Augustine Mavor-Parker, Aengus Lynch, Stefan Heimersheim, and Adrià Garriga- Alonso

Chi-Chih Chang, Chien-Yu Lin, Yash Akhauri, Wei-Cheng Lin, Kai-Chiang Wu, Luis Ceze, and Mohamed S Abdelfattah. xkv: Cross-layer svd for kv-cache compression.arXiv preprint arXiv:2503.18893, 2025

work page arXiv 2025

[4] [4]

Palu: Kv- cache compression with low-rank projection

Chi-Chih Chang, Wei-Cheng Lin, Chien-Yu Lin, Chong-Yan Chen, Yu-Fang Hu, Pei-Shuo Wang, Ning-Chi Huang, Luis Ceze, Mohamed S Abdelfattah, and Kai-Chiang Wu. Palu: Kv- cache compression with low-rank projection. InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025

[5] [5]

Mega-bench: Scaling multimodal evaluation to over 500 real-world tasks

Jiacheng Chen, Tianhao Liang, Sherman Siu, Zhengqing Wang, Kai Wang, Yubo Wang, Yuan- sheng Ni, Wang Zhu, Ziyan Jiang, Bohan Lyu, Dongfu Jiang, Xuan He, Yuan Liu, Hexiang Hu, Xiang Yue, and Wenhu Chen. Mega-bench: Scaling multimodal evaluation to over 500 real-world tasks. 2025

work page 2025

[6] [6]

Qaq: Quality adaptive quantization for llm kv cache

Wen Cheng, Shichen Dong, Jiayu Qin, and Wei Wang. Qaq: Quality adaptive quantization for llm kv cache. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 2542–2550, 2025

work page 2025

[7] [7]

Vlmevalkit: An open-source toolkit for evaluating large multi-modality models

Haodong Duan, Junming Yang, Yuxuan Qiao, Xinyu Fang, Lin Chen, Yuan Liu, Xiaoyi Dong, Yuhang Zang, Pan Zhang, Jiaqi Wang, et al. Vlmevalkit: An open-source toolkit for evaluating large multi-modality models. InProceedings of the 32nd ACM International Conference on Multimedia, pages 11198–11201, 2024

work page 2024

[8] [8]

Mmbench-video: A long-form multi-shot benchmark for holistic video understanding

Xinyu Fang, Kangrui Mao, Haodong Duan, Xiangyu Zhao, Yining Li, Dahua Lin, and Kai Chen. Mmbench-video: A long-form multi-shot benchmark for holistic video understanding. Advances in Neural Information Processing Systems, 37:89098–89124, 2024

work page 2024

[9] [9]

Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis

Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. InCVPR, 2025

work page 2025

[10] [10]

Hallusionbench: an advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models

Tianrui Guan, Fuxiao Liu, Xiyang Wu, Ruiqi Xian, Zongxia Li, Xiaoyu Liu, Xijun Wang, Lichang Chen, Furong Huang, Yaser Yacoob, et al. Hallusionbench: an advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pag...

work page 2024

[11] [11]

Calibquant: 1-bit kv cache quantization for multimodal llms.arXiv preprint arXiv:2502.14882, 2025

Insu Han, Zeliang Zhang, Zhiyuan Wang, Yifan Zhu, Susan Liang, Jiani Liu, Haiting Lin, Mingjie Zhao, Chenliang Xu, Kun Wan, et al. Calibquant: 1-bit kv cache quantization for multimodal llms.arXiv preprint arXiv:2502.14882, 2025

work page arXiv 2025

[12] [12]

Kvquant: Towards 10 million context length llm inference with kv cache quantization.Advances in Neural Information Processing Systems, 37:1270–1303, 2024

Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, Michael W Mahoney, Yakun S Shao, Kurt Keutzer, and Amir Gholami. Kvquant: Towards 10 million context length llm inference with kv cache quantization.Advances in Neural Information Processing Systems, 37:1270–1303, 2024

work page 2024

[13] [13]

Aircache: Activating inter-modal relevancy kv cache compression for efficient large vision-language model inference

Kai Huang, Hao Zou, Bochen Wang, Ye Xi, Zhen Xie, and Hao Wang. Aircache: Activating inter-modal relevancy kv cache compression for efficient large vision-language model inference. arXiv preprint arXiv:2503.23956, 2025

work page arXiv 2025

[14] [14]

Purekv: Plug-and-play kv cache optimization with spatial-temporal sparse attention for vision-language large models.arXiv preprint arXiv:2510.25600,

Zhonghua Jiang, Kunxi Li, Yiyun Zhou, Sihao Liu, Zhaode Wang, Shengyu Zhang, et al. Purekv: Plug-and-play kv cache optimization with spatial-temporal sparse attention for vision-language large models.arXiv preprint arXiv:2510.25600, 2025. 10

work page arXiv 2025

[15] [15]

Madakv: Adaptive modality-perception kv cache eviction for efficient multimodal long-context inference

Kunxi Li, Zhonghua Jiang, Zhouzhou Shen, ZhaodeWang ZhaodeWang, Chengfei Lv, Shengyu Zhang, Fan Wu, and Fei Wu. Madakv: Adaptive modality-perception kv cache eviction for efficient multimodal long-context inference. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), pages 13306–13318, 2025

work page 2025

[16] [16]

Snapkv: Llm knows what you are looking for before generation.Advances in Neural Information Processing Systems, 37:22947–22970, 2024

Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Acyr Locatelli, Hanchen Ye, Tianle Cai, Patrick Lewis, and Deming Chen. Snapkv: Llm knows what you are looking for before generation.Advances in Neural Information Processing Systems, 37:22947–22970, 2024

work page 2024

[17] [17]

Microsoft COCO: Common Objects in Context

Tsung-Yi Lin, Michael Maire, Serge Belongie, Lubomir Bourdev, Ross Girshick, James Hays, Pietro Perona, Deva Ramanan, C. Lawrence Zitnick, and Piotr Dollár. Microsoft coco: Common objects in context, 2015. URLhttps://arxiv.org/abs/1405.0312

work page internal anchor Pith review Pith/arXiv arXiv 2015

[18] [18]

Improved baselines with visual instruction tuning, 2023

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning, 2023

work page 2023

[19] [19]

Visual instruction tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. In NeurIPS, 2023

work page 2023

[20] [20]

Llava-next: Improved reasoning, ocr, and world knowledge, January 2024

Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llava-next: Improved reasoning, ocr, and world knowledge, January 2024. URL https: //llava-vl.github.io/blog/2024-01-30-llava-next/

work page 2024

[21] [21]

Mmbench: Is your multi-modal model an all-around player? InEuropean conference on computer vision, pages 216–233

Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? InEuropean conference on computer vision, pages 216–233. Springer, 2024

work page 2024

[22] [22]

Sals: Sparse attention in latent space for kv cache compression.arXiv preprint arXiv:2510.24273, 2025

Junlin Mu, Hantao Huang, Jihang Zhang, Minghui Yu, Tao Wang, and Yidong Li. Sals: Sparse attention in latent space for kv cache compression.arXiv preprint arXiv:2510.24273, 2025

work page arXiv 2025

[23] [23]

Cross-self kv cache pruning for efficient vision- language inference.arXiv preprint arXiv:2412.04652, 2024

Xiaohuan Pei, Tao Huang, and Chang Xu. Cross-self kv cache pruning for efficient vision- language inference.arXiv preprint arXiv:2412.04652, 2024

work page arXiv 2024

[24] [24]

Eigen attention: Attention in low-rank space for kv cache compression.arXiv preprint arXiv:2408.05646, 2024

Utkarsh Saxena, Gobinda Saha, Sakshi Choudhary, and Kaushik Roy. Eigen attention: Attention in low-rank space for kv cache compression.arXiv preprint arXiv:2408.05646, 2024

work page arXiv 2024

[25] [25]

Akvq-vl: Attention-aware kv cache adaptive 2-bit quantization for vision-language models

Zunhai Su, Wang Shen, Linge Li, Zhe Chen, Hanyu Wei, Huangqi Yu, and Kehong Yuan. Akvq-vl: Attention-aware kv cache adaptive 2-bit quantization for vision-language models. arXiv preprint arXiv:2501.15021, 2025

work page arXiv 2025

[26] [26]

A corpus for reasoning about natural language grounded in photographs

Alane Suhr, Stephanie Zhou, Ally Zhang, Iris Zhang, Huajun Bai, and Yoav Artzi. A corpus for reasoning about natural language grounded in photographs. InProceedings of the 57th annual meeting of the association for computational linguistics, pages 6418–6428, 2019

work page 2019

[27] [27]

Vl-cache: Sparsity and modality- aware kv cache compression for vision-language model inference acceleration.arXiv preprint arXiv:2410.23317, 2024

Dezhan Tu, Danylo Vashchilenko, Yuzhe Lu, and Panpan Xu. Vl-cache: Sparsity and modality- aware kv cache compression for vision-language model inference acceleration.arXiv preprint arXiv:2410.23317, 2024

work page arXiv 2024

[28] [28]

Look-m: Look-once optimization in kv cache for efficient multimodal long-context inference.arXiv preprint arXiv:2406.18139, 2024

Zhongwei Wan, Ziang Wu, Che Liu, Jinfa Huang, Zhihong Zhu, Peng Jin, Longyue Wang, and Li Yuan. Look-m: Look-once optimization in kv cache for efficient multimodal long-context inference.arXiv preprint arXiv:2406.18139, 2024

work page arXiv 2024

[29] [29]

Topv: Compatible token pruning with inference time optimization for fast and low-memory multimodal vision language model

Cheng Yang, Yang Sui, Jinqi Xiao, Lingyi Huang, Yu Gong, Chendi Li, Jinghua Yan, Yu Bai, Ponnuswamy Sadayappan, Xia Hu, et al. Topv: Compatible token pruning with inference time optimization for fast and low-memory multimodal vision language model. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 19803–19813, 2025

work page 2025

[30] [30]

Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi

Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, Cong Wei, Botao Yu, Ruibin Yuan, Renliang Sun, Ming Yin, Boyuan Zheng, Zhenzhu Yang, Yibo Liu, Wenhao Huang, Huan Sun, Yu Su, and Wenhu Chen. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for exp...

work page 2024

[31] [31]

Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding

Hang Zhang, Xin Li, and Lidong Bing. Video-llama: An instruction-tuned audio-visual language model for video understanding.arXiv preprint arXiv:2306.02858, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[32] [32]

Lorc: Low-rank compression for llms kv cache with a progressive compression strategy.arXiv preprint arXiv:2410.03111, 2024

Rongzhi Zhang, Kuang Wang, Liyuan Liu, Shuohang Wang, Hao Cheng, Chao Zhang, and Yelong Shen. Lorc: Low-rank compression for llms kv cache with a progressive compression strategy.arXiv preprint arXiv:2410.03111, 2024

work page arXiv 2024

[33] [33]

MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?

Yi-Fan Zhang, Huanyu Zhang, Haochen Tian, Chaoyou Fu, Shuangqing Zhang, Junfei Wu, Feng Li, Kun Wang, Qingsong Wen, Zhang Zhang, et al. Mme-realworld: Could your multimodal llm challenge high-resolution real-world scenarios that are difficult for humans?arXiv preprint arXiv:2408.13257, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[34] [34]

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479, 2025. 12 A Quantitative Analysis of Key-Value Asymmetry To strengthen our observation, we conduct a quant...

work page internal anchor Pith review Pith/arXiv arXiv 2025