pith. sign in

arxiv: 2605.16439 · v1 · pith:YZFSFS5Mnew · submitted 2026-05-14 · 💻 cs.CV · cs.AI

KVCapsule: Efficient Sequential KV Cache Compression for Vision-Language Models with Asymmetric Redundancy

Pith reviewed 2026-05-20 19:50 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords KV cache compressionVision-Language Modelsmultimodal inferencesequential compressionasymmetric redundancyefficient decodingfrozen backbone
0
0 comments X

The pith

KVCapsule compresses the KV cache of vision-language models by 60 percent using lightweight add-on modules while keeping the backbone frozen.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to establish that vision tokens exhibit structured and redundant attention patterns distinct from text tokens, allowing a new compression approach to reduce memory overhead in VLMs. This matters because image inputs create long token sequences that amplify the KV cache bottleneck during autoregressive decoding, and standard LLM compression methods fail to handle the spatial nature of vision data. KVCapsule adds lightweight compression and reconstruction components to existing models without retraining or changes to attention computation. Experiments across multiple VLMs and tasks show up to 2x gains in tokens per second and 2.4x memory reduction at 60 percent compression with negligible quality loss.

Core claim

KVCapsule is a KV cache compression framework for vision tokens that keeps the pretrained VLM backbone frozen, requires no modification to the attention computation modules, and integrates into existing VLMs through lightweight compression and reconstruction components. It exploits the asymmetric redundancy and structured attention patterns of vision tokens to reach a 60 percent compression ratio, yielding up to 2x improvement in tokens per second and 2.4x reduction in KV cache memory while maintaining negligible degradation in accuracy or response quality on benchmark tasks.

What carries the argument

Lightweight sequential compression and reconstruction components that target the structured attention patterns unique to vision tokens without altering the model's attention computation or backbone.

If this is right

  • VLMs can operate under tighter memory budgets without requiring model retraining or architecture changes.
  • Inference throughput increases up to twofold through reduced KV cache size during autoregressive generation.
  • The method applies directly to existing pretrained VLMs via simple integration of the compression modules.
  • Response quality across diverse multimodal tasks stays close to the uncompressed baseline.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same principle of modality-specific redundancy could guide cache compression in other multimodal systems that mix text with audio or video tokens.
  • KVCapsule might combine with quantization or speculative decoding to produce additional efficiency gains in constrained environments.
  • Extending the approach to variable compression ratios per layer could further optimize memory use for different vision tasks.

Load-bearing premise

Vision tokens possess sufficiently structured and redundant attention patterns distinct from text that allow 60 percent compression through lightweight add-on modules without retraining or attention changes while still preserving response quality.

What would settle it

A substantial drop in accuracy or response quality on standard vision-language benchmarks such as VQA or image captioning when KVCapsule is applied at the 60 percent compression ratio would show the central claim does not hold.

Figures

Figures reproduced from arXiv: 2605.16439 by Deming Chen, Steven K. Reinhardt, Tharun Adithya Srikrishnan, Yingbing Huang.

Figure 1
Figure 1. Figure 1: Overview of the KVCapsule framework. The gray region denotes the original process of VLMs, while the yellow region shows the process of KVCapsule. consumption to scale with both the visual context length and the decoding length. Efficient KV cache reduction is therefore essential for scalable VLM inference under realistic hardware constraints. To mitigate KV cache growth, prior work has explored compressio… view at source ↗
Figure 3
Figure 3. Figure 3: Vision-token attention dynamics. At￾tention shifts across decoding steps and layers; brighter colors indicate higher attention over vi￾sion tokens [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: Efficiency analysis of fused KVCapsule. (a) Fused KVCapsule improves decoding throughput over the full-cache baseline. (b) KVCapsule reduces persistent KV memory after the break-even point at L ≈ 626 tokens. (c) For batch size 1, KVCapsule reduces per-token latency across 9K-15K inputs, with relative reductions reported in the last row. Inference Latency. As shown in Algorithm 1, fused KVCapsule integrates… view at source ↗
Figure 6
Figure 6. Figure 6: Hidden-dimension PCA energy for visual KV states. We compute cumulative PCA energy over feature channels for visual keys and values. Values require nearly the full hidden dimension to preserve 95% variance across layers, suggesting that hidden-dimension compression is inefficient for visual values. To examine whether visual KV states can be compressed along the hidden dimension, we compute the cumulative P… view at source ↗
Figure 7
Figure 7. Figure 7: Comparison of Key and Value Reconstruction Designs. This figure illustrates the effectiveness of a hybrid MLP–PCA architecture compared to baseline methods across 36 layers of a neural network. Performance is measured by the cosine similarity between the original and reconstructed vectors (higher is better). similarity between the original and reconstructed tensors across different mask ratios, we identifi… view at source ↗
Figure 8
Figure 8. Figure 8: KV Cache Reconstruction Accuracy vs. Compression Ratio. The plot illustrates the relationship between compression ratio (mask ratio) and reconstruction cosine similarity across various model layers [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗
read the original abstract

Vision-Language Models (VLMs) have emerged as a critical and fast-growing extension of Large Language Models (LLMs) that enable multimodal reasoning through both text and image inputs. Although VLMs enrich the capabilities of language models, they also inherit and amplify key computational bottlenecks: the memory overhead caused by the large key-value (KV) cache during autoregressive decoding. This challenge is particularly severe in VLMs, where images produce longer token sequences and denser feature representations compared to text. Moreover, the spatial and information-rich nature of vision tokens introduces structured attention patterns that make many LLM-oriented KV cache compression techniques ineffective when applied directly to VLMs. In this work, we conduct a detailed empirical analysis of the behavior of vision tokens, highlighting the critical differences from purely text-based models. Based on these insights, we propose KVCapsule, a novel KV cache compression framework for vision tokens. KVCapsule keeps the pretrained VLM backbone frozen, requires no modification to the attention computation modules, and can be integrated into existing VLMs through lightweight compression and reconstruction components. We evaluate KVCapsule on multiple VLMs and benchmark tasks, demonstrating up to 2x improvement in TPS and 2.4x reduction in KV cache memory at a 60% compression ratio, with negligible degradation in accuracy or response quality. Our findings offer practical pathways to scale VLM inference under constrained memory budgets and inspire further research into structure-aware cache compression for multimodal models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces KVCapsule, a KV cache compression framework targeted at vision tokens in VLMs. It begins with an empirical analysis of vision-token attention patterns and their differences from text-only models, then proposes lightweight compression and reconstruction modules that leave the pretrained backbone frozen and require no changes to attention code. The method is evaluated on multiple VLMs across standard benchmarks, reporting up to 2× TPS improvement and 2.4× KV-cache memory reduction at a 60 % compression ratio while claiming negligible degradation in accuracy or response quality.

Significance. If the central performance claims hold under closer scrutiny, the work offers a practical, training-free route to easing memory pressure during VLM autoregressive decoding. The plug-in design and preservation of the original attention implementation are engineering strengths that ease adoption. The focus on asymmetric redundancy between vision and text tokens also supplies a useful empirical foundation for future structure-aware compression research.

major comments (3)
  1. [§4] §4 (Evaluation): the reported accuracy and quality metrics at the 60 % compression ratio are presented without error bars, standard deviations, or statistical significance tests across the multiple VLMs and tasks; this omission weakens the claim that degradation is negligible and makes it difficult to judge robustness.
  2. [§3.2] §3.2 (Reconstruction module): no quantitative analysis or bound is supplied on how reconstruction error in compressed vision-token KV entries propagates through unchanged cross-attention layers when those tokens later interact with newly generated text tokens; because the central claim rests on preserved response quality, this missing propagation check is load-bearing.
  3. [§2] §2 (Empirical analysis): while vision-token redundancy is highlighted, the paper does not quantify how the observed attention patterns differ in a way that would guarantee the 60 % compression ratio remains safe once reconstruction is applied; a direct comparison of attention-score cosine similarity before and after compression would strengthen the justification.
minor comments (2)
  1. [Figure 2] Figure 2: the pipeline diagram would benefit from explicit labels indicating which components are frozen versus trainable.
  2. [§3.1] Notation in §3.1: the symbols for the compression and reconstruction functions are introduced without a compact table of definitions, which slightly reduces readability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments on our manuscript. We address each of the major comments in detail below and have prepared revisions to strengthen the paper accordingly.

read point-by-point responses
  1. Referee: [§4] §4 (Evaluation): the reported accuracy and quality metrics at the 60 % compression ratio are presented without error bars, standard deviations, or statistical significance tests across the multiple VLMs and tasks; this omission weakens the claim that degradation is negligible and makes it difficult to judge robustness.

    Authors: We agree with the referee that providing error bars, standard deviations, and statistical significance tests would enhance the robustness of our claims regarding negligible degradation. In the revised manuscript, we will include these statistical measures across the multiple VLMs and tasks evaluated, to better demonstrate the consistency and reliability of the results at the 60% compression ratio. revision: yes

  2. Referee: [§3.2] §3.2 (Reconstruction module): no quantitative analysis or bound is supplied on how reconstruction error in compressed vision-token KV entries propagates through unchanged cross-attention layers when those tokens later interact with newly generated text tokens; because the central claim rests on preserved response quality, this missing propagation check is load-bearing.

    Authors: We recognize that a quantitative analysis of reconstruction error propagation through the cross-attention layers is important for supporting the claim of preserved response quality. Although our comprehensive end-to-end evaluations on various benchmarks indicate that the overall performance remains largely unaffected, we will add a dedicated analysis in the revised version. This will include measuring the reconstruction error and assessing its propagation effects on attention computations involving newly generated text tokens. revision: yes

  3. Referee: [§2] §2 (Empirical analysis): while vision-token redundancy is highlighted, the paper does not quantify how the observed attention patterns differ in a way that would guarantee the 60 % compression ratio remains safe once reconstruction is applied; a direct comparison of attention-score cosine similarity before and after compression would strengthen the justification.

    Authors: We appreciate this suggestion to strengthen the empirical justification. We will incorporate a direct comparison of attention-score cosine similarities before and after compression and reconstruction in the revised empirical analysis section. This will help quantify the differences in attention patterns and support the safety of the 60% compression ratio. revision: yes

Circularity Check

0 steps flagged

No significant circularity: empirical engineering contribution without load-bearing self-referential steps

full rationale

The paper presents KVCapsule as a practical, plug-in KV cache compression method for VLMs. It begins with an empirical analysis of vision-token attention patterns (distinct from text), then introduces lightweight compression and reconstruction modules that leave the pretrained backbone and attention computation unchanged. Performance claims (2x TPS, 2.4x memory reduction at 60% compression) rest on benchmark evaluations across multiple VLMs and tasks rather than any closed-form derivation. No equations, fitted parameters, or predictions reduce to inputs by construction; the central results are externally falsifiable via accuracy and quality metrics on held-out tasks. Any self-citations are incidental and non-load-bearing for the empirical claims, satisfying the criteria for a self-contained engineering contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based solely on the abstract, the approach relies on empirical observations of vision-token attention patterns and the premise that lightweight modules can be inserted without retraining; no explicit free parameters, axioms, or invented entities are described.

pith-pipeline@v0.9.0 · 5814 in / 1139 out tokens · 28721 ms · 2026-05-20T19:50:56.255859+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

34 extracted references · 34 canonical work pages · 5 internal anchors

  1. [1]

    Flamingo: a visual language model for few-shot learning.Advances in neural information processing systems, 35: 23716–23736, 2022

    Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning.Advances in neural information processing systems, 35: 23716–23736, 2022

  2. [2]

    Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

    Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond.arXiv preprint arXiv:2308.12966, 2023

  3. [3]

    Arthur Conmy, Augustine Mavor-Parker, Aengus Lynch, Stefan Heimersheim, and Adrià Garriga- Alonso

    Chi-Chih Chang, Chien-Yu Lin, Yash Akhauri, Wei-Cheng Lin, Kai-Chiang Wu, Luis Ceze, and Mohamed S Abdelfattah. xkv: Cross-layer svd for kv-cache compression.arXiv preprint arXiv:2503.18893, 2025

  4. [4]

    Palu: Kv- cache compression with low-rank projection

    Chi-Chih Chang, Wei-Cheng Lin, Chien-Yu Lin, Chong-Yan Chen, Yu-Fang Hu, Pei-Shuo Wang, Ning-Chi Huang, Luis Ceze, Mohamed S Abdelfattah, and Kai-Chiang Wu. Palu: Kv- cache compression with low-rank projection. InThe Thirteenth International Conference on Learning Representations, 2025

  5. [5]

    Mega-bench: Scaling multimodal evaluation to over 500 real-world tasks

    Jiacheng Chen, Tianhao Liang, Sherman Siu, Zhengqing Wang, Kai Wang, Yubo Wang, Yuan- sheng Ni, Wang Zhu, Ziyan Jiang, Bohan Lyu, Dongfu Jiang, Xuan He, Yuan Liu, Hexiang Hu, Xiang Yue, and Wenhu Chen. Mega-bench: Scaling multimodal evaluation to over 500 real-world tasks. 2025

  6. [6]

    Qaq: Quality adaptive quantization for llm kv cache

    Wen Cheng, Shichen Dong, Jiayu Qin, and Wei Wang. Qaq: Quality adaptive quantization for llm kv cache. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 2542–2550, 2025

  7. [7]

    Vlmevalkit: An open-source toolkit for evaluating large multi-modality models

    Haodong Duan, Junming Yang, Yuxuan Qiao, Xinyu Fang, Lin Chen, Yuan Liu, Xiaoyi Dong, Yuhang Zang, Pan Zhang, Jiaqi Wang, et al. Vlmevalkit: An open-source toolkit for evaluating large multi-modality models. InProceedings of the 32nd ACM International Conference on Multimedia, pages 11198–11201, 2024

  8. [8]

    Mmbench-video: A long-form multi-shot benchmark for holistic video understanding

    Xinyu Fang, Kangrui Mao, Haodong Duan, Xiangyu Zhao, Yining Li, Dahua Lin, and Kai Chen. Mmbench-video: A long-form multi-shot benchmark for holistic video understanding. Advances in Neural Information Processing Systems, 37:89098–89124, 2024

  9. [9]

    Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis

    Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. InCVPR, 2025

  10. [10]

    Hallusionbench: an advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models

    Tianrui Guan, Fuxiao Liu, Xiyang Wu, Ruiqi Xian, Zongxia Li, Xiaoyu Liu, Xijun Wang, Lichang Chen, Furong Huang, Yaser Yacoob, et al. Hallusionbench: an advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pag...

  11. [11]

    Calibquant: 1-bit kv cache quantization for multimodal llms.arXiv preprint arXiv:2502.14882, 2025

    Insu Han, Zeliang Zhang, Zhiyuan Wang, Yifan Zhu, Susan Liang, Jiani Liu, Haiting Lin, Mingjie Zhao, Chenliang Xu, Kun Wan, et al. Calibquant: 1-bit kv cache quantization for multimodal llms.arXiv preprint arXiv:2502.14882, 2025

  12. [12]

    Kvquant: Towards 10 million context length llm inference with kv cache quantization.Advances in Neural Information Processing Systems, 37:1270–1303, 2024

    Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, Michael W Mahoney, Yakun S Shao, Kurt Keutzer, and Amir Gholami. Kvquant: Towards 10 million context length llm inference with kv cache quantization.Advances in Neural Information Processing Systems, 37:1270–1303, 2024

  13. [13]

    Aircache: Activating inter-modal relevancy kv cache compression for efficient large vision-language model inference

    Kai Huang, Hao Zou, Bochen Wang, Ye Xi, Zhen Xie, and Hao Wang. Aircache: Activating inter-modal relevancy kv cache compression for efficient large vision-language model inference. arXiv preprint arXiv:2503.23956, 2025

  14. [14]

    Purekv: Plug-and-play kv cache optimization with spatial-temporal sparse attention for vision-language large models.arXiv preprint arXiv:2510.25600,

    Zhonghua Jiang, Kunxi Li, Yiyun Zhou, Sihao Liu, Zhaode Wang, Shengyu Zhang, et al. Purekv: Plug-and-play kv cache optimization with spatial-temporal sparse attention for vision-language large models.arXiv preprint arXiv:2510.25600, 2025. 10

  15. [15]

    Madakv: Adaptive modality-perception kv cache eviction for efficient multimodal long-context inference

    Kunxi Li, Zhonghua Jiang, Zhouzhou Shen, ZhaodeWang ZhaodeWang, Chengfei Lv, Shengyu Zhang, Fan Wu, and Fei Wu. Madakv: Adaptive modality-perception kv cache eviction for efficient multimodal long-context inference. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), pages 13306–13318, 2025

  16. [16]

    Snapkv: Llm knows what you are looking for before generation.Advances in Neural Information Processing Systems, 37:22947–22970, 2024

    Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Acyr Locatelli, Hanchen Ye, Tianle Cai, Patrick Lewis, and Deming Chen. Snapkv: Llm knows what you are looking for before generation.Advances in Neural Information Processing Systems, 37:22947–22970, 2024

  17. [17]

    Microsoft COCO: Common Objects in Context

    Tsung-Yi Lin, Michael Maire, Serge Belongie, Lubomir Bourdev, Ross Girshick, James Hays, Pietro Perona, Deva Ramanan, C. Lawrence Zitnick, and Piotr Dollár. Microsoft coco: Common objects in context, 2015. URLhttps://arxiv.org/abs/1405.0312

  18. [18]

    Improved baselines with visual instruction tuning, 2023

    Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning, 2023

  19. [19]

    Visual instruction tuning

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. In NeurIPS, 2023

  20. [20]

    Llava-next: Improved reasoning, ocr, and world knowledge, January 2024

    Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llava-next: Improved reasoning, ocr, and world knowledge, January 2024. URL https: //llava-vl.github.io/blog/2024-01-30-llava-next/

  21. [21]

    Mmbench: Is your multi-modal model an all-around player? InEuropean conference on computer vision, pages 216–233

    Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? InEuropean conference on computer vision, pages 216–233. Springer, 2024

  22. [22]

    Sals: Sparse attention in latent space for kv cache compression.arXiv preprint arXiv:2510.24273, 2025

    Junlin Mu, Hantao Huang, Jihang Zhang, Minghui Yu, Tao Wang, and Yidong Li. Sals: Sparse attention in latent space for kv cache compression.arXiv preprint arXiv:2510.24273, 2025

  23. [23]

    Cross-self kv cache pruning for efficient vision- language inference.arXiv preprint arXiv:2412.04652, 2024

    Xiaohuan Pei, Tao Huang, and Chang Xu. Cross-self kv cache pruning for efficient vision- language inference.arXiv preprint arXiv:2412.04652, 2024

  24. [24]

    Eigen attention: Attention in low-rank space for kv cache compression.arXiv preprint arXiv:2408.05646, 2024

    Utkarsh Saxena, Gobinda Saha, Sakshi Choudhary, and Kaushik Roy. Eigen attention: Attention in low-rank space for kv cache compression.arXiv preprint arXiv:2408.05646, 2024

  25. [25]

    Akvq-vl: Attention-aware kv cache adaptive 2-bit quantization for vision-language models

    Zunhai Su, Wang Shen, Linge Li, Zhe Chen, Hanyu Wei, Huangqi Yu, and Kehong Yuan. Akvq-vl: Attention-aware kv cache adaptive 2-bit quantization for vision-language models. arXiv preprint arXiv:2501.15021, 2025

  26. [26]

    A corpus for reasoning about natural language grounded in photographs

    Alane Suhr, Stephanie Zhou, Ally Zhang, Iris Zhang, Huajun Bai, and Yoav Artzi. A corpus for reasoning about natural language grounded in photographs. InProceedings of the 57th annual meeting of the association for computational linguistics, pages 6418–6428, 2019

  27. [27]

    Vl-cache: Sparsity and modality- aware kv cache compression for vision-language model inference acceleration.arXiv preprint arXiv:2410.23317, 2024

    Dezhan Tu, Danylo Vashchilenko, Yuzhe Lu, and Panpan Xu. Vl-cache: Sparsity and modality- aware kv cache compression for vision-language model inference acceleration.arXiv preprint arXiv:2410.23317, 2024

  28. [28]

    Look-m: Look-once optimization in kv cache for efficient multimodal long-context inference.arXiv preprint arXiv:2406.18139, 2024

    Zhongwei Wan, Ziang Wu, Che Liu, Jinfa Huang, Zhihong Zhu, Peng Jin, Longyue Wang, and Li Yuan. Look-m: Look-once optimization in kv cache for efficient multimodal long-context inference.arXiv preprint arXiv:2406.18139, 2024

  29. [29]

    Topv: Compatible token pruning with inference time optimization for fast and low-memory multimodal vision language model

    Cheng Yang, Yang Sui, Jinqi Xiao, Lingyi Huang, Yu Gong, Chendi Li, Jinghua Yan, Yu Bai, Ponnuswamy Sadayappan, Xia Hu, et al. Topv: Compatible token pruning with inference time optimization for fast and low-memory multimodal vision language model. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 19803–19813, 2025

  30. [30]

    Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi

    Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, Cong Wei, Botao Yu, Ruibin Yuan, Renliang Sun, Ming Yin, Boyuan Zheng, Zhenzhu Yang, Yibo Liu, Wenhao Huang, Huan Sun, Yu Su, and Wenhu Chen. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for exp...

  31. [31]

    Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding

    Hang Zhang, Xin Li, and Lidong Bing. Video-llama: An instruction-tuned audio-visual language model for video understanding.arXiv preprint arXiv:2306.02858, 2023

  32. [32]

    Lorc: Low-rank compression for llms kv cache with a progressive compression strategy.arXiv preprint arXiv:2410.03111, 2024

    Rongzhi Zhang, Kuang Wang, Liyuan Liu, Shuohang Wang, Hao Cheng, Chao Zhang, and Yelong Shen. Lorc: Low-rank compression for llms kv cache with a progressive compression strategy.arXiv preprint arXiv:2410.03111, 2024

  33. [33]

    MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?

    Yi-Fan Zhang, Huanyu Zhang, Haochen Tian, Chaoyou Fu, Shuangqing Zhang, Junfei Wu, Feng Li, Kun Wang, Qingsong Wen, Zhang Zhang, et al. Mme-realworld: Could your multimodal llm challenge high-resolution real-world scenarios that are difficult for humans?arXiv preprint arXiv:2408.13257, 2024

  34. [34]

    InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

    Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479, 2025. 12 A Quantitative Analysis of Key-Value Asymmetry To strengthen our observation, we conduct a quant...