pith. sign in

arxiv: 2605.15621 · v1 · pith:3FOLZAGRnew · submitted 2026-05-15 · 💻 cs.CV

LRCP: Low-Rank Compressibility Guided Visual Token Pruning for Efficient LVLMs

Pith reviewed 2026-05-20 18:59 UTC · model grok-4.3

classification 💻 cs.CV
keywords low-rank compressibilityvisual token pruningLVLMsPCAtoken reductionefficient inferencemultimodal understanding
0
0 comments X

The pith

Visual tokens in LVLMs can be pruned by 88-89 percent using projection residuals to a stable low-rank subspace estimated by PCA, retaining 94.7 percent of image understanding and 97.8 percent of video accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that visual token representations across models and datasets display a pronounced low-rank structure whose dominant subspace stays stable even after most tokens are randomly removed. This observation motivates a training-free pruning method that runs PCA on the complete token set to identify the background subspace, then keeps only the tokens with large residuals to that subspace. The retained tokens are those poorly explained by the low-rank background and therefore carry the distinctive visual information. The result is an 88.9 percent token cut for images and 87.5 percent for videos while preserving nearly all original multimodal performance. A reader would care because the approach lowers inference cost on high-resolution images and long videos without any model retraining or architectural changes.

Core claim

Visual token representations exhibit a pronounced low-rank structure, with a dominant subspace that remains stable even after a large fraction of tokens is randomly removed. Motivated by this finding, LRCP estimates the dominant low-rank subspace of visual tokens via PCA, then scores each token by its projection residual onto this subspace and retains tokens that are poorly explained by the low-rank background.

What carries the argument

Projection residuals onto the PCA-estimated dominant low-rank subspace of the full visual token set, used to identify and retain tokens outside the stable background.

If this is right

  • An 88.9 percent reduction in visual tokens for images preserves 94.7 percent of original understanding performance.
  • An 87.5 percent reduction in visual tokens for videos preserves 97.8 percent of average understanding accuracy.
  • The method outperforms prior attention-based and representation-based pruning techniques on the same benchmarks.
  • Because the approach requires no training, it can be applied directly to any existing LVLM at inference time.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If similar low-rank stability appears in text or audio token sequences, the same residual scoring could extend compression to other modalities inside multimodal models.
  • Real-time monitoring of residual statistics might allow the model to adjust the kept-token budget dynamically per input rather than using a fixed ratio.
  • Pairing the residual-based selection with post-training quantization could produce compounded speed-ups without further accuracy loss.

Load-bearing premise

The dominant low-rank subspace estimated by PCA on the full visual token set remains reliable for scoring even after a large fraction of tokens is removed.

What would settle it

If PCA subspaces computed on random subsets of 20 percent of tokens differ substantially from the full-set subspace, or if performance retention falls below 90 percent at 80 percent token reduction on standard image and video benchmarks, the pruning rule would be falsified.

Figures

Figures reproduced from arXiv: 2605.15621 by Feng Zhang, Hongyu Lu, Huanling Hu, Jiawei Li, Shikai Jiang, Tianjun Shi, Wenwei Jin, Yao Hu.

Figure 1
Figure 1. Figure 1: (Left) LRCP estimates the dominant low-rank subspace of visual tokens via PCA and selects tokens based on their projection residuals. (Right) Accuracy comparison with LLaVA-NeXT￾7B across 7 benchmarks, showing that ours outperforms both VisionZip (attention-based) and ApET (representation-based). This paper studies visual token compression from the perspective of low-rank compressibility. We observe that v… view at source ↗
Figure 2
Figure 2. Figure 2: Layer-wise Rank@90% and Rank@95% on POPE for LLaVA-v1.5-7B, LLaVA-NeXT-7B, [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Low-rank subspace stability on GQA for LLaVA-v1.5-7B, LLaVA-NeXT-7B, and Qwen2.5- [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Overview of LRCP. The method estimates the dominant low-rank subspace via PCA, [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative comparison of retained-token distributions on VQA [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Layer-wise Rank@90% and Rank@95% statistics on GQA for LLaVA-v1.5-7B, LLaVA [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Layer-wise Rank@90% and Rank@95% statistics on VQA [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Low-rank subspace stability on POPE for LLaVA-v1.5-7B, LLaVA-NeXT-7B, and Qwen2.5- [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Low-rank subspace stability on VQAText for LLaVA-v1.5-7B, LLaVA-NeXT-7B, and Qwen2.5-VL-7B. The stability is maintained across both text-dense and natural-scene images. Subspace stability under importance-based pruning. The stability analyses above employ ran￾dom token dropout as a stress test. Since LRCP preferentially retains tokens that deviate from the dominant subspace, a natural follow-up question is… view at source ↗
Figure 10
Figure 10. Figure 10: Subspace stability under importance-based pruning by LRCP. We compare the dominant [PITH_FULL_IMAGE:figures/full_fig_p016_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Retained-token distributions of LRCP under three retention ratios. As the budget decreases, [PITH_FULL_IMAGE:figures/full_fig_p019_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Qualitative examples on VQAText. For each case, we show the input image, the representation-based selection result, and the LRCP selection result. Red boxes indicate answer￾relevant regions. LRCP more effectively preserves text-bearing regions. A.6 Broader Impact This work proposes a training-free visual token compression method for large vision-language models. We discuss both positive and potential nega… view at source ↗
read the original abstract

Large vision-language models (LVLMs) achieve strong multimodal understanding, but their inference cost grows rapidly with the number of visual tokens, especially for high-resolution images and long videos. Existing attention-based methods estimate token importance from attention scores, which may introduce positional bias, while representation-based methods reduce visual redundancy based on feature relations or reconstruction errors, overlooking the global structure of the visual token set. In this paper, we revisit visual token compression from the perspective of low-rank compressibility. Across models and datasets, we observe that visual token representations exhibit a pronounced low-rank structure, with a dominant subspace that remains stable even after a large fraction of tokens is randomly removed. Motivated by this finding, we propose LRCP, a training-free compression framework that first estimates the dominant low-rank subspace of visual tokens via PCA, and then scores each token by its projection residual onto this subspace, retaining tokens that are poorly explained by the low-rank background. Extensive experiments show that LRCP achieves superior results, preserving 94.7% of the original image-understanding performance with an 88.9% token reduction and 97.8% of the average video-understanding accuracy with an 87.5% token reduction.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes LRCP, a training-free framework for pruning visual tokens in LVLMs. It observes that visual token representations have a pronounced low-rank structure whose dominant subspace remains stable under large random removal, estimates this subspace via PCA on the full token set, scores tokens by their projection residuals, and retains high-residual tokens. The abstract reports that this preserves 94.7% of original image-understanding performance at 88.9% token reduction and 97.8% of average video-understanding accuracy at 87.5% token reduction, positioning it as superior to attention-based and other representation-based methods by avoiding positional bias and capturing global structure.

Significance. If the low-rank stability observation and residual-based pruning prove robust, LRCP could offer a simple, training-free alternative for reducing inference costs in high-resolution image and long-video LVLMs while maintaining most performance. The emphasis on global compressibility rather than local attention scores is a conceptual strength, and the reported retention rates at high compression ratios suggest practical utility if the experimental claims hold under scrutiny.

major comments (2)
  1. [Abstract and §3] Abstract and the low-rank observation section: the claim that the dominant subspace remains stable even after large random removal is used to justify computing PCA on the full token set. However, LRCP performs deterministic pruning of high-residual tokens rather than random removal, so the retained subset may shift the effective subspace in ways not tested by the random-subsampling stability check. This mismatch is load-bearing for the method's motivation and requires a direct verification using the actual pruning rule.
  2. [Abstract and Experiments] Experimental results (abstract): the reported 94.7% and 97.8% retention figures are presented without accompanying details on exact baselines, statistical significance tests, dataset splits, or controls for positional bias. This limits independent verification of the superiority claim and should be expanded with concrete experimental protocols.
minor comments (2)
  1. [Method] Clarify notation for the residual score and PCA rank selection in the method description to improve reproducibility.
  2. [Experiments] Add explicit comparison tables with attention-based baselines and ablation on the number of PCA components.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We address each major comment below and have revised the manuscript accordingly to strengthen the motivation and experimental reporting.

read point-by-point responses
  1. Referee: [Abstract and §3] Abstract and the low-rank observation section: the claim that the dominant subspace remains stable even after large random removal is used to justify computing PCA on the full token set. However, LRCP performs deterministic pruning of high-residual tokens rather than random removal, so the retained subset may shift the effective subspace in ways not tested by the random-subsampling stability check. This mismatch is load-bearing for the method's motivation and requires a direct verification using the actual pruning rule.

    Authors: We appreciate this observation on the distinction between random subsampling and our deterministic pruning rule. While the random-removal experiments in Section 3 demonstrate general stability of the dominant subspace, we agree that a direct check under the actual LRCP pruning procedure provides stronger justification. In the revised manuscript we have added a new ablation in Section 3.2 that recomputes PCA on the tokens retained by LRCP and reports the cosine similarity of the top principal components to those from the full set. The similarity remains above 0.94 across pruning ratios up to 90 %, supporting the use of full-set PCA as a reliable estimate of the low-rank background. revision: yes

  2. Referee: [Abstract and Experiments] Experimental results (abstract): the reported 94.7% and 97.8% retention figures are presented without accompanying details on exact baselines, statistical significance tests, dataset splits, or controls for positional bias. This limits independent verification of the superiority claim and should be expanded with concrete experimental protocols.

    Authors: We agree that additional experimental details are necessary for reproducibility and verification. The revised manuscript now includes an expanded experimental protocol subsection that specifies: (i) the exact implementations and hyper-parameters of all compared baselines, (ii) the dataset splits and evaluation metrics used for each benchmark, (iii) mean and standard deviation over three random seeds to indicate statistical variability, and (iv) an explicit control experiment that isolates positional bias by comparing LRCP against a position-shuffled variant of the attention-based methods. These additions clarify the reported retention rates and the superiority claims. revision: yes

Circularity Check

0 steps flagged

No circularity in LRCP derivation chain

full rationale

The paper motivates LRCP from an independent empirical observation that visual token representations show low-rank structure with a dominant subspace stable under random token removal (verified separately from the pruning rule). It then applies standard PCA to compute this subspace on the full set and scores tokens by projection residuals, retaining high-residual ones. This chain does not reduce any claimed result to a fitted parameter renamed as prediction, a self-referential definition, or a load-bearing self-citation; the stability test uses uniform random subsampling while the method uses deterministic residual selection, but the former is presented only as motivation rather than a constructed input that forces the output. The performance claims are evaluated on external benchmarks and do not collapse to the inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the empirical observation that visual tokens possess a stable low-rank structure across models and datasets; no free parameters or invented entities are introduced in the abstract.

axioms (1)
  • domain assumption Visual token representations exhibit a pronounced low-rank structure, with a dominant subspace that remains stable even after a large fraction of tokens is randomly removed.
    This observation, stated in the abstract, directly motivates the PCA-based subspace estimation and residual scoring.

pith-pipeline@v0.9.0 · 5769 in / 1180 out tokens · 48908 ms · 2026-05-20T18:59:06.283818+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

60 extracted references · 60 canonical work pages · 15 internal anchors

  1. [1]

    Flamingo: a visual language model for few-shot learning

    Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. InAdvances in Neural Information Processing Systems, volume 35, pages 23716–23736, 2022

  2. [2]

    BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

    Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InInternational Conference on Machine Learning, pages 19730–19742, 2023

  3. [3]

    Visual instruction tuning

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. In Advances in Neural Information Processing Systems, volume 36, 2023

  4. [4]

    Fung, and Steven Hoi

    Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale N. Fung, and Steven Hoi. InstructBLIP: Towards general-purpose vision-language models with instruction tuning. InAdvances in Neural Information Processing Systems, volume 36, 2023

  5. [5]

    MiniGPT-4: En- hancing vision-language understanding with advanced large language models

    Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. MiniGPT-4: En- hancing vision-language understanding with advanced large language models. InInternational Conference on Learning Representations, 2024

  6. [6]

    GPT-4V(ision) system card.OpenAI Technical Report, 2023

    OpenAI. GPT-4V(ision) system card.OpenAI Technical Report, 2023

  7. [7]

    Lawrence Zitnick, and Devi Parikh

    Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. VQA: Visual question answering. InProceedings of the IEEE International Conference on Computer Vision, pages 2425–2433, 2015

  8. [8]

    Making the v in vqa matter: Elevating the role of image understanding in visual question answering

    Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6904–6913, 2017

  9. [9]

    Towards vqa models that can read

    Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. Towards vqa models that can read. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8317–8326, 2019

  10. [10]

    Minesh Mathew, Dimosthenis Karatzas, and C. V . Jawahar. DocVQA: A dataset for VQA on document images. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 2200–2209, 2021

  11. [11]

    Lawrence Zitnick

    Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. Microsoft COCO: Common objects in context. In European Conference on Computer Vision, pages 740–755. Springer, 2014

  12. [12]

    From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions

    Peter Young, Alice Lai, Micah Hodosh, and Julia Hockenmaier. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics, 2:67–78, 2014. 10

  13. [13]

    nocaps: novel object captioning at scale

    Harsh Agrawal, Karan Desai, Yufei Wang, Xinlei Chen, Rishabh Jain, Mark Johnson, Dhruv Batra, Devi Parikh, Stefan Lee, and Peter Anderson. nocaps: novel object captioning at scale. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 8948–8957, 2019

  14. [14]

    LLaVA-OneVision: Easy Visual Task Transfer

    Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Yanwei Li, Ziwei Liu, and Chunyuan Li. Llava-onevision: Easy visual task transfer.arXiv preprint arXiv:2408.03326, 2024

  15. [15]

    Video-llava: Learning united visual representation by alignment before projection

    Bin Lin, Yang Ye, Bin Zhu, Jiaxi Cui, Munan Ning, Peng Jin, and Li Yuan. Video-llava: Learning united visual representation by alignment before projection. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 5971–5984, 2024

  16. [16]

    VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding

    Boqiang Zhang, Kehan Li, Zesen Cheng, Zhiqiang Hu, Yuqian Yuan, Guanzheng Chen, Sicong Leng, Yuming Jiang, Hang Zhang, Xin Li, Peng Jin, Wenqi Zhang, Fan Wang, Lidong Bing, and Deli Zhao. Videollama 3: Frontier multimodal foundation models for image and video understanding.arXiv preprint arXiv:2501.13106, 2025

  17. [17]

    How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites

    Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhangwei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, Ji Ma, Jiaqi Wang, Xiaoyi Dong, Hang Yan, Hewei Guo, Conghui He, Botian Shi, Zhenjiang Jin, Chao Xu, Bin Wang, Xingjian Wei, Wei Li, Wenjian Zhang, Bo Zhang, Pinlong Cai, Licheng Wen, Xiangchao Yan, Min Dou, Lewei Lu, Xizhou Zhu, Tong Lu...

  18. [18]

    Seed1.5-VL Technical Report

    Dong Guo, Faming Wu, Feida Zhu, Fuxing Leng, Guang Shi, Haobin Chen, Haoqi Fan, Jian Wang, Jianyu Jiang, Jiawei Wang, et al. Seed1.5-vl technical report.arXiv preprint arXiv:2505.07062, 2025

  19. [19]

    Shifting AI efficiency from model-centric to data-centric compression.arXiv preprint arXiv:2505.19147, 2025

    Xuyang Liu, Zichen Wen, Shaobo Wang, Junjie Chen, Zhishan Tao, Yubo Wang, Xiangqi Jin, Chang Zou, Yiyu Wang, Chenfei Liao, et al. Shifting AI efficiency from model-centric to data-centric compression.arXiv preprint arXiv:2505.19147, 2025

  20. [20]

    An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models

    Liang Chen, Haozhe Zhao, Tianyu Liu, Shuai Bai, Junyang Lin, Chang Zhou, and Baobao Chang. An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models. InEuropean Conference on Computer Vision, pages 19–35. Springer, 2024

  21. [21]

    Hired: Attention-guided token dropping for efficient inference of high-resolution vision-language models

    Kazi Hasan Ibn Arif, JinYi Yoon, Dimitrios S Nikolopoulos, Hans Vandierendonck, Deepu John, and Bo Ji. Hired: Attention-guided token dropping for efficient inference of high-resolution vision-language models. InProceedings of the AAAI Conference on Artificial Intelligence, pages 1773–1781, 2025

  22. [22]

    [cls] attention is all you need for training- free visual token pruning: Make vlm inference faster

    Qizhe Zhang, Aosong Cheng, Ming Lu, Zhiyong Zhuo, Minqi Wang, Jiajun Cao, Shaobo Guo, Qi She, and Shanghang Zhang. [cls] attention is all you need for training-free visual token pruning: Make vlm inference faster.arXiv preprint arXiv:2412.01818, 2024

  23. [23]

    HiPrune: Hierarchical Attention for Efficient Token Pruning in Vision-Language Models

    Jizhihui Liu, Feiyi Du, Guangdao Zhu, Niu Lian, Jun Li, and Bin Chen. Hiprune: Training- free visual token pruning via hierarchical attention in vision-language models.arXiv preprint arXiv:2508.00553, 2025

  24. [24]

    Visionzip: Longer is better but not necessary in vision language models

    Senqiao Yang, Yukang Chen, Zhuotao Tian, Chengyao Wang, Jingyao Li, Bei Yu, and Jiaya Jia. Visionzip: Longer is better but not necessary in vision language models.arXiv preprint arXiv:2412.04467, 2024

  25. [25]

    SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference

    Yuan Zhang, Chun-Kai Fan, Junpeng Ma, Wenzhao Zheng, Tao Huang, Kuan Cheng, Denis Gudovskiy, Tomoyuki Okuno, Yohei Nakata, Kurt Keutzer, and Shanghang Zhang. Sparse- vlm: Visual token sparsification for efficient vision-language model inference.arXiv preprint arXiv:2410.04417, 2024. 11

  26. [26]

    PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid Visual Redundancy Reduction

    Long Xing, Qidong Huang, Xiaoyi Dong, Jiajie Lu, Pan Zhang, Yuhang Zang, Yuhang Cao, Con- ghui He, Jiaqi Wang, Feng Wu, and Dahua Lin. Pyramiddrop: Accelerating your large vision- language models via pyramid visual redundancy reduction.arXiv preprint arXiv:2410.17247, 2024

  27. [27]

    arXiv preprint arXiv:2505.22654 , year=

    Ce Zhang, Kaixin Ma, Tianqing Fang, Wenhao Yu, Hongming Zhang, Zhisong Zhang, Yaqi Xie, Katia Sycara, Haitao Mi, and Dong Yu. Vscan: Rethinking visual token reduction for efficient large vision-language models.arXiv preprint arXiv:2505.22654, 2025

  28. [28]

    Apet: Approximation-error guided token compression for efficient vlms.arXiv preprint arXiv:2602.19870, 2026

    Qiankun Ma, Ziyao Zhang, Haofei Wang, Jie Chen, Zhen Song, and Hairong Zheng. Apet: Approximation-error guided token compression for efficient vlms.arXiv preprint arXiv:2602.19870, 2026

  29. [29]

    Llava-prumerge: Adaptive token reduction for efficient large multimodal models

    Yuzhang Shang, Mu Cai, Bingxin Xu, Yong Jae Lee, and Yan Yan. Llava-prumerge: Adaptive token reduction for efficient large multimodal models.arXiv preprint arXiv:2403.15388, 2024

  30. [30]

    Atp-llava: Adaptive token pruning for large vision language models

    Xubing Ye, Yukang Gan, Yixiao Ge, Xiao-Ping Zhang, and Yansong Tang. Atp-llava: Adaptive token pruning for large vision language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24972–24982, 2025

  31. [31]

    Folder: Accelerating multi-modal large language models with enhanced performance.arXiv preprint arXiv:2501.02430, 2025

    Haicheng Wang, Zhemeng Yu, Gabriele Spadaro, Chen Ju, Victor Quétu, Shuai Xiao, and Enzo Tartaglione. Folder: Accelerating multi-modal large language models with enhanced performance.arXiv preprint arXiv:2501.02430, 2025

  32. [32]

    Conical visual concentration for efficient large vision-language models

    Long Xing, Qidong Huang, Xiaoyi Dong, Jiajie Lu, Pan Zhang, Yuhang Zang, Yuhang Cao, Conghui He, Jiaqi Wang, Feng Wu, et al. Conical visual concentration for efficient large vision-language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14593–14603, 2025

  33. [33]

    Flashattention: Fast and memory-efficient exact attention with io-awareness

    Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. Flashattention: Fast and memory-efficient exact attention with io-awareness. InAdvances in Neural Information Processing Systems, volume 35, pages 16344–16359, 2022

  34. [34]

    FlashAttention-2: Faster attention with better parallelism and work partitioning

    Tri Dao. FlashAttention-2: Faster attention with better parallelism and work partitioning. In International Conference on Learning Representations, 2024

  35. [35]

    Efficient memory management for large language model serving with PagedAttention

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with PagedAttention. InProceedings of the 29th Symposium on Operating Systems Principles, pages 611–626, 2023

  36. [36]

    Improved baselines with visual instruction tuning

    Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26296–26306, 2024

  37. [37]

    Llava-next: Improved reasoning, ocr, and world knowledge, jan 2024

    Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llava-next: Improved reasoning, ocr, and world knowledge, jan 2024

  38. [38]

    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

    Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, and Junyang Lin. Qwen2-vl: Enhancing vision- language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024

  39. [39]

    Qwen2.5-VL Technical Report

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, et al. Qwen2.5-vl technical report.arXiv preprint arXiv:2502.13923, 2025

  40. [40]

    InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

    Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Yuchen Duan, Hao Tian, Weijie Su, Jie Shao, Zhangwei Gao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479, 2025. 12

  41. [41]

    Fast Transformer Decoding: One Write-Head is All You Need

    Noam Shazeer. Fast transformer decoding: One write-head is all you need.arXiv preprint arXiv:1911.02150, 2019

  42. [42]

    An image is worth 16x16 words: Transformers for image recognition at scale

    Alexei Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. InInternational Conference on Learning Representations, 2021

  43. [43]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agar- wal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. InInterna- tional Conference on Machine Learning, pages 8748–8763, 2021

  44. [44]

    Gomez, Łukasz Kaiser, and Illia Polosukhin

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. InAdvances in Neural Informa- tion Processing Systems, volume 30, 2017

  45. [45]

    LLaMA: Open and Efficient Foundation Language Models

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timo- thée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. LLaMA: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023

  46. [46]

    Jolliffe.Principal Component Analysis

    Ian T. Jolliffe.Principal Component Analysis. Springer, 2nd edition, 2002

  47. [47]

    Åke Björck and Gene H. Golub. Numerical methods for computing angles between linear subspaces.Mathematics of Computation, 27(123):579–594, 1973

  48. [48]

    Nathan Halko, Per-Gunnar Martinsson, and Joel A. Tropp. Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions.SIAM Review, 53 (2):217–288, 2011

  49. [49]

    Token merging: Your ViT but faster

    Daniel Bolya, Cheng-Yang Fu, Xiaoliang Dai, Peizhao Zhang, Christoph Feichtenhofer, and Judy Hoffman. Token merging: Your ViT but faster. InInternational Conference on Learning Representations, 2023

  50. [50]

    Variation-aware vision token dropping for faster large vision-language models.arXiv preprint arXiv:2509.01552, 2025

    Junjie Chen, Xuyang Liu, Zichen Wen, Yiyu Wang, Siteng Huang, and Honggang Chen. Variation-aware vision token dropping for faster large vision-language models.arXiv preprint arXiv:2509.01552, 2025

  51. [51]

    Gqa: A new dataset for real-world visual reasoning and compositional question answering

    Drew A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019

  52. [52]

    Mmbench: Is your multi-modal model an all-around player? InEuropean Conference on Computer Vision, pages 216–233

    Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? InEuropean Conference on Computer Vision, pages 216–233. Springer, 2024

  53. [53]

    MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

    Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, et al. Mme: A comprehensive evaluation benchmark for multimodal large language models.arXiv preprint arXiv:2306.13394, 2023

  54. [54]

    Evaluating Object Hallucination in Large Vision-Language Models

    Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision-language models.arXiv preprint arXiv:2305.10355, 2023

  55. [55]

    Learn to explain: Multimodal reasoning via thought chains for science question answering

    Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering. InAdvances in Neural Information Processing Systems, volume 35, pages 2507–2521, 2022

  56. [56]

    Seed-bench: Benchmarking multimodal large language models

    Bohao Li, Yuying Ge, Yixiao Ge, Guangzhi Wang, Rui Wang, Ruimao Zhang, and Ying Shan. Seed-bench: Benchmarking multimodal large language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13299–13308, 2024

  57. [57]

    MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities

    Weihao Yu, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Zicheng Liu, Xinchao Wang, and Lijuan Wang. Mm-vet: Evaluating large multimodal models for integrated capabilities. arXiv preprint arXiv:2308.02490, 2023. 13

  58. [58]

    Tgif-qa: Toward spatio-temporal reasoning in visual question answering

    Yunseok Jang, Yale Song, Youngjae Yu, Youngjin Kim, and Gunhee Kim. Tgif-qa: Toward spatio-temporal reasoning in visual question answering. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2758–2766, 2017

  59. [59]

    David Chen and William B. Dolan. Collecting highly parallel data for paraphrase evaluation. InProceedings of the 49th Annual Meeting of the Association for Computational Linguistics, pages 190–200, 2011

  60. [60]

    Final Retain

    Jun Xu, Tao Mei, Ting Yao, and Yong Rui. MSR-VTT: A large video description dataset for bridging video and language. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5288–5296, 2016. 14 A Additional Experiments and Analysis A.1 Additional Low-Rank Analysis on Other Datasets We extend the effective-dimensionality analy...