LRCP: Low-Rank Compressibility Guided Visual Token Pruning for Efficient LVLMs

Feng Zhang; Hongyu Lu; Huanling Hu; Jiawei Li; Shikai Jiang; Tianjun Shi; Wenwei Jin; Yao Hu

arxiv: 2605.15621 · v1 · pith:3FOLZAGRnew · submitted 2026-05-15 · 💻 cs.CV

LRCP: Low-Rank Compressibility Guided Visual Token Pruning for Efficient LVLMs

Hongyu Lu , Feng Zhang , Wenwei Jin , Huanling Hu , Tianjun Shi , Shikai Jiang , Yao Hu , Jiawei Li This is my paper

Pith reviewed 2026-05-20 18:59 UTC · model grok-4.3

classification 💻 cs.CV

keywords low-rank compressibilityvisual token pruningLVLMsPCAtoken reductionefficient inferencemultimodal understanding

0 comments

The pith

Visual tokens in LVLMs can be pruned by 88-89 percent using projection residuals to a stable low-rank subspace estimated by PCA, retaining 94.7 percent of image understanding and 97.8 percent of video accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that visual token representations across models and datasets display a pronounced low-rank structure whose dominant subspace stays stable even after most tokens are randomly removed. This observation motivates a training-free pruning method that runs PCA on the complete token set to identify the background subspace, then keeps only the tokens with large residuals to that subspace. The retained tokens are those poorly explained by the low-rank background and therefore carry the distinctive visual information. The result is an 88.9 percent token cut for images and 87.5 percent for videos while preserving nearly all original multimodal performance. A reader would care because the approach lowers inference cost on high-resolution images and long videos without any model retraining or architectural changes.

Core claim

Visual token representations exhibit a pronounced low-rank structure, with a dominant subspace that remains stable even after a large fraction of tokens is randomly removed. Motivated by this finding, LRCP estimates the dominant low-rank subspace of visual tokens via PCA, then scores each token by its projection residual onto this subspace and retains tokens that are poorly explained by the low-rank background.

What carries the argument

Projection residuals onto the PCA-estimated dominant low-rank subspace of the full visual token set, used to identify and retain tokens outside the stable background.

If this is right

An 88.9 percent reduction in visual tokens for images preserves 94.7 percent of original understanding performance.
An 87.5 percent reduction in visual tokens for videos preserves 97.8 percent of average understanding accuracy.
The method outperforms prior attention-based and representation-based pruning techniques on the same benchmarks.
Because the approach requires no training, it can be applied directly to any existing LVLM at inference time.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If similar low-rank stability appears in text or audio token sequences, the same residual scoring could extend compression to other modalities inside multimodal models.
Real-time monitoring of residual statistics might allow the model to adjust the kept-token budget dynamically per input rather than using a fixed ratio.
Pairing the residual-based selection with post-training quantization could produce compounded speed-ups without further accuracy loss.

Load-bearing premise

The dominant low-rank subspace estimated by PCA on the full visual token set remains reliable for scoring even after a large fraction of tokens is removed.

What would settle it

If PCA subspaces computed on random subsets of 20 percent of tokens differ substantially from the full-set subspace, or if performance retention falls below 90 percent at 80 percent token reduction on standard image and video benchmarks, the pruning rule would be falsified.

Figures

Figures reproduced from arXiv: 2605.15621 by Feng Zhang, Hongyu Lu, Huanling Hu, Jiawei Li, Shikai Jiang, Tianjun Shi, Wenwei Jin, Yao Hu.

**Figure 1.** Figure 1: (Left) LRCP estimates the dominant low-rank subspace of visual tokens via PCA and selects tokens based on their projection residuals. (Right) Accuracy comparison with LLaVA-NeXT7B across 7 benchmarks, showing that ours outperforms both VisionZip (attention-based) and ApET (representation-based). This paper studies visual token compression from the perspective of low-rank compressibility. We observe that v… view at source ↗

**Figure 2.** Figure 2: Layer-wise Rank@90% and Rank@95% on POPE for LLaVA-v1.5-7B, LLaVA-NeXT-7B, [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Low-rank subspace stability on GQA for LLaVA-v1.5-7B, LLaVA-NeXT-7B, and Qwen2.5- [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Overview of LRCP. The method estimates the dominant low-rank subspace via PCA, [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Qualitative comparison of retained-token distributions on VQA [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Layer-wise Rank@90% and Rank@95% statistics on GQA for LLaVA-v1.5-7B, LLaVA [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗

**Figure 7.** Figure 7: Layer-wise Rank@90% and Rank@95% statistics on VQA [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗

**Figure 8.** Figure 8: Low-rank subspace stability on POPE for LLaVA-v1.5-7B, LLaVA-NeXT-7B, and Qwen2.5- [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗

**Figure 9.** Figure 9: Low-rank subspace stability on VQAText for LLaVA-v1.5-7B, LLaVA-NeXT-7B, and Qwen2.5-VL-7B. The stability is maintained across both text-dense and natural-scene images. Subspace stability under importance-based pruning. The stability analyses above employ random token dropout as a stress test. Since LRCP preferentially retains tokens that deviate from the dominant subspace, a natural follow-up question is… view at source ↗

**Figure 10.** Figure 10: Subspace stability under importance-based pruning by LRCP. We compare the dominant [PITH_FULL_IMAGE:figures/full_fig_p016_10.png] view at source ↗

**Figure 11.** Figure 11: Retained-token distributions of LRCP under three retention ratios. As the budget decreases, [PITH_FULL_IMAGE:figures/full_fig_p019_11.png] view at source ↗

**Figure 12.** Figure 12: Qualitative examples on VQAText. For each case, we show the input image, the representation-based selection result, and the LRCP selection result. Red boxes indicate answerrelevant regions. LRCP more effectively preserves text-bearing regions. A.6 Broader Impact This work proposes a training-free visual token compression method for large vision-language models. We discuss both positive and potential nega… view at source ↗

read the original abstract

Large vision-language models (LVLMs) achieve strong multimodal understanding, but their inference cost grows rapidly with the number of visual tokens, especially for high-resolution images and long videos. Existing attention-based methods estimate token importance from attention scores, which may introduce positional bias, while representation-based methods reduce visual redundancy based on feature relations or reconstruction errors, overlooking the global structure of the visual token set. In this paper, we revisit visual token compression from the perspective of low-rank compressibility. Across models and datasets, we observe that visual token representations exhibit a pronounced low-rank structure, with a dominant subspace that remains stable even after a large fraction of tokens is randomly removed. Motivated by this finding, we propose LRCP, a training-free compression framework that first estimates the dominant low-rank subspace of visual tokens via PCA, and then scores each token by its projection residual onto this subspace, retaining tokens that are poorly explained by the low-rank background. Extensive experiments show that LRCP achieves superior results, preserving 94.7% of the original image-understanding performance with an 88.9% token reduction and 97.8% of the average video-understanding accuracy with an 87.5% token reduction.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LRCP prunes via PCA residuals on visual tokens and reports strong retention at high compression rates, but the low-rank stability claim rests on random removal tests that do not match the actual selection process.

read the letter

LRCP prunes visual tokens in LVLMs by keeping those with high residual to the low-rank PCA subspace of the token representations, and the experiments show it retains over 94% performance while cutting tokens by about 88-89% for images and videos. This approach is new in its use of low-rank compressibility as the guiding principle for token importance. The authors observe that the dominant subspace stays stable under random removal across models and datasets, which lets them compute PCA once on the full set and then score residuals. That framing sets it apart from attention-based pruning that can have positional issues and from reconstruction error methods. The paper does well in delivering a training-free method with extensive testing on both image understanding and video tasks. The reported figures are competitive, and the simplicity makes it easy to apply in practice for reducing latency in multimodal inference. One soft spot is the justification for the subspace stability. The tests use random subsampling to show the dominant directions don't change much, but LRCP's pruning removes tokens based on low residual, which is a specific biased selection. It is possible that dropping those well-represented tokens leaves a set whose subspace differs from the original in ways random removal does not capture. The paper would be tighter if it included a check on the subspace after pruning or showed that performance holds even if PCA is recomputed on the kept tokens. This is a moderate concern rather than a core flaw, since the end results are what matter most. Readers working on efficient LVLMs or token compression will get the most out of this. It is particularly relevant for applications with high-resolution images or long videos where token count drives cost. I would bring this to a reading group for discussion on the low-rank idea and the empirical tradeoffs. It deserves peer review because the contribution is concrete and the results warrant verification of the method details.

Referee Report

2 major / 2 minor

Summary. The paper proposes LRCP, a training-free framework for pruning visual tokens in LVLMs. It observes that visual token representations have a pronounced low-rank structure whose dominant subspace remains stable under large random removal, estimates this subspace via PCA on the full token set, scores tokens by their projection residuals, and retains high-residual tokens. The abstract reports that this preserves 94.7% of original image-understanding performance at 88.9% token reduction and 97.8% of average video-understanding accuracy at 87.5% token reduction, positioning it as superior to attention-based and other representation-based methods by avoiding positional bias and capturing global structure.

Significance. If the low-rank stability observation and residual-based pruning prove robust, LRCP could offer a simple, training-free alternative for reducing inference costs in high-resolution image and long-video LVLMs while maintaining most performance. The emphasis on global compressibility rather than local attention scores is a conceptual strength, and the reported retention rates at high compression ratios suggest practical utility if the experimental claims hold under scrutiny.

major comments (2)

[Abstract and §3] Abstract and the low-rank observation section: the claim that the dominant subspace remains stable even after large random removal is used to justify computing PCA on the full token set. However, LRCP performs deterministic pruning of high-residual tokens rather than random removal, so the retained subset may shift the effective subspace in ways not tested by the random-subsampling stability check. This mismatch is load-bearing for the method's motivation and requires a direct verification using the actual pruning rule.
[Abstract and Experiments] Experimental results (abstract): the reported 94.7% and 97.8% retention figures are presented without accompanying details on exact baselines, statistical significance tests, dataset splits, or controls for positional bias. This limits independent verification of the superiority claim and should be expanded with concrete experimental protocols.

minor comments (2)

[Method] Clarify notation for the residual score and PCA rank selection in the method description to improve reproducibility.
[Experiments] Add explicit comparison tables with attention-based baselines and ablation on the number of PCA components.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We address each major comment below and have revised the manuscript accordingly to strengthen the motivation and experimental reporting.

read point-by-point responses

Referee: [Abstract and §3] Abstract and the low-rank observation section: the claim that the dominant subspace remains stable even after large random removal is used to justify computing PCA on the full token set. However, LRCP performs deterministic pruning of high-residual tokens rather than random removal, so the retained subset may shift the effective subspace in ways not tested by the random-subsampling stability check. This mismatch is load-bearing for the method's motivation and requires a direct verification using the actual pruning rule.

Authors: We appreciate this observation on the distinction between random subsampling and our deterministic pruning rule. While the random-removal experiments in Section 3 demonstrate general stability of the dominant subspace, we agree that a direct check under the actual LRCP pruning procedure provides stronger justification. In the revised manuscript we have added a new ablation in Section 3.2 that recomputes PCA on the tokens retained by LRCP and reports the cosine similarity of the top principal components to those from the full set. The similarity remains above 0.94 across pruning ratios up to 90 %, supporting the use of full-set PCA as a reliable estimate of the low-rank background. revision: yes
Referee: [Abstract and Experiments] Experimental results (abstract): the reported 94.7% and 97.8% retention figures are presented without accompanying details on exact baselines, statistical significance tests, dataset splits, or controls for positional bias. This limits independent verification of the superiority claim and should be expanded with concrete experimental protocols.

Authors: We agree that additional experimental details are necessary for reproducibility and verification. The revised manuscript now includes an expanded experimental protocol subsection that specifies: (i) the exact implementations and hyper-parameters of all compared baselines, (ii) the dataset splits and evaluation metrics used for each benchmark, (iii) mean and standard deviation over three random seeds to indicate statistical variability, and (iv) an explicit control experiment that isolates positional bias by comparing LRCP against a position-shuffled variant of the attention-based methods. These additions clarify the reported retention rates and the superiority claims. revision: yes

Circularity Check

0 steps flagged

No circularity in LRCP derivation chain

full rationale

The paper motivates LRCP from an independent empirical observation that visual token representations show low-rank structure with a dominant subspace stable under random token removal (verified separately from the pruning rule). It then applies standard PCA to compute this subspace on the full set and scores tokens by projection residuals, retaining high-residual ones. This chain does not reduce any claimed result to a fitted parameter renamed as prediction, a self-referential definition, or a load-bearing self-citation; the stability test uses uniform random subsampling while the method uses deterministic residual selection, but the former is presented only as motivation rather than a constructed input that forces the output. The performance claims are evaluated on external benchmarks and do not collapse to the inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the empirical observation that visual tokens possess a stable low-rank structure across models and datasets; no free parameters or invented entities are introduced in the abstract.

axioms (1)

domain assumption Visual token representations exhibit a pronounced low-rank structure, with a dominant subspace that remains stable even after a large fraction of tokens is randomly removed.
This observation, stated in the abstract, directly motivates the PCA-based subspace estimation and residual scoring.

pith-pipeline@v0.9.0 · 5769 in / 1180 out tokens · 48908 ms · 2026-05-20T18:59:06.283818+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

60 extracted references · 60 canonical work pages · 15 internal anchors

[1]

Flamingo: a visual language model for few-shot learning

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. InAdvances in Neural Information Processing Systems, volume 35, pages 23716–23736, 2022

work page 2022
[2]

BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InInternational Conference on Machine Learning, pages 19730–19742, 2023

work page 2023
[3]

Visual instruction tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. In Advances in Neural Information Processing Systems, volume 36, 2023

work page 2023
[4]

Fung, and Steven Hoi

Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale N. Fung, and Steven Hoi. InstructBLIP: Towards general-purpose vision-language models with instruction tuning. InAdvances in Neural Information Processing Systems, volume 36, 2023

work page 2023
[5]

MiniGPT-4: En- hancing vision-language understanding with advanced large language models

Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. MiniGPT-4: En- hancing vision-language understanding with advanced large language models. InInternational Conference on Learning Representations, 2024

work page 2024
[6]

GPT-4V(ision) system card.OpenAI Technical Report, 2023

OpenAI. GPT-4V(ision) system card.OpenAI Technical Report, 2023

work page 2023
[7]

Lawrence Zitnick, and Devi Parikh

Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. VQA: Visual question answering. InProceedings of the IEEE International Conference on Computer Vision, pages 2425–2433, 2015

work page 2015
[8]

Making the v in vqa matter: Elevating the role of image understanding in visual question answering

Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6904–6913, 2017

work page 2017
[9]

Towards vqa models that can read

Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. Towards vqa models that can read. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8317–8326, 2019

work page 2019
[10]

Minesh Mathew, Dimosthenis Karatzas, and C. V . Jawahar. DocVQA: A dataset for VQA on document images. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 2200–2209, 2021

work page 2021
[11]

Lawrence Zitnick

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. Microsoft COCO: Common objects in context. In European Conference on Computer Vision, pages 740–755. Springer, 2014

work page 2014
[12]

From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions

Peter Young, Alice Lai, Micah Hodosh, and Julia Hockenmaier. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics, 2:67–78, 2014. 10

work page 2014
[13]

nocaps: novel object captioning at scale

Harsh Agrawal, Karan Desai, Yufei Wang, Xinlei Chen, Rishabh Jain, Mark Johnson, Dhruv Batra, Devi Parikh, Stefan Lee, and Peter Anderson. nocaps: novel object captioning at scale. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 8948–8957, 2019

work page 2019
[14]

LLaVA-OneVision: Easy Visual Task Transfer

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Yanwei Li, Ziwei Liu, and Chunyuan Li. Llava-onevision: Easy visual task transfer.arXiv preprint arXiv:2408.03326, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[15]

Video-llava: Learning united visual representation by alignment before projection

Bin Lin, Yang Ye, Bin Zhu, Jiaxi Cui, Munan Ning, Peng Jin, and Li Yuan. Video-llava: Learning united visual representation by alignment before projection. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 5971–5984, 2024

work page 2024
[16]

VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding

Boqiang Zhang, Kehan Li, Zesen Cheng, Zhiqiang Hu, Yuqian Yuan, Guanzheng Chen, Sicong Leng, Yuming Jiang, Hang Zhang, Xin Li, Peng Jin, Wenqi Zhang, Fan Wang, Lidong Bing, and Deli Zhao. Videollama 3: Frontier multimodal foundation models for image and video understanding.arXiv preprint arXiv:2501.13106, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[17]

How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites

Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhangwei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, Ji Ma, Jiaqi Wang, Xiaoyi Dong, Hang Yan, Hewei Guo, Conghui He, Botian Shi, Zhenjiang Jin, Chao Xu, Bin Wang, Xingjian Wei, Wei Li, Wenjian Zhang, Bo Zhang, Pinlong Cai, Licheng Wen, Xiangchao Yan, Min Dou, Lewei Lu, Xizhou Zhu, Tong Lu...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[18]

Seed1.5-VL Technical Report

Dong Guo, Faming Wu, Feida Zhu, Fuxing Leng, Guang Shi, Haobin Chen, Haoqi Fan, Jian Wang, Jianyu Jiang, Jiawei Wang, et al. Seed1.5-vl technical report.arXiv preprint arXiv:2505.07062, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[19]

Shifting AI efficiency from model-centric to data-centric compression.arXiv preprint arXiv:2505.19147, 2025

Xuyang Liu, Zichen Wen, Shaobo Wang, Junjie Chen, Zhishan Tao, Yubo Wang, Xiangqi Jin, Chang Zou, Yiyu Wang, Chenfei Liao, et al. Shifting AI efficiency from model-centric to data-centric compression.arXiv preprint arXiv:2505.19147, 2025

work page arXiv 2025
[20]

An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models

Liang Chen, Haozhe Zhao, Tianyu Liu, Shuai Bai, Junyang Lin, Chang Zhou, and Baobao Chang. An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models. InEuropean Conference on Computer Vision, pages 19–35. Springer, 2024

work page 2024
[21]

Hired: Attention-guided token dropping for efficient inference of high-resolution vision-language models

Kazi Hasan Ibn Arif, JinYi Yoon, Dimitrios S Nikolopoulos, Hans Vandierendonck, Deepu John, and Bo Ji. Hired: Attention-guided token dropping for efficient inference of high-resolution vision-language models. InProceedings of the AAAI Conference on Artificial Intelligence, pages 1773–1781, 2025

work page 2025
[22]

[cls] attention is all you need for training- free visual token pruning: Make vlm inference faster

Qizhe Zhang, Aosong Cheng, Ming Lu, Zhiyong Zhuo, Minqi Wang, Jiajun Cao, Shaobo Guo, Qi She, and Shanghang Zhang. [cls] attention is all you need for training-free visual token pruning: Make vlm inference faster.arXiv preprint arXiv:2412.01818, 2024

work page arXiv 2024
[23]

HiPrune: Hierarchical Attention for Efficient Token Pruning in Vision-Language Models

Jizhihui Liu, Feiyi Du, Guangdao Zhu, Niu Lian, Jun Li, and Bin Chen. Hiprune: Training- free visual token pruning via hierarchical attention in vision-language models.arXiv preprint arXiv:2508.00553, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[24]

Visionzip: Longer is better but not necessary in vision language models

Senqiao Yang, Yukang Chen, Zhuotao Tian, Chengyao Wang, Jingyao Li, Bei Yu, and Jiaya Jia. Visionzip: Longer is better but not necessary in vision language models.arXiv preprint arXiv:2412.04467, 2024

work page arXiv 2024
[25]

SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference

Yuan Zhang, Chun-Kai Fan, Junpeng Ma, Wenzhao Zheng, Tao Huang, Kuan Cheng, Denis Gudovskiy, Tomoyuki Okuno, Yohei Nakata, Kurt Keutzer, and Shanghang Zhang. Sparse- vlm: Visual token sparsification for efficient vision-language model inference.arXiv preprint arXiv:2410.04417, 2024. 11

work page internal anchor Pith review Pith/arXiv arXiv 2024
[26]

PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid Visual Redundancy Reduction

Long Xing, Qidong Huang, Xiaoyi Dong, Jiajie Lu, Pan Zhang, Yuhang Zang, Yuhang Cao, Con- ghui He, Jiaqi Wang, Feng Wu, and Dahua Lin. Pyramiddrop: Accelerating your large vision- language models via pyramid visual redundancy reduction.arXiv preprint arXiv:2410.17247, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[27]

arXiv preprint arXiv:2505.22654 , year=

Ce Zhang, Kaixin Ma, Tianqing Fang, Wenhao Yu, Hongming Zhang, Zhisong Zhang, Yaqi Xie, Katia Sycara, Haitao Mi, and Dong Yu. Vscan: Rethinking visual token reduction for efficient large vision-language models.arXiv preprint arXiv:2505.22654, 2025

work page arXiv 2025
[28]

Apet: Approximation-error guided token compression for efficient vlms.arXiv preprint arXiv:2602.19870, 2026

Qiankun Ma, Ziyao Zhang, Haofei Wang, Jie Chen, Zhen Song, and Hairong Zheng. Apet: Approximation-error guided token compression for efficient vlms.arXiv preprint arXiv:2602.19870, 2026

work page arXiv 2026
[29]

Llava-prumerge: Adaptive token reduction for efficient large multimodal models

Yuzhang Shang, Mu Cai, Bingxin Xu, Yong Jae Lee, and Yan Yan. Llava-prumerge: Adaptive token reduction for efficient large multimodal models.arXiv preprint arXiv:2403.15388, 2024

work page arXiv 2024
[30]

Atp-llava: Adaptive token pruning for large vision language models

Xubing Ye, Yukang Gan, Yixiao Ge, Xiao-Ping Zhang, and Yansong Tang. Atp-llava: Adaptive token pruning for large vision language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24972–24982, 2025

work page 2025
[31]

Folder: Accelerating multi-modal large language models with enhanced performance.arXiv preprint arXiv:2501.02430, 2025

Haicheng Wang, Zhemeng Yu, Gabriele Spadaro, Chen Ju, Victor Quétu, Shuai Xiao, and Enzo Tartaglione. Folder: Accelerating multi-modal large language models with enhanced performance.arXiv preprint arXiv:2501.02430, 2025

work page arXiv 2025
[32]

Conical visual concentration for efficient large vision-language models

Long Xing, Qidong Huang, Xiaoyi Dong, Jiajie Lu, Pan Zhang, Yuhang Zang, Yuhang Cao, Conghui He, Jiaqi Wang, Feng Wu, et al. Conical visual concentration for efficient large vision-language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14593–14603, 2025

work page 2025
[33]

Flashattention: Fast and memory-efficient exact attention with io-awareness

Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. Flashattention: Fast and memory-efficient exact attention with io-awareness. InAdvances in Neural Information Processing Systems, volume 35, pages 16344–16359, 2022

work page 2022
[34]

FlashAttention-2: Faster attention with better parallelism and work partitioning

Tri Dao. FlashAttention-2: Faster attention with better parallelism and work partitioning. In International Conference on Learning Representations, 2024

work page 2024
[35]

Efficient memory management for large language model serving with PagedAttention

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with PagedAttention. InProceedings of the 29th Symposium on Operating Systems Principles, pages 611–626, 2023

work page 2023
[36]

Improved baselines with visual instruction tuning

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26296–26306, 2024

work page 2024
[37]

Llava-next: Improved reasoning, ocr, and world knowledge, jan 2024

Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llava-next: Improved reasoning, ocr, and world knowledge, jan 2024

work page 2024
[38]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, and Junyang Lin. Qwen2-vl: Enhancing vision- language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[39]

Qwen2.5-VL Technical Report

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, et al. Qwen2.5-vl technical report.arXiv preprint arXiv:2502.13923, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[40]

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Yuchen Duan, Hao Tian, Weijie Su, Jie Shao, Zhangwei Gao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479, 2025. 12

work page internal anchor Pith review Pith/arXiv arXiv 2025
[41]

Fast Transformer Decoding: One Write-Head is All You Need

Noam Shazeer. Fast transformer decoding: One write-head is all you need.arXiv preprint arXiv:1911.02150, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1911
[42]

An image is worth 16x16 words: Transformers for image recognition at scale

Alexei Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. InInternational Conference on Learning Representations, 2021

work page 2021
[43]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agar- wal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. InInterna- tional Conference on Machine Learning, pages 8748–8763, 2021

work page 2021
[44]

Gomez, Łukasz Kaiser, and Illia Polosukhin

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. InAdvances in Neural Informa- tion Processing Systems, volume 30, 2017

work page 2017
[45]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timo- thée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. LLaMA: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[46]

Jolliffe.Principal Component Analysis

Ian T. Jolliffe.Principal Component Analysis. Springer, 2nd edition, 2002

work page 2002
[47]

Åke Björck and Gene H. Golub. Numerical methods for computing angles between linear subspaces.Mathematics of Computation, 27(123):579–594, 1973

work page 1973
[48]

Nathan Halko, Per-Gunnar Martinsson, and Joel A. Tropp. Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions.SIAM Review, 53 (2):217–288, 2011

work page 2011
[49]

Token merging: Your ViT but faster

Daniel Bolya, Cheng-Yang Fu, Xiaoliang Dai, Peizhao Zhang, Christoph Feichtenhofer, and Judy Hoffman. Token merging: Your ViT but faster. InInternational Conference on Learning Representations, 2023

work page 2023
[50]

Variation-aware vision token dropping for faster large vision-language models.arXiv preprint arXiv:2509.01552, 2025

Junjie Chen, Xuyang Liu, Zichen Wen, Yiyu Wang, Siteng Huang, and Honggang Chen. Variation-aware vision token dropping for faster large vision-language models.arXiv preprint arXiv:2509.01552, 2025

work page arXiv 2025
[51]

Gqa: A new dataset for real-world visual reasoning and compositional question answering

Drew A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019

work page 2019
[52]

Mmbench: Is your multi-modal model an all-around player? InEuropean Conference on Computer Vision, pages 216–233

Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? InEuropean Conference on Computer Vision, pages 216–233. Springer, 2024

work page 2024
[53]

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, et al. Mme: A comprehensive evaluation benchmark for multimodal large language models.arXiv preprint arXiv:2306.13394, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[54]

Evaluating Object Hallucination in Large Vision-Language Models

Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision-language models.arXiv preprint arXiv:2305.10355, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[55]

Learn to explain: Multimodal reasoning via thought chains for science question answering

Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering. InAdvances in Neural Information Processing Systems, volume 35, pages 2507–2521, 2022

work page 2022
[56]

Seed-bench: Benchmarking multimodal large language models

Bohao Li, Yuying Ge, Yixiao Ge, Guangzhi Wang, Rui Wang, Ruimao Zhang, and Ying Shan. Seed-bench: Benchmarking multimodal large language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13299–13308, 2024

work page 2024
[57]

MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities

Weihao Yu, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Zicheng Liu, Xinchao Wang, and Lijuan Wang. Mm-vet: Evaluating large multimodal models for integrated capabilities. arXiv preprint arXiv:2308.02490, 2023. 13

work page internal anchor Pith review Pith/arXiv arXiv 2023
[58]

Tgif-qa: Toward spatio-temporal reasoning in visual question answering

Yunseok Jang, Yale Song, Youngjae Yu, Youngjin Kim, and Gunhee Kim. Tgif-qa: Toward spatio-temporal reasoning in visual question answering. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2758–2766, 2017

work page 2017
[59]

David Chen and William B. Dolan. Collecting highly parallel data for paraphrase evaluation. InProceedings of the 49th Annual Meeting of the Association for Computational Linguistics, pages 190–200, 2011

work page 2011
[60]

Final Retain

Jun Xu, Tao Mei, Ting Yao, and Yong Rui. MSR-VTT: A large video description dataset for bridging video and language. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5288–5296, 2016. 14 A Additional Experiments and Analysis A.1 Additional Low-Rank Analysis on Other Datasets We extend the effective-dimensionality analy...

work page 2016

[1] [1]

Flamingo: a visual language model for few-shot learning

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. InAdvances in Neural Information Processing Systems, volume 35, pages 23716–23736, 2022

work page 2022

[2] [2]

BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InInternational Conference on Machine Learning, pages 19730–19742, 2023

work page 2023

[3] [3]

Visual instruction tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. In Advances in Neural Information Processing Systems, volume 36, 2023

work page 2023

[4] [4]

Fung, and Steven Hoi

Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale N. Fung, and Steven Hoi. InstructBLIP: Towards general-purpose vision-language models with instruction tuning. InAdvances in Neural Information Processing Systems, volume 36, 2023

work page 2023

[5] [5]

MiniGPT-4: En- hancing vision-language understanding with advanced large language models

Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. MiniGPT-4: En- hancing vision-language understanding with advanced large language models. InInternational Conference on Learning Representations, 2024

work page 2024

[6] [6]

GPT-4V(ision) system card.OpenAI Technical Report, 2023

OpenAI. GPT-4V(ision) system card.OpenAI Technical Report, 2023

work page 2023

[7] [7]

Lawrence Zitnick, and Devi Parikh

Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. VQA: Visual question answering. InProceedings of the IEEE International Conference on Computer Vision, pages 2425–2433, 2015

work page 2015

[8] [8]

Making the v in vqa matter: Elevating the role of image understanding in visual question answering

Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6904–6913, 2017

work page 2017

[9] [9]

Towards vqa models that can read

Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. Towards vqa models that can read. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8317–8326, 2019

work page 2019

[10] [10]

Minesh Mathew, Dimosthenis Karatzas, and C. V . Jawahar. DocVQA: A dataset for VQA on document images. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 2200–2209, 2021

work page 2021

[11] [11]

Lawrence Zitnick

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. Microsoft COCO: Common objects in context. In European Conference on Computer Vision, pages 740–755. Springer, 2014

work page 2014

[12] [12]

From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions

Peter Young, Alice Lai, Micah Hodosh, and Julia Hockenmaier. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics, 2:67–78, 2014. 10

work page 2014

[13] [13]

nocaps: novel object captioning at scale

Harsh Agrawal, Karan Desai, Yufei Wang, Xinlei Chen, Rishabh Jain, Mark Johnson, Dhruv Batra, Devi Parikh, Stefan Lee, and Peter Anderson. nocaps: novel object captioning at scale. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 8948–8957, 2019

work page 2019

[14] [14]

LLaVA-OneVision: Easy Visual Task Transfer

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Yanwei Li, Ziwei Liu, and Chunyuan Li. Llava-onevision: Easy visual task transfer.arXiv preprint arXiv:2408.03326, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[15] [15]

Video-llava: Learning united visual representation by alignment before projection

Bin Lin, Yang Ye, Bin Zhu, Jiaxi Cui, Munan Ning, Peng Jin, and Li Yuan. Video-llava: Learning united visual representation by alignment before projection. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 5971–5984, 2024

work page 2024

[16] [16]

VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding

Boqiang Zhang, Kehan Li, Zesen Cheng, Zhiqiang Hu, Yuqian Yuan, Guanzheng Chen, Sicong Leng, Yuming Jiang, Hang Zhang, Xin Li, Peng Jin, Wenqi Zhang, Fan Wang, Lidong Bing, and Deli Zhao. Videollama 3: Frontier multimodal foundation models for image and video understanding.arXiv preprint arXiv:2501.13106, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[17] [17]

How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites

Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhangwei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, Ji Ma, Jiaqi Wang, Xiaoyi Dong, Hang Yan, Hewei Guo, Conghui He, Botian Shi, Zhenjiang Jin, Chao Xu, Bin Wang, Xingjian Wei, Wei Li, Wenjian Zhang, Bo Zhang, Pinlong Cai, Licheng Wen, Xiangchao Yan, Min Dou, Lewei Lu, Xizhou Zhu, Tong Lu...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[18] [18]

Seed1.5-VL Technical Report

Dong Guo, Faming Wu, Feida Zhu, Fuxing Leng, Guang Shi, Haobin Chen, Haoqi Fan, Jian Wang, Jianyu Jiang, Jiawei Wang, et al. Seed1.5-vl technical report.arXiv preprint arXiv:2505.07062, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[19] [19]

Shifting AI efficiency from model-centric to data-centric compression.arXiv preprint arXiv:2505.19147, 2025

Xuyang Liu, Zichen Wen, Shaobo Wang, Junjie Chen, Zhishan Tao, Yubo Wang, Xiangqi Jin, Chang Zou, Yiyu Wang, Chenfei Liao, et al. Shifting AI efficiency from model-centric to data-centric compression.arXiv preprint arXiv:2505.19147, 2025

work page arXiv 2025

[20] [20]

An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models

Liang Chen, Haozhe Zhao, Tianyu Liu, Shuai Bai, Junyang Lin, Chang Zhou, and Baobao Chang. An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models. InEuropean Conference on Computer Vision, pages 19–35. Springer, 2024

work page 2024

[21] [21]

Hired: Attention-guided token dropping for efficient inference of high-resolution vision-language models

Kazi Hasan Ibn Arif, JinYi Yoon, Dimitrios S Nikolopoulos, Hans Vandierendonck, Deepu John, and Bo Ji. Hired: Attention-guided token dropping for efficient inference of high-resolution vision-language models. InProceedings of the AAAI Conference on Artificial Intelligence, pages 1773–1781, 2025

work page 2025

[22] [22]

[cls] attention is all you need for training- free visual token pruning: Make vlm inference faster

Qizhe Zhang, Aosong Cheng, Ming Lu, Zhiyong Zhuo, Minqi Wang, Jiajun Cao, Shaobo Guo, Qi She, and Shanghang Zhang. [cls] attention is all you need for training-free visual token pruning: Make vlm inference faster.arXiv preprint arXiv:2412.01818, 2024

work page arXiv 2024

[23] [23]

HiPrune: Hierarchical Attention for Efficient Token Pruning in Vision-Language Models

Jizhihui Liu, Feiyi Du, Guangdao Zhu, Niu Lian, Jun Li, and Bin Chen. Hiprune: Training- free visual token pruning via hierarchical attention in vision-language models.arXiv preprint arXiv:2508.00553, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[24] [24]

Visionzip: Longer is better but not necessary in vision language models

Senqiao Yang, Yukang Chen, Zhuotao Tian, Chengyao Wang, Jingyao Li, Bei Yu, and Jiaya Jia. Visionzip: Longer is better but not necessary in vision language models.arXiv preprint arXiv:2412.04467, 2024

work page arXiv 2024

[25] [25]

SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference

Yuan Zhang, Chun-Kai Fan, Junpeng Ma, Wenzhao Zheng, Tao Huang, Kuan Cheng, Denis Gudovskiy, Tomoyuki Okuno, Yohei Nakata, Kurt Keutzer, and Shanghang Zhang. Sparse- vlm: Visual token sparsification for efficient vision-language model inference.arXiv preprint arXiv:2410.04417, 2024. 11

work page internal anchor Pith review Pith/arXiv arXiv 2024

[26] [26]

PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid Visual Redundancy Reduction

Long Xing, Qidong Huang, Xiaoyi Dong, Jiajie Lu, Pan Zhang, Yuhang Zang, Yuhang Cao, Con- ghui He, Jiaqi Wang, Feng Wu, and Dahua Lin. Pyramiddrop: Accelerating your large vision- language models via pyramid visual redundancy reduction.arXiv preprint arXiv:2410.17247, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[27] [27]

arXiv preprint arXiv:2505.22654 , year=

Ce Zhang, Kaixin Ma, Tianqing Fang, Wenhao Yu, Hongming Zhang, Zhisong Zhang, Yaqi Xie, Katia Sycara, Haitao Mi, and Dong Yu. Vscan: Rethinking visual token reduction for efficient large vision-language models.arXiv preprint arXiv:2505.22654, 2025

work page arXiv 2025

[28] [28]

Apet: Approximation-error guided token compression for efficient vlms.arXiv preprint arXiv:2602.19870, 2026

Qiankun Ma, Ziyao Zhang, Haofei Wang, Jie Chen, Zhen Song, and Hairong Zheng. Apet: Approximation-error guided token compression for efficient vlms.arXiv preprint arXiv:2602.19870, 2026

work page arXiv 2026

[29] [29]

Llava-prumerge: Adaptive token reduction for efficient large multimodal models

Yuzhang Shang, Mu Cai, Bingxin Xu, Yong Jae Lee, and Yan Yan. Llava-prumerge: Adaptive token reduction for efficient large multimodal models.arXiv preprint arXiv:2403.15388, 2024

work page arXiv 2024

[30] [30]

Atp-llava: Adaptive token pruning for large vision language models

Xubing Ye, Yukang Gan, Yixiao Ge, Xiao-Ping Zhang, and Yansong Tang. Atp-llava: Adaptive token pruning for large vision language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24972–24982, 2025

work page 2025

[31] [31]

Folder: Accelerating multi-modal large language models with enhanced performance.arXiv preprint arXiv:2501.02430, 2025

Haicheng Wang, Zhemeng Yu, Gabriele Spadaro, Chen Ju, Victor Quétu, Shuai Xiao, and Enzo Tartaglione. Folder: Accelerating multi-modal large language models with enhanced performance.arXiv preprint arXiv:2501.02430, 2025

work page arXiv 2025

[32] [32]

Conical visual concentration for efficient large vision-language models

Long Xing, Qidong Huang, Xiaoyi Dong, Jiajie Lu, Pan Zhang, Yuhang Zang, Yuhang Cao, Conghui He, Jiaqi Wang, Feng Wu, et al. Conical visual concentration for efficient large vision-language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14593–14603, 2025

work page 2025

[33] [33]

Flashattention: Fast and memory-efficient exact attention with io-awareness

Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. Flashattention: Fast and memory-efficient exact attention with io-awareness. InAdvances in Neural Information Processing Systems, volume 35, pages 16344–16359, 2022

work page 2022

[34] [34]

FlashAttention-2: Faster attention with better parallelism and work partitioning

Tri Dao. FlashAttention-2: Faster attention with better parallelism and work partitioning. In International Conference on Learning Representations, 2024

work page 2024

[35] [35]

Efficient memory management for large language model serving with PagedAttention

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with PagedAttention. InProceedings of the 29th Symposium on Operating Systems Principles, pages 611–626, 2023

work page 2023

[36] [36]

Improved baselines with visual instruction tuning

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26296–26306, 2024

work page 2024

[37] [37]

Llava-next: Improved reasoning, ocr, and world knowledge, jan 2024

Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llava-next: Improved reasoning, ocr, and world knowledge, jan 2024

work page 2024

[38] [38]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, and Junyang Lin. Qwen2-vl: Enhancing vision- language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[39] [39]

Qwen2.5-VL Technical Report

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, et al. Qwen2.5-vl technical report.arXiv preprint arXiv:2502.13923, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[40] [40]

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Yuchen Duan, Hao Tian, Weijie Su, Jie Shao, Zhangwei Gao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479, 2025. 12

work page internal anchor Pith review Pith/arXiv arXiv 2025

[41] [41]

Fast Transformer Decoding: One Write-Head is All You Need

Noam Shazeer. Fast transformer decoding: One write-head is all you need.arXiv preprint arXiv:1911.02150, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1911

[42] [42]

An image is worth 16x16 words: Transformers for image recognition at scale

Alexei Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. InInternational Conference on Learning Representations, 2021

work page 2021

[43] [43]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agar- wal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. InInterna- tional Conference on Machine Learning, pages 8748–8763, 2021

work page 2021

[44] [44]

Gomez, Łukasz Kaiser, and Illia Polosukhin

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. InAdvances in Neural Informa- tion Processing Systems, volume 30, 2017

work page 2017

[45] [45]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timo- thée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. LLaMA: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[46] [46]

Jolliffe.Principal Component Analysis

Ian T. Jolliffe.Principal Component Analysis. Springer, 2nd edition, 2002

work page 2002

[47] [47]

Åke Björck and Gene H. Golub. Numerical methods for computing angles between linear subspaces.Mathematics of Computation, 27(123):579–594, 1973

work page 1973

[48] [48]

Nathan Halko, Per-Gunnar Martinsson, and Joel A. Tropp. Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions.SIAM Review, 53 (2):217–288, 2011

work page 2011

[49] [49]

Token merging: Your ViT but faster

Daniel Bolya, Cheng-Yang Fu, Xiaoliang Dai, Peizhao Zhang, Christoph Feichtenhofer, and Judy Hoffman. Token merging: Your ViT but faster. InInternational Conference on Learning Representations, 2023

work page 2023

[50] [50]

Variation-aware vision token dropping for faster large vision-language models.arXiv preprint arXiv:2509.01552, 2025

Junjie Chen, Xuyang Liu, Zichen Wen, Yiyu Wang, Siteng Huang, and Honggang Chen. Variation-aware vision token dropping for faster large vision-language models.arXiv preprint arXiv:2509.01552, 2025

work page arXiv 2025

[51] [51]

Gqa: A new dataset for real-world visual reasoning and compositional question answering

Drew A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019

work page 2019

[52] [52]

Mmbench: Is your multi-modal model an all-around player? InEuropean Conference on Computer Vision, pages 216–233

Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? InEuropean Conference on Computer Vision, pages 216–233. Springer, 2024

work page 2024

[53] [53]

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, et al. Mme: A comprehensive evaluation benchmark for multimodal large language models.arXiv preprint arXiv:2306.13394, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[54] [54]

Evaluating Object Hallucination in Large Vision-Language Models

Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision-language models.arXiv preprint arXiv:2305.10355, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[55] [55]

Learn to explain: Multimodal reasoning via thought chains for science question answering

Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering. InAdvances in Neural Information Processing Systems, volume 35, pages 2507–2521, 2022

work page 2022

[56] [56]

Seed-bench: Benchmarking multimodal large language models

Bohao Li, Yuying Ge, Yixiao Ge, Guangzhi Wang, Rui Wang, Ruimao Zhang, and Ying Shan. Seed-bench: Benchmarking multimodal large language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13299–13308, 2024

work page 2024

[57] [57]

MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities

Weihao Yu, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Zicheng Liu, Xinchao Wang, and Lijuan Wang. Mm-vet: Evaluating large multimodal models for integrated capabilities. arXiv preprint arXiv:2308.02490, 2023. 13

work page internal anchor Pith review Pith/arXiv arXiv 2023

[58] [58]

Tgif-qa: Toward spatio-temporal reasoning in visual question answering

Yunseok Jang, Yale Song, Youngjae Yu, Youngjin Kim, and Gunhee Kim. Tgif-qa: Toward spatio-temporal reasoning in visual question answering. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2758–2766, 2017

work page 2017

[59] [59]

David Chen and William B. Dolan. Collecting highly parallel data for paraphrase evaluation. InProceedings of the 49th Annual Meeting of the Association for Computational Linguistics, pages 190–200, 2011

work page 2011

[60] [60]

Final Retain

Jun Xu, Tao Mei, Ting Yao, and Yong Rui. MSR-VTT: A large video description dataset for bridging video and language. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5288–5296, 2016. 14 A Additional Experiments and Analysis A.1 Additional Low-Rank Analysis on Other Datasets We extend the effective-dimensionality analy...

work page 2016