pith. sign in

arxiv: 2503.14075 · v3 · submitted 2025-03-18 · 💻 cs.CV · cs.CL

Growing a Multi-head Twig via Distillation and Reinforcement Learning to Accelerate Large Vision-Language Models

Pith reviewed 2026-05-22 23:55 UTC · model grok-4.3

classification 💻 cs.CV cs.CL
keywords vision-language modelstoken pruningmodel accelerationspeculative decodingdistillationreinforcement learningvisual token reduction
0
0 comments X

The pith

A twig module added to an early VLM layer supplies pruning signals that outperform early attention maps, enabling 88.9 percent token reduction with 96 percent performance retained plus 154 percent speedup on long responses via self-specul

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents TwigVLM as a way to accelerate large vision-language models by growing a small additional module called a twig on top of an early transformer layer. This twig is trained to produce more reliable signals for deciding which visual tokens can be dropped without harming the final answer. The same module also supports self-speculative decoding that re-uses early computations to shorten generation time even when the output contains dozens of tokens. TwigVLM++ extends the idea to multiple pruning heads and trains them first by distillation then by reinforcement learning that directly rewards higher accuracy after aggressive pruning. If the approach holds, VLMs could run substantially faster on the same hardware while keeping most of their multimodal capability.

Core claim

Growing a lightweight twig module on an early layer of a base VLM allows twig-guided token pruning that keeps 96 percent of original accuracy after removing 88.9 percent of visual tokens and simultaneously enables self-speculative decoding that yields 154 percent speedup on long responses; the multi-head TwigVLM++ variant further improves pruning quality through a two-stage process of distillation followed by pruning-oriented reinforcement learning and tree-based speculative decoding.

What carries the argument

The twig module, a lightweight addition placed on an early layer of the base VLM that learns to output pruning decisions and speculative tokens.

If this is right

  • Pruning 88.9 percent of visual tokens preserves 96 percent of the base model's performance on standard VLM benchmarks.
  • Self-speculative decoding delivers 154 percent higher throughput when generating long answers compared with standard decoding.
  • The multi-head twig trained by distillation then reinforcement learning produces higher-quality pruning decisions than single-head or attention-only baselines.
  • Tree-based speculative decoding in TwigVLM++ further increases generation speed beyond the single-path SSD strategy.
  • The method outperforms prior token-pruning approaches on both accuracy and speed when applied to LLaVA-1.5-7B.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same twig architecture could be attached to other VLM families beyond LLaVA without retraining the entire model.
  • Because the twig is small and trained once, the acceleration benefit scales to many downstream tasks that reuse the same base VLM.
  • Combining the pruning head with other compression techniques such as quantization might compound the speed gains.
  • The reinforcement-learning stage that directly optimizes post-pruning accuracy could be adapted to reward other metrics such as latency or energy use.

Load-bearing premise

The signals produced by the trained twig remain more accurate than the base model's early-layer attention maps across different VLMs and response lengths.

What would settle it

On a new VLM or on responses longer than those tested, measuring whether twig-guided pruning accuracy drops below the accuracy obtained by simply using early-layer attention maps for the same pruning ratio.

Figures

Figures reproduced from arXiv: 2503.14075 by Hongyuan Zhang, Jun Yu, Mingyang Wang, Tao Wei, Weijun Zhang, Wenwen Pan, Yan Yang, Zhenwei Shao, Zhou Yu.

Figure 1
Figure 1. Figure 1: Our TwigVLM overview. (a) Given a base deep VLM, the TwigVLM is obtained by freezing the base VLM while train￾ing a shallow twig block upon its early layer. (b) Compared to the VLM acceleration methods based on visual token pruning (e.g., FastV [9]), TwigVLM not only achieves better accuracy reten￾tion but also yields higher generation speed. Both FastV and TwigVLM take LLaVA-1.5-7B [32] as the base VLM (r… view at source ↗
Figure 3
Figure 3. Figure 3: Prefilling and decoding time costs. (a) Prefilling time (gray dotted line) and decoding time for LLaVA-1.5-7B [32] with different response lengths. (b) Prefilling (P) and decoding (D) time comparisons of LLaVA-1.5-7B and its FastV-based variant. son of different token pruning methods, the average number of retained tokens R¯ is used and defined as follows1 : R¯ = [M × K + R × (L − K)]/L (4) 2.2. Study 1: A… view at source ↗
Figure 4
Figure 4. Figure 4: Training and two-stage inference of TwigVLM. (a) The twig block is initialized from the base VLM and is coupled with the first K layers of base VLM to form a shallow model, which can be trained efficiently. (b) In the prefilling stage, different from previous approaches that perform token selection based on the attention maps from the base VLM, TwigVLM introduces a twig-guided token pruning (TTP) strategy … view at source ↗
Figure 5
Figure 5. Figure 5: Visualized attention map comparisons of TwigVLM and two typical token pruning methods. The visualized attention maps show that TwigVLM identifies accurate visual tokens to the prompt and predicts the right answer, while both counterparts fail to do that. More examples are provided in the supplementary. consistently surpasses all the state-of-the-art counterparts on most VideoQA benchmarks and achieves 3.1%… view at source ↗
Figure 6
Figure 6. Figure 6: Generation speed comparisons on two benchmarks: (a) TextVQA with short responses and (b) MM-Vet with long re￾sponses. S¯ denotes the average number of generated tokens on the whole benchmark. The RelSpd of each bar is highlighted in red. Generation speed comparisons. As mentioned above, TwigVLM can effectively accelerate the generation. To val￾idate this, we conduct intensive experiments on two typi￾cal be… view at source ↗
Figure 7
Figure 7. Figure 7: Performance comparisons of TwigVLM models trained [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Visualization of attention maps and predictions for FastV [ [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Examples of the generated responses using the self-speculative decoding (SSD) on MM-Vet [ [PITH_FULL_IMAGE:figures/full_fig_p014_9.png] view at source ↗
read the original abstract

Large vision-language models (VLMs) have demonstrated remarkable capabilities in open-world multimodal understanding, yet their high computational overheads pose great challenges for practical deployment. Some recent works have proposed methods to accelerate VLMs by pruning redundant visual tokens guided by the attention maps of VLM's early layers. Despite the success of these token pruning methods, they still suffer from two major shortcomings: (i) considerable accuracy drop due to insensitive attention signals in early layers, and (ii) limited speedup when generating long responses (e.g., 30 tokens). To address the limitations above, we present TwigVLM -- a simple and general architecture by growing a lightweight module, named twig, upon an early layer of the base VLM. Compared with most existing VLM acceleration methods purely based on visual token pruning, our TwigVLM not only achieves better accuracy retention by employing a twig-guided token pruning (TTP) strategy, but also yields higher generation speed by utilizing a self-speculative decoding (SSD) strategy. Taking LLaVA-1.5-7B as the base VLM, experimental results show that TwigVLM preserves 96% of the original performance after pruning 88.9% of the visual tokens and achieves 154% speedup in generating long responses, delivering significantly better performance in terms of both accuracy and speed over the state-of-the-art VLM acceleration methods. Moreover, we extend TwigVLM to an improved TwigVLM++ variant by introducing a novel multi-head twig architecture with a specialized pruning head. TwigVLM++ improves pruning quality via a two-stage training paradigm combining a distillation learning stage and a pruning-oriented reinforcement learning stage, and further accelerates inference via a tree-based SSD strategy.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces TwigVLM, a lightweight 'twig' module attached to an early layer of a base VLM (e.g., LLaVA-1.5-7B) that enables twig-guided token pruning (TTP) and self-speculative decoding (SSD) to accelerate inference. It claims to retain 96% of original performance after pruning 88.9% of visual tokens while delivering 154% speedup on long responses, outperforming prior attention-map-based pruning methods. TwigVLM++ extends this with a multi-head twig trained via a two-stage process (distillation followed by pruning-oriented reinforcement learning) and a tree-based SSD strategy.

Significance. If the empirical results hold, the work provides a practical architecture for improving the accuracy-speed trade-off in VLM acceleration by learning pruning signals rather than relying solely on early-layer attention. The two-stage training paradigm combining distillation and RL, along with the SSD extension, represents a concrete advance over existing token-pruning baselines if the twig's superiority is shown to generalize.

major comments (2)
  1. [Abstract] Abstract: the central empirical claims (96% performance retention at 88.9% pruning; 154% speedup on long responses) are presented without error bars, dataset specifications, or ablation controls, which are load-bearing for verifying that twig-derived signals are reliably superior to early-layer attention maps.
  2. [Abstract] Abstract: the claim that the multi-head twig with two-stage (distillation + RL) training improves pruning quality over the single-head version rests on the unverified assumption that the RL stage avoids new failure modes across response lengths and base VLMs; no direct comparison or robustness test is referenced.
minor comments (1)
  1. [Abstract] Abstract: the phrasing '154% speedup' is ambiguous (could mean 1.54x or 2.54x); standard multiplicative notation should be used for clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the abstract. We address the two major comments below and will incorporate clarifications where appropriate in a revised manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central empirical claims (96% performance retention at 88.9% pruning; 154% speedup on long responses) are presented without error bars, dataset specifications, or ablation controls, which are load-bearing for verifying that twig-derived signals are reliably superior to early-layer attention maps.

    Authors: We agree the abstract is concise and omits these details. The full manuscript reports error bars across multiple runs in the main results tables and figures, evaluates on standard benchmarks (VQA-v2, GQA, POPE, MME, TextVQA), and provides ablations in Section 4.3 directly comparing twig-guided signals to early-layer attention maps. We will revise the abstract to name the primary datasets and note that supporting error bars and ablations appear in the experiments section. revision: yes

  2. Referee: [Abstract] Abstract: the claim that the multi-head twig with two-stage (distillation + RL) training improves pruning quality over the single-head version rests on the unverified assumption that the RL stage avoids new failure modes across response lengths and base VLMs; no direct comparison or robustness test is referenced.

    Authors: The manuscript includes direct head-to-head comparisons between single-head TwigVLM and multi-head TwigVLM++ (with the two-stage distillation+RL pipeline) in Tables 1–4 and Section 4.4, along with results across varying response lengths and multiple base VLMs. These experiments show the RL stage improves pruning quality without introducing the failure modes raised. We will update the abstract to explicitly reference these comparisons. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical claims rest on external benchmarks

full rationale

The paper introduces TwigVLM as an architectural addition (lightweight twig module grown on an early VLM layer) plus two-stage training (distillation then RL) for TTP and SSD strategies. All reported outcomes—96% performance retention after 88.9% token pruning and 154% speedup—are presented as results of direct experiments on LLaVA-1.5-7B and comparisons to prior acceleration methods. No equations, fitted parameters renamed as predictions, uniqueness theorems, or self-citation chains are used to derive the performance numbers; the central claims remain falsifiable against independent test sets and baselines.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 1 invented entities

Review performed on abstract only; the ledger therefore records only the high-level assumptions visible in the abstract. The central claim rests on the existence of a trainable lightweight module that can outperform early-layer attention for pruning and on standard supervised-plus-RL training dynamics.

free parameters (1)
  • twig architecture hyperparameters
    Number of heads, hidden size, and placement layer of the twig module are chosen to make the pruning and decoding strategies work; these are not derived from first principles.
axioms (2)
  • domain assumption Early-layer signals in the base VLM can be improved upon by a separately trained lightweight module for token importance
    Invoked to justify replacing attention-map pruning with twig-guided pruning.
  • domain assumption Standard distillation followed by reinforcement learning on a pruning reward produces a module that generalizes across prompts and response lengths
    Required for the two-stage training paradigm of TwigVLM++ to deliver the claimed accuracy retention.
invented entities (1)
  • twig module no independent evidence
    purpose: Lightweight add-on that supplies pruning decisions and enables self-speculative decoding
    New module introduced by the paper; no independent evidence outside the reported experiments is supplied.

pith-pipeline@v0.9.0 · 5871 in / 1632 out tokens · 29059 ms · 2026-05-22T23:55:56.563899+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

70 extracted references · 70 canonical work pages · 18 internal anchors

  1. [1]

    Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

    Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966, 2023. 1

  2. [2]

    Token merging: Your vit but faster

    Daniel Bolya, Cheng-Yang Fu, Xiaoliang Dai, Peizhao Zhang, Christoph Feichtenhofer, and Judy Hoffman. Token merging: Your vit but faster. In The Eleventh International Conference on Learning Representations, 2023. 8

  3. [3]

    RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

    Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choromanski, Tianli Ding, Danny Driess, Avinava Dubey, Chelsea Finn, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. arXiv preprint arXiv:2307.15818, 2023. 1

  4. [4]

    Language models are few-shot learners

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Sub- biah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakan- tan, Pranav Shyam, Girish Sastry, Amanda Askell, Sand- hini Agarwal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, S...

  5. [5]

    Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads

    Tianle Cai, Yuhong Li, Zhengyang Geng, Hongwu Peng, Ja- son D Lee, Deming Chen, and Tri Dao. Medusa: Simple llm inference acceleration framework with multiple decod- ing heads. arXiv preprint arXiv:2401.10774, 2024. 8

  6. [6]

    Auroracap: Efficient, performant video detailed captioning and a new benchmark

    Wenhao Chai, Enxin Song, Yilun Du, Chenlin Meng, Vashisht Madhavan, Omer Bar-Tal, Jenq-Neng Hwang, Sain- ing Xie, and Christopher D Manning. Auroracap: Efficient, performant video detailed captioning and a new benchmark. arXiv preprint arXiv:2410.03051, 2024. 8

  7. [7]

    Chen and William B

    David L. Chen and William B. Dolan. Collecting highly par- allel data for paraphrase evaluation. In ACL, Portland, OR,

  8. [8]

    Llavolta: Efficient multi-modal models via stage-wise visual context compression

    Jieneng Chen, Luoxin Ye, Ju He, Zhao-Yang Wang, Daniel Khashabi, and Alan Yuille. Llavolta: Efficient multi-modal models via stage-wise visual context compression. arXiv preprint arXiv:2406.20092, 2024. 8

  9. [9]

    An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models

    Liang Chen, Haozhe Zhao, Tianyu Liu, Shuai Bai, Junyang Lin, Chang Zhou, and Baobao Chang. An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models. In European Conference on Computer Vision, pages 19–35. Springer, 2024. 1, 2, 3, 5, 6, 7, 8, 10, 12, 13

  10. [10]

    How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites

    Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhang- wei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, et al. How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites. Science China Information Sciences, 67(12):220101,

  11. [11]

    Efficient transformer inference with stati- cally structured sparse attention

    Steve Dai, Hasan Genc, Rangharajan Venkatesan, and Brucek Khailany. Efficient transformer inference with stati- cally structured sparse attention. In 2023 60th ACM/IEEE Design Automation Conference (DAC) , pages 1–6. IEEE,

  12. [12]

    Fu, Stefano Ermon, Atri Rudra, and Christopher R´e

    Tri Dao, Daniel Y . Fu, Stefano Ermon, Atri Rudra, and Christopher R´e. FlashAttention: Fast and memory-efficient exact attention with IO-awareness. In Advances in Neural Information Processing Systems (NeurIPS), 2022. 8

  13. [13]

    Layer- skip: Enabling early exit inference and self-speculative de- coding

    Mostafa Elhoushi, Akshat Shrivastava, Diana Liskovich, Basil Hosmer, Bram Wasti, Liangzhen Lai, Anas Mahmoud, Bilge Acun, Saurabh Agarwal, Ahmed Roman, et al. Layer- skip: Enabling early exit inference and self-speculative de- coding. arXiv preprint arXiv:2404.16710, 2024. 2, 4, 5, 8, 11

  14. [14]

    MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

    Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Zhenyu Qiu, Wei Lin, Jinrui Yang, Xiawu Zheng, et al. Mme: A comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394, 2023. 6, 12

  15. [15]

    On speculative de- coding for multimodal large language models

    Mukul Gagrani, Raghavv Goel, Wonseok Jeon, Junyoung Park, Mingu Lee, and Christopher Lott. On speculative de- coding for multimodal large language models. In Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8285–8289, 2024. 8

  16. [16]

    Making the v in vqa matter: Ele- vating the role of image understanding in visual question answering

    Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Ba- tra, and Devi Parikh. Making the v in vqa matter: Ele- vating the role of image understanding in visual question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 6904– 6913, 2017. 6, 15

  17. [17]

    Mamba: Linear-time sequence mod- eling with selective state spaces

    Albert Gu and Tri Dao. Mamba: Linear-time sequence mod- eling with selective state spaces. InFirst Conference on Lan- guage Modeling, 2024. 8

  18. [18]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025. 1

  19. [19]

    Mammoth-vl: Eliciting multimodal reasoning with instruction tuning at scale, 2024

    Jarvis Guo, Tuney Zheng, Yuelin Bai, Bo Li, Yubo Wang, King Zhu, Yizhi Li, Graham Neubig, Wenhu Chen, and Xi- ang Yue. Mammoth-vl: Eliciting multimodal reasoning with instruction tuning at scale, 2024. 11

  20. [20]

    Re- thinking token reduction in mllms: Towards a unified paradigm for training-free acceleration

    Yuhang Han, Xuyang Liu, Pengxiang Ding, Donglin Wang, Honggang Chen, Qingsen Yan, and Siteng Huang. Re- thinking token reduction in mllms: Towards a unified paradigm for training-free acceleration. arXiv preprint arXiv:2411.17686, 2024. 8

  21. [21]

    Gqa: A new dataset for real-world visual reasoning and compositional question answering

    Drew A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, 2019. 3, 6, 12, 13

  22. [22]

    Tgif-qa: Toward spatio-temporal reasoning in visual question answering

    Yunseok Jang, Yale Song, Youngjae Yu, Youngjin Kim, and Gunhee Kim. Tgif-qa: Toward spatio-temporal reasoning in visual question answering. In Proceedings of the IEEE con- ference on computer vision and pattern recognition , pages 2758–2766, 2017. 6, 15

  23. [23]

    Llmlingua: Compressing prompts for acceler- ated inference of large language models

    Huiqiang Jiang, Qianhui Wu, Chin-Yew Lin, Yuqing Yang, and Lili Qiu. Llmlingua: Compressing prompts for acceler- ated inference of large language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Lan- guage Processing, pages 13358–13376, 2023. 8

  24. [24]

    OpenVLA: An Open-Source Vision-Language-Action Model

    Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language-action model. arXiv preprint arXiv:2406.09246, 2024. 1

  25. [25]

    Fast in- ference from transformers via speculative decoding

    Yaniv Leviathan, Matan Kalman, and Yossi Matias. Fast in- ference from transformers via speculative decoding. In In- ternational Conference on Machine Learning, pages 19274– 19286. PMLR, 2023. 8, 11

  26. [26]

    Evaluating Object Hallucination in Large Vision-Language Models

    Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision-language models. arXiv:2305.10355, 2023. 15

  27. [27]

    Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models

    Yanwei Li, Yuechen Zhang, Chengyao Wang, Zhisheng Zhong, Yixin Chen, Ruihang Chu, Shaoteng Liu, and Jiaya Jia. Mini-gemini: Mining the potential of multi-modality vision language models. arXiv preprint arXiv:2403.18814,

  28. [28]

    Video-LLaVA: Learning United Visual Representation by Alignment Before Projection

    Bin Lin, Yang Ye, Bin Zhu, Jiaxi Cui, Munan Ning, Peng Jin, and Li Yuan. Video-llava: Learning united visual rep- resentation by alignment before projection. arXiv preprint arXiv:2311.10122, 2023. 5, 9

  29. [29]

    DeepSeek-V3 Technical Report

    Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report. arXiv preprint arXiv:2412.19437, 2024. 8

  30. [30]

    Kangaroo: Loss- less self-speculative decoding for accelerating llms via dou- ble early exiting

    Fangcheng Liu, Yehui Tang, Zhenhua Liu, Yunsheng Ni, Duyu Tang, Kai Han, and Yunhe Wang. Kangaroo: Loss- less self-speculative decoding for accelerating llms via dou- ble early exiting. Advances in Neural Information Process- ing Systems, 37:11946–11965, 2025. 5, 8, 11

  31. [31]

    Visual instruction tuning

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. InAdvances in Neural Information Processing Systems, pages 34892–34916, 2023. 1

  32. [32]

    Improved baselines with visual instruction tuning

    Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. In Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26296–26306, 2024. 1, 3, 4, 5, 9

  33. [33]

    Llava-next: Im- proved reasoning, ocr, and world knowledge, 2024

    Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llava-next: Im- proved reasoning, ocr, and world knowledge, 2024. 5

  34. [34]

    Multi-stage vision token dropping: Towards efficient multimodal large language model

    Ting Liu, Liangtao Shi, Richang Hong, Yue Hu, Quanjun Yin, and Linfeng Zhang. Multi-stage vision token dropping: Towards efficient multimodal large language model. arXiv preprint arXiv:2411.10803, 2024. 2, 3, 6, 8, 10

  35. [35]

    Mmbench: Is your multi-modal model an all-around player? In European Conference on Computer Vision, pages 216–233

    Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? In European Conference on Computer Vision, pages 216–233. Springer, 2025. 6, 12

  36. [36]

    Learn to explain: Multimodal reasoning via thought chains for science question answering

    Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering. Advances in Neural Information Processing Systems , 35:2507–2521,

  37. [37]

    Layoutllm: Layout instruction tuning with large language models for document understanding

    Chuwei Luo, Yufan Shen, Zhaoqing Zhu, Qi Zheng, Zhi Yu, and Cong Yao. Layoutllm: Layout instruction tuning with large language models for document understanding. In Pro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 15630–15640, 2024. 1

  38. [38]

    Groma: Localized visual tokenization for grounding multimodal large language models

    Chuofan Ma, Yi Jiang, Jiannan Wu, Zehuan Yuan, and Xiao- juan Qi. Groma: Localized visual tokenization for grounding multimodal large language models. In European Conference on Computer Vision, pages 417–435. Springer, 2024. 1

  39. [39]

    Video-chatgpt: Towards detailed video understanding via large vision and language models

    Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fa- had Shahbaz Khan. Video-chatgpt: Towards detailed video understanding via large vision and language models. In Pro- ceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL 2024), 2024. 9

  40. [40]

    Learning to com- press prompts with gist tokens

    Jesse Mu, Xiang Li, and Noah Goodman. Learning to com- press prompts with gist tokens. Advances in Neural Informa- tion Processing Systems, 36:19327–19352, 2023. 8

  41. [41]

    Gpt-4v(ision) system card

    OpenAI. Gpt-4v(ision) system card. Technical report, Ope- nAI, 2023. 1

  42. [42]

    Ground- ing multimodal large language models to the world

    Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, Qixiang Ye, and Furu Wei. Ground- ing multimodal large language models to the world. In The Twelfth International Conference on Learning Representa- tions, 2024. 1

  43. [43]

    Llava-prumerge: Adaptive token reduction for efficient large multimodal models

    Yuzhang Shang, Mu Cai, Bingxin Xu, Yong Jae Lee, and Yan Yan. Llava-prumerge: Adaptive token reduction for efficient large multimodal models. arXiv preprint arXiv:2403.15388,

  44. [44]

    Imp: Highly capable large multimodal models for mobile devices

    Zhenwei Shao, Zhou Yu, Jun Yu, Xuecheng Ouyang, Lihao Zheng, Zhenbiao Gai, Mingyang Wang, Zhenzhong Kuang, and Jiajun Ding. Imp: Highly capable large multimodal models for mobile devices. IEEE Transactions on Multime- dia, 2025. 1

  45. [45]

    Towards vqa models that can read

    Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. Towards vqa models that can read. In Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8317–8326, 2019. 3, 6, 7, 13, 15

  46. [46]

    You only cache once: Decoder-decoder architectures for lan- guage models

    Yutao Sun, Li Dong, Yi Zhu, Shaohan Huang, Wenhui Wang, Shuming Ma, Quanlu Zhang, Jianyong Wang, and Furu Wei. You only cache once: Decoder-decoder architectures for lan- guage models. Advances in Neural Information Processing Systems, 37:7339–7361, 2025. 8

  47. [47]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288, 2023. 1

  48. [48]

    Fastvlm: Efficient vision encoding for vision language models

    Pavan Kumar Anasosalu Vasu, Fartash Faghri, Chun-Liang Li, Cem Koc, Nate True, Albert Antony, Gokul Santhanam, James Gabriel, Peter Grasch, Oncel Tuzel, et al. Fastvlm: Efficient vision encoding for vision language models. arXiv preprint arXiv:2412.13303, 2024. 8

  49. [49]

    Attention is all you need

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems, 2017. 1

  50. [50]

    Mobile-Agent: Autonomous Multi-Modal Mobile Device Agent with Visual Perception

    Junyang Wang, Haiyang Xu, Jiabo Ye, Ming Yan, Weizhou Shen, Ji Zhang, Fei Huang, and Jitao Sang. Mobile-agent: Autonomous multi-modal mobile device agent with visual perception. arXiv preprint arXiv:2401.16158, 2024. 1

  51. [51]

    Accelerating multimodal large lan- guage models via dynamic visual-token exit and the empiri- cal findings

    Qiong Wu, Wenhao Lin, Weihao Ye, Yiyi Zhou, Xiaoshuai Sun, and Rongrong Ji. Accelerating multimodal large lan- guage models via dynamic visual-token exit and the empiri- cal findings. arXiv preprint arXiv:2411.19628, 2024. 8

  52. [52]

    Efficient streaming language models with attention sinks

    Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks. In The Twelfth International Conference on Learning Representations, 2024. 8

  53. [53]

    PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid Visual Redundancy Reduction

    Long Xing, Qidong Huang, Xiaoyi Dong, Jiajie Lu, Pan Zhang, Yuhang Zang, Yuhang Cao, Conghui He, Jiaqi Wang, Feng Wu, et al. Pyramiddrop: Accelerating your large vision-language models via pyramid visual redundancy re- duction. arXiv preprint arXiv:2410.17247 , 2024. 5, 6, 8, 10

  54. [54]

    Video question answer- ing via gradually refined attention over appearance and mo- tion

    Dejing Xu, Zhou Zhao, Jun Xiao, Fei Wu, Hanwang Zhang, Xiangnan He, and Yueting Zhuang. Video question answer- ing via gradually refined attention over appearance and mo- tion. In Proceedings of the ACM international conference on Multimedia, pages 1645–1653, 2017. 6, 15

  55. [55]

    Visionzip: Longer is better but not necessary in vision language models

    Senqiao Yang, Yukang Chen, Zhuotao Tian, Chengyao Wang, Jingyao Li, Bei Yu, and Jiaya Jia. Visionzip: Longer is better but not necessary in vision language models. arXiv preprint arXiv:2412.04467, 2024. 2, 5, 6, 8, 10, 12, 13

  56. [56]

    MiniCPM-V: A GPT-4V Level MLLM on Your Phone

    Yuan Yao, Tianyu Yu, Ao Zhang, Chongyi Wang, Junbo Cui, Hongji Zhu, Tianchi Cai, Haoyu Li, Weilin Zhao, Zhihui He, et al. Minicpm-v: A gpt-4v level mllm on your phone. arXiv preprint arXiv:2408.01800, 2024. 1

  57. [57]

    mplug-docowl: Modularized multimodal large language model for document understanding

    Jiabo Ye, Anwen Hu, Haiyang Xu, Qinghao Ye, Ming Yan, Yuhao Dan, Chenlin Zhao, Guohai Xu, Chenliang Li, Jun- feng Tian, et al. mplug-docowl: Modularized multimodal large language model for document understanding. arXiv preprint arXiv:2307.02499, 2023. 1

  58. [58]

    Fit and prune: Fast and training-free visual token pruning for multi-modal large language models

    Weihao Ye, Qiong Wu, Wenhao Lin, and Yiyi Zhou. Fit and prune: Fast and training-free visual token pruning for multi-modal large language models. arXiv preprint arXiv:2409.10197, 2024. 8

  59. [59]

    Atp-llava: Adaptive token pruning for large vision language models

    Xubing Ye, Yukang Gan, Yixiao Ge, Xiao-Ping Zhang, and Yansong Tang. Atp-llava: Adaptive token pruning for large vision language models. arXiv preprint arXiv:2412.00447,

  60. [60]

    MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities

    Weihao Yu, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Zicheng Liu, Xinchao Wang, and Lijuan Wang. Mm-vet: Evaluating large multimodal models for integrated capabilities. arXiv preprint arXiv:2308.02490, 2023. 7, 14, 15

  61. [61]

    Activitynet-qa: A dataset for understanding complex web videos via question answering

    Zhou Yu, Dejing Xu, Jun Yu, Ting Yu, Zhou Zhao, Yueting Zhuang, and Dacheng Tao. Activitynet-qa: A dataset for understanding complex web videos via question answering. In AAAI, pages 9127–9134, 2019. 6, 15

  62. [62]

    Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for ex- pert agi

    Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for ex- pert agi. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 9556– 9567, 2024. 15

  63. [63]

    AppAgent: Multimodal Agents as Smartphone Users

    Chi Zhang, Zhao Yang, Jiaxuan Liu, Yucheng Han, Xin Chen, Zebiao Huang, Bin Fu, and Gang Yu. Appagent: Multimodal agents as smartphone users. arXiv preprint arXiv:2312.13771, 2023. 1

  64. [64]

    [cls] attention is all you need for training- free visual token pruning: Make vlm inference faster

    Qizhe Zhang, Aosong Cheng, Ming Lu, Zhiyong Zhuo, Minqi Wang, Jiajun Cao, Shaobo Guo, Qi She, and Shang- hang Zhang. [cls] attention is all you need for training- free visual token pruning: Make vlm inference faster. arXiv preprint arXiv:2412.01818, 2024. 2, 6, 8, 10

  65. [65]

    SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference

    Yuan Zhang, Chun-Kai Fan, Junpeng Ma, Wenzhao Zheng, Tao Huang, Kuan Cheng, Denis Gudovskiy, Tomoyuki Okuno, Yohei Nakata, Kurt Keutzer, et al. Sparsevlm: Vi- sual token sparsification for efficient vision-language model inference. arXiv preprint arXiv:2410.04417, 2024. 2, 3, 5, 6, 8, 10

  66. [66]

    H2o: Heavy-hitter ora- cle for efficient generative inference of large language mod- els

    Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher R´e, Clark Barrett, et al. H2o: Heavy-hitter ora- cle for efficient generative inference of large language mod- els. Advances in Neural Information Processing Systems, 36: 34661–34710, 2023. 8

  67. [67]

    Treat visual tokens as text? but your mllm only needs fewer efforts to see

    Zeliang Zhang, Phu Pham, Wentian Zhao, Kun Wan, Yu-Jhe Li, Jianing Zhou, Daniel Miranda, Ajinkya Kale, and Chen- liang Xu. Treat visual tokens as text? but your mllm only needs fewer efforts to see. arXiv preprint arXiv:2410.06169,

  68. [68]

    Cross-modal information flow in multimodal large language models

    Zhi Zhang, Srishti Yadav, Fengze Han, and Ekaterina Shutova. Cross-modal information flow in multimodal large language models. arXiv preprint arXiv:2411.18620 , 2024. 5, 8

  69. [69]

    A stitch in time saves nine: Small vlm is a precise guidance for accelerating large vlms

    Wangbo Zhao, Yizeng Han, Jiasheng Tang, Zhikai Li, Yib- ing Song, Kai Wang, Zhangyang Wang, and Yang You. A stitch in time saves nine: Small vlm is a precise guidance for accelerating large vlms. arXiv preprint arXiv:2412.03324 ,

  70. [70]

    A Survey on Efficient Inference for Large Language Models

    Zixuan Zhou, Xuefei Ning, Ke Hong, Tianyu Fu, Jiaming Xu, Shiyao Li, Yuming Lou, Luning Wang, Zhihang Yuan, Xiuhong Li, et al. A survey on efficient inference for large language models. arXiv preprint arXiv:2404.14294, 2024. 2