Growing a Multi-head Twig via Distillation and Reinforcement Learning to Accelerate Large Vision-Language Models
Pith reviewed 2026-05-22 23:55 UTC · model grok-4.3
The pith
A twig module added to an early VLM layer supplies pruning signals that outperform early attention maps, enabling 88.9 percent token reduction with 96 percent performance retained plus 154 percent speedup on long responses via self-specul
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Growing a lightweight twig module on an early layer of a base VLM allows twig-guided token pruning that keeps 96 percent of original accuracy after removing 88.9 percent of visual tokens and simultaneously enables self-speculative decoding that yields 154 percent speedup on long responses; the multi-head TwigVLM++ variant further improves pruning quality through a two-stage process of distillation followed by pruning-oriented reinforcement learning and tree-based speculative decoding.
What carries the argument
The twig module, a lightweight addition placed on an early layer of the base VLM that learns to output pruning decisions and speculative tokens.
If this is right
- Pruning 88.9 percent of visual tokens preserves 96 percent of the base model's performance on standard VLM benchmarks.
- Self-speculative decoding delivers 154 percent higher throughput when generating long answers compared with standard decoding.
- The multi-head twig trained by distillation then reinforcement learning produces higher-quality pruning decisions than single-head or attention-only baselines.
- Tree-based speculative decoding in TwigVLM++ further increases generation speed beyond the single-path SSD strategy.
- The method outperforms prior token-pruning approaches on both accuracy and speed when applied to LLaVA-1.5-7B.
Where Pith is reading between the lines
- The same twig architecture could be attached to other VLM families beyond LLaVA without retraining the entire model.
- Because the twig is small and trained once, the acceleration benefit scales to many downstream tasks that reuse the same base VLM.
- Combining the pruning head with other compression techniques such as quantization might compound the speed gains.
- The reinforcement-learning stage that directly optimizes post-pruning accuracy could be adapted to reward other metrics such as latency or energy use.
Load-bearing premise
The signals produced by the trained twig remain more accurate than the base model's early-layer attention maps across different VLMs and response lengths.
What would settle it
On a new VLM or on responses longer than those tested, measuring whether twig-guided pruning accuracy drops below the accuracy obtained by simply using early-layer attention maps for the same pruning ratio.
Figures
read the original abstract
Large vision-language models (VLMs) have demonstrated remarkable capabilities in open-world multimodal understanding, yet their high computational overheads pose great challenges for practical deployment. Some recent works have proposed methods to accelerate VLMs by pruning redundant visual tokens guided by the attention maps of VLM's early layers. Despite the success of these token pruning methods, they still suffer from two major shortcomings: (i) considerable accuracy drop due to insensitive attention signals in early layers, and (ii) limited speedup when generating long responses (e.g., 30 tokens). To address the limitations above, we present TwigVLM -- a simple and general architecture by growing a lightweight module, named twig, upon an early layer of the base VLM. Compared with most existing VLM acceleration methods purely based on visual token pruning, our TwigVLM not only achieves better accuracy retention by employing a twig-guided token pruning (TTP) strategy, but also yields higher generation speed by utilizing a self-speculative decoding (SSD) strategy. Taking LLaVA-1.5-7B as the base VLM, experimental results show that TwigVLM preserves 96% of the original performance after pruning 88.9% of the visual tokens and achieves 154% speedup in generating long responses, delivering significantly better performance in terms of both accuracy and speed over the state-of-the-art VLM acceleration methods. Moreover, we extend TwigVLM to an improved TwigVLM++ variant by introducing a novel multi-head twig architecture with a specialized pruning head. TwigVLM++ improves pruning quality via a two-stage training paradigm combining a distillation learning stage and a pruning-oriented reinforcement learning stage, and further accelerates inference via a tree-based SSD strategy.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces TwigVLM, a lightweight 'twig' module attached to an early layer of a base VLM (e.g., LLaVA-1.5-7B) that enables twig-guided token pruning (TTP) and self-speculative decoding (SSD) to accelerate inference. It claims to retain 96% of original performance after pruning 88.9% of visual tokens while delivering 154% speedup on long responses, outperforming prior attention-map-based pruning methods. TwigVLM++ extends this with a multi-head twig trained via a two-stage process (distillation followed by pruning-oriented reinforcement learning) and a tree-based SSD strategy.
Significance. If the empirical results hold, the work provides a practical architecture for improving the accuracy-speed trade-off in VLM acceleration by learning pruning signals rather than relying solely on early-layer attention. The two-stage training paradigm combining distillation and RL, along with the SSD extension, represents a concrete advance over existing token-pruning baselines if the twig's superiority is shown to generalize.
major comments (2)
- [Abstract] Abstract: the central empirical claims (96% performance retention at 88.9% pruning; 154% speedup on long responses) are presented without error bars, dataset specifications, or ablation controls, which are load-bearing for verifying that twig-derived signals are reliably superior to early-layer attention maps.
- [Abstract] Abstract: the claim that the multi-head twig with two-stage (distillation + RL) training improves pruning quality over the single-head version rests on the unverified assumption that the RL stage avoids new failure modes across response lengths and base VLMs; no direct comparison or robustness test is referenced.
minor comments (1)
- [Abstract] Abstract: the phrasing '154% speedup' is ambiguous (could mean 1.54x or 2.54x); standard multiplicative notation should be used for clarity.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on the abstract. We address the two major comments below and will incorporate clarifications where appropriate in a revised manuscript.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central empirical claims (96% performance retention at 88.9% pruning; 154% speedup on long responses) are presented without error bars, dataset specifications, or ablation controls, which are load-bearing for verifying that twig-derived signals are reliably superior to early-layer attention maps.
Authors: We agree the abstract is concise and omits these details. The full manuscript reports error bars across multiple runs in the main results tables and figures, evaluates on standard benchmarks (VQA-v2, GQA, POPE, MME, TextVQA), and provides ablations in Section 4.3 directly comparing twig-guided signals to early-layer attention maps. We will revise the abstract to name the primary datasets and note that supporting error bars and ablations appear in the experiments section. revision: yes
-
Referee: [Abstract] Abstract: the claim that the multi-head twig with two-stage (distillation + RL) training improves pruning quality over the single-head version rests on the unverified assumption that the RL stage avoids new failure modes across response lengths and base VLMs; no direct comparison or robustness test is referenced.
Authors: The manuscript includes direct head-to-head comparisons between single-head TwigVLM and multi-head TwigVLM++ (with the two-stage distillation+RL pipeline) in Tables 1–4 and Section 4.4, along with results across varying response lengths and multiple base VLMs. These experiments show the RL stage improves pruning quality without introducing the failure modes raised. We will update the abstract to explicitly reference these comparisons. revision: yes
Circularity Check
No significant circularity; empirical claims rest on external benchmarks
full rationale
The paper introduces TwigVLM as an architectural addition (lightweight twig module grown on an early VLM layer) plus two-stage training (distillation then RL) for TTP and SSD strategies. All reported outcomes—96% performance retention after 88.9% token pruning and 154% speedup—are presented as results of direct experiments on LLaVA-1.5-7B and comparisons to prior acceleration methods. No equations, fitted parameters renamed as predictions, uniqueness theorems, or self-citation chains are used to derive the performance numbers; the central claims remain falsifiable against independent test sets and baselines.
Axiom & Free-Parameter Ledger
free parameters (1)
- twig architecture hyperparameters
axioms (2)
- domain assumption Early-layer signals in the base VLM can be improved upon by a separately trained lightweight module for token importance
- domain assumption Standard distillation followed by reinforcement learning on a pruning reward produces a module that generalizes across prompts and response lengths
invented entities (1)
-
twig module
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966, 2023. 1
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[2]
Token merging: Your vit but faster
Daniel Bolya, Cheng-Yang Fu, Xiaoliang Dai, Peizhao Zhang, Christoph Feichtenhofer, and Judy Hoffman. Token merging: Your vit but faster. In The Eleventh International Conference on Learning Representations, 2023. 8
work page 2023
-
[3]
RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control
Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choromanski, Tianli Ding, Danny Driess, Avinava Dubey, Chelsea Finn, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. arXiv preprint arXiv:2307.15818, 2023. 1
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[4]
Language models are few-shot learners
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Sub- biah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakan- tan, Pranav Shyam, Girish Sastry, Amanda Askell, Sand- hini Agarwal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, S...
work page 1901
-
[5]
Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads
Tianle Cai, Yuhong Li, Zhengyang Geng, Hongwu Peng, Ja- son D Lee, Deming Chen, and Tri Dao. Medusa: Simple llm inference acceleration framework with multiple decod- ing heads. arXiv preprint arXiv:2401.10774, 2024. 8
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[6]
Auroracap: Efficient, performant video detailed captioning and a new benchmark
Wenhao Chai, Enxin Song, Yilun Du, Chenlin Meng, Vashisht Madhavan, Omer Bar-Tal, Jenq-Neng Hwang, Sain- ing Xie, and Christopher D Manning. Auroracap: Efficient, performant video detailed captioning and a new benchmark. arXiv preprint arXiv:2410.03051, 2024. 8
-
[7]
David L. Chen and William B. Dolan. Collecting highly par- allel data for paraphrase evaluation. In ACL, Portland, OR,
-
[8]
Llavolta: Efficient multi-modal models via stage-wise visual context compression
Jieneng Chen, Luoxin Ye, Ju He, Zhao-Yang Wang, Daniel Khashabi, and Alan Yuille. Llavolta: Efficient multi-modal models via stage-wise visual context compression. arXiv preprint arXiv:2406.20092, 2024. 8
-
[9]
Liang Chen, Haozhe Zhao, Tianyu Liu, Shuai Bai, Junyang Lin, Chang Zhou, and Baobao Chang. An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models. In European Conference on Computer Vision, pages 19–35. Springer, 2024. 1, 2, 3, 5, 6, 7, 8, 10, 12, 13
work page 2024
-
[10]
How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites
Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhang- wei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, et al. How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites. Science China Information Sciences, 67(12):220101,
-
[11]
Efficient transformer inference with stati- cally structured sparse attention
Steve Dai, Hasan Genc, Rangharajan Venkatesan, and Brucek Khailany. Efficient transformer inference with stati- cally structured sparse attention. In 2023 60th ACM/IEEE Design Automation Conference (DAC) , pages 1–6. IEEE,
work page 2023
-
[12]
Fu, Stefano Ermon, Atri Rudra, and Christopher R´e
Tri Dao, Daniel Y . Fu, Stefano Ermon, Atri Rudra, and Christopher R´e. FlashAttention: Fast and memory-efficient exact attention with IO-awareness. In Advances in Neural Information Processing Systems (NeurIPS), 2022. 8
work page 2022
-
[13]
Layer- skip: Enabling early exit inference and self-speculative de- coding
Mostafa Elhoushi, Akshat Shrivastava, Diana Liskovich, Basil Hosmer, Bram Wasti, Liangzhen Lai, Anas Mahmoud, Bilge Acun, Saurabh Agarwal, Ahmed Roman, et al. Layer- skip: Enabling early exit inference and self-speculative de- coding. arXiv preprint arXiv:2404.16710, 2024. 2, 4, 5, 8, 11
-
[14]
MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models
Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Zhenyu Qiu, Wei Lin, Jinrui Yang, Xiawu Zheng, et al. Mme: A comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394, 2023. 6, 12
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[15]
On speculative de- coding for multimodal large language models
Mukul Gagrani, Raghavv Goel, Wonseok Jeon, Junyoung Park, Mingu Lee, and Christopher Lott. On speculative de- coding for multimodal large language models. In Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8285–8289, 2024. 8
work page 2024
-
[16]
Making the v in vqa matter: Ele- vating the role of image understanding in visual question answering
Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Ba- tra, and Devi Parikh. Making the v in vqa matter: Ele- vating the role of image understanding in visual question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 6904– 6913, 2017. 6, 15
work page 2017
-
[17]
Mamba: Linear-time sequence mod- eling with selective state spaces
Albert Gu and Tri Dao. Mamba: Linear-time sequence mod- eling with selective state spaces. InFirst Conference on Lan- guage Modeling, 2024. 8
work page 2024
-
[18]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025. 1
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[19]
Mammoth-vl: Eliciting multimodal reasoning with instruction tuning at scale, 2024
Jarvis Guo, Tuney Zheng, Yuelin Bai, Bo Li, Yubo Wang, King Zhu, Yizhi Li, Graham Neubig, Wenhu Chen, and Xi- ang Yue. Mammoth-vl: Eliciting multimodal reasoning with instruction tuning at scale, 2024. 11
work page 2024
-
[20]
Re- thinking token reduction in mllms: Towards a unified paradigm for training-free acceleration
Yuhang Han, Xuyang Liu, Pengxiang Ding, Donglin Wang, Honggang Chen, Qingsen Yan, and Siteng Huang. Re- thinking token reduction in mllms: Towards a unified paradigm for training-free acceleration. arXiv preprint arXiv:2411.17686, 2024. 8
-
[21]
Gqa: A new dataset for real-world visual reasoning and compositional question answering
Drew A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, 2019. 3, 6, 12, 13
work page 2019
-
[22]
Tgif-qa: Toward spatio-temporal reasoning in visual question answering
Yunseok Jang, Yale Song, Youngjae Yu, Youngjin Kim, and Gunhee Kim. Tgif-qa: Toward spatio-temporal reasoning in visual question answering. In Proceedings of the IEEE con- ference on computer vision and pattern recognition , pages 2758–2766, 2017. 6, 15
work page 2017
-
[23]
Llmlingua: Compressing prompts for acceler- ated inference of large language models
Huiqiang Jiang, Qianhui Wu, Chin-Yew Lin, Yuqing Yang, and Lili Qiu. Llmlingua: Compressing prompts for acceler- ated inference of large language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Lan- guage Processing, pages 13358–13376, 2023. 8
work page 2023
-
[24]
OpenVLA: An Open-Source Vision-Language-Action Model
Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language-action model. arXiv preprint arXiv:2406.09246, 2024. 1
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[25]
Fast in- ference from transformers via speculative decoding
Yaniv Leviathan, Matan Kalman, and Yossi Matias. Fast in- ference from transformers via speculative decoding. In In- ternational Conference on Machine Learning, pages 19274– 19286. PMLR, 2023. 8, 11
work page 2023
-
[26]
Evaluating Object Hallucination in Large Vision-Language Models
Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision-language models. arXiv:2305.10355, 2023. 15
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[27]
Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models
Yanwei Li, Yuechen Zhang, Chengyao Wang, Zhisheng Zhong, Yixin Chen, Ruihang Chu, Shaoteng Liu, and Jiaya Jia. Mini-gemini: Mining the potential of multi-modality vision language models. arXiv preprint arXiv:2403.18814,
work page internal anchor Pith review Pith/arXiv arXiv
-
[28]
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
Bin Lin, Yang Ye, Bin Zhu, Jiaxi Cui, Munan Ning, Peng Jin, and Li Yuan. Video-llava: Learning united visual rep- resentation by alignment before projection. arXiv preprint arXiv:2311.10122, 2023. 5, 9
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[29]
Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report. arXiv preprint arXiv:2412.19437, 2024. 8
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[30]
Kangaroo: Loss- less self-speculative decoding for accelerating llms via dou- ble early exiting
Fangcheng Liu, Yehui Tang, Zhenhua Liu, Yunsheng Ni, Duyu Tang, Kai Han, and Yunhe Wang. Kangaroo: Loss- less self-speculative decoding for accelerating llms via dou- ble early exiting. Advances in Neural Information Process- ing Systems, 37:11946–11965, 2025. 5, 8, 11
work page 2025
-
[31]
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. InAdvances in Neural Information Processing Systems, pages 34892–34916, 2023. 1
work page 2023
-
[32]
Improved baselines with visual instruction tuning
Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. In Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26296–26306, 2024. 1, 3, 4, 5, 9
work page 2024
-
[33]
Llava-next: Im- proved reasoning, ocr, and world knowledge, 2024
Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llava-next: Im- proved reasoning, ocr, and world knowledge, 2024. 5
work page 2024
-
[34]
Multi-stage vision token dropping: Towards efficient multimodal large language model
Ting Liu, Liangtao Shi, Richang Hong, Yue Hu, Quanjun Yin, and Linfeng Zhang. Multi-stage vision token dropping: Towards efficient multimodal large language model. arXiv preprint arXiv:2411.10803, 2024. 2, 3, 6, 8, 10
-
[35]
Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? In European Conference on Computer Vision, pages 216–233. Springer, 2025. 6, 12
work page 2025
-
[36]
Learn to explain: Multimodal reasoning via thought chains for science question answering
Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering. Advances in Neural Information Processing Systems , 35:2507–2521,
-
[37]
Layoutllm: Layout instruction tuning with large language models for document understanding
Chuwei Luo, Yufan Shen, Zhaoqing Zhu, Qi Zheng, Zhi Yu, and Cong Yao. Layoutllm: Layout instruction tuning with large language models for document understanding. In Pro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 15630–15640, 2024. 1
work page 2024
-
[38]
Groma: Localized visual tokenization for grounding multimodal large language models
Chuofan Ma, Yi Jiang, Jiannan Wu, Zehuan Yuan, and Xiao- juan Qi. Groma: Localized visual tokenization for grounding multimodal large language models. In European Conference on Computer Vision, pages 417–435. Springer, 2024. 1
work page 2024
-
[39]
Video-chatgpt: Towards detailed video understanding via large vision and language models
Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fa- had Shahbaz Khan. Video-chatgpt: Towards detailed video understanding via large vision and language models. In Pro- ceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL 2024), 2024. 9
work page 2024
-
[40]
Learning to com- press prompts with gist tokens
Jesse Mu, Xiang Li, and Noah Goodman. Learning to com- press prompts with gist tokens. Advances in Neural Informa- tion Processing Systems, 36:19327–19352, 2023. 8
work page 2023
-
[41]
OpenAI. Gpt-4v(ision) system card. Technical report, Ope- nAI, 2023. 1
work page 2023
-
[42]
Ground- ing multimodal large language models to the world
Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, Qixiang Ye, and Furu Wei. Ground- ing multimodal large language models to the world. In The Twelfth International Conference on Learning Representa- tions, 2024. 1
work page 2024
-
[43]
Llava-prumerge: Adaptive token reduction for efficient large multimodal models
Yuzhang Shang, Mu Cai, Bingxin Xu, Yong Jae Lee, and Yan Yan. Llava-prumerge: Adaptive token reduction for efficient large multimodal models. arXiv preprint arXiv:2403.15388,
-
[44]
Imp: Highly capable large multimodal models for mobile devices
Zhenwei Shao, Zhou Yu, Jun Yu, Xuecheng Ouyang, Lihao Zheng, Zhenbiao Gai, Mingyang Wang, Zhenzhong Kuang, and Jiajun Ding. Imp: Highly capable large multimodal models for mobile devices. IEEE Transactions on Multime- dia, 2025. 1
work page 2025
-
[45]
Towards vqa models that can read
Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. Towards vqa models that can read. In Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8317–8326, 2019. 3, 6, 7, 13, 15
work page 2019
-
[46]
You only cache once: Decoder-decoder architectures for lan- guage models
Yutao Sun, Li Dong, Yi Zhu, Shaohan Huang, Wenhui Wang, Shuming Ma, Quanlu Zhang, Jianyong Wang, and Furu Wei. You only cache once: Decoder-decoder architectures for lan- guage models. Advances in Neural Information Processing Systems, 37:7339–7361, 2025. 8
work page 2025
-
[47]
Llama 2: Open Foundation and Fine-Tuned Chat Models
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288, 2023. 1
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[48]
Fastvlm: Efficient vision encoding for vision language models
Pavan Kumar Anasosalu Vasu, Fartash Faghri, Chun-Liang Li, Cem Koc, Nate True, Albert Antony, Gokul Santhanam, James Gabriel, Peter Grasch, Oncel Tuzel, et al. Fastvlm: Efficient vision encoding for vision language models. arXiv preprint arXiv:2412.13303, 2024. 8
-
[49]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems, 2017. 1
work page 2017
-
[50]
Mobile-Agent: Autonomous Multi-Modal Mobile Device Agent with Visual Perception
Junyang Wang, Haiyang Xu, Jiabo Ye, Ming Yan, Weizhou Shen, Ji Zhang, Fei Huang, and Jitao Sang. Mobile-agent: Autonomous multi-modal mobile device agent with visual perception. arXiv preprint arXiv:2401.16158, 2024. 1
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[51]
Qiong Wu, Wenhao Lin, Weihao Ye, Yiyi Zhou, Xiaoshuai Sun, and Rongrong Ji. Accelerating multimodal large lan- guage models via dynamic visual-token exit and the empiri- cal findings. arXiv preprint arXiv:2411.19628, 2024. 8
-
[52]
Efficient streaming language models with attention sinks
Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks. In The Twelfth International Conference on Learning Representations, 2024. 8
work page 2024
-
[53]
PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid Visual Redundancy Reduction
Long Xing, Qidong Huang, Xiaoyi Dong, Jiajie Lu, Pan Zhang, Yuhang Zang, Yuhang Cao, Conghui He, Jiaqi Wang, Feng Wu, et al. Pyramiddrop: Accelerating your large vision-language models via pyramid visual redundancy re- duction. arXiv preprint arXiv:2410.17247 , 2024. 5, 6, 8, 10
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[54]
Video question answer- ing via gradually refined attention over appearance and mo- tion
Dejing Xu, Zhou Zhao, Jun Xiao, Fei Wu, Hanwang Zhang, Xiangnan He, and Yueting Zhuang. Video question answer- ing via gradually refined attention over appearance and mo- tion. In Proceedings of the ACM international conference on Multimedia, pages 1645–1653, 2017. 6, 15
work page 2017
-
[55]
Visionzip: Longer is better but not necessary in vision language models
Senqiao Yang, Yukang Chen, Zhuotao Tian, Chengyao Wang, Jingyao Li, Bei Yu, and Jiaya Jia. Visionzip: Longer is better but not necessary in vision language models. arXiv preprint arXiv:2412.04467, 2024. 2, 5, 6, 8, 10, 12, 13
-
[56]
MiniCPM-V: A GPT-4V Level MLLM on Your Phone
Yuan Yao, Tianyu Yu, Ao Zhang, Chongyi Wang, Junbo Cui, Hongji Zhu, Tianchi Cai, Haoyu Li, Weilin Zhao, Zhihui He, et al. Minicpm-v: A gpt-4v level mllm on your phone. arXiv preprint arXiv:2408.01800, 2024. 1
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[57]
mplug-docowl: Modularized multimodal large language model for document understanding
Jiabo Ye, Anwen Hu, Haiyang Xu, Qinghao Ye, Ming Yan, Yuhao Dan, Chenlin Zhao, Guohai Xu, Chenliang Li, Jun- feng Tian, et al. mplug-docowl: Modularized multimodal large language model for document understanding. arXiv preprint arXiv:2307.02499, 2023. 1
-
[58]
Fit and prune: Fast and training-free visual token pruning for multi-modal large language models
Weihao Ye, Qiong Wu, Wenhao Lin, and Yiyi Zhou. Fit and prune: Fast and training-free visual token pruning for multi-modal large language models. arXiv preprint arXiv:2409.10197, 2024. 8
-
[59]
Atp-llava: Adaptive token pruning for large vision language models
Xubing Ye, Yukang Gan, Yixiao Ge, Xiao-Ping Zhang, and Yansong Tang. Atp-llava: Adaptive token pruning for large vision language models. arXiv preprint arXiv:2412.00447,
-
[60]
MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities
Weihao Yu, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Zicheng Liu, Xinchao Wang, and Lijuan Wang. Mm-vet: Evaluating large multimodal models for integrated capabilities. arXiv preprint arXiv:2308.02490, 2023. 7, 14, 15
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[61]
Activitynet-qa: A dataset for understanding complex web videos via question answering
Zhou Yu, Dejing Xu, Jun Yu, Ting Yu, Zhou Zhao, Yueting Zhuang, and Dacheng Tao. Activitynet-qa: A dataset for understanding complex web videos via question answering. In AAAI, pages 9127–9134, 2019. 6, 15
work page 2019
-
[62]
Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for ex- pert agi
Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for ex- pert agi. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 9556– 9567, 2024. 15
work page 2024
-
[63]
AppAgent: Multimodal Agents as Smartphone Users
Chi Zhang, Zhao Yang, Jiaxuan Liu, Yucheng Han, Xin Chen, Zebiao Huang, Bin Fu, and Gang Yu. Appagent: Multimodal agents as smartphone users. arXiv preprint arXiv:2312.13771, 2023. 1
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[64]
[cls] attention is all you need for training- free visual token pruning: Make vlm inference faster
Qizhe Zhang, Aosong Cheng, Ming Lu, Zhiyong Zhuo, Minqi Wang, Jiajun Cao, Shaobo Guo, Qi She, and Shang- hang Zhang. [cls] attention is all you need for training- free visual token pruning: Make vlm inference faster. arXiv preprint arXiv:2412.01818, 2024. 2, 6, 8, 10
-
[65]
SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference
Yuan Zhang, Chun-Kai Fan, Junpeng Ma, Wenzhao Zheng, Tao Huang, Kuan Cheng, Denis Gudovskiy, Tomoyuki Okuno, Yohei Nakata, Kurt Keutzer, et al. Sparsevlm: Vi- sual token sparsification for efficient vision-language model inference. arXiv preprint arXiv:2410.04417, 2024. 2, 3, 5, 6, 8, 10
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[66]
H2o: Heavy-hitter ora- cle for efficient generative inference of large language mod- els
Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher R´e, Clark Barrett, et al. H2o: Heavy-hitter ora- cle for efficient generative inference of large language mod- els. Advances in Neural Information Processing Systems, 36: 34661–34710, 2023. 8
work page 2023
-
[67]
Treat visual tokens as text? but your mllm only needs fewer efforts to see
Zeliang Zhang, Phu Pham, Wentian Zhao, Kun Wan, Yu-Jhe Li, Jianing Zhou, Daniel Miranda, Ajinkya Kale, and Chen- liang Xu. Treat visual tokens as text? but your mllm only needs fewer efforts to see. arXiv preprint arXiv:2410.06169,
-
[68]
Cross-modal information flow in multimodal large language models
Zhi Zhang, Srishti Yadav, Fengze Han, and Ekaterina Shutova. Cross-modal information flow in multimodal large language models. arXiv preprint arXiv:2411.18620 , 2024. 5, 8
-
[69]
A stitch in time saves nine: Small vlm is a precise guidance for accelerating large vlms
Wangbo Zhao, Yizeng Han, Jiasheng Tang, Zhikai Li, Yib- ing Song, Kai Wang, Zhangyang Wang, and Yang You. A stitch in time saves nine: Small vlm is a precise guidance for accelerating large vlms. arXiv preprint arXiv:2412.03324 ,
-
[70]
A Survey on Efficient Inference for Large Language Models
Zixuan Zhou, Xuefei Ning, Ke Hong, Tianyu Fu, Jiaming Xu, Shiyao Li, Yuming Lou, Luning Wang, Zhihang Yuan, Xiuhong Li, et al. A survey on efficient inference for large language models. arXiv preprint arXiv:2404.14294, 2024. 2
work page internal anchor Pith review Pith/arXiv arXiv 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.