Growing a Multi-head Twig via Distillation and Reinforcement Learning to Accelerate Large Vision-Language Models

Hongyuan Zhang; Jun Yu; Mingyang Wang; Tao Wei; Weijun Zhang; Wenwen Pan; Yan Yang; Zhenwei Shao; Zhou Yu

arxiv: 2503.14075 · v3 · submitted 2025-03-18 · 💻 cs.CV · cs.CL

Growing a Multi-head Twig via Distillation and Reinforcement Learning to Accelerate Large Vision-Language Models

Zhenwei Shao , Mingyang Wang , Weijun Zhang , Zhou Yu , Wenwen Pan , Yan Yang , Tao Wei , Hongyuan Zhang

show 1 more author

Jun Yu

This is my paper

Pith reviewed 2026-05-22 23:55 UTC · model grok-4.3

classification 💻 cs.CV cs.CL

keywords vision-language modelstoken pruningmodel accelerationspeculative decodingdistillationreinforcement learningvisual token reduction

0 comments

The pith

A twig module added to an early VLM layer supplies pruning signals that outperform early attention maps, enabling 88.9 percent token reduction with 96 percent performance retained plus 154 percent speedup on long responses via self-specul

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents TwigVLM as a way to accelerate large vision-language models by growing a small additional module called a twig on top of an early transformer layer. This twig is trained to produce more reliable signals for deciding which visual tokens can be dropped without harming the final answer. The same module also supports self-speculative decoding that re-uses early computations to shorten generation time even when the output contains dozens of tokens. TwigVLM++ extends the idea to multiple pruning heads and trains them first by distillation then by reinforcement learning that directly rewards higher accuracy after aggressive pruning. If the approach holds, VLMs could run substantially faster on the same hardware while keeping most of their multimodal capability.

Core claim

Growing a lightweight twig module on an early layer of a base VLM allows twig-guided token pruning that keeps 96 percent of original accuracy after removing 88.9 percent of visual tokens and simultaneously enables self-speculative decoding that yields 154 percent speedup on long responses; the multi-head TwigVLM++ variant further improves pruning quality through a two-stage process of distillation followed by pruning-oriented reinforcement learning and tree-based speculative decoding.

What carries the argument

The twig module, a lightweight addition placed on an early layer of the base VLM that learns to output pruning decisions and speculative tokens.

If this is right

Pruning 88.9 percent of visual tokens preserves 96 percent of the base model's performance on standard VLM benchmarks.
Self-speculative decoding delivers 154 percent higher throughput when generating long answers compared with standard decoding.
The multi-head twig trained by distillation then reinforcement learning produces higher-quality pruning decisions than single-head or attention-only baselines.
Tree-based speculative decoding in TwigVLM++ further increases generation speed beyond the single-path SSD strategy.
The method outperforms prior token-pruning approaches on both accuracy and speed when applied to LLaVA-1.5-7B.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same twig architecture could be attached to other VLM families beyond LLaVA without retraining the entire model.
Because the twig is small and trained once, the acceleration benefit scales to many downstream tasks that reuse the same base VLM.
Combining the pruning head with other compression techniques such as quantization might compound the speed gains.
The reinforcement-learning stage that directly optimizes post-pruning accuracy could be adapted to reward other metrics such as latency or energy use.

Load-bearing premise

The signals produced by the trained twig remain more accurate than the base model's early-layer attention maps across different VLMs and response lengths.

What would settle it

On a new VLM or on responses longer than those tested, measuring whether twig-guided pruning accuracy drops below the accuracy obtained by simply using early-layer attention maps for the same pruning ratio.

Figures

Figures reproduced from arXiv: 2503.14075 by Hongyuan Zhang, Jun Yu, Mingyang Wang, Tao Wei, Weijun Zhang, Wenwen Pan, Yan Yang, Zhenwei Shao, Zhou Yu.

**Figure 1.** Figure 1: Our TwigVLM overview. (a) Given a base deep VLM, the TwigVLM is obtained by freezing the base VLM while training a shallow twig block upon its early layer. (b) Compared to the VLM acceleration methods based on visual token pruning (e.g., FastV [9]), TwigVLM not only achieves better accuracy retention but also yields higher generation speed. Both FastV and TwigVLM take LLaVA-1.5-7B [32] as the base VLM (r… view at source ↗

**Figure 3.** Figure 3: Prefilling and decoding time costs. (a) Prefilling time (gray dotted line) and decoding time for LLaVA-1.5-7B [32] with different response lengths. (b) Prefilling (P) and decoding (D) time comparisons of LLaVA-1.5-7B and its FastV-based variant. son of different token pruning methods, the average number of retained tokens R¯ is used and defined as follows1 : R¯ = [M × K + R × (L − K)]/L (4) 2.2. Study 1: A… view at source ↗

**Figure 4.** Figure 4: Training and two-stage inference of TwigVLM. (a) The twig block is initialized from the base VLM and is coupled with the first K layers of base VLM to form a shallow model, which can be trained efficiently. (b) In the prefilling stage, different from previous approaches that perform token selection based on the attention maps from the base VLM, TwigVLM introduces a twig-guided token pruning (TTP) strategy … view at source ↗

**Figure 5.** Figure 5: Visualized attention map comparisons of TwigVLM and two typical token pruning methods. The visualized attention maps show that TwigVLM identifies accurate visual tokens to the prompt and predicts the right answer, while both counterparts fail to do that. More examples are provided in the supplementary. consistently surpasses all the state-of-the-art counterparts on most VideoQA benchmarks and achieves 3.1%… view at source ↗

**Figure 6.** Figure 6: Generation speed comparisons on two benchmarks: (a) TextVQA with short responses and (b) MM-Vet with long responses. S¯ denotes the average number of generated tokens on the whole benchmark. The RelSpd of each bar is highlighted in red. Generation speed comparisons. As mentioned above, TwigVLM can effectively accelerate the generation. To validate this, we conduct intensive experiments on two typical be… view at source ↗

**Figure 7.** Figure 7: Performance comparisons of TwigVLM models trained [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗

**Figure 8.** Figure 8: Visualization of attention maps and predictions for FastV [ [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗

**Figure 9.** Figure 9: Examples of the generated responses using the self-speculative decoding (SSD) on MM-Vet [ [PITH_FULL_IMAGE:figures/full_fig_p014_9.png] view at source ↗

read the original abstract

Large vision-language models (VLMs) have demonstrated remarkable capabilities in open-world multimodal understanding, yet their high computational overheads pose great challenges for practical deployment. Some recent works have proposed methods to accelerate VLMs by pruning redundant visual tokens guided by the attention maps of VLM's early layers. Despite the success of these token pruning methods, they still suffer from two major shortcomings: (i) considerable accuracy drop due to insensitive attention signals in early layers, and (ii) limited speedup when generating long responses (e.g., 30 tokens). To address the limitations above, we present TwigVLM -- a simple and general architecture by growing a lightweight module, named twig, upon an early layer of the base VLM. Compared with most existing VLM acceleration methods purely based on visual token pruning, our TwigVLM not only achieves better accuracy retention by employing a twig-guided token pruning (TTP) strategy, but also yields higher generation speed by utilizing a self-speculative decoding (SSD) strategy. Taking LLaVA-1.5-7B as the base VLM, experimental results show that TwigVLM preserves 96% of the original performance after pruning 88.9% of the visual tokens and achieves 154% speedup in generating long responses, delivering significantly better performance in terms of both accuracy and speed over the state-of-the-art VLM acceleration methods. Moreover, we extend TwigVLM to an improved TwigVLM++ variant by introducing a novel multi-head twig architecture with a specialized pruning head. TwigVLM++ improves pruning quality via a two-stage training paradigm combining a distillation learning stage and a pruning-oriented reinforcement learning stage, and further accelerates inference via a tree-based SSD strategy.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TwigVLM grows a small trainable module on early VLM layers to combine token pruning with self-speculative decoding, reporting 96% accuracy retention at 88.9% pruning plus 154% long-response speedup on LLaVA-1.5-7B.

read the letter

The paper introduces TwigVLM, a lightweight twig module attached to an early layer of a base VLM. It uses the twig for guided token pruning (TTP) and self-speculative decoding (SSD) to tackle both accuracy loss from crude early attention and slow generation on longer outputs. TwigVLM++ adds a multi-head version trained first by distillation then by pruning-oriented RL, plus tree-based SSD. The headline numbers on LLaVA-1.5-7B look practical for deployment: 96% performance kept after heavy pruning and a clear speed gain over prior pruning-only methods. The two-stage training and multi-head design are concrete engineering moves that extend existing token-pruning and speculative-decoding work without claiming a new paradigm. The abstract is clear on the claimed gains and the motivation. The main soft spot is that the results are presented without visible ablations on the twig versus early attention, error bars, or controls for the RL stage, so it is hard to judge how reliably the twig signals beat the baseline or whether they hold across models and response lengths. The claims rest on empirical comparisons rather than any closed-form derivation. This is for readers focused on efficient VLM inference and hardware deployment. If the full experiments include proper controls and the numbers hold, the work is worth referee time for the practical angle even if the advance is incremental.

Referee Report

2 major / 1 minor

Summary. The paper introduces TwigVLM, a lightweight 'twig' module attached to an early layer of a base VLM (e.g., LLaVA-1.5-7B) that enables twig-guided token pruning (TTP) and self-speculative decoding (SSD) to accelerate inference. It claims to retain 96% of original performance after pruning 88.9% of visual tokens while delivering 154% speedup on long responses, outperforming prior attention-map-based pruning methods. TwigVLM++ extends this with a multi-head twig trained via a two-stage process (distillation followed by pruning-oriented reinforcement learning) and a tree-based SSD strategy.

Significance. If the empirical results hold, the work provides a practical architecture for improving the accuracy-speed trade-off in VLM acceleration by learning pruning signals rather than relying solely on early-layer attention. The two-stage training paradigm combining distillation and RL, along with the SSD extension, represents a concrete advance over existing token-pruning baselines if the twig's superiority is shown to generalize.

major comments (2)

[Abstract] Abstract: the central empirical claims (96% performance retention at 88.9% pruning; 154% speedup on long responses) are presented without error bars, dataset specifications, or ablation controls, which are load-bearing for verifying that twig-derived signals are reliably superior to early-layer attention maps.
[Abstract] Abstract: the claim that the multi-head twig with two-stage (distillation + RL) training improves pruning quality over the single-head version rests on the unverified assumption that the RL stage avoids new failure modes across response lengths and base VLMs; no direct comparison or robustness test is referenced.

minor comments (1)

[Abstract] Abstract: the phrasing '154% speedup' is ambiguous (could mean 1.54x or 2.54x); standard multiplicative notation should be used for clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the abstract. We address the two major comments below and will incorporate clarifications where appropriate in a revised manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: the central empirical claims (96% performance retention at 88.9% pruning; 154% speedup on long responses) are presented without error bars, dataset specifications, or ablation controls, which are load-bearing for verifying that twig-derived signals are reliably superior to early-layer attention maps.

Authors: We agree the abstract is concise and omits these details. The full manuscript reports error bars across multiple runs in the main results tables and figures, evaluates on standard benchmarks (VQA-v2, GQA, POPE, MME, TextVQA), and provides ablations in Section 4.3 directly comparing twig-guided signals to early-layer attention maps. We will revise the abstract to name the primary datasets and note that supporting error bars and ablations appear in the experiments section. revision: yes
Referee: [Abstract] Abstract: the claim that the multi-head twig with two-stage (distillation + RL) training improves pruning quality over the single-head version rests on the unverified assumption that the RL stage avoids new failure modes across response lengths and base VLMs; no direct comparison or robustness test is referenced.

Authors: The manuscript includes direct head-to-head comparisons between single-head TwigVLM and multi-head TwigVLM++ (with the two-stage distillation+RL pipeline) in Tables 1–4 and Section 4.4, along with results across varying response lengths and multiple base VLMs. These experiments show the RL stage improves pruning quality without introducing the failure modes raised. We will update the abstract to explicitly reference these comparisons. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical claims rest on external benchmarks

full rationale

The paper introduces TwigVLM as an architectural addition (lightweight twig module grown on an early VLM layer) plus two-stage training (distillation then RL) for TTP and SSD strategies. All reported outcomes—96% performance retention after 88.9% token pruning and 154% speedup—are presented as results of direct experiments on LLaVA-1.5-7B and comparisons to prior acceleration methods. No equations, fitted parameters renamed as predictions, uniqueness theorems, or self-citation chains are used to derive the performance numbers; the central claims remain falsifiable against independent test sets and baselines.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 1 invented entities

Review performed on abstract only; the ledger therefore records only the high-level assumptions visible in the abstract. The central claim rests on the existence of a trainable lightweight module that can outperform early-layer attention for pruning and on standard supervised-plus-RL training dynamics.

free parameters (1)

twig architecture hyperparameters
Number of heads, hidden size, and placement layer of the twig module are chosen to make the pruning and decoding strategies work; these are not derived from first principles.

axioms (2)

domain assumption Early-layer signals in the base VLM can be improved upon by a separately trained lightweight module for token importance
Invoked to justify replacing attention-map pruning with twig-guided pruning.
domain assumption Standard distillation followed by reinforcement learning on a pruning reward produces a module that generalizes across prompts and response lengths
Required for the two-stage training paradigm of TwigVLM++ to deliver the claimed accuracy retention.

invented entities (1)

twig module no independent evidence
purpose: Lightweight add-on that supplies pruning decisions and enables self-speculative decoding
New module introduced by the paper; no independent evidence outside the reported experiments is supplied.

pith-pipeline@v0.9.0 · 5871 in / 1632 out tokens · 29059 ms · 2026-05-22T23:55:56.563899+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

70 extracted references · 70 canonical work pages · 18 internal anchors

[1]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966, 2023. 1

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Token merging: Your vit but faster

Daniel Bolya, Cheng-Yang Fu, Xiaoliang Dai, Peizhao Zhang, Christoph Feichtenhofer, and Judy Hoffman. Token merging: Your vit but faster. In The Eleventh International Conference on Learning Representations, 2023. 8

work page 2023
[3]

RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choromanski, Tianli Ding, Danny Driess, Avinava Dubey, Chelsea Finn, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. arXiv preprint arXiv:2307.15818, 2023. 1

work page internal anchor Pith review Pith/arXiv arXiv 2023
[4]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Sub- biah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakan- tan, Pranav Shyam, Girish Sastry, Amanda Askell, Sand- hini Agarwal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, S...

work page 1901
[5]

Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads

Tianle Cai, Yuhong Li, Zhengyang Geng, Hongwu Peng, Ja- son D Lee, Deming Chen, and Tri Dao. Medusa: Simple llm inference acceleration framework with multiple decod- ing heads. arXiv preprint arXiv:2401.10774, 2024. 8

work page internal anchor Pith review Pith/arXiv arXiv 2024
[6]

Auroracap: Efficient, performant video detailed captioning and a new benchmark

Wenhao Chai, Enxin Song, Yilun Du, Chenlin Meng, Vashisht Madhavan, Omer Bar-Tal, Jenq-Neng Hwang, Sain- ing Xie, and Christopher D Manning. Auroracap: Efficient, performant video detailed captioning and a new benchmark. arXiv preprint arXiv:2410.03051, 2024. 8

work page arXiv 2024
[7]

Chen and William B

David L. Chen and William B. Dolan. Collecting highly par- allel data for paraphrase evaluation. In ACL, Portland, OR,

work page
[8]

Llavolta: Efficient multi-modal models via stage-wise visual context compression

Jieneng Chen, Luoxin Ye, Ju He, Zhao-Yang Wang, Daniel Khashabi, and Alan Yuille. Llavolta: Efficient multi-modal models via stage-wise visual context compression. arXiv preprint arXiv:2406.20092, 2024. 8

work page arXiv 2024
[9]

An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models

Liang Chen, Haozhe Zhao, Tianyu Liu, Shuai Bai, Junyang Lin, Chang Zhou, and Baobao Chang. An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models. In European Conference on Computer Vision, pages 19–35. Springer, 2024. 1, 2, 3, 5, 6, 7, 8, 10, 12, 13

work page 2024
[10]

How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites

Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhang- wei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, et al. How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites. Science China Information Sciences, 67(12):220101,

work page
[11]

Efficient transformer inference with stati- cally structured sparse attention

Steve Dai, Hasan Genc, Rangharajan Venkatesan, and Brucek Khailany. Efficient transformer inference with stati- cally structured sparse attention. In 2023 60th ACM/IEEE Design Automation Conference (DAC) , pages 1–6. IEEE,

work page 2023
[12]

Fu, Stefano Ermon, Atri Rudra, and Christopher R´e

Tri Dao, Daniel Y . Fu, Stefano Ermon, Atri Rudra, and Christopher R´e. FlashAttention: Fast and memory-efficient exact attention with IO-awareness. In Advances in Neural Information Processing Systems (NeurIPS), 2022. 8

work page 2022
[13]

Layer- skip: Enabling early exit inference and self-speculative de- coding

Mostafa Elhoushi, Akshat Shrivastava, Diana Liskovich, Basil Hosmer, Bram Wasti, Liangzhen Lai, Anas Mahmoud, Bilge Acun, Saurabh Agarwal, Ahmed Roman, et al. Layer- skip: Enabling early exit inference and self-speculative de- coding. arXiv preprint arXiv:2404.16710, 2024. 2, 4, 5, 8, 11

work page arXiv 2024
[14]

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Zhenyu Qiu, Wei Lin, Jinrui Yang, Xiawu Zheng, et al. Mme: A comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394, 2023. 6, 12

work page internal anchor Pith review Pith/arXiv arXiv 2023
[15]

On speculative de- coding for multimodal large language models

Mukul Gagrani, Raghavv Goel, Wonseok Jeon, Junyoung Park, Mingu Lee, and Christopher Lott. On speculative de- coding for multimodal large language models. In Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8285–8289, 2024. 8

work page 2024
[16]

Making the v in vqa matter: Ele- vating the role of image understanding in visual question answering

Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Ba- tra, and Devi Parikh. Making the v in vqa matter: Ele- vating the role of image understanding in visual question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 6904– 6913, 2017. 6, 15

work page 2017
[17]

Mamba: Linear-time sequence mod- eling with selective state spaces

Albert Gu and Tri Dao. Mamba: Linear-time sequence mod- eling with selective state spaces. InFirst Conference on Lan- guage Modeling, 2024. 8

work page 2024
[18]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025. 1

work page internal anchor Pith review Pith/arXiv arXiv 2025
[19]

Mammoth-vl: Eliciting multimodal reasoning with instruction tuning at scale, 2024

Jarvis Guo, Tuney Zheng, Yuelin Bai, Bo Li, Yubo Wang, King Zhu, Yizhi Li, Graham Neubig, Wenhu Chen, and Xi- ang Yue. Mammoth-vl: Eliciting multimodal reasoning with instruction tuning at scale, 2024. 11

work page 2024
[20]

Re- thinking token reduction in mllms: Towards a unified paradigm for training-free acceleration

Yuhang Han, Xuyang Liu, Pengxiang Ding, Donglin Wang, Honggang Chen, Qingsen Yan, and Siteng Huang. Re- thinking token reduction in mllms: Towards a unified paradigm for training-free acceleration. arXiv preprint arXiv:2411.17686, 2024. 8

work page arXiv 2024
[21]

Gqa: A new dataset for real-world visual reasoning and compositional question answering

Drew A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, 2019. 3, 6, 12, 13

work page 2019
[22]

Tgif-qa: Toward spatio-temporal reasoning in visual question answering

Yunseok Jang, Yale Song, Youngjae Yu, Youngjin Kim, and Gunhee Kim. Tgif-qa: Toward spatio-temporal reasoning in visual question answering. In Proceedings of the IEEE con- ference on computer vision and pattern recognition , pages 2758–2766, 2017. 6, 15

work page 2017
[23]

Llmlingua: Compressing prompts for acceler- ated inference of large language models

Huiqiang Jiang, Qianhui Wu, Chin-Yew Lin, Yuqing Yang, and Lili Qiu. Llmlingua: Compressing prompts for acceler- ated inference of large language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Lan- guage Processing, pages 13358–13376, 2023. 8

work page 2023
[24]

OpenVLA: An Open-Source Vision-Language-Action Model

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language-action model. arXiv preprint arXiv:2406.09246, 2024. 1

work page internal anchor Pith review Pith/arXiv arXiv 2024
[25]

Fast in- ference from transformers via speculative decoding

Yaniv Leviathan, Matan Kalman, and Yossi Matias. Fast in- ference from transformers via speculative decoding. In In- ternational Conference on Machine Learning, pages 19274– 19286. PMLR, 2023. 8, 11

work page 2023
[26]

Evaluating Object Hallucination in Large Vision-Language Models

Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision-language models. arXiv:2305.10355, 2023. 15

work page internal anchor Pith review Pith/arXiv arXiv 2023
[27]

Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models

Yanwei Li, Yuechen Zhang, Chengyao Wang, Zhisheng Zhong, Yixin Chen, Ruihang Chu, Shaoteng Liu, and Jiaya Jia. Mini-gemini: Mining the potential of multi-modality vision language models. arXiv preprint arXiv:2403.18814,

work page internal anchor Pith review Pith/arXiv arXiv
[28]

Video-LLaVA: Learning United Visual Representation by Alignment Before Projection

Bin Lin, Yang Ye, Bin Zhu, Jiaxi Cui, Munan Ning, Peng Jin, and Li Yuan. Video-llava: Learning united visual rep- resentation by alignment before projection. arXiv preprint arXiv:2311.10122, 2023. 5, 9

work page internal anchor Pith review Pith/arXiv arXiv 2023
[29]

DeepSeek-V3 Technical Report

Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report. arXiv preprint arXiv:2412.19437, 2024. 8

work page internal anchor Pith review Pith/arXiv arXiv 2024
[30]

Kangaroo: Loss- less self-speculative decoding for accelerating llms via dou- ble early exiting

Fangcheng Liu, Yehui Tang, Zhenhua Liu, Yunsheng Ni, Duyu Tang, Kai Han, and Yunhe Wang. Kangaroo: Loss- less self-speculative decoding for accelerating llms via dou- ble early exiting. Advances in Neural Information Process- ing Systems, 37:11946–11965, 2025. 5, 8, 11

work page 2025
[31]

Visual instruction tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. InAdvances in Neural Information Processing Systems, pages 34892–34916, 2023. 1

work page 2023
[32]

Improved baselines with visual instruction tuning

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. In Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26296–26306, 2024. 1, 3, 4, 5, 9

work page 2024
[33]

Llava-next: Im- proved reasoning, ocr, and world knowledge, 2024

Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llava-next: Im- proved reasoning, ocr, and world knowledge, 2024. 5

work page 2024
[34]

Multi-stage vision token dropping: Towards efficient multimodal large language model

Ting Liu, Liangtao Shi, Richang Hong, Yue Hu, Quanjun Yin, and Linfeng Zhang. Multi-stage vision token dropping: Towards efficient multimodal large language model. arXiv preprint arXiv:2411.10803, 2024. 2, 3, 6, 8, 10

work page arXiv 2024
[35]

Mmbench: Is your multi-modal model an all-around player? In European Conference on Computer Vision, pages 216–233

Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? In European Conference on Computer Vision, pages 216–233. Springer, 2025. 6, 12

work page 2025
[36]

Learn to explain: Multimodal reasoning via thought chains for science question answering

Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering. Advances in Neural Information Processing Systems , 35:2507–2521,

work page
[37]

Layoutllm: Layout instruction tuning with large language models for document understanding

Chuwei Luo, Yufan Shen, Zhaoqing Zhu, Qi Zheng, Zhi Yu, and Cong Yao. Layoutllm: Layout instruction tuning with large language models for document understanding. In Pro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 15630–15640, 2024. 1

work page 2024
[38]

Groma: Localized visual tokenization for grounding multimodal large language models

Chuofan Ma, Yi Jiang, Jiannan Wu, Zehuan Yuan, and Xiao- juan Qi. Groma: Localized visual tokenization for grounding multimodal large language models. In European Conference on Computer Vision, pages 417–435. Springer, 2024. 1

work page 2024
[39]

Video-chatgpt: Towards detailed video understanding via large vision and language models

Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fa- had Shahbaz Khan. Video-chatgpt: Towards detailed video understanding via large vision and language models. In Pro- ceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL 2024), 2024. 9

work page 2024
[40]

Learning to com- press prompts with gist tokens

Jesse Mu, Xiang Li, and Noah Goodman. Learning to com- press prompts with gist tokens. Advances in Neural Informa- tion Processing Systems, 36:19327–19352, 2023. 8

work page 2023
[41]

Gpt-4v(ision) system card

OpenAI. Gpt-4v(ision) system card. Technical report, Ope- nAI, 2023. 1

work page 2023
[42]

Ground- ing multimodal large language models to the world

Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, Qixiang Ye, and Furu Wei. Ground- ing multimodal large language models to the world. In The Twelfth International Conference on Learning Representa- tions, 2024. 1

work page 2024
[43]

Llava-prumerge: Adaptive token reduction for efficient large multimodal models

Yuzhang Shang, Mu Cai, Bingxin Xu, Yong Jae Lee, and Yan Yan. Llava-prumerge: Adaptive token reduction for efficient large multimodal models. arXiv preprint arXiv:2403.15388,

work page arXiv
[44]

Imp: Highly capable large multimodal models for mobile devices

Zhenwei Shao, Zhou Yu, Jun Yu, Xuecheng Ouyang, Lihao Zheng, Zhenbiao Gai, Mingyang Wang, Zhenzhong Kuang, and Jiajun Ding. Imp: Highly capable large multimodal models for mobile devices. IEEE Transactions on Multime- dia, 2025. 1

work page 2025
[45]

Towards vqa models that can read

Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. Towards vqa models that can read. In Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8317–8326, 2019. 3, 6, 7, 13, 15

work page 2019
[46]

You only cache once: Decoder-decoder architectures for lan- guage models

Yutao Sun, Li Dong, Yi Zhu, Shaohan Huang, Wenhui Wang, Shuming Ma, Quanlu Zhang, Jianyong Wang, and Furu Wei. You only cache once: Decoder-decoder architectures for lan- guage models. Advances in Neural Information Processing Systems, 37:7339–7361, 2025. 8

work page 2025
[47]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288, 2023. 1

work page internal anchor Pith review Pith/arXiv arXiv 2023
[48]

Fastvlm: Efficient vision encoding for vision language models

Pavan Kumar Anasosalu Vasu, Fartash Faghri, Chun-Liang Li, Cem Koc, Nate True, Albert Antony, Gokul Santhanam, James Gabriel, Peter Grasch, Oncel Tuzel, et al. Fastvlm: Efficient vision encoding for vision language models. arXiv preprint arXiv:2412.13303, 2024. 8

work page arXiv 2024
[49]

Attention is all you need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems, 2017. 1

work page 2017
[50]

Mobile-Agent: Autonomous Multi-Modal Mobile Device Agent with Visual Perception

Junyang Wang, Haiyang Xu, Jiabo Ye, Ming Yan, Weizhou Shen, Ji Zhang, Fei Huang, and Jitao Sang. Mobile-agent: Autonomous multi-modal mobile device agent with visual perception. arXiv preprint arXiv:2401.16158, 2024. 1

work page internal anchor Pith review Pith/arXiv arXiv 2024
[51]

Accelerating multimodal large lan- guage models via dynamic visual-token exit and the empiri- cal findings

Qiong Wu, Wenhao Lin, Weihao Ye, Yiyi Zhou, Xiaoshuai Sun, and Rongrong Ji. Accelerating multimodal large lan- guage models via dynamic visual-token exit and the empiri- cal findings. arXiv preprint arXiv:2411.19628, 2024. 8

work page arXiv 2024
[52]

Efficient streaming language models with attention sinks

Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks. In The Twelfth International Conference on Learning Representations, 2024. 8

work page 2024
[53]

PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid Visual Redundancy Reduction

Long Xing, Qidong Huang, Xiaoyi Dong, Jiajie Lu, Pan Zhang, Yuhang Zang, Yuhang Cao, Conghui He, Jiaqi Wang, Feng Wu, et al. Pyramiddrop: Accelerating your large vision-language models via pyramid visual redundancy re- duction. arXiv preprint arXiv:2410.17247 , 2024. 5, 6, 8, 10

work page internal anchor Pith review Pith/arXiv arXiv 2024
[54]

Video question answer- ing via gradually refined attention over appearance and mo- tion

Dejing Xu, Zhou Zhao, Jun Xiao, Fei Wu, Hanwang Zhang, Xiangnan He, and Yueting Zhuang. Video question answer- ing via gradually refined attention over appearance and mo- tion. In Proceedings of the ACM international conference on Multimedia, pages 1645–1653, 2017. 6, 15

work page 2017
[55]

Visionzip: Longer is better but not necessary in vision language models

Senqiao Yang, Yukang Chen, Zhuotao Tian, Chengyao Wang, Jingyao Li, Bei Yu, and Jiaya Jia. Visionzip: Longer is better but not necessary in vision language models. arXiv preprint arXiv:2412.04467, 2024. 2, 5, 6, 8, 10, 12, 13

work page arXiv 2024
[56]

MiniCPM-V: A GPT-4V Level MLLM on Your Phone

Yuan Yao, Tianyu Yu, Ao Zhang, Chongyi Wang, Junbo Cui, Hongji Zhu, Tianchi Cai, Haoyu Li, Weilin Zhao, Zhihui He, et al. Minicpm-v: A gpt-4v level mllm on your phone. arXiv preprint arXiv:2408.01800, 2024. 1

work page internal anchor Pith review Pith/arXiv arXiv 2024
[57]

mplug-docowl: Modularized multimodal large language model for document understanding

Jiabo Ye, Anwen Hu, Haiyang Xu, Qinghao Ye, Ming Yan, Yuhao Dan, Chenlin Zhao, Guohai Xu, Chenliang Li, Jun- feng Tian, et al. mplug-docowl: Modularized multimodal large language model for document understanding. arXiv preprint arXiv:2307.02499, 2023. 1

work page arXiv 2023
[58]

Fit and prune: Fast and training-free visual token pruning for multi-modal large language models

Weihao Ye, Qiong Wu, Wenhao Lin, and Yiyi Zhou. Fit and prune: Fast and training-free visual token pruning for multi-modal large language models. arXiv preprint arXiv:2409.10197, 2024. 8

work page arXiv 2024
[59]

Atp-llava: Adaptive token pruning for large vision language models

Xubing Ye, Yukang Gan, Yixiao Ge, Xiao-Ping Zhang, and Yansong Tang. Atp-llava: Adaptive token pruning for large vision language models. arXiv preprint arXiv:2412.00447,

work page arXiv
[60]

MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities

Weihao Yu, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Zicheng Liu, Xinchao Wang, and Lijuan Wang. Mm-vet: Evaluating large multimodal models for integrated capabilities. arXiv preprint arXiv:2308.02490, 2023. 7, 14, 15

work page internal anchor Pith review Pith/arXiv arXiv 2023
[61]

Activitynet-qa: A dataset for understanding complex web videos via question answering

Zhou Yu, Dejing Xu, Jun Yu, Ting Yu, Zhou Zhao, Yueting Zhuang, and Dacheng Tao. Activitynet-qa: A dataset for understanding complex web videos via question answering. In AAAI, pages 9127–9134, 2019. 6, 15

work page 2019
[62]

Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for ex- pert agi

Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for ex- pert agi. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 9556– 9567, 2024. 15

work page 2024
[63]

AppAgent: Multimodal Agents as Smartphone Users

Chi Zhang, Zhao Yang, Jiaxuan Liu, Yucheng Han, Xin Chen, Zebiao Huang, Bin Fu, and Gang Yu. Appagent: Multimodal agents as smartphone users. arXiv preprint arXiv:2312.13771, 2023. 1

work page internal anchor Pith review Pith/arXiv arXiv 2023
[64]

[cls] attention is all you need for training- free visual token pruning: Make vlm inference faster

Qizhe Zhang, Aosong Cheng, Ming Lu, Zhiyong Zhuo, Minqi Wang, Jiajun Cao, Shaobo Guo, Qi She, and Shang- hang Zhang. [cls] attention is all you need for training- free visual token pruning: Make vlm inference faster. arXiv preprint arXiv:2412.01818, 2024. 2, 6, 8, 10

work page arXiv 2024
[65]

SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference

Yuan Zhang, Chun-Kai Fan, Junpeng Ma, Wenzhao Zheng, Tao Huang, Kuan Cheng, Denis Gudovskiy, Tomoyuki Okuno, Yohei Nakata, Kurt Keutzer, et al. Sparsevlm: Vi- sual token sparsification for efficient vision-language model inference. arXiv preprint arXiv:2410.04417, 2024. 2, 3, 5, 6, 8, 10

work page internal anchor Pith review Pith/arXiv arXiv 2024
[66]

H2o: Heavy-hitter ora- cle for efficient generative inference of large language mod- els

Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher R´e, Clark Barrett, et al. H2o: Heavy-hitter ora- cle for efficient generative inference of large language mod- els. Advances in Neural Information Processing Systems, 36: 34661–34710, 2023. 8

work page 2023
[67]

Treat visual tokens as text? but your mllm only needs fewer efforts to see

Zeliang Zhang, Phu Pham, Wentian Zhao, Kun Wan, Yu-Jhe Li, Jianing Zhou, Daniel Miranda, Ajinkya Kale, and Chen- liang Xu. Treat visual tokens as text? but your mllm only needs fewer efforts to see. arXiv preprint arXiv:2410.06169,

work page arXiv
[68]

Cross-modal information flow in multimodal large language models

Zhi Zhang, Srishti Yadav, Fengze Han, and Ekaterina Shutova. Cross-modal information flow in multimodal large language models. arXiv preprint arXiv:2411.18620 , 2024. 5, 8

work page arXiv 2024
[69]

A stitch in time saves nine: Small vlm is a precise guidance for accelerating large vlms

Wangbo Zhao, Yizeng Han, Jiasheng Tang, Zhikai Li, Yib- ing Song, Kai Wang, Zhangyang Wang, and Yang You. A stitch in time saves nine: Small vlm is a precise guidance for accelerating large vlms. arXiv preprint arXiv:2412.03324 ,

work page arXiv
[70]

A Survey on Efficient Inference for Large Language Models

Zixuan Zhou, Xuefei Ning, Ke Hong, Tianyu Fu, Jiaming Xu, Shiyao Li, Yuming Lou, Luning Wang, Zhihang Yuan, Xiuhong Li, et al. A survey on efficient inference for large language models. arXiv preprint arXiv:2404.14294, 2024. 2

work page internal anchor Pith review Pith/arXiv arXiv 2024

[1] [1]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966, 2023. 1

work page internal anchor Pith review Pith/arXiv arXiv 2023

[2] [2]

Token merging: Your vit but faster

Daniel Bolya, Cheng-Yang Fu, Xiaoliang Dai, Peizhao Zhang, Christoph Feichtenhofer, and Judy Hoffman. Token merging: Your vit but faster. In The Eleventh International Conference on Learning Representations, 2023. 8

work page 2023

[3] [3]

RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choromanski, Tianli Ding, Danny Driess, Avinava Dubey, Chelsea Finn, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. arXiv preprint arXiv:2307.15818, 2023. 1

work page internal anchor Pith review Pith/arXiv arXiv 2023

[4] [4]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Sub- biah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakan- tan, Pranav Shyam, Girish Sastry, Amanda Askell, Sand- hini Agarwal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, S...

work page 1901

[5] [5]

Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads

Tianle Cai, Yuhong Li, Zhengyang Geng, Hongwu Peng, Ja- son D Lee, Deming Chen, and Tri Dao. Medusa: Simple llm inference acceleration framework with multiple decod- ing heads. arXiv preprint arXiv:2401.10774, 2024. 8

work page internal anchor Pith review Pith/arXiv arXiv 2024

[6] [6]

Auroracap: Efficient, performant video detailed captioning and a new benchmark

Wenhao Chai, Enxin Song, Yilun Du, Chenlin Meng, Vashisht Madhavan, Omer Bar-Tal, Jenq-Neng Hwang, Sain- ing Xie, and Christopher D Manning. Auroracap: Efficient, performant video detailed captioning and a new benchmark. arXiv preprint arXiv:2410.03051, 2024. 8

work page arXiv 2024

[7] [7]

Chen and William B

David L. Chen and William B. Dolan. Collecting highly par- allel data for paraphrase evaluation. In ACL, Portland, OR,

work page

[8] [8]

Llavolta: Efficient multi-modal models via stage-wise visual context compression

Jieneng Chen, Luoxin Ye, Ju He, Zhao-Yang Wang, Daniel Khashabi, and Alan Yuille. Llavolta: Efficient multi-modal models via stage-wise visual context compression. arXiv preprint arXiv:2406.20092, 2024. 8

work page arXiv 2024

[9] [9]

An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models

Liang Chen, Haozhe Zhao, Tianyu Liu, Shuai Bai, Junyang Lin, Chang Zhou, and Baobao Chang. An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models. In European Conference on Computer Vision, pages 19–35. Springer, 2024. 1, 2, 3, 5, 6, 7, 8, 10, 12, 13

work page 2024

[10] [10]

How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites

Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhang- wei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, et al. How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites. Science China Information Sciences, 67(12):220101,

work page

[11] [11]

Efficient transformer inference with stati- cally structured sparse attention

Steve Dai, Hasan Genc, Rangharajan Venkatesan, and Brucek Khailany. Efficient transformer inference with stati- cally structured sparse attention. In 2023 60th ACM/IEEE Design Automation Conference (DAC) , pages 1–6. IEEE,

work page 2023

[12] [12]

Fu, Stefano Ermon, Atri Rudra, and Christopher R´e

Tri Dao, Daniel Y . Fu, Stefano Ermon, Atri Rudra, and Christopher R´e. FlashAttention: Fast and memory-efficient exact attention with IO-awareness. In Advances in Neural Information Processing Systems (NeurIPS), 2022. 8

work page 2022

[13] [13]

Layer- skip: Enabling early exit inference and self-speculative de- coding

Mostafa Elhoushi, Akshat Shrivastava, Diana Liskovich, Basil Hosmer, Bram Wasti, Liangzhen Lai, Anas Mahmoud, Bilge Acun, Saurabh Agarwal, Ahmed Roman, et al. Layer- skip: Enabling early exit inference and self-speculative de- coding. arXiv preprint arXiv:2404.16710, 2024. 2, 4, 5, 8, 11

work page arXiv 2024

[14] [14]

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Zhenyu Qiu, Wei Lin, Jinrui Yang, Xiawu Zheng, et al. Mme: A comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394, 2023. 6, 12

work page internal anchor Pith review Pith/arXiv arXiv 2023

[15] [15]

On speculative de- coding for multimodal large language models

Mukul Gagrani, Raghavv Goel, Wonseok Jeon, Junyoung Park, Mingu Lee, and Christopher Lott. On speculative de- coding for multimodal large language models. In Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8285–8289, 2024. 8

work page 2024

[16] [16]

Making the v in vqa matter: Ele- vating the role of image understanding in visual question answering

Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Ba- tra, and Devi Parikh. Making the v in vqa matter: Ele- vating the role of image understanding in visual question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 6904– 6913, 2017. 6, 15

work page 2017

[17] [17]

Mamba: Linear-time sequence mod- eling with selective state spaces

Albert Gu and Tri Dao. Mamba: Linear-time sequence mod- eling with selective state spaces. InFirst Conference on Lan- guage Modeling, 2024. 8

work page 2024

[18] [18]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025. 1

work page internal anchor Pith review Pith/arXiv arXiv 2025

[19] [19]

Mammoth-vl: Eliciting multimodal reasoning with instruction tuning at scale, 2024

Jarvis Guo, Tuney Zheng, Yuelin Bai, Bo Li, Yubo Wang, King Zhu, Yizhi Li, Graham Neubig, Wenhu Chen, and Xi- ang Yue. Mammoth-vl: Eliciting multimodal reasoning with instruction tuning at scale, 2024. 11

work page 2024

[20] [20]

Re- thinking token reduction in mllms: Towards a unified paradigm for training-free acceleration

Yuhang Han, Xuyang Liu, Pengxiang Ding, Donglin Wang, Honggang Chen, Qingsen Yan, and Siteng Huang. Re- thinking token reduction in mllms: Towards a unified paradigm for training-free acceleration. arXiv preprint arXiv:2411.17686, 2024. 8

work page arXiv 2024

[21] [21]

Gqa: A new dataset for real-world visual reasoning and compositional question answering

Drew A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, 2019. 3, 6, 12, 13

work page 2019

[22] [22]

Tgif-qa: Toward spatio-temporal reasoning in visual question answering

Yunseok Jang, Yale Song, Youngjae Yu, Youngjin Kim, and Gunhee Kim. Tgif-qa: Toward spatio-temporal reasoning in visual question answering. In Proceedings of the IEEE con- ference on computer vision and pattern recognition , pages 2758–2766, 2017. 6, 15

work page 2017

[23] [23]

Llmlingua: Compressing prompts for acceler- ated inference of large language models

Huiqiang Jiang, Qianhui Wu, Chin-Yew Lin, Yuqing Yang, and Lili Qiu. Llmlingua: Compressing prompts for acceler- ated inference of large language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Lan- guage Processing, pages 13358–13376, 2023. 8

work page 2023

[24] [24]

OpenVLA: An Open-Source Vision-Language-Action Model

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language-action model. arXiv preprint arXiv:2406.09246, 2024. 1

work page internal anchor Pith review Pith/arXiv arXiv 2024

[25] [25]

Fast in- ference from transformers via speculative decoding

Yaniv Leviathan, Matan Kalman, and Yossi Matias. Fast in- ference from transformers via speculative decoding. In In- ternational Conference on Machine Learning, pages 19274– 19286. PMLR, 2023. 8, 11

work page 2023

[26] [26]

Evaluating Object Hallucination in Large Vision-Language Models

Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision-language models. arXiv:2305.10355, 2023. 15

work page internal anchor Pith review Pith/arXiv arXiv 2023

[27] [27]

Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models

Yanwei Li, Yuechen Zhang, Chengyao Wang, Zhisheng Zhong, Yixin Chen, Ruihang Chu, Shaoteng Liu, and Jiaya Jia. Mini-gemini: Mining the potential of multi-modality vision language models. arXiv preprint arXiv:2403.18814,

work page internal anchor Pith review Pith/arXiv arXiv

[28] [28]

Video-LLaVA: Learning United Visual Representation by Alignment Before Projection

Bin Lin, Yang Ye, Bin Zhu, Jiaxi Cui, Munan Ning, Peng Jin, and Li Yuan. Video-llava: Learning united visual rep- resentation by alignment before projection. arXiv preprint arXiv:2311.10122, 2023. 5, 9

work page internal anchor Pith review Pith/arXiv arXiv 2023

[29] [29]

DeepSeek-V3 Technical Report

Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report. arXiv preprint arXiv:2412.19437, 2024. 8

work page internal anchor Pith review Pith/arXiv arXiv 2024

[30] [30]

Kangaroo: Loss- less self-speculative decoding for accelerating llms via dou- ble early exiting

Fangcheng Liu, Yehui Tang, Zhenhua Liu, Yunsheng Ni, Duyu Tang, Kai Han, and Yunhe Wang. Kangaroo: Loss- less self-speculative decoding for accelerating llms via dou- ble early exiting. Advances in Neural Information Process- ing Systems, 37:11946–11965, 2025. 5, 8, 11

work page 2025

[31] [31]

Visual instruction tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. InAdvances in Neural Information Processing Systems, pages 34892–34916, 2023. 1

work page 2023

[32] [32]

Improved baselines with visual instruction tuning

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. In Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26296–26306, 2024. 1, 3, 4, 5, 9

work page 2024

[33] [33]

Llava-next: Im- proved reasoning, ocr, and world knowledge, 2024

Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llava-next: Im- proved reasoning, ocr, and world knowledge, 2024. 5

work page 2024

[34] [34]

Multi-stage vision token dropping: Towards efficient multimodal large language model

Ting Liu, Liangtao Shi, Richang Hong, Yue Hu, Quanjun Yin, and Linfeng Zhang. Multi-stage vision token dropping: Towards efficient multimodal large language model. arXiv preprint arXiv:2411.10803, 2024. 2, 3, 6, 8, 10

work page arXiv 2024

[35] [35]

Mmbench: Is your multi-modal model an all-around player? In European Conference on Computer Vision, pages 216–233

Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? In European Conference on Computer Vision, pages 216–233. Springer, 2025. 6, 12

work page 2025

[36] [36]

Learn to explain: Multimodal reasoning via thought chains for science question answering

Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering. Advances in Neural Information Processing Systems , 35:2507–2521,

work page

[37] [37]

Layoutllm: Layout instruction tuning with large language models for document understanding

Chuwei Luo, Yufan Shen, Zhaoqing Zhu, Qi Zheng, Zhi Yu, and Cong Yao. Layoutllm: Layout instruction tuning with large language models for document understanding. In Pro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 15630–15640, 2024. 1

work page 2024

[38] [38]

Groma: Localized visual tokenization for grounding multimodal large language models

Chuofan Ma, Yi Jiang, Jiannan Wu, Zehuan Yuan, and Xiao- juan Qi. Groma: Localized visual tokenization for grounding multimodal large language models. In European Conference on Computer Vision, pages 417–435. Springer, 2024. 1

work page 2024

[39] [39]

Video-chatgpt: Towards detailed video understanding via large vision and language models

Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fa- had Shahbaz Khan. Video-chatgpt: Towards detailed video understanding via large vision and language models. In Pro- ceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL 2024), 2024. 9

work page 2024

[40] [40]

Learning to com- press prompts with gist tokens

Jesse Mu, Xiang Li, and Noah Goodman. Learning to com- press prompts with gist tokens. Advances in Neural Informa- tion Processing Systems, 36:19327–19352, 2023. 8

work page 2023

[41] [41]

Gpt-4v(ision) system card

OpenAI. Gpt-4v(ision) system card. Technical report, Ope- nAI, 2023. 1

work page 2023

[42] [42]

Ground- ing multimodal large language models to the world

Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, Qixiang Ye, and Furu Wei. Ground- ing multimodal large language models to the world. In The Twelfth International Conference on Learning Representa- tions, 2024. 1

work page 2024

[43] [43]

Llava-prumerge: Adaptive token reduction for efficient large multimodal models

Yuzhang Shang, Mu Cai, Bingxin Xu, Yong Jae Lee, and Yan Yan. Llava-prumerge: Adaptive token reduction for efficient large multimodal models. arXiv preprint arXiv:2403.15388,

work page arXiv

[44] [44]

Imp: Highly capable large multimodal models for mobile devices

Zhenwei Shao, Zhou Yu, Jun Yu, Xuecheng Ouyang, Lihao Zheng, Zhenbiao Gai, Mingyang Wang, Zhenzhong Kuang, and Jiajun Ding. Imp: Highly capable large multimodal models for mobile devices. IEEE Transactions on Multime- dia, 2025. 1

work page 2025

[45] [45]

Towards vqa models that can read

Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. Towards vqa models that can read. In Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8317–8326, 2019. 3, 6, 7, 13, 15

work page 2019

[46] [46]

You only cache once: Decoder-decoder architectures for lan- guage models

Yutao Sun, Li Dong, Yi Zhu, Shaohan Huang, Wenhui Wang, Shuming Ma, Quanlu Zhang, Jianyong Wang, and Furu Wei. You only cache once: Decoder-decoder architectures for lan- guage models. Advances in Neural Information Processing Systems, 37:7339–7361, 2025. 8

work page 2025

[47] [47]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288, 2023. 1

work page internal anchor Pith review Pith/arXiv arXiv 2023

[48] [48]

Fastvlm: Efficient vision encoding for vision language models

Pavan Kumar Anasosalu Vasu, Fartash Faghri, Chun-Liang Li, Cem Koc, Nate True, Albert Antony, Gokul Santhanam, James Gabriel, Peter Grasch, Oncel Tuzel, et al. Fastvlm: Efficient vision encoding for vision language models. arXiv preprint arXiv:2412.13303, 2024. 8

work page arXiv 2024

[49] [49]

Attention is all you need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems, 2017. 1

work page 2017

[50] [50]

Mobile-Agent: Autonomous Multi-Modal Mobile Device Agent with Visual Perception

Junyang Wang, Haiyang Xu, Jiabo Ye, Ming Yan, Weizhou Shen, Ji Zhang, Fei Huang, and Jitao Sang. Mobile-agent: Autonomous multi-modal mobile device agent with visual perception. arXiv preprint arXiv:2401.16158, 2024. 1

work page internal anchor Pith review Pith/arXiv arXiv 2024

[51] [51]

Accelerating multimodal large lan- guage models via dynamic visual-token exit and the empiri- cal findings

Qiong Wu, Wenhao Lin, Weihao Ye, Yiyi Zhou, Xiaoshuai Sun, and Rongrong Ji. Accelerating multimodal large lan- guage models via dynamic visual-token exit and the empiri- cal findings. arXiv preprint arXiv:2411.19628, 2024. 8

work page arXiv 2024

[52] [52]

Efficient streaming language models with attention sinks

Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks. In The Twelfth International Conference on Learning Representations, 2024. 8

work page 2024

[53] [53]

PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid Visual Redundancy Reduction

Long Xing, Qidong Huang, Xiaoyi Dong, Jiajie Lu, Pan Zhang, Yuhang Zang, Yuhang Cao, Conghui He, Jiaqi Wang, Feng Wu, et al. Pyramiddrop: Accelerating your large vision-language models via pyramid visual redundancy re- duction. arXiv preprint arXiv:2410.17247 , 2024. 5, 6, 8, 10

work page internal anchor Pith review Pith/arXiv arXiv 2024

[54] [54]

Video question answer- ing via gradually refined attention over appearance and mo- tion

Dejing Xu, Zhou Zhao, Jun Xiao, Fei Wu, Hanwang Zhang, Xiangnan He, and Yueting Zhuang. Video question answer- ing via gradually refined attention over appearance and mo- tion. In Proceedings of the ACM international conference on Multimedia, pages 1645–1653, 2017. 6, 15

work page 2017

[55] [55]

Visionzip: Longer is better but not necessary in vision language models

Senqiao Yang, Yukang Chen, Zhuotao Tian, Chengyao Wang, Jingyao Li, Bei Yu, and Jiaya Jia. Visionzip: Longer is better but not necessary in vision language models. arXiv preprint arXiv:2412.04467, 2024. 2, 5, 6, 8, 10, 12, 13

work page arXiv 2024

[56] [56]

MiniCPM-V: A GPT-4V Level MLLM on Your Phone

Yuan Yao, Tianyu Yu, Ao Zhang, Chongyi Wang, Junbo Cui, Hongji Zhu, Tianchi Cai, Haoyu Li, Weilin Zhao, Zhihui He, et al. Minicpm-v: A gpt-4v level mllm on your phone. arXiv preprint arXiv:2408.01800, 2024. 1

work page internal anchor Pith review Pith/arXiv arXiv 2024

[57] [57]

mplug-docowl: Modularized multimodal large language model for document understanding

Jiabo Ye, Anwen Hu, Haiyang Xu, Qinghao Ye, Ming Yan, Yuhao Dan, Chenlin Zhao, Guohai Xu, Chenliang Li, Jun- feng Tian, et al. mplug-docowl: Modularized multimodal large language model for document understanding. arXiv preprint arXiv:2307.02499, 2023. 1

work page arXiv 2023

[58] [58]

Fit and prune: Fast and training-free visual token pruning for multi-modal large language models

Weihao Ye, Qiong Wu, Wenhao Lin, and Yiyi Zhou. Fit and prune: Fast and training-free visual token pruning for multi-modal large language models. arXiv preprint arXiv:2409.10197, 2024. 8

work page arXiv 2024

[59] [59]

Atp-llava: Adaptive token pruning for large vision language models

Xubing Ye, Yukang Gan, Yixiao Ge, Xiao-Ping Zhang, and Yansong Tang. Atp-llava: Adaptive token pruning for large vision language models. arXiv preprint arXiv:2412.00447,

work page arXiv

[60] [60]

MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities

Weihao Yu, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Zicheng Liu, Xinchao Wang, and Lijuan Wang. Mm-vet: Evaluating large multimodal models for integrated capabilities. arXiv preprint arXiv:2308.02490, 2023. 7, 14, 15

work page internal anchor Pith review Pith/arXiv arXiv 2023

[61] [61]

Activitynet-qa: A dataset for understanding complex web videos via question answering

Zhou Yu, Dejing Xu, Jun Yu, Ting Yu, Zhou Zhao, Yueting Zhuang, and Dacheng Tao. Activitynet-qa: A dataset for understanding complex web videos via question answering. In AAAI, pages 9127–9134, 2019. 6, 15

work page 2019

[62] [62]

Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for ex- pert agi

Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for ex- pert agi. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 9556– 9567, 2024. 15

work page 2024

[63] [63]

AppAgent: Multimodal Agents as Smartphone Users

Chi Zhang, Zhao Yang, Jiaxuan Liu, Yucheng Han, Xin Chen, Zebiao Huang, Bin Fu, and Gang Yu. Appagent: Multimodal agents as smartphone users. arXiv preprint arXiv:2312.13771, 2023. 1

work page internal anchor Pith review Pith/arXiv arXiv 2023

[64] [64]

[cls] attention is all you need for training- free visual token pruning: Make vlm inference faster

Qizhe Zhang, Aosong Cheng, Ming Lu, Zhiyong Zhuo, Minqi Wang, Jiajun Cao, Shaobo Guo, Qi She, and Shang- hang Zhang. [cls] attention is all you need for training- free visual token pruning: Make vlm inference faster. arXiv preprint arXiv:2412.01818, 2024. 2, 6, 8, 10

work page arXiv 2024

[65] [65]

SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference

Yuan Zhang, Chun-Kai Fan, Junpeng Ma, Wenzhao Zheng, Tao Huang, Kuan Cheng, Denis Gudovskiy, Tomoyuki Okuno, Yohei Nakata, Kurt Keutzer, et al. Sparsevlm: Vi- sual token sparsification for efficient vision-language model inference. arXiv preprint arXiv:2410.04417, 2024. 2, 3, 5, 6, 8, 10

work page internal anchor Pith review Pith/arXiv arXiv 2024

[66] [66]

H2o: Heavy-hitter ora- cle for efficient generative inference of large language mod- els

Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher R´e, Clark Barrett, et al. H2o: Heavy-hitter ora- cle for efficient generative inference of large language mod- els. Advances in Neural Information Processing Systems, 36: 34661–34710, 2023. 8

work page 2023

[67] [67]

Treat visual tokens as text? but your mllm only needs fewer efforts to see

Zeliang Zhang, Phu Pham, Wentian Zhao, Kun Wan, Yu-Jhe Li, Jianing Zhou, Daniel Miranda, Ajinkya Kale, and Chen- liang Xu. Treat visual tokens as text? but your mllm only needs fewer efforts to see. arXiv preprint arXiv:2410.06169,

work page arXiv

[68] [68]

Cross-modal information flow in multimodal large language models

Zhi Zhang, Srishti Yadav, Fengze Han, and Ekaterina Shutova. Cross-modal information flow in multimodal large language models. arXiv preprint arXiv:2411.18620 , 2024. 5, 8

work page arXiv 2024

[69] [69]

A stitch in time saves nine: Small vlm is a precise guidance for accelerating large vlms

Wangbo Zhao, Yizeng Han, Jiasheng Tang, Zhikai Li, Yib- ing Song, Kai Wang, Zhangyang Wang, and Yang You. A stitch in time saves nine: Small vlm is a precise guidance for accelerating large vlms. arXiv preprint arXiv:2412.03324 ,

work page arXiv

[70] [70]

A Survey on Efficient Inference for Large Language Models

Zixuan Zhou, Xuefei Ning, Ke Hong, Tianyu Fu, Jiaming Xu, Shiyao Li, Yuming Lou, Luning Wang, Zhihang Yuan, Xiuhong Li, et al. A survey on efficient inference for large language models. arXiv preprint arXiv:2404.14294, 2024. 2

work page internal anchor Pith review Pith/arXiv arXiv 2024