Improving MLLM Training Efficiency via Stage-Aware Sparsity

Baobao Chang; Haozhe Zhao; Kean Shi; Liang Chen

arxiv: 2509.18150 · v2 · pith:CMW2D6KRnew · submitted 2025-09-16 · 💻 cs.LG · cs.AI

Improving MLLM Training Efficiency via Stage-Aware Sparsity

Kean Shi , Liang Chen , Haozhe Zhao , Baobao Chang This is my paper

Pith reviewed 2026-05-21 21:57 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords multimodal large language modelstraining efficiencysparse trainingstage-awarevisual tokenslayer skippingmodality alignmentinstruction tuning

0 comments

The pith

A stage-aware sparsity scheme trains MLLMs more efficiently by using different compression techniques at different training phases.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to show that redundancy in MLLM training varies by stage, so a uniform sparsity approach is suboptimal. Instead, it proposes the Sparse Training Scheme that applies visual token compression during initial modality alignment to reduce input length and dynamic layer skipping during later instruction tuning to avoid unnecessary computations. This matters for practitioners because full training of these models requires significant compute resources, and stage-specific adjustments could lower those costs while preserving accuracy. Evaluations on standard benchmarks support that the method works across different model architectures. If correct, it suggests that training processes can be optimized by observing and adapting to phase-specific inefficiencies rather than applying blanket reductions.

Core claim

The Sparse Training Scheme (STS) adapts sparsity to different sources of redundancy in MLLM training: the Visual Token Compressor reduces visual token load during modality alignment, and the Layer Dynamic Skipper dynamically skips layers during instruction tuning, leading to improved training efficiency without substantial loss in performance as verified on multiple benchmarks.

What carries the argument

The stage-aware design of the Sparse Training Scheme (STS), which switches between visual token compression and dynamic layer skipping depending on the training stage to target varying redundancies.

If this is right

Lowers the computational burden of processing long multimodal sequences in early training.
Reduces inter-layer computation overhead in later training stages.
Applies broadly to various MLLM architectures.
Preserves model performance across evaluated benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This could encourage development of adaptive training schedulers that monitor redundancy in real time.
Similar stage-aware sparsity might apply to training other large models like standard LLMs or vision transformers.
Future work could explore combining these components or automating the stage transitions.

Load-bearing premise

That the redundancy in training can be divided cleanly into separate stages where each sparsity technique can be applied independently without harming the model's final capabilities or slowing convergence.

What would settle it

Running full training versus STS on the same MLLM architecture and dataset, then comparing final benchmark scores and total training time; if scores drop significantly while time savings are minimal, the claim would be weakened.

read the original abstract

Multimodal Large Language Models (MLLMs) have demonstrated outstanding performance across a variety of domains. However, training MLLMs is often inefficient, as much of the computation is redundant due to the long input sequences from multimodal data and underutilized inter-layer operations. Notably, such redundancy is not static but varies across different stages of training. Building on this observation, we shift the focus to the training process itself and propose a training-efficient framework based on sparse representations, termed the Sparse Training Scheme (STS). Instead of applying a uniform sparsity strategy, STS adopts a stage-aware design that adapts to different sources of redundancy during training. Specifically, the framework consists of two complementary components: the Visual Token Compressor, which reduces the information load by compressing visual tokens during modality alignment, and the Layer Dynamic Skipper, which mitigates computational overhead by dynamically skipping unnecessary layers during instruction tuning. Our approach is broadly applicable to diverse MLLM architectures and has been extensively evaluated on multiple benchmarks, demonstrating its effectiveness and efficiency.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Stage-aware sparsity for MLLM training is a sensible engineering split but the evidence that the two redundancy sources stay cleanly separated is still thin.

read the letter

The main thing to know is that this paper splits sparsity into two phases for MLLM training: a Visual Token Compressor during modality alignment and a Layer Dynamic Skipper during instruction tuning. The claim is that this stage-aware approach cuts compute more effectively than a single uniform strategy because the sources of redundancy change over training. That pairing is the concrete new piece, even if the underlying sparsity tricks themselves are not novel.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes the Sparse Training Scheme (STS) for improving training efficiency of Multimodal Large Language Models (MLLMs). It identifies that computational redundancy varies across training stages and introduces a stage-aware framework with two components: the Visual Token Compressor, applied during modality alignment to reduce visual token information load, and the Layer Dynamic Skipper, used during instruction tuning to dynamically skip underutilized layers. The approach is described as architecture-agnostic and is asserted to have been extensively evaluated on multiple benchmarks, showing gains in efficiency while preserving effectiveness.

Significance. If the stage-aware sparsity components deliver substantial compute reductions without degrading final model quality or convergence speed, the work would offer a practical contribution to efficient MLLM training. Targeting distinct redundancy sources at different stages could reduce the resource barrier for scaling multimodal models, provided the claimed orthogonality holds under rigorous testing.

major comments (2)

[Abstract] Abstract: The central efficiency claim rests on 'extensive evaluation on multiple benchmarks demonstrating its effectiveness and efficiency,' yet the abstract (and evaluation summary) supplies no quantitative metrics, per-stage redundancy statistics, ablation results, or baseline comparisons. Without these, the magnitude of claimed savings and any performance trade-offs cannot be assessed.
[Method (STS components)] STS framework description: The design assumes visual-token redundancy dominates only in modality alignment and layer redundancy only in instruction tuning, with the two components remaining non-interfering. No per-stage redundancy profiles, cross-stage ablation experiments, or analysis of whether early compression alters subsequent layer-utilization statistics are provided to substantiate this separability assumption.

minor comments (2)

[Visual Token Compressor] Clarify the precise compression mechanism in the Visual Token Compressor (e.g., which token features are retained and the exact reduction ratio).
[Layer Dynamic Skipper] Define the decision criterion and threshold for the Layer Dynamic Skipper more explicitly, including how 'unnecessary' layers are identified at runtime.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments on our manuscript. We address each major comment point by point below, indicating where revisions will be made to strengthen the presentation of our results and assumptions.

read point-by-point responses

Referee: [Abstract] Abstract: The central efficiency claim rests on 'extensive evaluation on multiple benchmarks demonstrating its effectiveness and efficiency,' yet the abstract (and evaluation summary) supplies no quantitative metrics, per-stage redundancy statistics, ablation results, or baseline comparisons. Without these, the magnitude of claimed savings and any performance trade-offs cannot be assessed.

Authors: We agree that the abstract would be strengthened by including specific quantitative metrics. The full manuscript reports detailed efficiency gains (e.g., training FLOPs and wall-clock time reductions) and performance preservation across benchmarks, along with ablations and baseline comparisons in the experimental section. In the revised manuscript we will update the abstract to explicitly state key results, such as the observed compute savings and accuracy retention, while directing readers to the corresponding tables and figures. revision: yes
Referee: [Method (STS components)] STS framework description: The design assumes visual-token redundancy dominates only in modality alignment and layer redundancy only in instruction tuning, with the two components remaining non-interfering. No per-stage redundancy profiles, cross-stage ablation experiments, or analysis of whether early compression alters subsequent layer-utilization statistics are provided to substantiate this separability assumption.

Authors: The referee correctly identifies that the manuscript would benefit from more direct evidence for the stage-specific redundancy assumptions and their non-interference. While our main results demonstrate that each component is most effective in its assigned stage and that joint application produces additive improvements, we did not include explicit per-stage redundancy profiles or cross-stage interaction analyses. We will add a dedicated subsection presenting observed redundancy statistics across training stages and additional ablation experiments that measure the effect of early visual-token compression on subsequent layer-utilization patterns. revision: yes

Circularity Check

0 steps flagged

No significant circularity: empirical engineering framework without self-referential derivations or fitted predictions

full rationale

The paper proposes the Sparse Training Scheme (STS) as a practical, stage-aware sparsity framework with two components (Visual Token Compressor and Layer Dynamic Skipper) motivated by the observation that redundancy sources differ across training phases. No equations, first-principles derivations, or quantitative predictions are presented that reduce by construction to the method's own inputs, fitted parameters, or prior self-citations. The design is justified by engineering observations and evaluated empirically on benchmarks rather than through any closed theoretical loop, uniqueness theorem, or ansatz imported from the authors' own prior work. This is the most common honest finding for applied efficiency papers that do not claim mathematical derivations.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The approach rests on the domain assumption that redundancy is stage-dependent and separable, plus two newly introduced components whose independent evidence is not supplied in the abstract.

axioms (1)

domain assumption Redundancy in computation is not static but varies across different stages of training.
Explicitly stated as the observation that motivates the stage-aware design.

invented entities (2)

Visual Token Compressor no independent evidence
purpose: Reduces information load by compressing visual tokens during modality alignment.
New module introduced to address early-stage redundancy.
Layer Dynamic Skipper no independent evidence
purpose: Mitigates computational overhead by dynamically skipping unnecessary layers during instruction tuning.
New module introduced to address later-stage redundancy.

pith-pipeline@v0.9.0 · 5710 in / 1268 out tokens · 29679 ms · 2026-05-21T21:57:29.118039+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

STS adopts a stage-aware design that adapts to different sources of redundancy during training, consisting of the Visual Token Compressor during modality alignment and the Layer Dynamic Skipper during instruction tuning
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean embed_strictMono_of_one_lt unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

p_l(e) = α((E−e)/E)² · (1 + ϵ·2^{l−(L−1)}/(L−1))

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

34 extracted references · 34 canonical work pages · 2 internal anchors

[1]

INTRODUCTION Large Language Models (LLMs) [1, 2, 3, 4] have recently achieved a series of remarkable breakthroughs. The emergence of Multimodal Large Language Models (MLLMs) [5, 6, 7] such as Flamingo [8], GPT-4 [9], and LLaV A [10] signifies a clear shift of language models toward multimodal capabilities. However, the training and deploy- ment of MLLMs f...

work page
[2]

Improving MLLM Training Efficiency via Stage-Aware Sparsity

METHOD In this section, we first offer a brief introduction of naive training scheme in Sec 2.1. As shown in Figure 1, our proposed STS in- cludes two key components: the VTC and the LDS. They are de- tailed in Sec 2.2 and Sec 2.3 respectively. Finally, we demonstrate the application of STS in the practical training of MLLMs in Sec- tion 2.4. 2.1. Prelimi...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

We also illustrate the ablation studies in Sec

EXPERIMENT In this section, we first declare the implementation details of our ex- periments in Sec 3.1, followed by discussing the main results across various MLLMs in Sec 3.2. We also illustrate the ablation studies in Sec. 3.3 and computational efficiency analysis in Sec. 3.4. 3.1. Implementation Details We evaluate our STS on LLaV A [10], Mipha [24], ...

work page 2000
[4]

STS reduces the computational overhead by using two components: VTC and LDS, which can compress redundant visual inputs and dynamically skip unnecessary decoder layers respectively

CONCLUSION We presented the STS, a general framework that improves the train- ing efficiency of multimodal large language models. STS reduces the computational overhead by using two components: VTC and LDS, which can compress redundant visual inputs and dynamically skip unnecessary decoder layers respectively. Experiments across multi- ple benchmarks demo...

work page
[5]

Glm: General language model pretraining with autoregressive blank infilling,

Zhengxiao Du, Yujie Qian, Xiao Liu, Ming Ding, Jiezhong Qiu, Zhilin Yang, and Jie Tang, “Glm: General language model pretraining with autoregressive blank infilling,” 2022

work page 2022
[6]

Language models are few-shot learners,

Tom B. Brown et al, “Language models are few-shot learners,” 2020

work page 2020
[7]

Llama: Open and efficient foundation language models,

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth ´ee Lacroix, Baptiste Rozi`ere, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample, “Llama: Open and efficient foundation language models,” 2023

work page 2023
[8]

Qwen2.5 technical report,

An Yang et al, “Qwen2.5 technical report,” 2025

work page 2025
[9]

Blip- 2: Bootstrapping language-image pre-training with frozen im- age encoders and large language models,

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi, “Blip- 2: Bootstrapping language-image pre-training with frozen im- age encoders and large language models,” 2023

work page 2023
[10]

In- ternvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks,

Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, Bin Li, Ping Luo, Tong Lu, Yu Qiao, and Jifeng Dai, “In- ternvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks,” 2024

work page 2024
[11]

Qwen-vl: A versatile vision-language model for understand- ing, localization, text reading, and beyond,

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou, “Qwen-vl: A versatile vision-language model for understand- ing, localization, text reading, and beyond,” 2023

work page 2023
[12]

Flamingo: a visual language model for few-shot learning,

Jean-Baptiste Alayrac et al, “Flamingo: a visual language model for few-shot learning,” 2022

work page 2022
[13]

Gpt-4 technical report,

OpenAI, “Gpt-4 technical report,” 2024

work page 2024
[14]

Visual instruction tuning,

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee, “Visual instruction tuning,” 2023

work page 2023
[15]

Leveraging visual tokens for ex- tended text contexts in multi-modal learning,

Alex Jinpeng Wang, Linjie Li, Yiqi Lin, Min Li, Lijuan Wang, and Mike Zheng Shou, “Leveraging visual tokens for ex- tended text contexts in multi-modal learning,”arXiv preprint arXiv:2406.02547, 2024

work page arXiv 2024
[16]

Qwen2.5-vl technical report,

Shuai Bai et al, “Qwen2.5-vl technical report,” 2025

work page 2025
[17]

An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models,

Liang Chen, Haozhe Zhao, Tianyu Liu, Shuai Bai, Junyang Lin, Chang Zhou, and Baobao Chang, “An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models,” 2024

work page 2024
[18]

Feather the throttle: Revisiting visual token pruning for vision- language model acceleration,

Mark Endo, Xiaohan Wang, and Serena Yeung-Levy, “Feather the throttle: Revisiting visual token pruning for vision- language model acceleration,” 2025

work page 2025
[19]

Sparse- vlm: Visual token sparsification for efficient vision-language model inference,

Yuan Zhang, Chun-Kai Fan, Junpeng Ma, Wenzhao Zheng, Tao Huang, Kuan Cheng, Denis Gudovskiy, Tomoyuki Okuno, Yohei Nakata, Kurt Keutzer, and Shanghang Zhang, “Sparse- vlm: Visual token sparsification for efficient vision-language model inference,” 2025

work page 2025
[20]

Advancing multimodal large language mod- els with quantization-aware scale learning for efficient adap- tation,

Jingjing Xie, Yuxin Zhang, Mingbao Lin, Liujuan Cao, and Rongrong Ji, “Advancing multimodal large language mod- els with quantization-aware scale learning for efficient adap- tation,” 2024

work page 2024
[21]

Vlmq: Efficient post-training quantization for large vision- language models via hessian augmentation,

Yufei Xue, Yushi Huang, Jiawei Shao, and Jun Zhang, “Vlmq: Efficient post-training quantization for large vision- language models via hessian augmentation,”arXiv preprint arXiv:2508.03351, 2025

work page arXiv 2025
[22]

A multi- level framework for accelerating training transformer models,

Longwei Zou, Han Zhang, and Yangdong Deng, “A multi- level framework for accelerating training transformer models,” 2024

work page 2024
[23]

Staged training for transformer lan- guage models,

Sheng Shen, Pete Walsh, Kurt Keutzer, Jesse Dodge, Matthew Peters, and Iz Beltagy, “Staged training for transformer lan- guage models,” inInternational Conference on Machine Learning. PMLR, 2022, pp. 19893–19908

work page 2022
[24]

No train no gain: Revisiting efficient training algorithms for transformer-based language models,

Jean Kaddour, Oscar Key, Piotr Nawrot, Pasquale Minervini, and Matt J. Kusner, “No train no gain: Revisiting efficient training algorithms for transformer-based language models,” 2023

work page 2023
[25]

Efficient large multi-modal mod- els via visual context compression,

Jieneng Chen, Luoxin Ye, Ju He, Zhao-Yang Wang, Daniel Khashabi, and Alan Yuille, “Efficient large multi-modal mod- els via visual context compression,” inThe Thirty-eighth An- nual Conference on Neural Information Processing Systems, 2024

work page 2024
[26]

Pyramiddrop: Accelerating your large vision-language models via pyramid visual redundancy reduction,

Long Xing, Qidong Huang, Xiaoyi Dong, Jiajie Lu, Pan Zhang, Yuhang Zang, Yuhang Cao, Conghui He, Jiaqi Wang, Feng Wu, and Dahua Lin, “Pyramiddrop: Accelerating your large vision-language models via pyramid visual redundancy reduction,” 2025

work page 2025
[27]

Exploring activation patterns of parameters in lan- guage models,

Yudong Wang, Damai Dai, Zhe Yang, Jingyuan Ma, and Zhi- fang Sui, “Exploring activation patterns of parameters in lan- guage models,” inProceedings of the AAAI Conference on Artificial Intelligence, 2025, vol. 39, pp. 25416–25424

work page 2025
[28]

Mipha: A comprehensive overhaul of multimodal as- sistant with small language models,

Minjie Zhu, Yichen Zhu, Xin Liu, Ning Liu, Zhiyuan Xu, Chaomin Shen, Yaxin Peng, Zhicai Ou, Feifei Feng, and Jian Tang, “Mipha: A comprehensive overhaul of multimodal as- sistant with small language models,” 2024

work page 2024
[29]

Moe-llava: Mixture of experts for large vision-language mod- els,

Bin Lin, Zhenyu Tang, Yang Ye, Jinfa Huang, Junwu Zhang, Yatian Pang, Peng Jin, Munan Ning, Jiebo Luo, and Li Yuan, “Moe-llava: Mixture of experts for large vision-language mod- els,” 2024

work page 2024
[30]

Gqa: A new dataset for real-world visual reasoning and compositional question answering,

Drew A. Hudson and Christopher D. Manning, “Gqa: A new dataset for real-world visual reasoning and compositional question answering,” inProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition (CVPR), June 2019

work page 2019
[31]

Making the V in VQA matter: Elevating the role of image understanding in Visual Question Answer- ing,

Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh, “Making the V in VQA matter: Elevating the role of image understanding in Visual Question Answer- ing,” inConference on Computer Vision and Pattern Recogni- tion (CVPR), 2017

work page 2017
[32]

Learn to explain: Multimodal reasoning via thought chains for science question answering

Pan Lu, Swaroop Mishra, Tony Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ash- win Kalyan, “Learn to explain: Multimodal reasoning via thought chains for science question answering,”arXiv preprint arXiv:2209.09513, 2022

work page arXiv 2022
[33]

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Meng- dan Zhang, Xu Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, Yunsheng Wu, and Rongrong Ji, “Mme: A comprehen- sive evaluation benchmark for multimodal large language mod- els,”arXiv preprint arXiv:2306.13394, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[34]

Evaluating object hallucination in large vision-language models,

Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen, “Evaluating object hallucination in large vision-language models,” inThe 2023 Conference on Empirical Methods in Natural Language Processing, 2023

work page 2023

[1] [1]

INTRODUCTION Large Language Models (LLMs) [1, 2, 3, 4] have recently achieved a series of remarkable breakthroughs. The emergence of Multimodal Large Language Models (MLLMs) [5, 6, 7] such as Flamingo [8], GPT-4 [9], and LLaV A [10] signifies a clear shift of language models toward multimodal capabilities. However, the training and deploy- ment of MLLMs f...

work page

[2] [2]

Improving MLLM Training Efficiency via Stage-Aware Sparsity

METHOD In this section, we first offer a brief introduction of naive training scheme in Sec 2.1. As shown in Figure 1, our proposed STS in- cludes two key components: the VTC and the LDS. They are de- tailed in Sec 2.2 and Sec 2.3 respectively. Finally, we demonstrate the application of STS in the practical training of MLLMs in Sec- tion 2.4. 2.1. Prelimi...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

We also illustrate the ablation studies in Sec

EXPERIMENT In this section, we first declare the implementation details of our ex- periments in Sec 3.1, followed by discussing the main results across various MLLMs in Sec 3.2. We also illustrate the ablation studies in Sec. 3.3 and computational efficiency analysis in Sec. 3.4. 3.1. Implementation Details We evaluate our STS on LLaV A [10], Mipha [24], ...

work page 2000

[4] [4]

STS reduces the computational overhead by using two components: VTC and LDS, which can compress redundant visual inputs and dynamically skip unnecessary decoder layers respectively

CONCLUSION We presented the STS, a general framework that improves the train- ing efficiency of multimodal large language models. STS reduces the computational overhead by using two components: VTC and LDS, which can compress redundant visual inputs and dynamically skip unnecessary decoder layers respectively. Experiments across multi- ple benchmarks demo...

work page

[5] [5]

Glm: General language model pretraining with autoregressive blank infilling,

Zhengxiao Du, Yujie Qian, Xiao Liu, Ming Ding, Jiezhong Qiu, Zhilin Yang, and Jie Tang, “Glm: General language model pretraining with autoregressive blank infilling,” 2022

work page 2022

[6] [6]

Language models are few-shot learners,

Tom B. Brown et al, “Language models are few-shot learners,” 2020

work page 2020

[7] [7]

Llama: Open and efficient foundation language models,

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth ´ee Lacroix, Baptiste Rozi`ere, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample, “Llama: Open and efficient foundation language models,” 2023

work page 2023

[8] [8]

Qwen2.5 technical report,

An Yang et al, “Qwen2.5 technical report,” 2025

work page 2025

[9] [9]

Blip- 2: Bootstrapping language-image pre-training with frozen im- age encoders and large language models,

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi, “Blip- 2: Bootstrapping language-image pre-training with frozen im- age encoders and large language models,” 2023

work page 2023

[10] [10]

In- ternvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks,

Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, Bin Li, Ping Luo, Tong Lu, Yu Qiao, and Jifeng Dai, “In- ternvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks,” 2024

work page 2024

[11] [11]

Qwen-vl: A versatile vision-language model for understand- ing, localization, text reading, and beyond,

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou, “Qwen-vl: A versatile vision-language model for understand- ing, localization, text reading, and beyond,” 2023

work page 2023

[12] [12]

Flamingo: a visual language model for few-shot learning,

Jean-Baptiste Alayrac et al, “Flamingo: a visual language model for few-shot learning,” 2022

work page 2022

[13] [13]

Gpt-4 technical report,

OpenAI, “Gpt-4 technical report,” 2024

work page 2024

[14] [14]

Visual instruction tuning,

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee, “Visual instruction tuning,” 2023

work page 2023

[15] [15]

Leveraging visual tokens for ex- tended text contexts in multi-modal learning,

Alex Jinpeng Wang, Linjie Li, Yiqi Lin, Min Li, Lijuan Wang, and Mike Zheng Shou, “Leveraging visual tokens for ex- tended text contexts in multi-modal learning,”arXiv preprint arXiv:2406.02547, 2024

work page arXiv 2024

[16] [16]

Qwen2.5-vl technical report,

Shuai Bai et al, “Qwen2.5-vl technical report,” 2025

work page 2025

[17] [17]

An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models,

Liang Chen, Haozhe Zhao, Tianyu Liu, Shuai Bai, Junyang Lin, Chang Zhou, and Baobao Chang, “An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models,” 2024

work page 2024

[18] [18]

Feather the throttle: Revisiting visual token pruning for vision- language model acceleration,

Mark Endo, Xiaohan Wang, and Serena Yeung-Levy, “Feather the throttle: Revisiting visual token pruning for vision- language model acceleration,” 2025

work page 2025

[19] [19]

Sparse- vlm: Visual token sparsification for efficient vision-language model inference,

Yuan Zhang, Chun-Kai Fan, Junpeng Ma, Wenzhao Zheng, Tao Huang, Kuan Cheng, Denis Gudovskiy, Tomoyuki Okuno, Yohei Nakata, Kurt Keutzer, and Shanghang Zhang, “Sparse- vlm: Visual token sparsification for efficient vision-language model inference,” 2025

work page 2025

[20] [20]

Advancing multimodal large language mod- els with quantization-aware scale learning for efficient adap- tation,

Jingjing Xie, Yuxin Zhang, Mingbao Lin, Liujuan Cao, and Rongrong Ji, “Advancing multimodal large language mod- els with quantization-aware scale learning for efficient adap- tation,” 2024

work page 2024

[21] [21]

Vlmq: Efficient post-training quantization for large vision- language models via hessian augmentation,

Yufei Xue, Yushi Huang, Jiawei Shao, and Jun Zhang, “Vlmq: Efficient post-training quantization for large vision- language models via hessian augmentation,”arXiv preprint arXiv:2508.03351, 2025

work page arXiv 2025

[22] [22]

A multi- level framework for accelerating training transformer models,

Longwei Zou, Han Zhang, and Yangdong Deng, “A multi- level framework for accelerating training transformer models,” 2024

work page 2024

[23] [23]

Staged training for transformer lan- guage models,

Sheng Shen, Pete Walsh, Kurt Keutzer, Jesse Dodge, Matthew Peters, and Iz Beltagy, “Staged training for transformer lan- guage models,” inInternational Conference on Machine Learning. PMLR, 2022, pp. 19893–19908

work page 2022

[24] [24]

No train no gain: Revisiting efficient training algorithms for transformer-based language models,

Jean Kaddour, Oscar Key, Piotr Nawrot, Pasquale Minervini, and Matt J. Kusner, “No train no gain: Revisiting efficient training algorithms for transformer-based language models,” 2023

work page 2023

[25] [25]

Efficient large multi-modal mod- els via visual context compression,

Jieneng Chen, Luoxin Ye, Ju He, Zhao-Yang Wang, Daniel Khashabi, and Alan Yuille, “Efficient large multi-modal mod- els via visual context compression,” inThe Thirty-eighth An- nual Conference on Neural Information Processing Systems, 2024

work page 2024

[26] [26]

Pyramiddrop: Accelerating your large vision-language models via pyramid visual redundancy reduction,

Long Xing, Qidong Huang, Xiaoyi Dong, Jiajie Lu, Pan Zhang, Yuhang Zang, Yuhang Cao, Conghui He, Jiaqi Wang, Feng Wu, and Dahua Lin, “Pyramiddrop: Accelerating your large vision-language models via pyramid visual redundancy reduction,” 2025

work page 2025

[27] [27]

Exploring activation patterns of parameters in lan- guage models,

Yudong Wang, Damai Dai, Zhe Yang, Jingyuan Ma, and Zhi- fang Sui, “Exploring activation patterns of parameters in lan- guage models,” inProceedings of the AAAI Conference on Artificial Intelligence, 2025, vol. 39, pp. 25416–25424

work page 2025

[28] [28]

Mipha: A comprehensive overhaul of multimodal as- sistant with small language models,

Minjie Zhu, Yichen Zhu, Xin Liu, Ning Liu, Zhiyuan Xu, Chaomin Shen, Yaxin Peng, Zhicai Ou, Feifei Feng, and Jian Tang, “Mipha: A comprehensive overhaul of multimodal as- sistant with small language models,” 2024

work page 2024

[29] [29]

Moe-llava: Mixture of experts for large vision-language mod- els,

Bin Lin, Zhenyu Tang, Yang Ye, Jinfa Huang, Junwu Zhang, Yatian Pang, Peng Jin, Munan Ning, Jiebo Luo, and Li Yuan, “Moe-llava: Mixture of experts for large vision-language mod- els,” 2024

work page 2024

[30] [30]

Gqa: A new dataset for real-world visual reasoning and compositional question answering,

Drew A. Hudson and Christopher D. Manning, “Gqa: A new dataset for real-world visual reasoning and compositional question answering,” inProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition (CVPR), June 2019

work page 2019

[31] [31]

Making the V in VQA matter: Elevating the role of image understanding in Visual Question Answer- ing,

Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh, “Making the V in VQA matter: Elevating the role of image understanding in Visual Question Answer- ing,” inConference on Computer Vision and Pattern Recogni- tion (CVPR), 2017

work page 2017

[32] [32]

Learn to explain: Multimodal reasoning via thought chains for science question answering

Pan Lu, Swaroop Mishra, Tony Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ash- win Kalyan, “Learn to explain: Multimodal reasoning via thought chains for science question answering,”arXiv preprint arXiv:2209.09513, 2022

work page arXiv 2022

[33] [33]

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Meng- dan Zhang, Xu Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, Yunsheng Wu, and Rongrong Ji, “Mme: A comprehen- sive evaluation benchmark for multimodal large language mod- els,”arXiv preprint arXiv:2306.13394, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[34] [34]

Evaluating object hallucination in large vision-language models,

Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen, “Evaluating object hallucination in large vision-language models,” inThe 2023 Conference on Empirical Methods in Natural Language Processing, 2023

work page 2023