Dynamic Token Reduction during Generation for Vision Language Models

Chaofeng Guan; Haoji Hu; Huan Wang; Huiyao Chen; Jiaying Lu; Xiaoyu Liang

arxiv: 2501.14204 · v1 · pith:7HKGAEI7new · submitted 2025-01-24 · 💻 cs.CV · cs.AI

Dynamic Token Reduction during Generation for Vision Language Models

Xiaoyu Liang , Chaofeng Guan , Jiaying Lu , Huiyao Chen , Huan Wang , Haoji Hu This is my paper

classification 💻 cs.CV cs.AI

keywords generationattentiontokensdistributionpruningratevisualachieved

0 comments

read the original abstract

Vision-Language Models (VLMs) have achieved notable success in multimodal tasks but face practical limitations due to the quadratic complexity of decoder attention mechanisms and autoregressive generation. Existing methods like FASTV and VTW have achieved notable results in reducing redundant visual tokens, but these approaches focus on pruning tokens in a single forward pass without systematically analyzing the redundancy of visual tokens throughout the entire generation process. In this paper, we introduce a dynamic pruning strategy tailored for VLMs, namedDynamic Rate (DyRate), which progressively adjusts the compression rate during generation. Our analysis of the distribution of attention reveals that the importance of visual tokens decreases throughout the generation process, inspiring us to adopt a more aggressive compression rate. By integrating a lightweight predictor based on attention distribution, our approach enables flexible adjustment of pruning rates based on the attention distribution. Our experimental results demonstrate that our method not only reduces computational demands but also maintains the quality of responses.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Spectral Evolution-Guided Token Pruning in Multimodal Large Language Models
cs.CV 2026-06 unverdicted novelty 6.0

CLSE prunes tokens in MLLMs by quantifying cross-layer spectral redistribution in the frequency domain to preserve semantically active tokens and reduce compute.
Accelerating Multimodal Large Language Models with Prior-Corrected Token Reduction
cs.CV 2026-06 unverdicted novelty 6.0

PriorTR estimates model-induced prior attention via a null token in one forward pass and contrasts it with task-conditioned attention to improve visual token pruning accuracy-efficiency trade-offs in MLLMs.
Beyond Surrogate Gradients: Fully Differentiable Token Pruning for Vision-Language Models
cs.CV 2026-05 unverdicted novelty 6.0

DiffPrune reformulates visual token pruning as continuous control of token information using an Information Throttler with importance-conditioned variance-preserving noise, enabling fully differentiable learning of sc...
Toward Native Multimodal Modeling: A Roadmap
cs.CV 2026-05 unverdicted novelty 3.0

A roadmap that defines architectural nativity for multimodal models and categorizes them into Multi-to-Text, Multi-to-Target, and Multi-to-Multi types while outlining an industrial pipeline toward unified transformer-...