Accelerating multimodal large lan- guage models via dynamic visual-token exit and the empiri- cal findings

Qiong Wu, Wenhao Lin, Weihao Ye, Yiyi Zhou, Xiaoshuai Sun, Rongrong Ji · 2024 · arXiv 2411.19628

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

representative citing papers

Counting to Four is still a Chore for VLMs

cs.CV · 2026-04-11 · unverdicted · novelty 6.0

VLMs fail at counting because visual evidence degrades in later language layers, and a lightweight Modality Attention Share intervention can encourage better use of image information during answer generation.

ForestPrune: High-ratio Visual Token Compression for Video Multimodal Large Language Models via Spatial-Temporal Forest Modeling

cs.CV · 2026-03-24 · unverdicted · novelty 6.0

ForestPrune prunes 90% of visual tokens in video MLLMs like LLaVA-OneVision while retaining 95.8% accuracy by modeling tokens as spatial-temporal forests and scoring importance via tree depth and node roles.

Growing a Multi-head Twig via Distillation and Reinforcement Learning to Accelerate Large Vision-Language Models

cs.CV · 2025-03-18 · unverdicted · novelty 5.0

TwigVLM adds a twig module to VLMs for twig-guided token pruning and self-speculative decoding, retaining 96% performance after pruning 88.9% visual tokens and delivering 154% speedup on long responses for LLaVA-1.5-7B.

citing papers explorer

Showing 3 of 3 citing papers.

Counting to Four is still a Chore for VLMs cs.CV · 2026-04-11 · unverdicted · none · ref 21
VLMs fail at counting because visual evidence degrades in later language layers, and a lightweight Modality Attention Share intervention can encourage better use of image information during answer generation.
ForestPrune: High-ratio Visual Token Compression for Video Multimodal Large Language Models via Spatial-Temporal Forest Modeling cs.CV · 2026-03-24 · unverdicted · none · ref 53
ForestPrune prunes 90% of visual tokens in video MLLMs like LLaVA-OneVision while retaining 95.8% accuracy by modeling tokens as spatial-temporal forests and scoring importance via tree depth and node roles.
Growing a Multi-head Twig via Distillation and Reinforcement Learning to Accelerate Large Vision-Language Models cs.CV · 2025-03-18 · unverdicted · none · ref 51
TwigVLM adds a twig module to VLMs for twig-guided token pruning and self-speculative decoding, retaining 96% performance after pruning 88.9% visual tokens and delivering 154% speedup on long responses for LLaVA-1.5-7B.

Accelerating multimodal large lan- guage models via dynamic visual-token exit and the empiri- cal findings

fields

years

verdicts

representative citing papers

citing papers explorer