Visual latents in MLLMs are systematically silenced by autoregressive training but can be unsilenced at inference via query-guided contrastive alignment followed by a confidence-progression reward.
Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
citation-role summary
dataset 1
citation-polarity summary
years
2026 2verdicts
UNVERDICTED 2roles
dataset 1polarities
use dataset 1representative citing papers
ERASE prunes 85% of vision tokens in Qwen2.5-VL-7B while retaining 89.46% accuracy, outperforming prior methods that retain only 78.1%.
citing papers explorer
-
Visual Latents Know More Than They Say: Unsilencing Latent Reasoning in MLLMs
Visual latents in MLLMs are systematically silenced by autoregressive training but can be unsilenced at inference via query-guided contrastive alignment followed by a confidence-progression reward.
-
ERASE: Eliminating Redundant Visual Tokens via Adaptive Two-Stage Token Pruning
ERASE prunes 85% of vision tokens in Qwen2.5-VL-7B while retaining 89.46% accuracy, outperforming prior methods that retain only 78.1%.