Don't Blink: Evidence Collapse during Multimodal Reasoning

· 2026 · cs.AI · arXiv 2604.04207

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

open full Pith review browse 1 citing papers arXiv PDF

abstract

Reasoning VLMs can become more accurate while progressively losing visual grounding as they think. This creates task-conditional danger zones where low-entropy predictions are confident but ungrounded, a failure mode text-only monitoring cannot detect. Evaluating three reasoning VLMs on MathVista, HallusionBench, and MMMU_Pro, we find a pervasive evidence-collapse phenomenon: attention to annotated evidence regions drops substantially, often losing over half of evidence mass, as reasoning unfolds. Full-response entropy is the most reliable text-only uncertainty signal under cross-dataset transfer, yet adding vision features with a single global linear rule is brittle and often degrades transfer. An entropy-vision interaction model reveals a task-conditional regime: lowentropy, visually disengaged predictions are hazardous on sustained visual-reference tasks but benign on symbolic tasks. Using this structure, a targeted vision veto reduces selective risk by up to 1.9 percentage points at 90% coverage, while avoiding degradations where disengagement is expected. The results support task-aware multimodal monitoring for safe deployment under distribution shift.

representative citing papers

V-Zero: Answer-Label-Free On-Policy Distillation with Contrastive Evidence Gating for Fine-Grained Visual Reasoning

cs.CV · 2026-06-24 · unverdicted · novelty 5.0

V-Zero trains MLLMs for visual reasoning without answer labels by gating on-policy distillation trajectories using contrastive evidence from relevant versus negative image crops.

citing papers explorer

Showing 1 of 1 citing paper.

V-Zero: Answer-Label-Free On-Policy Distillation with Contrastive Evidence Gating for Fine-Grained Visual Reasoning cs.CV · 2026-06-24 · unverdicted · none · ref 31 · internal anchor
V-Zero trains MLLMs for visual reasoning without answer labels by gating on-policy distillation trajectories using contrastive evidence from relevant versus negative image crops.

Don't Blink: Evidence Collapse during Multimodal Reasoning

fields

years

verdicts

representative citing papers

citing papers explorer