Training-free uncertainty guidance for complex visual tasks with mllms.arXiv preprint arXiv:2510.00705

Sanghwan Kim, Rui Xiao, Stephan Alaniz, Yongqin Xian, Zeynep Akata · 2025 · cs.CV · arXiv 2510.00705

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

open full Pith review browse 2 citing papers arXiv PDF

abstract

Multimodal Large Language Models (MLLMs) often struggle with fine-grained perception, such as identifying small objects in high-resolution images or detecting key moments in long videos. Existing methods typically rely on complex, task-specific fine-tuning, which reduces generalizability and increases system complexity. In this work, we propose an effective, training-free framework that uses an MLLM's intrinsic uncertainty as proactive guidance. Our core insight is that a model's uncertainty decreases when provided with relevant visual information. We introduce a unified mechanism that scores candidate visual inputs by response uncertainty, enabling the model to autonomously focus on the most informative data. We apply this simple principle to three challenging visual tasks: Visual Search, Long Video Understanding, and Temporal Grounding, allowing off-the-shelf MLLMs to achieve performance competitive with specialized, fine-tuned systems. Our results demonstrate that leveraging intrinsic uncertainty is a powerful strategy for improving fine-grained multimodal performance.

representative citing papers

Zoom Consistency: A Free Confidence Signal in Multi-Step Visual Grounding Pipelines

cs.CV · 2026-04-15 · unverdicted · novelty 6.0

Zoom consistency provides a geometric, cross-model confidence signal in zoom-in grounding pipelines that correlates with prediction correctness and enables modest gains in specialist-generalist routing.

DenoiseRL: Bootstrapping Reasoning Models to Recover from Noisy Prefixes

cs.AI · 2026-05-27 · unverdicted · novelty 4.0

DenoiseRL optimizes recovery from noisy prefixes in weak-model reasoning failures to improve performance and self-correction on math and general reasoning benchmarks without external supervision.

citing papers explorer

Showing 2 of 2 citing papers.

Zoom Consistency: A Free Confidence Signal in Multi-Step Visual Grounding Pipelines cs.CV · 2026-04-15 · unverdicted · none · ref 3 · internal anchor
Zoom consistency provides a geometric, cross-model confidence signal in zoom-in grounding pipelines that correlates with prediction correctness and enables modest gains in specialist-generalist routing.
DenoiseRL: Bootstrapping Reasoning Models to Recover from Noisy Prefixes cs.AI · 2026-05-27 · unverdicted · none · ref 14 · internal anchor
DenoiseRL optimizes recovery from noisy prefixes in weak-model reasoning failures to improve performance and self-correction on math and general reasoning benchmarks without external supervision.

Training-free uncertainty guidance for complex visual tasks with mllms.arXiv preprint arXiv:2510.00705

fields

years

verdicts

representative citing papers

citing papers explorer