LookWise: Knowing When and Where to Look for Fine-Grained Visual Reasoning in Multimodal Large Language Models

Chengjun Xie; Hailong Huang; Haoxuan Che; Jie Zhang; Man Zhou; Xuanhua He; Xueheng Li; Yuxiang Shen; Zhenkun Gao

arxiv: 2603.00171 · v3 · pith:R4RO2FJKnew · submitted 2026-02-26 · 💻 cs.CV · cs.AI

LookWise: Knowing When and Where to Look for Fine-Grained Visual Reasoning in Multimodal Large Language Models

Yuxiang Shen , Hailong Huang , Zhenkun Gao , Xueheng Li , Man Zhou , Chengjun Xie , Haoxuan Che , Xuanhua He

show 1 more author

Jie Zhang

This is my paper

classification 💻 cs.CV cs.AI

keywords lookwisevisualfine-grainedlookreasoninglanguagelargelocalization

0 comments

read the original abstract

Multimodal Large Language Models (MLLMs) are shifting towards "Thinking with Images" by actively exploring image details. While effective, large-scale training is computationally expensive, which has spurred growing interest in lightweight, training-free solutions. However, existing training-free methods suffer from two flaws: perceptual redundancy from indiscriminate cropping, which increases computational cost and introduces noise; and a drift between semantic intent and spatial attention, which prevents accurate localization of user-focused regions. To address these challenges, we propose LookWise, a framework for adaptive visual reasoning. LookWise follows a two-stage pipeline: a confidence-based module decides when to look more carefully, and a semantic-guided localization module determines where to look. This design enables MLLMs to adaptively acquire fine-grained visual evidence without additional training. Experiments on fine-grained and high-resolution visual reasoning benchmarks show that LookWise consistently improves accuracy over strong baselines while achieving an approximately $4.0\times$ inference speedup over the search-based method ZoomEye, demonstrating robust cross-model generalization.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

CaC: Advancing Video Reward Models via Hierarchical Spatiotemporal Concentrating
cs.CV 2026-05 unverdicted novelty 7.0

CaC is a hierarchical spatiotemporal concentrating reward model for video anomalies that reports 25.7% accuracy gains on fine-grained benchmarks and 11.7% anomaly reduction in generated videos via a new dataset and GR...