CropVLM: Learning to Zoom for Fine-Grained Vision-Language Perception

· 2025 · cs.CV · arXiv 2511.19820

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

open full Pith review browse 3 citing papers arXiv PDF

abstract

Vision-Language Models (VLMs) often struggle with tasks that require fine-grained image understanding, such as scene-text recognition or document analysis, due to perception limitations and visual fragmentation. To address these challenges, we introduce CropVLM as an external low-cost method for boosting performance, enabling VLMs to dynamically ''zoom in'' on relevant image regions, enhancing their ability to capture fine details. CropVLM is trained using reinforcement learning, without using human-labeled bounding boxes as a supervision signal, and without expensive synthetic evaluations. The model is trained once and can be paired with both open-source and proprietary VLMs to improve their performance. Our approach delivers significant improvements on tasks that require high-resolution image understanding, notably for benchmarks that are out-of-domain for the target VLM, without modifying or fine-tuning the VLM, thus avoiding catastrophic forgetting.

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

Diversity Over Frequency: Rethinking Tool Use in Visual Chain-of-Thought Agents

cs.CV · 2026-05-25 · unverdicted · novelty 7.0

Visual CoT agents exhibit tool-use collapse where tool usage declines but task accuracy rises, and adding entropy regularization for rollout diversity produces the strongest performance.

CaC: Advancing Video Reward Models via Hierarchical Spatiotemporal Concentrating

cs.CV · 2026-05-12 · unverdicted · novelty 7.0 · 2 refs

CaC presents a new spatiotemporal concentrating reward model for video anomalies, built on a novel large-scale dataset and three-stage training with RL and IoU rewards, claiming 25.7% accuracy gains and 11.7% anomaly reduction.

Foveated Reasoning: Stateful, Action-based Visual Focusing for Vision-Language Models

cs.CV · 2026-04-22 · unverdicted · novelty 6.0

Foveated Reasoner integrates foveation as stateful actions inside the autoregressive decoding loop of vision-language models, trained via cold-start supervision then reinforcement learning to achieve higher accuracy at low token budgets.

citing papers explorer

Showing 3 of 3 citing papers after filters.

Diversity Over Frequency: Rethinking Tool Use in Visual Chain-of-Thought Agents cs.CV · 2026-05-25 · unverdicted · none · ref 12 · internal anchor
Visual CoT agents exhibit tool-use collapse where tool usage declines but task accuracy rises, and adding entropy regularization for rollout diversity produces the strongest performance.
CaC: Advancing Video Reward Models via Hierarchical Spatiotemporal Concentrating cs.CV · 2026-05-12 · unverdicted · none · ref 7 · 2 links · internal anchor
CaC presents a new spatiotemporal concentrating reward model for video anomalies, built on a novel large-scale dataset and three-stage training with RL and IoU rewards, claiming 25.7% accuracy gains and 11.7% anomaly reduction.
Foveated Reasoning: Stateful, Action-based Visual Focusing for Vision-Language Models cs.CV · 2026-04-22 · unverdicted · none · ref 4 · internal anchor
Foveated Reasoner integrates foveation as stateful actions inside the autoregressive decoding loop of vision-language models, trained via cold-start supervision then reinforcement learning to achieve higher accuracy at low token budgets.

CropVLM: Learning to Zoom for Fine-Grained Vision-Language Perception

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer