CropVLM: Learning to Zoom for Fine-Grained Vision-Language Perception
Pith reviewed 2026-05-17 05:15 UTC · model grok-4.3
The pith
CropVLM trains an RL policy to pick image crops that let existing vision-language models handle fine details without any retraining or forgetting.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
CropVLM is an external low-cost module trained with reinforcement learning to select and zoom into relevant image regions, which raises the performance of target vision-language models on fine-grained high-resolution tasks, especially out-of-domain benchmarks, without any modification or fine-tuning of the VLM itself.
What carries the argument
CropVLM, a reinforcement-learning policy that chooses which image regions to crop and present to the VLM for finer perception.
If this is right
- The same CropVLM policy works with both open-source and proprietary VLMs.
- Performance gains appear on tasks that need high-resolution detail, including out-of-domain cases.
- The base VLM requires no changes, so its earlier capabilities remain intact.
- Training avoids human bounding-box labels and expensive synthetic evaluations.
Where Pith is reading between the lines
- If the policy learns general cues for useful detail, it may transfer across many VLMs without retraining.
- This style of external zooming could let smaller or cheaper VLMs reach the accuracy of larger ones on detail-heavy work.
- One could test whether task-specific policies or a hierarchy of zoom levels would give further gains.
Load-bearing premise
A single policy trained once with reinforcement learning will keep selecting helpful crops for any target VLM on any out-of-domain fine-grained task without needing further adaptation.
What would settle it
Pair the trained CropVLM policy with a previously unseen VLM on a fine-grained benchmark and measure whether accuracy rises, stays flat, or drops.
Figures
read the original abstract
Vision-Language Models (VLMs) often struggle with tasks that require fine-grained image understanding, such as scene-text recognition or document analysis, due to perception limitations and visual fragmentation. To address these challenges, we introduce CropVLM as an external low-cost method for boosting performance, enabling VLMs to dynamically ''zoom in'' on relevant image regions, enhancing their ability to capture fine details. CropVLM is trained using reinforcement learning, without using human-labeled bounding boxes as a supervision signal, and without expensive synthetic evaluations. The model is trained once and can be paired with both open-source and proprietary VLMs to improve their performance. Our approach delivers significant improvements on tasks that require high-resolution image understanding, notably for benchmarks that are out-of-domain for the target VLM, without modifying or fine-tuning the VLM, thus avoiding catastrophic forgetting.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces CropVLM, a reinforcement learning-based external module that learns to select image crops for dynamic zooming to enhance fine-grained perception in vision-language models (VLMs). Trained once without human-labeled bounding boxes or expensive synthetic evaluations, the policy is intended to pair with arbitrary open-source or proprietary VLMs, delivering improvements on high-resolution tasks and out-of-domain benchmarks while avoiding any modification or fine-tuning of the target VLM.
Significance. If the central empirical claims hold, the work offers a low-cost, generalizable plug-in for boosting VLM performance on detail-oriented tasks such as scene-text recognition and document analysis. The design choice to avoid VLM fine-tuning (and thus catastrophic forgetting) and the single-training paradigm are potentially valuable for practical deployment across model families.
major comments (2)
- [Abstract, §4] Abstract and §4: the central claim of 'significant improvements' on out-of-domain benchmarks for arbitrary target VLMs rests on quantitative results that are not visible in the abstract and whose robustness (baselines, error bars, ablation controls) is not summarized; this directly affects evaluation of the reported gains.
- [§3.1–3.2] §3.1–3.2: the reward formulation and training task distribution used to train the single RL policy are not specified in sufficient detail to establish that the learned cropping strategy is VLM-agnostic rather than tuned to the particular VLMs or tasks supplying the reward signal; this is load-bearing for the no-adaptation, cross-VLM generalization claim.
minor comments (2)
- [Figure 2, §4.3] Figure 2 and §4.3: crop visualization panels would benefit from explicit overlay of the selected region coordinates and the downstream VLM output to allow direct inspection of the zooming effect.
- [§3] Notation in §3: the distinction between the policy network input (full image features) and the crop selection output could be clarified with an explicit diagram or equation.
Simulated Author's Rebuttal
We thank the referee for the constructive comments. We address each major comment below and have revised the manuscript to improve clarity and support for the central claims.
read point-by-point responses
-
Referee: [Abstract, §4] Abstract and §4: the central claim of 'significant improvements' on out-of-domain benchmarks for arbitrary target VLMs rests on quantitative results that are not visible in the abstract and whose robustness (baselines, error bars, ablation controls) is not summarized; this directly affects evaluation of the reported gains.
Authors: We agree that the abstract would benefit from explicitly summarizing the key quantitative results and evaluation robustness to better support the central claims. In the revised manuscript, we will update the abstract to report specific performance gains (e.g., accuracy improvements on out-of-domain high-resolution tasks) and reference the use of multiple baselines and controls. For §4, we will add a concise summary highlighting error bars from repeated evaluations and the main ablation controls to facilitate assessment of the reported gains. revision: yes
-
Referee: [§3.1–3.2] §3.1–3.2: the reward formulation and training task distribution used to train the single RL policy are not specified in sufficient detail to establish that the learned cropping strategy is VLM-agnostic rather than tuned to the particular VLMs or tasks supplying the reward signal; this is load-bearing for the no-adaptation, cross-VLM generalization claim.
Authors: We acknowledge that greater detail on these elements is needed to substantiate the VLM-agnostic generalization claim. The reward is derived from the target VLM's own output quality using a task-agnostic metric on a broad, diverse set of high-resolution images drawn from multiple domains, with no overlap to evaluation benchmarks and without VLM-specific fine-tuning. To address the concern directly, we will expand §3.1–3.2 with the exact reward formulation (including the metric for response quality) and the composition of the training task distribution (e.g., domain breakdown and sampling strategy). This will clarify that the policy learns general zooming behaviors applicable across VLMs. revision: yes
Circularity Check
No load-bearing circularity; RL policy trained once and evaluated externally
full rationale
The paper presents CropVLM as an RL-trained external module that selects crops to improve VLM performance on fine-grained tasks. It is trained once without bounding-box labels or synthetic evaluations and then paired with arbitrary VLMs on out-of-domain benchmarks. No equations, fitted parameters renamed as predictions, or self-citation chains are described in the abstract or central claims that would reduce the reported gains to quantities defined by the method's own inputs. The derivation chain relies on external benchmark evaluations rather than internal self-reference, making the approach self-contained against external benchmarks. Minor self-citation risk exists in any RL literature but is not load-bearing here.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Reinforcement learning reward signals can guide selection of image crops that measurably improve downstream VLM accuracy on fine-grained tasks
invented entities (1)
-
CropVLM
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
CropVLM is trained using reinforcement learning, without using human-labeled bounding boxes... GRPO... reward formulations: Accuracy-Based Reward... Likelihood-Based Reward R(Io,Ic,q,a*) = sum log p(a*_t | Io,Ic,q,a*<t)
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
The model is trained once and can be paired with both open-source and proprietary VLMs... without modifying or fine-tuning the VLM
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 2 Pith papers
-
CaC: Advancing Video Reward Models via Hierarchical Spatiotemporal Concentrating
CaC is a hierarchical spatiotemporal concentrating reward model for video anomalies that reports 25.7% accuracy gains on fine-grained benchmarks and 11.7% anomaly reduction in generated videos via a new dataset and GR...
-
Foveated Reasoning: Stateful, Action-based Visual Focusing for Vision-Language Models
Foveated Reasoner integrates foveation as stateful actions inside the autoregressive decoding loop of vision-language models, trained via cold-start supervision then reinforcement learning to achieve higher accuracy a...
Reference graph
Works this paper leans on
-
[1]
Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a Visual Language Model for Few-Shot Learning.Advances in Neural Information Processing Systems, 35:23716–23736,
-
[2]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2.5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 4
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[3]
Scene text visual question answering
Ali Furkan Biten, Ruben Tito, Andres Mafla, Lluis Gomez, Marc ¸al Rusinol, Ernest Valveny, CV Jawahar, and Dimos- thenis Karatzas. Scene text visual question answering. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4291–4301, 2019. 4
work page 2019
-
[4]
Mu Cai, Jianwei Yang, Jianfeng Gao, and Yong Jae Lee. Matryoshka multimodal models. InWorkshop on Video- Language Models@ NeurIPS, 2024. 1, 2
work page 2024
-
[5]
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24185–24198, 2024. 1
work page 2024
-
[6]
Visrl: Intention-driven visual perception via reinforced reasoning
Zhangquan Chen, Xufang Luo, and Dongsheng Li. Visrl: Intention-driven visual perception via reinforced reasoning. arXiv preprint arXiv:2503.07523, 2025. 2, 3, 4, 7
-
[7]
FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning
Tri Dao. Flashattention-2: Faster attention with bet- ter parallelism and work partitioning.arXiv preprint arXiv:2307.08691, 2023. 8
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[8]
LoRA: Low-Rank Adaptation of Large Language Models
Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. LoRA: Low-Rank Adaptation of Large Language Models. Proceedings of the International Conference on Learning Representations, 2022. 5
work page 2022
-
[9]
Mini-o3: Scaling Up Reasoning Patterns and Interaction Turns for Visual Search
Xin Lai, Junyi Li, Wei Li, Tao Liu, Tianjian Li, and Hengshuang Zhao. Mini-o3: Scaling up reasoning pat- terns and interaction turns for visual search.arXiv preprint arXiv:2509.07969, 2025. 2, 3
work page internal anchor Pith review arXiv 2025
-
[10]
Bangzheng Li, Fei Wang, Wenxuan Zhou, Nan Xu, Ben Zhou, Sheng Zhang, Hoifung Poon, and Muhao Chen. Semantic-clipping: Efficient vision-language model- ing with semantic-guidedd visual selection.arXiv preprint arXiv:2503.11794, 2025. 2
-
[11]
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. InAdvances in Neural Information Processing Systems, pages 34892–34916. Curran Associates, Inc., 2023. 1
work page 2023
-
[12]
Yuliang Liu, Zhang Li, Mingxin Huang, Biao Yang, Wenwen Yu, Chunyuan Li, Xu-Cheng Yin, Cheng-Lin Liu, Lianwen Jin, and Xiang Bai. Ocrbench: on the hidden mystery of ocr in large multimodal models.Science China Information Sciences, 67(12):220102, 2024. 1, 2
work page 2024
-
[13]
Visual-RFT: Visual Reinforcement Fine-Tuning
Ziyu Liu, Zeyi Sun, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Haodong Duan, Dahua Lin, and Jiaqi Wang. Visual- rft: Visual reinforcement fine-tuning.arXiv preprint arXiv:2503.01785, 2025. 2, 3
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[14]
SmolVLM: Redefining small and efficient multimodal models
Andr ´es Marafioti, Orr Zohar, Miquel Farr ´e, Merve Noyan, Elie Bakouch, Pedro Cuenca, Cyril Zakka, Loubna Ben Allal, Anton Lozhkov, Nouamane Tazi, Vaibhav Srivastav, Joshua Lochner, Hugo Larcher, Mathieu Morlon, Lewis Tun- stall, Leandro von Werra, and Thomas Wolf. SmolVLM: Redefining small and efficient multimodal models.arXiv preprint arXiv:2504.05299...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[15]
DocVQA: A Dataset for VQA on Document Images
Minesh Mathew, Dimosthenis Karatzas, and CV Jawahar. DocVQA: A Dataset for VQA on Document Images. In Proceedings of the IEEE/CVF Winter Conference on Appli- cations of Computer Vision, pages 2200–2209, 2021. 4
work page 2021
-
[16]
Minesh Mathew, Viraj Bagal, Rub `en Tito, Dimosthenis Karatzas, Ernest Valveny, and CV Jawahar. Infographicvqa. InProceedings of the IEEE/CVF Winter Conference on Ap- plications of Computer Vision, pages 1697–1706, 2022. 4
work page 2022
-
[17]
Training lan- guage models to follow instructions with human feedback
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Car- roll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training lan- guage models to follow instructions with human feedback. In Advances in Neural Information Processing Systems, pages 27730–27744, 2022. 3
work page 2022
-
[18]
Learn- ing transferable visual models from natural language super- vision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- ing transferable visual models from natural language super- vision. InProceedings of the International Conference on Machine Learning, pages 8748–8763. PmLR, 2021. 1
work page 2021
-
[19]
Proximal Policy Optimization Algorithms
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Rad- ford, and Oleg Klimov. Proximal policy optimization algo- rithms.arXiv preprint arXiv:1707.06347, 2017. 3
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[20]
Hao Shao, Shengju Qian, Han Xiao, Guanglu Song, Zhuo- fan Zong, Letian Wang, Yu Liu, and Hongsheng Li. Visual cot: Advancing multi-modal language models with a com- prehensive dataset and benchmark for chain-of-thought rea- soning.Advances in Neural Information Processing Systems, 37:8612–8642, 2024. 1, 2, 3
work page 2024
-
[21]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathe- matical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024. 3
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[22]
When do we not need larger vision models? In European Conference on Computer Vision, pages 444–462
Baifeng Shi, Ziyang Wu, Maolin Mao, Xin Wang, and Trevor Darrell. When do we not need larger vision models? In European Conference on Computer Vision, pages 444–462. Springer, 2024. 1, 2
work page 2024
-
[23]
Scaling vision pre-training to 4k resolution
Baifeng Shi, Boyi Li, Han Cai, Yao Lu, Sifei Liu, Marco Pavone, Jan Kautz, Song Han, Trevor Darrell, Pavlo Molchanov, et al. Scaling vision pre-training to 4k resolu- tion.arXiv preprint arXiv:2503.19903, 2025. 1, 2
-
[24]
Towards VQA Models That Can Read
Amanpreet Singh, Vivek Natarjan, Meet Shah, Yu Jiang, Xinlei Chen, Devi Parikh, and Marcus Rohrbach. Towards VQA Models That Can Read. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8317–8326, 2019. 4
work page 2019
-
[25]
Eyes Wide Shut? Exploring the Vi- sual Shortcomings of Multimodal LLMs
Shengbang Tong, Zhuang Liu, Yuexiang Zhai, Yi Ma, Yann LeCun, and Saining Xie. Eyes Wide Shut? Exploring the Vi- sual Shortcomings of Multimodal LLMs. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9568–9578, 2024. 2
work page 2024
-
[26]
Wenbin Wang, Liang Ding, Minyan Zeng, Xiabin Zhou, Li Shen, Yong Luo, Wei Yu, and Dacheng Tao. Divide, conquer and combine: A training-free framework for high-resolution image perception in multimodal large language models. In Proceedings of the AAAI Conference on Artificial Intelli- gence, pages 7907–7915, 2025. 6
work page 2025
-
[27]
V*: Guided visual search as a core mechanism in multimodal llms
Penghao Wu and Saining Xie. V*: Guided visual search as a core mechanism in multimodal llms. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13084–13094, 2024. 2, 6
work page 2024
-
[28]
List Items One by One: A New Data Source and Learning Paradigm for Multimodal LLMs.CoRR, 2024
An Yan, Zhengyuan Yang, Junda Wu, Wanrong Zhu, Jian- wei Yang, Linjie Li, Kevin Lin, Jianfeng Wang, Julian J McAuley, Jianfeng Gao, et al. List Items One by One: A New Data Source and Learning Paradigm for Multimodal LLMs.CoRR, 2024. 2
work page 2024
-
[29]
Jiarui Zhang, Mahyar Khayatkhoei, Prateek Chhikara, and Filip Ilievski. Mllms know where to look: Training-free per- ception of small visual details with multimodal llms.arXiv preprint arXiv:2502.17422, 2025. 2, 3, 7
-
[30]
Adaptive Chain-of-Focus Reasoning via Dynamic Visual Search and Zooming for Efficient VLMs
Xintong Zhang, Zhi Gao, Bofei Zhang, Pengxiang Li, Xi- aowen Zhang, Yang Liu, Tao Yuan, Yuwei Wu, Yunde Jia, Song-Chun Zhu, et al. Chain-of-focus: Adaptive visual search and zooming for multimodal reasoning via rl.arXiv preprint arXiv:2505.15436, 2025. 2, 3
work page internal anchor Pith review arXiv 2025
-
[31]
Unsupervised visual chain-of-thought reasoning via preference optimization
Kesen Zhao, Beier Zhu, Qianru Sun, and Hanwang Zhang. Unsupervised visual chain-of-thought reasoning via prefer- ence optimization.arXiv preprint arXiv:2504.18397, 2025. 1, 2, 3, 4, 7
-
[32]
DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning
Ziwei Zheng, Michael Yang, Jack Hong, Chenxiao Zhao, Guohai Xu, Le Yang, Chao Shen, and Xing Yu. Deep- eyes: Incentivizing” thinking with images” via reinforce- ment learning.arXiv preprint arXiv:2505.14362, 2025. 2, 3 A. Limitations and Ethical Considerations The research reported in this paper aims to refine the ca- pabilities of VLMs by enabling detail...
work page internal anchor Pith review Pith/arXiv arXiv 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.