pith. sign in

arxiv: 2511.19820 · v2 · submitted 2025-11-25 · 💻 cs.CV · cs.AI· cs.CL· cs.LG

CropVLM: Learning to Zoom for Fine-Grained Vision-Language Perception

Pith reviewed 2026-05-17 05:15 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.CLcs.LG
keywords CropVLMvision-language modelsreinforcement learningimage croppingfine-grained perceptionzoominghigh-resolution understandingout-of-domain benchmarks
0
0 comments X

The pith

CropVLM trains an RL policy to pick image crops that let existing vision-language models handle fine details without any retraining or forgetting.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that vision-language models can be made better at fine-grained tasks such as reading scene text or analyzing documents by adding an external module that learns to zoom into useful image regions. This module, called CropVLM, is trained once with reinforcement learning and needs no labeled boxes or costly synthetic tests. The key point is that the same trained policy can be attached to many different VLMs, including ones that never saw the target data, and it raises accuracy on those tasks while leaving the original model untouched.

Core claim

CropVLM is an external low-cost module trained with reinforcement learning to select and zoom into relevant image regions, which raises the performance of target vision-language models on fine-grained high-resolution tasks, especially out-of-domain benchmarks, without any modification or fine-tuning of the VLM itself.

What carries the argument

CropVLM, a reinforcement-learning policy that chooses which image regions to crop and present to the VLM for finer perception.

If this is right

  • The same CropVLM policy works with both open-source and proprietary VLMs.
  • Performance gains appear on tasks that need high-resolution detail, including out-of-domain cases.
  • The base VLM requires no changes, so its earlier capabilities remain intact.
  • Training avoids human bounding-box labels and expensive synthetic evaluations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the policy learns general cues for useful detail, it may transfer across many VLMs without retraining.
  • This style of external zooming could let smaller or cheaper VLMs reach the accuracy of larger ones on detail-heavy work.
  • One could test whether task-specific policies or a hierarchy of zoom levels would give further gains.

Load-bearing premise

A single policy trained once with reinforcement learning will keep selecting helpful crops for any target VLM on any out-of-domain fine-grained task without needing further adaptation.

What would settle it

Pair the trained CropVLM policy with a previously unseen VLM on a fine-grained benchmark and measure whether accuracy rises, stays flat, or drops.

Figures

Figures reproduced from arXiv: 2511.19820 by Bruno Martins, Helder Dias, Miguel Carvalho.

Figure 1
Figure 1. Figure 1: Overview of CropVLM paired with LLaVA. CropVLM dynamically selects informative image regions to boost fine￾grained perception while keeping the target VLM frozen. Cai et al. [4] demonstrates that even advanced models use only a small number of image tokens to answer most re￾quests, suggesting that uniform high-resolution processing is inefficient and unnecessary. Alternative approaches have attempted to ad… view at source ↗
Figure 2
Figure 2. Figure 2: The overall CropVLM training procedure. The orange and purple lines represent training with an accuracy-based reward or with [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: TextVQA performance across multiple bounding box [PITH_FULL_IMAGE:figures/full_fig_p014_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative examples from the V* Benchmark, where the first 6 cases are successful and the last 2 are failures. Next to each [PITH_FULL_IMAGE:figures/full_fig_p015_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative examples from TextVQA, where the first 6 cases are successful and the last 2 are failures. Next to each image, we [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗
read the original abstract

Vision-Language Models (VLMs) often struggle with tasks that require fine-grained image understanding, such as scene-text recognition or document analysis, due to perception limitations and visual fragmentation. To address these challenges, we introduce CropVLM as an external low-cost method for boosting performance, enabling VLMs to dynamically ''zoom in'' on relevant image regions, enhancing their ability to capture fine details. CropVLM is trained using reinforcement learning, without using human-labeled bounding boxes as a supervision signal, and without expensive synthetic evaluations. The model is trained once and can be paired with both open-source and proprietary VLMs to improve their performance. Our approach delivers significant improvements on tasks that require high-resolution image understanding, notably for benchmarks that are out-of-domain for the target VLM, without modifying or fine-tuning the VLM, thus avoiding catastrophic forgetting.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces CropVLM, a reinforcement learning-based external module that learns to select image crops for dynamic zooming to enhance fine-grained perception in vision-language models (VLMs). Trained once without human-labeled bounding boxes or expensive synthetic evaluations, the policy is intended to pair with arbitrary open-source or proprietary VLMs, delivering improvements on high-resolution tasks and out-of-domain benchmarks while avoiding any modification or fine-tuning of the target VLM.

Significance. If the central empirical claims hold, the work offers a low-cost, generalizable plug-in for boosting VLM performance on detail-oriented tasks such as scene-text recognition and document analysis. The design choice to avoid VLM fine-tuning (and thus catastrophic forgetting) and the single-training paradigm are potentially valuable for practical deployment across model families.

major comments (2)
  1. [Abstract, §4] Abstract and §4: the central claim of 'significant improvements' on out-of-domain benchmarks for arbitrary target VLMs rests on quantitative results that are not visible in the abstract and whose robustness (baselines, error bars, ablation controls) is not summarized; this directly affects evaluation of the reported gains.
  2. [§3.1–3.2] §3.1–3.2: the reward formulation and training task distribution used to train the single RL policy are not specified in sufficient detail to establish that the learned cropping strategy is VLM-agnostic rather than tuned to the particular VLMs or tasks supplying the reward signal; this is load-bearing for the no-adaptation, cross-VLM generalization claim.
minor comments (2)
  1. [Figure 2, §4.3] Figure 2 and §4.3: crop visualization panels would benefit from explicit overlay of the selected region coordinates and the downstream VLM output to allow direct inspection of the zooming effect.
  2. [§3] Notation in §3: the distinction between the policy network input (full image features) and the crop selection output could be clarified with an explicit diagram or equation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major comment below and have revised the manuscript to improve clarity and support for the central claims.

read point-by-point responses
  1. Referee: [Abstract, §4] Abstract and §4: the central claim of 'significant improvements' on out-of-domain benchmarks for arbitrary target VLMs rests on quantitative results that are not visible in the abstract and whose robustness (baselines, error bars, ablation controls) is not summarized; this directly affects evaluation of the reported gains.

    Authors: We agree that the abstract would benefit from explicitly summarizing the key quantitative results and evaluation robustness to better support the central claims. In the revised manuscript, we will update the abstract to report specific performance gains (e.g., accuracy improvements on out-of-domain high-resolution tasks) and reference the use of multiple baselines and controls. For §4, we will add a concise summary highlighting error bars from repeated evaluations and the main ablation controls to facilitate assessment of the reported gains. revision: yes

  2. Referee: [§3.1–3.2] §3.1–3.2: the reward formulation and training task distribution used to train the single RL policy are not specified in sufficient detail to establish that the learned cropping strategy is VLM-agnostic rather than tuned to the particular VLMs or tasks supplying the reward signal; this is load-bearing for the no-adaptation, cross-VLM generalization claim.

    Authors: We acknowledge that greater detail on these elements is needed to substantiate the VLM-agnostic generalization claim. The reward is derived from the target VLM's own output quality using a task-agnostic metric on a broad, diverse set of high-resolution images drawn from multiple domains, with no overlap to evaluation benchmarks and without VLM-specific fine-tuning. To address the concern directly, we will expand §3.1–3.2 with the exact reward formulation (including the metric for response quality) and the composition of the training task distribution (e.g., domain breakdown and sampling strategy). This will clarify that the policy learns general zooming behaviors applicable across VLMs. revision: yes

Circularity Check

0 steps flagged

No load-bearing circularity; RL policy trained once and evaluated externally

full rationale

The paper presents CropVLM as an RL-trained external module that selects crops to improve VLM performance on fine-grained tasks. It is trained once without bounding-box labels or synthetic evaluations and then paired with arbitrary VLMs on out-of-domain benchmarks. No equations, fitted parameters renamed as predictions, or self-citation chains are described in the abstract or central claims that would reduce the reported gains to quantities defined by the method's own inputs. The derivation chain relies on external benchmark evaluations rather than internal self-reference, making the approach self-contained against external benchmarks. Minor self-citation risk exists in any RL literature but is not load-bearing here.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the assumption that RL can discover generally useful cropping policies from reward signals alone and that those policies transfer across different VLMs and out-of-domain benchmarks.

axioms (1)
  • domain assumption Reinforcement learning reward signals can guide selection of image crops that measurably improve downstream VLM accuracy on fine-grained tasks
    Invoked by the claim that training succeeds without human-labeled boxes or synthetic evaluations.
invented entities (1)
  • CropVLM no independent evidence
    purpose: External module that learns to zoom into relevant image regions for any target VLM
    Newly introduced model whose policy is the load-bearing component of the method.

pith-pipeline@v0.9.0 · 5446 in / 1212 out tokens · 34490 ms · 2026-05-17T05:15:16.196092+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. CaC: Advancing Video Reward Models via Hierarchical Spatiotemporal Concentrating

    cs.CV 2026-05 unverdicted novelty 7.0

    CaC is a hierarchical spatiotemporal concentrating reward model for video anomalies that reports 25.7% accuracy gains on fine-grained benchmarks and 11.7% anomaly reduction in generated videos via a new dataset and GR...

  2. Foveated Reasoning: Stateful, Action-based Visual Focusing for Vision-Language Models

    cs.CV 2026-04 unverdicted novelty 6.0

    Foveated Reasoner integrates foveation as stateful actions inside the autoregressive decoding loop of vision-language models, trained via cold-start supervision then reinforcement learning to achieve higher accuracy a...

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages · cited by 2 Pith papers · 9 internal anchors

  1. [1]

    Flamingo: a Visual Language Model for Few-Shot Learning.Advances in Neural Information Processing Systems, 35:23716–23736,

    Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a Visual Language Model for Few-Shot Learning.Advances in Neural Information Processing Systems, 35:23716–23736,

  2. [2]

    Qwen2.5-VL Technical Report

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2.5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 4

  3. [3]

    Scene text visual question answering

    Ali Furkan Biten, Ruben Tito, Andres Mafla, Lluis Gomez, Marc ¸al Rusinol, Ernest Valveny, CV Jawahar, and Dimos- thenis Karatzas. Scene text visual question answering. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4291–4301, 2019. 4

  4. [4]

    Matryoshka multimodal models

    Mu Cai, Jianwei Yang, Jianfeng Gao, and Yong Jae Lee. Matryoshka multimodal models. InWorkshop on Video- Language Models@ NeurIPS, 2024. 1, 2

  5. [5]

    InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

    Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24185–24198, 2024. 1

  6. [6]

    Visrl: Intention-driven visual perception via reinforced reasoning

    Zhangquan Chen, Xufang Luo, and Dongsheng Li. Visrl: Intention-driven visual perception via reinforced reasoning. arXiv preprint arXiv:2503.07523, 2025. 2, 3, 4, 7

  7. [7]

    FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

    Tri Dao. Flashattention-2: Faster attention with bet- ter parallelism and work partitioning.arXiv preprint arXiv:2307.08691, 2023. 8

  8. [8]

    LoRA: Low-Rank Adaptation of Large Language Models

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. LoRA: Low-Rank Adaptation of Large Language Models. Proceedings of the International Conference on Learning Representations, 2022. 5

  9. [9]

    Mini-o3: Scaling Up Reasoning Patterns and Interaction Turns for Visual Search

    Xin Lai, Junyi Li, Wei Li, Tao Liu, Tianjian Li, and Hengshuang Zhao. Mini-o3: Scaling up reasoning pat- terns and interaction turns for visual search.arXiv preprint arXiv:2509.07969, 2025. 2, 3

  10. [10]

    Semantic-clipping: Efficient vision-language model- ing with semantic-guidedd visual selection.arXiv preprint arXiv:2503.11794, 2025

    Bangzheng Li, Fei Wang, Wenxuan Zhou, Nan Xu, Ben Zhou, Sheng Zhang, Hoifung Poon, and Muhao Chen. Semantic-clipping: Efficient vision-language model- ing with semantic-guidedd visual selection.arXiv preprint arXiv:2503.11794, 2025. 2

  11. [11]

    Visual instruction tuning

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. InAdvances in Neural Information Processing Systems, pages 34892–34916. Curran Associates, Inc., 2023. 1

  12. [12]

    Ocrbench: on the hidden mystery of ocr in large multimodal models.Science China Information Sciences, 67(12):220102, 2024

    Yuliang Liu, Zhang Li, Mingxin Huang, Biao Yang, Wenwen Yu, Chunyuan Li, Xu-Cheng Yin, Cheng-Lin Liu, Lianwen Jin, and Xiang Bai. Ocrbench: on the hidden mystery of ocr in large multimodal models.Science China Information Sciences, 67(12):220102, 2024. 1, 2

  13. [13]

    Visual-RFT: Visual Reinforcement Fine-Tuning

    Ziyu Liu, Zeyi Sun, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Haodong Duan, Dahua Lin, and Jiaqi Wang. Visual- rft: Visual reinforcement fine-tuning.arXiv preprint arXiv:2503.01785, 2025. 2, 3

  14. [14]

    SmolVLM: Redefining small and efficient multimodal models

    Andr ´es Marafioti, Orr Zohar, Miquel Farr ´e, Merve Noyan, Elie Bakouch, Pedro Cuenca, Cyril Zakka, Loubna Ben Allal, Anton Lozhkov, Nouamane Tazi, Vaibhav Srivastav, Joshua Lochner, Hugo Larcher, Mathieu Morlon, Lewis Tun- stall, Leandro von Werra, and Thomas Wolf. SmolVLM: Redefining small and efficient multimodal models.arXiv preprint arXiv:2504.05299...

  15. [15]

    DocVQA: A Dataset for VQA on Document Images

    Minesh Mathew, Dimosthenis Karatzas, and CV Jawahar. DocVQA: A Dataset for VQA on Document Images. In Proceedings of the IEEE/CVF Winter Conference on Appli- cations of Computer Vision, pages 2200–2209, 2021. 4

  16. [16]

    Infographicvqa

    Minesh Mathew, Viraj Bagal, Rub `en Tito, Dimosthenis Karatzas, Ernest Valveny, and CV Jawahar. Infographicvqa. InProceedings of the IEEE/CVF Winter Conference on Ap- plications of Computer Vision, pages 1697–1706, 2022. 4

  17. [17]

    Training lan- guage models to follow instructions with human feedback

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Car- roll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training lan- guage models to follow instructions with human feedback. In Advances in Neural Information Processing Systems, pages 27730–27744, 2022. 3

  18. [18]

    Learn- ing transferable visual models from natural language super- vision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- ing transferable visual models from natural language super- vision. InProceedings of the International Conference on Machine Learning, pages 8748–8763. PmLR, 2021. 1

  19. [19]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Rad- ford, and Oleg Klimov. Proximal policy optimization algo- rithms.arXiv preprint arXiv:1707.06347, 2017. 3

  20. [20]

    Hao Shao, Shengju Qian, Han Xiao, Guanglu Song, Zhuo- fan Zong, Letian Wang, Yu Liu, and Hongsheng Li. Visual cot: Advancing multi-modal language models with a com- prehensive dataset and benchmark for chain-of-thought rea- soning.Advances in Neural Information Processing Systems, 37:8612–8642, 2024. 1, 2, 3

  21. [21]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathe- matical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024. 3

  22. [22]

    When do we not need larger vision models? In European Conference on Computer Vision, pages 444–462

    Baifeng Shi, Ziyang Wu, Maolin Mao, Xin Wang, and Trevor Darrell. When do we not need larger vision models? In European Conference on Computer Vision, pages 444–462. Springer, 2024. 1, 2

  23. [23]

    Scaling vision pre-training to 4k resolution

    Baifeng Shi, Boyi Li, Han Cai, Yao Lu, Sifei Liu, Marco Pavone, Jan Kautz, Song Han, Trevor Darrell, Pavlo Molchanov, et al. Scaling vision pre-training to 4k resolu- tion.arXiv preprint arXiv:2503.19903, 2025. 1, 2

  24. [24]

    Towards VQA Models That Can Read

    Amanpreet Singh, Vivek Natarjan, Meet Shah, Yu Jiang, Xinlei Chen, Devi Parikh, and Marcus Rohrbach. Towards VQA Models That Can Read. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8317–8326, 2019. 4

  25. [25]

    Eyes Wide Shut? Exploring the Vi- sual Shortcomings of Multimodal LLMs

    Shengbang Tong, Zhuang Liu, Yuexiang Zhai, Yi Ma, Yann LeCun, and Saining Xie. Eyes Wide Shut? Exploring the Vi- sual Shortcomings of Multimodal LLMs. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9568–9578, 2024. 2

  26. [26]

    Divide, conquer and combine: A training-free framework for high-resolution image perception in multimodal large language models

    Wenbin Wang, Liang Ding, Minyan Zeng, Xiabin Zhou, Li Shen, Yong Luo, Wei Yu, and Dacheng Tao. Divide, conquer and combine: A training-free framework for high-resolution image perception in multimodal large language models. In Proceedings of the AAAI Conference on Artificial Intelli- gence, pages 7907–7915, 2025. 6

  27. [27]

    V*: Guided visual search as a core mechanism in multimodal llms

    Penghao Wu and Saining Xie. V*: Guided visual search as a core mechanism in multimodal llms. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13084–13094, 2024. 2, 6

  28. [28]

    List Items One by One: A New Data Source and Learning Paradigm for Multimodal LLMs.CoRR, 2024

    An Yan, Zhengyuan Yang, Junda Wu, Wanrong Zhu, Jian- wei Yang, Linjie Li, Kevin Lin, Jianfeng Wang, Julian J McAuley, Jianfeng Gao, et al. List Items One by One: A New Data Source and Learning Paradigm for Multimodal LLMs.CoRR, 2024. 2

  29. [29]

    Mllms know where to look: Training-free perception of small visual details with multimodal llms.arXiv preprint arXiv:2502.17422, 2025a

    Jiarui Zhang, Mahyar Khayatkhoei, Prateek Chhikara, and Filip Ilievski. Mllms know where to look: Training-free per- ception of small visual details with multimodal llms.arXiv preprint arXiv:2502.17422, 2025. 2, 3, 7

  30. [30]

    Adaptive Chain-of-Focus Reasoning via Dynamic Visual Search and Zooming for Efficient VLMs

    Xintong Zhang, Zhi Gao, Bofei Zhang, Pengxiang Li, Xi- aowen Zhang, Yang Liu, Tao Yuan, Yuwei Wu, Yunde Jia, Song-Chun Zhu, et al. Chain-of-focus: Adaptive visual search and zooming for multimodal reasoning via rl.arXiv preprint arXiv:2505.15436, 2025. 2, 3

  31. [31]

    Unsupervised visual chain-of-thought reasoning via preference optimization

    Kesen Zhao, Beier Zhu, Qianru Sun, and Hanwang Zhang. Unsupervised visual chain-of-thought reasoning via prefer- ence optimization.arXiv preprint arXiv:2504.18397, 2025. 1, 2, 3, 4, 7

  32. [32]

    DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning

    Ziwei Zheng, Michael Yang, Jack Hong, Chenxiao Zhao, Guohai Xu, Le Yang, Chao Shen, and Xing Yu. Deep- eyes: Incentivizing” thinking with images” via reinforce- ment learning.arXiv preprint arXiv:2505.14362, 2025. 2, 3 A. Limitations and Ethical Considerations The research reported in this paper aims to refine the ca- pabilities of VLMs by enabling detail...