CropVLM: Learning to Zoom for Fine-Grained Vision-Language Perception

Bruno Martins; Helder Dias; Miguel Carvalho

arxiv: 2511.19820 · v2 · submitted 2025-11-25 · 💻 cs.CV · cs.AI· cs.CL· cs.LG

CropVLM: Learning to Zoom for Fine-Grained Vision-Language Perception

Miguel Carvalho , Helder Dias , Bruno Martins This is my paper

Pith reviewed 2026-05-17 05:15 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.CLcs.LG

keywords CropVLMvision-language modelsreinforcement learningimage croppingfine-grained perceptionzoominghigh-resolution understandingout-of-domain benchmarks

0 comments

The pith

CropVLM trains an RL policy to pick image crops that let existing vision-language models handle fine details without any retraining or forgetting.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that vision-language models can be made better at fine-grained tasks such as reading scene text or analyzing documents by adding an external module that learns to zoom into useful image regions. This module, called CropVLM, is trained once with reinforcement learning and needs no labeled boxes or costly synthetic tests. The key point is that the same trained policy can be attached to many different VLMs, including ones that never saw the target data, and it raises accuracy on those tasks while leaving the original model untouched.

Core claim

CropVLM is an external low-cost module trained with reinforcement learning to select and zoom into relevant image regions, which raises the performance of target vision-language models on fine-grained high-resolution tasks, especially out-of-domain benchmarks, without any modification or fine-tuning of the VLM itself.

What carries the argument

CropVLM, a reinforcement-learning policy that chooses which image regions to crop and present to the VLM for finer perception.

If this is right

The same CropVLM policy works with both open-source and proprietary VLMs.
Performance gains appear on tasks that need high-resolution detail, including out-of-domain cases.
The base VLM requires no changes, so its earlier capabilities remain intact.
Training avoids human bounding-box labels and expensive synthetic evaluations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the policy learns general cues for useful detail, it may transfer across many VLMs without retraining.
This style of external zooming could let smaller or cheaper VLMs reach the accuracy of larger ones on detail-heavy work.
One could test whether task-specific policies or a hierarchy of zoom levels would give further gains.

Load-bearing premise

A single policy trained once with reinforcement learning will keep selecting helpful crops for any target VLM on any out-of-domain fine-grained task without needing further adaptation.

What would settle it

Pair the trained CropVLM policy with a previously unseen VLM on a fine-grained benchmark and measure whether accuracy rises, stays flat, or drops.

Figures

Figures reproduced from arXiv: 2511.19820 by Bruno Martins, Helder Dias, Miguel Carvalho.

**Figure 1.** Figure 1: Overview of CropVLM paired with LLaVA. CropVLM dynamically selects informative image regions to boost finegrained perception while keeping the target VLM frozen. Cai et al. [4] demonstrates that even advanced models use only a small number of image tokens to answer most requests, suggesting that uniform high-resolution processing is inefficient and unnecessary. Alternative approaches have attempted to ad… view at source ↗

**Figure 2.** Figure 2: The overall CropVLM training procedure. The orange and purple lines represent training with an accuracy-based reward or with [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: TextVQA performance across multiple bounding box [PITH_FULL_IMAGE:figures/full_fig_p014_3.png] view at source ↗

**Figure 4.** Figure 4: Qualitative examples from the V* Benchmark, where the first 6 cases are successful and the last 2 are failures. Next to each [PITH_FULL_IMAGE:figures/full_fig_p015_4.png] view at source ↗

**Figure 5.** Figure 5: Qualitative examples from TextVQA, where the first 6 cases are successful and the last 2 are failures. Next to each image, we [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗

read the original abstract

Vision-Language Models (VLMs) often struggle with tasks that require fine-grained image understanding, such as scene-text recognition or document analysis, due to perception limitations and visual fragmentation. To address these challenges, we introduce CropVLM as an external low-cost method for boosting performance, enabling VLMs to dynamically ''zoom in'' on relevant image regions, enhancing their ability to capture fine details. CropVLM is trained using reinforcement learning, without using human-labeled bounding boxes as a supervision signal, and without expensive synthetic evaluations. The model is trained once and can be paired with both open-source and proprietary VLMs to improve their performance. Our approach delivers significant improvements on tasks that require high-resolution image understanding, notably for benchmarks that are out-of-domain for the target VLM, without modifying or fine-tuning the VLM, thus avoiding catastrophic forgetting.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CropVLM trains one external RL cropper to improve fine-grained VLM tasks without labels or model changes, but the transfer claims rest on thin visible evidence.

read the letter

The main point is that this paper trains a single RL policy to crop images for better detail capture in VLMs, then pairs that policy with unmodified models on fine-grained tasks like scene text or documents. It avoids human boxes and any VLM fine-tuning, which keeps the original model intact and sidesteps forgetting issues. That setup is the clearest new element compared to earlier cropping work that often relied on supervision or task-specific tuning. The paper handles the motivation cleanly by framing the cropper as a low-cost add-on usable with both open and closed VLMs. Credit for that framing and for keeping training cheap without synthetic rollouts. The experiments presumably back the out-of-domain gains, but the abstract gives no numbers, no baselines, and no error bars, so the central performance claim stays hard to judge from the summary alone. The weakest link is exactly the generalization stress-test flags: nothing in the provided text shows that the learned policy works for arbitrary target VLMs on truly unseen distributions without further adaptation or reward redesign. If the full paper only tests on the VLMs used to shape the reward, the VLM-agnostic claim weakens. Readers who build practical VLM pipelines or look for modular perception fixes would get the most from this if the transfer results hold. It is coherent enough on its own terms to merit referee time rather than a desk reject, though any review would need to press hard on the cross-model experiments and reward details. I would send it to review.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces CropVLM, a reinforcement learning-based external module that learns to select image crops for dynamic zooming to enhance fine-grained perception in vision-language models (VLMs). Trained once without human-labeled bounding boxes or expensive synthetic evaluations, the policy is intended to pair with arbitrary open-source or proprietary VLMs, delivering improvements on high-resolution tasks and out-of-domain benchmarks while avoiding any modification or fine-tuning of the target VLM.

Significance. If the central empirical claims hold, the work offers a low-cost, generalizable plug-in for boosting VLM performance on detail-oriented tasks such as scene-text recognition and document analysis. The design choice to avoid VLM fine-tuning (and thus catastrophic forgetting) and the single-training paradigm are potentially valuable for practical deployment across model families.

major comments (2)

[Abstract, §4] Abstract and §4: the central claim of 'significant improvements' on out-of-domain benchmarks for arbitrary target VLMs rests on quantitative results that are not visible in the abstract and whose robustness (baselines, error bars, ablation controls) is not summarized; this directly affects evaluation of the reported gains.
[§3.1–3.2] §3.1–3.2: the reward formulation and training task distribution used to train the single RL policy are not specified in sufficient detail to establish that the learned cropping strategy is VLM-agnostic rather than tuned to the particular VLMs or tasks supplying the reward signal; this is load-bearing for the no-adaptation, cross-VLM generalization claim.

minor comments (2)

[Figure 2, §4.3] Figure 2 and §4.3: crop visualization panels would benefit from explicit overlay of the selected region coordinates and the downstream VLM output to allow direct inspection of the zooming effect.
[§3] Notation in §3: the distinction between the policy network input (full image features) and the crop selection output could be clarified with an explicit diagram or equation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major comment below and have revised the manuscript to improve clarity and support for the central claims.

read point-by-point responses

Referee: [Abstract, §4] Abstract and §4: the central claim of 'significant improvements' on out-of-domain benchmarks for arbitrary target VLMs rests on quantitative results that are not visible in the abstract and whose robustness (baselines, error bars, ablation controls) is not summarized; this directly affects evaluation of the reported gains.

Authors: We agree that the abstract would benefit from explicitly summarizing the key quantitative results and evaluation robustness to better support the central claims. In the revised manuscript, we will update the abstract to report specific performance gains (e.g., accuracy improvements on out-of-domain high-resolution tasks) and reference the use of multiple baselines and controls. For §4, we will add a concise summary highlighting error bars from repeated evaluations and the main ablation controls to facilitate assessment of the reported gains. revision: yes
Referee: [§3.1–3.2] §3.1–3.2: the reward formulation and training task distribution used to train the single RL policy are not specified in sufficient detail to establish that the learned cropping strategy is VLM-agnostic rather than tuned to the particular VLMs or tasks supplying the reward signal; this is load-bearing for the no-adaptation, cross-VLM generalization claim.

Authors: We acknowledge that greater detail on these elements is needed to substantiate the VLM-agnostic generalization claim. The reward is derived from the target VLM's own output quality using a task-agnostic metric on a broad, diverse set of high-resolution images drawn from multiple domains, with no overlap to evaluation benchmarks and without VLM-specific fine-tuning. To address the concern directly, we will expand §3.1–3.2 with the exact reward formulation (including the metric for response quality) and the composition of the training task distribution (e.g., domain breakdown and sampling strategy). This will clarify that the policy learns general zooming behaviors applicable across VLMs. revision: yes

Circularity Check

0 steps flagged

No load-bearing circularity; RL policy trained once and evaluated externally

full rationale

The paper presents CropVLM as an RL-trained external module that selects crops to improve VLM performance on fine-grained tasks. It is trained once without bounding-box labels or synthetic evaluations and then paired with arbitrary VLMs on out-of-domain benchmarks. No equations, fitted parameters renamed as predictions, or self-citation chains are described in the abstract or central claims that would reduce the reported gains to quantities defined by the method's own inputs. The derivation chain relies on external benchmark evaluations rather than internal self-reference, making the approach self-contained against external benchmarks. Minor self-citation risk exists in any RL literature but is not load-bearing here.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the assumption that RL can discover generally useful cropping policies from reward signals alone and that those policies transfer across different VLMs and out-of-domain benchmarks.

axioms (1)

domain assumption Reinforcement learning reward signals can guide selection of image crops that measurably improve downstream VLM accuracy on fine-grained tasks
Invoked by the claim that training succeeds without human-labeled boxes or synthetic evaluations.

invented entities (1)

CropVLM no independent evidence
purpose: External module that learns to zoom into relevant image regions for any target VLM
Newly introduced model whose policy is the load-bearing component of the method.

pith-pipeline@v0.9.0 · 5446 in / 1212 out tokens · 34490 ms · 2026-05-17T05:15:16.196092+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

CropVLM is trained using reinforcement learning, without using human-labeled bounding boxes... GRPO... reward formulations: Accuracy-Based Reward... Likelihood-Based Reward R(Io,Ic,q,a*) = sum log p(a*_t | Io,Ic,q,a*<t)
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The model is trained once and can be paired with both open-source and proprietary VLMs... without modifying or fine-tuning the VLM

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

CaC: Advancing Video Reward Models via Hierarchical Spatiotemporal Concentrating
cs.CV 2026-05 unverdicted novelty 7.0

CaC is a hierarchical spatiotemporal concentrating reward model for video anomalies that reports 25.7% accuracy gains on fine-grained benchmarks and 11.7% anomaly reduction in generated videos via a new dataset and GR...
Foveated Reasoning: Stateful, Action-based Visual Focusing for Vision-Language Models
cs.CV 2026-04 unverdicted novelty 6.0

Foveated Reasoner integrates foveation as stateful actions inside the autoregressive decoding loop of vision-language models, trained via cold-start supervision then reinforcement learning to achieve higher accuracy a...

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages · cited by 2 Pith papers · 9 internal anchors

[1]

Flamingo: a Visual Language Model for Few-Shot Learning.Advances in Neural Information Processing Systems, 35:23716–23736,

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a Visual Language Model for Few-Shot Learning.Advances in Neural Information Processing Systems, 35:23716–23736,

work page
[2]

Qwen2.5-VL Technical Report

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2.5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 4

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Scene text visual question answering

Ali Furkan Biten, Ruben Tito, Andres Mafla, Lluis Gomez, Marc ¸al Rusinol, Ernest Valveny, CV Jawahar, and Dimos- thenis Karatzas. Scene text visual question answering. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4291–4301, 2019. 4

work page 2019
[4]

Matryoshka multimodal models

Mu Cai, Jianwei Yang, Jianfeng Gao, and Yong Jae Lee. Matryoshka multimodal models. InWorkshop on Video- Language Models@ NeurIPS, 2024. 1, 2

work page 2024
[5]

InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24185–24198, 2024. 1

work page 2024
[6]

Visrl: Intention-driven visual perception via reinforced reasoning

Zhangquan Chen, Xufang Luo, and Dongsheng Li. Visrl: Intention-driven visual perception via reinforced reasoning. arXiv preprint arXiv:2503.07523, 2025. 2, 3, 4, 7

work page arXiv 2025
[7]

FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

Tri Dao. Flashattention-2: Faster attention with bet- ter parallelism and work partitioning.arXiv preprint arXiv:2307.08691, 2023. 8

work page internal anchor Pith review Pith/arXiv arXiv 2023
[8]

LoRA: Low-Rank Adaptation of Large Language Models

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. LoRA: Low-Rank Adaptation of Large Language Models. Proceedings of the International Conference on Learning Representations, 2022. 5

work page 2022
[9]

Mini-o3: Scaling Up Reasoning Patterns and Interaction Turns for Visual Search

Xin Lai, Junyi Li, Wei Li, Tao Liu, Tianjian Li, and Hengshuang Zhao. Mini-o3: Scaling up reasoning pat- terns and interaction turns for visual search.arXiv preprint arXiv:2509.07969, 2025. 2, 3

work page internal anchor Pith review arXiv 2025
[10]

Semantic-clipping: Efficient vision-language model- ing with semantic-guidedd visual selection.arXiv preprint arXiv:2503.11794, 2025

Bangzheng Li, Fei Wang, Wenxuan Zhou, Nan Xu, Ben Zhou, Sheng Zhang, Hoifung Poon, and Muhao Chen. Semantic-clipping: Efficient vision-language model- ing with semantic-guidedd visual selection.arXiv preprint arXiv:2503.11794, 2025. 2

work page arXiv 2025
[11]

Visual instruction tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. InAdvances in Neural Information Processing Systems, pages 34892–34916. Curran Associates, Inc., 2023. 1

work page 2023
[12]

Ocrbench: on the hidden mystery of ocr in large multimodal models.Science China Information Sciences, 67(12):220102, 2024

Yuliang Liu, Zhang Li, Mingxin Huang, Biao Yang, Wenwen Yu, Chunyuan Li, Xu-Cheng Yin, Cheng-Lin Liu, Lianwen Jin, and Xiang Bai. Ocrbench: on the hidden mystery of ocr in large multimodal models.Science China Information Sciences, 67(12):220102, 2024. 1, 2

work page 2024
[13]

Visual-RFT: Visual Reinforcement Fine-Tuning

Ziyu Liu, Zeyi Sun, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Haodong Duan, Dahua Lin, and Jiaqi Wang. Visual- rft: Visual reinforcement fine-tuning.arXiv preprint arXiv:2503.01785, 2025. 2, 3

work page internal anchor Pith review Pith/arXiv arXiv 2025
[14]

SmolVLM: Redefining small and efficient multimodal models

Andr ´es Marafioti, Orr Zohar, Miquel Farr ´e, Merve Noyan, Elie Bakouch, Pedro Cuenca, Cyril Zakka, Loubna Ben Allal, Anton Lozhkov, Nouamane Tazi, Vaibhav Srivastav, Joshua Lochner, Hugo Larcher, Mathieu Morlon, Lewis Tun- stall, Leandro von Werra, and Thomas Wolf. SmolVLM: Redefining small and efficient multimodal models.arXiv preprint arXiv:2504.05299...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[15]

DocVQA: A Dataset for VQA on Document Images

Minesh Mathew, Dimosthenis Karatzas, and CV Jawahar. DocVQA: A Dataset for VQA on Document Images. In Proceedings of the IEEE/CVF Winter Conference on Appli- cations of Computer Vision, pages 2200–2209, 2021. 4

work page 2021
[16]

Infographicvqa

Minesh Mathew, Viraj Bagal, Rub `en Tito, Dimosthenis Karatzas, Ernest Valveny, and CV Jawahar. Infographicvqa. InProceedings of the IEEE/CVF Winter Conference on Ap- plications of Computer Vision, pages 1697–1706, 2022. 4

work page 2022
[17]

Training lan- guage models to follow instructions with human feedback

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Car- roll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training lan- guage models to follow instructions with human feedback. In Advances in Neural Information Processing Systems, pages 27730–27744, 2022. 3

work page 2022
[18]

Learn- ing transferable visual models from natural language super- vision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- ing transferable visual models from natural language super- vision. InProceedings of the International Conference on Machine Learning, pages 8748–8763. PmLR, 2021. 1

work page 2021
[19]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Rad- ford, and Oleg Klimov. Proximal policy optimization algo- rithms.arXiv preprint arXiv:1707.06347, 2017. 3

work page internal anchor Pith review Pith/arXiv arXiv 2017
[20]

Hao Shao, Shengju Qian, Han Xiao, Guanglu Song, Zhuo- fan Zong, Letian Wang, Yu Liu, and Hongsheng Li. Visual cot: Advancing multi-modal language models with a com- prehensive dataset and benchmark for chain-of-thought rea- soning.Advances in Neural Information Processing Systems, 37:8612–8642, 2024. 1, 2, 3

work page 2024
[21]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathe- matical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024. 3

work page internal anchor Pith review Pith/arXiv arXiv 2024
[22]

When do we not need larger vision models? In European Conference on Computer Vision, pages 444–462

Baifeng Shi, Ziyang Wu, Maolin Mao, Xin Wang, and Trevor Darrell. When do we not need larger vision models? In European Conference on Computer Vision, pages 444–462. Springer, 2024. 1, 2

work page 2024
[23]

Scaling vision pre-training to 4k resolution

Baifeng Shi, Boyi Li, Han Cai, Yao Lu, Sifei Liu, Marco Pavone, Jan Kautz, Song Han, Trevor Darrell, Pavlo Molchanov, et al. Scaling vision pre-training to 4k resolu- tion.arXiv preprint arXiv:2503.19903, 2025. 1, 2

work page arXiv 2025
[24]

Towards VQA Models That Can Read

Amanpreet Singh, Vivek Natarjan, Meet Shah, Yu Jiang, Xinlei Chen, Devi Parikh, and Marcus Rohrbach. Towards VQA Models That Can Read. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8317–8326, 2019. 4

work page 2019
[25]

Eyes Wide Shut? Exploring the Vi- sual Shortcomings of Multimodal LLMs

Shengbang Tong, Zhuang Liu, Yuexiang Zhai, Yi Ma, Yann LeCun, and Saining Xie. Eyes Wide Shut? Exploring the Vi- sual Shortcomings of Multimodal LLMs. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9568–9578, 2024. 2

work page 2024
[26]

Divide, conquer and combine: A training-free framework for high-resolution image perception in multimodal large language models

Wenbin Wang, Liang Ding, Minyan Zeng, Xiabin Zhou, Li Shen, Yong Luo, Wei Yu, and Dacheng Tao. Divide, conquer and combine: A training-free framework for high-resolution image perception in multimodal large language models. In Proceedings of the AAAI Conference on Artificial Intelli- gence, pages 7907–7915, 2025. 6

work page 2025
[27]

V*: Guided visual search as a core mechanism in multimodal llms

Penghao Wu and Saining Xie. V*: Guided visual search as a core mechanism in multimodal llms. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13084–13094, 2024. 2, 6

work page 2024
[28]

List Items One by One: A New Data Source and Learning Paradigm for Multimodal LLMs.CoRR, 2024

An Yan, Zhengyuan Yang, Junda Wu, Wanrong Zhu, Jian- wei Yang, Linjie Li, Kevin Lin, Jianfeng Wang, Julian J McAuley, Jianfeng Gao, et al. List Items One by One: A New Data Source and Learning Paradigm for Multimodal LLMs.CoRR, 2024. 2

work page 2024
[29]

Mllms know where to look: Training-free perception of small visual details with multimodal llms.arXiv preprint arXiv:2502.17422, 2025a

Jiarui Zhang, Mahyar Khayatkhoei, Prateek Chhikara, and Filip Ilievski. Mllms know where to look: Training-free per- ception of small visual details with multimodal llms.arXiv preprint arXiv:2502.17422, 2025. 2, 3, 7

work page arXiv 2025
[30]

Adaptive Chain-of-Focus Reasoning via Dynamic Visual Search and Zooming for Efficient VLMs

Xintong Zhang, Zhi Gao, Bofei Zhang, Pengxiang Li, Xi- aowen Zhang, Yang Liu, Tao Yuan, Yuwei Wu, Yunde Jia, Song-Chun Zhu, et al. Chain-of-focus: Adaptive visual search and zooming for multimodal reasoning via rl.arXiv preprint arXiv:2505.15436, 2025. 2, 3

work page internal anchor Pith review arXiv 2025
[31]

Unsupervised visual chain-of-thought reasoning via preference optimization

Kesen Zhao, Beier Zhu, Qianru Sun, and Hanwang Zhang. Unsupervised visual chain-of-thought reasoning via prefer- ence optimization.arXiv preprint arXiv:2504.18397, 2025. 1, 2, 3, 4, 7

work page arXiv 2025
[32]

DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning

Ziwei Zheng, Michael Yang, Jack Hong, Chenxiao Zhao, Guohai Xu, Le Yang, Chao Shen, and Xing Yu. Deep- eyes: Incentivizing” thinking with images” via reinforce- ment learning.arXiv preprint arXiv:2505.14362, 2025. 2, 3 A. Limitations and Ethical Considerations The research reported in this paper aims to refine the ca- pabilities of VLMs by enabling detail...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[1] [1]

Flamingo: a Visual Language Model for Few-Shot Learning.Advances in Neural Information Processing Systems, 35:23716–23736,

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a Visual Language Model for Few-Shot Learning.Advances in Neural Information Processing Systems, 35:23716–23736,

work page

[2] [2]

Qwen2.5-VL Technical Report

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2.5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 4

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

Scene text visual question answering

Ali Furkan Biten, Ruben Tito, Andres Mafla, Lluis Gomez, Marc ¸al Rusinol, Ernest Valveny, CV Jawahar, and Dimos- thenis Karatzas. Scene text visual question answering. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4291–4301, 2019. 4

work page 2019

[4] [4]

Matryoshka multimodal models

Mu Cai, Jianwei Yang, Jianfeng Gao, and Yong Jae Lee. Matryoshka multimodal models. InWorkshop on Video- Language Models@ NeurIPS, 2024. 1, 2

work page 2024

[5] [5]

InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24185–24198, 2024. 1

work page 2024

[6] [6]

Visrl: Intention-driven visual perception via reinforced reasoning

Zhangquan Chen, Xufang Luo, and Dongsheng Li. Visrl: Intention-driven visual perception via reinforced reasoning. arXiv preprint arXiv:2503.07523, 2025. 2, 3, 4, 7

work page arXiv 2025

[7] [7]

FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

Tri Dao. Flashattention-2: Faster attention with bet- ter parallelism and work partitioning.arXiv preprint arXiv:2307.08691, 2023. 8

work page internal anchor Pith review Pith/arXiv arXiv 2023

[8] [8]

LoRA: Low-Rank Adaptation of Large Language Models

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. LoRA: Low-Rank Adaptation of Large Language Models. Proceedings of the International Conference on Learning Representations, 2022. 5

work page 2022

[9] [9]

Mini-o3: Scaling Up Reasoning Patterns and Interaction Turns for Visual Search

Xin Lai, Junyi Li, Wei Li, Tao Liu, Tianjian Li, and Hengshuang Zhao. Mini-o3: Scaling up reasoning pat- terns and interaction turns for visual search.arXiv preprint arXiv:2509.07969, 2025. 2, 3

work page internal anchor Pith review arXiv 2025

[10] [10]

Semantic-clipping: Efficient vision-language model- ing with semantic-guidedd visual selection.arXiv preprint arXiv:2503.11794, 2025

Bangzheng Li, Fei Wang, Wenxuan Zhou, Nan Xu, Ben Zhou, Sheng Zhang, Hoifung Poon, and Muhao Chen. Semantic-clipping: Efficient vision-language model- ing with semantic-guidedd visual selection.arXiv preprint arXiv:2503.11794, 2025. 2

work page arXiv 2025

[11] [11]

Visual instruction tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. InAdvances in Neural Information Processing Systems, pages 34892–34916. Curran Associates, Inc., 2023. 1

work page 2023

[12] [12]

Ocrbench: on the hidden mystery of ocr in large multimodal models.Science China Information Sciences, 67(12):220102, 2024

Yuliang Liu, Zhang Li, Mingxin Huang, Biao Yang, Wenwen Yu, Chunyuan Li, Xu-Cheng Yin, Cheng-Lin Liu, Lianwen Jin, and Xiang Bai. Ocrbench: on the hidden mystery of ocr in large multimodal models.Science China Information Sciences, 67(12):220102, 2024. 1, 2

work page 2024

[13] [13]

Visual-RFT: Visual Reinforcement Fine-Tuning

Ziyu Liu, Zeyi Sun, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Haodong Duan, Dahua Lin, and Jiaqi Wang. Visual- rft: Visual reinforcement fine-tuning.arXiv preprint arXiv:2503.01785, 2025. 2, 3

work page internal anchor Pith review Pith/arXiv arXiv 2025

[14] [14]

SmolVLM: Redefining small and efficient multimodal models

Andr ´es Marafioti, Orr Zohar, Miquel Farr ´e, Merve Noyan, Elie Bakouch, Pedro Cuenca, Cyril Zakka, Loubna Ben Allal, Anton Lozhkov, Nouamane Tazi, Vaibhav Srivastav, Joshua Lochner, Hugo Larcher, Mathieu Morlon, Lewis Tun- stall, Leandro von Werra, and Thomas Wolf. SmolVLM: Redefining small and efficient multimodal models.arXiv preprint arXiv:2504.05299...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[15] [15]

DocVQA: A Dataset for VQA on Document Images

Minesh Mathew, Dimosthenis Karatzas, and CV Jawahar. DocVQA: A Dataset for VQA on Document Images. In Proceedings of the IEEE/CVF Winter Conference on Appli- cations of Computer Vision, pages 2200–2209, 2021. 4

work page 2021

[16] [16]

Infographicvqa

Minesh Mathew, Viraj Bagal, Rub `en Tito, Dimosthenis Karatzas, Ernest Valveny, and CV Jawahar. Infographicvqa. InProceedings of the IEEE/CVF Winter Conference on Ap- plications of Computer Vision, pages 1697–1706, 2022. 4

work page 2022

[17] [17]

Training lan- guage models to follow instructions with human feedback

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Car- roll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training lan- guage models to follow instructions with human feedback. In Advances in Neural Information Processing Systems, pages 27730–27744, 2022. 3

work page 2022

[18] [18]

Learn- ing transferable visual models from natural language super- vision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- ing transferable visual models from natural language super- vision. InProceedings of the International Conference on Machine Learning, pages 8748–8763. PmLR, 2021. 1

work page 2021

[19] [19]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Rad- ford, and Oleg Klimov. Proximal policy optimization algo- rithms.arXiv preprint arXiv:1707.06347, 2017. 3

work page internal anchor Pith review Pith/arXiv arXiv 2017

[20] [20]

Hao Shao, Shengju Qian, Han Xiao, Guanglu Song, Zhuo- fan Zong, Letian Wang, Yu Liu, and Hongsheng Li. Visual cot: Advancing multi-modal language models with a com- prehensive dataset and benchmark for chain-of-thought rea- soning.Advances in Neural Information Processing Systems, 37:8612–8642, 2024. 1, 2, 3

work page 2024

[21] [21]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathe- matical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024. 3

work page internal anchor Pith review Pith/arXiv arXiv 2024

[22] [22]

When do we not need larger vision models? In European Conference on Computer Vision, pages 444–462

Baifeng Shi, Ziyang Wu, Maolin Mao, Xin Wang, and Trevor Darrell. When do we not need larger vision models? In European Conference on Computer Vision, pages 444–462. Springer, 2024. 1, 2

work page 2024

[23] [23]

Scaling vision pre-training to 4k resolution

Baifeng Shi, Boyi Li, Han Cai, Yao Lu, Sifei Liu, Marco Pavone, Jan Kautz, Song Han, Trevor Darrell, Pavlo Molchanov, et al. Scaling vision pre-training to 4k resolu- tion.arXiv preprint arXiv:2503.19903, 2025. 1, 2

work page arXiv 2025

[24] [24]

Towards VQA Models That Can Read

Amanpreet Singh, Vivek Natarjan, Meet Shah, Yu Jiang, Xinlei Chen, Devi Parikh, and Marcus Rohrbach. Towards VQA Models That Can Read. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8317–8326, 2019. 4

work page 2019

[25] [25]

Eyes Wide Shut? Exploring the Vi- sual Shortcomings of Multimodal LLMs

Shengbang Tong, Zhuang Liu, Yuexiang Zhai, Yi Ma, Yann LeCun, and Saining Xie. Eyes Wide Shut? Exploring the Vi- sual Shortcomings of Multimodal LLMs. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9568–9578, 2024. 2

work page 2024

[26] [26]

Divide, conquer and combine: A training-free framework for high-resolution image perception in multimodal large language models

Wenbin Wang, Liang Ding, Minyan Zeng, Xiabin Zhou, Li Shen, Yong Luo, Wei Yu, and Dacheng Tao. Divide, conquer and combine: A training-free framework for high-resolution image perception in multimodal large language models. In Proceedings of the AAAI Conference on Artificial Intelli- gence, pages 7907–7915, 2025. 6

work page 2025

[27] [27]

V*: Guided visual search as a core mechanism in multimodal llms

Penghao Wu and Saining Xie. V*: Guided visual search as a core mechanism in multimodal llms. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13084–13094, 2024. 2, 6

work page 2024

[28] [28]

List Items One by One: A New Data Source and Learning Paradigm for Multimodal LLMs.CoRR, 2024

An Yan, Zhengyuan Yang, Junda Wu, Wanrong Zhu, Jian- wei Yang, Linjie Li, Kevin Lin, Jianfeng Wang, Julian J McAuley, Jianfeng Gao, et al. List Items One by One: A New Data Source and Learning Paradigm for Multimodal LLMs.CoRR, 2024. 2

work page 2024

[29] [29]

Mllms know where to look: Training-free perception of small visual details with multimodal llms.arXiv preprint arXiv:2502.17422, 2025a

Jiarui Zhang, Mahyar Khayatkhoei, Prateek Chhikara, and Filip Ilievski. Mllms know where to look: Training-free per- ception of small visual details with multimodal llms.arXiv preprint arXiv:2502.17422, 2025. 2, 3, 7

work page arXiv 2025

[30] [30]

Adaptive Chain-of-Focus Reasoning via Dynamic Visual Search and Zooming for Efficient VLMs

Xintong Zhang, Zhi Gao, Bofei Zhang, Pengxiang Li, Xi- aowen Zhang, Yang Liu, Tao Yuan, Yuwei Wu, Yunde Jia, Song-Chun Zhu, et al. Chain-of-focus: Adaptive visual search and zooming for multimodal reasoning via rl.arXiv preprint arXiv:2505.15436, 2025. 2, 3

work page internal anchor Pith review arXiv 2025

[31] [31]

Unsupervised visual chain-of-thought reasoning via preference optimization

Kesen Zhao, Beier Zhu, Qianru Sun, and Hanwang Zhang. Unsupervised visual chain-of-thought reasoning via prefer- ence optimization.arXiv preprint arXiv:2504.18397, 2025. 1, 2, 3, 4, 7

work page arXiv 2025

[32] [32]

DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning

Ziwei Zheng, Michael Yang, Jack Hong, Chenxiao Zhao, Guohai Xu, Le Yang, Chao Shen, and Xing Yu. Deep- eyes: Incentivizing” thinking with images” via reinforce- ment learning.arXiv preprint arXiv:2505.14362, 2025. 2, 3 A. Limitations and Ethical Considerations The research reported in this paper aims to refine the ca- pabilities of VLMs by enabling detail...

work page internal anchor Pith review Pith/arXiv arXiv 2025