Recognition: no theorem link
Learning to Focus and Precise Cropping: A Reinforcement Learning Framework with Information Gaps and Grounding Loss for MLLMs
Pith reviewed 2026-05-14 21:31 UTC · model grok-4.3
The pith
A two-stage reinforcement learning framework uses information gaps from coarser global images to train MLLMs to rely on cropped region details for high-resolution visual question answering.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By deliberately reducing the granularity of the global image input, the reinforcement learning objective creates an information gap that forces the model to extract answers from the details inside the cropped region. A subsequent grounding loss, trained on limited bounding-box annotations, then improves the precision of the cropping decisions themselves. Together these steps produce measurably higher attention to the cropped content and deliver state-of-the-art results on high-resolution visual question-answering tasks.
What carries the argument
The information gap mechanism, created by lowering the resolution or detail level of the global image so that answer accuracy depends on information supplied only by the crop.
If this is right
- The model exhibits measurably higher attention weights on the cropped regions during inference.
- Performance reaches state-of-the-art levels on high-resolution visual question-answering benchmarks.
- The framework operates without any trajectory-level supervision.
- Only a small number of bounding-box annotations are needed for the second stage.
Where Pith is reading between the lines
- The same gap-creation idea could be tested in other agentic multimodal pipelines where global context currently dominates local tool use.
- Lower-resolution global views might become a general training trick to encourage selective focus without extra labels.
- The method hints that reward shaping through controlled information loss can substitute for expensive dense supervision in visual agents.
Load-bearing premise
Making the global image coarser will reliably push the model to base its answers on the cropped region's details rather than on whatever remains visible globally.
What would settle it
After training, replace the cropped patch with unrelated content while leaving the global image unchanged; if accuracy stays the same, the model is not actually using the crop.
Figures
read the original abstract
To enhance the perception and reasoning capabilities of multimodal large language models in complex visual scenes, recent research has introduced agent-based workflows. In these works, MLLMs autonomously utilize image cropping tool to analyze regions of interest for question answering. While existing training strategies, such as those employing supervised fine-tuning and reinforcement learning, have made significant progress, our empirical analysis reveals a key limitation. We demonstrate the model's strong reliance on global input and its weak dependence on the details within the cropped region. To address this issue, we propose a novel two-stage reinforcement learning framework that does not require trajectory supervision. In the first stage, we introduce the ``Information Gap" mechanism by adjusting the granularity of the global image. This mechanism trains the model to answer questions by focusing on cropped key regions, driven by the information gain these regions provide. The second stage further enhances cropping precision by incorporating a grounding loss, using a small number of bounding box annotations. Experiments show that our method significantly enhances the model's attention to cropped regions, enabling it to achieve state-of-the-art performance on high-resolution visual question-answering benchmarks. Our method provides a more efficient approach for perceiving and reasoning fine-grained details in MLLMs. Code is available at: https://github.com/XuanPu-Z/LFPC.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes a two-stage reinforcement learning framework for multimodal large language models (MLLMs) to improve cropping and focus on relevant regions in high-resolution images for visual question answering. The first stage introduces an 'Information Gap' mechanism by reducing the granularity of the global image input to encourage reliance on cropped details via information gain, without trajectory supervision. The second stage adds a grounding loss using a small set of bounding-box annotations to refine cropping precision. The authors report that this addresses the observed over-reliance on global input and achieves state-of-the-art results on high-resolution VQA benchmarks.
Significance. If the empirical gains hold under proper controls, the work offers a practical, low-supervision route to better fine-grained perception in agentic MLLM workflows. The absence of trajectory supervision and the use of only minimal bounding-box labels are notable strengths that could improve scalability over fully supervised cropping methods.
major comments (3)
- [Abstract, §3] Abstract and §3 (method): the central claim that the Information Gap mechanism (via global granularity reduction) is what shifts policy toward cropped-region reliance lacks any quantitative ablation or control experiment. No accuracy delta is reported when the global image is ablated post-training, nor is a baseline shown that applies only the RL reward and grounding loss without the gap.
- [Experiments] Experiments section: no ablation numbers, error analysis, or failure-case breakdown are provided to isolate the contribution of each stage. The abstract states empirical gains and an identified limitation, yet supplies no concrete metrics on how the gap is implemented or its causal effect.
- [§4] §4 (results): the SOTA claim on high-resolution VQA benchmarks rests on unreported experimental details; without tables showing per-benchmark deltas, baseline comparisons, or statistical significance, it is impossible to assess whether the two-stage pipeline outperforms prior RL or SFT cropping methods for the stated reason.
minor comments (2)
- [Abstract] The abstract mentions 'a small number of bounding box annotations' but does not specify the exact count or how they are sampled; this detail should be added for reproducibility.
- [Figures] Figure captions and method diagrams should explicitly label the granularity adjustment operation and the two-stage training flow to improve clarity.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback, which highlights important areas where our presentation can be strengthened. We address each major comment below and will incorporate the suggested additions in the revised manuscript.
read point-by-point responses
-
Referee: [Abstract, §3] Abstract and §3 (method): the central claim that the Information Gap mechanism (via global granularity reduction) is what shifts policy toward cropped-region reliance lacks any quantitative ablation or control experiment. No accuracy delta is reported when the global image is ablated post-training, nor is a baseline shown that applies only the RL reward and grounding loss without the gap.
Authors: We agree that explicit quantitative controls are required to substantiate the causal role of the Information Gap. In the revision we will add a dedicated ablation: a variant trained with RL reward and grounding loss but without granularity reduction on the global input. We will report accuracy deltas on the high-resolution VQA benchmarks both with and without the gap, as well as a post-training global-image ablation that measures the drop when the cropped region is removed. revision: yes
-
Referee: [Experiments] Experiments section: no ablation numbers, error analysis, or failure-case breakdown are provided to isolate the contribution of each stage. The abstract states empirical gains and an identified limitation, yet supplies no concrete metrics on how the gap is implemented or its causal effect.
Authors: We will expand the Experiments section with (i) stage-wise ablations that isolate the first-stage Information Gap from the second-stage grounding loss, (ii) quantitative metrics describing the granularity reduction levels and resulting information-gain values, and (iii) an error analysis together with representative failure cases that illustrate remaining limitations in cropping precision. revision: yes
-
Referee: [§4] §4 (results): the SOTA claim on high-resolution VQA benchmarks rests on unreported experimental details; without tables showing per-benchmark deltas, baseline comparisons, or statistical significance, it is impossible to assess whether the two-stage pipeline outperforms prior RL or SFT cropping methods for the stated reason.
Authors: We will revise §4 to include comprehensive tables that report per-benchmark absolute scores and deltas relative to the strongest prior RL and SFT cropping baselines, together with statistical significance tests (e.g., paired t-tests or bootstrap confidence intervals) on the reported gains. revision: yes
Circularity Check
No circularity in empirical RL training pipeline
full rationale
The paper introduces an empirical two-stage reinforcement learning framework for MLLM cropping, using an information-gap mechanism (via global granularity adjustment) in stage one and a grounding loss with external bounding-box annotations in stage two. No equations, derivations, or predictions are defined that reduce by construction to fitted parameters, self-citations, or renamed inputs. Performance claims rest on benchmark experiments rather than internal consistency loops, and the method is presented as a practical training recipe without load-bearing self-referential steps.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Coarsening global image granularity creates an information gap that can be used as a training signal without trajectory supervision.
invented entities (1)
-
Information Gap mechanism
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A frontier large vision-language model with versatile abilities.arXiv preprint arXiv:2308.12966, 1(2):3,
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
Qwen2.5-vl technical report, 2025
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhao- hai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report, 2025. 2
work page 2025
-
[3]
Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhang- wei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, et al. Expanding performance boundaries of open-source multimodal models with model, data, and test- time scaling.arXiv preprint arXiv:2412.05271, 2024. 2
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[4]
Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhang- wei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, et al. How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites.Science China Information Sciences, 67(12):220101, 2024
work page 2024
-
[5]
Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks
Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 24185–24198, 2024. 2
work page 2024
-
[6]
Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Yunxuan Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, et al. Scaling instruction- finetuned language models.Journal of Machine Learning Research, 25(70):1–53, 2024. 2
work page 2024
-
[7]
Blink: Multimodal large language models can see but not perceive
Xingyu Fu, Yushi Hu, Bangzheng Li, Yu Feng, Haoyu Wang, Xudong Lin, Dan Roth, Noah A Smith, Wei-Chiu Ma, and Ranjay Krishna. Blink: Multimodal large language models can see but not perceive. InEuropean Conference on Com- puter Vision, pages 148–166. Springer, 2024. 1
work page 2024
-
[8]
Xingyu Fu, Minqian Liu, Zhengyuan Yang, John Cor- ring, Yijuan Lu, Jianwei Yang, Dan Roth, Dinei Floren- cio, and Cha Zhang. Refocus: Visual editing as a chain of thought for structured image understanding.arXiv preprint arXiv:2501.05452, 2025. 2
-
[9]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning ca- pability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025. 5
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[10]
Gqa: A new dataset for real-world visual reasoning and compositional question answering
Drew A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 6700–6709, 2019. 1, 4
work page 2019
-
[11]
Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perel- man, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Weli- hinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024. 3
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[12]
Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White- head, Alexander C Berg, Wan-Yen Lo, et al. Segment any- thing. InProceedings of the IEEE/CVF international confer- ence on computer vision, pages 4015–4026, 2023. 6
work page 2023
-
[13]
Lisa: Reasoning segmentation via large language model
Xin Lai, Zhuotao Tian, Yukang Chen, Yanwei Li, Yuhui Yuan, Shu Liu, and Jiaya Jia. Lisa: Reasoning segmentation via large language model. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9579–9589, 2024. 4
work page 2024
-
[14]
Xin Lai, Junyi Li, Wei Li, Tao Liu, Tianjian Li, and Hengshuang Zhao. Mini-o3: Scaling up reasoning pat- terns and interaction turns for visual search.arXiv preprint arXiv:2509.07969, 2025. 1, 3, 6, 7
-
[15]
Proxyclip: Proxy attention improves clip for open-vocabulary segmentation
Mengcheng Lan, Chaofeng Chen, Yiping Ke, Xinjiang Wang, Litong Feng, and Wayne Zhang. Proxyclip: Proxy attention improves clip for open-vocabulary segmentation. InEuropean Conference on Computer Vision, pages 70–88. Springer, 2024. 4
work page 2024
-
[16]
LLaVA-OneVision: Easy Visual Task Transfer
Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Zi- wei Liu, et al. Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326, 2024. 2
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[17]
Chengzu Li, Wenshan Wu, Huanyu Zhang, Yan Xia, Shaoguang Mao, Li Dong, Ivan Vuli´c, and Furu Wei. Imag- ine while reasoning in space: Multimodal visualization-of- thought.arXiv preprint arXiv:2501.07542, 2025. 2
-
[18]
Geng Li, Jinglin Xu, Yunzhen Zhao, and Yuxin Peng. Dyfo: A training-free dynamic focus visual search for enhancing lmms in fine-grained visual understanding. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 9098–9108, 2025. 3
work page 2025
-
[19]
Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InIn- ternational conference on machine learning, pages 19730– 19742. PMLR, 2023. 2
work page 2023
-
[20]
Zongzhao Li, Zongyang Ma, Mingze Li, Songyou Li, Yu Rong, Tingyang Xu, Ziqi Zhang, Deli Zhao, and Wenbing Huang. Star-r1: Spatial transformation reasoning by rein- forcing multimodal llms.arXiv preprint arXiv:2505.15804,
-
[21]
Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023. 2
work page 2023
-
[22]
Improved baselines with visual instruction tuning
Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 26296–26306, 2024. 2
work page 2024
-
[23]
Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? InEuropean conference on computer vi- sion, pages 216–233. Springer, 2024. 1
work page 2024
-
[24]
arXiv preprint arXiv:2503.06520 (2025)
Yuqi Liu, Bohao Peng, Zhisheng Zhong, Zihao Yue, Fanbin Lu, Bei Yu, and Jiaya Jia. Seg-zero: Reasoning-chain guided segmentation via cognitive reinforcement.arXiv preprint arXiv:2503.06520, 2025. 3
-
[25]
Zuyan Liu, Yuhao Dong, Yongming Rao, Jie Zhou, and Ji- wen Lu. Chain-of-spot: Interactive reasoning improves large vision-language models.arXiv preprint arXiv:2403.12966,
-
[26]
Through the Magnifying Glass: Adaptive Perception Magnification for Hallucination-Free VLM Decoding
Shunqi Mao, Chaoyi Zhang, and Weidong Cai. Through the magnifying glass: Adaptive perception magnifica- tion for hallucination-free vlm decoding.arXiv preprint arXiv:2503.10183, 2025. 3
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[27]
Kam-cot: Knowledge augmented multimodal chain-of-thoughts reasoning
Debjyoti Mondal, Suraj Modi, Subhadarshi Panda, Rituraj Singh, and Godawari Sudhakar Rao. Kam-cot: Knowledge augmented multimodal chain-of-thoughts reasoning. InPro- ceedings of the AAAI conference on artificial intelligence, pages 18798–18806, 2024. 2
work page 2024
-
[28]
arXiv preprint arXiv:2504.01805 (2025)
Kun Ouyang, Yuanxin Liu, Haoning Wu, Yi Liu, Hao Zhou, Jie Zhou, Fandong Meng, and Xu Sun. Spacer: Rein- forcing mllms in video spatial reasoning.arXiv preprint arXiv:2504.01805, 2025. 3
-
[29]
arXiv preprint arXiv:2503.07536 , year =
Yingzhe Peng, Gongrui Zhang, Miaosen Zhang, Zhiyuan You, Jie Liu, Qipeng Zhu, Kai Yang, Xingzhong Xu, Xin Geng, and Xu Yang. Lmm-r1: Empowering 3b lmms with strong reasoning abilities through two-stage rule-based rl. arXiv preprint arXiv:2503.07536, 2025. 3
-
[30]
SAM 2: Segment Anything in Images and Videos
Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman R¨adle, Chloe Rolland, Laura Gustafson, et al. Sam 2: Segment anything in images and videos.arXiv preprint arXiv:2408.00714, 2024. 6
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[31]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of math- ematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024. 5
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[32]
VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model
Haozhan Shen, Peng Liu, Jingcheng Li, Chunxin Fang, Yibo Ma, Jiajia Liao, Qiaoli Shen, Zilun Zhang, Kangjia Zhao, Qianqian Zhang, et al. Vlm-r1: A stable and general- izable r1-style large vision-language model, 2025.URL https://arxiv. org/abs/2504.07615, 3(6):11, 2025. 3
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[33]
Towards more unified in-context visual un- derstanding
Dianmo Sheng, Dongdong Chen, Zhentao Tan, Qiankun Liu, Qi Chu, Jianmin Bao, Tao Gong, Bin Liu, Shengwei Xu, and Nenghai Yu. Towards more unified in-context visual un- derstanding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13362– 13372, 2024. 4
work page 2024
-
[34]
Unicl-sam: Uncertainty-driven in-context segmen- tation with part prototype discovery
Dianmo Sheng, Dongdong Chen, Zhentao Tan, Qiankun Liu, Qi Chu, Tao Gong, Bin Liu, Jing Han, Wenbin Tu, Shengwei Xu, et al. Unicl-sam: Uncertainty-driven in-context segmen- tation with part prototype discovery. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 20201–20211, 2025. 4
work page 2025
-
[35]
Towards vqa models that can read
Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. Towards vqa models that can read. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8317–8326, 2019. 1
work page 2019
-
[36]
Visual agents as fast and slow thinkers.arXiv preprint arXiv:2408.08862, 2024
Guangyan Sun, Mingyu Jin, Zhenting Wang, Cheng-Long Wang, Siqi Ma, Qifan Wang, Tong Geng, Ying Nian Wu, Yongfeng Zhang, and Dongfang Liu. Visual agents as fast and slow thinkers.arXiv preprint arXiv:2408.08862, 2024. 2
-
[37]
Hao Tang, Chenwei Xie, Haiyang Wang, Xiaoyi Bao, Tingyu Weng, Pandeng Li, Yun Zheng, and Liwei Wang. Ufo: A unified approach to fine-grained visual perception via open- ended language interface.arXiv preprint arXiv:2503.01342,
-
[38]
Llama 2: Open Foundation and Fine-Tuned Chat Models
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288, 2023. 2
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[39]
Pixel Reasoner: Incentivizing Pixel-Space Reasoning with Curiosity-Driven Reinforcement Learning
Haozhe Wang, Alex Su, Weiming Ren, Fangzhen Lin, and Wenhu Chen. Pixel reasoner: Incentivizing pixel-space rea- soning with curiosity-driven reinforcement learning.arXiv preprint arXiv:2505.15966, 2025. 1, 3, 7
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[40]
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024. 2
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[41]
Wenbin Wang, Liang Ding, Minyan Zeng, Xiabin Zhou, Li Shen, Yong Luo, Wei Yu, and Dacheng Tao. Divide, conquer and combine: A training-free framework for high-resolution image perception in multimodal large language models. In Proceedings of the AAAI Conference on Artificial Intelli- gence, pages 7907–7915, 2025. 1, 3
work page 2025
-
[42]
arXiv preprint arXiv:2304.03284 , year=
Xinlong Wang, Xiaosong Zhang, Yue Cao, Wen Wang, Chunhua Shen, and Tiejun Huang. Seggpt: Segmenting ev- erything in context.arXiv preprint arXiv:2304.03284, 2023. 4
-
[43]
Perception in reflection.arXiv preprint arXiv:2504.07165, 2025
Yana Wei, Liang Zhao, Kangheng Lin, En Yu, Yuang Peng, Runpei Dong, Jianjian Sun, Haoran Wei, Zheng Ge, Xi- angyu Zhang, et al. Perception in reflection.arXiv preprint arXiv:2504.07165, 2025. 2
-
[44]
Junfei Wu, Jian Guan, Kaituo Feng, Qiang Liu, Shu Wu, Liang Wang, Wei Wu, and Tieniu Tan. Reinforcing spatial reasoning in vision-language models with interwoven think- ing and visual drawing.arXiv preprint arXiv:2506.09965,
-
[45]
V?: Guided visual search as a core mechanism in multimodal llms
Penghao Wu and Saining Xie. V?: Guided visual search as a core mechanism in multimodal llms. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13084–13094, 2024. 1, 3
work page 2024
-
[46]
Weijia Wu, Yuzhong Zhao, Hao Chen, Yuchao Gu, Rui Zhao, Yefei He, Hong Zhou, Mike Zheng Shou, and Chunhua Shen. Datasetdm: Synthesizing data with perception annota- tions using diffusion models.Advances in Neural Informa- tion Processing Systems, 36:54683–54695, 2023. 4
work page 2023
-
[47]
Side adapter network for open-vocabulary semantic segmentation
Mengde Xu, Zheng Zhang, Fangyun Wei, Han Hu, and Xi- ang Bai. Side adapter network for open-vocabulary semantic segmentation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2945– 2954, 2023. 4
work page 2023
-
[48]
React: Synergizing reasoning and acting in language models
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InThe eleventh international conference on learning representations, 2022. 3
work page 2022
-
[49]
Jiarui Zhang, Mahyar Khayatkhoei, Prateek Chhikara, and Filip Ilievski. Mllms know where to look: Training-free per- ception of small visual details with multimodal llms.arXiv preprint arXiv:2502.17422, 2025. 3
-
[50]
Person- alize segment anything model with one shot.arXiv preprint arXiv:2305.03048, 2023
Renrui Zhang, Zhengkai Jiang, Ziyu Guo, Shilin Yan, Junt- ing Pan, Hao Dong, Peng Gao, and Hongsheng Li. Person- alize segment anything model with one shot.arXiv preprint arXiv:2305.03048, 2023. 6
-
[51]
Xintong Zhang, Zhi Gao, Bofei Zhang, Pengxiang Li, Xi- aowen Zhang, Yang Liu, Tao Yuan, Yuwei Wu, Yunde Jia, Song-Chun Zhu, et al. Chain-of-focus: Adaptive visual search and zooming for multimodal reasoning via rl.arXiv e-prints, pages arXiv–2505, 2025. 1, 3, 7
work page 2025
-
[52]
X-paste: Revisiting scalable copy- paste for instance segmentation using clip and stablediffu- sion
Hanqing Zhao, Dianmo Sheng, Jianmin Bao, Dongdong Chen, Dong Chen, Fang Wen, Lu Yuan, Ce Liu, Wenbo Zhou, Qi Chu, et al. X-paste: Revisiting scalable copy- paste for instance segmentation using clip and stablediffu- sion. InInternational Conference on Machine Learning, pages 42098–42109. PMLR, 2023. 4
work page 2023
-
[53]
Xuanpu Zhao, Dianmo Sheng, Zhentao Tan, Zhiwei Zhao, Tao Gong, Qi Chu, Bin Liu, and Nenghai Yu. Training- free open-vocabulary semantic segmentation via diverse pro- totype construction and sub-region matching. InProceed- ings of the AAAI Conference on Artificial Intelligence, pages 10474–10482, 2025. 4
work page 2025
-
[54]
DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning
Ziwei Zheng, Michael Yang, Jack Hong, Chenxiao Zhao, Guohai Xu, Le Yang, Chao Shen, and Xing Yu. Deep- eyes: Incentivizing” thinking with images” via reinforce- ment learning.arXiv preprint arXiv:2505.14362, 2025. 1, 3, 7
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[55]
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shen- glong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479, 2025. 2 Learning to Focus and Precise Cropping: A Reinforcement Learning Framework with Information Gap...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[56]
Training Dynamics Analysis In this section, we visualize and analyze evolution of several key metrics during the first and second stages of training. 8.1. Stage-I To analyze the model’s behavior during the first stage of training, we present the evolution of four key metrics for both the BaseLine and our Stage-I Model with Info Gap in Figure 5. These metr...
-
[57]
All experiments in this section are conducted with the maximum number of visual tokens set to 1,024
More Preliminary Analysis To more rigorously demonstrate DeepEyes’s insufficient at- tention to cropped regions, we also conducted the following experiments. All experiments in this section are conducted with the maximum number of visual tokens set to 1,024. GT test.We isolate samples from benchmarks where re- gions predicted by DeepEyes poorly cover GT (...
-
[58]
More Ablation Studies Data Utilization Strategy.As shown in the Table 8, we compare two data usage strategies. The first strategy, ‘Mix Data’, involves training the model in a single stage by mix- ing our collected data with the Visual Probe data at a 1:1 ratio. The second strategy, which we term ‘Stage-II’, is the two-stage approach adopted. In the first...
-
[59]
Benchmarks and Metrics Details Our method is evaluated on three benchmarks. The first, HR-Bench 8Kwith an average resolution of 7680, which consists of two sub-tasks: Fine-grained Single-instance Per- ception (FSP) and Fine-grained Cross-instance Perception (FCP). The 8K images are cropped around the objects in question to produceHR-Bench 4K. The third,V ...
-
[60]
We also compare our training time with that of DeepEyes as shown in Table 11
Training Details We show the related hyper-parameters we use in Table 12. We also compare our training time with that of DeepEyes as shown in Table 11. Our total training time is shorter than that of DeepEyes. This is because we reduce the number Parameter Value train batch size 256 rollout num per sample 16 ppo mini batch size 32 ppo micro batch size per...
work page 2048
-
[61]
Visualization Analysis In Figure 7, we visually analyze the inference processes of DeepEyes and Stage-I. In the first example, while both mod- els correctly crop the flag, DeepEyes provides an incorrect answer, whereas Stage-I arrives at the correct one. In the second example, both models initially fail to crop the jack’s sleeve. However, DeepEyes proceed...
-
[62]
bbox_2d": [610, 224, 636, 258]],
Limitations and Future Works In this work, we first identify a critical issue in existing agent-based workflows for complex image understanding: the sub-optimal tool invocation that stems from a rigid for- malization of the cropping tool. We address this by propos- ing an information gap mechanism. Building upon this, we further enhance model’s cropping p...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.