pith. machine review for the scientific record. sign in

arxiv: 2604.19945 · v1 · submitted 2026-04-21 · 💻 cs.CV

Recognition: unknown

Visual Reasoning through Tool-supervised Reinforcement Learning

Authors on Pith no claims yet

Pith reviewed 2026-05-10 02:59 UTC · model grok-4.3

classification 💻 cs.CV
keywords tool-supervised reinforcement learningvisual reasoningmultimodal large language modelscurriculum learningtool useimage manipulation toolsreinforcement learning
0
0 comments X

The pith

A two-stage reinforcement learning curriculum with direct tool supervision enables multimodal models to master simple visual tools before tackling complex reasoning tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper investigates how to make multimodal large language models better at using tools to solve hard visual reasoning problems. It introduces ToolsRL, which trains the model in two stages: first using only tool-specific rewards to learn actions like zoom-in, rotate, flip, and drawing points or lines, then adding accuracy rewards while allowing tool calls. This staged approach is meant to prevent the model from having to learn tool use and task success at the same time. A sympathetic reader would care if this separation leads to more reliable visual reasoning without the usual training conflicts.

Core claim

The central claim is that direct supervision on a set of simple visual tools combined with a curriculum of first optimizing tool-specific rewards alone and then accuracy rewards while permitting tool calls produces efficient learning and strong tool-use capabilities for complex visual reasoning tasks.

What carries the argument

The ToolsRL two-stage reinforcement learning curriculum, where the first stage optimizes exclusively on tool-specific rewards for actions such as zoom-in, rotate, flip, and draw point/line before the second stage adds task accuracy rewards with tool access allowed.

If this is right

  • Tool calling capability is mastered independently before being applied to complete visual reasoning tasks.
  • Potential optimization conflicts between learning tool use and achieving task accuracy are avoided.
  • Training becomes more efficient while still reaching strong capabilities on complex visual reasoning.
  • A small set of native interpretable tools proves adequate once the model has learned to call them reliably.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The staged supervision method could be tested on non-visual tools or other agent skills where intermediate capabilities conflict with final goals.
  • Similar curricula might improve reliability when deploying models in interactive settings that require repeated tool use over time.
  • Scaling the same separation to larger models or more varied visual tools would reveal whether the efficiency gains hold beyond the reported experiments.

Load-bearing premise

That separating tool mastery into a dedicated first stage with tool-specific rewards avoids optimization conflicts with task accuracy, and that supervision for the chosen simple visual tools is easy to collect and sufficient for complex reasoning.

What would settle it

A single-stage training run that jointly optimizes tool rewards and accuracy rewards achieving equal or better tool-use performance and task accuracy than the two-stage curriculum would falsify the necessity of the separation.

Figures

Figures reproduced from arXiv: 2604.19945 by Davide Modolo, Gozde Sahin, Hao Yang, Pei Wang, Qihua Dong, Robik Shrestha, Zhaowei Cai.

Figure 1
Figure 1. Figure 1: Visual reasoning with ToolsRL. Illustrative examples of tool-supervised RL integrating various visual tools into coher￾ent multi-step reasoning chains for different tasks. termining the optimal invocation strategies (how, when, and why)—remains a significant, unsolved challenge for open￾source MLLMs and the broader research community. Previous research efforts to instill tool-use capabilities primarily lev… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of Tool-supervised Reinforcement Learn￾ing (ToolsRL). (a) ToolsRL includes tool-specific rewards that su￾pervises tool usage. (b) Unlike SFT-then-RL and standard RL, ToolsRL injects tool supervision before training on QA tasks. phase [27]. The same as SFT-only approaches, they all re￾quire substantial data effort to construct supervised demon￾strations before reinforcement learning can begin. RL-o… view at source ↗
Figure 3
Figure 3. Figure 3: Case studies of ToolsRL. Left: Visual search on high-resolution benchmarks, where the agent iteratively zooms in to localize the queried region before answering. Red arrow in the last image indicates the target region. Middle: Visual verification on charts, where the agent marks key points to check the presence of peaks on the x-axis. Right: Composite tool use, where the agent combines zoom-in and point-dr… view at source ↗
Figure 4
Figure 4. Figure 4: Comparison case study of tool usage across different [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative comparison with DeepEyes. A. Overview Here we provide additional qualitative comparison (Sec. B), de￾tails on our synthetic dataset generation (Sec. C), tool-supervised design and ablations (Sec. D), curriculum and ablations (Sec. E), and the prompts and tool APIs used in both stages of training (Sec. F). B. Qualitative Comparison More qualitative results (ToolsRL vs. DeepEyes) are displayed in… view at source ↗
Figure 6
Figure 6. Figure 6: Visual examples of Stage 1 tool-supervised training data. From left to right: (a) document Rotate/Flip orientation correction; [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗
Figure 8
Figure 8. Figure 8: Analysis of the Stage-1 rotation/flip training design. [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Average tool usage during training across ablation ex [PITH_FULL_IMAGE:figures/full_fig_p014_9.png] view at source ↗
read the original abstract

In this paper, we investigate the problem of how to effectively master tool-use to solve complex visual reasoning tasks for Multimodal Large Language Models. To achieve that, we propose a novel Tool-supervised Reinforcement Learning (ToolsRL) framework, with direct tool supervision for more effective tool-use learning. We focus on a series of simple, native, and interpretable visual tools, including zoom-in, rotate, flip, and draw point/line, whose tool supervision is easy to collect. A reinforcement learning curriculum is developed, where the first stage is solely optimized by a set of well motivated tool-specific rewards, and the second stage is trained with the accuracy targeted rewards while allowing calling tools. In this way, tool calling capability is mastered before using tools to complete visual reasoning tasks, avoiding the potential optimization conflict among those heterogeneous tasks. Our experiments have shown that the tool-supervised curriculum training is efficient and ToolsRL can achieve strong tool-use capabilities for complex visual reasoning tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes ToolsRL, a two-stage reinforcement learning framework for multimodal large language models to master tool use in complex visual reasoning. It introduces simple visual tools (zoom-in, rotate, flip, draw point/line) with easily collectible supervision. Stage 1 optimizes solely on tool-specific rewards to master tool calling; Stage 2 adds task-accuracy rewards while permitting tool use. The authors claim this curriculum avoids optimization conflicts between heterogeneous objectives and that experiments demonstrate efficient training and strong tool-use capabilities.

Significance. If the empirical claims hold, the work could offer a practical curriculum-based approach to tool-augmented visual reasoning that separates tool proficiency from final-task optimization. The emphasis on native, interpretable tools with straightforward supervision is a constructive design choice that may generalize beyond the specific tools tested. However, without detailed quantitative results, baselines, or ablations, the magnitude of any advance over standard RL or prompting methods remains difficult to gauge.

major comments (2)
  1. [Methods / Experiments] The central design claim—that the two-stage curriculum avoids optimization conflicts between tool-specific rewards and accuracy rewards—lacks any supporting ablation or control experiment. No joint-optimization baseline, single-stage training run, or reward-weighting comparison is described, leaving the necessity and benefit of the separation as an untested modeling choice rather than a demonstrated requirement (see the curriculum description and experimental claims).
  2. [Abstract / Experiments] The abstract and experimental summary assert that 'the tool-supervised curriculum training is efficient and ToolsRL can achieve strong tool-use capabilities,' yet no quantitative metrics, error bars, baseline comparisons, or ablation tables are referenced in the provided text. This absence directly undermines verification of the efficiency and capability claims that constitute the paper's primary contribution.
minor comments (2)
  1. [Abstract] The abstract would be strengthened by including at least one key quantitative result (e.g., accuracy delta or sample efficiency) to ground the efficiency and capability assertions.
  2. [Methods] Notation for the tool-specific reward functions and the transition between stages could be made more explicit (e.g., by defining the reward components in an equation) to improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive comments. We address each major point below and describe the revisions that will be incorporated into the manuscript.

read point-by-point responses
  1. Referee: [Methods / Experiments] The central design claim—that the two-stage curriculum avoids optimization conflicts between tool-specific rewards and accuracy rewards—lacks any supporting ablation or control experiment. No joint-optimization baseline, single-stage training run, or reward-weighting comparison is described, leaving the necessity and benefit of the separation as an untested modeling choice rather than a demonstrated requirement (see the curriculum description and experimental claims).

    Authors: We agree that the current manuscript does not contain explicit ablation studies comparing the two-stage curriculum to joint optimization or single-stage alternatives. The separation is motivated by the distinct objectives of the reward signals (tool invocation precision versus end-task accuracy), which can create optimization tension when combined from the start. To substantiate this, we will add a dedicated ablation subsection in the revised Experiments section that includes (i) a joint-optimization baseline trained with a combined reward, (ii) a single-stage accuracy-only run, and (iii) quantitative comparisons of tool-calling success rate and final task accuracy. These results will be presented with error bars and statistical significance tests. revision: yes

  2. Referee: [Abstract / Experiments] The abstract and experimental summary assert that 'the tool-supervised curriculum training is efficient and ToolsRL can achieve strong tool-use capabilities,' yet no quantitative metrics, error bars, baseline comparisons, or ablation tables are referenced in the provided text. This absence directly undermines verification of the efficiency and capability claims that constitute the paper's primary contribution.

    Authors: We acknowledge that the submitted abstract and summary statements do not reference specific numerical results. The full manuscript contains quantitative evaluations in Section 4, including tool-use accuracy, task success rates, training efficiency curves, and comparisons against prompting and standard RL baselines. In the revision we will update the abstract to cite the key metrics (e.g., relative improvements in tool-calling success and overall accuracy) and will add explicit cross-references to the corresponding tables and figures so that the claims are directly verifiable from the abstract. revision: yes

Circularity Check

0 steps flagged

No circularity: methodological proposal with no derivations or self-referential reductions

full rationale

The paper proposes a ToolsRL framework consisting of a two-stage curriculum (tool-specific rewards first, then accuracy rewards) and simple visual tools. No equations, derivations, or first-principles results are present in the abstract or described text. The curriculum separation is presented as a design choice motivated by avoiding optimization conflicts, not as a derived necessity that reduces to its own inputs. No self-citations, fitted parameters renamed as predictions, or ansatzes appear. Experimental claims rest on empirical results rather than tautological reasoning, making the derivation chain self-contained with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim depends on the domain assumption that tool supervision for simple visual operations is straightforward to obtain and that the staged curriculum prevents optimization conflicts; no free parameters or invented entities are mentioned.

axioms (2)
  • domain assumption Tool supervision for zoom-in, rotate, flip, and draw operations is easy to collect
    Explicitly stated as the reason for focusing on these native visual tools.
  • domain assumption Separating tool mastery from task accuracy avoids optimization conflicts among heterogeneous objectives
    Core motivation for the two-stage curriculum described in the abstract.

pith-pipeline@v0.9.0 · 5476 in / 1261 out tokens · 62581 ms · 2026-05-10T02:59:45.401492+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

36 extracted references · 20 canonical work pages · 5 internal anchors

  1. [1]

    Qwen2.5-VL Technical Report

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhao- hai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Jun- yang Lin. Qwen2.5-vl technical repor...

  2. [2]

    Learning only with images: Visual reinforcement learning with reasoning, rendering, and visual feedback.arXiv preprint arXiv:2507.20766, 2025

    Yang Chen, Yufan Shen, Wenxuan Huang, Sheng Zhou, Qunshu Lin, Xinyu Cai, Zhi Yu, Jiajun Bu, Botian Shi, and Yu Qiao. Learning only with images: Visual reinforce- ment learning with reasoning, rendering, and visual feed- back.arXiv preprint arXiv:2507.20766, 2025. 1, 2, 3

  3. [3]

    Cot re- ferring: Improving referring expression tasks with grounded reasoning, 2025

    Qihua Dong, Luis Figueroa, Handong Zhao, Kushal Kafle, Jason Kuen, Zhihong Ding, Scott Cohen, and Yun Fu. Cot re- ferring: Improving referring expression tasks with grounded reasoning, 2025. 1

  4. [4]

    Visual sketchpad: Sketching as a visual chain of thought for multimodal language models.arXiv preprint arXiv:2406.09403,

    Yushi Hu, Weijia Shi, Xingyu Fu, Dan Roth, Mari Osten- dorf, Luke Zettlemoyer, Noah A. Smith, and Ranjay Kr- ishna. Visual sketchpad: Sketching as a visual chain of thought for multimodal language models.arXiv preprint arXiv:2406.09403, 2024. 1, 2

  5. [5]

    Vision-r1: Incentivizing reasoning capability in multimodal large lan- guage models, 2025

    Wenxuan Huang, Bohan Jia, Zijie Zhai, Shaosheng Cao, Zheyu Ye, Fei Zhao, Yao Hu, and Shaohui Lin. Vision-r1: Incentivizing reasoning capability in multimodal large lan- guage models, 2025. 2

  6. [6]

    Rl is neither a panacea nor a mirage: Understanding supervised vs

    Hangzhan Jin, Sicheng Lv, Sifan Wu, and Mohammad Ham- daqa. Rl is neither a panacea nor a mirage: Understanding supervised vs. reinforcement learning fine-tuning for LLMs. arXiv preprint arXiv:2508.16546, 2025. 1

  7. [7]

    TableVQA-Bench: A visual question answering benchmark on multiple table domains, 2024

    Yoonsik Kim, Moonbin Yim, and Ka Yeon Song. TableVQA-Bench: A visual question answering benchmark on multiple table domains, 2024. 5

  8. [8]

    Reinforcing vlms to use tools for detailed vi- sual reasoning under resource constraints.arXiv preprint arXiv:2506.14821, 2025

    Sunil Kumar, Bowen Zhao, Leo Dirac, and Paulina Var- shavskaya. Reinforcing vlms to use tools for detailed vi- sual reasoning under resource constraints.arXiv preprint arXiv:2506.14821, 2025. 1, 2

  9. [9]

    Mini-o3: Scaling up reasoning pat- terns and interaction turns for visual search.arXiv preprint arXiv:2509.07969, 2025

    Xin Lai, Junyi Li, Wei Li, Tao Liu, Tianjian Li, and Hengshuang Zhao. Mini-o3: Scaling up reasoning pat- terns and interaction turns for visual search. arXiv preprint arXiv:2509.07969, 2025. 1, 2, 5, 6, 7

  10. [10]

    Llava-onevision: Easy visual task transfer, 2024

    Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Zi- wei Liu, and Chunyuan Li. Llava-onevision: Easy visual task transfer, 2024. 1

  11. [11]

    Multimodal arxiv: A dataset for improving scientific comprehension of large vision-language models

    Lei Li, Yuqi Wang, Runxin Xu, Peiyi Wang, Xiachong Feng, Lingpeng Kong, and Qi Liu. Multimodal arxiv: A dataset for improving scientific comprehension of large vision-language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 14369–14387, Bangkok, Thailand, 2024. As- sociation ...

  12. [12]

    ChartQA: A benchmark for question answering about charts with visual and logical reasoning

    Ahmed Masry, Do Xuan Long, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. ChartQA: A benchmark for question answering about charts with visual and logical reasoning. In Findings of the Association for Computational Linguistics: ACL 2022, pages 2263–2279, Dublin, Ireland, 2022. Associ- ation for Computational Linguistics. 5

  13. [13]

    ChartQAPro: A more diverse and challenging benchmark for chart question answering

    Ahmed Masry, Mohammed Saidul Islam, Mahir Ahmed, Aayush Bajaj, Firoz Kabir, Aaryaman Kartha, Md Tah- mid Rahman Laskar, Mizanur Rahman, Shadikur Rah- man, Mehrad Shahmohammadi, Megh Thakkar, Md Rizwan Parvez, Enamul Hoque, and Shafiq Joty. ChartQAPro: A more diverse and challenging benchmark for chart question answering. InFindings of the Association for ...

  14. [14]

    Minesh Mathew, Dimosthenis Karatzas, and C. V . Jawahar. DocVQA: A dataset for vqa on document images. InPro- ceedings of the IEEE/CVF Winter Conference on Applica- tions of Computer Vision (WACV), pages 2200–2209, 2021. arXiv pre-print arXiv:2007.00398. 5, 6

  15. [15]

    Minesh Mathew, Viraj Bagal, Rub ´en P´erez Tito, Dimosthe- nis Karatzas, Ernest Valveny, and C. V . Jawahar. Infograph- icvqa. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 2582– 2591, 2022. 5, 6

  16. [16]

    Point-rft: Im- proving multimodal reasoning with visually grounded rein- forcement finetuning, 2025

    Minheng Ni, Zhengyuan Yang, Linjie Li, Chung-Ching Lin, Kevin Lin, Wangmeng Zuo, and Lijuan Wang. Point-rft: Im- proving multimodal reasoning with visually grounded rein- forcement finetuning, 2025. 2, 6

  17. [17]

    GPT-4o System Card

    OpenAI. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024. 1

  18. [18]

    Openai o3 system card

    OpenAI. Openai o3 system card. https://openai.com/o3,

  19. [19]

    Accessed: 2024-12-20. 1

  20. [20]

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y .K. Li, Y . Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024. 1, 6

  21. [21]

    Zoom- eye: Enhancing multimodal LLMs with human-like zooming capabilities through tree-based image exploration

    Haozhan Shen, Kangjia Zhao, Tiancheng Zhao, Ruochen Xu, Zilun Zhang, Mingwei Zhu, and Jianwei Yin. Zoom- eye: Enhancing multimodal LLMs with human-like zooming capabilities through tree-based image exploration. InPro- ceedings of the 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6613–6629, Suzhou, China, 2025. Associat...

  22. [22]

    Pixel Reasoner: Incentivizing Pixel-Space Reasoning with Curiosity-Driven Reinforcement Learning

    Alex Su, Haozhe Wang, Weiming Ren, Fangzhen Lin, and Wenhu Chen. Pixel reasoner: Incentivizing pixel-space rea- soning with curiosity-driven reinforcement learning.arXiv preprint arXiv:2505.15966, 2025. 6, 7

  23. [23]

    Openthinkimg: Learning to think with images via visual tool reinforcement learning

    Zhaochen Su, Linjie Li, Mingyang Song, Yunzhuo Hao, Zhengyuan Yang, Jun Zhang, Guanjie Chen, Jiawei Gu, Jun- tao Li, Xiaoye Qu, and Yu Cheng. Openthinkimg: Learning to think with images via visual tool reinforcement learning. arXiv preprint arXiv:2505.08617, 2025. 1, 2, 6, 7

  24. [24]

    Reason-rft: Reinforcement fine-tuning for visual reasoning,

    Huajie Tan, Yuheng Ji, Xiaoshuai Hao, Minglan Lin, Pengwei Wang, Zhongyuan Wang, and Shanghang Zhang. Reason-rft: Reinforcement fine-tuning for visual reasoning,

  25. [25]

    Divide, conquer 9 and combine: A training-free framework for high-resolution image perception in multimodal large language models

    Wenbin Wang, Liang Ding, Minyan Zeng, Xiabin Zhou, Li Shen, Yong Luo, Wei Yu, and Dacheng Tao. Divide, conquer 9 and combine: A training-free framework for high-resolution image perception in multimodal large language models. In Proceedings of the AAAI Conference on Artificial Intelli- gence, pages 7907–7915, 2025. 5

  26. [26]

    Simple o3: To- wards interleaved vision-language reasoning.arXiv preprint arXiv:2508.12109, 2025

    Ye Wang, Qianglong Chen, Zejun Li, Siyuan Wang, Shi- jie Guo, Zhirui Zhang, and Zhongyu Wei. Simple o3: To- wards interleaved vision-language reasoning.arXiv preprint arXiv:2508.12109, 2025. 1, 2, 6

  27. [27]

    CharXiv: Charting gaps in realistic chart understand- ing in multimodal llms.CoRR, abs/2406.18521, 2024

    Zirui Wang, Mengzhou Xia, Luxi He, Howard Chen, Yitao Liu, Richard Zhu, Kaiqu Liang, Xindi Wu, Haotian Liu, Sad- hika Malladi, Alexis Chevalier, Sanjeev Arora, and Danqi Chen. CharXiv: Charting gaps in realistic chart understand- ing in multimodal llms.CoRR, abs/2406.18521, 2024. 5

  28. [28]

    Reinforcing spatial reasoning in vision-language models with interwoven thinking and visual drawing.ArXiv, abs/2506.09965, 2025

    Junfei Wu, Jian Guan, Kaituo Feng, Qiang Liu, Shu Wu, Liang Wang, Wei Wu, and Tieniu Tan. Reinforcing spatial reasoning in vision-language models with interwoven think- ing and visual drawing.arXiv preprint arXiv:2506.09965,

  29. [29]

    Vtool-r1: Vlms learn to think with images via reinforcement learning on multimodal tool use.arXiv preprint arXiv:2505.19255, 2025

    Mingyuan Wu, Jingcheng Yang, Jize Jiang, Meitang Li, Kaizhuo Yan, Hanchao Yu, Minjia Zhang, Chengxiang Zhai, and Klara Nahrstedt. Vtool-r1: Vlms learn to think with images via reinforcement learning on multimodal tool use. arXiv preprint arXiv:2505.19255, 2025. 1, 2, 6, 7

  30. [30]

    V* : Guided visual search as a core mechanism in multimodal llms

    Penghao Wu and Saining Xie. V* : Guided visual search as a core mechanism in multimodal llms. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024. 5

  31. [31]

    Look-back: Implicit visual re-focusing in MLLM reasoning.CoRR, abs/2507.03019, 2025

    Shuo Yang, Yuwei Niu, Yuyang Liu, Yang Ye, Bin Lin, and Li Yuan. Look-back: Implicit visual re-focusing in mllm reasoning.arXiv preprint arXiv:2507.03019, 2025. 2

  32. [32]

    R1-sharevl: Incentivizing reasoning capability of multimodal large language models via share- grpo, 2025

    Huanjin Yao, Qixiang Yin, Jingyi Zhang, Min Yang, Yibo Wang, Wenhao Wu, Fei Su, Li Shen, Minghui Qiu, Dacheng Tao, and Jiaxing Huang. R1-sharevl: Incentivizing reasoning capability of multimodal large language models via share- grpo, 2025. 2

  33. [33]

    On-policy rl meets off-policy experts: Harmonizing supervised fine- tuning and reinforcement learning via dynamic weighting, 2026

    Wenhao Zhang, Yuexiang Xie, Yuchang Sun, Yanxi Chen, Guoyin Wang, Yaliang Li, Bolin Ding, and Jingren Zhou. On-policy RL meets off-policy experts: Harmonizing su- pervised fine-tuning and reinforcement learning via dynamic weighting.arXiv preprint arXiv:2508.11408, 2025. 1

  34. [34]

    Adaptive chain-of-focus reasoning via dynamic visual search and zooming for efficient vlms.arXiv preprint arXiv:2505.15436, 2025

    Xintong Zhang, Zhi Gao, Bofei Zhang, Pengxiang Li, Xi- aowen Zhang, Yang Liu, Tao Yuan, Yuwei Wu, Yunde Jia, Song-Chun Zhu, and Qing Li. Chain-of-focus: Adaptive visual search and zooming for multimodal reasoning via rl. arXiv preprint arXiv:2505.15436, 2025. 1, 2

  35. [35]

    DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning

    Ziwei Zheng, Michael Yang, Jack Hong, Chenxiao Zhao, Guohai Xu, Le Yang, Chao Shen, and Xing Yu. Deep- Eyes: Incentivizing “thinking with images” via reinforce- ment learning.CoRR, abs/2505.14362, 2025. 1, 2, 3, 5, 6, 7

  36. [36]

    disable” ( −5.24 on V*) and “replace with placeholders

    Zetong Zhou, Dongping Chen, Zixian Ma, Zhihan Hu, Mingyang Fu, Sinan Wang, Yao Wan, Zhou Zhao, and Ran- jay Krishna. Reinforced visual perception with tools.arXiv preprint arXiv:2509.01656, 2025. 7 10 ToolsRL Given <zoom-in> tool and the <image>, what is the word to the left of "WAY"? w/ Accuracy Reward First, I will zoom in on the sign to the left of "WA...