pith. sign in

arxiv: 2512.16918 · v3 · submitted 2025-12-18 · 💻 cs.CV

AdaTooler-V: Adaptive Tool-Use for Images and Videos

Pith reviewed 2026-05-16 21:21 UTC · model grok-4.3

classification 💻 cs.CV
keywords adaptive tool usemultimodal large language modelsreinforcement learningvision toolstool benefit scorevisual reasoningchain of thoughtimages and videos
0
0 comments X

The pith

AdaTooler-V enables multimodal models to invoke vision tools only when they improve results, reaching 89.8% on the V* benchmark.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Existing open-source multimodal models often invoke vision tools even when unnecessary, raising inference costs and sometimes lowering accuracy. AdaTooler-V addresses this by training the model to first decide whether a given image or video problem actually benefits from tool assistance. The central mechanism is AT-GRPO, a reinforcement learning method that scales each sample's reward according to its Tool Benefit Score so the policy learns selective rather than habitual tool use. Two new datasets support the training pipeline: one for supervised cold-start chain-of-thought data and a larger set for reinforcement learning with verifiable outcomes across single images, multiple images, and videos. The resulting 7B model sets a new mark on twelve benchmarks, including 89.8% accuracy on the high-resolution V* set where it exceeds GPT-4o and Gemini 1.5 Pro.

Core claim

AdaTooler-V performs adaptive tool-use by determining whether a visual problem truly requires tools. It introduces AT-GRPO, a reinforcement learning algorithm that adaptively adjusts reward scales based on the Tool Benefit Score of each sample, encouraging the model to invoke tools only when they provide genuine improvements. Two datasets are constructed for training: AdaTooler-V-CoT-100k for SFT cold start and AdaTooler-V-300k for RL with verifiable rewards across single-image, multi-image, and video data. Experiments across twelve benchmarks demonstrate the strong reasoning capability of AdaTooler-V, with the 7B version achieving 89.8% accuracy on V* and surpassing GPT-4o and Gemini 1.5.

What carries the argument

AT-GRPO, a reinforcement learning algorithm that scales per-sample rewards by the Tool Benefit Score to train selective rather than unconditional vision-tool invocation.

Load-bearing premise

The Tool Benefit Score correctly identifies samples where calling a vision tool produces a genuine accuracy gain rather than merely reflecting dataset biases or training artifacts.

What would settle it

Measure whether a non-adaptive baseline that always calls tools matches or exceeds AdaTooler-V accuracy on a new test set deliberately constructed so that tool use never improves the answer.

Figures

Figures reproduced from arXiv: 2512.16918 by Chaoyang Wang, Dongyang Chen, Kaituo Feng, Manyuan Zhang, Meng Meng, Sicheng Gao, Xiangyu Yue, Xu Zhou, Yuzhang Shang, Zhixun Li, Zhongyu Wang.

Figure 1
Figure 1. Figure 1: (a) Compared with existing models that blindly invoke [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Case reasoning trajectory of AdaTooler-V. For single-image and video questions, the model alternates between internal reasoning, vision tool invocations and final answers, enabling zoom-in on fine-grained regions and inspection of informative clips. In contrast, for the multi-image clock example, AdaTooler-V solves the problem purely via text-based CoT, illustrating its ability to adaptively decide when vi… view at source ↗
Figure 3
Figure 3. Figure 3: The data distribution of our AdaTooler-V-300k dataset. [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: An illustration of our proposed AT-GRPO. [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: RL training curves. reinforcement learning for training; SFT+GRPO, which re￾places our proposed AT-GRPO algorithm with the standard GRPO method. As shown in the last two rows of Tab. 3, incorporating the proposed AT-GRPO training strategy leads to a substantial performance improvement. These results confirm that dynamically adjusting tool-use rewards based on the Tool Benefit Score enables the model to inv… view at source ↗
Figure 6
Figure 6. Figure 6: An example of AdaTooler-V-7B’s reasoning output on V* Benchmark. [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: An example of AdaTooler-V-7B’s reasoning output on MVBench. [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Prompt template for training and inference. [PITH_FULL_IMAGE:figures/full_fig_p015_8.png] view at source ↗
read the original abstract

Recent advances have shown that multimodal large language models (MLLMs) benefit from multimodal interleaved chain-of-thought (CoT) with vision tool interactions. However, existing open-source models often exhibit blind tool-use reasoning patterns, invoking vision tools even when they are unnecessary, which significantly increases inference overhead and degrades model performance. To this end, we propose AdaTooler-V, an MLLM that performs adaptive tool-use by determining whether a visual problem truly requires tools. First, we introduce AT-GRPO, a reinforcement learning algorithm that adaptively adjusts reward scales based on the Tool Benefit Score of each sample, encouraging the model to invoke tools only when they provide genuine improvements. Moreover, we construct two datasets to support training: AdaTooler-V-CoT-100k for SFT cold start and AdaTooler-V-300k for RL with verifiable rewards across single-image, multi-image, and video data. Experiments across twelve benchmarks demonstrate the strong reasoning capability of AdaTooler-V, outperforming existing methods in diverse visual reasoning tasks. Notably, AdaTooler-V-7B achieves an accuracy of 89.8\% on the high-resolution benchmark V*, surpassing the commercial proprietary model GPT-4o and Gemini 1.5 Pro. All code, models, and data are released.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes AdaTooler-V, an MLLM for adaptive tool-use in visual reasoning tasks across images and videos. It introduces the AT-GRPO reinforcement learning algorithm that scales rewards using a Tool Benefit Score to encourage tool invocation only when beneficial. The model is trained on AdaTooler-V-CoT-100k for SFT and AdaTooler-V-300k for RL, and achieves state-of-the-art results including 89.8% accuracy on the V* benchmark, outperforming GPT-4o and Gemini 1.5 Pro.

Significance. If the adaptive tool-use mechanism is robustly validated, this work would be significant for improving efficiency and performance of open-source MLLMs in multimodal reasoning by mitigating blind tool invocation. The release of code, models, and datasets strengthens reproducibility and impact.

major comments (2)
  1. [§3.2 (AT-GRPO)] §3.2 (AT-GRPO): The Tool Benefit Score is load-bearing for the adaptive behavior claim, but its computation method, whether it is derived independently from held-out data or the training distribution AdaTooler-V-300k, and how it avoids reflecting dataset biases (e.g., resolution patterns or annotation artifacts) is not sufficiently detailed. This risks the RL objective overfitting to spurious correlations rather than genuine tool benefit.
  2. [§5 (Experiments)] §5 (Experiments): The 89.8% accuracy on V* surpassing proprietary models requires verification of the evaluation setup, including whether tool-use is fairly compared, number of evaluation runs, and confirmation that the Tool Benefit Score was not tuned on the test set.
minor comments (2)
  1. [Abstract] Abstract: The abstract mentions 'verifiable rewards' but does not specify what makes the rewards verifiable across single-image, multi-image, and video data.
  2. [§4] §4: Clarify the exact form of the reward scaling function in AT-GRPO to allow reproduction.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We sincerely thank the referee for the constructive and detailed feedback. We address each major comment point-by-point below and will revise the manuscript to enhance clarity and rigor where needed.

read point-by-point responses
  1. Referee: §3.2 (AT-GRPO): The Tool Benefit Score is load-bearing for the adaptive behavior claim, but its computation method, whether it is derived independently from held-out data or the training distribution AdaTooler-V-300k, and how it avoids reflecting dataset biases (e.g., resolution patterns or annotation artifacts) is not sufficiently detailed. This risks the RL objective overfitting to spurious correlations rather than genuine tool benefit.

    Authors: We thank the referee for this important observation. We acknowledge that §3.2 would benefit from greater detail on the Tool Benefit Score. In the revised manuscript, we will expand the description to specify that the score is computed as the performance delta (with vs. without tool use) on a held-out validation subset drawn independently from the AdaTooler-V-300k training distribution. To guard against dataset biases such as resolution patterns or annotation artifacts, the score incorporates normalization across data modalities (single-image, multi-image, video) and is cross-validated on multiple disjoint subsets. We will also add an ablation demonstrating that removing the Tool Benefit Score leads to increased blind tool invocation, supporting that the RL objective learns genuine benefit rather than spurious correlations. revision: yes

  2. Referee: §5 (Experiments): The 89.8% accuracy on V* surpassing proprietary models requires verification of the evaluation setup, including whether tool-use is fairly compared, number of evaluation runs, and confirmation that the Tool Benefit Score was not tuned on the test set.

    Authors: We agree that transparent verification of the evaluation protocol is necessary. In the revised §5, we will clarify that V* results follow the benchmark's official protocol, with tool-use comparisons conducted under matched conditions (resolution handling and tool availability) against the publicly reported numbers for GPT-4o and Gemini 1.5 Pro. We will report results averaged over 5 independent runs with different random seeds, including standard deviation. We explicitly confirm that the Tool Benefit Score was computed only on training and internal validation splits and never tuned or evaluated on the V* test set, which remained completely held-out for final benchmarking. We will include these details along with the exact prompting templates used. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the derivation chain

full rationale

The paper presents the 89.8% V* accuracy as an empirical held-out benchmark result after training with AT-GRPO on the constructed AdaTooler-V-300k dataset. The Tool Benefit Score is used to scale rewards during RL, but the reported performance metric is not shown to reduce to a fitted parameter or self-referential definition by construction. No equations, self-citation load-bearing steps, or ansatz smuggling are exhibited that would make the adaptive tool-use claim equivalent to its inputs. The derivation remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim rests on the assumption that the Tool Benefit Score can be computed reliably from model outputs and that the RL objective with adaptive scaling produces better tool-use policies than standard GRPO. No free parameters are explicitly listed in the abstract, but the reward scaling function itself functions as an implicit fitted component.

free parameters (1)
  • Tool Benefit Score scaling function
    The abstract states that AT-GRPO adaptively adjusts reward scales based on this score; its exact functional form and any hyperparameters are not specified here.
axioms (1)
  • domain assumption RL with verifiable rewards produces policies that generalize to unseen visual reasoning tasks
    Invoked implicitly when claiming superior performance on twelve benchmarks after training on the new 300k dataset.
invented entities (1)
  • Tool Benefit Score no independent evidence
    purpose: Quantifies whether tool use improves the answer for a given sample so that reward scaling can discourage unnecessary calls.
    Introduced as the core signal for AT-GRPO; no independent evidence outside the training loop is provided in the abstract.

pith-pipeline@v0.9.0 · 5565 in / 1462 out tokens · 18613 ms · 2026-05-16T21:21:02.075640+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 6 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. From Web to Pixels: Bringing Agentic Search into Visual Perception

    cs.CV 2026-05 unverdicted novelty 7.0

    WebEye benchmark and Pixel-Searcher agent enable visual perception tasks by using web search to resolve object identities before precise localization or answering.

  2. Gen-Searcher: Reinforcing Agentic Search for Image Generation

    cs.CV 2026-03 unverdicted novelty 7.0

    Gen-Searcher is the first trained search-augmented image generation agent using SFT followed by GRPO reinforcement learning with dual text-image rewards, delivering 15-16 point gains on knowledge-intensive benchmarks.

  3. VideoSeeker: Incentivizing Instance-level Video Understanding via Native Agentic Tool Invocation

    cs.CV 2026-05 unverdicted novelty 6.0

    VideoSeeker integrates agentic reasoning and visual prompts into LVLMs via automated data synthesis, cold-start supervision, and RL training, yielding +13.7% gains on instance-level video tasks over baselines includin...

  4. ToolCUA: Towards Optimal GUI-Tool Path Orchestration for Computer Use Agents

    cs.AI 2026-05 unverdicted novelty 6.0

    ToolCUA introduces a trajectory scaling pipeline and staged RL to optimize GUI-tool switching, reaching 46.85% accuracy on OSWorld-MCP for a 66% relative gain over baseline.

  5. Gen-Searcher: Reinforcing Agentic Search for Image Generation

    cs.CV 2026-03 unverdicted novelty 6.0

    Gen-Searcher is the first search-augmented image generation agent trained with SFT followed by agentic RL using dual text and image rewards on custom datasets and the KnowGen benchmark.

  6. Act Wisely: Cultivating Meta-Cognitive Tool Use in Agentic Multimodal Models

    cs.CV 2026-04 unverdicted novelty 5.0

    HDPO reframes tool efficiency as a conditional objective within accurate trajectories, enabling Metis to reduce tool invocations by orders of magnitude while raising reasoning accuracy.

Reference graph

Works this paper leans on

84 extracted references · 84 canonical work pages · cited by 5 Pith papers · 26 internal anchors

  1. [1]

    Qwen2.5- vl technical report, 2025

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5- vl technical report, ...

  2. [2]

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 1, 4, 5

  3. [3]

    Univg- r1: Reasoning guided universal visual grounding with re- inforcement learning.arXiv preprint arXiv:2505.14231,

    Sule Bai, Mingxing Li, Yong Liu, Jing Tang, Haoji Zhang, Lei Sun, Xiangxiang Chu, and Yansong Tang. Univg-r1: Rea- soning guided universal visual grounding with reinforcement learning.arXiv preprint arXiv:2505.14231, 2025. 2

  4. [4]

    Do not think that much for 2+3=? on the overthink- ing of o1-like llms, 2025

    Xingyu Chen, Jiahao Xu, Tian Liang, Zhiwei He, Jianhui Pang, Dian Yu, Linfeng Song, Qiuzhi Liu, Mengfei Zhou, Zhuosheng Zhang, Rui Wang, Zhaopeng Tu, Haitao Mi, and Dong Yu. Do not think that much for 2+3=? on the overthink- ing of o1-like llms, 2025. 2

  5. [5]

    Video-Holmes: Can MLLM Think Like Holmes for Complex Video Reasoning?

    Junhao Cheng, Yuying Ge, Teng Wang, Yixiao Ge, Jing Liao, and Ying Shan. Video-holmes: Can mllm think like holmes for complex video reasoning?arXiv preprint arXiv:2505.21374, 2025. 2

  6. [6]

    Video-holmes: Can mllm think like holmes for complex video reasoning?, 2025

    Junhao Cheng, Yuying Ge, Teng Wang, Yixiao Ge, Jing Liao, and Ying Shan. Video-holmes: Can mllm think like holmes for complex video reasoning?, 2025. 7

  7. [7]

    Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, 2025

    DeepSeek-AI et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, 2025. 1, 2, 5, 6

  8. [8]

    Miss- ing premise exacerbates overthinking: Are reasoning models losing critical thinking skill?, 2025

    Chenrui Fan, Ming Li, Lichao Sun, and Tianyi Zhou. Miss- ing premise exacerbates overthinking: Are reasoning models losing critical thinking skill?, 2025. 2

  9. [9]

    Sophiavl-r1: Reinforcing mllms reasoning with thinking reward.arXiv preprint arXiv:2505.17018, 2025

    Kaixuan Fan, Kaituo Feng, Haoming Lyu, Dongzhan Zhou, and Xiangyu Yue. Sophiavl-r1: Reinforcing mllms reasoning with thinking reward.arXiv preprint arXiv:2505.17018, 2025. 1, 6

  10. [10]

    Video-ccam: Enhancing video-language un- derstanding with causal cross-attention masks for short and long videos, 2024

    Jiajun Fei, Dian Li, Zhidong Deng, Zekun Wang, Gang Liu, and Hui Wang. Video-ccam: Enhancing video-language un- derstanding with causal cross-attention masks for short and long videos, 2024. 7

  11. [11]

    Video-R1: Reinforcing Video Reasoning in MLLMs

    Kaituo Feng, Kaixiong Gong, Bohao Li, Zonghao Guo, Yibing Wang, Tianshuo Peng, Junfei Wu, Xiaoying Zhang, Benyou Wang, and Xiangyu Yue. Video-r1: Reinforcing video reasoning in mllms.arXiv preprint arXiv:2503.21776,

  12. [12]

    OneThinker: All-in-one Reasoning Model for Image and Video

    Kaituo Feng, Manyuan Zhang, Hongyu Li, Kaixuan Fan, Shuang Chen, Yilei Jiang, Dian Zheng, Peiwen Sun, Yiyuan Zhang, Haoze Sun, et al. Onethinker: All-in-one reasoning model for image and video.arXiv preprint arXiv:2512.03043,

  13. [13]

    Mme: A comprehensive evaluation benchmark for multi- modal large language models, 2025

    Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Meng- dan Zhang, Xu Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, Yunsheng Wu, Rongrong Ji, Caifeng Shan, and Ran He. Mme: A comprehensive evaluation benchmark for multi- modal large language models, 2025. 7

  14. [14]

    Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis, 2025

    Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, Peixian Chen, Yanwei Li, Shaohui Lin, Sirui Zhao, Ke Li, Tong Xu, Xiawu Zheng, Enhong Chen, Caifeng Shan, Ran He, and Xing Sun. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis, 2025. 7

  15. [15]

    Framemind: Frame-interleaved chain-of-thought for video reasoning via reinforcement learning.arXiv preprint arXiv:2509.24008, 2025

    Haonan Ge, Yiwei Wang, Kai-Wei Chang, Hang Wu, and Yujun Cai. Framemind: Frame-interleaved chain-of-thought for video reasoning via reinforcement learning.arXiv preprint arXiv:2509.24008, 2025. 7

  16. [16]

    Arc-hunyuan-video-7b: Structured video comprehension of real-world shorts.arXiv preprint arXiv:2507.20939, 2025

    Yuying Ge, Yixiao Ge, Chen Li, Teng Wang, Junfu Pu, Yizhuo Li, Lu Qiu, Jin Ma, Lisheng Duan, Xinyu Zuo, et al. Arc- 9 hunyuan-video-7b: Structured video comprehension of real- world shorts.arXiv preprint arXiv:2507.20939, 2025. 2

  17. [17]

    Gemini 1.5: Unlocking multimodal under- standing across millions of tokens of context, 2024

    Gemini Team. Gemini 1.5: Unlocking multimodal under- standing across millions of tokens of context, 2024. 6, 7

  18. [19]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025. 6

  19. [20]

    Video-mmmu: Evaluating knowledge acquisition from multi-discipline pro- fessional videos, 2025

    Kairui Hu, Penghao Wu, Fanyi Pu, Wang Xiao, Yuanhan Zhang, Xiang Yue, Bo Li, and Ziwei Liu. Video-mmmu: Evaluating knowledge acquisition from multi-discipline pro- fessional videos, 2025. 7

  20. [21]

    Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models

    Wenxuan Huang, Bohan Jia, Zijie Zhai, Shaosheng Cao, Zheyu Ye, Fei Zhao, Zhe Xu, Yao Hu, and Shaohui Lin. Vision-r1: Incentivizing reasoning capability in multimodal large language models.arXiv preprint arXiv:2503.06749,

  21. [22]

    Vision-r1: Incentivizing reasoning capability in multimodal large language models, 2025

    Wenxuan Huang, Bohan Jia, Zijie Zhai, Shaosheng Cao, Zheyu Ye, Fei Zhao, Zhe Xu, Yao Hu, and Shaohui Lin. Vision-r1: Incentivizing reasoning capability in multimodal large language models, 2025. 1

  22. [23]

    Verltool: Towards holistic agentic reinforcement learning with tool use.arXiv preprint arXiv:2509.01055, 2025

    Dongfu Jiang, Yi Lu, Zhuofeng Li, Zhiheng Lyu, Ping Nie, Haozhe Wang, Alex Su, Hui Chen, Kai Zou, Chao Du, et al. Verltool: Towards holistic agentic reinforcement learning with tool use.arXiv preprint arXiv:2509.01055, 2025. 7

  23. [24]

    Gonzalez, Hao Zhang, and Ion Stoica

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InPro- ceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023. 7

  24. [25]

    Mini-o3: Scaling Up Reasoning Patterns and Interaction Turns for Visual Search

    Xin Lai, Junyi Li, Wei Li, Tao Liu, Tianjian Li, and Heng- shuang Zhao. Mini-o3: Scaling up reasoning patterns and interaction turns for visual search.arXiv preprint arXiv:2509.07969, 2025. 6

  25. [26]

    Mini-o3: Scaling up reasoning patterns and interaction turns for visual search, 2025

    Xin Lai, Junyi Li, Wei Li, Tao Liu, Tianjian Li, and Heng- shuang Zhao. Mini-o3: Scaling up reasoning patterns and interaction turns for visual search, 2025. 2

  26. [27]

    LLaVA-OneVision: Easy Visual Task Transfer

    Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Zi- wei Liu, et al. Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326, 2024. 6

  27. [28]

    Imagine while Reasoning in Space: Multimodal Visualization-of-Thought

    Chengzu Li, Wenshan Wu, Huanyu Zhang, Yan Xia, Shaoguang Mao, Li Dong, Ivan Vuli´c, and Furu Wei. Imag- ine while reasoning in space: Multimodal visualization-of- thought.arXiv preprint arXiv:2501.07542, 2025. 3

  28. [29]

    Reinforcement learning tuning for videollms: Reward design and data efficiency.arXiv preprint arXiv:2506.01908, 2025

    Hongyu Li, Songhao Han, Yue Liao, Junfeng Luo, Jialin Gao, Shuicheng Yan, and Si Liu. Reinforcement learning tuning for videollms: Reward design and data efficiency.arXiv preprint arXiv:2506.01908, 2025. 2

  29. [30]

    Editthinker: Unlocking iterative reasoning for any image editor.arXiv preprint arXiv:2512.05965, 2025

    Hongyu Li, Manyuan Zhang, Dian Zheng, Ziyu Guo, Yimeng Jia, Kaituo Feng, Hao Yu, Yexin Liu, Yan Feng, Peng Pei, et al. Editthinker: Unlocking iterative reasoning for any image editor.arXiv preprint arXiv:2512.05965, 2025. 2

  30. [31]

    Mvbench: A comprehensive multi-modal video understanding benchmark, 2024

    Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, Limin Wang, and Yu Qiao. Mvbench: A comprehensive multi-modal video understanding benchmark, 2024. 7

  31. [32]

    Adaptive tool use in large language models with meta-cognition trigger, 2025

    Wenjun Li, Dexun Li, Kuicai Dong, Cong Zhang, Hao Zhang, Weiwen Liu, Yasheng Wang, Ruiming Tang, and Yong Liu. Adaptive tool use in large language models with meta-cognition trigger, 2025. 2

  32. [34]

    VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning

    Xinhao Li, Ziang Yan, Desen Meng, Lu Dong, Xiangyu Zeng, Yinan He, Yali Wang, Yu Qiao, Yi Wang, and Limin Wang. Videochat-r1: Enhancing spatio-temporal perception via re- inforcement fine-tuning.arXiv preprint arXiv:2504.06958,

  33. [35]

    Perception, reason, think, and plan: A survey on large multimodal reasoning models, 2025

    Yunxin Li, Zhenyu Liu, Zitao Li, Xuanyu Zhang, Zhenran Xu, Xinyu Chen, Haoyuan Shi, Shenyuan Jiang, Xintong Wang, Jifang Wang, Shouzheng Huang, Xinping Zhao, Borui Jiang, Lanqing Hong, Longyue Wang, Zhuotao Tian, Baoxing Huai, Wenhan Luo, Weihua Luo, Zheng Zhang, Baotian Hu, and Min Zhang. Perception, reason, think, and plan: A survey on large multimodal ...

  34. [36]

    Improved baselines with visual instruction tuning

    Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 26296–26306, 2024. 6

  35. [37]

    Mmbench: Is your multi-modal model an all-around player?, 2024

    Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, Kai Chen, and Dahua Lin. Mmbench: Is your multi-modal model an all-around player?, 2024. 7

  36. [38]

    Seg-Zero: Reasoning-Chain Guided Segmentation via Cognitive Reinforcement

    Yuqi Liu, Bohao Peng, Zhisheng Zhong, Zihao Yue, Fanbin Lu, Bei Yu, and Jiaya Jia. Seg-zero: Reasoning-chain guided segmentation via cognitive reinforcement.arXiv preprint arXiv:2503.06520, 2025. 2

  37. [39]

    Visual-RFT: Visual Reinforcement Fine-Tuning

    Ziyu Liu, Zeyi Sun, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Haodong Duan, Dahua Lin, and Jiaqi Wang. Visual-rft: Visual reinforcement fine-tuning.arXiv preprint arXiv:2503.01785,

  38. [40]

    Decoupled weight decay regularization, 2019

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization, 2019. 7

  39. [41]

    Mathvista: Evaluating mathemati- cal reasoning of foundation models in visual contexts, 2024

    Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathemati- cal reasoning of foundation models in visual contexts, 2024. 7

  40. [42]

    V Jawahar

    Minesh Mathew, Viraj Bagal, Rub`en P´erez Tito, Dimosthenis Karatzas, Ernest Valveny, and C. V Jawahar. Infographicvqa,

  41. [43]

    Open-o3 video: Grounded video reasoning with explicit spatio-temporal evidence.arXiv preprint arXiv:2510.20579, 2025

    Jiahao Meng, Xiangtai Li, Haochen Wang, Yue Tan, Tao Zhang, Lingdong Kong, Yunhai Tong, Anran Wang, Zhiyang Teng, Yujing Wang, et al. Open-o3 video: Grounded video reasoning with explicit spatio-temporal evidence.arXiv preprint arXiv:2510.20579, 2025. 7 10

  42. [44]

    Hello gpt-4o

    OpenAI. Hello gpt-4o. https://openai.com/index/ hello-gpt-4o/, 2024. Accessed 2025-09-19. 6, 7

  43. [45]

    Thinking with images, 2025

    OpenAI. Thinking with images, 2025. 2

  44. [46]

    Deepvideo-r1: Video reinforcement fine-tuning via difficulty-aware regressive grpo.arXiv preprint arXiv:2506.07464, 2025

    Jinyoung Park, Jeehye Na, Jinyoung Kim, and Hyun- woo J Kim. Deepvideo-r1: Video reinforcement fine- tuning via difficulty-aware regressive grpo.arXiv preprint arXiv:2506.07464, 2025. 2

  45. [47]

    LMM-R1: Empowering 3B LMMs with Strong Reasoning Abilities Through Two-Stage Rule-Based RL

    Yingzhe Peng, Gongrui Zhang, Miaosen Zhang, Zhiyuan You, Jie Liu, Qipeng Zhu, Kai Yang, Xingzhong Xu, Xin Geng, and Xu Yang. Lmm-r1: Empowering 3b lmms with strong reasoning abilities through two-stage rule-based rl.arXiv preprint arXiv:2503.07536, 2025. 2

  46. [48]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathe- matical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024. 1

  47. [49]

    VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model

    Haozhan Shen, Peng Liu, Jingcheng Li, Chunxin Fang, Yibo Ma, Jiajia Liao, Qiaoli Shen, Zilun Zhang, Kangjia Zhao, Qianqian Zhang, et al. Vlm-r1: A stable and generaliz- able r1-style large vision-language model.arXiv preprint arXiv:2504.07615, 2025. 2

  48. [50]

    HybridFlow: A Flexible and Efficient RLHF Framework

    Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf frame- work.arXiv preprint arXiv: 2409.19256, 2024. 7

  49. [51]

    Video- xl: Extra-long vision language model for hour-scale video understanding

    Yan Shu, Zheng Liu, Peitian Zhang, Minghao Qin, Junjie Zhou, Zhengyang Liang, Tiejun Huang, and Bo Zhao. Video- xl: Extra-long vision language model for hour-scale video understanding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025. 7

  50. [52]

    Pixel Reasoner: Incentivizing Pixel-Space Reasoning with Curiosity-Driven Reinforcement Learning

    Alex Su, Haozhe Wang, Weiming Ren, Fangzhen Lin, and Wenhu Chen. Pixel reasoner: Incentivizing pixel-space rea- soning with curiosity-driven reinforcement learning.arXiv preprint arXiv:2505.15966, 2025. 2

  51. [53]

    Between underthinking and overthinking: An empirical study of reasoning length and correctness in llms, 2025

    Jinyan Su, Jennifer Healey, Preslav Nakov, and Claire Cardie. Between underthinking and overthinking: An empirical study of reasoning length and correctness in llms, 2025. 2

  52. [54]

    Openthinkimg: Learning to think with images via visual tool reinforcement learning,

    Zhaochen Su, Linjie Li, Mingyang Song, Yunzhuo Hao, Zhengyuan Yang, Jun Zhang, Guanjie Chen, Jiawei Gu, Jun- tao Li, Xiaoye Qu, and Yu Cheng. Openthinkimg: Learning to think with images via visual tool reinforcement learning,

  53. [55]

    OpenThinkIMG: Learning to Think with Images via Visual Tool Reinforcement Learning

    Zhaochen Su, Linjie Li, Mingyang Song, Yunzhuo Hao, Zhengyuan Yang, Jun Zhang, Guanjie Chen, Jiawei Gu, Jun- tao Li, Xiaoye Qu, et al. Openthinkimg: Learning to think with images via visual tool reinforcement learning.arXiv preprint arXiv:2505.08617, 2025. 2

  54. [56]

    Thinking with Images for Multimodal Reasoning: Foundations, Methods, and Future Frontiers

    Zhaochen Su, Peng Xia, Hangyu Guo, Zhenhua Liu, Yan Ma, Xiaoye Qu, Jiaqi Liu, Yanshu Li, Kaide Zeng, Zhengyuan Yang, et al. Thinking with images for multimodal reasoning: Foundations, methods, and future frontiers.arXiv preprint arXiv:2506.23918, 2025. 2

  55. [57]

    Reason-rft: Reinforcement fine-tuning for visual reasoning of vision language models, 2025

    Huajie Tan, Yuheng Ji, Xiaoshuai Hao, Xiansheng Chen, Pengwei Wang, Zhongyuan Wang, and Shanghang Zhang. Reason-rft: Reinforcement fine-tuning for visual reasoning of vision language models, 2025. 1

  56. [58]

    Qwen2.5-vl, 2025

    Qwen Team. Qwen2.5-vl, 2025. 7

  57. [59]

    More thought, less accuracy? on the dual nature of reasoning in vision-language models, 2025

    Xinyu Tian, Shu Zou, Zhaoyuan Yang, Mengqi He, Fabian Waschkowski, Lukas Wesemann, Peter Tu, and Jing Zhang. More thought, less accuracy? on the dual nature of reasoning in vision-language models, 2025. 2

  58. [60]

    Knowing the answer isn’t enough: Fixing reasoning path failures in lvlms.arXiv preprint arXiv:2512.06258, 2025

    Chaoyang Wang, Yangfan He, Yiyang Zhou, Yixuan Wang, Ji- aqi Liu, Peng Xia, Zhengzhong Tu, Mohit Bansal, and Huaxiu Yao. Knowing the answer isn’t enough: Fixing reasoning path failures in lvlms.arXiv preprint arXiv:2512.06258, 2025. 2

  59. [61]

    Vision-EKIPL: External Knowledge-Infused Policy Learning for Visual Reasoning

    Chaoyang Wang, Zeyu Zhang, Meng Meng, Xu Zhou, and Haiyun Jiang. Vision-ekipl: External knowledge- infused policy learning for visual reasoning.arXiv preprint arXiv:2506.06856, 2025. 1

  60. [62]

    Tmcir: Token merge benefits composed image retrieval.arXiv preprint arXiv:2504.10995, 2025

    Chaoyang Wang, Zeyu Zhang, Long Teng, Zijun Li, and Shichao Kan. Tmcir: Token merge benefits composed image retrieval.arXiv preprint arXiv:2504.10995, 2025. 2

  61. [63]

    Traceable evidence enhanced visual grounded reasoning: Evaluation and methodology.arXiv preprint arXiv:2507.07999, 2025

    Haochen Wang, Xiangtai Li, Zilong Huang, Anran Wang, Jiacong Wang, Tao Zhang, Jiani Zheng, Sule Bai, Zijian Kang, Jiashi Feng, et al. Traceable evidence enhanced visual grounded reasoning: Evaluation and methodology.arXiv preprint arXiv:2507.07999, 2025. 2

  62. [64]

    Vl-rethinker: Incentivizing self- reflection of vision-language models with reinforcement learn- ing, 2025

    Haozhe Wang, Chao Qu, Zuming Huang, Wei Chu, Fangzhen Lin, and Wenhu Chen. Vl-rethinker: Incentivizing self- reflection of vision-language models with reinforcement learn- ing, 2025. 6

  63. [65]

    Pixel reasoner: Incentivizing pixel-space rea- soning with curiosity-driven reinforcement learning, 2025

    Haozhe Wang, Alex Su, Weiming Ren, Fangzhen Lin, and Wenhu Chen. Pixel reasoner: Incentivizing pixel-space rea- soning with curiosity-driven reinforcement learning, 2025. 2, 3, 6

  64. [66]

    VGR: Visual Grounded Reasoning

    Jiacong Wang, Zijian Kang, Haochen Wang, Haiyong Jiang, Jiawen Li, Bohong Wu, Ya Wang, Jiao Ran, Xiao Liang, Chao Feng, et al. Vgr: Visual grounded reasoning.arXiv preprint arXiv:2506.11991, 2025. 2

  65. [67]

    Videorft: Incentivizing video reasoning capability in mllms via reinforced fine-tuning.arXiv preprint arXiv:2505.12434, 2025

    Qi Wang, Yanrui Yu, Ye Yuan, Rui Mao, and Tianfei Zhou. Videorft: Incentivizing video reasoning capability in mllms via reinforced fine-tuning.arXiv preprint arXiv:2505.12434,

  66. [68]

    Video-thinker: Sparking” thinking with videos” via reinforcement learning.arXiv preprint arXiv:2510.23473,

    Shijian Wang, Jiarui Jin, Xingjian Wang, Linxin Song, Run- hao Fu, Hecheng Wang, Zongyuan Ge, Yuan Lu, and Xuelian Cheng. Video-thinker: Sparking” thinking with videos” via reinforcement learning.arXiv preprint arXiv:2510.23473,

  67. [69]

    Vstar: A video-grounded dialogue dataset for situated semantic understanding with scene and topic transitions, 2023

    Yuxuan Wang, Zilong Zheng, Xueliang Zhao, Jinpeng Li, Yueqian Wang, and Dongyan Zhao. Vstar: A video-grounded dialogue dataset for situated semantic understanding with scene and topic transitions, 2023. 7

  68. [70]

    Time-R1: Post-Training Large Vision Language Model for Temporal Video Grounding

    Ye Wang, Ziheng Wang, Boshen Xu, Yang Du, Kejun Lin, Zihan Xiao, Zihao Yue, Jianzhong Ju, Liang Zhang, Dingyi Yang, et al. Time-r1: Post-training large vision lan- guage model for temporal video grounding.arXiv preprint arXiv:2503.13377, 2025. 2

  69. [71]

    Reinforcing Spatial Reasoning in Vision-Language Models with Interwoven Thinking and Visual Drawing

    Junfei Wu, Jian Guan, Kaituo Feng, Qiang Liu, Shu Wu, Liang Wang, Wei Wu, and Tieniu Tan. Reinforcing spatial reasoning in vision-language models with interwoven think- ing and visual drawing.arXiv preprint arXiv:2506.09965,

  70. [72]

    Proxythinker: Test-time guidance through small visual reasoners, 2025

    Zilin Xiao, Jaywon Koo, Siru Ouyang, Jefferson Hernandez, Yu Meng, and Vicente Ordonez. Proxythinker: Test-time guidance through small visual reasoners, 2025. 6

  71. [73]

    Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie

    Jihan Yang, Shusheng Yang, Anjali W. Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie. Thinking in space: How multimodal large language models see, remember, and recall spaces, 2025. 7

  72. [74]

    Mmsi-bench: A benchmark for multi-image spatial intelli- gence, 2025

    Sihan Yang, Runsen Xu, Yiman Xie, Sizhe Yang, Mo Li, Jingli Lin, Chenming Zhu, Xiaochen Chen, Haodong Duan, Xiangyu Yue, Dahua Lin, Tai Wang, and Jiangmiao Pang. Mmsi-bench: A benchmark for multi-image spatial intelli- gence, 2025. 7

  73. [75]

    Language-aware vision transformer for referring segmentation.IEEE Trans- actions on Pattern Analysis and Machine Intelligence, 2024

    Zhao Yang, Jiaqi Wang, Xubing Ye, Yansong Tang, Kai Chen, Hengshuang Zhao, and Philip HS Torr. Language-aware vision transformer for referring segmentation.IEEE Trans- actions on Pattern Analysis and Machine Intelligence, 2024. 2

  74. [76]

    Dapo: An open-source llm reinforcement learning system at scale,

    Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xi- aochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Wang Zhang, Hang Zhu, Jinhua Zhu, Jiaze Chen, Jiangjie Chen, Chengyi Wang, Hongli Yu, Yuxuan Song, Xiangpeng Wei, Hao Zhou, Jingjing Liu,...

  75. [77]

    Mme-reasoning: A comprehensive benchmark for logical reasoning in mllms

    Jiakang Yuan, Tianshuo Peng, Yilei Jiang, Yiting Lu, Ren- rui Zhang, Kaituo Feng, Chaoyou Fu, Tao Chen, Lei Bai, Bo Zhang, et al. Mme-reasoning: A comprehensive benchmark for logical reasoning in mllms.arXiv preprint arXiv:2505.21327, 2025. 2

  76. [78]

    Thinking with videos: Multimodal tool- augmented reinforcement learning for long video reasoning,

    Haoji Zhang, Xin Gu, Jiawen Li, Chixiang Ma, Sule Bai, Chubin Zhang, Bowen Zhang, Zhichao Zhou, Dongliang He, and Yansong Tang. Thinking with videos: Multimodal tool- augmented reinforcement learning for long video reasoning,

  77. [79]

    From flatland to space: Teaching vision-language models to perceive and reason in 3d, 2025

    Jiahui Zhang, Yurui Chen, Yanpeng Zhou, Yueming Xu, Ze Huang, Jilin Mei, Junhui Chen, Yu-Jie Yuan, Xinyue Cai, Guowei Huang, Xingyue Quan, Hang Xu, and Li Zhang. From flatland to space: Teaching vision-language models to perceive and reason in 3d, 2025. 7

  78. [80]

    Critique-grpo: Advancing llm reasoning with natural language and numerical feedback.arXiv preprint arXiv:2506.03106, 2025

    Xiaoying Zhang, Hao Sun, Yipeng Zhang, Kaituo Feng, Chaochao Lu, Chao Yang, and Helen Meng. Critique-grpo: Advancing llm reasoning with natural language and numerical feedback.arXiv preprint arXiv:2506.03106, 2025. 1

  79. [81]

    Thyme: Think beyond images, 2025

    Yi-Fan Zhang, Xingyu Lu, Shukang Yin, Chaoyou Fu, Wei Chen, Xiao Hu, Bin Wen, Kaiyu Jiang, Changyi Liu, Tianke Zhang, Haonan Fan, Kaibing Chen, Jiankang Chen, Haojie Ding, Kaiyu Tang, Zhang Zhang, Liang Wang, Fan Yang, Tingting Gao, and Guorui Zhou. Thyme: Think beyond images, 2025. 2

  80. [82]

    Thyme: Think Beyond Images

    Yi-Fan Zhang, Xingyu Lu, Shukang Yin, Chaoyou Fu, Wei Chen, Xiao Hu, Bin Wen, Kaiyu Jiang, Changyi Liu, Tianke Zhang, et al. Thyme: Think beyond images.arXiv preprint arXiv:2508.11630, 2025. 6

Showing first 80 references.