AdaTooler-V: Adaptive Tool-Use for Images and Videos
Pith reviewed 2026-05-16 21:21 UTC · model grok-4.3
The pith
AdaTooler-V enables multimodal models to invoke vision tools only when they improve results, reaching 89.8% on the V* benchmark.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
AdaTooler-V performs adaptive tool-use by determining whether a visual problem truly requires tools. It introduces AT-GRPO, a reinforcement learning algorithm that adaptively adjusts reward scales based on the Tool Benefit Score of each sample, encouraging the model to invoke tools only when they provide genuine improvements. Two datasets are constructed for training: AdaTooler-V-CoT-100k for SFT cold start and AdaTooler-V-300k for RL with verifiable rewards across single-image, multi-image, and video data. Experiments across twelve benchmarks demonstrate the strong reasoning capability of AdaTooler-V, with the 7B version achieving 89.8% accuracy on V* and surpassing GPT-4o and Gemini 1.5.
What carries the argument
AT-GRPO, a reinforcement learning algorithm that scales per-sample rewards by the Tool Benefit Score to train selective rather than unconditional vision-tool invocation.
Load-bearing premise
The Tool Benefit Score correctly identifies samples where calling a vision tool produces a genuine accuracy gain rather than merely reflecting dataset biases or training artifacts.
What would settle it
Measure whether a non-adaptive baseline that always calls tools matches or exceeds AdaTooler-V accuracy on a new test set deliberately constructed so that tool use never improves the answer.
Figures
read the original abstract
Recent advances have shown that multimodal large language models (MLLMs) benefit from multimodal interleaved chain-of-thought (CoT) with vision tool interactions. However, existing open-source models often exhibit blind tool-use reasoning patterns, invoking vision tools even when they are unnecessary, which significantly increases inference overhead and degrades model performance. To this end, we propose AdaTooler-V, an MLLM that performs adaptive tool-use by determining whether a visual problem truly requires tools. First, we introduce AT-GRPO, a reinforcement learning algorithm that adaptively adjusts reward scales based on the Tool Benefit Score of each sample, encouraging the model to invoke tools only when they provide genuine improvements. Moreover, we construct two datasets to support training: AdaTooler-V-CoT-100k for SFT cold start and AdaTooler-V-300k for RL with verifiable rewards across single-image, multi-image, and video data. Experiments across twelve benchmarks demonstrate the strong reasoning capability of AdaTooler-V, outperforming existing methods in diverse visual reasoning tasks. Notably, AdaTooler-V-7B achieves an accuracy of 89.8\% on the high-resolution benchmark V*, surpassing the commercial proprietary model GPT-4o and Gemini 1.5 Pro. All code, models, and data are released.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes AdaTooler-V, an MLLM for adaptive tool-use in visual reasoning tasks across images and videos. It introduces the AT-GRPO reinforcement learning algorithm that scales rewards using a Tool Benefit Score to encourage tool invocation only when beneficial. The model is trained on AdaTooler-V-CoT-100k for SFT and AdaTooler-V-300k for RL, and achieves state-of-the-art results including 89.8% accuracy on the V* benchmark, outperforming GPT-4o and Gemini 1.5 Pro.
Significance. If the adaptive tool-use mechanism is robustly validated, this work would be significant for improving efficiency and performance of open-source MLLMs in multimodal reasoning by mitigating blind tool invocation. The release of code, models, and datasets strengthens reproducibility and impact.
major comments (2)
- [§3.2 (AT-GRPO)] §3.2 (AT-GRPO): The Tool Benefit Score is load-bearing for the adaptive behavior claim, but its computation method, whether it is derived independently from held-out data or the training distribution AdaTooler-V-300k, and how it avoids reflecting dataset biases (e.g., resolution patterns or annotation artifacts) is not sufficiently detailed. This risks the RL objective overfitting to spurious correlations rather than genuine tool benefit.
- [§5 (Experiments)] §5 (Experiments): The 89.8% accuracy on V* surpassing proprietary models requires verification of the evaluation setup, including whether tool-use is fairly compared, number of evaluation runs, and confirmation that the Tool Benefit Score was not tuned on the test set.
minor comments (2)
- [Abstract] Abstract: The abstract mentions 'verifiable rewards' but does not specify what makes the rewards verifiable across single-image, multi-image, and video data.
- [§4] §4: Clarify the exact form of the reward scaling function in AT-GRPO to allow reproduction.
Simulated Author's Rebuttal
We sincerely thank the referee for the constructive and detailed feedback. We address each major comment point-by-point below and will revise the manuscript to enhance clarity and rigor where needed.
read point-by-point responses
-
Referee: §3.2 (AT-GRPO): The Tool Benefit Score is load-bearing for the adaptive behavior claim, but its computation method, whether it is derived independently from held-out data or the training distribution AdaTooler-V-300k, and how it avoids reflecting dataset biases (e.g., resolution patterns or annotation artifacts) is not sufficiently detailed. This risks the RL objective overfitting to spurious correlations rather than genuine tool benefit.
Authors: We thank the referee for this important observation. We acknowledge that §3.2 would benefit from greater detail on the Tool Benefit Score. In the revised manuscript, we will expand the description to specify that the score is computed as the performance delta (with vs. without tool use) on a held-out validation subset drawn independently from the AdaTooler-V-300k training distribution. To guard against dataset biases such as resolution patterns or annotation artifacts, the score incorporates normalization across data modalities (single-image, multi-image, video) and is cross-validated on multiple disjoint subsets. We will also add an ablation demonstrating that removing the Tool Benefit Score leads to increased blind tool invocation, supporting that the RL objective learns genuine benefit rather than spurious correlations. revision: yes
-
Referee: §5 (Experiments): The 89.8% accuracy on V* surpassing proprietary models requires verification of the evaluation setup, including whether tool-use is fairly compared, number of evaluation runs, and confirmation that the Tool Benefit Score was not tuned on the test set.
Authors: We agree that transparent verification of the evaluation protocol is necessary. In the revised §5, we will clarify that V* results follow the benchmark's official protocol, with tool-use comparisons conducted under matched conditions (resolution handling and tool availability) against the publicly reported numbers for GPT-4o and Gemini 1.5 Pro. We will report results averaged over 5 independent runs with different random seeds, including standard deviation. We explicitly confirm that the Tool Benefit Score was computed only on training and internal validation splits and never tuned or evaluated on the V* test set, which remained completely held-out for final benchmarking. We will include these details along with the exact prompting templates used. revision: yes
Circularity Check
No significant circularity in the derivation chain
full rationale
The paper presents the 89.8% V* accuracy as an empirical held-out benchmark result after training with AT-GRPO on the constructed AdaTooler-V-300k dataset. The Tool Benefit Score is used to scale rewards during RL, but the reported performance metric is not shown to reduce to a fitted parameter or self-referential definition by construction. No equations, self-citation load-bearing steps, or ansatz smuggling are exhibited that would make the adaptive tool-use claim equivalent to its inputs. The derivation remains self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
free parameters (1)
- Tool Benefit Score scaling function
axioms (1)
- domain assumption RL with verifiable rewards produces policies that generalize to unseen visual reasoning tasks
invented entities (1)
-
Tool Benefit Score
no independent evidence
Forward citations
Cited by 6 Pith papers
-
From Web to Pixels: Bringing Agentic Search into Visual Perception
WebEye benchmark and Pixel-Searcher agent enable visual perception tasks by using web search to resolve object identities before precise localization or answering.
-
Gen-Searcher: Reinforcing Agentic Search for Image Generation
Gen-Searcher is the first trained search-augmented image generation agent using SFT followed by GRPO reinforcement learning with dual text-image rewards, delivering 15-16 point gains on knowledge-intensive benchmarks.
-
VideoSeeker: Incentivizing Instance-level Video Understanding via Native Agentic Tool Invocation
VideoSeeker integrates agentic reasoning and visual prompts into LVLMs via automated data synthesis, cold-start supervision, and RL training, yielding +13.7% gains on instance-level video tasks over baselines includin...
-
ToolCUA: Towards Optimal GUI-Tool Path Orchestration for Computer Use Agents
ToolCUA introduces a trajectory scaling pipeline and staged RL to optimize GUI-tool switching, reaching 46.85% accuracy on OSWorld-MCP for a 66% relative gain over baseline.
-
Gen-Searcher: Reinforcing Agentic Search for Image Generation
Gen-Searcher is the first search-augmented image generation agent trained with SFT followed by agentic RL using dual text and image rewards on custom datasets and the KnowGen benchmark.
-
Act Wisely: Cultivating Meta-Cognitive Tool Use in Agentic Multimodal Models
HDPO reframes tool efficiency as a conditional objective within accurate trajectories, enabling Metis to reduce tool invocations by orders of magnitude while raising reasoning accuracy.
Reference graph
Works this paper leans on
-
[1]
Qwen2.5- vl technical report, 2025
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5- vl technical report, ...
work page 2025
-
[2]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 1, 4, 5
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[3]
Sule Bai, Mingxing Li, Yong Liu, Jing Tang, Haoji Zhang, Lei Sun, Xiangxiang Chu, and Yansong Tang. Univg-r1: Rea- soning guided universal visual grounding with reinforcement learning.arXiv preprint arXiv:2505.14231, 2025. 2
-
[4]
Do not think that much for 2+3=? on the overthink- ing of o1-like llms, 2025
Xingyu Chen, Jiahao Xu, Tian Liang, Zhiwei He, Jianhui Pang, Dian Yu, Linfeng Song, Qiuzhi Liu, Mengfei Zhou, Zhuosheng Zhang, Rui Wang, Zhaopeng Tu, Haitao Mi, and Dong Yu. Do not think that much for 2+3=? on the overthink- ing of o1-like llms, 2025. 2
work page 2025
-
[5]
Video-Holmes: Can MLLM Think Like Holmes for Complex Video Reasoning?
Junhao Cheng, Yuying Ge, Teng Wang, Yixiao Ge, Jing Liao, and Ying Shan. Video-holmes: Can mllm think like holmes for complex video reasoning?arXiv preprint arXiv:2505.21374, 2025. 2
work page internal anchor Pith review arXiv 2025
-
[6]
Video-holmes: Can mllm think like holmes for complex video reasoning?, 2025
Junhao Cheng, Yuying Ge, Teng Wang, Yixiao Ge, Jing Liao, and Ying Shan. Video-holmes: Can mllm think like holmes for complex video reasoning?, 2025. 7
work page 2025
-
[7]
Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, 2025
DeepSeek-AI et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, 2025. 1, 2, 5, 6
work page 2025
-
[8]
Chenrui Fan, Ming Li, Lichao Sun, and Tianyi Zhou. Miss- ing premise exacerbates overthinking: Are reasoning models losing critical thinking skill?, 2025. 2
work page 2025
-
[9]
Sophiavl-r1: Reinforcing mllms reasoning with thinking reward.arXiv preprint arXiv:2505.17018, 2025
Kaixuan Fan, Kaituo Feng, Haoming Lyu, Dongzhan Zhou, and Xiangyu Yue. Sophiavl-r1: Reinforcing mllms reasoning with thinking reward.arXiv preprint arXiv:2505.17018, 2025. 1, 6
-
[10]
Jiajun Fei, Dian Li, Zhidong Deng, Zekun Wang, Gang Liu, and Hui Wang. Video-ccam: Enhancing video-language un- derstanding with causal cross-attention masks for short and long videos, 2024. 7
work page 2024
-
[11]
Video-R1: Reinforcing Video Reasoning in MLLMs
Kaituo Feng, Kaixiong Gong, Bohao Li, Zonghao Guo, Yibing Wang, Tianshuo Peng, Junfei Wu, Xiaoying Zhang, Benyou Wang, and Xiangyu Yue. Video-r1: Reinforcing video reasoning in mllms.arXiv preprint arXiv:2503.21776,
work page internal anchor Pith review Pith/arXiv arXiv
-
[12]
OneThinker: All-in-one Reasoning Model for Image and Video
Kaituo Feng, Manyuan Zhang, Hongyu Li, Kaixuan Fan, Shuang Chen, Yilei Jiang, Dian Zheng, Peiwen Sun, Yiyuan Zhang, Haoze Sun, et al. Onethinker: All-in-one reasoning model for image and video.arXiv preprint arXiv:2512.03043,
work page internal anchor Pith review Pith/arXiv arXiv
-
[13]
Mme: A comprehensive evaluation benchmark for multi- modal large language models, 2025
Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Meng- dan Zhang, Xu Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, Yunsheng Wu, Rongrong Ji, Caifeng Shan, and Ran He. Mme: A comprehensive evaluation benchmark for multi- modal large language models, 2025. 7
work page 2025
-
[14]
Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, Peixian Chen, Yanwei Li, Shaohui Lin, Sirui Zhao, Ke Li, Tong Xu, Xiawu Zheng, Enhong Chen, Caifeng Shan, Ran He, and Xing Sun. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis, 2025. 7
work page 2025
-
[15]
Haonan Ge, Yiwei Wang, Kai-Wei Chang, Hang Wu, and Yujun Cai. Framemind: Frame-interleaved chain-of-thought for video reasoning via reinforcement learning.arXiv preprint arXiv:2509.24008, 2025. 7
-
[16]
Yuying Ge, Yixiao Ge, Chen Li, Teng Wang, Junfu Pu, Yizhuo Li, Lu Qiu, Jin Ma, Lisheng Duan, Xinyu Zuo, et al. Arc- 9 hunyuan-video-7b: Structured video comprehension of real- world shorts.arXiv preprint arXiv:2507.20939, 2025. 2
-
[17]
Gemini 1.5: Unlocking multimodal under- standing across millions of tokens of context, 2024
Gemini Team. Gemini 1.5: Unlocking multimodal under- standing across millions of tokens of context, 2024. 6, 7
work page 2024
-
[19]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025. 6
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[20]
Video-mmmu: Evaluating knowledge acquisition from multi-discipline pro- fessional videos, 2025
Kairui Hu, Penghao Wu, Fanyi Pu, Wang Xiao, Yuanhan Zhang, Xiang Yue, Bo Li, and Ziwei Liu. Video-mmmu: Evaluating knowledge acquisition from multi-discipline pro- fessional videos, 2025. 7
work page 2025
-
[21]
Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models
Wenxuan Huang, Bohan Jia, Zijie Zhai, Shaosheng Cao, Zheyu Ye, Fei Zhao, Zhe Xu, Yao Hu, and Shaohui Lin. Vision-r1: Incentivizing reasoning capability in multimodal large language models.arXiv preprint arXiv:2503.06749,
work page internal anchor Pith review Pith/arXiv arXiv
-
[22]
Vision-r1: Incentivizing reasoning capability in multimodal large language models, 2025
Wenxuan Huang, Bohan Jia, Zijie Zhai, Shaosheng Cao, Zheyu Ye, Fei Zhao, Zhe Xu, Yao Hu, and Shaohui Lin. Vision-r1: Incentivizing reasoning capability in multimodal large language models, 2025. 1
work page 2025
-
[23]
Dongfu Jiang, Yi Lu, Zhuofeng Li, Zhiheng Lyu, Ping Nie, Haozhe Wang, Alex Su, Hui Chen, Kai Zou, Chao Du, et al. Verltool: Towards holistic agentic reinforcement learning with tool use.arXiv preprint arXiv:2509.01055, 2025. 7
-
[24]
Gonzalez, Hao Zhang, and Ion Stoica
Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InPro- ceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023. 7
work page 2023
-
[25]
Mini-o3: Scaling Up Reasoning Patterns and Interaction Turns for Visual Search
Xin Lai, Junyi Li, Wei Li, Tao Liu, Tianjian Li, and Heng- shuang Zhao. Mini-o3: Scaling up reasoning patterns and interaction turns for visual search.arXiv preprint arXiv:2509.07969, 2025. 6
work page internal anchor Pith review arXiv 2025
-
[26]
Mini-o3: Scaling up reasoning patterns and interaction turns for visual search, 2025
Xin Lai, Junyi Li, Wei Li, Tao Liu, Tianjian Li, and Heng- shuang Zhao. Mini-o3: Scaling up reasoning patterns and interaction turns for visual search, 2025. 2
work page 2025
-
[27]
LLaVA-OneVision: Easy Visual Task Transfer
Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Zi- wei Liu, et al. Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326, 2024. 6
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[28]
Imagine while Reasoning in Space: Multimodal Visualization-of-Thought
Chengzu Li, Wenshan Wu, Huanyu Zhang, Yan Xia, Shaoguang Mao, Li Dong, Ivan Vuli´c, and Furu Wei. Imag- ine while reasoning in space: Multimodal visualization-of- thought.arXiv preprint arXiv:2501.07542, 2025. 3
work page internal anchor Pith review arXiv 2025
-
[29]
Hongyu Li, Songhao Han, Yue Liao, Junfeng Luo, Jialin Gao, Shuicheng Yan, and Si Liu. Reinforcement learning tuning for videollms: Reward design and data efficiency.arXiv preprint arXiv:2506.01908, 2025. 2
-
[30]
Hongyu Li, Manyuan Zhang, Dian Zheng, Ziyu Guo, Yimeng Jia, Kaituo Feng, Hao Yu, Yexin Liu, Yan Feng, Peng Pei, et al. Editthinker: Unlocking iterative reasoning for any image editor.arXiv preprint arXiv:2512.05965, 2025. 2
-
[31]
Mvbench: A comprehensive multi-modal video understanding benchmark, 2024
Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, Limin Wang, and Yu Qiao. Mvbench: A comprehensive multi-modal video understanding benchmark, 2024. 7
work page 2024
-
[32]
Adaptive tool use in large language models with meta-cognition trigger, 2025
Wenjun Li, Dexun Li, Kuicai Dong, Cong Zhang, Hao Zhang, Weiwen Liu, Yasheng Wang, Ruiming Tang, and Yong Liu. Adaptive tool use in large language models with meta-cognition trigger, 2025. 2
work page 2025
-
[34]
VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning
Xinhao Li, Ziang Yan, Desen Meng, Lu Dong, Xiangyu Zeng, Yinan He, Yali Wang, Yu Qiao, Yi Wang, and Limin Wang. Videochat-r1: Enhancing spatio-temporal perception via re- inforcement fine-tuning.arXiv preprint arXiv:2504.06958,
work page internal anchor Pith review Pith/arXiv arXiv
-
[35]
Perception, reason, think, and plan: A survey on large multimodal reasoning models, 2025
Yunxin Li, Zhenyu Liu, Zitao Li, Xuanyu Zhang, Zhenran Xu, Xinyu Chen, Haoyuan Shi, Shenyuan Jiang, Xintong Wang, Jifang Wang, Shouzheng Huang, Xinping Zhao, Borui Jiang, Lanqing Hong, Longyue Wang, Zhuotao Tian, Baoxing Huai, Wenhan Luo, Weihua Luo, Zheng Zhang, Baotian Hu, and Min Zhang. Perception, reason, think, and plan: A survey on large multimodal ...
work page 2025
-
[36]
Improved baselines with visual instruction tuning
Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 26296–26306, 2024. 6
work page 2024
-
[37]
Mmbench: Is your multi-modal model an all-around player?, 2024
Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, Kai Chen, and Dahua Lin. Mmbench: Is your multi-modal model an all-around player?, 2024. 7
work page 2024
-
[38]
Seg-Zero: Reasoning-Chain Guided Segmentation via Cognitive Reinforcement
Yuqi Liu, Bohao Peng, Zhisheng Zhong, Zihao Yue, Fanbin Lu, Bei Yu, and Jiaya Jia. Seg-zero: Reasoning-chain guided segmentation via cognitive reinforcement.arXiv preprint arXiv:2503.06520, 2025. 2
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[39]
Visual-RFT: Visual Reinforcement Fine-Tuning
Ziyu Liu, Zeyi Sun, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Haodong Duan, Dahua Lin, and Jiaqi Wang. Visual-rft: Visual reinforcement fine-tuning.arXiv preprint arXiv:2503.01785,
work page internal anchor Pith review Pith/arXiv arXiv
-
[40]
Decoupled weight decay regularization, 2019
Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization, 2019. 7
work page 2019
-
[41]
Mathvista: Evaluating mathemati- cal reasoning of foundation models in visual contexts, 2024
Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathemati- cal reasoning of foundation models in visual contexts, 2024. 7
work page 2024
- [42]
-
[43]
Jiahao Meng, Xiangtai Li, Haochen Wang, Yue Tan, Tao Zhang, Lingdong Kong, Yunhai Tong, Anran Wang, Zhiyang Teng, Yujing Wang, et al. Open-o3 video: Grounded video reasoning with explicit spatio-temporal evidence.arXiv preprint arXiv:2510.20579, 2025. 7 10
-
[44]
OpenAI. Hello gpt-4o. https://openai.com/index/ hello-gpt-4o/, 2024. Accessed 2025-09-19. 6, 7
work page 2024
- [45]
-
[46]
Jinyoung Park, Jeehye Na, Jinyoung Kim, and Hyun- woo J Kim. Deepvideo-r1: Video reinforcement fine- tuning via difficulty-aware regressive grpo.arXiv preprint arXiv:2506.07464, 2025. 2
-
[47]
LMM-R1: Empowering 3B LMMs with Strong Reasoning Abilities Through Two-Stage Rule-Based RL
Yingzhe Peng, Gongrui Zhang, Miaosen Zhang, Zhiyuan You, Jie Liu, Qipeng Zhu, Kai Yang, Xingzhong Xu, Xin Geng, and Xu Yang. Lmm-r1: Empowering 3b lmms with strong reasoning abilities through two-stage rule-based rl.arXiv preprint arXiv:2503.07536, 2025. 2
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[48]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathe- matical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024. 1
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[49]
VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model
Haozhan Shen, Peng Liu, Jingcheng Li, Chunxin Fang, Yibo Ma, Jiajia Liao, Qiaoli Shen, Zilun Zhang, Kangjia Zhao, Qianqian Zhang, et al. Vlm-r1: A stable and generaliz- able r1-style large vision-language model.arXiv preprint arXiv:2504.07615, 2025. 2
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[50]
HybridFlow: A Flexible and Efficient RLHF Framework
Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf frame- work.arXiv preprint arXiv: 2409.19256, 2024. 7
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[51]
Video- xl: Extra-long vision language model for hour-scale video understanding
Yan Shu, Zheng Liu, Peitian Zhang, Minghao Qin, Junjie Zhou, Zhengyang Liang, Tiejun Huang, and Bo Zhao. Video- xl: Extra-long vision language model for hour-scale video understanding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025. 7
work page 2025
-
[52]
Pixel Reasoner: Incentivizing Pixel-Space Reasoning with Curiosity-Driven Reinforcement Learning
Alex Su, Haozhe Wang, Weiming Ren, Fangzhen Lin, and Wenhu Chen. Pixel reasoner: Incentivizing pixel-space rea- soning with curiosity-driven reinforcement learning.arXiv preprint arXiv:2505.15966, 2025. 2
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[53]
Jinyan Su, Jennifer Healey, Preslav Nakov, and Claire Cardie. Between underthinking and overthinking: An empirical study of reasoning length and correctness in llms, 2025. 2
work page 2025
-
[54]
Openthinkimg: Learning to think with images via visual tool reinforcement learning,
Zhaochen Su, Linjie Li, Mingyang Song, Yunzhuo Hao, Zhengyuan Yang, Jun Zhang, Guanjie Chen, Jiawei Gu, Jun- tao Li, Xiaoye Qu, and Yu Cheng. Openthinkimg: Learning to think with images via visual tool reinforcement learning,
-
[55]
OpenThinkIMG: Learning to Think with Images via Visual Tool Reinforcement Learning
Zhaochen Su, Linjie Li, Mingyang Song, Yunzhuo Hao, Zhengyuan Yang, Jun Zhang, Guanjie Chen, Jiawei Gu, Jun- tao Li, Xiaoye Qu, et al. Openthinkimg: Learning to think with images via visual tool reinforcement learning.arXiv preprint arXiv:2505.08617, 2025. 2
work page internal anchor Pith review arXiv 2025
-
[56]
Thinking with Images for Multimodal Reasoning: Foundations, Methods, and Future Frontiers
Zhaochen Su, Peng Xia, Hangyu Guo, Zhenhua Liu, Yan Ma, Xiaoye Qu, Jiaqi Liu, Yanshu Li, Kaide Zeng, Zhengyuan Yang, et al. Thinking with images for multimodal reasoning: Foundations, methods, and future frontiers.arXiv preprint arXiv:2506.23918, 2025. 2
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[57]
Reason-rft: Reinforcement fine-tuning for visual reasoning of vision language models, 2025
Huajie Tan, Yuheng Ji, Xiaoshuai Hao, Xiansheng Chen, Pengwei Wang, Zhongyuan Wang, and Shanghang Zhang. Reason-rft: Reinforcement fine-tuning for visual reasoning of vision language models, 2025. 1
work page 2025
- [58]
-
[59]
More thought, less accuracy? on the dual nature of reasoning in vision-language models, 2025
Xinyu Tian, Shu Zou, Zhaoyuan Yang, Mengqi He, Fabian Waschkowski, Lukas Wesemann, Peter Tu, and Jing Zhang. More thought, less accuracy? on the dual nature of reasoning in vision-language models, 2025. 2
work page 2025
-
[60]
Chaoyang Wang, Yangfan He, Yiyang Zhou, Yixuan Wang, Ji- aqi Liu, Peng Xia, Zhengzhong Tu, Mohit Bansal, and Huaxiu Yao. Knowing the answer isn’t enough: Fixing reasoning path failures in lvlms.arXiv preprint arXiv:2512.06258, 2025. 2
-
[61]
Vision-EKIPL: External Knowledge-Infused Policy Learning for Visual Reasoning
Chaoyang Wang, Zeyu Zhang, Meng Meng, Xu Zhou, and Haiyun Jiang. Vision-ekipl: External knowledge- infused policy learning for visual reasoning.arXiv preprint arXiv:2506.06856, 2025. 1
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[62]
Tmcir: Token merge benefits composed image retrieval.arXiv preprint arXiv:2504.10995, 2025
Chaoyang Wang, Zeyu Zhang, Long Teng, Zijun Li, and Shichao Kan. Tmcir: Token merge benefits composed image retrieval.arXiv preprint arXiv:2504.10995, 2025. 2
-
[63]
Haochen Wang, Xiangtai Li, Zilong Huang, Anran Wang, Jiacong Wang, Tao Zhang, Jiani Zheng, Sule Bai, Zijian Kang, Jiashi Feng, et al. Traceable evidence enhanced visual grounded reasoning: Evaluation and methodology.arXiv preprint arXiv:2507.07999, 2025. 2
-
[64]
Haozhe Wang, Chao Qu, Zuming Huang, Wei Chu, Fangzhen Lin, and Wenhu Chen. Vl-rethinker: Incentivizing self- reflection of vision-language models with reinforcement learn- ing, 2025. 6
work page 2025
-
[65]
Haozhe Wang, Alex Su, Weiming Ren, Fangzhen Lin, and Wenhu Chen. Pixel reasoner: Incentivizing pixel-space rea- soning with curiosity-driven reinforcement learning, 2025. 2, 3, 6
work page 2025
-
[66]
VGR: Visual Grounded Reasoning
Jiacong Wang, Zijian Kang, Haochen Wang, Haiyong Jiang, Jiawen Li, Bohong Wu, Ya Wang, Jiao Ran, Xiao Liang, Chao Feng, et al. Vgr: Visual grounded reasoning.arXiv preprint arXiv:2506.11991, 2025. 2
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[67]
Qi Wang, Yanrui Yu, Ye Yuan, Rui Mao, and Tianfei Zhou. Videorft: Incentivizing video reasoning capability in mllms via reinforced fine-tuning.arXiv preprint arXiv:2505.12434,
-
[68]
Shijian Wang, Jiarui Jin, Xingjian Wang, Linxin Song, Run- hao Fu, Hecheng Wang, Zongyuan Ge, Yuan Lu, and Xuelian Cheng. Video-thinker: Sparking” thinking with videos” via reinforcement learning.arXiv preprint arXiv:2510.23473,
-
[69]
Yuxuan Wang, Zilong Zheng, Xueliang Zhao, Jinpeng Li, Yueqian Wang, and Dongyan Zhao. Vstar: A video-grounded dialogue dataset for situated semantic understanding with scene and topic transitions, 2023. 7
work page 2023
-
[70]
Time-R1: Post-Training Large Vision Language Model for Temporal Video Grounding
Ye Wang, Ziheng Wang, Boshen Xu, Yang Du, Kejun Lin, Zihan Xiao, Zihao Yue, Jianzhong Ju, Liang Zhang, Dingyi Yang, et al. Time-r1: Post-training large vision lan- guage model for temporal video grounding.arXiv preprint arXiv:2503.13377, 2025. 2
work page internal anchor Pith review arXiv 2025
-
[71]
Reinforcing Spatial Reasoning in Vision-Language Models with Interwoven Thinking and Visual Drawing
Junfei Wu, Jian Guan, Kaituo Feng, Qiang Liu, Shu Wu, Liang Wang, Wei Wu, and Tieniu Tan. Reinforcing spatial reasoning in vision-language models with interwoven think- ing and visual drawing.arXiv preprint arXiv:2506.09965,
work page internal anchor Pith review arXiv
-
[72]
Proxythinker: Test-time guidance through small visual reasoners, 2025
Zilin Xiao, Jaywon Koo, Siru Ouyang, Jefferson Hernandez, Yu Meng, and Vicente Ordonez. Proxythinker: Test-time guidance through small visual reasoners, 2025. 6
work page 2025
-
[73]
Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie
Jihan Yang, Shusheng Yang, Anjali W. Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie. Thinking in space: How multimodal large language models see, remember, and recall spaces, 2025. 7
work page 2025
-
[74]
Mmsi-bench: A benchmark for multi-image spatial intelli- gence, 2025
Sihan Yang, Runsen Xu, Yiman Xie, Sizhe Yang, Mo Li, Jingli Lin, Chenming Zhu, Xiaochen Chen, Haodong Duan, Xiangyu Yue, Dahua Lin, Tai Wang, and Jiangmiao Pang. Mmsi-bench: A benchmark for multi-image spatial intelli- gence, 2025. 7
work page 2025
-
[75]
Zhao Yang, Jiaqi Wang, Xubing Ye, Yansong Tang, Kai Chen, Hengshuang Zhao, and Philip HS Torr. Language-aware vision transformer for referring segmentation.IEEE Trans- actions on Pattern Analysis and Machine Intelligence, 2024. 2
work page 2024
-
[76]
Dapo: An open-source llm reinforcement learning system at scale,
Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xi- aochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Wang Zhang, Hang Zhu, Jinhua Zhu, Jiaze Chen, Jiangjie Chen, Chengyi Wang, Hongli Yu, Yuxuan Song, Xiangpeng Wei, Hao Zhou, Jingjing Liu,...
-
[77]
Mme-reasoning: A comprehensive benchmark for logical reasoning in mllms
Jiakang Yuan, Tianshuo Peng, Yilei Jiang, Yiting Lu, Ren- rui Zhang, Kaituo Feng, Chaoyou Fu, Tao Chen, Lei Bai, Bo Zhang, et al. Mme-reasoning: A comprehensive benchmark for logical reasoning in mllms.arXiv preprint arXiv:2505.21327, 2025. 2
-
[78]
Thinking with videos: Multimodal tool- augmented reinforcement learning for long video reasoning,
Haoji Zhang, Xin Gu, Jiawen Li, Chixiang Ma, Sule Bai, Chubin Zhang, Bowen Zhang, Zhichao Zhou, Dongliang He, and Yansong Tang. Thinking with videos: Multimodal tool- augmented reinforcement learning for long video reasoning,
-
[79]
From flatland to space: Teaching vision-language models to perceive and reason in 3d, 2025
Jiahui Zhang, Yurui Chen, Yanpeng Zhou, Yueming Xu, Ze Huang, Jilin Mei, Junhui Chen, Yu-Jie Yuan, Xinyue Cai, Guowei Huang, Xingyue Quan, Hang Xu, and Li Zhang. From flatland to space: Teaching vision-language models to perceive and reason in 3d, 2025. 7
work page 2025
-
[80]
Xiaoying Zhang, Hao Sun, Yipeng Zhang, Kaituo Feng, Chaochao Lu, Chao Yang, and Helen Meng. Critique-grpo: Advancing llm reasoning with natural language and numerical feedback.arXiv preprint arXiv:2506.03106, 2025. 1
-
[81]
Thyme: Think beyond images, 2025
Yi-Fan Zhang, Xingyu Lu, Shukang Yin, Chaoyou Fu, Wei Chen, Xiao Hu, Bin Wen, Kaiyu Jiang, Changyi Liu, Tianke Zhang, Haonan Fan, Kaibing Chen, Jiankang Chen, Haojie Ding, Kaiyu Tang, Zhang Zhang, Liang Wang, Fan Yang, Tingting Gao, and Guorui Zhou. Thyme: Think beyond images, 2025. 2
work page 2025
-
[82]
Yi-Fan Zhang, Xingyu Lu, Shukang Yin, Chaoyou Fu, Wei Chen, Xiao Hu, Bin Wen, Kaiyu Jiang, Changyi Liu, Tianke Zhang, et al. Thyme: Think beyond images.arXiv preprint arXiv:2508.11630, 2025. 6
work page internal anchor Pith review Pith/arXiv arXiv 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.