pith. sign in

arxiv: 2605.28774 · v1 · pith:2DI52UWCnew · submitted 2026-05-27 · 💻 cs.CL

Agent Explorative Policy Optimization for Multimodal Agentic Reasoning

Pith reviewed 2026-06-29 12:19 UTC · model grok-4.3

classification 💻 cs.CL
keywords agentic reasoningmultimodal agentspolicy optimizationtool usevision-language modelsreinforcement learningThinking-Acting GapAXPO
0
0 comments X

The pith

Resampling tool calls after fixing thinking prefixes raises performance in multimodal agentic reasoning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Vision-language models for agentic tasks face a Thinking-Acting Gap because thinking occurs by default while tool use remains a high-variance auxiliary behavior. Standard methods like GRPO produce low tool-use rates around 30 percent and high all-wrong rates around 40 percent among tool-using rollouts, which weakens the learning signal at the points where it is most needed. AXPO counters this by identifying all-wrong tool-using subgroups, fixing the thinking prefix, and resampling the tool call plus continuation under uncertainty-based prefix selection. On nine multimodal benchmarks and three scales of Qwen3-VL-Thinking, the SFT+AXPO variant improves average Pass@1 and Pass@4 by 1.8 percentage points at the 8B scale over SFT+GRPO. The same 8B model with AXPO exceeds the 32B base model on Pass@4 while using one-quarter the parameters.

Core claim

AXPO improves agentic reasoning by targeting the Thinking-Acting Gap through targeted resampling: for each all-wrong tool-using subgroup it keeps the thinking prefix fixed and redraws the tool call and its continuation, paired with uncertainty-based prefix selection, which increases the effective learning signal at tool-use steps and produces higher Pass@1 and Pass@4 scores than GRPO across nine benchmarks and three model scales.

What carries the argument

AXPO (Agent eXplorative Policy Optimization), which identifies all-wrong tool-using subgroups, fixes the thinking prefix, and resamples the tool call and continuation with uncertainty-based prefix selection.

If this is right

  • SFT+AXPO delivers +1.8pp average gains on both Pass@1 and Pass@4 versus SFT+GRPO at the 8B scale.
  • An 8B model trained with SFT+AXPO exceeds the 32B base model on Pass@4.
  • The method raises tool-use frequency and reduces the fraction of all-wrong tool-using groups during training.
  • Gains hold across nine multimodal benchmarks and three model scales.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The resampling step could be combined with other policy-gradient algorithms that already separate thinking and acting phases.
  • Similar prefix-fixing techniques might reduce variance in non-multimodal agent settings where internal state and external actions are asymmetrically reliable.
  • If the Thinking-Acting Gap scales with model size, AXPO-style resampling may become more valuable at larger parameter counts.

Load-bearing premise

The low tool-use rate and high all-wrong rate among tool-using rollouts are the main bottlenecks, and resampling the tool call while fixing the thinking prefix will strengthen the learning signal without introducing new biases or training instability.

What would settle it

A side-by-side training run in which AXPO produces no measurable rise in tool-use rate, no drop in all-wrong rate, and no gain in Pass@1 or Pass@4 over GRPO on the same benchmarks and model scales.

read the original abstract

Vision-language models with extended reasoning succeed on complex problems, but many real-world problems require external tools that internal reasoning alone often cannot resolve. Agentic reasoning therefore interleaves two behaviors with a structural asymmetry: thinking (the self-contained default) and tool use (a high-variance auxiliary acting). We refer to this asymmetry as the Thinking-Acting Gap. Under standard RL recipes like GRPO, the gap manifests as two diagnostic symptoms during training: tool use is attempted on only ~30% of rollouts, and when attempted, the tool-using rollouts within a group are all-wrong on ~40% of questions, suppressing the learning signal at the tool calls that needed it. We propose AXPO (Agent eXplorative Policy Optimization): for each all-wrong tool-using subgroup, AXPO fixes the thinking prefix and resamples the tool call and its continuation, paired with uncertainty-based prefix selection. Across nine multimodal benchmarks and three scales of Qwen3-VL-Thinking, SFT+AXPO outperforms SFT+GRPO at average (+1.8pp Pass@1 and +1.8pp Pass@4 at 8B on average) and 8B with SFT+AXPO surpasses the 32B Base on Pass@4 with 4 times fewer parameters.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper identifies a 'Thinking-Acting Gap' in multimodal agentic reasoning where standard RL methods like GRPO yield low tool-use rates (~30%) and high all-wrong rates (~40%) in tool-using subgroups, suppressing learning signals. It proposes AXPO, which fixes the thinking prefix and resamples tool calls (with uncertainty-based prefix selection) for all-wrong subgroups. Experiments across nine multimodal benchmarks and three scales of Qwen3-VL-Thinking report that SFT+AXPO outperforms SFT+GRPO by +1.8pp on average Pass@1 and Pass@4 (at 8B), with the 8B AXPO model surpassing the 32B base on Pass@4.

Significance. If the reported gains hold under controlled ablations and are shown to arise specifically from the resampling mechanism rather than extra compute or selection effects, the method could provide a practical way to improve tool-use learning in VLMs without scaling model size. The 8B vs 32B comparison, if robust, would be a notable efficiency result.

major comments (3)
  1. [Abstract] Abstract: the central performance claims (+1.8pp Pass@1 and +1.8pp Pass@4 at 8B) are stated without any reference to tables, error bars, number of runs, baseline details, or statistical tests, so the magnitude and reliability of the improvement cannot be assessed from the provided information.
  2. [Abstract] Abstract / method description: the claim that AXPO addresses the Thinking-Acting Gap by resampling tool calls while fixing the thinking prefix rests on the untested assumption that the observed ~30% tool-use and ~40% all-wrong symptoms are the primary bottlenecks; no ablation is described that isolates the resampling step from the increased number of rollouts it generates or that matches total compute against a control (e.g., reward shaping or different KL coefficient).
  3. [Abstract] Abstract: the 8B SFT+AXPO surpassing 32B Base on Pass@4 is presented without evidence that base-model scale differences in tool-use propensity were controlled, making the parameter-efficiency claim sensitive to unstated confounds.
minor comments (1)
  1. [Abstract] The abstract refers to 'nine multimodal benchmarks' and 'three scales of Qwen3-VL-Thinking' but does not name them; listing the exact benchmarks and model sizes would improve clarity.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for the constructive comments on the abstract and the need for stronger mechanistic evidence. We respond to each major comment below, indicating where revisions will be made to the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central performance claims (+1.8pp Pass@1 and +1.8pp Pass@4 at 8B) are stated without any reference to tables, error bars, number of runs, baseline details, or statistical tests, so the magnitude and reliability of the improvement cannot be assessed from the provided information.

    Authors: We agree that the abstract would benefit from additional context to allow assessment of the claims. In the revised version we will update the abstract to reference Table 2 for the main results, note that averages are computed over three random seeds, and direct readers to the appendix for error bars and baseline details. revision: yes

  2. Referee: [Abstract] Abstract / method description: the claim that AXPO addresses the Thinking-Acting Gap by resampling tool calls while fixing the thinking prefix rests on the untested assumption that the observed ~30% tool-use and ~40% all-wrong symptoms are the primary bottlenecks; no ablation is described that isolates the resampling step from the increased number of rollouts it generates or that matches total compute against a control (e.g., reward shaping or different KL coefficient).

    Authors: Section 5.2 of the manuscript contains ablations that vary resampling intensity while holding rollout budget fixed where feasible, showing gains track the targeted all-wrong subgroup resampling. We acknowledge, however, that a strict total-compute-matched comparison against reward shaping or altered KL coefficients is not present. We will expand the discussion to better isolate the resampling contribution and explicitly note this limitation. revision: partial

  3. Referee: [Abstract] Abstract: the 8B SFT+AXPO surpassing 32B Base on Pass@4 is presented without evidence that base-model scale differences in tool-use propensity were controlled, making the parameter-efficiency claim sensitive to unstated confounds.

    Authors: Section 3.2 already reports tool-use rates across the three model scales and shows the Thinking-Acting Gap is consistent yet mitigated by AXPO. In revision we will add a clarifying clause in the abstract that references this scale analysis and acknowledges possible base-model confounds in the efficiency comparison. revision: yes

standing simulated objections not resolved
  • A complete ablation that strictly matches total compute against controls such as reward shaping or different KL coefficients is not described in the manuscript.

Circularity Check

0 steps flagged

No circularity in derivation or claims

full rationale

The paper introduces AXPO as an empirical intervention (resampling tool calls with fixed thinking prefix plus uncertainty selection) to address observed symptoms of the Thinking-Acting Gap under GRPO. Performance is measured directly against external multimodal benchmarks and baselines (SFT+GRPO, base models at different scales). No equations, uniqueness theorems, ansatzes, or predictions are defined in terms of themselves or reduced to fitted inputs by construction. The central claims rest on reported benchmark deltas rather than any self-referential derivation chain. This is the normal case of an applied RL method paper whose value is externally falsifiable.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no free parameters, axioms, or invented entities are specified in the provided text.

pith-pipeline@v0.9.1-grok · 5789 in / 1134 out tokens · 44167 ms · 2026-06-29T12:19:47.091124+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Zone of Proximal Policy Optimization: Teacher in Prompts, Not Gradients

    cs.CL 2026-06 unverdicted novelty 7.0

    ZPPO improves distillation to small vision-language models by using binary and negative candidate prompts plus a replay buffer for hard questions, outperforming standard distillation and GRPO on a 31-benchmark suite w...

Reference graph

Works this paper leans on

108 extracted references · 60 canonical work pages · cited by 1 Pith paper · 28 internal anchors

  1. [1]

    DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai D...

  2. [2]

    OpenAI o1 System Card

    Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, Alex Iftimie, Alex Karpenko, Alex Tachard Passos, Alexander Neitz, Alexander Prokofiev, Alexander Wei, Allison Tam, Ally Bennett, Ananya Kumar, Andre Saraiva, Andrea Vallone, Andrew Duberstein, Andrew Kondrich, Andrey ...

  3. [3]

    s1: Simple test-time scaling

    Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Candès, and Tatsunori Hashimoto. s1: Simple test-time scaling, 2025. URLhttps://arxiv.org/abs/2501.19393

  4. [4]

    Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning

    Bowen Jin, Hansi Zeng, Zhenrui Yue, Dong Wang, Hamed Zamani, and Jiawei Han. Search- r1: Training llms to reason and leverage search engines with reinforcement learning.arXiv, 2503.09516, 2025. URLhttps://doi.org/10.48550/arXiv.2503.09516

  5. [5]

    Search-o1: Agentic Search-Enhanced Large Reasoning Models

    Xiaoxi Li, Guanting Dong, Jiajie Jin, Yuyao Zhang, Yujia Zhou, Yutao Zhu, Peitian Zhang, and Zhicheng Dou. Search-o1: Agentic search-enhanced large reasoning models.arXiv, 2501.05366,

  6. [6]

    URLhttps://doi.org/10.48550/arXiv.2501.05366

  7. [7]

    Wenhu Chen, Xueguang Ma, Xinyi Wang, and William W. Cohen. Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks.Trans. Mach. Learn. Res., 2023, 2023. URLhttps://openreview.net/forum?id=YfZ4ZPt8zd

  8. [8]

    Mini-o3: Scaling Up Reasoning Patterns and Interaction Turns for Visual Search

    Xin Lai, Junyi Li, Wei Li, Tao Liu, Tianjian Li, and Hengshuang Zhao. Mini-o3: Scaling up reasoning patterns and interaction turns for visual search.arXiv, 2509.07969, 2025. URL https://doi.org/10.48550/arXiv.2509.07969

  9. [9]

    URL https://proceedings.mlr

    Penghao Wu and Saining Xie. V*: Guided visual search as a core mechanism in multimodal llms. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024, Seattle, WA, USA, June 16-22, 2024, pages 13084–13094. IEEE, 2024. URLhttps://doi.org/ 10.1109/CVPR52733.2024.01243

  10. [10]

    Agentic reasoning: Reasoning llms with tools for the deep research.arXiv, 2502.04644, 2025

    Junde Wu, Jiayuan Zhu, and Yuyuan Liu. Agentic reasoning: Reasoning llms with tools for the deep research.arXiv, 2502.04644, 2025. URLhttps://doi.org/10.48550/arXiv.2502.04644

  11. [11]

    Tora: A tool-integrated reasoning agent for mathematical problem solving

    Zhibin Gou, Zhihong Shao, Yeyun Gong, Yelong Shen, Yujiu Yang, Minlie Huang, Nan Duan, and Weizhu Chen. Tora: A tool-integrated reasoning agent for mathematical problem solving. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024. URLhttps://openreview.net/forum?id= Ep0TtjVoap

  12. [12]

    Understanding tool-integrated reasoning.arXiv, 2508.19201, 2025

    Heng Lin and Zhongwen Xu. Understanding tool-integrated reasoning.arXiv, 2508.19201, 2025. URLhttps://doi.org/10.48550/arXiv.2508.19201

  13. [13]

    Pyvision: Agentic vision with dynamic tooling.arXiv, 2507.07998, 2025

    Shitian Zhao, Haoquan Zhang, Shaoheng Lin, Ming Li, Qilong Wu, Kaipeng Zhang, and Chen Wei. Pyvision: Agentic vision with dynamic tooling.arXiv, 2507.07998, 2025. URL https://doi.org/10.48550/arXiv.2507.07998

  14. [14]

    Toolorchestra: Elevating intelligence via efficient model and tool orchestration.arXiv, 2511.21689, 2025

    Hongjin Su, Shizhe Diao, Ximing Lu, Mingjie Liu, Jiacheng Xu, Xin Dong, Yonggan Fu, Peter Belcak, Hanrong Ye, Hongxu Yin, Yi Dong, Evelina Bakhturina, Tao Yu, Yejin Choi, Jan Kautz, and Pavlo Molchanov. Toolorchestra: Elevating intelligence via efficient model and tool orchestration.arXiv, 2511.21689, 2025. URL https://doi.org/10.48550/arXiv.2511.21689

  15. [15]

    Smith, and Ranjay Krishna

    Yushi Hu, Weijia Shi, Xingyu Fu, Dan Roth, Mari Ostendorf, Luke Zettlemoyer, Noah A. Smith, and Ranjay Krishna. Visual sketchpad: Sketching as a visual chain of thought for multimodal language models. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URLhttps://openreview.net/forum?id=GNSMl1P5VR. 13 Agent Explorative Po...

  16. [16]

    Deepseek-r1 thoughtology: Let’s think about LLM reasoning

    Sara Vera Marjanovic, Arkil Patel, Vaibhav Adlakha, Milad Aghajohari, Parishad BehnamGhader, Mehar Bhatia, Aditi Khandelwal, Austin Kraft, Benno Krojer, Xing Han Lù, Nicholas Meade, Dongchan Shin, Amirhossein Kazemnejad, Gaurav Kamath, Marius Mosbach, Karolina Stanczak, and Siva Reddy. Deepseek-r1 thoughtology: Let’s think about LLM reasoning. Trans. Mach...

  17. [17]

    Wong, and Di Wang

    Shu Yang, Junchao Wu, Xin Chen, Yunze Xiao, Xinyi Yang, Derek F. Wong, and Di Wang. Understanding aha moments: from external observations to internal mechanisms, 2025. URL https://arxiv.org/abs/2504.02956

  18. [18]

    Understanding reasoning in llms through strategic information allocation under uncertainty,

    Jeonghye Kim, Xufang Luo, Minbeom Kim, Sangmook Lee, Dongsheng Li, and Yuqing Yang. Understanding reasoning in llms through strategic information allocation under uncertainty,

  19. [19]

    URLhttps://arxiv.org/abs/2603.15500

  20. [20]

    Pyvision-rl: Forging open agentic vision models via RL.arxiv, 2602.20739, 2026

    Shitian Zhao, Shaoheng Lin, Ming Li, Haoquan Zhang, Wenshuo Peng, Kaipeng Zhang, and Chen Wei. Pyvision-rl: Forging open agentic vision models via RL.arxiv, 2602.20739, 2026. URLhttps://doi.org/10.48550/arXiv.2602.20739

  21. [21]

    Qwen3-VL Technical Report

    Qwen Team. Qwen3-vl technical report.arXiv, 2511.21631, 2025. URLhttps://doi.org/10. 48550/arXiv.2511.21631

  22. [22]

    Distilling LLM agent into small models with retrieval and code tools

    Minki Kang, Jongwon Jeong, Seanie Lee, Jaewoong Cho, and Sung Ju Hwang. Distilling LLM agent into small models with retrieval and code tools. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. URLhttps://openreview.net/forum?id= VkicTqszOn

  23. [23]

    Gordon, and Drew Bagnell

    Stéphane Ross, Geoffrey J. Gordon, and Drew Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. In Geoffrey J. Gordon, David B. Dunson, and Miroslav Dudík, editors,Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, AISTATS 2011, Fort Lauderdale, USA, April 11-13,...

  24. [24]

    Agentic reasoning and tool integration for llms via reinforcement learning, 2025

    Joykirat Singh, Raghav Magazine, Yash Pandya, and Akshay Nambi. Agentic reasoning and tool integration for llms via reinforcement learning, 2025. URLhttps://arxiv.org/abs/2505. 01441

  25. [25]

    DeepEyesV2: Toward Agentic Multimodal Model

    Jack Hong, Chenxiao Zhao, ChengLin Zhu, Weiheng Lu, Guohai Xu, and Xing Yu. Deepeyesv2: Toward agentic multimodal model, 2025. URLhttps://arxiv.org/abs/2511.05271

  26. [26]

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv, 2402.03300, 2024. URL https://doi.org/10.48550/arXiv.2402. 03300

  27. [27]

    DAPO: An Open-Source LLM Reinforcement Learning System at Scale

    Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Wang Zhang, Hang Zhu, Jinhua Zhu, Jiaze Chen, Jiangjie Chen, Chengyi Wang, Honglin Yu, Weinan Dai, Yuxuan Song, Xiang Wei, Haodong Zhou, Jingjing Liu, ...

  28. [28]

    ProRL: Prolonged reinforcement learning expands reasoning boundaries in large language models

    Mingjie Liu, Shizhe Diao, Ximing Lu, Jian Hu, Xin Dong, Yejin Choi, Jan Kautz, and Yi Dong. ProRL: Prolonged reinforcement learning expands reasoning boundaries in large language models. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems,

  29. [29]

    URLhttps://openreview.net/forum?id=YPsJha5HXQ

  30. [30]

    POPE: learning to reason on hard problems via privileged on-policy exploration.arXiv, 2601.18779,

    Yuxiao Qu, Amrith Setlur, Virginia Smith, Ruslan Salakhutdinov, and Aviral Kumar. POPE: learning to reason on hard problems via privileged on-policy exploration.arXiv, 2601.18779,

  31. [31]

    URLhttps://doi.org/10.48550/arXiv.2601.18779

  32. [32]

    Self-hinting language models enhance reinforcement learning.arXiv, 2602.03143, 2026

    Baohao Liao, Hanze Dong, Xinxing Xu, Christof Monz, and Jiang Bian. Self-hinting language models enhance reinforcement learning.arXiv, 2602.03143, 2026. URLhttps://doi.org/10. 48550/arXiv.2602.03143

  33. [33]

    Acting less is reasoning more! teaching model to act efficiently, 2025

    Hongru Wang, Cheng Qian, Wanjun Zhong, Xiusi Chen, Jiahao Qiu, Shijue Huang, Bowen Jin, Mengdi Wang, Kam-Fai Wong, and Heng Ji. Acting less is reasoning more! teaching model to act efficiently, 2025. URLhttps://arxiv.org/abs/2504.14870

  34. [35]

    Agentic reinforced policy optimization

    Guanting Dong, Hangyu Mao, Kai Ma, Licheng Bao, Yifei Chen, Zhongyuan Wang, Zhongxia Chen, Jiazhen Du, Huiyang Wang, Fuzheng Zhang, Guorui Zhou, Yutao Zhu, Ji-Rong Wen, and Zhicheng Dou. Agentic reinforced policy optimization. InThe Fourteenth International Conference on Learning Representations, 2026. URLhttps://openreview.net/forum?id= TX4k7BF6aO

  35. [36]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv, 1707.06347, 2017. URL http://arxiv.org/abs/1707.06347

  36. [37]

    RAGEN-2: Reasoning Collapse in Agentic RL

    Zihan Wang, Chi Gui, Xing Jin, Qineng Wang, Licheng Liu, Kangrui Wang, Shiqi Chen, Linjie Li, Zhengyuan Yang, Pingyue Zhang, Yiping Lu, Jiajun Wu, Li Fei-Fei, Lijuan Wang, Yejin Choi, and Manling Li. Ragen-2: Reasoning collapse in agentic rl, 2026. URLhttps: //arxiv.org/abs/2604.06268

  37. [38]

    Deep think with confidence.arXiv,

    Yichao Fu, Xuewei Wang, Yuandong Tian, and Jiawei Zhao. Deep think with confidence.arXiv,

  38. [39]

    URLhttps://arxiv.org/abs/2508.15260

  39. [40]

    VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models with Reinforcement Learning

    Haozhe Wang, Chao Qu, Zuming Huang, Wei Chu, Fangzhen Lin, and Wenhu Chen. Vl- rethinker: Incentivizing self-reflection of vision-language models with reinforcement learning. arXiv, 2504.08837, 2025. URLhttps://doi.org/10.48550/arXiv.2504.08837

  40. [41]

    MMSearch-R1: Incentivizing LMMs to Search

    Jinming Wu, Zihao Deng, Wei Li, Yiding Liu, Bo You, Bo Li, Zejun Ma, and Ziwei Liu. Mmsearch-r1: Incentivizing lmms to search. 2025. URLhttps://arxiv.org/abs/2506.20670

  41. [43]

    Measuring Multimodal Mathematical Reasoning with MATH-Vision Dataset

    Ke Wang, Junting Pan, Weikang Shi, Zimu Lu, Mingjie Zhan, and Hongsheng Li. Measuring multimodal mathematical reasoning with math-vision dataset.arXiv, 2024. URL https: //arxiv.org/abs/2402.14804. 15 Agent Explorative Policy Optimization

  42. [44]

    Dynamath: A dynamic visual benchmark for evaluating mathematical reasoning robustness of vision language models

    Chengke Zou, Xingang Guo, Rui Yang, Junyu Zhang, Bin Hu, and Huan Zhang. Dynamath: A dynamic visual benchmark for evaluating mathematical reasoning robustness of vision language models. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net, 2025. URLhttps://openreview.net/forum? id=V...

  43. [45]

    Codeplot- cot: Mathematical visual reasoning by thinking with code-driven images, 2025

    Chengqi Duan, Kaiyue Sun, Rongyao Fang, Manyuan Zhang, Yan Feng, Ying Luo, Yufang Liu, Ke Wang, Peng Pei, Xunliang Cai, Hongsheng Li, Yi Ma, and Xihui Liu. Codeplot- cot: Mathematical visual reasoning by thinking with code-driven images, 2025. URLhttps: //arxiv.org/abs/2510.11718

  44. [46]

    Divide, conquer and combine: A training-free framework for high-resolution image perception in multimodal large language models

    Wenbin Wang, Liang Ding, Minyan Zeng, Xiabin Zhou, Li Shen, Yong Luo, Wei Yu, and Dacheng Tao. Divide, conquer and combine: A training-free framework for high-resolution image perception in multimodal large language models. In Toby Walsh, Julie Shah, and Zico Kolter, editors,Thirty-Ninth AAAI Conference on Artificial Intelligence, Thirty-Seventh Conferenc...

  45. [47]

    Sensenova- mars: Empowering multimodal agentic reasoning and search via reinforcement learning.arXiv, 2512.24330, 2025

    Yong Xien Chng, Tao Hu, Wenwen Tong, Xueheng Li, Jiandong Chen, Haojia Yu, Jiefan Lu, Hewei Guo, Hanming Deng, Xie Chen, Gao Huang, Dahua Lin, and Lewei Lu. Sensenova- mars: Empowering multimodal agentic reasoning and search via reinforcement learning.arXiv, 2512.24330, 2025. URLhttps://doi.org/10.48550/arXiv.2512.24330

  46. [48]

    Mmsearch: Benchmarking the potential of large models as multi-modal search engines.arXiv, 2409.12959,

    Dongzhi Jiang, Renrui Zhang, Ziyu Guo, Yanmin Wu, Jiayi Lei, Pengshuo Qiu, Pan Lu, Zehui Chen, Guanglu Song, Peng Gao, Yu Liu, Chunyuan Li, and Hongsheng Li. Mmsearch: Benchmarking the potential of large models as multi-modal search engines.arXiv, 2409.12959,

  47. [49]

    URLhttps://doi.org/10.48550/arXiv.2409.12959

  48. [50]

    Pixel reasoner: Incentivizing pixel space reasoning via curiosity-driven reinforcement learning

    Alex Su, Haozhe Wang, Weiming Ren, Fangzhen Lin, and Wenhu Chen. Pixel reasoner: Incentivizing pixel space reasoning via curiosity-driven reinforcement learning. InThe Thirty- ninth Annual Conference on Neural Information Processing Systems, 2026. URL https: //openreview.net/forum?id=VeZkY3JjWV

  49. [51]

    Deepeyes: Incentivizing ”thinking with images” via reinforcement learning

    Ziwei Zheng, Michael Yang, Jack Hong, Chenxiao Zhao, Guohai Xu, Le Yang, Chao Shen, and XingYu. Deepeyes: Incentivizing ”thinking with images” via reinforcement learning. In The Fourteenth International Conference on Learning Representations, 2026. URLhttps: //openreview.net/forum?id=xUyMXkI958

  50. [52]

    Andrew Bagnell, Aarti Singh, and Andrea Zanette

    Yuda Song, Lili Chen, Fahim Tajwar, Remi Munos, Deepak Pathak, J. Andrew Bagnell, Aarti Singh, and Andrea Zanette. Expanding the capabilities of reinforcement learning via text feedback, 2026. URLhttps://arxiv.org/abs/2602.02482

  51. [53]

    MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention

    MiniMax, :, Aili Chen, Aonian Li, Bangwei Gong, Binyang Jiang, Bo Fei, Bo Yang, Boji Shan, Changqing Yu, Chao Wang, Cheng Zhu, Chengjun Xiao, Chengyu Du, Chi Zhang, Chu Qiao, Chunhao Zhang, Chunhui Du, Congchao Guo, Da Chen, Deming Ding, Dianjun Sun, Dong Li, Enwei Jiao, Haigang Zhou, Haimo Zhang, Han Ding, Haohai Sun, Haoyu Feng, Huaiguang Cai, Haichao Z...

  52. [54]

    Narasimhan, and Yuan Cao

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R. Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InThe Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5,

  53. [55]

    URLhttps://openreview.net/forum?id=WE_vluYUL-X

    OpenReview.net, 2023. URLhttps://openreview.net/forum?id=WE_vluYUL-X

  54. [56]

    ReTool: Reinforcement Learning for Strategic Tool Use in LLMs

    Jiazhan Feng, Shijue Huang, Xingwei Qu, Ge Zhang, Yujia Qin, Baoquan Zhong, Chengquan Jiang, Jinxin Chi, and Wanjun Zhong. Retool: Reinforcement learning for strategic tool use in llms.arXiv, 2504.11536, 2025. URLhttps://arxiv.org/abs/2504.11536

  55. [57]

    Mathcoder: Seamless code integration in llms for enhanced mathematical reasoning.arXiv, 2310.03731, 2023

    Ke Wang, Houxing Ren, Aojun Zhou, Zimu Lu, Sichun Luo, Weikang Shi, Renrui Zhang, Linqi Song, Mingjie Zhan, and Hongsheng Li. Mathcoder: Seamless code integration in llms for enhanced mathematical reasoning.arXiv, 2310.03731, 2023. URLhttps://arxiv.org/abs/ 2310.03731

  56. [58]

    T1: Tool-integrated verification for test-time compute scaling in small language models

    Minki Kang, Jongwon Jeong, and Jaewoong Cho. T1: Tool-integrated verification for test-time compute scaling in small language models. InThe Fourteenth International Conference on Learning Representations, 2026. URLhttps://openreview.net/forum?id=tBkLWfmugI

  57. [59]

    Introducing agentic vision in gemini 3 flash

    Google Deepmind. Introducing agentic vision in gemini 3 flash. https://blog.google/ innovation-and-ai/technology/developers-tools/agentic-vision-gemini-3-flash/ , 2025

  58. [60]

    Thinking with images

    OpenAI. Thinking with images. https://openai.com/index/thinking-with-images/, 2025

  59. [61]

    V-thinker: Interactive thinking with images

    Runqi Qiao, Qiuna Tan, Minghan Yang, Guanting Dong, Peiqing Yang, Shiqiang Lang, Enhui Wan, Xiaowan Wang, Yida Xu, Lan Yang, Chong Sun, Chen Li, Jing Lyu, and Honggang Zhang. V-thinker: Interactive thinking with images, 2025. URLhttps://arxiv.org/abs/2511.04460

  60. [62]

    rstar2-agent: Agentic reasoning technical report.arXiv, 2508.20722, 2025

    Ning Shang, Yifei Liu, Yi Zhu, Li Lyna Zhang, Weijiang Xu, Xinyu Guan, Buze Zhang, Bingcheng Dong, Xudong Zhou, Bowen Zhang, Ying Xin, Ziming Miao, Scarlett Li, Fan Yang, and Mao Yang. rstar2-agent: Agentic reasoning technical report.arXiv, 2508.20722, 2025. URL https://doi.org/10.48550/arXiv.2508.20722

  61. [63]

    ToolRL: Reward is All Tool Learning Needs

    Cheng Qian, Emre Can Acikgoz, Qi He, Hongru Wang, Xiusi Chen, Dilek Hakkani-Tür, Gokhan Tur, and Heng Ji. Toolrl: Reward is all tool learning needs, 2025. URL https: //arxiv.org/abs/2504.13958

  62. [64]

    Torl: Scaling tool-integrated RL.arXiv, 2503.23383,

    Xuefeng Li, Haoyang Zou, and Pengfei Liu. Torl: Scaling tool-integrated RL.arXiv, 2503.23383,

  63. [65]

    URLhttps://doi.org/10.48550/arXiv.2503.23383

  64. [66]

    Thyme: Think Beyond Images

    Yifan Zhang, Xingyu Lu, Shukang Yin, Chaoyou Fu, Wei Chen, Xiao Hu, Bin Wen, Kaiyu Jiang, Changyi Liu, Tianke Zhang, Haonan Fan, Kaibing Chen, Jiankang Chen, Haojie Ding, 17 Agent Explorative Policy Optimization Kaiyu Tang, Zhang Zhang, Liang Wang, Fan Yang, Tingting Gao, and Guorui Zhou. Thyme: Think beyond images.arXiv, 2508.11630, 2025. URL https://doi...

  65. [67]

    Agentic entropy-balanced policy optimization.arXiv, 2025

    Guanting Dong, Licheng Bao, Zhongyuan Wang, Kangzhi Zhao, Xiaoxi Li, Jiajie Jin, Jinghan Yang, Hangyu Mao, Fuzheng Zhang, Kun Gai, Guorui Zhou, Yutao Zhu, Ji-Rong Wen, and Zhicheng Dou. Agentic entropy-balanced policy optimization.arXiv, 2025. URL https: //arxiv.org/abs/2510.14545

  66. [68]

    HybridFlow: A Flexible and Efficient RLHF Framework

    Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework.arXiv preprint arXiv: 2409.19256, 2024

  67. [69]

    rllm: A framework for post-training language agents.https://pretty-radio-b75

    Sijun Tan, Michael Luo, Colin Cai, Tarun Venkat, Kyle Montgomery, Aaron Hao, Tianhao Wu, Arnav Balyan, Manan Roongta, Chenguang Wang, Li Erran Li, Raluca Ada Popa, and Ion Stoica. rllm: A framework for post-training language agents.https://pretty-radio-b75. notion.site/rLLM-A-Framework-for-Post-Training-Language-Agents\ -21b81902c146819db63cd98a54ba5f31, ...

  68. [70]

    Does reinforcement learning really incentivize reasoning capacity in LLMs beyond the base model? InThe Thirty-ninth Annual Conference on Neural Information Processing Systems,

    Yang Yue, Zhiqi Chen, Rui Lu, Andrew Zhao, Zhaokai Wang, Yang Yue, Shiji Song, and Gao Huang. Does reinforcement learning really incentivize reasoning capacity in LLMs beyond the base model? InThe Thirty-ninth Annual Conference on Neural Information Processing Systems,

  69. [71]

    URLhttps://openreview.net/forum?id=4OsgYD7em5

  70. [72]

    SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training

    Tianzhe Chu, Yuexiang Zhai, Jihan Yang, Shengbang Tong, Saining Xie, Dale Schuurmans, Quoc V. Le, Sergey Levine, and Yi Ma. Sft memorizes, rl generalizes: A comparative study of foundation model post-training, 2025. URLhttps://arxiv.org/abs/2501.17161

  71. [73]

    Qwen2.5-VL Technical Report

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Ming-Hsuan Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report...

  72. [74]

    Gonzalez, Hao Zhang, and Ion Stoica

    WoosukKwon, ZhuohanLi, SiyuanZhuang, YingSheng, LianminZheng, CodyHaoYu, JosephE. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InProceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023

  73. [75]

    Fireact: Toward language agent fine-tuning.arXiv, 2310.05915, 2023

    Baian Chen, Chang Shu, Ehsan Shareghi, Nigel Collier, Karthik Narasimhan, and Shunyu Yao. Fireact: Toward language agent fine-tuning.arXiv, 2310.05915, 2023. URL https: //doi.org/10.48550/arXiv.2310.05915

  74. [76]

    Uni- fied reinforcement and imitation learning for vision-language models

    Byung-Kwan Lee, Ryo Hachiuma, Yong Man Ro, Frank Wang, and Yueh-Hua Wu. Uni- fied reinforcement and imitation learning for vision-language models. In D. Belgrave, C. Zhang, H. Lin, R. Pascanu, P. Koniusz, M. Ghassemi, and N. Chen, editors,Advances in Neural Information Processing Systems, volume 38, pages 156508–156534. Curran Asso- ciates, Inc., 2025. UR...

  75. [77]

    Masking teacher and reinforcing student for distilling vision-language models, 2025

    Byung-Kwan Lee, Yu-Chiang Frank Wang, and Ryo Hachiuma. Masking teacher and reinforcing student for distilling vision-language models, 2025. URLhttps://arxiv.org/abs/2512.22238. 18 Agent Explorative Policy Optimization

  76. [78]

    Why and when visual token pruning fails? a study on relevant visual information shift in mllms decoding,

    Jiwan Kim, Kibum Kim, Wonjoong Kim, Byung-Kwan Lee, and Chanyoung Park. Why and when visual token pruning fails? a study on relevant visual information shift in mllms decoding,

  77. [79]

    URLhttps://arxiv.org/abs/2604.12358

  78. [80]

    Hide to See: Reasoning-prefix Masking for Visual-anchored Thinking in VLM Distillation

    Seonghoon Yu, Dongjun Nam, Byung-Kwan Lee, and Jeany Son. Hide to see: Reasoning-prefix masking for visual-anchored thinking in vlm distillation, 2026. URLhttps://arxiv.org/abs/ 2605.11651

  79. [81]

    GenRecal: Generation after Recalibration from Large to Small Vision-Language Models

    Byung-Kwan Lee, Ryo Hachiuma, Yong Man Ro, Yu-Chiang Frank Wang, and Yueh-Hua Wu. Genrecal: Generation after recalibration from large to small vision-language models, 2026. URL https://arxiv.org/abs/2506.15681

  80. [82]

    Vlsi: Verbalized layers-to-interactions from large to small vision language models

    Byung-Kwan Lee, Ryo Hachiuma, Yu-Chiang Frank Wang, Yong Man Ro, and Yueh-Hua Wu. Vlsi: Verbalized layers-to-interactions from large to small vision language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 29545–29557, June 2025

Showing first 80 references.