Agent Explorative Policy Optimization for Multimodal Agentic Reasoning
Pith reviewed 2026-06-29 12:19 UTC · model grok-4.3
The pith
Resampling tool calls after fixing thinking prefixes raises performance in multimodal agentic reasoning.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
AXPO improves agentic reasoning by targeting the Thinking-Acting Gap through targeted resampling: for each all-wrong tool-using subgroup it keeps the thinking prefix fixed and redraws the tool call and its continuation, paired with uncertainty-based prefix selection, which increases the effective learning signal at tool-use steps and produces higher Pass@1 and Pass@4 scores than GRPO across nine benchmarks and three model scales.
What carries the argument
AXPO (Agent eXplorative Policy Optimization), which identifies all-wrong tool-using subgroups, fixes the thinking prefix, and resamples the tool call and continuation with uncertainty-based prefix selection.
If this is right
- SFT+AXPO delivers +1.8pp average gains on both Pass@1 and Pass@4 versus SFT+GRPO at the 8B scale.
- An 8B model trained with SFT+AXPO exceeds the 32B base model on Pass@4.
- The method raises tool-use frequency and reduces the fraction of all-wrong tool-using groups during training.
- Gains hold across nine multimodal benchmarks and three model scales.
Where Pith is reading between the lines
- The resampling step could be combined with other policy-gradient algorithms that already separate thinking and acting phases.
- Similar prefix-fixing techniques might reduce variance in non-multimodal agent settings where internal state and external actions are asymmetrically reliable.
- If the Thinking-Acting Gap scales with model size, AXPO-style resampling may become more valuable at larger parameter counts.
Load-bearing premise
The low tool-use rate and high all-wrong rate among tool-using rollouts are the main bottlenecks, and resampling the tool call while fixing the thinking prefix will strengthen the learning signal without introducing new biases or training instability.
What would settle it
A side-by-side training run in which AXPO produces no measurable rise in tool-use rate, no drop in all-wrong rate, and no gain in Pass@1 or Pass@4 over GRPO on the same benchmarks and model scales.
read the original abstract
Vision-language models with extended reasoning succeed on complex problems, but many real-world problems require external tools that internal reasoning alone often cannot resolve. Agentic reasoning therefore interleaves two behaviors with a structural asymmetry: thinking (the self-contained default) and tool use (a high-variance auxiliary acting). We refer to this asymmetry as the Thinking-Acting Gap. Under standard RL recipes like GRPO, the gap manifests as two diagnostic symptoms during training: tool use is attempted on only ~30% of rollouts, and when attempted, the tool-using rollouts within a group are all-wrong on ~40% of questions, suppressing the learning signal at the tool calls that needed it. We propose AXPO (Agent eXplorative Policy Optimization): for each all-wrong tool-using subgroup, AXPO fixes the thinking prefix and resamples the tool call and its continuation, paired with uncertainty-based prefix selection. Across nine multimodal benchmarks and three scales of Qwen3-VL-Thinking, SFT+AXPO outperforms SFT+GRPO at average (+1.8pp Pass@1 and +1.8pp Pass@4 at 8B on average) and 8B with SFT+AXPO surpasses the 32B Base on Pass@4 with 4 times fewer parameters.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper identifies a 'Thinking-Acting Gap' in multimodal agentic reasoning where standard RL methods like GRPO yield low tool-use rates (~30%) and high all-wrong rates (~40%) in tool-using subgroups, suppressing learning signals. It proposes AXPO, which fixes the thinking prefix and resamples tool calls (with uncertainty-based prefix selection) for all-wrong subgroups. Experiments across nine multimodal benchmarks and three scales of Qwen3-VL-Thinking report that SFT+AXPO outperforms SFT+GRPO by +1.8pp on average Pass@1 and Pass@4 (at 8B), with the 8B AXPO model surpassing the 32B base on Pass@4.
Significance. If the reported gains hold under controlled ablations and are shown to arise specifically from the resampling mechanism rather than extra compute or selection effects, the method could provide a practical way to improve tool-use learning in VLMs without scaling model size. The 8B vs 32B comparison, if robust, would be a notable efficiency result.
major comments (3)
- [Abstract] Abstract: the central performance claims (+1.8pp Pass@1 and +1.8pp Pass@4 at 8B) are stated without any reference to tables, error bars, number of runs, baseline details, or statistical tests, so the magnitude and reliability of the improvement cannot be assessed from the provided information.
- [Abstract] Abstract / method description: the claim that AXPO addresses the Thinking-Acting Gap by resampling tool calls while fixing the thinking prefix rests on the untested assumption that the observed ~30% tool-use and ~40% all-wrong symptoms are the primary bottlenecks; no ablation is described that isolates the resampling step from the increased number of rollouts it generates or that matches total compute against a control (e.g., reward shaping or different KL coefficient).
- [Abstract] Abstract: the 8B SFT+AXPO surpassing 32B Base on Pass@4 is presented without evidence that base-model scale differences in tool-use propensity were controlled, making the parameter-efficiency claim sensitive to unstated confounds.
minor comments (1)
- [Abstract] The abstract refers to 'nine multimodal benchmarks' and 'three scales of Qwen3-VL-Thinking' but does not name them; listing the exact benchmarks and model sizes would improve clarity.
Simulated Author's Rebuttal
We thank the referee for the constructive comments on the abstract and the need for stronger mechanistic evidence. We respond to each major comment below, indicating where revisions will be made to the manuscript.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central performance claims (+1.8pp Pass@1 and +1.8pp Pass@4 at 8B) are stated without any reference to tables, error bars, number of runs, baseline details, or statistical tests, so the magnitude and reliability of the improvement cannot be assessed from the provided information.
Authors: We agree that the abstract would benefit from additional context to allow assessment of the claims. In the revised version we will update the abstract to reference Table 2 for the main results, note that averages are computed over three random seeds, and direct readers to the appendix for error bars and baseline details. revision: yes
-
Referee: [Abstract] Abstract / method description: the claim that AXPO addresses the Thinking-Acting Gap by resampling tool calls while fixing the thinking prefix rests on the untested assumption that the observed ~30% tool-use and ~40% all-wrong symptoms are the primary bottlenecks; no ablation is described that isolates the resampling step from the increased number of rollouts it generates or that matches total compute against a control (e.g., reward shaping or different KL coefficient).
Authors: Section 5.2 of the manuscript contains ablations that vary resampling intensity while holding rollout budget fixed where feasible, showing gains track the targeted all-wrong subgroup resampling. We acknowledge, however, that a strict total-compute-matched comparison against reward shaping or altered KL coefficients is not present. We will expand the discussion to better isolate the resampling contribution and explicitly note this limitation. revision: partial
-
Referee: [Abstract] Abstract: the 8B SFT+AXPO surpassing 32B Base on Pass@4 is presented without evidence that base-model scale differences in tool-use propensity were controlled, making the parameter-efficiency claim sensitive to unstated confounds.
Authors: Section 3.2 already reports tool-use rates across the three model scales and shows the Thinking-Acting Gap is consistent yet mitigated by AXPO. In revision we will add a clarifying clause in the abstract that references this scale analysis and acknowledges possible base-model confounds in the efficiency comparison. revision: yes
- A complete ablation that strictly matches total compute against controls such as reward shaping or different KL coefficients is not described in the manuscript.
Circularity Check
No circularity in derivation or claims
full rationale
The paper introduces AXPO as an empirical intervention (resampling tool calls with fixed thinking prefix plus uncertainty selection) to address observed symptoms of the Thinking-Acting Gap under GRPO. Performance is measured directly against external multimodal benchmarks and baselines (SFT+GRPO, base models at different scales). No equations, uniqueness theorems, ansatzes, or predictions are defined in terms of themselves or reduced to fitted inputs by construction. The central claims rest on reported benchmark deltas rather than any self-referential derivation chain. This is the normal case of an applied RL method paper whose value is externally falsifiable.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 1 Pith paper
-
Zone of Proximal Policy Optimization: Teacher in Prompts, Not Gradients
ZPPO improves distillation to small vision-language models by using binary and negative candidate prompts plus a replay buffer for hard questions, outperforming standard distillation and GRPO on a 31-benchmark suite w...
Reference graph
Works this paper leans on
-
[1]
DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai D...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[2]
Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, Alex Iftimie, Alex Karpenko, Alex Tachard Passos, Alexander Neitz, Alexander Prokofiev, Alexander Wei, Allison Tam, Ally Bennett, Ananya Kumar, Andre Saraiva, Andrea Vallone, Andrew Duberstein, Andrew Kondrich, Andrey ...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2412.16720 2024
-
[3]
Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Candès, and Tatsunori Hashimoto. s1: Simple test-time scaling, 2025. URLhttps://arxiv.org/abs/2501.19393
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[4]
Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning
Bowen Jin, Hansi Zeng, Zhenrui Yue, Dong Wang, Hamed Zamani, and Jiawei Han. Search- r1: Training llms to reason and leverage search engines with reinforcement learning.arXiv, 2503.09516, 2025. URLhttps://doi.org/10.48550/arXiv.2503.09516
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2503.09516 2025
-
[5]
Search-o1: Agentic Search-Enhanced Large Reasoning Models
Xiaoxi Li, Guanting Dong, Jiajie Jin, Yuyao Zhang, Yujia Zhou, Yutao Zhu, Peitian Zhang, and Zhicheng Dou. Search-o1: Agentic search-enhanced large reasoning models.arXiv, 2501.05366,
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
URLhttps://doi.org/10.48550/arXiv.2501.05366
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2501.05366
-
[7]
Wenhu Chen, Xueguang Ma, Xinyi Wang, and William W. Cohen. Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks.Trans. Mach. Learn. Res., 2023, 2023. URLhttps://openreview.net/forum?id=YfZ4ZPt8zd
2023
-
[8]
Mini-o3: Scaling Up Reasoning Patterns and Interaction Turns for Visual Search
Xin Lai, Junyi Li, Wei Li, Tao Liu, Tianjian Li, and Hengshuang Zhao. Mini-o3: Scaling up reasoning patterns and interaction turns for visual search.arXiv, 2509.07969, 2025. URL https://doi.org/10.48550/arXiv.2509.07969
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2509.07969 2025
-
[9]
Penghao Wu and Saining Xie. V*: Guided visual search as a core mechanism in multimodal llms. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024, Seattle, WA, USA, June 16-22, 2024, pages 13084–13094. IEEE, 2024. URLhttps://doi.org/ 10.1109/CVPR52733.2024.01243
-
[10]
Agentic reasoning: Reasoning llms with tools for the deep research.arXiv, 2502.04644, 2025
Junde Wu, Jiayuan Zhu, and Yuyuan Liu. Agentic reasoning: Reasoning llms with tools for the deep research.arXiv, 2502.04644, 2025. URLhttps://doi.org/10.48550/arXiv.2502.04644
-
[11]
Tora: A tool-integrated reasoning agent for mathematical problem solving
Zhibin Gou, Zhihong Shao, Yeyun Gong, Yelong Shen, Yujiu Yang, Minlie Huang, Nan Duan, and Weizhu Chen. Tora: A tool-integrated reasoning agent for mathematical problem solving. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024. URLhttps://openreview.net/forum?id= Ep0TtjVoap
2024
-
[12]
Understanding tool-integrated reasoning.arXiv, 2508.19201, 2025
Heng Lin and Zhongwen Xu. Understanding tool-integrated reasoning.arXiv, 2508.19201, 2025. URLhttps://doi.org/10.48550/arXiv.2508.19201
-
[13]
Pyvision: Agentic vision with dynamic tooling.arXiv, 2507.07998, 2025
Shitian Zhao, Haoquan Zhang, Shaoheng Lin, Ming Li, Qilong Wu, Kaipeng Zhang, and Chen Wei. Pyvision: Agentic vision with dynamic tooling.arXiv, 2507.07998, 2025. URL https://doi.org/10.48550/arXiv.2507.07998
-
[14]
Hongjin Su, Shizhe Diao, Ximing Lu, Mingjie Liu, Jiacheng Xu, Xin Dong, Yonggan Fu, Peter Belcak, Hanrong Ye, Hongxu Yin, Yi Dong, Evelina Bakhturina, Tao Yu, Yejin Choi, Jan Kautz, and Pavlo Molchanov. Toolorchestra: Elevating intelligence via efficient model and tool orchestration.arXiv, 2511.21689, 2025. URL https://doi.org/10.48550/arXiv.2511.21689
-
[15]
Smith, and Ranjay Krishna
Yushi Hu, Weijia Shi, Xingyu Fu, Dan Roth, Mari Ostendorf, Luke Zettlemoyer, Noah A. Smith, and Ranjay Krishna. Visual sketchpad: Sketching as a visual chain of thought for multimodal language models. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URLhttps://openreview.net/forum?id=GNSMl1P5VR. 13 Agent Explorative Po...
2024
-
[16]
Deepseek-r1 thoughtology: Let’s think about LLM reasoning
Sara Vera Marjanovic, Arkil Patel, Vaibhav Adlakha, Milad Aghajohari, Parishad BehnamGhader, Mehar Bhatia, Aditi Khandelwal, Austin Kraft, Benno Krojer, Xing Han Lù, Nicholas Meade, Dongchan Shin, Amirhossein Kazemnejad, Gaurav Kamath, Marius Mosbach, Karolina Stanczak, and Siva Reddy. Deepseek-r1 thoughtology: Let’s think about LLM reasoning. Trans. Mach...
2026
-
[17]
Shu Yang, Junchao Wu, Xin Chen, Yunze Xiao, Xinyi Yang, Derek F. Wong, and Di Wang. Understanding aha moments: from external observations to internal mechanisms, 2025. URL https://arxiv.org/abs/2504.02956
-
[18]
Understanding reasoning in llms through strategic information allocation under uncertainty,
Jeonghye Kim, Xufang Luo, Minbeom Kim, Sangmook Lee, Dongsheng Li, and Yuqing Yang. Understanding reasoning in llms through strategic information allocation under uncertainty,
-
[19]
URLhttps://arxiv.org/abs/2603.15500
work page internal anchor Pith review Pith/arXiv arXiv
-
[20]
Pyvision-rl: Forging open agentic vision models via RL.arxiv, 2602.20739, 2026
Shitian Zhao, Shaoheng Lin, Ming Li, Haoquan Zhang, Wenshuo Peng, Kaipeng Zhang, and Chen Wei. Pyvision-rl: Forging open agentic vision models via RL.arxiv, 2602.20739, 2026. URLhttps://doi.org/10.48550/arXiv.2602.20739
-
[21]
Qwen Team. Qwen3-vl technical report.arXiv, 2511.21631, 2025. URLhttps://doi.org/10. 48550/arXiv.2511.21631
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[22]
Distilling LLM agent into small models with retrieval and code tools
Minki Kang, Jongwon Jeong, Seanie Lee, Jaewoong Cho, and Sung Ju Hwang. Distilling LLM agent into small models with retrieval and code tools. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. URLhttps://openreview.net/forum?id= VkicTqszOn
2025
-
[23]
Gordon, and Drew Bagnell
Stéphane Ross, Geoffrey J. Gordon, and Drew Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. In Geoffrey J. Gordon, David B. Dunson, and Miroslav Dudík, editors,Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, AISTATS 2011, Fort Lauderdale, USA, April 11-13,...
2011
-
[24]
Agentic reasoning and tool integration for llms via reinforcement learning, 2025
Joykirat Singh, Raghav Magazine, Yash Pandya, and Akshay Nambi. Agentic reasoning and tool integration for llms via reinforcement learning, 2025. URLhttps://arxiv.org/abs/2505. 01441
2025
-
[25]
DeepEyesV2: Toward Agentic Multimodal Model
Jack Hong, Chenxiao Zhao, ChengLin Zhu, Weiheng Lu, Guohai Xu, and Xing Yu. Deepeyesv2: Toward agentic multimodal model, 2025. URLhttps://arxiv.org/abs/2511.05271
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[26]
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv, 2402.03300, 2024. URL https://doi.org/10.48550/arXiv.2402. 03300
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2402 2024
-
[27]
DAPO: An Open-Source LLM Reinforcement Learning System at Scale
Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Wang Zhang, Hang Zhu, Jinhua Zhu, Jiaze Chen, Jiangjie Chen, Chengyi Wang, Honglin Yu, Weinan Dai, Yuxuan Song, Xiang Wei, Haodong Zhou, Jingjing Liu, ...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[28]
ProRL: Prolonged reinforcement learning expands reasoning boundaries in large language models
Mingjie Liu, Shizhe Diao, Ximing Lu, Jian Hu, Xin Dong, Yejin Choi, Jan Kautz, and Yi Dong. ProRL: Prolonged reinforcement learning expands reasoning boundaries in large language models. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems,
-
[29]
URLhttps://openreview.net/forum?id=YPsJha5HXQ
-
[30]
POPE: learning to reason on hard problems via privileged on-policy exploration.arXiv, 2601.18779,
Yuxiao Qu, Amrith Setlur, Virginia Smith, Ruslan Salakhutdinov, and Aviral Kumar. POPE: learning to reason on hard problems via privileged on-policy exploration.arXiv, 2601.18779,
-
[31]
URLhttps://doi.org/10.48550/arXiv.2601.18779
-
[32]
Self-hinting language models enhance reinforcement learning.arXiv, 2602.03143, 2026
Baohao Liao, Hanze Dong, Xinxing Xu, Christof Monz, and Jiang Bian. Self-hinting language models enhance reinforcement learning.arXiv, 2602.03143, 2026. URLhttps://doi.org/10. 48550/arXiv.2602.03143
-
[33]
Acting less is reasoning more! teaching model to act efficiently, 2025
Hongru Wang, Cheng Qian, Wanjun Zhong, Xiusi Chen, Jiahao Qiu, Shijue Huang, Bowen Jin, Mengdi Wang, Kam-Fai Wong, and Heng Ji. Acting less is reasoning more! teaching model to act efficiently, 2025. URLhttps://arxiv.org/abs/2504.14870
-
[35]
Agentic reinforced policy optimization
Guanting Dong, Hangyu Mao, Kai Ma, Licheng Bao, Yifei Chen, Zhongyuan Wang, Zhongxia Chen, Jiazhen Du, Huiyang Wang, Fuzheng Zhang, Guorui Zhou, Yutao Zhu, Ji-Rong Wen, and Zhicheng Dou. Agentic reinforced policy optimization. InThe Fourteenth International Conference on Learning Representations, 2026. URLhttps://openreview.net/forum?id= TX4k7BF6aO
2026
-
[36]
Proximal Policy Optimization Algorithms
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv, 1707.06347, 2017. URL http://arxiv.org/abs/1707.06347
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[37]
RAGEN-2: Reasoning Collapse in Agentic RL
Zihan Wang, Chi Gui, Xing Jin, Qineng Wang, Licheng Liu, Kangrui Wang, Shiqi Chen, Linjie Li, Zhengyuan Yang, Pingyue Zhang, Yiping Lu, Jiajun Wu, Li Fei-Fei, Lijuan Wang, Yejin Choi, and Manling Li. Ragen-2: Reasoning collapse in agentic rl, 2026. URLhttps: //arxiv.org/abs/2604.06268
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[38]
Deep think with confidence.arXiv,
Yichao Fu, Xuewei Wang, Yuandong Tian, and Jiawei Zhao. Deep think with confidence.arXiv,
-
[39]
URLhttps://arxiv.org/abs/2508.15260
work page internal anchor Pith review Pith/arXiv arXiv
-
[40]
VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models with Reinforcement Learning
Haozhe Wang, Chao Qu, Zuming Huang, Wei Chu, Fangzhen Lin, and Wenhu Chen. Vl- rethinker: Incentivizing self-reflection of vision-language models with reinforcement learning. arXiv, 2504.08837, 2025. URLhttps://doi.org/10.48550/arXiv.2504.08837
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2504.08837 2025
-
[41]
MMSearch-R1: Incentivizing LMMs to Search
Jinming Wu, Zihao Deng, Wei Li, Yiding Liu, Bo You, Bo Li, Zejun Ma, and Ziwei Liu. Mmsearch-r1: Incentivizing lmms to search. 2025. URLhttps://arxiv.org/abs/2506.20670
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[43]
Measuring Multimodal Mathematical Reasoning with MATH-Vision Dataset
Ke Wang, Junting Pan, Weikang Shi, Zimu Lu, Mingjie Zhan, and Hongsheng Li. Measuring multimodal mathematical reasoning with math-vision dataset.arXiv, 2024. URL https: //arxiv.org/abs/2402.14804. 15 Agent Explorative Policy Optimization
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[44]
Dynamath: A dynamic visual benchmark for evaluating mathematical reasoning robustness of vision language models
Chengke Zou, Xingang Guo, Rui Yang, Junyu Zhang, Bin Hu, and Huan Zhang. Dynamath: A dynamic visual benchmark for evaluating mathematical reasoning robustness of vision language models. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net, 2025. URLhttps://openreview.net/forum? id=V...
2025
-
[45]
Codeplot- cot: Mathematical visual reasoning by thinking with code-driven images, 2025
Chengqi Duan, Kaiyue Sun, Rongyao Fang, Manyuan Zhang, Yan Feng, Ying Luo, Yufang Liu, Ke Wang, Peng Pei, Xunliang Cai, Hongsheng Li, Yi Ma, and Xihui Liu. Codeplot- cot: Mathematical visual reasoning by thinking with code-driven images, 2025. URLhttps: //arxiv.org/abs/2510.11718
-
[46]
Wenbin Wang, Liang Ding, Minyan Zeng, Xiabin Zhou, Li Shen, Yong Luo, Wei Yu, and Dacheng Tao. Divide, conquer and combine: A training-free framework for high-resolution image perception in multimodal large language models. In Toby Walsh, Julie Shah, and Zico Kolter, editors,Thirty-Ninth AAAI Conference on Artificial Intelligence, Thirty-Seventh Conferenc...
-
[47]
Yong Xien Chng, Tao Hu, Wenwen Tong, Xueheng Li, Jiandong Chen, Haojia Yu, Jiefan Lu, Hewei Guo, Hanming Deng, Xie Chen, Gao Huang, Dahua Lin, and Lewei Lu. Sensenova- mars: Empowering multimodal agentic reasoning and search via reinforcement learning.arXiv, 2512.24330, 2025. URLhttps://doi.org/10.48550/arXiv.2512.24330
-
[48]
Dongzhi Jiang, Renrui Zhang, Ziyu Guo, Yanmin Wu, Jiayi Lei, Pengshuo Qiu, Pan Lu, Zehui Chen, Guanglu Song, Peng Gao, Yu Liu, Chunyuan Li, and Hongsheng Li. Mmsearch: Benchmarking the potential of large models as multi-modal search engines.arXiv, 2409.12959,
-
[49]
URLhttps://doi.org/10.48550/arXiv.2409.12959
-
[50]
Pixel reasoner: Incentivizing pixel space reasoning via curiosity-driven reinforcement learning
Alex Su, Haozhe Wang, Weiming Ren, Fangzhen Lin, and Wenhu Chen. Pixel reasoner: Incentivizing pixel space reasoning via curiosity-driven reinforcement learning. InThe Thirty- ninth Annual Conference on Neural Information Processing Systems, 2026. URL https: //openreview.net/forum?id=VeZkY3JjWV
2026
-
[51]
Deepeyes: Incentivizing ”thinking with images” via reinforcement learning
Ziwei Zheng, Michael Yang, Jack Hong, Chenxiao Zhao, Guohai Xu, Le Yang, Chao Shen, and XingYu. Deepeyes: Incentivizing ”thinking with images” via reinforcement learning. In The Fourteenth International Conference on Learning Representations, 2026. URLhttps: //openreview.net/forum?id=xUyMXkI958
2026
-
[52]
Andrew Bagnell, Aarti Singh, and Andrea Zanette
Yuda Song, Lili Chen, Fahim Tajwar, Remi Munos, Deepak Pathak, J. Andrew Bagnell, Aarti Singh, and Andrea Zanette. Expanding the capabilities of reinforcement learning via text feedback, 2026. URLhttps://arxiv.org/abs/2602.02482
-
[53]
MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention
MiniMax, :, Aili Chen, Aonian Li, Bangwei Gong, Binyang Jiang, Bo Fei, Bo Yang, Boji Shan, Changqing Yu, Chao Wang, Cheng Zhu, Chengjun Xiao, Chengyu Du, Chi Zhang, Chu Qiao, Chunhao Zhang, Chunhui Du, Congchao Guo, Da Chen, Deming Ding, Dianjun Sun, Dong Li, Enwei Jiao, Haigang Zhou, Haimo Zhang, Han Ding, Haohai Sun, Haoyu Feng, Huaiguang Cai, Haichao Z...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[54]
Narasimhan, and Yuan Cao
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R. Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InThe Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5,
2023
-
[55]
URLhttps://openreview.net/forum?id=WE_vluYUL-X
OpenReview.net, 2023. URLhttps://openreview.net/forum?id=WE_vluYUL-X
2023
-
[56]
ReTool: Reinforcement Learning for Strategic Tool Use in LLMs
Jiazhan Feng, Shijue Huang, Xingwei Qu, Ge Zhang, Yujia Qin, Baoquan Zhong, Chengquan Jiang, Jinxin Chi, and Wanjun Zhong. Retool: Reinforcement learning for strategic tool use in llms.arXiv, 2504.11536, 2025. URLhttps://arxiv.org/abs/2504.11536
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[57]
Ke Wang, Houxing Ren, Aojun Zhou, Zimu Lu, Sichun Luo, Weikang Shi, Renrui Zhang, Linqi Song, Mingjie Zhan, and Hongsheng Li. Mathcoder: Seamless code integration in llms for enhanced mathematical reasoning.arXiv, 2310.03731, 2023. URLhttps://arxiv.org/abs/ 2310.03731
-
[58]
T1: Tool-integrated verification for test-time compute scaling in small language models
Minki Kang, Jongwon Jeong, and Jaewoong Cho. T1: Tool-integrated verification for test-time compute scaling in small language models. InThe Fourteenth International Conference on Learning Representations, 2026. URLhttps://openreview.net/forum?id=tBkLWfmugI
2026
-
[59]
Introducing agentic vision in gemini 3 flash
Google Deepmind. Introducing agentic vision in gemini 3 flash. https://blog.google/ innovation-and-ai/technology/developers-tools/agentic-vision-gemini-3-flash/ , 2025
2025
-
[60]
Thinking with images
OpenAI. Thinking with images. https://openai.com/index/thinking-with-images/, 2025
2025
-
[61]
V-thinker: Interactive thinking with images
Runqi Qiao, Qiuna Tan, Minghan Yang, Guanting Dong, Peiqing Yang, Shiqiang Lang, Enhui Wan, Xiaowan Wang, Yida Xu, Lan Yang, Chong Sun, Chen Li, Jing Lyu, and Honggang Zhang. V-thinker: Interactive thinking with images, 2025. URLhttps://arxiv.org/abs/2511.04460
-
[62]
rstar2-agent: Agentic reasoning technical report.arXiv, 2508.20722, 2025
Ning Shang, Yifei Liu, Yi Zhu, Li Lyna Zhang, Weijiang Xu, Xinyu Guan, Buze Zhang, Bingcheng Dong, Xudong Zhou, Bowen Zhang, Ying Xin, Ziming Miao, Scarlett Li, Fan Yang, and Mao Yang. rstar2-agent: Agentic reasoning technical report.arXiv, 2508.20722, 2025. URL https://doi.org/10.48550/arXiv.2508.20722
-
[63]
ToolRL: Reward is All Tool Learning Needs
Cheng Qian, Emre Can Acikgoz, Qi He, Hongru Wang, Xiusi Chen, Dilek Hakkani-Tür, Gokhan Tur, and Heng Ji. Toolrl: Reward is all tool learning needs, 2025. URL https: //arxiv.org/abs/2504.13958
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[64]
Torl: Scaling tool-integrated RL.arXiv, 2503.23383,
Xuefeng Li, Haoyang Zou, and Pengfei Liu. Torl: Scaling tool-integrated RL.arXiv, 2503.23383,
-
[65]
URLhttps://doi.org/10.48550/arXiv.2503.23383
-
[66]
Yifan Zhang, Xingyu Lu, Shukang Yin, Chaoyou Fu, Wei Chen, Xiao Hu, Bin Wen, Kaiyu Jiang, Changyi Liu, Tianke Zhang, Haonan Fan, Kaibing Chen, Jiankang Chen, Haojie Ding, 17 Agent Explorative Policy Optimization Kaiyu Tang, Zhang Zhang, Liang Wang, Fan Yang, Tingting Gao, and Guorui Zhou. Thyme: Think beyond images.arXiv, 2508.11630, 2025. URL https://doi...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv 2025
-
[67]
Agentic entropy-balanced policy optimization.arXiv, 2025
Guanting Dong, Licheng Bao, Zhongyuan Wang, Kangzhi Zhao, Xiaoxi Li, Jiajie Jin, Jinghan Yang, Hangyu Mao, Fuzheng Zhang, Kun Gai, Guorui Zhou, Yutao Zhu, Ji-Rong Wen, and Zhicheng Dou. Agentic entropy-balanced policy optimization.arXiv, 2025. URL https: //arxiv.org/abs/2510.14545
-
[68]
HybridFlow: A Flexible and Efficient RLHF Framework
Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework.arXiv preprint arXiv: 2409.19256, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[69]
rllm: A framework for post-training language agents.https://pretty-radio-b75
Sijun Tan, Michael Luo, Colin Cai, Tarun Venkat, Kyle Montgomery, Aaron Hao, Tianhao Wu, Arnav Balyan, Manan Roongta, Chenguang Wang, Li Erran Li, Raluca Ada Popa, and Ion Stoica. rllm: A framework for post-training language agents.https://pretty-radio-b75. notion.site/rLLM-A-Framework-for-Post-Training-Language-Agents\ -21b81902c146819db63cd98a54ba5f31, ...
2025
-
[70]
Does reinforcement learning really incentivize reasoning capacity in LLMs beyond the base model? InThe Thirty-ninth Annual Conference on Neural Information Processing Systems,
Yang Yue, Zhiqi Chen, Rui Lu, Andrew Zhao, Zhaokai Wang, Yang Yue, Shiji Song, and Gao Huang. Does reinforcement learning really incentivize reasoning capacity in LLMs beyond the base model? InThe Thirty-ninth Annual Conference on Neural Information Processing Systems,
-
[71]
URLhttps://openreview.net/forum?id=4OsgYD7em5
-
[72]
SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training
Tianzhe Chu, Yuexiang Zhai, Jihan Yang, Shengbang Tong, Saining Xie, Dale Schuurmans, Quoc V. Le, Sergey Levine, and Yi Ma. Sft memorizes, rl generalizes: A comparative study of foundation model post-training, 2025. URLhttps://arxiv.org/abs/2501.17161
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[73]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Ming-Hsuan Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2502.13923 2025
-
[74]
Gonzalez, Hao Zhang, and Ion Stoica
WoosukKwon, ZhuohanLi, SiyuanZhuang, YingSheng, LianminZheng, CodyHaoYu, JosephE. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InProceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023
2023
-
[75]
Fireact: Toward language agent fine-tuning.arXiv, 2310.05915, 2023
Baian Chen, Chang Shu, Ehsan Shareghi, Nigel Collier, Karthik Narasimhan, and Shunyu Yao. Fireact: Toward language agent fine-tuning.arXiv, 2310.05915, 2023. URL https: //doi.org/10.48550/arXiv.2310.05915
-
[76]
Uni- fied reinforcement and imitation learning for vision-language models
Byung-Kwan Lee, Ryo Hachiuma, Yong Man Ro, Frank Wang, and Yueh-Hua Wu. Uni- fied reinforcement and imitation learning for vision-language models. In D. Belgrave, C. Zhang, H. Lin, R. Pascanu, P. Koniusz, M. Ghassemi, and N. Chen, editors,Advances in Neural Information Processing Systems, volume 38, pages 156508–156534. Curran Asso- ciates, Inc., 2025. UR...
2025
-
[77]
Masking teacher and reinforcing student for distilling vision-language models, 2025
Byung-Kwan Lee, Yu-Chiang Frank Wang, and Ryo Hachiuma. Masking teacher and reinforcing student for distilling vision-language models, 2025. URLhttps://arxiv.org/abs/2512.22238. 18 Agent Explorative Policy Optimization
-
[78]
Why and when visual token pruning fails? a study on relevant visual information shift in mllms decoding,
Jiwan Kim, Kibum Kim, Wonjoong Kim, Byung-Kwan Lee, and Chanyoung Park. Why and when visual token pruning fails? a study on relevant visual information shift in mllms decoding,
-
[79]
URLhttps://arxiv.org/abs/2604.12358
work page internal anchor Pith review Pith/arXiv arXiv
-
[80]
Hide to See: Reasoning-prefix Masking for Visual-anchored Thinking in VLM Distillation
Seonghoon Yu, Dongjun Nam, Byung-Kwan Lee, and Jeany Son. Hide to see: Reasoning-prefix masking for visual-anchored thinking in vlm distillation, 2026. URLhttps://arxiv.org/abs/ 2605.11651
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[81]
GenRecal: Generation after Recalibration from Large to Small Vision-Language Models
Byung-Kwan Lee, Ryo Hachiuma, Yong Man Ro, Yu-Chiang Frank Wang, and Yueh-Hua Wu. Genrecal: Generation after recalibration from large to small vision-language models, 2026. URL https://arxiv.org/abs/2506.15681
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[82]
Vlsi: Verbalized layers-to-interactions from large to small vision language models
Byung-Kwan Lee, Ryo Hachiuma, Yu-Chiang Frank Wang, Yong Man Ro, and Yueh-Hua Wu. Vlsi: Verbalized layers-to-interactions from large to small vision language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 29545–29557, June 2025
2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.