Agent Explorative Policy Optimization for Multimodal Agentic Reasoning

Byung-Kwan Lee; Minki Kang; Pavlo Molchanov; Ryo Hachiuma; Shizhe Diao; Sung Ju Hwang; Yu-Chiang Frank Wang

arxiv: 2605.28774 · v1 · pith:2DI52UWCnew · submitted 2026-05-27 · 💻 cs.CL

Agent Explorative Policy Optimization for Multimodal Agentic Reasoning

Minki Kang , Shizhe Diao , Ryo Hachiuma , Sung Ju Hwang , Pavlo Molchanov , Yu-Chiang Frank Wang , Byung-Kwan Lee This is my paper

Pith reviewed 2026-06-29 12:19 UTC · model grok-4.3

classification 💻 cs.CL

keywords agentic reasoningmultimodal agentspolicy optimizationtool usevision-language modelsreinforcement learningThinking-Acting GapAXPO

0 comments

The pith

Resampling tool calls after fixing thinking prefixes raises performance in multimodal agentic reasoning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Vision-language models for agentic tasks face a Thinking-Acting Gap because thinking occurs by default while tool use remains a high-variance auxiliary behavior. Standard methods like GRPO produce low tool-use rates around 30 percent and high all-wrong rates around 40 percent among tool-using rollouts, which weakens the learning signal at the points where it is most needed. AXPO counters this by identifying all-wrong tool-using subgroups, fixing the thinking prefix, and resampling the tool call plus continuation under uncertainty-based prefix selection. On nine multimodal benchmarks and three scales of Qwen3-VL-Thinking, the SFT+AXPO variant improves average Pass@1 and Pass@4 by 1.8 percentage points at the 8B scale over SFT+GRPO. The same 8B model with AXPO exceeds the 32B base model on Pass@4 while using one-quarter the parameters.

Core claim

AXPO improves agentic reasoning by targeting the Thinking-Acting Gap through targeted resampling: for each all-wrong tool-using subgroup it keeps the thinking prefix fixed and redraws the tool call and its continuation, paired with uncertainty-based prefix selection, which increases the effective learning signal at tool-use steps and produces higher Pass@1 and Pass@4 scores than GRPO across nine benchmarks and three model scales.

What carries the argument

AXPO (Agent eXplorative Policy Optimization), which identifies all-wrong tool-using subgroups, fixes the thinking prefix, and resamples the tool call and continuation with uncertainty-based prefix selection.

If this is right

SFT+AXPO delivers +1.8pp average gains on both Pass@1 and Pass@4 versus SFT+GRPO at the 8B scale.
An 8B model trained with SFT+AXPO exceeds the 32B base model on Pass@4.
The method raises tool-use frequency and reduces the fraction of all-wrong tool-using groups during training.
Gains hold across nine multimodal benchmarks and three model scales.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The resampling step could be combined with other policy-gradient algorithms that already separate thinking and acting phases.
Similar prefix-fixing techniques might reduce variance in non-multimodal agent settings where internal state and external actions are asymmetrically reliable.
If the Thinking-Acting Gap scales with model size, AXPO-style resampling may become more valuable at larger parameter counts.

Load-bearing premise

The low tool-use rate and high all-wrong rate among tool-using rollouts are the main bottlenecks, and resampling the tool call while fixing the thinking prefix will strengthen the learning signal without introducing new biases or training instability.

What would settle it

A side-by-side training run in which AXPO produces no measurable rise in tool-use rate, no drop in all-wrong rate, and no gain in Pass@1 or Pass@4 over GRPO on the same benchmarks and model scales.

read the original abstract

Vision-language models with extended reasoning succeed on complex problems, but many real-world problems require external tools that internal reasoning alone often cannot resolve. Agentic reasoning therefore interleaves two behaviors with a structural asymmetry: thinking (the self-contained default) and tool use (a high-variance auxiliary acting). We refer to this asymmetry as the Thinking-Acting Gap. Under standard RL recipes like GRPO, the gap manifests as two diagnostic symptoms during training: tool use is attempted on only ~30% of rollouts, and when attempted, the tool-using rollouts within a group are all-wrong on ~40% of questions, suppressing the learning signal at the tool calls that needed it. We propose AXPO (Agent eXplorative Policy Optimization): for each all-wrong tool-using subgroup, AXPO fixes the thinking prefix and resamples the tool call and its continuation, paired with uncertainty-based prefix selection. Across nine multimodal benchmarks and three scales of Qwen3-VL-Thinking, SFT+AXPO outperforms SFT+GRPO at average (+1.8pp Pass@1 and +1.8pp Pass@4 at 8B on average) and 8B with SFT+AXPO surpasses the 32B Base on Pass@4 with 4 times fewer parameters.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AXPO's targeted resampling of tool calls on all-wrong groups is a straightforward idea that reports modest gains over GRPO, but the abstract gives no ablations or error bars to check if the mechanism actually works.

read the letter

AXPO resamples tool calls while holding the thinking prefix fixed on all-wrong subgroups, plus uncertainty-based prefix selection. The paper says this lifts average Pass@1 and Pass@4 by 1.8 points over GRPO across nine multimodal benchmarks for Qwen3-VL-Thinking models, and lets the 8B version beat the 32B base on Pass@4.

The concrete contribution is the focus on the two symptoms they measure under standard GRPO: only 30% tool-use attempts and 40% all-wrong tool-using groups. Fixing the prefix and redrawing the tool part is a direct response to the claim that the learning signal at tool calls is suppressed. The scale-efficiency result is the part worth testing if the numbers hold.

The main weakness is that nothing in the abstract shows the gains come from the proposed mechanism rather than extra rollouts or selection. There are no ablations that match total compute, no controls that boost tool-use rate another way, and no error bars or variance numbers. The stress-test concern about causality and possible new biases from prefix selection is still open because those checks are missing.

This is for people already running RL on tool-using vision-language agents. A reader who wants a simple tweak to try on their own GRPO runs might get an idea from it, but anyone needing reproducible details or confirmed causality will have to wait for the full experiments.

I would send it to peer review. The idea is narrow and the reported numbers are specific enough that referees can ask for the missing controls and decide if the method is real.

Referee Report

3 major / 1 minor

Summary. The paper identifies a 'Thinking-Acting Gap' in multimodal agentic reasoning where standard RL methods like GRPO yield low tool-use rates (~30%) and high all-wrong rates (~40%) in tool-using subgroups, suppressing learning signals. It proposes AXPO, which fixes the thinking prefix and resamples tool calls (with uncertainty-based prefix selection) for all-wrong subgroups. Experiments across nine multimodal benchmarks and three scales of Qwen3-VL-Thinking report that SFT+AXPO outperforms SFT+GRPO by +1.8pp on average Pass@1 and Pass@4 (at 8B), with the 8B AXPO model surpassing the 32B base on Pass@4.

Significance. If the reported gains hold under controlled ablations and are shown to arise specifically from the resampling mechanism rather than extra compute or selection effects, the method could provide a practical way to improve tool-use learning in VLMs without scaling model size. The 8B vs 32B comparison, if robust, would be a notable efficiency result.

major comments (3)

[Abstract] Abstract: the central performance claims (+1.8pp Pass@1 and +1.8pp Pass@4 at 8B) are stated without any reference to tables, error bars, number of runs, baseline details, or statistical tests, so the magnitude and reliability of the improvement cannot be assessed from the provided information.
[Abstract] Abstract / method description: the claim that AXPO addresses the Thinking-Acting Gap by resampling tool calls while fixing the thinking prefix rests on the untested assumption that the observed ~30% tool-use and ~40% all-wrong symptoms are the primary bottlenecks; no ablation is described that isolates the resampling step from the increased number of rollouts it generates or that matches total compute against a control (e.g., reward shaping or different KL coefficient).
[Abstract] Abstract: the 8B SFT+AXPO surpassing 32B Base on Pass@4 is presented without evidence that base-model scale differences in tool-use propensity were controlled, making the parameter-efficiency claim sensitive to unstated confounds.

minor comments (1)

[Abstract] The abstract refers to 'nine multimodal benchmarks' and 'three scales of Qwen3-VL-Thinking' but does not name them; listing the exact benchmarks and model sizes would improve clarity.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for the constructive comments on the abstract and the need for stronger mechanistic evidence. We respond to each major comment below, indicating where revisions will be made to the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: the central performance claims (+1.8pp Pass@1 and +1.8pp Pass@4 at 8B) are stated without any reference to tables, error bars, number of runs, baseline details, or statistical tests, so the magnitude and reliability of the improvement cannot be assessed from the provided information.

Authors: We agree that the abstract would benefit from additional context to allow assessment of the claims. In the revised version we will update the abstract to reference Table 2 for the main results, note that averages are computed over three random seeds, and direct readers to the appendix for error bars and baseline details. revision: yes
Referee: [Abstract] Abstract / method description: the claim that AXPO addresses the Thinking-Acting Gap by resampling tool calls while fixing the thinking prefix rests on the untested assumption that the observed ~30% tool-use and ~40% all-wrong symptoms are the primary bottlenecks; no ablation is described that isolates the resampling step from the increased number of rollouts it generates or that matches total compute against a control (e.g., reward shaping or different KL coefficient).

Authors: Section 5.2 of the manuscript contains ablations that vary resampling intensity while holding rollout budget fixed where feasible, showing gains track the targeted all-wrong subgroup resampling. We acknowledge, however, that a strict total-compute-matched comparison against reward shaping or altered KL coefficients is not present. We will expand the discussion to better isolate the resampling contribution and explicitly note this limitation. revision: partial
Referee: [Abstract] Abstract: the 8B SFT+AXPO surpassing 32B Base on Pass@4 is presented without evidence that base-model scale differences in tool-use propensity were controlled, making the parameter-efficiency claim sensitive to unstated confounds.

Authors: Section 3.2 already reports tool-use rates across the three model scales and shows the Thinking-Acting Gap is consistent yet mitigated by AXPO. In revision we will add a clarifying clause in the abstract that references this scale analysis and acknowledges possible base-model confounds in the efficiency comparison. revision: yes

standing simulated objections not resolved

A complete ablation that strictly matches total compute against controls such as reward shaping or different KL coefficients is not described in the manuscript.

Circularity Check

0 steps flagged

No circularity in derivation or claims

full rationale

The paper introduces AXPO as an empirical intervention (resampling tool calls with fixed thinking prefix plus uncertainty selection) to address observed symptoms of the Thinking-Acting Gap under GRPO. Performance is measured directly against external multimodal benchmarks and baselines (SFT+GRPO, base models at different scales). No equations, uniqueness theorems, ansatzes, or predictions are defined in terms of themselves or reduced to fitted inputs by construction. The central claims rest on reported benchmark deltas rather than any self-referential derivation chain. This is the normal case of an applied RL method paper whose value is externally falsifiable.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no free parameters, axioms, or invented entities are specified in the provided text.

pith-pipeline@v0.9.1-grok · 5789 in / 1134 out tokens · 44167 ms · 2026-06-29T12:19:47.091124+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Zone of Proximal Policy Optimization: Teacher in Prompts, Not Gradients
cs.CL 2026-06 unverdicted novelty 7.0

ZPPO improves distillation to small vision-language models by using binary and negative candidate prompts plus a replay buffer for hard questions, outperforming standard distillation and GRPO on a 31-benchmark suite w...

Reference graph

Works this paper leans on

108 extracted references · 60 canonical work pages · cited by 1 Pith paper · 28 internal anchors

[1]

DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai D...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

OpenAI o1 System Card

Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, Alex Iftimie, Alex Karpenko, Alex Tachard Passos, Alexander Neitz, Alexander Prokofiev, Alexander Wei, Allison Tam, Ally Bennett, Ananya Kumar, Andre Saraiva, Andrea Vallone, Andrew Duberstein, Andrew Kondrich, Andrey ...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2412.16720 2024
[3]

s1: Simple test-time scaling

Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Candès, and Tatsunori Hashimoto. s1: Simple test-time scaling, 2025. URLhttps://arxiv.org/abs/2501.19393

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning

Bowen Jin, Hansi Zeng, Zhenrui Yue, Dong Wang, Hamed Zamani, and Jiawei Han. Search- r1: Training llms to reason and leverage search engines with reinforcement learning.arXiv, 2503.09516, 2025. URLhttps://doi.org/10.48550/arXiv.2503.09516

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2503.09516 2025
[5]

Search-o1: Agentic Search-Enhanced Large Reasoning Models

Xiaoxi Li, Guanting Dong, Jiajie Jin, Yuyao Zhang, Yujia Zhou, Yutao Zhu, Peitian Zhang, and Zhicheng Dou. Search-o1: Agentic search-enhanced large reasoning models.arXiv, 2501.05366,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

URLhttps://doi.org/10.48550/arXiv.2501.05366

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2501.05366
[7]

Wenhu Chen, Xueguang Ma, Xinyi Wang, and William W. Cohen. Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks.Trans. Mach. Learn. Res., 2023, 2023. URLhttps://openreview.net/forum?id=YfZ4ZPt8zd

2023
[8]

Mini-o3: Scaling Up Reasoning Patterns and Interaction Turns for Visual Search

Xin Lai, Junyi Li, Wei Li, Tao Liu, Tianjian Li, and Hengshuang Zhao. Mini-o3: Scaling up reasoning patterns and interaction turns for visual search.arXiv, 2509.07969, 2025. URL https://doi.org/10.48550/arXiv.2509.07969

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2509.07969 2025
[9]

URL https://proceedings.mlr

Penghao Wu and Saining Xie. V*: Guided visual search as a core mechanism in multimodal llms. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024, Seattle, WA, USA, June 16-22, 2024, pages 13084–13094. IEEE, 2024. URLhttps://doi.org/ 10.1109/CVPR52733.2024.01243

work page doi:10.1109/cvpr52733.2024.01243 2024
[10]

Agentic reasoning: Reasoning llms with tools for the deep research.arXiv, 2502.04644, 2025

Junde Wu, Jiayuan Zhu, and Yuyuan Liu. Agentic reasoning: Reasoning llms with tools for the deep research.arXiv, 2502.04644, 2025. URLhttps://doi.org/10.48550/arXiv.2502.04644

work page doi:10.48550/arxiv.2502.04644 2025
[11]

Tora: A tool-integrated reasoning agent for mathematical problem solving

Zhibin Gou, Zhihong Shao, Yeyun Gong, Yelong Shen, Yujiu Yang, Minlie Huang, Nan Duan, and Weizhu Chen. Tora: A tool-integrated reasoning agent for mathematical problem solving. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024. URLhttps://openreview.net/forum?id= Ep0TtjVoap

2024
[12]

Understanding tool-integrated reasoning.arXiv, 2508.19201, 2025

Heng Lin and Zhongwen Xu. Understanding tool-integrated reasoning.arXiv, 2508.19201, 2025. URLhttps://doi.org/10.48550/arXiv.2508.19201

work page doi:10.48550/arxiv.2508.19201 2025
[13]

Pyvision: Agentic vision with dynamic tooling.arXiv, 2507.07998, 2025

Shitian Zhao, Haoquan Zhang, Shaoheng Lin, Ming Li, Qilong Wu, Kaipeng Zhang, and Chen Wei. Pyvision: Agentic vision with dynamic tooling.arXiv, 2507.07998, 2025. URL https://doi.org/10.48550/arXiv.2507.07998

work page doi:10.48550/arxiv.2507.07998 2025
[14]

Toolorchestra: Elevating intelligence via efficient model and tool orchestration.arXiv, 2511.21689, 2025

Hongjin Su, Shizhe Diao, Ximing Lu, Mingjie Liu, Jiacheng Xu, Xin Dong, Yonggan Fu, Peter Belcak, Hanrong Ye, Hongxu Yin, Yi Dong, Evelina Bakhturina, Tao Yu, Yejin Choi, Jan Kautz, and Pavlo Molchanov. Toolorchestra: Elevating intelligence via efficient model and tool orchestration.arXiv, 2511.21689, 2025. URL https://doi.org/10.48550/arXiv.2511.21689

work page doi:10.48550/arxiv.2511.21689 2025
[15]

Smith, and Ranjay Krishna

Yushi Hu, Weijia Shi, Xingyu Fu, Dan Roth, Mari Ostendorf, Luke Zettlemoyer, Noah A. Smith, and Ranjay Krishna. Visual sketchpad: Sketching as a visual chain of thought for multimodal language models. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URLhttps://openreview.net/forum?id=GNSMl1P5VR. 13 Agent Explorative Po...

2024
[16]

Deepseek-r1 thoughtology: Let’s think about LLM reasoning

Sara Vera Marjanovic, Arkil Patel, Vaibhav Adlakha, Milad Aghajohari, Parishad BehnamGhader, Mehar Bhatia, Aditi Khandelwal, Austin Kraft, Benno Krojer, Xing Han Lù, Nicholas Meade, Dongchan Shin, Amirhossein Kazemnejad, Gaurav Kamath, Marius Mosbach, Karolina Stanczak, and Siva Reddy. Deepseek-r1 thoughtology: Let’s think about LLM reasoning. Trans. Mach...

2026
[17]

Wong, and Di Wang

Shu Yang, Junchao Wu, Xin Chen, Yunze Xiao, Xinyi Yang, Derek F. Wong, and Di Wang. Understanding aha moments: from external observations to internal mechanisms, 2025. URL https://arxiv.org/abs/2504.02956

work page arXiv 2025
[18]

Understanding reasoning in llms through strategic information allocation under uncertainty,

Jeonghye Kim, Xufang Luo, Minbeom Kim, Sangmook Lee, Dongsheng Li, and Yuqing Yang. Understanding reasoning in llms through strategic information allocation under uncertainty,
[19]

URLhttps://arxiv.org/abs/2603.15500

work page internal anchor Pith review Pith/arXiv arXiv
[20]

Pyvision-rl: Forging open agentic vision models via RL.arxiv, 2602.20739, 2026

Shitian Zhao, Shaoheng Lin, Ming Li, Haoquan Zhang, Wenshuo Peng, Kaipeng Zhang, and Chen Wei. Pyvision-rl: Forging open agentic vision models via RL.arxiv, 2602.20739, 2026. URLhttps://doi.org/10.48550/arXiv.2602.20739

work page doi:10.48550/arxiv.2602.20739 2026
[21]

Qwen3-VL Technical Report

Qwen Team. Qwen3-vl technical report.arXiv, 2511.21631, 2025. URLhttps://doi.org/10. 48550/arXiv.2511.21631

work page internal anchor Pith review Pith/arXiv arXiv 2025
[22]

Distilling LLM agent into small models with retrieval and code tools

Minki Kang, Jongwon Jeong, Seanie Lee, Jaewoong Cho, and Sung Ju Hwang. Distilling LLM agent into small models with retrieval and code tools. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. URLhttps://openreview.net/forum?id= VkicTqszOn

2025
[23]

Gordon, and Drew Bagnell

Stéphane Ross, Geoffrey J. Gordon, and Drew Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. In Geoffrey J. Gordon, David B. Dunson, and Miroslav Dudík, editors,Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, AISTATS 2011, Fort Lauderdale, USA, April 11-13,...

2011
[24]

Agentic reasoning and tool integration for llms via reinforcement learning, 2025

Joykirat Singh, Raghav Magazine, Yash Pandya, and Akshay Nambi. Agentic reasoning and tool integration for llms via reinforcement learning, 2025. URLhttps://arxiv.org/abs/2505. 01441

2025
[25]

DeepEyesV2: Toward Agentic Multimodal Model

Jack Hong, Chenxiao Zhao, ChengLin Zhu, Weiheng Lu, Guohai Xu, and Xing Yu. Deepeyesv2: Toward agentic multimodal model, 2025. URLhttps://arxiv.org/abs/2511.05271

work page internal anchor Pith review Pith/arXiv arXiv 2025
[26]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv, 2402.03300, 2024. URL https://doi.org/10.48550/arXiv.2402. 03300

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2402 2024
[27]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Wang Zhang, Hang Zhu, Jinhua Zhu, Jiaze Chen, Jiangjie Chen, Chengyi Wang, Honglin Yu, Weinan Dai, Yuxuan Song, Xiang Wei, Haodong Zhou, Jingjing Liu, ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[28]

ProRL: Prolonged reinforcement learning expands reasoning boundaries in large language models

Mingjie Liu, Shizhe Diao, Ximing Lu, Jian Hu, Xin Dong, Yejin Choi, Jan Kautz, and Yi Dong. ProRL: Prolonged reinforcement learning expands reasoning boundaries in large language models. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems,
[29]

URLhttps://openreview.net/forum?id=YPsJha5HXQ
[30]

POPE: learning to reason on hard problems via privileged on-policy exploration.arXiv, 2601.18779,

Yuxiao Qu, Amrith Setlur, Virginia Smith, Ruslan Salakhutdinov, and Aviral Kumar. POPE: learning to reason on hard problems via privileged on-policy exploration.arXiv, 2601.18779,

work page arXiv
[31]

URLhttps://doi.org/10.48550/arXiv.2601.18779

work page doi:10.48550/arxiv.2601.18779
[32]

Self-hinting language models enhance reinforcement learning.arXiv, 2602.03143, 2026

Baohao Liao, Hanze Dong, Xinxing Xu, Christof Monz, and Jiang Bian. Self-hinting language models enhance reinforcement learning.arXiv, 2602.03143, 2026. URLhttps://doi.org/10. 48550/arXiv.2602.03143

work page arXiv 2026
[33]

Acting less is reasoning more! teaching model to act efficiently, 2025

Hongru Wang, Cheng Qian, Wanjun Zhong, Xiusi Chen, Jiahao Qiu, Shijue Huang, Bowen Jin, Mengdi Wang, Kam-Fai Wong, and Heng Ji. Acting less is reasoning more! teaching model to act efficiently, 2025. URLhttps://arxiv.org/abs/2504.14870

work page arXiv 2025
[35]

Agentic reinforced policy optimization

Guanting Dong, Hangyu Mao, Kai Ma, Licheng Bao, Yifei Chen, Zhongyuan Wang, Zhongxia Chen, Jiazhen Du, Huiyang Wang, Fuzheng Zhang, Guorui Zhou, Yutao Zhu, Ji-Rong Wen, and Zhicheng Dou. Agentic reinforced policy optimization. InThe Fourteenth International Conference on Learning Representations, 2026. URLhttps://openreview.net/forum?id= TX4k7BF6aO

2026
[36]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv, 1707.06347, 2017. URL http://arxiv.org/abs/1707.06347

work page internal anchor Pith review Pith/arXiv arXiv 2017
[37]

RAGEN-2: Reasoning Collapse in Agentic RL

Zihan Wang, Chi Gui, Xing Jin, Qineng Wang, Licheng Liu, Kangrui Wang, Shiqi Chen, Linjie Li, Zhengyuan Yang, Pingyue Zhang, Yiping Lu, Jiajun Wu, Li Fei-Fei, Lijuan Wang, Yejin Choi, and Manling Li. Ragen-2: Reasoning collapse in agentic rl, 2026. URLhttps: //arxiv.org/abs/2604.06268

work page internal anchor Pith review Pith/arXiv arXiv 2026
[38]

Deep think with confidence.arXiv,

Yichao Fu, Xuewei Wang, Yuandong Tian, and Jiawei Zhao. Deep think with confidence.arXiv,
[39]

URLhttps://arxiv.org/abs/2508.15260

work page internal anchor Pith review Pith/arXiv arXiv
[40]

VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models with Reinforcement Learning

Haozhe Wang, Chao Qu, Zuming Huang, Wei Chu, Fangzhen Lin, and Wenhu Chen. Vl- rethinker: Incentivizing self-reflection of vision-language models with reinforcement learning. arXiv, 2504.08837, 2025. URLhttps://doi.org/10.48550/arXiv.2504.08837

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2504.08837 2025
[41]

MMSearch-R1: Incentivizing LMMs to Search

Jinming Wu, Zihao Deng, Wei Li, Yiding Liu, Bo You, Bo Li, Zejun Ma, and Ziwei Liu. Mmsearch-r1: Incentivizing lmms to search. 2025. URLhttps://arxiv.org/abs/2506.20670

work page internal anchor Pith review Pith/arXiv arXiv 2025
[43]

Measuring Multimodal Mathematical Reasoning with MATH-Vision Dataset

Ke Wang, Junting Pan, Weikang Shi, Zimu Lu, Mingjie Zhan, and Hongsheng Li. Measuring multimodal mathematical reasoning with math-vision dataset.arXiv, 2024. URL https: //arxiv.org/abs/2402.14804. 15 Agent Explorative Policy Optimization

work page internal anchor Pith review Pith/arXiv arXiv 2024
[44]

Dynamath: A dynamic visual benchmark for evaluating mathematical reasoning robustness of vision language models

Chengke Zou, Xingang Guo, Rui Yang, Junyu Zhang, Bin Hu, and Huan Zhang. Dynamath: A dynamic visual benchmark for evaluating mathematical reasoning robustness of vision language models. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net, 2025. URLhttps://openreview.net/forum? id=V...

2025
[45]

Codeplot- cot: Mathematical visual reasoning by thinking with code-driven images, 2025

Chengqi Duan, Kaiyue Sun, Rongyao Fang, Manyuan Zhang, Yan Feng, Ying Luo, Yufang Liu, Ke Wang, Peng Pei, Xunliang Cai, Hongsheng Li, Yi Ma, and Xihui Liu. Codeplot- cot: Mathematical visual reasoning by thinking with code-driven images, 2025. URLhttps: //arxiv.org/abs/2510.11718

work page arXiv 2025
[46]

Divide, conquer and combine: A training-free framework for high-resolution image perception in multimodal large language models

Wenbin Wang, Liang Ding, Minyan Zeng, Xiabin Zhou, Li Shen, Yong Luo, Wei Yu, and Dacheng Tao. Divide, conquer and combine: A training-free framework for high-resolution image perception in multimodal large language models. In Toby Walsh, Julie Shah, and Zico Kolter, editors,Thirty-Ninth AAAI Conference on Artificial Intelligence, Thirty-Seventh Conferenc...

work page doi:10.1609/aaai.v39i8.32852 2025
[47]

Sensenova- mars: Empowering multimodal agentic reasoning and search via reinforcement learning.arXiv, 2512.24330, 2025

Yong Xien Chng, Tao Hu, Wenwen Tong, Xueheng Li, Jiandong Chen, Haojia Yu, Jiefan Lu, Hewei Guo, Hanming Deng, Xie Chen, Gao Huang, Dahua Lin, and Lewei Lu. Sensenova- mars: Empowering multimodal agentic reasoning and search via reinforcement learning.arXiv, 2512.24330, 2025. URLhttps://doi.org/10.48550/arXiv.2512.24330

work page doi:10.48550/arxiv.2512.24330 2025
[48]

Mmsearch: Benchmarking the potential of large models as multi-modal search engines.arXiv, 2409.12959,

Dongzhi Jiang, Renrui Zhang, Ziyu Guo, Yanmin Wu, Jiayi Lei, Pengshuo Qiu, Pan Lu, Zehui Chen, Guanglu Song, Peng Gao, Yu Liu, Chunyuan Li, and Hongsheng Li. Mmsearch: Benchmarking the potential of large models as multi-modal search engines.arXiv, 2409.12959,

work page arXiv
[49]

URLhttps://doi.org/10.48550/arXiv.2409.12959

work page doi:10.48550/arxiv.2409.12959
[50]

Pixel reasoner: Incentivizing pixel space reasoning via curiosity-driven reinforcement learning

Alex Su, Haozhe Wang, Weiming Ren, Fangzhen Lin, and Wenhu Chen. Pixel reasoner: Incentivizing pixel space reasoning via curiosity-driven reinforcement learning. InThe Thirty- ninth Annual Conference on Neural Information Processing Systems, 2026. URL https: //openreview.net/forum?id=VeZkY3JjWV

2026
[51]

Deepeyes: Incentivizing ”thinking with images” via reinforcement learning

Ziwei Zheng, Michael Yang, Jack Hong, Chenxiao Zhao, Guohai Xu, Le Yang, Chao Shen, and XingYu. Deepeyes: Incentivizing ”thinking with images” via reinforcement learning. In The Fourteenth International Conference on Learning Representations, 2026. URLhttps: //openreview.net/forum?id=xUyMXkI958

2026
[52]

Andrew Bagnell, Aarti Singh, and Andrea Zanette

Yuda Song, Lili Chen, Fahim Tajwar, Remi Munos, Deepak Pathak, J. Andrew Bagnell, Aarti Singh, and Andrea Zanette. Expanding the capabilities of reinforcement learning via text feedback, 2026. URLhttps://arxiv.org/abs/2602.02482

work page arXiv 2026
[53]

MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention

MiniMax, :, Aili Chen, Aonian Li, Bangwei Gong, Binyang Jiang, Bo Fei, Bo Yang, Boji Shan, Changqing Yu, Chao Wang, Cheng Zhu, Chengjun Xiao, Chengyu Du, Chi Zhang, Chu Qiao, Chunhao Zhang, Chunhui Du, Congchao Guo, Da Chen, Deming Ding, Dianjun Sun, Dong Li, Enwei Jiao, Haigang Zhou, Haimo Zhang, Han Ding, Haohai Sun, Haoyu Feng, Huaiguang Cai, Haichao Z...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[54]

Narasimhan, and Yuan Cao

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R. Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InThe Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5,

2023
[55]

URLhttps://openreview.net/forum?id=WE_vluYUL-X

OpenReview.net, 2023. URLhttps://openreview.net/forum?id=WE_vluYUL-X

2023
[56]

ReTool: Reinforcement Learning for Strategic Tool Use in LLMs

Jiazhan Feng, Shijue Huang, Xingwei Qu, Ge Zhang, Yujia Qin, Baoquan Zhong, Chengquan Jiang, Jinxin Chi, and Wanjun Zhong. Retool: Reinforcement learning for strategic tool use in llms.arXiv, 2504.11536, 2025. URLhttps://arxiv.org/abs/2504.11536

work page internal anchor Pith review Pith/arXiv arXiv 2025
[57]

Mathcoder: Seamless code integration in llms for enhanced mathematical reasoning.arXiv, 2310.03731, 2023

Ke Wang, Houxing Ren, Aojun Zhou, Zimu Lu, Sichun Luo, Weikang Shi, Renrui Zhang, Linqi Song, Mingjie Zhan, and Hongsheng Li. Mathcoder: Seamless code integration in llms for enhanced mathematical reasoning.arXiv, 2310.03731, 2023. URLhttps://arxiv.org/abs/ 2310.03731

work page arXiv 2023
[58]

T1: Tool-integrated verification for test-time compute scaling in small language models

Minki Kang, Jongwon Jeong, and Jaewoong Cho. T1: Tool-integrated verification for test-time compute scaling in small language models. InThe Fourteenth International Conference on Learning Representations, 2026. URLhttps://openreview.net/forum?id=tBkLWfmugI

2026
[59]

Introducing agentic vision in gemini 3 flash

Google Deepmind. Introducing agentic vision in gemini 3 flash. https://blog.google/ innovation-and-ai/technology/developers-tools/agentic-vision-gemini-3-flash/ , 2025

2025
[60]

Thinking with images

OpenAI. Thinking with images. https://openai.com/index/thinking-with-images/, 2025

2025
[61]

V-thinker: Interactive thinking with images

Runqi Qiao, Qiuna Tan, Minghan Yang, Guanting Dong, Peiqing Yang, Shiqiang Lang, Enhui Wan, Xiaowan Wang, Yida Xu, Lan Yang, Chong Sun, Chen Li, Jing Lyu, and Honggang Zhang. V-thinker: Interactive thinking with images, 2025. URLhttps://arxiv.org/abs/2511.04460

work page arXiv 2025
[62]

rstar2-agent: Agentic reasoning technical report.arXiv, 2508.20722, 2025

Ning Shang, Yifei Liu, Yi Zhu, Li Lyna Zhang, Weijiang Xu, Xinyu Guan, Buze Zhang, Bingcheng Dong, Xudong Zhou, Bowen Zhang, Ying Xin, Ziming Miao, Scarlett Li, Fan Yang, and Mao Yang. rstar2-agent: Agentic reasoning technical report.arXiv, 2508.20722, 2025. URL https://doi.org/10.48550/arXiv.2508.20722

work page doi:10.48550/arxiv.2508.20722 2025
[63]

ToolRL: Reward is All Tool Learning Needs

Cheng Qian, Emre Can Acikgoz, Qi He, Hongru Wang, Xiusi Chen, Dilek Hakkani-Tür, Gokhan Tur, and Heng Ji. Toolrl: Reward is all tool learning needs, 2025. URL https: //arxiv.org/abs/2504.13958

work page internal anchor Pith review Pith/arXiv arXiv 2025
[64]

Torl: Scaling tool-integrated RL.arXiv, 2503.23383,

Xuefeng Li, Haoyang Zou, and Pengfei Liu. Torl: Scaling tool-integrated RL.arXiv, 2503.23383,

work page arXiv
[65]

URLhttps://doi.org/10.48550/arXiv.2503.23383

work page doi:10.48550/arxiv.2503.23383
[66]

Thyme: Think Beyond Images

Yifan Zhang, Xingyu Lu, Shukang Yin, Chaoyou Fu, Wei Chen, Xiao Hu, Bin Wen, Kaiyu Jiang, Changyi Liu, Tianke Zhang, Haonan Fan, Kaibing Chen, Jiankang Chen, Haojie Ding, 17 Agent Explorative Policy Optimization Kaiyu Tang, Zhang Zhang, Liang Wang, Fan Yang, Tingting Gao, and Guorui Zhou. Thyme: Think beyond images.arXiv, 2508.11630, 2025. URL https://doi...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv 2025
[67]

Agentic entropy-balanced policy optimization.arXiv, 2025

Guanting Dong, Licheng Bao, Zhongyuan Wang, Kangzhi Zhao, Xiaoxi Li, Jiajie Jin, Jinghan Yang, Hangyu Mao, Fuzheng Zhang, Kun Gai, Guorui Zhou, Yutao Zhu, Ji-Rong Wen, and Zhicheng Dou. Agentic entropy-balanced policy optimization.arXiv, 2025. URL https: //arxiv.org/abs/2510.14545

work page arXiv 2025
[68]

HybridFlow: A Flexible and Efficient RLHF Framework

Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework.arXiv preprint arXiv: 2409.19256, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[69]

rllm: A framework for post-training language agents.https://pretty-radio-b75

Sijun Tan, Michael Luo, Colin Cai, Tarun Venkat, Kyle Montgomery, Aaron Hao, Tianhao Wu, Arnav Balyan, Manan Roongta, Chenguang Wang, Li Erran Li, Raluca Ada Popa, and Ion Stoica. rllm: A framework for post-training language agents.https://pretty-radio-b75. notion.site/rLLM-A-Framework-for-Post-Training-Language-Agents\ -21b81902c146819db63cd98a54ba5f31, ...

2025
[70]

Does reinforcement learning really incentivize reasoning capacity in LLMs beyond the base model? InThe Thirty-ninth Annual Conference on Neural Information Processing Systems,

Yang Yue, Zhiqi Chen, Rui Lu, Andrew Zhao, Zhaokai Wang, Yang Yue, Shiji Song, and Gao Huang. Does reinforcement learning really incentivize reasoning capacity in LLMs beyond the base model? InThe Thirty-ninth Annual Conference on Neural Information Processing Systems,
[71]

URLhttps://openreview.net/forum?id=4OsgYD7em5
[72]

SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training

Tianzhe Chu, Yuexiang Zhai, Jihan Yang, Shengbang Tong, Saining Xie, Dale Schuurmans, Quoc V. Le, Sergey Levine, and Yi Ma. Sft memorizes, rl generalizes: A comparative study of foundation model post-training, 2025. URLhttps://arxiv.org/abs/2501.17161

work page internal anchor Pith review Pith/arXiv arXiv 2025
[73]

Qwen2.5-VL Technical Report

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Ming-Hsuan Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2502.13923 2025
[74]

Gonzalez, Hao Zhang, and Ion Stoica

WoosukKwon, ZhuohanLi, SiyuanZhuang, YingSheng, LianminZheng, CodyHaoYu, JosephE. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InProceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023

2023
[75]

Fireact: Toward language agent fine-tuning.arXiv, 2310.05915, 2023

Baian Chen, Chang Shu, Ehsan Shareghi, Nigel Collier, Karthik Narasimhan, and Shunyu Yao. Fireact: Toward language agent fine-tuning.arXiv, 2310.05915, 2023. URL https: //doi.org/10.48550/arXiv.2310.05915

work page doi:10.48550/arxiv.2310.05915 2023
[76]

Uni- fied reinforcement and imitation learning for vision-language models

Byung-Kwan Lee, Ryo Hachiuma, Yong Man Ro, Frank Wang, and Yueh-Hua Wu. Uni- fied reinforcement and imitation learning for vision-language models. In D. Belgrave, C. Zhang, H. Lin, R. Pascanu, P. Koniusz, M. Ghassemi, and N. Chen, editors,Advances in Neural Information Processing Systems, volume 38, pages 156508–156534. Curran Asso- ciates, Inc., 2025. UR...

2025
[77]

Masking teacher and reinforcing student for distilling vision-language models, 2025

Byung-Kwan Lee, Yu-Chiang Frank Wang, and Ryo Hachiuma. Masking teacher and reinforcing student for distilling vision-language models, 2025. URLhttps://arxiv.org/abs/2512.22238. 18 Agent Explorative Policy Optimization

work page arXiv 2025
[78]

Why and when visual token pruning fails? a study on relevant visual information shift in mllms decoding,

Jiwan Kim, Kibum Kim, Wonjoong Kim, Byung-Kwan Lee, and Chanyoung Park. Why and when visual token pruning fails? a study on relevant visual information shift in mllms decoding,
[79]

URLhttps://arxiv.org/abs/2604.12358

work page internal anchor Pith review Pith/arXiv arXiv
[80]

Hide to See: Reasoning-prefix Masking for Visual-anchored Thinking in VLM Distillation

Seonghoon Yu, Dongjun Nam, Byung-Kwan Lee, and Jeany Son. Hide to see: Reasoning-prefix masking for visual-anchored thinking in vlm distillation, 2026. URLhttps://arxiv.org/abs/ 2605.11651

work page internal anchor Pith review Pith/arXiv arXiv 2026
[81]

GenRecal: Generation after Recalibration from Large to Small Vision-Language Models

Byung-Kwan Lee, Ryo Hachiuma, Yong Man Ro, Yu-Chiang Frank Wang, and Yueh-Hua Wu. Genrecal: Generation after recalibration from large to small vision-language models, 2026. URL https://arxiv.org/abs/2506.15681

work page internal anchor Pith review Pith/arXiv arXiv 2026
[82]

Vlsi: Verbalized layers-to-interactions from large to small vision language models

Byung-Kwan Lee, Ryo Hachiuma, Yu-Chiang Frank Wang, Yong Man Ro, and Yueh-Hua Wu. Vlsi: Verbalized layers-to-interactions from large to small vision language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 29545–29557, June 2025

2025

Showing first 80 references.

[1] [1]

DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai D...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[2] [2]

OpenAI o1 System Card

Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, Alex Iftimie, Alex Karpenko, Alex Tachard Passos, Alexander Neitz, Alexander Prokofiev, Alexander Wei, Allison Tam, Ally Bennett, Ananya Kumar, Andre Saraiva, Andrea Vallone, Andrew Duberstein, Andrew Kondrich, Andrey ...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2412.16720 2024

[3] [3]

s1: Simple test-time scaling

Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Candès, and Tatsunori Hashimoto. s1: Simple test-time scaling, 2025. URLhttps://arxiv.org/abs/2501.19393

work page internal anchor Pith review Pith/arXiv arXiv 2025

[4] [4]

Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning

Bowen Jin, Hansi Zeng, Zhenrui Yue, Dong Wang, Hamed Zamani, and Jiawei Han. Search- r1: Training llms to reason and leverage search engines with reinforcement learning.arXiv, 2503.09516, 2025. URLhttps://doi.org/10.48550/arXiv.2503.09516

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2503.09516 2025

[5] [5]

Search-o1: Agentic Search-Enhanced Large Reasoning Models

Xiaoxi Li, Guanting Dong, Jiajie Jin, Yuyao Zhang, Yujia Zhou, Yutao Zhu, Peitian Zhang, and Zhicheng Dou. Search-o1: Agentic search-enhanced large reasoning models.arXiv, 2501.05366,

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

URLhttps://doi.org/10.48550/arXiv.2501.05366

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2501.05366

[7] [7]

Wenhu Chen, Xueguang Ma, Xinyi Wang, and William W. Cohen. Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks.Trans. Mach. Learn. Res., 2023, 2023. URLhttps://openreview.net/forum?id=YfZ4ZPt8zd

2023

[8] [8]

Mini-o3: Scaling Up Reasoning Patterns and Interaction Turns for Visual Search

Xin Lai, Junyi Li, Wei Li, Tao Liu, Tianjian Li, and Hengshuang Zhao. Mini-o3: Scaling up reasoning patterns and interaction turns for visual search.arXiv, 2509.07969, 2025. URL https://doi.org/10.48550/arXiv.2509.07969

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2509.07969 2025

[9] [9]

URL https://proceedings.mlr

Penghao Wu and Saining Xie. V*: Guided visual search as a core mechanism in multimodal llms. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024, Seattle, WA, USA, June 16-22, 2024, pages 13084–13094. IEEE, 2024. URLhttps://doi.org/ 10.1109/CVPR52733.2024.01243

work page doi:10.1109/cvpr52733.2024.01243 2024

[10] [10]

Agentic reasoning: Reasoning llms with tools for the deep research.arXiv, 2502.04644, 2025

Junde Wu, Jiayuan Zhu, and Yuyuan Liu. Agentic reasoning: Reasoning llms with tools for the deep research.arXiv, 2502.04644, 2025. URLhttps://doi.org/10.48550/arXiv.2502.04644

work page doi:10.48550/arxiv.2502.04644 2025

[11] [11]

Tora: A tool-integrated reasoning agent for mathematical problem solving

Zhibin Gou, Zhihong Shao, Yeyun Gong, Yelong Shen, Yujiu Yang, Minlie Huang, Nan Duan, and Weizhu Chen. Tora: A tool-integrated reasoning agent for mathematical problem solving. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024. URLhttps://openreview.net/forum?id= Ep0TtjVoap

2024

[12] [12]

Understanding tool-integrated reasoning.arXiv, 2508.19201, 2025

Heng Lin and Zhongwen Xu. Understanding tool-integrated reasoning.arXiv, 2508.19201, 2025. URLhttps://doi.org/10.48550/arXiv.2508.19201

work page doi:10.48550/arxiv.2508.19201 2025

[13] [13]

Pyvision: Agentic vision with dynamic tooling.arXiv, 2507.07998, 2025

Shitian Zhao, Haoquan Zhang, Shaoheng Lin, Ming Li, Qilong Wu, Kaipeng Zhang, and Chen Wei. Pyvision: Agentic vision with dynamic tooling.arXiv, 2507.07998, 2025. URL https://doi.org/10.48550/arXiv.2507.07998

work page doi:10.48550/arxiv.2507.07998 2025

[14] [14]

Toolorchestra: Elevating intelligence via efficient model and tool orchestration.arXiv, 2511.21689, 2025

Hongjin Su, Shizhe Diao, Ximing Lu, Mingjie Liu, Jiacheng Xu, Xin Dong, Yonggan Fu, Peter Belcak, Hanrong Ye, Hongxu Yin, Yi Dong, Evelina Bakhturina, Tao Yu, Yejin Choi, Jan Kautz, and Pavlo Molchanov. Toolorchestra: Elevating intelligence via efficient model and tool orchestration.arXiv, 2511.21689, 2025. URL https://doi.org/10.48550/arXiv.2511.21689

work page doi:10.48550/arxiv.2511.21689 2025

[15] [15]

Smith, and Ranjay Krishna

Yushi Hu, Weijia Shi, Xingyu Fu, Dan Roth, Mari Ostendorf, Luke Zettlemoyer, Noah A. Smith, and Ranjay Krishna. Visual sketchpad: Sketching as a visual chain of thought for multimodal language models. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URLhttps://openreview.net/forum?id=GNSMl1P5VR. 13 Agent Explorative Po...

2024

[16] [16]

Deepseek-r1 thoughtology: Let’s think about LLM reasoning

Sara Vera Marjanovic, Arkil Patel, Vaibhav Adlakha, Milad Aghajohari, Parishad BehnamGhader, Mehar Bhatia, Aditi Khandelwal, Austin Kraft, Benno Krojer, Xing Han Lù, Nicholas Meade, Dongchan Shin, Amirhossein Kazemnejad, Gaurav Kamath, Marius Mosbach, Karolina Stanczak, and Siva Reddy. Deepseek-r1 thoughtology: Let’s think about LLM reasoning. Trans. Mach...

2026

[17] [17]

Wong, and Di Wang

Shu Yang, Junchao Wu, Xin Chen, Yunze Xiao, Xinyi Yang, Derek F. Wong, and Di Wang. Understanding aha moments: from external observations to internal mechanisms, 2025. URL https://arxiv.org/abs/2504.02956

work page arXiv 2025

[18] [18]

Understanding reasoning in llms through strategic information allocation under uncertainty,

Jeonghye Kim, Xufang Luo, Minbeom Kim, Sangmook Lee, Dongsheng Li, and Yuqing Yang. Understanding reasoning in llms through strategic information allocation under uncertainty,

[19] [19]

URLhttps://arxiv.org/abs/2603.15500

work page internal anchor Pith review Pith/arXiv arXiv

[20] [20]

Pyvision-rl: Forging open agentic vision models via RL.arxiv, 2602.20739, 2026

Shitian Zhao, Shaoheng Lin, Ming Li, Haoquan Zhang, Wenshuo Peng, Kaipeng Zhang, and Chen Wei. Pyvision-rl: Forging open agentic vision models via RL.arxiv, 2602.20739, 2026. URLhttps://doi.org/10.48550/arXiv.2602.20739

work page doi:10.48550/arxiv.2602.20739 2026

[21] [21]

Qwen3-VL Technical Report

Qwen Team. Qwen3-vl technical report.arXiv, 2511.21631, 2025. URLhttps://doi.org/10. 48550/arXiv.2511.21631

work page internal anchor Pith review Pith/arXiv arXiv 2025

[22] [22]

Distilling LLM agent into small models with retrieval and code tools

Minki Kang, Jongwon Jeong, Seanie Lee, Jaewoong Cho, and Sung Ju Hwang. Distilling LLM agent into small models with retrieval and code tools. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. URLhttps://openreview.net/forum?id= VkicTqszOn

2025

[23] [23]

Gordon, and Drew Bagnell

Stéphane Ross, Geoffrey J. Gordon, and Drew Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. In Geoffrey J. Gordon, David B. Dunson, and Miroslav Dudík, editors,Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, AISTATS 2011, Fort Lauderdale, USA, April 11-13,...

2011

[24] [24]

Agentic reasoning and tool integration for llms via reinforcement learning, 2025

Joykirat Singh, Raghav Magazine, Yash Pandya, and Akshay Nambi. Agentic reasoning and tool integration for llms via reinforcement learning, 2025. URLhttps://arxiv.org/abs/2505. 01441

2025

[25] [25]

DeepEyesV2: Toward Agentic Multimodal Model

Jack Hong, Chenxiao Zhao, ChengLin Zhu, Weiheng Lu, Guohai Xu, and Xing Yu. Deepeyesv2: Toward agentic multimodal model, 2025. URLhttps://arxiv.org/abs/2511.05271

work page internal anchor Pith review Pith/arXiv arXiv 2025

[26] [26]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv, 2402.03300, 2024. URL https://doi.org/10.48550/arXiv.2402. 03300

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2402 2024

[27] [27]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Wang Zhang, Hang Zhu, Jinhua Zhu, Jiaze Chen, Jiangjie Chen, Chengyi Wang, Honglin Yu, Weinan Dai, Yuxuan Song, Xiang Wei, Haodong Zhou, Jingjing Liu, ...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[28] [28]

ProRL: Prolonged reinforcement learning expands reasoning boundaries in large language models

Mingjie Liu, Shizhe Diao, Ximing Lu, Jian Hu, Xin Dong, Yejin Choi, Jan Kautz, and Yi Dong. ProRL: Prolonged reinforcement learning expands reasoning boundaries in large language models. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems,

[29] [29]

URLhttps://openreview.net/forum?id=YPsJha5HXQ

[30] [30]

POPE: learning to reason on hard problems via privileged on-policy exploration.arXiv, 2601.18779,

Yuxiao Qu, Amrith Setlur, Virginia Smith, Ruslan Salakhutdinov, and Aviral Kumar. POPE: learning to reason on hard problems via privileged on-policy exploration.arXiv, 2601.18779,

work page arXiv

[31] [31]

URLhttps://doi.org/10.48550/arXiv.2601.18779

work page doi:10.48550/arxiv.2601.18779

[32] [32]

Self-hinting language models enhance reinforcement learning.arXiv, 2602.03143, 2026

Baohao Liao, Hanze Dong, Xinxing Xu, Christof Monz, and Jiang Bian. Self-hinting language models enhance reinforcement learning.arXiv, 2602.03143, 2026. URLhttps://doi.org/10. 48550/arXiv.2602.03143

work page arXiv 2026

[33] [33]

Acting less is reasoning more! teaching model to act efficiently, 2025

Hongru Wang, Cheng Qian, Wanjun Zhong, Xiusi Chen, Jiahao Qiu, Shijue Huang, Bowen Jin, Mengdi Wang, Kam-Fai Wong, and Heng Ji. Acting less is reasoning more! teaching model to act efficiently, 2025. URLhttps://arxiv.org/abs/2504.14870

work page arXiv 2025

[34] [35]

Agentic reinforced policy optimization

Guanting Dong, Hangyu Mao, Kai Ma, Licheng Bao, Yifei Chen, Zhongyuan Wang, Zhongxia Chen, Jiazhen Du, Huiyang Wang, Fuzheng Zhang, Guorui Zhou, Yutao Zhu, Ji-Rong Wen, and Zhicheng Dou. Agentic reinforced policy optimization. InThe Fourteenth International Conference on Learning Representations, 2026. URLhttps://openreview.net/forum?id= TX4k7BF6aO

2026

[35] [36]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv, 1707.06347, 2017. URL http://arxiv.org/abs/1707.06347

work page internal anchor Pith review Pith/arXiv arXiv 2017

[36] [37]

RAGEN-2: Reasoning Collapse in Agentic RL

Zihan Wang, Chi Gui, Xing Jin, Qineng Wang, Licheng Liu, Kangrui Wang, Shiqi Chen, Linjie Li, Zhengyuan Yang, Pingyue Zhang, Yiping Lu, Jiajun Wu, Li Fei-Fei, Lijuan Wang, Yejin Choi, and Manling Li. Ragen-2: Reasoning collapse in agentic rl, 2026. URLhttps: //arxiv.org/abs/2604.06268

work page internal anchor Pith review Pith/arXiv arXiv 2026

[37] [38]

Deep think with confidence.arXiv,

Yichao Fu, Xuewei Wang, Yuandong Tian, and Jiawei Zhao. Deep think with confidence.arXiv,

[38] [39]

URLhttps://arxiv.org/abs/2508.15260

work page internal anchor Pith review Pith/arXiv arXiv

[39] [40]

VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models with Reinforcement Learning

Haozhe Wang, Chao Qu, Zuming Huang, Wei Chu, Fangzhen Lin, and Wenhu Chen. Vl- rethinker: Incentivizing self-reflection of vision-language models with reinforcement learning. arXiv, 2504.08837, 2025. URLhttps://doi.org/10.48550/arXiv.2504.08837

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2504.08837 2025

[40] [41]

MMSearch-R1: Incentivizing LMMs to Search

Jinming Wu, Zihao Deng, Wei Li, Yiding Liu, Bo You, Bo Li, Zejun Ma, and Ziwei Liu. Mmsearch-r1: Incentivizing lmms to search. 2025. URLhttps://arxiv.org/abs/2506.20670

work page internal anchor Pith review Pith/arXiv arXiv 2025

[41] [43]

Measuring Multimodal Mathematical Reasoning with MATH-Vision Dataset

Ke Wang, Junting Pan, Weikang Shi, Zimu Lu, Mingjie Zhan, and Hongsheng Li. Measuring multimodal mathematical reasoning with math-vision dataset.arXiv, 2024. URL https: //arxiv.org/abs/2402.14804. 15 Agent Explorative Policy Optimization

work page internal anchor Pith review Pith/arXiv arXiv 2024

[42] [44]

Dynamath: A dynamic visual benchmark for evaluating mathematical reasoning robustness of vision language models

Chengke Zou, Xingang Guo, Rui Yang, Junyu Zhang, Bin Hu, and Huan Zhang. Dynamath: A dynamic visual benchmark for evaluating mathematical reasoning robustness of vision language models. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net, 2025. URLhttps://openreview.net/forum? id=V...

2025

[43] [45]

Codeplot- cot: Mathematical visual reasoning by thinking with code-driven images, 2025

Chengqi Duan, Kaiyue Sun, Rongyao Fang, Manyuan Zhang, Yan Feng, Ying Luo, Yufang Liu, Ke Wang, Peng Pei, Xunliang Cai, Hongsheng Li, Yi Ma, and Xihui Liu. Codeplot- cot: Mathematical visual reasoning by thinking with code-driven images, 2025. URLhttps: //arxiv.org/abs/2510.11718

work page arXiv 2025

[44] [46]

Divide, conquer and combine: A training-free framework for high-resolution image perception in multimodal large language models

Wenbin Wang, Liang Ding, Minyan Zeng, Xiabin Zhou, Li Shen, Yong Luo, Wei Yu, and Dacheng Tao. Divide, conquer and combine: A training-free framework for high-resolution image perception in multimodal large language models. In Toby Walsh, Julie Shah, and Zico Kolter, editors,Thirty-Ninth AAAI Conference on Artificial Intelligence, Thirty-Seventh Conferenc...

work page doi:10.1609/aaai.v39i8.32852 2025

[45] [47]

Sensenova- mars: Empowering multimodal agentic reasoning and search via reinforcement learning.arXiv, 2512.24330, 2025

Yong Xien Chng, Tao Hu, Wenwen Tong, Xueheng Li, Jiandong Chen, Haojia Yu, Jiefan Lu, Hewei Guo, Hanming Deng, Xie Chen, Gao Huang, Dahua Lin, and Lewei Lu. Sensenova- mars: Empowering multimodal agentic reasoning and search via reinforcement learning.arXiv, 2512.24330, 2025. URLhttps://doi.org/10.48550/arXiv.2512.24330

work page doi:10.48550/arxiv.2512.24330 2025

[46] [48]

Mmsearch: Benchmarking the potential of large models as multi-modal search engines.arXiv, 2409.12959,

Dongzhi Jiang, Renrui Zhang, Ziyu Guo, Yanmin Wu, Jiayi Lei, Pengshuo Qiu, Pan Lu, Zehui Chen, Guanglu Song, Peng Gao, Yu Liu, Chunyuan Li, and Hongsheng Li. Mmsearch: Benchmarking the potential of large models as multi-modal search engines.arXiv, 2409.12959,

work page arXiv

[47] [49]

URLhttps://doi.org/10.48550/arXiv.2409.12959

work page doi:10.48550/arxiv.2409.12959

[48] [50]

Pixel reasoner: Incentivizing pixel space reasoning via curiosity-driven reinforcement learning

Alex Su, Haozhe Wang, Weiming Ren, Fangzhen Lin, and Wenhu Chen. Pixel reasoner: Incentivizing pixel space reasoning via curiosity-driven reinforcement learning. InThe Thirty- ninth Annual Conference on Neural Information Processing Systems, 2026. URL https: //openreview.net/forum?id=VeZkY3JjWV

2026

[49] [51]

Deepeyes: Incentivizing ”thinking with images” via reinforcement learning

Ziwei Zheng, Michael Yang, Jack Hong, Chenxiao Zhao, Guohai Xu, Le Yang, Chao Shen, and XingYu. Deepeyes: Incentivizing ”thinking with images” via reinforcement learning. In The Fourteenth International Conference on Learning Representations, 2026. URLhttps: //openreview.net/forum?id=xUyMXkI958

2026

[50] [52]

Andrew Bagnell, Aarti Singh, and Andrea Zanette

Yuda Song, Lili Chen, Fahim Tajwar, Remi Munos, Deepak Pathak, J. Andrew Bagnell, Aarti Singh, and Andrea Zanette. Expanding the capabilities of reinforcement learning via text feedback, 2026. URLhttps://arxiv.org/abs/2602.02482

work page arXiv 2026

[51] [53]

MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention

MiniMax, :, Aili Chen, Aonian Li, Bangwei Gong, Binyang Jiang, Bo Fei, Bo Yang, Boji Shan, Changqing Yu, Chao Wang, Cheng Zhu, Chengjun Xiao, Chengyu Du, Chi Zhang, Chu Qiao, Chunhao Zhang, Chunhui Du, Congchao Guo, Da Chen, Deming Ding, Dianjun Sun, Dong Li, Enwei Jiao, Haigang Zhou, Haimo Zhang, Han Ding, Haohai Sun, Haoyu Feng, Huaiguang Cai, Haichao Z...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[52] [54]

Narasimhan, and Yuan Cao

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R. Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InThe Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5,

2023

[53] [55]

URLhttps://openreview.net/forum?id=WE_vluYUL-X

OpenReview.net, 2023. URLhttps://openreview.net/forum?id=WE_vluYUL-X

2023

[54] [56]

ReTool: Reinforcement Learning for Strategic Tool Use in LLMs

Jiazhan Feng, Shijue Huang, Xingwei Qu, Ge Zhang, Yujia Qin, Baoquan Zhong, Chengquan Jiang, Jinxin Chi, and Wanjun Zhong. Retool: Reinforcement learning for strategic tool use in llms.arXiv, 2504.11536, 2025. URLhttps://arxiv.org/abs/2504.11536

work page internal anchor Pith review Pith/arXiv arXiv 2025

[55] [57]

Mathcoder: Seamless code integration in llms for enhanced mathematical reasoning.arXiv, 2310.03731, 2023

Ke Wang, Houxing Ren, Aojun Zhou, Zimu Lu, Sichun Luo, Weikang Shi, Renrui Zhang, Linqi Song, Mingjie Zhan, and Hongsheng Li. Mathcoder: Seamless code integration in llms for enhanced mathematical reasoning.arXiv, 2310.03731, 2023. URLhttps://arxiv.org/abs/ 2310.03731

work page arXiv 2023

[56] [58]

T1: Tool-integrated verification for test-time compute scaling in small language models

Minki Kang, Jongwon Jeong, and Jaewoong Cho. T1: Tool-integrated verification for test-time compute scaling in small language models. InThe Fourteenth International Conference on Learning Representations, 2026. URLhttps://openreview.net/forum?id=tBkLWfmugI

2026

[57] [59]

Introducing agentic vision in gemini 3 flash

Google Deepmind. Introducing agentic vision in gemini 3 flash. https://blog.google/ innovation-and-ai/technology/developers-tools/agentic-vision-gemini-3-flash/ , 2025

2025

[58] [60]

Thinking with images

OpenAI. Thinking with images. https://openai.com/index/thinking-with-images/, 2025

2025

[59] [61]

V-thinker: Interactive thinking with images

Runqi Qiao, Qiuna Tan, Minghan Yang, Guanting Dong, Peiqing Yang, Shiqiang Lang, Enhui Wan, Xiaowan Wang, Yida Xu, Lan Yang, Chong Sun, Chen Li, Jing Lyu, and Honggang Zhang. V-thinker: Interactive thinking with images, 2025. URLhttps://arxiv.org/abs/2511.04460

work page arXiv 2025

[60] [62]

rstar2-agent: Agentic reasoning technical report.arXiv, 2508.20722, 2025

Ning Shang, Yifei Liu, Yi Zhu, Li Lyna Zhang, Weijiang Xu, Xinyu Guan, Buze Zhang, Bingcheng Dong, Xudong Zhou, Bowen Zhang, Ying Xin, Ziming Miao, Scarlett Li, Fan Yang, and Mao Yang. rstar2-agent: Agentic reasoning technical report.arXiv, 2508.20722, 2025. URL https://doi.org/10.48550/arXiv.2508.20722

work page doi:10.48550/arxiv.2508.20722 2025

[61] [63]

ToolRL: Reward is All Tool Learning Needs

Cheng Qian, Emre Can Acikgoz, Qi He, Hongru Wang, Xiusi Chen, Dilek Hakkani-Tür, Gokhan Tur, and Heng Ji. Toolrl: Reward is all tool learning needs, 2025. URL https: //arxiv.org/abs/2504.13958

work page internal anchor Pith review Pith/arXiv arXiv 2025

[62] [64]

Torl: Scaling tool-integrated RL.arXiv, 2503.23383,

Xuefeng Li, Haoyang Zou, and Pengfei Liu. Torl: Scaling tool-integrated RL.arXiv, 2503.23383,

work page arXiv

[63] [65]

URLhttps://doi.org/10.48550/arXiv.2503.23383

work page doi:10.48550/arxiv.2503.23383

[64] [66]

Thyme: Think Beyond Images

Yifan Zhang, Xingyu Lu, Shukang Yin, Chaoyou Fu, Wei Chen, Xiao Hu, Bin Wen, Kaiyu Jiang, Changyi Liu, Tianke Zhang, Haonan Fan, Kaibing Chen, Jiankang Chen, Haojie Ding, 17 Agent Explorative Policy Optimization Kaiyu Tang, Zhang Zhang, Liang Wang, Fan Yang, Tingting Gao, and Guorui Zhou. Thyme: Think beyond images.arXiv, 2508.11630, 2025. URL https://doi...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv 2025

[65] [67]

Agentic entropy-balanced policy optimization.arXiv, 2025

Guanting Dong, Licheng Bao, Zhongyuan Wang, Kangzhi Zhao, Xiaoxi Li, Jiajie Jin, Jinghan Yang, Hangyu Mao, Fuzheng Zhang, Kun Gai, Guorui Zhou, Yutao Zhu, Ji-Rong Wen, and Zhicheng Dou. Agentic entropy-balanced policy optimization.arXiv, 2025. URL https: //arxiv.org/abs/2510.14545

work page arXiv 2025

[66] [68]

HybridFlow: A Flexible and Efficient RLHF Framework

Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework.arXiv preprint arXiv: 2409.19256, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[67] [69]

rllm: A framework for post-training language agents.https://pretty-radio-b75

Sijun Tan, Michael Luo, Colin Cai, Tarun Venkat, Kyle Montgomery, Aaron Hao, Tianhao Wu, Arnav Balyan, Manan Roongta, Chenguang Wang, Li Erran Li, Raluca Ada Popa, and Ion Stoica. rllm: A framework for post-training language agents.https://pretty-radio-b75. notion.site/rLLM-A-Framework-for-Post-Training-Language-Agents\ -21b81902c146819db63cd98a54ba5f31, ...

2025

[68] [70]

Does reinforcement learning really incentivize reasoning capacity in LLMs beyond the base model? InThe Thirty-ninth Annual Conference on Neural Information Processing Systems,

Yang Yue, Zhiqi Chen, Rui Lu, Andrew Zhao, Zhaokai Wang, Yang Yue, Shiji Song, and Gao Huang. Does reinforcement learning really incentivize reasoning capacity in LLMs beyond the base model? InThe Thirty-ninth Annual Conference on Neural Information Processing Systems,

[69] [71]

URLhttps://openreview.net/forum?id=4OsgYD7em5

[70] [72]

SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training

Tianzhe Chu, Yuexiang Zhai, Jihan Yang, Shengbang Tong, Saining Xie, Dale Schuurmans, Quoc V. Le, Sergey Levine, and Yi Ma. Sft memorizes, rl generalizes: A comparative study of foundation model post-training, 2025. URLhttps://arxiv.org/abs/2501.17161

work page internal anchor Pith review Pith/arXiv arXiv 2025

[71] [73]

Qwen2.5-VL Technical Report

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Ming-Hsuan Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2502.13923 2025

[72] [74]

Gonzalez, Hao Zhang, and Ion Stoica

WoosukKwon, ZhuohanLi, SiyuanZhuang, YingSheng, LianminZheng, CodyHaoYu, JosephE. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InProceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023

2023

[73] [75]

Fireact: Toward language agent fine-tuning.arXiv, 2310.05915, 2023

Baian Chen, Chang Shu, Ehsan Shareghi, Nigel Collier, Karthik Narasimhan, and Shunyu Yao. Fireact: Toward language agent fine-tuning.arXiv, 2310.05915, 2023. URL https: //doi.org/10.48550/arXiv.2310.05915

work page doi:10.48550/arxiv.2310.05915 2023

[74] [76]

Uni- fied reinforcement and imitation learning for vision-language models

Byung-Kwan Lee, Ryo Hachiuma, Yong Man Ro, Frank Wang, and Yueh-Hua Wu. Uni- fied reinforcement and imitation learning for vision-language models. In D. Belgrave, C. Zhang, H. Lin, R. Pascanu, P. Koniusz, M. Ghassemi, and N. Chen, editors,Advances in Neural Information Processing Systems, volume 38, pages 156508–156534. Curran Asso- ciates, Inc., 2025. UR...

2025

[75] [77]

Masking teacher and reinforcing student for distilling vision-language models, 2025

Byung-Kwan Lee, Yu-Chiang Frank Wang, and Ryo Hachiuma. Masking teacher and reinforcing student for distilling vision-language models, 2025. URLhttps://arxiv.org/abs/2512.22238. 18 Agent Explorative Policy Optimization

work page arXiv 2025

[76] [78]

Why and when visual token pruning fails? a study on relevant visual information shift in mllms decoding,

Jiwan Kim, Kibum Kim, Wonjoong Kim, Byung-Kwan Lee, and Chanyoung Park. Why and when visual token pruning fails? a study on relevant visual information shift in mllms decoding,

[77] [79]

URLhttps://arxiv.org/abs/2604.12358

work page internal anchor Pith review Pith/arXiv arXiv

[78] [80]

Hide to See: Reasoning-prefix Masking for Visual-anchored Thinking in VLM Distillation

Seonghoon Yu, Dongjun Nam, Byung-Kwan Lee, and Jeany Son. Hide to see: Reasoning-prefix masking for visual-anchored thinking in vlm distillation, 2026. URLhttps://arxiv.org/abs/ 2605.11651

work page internal anchor Pith review Pith/arXiv arXiv 2026

[79] [81]

GenRecal: Generation after Recalibration from Large to Small Vision-Language Models

Byung-Kwan Lee, Ryo Hachiuma, Yong Man Ro, Yu-Chiang Frank Wang, and Yueh-Hua Wu. Genrecal: Generation after recalibration from large to small vision-language models, 2026. URL https://arxiv.org/abs/2506.15681

work page internal anchor Pith review Pith/arXiv arXiv 2026

[80] [82]

Vlsi: Verbalized layers-to-interactions from large to small vision language models

Byung-Kwan Lee, Ryo Hachiuma, Yu-Chiang Frank Wang, Yong Man Ro, and Yueh-Hua Wu. Vlsi: Verbalized layers-to-interactions from large to small vision language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 29545–29557, June 2025

2025