arxiv: 2604.27859 · v2 · submitted 2026-04-30 · 💻 cs.AI · cs.ET

Recognition: unknown

A Brief Overview: Agentic Reinforcement Learning In Large Language Models

Fangming Cui , Ruixiao Zhu , Cheng Fang , Sunan Li , Jiahong Li

Authors on Pith no claims yet

Pith reviewed 2026-05-08 03:08 UTC · model grok-4.3

classification 💻 cs.AI cs.ET

keywords agentic reinforcement learninglarge language modelsmeta-reasoningself-reflectionautonomous agentslong-term planningdynamic strategy adaptation

0 comments

The pith

Large language models shift reinforcement learning toward autonomous agents that set goals, plan ahead, and reflect on their actions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Traditional reinforcement learning optimizes fixed reward functions in restricted settings through episodic trials. This paper maps how LLMs introduce an agentic version that lets agents generate their own objectives, maintain long-term plans, adjust tactics mid-task, and embed self-reflection inside the training loop. A sympathetic reader would care because the change could produce systems that operate in messy, open-ended real-world conditions instead of narrow, pre-defined tasks. The overview examines the underlying concepts, practical designs, open problems, and possible next steps for such systems.

Core claim

LLM-based Agentic RL incorporates cognitive-like capabilities such as meta-reasoning, self-reflection, and multi-step decision-making directly into the learning loop, extending beyond traditional RL that relies on static objectives and episodic interactions.

What carries the argument

The agentic paradigm, in which LLMs supply goal-setting, long-term planning, dynamic strategy adaptation, and interactive reasoning to reinforcement-learning agents.

If this is right

Agents can function without externally supplied static reward functions.
Learning processes now include built-in mechanisms for strategy revision and self-evaluation.
Systems become viable for complex, open-ended tasks that require ongoing interaction with uncertain environments.
Design choices center on embedding cognitive capabilities directly rather than post-hoc wrappers.
Key challenges around stability and integration must be solved before broad deployment.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same mechanisms could support safer deployment by letting agents notice and correct their own misaligned actions.
Benchmarks that test sustained reflection over changing environments would be needed to measure progress.
Multi-agent versions might emerge naturally once single agents demonstrate reliable internal reasoning.

Load-bearing premise

Current LLMs can reliably integrate and sustain meta-reasoning, self-reflection, and dynamic strategy adaptation in uncertain real-world environments without fundamental limitations in model scale or training stability.

What would settle it

An experiment in which LLMs lose coherence or consistency in self-reflection across long sequences of changing conditions would show the approach cannot yet deliver reliable agentic behavior.

Figures

Figures reproduced from arXiv: 2604.27859 by Cheng Fang, Fangming Cui, Jiahong Li, Ruixiao Zhu, Sunan Li.

**Figure 1.** Figure 1: Agent. 𝑄 𝜋 (𝑠, 𝑎) = E𝜋 "∑︁∞ 𝑘=0 𝛾 𝑘𝑅𝑡+𝑘 view at source ↗

**Figure 2.** Figure 2: Demonstration of DPO, PPO and GRPO. latest Qwen3 models. The GSPO loss is defined as: JGSPO (𝜃) = E𝑥∼D, {𝑦𝑖 } 𝐺 𝑖=1 ∼𝜋𝜃old (· |𝑥 ) 1 𝐺 ∑︁𝐺 𝑖=1 min 𝑤𝑖(𝜃) 𝐴b𝑖 , clip 𝑤𝑖(𝜃), 1 − 𝜀, 1 + 𝜀 𝐴b𝑖 , (20) where 𝐴b𝑖 = 𝑟 (𝑥, 𝑦𝑖) − mean {𝑟 (𝑥, 𝑦𝑖)}𝐺 𝑖=1 std {𝑟 (𝑥, 𝑦𝑖)}𝐺 𝑖=1 , (21) view at source ↗

**Figure 3.** Figure 3: Evolution diagram of RL algorithm technology. view at source ↗

read the original abstract

Reinforcement Learning (RL) has traditionally focused on training specialized agents to optimize predefined reward functions within narrowly defined environments. However, the advent of powerful Large Language Models (LLMs) and increasingly complex, open-ended tasks has catalyzed a paradigm shift towards agentic paradigms within RL. This emerging framework extends beyond traditional RL by emphasizing the development of autonomous agents capable of goal-setting, long-term planning, dynamic strategy adaptation, and interactive reasoning in uncertain, real-world environments. Unlike conventional approaches that rely heavily on static objectives and episodic interactions, LLM-based Agentic RL incorporates cognitive-like capabilities such as meta-reasoning, self-reflection, and multi-step decision-making directly into the learning loop. In this paper, we provide a deep insight for looking the conceptual foundations, methodological innovations, and effective designs underlying this trend. Furthermore, we identify critical challenges and outline promising future directions for building LLM-based Agentic RL.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a short survey that maps recent LLM agent work under an 'agentic RL' label but introduces no new methods, data, or sharp analysis.

read the letter

The paper's core move is to describe a shift from standard RL toward LLM agents that set their own goals, reflect on their steps, and adapt plans in open settings. It collects examples from the last couple of years and lists challenges like stability and long-horizon credit assignment. That framing is clear enough for a quick read, and the authors do a reasonable job of grouping papers around meta-reasoning and self-reflection themes without obvious internal contradictions in the abstract and high-level narrative they provide. The absence of any new equations or experiments keeps the circularity burden low, which is appropriate for an overview. Credit goes to the straightforward organization of existing concepts rather than any claim of invention. The main limitation is depth. The discussion of how well current LLMs actually sustain reliable meta-reasoning or dynamic adaptation stays at the level of restating prior claims; there is no attempt to quantify failure modes or compare against simpler baselines. Future directions read as a standard checklist without evidence on which ones matter most or how to test them. Citation coverage appears selective based on the abstract, so the soundness ultimately hinges on whether the full reference list is balanced and up to date. Readers who already follow the LLM agent literature will not find much they have not seen. The paper is best suited for newcomers or students who want a compact entry point into the intersection of RL and language-model agents. It does not resolve open questions or supply falsifiable predictions, so its value is mainly organizational. I would bring it to a reading group for discussion of the trend it names, but I would not cite it for any technical result. It is coherent enough on its own terms to warrant peer review as a survey, provided the referees check the completeness of the cited work and the accuracy of the high-level characterizations.

Referee Report

0 major / 4 minor

Summary. The manuscript is a brief overview of Agentic Reinforcement Learning (RL) in Large Language Models (LLMs). It contrasts traditional RL, which optimizes predefined reward functions in narrowly defined environments using static objectives and episodic interactions, with an emerging LLM-based agentic paradigm. This paradigm emphasizes autonomous agents capable of goal-setting, long-term planning, dynamic strategy adaptation, interactive reasoning in uncertain real-world environments, and the direct incorporation of cognitive-like capabilities such as meta-reasoning, self-reflection, and multi-step decision-making into the learning loop. The paper reviews conceptual foundations, methodological innovations, and effective designs, while identifying critical challenges and outlining future directions.

Significance. If the synthesis of prior literature is accurate and balanced, the overview could usefully organize an emerging subfield for researchers, framing how LLMs extend RL beyond conventional limits by embedding meta-reasoning and self-reflection. Its primary value is descriptive and organizational rather than in new derivations, data, or proofs; credit is due for explicitly calling out challenges and future directions drawn from existing work. The central narrative is a high-level characterization of trends rather than a testable claim, so impact would be moderate as an accessible entry point.

minor comments (4)

Abstract: the phrasing 'provide a deep insight for looking the conceptual foundations' is grammatically awkward and unclear; rephrase for precision, e.g., 'provides insights into the conceptual foundations'.
Title vs. abstract: the title describes the work as 'A Brief Overview' while the abstract claims to 'provide a deep insight'; align these to set consistent reader expectations about scope and depth.
As a review-style paper, the manuscript would benefit from a summary table (e.g., in the methodological innovations section) listing key cited works, their specific contributions to agentic RL, and how they address goal-setting or self-reflection; this would improve clarity and utility without adding new content.
Ensure all references to prior literature in the challenges and future-directions sections are accompanied by concrete citations so readers can trace the synthesis back to source papers.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive and constructive review. The referee's summary accurately captures the manuscript's focus as a high-level overview contrasting traditional RL with the emerging LLM-based agentic paradigm, including its emphasis on cognitive capabilities, challenges, and future directions. We appreciate the recognition of the paper's organizational value for researchers entering this subfield.

Circularity Check

0 steps flagged

No significant circularity in this review-style overview

full rationale

This manuscript is a descriptive synthesis of existing literature on LLM-based Agentic RL, summarizing trends such as goal-setting, meta-reasoning, and self-reflection drawn from external sources. It contains no original equations, derivations, predictions, fitted parameters, or formal proofs that could reduce to the paper's own inputs by construction. The central claims are characterizations of an emerging paradigm rather than self-referential results, satisfying the criteria for a self-contained review with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a survey paper with no new technical derivations, so the ledger contains no free parameters, axioms, or invented entities introduced by the authors.

pith-pipeline@v0.9.0 · 5460 in / 1179 out tokens · 41355 ms · 2026-05-08T03:08:49.080269+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

133 extracted references · 104 canonical work pages · 32 internal anchors

[1]

Anthropic. 2025. Claude code: Deep coding at terminal velocity. https://www.anthropic.com/claude-code Anthropic’s agentic command-line coding tool, introduced alongside Claude 3.7 Sonnet. Enables developers to delegate engineering tasks directly from their terminal via natural-language commands

2025
[2]

R. M. Aratchige and W. M. K. S. Ilmini. 2025. Llms working in harmony: A survey on the technological aspects of building effective llm-based multi agent systems. (2025). https://arxiv.org/abs/2504.01963

work page arXiv 2025
[3]

Andrea Asperti, Alberto Naibo, and Claudio Sacerdoti Coen. 2025. Thinking machines: Mathematical reasoning in the age of llms. (2025). https://arxiv.org/abs/2508.00459

work page arXiv 2025
[4]

Ibragim Badertdinov, Alexander Golubev, Maksim Nekrashevich, Anton Shevtsov, Simon Karasik, Andrei Andriushchenko, Maria Trofimova, Daria Litvintseva, and Boris Yangel. 2025. Swe-rebench: An automated pipeline for task collection and decontaminated evaluation of software engineering agents. (2025). https://arxiv.org/abs/2505.20411

work page arXiv 2025
[5]

Zechen Bai, Pichao Wang, Tianjun Xiao, Tong He, Zongbo Han, Zheng Zhang, and Mike Zheng Shou. 2025. Hallucination of multimodal large language models: A survey. (2025). https://arxiv.org/abs/2404.18930

work page internal anchor Pith review arXiv 2025
[6]

Baian Chen, Chang Shu, Ehsan Shareghi, Nigel Collier, Karthik Narasimhan, and Shunyu Yao. 2023. Fireact: Toward language agent fine-tuning. (2023). https://arxiv.org/abs/2310.05915

work page arXiv 2023
[7]

Lei Chen, Xingyu Xu, Zhixiong Zeng, Jing Huang, Liming Zheng, Yufeng Zhong, and Lin Ma. 2025a. Breaking the sft plateau: Multimodal structured reinforcement learning for chart-to-code generation.arXiv preprint arXiv:2508.13587(2025a). https://arxiv.org/abs/2508.13587

work page arXiv
[8]

Yongchao Chen, Yueying Liu, Junwei Zhou, Yilun Hao, Jingquan Wang, Yang Zhang, and Chuchu Fan. 2025h. R1-code-interpreter: Training llms to reason with code via supervised and reinforcement learning. (2025h). https://arxiv.org/abs/2505.21668

work page arXiv
[9]

Prateek Chhikara, Dev Khant, Saket Arya, Taranjeet Singh, and Deshraj Yadav. 2025. Mem0: Building production-ready ai agents with scalable long-term memory. (2025). https://arxiv.org/abs/2504.19413

work page internal anchor Pith review arXiv 2025
[10]

Jae-Woo Choi, Hyungmin Kim, Hyobin Ong, Youngwoo Yoon, Minsu Jang, Dohyung Kim, and Jaehong Kim. 2025. Reactree: Hierarchical llm agent trees with control flow for long-horizon task planning.arXiv preprint arXiv:2511.02424(2025)

work page arXiv 2025
[11]

Manuel Cossio. 2025. A comprehensive taxonomy of hallucinations in large language models. (2025). https://arxiv.org/abs/2508.01781

work page arXiv 2025
[12]

Aidan Curtis, Hao Tang, Thiago Veloso, Kevin Ellis, Joshua B Tenenbaum, Tomás Lozano-Pérez, and Leslie Pack Kaelbling. 2025. Llm-guided probabilistic program induction for pomdp model estimation. InConference on Robot Learning. PMLR, 3137–3184

2025
[13]

Ning Dai, Zheng Wu, Renjie Zheng, Ziyun Wei, Wenlei Shi, Xing Jin, Guanlin Liu, Chen Dun, Liang Huang, and Lin Yan. 2024. Process supervision-guided policy optimization for code generation.arXiv preprint arXiv:2410.17621(2024)

work page arXiv 2024
[14]

Yihong Dong, Xue Jiang, Jiaru Qian, Tian Zhang, Zhi Jin, and Ge Li. 2025b. A survey on code generation with llm-based agents. (2025b). https://arxiv.org/abs/2508.00083

work page arXiv
[15]

Yihong Dong, Xue Jiang, Yongding Tao, Huanyu Liu, Kechi Zhang, Lili Mou, Rongyu Cao, Yingwei Wei, Jue Chen, Binhua Li, Zhi Jin, Fei Huang, Yongbin Li, and Ge Li. 2025c. Rl-plus: Countering capability boundary collapse of llms in reinforcement learning with hybrid-policy optimization. (2025c). https://arxiv.org/abs/2508.00222

work page internal anchor Pith review Pith/arXiv arXiv
[16]

Shihan Dou, Yan Liu, Haoxiang Jia, Enyu Zhou, Limao Xiong, Junjie Shan, Caishuang Huang, Xiao Wang, Xiaoran Fan, Zhiheng Xi, Yuhao Zhou, Tao Ji, Rui Zheng, Qi Zhang, Tao Gui, and Huang Xuanjing. 2024. StepCoder: Improving code generation with reinforcement learning from compiler feedback. InProceedings of the 62nd Annual Meeting of the Association for Com...

work page doi:10.18653/v1/2024.acl-long.251 2024
[17]

Kawin Ethayarajah, Winnie Xu, Niklas Muennighoff, Dan Jurafsky, and Douwe Kiela. 2024. Model alignment as prospect theoretic optimization. (2024). https://openreview.net/forum?id=iUwHnoENnl

2024
[18]

Wenqi Fan, Yujuan Ding, Liangbo Ning, Shijie Wang, Heng Li, Dawei Yin, Tat-Seng Tang, and Qing Li. 2024. A survey on rag meeting llms: Towards retrieval-augmented large language models. InProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD’24). Association for Computing Machinery, New York, NY, USA, 6491–6501. doi:10....

work page doi:10.1145/3637528.3671470 2024
[19]

Jiazhan Feng, Shijue Huang, Xingwei Qu, Ge Zhang, Yujia Qin, Baoquan Zhong, Chengquan Jiang, Jinxin Chi, and Wanjun Zhong. 2025a. Retool: Reinforcement learning for strategic tool use in llms.CoRRabs/2504.11536 (April 2025a). https://doi.org/10.48550/arXiv.2504.11536

work page internal anchor Pith review doi:10.48550/arxiv.2504.11536
[20]

Chen Gao, Liankai Jin, Xingyu Peng, Jiazhao Zhang, Yue Deng, Annan Li, He Wang, and Si Liu. 2025a. Octonav: Towards generalist embodied navigation. (2025a). https://arxiv.org/abs/2506.09839

work page arXiv
[21]

Chang Gao, Chujie Zheng, Xiong-Hui Chen, Kai Dang, Shixuan Liu, Bowen Yu, An Yang, Shuai Bai, Jingren Zhou, and Junyang Lin. 2025. Soft adaptive policy optimization.arXiv preprint arXiv:2511.20347(2025)

work page internal anchor Pith review arXiv 2025
[22]

Huan ang Gao, Jiayi Geng, Wenyue Hua, Mengkang Hu, Xinzhe Juan, Hongzhang Liu, Shilong Wang, Jiahao Qiu, Xuan Qi, Yiran Wu, Hongru Wang, Han Xiao, Yuhang Zhou, Shaokun Zhang, Jinyu Xiang, Yixiong Fang, Qiwen Zhao, Ruiqi Ren, Qian Cheng, Zhenhailong Wang, Minda Hu, Huazheng Wang, Qingyun Wu, Heng Ji, and Mengdi Wang. 2025. A survey of self-evolving agents:...

work page internal anchor Pith review arXiv 2025
[23]

Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yi Dai, Jiawei Sun, Meng Wang, and Haofen Wang. 2024. Retrieval-augmented generation for large language models: A survey. (2024). https://arxiv.org/abs/2312.10997 Rethinking Agentic Reinforcement Learning In Large Language Models 13

work page internal anchor Pith review arXiv 2024
[24]

Jonas Gehring, Kunhao Zheng, Jade Coppet, Vegard Mella, Taco Cohen, and Gabriel Synnaeve. 2025. RLEF: Grounding code LLMs in execution feedback with reinforcement learning. InForty-second International Conference on Machine Learning. https://openreview.net/forum?id=PzSG5nKe1q

2025
[25]

Xinyu Geng, Peng Xia, Zhen Peng, Xinyu Wang, Ruixue Ding, Chenxi Wang, Jialong Wu, Pengjun Xie, Fei Huang, and Jingren Zhou. 2025. Webwatcher: Breaking new frontiers of vision-language deep research agent. (2025). https://arxiv.org/abs/2508.05748

work page arXiv 2025
[26]

Alexander Golubev, Maria Trofimova, Sergei Polezhaev, Ibragim Badertdinov, Maksim Nekrashevich, Anton Shevtsov, Simon Karasik, Sergey Abramov, Andrei Andriushchenko, Filipp Fisin, Sergei Skvortsov, and Boris Yangel. 2025. Training long-context, multi-turn software engineering agents with reinforcement learning. (2025). https://arxiv.org/abs/2508.03501

work page arXiv 2025
[27]

Boyu Gou, Zanming Huang, Yuting Ning, Yu Gu, Michael Lin, Weijian Qi, Andrei Kopanev, Botao Yu, Bernal Jimenez Gutierrez, Yiheng Shu, Chan Hee Song, Jiaman Wu, Shijie Chen, Hanane Sun, Tianci Zhang, Jian Xie, Yifei Li, Tianyu Xue, Zeyi Liao, Kai Zhang, Viktor Cai, Morteza Rozgic, Murtuza Ziyadi, and Huan Sun. 2025. Mind2web 2: Evaluating agentic search wi...

work page arXiv 2025
[28]

Yuxuan Guo, Shaohui Peng, Jiaming Guo, Di Huang, Xishan Zhang, Rui Zhang, Yifan Hao, Ling Li, Zikang Tian, Ming Gao, Yutai Li, Yiming Li, Shuai Liang, Zihao Zhang, Zidong Du, Qi Guo, Xing Hu, and Yunji Chen. 2024. Luban: Building open-ended creative agents via autonomous embodied verification.CoRRabs/2405.15414 (2024). https://doi.org/10.48550/arXiv.2405.15414

work page doi:10.48550/arxiv.2405.15414 2024
[29]

Zichuan Guo and Hao Wang. 2025. A survey of reinforcement learning in large language models: From data generation to test-time inference. A vailable at SSRN 5128927(2025). https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5128927

2025
[30]

Xudong Han, Junjie Yang, Tianyang Wang, Ziqian Bi, Xinyuan Song, Junfeng Hao, and Junhao Song. 2025. Towards alignment-centric paradigm: A survey of instruction tuning in large language models. (2025). https://arxiv.org/abs/2508.17184

work page arXiv 2025
[31]

Qianyue Hao, Sibo Li, Jian Yuan, and Yong Li. 2025b. Rl of thoughts: Navigating llm reasoning with inference-time reinforcement learning. (2025b). https://arxiv.org/abs/2505.14140

work page arXiv
[32]

Tianyi Hu, Qingxu Fu, Yanxi Chen, Zhaoyang Liu, and Bolin Ding. 2026. SeeUPO: Sequence-Level Agentic-RL with Convergence Guarantees. arXiv preprint arXiv:2602.06554

work page arXiv 2026
[33]

Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qianglong Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, and Ting Liu. 2025b. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions.ACM Transactions on Information Systems43, 2 (January 2025b), 1–55. doi:10.1145/3703155

work page doi:10.1145/3703155
[34]

Xu Huang, Weiwen Liu, Xiaolong Wang, Xingmei Wang, Hao Zhang, Defu Lian, Yasheng Wang, Ruiming Tang, and Enhong Chen. 2024a. Understanding the planning of llm agents: A survey.CoRRabs/2402.02716 (2024a). https://doi.org/10.48550/arXiv.2402.02716

work page internal anchor Pith review doi:10.48550/arxiv.2402.02716
[35]

Li Kang, Xiufeng Song, Heng Zhou, Yiran Yue, Yuchang Qin, Yang Li, Xiaohong Liu, Philip Torr, Lei Bai, and Zhenfei Yin. 2025a. Viki-r: Coordinating embodied multi-agent cooperation via reinforcement learning.arXiv preprint arXiv:2506.09049(2025a)

work page arXiv
[36]

Amirhossein Kazemnejad, Milad Aghajohari, Eva Portelance, Alessandro Sordoni, Siva Reddy, Aaron Courville, and Nicolas Le Roux. 2024. Vineppo: Unlocking rl potential for llm reasoning through refined credit assignment.arXiv preprint arXiv:2410.01679(2024). https://arxiv.org/abs/2410.01679

work page arXiv 2024
[37]

Zixuan Ke, Fangkai Jiao, Xuan-Phi Nguyen, Minh Long, PeiFeng Wang, Silvio Xiong, Sunita Sarawagi, Xiong Caiming, and Joty Shafiq. 2025. A survey of frontiers in LLM reasoning: Inference scaling, learning to reason, and agentic systems.Transactions on Machine Learning Research(2025). https://openreview.net/forum?id=SlsZZ25InC Survey Certification

2025
[38]

Jing Yu Koh, Robert Lo, Lawrence Jang, Vikram Duvvur, Ming Lim, Po-Yu Huang, Graham Neubig, Shuyan Zhou, Russ Salakhutdinov, and Daniel Fried. 2024. Visualwebarena: Evaluating multimodal agents on realistic visual web tasks. (2024), 881–905

2024
[39]

Singh Kunal, Omkar Thawakar, Khanna Mukund, and Thawakar Singh. 2025. Trishul: A training-free agentic framework for zero-shot gui action grounding. (2025). https://arxiv.org/abs/2502.08226

work page arXiv 2025
[40]

Hung Le, Yue Wang, Akhilesh Deepak Gotmare, Silvio Savarese, and Steven Chu Hong Hoi. 2022. Coderl: Mastering code generation through pretrained models and deep reinforcement learning. 35 (2022), 21314–21328. https://proceedings.neurips.cc/paper_files/paper/2022/file/ 8636419dea1aa9fbd25fc4248e702da4-Paper-Conference.pdf

2022
[41]

Junyi Li and Hwee Tou Ng. 2025. Reasoning models hallucinate more: Factuality-aware reinforcement learning for large reasoning models.arXiv preprint arXiv:2505.24630(2025)

work page arXiv 2025
[42]

Wenjun Li, Zhi Chen, Jingru Lin, Hannan Cao, Wei Han, Sheng Liang, Zhi Zhang, Kuicai Dong, Dexun Li, Chen Zhang, et al. 2025. Reinforcement learning foundations for deep research systems: A survey.arXiv preprint arXiv:2509.06733(2025)

work page arXiv 2025
[43]

Xiaoxi Li, Jiajie Jin, Guanting Dong, Hongjin Qian, Yutao Zhu, Yongkang Wu, Ji-Rong Wen, and Zhicheng Dou. 2025j. Webthinker: Empowering large reasoning models with deep research capability. (2025j). https://arxiv.org/abs/2504.21776

work page arXiv
[44]

Yizhi Li, Qingshui Gu, Zhoufutu Wen, Tianshun Xing, Shuyue Guo, Tianyu Zheng, Xingwei Qu, Wangchunshu Wang, et al . 2025n. Treepo: Bridging the gap of policy optimization and efficacy and inference efficiency with heuristic tree-based modeling.arXiv preprint arXiv:2508.17445 (2025n)

work page arXiv
[45]

Zhong-Zhi Li, Duzhen Zhang, Ming-Liang Zhang, Jiaxin Jia, Zengyan Liu, Yuxin Yao, Haotian Xu, Junhao Zheng, Pei-Jie Wang, Xiuyi Chen, Yingying Zhang, Fei Yin, and Cheng-Lin Liu. 2025r. From system 1 to system 2: A survey of reasoning large language models. (2025r). https://arxiv.org/abs/2502.17419

work page internal anchor Pith review arXiv
[46]

Shuq uan Lian, Yuhang Wu, Jia Ma, Zihan Song, Bingqi Chen, Xiawu Zheng, and Hui Li. 2025. Ui-agile: Advancing gui agents with effective reinforcement learning and precise inference-time grounding.arXiv preprint arXiv:2507.22025(2025). https://arxiv.org/abs/2507.22025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[47]

Jintao Liang, Huifeng Lin, You Wu, Rui Zhao, Ziyue Li, et al . 2025. Reasoning rag via system 1 or system 2: A survey on reasoning agentic retrieval-augmented generation for industry challenges. InProceedings of the 14th International Joint Conference on Natural Language Processing 14 Fangming Cui, Ruixiao Zhu, Cheng Fang, Sunan Li, and Jiahong Li and the...

2025
[48]

Minhua Lin, Zongyu Wu, Zhichao Xu, Hui Liu, Xianfeng Tang, Qi He, Charu Aggarwal, Xiang Zhang, and Suhang Wang. 2025. A comprehen- sive survey on reinforcement learning-based agentic search: Foundations, roles, optimizations, evaluations, and applications.arXiv preprint arXiv:2510.16724(2025)

work page arXiv 2025
[49]

Shunyu Liu, Wenkai Fang, Zetian Hu, Junjie Zhang, Yang Zhou, Kongcheng Zhang, Rongcheng Tu, Ting-En Lin, Fei Huang, Mingli Song, Dacheng Tao, and Yongbin Li. 2025k. A survey of direct preference optimization.CoRRabs/2503.11701 (March 2025k). https://doi.org/10.48550/arXiv.2503. 11701

work page doi:10.48550/arxiv.2503
[50]

Yuhang Liu, Pengxiang Li, Zishu Wei, Congkai Xie, Xueyu Hu, Xiaotian Han, Hongxia Yang, and Fei Wu. 2025n. InfiGUIAgent: A multimodal generalist GUI agent with native reasoning and reflection. InICML 2025 Workshop on Computer Use Agents. https://openreview.net/forum?id= p0h9XJ7fMH

2025
[51]

Yuhang Liu, Pengxiang Li, Congkai Xie, Xueyu Hu, Xiaotian Han, Hongxia Yang, and Fei Wu. 2025o. Infigui-r1: Advancing multimodal gui agents from reactive actors to deliberative reasoners.arXiv preprint arXiv:2504.14239(2025o)

work page arXiv
[52]

Zhiwei Liu, Weiran Yao, Jianguo Zhang, Zuxin Liu, Liangwei Yang, Rithesh RN, Tian Lan, Ming Zhu, Juntao Tan, Shirley Kokane, et al. 2024. Pract: Optimizing principled reasoning and acting of llm agent. InProceedings of the 28th Conference on Computational Natural Language Learning. 442–446

2024
[53]

Keer Lu, Chong Chen, Bin Cui, Huang Leng, and Wentao Zhang. 2025c. Pilotrl: Training language model agents via global planning-guided progressive reinforcement learning.arXiv preprint arXiv:2508.00344(2025c)

work page arXiv
[54]

Miao Lu, Weiwei Sun, Weihua Du, Zhan Ling, Xuesong Yao, Kang Liu, and Jiecao Chen. 2025. Scaling llm multi-turn rl with end-to-end summarization-based context management.arXiv preprint arXiv:2510.06727(2025)

work page arXiv 2025
[55]

Zhengxi Lu, Yuxiang Chai, Yaxuan Guo, Xi Yin, Liang Liu, Hao Wang, Han Xiao, Shuai Ren, Guanxing Xiong, and Hongsheng Li. 2025d. Ui-r1: Enhancing efficient action prediction of gui agents by reinforcement learning.arXiv preprint arXiv:2503.21620(2025d)

work page arXiv
[56]

Large Language Model Agent: A Survey on Methodology, Applications and Challenges

Junyu Luo, Weizhi Zhang, Ye Yuan, Yusheng Zhao, Junwei Yang, Yiyang Gu, Bohan Wu, Binqi Chen, Ziyue Qiao, Qingqing Long, Rongcheng Tu, Xiao Luo, Wei Ju, Zhiping Xiao, Yifan Wang, Meng Xiao, Chenwu Liu, Jingyang Yuan, Shichang Zhang, Yiqiao Jin, Fan Zhang, Xian Wu, Hanqing Zhao, Dacheng Tao, Philip S. Yu, and Ming Zhang. 2025a. Large language model agent: ...

work page Pith review doi:10.48550/arxiv.2503.21460
[57]

Run Luo, Lu Wang, Wanwei He, and Xiaobo Xia. 2025d. Gui-r1: A generalist r1-style vision-language action model for gui agents.arXiv preprint arXiv:2504.10458(2025d)

work page internal anchor Pith review arXiv
[58]

Xinji Mai, Haotian Xu, Xing Wang, Weinong Ma, Jian Li, Yingying Zhang, and Wenqiang Zhang. 2025. Agent rl scaling law: Agent rl with spontaneous code execution for mathematical problem solving.arXiv preprint arXiv:2505.07773(2025)

work page arXiv 2025
[59]

Tula Masterman, Sandi Besen Smith, Mason Sawtell, and Alex Chao. 2024. The landscape of emerging ai agent architectures for reasoning, planning, and tool calling: A survey.arXiv preprint arXiv:2404.11584(2024)

work page arXiv 2024
[60]

Yu Meng, Mengzhou Xia, and Danqi Chen. 2024. SimPO: Simple preference optimization with a reference-free reward. (2024). https://openreview. net/forum?id=3Tzcot1LKb

2024
[61]

Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christopher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, et al. 2021. Webgpt: Browser-assisted question-answering with human feedback.arXiv preprint arXiv:2112.09332(2021)

work page internal anchor Pith review arXiv 2021
[62]

Weihua Cheng, Junming Liu, Yifei Sun, Botian Shi, Yirong Chen, and Ding Wang

Dang Nguyen, Jian Chen, Yu Wang, Gang Wu, Namyong Park, Zhengmian Hu, Hanjia Lyu, Junda Wu, Ryan Aponte, Yu Xia Yu, Xintong Li, Jing Shi, Hongjie Chen, Viet Dac Lai, Zhouhang Xie, Sungchul Yoon, Lina Yao, Nesreen K. Ahmed, Jihyung Kil, Thien Huu Nguyen, Randolph Wickramasinghe, Ryan A. Rossi, and Franck Dernoncourt. 2025a. GUI agents: A survey. InFindings...

work page doi:10.18653/v1/2025.findings-acl.1158 2025
[63]

Dang Nguyen, Viet Dac Lai, Seunghyun Yoon, Ryan A Rossi, Handong Zhao, Ruiyi Zhang, Puneet Mathur, Nedim Lipka, Yu Wang, Trung Bui, et al
[64]

Dynasaur: Large language agents beyond predefined actions.arXiv preprint arXiv:2411.01747(2024)

work page arXiv 2024
[65]

Yansong Ning, Jun Fang, Naiqiang Tan, and Hao Liu. 2026. Agent-Omit: Training Efficient LLM Agents for Adaptive Thought and Observation Omission via Agentic Reinforcement Learning.arXiv preprint arXiv:2602.04284(2026)

work page internal anchor Pith review arXiv 2026
[66]

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katia Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F Christiano, Jan Leike, and Ryan Lowe. 2022. Training language models to follow instructions with human feed...

2022
[67]

Davide Paglieri, Bartlomiej Cupial, Jonathan Cook, Ulyana Piterbarg, Jens Tuyls, Edward Grefenstette, Jakob Nicolaus Foerster, Jack Parker-Holder, and Tim Rocktaschel. 2025. Learning when to plan: Efficiently allocating test-time compute for llm agents.arXiv preprint arXiv:2509.03581(2025)

work page arXiv 2025
[68]

Jiangweizhi Peng, Yuanxin Liu, Ruida Zhou, Charles Fleming, Zhaoran Wang, Alfredo Garcia, and Mingyi Hong. 2026. HiPER: Hierarchical Reinforcement Learning with Explicit Credit Assignment for Large Language Model Agents.arXiv preprint arXiv:2602.16165(2026)

work page internal anchor Pith review arXiv 2026
[69]

Eduardo Pignatelli, Johan Ferret, Matthieu Geist, Thomas Mesnard, Hado van Hasselt, and Laura Toni. 2024. A survey of temporal credit assignment in deep reinforcement learning.Transactions on Machine Learning Research(2024). https://openreview.net/forum?id=bNtr6SLgZf Survey Certification

2024
[70]

Agentic large language models: A survey.arXiv preprint arXiv:2503.23037, 2025

Aske Plaat, Max J. van Duijn, Niki Stein, Mike Preuss, Peter van der Putten, and Kees Joost Batenburg. 2025. Agentic large language models, a survey.CoRRabs/2503.23037 (March 2025). https://doi.org/10.48550/arXiv.2503.23037 Rethinking Agentic Reinforcement Learning In Large Language Models 15

work page doi:10.48550/arxiv.2503.23037 2025
[71]

Yujia Qin, Yining Ye, Junjie Fang, Haoming Wang, Shihao Liang, Shizuo Huang, Wanjun Wang, Kuanyu Li, Jiale Li, Yu Yang, Yu Miao, Woyu Lin, Longxiang Liu, Xinyu Xu, Xu Jiang, Qianli Zhao, Tian Peng, Xin Liu, Guang Shi, Yankai Yang, Ji-Rong Li, Jie Tang, and Maosong Li. 2025. Ui-tars: Pioneering automated gui interaction with native agents.arXiv preprint ar...

work page internal anchor Pith review arXiv 2025
[72]

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. 2023. Direct preference optimization: Your language model is secretly a reward model. 36 (2023), 53728–53741

2023
[73]

Z. Z. Ren, Zhihong Shao, Junxiao Song, Haocheng Xin, Haowei Wang, Wanjia Zhao, Liyue Zhang, Zhe Fu, Qihao Zhu, Dejian Yang, Z. F. Wu, Zhibin Gou, Shirong Ma, Hongxuan Tang, Yuxuan Liu, Wenjun Gao, Daya Guo, and Chong Ruan. 2025. Deepseek-prover-v2: Advancing formal mathematical reasoning via reinforcement learning for subgoal decomposition.arXiv preprint ...

work page internal anchor Pith review arXiv 2025
[74]

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. 2017. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347(2017)

work page internal anchor Pith review arXiv 2017
[75]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. 2024b. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300(2024b)

work page internal anchor Pith review arXiv
[76]

Yucheng Shi, Wenhao Yu, Zaitang Li, Yonglin Wang, Hongming Zhang, Ninghao Liu, Haitao Mi, and Dong Yu. 2025c. Mobilegui-rl: Advancing mobile gui agent through reinforcement learning in online environment.arXiv preprint arXiv:2507.05720(2025c)

work page arXiv
[77]

Mohit Shridhar, Xingdi Yuan, Marc-Alexandre Cote, Yonatan Bisk, Adam Trischler, and Matthew Hausknecht. 2021. Alfworld: Aligning text and embodied environments for interactive learning. (2021). https://openreview.net/forum?id=0I0X0YcCdTn

2021
[78]

Aditi Singh, Abul Ehtesham, Saket Kumar, and Tala Talaei Khoei. 2025. Agentic retrieval-augmented generation: A survey on agentic rag.arXiv preprint arXiv:2501.09136(2025)

work page internal anchor Pith review arXiv 2025
[79]

Linxin Song, Taiwei Shi, and Jieyu Zhao. 2025c. The hallucination tax of reinforcement finetuning.arXiv preprint arXiv:2505.13988(2025c)

work page arXiv
[80]

Zirui Song, Guangxian Ouyang, Mingzhe Li, Yuheng Ji, Chenxi Wang, Xiaoqing Zhang, Zhenhao Chen, et al. 2025e. Maniplvm-r1: Reinforcement learning for reasoning in embodied manipulation with large vision-language models.arXiv preprint arXiv:2505.16517(2025e)

work page arXiv

Showing first 80 references.