pith. machine review for the scientific record. sign in

arxiv: 2604.27859 · v2 · submitted 2026-04-30 · 💻 cs.AI · cs.ET

Recognition: unknown

A Brief Overview: Agentic Reinforcement Learning In Large Language Models

Authors on Pith no claims yet

Pith reviewed 2026-05-08 03:08 UTC · model grok-4.3

classification 💻 cs.AI cs.ET
keywords agentic reinforcement learninglarge language modelsmeta-reasoningself-reflectionautonomous agentslong-term planningdynamic strategy adaptation
0
0 comments X

The pith

Large language models shift reinforcement learning toward autonomous agents that set goals, plan ahead, and reflect on their actions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Traditional reinforcement learning optimizes fixed reward functions in restricted settings through episodic trials. This paper maps how LLMs introduce an agentic version that lets agents generate their own objectives, maintain long-term plans, adjust tactics mid-task, and embed self-reflection inside the training loop. A sympathetic reader would care because the change could produce systems that operate in messy, open-ended real-world conditions instead of narrow, pre-defined tasks. The overview examines the underlying concepts, practical designs, open problems, and possible next steps for such systems.

Core claim

LLM-based Agentic RL incorporates cognitive-like capabilities such as meta-reasoning, self-reflection, and multi-step decision-making directly into the learning loop, extending beyond traditional RL that relies on static objectives and episodic interactions.

What carries the argument

The agentic paradigm, in which LLMs supply goal-setting, long-term planning, dynamic strategy adaptation, and interactive reasoning to reinforcement-learning agents.

If this is right

  • Agents can function without externally supplied static reward functions.
  • Learning processes now include built-in mechanisms for strategy revision and self-evaluation.
  • Systems become viable for complex, open-ended tasks that require ongoing interaction with uncertain environments.
  • Design choices center on embedding cognitive capabilities directly rather than post-hoc wrappers.
  • Key challenges around stability and integration must be solved before broad deployment.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same mechanisms could support safer deployment by letting agents notice and correct their own misaligned actions.
  • Benchmarks that test sustained reflection over changing environments would be needed to measure progress.
  • Multi-agent versions might emerge naturally once single agents demonstrate reliable internal reasoning.

Load-bearing premise

Current LLMs can reliably integrate and sustain meta-reasoning, self-reflection, and dynamic strategy adaptation in uncertain real-world environments without fundamental limitations in model scale or training stability.

What would settle it

An experiment in which LLMs lose coherence or consistency in self-reflection across long sequences of changing conditions would show the approach cannot yet deliver reliable agentic behavior.

Figures

Figures reproduced from arXiv: 2604.27859 by Cheng Fang, Fangming Cui, Jiahong Li, Ruixiao Zhu, Sunan Li.

Figure 1
Figure 1. Figure 1: Agent. 𝑄 𝜋 (𝑠, 𝑎) = E𝜋 "∑︁∞ 𝑘=0 𝛾 𝑘𝑅𝑡+𝑘 view at source ↗
Figure 2
Figure 2. Figure 2: Demonstration of DPO, PPO and GRPO. latest Qwen3 models. The GSPO loss is defined as: JGSPO (𝜃) = E𝑥∼D, {𝑦𝑖 } 𝐺 𝑖=1 ∼𝜋𝜃old (· |𝑥 )  1 𝐺 ∑︁𝐺 𝑖=1 min  𝑤𝑖(𝜃) 𝐴b𝑖 , clip 𝑤𝑖(𝜃), 1 − 𝜀, 1 + 𝜀  𝐴b𝑖   , (20) where 𝐴b𝑖 = 𝑟 (𝑥, 𝑦𝑖) − mean  {𝑟 (𝑥, 𝑦𝑖)}𝐺 𝑖=1  std  {𝑟 (𝑥, 𝑦𝑖)}𝐺 𝑖=1  , (21) view at source ↗
Figure 3
Figure 3. Figure 3: Evolution diagram of RL algorithm technology. view at source ↗
read the original abstract

Reinforcement Learning (RL) has traditionally focused on training specialized agents to optimize predefined reward functions within narrowly defined environments. However, the advent of powerful Large Language Models (LLMs) and increasingly complex, open-ended tasks has catalyzed a paradigm shift towards agentic paradigms within RL. This emerging framework extends beyond traditional RL by emphasizing the development of autonomous agents capable of goal-setting, long-term planning, dynamic strategy adaptation, and interactive reasoning in uncertain, real-world environments. Unlike conventional approaches that rely heavily on static objectives and episodic interactions, LLM-based Agentic RL incorporates cognitive-like capabilities such as meta-reasoning, self-reflection, and multi-step decision-making directly into the learning loop. In this paper, we provide a deep insight for looking the conceptual foundations, methodological innovations, and effective designs underlying this trend. Furthermore, we identify critical challenges and outline promising future directions for building LLM-based Agentic RL.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 4 minor

Summary. The manuscript is a brief overview of Agentic Reinforcement Learning (RL) in Large Language Models (LLMs). It contrasts traditional RL, which optimizes predefined reward functions in narrowly defined environments using static objectives and episodic interactions, with an emerging LLM-based agentic paradigm. This paradigm emphasizes autonomous agents capable of goal-setting, long-term planning, dynamic strategy adaptation, interactive reasoning in uncertain real-world environments, and the direct incorporation of cognitive-like capabilities such as meta-reasoning, self-reflection, and multi-step decision-making into the learning loop. The paper reviews conceptual foundations, methodological innovations, and effective designs, while identifying critical challenges and outlining future directions.

Significance. If the synthesis of prior literature is accurate and balanced, the overview could usefully organize an emerging subfield for researchers, framing how LLMs extend RL beyond conventional limits by embedding meta-reasoning and self-reflection. Its primary value is descriptive and organizational rather than in new derivations, data, or proofs; credit is due for explicitly calling out challenges and future directions drawn from existing work. The central narrative is a high-level characterization of trends rather than a testable claim, so impact would be moderate as an accessible entry point.

minor comments (4)
  1. Abstract: the phrasing 'provide a deep insight for looking the conceptual foundations' is grammatically awkward and unclear; rephrase for precision, e.g., 'provides insights into the conceptual foundations'.
  2. Title vs. abstract: the title describes the work as 'A Brief Overview' while the abstract claims to 'provide a deep insight'; align these to set consistent reader expectations about scope and depth.
  3. As a review-style paper, the manuscript would benefit from a summary table (e.g., in the methodological innovations section) listing key cited works, their specific contributions to agentic RL, and how they address goal-setting or self-reflection; this would improve clarity and utility without adding new content.
  4. Ensure all references to prior literature in the challenges and future-directions sections are accompanied by concrete citations so readers can trace the synthesis back to source papers.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive and constructive review. The referee's summary accurately captures the manuscript's focus as a high-level overview contrasting traditional RL with the emerging LLM-based agentic paradigm, including its emphasis on cognitive capabilities, challenges, and future directions. We appreciate the recognition of the paper's organizational value for researchers entering this subfield.

Circularity Check

0 steps flagged

No significant circularity in this review-style overview

full rationale

This manuscript is a descriptive synthesis of existing literature on LLM-based Agentic RL, summarizing trends such as goal-setting, meta-reasoning, and self-reflection drawn from external sources. It contains no original equations, derivations, predictions, fitted parameters, or formal proofs that could reduce to the paper's own inputs by construction. The central claims are characterizations of an emerging paradigm rather than self-referential results, satisfying the criteria for a self-contained review with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a survey paper with no new technical derivations, so the ledger contains no free parameters, axioms, or invented entities introduced by the authors.

pith-pipeline@v0.9.0 · 5460 in / 1179 out tokens · 41355 ms · 2026-05-08T03:08:49.080269+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

133 extracted references · 104 canonical work pages · 32 internal anchors

  1. [1]

    Anthropic. 2025. Claude code: Deep coding at terminal velocity. https://www.anthropic.com/claude-code Anthropic’s agentic command-line coding tool, introduced alongside Claude 3.7 Sonnet. Enables developers to delegate engineering tasks directly from their terminal via natural-language commands

  2. [2]

    R. M. Aratchige and W. M. K. S. Ilmini. 2025. Llms working in harmony: A survey on the technological aspects of building effective llm-based multi agent systems. (2025). https://arxiv.org/abs/2504.01963

  3. [3]

    Andrea Asperti, Alberto Naibo, and Claudio Sacerdoti Coen. 2025. Thinking machines: Mathematical reasoning in the age of llms. (2025). https://arxiv.org/abs/2508.00459

  4. [4]

    Ibragim Badertdinov, Alexander Golubev, Maksim Nekrashevich, Anton Shevtsov, Simon Karasik, Andrei Andriushchenko, Maria Trofimova, Daria Litvintseva, and Boris Yangel. 2025. Swe-rebench: An automated pipeline for task collection and decontaminated evaluation of software engineering agents. (2025). https://arxiv.org/abs/2505.20411

  5. [5]

    Zechen Bai, Pichao Wang, Tianjun Xiao, Tong He, Zongbo Han, Zheng Zhang, and Mike Zheng Shou. 2025. Hallucination of multimodal large language models: A survey. (2025). https://arxiv.org/abs/2404.18930

  6. [6]

    Baian Chen, Chang Shu, Ehsan Shareghi, Nigel Collier, Karthik Narasimhan, and Shunyu Yao. 2023. Fireact: Toward language agent fine-tuning. (2023). https://arxiv.org/abs/2310.05915

  7. [7]

    Lei Chen, Xingyu Xu, Zhixiong Zeng, Jing Huang, Liming Zheng, Yufeng Zhong, and Lin Ma. 2025a. Breaking the sft plateau: Multimodal structured reinforcement learning for chart-to-code generation.arXiv preprint arXiv:2508.13587(2025a). https://arxiv.org/abs/2508.13587

  8. [8]

    Yongchao Chen, Yueying Liu, Junwei Zhou, Yilun Hao, Jingquan Wang, Yang Zhang, and Chuchu Fan. 2025h. R1-code-interpreter: Training llms to reason with code via supervised and reinforcement learning. (2025h). https://arxiv.org/abs/2505.21668

  9. [9]

    Prateek Chhikara, Dev Khant, Saket Arya, Taranjeet Singh, and Deshraj Yadav. 2025. Mem0: Building production-ready ai agents with scalable long-term memory. (2025). https://arxiv.org/abs/2504.19413

  10. [10]

    Jae-Woo Choi, Hyungmin Kim, Hyobin Ong, Youngwoo Yoon, Minsu Jang, Dohyung Kim, and Jaehong Kim. 2025. Reactree: Hierarchical llm agent trees with control flow for long-horizon task planning.arXiv preprint arXiv:2511.02424(2025)

  11. [11]

    Manuel Cossio. 2025. A comprehensive taxonomy of hallucinations in large language models. (2025). https://arxiv.org/abs/2508.01781

  12. [12]

    Aidan Curtis, Hao Tang, Thiago Veloso, Kevin Ellis, Joshua B Tenenbaum, Tomás Lozano-Pérez, and Leslie Pack Kaelbling. 2025. Llm-guided probabilistic program induction for pomdp model estimation. InConference on Robot Learning. PMLR, 3137–3184

  13. [13]

    Ning Dai, Zheng Wu, Renjie Zheng, Ziyun Wei, Wenlei Shi, Xing Jin, Guanlin Liu, Chen Dun, Liang Huang, and Lin Yan. 2024. Process supervision-guided policy optimization for code generation.arXiv preprint arXiv:2410.17621(2024)

  14. [14]

    Yihong Dong, Xue Jiang, Jiaru Qian, Tian Zhang, Zhi Jin, and Ge Li. 2025b. A survey on code generation with llm-based agents. (2025b). https://arxiv.org/abs/2508.00083

  15. [15]

    Yihong Dong, Xue Jiang, Yongding Tao, Huanyu Liu, Kechi Zhang, Lili Mou, Rongyu Cao, Yingwei Wei, Jue Chen, Binhua Li, Zhi Jin, Fei Huang, Yongbin Li, and Ge Li. 2025c. Rl-plus: Countering capability boundary collapse of llms in reinforcement learning with hybrid-policy optimization. (2025c). https://arxiv.org/abs/2508.00222

  16. [16]

    Shihan Dou, Yan Liu, Haoxiang Jia, Enyu Zhou, Limao Xiong, Junjie Shan, Caishuang Huang, Xiao Wang, Xiaoran Fan, Zhiheng Xi, Yuhao Zhou, Tao Ji, Rui Zheng, Qi Zhang, Tao Gui, and Huang Xuanjing. 2024. StepCoder: Improving code generation with reinforcement learning from compiler feedback. InProceedings of the 62nd Annual Meeting of the Association for Com...

  17. [17]

    Kawin Ethayarajah, Winnie Xu, Niklas Muennighoff, Dan Jurafsky, and Douwe Kiela. 2024. Model alignment as prospect theoretic optimization. (2024). https://openreview.net/forum?id=iUwHnoENnl

  18. [18]

    Wenqi Fan, Yujuan Ding, Liangbo Ning, Shijie Wang, Heng Li, Dawei Yin, Tat-Seng Tang, and Qing Li. 2024. A survey on rag meeting llms: Towards retrieval-augmented large language models. InProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD’24). Association for Computing Machinery, New York, NY, USA, 6491–6501. doi:10....

  19. [19]

    Jiazhan Feng, Shijue Huang, Xingwei Qu, Ge Zhang, Yujia Qin, Baoquan Zhong, Chengquan Jiang, Jinxin Chi, and Wanjun Zhong. 2025a. Retool: Reinforcement learning for strategic tool use in llms.CoRRabs/2504.11536 (April 2025a). https://doi.org/10.48550/arXiv.2504.11536

  20. [20]

    Chen Gao, Liankai Jin, Xingyu Peng, Jiazhao Zhang, Yue Deng, Annan Li, He Wang, and Si Liu. 2025a. Octonav: Towards generalist embodied navigation. (2025a). https://arxiv.org/abs/2506.09839

  21. [21]

    Chang Gao, Chujie Zheng, Xiong-Hui Chen, Kai Dang, Shixuan Liu, Bowen Yu, An Yang, Shuai Bai, Jingren Zhou, and Junyang Lin. 2025. Soft adaptive policy optimization.arXiv preprint arXiv:2511.20347(2025)

  22. [22]

    Huan ang Gao, Jiayi Geng, Wenyue Hua, Mengkang Hu, Xinzhe Juan, Hongzhang Liu, Shilong Wang, Jiahao Qiu, Xuan Qi, Yiran Wu, Hongru Wang, Han Xiao, Yuhang Zhou, Shaokun Zhang, Jinyu Xiang, Yixiong Fang, Qiwen Zhao, Ruiqi Ren, Qian Cheng, Zhenhailong Wang, Minda Hu, Huazheng Wang, Qingyun Wu, Heng Ji, and Mengdi Wang. 2025. A survey of self-evolving agents:...

  23. [23]

    Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yi Dai, Jiawei Sun, Meng Wang, and Haofen Wang. 2024. Retrieval-augmented generation for large language models: A survey. (2024). https://arxiv.org/abs/2312.10997 Rethinking Agentic Reinforcement Learning In Large Language Models 13

  24. [24]

    Jonas Gehring, Kunhao Zheng, Jade Coppet, Vegard Mella, Taco Cohen, and Gabriel Synnaeve. 2025. RLEF: Grounding code LLMs in execution feedback with reinforcement learning. InForty-second International Conference on Machine Learning. https://openreview.net/forum?id=PzSG5nKe1q

  25. [25]

    Xinyu Geng, Peng Xia, Zhen Peng, Xinyu Wang, Ruixue Ding, Chenxi Wang, Jialong Wu, Pengjun Xie, Fei Huang, and Jingren Zhou. 2025. Webwatcher: Breaking new frontiers of vision-language deep research agent. (2025). https://arxiv.org/abs/2508.05748

  26. [26]

    Alexander Golubev, Maria Trofimova, Sergei Polezhaev, Ibragim Badertdinov, Maksim Nekrashevich, Anton Shevtsov, Simon Karasik, Sergey Abramov, Andrei Andriushchenko, Filipp Fisin, Sergei Skvortsov, and Boris Yangel. 2025. Training long-context, multi-turn software engineering agents with reinforcement learning. (2025). https://arxiv.org/abs/2508.03501

  27. [27]

    Boyu Gou, Zanming Huang, Yuting Ning, Yu Gu, Michael Lin, Weijian Qi, Andrei Kopanev, Botao Yu, Bernal Jimenez Gutierrez, Yiheng Shu, Chan Hee Song, Jiaman Wu, Shijie Chen, Hanane Sun, Tianci Zhang, Jian Xie, Yifei Li, Tianyu Xue, Zeyi Liao, Kai Zhang, Viktor Cai, Morteza Rozgic, Murtuza Ziyadi, and Huan Sun. 2025. Mind2web 2: Evaluating agentic search wi...

  28. [28]

    Yuxuan Guo, Shaohui Peng, Jiaming Guo, Di Huang, Xishan Zhang, Rui Zhang, Yifan Hao, Ling Li, Zikang Tian, Ming Gao, Yutai Li, Yiming Li, Shuai Liang, Zihao Zhang, Zidong Du, Qi Guo, Xing Hu, and Yunji Chen. 2024. Luban: Building open-ended creative agents via autonomous embodied verification.CoRRabs/2405.15414 (2024). https://doi.org/10.48550/arXiv.2405.15414

  29. [29]

    Zichuan Guo and Hao Wang. 2025. A survey of reinforcement learning in large language models: From data generation to test-time inference. A vailable at SSRN 5128927(2025). https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5128927

  30. [30]

    Xudong Han, Junjie Yang, Tianyang Wang, Ziqian Bi, Xinyuan Song, Junfeng Hao, and Junhao Song. 2025. Towards alignment-centric paradigm: A survey of instruction tuning in large language models. (2025). https://arxiv.org/abs/2508.17184

  31. [31]

    Qianyue Hao, Sibo Li, Jian Yuan, and Yong Li. 2025b. Rl of thoughts: Navigating llm reasoning with inference-time reinforcement learning. (2025b). https://arxiv.org/abs/2505.14140

  32. [32]

    Tianyi Hu, Qingxu Fu, Yanxi Chen, Zhaoyang Liu, and Bolin Ding. 2026. SeeUPO: Sequence-Level Agentic-RL with Convergence Guarantees. arXiv preprint arXiv:2602.06554

  33. [33]

    Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qianglong Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, and Ting Liu. 2025b. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions.ACM Transactions on Information Systems43, 2 (January 2025b), 1–55. doi:10.1145/3703155

  34. [34]

    Xu Huang, Weiwen Liu, Xiaolong Wang, Xingmei Wang, Hao Zhang, Defu Lian, Yasheng Wang, Ruiming Tang, and Enhong Chen. 2024a. Understanding the planning of llm agents: A survey.CoRRabs/2402.02716 (2024a). https://doi.org/10.48550/arXiv.2402.02716

  35. [35]

    Li Kang, Xiufeng Song, Heng Zhou, Yiran Yue, Yuchang Qin, Yang Li, Xiaohong Liu, Philip Torr, Lei Bai, and Zhenfei Yin. 2025a. Viki-r: Coordinating embodied multi-agent cooperation via reinforcement learning.arXiv preprint arXiv:2506.09049(2025a)

  36. [36]

    Amirhossein Kazemnejad, Milad Aghajohari, Eva Portelance, Alessandro Sordoni, Siva Reddy, Aaron Courville, and Nicolas Le Roux. 2024. Vineppo: Unlocking rl potential for llm reasoning through refined credit assignment.arXiv preprint arXiv:2410.01679(2024). https://arxiv.org/abs/2410.01679

  37. [37]

    Zixuan Ke, Fangkai Jiao, Xuan-Phi Nguyen, Minh Long, PeiFeng Wang, Silvio Xiong, Sunita Sarawagi, Xiong Caiming, and Joty Shafiq. 2025. A survey of frontiers in LLM reasoning: Inference scaling, learning to reason, and agentic systems.Transactions on Machine Learning Research(2025). https://openreview.net/forum?id=SlsZZ25InC Survey Certification

  38. [38]

    Jing Yu Koh, Robert Lo, Lawrence Jang, Vikram Duvvur, Ming Lim, Po-Yu Huang, Graham Neubig, Shuyan Zhou, Russ Salakhutdinov, and Daniel Fried. 2024. Visualwebarena: Evaluating multimodal agents on realistic visual web tasks. (2024), 881–905

  39. [39]

    Singh Kunal, Omkar Thawakar, Khanna Mukund, and Thawakar Singh. 2025. Trishul: A training-free agentic framework for zero-shot gui action grounding. (2025). https://arxiv.org/abs/2502.08226

  40. [40]

    Hung Le, Yue Wang, Akhilesh Deepak Gotmare, Silvio Savarese, and Steven Chu Hong Hoi. 2022. Coderl: Mastering code generation through pretrained models and deep reinforcement learning. 35 (2022), 21314–21328. https://proceedings.neurips.cc/paper_files/paper/2022/file/ 8636419dea1aa9fbd25fc4248e702da4-Paper-Conference.pdf

  41. [41]

    Junyi Li and Hwee Tou Ng. 2025. Reasoning models hallucinate more: Factuality-aware reinforcement learning for large reasoning models.arXiv preprint arXiv:2505.24630(2025)

  42. [42]

    Wenjun Li, Zhi Chen, Jingru Lin, Hannan Cao, Wei Han, Sheng Liang, Zhi Zhang, Kuicai Dong, Dexun Li, Chen Zhang, et al. 2025. Reinforcement learning foundations for deep research systems: A survey.arXiv preprint arXiv:2509.06733(2025)

  43. [43]

    Xiaoxi Li, Jiajie Jin, Guanting Dong, Hongjin Qian, Yutao Zhu, Yongkang Wu, Ji-Rong Wen, and Zhicheng Dou. 2025j. Webthinker: Empowering large reasoning models with deep research capability. (2025j). https://arxiv.org/abs/2504.21776

  44. [44]

    Yizhi Li, Qingshui Gu, Zhoufutu Wen, Tianshun Xing, Shuyue Guo, Tianyu Zheng, Xingwei Qu, Wangchunshu Wang, et al . 2025n. Treepo: Bridging the gap of policy optimization and efficacy and inference efficiency with heuristic tree-based modeling.arXiv preprint arXiv:2508.17445 (2025n)

  45. [45]

    Zhong-Zhi Li, Duzhen Zhang, Ming-Liang Zhang, Jiaxin Jia, Zengyan Liu, Yuxin Yao, Haotian Xu, Junhao Zheng, Pei-Jie Wang, Xiuyi Chen, Yingying Zhang, Fei Yin, and Cheng-Lin Liu. 2025r. From system 1 to system 2: A survey of reasoning large language models. (2025r). https://arxiv.org/abs/2502.17419

  46. [46]

    Shuq uan Lian, Yuhang Wu, Jia Ma, Zihan Song, Bingqi Chen, Xiawu Zheng, and Hui Li. 2025. Ui-agile: Advancing gui agents with effective reinforcement learning and precise inference-time grounding.arXiv preprint arXiv:2507.22025(2025). https://arxiv.org/abs/2507.22025

  47. [47]

    Jintao Liang, Huifeng Lin, You Wu, Rui Zhao, Ziyue Li, et al . 2025. Reasoning rag via system 1 or system 2: A survey on reasoning agentic retrieval-augmented generation for industry challenges. InProceedings of the 14th International Joint Conference on Natural Language Processing 14 Fangming Cui, Ruixiao Zhu, Cheng Fang, Sunan Li, and Jiahong Li and the...

  48. [48]

    Minhua Lin, Zongyu Wu, Zhichao Xu, Hui Liu, Xianfeng Tang, Qi He, Charu Aggarwal, Xiang Zhang, and Suhang Wang. 2025. A comprehen- sive survey on reinforcement learning-based agentic search: Foundations, roles, optimizations, evaluations, and applications.arXiv preprint arXiv:2510.16724(2025)

  49. [49]

    Shunyu Liu, Wenkai Fang, Zetian Hu, Junjie Zhang, Yang Zhou, Kongcheng Zhang, Rongcheng Tu, Ting-En Lin, Fei Huang, Mingli Song, Dacheng Tao, and Yongbin Li. 2025k. A survey of direct preference optimization.CoRRabs/2503.11701 (March 2025k). https://doi.org/10.48550/arXiv.2503. 11701

  50. [50]

    Yuhang Liu, Pengxiang Li, Zishu Wei, Congkai Xie, Xueyu Hu, Xiaotian Han, Hongxia Yang, and Fei Wu. 2025n. InfiGUIAgent: A multimodal generalist GUI agent with native reasoning and reflection. InICML 2025 Workshop on Computer Use Agents. https://openreview.net/forum?id= p0h9XJ7fMH

  51. [51]

    Yuhang Liu, Pengxiang Li, Congkai Xie, Xueyu Hu, Xiaotian Han, Hongxia Yang, and Fei Wu. 2025o. Infigui-r1: Advancing multimodal gui agents from reactive actors to deliberative reasoners.arXiv preprint arXiv:2504.14239(2025o)

  52. [52]

    Zhiwei Liu, Weiran Yao, Jianguo Zhang, Zuxin Liu, Liangwei Yang, Rithesh RN, Tian Lan, Ming Zhu, Juntao Tan, Shirley Kokane, et al. 2024. Pract: Optimizing principled reasoning and acting of llm agent. InProceedings of the 28th Conference on Computational Natural Language Learning. 442–446

  53. [53]

    Keer Lu, Chong Chen, Bin Cui, Huang Leng, and Wentao Zhang. 2025c. Pilotrl: Training language model agents via global planning-guided progressive reinforcement learning.arXiv preprint arXiv:2508.00344(2025c)

  54. [54]

    Miao Lu, Weiwei Sun, Weihua Du, Zhan Ling, Xuesong Yao, Kang Liu, and Jiecao Chen. 2025. Scaling llm multi-turn rl with end-to-end summarization-based context management.arXiv preprint arXiv:2510.06727(2025)

  55. [55]

    Zhengxi Lu, Yuxiang Chai, Yaxuan Guo, Xi Yin, Liang Liu, Hao Wang, Han Xiao, Shuai Ren, Guanxing Xiong, and Hongsheng Li. 2025d. Ui-r1: Enhancing efficient action prediction of gui agents by reinforcement learning.arXiv preprint arXiv:2503.21620(2025d)

  56. [56]

    Large Language Model Agent: A Survey on Methodology, Applications and Challenges

    Junyu Luo, Weizhi Zhang, Ye Yuan, Yusheng Zhao, Junwei Yang, Yiyang Gu, Bohan Wu, Binqi Chen, Ziyue Qiao, Qingqing Long, Rongcheng Tu, Xiao Luo, Wei Ju, Zhiping Xiao, Yifan Wang, Meng Xiao, Chenwu Liu, Jingyang Yuan, Shichang Zhang, Yiqiao Jin, Fan Zhang, Xian Wu, Hanqing Zhao, Dacheng Tao, Philip S. Yu, and Ming Zhang. 2025a. Large language model agent: ...

  57. [57]

    Run Luo, Lu Wang, Wanwei He, and Xiaobo Xia. 2025d. Gui-r1: A generalist r1-style vision-language action model for gui agents.arXiv preprint arXiv:2504.10458(2025d)

  58. [58]

    Xinji Mai, Haotian Xu, Xing Wang, Weinong Ma, Jian Li, Yingying Zhang, and Wenqiang Zhang. 2025. Agent rl scaling law: Agent rl with spontaneous code execution for mathematical problem solving.arXiv preprint arXiv:2505.07773(2025)

  59. [59]

    Tula Masterman, Sandi Besen Smith, Mason Sawtell, and Alex Chao. 2024. The landscape of emerging ai agent architectures for reasoning, planning, and tool calling: A survey.arXiv preprint arXiv:2404.11584(2024)

  60. [60]

    Yu Meng, Mengzhou Xia, and Danqi Chen. 2024. SimPO: Simple preference optimization with a reference-free reward. (2024). https://openreview. net/forum?id=3Tzcot1LKb

  61. [61]

    Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christopher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, et al. 2021. Webgpt: Browser-assisted question-answering with human feedback.arXiv preprint arXiv:2112.09332(2021)

  62. [62]

    Weihua Cheng, Junming Liu, Yifei Sun, Botian Shi, Yirong Chen, and Ding Wang

    Dang Nguyen, Jian Chen, Yu Wang, Gang Wu, Namyong Park, Zhengmian Hu, Hanjia Lyu, Junda Wu, Ryan Aponte, Yu Xia Yu, Xintong Li, Jing Shi, Hongjie Chen, Viet Dac Lai, Zhouhang Xie, Sungchul Yoon, Lina Yao, Nesreen K. Ahmed, Jihyung Kil, Thien Huu Nguyen, Randolph Wickramasinghe, Ryan A. Rossi, and Franck Dernoncourt. 2025a. GUI agents: A survey. InFindings...

  63. [63]

    Dang Nguyen, Viet Dac Lai, Seunghyun Yoon, Ryan A Rossi, Handong Zhao, Ruiyi Zhang, Puneet Mathur, Nedim Lipka, Yu Wang, Trung Bui, et al

  64. [64]

    Dynasaur: Large language agents beyond predefined actions.arXiv preprint arXiv:2411.01747(2024)

  65. [65]

    Yansong Ning, Jun Fang, Naiqiang Tan, and Hao Liu. 2026. Agent-Omit: Training Efficient LLM Agents for Adaptive Thought and Observation Omission via Agentic Reinforcement Learning.arXiv preprint arXiv:2602.04284(2026)

  66. [66]

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katia Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F Christiano, Jan Leike, and Ryan Lowe. 2022. Training language models to follow instructions with human feed...

  67. [67]

    Davide Paglieri, Bartlomiej Cupial, Jonathan Cook, Ulyana Piterbarg, Jens Tuyls, Edward Grefenstette, Jakob Nicolaus Foerster, Jack Parker-Holder, and Tim Rocktaschel. 2025. Learning when to plan: Efficiently allocating test-time compute for llm agents.arXiv preprint arXiv:2509.03581(2025)

  68. [68]

    Jiangweizhi Peng, Yuanxin Liu, Ruida Zhou, Charles Fleming, Zhaoran Wang, Alfredo Garcia, and Mingyi Hong. 2026. HiPER: Hierarchical Reinforcement Learning with Explicit Credit Assignment for Large Language Model Agents.arXiv preprint arXiv:2602.16165(2026)

  69. [69]

    Eduardo Pignatelli, Johan Ferret, Matthieu Geist, Thomas Mesnard, Hado van Hasselt, and Laura Toni. 2024. A survey of temporal credit assignment in deep reinforcement learning.Transactions on Machine Learning Research(2024). https://openreview.net/forum?id=bNtr6SLgZf Survey Certification

  70. [70]

    Agentic large language models: A survey.arXiv preprint arXiv:2503.23037, 2025

    Aske Plaat, Max J. van Duijn, Niki Stein, Mike Preuss, Peter van der Putten, and Kees Joost Batenburg. 2025. Agentic large language models, a survey.CoRRabs/2503.23037 (March 2025). https://doi.org/10.48550/arXiv.2503.23037 Rethinking Agentic Reinforcement Learning In Large Language Models 15

  71. [71]

    Yujia Qin, Yining Ye, Junjie Fang, Haoming Wang, Shihao Liang, Shizuo Huang, Wanjun Wang, Kuanyu Li, Jiale Li, Yu Yang, Yu Miao, Woyu Lin, Longxiang Liu, Xinyu Xu, Xu Jiang, Qianli Zhao, Tian Peng, Xin Liu, Guang Shi, Yankai Yang, Ji-Rong Li, Jie Tang, and Maosong Li. 2025. Ui-tars: Pioneering automated gui interaction with native agents.arXiv preprint ar...

  72. [72]

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. 2023. Direct preference optimization: Your language model is secretly a reward model. 36 (2023), 53728–53741

  73. [73]

    Z. Z. Ren, Zhihong Shao, Junxiao Song, Haocheng Xin, Haowei Wang, Wanjia Zhao, Liyue Zhang, Zhe Fu, Qihao Zhu, Dejian Yang, Z. F. Wu, Zhibin Gou, Shirong Ma, Hongxuan Tang, Yuxuan Liu, Wenjun Gao, Daya Guo, and Chong Ruan. 2025. Deepseek-prover-v2: Advancing formal mathematical reasoning via reinforcement learning for subgoal decomposition.arXiv preprint ...

  74. [74]

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. 2017. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347(2017)

  75. [75]

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. 2024b. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300(2024b)

  76. [76]

    Yucheng Shi, Wenhao Yu, Zaitang Li, Yonglin Wang, Hongming Zhang, Ninghao Liu, Haitao Mi, and Dong Yu. 2025c. Mobilegui-rl: Advancing mobile gui agent through reinforcement learning in online environment.arXiv preprint arXiv:2507.05720(2025c)

  77. [77]

    Mohit Shridhar, Xingdi Yuan, Marc-Alexandre Cote, Yonatan Bisk, Adam Trischler, and Matthew Hausknecht. 2021. Alfworld: Aligning text and embodied environments for interactive learning. (2021). https://openreview.net/forum?id=0I0X0YcCdTn

  78. [78]

    Aditi Singh, Abul Ehtesham, Saket Kumar, and Tala Talaei Khoei. 2025. Agentic retrieval-augmented generation: A survey on agentic rag.arXiv preprint arXiv:2501.09136(2025)

  79. [79]

    Linxin Song, Taiwei Shi, and Jieyu Zhao. 2025c. The hallucination tax of reinforcement finetuning.arXiv preprint arXiv:2505.13988(2025c)

  80. [80]

    Zirui Song, Guangxian Ouyang, Mingzhe Li, Yuheng Ji, Chenxi Wang, Xiaoqing Zhang, Zhenhao Chen, et al. 2025e. Maniplvm-r1: Reinforcement learning for reasoning in embodied manipulation with large vision-language models.arXiv preprint arXiv:2505.16517(2025e)

Showing first 80 references.