Recognition: unknown
A Brief Overview: Agentic Reinforcement Learning In Large Language Models
Pith reviewed 2026-05-08 03:08 UTC · model grok-4.3
The pith
Large language models shift reinforcement learning toward autonomous agents that set goals, plan ahead, and reflect on their actions.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
LLM-based Agentic RL incorporates cognitive-like capabilities such as meta-reasoning, self-reflection, and multi-step decision-making directly into the learning loop, extending beyond traditional RL that relies on static objectives and episodic interactions.
What carries the argument
The agentic paradigm, in which LLMs supply goal-setting, long-term planning, dynamic strategy adaptation, and interactive reasoning to reinforcement-learning agents.
If this is right
- Agents can function without externally supplied static reward functions.
- Learning processes now include built-in mechanisms for strategy revision and self-evaluation.
- Systems become viable for complex, open-ended tasks that require ongoing interaction with uncertain environments.
- Design choices center on embedding cognitive capabilities directly rather than post-hoc wrappers.
- Key challenges around stability and integration must be solved before broad deployment.
Where Pith is reading between the lines
- The same mechanisms could support safer deployment by letting agents notice and correct their own misaligned actions.
- Benchmarks that test sustained reflection over changing environments would be needed to measure progress.
- Multi-agent versions might emerge naturally once single agents demonstrate reliable internal reasoning.
Load-bearing premise
Current LLMs can reliably integrate and sustain meta-reasoning, self-reflection, and dynamic strategy adaptation in uncertain real-world environments without fundamental limitations in model scale or training stability.
What would settle it
An experiment in which LLMs lose coherence or consistency in self-reflection across long sequences of changing conditions would show the approach cannot yet deliver reliable agentic behavior.
Figures
read the original abstract
Reinforcement Learning (RL) has traditionally focused on training specialized agents to optimize predefined reward functions within narrowly defined environments. However, the advent of powerful Large Language Models (LLMs) and increasingly complex, open-ended tasks has catalyzed a paradigm shift towards agentic paradigms within RL. This emerging framework extends beyond traditional RL by emphasizing the development of autonomous agents capable of goal-setting, long-term planning, dynamic strategy adaptation, and interactive reasoning in uncertain, real-world environments. Unlike conventional approaches that rely heavily on static objectives and episodic interactions, LLM-based Agentic RL incorporates cognitive-like capabilities such as meta-reasoning, self-reflection, and multi-step decision-making directly into the learning loop. In this paper, we provide a deep insight for looking the conceptual foundations, methodological innovations, and effective designs underlying this trend. Furthermore, we identify critical challenges and outline promising future directions for building LLM-based Agentic RL.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript is a brief overview of Agentic Reinforcement Learning (RL) in Large Language Models (LLMs). It contrasts traditional RL, which optimizes predefined reward functions in narrowly defined environments using static objectives and episodic interactions, with an emerging LLM-based agentic paradigm. This paradigm emphasizes autonomous agents capable of goal-setting, long-term planning, dynamic strategy adaptation, interactive reasoning in uncertain real-world environments, and the direct incorporation of cognitive-like capabilities such as meta-reasoning, self-reflection, and multi-step decision-making into the learning loop. The paper reviews conceptual foundations, methodological innovations, and effective designs, while identifying critical challenges and outlining future directions.
Significance. If the synthesis of prior literature is accurate and balanced, the overview could usefully organize an emerging subfield for researchers, framing how LLMs extend RL beyond conventional limits by embedding meta-reasoning and self-reflection. Its primary value is descriptive and organizational rather than in new derivations, data, or proofs; credit is due for explicitly calling out challenges and future directions drawn from existing work. The central narrative is a high-level characterization of trends rather than a testable claim, so impact would be moderate as an accessible entry point.
minor comments (4)
- Abstract: the phrasing 'provide a deep insight for looking the conceptual foundations' is grammatically awkward and unclear; rephrase for precision, e.g., 'provides insights into the conceptual foundations'.
- Title vs. abstract: the title describes the work as 'A Brief Overview' while the abstract claims to 'provide a deep insight'; align these to set consistent reader expectations about scope and depth.
- As a review-style paper, the manuscript would benefit from a summary table (e.g., in the methodological innovations section) listing key cited works, their specific contributions to agentic RL, and how they address goal-setting or self-reflection; this would improve clarity and utility without adding new content.
- Ensure all references to prior literature in the challenges and future-directions sections are accompanied by concrete citations so readers can trace the synthesis back to source papers.
Simulated Author's Rebuttal
We thank the referee for the positive and constructive review. The referee's summary accurately captures the manuscript's focus as a high-level overview contrasting traditional RL with the emerging LLM-based agentic paradigm, including its emphasis on cognitive capabilities, challenges, and future directions. We appreciate the recognition of the paper's organizational value for researchers entering this subfield.
Circularity Check
No significant circularity in this review-style overview
full rationale
This manuscript is a descriptive synthesis of existing literature on LLM-based Agentic RL, summarizing trends such as goal-setting, meta-reasoning, and self-reflection drawn from external sources. It contains no original equations, derivations, predictions, fitted parameters, or formal proofs that could reduce to the paper's own inputs by construction. The central claims are characterizations of an emerging paradigm rather than self-referential results, satisfying the criteria for a self-contained review with no load-bearing circular steps.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Anthropic. 2025. Claude code: Deep coding at terminal velocity. https://www.anthropic.com/claude-code Anthropic’s agentic command-line coding tool, introduced alongside Claude 3.7 Sonnet. Enables developers to delegate engineering tasks directly from their terminal via natural-language commands
2025
- [2]
- [3]
-
[4]
Ibragim Badertdinov, Alexander Golubev, Maksim Nekrashevich, Anton Shevtsov, Simon Karasik, Andrei Andriushchenko, Maria Trofimova, Daria Litvintseva, and Boris Yangel. 2025. Swe-rebench: An automated pipeline for task collection and decontaminated evaluation of software engineering agents. (2025). https://arxiv.org/abs/2505.20411
-
[5]
Zechen Bai, Pichao Wang, Tianjun Xiao, Tong He, Zongbo Han, Zheng Zhang, and Mike Zheng Shou. 2025. Hallucination of multimodal large language models: A survey. (2025). https://arxiv.org/abs/2404.18930
work page internal anchor Pith review arXiv 2025
- [6]
- [7]
- [8]
-
[9]
Prateek Chhikara, Dev Khant, Saket Arya, Taranjeet Singh, and Deshraj Yadav. 2025. Mem0: Building production-ready ai agents with scalable long-term memory. (2025). https://arxiv.org/abs/2504.19413
work page internal anchor Pith review arXiv 2025
- [10]
- [11]
-
[12]
Aidan Curtis, Hao Tang, Thiago Veloso, Kevin Ellis, Joshua B Tenenbaum, Tomás Lozano-Pérez, and Leslie Pack Kaelbling. 2025. Llm-guided probabilistic program induction for pomdp model estimation. InConference on Robot Learning. PMLR, 3137–3184
2025
- [13]
- [14]
-
[15]
Yihong Dong, Xue Jiang, Yongding Tao, Huanyu Liu, Kechi Zhang, Lili Mou, Rongyu Cao, Yingwei Wei, Jue Chen, Binhua Li, Zhi Jin, Fei Huang, Yongbin Li, and Ge Li. 2025c. Rl-plus: Countering capability boundary collapse of llms in reinforcement learning with hybrid-policy optimization. (2025c). https://arxiv.org/abs/2508.00222
work page internal anchor Pith review Pith/arXiv arXiv
-
[16]
Shihan Dou, Yan Liu, Haoxiang Jia, Enyu Zhou, Limao Xiong, Junjie Shan, Caishuang Huang, Xiao Wang, Xiaoran Fan, Zhiheng Xi, Yuhao Zhou, Tao Ji, Rui Zheng, Qi Zhang, Tao Gui, and Huang Xuanjing. 2024. StepCoder: Improving code generation with reinforcement learning from compiler feedback. InProceedings of the 62nd Annual Meeting of the Association for Com...
-
[17]
Kawin Ethayarajah, Winnie Xu, Niklas Muennighoff, Dan Jurafsky, and Douwe Kiela. 2024. Model alignment as prospect theoretic optimization. (2024). https://openreview.net/forum?id=iUwHnoENnl
2024
-
[18]
Wenqi Fan, Yujuan Ding, Liangbo Ning, Shijie Wang, Heng Li, Dawei Yin, Tat-Seng Tang, and Qing Li. 2024. A survey on rag meeting llms: Towards retrieval-augmented large language models. InProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD’24). Association for Computing Machinery, New York, NY, USA, 6491–6501. doi:10....
-
[19]
Jiazhan Feng, Shijue Huang, Xingwei Qu, Ge Zhang, Yujia Qin, Baoquan Zhong, Chengquan Jiang, Jinxin Chi, and Wanjun Zhong. 2025a. Retool: Reinforcement learning for strategic tool use in llms.CoRRabs/2504.11536 (April 2025a). https://doi.org/10.48550/arXiv.2504.11536
work page internal anchor Pith review doi:10.48550/arxiv.2504.11536
- [20]
-
[21]
Chang Gao, Chujie Zheng, Xiong-Hui Chen, Kai Dang, Shixuan Liu, Bowen Yu, An Yang, Shuai Bai, Jingren Zhou, and Junyang Lin. 2025. Soft adaptive policy optimization.arXiv preprint arXiv:2511.20347(2025)
work page internal anchor Pith review arXiv 2025
-
[22]
Huan ang Gao, Jiayi Geng, Wenyue Hua, Mengkang Hu, Xinzhe Juan, Hongzhang Liu, Shilong Wang, Jiahao Qiu, Xuan Qi, Yiran Wu, Hongru Wang, Han Xiao, Yuhang Zhou, Shaokun Zhang, Jinyu Xiang, Yixiong Fang, Qiwen Zhao, Ruiqi Ren, Qian Cheng, Zhenhailong Wang, Minda Hu, Huazheng Wang, Qingyun Wu, Heng Ji, and Mengdi Wang. 2025. A survey of self-evolving agents:...
work page internal anchor Pith review arXiv 2025
-
[23]
Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yi Dai, Jiawei Sun, Meng Wang, and Haofen Wang. 2024. Retrieval-augmented generation for large language models: A survey. (2024). https://arxiv.org/abs/2312.10997 Rethinking Agentic Reinforcement Learning In Large Language Models 13
work page internal anchor Pith review arXiv 2024
-
[24]
Jonas Gehring, Kunhao Zheng, Jade Coppet, Vegard Mella, Taco Cohen, and Gabriel Synnaeve. 2025. RLEF: Grounding code LLMs in execution feedback with reinforcement learning. InForty-second International Conference on Machine Learning. https://openreview.net/forum?id=PzSG5nKe1q
2025
- [25]
-
[26]
Alexander Golubev, Maria Trofimova, Sergei Polezhaev, Ibragim Badertdinov, Maksim Nekrashevich, Anton Shevtsov, Simon Karasik, Sergey Abramov, Andrei Andriushchenko, Filipp Fisin, Sergei Skvortsov, and Boris Yangel. 2025. Training long-context, multi-turn software engineering agents with reinforcement learning. (2025). https://arxiv.org/abs/2508.03501
-
[27]
Boyu Gou, Zanming Huang, Yuting Ning, Yu Gu, Michael Lin, Weijian Qi, Andrei Kopanev, Botao Yu, Bernal Jimenez Gutierrez, Yiheng Shu, Chan Hee Song, Jiaman Wu, Shijie Chen, Hanane Sun, Tianci Zhang, Jian Xie, Yifei Li, Tianyu Xue, Zeyi Liao, Kai Zhang, Viktor Cai, Morteza Rozgic, Murtuza Ziyadi, and Huan Sun. 2025. Mind2web 2: Evaluating agentic search wi...
-
[28]
Yuxuan Guo, Shaohui Peng, Jiaming Guo, Di Huang, Xishan Zhang, Rui Zhang, Yifan Hao, Ling Li, Zikang Tian, Ming Gao, Yutai Li, Yiming Li, Shuai Liang, Zihao Zhang, Zidong Du, Qi Guo, Xing Hu, and Yunji Chen. 2024. Luban: Building open-ended creative agents via autonomous embodied verification.CoRRabs/2405.15414 (2024). https://doi.org/10.48550/arXiv.2405.15414
-
[29]
Zichuan Guo and Hao Wang. 2025. A survey of reinforcement learning in large language models: From data generation to test-time inference. A vailable at SSRN 5128927(2025). https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5128927
2025
- [30]
- [31]
- [32]
-
[33]
Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qianglong Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, and Ting Liu. 2025b. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions.ACM Transactions on Information Systems43, 2 (January 2025b), 1–55. doi:10.1145/3703155
-
[34]
Xu Huang, Weiwen Liu, Xiaolong Wang, Xingmei Wang, Hao Zhang, Defu Lian, Yasheng Wang, Ruiming Tang, and Enhong Chen. 2024a. Understanding the planning of llm agents: A survey.CoRRabs/2402.02716 (2024a). https://doi.org/10.48550/arXiv.2402.02716
work page internal anchor Pith review doi:10.48550/arxiv.2402.02716
- [35]
-
[36]
Amirhossein Kazemnejad, Milad Aghajohari, Eva Portelance, Alessandro Sordoni, Siva Reddy, Aaron Courville, and Nicolas Le Roux. 2024. Vineppo: Unlocking rl potential for llm reasoning through refined credit assignment.arXiv preprint arXiv:2410.01679(2024). https://arxiv.org/abs/2410.01679
-
[37]
Zixuan Ke, Fangkai Jiao, Xuan-Phi Nguyen, Minh Long, PeiFeng Wang, Silvio Xiong, Sunita Sarawagi, Xiong Caiming, and Joty Shafiq. 2025. A survey of frontiers in LLM reasoning: Inference scaling, learning to reason, and agentic systems.Transactions on Machine Learning Research(2025). https://openreview.net/forum?id=SlsZZ25InC Survey Certification
2025
-
[38]
Jing Yu Koh, Robert Lo, Lawrence Jang, Vikram Duvvur, Ming Lim, Po-Yu Huang, Graham Neubig, Shuyan Zhou, Russ Salakhutdinov, and Daniel Fried. 2024. Visualwebarena: Evaluating multimodal agents on realistic visual web tasks. (2024), 881–905
2024
- [39]
-
[40]
Hung Le, Yue Wang, Akhilesh Deepak Gotmare, Silvio Savarese, and Steven Chu Hong Hoi. 2022. Coderl: Mastering code generation through pretrained models and deep reinforcement learning. 35 (2022), 21314–21328. https://proceedings.neurips.cc/paper_files/paper/2022/file/ 8636419dea1aa9fbd25fc4248e702da4-Paper-Conference.pdf
2022
- [41]
- [42]
- [43]
- [44]
-
[45]
Zhong-Zhi Li, Duzhen Zhang, Ming-Liang Zhang, Jiaxin Jia, Zengyan Liu, Yuxin Yao, Haotian Xu, Junhao Zheng, Pei-Jie Wang, Xiuyi Chen, Yingying Zhang, Fei Yin, and Cheng-Lin Liu. 2025r. From system 1 to system 2: A survey of reasoning large language models. (2025r). https://arxiv.org/abs/2502.17419
work page internal anchor Pith review arXiv
-
[46]
Shuq uan Lian, Yuhang Wu, Jia Ma, Zihan Song, Bingqi Chen, Xiawu Zheng, and Hui Li. 2025. Ui-agile: Advancing gui agents with effective reinforcement learning and precise inference-time grounding.arXiv preprint arXiv:2507.22025(2025). https://arxiv.org/abs/2507.22025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[47]
Jintao Liang, Huifeng Lin, You Wu, Rui Zhao, Ziyue Li, et al . 2025. Reasoning rag via system 1 or system 2: A survey on reasoning agentic retrieval-augmented generation for industry challenges. InProceedings of the 14th International Joint Conference on Natural Language Processing 14 Fangming Cui, Ruixiao Zhu, Cheng Fang, Sunan Li, and Jiahong Li and the...
2025
-
[48]
Minhua Lin, Zongyu Wu, Zhichao Xu, Hui Liu, Xianfeng Tang, Qi He, Charu Aggarwal, Xiang Zhang, and Suhang Wang. 2025. A comprehen- sive survey on reinforcement learning-based agentic search: Foundations, roles, optimizations, evaluations, and applications.arXiv preprint arXiv:2510.16724(2025)
-
[49]
Shunyu Liu, Wenkai Fang, Zetian Hu, Junjie Zhang, Yang Zhou, Kongcheng Zhang, Rongcheng Tu, Ting-En Lin, Fei Huang, Mingli Song, Dacheng Tao, and Yongbin Li. 2025k. A survey of direct preference optimization.CoRRabs/2503.11701 (March 2025k). https://doi.org/10.48550/arXiv.2503. 11701
-
[50]
Yuhang Liu, Pengxiang Li, Zishu Wei, Congkai Xie, Xueyu Hu, Xiaotian Han, Hongxia Yang, and Fei Wu. 2025n. InfiGUIAgent: A multimodal generalist GUI agent with native reasoning and reflection. InICML 2025 Workshop on Computer Use Agents. https://openreview.net/forum?id= p0h9XJ7fMH
2025
- [51]
-
[52]
Zhiwei Liu, Weiran Yao, Jianguo Zhang, Zuxin Liu, Liangwei Yang, Rithesh RN, Tian Lan, Ming Zhu, Juntao Tan, Shirley Kokane, et al. 2024. Pract: Optimizing principled reasoning and acting of llm agent. InProceedings of the 28th Conference on Computational Natural Language Learning. 442–446
2024
- [53]
- [54]
- [55]
-
[56]
Large Language Model Agent: A Survey on Methodology, Applications and Challenges
Junyu Luo, Weizhi Zhang, Ye Yuan, Yusheng Zhao, Junwei Yang, Yiyang Gu, Bohan Wu, Binqi Chen, Ziyue Qiao, Qingqing Long, Rongcheng Tu, Xiao Luo, Wei Ju, Zhiping Xiao, Yifan Wang, Meng Xiao, Chenwu Liu, Jingyang Yuan, Shichang Zhang, Yiqiao Jin, Fan Zhang, Xian Wu, Hanqing Zhao, Dacheng Tao, Philip S. Yu, and Ming Zhang. 2025a. Large language model agent: ...
-
[57]
Run Luo, Lu Wang, Wanwei He, and Xiaobo Xia. 2025d. Gui-r1: A generalist r1-style vision-language action model for gui agents.arXiv preprint arXiv:2504.10458(2025d)
work page internal anchor Pith review arXiv
- [58]
- [59]
-
[60]
Yu Meng, Mengzhou Xia, and Danqi Chen. 2024. SimPO: Simple preference optimization with a reference-free reward. (2024). https://openreview. net/forum?id=3Tzcot1LKb
2024
-
[61]
Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christopher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, et al. 2021. Webgpt: Browser-assisted question-answering with human feedback.arXiv preprint arXiv:2112.09332(2021)
work page internal anchor Pith review arXiv 2021
-
[62]
Weihua Cheng, Junming Liu, Yifei Sun, Botian Shi, Yirong Chen, and Ding Wang
Dang Nguyen, Jian Chen, Yu Wang, Gang Wu, Namyong Park, Zhengmian Hu, Hanjia Lyu, Junda Wu, Ryan Aponte, Yu Xia Yu, Xintong Li, Jing Shi, Hongjie Chen, Viet Dac Lai, Zhouhang Xie, Sungchul Yoon, Lina Yao, Nesreen K. Ahmed, Jihyung Kil, Thien Huu Nguyen, Randolph Wickramasinghe, Ryan A. Rossi, and Franck Dernoncourt. 2025a. GUI agents: A survey. InFindings...
-
[63]
Dang Nguyen, Viet Dac Lai, Seunghyun Yoon, Ryan A Rossi, Handong Zhao, Ruiyi Zhang, Puneet Mathur, Nedim Lipka, Yu Wang, Trung Bui, et al
- [64]
-
[65]
Yansong Ning, Jun Fang, Naiqiang Tan, and Hao Liu. 2026. Agent-Omit: Training Efficient LLM Agents for Adaptive Thought and Observation Omission via Agentic Reinforcement Learning.arXiv preprint arXiv:2602.04284(2026)
work page internal anchor Pith review arXiv 2026
-
[66]
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katia Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F Christiano, Jan Leike, and Ryan Lowe. 2022. Training language models to follow instructions with human feed...
2022
-
[67]
Davide Paglieri, Bartlomiej Cupial, Jonathan Cook, Ulyana Piterbarg, Jens Tuyls, Edward Grefenstette, Jakob Nicolaus Foerster, Jack Parker-Holder, and Tim Rocktaschel. 2025. Learning when to plan: Efficiently allocating test-time compute for llm agents.arXiv preprint arXiv:2509.03581(2025)
-
[68]
Jiangweizhi Peng, Yuanxin Liu, Ruida Zhou, Charles Fleming, Zhaoran Wang, Alfredo Garcia, and Mingyi Hong. 2026. HiPER: Hierarchical Reinforcement Learning with Explicit Credit Assignment for Large Language Model Agents.arXiv preprint arXiv:2602.16165(2026)
work page internal anchor Pith review arXiv 2026
-
[69]
Eduardo Pignatelli, Johan Ferret, Matthieu Geist, Thomas Mesnard, Hado van Hasselt, and Laura Toni. 2024. A survey of temporal credit assignment in deep reinforcement learning.Transactions on Machine Learning Research(2024). https://openreview.net/forum?id=bNtr6SLgZf Survey Certification
2024
-
[70]
Agentic large language models: A survey.arXiv preprint arXiv:2503.23037, 2025
Aske Plaat, Max J. van Duijn, Niki Stein, Mike Preuss, Peter van der Putten, and Kees Joost Batenburg. 2025. Agentic large language models, a survey.CoRRabs/2503.23037 (March 2025). https://doi.org/10.48550/arXiv.2503.23037 Rethinking Agentic Reinforcement Learning In Large Language Models 15
-
[71]
Yujia Qin, Yining Ye, Junjie Fang, Haoming Wang, Shihao Liang, Shizuo Huang, Wanjun Wang, Kuanyu Li, Jiale Li, Yu Yang, Yu Miao, Woyu Lin, Longxiang Liu, Xinyu Xu, Xu Jiang, Qianli Zhao, Tian Peng, Xin Liu, Guang Shi, Yankai Yang, Ji-Rong Li, Jie Tang, and Maosong Li. 2025. Ui-tars: Pioneering automated gui interaction with native agents.arXiv preprint ar...
work page internal anchor Pith review arXiv 2025
-
[72]
Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. 2023. Direct preference optimization: Your language model is secretly a reward model. 36 (2023), 53728–53741
2023
-
[73]
Z. Z. Ren, Zhihong Shao, Junxiao Song, Haocheng Xin, Haowei Wang, Wanjia Zhao, Liyue Zhang, Zhe Fu, Qihao Zhu, Dejian Yang, Z. F. Wu, Zhibin Gou, Shirong Ma, Hongxuan Tang, Yuxuan Liu, Wenjun Gao, Daya Guo, and Chong Ruan. 2025. Deepseek-prover-v2: Advancing formal mathematical reasoning via reinforcement learning for subgoal decomposition.arXiv preprint ...
work page internal anchor Pith review arXiv 2025
-
[74]
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. 2017. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347(2017)
work page internal anchor Pith review arXiv 2017
-
[75]
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. 2024b. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300(2024b)
work page internal anchor Pith review arXiv
- [76]
-
[77]
Mohit Shridhar, Xingdi Yuan, Marc-Alexandre Cote, Yonatan Bisk, Adam Trischler, and Matthew Hausknecht. 2021. Alfworld: Aligning text and embodied environments for interactive learning. (2021). https://openreview.net/forum?id=0I0X0YcCdTn
2021
-
[78]
Aditi Singh, Abul Ehtesham, Saket Kumar, and Tala Talaei Khoei. 2025. Agentic retrieval-augmented generation: A survey on agentic rag.arXiv preprint arXiv:2501.09136(2025)
work page internal anchor Pith review arXiv 2025
- [79]
- [80]
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.