Next-Generation Agentic Reinforcement Learning Systems Enable Self-Evolving Agents
Pith reviewed 2026-07-03 18:45 UTC · model grok-4.3
The pith
Self-evolving LLM agents at enterprise scale are blocked by missing agentic RL systems rather than by reinforcement learning algorithms.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that next-generation agentic RL systems must be co-designed around a standardized agent trajectory data protocol capable of carrying RL learning signals at step granularity, an enterprise-grade comprehensive data proxy that converts real workloads into governed learning substrates, and a unified agent evolution control plane that automatically decides policy updates or harness evolution based on trajectory statistics; only then can self-evolving agents move from individual prototypes to large-scale enterprise service, as partially instantiated by reorganizing existing RL infrastructure into an agent-oriented online RL loop in AReaL2.0.
What carries the argument
The three essential aspects of inadequacy in current agentic RL systems (standardized trajectory data protocol, enterprise-grade data proxy, unified evolution control plane) that the paper identifies as the primary blockers preventing continual learning from deployed workloads.
If this is right
- Trajectory statistics from real workloads can automatically trigger policy weight updates without human intervention.
- Heterogeneous agent paradigms can share a common data protocol that preserves step-granularity RL signals.
- Real enterprise workloads can be converted into governed learning substrates via a dedicated data proxy.
- A single control plane can decide both weight updates and in-context harness evolution based on the same trajectory data.
- Existing RL infrastructure can be reorganized into an agent-oriented online loop that learns directly from production traffic.
Where Pith is reading between the lines
- Standardizing the trajectory protocol could also simplify debugging and auditing of agent decisions across different vendors.
- An enterprise data proxy might reduce the need for separate offline data curation teams by turning every production interaction into potential training signal.
- The control plane logic could be extended to handle multi-agent coordination if trajectory data includes inter-agent interactions.
Load-bearing premise
That the three listed system gaps are the main and sufficient blockers for self-evolving agents at enterprise scale, rather than limitations in the underlying RL algorithms themselves.
What would settle it
A production deployment that implements the three pillars yet still requires manual human-curated data loops to improve agent performance across heterogeneous paradigms.
read the original abstract
LLM agents are rapidly being deployed in production, including coding assistants, customer-support chatbots, and scientific research assistants, yet they remain fundamentally static in enterprise deployment. The LLM weights, system prompts, tool repertoires, and in-context harnesses are frozen at deployment time, and any improvement requires a manual loop of human-curated data collection, offline fine-tuning, modification of the agentic paradigm, and re-deployment. Recent work on self-evolving agents, such as OpenClaw for individual users, indicates that the next leap in agent capability will come from agents that continually learn from their own experience. In this paper, we argue that this vision for self-evolving agent deployment is being held back for enterprise-level large-scale agentic service not by reinforcement learning (RL) algorithms but by agentic online RL systems. Specifically, current agentic RL systems and the surrounding observability software stack are inadequate along three essential aspects: (i) there is no standardized agent trajectory data protocol capable of carrying RL learning signals at step granularity across heterogeneous agent paradigms; (ii) there is no enterprise-grade comprehensive data proxy that converts real workloads into governed learning substrates; and (iii) there is no unified agent evolution control plane that automatically decides, based on trajectory statistics, when to update policy weights or evolve the in-context harness. The next generation of agentic RL systems must be co-designed around these three pillars, and we sketch concrete architectures, case studies, and counter-arguments. We instantiate one branch through AReaL2.0, reorganizing existing RL infrastructure into an agent-oriented online RL loop for policy weight updates from deployed workloads.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that enterprise-scale self-evolving LLM agents are limited not by RL algorithms but by three inadequacies in current agentic online RL systems and observability stacks: (i) no standardized trajectory data protocol for RL signals at step granularity across heterogeneous paradigms, (ii) no enterprise-grade data proxy converting workloads into governed learning substrates, and (iii) no unified evolution control plane for automatic policy/harness updates. It proposes co-design around these pillars, sketches architectures and case studies, and instantiates one via AReaL2.0 for online policy updates from deployed workloads.
Significance. If the three gaps are indeed the primary blockers, the work could usefully redirect attention in the agentic systems community from pure algorithmic RL advances toward infrastructure co-design, potentially informing standards for trajectory logging and control planes in production deployments. The forward-looking framing and explicit counter-argument discussion are strengths for a position piece.
major comments (2)
- Abstract: The central claim that the three listed aspects are the 'essential' inadequacies (rather than, e.g., RL sample efficiency, safety constraints, or compute scaling) is asserted without any supporting analysis, literature synthesis, or failure-mode examination of existing systems. This assertion is load-bearing for the entire proposal to co-design around them.
- Abstract and overall manuscript: No empirical data, derivations, error bounds, or even qualitative case-study outcomes are provided to show that resolving the three gaps would enable continual learning from deployed workloads or outperform current manual loops; the argument therefore remains an untested hypothesis rather than a substantiated position.
minor comments (1)
- The manuscript introduces AReaL2.0 and sketches 'concrete architectures' but the abstract provides no expansion of the acronym, component breakdown, or how it specifically addresses the three pillars.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on this position paper. We address the major comments point-by-point below, agreeing where the manuscript requires clarification or expansion, and have planned revisions accordingly.
read point-by-point responses
-
Referee: Abstract: The central claim that the three listed aspects are the 'essential' inadequacies (rather than, e.g., RL sample efficiency, safety constraints, or compute scaling) is asserted without any supporting analysis, literature synthesis, or failure-mode examination of existing systems. This assertion is load-bearing for the entire proposal to co-design around them.
Authors: We acknowledge that the claim would benefit from explicit grounding. As a position paper, the assertion draws from observed production limitations and related literature, but we will revise by expanding the introduction with a literature synthesis on agentic RL systems and adding a subsection analyzing failure modes of current observability stacks (e.g., loss of step-granularity signals in heterogeneous paradigms). This will support why the three pillars warrant co-design attention alongside algorithmic factors. revision: yes
-
Referee: Abstract and overall manuscript: No empirical data, derivations, error bounds, or even qualitative case-study outcomes are provided to show that resolving the three gaps would enable continual learning from deployed workloads or outperform current manual loops; the argument therefore remains an untested hypothesis rather than a substantiated position.
Authors: We agree that the manuscript presents no new empirical data, derivations, or quantitative outcomes, as it is a forward-looking position piece sketching architectures and case studies rather than reporting experiments. We will revise the abstract and add a dedicated 'Limitations and Future Work' section that explicitly frames the claims as a hypothesis, describes the qualitative AReaL2.0 instantiation at a higher level, and outlines potential evaluation approaches for validating continual learning gains versus manual loops. revision: yes
Circularity Check
No significant circularity
full rationale
The paper is a position document that argues three specific system-level gaps (trajectory data protocol, data proxy, evolution control plane) are the primary blockers for enterprise self-evolving agents, with RL algorithms not being the limit. It sketches co-designed architectures and instantiates one via AReaL2.0. No mathematical derivations, equations, fitted parameters, or predictions appear in the provided text. The argument consists of stated assessments of current systems and forward-looking proposals without any reduction of claims to self-referential inputs, self-citation chains, or renamings by construction. The central claims remain independent of any internal circular steps.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption RL algorithms are not the limiting factor; the bottleneck lies in agentic online RL systems infrastructure.
invented entities (1)
-
AReaL2.0
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Openclaw: The ai that actually does things, 2026
OpenClaw. Openclaw: The ai that actually does things, 2026
2026
-
[2]
OpenClaw-RL: Train Any Agent Simply by Talking
Yinjie Wang, Xuyang Chen, Xiaolong Jin, Mengdi Wang, and Ling Yang. Openclaw-rl: Train any agent simply by talking. arXiv preprint arXiv:2603.10165, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[3]
Metaclaw: Just talk–an agent that meta-learns and evolves in the wild
Peng Xia, Jianwen Chen, Xinyu Yang, Haoqin Tu, Jiaqi Liu, Kaiwen Xiong, Siwei Han, Shi Qiu, Haonian Ji, Yuyin Zhou, et al. Metaclaw: Just talk–an agent that meta-learns and evolves in the wild. arXiv preprint arXiv:2603.17187, 2026
-
[4]
SkillClaw: Let Skills Evolve Collectively with Agentic Evolver
Ziyu Ma, Shidong Yang, Yuxiang Ji, Xucong Wang, Yong Wang, Yiming Hu, Tongwen Huang, and Xiangxiang Chu. Skillclaw: Let skills evolve collectively with agentic evolver. arXiv preprint arXiv:2604.08377, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[5]
Reflexion: Language agents with verbal reinforcement learning
Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning. Advances in neural information processing systems, 36:8634–8652, 2023
2023
-
[6]
Memento-skills: Let agents design agents
Huichi Zhou, Siyuan Guo, Anjie Liu, Zhongwei Yu, Ziqin Gong, Bowen Zhao, Zhixun Chen, Menglong Zhang, Yihang Chen, Jinsong Li, et al. Memento-skills: Let agents design agents. arXiv preprint arXiv:2603.18743, 2026
-
[7]
Agentic context engineering: Evolving contexts for self-improving language models
Qizheng Zhang, Changran Hu, Shubhangi Upasani, Boyuan Ma, Fenglu Hong, Vamsidhar Kamanuru, Jay Rainton, Chen Wu, Mengmeng Ji, Hanchen Li, et al. Agentic context engineering: Evolving contexts for self-improving language models. 2026
2026
-
[8]
Areal: A large-scale asynchronous reinforcement learning system for language reasoning
Wei Fu, Jiaxuan Gao, Xujie Shen, Chen Zhu, Zhiyu Mei, Chuyi He, Shusheng Xu, Guo Wei, Jun Mei, WANG JIASHU, et al. Areal: A large-scale asynchronous reinforcement learning system for language reasoning. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025
2025
-
[9]
A Survey of Reinforcement Learning for Large Reasoning Models
Kaiyan Zhang, Yuxin Zuo, Bingxiang He, Youbang Sun, Runze Liu, Che Jiang, Yuchen Fan, Kai Tian, Guoli Jia, Pengfei Li, et al. A survey of reinforcement learning for large reasoning models. arXiv preprint arXiv:2509.08827, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[10]
Training language models to follow instructions with human feedback
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744, 2022
2022
-
[11]
Constitutional AI: Harmlessness from AI Feedback
Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[12]
Proximal Policy Optimization Algorithms
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[13]
Direct preference optimization: Your language model is secretly a reward model.Advances in neural information processing systems, 36:53728–53741, 2023
Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model.Advances in neural information processing systems, 36:53728–53741, 2023
2023
-
[14]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[15]
The Landscape of Agentic Reinforcement Learning for LLMs: A Survey
Guibin Zhang, Hejia Geng, Xiaohang Yu, Zhenfei Yin, Zaibin Zhang, Zelin Tan, Heng Zhou, Zhongzhi Li, Xi- angyuan Xue, Yijiang Li, et al. The landscape of agentic reinforcement learning for llms: A survey. arXiv preprint arXiv:2509.02547, 2025. 11
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[16]
ReAct: Synergizing Reasoning and Acting in Language Models
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[17]
TextGrad: Automatic "Differentiation" via Text
Mert Yuksekgonul, Federico Bianchi, Joseph Boen, Sheng Liu, Zhi Huang, Carlos Guestrin, and James Zou. Textgrad: Automatic" differentiation" via text. arXiv preprint arXiv:2406.07496, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[18]
Unlocking long-horizon agentic search with large-scale end-to-end rl
Jiaxuan Gao, Wei Fu, Minyang Xie, Shusheng Xu, Chuyi He, Zhiyu Mei, Banghua Zhu, and Yi Wu. Unlocking long-horizon agentic search with large-scale end-to-end rl. In The Fourteenth International Conference on Learning Representations, 2026
2026
-
[19]
Real: Efficient rlhf training of large language models with parameter reallocation
Zhiyu Mei, Wei Fu, Kaiwei Li, Guangju Wang, Huanchen Zhang, and Yi Wu. Real: Efficient rlhf training of large language models with parameter reallocation. arXiv preprint arXiv:2406.14088, 2024
-
[20]
Optimizing {RLHF} training for large language models with stage fusion
Yinmin Zhong, Zili Zhang, Bingyang Wu, Shengyu Liu, Yukun Chen, Changyi Wan, Hanpeng Hu, Lei Xia, Ranchen Ming, Yibo Zhu, et al. Optimizing {RLHF} training for large language models with stage fusion. In 22nd USENIX Symposium on Networked Systems Design and Implementation (NSDI 25), pages 489–503, 2025
2025
-
[21]
G-core: A simple, scalable and balanced rlhf trainer
Junyu Wu, Weiming Chang, Xiaotao Liu, Guanyou He, Haoqiang Hong, Boqi Liu, Hongtao Tian, Tao Yang, Yunsheng Shi, Feng Lin, et al. G-core: A simple, scalable and balanced rlhf trainer. arXiv preprint arXiv:2507.22789, 2025
-
[22]
Hybridflow: A flexible and efficient rlhf framework
Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework. In Proceedings of the Twentieth European Conference on Computer Systems, pages 1279–1297, 2025
2025
-
[23]
Streamrl: Scalable, heterogeneous, and elastic rl for llms with disaggregated stream generation
Yinmin Zhong, Zili Zhang, Xiaoniu Song, Hanpeng Hu, Chao Jin, Bingyang Wu, Nuo Chen, Yukun Chen, Yu Zhou, Changyi Wan, et al. Streamrl: Scalable, heterogeneous, and elastic rl for llms with disaggregated stream generation. arXiv preprint arXiv:2504.15930, 2025
-
[24]
Asyncflow: An asynchronous streaming rl framework for efficient llm post-training
Zhenyu Han, Ansheng You, Haibo Wang, Kui Luo, Guang Yang, Wenqi Shi, Menglong Chen, Sicheng Zhang, Zeshun Lan, Chunshi Deng, et al. Asyncflow: An asynchronous streaming rl framework for efficient llm post-training. arXiv preprint arXiv:2507.01663, 2025
-
[25]
Introducing the Model Context Protocol
Anthropic. Introducing the Model Context Protocol. https://www.anthropic.com/news/ model-context-protocol, 2024
2024
-
[26]
Agent2agent (a2a) protocol.https://a2a-protocol.org/, 2025
Google. Agent2agent (a2a) protocol.https://a2a-protocol.org/, 2025
2025
-
[27]
A survey of ai agent protocols
Yingxuan Yang, Huacan Chai, Yuanyi Song, Siyuan Qi, Muning Wen, Ning Li, Junwei Liao, Haoyi Hu, Jianghao Lin, Gaowei Chang, et al. A survey of ai agent protocols. arXiv preprint arXiv:2504.16736, 2025
-
[28]
D4RL: Datasets for Deep Data-Driven Reinforcement Learning
Justin Fu, Aviral Kumar, Ofir Nachum, George Tucker, and Sergey Levine. D4rl: Datasets for deep data-driven reinforcement learning. arXiv preprint arXiv:2004.07219, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2004
-
[29]
Rlds: an ecosystem to generate, share and use datasets in reinforcement learning
Sabela Ramos, Sertan Girgin, Léonard Hussenot, Damien Vincent, Hanna Yakubovich, Daniel Toyama, Anita Gergely , Piotr Stanczyk, Raphael Marinier, Jeremiah Harmsen, et al. Rlds: an ecosystem to generate, share and use datasets in reinforcement learning. arXiv preprint arXiv:2111.02767, 2021
-
[30]
Agent data protocol: Unifying datasets for diverse, effective fine-tuning of llm agents
Yueqi Song, Ketan Ramaneti, Zaid Sheikh, Ziru Chen, Boyu Gou, Tianbao Xie, Yiheng Xu, Danyang Zhang, Apurva Gandhi, Fan Yang, et al. Agent data protocol: Unifying datasets for diverse, effective fine-tuning of llm agents. arXiv preprint arXiv:2510.24702, 2025
-
[31]
LangChain: The agent engineering platform
LangChain, Inc. LangChain: The agent engineering platform. https://github.com/langchain-ai/langchain, 2025
2025
-
[32]
LangGraph: Build resilient language agents as graphs
LangChain, Inc. LangGraph: Build resilient language agents as graphs. https://github.com/langchain-ai/ langgraph, 2025
2025
-
[33]
CrewAI: Framework for orchestrating role-playing, autonomous AI agents
crewAI, Inc. CrewAI: Framework for orchestrating role-playing, autonomous AI agents. https://github.com/ crewAIInc/crewAI, 2025
2025
-
[34]
OpenAI Agents SDK: A lightweight, powerful framework for multi-agent workflows
OpenAI. OpenAI Agents SDK: A lightweight, powerful framework for multi-agent workflows. https://github.com/ openai/openai-agents-python, 2025
2025
-
[35]
Claude Agent SDK.https://github.com/anthropics/claude-agent-sdk-python, 2025
Anthropic. Claude Agent SDK.https://github.com/anthropics/claude-agent-sdk-python, 2025. 12
2025
-
[36]
Agentprm: Process reward models for llm agents via step-wise promise and progress
Zhiheng Xi, Chenyang Liao, Guanyu Li, Zhihao Zhang, Wenxiang Chen, Binghai Wang, Senjie Jin, Yuhao Zhou, Jian Guan, Wei Wu, et al. Agentprm: Process reward models for llm agents via step-wise promise and progress. In Proceedings of the ACM Web Conference 2026, pages 4184–4195, 2026
2026
-
[37]
Rlanything: Forge environment, policy , and reward model in completely dynamic rl system
Yinjie Wang, Tianbao Xie, Ke Shen, Mengdi Wang, and Ling Yang. Rlanything: Forge environment, policy , and reward model in completely dynamic rl system. arXiv preprint arXiv:2602.02488, 2026
-
[38]
Hermes agent: The self-improving ai agent built by nous research
Nous Research. Hermes agent: The self-improving ai agent built by nous research. https://github.com/ NousResearch/hermes-agent, 2026. Accessed: 2026-06-30. 13
2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.