pith. sign in

arxiv: 2607.01120 · v2 · pith:SB575SE2new · submitted 2026-07-01 · 💻 cs.DC

Next-Generation Agentic Reinforcement Learning Systems Enable Self-Evolving Agents

Pith reviewed 2026-07-03 18:45 UTC · model grok-4.3

classification 💻 cs.DC
keywords self-evolving agentsagentic reinforcement learningonline RL systemsLLM agentstrajectory data protocolenterprise deploymentcontinual learningagent evolution control plane
0
0 comments X

The pith

Self-evolving LLM agents at enterprise scale are blocked by missing agentic RL systems rather than by reinforcement learning algorithms.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that production LLM agents remain static because any improvement still requires a manual human loop of data collection, fine-tuning, and redeployment. While individual-user self-evolving agents show promise, the authors claim the barrier for large-scale enterprise use lies in three concrete gaps in current agentic online RL systems and their observability stack. These gaps are the absence of a standardized trajectory data protocol that carries step-level RL signals across different agent designs, the lack of an enterprise-grade data proxy that turns real workloads into governed learning data, and the absence of a unified control plane that uses trajectory statistics to decide when to update weights or evolve the agent harness. The paper states that co-designing the next generation of agentic RL systems around these three pillars will allow agents to learn continually from deployed workloads, and it sketches one such architecture in AReaL2.0.

Core claim

The central claim is that next-generation agentic RL systems must be co-designed around a standardized agent trajectory data protocol capable of carrying RL learning signals at step granularity, an enterprise-grade comprehensive data proxy that converts real workloads into governed learning substrates, and a unified agent evolution control plane that automatically decides policy updates or harness evolution based on trajectory statistics; only then can self-evolving agents move from individual prototypes to large-scale enterprise service, as partially instantiated by reorganizing existing RL infrastructure into an agent-oriented online RL loop in AReaL2.0.

What carries the argument

The three essential aspects of inadequacy in current agentic RL systems (standardized trajectory data protocol, enterprise-grade data proxy, unified evolution control plane) that the paper identifies as the primary blockers preventing continual learning from deployed workloads.

If this is right

  • Trajectory statistics from real workloads can automatically trigger policy weight updates without human intervention.
  • Heterogeneous agent paradigms can share a common data protocol that preserves step-granularity RL signals.
  • Real enterprise workloads can be converted into governed learning substrates via a dedicated data proxy.
  • A single control plane can decide both weight updates and in-context harness evolution based on the same trajectory data.
  • Existing RL infrastructure can be reorganized into an agent-oriented online loop that learns directly from production traffic.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Standardizing the trajectory protocol could also simplify debugging and auditing of agent decisions across different vendors.
  • An enterprise data proxy might reduce the need for separate offline data curation teams by turning every production interaction into potential training signal.
  • The control plane logic could be extended to handle multi-agent coordination if trajectory data includes inter-agent interactions.

Load-bearing premise

That the three listed system gaps are the main and sufficient blockers for self-evolving agents at enterprise scale, rather than limitations in the underlying RL algorithms themselves.

What would settle it

A production deployment that implements the three pillars yet still requires manual human-curated data loops to improve agent performance across heterogeneous paradigms.

read the original abstract

LLM agents are rapidly being deployed in production, including coding assistants, customer-support chatbots, and scientific research assistants, yet they remain fundamentally static in enterprise deployment. The LLM weights, system prompts, tool repertoires, and in-context harnesses are frozen at deployment time, and any improvement requires a manual loop of human-curated data collection, offline fine-tuning, modification of the agentic paradigm, and re-deployment. Recent work on self-evolving agents, such as OpenClaw for individual users, indicates that the next leap in agent capability will come from agents that continually learn from their own experience. In this paper, we argue that this vision for self-evolving agent deployment is being held back for enterprise-level large-scale agentic service not by reinforcement learning (RL) algorithms but by agentic online RL systems. Specifically, current agentic RL systems and the surrounding observability software stack are inadequate along three essential aspects: (i) there is no standardized agent trajectory data protocol capable of carrying RL learning signals at step granularity across heterogeneous agent paradigms; (ii) there is no enterprise-grade comprehensive data proxy that converts real workloads into governed learning substrates; and (iii) there is no unified agent evolution control plane that automatically decides, based on trajectory statistics, when to update policy weights or evolve the in-context harness. The next generation of agentic RL systems must be co-designed around these three pillars, and we sketch concrete architectures, case studies, and counter-arguments. We instantiate one branch through AReaL2.0, reorganizing existing RL infrastructure into an agent-oriented online RL loop for policy weight updates from deployed workloads.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper claims that enterprise-scale self-evolving LLM agents are limited not by RL algorithms but by three inadequacies in current agentic online RL systems and observability stacks: (i) no standardized trajectory data protocol for RL signals at step granularity across heterogeneous paradigms, (ii) no enterprise-grade data proxy converting workloads into governed learning substrates, and (iii) no unified evolution control plane for automatic policy/harness updates. It proposes co-design around these pillars, sketches architectures and case studies, and instantiates one via AReaL2.0 for online policy updates from deployed workloads.

Significance. If the three gaps are indeed the primary blockers, the work could usefully redirect attention in the agentic systems community from pure algorithmic RL advances toward infrastructure co-design, potentially informing standards for trajectory logging and control planes in production deployments. The forward-looking framing and explicit counter-argument discussion are strengths for a position piece.

major comments (2)
  1. Abstract: The central claim that the three listed aspects are the 'essential' inadequacies (rather than, e.g., RL sample efficiency, safety constraints, or compute scaling) is asserted without any supporting analysis, literature synthesis, or failure-mode examination of existing systems. This assertion is load-bearing for the entire proposal to co-design around them.
  2. Abstract and overall manuscript: No empirical data, derivations, error bounds, or even qualitative case-study outcomes are provided to show that resolving the three gaps would enable continual learning from deployed workloads or outperform current manual loops; the argument therefore remains an untested hypothesis rather than a substantiated position.
minor comments (1)
  1. The manuscript introduces AReaL2.0 and sketches 'concrete architectures' but the abstract provides no expansion of the acronym, component breakdown, or how it specifically addresses the three pillars.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on this position paper. We address the major comments point-by-point below, agreeing where the manuscript requires clarification or expansion, and have planned revisions accordingly.

read point-by-point responses
  1. Referee: Abstract: The central claim that the three listed aspects are the 'essential' inadequacies (rather than, e.g., RL sample efficiency, safety constraints, or compute scaling) is asserted without any supporting analysis, literature synthesis, or failure-mode examination of existing systems. This assertion is load-bearing for the entire proposal to co-design around them.

    Authors: We acknowledge that the claim would benefit from explicit grounding. As a position paper, the assertion draws from observed production limitations and related literature, but we will revise by expanding the introduction with a literature synthesis on agentic RL systems and adding a subsection analyzing failure modes of current observability stacks (e.g., loss of step-granularity signals in heterogeneous paradigms). This will support why the three pillars warrant co-design attention alongside algorithmic factors. revision: yes

  2. Referee: Abstract and overall manuscript: No empirical data, derivations, error bounds, or even qualitative case-study outcomes are provided to show that resolving the three gaps would enable continual learning from deployed workloads or outperform current manual loops; the argument therefore remains an untested hypothesis rather than a substantiated position.

    Authors: We agree that the manuscript presents no new empirical data, derivations, or quantitative outcomes, as it is a forward-looking position piece sketching architectures and case studies rather than reporting experiments. We will revise the abstract and add a dedicated 'Limitations and Future Work' section that explicitly frames the claims as a hypothesis, describes the qualitative AReaL2.0 instantiation at a higher level, and outlines potential evaluation approaches for validating continual learning gains versus manual loops. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper is a position document that argues three specific system-level gaps (trajectory data protocol, data proxy, evolution control plane) are the primary blockers for enterprise self-evolving agents, with RL algorithms not being the limit. It sketches co-designed architectures and instantiates one via AReaL2.0. No mathematical derivations, equations, fitted parameters, or predictions appear in the provided text. The argument consists of stated assessments of current systems and forward-looking proposals without any reduction of claims to self-referential inputs, self-citation chains, or renamings by construction. The central claims remain independent of any internal circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the domain assumption that the three listed systems deficiencies are the primary obstacles to self-evolving agents at enterprise scale, with no independent evidence or benchmarks provided.

axioms (1)
  • domain assumption RL algorithms are not the limiting factor; the bottleneck lies in agentic online RL systems infrastructure.
    Explicitly stated in the abstract as the vision being held back not by RL algorithms but by the systems.
invented entities (1)
  • AReaL2.0 no independent evidence
    purpose: An example instantiation reorganizing existing RL infrastructure into an agent-oriented online RL loop.
    Mentioned as one branch of the proposed co-designed systems.

pith-pipeline@v0.9.1-grok · 5915 in / 1421 out tokens · 34126 ms · 2026-07-03T18:45:47.711336+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

38 extracted references · 20 canonical work pages · 10 internal anchors

  1. [1]

    Openclaw: The ai that actually does things, 2026

    OpenClaw. Openclaw: The ai that actually does things, 2026

  2. [2]

    OpenClaw-RL: Train Any Agent Simply by Talking

    Yinjie Wang, Xuyang Chen, Xiaolong Jin, Mengdi Wang, and Ling Yang. Openclaw-rl: Train any agent simply by talking. arXiv preprint arXiv:2603.10165, 2026

  3. [3]

    Metaclaw: Just talk–an agent that meta-learns and evolves in the wild

    Peng Xia, Jianwen Chen, Xinyu Yang, Haoqin Tu, Jiaqi Liu, Kaiwen Xiong, Siwei Han, Shi Qiu, Haonian Ji, Yuyin Zhou, et al. Metaclaw: Just talk–an agent that meta-learns and evolves in the wild. arXiv preprint arXiv:2603.17187, 2026

  4. [4]

    SkillClaw: Let Skills Evolve Collectively with Agentic Evolver

    Ziyu Ma, Shidong Yang, Yuxiang Ji, Xucong Wang, Yong Wang, Yiming Hu, Tongwen Huang, and Xiangxiang Chu. Skillclaw: Let skills evolve collectively with agentic evolver. arXiv preprint arXiv:2604.08377, 2026

  5. [5]

    Reflexion: Language agents with verbal reinforcement learning

    Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning. Advances in neural information processing systems, 36:8634–8652, 2023

  6. [6]

    Memento-skills: Let agents design agents

    Huichi Zhou, Siyuan Guo, Anjie Liu, Zhongwei Yu, Ziqin Gong, Bowen Zhao, Zhixun Chen, Menglong Zhang, Yihang Chen, Jinsong Li, et al. Memento-skills: Let agents design agents. arXiv preprint arXiv:2603.18743, 2026

  7. [7]

    Agentic context engineering: Evolving contexts for self-improving language models

    Qizheng Zhang, Changran Hu, Shubhangi Upasani, Boyuan Ma, Fenglu Hong, Vamsidhar Kamanuru, Jay Rainton, Chen Wu, Mengmeng Ji, Hanchen Li, et al. Agentic context engineering: Evolving contexts for self-improving language models. 2026

  8. [8]

    Areal: A large-scale asynchronous reinforcement learning system for language reasoning

    Wei Fu, Jiaxuan Gao, Xujie Shen, Chen Zhu, Zhiyu Mei, Chuyi He, Shusheng Xu, Guo Wei, Jun Mei, WANG JIASHU, et al. Areal: A large-scale asynchronous reinforcement learning system for language reasoning. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

  9. [9]

    A Survey of Reinforcement Learning for Large Reasoning Models

    Kaiyan Zhang, Yuxin Zuo, Bingxiang He, Youbang Sun, Runze Liu, Che Jiang, Yuchen Fan, Kai Tian, Guoli Jia, Pengfei Li, et al. A survey of reinforcement learning for large reasoning models. arXiv preprint arXiv:2509.08827, 2025

  10. [10]

    Training language models to follow instructions with human feedback

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744, 2022

  11. [11]

    Constitutional AI: Harmlessness from AI Feedback

    Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073, 2022

  12. [12]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017

  13. [13]

    Direct preference optimization: Your language model is secretly a reward model.Advances in neural information processing systems, 36:53728–53741, 2023

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model.Advances in neural information processing systems, 36:53728–53741, 2023

  14. [14]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025

  15. [15]

    The Landscape of Agentic Reinforcement Learning for LLMs: A Survey

    Guibin Zhang, Hejia Geng, Xiaohang Yu, Zhenfei Yin, Zaibin Zhang, Zelin Tan, Heng Zhou, Zhongzhi Li, Xi- angyuan Xue, Yijiang Li, et al. The landscape of agentic reinforcement learning for llms: A survey. arXiv preprint arXiv:2509.02547, 2025. 11

  16. [16]

    ReAct: Synergizing Reasoning and Acting in Language Models

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629, 2022

  17. [17]

    TextGrad: Automatic "Differentiation" via Text

    Mert Yuksekgonul, Federico Bianchi, Joseph Boen, Sheng Liu, Zhi Huang, Carlos Guestrin, and James Zou. Textgrad: Automatic" differentiation" via text. arXiv preprint arXiv:2406.07496, 2024

  18. [18]

    Unlocking long-horizon agentic search with large-scale end-to-end rl

    Jiaxuan Gao, Wei Fu, Minyang Xie, Shusheng Xu, Chuyi He, Zhiyu Mei, Banghua Zhu, and Yi Wu. Unlocking long-horizon agentic search with large-scale end-to-end rl. In The Fourteenth International Conference on Learning Representations, 2026

  19. [19]

    Real: Efficient rlhf training of large language models with parameter reallocation

    Zhiyu Mei, Wei Fu, Kaiwei Li, Guangju Wang, Huanchen Zhang, and Yi Wu. Real: Efficient rlhf training of large language models with parameter reallocation. arXiv preprint arXiv:2406.14088, 2024

  20. [20]

    Optimizing {RLHF} training for large language models with stage fusion

    Yinmin Zhong, Zili Zhang, Bingyang Wu, Shengyu Liu, Yukun Chen, Changyi Wan, Hanpeng Hu, Lei Xia, Ranchen Ming, Yibo Zhu, et al. Optimizing {RLHF} training for large language models with stage fusion. In 22nd USENIX Symposium on Networked Systems Design and Implementation (NSDI 25), pages 489–503, 2025

  21. [21]

    G-core: A simple, scalable and balanced rlhf trainer

    Junyu Wu, Weiming Chang, Xiaotao Liu, Guanyou He, Haoqiang Hong, Boqi Liu, Hongtao Tian, Tao Yang, Yunsheng Shi, Feng Lin, et al. G-core: A simple, scalable and balanced rlhf trainer. arXiv preprint arXiv:2507.22789, 2025

  22. [22]

    Hybridflow: A flexible and efficient rlhf framework

    Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework. In Proceedings of the Twentieth European Conference on Computer Systems, pages 1279–1297, 2025

  23. [23]

    Streamrl: Scalable, heterogeneous, and elastic rl for llms with disaggregated stream generation

    Yinmin Zhong, Zili Zhang, Xiaoniu Song, Hanpeng Hu, Chao Jin, Bingyang Wu, Nuo Chen, Yukun Chen, Yu Zhou, Changyi Wan, et al. Streamrl: Scalable, heterogeneous, and elastic rl for llms with disaggregated stream generation. arXiv preprint arXiv:2504.15930, 2025

  24. [24]

    Asyncflow: An asynchronous streaming rl framework for efficient llm post-training

    Zhenyu Han, Ansheng You, Haibo Wang, Kui Luo, Guang Yang, Wenqi Shi, Menglong Chen, Sicheng Zhang, Zeshun Lan, Chunshi Deng, et al. Asyncflow: An asynchronous streaming rl framework for efficient llm post-training. arXiv preprint arXiv:2507.01663, 2025

  25. [25]

    Introducing the Model Context Protocol

    Anthropic. Introducing the Model Context Protocol. https://www.anthropic.com/news/ model-context-protocol, 2024

  26. [26]

    Agent2agent (a2a) protocol.https://a2a-protocol.org/, 2025

    Google. Agent2agent (a2a) protocol.https://a2a-protocol.org/, 2025

  27. [27]

    A survey of ai agent protocols

    Yingxuan Yang, Huacan Chai, Yuanyi Song, Siyuan Qi, Muning Wen, Ning Li, Junwei Liao, Haoyi Hu, Jianghao Lin, Gaowei Chang, et al. A survey of ai agent protocols. arXiv preprint arXiv:2504.16736, 2025

  28. [28]

    D4RL: Datasets for Deep Data-Driven Reinforcement Learning

    Justin Fu, Aviral Kumar, Ofir Nachum, George Tucker, and Sergey Levine. D4rl: Datasets for deep data-driven reinforcement learning. arXiv preprint arXiv:2004.07219, 2020

  29. [29]

    Rlds: an ecosystem to generate, share and use datasets in reinforcement learning

    Sabela Ramos, Sertan Girgin, Léonard Hussenot, Damien Vincent, Hanna Yakubovich, Daniel Toyama, Anita Gergely , Piotr Stanczyk, Raphael Marinier, Jeremiah Harmsen, et al. Rlds: an ecosystem to generate, share and use datasets in reinforcement learning. arXiv preprint arXiv:2111.02767, 2021

  30. [30]

    Agent data protocol: Unifying datasets for diverse, effective fine-tuning of llm agents

    Yueqi Song, Ketan Ramaneti, Zaid Sheikh, Ziru Chen, Boyu Gou, Tianbao Xie, Yiheng Xu, Danyang Zhang, Apurva Gandhi, Fan Yang, et al. Agent data protocol: Unifying datasets for diverse, effective fine-tuning of llm agents. arXiv preprint arXiv:2510.24702, 2025

  31. [31]

    LangChain: The agent engineering platform

    LangChain, Inc. LangChain: The agent engineering platform. https://github.com/langchain-ai/langchain, 2025

  32. [32]

    LangGraph: Build resilient language agents as graphs

    LangChain, Inc. LangGraph: Build resilient language agents as graphs. https://github.com/langchain-ai/ langgraph, 2025

  33. [33]

    CrewAI: Framework for orchestrating role-playing, autonomous AI agents

    crewAI, Inc. CrewAI: Framework for orchestrating role-playing, autonomous AI agents. https://github.com/ crewAIInc/crewAI, 2025

  34. [34]

    OpenAI Agents SDK: A lightweight, powerful framework for multi-agent workflows

    OpenAI. OpenAI Agents SDK: A lightweight, powerful framework for multi-agent workflows. https://github.com/ openai/openai-agents-python, 2025

  35. [35]

    Claude Agent SDK.https://github.com/anthropics/claude-agent-sdk-python, 2025

    Anthropic. Claude Agent SDK.https://github.com/anthropics/claude-agent-sdk-python, 2025. 12

  36. [36]

    Agentprm: Process reward models for llm agents via step-wise promise and progress

    Zhiheng Xi, Chenyang Liao, Guanyu Li, Zhihao Zhang, Wenxiang Chen, Binghai Wang, Senjie Jin, Yuhao Zhou, Jian Guan, Wei Wu, et al. Agentprm: Process reward models for llm agents via step-wise promise and progress. In Proceedings of the ACM Web Conference 2026, pages 4184–4195, 2026

  37. [37]

    Rlanything: Forge environment, policy , and reward model in completely dynamic rl system

    Yinjie Wang, Tianbao Xie, Ke Shen, Mengdi Wang, and Ling Yang. Rlanything: Forge environment, policy , and reward model in completely dynamic rl system. arXiv preprint arXiv:2602.02488, 2026

  38. [38]

    Hermes agent: The self-improving ai agent built by nous research

    Nous Research. Hermes agent: The self-improving ai agent built by nous research. https://github.com/ NousResearch/hermes-agent, 2026. Accessed: 2026-06-30. 13