Next-Generation Agentic Reinforcement Learning Systems Enable Self-Evolving Agents

Binhang Yuan; Chuyi He; Haitao Wang; Hao Dai; Honghua Dong; Huaijie Wang; Jiale Li; Jiarui Zhang; Jiawei Zhang; Jiaxuan Gao

arxiv: 2607.01120 · v2 · pith:SB575SE2new · submitted 2026-07-01 · 💻 cs.DC

Next-Generation Agentic Reinforcement Learning Systems Enable Self-Evolving Agents

Ran Yan , Wei Fu , Jiale Li , Shusheng Xu , Zhiyu Mei , Jiaxuan Gao , Jiarui Zhang , Wentai Zhang

show 16 more authors

Hao Dai Xujie Shen Chuyi He Zhen Pu Jun Mei Zhiyao Lin Haitao Wang Zhiqiang Ding Jiawei Zhang Huaijie Wang Ruida Xu Honghua Dong Youhe Jiang Yi Wu Tongkai Yang Binhang Yuan

This is my paper

Pith reviewed 2026-07-03 18:45 UTC · model grok-4.3

classification 💻 cs.DC

keywords self-evolving agentsagentic reinforcement learningonline RL systemsLLM agentstrajectory data protocolenterprise deploymentcontinual learningagent evolution control plane

0 comments

The pith

Self-evolving LLM agents at enterprise scale are blocked by missing agentic RL systems rather than by reinforcement learning algorithms.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that production LLM agents remain static because any improvement still requires a manual human loop of data collection, fine-tuning, and redeployment. While individual-user self-evolving agents show promise, the authors claim the barrier for large-scale enterprise use lies in three concrete gaps in current agentic online RL systems and their observability stack. These gaps are the absence of a standardized trajectory data protocol that carries step-level RL signals across different agent designs, the lack of an enterprise-grade data proxy that turns real workloads into governed learning data, and the absence of a unified control plane that uses trajectory statistics to decide when to update weights or evolve the agent harness. The paper states that co-designing the next generation of agentic RL systems around these three pillars will allow agents to learn continually from deployed workloads, and it sketches one such architecture in AReaL2.0.

Core claim

The central claim is that next-generation agentic RL systems must be co-designed around a standardized agent trajectory data protocol capable of carrying RL learning signals at step granularity, an enterprise-grade comprehensive data proxy that converts real workloads into governed learning substrates, and a unified agent evolution control plane that automatically decides policy updates or harness evolution based on trajectory statistics; only then can self-evolving agents move from individual prototypes to large-scale enterprise service, as partially instantiated by reorganizing existing RL infrastructure into an agent-oriented online RL loop in AReaL2.0.

What carries the argument

The three essential aspects of inadequacy in current agentic RL systems (standardized trajectory data protocol, enterprise-grade data proxy, unified evolution control plane) that the paper identifies as the primary blockers preventing continual learning from deployed workloads.

If this is right

Trajectory statistics from real workloads can automatically trigger policy weight updates without human intervention.
Heterogeneous agent paradigms can share a common data protocol that preserves step-granularity RL signals.
Real enterprise workloads can be converted into governed learning substrates via a dedicated data proxy.
A single control plane can decide both weight updates and in-context harness evolution based on the same trajectory data.
Existing RL infrastructure can be reorganized into an agent-oriented online loop that learns directly from production traffic.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Standardizing the trajectory protocol could also simplify debugging and auditing of agent decisions across different vendors.
An enterprise data proxy might reduce the need for separate offline data curation teams by turning every production interaction into potential training signal.
The control plane logic could be extended to handle multi-agent coordination if trajectory data includes inter-agent interactions.

Load-bearing premise

That the three listed system gaps are the main and sufficient blockers for self-evolving agents at enterprise scale, rather than limitations in the underlying RL algorithms themselves.

What would settle it

A production deployment that implements the three pillars yet still requires manual human-curated data loops to improve agent performance across heterogeneous paradigms.

read the original abstract

LLM agents are rapidly being deployed in production, including coding assistants, customer-support chatbots, and scientific research assistants, yet they remain fundamentally static in enterprise deployment. The LLM weights, system prompts, tool repertoires, and in-context harnesses are frozen at deployment time, and any improvement requires a manual loop of human-curated data collection, offline fine-tuning, modification of the agentic paradigm, and re-deployment. Recent work on self-evolving agents, such as OpenClaw for individual users, indicates that the next leap in agent capability will come from agents that continually learn from their own experience. In this paper, we argue that this vision for self-evolving agent deployment is being held back for enterprise-level large-scale agentic service not by reinforcement learning (RL) algorithms but by agentic online RL systems. Specifically, current agentic RL systems and the surrounding observability software stack are inadequate along three essential aspects: (i) there is no standardized agent trajectory data protocol capable of carrying RL learning signals at step granularity across heterogeneous agent paradigms; (ii) there is no enterprise-grade comprehensive data proxy that converts real workloads into governed learning substrates; and (iii) there is no unified agent evolution control plane that automatically decides, based on trajectory statistics, when to update policy weights or evolve the in-context harness. The next generation of agentic RL systems must be co-designed around these three pillars, and we sketch concrete architectures, case studies, and counter-arguments. We instantiate one branch through AReaL2.0, reorganizing existing RL infrastructure into an agent-oriented online RL loop for policy weight updates from deployed workloads.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a position paper arguing that three specific systems gaps block self-evolving enterprise agents, not RL algorithms, but it offers no evidence or tests for that claim.

read the letter

The main takeaway is that the authors see the path to continual agent improvement at scale as blocked by missing infrastructure rather than by the underlying learning methods. They name three gaps—no standard trajectory data protocol, no enterprise data proxy for real workloads, and no unified evolution control plane—and call for co-design around them, with a sketch of architectures and one branch called AReaL2.0.

What the paper does is organize known production pain points into a short list and tie them to recent work like OpenClaw. That framing can help systems people focus on what to build next for online RL loops from deployed agents.

The limitation is that the argument stays at the level of assertion. There are no measurements from actual deployments, no comparisons showing these three items are the primary blockers, and no demonstration that the proposed fixes would produce the claimed self-evolution. The paper is explicit that it is forward-looking, so the absence of data is not hidden, but it does leave the central claim untested.

This is for researchers already working on agent deployment infrastructure who want a structured way to think about the next layer of tooling. Readers who need empirical results or new algorithms will find little to use directly.

It deserves a serious referee. The topic is timely for production agent systems, and feedback on the proposed pillars could shape follow-on work even if the current version stays speculative.

Referee Report

2 major / 1 minor

Summary. The paper claims that enterprise-scale self-evolving LLM agents are limited not by RL algorithms but by three inadequacies in current agentic online RL systems and observability stacks: (i) no standardized trajectory data protocol for RL signals at step granularity across heterogeneous paradigms, (ii) no enterprise-grade data proxy converting workloads into governed learning substrates, and (iii) no unified evolution control plane for automatic policy/harness updates. It proposes co-design around these pillars, sketches architectures and case studies, and instantiates one via AReaL2.0 for online policy updates from deployed workloads.

Significance. If the three gaps are indeed the primary blockers, the work could usefully redirect attention in the agentic systems community from pure algorithmic RL advances toward infrastructure co-design, potentially informing standards for trajectory logging and control planes in production deployments. The forward-looking framing and explicit counter-argument discussion are strengths for a position piece.

major comments (2)

Abstract: The central claim that the three listed aspects are the 'essential' inadequacies (rather than, e.g., RL sample efficiency, safety constraints, or compute scaling) is asserted without any supporting analysis, literature synthesis, or failure-mode examination of existing systems. This assertion is load-bearing for the entire proposal to co-design around them.
Abstract and overall manuscript: No empirical data, derivations, error bounds, or even qualitative case-study outcomes are provided to show that resolving the three gaps would enable continual learning from deployed workloads or outperform current manual loops; the argument therefore remains an untested hypothesis rather than a substantiated position.

minor comments (1)

The manuscript introduces AReaL2.0 and sketches 'concrete architectures' but the abstract provides no expansion of the acronym, component breakdown, or how it specifically addresses the three pillars.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on this position paper. We address the major comments point-by-point below, agreeing where the manuscript requires clarification or expansion, and have planned revisions accordingly.

read point-by-point responses

Referee: Abstract: The central claim that the three listed aspects are the 'essential' inadequacies (rather than, e.g., RL sample efficiency, safety constraints, or compute scaling) is asserted without any supporting analysis, literature synthesis, or failure-mode examination of existing systems. This assertion is load-bearing for the entire proposal to co-design around them.

Authors: We acknowledge that the claim would benefit from explicit grounding. As a position paper, the assertion draws from observed production limitations and related literature, but we will revise by expanding the introduction with a literature synthesis on agentic RL systems and adding a subsection analyzing failure modes of current observability stacks (e.g., loss of step-granularity signals in heterogeneous paradigms). This will support why the three pillars warrant co-design attention alongside algorithmic factors. revision: yes
Referee: Abstract and overall manuscript: No empirical data, derivations, error bounds, or even qualitative case-study outcomes are provided to show that resolving the three gaps would enable continual learning from deployed workloads or outperform current manual loops; the argument therefore remains an untested hypothesis rather than a substantiated position.

Authors: We agree that the manuscript presents no new empirical data, derivations, or quantitative outcomes, as it is a forward-looking position piece sketching architectures and case studies rather than reporting experiments. We will revise the abstract and add a dedicated 'Limitations and Future Work' section that explicitly frames the claims as a hypothesis, describes the qualitative AReaL2.0 instantiation at a higher level, and outlines potential evaluation approaches for validating continual learning gains versus manual loops. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper is a position document that argues three specific system-level gaps (trajectory data protocol, data proxy, evolution control plane) are the primary blockers for enterprise self-evolving agents, with RL algorithms not being the limit. It sketches co-designed architectures and instantiates one via AReaL2.0. No mathematical derivations, equations, fitted parameters, or predictions appear in the provided text. The argument consists of stated assessments of current systems and forward-looking proposals without any reduction of claims to self-referential inputs, self-citation chains, or renamings by construction. The central claims remain independent of any internal circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the domain assumption that the three listed systems deficiencies are the primary obstacles to self-evolving agents at enterprise scale, with no independent evidence or benchmarks provided.

axioms (1)

domain assumption RL algorithms are not the limiting factor; the bottleneck lies in agentic online RL systems infrastructure.
Explicitly stated in the abstract as the vision being held back not by RL algorithms but by the systems.

invented entities (1)

AReaL2.0 no independent evidence
purpose: An example instantiation reorganizing existing RL infrastructure into an agent-oriented online RL loop.
Mentioned as one branch of the proposed co-designed systems.

pith-pipeline@v0.9.1-grok · 5915 in / 1421 out tokens · 34126 ms · 2026-07-03T18:45:47.711336+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

38 extracted references · 20 canonical work pages · 10 internal anchors

[1]

Openclaw: The ai that actually does things, 2026

OpenClaw. Openclaw: The ai that actually does things, 2026

2026
[2]

OpenClaw-RL: Train Any Agent Simply by Talking

Yinjie Wang, Xuyang Chen, Xiaolong Jin, Mengdi Wang, and Ling Yang. Openclaw-rl: Train any agent simply by talking. arXiv preprint arXiv:2603.10165, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[3]

Metaclaw: Just talk–an agent that meta-learns and evolves in the wild

Peng Xia, Jianwen Chen, Xinyu Yang, Haoqin Tu, Jiaqi Liu, Kaiwen Xiong, Siwei Han, Shi Qiu, Haonian Ji, Yuyin Zhou, et al. Metaclaw: Just talk–an agent that meta-learns and evolves in the wild. arXiv preprint arXiv:2603.17187, 2026

work page arXiv 2026
[4]

SkillClaw: Let Skills Evolve Collectively with Agentic Evolver

Ziyu Ma, Shidong Yang, Yuxiang Ji, Xucong Wang, Yong Wang, Yiming Hu, Tongwen Huang, and Xiangxiang Chu. Skillclaw: Let skills evolve collectively with agentic evolver. arXiv preprint arXiv:2604.08377, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[5]

Reflexion: Language agents with verbal reinforcement learning

Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning. Advances in neural information processing systems, 36:8634–8652, 2023

2023
[6]

Memento-skills: Let agents design agents

Huichi Zhou, Siyuan Guo, Anjie Liu, Zhongwei Yu, Ziqin Gong, Bowen Zhao, Zhixun Chen, Menglong Zhang, Yihang Chen, Jinsong Li, et al. Memento-skills: Let agents design agents. arXiv preprint arXiv:2603.18743, 2026

work page arXiv 2026
[7]

Agentic context engineering: Evolving contexts for self-improving language models

Qizheng Zhang, Changran Hu, Shubhangi Upasani, Boyuan Ma, Fenglu Hong, Vamsidhar Kamanuru, Jay Rainton, Chen Wu, Mengmeng Ji, Hanchen Li, et al. Agentic context engineering: Evolving contexts for self-improving language models. 2026

2026
[8]

Areal: A large-scale asynchronous reinforcement learning system for language reasoning

Wei Fu, Jiaxuan Gao, Xujie Shen, Chen Zhu, Zhiyu Mei, Chuyi He, Shusheng Xu, Guo Wei, Jun Mei, WANG JIASHU, et al. Areal: A large-scale asynchronous reinforcement learning system for language reasoning. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

2025
[9]

A Survey of Reinforcement Learning for Large Reasoning Models

Kaiyan Zhang, Yuxin Zuo, Bingxiang He, Youbang Sun, Runze Liu, Che Jiang, Yuchen Fan, Kai Tian, Guoli Jia, Pengfei Li, et al. A survey of reinforcement learning for large reasoning models. arXiv preprint arXiv:2509.08827, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[10]

Training language models to follow instructions with human feedback

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744, 2022

2022
[11]

Constitutional AI: Harmlessness from AI Feedback

Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[12]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[13]

Direct preference optimization: Your language model is secretly a reward model.Advances in neural information processing systems, 36:53728–53741, 2023

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model.Advances in neural information processing systems, 36:53728–53741, 2023

2023
[14]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[15]

The Landscape of Agentic Reinforcement Learning for LLMs: A Survey

Guibin Zhang, Hejia Geng, Xiaohang Yu, Zhenfei Yin, Zaibin Zhang, Zelin Tan, Heng Zhou, Zhongzhi Li, Xi- angyuan Xue, Yijiang Li, et al. The landscape of agentic reinforcement learning for llms: A survey. arXiv preprint arXiv:2509.02547, 2025. 11

work page internal anchor Pith review Pith/arXiv arXiv 2025
[16]

ReAct: Synergizing Reasoning and Acting in Language Models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[17]

TextGrad: Automatic "Differentiation" via Text

Mert Yuksekgonul, Federico Bianchi, Joseph Boen, Sheng Liu, Zhi Huang, Carlos Guestrin, and James Zou. Textgrad: Automatic" differentiation" via text. arXiv preprint arXiv:2406.07496, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[18]

Unlocking long-horizon agentic search with large-scale end-to-end rl

Jiaxuan Gao, Wei Fu, Minyang Xie, Shusheng Xu, Chuyi He, Zhiyu Mei, Banghua Zhu, and Yi Wu. Unlocking long-horizon agentic search with large-scale end-to-end rl. In The Fourteenth International Conference on Learning Representations, 2026

2026
[19]

Real: Efficient rlhf training of large language models with parameter reallocation

Zhiyu Mei, Wei Fu, Kaiwei Li, Guangju Wang, Huanchen Zhang, and Yi Wu. Real: Efficient rlhf training of large language models with parameter reallocation. arXiv preprint arXiv:2406.14088, 2024

work page arXiv 2024
[20]

Optimizing {RLHF} training for large language models with stage fusion

Yinmin Zhong, Zili Zhang, Bingyang Wu, Shengyu Liu, Yukun Chen, Changyi Wan, Hanpeng Hu, Lei Xia, Ranchen Ming, Yibo Zhu, et al. Optimizing {RLHF} training for large language models with stage fusion. In 22nd USENIX Symposium on Networked Systems Design and Implementation (NSDI 25), pages 489–503, 2025

2025
[21]

G-core: A simple, scalable and balanced rlhf trainer

Junyu Wu, Weiming Chang, Xiaotao Liu, Guanyou He, Haoqiang Hong, Boqi Liu, Hongtao Tian, Tao Yang, Yunsheng Shi, Feng Lin, et al. G-core: A simple, scalable and balanced rlhf trainer. arXiv preprint arXiv:2507.22789, 2025

work page arXiv 2025
[22]

Hybridflow: A flexible and efficient rlhf framework

Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework. In Proceedings of the Twentieth European Conference on Computer Systems, pages 1279–1297, 2025

2025
[23]

Streamrl: Scalable, heterogeneous, and elastic rl for llms with disaggregated stream generation

Yinmin Zhong, Zili Zhang, Xiaoniu Song, Hanpeng Hu, Chao Jin, Bingyang Wu, Nuo Chen, Yukun Chen, Yu Zhou, Changyi Wan, et al. Streamrl: Scalable, heterogeneous, and elastic rl for llms with disaggregated stream generation. arXiv preprint arXiv:2504.15930, 2025

work page arXiv 2025
[24]

Asyncflow: An asynchronous streaming rl framework for efficient llm post-training

Zhenyu Han, Ansheng You, Haibo Wang, Kui Luo, Guang Yang, Wenqi Shi, Menglong Chen, Sicheng Zhang, Zeshun Lan, Chunshi Deng, et al. Asyncflow: An asynchronous streaming rl framework for efficient llm post-training. arXiv preprint arXiv:2507.01663, 2025

work page arXiv 2025
[25]

Introducing the Model Context Protocol

Anthropic. Introducing the Model Context Protocol. https://www.anthropic.com/news/ model-context-protocol, 2024

2024
[26]

Agent2agent (a2a) protocol.https://a2a-protocol.org/, 2025

Google. Agent2agent (a2a) protocol.https://a2a-protocol.org/, 2025

2025
[27]

A survey of ai agent protocols

Yingxuan Yang, Huacan Chai, Yuanyi Song, Siyuan Qi, Muning Wen, Ning Li, Junwei Liao, Haoyi Hu, Jianghao Lin, Gaowei Chang, et al. A survey of ai agent protocols. arXiv preprint arXiv:2504.16736, 2025

work page arXiv 2025
[28]

D4RL: Datasets for Deep Data-Driven Reinforcement Learning

Justin Fu, Aviral Kumar, Ofir Nachum, George Tucker, and Sergey Levine. D4rl: Datasets for deep data-driven reinforcement learning. arXiv preprint arXiv:2004.07219, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2004
[29]

Rlds: an ecosystem to generate, share and use datasets in reinforcement learning

Sabela Ramos, Sertan Girgin, Léonard Hussenot, Damien Vincent, Hanna Yakubovich, Daniel Toyama, Anita Gergely , Piotr Stanczyk, Raphael Marinier, Jeremiah Harmsen, et al. Rlds: an ecosystem to generate, share and use datasets in reinforcement learning. arXiv preprint arXiv:2111.02767, 2021

work page arXiv 2021
[30]

Agent data protocol: Unifying datasets for diverse, effective fine-tuning of llm agents

Yueqi Song, Ketan Ramaneti, Zaid Sheikh, Ziru Chen, Boyu Gou, Tianbao Xie, Yiheng Xu, Danyang Zhang, Apurva Gandhi, Fan Yang, et al. Agent data protocol: Unifying datasets for diverse, effective fine-tuning of llm agents. arXiv preprint arXiv:2510.24702, 2025

work page arXiv 2025
[31]

LangChain: The agent engineering platform

LangChain, Inc. LangChain: The agent engineering platform. https://github.com/langchain-ai/langchain, 2025

2025
[32]

LangGraph: Build resilient language agents as graphs

LangChain, Inc. LangGraph: Build resilient language agents as graphs. https://github.com/langchain-ai/ langgraph, 2025

2025
[33]

CrewAI: Framework for orchestrating role-playing, autonomous AI agents

crewAI, Inc. CrewAI: Framework for orchestrating role-playing, autonomous AI agents. https://github.com/ crewAIInc/crewAI, 2025

2025
[34]

OpenAI Agents SDK: A lightweight, powerful framework for multi-agent workflows

OpenAI. OpenAI Agents SDK: A lightweight, powerful framework for multi-agent workflows. https://github.com/ openai/openai-agents-python, 2025

2025
[35]

Claude Agent SDK.https://github.com/anthropics/claude-agent-sdk-python, 2025

Anthropic. Claude Agent SDK.https://github.com/anthropics/claude-agent-sdk-python, 2025. 12

2025
[36]

Agentprm: Process reward models for llm agents via step-wise promise and progress

Zhiheng Xi, Chenyang Liao, Guanyu Li, Zhihao Zhang, Wenxiang Chen, Binghai Wang, Senjie Jin, Yuhao Zhou, Jian Guan, Wei Wu, et al. Agentprm: Process reward models for llm agents via step-wise promise and progress. In Proceedings of the ACM Web Conference 2026, pages 4184–4195, 2026

2026
[37]

Rlanything: Forge environment, policy , and reward model in completely dynamic rl system

Yinjie Wang, Tianbao Xie, Ke Shen, Mengdi Wang, and Ling Yang. Rlanything: Forge environment, policy , and reward model in completely dynamic rl system. arXiv preprint arXiv:2602.02488, 2026

work page arXiv 2026
[38]

Hermes agent: The self-improving ai agent built by nous research

Nous Research. Hermes agent: The self-improving ai agent built by nous research. https://github.com/ NousResearch/hermes-agent, 2026. Accessed: 2026-06-30. 13

2026

[1] [1]

Openclaw: The ai that actually does things, 2026

OpenClaw. Openclaw: The ai that actually does things, 2026

2026

[2] [2]

OpenClaw-RL: Train Any Agent Simply by Talking

Yinjie Wang, Xuyang Chen, Xiaolong Jin, Mengdi Wang, and Ling Yang. Openclaw-rl: Train any agent simply by talking. arXiv preprint arXiv:2603.10165, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[3] [3]

Metaclaw: Just talk–an agent that meta-learns and evolves in the wild

Peng Xia, Jianwen Chen, Xinyu Yang, Haoqin Tu, Jiaqi Liu, Kaiwen Xiong, Siwei Han, Shi Qiu, Haonian Ji, Yuyin Zhou, et al. Metaclaw: Just talk–an agent that meta-learns and evolves in the wild. arXiv preprint arXiv:2603.17187, 2026

work page arXiv 2026

[4] [4]

SkillClaw: Let Skills Evolve Collectively with Agentic Evolver

Ziyu Ma, Shidong Yang, Yuxiang Ji, Xucong Wang, Yong Wang, Yiming Hu, Tongwen Huang, and Xiangxiang Chu. Skillclaw: Let skills evolve collectively with agentic evolver. arXiv preprint arXiv:2604.08377, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[5] [5]

Reflexion: Language agents with verbal reinforcement learning

Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning. Advances in neural information processing systems, 36:8634–8652, 2023

2023

[6] [6]

Memento-skills: Let agents design agents

Huichi Zhou, Siyuan Guo, Anjie Liu, Zhongwei Yu, Ziqin Gong, Bowen Zhao, Zhixun Chen, Menglong Zhang, Yihang Chen, Jinsong Li, et al. Memento-skills: Let agents design agents. arXiv preprint arXiv:2603.18743, 2026

work page arXiv 2026

[7] [7]

Agentic context engineering: Evolving contexts for self-improving language models

Qizheng Zhang, Changran Hu, Shubhangi Upasani, Boyuan Ma, Fenglu Hong, Vamsidhar Kamanuru, Jay Rainton, Chen Wu, Mengmeng Ji, Hanchen Li, et al. Agentic context engineering: Evolving contexts for self-improving language models. 2026

2026

[8] [8]

Areal: A large-scale asynchronous reinforcement learning system for language reasoning

Wei Fu, Jiaxuan Gao, Xujie Shen, Chen Zhu, Zhiyu Mei, Chuyi He, Shusheng Xu, Guo Wei, Jun Mei, WANG JIASHU, et al. Areal: A large-scale asynchronous reinforcement learning system for language reasoning. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

2025

[9] [9]

A Survey of Reinforcement Learning for Large Reasoning Models

Kaiyan Zhang, Yuxin Zuo, Bingxiang He, Youbang Sun, Runze Liu, Che Jiang, Yuchen Fan, Kai Tian, Guoli Jia, Pengfei Li, et al. A survey of reinforcement learning for large reasoning models. arXiv preprint arXiv:2509.08827, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[10] [10]

Training language models to follow instructions with human feedback

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744, 2022

2022

[11] [11]

Constitutional AI: Harmlessness from AI Feedback

Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[12] [12]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[13] [13]

Direct preference optimization: Your language model is secretly a reward model.Advances in neural information processing systems, 36:53728–53741, 2023

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model.Advances in neural information processing systems, 36:53728–53741, 2023

2023

[14] [14]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[15] [15]

The Landscape of Agentic Reinforcement Learning for LLMs: A Survey

Guibin Zhang, Hejia Geng, Xiaohang Yu, Zhenfei Yin, Zaibin Zhang, Zelin Tan, Heng Zhou, Zhongzhi Li, Xi- angyuan Xue, Yijiang Li, et al. The landscape of agentic reinforcement learning for llms: A survey. arXiv preprint arXiv:2509.02547, 2025. 11

work page internal anchor Pith review Pith/arXiv arXiv 2025

[16] [16]

ReAct: Synergizing Reasoning and Acting in Language Models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[17] [17]

TextGrad: Automatic "Differentiation" via Text

Mert Yuksekgonul, Federico Bianchi, Joseph Boen, Sheng Liu, Zhi Huang, Carlos Guestrin, and James Zou. Textgrad: Automatic" differentiation" via text. arXiv preprint arXiv:2406.07496, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[18] [18]

Unlocking long-horizon agentic search with large-scale end-to-end rl

Jiaxuan Gao, Wei Fu, Minyang Xie, Shusheng Xu, Chuyi He, Zhiyu Mei, Banghua Zhu, and Yi Wu. Unlocking long-horizon agentic search with large-scale end-to-end rl. In The Fourteenth International Conference on Learning Representations, 2026

2026

[19] [19]

Real: Efficient rlhf training of large language models with parameter reallocation

Zhiyu Mei, Wei Fu, Kaiwei Li, Guangju Wang, Huanchen Zhang, and Yi Wu. Real: Efficient rlhf training of large language models with parameter reallocation. arXiv preprint arXiv:2406.14088, 2024

work page arXiv 2024

[20] [20]

Optimizing {RLHF} training for large language models with stage fusion

Yinmin Zhong, Zili Zhang, Bingyang Wu, Shengyu Liu, Yukun Chen, Changyi Wan, Hanpeng Hu, Lei Xia, Ranchen Ming, Yibo Zhu, et al. Optimizing {RLHF} training for large language models with stage fusion. In 22nd USENIX Symposium on Networked Systems Design and Implementation (NSDI 25), pages 489–503, 2025

2025

[21] [21]

G-core: A simple, scalable and balanced rlhf trainer

Junyu Wu, Weiming Chang, Xiaotao Liu, Guanyou He, Haoqiang Hong, Boqi Liu, Hongtao Tian, Tao Yang, Yunsheng Shi, Feng Lin, et al. G-core: A simple, scalable and balanced rlhf trainer. arXiv preprint arXiv:2507.22789, 2025

work page arXiv 2025

[22] [22]

Hybridflow: A flexible and efficient rlhf framework

Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework. In Proceedings of the Twentieth European Conference on Computer Systems, pages 1279–1297, 2025

2025

[23] [23]

Streamrl: Scalable, heterogeneous, and elastic rl for llms with disaggregated stream generation

Yinmin Zhong, Zili Zhang, Xiaoniu Song, Hanpeng Hu, Chao Jin, Bingyang Wu, Nuo Chen, Yukun Chen, Yu Zhou, Changyi Wan, et al. Streamrl: Scalable, heterogeneous, and elastic rl for llms with disaggregated stream generation. arXiv preprint arXiv:2504.15930, 2025

work page arXiv 2025

[24] [24]

Asyncflow: An asynchronous streaming rl framework for efficient llm post-training

Zhenyu Han, Ansheng You, Haibo Wang, Kui Luo, Guang Yang, Wenqi Shi, Menglong Chen, Sicheng Zhang, Zeshun Lan, Chunshi Deng, et al. Asyncflow: An asynchronous streaming rl framework for efficient llm post-training. arXiv preprint arXiv:2507.01663, 2025

work page arXiv 2025

[25] [25]

Introducing the Model Context Protocol

Anthropic. Introducing the Model Context Protocol. https://www.anthropic.com/news/ model-context-protocol, 2024

2024

[26] [26]

Agent2agent (a2a) protocol.https://a2a-protocol.org/, 2025

Google. Agent2agent (a2a) protocol.https://a2a-protocol.org/, 2025

2025

[27] [27]

A survey of ai agent protocols

Yingxuan Yang, Huacan Chai, Yuanyi Song, Siyuan Qi, Muning Wen, Ning Li, Junwei Liao, Haoyi Hu, Jianghao Lin, Gaowei Chang, et al. A survey of ai agent protocols. arXiv preprint arXiv:2504.16736, 2025

work page arXiv 2025

[28] [28]

D4RL: Datasets for Deep Data-Driven Reinforcement Learning

Justin Fu, Aviral Kumar, Ofir Nachum, George Tucker, and Sergey Levine. D4rl: Datasets for deep data-driven reinforcement learning. arXiv preprint arXiv:2004.07219, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2004

[29] [29]

Rlds: an ecosystem to generate, share and use datasets in reinforcement learning

Sabela Ramos, Sertan Girgin, Léonard Hussenot, Damien Vincent, Hanna Yakubovich, Daniel Toyama, Anita Gergely , Piotr Stanczyk, Raphael Marinier, Jeremiah Harmsen, et al. Rlds: an ecosystem to generate, share and use datasets in reinforcement learning. arXiv preprint arXiv:2111.02767, 2021

work page arXiv 2021

[30] [30]

Agent data protocol: Unifying datasets for diverse, effective fine-tuning of llm agents

Yueqi Song, Ketan Ramaneti, Zaid Sheikh, Ziru Chen, Boyu Gou, Tianbao Xie, Yiheng Xu, Danyang Zhang, Apurva Gandhi, Fan Yang, et al. Agent data protocol: Unifying datasets for diverse, effective fine-tuning of llm agents. arXiv preprint arXiv:2510.24702, 2025

work page arXiv 2025

[31] [31]

LangChain: The agent engineering platform

LangChain, Inc. LangChain: The agent engineering platform. https://github.com/langchain-ai/langchain, 2025

2025

[32] [32]

LangGraph: Build resilient language agents as graphs

LangChain, Inc. LangGraph: Build resilient language agents as graphs. https://github.com/langchain-ai/ langgraph, 2025

2025

[33] [33]

CrewAI: Framework for orchestrating role-playing, autonomous AI agents

crewAI, Inc. CrewAI: Framework for orchestrating role-playing, autonomous AI agents. https://github.com/ crewAIInc/crewAI, 2025

2025

[34] [34]

OpenAI Agents SDK: A lightweight, powerful framework for multi-agent workflows

OpenAI. OpenAI Agents SDK: A lightweight, powerful framework for multi-agent workflows. https://github.com/ openai/openai-agents-python, 2025

2025

[35] [35]

Claude Agent SDK.https://github.com/anthropics/claude-agent-sdk-python, 2025

Anthropic. Claude Agent SDK.https://github.com/anthropics/claude-agent-sdk-python, 2025. 12

2025

[36] [36]

Agentprm: Process reward models for llm agents via step-wise promise and progress

Zhiheng Xi, Chenyang Liao, Guanyu Li, Zhihao Zhang, Wenxiang Chen, Binghai Wang, Senjie Jin, Yuhao Zhou, Jian Guan, Wei Wu, et al. Agentprm: Process reward models for llm agents via step-wise promise and progress. In Proceedings of the ACM Web Conference 2026, pages 4184–4195, 2026

2026

[37] [37]

Rlanything: Forge environment, policy , and reward model in completely dynamic rl system

Yinjie Wang, Tianbao Xie, Ke Shen, Mengdi Wang, and Ling Yang. Rlanything: Forge environment, policy , and reward model in completely dynamic rl system. arXiv preprint arXiv:2602.02488, 2026

work page arXiv 2026

[38] [38]

Hermes agent: The self-improving ai agent built by nous research

Nous Research. Hermes agent: The self-improving ai agent built by nous research. https://github.com/ NousResearch/hermes-agent, 2026. Accessed: 2026-06-30. 13

2026