OpenClaw-RL: Train Any Agent Simply by Talking
Pith reviewed 2026-05-15 12:52 UTC · model grok-4.3
The pith
OpenClaw-RL recovers next-state signals from user replies and state changes to drive online agent improvement without added labels or offline data.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
OpenClaw-RL is the first RL framework to unify real-world agent settings spanning terminal, GUI, SWE, and tool-call environments. It recovers conversational signals from user re-queries, corrections, and explicit feedback so that an agent improves simply by being used. The system extracts evaluative and directive signals from each next state via an asynchronous server, then applies a hybrid RL objective that unifies both in one update while using overlap-guided hint selection and log-probability-difference clipping to stabilize training under teacher-student mismatch.
What carries the argument
Hybrid RL objective that fuses evaluative signals (broadly available) and directive signals (token-level but sparser) from next-state observations, stabilized by overlap-guided hint selection and log-probability-difference clipping.
If this is right
- Personal agents improve continuously from ordinary user interactions such as corrections and re-queries.
- No separate human preference datasets or offline curation steps are required for policy updates.
- The same infrastructure and objective apply without modification to terminal, GUI, software-engineering, and tool-call agents.
- Long-horizon tasks can exploit next-state signals that become available only after many steps.
- Inference latency remains unaffected because signal extraction and optimization run on a separate asynchronous server.
Where Pith is reading between the lines
- Existing chat interfaces could become continuous training loops if every reply is automatically turned into an update.
- Individual users might see faster adaptation than population-level training because signals are collected per conversation.
- The approach might extend to multi-agent settings where one agent's next state is another agent's action output.
- Safety checks would need to be added before online updates, since noisy user feedback could reinforce unwanted behaviors.
Load-bearing premise
Next-state signals contain enough rich, low-noise evaluative and directive information to produce stable online policy improvement without extra human labeling.
What would settle it
Run a controlled multi-turn agent task where the policy is updated only from real user re-queries and state changes; measure whether task success rate stays flat or falls after several hundred interactions.
read the original abstract
Every agent interaction generates a next-state signal, namely the user reply, tool output, terminal or GUI state change that follows each action, yet no existing agentic RL system recovers it as a live, online learning source. We present OpenClaw-RL, a framework that employs next-state signals to optimize personal agents online through infrastructure and methodology innovations. On the infrastructure side, we extend existing RL systems to a server-client architecture where the RL server hosts the policy behind an inference API and user terminals stream interaction data back over HTTP. From each observed next state, the system extracts two complementary training signals, evaluative and directive, via a separate asynchronous server so that neither signal extraction nor optimization blocks inference. On the methodology side, we introduce a hybrid RL objective that unifies both signal types in a single update: directive signals provide richer, token-level supervision but are sparser, while evaluative signals are more broadly available. To stabilize distillation under teacher-student mismatch, we propose overlap-guided hint selection, which picks the hint whose induced teacher distribution maximally overlaps with the student's top-$k$ tokens, together with a log-probability-difference clip that bounds per-token advantages. Applied to personal agents, OpenClaw-RL enables an agent to improve simply by being used, recovering conversational signals from user re-queries, corrections, and explicit feedback. Applied to general agents, OpenClaw-RL is the first RL framework to unify real-world agent settings spanning terminal, GUI, SWE, and tool-call environments, where we additionally demonstrate the utility of next-state signals in long-horizon settings.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents OpenClaw-RL, a server-client RL framework that recovers next-state signals (user replies, tool outputs, terminal/GUI state changes) from agent interactions to enable online policy improvement without additional labeling. It describes infrastructure for asynchronous signal extraction, a hybrid objective combining sparse directive signals with broadly available evaluative signals, and stabilization methods including overlap-guided hint selection and log-probability-difference clipping. The central claims are that this allows agents to improve simply by being used and that the framework is the first to unify terminal, GUI, SWE, and tool-call environments.
Significance. If the mechanisms are shown to work, the work would be significant for providing a practical path to continuous online RL in real-world agent settings using natural conversational signals. The infrastructure decoupling of inference from optimization and the hybrid objective address key deployment challenges; the stabilization techniques for teacher-student mismatch could have broader applicability. The unification claim, if substantiated, would represent a notable engineering contribution.
major comments (3)
- [Abstract and §3] Abstract and §3: The hybrid RL objective is described at a high level but no explicit loss function, weighting between directive and evaluative terms, or derivation is given; without this, it is impossible to verify how the unification is achieved or whether it reduces to standard RL objectives.
- [Abstract and §4] Abstract and §4: The overlap-guided hint selection and log-probability-difference clip are introduced to stabilize distillation, yet the manuscript provides neither the precise computation of overlap (e.g., which divergence or set intersection) nor any analysis showing that the clip bounds advantages without introducing bias; these are load-bearing for the stability claim.
- [Throughout (no experiments section)] Throughout (no experiments section): No quantitative results, ablations, success rates, or comparisons to baselines are reported for any environment; this directly undermines the central claim that next-state signals yield stable policy improvement and that the framework unifies the listed settings.
minor comments (2)
- [Figure 1] The server-client architecture diagram would be clearer with explicit arrows for the HTTP streaming path and the separate asynchronous server.
- [§3] Notation for teacher and student distributions is introduced without a consistent symbol table, making the description of hint selection harder to follow.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on the hybrid objective, stabilization techniques, and empirical validation. We address each major comment below and will incorporate revisions to strengthen the manuscript.
read point-by-point responses
-
Referee: [Abstract and §3] Abstract and §3: The hybrid RL objective is described at a high level but no explicit loss function, weighting between directive and evaluative terms, or derivation is given; without this, it is impossible to verify how the unification is achieved or whether it reduces to standard RL objectives.
Authors: We agree that an explicit formulation is needed for verifiability. In the revised manuscript, §3 will include the full hybrid loss: L = λ L_directive + (1-λ) L_evaluative, where L_directive is the token-level cross-entropy from directive signals and L_evaluative is the clipped advantage-weighted log-prob from evaluative signals. We will derive it as a combination of supervised fine-tuning and advantage-weighted RL, showing reduction to standard objectives under λ=1 or λ=0. Weighting λ will be set to 0.6 based on signal sparsity. revision: yes
-
Referee: [Abstract and §4] Abstract and §4: The overlap-guided hint selection and log-probability-difference clip are introduced to stabilize distillation, yet the manuscript provides neither the precise computation of overlap (e.g., which divergence or set intersection) nor any analysis showing that the clip bounds advantages without introducing bias; these are load-bearing for the stability claim.
Authors: We will expand §4 with the precise overlap computation as the size of the intersection between the teacher's top-k tokens and the student's top-k tokens (normalized by k), or equivalently 1 - JSD where JSD is Jensen-Shannon divergence. For the log-prob clip, we will add analysis showing that clipping |log π_teacher - log π_student| ≤ ε bounds the advantage estimate without bias by preserving the sign and relative magnitude within the trust region, with a short proof sketch that expected bias is zero under the overlap selection. revision: yes
-
Referee: [Throughout (no experiments section)] Throughout (no experiments section): No quantitative results, ablations, success rates, or comparisons to baselines are reported for any environment; this directly undermines the central claim that next-state signals yield stable policy improvement and that the framework unifies the listed settings.
Authors: The current version emphasizes the infrastructure and objective design, with the unification claim supported by the server-client architecture's applicability across environments. We acknowledge the lack of quantitative results and will add a dedicated Experiments section in the revision, including success rates on terminal, GUI, SWE, and tool-call tasks, ablations on the hybrid objective and stabilization methods, and comparisons to standard RL baselines, using the next-state signals for online improvement. revision: yes
Circularity Check
No circularity: novel components introduced without reduction to fitted inputs or self-citations
full rationale
The paper describes OpenClaw-RL as introducing a server-client infrastructure extension and new methodology elements (hybrid RL objective unifying evaluative and directive signals from next-state observations, overlap-guided hint selection, and log-probability-difference clipping). These are presented as original proposals to recover conversational signals for online policy improvement. No equations, derivations, or claims in the abstract or description reduce a 'prediction' or result to a parameter fitted from the same data by construction, nor rely on load-bearing self-citations or imported uniqueness theorems. The unification claim across terminal/GUI/SWE/tool-call settings rests on the proposed mechanisms rather than tautological re-use of inputs, making the derivation chain self-contained.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Next-state signals contain usable evaluative and directive training information without additional supervision
Forward citations
Cited by 24 Pith papers
-
GroupMemBench: Benchmarking LLM Agent Memory in Multi-Party Conversations
GroupMemBench shows leading LLM memory systems reach only 46% average accuracy on multi-party tasks, with a simple BM25 baseline matching or beating most of them.
-
Agent-ValueBench: A Comprehensive Benchmark for Evaluating Agent Values
Agent-ValueBench is the first dedicated benchmark for agent values, showing they diverge from LLM values, form a homogeneous 'Value Tide' across models, and bend under harnesses and skill steering.
-
Continual Harness: Online Adaptation for Self-Improving Foundation Agents
Continual Harness automates online self-improvement for foundation-model embodied agents by refining prompts, sub-agents, skills, and memory within one run, cutting button-press costs on Pokemon Red and Emerald and cl...
-
Beyond the Assistant Turn: User Turn Generation as a Probe of Interaction Awareness in Language Models
User-turn generation reveals that LLMs' interaction awareness is largely decoupled from task accuracy, remaining near zero in deterministic settings even as accuracy scales to 96.8% on GSM8K.
-
Learning Agentic Policy from Action Guidance
ActGuide-RL uses human action data as plan-style guidance in mixed-policy RL to overcome exploration barriers in LLM agents, matching SFT+RL performance on search benchmarks without cold-start training.
-
TimeClaw: A Time-Series AI Agent with Exploratory Execution Learning
TimeClaw is an exploratory execution learning system that turns multiple valid tool-use paths into hierarchical distilled experience for improved time-series reasoning without test-time adaptation.
-
TCOD: Exploring Temporal Curriculum in On-Policy Distillation for Multi-turn Autonomous Agents
TCOD stabilizes on-policy distillation for multi-turn agents via temporal curriculum on trajectory depth, improving performance up to 18 points over vanilla OPD and sometimes surpassing the teacher.
-
ToolCUA: Towards Optimal GUI-Tool Path Orchestration for Computer Use Agents
ToolCUA introduces a trajectory scaling pipeline and staged RL to optimize GUI-tool switching, reaching 46.85% accuracy on OSWorld-MCP for a 66% relative gain over baseline.
-
Nice Fold or Hero Call: Learning Budget-Efficient Thinking for Adaptive Reasoning
BET reduces reasoning tokens by about 55% on average while improving performance across benchmarks by learning to short-solve easy queries, fold early on unsolvable ones, and preserve budget for hard solvable queries.
-
SOD: Step-wise On-policy Distillation for Small Language Model Agents
SOD reweights on-policy distillation strength step-by-step using divergence to stabilize tool use in small language model agents, yielding up to 20.86% gains and 26.13% on AIME 2025 for a 0.6B model.
-
Multilingual Safety Alignment via Self-Distillation
MSD enables cross-lingual safety transfer in LLMs via self-distillation with Dual-Perspective Safety Weighting, improving safety in low-resource languages without target response data.
-
ClawGym: A Scalable Framework for Building Effective Claw Agents
ClawGym supplies a 13.5K-task synthetic dataset, SFT-plus-RL trained agents, and a 200-instance benchmark to support the full lifecycle of Claw-style personal agent development.
-
When Model Editing Meets Service Evolution: A Knowledge-Update Perspective for Service Recommendation
EVOREC integrates locate-then-edit model editing with FA-constrained decoding to improve LLM-based service recommendation under evolution, reporting 25.9% average relative gain in Recall@5 over baselines and 22.3% ove...
-
ClawEnvKit: Automatic Environment Generation for Claw-Like Agents
ClawEnvKit automates generation of diverse verified environments for claw-like agents from natural language, producing the Auto-ClawEval benchmark of 1,040 environments that matches human-curated quality at 13,800x lo...
-
GenericAgent: A Token-Efficient Self-Evolving LLM Agent via Contextual Information Density Maximization (V1.0)
GenericAgent outperforms other LLM agents on long-horizon tasks by maximizing context information density with fewer tokens via minimal tools, on-demand memory, trajectory-to-SOP evolution, and compression.
-
$\pi$-Play: Multi-Agent Self-Play via Privileged Self-Distillation without External Data
π-Play uses self-generated question construction paths as privileged information in multi-agent self-distillation to convert sparse-reward self-play into a dense-feedback loop, surpassing supervised search agents and ...
-
ClawGUI: A Unified Framework for Training, Evaluating, and Deploying GUI Agents
ClawGUI delivers a unified open-source stack for stable RL training of GUI agents, standardized evaluation on 6 benchmarks with 95.8% reproduction, and real-device deployment, yielding a 2B model at 17.1% success rate...
-
Skill-SD: Skill-Conditioned Self-Distillation for Multi-turn LLM Agents
Skill-SD turns an agent's completed trajectories into dynamic natural-language skills that condition only the teacher in self-distillation, yielding 14-42% gains over RL and OPSD baselines on multi-turn agent benchmarks.
-
Multilingual Safety Alignment via Self-Distillation
MSD transfers LLM safety from high-resource to low-resource languages via self-distillation and dual-perspective weighting without needing response data.
-
ClawGym: A Scalable Framework for Building Effective Claw Agents
ClawGym is a framework for synthesizing 13.5K training tasks, training Claw-style agents via supervised fine-tuning and reinforcement learning, and evaluating them on a 200-instance benchmark.
-
AgentOpt v0.1 Technical Report: Client-Side Optimization for LLM-Based Agent
AgentOpt introduces a framework-agnostic package that uses algorithms like UCB-E to find cost-effective model assignments in multi-step LLM agent pipelines, cutting evaluation budgets by 62-76% while maintaining near-...
-
ROSClaw: A Hierarchical Semantic-Physical Framework for Heterogeneous Multi-Agent Collaboration
ROSClaw is a hierarchical framework that unifies vision-language model control with e-URDF-based sim-to-real mapping and closed-loop data collection to enable semantic-physical collaboration among heterogeneous multi-...
-
A Brief Overview: Agentic Reinforcement Learning In Large Language Models
The paper surveys the conceptual foundations, methodological innovations, challenges, and future directions of agentic reinforcement learning frameworks that embed cognitive capabilities like meta-reasoning and self-r...
-
A Brief Overview: Agentic Reinforcement Learning In Large Language Models
This review synthesizes conceptual foundations, methods, challenges, and future directions for agentic reinforcement learning in large language models.
Reference graph
Works this paper leans on
-
[1]
URLhttps://arxiv.org/abs/2505.24298. D. Guo, D. Yang, H. Zhang, J. Song, P. Wang, Q. Zhu, R. Xu, R. Zhang, S. Ma, X. Bi, et al. Deepseek-R1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025. J. Hu, X. Wu, Z. Zhu, W. Wang, D. Zhang, Y. Cao, et al. OpenRLHF: An easy-to-use, scalable and high-perform...
work page internal anchor Pith review arXiv 2025
-
[2]
URLhttps://openreview.net/forum?id=WE_vluYUL-X. Q. Yu, Z. Zhang, R. Zhu, Y. Yuan, X. Zuo, Y. Yue, W. Dai, T. Fan, G. Liu, L. Liu, et al. DAPO: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025a. Z. Yu, L. Yang, J. Zou, S. Yan, and M. Wang. Demystifying reinforcement learning in agentic reasoning. arXiv preprin...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[3]
URLhttps://openreview.net/forum?id=_3ELRdg2sgI. T. Zhang, F. Liu, J. Wong, P. Abbeel, and J. E. Gonzalez. The wisdom of hindsight makes language models better instruction followers. InInternational Conference on Machine Learning, pages 41414– 41428. PMLR, 2023. J. Zhao, R. Liu, K. Zhang, Z. Zhou, J. Gao, D. Li, J. Lyu, Z. Qian, B. Qi, X. Li, et al. GenPRM...
-
[4]
The probability of rolling a number greater than 3 on a single die is **50%** ↩→(numbers 4, 5, 6 out of 6)
-
[5]
The probability of rolling two even numbers in a row is **25%** (3 even numbers ↩→per roll * 3/6 * 3/6 = 1/4)
-
[6]
The difference in likelihood is **25 percentage points** (50% - 25%). To express this as a *percentage increase relative to the second event*: $$ \frac{50\% - 25\%}{25\%} \times 100\% = 100\% $$ **Answer:** 100% more likely. 18 OpenClaw-RL: Train Any Agent Simply by Talking [After Optimization] The answer is 100%. Here’s the breakdown: The probability of ...
-
[7]
the agent’s task instruction,
-
[8]
the interaction history, and
-
[9]
the agent’s most recent step to evaluate. User Prompt (Terminal) "task_instruction": {task_instruction} "history": [ { "turn_idx": {history_turn_idx}, "assistant_text": {history_assistant_text}, "tool_calls": {history_tool_calls}, "tool_results": {history_tool_results}, }, ... ] "current": { "turn_idx": {current_turn_idx}, "assistant_text": {current_assis...
-
[10]
the interaction history between the agent and the environment,
-
[11]
the agent’s objective, and
-
[12]
the agent’s most recent step to evaluate. User Prompt (GUI) {"type": "text", "text": "Previous Actions:\n{None or Step k lines}\n"}, {"type": "text", "text": "Image of environment:\n"}, {"type": "image", "image": "data:image/png;base64,{history_image}"}, {"type": "text", "text": "\nAction of agent:\nStep {k}:\n{history_action}\n"}, {"type": "text", "text"...
-
[13]
The issue description (problem statement)
-
[14]
The agent’s recent action history
-
[15]
The agent’s most recent step (THOUGHT + bash command) to evaluate
-
[16]
The execution result of that command (returncode + stdout/stderr). User Prompt (SWE) ## Issue Description {problem_statement} ## Recent History ({n_history} steps) {history_summary} ## Current Step to Evaluate (step {step_num}) Agent’s full response: {policy_response} Execution result (returncode={returncode}): {command_output} Evaluate ONLY the single mo...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.