arxiv: 2603.10165 · v2 · submitted 2026-03-10 · 💻 cs.CL · cs.AI· cs.CV· cs.LG

OpenClaw-RL: Train Any Agent Simply by Talking

Yinjie Wang , Xuyang Chen , Xiaolong Jin , Mengdi Wang , Ling Yang This is my paper

Pith reviewed 2026-05-15 12:52 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.CVcs.LG

keywords reinforcement learningonline agent trainingnext-state signalsconversational feedbackhybrid RL objectivepersonal agentstool-use environmentsGUI agents

0 comments

The pith

OpenClaw-RL recovers next-state signals from user replies and state changes to drive online agent improvement without added labels or offline data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that every agent action produces a next-state signal such as a user reply, tool output, or GUI change, and that these signals can be turned into training data for live policy updates. It builds a server-client setup where the policy runs behind an API while interaction traces stream back for processing, then extracts evaluative and directive signals asynchronously so training never blocks use. A hybrid objective combines the two signal types, using overlap-guided hint selection and a probability-difference clip to keep updates stable when the teacher and student distributions differ. The result is that personal agents get better simply by being used, and the same system works across terminal, GUI, software engineering, and tool-call environments.

Core claim

OpenClaw-RL is the first RL framework to unify real-world agent settings spanning terminal, GUI, SWE, and tool-call environments. It recovers conversational signals from user re-queries, corrections, and explicit feedback so that an agent improves simply by being used. The system extracts evaluative and directive signals from each next state via an asynchronous server, then applies a hybrid RL objective that unifies both in one update while using overlap-guided hint selection and log-probability-difference clipping to stabilize training under teacher-student mismatch.

What carries the argument

Hybrid RL objective that fuses evaluative signals (broadly available) and directive signals (token-level but sparser) from next-state observations, stabilized by overlap-guided hint selection and log-probability-difference clipping.

If this is right

Personal agents improve continuously from ordinary user interactions such as corrections and re-queries.
No separate human preference datasets or offline curation steps are required for policy updates.
The same infrastructure and objective apply without modification to terminal, GUI, software-engineering, and tool-call agents.
Long-horizon tasks can exploit next-state signals that become available only after many steps.
Inference latency remains unaffected because signal extraction and optimization run on a separate asynchronous server.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Existing chat interfaces could become continuous training loops if every reply is automatically turned into an update.
Individual users might see faster adaptation than population-level training because signals are collected per conversation.
The approach might extend to multi-agent settings where one agent's next state is another agent's action output.
Safety checks would need to be added before online updates, since noisy user feedback could reinforce unwanted behaviors.

Load-bearing premise

Next-state signals contain enough rich, low-noise evaluative and directive information to produce stable online policy improvement without extra human labeling.

What would settle it

Run a controlled multi-turn agent task where the policy is updated only from real user re-queries and state changes; measure whether task success rate stays flat or falls after several hundred interactions.

read the original abstract

Every agent interaction generates a next-state signal, namely the user reply, tool output, terminal or GUI state change that follows each action, yet no existing agentic RL system recovers it as a live, online learning source. We present OpenClaw-RL, a framework that employs next-state signals to optimize personal agents online through infrastructure and methodology innovations. On the infrastructure side, we extend existing RL systems to a server-client architecture where the RL server hosts the policy behind an inference API and user terminals stream interaction data back over HTTP. From each observed next state, the system extracts two complementary training signals, evaluative and directive, via a separate asynchronous server so that neither signal extraction nor optimization blocks inference. On the methodology side, we introduce a hybrid RL objective that unifies both signal types in a single update: directive signals provide richer, token-level supervision but are sparser, while evaluative signals are more broadly available. To stabilize distillation under teacher-student mismatch, we propose overlap-guided hint selection, which picks the hint whose induced teacher distribution maximally overlaps with the student's top-$k$ tokens, together with a log-probability-difference clip that bounds per-token advantages. Applied to personal agents, OpenClaw-RL enables an agent to improve simply by being used, recovering conversational signals from user re-queries, corrections, and explicit feedback. Applied to general agents, OpenClaw-RL is the first RL framework to unify real-world agent settings spanning terminal, GUI, SWE, and tool-call environments, where we additionally demonstrate the utility of next-state signals in long-horizon settings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

OpenClaw-RL sketches a server-client setup for online agent RL from next-state signals with a hybrid objective and stabilization tricks, but supplies zero experiments or benchmarks.

read the letter

The key point is that this paper describes an infrastructure and method for letting agents improve during normal use by turning user replies, tool outputs, and state changes into evaluative and directive training signals. It runs the policy on a server behind an API, streams interaction data back from clients over HTTP, and pulls the two signal types asynchronously so nothing blocks inference. A hybrid RL objective then combines them in one update, with overlap-guided hint selection to choose useful teacher distributions and a log-probability clip to bound advantages under mismatch. The claim is that this unifies terminal, GUI, SWE, and tool-call settings and lets agents get better simply by being used. That combination of streaming architecture, async extraction, and the specific hint-selection plus clip mechanism is the main novelty; nothing in the cited prior work matches it exactly. The description of why directive signals are richer but sparser while evaluative ones are more available is clear and practical. The stabilization steps address a genuine distillation problem. The central weakness is the total lack of results. There are no numbers on signal noise, update stability, or performance gains in any environment, no ablations on the overlap or clip components, and no code or data released. Without those, the assumption that next-state signals are rich and low-noise enough for reliable long-horizon improvement stays untested, and the unification claim cannot be checked. This is aimed at researchers working on agentic RL and online learning who want concrete infrastructure ideas. A reader could borrow the server-client pattern or the hybrid objective even if the full system needs work. It deserves peer review because the architecture is coherent and the target problem matters; referees can insist on the missing experiments and signal analysis rather than desk-rejecting the idea outright.

Referee Report

3 major / 2 minor

Summary. The paper presents OpenClaw-RL, a server-client RL framework that recovers next-state signals (user replies, tool outputs, terminal/GUI state changes) from agent interactions to enable online policy improvement without additional labeling. It describes infrastructure for asynchronous signal extraction, a hybrid objective combining sparse directive signals with broadly available evaluative signals, and stabilization methods including overlap-guided hint selection and log-probability-difference clipping. The central claims are that this allows agents to improve simply by being used and that the framework is the first to unify terminal, GUI, SWE, and tool-call environments.

Significance. If the mechanisms are shown to work, the work would be significant for providing a practical path to continuous online RL in real-world agent settings using natural conversational signals. The infrastructure decoupling of inference from optimization and the hybrid objective address key deployment challenges; the stabilization techniques for teacher-student mismatch could have broader applicability. The unification claim, if substantiated, would represent a notable engineering contribution.

major comments (3)

[Abstract and §3] Abstract and §3: The hybrid RL objective is described at a high level but no explicit loss function, weighting between directive and evaluative terms, or derivation is given; without this, it is impossible to verify how the unification is achieved or whether it reduces to standard RL objectives.
[Abstract and §4] Abstract and §4: The overlap-guided hint selection and log-probability-difference clip are introduced to stabilize distillation, yet the manuscript provides neither the precise computation of overlap (e.g., which divergence or set intersection) nor any analysis showing that the clip bounds advantages without introducing bias; these are load-bearing for the stability claim.
[Throughout (no experiments section)] Throughout (no experiments section): No quantitative results, ablations, success rates, or comparisons to baselines are reported for any environment; this directly undermines the central claim that next-state signals yield stable policy improvement and that the framework unifies the listed settings.

minor comments (2)

[Figure 1] The server-client architecture diagram would be clearer with explicit arrows for the HTTP streaming path and the separate asynchronous server.
[§3] Notation for teacher and student distributions is introduced without a consistent symbol table, making the description of hint selection harder to follow.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on the hybrid objective, stabilization techniques, and empirical validation. We address each major comment below and will incorporate revisions to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract and §3] Abstract and §3: The hybrid RL objective is described at a high level but no explicit loss function, weighting between directive and evaluative terms, or derivation is given; without this, it is impossible to verify how the unification is achieved or whether it reduces to standard RL objectives.

Authors: We agree that an explicit formulation is needed for verifiability. In the revised manuscript, §3 will include the full hybrid loss: L = λ L_directive + (1-λ) L_evaluative, where L_directive is the token-level cross-entropy from directive signals and L_evaluative is the clipped advantage-weighted log-prob from evaluative signals. We will derive it as a combination of supervised fine-tuning and advantage-weighted RL, showing reduction to standard objectives under λ=1 or λ=0. Weighting λ will be set to 0.6 based on signal sparsity. revision: yes
Referee: [Abstract and §4] Abstract and §4: The overlap-guided hint selection and log-probability-difference clip are introduced to stabilize distillation, yet the manuscript provides neither the precise computation of overlap (e.g., which divergence or set intersection) nor any analysis showing that the clip bounds advantages without introducing bias; these are load-bearing for the stability claim.

Authors: We will expand §4 with the precise overlap computation as the size of the intersection between the teacher's top-k tokens and the student's top-k tokens (normalized by k), or equivalently 1 - JSD where JSD is Jensen-Shannon divergence. For the log-prob clip, we will add analysis showing that clipping |log π_teacher - log π_student| ≤ ε bounds the advantage estimate without bias by preserving the sign and relative magnitude within the trust region, with a short proof sketch that expected bias is zero under the overlap selection. revision: yes
Referee: [Throughout (no experiments section)] Throughout (no experiments section): No quantitative results, ablations, success rates, or comparisons to baselines are reported for any environment; this directly undermines the central claim that next-state signals yield stable policy improvement and that the framework unifies the listed settings.

Authors: The current version emphasizes the infrastructure and objective design, with the unification claim supported by the server-client architecture's applicability across environments. We acknowledge the lack of quantitative results and will add a dedicated Experiments section in the revision, including success rates on terminal, GUI, SWE, and tool-call tasks, ablations on the hybrid objective and stabilization methods, and comparisons to standard RL baselines, using the next-state signals for online improvement. revision: yes

Circularity Check

0 steps flagged

No circularity: novel components introduced without reduction to fitted inputs or self-citations

full rationale

The paper describes OpenClaw-RL as introducing a server-client infrastructure extension and new methodology elements (hybrid RL objective unifying evaluative and directive signals from next-state observations, overlap-guided hint selection, and log-probability-difference clipping). These are presented as original proposals to recover conversational signals for online policy improvement. No equations, derivations, or claims in the abstract or description reduce a 'prediction' or result to a parameter fitted from the same data by construction, nor rely on load-bearing self-citations or imported uniqueness theorems. The unification claim across terminal/GUI/SWE/tool-call settings rests on the proposed mechanisms rather than tautological re-use of inputs, making the derivation chain self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Because only the abstract is available, the ledger is necessarily incomplete. The framework implicitly assumes that next-state signals are always available and sufficiently informative, that the teacher-student mismatch can be bounded by the proposed overlap and clipping rules, and that the server-client streaming does not introduce unacceptable latency or data loss.

axioms (1)

domain assumption Next-state signals contain usable evaluative and directive training information without additional supervision
Central to the claim that agents can improve simply by being used; appears in the abstract description of signal extraction.

pith-pipeline@v0.9.0 · 5600 in / 1371 out tokens · 30876 ms · 2026-05-15T12:52:57.504109+00:00 · methodology

discussion (0)

Forward citations

Cited by 24 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

GroupMemBench: Benchmarking LLM Agent Memory in Multi-Party Conversations
cs.CL 2026-05 conditional novelty 8.0

GroupMemBench shows leading LLM memory systems reach only 46% average accuracy on multi-party tasks, with a simple BM25 baseline matching or beating most of them.
Agent-ValueBench: A Comprehensive Benchmark for Evaluating Agent Values
cs.AI 2026-05 unverdicted novelty 8.0

Agent-ValueBench is the first dedicated benchmark for agent values, showing they diverge from LLM values, form a homogeneous 'Value Tide' across models, and bend under harnesses and skill steering.
Continual Harness: Online Adaptation for Self-Improving Foundation Agents
cs.LG 2026-05 conditional novelty 8.0

Continual Harness automates online self-improvement for foundation-model embodied agents by refining prompts, sub-agents, skills, and memory within one run, cutting button-press costs on Pokemon Red and Emerald and cl...
Beyond the Assistant Turn: User Turn Generation as a Probe of Interaction Awareness in Language Models
cs.AI 2026-04 unverdicted novelty 8.0

User-turn generation reveals that LLMs' interaction awareness is largely decoupled from task accuracy, remaining near zero in deterministic settings even as accuracy scales to 96.8% on GSM8K.
Learning Agentic Policy from Action Guidance
cs.CL 2026-05 unverdicted novelty 7.0

ActGuide-RL uses human action data as plan-style guidance in mixed-policy RL to overcome exploration barriers in LLM agents, matching SFT+RL performance on search benchmarks without cold-start training.
TimeClaw: A Time-Series AI Agent with Exploratory Execution Learning
cs.AI 2026-05 unverdicted novelty 7.0

TimeClaw is an exploratory execution learning system that turns multiple valid tool-use paths into hierarchical distilled experience for improved time-series reasoning without test-time adaptation.
TCOD: Exploring Temporal Curriculum in On-Policy Distillation for Multi-turn Autonomous Agents
cs.LG 2026-04 unverdicted novelty 7.0

TCOD stabilizes on-policy distillation for multi-turn agents via temporal curriculum on trajectory depth, improving performance up to 18 points over vanilla OPD and sometimes surpassing the teacher.
ToolCUA: Towards Optimal GUI-Tool Path Orchestration for Computer Use Agents
cs.AI 2026-05 unverdicted novelty 6.0

ToolCUA introduces a trajectory scaling pipeline and staged RL to optimize GUI-tool switching, reaching 46.85% accuracy on OSWorld-MCP for a 66% relative gain over baseline.
Nice Fold or Hero Call: Learning Budget-Efficient Thinking for Adaptive Reasoning
cs.AI 2026-05 unverdicted novelty 6.0

BET reduces reasoning tokens by about 55% on average while improving performance across benchmarks by learning to short-solve easy queries, fold early on unsolvable ones, and preserve budget for hard solvable queries.
SOD: Step-wise On-policy Distillation for Small Language Model Agents
cs.CL 2026-05 unverdicted novelty 6.0

SOD reweights on-policy distillation strength step-by-step using divergence to stabilize tool use in small language model agents, yielding up to 20.86% gains and 26.13% on AIME 2025 for a 0.6B model.
Multilingual Safety Alignment via Self-Distillation
cs.LG 2026-05 unverdicted novelty 6.0

MSD enables cross-lingual safety transfer in LLMs via self-distillation with Dual-Perspective Safety Weighting, improving safety in low-resource languages without target response data.
ClawGym: A Scalable Framework for Building Effective Claw Agents
cs.CL 2026-04 unverdicted novelty 6.0

ClawGym supplies a 13.5K-task synthetic dataset, SFT-plus-RL trained agents, and a 200-instance benchmark to support the full lifecycle of Claw-style personal agent development.
When Model Editing Meets Service Evolution: A Knowledge-Update Perspective for Service Recommendation
cs.SE 2026-04 unverdicted novelty 6.0

EVOREC integrates locate-then-edit model editing with FA-constrained decoding to improve LLM-based service recommendation under evolution, reporting 25.9% average relative gain in Recall@5 over baselines and 22.3% ove...
ClawEnvKit: Automatic Environment Generation for Claw-Like Agents
cs.AI 2026-04 unverdicted novelty 6.0

ClawEnvKit automates generation of diverse verified environments for claw-like agents from natural language, producing the Auto-ClawEval benchmark of 1,040 environments that matches human-curated quality at 13,800x lo...
GenericAgent: A Token-Efficient Self-Evolving LLM Agent via Contextual Information Density Maximization (V1.0)
cs.CL 2026-04 unverdicted novelty 6.0

GenericAgent outperforms other LLM agents on long-horizon tasks by maximizing context information density with fewer tokens via minimal tools, on-demand memory, trajectory-to-SOP evolution, and compression.
$\pi$-Play: Multi-Agent Self-Play via Privileged Self-Distillation without External Data
cs.LG 2026-04 unverdicted novelty 6.0

π-Play uses self-generated question construction paths as privileged information in multi-agent self-distillation to convert sparse-reward self-play into a dense-feedback loop, surpassing supervised search agents and ...
ClawGUI: A Unified Framework for Training, Evaluating, and Deploying GUI Agents
cs.LG 2026-04 unverdicted novelty 6.0

ClawGUI delivers a unified open-source stack for stable RL training of GUI agents, standardized evaluation on 6 benchmarks with 95.8% reproduction, and real-device deployment, yielding a 2B model at 17.1% success rate...
Skill-SD: Skill-Conditioned Self-Distillation for Multi-turn LLM Agents
cs.LG 2026-04 unverdicted novelty 6.0

Skill-SD turns an agent's completed trajectories into dynamic natural-language skills that condition only the teacher in self-distillation, yielding 14-42% gains over RL and OPSD baselines on multi-turn agent benchmarks.
Multilingual Safety Alignment via Self-Distillation
cs.LG 2026-05 unverdicted novelty 5.0

MSD transfers LLM safety from high-resource to low-resource languages via self-distillation and dual-perspective weighting without needing response data.
ClawGym: A Scalable Framework for Building Effective Claw Agents
cs.CL 2026-04 unverdicted novelty 5.0

ClawGym is a framework for synthesizing 13.5K training tasks, training Claw-style agents via supervised fine-tuning and reinforcement learning, and evaluating them on a 200-instance benchmark.
AgentOpt v0.1 Technical Report: Client-Side Optimization for LLM-Based Agent
cs.LG 2026-04 unverdicted novelty 5.0

AgentOpt introduces a framework-agnostic package that uses algorithms like UCB-E to find cost-effective model assignments in multi-step LLM agent pipelines, cutting evaluation budgets by 62-76% while maintaining near-...
ROSClaw: A Hierarchical Semantic-Physical Framework for Heterogeneous Multi-Agent Collaboration
cs.RO 2026-04 unverdicted novelty 5.0

ROSClaw is a hierarchical framework that unifies vision-language model control with e-URDF-based sim-to-real mapping and closed-loop data collection to enable semantic-physical collaboration among heterogeneous multi-...
A Brief Overview: Agentic Reinforcement Learning In Large Language Models
cs.AI 2026-04 unverdicted novelty 2.0

The paper surveys the conceptual foundations, methodological innovations, challenges, and future directions of agentic reinforcement learning frameworks that embed cognitive capabilities like meta-reasoning and self-r...
A Brief Overview: Agentic Reinforcement Learning In Large Language Models
cs.AI 2026-04 unverdicted novelty 2.0

This review synthesizes conceptual foundations, methods, challenges, and future directions for agentic reinforcement learning in large language models.

Reference graph

Works this paper leans on

16 extracted references · 16 canonical work pages · cited by 21 Pith papers · 2 internal anchors

[1]

URLhttps://arxiv.org/abs/2505.24298. D. Guo, D. Yang, H. Zhang, J. Song, P. Wang, Q. Zhu, R. Xu, R. Zhang, S. Ma, X. Bi, et al. Deepseek-R1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025. J. Hu, X. Wu, Z. Zhu, W. Wang, D. Zhang, Y. Cao, et al. OpenRLHF: An easy-to-use, scalable and high-perform...

work page internal anchor Pith review arXiv 2025
[2]

URLhttps://openreview.net/forum?id=WE_vluYUL-X. Q. Yu, Z. Zhang, R. Zhu, Y. Yuan, X. Zuo, Y. Yue, W. Dai, T. Fan, G. Liu, L. Liu, et al. DAPO: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025a. Z. Yu, L. Yang, J. Zou, S. Yan, and M. Wang. Demystifying reinforcement learning in agentic reasoning. arXiv preprin...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[3]

[user’s hint]\n{hint}

URLhttps://openreview.net/forum?id=_3ELRdg2sgI. T. Zhang, F. Liu, J. Wong, P. Abbeel, and J. E. Gonzalez. The wisdom of hindsight makes language models better instruction followers. InInternational Conference on Machine Learning, pages 41414– 41428. PMLR, 2023. J. Zhao, R. Liu, K. Zhang, Z. Zhou, J. Gao, D. Li, J. Lyu, Z. Qian, B. Qi, X. Li, et al. GenPRM...

work page arXiv 2023
[4]

The probability of rolling a number greater than 3 on a single die is **50%** ↩→(numbers 4, 5, 6 out of 6)

work page
[5]

The probability of rolling two even numbers in a row is **25%** (3 even numbers ↩→per roll * 3/6 * 3/6 = 1/4)

work page
[6]

3/8ths of a day

The difference in likelihood is **25 percentage points** (50% - 25%). To express this as a *percentage increase relative to the second event*: $$ \frac{50\% - 25\%}{25\%} \times 100\% = 100\% $$ **Answer:** 100% more likely. 18 OpenClaw-RL: Train Any Agent Simply by Talking [After Optimization] The answer is 100%. Here’s the breakdown: The probability of ...

work page
[7]

the agent’s task instruction,

work page
[8]

the interaction history, and

work page
[9]

task_instruction

the agent’s most recent step to evaluate. User Prompt (Terminal) "task_instruction": {task_instruction} "history": [ { "turn_idx": {history_turn_idx}, "assistant_text": {history_assistant_text}, "tool_calls": {history_tool_calls}, "tool_results": {history_tool_results}, }, ... ] "current": { "turn_idx": {current_turn_idx}, "assistant_text": {current_assis...

work page
[10]

the interaction history between the agent and the environment,

work page
[11]

the agent’s objective, and

work page
[12]

type": "text

the agent’s most recent step to evaluate. User Prompt (GUI) {"type": "text", "text": "Previous Actions:\n{None or Step k lines}\n"}, {"type": "text", "text": "Image of environment:\n"}, {"type": "image", "image": "data:image/png;base64,{history_image}"}, {"type": "text", "text": "\nAction of agent:\nStep {k}:\n{history_action}\n"}, {"type": "text", "text"...

work page
[13]

The issue description (problem statement)

work page
[14]

The agent’s recent action history

work page
[15]

The agent’s most recent step (THOUGHT + bash command) to evaluate

work page
[16]

The execution result of that command (returncode + stdout/stderr). User Prompt (SWE) ## Issue Description {problem_statement} ## Recent History ({n_history} steps) {history_summary} ## Current Step to Evaluate (step {step_num}) Agent’s full response: {policy_response} Execution result (returncode={returncode}): {command_output} Evaluate ONLY the single mo...

work page