Agent World Model: Infinity Synthetic Environments for Agentic Reinforcement Learning

Boyi Liu; Canwen Xu; Huaxiu Yao; Siwei Han; Yite Wang; Yuxiong He; Zhaoyang Wang; Zhewei Yao

Agent World Model: Infinity Synthetic Environments for Agentic Reinforcement Learning

Not yet reviewed by Pith; the record is open.

Re-run · record.json Download PDF Read on arXiv ↗

This paper has not been read by Pith yet. Machine review is queued; the pith claim, tier, and objections will appear here once it completes.

SPECIMEN: schema-true, not a live event

T0 review · schema-true

One-sentence machine reading of the paper's core claim.

pith:XXXXXXXX · record.json · timestamp

arxiv 2602.10090 v3 pith:OBNTXM2K submitted 2026-02-10 cs.AI cs.CLcs.LG

Agent World Model: Infinity Synthetic Environments for Agentic Reinforcement Learning

Zhaoyang Wang , Canwen Xu , Boyi Liu , Yite Wang , Siwei Han , Zhewei Yao , Huaxiu Yao , Yuxiong He This is my paper

classification cs.AI cs.CLcs.LG

keywords environmentsagentagentsmodelreliablesyntheticfullylearning

verification ladder T0 review T1 audit T2 compute T3 formal T4 reserved

0 comments

read the original abstract

Recent advances in large language model (LLM) have empowered autonomous agents to perform multi-turn interactions with tools and environments. However, scaling such agent training is limited by the lack of diverse and reliable environments. In this paper, we propose Agent World Model (AWM), a fully synthetic environment generation pipeline. Using this pipeline, we scale to 1,000 environments covering everyday scenarios, in which agents can interact with rich toolsets and obtain high-quality observations. Notably, these environments are code-driven and backed by databases, providing more reliable and consistent state transitions than environments simulated by LLMs. Moreover, they enable more efficient agent interaction compared with collecting trajectories from realistic environments. To demonstrate the effectiveness of this resource, we perform large-scale reinforcement learning for multi-turn tool-use agents. Thanks to the fully executable environments and accessible database states, we can also design reliable reward functions. Experiments on three benchmarks show that training exclusively in synthetic environments, rather than benchmark-specific ones, yields strong out-of-distribution generalization. The code is available at https://github.com/Snowflake-Labs/agent-world-model.

discussion (0)

Forward citations

Cited by 11 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Generate, Filter, Control, Replay: A Comprehensive Survey of Rollout Strategies for LLM Reinforcement Learning
cs.LG 2026-04 unverdicted novelty 7.0

This survey introduces the Generate-Filter-Control-Replay (GFCR) taxonomy to structure rollout pipelines for RL-based post-training of reasoning LLMs.
CurateEvo: Data-Curation Evolving for Agentic Post-Training
cs.CL 2026-07 conditional novelty 6.0

CurateEvo evolves executable data-curation code using failed agent trajectories, improving post-training performance by 3.2 and 2.7 points over baselines on labeled and wild data respectively.
PhoneWorld: Scaling Phone-Use Agent Environments
cs.CL 2026-05 unverdicted novelty 6.0

PhoneWorld is a pipeline that converts real mobile trajectories into scalable controllable environments, yielding large gains on four benchmarks when used to supplement training data.
OpenComputer: Verifiable Software Worlds for Computer-Use Agents
cs.AI 2026-05 unverdicted novelty 6.0

OpenComputer introduces a verifier-grounded framework with state verifiers, self-evolving layers, task synthesis, and auditable evaluation for 33 desktop apps and 1000 tasks to support computer-use AI agents.
ClawEnvKit: Automatic Environment Generation for Claw-Like Agents
cs.AI 2026-04 unverdicted novelty 6.0

ClawEnvKit automates generation of diverse verified environments for claw-like agents from natural language, producing the Auto-ClawEval benchmark of 1,040 environments that matches human-curated quality at 13,800x lo...
Agent-World: Scaling Real-World Environment Synthesis for Evolving General Agent Intelligence
cs.AI 2026-04 unverdicted novelty 6.0

Agent-World autonomously synthesizes verifiable real-world tasks and uses continuous self-evolution to train 8B and 14B agents that outperform proprietary models on 23 benchmarks.
Democratizing Tool Learning with Environments Fully Simulated by a Free 8B Language Model
cs.LG 2026-04 unverdicted novelty 6.0

TRUSTEE uses an 8B LM to simulate complete dynamic environments for RL-based tool learning and outperforms baselines that require extra external resources.
Agentic Environment Engineering for Large Language Models: A Survey of Environment Modeling, Synthesis, Evaluation, and Application
cs.CL 2026-06 unverdicted novelty 5.0

This survey categorizes agentic environments for LLMs by eight attributes and domains, introduces symbolic and neural synthesis paradigms with evaluation, and outlines four agent evolution pathways plus three environm...
Code as Agent Harness
cs.CL 2026-05 accept novelty 5.0

A survey that organizes existing work on LLM-based agents around code as the central harness, structured in three layers of interfaces, mechanisms, and multi-agent scaling, with applications across domains and listed ...
MCP-Cosmos: World Model-Augmented Agents for Complex Task Execution in MCP Environments
cs.AI 2026-05 unverdicted novelty 4.0

MCP-Cosmos combines world models with MCP agents via a bring-your-own-world-model strategy and reports gains in tool success rate and parameter accuracy on benchmark tasks.
ClawEnvKit: Automatic Environment Generation for Claw-Like Agents
cs.AI 2026-04 conditional novelty 4.0

EVT improves the RMT backbone by using Euclidean-distance attention decay and 1D token grouping, achieving 86.6% top-1 on ImageNet-1K at 384×384 resolution.