pith. sign in

arxiv: 2606.24597 · v1 · pith:VPOPJZPHnew · submitted 2026-06-23 · 💻 cs.CL

Qwen-AgentWorld: Language World Models for General Agents

Pith reviewed 2026-06-25 23:50 UTC · model grok-4.3

classification 💻 cs.CL
keywords language world modelsagentic environment simulationAgentWorldBenchnext-state predictionchain-of-thought reasoningagent reinforcement learningfoundation model warm-upenvironment dynamics
0
0 comments X

The pith

Language models trained as world models simulate agent environments across seven domains, outperform frontier models on a new benchmark, and improve general agents when used as warm-up.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to show that language models can serve as world models predicting environment dynamics for agents, pushing boundaries in general agent capabilities. It builds two large models on more than 10 million real interaction trajectories from seven domains through a three-stage process of continued pretraining on state transitions, supervised fine-tuning for next-state reasoning, and reinforcement learning with hybrid rewards. These models are evaluated on AgentWorldBench, a benchmark derived from real interactions of frontier models on nine tasks, where they outperform existing models. The work further demonstrates two uses: the models act as decoupled simulators for scalable agent reinforcement learning, and the world-model training itself functions as effective warm-up that boosts performance on seven downstream agent benchmarks.

Core claim

Qwen-AgentWorld-35B-A3B and Qwen-AgentWorld-397B-A17B are the first language world models capable of simulating agentic environments covering seven domains via long chain-of-thought reasoning. Built from over 10M environment interaction trajectories through CPT on state transition dynamics, SFT activating next-state-prediction reasoning, and RL sharpening fidelity with hybrid rubric-and-rule rewards, the models significantly outperform existing frontier models on AgentWorldBench. As decoupled simulators they enable scalable RL with gains surpassing real-environment training alone, and as unified foundation models their training serves as highly effective warm-up improving performance across

What carries the argument

Three-stage training pipeline of CPT on state transitions and professional corpora, SFT for next-state reasoning, and RL with hybrid rubric-and-rule rewards that turns base language models into simulators of agent environment dynamics.

If this is right

  • Qwen-AgentWorld supports scalable and controllable simulation of thousands of real-world environments for agentic RL, yielding gains that surpass real-environment training alone.
  • World-model training acts as a highly effective warm-up that improves downstream performance across 7 agentic benchmarks when the model is used as a unified agent foundation model.
  • The models enable simulation of agentic environments covering 7 domains via long chain-of-thought reasoning and outperform frontier models on AgentWorldBench.
  • AgentWorldBench, constructed from real-world interactions of 5 frontier models on 9 established benchmarks, serves as a comprehensive evaluation for language world models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Internal simulation via these world models could reduce the need for costly real-world interactions during agent planning and exploration.
  • The approach might generalize to allow agents to reason about hypothetical or counterfactual environment states not present in the original training data.
  • Combining language world models with other modalities could extend simulation capabilities to visual or multimodal agent environments.

Load-bearing premise

The three-stage training pipeline produces simulations whose fidelity generalizes to new agent scenarios and yields measurable gains when used for RL or as foundation model warm-up.

What would settle it

A controlled experiment showing no performance gain on a held-out agent benchmark when agents are trained with the language world model simulator versus real environments alone would falsify the utility of the decoupled simulator paradigm.

Figures

Figures reproduced from arXiv: 2606.24597 by An Yang, Bowen Yu, Dayiheng Liu, Fei Huang, Haiquan Zhao, Haiyang Xu, Jianhong Tu, Jianxin Yang, Jiayang Cheng, Jingren Zhou, Junyang Wang, Lianghao Deng, Li Sheng, Mingfeng Xue, Ning Ding, Qingfeng Lan, Qin Zhu, Tianyi Bai, Tianyi Tang, Xiaomeng Hu, Yang Fan, Yang Su, Yantao Liu, Yinger Zhang, Yubo Ma, Yucheng Li, Yuxin Zuo, Yuxuan Liu, Zeyu Cui, Zhihai Wang, Zhihui Xie, Zhuorui Ye, Zikai Xiao.

Figure 1
Figure 1. Figure 1: Overview of Qwen-AgentWorld. Top: Qwen-AgentWorld is a unified native language world model across seven domains. Bottom: We explore two complementary strategies for applying world modeling to enhance language agents (mainly using the 35B-A3B model as agent): Decouple and Unify , where the world model serves as the environment simulator and agent foundation model, respectively. 1 arXiv:2606.24597v1 [cs.CL] … view at source ↗
Figure 2
Figure 2. Figure 2: Qwen-AgentWorld unifies seven categories of interactive environment simulation within a [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Anatomy of a Terminal domain LWM RL system prompt, showing the five components defined [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Representative interaction examples from a text-based domain (SWE) and a GUI domain [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Three-stage training pipeline of Qwen-AgentWorld. Stage 1 CPT injects world knowledge; [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Overview of AgentWorldBench composition. [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Main results on AgentWorldBench: five-dimensional rubric mean per domain. Qwen [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Cross-domain generalization when training Stage 3 (RL) on Terminal data alone. (a) Terminal [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Controllable Sim RL vs. Real RL (trained against a live search engine) on WideSearch during [PITH_FULL_IMAGE:figures/full_fig_p020_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Environment prediction accu￾racy on Terminal-Bench 2.0 trajectories. Insights: Prediction-Driven Action Refinement. To un￾derstand why LWM warm-up transfers to agentic tasks, we examine the reasoning traces produced during multi-turn agentic evaluation. We find that the RL-trained model sys￾tematically performs mental simulation of environment re￾sponses before executing actions, using its internalized wo… view at source ↗
Figure 11
Figure 11. Figure 11: Case study of prediction-driven action refinement on the [PITH_FULL_IMAGE:figures/full_fig_p023_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Representative LWM reasoning patterns from Qwen-AgentWorld-397B-A17B’s thinking traces. [PITH_FULL_IMAGE:figures/full_fig_p024_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Micro-level fidelity improvements during RL training. [PITH_FULL_IMAGE:figures/full_fig_p025_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Interaction examples from the GUI domains (continued). OS and Web are shown here; Android [PITH_FULL_IMAGE:figures/full_fig_p035_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Interaction examples from the text-based domains (continued). Terminal, MCP Tool Use, and [PITH_FULL_IMAGE:figures/full_fig_p036_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Training dynamics across 440 RL steps (Qwen [PITH_FULL_IMAGE:figures/full_fig_p037_16.png] view at source ↗
read the original abstract

A world model predicts environment dynamics based on current observations and actions, serving as a core cognitive mechanism for reasoning and planning. In this work, we investigate how world modeling based on language models can further push the boundaries of general agents. (i) We first focus on building foundation models for agentic environment simulation. We introduce Qwen-AgentWorld-35B-A3B and Qwen-AgentWorld-397B-A17B, the first language world models capable of simulating agentic environments covering 7 domains via long chain-of-thought reasoning. Leveraging more than 10M environment interaction trajectories of 7 domains in real-world environments, we develop Qwen-AgentWorld through a three-stage training pipeline: CPT injects general-purpose world modeling capabilities from the state transition dynamics and augmented professional corpora, SFT activates next-state-prediction reasoning, and RL sharpens simulation fidelity through a tailored framework with hybrid rubric-and-rule rewards. To evaluate language world models, we present AgentWorldBench, a comprehensive benchmark constructed from real-world interactions of 5 frontier models on 9 established benchmarks. Empirical results demonstrate that Qwen-AgentWorld significantly outperforms existing frontier models. (ii) Beyond foundation models, we further investigate two complementary paradigms through which world modeling enhances general agents. First, as a decoupled environment simulator, Qwen-AgentWorld supports scalable and controllable simulation of thousands of real-world environments for agentic RL, yielding gains that surpass real-environment training alone. Second, as a unified agent foundation model, world-model training acts as a highly effective warm-up that improves downstream performance across 7 agentic benchmarks. Code: https://github.com/QwenLM/Qwen-AgentWorld

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces Qwen-AgentWorld-35B-A3B and Qwen-AgentWorld-397B-A17B as the first language world models for simulating agentic environments across 7 domains. These are trained via a three-stage pipeline (CPT on state transitions from >10M trajectories, SFT for next-state reasoning, and RL with hybrid rubric-and-rule rewards) on real-world interaction data. The work also presents AgentWorldBench, built from real-world interactions of 5 frontier models on 9 benchmarks, and claims that the models significantly outperform existing frontier models on this benchmark. It further explores two paradigms: using the models as decoupled simulators for scalable agentic RL and as a warm-up stage that improves performance on 7 downstream agentic benchmarks.

Significance. If the empirical results hold with proper controls for generalization, the work would be significant for the agentic AI community by providing specialized foundation models for environment simulation and demonstrating concrete benefits of world-model training for RL and agent fine-tuning. The release of code at the provided GitHub link is a positive contribution that supports reproducibility.

major comments (2)
  1. Abstract and data collection description: the manuscript asserts that Qwen-AgentWorld 'significantly outperforms existing frontier models' on AgentWorldBench and that world-model training yields gains across 7 benchmarks, yet the abstract supplies no quantitative metrics, baselines, error bars, or evaluation protocol details. Without these, the central empirical claims cannot be assessed for magnitude or robustness.
  2. Training data and AgentWorldBench construction: the paper does not state whether trajectories or environments in AgentWorldBench (derived from 5 frontier models on 9 benchmarks) were held out from the >10M training trajectories collected across 7 domains. Overlap in state spaces or interaction patterns would allow memorization of specific dynamics rather than learning generalizable world models, directly undermining both the benchmark outperformance claim and the downstream RL/warm-up gains.
minor comments (1)
  1. The model naming convention (Qwen-AgentWorld-35B-A3B, Qwen-AgentWorld-397B-A17B) is lengthy and could be clarified or abbreviated on first use for readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and indicate where revisions will be made to improve clarity and rigor.

read point-by-point responses
  1. Referee: [—] Abstract and data collection description: the manuscript asserts that Qwen-AgentWorld 'significantly outperforms existing frontier models' on AgentWorldBench and that world-model training yields gains across 7 benchmarks, yet the abstract supplies no quantitative metrics, baselines, error bars, or evaluation protocol details. Without these, the central empirical claims cannot be assessed for magnitude or robustness.

    Authors: We agree that the abstract would benefit from including key quantitative results. The full manuscript provides detailed results, baselines, error bars, and evaluation protocols in the experimental sections, but the abstract was kept high-level for brevity. We will revise the abstract to incorporate specific metrics (e.g., outperformance margins on AgentWorldBench and gains on the 7 downstream benchmarks) along with a brief note on the evaluation protocol. revision: yes

  2. Referee: [—] Training data and AgentWorldBench construction: the paper does not state whether trajectories or environments in AgentWorldBench (derived from 5 frontier models on 9 benchmarks) were held out from the >10M training trajectories collected across 7 domains. Overlap in state spaces or interaction patterns would allow memorization of specific dynamics rather than learning generalizable world models, directly undermining both the benchmark outperformance claim and the downstream RL/warm-up gains.

    Authors: We confirm that the trajectories and environments in AgentWorldBench were held out from the >10M training trajectories. The benchmark was constructed independently from interactions of 5 frontier models on 9 benchmarks that were not part of the training data collection across the 7 domains, ensuring no overlap in state spaces or interaction patterns. We will add explicit statements clarifying this hold-out procedure, the data separation, and the evaluation protocol in the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical training pipeline and benchmark evaluation

full rationale

The paper presents an empirical three-stage training procedure (CPT on >10M trajectories, SFT, RL with hybrid rewards) to produce language world models, followed by evaluation on AgentWorldBench constructed from separate frontier-model interactions. No mathematical derivation, equations, or first-principles chain is claimed that reduces to its own inputs by construction. No fitted parameter is renamed as a prediction, no self-citation supplies a load-bearing uniqueness theorem, and no ansatz is smuggled via prior work. The central claims rest on reported benchmark numbers and downstream gains rather than tautological redefinitions. Potential trajectory overlap is a separate generalization concern outside the circularity criteria, which require explicit quoted reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract provides limited technical detail; the work rests on the premise that LLMs can serve as faithful simulators after the described training, with no free parameters or invented entities explicitly named.

axioms (1)
  • domain assumption Language models can simulate agentic environment dynamics via long chain-of-thought reasoning after CPT-SFT-RL training on interaction trajectories
    This premise underpins the entire claim that the models function as world models and deliver downstream gains.

pith-pipeline@v0.9.1-grok · 5954 in / 1175 out tokens · 30370 ms · 2026-06-25T23:50:42.760779+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

72 extracted references · 27 linked inside Pith

  1. [1]

    World simulation with video foundation models for physical ai

    Arslan Ali, Junjie Bai, Maciej Bala, Yogesh Balaji, Aaron Blakeman, Tiffany Cai, Jiaxin Cao, Tianshi Cao, Elizabeth Cha, Yu-Wei Chao, et al. World simulation with video foundation models for physical ai. arXiv preprint arXiv:2511.00062,

  2. [2]

    Anthropic

    URL https://assets.anthropic.com/m/12f214efcc2 f457a/original/Claude-Sonnet-4-5-System-Card.pdf. Anthropic. System card: Claude opus 4.6, 2026a. URL https://www-cdn.anthropic.com/14e4fb01875d2 a69f646fa5e574dea2b1c0ff7b5.pdf. Anthropic. System card: Claude opus 4.8, 2026b. URL https://www-cdn.anthropic.com/0f0c97ad20d80 05706296bd92aa1c27c6b2f4f61.pdf. An...

  3. [3]

    Webgym: Scaling training environments for visual web agents with realistic tasks.arXiv preprint arXiv:2601.02439,

    Hao Bai, Alexey Taymanov, Tong Zhang, Aviral Kumar, and Spencer Whitehead. Webgym: Scaling training environments for visual web agents with realistic tasks.arXiv preprint arXiv:2601.02439,

  4. [4]

    Shihao Cai, Runnan Fang, Jialong Wu, Baixuan Li, Xinyu Wang, Yong Jiang, Liangcai Su, Liwen Zhang, Wenbiao Yin, Zhen Zhang, et al

    URL https://deepmind.google/blog/genie-3-a-new-frontier-for-world-models/. Shihao Cai, Runnan Fang, Jialong Wu, Baixuan Li, Xinyu Wang, Yong Jiang, Liangcai Su, Liwen Zhang, Wenbiao Yin, Zhen Zhang, et al. Autoforge: Automated environment synthesis for agentic reinforce- ment learning.arXiv preprint arXiv:2512.22857,

  5. [5]

    Mobiledreamer: Generative sketch world model for gui agent.arXiv preprint arXiv:2601.04035, 2026a

    Yilin Cao, Yufeng Zhong, Zhixiong Zeng, Liming Zheng, Jing Huang, Haibo Qiu, Peng Shi, Wenji Mao, and Wan Guanglu. Mobiledreamer: Generative sketch world model for gui agent.arXiv preprint arXiv:2601.04035, 2026a. Yuan Cao, Dezhi Ran, Mengzhou Wu, Yuzhe Guo, Xin Chen, Ang Li, Gang Cao, Gong Zhi, Hao Yu, Linyi Li, et al. Gui-genesis: Automated synthesis of...

  6. [6]

    Safe and scalable web agent learning via recreated websites.arXiv preprint arXiv:2603.10505,

    Hyungjoo Chae, Jungsoo Park, and Alan Ritter. Safe and scalable web agent learning via recreated websites.arXiv preprint arXiv:2603.10505,

  7. [7]

    Swe-universe: Scale real-world verifiable environments to millions

    Mouxiang Chen, Lei Zhang, Yunlong Feng, Xuwu Wang, Wenting Zhao, Ruisheng Cao, Jiaxi Yang, Jiawei Chen, Mingze Li, Zeyao Ma, et al. Swe-universe: Scale real-world verifiable environments to millions. arXiv preprint arXiv:2602.02361,

  8. [8]

    Internalizing world models via self-play finetuning for agentic rl

    Shiqi Chen, Tongyao Zhu, Zian Wang, Jinghan Zhang, Kangrui Wang, Siyang Gao, Teng Xiao, Yee Whye Teh, Junxian He, and Manling Li. Internalizing world models via self-play finetuning for agentic rl. arXiv preprint arXiv:2510.15047,

  9. [9]

    Webatlas: An llm agent with experience- driven memory and action simulation.arXiv preprint arXiv:2510.22732,

    Jiali Cheng, Anjishnu Kumar, Roshan Lal, Rishi Rajasekaran, Hani Ramezani, Omar Zia Khan, Oleg Rokhlenko, Sunny Chiu-Webster, Gang Hua, and Hadi Amiri. Webatlas: An llm agent with experience- driven memory and action simulation.arXiv preprint arXiv:2510.22732,

  10. [10]

    Agentic world modeling: Foundations, capabilities, laws, and beyond.arXiv preprint arXiv:2604.22748,

    Meng Chu, Xuan Billy Zhang, Kevin Qinghong Lin, Lingdong Kong, Jize Zhang, Teng Tu, Weijian Ma, Ziqi Huang, Senqiao Yang, Wei Huang, et al. Agentic world modeling: Foundations, capabilities, laws, and beyond.arXiv preprint arXiv:2604.22748,

  11. [11]

    General agents contain world models, even under partial observability and stochas- ticity.arXiv preprint arXiv:2602.03146,

    28 Santiago Cifuentes. General agents contain world models, even under partial observability and stochas- ticity.arXiv preprint arXiv:2602.03146,

  12. [12]

    Cwm: An open-weights llm for research on code generation with world models.arXiv preprint arXiv:2510.02387,

    Jade Copet, Quentin Carbonneaux, Gal Cohen, Jonas Gehring, Jacob Kahn, Jannik Kossen, Felix Kreuk, Emily McMilin, Michel Meyer, Yuxiang Wei, et al. Cwm: An open-weights llm for research on code generation with world models.arXiv preprint arXiv:2510.02387,

  13. [13]

    Foundation world models for agents that learn, verify, and adapt reliably beyond static environments.arXiv preprint arXiv:2602.23997,

    Florent Delgrange. Foundation world models for agents that learn, verify, and adapt reliably beyond static environments.arXiv preprint arXiv:2602.23997,

  14. [14]

    Swe-bench pro: Can ai agents solve long-horizon software engineering tasks?arXiv preprint arXiv:2509.16941,

    Xiang Deng, Jeff Da, Edwin Pan, Yannis Yiming He, Charles Ide, Kanak Garg, Niklas Lauffer, Andrew Park, Nitin Pasari, Chetan Rane, et al. Swe-bench pro: Can ai agents solve long-horizon software engineering tasks?arXiv preprint arXiv:2509.16941,

  15. [15]

    Dynaweb: Model-based reinforcement learning of web agents.arXiv preprint arXiv:2601.22149,

    Hang Ding, Peidong Liu, Junqiao Wang, Ziwei Ji, Meng Cao, Rongzhao Zhang, Lynn Ai, Eric Yang, Tianyu Shi, and Lei Yu. Dynaweb: Model-based reinforcement learning of web agents.arXiv preprint arXiv:2601.22149,

  16. [16]

    Agent-world: Scaling real-world environment synthesis for evolving general agent intelligence.arXiv preprint arXiv:2604.18292,

    Guanting Dong, Junting Lu, Junjie Huang, Wanjun Zhong, Longxiang Liu, Shijue Huang, Zhenyu Li, Yang Zhao, Xiaoshuai Song, Xiaoxi Li, et al. Agent-world: Scaling real-world environment synthesis for evolving general agent intelligence.arXiv preprint arXiv:2604.18292,

  17. [17]

    Webfactory: Automated compression of foundational language intelligence into grounded web agents.arXiv preprint arXiv:2603.05044,

    Sicheng Fan, Qingyun Shi, Shengze Xu, Shengbo Cai, Tieyong Zeng, Li Ling, Yanyi Shang, and Dehan Kong. Webfactory: Automated compression of foundational language intelligence into grounded web agents.arXiv preprint arXiv:2603.05044,

  18. [18]

    Ssrl: Self-search reinforcement learning.arXiv preprint arXiv:2508.10874,

    Yuchen Fan, Kaiyan Zhang, Heng Zhou, Yuxin Zuo, Yanxu Chen, Yu Fu, Xinwei Long, Xuekai Zhu, Che Jiang, Yuchen Zhang, et al. Ssrl: Self-search reinforcement learning.arXiv preprint arXiv:2508.10874,

  19. [19]

    Towards general agentic intelligence via environment scaling.arXiv preprint arXiv:2509.13311,

    Runnan Fang, Shihao Cai, Baixuan Li, Jialong Wu, Guangyu Li, Wenbiao Yin, Xinyu Wang, Xiaobin Wang, Liangcai Su, Zhen Zhang, et al. Towards general agentic intelligence via environment scaling.arXiv preprint arXiv:2509.13311,

  20. [20]

    davinci-env: Open swe environment synthesis at scale.arXiv preprint arXiv:2603.13023,

    Dayuan Fu, Shenyu Wu, Yunze Wu, Zerui Peng, Yaxing Huang, Jie Sun, Ji Zeng, Mohan Jiang, Lin Zhang, Yukun Li, et al. davinci-env: Open swe environment synthesis at scale.arXiv preprint arXiv:2603.13023,

  21. [21]

    Endless terminals: Scaling rl environments for terminal agents.arXiv preprint arXiv:2601.16443,

    Kanishk Gandhi, Shivam Garg, Noah D Goodman, and Dimitris Papailiopoulos. Endless terminals: Scaling rl environments for terminal agents.arXiv preprint arXiv:2601.16443,

  22. [22]

    Websynthesis: World-model-guided mcts for efficient webui-trajectory synthesis.arXiv preprint arXiv:2507.04370,

    Yifei Gao, Junhong Ye, Jiaqi Wang, and Jitao Sang. Websynthesis: World-model-guided mcts for efficient webui-trajectory synthesis.arXiv preprint arXiv:2507.04370,

  23. [23]

    Is your llm secretly a world model of the internet? model-based planning for web agents.arXiv preprint arXiv:2411.06559,

    Yu Gu, Kai Zhang, Yuting Ning, Boyuan Zheng, Boyu Gou, Tianci Xue, Cheng Chang, Sanjari Srivastava, Yanan Xie, Peng Qi, et al. Is your llm secretly a world model of the internet? model-based planning for web agents.arXiv preprint arXiv:2411.06559,

  24. [24]

    Computer-using world model.arXiv preprint arXiv:2602.17365,

    Yiming Guan, Rui Yu, John Zhang, Lu Wang, Chaoyun Zhang, Liqun Li, Bo Qiao, Si Qin, He Huang, Fangkai Yang, et al. Computer-using world model.arXiv preprint arXiv:2602.17365,

  25. [25]

    Rubrics as rewards: Reinforcement learning beyond verifiable domains.arXiv preprint arXiv:2507.17746,

    Anisha Gunjal, Anthony Wang, Elaine Lau, Vaskar Nath, Yunzhong He, Bing Liu, and Sean Hendryx. Rubrics as rewards: Reinforcement learning beyond verifiable domains.arXiv preprint arXiv:2507.17746,

  26. [26]

    Genenv: Difficulty-aligned co-evolution between llm agents and environment simulators.arXiv preprint arXiv:2512.19682, 2025a

    Jiacheng Guo, Ling Yang, Peter Chen, Qixin Xiao, Yinjie Wang, Xinzhe Juan, Jiahao Qiu, Ke Shen, and Mengdi Wang. Genenv: Difficulty-aligned co-evolution between llm agents and environment simulators.arXiv preprint arXiv:2512.19682, 2025a. Shangmin Guo, Omar Darwiche Domingues, Raphaël Avalos, Aaron Courville, and Florian Strub. World modelling improves la...

  27. [27]

    Training agents inside of scalable world models

    Danijar Hafner, Wilson Yan, and Timothy Lillicrap. Training agents inside of scalable world models. arXiv preprint arXiv:2509.24527,

  28. [28]

    Ee-mcp: Self-evolving mcp-gui agents via automated environment generation and experience learning.arXiv preprint arXiv:2604.09815,

    Tiantian He, Yihang Chen, Keyue Jiang, Ka Yiu Lee, Kaiwen Zhou, Kun Shao, and Shuai Wang. Ee-mcp: Self-evolving mcp-gui agents via automated environment generation and experience learning.arXiv preprint arXiv:2604.09815,

  29. [29]

    World model for robot learning: A comprehensive survey.arXiv preprint arXiv:2605.00080,

    Bohan Hou, Gen Li, Jindou Jia, Tuo An, Xinying Guo, Sicong Leng, Haoran Geng, Yanjie Ze, Tatsuya Harada, Philip Torr, et al. World model for robot learning: A comprehensive survey.arXiv preprint arXiv:2605.00080,

  30. [30]

    Text2world: Benchmarking large language models for symbolic world model generation

    Mengkang Hu, Tianxing Chen, Yude Zou, Yuheng Lei, Qiguang Chen, Ming Li, Yao Mu, Hongyuan Zhang, Wenqi Shao, and Ping Luo. Text2world: Benchmarking large language models for symbolic world model generation. InFindings of the Association for Computational Linguistics: ACL 2025, pp. 26043–26066, 2025a. Mengkang Hu, Bowei Xia, Yuran Wu, Ailing Yu, Yude Zou, ...

  31. [31]

    Environment scaling for interactive agentic experience collection: A survey.arXiv preprint arXiv:2511.09586, 2025b

    Yuchen Huang, Sijia Li, Minghao Liu, Wei Liu, Shijue Huang, Zhiyuan Fan, Hou Pong Chan, and Yi R Fung. Environment scaling for interactive agentic experience collection: A survey.arXiv preprint arXiv:2511.09586, 2025b. Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models r...

  32. [32]

    Andrej Karpathy

    URL https://arxiv.org/abs/2605.06761. Andrej Karpathy. autoresearch.https://github.com/karpathy/autoresearch,

  33. [33]

    Generative visual code mobile world models.arXiv preprint arXiv:2602.01576,

    Woosung Koh, Sungjun Han, Segyu Lee, Se-Young Yun, and Jamin Shin. Generative visual code mobile world models.arXiv preprint arXiv:2602.01576,

  34. [34]

    A path towards autonomous machine intelligence version 0.9

    Yann LeCun et al. A path towards autonomous machine intelligence version 0.9. 2, 2022-06-27.Open Review, 62(1):1–62,

  35. [35]

    Code world models for general game playing.arXiv preprint arXiv:2510.04542,

    Wolfgang Lehrach, Daniel Hennes, Miguel Lazaro-Gredilla, Xinghua Lou, Carter Wendelken, Zun Li, Antoine Dedieu, Jordi Grau-Moya, Marc Lanctot, Atil Iscen, et al. Code world models for general game playing.arXiv preprint arXiv:2510.04542,

  36. [36]

    Worldllm: Improving llms’ world modeling using curiosity-driven theory-making.arXiv preprint arXiv:2506.06725,

    Guillaume Levy, Cédric Colas, Pierre-Yves Oudeyer, Thomas Carta, and Clément Romac. Worldllm: Improving llms’ world modeling using curiosity-driven theory-making.arXiv preprint arXiv:2506.06725,

  37. [37]

    The tool decathlon: Benchmarking language agents for diverse, realistic, and long-horizon task execution.arXiv preprint arXiv:2510.25726, 2025a

    Junlong Li, Wenshuo Zhao, Jian Zhao, Weihao Zeng, Haoze Wu, Xiaochen Wang, Rui Ge, Yuxuan Cao, Yuzhen Huang, Wei Liu, et al. The tool decathlon: Benchmarking language agents for diverse, realistic, and long-horizon task execution.arXiv preprint arXiv:2510.25726, 2025a. Wanli Li, Bince Qu, Bo Pan, Jianyu Zhang, Zheng Liu, Pan Zhang, Wei Chen, and Bo Zhang....

  38. [38]

    Terminal-bench: Benchmarking agents on hard, realistic tasks in command line interfaces.arXiv preprint arXiv:2601.11868,

    Mike A Merrill, Alexander G Shaw, Nicholas Carlini, Boxuan Li, Harsh Raj, Ivan Bercovich, Lin Shi, Jeong Yeon Shin, Thomas Walshe, E Kelly Buchanan, et al. Terminal-bench: Benchmarking agents on hard, realistic tasks in command line interfaces.arXiv preprint arXiv:2601.11868,

  39. [39]

    Transformers are sample-efficient world models

    Vincent Micheli, Eloi Alonso, and François Fleuret. Transformers are sample-efficient world models. arXiv preprint arXiv:2209.00588,

  40. [40]

    Shishir G Patil, Huanzhi Mao, Fanjia Yan, Charlie Cheng-Jie Ji, Vishnu Suresh, Ion Stoica, and Joseph E Gonzalez

    Open-source personal AI assistant, version 2026.3.8, accessed 2026-03-09. Shishir G Patil, Huanzhi Mao, Fanjia Yan, Charlie Cheng-Jie Ji, Vishnu Suresh, Ion Stoica, and Joseph E Gonzalez. The berkeley function calling leaderboard (bfcl): From tool use to agentic evaluation of large language models. InForty-second International Conference on Machine Learning,

  41. [41]

    Gdpval: Evaluating ai model performance on real-world economically valuable tasks.arXiv preprint arXiv:2510.04374,

    Tejal Patwardhan, Rachel Dias, Elizabeth Proehl, Grace Kim, Michele Wang, Olivia Watkins, Simón Posada Fishman, Marwan Aljubeh, Phoebe Thacker, Laurance Fauconnet, et al. Gdpval: Evaluating ai model performance on real-world economically valuable tasks.arXiv preprint arXiv:2510.04374,

  42. [42]

    Current agents fail to leverage world model as tool for foresight.arXiv preprint arXiv:2601.03905,

    Cheng Qian, Emre Can Acikgoz, Bingxuan Li, Xiusi Chen, Yuji Zhang, Bingxiang He, Qinyu Luo, Dilek Hakkani-Tür, Gokhan Tur, Yunzhu Li, et al. Current agents fail to leverage world model as tool for foresight.arXiv preprint arXiv:2601.03905,

  43. [43]

    Self-improving world modelling with latent actions.arXiv preprint arXiv:2602.06130,

    Yifu Qiu, Zheng Zhao, Waylon Li, Yftah Ziser, Anna Korhonen, Shay B Cohen, and Edoardo M Ponti. Self-improving world modelling with latent actions.arXiv preprint arXiv:2602.06130,

  44. [44]

    Qwen3.6-35B-A3B: Agentic coding power, now open to all, April 2026a

    Qwen Team. Qwen3.6-35B-A3B: Agentic coding power, now open to all, April 2026a. URL https: //qwen.ai/blog?id=qwen3.6-35b-a3b. Qwen Team. Qwen3.6-Max-Preview: Smarter, sharper, still evolving, April 2026b. URL https://qwen.a i/blog?id=qwen3.6-max-preview. Qwen Team. Qwen3.6-Plus: Towards real world agents, April 2026c. URL https://qwen.ai/blog?id=qw en3.6....

  45. [45]

    Aligning agentic world models via knowledgeable experience learning.arXiv preprint arXiv:2601.13247,

    Baochang Ren, Yunzhi Yao, Rui Sun, Shuofei Qiao, Ningyu Zhang, and Huajun Chen. Aligning agentic world models via knowledgeable experience learning.arXiv preprint arXiv:2601.13247,

  46. [46]

    General agents contain world models

    Jonathan Richens, David Abel, Alexis Bellot, and Tom Everitt. General agents contain world models. arXiv preprint arXiv:2506.01622,

  47. [47]

    Rethinking rubric generation for improving llm judge and reward modeling for open-ended tasks.arXiv preprint arXiv:2602.05125, 2026a

    William F Shen, Xinchi Qiu, Chenxi Whitehouse, Lisa Alazraki, Shashwat Goel, Francesco Barbieri, Timon Willi, Akhil Mathur, and Ilias Leontiadis. Rethinking rubric generation for improving llm judge and reward modeling for open-ended tasks.arXiv preprint arXiv:2602.05125, 2026a. Zhouzhou Shen, Xueyu Hu, Xiyun Li, Tianqing Fang, Juncheng Li, and Shengyu Zh...

  48. [48]

    Envs- caler: Scaling tool-interactive environments for llm agent via programmatic synthesis.arXiv preprint arXiv:2601.05808,

    Xiaoshuai Song, Haofei Chang, Guanting Dong, Yutao Zhu, Ji-Rong Wen, and Zhicheng Dou. Envs- caler: Scaling tool-interactive environments for llm agent via programmatic synthesis.arXiv preprint arXiv:2601.05808,

  49. [49]

    Scaling agents via continual pre-training.arXiv preprint arXiv:2509.13310,

    Liangcai Su, Zhen Zhang, Guangyu Li, Zhuo Chen, Chenxi Wang, Maojia Song, Xinyu Wang, Kuan Li, Jialong Wu, Xuanzhong Chen, et al. Scaling agents via continual pre-training.arXiv preprint arXiv:2509.13310,

  50. [50]

    Zerosearch: Incentivize the search capability of llms without searching

    Hao Sun, Zile Qiao, Jiayan Guo, Xuanbo Fan, Yingyan Hou, Yong Jiang, Pengjun Xie, Yan Zhang, Fei Huang, and Jingren Zhou. Zerosearch: Incentivize the search capability of llms without searching. arXiv preprint arXiv:2505.04588,

  51. [51]

    Swe-world: Building software engineering agents in docker-free environments.arXiv preprint arXiv:2602.03419,

    Shuang Sun, Huatong Song, Lisheng Huang, Jinhao Jiang, Ran Le, Zhihao Lv, Zongchao Chen, Yiwen Hu, Wenyang Luo, Wayne Xin Zhao, et al. Swe-world: Building software engineering agents in docker-free environments.arXiv preprint arXiv:2602.03419,

  52. [52]

    Kimi Team, Tongtong Bai, Yifan Bai, Yiping Bao, SH Cai, Yuan Cao, Y Charles, HS Che, Cheng Chen, Guanduo Chen, et al

    URLhttps://arxiv.org/abs/2604.17739. Kimi Team, Tongtong Bai, Yifan Bai, Yiping Bao, SH Cai, Yuan Cao, Y Charles, HS Che, Cheng Chen, Guanduo Chen, et al. Kimi k2. 5: Visual agentic intelligence.arXiv preprint arXiv:2602.02276,

  53. [53]

    Xiaoyu Tian, Haotian Wang, Shuaiting Chen, Hao Zhou, Kaichi Yu, Yudian Zhang, Jade Ouyang, Junxi Yin, Jiong Chen, Baoyan Guo, et al

    URLgithub.com/SKYLENAGE-AI/QwenClawBench. Xiaoyu Tian, Haotian Wang, Shuaiting Chen, Hao Zhou, Kaichi Yu, Yudian Zhang, Jade Ouyang, Junxi Yin, Jiong Chen, Baoyan Guo, et al. Astra: Automated synthesis of agentic trajectories and reinforcement arenas.arXiv preprint arXiv:2601.21558,

  54. [54]

    Scaleenv: Scaling environment synthesis from scratch for generalist interactive tool-use agent training.arXiv preprint arXiv:2602.06820,

    Dunwei Tu, Hongyan Hao, Hansi Yang, Yihao Chen, Yi-Kai Zhang, Zhikang Xia, Yu Yang, Yueqing Sun, Xingchen Liu, Furao Shen, et al. Scaleenv: Scaling environment synthesis from scratch for generalist interactive tool-use agent training.arXiv preprint arXiv:2602.06820,

  55. [55]

    Diffusion models are real-time game engines

    Dani Valevski, Yaniv Leviathan, Moab Arar, and Shlomi Fruchter. Diffusion models are real-time game engines. InInternational Conference on Learning Representations, volume 2025, pp. 73754–73776,

  56. [56]

    Reward hacking in the era of large models: Mechanisms, emergent misalignment, challenges.arXiv preprint arXiv:2604.13602, 2026b

    Xiaohua Wang, Muzhao Tian, Yuqi Zeng, Zisu Huang, Jiakang Yuan, Bowen Chen, Jingwen Xu, Mingbo Zhou, Wenhao Liu, Muling Wu, et al. Reward hacking in the era of large models: Mechanisms, emergent misalignment, challenges.arXiv preprint arXiv:2604.13602, 2026b. Yiming Wang, Da Yin, Yuedong Cui, Ruichen Zheng, Zhiqian Li, Zongyu Lin, Di Wu, Xueqing Wu, Chenc...

  57. [57]

    Jialong Wu, Shaofeng Yin, Ningya Feng, and Mingsheng Long

    URL https: //www.worldlabs.ai/blog/marble-world-model. Jialong Wu, Shaofeng Yin, Ningya Feng, and Mingsheng Long. Rlvr-world: Training world models with reinforcement learning.Advances in Neural Information Processing Systems, 38:125312–125350, 2026a. Yifan Wu, Yiran Peng, Yiyu Chen, Jianhao Ruan, Zijie Zhuang, Cheng Yang, Jiayi Zhang, Man Chen, Yenchi Ts...

  58. [58]

    Pan: A world model for general, interactable, and long-horizon world simulation.arXiv preprint arXiv:2511.09057, 2025a

    Jiannan Xiang, Yi Gu, Zihan Liu, Zeyu Feng, Qiyue Gao, Yiyan Hu, Benhao Huang, Guangyi Liu, Yichi Yang, Kun Zhou, et al. Pan: A world model for general, interactable, and long-horizon world simulation.arXiv preprint arXiv:2511.09057, 2025a. Jiannan Xiang, Yun Zhu, Lei Shu, Maria Wang, Lijun Yu, Gabriel Barcik, James Lyon, Srinivas Sunkara, and Jindong Che...

  59. [59]

    Scaler: Synthetic scalable adaptive learning environment for reasoning.arXiv preprint arXiv:2601.04809, 2026a

    Caijun Xu, Changyi Xiao, Zhongyuan Peng, Xinrun Wang, and Yixin Cao. Scaler: Synthetic scalable adaptive learning environment for reasoning.arXiv preprint arXiv:2601.04809, 2026a. Siyuan Xu, Shiyang Li, Xin Liu, Tianyi Liu, Yixiao Li, Zhan Shi, Zixuan Zhang, Zilong Wang, Qingyu Yin, Jianshu Chen, et al. Controllable and verifiable tool-use data synthesis ...

  60. [60]

    World models as an intermediary between agents and the real world.arXiv preprint arXiv:2602.00785,

    Sherry Yang. World models as an intermediary between agents and the real world.arXiv preprint arXiv:2602.00785,

  61. [61]

    Learning interactive real-world simulators.arXiv preprint arXiv:2310.06114,

    Sherry Yang, Yilun Du, Kamyar Ghasemipour, Jonathan Tompson, Leslie Kaelbling, Dale Schuurmans, and Pieter Abbeel. Learning interactive real-world simulators.arXiv preprint arXiv:2310.06114,

  62. [62]

    Claw-eval: Toward trustworthy evaluation of autonomous agents.arXiv preprint arXiv:2604.06132, 2026a

    Bowen Ye, Rang Li, Qibin Yang, Yuanxin Liu, Linli Yao, Hanglong Lv, Zhihui Xie, Chenxin An, Lei Li, Lingpeng Kong, et al. Claw-eval: Toward trustworthy evaluation of autonomous agents.arXiv preprint arXiv:2604.06132, 2026a. Seonghyeon Ye, Yunhao Ge, Kaiyuan Zheng, Shenyuan Gao, Sihyun Yu, George Kurian, Suneel Indupuru, You Liang Tan, Chuning Zhu, Jiannan...

  63. [63]

    Glm-5: from vibe coding to agentic engineering.arXiv preprint arXiv:2602.15763,

    Aohan Zeng, Xin Lv, Zhenyu Hou, Zhengxiao Du, Qinkai Zheng, Bin Chen, Da Yin, Chendi Ge, Chenghua Huang, Chengxing Xie, et al. Glm-5: from vibe coding to agentic engineering.arXiv preprint arXiv:2602.15763,

  64. [64]

    Agent learning via early experience.arXiv preprint arXiv:2510.08558,

    33 Kai Zhang, Xiangchao Chen, Bo Liu, Tianci Xue, Zeyi Liao, Zhihan Liu, Xiyao Wang, Yuting Ning, Zhaorun Chen, Xiaohan Fu, et al. Agent learning via early experience.arXiv preprint arXiv:2510.08558,

  65. [65]

    Infiniteweb: Scalable web environment synthesis for gui agent training.arXiv preprint arXiv:2601.04126,

    Ziyun Zhang, Zezhou Wang, Xiaoyi Zhang, Zongyu Guo, Jiahao Li, Bin Li, and Yan Lu. Infiniteweb: Scalable web environment synthesis for gui agent training.arXiv preprint arXiv:2601.04126,

  66. [66]

    Immersion in the github universe: Scaling coding agents to mastery

    Jiale Zhao, Guoxin Chen, Fanzhe Meng, Minghao Li, Jie Chen, Hui Xu, Yongshuai Sun, Wayne Xin Zhao, Ruihua Song, Yuan Zhang, et al. Immersion in the github universe: Scaling coding agents to mastery. arXiv preprint arXiv:2602.09892,

  67. [67]

    Group sequence policy optimization.arXiv preprint arXiv:2507.18071,

    Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, et al. Group sequence policy optimization.arXiv preprint arXiv:2507.18071,

  68. [68]

    Code2world: A gui world model via renderable code generation.arXiv preprint arXiv:2602.09856,

    Yuhao Zheng, Li’an Zhong, Yi Wang, Rui Dai, Kaikui Liu, Xiangxiang Chu, Linyuan Lv, Philip Torr, and Kevin Qinghong Lin. Code2world: A gui world model via renderable code generation.arXiv preprint arXiv:2602.09856,

  69. [69]

    Synthetic sandbox for training machine learning engineering agents.arXiv preprint arXiv:2604.04872,

    Yuhang Zhou, Lizhu Zhang, Yifan Wu, Jiayi Liu, Xiangjun Fan, Zhuokai Zhao, and Hong Yan. Synthetic sandbox for training machine learning engineering agents.arXiv preprint arXiv:2604.04872,

  70. [70]

    Termigen: High-fidelity environment and robust trajectory synthesis for terminal agents.arXiv preprint arXiv:2602.07274,

    Kaijie Zhu, Yuzhou Nie, Yijiang Li, Yiming Huang, Jialian Wu, Jiang Liu, Ximeng Sun, Zhenfei Yin, Lun Wang, Zicheng Liu, et al. Termigen: High-fidelity environment and robust trajectory synthesis for terminal agents.arXiv preprint arXiv:2602.07274,

  71. [71]

    Planner-r1: Reward shaping enables efficient agentic rl with smaller llms

    Siyu Zhu, Yanbin Jiang, Hejian Sang, Shao Tang, Qingquan Song, Biao He, Rohit Jain, Zhipeng Wang, and Alborz Geramifard. Planner-r1: Reward shaping enables efficient agentic rl with smaller llms. arXiv preprint arXiv:2509.25779,

  72. [72]

    File ▸ Print

    34 A Domain Interaction Examples Figure 14 and 15 show representative interaction examples across all seven domains covered by Qwen- AgentWorld. Each panel shows one agent–environment turn: the action issued by the agent (top) and the simulated observation produced by Qwen-AgentWorld (bottom). OS 12.2 k tokens ↗Qwen-AgentW orld Input (Action) click(target...