pith. machine review for the scientific record. sign in

arxiv: 2601.22149 · v2 · submitted 2026-01-29 · 💻 cs.CL · cs.AI

Recognition: 1 theorem link

· Lean Theorem

DynaWeb: Model-Based Reinforcement Learning of Web Agents

Authors on Pith no claims yet

Pith reviewed 2026-05-16 09:31 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords web agentsmodel-based reinforcement learningworld modelLLM agentsWebArenaWebVoyagerreinforcement learningsimulated rollouts
0
0 comments X

The pith

DynaWeb trains web agents by learning a world model that simulates page responses to actions for efficient reinforcement learning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

DynaWeb is a model-based reinforcement learning framework that first trains a world model to predict naturalistic web page representations from agent actions. This model then serves as a synthetic environment in which policies can generate large numbers of rollout trajectories without touching the live internet. Real expert trajectories are randomly interleaved with these on-policy rollouts during training to maintain stability. Experiments on the WebArena and WebVoyager benchmarks show consistent gains for current open-source web agent models. The work establishes that training through imagination is a viable route to scaling online agentic RL.

Core claim

The paper introduces DynaWeb as a novel MBRL framework that trains a web world model to predict naturalistic page representations given agent actions; this model functions as a synthetic web environment in which an agent policy can generate vast quantities of rollout trajectories for efficient online reinforcement learning, with real expert trajectories randomly interleaved to improve stability and sample efficiency, yielding significant performance improvements on WebArena and WebVoyager.

What carries the argument

The web world model that predicts naturalistic page representations to support simulated policy rollouts and imagination-based training.

If this is right

  • Web agent training requires far fewer live internet interactions, lowering cost and risk.
  • The quantity of training trajectories can be scaled arbitrarily through simulation.
  • Interleaving expert data stabilizes learning and improves sample efficiency.
  • The same framework delivers measurable gains to existing state-of-the-art open-source models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar learned world models could support efficient training in other interactive digital environments such as desktop applications or mobile UIs.
  • Higher-fidelity page prediction might further reduce any remaining sim-to-real gap.
  • The method provides a practical route toward safer, lower-cost development of autonomous web assistants.

Load-bearing premise

The learned world model produces page representations realistic enough that policies trained inside the simulation transfer to real web environments without large distribution shift.

What would settle it

Agents trained using only DynaWeb-generated rollouts perform no better than, or worse than, agents trained exclusively on real trajectories when evaluated on the WebArena or WebVoyager benchmarks.

Figures

Figures reproduced from arXiv: 2601.22149 by Eric Yang, Hang Ding, Junqiao Wang, Lei Yu, Lynn Ai, Meng Cao, Peidong Liu, Rongzhao Zhang, Tianyu Shi, Ziwei Ji.

Figure 1
Figure 1. Figure 1: Comparison between traditional web agent training via live web interaction and DynaWeb. By replacing risky and inefficient real￾world interaction with a learned web world model, DynaWeb enables imagination-driven training using virtual pages and dreamed trajecto￾ries, optionally augmented with real expert data, resulting in safer and more efficient agent optimization. During training, agents may trigger ir… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of DynaWeb. DynaWeb trains web agents via imagination-driven, model-based reinforcement learning. A learned web world model serves as a synthetic environment, enabling the agent to generate multi-step imagined rollouts without interacting with the live web. These imagined trajectories are mixed with a small fraction of real expert trajectories to stabilize learning. The agent policy is optimized u… view at source ↗
Figure 4
Figure 4. Figure 4: Effect of world model training on downstream agent perfor￾mance. Success rate (%) comparing a supervised task-specific world model (DynaWeb WM) and a frozen general-purpose LLM (GPT-oss￾120b). Benchmark DynaWeb WM GPT-oss-120b WebArena (Sim.) 31.0 20.9 WebVoyager (Live) 35.4 28.6 As shown in [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: System prompt used for training and evaluation of the WebArena agent. A.2 World Model System Prompt We provide the full system prompt used by the web world model. The prompt specifies the input information available to the model, including the user objective, current webpage state, and executed actions, as well as the required output format for predicting web state changes and the resulting next-step acces… view at source ↗
Figure 6
Figure 6. Figure 6: System prompt used to train the web world model for predicting next-step accessibility trees from actions and current observations. 17 [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗
read the original abstract

The development of autonomous web agents, powered by Large Language Models (LLMs) and reinforcement learning (RL), represents a significant step towards general-purpose AI assistants. However, training these agents is severely hampered by the challenges of interacting with the live internet, which is inefficient, costly, and fraught with risks. Model-based reinforcement learning (MBRL) offers a promising solution by learning a world model of the environment to enable simulated interaction. This paper introduces DynaWeb, a novel MBRL framework that trains web agents through interacting with a web world model trained to predict naturalistic web page representations given agent actions. This model serves as a synthetic web environment where an agent policy can dream by generating vast quantities of rollout action trajectories for efficient online reinforcement learning. Beyond free policy rollouts, DynaWeb incorporates real expert trajectories from training data, which are randomly interleaved with on-policy rollouts during training to improve stability and sample efficiency. Experiments conducted on the challenging WebArena and WebVoyager benchmarks demonstrate that DynaWeb consistently and significantly improves the performance of state-of-the-art open-source web agent models. Our findings establish the viability of training web agents through imagination, offering a scalable and efficient way to scale up online agentic RL.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces DynaWeb, a model-based RL framework for web agents in which a learned world model predicts naturalistic page representations from actions to support simulated policy rollouts; these rollouts are randomly interleaved with real expert trajectories during training to stabilize learning, and the resulting agents are shown to outperform prior open-source models on the WebArena and WebVoyager benchmarks.

Significance. If the central empirical claim holds, the work demonstrates a practical route to scaling online RL for web agents by replacing costly live-environment interactions with imagination-based training while mitigating distribution shift through expert-data interleaving. This addresses a key bottleneck in agentic RL and could generalize to other high-cost interaction domains.

major comments (2)
  1. [Abstract] Abstract: the headline claim that DynaWeb 'consistently and significantly improves' performance on WebArena and WebVoyager is presented without any reported error bars, statistical significance tests, or ablation isolating the MBRL component from the expert-trajectory interleaving; this information is load-bearing for the central claim that the world-model rollouts are responsible for the gains.
  2. [World-model section] World-model section (description of training objective and simulation): no quantitative validation of simulation fidelity is supplied, such as next-state prediction error on held-out real trajectories, KL divergence or other distributional metrics between simulated and real page representations, or an ablation of pure-simulated versus mixed versus pure-expert training; without these checks the transfer assumption remains unverified and the reported improvements could be driven primarily by the expert data.
minor comments (1)
  1. [Abstract] The informal phrase 'dream by generating vast quantities of rollout action trajectories' could be replaced by a more precise term such as 'simulate' to maintain technical tone.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We agree that additional statistical rigor and world-model validation would strengthen the central claims. We address each major comment below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the headline claim that DynaWeb 'consistently and significantly improves' performance on WebArena and WebVoyager is presented without any reported error bars, statistical significance tests, or ablation isolating the MBRL component from the expert-trajectory interleaving; this information is load-bearing for the central claim that the world-model rollouts are responsible for the gains.

    Authors: We agree that the abstract claim would benefit from explicit statistical support. In the revised manuscript we will report mean performance with standard deviations across multiple random seeds for all main results, include pairwise statistical significance tests (e.g., paired t-tests or Wilcoxon tests) against baselines, and add a dedicated ablation table that isolates the contribution of model-based rollouts from expert-trajectory interleaving. These additions will be placed in the Experiments section and referenced from the abstract. revision: yes

  2. Referee: [World-model section] World-model section (description of training objective and simulation): no quantitative validation of simulation fidelity is supplied, such as next-state prediction error on held-out real trajectories, KL divergence or other distributional metrics between simulated and real page representations, or an ablation of pure-simulated versus mixed versus pure-expert training; without these checks the transfer assumption remains unverified and the reported improvements could be driven primarily by the expert data.

    Authors: We acknowledge that quantitative fidelity metrics were not reported. In the revision we will add (i) next-state prediction error (L2 or cross-entropy) on held-out real trajectories, (ii) distributional metrics including KL divergence between simulated and real page-representation distributions, and (iii) an explicit ablation comparing pure-simulated rollouts, pure-expert trajectories, and the mixed schedule. These results will be presented in a new subsection of the World-Model section to directly verify the transfer assumption. revision: yes

Circularity Check

0 steps flagged

No significant circularity: performance gains derived from external benchmarks

full rationale

The paper trains a world model to predict page representations from actions, generates simulated rollouts, interleaves them with real expert trajectories for policy optimization, and reports improvements on the held-out WebArena and WebVoyager benchmarks. No equations, definitions, or self-citations reduce the reported gains to quantities fitted from the same evaluation data by construction; the central empirical claim remains independent of the training inputs and does not rely on renaming fitted parameters as predictions or importing uniqueness from prior author work.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the unstated assumption that a learned dynamics model can produce rollouts whose distribution is close enough to real web pages for policy improvement to transfer. No free parameters or invented entities are named in the abstract.

axioms (1)
  • domain assumption A sufficiently accurate world model of web-page transitions exists and can be learned from limited interaction data.
    Invoked implicitly when the paper states the model enables 'vast quantities of rollout action trajectories' that improve real performance.
invented entities (1)
  • Web world model no independent evidence
    purpose: Synthetic environment for policy dreaming and rollouts
    New component introduced to replace live internet interaction; no independent evidence of its fidelity is supplied in the abstract.

pith-pipeline@v0.9.0 · 5539 in / 1321 out tokens · 29950 ms · 2026-05-16T09:31:03.556388+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. GUI Agents with Reinforcement Learning: Toward Digital Inhabitants

    cs.AI 2026-04 unverdicted novelty 5.0

    The paper delivers the first comprehensive overview of RL for GUI agents, organizing methods into offline, online, and hybrid strategies while analyzing trends in rewards, efficiency, and deliberation to outline a fut...

Reference graph

Works this paper leans on

36 extracted references · 36 canonical work pages · cited by 1 Pith paper · 10 internal anchors

  1. [1]

    Narasimhan, and Yuan Cao

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R. Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InThe Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5,

  2. [2]

    Shuyan Zhou, Frank F

    URL https://openrevi ew.net/forum?id=WE_vluYUL-X. Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig. Webarena: A realistic web environment for building autonomous agents. InThe Twelfth International Conference on Learning Representations, ICLR 2024, ...

  3. [3]

    URL https://openreview.net/forum?id=oKn9 c6ytLx

    OpenReview.net, 2024a. URL https://openreview.net/forum?id=oKn9 c6ytLx. 11 DynaWeb Hongliang He, Wenlin Yao, Kaixin Ma, Wenhao Yu, Yong Dai, Hongming Zhang, Zhenzhong Lan, and Dong Yu. Webvoyager: Building an end-to-end web agent with large multimodal models. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Proceedings of the 62nd Annual Meeting ...

  4. [4]

    URLhttps://doi.org/10.18653/v1/2024.acl-long.371

    doi: 10.18653/V1/2024.ACL-LONG.371. URLhttps://doi.org/10.18653/v1/2024.acl-long.371. Zhepei Wei, Wenlin Yao, Yao Liu, Weizhi Zhang, Qin Lu, Liang Qiu, Changlong Yu, Puyang Xu, Chao Zhang, Bing Yin, Hyokun Yun, and Lihong Li. Webagent-r1: Training web agents via end-to-end multi-turn reinforcement learning,

  5. [5]

    Zehan Qi, Xiao Liu, Iat Long Iong, Hanyu Lai, Xueqiao Sun, Wenyi Zhao, Yu Yang, Xinyue Yang, Jiadai Sun, Shuntian Yao, Tianjie Zhang, Wei Xu, Jie Tang, and Yuxiao Dong

    URLhttps://arxiv.org/abs/2505.16421. Zehan Qi, Xiao Liu, Iat Long Iong, Hanyu Lai, Xueqiao Sun, Wenyi Zhao, Yu Yang, Xinyue Yang, Jiadai Sun, Shuntian Yao, Tianjie Zhang, Wei Xu, Jie Tang, and Yuxiao Dong. Webrl: Training llm web agents via self-evolving online curriculum reinforcement learning,

  6. [6]

    Yuchen Zhuang, Di Jin, Jiaao Chen, Wenqi Shi, Hanrui Wang, and Chao Zhang

    URL https://arxiv.org/abs/2411.02337. Yuchen Zhuang, Di Jin, Jiaao Chen, Wenqi Shi, Hanrui Wang, and Chao Zhang. Workforceagent-r1: Incentivizing reasoning capability in llm-based web agents via reinforcement learning,

  7. [7]

    Yu Gu, Boyuan Zheng, Boyu Gou, Kai Zhang, Cheng Chang, Sanjari Srivastava, Yanan Xie, Peng Qi, Huan Sun, and Yu Su

    URL https://arxiv.org/abs/2505.22942. Yu Gu, Boyuan Zheng, Boyu Gou, Kai Zhang, Cheng Chang, Sanjari Srivastava, Yanan Xie, Peng Qi, Huan Sun, and Yu Su. Is your LLM secretly a world model of the internet? model-based planning for web agents. CoRR, abs/2411.06559,

  8. [8]

    URL https://doi.org/10.48550/arXiv .2411.06559

    doi: 10.48550/ARXIV.2411.06559. URL https://doi.org/10.48550/arXiv .2411.06559. Tianqing Fang, Hongming Zhang, Zhisong Zhang, Kaixin Ma, Wenhao Yu, Haitao Mi, and Dong Yu. Webevolver: Enhancing web agent self-improvement with coevolving world model.arXiv preprint arXiv:2504.21024, 2025a. Vardaan Pahuja, Yadong Lu, Corby Rosset, Boyu Gou, Arindam Mitra, Sp...

  9. [9]

    URL https://doi.org/10.48550 /arXiv.2502.11357

    doi: 10.48550/ARXIV.2502.11357. URL https://doi.org/10.48550 /arXiv.2502.11357. Richard S Sutton. Dyna, an integrated architecture for learning, planning, and reacting.ACM Sigart Bulletin, 2(4):160–163,

  10. [10]

    URLhttps://arxiv.org/abs/1912.01603. Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, Aurélien Rodriguez, Austen Gregerson, A...

  11. [11]

    The Llama 3 Herd of Models

    doi: 10.48550 /ARXIV.2407.21783. URLhttps://doi.org/10.48550/arXiv.2407.21783. 12 DynaWeb Mengzhao Jia, Wenhao Yu, Kaixin Ma, Tianqing Fang, Zhihan Zhang, Siru Ouyang, Hongming Zhang, Meng Jiang, and Dong Yu. Leopard: A vision language model for text-rich multi-image tasks.CoRR, abs/2410.01744,

  12. [12]

    URL https://doi.org/10.48550/arXiv.2410

    doi: 10.48550/ARXIV.2410.01744. URL https://doi.org/10.48550/arXiv.2410. 01744. OpenAI. Gpt-4 technical report. Technical Report, March

  13. [13]

    GPT-4 Technical Report

    URL https://arxiv.org/abs/2303.08774. A large multimodal model capable of processing image and text inputs and producing text outputs. Achieves human-level performance on various professional benchmarks including passing a simulated bar exam in the top 10 Anthropic. Claude 3.7 sonnet: Hybrid reasoning model. https://www.anthropic.com/news/claude-3-7 -sonnet,

  14. [14]

    Anthropic

    Accessed: 2025-04-18. Anthropic. Model context protocol. Open-source protocol, November

  15. [15]

    Hongming Zhang, Xiaoman Pan, Hongwei Wang, Kaixin Ma, Wenhao Yu, and Dong Yu

    URL https://github.com/mod elcontextprotocol. Hongming Zhang, Xiaoman Pan, Hongwei Wang, Kaixin Ma, Wenhao Yu, and Dong Yu. Cognitive kernel: An open-source agent system towards generalist autopilots.CoRR, abs/2409.10277, 2024a. doi: 10.48550/ARXIV.2409.10277. URLhttps://doi.org/10.48550/arXiv.2409.10277. Shunyu Yao, Howard Chen, John Yang, and Karthik Na...

  16. [16]

    Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Samual Stevens, Boshi Wang, Huan Sun, and Yu Su

    URL http://papers.nips.cc/paper_files/paper/2022/hash/82ad13ec01f9fe44c01cb91 814fd7b8c-Abstract-Conference.html. Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Samual Stevens, Boshi Wang, Huan Sun, and Yu Su. Mind2web: Towards a generalist agent for the web. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editor...

  17. [17]

    Jing Yu Koh, Robert Lo, Lawrence Jang, Vikram Duvvur, Ming Chong Lim, Po-Yu Huang, Graham Neubig, Shuyan Zhou, Russ Salakhutdinov, and Daniel Fried

    URL http://papers.nips.cc/paper_files/paper/2023/hash/5950bf290 a1570ea401bf98882128160-Abstract-Datasets_and_Benchmarks.html. Jing Yu Koh, Robert Lo, Lawrence Jang, Vikram Duvvur, Ming Chong Lim, Po-Yu Huang, Graham Neubig, Shuyan Zhou, Russ Salakhutdinov, and Daniel Fried. Visualwebarena: Evaluating multimodal agents on realistic visual web tasks. In Lu...

  18. [18]

    Brandon Trabucco, Gunnar A

    URL https: //arxiv.org/abs/2410.02907. Brandon Trabucco, Gunnar A. Sigurdsson, Robinson Piramuthu, and Ruslan Salakhutdinov. Towards internet-scale training for agents.CoRR, abs/2502.06776,

  19. [19]

    URL https://doi.org/10.48550/arXiv.2502.06776

    doi: 10.48550/ARXIV.2502.06776. URL https://doi.org/10.48550/arXiv.2502.06776. 13 DynaWeb Xiaoxi Li, Jiajie Jin, Guanting Dong, Hongjin Qian, Yutao Zhu, Yongkang Wu, Ji-Rong Wen, and Zhicheng Dou. Webthinker: Empowering large reasoning models with deep research capability.arXiv preprint arXiv:2504.21776, 2025a. Jialong Wu, Baixuan Li, Runnan Fang, Wenbiao...

  20. [20]

    Cognitive Kernel-Pro: A Framework for Deep Research Agents and Agent Foundation Models Training

    Tianqing Fang, Zhisong Zhang, Xiaoyang Wang, Rui Wang, Can Qin, Yuxuan Wan, Jun-Yu Ma, Ce Zhang, Jiaqi Chen, Xiyun Li, et al. Cognitive kernel-pro: A framework for deep research agents and agent foundation models training.arXiv preprint arXiv:2508.00414, 2025b. MiroMindAI. Miroflow: A consistent agent framework with reproducible performance. https://githu...

  21. [21]

    WebWatcher: Breaking New Frontier of Vision-Language Deep Research Agent

    URLhttps://arxiv.org/abs/2508.05748. Jing Yu Koh, Stephen McAleer, Daniel Fried, and Ruslan Salakhutdinov. Tree search for language model agents.CoRR, abs/2407.01476, 2024b. doi: 10.48550/ARXIV.2407.01476. URL https://doi.org/10.485 50/arXiv.2407.01476. Pranav Putta, Edmund Mills, Naman Garg, Sumeet Motwani, Chelsea Finn, Divyansh Garg, and Rafael Rafailo...

  22. [22]

    URLhttps://doi.org/10.48550/arXiv.2408.07199

    doi: 10.48550/ARXIV.2408.07199. URLhttps://doi.org/10.48550/arXiv.2408.07199. Xiao Yu, Baolin Peng, Vineeth Vajipey, Hao Cheng, Michel Galley, Jianfeng Gao, and Zhou Yu. Exact: Teaching AI agents to explore with reflective-mcts and exploratory learning.CoRR, abs/2410.02052,

  23. [23]

    URLhttps://doi.org/10.48550/arXiv.2410.02052

    doi: 10.48550/ARXIV.2410.02052. URLhttps://doi.org/10.48550/arXiv.2410.02052. Andy Zhou, Kai Yan, Michal Shlapentokh-Rothman, Haohan Wang, and Yu-Xiong Wang. Language agent tree search unifies reasoning, acting, and planning in language models. InForty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27,

  24. [24]

    URL https://openreview.net/forum?id=njwv9BsGHF

    OpenReview.net, 2024b. URL https://openreview.net/forum?id=njwv9BsGHF. Yao Zhang, Zijian Ma, Yunpu Ma, Zhen Han, Yu Wu, and Volker Tresp. Webpilot: A versatile and autonomous multi-agent system for web task execution with strategic exploration.CoRR, abs/2408.15978, 2024c. doi: 10.48550/ARXIV.2408.15978. URLhttps://doi.org/10.48550/arXiv.2408.15978. Noah S...

  25. [25]

    David Ha and Jürgen Schmidhuber

    URL http://papers.nips.cc/paper_files/paper/2023/hash/1b44b878b b782e6954cd888628510e90-Abstract-Conference.html. David Ha and Jürgen Schmidhuber. Recurrent world models facilitate policy evolution. In Samy Bengio, Hanna M. Wallach, Hugo Larochelle, Kristen Grauman, Nicolò Cesa-Bianchi, and Roman Garnett, editors, Advances in Neural Information Processing...

  26. [26]

    Diffusion Models Are Real-Time Game Engines

    URL https: //proceedings.neurips.cc/paper/2018/hash/2de5d16682c3c35007e4e92982f1a2ba-Abstract.html. 14 DynaWeb Dani Valevski, Yaniv Leviathan, Moab Arar, and Shlomi Fruchter. Diffusion models are real-time game engines.CoRR, abs/2408.14837,

  27. [27]

    Diffusion Models Are Real-Time Game Engines

    doi: 10.48550/ARXIV.2408.14837. URL https://doi.org/10.485 50/arXiv.2408.14837. Eloi Alonso, Adam Jelley, Vincent Micheli, Anssi Kanervisto, Amos J. Storkey, Tim Pearce, and François Fleuret. Diffusion for world modeling: Visual details matter in atari. In Amir Globersons, Lester Mackey, Danielle Belgrave, Angela Fan, Ulrich Paquet, Jakub M. Tomczak, and ...

  28. [28]

    Max Olan Smith and Michael P

    URL http://papers.nips.cc/paper_f iles/paper/2024/hash/6bdde0373d53d4a501249547084bed43-Abstract-Conference.html. Max Olan Smith and Michael P . Wellman. Co-learning empirical games and world models.CoRR, abs/2305.14223,

  29. [29]

    Max Olan Smith and Michael P

    doi: 10.48550/ARXIV.2305.14223. URL https://doi.org/10.48550/arXiv .2305.14223. Shibo Hao, Yi Gu, Haodi Ma, Joshua Jiahua Hong, Zhen Wang, Daisy Zhe Wang, and Zhiting Hu. Reasoning with language model is planning with world model. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Proceedings of the 2023 Conference on Empirical Methods in Natural Langu...

  30. [30]

    URLhttps://doi.org/10.18653/v1/2023.emnlp-main.507

    doi: 10.18653/V1/2023.EMNLP-MAIN.507. URLhttps://doi.org/10.18653/v1/2023.emnlp-main.507. Shuofei Qiao, Runnan Fang, Ningyu Zhang, Yuqi Zhu, Xiang Chen, Shumin Deng, Yong Jiang, Pengjun Xie, Fei Huang, and Huajun Chen. Agent planning with world knowledge model. In Amir Globersons, Lester Mackey, Danielle Belgrave, Angela Fan, Ulrich Paquet, Jakub M. Tomcz...

  31. [31]

    Hyungjoo Chae, Namyoung Kim, Kai Tzu-iunn Ong, Minju Gwak, Gwanwoo Song, Jihoon Kim, Sunghwan Kim, Dongha Lee, and Jinyoung Yeo

    URL http: //papers.nips.cc/paper_files/paper/2024/hash/d032263772946dd5026e7f3cd22bce5b-Abstrac t-Conference.html. Hyungjoo Chae, Namyoung Kim, Kai Tzu-iunn Ong, Minju Gwak, Gwanwoo Song, Jihoon Kim, Sunghwan Kim, Dongha Lee, and Jinyoung Yeo. Web agents with world models: Learning and leveraging environ- ment dynamics in web navigation. InThe Thirteenth ...

  32. [32]

    URLhttps://arxiv.org/abs/2402.03300. Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Wang Zhang, Hang Zhu, Jinhua Zhu, Jiaze Chen, Jiangjie Chen, Chengyi Wang, Hongli Yu, Yuxuan Song, Xia...

  33. [33]

    DAPO: An Open-Source LLM Reinforcement Learning System at Scale

    URLhttps://arxiv.org/abs/2503.14476. Hanyu Lai, Xiao Liu, Iat Long Iong, Shuntian Yao, Yuxuan Chen, Pengbo Shen, Hao Yu, Hanchen Zhang, Xiaohan Zhang, Yuxiao Dong, and Jie Tang. Autowebglm: A large language model-based web navigating agent. InProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD ’24, page 5295–5306, New...

  34. [34]

    ISBN 9798400704901

    Association for Computing Machinery. ISBN 9798400704901. doi: 10.1145/3637528.3671620. URLhttps://doi.org/10.1145/3637528.3671620. Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, Jingren Zhou, and Junyang Lin. Group sequence policy optimization,

  35. [35]

    Group Sequence Policy Optimization

    URL https://arxiv.org/abs/2507.18071. Apurva Gandhi and Graham Neubig. Go-browse: Training web agents with structured exploration,

  36. [36]

    15 DynaWeb A More Details About DynaWeb A.1 WebArena T raining Prompt We provide the full system prompt used to train our web agent on WebArena

    URLhttps://arxiv.org/abs/2506.03533. 15 DynaWeb A More Details About DynaWeb A.1 WebArena T raining Prompt We provide the full system prompt used to train our web agent on WebArena. The prompt defines the agent’s role, available actions, observation format, and task completion criteria, and is used consistently across both real-environment interaction and...