pith. sign in

arxiv: 2507.23773 · v3 · pith:7ZI2W2HLnew · submitted 2025-07-31 · 💻 cs.AI · cs.CL· cs.LG· cs.RO

General Agentic Planning Through Simulative Reasoning with World Models

Pith reviewed 2026-05-22 12:33 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.LGcs.RO
keywords simulative reasoningworld modelsagentic planningLLM agentsweb navigationcounterfactual evaluationSiRA architecture
0
0 comments X

The pith

Simulative reasoning with world models enables general agentic planning beyond reactive policies.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that agentic systems currently rely on reactive decision-making that requires task-specific engineering and lacks explicit modeling of future outcomes. It proposes that simulative reasoning, in which an internal world model predicts consequences of candidate actions, supplies a more general-purpose planning mechanism that supports flexible, goal-directed behavior across contexts. To test this, the authors present SiRA, an architecture that maintains natural-language belief states inside an LLM-based world model and uses counterfactual evaluation to choose actions. Evaluations in a web-browser setting across constrained navigation, multi-hop information aggregation, and general instruction following show consistent gains over matched reactive baselines. A sympathetic reader would care because the approach suggests that shared simulative capacity can transfer across tasks rather than demanding re-engineering for each new domain.

Core claim

SiRA instantiates simulative reasoning by using an LLM-based world model with natural-language belief states to predict future outcomes of candidate actions before selecting the next step. Across three qualitatively distinct task categories in a web-browser environment, this produces up to 124% higher task completion rates than a matched reactive baseline and raises constrained navigation success from 0% to 32.2% compared with a representative open-web agent. The persistent advantage across task types indicates that the benefit arises from generalizable counterfactual evaluation rather than task-specific tuning.

What carries the argument

SiRA, the Simulative Reasoning Architecture, which maintains natural-language belief states inside an LLM-based world model to support counterfactual evaluation of candidate actions before execution.

If this is right

  • Agents achieve higher success rates on complex web tasks by grounding each decision in predicted future states rather than immediate pattern matching.
  • The same reasoning capacity transfers across constrained navigation, multi-hop aggregation, and general instruction following without per-task re-engineering.
  • Constrained navigation tasks become solvable where purely reactive methods achieve zero success.
  • Natural-language belief states make the agent's internal predictions inspectable and reusable across goals.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If world-model accuracy improves, the same architecture could transfer to non-web environments such as code execution or robotic control.
  • The separation of world-model prediction from action selection suggests a route for combining simulative reasoning with learned policies or search algorithms.
  • Natural-language state representations may allow human users to audit or correct the agent's internal model before execution.

Load-bearing premise

The LLM-based world model produces sufficiently accurate natural-language predictions of future states to support effective counterfactual evaluation in the tested web-browser tasks.

What would settle it

Replacing the world-model predictions with inaccurate or random forecasts and observing that task-completion rates fall to the level of the reactive baseline or below.

Figures

Figures reproduced from arXiv: 2507.23773 by Eric Xing, Jinyu Hou, Mingkai Deng, Zhiting Hu.

Figure 1
Figure 1. Figure 1: Demo of tasks performed using a web-browsing agent built on [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: A possible definition of an optimal agent [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: An agent in real world where groundtruth world state and universe are unavailable to [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Optimal agent architecture design with conditional probability annotation [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: LLM-based implementation of our proposed agent model for web-related tasks (e.g. multi [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Overview of performance comparison between [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Faced with the question “What is the earliest-arriving flight tomorrow from Pittsburgh to [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Illustration of the data generation process for the FlightQA dataset. We first prompt a LLM [PITH_FULL_IMAGE:figures/full_fig_p010_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: % correct and % response returned vs. number of constraints for FlightQA. Based on our [PITH_FULL_IMAGE:figures/full_fig_p012_9.png] view at source ↗
read the original abstract

What does it mean to plan? Current agentic systems, whether scaffolded workflows or end-to-end policies, rely on reactive decision-making: selecting the next action via a fixed procedure with at most undifferentiated adaptive computation (e.g., chain-of-thought) lacking explicit modeling of future outcomes. This limits generalizability, as each new task demands re-engineering rather than transfer of shared reasoning capacity. Humans, by contrast, plan by mentally simulating consequences of candidate actions within an internal world model, a capacity known as simulative reasoning (System II) that supports flexible, goal-directed behavior across diverse contexts. We argue that simulative reasoning through a world model provides a general-purpose planning mechanism for agentic systems, improving upon reactive policies (System I) by grounding decisions in predicted future states rather than pattern-matched responses. To verify this, we introduce SiRA (Simulative Reasoning Architecture), a goal-oriented architecture instantiating simulative reasoning using an LLM-based world model with natural-language belief states, while remaining model-agnostic. We evaluate across three qualitatively distinct task categories: constrained navigation, multi-hop information aggregation, and general instruction following, in a web-browser environment. Across all categories, simulative reasoning achieves up to 124% higher task completion rates than a matched reactive baseline, and increases constrained navigation success from 0% to 32.2% compared to a representative open-web agent. The persistent advantage across distinct task types suggests the benefit stems from generalizable counterfactual evaluation rather than task-specific tuning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces SiRA (Simulative Reasoning Architecture), a model-agnostic goal-oriented framework that performs agentic planning via simulative reasoning over an LLM-based world model with natural-language belief states. It contrasts this explicit counterfactual evaluation of future outcomes against reactive (System I) policies that rely on pattern-matched next-action selection without internal simulation. The approach is evaluated on three distinct task categories in a web-browser environment—constrained navigation, multi-hop information aggregation, and general instruction following—reporting up to 124% higher task completion rates than a matched reactive baseline and lifting constrained navigation success from 0% to 32.2% relative to a representative open-web agent. The authors argue that the persistent cross-category gains indicate a generalizable benefit from grounded simulation rather than task-specific tuning.

Significance. If the performance advantages can be causally attributed to accurate simulative reasoning rather than ancillary factors, the work offers a promising general-purpose mechanism for improving transfer and flexibility in agentic systems beyond current scaffolded or end-to-end reactive approaches. The model-agnostic design and evaluation across qualitatively different task types are constructive elements that support broader applicability claims.

major comments (2)
  1. [Experimental results and analysis sections] Experimental results and analysis sections: No direct measurement or error analysis of the LLM world model's natural-language future-state prediction accuracy (e.g., state match rate, trajectory-level fidelity, or held-out prediction error) is reported. This is load-bearing for the central claim that gains arise from reliable counterfactual evaluation, because without evidence that the predicted states are sufficiently faithful to actual dynamic web-browser states, the 124% completion improvement and 0%-to-32.2% navigation lift could instead result from extra inference steps, prompt structure, or implicit search.
  2. [Baseline and evaluation protocol] Baseline and evaluation protocol: The manuscript provides insufficient detail on run counts, variance, exact matching of reactive baselines (including total LLM calls and prompt length), and controls for model scale or compute budget. These omissions undermine interpretation of the reported percentage gains, as simulative reasoning inherently involves additional forward simulation steps whose contribution must be isolated from confounding factors.
minor comments (2)
  1. [Method] Notation for belief states and world-model outputs could be formalized more explicitly (e.g., with a consistent mathematical description) to aid reproducibility.
  2. [Results figures] Figure captions and axis labels in the results figures would benefit from explicit indication of error bars or trial counts.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback, which identifies key areas where additional evidence and transparency can strengthen the attribution of gains to simulative reasoning. We address each major comment below and will incorporate revisions to improve the manuscript.

read point-by-point responses
  1. Referee: Experimental results and analysis sections: No direct measurement or error analysis of the LLM world model's natural-language future-state prediction accuracy (e.g., state match rate, trajectory-level fidelity, or held-out prediction error) is reported. This is load-bearing for the central claim that gains arise from reliable counterfactual evaluation, because without evidence that the predicted states are sufficiently faithful to actual dynamic web-browser states, the 124% completion improvement and 0%-to-32.2% navigation lift could instead result from extra inference steps, prompt structure, or implicit search.

    Authors: We agree that direct metrics on the world model's predictive accuracy would provide stronger causal support for the role of simulative reasoning. While the manuscript prioritizes end-to-end task performance across diverse categories as evidence of generalizability, we recognize this as a valuable addition. In revision, we will add an analysis subsection reporting state prediction accuracy on held-out trajectories, including state match rates, trajectory-level fidelity, and examples of prediction errors. revision: yes

  2. Referee: Baseline and evaluation protocol: The manuscript provides insufficient detail on run counts, variance, exact matching of reactive baselines (including total LLM calls and prompt length), and controls for model scale or compute budget. These omissions undermine interpretation of the reported percentage gains, as simulative reasoning inherently involves additional forward simulation steps whose contribution must be isolated from confounding factors.

    Authors: We appreciate the call for greater experimental detail. The current version reports comparative results but omits full statistical and matching information. We will revise the evaluation section to specify the number of independent runs per task, report variance via standard deviations, confirm equivalent total LLM calls and prompt lengths for the reactive baseline, and explicitly note that all methods share the same underlying LLM to control for scale and compute budget. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical comparison to baselines is independent of fitted parameters or self-referential definitions

full rationale

The paper introduces the SiRA architecture to instantiate simulative reasoning via an LLM-based world model and reports empirical performance gains (up to 124% higher task completion, 0% to 32.2% navigation success) from direct comparison against matched reactive and open-web agent baselines across three task categories. No equations, fitted parameters, or self-citations are described that would reduce the reported advantage to a construction by definition or to a renamed input. The central claim rests on external experimental benchmarks rather than internal redefinition or load-bearing self-reference, making the derivation self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim depends on the premise that an LLM can serve as a usable world model for counterfactual simulation and that observed gains are attributable to this mechanism rather than implementation details.

axioms (1)
  • domain assumption Large language models can maintain accurate enough natural-language belief states to support useful simulative reasoning about future outcomes.
    Invoked when the architecture is instantiated with an LLM-based world model.
invented entities (1)
  • SiRA architecture no independent evidence
    purpose: Goal-oriented instantiation of simulative reasoning using LLM world models.
    New proposed system whose effectiveness is demonstrated only within the paper's experiments.

pith-pipeline@v0.9.0 · 5812 in / 1271 out tokens · 68453 ms · 2026-05-22T12:33:21.515999+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Efficient Agentic Reasoning Through Self-Regulated Simulative Planning

    cs.AI 2026-05 unverdicted novelty 6.0

    SR²AM achieves competitive Pass@1 accuracy on diverse tasks with 25.8-95.3% fewer reasoning tokens than much larger models by using self-regulated simulative planning trained via supervised learning and RL.

  2. GUI Agents with Reinforcement Learning: Toward Digital Inhabitants

    cs.AI 2026-04 unverdicted novelty 5.0

    The paper delivers the first comprehensive overview of RL for GUI agents, organizing methods into offline, online, and hybrid strategies while analyzing trends in rewards, efficiency, and deliberation to outline a fut...

Reference graph

Works this paper leans on

91 extracted references · 91 canonical work pages · cited by 2 Pith papers · 3 internal anchors

  1. [1]

    Operator

    OpenAI. Computer -using agent (cua). https://openai.com/index/ computer-using-agent/, January 2025. Research preview of “Operator”, published January 23, 2025

  2. [2]

    Project mariner

    DeepMind. Project mariner. https://deepmind.google/models/project-mariner/,

  3. [3]

    Accessed: 2025-07-16

  4. [4]

    WebArena: A Realistic Web Environment for Building Autonomous Agents

    Shuyan Zhou, Frank F Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, et al. Webarena: A realistic web environment for building autonomous agents.arXiv preprint arXiv:2307.13854, 2023

  5. [5]

    Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments, 2024

    Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh Jing Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, Yitao Liu, Yiheng Xu, Shuyan Zhou, Silvio Savarese, Caiming Xiong, Victor Zhong, and Tao Yu. Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments, 2024

  6. [6]

    Introducing Deep Research

    OpenAI. Introducing Deep Research. https://openai.com/index/ introducing-deep-research/, February 2025. Deep Research agent release an- nouncement

  7. [7]

    computer use

    Anthropic. Introducing computer use, a new claude 3.5 sonnet, and claude 3.5 haiku. https: //www.anthropic.com/news/3-5-models-and-computer-use , October 22 2024. Public beta “computer use” feature for Claude 3.5 Sonnet

  8. [8]

    Gemini Deep Research: Your Personal Research Assistant

    Google. Gemini Deep Research: Your Personal Research Assistant. https://gemini. google/overview/deep-research/?hl=en, December 2024. Overview of Gemini Deep Research agent feature

  9. [9]

    O’Brien, Carrie J

    Joon Sung Park, Joseph C. O’Brien, Carrie J. Cai, Meredith Ringel Morris, Percy Liang, and Michael S. Bernstein. Generative agents: Interactive simulacra of human behavior, 2023. 14

  10. [10]

    Cursor: The ai code editor

    Anysphere Inc. Cursor: The ai code editor. https://cursor.com, 2025. Accessed: 2025-07- 16

  11. [11]

    Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, Hoang H

    Xingyao Wang, Boxuan Li, Yufan Song, Frank F. Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, Hoang H. Tran, Fuqiang Li, Ren Ma, Mingzhang Zheng, Bill Qian, Yanjun Shao, Niklas Muennighoff, Yizhe Zhang, Binyuan Hui, Junyang Lin, Robert Brennan, Hao Peng, Heng Ji, and Graham Neubig. Openhands: An open platform for ai soft...

  12. [12]

    Towards an AI co-scientist

    Juraj Gottweis, Wei-Hung Weng, Alexander Daryin, Tao Tu, Anil Palepu, Petar Sirkovic, Artiom Myaskovsky, Felix Weissenberger, Keran Rong, Ryutaro Tanno, Khaled Saab, Dan Popovici, Jacob Blum, Fan Zhang, Katherine Chou, Avinatan Hassidim, Burak Gokturk, Amin Vahdat, Pushmeet Kohli, Yossi Matias, Andrew Carroll, Kavita Kulkarni, Nenad Tomasev, Yuan Guan, Vi...

  13. [13]

    The ai scientist-v2: Workshop-level automated scientific discovery via agentic tree search, 2025

    Yutaro Yamada, Robert Tjarko Lange, Cong Lu, Shengran Hu, Chris Lu, Jakob Foerster, Jeff Clune, and David Ha. The ai scientist-v2: Workshop-level automated scientific discovery via agentic tree search, 2025

  14. [14]

    Advances and challenges in foundation agents: From brain-inspired intelligence to evolutionary, collaborative, and safe systems, 2025

    Bang Liu, Xinfeng Li, Jiayi Zhang, Jinlin Wang, Tanjin He, Sirui Hong, Hongzhang Liu, Shaokun Zhang, Kaitao Song, Kunlun Zhu, Yuheng Cheng, Suyuchen Wang, Xiaoqiang Wang, Yuyu Luo, Haibo Jin, Peiyan Zhang, Ollie Liu, Jiaqi Chen, Huan Zhang, Zhaoyang Yu, Haochen Shi, Boyan Li, Dekun Wu, Fengwei Teng, Xiaojun Jia, Jiawei Xu, Jinyu Xiang, Yizhang Lin, Tianmi...

  15. [15]

    Roy-Chowdhury, and Chengyu Song

    Trishna Chakraborty, Udita Ghosh, Xiaopan Zhang, Fahim Faisal Niloy, Yue Dong, Jiachen Li, Amit K. Roy-Chowdhury, and Chengyu Song. Heal: An empirical study on hallucinations in embodied agents driven by large language models, 2025

  16. [16]

    The general-purpose intelligent agent.Engineering, 6(3):221–226, 2020

    Cewu Lu and Shiquan Wang. The general-purpose intelligent agent.Engineering, 6(3):221–226, 2020

  17. [17]

    Language models as agent models, 2022

    Jacob Andreas. Language models as agent models, 2022

  18. [18]

    React: Synergizing reasoning and acting in language models

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InThe Eleventh International Conference on Learning Representations, 2023

  19. [19]

    A path towards autonomous machine intelligence

    Yann LeCun. A path towards autonomous machine intelligence. OpenReview preprint, June

  20. [20]

    Version 0.9.2, June 27 2022

  21. [21]

    Llm reasoners: New evaluation, library, and analysis of step-by-step reasoning with large language models.arXiv preprint arXiv:2404.05221, 2024

    Shibo Hao, Yi Gu, Haotian Luo, Tianyang Liu, Xiyan Shao, Xinyuan Wang, Shuhua Xie, Haodi Ma, Adithya Samavedhi, Qiyue Gao, et al. Llm reasoners: New evaluation, library, and analysis of step-by-step reasoning with large language models.arXiv preprint arXiv:2404.05221, 2024

  22. [22]

    Brandon Chiou, Mason Choey, Mingkai Deng, Jinyu Hou, Jackie Wang, Ariel Wu, Frank Xu, Zhiting Hu, Hongxia Jin, Li Erran Li, Graham Neubig, Yilin Shen, and Eric P. Xing. Reasoneragent: A fully open source, ready-to-run agent that does research in a web browser and answers your queries, February 2025

  23. [23]

    Autowebglm: A large language model-based web navigating agent, 2024

    Hanyu Lai, Xiao Liu, Iat Long Iong, Shuntian Yao, Yuxuan Chen, Pengbo Shen, Hao Yu, Hanchen Zhang, Xiaohan Zhang, Yuxiao Dong, and Jie Tang. Autowebglm: A large language model-based web navigating agent, 2024

  24. [24]

    Agent q: Advanced reasoning and learning for autonomous ai agents, 2024

    Pranav Putta, Edmund Mills, Naman Garg, Sumeet Motwani, Chelsea Finn, Divyansh Garg, and Rafael Rafailov. Agent q: Advanced reasoning and learning for autonomous ai agents, 2024. 15

  25. [25]

    Ui-tars: Pioneering automated gui interaction with native agents, 2025

    Yujia Qin, Yining Ye, Junjie Fang, Haoming Wang, Shihao Liang, Shizuo Tian, Junda Zhang, Jiahao Li, Yunxin Li, Shijue Huang, Wanjun Zhong, Kuanye Li, Jiale Yang, Yu Miao, Woyu Lin, Longxiang Liu, Xu Jiang, Qianli Ma, Jingyu Li, Xiaojun Xiao, Kai Cai, Chuang Li, Yaowei Zheng, Chaolin Jin, Chen Li, Xiao Zhou, Minchao Wang, Haoli Chen, Zhaojian Li, Haihua Ya...

  26. [26]

    Agent workflow memory, 2024

    Zora Zhiruo Wang, Jiayuan Mao, Daniel Fried, and Graham Neubig. Agent workflow memory, 2024

  27. [27]

    V oyager: An open-ended embodied agent with large language models, 2023

    Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. V oyager: An open-ended embodied agent with large language models, 2023

  28. [28]

    Web agents with world models: Learning and leveraging environment dynamics in web navigation.arXiv preprint arXiv:2410.13232, 2024

    Hyungjoo Chae, Namyoung Kim, Kai Tzu-iunn Ong, Minju Gwak, Gwanwoo Song, Jihoon Kim, Sunghwan Kim, Dongha Lee, and Jinyoung Yeo. Web agents with world models: Learning and leveraging environment dynamics in web navigation.arXiv preprint arXiv:2410.13232, 2024

  29. [29]

    Action- conditional video prediction using deep networks in atari games, 2015

    Junhyuk Oh, Xiaoxiao Guo, Honglak Lee, Richard Lewis, and Satinder Singh. Action- conditional video prediction using deep networks in atari games, 2015

  30. [30]

    Mastering atari, go, chess and shogi by planning with a learned model.Nature, 588(7839):604–609, December 2020

    Julian Schrittwieser, Ioannis Antonoglou, Thomas Hubert, Karen Simonyan, Laurent Sifre, Simon Schmitt, Arthur Guez, Edward Lockhart, Demis Hassabis, Thore Graepel, Timothy Lillicrap, and David Silver. Mastering atari, go, chess and shogi by planning with a learned model.Nature, 588(7839):604–609, December 2020

  31. [31]

    When to trust your model: Model-based policy optimization, 2021

    Michael Janner, Justin Fu, Marvin Zhang, and Sergey Levine. When to trust your model: Model-based policy optimization, 2021

  32. [32]

    Temporal difference learning for model predictive control, 2022

    Nicklas Hansen, Xiaolong Wang, and Hao Su. Temporal difference learning for model predictive control, 2022

  33. [33]

    Reasoning with language model is planning with world model, 2023

    Shibo Hao, Yi Gu, Haodi Ma, Joshua Jiahua Hong, Zhen Wang, Daisy Zhe Wang, and Zhiting Hu. Reasoning with language model is planning with world model, 2023

  34. [34]

    Mastering diverse domains through world models, 2024

    Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse domains through world models, 2024

  35. [35]

    Is your llm secretly a world model of the internet? model-based planning for web agents, 2025

    Yu Gu, Kai Zhang, Yuting Ning, Boyuan Zheng, Boyu Gou, Tianci Xue, Cheng Chang, Sanjari Srivastava, Yanan Xie, Peng Qi, Huan Sun, and Yu Su. Is your llm secretly a world model of the internet? model-based planning for web agents, 2025

  36. [36]

    Pan Macmillan, 2017

    Lisa Feldman Barrett.How emotions are made: The secret life of the brain. Pan Macmillan, 2017

  37. [37]

    Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, Hoang H

    Xingyao Wang, Boxuan Li, Yufan Song, Frank F. Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, Hoang H. Tran, Fuqiang Li, Ren Ma, Mingzhang Zheng, Bill Qian, Yanjun Shao, Niklas Muennighoff, Yizhe Zhang, Binyuan Hui, Junyang Lin, Robert Brennan, Hao Peng, Heng Ji, and Graham Neubig. OpenHands: An Open Platform for AI Soft...

  38. [38]

    Webvoyager: Building an end-to-end web agent with large multimodal models, 2024

    Hongliang He, Wenlin Yao, Kaixin Ma, Wenhao Yu, Yong Dai, Hongming Zhang, Zhenzhong Lan, and Dong Yu. Webvoyager: Building an end-to-end web agent with large multimodal models, 2024

  39. [39]

    Cogagent: A visual language model for gui agents, 2024

    Wenyi Hong, Weihan Wang, Qingsong Lv, Jiazheng Xu, Wenmeng Yu, Junhui Ji, Yan Wang, Zihan Wang, Yuxuan Zhang, Juanzi Li, Bin Xu, Yuxiao Dong, Ming Ding, and Jie Tang. Cogagent: A visual language model for gui agents, 2024

  40. [40]

    A real-world webagent with planning, long context understanding, and program synthesis, 2024

    Izzeddin Gur, Hiroki Furuta, Austin Huang, Mustafa Safdari, Yutaka Matsuo, Douglas Eck, and Aleksandra Faust. A real-world webagent with planning, long context understanding, and program synthesis, 2024. 16

  41. [41]

    Reinforcement learning on web interfaces using workflow-guided exploration

    Evan Zheran Liu, Kelvin Guu, Panupong Pasupat, Tianlin Shi, and Percy Liang. Reinforcement learning on web interfaces using workflow-guided exploration. InInternational Conference on Learning Representations (ICLR), 2018

  42. [42]

    Mind2web: Towards a generalist agent for the web, 2023

    Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Samuel Stevens, Boshi Wang, Huan Sun, and Yu Su. Mind2web: Towards a generalist agent for the web, 2023

  43. [43]

    Webshop: Towards scalable real-world web interaction with grounded language agents, 2023

    Shunyu Yao, Howard Chen, John Yang, and Karthik Narasimhan. Webshop: Towards scalable real-world web interaction with grounded language agents, 2023

  44. [44]

    Owl: Optimized workforce learning for general multi-agent assistance in real-world task automation, 2025

    Mengkang Hu, Yuhang Zhou, Wendong Fan, Yuzhou Nie, Bowei Xia, Tao Sun, Ziyu Ye, Zhaoxuan Jin, Yingru Li, Qiguang Chen, Zeyu Zhang, Yifeng Wang, Qianshuo Ye, Bernard Ghanem, Ping Luo, and Guohao Li. Owl: Optimized workforce learning for general multi-agent assistance in real-world task automation, 2025

  45. [45]

    Long term memory: The foundation of ai self-evolution, 2025

    Xun Jiang, Feng Li, Han Zhao, Jiahao Qiu, Jiaying Wang, Jun Shao, Shihao Xu, Shu Zhang, Weiling Chen, Xavier Tang, Yize Chen, Mengyue Wu, Weizhi Ma, Mengdi Wang, and Tianqiao Chen. Long term memory: The foundation of ai self-evolution, 2025

  46. [46]

    Agentorchestra: A hierarchical multi-agent framework for general-purpose task solving, 2025

    Wentao Zhang, Ce Cui, Yilei Zhao, Rui Hu, Yang Liu, Yahui Zhou, and Bo An. Agentorchestra: A hierarchical multi-agent framework for general-purpose task solving, 2025

  47. [47]

    Magentic-one: A generalist multi-agent system for solving complex tasks, 2024

    Adam Fourney, Gagan Bansal, Hussein Mozannar, Cheng Tan, Eduardo Salinas, Erkang, Zhu, Friederike Niedtner, Grace Proebsting, Griffin Bassman, Jack Gerrits, Jacob Alber, Peter Chang, Ricky Loynd, Robert West, Victor Dibia, Ahmed Awadallah, Ece Kamar, Rafah Hosn, and Saleema Amershi. Magentic-one: A generalist multi-agent system for solving complex tasks, 2024

  48. [48]

    Executable code actions elicit better llm agents, 2024

    Xingyao Wang, Yangyi Chen, Lifan Yuan, Yizhe Zhang, Yunzhu Li, Hao Peng, and Heng Ji. Executable code actions elicit better llm agents, 2024

  49. [49]

    ‘smolagents‘: a smol library to build great agentic systems

    Aymeric Roucher, Albert Villanova del Moral, Thomas Wolf, Leandro von Werra, and Erik Kaunismäki. ‘smolagents‘: a smol library to build great agentic systems. https://github. com/huggingface/smolagents, 2025

  50. [50]

    Alita: Generalist agent enabling scalable agentic reasoning with minimal predefinition and maximal self-evolution, 2025

    Jiahao Qiu, Xuan Qi, Tongcheng Zhang, Xinzhe Juan, Jiacheng Guo, Yifu Lu, Yimin Wang, Zixin Yao, Qihan Ren, Xun Jiang, Xing Zhou, Dongrui Liu, Ling Yang, Yue Wu, Kaixuan Huang, Shilong Liu, Hongru Wang, and Mengdi Wang. Alita: Generalist agent enabling scalable agentic reasoning with minimal predefinition and maximal self-evolution, 2025

  51. [51]

    Critiques of world models.arXiv preprint arXiv:2507.05169, 2025

    Eric Xing, Mingkai Deng, Jinyu Hou, and Zhiting Hu. Critiques of world models.arXiv preprint arXiv:2507.05169, 2025

  52. [52]

    MIT press Cambridge, 1998

    Richard S Sutton, Andrew G Barto, et al.Reinforcement learning: An introduction, volume 1. MIT press Cambridge, 1998

  53. [53]

    macmillan, 2011

    Daniel Kahneman.Thinking, fast and slow. macmillan, 2011

  54. [54]

    Mas- tering the game of go with deep neural networks and tree search.nature, 529(7587):484–489, 2016

    David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driess- che, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mas- tering the game of go with deep neural networks and tree search.nature, 529(7587):484–489, 2016

  55. [55]

    Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm

    David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, et al. Mastering chess and shogi by self-play with a general reinforcement learning algorithm.arXiv preprint arXiv:1712.01815, 2017

  56. [56]

    Dyna, an integrated architecture for learning, planning, and reacting.ACM Sigart Bulletin, 2(4):160–163, 1991

    Richard S Sutton. Dyna, an integrated architecture for learning, planning, and reacting.ACM Sigart Bulletin, 2(4):160–163, 1991

  57. [57]

    Language models, agent models, and world models: The law for machine reasoning and planning,

    Zhiting Hu and Tianmin Shu. Language models, agent models, and world models: The law for machine reasoning and planning.arXiv preprint arXiv:2312.05230, 2023. 17

  58. [58]

    Introducing chatgpt search

    OpenAI. Introducing chatgpt search. https://openai.com/index/ introducing-chatgpt-search/, 2024. Accessed: 2024-12-19

  59. [59]

    Getting started with perplexity

    Perplexity. Getting started with perplexity. https://www.perplexity.ai/hub/blog/ getting-started-with-perplexity, 2024. Accessed: 2024-12-19

  60. [60]

    Is your LLM secretly a world model of the internet? model-based planning for web agents.arXiv preprint arXiv:2411.06559, 2024

    Yu Gu, Boyuan Zheng, Boyu Gou, Kai Zhang, Cheng Chang, Sanjari Srivastava, Yanan Xie, Peng Qi, Huan Sun, and Yu Su. Is your llm secretly a world model of the internet? model-based planning for web agents.arXiv preprint arXiv:2411.06559, 2024

  61. [61]

    Tree search for language model agents.arXiv preprint arXiv:2407.01476, 2024

    Jing Yu Koh, Stephen McAleer, Daniel Fried, and Ruslan Salakhutdinov. Tree search for language model agents.arXiv preprint arXiv:2407.01476, 2024

  62. [62]

    Advances in neural information processing systems, 36:11809–11822

    Ori Yoran, Samuel Joseph Amouyal, Chaitanya Malaviya, Ben Bogin, Ofir Press, and Jonathan Berant. Assistantbench: Can web agents solve realistic and time-consuming tasks?arXiv preprint arXiv:2407.15711, 2024

  63. [63]

    Compression, transduction, and creation: A unified framework for evaluating natural language generation

    Mingkai Deng, Bowen Tan, Zhengzhong Liu, Eric P Xing, and Zhiting Hu. Compression, transduction, and creation: A unified framework for evaluating natural language generation. arXiv preprint arXiv:2109.06379, 2021

  64. [64]

    G- eval: NLG evaluation using gpt-4 with better human alignment

    Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. G- eval: NLG evaluation using gpt-4 with better human alignment. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 2511–2522, Singapore, December 2023. Association for Computational...

  65. [65]

    Laradji, Manuel Del Verme, Tom Marty, David Vazquez, Nicolas Chapados, and Alexandre Lacoste

    Alexandre Drouin, Maxime Gasse, Massimo Caccia, Issam H. Laradji, Manuel Del Verme, Tom Marty, David Vazquez, Nicolas Chapados, and Alexandre Lacoste. WorkArena: How capable are web agents at solving common knowledge work tasks? In Ruslan Salakhutdinov, Zico Kolter, Katherine Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, ed...

  66. [66]

    FanOutQA: A multi-hop, multi-document question answering benchmark for large language models

    Andrew Zhu, Alyssa Hwang, Liam Dugan, and Chris Callison-Burch. FanOutQA: A multi-hop, multi-document question answering benchmark for large language models. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (V olume 2: Short Papers), pages 18–37, Bangkok, Thai...

  67. [67]

    button ’Main menu’, clickable, expanded=False

  68. [68]

    link ’Google’, clickable StaticText ’Skip to main content’ StaticText ’Accessibility feedback’

  69. [69]

    link ’Vacation rentals’

  70. [70]

    button ’Change appearance’, hasPopup=’menu’, expanded=False

  71. [71]

    button ’Google apps’, clickable, expanded=False

  72. [72]

    link ’Sign in’, clickable

  73. [73]

    image ’’ StaticText ’Flights’

  74. [74]

    \u200bRound trip’, live=’polite’, relevant=’additions text’, hasPopup=’listbox’, expanded=False, controls=’i9’

    combobox ’Change ticket type. \u200bRound trip’, live=’polite’, relevant=’additions text’, hasPopup=’listbox’, expanded=False, controls=’i9’

  75. [75]

    button ’1 passenger, change number of passengers.’, hasPopup=’dialog’

  76. [76]

    \u200bEconomy’, live=’polite’, relevant=’additions text’, hasPopup=’listbox’, expanded=False, controls=’i22’

    combobox ’Change seating class. \u200bEconomy’, live=’polite’, relevant=’additions text’, hasPopup=’listbox’, expanded=False, controls=’i22’

  77. [77]

    combobox ’Where from?’ value=’Pittsburgh’, clickable, autocomplete=’inline’, hasPopup=’menu’, expanded=False

  78. [78]

    button ’Swap origin and destination.’, disabled=True

  79. [79]

    combobox ’Where to?’, clickable, focused, autocomplete=’inline’, hasPopup=’menu’, expanded=False

  80. [80]

    image ’’ generic ’’, hidden=True

Showing first 80 references.