General Agentic Planning Through Simulative Reasoning with World Models

Eric Xing; Jinyu Hou; Mingkai Deng; Zhiting Hu

arxiv: 2507.23773 · v3 · pith:7ZI2W2HLnew · submitted 2025-07-31 · 💻 cs.AI · cs.CL· cs.LG· cs.RO

General Agentic Planning Through Simulative Reasoning with World Models

Mingkai Deng , Jinyu Hou , Zhiting Hu , Eric Xing This is my paper

Pith reviewed 2026-05-22 12:33 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.LGcs.RO

keywords simulative reasoningworld modelsagentic planningLLM agentsweb navigationcounterfactual evaluationSiRA architecture

0 comments

The pith

Simulative reasoning with world models enables general agentic planning beyond reactive policies.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that agentic systems currently rely on reactive decision-making that requires task-specific engineering and lacks explicit modeling of future outcomes. It proposes that simulative reasoning, in which an internal world model predicts consequences of candidate actions, supplies a more general-purpose planning mechanism that supports flexible, goal-directed behavior across contexts. To test this, the authors present SiRA, an architecture that maintains natural-language belief states inside an LLM-based world model and uses counterfactual evaluation to choose actions. Evaluations in a web-browser setting across constrained navigation, multi-hop information aggregation, and general instruction following show consistent gains over matched reactive baselines. A sympathetic reader would care because the approach suggests that shared simulative capacity can transfer across tasks rather than demanding re-engineering for each new domain.

Core claim

SiRA instantiates simulative reasoning by using an LLM-based world model with natural-language belief states to predict future outcomes of candidate actions before selecting the next step. Across three qualitatively distinct task categories in a web-browser environment, this produces up to 124% higher task completion rates than a matched reactive baseline and raises constrained navigation success from 0% to 32.2% compared with a representative open-web agent. The persistent advantage across task types indicates that the benefit arises from generalizable counterfactual evaluation rather than task-specific tuning.

What carries the argument

SiRA, the Simulative Reasoning Architecture, which maintains natural-language belief states inside an LLM-based world model to support counterfactual evaluation of candidate actions before execution.

If this is right

Agents achieve higher success rates on complex web tasks by grounding each decision in predicted future states rather than immediate pattern matching.
The same reasoning capacity transfers across constrained navigation, multi-hop aggregation, and general instruction following without per-task re-engineering.
Constrained navigation tasks become solvable where purely reactive methods achieve zero success.
Natural-language belief states make the agent's internal predictions inspectable and reusable across goals.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If world-model accuracy improves, the same architecture could transfer to non-web environments such as code execution or robotic control.
The separation of world-model prediction from action selection suggests a route for combining simulative reasoning with learned policies or search algorithms.
Natural-language state representations may allow human users to audit or correct the agent's internal model before execution.

Load-bearing premise

The LLM-based world model produces sufficiently accurate natural-language predictions of future states to support effective counterfactual evaluation in the tested web-browser tasks.

What would settle it

Replacing the world-model predictions with inaccurate or random forecasts and observing that task-completion rates fall to the level of the reactive baseline or below.

Figures

Figures reproduced from arXiv: 2507.23773 by Eric Xing, Jinyu Hou, Mingkai Deng, Zhiting Hu.

**Figure 2.** Figure 2: A possible definition of an optimal agent [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: An agent in real world where groundtruth world state and universe are unavailable to [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Optimal agent architecture design with conditional probability annotation [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: LLM-based implementation of our proposed agent model for web-related tasks (e.g. multi [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Overview of performance comparison between [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: Faced with the question “What is the earliest-arriving flight tomorrow from Pittsburgh to [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗

**Figure 8.** Figure 8: Illustration of the data generation process for the FlightQA dataset. We first prompt a LLM [PITH_FULL_IMAGE:figures/full_fig_p010_8.png] view at source ↗

**Figure 9.** Figure 9: % correct and % response returned vs. number of constraints for FlightQA. Based on our [PITH_FULL_IMAGE:figures/full_fig_p012_9.png] view at source ↗

read the original abstract

What does it mean to plan? Current agentic systems, whether scaffolded workflows or end-to-end policies, rely on reactive decision-making: selecting the next action via a fixed procedure with at most undifferentiated adaptive computation (e.g., chain-of-thought) lacking explicit modeling of future outcomes. This limits generalizability, as each new task demands re-engineering rather than transfer of shared reasoning capacity. Humans, by contrast, plan by mentally simulating consequences of candidate actions within an internal world model, a capacity known as simulative reasoning (System II) that supports flexible, goal-directed behavior across diverse contexts. We argue that simulative reasoning through a world model provides a general-purpose planning mechanism for agentic systems, improving upon reactive policies (System I) by grounding decisions in predicted future states rather than pattern-matched responses. To verify this, we introduce SiRA (Simulative Reasoning Architecture), a goal-oriented architecture instantiating simulative reasoning using an LLM-based world model with natural-language belief states, while remaining model-agnostic. We evaluate across three qualitatively distinct task categories: constrained navigation, multi-hop information aggregation, and general instruction following, in a web-browser environment. Across all categories, simulative reasoning achieves up to 124% higher task completion rates than a matched reactive baseline, and increases constrained navigation success from 0% to 32.2% compared to a representative open-web agent. The persistent advantage across distinct task types suggests the benefit stems from generalizable counterfactual evaluation rather than task-specific tuning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SiRA adds LLM-based simulative rollouts with natural-language states to agent planning and reports solid cross-task gains, but the gains rest on an unmeasured assumption about prediction accuracy.

read the letter

The paper's main contribution is SiRA, a straightforward architecture that inserts an explicit simulation step: an LLM world model predicts future states in natural language, the agent evaluates candidate actions against those predictions, and then acts. They test it on constrained navigation, multi-hop information tasks, and general instruction following inside a web browser, and the numbers show consistent edges over reactive baselines and a standard open-web agent, including lifting navigation success from 0% to 32% and up to 124% better completion rates overall.

Referee Report

2 major / 2 minor

Summary. The paper introduces SiRA (Simulative Reasoning Architecture), a model-agnostic goal-oriented framework that performs agentic planning via simulative reasoning over an LLM-based world model with natural-language belief states. It contrasts this explicit counterfactual evaluation of future outcomes against reactive (System I) policies that rely on pattern-matched next-action selection without internal simulation. The approach is evaluated on three distinct task categories in a web-browser environment—constrained navigation, multi-hop information aggregation, and general instruction following—reporting up to 124% higher task completion rates than a matched reactive baseline and lifting constrained navigation success from 0% to 32.2% relative to a representative open-web agent. The authors argue that the persistent cross-category gains indicate a generalizable benefit from grounded simulation rather than task-specific tuning.

Significance. If the performance advantages can be causally attributed to accurate simulative reasoning rather than ancillary factors, the work offers a promising general-purpose mechanism for improving transfer and flexibility in agentic systems beyond current scaffolded or end-to-end reactive approaches. The model-agnostic design and evaluation across qualitatively different task types are constructive elements that support broader applicability claims.

major comments (2)

[Experimental results and analysis sections] Experimental results and analysis sections: No direct measurement or error analysis of the LLM world model's natural-language future-state prediction accuracy (e.g., state match rate, trajectory-level fidelity, or held-out prediction error) is reported. This is load-bearing for the central claim that gains arise from reliable counterfactual evaluation, because without evidence that the predicted states are sufficiently faithful to actual dynamic web-browser states, the 124% completion improvement and 0%-to-32.2% navigation lift could instead result from extra inference steps, prompt structure, or implicit search.
[Baseline and evaluation protocol] Baseline and evaluation protocol: The manuscript provides insufficient detail on run counts, variance, exact matching of reactive baselines (including total LLM calls and prompt length), and controls for model scale or compute budget. These omissions undermine interpretation of the reported percentage gains, as simulative reasoning inherently involves additional forward simulation steps whose contribution must be isolated from confounding factors.

minor comments (2)

[Method] Notation for belief states and world-model outputs could be formalized more explicitly (e.g., with a consistent mathematical description) to aid reproducibility.
[Results figures] Figure captions and axis labels in the results figures would benefit from explicit indication of error bars or trial counts.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback, which identifies key areas where additional evidence and transparency can strengthen the attribution of gains to simulative reasoning. We address each major comment below and will incorporate revisions to improve the manuscript.

read point-by-point responses

Referee: Experimental results and analysis sections: No direct measurement or error analysis of the LLM world model's natural-language future-state prediction accuracy (e.g., state match rate, trajectory-level fidelity, or held-out prediction error) is reported. This is load-bearing for the central claim that gains arise from reliable counterfactual evaluation, because without evidence that the predicted states are sufficiently faithful to actual dynamic web-browser states, the 124% completion improvement and 0%-to-32.2% navigation lift could instead result from extra inference steps, prompt structure, or implicit search.

Authors: We agree that direct metrics on the world model's predictive accuracy would provide stronger causal support for the role of simulative reasoning. While the manuscript prioritizes end-to-end task performance across diverse categories as evidence of generalizability, we recognize this as a valuable addition. In revision, we will add an analysis subsection reporting state prediction accuracy on held-out trajectories, including state match rates, trajectory-level fidelity, and examples of prediction errors. revision: yes
Referee: Baseline and evaluation protocol: The manuscript provides insufficient detail on run counts, variance, exact matching of reactive baselines (including total LLM calls and prompt length), and controls for model scale or compute budget. These omissions undermine interpretation of the reported percentage gains, as simulative reasoning inherently involves additional forward simulation steps whose contribution must be isolated from confounding factors.

Authors: We appreciate the call for greater experimental detail. The current version reports comparative results but omits full statistical and matching information. We will revise the evaluation section to specify the number of independent runs per task, report variance via standard deviations, confirm equivalent total LLM calls and prompt lengths for the reactive baseline, and explicitly note that all methods share the same underlying LLM to control for scale and compute budget. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical comparison to baselines is independent of fitted parameters or self-referential definitions

full rationale

The paper introduces the SiRA architecture to instantiate simulative reasoning via an LLM-based world model and reports empirical performance gains (up to 124% higher task completion, 0% to 32.2% navigation success) from direct comparison against matched reactive and open-web agent baselines across three task categories. No equations, fitted parameters, or self-citations are described that would reduce the reported advantage to a construction by definition or to a renamed input. The central claim rests on external experimental benchmarks rather than internal redefinition or load-bearing self-reference, making the derivation self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim depends on the premise that an LLM can serve as a usable world model for counterfactual simulation and that observed gains are attributable to this mechanism rather than implementation details.

axioms (1)

domain assumption Large language models can maintain accurate enough natural-language belief states to support useful simulative reasoning about future outcomes.
Invoked when the architecture is instantiated with an LLM-based world model.

invented entities (1)

SiRA architecture no independent evidence
purpose: Goal-oriented instantiation of simulative reasoning using LLM world models.
New proposed system whose effectiveness is demonstrated only within the paper's experiments.

pith-pipeline@v0.9.0 · 5812 in / 1271 out tokens · 68453 ms · 2026-05-22T12:33:21.515999+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

SIMURA ... world model f ... predicts the next belief state ŝ_{t+1} ... natural language as a compact but complete representation
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

hierarchical architecture that isolates perception, simulative planning, and action selection

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Efficient Agentic Reasoning Through Self-Regulated Simulative Planning
cs.AI 2026-05 unverdicted novelty 6.0

SR²AM achieves competitive Pass@1 accuracy on diverse tasks with 25.8-95.3% fewer reasoning tokens than much larger models by using self-regulated simulative planning trained via supervised learning and RL.
GUI Agents with Reinforcement Learning: Toward Digital Inhabitants
cs.AI 2026-04 unverdicted novelty 5.0

The paper delivers the first comprehensive overview of RL for GUI agents, organizing methods into offline, online, and hybrid strategies while analyzing trends in rewards, efficiency, and deliberation to outline a fut...

Reference graph

Works this paper leans on

91 extracted references · 91 canonical work pages · cited by 2 Pith papers · 3 internal anchors

[1]

Operator

OpenAI. Computer -using agent (cua). https://openai.com/index/ computer-using-agent/, January 2025. Research preview of “Operator”, published January 23, 2025

work page 2025
[2]

Project mariner

DeepMind. Project mariner. https://deepmind.google/models/project-mariner/,

work page
[3]

Accessed: 2025-07-16

work page 2025
[4]

WebArena: A Realistic Web Environment for Building Autonomous Agents

Shuyan Zhou, Frank F Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, et al. Webarena: A realistic web environment for building autonomous agents.arXiv preprint arXiv:2307.13854, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[5]

Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments, 2024

Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh Jing Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, Yitao Liu, Yiheng Xu, Shuyan Zhou, Silvio Savarese, Caiming Xiong, Victor Zhong, and Tao Yu. Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments, 2024

work page 2024
[6]

Introducing Deep Research

OpenAI. Introducing Deep Research. https://openai.com/index/ introducing-deep-research/, February 2025. Deep Research agent release an- nouncement

work page 2025
[7]

computer use

Anthropic. Introducing computer use, a new claude 3.5 sonnet, and claude 3.5 haiku. https: //www.anthropic.com/news/3-5-models-and-computer-use , October 22 2024. Public beta “computer use” feature for Claude 3.5 Sonnet

work page 2024
[8]

Gemini Deep Research: Your Personal Research Assistant

Google. Gemini Deep Research: Your Personal Research Assistant. https://gemini. google/overview/deep-research/?hl=en, December 2024. Overview of Gemini Deep Research agent feature

work page 2024
[9]

O’Brien, Carrie J

Joon Sung Park, Joseph C. O’Brien, Carrie J. Cai, Meredith Ringel Morris, Percy Liang, and Michael S. Bernstein. Generative agents: Interactive simulacra of human behavior, 2023. 14

work page 2023
[10]

Cursor: The ai code editor

Anysphere Inc. Cursor: The ai code editor. https://cursor.com, 2025. Accessed: 2025-07- 16

work page 2025
[11]

Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, Hoang H

Xingyao Wang, Boxuan Li, Yufan Song, Frank F. Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, Hoang H. Tran, Fuqiang Li, Ren Ma, Mingzhang Zheng, Bill Qian, Yanjun Shao, Niklas Muennighoff, Yizhe Zhang, Binyuan Hui, Junyang Lin, Robert Brennan, Hao Peng, Heng Ji, and Graham Neubig. Openhands: An open platform for ai soft...

work page 2025
[12]

Towards an AI co-scientist

Juraj Gottweis, Wei-Hung Weng, Alexander Daryin, Tao Tu, Anil Palepu, Petar Sirkovic, Artiom Myaskovsky, Felix Weissenberger, Keran Rong, Ryutaro Tanno, Khaled Saab, Dan Popovici, Jacob Blum, Fan Zhang, Katherine Chou, Avinatan Hassidim, Burak Gokturk, Amin Vahdat, Pushmeet Kohli, Yossi Matias, Andrew Carroll, Kavita Kulkarni, Nenad Tomasev, Yuan Guan, Vi...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[13]

The ai scientist-v2: Workshop-level automated scientific discovery via agentic tree search, 2025

Yutaro Yamada, Robert Tjarko Lange, Cong Lu, Shengran Hu, Chris Lu, Jakob Foerster, Jeff Clune, and David Ha. The ai scientist-v2: Workshop-level automated scientific discovery via agentic tree search, 2025

work page 2025
[14]

Advances and challenges in foundation agents: From brain-inspired intelligence to evolutionary, collaborative, and safe systems, 2025

Bang Liu, Xinfeng Li, Jiayi Zhang, Jinlin Wang, Tanjin He, Sirui Hong, Hongzhang Liu, Shaokun Zhang, Kaitao Song, Kunlun Zhu, Yuheng Cheng, Suyuchen Wang, Xiaoqiang Wang, Yuyu Luo, Haibo Jin, Peiyan Zhang, Ollie Liu, Jiaqi Chen, Huan Zhang, Zhaoyang Yu, Haochen Shi, Boyan Li, Dekun Wu, Fengwei Teng, Xiaojun Jia, Jiawei Xu, Jinyu Xiang, Yizhang Lin, Tianmi...

work page 2025
[15]

Roy-Chowdhury, and Chengyu Song

Trishna Chakraborty, Udita Ghosh, Xiaopan Zhang, Fahim Faisal Niloy, Yue Dong, Jiachen Li, Amit K. Roy-Chowdhury, and Chengyu Song. Heal: An empirical study on hallucinations in embodied agents driven by large language models, 2025

work page 2025
[16]

The general-purpose intelligent agent.Engineering, 6(3):221–226, 2020

Cewu Lu and Shiquan Wang. The general-purpose intelligent agent.Engineering, 6(3):221–226, 2020

work page 2020
[17]

Language models as agent models, 2022

Jacob Andreas. Language models as agent models, 2022

work page 2022
[18]

React: Synergizing reasoning and acting in language models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InThe Eleventh International Conference on Learning Representations, 2023

work page 2023
[19]

A path towards autonomous machine intelligence

Yann LeCun. A path towards autonomous machine intelligence. OpenReview preprint, June

work page
[20]

Version 0.9.2, June 27 2022

work page 2022
[21]

Llm reasoners: New evaluation, library, and analysis of step-by-step reasoning with large language models.arXiv preprint arXiv:2404.05221, 2024

Shibo Hao, Yi Gu, Haotian Luo, Tianyang Liu, Xiyan Shao, Xinyuan Wang, Shuhua Xie, Haodi Ma, Adithya Samavedhi, Qiyue Gao, et al. Llm reasoners: New evaluation, library, and analysis of step-by-step reasoning with large language models.arXiv preprint arXiv:2404.05221, 2024

work page arXiv 2024
[22]

Brandon Chiou, Mason Choey, Mingkai Deng, Jinyu Hou, Jackie Wang, Ariel Wu, Frank Xu, Zhiting Hu, Hongxia Jin, Li Erran Li, Graham Neubig, Yilin Shen, and Eric P. Xing. Reasoneragent: A fully open source, ready-to-run agent that does research in a web browser and answers your queries, February 2025

work page 2025
[23]

Autowebglm: A large language model-based web navigating agent, 2024

Hanyu Lai, Xiao Liu, Iat Long Iong, Shuntian Yao, Yuxuan Chen, Pengbo Shen, Hao Yu, Hanchen Zhang, Xiaohan Zhang, Yuxiao Dong, and Jie Tang. Autowebglm: A large language model-based web navigating agent, 2024

work page 2024
[24]

Agent q: Advanced reasoning and learning for autonomous ai agents, 2024

Pranav Putta, Edmund Mills, Naman Garg, Sumeet Motwani, Chelsea Finn, Divyansh Garg, and Rafael Rafailov. Agent q: Advanced reasoning and learning for autonomous ai agents, 2024. 15

work page 2024
[25]

Ui-tars: Pioneering automated gui interaction with native agents, 2025

Yujia Qin, Yining Ye, Junjie Fang, Haoming Wang, Shihao Liang, Shizuo Tian, Junda Zhang, Jiahao Li, Yunxin Li, Shijue Huang, Wanjun Zhong, Kuanye Li, Jiale Yang, Yu Miao, Woyu Lin, Longxiang Liu, Xu Jiang, Qianli Ma, Jingyu Li, Xiaojun Xiao, Kai Cai, Chuang Li, Yaowei Zheng, Chaolin Jin, Chen Li, Xiao Zhou, Minchao Wang, Haoli Chen, Zhaojian Li, Haihua Ya...

work page 2025
[26]

Agent workflow memory, 2024

Zora Zhiruo Wang, Jiayuan Mao, Daniel Fried, and Graham Neubig. Agent workflow memory, 2024

work page 2024
[27]

V oyager: An open-ended embodied agent with large language models, 2023

Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. V oyager: An open-ended embodied agent with large language models, 2023

work page 2023
[28]

Web agents with world models: Learning and leveraging environment dynamics in web navigation.arXiv preprint arXiv:2410.13232, 2024

Hyungjoo Chae, Namyoung Kim, Kai Tzu-iunn Ong, Minju Gwak, Gwanwoo Song, Jihoon Kim, Sunghwan Kim, Dongha Lee, and Jinyoung Yeo. Web agents with world models: Learning and leveraging environment dynamics in web navigation.arXiv preprint arXiv:2410.13232, 2024

work page arXiv 2024
[29]

Action- conditional video prediction using deep networks in atari games, 2015

Junhyuk Oh, Xiaoxiao Guo, Honglak Lee, Richard Lewis, and Satinder Singh. Action- conditional video prediction using deep networks in atari games, 2015

work page 2015
[30]

Mastering atari, go, chess and shogi by planning with a learned model.Nature, 588(7839):604–609, December 2020

Julian Schrittwieser, Ioannis Antonoglou, Thomas Hubert, Karen Simonyan, Laurent Sifre, Simon Schmitt, Arthur Guez, Edward Lockhart, Demis Hassabis, Thore Graepel, Timothy Lillicrap, and David Silver. Mastering atari, go, chess and shogi by planning with a learned model.Nature, 588(7839):604–609, December 2020

work page 2020
[31]

When to trust your model: Model-based policy optimization, 2021

Michael Janner, Justin Fu, Marvin Zhang, and Sergey Levine. When to trust your model: Model-based policy optimization, 2021

work page 2021
[32]

Temporal difference learning for model predictive control, 2022

Nicklas Hansen, Xiaolong Wang, and Hao Su. Temporal difference learning for model predictive control, 2022

work page 2022
[33]

Reasoning with language model is planning with world model, 2023

Shibo Hao, Yi Gu, Haodi Ma, Joshua Jiahua Hong, Zhen Wang, Daisy Zhe Wang, and Zhiting Hu. Reasoning with language model is planning with world model, 2023

work page 2023
[34]

Mastering diverse domains through world models, 2024

Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse domains through world models, 2024

work page 2024
[35]

Is your llm secretly a world model of the internet? model-based planning for web agents, 2025

Yu Gu, Kai Zhang, Yuting Ning, Boyuan Zheng, Boyu Gou, Tianci Xue, Cheng Chang, Sanjari Srivastava, Yanan Xie, Peng Qi, Huan Sun, and Yu Su. Is your llm secretly a world model of the internet? model-based planning for web agents, 2025

work page 2025
[36]

Pan Macmillan, 2017

Lisa Feldman Barrett.How emotions are made: The secret life of the brain. Pan Macmillan, 2017

work page 2017
[37]

Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, Hoang H

Xingyao Wang, Boxuan Li, Yufan Song, Frank F. Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, Hoang H. Tran, Fuqiang Li, Ren Ma, Mingzhang Zheng, Bill Qian, Yanjun Shao, Niklas Muennighoff, Yizhe Zhang, Binyuan Hui, Junyang Lin, Robert Brennan, Hao Peng, Heng Ji, and Graham Neubig. OpenHands: An Open Platform for AI Soft...

work page 2024
[38]

Webvoyager: Building an end-to-end web agent with large multimodal models, 2024

Hongliang He, Wenlin Yao, Kaixin Ma, Wenhao Yu, Yong Dai, Hongming Zhang, Zhenzhong Lan, and Dong Yu. Webvoyager: Building an end-to-end web agent with large multimodal models, 2024

work page 2024
[39]

Cogagent: A visual language model for gui agents, 2024

Wenyi Hong, Weihan Wang, Qingsong Lv, Jiazheng Xu, Wenmeng Yu, Junhui Ji, Yan Wang, Zihan Wang, Yuxuan Zhang, Juanzi Li, Bin Xu, Yuxiao Dong, Ming Ding, and Jie Tang. Cogagent: A visual language model for gui agents, 2024

work page 2024
[40]

A real-world webagent with planning, long context understanding, and program synthesis, 2024

Izzeddin Gur, Hiroki Furuta, Austin Huang, Mustafa Safdari, Yutaka Matsuo, Douglas Eck, and Aleksandra Faust. A real-world webagent with planning, long context understanding, and program synthesis, 2024. 16

work page 2024
[41]

Reinforcement learning on web interfaces using workflow-guided exploration

Evan Zheran Liu, Kelvin Guu, Panupong Pasupat, Tianlin Shi, and Percy Liang. Reinforcement learning on web interfaces using workflow-guided exploration. InInternational Conference on Learning Representations (ICLR), 2018

work page 2018
[42]

Mind2web: Towards a generalist agent for the web, 2023

Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Samuel Stevens, Boshi Wang, Huan Sun, and Yu Su. Mind2web: Towards a generalist agent for the web, 2023

work page 2023
[43]

Webshop: Towards scalable real-world web interaction with grounded language agents, 2023

Shunyu Yao, Howard Chen, John Yang, and Karthik Narasimhan. Webshop: Towards scalable real-world web interaction with grounded language agents, 2023

work page 2023
[44]

Owl: Optimized workforce learning for general multi-agent assistance in real-world task automation, 2025

Mengkang Hu, Yuhang Zhou, Wendong Fan, Yuzhou Nie, Bowei Xia, Tao Sun, Ziyu Ye, Zhaoxuan Jin, Yingru Li, Qiguang Chen, Zeyu Zhang, Yifeng Wang, Qianshuo Ye, Bernard Ghanem, Ping Luo, and Guohao Li. Owl: Optimized workforce learning for general multi-agent assistance in real-world task automation, 2025

work page 2025
[45]

Long term memory: The foundation of ai self-evolution, 2025

Xun Jiang, Feng Li, Han Zhao, Jiahao Qiu, Jiaying Wang, Jun Shao, Shihao Xu, Shu Zhang, Weiling Chen, Xavier Tang, Yize Chen, Mengyue Wu, Weizhi Ma, Mengdi Wang, and Tianqiao Chen. Long term memory: The foundation of ai self-evolution, 2025

work page 2025
[46]

Agentorchestra: A hierarchical multi-agent framework for general-purpose task solving, 2025

Wentao Zhang, Ce Cui, Yilei Zhao, Rui Hu, Yang Liu, Yahui Zhou, and Bo An. Agentorchestra: A hierarchical multi-agent framework for general-purpose task solving, 2025

work page 2025
[47]

Magentic-one: A generalist multi-agent system for solving complex tasks, 2024

Adam Fourney, Gagan Bansal, Hussein Mozannar, Cheng Tan, Eduardo Salinas, Erkang, Zhu, Friederike Niedtner, Grace Proebsting, Griffin Bassman, Jack Gerrits, Jacob Alber, Peter Chang, Ricky Loynd, Robert West, Victor Dibia, Ahmed Awadallah, Ece Kamar, Rafah Hosn, and Saleema Amershi. Magentic-one: A generalist multi-agent system for solving complex tasks, 2024

work page 2024
[48]

Executable code actions elicit better llm agents, 2024

Xingyao Wang, Yangyi Chen, Lifan Yuan, Yizhe Zhang, Yunzhu Li, Hao Peng, and Heng Ji. Executable code actions elicit better llm agents, 2024

work page 2024
[49]

‘smolagents‘: a smol library to build great agentic systems

Aymeric Roucher, Albert Villanova del Moral, Thomas Wolf, Leandro von Werra, and Erik Kaunismäki. ‘smolagents‘: a smol library to build great agentic systems. https://github. com/huggingface/smolagents, 2025

work page 2025
[50]

Alita: Generalist agent enabling scalable agentic reasoning with minimal predefinition and maximal self-evolution, 2025

Jiahao Qiu, Xuan Qi, Tongcheng Zhang, Xinzhe Juan, Jiacheng Guo, Yifu Lu, Yimin Wang, Zixin Yao, Qihan Ren, Xun Jiang, Xing Zhou, Dongrui Liu, Ling Yang, Yue Wu, Kaixuan Huang, Shilong Liu, Hongru Wang, and Mengdi Wang. Alita: Generalist agent enabling scalable agentic reasoning with minimal predefinition and maximal self-evolution, 2025

work page 2025
[51]

Critiques of world models.arXiv preprint arXiv:2507.05169, 2025

Eric Xing, Mingkai Deng, Jinyu Hou, and Zhiting Hu. Critiques of world models.arXiv preprint arXiv:2507.05169, 2025

work page arXiv 2025
[52]

MIT press Cambridge, 1998

Richard S Sutton, Andrew G Barto, et al.Reinforcement learning: An introduction, volume 1. MIT press Cambridge, 1998

work page 1998
[53]

macmillan, 2011

Daniel Kahneman.Thinking, fast and slow. macmillan, 2011

work page 2011
[54]

Mas- tering the game of go with deep neural networks and tree search.nature, 529(7587):484–489, 2016

David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driess- che, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mas- tering the game of go with deep neural networks and tree search.nature, 529(7587):484–489, 2016

work page 2016
[55]

Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm

David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, et al. Mastering chess and shogi by self-play with a general reinforcement learning algorithm.arXiv preprint arXiv:1712.01815, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[56]

Dyna, an integrated architecture for learning, planning, and reacting.ACM Sigart Bulletin, 2(4):160–163, 1991

Richard S Sutton. Dyna, an integrated architecture for learning, planning, and reacting.ACM Sigart Bulletin, 2(4):160–163, 1991

work page 1991
[57]

Language models, agent models, and world models: The law for machine reasoning and planning,

Zhiting Hu and Tianmin Shu. Language models, agent models, and world models: The law for machine reasoning and planning.arXiv preprint arXiv:2312.05230, 2023. 17

work page arXiv 2023
[58]

Introducing chatgpt search

OpenAI. Introducing chatgpt search. https://openai.com/index/ introducing-chatgpt-search/, 2024. Accessed: 2024-12-19

work page 2024
[59]

Getting started with perplexity

Perplexity. Getting started with perplexity. https://www.perplexity.ai/hub/blog/ getting-started-with-perplexity, 2024. Accessed: 2024-12-19

work page 2024
[60]

Is your LLM secretly a world model of the internet? model-based planning for web agents.arXiv preprint arXiv:2411.06559, 2024

Yu Gu, Boyuan Zheng, Boyu Gou, Kai Zhang, Cheng Chang, Sanjari Srivastava, Yanan Xie, Peng Qi, Huan Sun, and Yu Su. Is your llm secretly a world model of the internet? model-based planning for web agents.arXiv preprint arXiv:2411.06559, 2024

work page arXiv 2024
[61]

Tree search for language model agents.arXiv preprint arXiv:2407.01476, 2024

Jing Yu Koh, Stephen McAleer, Daniel Fried, and Ruslan Salakhutdinov. Tree search for language model agents.arXiv preprint arXiv:2407.01476, 2024

work page arXiv 2024
[62]

Advances in neural information processing systems, 36:11809–11822

Ori Yoran, Samuel Joseph Amouyal, Chaitanya Malaviya, Ben Bogin, Ofir Press, and Jonathan Berant. Assistantbench: Can web agents solve realistic and time-consuming tasks?arXiv preprint arXiv:2407.15711, 2024

work page arXiv 2024
[63]

Compression, transduction, and creation: A unified framework for evaluating natural language generation

Mingkai Deng, Bowen Tan, Zhengzhong Liu, Eric P Xing, and Zhiting Hu. Compression, transduction, and creation: A unified framework for evaluating natural language generation. arXiv preprint arXiv:2109.06379, 2021

work page arXiv 2021
[64]

G- eval: NLG evaluation using gpt-4 with better human alignment

Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. G- eval: NLG evaluation using gpt-4 with better human alignment. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 2511–2522, Singapore, December 2023. Association for Computational...

work page 2023
[65]

Laradji, Manuel Del Verme, Tom Marty, David Vazquez, Nicolas Chapados, and Alexandre Lacoste

Alexandre Drouin, Maxime Gasse, Massimo Caccia, Issam H. Laradji, Manuel Del Verme, Tom Marty, David Vazquez, Nicolas Chapados, and Alexandre Lacoste. WorkArena: How capable are web agents at solving common knowledge work tasks? In Ruslan Salakhutdinov, Zico Kolter, Katherine Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, ed...

work page 2024
[66]

FanOutQA: A multi-hop, multi-document question answering benchmark for large language models

Andrew Zhu, Alyssa Hwang, Liam Dugan, and Chris Callison-Burch. FanOutQA: A multi-hop, multi-document question answering benchmark for large language models. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (V olume 2: Short Papers), pages 18–37, Bangkok, Thai...

work page 2024
[67]

button ’Main menu’, clickable, expanded=False

work page
[68]

link ’Google’, clickable StaticText ’Skip to main content’ StaticText ’Accessibility feedback’

work page
[69]

link ’Vacation rentals’

work page
[70]

button ’Change appearance’, hasPopup=’menu’, expanded=False

work page
[71]

button ’Google apps’, clickable, expanded=False

work page
[72]

link ’Sign in’, clickable

work page
[73]

image ’’ StaticText ’Flights’

work page
[74]

\u200bRound trip’, live=’polite’, relevant=’additions text’, hasPopup=’listbox’, expanded=False, controls=’i9’

combobox ’Change ticket type. \u200bRound trip’, live=’polite’, relevant=’additions text’, hasPopup=’listbox’, expanded=False, controls=’i9’

work page
[75]

button ’1 passenger, change number of passengers.’, hasPopup=’dialog’

work page
[76]

\u200bEconomy’, live=’polite’, relevant=’additions text’, hasPopup=’listbox’, expanded=False, controls=’i22’

combobox ’Change seating class. \u200bEconomy’, live=’polite’, relevant=’additions text’, hasPopup=’listbox’, expanded=False, controls=’i22’

work page
[77]

combobox ’Where from?’ value=’Pittsburgh’, clickable, autocomplete=’inline’, hasPopup=’menu’, expanded=False

work page
[78]

button ’Swap origin and destination.’, disabled=True

work page
[79]

combobox ’Where to?’, clickable, focused, autocomplete=’inline’, hasPopup=’menu’, expanded=False

work page
[80]

image ’’ generic ’’, hidden=True

work page

Showing first 80 references.

[1] [1]

Operator

OpenAI. Computer -using agent (cua). https://openai.com/index/ computer-using-agent/, January 2025. Research preview of “Operator”, published January 23, 2025

work page 2025

[2] [2]

Project mariner

DeepMind. Project mariner. https://deepmind.google/models/project-mariner/,

work page

[3] [3]

Accessed: 2025-07-16

work page 2025

[4] [4]

WebArena: A Realistic Web Environment for Building Autonomous Agents

Shuyan Zhou, Frank F Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, et al. Webarena: A realistic web environment for building autonomous agents.arXiv preprint arXiv:2307.13854, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[5] [5]

Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments, 2024

Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh Jing Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, Yitao Liu, Yiheng Xu, Shuyan Zhou, Silvio Savarese, Caiming Xiong, Victor Zhong, and Tao Yu. Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments, 2024

work page 2024

[6] [6]

Introducing Deep Research

OpenAI. Introducing Deep Research. https://openai.com/index/ introducing-deep-research/, February 2025. Deep Research agent release an- nouncement

work page 2025

[7] [7]

computer use

Anthropic. Introducing computer use, a new claude 3.5 sonnet, and claude 3.5 haiku. https: //www.anthropic.com/news/3-5-models-and-computer-use , October 22 2024. Public beta “computer use” feature for Claude 3.5 Sonnet

work page 2024

[8] [8]

Gemini Deep Research: Your Personal Research Assistant

Google. Gemini Deep Research: Your Personal Research Assistant. https://gemini. google/overview/deep-research/?hl=en, December 2024. Overview of Gemini Deep Research agent feature

work page 2024

[9] [9]

O’Brien, Carrie J

Joon Sung Park, Joseph C. O’Brien, Carrie J. Cai, Meredith Ringel Morris, Percy Liang, and Michael S. Bernstein. Generative agents: Interactive simulacra of human behavior, 2023. 14

work page 2023

[10] [10]

Cursor: The ai code editor

Anysphere Inc. Cursor: The ai code editor. https://cursor.com, 2025. Accessed: 2025-07- 16

work page 2025

[11] [11]

Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, Hoang H

Xingyao Wang, Boxuan Li, Yufan Song, Frank F. Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, Hoang H. Tran, Fuqiang Li, Ren Ma, Mingzhang Zheng, Bill Qian, Yanjun Shao, Niklas Muennighoff, Yizhe Zhang, Binyuan Hui, Junyang Lin, Robert Brennan, Hao Peng, Heng Ji, and Graham Neubig. Openhands: An open platform for ai soft...

work page 2025

[12] [12]

Towards an AI co-scientist

Juraj Gottweis, Wei-Hung Weng, Alexander Daryin, Tao Tu, Anil Palepu, Petar Sirkovic, Artiom Myaskovsky, Felix Weissenberger, Keran Rong, Ryutaro Tanno, Khaled Saab, Dan Popovici, Jacob Blum, Fan Zhang, Katherine Chou, Avinatan Hassidim, Burak Gokturk, Amin Vahdat, Pushmeet Kohli, Yossi Matias, Andrew Carroll, Kavita Kulkarni, Nenad Tomasev, Yuan Guan, Vi...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[13] [13]

The ai scientist-v2: Workshop-level automated scientific discovery via agentic tree search, 2025

Yutaro Yamada, Robert Tjarko Lange, Cong Lu, Shengran Hu, Chris Lu, Jakob Foerster, Jeff Clune, and David Ha. The ai scientist-v2: Workshop-level automated scientific discovery via agentic tree search, 2025

work page 2025

[14] [14]

Advances and challenges in foundation agents: From brain-inspired intelligence to evolutionary, collaborative, and safe systems, 2025

Bang Liu, Xinfeng Li, Jiayi Zhang, Jinlin Wang, Tanjin He, Sirui Hong, Hongzhang Liu, Shaokun Zhang, Kaitao Song, Kunlun Zhu, Yuheng Cheng, Suyuchen Wang, Xiaoqiang Wang, Yuyu Luo, Haibo Jin, Peiyan Zhang, Ollie Liu, Jiaqi Chen, Huan Zhang, Zhaoyang Yu, Haochen Shi, Boyan Li, Dekun Wu, Fengwei Teng, Xiaojun Jia, Jiawei Xu, Jinyu Xiang, Yizhang Lin, Tianmi...

work page 2025

[15] [15]

Roy-Chowdhury, and Chengyu Song

Trishna Chakraborty, Udita Ghosh, Xiaopan Zhang, Fahim Faisal Niloy, Yue Dong, Jiachen Li, Amit K. Roy-Chowdhury, and Chengyu Song. Heal: An empirical study on hallucinations in embodied agents driven by large language models, 2025

work page 2025

[16] [16]

The general-purpose intelligent agent.Engineering, 6(3):221–226, 2020

Cewu Lu and Shiquan Wang. The general-purpose intelligent agent.Engineering, 6(3):221–226, 2020

work page 2020

[17] [17]

Language models as agent models, 2022

Jacob Andreas. Language models as agent models, 2022

work page 2022

[18] [18]

React: Synergizing reasoning and acting in language models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InThe Eleventh International Conference on Learning Representations, 2023

work page 2023

[19] [19]

A path towards autonomous machine intelligence

Yann LeCun. A path towards autonomous machine intelligence. OpenReview preprint, June

work page

[20] [20]

Version 0.9.2, June 27 2022

work page 2022

[21] [21]

Llm reasoners: New evaluation, library, and analysis of step-by-step reasoning with large language models.arXiv preprint arXiv:2404.05221, 2024

Shibo Hao, Yi Gu, Haotian Luo, Tianyang Liu, Xiyan Shao, Xinyuan Wang, Shuhua Xie, Haodi Ma, Adithya Samavedhi, Qiyue Gao, et al. Llm reasoners: New evaluation, library, and analysis of step-by-step reasoning with large language models.arXiv preprint arXiv:2404.05221, 2024

work page arXiv 2024

[22] [22]

Brandon Chiou, Mason Choey, Mingkai Deng, Jinyu Hou, Jackie Wang, Ariel Wu, Frank Xu, Zhiting Hu, Hongxia Jin, Li Erran Li, Graham Neubig, Yilin Shen, and Eric P. Xing. Reasoneragent: A fully open source, ready-to-run agent that does research in a web browser and answers your queries, February 2025

work page 2025

[23] [23]

Autowebglm: A large language model-based web navigating agent, 2024

Hanyu Lai, Xiao Liu, Iat Long Iong, Shuntian Yao, Yuxuan Chen, Pengbo Shen, Hao Yu, Hanchen Zhang, Xiaohan Zhang, Yuxiao Dong, and Jie Tang. Autowebglm: A large language model-based web navigating agent, 2024

work page 2024

[24] [24]

Agent q: Advanced reasoning and learning for autonomous ai agents, 2024

Pranav Putta, Edmund Mills, Naman Garg, Sumeet Motwani, Chelsea Finn, Divyansh Garg, and Rafael Rafailov. Agent q: Advanced reasoning and learning for autonomous ai agents, 2024. 15

work page 2024

[25] [25]

Ui-tars: Pioneering automated gui interaction with native agents, 2025

Yujia Qin, Yining Ye, Junjie Fang, Haoming Wang, Shihao Liang, Shizuo Tian, Junda Zhang, Jiahao Li, Yunxin Li, Shijue Huang, Wanjun Zhong, Kuanye Li, Jiale Yang, Yu Miao, Woyu Lin, Longxiang Liu, Xu Jiang, Qianli Ma, Jingyu Li, Xiaojun Xiao, Kai Cai, Chuang Li, Yaowei Zheng, Chaolin Jin, Chen Li, Xiao Zhou, Minchao Wang, Haoli Chen, Zhaojian Li, Haihua Ya...

work page 2025

[26] [26]

Agent workflow memory, 2024

Zora Zhiruo Wang, Jiayuan Mao, Daniel Fried, and Graham Neubig. Agent workflow memory, 2024

work page 2024

[27] [27]

V oyager: An open-ended embodied agent with large language models, 2023

Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. V oyager: An open-ended embodied agent with large language models, 2023

work page 2023

[28] [28]

Web agents with world models: Learning and leveraging environment dynamics in web navigation.arXiv preprint arXiv:2410.13232, 2024

Hyungjoo Chae, Namyoung Kim, Kai Tzu-iunn Ong, Minju Gwak, Gwanwoo Song, Jihoon Kim, Sunghwan Kim, Dongha Lee, and Jinyoung Yeo. Web agents with world models: Learning and leveraging environment dynamics in web navigation.arXiv preprint arXiv:2410.13232, 2024

work page arXiv 2024

[29] [29]

Action- conditional video prediction using deep networks in atari games, 2015

Junhyuk Oh, Xiaoxiao Guo, Honglak Lee, Richard Lewis, and Satinder Singh. Action- conditional video prediction using deep networks in atari games, 2015

work page 2015

[30] [30]

Mastering atari, go, chess and shogi by planning with a learned model.Nature, 588(7839):604–609, December 2020

Julian Schrittwieser, Ioannis Antonoglou, Thomas Hubert, Karen Simonyan, Laurent Sifre, Simon Schmitt, Arthur Guez, Edward Lockhart, Demis Hassabis, Thore Graepel, Timothy Lillicrap, and David Silver. Mastering atari, go, chess and shogi by planning with a learned model.Nature, 588(7839):604–609, December 2020

work page 2020

[31] [31]

When to trust your model: Model-based policy optimization, 2021

Michael Janner, Justin Fu, Marvin Zhang, and Sergey Levine. When to trust your model: Model-based policy optimization, 2021

work page 2021

[32] [32]

Temporal difference learning for model predictive control, 2022

Nicklas Hansen, Xiaolong Wang, and Hao Su. Temporal difference learning for model predictive control, 2022

work page 2022

[33] [33]

Reasoning with language model is planning with world model, 2023

Shibo Hao, Yi Gu, Haodi Ma, Joshua Jiahua Hong, Zhen Wang, Daisy Zhe Wang, and Zhiting Hu. Reasoning with language model is planning with world model, 2023

work page 2023

[34] [34]

Mastering diverse domains through world models, 2024

Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse domains through world models, 2024

work page 2024

[35] [35]

Is your llm secretly a world model of the internet? model-based planning for web agents, 2025

Yu Gu, Kai Zhang, Yuting Ning, Boyuan Zheng, Boyu Gou, Tianci Xue, Cheng Chang, Sanjari Srivastava, Yanan Xie, Peng Qi, Huan Sun, and Yu Su. Is your llm secretly a world model of the internet? model-based planning for web agents, 2025

work page 2025

[36] [36]

Pan Macmillan, 2017

Lisa Feldman Barrett.How emotions are made: The secret life of the brain. Pan Macmillan, 2017

work page 2017

[37] [37]

Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, Hoang H

Xingyao Wang, Boxuan Li, Yufan Song, Frank F. Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, Hoang H. Tran, Fuqiang Li, Ren Ma, Mingzhang Zheng, Bill Qian, Yanjun Shao, Niklas Muennighoff, Yizhe Zhang, Binyuan Hui, Junyang Lin, Robert Brennan, Hao Peng, Heng Ji, and Graham Neubig. OpenHands: An Open Platform for AI Soft...

work page 2024

[38] [38]

Webvoyager: Building an end-to-end web agent with large multimodal models, 2024

Hongliang He, Wenlin Yao, Kaixin Ma, Wenhao Yu, Yong Dai, Hongming Zhang, Zhenzhong Lan, and Dong Yu. Webvoyager: Building an end-to-end web agent with large multimodal models, 2024

work page 2024

[39] [39]

Cogagent: A visual language model for gui agents, 2024

Wenyi Hong, Weihan Wang, Qingsong Lv, Jiazheng Xu, Wenmeng Yu, Junhui Ji, Yan Wang, Zihan Wang, Yuxuan Zhang, Juanzi Li, Bin Xu, Yuxiao Dong, Ming Ding, and Jie Tang. Cogagent: A visual language model for gui agents, 2024

work page 2024

[40] [40]

A real-world webagent with planning, long context understanding, and program synthesis, 2024

Izzeddin Gur, Hiroki Furuta, Austin Huang, Mustafa Safdari, Yutaka Matsuo, Douglas Eck, and Aleksandra Faust. A real-world webagent with planning, long context understanding, and program synthesis, 2024. 16

work page 2024

[41] [41]

Reinforcement learning on web interfaces using workflow-guided exploration

Evan Zheran Liu, Kelvin Guu, Panupong Pasupat, Tianlin Shi, and Percy Liang. Reinforcement learning on web interfaces using workflow-guided exploration. InInternational Conference on Learning Representations (ICLR), 2018

work page 2018

[42] [42]

Mind2web: Towards a generalist agent for the web, 2023

Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Samuel Stevens, Boshi Wang, Huan Sun, and Yu Su. Mind2web: Towards a generalist agent for the web, 2023

work page 2023

[43] [43]

Webshop: Towards scalable real-world web interaction with grounded language agents, 2023

Shunyu Yao, Howard Chen, John Yang, and Karthik Narasimhan. Webshop: Towards scalable real-world web interaction with grounded language agents, 2023

work page 2023

[44] [44]

Owl: Optimized workforce learning for general multi-agent assistance in real-world task automation, 2025

Mengkang Hu, Yuhang Zhou, Wendong Fan, Yuzhou Nie, Bowei Xia, Tao Sun, Ziyu Ye, Zhaoxuan Jin, Yingru Li, Qiguang Chen, Zeyu Zhang, Yifeng Wang, Qianshuo Ye, Bernard Ghanem, Ping Luo, and Guohao Li. Owl: Optimized workforce learning for general multi-agent assistance in real-world task automation, 2025

work page 2025

[45] [45]

Long term memory: The foundation of ai self-evolution, 2025

Xun Jiang, Feng Li, Han Zhao, Jiahao Qiu, Jiaying Wang, Jun Shao, Shihao Xu, Shu Zhang, Weiling Chen, Xavier Tang, Yize Chen, Mengyue Wu, Weizhi Ma, Mengdi Wang, and Tianqiao Chen. Long term memory: The foundation of ai self-evolution, 2025

work page 2025

[46] [46]

Agentorchestra: A hierarchical multi-agent framework for general-purpose task solving, 2025

Wentao Zhang, Ce Cui, Yilei Zhao, Rui Hu, Yang Liu, Yahui Zhou, and Bo An. Agentorchestra: A hierarchical multi-agent framework for general-purpose task solving, 2025

work page 2025

[47] [47]

Magentic-one: A generalist multi-agent system for solving complex tasks, 2024

Adam Fourney, Gagan Bansal, Hussein Mozannar, Cheng Tan, Eduardo Salinas, Erkang, Zhu, Friederike Niedtner, Grace Proebsting, Griffin Bassman, Jack Gerrits, Jacob Alber, Peter Chang, Ricky Loynd, Robert West, Victor Dibia, Ahmed Awadallah, Ece Kamar, Rafah Hosn, and Saleema Amershi. Magentic-one: A generalist multi-agent system for solving complex tasks, 2024

work page 2024

[48] [48]

Executable code actions elicit better llm agents, 2024

Xingyao Wang, Yangyi Chen, Lifan Yuan, Yizhe Zhang, Yunzhu Li, Hao Peng, and Heng Ji. Executable code actions elicit better llm agents, 2024

work page 2024

[49] [49]

‘smolagents‘: a smol library to build great agentic systems

Aymeric Roucher, Albert Villanova del Moral, Thomas Wolf, Leandro von Werra, and Erik Kaunismäki. ‘smolagents‘: a smol library to build great agentic systems. https://github. com/huggingface/smolagents, 2025

work page 2025

[50] [50]

Alita: Generalist agent enabling scalable agentic reasoning with minimal predefinition and maximal self-evolution, 2025

Jiahao Qiu, Xuan Qi, Tongcheng Zhang, Xinzhe Juan, Jiacheng Guo, Yifu Lu, Yimin Wang, Zixin Yao, Qihan Ren, Xun Jiang, Xing Zhou, Dongrui Liu, Ling Yang, Yue Wu, Kaixuan Huang, Shilong Liu, Hongru Wang, and Mengdi Wang. Alita: Generalist agent enabling scalable agentic reasoning with minimal predefinition and maximal self-evolution, 2025

work page 2025

[51] [51]

Critiques of world models.arXiv preprint arXiv:2507.05169, 2025

Eric Xing, Mingkai Deng, Jinyu Hou, and Zhiting Hu. Critiques of world models.arXiv preprint arXiv:2507.05169, 2025

work page arXiv 2025

[52] [52]

MIT press Cambridge, 1998

Richard S Sutton, Andrew G Barto, et al.Reinforcement learning: An introduction, volume 1. MIT press Cambridge, 1998

work page 1998

[53] [53]

macmillan, 2011

Daniel Kahneman.Thinking, fast and slow. macmillan, 2011

work page 2011

[54] [54]

Mas- tering the game of go with deep neural networks and tree search.nature, 529(7587):484–489, 2016

David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driess- che, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mas- tering the game of go with deep neural networks and tree search.nature, 529(7587):484–489, 2016

work page 2016

[55] [55]

Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm

David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, et al. Mastering chess and shogi by self-play with a general reinforcement learning algorithm.arXiv preprint arXiv:1712.01815, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[56] [56]

Dyna, an integrated architecture for learning, planning, and reacting.ACM Sigart Bulletin, 2(4):160–163, 1991

Richard S Sutton. Dyna, an integrated architecture for learning, planning, and reacting.ACM Sigart Bulletin, 2(4):160–163, 1991

work page 1991

[57] [57]

Language models, agent models, and world models: The law for machine reasoning and planning,

Zhiting Hu and Tianmin Shu. Language models, agent models, and world models: The law for machine reasoning and planning.arXiv preprint arXiv:2312.05230, 2023. 17

work page arXiv 2023

[58] [58]

Introducing chatgpt search

OpenAI. Introducing chatgpt search. https://openai.com/index/ introducing-chatgpt-search/, 2024. Accessed: 2024-12-19

work page 2024

[59] [59]

Getting started with perplexity

Perplexity. Getting started with perplexity. https://www.perplexity.ai/hub/blog/ getting-started-with-perplexity, 2024. Accessed: 2024-12-19

work page 2024

[60] [60]

Is your LLM secretly a world model of the internet? model-based planning for web agents.arXiv preprint arXiv:2411.06559, 2024

Yu Gu, Boyuan Zheng, Boyu Gou, Kai Zhang, Cheng Chang, Sanjari Srivastava, Yanan Xie, Peng Qi, Huan Sun, and Yu Su. Is your llm secretly a world model of the internet? model-based planning for web agents.arXiv preprint arXiv:2411.06559, 2024

work page arXiv 2024

[61] [61]

Tree search for language model agents.arXiv preprint arXiv:2407.01476, 2024

Jing Yu Koh, Stephen McAleer, Daniel Fried, and Ruslan Salakhutdinov. Tree search for language model agents.arXiv preprint arXiv:2407.01476, 2024

work page arXiv 2024

[62] [62]

Advances in neural information processing systems, 36:11809–11822

Ori Yoran, Samuel Joseph Amouyal, Chaitanya Malaviya, Ben Bogin, Ofir Press, and Jonathan Berant. Assistantbench: Can web agents solve realistic and time-consuming tasks?arXiv preprint arXiv:2407.15711, 2024

work page arXiv 2024

[63] [63]

Compression, transduction, and creation: A unified framework for evaluating natural language generation

Mingkai Deng, Bowen Tan, Zhengzhong Liu, Eric P Xing, and Zhiting Hu. Compression, transduction, and creation: A unified framework for evaluating natural language generation. arXiv preprint arXiv:2109.06379, 2021

work page arXiv 2021

[64] [64]

G- eval: NLG evaluation using gpt-4 with better human alignment

Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. G- eval: NLG evaluation using gpt-4 with better human alignment. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 2511–2522, Singapore, December 2023. Association for Computational...

work page 2023

[65] [65]

Laradji, Manuel Del Verme, Tom Marty, David Vazquez, Nicolas Chapados, and Alexandre Lacoste

Alexandre Drouin, Maxime Gasse, Massimo Caccia, Issam H. Laradji, Manuel Del Verme, Tom Marty, David Vazquez, Nicolas Chapados, and Alexandre Lacoste. WorkArena: How capable are web agents at solving common knowledge work tasks? In Ruslan Salakhutdinov, Zico Kolter, Katherine Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, ed...

work page 2024

[66] [66]

FanOutQA: A multi-hop, multi-document question answering benchmark for large language models

Andrew Zhu, Alyssa Hwang, Liam Dugan, and Chris Callison-Burch. FanOutQA: A multi-hop, multi-document question answering benchmark for large language models. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (V olume 2: Short Papers), pages 18–37, Bangkok, Thai...

work page 2024

[67] [67]

button ’Main menu’, clickable, expanded=False

work page

[68] [68]

link ’Google’, clickable StaticText ’Skip to main content’ StaticText ’Accessibility feedback’

work page

[69] [69]

link ’Vacation rentals’

work page

[70] [70]

button ’Change appearance’, hasPopup=’menu’, expanded=False

work page

[71] [71]

button ’Google apps’, clickable, expanded=False

work page

[72] [72]

link ’Sign in’, clickable

work page

[73] [73]

image ’’ StaticText ’Flights’

work page

[74] [74]

\u200bRound trip’, live=’polite’, relevant=’additions text’, hasPopup=’listbox’, expanded=False, controls=’i9’

combobox ’Change ticket type. \u200bRound trip’, live=’polite’, relevant=’additions text’, hasPopup=’listbox’, expanded=False, controls=’i9’

work page

[75] [75]

button ’1 passenger, change number of passengers.’, hasPopup=’dialog’

work page

[76] [76]

\u200bEconomy’, live=’polite’, relevant=’additions text’, hasPopup=’listbox’, expanded=False, controls=’i22’

combobox ’Change seating class. \u200bEconomy’, live=’polite’, relevant=’additions text’, hasPopup=’listbox’, expanded=False, controls=’i22’

work page

[77] [77]

combobox ’Where from?’ value=’Pittsburgh’, clickable, autocomplete=’inline’, hasPopup=’menu’, expanded=False

work page

[78] [78]

button ’Swap origin and destination.’, disabled=True

work page

[79] [79]

combobox ’Where to?’, clickable, focused, autocomplete=’inline’, hasPopup=’menu’, expanded=False

work page

[80] [80]

image ’’ generic ’’, hidden=True

work page