arxiv: 2503.09572 · v3 · pith:63CLJGZZnew · submitted 2025-03-12 · 💻 cs.CL

Plan-and-Act: Improving Planning of Agents for Long-Horizon Tasks

Lutfi Eren Erdogan , Nicholas Lee , Sehoon Kim , Suhong Moon , Hiroki Furuta , Gopala Anumanchipalli , Kurt Keutzer , Amir Gholami This is my paper

Pith reviewed 2026-05-17 21:27 UTC · model grok-4.3

classification 💻 cs.CL

keywords Plan-and-ActLLM agentslong-horizon taskssynthetic data generationweb navigationplanning and executionWebArenaWebVoyager

0 comments

The pith

Plan-and-Act improves LLM agent performance on long-horizon tasks by separating planning from execution and training the planner with synthetic data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to establish that LLM-based agents struggle with complex multi-step tasks because planning is hard for them, but separating the high-level plan generation from low-level action execution helps balance objectives and details. They propose the Plan-and-Act framework where a Planner creates structured plans and an Executor turns them into actions. To make the Planner better, they use a method that takes existing successful trajectories, adds plans to them, and generates lots of varied synthetic examples for training. A sympathetic reader would care because this could make AI agents more capable at real-world jobs that require many steps, like navigating websites to complete bookings or research.

Core claim

The authors claim that by explicitly incorporating planning into LLM agents through a dedicated Planner model trained via annotating ground-truth trajectories with feasible plans and augmenting them with diverse synthetic examples, the resulting Plan-and-Act system achieves state-of-the-art success rates of 57.58% on WebArena-Lite and 81.36% on WebVoyager in web navigation tasks.

What carries the argument

The Plan-and-Act framework consisting of a Planner model that generates high-level structured plans and an Executor model that translates plans into environment-specific actions, enabled by a synthetic data generation process for training.

Load-bearing premise

That adding plans to existing trajectories and creating synthetic examples will train a planner that works on new tasks without overfitting to the specific annotation or generation process.

What would settle it

Running the trained planner on a set of web tasks that differ substantially from the training trajectories and finding success rates no higher than those of agents without explicit planning.

read the original abstract

Large language models (LLMs) have shown remarkable advancements in enabling language agents to tackle simple tasks. However, applying them for complex, multi-step, long-horizon tasks remains a challenge. Recent work have found success by separating high-level planning from low-level execution, which enables the model to effectively balance high-level planning objectives and low-level execution details. However, generating accurate plans remains difficult since LLMs are not inherently trained for this task. To address this, we propose Plan-and-Act, a novel framework that incorporates explicit planning into LLM-based agents and introduces a scalable method to enhance plan generation through a novel synthetic data generation method. Plan-and-Act consists of a Planner model which generates structured, high-level plans to achieve user goals, and an Executor model that translates these plans into environment-specific actions. To train the Planner effectively, we introduce a synthetic data generation method that annotates ground-truth trajectories with feasible plans, augmented with diverse and extensive examples to enhance generalization. We evaluate Plan-and-Act using web navigation as a representative long-horizon planning environment, demonstrating a state-of-the-art 57.58% success rate on the WebArena-Lite benchmark as well as a text-only state-of-the-art 81.36% success rate on WebVoyager.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Plan-and-Act gets competitive web navigation numbers by training a planner on trajectory-derived synthetic plans, but the ablations needed to confirm the data method actually drives generalization are missing.

read the letter

The main point is that this work splits the agent into a planner that outputs high-level steps and an executor that turns them into actions, then trains the planner on plans pulled from successful trajectories plus synthetic additions. That setup delivers the reported 57.58% on WebArena-Lite and 81.36% text-only on WebVoyager, which beats prior numbers on those benchmarks. The concrete new piece is the scalable annotation pipeline that turns ground-truth paths into plan supervision and then augments it to cover more cases. That is a practical way to get planning data without full human labeling, and it fits the long-horizon web task setting they chose. The separation itself is not brand new, but the specific data recipe and the two-model split applied to these benchmarks is the part that moves the needle here. The results look usable for people who need agents that can handle multi-step browsing or automation without constant prompting. The soft spot is the lack of clear controls on whether the synthetic plans add real diversity or just echo the original trajectory patterns. Without ablations that remove the augmentation step, or checks on how much the generated plans differ from training examples under distribution shift, the success rates could partly reflect memorization rather than better planning. The paper also does not lay out statistical significance or full baseline comparisons in enough detail to pin down the exact source of the gains. This is the kind of work that matters for labs building deployable LLM agents on interactive environments. Readers who run similar benchmarks will find the framework straightforward to try and the numbers worth testing against their own setups. It is solid enough on the empirical side to deserve peer review, even if the authors need to add the missing ablations and validation steps before it lands in a strong venue. I would send it out rather than desk reject.

Referee Report

2 major / 2 minor

Summary. The paper proposes Plan-and-Act, a framework separating a Planner model that generates high-level structured plans from an Executor that translates plans into environment actions for LLM agents on long-horizon tasks. It introduces a synthetic data generation method that annotates ground-truth trajectories with feasible plans and augments them with diverse synthetic examples to train the Planner. Evaluated on web navigation, it reports state-of-the-art success rates of 57.58% on WebArena-Lite and 81.36% on WebVoyager.

Significance. If the empirical results hold after addressing evaluation gaps, this work could advance LLM agent capabilities for complex multi-step tasks by providing a scalable synthetic data approach to improve plan generation and generalization. The planner-executor separation is a promising architectural choice, and the reported benchmark numbers indicate potential practical utility if the gains are attributable to the proposed method rather than unexamined factors.

major comments (2)

[Abstract and Evaluation section] Abstract and Evaluation section: The manuscript states SOTA success rates of 57.58% on WebArena-Lite and 81.36% on WebVoyager but supplies no information on the specific baselines, number of evaluation runs, variance or statistical tests, or how the synthetic data was validated against distribution shift. This makes it impossible to determine whether the rates support the central claim of improved planning generalization.
[Synthetic data generation description] Synthetic data generation description: The method of annotating ground-truth trajectories with feasible plans and augmenting with diverse synthetic examples lacks any explicit diversity metric, distribution shift analysis, or ablation isolating the augmentation's contribution. This directly bears on the weakest assumption that the approach produces plans that generalize to unseen tasks rather than overfitting to annotation-specific structures in the WebArena/WebVoyager trajectories.

minor comments (2)

[Abstract] The abstract contains a minor grammatical issue: 'Recent work have found success' should read 'Recent work has found success'.
[Method] Clarify the base LLM models and training details for the Planner and Executor to improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback, which has identified opportunities to strengthen the clarity and rigor of our empirical claims. We address each major comment below and will incorporate the suggested additions in the revised manuscript.

read point-by-point responses

Referee: [Abstract and Evaluation section] Abstract and Evaluation section: The manuscript states SOTA success rates of 57.58% on WebArena-Lite and 81.36% on WebVoyager but supplies no information on the specific baselines, number of evaluation runs, variance or statistical tests, or how the synthetic data was validated against distribution shift. This makes it impossible to determine whether the rates support the central claim of improved planning generalization.

Authors: We agree that more granular reporting is needed to substantiate the SOTA claims. In the revised Evaluation section we will add: (i) an explicit list of all baselines with citations, (ii) results averaged over three independent runs with standard deviations, (iii) statistical significance via paired t-tests (p-values reported), and (iv) a short appendix subsection validating synthetic data against distribution shift using KL divergence on plan-length and step-type histograms between the generated training distribution and the WebArena/WebVoyager test sets. These details will make the generalization argument directly verifiable. revision: yes
Referee: [Synthetic data generation description] Synthetic data generation description: The method of annotating ground-truth trajectories with feasible plans and augmenting with diverse synthetic examples lacks any explicit diversity metric, distribution shift analysis, or ablation isolating the augmentation's contribution. This directly bears on the weakest assumption that the approach produces plans that generalize to unseen tasks rather than overfitting to annotation-specific structures in the WebArena/WebVoyager trajectories.

Authors: We concur that these supporting analyses are currently absent and would strengthen the paper. We will revise the Synthetic Data Generation section to introduce (1) a quantitative diversity metric (average pairwise Levenshtein distance across plan sequences plus entropy over high-level action vocabularies), (2) a distribution-shift comparison (embedding cosine similarity and n-gram overlap between synthetic and held-out real trajectories), and (3) an ablation table in the Experiments section that trains the Planner on annotated trajectories alone versus the full augmented set, isolating the augmentation's contribution to success rate. These additions directly address the generalization concern. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical framework with independent evaluation

full rationale

The paper introduces a Plan-and-Act framework separating Planner and Executor models, trained via annotation of ground-truth trajectories plus synthetic augmentation. No equations, fitted parameters, or derivations are present that reduce to their inputs by construction. Claims rest on benchmark success rates (57.58% on WebArena-Lite, 81.36% on WebVoyager) rather than self-referential definitions or load-bearing self-citations. The method is self-contained against external benchmarks with no reduction of predictions to annotation inputs by design.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the unproven assumption that synthetic plan annotation from trajectories produces generalizable planning ability and that the Planner-Executor split improves performance over joint models.

axioms (1)

domain assumption LLMs fine-tuned on synthetic plan-annotated trajectories will generate plans that improve downstream execution success
Invoked when the authors state that the synthetic data method enhances generalization for the Planner model.

pith-pipeline@v0.9.0 · 5555 in / 1159 out tokens · 35488 ms · 2026-05-17T21:27:57.581461+00:00 · methodology

discussion (0)

Forward citations

Cited by 17 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Remember Your Trace: Memory-Guided Long-Horizon Agentic Framework for Consistent and Hierarchical Repository-Level Code Documentation
cs.SE 2026-05 unverdicted novelty 7.0

MemDocAgent generates consistent hierarchical repository-level code documentation by combining dependency-aware traversal with memory-guided agent interactions that accumulate work traces.
State-Centric Decision Process
cs.AI 2026-05 unverdicted novelty 7.0

SDP constructs a task-induced state space from raw text by having agents commit to and certify natural-language predicates as states, enabling structured planning and analysis in unstructured language environments.
Learning Agentic Policy from Action Guidance
cs.CL 2026-05 unverdicted novelty 7.0

ActGuide-RL uses human action data as plan-style guidance in mixed-policy RL to overcome exploration barriers in LLM agents, matching SFT+RL performance on search benchmarks without cold-start training.
FlowSteer: Prompt-Only Workflow Steering Exposes Planning-Time Vulnerabilities in Multi-Agent LLM Systems
cs.CR 2026-05 unverdicted novelty 7.0

FlowSteer is a prompt-only attack that biases multi-agent LLM workflow planning to propagate malicious signals, raising success rates by up to 55%, with FlowGuard as an input-side defense reducing it by up to 34%.
From Skills to Talent: Organising Heterogeneous Agents as a Real-World Company
cs.AI 2026-04 unverdicted novelty 7.0

OMC framework turns multi-agent AI into self-organizing companies with Talents, Talent Market, and E²R search, achieving 84.67% success on PRDBench (15.48 points above prior art).
ReCodeAgent: A Multi-Agent Workflow for Language-agnostic Translation and Validation of Large-scale Repositories
cs.SE 2026-04 unverdicted novelty 7.0

ReCodeAgent uses a multi-agent system to translate and validate large code repositories across multiple programming languages, achieving 60.8% higher test pass rates than prior neuro-symbolic and agentic methods on 11...
Evo-Memory: Benchmarking LLM Agent Test-time Learning with Self-Evolving Memory
cs.CL 2025-11 unverdicted novelty 7.0

Evo-Memory is a new benchmark for self-evolving memory in LLM agents across task streams, with baseline ExpRAG and proposed ReMem method that integrates reasoning, actions, and memory updates for continual improvement.
Security Considerations for Multi-agent Systems
cs.CR 2026-03 unverdicted novelty 6.0

No existing AI security framework covers a majority of the 193 identified multi-agent system threats in any category, with OWASP Agentic Security Initiative achieving the highest overall coverage at 65.3%.
HiMAC: Hierarchical Macro-Micro Learning for Long-Horizon LLM Agents
cs.AI 2026-03 unverdicted novelty 6.0

HiMAC decomposes LLM agent tasks into macro planning and micro execution using critic-free hierarchical RL and iterative co-evolution, outperforming baselines on ALFWorld, WebShop, and Sokoban.
KGLAMP: Knowledge Graph-guided Language model for Adaptive Multi-robot Planning and Replanning
cs.RO 2026-02 unverdicted novelty 6.0

KGLAMP uses a dynamically updated knowledge graph to guide LLMs in creating and replanning PDDL specifications for heterogeneous multi-robot teams, reporting at least 25.3% better performance than LLM-only or classica...
AlphaCast: A Human Wisdom-LLM Intelligence Co-Reasoning Framework for Interactive Time Series Forecasting
cs.AI 2025-11 conditional novelty 6.0

AlphaCast is a training-free LLM framework that performs interactive multi-stage reasoning for time series forecasting by integrating feature extraction, knowledge bases, case libraries, and contextual pools.
Beyond Individual Intelligence: Surveying Collaboration, Failure Attribution, and Self-Evolution in LLM-based Multi-Agent Systems
cs.AI 2026-05 conditional novelty 5.0

The survey proposes the LIFE framework to unify fragmented research on collaboration, failure attribution, and self-evolution in LLM multi-agent systems into a progression toward self-organizing intelligence.
StraTA: Incentivizing Agentic Reinforcement Learning with Strategic Trajectory Abstraction
cs.CL 2026-05 unverdicted novelty 5.0

StraTA improves LLM agent success rates to 93.1% on ALFWorld and 84.2% on WebShop by sampling a compact initial strategy and training it jointly with action execution via hierarchical GRPO-style rollouts.
GUI Agents with Reinforcement Learning: Toward Digital Inhabitants
cs.AI 2026-04 unverdicted novelty 5.0

The paper delivers the first comprehensive overview of RL for GUI agents, organizing methods into offline, online, and hybrid strategies while analyzing trends in rewards, efficiency, and deliberation to outline a fut...
From Coarse to Fine: Self-Adaptive Hierarchical Planning for LLM Agents
cs.AI 2026-04 unverdicted novelty 5.0

AdaPlan-H enables LLM agents to generate self-adaptive hierarchical plans that adjust detail level to task difficulty, improving success rates in multi-step tasks.
End-to-end PDDL Planning with Hardcoded and Dynamic Agents
cs.AI 2025-12 unverdicted novelty 5.0

An end-to-end LLM framework refines natural language into valid PDDL domains and problems via hardcoded and dynamic agents, generates plans with standard engines, and returns readable output.
Agentic Reasoning for Large Language Models
cs.AI 2026-01 unverdicted novelty 4.0

The survey structures agentic reasoning for LLMs into foundational, self-evolving, and collective multi-agent layers while distinguishing in-context orchestration from post-training optimization and reviewing applicat...

Reference graph

Works this paper leans on

101 extracted references · 101 canonical work pages · cited by 17 Pith papers · 15 internal anchors

[1]

Agent-e: From autonomous web navigation to foundational design principles in agentic systems

Abuelsaad, T., Akkil, D., Dey, P., Jagmohan, A., Vem- paty, A., and Kokku, R. Agent-e: From autonomous web navigation to foundational design principles in agentic systems. arXiv preprint arXiv:2407.13032 , 2024

work page arXiv 2024
[2]

Digirl: Training in-the-wild device- control agents with autonomous reinforcement learn- ing

Bai, H., Zhou, Y ., Cemri, M., Pan, J., Suhr, A., Levine, S., and Kumar, A. Digirl: Training in-the-wild device- control agents with autonomous reinforcement learn- ing. arXiv preprint arXiv:2406.11896, 2024

work page arXiv 2024
[3]

T.-i., Gwak, M., Song, G., Kim, J., Kim, S., Lee, D., and Yeo, J

Chae, H., Kim, N., Ong, K. T.-i., Gwak, M., Song, G., Kim, J., Kim, S., Lee, D., and Yeo, J. Web agents with world models: Learning and leveraging environ- ment dynamics in web navigation. arXiv preprint arXiv:2410.13232, 2024

work page arXiv 2024
[4]

Mind2web: Towards a generalist agent for the web

Deng, X., Gu, Y ., Zheng, B., Chen, S., Stevens, S., Wang, B., Sun, H., and Su, Y . Mind2web: Towards a generalist agent for the web. Advances in Neural Information Processing Systems, 36, 2024

work page 2024
[5]

E., Lee, N., Jha, S., Kim, S., Tabrizi, R., Moon, S., Hooper, C., Anumanchipalli, G., Keutzer, K., and Gholami, A

Erdogan, L. E., Lee, N., Jha, S., Kim, S., Tabrizi, R., Moon, S., Hooper, C., Anumanchipalli, G., Keutzer, K., and Gholami, A. Tinyagent: Function calling at the edge. arXiv preprint arXiv:2409.00608, 2024

work page arXiv 2024
[6]

Multimodal web navigation with instruction-finetuned foundation models

Furuta, H., Lee, K.-H., Nachum, O., Matsuo, Y ., Faust, A., Gu, S. S., and Gur, I. Multimodal web navigation with instruction-finetuned foundation models. arXiv preprint arXiv:2305.11854, 2023

work page arXiv 2023
[7]

Exposing limitations of language model agents in sequential- task compositions on the web

Furuta, H., Matsuo, Y ., Faust, A., and Gur, I. Exposing limitations of language model agents in sequential- task compositions on the web. arXiv preprint arXiv:2311.18751, 2023

work page arXiv 2023
[8]

S., Matsuo, Y ., Faust, A., Zen, H., and Gur, I

Furuta, H., Lee, K.-H., Gu, S. S., Matsuo, Y ., Faust, A., Zen, H., and Gur, I. Geometric-averaged preference optimization for soft preference labels. arXiv preprint arXiv:2409.06691, 2024

work page arXiv 2024
[9]

Is your llm secretly a world model of the internet? model-based planning for web agents

Gu, Y ., Zheng, B., Gou, B., Zhang, K., Chang, C., Srivastava, S., Xie, Y ., Qi, P., Sun, H., and Su, Y . Is your llm secretly a world model of the internet? model-based planning for web agents. arXiv preprint arXiv:2411.06559, 2024

work page arXiv 2024
[10]

Gunasekar, S., Zhang, Y ., Aneja, J., Mendes, C. C. T., Del Giorno, A., Gopi, S., Javaheripi, M., Kauffmann, P., de Rosa, G., Saarikivi, O., et al. Textbooks are all you need. arXiv preprint arXiv:2306.11644, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[11]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Guo, D., Yang, D., Zhang, H., Song, J., Zhang, R., Xu, R., Zhu, Q., Ma, S., Wang, P., Bi, X., et al. Deepseek- r1: Incentivizing reasoning capability in llms via rein- forcement learning. arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[12]

Understanding html with large language models

Gur, I., Nachum, O., Miao, Y ., Safdari, M., Huang, A., Chowdhery, A., Narang, S., Fiedel, N., and Faust, A. Understanding html with large language models. arXiv preprint arXiv:2210.03945, 2022

work page arXiv 2022
[13]

A Real-World WebAgent with Planning, Long Context Understanding, and Program Synthesis

Gur, I., Furuta, H., Huang, A., Safdari, M., Matsuo, Y ., Eck, D., and Faust, A. A real-world webagent with planning, long context understanding, and program synthesis. arXiv preprint arXiv:2307.12856, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[14]

WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models

He, H., Yao, W., Ma, K., Yu, W., Dai, Y ., Zhang, H., Lan, Z., and Yu, D. Webvoyager: Building an end-to- end web agent with large multimodal models. arXiv preprint arXiv:2401.13919, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[15]

Openwebvoyager: Build- ing multimodal web agents via iterative real-world ex- ploration, feedback and optimization

He, H., Yao, W., Ma, K., Yu, W., Zhang, H., Fang, T., Lan, Z., and Yu, D. Openwebvoyager: Build- ing multimodal web agents via iterative real-world ex- ploration, feedback and optimization. arXiv preprint arXiv:2410.19609, 2024

work page arXiv 2024
[16]

LoRA: Low-Rank Adaptation of Large Language Models

Hu, E. J., Shen, Y ., Wallis, P., Allen-Zhu, Z., Li, Y ., Wang, S., Wang, L., and Chen, W. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[17]

S., Venkatesh, V

Kannan, S. S., Venkatesh, V . L., and Min, B.-C. Smart- llm: Smart multi-agent robot task planning using large language models. In 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) , pp. 12140–12147. IEEE, 2024

work page 2024
[18]

Language Models can Solve Computer Tasks

Kim, G., Baldi, P., and McAleer, S. Language models can solve computer tasks. arXiv preprint arxiv:2303.17491, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[19]

Li, Y ., Gu, Q., Wen, Z., Li, Z., Xing, T., Guo, S., Zheng, T., Zhou, X., Qu, X., Zhou, W., Zhang, Z., Shen, W., Liu, Q., Lin, C., Yang, J., Zhang, G., and Huang, W

Kim, S., Moon, S., Tabrizi, R., Lee, N., Mahoney, M. W., Keutzer, K., and Gholami, A. An llm com- piler for parallel function calling. arXiv preprint arXiv:2312.04511, 2023

work page arXiv 2023
[20]

VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks

Koh, J. Y ., Lo, R., Jang, L., Duvvur, V ., Lim, M. C., Huang, P.-Y ., Neubig, G., Zhou, S., Salakhutdinov, R., and Fried, D. Visualwebarena: Evaluating multimodal agents on realistic visual web tasks. arXiv preprint arXiv:2401.13649, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[21]

Y., McAleer, S., Fried, D., and Salakhutdinov, R

Koh, J. Y ., McAleer, S., Fried, D., and Salakhutdinov, R. Tree search for language model agents. arXiv preprint arXiv:2407.01476, 2024

work page arXiv 2024
[22]

S., Reid, M., Matsuo, Y ., and Iwa- sawa, Y

Kojima, T., Gu, S. S., Reid, M., Matsuo, Y ., and Iwa- sawa, Y . Large language models are zero-shot rea- soners. Advances in neural information processing systems, 35:22199–22213, 2022. 11 PLAN -AND -ACT: Improving Planning of Agents for Long-Horizon Tasks

work page 2022
[23]

L., Yao, S., Chen, Y ., Shen, P., Yu, H., Zhang, H., Zhang, X., Dong, Y ., et al

Lai, H., Liu, X., Iong, I. L., Yao, S., Chen, Y ., Shen, P., Yu, H., Zhang, H., Zhang, X., Dong, Y ., et al. Autowe- bglm: A large language model-based web navigating agent. In Proceedings of the 30th ACM SIGKDD Con- ference on Knowledge Discovery and Data Mining, pp. 5295–5306, 2024

work page 2024
[24]

W., Keutzer, K., and Gholami, A

Lee, N., Wattanawong, T., Kim, S., Mangalam, K., Shen, S., Anumanchipalli, G., Mahoney, M. W., Keutzer, K., and Gholami, A. Llm2llm: Boosting llms with novel iterative data enhancement. arXiv preprint arXiv:2403.15042, 2024

work page arXiv 2024
[25]

L., Xu, Y ., Song, X., Zhang, S., Lai, H., Liu, X., Zhao, H., et al

Liu, X., Zhang, T., Gu, Y ., Iong, I. L., Xu, Y ., Song, X., Zhang, S., Lai, H., Liu, X., Zhao, H., et al. Visuala- gentbench: Towards large multimodal models as visual foundation agents. arXiv preprint arXiv:2408.06327, 2024

work page arXiv 2024
[26]

Wilbur: Adaptive in-context learn- ing for robust and accurate web agents

Lutz, M., Bohra, A., Saroyan, M., Harutyunyan, A., and Campagna, G. Wilbur: Adaptive in-context learn- ing for robust and accurate web agents. arXiv preprint arXiv:2404.05902, 2024

work page arXiv 2024
[27]

E., Kim, S., Lim, W., Keutzer, K., and Gholami, A

Moon, S., Jha, S., Erdogan, L. E., Kim, S., Lim, W., Keutzer, K., and Gholami, A. Efficient and scalable es- timation of tool representations in vector space. arXiv preprint arXiv:2409.02141, 2024

work page arXiv 2024
[28]

Murty, S., Bahdanau, D., and Manning, C. D. Nnetscape navigator: Complex demonstrations for web agents without a demonstrator. arXiv preprint arXiv:2410.02907, 2024

work page arXiv 2024
[29]

Long-horizon planning for multi-agent robots in partially observ- able environments

Nayak, S., Morrison Orozco, A., Have, M., Zhang, J., Thirumalai, V ., Chen, D., Kapoor, A., Robinson, E., Gopalakrishnan, K., Harrison, J., et al. Long-horizon planning for multi-agent robots in partially observ- able environments. Advances in Neural Information Processing Systems, 37:67929–67967, 2024

work page 2024
[30]

F., Madaan, A., Liu, J., Lo, R., Srid- har, A., Sengupta, S., Roth, D., Neubig, G., and Zhou, S

Ou, T., Xu, F. F., Madaan, A., Liu, J., Lo, R., Srid- har, A., Sengupta, S., Roth, D., Neubig, G., and Zhou, S. Synatra: Turning indirect knowledge into direct demonstrations for digital agents at scale. arXiv preprint arXiv:2409.15637, 2024

work page arXiv 2024
[31]

Autonomous evaluation and refinement of digital agents

Pan, J., Zhang, Y ., Tomlin, N., Zhou, Y ., Levine, S., and Suhr, A. Autonomous evaluation and refinement of digital agents. arXiv preprint arXiv:2404.06474 , 2024

work page arXiv 2024
[32]

Large language models can self-improve at web agent tasks

Patel, A., Hofmarcher, M., Leoveanu-Condrei, C., Dinu, M.-C., Callison-Burch, C., and Hochreiter, S. Large language models can self-improve at web agent tasks. arXiv preprint arXiv:2405.20309, 2024

work page arXiv 2024
[33]

Tinyclick: Single-turn agent for empowering gui au- tomation

Pawlowski, P., Zawistowski, K., Lapacz, W., Skorupa, M., Wiacek, A., Postansque, S., and Hoscilowicz, J. Tinyclick: Single-turn agent for empowering gui au- tomation. arXiv preprint arXiv:2410.11871, 2024

work page arXiv 2024
[34]

Adapt: As-needed decomposition and planning with language models

Prasad, A., Koller, A., Hartmann, M., Clark, P., Sabhar- wal, A., Bansal, M., and Khot, T. Adapt: As-needed decomposition and planning with language models. arXiv preprint arXiv:2311.05772, 2023

work page arXiv 2023
[35]

L., Lai, H., Sun, X., Yang, X., Sun, J., Yang, Y ., Yao, S., Zhang, T., et al

Qi, Z., Liu, X., Iong, I. L., Lai, H., Sun, X., Yang, X., Sun, J., Yang, Y ., Yao, S., Zhang, T., et al. We- brl: Training llm web agents via self-evolving online curriculum reinforcement learning. arXiv preprint arXiv:2411.02337, 2024

work page arXiv 2024
[36]

D., Ermon, S., and Finn, C

Rafailov, R., Sharma, A., Mitchell, E., Manning, C. D., Ermon, S., and Finn, C. Direct preference optimiza- tion: Your language model is secretly a reward model. Advances in Neural Information Processing Systems , 36, 2024

work page 2024
[37]

Androidinthewild: A large-scale dataset for android device control

Rawles, C., Li, A., Rodriguez, D., Riva, O., and Lilli- crap, T. Androidinthewild: A large-scale dataset for android device control. Advances in Neural Informa- tion Processing Systems, 36, 2024

work page 2024
[38]

World of bits: An open-domain platform for web-based agents

Shi, T., Karpathy, A., Fan, L., Hernandez, J., and Liang, P. World of bits: An open-domain platform for web-based agents. In International Conference on Machine Learning, pp. 3135–3144. PMLR, 2017

work page 2017
[39]

Heap: Hierarchical policies for web actions using llms

Sodhi, P., Branavan, S., and McDonald, R. Heap: Hierarchical policies for web actions using llms. arXiv preprint arXiv:2310.03720, 2023

work page arXiv 2023
[40]

H., Wu, J., Washington, C., Sadler, B

Song, C. H., Wu, J., Washington, C., Sadler, B. M., Chao, W.-L., and Su, Y . Llm-planner: Few-shot grounded planning for embodied agents with large language models. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 2998– 3009, 2023

work page 2023
[41]

F., Zhou, S., and Neubig, G

Song, Y ., Xu, F. F., Zhou, S., and Neubig, G. Beyond browsing: Api-based web agents. 2024

work page 2024
[42]

F., Zhu, H., and Zhou, S

Sridhar, A., Lo, R., Xu, F. F., Zhu, H., and Zhou, S. Hierarchical prompting assists large language model on web navigation. arXiv preprint arXiv:2305.14257, 2023

work page arXiv 2023
[43]

Adaplanner: Adaptive planning from feedback with language models

Sun, H., Zhuang, Y ., Kong, L., Dai, B., and Zhang, C. Adaplanner: Adaptive planning from feedback with language models. Advances in neural information processing systems, 36:58202–58245, 2023

work page 2023
[44]

Sutton, R. S. Reinforcement learning: An introduction. A Bradford Book, 2018. 12 PLAN -AND -ACT: Improving Planning of Agents for Long-Horizon Tasks

work page 2018
[45]

Taori, R., Gulrajani, I., Zhang, T., Dubois, Y ., Li, X., Guestrin, C., Liang, P., and Hashimoto, T. B. Stanford alpaca: An instruction-following llama model, 2023

work page 2023
[46]

Qwq-32b: Embracing the power of re- inforcement learning, March 2025

Team, Q. Qwq-32b: Embracing the power of re- inforcement learning, March 2025. URL https: //qwenlm.github.io/blog/qwq-32b/

work page 2025
[47]

A survey on data synthesis and augmentation for large language models

Wang, K., Zhu, J., Ren, M., Liu, Z., Li, S., Zhang, Z., Zhang, C., Wu, X., Zhan, Q., Liu, Q., et al. A survey on data synthesis and augmentation for large language models. arXiv preprint arXiv:2410.12896, 2024

work page arXiv 2024
[48]

Plan-and-Solve Prompting: Improving Zero-Shot Chain-of-Thought Reasoning by Large Language Models

Wang, L., Xu, W., Lan, Y ., Hu, Z., Lan, Y ., Lee, R. K.-W., and Lim, E.-P. Plan-and-solve prompting: Im- proving zero-shot chain-of-thought reasoning by large language models. arXiv preprint arXiv:2305.04091, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[49]

Self-Instruct: Aligning Language Models with Self-Generated Instructions

Wang, Y ., Kordi, Y ., Mishra, S., Liu, A., Smith, N. A., Khashabi, D., and Hajishirzi, H. Self-instruct: Align- ing language models with self-generated instructions. arXiv preprint arXiv:2212.10560, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[50]

Agent Workflow Memory

Wang, Z. Z., Mao, J., Fried, D., and Neubig, G. Agent workflow memory. arXiv preprint arXiv:2409.07429, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[51]

V ., Zhou, D., et al

Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., Le, Q. V ., Zhou, D., et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems , 35:24824–24837, 2022

work page 2022
[52]

OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments

Xie, T., Zhang, D., Chen, J., Li, X., Zhao, S., Cao, R., Hua, T. J., Cheng, Z., Shin, D., Lei, F., et al. Os- world: Benchmarking multimodal agents for open- ended tasks in real computer environments. arXiv preprint arXiv:2404.07972, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[53]

WizardLM: Empowering large pre-trained language models to follow complex instructions

Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., and Jiang, D. Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[54]

Agentoccam: A simple yet strong baseline for llm-based web agents.arXiv preprint arXiv:2410.13825, 2024

Yang, K., Liu, Y ., Chaudhary, S., Fakoor, R., Chaud- hari, P., Karypis, G., and Rangwala, H. Agentoccam: A simple yet strong baseline for llm-based web agents. arXiv preprint arXiv:2410.13825, 2024

work page arXiv 2024
[55]

React meets actre: Autonomous annotations of agent trajectories for contrastive self-training

Yang, Z., Li, P., Yan, M., Zhang, J., Huang, F., and Liu, Y . React meets actre: Autonomous annotations of agent trajectories for contrastive self-training. arXiv preprint arXiv:2403.14589, 2024

work page arXiv 2024
[56]

Webshop: Towards scalable real-world web interac- tion with grounded language agents

Yao, S., Chen, H., Yang, J., and Narasimhan, K. Webshop: Towards scalable real-world web interac- tion with grounded language agents. arXiv preprint arxiv:2207.01206, 2022

work page arXiv 2022
[57]

ReAct: Synergizing Reasoning and Acting in Language Models

Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., and Cao, Y . React: Synergizing rea- soning and acting in language models. arXiv preprint arXiv:2210.03629, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[58]

AppAgent: Multimodal Agents as Smartphone Users

Zhang, C., Yang, Z., Liu, J., Han, Y ., Chen, X., Huang, Z., Fu, B., and Yu, G. Appagent: Multimodal agents as smartphone users. arXiv preprint arXiv:2312.13771, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[59]

Api agents vs

Zhang, C., He, S., Qian, J., Li, B., Li, L., Qin, S., Kang, Y ., Ma, M., Lin, Q., Rajmohan, S., et al. Large language model-brained gui agents: A survey. arXiv preprint arXiv:2411.18279, 2024

work page arXiv 2024
[60]

Webpilot: A versatile and autonomous multi-agent system for web task execution with strategic explo- ration

Zhang, Y ., Ma, Z., Ma, Y ., Han, Z., Wu, Y ., and Tresp, V . Webpilot: A versatile and autonomous multi-agent system for web task execution with strategic explo- ration. arXiv preprint arXiv:2408.15978, 2024

work page arXiv 2024
[61]

WebArena: A Realistic Web Environment for Building Autonomous Agents

Zhou, S., Xu, F. F., Zhu, H., Zhou, X., Lo, R., Srid- har, A., Cheng, X., Ou, T., Bisk, Y ., Fried, D., et al. Webarena: A realistic web environment for building autonomous agents. arXiv preprint arXiv:2307.13854, 2023. 13 PLAN -AND -ACT: Improving Planning of Agents for Long-Horizon Tasks A. Appendix A.1. Planner and Executor Output Examples • Task: ”Fro...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[62]

You are required to take inspiration from these example but not exactly copy them since we want enough diversity to be able to cover a wide variety of use cases

work page
[63]

shopping_admin

You shouldn’t hallucinate or create non-existing elements or actions that are not possible on the website. If you make up something that is not possible on the website, you will be penalized. Your data needs to be grounded on the website and the examples given. {examples_str} A.6.2. S YNTHETIC PLAN GENERATOR USER MESSAGE Use the given examples to generate...

work page
[64]

Read the given user query and the plan carefully

work page
[65]

Identify what this data points is trying to do and what can the planner model learn from being trained on this data point and data points like it

work page
[66]

Provide clear reasoning for your classification decision

work page
[67]

{classification_section_for_website} General guidelines:

Classify the data point into one of the known failure classes for that website or "Other" if no class fits; specifically, you should classify the failure class that this data point will help the planner avoid if it was trained on this data point and data points like it Below is the set of possible classes for the website: {website.value}. {classification_...

work page
[68]

Carefully check the user query and plan

work page
[69]

Match them against the class definitions

work page
[70]

If none of the classes apply, label as "Other"

work page
[71]

Class A",

Provide your output in the following format: 30 PLAN -AND -ACT: Improving Planning of Agents for Long-Horizon Tasks ## Reasoning [Explain your thought process and why this example fits the chosen class] ## Classification [Class label: "Class A", "Class B", "Other", etc.] Please ensure your output follows this exact format. Here are the prompts for the fai...

work page
[72]

Show me the name of the customers who have expressed dissatisfaction with Chloe tank

"Show me the name of the customers who have expressed dissatisfaction with Chloe tank" - Error: Planner used exact "chloe tank" search instead of broader "chloe" search that would have found "chloe plastic tank"

work page
[73]

List the top 3 search terms in my store

"List the top 3 search terms in my store" - Error: Planner incorrectly included date filtering steps which don’t exist in search terms report - Solution: Training data showing correct navigation of "search terms" report without date filtering ## Class B: Product Attribute Update Confusion ### Description The planner confuses high-level status changes with...

work page
[74]

Mark all Hollister shirts on sale

"Mark all Hollister shirts on sale" - Error: Planner used general status change instead of specific sale attribute update - Solution: Training data showing how to update sale attributes specifically using ’update attributes’ option 31 PLAN -AND -ACT: Improving Planning of Agents for Long-Horizon Tasks

work page
[75]

Make all Aeno capri as out of stock

"Make all Aeno capri as out of stock" - Error: Planner tried using Enable/Disable status instead of stock attribute - Solution: More examples of updating product attributes vs changing status ## Class C: Review Analysis Navigation Failures ### Description The planner fails to properly navigate and analyze product reviews: - Missing steps to access product...

work page
[76]

Tell me the reasons why customers like Circe’s products

"Tell me the reasons why customers like Circe’s products" - Error: Planner didn’t include steps to access and analyze review content - Solution: Training data showing how to navigate to and analyze review sections ## Other Description: If none of the above classes match. A.7.3. R EDDIT FAILURE CLASSES PROMPT # Reddit Website Classes ## Class A: Content Re...

work page
[77]

Re-post the image of costume contest to funny subreddit

"Re-post the image of costume contest to funny subreddit" - Error: Planner created new post instead of using existing repost functionality - Solution: Training data showing correct repost/crosspost workflow ## Other Description: If none of the above classes match. A.7.4. G ITLAB FAILURE CLASSES PROMPT # GitLab Website Classes ## Class A: Issue/MR Navigati...

work page
[78]

Open my latest created issue that has homepage content in its title

"Open my latest created issue that has homepage content in its title" - Error: Planner used global search instead of navigating through Issues tab and filters - Solution: Training data showing navigation through Issues section with proper filtering

work page
[79]

Checkout merge requests requiring my review

"Checkout merge requests requiring my review" - Error: Planner attempted repository search instead of using MR section with review filter - Solution: Examples showing how to access personal merge requests ## Class B: Profile/Project Settings Navigation Errors ### Description The planner fails to locate correct paths for user/project settings: - Not identi...

work page
[80]

Set my gitlab status as Enjoying life

"Set my gitlab status as Enjoying life" - Error: Planner looked for non-existent "Edit status" button instead of profile settings path - Solution: Training data showing how to update profile settings and status

work page

Showing first 80 references.