Plan-and-Act: Improving Planning of Agents for Long-Horizon Tasks
Pith reviewed 2026-05-17 21:27 UTC · model grok-4.3
The pith
Plan-and-Act improves LLM agent performance on long-horizon tasks by separating planning from execution and training the planner with synthetic data.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors claim that by explicitly incorporating planning into LLM agents through a dedicated Planner model trained via annotating ground-truth trajectories with feasible plans and augmenting them with diverse synthetic examples, the resulting Plan-and-Act system achieves state-of-the-art success rates of 57.58% on WebArena-Lite and 81.36% on WebVoyager in web navigation tasks.
What carries the argument
The Plan-and-Act framework consisting of a Planner model that generates high-level structured plans and an Executor model that translates plans into environment-specific actions, enabled by a synthetic data generation process for training.
Load-bearing premise
That adding plans to existing trajectories and creating synthetic examples will train a planner that works on new tasks without overfitting to the specific annotation or generation process.
What would settle it
Running the trained planner on a set of web tasks that differ substantially from the training trajectories and finding success rates no higher than those of agents without explicit planning.
read the original abstract
Large language models (LLMs) have shown remarkable advancements in enabling language agents to tackle simple tasks. However, applying them for complex, multi-step, long-horizon tasks remains a challenge. Recent work have found success by separating high-level planning from low-level execution, which enables the model to effectively balance high-level planning objectives and low-level execution details. However, generating accurate plans remains difficult since LLMs are not inherently trained for this task. To address this, we propose Plan-and-Act, a novel framework that incorporates explicit planning into LLM-based agents and introduces a scalable method to enhance plan generation through a novel synthetic data generation method. Plan-and-Act consists of a Planner model which generates structured, high-level plans to achieve user goals, and an Executor model that translates these plans into environment-specific actions. To train the Planner effectively, we introduce a synthetic data generation method that annotates ground-truth trajectories with feasible plans, augmented with diverse and extensive examples to enhance generalization. We evaluate Plan-and-Act using web navigation as a representative long-horizon planning environment, demonstrating a state-of-the-art 57.58% success rate on the WebArena-Lite benchmark as well as a text-only state-of-the-art 81.36% success rate on WebVoyager.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Plan-and-Act, a framework separating a Planner model that generates high-level structured plans from an Executor that translates plans into environment actions for LLM agents on long-horizon tasks. It introduces a synthetic data generation method that annotates ground-truth trajectories with feasible plans and augments them with diverse synthetic examples to train the Planner. Evaluated on web navigation, it reports state-of-the-art success rates of 57.58% on WebArena-Lite and 81.36% on WebVoyager.
Significance. If the empirical results hold after addressing evaluation gaps, this work could advance LLM agent capabilities for complex multi-step tasks by providing a scalable synthetic data approach to improve plan generation and generalization. The planner-executor separation is a promising architectural choice, and the reported benchmark numbers indicate potential practical utility if the gains are attributable to the proposed method rather than unexamined factors.
major comments (2)
- [Abstract and Evaluation section] Abstract and Evaluation section: The manuscript states SOTA success rates of 57.58% on WebArena-Lite and 81.36% on WebVoyager but supplies no information on the specific baselines, number of evaluation runs, variance or statistical tests, or how the synthetic data was validated against distribution shift. This makes it impossible to determine whether the rates support the central claim of improved planning generalization.
- [Synthetic data generation description] Synthetic data generation description: The method of annotating ground-truth trajectories with feasible plans and augmenting with diverse synthetic examples lacks any explicit diversity metric, distribution shift analysis, or ablation isolating the augmentation's contribution. This directly bears on the weakest assumption that the approach produces plans that generalize to unseen tasks rather than overfitting to annotation-specific structures in the WebArena/WebVoyager trajectories.
minor comments (2)
- [Abstract] The abstract contains a minor grammatical issue: 'Recent work have found success' should read 'Recent work has found success'.
- [Method] Clarify the base LLM models and training details for the Planner and Executor to improve reproducibility.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback, which has identified opportunities to strengthen the clarity and rigor of our empirical claims. We address each major comment below and will incorporate the suggested additions in the revised manuscript.
read point-by-point responses
-
Referee: [Abstract and Evaluation section] Abstract and Evaluation section: The manuscript states SOTA success rates of 57.58% on WebArena-Lite and 81.36% on WebVoyager but supplies no information on the specific baselines, number of evaluation runs, variance or statistical tests, or how the synthetic data was validated against distribution shift. This makes it impossible to determine whether the rates support the central claim of improved planning generalization.
Authors: We agree that more granular reporting is needed to substantiate the SOTA claims. In the revised Evaluation section we will add: (i) an explicit list of all baselines with citations, (ii) results averaged over three independent runs with standard deviations, (iii) statistical significance via paired t-tests (p-values reported), and (iv) a short appendix subsection validating synthetic data against distribution shift using KL divergence on plan-length and step-type histograms between the generated training distribution and the WebArena/WebVoyager test sets. These details will make the generalization argument directly verifiable. revision: yes
-
Referee: [Synthetic data generation description] Synthetic data generation description: The method of annotating ground-truth trajectories with feasible plans and augmenting with diverse synthetic examples lacks any explicit diversity metric, distribution shift analysis, or ablation isolating the augmentation's contribution. This directly bears on the weakest assumption that the approach produces plans that generalize to unseen tasks rather than overfitting to annotation-specific structures in the WebArena/WebVoyager trajectories.
Authors: We concur that these supporting analyses are currently absent and would strengthen the paper. We will revise the Synthetic Data Generation section to introduce (1) a quantitative diversity metric (average pairwise Levenshtein distance across plan sequences plus entropy over high-level action vocabularies), (2) a distribution-shift comparison (embedding cosine similarity and n-gram overlap between synthetic and held-out real trajectories), and (3) an ablation table in the Experiments section that trains the Planner on annotated trajectories alone versus the full augmented set, isolating the augmentation's contribution to success rate. These additions directly address the generalization concern. revision: yes
Circularity Check
No significant circularity; empirical framework with independent evaluation
full rationale
The paper introduces a Plan-and-Act framework separating Planner and Executor models, trained via annotation of ground-truth trajectories plus synthetic augmentation. No equations, fitted parameters, or derivations are present that reduce to their inputs by construction. Claims rest on benchmark success rates (57.58% on WebArena-Lite, 81.36% on WebVoyager) rather than self-referential definitions or load-bearing self-citations. The method is self-contained against external benchmarks with no reduction of predictions to annotation inputs by design.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption LLMs fine-tuned on synthetic plan-annotated trajectories will generate plans that improve downstream execution success
Forward citations
Cited by 17 Pith papers
-
Remember Your Trace: Memory-Guided Long-Horizon Agentic Framework for Consistent and Hierarchical Repository-Level Code Documentation
MemDocAgent generates consistent hierarchical repository-level code documentation by combining dependency-aware traversal with memory-guided agent interactions that accumulate work traces.
-
State-Centric Decision Process
SDP constructs a task-induced state space from raw text by having agents commit to and certify natural-language predicates as states, enabling structured planning and analysis in unstructured language environments.
-
Learning Agentic Policy from Action Guidance
ActGuide-RL uses human action data as plan-style guidance in mixed-policy RL to overcome exploration barriers in LLM agents, matching SFT+RL performance on search benchmarks without cold-start training.
-
FlowSteer: Prompt-Only Workflow Steering Exposes Planning-Time Vulnerabilities in Multi-Agent LLM Systems
FlowSteer is a prompt-only attack that biases multi-agent LLM workflow planning to propagate malicious signals, raising success rates by up to 55%, with FlowGuard as an input-side defense reducing it by up to 34%.
-
From Skills to Talent: Organising Heterogeneous Agents as a Real-World Company
OMC framework turns multi-agent AI into self-organizing companies with Talents, Talent Market, and E²R search, achieving 84.67% success on PRDBench (15.48 points above prior art).
-
ReCodeAgent: A Multi-Agent Workflow for Language-agnostic Translation and Validation of Large-scale Repositories
ReCodeAgent uses a multi-agent system to translate and validate large code repositories across multiple programming languages, achieving 60.8% higher test pass rates than prior neuro-symbolic and agentic methods on 11...
-
Evo-Memory: Benchmarking LLM Agent Test-time Learning with Self-Evolving Memory
Evo-Memory is a new benchmark for self-evolving memory in LLM agents across task streams, with baseline ExpRAG and proposed ReMem method that integrates reasoning, actions, and memory updates for continual improvement.
-
Security Considerations for Multi-agent Systems
No existing AI security framework covers a majority of the 193 identified multi-agent system threats in any category, with OWASP Agentic Security Initiative achieving the highest overall coverage at 65.3%.
-
HiMAC: Hierarchical Macro-Micro Learning for Long-Horizon LLM Agents
HiMAC decomposes LLM agent tasks into macro planning and micro execution using critic-free hierarchical RL and iterative co-evolution, outperforming baselines on ALFWorld, WebShop, and Sokoban.
-
KGLAMP: Knowledge Graph-guided Language model for Adaptive Multi-robot Planning and Replanning
KGLAMP uses a dynamically updated knowledge graph to guide LLMs in creating and replanning PDDL specifications for heterogeneous multi-robot teams, reporting at least 25.3% better performance than LLM-only or classica...
-
AlphaCast: A Human Wisdom-LLM Intelligence Co-Reasoning Framework for Interactive Time Series Forecasting
AlphaCast is a training-free LLM framework that performs interactive multi-stage reasoning for time series forecasting by integrating feature extraction, knowledge bases, case libraries, and contextual pools.
-
Beyond Individual Intelligence: Surveying Collaboration, Failure Attribution, and Self-Evolution in LLM-based Multi-Agent Systems
The survey proposes the LIFE framework to unify fragmented research on collaboration, failure attribution, and self-evolution in LLM multi-agent systems into a progression toward self-organizing intelligence.
-
StraTA: Incentivizing Agentic Reinforcement Learning with Strategic Trajectory Abstraction
StraTA improves LLM agent success rates to 93.1% on ALFWorld and 84.2% on WebShop by sampling a compact initial strategy and training it jointly with action execution via hierarchical GRPO-style rollouts.
-
GUI Agents with Reinforcement Learning: Toward Digital Inhabitants
The paper delivers the first comprehensive overview of RL for GUI agents, organizing methods into offline, online, and hybrid strategies while analyzing trends in rewards, efficiency, and deliberation to outline a fut...
-
From Coarse to Fine: Self-Adaptive Hierarchical Planning for LLM Agents
AdaPlan-H enables LLM agents to generate self-adaptive hierarchical plans that adjust detail level to task difficulty, improving success rates in multi-step tasks.
-
End-to-end PDDL Planning with Hardcoded and Dynamic Agents
An end-to-end LLM framework refines natural language into valid PDDL domains and problems via hardcoded and dynamic agents, generates plans with standard engines, and returns readable output.
-
Agentic Reasoning for Large Language Models
The survey structures agentic reasoning for LLMs into foundational, self-evolving, and collective multi-agent layers while distinguishing in-context orchestration from post-training optimization and reviewing applicat...
Reference graph
Works this paper leans on
-
[1]
Agent-e: From autonomous web navigation to foundational design principles in agentic systems
Abuelsaad, T., Akkil, D., Dey, P., Jagmohan, A., Vem- paty, A., and Kokku, R. Agent-e: From autonomous web navigation to foundational design principles in agentic systems. arXiv preprint arXiv:2407.13032 , 2024
-
[2]
Digirl: Training in-the-wild device- control agents with autonomous reinforcement learn- ing
Bai, H., Zhou, Y ., Cemri, M., Pan, J., Suhr, A., Levine, S., and Kumar, A. Digirl: Training in-the-wild device- control agents with autonomous reinforcement learn- ing. arXiv preprint arXiv:2406.11896, 2024
-
[3]
T.-i., Gwak, M., Song, G., Kim, J., Kim, S., Lee, D., and Yeo, J
Chae, H., Kim, N., Ong, K. T.-i., Gwak, M., Song, G., Kim, J., Kim, S., Lee, D., and Yeo, J. Web agents with world models: Learning and leveraging environ- ment dynamics in web navigation. arXiv preprint arXiv:2410.13232, 2024
-
[4]
Mind2web: Towards a generalist agent for the web
Deng, X., Gu, Y ., Zheng, B., Chen, S., Stevens, S., Wang, B., Sun, H., and Su, Y . Mind2web: Towards a generalist agent for the web. Advances in Neural Information Processing Systems, 36, 2024
work page 2024
-
[5]
Erdogan, L. E., Lee, N., Jha, S., Kim, S., Tabrizi, R., Moon, S., Hooper, C., Anumanchipalli, G., Keutzer, K., and Gholami, A. Tinyagent: Function calling at the edge. arXiv preprint arXiv:2409.00608, 2024
-
[6]
Multimodal web navigation with instruction-finetuned foundation models
Furuta, H., Lee, K.-H., Nachum, O., Matsuo, Y ., Faust, A., Gu, S. S., and Gur, I. Multimodal web navigation with instruction-finetuned foundation models. arXiv preprint arXiv:2305.11854, 2023
-
[7]
Exposing limitations of language model agents in sequential- task compositions on the web
Furuta, H., Matsuo, Y ., Faust, A., and Gur, I. Exposing limitations of language model agents in sequential- task compositions on the web. arXiv preprint arXiv:2311.18751, 2023
-
[8]
S., Matsuo, Y ., Faust, A., Zen, H., and Gur, I
Furuta, H., Lee, K.-H., Gu, S. S., Matsuo, Y ., Faust, A., Zen, H., and Gur, I. Geometric-averaged preference optimization for soft preference labels. arXiv preprint arXiv:2409.06691, 2024
-
[9]
Is your llm secretly a world model of the internet? model-based planning for web agents
Gu, Y ., Zheng, B., Gou, B., Zhang, K., Chang, C., Srivastava, S., Xie, Y ., Qi, P., Sun, H., and Su, Y . Is your llm secretly a world model of the internet? model-based planning for web agents. arXiv preprint arXiv:2411.06559, 2024
-
[10]
Gunasekar, S., Zhang, Y ., Aneja, J., Mendes, C. C. T., Del Giorno, A., Gopi, S., Javaheripi, M., Kauffmann, P., de Rosa, G., Saarikivi, O., et al. Textbooks are all you need. arXiv preprint arXiv:2306.11644, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[11]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Guo, D., Yang, D., Zhang, H., Song, J., Zhang, R., Xu, R., Zhu, Q., Ma, S., Wang, P., Bi, X., et al. Deepseek- r1: Incentivizing reasoning capability in llms via rein- forcement learning. arXiv preprint arXiv:2501.12948, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[12]
Understanding html with large language models
Gur, I., Nachum, O., Miao, Y ., Safdari, M., Huang, A., Chowdhery, A., Narang, S., Fiedel, N., and Faust, A. Understanding html with large language models. arXiv preprint arXiv:2210.03945, 2022
-
[13]
A Real-World WebAgent with Planning, Long Context Understanding, and Program Synthesis
Gur, I., Furuta, H., Huang, A., Safdari, M., Matsuo, Y ., Eck, D., and Faust, A. A real-world webagent with planning, long context understanding, and program synthesis. arXiv preprint arXiv:2307.12856, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[14]
WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models
He, H., Yao, W., Ma, K., Yu, W., Dai, Y ., Zhang, H., Lan, Z., and Yu, D. Webvoyager: Building an end-to- end web agent with large multimodal models. arXiv preprint arXiv:2401.13919, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[15]
He, H., Yao, W., Ma, K., Yu, W., Zhang, H., Fang, T., Lan, Z., and Yu, D. Openwebvoyager: Build- ing multimodal web agents via iterative real-world ex- ploration, feedback and optimization. arXiv preprint arXiv:2410.19609, 2024
-
[16]
LoRA: Low-Rank Adaptation of Large Language Models
Hu, E. J., Shen, Y ., Wallis, P., Allen-Zhu, Z., Li, Y ., Wang, S., Wang, L., and Chen, W. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[17]
Kannan, S. S., Venkatesh, V . L., and Min, B.-C. Smart- llm: Smart multi-agent robot task planning using large language models. In 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) , pp. 12140–12147. IEEE, 2024
work page 2024
-
[18]
Language Models can Solve Computer Tasks
Kim, G., Baldi, P., and McAleer, S. Language models can solve computer tasks. arXiv preprint arxiv:2303.17491, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[19]
Kim, S., Moon, S., Tabrizi, R., Lee, N., Mahoney, M. W., Keutzer, K., and Gholami, A. An llm com- piler for parallel function calling. arXiv preprint arXiv:2312.04511, 2023
-
[20]
VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks
Koh, J. Y ., Lo, R., Jang, L., Duvvur, V ., Lim, M. C., Huang, P.-Y ., Neubig, G., Zhou, S., Salakhutdinov, R., and Fried, D. Visualwebarena: Evaluating multimodal agents on realistic visual web tasks. arXiv preprint arXiv:2401.13649, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[21]
Y., McAleer, S., Fried, D., and Salakhutdinov, R
Koh, J. Y ., McAleer, S., Fried, D., and Salakhutdinov, R. Tree search for language model agents. arXiv preprint arXiv:2407.01476, 2024
-
[22]
S., Reid, M., Matsuo, Y ., and Iwa- sawa, Y
Kojima, T., Gu, S. S., Reid, M., Matsuo, Y ., and Iwa- sawa, Y . Large language models are zero-shot rea- soners. Advances in neural information processing systems, 35:22199–22213, 2022. 11 PLAN -AND -ACT: Improving Planning of Agents for Long-Horizon Tasks
work page 2022
-
[23]
L., Yao, S., Chen, Y ., Shen, P., Yu, H., Zhang, H., Zhang, X., Dong, Y ., et al
Lai, H., Liu, X., Iong, I. L., Yao, S., Chen, Y ., Shen, P., Yu, H., Zhang, H., Zhang, X., Dong, Y ., et al. Autowe- bglm: A large language model-based web navigating agent. In Proceedings of the 30th ACM SIGKDD Con- ference on Knowledge Discovery and Data Mining, pp. 5295–5306, 2024
work page 2024
-
[24]
W., Keutzer, K., and Gholami, A
Lee, N., Wattanawong, T., Kim, S., Mangalam, K., Shen, S., Anumanchipalli, G., Mahoney, M. W., Keutzer, K., and Gholami, A. Llm2llm: Boosting llms with novel iterative data enhancement. arXiv preprint arXiv:2403.15042, 2024
-
[25]
L., Xu, Y ., Song, X., Zhang, S., Lai, H., Liu, X., Zhao, H., et al
Liu, X., Zhang, T., Gu, Y ., Iong, I. L., Xu, Y ., Song, X., Zhang, S., Lai, H., Liu, X., Zhao, H., et al. Visuala- gentbench: Towards large multimodal models as visual foundation agents. arXiv preprint arXiv:2408.06327, 2024
-
[26]
Wilbur: Adaptive in-context learn- ing for robust and accurate web agents
Lutz, M., Bohra, A., Saroyan, M., Harutyunyan, A., and Campagna, G. Wilbur: Adaptive in-context learn- ing for robust and accurate web agents. arXiv preprint arXiv:2404.05902, 2024
-
[27]
E., Kim, S., Lim, W., Keutzer, K., and Gholami, A
Moon, S., Jha, S., Erdogan, L. E., Kim, S., Lim, W., Keutzer, K., and Gholami, A. Efficient and scalable es- timation of tool representations in vector space. arXiv preprint arXiv:2409.02141, 2024
- [28]
-
[29]
Long-horizon planning for multi-agent robots in partially observ- able environments
Nayak, S., Morrison Orozco, A., Have, M., Zhang, J., Thirumalai, V ., Chen, D., Kapoor, A., Robinson, E., Gopalakrishnan, K., Harrison, J., et al. Long-horizon planning for multi-agent robots in partially observ- able environments. Advances in Neural Information Processing Systems, 37:67929–67967, 2024
work page 2024
-
[30]
F., Madaan, A., Liu, J., Lo, R., Srid- har, A., Sengupta, S., Roth, D., Neubig, G., and Zhou, S
Ou, T., Xu, F. F., Madaan, A., Liu, J., Lo, R., Srid- har, A., Sengupta, S., Roth, D., Neubig, G., and Zhou, S. Synatra: Turning indirect knowledge into direct demonstrations for digital agents at scale. arXiv preprint arXiv:2409.15637, 2024
-
[31]
Autonomous evaluation and refinement of digital agents
Pan, J., Zhang, Y ., Tomlin, N., Zhou, Y ., Levine, S., and Suhr, A. Autonomous evaluation and refinement of digital agents. arXiv preprint arXiv:2404.06474 , 2024
-
[32]
Large language models can self-improve at web agent tasks
Patel, A., Hofmarcher, M., Leoveanu-Condrei, C., Dinu, M.-C., Callison-Burch, C., and Hochreiter, S. Large language models can self-improve at web agent tasks. arXiv preprint arXiv:2405.20309, 2024
-
[33]
Tinyclick: Single-turn agent for empowering gui au- tomation
Pawlowski, P., Zawistowski, K., Lapacz, W., Skorupa, M., Wiacek, A., Postansque, S., and Hoscilowicz, J. Tinyclick: Single-turn agent for empowering gui au- tomation. arXiv preprint arXiv:2410.11871, 2024
-
[34]
Adapt: As-needed decomposition and planning with language models
Prasad, A., Koller, A., Hartmann, M., Clark, P., Sabhar- wal, A., Bansal, M., and Khot, T. Adapt: As-needed decomposition and planning with language models. arXiv preprint arXiv:2311.05772, 2023
-
[35]
L., Lai, H., Sun, X., Yang, X., Sun, J., Yang, Y ., Yao, S., Zhang, T., et al
Qi, Z., Liu, X., Iong, I. L., Lai, H., Sun, X., Yang, X., Sun, J., Yang, Y ., Yao, S., Zhang, T., et al. We- brl: Training llm web agents via self-evolving online curriculum reinforcement learning. arXiv preprint arXiv:2411.02337, 2024
-
[36]
Rafailov, R., Sharma, A., Mitchell, E., Manning, C. D., Ermon, S., and Finn, C. Direct preference optimiza- tion: Your language model is secretly a reward model. Advances in Neural Information Processing Systems , 36, 2024
work page 2024
-
[37]
Androidinthewild: A large-scale dataset for android device control
Rawles, C., Li, A., Rodriguez, D., Riva, O., and Lilli- crap, T. Androidinthewild: A large-scale dataset for android device control. Advances in Neural Informa- tion Processing Systems, 36, 2024
work page 2024
-
[38]
World of bits: An open-domain platform for web-based agents
Shi, T., Karpathy, A., Fan, L., Hernandez, J., and Liang, P. World of bits: An open-domain platform for web-based agents. In International Conference on Machine Learning, pp. 3135–3144. PMLR, 2017
work page 2017
-
[39]
Heap: Hierarchical policies for web actions using llms
Sodhi, P., Branavan, S., and McDonald, R. Heap: Hierarchical policies for web actions using llms. arXiv preprint arXiv:2310.03720, 2023
-
[40]
H., Wu, J., Washington, C., Sadler, B
Song, C. H., Wu, J., Washington, C., Sadler, B. M., Chao, W.-L., and Su, Y . Llm-planner: Few-shot grounded planning for embodied agents with large language models. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 2998– 3009, 2023
work page 2023
-
[41]
Song, Y ., Xu, F. F., Zhou, S., and Neubig, G. Beyond browsing: Api-based web agents. 2024
work page 2024
-
[42]
Sridhar, A., Lo, R., Xu, F. F., Zhu, H., and Zhou, S. Hierarchical prompting assists large language model on web navigation. arXiv preprint arXiv:2305.14257, 2023
-
[43]
Adaplanner: Adaptive planning from feedback with language models
Sun, H., Zhuang, Y ., Kong, L., Dai, B., and Zhang, C. Adaplanner: Adaptive planning from feedback with language models. Advances in neural information processing systems, 36:58202–58245, 2023
work page 2023
-
[44]
Sutton, R. S. Reinforcement learning: An introduction. A Bradford Book, 2018. 12 PLAN -AND -ACT: Improving Planning of Agents for Long-Horizon Tasks
work page 2018
-
[45]
Taori, R., Gulrajani, I., Zhang, T., Dubois, Y ., Li, X., Guestrin, C., Liang, P., and Hashimoto, T. B. Stanford alpaca: An instruction-following llama model, 2023
work page 2023
-
[46]
Qwq-32b: Embracing the power of re- inforcement learning, March 2025
Team, Q. Qwq-32b: Embracing the power of re- inforcement learning, March 2025. URL https: //qwenlm.github.io/blog/qwq-32b/
work page 2025
-
[47]
A survey on data synthesis and augmentation for large language models
Wang, K., Zhu, J., Ren, M., Liu, Z., Li, S., Zhang, Z., Zhang, C., Wu, X., Zhan, Q., Liu, Q., et al. A survey on data synthesis and augmentation for large language models. arXiv preprint arXiv:2410.12896, 2024
-
[48]
Plan-and-Solve Prompting: Improving Zero-Shot Chain-of-Thought Reasoning by Large Language Models
Wang, L., Xu, W., Lan, Y ., Hu, Z., Lan, Y ., Lee, R. K.-W., and Lim, E.-P. Plan-and-solve prompting: Im- proving zero-shot chain-of-thought reasoning by large language models. arXiv preprint arXiv:2305.04091, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[49]
Self-Instruct: Aligning Language Models with Self-Generated Instructions
Wang, Y ., Kordi, Y ., Mishra, S., Liu, A., Smith, N. A., Khashabi, D., and Hajishirzi, H. Self-instruct: Align- ing language models with self-generated instructions. arXiv preprint arXiv:2212.10560, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[50]
Wang, Z. Z., Mao, J., Fried, D., and Neubig, G. Agent workflow memory. arXiv preprint arXiv:2409.07429, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[51]
Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., Le, Q. V ., Zhou, D., et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems , 35:24824–24837, 2022
work page 2022
-
[52]
OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments
Xie, T., Zhang, D., Chen, J., Li, X., Zhao, S., Cao, R., Hua, T. J., Cheng, Z., Shin, D., Lei, F., et al. Os- world: Benchmarking multimodal agents for open- ended tasks in real computer environments. arXiv preprint arXiv:2404.07972, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[53]
WizardLM: Empowering large pre-trained language models to follow complex instructions
Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., and Jiang, D. Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[54]
Yang, K., Liu, Y ., Chaudhary, S., Fakoor, R., Chaud- hari, P., Karypis, G., and Rangwala, H. Agentoccam: A simple yet strong baseline for llm-based web agents. arXiv preprint arXiv:2410.13825, 2024
-
[55]
React meets actre: Autonomous annotations of agent trajectories for contrastive self-training
Yang, Z., Li, P., Yan, M., Zhang, J., Huang, F., and Liu, Y . React meets actre: Autonomous annotations of agent trajectories for contrastive self-training. arXiv preprint arXiv:2403.14589, 2024
-
[56]
Webshop: Towards scalable real-world web interac- tion with grounded language agents
Yao, S., Chen, H., Yang, J., and Narasimhan, K. Webshop: Towards scalable real-world web interac- tion with grounded language agents. arXiv preprint arxiv:2207.01206, 2022
-
[57]
ReAct: Synergizing Reasoning and Acting in Language Models
Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., and Cao, Y . React: Synergizing rea- soning and acting in language models. arXiv preprint arXiv:2210.03629, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[58]
AppAgent: Multimodal Agents as Smartphone Users
Zhang, C., Yang, Z., Liu, J., Han, Y ., Chen, X., Huang, Z., Fu, B., and Yu, G. Appagent: Multimodal agents as smartphone users. arXiv preprint arXiv:2312.13771, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[59]
Zhang, C., He, S., Qian, J., Li, B., Li, L., Qin, S., Kang, Y ., Ma, M., Lin, Q., Rajmohan, S., et al. Large language model-brained gui agents: A survey. arXiv preprint arXiv:2411.18279, 2024
-
[60]
Zhang, Y ., Ma, Z., Ma, Y ., Han, Z., Wu, Y ., and Tresp, V . Webpilot: A versatile and autonomous multi-agent system for web task execution with strategic explo- ration. arXiv preprint arXiv:2408.15978, 2024
-
[61]
WebArena: A Realistic Web Environment for Building Autonomous Agents
Zhou, S., Xu, F. F., Zhu, H., Zhou, X., Lo, R., Srid- har, A., Cheng, X., Ou, T., Bisk, Y ., Fried, D., et al. Webarena: A realistic web environment for building autonomous agents. arXiv preprint arXiv:2307.13854, 2023. 13 PLAN -AND -ACT: Improving Planning of Agents for Long-Horizon Tasks A. Appendix A.1. Planner and Executor Output Examples • Task: ”Fro...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[62]
You are required to take inspiration from these example but not exactly copy them since we want enough diversity to be able to cover a wide variety of use cases
-
[63]
You shouldn’t hallucinate or create non-existing elements or actions that are not possible on the website. If you make up something that is not possible on the website, you will be penalized. Your data needs to be grounded on the website and the examples given. {examples_str} A.6.2. S YNTHETIC PLAN GENERATOR USER MESSAGE Use the given examples to generate...
-
[64]
Read the given user query and the plan carefully
-
[65]
Identify what this data points is trying to do and what can the planner model learn from being trained on this data point and data points like it
-
[66]
Provide clear reasoning for your classification decision
-
[67]
{classification_section_for_website} General guidelines:
Classify the data point into one of the known failure classes for that website or "Other" if no class fits; specifically, you should classify the failure class that this data point will help the planner avoid if it was trained on this data point and data points like it Below is the set of possible classes for the website: {website.value}. {classification_...
-
[68]
Carefully check the user query and plan
-
[69]
Match them against the class definitions
-
[70]
If none of the classes apply, label as "Other"
-
[71]
Provide your output in the following format: 30 PLAN -AND -ACT: Improving Planning of Agents for Long-Horizon Tasks ## Reasoning [Explain your thought process and why this example fits the chosen class] ## Classification [Class label: "Class A", "Class B", "Other", etc.] Please ensure your output follows this exact format. Here are the prompts for the fai...
-
[72]
Show me the name of the customers who have expressed dissatisfaction with Chloe tank
"Show me the name of the customers who have expressed dissatisfaction with Chloe tank" - Error: Planner used exact "chloe tank" search instead of broader "chloe" search that would have found "chloe plastic tank"
-
[73]
List the top 3 search terms in my store
"List the top 3 search terms in my store" - Error: Planner incorrectly included date filtering steps which don’t exist in search terms report - Solution: Training data showing correct navigation of "search terms" report without date filtering ## Class B: Product Attribute Update Confusion ### Description The planner confuses high-level status changes with...
-
[74]
Mark all Hollister shirts on sale
"Mark all Hollister shirts on sale" - Error: Planner used general status change instead of specific sale attribute update - Solution: Training data showing how to update sale attributes specifically using ’update attributes’ option 31 PLAN -AND -ACT: Improving Planning of Agents for Long-Horizon Tasks
-
[75]
Make all Aeno capri as out of stock
"Make all Aeno capri as out of stock" - Error: Planner tried using Enable/Disable status instead of stock attribute - Solution: More examples of updating product attributes vs changing status ## Class C: Review Analysis Navigation Failures ### Description The planner fails to properly navigate and analyze product reviews: - Missing steps to access product...
-
[76]
Tell me the reasons why customers like Circe’s products
"Tell me the reasons why customers like Circe’s products" - Error: Planner didn’t include steps to access and analyze review content - Solution: Training data showing how to navigate to and analyze review sections ## Other Description: If none of the above classes match. A.7.3. R EDDIT FAILURE CLASSES PROMPT # Reddit Website Classes ## Class A: Content Re...
-
[77]
Re-post the image of costume contest to funny subreddit
"Re-post the image of costume contest to funny subreddit" - Error: Planner created new post instead of using existing repost functionality - Solution: Training data showing correct repost/crosspost workflow ## Other Description: If none of the above classes match. A.7.4. G ITLAB FAILURE CLASSES PROMPT # GitLab Website Classes ## Class A: Issue/MR Navigati...
-
[78]
Open my latest created issue that has homepage content in its title
"Open my latest created issue that has homepage content in its title" - Error: Planner used global search instead of navigating through Issues tab and filters - Solution: Training data showing navigation through Issues section with proper filtering
-
[79]
Checkout merge requests requiring my review
"Checkout merge requests requiring my review" - Error: Planner attempted repository search instead of using MR section with review filter - Solution: Examples showing how to access personal merge requests ## Class B: Profile/Project Settings Navigation Errors ### Description The planner fails to locate correct paths for user/project settings: - Not identi...
-
[80]
Set my gitlab status as Enjoying life
"Set my gitlab status as Enjoying life" - Error: Planner looked for non-existent "Edit status" button instead of profile settings path - Solution: Training data showing how to update profile settings and status
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.