pith. sign in

arxiv: 2402.02716 · v1 · submitted 2024-02-05 · 💻 cs.AI · cs.CL· cs.LG

Understanding the planning of LLM agents: A survey

Pith reviewed 2026-05-13 18:08 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.LG
keywords LLM agentsplanningsurveytask decompositionplan selectionexternal modulereflectionmemory
0
0 comments X

The pith

LLM agent planning falls into five categories: task decomposition, plan selection, external modules, reflection, and memory.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models increasingly act as planners inside autonomous agents, but the ways they generate and refine plans sit scattered across individual papers. This survey collects those approaches and sorts them into a single taxonomy with five parts. It examines the techniques used in each part and notes the challenges that remain. A reader who grasps the structure can see how current methods relate and where further work is needed.

Core claim

The paper establishes that existing research on LLM-based agent planning can be organized into five directions—Task Decomposition, Plan Selection, External Module, Reflection, and Memory—supplies detailed analyses of each direction, and identifies open challenges for the field.

What carries the argument

The taxonomy that divides LLM-agent planning methods into Task Decomposition, Plan Selection, External Module, Reflection, and Memory.

If this is right

  • Methods inside each category become easier to compare directly.
  • New research can target specific gaps identified within one category.
  • Hybrid systems that draw techniques from several categories may improve overall performance.
  • The field gains a shared vocabulary for describing planning steps.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Agent builders could test whether adding reflection or memory to existing decomposition methods raises success rates on long tasks.
  • Benchmarks might evaluate agents on each of the five dimensions separately to measure balanced improvement.
  • Pure text-based planning may remain limited until external modules or memory are routinely combined with it.

Load-bearing premise

The five categories capture the full space of LLM-agent planning methods without significant gaps or overlaps.

What would settle it

A new planning method for LLM agents that cannot be placed in any of the five categories would show the taxonomy is incomplete.

read the original abstract

As Large Language Models (LLMs) have shown significant intelligence, the progress to leverage LLMs as planning modules of autonomous agents has attracted more attention. This survey provides the first systematic view of LLM-based agents planning, covering recent works aiming to improve planning ability. We provide a taxonomy of existing works on LLM-Agent planning, which can be categorized into Task Decomposition, Plan Selection, External Module, Reflection and Memory. Comprehensive analyses are conducted for each direction, and further challenges for the field of research are discussed.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper surveys recent literature on planning capabilities in LLM-based autonomous agents. It claims to offer the first systematic overview by proposing a taxonomy that organizes existing works into five categories—Task Decomposition, Plan Selection, External Module, Reflection, and Memory—followed by per-category analyses and a discussion of open challenges.

Significance. If the taxonomy is shown to be both comprehensive and non-overlapping, the survey would provide a useful organizing framework for a fast-moving subfield, helping researchers identify patterns across methods and prioritize future work on LLM agent planning. The absence of original empirical claims or derivations means its contribution rests entirely on the quality and coverage of the categorization and synthesis.

major comments (1)
  1. [Taxonomy] Taxonomy section (implied by abstract and described structure): the five-category partition is presented without explicit criteria or decision rules for assigning a method to one category versus another. This risks overlap (e.g., many reflection techniques rely on memory buffers) and potential omissions; the paper should supply a clear assignment protocol plus a table mapping at least 10 representative cited works to categories to demonstrate exhaustiveness.
minor comments (2)
  1. [Abstract] Abstract: the assertion that the survey is the 'first systematic view' should be supported by a brief comparison to prior LLM-agent surveys in the introduction or related-work section.
  2. [Analyses] The per-category analyses would benefit from a summary table listing key methods, their core mechanisms, and reported performance highlights to improve readability and comparability.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and positive recommendation. We address the taxonomy concern below and will revise the manuscript accordingly to strengthen the presentation of the categorization framework.

read point-by-point responses
  1. Referee: [Taxonomy] Taxonomy section (implied by abstract and described structure): the five-category partition is presented without explicit criteria or decision rules for assigning a method to one category versus another. This risks overlap (e.g., many reflection techniques rely on memory buffers) and potential omissions; the paper should supply a clear assignment protocol plus a table mapping at least 10 representative cited works to categories to demonstrate exhaustiveness.

    Authors: We agree that the manuscript would benefit from explicit assignment criteria to minimize ambiguity around category boundaries. In the revised version, we will add a dedicated subsection in the Taxonomy section that defines an assignment protocol: a method is placed in the category corresponding to its primary planning mechanism (e.g., Reflection for iterative self-critique loops even if memory buffers are used secondarily; Memory for explicit storage/retrieval architectures). This protocol will be illustrated with decision rules and edge-case examples. We will also insert a new table mapping 15 representative works (selected for diversity across the five categories) to their assigned categories, with brief justification for each assignment. These additions directly address the risk of overlap and demonstrate coverage without altering the underlying taxonomy. revision: yes

Circularity Check

0 steps flagged

No significant circularity: descriptive survey taxonomy

full rationale

The paper is a literature survey proposing a five-category taxonomy (Task Decomposition, Plan Selection, External Module, Reflection, Memory) for LLM-agent planning research. It contains no equations, derivations, fitted parameters, predictions, or self-referential definitions. The taxonomy is presented as an organizational framework for existing works rather than a derived result; no load-bearing steps reduce to self-citation chains or by-construction equivalences. The central claim of providing a 'first systematic view' is supported by citation of prior literature without circular reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

As a survey the paper introduces no free parameters, axioms, or invented entities; it relies entirely on the cited prior literature.

pith-pipeline@v0.9.0 · 5396 in / 916 out tokens · 42655 ms · 2026-05-13T18:08:37.632590+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 54 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation

    cs.CL 2026-05 unverdicted novelty 8.0

    A new native-runtime benchmark reveals that current frontier AI agents succeed on at most 62 percent of realistic long-horizon CLI tasks.

  2. Beyond Individual Intelligence: Surveying Collaboration, Failure Attribution, and Self-Evolution in LLM-based Multi-Agent Systems

    cs.AI 2026-05 unverdicted novelty 7.0

    A survey that unifies prior work on multi-agent LLM systems via the LIFE framework, mapping dependencies across collaboration, failure attribution, and autonomous self-evolution while identifying cross-stage challenges.

  3. EditRefiner: A Human-Aligned Agentic Framework for Image Editing Refinement

    cs.CV 2026-05 unverdicted novelty 7.0

    EditRefiner uses a perception-reasoning-action-evaluation agent loop and the EditFHF-15K human feedback dataset to refine text-guided image edits more accurately than prior methods.

  4. Uncertainty Propagation in LLM-Based Systems

    cs.SE 2026-04 unverdicted novelty 7.0

    This paper introduces a systems-level conceptual framing and a three-level taxonomy (intra-model, system-level, socio-technical) for uncertainty propagation in compound LLM applications, along with engineering insight...

  5. Agentic World Modeling: Foundations, Capabilities, Laws, and Beyond

    cs.AI 2026-04 unverdicted novelty 7.0

    Proposes a levels x laws taxonomy for world models in AI agents, defining L1-L3 capabilities across physical, digital, social, and scientific regimes while reviewing over 400 works to outline a roadmap for advanced ag...

  6. From Skills to Talent: Organising Heterogeneous Agents as a Real-World Company

    cs.AI 2026-04 unverdicted novelty 7.0

    OMC framework turns multi-agent AI into self-organizing companies with Talents, Talent Market, and E²R search, achieving 84.67% success on PRDBench (15.48 points above prior art).

  7. Evaluating Plan Compliance in Autonomous Programming Agents

    cs.SE 2026-04 unverdicted novelty 7.0

    Autonomous programming agents frequently fail to follow instructed plans, falling back on incomplete internalized workflows, while standard plans and periodic reminders improve performance but poor plans can degrade i...

  8. User Simulator-Guided Multi-Turn Preference Optimization for Reasoning LLM-based Conversational Recommendation

    cs.IR 2026-04 unverdicted novelty 7.0

    SMTPO uses multi-task SFT to improve simulator feedback quality and RL with fine-grained rewards to optimize multi-turn preference reasoning in LLM-based conversational recommendation.

  9. VideoThinker: Building Agentic VideoLLMs with LLM-Guided Tool Reasoning

    cs.CV 2026-01 unverdicted novelty 7.0

    VideoThinker uses LLM-generated synthetic tool trajectories in caption space grounded to video frames to train agentic VideoLLMs that outperform baselines on long-video benchmarks.

  10. GenCellAgent: Generalizable, Training-Free Cellular Image Segmentation via Large Language Model Agents

    q-bio.QM 2025-10 unverdicted novelty 7.0

    GenCellAgent deploys a planner-executor-evaluator LLM agent loop to automatically select, adapt, and refine segmentation tools for diverse cellular microscopy images, matching or exceeding specialist performance on 4,...

  11. The Challenge and Reward of Fair Play in Narrative: A Computational Approach

    cs.CL 2025-07 unverdicted novelty 7.0

    Develops an information-theoretic framework showing surprise and coherence trade off in single reader models but coexist via pre- and post-revelation modes, operationalized as reference-less LLM metrics for fair play ...

  12. FaSTA$^*$: Fast-Slow Toolpath Agent with Subroutine Mining for Efficient Multi-turn Image Editing

    cs.CV 2025-06 unverdicted novelty 7.0

    FaSTA* combines LLM fast planning with A* search and inductive subroutine mining to create an efficient agent for multi-turn image editing tasks.

  13. How to Steer Your Multi-Agent System: Human-LLM Collaborative Planning

    cs.MA 2026-05 unverdicted novelty 6.0

    Formalizes design space for human-LLM collaborative planning along mode, scope, and level axes; evaluates AMBIPOM prototype via user study and benchmark revealing hybrid workflows and trade-offs.

  14. BLAgent: Agentic RAG for File-Level Bug Localization

    cs.SE 2026-05 unverdicted novelty 6.0

    BLAgent achieves over 78% Top-1 accuracy on SWE-bench Lite for file-level bug localization using agentic RAG, at 18x lower cost than baselines, and boosts end-to-end APR success by over 20%.

  15. PULSE: Agentic Investigation with Passive Sensing for Proactive Intervention in Cancer Survivorship

    cs.HC 2026-05 unverdicted novelty 6.0

    PULSE demonstrates that agentic LLM-based investigation of passive smartphone sensing data achieves balanced accuracies of 0.743 (with diary) and 0.713 (sensing-only) for predicting emotion regulation desire and inter...

  16. Dynamic Skill Lifecycle Management for Agentic Reinforcement Learning

    cs.LG 2026-05 unverdicted novelty 6.0

    SLIM dynamically optimizes active external skills in agentic RL via leave-one-skill-out marginal contribution estimates and three lifecycle operations, outperforming baselines by 7.1% on ALFWorld and SearchQA while sh...

  17. From Controlled to the Wild: Evaluation of Pentesting Agents for the Real-World

    cs.AI 2026-05 unverdicted novelty 6.0

    A practical evaluation protocol for AI pentesting agents that uses validated vulnerability discovery, LLM semantic matching, and bipartite scoring to assess performance in realistic, complex targets.

  18. FitText: Evolving Agent Tool Ecologies via Memetic Retrieval

    cs.AI 2026-05 unverdicted novelty 6.0

    FitText embeds memetic evolutionary retrieval inside the agent's reasoning loop to iteratively refine pseudo-tool descriptions, raising retrieval rank from 8.81 to 2.78 on ToolRet and pass rate to 0.73 on StableToolBench.

  19. A Limit Theory of Foundation Models: A Mathematical Approach to Understanding Emergent Intelligence and Scaling Laws

    cs.LG 2026-04 unverdicted novelty 6.0

    Emergent intelligence is recast as the existence of the limit of performance E(N,P,K) as N,P,K to infinity, with necessary and sufficient conditions derived via nonlinear Lipschitz operator theory and scaling laws obt...

  20. QuantClaw: Precision Where It Matters for OpenClaw

    cs.AI 2026-04 unverdicted novelty 6.0

    QuantClaw dynamically routes precision in agent workflows to cut cost by up to 21.4% and latency by 15.7% while keeping or improving task performance.

  21. SpecSyn: LLM-based Synthesis and Refinement of Formal Specifications for Real-world Program Verification

    cs.SE 2026-04 unverdicted novelty 6.0

    SpecSyn generates formal specifications with over 90% precision and 75% recall, successfully verifying 1071 out of 1365 target properties on open-source programs.

  22. From Spark to Fire: Modeling and Mitigating Error Cascades in LLM-Based Multi-Agent Collaboration

    cs.MA 2026-03 unverdicted novelty 6.0

    A graph-based propagation model for error cascades in LLM multi-agent systems plus a genealogy-graph governance plugin that prevents final infection in at least 89% of runs across tested frameworks.

  23. HiMAC: Hierarchical Macro-Micro Learning for Long-Horizon LLM Agents

    cs.AI 2026-03 unverdicted novelty 6.0

    HiMAC decomposes LLM agent tasks into macro planning and micro execution using critic-free hierarchical RL and iterative co-evolution, outperforming baselines on ALFWorld, WebShop, and Sokoban.

  24. SoK: Agentic Skills -- Beyond Tool Use in LLM Agents

    cs.CR 2026-02 unverdicted novelty 6.0

    The paper systematizes agentic skills beyond tool use, providing design pattern and representation-scope taxonomies plus security analysis of malicious skill infiltration in agent marketplaces.

  25. When Should Users Check? Modeling Confirmation Frequency inMulti-Step Agentic AI Tasks

    cs.HC 2025-10 conditional novelty 6.0

    A decision-theoretic model based on the observed Confirmation-Diagnosis-Correction-Redo user pattern places intermediate confirmations in AI agent tasks, yielding 81% user preference and 13.54% faster completion versu...

  26. VeriOS: Query-Driven Proactive Human-Agent-GUI Interaction for Trustworthy OS Agents

    cs.CL 2025-09 unverdicted novelty 6.0

    VeriOS-Agent is an OS agent that proactively queries humans in untrustworthy scenarios via a query-driven framework and three-stage training, achieving 19.72% higher step-wise success rate over baselines while preserv...

  27. The Landscape of Agentic Reinforcement Learning for LLMs: A Survey

    cs.AI 2025-09 accept novelty 6.0

    Survey that defines agentic RL for LLMs via POMDPs, introduces a taxonomy of planning/tool-use/memory/reasoning capabilities and domains, and compiles open environments from over 500 papers.

  28. Mobile-R1: Towards Interactive Capability for VLM-Based Mobile Agent via Systematic Training

    cs.AI 2025-06 unverdicted novelty 6.0

    Mobile-R1 introduces a hierarchical three-stage curriculum that combines format alignment, verifiable action feedback, and multi-turn environment training to improve exploration and self-correction in VLM-based mobile...

  29. InfiGUI-R1: Advancing Multimodal GUI Agents from Reactive Actors to Deliberative Reasoners

    cs.AI 2025-04 unverdicted novelty 6.0

    InfiGUI-R1 uses Reasoning Injection via spatial distillation followed by Deliberation Enhancement via RL to evolve GUI agents from reactive actors to deliberative reasoners, reporting strong performance on grounding a...

  30. Retrieval-Augmented Generation for Natural Language Processing: A Survey

    cs.CL 2024-07 accept novelty 6.0

    The survey organizes RAG methods via a taxonomy of query-based, logits-based, latent, and parametric fusion with comparisons on accessibility, efficiency, applications, and challenges.

  31. Beyond Individual Intelligence: Surveying Collaboration, Failure Attribution, and Self-Evolution in LLM-based Multi-Agent Systems

    cs.AI 2026-05 conditional novelty 5.0

    The survey proposes the LIFE framework to unify fragmented research on collaboration, failure attribution, and self-evolution in LLM multi-agent systems into a progression toward self-organizing intelligence.

  32. Dynamic Skill Lifecycle Management for Agentic Reinforcement Learning

    cs.LG 2026-05 unverdicted novelty 5.0

    SLIM dynamically optimizes the active external skill set in agentic RL via leave-one-skill-out marginal contribution estimates and lifecycle operations, delivering a 7.1% average gain over baselines on ALFWorld and Se...

  33. Do Agents Need to Plan Step-by-Step? Rethinking Planning Horizon in Data-Centric Tool Calling

    cs.CL 2026-05 unverdicted novelty 5.0

    Full-horizon planning with on-demand replanning achieves accuracy parity with single-step planning in tool-calling agents for knowledge base and multi-hop question answering while consuming 2-3 times fewer tokens.

  34. From Agent Loops to Deterministic Graphs: Execution Lineage for Reproducible AI-Native Work

    cs.AI 2026-05 conditional novelty 5.0

    Execution lineage models AI-native work as a DAG of computations with explicit dependencies, achieving perfect state preservation in controlled update tasks where loop-based agents introduce churn and contamination.

  35. Novelty-based Tree-of-Thought Search for LLM Reasoning and Planning

    cs.AI 2026-05 unverdicted novelty 5.0

    Novelty estimation via LLM prompts enables pruning in Tree-of-Thought search, reducing overall token usage on language planning benchmarks.

  36. A Limit Theory of Foundation Models: A Mathematical Approach to Understanding Emergent Intelligence and Scaling Laws

    cs.LG 2026-04 unverdicted novelty 5.0

    Emergent intelligence corresponds to the limit of a performance function E(N,P,K) as N, P, K go to infinity, originating from a parameter-limit architecture whose existence is governed by Lipschitz conditions, with sc...

  37. Lightweight LLM Agent Memory with Small Language Models

    cs.AI 2026-04 unverdicted novelty 5.0

    LightMem uses SLMs to modularize agent memory into STM, MTM, and LTM with two-stage vector-plus-semantic retrieval online and incremental consolidation offline, reporting 2.5 F1 gains and low latency over A-MEM on LoCoMo.

  38. A Task Decomposition and Planning Framework for Efficient LLM Inference in AI-Enabled WiFi-Offload Networks

    cs.DC 2026-04 unverdicted novelty 4.0

    An LLM planner for task decomposition and a decomposition-aware scheduler in multi-user WiFi networks reduce average latency by 20% and improve overall reward by 80% versus local-only and nearest-edge baselines.

  39. Competition and Cooperation of LLM Agents in Games

    cs.MA 2026-04 unverdicted novelty 4.0

    LLM agents cooperate in two standard games due to fairness reasoning instead of converging to Nash equilibria under multi-round prompts.

  40. Toward a Safe Internet of Agents

    cs.MA 2025-11 unverdicted novelty 4.0

    The paper proposes a bottom-up framework for safe agentic AI systems that treats each component as a dual-use interface where added capabilities also expand attack surfaces across single agents, multi-agent systems, a...

  41. From LLM Reasoning to Autonomous AI Agents: A Comprehensive Review

    cs.AI 2025-04 accept novelty 4.0

    A survey consolidating benchmarks, agent frameworks, real-world applications, and protocols for LLM-based autonomous agents into a proposed taxonomy with recommendations for future research.

  42. Multi-Agent Collaboration Mechanisms: A Survey of LLMs

    cs.AI 2025-01 unverdicted novelty 4.0

    The survey organizes LLM-based multi-agent collaboration mechanisms into a framework with dimensions of actors, types, structures, strategies, and coordination protocols, reviews applications across domains, and ident...

  43. Large Language Model-Brained GUI Agents: A Survey

    cs.AI 2024-11 unverdicted novelty 4.0

    A survey consolidating frameworks, data practices, large action models, benchmarks, applications, and research gaps in LLM-brained GUI agents.

  44. From Cool Demos to Production-Ready FMware: Core Challenges and a Technology Roadmap

    cs.SE 2024-10 unverdicted novelty 4.0

    A semi-structured thematic synthesis identifies core challenges in FM selection, alignment, prompting, orchestration, testing, deployment, and cross-cutting concerns like observability for production-ready FMware.

  45. Large Language Model-Based Agents for Software Engineering: A Survey

    cs.SE 2024-09 unverdicted novelty 4.0

    A literature survey that collects and categorizes 124 papers on LLM-based agents for software engineering from SE and agent perspectives.

  46. Rethinking Agentic Reinforcement Learning In Large Language Models

    cs.AI 2026-04 unverdicted novelty 3.0

    The paper reviews conceptual foundations, methodological innovations, effective designs, critical challenges, and future directions for LLM-based Agentic Reinforcement Learning.

  47. Flowr -- Scaling Up Retail Supply Chain Operations Through Agentic AI in Large Scale Supermarket Chains

    cs.AI 2026-04 unverdicted novelty 3.0

    Flowr is an agentic AI framework that decomposes retail supply chain workflows into coordinated LLM-based agents with human-in-the-loop oversight to automate operations in large supermarket chains.

  48. Large Language Model Agent: A Survey on Methodology, Applications and Challenges

    cs.CL 2025-03 accept novelty 3.0

    A survey that deconstructs LLM agent systems via a methodology-centered taxonomy linking design principles to emergent behaviors, applications, and challenges.

  49. A Survey on the Memory Mechanism of Large Language Model based Agents

    cs.AI 2024-04 accept novelty 3.0

    A systematic review of memory designs, evaluation methods, applications, limitations, and future directions for LLM-based agents.

  50. The Landscape of Emerging AI Agent Architectures for Reasoning, Planning, and Tool Calling: A Survey

    cs.AI 2024-04 unverdicted novelty 3.0

    A survey of emerging AI agent architectures that organizes single and multi-agent designs around reasoning, planning, tool use, communication, and reflection phases.

  51. Rethinking Agentic Reinforcement Learning In Large Language Models

    cs.AI 2026-04 unverdicted novelty 2.0

    This review synthesizes conceptual foundations, methods, challenges, and future directions for agentic reinforcement learning in large language models.

  52. Rethinking Agentic Reinforcement Learning In Large Language Models

    cs.AI 2026-04 unverdicted novelty 2.0

    The paper surveys the conceptual foundations, methodological innovations, challenges, and future directions of agentic reinforcement learning frameworks that embed cognitive capabilities like meta-reasoning and self-r...

  53. LLM-Powered AI Agent Systems and Their Applications in Industry

    cs.AI 2025-05 unverdicted novelty 2.0

    A survey categorizing LLM-powered agent systems into software-based, physical, and hybrid types, covering industrial applications and challenges such as latency and security.

  54. Advances and Challenges in Foundation Agents: From Brain-Inspired Intelligence to Evolutionary, Collaborative, and Safe Systems

    cs.AI 2025-03 unverdicted novelty 2.0

    This survey frames foundation agents using brain-inspired modular architectures and reviews challenges in evolution, collaboration, and safety.

Reference graph

Works this paper leans on

63 extracted references · 63 canonical work pages · cited by 49 Pith papers · 27 internal anchors

  1. [1]

    Pddl— the planning domain definition language

    [Aeronautiques et al., 1998] Constructions Aeronautiques, Adele Howe, et al. Pddl— the planning domain definition language. Technical Report, Tech. Rep.,

  2. [2]

    Learning from mistakes makes llm better reasoner

    [An et al., 2023] Shengnan An, Zexiong Ma, et al. Learning from mistakes makes llm better reasoner. arXiv preprint arXiv:2310.20689,

  3. [3]

    Besta, N

    [Besta et al., 2023] Maciej Besta, Nils Blach, et al. Graph of thoughts: Solving elaborate problems with large language models. arXiv preprint arXiv:2308.09687,

  4. [4]

    Recent advances in retrieval-augmented text generation

    [Cai et al., 2022] Deng Cai, Yan Wang, Lemao Liu, and Shuming Shi. Recent advances in retrieval-augmented text generation. In SIGIR, pages 3417–3419,

  5. [5]

    Evaluating Large Language Models Trained on Code

    [Chen et al., 2021b] Mark Chen, Jerry Tworek, et al. Eval- uating large language models trained on code. arXiv preprint arXiv:2107.03374,

  6. [6]

    Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks

    [Chen et al., 2022] Wenhu Chen, Xueguang Ma, et al. Pro- gram of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks. arXiv preprint arXiv:2211.12588,

  7. [7]

    Dynamic planning with a llm,

    [Dagan et al., 2023] Gautier Dagan, Frank Keller, and Alex Lascarides. Dynamic planning with a llm. arXiv preprint arXiv:2308.06391,

  8. [8]

    Mind2Web: Towards a Generalist Agent for the Web

    [Deng et al., 2023] Xiang Deng, Yu Gu, et al. Mind2web: Towards a generalist agent for the web. arXiv preprint arXiv:2306.06070,

  9. [9]

    Pal: Program-aided language models

    [Gao et al., 2023] Luyu Gao, Aman Madaan, et al. Pal: Program-aided language models. In ICML, pages 10764– 10799,

  10. [10]

    Lpg: A planner based on local search for planning graphs with action costs

    [Gerevini and Serina, 2002] Alfonso Gerevini and Ivan Se- rina. Lpg: A planner based on local search for planning graphs with action costs. In Aips, volume 2, pages 281– 290,

  11. [11]

    Auto- mated Planning: theory and practice

    [Ghallab et al., 2004] Malik Ghallab, Dana Nau, et al. Auto- mated Planning: theory and practice. Elsevier,

  12. [12]

    CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing

    [Gou et al., 2023] Zhibin Gou, Zhihong Shao, et al. Critic: Large language models can self-correct with tool- interactive critiquing. arXiv preprint arXiv:2305.11738 ,

  13. [13]

    Leveraging pre-trained large language models to construct and uti- lize world models for model-based task planning,

    [Guan et al., 2023] Lin Guan, Karthik Valmeekam, et al. Leveraging pre-trained large language models to construct and utilize world models for model-based task planning. arXiv preprint arXiv:2305.14909,

  14. [14]

    Reasoning with Language Model is Planning with World Model

    [Hao et al., 2023] Shibo Hao, Yi Gu, et al. Reasoning with language model is planning with world model. arXiv preprint arXiv:2305.14992,

  15. [15]

    An introduction to the planning domain definition lan- guage, volume

    [Haslum et al., 2019] Patrik Haslum, Nir Lipovetzky, et al. An introduction to the planning domain definition lan- guage, volume

  16. [16]

    Deep reinforce- ment learning with a natural language action space

    [He et al., 2015] Ji He, Jianshu Chen, et al. Deep reinforce- ment learning with a natural language action space. arXiv preprint arXiv:1511.04636,

  17. [17]

    A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions

    [Huang et al., 2023a] Lei Huang, Yu Weijiang, et al. A sur- vey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. arXiv preprint arXiv:2311.05232,

  18. [18]

    arXiv preprint arXiv:2308.16505

    [Huang et al., 2023b] Xu Huang, Jianxun Lian, et al. Rec- ommender ai agent: Integrating large language mod- els for interactive recommendations. arXiv preprint arXiv:2308.16505,

  19. [19]

    Billion-scale similarity search with GPUs

    [Johnson et al., 2019] Jeff Johnson, Matthijs Douze, and Herv´e J ´egou. Billion-scale similarity search with GPUs. IEEE Transactions on Big Data, 7(3):535–547,

  20. [20]

    Language Models can Solve Computer Tasks

    [Kim and others, 2023] Geunwoo Kim et al. Language models can solve computer tasks. arXiv preprint arXiv:2303.17491,

  21. [21]

    Large language models are zero-shot reasoners

    [Kojima et al., 2022] Takeshi Kojima, Shixiang Shane Gu, et al. Large language models are zero-shot reasoners. NeurIPS, 35:22199–22213,

  22. [22]

    Retrieval-augmented generation for knowledge-intensive nlp tasks

    [Lewis et al., 2020] Patrick Lewis, Ethan Perez, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks. NeurIPS, 33:9459–9474,

  23. [23]

    arXiv preprint arXiv:2305.17390 , year=

    [Lin et al., 2023] Bill Yuchen Lin, Yicheng Fu, et al. Swift- sage: A generative agent with fast and slow think- ing for complex interactive tasks. arXiv preprint arXiv:2305.17390,

  24. [24]

    Width and inference based planners: Siw, bfs (f), and probe

    [Lipovetzky et al., 2014] Nir Lipovetzky, Miquel Ramirez, et al. Width and inference based planners: Siw, bfs (f), and probe. IPC, page 43,

  25. [25]

    LLM+P: Empowering Large Language Models with Optimal Planning Proficiency

    [Liu et al., 2023a] Bo Liu, Yuqian Jiang, et al. Llm+ p: Em- powering large language models with optimal planning proficiency. arXiv preprint arXiv:2304.11477,

  26. [26]

    arXiv preprint arXiv:2311.08719 , year=

    [Liu et al., 2023b] Lei Liu, Xiaoyan Yang, et al. Think-in- memory: Recalling and post-thinking enable llms with long-term memory. arXiv preprint arXiv:2311.08719 ,

  27. [27]

    AgentBench: Evaluating LLMs as Agents

    [Liu et al., 2023c] Xiao Liu, Hao Yu, et al. Agent- bench: Evaluating llms as agents. arXiv preprint arXiv:2308.03688,

  28. [28]

    Self-Refine: Iterative Refinement with Self-Feedback

    [Madaan et al., 2023] Aman Madaan, Niket Tandon, , et al. Self-refine: Iterative refinement with self-feedback. arXiv preprint arXiv:2303.17651,

  29. [29]

    Generation-augmented retrieval for open-domain question answering

    [Mao et al., 2020] Yuning Mao, Pengcheng He, Liu, et al. Generation-augmented retrieval for open-domain question answering. arXiv preprint arXiv:2009.08553,

  30. [30]

    MemGPT: Towards LLMs as Operating Systems

    [Packer et al., 2023] Charles Packer, Vivian Fang, et al. Memgpt: Towards llms as operating systems. arXiv preprint arXiv:2310.08560,

  31. [31]

    Unifying large language models and knowledge graphs: A roadmap

    [Pan et al., 2024] Shirui Pan, Linhao Luo, et al. Unifying large language models and knowledge graphs: A roadmap. TKDE,

  32. [32]

    Generative agents: Interactive simulacra of human behav- ior

    [Park et al., 2023] Joon Sung Park, Joseph O’Brien, et al. Generative agents: Interactive simulacra of human behav- ior. In SUIST, pages 1–22,

  33. [33]

    Tool Learning with Foundation Models

    [Qin et al., 2023] Yujia Qin, Shengding Hu, et al. Tool learning with foundation models. arXiv preprint arXiv:2304.08354,

  34. [34]

    Cognitive task analysis

    [Schraagen et al., 2000] Jan Maarten Schraagen, Susan F Chipman, et al. Cognitive task analysis. Psychology Press,

  35. [35]

    HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face

    [Shen et al., 2023] Yongliang Shen, Kaitao Song, et al. Hug- ginggpt: Solving ai tasks with chatgpt and its friends in huggingface. arXiv preprint arXiv:2303.17580,

  36. [36]

    Reflexion: Language agents with verbal reinforcement learning

    [Shinn et al., 2023] Noah Shinn, Federico Cassano, et al. Reflexion: Language agents with verbal reinforcement learning. In NeurIPS,

  37. [37]

    ALFWorld: Aligning Text and Embodied Environments for Interactive Learning

    [Shridhar et al., 2020] Mohit Shridhar, Xingdi Yuan, et al. Alfworld: Aligning text and embodied environments for interactive learning. arXiv preprint arXiv:2010.03768 ,

  38. [38]

    Prog- prompt: Generating situated robot task plans using large language models

    [Singh et al., 2023] Ishika Singh, Valts Blukis, et al. Prog- prompt: Generating situated robot task plans using large language models. In ICRA 2023 , pages 11523–11530. IEEE,

  39. [39]

    A survey of reasoning with foundation models

    [Sun et al., 2023] Jiankai Sun, Chuanyang Zheng, et al. A survey of reasoning with foundation models. arXiv preprint arXiv:2312.11562,

  40. [40]

    FEVER: a large-scale dataset for Fact Extraction and VERification

    [Thorne et al., 2018] James Thorne, Andreas Vlachos, et al. Fever: a large-scale dataset for fact extraction and verifi- cation. arXiv preprint arXiv:1803.05355,

  41. [41]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    [Touvron et al., 2023] Hugo Touvron, Louis Martin, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288,

  42. [42]

    InACL, pages 9426–9439

    [Wang et al., 2022a] Ruoyao Wang, Peter Jansen, et al. Sci- enceworld: Is your agent smarter than a 5th grader? arXiv preprint arXiv:2203.07540,

  43. [43]

    Self-Consistency Improves Chain of Thought Reasoning in Language Models

    [Wang et al., 2022b] Xuezhi Wang, Jason Wei, et al. Self- consistency improves chain of thought reasoning in lan- guage models. arXiv preprint arXiv:2203.11171,

  44. [44]

    A Survey on Large Language Model based Autonomous Agents

    [Wang et al., 2023a] Lei Wang, Chen Ma, et al. A survey on large language model based autonomous agents. arXiv preprint arXiv:2308.11432,

  45. [45]

    Plan-and-Solve Prompting: Improving Zero-Shot Chain-of-Thought Reasoning by Large Language Models

    [Wang et al., 2023b] Lei Wang, Wanyu Xu, et al. Plan-and- solve prompting: Improving zero-shot chain-of-thought reasoning by large language models. arXiv preprint arXiv:2305.04091,

  46. [46]

    Recmind: Large language model powered agent for recommendation.arXiv preprint arXiv:2308.14296, 2023

    [Wang et al., 2023c] Yancheng Wang, Ziyan Jiang, et al. Recmind: Large language model powered agent for rec- ommendation. arXiv preprint arXiv:2308.14296,

  47. [47]

    Chain- of-thought prompting elicits reasoning in large language models

    [Wei et al., 2022] Jason Wei, Xuezhi Wang, et al. Chain- of-thought prompting elicits reasoning in large language models. NeurIPS, 35:24824–24837,

  48. [48]

    Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models

    [Wu et al., 2023] Chenfei Wu, Shengming Yin, et al. Visual chatgpt: Talking, drawing and editing with visual founda- tion models. arXiv preprint arXiv:2303.04671,

  49. [49]

    C-pack: Packaged resources to advance general chinese embedding,

    [Xiao and others, 2023] Shitao Xiao et al. C-pack: Packaged resources to advance general chinese embedding,

  50. [50]

    Llm a*: Human in the loop large language models enabled a* search for robotics

    [Xiao and Wang, 2023] Hengjia Xiao and Peng Wang. Llm a*: Human in the loop large language models enabled a* search for robotics. arXiv preprint arXiv:2312.01797,

  51. [51]

    HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering

    [Yang et al., 2018] Zhilin Yang, Peng Qi, et al. Hotpotqa: A dataset for diverse, explainable multi-hop question an- swering. arXiv preprint arXiv:1809.09600,

  52. [52]

    Foundation models for decision making: Problems, methods, and opportunities, 2023

    [Yang et al., 2023a] Sherry Yang, Nachum Ofir, et al. Foun- dation models for decision making: Problems, meth- ods, and opportunities. arXiv preprint arXiv:2303.04129,

  53. [53]

    Coupling large language models with logic program- ming for robust and general reasoning from text

    [Yang et al., 2023b] Zhun Yang, Adam Ishay, and Joohyung Lee. Coupling large language models with logic program- ming for robust and general reasoning from text. arXiv preprint arXiv:2307.07696,

  54. [55]

    Keep calm and explore: Language models for action generation in text-based games

    [Yao et al., 2020b] Shunyu Yao, Rohan Rao, et al. Keep calm and explore: Language models for action generation in text-based games. arXiv preprint arXiv:2010.02903 ,

  55. [56]

    ReAct: Synergizing Reasoning and Acting in Language Models

    [Yao et al., 2022] Shunyu Yao, Jeffrey Zhao, et al. Re- act: Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629,

  56. [57]

    Tree of Thoughts: Deliberate Problem Solving with Large Language Models

    [Yao et al., 2023] Shunyu Yao, Dian Yu, et al. Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601,

  57. [58]

    Agenttuning: Enabling generalized agent abilities for llms

    [Zeng et al., 2023] Aohan Zeng, Mingdao Liu, et al. Agent- tuning: Enabling generalized agent abilities for llms.arXiv preprint arXiv:2310.12823,

  58. [59]

    Large language model is semi-parametric reinforcement learning agent

    [Zhang et al., 2023a] Danyang Zhang, Lu Chen, et al. Large language model is semi-parametric reinforcement learning agent. arXiv preprint arXiv:2306.07929,

  59. [60]

    Siren's Song in the AI Ocean: A Survey on Hallucination in Large Language Models

    [Zhang et al., 2023b] Yue Zhang, Yafu Li, et al. Siren’s song in the ai ocean: A survey on hallucination in large lan- guage models. arXiv preprint arXiv:2309.01219,

  60. [61]

    A Survey of Large Language Models

    [Zhao et al., 2023a] Wayne Xin Zhao, Kun Zhou, et al. A survey of large language models. arXiv preprint arXiv:2303.18223,

  61. [62]

    Large language models as commonsense knowledge for large-scale task planning,

    [Zhao et al., 2023b] Zirui Zhao, Wee Sun Lee, and David Hsu. Large language models as commonsense knowl- edge for large-scale task planning. arXiv preprint arXiv:2305.14078,

  62. [63]

    MemoryBank: Enhancing Large Language Models with Long-Term Memory

    [Zhong et al., 2023] Wanjun Zhong, Lianghong Guo, Qiqi Gao, and Yanlin Wang. Memorybank: Enhancing large language models with long-term memory. arXiv preprint arXiv:2305.10250,

  63. [64]

    WebArena: A Realistic Web Environment for Building Autonomous Agents

    [Zhou et al., 2023] Shuyan Zhou, Frank F Xu, et al. We- barena: A realistic web environment for building au- tonomous agents. arXiv preprint arXiv:2307.13854, 2023