arxiv: 2304.11477 · v3 · submitted 2023-04-22 · 💻 cs.AI · cs.RO

Recognition: no theorem link

LLM+P: Empowering Large Language Models with Optimal Planning Proficiency

Bo Liu , Yuqian Jiang , Xiaohan Zhang , Qiang Liu , Shiqi Zhang , Joydeep Biswas , Peter Stone

Authors on Pith no claims yet

Pith reviewed 2026-05-14 18:30 UTC · model grok-4.3

classification 💻 cs.AI cs.RO

keywords large language modelsplanningPDDLclassical plannersoptimal planninghybrid systemsnatural language to formal language

0 comments

The pith

LLM+P lets language models generate optimal plans by routing problems through classical planners via PDDL translation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces LLM+P as a framework that takes natural-language planning problems, converts them into PDDL files, hands the files to fast classical planners for correct or optimal solutions, and translates the resulting plans back into natural language. Pure LLMs are shown to fail at producing even feasible plans on most long-horizon tasks, while the hybrid method succeeds on the majority of tested benchmarks drawn from everyday planning domains. The work matters to a sympathetic reader because it offers a concrete way to combine the flexible language understanding of LLMs with the guaranteed efficiency and optimality of search-based planners. Experiments on the new benchmark suite confirm that the translation steps preserve enough information for the planner to operate correctly in most cases. If the approach holds, it points toward practical AI systems that can handle real planning tasks such as logistics, robotics, or scheduling without requiring hand-crafted problem encodings.

Core claim

LLM+P takes a natural language description of a planning problem, translates it into a syntactically and semantically correct PDDL file, invokes a classical planner to compute a correct or optimal plan, and translates that plan back into natural language; on the introduced benchmark problems this yields optimal solutions for most instances while standalone LLMs produce no feasible plan for most instances.

What carries the argument

The bidirectional LLM-to-PDDL-to-LLM translation pipeline that lets an LLM describe the problem and interpret the solution while delegating the actual search to a classical planner.

If this is right

Classical planners become usable inside LLM pipelines without requiring users to write PDDL themselves.
LLMs can be restricted to the easier sub-task of problem description and solution interpretation while optimality is guaranteed by search.
The same translation pattern can be applied to other structured reasoning domains that already have efficient solvers.
Benchmark results indicate that feasibility and optimality rates rise sharply once the planner is inserted between the two LLM calls.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The method could be tested on problems whose optimal solutions are already known from independent solvers to measure exact translation error rates.
If PDDL translation remains the bottleneck, fine-tuning the LLM specifically on paired natural-language and PDDL examples might further improve reliability.
The framework naturally extends to any domain that possesses both a natural-language interface and an existing classical solver, such as scheduling or verification tasks.

Load-bearing premise

Large language models can produce PDDL encodings that are accurate enough for the classical planner to return valid and optimal plans rather than invalid or empty ones.

What would settle it

A collection of natural-language planning problems on which the LLM repeatedly emits PDDL that is either syntactically malformed or semantically inconsistent with the original description, causing the planner to return no solution or an incorrect one.

read the original abstract

Large language models (LLMs) have demonstrated remarkable zero-shot generalization abilities: state-of-the-art chatbots can provide plausible answers to many common questions that arise in daily life. However, so far, LLMs cannot reliably solve long-horizon planning problems. By contrast, classical planners, once a problem is given in a formatted way, can use efficient search algorithms to quickly identify correct, or even optimal, plans. In an effort to get the best of both worlds, this paper introduces LLM+P, the first framework that incorporates the strengths of classical planners into LLMs. LLM+P takes in a natural language description of a planning problem, then returns a correct (or optimal) plan for solving that problem in natural language. LLM+P does so by first converting the language description into a file written in the planning domain definition language (PDDL), then leveraging classical planners to quickly find a solution, and then translating the found solution back into natural language. Along with LLM+P, we define a diverse set of different benchmark problems taken from common planning scenarios. Via a comprehensive set of experiments on these benchmark problems, we find that LLM+P is able to provide optimal solutions for most problems, while LLMs fail to provide even feasible plans for most problems.\footnote{The code and results are publicly available at https://github.com/Cranial-XIX/llm-pddl.git.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces LLM+P, a hybrid framework in which an LLM first translates a natural-language planning problem into PDDL, a classical planner then computes an optimal (or correct) solution, and the LLM finally renders the plan back into natural language. The authors define a collection of benchmark problems drawn from common planning domains, and report that LLM+P produces optimal solutions on most instances while pure LLMs fail to produce even feasible plans on most instances. Code and results are released publicly.

Significance. If the central empirical claim holds, the work supplies a concrete, reproducible demonstration that classical planners can be grafted onto LLMs to obtain optimality guarantees that current LLMs lack. The public release of code strengthens the result. The significance is limited by the absence of direct evidence that the LLM-generated PDDL faithfully encodes the original natural-language specification; downstream planner success alone does not certify semantic fidelity.

major comments (2)

[§4] §4 (Experiments) and the associated tables: success is measured solely by whether the classical planner returns a plan; no separate human or automated audit of the generated PDDL files is reported. Consequently the headline claim that LLM+P solves the intended problems optimally rests on an unverified assumption that the LLM translation step preserves preconditions, fluents, and goal conditions exactly.
[§3.1] §3.1 (PDDL Generation): the prompt templates and few-shot examples used to elicit PDDL are not accompanied by any quantitative measure of syntactic or semantic error rates. Because classical planners will solve any well-formed PDDL they receive, end-to-end success rates do not isolate whether the LLM step is reliable or merely lucky on the chosen benchmarks.

minor comments (2)

[§4.1] The abstract states that benchmarks are 'taken from common planning scenarios' but §4.1 provides only high-level descriptions; an explicit list of domains, instance counts, and difficulty parameters would improve reproducibility.
[Figure 2] Figure 2 (or equivalent) comparing LLM-only versus LLM+P trajectories would benefit from error bars or per-domain breakdowns rather than aggregate percentages.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which highlight important aspects of our evaluation methodology. We address each major comment below and outline the revisions we will make to strengthen the manuscript.

read point-by-point responses

Referee: §4 (Experiments) and the associated tables: success is measured solely by whether the classical planner returns a plan; no separate human or automated audit of the generated PDDL files is reported. Consequently the headline claim that LLM+P solves the intended problems optimally rests on an unverified assumption that the LLM translation step preserves preconditions, fluents, and goal conditions exactly.

Authors: We acknowledge that our evaluation does not include an independent audit of the generated PDDL against the natural-language specifications. While the public code release permits inspection of the PDDL outputs, we agree that this leaves the semantic fidelity of the translation step unverified in the paper. In the revised manuscript we will add a new subsection under Experiments that reports a human audit of a random sample of generated PDDL files (at least 20% of instances per domain), checking that all preconditions, effects, and goal conditions match the original problem statement. We will also report the fraction of cases where the generated PDDL is syntactically invalid. revision: yes
Referee: §3.1 (PDDL Generation): the prompt templates and few-shot examples used to elicit PDDL are not accompanied by any quantitative measure of syntactic or semantic error rates. Because classical planners will solve any well-formed PDDL they receive, end-to-end success rates do not isolate whether the LLM step is reliable or merely lucky on the chosen benchmarks.

Authors: We agree that isolating the reliability of the PDDL-generation step is valuable. In the revision we will augment §3.1 with quantitative error analysis: (1) syntactic validity rate measured by attempting to parse every generated PDDL file with a standard PDDL parser, and (2) semantic fidelity measured on the subset of domains for which we possess ground-truth PDDL (Blocksworld, Logistics, etc.) by comparing generated predicates, actions, and goals against the reference encodings. These metrics will be reported alongside the existing end-to-end success rates. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical hybrid framework

full rationale

The paper presents LLM+P as an empirical system: LLM translates NL problem descriptions to PDDL, a classical planner computes a solution (optimal or feasible), and the plan is translated back to NL. Central claims rest on experiments over author-defined benchmarks where end-to-end success rates are measured against ground-truth solvability. No equations, parameter fits, or derivations appear; no self-citation chain supports a uniqueness theorem or ansatz; success is externally validated by planner output rather than by construction from the LLM step itself. This matches the default non-circular case for a systems paper.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The framework rests on the domain assumption that LLMs can perform accurate bidirectional translation between natural language and PDDL; no free parameters or invented entities are introduced.

axioms (2)

domain assumption LLMs can accurately convert natural language planning problems into correct PDDL
This translation step is required for the planner to receive valid input and is not independently verified in the abstract.
standard math Classical planners produce optimal solutions given valid PDDL input
Standard result in automated planning literature.

pith-pipeline@v0.9.0 · 5558 in / 1201 out tokens · 33726 ms · 2026-05-14T18:30:44.381182+00:00 · methodology

discussion (0)

Forward citations

Cited by 23 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Self-Improvement for Fast, High-Quality Plan Generation
cs.AI 2026-05 unverdicted novelty 7.0

Self-improvement of a decoder-only transformer yields plans averaging 30% shorter than a source symbolic planner, over 80% optimal where known, with sub-exponential latency scaling.
LLMs as ASP Programmers: Self-Correction Enables Task-Agnostic Nonmonotonic Reasoning
cs.AI 2026-04 unverdicted novelty 7.0

LLM+ASP framework enables task-agnostic nonmonotonic reasoning by having LLMs generate and self-correct ASP programs using solver feedback, outperforming SMT alternatives on diverse benchmarks.
LLM-Flax : Generalizable Robotic Task Planning via Neuro-Symbolic Approaches with Large Language Models
cs.RO 2026-04 unverdicted novelty 7.0

LLM-Flax automates neuro-symbolic robotic task planning with three LLM stages for rule generation, failure recovery, and zero-shot scoring, outperforming manual baselines on MazeNamo grids.
ANCHOR: A Physically Grounded Closed-Loop Framework for Robust Home-Service Mobile Manipulation
cs.RO 2026-04 conditional novelty 7.0

ANCHOR raises mobile manipulation success from 53.3% to 71.7% in unseen homes by binding plans to observable geometry, ensuring operable navigation endpoints, and using layered local recovery instead of global replans.
Using large language models for embodied planning introduces systematic safety risks
cs.AI 2026-04 unverdicted novelty 7.0

LLM planners for robots often produce dangerous plans even when planning succeeds, with safety awareness staying flat as model scale improves planning ability.
Self-Correcting RAG: Enhancing Faithfulness via MMKP Context Selection and NLI-Guided MCTS
cs.CL 2026-04 unverdicted novelty 7.0

Self-Correcting RAG formalizes retrieval as MMKP to maximize information density under token limits and uses NLI-guided MCTS to validate faithfulness, raising accuracy and cutting hallucinations on six multi-hop QA an...
GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models
cs.LG 2024-10 accept novelty 7.0

LLMs display high variance and major accuracy drops on GSM-Symbolic variants of grade-school math problems, indicating they replicate training patterns rather than execute logical reasoning.
VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models
cs.RO 2023-07 unverdicted novelty 7.0

VoxPoser uses LLMs to compose 3D value maps via VLM interaction for model-based synthesis of robust robot trajectories on open-set language-specified manipulation tasks.
CSR: Infinite-Horizon Real-Time Policies with Massive Cached State Representations
cs.RO 2026-05 unverdicted novelty 6.0

CSR with ASR enables infinite-horizon real-time LLM policies via stable KV-cache properties and background eviction, delivering 26x lower latency and SOTA recall on embodied benchmarks.
Decoupled Travel Planning with Behavior Forest
cs.LG 2026-04 unverdicted novelty 6.0

Behavior Forest decouples multi-constraint travel planning into parallel behavior trees with LLM nodes and global coordination, yielding 6.67% and 11.82% gains over prior methods on two benchmarks.
Mind the Prompt: Self-adaptive Generation of Task Plan Explanations via LLMs
cs.AI 2026-04 unverdicted novelty 6.0

COMPASS formalizes prompt engineering as a POMDP-based cognitive decision process for self-adaptive generation of task plan explanations via LLMs.
SYMBOLIZER: Symbolic Model-free Task Planning with VLMs
cs.RO 2026-04 unverdicted novelty 6.0

SYMBOLIZER grounds symbolic states from images via VLMs using only lifted predicates and solves long-horizon tasks with goal-count and width-based heuristic search, outperforming direct VLM planning and matching VLM-h...
Select-then-Solve: Paradigm Routing as Inference-Time Optimization for LLM Agents
cs.CL 2026-04 conditional novelty 6.0

A learned embedding-based router selecting among six reasoning paradigms improves LLM agent accuracy from 47.6% to 53.1% on average, beating the best fixed paradigm by 2.8pp.
A Survey on Large Language Model based Autonomous Agents
cs.AI 2023-08 accept novelty 6.0

A survey of LLM-based autonomous agents that proposes a unified framework for their construction and reviews applications in social science, natural science, and engineering along with evaluation methods and future di...
Beyond Individual Intelligence: Surveying Collaboration, Failure Attribution, and Self-Evolution in LLM-based Multi-Agent Systems
cs.AI 2026-05 conditional novelty 5.0

The survey proposes the LIFE framework to unify fragmented research on collaboration, failure attribution, and self-evolution in LLM multi-agent systems into a progression toward self-organizing intelligence.
Novelty-based Tree-of-Thought Search for LLM Reasoning and Planning
cs.AI 2026-05 unverdicted novelty 5.0

Novelty estimation via LLM prompts enables pruning in Tree-of-Thought search, reducing overall token usage on language planning benchmarks.
Bridging Values and Behavior: A Hierarchical Framework for Proactive Embodied Agents
cs.AI 2026-04 unverdicted novelty 5.0

ValuePlanner is a hierarchical architecture that uses LLMs to generate value-based subgoals and PDDL planners to produce executable actions, enabling self-directed behavior in embodied agents.
From Coarse to Fine: Self-Adaptive Hierarchical Planning for LLM Agents
cs.AI 2026-04 unverdicted novelty 5.0

AdaPlan-H enables LLM agents to generate self-adaptive hierarchical plans that adjust detail level to task difficulty, improving success rates in multi-step tasks.
AssemPlanner: A Multi-Agent Based Task Planning Framework for Flexible Assembly System
cs.RO 2026-05 unverdicted novelty 4.0

AssemPlanner is a ReAct-based multi-agent system that autonomously generates production plans from natural language inputs by integrating scheduling, knowledge, line balancing, and scene graph feedback.
Compiled AI: Deterministic Code Generation for LLM-Based Workflow Automation
cs.SE 2026-04 unverdicted novelty 4.0

Compiled AI generates deterministic code artifacts from LLMs in a one-time compilation step, enabling reliable workflow execution with zero runtime tokens after break-even.
Understanding the planning of LLM agents: A survey
cs.AI 2024-02 accept novelty 4.0

A survey that provides a taxonomy of methods for improving planning in LLM-based agents across task decomposition, plan selection, external modules, reflection, and memory.
The Rise and Potential of Large Language Model Based Agents: A Survey
cs.AI 2023-09 accept novelty 4.0

The paper surveys the origins, frameworks, applications, and open challenges of AI agents built on large language models.
Shaping Schema via Language Representation as the Next Frontier for LLM Intelligence Expanding
cs.AI 2026-05 unverdicted novelty 3.0

Advanced language representations shape LLMs' schemas to improve knowledge activation and problem-solving.

Reference graph

Works this paper leans on

68 extracted references · 68 canonical work pages · cited by 23 Pith papers · 14 internal anchors

[1]

Eliza—a computer program for the study of natural language communication between man and machine,

J. Weizenbaum, “Eliza—a computer program for the study of natural language communication between man and machine,” Communica- tions of the ACM , vol. 9, no. 1, pp. 36–45, 1966

work page 1966
[2]

Gpt-4 technical report,

OpenAI, “Gpt-4 technical report,” 2023

work page 2023
[3]

Chatgpt for robotics: Design principles and model abilities,

S. Vemprala, R. Bonatti, A. Bucker, and A. Kapoor, “Chatgpt for robotics: Design principles and model abilities,” Microsoft, Tech. Rep. MSR-TR-2023-8, February 2023. [Online]. Available: https://www.microsoft.com/en-us/research/publication/ chatgpt-for-robotics-design-principles-and-model-abilities/

work page 2023
[4]

Dissociating language and thought in large language models: a cognitive perspective,

K. Mahowald, A. A. Ivanova, I. A. Blank, N. Kanwisher, J. B. Tenenbaum, and E. Fedorenko, “Dissociating language and thought in large language models: a cognitive perspective,” arXiv preprint arXiv:2301.06627, 2023

work page arXiv 2023
[5]

Mixout: Effective regularization to finetune large-scale pretrained language models,

C. Lee, K. Cho, and W. Kang, “Mixout: Effective regularization to finetune large-scale pretrained language models,” arXiv preprint arXiv:1909.11299, 2019

work page arXiv 1909
[6]

Finetuned Language Models Are Zero-Shot Learners

J. Wei, M. Bosma, V . Y . Zhao, K. Guu, A. W. Yu, B. Lester, N. Du, A. M. Dai, and Q. V . Le, “Finetuned language models are zero-shot learners,” arXiv preprint arXiv:2109.01652 , 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[7]

Pddl-the planning domain definition language,

D. McDermott, M. Ghallab, A. Howe, C. Knoblock, A. Ram, M. Veloso, D. Weld, and D. Wilkins, “Pddl-the planning domain definition language,” 1998

work page 1998
[8]

An introduc- tion to the planning domain definition language,

P. Haslum, N. Lipovetzky, D. Magazzeni, and C. Muise, “An introduc- tion to the planning domain definition language,” Synthesis Lectures on Artificial Intelligence and Machine Learning , vol. 13, no. 2, pp. 1–187, 2019

work page 2019
[9]

Large language models still can’t plan (A benchmark for llms on planning and reasoning about change).CoRR, abs/2206.10498, 2022

K. Valmeekam, A. Olmo, S. Sreedharan, and S. Kambhampati, “Large language models still can’t plan (a benchmark for llms on planning and reasoning about change),” arXiv preprint arXiv:2206.10498, 2022

work page arXiv 2022
[10]

Language models are few-shot learners,

T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al. , “Language models are few-shot learners,” Advances in neural information pro- cessing systems, vol. 33, pp. 1877–1901, 2020

work page 1901
[11]

The fast downward planning system,

M. Helmert, “The fast downward planning system,” Journal of Artifi- cial Intelligence Research , vol. 26, pp. 191–246, 2006

work page 2006
[12]

The computational complexity of propositional STRIPS planning,

T. Bylander, “The computational complexity of propositional STRIPS planning,” Artificial Intelligence, vol. 69, no. 1-2, pp. 165–204, 1994

work page 1994
[13]

Situations, actions, and causal laws,

J. McCarthy, “Situations, actions, and causal laws,” Stanford Univer- sity Technical Report, Tech. Rep., 1963

work page 1963
[14]

Strips: A new approach to the appli- cation of theorem proving to problem solving,

R. E. Fikes and N. J. Nilsson, “Strips: A new approach to the appli- cation of theorem proving to problem solving,” Artificial intelligence, vol. 2, no. 3-4, pp. 189–208, 1971

work page 1971
[15]

Shakey the robot,

N. J. Nilsson et al., “Shakey the robot,” 1984

work page 1984
[16]

Prodigy: An integrated architecture for planning and learning,

J. Carbonell, O. Etzioni, Y . Gil, R. Joseph, C. Knoblock, S. Minton, and M. Veloso, “Prodigy: An integrated architecture for planning and learning,” ACM SIGART Bulletin , vol. 2, no. 4, pp. 51–55, 1991

work page 1991
[17]

Shop2: An htn planning system,

D. S. Nau, T.-C. Au, O. Ilghami, U. Kuter, J. W. Murdock, D. Wu, and F. Yaman, “Shop2: An htn planning system,” Journal of artificial intelligence research, 2003

work page 2003
[18]

Task planning in robotics: an empirical comparison of pddl-and asp-based sys- tems,

Y .-q. Jiang, S.-q. Zhang, P. Khandelwal, and P. Stone, “Task planning in robotics: an empirical comparison of pddl-and asp-based sys- tems,” Frontiers of Information Technology & Electronic Engineering, vol. 20, pp. 363–373, 2019

work page 2019
[19]

Answer set programming at a glance,

G. Brewka, T. Eiter, and M. Truszczy ´nski, “Answer set programming at a glance,” Communications of the ACM, vol. 54, no. 12, pp. 92–103, 2011

work page 2011
[20]

Answer set programming and plan generation,

V . Lifschitz, “Answer set programming and plan generation,” Artificial Intelligence, vol. 138, no. 1-2, pp. 39–54, 2002

work page 2002
[21]

Pddl2. 1: An extension to pddl for expressing temporal planning domains,

M. Fox and D. Long, “Pddl2. 1: An extension to pddl for expressing temporal planning domains,”Journal of artificial intelligence research, vol. 20, pp. 61–124, 2003

work page 2003
[22]

Mobile robot planning using action language bc with an abstraction hierarchy,

S. Zhang, F. Yang, P. Khandelwal, and P. Stone, “Mobile robot planning using action language bc with an abstraction hierarchy,” in International Conference on Logic Programming and Nonmonotonic Reasoning. Springer, 2015, pp. 502–516

work page 2015
[23]

Task-motion planning for safe and efficient urban driving,

Y . Ding, X. Zhang, X. Zhan, and S. Zhang, “Task-motion planning for safe and efficient urban driving,” in 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) , 2020

work page 2020
[24]

Multi-robot planning with conflicts and synergies,

Y . Jiang, H. Yedidsion, S. Zhang, G. Sharon, and P. Stone, “Multi-robot planning with conflicts and synergies,” Autonomous Robots , vol. 43, no. 8, pp. 2011–2032, 2019

work page 2011
[25]

Platform-independent benchmarks for task and motion planning,

F. Lagriffoul, N. T. Dantam, C. Garrett, A. Akbari, S. Srivastava, and L. E. Kavraki, “Platform-independent benchmarks for task and motion planning,” IEEE Robotics and Automation Letters , vol. 3, no. 4, pp. 3765–3772, 2018

work page 2018
[26]

Integrated task and motion planning in belief space,

L. P. Kaelbling and T. Lozano-P ´erez, “Integrated task and motion planning in belief space,” The International Journal of Robotics Research, vol. 32, no. 9-10, pp. 1194–1227, 2013

work page 2013
[27]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[28]

Evaluating Large Language Models Trained on Code

M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. d. O. Pinto, J. Ka- plan, H. Edwards, Y . Burda, N. Joseph, G. Brockman, et al. , “Evaluating large language models trained on code,” arXiv preprint arXiv:2107.03374, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[29]

OPT: Open Pre-trained Transformer Language Models

S. Zhang, S. Roller, N. Goyal, M. Artetxe, M. Chen, S. Chen, C. De- wan, M. Diab, X. Li, X. V . Lin, et al., “Opt: Open pre-trained trans- former language models,” arXiv preprint arXiv:2205.01068 , 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[30]

Chatgpt,

OpenAI, “Chatgpt,” Accessed: 2023-02-08, 2023, cit. on pp. 1, 16. [Online]. Available: https://openai.com/blog/chatgpt/

work page 2023
[31]

LLaMA: Open and Efficient Foundation Language Models

H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozi`ere, N. Goyal, E. Hambro, F. Azhar, et al., “Llama: Open and efficient foundation language models,” arXiv preprint arXiv:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[32]

Llama 2: Open Foundation and Fine-Tuned Chat Models

H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y . Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale, et al. , “Llama 2: Open foundation and fine-tuned chat models,” arXiv preprint arXiv:2307.09288, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[33]

PaLM: Scaling Language Modeling with Pathways

A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, P. Barham, H. W. Chung, C. Sutton, S. Gehrmann, et al., “Palm: Scaling language modeling with pathways,” arXiv preprint arXiv:2204.02311, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[34]

Do As I Can, Not As I Say: Grounding Language in Robotic Affordances

M. Ahn, A. Brohan, N. Brown, Y . Chebotar, O. Cortes, B. David, C. Finn, K. Gopalakrishnan, K. Hausman, A. Herzog, et al., “Do as i can, not as i say: Grounding language in robotic affordances,” arXiv preprint arXiv:2204.01691, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[35]

Task and motion planning with large language models for object rearrangement,

Y . Ding, X. Zhang, C. Paxton, and S. Zhang, “Task and motion planning with large language models for object rearrangement,” 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2023

work page 2023
[36]

PaLM-E: An Embodied Multimodal Language Model

D. Driess, F. Xia, M. S. Sajjadi, C. Lynch, A. Chowdhery, B. Ichter, A. Wahid, J. Tompson, Q. Vuong, T. Yu,et al., “Palm-e: An embodied multimodal language model,” arXiv preprint arXiv:2303.03378, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[37]

Inner Monologue: Embodied Reasoning through Planning with Language Models

W. Huang, F. Xia, T. Xiao, H. Chan, J. Liang, P. Florence, A. Zeng, J. Tompson, I. Mordatch, Y . Chebotar, et al. , “Inner monologue: Embodied reasoning through planning with language models,” arXiv preprint arXiv:2207.05608, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[38]

Language models as zero-shot planners: Extracting actionable knowledge for embodied agents,

W. Huang, P. Abbeel, D. Pathak, and I. Mordatch, “Language models as zero-shot planners: Extracting actionable knowledge for embodied agents,” in International Conference on Machine Learning . PMLR, 2022, pp. 9118–9147

work page 2022
[39]

Housekeep: Tidying virtual households using commonsense reasoning,

Y . Kant, A. Ramachandran, S. Yenamandra, I. Gilitschenski, D. Batra, A. Szot, and H. Agrawal, “Housekeep: Tidying virtual households using commonsense reasoning,” in Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXIX. Springer, 2022, pp. 355–373

work page 2022
[40]

Singh, V

I. Singh, V . Blukis, A. Mousavian, A. Goyal, D. Xu, J. Tremblay, D. Fox, J. Thomason, and A. Garg, “Progprompt: Generating situ- ated robot task plans using large language models,” arXiv preprint arXiv:2209.11302, 2022

work page arXiv 2022
[41]

Text2motion: From natu- ral language instructions to feasible plans

K. Lin, C. Agia, T. Migimatsu, M. Pavone, and J. Bohg, “Text2motion: From natural language instructions to feasible plans,” arXiv preprint arXiv:2303.12153, 2023

work page arXiv 2023
[42]

Automaton-based representations of task knowledge from generative language models,

Y . Yang, J.-R. Gaglione, C. Neary, and U. Topcu, “Automaton-based representations of task knowledge from generative language models,” arXiv preprint arXiv:2212.01944 , 2023

work page arXiv 2023
[43]

Integrating action knowledge and llms for task planning and situation handling in open worlds,

Y . Ding, X. Zhang, S. Amiri, N. Cao, H. Yang, A. Kaminski, C. Esselink, and S. Zhang, “Integrating action knowledge and llms for task planning and situation handling in open worlds,” arXiv preprint arXiv:2305.17590, 2023

work page arXiv 2023
[44]

Robots that ask for help: Uncertainty alignment for large language model planners,

A. Z. Ren, A. Dixit, A. Bodrova, S. Singh, S. Tu, N. Brown, P. Xu, L. Takayama, F. Xia, J. Varley, et al. , “Robots that ask for help: Uncertainty alignment for large language model planners,” arXiv preprint arXiv:2307.01928, 2023

work page arXiv 2023
[45]

Autotamp: Autoregressive task and motion planning with llms as translators and checkers,

Y . Chen, J. Arkin, Y . Zhang, N. Roy, and C. Fan, “Autotamp: Autoregressive task and motion planning with llms as translators and checkers,” arXiv preprint arXiv:2306.06531 , 2023

work page arXiv 2023
[46]

On the planning abilities of large language models (a critical investigation with a proposed benchmark),

K. Valmeekam, S. Sreedharan, M. Marquez, A. Olmo, and S. Kamb- hampati, “On the planning abilities of large language models (a critical investigation with a proposed benchmark),” arXiv preprint arXiv:2302.06706, 2023

work page arXiv 2023
[47]

PDDL planning with pretrained large language models,

T. Silver, V . Hariprasad, R. S. Shuttleworth, N. Kumar, T. Lozano- P´erez, and L. P. Kaelbling, “PDDL planning with pretrained large language models,” in NeurIPS 2022 Foundation Models for Decision Making Workshop , 2022. [Online]. Available: https: //openreview.net/forum?id=1QMMUB4zfl

work page 2022
[48]

Plansformer: Generating symbolic plans using transformers,

V . Pallagani, B. Muppasani, K. Murugesan, F. Rossi, L. Horesh, B. Srivastava, F. Fabiano, and A. Loreggia, “Plansformer: Generating symbolic plans using transformers,” arXiv preprint arXiv:2212.08681, 2022

work page arXiv 2022
[49]

Learning and leveraging verifiers to improve planning capabilities of pre-trained language models,

D. Arora and S. Kambhampati, “Learning and leveraging verifiers to improve planning capabilities of pre-trained language models,” arXiv preprint arXiv:2305.17077, 2023

work page arXiv 2023
[50]

Leveraging pre-trained large language models to construct and uti- lize world models for model-based task planning,

L. Guan, K. Valmeekam, S. Sreedharan, and S. Kambhampati, “Leveraging pre-trained large language models to construct and uti- lize world models for model-based task planning,” arXiv preprint arXiv:2305.14909, 2023

work page arXiv 2023
[51]

Generalized planning in pddl domains with pretrained large language models,

T. Silver, S. Dan, K. Srinivas, J. B. Tenenbaum, L. P. Kaelbling, and M. Katz, “Generalized planning in pddl domains with pretrained large language models,” arXiv preprint arXiv:2305.11014 , 2023

work page arXiv 2023
[52]

Understanding the capabili- ties of large language models for automated planning,

V . Pallagani, B. Muppasani, K. Murugesan, F. Rossi, B. Srivastava, L. Horesh, F. Fabiano, and A. Loreggia, “Understanding the capabili- ties of large language models for automated planning,” arXiv preprint arXiv:2305.16151, 2023

work page arXiv 2023
[53]

On the planning abilities of large language models–a critical investi- gation,

K. Valmeekam, M. Marquez, S. Sreedharan, and S. Kambhampati, “On the planning abilities of large language models–a critical investi- gation,” arXiv preprint arXiv:2305.15771 , 2023

work page arXiv 2023
[54]

Translating natural language to planning goals with large-language models,

Y . Xie, C. Yu, T. Zhu, J. Bai, Z. Gong, and H. Soh, “Translating natural language to planning goals with large-language models,” arXiv preprint arXiv:2302.05128, 2023

work page arXiv 2023
[55]

Saycanpay: Heuristic planning with large language models using learnable domain knowl- edge,

R. Hazra, P. Z. D. Martires, and L. De Raedt, “Saycanpay: Heuristic planning with large language models using learnable domain knowl- edge,” arXiv preprint arXiv:2308.12682 , 2023

work page arXiv 2023
[56]

Sayplan: Grounding large language models using 3d scene graphs for scalable robot task planning.arXiv preprint arXiv:2307.06135, 2023

K. Rana, J. Haviland, S. Garg, J. Abou-Chakra, I. Reid, and N. Suen- derhauf, “Sayplan: Grounding large language models using 3d scene graphs for scalable task planning,” arXiv preprint arXiv:2307.06135 , 2023

work page arXiv 2023
[57]

Isr-llm: Iterative self-refined large language model for long-horizon sequential task planning,

Z. Zhou, J. Song, K. Yao, Z. Shu, and L. Ma, “Isr-llm: Iterative self-refined large language model for long-horizon sequential task planning,” arXiv preprint arXiv:2308.13724 , 2023

work page arXiv 2023
[58]

Zhiyong Wu, Zhenyu Wu, Fangzhi Xu, Yian Wang, Qiushi Sun, Chengyou Jia, Kanzhi Cheng, Zichen Ding, Liheng Chen, Paul Pu Liang, and Yu Qiao

Z. Wang, S. Cai, A. Liu, X. Ma, and Y . Liang, “Describe, explain, plan and select: Interactive planning with large language models en- ables open-world multi-task agents,”arXiv preprint arXiv:2302.01560, 2023

work page arXiv 2023
[59]

WebGPT: Browser-assisted question-answering with human feedback

R. Nakano, J. Hilton, S. Balaji, J. Wu, L. Ouyang, C. Kim, C. Hesse, S. Jain, V . Kosaraju, W. Saunders,et al., “Webgpt: Browser- assisted question-answering with human feedback,” arXiv preprint arXiv:2112.09332, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[60]

Internet-augmented language models through few-shot prompting for open-domain question answering,

A. Lazaridou, E. Gribovskaya, W. Stokowiec, and N. Grigorev, “Internet-augmented language models through few-shot prompting for open-domain question answering,” arXiv preprint arXiv:2203.05115 , 2022

work page arXiv 2022
[61]

Memory-assisted prompt editing to improve gpt-3 after deployment,

A. Madaan, N. Tandon, P. Clark, and Y . Yang, “Memory-assisted prompt editing to improve gpt-3 after deployment,” 2023

work page 2023
[62]

Replug: Retrieval-augmented black-box language models

W. Shi, S. Min, M. Yasunaga, M. Seo, R. James, M. Lewis, L. Zettle- moyer, and W.-t. Yih, “Replug: Retrieval-augmented black-box lan- guage models,” arXiv preprint arXiv:2301.12652 , 2023

work page arXiv 2023
[63]

Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks

W. Chen, X. Ma, X. Wang, and W. W. Cohen, “Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks,” arXiv preprint arXiv:2211.12588 , 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[64]

PAL: Program-aided Language Models

L. Gao, A. Madaan, S. Zhou, U. Alon, P. Liu, Y . Yang, J. Callan, and G. Neubig, “Pal: Program-aided language models,” arXiv preprint arXiv:2211.10435, 2022

work page Pith review arXiv 2022
[65]

Toolformer: Language Models Can Teach Themselves to Use Tools

T. Schick, J. Dwivedi-Yu, R. Dess `ı, R. Raileanu, M. Lomeli, L. Zettle- moyer, N. Cancedda, and T. Scialom, “Toolformer: Language models can teach themselves to use tools,” arXiv preprint arXiv:2302.04761 , 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[66]

Faithful chain-of-thought reasoning, 2023

Q. Lyu, S. Havaldar, A. Stein, L. Zhang, D. Rao, E. Wong, M. Apid- ianaki, and C. Callison-Burch, “Faithful chain-of-thought reasoning,” arXiv preprint arXiv:2301.13379 , 2023

work page arXiv 2023
[67]

doi:10.5281/ZENODO.6382173 , organization =

J. Seipp, ´A. Torralba, and J. Hoffmann, “PDDL generators,” https: //doi.org/10.5281/zenodo.6382173, 2022

work page doi:10.5281/zenodo.6382173 2022
[68]

Tree of Thoughts: Deliberate Problem Solving with Large Language Models

S. Yao, D. Yu, J. Zhao, I. Shafran, T. L. Griffiths, Y . Cao, and K. Narasimhan, “Tree of thoughts: Deliberate problem solving with large language models,” arXiv preprint arXiv:2305.10601 , 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023