Frontier Large Language Models Rival State-of-the-Art Planners
Pith reviewed 2026-05-21 19:34 UTC · model grok-4.3
The pith
Frontier large language models solve planning tasks as well as or better than state-of-the-art classical planners.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
On standard task descriptions from the latest International Planning Competition, Gemini 3.1 Pro outperforms the strongest planner baseline by solving 245 versus 234 tasks out of 360, while GPT-5 achieves comparable performance; performance remains competitive even with obfuscated descriptions lacking semantic information.
What carries the argument
Rigorous evaluation on freshly created International Planning Competition tasks verified by a validation tool, with direct head-to-head comparison against state-of-the-art classical planners on both standard and semantically obfuscated versions.
If this is right
- Frontier LLMs could serve as practical planners in domains where classical systems are currently used.
- Planning ability in LLMs has advanced rapidly across successive model generations.
- Some degree of symbolic reasoning persists even when natural-language cues are removed from task descriptions.
- The open question is how far LLM planning performance will continue to scale with future models.
Where Pith is reading between the lines
- Hybrid planners that combine LLM flexibility with classical search guarantees might outperform either approach alone on complex real-world problems.
- If contamination concerns are addressed through repeated fresh benchmarks, LLMs could become a default starting point for automated planning rather than a last resort.
- Domains such as robotics or logistics that require both symbolic correctness and adaptation to vague goals become newly accessible to LLM-based methods.
Load-bearing premise
The tasks are genuinely novel and free of training-data contamination, and the validation tool correctly classifies every plan as valid or invalid without systematic errors.
What would settle it
Creating a fresh set of planning tasks from the next International Planning Competition and re-testing the same models, or uncovering systematic misclassifications by the validation tool that artificially favor LLM solutions.
Figures
read the original abstract
A series of influential studies established that large language models cannot reliably solve even simple planning tasks. We show that the latest generation of frontier models overturns this conclusion. We evaluate three families of frontier LLMs on a challenging set of planning tasks based on the most recent International Planning Competition following rigorous evaluation guidelines: solutions are verified with a validation tool, tasks are freshly created to avoid data contamination, and performance is compared against state-of-the-art classical planners. On standard task descriptions, Gemini 3.1 Pro outperforms the strongest planner baseline (245 vs. 234 solved tasks out of 360), while GPT-5 achieves comparable performance to the baselines. When all semantic information is obfuscated from the descriptions to test for pure symbolic planning, performance degrades but Gemini 3.1 Pro remains competitive with the strongest baselines. A longitudinal comparison across model generations -- from GPT-3.5, which solves zero tasks, to GPT-5 -- reveals a striking upward trajectory. Frontier LLMs might finally be able to plan; the question now is how far this capability will extend.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that frontier LLMs overturn prior findings on their inability to solve planning tasks. On 360 tasks derived from the most recent IPC, Gemini 3.1 Pro solves 245 tasks (vs. 234 for the strongest classical planner baseline) under standard descriptions and remains competitive under semantic obfuscation; GPT-5 matches baseline performance, while earlier models like GPT-3.5 solve zero tasks. The evaluation uses verified solutions, fresh task creation to avoid contamination, and direct comparison to SOTA planners.
Significance. If the results hold, this would mark a substantial advance in LLM planning capabilities and challenge the view that LLMs are unreliable for planning. The empirical design—solution verification, direct SOTA comparisons, and longitudinal model tracking—provides a strong basis for the claims. The obfuscation experiment and fresh-task protocol are notable strengths for isolating planning ability.
major comments (1)
- [Methods - Task Creation] Task creation description (Methods section): The statement that tasks were 'freshly created' from the most recent IPC 'following rigorous evaluation guidelines' to avoid contamination lacks a concrete account of the generation process. It is unclear whether new predicates, objects, or goal structures were introduced versus minor reparameterizations of known IPC domains. This detail is load-bearing for distinguishing emergent planning from possible structural familiarity with IPC-style problems in pre-training data.
minor comments (2)
- [Abstract] Abstract: Explicitly name the three families of frontier LLMs and the number of domains covered to improve immediate readability.
- [Results] Results presentation: Report variance or multiple runs for the LLM results, as the 245 vs. 234 margin is narrow and single-run figures limit confidence in the outperformance claim.
Simulated Author's Rebuttal
We thank the referee for their positive assessment of the work's significance and for highlighting the importance of methodological clarity. We address the single major comment below and will revise the manuscript to incorporate additional detail on task creation.
read point-by-point responses
-
Referee: Task creation description (Methods section): The statement that tasks were 'freshly created' from the most recent IPC 'following rigorous evaluation guidelines' to avoid contamination lacks a concrete account of the generation process. It is unclear whether new predicates, objects, or goal structures were introduced versus minor reparameterizations of known IPC domains. This detail is load-bearing for distinguishing emergent planning from possible structural familiarity with IPC-style problems in pre-training data.
Authors: We agree that a more concrete description of the task generation process is warranted to strengthen the claim that the evaluation isolates planning ability from potential pre-training familiarity. In the revised manuscript, we will expand the Methods section with a detailed account of the instance generation procedure. This will specify the exact mechanisms used to create novel instances (e.g., systematic variation in object counts, introduction of new initial-state predicate combinations and goal conjuncts drawn from the domain axioms, and explicit checks against published IPC instance sets) while preserving the original domain semantics. These additions will clarify that the tasks involve non-trivial structural novelty rather than superficial reparameterization. revision: yes
Circularity Check
No circularity: direct empirical comparison to external baselines
full rationale
The paper reports empirical counts of solved tasks (e.g., 245/360 for Gemini 3.1 Pro vs. 234 for the strongest classical planner) obtained by running frontier LLMs on freshly generated IPC-derived instances and validating outputs with an independent tool. These results rest on external benchmarks and classical planners rather than any derivation, fitted parameter, or self-referential definition. The assertion that tasks were 'freshly created to avoid data contamination' is a methodological claim about the experimental setup, not a quantity that reduces to the paper's own inputs by construction. No equations, predictions, or uniqueness theorems appear that would trigger self-definitional, fitted-input, or self-citation circularity patterns.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Tasks drawn from recent International Planning Competition instances, when freshly generated, provide a reliable test of general planning capability.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We evaluate three families of frontier LLMs on a challenging set of planning tasks based on the most recent International Planning Competition... solutions are verified with a validation tool, tasks are freshly created to avoid data contamination
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 3 Pith papers
-
Property-Guided LLM Program Synthesis for Planning
Property-guided LLM program synthesis with counterexample feedback creates direct heuristics for PDDL planning domains that require far fewer generations and less evaluation cost than score-based baselines.
-
Zero-Shot Goal Recognition with Large Language Models
Frontier LLMs show uneven zero-shot performance on goal recognition in PDDL domains: some scale with accumulating evidence toward landmark-based accuracy while others stay anchored to world-knowledge priors.
-
Exploring Plan Space through Conversation: An Agentic Framework for LLM-Mediated Explanations in Planning
A multi-agent LLM framework enables interactive explanations for planning problems and is evaluated against template-based interfaces in a user study on goal conflicts.
Reference graph
Works this paper leans on
-
[1]
Chen, Johannes Zenn, Tristan Cinquin, and Sheila A
Dillon Z. Chen, Johannes Zenn, Tristan Cinquin, and Sheila A. McIlraith. Language models for generalised PDDL planning: Synthesising sound and programmatic policies. In ICAPS 2025 Workshop on Planning in the Era of LLMs (LM4Plan), 2025
work page 2025
-
[2]
Corrêa and Giuseppe De Giacomo
Augusto B. Corrêa and Giuseppe De Giacomo. Lifted planning: Recent advances in planning using first-order representations. In Kate Larson, editor, Proceedings of the 33rd International Joint Conference on Artificial Intelligence (IJCAI 2024), pages 8010–8019. IJCAI, 2024
work page 2024
-
[3]
Augusto B. Corrêa, André G. Pereira, and Jendrik Seipp. Classical planning with LLM- generated heuristics: Challenging the state of the art with Python code. In Proceedings of the Thirty-Ninth Annual Conference on Neural Information Processing Systems (NeurIPS 2025), 2025
work page 2025
-
[4]
DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z.F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai Da...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[5]
Gemini Team Google. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. arXiv:2507.06261 [cs.CL], 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[6]
Patrik Haslum, Nir Lipovetzky, Daniele Magazzeni, and Christian Muise. An Introduction to the Planning Domain Definition Language, volume 13 of Synthesis Lectures on Artificial Intelligence and Machine Learning. Morgan & Claypool, 2019
work page 2019
-
[7]
Concise finite-domain representations for PDDL planning tasks
Malte Helmert. Concise finite-domain representations for PDDL planning tasks. Artificial Intelligence, 173:503–535, 2009
work page 2009
-
[8]
Engineering benchmarks for planning: the domains used in the deterministic part of IPC-4
Jörg Hoffmann, Stefan Edelkamp, Sylvie Thiébaux, Roman Englert, Frederico dos Santos Li- porace, and Sebastian Trüg. Engineering benchmarks for planning: the domains used in the deterministic part of IPC-4. Journal of Artificial Intelligence Research, 26:453–541, 2006
work page 2006
-
[9]
Richard Howey and Derek Long. V AL’s progress: The automatic validation tool for PDDL2.1 used in the International Planning Competition. In Stefan Edelkamp and Jörg Hoffmann, editors, Proceedings of the ICAPS 2003 Workshop on the Competition: Impact, Organisation, Evaluation, Benchmarks, 2003. 5
work page 2003
-
[10]
Position: LLMs can’t plan, but can help planning in LLM-modulo frameworks
Subbarao Kambhampati, Karthik Valmeekam, Lin Guan, Mudit Verma, Kaya Stechly, Siddhant Bhambri, Lucas Saldyt, and Anil Murthy. Position: LLMs can’t plan, but can help planning in LLM-modulo frameworks. In Proceedings of the 41st International Conference on Machine Learning, 2024
work page 2024
-
[11]
Make planning research rigorous again! arXiv:2505.21674 [cs.CL], 2025
Michael Katz, Harsha Kokel, Christian Muise, Shirin Sohrabi, and Sarath Sreedharan. Make planning research rigorous again! arXiv:2505.21674 [cs.CL], 2025
-
[12]
Elevator control as a planning problem
Jana Koehler and Kilian Schuster. Elevator control as a planning problem. In Steve Chien, Subbarao Kambhampati, and Craig A. Knoblock, editors, Proceedings of the Fifth International Conference on Artificial Intelligence Planning and Scheduling (AIPS 2000), pages 331–338. AAAI Press, 2000
work page 2000
-
[13]
The 3rd International Planning Competition: Results and analysis
Derek Long and Maria Fox. The 3rd International Planning Competition: Results and analysis. Journal of Artificial Intelligence Research, 20:1–59, 2003
work page 2003
-
[14]
Exploring organic synthesis with state-of-the-art planning techniques
Rami Matloob and Mikhail Soutchanski. Exploring organic synthesis with state-of-the-art planning techniques. In ICAPS 2016 Scheduling and Planning Applications woRKshop, pages 52–61, 2016
work page 2016
-
[15]
The 1998 AI Planning Systems competition
Drew McDermott. The 1998 AI Planning Systems competition. AI Magazine, 21(2):35–55, 2000
work page 1998
-
[16]
Milidiu and Frederico dos Santos Liporace
Ruy L. Milidiu and Frederico dos Santos Liporace. Pipesworld: Applying planning systems to pipeline transportation. In International Pipeline Conference, 2004
work page 2004
-
[17]
Automated adversary emulation: A case for planning and acting with unknowns
Doug Miller, Ron Alford, Andy Applebaum, Henry Foster, Caleb Little, and Blake Strom. Automated adversary emulation: A case for planning and acting with unknowns. Technical report, MITRE, 2018
work page 2018
-
[18]
OpenAI. GPT-5 system card, 2025. URL https://openai.com/index/ gpt-5-system-card/
work page 2025
-
[19]
The LAMA planner: Guiding cost-based anytime planning with landmarks
Silvia Richter and Matthias Westphal. The LAMA planner: Guiding cost-based anytime planning with landmarks. Journal of Artificial Intelligence Research, 39:127–177, 2010
work page 2010
-
[20]
Jendrik Seipp, Florian Pommerening, Silvan Sievers, and Malte Helmert. Downward Lab. https://doi.org/10.5281/zenodo.790461, 2017
-
[21]
Optimal layout synthesis for quantum circuits as classical planning
Irfansha Shaik and Jaco van de Pol. Optimal layout synthesis for quantum circuits as classical planning. In Proceedings of International Conference on Computer Aided Design (ICCAD), pages 1–9. IEEE/ACM, 2023
work page 2023
-
[22]
The 2023 International Planning Competition
Ayal Taitler, Ron Alford, Joan Espasa, Gregor Behnke, Daniel Fišer, Michael Gimelfarb, Florian Pommerening, Scott Sanner, Enrico Scala, Dominik Schreiber, Javier Segovia-Aguas, and Jendrik Seipp. The 2023 International Planning Competition. AI Magazine, 45(2):280–296,
work page 2023
-
[23]
doi: 10.1002/aaai.12169
-
[24]
Automatic instance generation for classical planning
Álvaro Torralba, Jendrik Seipp, and Silvan Sievers. Automatic instance generation for classical planning. In Robert P. Goldman, Susanne Biundo, and Michael Katz, editors, Proceedings of the Thirty-First International Conference on Automated Planning and Scheduling (ICAPS 2021), pages 376–384. AAAI Press, 2021
work page 2021
-
[25]
Karthik Valmeekam, Matthew Marquez, Alberto Olmo, Sarath Sreedharan, and Subbarao Kambhampati. PlanBench: An extensible benchmark for evaluating large language models on planning and reasoning about change. In Proceedings of the Thirty-Seventh Annual Conference on Neural Information Processing Systems (NeurIPS 2023), pages 38975–38987, 2023
work page 2023
-
[26]
On the planning abilities of large language models - A critical investigation
Karthik Valmeekam, Matthew Marquez, Sarath Sreedharan, and Subbarao Kambhampati. On the planning abilities of large language models - A critical investigation. In Proceedings of the Thirty-Seventh Annual Conference on Neural Information Processing Systems (NeurIPS 2023), pages 75993–76005, 2023
work page 2023
-
[27]
Karthik Valmeekam, Kaya Stechly, Atharva Gundawar, and Subbarao Kambhampati. Planning in strawberry fields: Evaluating and improving the planning and scheduling capabilities of LRM o1. arXiv:2410.02162 [cs.CL], 2024. 6
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.