pith. sign in

arxiv: 2511.09378 · v2 · pith:RZQPDAUOnew · submitted 2025-11-12 · 💻 cs.AI · cs.LG

Frontier Large Language Models Rival State-of-the-Art Planners

Pith reviewed 2026-05-21 19:34 UTC · model grok-4.3

classification 💻 cs.AI cs.LG
keywords large language modelsautomated planninginternational planning competitionplanning benchmarksgeminigpt modelsclassical plannerssymbolic reasoning
0
0 comments X

The pith

Frontier large language models solve planning tasks as well as or better than state-of-the-art classical planners.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper challenges the long-standing view that large language models cannot reliably handle planning by testing the newest frontier models on a demanding benchmark drawn from the most recent International Planning Competition. Tasks were freshly generated to prevent data contamination and solutions were checked with an independent validation tool before comparing results against leading classical planners. Gemini 3.1 Pro solved 245 of 360 problems compared with 234 for the strongest baseline, while GPT-5 performed at a comparable level. Even after semantic details were stripped from the task descriptions to test pure symbolic reasoning, Gemini 3.1 Pro stayed competitive with the top baselines. Performance has risen sharply from earlier models such as GPT-3.5, which solved none of the tasks.

Core claim

On standard task descriptions from the latest International Planning Competition, Gemini 3.1 Pro outperforms the strongest planner baseline by solving 245 versus 234 tasks out of 360, while GPT-5 achieves comparable performance; performance remains competitive even with obfuscated descriptions lacking semantic information.

What carries the argument

Rigorous evaluation on freshly created International Planning Competition tasks verified by a validation tool, with direct head-to-head comparison against state-of-the-art classical planners on both standard and semantically obfuscated versions.

If this is right

  • Frontier LLMs could serve as practical planners in domains where classical systems are currently used.
  • Planning ability in LLMs has advanced rapidly across successive model generations.
  • Some degree of symbolic reasoning persists even when natural-language cues are removed from task descriptions.
  • The open question is how far LLM planning performance will continue to scale with future models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Hybrid planners that combine LLM flexibility with classical search guarantees might outperform either approach alone on complex real-world problems.
  • If contamination concerns are addressed through repeated fresh benchmarks, LLMs could become a default starting point for automated planning rather than a last resort.
  • Domains such as robotics or logistics that require both symbolic correctness and adaptation to vague goals become newly accessible to LLM-based methods.

Load-bearing premise

The tasks are genuinely novel and free of training-data contamination, and the validation tool correctly classifies every plan as valid or invalid without systematic errors.

What would settle it

Creating a fresh set of planning tasks from the next International Planning Competition and re-testing the same models, or uncovering systematic misclassifications by the validation tool that artificially favor LLM solutions.

Figures

Figures reproduced from arXiv: 2511.09378 by Andr\'e G. Pereira, Augusto B. Corr\^ea, Jendrik Seipp.

Figure 1
Figure 1. Figure 1: End-to-end planning performance of frontier LLMs and a planner (LAMA) on standard [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: A comparison of plan lengths and reasoning effort. (a) The distribution of plan lengths for [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
read the original abstract

A series of influential studies established that large language models cannot reliably solve even simple planning tasks. We show that the latest generation of frontier models overturns this conclusion. We evaluate three families of frontier LLMs on a challenging set of planning tasks based on the most recent International Planning Competition following rigorous evaluation guidelines: solutions are verified with a validation tool, tasks are freshly created to avoid data contamination, and performance is compared against state-of-the-art classical planners. On standard task descriptions, Gemini 3.1 Pro outperforms the strongest planner baseline (245 vs. 234 solved tasks out of 360), while GPT-5 achieves comparable performance to the baselines. When all semantic information is obfuscated from the descriptions to test for pure symbolic planning, performance degrades but Gemini 3.1 Pro remains competitive with the strongest baselines. A longitudinal comparison across model generations -- from GPT-3.5, which solves zero tasks, to GPT-5 -- reveals a striking upward trajectory. Frontier LLMs might finally be able to plan; the question now is how far this capability will extend.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper claims that frontier LLMs overturn prior findings on their inability to solve planning tasks. On 360 tasks derived from the most recent IPC, Gemini 3.1 Pro solves 245 tasks (vs. 234 for the strongest classical planner baseline) under standard descriptions and remains competitive under semantic obfuscation; GPT-5 matches baseline performance, while earlier models like GPT-3.5 solve zero tasks. The evaluation uses verified solutions, fresh task creation to avoid contamination, and direct comparison to SOTA planners.

Significance. If the results hold, this would mark a substantial advance in LLM planning capabilities and challenge the view that LLMs are unreliable for planning. The empirical design—solution verification, direct SOTA comparisons, and longitudinal model tracking—provides a strong basis for the claims. The obfuscation experiment and fresh-task protocol are notable strengths for isolating planning ability.

major comments (1)
  1. [Methods - Task Creation] Task creation description (Methods section): The statement that tasks were 'freshly created' from the most recent IPC 'following rigorous evaluation guidelines' to avoid contamination lacks a concrete account of the generation process. It is unclear whether new predicates, objects, or goal structures were introduced versus minor reparameterizations of known IPC domains. This detail is load-bearing for distinguishing emergent planning from possible structural familiarity with IPC-style problems in pre-training data.
minor comments (2)
  1. [Abstract] Abstract: Explicitly name the three families of frontier LLMs and the number of domains covered to improve immediate readability.
  2. [Results] Results presentation: Report variance or multiple runs for the LLM results, as the 245 vs. 234 margin is narrow and single-run figures limit confidence in the outperformance claim.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their positive assessment of the work's significance and for highlighting the importance of methodological clarity. We address the single major comment below and will revise the manuscript to incorporate additional detail on task creation.

read point-by-point responses
  1. Referee: Task creation description (Methods section): The statement that tasks were 'freshly created' from the most recent IPC 'following rigorous evaluation guidelines' to avoid contamination lacks a concrete account of the generation process. It is unclear whether new predicates, objects, or goal structures were introduced versus minor reparameterizations of known IPC domains. This detail is load-bearing for distinguishing emergent planning from possible structural familiarity with IPC-style problems in pre-training data.

    Authors: We agree that a more concrete description of the task generation process is warranted to strengthen the claim that the evaluation isolates planning ability from potential pre-training familiarity. In the revised manuscript, we will expand the Methods section with a detailed account of the instance generation procedure. This will specify the exact mechanisms used to create novel instances (e.g., systematic variation in object counts, introduction of new initial-state predicate combinations and goal conjuncts drawn from the domain axioms, and explicit checks against published IPC instance sets) while preserving the original domain semantics. These additions will clarify that the tasks involve non-trivial structural novelty rather than superficial reparameterization. revision: yes

Circularity Check

0 steps flagged

No circularity: direct empirical comparison to external baselines

full rationale

The paper reports empirical counts of solved tasks (e.g., 245/360 for Gemini 3.1 Pro vs. 234 for the strongest classical planner) obtained by running frontier LLMs on freshly generated IPC-derived instances and validating outputs with an independent tool. These results rest on external benchmarks and classical planners rather than any derivation, fitted parameter, or self-referential definition. The assertion that tasks were 'freshly created to avoid data contamination' is a methodological claim about the experimental setup, not a quantity that reduces to the paper's own inputs by construction. No equations, predictions, or uniqueness theorems appear that would trigger self-definitional, fitted-input, or self-citation circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on empirical benchmark results rather than derivations, so the ledger contains only a domain assumption about the validity of the chosen planning tasks and verification procedure.

axioms (1)
  • domain assumption Tasks drawn from recent International Planning Competition instances, when freshly generated, provide a reliable test of general planning capability.
    The paper's comparison depends on these benchmarks being representative and free of contamination.

pith-pipeline@v0.9.0 · 5726 in / 1195 out tokens · 82816 ms · 2026-05-21T19:34:22.913941+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Property-Guided LLM Program Synthesis for Planning

    cs.AI 2026-05 unverdicted novelty 7.0

    Property-guided LLM program synthesis with counterexample feedback creates direct heuristics for PDDL planning domains that require far fewer generations and less evaluation cost than score-based baselines.

  2. Zero-Shot Goal Recognition with Large Language Models

    cs.AI 2026-05 unverdicted novelty 7.0

    Frontier LLMs show uneven zero-shot performance on goal recognition in PDDL domains: some scale with accumulating evidence toward landmark-based accuracy while others stay anchored to world-knowledge priors.

  3. Exploring Plan Space through Conversation: An Agentic Framework for LLM-Mediated Explanations in Planning

    cs.AI 2026-03 unverdicted novelty 5.0

    A multi-agent LLM framework enables interactive explanations for planning problems and is evaluated against template-based interfaces in a user study on goal conflicts.

Reference graph

Works this paper leans on

27 extracted references · 27 canonical work pages · cited by 3 Pith papers · 2 internal anchors

  1. [1]

    Chen, Johannes Zenn, Tristan Cinquin, and Sheila A

    Dillon Z. Chen, Johannes Zenn, Tristan Cinquin, and Sheila A. McIlraith. Language models for generalised PDDL planning: Synthesising sound and programmatic policies. In ICAPS 2025 Workshop on Planning in the Era of LLMs (LM4Plan), 2025

  2. [2]

    Corrêa and Giuseppe De Giacomo

    Augusto B. Corrêa and Giuseppe De Giacomo. Lifted planning: Recent advances in planning using first-order representations. In Kate Larson, editor, Proceedings of the 33rd International Joint Conference on Artificial Intelligence (IJCAI 2024), pages 8010–8019. IJCAI, 2024

  3. [3]

    Corrêa, André G

    Augusto B. Corrêa, André G. Pereira, and Jendrik Seipp. Classical planning with LLM- generated heuristics: Challenging the state of the art with Python code. In Proceedings of the Thirty-Ninth Annual Conference on Neural Information Processing Systems (NeurIPS 2025), 2025

  4. [4]

    DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z.F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai Da...

  5. [5]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Gemini Team Google. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. arXiv:2507.06261 [cs.CL], 2025

  6. [6]

    An Introduction to the Planning Domain Definition Language, volume 13 of Synthesis Lectures on Artificial Intelligence and Machine Learning

    Patrik Haslum, Nir Lipovetzky, Daniele Magazzeni, and Christian Muise. An Introduction to the Planning Domain Definition Language, volume 13 of Synthesis Lectures on Artificial Intelligence and Machine Learning. Morgan & Claypool, 2019

  7. [7]

    Concise finite-domain representations for PDDL planning tasks

    Malte Helmert. Concise finite-domain representations for PDDL planning tasks. Artificial Intelligence, 173:503–535, 2009

  8. [8]

    Engineering benchmarks for planning: the domains used in the deterministic part of IPC-4

    Jörg Hoffmann, Stefan Edelkamp, Sylvie Thiébaux, Roman Englert, Frederico dos Santos Li- porace, and Sebastian Trüg. Engineering benchmarks for planning: the domains used in the deterministic part of IPC-4. Journal of Artificial Intelligence Research, 26:453–541, 2006

  9. [9]

    V AL’s progress: The automatic validation tool for PDDL2.1 used in the International Planning Competition

    Richard Howey and Derek Long. V AL’s progress: The automatic validation tool for PDDL2.1 used in the International Planning Competition. In Stefan Edelkamp and Jörg Hoffmann, editors, Proceedings of the ICAPS 2003 Workshop on the Competition: Impact, Organisation, Evaluation, Benchmarks, 2003. 5

  10. [10]

    Position: LLMs can’t plan, but can help planning in LLM-modulo frameworks

    Subbarao Kambhampati, Karthik Valmeekam, Lin Guan, Mudit Verma, Kaya Stechly, Siddhant Bhambri, Lucas Saldyt, and Anil Murthy. Position: LLMs can’t plan, but can help planning in LLM-modulo frameworks. In Proceedings of the 41st International Conference on Machine Learning, 2024

  11. [11]

    Make planning research rigorous again! arXiv:2505.21674 [cs.CL], 2025

    Michael Katz, Harsha Kokel, Christian Muise, Shirin Sohrabi, and Sarath Sreedharan. Make planning research rigorous again! arXiv:2505.21674 [cs.CL], 2025

  12. [12]

    Elevator control as a planning problem

    Jana Koehler and Kilian Schuster. Elevator control as a planning problem. In Steve Chien, Subbarao Kambhampati, and Craig A. Knoblock, editors, Proceedings of the Fifth International Conference on Artificial Intelligence Planning and Scheduling (AIPS 2000), pages 331–338. AAAI Press, 2000

  13. [13]

    The 3rd International Planning Competition: Results and analysis

    Derek Long and Maria Fox. The 3rd International Planning Competition: Results and analysis. Journal of Artificial Intelligence Research, 20:1–59, 2003

  14. [14]

    Exploring organic synthesis with state-of-the-art planning techniques

    Rami Matloob and Mikhail Soutchanski. Exploring organic synthesis with state-of-the-art planning techniques. In ICAPS 2016 Scheduling and Planning Applications woRKshop, pages 52–61, 2016

  15. [15]

    The 1998 AI Planning Systems competition

    Drew McDermott. The 1998 AI Planning Systems competition. AI Magazine, 21(2):35–55, 2000

  16. [16]

    Milidiu and Frederico dos Santos Liporace

    Ruy L. Milidiu and Frederico dos Santos Liporace. Pipesworld: Applying planning systems to pipeline transportation. In International Pipeline Conference, 2004

  17. [17]

    Automated adversary emulation: A case for planning and acting with unknowns

    Doug Miller, Ron Alford, Andy Applebaum, Henry Foster, Caleb Little, and Blake Strom. Automated adversary emulation: A case for planning and acting with unknowns. Technical report, MITRE, 2018

  18. [18]

    GPT-5 system card, 2025

    OpenAI. GPT-5 system card, 2025. URL https://openai.com/index/ gpt-5-system-card/

  19. [19]

    The LAMA planner: Guiding cost-based anytime planning with landmarks

    Silvia Richter and Matthias Westphal. The LAMA planner: Guiding cost-based anytime planning with landmarks. Journal of Artificial Intelligence Research, 39:127–177, 2010

  20. [20]

    Downward Lab

    Jendrik Seipp, Florian Pommerening, Silvan Sievers, and Malte Helmert. Downward Lab. https://doi.org/10.5281/zenodo.790461, 2017

  21. [21]

    Optimal layout synthesis for quantum circuits as classical planning

    Irfansha Shaik and Jaco van de Pol. Optimal layout synthesis for quantum circuits as classical planning. In Proceedings of International Conference on Computer Aided Design (ICCAD), pages 1–9. IEEE/ACM, 2023

  22. [22]

    The 2023 International Planning Competition

    Ayal Taitler, Ron Alford, Joan Espasa, Gregor Behnke, Daniel Fišer, Michael Gimelfarb, Florian Pommerening, Scott Sanner, Enrico Scala, Dominik Schreiber, Javier Segovia-Aguas, and Jendrik Seipp. The 2023 International Planning Competition. AI Magazine, 45(2):280–296,

  23. [23]

    doi: 10.1002/aaai.12169

  24. [24]

    Automatic instance generation for classical planning

    Álvaro Torralba, Jendrik Seipp, and Silvan Sievers. Automatic instance generation for classical planning. In Robert P. Goldman, Susanne Biundo, and Michael Katz, editors, Proceedings of the Thirty-First International Conference on Automated Planning and Scheduling (ICAPS 2021), pages 376–384. AAAI Press, 2021

  25. [25]

    PlanBench: An extensible benchmark for evaluating large language models on planning and reasoning about change

    Karthik Valmeekam, Matthew Marquez, Alberto Olmo, Sarath Sreedharan, and Subbarao Kambhampati. PlanBench: An extensible benchmark for evaluating large language models on planning and reasoning about change. In Proceedings of the Thirty-Seventh Annual Conference on Neural Information Processing Systems (NeurIPS 2023), pages 38975–38987, 2023

  26. [26]

    On the planning abilities of large language models - A critical investigation

    Karthik Valmeekam, Matthew Marquez, Sarath Sreedharan, and Subbarao Kambhampati. On the planning abilities of large language models - A critical investigation. In Proceedings of the Thirty-Seventh Annual Conference on Neural Information Processing Systems (NeurIPS 2023), pages 75993–76005, 2023

  27. [27]

    Planning in strawberry fields: Evaluating and improving the planning and scheduling capabilities of lrm o1,

    Karthik Valmeekam, Kaya Stechly, Atharva Gundawar, and Subbarao Kambhampati. Planning in strawberry fields: Evaluating and improving the planning and scheduling capabilities of LRM o1. arXiv:2410.02162 [cs.CL], 2024. 6