Frontier Large Language Models Rival State-of-the-Art Planners
read the original abstract
A series of influential studies established that large language models cannot reliably solve even simple planning tasks. We show that the latest generation of frontier models overturns this conclusion. We evaluate three families of frontier LLMs on a challenging set of planning tasks based on the most recent International Planning Competition following rigorous evaluation guidelines: solutions are verified with a validation tool, tasks are freshly created to avoid data contamination, and performance is compared against state-of-the-art classical planners. On standard task descriptions, Gemini 3.1 Pro outperforms the strongest planner baseline (245 vs. 234 solved tasks out of 360), while GPT-5 achieves comparable performance to the baselines. When all semantic information is obfuscated from the descriptions to test for pure symbolic planning, performance degrades but Gemini 3.1 Pro remains competitive with the strongest baselines. A longitudinal comparison across model generations -- from GPT-3.5, which solves zero tasks, to GPT-5 -- reveals a striking upward trajectory. Frontier LLMs might finally be able to plan; the question now is how far this capability will extend.
This paper has not been read by Pith yet.
Forward citations
Cited by 3 Pith papers
-
Property-Guided LLM Program Synthesis for Planning
Property-guided LLM program synthesis with counterexample feedback creates direct heuristics for PDDL planning domains that require far fewer generations and less evaluation cost than score-based baselines.
-
Zero-Shot Goal Recognition with Large Language Models
Frontier LLMs show uneven zero-shot performance on goal recognition in PDDL domains: some scale with accumulating evidence toward landmark-based accuracy while others stay anchored to world-knowledge priors.
-
Exploring Plan Space through Conversation: An Agentic Framework for LLM-Mediated Explanations in Planning
A multi-agent LLM framework enables interactive explanations for planning problems and is evaluated against template-based interfaces in a user study on goal conflicts.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.