Frontier Large Language Models Rival State-of-the-Art Planners

Andr\'e G. Pereira; Augusto B. Corr\^ea; Jendrik Seipp

arxiv: 2511.09378 · v2 · pith:RZQPDAUOnew · submitted 2025-11-12 · 💻 cs.AI · cs.LG

Frontier Large Language Models Rival State-of-the-Art Planners

Augusto B. Corr\^ea , Andr\'e G. Pereira , Jendrik Seipp This is my paper

classification 💻 cs.AI cs.LG

keywords tasksfrontierplanningmodelsperformancebaselinesdescriptionsgemini

0 comments

read the original abstract

A series of influential studies established that large language models cannot reliably solve even simple planning tasks. We show that the latest generation of frontier models overturns this conclusion. We evaluate three families of frontier LLMs on a challenging set of planning tasks based on the most recent International Planning Competition following rigorous evaluation guidelines: solutions are verified with a validation tool, tasks are freshly created to avoid data contamination, and performance is compared against state-of-the-art classical planners. On standard task descriptions, Gemini 3.1 Pro outperforms the strongest planner baseline (245 vs. 234 solved tasks out of 360), while GPT-5 achieves comparable performance to the baselines. When all semantic information is obfuscated from the descriptions to test for pure symbolic planning, performance degrades but Gemini 3.1 Pro remains competitive with the strongest baselines. A longitudinal comparison across model generations -- from GPT-3.5, which solves zero tasks, to GPT-5 -- reveals a striking upward trajectory. Frontier LLMs might finally be able to plan; the question now is how far this capability will extend.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Property-Guided LLM Program Synthesis for Planning
cs.AI 2026-05 unverdicted novelty 7.0

Property-guided LLM program synthesis with counterexample feedback creates direct heuristics for PDDL planning domains that require far fewer generations and less evaluation cost than score-based baselines.
Zero-Shot Goal Recognition with Large Language Models
cs.AI 2026-05 unverdicted novelty 7.0

Frontier LLMs show uneven zero-shot performance on goal recognition in PDDL domains: some scale with accumulating evidence toward landmark-based accuracy while others stay anchored to world-knowledge priors.
Exploring Plan Space through Conversation: An Agentic Framework for LLM-Mediated Explanations in Planning
cs.AI 2026-03 unverdicted novelty 5.0

A multi-agent LLM framework enables interactive explanations for planning problems and is evaluated against template-based interfaces in a user study on goal conflicts.