Analysis of Optimality of Large Language Models on Planning Problems

Aaron Parisi; Bernd Bohnet; Kathleen Kenealy; Kevin Swersky; Michael C. Mozer; Noah Fiedel; Wil Cunningham

arxiv: 2604.02910 · v1 · submitted 2026-04-03 · 💻 cs.AI · cs.CL

Analysis of Optimality of Large Language Models on Planning Problems

Bernd Bohnet , Michael C. Mozer , Kevin Swersky , Wil Cunningham , Aaron Parisi , Kathleen Kenealy , Noah Fiedel This is my paper

Pith reviewed 2026-05-13 19:43 UTC · model grok-4.3

classification 💻 cs.AI cs.CL

keywords large language modelsplanning problemsblocksworldpath-star graphoptimalitytopological reasoningalgorithmic simulationgeometric memory

0 comments

The pith

Frontier large language models achieve near-perfect optimality on planning tasks even when stripped of semantic information.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines how well large language models perform on classic planning problems such as rearranging blocks in Blocksworld to match goal configurations. It introduces an equivalent but semantically stripped Path-Star graph task to test if models are doing true topological reasoning. Results show that reasoning-enhanced LLMs track the theoretical optimal plan lengths with high precision as problems get deeper, wider, and more compositional, unlike classical search methods that scale poorly. This matters because it suggests LLMs are not limited to heuristic shortcuts but can approach optimal solutions in combinatorial domains.

Core claim

Although classical search algorithms hit a wall as the search space expands, LLMs track theoretical optimality limits with near-perfect precision, even when domain-specific semantic hints are stripped away. Experiments on Blocksworld and the generalized Path-Star graph demonstrate that models outperform traditional planners like LAMA in complex multi-goal setups, supported by evidence for algorithmic simulation through reasoning tokens and a geometric memory that represents the problem topology as a navigable global geometry.

What carries the argument

The generalized Path-Star (P*) graph, formally equivalent to Blocksworld, used to isolate topological reasoning from semantic priors, along with the two explanatory hypotheses of algorithmic simulation and geometric memory.

If this is right

Reasoning-enhanced LLMs significantly outperform traditional satisficing planners in complex, multi-goal configurations.
LLM performance remains close to theoretical optimality as problem depth, width, and compositionality increase.
Classical search methods degrade with expanding search space while LLMs do not.
The findings support the possibility of active algorithmic simulation executed via reasoning tokens.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Models may be maintaining an internal representation of configuration spaces as continuous geometries rather than discrete search trees.
This capability could transfer to other planning domains like logistics or scheduling if the geometric memory generalizes.
Future work might test if increasing the number of reasoning tokens further improves adherence to optimality bounds on larger instances.

Load-bearing premise

The Path-Star graph presentation truly isolates topological reasoning rather than allowing the model to exploit any remaining statistical patterns from training or prompt structure.

What would settle it

Run the same experiments on Path-Star graphs with significantly larger depths where the optimal solution length exceeds what could be guessed from training data patterns, and check whether precision to optimality is maintained.

read the original abstract

Classic AI planning problems have been revisited in the Large Language Model (LLM) era, with a focus of recent benchmarks on success rates rather than plan efficiency. We examine the degree to which frontier models reason optimally versus relying on simple, heuristic, and possibly inefficient strategies. We focus on the Blocksworld domain involving towers of labeled blocks which have to be moved from an initial to a goal configuration via a set of primitive actions. We also study a formally equivalent task, the generalized Path-Star ($P^*$) graph, in order to isolate true topological reasoning from semantic priors. We systematically manipulate problem depth (the height of block towers), width (the number of towers), and compositionality (the number of goal blocks). Reasoning-enhanced LLMs significantly outperform traditional satisficing planners (e.g., LAMA) in complex, multi-goal configurations. Although classical search algorithms hit a wall as the search space expands, LLMs track theoretical optimality limits with near-perfect precision, even when domain-specific semantic hints are stripped away. To explain these surprising findings, we consider (and find evidence to support) two hypotheses: an active Algorithmic Simulation executed via reasoning tokens and a Geometric Memory that allows models to represent the $P^*$ topology as a navigable global geometry, effectively bypassing exponential combinatorial complexity.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LLMs match optimal plan lengths on Blocksworld and its stripped Path-Star graph version, but the encoding needs close checks to confirm it isolates topology rather than pattern matching.

read the letter

The key point is that reasoning-enhanced LLMs produce plans whose lengths stay close to theoretical optima on both the standard Blocksworld domain and a generalized Path-Star graph that removes block semantics. They also beat satisficing planners like LAMA once the problems get wider or more compositional, while classical search runs into scaling limits. This is measured directly against independently computed optimal lengths rather than just success rates. The controlled Path-Star setup is the clearest new element. It lets them test whether performance holds when language priors about towers and blocks are taken away, which goes beyond most existing LLM planning benchmarks that stay inside the original domain. The two hypotheses they float—an active simulation through reasoning tokens and some form of geometric memory for the graph—give a concrete way to interpret the results. The abstract reports that they find supporting evidence for both. The main soft spot is the prompt encoding for Path-Star. Without details on node labels, edge phrasing, or how the graph is presented, it is hard to rule out that models are exploiting residual statistical patterns from training data instead of computing paths or plans from scratch. The stress-test note raises this exact issue, and the abstract alone does not resolve it. Minor gaps include the lack of error breakdowns by problem size and any mention of how optimality was verified at scale. This paper is for researchers who build or evaluate LLM planners and hybrid systems. It gives them a cleaner testbed than pure success-rate suites, so people working on reasoning benchmarks or automated planning would find the setup useful even if they end up questioning the strength of the optimality claims. I would send it to peer review. The direct optimality comparison on a controlled graph problem is worth referee time, though the review will probably focus on prompt controls and verification of the optimality calculations.

Referee Report

2 major / 2 minor

Summary. The paper claims that reasoning-enhanced LLMs achieve near-optimal performance on Blocksworld planning problems and their semantically equivalent generalized Path-Star (P*) graph variant, significantly outperforming traditional satisficing planners such as LAMA in complex multi-goal settings. LLMs are reported to track theoretical optimality limits with near-perfect precision even after stripping domain-specific semantic hints, with the results explained by two hypotheses: active algorithmic simulation executed via reasoning tokens and a geometric memory that represents P* topology as navigable global geometry bypassing combinatorial explosion.

Significance. If the empirical results hold after detailed verification, the work would provide evidence that LLMs can perform topological reasoning at near-optimal levels on planning tasks without relying on semantic priors or exhaustive search, potentially reframing discussions of LLM capabilities in combinatorial domains and offering new directions for hybrid planning systems. The use of a formally equivalent stripped task (P*) to isolate reasoning is a strength, as is the direct comparison against independently computed optimal plan lengths.

major comments (2)

[P* task definition (likely §3)] P* task definition (likely §3): the description of node labeling, edge representation, and prompt formatting for the generalized Path-Star graph is insufficient to establish isolation of topological reasoning. If nodes use sequential integers or motifs common in pretraining data, residual n-gram or traversal patterns could explain the reported near-perfect optimality tracking rather than algorithmic simulation or geometric memory, directly weakening the central claim that performance persists after semantic stripping.
[Results and evaluation (§4)] Results and evaluation (§4): the claim of 'near-perfect precision' in tracking optimality limits lacks accompanying error analysis, variance across problem depth/width/compositionality, or explicit data-split details. Without these, it is impossible to rule out selective reporting of problem sizes or prompt-engineering effects, which is load-bearing for the outperformance assertion over LAMA.

minor comments (2)

[Hypotheses] The hypotheses section introduces 'Algorithmic Simulation via reasoning tokens' and 'Geometric Memory' without a concise operational definition or reference to how they are measured in the experiments; adding one sentence linking each to specific metrics would improve clarity.
[Figures] Figure captions for P* examples should explicitly state the prompt template used so readers can assess encoding leakage.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed comments, which have helped us strengthen the manuscript. We address each major comment point by point below. Revisions have been made to improve clarity and rigor where the concerns are valid.

read point-by-point responses

Referee: P* task definition (likely §3): the description of node labeling, edge representation, and prompt formatting for the generalized Path-Star graph is insufficient to establish isolation of topological reasoning. If nodes use sequential integers or motifs common in pretraining data, residual n-gram or traversal patterns could explain the reported near-perfect optimality tracking rather than algorithmic simulation or geometric memory, directly weakening the central claim that performance persists after semantic stripping.

Authors: We agree that the original description in §3 could be more explicit to rule out pretraining artifacts. In the revised manuscript we have expanded this section with a formal definition of the P* construction: nodes receive unique abstract alphanumeric labels drawn from a fixed non-sequential vocabulary explicitly chosen to avoid common n-gram patterns; edges are represented solely as unordered adjacency pairs with no semantic annotations; and prompts follow a rigid template that supplies only the adjacency list, start node, and goal set. We have added an appendix containing the exact labeling algorithm, sample prompts, and a small controlled experiment confirming that performance does not degrade when labels are further randomized. These changes directly support the claim that the observed optimality tracking arises from algorithmic simulation and geometric memory rather than residual linguistic patterns. revision: yes
Referee: Results and evaluation (§4): the claim of 'near-perfect precision' in tracking optimality limits lacks accompanying error analysis, variance across problem depth/width/compositionality, or explicit data-split details. Without these, it is impossible to rule out selective reporting of problem sizes or prompt-engineering effects, which is load-bearing for the outperformance assertion over LAMA.

Authors: We accept that the original presentation of results was insufficiently detailed. The revised §4 now includes per-condition standard deviations and 95% confidence intervals for optimality-gap metrics, broken down by depth, width, and number of goal blocks. We have also documented the full problem-generation procedure, the train/test split (ensuring no overlap between evaluation instances and any prompt-tuning data), and an ablation across three distinct prompt phrasings. These additions show that the near-perfect tracking and outperformance versus LAMA hold consistently across the tested range and are not artifacts of selective reporting or prompt engineering. revision: yes

Circularity Check

0 steps flagged

No circularity: claims rest on independent empirical benchmarks

full rationale

The paper's core assertions compare LLM outputs to externally computed optimal plan lengths in Blocksworld and the P* graph. These optima are derived from the problem definitions themselves rather than from any fitted parameters or self-citations within the work. The P* construction is introduced explicitly to control for semantic leakage, and the reported tracking of optimality limits is presented as a measurement result, not a definitional identity. No load-bearing step reduces by construction to the inputs via self-definition, renaming, or ansatz smuggling.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The paper relies on standard graph theory for defining optimality in the Path-Star domain and on the assumption that plan length is a faithful proxy for optimality. No free parameters are introduced to fit the main result. The two explanatory hypotheses (Algorithmic Simulation and Geometric Memory) are post-hoc interpretations rather than axioms.

axioms (1)

domain assumption Plan length equals optimality in the Blocksworld and Path-Star domains
Invoked when claiming LLMs track theoretical optimality limits

invented entities (2)

Algorithmic Simulation via reasoning tokens no independent evidence
purpose: Explains how LLMs produce optimal plans without explicit search
Post-hoc hypothesis with no independent falsifiable prediction stated in the abstract
Geometric Memory for P* topology no independent evidence
purpose: Explains bypassing of combinatorial complexity
Post-hoc hypothesis with no independent falsifiable prediction stated in the abstract

pith-pipeline@v0.9.0 · 5546 in / 1395 out tokens · 35993 ms · 2026-05-13T19:43:33.275341+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We also study a formally equivalent task, the generalized Path-Star (P*) graph, in order to isolate true topological reasoning from semantic priors.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

LLMs track theoretical optimality limits with near-perfect precision

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

25 extracted references · 25 canonical work pages

[1]

Clear ‘b3 ‘. 2. Clear ‘ b16 ‘. 3. ‘ pick - up b3 ‘. 4. ‘ stack b3 b16 ‘. Okay , now that I have everything , let ’ s detail the steps . To clear ‘b3 ‘: First , ‘ unstack b37 b14 ‘ , then ‘ put - down b37 ‘ , and so on . Co nt in ue with ‘ b14 ‘ , ‘ b15 ‘ , ‘ b46 ‘ , ‘ b50 ‘ , ‘ b26 ‘ , and finally ‘ b34 ‘. At each step , I put - down , then unstack a laye...

work page
[2]

‘( unstack b37 b14 ) ‘

work page
[3]

‘( put - down b37 ) ‘

work page
[4]

‘( unstack b14 b15 ) ‘

work page
[5]

‘( put - down b14 ) ‘

work page
[6]

‘( unstack b15 b46 ) ‘

work page
[7]

‘( put - down b15 ) ‘

work page
[8]

‘( unstack b46 b50 ) ‘

work page
[9]

‘( put - down b46 ) ‘

work page
[10]

‘( unstack b50 b26 ) ‘

work page
[11]

‘( put - down b50 ) ‘ 15 Analysis of Optimality of Large Language Models on Planning Problems

work page
[12]

‘( unstack b26 b34 ) ‘

work page
[13]

‘( put - down b26 ) ‘

work page
[14]

‘( unstack b34 b3 ) ‘

work page
[15]

‘( put - down b34 ) ‘

work page
[16]

‘( unstack b2 b49 ) ‘

work page
[17]

‘( put - down b2 ) ‘

work page
[18]

‘( unstack b49 b21 ) ‘

work page
[19]

‘( put - down b49 ) ‘

work page
[20]

‘( unstack b21 b16 ) ‘

work page
[21]

Graph problem evaluation prompts The following prompt, is the rewritten Blocksworld problem

‘( put - down b2 ) ‘ A.3. Graph problem evaluation prompts The following prompt, is the rewritten Blocksworld problem. SYSTEM PROMPT : GRAPH REWRITE SOLVER I . THE DOMAIN You are a Graph R e w r i t i n g Engine . You operate on a d ir ec te d graph r e p r e s e n t i n g a h i e r a r c h i c a l data s t r u c t u r e . D e f i n i t i o n s : Nodes ( ...

work page
[22]

( C o n s t r a i n t : Do NOT use this to detach from the Root R )

D E T A C H _ N O D E ( child , parent ) S e m a n t i c s : D et ac he s a leaf node from its current parent . ( C o n s t r a i n t : Do NOT use this to detach from the Root R ) . P r e c o n d i t i o n s : - Edge parent -> child exists . - child is a Leaf ( Out - degree = 0) . - T ( Tra ns fe r Node ) is empty ( Out - degree = 0) . - parent is not R ....

work page
[23]

( C o n s t r a i n t : Do NOT use this to attach to the Root R )

A T T A C H _ N O D E ( child , target ) S e m a n t i c s : A tt ac he s the node c u r r e n t l y in the T ra ns fe r Node to a new target leaf . ( C o n s t r a i n t : Do NOT use this to attach to the Root R ) . P r e c o n d i t i o n s : - Edge T -> child exists . - target is a Leaf node ( Out - degree = 0) . - child != target . - target is not R ....

work page
[24]

P r e c o n d i t i o n s : - Edge T -> child exists

A T T A C H _ T O _ R O O T ( child ) S e m a n t i c s : A tt ac he s the node c u r r e n t l y in the T ra ns fe r Node to the Root R . P r e c o n d i t i o n s : - Edge T -> child exists . Effect : Delete edge T -> child . Add edge R -> child

work page
[25]

P r e c o n d i t i o n s : - Edge R -> child exists

D E T A C H _ F R O M _ R O O T ( child ) S e m a n t i c s : D et ac he s a leaf node that is c u r r e n t l y c o n n e c t e d di re ct ly to the Root R . P r e c o n d i t i o n s : - Edge R -> child exists . - child is a Leaf ( Out - degree = 0) . - T ( Tr an sfe r Node ) is empty ( Out - degree = 0) . Effect : Delete edge R -> child . Add edge T ->...

work page 2011

[1] [1]

Clear ‘b3 ‘. 2. Clear ‘ b16 ‘. 3. ‘ pick - up b3 ‘. 4. ‘ stack b3 b16 ‘. Okay , now that I have everything , let ’ s detail the steps . To clear ‘b3 ‘: First , ‘ unstack b37 b14 ‘ , then ‘ put - down b37 ‘ , and so on . Co nt in ue with ‘ b14 ‘ , ‘ b15 ‘ , ‘ b46 ‘ , ‘ b50 ‘ , ‘ b26 ‘ , and finally ‘ b34 ‘. At each step , I put - down , then unstack a laye...

work page

[2] [2]

‘( unstack b37 b14 ) ‘

work page

[3] [3]

‘( put - down b37 ) ‘

work page

[4] [4]

‘( unstack b14 b15 ) ‘

work page

[5] [5]

‘( put - down b14 ) ‘

work page

[6] [6]

‘( unstack b15 b46 ) ‘

work page

[7] [7]

‘( put - down b15 ) ‘

work page

[8] [8]

‘( unstack b46 b50 ) ‘

work page

[9] [9]

‘( put - down b46 ) ‘

work page

[10] [10]

‘( unstack b50 b26 ) ‘

work page

[11] [11]

‘( put - down b50 ) ‘ 15 Analysis of Optimality of Large Language Models on Planning Problems

work page

[12] [12]

‘( unstack b26 b34 ) ‘

work page

[13] [13]

‘( put - down b26 ) ‘

work page

[14] [14]

‘( unstack b34 b3 ) ‘

work page

[15] [15]

‘( put - down b34 ) ‘

work page

[16] [16]

‘( unstack b2 b49 ) ‘

work page

[17] [17]

‘( put - down b2 ) ‘

work page

[18] [18]

‘( unstack b49 b21 ) ‘

work page

[19] [19]

‘( put - down b49 ) ‘

work page

[20] [20]

‘( unstack b21 b16 ) ‘

work page

[21] [21]

Graph problem evaluation prompts The following prompt, is the rewritten Blocksworld problem

‘( put - down b2 ) ‘ A.3. Graph problem evaluation prompts The following prompt, is the rewritten Blocksworld problem. SYSTEM PROMPT : GRAPH REWRITE SOLVER I . THE DOMAIN You are a Graph R e w r i t i n g Engine . You operate on a d ir ec te d graph r e p r e s e n t i n g a h i e r a r c h i c a l data s t r u c t u r e . D e f i n i t i o n s : Nodes ( ...

work page

[22] [22]

( C o n s t r a i n t : Do NOT use this to detach from the Root R )

D E T A C H _ N O D E ( child , parent ) S e m a n t i c s : D et ac he s a leaf node from its current parent . ( C o n s t r a i n t : Do NOT use this to detach from the Root R ) . P r e c o n d i t i o n s : - Edge parent -> child exists . - child is a Leaf ( Out - degree = 0) . - T ( Tra ns fe r Node ) is empty ( Out - degree = 0) . - parent is not R ....

work page

[23] [23]

( C o n s t r a i n t : Do NOT use this to attach to the Root R )

A T T A C H _ N O D E ( child , target ) S e m a n t i c s : A tt ac he s the node c u r r e n t l y in the T ra ns fe r Node to a new target leaf . ( C o n s t r a i n t : Do NOT use this to attach to the Root R ) . P r e c o n d i t i o n s : - Edge T -> child exists . - target is a Leaf node ( Out - degree = 0) . - child != target . - target is not R ....

work page

[24] [24]

P r e c o n d i t i o n s : - Edge T -> child exists

A T T A C H _ T O _ R O O T ( child ) S e m a n t i c s : A tt ac he s the node c u r r e n t l y in the T ra ns fe r Node to the Root R . P r e c o n d i t i o n s : - Edge T -> child exists . Effect : Delete edge T -> child . Add edge R -> child

work page

[25] [25]

P r e c o n d i t i o n s : - Edge R -> child exists

D E T A C H _ F R O M _ R O O T ( child ) S e m a n t i c s : D et ac he s a leaf node that is c u r r e n t l y c o n n e c t e d di re ct ly to the Root R . P r e c o n d i t i o n s : - Edge R -> child exists . - child is a Leaf ( Out - degree = 0) . - T ( Tr an sfe r Node ) is empty ( Out - degree = 0) . Effect : Delete edge R -> child . Add edge T ->...

work page 2011