Do LLMs Build Spatial World Models? Evidence from Grid-World Maze Tasks

Parijat Dube; Rajarshi Das; Weijiang Li; Yilin Zhu

arxiv: 2604.10690 · v1 · submitted 2026-04-12 · 💻 cs.AI

Do LLMs Build Spatial World Models? Evidence from Grid-World Maze Tasks

Weijiang Li , Yilin Zhu , Rajarshi Das , Parijat Dube This is my paper

Pith reviewed 2026-05-10 15:06 UTC · model grok-4.3

classification 💻 cs.AI

keywords large language modelsspatial reasoningworld modelsmaze navigationgrid-world taskschain-of-thought promptingrepresentation dependence

0 comments

The pith

Large language models do not build robust spatial world models but instead depend on narrow, representation-specific cues for maze tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether LLMs develop an internal spatial world model by running them on grid-world mazes that require multi-step planning. Models achieve high accuracy only with certain tokenized adjacency inputs and chain-of-thought prompting, but accuracy drops sharply with visual grid formats. Sequential questions about proximity and distance show that models do not carry spatial information forward from one query to the next. A sympathetic reader would care because many proposed uses of LLMs in planning, navigation, or robotics assume exactly the kind of stable spatial representation these results call into question.

Core claim

Experiments with Gemini-2.5-Flash, GPT-5-mini, Claude-Haiku-4.5, and DeepSeek-Chat on 5x5 to 7x7 mazes show 80-86% accuracy with tokenized adjacency representations under chain-of-thought prompting, yet the same models fall to 16-34% with visual grid formats. Reasoning traces reach 96-99% semantic coverage, but models treat each proximity or distance question independently and show no cumulative improvement, indicating representation-dependent rather than format-invariant spatial reasoning.

What carries the argument

Maze tasks that vary input format (tokenized adjacency versus visual grid) and present sequential spatial questions to check whether accuracy improves as the model accumulates information.

If this is right

LLMs cannot be assumed to maintain a consistent internal map when input presentation changes.
Spatial planning applications will require explicit external representations or additional modules rather than relying on the model's implicit reasoning.
Chain-of-thought prompting improves surface performance only within narrow representation windows.
Models process each spatial query as an isolated language task even when the underlying maze is the same.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Architectural additions that enforce persistent spatial state may be needed before LLMs can handle dynamic environments reliably.
Testing on larger or partially observed mazes could reveal whether the format dependence scales or eventually disappears.
Hybrid systems that pair language models with dedicated mapping modules might bypass the observed limitations.

Load-bearing premise

That large differences in accuracy by input format and the absence of improvement across sequential questions indicate the lack of an internal spatial world model instead of prompt sensitivity or training gaps.

What would settle it

An experiment in which a model maintains near-identical high accuracy on both tokenized and visual maze formats while showing measurably better answers on later questions in a sequence because it reuses information from earlier answers.

Figures

Figures reproduced from arXiv: 2604.10690 by Parijat Dube, Rajarshi Das, Weijiang Li, Yilin Zhu.

**Figure 2.** Figure 2: An overview of model performance across all tasks and maze sizes, with each spoke rep [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

read the original abstract

Foundation models have shown remarkable performance across diverse tasks, yet their ability to construct internal spatial world models for reasoning and planning remains unclear. We systematically evaluate the spatial understanding of large language models through maze tasks, a controlled testing context requiring multi-step planning and spatial abstraction. Across comprehensive experiments with Gemini-2.5-Flash, GPT-5-mini, Claude-Haiku-4.5, and DeepSeek-Chat, we uncover significant discrepancies in spatial reasoning that challenge assumptions about LLM planning capabilities. Using chain-of-thought prompting, Gemini achieves 80-86% accuracy on smaller mazes (5x5 to 7x7 grids) with tokenized adjacency representations, but performance collapses to 16-34% with visual grid formats, which is a 2-5x difference, suggesting representation-dependent rather than format-invariant spatial reasoning. We further probe spatial understanding through sequential proximity questions and compositional distance comparisons. Despite achieving 96-99% semantic coverage in reasoning traces, models fail to leverage this understanding for consistent spatial computations, indicating that they treat each question independently rather than building cumulative spatial knowledge. Our findings based on the maze-solving tasks suggest that LLMs do not develop robust spatial world models, but rather exhibit representation-specific and prompting-dependent reasoning that succeeds only under narrow conditions. These results have critical implications for deploying foundation models in applications requiring spatial abstraction.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The experiments document clear accuracy drops from adjacency lists to visual grids plus non-cumulative answers on follow-up questions, but the data do not yet rule out tokenization or prompting effects as the cause.

read the letter

The paper's main result is that these LLMs handle the maze tasks quite differently depending on whether the input is given as a tokenized adjacency list or as a visual grid. Accuracy stays high at 80-86 percent for the lists on smaller mazes but falls sharply to 16-34 percent for the grids. On top of that, when they ask follow-up questions about proximity and distances, the models give semantically reasonable chain-of-thought traces most of the time but do not show consistent or improving spatial answers, suggesting they treat each query separately. The new part is the side-by-side testing of four frontier models with these two representations plus the sequential probing. It is straightforward and the differences come through clearly. That kind of comparative data is helpful for understanding where current systems stand on spatial abstraction. Where it gets soft is in moving from the observed gaps to the conclusion that LLMs do not develop robust spatial world models. The drop with visual grids could be explained by differences in how the tokenizers handle grid layouts versus lists, or by the models not being trained to extract spatial structure from that kind of input. The non-cumulative performance across questions might just mean the prompt does not force the model to keep an internal state updated. The paper does not report ablations that keep the representation the same but vary context length or question chaining, nor does it benchmark against a simple graph-based planner that would show what good cumulative reasoning looks like. Without those, the data are compatible with models that have some latent spatial structure but do not activate it under these conditions. Readers who work on LLM planning, robotics interfaces, or training objectives for spatial tasks will get the most out of this. The numbers are concrete and the setup is controlled enough to spark useful discussion. I think it deserves a serious referee because the empirical gaps are large and the topic is relevant, even if the interpretation could be tightened.

Referee Report

3 major / 2 minor

Summary. The manuscript evaluates whether LLMs construct internal spatial world models by testing Gemini-2.5-Flash, GPT-5-mini, Claude-Haiku-4.5, and DeepSeek-Chat on grid-world maze navigation tasks. Using chain-of-thought prompting, it reports 80-86% accuracy on tokenized adjacency representations for 5x5–7x7 mazes but a collapse to 16-34% on visual grid formats (a 2-5x gap). It further finds that models achieve 96-99% semantic coverage in reasoning traces yet fail to maintain consistency or improve across sequential proximity and distance questions, concluding that LLMs exhibit only representation-specific and prompting-dependent reasoning rather than robust spatial world models.

Significance. If the central negative claim holds after addressing controls, the work would usefully document format sensitivity and non-cumulative reasoning in current LLMs, with direct relevance to planning and navigation applications. The purely empirical design on controlled maze tasks is a strength, as are the direct comparisons across input formats and the reporting of semantic coverage metrics. However, the absence of ablations isolating tokenization, context maintenance, and alternative prompting regimes limits how strongly the data support the interpretation that no latent spatial structure exists.

major comments (3)

[Results (maze accuracy comparisons)] The abstract and results sections present the 2-5x accuracy drop from adjacency (80-86%) to visual-grid (16-34%) formats as evidence of representation-dependent rather than robust spatial reasoning, yet no ablation holds the underlying maze structure fixed while varying only tokenization, grid rendering, or prompt length; without this, the gap remains compatible with tokenization or parsing artifacts rather than absence of an internal model.
[Sequential questioning experiments] The sequential proximity/distance probing (abstract and §4) reports failure to accumulate spatial knowledge despite 96-99% semantic coverage, but includes no control condition in which prior answers or explicit state are appended to the prompt or in which context length is systematically varied; this leaves open whether the per-query independence reflects lack of a world model or simply the tested prompting regime.
[Abstract and Results] Numerical claims (80-86%, 16-34%, 96-99%) are reported without sample sizes, number of distinct mazes per condition, run-to-run variance, or statistical tests, which is load-bearing for the strong conclusion that LLMs 'do not develop robust spatial world models'.

minor comments (2)

[Methods] Clarify exact model versions and API parameters used (e.g., temperature, max tokens) in the methods section to support reproducibility.
[Evaluation metrics] Add explicit definitions or examples of 'semantic coverage' metric and how it was computed from CoT traces.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and indicate the revisions planned for the next version of the manuscript.

read point-by-point responses

Referee: The abstract and results sections present the 2-5x accuracy drop from adjacency (80-86%) to visual-grid (16-34%) formats as evidence of representation-dependent rather than robust spatial reasoning, yet no ablation holds the underlying maze structure fixed while varying only tokenization, grid rendering, or prompt length; without this, the gap remains compatible with tokenization or parsing artifacts rather than absence of an internal model.

Authors: We appreciate this observation. Our experiments generated mazes using the same procedure and then rendered them in the two formats, but we did not perform a dedicated ablation that isolates tokenization, rendering, or prompt length while holding every other variable constant. We will add such an ablation study in the revised Results section to more cleanly separate representation effects from potential parsing artifacts. revision: yes
Referee: The sequential proximity/distance probing (abstract and §4) reports failure to accumulate spatial knowledge despite 96-99% semantic coverage, but includes no control condition in which prior answers or explicit state are appended to the prompt or in which context length is systematically varied; this leaves open whether the per-query independence reflects lack of a world model or simply the tested prompting regime.

Authors: This is a fair critique of the experimental design. The sequential probes were intended to test maintenance of spatial information without explicit state carry-over, but we agree that controls with appended prior answers and systematic context-length variation are needed to strengthen the claim. We will incorporate these control conditions into the revised §4 and show that performance remains limited even when state is explicitly provided. revision: yes
Referee: Numerical claims (80-86%, 16-34%, 96-99%) are reported without sample sizes, number of distinct mazes per condition, run-to-run variance, or statistical tests, which is load-bearing for the strong conclusion that LLMs 'do not develop robust spatial world models'.

Authors: We agree that these details are essential for supporting the quantitative claims. We will revise the Abstract, Results, and figure captions to report the exact number of mazes and runs per condition, include measures of variance (standard deviation or error bars), and add appropriate statistical tests (e.g., paired comparisons) to substantiate the reported accuracy differences. revision: yes

Circularity Check

0 steps flagged

No significant circularity: purely empirical evaluation

full rationale

The paper conducts controlled experiments on LLMs solving maze tasks under varied input representations (tokenized adjacency vs. visual grids) and question sequences, reporting raw accuracy differences and CoT coverage statistics. No equations, parameters, or derivations appear; the central claim follows directly from observed performance gaps without any self-definition, fitted-input renaming, or load-bearing self-citation. The evaluation is self-contained against external benchmarks (model outputs on fixed prompts) and does not reduce any result to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper is an empirical evaluation of existing models on custom tasks and introduces no free parameters, mathematical axioms, or new postulated entities.

pith-pipeline@v0.9.0 · 5550 in / 1109 out tokens · 84878 ms · 2026-05-10T15:06:01.156691+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

5 extracted references · 5 canonical work pages

[1]

Chin-Yew Lin

URLhttps://arxiv.org/abs/2210.13382. Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. InText summarization branches out, pp. 74–81, 2004. 10 Published as a conference paper at ICLR 2026 Nicolas Martorell. From text to space: Mapping abstract spatial models in llms during a grid-world navigation task, 2025. URLhttps://arxiv.org/abs/250...

work page doi:10.1016/j.tics.2024.02 2004
[2]

URLhttp://dx.doi.org/10.1016/j.tics.2024.02.008. A APPENDIX A.1 ADDITIONAL MAZE-SOLVING RESULTS WITH FINE-TUNEDT5MODELS Initially, we followed the same idea proposed in (Ivanitskiy et al., 2023) that uses a transformer- based encoder-decoder model and fine-tuned T5 models using data generated from the maze-dataset. Table 3: Fine-tuned model performance on...

work page doi:10.1016/j.tics.2024.02.008 2024
[3]

Is position(0,0)closer to position(2,2)than it is to position(4,0)? Consider theactual path distance through the maze, not straight-line distance

work page
[4]

Is position(4,4)closer to position(2,2)than it is to position(0,4)? Consider theactual path distance through the maze, not straight-line distance

work page
[5]

Is position(4,0)closer to position(0,4)than it is to position(2,2)? Consider theactual path distance through the maze, not straight-line distance. Answer Format Question 1 Answer: True/False: [reasoning] Question 2 Answer: True/False: [reasoning] Question 3 Answer: True/False: [reasoning] 14 Published as a conference paper at ICLR 2026 A.5 SEQUENTIALQUEST...

work page 2026

[1] [1]

Chin-Yew Lin

URLhttps://arxiv.org/abs/2210.13382. Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. InText summarization branches out, pp. 74–81, 2004. 10 Published as a conference paper at ICLR 2026 Nicolas Martorell. From text to space: Mapping abstract spatial models in llms during a grid-world navigation task, 2025. URLhttps://arxiv.org/abs/250...

work page doi:10.1016/j.tics.2024.02 2004

[2] [2]

URLhttp://dx.doi.org/10.1016/j.tics.2024.02.008. A APPENDIX A.1 ADDITIONAL MAZE-SOLVING RESULTS WITH FINE-TUNEDT5MODELS Initially, we followed the same idea proposed in (Ivanitskiy et al., 2023) that uses a transformer- based encoder-decoder model and fine-tuned T5 models using data generated from the maze-dataset. Table 3: Fine-tuned model performance on...

work page doi:10.1016/j.tics.2024.02.008 2024

[3] [3]

Is position(0,0)closer to position(2,2)than it is to position(4,0)? Consider theactual path distance through the maze, not straight-line distance

work page

[4] [4]

Is position(4,4)closer to position(2,2)than it is to position(0,4)? Consider theactual path distance through the maze, not straight-line distance

work page

[5] [5]

Is position(4,0)closer to position(0,4)than it is to position(2,2)? Consider theactual path distance through the maze, not straight-line distance. Answer Format Question 1 Answer: True/False: [reasoning] Question 2 Answer: True/False: [reasoning] Question 3 Answer: True/False: [reasoning] 14 Published as a conference paper at ICLR 2026 A.5 SEQUENTIALQUEST...

work page 2026