From Pixels to BFS: High Maze Accuracy Does Not Imply Visual Planning
Pith reviewed 2026-05-15 00:05 UTC · model grok-4.3
The pith
Multimodal models solve maze images by converting them to text grids and enumerating paths token by token rather than through visual planning.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Models achieve high accuracy on maze images by first converting them into textual grid representations and subsequently performing step-by-step path enumeration in token space, equivalent to breadth-first search conducted in prose, rather than engaging in genuine visual or spatial planning.
What carries the argument
The two-stage image-to-grid translation followed by token-level path enumeration that functions as breadth-first search in generated text.
If this is right
- Without added reasoning budgets all tested configurations score only 2-12 percent.
- On 20 by 20 ultra-hard mazes models hit token limits and fail completely.
- Explicit instructions not to construct grids or perform graph search still produce the same enumeration strategy.
- Providing the correct text grid raises Claude Sonnet 4.6 from 6 percent on images to 80 percent.
Where Pith is reading between the lines
- The same text-proxy pattern may appear in other visual reasoning tasks where models lack direct spatial mechanisms.
- Future benchmarks for planning should include explicit controls that detect language-based enumeration workarounds.
- Architectures with stronger native visual-spatial modules could reduce reliance on text intermediaries.
Load-bearing premise
That the two-stage image-to-grid translation plus token enumeration strategy is the dominant mechanism driving performance rather than an artifact of the tested prompting or model configurations.
What would settle it
A model that solves the mazes using few tokens and without producing any grid-like structures or step-by-step enumerations in its reasoning traces would falsify the claim.
read the original abstract
How do multimodal models solve visual spatial tasks -- through genuine planning, or through brute-force search in token space? We introduce \textsc{MazeBench}, a benchmark of 110 procedurally generated maze images across nine controlled groups, and evaluate 16 model configurations from OpenAI, Anthropic, Google, and Alibaba. GPT-5.4 solves 91\% and Gemini 3.1 Pro 79\%, but these scores are misleading: models typically translate images into text grids and then enumerate paths step by step, consuming 1,710--22,818 tokens per solve for a task humans do quickly. Without added reasoning budgets, all configurations score only 2--12\%; on 20$\times$20 ultra-hard mazes, they hit token limits and fail. Qualitative traces reveal a common two-stage strategy: image-to-grid translation followed by token-level search, effectively BFS in prose. A text-grid ablation shows Claude Sonnet 4.6 rising from 6\% on images to 80\% when given the correct grid, isolating weak visual extraction from downstream search. When explicitly instructed not to construct a grid or perform graph search, models still revert to the same enumeration strategy. \textsc{MazeBench} therefore shows that high accuracy on visual planning tasks does not imply human-like spatial understanding.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces MazeBench, a benchmark of 110 procedurally generated maze images across nine controlled groups, and evaluates 16 multimodal model configurations from OpenAI, Anthropic, Google, and Alibaba. It claims that high reported accuracies (e.g., GPT-5.4 at 91%, Gemini 3.1 Pro at 79%) do not reflect human-like visual planning; instead, models follow a two-stage strategy of translating the image into a text grid followed by step-by-step token enumeration (BFS in prose), as shown by high token counts (1,710–22,818), sharp drops to 2–12% without added reasoning budgets, failure on 20×20 mazes due to token limits, qualitative traces, and a text-grid ablation where Claude Sonnet 4.6 improves from 6% to 80%. Models revert to enumeration even when explicitly instructed against grid construction or graph search.
Significance. If the central empirical findings hold, the work is significant for demonstrating that strong performance on visual spatial tasks in current multimodal models often stems from token-space search rather than genuine spatial understanding. The controlled procedural groups, token-usage measurements, and targeted ablations (image vs. text-grid) provide concrete evidence distinguishing visual extraction failures from downstream search capabilities. This has implications for interpreting benchmark results in vision-language planning and motivates more robust evaluation protocols that penalize enumeration strategies.
major comments (3)
- [Methods / Benchmark Construction] The procedural generation rules for the nine groups, exact maze sizes, and full definition of accuracy metrics (including any statistical tests or error bars) are insufficiently specified. This is load-bearing because the claim that high accuracy does not imply planning depends on confirming that the mazes genuinely require spatial reasoning rather than exploitable patterns.
- [Results / Ablations] The text-grid ablation (Claude Sonnet 4.6: 6% on images to 80% on grids) isolates visual extraction but lacks detail on how the 'correct grid' is generated, verified for fidelity to the original image, and whether it matches the exact representation models would produce. This weakens the isolation of the two-stage mechanism.
- [Results / Prompting Experiments] The finding that models revert to enumeration despite explicit instructions against grid construction or graph search is tied to the 16 tested configurations and prompting regime. To rule out elicitation artifacts, additional systematic prompt variants (e.g., stronger penalties on grid construction or few-shot examples of direct visual path tracing) are needed, as the current evidence does not fully establish generality.
minor comments (2)
- [Abstract] The abstract's token range (1,710--22,818) would be clearer if reported with per-model averages or breakdowns rather than a single aggregate range.
- [Figures / Qualitative Analysis] Figure captions and qualitative trace examples should explicitly label the model, prompt variant, and maze group to improve traceability of the two-stage strategy.
Simulated Author's Rebuttal
We thank the referee for their constructive comments, which have helped us improve the clarity and rigor of the manuscript. We address each major comment point-by-point below.
read point-by-point responses
-
Referee: [Methods / Benchmark Construction] The procedural generation rules for the nine groups, exact maze sizes, and full definition of accuracy metrics (including any statistical tests or error bars) are insufficiently specified. This is load-bearing because the claim that high accuracy does not imply planning depends on confirming that the mazes genuinely require spatial reasoning rather than exploitable patterns.
Authors: We fully agree that detailed specification of the benchmark is essential. In the revised manuscript, we have added a comprehensive description of the procedural generation process for all nine groups, including the specific rules for maze creation (e.g., random wall placement with connectivity constraints, varying densities from 20% to 50%), exact dimensions (primarily 10x10 with subsets at 5x5 and 20x20), and the accuracy metric defined as the percentage of correctly solved mazes where the output path is valid and reaches the goal without errors. We now include error bars representing standard deviation across three independent evaluation runs per model. No additional statistical tests were applied beyond descriptive statistics, as the focus is on qualitative differences in strategy rather than hypothesis testing. These details confirm the mazes demand spatial reasoning. revision: yes
-
Referee: [Results / Ablations] The text-grid ablation (Claude Sonnet 4.6: 6% on images to 80% on grids) isolates visual extraction but lacks detail on how the 'correct grid' is generated, verified for fidelity to the original image, and whether it matches the exact representation models would produce. This weakens the isolation of the two-stage mechanism.
Authors: We appreciate this observation. The correct grids were produced by rendering the procedural maze parameters into the identical text format used in the model prompts (ASCII-based grid with '#' for walls, '.' for paths, 'S' for start, 'G' for goal). Fidelity was verified by regenerating the image from the grid and comparing it to the original for all 110 mazes, ensuring pixel-perfect alignment in structure. We have now included this generation and verification procedure in the revised Methods and Results sections to better isolate the visual extraction stage from the search stage. revision: yes
-
Referee: [Results / Prompting Experiments] The finding that models revert to enumeration despite explicit instructions against grid construction or graph search is tied to the 16 tested configurations and prompting regime. To rule out elicitation artifacts, additional systematic prompt variants (e.g., stronger penalties on grid construction or few-shot examples of direct visual path tracing) are needed, as the current evidence does not fully establish generality.
Authors: We acknowledge the value of testing additional prompt variants to strengthen claims of generality. Our original prompts included direct prohibitions such as 'Solve the maze by visually tracing the path without creating any text grid or performing step-by-step search.' Despite this, models reverted. In the revision, we have incorporated results from two additional prompt variants: one with stronger penalties (e.g., 'Any grid construction will result in incorrect scoring') and one with few-shot examples demonstrating direct visual path tracing. These experiments, detailed in the new Appendix C, show consistent reversion to enumeration, supporting the robustness of our findings. While not exhaustive, this addresses the concern for elicitation artifacts. revision: yes
Circularity Check
No circularity: purely empirical benchmark with independent ablations and observations
full rationale
The paper introduces MazeBench, evaluates 16 model configurations on 110 images, performs text-grid ablations, measures token usage, and inspects qualitative traces. No equations, fitted parameters, derivations, or self-citations are invoked to support the central claim. The observation that models revert to enumeration even under explicit instructions is a direct experimental result, not a reduction to prior inputs. All load-bearing evidence (accuracy drops, token counts, ablation gains) is generated within the reported experiments and is externally falsifiable.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Observed token enumeration in model outputs constitutes the primary solving strategy rather than incidental behavior.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
models typically translate images into text grids and then enumerate paths step by step, consuming 1,710–22,818 tokens per solve
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
A text-grid ablation shows Claude Sonnet 4.6 rising from 6% on images to 80% when given the correct grid
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.