pith. sign in

arxiv: 2603.26839 · v2 · pith:DMNDIYZ6new · submitted 2026-03-27 · 💻 cs.LG · cs.CV

From Pixels to BFS: High Maze Accuracy Does Not Imply Visual Planning

Pith reviewed 2026-05-15 00:05 UTC · model grok-4.3

classification 💻 cs.LG cs.CV
keywords MazeBenchvisual planningmultimodal modelstoken enumerationspatial reasoningBFS in proseimage-to-grid translation
0
0 comments X

The pith

Multimodal models solve maze images by converting them to text grids and enumerating paths token by token rather than through visual planning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether high scores on visual maze tasks reflect genuine spatial understanding or a workaround of turning images into text and searching step by step. Top models reach 79-91 percent accuracy only when allowed large token budgets to translate the maze into a grid and then run a text version of breadth-first search. Without extra reasoning steps the same models drop to 2-12 percent, and they hit token limits on larger 20 by 20 mazes. An ablation that supplies the correct text grid lifts one model from 6 percent to 80 percent, showing the main weakness is visual extraction. Even direct instructions to skip grid construction fail to change the underlying enumeration behavior.

Core claim

Models achieve high accuracy on maze images by first converting them into textual grid representations and subsequently performing step-by-step path enumeration in token space, equivalent to breadth-first search conducted in prose, rather than engaging in genuine visual or spatial planning.

What carries the argument

The two-stage image-to-grid translation followed by token-level path enumeration that functions as breadth-first search in generated text.

If this is right

  • Without added reasoning budgets all tested configurations score only 2-12 percent.
  • On 20 by 20 ultra-hard mazes models hit token limits and fail completely.
  • Explicit instructions not to construct grids or perform graph search still produce the same enumeration strategy.
  • Providing the correct text grid raises Claude Sonnet 4.6 from 6 percent on images to 80 percent.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same text-proxy pattern may appear in other visual reasoning tasks where models lack direct spatial mechanisms.
  • Future benchmarks for planning should include explicit controls that detect language-based enumeration workarounds.
  • Architectures with stronger native visual-spatial modules could reduce reliance on text intermediaries.

Load-bearing premise

That the two-stage image-to-grid translation plus token enumeration strategy is the dominant mechanism driving performance rather than an artifact of the tested prompting or model configurations.

What would settle it

A model that solves the mazes using few tokens and without producing any grid-like structures or step-by-step enumerations in its reasoning traces would falsify the claim.

read the original abstract

How do multimodal models solve visual spatial tasks -- through genuine planning, or through brute-force search in token space? We introduce \textsc{MazeBench}, a benchmark of 110 procedurally generated maze images across nine controlled groups, and evaluate 16 model configurations from OpenAI, Anthropic, Google, and Alibaba. GPT-5.4 solves 91\% and Gemini 3.1 Pro 79\%, but these scores are misleading: models typically translate images into text grids and then enumerate paths step by step, consuming 1,710--22,818 tokens per solve for a task humans do quickly. Without added reasoning budgets, all configurations score only 2--12\%; on 20$\times$20 ultra-hard mazes, they hit token limits and fail. Qualitative traces reveal a common two-stage strategy: image-to-grid translation followed by token-level search, effectively BFS in prose. A text-grid ablation shows Claude Sonnet 4.6 rising from 6\% on images to 80\% when given the correct grid, isolating weak visual extraction from downstream search. When explicitly instructed not to construct a grid or perform graph search, models still revert to the same enumeration strategy. \textsc{MazeBench} therefore shows that high accuracy on visual planning tasks does not imply human-like spatial understanding.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces MazeBench, a benchmark of 110 procedurally generated maze images across nine controlled groups, and evaluates 16 multimodal model configurations from OpenAI, Anthropic, Google, and Alibaba. It claims that high reported accuracies (e.g., GPT-5.4 at 91%, Gemini 3.1 Pro at 79%) do not reflect human-like visual planning; instead, models follow a two-stage strategy of translating the image into a text grid followed by step-by-step token enumeration (BFS in prose), as shown by high token counts (1,710–22,818), sharp drops to 2–12% without added reasoning budgets, failure on 20×20 mazes due to token limits, qualitative traces, and a text-grid ablation where Claude Sonnet 4.6 improves from 6% to 80%. Models revert to enumeration even when explicitly instructed against grid construction or graph search.

Significance. If the central empirical findings hold, the work is significant for demonstrating that strong performance on visual spatial tasks in current multimodal models often stems from token-space search rather than genuine spatial understanding. The controlled procedural groups, token-usage measurements, and targeted ablations (image vs. text-grid) provide concrete evidence distinguishing visual extraction failures from downstream search capabilities. This has implications for interpreting benchmark results in vision-language planning and motivates more robust evaluation protocols that penalize enumeration strategies.

major comments (3)
  1. [Methods / Benchmark Construction] The procedural generation rules for the nine groups, exact maze sizes, and full definition of accuracy metrics (including any statistical tests or error bars) are insufficiently specified. This is load-bearing because the claim that high accuracy does not imply planning depends on confirming that the mazes genuinely require spatial reasoning rather than exploitable patterns.
  2. [Results / Ablations] The text-grid ablation (Claude Sonnet 4.6: 6% on images to 80% on grids) isolates visual extraction but lacks detail on how the 'correct grid' is generated, verified for fidelity to the original image, and whether it matches the exact representation models would produce. This weakens the isolation of the two-stage mechanism.
  3. [Results / Prompting Experiments] The finding that models revert to enumeration despite explicit instructions against grid construction or graph search is tied to the 16 tested configurations and prompting regime. To rule out elicitation artifacts, additional systematic prompt variants (e.g., stronger penalties on grid construction or few-shot examples of direct visual path tracing) are needed, as the current evidence does not fully establish generality.
minor comments (2)
  1. [Abstract] The abstract's token range (1,710--22,818) would be clearer if reported with per-model averages or breakdowns rather than a single aggregate range.
  2. [Figures / Qualitative Analysis] Figure captions and qualitative trace examples should explicitly label the model, prompt variant, and maze group to improve traceability of the two-stage strategy.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive comments, which have helped us improve the clarity and rigor of the manuscript. We address each major comment point-by-point below.

read point-by-point responses
  1. Referee: [Methods / Benchmark Construction] The procedural generation rules for the nine groups, exact maze sizes, and full definition of accuracy metrics (including any statistical tests or error bars) are insufficiently specified. This is load-bearing because the claim that high accuracy does not imply planning depends on confirming that the mazes genuinely require spatial reasoning rather than exploitable patterns.

    Authors: We fully agree that detailed specification of the benchmark is essential. In the revised manuscript, we have added a comprehensive description of the procedural generation process for all nine groups, including the specific rules for maze creation (e.g., random wall placement with connectivity constraints, varying densities from 20% to 50%), exact dimensions (primarily 10x10 with subsets at 5x5 and 20x20), and the accuracy metric defined as the percentage of correctly solved mazes where the output path is valid and reaches the goal without errors. We now include error bars representing standard deviation across three independent evaluation runs per model. No additional statistical tests were applied beyond descriptive statistics, as the focus is on qualitative differences in strategy rather than hypothesis testing. These details confirm the mazes demand spatial reasoning. revision: yes

  2. Referee: [Results / Ablations] The text-grid ablation (Claude Sonnet 4.6: 6% on images to 80% on grids) isolates visual extraction but lacks detail on how the 'correct grid' is generated, verified for fidelity to the original image, and whether it matches the exact representation models would produce. This weakens the isolation of the two-stage mechanism.

    Authors: We appreciate this observation. The correct grids were produced by rendering the procedural maze parameters into the identical text format used in the model prompts (ASCII-based grid with '#' for walls, '.' for paths, 'S' for start, 'G' for goal). Fidelity was verified by regenerating the image from the grid and comparing it to the original for all 110 mazes, ensuring pixel-perfect alignment in structure. We have now included this generation and verification procedure in the revised Methods and Results sections to better isolate the visual extraction stage from the search stage. revision: yes

  3. Referee: [Results / Prompting Experiments] The finding that models revert to enumeration despite explicit instructions against grid construction or graph search is tied to the 16 tested configurations and prompting regime. To rule out elicitation artifacts, additional systematic prompt variants (e.g., stronger penalties on grid construction or few-shot examples of direct visual path tracing) are needed, as the current evidence does not fully establish generality.

    Authors: We acknowledge the value of testing additional prompt variants to strengthen claims of generality. Our original prompts included direct prohibitions such as 'Solve the maze by visually tracing the path without creating any text grid or performing step-by-step search.' Despite this, models reverted. In the revision, we have incorporated results from two additional prompt variants: one with stronger penalties (e.g., 'Any grid construction will result in incorrect scoring') and one with few-shot examples demonstrating direct visual path tracing. These experiments, detailed in the new Appendix C, show consistent reversion to enumeration, supporting the robustness of our findings. While not exhaustive, this addresses the concern for elicitation artifacts. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical benchmark with independent ablations and observations

full rationale

The paper introduces MazeBench, evaluates 16 model configurations on 110 images, performs text-grid ablations, measures token usage, and inspects qualitative traces. No equations, fitted parameters, derivations, or self-citations are invoked to support the central claim. The observation that models revert to enumeration even under explicit instructions is a direct experimental result, not a reduction to prior inputs. All load-bearing evidence (accuracy drops, token counts, ablation gains) is generated within the reported experiments and is externally falsifiable.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper rests on empirical observations from model traces and ablations with no free parameters, new entities, or mathematical derivations.

axioms (1)
  • domain assumption Observed token enumeration in model outputs constitutes the primary solving strategy rather than incidental behavior.
    Interpretation drawn from qualitative traces and the instruction experiment where models were told not to use grids.

pith-pipeline@v0.9.0 · 5532 in / 1210 out tokens · 58876 ms · 2026-05-15T00:05:18.200340+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.