arxiv: 2604.09338 · v1 · submitted 2026-04-10 · 💻 cs.AI · cs.CL

Recognition: 2 theorem links

· Lean Theorem

Mind the Gap Between Spatial Reasoning and Acting! Step-by-Step Evaluation of Agents With Spatial-Gym

Lars Benedikt Kaesberg , Tianyu Yang , Niklas Bauer , Terry Ruas , Jan Philip Wahle , Bela Gipp

Authors on Pith no claims yet

Pith reviewed 2026-05-10 18:22 UTC · model grok-4.3

classification 💻 cs.AI cs.CL

keywords spatial reasoningpathfindingAI agentschain-of-thoughtvision-language modelssequential decision makingreinforcement learninggymnasium

0 comments

The pith

Top AI models solve spatial pathfinding tasks at 16 percent success, against a 98 percent human baseline.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates an interactive benchmark called Spatial-Gym that turns 2D grid pathfinding into a sequential decision task where agents must navigate step by step and may backtrack. It evaluates eight models across one-shot, step-by-step, and backtracking-enabled settings on 500 episodes. The strongest model reaches only 16 percent solve rate while humans reach 98 percent. Models do not increase their reasoning effort as puzzles grow harder, vision inputs cut performance by 73 percent, and longer chain-of-thought still delivers a 3-5 times accuracy gain even when answers must be produced incrementally. The work shows that current agents cannot reliably turn spatial understanding into correct actions under realistic interaction constraints.

Core claim

Spatial-Gym isolates spatial constraint reasoning by framing pathfinding in 2D grids as a sequential decision process with optional backtracking. The best model, GPT-OSS 120B, achieves a 16.0 percent solve rate compared with the human baseline of 98.0 percent. Models fail to scale reasoning effort with difficulty, vision models receiving images reduce solve rate by 73 percent, and extended chain-of-thought reasoning keeps a 3-5x accuracy advantage over standard inference even in the step-by-step setting.

What carries the argument

Spatial-Gym, a Gymnasium environment that converts 2D-grid pathfinding into a sequential decision task with optional backtracking to isolate spatial constraint reasoning.

If this is right

Step-by-step interaction removes formatting errors for weaker models but limits global planning in stronger models.
Backtracking raises episode completion rates yet improves final solve rate only for weaker models.
Spatial-Gym supplies a training environment for reinforcement learning aimed at closing the gap between spatial reasoning and correct actions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Real robotics tasks involve continuous space and sensor noise, so the observed 82-point gap may widen further outside the discrete grid.
One-shot benchmarks likely overestimate deployed performance because they allow full planning before any action occurs.
Specialized training on sequential spatial decisions could be tested by fine-tuning models directly inside Spatial-Gym episodes.

Load-bearing premise

Performance gaps observed on these simplified 2D grid puzzles reflect the spatial reasoning demands agents would face in real navigation and robotics.

What would settle it

A model that reaches above 80 percent solve rate on Spatial-Gym but still fails to navigate real-world robot environments with comparable accuracy would show the benchmark does not capture the full difficulty.

Figures

Figures reproduced from arXiv: 2604.09338 by Bela Gipp, Jan Philip Wahle, Lars Benedikt Kaesberg, Niklas Bauer, Terry Ruas, Tianyu Yang.

**Figure 2.** Figure 2: Accuracy (%) of all models on Spatial-Gym with no backtracking. We run all eight models on Spatial-Gym without backtracking on the full 500-puzzle test set. Each model receives the system prompt with rules, the current board state, and legal actions at each step (see Appendix B for the full prompts and Appendix D for a complete example with GPT-OSS 120B). We plot the human baseline from Kaesberg et al. (20… view at source ↗

**Figure 3.** Figure 3: ∆ Accuracy (%) for each model relative to two settings: (a) Gym accuracy minus baseline accuracy; (b) Gym w/ backtracking accuracy minus w/o backtracking accuracy. Bars above zero show improvement under new conditions; bars below zero indicate collapse. Two findings stand out. First, three reasoning-trained 32B models (OLMo 3.1, Qwen 3, Nemotron) cluster between 10.6% and 11.4%, while R1 Distill 32B reache… view at source ↗

**Figure 4.** Figure 4: Completion rates (%) in Gym w/o backtracking (left) and Gym w/ backtracking [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Backtracking ratio (steps/path edges) per model in Spatial-Gym. Lines show median, boxes show IQR, and whiskers extend to 1.5× IQR. Backtracking behavior. We measure the backtracking frequency as the ratio of the total number of steps to the final path length. On average, across models, every third action is a backtrack (shown in [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: Path length (in edges) per difficulty score (0–5) for: (a) ground-truth solutions, (b) [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: Rule-specific accuracy (%) across all rule types averaged over all models under each of the three evaluation settings. Rule types differ in difficulty, with Gaps being the easiest and Ylops the hardest. We compute solve rates per rule type across all three settings [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗

**Figure 8.** Figure 8: Accuracy (%) and completion rate (%) for random walk, A*, Qwen 3 0.6B, and [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗

**Figure 9.** Figure 9: Vision vs. text input for the Qwen3 family. Accuracy (%) for Qwen3- 32B, Qwen3-VL-32B with text input, and Qwen3-VL-32B with puzzle images. Previous work found that multimodal prompting did not improve performance compared to text inputs for one-shot visual puzzles Kaesberg et al. (2025b). We test whether vision-language models can benefit from visual input in the Gym setting. We compare three conditions w… view at source ↗

**Figure 10.** Figure 10: Accuracy (%) and estimated compute (FLOPs, log scale) for four Qwen 3 model sizes (0.6B, 4B, 14B, 32B) for the baseline and Spatial-Gym. We evaluate four Qwen 3 model sizes (0.6B, 4B, 14B, 32B) to test whether spatial reasoning scales with model size and whether the two formats produce different scaling trajectories [PITH_FULL_IMAGE:figures/full_fig_p009_10.png] view at source ↗

**Figure 11.** Figure 11: Rule-specific accuracy (%) across all seven rule types for GPT-OSS 120B under each of the three evaluation settings: SPaRC, Spatial-Gym, and Gym with backtracking [PITH_FULL_IMAGE:figures/full_fig_p024_11.png] view at source ↗

**Figure 12.** Figure 12: Total steps taken versus final path length (in edges) for all puzzle attempts in the Gym with backtracking setting. The dashed line denotes the no-backtracking baseline (slope = 1); the orange line shows the linear fit (slope = 1.66) [PITH_FULL_IMAGE:figures/full_fig_p025_12.png] view at source ↗

**Figure 13.** Figure 13: Accuracy (%) broken down by difficulty level (1–5) for Qwen3-32B (text), Qwen3-VL-32B (text), and Qwen3-VL-32B (vision) [PITH_FULL_IMAGE:figures/full_fig_p025_13.png] view at source ↗

**Figure 14.** Figure 14: Reasoning versus non-reasoning mode ablation for Qwen 3 14B and Qwen 3 [PITH_FULL_IMAGE:figures/full_fig_p026_14.png] view at source ↗

**Figure 15.** Figure 15: Inter-model agreement matrices for the Spatial-Gym setting. (left) Jaccard [PITH_FULL_IMAGE:figures/full_fig_p027_15.png] view at source ↗

**Figure 16.** Figure 16: Total number of puzzles solved (blue) versus puzzles solved exclusively by that model and no other (orange) in the Spatial-Gym setting, for each of the eight evaluated models [PITH_FULL_IMAGE:figures/full_fig_p027_16.png] view at source ↗

**Figure 17.** Figure 17: Accuracy (%) by difficulty level (1–5) for all eight models across the three [PITH_FULL_IMAGE:figures/full_fig_p028_17.png] view at source ↗

**Figure 18.** Figure 18: Accuracy (%) versus average token count per puzzle (K, log scale) for all eight [PITH_FULL_IMAGE:figures/full_fig_p029_18.png] view at source ↗

**Figure 19.** Figure 19: Average token count per puzzle (in thousands) across difficulty levels 1–5 for [PITH_FULL_IMAGE:figures/full_fig_p029_19.png] view at source ↗

read the original abstract

Spatial reasoning is central to navigation and robotics, yet measuring model capabilities on these tasks remains difficult. Existing benchmarks evaluate models in a one-shot setting, requiring full solution generation in a single response, unlike humans, who work in interactive environments step-by-step. We introduce Spatial-Gym, a Gymnasium environment that isolates spatial constraint reasoning by testing pathfinding in 2D-grid puzzles as a sequential decision task with optional backtracking. We evaluate eight models in three settings (one-shot, step-by-step, step-by-step with backtracking) against human, random, and A* baselines on 500 episodes. The best model, GPT-OSS 120B, achieves a solve rate of 16.0%, 82 points below the human baseline (98.0%). Step-by-step format helps weaker models (up to +5.4%) by removing formatting errors, but hurts stronger models (up to 5.6%) by constraining global planning. Backtracking improves episode completion, but increases solve rate only for weaker models; stronger models rarely backtrack and do not benefit from it. Our experiments have three key findings: (1) models fail to scale reasoning effort with difficulty, (2) vision models receiving images of the spatial environment reduce solve rate by 73%, and (3) extended chain-of-thought reasoning retains a 3-5x accuracy advantage over standard inference even in the step-by-step setting. Spatial-Gym enables diagnosis of model limitations and provides a framework for improving spatial reasoning through reinforcement learning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Spatial-Gym gives a concrete new benchmark for step-by-step grid pathfinding that exposes big model-human gaps and format effects, but the 2D setup may not capture the spatial constraints that matter for actual robotics.

read the letter

The main takeaway is that this paper ships a new Gymnasium environment for sequential spatial decisions and runs clean comparisons across one-shot, step-by-step, and backtracking regimes. The strongest model reaches only 16% solve rate against a 98% human baseline, with step-by-step helping weaker models but sometimes hurting stronger ones, and vision inputs tanking performance by 73% in the reported tests. Backtracking boosts completion but rarely improves final solves for the better models. Those numbers and the three listed findings on scaling, vision, and chain-of-thought are the usable parts.

Referee Report

4 major / 2 minor

Summary. The paper introduces Spatial-Gym, a Gymnasium environment for 2D-grid pathfinding puzzles evaluated as sequential decision tasks with optional backtracking. It reports that eight LLMs achieve at most 16% solve rate (GPT-OSS 120B) versus 98% for humans across 500 episodes in one-shot, step-by-step, and backtracking settings, with three findings: models do not scale reasoning effort with difficulty, vision models suffer a 73% solve-rate drop, and extended CoT retains a 3-5x advantage even step-by-step. The work positions the benchmark as a diagnostic tool for spatial reasoning gaps relevant to navigation and robotics.

Significance. If the 2D puzzles validly isolate spatial constraint reasoning without confounds from discretization or interface, the large performance gap and format-specific effects would highlight actionable limitations in current LLMs for embodied tasks and motivate RL-based training on interactive environments. The explicit baselines (human, random, A*) and multi-setting comparison are strengths that enable direct diagnosis.

major comments (4)

[Abstract, §3] Abstract and §3 (Spatial-Gym description): the claim that the environment 'isolates spatial constraint reasoning' for navigation/robotics is load-bearing for all three key findings, yet no details are given on puzzle generation parameters (grid size distribution, obstacle density, start/goal placement, or backtracking necessity). Without these, it is impossible to verify whether the 82-point gap and the scaling/vision/CoT effects reflect integrated spatial understanding or exhaustive search/local heuristics.
[Experiments] Experiments section (results on 500 episodes): solve rates (e.g., 16.0% for GPT-OSS 120B, +5.4%/-5.6% format effects, 73% vision drop) are reported as point estimates with no error bars, number of independent runs, or statistical tests. This directly undermines confidence in the three key findings and the comparison to A* and human baselines.
[§4.2] §4.2 (vision models): the 73% solve-rate reduction is presented as evidence of spatial reasoning deficits, but the manuscript provides no information on image resolution, observation encoding, or prompting template used for vision inputs versus text. This leaves open whether the drop is due to spatial perception or unrelated interface differences.
[Abstract, §4] Step-by-step setting definition (abstract and §4): the finding that step-by-step helps weaker models but hurts stronger ones by 'constraining global planning' is central, yet the exact per-step observation/action interface and termination conditions are not specified. This makes it difficult to assess whether the reported format effects generalize beyond the specific protocol.

minor comments (2)

[Abstract] The model name 'GPT-OSS 120B' appears without citation or link to its exact variant/release; add a reference or appendix entry.
[Results tables/figures] Table or figure captions for the 500-episode results should explicitly state the number of puzzles per difficulty level and whether episodes are independent.

Simulated Author's Rebuttal

4 responses · 0 unresolved

We thank the referee for their constructive comments, which identify key areas requiring greater detail and statistical rigor to support the claims about Spatial-Gym. We address each major comment below and will revise the manuscript accordingly to improve clarity and reproducibility.

read point-by-point responses

Referee: [Abstract, §3] Abstract and §3 (Spatial-Gym description): the claim that the environment 'isolates spatial constraint reasoning' for navigation/robotics is load-bearing for all three key findings, yet no details are given on puzzle generation parameters (grid size distribution, obstacle density, start/goal placement, or backtracking necessity). Without these, it is impossible to verify whether the 82-point gap and the scaling/vision/CoT effects reflect integrated spatial understanding or exhaustive search/local heuristics.

Authors: We agree that explicit generation parameters are necessary to substantiate the isolation claim and enable verification of the reported gaps. In the revised manuscript, §3 will be expanded with a new subsection detailing the procedural generation: grid sizes uniformly sampled from 5×5 to 15×15, obstacle densities from 0–35% with guaranteed solvability via A* pre-check, start/goal positions placed at least 3 steps apart, and backtracking necessity triggered only on dead-end paths exceeding 20% of optimal length. These parameters ensure the benchmark focuses on spatial constraint satisfaction rather than pure search. revision: yes
Referee: [Experiments] Experiments section (results on 500 episodes): solve rates (e.g., 16.0% for GPT-OSS 120B, +5.4%/-5.6% format effects, 73% vision drop) are reported as point estimates with no error bars, number of independent runs, or statistical tests. This directly undermines confidence in the three key findings and the comparison to A* and human baselines.

Authors: The referee is correct that point estimates alone limit confidence. The 500 episodes were generated once with a fixed seed for cross-model comparability, but model outputs contain stochasticity. In revision we will rerun the full evaluation over 5 independent episode sets (different seeds), report mean solve rates with standard error bars, and add two-tailed t-tests for the format and vision comparisons against the A* and human baselines. revision: yes
Referee: [§4.2] §4.2 (vision models): the 73% solve-rate reduction is presented as evidence of spatial reasoning deficits, but the manuscript provides no information on image resolution, observation encoding, or prompting template used for vision inputs versus text. This leaves open whether the drop is due to spatial perception or unrelated interface differences.

Authors: We acknowledge the omission of implementation specifics for the vision condition. The revised §4.2 will specify: 224×224 RGB images rendered via matplotlib grid visualization, observation encoding as base64-encoded PNG passed to the vision encoder, and the exact prompt template (identical text instructions plus “The image shows the current grid state” appended). This protocol isolates the perceptual component, confirming the drop stems from spatial understanding rather than interface mismatch. revision: yes
Referee: [Abstract, §4] Step-by-step setting definition (abstract and §4): the finding that step-by-step helps weaker models but hurts stronger ones by 'constraining global planning' is central, yet the exact per-step observation/action interface and termination conditions are not specified. This makes it difficult to assess whether the reported format effects generalize beyond the specific protocol.

Authors: We agree that the precise interface must be documented. The revised §4 and abstract will describe: per-step observation as a text tuple (current position, goal coordinates, visible 5×5 local grid, remaining steps), action space {up, down, left, right, backtrack, submit}, and termination when the agent issues “submit” at the goal (success) or exceeds 100 steps or produces an invalid move (failure). These details will allow readers to evaluate generalizability of the format effects. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark evaluation with external baselines

full rationale

The paper introduces Spatial-Gym as a new Gymnasium environment and reports direct empirical solve rates for models against independent human, random, and A* baselines on 500 episodes. No equations, derivations, or fitted parameters are presented as predictions; the three key findings are observational comparisons. No self-citations are load-bearing for the central claims, and the work is self-contained against external benchmarks without reducing reported results to author-defined quantities by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claims rest on the domain assumption that 2D grid pathfinding isolates transferable spatial reasoning and on the new benchmark itself as the primary contribution.

axioms (1)

domain assumption The 2D-grid pathfinding tasks with optional backtracking isolate spatial constraint reasoning relevant to navigation and robotics.
Invoked in the abstract as the justification for using these puzzles to measure model capabilities.

invented entities (1)

Spatial-Gym environment no independent evidence
purpose: To enable step-by-step sequential decision evaluation of spatial pathfinding with backtracking
Newly introduced benchmark; no independent evidence outside the paper is provided.

pith-pipeline@v0.9.0 · 5599 in / 1487 out tokens · 98189 ms · 2026-05-10T18:22:13.612952+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We introduce Spatial-Gym, a Gymnasium environment that isolates spatial constraint reasoning by testing pathfinding in 2D-grid puzzles as a sequential decision task with optional backtracking.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Multi-Agent Reasoning Improves Compute Efficiency: Pareto-Optimal Test-Time Scaling
cs.AI 2026-05 unverdicted novelty 5.0

Multi-agent debate and mixture-of-agents outperform self-consistency by 1.3 and 2.7 percentage points respectively at equal compute budgets on MMLU-Pro and BBH, with advantages that continue at higher scales while sel...

Reference graph

Works this paper leans on

14 extracted references · 3 canonical work pages · cited by 1 Pith paper · 1 internal anchor

[1]

arXiv preprint arXiv:2402.01817 , year=

URLhttps://aclanthology.org/2025.emnlp-main.526/. Subbarao Kambhampati, Karthik Valmeekam, Lin Guan, Mudit Verma, Kaya Stechly, Siddhant Bhambri, Lucas Saldyt, and Anil Murthy. Llms can’t plan, but can help planning in llm-modulo frameworks.ArXiv preprint, abs/2402.01817, 2024. URL https://arxiv.org/abs/2402.01817. Michael Katz, Harsha Kokel, and Sarath S...

work page arXiv 2025
[2]

Seemingly Simple Planning Problems are Computationally Challenging: The Countdown Game

URLhttps://arxiv.org/abs/2508.02900. Doyoung Kim, Jongwon Lee, Jinho Park, and Minjoon Seo. How language models extrapolate outside the training data: A case study in textualized gridworld.ArXiv preprint, abs/2406.15275, 2024. URLhttps://arxiv.org/abs/2406.15275. Bill Yuchen Lin, Ronan Le Bras, Kyle Richardson, Ashish Sabharwal, Radha Poovendran, Peter Cl...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/2021.naacl-main.364 2024
[3]

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov

URLhttps://arxiv.org/abs/2504.07052. John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.ArXiv preprint, abs/1707.06347, 2017a. URL https: //arxiv.org/abs/1707.06347. John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.ArXi...

work page doi:10.1109/jcdl57899.2023.00060 2024
[4]

Nodes CAN NOT be revisited

Path Constraints: Path connects adjacent nodes (horizontal/vertical moves only). Nodes CAN NOT be revisited. You cannot visit a cell twice. Path MUST pass through all Dot cells. Path CANNOT pass through any Gap cells
[5]

Squares of different colors MUST be separated into different regions by the path

Region-Based Rules (Apply to areas enclosed by the path): Squares: All squares within a single region MUST be the same color. Squares of different colors MUST be separated into different regions by the path. Stars: Within a single region, each star symbol MUST be paired with exactly ONE other element of the same color. Other colors within the region are i...
[6]

(1): Path touches EXACTLY 1 edge of the triangle's cell

Path-Based Rules (Edge Touching): Triangles: The path MUST touch a specific number of edges of the cell containing the triangle symbol. (1): Path touches EXACTLY 1 edge of the triangle's cell. (2): Path touches EXACTLY 2 edges of the triangle's cell. (3): Path touches EXACTLY 3 edges of the triangle's cell. (4): Path touches EXACTLY 4 edges (fully surroun...
[7]

Nodes CAN be revisited

Path Constraints: Path connects adjacent nodes (horizontal/vertical moves only). Nodes CAN be revisited. But only if you traceback to the last cell you occupied (and from there again and again ...). Otherwise you CANNOT cross your own path. Path MUST pass through all Dot cells. Path CANNOT pass through any Gap cells. B.4 Spatial-Gym Visual The visual prom...
[8]

An image of the puzzle showing the current state, including your path progress
[9]

## Visual Appearance of the Puzzle Image The image shows a Witness puzzle grid with a dark teal/green background

A text message with your current position coordinates and available legal moves Use both together -- the image helps with spatial reasoning and visual pattern recognition, while the text provides precise position and action information. ## Visual Appearance of the Puzzle Image The image shows a Witness puzzle grid with a dark teal/green background. Here i...
[10]

Draw a continuous line from START to END without visiting the same node twice
[11]

The line can only be placed on valid path cells (not on rule cells)
[12]

The line acts as a boundary, dividing the grid into regions
[13]

each star symbol MUST be paired with exactly ONE other element of the same color

All rule symbols must be satisfied: - Dots (black hexagons on grid lines): The line MUST pass through each dot - Colored Squares (filled rounded rectangles in cells): All squares in a single region must be the same color. Different colors MUST be separated into different regions - Colored Stars (8-pointed stars in cells): Each star must be paired with EXA...

2016
[14]

This convergence indicates that the most complex spatial constraints exceed current model capabilities regardless of the provided input modality. Notably, the performance of the two text- based evaluations (the standard Qwen3- 32B model and the vision-language variant Qwen3-VL-32B) tracks almost identically across the entire difficulty spectrum. Both mode...

2024