pith. sign in

arxiv: 2604.09594 · v1 · submitted 2026-03-05 · 💻 cs.AI · cs.LG

Spatial Competence Benchmark

Pith reviewed 2026-05-15 17:12 UTC · model grok-4.3

classification 💻 cs.AI cs.LG
keywords spatial competencebenchmarklarge language modelshierarchical tasksspatial reasoningdeterministic verificationfrontier models
0
0 comments X

The pith

Frontier models show steadily falling accuracy on more complex spatial tasks in the new SCBench benchmark.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper defines spatial competence as the ability to hold a consistent internal map of an environment, extract its structure, and plan actions within limits. Existing tests examine only single skills such as 3D rotations or simple visual questions, so the authors built SCBench with three stacked levels of tasks that demand executable plans checked by fixed rules or simulators. When three leading models run the full ladder, their success rate falls at each higher level. Extra output tokens improve results only at short lengths and then stop helping, while mistakes usually keep local shapes believable but violate larger rules. The generators, verifiers, and visualization tools are released so others can run the same checks.

Core claim

SCBench organizes tasks into three hierarchical capability buckets whose outputs must pass deterministic checkers or simulator evaluators. Three frontier models display monotonically decreasing accuracy as the required capability level rises. Accuracy improves with higher output-token limits only at low budgets and saturates rapidly thereafter; errors are driven mainly by locally plausible geometry that violates global constraints.

What carries the argument

The Spatial Competence Benchmark (SCBench) consisting of three hierarchical capability buckets with tasks that produce executable outputs scored by deterministic checkers or simulator-based evaluators.

If this is right

  • Basic spatial primitives can be solved while integrated planning across global constraints still fails.
  • Increasing output length beyond a modest budget yields little further gain on these tasks.
  • Most errors preserve local geometry yet break overall consistency, pointing to a specific representational gap.
  • Releasing the task generators and verifiers allows direct comparison of future models on the same ladder.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Models may need explicit mechanisms to enforce global consistency rather than relying on next-token prediction alone.
  • The benchmark could be adapted to test physical robot planning by replacing simulators with real-world execution checks.
  • Training regimes that emphasize long-horizon spatial coherence might close the observed accuracy gaps.

Load-bearing premise

The chosen hierarchical tasks and their deterministic verifiers measure the intended notion of spatial competence without being skewed by training data patterns or output formatting habits.

What would settle it

A model that maintains similar accuracy across all three SCBench levels or shows non-monotonic performance would falsify the claim of steadily decreasing accuracy up the ladder.

Figures

Figures reproduced from arXiv: 2604.09594 by Ashley Harris, Jash Vira.

Figure 1
Figure 1. Figure 1: Representative task from each bucket. (a) Place two segments on a square boundary to [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Accuracy vs. mean realised out￾put tokens across budget caps (highest rea￾soning mode, no tools) [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 4
Figure 4. Figure 4: Local-Only failure example on Delaunay Triangulation (curated 28-point subpass, seed 1009; Claude Sonnet 4.5 no-tools). Left: ground-truth triangulation. Right: model output. Judge label: Local-Only (Global Constraint Integration Failure) with confidence 0.90. Task Card example: Delaunay Triangulation (continued) - Every triangle has exactly 3 non-negative indices. - Predicted triangle multiset = ground tr… view at source ↗
Figure 5
Figure 5. Figure 5: Lego Hemispherical Shell (1, 2). (Left) A valid Lego hemisphere, 1524 bricks. (Right) [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Lego Hemispherical Shell (3, 4). (Left) An invalid Lego construction will fall over. (Right) [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: CSG Union (1, 2). (Left) Two cylinders intersecting at right angles at the origin, as [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: CSG Union (3). Three rectangular prism union. [PITH_FULL_IMAGE:figures/full_fig_p011_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: The shadow cast from dozens of non-intersecting tetrahedra can form complex 2D shapes, [PITH_FULL_IMAGE:figures/full_fig_p012_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Voxel Grid Projection. Valid solution to 20x20x20, 500 voxels, and no voxel having “7” [PITH_FULL_IMAGE:figures/full_fig_p013_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Voxel Grid Projection (subtasks 3, 4). (Left) The model lost focus while building this [PITH_FULL_IMAGE:figures/full_fig_p014_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: 3D Maze (1). A valid 3D maze with 7 jumps (shown in yellow). [PITH_FULL_IMAGE:figures/full_fig_p014_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: 3D Maze (2, 3). (Left) A 5x5 maze with 2 jumps, a valid solution to subpass 0. (Right) [PITH_FULL_IMAGE:figures/full_fig_p015_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: 3D Maze. A model’s attempt at subpass 1, failed due to duplicate paths. [PITH_FULL_IMAGE:figures/full_fig_p016_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Polynomial Curve Fitting (2). Expected (left) vs. actual (right): a model’s attempt at a [PITH_FULL_IMAGE:figures/full_fig_p017_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Hamiltonian Loop (1). A 16x16 grid with 228 visitable cells and a solved Hamiltonian [PITH_FULL_IMAGE:figures/full_fig_p018_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Hyper-Snake (1). (Left) 2D snake is trivial. The snake (start = blue, head = yellow, body [PITH_FULL_IMAGE:figures/full_fig_p019_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Hyper-Snake (2). Mean score by subpass across six model runs, with dimensionality [PITH_FULL_IMAGE:figures/full_fig_p020_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Pipe Loop Fitting (1). 14 lengths of pipe laid out end-to-end in a 3x3 square. [PITH_FULL_IMAGE:figures/full_fig_p020_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Pipe Loop Fitting (2, 3). (Left) 390 pipes in 35 x 35, solved with spring solver (subpass [PITH_FULL_IMAGE:figures/full_fig_p022_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: Hide and Seek (1). View from above the crowd (left) and sniper’s view (right). This is a [PITH_FULL_IMAGE:figures/full_fig_p023_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: Pack Rectangular Prisms (1). 108 prisms (8 size classes) packed 96% effectively. [PITH_FULL_IMAGE:figures/full_fig_p023_22.png] view at source ↗
Figure 23
Figure 23. Figure 23: Pack Rectangular Prisms (2, 3). (Left) Mean score by subtask across six model runs. [PITH_FULL_IMAGE:figures/full_fig_p024_23.png] view at source ↗
Figure 24
Figure 24. Figure 24: Fluid Simulation (1). A successful earthworks project that diverts rainfall from the centre [PITH_FULL_IMAGE:figures/full_fig_p025_24.png] view at source ↗
Figure 25
Figure 25. Figure 25: Terrain Leveling (1, 2). (Left) Typical starting world, with the largest flat city (5 cells) [PITH_FULL_IMAGE:figures/full_fig_p026_25.png] view at source ↗
Figure 26
Figure 26. Figure 26: Terrain Leveling (3, 4). (Left) Blue cells represent the new city after leveling. (Right) [PITH_FULL_IMAGE:figures/full_fig_p027_26.png] view at source ↗
Figure 27
Figure 27. Figure 27: Interlocking Parts (1, 2). (Left) A single threaded rod cannot hold these panels together, [PITH_FULL_IMAGE:figures/full_fig_p030_27.png] view at source ↗
Figure 28
Figure 28. Figure 28: Interlocking Parts (3). Rod engagement tests performed using OpenSCAD (subtask 6). [PITH_FULL_IMAGE:figures/full_fig_p030_28.png] view at source ↗
Figure 29
Figure 29. Figure 29: Interlocking Parts (4). A valid (human-created) solution, showing all 10 parts and the [PITH_FULL_IMAGE:figures/full_fig_p031_29.png] view at source ↗
Figure 30
Figure 30. Figure 30: Largest 3D-Printable Prime (1, 2). (Left) A model’s suggestion of 3,317 is not 3D [PITH_FULL_IMAGE:figures/full_fig_p031_30.png] view at source ↗
Figure 31
Figure 31. Figure 31: Topology Enumeration (1, 2). (Left) Labelling the 4 corners with alternating 1s and [PITH_FULL_IMAGE:figures/full_fig_p033_31.png] view at source ↗
Figure 32
Figure 32. Figure 32: Enumerate Edges (1). Given the vertex labels, boundaries between regions are unavoid [PITH_FULL_IMAGE:figures/full_fig_p034_32.png] view at source ↗
Figure 33
Figure 33. Figure 33: Classify Behaviour (1). Classifying the intersections. [PITH_FULL_IMAGE:figures/full_fig_p035_33.png] view at source ↗
Figure 34
Figure 34. Figure 34: Half Subdivision Neighbours (2). Subdivision neighbour test performed in 3D. [PITH_FULL_IMAGE:figures/full_fig_p037_34.png] view at source ↗
Figure 35
Figure 35. Figure 35: Delaunay Triangulation (1). This triangulation required 5 triangles; 6 were provided. [PITH_FULL_IMAGE:figures/full_fig_p037_35.png] view at source ↗
Figure 36
Figure 36. Figure 36: Shikaku Rectangles. A solved 7×7 instance: each coloured rectangle contains exactly one clue equal to its area. Shikaku Rectangles – OUTPUT SCHEMA { "type": "object", "properties": { "rectangles": { "type": "array", "items": { "type": "array", "minItems": 4, "maxItems": 4, "items": {"type": "integer", "minimum": 0} } } }, "required": ["rectangles"] } Each rectangle: [x_min, y_min, x_max, y_max] Verifier: … view at source ↗
Figure 37
Figure 37. Figure 37: Two Segments. The model must place two segments on the square boundary to partition [PITH_FULL_IMAGE:figures/full_fig_p039_37.png] view at source ↗
read the original abstract

Spatial competence is the quality of maintaining a consistent internal representation of an environment and using it to infer discrete structure and plan actions under constraints. Prevailing spatial evaluations for large models are limited to probing isolated primitives through 3D transformations or visual question answering. We introduce the Spatial Competence Benchmark (SCBench), spanning three hierarchical capability buckets whose tasks require executable outputs verified by deterministic checkers or simulator-based evaluators. On SCBench, three frontier models exhibit monotonically decreasing accuracy up the capability ladder. Sweeping output-token caps shows that accuracy gains concentrate at low budgets and saturate quickly, and failures are dominated by locally plausible geometry that breaks global constraints. We release the task generators, verifiers, and visualisation tooling.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces the Spatial Competence Benchmark (SCBench), spanning three hierarchical capability buckets for evaluating large language models. Tasks require executable outputs verified by deterministic checkers or simulator-based evaluators. On SCBench, three frontier models exhibit monotonically decreasing accuracy up the capability ladder. Sweeping output-token caps shows accuracy gains concentrate at low budgets and saturate quickly, with failures dominated by locally plausible geometry that breaks global constraints. The task generators, verifiers, and visualisation tooling are released.

Significance. If the results hold, SCBench provides a new executable-output framework for assessing higher-order spatial competence beyond isolated 3D transformations or VQA probes. The hierarchical structure and deterministic verification could help pinpoint where models fail to maintain consistent internal representations under global constraints. Public release of generators and tooling supports reproducibility and extension. The reported saturation pattern and failure modes, if robust to output variations, would offer falsifiable insights into current model limitations.

major comments (3)
  1. Abstract: the claim of monotonically decreasing accuracy across the three hierarchical buckets lacks error bars, sample sizes, and model version details, undermining assessment of trend reliability and statistical significance.
  2. Results (failure mode analysis): the assertion that failures are dominated by locally plausible geometry breaking global constraints depends on verifier correctness; robustness to output formatting variations (coordinate ordering, delimiters, optional explanatory text) is not demonstrated, risking confounding of higher-capability models that produce more verbose outputs.
  3. Methods: the task generators and deterministic verifiers are described at high level only; without explicit normalization rules or ablation on formatting, it is unclear whether the hierarchical buckets isolate spatial competence or partly measure output style differences.
minor comments (2)
  1. Abstract: add one concrete task example per bucket and a brief note on verification criteria to improve immediate readability.
  2. Figure captions: ensure all plots include axis labels, legend details, and token-budget values for the saturation curves.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on SCBench. The comments highlight important aspects of statistical reporting, robustness, and methodological clarity. We address each major comment below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: Abstract: the claim of monotonically decreasing accuracy across the three hierarchical buckets lacks error bars, sample sizes, and model version details, undermining assessment of trend reliability and statistical significance.

    Authors: We agree that error bars, sample sizes, and model version details should be reported to allow proper assessment of the trend. In the revised manuscript we will update the abstract and results to include per-bucket accuracy with standard deviations across 5 independent runs (n=200 tasks per bucket), specific model versions (GPT-4o-2024-05-13, Claude-3.5-Sonnet-20240620, Gemini-1.5-Pro-002), and paired t-test p-values confirming the monotonic decrease (p<0.01). revision: yes

  2. Referee: Results (failure mode analysis): the assertion that failures are dominated by locally plausible geometry breaking global constraints depends on verifier correctness; robustness to output formatting variations (coordinate ordering, delimiters, optional explanatory text) is not demonstrated, risking confounding of higher-capability models that produce more verbose outputs.

    Authors: We acknowledge the risk that formatting differences could confound results. Our verifiers already apply normalization for coordinate ordering, delimiters, and stripping of optional explanatory text before checking global constraints. To demonstrate robustness, the revision will add an explicit ablation comparing accuracy under strict JSON-only output versus lenient parsing that tolerates extra text; preliminary checks show the dominance of global-constraint failures persists under both regimes. revision: yes

  3. Referee: Methods: the task generators and deterministic verifiers are described at high level only; without explicit normalization rules or ablation on formatting, it is unclear whether the hierarchical buckets isolate spatial competence or partly measure output style differences.

    Authors: The released code repository contains the full implementation of generators and verifiers, including normalization logic. We will expand the Methods section with a dedicated subsection detailing the exact normalization rules (e.g., regex-based coordinate extraction, tolerance thresholds for floating-point comparison, and handling of free-form text) and will include the formatting ablation described above to show that bucket performance differences are driven by spatial requirements rather than output style. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark definition independent of reported model results

full rationale

The paper introduces SCBench via explicit task generators, hierarchical buckets, and deterministic verifiers/simulators. The central observation (monotonically decreasing accuracy across three frontier models) is an empirical measurement on external model outputs, not a derivation that reduces to fitted parameters, self-defined quantities, or self-citation chains. No equations appear in the provided text; the verifiers are part of the benchmark construction rather than a post-hoc fit. This matches the default case of a self-contained empirical benchmark with no load-bearing reduction to its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Benchmark paper with no free parameters fitted to results; relies on domain definition of spatial competence and standard assumptions that simulator verifiers are faithful.

axioms (1)
  • domain assumption Spatial competence is the quality of maintaining a consistent internal representation of an environment and using it to infer discrete structure and plan actions under constraints.
    Definition stated in abstract that underpins all task design.

pith-pipeline@v0.9.0 · 5398 in / 1005 out tokens · 23387 ms · 2026-05-15T17:12:43.686589+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

12 extracted references · 12 canonical work pages

  1. [1]

    Lanxiang Hu, Mingjia Huo, Yuxuan Zhang, Haoyang Yu, Eric P

    URLhttps://arxiv.org/abs/2601.03590. Lanxiang Hu, Mingjia Huo, Yuxuan Zhang, Haoyang Yu, Eric P. Xing, Ion Stoica, Tajana Rosing, Haojian Jin, and Hao Zhang. lmgame-bench: How good are llms at playing games? InInternational Conference on Learning Representations (ICLR), 2026. Wanjing Huang, Weixiang Yan, Zhen Zhang, and Ambuj Singh. Apex: Empowering llms ...

  2. [2]

    Raw output 3

    Prompt 2. Raw output 3. Parsed triangles

  3. [3]

    polyhedron

    Verifier diff 5. Reasoning summary (optional) »> Classify with exactly one failure mode. »> Return JSON: failure_mode, confidence, justification. A.3 EVALUATION PROTOCOL OpenAI Anthropic Google Model aliasgpt-5.2 claude-sonnet-4-5 gemini-3-pro-preview TemperatureProvider default Provider default Provider default Reasoningeffort=xhigh thinkingenabled (max)...

  4. [4]

    Basic validation Minimum point count Loop closure (last point 1m away from first) No duplicate vertices Within bounds

  5. [5]

    Segment analysis Compute all line segments Detect and count intersections Calculate segment lengths and angles Check for backtracking

  6. [6]

    distribute around circle

    Constraint verification Count intersection points = N All turn angles within bounds Centroid distance from target < tolerance Detect edge-touching points Count direction changes≥threshold All segment lengths≤maximum Track quadrant visits All points within inner box Convex hull area≥minimum First point at specified location All pairwise distances≥minimum B...

  7. [7]

    Blast coordinates must lie within the grid

  8. [8]

    Cannot drill deeper than the current terrain height

  9. [9]

    Physics simulation (PyBullet): create heightfield collision shape, spawn spheres represent- ing blasted material, simulate gravity and collision, track where spheres settle, and update the heightmap based on final sphere positions

  10. [10]

    Find the largest 4-connected region where all heights are within 0.2 units

  11. [11]

    type": "object

    Score=(after−before) / min(total_cells/2, total_cells−before). Simulation results are cached between runs. This task also tests model guardrails: refusing to discuss explosives yields an automatic score of 0, despite civil engineering and earthworks being safe and legal queries. 27 Published as a conference paper at ICLR 2026 Efficient Spatial Reasoning W...

  12. [12]

    type": "object

    (2, 2, 2, 1) You are given unit squares with corners labelled in (’bottom-right’, ’top-right’, ’top-left’, ’bottom-left’) order. Edges are indexed: right=0, top=1, left=2, bottom=3. For each square above (in the same order), list which edges are guaranteed to connect. Return a list where each element is a list of sorted [i,j] pairs (i < j). If no edges ar...