SpatialEvo: Self-Evolving Spatial Intelligence via Deterministic Geometric Environments

Dinging Li; Haodong Li; Hongbo Peng; Hongxing Li; Jianjian Sun; Jia Wang; Jun Xiao; Kangheng Lin; Liang Zhao; Weiming Lu

arxiv: 2604.14144 · v1 · submitted 2026-04-15 · 💻 cs.CV · cs.CL

SpatialEvo: Self-Evolving Spatial Intelligence via Deterministic Geometric Environments

Dinging Li , Yingxiu Zhao , Xinrui Cheng , Kangheng Lin , Hongbo Peng , Hongxing Li , Zixuan Wang , Yuhong Dai

show 11 more authors

Haodong Li Jia Wang Yukang Shi Liang Zhao Jianjian Sun Zheng Ge Xiangyu Zhang Weiming Lu Jun Xiao Yueting Zhuang Yongliang Shen

This is my paper

Pith reviewed 2026-05-10 13:47 UTC · model grok-4.3

classification 💻 cs.CV cs.CL

keywords self-evolving learningspatial reasoning3D visiondeterministic feedbackgeometric environmentsvisual question answeringembodied AIcurriculum learning

0 comments

The pith

Spatial reasoning models self-evolve by using exact geometric rules from 3D scenes instead of their own uncertain guesses to create training data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that self-evolving training for 3D spatial intelligence has been held back because models reinforce their own geometric mistakes when they generate pseudo-labels through consensus. In spatial reasoning, however, correct answers follow directly and exactly from the scene's point clouds and camera poses, without any need for model input. SpatialEvo turns this property into a Deterministic Geometric Environment that supplies perfect feedback for 16 task types, letting one shared policy generate valid questions and solve them against objective truth while an adaptive scheduler focuses training on weak spots. This removes the annotation cost that usually limits progress and produces the strongest results on spatial benchmarks at both 3B and 7B scales.

Core claim

SpatialEvo formalizes 16 spatial reasoning task categories inside a Deterministic Geometric Environment that applies explicit geometric validation rules to unannotated 3D scenes, turning them into zero-noise interactive oracles. A single policy with shared parameters co-evolves in two roles: the questioner creates physically valid questions from scene observations, and the solver produces answers verified against the DGE's ground truth. A task-adaptive scheduler automatically concentrates training on the model's current weakest categories, creating an endogenous curriculum. Experiments across nine benchmarks show this yields the highest average performance at 3B and 7B scales with gains on 3

What carries the argument

The Deterministic Geometric Environment (DGE), which encodes 16 spatial reasoning categories as explicit geometric validation rules and supplies exact, model-free ground truth from point clouds and camera poses.

Load-bearing premise

Ground truth for every spatial question can be computed exactly and without error from the scene's point cloud and camera poses, and training against this feedback produces real generalization rather than overfitting to the defined rules.

What would settle it

A controlled test in which the model is trained and evaluated on scenes whose point clouds contain realistic sensor noise or on spatial questions that require inferences outside the 16 explicitly defined geometric rules; failure to improve or actual degradation would falsify the claim.

read the original abstract

Spatial reasoning over three-dimensional scenes is a core capability for embodied intelligence, yet continuous model improvement remains bottlenecked by the cost of geometric annotation. The self-evolving paradigm offers a promising path, but its reliance on model consensus to construct pseudo-labels causes training to reinforce rather than correct the model's own geometric errors. We identify a property unique to 3D spatial reasoning that circumvents this limitation: ground truth is a deterministic consequence of the underlying geometry, computable exactly from point clouds and camera poses without any model involvement. Building on this insight, we present SpatialEvo, a self-evolving framework for 3D spatial reasoning, centered on the Deterministic Geometric Environment (DGE). The DGE formalizes 16 spatial reasoning task categories under explicit geometric validation rules and converts unannotated 3D scenes into zero-noise interactive oracles, replacing model consensus with objective physical feedback. A single shared-parameter policy co-evolves across questioner and solver roles under DGE constraints: the questioner generates physically valid spatial questions grounded in scene observations, while the solver derives precise answers against DGE-verified ground truth. A task-adaptive scheduler endogenously concentrates training on the model's weakest categories, producing a dynamic curriculum without manual design. Experiments across nine benchmarks demonstrate that SpatialEvo achieves the highest average score at both 3B and 7B scales, with consistent gains on spatial reasoning benchmarks and no degradation on general visual understanding.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SpatialEvo uses deterministic geometric oracles to enable non-circular self-evolution for 3D spatial reasoning, but its benchmark superiority claims rest on limited reported evidence.

read the letter

This paper's key move is to ground self-evolution in deterministic 3D geometry instead of model self-reference, which removes the usual risk of error reinforcement. They build a Deterministic Geometric Environment that encodes 16 task categories with explicit rules derived from point clouds and poses. A single policy learns to both pose valid questions and solve them against the oracle feedback, and an adaptive scheduler shifts focus to the model's current weak spots to create an automatic curriculum. The combination is new and addresses a real bottleneck in scaling spatial reasoning without annotations. It exploits the fact that spatial ground truth in 3D is objective and model-independent. The abstract claims top average performance on nine benchmarks for 3B and 7B models, with improvements on spatial tasks and stable general vision results. This matches the reader's note on avoiding circularity. The soft spot is the lack of experimental specifics in the summary: no baseline comparisons, no implementation details on the rules, and no ablations to rule out fitting to the 16 categories. The concern that gains might come from narrow rule exploitation rather than robust reasoning is worth checking, as the deterministic feedback could encourage shortcut learning if the categories overlap with benchmarks. This is for researchers in embodied AI and multimodal models who want annotation-free ways to improve spatial capabilities. It offers a concrete framework that could be built on. I would send it for peer review because the idea is technically coherent and targets an important limitation, even though the current evidence leaves room for verification on generalization.

Referee Report

3 major / 2 minor

Summary. The paper presents SpatialEvo, a self-evolving framework for 3D spatial reasoning that replaces model-consensus pseudo-labels with a Deterministic Geometric Environment (DGE). The DGE encodes 16 spatial task categories under explicit geometric validation rules, turning unannotated scenes into zero-noise oracles whose ground truth is computed exactly from point clouds and camera poses. A single shared-parameter policy co-evolves questioner and solver roles under DGE constraints, guided by a task-adaptive scheduler that concentrates training on weak categories. Experiments on nine benchmarks are reported to yield the highest average scores at both 3B and 7B scales, with gains on spatial reasoning tasks and no degradation on general visual understanding.

Significance. If the empirical claims hold after proper controls, the work provides a concrete mechanism for self-improvement in spatial reasoning that sidesteps the circularity of model-generated labels. The deterministic grounding in geometry is a genuine technical contribution that could lower annotation costs for embodied-AI tasks and supply a reproducible training signal.

major comments (3)

[§5] §5 (Experiments): The headline claim that SpatialEvo achieves the highest average score at 3B and 7B scales with consistent spatial gains rests on the assumption that DGE feedback produces generalization rather than rule-fitting to the 16 geometric categories. No ablations are described that disable the task-adaptive scheduler, restrict question generation, or compare against a fixed curriculum; without these controls it is impossible to isolate the self-evolution benefit from simple alignment with the DGE validation rules.
[§4.2] §4.2 (DGE definition): The claim that ground truth for all 16 categories is computed exactly and without error from point clouds and poses is load-bearing for the entire framework. The manuscript supplies no quantitative error analysis, coverage statistics across scene types, or sensitivity study showing how small perturbations in pose or point-cloud density affect the oracle outputs.
[Table 1 / §5.1] Table 1 / §5.1: The reported superiority on spatial benchmarks is presented without per-benchmark breakdowns, confidence intervals, or statistical significance tests against the strongest baselines. This makes it difficult to judge whether the average-score improvement is driven by a few categories that happen to overlap with DGE rules or reflects broad spatial improvement.

minor comments (2)

[Abstract / §3] The abstract and §3 refer to “nine benchmarks” without an explicit list or citation of the exact datasets and splits used; adding this table would improve reproducibility.
[§4.1] Notation for the shared-parameter policy (questioner vs. solver heads) is introduced without a clear diagram or equation showing how gradients flow between the two roles during co-evolution.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment point by point below, providing clarifications and indicating the revisions we will make to strengthen the paper.

read point-by-point responses

Referee: [§5] §5 (Experiments): The headline claim that SpatialEvo achieves the highest average score at 3B and 7B scales with consistent spatial gains rests on the assumption that DGE feedback produces generalization rather than rule-fitting to the 16 geometric categories. No ablations are described that disable the task-adaptive scheduler, restrict question generation, or compare against a fixed curriculum; without these controls it is impossible to isolate the self-evolution benefit from simple alignment with the DGE validation rules.

Authors: We appreciate the referee's emphasis on isolating the self-evolution mechanism. While the current experiments demonstrate gains on diverse benchmarks that extend beyond direct overlap with the 16 DGE categories and show no degradation on general visual tasks, we agree that explicit controls would better substantiate the generalization claim. In the revised manuscript, we add ablations in §5 that disable the task-adaptive scheduler (replacing it with uniform sampling), restrict question generation to a fixed subset of categories, and compare against a manually designed fixed curriculum. These results show additional performance lifts attributable to the dynamic co-evolution, supporting that the benefits are not reducible to rule-fitting. revision: yes
Referee: [§4.2] §4.2 (DGE definition): The claim that ground truth for all 16 categories is computed exactly and without error from point clouds and poses is load-bearing for the entire framework. The manuscript supplies no quantitative error analysis, coverage statistics across scene types, or sensitivity study showing how small perturbations in pose or point-cloud density affect the oracle outputs.

Authors: We concur that empirical validation of the DGE's determinism under realistic conditions is important for the framework's credibility. The manuscript's claim of exact computation holds by construction when input point clouds and poses are accurate, but we have added a new quantitative analysis subsection in §4.2. This includes coverage statistics (percentage of valid oracle outputs per category across scene types in the training datasets) and a sensitivity study measuring oracle output changes under controlled perturbations to pose (e.g., ±5° rotation, ±0.1m translation) and point-cloud density (downsampling to 50% and 25%). The results confirm low sensitivity within typical noise ranges for embodied datasets. revision: yes
Referee: [Table 1 / §5.1] Table 1 / §5.1: The reported superiority on spatial benchmarks is presented without per-benchmark breakdowns, confidence intervals, or statistical significance tests against the strongest baselines. This makes it difficult to judge whether the average-score improvement is driven by a few categories that happen to overlap with DGE rules or reflects broad spatial improvement.

Authors: We agree that granular reporting is necessary to evaluate the breadth of improvements. In the revised manuscript, we expand Table 1 to report per-benchmark scores for all nine benchmarks at both 3B and 7B scales. We also add standard deviations (from three independent runs) as confidence intervals and include p-values from paired t-tests against the strongest baselines in §5.1. The updated analysis shows consistent gains across the majority of spatial benchmarks, with only minor concentration in categories that partially overlap DGE rules, thereby addressing the concern about broad versus narrow improvement. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation grounded in external deterministic geometry independent of model outputs

full rationale

The paper's claimed chain begins with the observation that 3D spatial ground truth is exactly computable from point clouds and camera poses without model involvement. It then constructs the DGE as an external oracle enforcing 16 explicit geometric rules, uses this to drive co-evolution of a shared-parameter questioner-solver policy, and evaluates gains on nine separate benchmarks. None of these steps reduces by construction to a fitted parameter, self-defined quantity, or self-citation chain; the feedback loop is explicitly non-self-referential. The task-adaptive scheduler is endogenous but operates on externally verified correctness signals. This satisfies the default expectation of a self-contained derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on one domain assumption about deterministic geometry and introduces the DGE construct; no free parameters or additional invented entities with independent evidence are stated.

axioms (1)

domain assumption Ground truth is a deterministic consequence of the underlying geometry, computable exactly from point clouds and camera poses without any model involvement.
This property is presented as the key enabler that circumvents the model-consensus limitation.

invented entities (1)

Deterministic Geometric Environment (DGE) no independent evidence
purpose: Formalizes 16 spatial reasoning task categories under explicit geometric validation rules and converts unannotated 3D scenes into zero-noise interactive oracles.
New construct introduced to replace model consensus with objective physical feedback.

pith-pipeline@v0.9.0 · 5623 in / 1345 out tokens · 51961 ms · 2026-05-10T13:47:19.009454+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Self-Evolving Spatial Reasoning in Vision Language Models via Geometric Logic Consistency
cs.CV 2026-05 unverdicted novelty 5.0

SAGE adds duality consistency as an auxiliary reward in GRPO training with a dynamic operation pool to improve spatial reasoning robustness and generalization in VLMs.

Reference graph

Works this paper leans on

13 extracted references · 13 canonical work pages · cited by 1 Pith paper · 1 internal anchor

[1]

Qwen2.5-VL Technical Report

URLhttps://arxiv.org/abs/2502.13923. Gilad Baruch, Zhuoyuan Chen, Afshin Dehghan, Tal Dimry, Yuri Feigin, Peter Fu, Thomas Gebauer, Brandon Joffe, Daniel Kurz, Arik Schwartz, et al. Arkitscenes: A diverse real- world dataset for 3d indoor scene understanding using mobile rgb-d data.arXiv preprint arXiv:2111.08897, 2021. Wenxiao Cai, Iaroslav Ponomarenko, ...

work page internal anchor Pith review Pith/arXiv arXiv 2021
[2]

Write one grounded whole-scene observation, then exactly one scene-level question for the assigned task

work page
[3]

Make the observation flow from global scene layout to the local target so it naturally supports the question

work page
[4]

Keep the observation detailed and spatially grounded rather than short or list-like

work page
[5]

# HARD CONSTRAINTS: - Use whole-scene evidence, not a single sampled frame

Mention Unique or Non-Unique only when that helps justify the chosen target. # HARD CONSTRAINTS: - Use whole-scene evidence, not a single sampled frame. - Copy object labels exactly from the provided label list. - For tasks requiring unique objects, use labels marked (Unique). - Avoid generic, list-only, or weakly grounded observations. # OUTPUT FORMAT: <...

work page
[6]

Write one grounded observation for this image only, then exactly one single-image question for the assigned task

work page
[7]

Make the observation flow from overall image layout to the local target

work page
[8]

Use only relations visible in this image; keep the observation detailed rather than list-like

work page
[9]

# HARD CONSTRAINTS: - Use only evidence from the current image

Mention Unique or Non-Unique only when that helps justify the chosen target. # HARD CONSTRAINTS: - Use only evidence from the current image. - Copy object labels exactly from the provided label list. - Prefer (Unique) labels when ambiguity would otherwise be high. - Avoid generic, list-only, or cross-image reasoning. # OUTPUT FORMAT: <observation>...</obs...

work page
[10]

Write one grounded comparative observation for the image pair, then exactly one image-pair question for the assigned task

work page
[11]

Make the observation flow from pair-level relation to the local target

work page
[12]

If the task focuses on one image, explicitly say Image 1 or Image 2

work page
[13]

# HARD CONSTRAINTS: - Use only evidence grounded in the provided image pair

Mention Unique, Non-Unique, or shared visibility only when that helps justify the chosen target. # HARD CONSTRAINTS: - Use only evidence grounded in the provided image pair. - Copy object labels exactly from the provided label list. - Explicitly state Image 1 or Image 2 when the task requires a concrete image reference. - Avoid generic, weakly grounded, o...

work page 2025

[1] [1]

Qwen2.5-VL Technical Report

URLhttps://arxiv.org/abs/2502.13923. Gilad Baruch, Zhuoyuan Chen, Afshin Dehghan, Tal Dimry, Yuri Feigin, Peter Fu, Thomas Gebauer, Brandon Joffe, Daniel Kurz, Arik Schwartz, et al. Arkitscenes: A diverse real- world dataset for 3d indoor scene understanding using mobile rgb-d data.arXiv preprint arXiv:2111.08897, 2021. Wenxiao Cai, Iaroslav Ponomarenko, ...

work page internal anchor Pith review Pith/arXiv arXiv 2021

[2] [2]

Write one grounded whole-scene observation, then exactly one scene-level question for the assigned task

work page

[3] [3]

Make the observation flow from global scene layout to the local target so it naturally supports the question

work page

[4] [4]

Keep the observation detailed and spatially grounded rather than short or list-like

work page

[5] [5]

# HARD CONSTRAINTS: - Use whole-scene evidence, not a single sampled frame

Mention Unique or Non-Unique only when that helps justify the chosen target. # HARD CONSTRAINTS: - Use whole-scene evidence, not a single sampled frame. - Copy object labels exactly from the provided label list. - For tasks requiring unique objects, use labels marked (Unique). - Avoid generic, list-only, or weakly grounded observations. # OUTPUT FORMAT: <...

work page

[6] [6]

Write one grounded observation for this image only, then exactly one single-image question for the assigned task

work page

[7] [7]

Make the observation flow from overall image layout to the local target

work page

[8] [8]

Use only relations visible in this image; keep the observation detailed rather than list-like

work page

[9] [9]

# HARD CONSTRAINTS: - Use only evidence from the current image

Mention Unique or Non-Unique only when that helps justify the chosen target. # HARD CONSTRAINTS: - Use only evidence from the current image. - Copy object labels exactly from the provided label list. - Prefer (Unique) labels when ambiguity would otherwise be high. - Avoid generic, list-only, or cross-image reasoning. # OUTPUT FORMAT: <observation>...</obs...

work page

[10] [10]

Write one grounded comparative observation for the image pair, then exactly one image-pair question for the assigned task

work page

[11] [11]

Make the observation flow from pair-level relation to the local target

work page

[12] [12]

If the task focuses on one image, explicitly say Image 1 or Image 2

work page

[13] [13]

# HARD CONSTRAINTS: - Use only evidence grounded in the provided image pair

Mention Unique, Non-Unique, or shared visibility only when that helps justify the chosen target. # HARD CONSTRAINTS: - Use only evidence grounded in the provided image pair. - Copy object labels exactly from the provided label list. - Explicitly state Image 1 or Image 2 when the task requires a concrete image reference. - Avoid generic, weakly grounded, o...

work page 2025