pith. sign in

arxiv: 2606.11719 · v1 · pith:X2ENZNUUnew · submitted 2026-06-10 · 💻 cs.CV · cs.AI

Ouroboros-Spatial: Closing the Data-Model Loop for Spatial Reasoning

Pith reviewed 2026-06-27 10:24 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords spatial reasoningself-evolving trainingmultimodal large language modelsclosed-loop data generationquestion-answer pair synthesis3D scene understandingvideo-based spatial QAdata efficiency
0
0 comments X

The pith

Ouroboros-Spatial lets an MLLM act as both question proposer and solver so that its own confidence scores steer the next round of spatial QA generation, co-evolving the training distribution with model ability.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to replace static, uniformly treated spatial-reasoning datasets with a closed loop in which one frozen component proposes questions from 3D metadata and video while a learnable component solves them and returns per-sample confidence as a difficulty signal. This signal tells the proposer which questions to generate next, trimming trivial or uninformative examples and keeping the training distribution matched to the solver’s current state. A sympathetic reader would care because the method claims to deliver larger accuracy gains on six benchmarks than recent large curated sets while using roughly one-tenth the examples. The central demonstration is that the feedback loop produces measurable lifts for both 4B and 8B Qwen3-VL models and lets them surpass a range of open and proprietary baselines.

Core claim

Ouroboros-Spatial is a self-evolving training framework in which the model simultaneously serves as a frozen proposer that generates spatial question-answer pairs plus executable code from 3D scene metadata and raw video frames, and as a learnable solver that is fine-tuned on the accepted samples; the solver’s per-sample prediction confidence is then fed back to the proposer so that subsequent questions are better calibrated to the solver’s present capabilities, thereby reducing redundant trivial examples and filtering ambiguous samples with limited learning value.

What carries the argument

The closed-loop feedback mechanism in which the solver’s per-sample prediction confidence is used as a difficulty signal to guide the frozen proposer’s generation of new spatial QA pairs.

If this is right

  • The same volume of generated data yields absolute gains of 9.9 points for the 4B model and 6.8 points for the 8B model on VSI-Bench.
  • Both improved models outperform a wide range of strong open-source and proprietary baselines across six spatial-reasoning benchmarks.
  • Training requires an order of magnitude fewer examples than recent large-scale statically curated datasets.
  • The training distribution automatically sheds trivial and ambiguous samples as the solver improves.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same proposer-solver loop could be applied to other multimodal reasoning domains such as temporal or causal inference without requiring new human-curated seeds.
  • Continuous deployment versions could keep the proposer active after initial training, allowing the model to keep generating and filtering its own practice data from new video streams.
  • If the confidence signal proves robust, the method reduces reliance on large human-annotated spatial datasets and shifts curation effort toward designing the initial 3D metadata sources.

Load-bearing premise

The per-sample prediction confidence produced by the solver serves as a reliable and unbiased signal of learning value that can safely guide the frozen proposer to generate more useful questions without systematically discarding valuable examples or introducing self-reinforcing biases in the data distribution.

What would settle it

Replace the solver’s actual confidence scores with random values of the same distribution and retrain; if the resulting models show no accuracy advantage over a control trained on the same volume of uniformly sampled questions, the claimed value of the feedback signal is refuted.

Figures

Figures reproduced from arXiv: 2606.11719 by Di He, Enhan Zhao, Wei Wu, Xueliang Zhao, Yuanrui Zhang.

Figure 1
Figure 1. Figure 1: Overview of the Ouroboros-Spatial framework. The proposer (left loop) generates spatial questions and programs, executes the programs to obtain answers, and filters the data, while the solver (right loop) learns from the curated data and provides difficulty feedback via confidence estimation. An important lesson from large language models (LLMs) is that reasoning capabilities can be continuously improved t… view at source ↗
Figure 2
Figure 2. Figure 2: The “propose–execute–filter” pipeline for question generation in Ouroboros-Spatial: the [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Comparison of data-source composition between Ouro-Spatial and ViCA-322k. [PITH_FULL_IMAGE:figures/full_fig_p016_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Normalized question-type composition of Ouro-Spatial and ViCA-322k. ViCA includes [PITH_FULL_IMAGE:figures/full_fig_p017_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: 16 of the 32 uniformly sampled frames from scene [PITH_FULL_IMAGE:figures/full_fig_p018_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: 16 of the 32 uniformly sampled frames from scene [PITH_FULL_IMAGE:figures/full_fig_p019_6.png] view at source ↗
read the original abstract

Spatial reasoning remains a persistent challenge for multimodal large language models (MLLMs). Existing approaches largely rely on large-scale, statically curated datasets, where all training samples are treated uniformly regardless of the model's evolving capabilities. This static paradigm is inherently data-inefficient: training capacity is often spent on samples that are either trivial or overly difficult for the model at its current stage. To address this limitation, we propose Ouroboros-Spatial, a self-evolving training framework in which the model plays dual roles as a proposer and a solver. In each iteration, a frozen proposer generates spatial question-answer (QA) pairs from 3D scene metadata and raw video frames, together with executable code for deriving reliable ground truth. A learnable solver is then fine-tuned on the accepted samples, and its per-sample prediction confidence is used as a difficulty signal. This signal is fed back to the proposer in the next iteration, guiding it to generate questions better matched to the solver's current capabilities. Through this closed-loop design, the training distribution co-evolves with model ability, reducing redundant trivial examples while filtering out ambiguous or uninformative samples with limited learning value. Across six spatial reasoning benchmarks, Ouroboros-Spatial substantially improves Qwen3-VL-4B and Qwen3-VL-8B while using an order of magnitude fewer training examples than recent large-scale curated datasets. On VSI-Bench, it yields absolute gains of 9.9 and 6.8 points for the 4B and 8B models, respectively, enabling both to outperform a wide range of strong open-source and proprietary baselines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes Ouroboros-Spatial, a self-evolving closed-loop framework for spatial reasoning in MLLMs. A frozen proposer generates spatial QA pairs from 3D scene metadata and video frames together with executable code for ground truth; a learnable solver is fine-tuned on accepted samples; and the solver's per-sample prediction confidence is fed back as a difficulty signal to steer the proposer toward questions better matched to the solver's current capabilities. The approach is claimed to improve Qwen3-VL-4B and Qwen3-VL-8B across six benchmarks while using an order of magnitude fewer examples than prior large-scale datasets, with absolute gains of 9.9 and 6.8 points on VSI-Bench.

Significance. If the results hold after proper validation, the framework could meaningfully advance data-efficient training for multimodal models by dynamically co-evolving the training distribution with model ability rather than relying on static curated datasets. The use of executable code to derive ground truth is a concrete strength that helps bound circularity and supports reliable labels.

major comments (2)
  1. [Abstract] Abstract: The central claim that solver per-sample prediction confidence is a reliable, unbiased difficulty signal for guiding the proposer rests on untested assumptions (calibration for spatial QA, correlation with actual learning gain, and absence of self-reinforcing bias). No calibration plots, correlation analysis with held-out learning gain, or ablation isolating confidence-based selection from random/heuristic selection is described.
  2. [Abstract] Abstract: Reported benchmark gains (e.g., +9.9 / +6.8 on VSI-Bench) are presented without any information on experimental controls, statistical significance testing, number of runs, or validation that the executable ground-truth code produces correct labels across the generated distribution; this information is load-bearing for assessing whether the closed loop actually delivers the claimed data efficiency.
minor comments (1)
  1. [Abstract] The abstract would benefit from a concise statement of the number of iterations and the precise acceptance criterion used to filter samples.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting the need for stronger empirical validation of the confidence-based difficulty signal and experimental rigor. We agree these elements are important for substantiating the closed-loop claims and will revise the manuscript accordingly by adding the requested analyses and details.

read point-by-point responses
  1. Referee: The central claim that solver per-sample prediction confidence is a reliable, unbiased difficulty signal for guiding the proposer rests on untested assumptions (calibration for spatial QA, correlation with actual learning gain, and absence of self-reinforcing bias). No calibration plots, correlation analysis with held-out learning gain, or ablation isolating confidence-based selection from random/heuristic selection is described.

    Authors: We acknowledge the validity of this critique. The manuscript currently presents performance improvements as supporting evidence for the signal's utility but does not include direct validation. In the revised version we will add: (1) calibration plots of solver confidence on held-out spatial QA samples, (2) correlation analysis between per-sample confidence and subsequent learning gain on a held-out set, and (3) an ablation comparing confidence-guided selection against random and heuristic baselines. These will appear in a new subsection under Experiments. revision: yes

  2. Referee: Reported benchmark gains (e.g., +9.9 / +6.8 on VSI-Bench) are presented without any information on experimental controls, statistical significance testing, number of runs, or validation that the executable ground-truth code produces correct labels across the generated distribution; this information is load-bearing for assessing whether the closed loop actually delivers the claimed data efficiency.

    Authors: We agree that these details are necessary. The revised manuscript will report: three independent runs with mean and standard deviation, statistical significance via paired t-tests (p < 0.01 on VSI-Bench gains), explicit controls (static dataset baselines and random selection), and expanded ground-truth validation including manual inspection of 500 generated QA pairs plus automated checks for code execution errors across the full distribution. These elements will be added to Section 4 and the supplementary material. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in derivation chain

full rationale

The paper presents an iterative self-evolving framework in which a frozen proposer generates QA pairs with code-derived ground truth and a learnable solver is fine-tuned, with its per-sample confidence used only as a heuristic signal to steer subsequent generation. No equations, self-citations, or definitional steps are shown that reduce the reported benchmark gains to the input data or to the confidence signal by construction. The central claims (absolute gains of 9.9/6.8 points on VSI-Bench with far fewer examples) are empirical and externally falsifiable on held-out benchmarks. This satisfies the default expectation of a self-contained, non-circular derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based solely on the abstract, no free parameters, axioms, or invented entities are explicitly introduced or quantified; the approach relies on standard assumptions of MLLM fine-tuning and the reliability of code-derived ground truth from 3D metadata.

pith-pipeline@v0.9.1-grok · 5838 in / 1322 out tokens · 38288 ms · 2026-06-27T10:24:38.833122+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

71 extracted references · 20 linked inside Pith

  1. [1]

    URL https://deepmind.google/models/ model-cards/gemini-3-1-pro/

    Gemini 3.1 pro - model card, 2026. URL https://deepmind.google/models/ model-cards/gemini-3-1-pro/

  2. [2]

    URL https://qwen.ai/blog?id= qwen3.5

    Qwen3.5: Towards native multimodal agents, 2026. URL https://qwen.ai/blog?id= qwen3.5

  3. [3]

    Tool-r0: Self-evolving llm agents for tool-learning from zero data.arXiv preprint arXiv:2602.21320, 2026

    Emre Can Acikgoz, Cheng Qian, Jonas Hübotter, Heng Ji, Dilek Hakkani-Tür, and Gokhan Tur. Tool-r0: Self-evolving llm agents for tool-learning from zero data.arXiv preprint arXiv:2602.21320, 2026

  4. [4]

    Qwen3-vl technical report, 2025

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan ...

  5. [5]

    ARKitScenes: A diverse real-world dataset for 3d indoor scene understanding using mobile RGB-D data

    Gilad Baruch, Zhuoyuan Chen, Afshin Dehghan, Tal Dimry, Yair Feigelstock, Xu Fu, Yasutaka Furukawa, Aviv Goldberger, Binyamin Gottfried, Ran Halperin, et al. ARKitScenes: A diverse real-world dataset for 3d indoor scene understanding using mobile RGB-D data. InNeurIPS Datasets and Benchmarks Track, 2021. URLhttps://arxiv.org/abs/2111.08897

  6. [6]

    Seed2.0 model card: Towards intelligence frontier for real-world complexity

    ByteDance Seed. Seed2.0 model card: Towards intelligence frontier for real-world complexity. Technical report, ByteDance, 2025. URL https://lf3-static.bytednsdoc.com/ obj/eden-cn/lapzild-tss/ljhwZthlaukjlkulzlp/seed2/0214/Seed2.0%20Model% 20Card.pdf

  7. [7]

    Scaling spatial intelligence with multimodal foundation models.arXiv preprint arXiv:2511.13719, 2025

    Zhongang Cai, Ruisi Wang, Chenyang Gu, Fanyi Pu, Junxiang Xu, et al. Scaling spatial intelligence with multimodal foundation models.arXiv preprint arXiv:2511.13719, 2025. URL https://arxiv.org/abs/2511.13719

  8. [8]

    Spatialdreamer: Incentivizing spatial reasoning via active mental imagery.arXiv preprint arXiv:2512.07733, 2025

    Meng Cao, Xingyu Li, Xue Liu, Ian Reid, and Xiaodan Liang. Spatialdreamer: Incentivizing spatial reasoning via active mental imagery.arXiv preprint arXiv:2512.07733, 2025. URL https://arxiv.org/abs/2512.07733

  9. [9]

    Seeing through imagination: Learning scene geometry via implicit spatial world modeling.arXiv preprint arXiv:2512.01821, 2025

    Meng Cao, Haokun Lin, Haoyuan Li, Haoran Tang, Rongtao Xu, Dong An, Xue Liu, Ian Reid, and Xiaodan Liang. Seeing through imagination: Learning scene geometry via implicit spatial world modeling.arXiv preprint arXiv:2512.01821, 2025. URL https://arxiv.org/abs/ 2512.01821

  10. [10]

    Thinking with spatial code for physical-world video reasoning.arXiv preprint arXiv:2603.05591, 2026

    Jieneng Chen, Wenxin Ma, Ruisheng Yuan, Yunzhi Zhang, Jiajun Wu, and Alan Yuille. Thinking with spatial code for physical-world video reasoning.arXiv preprint arXiv:2603.05591, 2026. URLhttps://arxiv.org/abs/2603.05591

  11. [11]

    Self- questioning language models.arXiv preprint arXiv:2508.03682, 2025

    Lili Chen, Mihir Prabhudesai, Katerina Fragkiadaki, Hao Liu, and Deepak Pathak. Self- questioning language models.arXiv preprint arXiv:2508.03682, 2025. URL https://arxiv. org/abs/2508.03682. 10

  12. [12]

    Spacetools: Tool-augmented spatial reasoning via double interactive rl.arXiv preprint arXiv:2512.04069, 2025

    Siyi Chen, Mikaela Angelina Uy, Chan Hee Song, Faisal Ladhak, Adithyavairavan Murali, Qing Qu, Stan Birchfield, Valts Blukis, and Jonathan Tremblay. Spacetools: Tool-augmented spatial reasoning via double interactive rl.arXiv preprint arXiv:2512.04069, 2025. URL https://arxiv.org/abs/2512.04069

  13. [13]

    Think with 3d: Geometric imagination grounded spatial reasoning from limited views.arXiv preprint arXiv:2510.18632, 2025

    Zhangquan Chen, Manyuan Zhang, Xinlei Yu, Xufang Luo, Mingze Sun, Zihao Pan, Yan Feng, Peng Pei, Xunliang Cai, and Ruqi Huang. Think with 3d: Geometric imagination grounded spatial reasoning from limited views.arXiv preprint arXiv:2510.18632, 2025. URL https://arxiv.org/abs/2510.18632

  14. [14]

    Self-play fine-tuning converts weak language models to strong language models

    Zixiang Chen, Yihe Deng, Huizhuo Yuan, Kaixuan Ji, and Quanquan Gu. Self-play fine-tuning converts weak language models to strong language models. InInternational Conference on Machine Learning, 2024. URLhttps://arxiv.org/abs/2401.01335

  15. [15]

    Scannet: Richly-annotated 3d reconstructions of indoor scenes

    Angela Dai, Angel X Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 5828–5839, 2017

  16. [16]

    Embspatial-bench: Benchmarking spatial understanding for embodied tasks with large vision-language models

    Mengfei Du, Binhao Wu, Zejun Li, Xuan-Jing Huang, and Zhongyu Wei. Embspatial-bench: Benchmarking spatial understanding for embodied tasks with large vision-language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 346–355, 2024

  17. [17]

    VLM-3R: Vision-language models augmented with instruction-aligned 3D reconstruction.arXiv preprint arXiv:2505.20279, 2025

    Zhiwen Fan, Jian Zhang, Renjie Li, Junge Zhang, Runjin Chen, Hezhen Hu, Kevin Wang, Peihao Wang, Huaizhi Qu, Shijie Zhou, et al. VLM-3R: Vision-language models augmented with instruction-aligned 3D reconstruction.arXiv preprint arXiv:2505.20279, 2025. URL https://arxiv.org/abs/2505.20279

  18. [19]

    URLhttps://arxiv.org/abs/2508.07407

  19. [20]

    Visuospatial cognitive assistant.arXiv preprint arXiv:2505.12312, 2025

    Qi Feng. Visuospatial cognitive assistant.arXiv preprint arXiv:2505.12312, 2025. URL https://arxiv.org/abs/2505.12312

  20. [21]

    Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis

    Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 24108–24118, 2025

  21. [22]

    Blink: Multimodal large language models can see but not perceive

    Xingyu Fu, Yushi Hu, Bangzheng Li, Yu Feng, Haoyu Wang, Xudong Lin, Dan Roth, Noah A Smith, Wei-Chiu Ma, and Ranjay Krishna. Blink: Multimodal large language models can see but not perceive. InEuropean Conference on Computer Vision, pages 148–166. Springer, 2024

  22. [23]

    Map2thought: Explicit 3d spatial reasoning via metric cognitive maps.arXiv preprint arXiv:2601.11442, 2026

    Xiangjun Gao, Zhensong Zhang, Dave Zhenyu Chen, Songcen Xu, Long Quan, Eduardo Pérez- Pellitero, and Youngkyoon Jang. Map2thought: Explicit 3d spatial reasoning via metric cognitive maps.arXiv preprint arXiv:2601.11442, 2026. URL https://arxiv.org/abs/2601.11442

  23. [24]

    Visplay: Self-evolving vision-language models from images.arXiv preprint arXiv:2511.15661, 2025

    Yicheng He, Chengsong Huang, Zongxia Li, Jiaxin Huang, and Yonghui Yang. Visplay: Self-evolving vision-language models from images.arXiv preprint arXiv:2511.15661, 2025

  24. [25]

    R-zero: Self-evolving reasoning llm from zero data

    Chengsong Huang, Wenhao Yu, Xiaoyang Wang, Hongming Zhang, Zongxia Li, Ruosen Li, Jiaxin Huang, Haitao Mi, and Dong Yu. R-zero: Self-evolving reasoning llm from zero data. In The 5th Workshop on Mathematical Reasoning and AI at NeurIPS 2025, 2025

  25. [26]

    Language self-play for data-free training.arXiv preprint arXiv:2509.07414, 2025

    Jakub Grudzien Kuba, Mengting Gu, Qi Ma, Yuandong Tian, Vijai Mohan, and Jason Chen. Language self-play for data-free training.arXiv preprint arXiv:2509.07414, 2025

  26. [27]

    Llava-onevision: Easy visual task transfer.arXiv preprint arXiv:2408.03326, 2024

    Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, et al. Llava-onevision: Easy visual task transfer.arXiv preprint arXiv:2408.03326, 2024. 11

  27. [28]

    Imagine while reasoning in space: Multimodal visualization-of-thought

    Chengzu Li, Wenshan Wu, Huanyu Zhang, Yan Xia, Shaoguang Mao, Li Dong, Ivan Vuli ´c, and Furu Wei. Imagine while reasoning in space: Multimodal visualization-of-thought. In International Conference on Machine Learning, pages 36340–36364. PMLR, 2025

  28. [29]

    Viewspatial-bench: Evaluating multi- perspective spatial localization in vision-language models.arXiv preprint arXiv:2505.21500, 2025

    Dingming Li, Hongxing Li, Zixuan Wang, Yuchen Yan, Hang Zhang, Siqi Chen, Guiyang Hou, Shengpei Jiang, Wenqi Zhang, Yongliang Shen, et al. Viewspatial-bench: Evaluating multi- perspective spatial localization in vision-language models.arXiv preprint arXiv:2505.21500, 2025

  29. [30]

    Mvbench: A comprehensive multi-modal video understanding benchmark

    Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, et al. Mvbench: A comprehensive multi-modal video understanding benchmark. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22195–22206, 2024

  30. [31]

    MM-Zero: Self-evolving multi- model vision language models with zero data.arXiv preprint arXiv:2603.09206, 2026

    Zongxia Li, Hongyang Du, Chengsong Huang, Xiyang Wu, Lantao Yu, Yicheng He, Jing Xie, Xiaomin Wu, Zhichao Liu, Jiarui Zhang, and Fuxiao Liu. MM-Zero: Self-evolving multi- model vision language models with zero data.arXiv preprint arXiv:2603.09206, 2026. URL https://arxiv.org/abs/2603.09206

  31. [32]

    Spice: Self-play in corpus environments improves reasoning.arXiv preprint arXiv:2510.24684, 2025

    Bo Liu, Chuanyang Jin, Seungone Kim, Weizhe Yuan, Wenting Zhao, Ilia Kulikov, Xian Li, Sainbayar Sukhbaatar, Jack Lanchantin, and Jason Weston. Spice: Self-play in corpus environments improves reasoning.arXiv preprint arXiv:2510.24684, 2025

  32. [33]

    Diving into self- evolving training for multimodal reasoning

    Wei Liu, Junlong Li, Xiwen Zhang, Fan Zhou, Yu Cheng, and Junxian He. Diving into self- evolving training for multimodal reasoning. InInternational Conference on Machine Learning, pages 38842–38856. PMLR, 2025

  33. [34]

    Spatial reasoning in multimodal large language models: A survey of tasks, benchmarks and methods

    Weichen Liu, Qiyao Xue, Haoming Wang, Xiangyu Yin, Boyuan Yang, and Wei Gao. Spatial reasoning in multimodal large language models: A survey of tasks, benchmarks and methods. arXiv preprint arXiv:2511.15722, 2025. URLhttps://arxiv.org/abs/2511.15722

  34. [35]

    Thinking with blueprints: Assisting vision-language models in spatial reasoning via structured object representation.arXiv preprint arXiv:2601.01984, 2026

    Weijian Ma, Shizhao Sun, Tianyu Yu, Ruiyu Wang, Tat-Seng Chua, and Jiang Bian. Thinking with blueprints: Assisting vision-language models in spatial reasoning via structured object representation.arXiv preprint arXiv:2601.01984, 2026. URL https://arxiv.org/abs/ 2601.01984

  35. [36]

    Aimo-2 winning solution: Building state-of-the-art math- ematical reasoning models with openmathreasoning dataset.arXiv preprint arXiv:2504.16891, 2025

    Ivan Moshkov, Darragh Hanley, Ivan Sorokin, Shubham Toshniwal, Christof Henkel, Benedikt Schifferer, Wei Du, and Igor Gitman. Aimo-2 winning solution: Building state-of-the-art math- ematical reasoning models with openmathreasoning dataset.arXiv preprint arXiv:2504.16891, 2025

  36. [37]

    SpaceR: Reinforcing MLLMs in video spatial reasoning.arXiv preprint arXiv:2504.01805, 2025

    Kun Ouyang, Yuanxin Liu, Haoning Wu, Yi Liu, Hao Zhou, Jie Zhou, Fandong Meng, and Xu Sun. SpaceR: Reinforcing MLLMs in video spatial reasoning.arXiv preprint arXiv:2504.01805, 2025. URLhttps://arxiv.org/abs/2504.01805

  37. [38]

    Sat: Dynamic spatial aptitude training for multimodal language models.arXiv preprint arXiv:2412.07755, 2024

    Arijit Ray, Jiafei Duan, Ellis Brown, Reuben Tan, Dina Bashkirova, Rose Hendrix, Kiana Ehsani, Aniruddha Kembhavi, Bryan A Plummer, Ranjay Krishna, et al. Sat: Dynamic spatial aptitude training for multimodal language models.arXiv preprint arXiv:2412.07755, 2024. URLhttps://arxiv.org/abs/2412.07755

  38. [39]

    Openai gpt-5 system card, 2025

    Aaditya Singh et al. Openai gpt-5 system card, 2025. URL https://arxiv.org/abs/2601. 03267

  39. [40]

    Spacevista: All-scale visual spatial reasoning from mm to km.arXiv preprint arXiv:2510.09606, 2025

    Peiwen Sun, Shiqiang Lang, Dongming Wu, Yi Ding, Kaituo Feng, Huadai Liu, Zhen Ye, Rui Liu, Yun-Hui Liu, Jianan Wang, et al. Spacevista: All-scale visual spatial reasoning from mm to km.arXiv preprint arXiv:2510.09606, 2025. URLhttps://arxiv.org/abs/2510.09606

  40. [41]

    Gemini robotics: Bringing ai into the physical world.arXiv preprint arXiv:2503.20020, 2025

    Gemini Robotics Team, Saminda Abeyruwan, Joshua Ainslie, Jean-Baptiste Alayrac, Montser- rat Gonzalez Arenas, Travis Armstrong, Ashwin Balakrishna, Robert Baruch, Maria Bauza, Michiel Blokzijl, et al. Gemini robotics: Bringing ai into the physical world.arXiv preprint arXiv:2503.20020, 2025. 12

  41. [42]

    Kimi Team, Tongtong Bai, Yifan Bai, Yiping Bao, SH Cai, Yuan Cao, Y Charles, HS Che, Cheng Chen, Guanduo Chen, et al. Kimi k2. 5: Visual agentic intelligence.arXiv preprint arXiv:2602.02276, 2026

  42. [43]

    EvoLMM: Self-evolving large multimodal models with continuous rewards.arXiv preprint arXiv:2511.16672, 2025

    Omkar Thawakar, Shravan Venkatraman, Ritesh Thawkar, Abdelrahman Shaker, Hisham Cholakkal, Rao Muhammad Anwer, Salman Khan, and Fahad Khan. EvoLMM: Self-evolving large multimodal models with continuous rewards.arXiv preprint arXiv:2511.16672, 2025. URLhttps://arxiv.org/abs/2511.16672

  43. [44]

    Muirbench: A comprehensive benchmark for robust multi-image understanding.arXiv preprint arXiv:2406.09411, 2024

    Fei Wang, Xingyu Fu, James Y Huang, Zekun Li, Qin Liu, Xiaogeng Liu, Mingyu Derek Ma, Nan Xu, Wenxuan Zhou, Kai Zhang, et al. Muirbench: A comprehensive benchmark for robust multi-image understanding.arXiv preprint arXiv:2406.09411, 2024

  44. [45]

    Self-improving multimodal reasoning with zero annotation.arXiv preprint arXiv:2601.10094, 2026

    Han Wang, Yi Yang, Jingyuan Hu, Minfeng Zhu, and Wei Chen. Self-improving multimodal reasoning with zero annotation.arXiv preprint arXiv:2601.10094, 2026. URL https://arxiv. org/abs/2601.10094

  45. [46]

    Measuring multimodal mathematical reasoning with math-vision dataset

    Ke Wang, Junting Pan, Weikang Shi, Zimu Lu, Houxing Ren, Aojun Zhou, Mingjie Zhan, and Hongsheng Li. Measuring multimodal mathematical reasoning with math-vision dataset. Advances in Neural Information Processing Systems, 37:95095–95169, 2024

  46. [47]

    Spatial mental modeling from limited views

    Qineng Wang, Baiqiao Yin, Pingyue Zhang, et al. Spatial mental modeling from limited views. InarXiv preprint arXiv:2506.21458, 2025. URLhttps://arxiv.org/abs/2506.21458

  47. [48]

    Vision-zero: Scalable vlm self-improvement via strategic gamified self-play

    Qinsi Wang, Bo Liu, Tianyi Zhou, Jing Shi, Yueqian Lin, Yiran Chen, Hai Helen Li, Kun Wan, and Wentian Zhao. Vision-zero: Scalable vlm self-improvement via strategic gamified self-play. arXiv preprint arXiv:2509.25541, 2025

  48. [49]

    Charxiv: Charting gaps in realistic chart understanding in multimodal llms.Advances in Neural Information Processing Systems, 37: 113569–113697, 2024

    Zirui Wang, Mengzhou Xia, Luxi He, Howard Chen, Yitao Liu, Richard Zhu, Kaiqu Liang, Xindi Wu, Haotian Liu, Sadhika Malladi, et al. Charxiv: Charting gaps in realistic chart understanding in multimodal llms.Advances in Neural Information Processing Systems, 37: 113569–113697, 2024

  49. [50]

    Toward training superintelligent software agents through self-play swe-rl.arXiv preprint arXiv:2512.18552, 2025

    Yuxiang Wei, Zhiqing Sun, Emily McMilin, Jonas Gehring, David Zhang, Gabriel Synnaeve, Daniel Fried, Lingming Zhang, and Sida Wang. Toward training superintelligent software agents through self-play swe-rl.arXiv preprint arXiv:2512.18552, 2025

  50. [51]

    Spatial-MLLM: Boosting MLLM capabilities in visual-based spatial intelligence.arXiv preprint arXiv:2505.23747, 2025

    Diankun Wu, Fangfu Liu, Yi-Hsin Hung, and Yueqi Duan. Spatial-MLLM: Boosting MLLM capabilities in visual-based spatial intelligence.arXiv preprint arXiv:2505.23747, 2025. URL https://arxiv.org/abs/2505.23747

  51. [52]

    Reinforcing spatial reasoning in vision-language models with interwoven thinking and visual drawing

    Junfei Wu, Jian Guan, Kaituo Feng, Qiang Liu, Shu Wu, Liang Wang, Wei Wu, and Tieniu Tan. Reinforcing spatial reasoning in vision-language models with interwoven thinking and visual drawing. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems,

  52. [53]

    URLhttps://openreview.net/forum?id=yyWeSAsOhs

  53. [54]

    Chatting with images for introspective visual thinking.arXiv preprint arXiv:2602.11073, 2026

    Junfei Wu, Jian Guan, Qiang Liu, Shu Wu, Liang Wang, Wei Wu, and Tienie Tan. Chatting with images for introspective visual thinking.arXiv preprint arXiv:2602.11073, 2026. URL https://arxiv.org/abs/2602.11073

  54. [55]

    xAI. Grok 4. URLhttps://x.ai/news/grok-4. Model announcement

  55. [56]

    Thinking in space: How multimodal large language models see, remember, and recall spaces

    Jihan Yang, Shusheng Yang, Anjali W Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie. Thinking in space: How multimodal large language models see, remember, and recall spaces. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 10632–10643, 2025

  56. [58]

    URLhttps://arxiv.org/abs/2511.05491

  57. [59]

    Cambrian-S: Towards spatial supersensing in video

    Shusheng Yang, Jihan Yang, Pinzhi Huang, Ellis Brown, Zihao Yang, Yue Yu, Shengbang Tong, Zihan Zheng, Yifan Xu, Muhan Wang, et al. Cambrian-S: Towards spatial supersensing in video. arXiv preprint arXiv:2511.04670, 2025. URLhttps://arxiv.org/abs/2511.04670. 13

  58. [60]

    Mmsi-bench: A benchmark for multi-image spatial intelligence.arXiv preprint arXiv:2505.23764, 2025

    Sihan Yang, Runsen Xu, Yiman Xie, Sizhe Yang, Mo Li, Jingli Lin, Chenming Zhu, Xiaochen Chen, Haodong Duan, Xiangyu Yue, et al. Mmsi-bench: A benchmark for multi-image spatial intelligence.arXiv preprint arXiv:2505.23764, 2025. URL https://arxiv.org/abs/2505. 23764

  59. [61]

    Mindjourney: Test-time scaling with world models for spatial reasoning

    Yuncong Yang, Jiageng Liu, Zheyuan Zhang, Siyuan Zhou, Reuben Tan, Jianwei Yang, Yilun Du, and Chuang Gan. Mindjourney: Test-time scaling with world models for spatial reasoning. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. URL https://openreview.net/forum?id=L2W4wQsNkY

  60. [62]

    Spell: Self-play reinforcement learning for evolving long-context language models.arXiv preprint arXiv:2509.23863, 2025

    Ziyi Yang, Weizhou Shen, Chenliang Li, Ruijun Chen, Fanqi Wan, Ming Yan, Xiaojun Quan, and Fei Huang. Spell: Self-play reinforcement learning for evolving long-context language models.arXiv preprint arXiv:2509.23863, 2025

  61. [63]

    Scannet++: A high- fidelity dataset of 3d indoor scenes

    Chandan Yeshwanth, Yueh-Cheng Liu, Matthias Nießner, and Angela Dai. Scannet++: A high- fidelity dataset of 3d indoor scenes. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 12–22, 2023

  62. [65]

    URLhttps://arxiv.org/abs/2401.10020

  63. [66]

    Mmmu-pro: A more robust multi-discipline multimodal understanding benchmark

    Xiang Yue, Tianyu Zheng, Yuansheng Ni, Yubo Wang, Kai Zhang, Shengbang Tong, Yuxuan Sun, Botao Yu, Ge Zhang, Huan Sun, et al. Mmmu-pro: A more robust multi-discipline multimodal understanding benchmark. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15134–15186, 2025

  64. [67]

    Zhenrui Yue, Kartikeya Upasani, Xianjun Yang, Suyu Ge, Shaoliang Nie, Yuning Mao, Zhe Liu, and Dong Wang. Dr. zero: Self-evolving search agents without training data.arXiv preprint arXiv:2601.07055, 2026

  65. [68]

    Star: Bootstrapping reasoning with reasoning.Advances in Neural Information Processing Systems, 35:15476–15488, 2022

    Eric Zelikman, Yuhuai Wu, Jesse Mu, and Noah Goodman. Star: Bootstrapping reasoning with reasoning.Advances in Neural Information Processing Systems, 35:15476–15488, 2022

  66. [69]

    Think3d: Thinking with space for spatial reasoning.arXiv preprint arXiv:2601.13029, 2026

    Zaibin Zhang, Yuhan Wu, Lianjie Jia, Yifan Wang, Zhongbo Zhang, Yijiang Li, Binghao Ran, Fuxi Zhang, Zhuohan Sun, Zhenfei Yin, et al. Think3d: Thinking with space for spatial reasoning.arXiv preprint arXiv:2601.13029, 2026. URL https://arxiv.org/abs/2601. 13029

  67. [70]

    Absolute zero: Reinforced self-play reasoning with zero data.arXiv preprint arXiv:2505.03335, 2025

    Andrew Zhao, Yiran Wu, Yang Yue, Tong Wu, Quentin Xu, Matthieu Lin, Shenzhi Wang, Qingyun Wu, Zilong Zheng, and Gao Huang. Absolute zero: Reinforced self-play reasoning with zero data.arXiv preprint arXiv:2505.03335, 2025. URL https://arxiv.org/abs/ 2505.03335

  68. [71]

    Spacemind: Camera-guided modality fusion for spatial reasoning in vision-language models.arXiv preprint arXiv:2511.23075, 2025

    Ruosen Zhao, Zhikang Zhang, Jialei Xu, Jiahao Chang, Dong Chen, Lingyun Li, Weijian Sun, and Zizhuang Wei. Spacemind: Camera-guided modality fusion for spatial reasoning in vision-language models.arXiv preprint arXiv:2511.23075, 2025. URL https://arxiv.org/ abs/2511.23075

  69. [72]

    Promptcot 2.0: Scaling prompt synthesis for large language model reasoning.arXiv preprint arXiv:2509.19894, 2025

    Xueliang Zhao, Wei Wu, Jian Guan, Zhuocheng Gong, and Lingpeng Kong. Promptcot 2.0: Scaling prompt synthesis for large language model reasoning.arXiv preprint arXiv:2509.19894, 2025

  70. [73]

    Swift:a scal- able lightweight infrastructure for fine-tuning, 2024

    Yuze Zhao, Jintao Huang, Jinghan Hu, Xingjun Wang, Yunlin Mao, Daoze Zhang, Zeyinzi Jiang, Zhikai Wu, Baole Ai, Ang Wang, Wenmeng Zhou, and Yingda Chen. Swift:a scal- able lightweight infrastructure for fine-tuning, 2024. URL https://arxiv.org/abs/2408. 05517

  71. [74]

    In centimeters, what is the longest side of the dishwasher?

    Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479, 2025. 14 A Implementation Details A.1 Difficulty Feedback Prompt Starting from round t≥2 , the proposer’s...