pith. sign in

arxiv: 2606.06476 · v1 · pith:QPEH64OYnew · submitted 2026-06-04 · 💻 cs.CV

Thinking with Imagination: Agentic Visual Spatial Reasoning with World Simulators

Pith reviewed 2026-06-28 02:22 UTC · model grok-4.3

classification 💻 cs.CV
keywords spatial reasoningvision-language modelsworld simulatorsagentic reasoningnovel view synthesisreinforcement learning for tool use
0
0 comments X

The pith

A VLM improves spatial reasoning by learning to request imagined viewpoints from a world simulator.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that vision-language models remain limited when spatial questions require layouts or viewpoints beyond the input images. It proposes solving this by letting the model actively call a world simulator during reasoning to produce new visual observations conditioned on camera motions. Both the simulator, tuned for consistent novel views, and the policy, trained via reinforcement learning to invoke the simulator selectively, prove necessary for gains. Experiments show these components together raise scores on spatial benchmarks over base models or simulator-augmented baselines without the learned policy. If the claim holds, models could handle tasks that currently demand inference about unseen spaces by generating their own visual evidence on demand.

Core claim

The paper claims that coupling an RL-trained VLM policy with a view-consistency-tuned world simulator creates an agentic loop in which the model acquires imagined visual evidence only when it improves over direct answering, producing measurable gains on spatial reasoning benchmarks that neither component achieves alone.

What carries the argument

The two-phase RL curriculum that stabilizes tool-use exploration by first teaching the VLM when imagined observations help and then refining selective invocation.

If this is right

  • Adding the trained simulator to an existing VLM raises spatial benchmark performance even without retraining the policy.
  • The learned policy reduces unnecessary simulator calls while preserving gains from useful ones.
  • Imagined observations supply cross-view consistency and layout information absent from the original egocentric inputs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same selective-imagination loop could apply to other reasoning domains where generating hypothetical observations would reduce uncertainty.
  • Success may depend on scaling the simulator to handle longer or more complex motion sequences than those tested.
  • If the curriculum generalizes, similar agentic training could turn other generative tools into on-demand evidence sources for VLMs.

Load-bearing premise

The simulator must output novel views reliable and informative enough that they genuinely help the model reason better than direct answers from the given images.

What would settle it

Run the same spatial benchmarks with the world simulator disabled or with a random-invocation policy and observe whether scores remain at or above the levels achieved by the full trained system.

read the original abstract

While Vision-Language Models (VLMs) have shown strong visual reasoning capabilities, their spatial reasoning abilities remain largely constrained to the observed images and text-oriented chain-of-thought. They often struggle to infer unobserved layouts, maintain cross-view consistency, and reason from alternative viewpoints when only limited egocentric observations are available. In this work, we study this problem as thinking with imagination, where a VLM actively acquires imagined visual evidence by interacting with a world simulator during reasoning. We propose Astra, an agentic spatial reasoning framework that empowers VLMs with action-conditioned visual imagination. Specifically, Astra couples Astra-VL, an RL-trained VLM policy, with Astra-WM, a Bagel-based world simulator that generates novel-view observations from context images and natural-language camera motions. To provide reliable imagined evidence, Astra-WM is trained with view consistency tuning to improve pose and content consistency across views. In the RL stage, we propose a world-simulator-in-the-loop two-phase RL curriculum to stabilize tool-use exploration and advance the model's ability to invoke the simulator only when imagined observations improve over direct answering. Experiments demonstrate that both the world simulator and the agentic policy are necessary: Astra-WM improves simulator-augmented Gemini-3-Flash on MMSI-Bench from 45.1 to 49.5, while Astra-VL improves the Qwen3-VL backbone from 29.8 to 38.8 on MMSI-Bench and from 36.8 to 42.7 on MindCube. These results show that imagined observations can provide useful spatial evidence, but effective world-model-augmented reasoning requires learning when, where, and how to imagine.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The manuscript proposes Astra, an agentic spatial reasoning framework that couples Astra-VL (an RL-trained VLM policy) with Astra-WM (a Bagel-based world simulator trained via view consistency tuning). It claims that this setup enables VLMs to actively acquire imagined novel-view observations during reasoning, with both components necessary, as shown by benchmark gains: Astra-WM lifts simulator-augmented Gemini-3-Flash from 45.1 to 49.5 on MMSI-Bench, while Astra-VL lifts Qwen3-VL from 29.8 to 38.8 on MMSI-Bench and 36.8 to 42.7 on MindCube.

Significance. If the central claims hold after addressing the gaps below, the work would be significant for agentic visual reasoning by demonstrating a concrete mechanism for world-model-augmented imagination; explicit credit is due to the two-phase RL curriculum for stabilizing tool-use exploration and the view consistency tuning for improving cross-view reliability.

major comments (3)
  1. [Abstract] Abstract: the necessity claim for both Astra-WM and Astra-VL rests on the reported deltas (45.1→49.5, 29.8→38.8, 36.8→42.7), yet no quantitative simulator diagnostics (pose error, cross-view LPIPS, or human preference on generated vs. real views) are supplied to confirm that the novel views are sufficiently reliable and informative.
  2. [Experiments] Experiments section: no ablation isolating the two-phase RL curriculum from plain RL fine-tuning is reported, leaving open whether the observed gains arise from genuine selective imagination or from extra context tokens and training; this directly undermines the load-bearing claim that the policy learns 'when, where, and how to imagine.'
  3. [Results] Results: the abstract and results supply no details on baselines, statistical tests, data splits, or potential confounds, which is required to substantiate that the gains support the central claim rather than artifacts of evaluation design.
minor comments (1)
  1. [Introduction] The distinction between Astra-VL (policy) and Astra-WM (simulator) could be introduced with a single clarifying sentence in the introduction to avoid any initial notation ambiguity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. The comments highlight important gaps in substantiating the central claims about component necessity and evaluation rigor. We address each point below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the necessity claim for both Astra-WM and Astra-VL rests on the reported deltas (45.1→49.5, 29.8→38.8, 36.8→42.7), yet no quantitative simulator diagnostics (pose error, cross-view LPIPS, or human preference on generated vs. real views) are supplied to confirm that the novel views are sufficiently reliable and informative.

    Authors: We agree that the abstract's necessity claim would be strengthened by explicit simulator quality metrics. The manuscript describes view consistency tuning but does not report numerical diagnostics such as pose error, LPIPS, or human preference scores. In revision we will add these quantitative results (drawn from our internal evaluations) to the abstract and results sections to directly support the reliability of the imagined views. revision: yes

  2. Referee: [Experiments] Experiments section: no ablation isolating the two-phase RL curriculum from plain RL fine-tuning is reported, leaving open whether the observed gains arise from genuine selective imagination or from extra context tokens and training; this directly undermines the load-bearing claim that the policy learns 'when, where, and how to imagine.'

    Authors: The two-phase curriculum is presented as key to stabilizing tool-use exploration. However, the current manuscript does not include an ablation against standard RL fine-tuning. This is a substantive gap. We will add the requested ablation study in the revised experiments section, reporting performance with and without the two-phase structure to isolate its contribution to selective imagination. revision: yes

  3. Referee: [Results] Results: the abstract and results supply no details on baselines, statistical tests, data splits, or potential confounds, which is required to substantiate that the gains support the central claim rather than artifacts of evaluation design.

    Authors: We acknowledge that the provided abstract and results excerpt lack explicit discussion of baselines, statistical significance, data splits, and confounds. The full manuscript contains some baseline comparisons, but additional details are needed. In revision we will expand the results section with data-split information, statistical tests or confidence intervals, and explicit discussion of potential confounds to strengthen the evaluation. revision: partial

Circularity Check

0 steps flagged

No circularity; empirical gains reported as direct experimental outcomes

full rationale

The paper's central claims rest on benchmark deltas (MMSI-Bench, MindCube) obtained after training Astra-WM with view consistency tuning and Astra-VL with two-phase RL. No equations, fitted parameters renamed as predictions, or self-citation chains are present in the provided text that would reduce the reported improvements to inputs by construction. The evaluation is externally falsifiable via the cited benchmarks and does not invoke uniqueness theorems or ansatzes from prior author work.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides insufficient detail to enumerate specific free parameters, axioms, or invented entities beyond the high-level framework components; training involves unspecified RL rewards and consistency losses.

pith-pipeline@v0.9.1-grok · 5855 in / 1052 out tokens · 60180 ms · 2026-06-28T02:22:17.945339+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

52 extracted references · 37 canonical work pages · 17 internal anchors

  1. [1]

    Mindcube: Spatial mental modeling from limited views, 2026

    Qineng Wang, Baiqiao Yin, Pingyue Zhang, Jianshu Zhang, Kangrui Wang, Zihan Wang, Jieyu Zhang, Keshigeyan Chandrasegaran, Han Liu, Ranjay Krishna, Saining Xie, Jiajun Wu, Li Fei-Fei, and Manling Li. Mindcube: Spatial mental modeling from limited views, 2026. URLhttps://arxiv.org/abs/2506.21458

  2. [2]

    Mmsi-bench: A benchmark for multi-image spatial intelligence

    Sihan Yang, Runsen Xu, Yiman Xie, Sizhe Yang, Mo Li, Jingli Lin, Chenming Zhu, Xiaochen Chen, Haodong Duan, Xiangyu Yue, Dahua Lin, Tai Wang, and Jiangmiao Pang. Mmsi-bench: A benchmark for multi-image spatial intelligence. InICLR, 2025

  3. [3]

    arXiv preprint arXiv:2507.07984 , year=

    JingLi Lin, Chenming Zhu, Runsen Xu, Xiaohan Mao, Xihui Liu, Tai Wang, and Jiangmiao Pang. Ost- bench: Evaluating the capabilities of mllms in online spatio-temporal scene understanding.arXiv preprint arXiv:2507.07984, 2025

  4. [4]

    arXiv preprint arXiv:2512.10863 , year=

    Jingli Lin, Runsen Xu, Shaohao Zhu, Sihan Yang, Peizhou Cao, Yunlong Ran, Miao Hu, Chenming Zhu, Yiman Xie, Yilin Long, Wenbo Hu, Dahua Lin, Tai Wang, and Jiangmiao Pang. Mmsi-video-bench: A holistic benchmark for video-based spatial intelligence, 2025. URLhttps://arxiv.org/abs/2512.10863

  5. [5]

    arXiv preprint arXiv:2505.21500 , year=

    Dingming Li, Hongxing Li, Zixuan Wang, Yuchen Yan, Hang Zhang, Siqi Chen, Guiyang Hou, Shengpei Jiang, Wenqi Zhang, Yongliang Shen, Weiming Lu, and Yueting Zhuang. Viewspatial-bench: Evaluating multi-perspective spatial localization in vision-language models, 2025. URLhttps://arxiv.org/abs/2505.21500

  6. [6]

    3dsrbench: A comprehensive 3d spatial reasoning benchmark.arXiv preprint arXiv:2412.07825, 2024

    Wufei Ma, Haoyu Chen, Guofeng Zhang, Celso M de Melo, Jieneng Chen, and Alan Yuille. 3dsrbench: A comprehensive 3d spatial reasoning benchmark.arXiv preprint arXiv:2412.07825, 2024

  7. [8]

    Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces

    Jihan Yang, Shusheng Yang, Anjali W Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie. Thinking in space: How multimodal large language models see, remember, and recall spaces.arXiv:2412.14171, 2024

  8. [9]

    arXiv preprint arXiv:2512.24330 , year=

    Yong Xien Chng, Tao Hu, Wenwen Tong, Xueheng Li, Jiandong Chen, Haojia Yu, Jiefan Lu, Hewei Guo, Hanming Deng, Chengjun Xie, et al. Sensenova-mars: Empowering multimodal agentic reasoning and search via reinforcement learning.arXiv preprint arXiv:2512.24330, 2025

  9. [10]

    Visual spatial tuning.arXiv preprint arXiv:2511.05491, 2025

    Rui Yang, Ziyu Zhu, Yanwei Li, Jingjia Huang, Shen Yan, Siyuan Zhou, Zhe Liu, Xiangtai Li, Shuangye Li, Wenqian Wang, Yi Lin, and Hengshuang Zhao. Visual spatial tuning, 2025. URLhttps://arxiv.org/abs/2511.05491

  10. [11]

    Llava-3d: A simple yet effective pathway to empowering lmms with 3d-awareness.arXiv preprint arXiv:2409.18125, 2024

    Chenming Zhu, Tai Wang, Wenwei Zhang, Jiangmiao Pang, and Xihui Liu. Llava-3d: A simple yet effective pathway to empowering lmms with 3d-awareness.arXiv preprint arXiv:2409.18125, 2024

  11. [12]

    Ross3d: Reconstructive visual instruction tuning with 3d-awareness.arXiv preprint arXiv:2504.01901, 2025

    Haochen Wang, Yucheng Zhao, Tiancai Wang, Haoqiang Fan, Xiangyu Zhang, and Zhaoxiang Zhang. Ross3d: Reconstructive visual instruction tuning with 3d-awareness.arXiv preprint arXiv:2504.01901, 2025

  12. [13]

    Vlm-3r: Vision-language models augmented with instruction-aligned 3d reconstruction,

    Zhiwen Fan, Jian Zhang, Renjie Li, Junge Zhang, Runjin Chen, Hezhen Hu, Kevin Wang, Huaizhi Qu, Shijie Zhou, Dilin Wang, Zhicheng Yan, Hongyu Xu, Justin Theiss, Tianlong Chen, Jiachen Li, Zhengzhong Tu, Zhangyang Wang, and Rakesh Ranjan. Vlm-3r: Vision-language models augmented with instruction-aligned 3d reconstruction,

  13. [14]

    URLhttps://arxiv.org/abs/2505.20279

  14. [15]

    G2vlm: Geometry grounded vision language model with unified 3d reconstruction and spatial reasoning

    Wenbo Hu, Jingli Lin, Yilin Long, Yunlong Ran, Lihan Jiang, Yifan Wang, Chenming Zhu, Runsen Xu, Tai Wang, and Jiangmiao Pang. G2vlm: Geometry grounded vision language model with unified 3d reconstruction and spatial reasoning.arXiv preprint arXiv:2511.21688, 2025. URLhttps://arxiv.org/abs/2511.21688. 10

  15. [16]

    Geometrically-constrained agent for spatial reasoning.arXiv preprint arXiv:2511.22659, 2025

    Zeren Chen, Xiaoya Lu, Zhijie Zheng, Pengrui Li, Lehan He, Yijin Zhou, Jing Shao, Bohan Zhuang, and Lu Sheng. Geometrically-constrained agent for spatial reasoning, 2025. URLhttps://arxiv.org/abs/2511.22659

  16. [17]

    Tiger: Tool-integrated geometric reasoning in vision-language models for robotics.arXiv preprint arXiv:2510.07181, 2025

    Yi Han, Enshen Zhou, Shanyu Rong, Jingkun An, Pengwei Wang, Zhongyuan Wang, Cheng Chi, Lu Sheng, and Shanghang Zhang. Tiger: Tool-integrated geometric reasoning in vision-language models for robotics, 2026. URL https://arxiv.org/abs/2510.07181

  17. [18]

    Cooper: A unified model for cooperative perception and reasoning in spatial intelligence.arXiv preprint arXiv:2512.04563, 2025

    Zefeng Zhang, Xiangzhao Hao, Hengzhu Tang, Zhenyu Zhang, Jiawei Sheng, Xiaodong Li, Zhenyang Li, Li Gao, Daiting Shi, Dawei Yin, et al. Cooper: A unified model for cooperative perception and reasoning in spatial intelligence.arXiv preprint arXiv:2512.04563, 2025

  18. [19]

    Introducing o3 and o4 mini.https://openai.com/index/introducing-o3-and-o4-mini/, 2025

    OpenAI. Introducing o3 and o4 mini.https://openai.com/index/introducing-o3-and-o4-mini/, 2025

  19. [20]

    DeepEyesV2: Toward Agentic Multimodal Model

    Jack Hong, Chenxiao Zhao, ChengLin Zhu, Weiheng Lu, Guohai Xu, and Xing Yu. Deepeyesv2: Toward agentic multimodal model.arXiv preprint arXiv:2511.05271, 2025

  20. [21]

    Thinking with Images for Multimodal Reasoning: Foundations, Methods, and Future Frontiers

    Zhaochen Su, Peng Xia, Hangyu Guo, Zhenhua Liu, Yan Ma, Xiaoye Qu, Jiaqi Liu, Yanshu Li, Kaide Zeng, Zhengyuan Yang, et al. Thinking with images for multimodal reasoning: Foundations, methods, and future frontiers.arXiv preprint arXiv:2506.23918, 2025

  21. [22]

    Thyme: Think Beyond Images

    Yi-Fan Zhang, Xingyu Lu, Shukang Yin, Chaoyou Fu, Wei Chen, Xiao Hu, Bin Wen, Kaiyu Jiang, Changyi Liu, Tianke Zhang, et al. Thyme: Think beyond images.arXiv preprint arXiv:2508.11630, 2025

  22. [23]

    OpenThinkIMG: Learning to Think with Images via Visual Tool Reinforcement Learning

    Zhaochen Su, Linjie Li, Mingyang Song, Yunzhuo Hao, Zhengyuan Yang, Jun Zhang, Guanjie Chen, Jiawei Gu, Juntao Li, Xiaoye Qu, et al. Openthinkimg: Learning to think with images via visual tool reinforcement learning. arXiv preprint arXiv:2505.08617, 2025

  23. [24]

    DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning

    Ziwei Zheng, Michael Yang, Jack Hong, Chenxiao Zhao, Guohai Xu, Le Yang, Chao Shen, and Xing Yu. Deepeyes: Incentivizing" thinking with images" via reinforcement learning.arXiv preprint arXiv:2505.14362, 2025

  24. [25]

    Pixel Reasoner: Incentivizing Pixel-Space Reasoning with Curiosity-Driven Reinforcement Learning

    Haozhe Wang, Alex Su, Weiming Ren, Fangzhen Lin, and Wenhu Chen. Pixel reasoner: Incentivizing pixel-space reasoning with curiosity-driven reinforcement learning.arXiv preprint arXiv:2505.15966, 2025

  25. [26]

    Mini-o3: Scaling Up Reasoning Patterns and Interaction Turns for Visual Search

    Xin Lai, Junyi Li, Wei Li, Tao Liu, Tianjian Li, and Hengshuang Zhao. Mini-o3: Scaling up reasoning patterns and interaction turns for visual search.arXiv preprint arXiv:2509.07969, 2025

  26. [27]

    Visual generation unlocks human-like reasoning through multimodal world models.arXiv preprint arXiv:2601.19834, 2026

    Jialong Wu, Xiaoying Zhang, Hongyi Yuan, Xiangcheng Zhang, Tianhao Huang, Changjing He, Chaoyi Deng, Renrui Zhang, Youbin Wu, and Mingsheng Long. Visual generation unlocks human-like reasoning through multimodal world models.arXiv preprint arXiv:2601.19834, 2026

  27. [28]

    Isaac sim: Robotics simulation and synthetic data generation.https://developer.nvidia.com/isaac/sim, 2025

    NVIDIA. Isaac sim: Robotics simulation and synthetic data generation.https://developer.nvidia.com/isaac/sim, 2025

  28. [29]

    Scannet++: A high-fidelity dataset of 3d indoor scenes

    Chandan Yeshwanth, Yueh-Cheng Liu, Matthias Nießner, and Angela Dai. Scannet++: A high-fidelity dataset of 3d indoor scenes. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 12–22, 2023

  29. [30]

    Scannet: Richly-annotated 3d reconstructions of indoor scenes

    Angela Dai, Angel X Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. InCVPR, 2017

  30. [31]

    Matterport3D: Learning from RGB-D Data in Indoor Environments

    Angel Chang, Angela Dai, Thomas Funkhouser, Maciej Halber, Matthias Niessner, Manolis Savva, Shuran Song, Andy Zeng, and Yinda Zhang. Matterport3d: Learning from rgb-d data in indoor environments.arXiv preprint arXiv:1709.06158, 2017

  31. [32]

    Dl3dv-10k: A large-scale scene dataset for deep learning-based 3d vision

    Lu Ling, Yichen Sheng, Zhi Tu, Wentian Zhao, Cheng Xin, Kun Wan, Lantao Yu, Qianyu Guo, Zixun Yu, Yawen Lu, et al. Dl3dv-10k: A large-scale scene dataset for deep learning-based 3d vision. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22160–22169, 2024

  32. [33]

    ARKitscenes - a diverse real-world dataset for 3d indoor scene understanding using mobile RGB-d data

    Gilad Baruch, Zhuoyuan Chen, Afshin Dehghan, Tal Dimry, Yuri Feigin, Peter Fu, Thomas Gebauer, Brandon Joffe, Daniel Kurz, Arik Schwartz, and Elad Shulman. ARKitscenes - a diverse real-world dataset for 3d indoor scene understanding using mobile RGB-d data. InThirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (R...

  33. [34]

    Laminar: A scalable asynchronous RL post-training framework

    Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework. InProceedings of the Twentieth European Conference on Computer Systems, EuroSys ’25, page 1279–1297. ACM, March 2025. doi: 10.1145/3689031.3696075. URLhttp://dx.doi.org/10.1145/3689031.3696075. 11

  34. [35]

    vllm-omni: Fully disaggregated serving for any-to-any multimodal models

    Peiqi Yin, Jiangyun Zhu, Han Gao, Chenguang Zheng, Yongxiang Huang, Taichang Zhou, Ruirui Yang, Weizhi Liu, Weiqing Chen, Canlin Guo, et al. vllm-omni: Fully disaggregated serving for any-to-any multimodal models. arXiv preprint arXiv:2602.02204, 2026

  35. [36]

    Qwen3-VL Technical Report

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

  36. [37]

    Emerging Properties in Unified Multimodal Pretraining

    Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, et al. Emerging properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683, 2025

  37. [38]

    Spatialllm: A compound 3d-informed design towards spatially-intelligent large multimodal models

    Wufei Ma, Luoxin Ye, Celso M de Melo, Alan Yuille, and Jieneng Chen. Spatialllm: A compound 3d-informed design towards spatially-intelligent large multimodal models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 17249–17260, 2025

  38. [39]

    Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence

    Diankun Wu, Fangfu Liu, Yi-Hsin Hung, and Yueqi Duan. Spatial-mllm: Boosting mllm capabilities in visual-based spatial intelligence.arXiv preprint arXiv:2505.23747, 2025

  39. [40]

    Spatialladder: Progressive training for spatial reasoning in vision-language models.arXiv preprint arXiv:2510.08531, 2025

    Hongxing Li, Dingming Li, Zixuan Wang, Yuchen Yan, Hang Wu, Wenqi Zhang, Yongliang Shen, Weiming Lu, Jun Xiao, and Yueting Zhuang. Spatialladder: Progressive training for spatial reasoning in vision-language models. arXiv preprint arXiv:2510.08531, 2025

  40. [41]

    SpaceR: Reinforcing MLLMs in Video Spatial Reasoning

    Kun Ouyang, Yuanxin Liu, Haoning Wu, Yi Liu, Hao Zhou, Jie Zhou, Fandong Meng, and Xu Sun. Spacer: Reinforcing mllms in video spatial reasoning.arXiv preprint arXiv:2504.01805, 2025

  41. [42]

    Video-R1: Reinforcing Video Reasoning in MLLMs

    Kaituo Feng, Kaixiong Gong, Bohao Li, Zonghao Guo, Yibing Wang, Tianshuo Peng, Junfei Wu, Xiaoying Zhang, Benyou Wang, and Xiangyu Yue. Video-r1: Reinforcing video reasoning in mllms.arXiv preprint arXiv:2503.21776, 2025

  42. [43]

    Robobrain 2.0 technical report.arXiv preprint arXiv:2507.02029, 2025

    BAAI RoboBrain Team, Mingyu Cao, Huajie Tan, Yuheng Ji, Xiansheng Chen, Minglan Lin, Zhiyu Li, Zhou Cao, Pengwei Wang, Enshen Zhou, et al. Robobrain 2.0 technical report.arXiv preprint arXiv:2507.02029, 2025

  43. [44]

    Reinforcing Spatial Reasoning in Vision-Language Models with Interwoven Thinking and Visual Drawing

    Junfei Wu, Jian Guan, Kaituo Feng, Qiang Liu, Shu Wu, Liang Wang, Wei Wu, and Tieniu Tan. Reinforcing spatial reasoning in vision-language models with interwoven thinking and visual drawing.arXiv preprint arXiv:2506.09965, 2025

  44. [45]

    Vlaser: Vision-language-action model with synergistic embodied reasoning.arXiv preprint arXiv:2510.11027, 2025

    Ganlin Yang, Tianyi Zhang, Haoran Hao, Weiyun Wang, Yibin Liu, Dehui Wang, Guanzhou Chen, Zijian Cai, Junting Chen, Weijie Su, et al. Vlaser: Vision-language-action model with synergistic embodied reasoning.arXiv preprint arXiv:2510.11027, 2025

  45. [46]

    Glm-4.1 v-thinking: Towards versatile multimodal reasoning with scalable reinforcement learning.arXiv e-prints, pages arXiv–2507, 2025

    Wenyi Hong, Wenmeng Yu, Xiaotao Gu, Guo Wang, Guobing Gan, Haomiao Tang, Jiale Cheng, Ji Qi, Junhui Ji, Lihang Pan, et al. Glm-4.1 v-thinking: Towards versatile multimodal reasoning with scalable reinforcement learning.arXiv e-prints, pages arXiv–2507, 2025

  46. [47]

    Gpt-4o.https://openai.com/index/hello-gpt-4o/, 2024

    OpenAI. Gpt-4o.https://openai.com/index/hello-gpt-4o/, 2024

  47. [48]

    Gemini 2.5: Our most intelligent ai model.https://blog.google/technology/google-deepmind/ gemini-model-thinking-updates-march-2025/, 2025

    Google DeepMind. Gemini 2.5: Our most intelligent ai model.https://blog.google/technology/google-deepmind/ gemini-model-thinking-updates-march-2025/, 2025

  48. [49]

    Gemini 3 flash.https://deepmind.google/models/gemini/flash/, December 2025

    Google DeepMind. Gemini 3 flash.https://deepmind.google/models/gemini/flash/, December 2025. Model card and product documentation

  49. [50]

    Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection

    Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, and Lei Zhang. Grounding dino: Marrying dino with grounded pre-training for open-set object detection, 2024. URLhttps://arxiv.org/abs/2303.05499

  50. [51]

    Scaling spatial intelligence with multimodal foundation models.arXiv preprint arXiv:2511.13719, 2025

    Zhongang Cai, Ruisi Wang, Chenyang Gu, Fanyi Pu, Junxiang Xu, Yubo Wang, Wanqi Yin, Zhitao Yang, Chen Wei, Qingping Sun, Tongxi Zhou, Jiaqi Li, Hui En Pang, Oscar Qian, Yukun Wei, Zhiqian Lin, Xuanke Shi, Kewang Deng, Xiaoyang Han, Zukai Chen, Xiangyu Fan, Hanming Deng, Lewei Lu, Liang Pan, Bo Li, Ziwei Liu, Quan Wang, Dahua Lin, and Lei Yang. Scaling spa...

  51. [52]

    Scannet license.https://kaldir.vc.in.tum.de/scannet/ScanNet_TOS.pdf

  52. [53]

    move 2.5 meters to the left

    Matterport3d license.https://kaldir.vc.in.tum.de/matterport/MP_TOS.pdf. 12 A Additional Details on Training Data A.1 World Simulator SFT Data To equip the world simulator (i.e., Bagel) with strong novel view synthesis capabilities, we construct a large-scale training dataset consisting of tuples(Ictx, p, I tgt), where Ictx denotes a set of context images,...