Thinking with Imagination: Agentic Visual Spatial Reasoning with World Simulators

Chenming Zhu; Jiangmiao Pang; Jingli Lin; Peizhou Cao; Tai Wang; Xihui Liu; Yilin Long

arxiv: 2606.06476 · v1 · pith:QPEH64OYnew · submitted 2026-06-04 · 💻 cs.CV

Thinking with Imagination: Agentic Visual Spatial Reasoning with World Simulators

Chenming Zhu , Jingli Lin , Yilin Long , Peizhou Cao , Tai Wang , Jiangmiao Pang , Xihui Liu This is my paper

Pith reviewed 2026-06-28 02:22 UTC · model grok-4.3

classification 💻 cs.CV

keywords spatial reasoningvision-language modelsworld simulatorsagentic reasoningnovel view synthesisreinforcement learning for tool use

0 comments

The pith

A VLM improves spatial reasoning by learning to request imagined viewpoints from a world simulator.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that vision-language models remain limited when spatial questions require layouts or viewpoints beyond the input images. It proposes solving this by letting the model actively call a world simulator during reasoning to produce new visual observations conditioned on camera motions. Both the simulator, tuned for consistent novel views, and the policy, trained via reinforcement learning to invoke the simulator selectively, prove necessary for gains. Experiments show these components together raise scores on spatial benchmarks over base models or simulator-augmented baselines without the learned policy. If the claim holds, models could handle tasks that currently demand inference about unseen spaces by generating their own visual evidence on demand.

Core claim

The paper claims that coupling an RL-trained VLM policy with a view-consistency-tuned world simulator creates an agentic loop in which the model acquires imagined visual evidence only when it improves over direct answering, producing measurable gains on spatial reasoning benchmarks that neither component achieves alone.

What carries the argument

The two-phase RL curriculum that stabilizes tool-use exploration by first teaching the VLM when imagined observations help and then refining selective invocation.

If this is right

Adding the trained simulator to an existing VLM raises spatial benchmark performance even without retraining the policy.
The learned policy reduces unnecessary simulator calls while preserving gains from useful ones.
Imagined observations supply cross-view consistency and layout information absent from the original egocentric inputs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same selective-imagination loop could apply to other reasoning domains where generating hypothetical observations would reduce uncertainty.
Success may depend on scaling the simulator to handle longer or more complex motion sequences than those tested.
If the curriculum generalizes, similar agentic training could turn other generative tools into on-demand evidence sources for VLMs.

Load-bearing premise

The simulator must output novel views reliable and informative enough that they genuinely help the model reason better than direct answers from the given images.

What would settle it

Run the same spatial benchmarks with the world simulator disabled or with a random-invocation policy and observe whether scores remain at or above the levels achieved by the full trained system.

read the original abstract

While Vision-Language Models (VLMs) have shown strong visual reasoning capabilities, their spatial reasoning abilities remain largely constrained to the observed images and text-oriented chain-of-thought. They often struggle to infer unobserved layouts, maintain cross-view consistency, and reason from alternative viewpoints when only limited egocentric observations are available. In this work, we study this problem as thinking with imagination, where a VLM actively acquires imagined visual evidence by interacting with a world simulator during reasoning. We propose Astra, an agentic spatial reasoning framework that empowers VLMs with action-conditioned visual imagination. Specifically, Astra couples Astra-VL, an RL-trained VLM policy, with Astra-WM, a Bagel-based world simulator that generates novel-view observations from context images and natural-language camera motions. To provide reliable imagined evidence, Astra-WM is trained with view consistency tuning to improve pose and content consistency across views. In the RL stage, we propose a world-simulator-in-the-loop two-phase RL curriculum to stabilize tool-use exploration and advance the model's ability to invoke the simulator only when imagined observations improve over direct answering. Experiments demonstrate that both the world simulator and the agentic policy are necessary: Astra-WM improves simulator-augmented Gemini-3-Flash on MMSI-Bench from 45.1 to 49.5, while Astra-VL improves the Qwen3-VL backbone from 29.8 to 38.8 on MMSI-Bench and from 36.8 to 42.7 on MindCube. These results show that imagined observations can provide useful spatial evidence, but effective world-model-augmented reasoning requires learning when, where, and how to imagine.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gets concrete benchmark lifts by training a VLM policy to query a view-consistency-tuned world simulator for spatial tasks, but the abstract gives no simulator diagnostics or curriculum ablations so the source of the gains stays unclear.

read the letter

The main takeaway is that Astra lets a VLM learn to call a world simulator for imagined novel views during reasoning, and this produces measurable gains on MMSI-Bench and MindCube. The new pieces are the RL policy that decides when to invoke the simulator and the two-phase curriculum meant to keep the calls selective.

The work does report specific improvements: Astra-WM raises simulator-augmented Gemini-3-Flash from 45.1 to 49.5 on MMSI-Bench, and Astra-VL lifts the Qwen3-VL backbone from 29.8 to 38.8 on the same benchmark and from 36.8 to 42.7 on MindCube. Those numbers are the clearest evidence offered.

The soft spots sit where the abstract stops. There are no reported metrics on the simulator itself (pose accuracy, cross-view consistency scores, or human judgments of the generated views), and no ablation that isolates the two-phase curriculum from ordinary RL fine-tuning. Without those, it is hard to rule out that the gains come from extra tokens or training rather than useful imagined evidence. The necessity claim for both components therefore rests on assumptions that the provided text does not test.

This is for groups working on embodied VLMs and world-model augmentation for robotics. A reader who wants to see whether active imagination can be made practical will find a clear experimental setup to examine.

It deserves peer review because the idea is concrete, the benchmarks are relevant, and the results are stated numerically; the methods section will just need close checking on the missing diagnostics.

Referee Report

3 major / 1 minor

Summary. The manuscript proposes Astra, an agentic spatial reasoning framework that couples Astra-VL (an RL-trained VLM policy) with Astra-WM (a Bagel-based world simulator trained via view consistency tuning). It claims that this setup enables VLMs to actively acquire imagined novel-view observations during reasoning, with both components necessary, as shown by benchmark gains: Astra-WM lifts simulator-augmented Gemini-3-Flash from 45.1 to 49.5 on MMSI-Bench, while Astra-VL lifts Qwen3-VL from 29.8 to 38.8 on MMSI-Bench and 36.8 to 42.7 on MindCube.

Significance. If the central claims hold after addressing the gaps below, the work would be significant for agentic visual reasoning by demonstrating a concrete mechanism for world-model-augmented imagination; explicit credit is due to the two-phase RL curriculum for stabilizing tool-use exploration and the view consistency tuning for improving cross-view reliability.

major comments (3)

[Abstract] Abstract: the necessity claim for both Astra-WM and Astra-VL rests on the reported deltas (45.1→49.5, 29.8→38.8, 36.8→42.7), yet no quantitative simulator diagnostics (pose error, cross-view LPIPS, or human preference on generated vs. real views) are supplied to confirm that the novel views are sufficiently reliable and informative.
[Experiments] Experiments section: no ablation isolating the two-phase RL curriculum from plain RL fine-tuning is reported, leaving open whether the observed gains arise from genuine selective imagination or from extra context tokens and training; this directly undermines the load-bearing claim that the policy learns 'when, where, and how to imagine.'
[Results] Results: the abstract and results supply no details on baselines, statistical tests, data splits, or potential confounds, which is required to substantiate that the gains support the central claim rather than artifacts of evaluation design.

minor comments (1)

[Introduction] The distinction between Astra-VL (policy) and Astra-WM (simulator) could be introduced with a single clarifying sentence in the introduction to avoid any initial notation ambiguity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. The comments highlight important gaps in substantiating the central claims about component necessity and evaluation rigor. We address each point below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [Abstract] Abstract: the necessity claim for both Astra-WM and Astra-VL rests on the reported deltas (45.1→49.5, 29.8→38.8, 36.8→42.7), yet no quantitative simulator diagnostics (pose error, cross-view LPIPS, or human preference on generated vs. real views) are supplied to confirm that the novel views are sufficiently reliable and informative.

Authors: We agree that the abstract's necessity claim would be strengthened by explicit simulator quality metrics. The manuscript describes view consistency tuning but does not report numerical diagnostics such as pose error, LPIPS, or human preference scores. In revision we will add these quantitative results (drawn from our internal evaluations) to the abstract and results sections to directly support the reliability of the imagined views. revision: yes
Referee: [Experiments] Experiments section: no ablation isolating the two-phase RL curriculum from plain RL fine-tuning is reported, leaving open whether the observed gains arise from genuine selective imagination or from extra context tokens and training; this directly undermines the load-bearing claim that the policy learns 'when, where, and how to imagine.'

Authors: The two-phase curriculum is presented as key to stabilizing tool-use exploration. However, the current manuscript does not include an ablation against standard RL fine-tuning. This is a substantive gap. We will add the requested ablation study in the revised experiments section, reporting performance with and without the two-phase structure to isolate its contribution to selective imagination. revision: yes
Referee: [Results] Results: the abstract and results supply no details on baselines, statistical tests, data splits, or potential confounds, which is required to substantiate that the gains support the central claim rather than artifacts of evaluation design.

Authors: We acknowledge that the provided abstract and results excerpt lack explicit discussion of baselines, statistical significance, data splits, and confounds. The full manuscript contains some baseline comparisons, but additional details are needed. In revision we will expand the results section with data-split information, statistical tests or confidence intervals, and explicit discussion of potential confounds to strengthen the evaluation. revision: partial

Circularity Check

0 steps flagged

No circularity; empirical gains reported as direct experimental outcomes

full rationale

The paper's central claims rest on benchmark deltas (MMSI-Bench, MindCube) obtained after training Astra-WM with view consistency tuning and Astra-VL with two-phase RL. No equations, fitted parameters renamed as predictions, or self-citation chains are present in the provided text that would reduce the reported improvements to inputs by construction. The evaluation is externally falsifiable via the cited benchmarks and does not invoke uniqueness theorems or ansatzes from prior author work.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides insufficient detail to enumerate specific free parameters, axioms, or invented entities beyond the high-level framework components; training involves unspecified RL rewards and consistency losses.

pith-pipeline@v0.9.1-grok · 5855 in / 1052 out tokens · 60180 ms · 2026-06-28T02:22:17.945339+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

52 extracted references · 37 canonical work pages · 17 internal anchors

[1]

Mindcube: Spatial mental modeling from limited views, 2026

Qineng Wang, Baiqiao Yin, Pingyue Zhang, Jianshu Zhang, Kangrui Wang, Zihan Wang, Jieyu Zhang, Keshigeyan Chandrasegaran, Han Liu, Ranjay Krishna, Saining Xie, Jiajun Wu, Li Fei-Fei, and Manling Li. Mindcube: Spatial mental modeling from limited views, 2026. URLhttps://arxiv.org/abs/2506.21458

work page arXiv 2026
[2]

Mmsi-bench: A benchmark for multi-image spatial intelligence

Sihan Yang, Runsen Xu, Yiman Xie, Sizhe Yang, Mo Li, Jingli Lin, Chenming Zhu, Xiaochen Chen, Haodong Duan, Xiangyu Yue, Dahua Lin, Tai Wang, and Jiangmiao Pang. Mmsi-bench: A benchmark for multi-image spatial intelligence. InICLR, 2025

2025
[3]

arXiv preprint arXiv:2507.07984 , year=

JingLi Lin, Chenming Zhu, Runsen Xu, Xiaohan Mao, Xihui Liu, Tai Wang, and Jiangmiao Pang. Ost- bench: Evaluating the capabilities of mllms in online spatio-temporal scene understanding.arXiv preprint arXiv:2507.07984, 2025

work page arXiv 2025
[4]

arXiv preprint arXiv:2512.10863 , year=

Jingli Lin, Runsen Xu, Shaohao Zhu, Sihan Yang, Peizhou Cao, Yunlong Ran, Miao Hu, Chenming Zhu, Yiman Xie, Yilin Long, Wenbo Hu, Dahua Lin, Tai Wang, and Jiangmiao Pang. Mmsi-video-bench: A holistic benchmark for video-based spatial intelligence, 2025. URLhttps://arxiv.org/abs/2512.10863

work page arXiv 2025
[5]

arXiv preprint arXiv:2505.21500 , year=

Dingming Li, Hongxing Li, Zixuan Wang, Yuchen Yan, Hang Zhang, Siqi Chen, Guiyang Hou, Shengpei Jiang, Wenqi Zhang, Yongliang Shen, Weiming Lu, and Yueting Zhuang. Viewspatial-bench: Evaluating multi-perspective spatial localization in vision-language models, 2025. URLhttps://arxiv.org/abs/2505.21500

work page arXiv 2025
[6]

3dsrbench: A comprehensive 3d spatial reasoning benchmark.arXiv preprint arXiv:2412.07825, 2024

Wufei Ma, Haoyu Chen, Guofeng Zhang, Celso M de Melo, Jieneng Chen, and Alan Yuille. 3dsrbench: A comprehensive 3d spatial reasoning benchmark.arXiv preprint arXiv:2412.07825, 2024

work page arXiv 2024
[8]

Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces

Jihan Yang, Shusheng Yang, Anjali W Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie. Thinking in space: How multimodal large language models see, remember, and recall spaces.arXiv:2412.14171, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[9]

arXiv preprint arXiv:2512.24330 , year=

Yong Xien Chng, Tao Hu, Wenwen Tong, Xueheng Li, Jiandong Chen, Haojia Yu, Jiefan Lu, Hewei Guo, Hanming Deng, Chengjun Xie, et al. Sensenova-mars: Empowering multimodal agentic reasoning and search via reinforcement learning.arXiv preprint arXiv:2512.24330, 2025

work page arXiv 2025
[10]

Visual spatial tuning.arXiv preprint arXiv:2511.05491, 2025

Rui Yang, Ziyu Zhu, Yanwei Li, Jingjia Huang, Shen Yan, Siyuan Zhou, Zhe Liu, Xiangtai Li, Shuangye Li, Wenqian Wang, Yi Lin, and Hengshuang Zhao. Visual spatial tuning, 2025. URLhttps://arxiv.org/abs/2511.05491

work page arXiv 2025
[11]

Llava-3d: A simple yet effective pathway to empowering lmms with 3d-awareness.arXiv preprint arXiv:2409.18125, 2024

Chenming Zhu, Tai Wang, Wenwei Zhang, Jiangmiao Pang, and Xihui Liu. Llava-3d: A simple yet effective pathway to empowering lmms with 3d-awareness.arXiv preprint arXiv:2409.18125, 2024

work page arXiv 2024
[12]

Ross3d: Reconstructive visual instruction tuning with 3d-awareness.arXiv preprint arXiv:2504.01901, 2025

Haochen Wang, Yucheng Zhao, Tiancai Wang, Haoqiang Fan, Xiangyu Zhang, and Zhaoxiang Zhang. Ross3d: Reconstructive visual instruction tuning with 3d-awareness.arXiv preprint arXiv:2504.01901, 2025

work page arXiv 2025
[13]

Vlm-3r: Vision-language models augmented with instruction-aligned 3d reconstruction,

Zhiwen Fan, Jian Zhang, Renjie Li, Junge Zhang, Runjin Chen, Hezhen Hu, Kevin Wang, Huaizhi Qu, Shijie Zhou, Dilin Wang, Zhicheng Yan, Hongyu Xu, Justin Theiss, Tianlong Chen, Jiachen Li, Zhengzhong Tu, Zhangyang Wang, and Rakesh Ranjan. Vlm-3r: Vision-language models augmented with instruction-aligned 3d reconstruction,
[14]

URLhttps://arxiv.org/abs/2505.20279

work page internal anchor Pith review Pith/arXiv arXiv
[15]

G2vlm: Geometry grounded vision language model with unified 3d reconstruction and spatial reasoning

Wenbo Hu, Jingli Lin, Yilin Long, Yunlong Ran, Lihan Jiang, Yifan Wang, Chenming Zhu, Runsen Xu, Tai Wang, and Jiangmiao Pang. G2vlm: Geometry grounded vision language model with unified 3d reconstruction and spatial reasoning.arXiv preprint arXiv:2511.21688, 2025. URLhttps://arxiv.org/abs/2511.21688. 10

work page arXiv 2025
[16]

Geometrically-constrained agent for spatial reasoning.arXiv preprint arXiv:2511.22659, 2025

Zeren Chen, Xiaoya Lu, Zhijie Zheng, Pengrui Li, Lehan He, Yijin Zhou, Jing Shao, Bohan Zhuang, and Lu Sheng. Geometrically-constrained agent for spatial reasoning, 2025. URLhttps://arxiv.org/abs/2511.22659

work page arXiv 2025
[17]

Tiger: Tool-integrated geometric reasoning in vision-language models for robotics.arXiv preprint arXiv:2510.07181, 2025

Yi Han, Enshen Zhou, Shanyu Rong, Jingkun An, Pengwei Wang, Zhongyuan Wang, Cheng Chi, Lu Sheng, and Shanghang Zhang. Tiger: Tool-integrated geometric reasoning in vision-language models for robotics, 2026. URL https://arxiv.org/abs/2510.07181

work page arXiv 2026
[18]

Cooper: A unified model for cooperative perception and reasoning in spatial intelligence.arXiv preprint arXiv:2512.04563, 2025

Zefeng Zhang, Xiangzhao Hao, Hengzhu Tang, Zhenyu Zhang, Jiawei Sheng, Xiaodong Li, Zhenyang Li, Li Gao, Daiting Shi, Dawei Yin, et al. Cooper: A unified model for cooperative perception and reasoning in spatial intelligence.arXiv preprint arXiv:2512.04563, 2025

work page arXiv 2025
[19]

Introducing o3 and o4 mini.https://openai.com/index/introducing-o3-and-o4-mini/, 2025

OpenAI. Introducing o3 and o4 mini.https://openai.com/index/introducing-o3-and-o4-mini/, 2025

2025
[20]

DeepEyesV2: Toward Agentic Multimodal Model

Jack Hong, Chenxiao Zhao, ChengLin Zhu, Weiheng Lu, Guohai Xu, and Xing Yu. Deepeyesv2: Toward agentic multimodal model.arXiv preprint arXiv:2511.05271, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[21]

Thinking with Images for Multimodal Reasoning: Foundations, Methods, and Future Frontiers

Zhaochen Su, Peng Xia, Hangyu Guo, Zhenhua Liu, Yan Ma, Xiaoye Qu, Jiaqi Liu, Yanshu Li, Kaide Zeng, Zhengyuan Yang, et al. Thinking with images for multimodal reasoning: Foundations, methods, and future frontiers.arXiv preprint arXiv:2506.23918, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[22]

Thyme: Think Beyond Images

Yi-Fan Zhang, Xingyu Lu, Shukang Yin, Chaoyou Fu, Wei Chen, Xiao Hu, Bin Wen, Kaiyu Jiang, Changyi Liu, Tianke Zhang, et al. Thyme: Think beyond images.arXiv preprint arXiv:2508.11630, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[23]

OpenThinkIMG: Learning to Think with Images via Visual Tool Reinforcement Learning

Zhaochen Su, Linjie Li, Mingyang Song, Yunzhuo Hao, Zhengyuan Yang, Jun Zhang, Guanjie Chen, Jiawei Gu, Juntao Li, Xiaoye Qu, et al. Openthinkimg: Learning to think with images via visual tool reinforcement learning. arXiv preprint arXiv:2505.08617, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[24]

DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning

Ziwei Zheng, Michael Yang, Jack Hong, Chenxiao Zhao, Guohai Xu, Le Yang, Chao Shen, and Xing Yu. Deepeyes: Incentivizing" thinking with images" via reinforcement learning.arXiv preprint arXiv:2505.14362, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[25]

Pixel Reasoner: Incentivizing Pixel-Space Reasoning with Curiosity-Driven Reinforcement Learning

Haozhe Wang, Alex Su, Weiming Ren, Fangzhen Lin, and Wenhu Chen. Pixel reasoner: Incentivizing pixel-space reasoning with curiosity-driven reinforcement learning.arXiv preprint arXiv:2505.15966, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[26]

Mini-o3: Scaling Up Reasoning Patterns and Interaction Turns for Visual Search

Xin Lai, Junyi Li, Wei Li, Tao Liu, Tianjian Li, and Hengshuang Zhao. Mini-o3: Scaling up reasoning patterns and interaction turns for visual search.arXiv preprint arXiv:2509.07969, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[27]

Visual generation unlocks human-like reasoning through multimodal world models.arXiv preprint arXiv:2601.19834, 2026

Jialong Wu, Xiaoying Zhang, Hongyi Yuan, Xiangcheng Zhang, Tianhao Huang, Changjing He, Chaoyi Deng, Renrui Zhang, Youbin Wu, and Mingsheng Long. Visual generation unlocks human-like reasoning through multimodal world models.arXiv preprint arXiv:2601.19834, 2026

work page arXiv 2026
[28]

Isaac sim: Robotics simulation and synthetic data generation.https://developer.nvidia.com/isaac/sim, 2025

NVIDIA. Isaac sim: Robotics simulation and synthetic data generation.https://developer.nvidia.com/isaac/sim, 2025

2025
[29]

Scannet++: A high-fidelity dataset of 3d indoor scenes

Chandan Yeshwanth, Yueh-Cheng Liu, Matthias Nießner, and Angela Dai. Scannet++: A high-fidelity dataset of 3d indoor scenes. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 12–22, 2023

2023
[30]

Scannet: Richly-annotated 3d reconstructions of indoor scenes

Angela Dai, Angel X Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. InCVPR, 2017

2017
[31]

Matterport3D: Learning from RGB-D Data in Indoor Environments

Angel Chang, Angela Dai, Thomas Funkhouser, Maciej Halber, Matthias Niessner, Manolis Savva, Shuran Song, Andy Zeng, and Yinda Zhang. Matterport3d: Learning from rgb-d data in indoor environments.arXiv preprint arXiv:1709.06158, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[32]

Dl3dv-10k: A large-scale scene dataset for deep learning-based 3d vision

Lu Ling, Yichen Sheng, Zhi Tu, Wentian Zhao, Cheng Xin, Kun Wan, Lantao Yu, Qianyu Guo, Zixun Yu, Yawen Lu, et al. Dl3dv-10k: A large-scale scene dataset for deep learning-based 3d vision. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22160–22169, 2024

2024
[33]

ARKitscenes - a diverse real-world dataset for 3d indoor scene understanding using mobile RGB-d data

Gilad Baruch, Zhuoyuan Chen, Afshin Dehghan, Tal Dimry, Yuri Feigin, Peter Fu, Thomas Gebauer, Brandon Joffe, Daniel Kurz, Arik Schwartz, and Elad Shulman. ARKitscenes - a diverse real-world dataset for 3d indoor scene understanding using mobile RGB-d data. InThirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (R...

2021
[34]

Laminar: A scalable asynchronous RL post-training framework

Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework. InProceedings of the Twentieth European Conference on Computer Systems, EuroSys ’25, page 1279–1297. ACM, March 2025. doi: 10.1145/3689031.3696075. URLhttp://dx.doi.org/10.1145/3689031.3696075. 11

work page doi:10.1145/3689031.3696075 2025
[35]

vllm-omni: Fully disaggregated serving for any-to-any multimodal models

Peiqi Yin, Jiangyun Zhu, Han Gao, Chenguang Zheng, Yongxiang Huang, Taichang Zhou, Ruirui Yang, Weizhi Liu, Weiqing Chen, Canlin Guo, et al. vllm-omni: Fully disaggregated serving for any-to-any multimodal models. arXiv preprint arXiv:2602.02204, 2026

work page arXiv 2026
[36]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[37]

Emerging Properties in Unified Multimodal Pretraining

Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, et al. Emerging properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[38]

Spatialllm: A compound 3d-informed design towards spatially-intelligent large multimodal models

Wufei Ma, Luoxin Ye, Celso M de Melo, Alan Yuille, and Jieneng Chen. Spatialllm: A compound 3d-informed design towards spatially-intelligent large multimodal models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 17249–17260, 2025

2025
[39]

Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence

Diankun Wu, Fangfu Liu, Yi-Hsin Hung, and Yueqi Duan. Spatial-mllm: Boosting mllm capabilities in visual-based spatial intelligence.arXiv preprint arXiv:2505.23747, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[40]

Spatialladder: Progressive training for spatial reasoning in vision-language models.arXiv preprint arXiv:2510.08531, 2025

Hongxing Li, Dingming Li, Zixuan Wang, Yuchen Yan, Hang Wu, Wenqi Zhang, Yongliang Shen, Weiming Lu, Jun Xiao, and Yueting Zhuang. Spatialladder: Progressive training for spatial reasoning in vision-language models. arXiv preprint arXiv:2510.08531, 2025

work page arXiv 2025
[41]

SpaceR: Reinforcing MLLMs in Video Spatial Reasoning

Kun Ouyang, Yuanxin Liu, Haoning Wu, Yi Liu, Hao Zhou, Jie Zhou, Fandong Meng, and Xu Sun. Spacer: Reinforcing mllms in video spatial reasoning.arXiv preprint arXiv:2504.01805, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[42]

Video-R1: Reinforcing Video Reasoning in MLLMs

Kaituo Feng, Kaixiong Gong, Bohao Li, Zonghao Guo, Yibing Wang, Tianshuo Peng, Junfei Wu, Xiaoying Zhang, Benyou Wang, and Xiangyu Yue. Video-r1: Reinforcing video reasoning in mllms.arXiv preprint arXiv:2503.21776, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[43]

Robobrain 2.0 technical report.arXiv preprint arXiv:2507.02029, 2025

BAAI RoboBrain Team, Mingyu Cao, Huajie Tan, Yuheng Ji, Xiansheng Chen, Minglan Lin, Zhiyu Li, Zhou Cao, Pengwei Wang, Enshen Zhou, et al. Robobrain 2.0 technical report.arXiv preprint arXiv:2507.02029, 2025

work page arXiv 2025
[44]

Reinforcing Spatial Reasoning in Vision-Language Models with Interwoven Thinking and Visual Drawing

Junfei Wu, Jian Guan, Kaituo Feng, Qiang Liu, Shu Wu, Liang Wang, Wei Wu, and Tieniu Tan. Reinforcing spatial reasoning in vision-language models with interwoven thinking and visual drawing.arXiv preprint arXiv:2506.09965, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[45]

Vlaser: Vision-language-action model with synergistic embodied reasoning.arXiv preprint arXiv:2510.11027, 2025

Ganlin Yang, Tianyi Zhang, Haoran Hao, Weiyun Wang, Yibin Liu, Dehui Wang, Guanzhou Chen, Zijian Cai, Junting Chen, Weijie Su, et al. Vlaser: Vision-language-action model with synergistic embodied reasoning.arXiv preprint arXiv:2510.11027, 2025

work page arXiv 2025
[46]

Glm-4.1 v-thinking: Towards versatile multimodal reasoning with scalable reinforcement learning.arXiv e-prints, pages arXiv–2507, 2025

Wenyi Hong, Wenmeng Yu, Xiaotao Gu, Guo Wang, Guobing Gan, Haomiao Tang, Jiale Cheng, Ji Qi, Junhui Ji, Lihang Pan, et al. Glm-4.1 v-thinking: Towards versatile multimodal reasoning with scalable reinforcement learning.arXiv e-prints, pages arXiv–2507, 2025

2025
[47]

Gpt-4o.https://openai.com/index/hello-gpt-4o/, 2024

OpenAI. Gpt-4o.https://openai.com/index/hello-gpt-4o/, 2024

2024
[48]

Gemini 2.5: Our most intelligent ai model.https://blog.google/technology/google-deepmind/ gemini-model-thinking-updates-march-2025/, 2025

Google DeepMind. Gemini 2.5: Our most intelligent ai model.https://blog.google/technology/google-deepmind/ gemini-model-thinking-updates-march-2025/, 2025

2025
[49]

Gemini 3 flash.https://deepmind.google/models/gemini/flash/, December 2025

Google DeepMind. Gemini 3 flash.https://deepmind.google/models/gemini/flash/, December 2025. Model card and product documentation

2025
[50]

Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection

Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, and Lei Zhang. Grounding dino: Marrying dino with grounded pre-training for open-set object detection, 2024. URLhttps://arxiv.org/abs/2303.05499

work page internal anchor Pith review Pith/arXiv arXiv 2024
[51]

Scaling spatial intelligence with multimodal foundation models.arXiv preprint arXiv:2511.13719, 2025

Zhongang Cai, Ruisi Wang, Chenyang Gu, Fanyi Pu, Junxiang Xu, Yubo Wang, Wanqi Yin, Zhitao Yang, Chen Wei, Qingping Sun, Tongxi Zhou, Jiaqi Li, Hui En Pang, Oscar Qian, Yukun Wei, Zhiqian Lin, Xuanke Shi, Kewang Deng, Xiaoyang Han, Zukai Chen, Xiangyu Fan, Hanming Deng, Lewei Lu, Liang Pan, Bo Li, Ziwei Liu, Quan Wang, Dahua Lin, and Lei Yang. Scaling spa...

work page arXiv 2026
[52]

Scannet license.https://kaldir.vc.in.tum.de/scannet/ScanNet_TOS.pdf
[53]

move 2.5 meters to the left

Matterport3d license.https://kaldir.vc.in.tum.de/matterport/MP_TOS.pdf. 12 A Additional Details on Training Data A.1 World Simulator SFT Data To equip the world simulator (i.e., Bagel) with strong novel view synthesis capabilities, we construct a large-scale training dataset consisting of tuples(Ictx, p, I tgt), where Ictx denotes a set of context images,...

[1] [1]

Mindcube: Spatial mental modeling from limited views, 2026

Qineng Wang, Baiqiao Yin, Pingyue Zhang, Jianshu Zhang, Kangrui Wang, Zihan Wang, Jieyu Zhang, Keshigeyan Chandrasegaran, Han Liu, Ranjay Krishna, Saining Xie, Jiajun Wu, Li Fei-Fei, and Manling Li. Mindcube: Spatial mental modeling from limited views, 2026. URLhttps://arxiv.org/abs/2506.21458

work page arXiv 2026

[2] [2]

Mmsi-bench: A benchmark for multi-image spatial intelligence

Sihan Yang, Runsen Xu, Yiman Xie, Sizhe Yang, Mo Li, Jingli Lin, Chenming Zhu, Xiaochen Chen, Haodong Duan, Xiangyu Yue, Dahua Lin, Tai Wang, and Jiangmiao Pang. Mmsi-bench: A benchmark for multi-image spatial intelligence. InICLR, 2025

2025

[3] [3]

arXiv preprint arXiv:2507.07984 , year=

JingLi Lin, Chenming Zhu, Runsen Xu, Xiaohan Mao, Xihui Liu, Tai Wang, and Jiangmiao Pang. Ost- bench: Evaluating the capabilities of mllms in online spatio-temporal scene understanding.arXiv preprint arXiv:2507.07984, 2025

work page arXiv 2025

[4] [4]

arXiv preprint arXiv:2512.10863 , year=

Jingli Lin, Runsen Xu, Shaohao Zhu, Sihan Yang, Peizhou Cao, Yunlong Ran, Miao Hu, Chenming Zhu, Yiman Xie, Yilin Long, Wenbo Hu, Dahua Lin, Tai Wang, and Jiangmiao Pang. Mmsi-video-bench: A holistic benchmark for video-based spatial intelligence, 2025. URLhttps://arxiv.org/abs/2512.10863

work page arXiv 2025

[5] [5]

arXiv preprint arXiv:2505.21500 , year=

Dingming Li, Hongxing Li, Zixuan Wang, Yuchen Yan, Hang Zhang, Siqi Chen, Guiyang Hou, Shengpei Jiang, Wenqi Zhang, Yongliang Shen, Weiming Lu, and Yueting Zhuang. Viewspatial-bench: Evaluating multi-perspective spatial localization in vision-language models, 2025. URLhttps://arxiv.org/abs/2505.21500

work page arXiv 2025

[6] [6]

3dsrbench: A comprehensive 3d spatial reasoning benchmark.arXiv preprint arXiv:2412.07825, 2024

Wufei Ma, Haoyu Chen, Guofeng Zhang, Celso M de Melo, Jieneng Chen, and Alan Yuille. 3dsrbench: A comprehensive 3d spatial reasoning benchmark.arXiv preprint arXiv:2412.07825, 2024

work page arXiv 2024

[7] [8]

Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces

Jihan Yang, Shusheng Yang, Anjali W Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie. Thinking in space: How multimodal large language models see, remember, and recall spaces.arXiv:2412.14171, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[8] [9]

arXiv preprint arXiv:2512.24330 , year=

Yong Xien Chng, Tao Hu, Wenwen Tong, Xueheng Li, Jiandong Chen, Haojia Yu, Jiefan Lu, Hewei Guo, Hanming Deng, Chengjun Xie, et al. Sensenova-mars: Empowering multimodal agentic reasoning and search via reinforcement learning.arXiv preprint arXiv:2512.24330, 2025

work page arXiv 2025

[9] [10]

Visual spatial tuning.arXiv preprint arXiv:2511.05491, 2025

Rui Yang, Ziyu Zhu, Yanwei Li, Jingjia Huang, Shen Yan, Siyuan Zhou, Zhe Liu, Xiangtai Li, Shuangye Li, Wenqian Wang, Yi Lin, and Hengshuang Zhao. Visual spatial tuning, 2025. URLhttps://arxiv.org/abs/2511.05491

work page arXiv 2025

[10] [11]

Llava-3d: A simple yet effective pathway to empowering lmms with 3d-awareness.arXiv preprint arXiv:2409.18125, 2024

Chenming Zhu, Tai Wang, Wenwei Zhang, Jiangmiao Pang, and Xihui Liu. Llava-3d: A simple yet effective pathway to empowering lmms with 3d-awareness.arXiv preprint arXiv:2409.18125, 2024

work page arXiv 2024

[11] [12]

Ross3d: Reconstructive visual instruction tuning with 3d-awareness.arXiv preprint arXiv:2504.01901, 2025

Haochen Wang, Yucheng Zhao, Tiancai Wang, Haoqiang Fan, Xiangyu Zhang, and Zhaoxiang Zhang. Ross3d: Reconstructive visual instruction tuning with 3d-awareness.arXiv preprint arXiv:2504.01901, 2025

work page arXiv 2025

[12] [13]

Vlm-3r: Vision-language models augmented with instruction-aligned 3d reconstruction,

Zhiwen Fan, Jian Zhang, Renjie Li, Junge Zhang, Runjin Chen, Hezhen Hu, Kevin Wang, Huaizhi Qu, Shijie Zhou, Dilin Wang, Zhicheng Yan, Hongyu Xu, Justin Theiss, Tianlong Chen, Jiachen Li, Zhengzhong Tu, Zhangyang Wang, and Rakesh Ranjan. Vlm-3r: Vision-language models augmented with instruction-aligned 3d reconstruction,

[13] [14]

URLhttps://arxiv.org/abs/2505.20279

work page internal anchor Pith review Pith/arXiv arXiv

[14] [15]

G2vlm: Geometry grounded vision language model with unified 3d reconstruction and spatial reasoning

Wenbo Hu, Jingli Lin, Yilin Long, Yunlong Ran, Lihan Jiang, Yifan Wang, Chenming Zhu, Runsen Xu, Tai Wang, and Jiangmiao Pang. G2vlm: Geometry grounded vision language model with unified 3d reconstruction and spatial reasoning.arXiv preprint arXiv:2511.21688, 2025. URLhttps://arxiv.org/abs/2511.21688. 10

work page arXiv 2025

[15] [16]

Geometrically-constrained agent for spatial reasoning.arXiv preprint arXiv:2511.22659, 2025

Zeren Chen, Xiaoya Lu, Zhijie Zheng, Pengrui Li, Lehan He, Yijin Zhou, Jing Shao, Bohan Zhuang, and Lu Sheng. Geometrically-constrained agent for spatial reasoning, 2025. URLhttps://arxiv.org/abs/2511.22659

work page arXiv 2025

[16] [17]

Tiger: Tool-integrated geometric reasoning in vision-language models for robotics.arXiv preprint arXiv:2510.07181, 2025

Yi Han, Enshen Zhou, Shanyu Rong, Jingkun An, Pengwei Wang, Zhongyuan Wang, Cheng Chi, Lu Sheng, and Shanghang Zhang. Tiger: Tool-integrated geometric reasoning in vision-language models for robotics, 2026. URL https://arxiv.org/abs/2510.07181

work page arXiv 2026

[17] [18]

Cooper: A unified model for cooperative perception and reasoning in spatial intelligence.arXiv preprint arXiv:2512.04563, 2025

Zefeng Zhang, Xiangzhao Hao, Hengzhu Tang, Zhenyu Zhang, Jiawei Sheng, Xiaodong Li, Zhenyang Li, Li Gao, Daiting Shi, Dawei Yin, et al. Cooper: A unified model for cooperative perception and reasoning in spatial intelligence.arXiv preprint arXiv:2512.04563, 2025

work page arXiv 2025

[18] [19]

Introducing o3 and o4 mini.https://openai.com/index/introducing-o3-and-o4-mini/, 2025

OpenAI. Introducing o3 and o4 mini.https://openai.com/index/introducing-o3-and-o4-mini/, 2025

2025

[19] [20]

DeepEyesV2: Toward Agentic Multimodal Model

Jack Hong, Chenxiao Zhao, ChengLin Zhu, Weiheng Lu, Guohai Xu, and Xing Yu. Deepeyesv2: Toward agentic multimodal model.arXiv preprint arXiv:2511.05271, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[20] [21]

Thinking with Images for Multimodal Reasoning: Foundations, Methods, and Future Frontiers

Zhaochen Su, Peng Xia, Hangyu Guo, Zhenhua Liu, Yan Ma, Xiaoye Qu, Jiaqi Liu, Yanshu Li, Kaide Zeng, Zhengyuan Yang, et al. Thinking with images for multimodal reasoning: Foundations, methods, and future frontiers.arXiv preprint arXiv:2506.23918, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[21] [22]

Thyme: Think Beyond Images

Yi-Fan Zhang, Xingyu Lu, Shukang Yin, Chaoyou Fu, Wei Chen, Xiao Hu, Bin Wen, Kaiyu Jiang, Changyi Liu, Tianke Zhang, et al. Thyme: Think beyond images.arXiv preprint arXiv:2508.11630, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[22] [23]

OpenThinkIMG: Learning to Think with Images via Visual Tool Reinforcement Learning

Zhaochen Su, Linjie Li, Mingyang Song, Yunzhuo Hao, Zhengyuan Yang, Jun Zhang, Guanjie Chen, Jiawei Gu, Juntao Li, Xiaoye Qu, et al. Openthinkimg: Learning to think with images via visual tool reinforcement learning. arXiv preprint arXiv:2505.08617, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[23] [24]

DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning

Ziwei Zheng, Michael Yang, Jack Hong, Chenxiao Zhao, Guohai Xu, Le Yang, Chao Shen, and Xing Yu. Deepeyes: Incentivizing" thinking with images" via reinforcement learning.arXiv preprint arXiv:2505.14362, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[24] [25]

Pixel Reasoner: Incentivizing Pixel-Space Reasoning with Curiosity-Driven Reinforcement Learning

Haozhe Wang, Alex Su, Weiming Ren, Fangzhen Lin, and Wenhu Chen. Pixel reasoner: Incentivizing pixel-space reasoning with curiosity-driven reinforcement learning.arXiv preprint arXiv:2505.15966, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[25] [26]

Mini-o3: Scaling Up Reasoning Patterns and Interaction Turns for Visual Search

Xin Lai, Junyi Li, Wei Li, Tao Liu, Tianjian Li, and Hengshuang Zhao. Mini-o3: Scaling up reasoning patterns and interaction turns for visual search.arXiv preprint arXiv:2509.07969, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[26] [27]

Visual generation unlocks human-like reasoning through multimodal world models.arXiv preprint arXiv:2601.19834, 2026

Jialong Wu, Xiaoying Zhang, Hongyi Yuan, Xiangcheng Zhang, Tianhao Huang, Changjing He, Chaoyi Deng, Renrui Zhang, Youbin Wu, and Mingsheng Long. Visual generation unlocks human-like reasoning through multimodal world models.arXiv preprint arXiv:2601.19834, 2026

work page arXiv 2026

[27] [28]

Isaac sim: Robotics simulation and synthetic data generation.https://developer.nvidia.com/isaac/sim, 2025

NVIDIA. Isaac sim: Robotics simulation and synthetic data generation.https://developer.nvidia.com/isaac/sim, 2025

2025

[28] [29]

Scannet++: A high-fidelity dataset of 3d indoor scenes

Chandan Yeshwanth, Yueh-Cheng Liu, Matthias Nießner, and Angela Dai. Scannet++: A high-fidelity dataset of 3d indoor scenes. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 12–22, 2023

2023

[29] [30]

Scannet: Richly-annotated 3d reconstructions of indoor scenes

Angela Dai, Angel X Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. InCVPR, 2017

2017

[30] [31]

Matterport3D: Learning from RGB-D Data in Indoor Environments

Angel Chang, Angela Dai, Thomas Funkhouser, Maciej Halber, Matthias Niessner, Manolis Savva, Shuran Song, Andy Zeng, and Yinda Zhang. Matterport3d: Learning from rgb-d data in indoor environments.arXiv preprint arXiv:1709.06158, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[31] [32]

Dl3dv-10k: A large-scale scene dataset for deep learning-based 3d vision

Lu Ling, Yichen Sheng, Zhi Tu, Wentian Zhao, Cheng Xin, Kun Wan, Lantao Yu, Qianyu Guo, Zixun Yu, Yawen Lu, et al. Dl3dv-10k: A large-scale scene dataset for deep learning-based 3d vision. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22160–22169, 2024

2024

[32] [33]

ARKitscenes - a diverse real-world dataset for 3d indoor scene understanding using mobile RGB-d data

Gilad Baruch, Zhuoyuan Chen, Afshin Dehghan, Tal Dimry, Yuri Feigin, Peter Fu, Thomas Gebauer, Brandon Joffe, Daniel Kurz, Arik Schwartz, and Elad Shulman. ARKitscenes - a diverse real-world dataset for 3d indoor scene understanding using mobile RGB-d data. InThirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (R...

2021

[33] [34]

Laminar: A scalable asynchronous RL post-training framework

Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework. InProceedings of the Twentieth European Conference on Computer Systems, EuroSys ’25, page 1279–1297. ACM, March 2025. doi: 10.1145/3689031.3696075. URLhttp://dx.doi.org/10.1145/3689031.3696075. 11

work page doi:10.1145/3689031.3696075 2025

[34] [35]

vllm-omni: Fully disaggregated serving for any-to-any multimodal models

Peiqi Yin, Jiangyun Zhu, Han Gao, Chenguang Zheng, Yongxiang Huang, Taichang Zhou, Ruirui Yang, Weizhi Liu, Weiqing Chen, Canlin Guo, et al. vllm-omni: Fully disaggregated serving for any-to-any multimodal models. arXiv preprint arXiv:2602.02204, 2026

work page arXiv 2026

[35] [36]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[36] [37]

Emerging Properties in Unified Multimodal Pretraining

Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, et al. Emerging properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[37] [38]

Spatialllm: A compound 3d-informed design towards spatially-intelligent large multimodal models

Wufei Ma, Luoxin Ye, Celso M de Melo, Alan Yuille, and Jieneng Chen. Spatialllm: A compound 3d-informed design towards spatially-intelligent large multimodal models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 17249–17260, 2025

2025

[38] [39]

Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence

Diankun Wu, Fangfu Liu, Yi-Hsin Hung, and Yueqi Duan. Spatial-mllm: Boosting mllm capabilities in visual-based spatial intelligence.arXiv preprint arXiv:2505.23747, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[39] [40]

Spatialladder: Progressive training for spatial reasoning in vision-language models.arXiv preprint arXiv:2510.08531, 2025

Hongxing Li, Dingming Li, Zixuan Wang, Yuchen Yan, Hang Wu, Wenqi Zhang, Yongliang Shen, Weiming Lu, Jun Xiao, and Yueting Zhuang. Spatialladder: Progressive training for spatial reasoning in vision-language models. arXiv preprint arXiv:2510.08531, 2025

work page arXiv 2025

[40] [41]

SpaceR: Reinforcing MLLMs in Video Spatial Reasoning

Kun Ouyang, Yuanxin Liu, Haoning Wu, Yi Liu, Hao Zhou, Jie Zhou, Fandong Meng, and Xu Sun. Spacer: Reinforcing mllms in video spatial reasoning.arXiv preprint arXiv:2504.01805, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[41] [42]

Video-R1: Reinforcing Video Reasoning in MLLMs

Kaituo Feng, Kaixiong Gong, Bohao Li, Zonghao Guo, Yibing Wang, Tianshuo Peng, Junfei Wu, Xiaoying Zhang, Benyou Wang, and Xiangyu Yue. Video-r1: Reinforcing video reasoning in mllms.arXiv preprint arXiv:2503.21776, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[42] [43]

Robobrain 2.0 technical report.arXiv preprint arXiv:2507.02029, 2025

BAAI RoboBrain Team, Mingyu Cao, Huajie Tan, Yuheng Ji, Xiansheng Chen, Minglan Lin, Zhiyu Li, Zhou Cao, Pengwei Wang, Enshen Zhou, et al. Robobrain 2.0 technical report.arXiv preprint arXiv:2507.02029, 2025

work page arXiv 2025

[43] [44]

Reinforcing Spatial Reasoning in Vision-Language Models with Interwoven Thinking and Visual Drawing

Junfei Wu, Jian Guan, Kaituo Feng, Qiang Liu, Shu Wu, Liang Wang, Wei Wu, and Tieniu Tan. Reinforcing spatial reasoning in vision-language models with interwoven thinking and visual drawing.arXiv preprint arXiv:2506.09965, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[44] [45]

Vlaser: Vision-language-action model with synergistic embodied reasoning.arXiv preprint arXiv:2510.11027, 2025

Ganlin Yang, Tianyi Zhang, Haoran Hao, Weiyun Wang, Yibin Liu, Dehui Wang, Guanzhou Chen, Zijian Cai, Junting Chen, Weijie Su, et al. Vlaser: Vision-language-action model with synergistic embodied reasoning.arXiv preprint arXiv:2510.11027, 2025

work page arXiv 2025

[45] [46]

Glm-4.1 v-thinking: Towards versatile multimodal reasoning with scalable reinforcement learning.arXiv e-prints, pages arXiv–2507, 2025

Wenyi Hong, Wenmeng Yu, Xiaotao Gu, Guo Wang, Guobing Gan, Haomiao Tang, Jiale Cheng, Ji Qi, Junhui Ji, Lihang Pan, et al. Glm-4.1 v-thinking: Towards versatile multimodal reasoning with scalable reinforcement learning.arXiv e-prints, pages arXiv–2507, 2025

2025

[46] [47]

Gpt-4o.https://openai.com/index/hello-gpt-4o/, 2024

OpenAI. Gpt-4o.https://openai.com/index/hello-gpt-4o/, 2024

2024

[47] [48]

Gemini 2.5: Our most intelligent ai model.https://blog.google/technology/google-deepmind/ gemini-model-thinking-updates-march-2025/, 2025

Google DeepMind. Gemini 2.5: Our most intelligent ai model.https://blog.google/technology/google-deepmind/ gemini-model-thinking-updates-march-2025/, 2025

2025

[48] [49]

Gemini 3 flash.https://deepmind.google/models/gemini/flash/, December 2025

Google DeepMind. Gemini 3 flash.https://deepmind.google/models/gemini/flash/, December 2025. Model card and product documentation

2025

[49] [50]

Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection

Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, and Lei Zhang. Grounding dino: Marrying dino with grounded pre-training for open-set object detection, 2024. URLhttps://arxiv.org/abs/2303.05499

work page internal anchor Pith review Pith/arXiv arXiv 2024

[50] [51]

Scaling spatial intelligence with multimodal foundation models.arXiv preprint arXiv:2511.13719, 2025

Zhongang Cai, Ruisi Wang, Chenyang Gu, Fanyi Pu, Junxiang Xu, Yubo Wang, Wanqi Yin, Zhitao Yang, Chen Wei, Qingping Sun, Tongxi Zhou, Jiaqi Li, Hui En Pang, Oscar Qian, Yukun Wei, Zhiqian Lin, Xuanke Shi, Kewang Deng, Xiaoyang Han, Zukai Chen, Xiangyu Fan, Hanming Deng, Lewei Lu, Liang Pan, Bo Li, Ziwei Liu, Quan Wang, Dahua Lin, and Lei Yang. Scaling spa...

work page arXiv 2026

[51] [52]

Scannet license.https://kaldir.vc.in.tum.de/scannet/ScanNet_TOS.pdf

[52] [53]

move 2.5 meters to the left

Matterport3d license.https://kaldir.vc.in.tum.de/matterport/MP_TOS.pdf. 12 A Additional Details on Training Data A.1 World Simulator SFT Data To equip the world simulator (i.e., Bagel) with strong novel view synthesis capabilities, we construct a large-scale training dataset consisting of tuples(Ictx, p, I tgt), where Ictx denotes a set of context images,...