Thinking with Imagination: Agentic Visual Spatial Reasoning with World Simulators
Pith reviewed 2026-06-28 02:22 UTC · model grok-4.3
The pith
A VLM improves spatial reasoning by learning to request imagined viewpoints from a world simulator.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper claims that coupling an RL-trained VLM policy with a view-consistency-tuned world simulator creates an agentic loop in which the model acquires imagined visual evidence only when it improves over direct answering, producing measurable gains on spatial reasoning benchmarks that neither component achieves alone.
What carries the argument
The two-phase RL curriculum that stabilizes tool-use exploration by first teaching the VLM when imagined observations help and then refining selective invocation.
If this is right
- Adding the trained simulator to an existing VLM raises spatial benchmark performance even without retraining the policy.
- The learned policy reduces unnecessary simulator calls while preserving gains from useful ones.
- Imagined observations supply cross-view consistency and layout information absent from the original egocentric inputs.
Where Pith is reading between the lines
- The same selective-imagination loop could apply to other reasoning domains where generating hypothetical observations would reduce uncertainty.
- Success may depend on scaling the simulator to handle longer or more complex motion sequences than those tested.
- If the curriculum generalizes, similar agentic training could turn other generative tools into on-demand evidence sources for VLMs.
Load-bearing premise
The simulator must output novel views reliable and informative enough that they genuinely help the model reason better than direct answers from the given images.
What would settle it
Run the same spatial benchmarks with the world simulator disabled or with a random-invocation policy and observe whether scores remain at or above the levels achieved by the full trained system.
read the original abstract
While Vision-Language Models (VLMs) have shown strong visual reasoning capabilities, their spatial reasoning abilities remain largely constrained to the observed images and text-oriented chain-of-thought. They often struggle to infer unobserved layouts, maintain cross-view consistency, and reason from alternative viewpoints when only limited egocentric observations are available. In this work, we study this problem as thinking with imagination, where a VLM actively acquires imagined visual evidence by interacting with a world simulator during reasoning. We propose Astra, an agentic spatial reasoning framework that empowers VLMs with action-conditioned visual imagination. Specifically, Astra couples Astra-VL, an RL-trained VLM policy, with Astra-WM, a Bagel-based world simulator that generates novel-view observations from context images and natural-language camera motions. To provide reliable imagined evidence, Astra-WM is trained with view consistency tuning to improve pose and content consistency across views. In the RL stage, we propose a world-simulator-in-the-loop two-phase RL curriculum to stabilize tool-use exploration and advance the model's ability to invoke the simulator only when imagined observations improve over direct answering. Experiments demonstrate that both the world simulator and the agentic policy are necessary: Astra-WM improves simulator-augmented Gemini-3-Flash on MMSI-Bench from 45.1 to 49.5, while Astra-VL improves the Qwen3-VL backbone from 29.8 to 38.8 on MMSI-Bench and from 36.8 to 42.7 on MindCube. These results show that imagined observations can provide useful spatial evidence, but effective world-model-augmented reasoning requires learning when, where, and how to imagine.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes Astra, an agentic spatial reasoning framework that couples Astra-VL (an RL-trained VLM policy) with Astra-WM (a Bagel-based world simulator trained via view consistency tuning). It claims that this setup enables VLMs to actively acquire imagined novel-view observations during reasoning, with both components necessary, as shown by benchmark gains: Astra-WM lifts simulator-augmented Gemini-3-Flash from 45.1 to 49.5 on MMSI-Bench, while Astra-VL lifts Qwen3-VL from 29.8 to 38.8 on MMSI-Bench and 36.8 to 42.7 on MindCube.
Significance. If the central claims hold after addressing the gaps below, the work would be significant for agentic visual reasoning by demonstrating a concrete mechanism for world-model-augmented imagination; explicit credit is due to the two-phase RL curriculum for stabilizing tool-use exploration and the view consistency tuning for improving cross-view reliability.
major comments (3)
- [Abstract] Abstract: the necessity claim for both Astra-WM and Astra-VL rests on the reported deltas (45.1→49.5, 29.8→38.8, 36.8→42.7), yet no quantitative simulator diagnostics (pose error, cross-view LPIPS, or human preference on generated vs. real views) are supplied to confirm that the novel views are sufficiently reliable and informative.
- [Experiments] Experiments section: no ablation isolating the two-phase RL curriculum from plain RL fine-tuning is reported, leaving open whether the observed gains arise from genuine selective imagination or from extra context tokens and training; this directly undermines the load-bearing claim that the policy learns 'when, where, and how to imagine.'
- [Results] Results: the abstract and results supply no details on baselines, statistical tests, data splits, or potential confounds, which is required to substantiate that the gains support the central claim rather than artifacts of evaluation design.
minor comments (1)
- [Introduction] The distinction between Astra-VL (policy) and Astra-WM (simulator) could be introduced with a single clarifying sentence in the introduction to avoid any initial notation ambiguity.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. The comments highlight important gaps in substantiating the central claims about component necessity and evaluation rigor. We address each point below and will revise the manuscript accordingly.
read point-by-point responses
-
Referee: [Abstract] Abstract: the necessity claim for both Astra-WM and Astra-VL rests on the reported deltas (45.1→49.5, 29.8→38.8, 36.8→42.7), yet no quantitative simulator diagnostics (pose error, cross-view LPIPS, or human preference on generated vs. real views) are supplied to confirm that the novel views are sufficiently reliable and informative.
Authors: We agree that the abstract's necessity claim would be strengthened by explicit simulator quality metrics. The manuscript describes view consistency tuning but does not report numerical diagnostics such as pose error, LPIPS, or human preference scores. In revision we will add these quantitative results (drawn from our internal evaluations) to the abstract and results sections to directly support the reliability of the imagined views. revision: yes
-
Referee: [Experiments] Experiments section: no ablation isolating the two-phase RL curriculum from plain RL fine-tuning is reported, leaving open whether the observed gains arise from genuine selective imagination or from extra context tokens and training; this directly undermines the load-bearing claim that the policy learns 'when, where, and how to imagine.'
Authors: The two-phase curriculum is presented as key to stabilizing tool-use exploration. However, the current manuscript does not include an ablation against standard RL fine-tuning. This is a substantive gap. We will add the requested ablation study in the revised experiments section, reporting performance with and without the two-phase structure to isolate its contribution to selective imagination. revision: yes
-
Referee: [Results] Results: the abstract and results supply no details on baselines, statistical tests, data splits, or potential confounds, which is required to substantiate that the gains support the central claim rather than artifacts of evaluation design.
Authors: We acknowledge that the provided abstract and results excerpt lack explicit discussion of baselines, statistical significance, data splits, and confounds. The full manuscript contains some baseline comparisons, but additional details are needed. In revision we will expand the results section with data-split information, statistical tests or confidence intervals, and explicit discussion of potential confounds to strengthen the evaluation. revision: partial
Circularity Check
No circularity; empirical gains reported as direct experimental outcomes
full rationale
The paper's central claims rest on benchmark deltas (MMSI-Bench, MindCube) obtained after training Astra-WM with view consistency tuning and Astra-VL with two-phase RL. No equations, fitted parameters renamed as predictions, or self-citation chains are present in the provided text that would reduce the reported improvements to inputs by construction. The evaluation is externally falsifiable via the cited benchmarks and does not invoke uniqueness theorems or ansatzes from prior author work.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Mindcube: Spatial mental modeling from limited views, 2026
Qineng Wang, Baiqiao Yin, Pingyue Zhang, Jianshu Zhang, Kangrui Wang, Zihan Wang, Jieyu Zhang, Keshigeyan Chandrasegaran, Han Liu, Ranjay Krishna, Saining Xie, Jiajun Wu, Li Fei-Fei, and Manling Li. Mindcube: Spatial mental modeling from limited views, 2026. URLhttps://arxiv.org/abs/2506.21458
-
[2]
Mmsi-bench: A benchmark for multi-image spatial intelligence
Sihan Yang, Runsen Xu, Yiman Xie, Sizhe Yang, Mo Li, Jingli Lin, Chenming Zhu, Xiaochen Chen, Haodong Duan, Xiangyu Yue, Dahua Lin, Tai Wang, and Jiangmiao Pang. Mmsi-bench: A benchmark for multi-image spatial intelligence. InICLR, 2025
2025
-
[3]
arXiv preprint arXiv:2507.07984 , year=
JingLi Lin, Chenming Zhu, Runsen Xu, Xiaohan Mao, Xihui Liu, Tai Wang, and Jiangmiao Pang. Ost- bench: Evaluating the capabilities of mllms in online spatio-temporal scene understanding.arXiv preprint arXiv:2507.07984, 2025
-
[4]
arXiv preprint arXiv:2512.10863 , year=
Jingli Lin, Runsen Xu, Shaohao Zhu, Sihan Yang, Peizhou Cao, Yunlong Ran, Miao Hu, Chenming Zhu, Yiman Xie, Yilin Long, Wenbo Hu, Dahua Lin, Tai Wang, and Jiangmiao Pang. Mmsi-video-bench: A holistic benchmark for video-based spatial intelligence, 2025. URLhttps://arxiv.org/abs/2512.10863
-
[5]
arXiv preprint arXiv:2505.21500 , year=
Dingming Li, Hongxing Li, Zixuan Wang, Yuchen Yan, Hang Zhang, Siqi Chen, Guiyang Hou, Shengpei Jiang, Wenqi Zhang, Yongliang Shen, Weiming Lu, and Yueting Zhuang. Viewspatial-bench: Evaluating multi-perspective spatial localization in vision-language models, 2025. URLhttps://arxiv.org/abs/2505.21500
-
[6]
3dsrbench: A comprehensive 3d spatial reasoning benchmark.arXiv preprint arXiv:2412.07825, 2024
Wufei Ma, Haoyu Chen, Guofeng Zhang, Celso M de Melo, Jieneng Chen, and Alan Yuille. 3dsrbench: A comprehensive 3d spatial reasoning benchmark.arXiv preprint arXiv:2412.07825, 2024
-
[8]
Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces
Jihan Yang, Shusheng Yang, Anjali W Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie. Thinking in space: How multimodal large language models see, remember, and recall spaces.arXiv:2412.14171, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[9]
arXiv preprint arXiv:2512.24330 , year=
Yong Xien Chng, Tao Hu, Wenwen Tong, Xueheng Li, Jiandong Chen, Haojia Yu, Jiefan Lu, Hewei Guo, Hanming Deng, Chengjun Xie, et al. Sensenova-mars: Empowering multimodal agentic reasoning and search via reinforcement learning.arXiv preprint arXiv:2512.24330, 2025
-
[10]
Visual spatial tuning.arXiv preprint arXiv:2511.05491, 2025
Rui Yang, Ziyu Zhu, Yanwei Li, Jingjia Huang, Shen Yan, Siyuan Zhou, Zhe Liu, Xiangtai Li, Shuangye Li, Wenqian Wang, Yi Lin, and Hengshuang Zhao. Visual spatial tuning, 2025. URLhttps://arxiv.org/abs/2511.05491
-
[11]
Chenming Zhu, Tai Wang, Wenwei Zhang, Jiangmiao Pang, and Xihui Liu. Llava-3d: A simple yet effective pathway to empowering lmms with 3d-awareness.arXiv preprint arXiv:2409.18125, 2024
-
[12]
Haochen Wang, Yucheng Zhao, Tiancai Wang, Haoqiang Fan, Xiangyu Zhang, and Zhaoxiang Zhang. Ross3d: Reconstructive visual instruction tuning with 3d-awareness.arXiv preprint arXiv:2504.01901, 2025
-
[13]
Vlm-3r: Vision-language models augmented with instruction-aligned 3d reconstruction,
Zhiwen Fan, Jian Zhang, Renjie Li, Junge Zhang, Runjin Chen, Hezhen Hu, Kevin Wang, Huaizhi Qu, Shijie Zhou, Dilin Wang, Zhicheng Yan, Hongyu Xu, Justin Theiss, Tianlong Chen, Jiachen Li, Zhengzhong Tu, Zhangyang Wang, and Rakesh Ranjan. Vlm-3r: Vision-language models augmented with instruction-aligned 3d reconstruction,
-
[14]
URLhttps://arxiv.org/abs/2505.20279
work page internal anchor Pith review Pith/arXiv arXiv
-
[15]
G2vlm: Geometry grounded vision language model with unified 3d reconstruction and spatial reasoning
Wenbo Hu, Jingli Lin, Yilin Long, Yunlong Ran, Lihan Jiang, Yifan Wang, Chenming Zhu, Runsen Xu, Tai Wang, and Jiangmiao Pang. G2vlm: Geometry grounded vision language model with unified 3d reconstruction and spatial reasoning.arXiv preprint arXiv:2511.21688, 2025. URLhttps://arxiv.org/abs/2511.21688. 10
-
[16]
Geometrically-constrained agent for spatial reasoning.arXiv preprint arXiv:2511.22659, 2025
Zeren Chen, Xiaoya Lu, Zhijie Zheng, Pengrui Li, Lehan He, Yijin Zhou, Jing Shao, Bohan Zhuang, and Lu Sheng. Geometrically-constrained agent for spatial reasoning, 2025. URLhttps://arxiv.org/abs/2511.22659
-
[17]
Yi Han, Enshen Zhou, Shanyu Rong, Jingkun An, Pengwei Wang, Zhongyuan Wang, Cheng Chi, Lu Sheng, and Shanghang Zhang. Tiger: Tool-integrated geometric reasoning in vision-language models for robotics, 2026. URL https://arxiv.org/abs/2510.07181
-
[18]
Zefeng Zhang, Xiangzhao Hao, Hengzhu Tang, Zhenyu Zhang, Jiawei Sheng, Xiaodong Li, Zhenyang Li, Li Gao, Daiting Shi, Dawei Yin, et al. Cooper: A unified model for cooperative perception and reasoning in spatial intelligence.arXiv preprint arXiv:2512.04563, 2025
-
[19]
Introducing o3 and o4 mini.https://openai.com/index/introducing-o3-and-o4-mini/, 2025
OpenAI. Introducing o3 and o4 mini.https://openai.com/index/introducing-o3-and-o4-mini/, 2025
2025
-
[20]
DeepEyesV2: Toward Agentic Multimodal Model
Jack Hong, Chenxiao Zhao, ChengLin Zhu, Weiheng Lu, Guohai Xu, and Xing Yu. Deepeyesv2: Toward agentic multimodal model.arXiv preprint arXiv:2511.05271, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[21]
Thinking with Images for Multimodal Reasoning: Foundations, Methods, and Future Frontiers
Zhaochen Su, Peng Xia, Hangyu Guo, Zhenhua Liu, Yan Ma, Xiaoye Qu, Jiaqi Liu, Yanshu Li, Kaide Zeng, Zhengyuan Yang, et al. Thinking with images for multimodal reasoning: Foundations, methods, and future frontiers.arXiv preprint arXiv:2506.23918, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[22]
Yi-Fan Zhang, Xingyu Lu, Shukang Yin, Chaoyou Fu, Wei Chen, Xiao Hu, Bin Wen, Kaiyu Jiang, Changyi Liu, Tianke Zhang, et al. Thyme: Think beyond images.arXiv preprint arXiv:2508.11630, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[23]
OpenThinkIMG: Learning to Think with Images via Visual Tool Reinforcement Learning
Zhaochen Su, Linjie Li, Mingyang Song, Yunzhuo Hao, Zhengyuan Yang, Jun Zhang, Guanjie Chen, Jiawei Gu, Juntao Li, Xiaoye Qu, et al. Openthinkimg: Learning to think with images via visual tool reinforcement learning. arXiv preprint arXiv:2505.08617, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[24]
DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning
Ziwei Zheng, Michael Yang, Jack Hong, Chenxiao Zhao, Guohai Xu, Le Yang, Chao Shen, and Xing Yu. Deepeyes: Incentivizing" thinking with images" via reinforcement learning.arXiv preprint arXiv:2505.14362, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[25]
Pixel Reasoner: Incentivizing Pixel-Space Reasoning with Curiosity-Driven Reinforcement Learning
Haozhe Wang, Alex Su, Weiming Ren, Fangzhen Lin, and Wenhu Chen. Pixel reasoner: Incentivizing pixel-space reasoning with curiosity-driven reinforcement learning.arXiv preprint arXiv:2505.15966, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[26]
Mini-o3: Scaling Up Reasoning Patterns and Interaction Turns for Visual Search
Xin Lai, Junyi Li, Wei Li, Tao Liu, Tianjian Li, and Hengshuang Zhao. Mini-o3: Scaling up reasoning patterns and interaction turns for visual search.arXiv preprint arXiv:2509.07969, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[27]
Jialong Wu, Xiaoying Zhang, Hongyi Yuan, Xiangcheng Zhang, Tianhao Huang, Changjing He, Chaoyi Deng, Renrui Zhang, Youbin Wu, and Mingsheng Long. Visual generation unlocks human-like reasoning through multimodal world models.arXiv preprint arXiv:2601.19834, 2026
-
[28]
Isaac sim: Robotics simulation and synthetic data generation.https://developer.nvidia.com/isaac/sim, 2025
NVIDIA. Isaac sim: Robotics simulation and synthetic data generation.https://developer.nvidia.com/isaac/sim, 2025
2025
-
[29]
Scannet++: A high-fidelity dataset of 3d indoor scenes
Chandan Yeshwanth, Yueh-Cheng Liu, Matthias Nießner, and Angela Dai. Scannet++: A high-fidelity dataset of 3d indoor scenes. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 12–22, 2023
2023
-
[30]
Scannet: Richly-annotated 3d reconstructions of indoor scenes
Angela Dai, Angel X Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. InCVPR, 2017
2017
-
[31]
Matterport3D: Learning from RGB-D Data in Indoor Environments
Angel Chang, Angela Dai, Thomas Funkhouser, Maciej Halber, Matthias Niessner, Manolis Savva, Shuran Song, Andy Zeng, and Yinda Zhang. Matterport3d: Learning from rgb-d data in indoor environments.arXiv preprint arXiv:1709.06158, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[32]
Dl3dv-10k: A large-scale scene dataset for deep learning-based 3d vision
Lu Ling, Yichen Sheng, Zhi Tu, Wentian Zhao, Cheng Xin, Kun Wan, Lantao Yu, Qianyu Guo, Zixun Yu, Yawen Lu, et al. Dl3dv-10k: A large-scale scene dataset for deep learning-based 3d vision. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22160–22169, 2024
2024
-
[33]
ARKitscenes - a diverse real-world dataset for 3d indoor scene understanding using mobile RGB-d data
Gilad Baruch, Zhuoyuan Chen, Afshin Dehghan, Tal Dimry, Yuri Feigin, Peter Fu, Thomas Gebauer, Brandon Joffe, Daniel Kurz, Arik Schwartz, and Elad Shulman. ARKitscenes - a diverse real-world dataset for 3d indoor scene understanding using mobile RGB-d data. InThirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (R...
2021
-
[34]
Laminar: A scalable asynchronous RL post-training framework
Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework. InProceedings of the Twentieth European Conference on Computer Systems, EuroSys ’25, page 1279–1297. ACM, March 2025. doi: 10.1145/3689031.3696075. URLhttp://dx.doi.org/10.1145/3689031.3696075. 11
-
[35]
vllm-omni: Fully disaggregated serving for any-to-any multimodal models
Peiqi Yin, Jiangyun Zhu, Han Gao, Chenguang Zheng, Yongxiang Huang, Taichang Zhou, Ruirui Yang, Weizhi Liu, Weiqing Chen, Canlin Guo, et al. vllm-omni: Fully disaggregated serving for any-to-any multimodal models. arXiv preprint arXiv:2602.02204, 2026
-
[36]
Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[37]
Emerging Properties in Unified Multimodal Pretraining
Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, et al. Emerging properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[38]
Spatialllm: A compound 3d-informed design towards spatially-intelligent large multimodal models
Wufei Ma, Luoxin Ye, Celso M de Melo, Alan Yuille, and Jieneng Chen. Spatialllm: A compound 3d-informed design towards spatially-intelligent large multimodal models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 17249–17260, 2025
2025
-
[39]
Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence
Diankun Wu, Fangfu Liu, Yi-Hsin Hung, and Yueqi Duan. Spatial-mllm: Boosting mllm capabilities in visual-based spatial intelligence.arXiv preprint arXiv:2505.23747, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[40]
Hongxing Li, Dingming Li, Zixuan Wang, Yuchen Yan, Hang Wu, Wenqi Zhang, Yongliang Shen, Weiming Lu, Jun Xiao, and Yueting Zhuang. Spatialladder: Progressive training for spatial reasoning in vision-language models. arXiv preprint arXiv:2510.08531, 2025
-
[41]
SpaceR: Reinforcing MLLMs in Video Spatial Reasoning
Kun Ouyang, Yuanxin Liu, Haoning Wu, Yi Liu, Hao Zhou, Jie Zhou, Fandong Meng, and Xu Sun. Spacer: Reinforcing mllms in video spatial reasoning.arXiv preprint arXiv:2504.01805, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[42]
Video-R1: Reinforcing Video Reasoning in MLLMs
Kaituo Feng, Kaixiong Gong, Bohao Li, Zonghao Guo, Yibing Wang, Tianshuo Peng, Junfei Wu, Xiaoying Zhang, Benyou Wang, and Xiangyu Yue. Video-r1: Reinforcing video reasoning in mllms.arXiv preprint arXiv:2503.21776, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[43]
Robobrain 2.0 technical report.arXiv preprint arXiv:2507.02029, 2025
BAAI RoboBrain Team, Mingyu Cao, Huajie Tan, Yuheng Ji, Xiansheng Chen, Minglan Lin, Zhiyu Li, Zhou Cao, Pengwei Wang, Enshen Zhou, et al. Robobrain 2.0 technical report.arXiv preprint arXiv:2507.02029, 2025
-
[44]
Reinforcing Spatial Reasoning in Vision-Language Models with Interwoven Thinking and Visual Drawing
Junfei Wu, Jian Guan, Kaituo Feng, Qiang Liu, Shu Wu, Liang Wang, Wei Wu, and Tieniu Tan. Reinforcing spatial reasoning in vision-language models with interwoven thinking and visual drawing.arXiv preprint arXiv:2506.09965, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[45]
Ganlin Yang, Tianyi Zhang, Haoran Hao, Weiyun Wang, Yibin Liu, Dehui Wang, Guanzhou Chen, Zijian Cai, Junting Chen, Weijie Su, et al. Vlaser: Vision-language-action model with synergistic embodied reasoning.arXiv preprint arXiv:2510.11027, 2025
-
[46]
Glm-4.1 v-thinking: Towards versatile multimodal reasoning with scalable reinforcement learning.arXiv e-prints, pages arXiv–2507, 2025
Wenyi Hong, Wenmeng Yu, Xiaotao Gu, Guo Wang, Guobing Gan, Haomiao Tang, Jiale Cheng, Ji Qi, Junhui Ji, Lihang Pan, et al. Glm-4.1 v-thinking: Towards versatile multimodal reasoning with scalable reinforcement learning.arXiv e-prints, pages arXiv–2507, 2025
2025
-
[47]
Gpt-4o.https://openai.com/index/hello-gpt-4o/, 2024
OpenAI. Gpt-4o.https://openai.com/index/hello-gpt-4o/, 2024
2024
-
[48]
Gemini 2.5: Our most intelligent ai model.https://blog.google/technology/google-deepmind/ gemini-model-thinking-updates-march-2025/, 2025
Google DeepMind. Gemini 2.5: Our most intelligent ai model.https://blog.google/technology/google-deepmind/ gemini-model-thinking-updates-march-2025/, 2025
2025
-
[49]
Gemini 3 flash.https://deepmind.google/models/gemini/flash/, December 2025
Google DeepMind. Gemini 3 flash.https://deepmind.google/models/gemini/flash/, December 2025. Model card and product documentation
2025
-
[50]
Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection
Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, and Lei Zhang. Grounding dino: Marrying dino with grounded pre-training for open-set object detection, 2024. URLhttps://arxiv.org/abs/2303.05499
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[51]
Scaling spatial intelligence with multimodal foundation models.arXiv preprint arXiv:2511.13719, 2025
Zhongang Cai, Ruisi Wang, Chenyang Gu, Fanyi Pu, Junxiang Xu, Yubo Wang, Wanqi Yin, Zhitao Yang, Chen Wei, Qingping Sun, Tongxi Zhou, Jiaqi Li, Hui En Pang, Oscar Qian, Yukun Wei, Zhiqian Lin, Xuanke Shi, Kewang Deng, Xiaoyang Han, Zukai Chen, Xiangyu Fan, Hanming Deng, Lewei Lu, Liang Pan, Bo Li, Ziwei Liu, Quan Wang, Dahua Lin, and Lei Yang. Scaling spa...
-
[52]
Scannet license.https://kaldir.vc.in.tum.de/scannet/ScanNet_TOS.pdf
-
[53]
move 2.5 meters to the left
Matterport3d license.https://kaldir.vc.in.tum.de/matterport/MP_TOS.pdf. 12 A Additional Details on Training Data A.1 World Simulator SFT Data To equip the world simulator (i.e., Bagel) with strong novel view synthesis capabilities, we construct a large-scale training dataset consisting of tuples(Ictx, p, I tgt), where Ictx denotes a set of context images,...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.