pith. sign in

arxiv: 2511.10946 · v3 · submitted 2025-11-14 · 💻 cs.CV

Abstract 3D Perception for Spatial Intelligence in Vision-Language Models

Pith reviewed 2026-05-17 22:39 UTC · model grok-4.3

classification 💻 cs.CV
keywords vision-language models3D perceptionspatial reasoningabstract bounding boxeszero-shot learningmodality gapembodied AIrobotics
0
0 comments X

The pith

Abstract bounding boxes let vision-language models retrieve 3D structure from 2D images and improve spatial reasoning without retraining.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that vision-language models fall short on spatial cognition and physical understanding because their 2D training leaves a modality gap that blocks efficient use of geometric information in 3D tasks. It introduces SandboxVLM, a zero-shot framework that reconstructs abstract 3D scenes from 2D inputs by generating multi-view priors, performing proxy elevation, applying multi-view voting and clustering, and then conducting 3D-aware reasoning. The method encodes geometric structure and physical kinematics inside abstract bounding boxes. Tests across multiple benchmarks and VLM backbones show consistent gains, including an 8.3 percent improvement on SAT Real. A reader would care because the approach suggests a lightweight route to better performance in robotics and embodied settings without any fine-tuning.

Core claim

Equipping VLMs with a 3D abstraction via abstract bounding boxes substantially enhances their 3D reasoning ability without additional training. The 3D Sandbox reconstruction and perception pipeline bridges the modality gap between 2D VLM training and 3D tasks by encoding geometric structure and physical kinematics, delivering consistent improvements across benchmarks in zero-shot evaluation.

What carries the argument

The 3D Sandbox reconstruction and perception pipeline, which uses abstract bounding boxes to encode geometric structure and physical kinematics from multi-view 2D inputs through four stages of prior generation, proxy elevation, voting and clustering, and 3D-aware reasoning.

If this is right

  • VLMs gain measurable spatial intelligence in zero-shot settings on existing benchmarks.
  • The same abstraction approach works across different VLM backbones without retraining.
  • Better 3D reasoning supports downstream uses in robotics and embodied agents.
  • No extra training data or compute is required to obtain the reported gains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same bounding-box abstraction might transfer to other perception gaps such as video or multi-sensor inputs.
  • Combining the abstract 3D layer with explicit depth sensors could produce larger gains than either alone.
  • The pipeline's multi-view voting step might generalize to tasks that require consistent physical simulation across frames.

Load-bearing premise

The main reason VLMs struggle with 3D tasks is a modality gap between 2D training and 3D needs, and abstract bounding boxes can effectively encode and retrieve the missing geometric and kinematic information.

What would settle it

Running the SandboxVLM pipeline on the SAT Real benchmark or similar 3D spatial tasks and finding no improvement or even a drop in accuracy compared with the unmodified VLM baseline would falsify the central claim.

Figures

Figures reproduced from arXiv: 2511.10946 by Fangneng Zhan, Hanspeter Pfister, Kaichen Zhou, Paul Pu Liang, Yifan Liu, Yilun Du.

Figure 1
Figure 1. Figure 1: Motivation of SandboxVLM. (a) Existing VLMs are [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the SandboxVLM pipeline. Given an input image and a textual query, the system builds a compact, 3D-aware, [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Core modules of 3D Sandbox. (a) Proxy Elevation: The [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Visualization of representations in ablation study. (a) [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
read the original abstract

Vision-language models (VLMs) struggle with 3D-related tasks such as spatial cognition and physical understanding, which are crucial for real-world applications like robotics and embodied agents. We attribute this to a modality gap between the 3D tasks and the 2D training of VLM, which led to inefficient retrieval of 3D information from 2D input. To bridge this gap, we introduce SandboxVLM, a simple yet effective framework that leverages abstract bounding boxes to encode geometric structure and physical kinematics for VLM. Specifically, we design a 3D Sandbox reconstruction and perception pipeline comprising four stages: generating multi-view priors with abstract control, proxy elevation, multi-view voting and clustering, and 3D-aware reasoning. Evaluated in zero-shot settings across multiple benchmarks and VLM backbones, our approach consistently improves spatial intelligence, achieving an 8.3\% gain on SAT Real compared with baseline methods for instance. These results demonstrate that equipping VLMs with a 3D abstraction substantially enhances their 3D reasoning ability without additional training, suggesting new possibilities for general-purpose embodied intelligence.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces SandboxVLM, a training-free framework that supplies VLMs with abstract bounding boxes derived from a four-stage 3D reconstruction and perception pipeline (multi-view priors with abstract control, proxy elevation, multi-view voting/clustering, and 3D-aware reasoning). The central claim is that this 3D abstraction bridges the modality gap between 2D VLM training and 3D spatial/physical reasoning tasks, yielding consistent zero-shot gains across benchmarks, including an 8.3% improvement on SAT Real relative to baselines.

Significance. If the reported gains prove robust and attributable to genuine geometric retrieval rather than prompting artifacts, the work would offer a practical route to improved spatial intelligence in VLMs for embodied applications. The approach is notable for its simplicity and lack of additional training, potentially influencing future designs of general-purpose agents that must reason about 3D structure from 2D inputs.

major comments (2)
  1. [Abstract] Abstract: the reported 8.3% gain on SAT Real is stated without identifying the baseline methods, reporting error bars, statistical significance tests, data splits, or controls. This omission leaves the central empirical claim only partially supported and prevents assessment of whether the improvement is reliable or reproducible.
  2. [Method] Method section (SandboxVLM pipeline): no quantitative checks are provided for the accuracy of proxy elevation or multi-view clustering (e.g., 3D IoU, depth consistency, or kinematic fidelity against ground-truth 3D data). Without such validation, it remains possible that observed gains arise from richer textual prompting or multi-view ensembling rather than retrieval of usable 3D structure, directly affecting the load-bearing assumption that the bounding-box abstraction encodes geometric and kinematic information.
minor comments (1)
  1. [Abstract] Abstract: the phrase 'for instance' at the end of the results sentence is grammatically awkward and should be removed or rephrased for clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment point by point below, clarifying our approach and indicating planned revisions to improve the presentation and support for our claims.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the reported 8.3% gain on SAT Real is stated without identifying the baseline methods, reporting error bars, statistical significance tests, data splits, or controls. This omission leaves the central empirical claim only partially supported and prevents assessment of whether the improvement is reliable or reproducible.

    Authors: We agree that greater specificity in the abstract would strengthen the central claim. The baseline in our experiments is the unmodified VLM without the SandboxVLM pipeline, evaluated on the standard SAT Real data split. Comprehensive results with error bars, ablations, and controls appear in the experimental section of the full manuscript. We will revise the abstract to explicitly name the baseline and direct readers to the detailed quantitative results, including any available statistical information, in the main body. revision: yes

  2. Referee: [Method] Method section (SandboxVLM pipeline): no quantitative checks are provided for the accuracy of proxy elevation or multi-view clustering (e.g., 3D IoU, depth consistency, or kinematic fidelity against ground-truth 3D data). Without such validation, it remains possible that observed gains arise from richer textual prompting or multi-view ensembling rather than retrieval of usable 3D structure, directly affecting the load-bearing assumption that the bounding-box abstraction encodes geometric and kinematic information.

    Authors: We acknowledge the value of direct quantitative validation for the intermediate stages. The manuscript validates the pipeline primarily through end-to-end task performance and ablation studies that demonstrate the contribution of each stage, including comparisons to prompting-only baselines. However, comprehensive ground-truth 3D annotations are not available across the zero-shot benchmarks, which limits direct computation of metrics such as 3D IoU. We will revise the method section to include a more explicit discussion of this limitation, the design rationale for the abstraction stages, and any feasible proxy analyses that can be added based on available data. revision: partial

Circularity Check

0 steps flagged

No circularity in derivation or evaluation chain

full rationale

The paper introduces SandboxVLM as a four-stage pipeline (multi-view priors, proxy elevation, multi-view voting/clustering, 3D-aware reasoning) that supplies abstract bounding boxes to VLMs for improved 3D reasoning in zero-shot settings. The central claim rests on empirical gains (e.g., 8.3% on SAT Real) measured against baselines on public benchmarks. No equations, fitted parameters, or self-referential definitions appear in the abstract or described method; the pipeline is presented as an external augmentation rather than a quantity derived from the target performance metric. No load-bearing self-citations or uniqueness theorems are invoked. The evaluation is independent of the method's internal construction, satisfying the criteria for a self-contained empirical result.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The approach rests on standard multi-view geometry and existing VLM inference capabilities but introduces new combinations without explicit free parameters or external validation of the core abstraction.

axioms (1)
  • domain assumption Abstract bounding boxes can encode sufficient geometric structure and physical kinematics for effective VLM reasoning.
    Invoked as the bridge for the modality gap; central to all four pipeline stages but not independently verified in the abstract.
invented entities (1)
  • SandboxVLM 3D reconstruction and perception pipeline no independent evidence
    purpose: To generate and leverage abstract bounding boxes for 3D-aware reasoning in VLMs.
    Newly proposed framework whose effectiveness is demonstrated only via the reported benchmark gains.

pith-pipeline@v0.9.0 · 5508 in / 1527 out tokens · 43521 ms · 2026-05-17T22:39:33.920430+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. GeoWorld-VLM: Geometry from World Models for Vision-Language Models

    cs.CV 2026-05 unverdicted novelty 5.0

    GeoWorld-VLM distills geometric structure from camera-conditioned world models into VLMs by aligning visual features, improving spatial reasoning by about 4% on What'sUp and VSR benchmarks across two architectures whi...

  2. Thinking with Novel Views: A Systematic Analysis of Generative-Augmented Spatial Intelligence

    cs.CV 2026-05 unverdicted novelty 5.0

    Integrating generative novel-view synthesis into LMM reasoning loops improves accuracy on spatial subtasks by 1.3 to 3.9 percentage points across multiple models and tasks.

Reference graph

Works this paper leans on

46 extracted references · 46 canonical work pages · cited by 2 Pith papers · 3 internal anchors

  1. [1]

    Qwen2.5-VL Technical Report

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhao- hai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Jun- yang Lin. Qwen2.5-vl technical repor...

  2. [2]

    Zhuge, Y

    Yitao Bai, Yutong Chen, Jiaheng Tang, et al. Qwen3- vl: A frontier vision-language model with unified mul- timodal understanding and generation.arXiv preprint arXiv:2501.01234, 2025. 1, 2

  3. [3]

    Barron, Ben Mildenhall, Matthew Tancik, Peter Hedman, Ricardo Martin-Brualla, and Pratul P

    Jonathan T. Barron, Ben Mildenhall, Matthew Tancik, Peter Hedman, Ricardo Martin-Brualla, and Pratul P. Srinivasan. Mip-nerf: A multiscale representation for anti-aliasing neu- ral radiance fields.ICCV, 2021. 2

  4. [4]

    Language- image models with 3d understanding.arXiv preprint arXiv:2405.03685, 2024

    Xiangyu Chen, Wei Zhou, Yifan Sun, et al. Cube-llm: En- hancing large language models with 3d spatial reasoning. arXiv preprint arXiv:2405.03685, 2024. 1, 2

  5. [5]

    PhysBench: Benchmarking and enhancing vision-language models for physical world understanding.arXiv preprint arXiv:2501.16411, 2025

    Wei Chow, Jiageng Mao, Boyi Li, Daniel Seita, Vitor Guizilini, and Yue Wang. Physbench: Benchmarking and enhancing vision-language models for physical world under- standing.arXiv preprint arXiv:2501.16411, 2025. 1, 6

  6. [6]

    Choy, Danfei Xu, JunYoung Gwak, Kevin Chen, and Silvio Savarese

    Christopher B. Choy, Danfei Xu, JunYoung Gwak, Kevin Chen, and Silvio Savarese. 3d-r2n2: A unified approach for single and multi-view 3d object reconstruction. InECCV,

  7. [7]

    Reltr: Relation transformer for scene graph generation,

    Yuren Cong, Michael Ying Yang, and Bodo Rosenhahn. Reltr: Relation transformer for scene graph generation,

  8. [8]

    Embspatial-bench: Benchmarking spatial un- derstanding for embodied tasks with large vision-language models, 2024

    Mengfei Du, Binhao Wu, Zejun Li, Xuanjing Huang, and Zhongyu Wei. Embspatial-bench: Benchmarking spatial un- derstanding for embodied tasks with large vision-language models, 2024. 5

  9. [9]

    Human com- plex exploration strategies are enriched by noradrenaline- modulated heuristics.Elife, 10:e59907, 2021

    Magda Dubois, Johanna Habicht, Jochen Michely, Rani Moran, Ray J Dolan, and Tobias U Hauser. Human com- plex exploration strategies are enriched by noradrenaline- modulated heuristics.Elife, 10:e59907, 2021. 2, 4

  10. [10]

    Robix: A unified model for robot interaction, reasoning and planning, 2025

    Huang Fang, Mengxi Zhang, Heng Dong, Wei Li, Zixuan Wang, Qifeng Zhang, Xueyun Tian, Yucheng Hu, and Hang Li. Robix: A unified model for robot interaction, reasoning and planning, 2025. 5, 6

  11. [11]

    BLINK: Multimodal Large Language Models Can See but Not Perceive

    Xingyu Fu, Yushi Hu, Bangzheng Li, Yu Feng, Haoyu Wang, Xudong Lin, Dan Roth, Noah A Smith, Wei-Chiu Ma, and Ranjay Krishna. Blink: Multimodal large language models can see but not perceive.arXiv preprint arXiv:2404.12390,

  12. [12]

    Search in external and internal spaces: Evidence for gener- alized cognitive search processes.Psychological science, 19 (8):802–808, 2008

    Thomas T Hills, Peter M Todd, and Robert L Goldstone. Search in external and internal spaces: Evidence for gener- alized cognitive search processes.Psychological science, 19 (8):802–808, 2008. 2, 4

  13. [13]

    Unsupervised learning of 3d scene structure from images

    Siyuan Huang, Siyuan Qi, Yixin Wu, and Song-Chun Zhu. Unsupervised learning of 3d scene structure from images. In NeurIPS, 2020. 2

  14. [14]

    Image genera- tion from scene graphs

    Justin Johnson, Agrim Gupta, and Li Fei-Fei. Image genera- tion from scene graphs. InCVPR, 2018. 2

  15. [15]

    Learning 3d shape representations by combining synthetic and real data

    Abhishek Kar, Shubham Tulsiani, Joao Carreira, and Jiten- dra Malik. Learning 3d shape representations by combining synthetic and real data. InCVPR, 2017. 2

  16. [16]

    3d gaussian splatting for real-time radiance field rendering, 2023

    Bernhard Kerbl, Georgios Kopanas, Thomas Leimk ¨uhler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering, 2023. 4

  17. [17]

    Lee, Jihyeon Je, Chanho Park, Mikaela Angelina Uy, Leonidas Guibas, and Minhyuk Sung

    Phillip Y . Lee, Jihyeon Je, Chanho Park, Mikaela Angelina Uy, Leonidas Guibas, and Minhyuk Sung. Perspective- aware reasoning in vision-language models via mental im- agery simulation, 2025. 3

  18. [18]

    Improved baselines with visual instruction tuning

    Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 26296–26306, 2024. 1

  19. [19]

    3ur-llm: Unified understanding and reasoning for real 3d scenes with point clouds and language.arXiv preprint arXiv:2501.07819,

    Zhiyuan Liu, Qing Wang, Hao Xu, et al. 3ur-llm: Unified understanding and reasoning for real 3d scenes with point clouds and language.arXiv preprint arXiv:2501.07819,

  20. [20]

    Visual embodied brain: Let multimodal large language mod- els see, think, and control in spaces, 2025

    Gen Luo, Ganlin Yang, Ziyang Gong, Guanzhou Chen, Hao- nan Duan, Erfei Cui, Ronglei Tong, Zhi Hou, Tianyi Zhang, Zhe Chen, Shenglong Ye, Lewei Lu, Jingbo Wang, Wenhai Wang, Jifeng Dai, Yu Qiao, Rongrong Ji, and Xizhou Zhu. Visual embodied brain: Let multimodal large language mod- els see, think, and control in spaces, 2025. 5, 6

  21. [21]

    Sajjadi, Jonathan T

    Ricardo Martin-Brualla, Noha Radwan, Mehdi S.M. Sajjadi, Jonathan T. Barron, Alexey Dosovitskiy, and Daniel Duck- worth. Nerfies: Deformable neural radiance fields. InICCV,

  22. [22]

    Srinivasan, Matthew Tancik, Jonathan T

    Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik, Jonathan T. Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view synthesis. InEuropean Conference on Computer Vision (ECCV), 2020. 2, 4

  23. [23]

    Romero, Misha Smelyanskiy, Shuran Song, Lyne Tchapmi, Andrew Z

    NVIDIA, :, Alisson Azzolini, Junjie Bai, Hannah Brandon, Jiaxin Cao, Prithvijit Chattopadhyay, Huayu Chen, Jinju Chu, Yin Cui, Jenna Diamond, Yifan Ding, Liang Feng, Francesco Ferroni, Rama Govindaraju, Jinwei Gu, Siddharth Gururani, Imad El Hanafi, Zekun Hao, Jacob Huffman, Jingyi Jin, Brendan Johnson, Rizwan Khan, George Kurian, Elena Lantz, Nayeon Lee,...

  24. [24]

    Qi, Hao Su, Kaichun Mo, and Leonidas J

    Charles R. Qi, Hao Su, Kaichun Mo, and Leonidas J. Guibas. Pointnet: Deep learning on point sets for 3d classification and segmentation. InCVPR, 2017. 2

  25. [25]

    SAM 2: Segment Anything in Images and Videos

    Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman R¨adle, Chloe Rolland, Laura Gustafson, Eric Mintun, Junt- ing Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao- Yuan Wu, Ross Girshick, Piotr Doll´ar, and Christoph Feicht- enhofer. Sam 2: Segment anything in images and videos. arXiv preprint arXiv:...

  26. [26]

    Plummer, Ranjay Krishna, Kuo-Hao Zeng, and Kate Saenko

    Arijit Ray, Jiafei Duan, Ellis Brown, Reuben Tan, Dina Bashkirova, Rose Hendrix, Kiana Ehsani, Aniruddha Kem- bhavi, Bryan A. Plummer, Ranjay Krishna, Kuo-Hao Zeng, and Kate Saenko. Sat: Dynamic spatial aptitude training for multimodal language models, 2024. 2, 5

  27. [27]

    Pv-rcnn: Point-voxel feature set abstraction for 3d object detection

    Shaoshuai Shi, Chun Guo, Li Jiang, Zhe Wang, and Hong- sheng Li. Pv-rcnn: Point-voxel feature set abstraction for 3d object detection. InCVPR, 2020. 2

  28. [28]

    Robobrain 2.0 technical report, 2025

    BAAI RoboBrain Team, Mingyu Cao, Huajie Tan, Yuheng Ji, Xiansheng Chen, Minglan Lin, Zhiyu Li, Zhou Cao, Pengwei Wang, Enshen Zhou, Yi Han, Yingbo Tang, Xi- angqi Xu, Wei Guo, Yaoxu Lyu, Yijie Xu, Jiayu Shi, Mengfei Du, Cheng Chi, Mengdi Zhao, Xiaoshuai Hao, Junkai Zhao, Xiaojie Zhang, Shanyu Rong, Huaihai Lyu, Zhengliang Cai, Yankai Fu, Ning Chen, Bolun ...

  29. [29]

    Hunyuan3d 1.0: A unified framework for text-to-3d and image-to-3d generation, 2024

    Tencent Hunyuan3D Team. Hunyuan3d 1.0: A unified framework for text-to-3d and image-to-3d generation, 2024. 1, 3

  30. [30]

    Vggt: Vi- sual geometry grounded transformer, 2025

    Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Vi- sual geometry grounded transformer, 2025. 2, 4, 7

  31. [31]

    Vagen: Re- inforcing world model reasoning for multi-turn vlm agents,

    Kangrui Wang, Pingyue Zhang, Zihan Wang, Yaning Gao, Linjie Li, Qineng Wang, Hanyang Chen, Chi Wan, Yiping Lu, Zhengyuan Yang, Lijuan Wang, Ranjay Krishna, Jiajun Wu, Li Fei-Fei, Yejin Choi, and Manling Li. Vagen: Re- inforcing world model reasoning for multi-turn vlm agents,

  32. [32]

    3d-llm: Inject- ing the 3d world into large language models.arXiv preprint arXiv:2307.12981, 2023

    Yixin Wang, Jiayuan Gu, Ting Chen, et al. 3d-llm: Inject- ing the 3d world into large language models.arXiv preprint arXiv:2307.12981, 2023. 1, 2

  33. [33]

    Generalization guides human exploration in vast decision spaces.Nature human behaviour, 2(12):915–924, 2018

    Charley M Wu, Eric Schulz, Maarten Speekenbrink, Jonathan D Nelson, and Bj¨orn Meder. Generalization guides human exploration in vast decision spaces.Nature human behaviour, 2(12):915–924, 2018. 2, 4

  34. [34]

    Pq-net: A generative part seq2seq network for 3d shapes

    Ruihui Wu, Yifan Wang, Duygu Ceylan, Ersin Yumer, and Niloy J Mitra. Pq-net: A generative part seq2seq network for 3d shapes. InCVPR, 2020. 2

  35. [35]

    Universal scene graph generation, 2025

    Shengqiong Wu, Hao Fei, and Tat-Seng Chua. Universal scene graph generation, 2025. 2

  36. [36]

    Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie

    Jihan Yang, Shusheng Yang, Anjali W. Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie. Thinking in space: How mul- timodal large language models see, remember, and recall spaces, 2024. 2

  37. [37]

    Magma: A foundation model for multimodal ai agents, 2025

    Jianwei Yang, Reuben Tan, Qianhui Wu, Ruijie Zheng, Baolin Peng, Yongyuan Liang, Yu Gu, Mu Cai, Seonghyeon Ye, Joel Jang, Yuquan Deng, Lars Liden, and Jianfeng Gao. Magma: A foundation model for multimodal ai agents, 2025. 5, 6

  38. [38]

    Thinking in space: How mul- timodal large language models see, remember, and recall spaces

    Jihan Yang, Shusheng Yang, Anjali W Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie. Thinking in space: How mul- timodal large language models see, remember, and recall spaces. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 10632–10643, 2025. 1

  39. [39]

    Omni-shapellm: Text-driven 3d shape generation and editing with llms.arXiv preprint arXiv:2406.08124, 2024

    Yifan Yang, Kai Xu, et al. Omni-shapellm: Text-driven 3d shape generation and editing with llms.arXiv preprint arXiv:2406.08124, 2024. 2

  40. [40]

    arXiv preprint arXiv:2402.17766 (2024)

    Yifan Yang, Hanzhi Zhang, Kai Xu, et al. Shapellm: Univer- sal 3d shape understanding via large language models.arXiv preprint arXiv:2402.17766, 2024. 1, 2

  41. [41]

    Mindjourney: Test-time scaling with world models for spatial reasoning

    Yuncong Yang, Jiageng Liu, Zheyuan Zhang, Siyuan Zhou, Reuben Tan, Jianwei Yang, Yilun Du, and Chuang Gan. Mindjourney: Test-time scaling with world models for spa- tial reasoning.arXiv preprint arXiv:2507.12508, 2025. 1, 3, 5, 6

  42. [42]

    Spatial mental modeling from limited views, 2025

    Baiqiao Yin, Qineng Wang, Pingyue Zhang, Jianshu Zhang, Kangrui Wang, Zihan Wang, Jieyu Zhang, Keshigeyan Chan- drasegaran, Han Liu, Ranjay Krishna, Saining Xie, Manling Li, Jiajun Wu, and Li Fei-Fei. Spatial mental modeling from limited views, 2025. 2

  43. [43]

    Minigpt-3d: Effi- ciently aligning 3d point clouds with large language models via 2d priors.arXiv preprint arXiv:2405.01413, 2024

    Yu Zhang, Wenxuan Li, Yang Zhao, et al. Minigpt-3d: Effi- ciently aligning 3d point clouds with large language models via 2d priors.arXiv preprint arXiv:2405.01413, 2024. 2

  44. [44]

    Stable virtual camera: Generative view synthesis with diffusion models, 2025

    Jensen Zhou, Hang Gao, Vikram V oleti, Aaryaman Vasishta, Chun-Han Yao, Mark Boss, Philip Torr, Christian Rupprecht, and Varun Jampani. Stable virtual camera: Generative view synthesis with diffusion models, 2025. 1, 2, 3, 7

  45. [45]

    Dyn- point: Dynamic neural point for view synthesis.Advances in Neural Information Processing Systems, 36:69532–69545,

    Kaichen Zhou, Jia-Xing Zhong, Sangyun Shin, Kai Lu, Yiyuan Yang, Andrew Markham, and Niki Trigoni. Dyn- point: Dynamic neural point for view synthesis.Advances in Neural Information Processing Systems, 36:69532–69545,

  46. [46]

    Page-4d: Disentangled pose and geometry estimation for vggt-4d perception, 2025

    Kaichen Zhou, Yuhan Wang, Grace Chen, Xinhai Chang, Gaspard Beaudouin, Fangneng Zhan, Paul Pu Liang, and Mengyu Wang. Page-4d: Disentangled pose and geometry estimation for vggt-4d perception, 2025. 2, 8