Abstract 3D Perception for Spatial Intelligence in Vision-Language Models

Fangneng Zhan; Hanspeter Pfister; Kaichen Zhou; Paul Pu Liang; Yifan Liu; Yilun Du

arxiv: 2511.10946 · v3 · submitted 2025-11-14 · 💻 cs.CV

Abstract 3D Perception for Spatial Intelligence in Vision-Language Models

Yifan Liu , Fangneng Zhan , Kaichen Zhou , Yilun Du , Paul Pu Liang , Hanspeter Pfister This is my paper

Pith reviewed 2026-05-17 22:39 UTC · model grok-4.3

classification 💻 cs.CV

keywords vision-language models3D perceptionspatial reasoningabstract bounding boxeszero-shot learningmodality gapembodied AIrobotics

0 comments

The pith

Abstract bounding boxes let vision-language models retrieve 3D structure from 2D images and improve spatial reasoning without retraining.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that vision-language models fall short on spatial cognition and physical understanding because their 2D training leaves a modality gap that blocks efficient use of geometric information in 3D tasks. It introduces SandboxVLM, a zero-shot framework that reconstructs abstract 3D scenes from 2D inputs by generating multi-view priors, performing proxy elevation, applying multi-view voting and clustering, and then conducting 3D-aware reasoning. The method encodes geometric structure and physical kinematics inside abstract bounding boxes. Tests across multiple benchmarks and VLM backbones show consistent gains, including an 8.3 percent improvement on SAT Real. A reader would care because the approach suggests a lightweight route to better performance in robotics and embodied settings without any fine-tuning.

Core claim

Equipping VLMs with a 3D abstraction via abstract bounding boxes substantially enhances their 3D reasoning ability without additional training. The 3D Sandbox reconstruction and perception pipeline bridges the modality gap between 2D VLM training and 3D tasks by encoding geometric structure and physical kinematics, delivering consistent improvements across benchmarks in zero-shot evaluation.

What carries the argument

The 3D Sandbox reconstruction and perception pipeline, which uses abstract bounding boxes to encode geometric structure and physical kinematics from multi-view 2D inputs through four stages of prior generation, proxy elevation, voting and clustering, and 3D-aware reasoning.

If this is right

VLMs gain measurable spatial intelligence in zero-shot settings on existing benchmarks.
The same abstraction approach works across different VLM backbones without retraining.
Better 3D reasoning supports downstream uses in robotics and embodied agents.
No extra training data or compute is required to obtain the reported gains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same bounding-box abstraction might transfer to other perception gaps such as video or multi-sensor inputs.
Combining the abstract 3D layer with explicit depth sensors could produce larger gains than either alone.
The pipeline's multi-view voting step might generalize to tasks that require consistent physical simulation across frames.

Load-bearing premise

The main reason VLMs struggle with 3D tasks is a modality gap between 2D training and 3D needs, and abstract bounding boxes can effectively encode and retrieve the missing geometric and kinematic information.

What would settle it

Running the SandboxVLM pipeline on the SAT Real benchmark or similar 3D spatial tasks and finding no improvement or even a drop in accuracy compared with the unmodified VLM baseline would falsify the central claim.

Figures

Figures reproduced from arXiv: 2511.10946 by Fangneng Zhan, Hanspeter Pfister, Kaichen Zhou, Paul Pu Liang, Yifan Liu, Yilun Du.

**Figure 2.** Figure 2: Overview of the SandboxVLM pipeline. Given an input image and a textual query, the system builds a compact, 3D-aware, [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Core modules of 3D Sandbox. (a) Proxy Elevation: The [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Visualization of representations in ablation study. (a) [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

read the original abstract

Vision-language models (VLMs) struggle with 3D-related tasks such as spatial cognition and physical understanding, which are crucial for real-world applications like robotics and embodied agents. We attribute this to a modality gap between the 3D tasks and the 2D training of VLM, which led to inefficient retrieval of 3D information from 2D input. To bridge this gap, we introduce SandboxVLM, a simple yet effective framework that leverages abstract bounding boxes to encode geometric structure and physical kinematics for VLM. Specifically, we design a 3D Sandbox reconstruction and perception pipeline comprising four stages: generating multi-view priors with abstract control, proxy elevation, multi-view voting and clustering, and 3D-aware reasoning. Evaluated in zero-shot settings across multiple benchmarks and VLM backbones, our approach consistently improves spatial intelligence, achieving an 8.3\% gain on SAT Real compared with baseline methods for instance. These results demonstrate that equipping VLMs with a 3D abstraction substantially enhances their 3D reasoning ability without additional training, suggesting new possibilities for general-purpose embodied intelligence.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SandboxVLM gives VLMs abstract bounding boxes through a four-stage pipeline to lift spatial performance without training, but the evidence that the boxes actually deliver usable 3D geometry remains indirect.

read the letter

The main takeaway is a zero-shot method that uses abstract bounding boxes to improve 3D reasoning in vision-language models without any extra training. It claims an 8.3% improvement on SAT Real, but the details on how the gains were measured are sparse in the abstract. What is actually new is the specific four-stage process: generating multi-view priors with abstract control, proxy elevation to lift 2D to 3D-ish, multi-view voting and clustering, and finally 3D-aware reasoning. The paper does well by applying this across multiple benchmarks and different VLM backbones in zero-shot settings, which makes the approach easy to try out. It earns credit for focusing on practical enhancements for embodied AI and robotics, where spatial cognition matters a lot. The consistent improvements suggest the abstraction helps bridge the 2D training gap. The soft spots come in the evaluation. There are no mentions of error bars, statistical tests, or detailed baseline comparisons, and no checks to confirm the bounding boxes capture real geometric structure or kinematics. The stress-test note about depth ambiguity in proxy elevation is on point here; if the abstractions are noisy, the benefits might stem from richer multi-view prompts rather than genuine 3D perception. Full validation would need more on that. This paper is for researchers and engineers looking to boost existing VLMs for spatial tasks in real-world applications. A reader interested in training-free methods for embodied intelligence would find it useful, even if they need to implement and test the pipeline themselves. It deserves a serious referee because the core technique is novel enough and the claims can be checked with additional experiments. I recommend sending it to peer review with feedback on strengthening the empirical support and adding ablations.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces SandboxVLM, a training-free framework that supplies VLMs with abstract bounding boxes derived from a four-stage 3D reconstruction and perception pipeline (multi-view priors with abstract control, proxy elevation, multi-view voting/clustering, and 3D-aware reasoning). The central claim is that this 3D abstraction bridges the modality gap between 2D VLM training and 3D spatial/physical reasoning tasks, yielding consistent zero-shot gains across benchmarks, including an 8.3% improvement on SAT Real relative to baselines.

Significance. If the reported gains prove robust and attributable to genuine geometric retrieval rather than prompting artifacts, the work would offer a practical route to improved spatial intelligence in VLMs for embodied applications. The approach is notable for its simplicity and lack of additional training, potentially influencing future designs of general-purpose agents that must reason about 3D structure from 2D inputs.

major comments (2)

[Abstract] Abstract: the reported 8.3% gain on SAT Real is stated without identifying the baseline methods, reporting error bars, statistical significance tests, data splits, or controls. This omission leaves the central empirical claim only partially supported and prevents assessment of whether the improvement is reliable or reproducible.
[Method] Method section (SandboxVLM pipeline): no quantitative checks are provided for the accuracy of proxy elevation or multi-view clustering (e.g., 3D IoU, depth consistency, or kinematic fidelity against ground-truth 3D data). Without such validation, it remains possible that observed gains arise from richer textual prompting or multi-view ensembling rather than retrieval of usable 3D structure, directly affecting the load-bearing assumption that the bounding-box abstraction encodes geometric and kinematic information.

minor comments (1)

[Abstract] Abstract: the phrase 'for instance' at the end of the results sentence is grammatically awkward and should be removed or rephrased for clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment point by point below, clarifying our approach and indicating planned revisions to improve the presentation and support for our claims.

read point-by-point responses

Referee: [Abstract] Abstract: the reported 8.3% gain on SAT Real is stated without identifying the baseline methods, reporting error bars, statistical significance tests, data splits, or controls. This omission leaves the central empirical claim only partially supported and prevents assessment of whether the improvement is reliable or reproducible.

Authors: We agree that greater specificity in the abstract would strengthen the central claim. The baseline in our experiments is the unmodified VLM without the SandboxVLM pipeline, evaluated on the standard SAT Real data split. Comprehensive results with error bars, ablations, and controls appear in the experimental section of the full manuscript. We will revise the abstract to explicitly name the baseline and direct readers to the detailed quantitative results, including any available statistical information, in the main body. revision: yes
Referee: [Method] Method section (SandboxVLM pipeline): no quantitative checks are provided for the accuracy of proxy elevation or multi-view clustering (e.g., 3D IoU, depth consistency, or kinematic fidelity against ground-truth 3D data). Without such validation, it remains possible that observed gains arise from richer textual prompting or multi-view ensembling rather than retrieval of usable 3D structure, directly affecting the load-bearing assumption that the bounding-box abstraction encodes geometric and kinematic information.

Authors: We acknowledge the value of direct quantitative validation for the intermediate stages. The manuscript validates the pipeline primarily through end-to-end task performance and ablation studies that demonstrate the contribution of each stage, including comparisons to prompting-only baselines. However, comprehensive ground-truth 3D annotations are not available across the zero-shot benchmarks, which limits direct computation of metrics such as 3D IoU. We will revise the method section to include a more explicit discussion of this limitation, the design rationale for the abstraction stages, and any feasible proxy analyses that can be added based on available data. revision: partial

Circularity Check

0 steps flagged

No circularity in derivation or evaluation chain

full rationale

The paper introduces SandboxVLM as a four-stage pipeline (multi-view priors, proxy elevation, multi-view voting/clustering, 3D-aware reasoning) that supplies abstract bounding boxes to VLMs for improved 3D reasoning in zero-shot settings. The central claim rests on empirical gains (e.g., 8.3% on SAT Real) measured against baselines on public benchmarks. No equations, fitted parameters, or self-referential definitions appear in the abstract or described method; the pipeline is presented as an external augmentation rather than a quantity derived from the target performance metric. No load-bearing self-citations or uniqueness theorems are invoked. The evaluation is independent of the method's internal construction, satisfying the criteria for a self-contained empirical result.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The approach rests on standard multi-view geometry and existing VLM inference capabilities but introduces new combinations without explicit free parameters or external validation of the core abstraction.

axioms (1)

domain assumption Abstract bounding boxes can encode sufficient geometric structure and physical kinematics for effective VLM reasoning.
Invoked as the bridge for the modality gap; central to all four pipeline stages but not independently verified in the abstract.

invented entities (1)

SandboxVLM 3D reconstruction and perception pipeline no independent evidence
purpose: To generate and leverage abstract bounding boxes for 3D-aware reasoning in VLMs.
Newly proposed framework whose effectiveness is demonstrated only via the reported benchmark gains.

pith-pipeline@v0.9.0 · 5508 in / 1527 out tokens · 43521 ms · 2026-05-17T22:39:33.920430+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We design a 3D Sandbox reconstruction and perception pipeline comprising four stages: generating multi-view priors with abstract control, proxy elevation, multi-view voting and clustering, and 3D-aware reasoning.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

GeoWorld-VLM: Geometry from World Models for Vision-Language Models
cs.CV 2026-05 unverdicted novelty 5.0

GeoWorld-VLM distills geometric structure from camera-conditioned world models into VLMs by aligning visual features, improving spatial reasoning by about 4% on What'sUp and VSR benchmarks across two architectures whi...
Thinking with Novel Views: A Systematic Analysis of Generative-Augmented Spatial Intelligence
cs.CV 2026-05 unverdicted novelty 5.0

Integrating generative novel-view synthesis into LMM reasoning loops improves accuracy on spatial subtasks by 1.3 to 3.9 percentage points across multiple models and tasks.

Reference graph

Works this paper leans on

46 extracted references · 46 canonical work pages · cited by 2 Pith papers · 3 internal anchors

[1]

Qwen2.5-VL Technical Report

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhao- hai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Jun- yang Lin. Qwen2.5-vl technical repor...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

Zhuge, Y

Yitao Bai, Yutong Chen, Jiaheng Tang, et al. Qwen3- vl: A frontier vision-language model with unified mul- timodal understanding and generation.arXiv preprint arXiv:2501.01234, 2025. 1, 2

work page arXiv 2025
[3]

Barron, Ben Mildenhall, Matthew Tancik, Peter Hedman, Ricardo Martin-Brualla, and Pratul P

Jonathan T. Barron, Ben Mildenhall, Matthew Tancik, Peter Hedman, Ricardo Martin-Brualla, and Pratul P. Srinivasan. Mip-nerf: A multiscale representation for anti-aliasing neu- ral radiance fields.ICCV, 2021. 2

work page 2021
[4]

Language- image models with 3d understanding.arXiv preprint arXiv:2405.03685, 2024

Xiangyu Chen, Wei Zhou, Yifan Sun, et al. Cube-llm: En- hancing large language models with 3d spatial reasoning. arXiv preprint arXiv:2405.03685, 2024. 1, 2

work page arXiv 2024
[5]

PhysBench: Benchmarking and enhancing vision-language models for physical world understanding.arXiv preprint arXiv:2501.16411, 2025

Wei Chow, Jiageng Mao, Boyi Li, Daniel Seita, Vitor Guizilini, and Yue Wang. Physbench: Benchmarking and enhancing vision-language models for physical world under- standing.arXiv preprint arXiv:2501.16411, 2025. 1, 6

work page arXiv 2025
[6]

Choy, Danfei Xu, JunYoung Gwak, Kevin Chen, and Silvio Savarese

Christopher B. Choy, Danfei Xu, JunYoung Gwak, Kevin Chen, and Silvio Savarese. 3d-r2n2: A unified approach for single and multi-view 3d object reconstruction. InECCV,

work page
[7]

Reltr: Relation transformer for scene graph generation,

Yuren Cong, Michael Ying Yang, and Bodo Rosenhahn. Reltr: Relation transformer for scene graph generation,

work page
[8]

Embspatial-bench: Benchmarking spatial un- derstanding for embodied tasks with large vision-language models, 2024

Mengfei Du, Binhao Wu, Zejun Li, Xuanjing Huang, and Zhongyu Wei. Embspatial-bench: Benchmarking spatial un- derstanding for embodied tasks with large vision-language models, 2024. 5

work page 2024
[9]

Human com- plex exploration strategies are enriched by noradrenaline- modulated heuristics.Elife, 10:e59907, 2021

Magda Dubois, Johanna Habicht, Jochen Michely, Rani Moran, Ray J Dolan, and Tobias U Hauser. Human com- plex exploration strategies are enriched by noradrenaline- modulated heuristics.Elife, 10:e59907, 2021. 2, 4

work page 2021
[10]

Robix: A unified model for robot interaction, reasoning and planning, 2025

Huang Fang, Mengxi Zhang, Heng Dong, Wei Li, Zixuan Wang, Qifeng Zhang, Xueyun Tian, Yucheng Hu, and Hang Li. Robix: A unified model for robot interaction, reasoning and planning, 2025. 5, 6

work page 2025
[11]

BLINK: Multimodal Large Language Models Can See but Not Perceive

Xingyu Fu, Yushi Hu, Bangzheng Li, Yu Feng, Haoyu Wang, Xudong Lin, Dan Roth, Noah A Smith, Wei-Chiu Ma, and Ranjay Krishna. Blink: Multimodal large language models can see but not perceive.arXiv preprint arXiv:2404.12390,

work page internal anchor Pith review Pith/arXiv arXiv
[12]

Search in external and internal spaces: Evidence for gener- alized cognitive search processes.Psychological science, 19 (8):802–808, 2008

Thomas T Hills, Peter M Todd, and Robert L Goldstone. Search in external and internal spaces: Evidence for gener- alized cognitive search processes.Psychological science, 19 (8):802–808, 2008. 2, 4

work page 2008
[13]

Unsupervised learning of 3d scene structure from images

Siyuan Huang, Siyuan Qi, Yixin Wu, and Song-Chun Zhu. Unsupervised learning of 3d scene structure from images. In NeurIPS, 2020. 2

work page 2020
[14]

Image genera- tion from scene graphs

Justin Johnson, Agrim Gupta, and Li Fei-Fei. Image genera- tion from scene graphs. InCVPR, 2018. 2

work page 2018
[15]

Learning 3d shape representations by combining synthetic and real data

Abhishek Kar, Shubham Tulsiani, Joao Carreira, and Jiten- dra Malik. Learning 3d shape representations by combining synthetic and real data. InCVPR, 2017. 2

work page 2017
[16]

3d gaussian splatting for real-time radiance field rendering, 2023

Bernhard Kerbl, Georgios Kopanas, Thomas Leimk ¨uhler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering, 2023. 4

work page 2023
[17]

Lee, Jihyeon Je, Chanho Park, Mikaela Angelina Uy, Leonidas Guibas, and Minhyuk Sung

Phillip Y . Lee, Jihyeon Je, Chanho Park, Mikaela Angelina Uy, Leonidas Guibas, and Minhyuk Sung. Perspective- aware reasoning in vision-language models via mental im- agery simulation, 2025. 3

work page 2025
[18]

Improved baselines with visual instruction tuning

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 26296–26306, 2024. 1

work page 2024
[19]

3ur-llm: Unified understanding and reasoning for real 3d scenes with point clouds and language.arXiv preprint arXiv:2501.07819,

Zhiyuan Liu, Qing Wang, Hao Xu, et al. 3ur-llm: Unified understanding and reasoning for real 3d scenes with point clouds and language.arXiv preprint arXiv:2501.07819,

work page arXiv
[20]

Visual embodied brain: Let multimodal large language mod- els see, think, and control in spaces, 2025

Gen Luo, Ganlin Yang, Ziyang Gong, Guanzhou Chen, Hao- nan Duan, Erfei Cui, Ronglei Tong, Zhi Hou, Tianyi Zhang, Zhe Chen, Shenglong Ye, Lewei Lu, Jingbo Wang, Wenhai Wang, Jifeng Dai, Yu Qiao, Rongrong Ji, and Xizhou Zhu. Visual embodied brain: Let multimodal large language mod- els see, think, and control in spaces, 2025. 5, 6

work page 2025
[21]

Sajjadi, Jonathan T

Ricardo Martin-Brualla, Noha Radwan, Mehdi S.M. Sajjadi, Jonathan T. Barron, Alexey Dosovitskiy, and Daniel Duck- worth. Nerfies: Deformable neural radiance fields. InICCV,

work page
[22]

Srinivasan, Matthew Tancik, Jonathan T

Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik, Jonathan T. Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view synthesis. InEuropean Conference on Computer Vision (ECCV), 2020. 2, 4

work page 2020
[23]

Romero, Misha Smelyanskiy, Shuran Song, Lyne Tchapmi, Andrew Z

NVIDIA, :, Alisson Azzolini, Junjie Bai, Hannah Brandon, Jiaxin Cao, Prithvijit Chattopadhyay, Huayu Chen, Jinju Chu, Yin Cui, Jenna Diamond, Yifan Ding, Liang Feng, Francesco Ferroni, Rama Govindaraju, Jinwei Gu, Siddharth Gururani, Imad El Hanafi, Zekun Hao, Jacob Huffman, Jingyi Jin, Brendan Johnson, Rizwan Khan, George Kurian, Elena Lantz, Nayeon Lee,...

work page 2025
[24]

Qi, Hao Su, Kaichun Mo, and Leonidas J

Charles R. Qi, Hao Su, Kaichun Mo, and Leonidas J. Guibas. Pointnet: Deep learning on point sets for 3d classification and segmentation. InCVPR, 2017. 2

work page 2017
[25]

SAM 2: Segment Anything in Images and Videos

Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman R¨adle, Chloe Rolland, Laura Gustafson, Eric Mintun, Junt- ing Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao- Yuan Wu, Ross Girshick, Piotr Doll´ar, and Christoph Feicht- enhofer. Sam 2: Segment anything in images and videos. arXiv preprint arXiv:...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[26]

Plummer, Ranjay Krishna, Kuo-Hao Zeng, and Kate Saenko

Arijit Ray, Jiafei Duan, Ellis Brown, Reuben Tan, Dina Bashkirova, Rose Hendrix, Kiana Ehsani, Aniruddha Kem- bhavi, Bryan A. Plummer, Ranjay Krishna, Kuo-Hao Zeng, and Kate Saenko. Sat: Dynamic spatial aptitude training for multimodal language models, 2024. 2, 5

work page 2024
[27]

Pv-rcnn: Point-voxel feature set abstraction for 3d object detection

Shaoshuai Shi, Chun Guo, Li Jiang, Zhe Wang, and Hong- sheng Li. Pv-rcnn: Point-voxel feature set abstraction for 3d object detection. InCVPR, 2020. 2

work page 2020
[28]

Robobrain 2.0 technical report, 2025

BAAI RoboBrain Team, Mingyu Cao, Huajie Tan, Yuheng Ji, Xiansheng Chen, Minglan Lin, Zhiyu Li, Zhou Cao, Pengwei Wang, Enshen Zhou, Yi Han, Yingbo Tang, Xi- angqi Xu, Wei Guo, Yaoxu Lyu, Yijie Xu, Jiayu Shi, Mengfei Du, Cheng Chi, Mengdi Zhao, Xiaoshuai Hao, Junkai Zhao, Xiaojie Zhang, Shanyu Rong, Huaihai Lyu, Zhengliang Cai, Yankai Fu, Ning Chen, Bolun ...

work page 2025
[29]

Hunyuan3d 1.0: A unified framework for text-to-3d and image-to-3d generation, 2024

Tencent Hunyuan3D Team. Hunyuan3d 1.0: A unified framework for text-to-3d and image-to-3d generation, 2024. 1, 3

work page 2024
[30]

Vggt: Vi- sual geometry grounded transformer, 2025

Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Vi- sual geometry grounded transformer, 2025. 2, 4, 7

work page 2025
[31]

Vagen: Re- inforcing world model reasoning for multi-turn vlm agents,

Kangrui Wang, Pingyue Zhang, Zihan Wang, Yaning Gao, Linjie Li, Qineng Wang, Hanyang Chen, Chi Wan, Yiping Lu, Zhengyuan Yang, Lijuan Wang, Ranjay Krishna, Jiajun Wu, Li Fei-Fei, Yejin Choi, and Manling Li. Vagen: Re- inforcing world model reasoning for multi-turn vlm agents,

work page
[32]

3d-llm: Inject- ing the 3d world into large language models.arXiv preprint arXiv:2307.12981, 2023

Yixin Wang, Jiayuan Gu, Ting Chen, et al. 3d-llm: Inject- ing the 3d world into large language models.arXiv preprint arXiv:2307.12981, 2023. 1, 2

work page arXiv 2023
[33]

Generalization guides human exploration in vast decision spaces.Nature human behaviour, 2(12):915–924, 2018

Charley M Wu, Eric Schulz, Maarten Speekenbrink, Jonathan D Nelson, and Bj¨orn Meder. Generalization guides human exploration in vast decision spaces.Nature human behaviour, 2(12):915–924, 2018. 2, 4

work page 2018
[34]

Pq-net: A generative part seq2seq network for 3d shapes

Ruihui Wu, Yifan Wang, Duygu Ceylan, Ersin Yumer, and Niloy J Mitra. Pq-net: A generative part seq2seq network for 3d shapes. InCVPR, 2020. 2

work page 2020
[35]

Universal scene graph generation, 2025

Shengqiong Wu, Hao Fei, and Tat-Seng Chua. Universal scene graph generation, 2025. 2

work page 2025
[36]

Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie

Jihan Yang, Shusheng Yang, Anjali W. Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie. Thinking in space: How mul- timodal large language models see, remember, and recall spaces, 2024. 2

work page 2024
[37]

Magma: A foundation model for multimodal ai agents, 2025

Jianwei Yang, Reuben Tan, Qianhui Wu, Ruijie Zheng, Baolin Peng, Yongyuan Liang, Yu Gu, Mu Cai, Seonghyeon Ye, Joel Jang, Yuquan Deng, Lars Liden, and Jianfeng Gao. Magma: A foundation model for multimodal ai agents, 2025. 5, 6

work page 2025
[38]

Thinking in space: How mul- timodal large language models see, remember, and recall spaces

Jihan Yang, Shusheng Yang, Anjali W Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie. Thinking in space: How mul- timodal large language models see, remember, and recall spaces. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 10632–10643, 2025. 1

work page 2025
[39]

Omni-shapellm: Text-driven 3d shape generation and editing with llms.arXiv preprint arXiv:2406.08124, 2024

Yifan Yang, Kai Xu, et al. Omni-shapellm: Text-driven 3d shape generation and editing with llms.arXiv preprint arXiv:2406.08124, 2024. 2

work page arXiv 2024
[40]

arXiv preprint arXiv:2402.17766 (2024)

Yifan Yang, Hanzhi Zhang, Kai Xu, et al. Shapellm: Univer- sal 3d shape understanding via large language models.arXiv preprint arXiv:2402.17766, 2024. 1, 2

work page arXiv 2024
[41]

Mindjourney: Test-time scaling with world models for spatial reasoning

Yuncong Yang, Jiageng Liu, Zheyuan Zhang, Siyuan Zhou, Reuben Tan, Jianwei Yang, Yilun Du, and Chuang Gan. Mindjourney: Test-time scaling with world models for spa- tial reasoning.arXiv preprint arXiv:2507.12508, 2025. 1, 3, 5, 6

work page arXiv 2025
[42]

Spatial mental modeling from limited views, 2025

Baiqiao Yin, Qineng Wang, Pingyue Zhang, Jianshu Zhang, Kangrui Wang, Zihan Wang, Jieyu Zhang, Keshigeyan Chan- drasegaran, Han Liu, Ranjay Krishna, Saining Xie, Manling Li, Jiajun Wu, and Li Fei-Fei. Spatial mental modeling from limited views, 2025. 2

work page 2025
[43]

Minigpt-3d: Effi- ciently aligning 3d point clouds with large language models via 2d priors.arXiv preprint arXiv:2405.01413, 2024

Yu Zhang, Wenxuan Li, Yang Zhao, et al. Minigpt-3d: Effi- ciently aligning 3d point clouds with large language models via 2d priors.arXiv preprint arXiv:2405.01413, 2024. 2

work page arXiv 2024
[44]

Stable virtual camera: Generative view synthesis with diffusion models, 2025

Jensen Zhou, Hang Gao, Vikram V oleti, Aaryaman Vasishta, Chun-Han Yao, Mark Boss, Philip Torr, Christian Rupprecht, and Varun Jampani. Stable virtual camera: Generative view synthesis with diffusion models, 2025. 1, 2, 3, 7

work page 2025
[45]

Dyn- point: Dynamic neural point for view synthesis.Advances in Neural Information Processing Systems, 36:69532–69545,

Kaichen Zhou, Jia-Xing Zhong, Sangyun Shin, Kai Lu, Yiyuan Yang, Andrew Markham, and Niki Trigoni. Dyn- point: Dynamic neural point for view synthesis.Advances in Neural Information Processing Systems, 36:69532–69545,

work page
[46]

Page-4d: Disentangled pose and geometry estimation for vggt-4d perception, 2025

Kaichen Zhou, Yuhan Wang, Grace Chen, Xinhai Chang, Gaspard Beaudouin, Fangneng Zhan, Paul Pu Liang, and Mengyu Wang. Page-4d: Disentangled pose and geometry estimation for vggt-4d perception, 2025. 2, 8

work page 2025

[1] [1]

Qwen2.5-VL Technical Report

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhao- hai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Jun- yang Lin. Qwen2.5-vl technical repor...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[2] [2]

Zhuge, Y

Yitao Bai, Yutong Chen, Jiaheng Tang, et al. Qwen3- vl: A frontier vision-language model with unified mul- timodal understanding and generation.arXiv preprint arXiv:2501.01234, 2025. 1, 2

work page arXiv 2025

[3] [3]

Barron, Ben Mildenhall, Matthew Tancik, Peter Hedman, Ricardo Martin-Brualla, and Pratul P

Jonathan T. Barron, Ben Mildenhall, Matthew Tancik, Peter Hedman, Ricardo Martin-Brualla, and Pratul P. Srinivasan. Mip-nerf: A multiscale representation for anti-aliasing neu- ral radiance fields.ICCV, 2021. 2

work page 2021

[4] [4]

Language- image models with 3d understanding.arXiv preprint arXiv:2405.03685, 2024

Xiangyu Chen, Wei Zhou, Yifan Sun, et al. Cube-llm: En- hancing large language models with 3d spatial reasoning. arXiv preprint arXiv:2405.03685, 2024. 1, 2

work page arXiv 2024

[5] [5]

PhysBench: Benchmarking and enhancing vision-language models for physical world understanding.arXiv preprint arXiv:2501.16411, 2025

Wei Chow, Jiageng Mao, Boyi Li, Daniel Seita, Vitor Guizilini, and Yue Wang. Physbench: Benchmarking and enhancing vision-language models for physical world under- standing.arXiv preprint arXiv:2501.16411, 2025. 1, 6

work page arXiv 2025

[6] [6]

Choy, Danfei Xu, JunYoung Gwak, Kevin Chen, and Silvio Savarese

Christopher B. Choy, Danfei Xu, JunYoung Gwak, Kevin Chen, and Silvio Savarese. 3d-r2n2: A unified approach for single and multi-view 3d object reconstruction. InECCV,

work page

[7] [7]

Reltr: Relation transformer for scene graph generation,

Yuren Cong, Michael Ying Yang, and Bodo Rosenhahn. Reltr: Relation transformer for scene graph generation,

work page

[8] [8]

Embspatial-bench: Benchmarking spatial un- derstanding for embodied tasks with large vision-language models, 2024

Mengfei Du, Binhao Wu, Zejun Li, Xuanjing Huang, and Zhongyu Wei. Embspatial-bench: Benchmarking spatial un- derstanding for embodied tasks with large vision-language models, 2024. 5

work page 2024

[9] [9]

Human com- plex exploration strategies are enriched by noradrenaline- modulated heuristics.Elife, 10:e59907, 2021

Magda Dubois, Johanna Habicht, Jochen Michely, Rani Moran, Ray J Dolan, and Tobias U Hauser. Human com- plex exploration strategies are enriched by noradrenaline- modulated heuristics.Elife, 10:e59907, 2021. 2, 4

work page 2021

[10] [10]

Robix: A unified model for robot interaction, reasoning and planning, 2025

Huang Fang, Mengxi Zhang, Heng Dong, Wei Li, Zixuan Wang, Qifeng Zhang, Xueyun Tian, Yucheng Hu, and Hang Li. Robix: A unified model for robot interaction, reasoning and planning, 2025. 5, 6

work page 2025

[11] [11]

BLINK: Multimodal Large Language Models Can See but Not Perceive

Xingyu Fu, Yushi Hu, Bangzheng Li, Yu Feng, Haoyu Wang, Xudong Lin, Dan Roth, Noah A Smith, Wei-Chiu Ma, and Ranjay Krishna. Blink: Multimodal large language models can see but not perceive.arXiv preprint arXiv:2404.12390,

work page internal anchor Pith review Pith/arXiv arXiv

[12] [12]

Search in external and internal spaces: Evidence for gener- alized cognitive search processes.Psychological science, 19 (8):802–808, 2008

Thomas T Hills, Peter M Todd, and Robert L Goldstone. Search in external and internal spaces: Evidence for gener- alized cognitive search processes.Psychological science, 19 (8):802–808, 2008. 2, 4

work page 2008

[13] [13]

Unsupervised learning of 3d scene structure from images

Siyuan Huang, Siyuan Qi, Yixin Wu, and Song-Chun Zhu. Unsupervised learning of 3d scene structure from images. In NeurIPS, 2020. 2

work page 2020

[14] [14]

Image genera- tion from scene graphs

Justin Johnson, Agrim Gupta, and Li Fei-Fei. Image genera- tion from scene graphs. InCVPR, 2018. 2

work page 2018

[15] [15]

Learning 3d shape representations by combining synthetic and real data

Abhishek Kar, Shubham Tulsiani, Joao Carreira, and Jiten- dra Malik. Learning 3d shape representations by combining synthetic and real data. InCVPR, 2017. 2

work page 2017

[16] [16]

3d gaussian splatting for real-time radiance field rendering, 2023

Bernhard Kerbl, Georgios Kopanas, Thomas Leimk ¨uhler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering, 2023. 4

work page 2023

[17] [17]

Lee, Jihyeon Je, Chanho Park, Mikaela Angelina Uy, Leonidas Guibas, and Minhyuk Sung

Phillip Y . Lee, Jihyeon Je, Chanho Park, Mikaela Angelina Uy, Leonidas Guibas, and Minhyuk Sung. Perspective- aware reasoning in vision-language models via mental im- agery simulation, 2025. 3

work page 2025

[18] [18]

Improved baselines with visual instruction tuning

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 26296–26306, 2024. 1

work page 2024

[19] [19]

3ur-llm: Unified understanding and reasoning for real 3d scenes with point clouds and language.arXiv preprint arXiv:2501.07819,

Zhiyuan Liu, Qing Wang, Hao Xu, et al. 3ur-llm: Unified understanding and reasoning for real 3d scenes with point clouds and language.arXiv preprint arXiv:2501.07819,

work page arXiv

[20] [20]

Visual embodied brain: Let multimodal large language mod- els see, think, and control in spaces, 2025

Gen Luo, Ganlin Yang, Ziyang Gong, Guanzhou Chen, Hao- nan Duan, Erfei Cui, Ronglei Tong, Zhi Hou, Tianyi Zhang, Zhe Chen, Shenglong Ye, Lewei Lu, Jingbo Wang, Wenhai Wang, Jifeng Dai, Yu Qiao, Rongrong Ji, and Xizhou Zhu. Visual embodied brain: Let multimodal large language mod- els see, think, and control in spaces, 2025. 5, 6

work page 2025

[21] [21]

Sajjadi, Jonathan T

Ricardo Martin-Brualla, Noha Radwan, Mehdi S.M. Sajjadi, Jonathan T. Barron, Alexey Dosovitskiy, and Daniel Duck- worth. Nerfies: Deformable neural radiance fields. InICCV,

work page

[22] [22]

Srinivasan, Matthew Tancik, Jonathan T

Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik, Jonathan T. Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view synthesis. InEuropean Conference on Computer Vision (ECCV), 2020. 2, 4

work page 2020

[23] [23]

Romero, Misha Smelyanskiy, Shuran Song, Lyne Tchapmi, Andrew Z

NVIDIA, :, Alisson Azzolini, Junjie Bai, Hannah Brandon, Jiaxin Cao, Prithvijit Chattopadhyay, Huayu Chen, Jinju Chu, Yin Cui, Jenna Diamond, Yifan Ding, Liang Feng, Francesco Ferroni, Rama Govindaraju, Jinwei Gu, Siddharth Gururani, Imad El Hanafi, Zekun Hao, Jacob Huffman, Jingyi Jin, Brendan Johnson, Rizwan Khan, George Kurian, Elena Lantz, Nayeon Lee,...

work page 2025

[24] [24]

Qi, Hao Su, Kaichun Mo, and Leonidas J

Charles R. Qi, Hao Su, Kaichun Mo, and Leonidas J. Guibas. Pointnet: Deep learning on point sets for 3d classification and segmentation. InCVPR, 2017. 2

work page 2017

[25] [25]

SAM 2: Segment Anything in Images and Videos

Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman R¨adle, Chloe Rolland, Laura Gustafson, Eric Mintun, Junt- ing Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao- Yuan Wu, Ross Girshick, Piotr Doll´ar, and Christoph Feicht- enhofer. Sam 2: Segment anything in images and videos. arXiv preprint arXiv:...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[26] [26]

Plummer, Ranjay Krishna, Kuo-Hao Zeng, and Kate Saenko

Arijit Ray, Jiafei Duan, Ellis Brown, Reuben Tan, Dina Bashkirova, Rose Hendrix, Kiana Ehsani, Aniruddha Kem- bhavi, Bryan A. Plummer, Ranjay Krishna, Kuo-Hao Zeng, and Kate Saenko. Sat: Dynamic spatial aptitude training for multimodal language models, 2024. 2, 5

work page 2024

[27] [27]

Pv-rcnn: Point-voxel feature set abstraction for 3d object detection

Shaoshuai Shi, Chun Guo, Li Jiang, Zhe Wang, and Hong- sheng Li. Pv-rcnn: Point-voxel feature set abstraction for 3d object detection. InCVPR, 2020. 2

work page 2020

[28] [28]

Robobrain 2.0 technical report, 2025

BAAI RoboBrain Team, Mingyu Cao, Huajie Tan, Yuheng Ji, Xiansheng Chen, Minglan Lin, Zhiyu Li, Zhou Cao, Pengwei Wang, Enshen Zhou, Yi Han, Yingbo Tang, Xi- angqi Xu, Wei Guo, Yaoxu Lyu, Yijie Xu, Jiayu Shi, Mengfei Du, Cheng Chi, Mengdi Zhao, Xiaoshuai Hao, Junkai Zhao, Xiaojie Zhang, Shanyu Rong, Huaihai Lyu, Zhengliang Cai, Yankai Fu, Ning Chen, Bolun ...

work page 2025

[29] [29]

Hunyuan3d 1.0: A unified framework for text-to-3d and image-to-3d generation, 2024

Tencent Hunyuan3D Team. Hunyuan3d 1.0: A unified framework for text-to-3d and image-to-3d generation, 2024. 1, 3

work page 2024

[30] [30]

Vggt: Vi- sual geometry grounded transformer, 2025

Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Vi- sual geometry grounded transformer, 2025. 2, 4, 7

work page 2025

[31] [31]

Vagen: Re- inforcing world model reasoning for multi-turn vlm agents,

Kangrui Wang, Pingyue Zhang, Zihan Wang, Yaning Gao, Linjie Li, Qineng Wang, Hanyang Chen, Chi Wan, Yiping Lu, Zhengyuan Yang, Lijuan Wang, Ranjay Krishna, Jiajun Wu, Li Fei-Fei, Yejin Choi, and Manling Li. Vagen: Re- inforcing world model reasoning for multi-turn vlm agents,

work page

[32] [32]

3d-llm: Inject- ing the 3d world into large language models.arXiv preprint arXiv:2307.12981, 2023

Yixin Wang, Jiayuan Gu, Ting Chen, et al. 3d-llm: Inject- ing the 3d world into large language models.arXiv preprint arXiv:2307.12981, 2023. 1, 2

work page arXiv 2023

[33] [33]

Generalization guides human exploration in vast decision spaces.Nature human behaviour, 2(12):915–924, 2018

Charley M Wu, Eric Schulz, Maarten Speekenbrink, Jonathan D Nelson, and Bj¨orn Meder. Generalization guides human exploration in vast decision spaces.Nature human behaviour, 2(12):915–924, 2018. 2, 4

work page 2018

[34] [34]

Pq-net: A generative part seq2seq network for 3d shapes

Ruihui Wu, Yifan Wang, Duygu Ceylan, Ersin Yumer, and Niloy J Mitra. Pq-net: A generative part seq2seq network for 3d shapes. InCVPR, 2020. 2

work page 2020

[35] [35]

Universal scene graph generation, 2025

Shengqiong Wu, Hao Fei, and Tat-Seng Chua. Universal scene graph generation, 2025. 2

work page 2025

[36] [36]

Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie

Jihan Yang, Shusheng Yang, Anjali W. Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie. Thinking in space: How mul- timodal large language models see, remember, and recall spaces, 2024. 2

work page 2024

[37] [37]

Magma: A foundation model for multimodal ai agents, 2025

Jianwei Yang, Reuben Tan, Qianhui Wu, Ruijie Zheng, Baolin Peng, Yongyuan Liang, Yu Gu, Mu Cai, Seonghyeon Ye, Joel Jang, Yuquan Deng, Lars Liden, and Jianfeng Gao. Magma: A foundation model for multimodal ai agents, 2025. 5, 6

work page 2025

[38] [38]

Thinking in space: How mul- timodal large language models see, remember, and recall spaces

Jihan Yang, Shusheng Yang, Anjali W Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie. Thinking in space: How mul- timodal large language models see, remember, and recall spaces. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 10632–10643, 2025. 1

work page 2025

[39] [39]

Omni-shapellm: Text-driven 3d shape generation and editing with llms.arXiv preprint arXiv:2406.08124, 2024

Yifan Yang, Kai Xu, et al. Omni-shapellm: Text-driven 3d shape generation and editing with llms.arXiv preprint arXiv:2406.08124, 2024. 2

work page arXiv 2024

[40] [40]

arXiv preprint arXiv:2402.17766 (2024)

Yifan Yang, Hanzhi Zhang, Kai Xu, et al. Shapellm: Univer- sal 3d shape understanding via large language models.arXiv preprint arXiv:2402.17766, 2024. 1, 2

work page arXiv 2024

[41] [41]

Mindjourney: Test-time scaling with world models for spatial reasoning

Yuncong Yang, Jiageng Liu, Zheyuan Zhang, Siyuan Zhou, Reuben Tan, Jianwei Yang, Yilun Du, and Chuang Gan. Mindjourney: Test-time scaling with world models for spa- tial reasoning.arXiv preprint arXiv:2507.12508, 2025. 1, 3, 5, 6

work page arXiv 2025

[42] [42]

Spatial mental modeling from limited views, 2025

Baiqiao Yin, Qineng Wang, Pingyue Zhang, Jianshu Zhang, Kangrui Wang, Zihan Wang, Jieyu Zhang, Keshigeyan Chan- drasegaran, Han Liu, Ranjay Krishna, Saining Xie, Manling Li, Jiajun Wu, and Li Fei-Fei. Spatial mental modeling from limited views, 2025. 2

work page 2025

[43] [43]

Minigpt-3d: Effi- ciently aligning 3d point clouds with large language models via 2d priors.arXiv preprint arXiv:2405.01413, 2024

Yu Zhang, Wenxuan Li, Yang Zhao, et al. Minigpt-3d: Effi- ciently aligning 3d point clouds with large language models via 2d priors.arXiv preprint arXiv:2405.01413, 2024. 2

work page arXiv 2024

[44] [44]

Stable virtual camera: Generative view synthesis with diffusion models, 2025

Jensen Zhou, Hang Gao, Vikram V oleti, Aaryaman Vasishta, Chun-Han Yao, Mark Boss, Philip Torr, Christian Rupprecht, and Varun Jampani. Stable virtual camera: Generative view synthesis with diffusion models, 2025. 1, 2, 3, 7

work page 2025

[45] [45]

Dyn- point: Dynamic neural point for view synthesis.Advances in Neural Information Processing Systems, 36:69532–69545,

Kaichen Zhou, Jia-Xing Zhong, Sangyun Shin, Kai Lu, Yiyuan Yang, Andrew Markham, and Niki Trigoni. Dyn- point: Dynamic neural point for view synthesis.Advances in Neural Information Processing Systems, 36:69532–69545,

work page

[46] [46]

Page-4d: Disentangled pose and geometry estimation for vggt-4d perception, 2025

Kaichen Zhou, Yuhan Wang, Grace Chen, Xinhai Chang, Gaspard Beaudouin, Fangneng Zhan, Paul Pu Liang, and Mengyu Wang. Page-4d: Disentangled pose and geometry estimation for vggt-4d perception, 2025. 2, 8

work page 2025