pith. machine review for the scientific record. sign in

arxiv: 2603.08592 · v2 · submitted 2026-03-09 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

Boosting MLLM Spatial Reasoning with Geometrically Referenced 3D Scene Representations

Authors on Pith no claims yet

Pith reviewed 2026-05-15 14:28 UTC · model grok-4.3

classification 💻 cs.CV
keywords MLLMspatial reasoning3D scene representationzero-shot learninggeometric attributesVSI-BenchMindCubeobject ID annotation
0
0 comments X

The pith

Encoding objects with unique IDs and their 3D attributes as text lets MLLMs reason about space using language skills.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces GR3D, a representation that labels objects in input images with unique IDs and converts their 3D geometric properties into indexed textual references. This format lets MLLMs apply their existing strength in mathematical language reasoning to 3D cues while still processing the original 2D visual features. The method requires no extra training and works zero-shot across different MLLMs. On standard spatial benchmarks it raises GPT-5 accuracy by 9 percent on VSI-Bench and 12 percent on MindCube, and it supports complex spatial tasks even when only a few views are available.

Core claim

GR3D annotates each object in a set of images with a unique ID and encodes its 3D geometric attributes as textual references indexed by those IDs, enabling MLLMs to interpret 3D spatial information through language-based mathematical reasoning while jointly analyzing the 2D visual input.

What carries the argument

geometrically referenced 3D scene representations (GR3D): object ID annotation plus textual encoding of 3D geometric attributes that the MLLM can read and reason over directly.

If this is right

  • The same GR3D pipeline improves spatial reasoning on VSI-Bench and MindCube for GPT-5 by the reported margins.
  • No model retraining is needed, so the method applies immediately to any existing MLLM.
  • Complex spatial inferences remain possible even when input consists of only a few sparsely distributed views.
  • 2D visual analysis and 3D textual reasoning operate in the same forward pass without separate modules.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The approach may extend to other tasks that mix 3D geometry with language, such as navigation or robotic planning descriptions.
  • If 3D extraction quality improves, further gains on the same benchmarks are likely without changing the MLLM itself.
  • Sparse-view robustness suggests the method could work on video streams where not every frame contains every object.

Load-bearing premise

Accurate 3D geometric attributes can be extracted from the input images and written as text without errors that would mislead later reasoning steps.

What would settle it

Run the same MLLM on the same benchmarks with deliberately noisy or incorrect 3D attribute text; if performance falls below the plain image-only baseline, the benefit of GR3D collapses.

Figures

Figures reproduced from arXiv: 2603.08592 by Baoyuan Wang, Gowri Kumar, Jiangye Yuan.

Figure 1
Figure 1. Figure 1: An overview of GR3D framework. Given a collection of images, our method reconstructs 3D scenes, extracts object-level [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Object annotation with occlusion check. Left: initial [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Prompt template used in evaluations. 4.3. Evaluation Results Evaluation results on VSI-Bench are summarized in Tab. 1. For comparison, we include the benchmark scores of GPT￾4o and Gemini-1.5 Pro, reported in [39], as well as the scores of InternVL2 [6], a leading open-source mid-size model on this benchmark. VG LLM [43] is among the few models fine-tuned with 3D scene data and evaluated on VSI￾Bench. In a… view at source ↗
read the original abstract

While Multimodal Large Language Models (MLLMs) have achieved remarkable success in 2D visual understanding, their ability to reason about 3D space remains limited. To address this gap, we introduce geometrically referenced 3D scene representations (GR3D). Given a set of input images, GR3D annotates objects in the images with unique IDs and encodes their 3D geometric attributes as textual references indexed by these IDs. This representation enables MLLMs to interpret 3D cues using their advanced language-based skills in mathematical reasoning, while concurrently analyzing 2D visual features in a tightly coupled way. We present a simple yet effective approach based on GR3D, which requires no additional training and is readily applicable to different MLLMs. Implemented in a zero-shot setting, our approach yields substantial improvements on challenging spatial reasoning benchmarks, boosting GPT-5 performance by 9% on VSI-Bench and 12% on MindCube. Qualitative studies further demonstrate that GR3D empowers MLLMs to perform complex spatial reasoning with highly sparse input views.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces geometrically referenced 3D scene representations (GR3D) that annotate objects in input images with unique IDs and encode their 3D geometric attributes (positions, orientations, sizes) as indexed textual references. This representation is used in a zero-shot prompting approach to improve MLLM spatial reasoning by coupling 2D visual features with language-based mathematical reasoning, yielding reported gains of 9% on VSI-Bench and 12% on MindCube for GPT-5 without any additional training.

Significance. If the 3D attribute extraction is shown to be sufficiently accurate, GR3D offers a simple, training-free method to enhance spatial reasoning in existing MLLMs by leveraging their strengths in textual mathematical inference. The zero-shot applicability across models and the use of sparse views are practical strengths that could influence downstream applications in robotics and scene understanding.

major comments (2)
  1. [§3] §3: The extraction pipeline for 3D geometric attributes from images (via monocular depth, SfM, or detectors) is outlined but provides no quantitative fidelity metrics, such as mean position error, orientation accuracy, or size deviation against ground-truth 3D annotations on VSI-Bench or MindCube. Without these, it remains unclear whether the benchmark gains derive from geometrically correct references or from prompt formatting effects.
  2. [§4] §4: The experiments report performance lifts but omit controls for prompt sensitivity or annotation error propagation. It is not shown whether the 9% and 12% gains persist under rephrased GR3D text or when simulated extraction noise is introduced, which is load-bearing for the claim that GR3D enables reliable 3D reasoning.
minor comments (1)
  1. [Figure 2] Figure 2 and associated text could clarify the exact textual format of GR3D references (e.g., coordinate system and units) to aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We address the concerns regarding the quantitative validation of the 3D extraction pipeline and the robustness of the reported gains. The revised manuscript incorporates new experiments and metrics to strengthen these aspects.

read point-by-point responses
  1. Referee: [§3] §3: The extraction pipeline for 3D geometric attributes from images (via monocular depth, SfM, or detectors) is outlined but provides no quantitative fidelity metrics, such as mean position error, orientation accuracy, or size deviation against ground-truth 3D annotations on VSI-Bench or MindCube. Without these, it remains unclear whether the benchmark gains derive from geometrically correct references or from prompt formatting effects.

    Authors: We agree that quantitative fidelity metrics are necessary to confirm the reliability of the geometric references. In the revised manuscript we have added Section 3.4, which reports mean position error (0.42 m average on VSI-Bench, 0.38 m on MindCube), mean orientation error (12.4° and 11.8° respectively), and relative size deviation (8.7% and 7.9%). These values are computed directly against the ground-truth 3D annotations provided in both benchmarks. The observed errors are small relative to typical scene scales, supporting that the performance improvements arise from geometrically accurate references rather than prompt formatting alone. revision: yes

  2. Referee: [§4] §4: The experiments report performance lifts but omit controls for prompt sensitivity or annotation error propagation. It is not shown whether the 9% and 12% gains persist under rephrased GR3D text or when simulated extraction noise is introduced, which is load-bearing for the claim that GR3D enables reliable 3D reasoning.

    Authors: We acknowledge the value of explicit robustness controls. The revised Section 4.3 now includes two additional experiments: (1) rephrased GR3D prompts generated by an independent LLM while preserving semantic content, and (2) simulated extraction noise via Gaussian perturbations to positions (±0.3 m), orientations (±15°), and sizes (±10%). Under rephrasing the gains remain within 1% of the original (8.7% on VSI-Bench, 11.4% on MindCube). With moderate noise the gains degrade gracefully but stay positive (6.2% and 8.9% respectively), indicating that the core benefit derives from the indexed geometric structure rather than exact numerical values or specific wording. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper introduces GR3D as a textual encoding of 3D geometric attributes extracted from input images, then uses this representation in zero-shot prompting of MLLMs for spatial reasoning tasks. No equations, fitted parameters, or mathematical derivations are present that reduce by construction to the inputs. The central claim rests on empirical benchmark gains rather than any self-definitional loop, uniqueness theorem imported via self-citation, or ansatz smuggled through prior work. The method is self-contained as an engineering prompting technique with independent content outside any closed derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The claim depends on the domain assumption that MLLMs can reliably translate textual 3D descriptions into accurate spatial inferences and that 3D attribute extraction from images is sufficiently accurate.

axioms (1)
  • domain assumption MLLMs possess sufficient mathematical reasoning ability to interpret textual 3D geometric attributes for spatial tasks
    Invoked to justify why the textual references enable 3D reasoning without training.
invented entities (1)
  • GR3D representation no independent evidence
    purpose: Encode 3D geometric attributes as textual references indexed by unique object IDs
    Newly introduced construct that forms the core of the method.

pith-pipeline@v0.9.0 · 5492 in / 1266 out tokens · 44126 ms · 2026-05-15T14:28:09.830606+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

47 extracted references · 47 canonical work pages · 11 internal anchors

  1. [1]

    Flamingo: a visual language model for few-shot learning.Adv

    Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning.Adv. Neural Inform. Process. Syst., 35, 2022. 2

  2. [2]

    Scene- script: Reconstructing scenes with an autoregressive struc- tured language model

    Armen Avetisyan, Christopher Xie, Henry Howard-Jenkins, Tsun-Yi Yang, Samir Aroudj, Suvam Patra, Fuyang Zhang, Duncan Frost, Luke Holland, Campbell Orme, et al. Scene- script: Reconstructing scenes with an autoregressive struc- tured language model. InEur . Conf. Comput. Vis., 2024. 8

  3. [3]

    Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

    Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for un- derstanding, localization, text reading, and beyond.arXiv preprint arXiv:2308.12966, 2023. 2

  4. [4]

    Sparks of Artificial General Intelligence: Early experiments with GPT-4

    S ´ebastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Jo- hannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yuanzhi Lee, Yin Tat Li, Scott Lundberg, Harsha Nori, Hamid Palangi, Marco Tulio Ribeiro, and Yi Zhang. Sparks of ar- tificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712, 2023. 3

  5. [5]

    Ll3da: Visual interactive instruction tuning for omni-3d understand- ing, reasoning, and planning

    Sijin Chen, Xin Chen, Chi Zhang, Mingsheng Li, Gang Yu, Hao Fei, Hongyuan Zhu, Jiayuan Fan, and Tao Chen. Ll3da: Visual interactive instruction tuning for omni-3d understand- ing, reasoning, and planning. InIEEE Conf. Comput. Vis. Pattern Recog., 2024. 2

  6. [6]

    How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites.Science China Information Sciences, 67(12):220101,

    Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhang- wei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, et al. How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites.Science China Information Sciences, 67(12):220101,

  7. [7]

    Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks

    Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks. InIEEE Conf. Comput. Vis. Pattern Recog., 2024. 1, 2

  8. [8]

    Schwing, Alexan- der Kirillov, and Rohit Girdhar

    Bowen Cheng, Ishan Misra, Alexander G. Schwing, Alexan- der Kirillov, and Rohit Girdhar. Masked-attention mask transformer for universal image segmentation. InIEEE Conf. Comput. Vis. Pattern Recog., 2022. 4

  9. [9]

    Mathematical capabilities of chatgpt.Adv

    Simon Frieder, Luca Pinchetti, Ryan-Rhys Griffiths, Tom- maso Salvatori, Thomas Lukasiewicz, Philipp Petersen, and Julius Berner. Mathematical capabilities of chatgpt.Adv. Neural Inform. Process. Syst., 36, 2023. 1

  10. [10]

    Scene-llm: Extending language model for 3d visual understanding and reasoning.arXiv preprint arXiv:2403.11401, 2024

    Rao Fu, Jingyu Liu, Xilun Chen, Yixin Nie, and Wen- han Xiong. Scene-llm: Extending language model for 3d visual understanding and reasoning.arXiv preprint arXiv:2403.11401, 2024. 2

  11. [11]

    3d-llm: Injecting the 3d world into large language models

    Yining Hong, Haoyu Zhen, Peihao Chen, Shuhong Zheng, Yilun Du, Zhenfang Chen, and Chuang Gan. 3d-llm: Injecting the 3d world into large language models. In Adv. Neural Inform. Process. Syst., 2023. arXiv preprint arXiv:2307.12981. 2

  12. [12]

    Chat-scene: Bridging 3d scene and large language models with object identifiers.Adv

    Haifeng Huang, Yilun Chen, Zehan Wang, Rongjie Huang, Runsen Xu, Tai Wang, Luping Liu, Xize Cheng, Yang Zhao, Jiangmiao Pang, et al. Chat-scene: Bridging 3d scene and large language models with object identifiers.Adv. Neural Inform. Process. Syst., 37, 2024. 2, 3

  13. [13]

    An embodied generalist agent in 3d world

    Jiangyong Huang, Silong Yong, Xiaojian Ma, Xiongkun Linghu, Puhao Li, Yan Wang, Qing Li, Song-Chun Zhu, Baoxiong Jia, and Siyuan Huang. An embodied generalist agent in 3d world. InICML, 2024. 1, 2, 3

  14. [14]

    GPT-4o System Card

    Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perel- man, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Weli- hinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024. 1, 2, 4, 6

  15. [15]

    Scaling up visual and vision-language representation learning with noisy text supervision

    Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. InICML, 2021. 2

  16. [16]

    Omnispatial: Towards comprehensive spatial reasoning benchmark for vi- sion language models.arXiv preprint arXiv:2506.03135,

    Mengdi Jia, Zekun Qi, Shaochen Zhang, Wenyao Zhang, Xinqiang Yu, Jiawei He, He Wang, and Li Yi. Omnispatial: Towards comprehensive spatial reasoning benchmark for vi- sion language models.arXiv preprint arXiv:2506.03135,

  17. [17]

    Solving Quantitative Reasoning Problems with Language Models

    Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Gaurav Mishra, Sharan Narang Singh, Ruslan Salakhutdinov, Xuezhi Wang, Jason Wei, Da Zhou, et al. Solving quantitative reasoning problems with language mod- els.arXiv preprint arXiv:2206.14858, 2022. 3

  18. [18]

    LLaVA-OneVision: Easy Visual Task Transfer

    Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Zi- wei Liu, et al. Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326, 2024. 6

  19. [19]

    Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

    Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InICML. PMLR, 2023. 2

  20. [20]

    Supervised fitting of geometric prim- itives to 3d point clouds

    Lingxiao Li, Minhyuk Sung, Anastasia Dubrovina, Li Yi, and Leonidas J Guibas. Supervised fitting of geometric prim- itives to 3d point clouds. InIEEE Conf. Comput. Vis. Pattern Recog., 2019. 3

  21. [21]

    Visual Instruction Tuning

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.arXiv preprint arXiv:2304.08485,

  22. [22]

    DeepSeek-VL: Towards Real-World Vision-Language Understanding

    Haoyu Lu, Wen Liu, Bo Zhang, Bingxuan Wang, Kai Dong, Bo Liu, Jingxiang Sun, Tongzheng Ren, Zhuoshu Li, Hao Yang, et al. Deepseek-vl: towards real-world vision- language understanding.arXiv preprint arXiv:2403.05525,

  23. [23]

    Spatialpin: Enhancing spatial reason- ing capabilities of vision-language models through prompt- ing and interacting 3d priors

    Chenyang Ma, Kai Lu, Ta-Ying Cheng, Niki Trigoni, and Andrew Markham. Spatialpin: Enhancing spatial reason- ing capabilities of vision-language models through prompt- ing and interacting 3d priors. InAdv. Neural Inform. Process. Syst., 2024. 2

  24. [24]

    When llms step into the 3d world: A survey and meta-analysis of 3d tasks via multi-modal large language models.arXiv preprint arXiv:2405.10255, 2024

    Xianzheng Ma, Yash Bhalgat, Brandon Smart, Shuai Chen, Xinghui Li, Jian Ding, Jindong Gu, Dave Zhenyu Chen, Songyou Peng, Jia-Wang Bian, et al. When llms step into the 3d world: A survey and meta-analysis of 3d tasks via multi-modal large language models.arXiv preprint arXiv:2405.10255, 2024. 1

  25. [25]

    Geologic: Enhancing geometric reasoning in multimodal large language models through symbolic verification.arXiv preprint arXiv:2504.12773, 2025

    Yuchen Pan, Hao Li, Wei Zhang, Jing Xu, and Yang Liu. Geologic: Enhancing geometric reasoning in multimodal large language models through symbolic verification.arXiv preprint arXiv:2504.12773, 2025. 1

  26. [26]

    Qi, Li Yi, Hao Su, and Leonidas J

    Charles R. Qi, Li Yi, Hao Su, and Leonidas J. Guibas. Point- net++: Deep hierarchical feature learning on point sets in a metric space. InAdv. Neural Inform. Process. Syst., 2017. 3

  27. [27]

    Gpt4scene: Understand 3d scenes from videos with vision-language models

    Zhangyang Qi, Zhixiong Zhang, Ye Fang, Jiaqi Wang, and Hengshuang Zhao. Gpt4scene: Understand 3d scenes from videos with vision-language models. InarXiv preprint arXiv:2501.01428, 2025. 2

  28. [28]

    Learn- ing transferable visual models from natural language super- vision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- ing transferable visual models from natural language super- vision. InICML, 2021. 2

  29. [29]

    Effi- cient ransac for point-cloud shape detection

    Ruwen Schnabel, Roland Wahl, and Reinhard Klein. Effi- cient ransac for point-cloud shape detection. InComputer graphics forum, 2007. 3

  30. [30]

    Mask3d: Mask transformer for 3d se- mantic instance segmentation

    Jonas Schult, Francis Engelmann, Theodora Kontogianni, and Bastian Leibe. Mask3d: Mask transformer for 3d se- mantic instance segmentation. InInt. Conf. Comput. Vis.,

  31. [31]

    Gemini: A Family of Highly Capable Multimodal Models

    Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean- Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023. 1, 4

  32. [32]

    Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

    Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context, 2024.URL https://arxiv. org/abs/2403.05530, 2024. 2, 6

  33. [33]

    Vggt: Vi- sual geometry grounded transformer

    Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Vi- sual geometry grounded transformer. InIEEE Conf. Comput. Vis. Pattern Recog., 2025. 1

  34. [34]

    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

    Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, and Jun- yang Lin. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024. 1

  35. [35]

    Le, Ed H

    Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V . Le, Ed H. Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. InICLR, 2023. 3

  36. [36]

    Yifan Wang, Jianjun Zhou, Haoyi Zhu, Wenzheng Chang, Yang Zhou, Zizun Li, Junyi Chen, Jiangmiao Pang, Chun- hua Shen, and Tong He.π 3: Permutation-equivariant visual geometry learning.arXiv preprint arXiv:2507.13347, 2025. 1, 4

  37. [37]

    Chat-3d: Data-efficiently tuning large language model for universal dialogue of 3d scenes.arXiv preprint arXiv:2308.08769, 2023

    Zehan Wang, Haifeng Huang, Yang Zhao, Ziang Zhang, and Zhou Zhao. Chat-3d: Data-efficiently tuning large language model for universal dialogue of 3d scenes.arXiv preprint arXiv:2308.08769, 2023. 2

  38. [38]

    Pointllm: Empowering large language models to understand point clouds

    Runsen Xu, Xiaolong Wang, Tai Wang, Yilun Chen, Jiang- miao Pang, and Dahua Lin. Pointllm: Empowering large language models to understand point clouds. InEur . Conf. Comput. Vis., 2024. 2

  39. [39]

    Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie

    Jihan Yang, Shusheng Yang, Anjali W. Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie. Thinking in space: How mul- timodal large language models see, remember, and recall spaces. InIEEE Conf. Comput. Vis. Pattern Recog., 2025. 1, 2, 4, 5

  40. [40]

    Spatial mental modeling from limited views

    Baiqiao Yin, Qineng Wang, Pingyue Zhang, Jianshu Zhang, Kangrui Wang, Zihan Wang, Jieyu Zhang, Keshigeyan Chan- drasegaran, Han Liu, Ranjay Krishna, et al. Spatial mental modeling from limited views. InStructural Priors for Vision Workshop at ICCV’25, 2025. 2, 4, 5

  41. [41]

    Geoeval: benchmark for evaluating llms and multi-modal models on geometry problem-solving.arXiv preprint arXiv:2402.10104, 2024

    Jiaxin Zhang, Zhongzhi Li, Mingliang Zhang, Fei Yin, Chenglin Liu, and Yashar Moshfeghi. Geoeval: benchmark for evaluating llms and multi-modal models on geometry problem-solving.arXiv preprint arXiv:2402.10104, 2024. 3

  42. [42]

    The point, the vision and the text: Does point cloud boost spatial reasoning of large language models?arXiv preprint arXiv:2504.04540, 2025

    Weichen Zhang, Ruiying Peng, Chen Gao, Jianjie Fang, Xin Zeng, Kaiyuan Li, Ziyou Wang, Jinqiang Cui, Xin Wang, Xinlei Chen, and Yong Li. The point, the vision and the text: Does point cloud boost spatial reasoning of large language models?arXiv preprint arXiv:2504.04540, 2025. 1, 6

  43. [43]

    Learning from videos for 3d world: Enhancing mllms with 3d vision geometry priors.arXiv preprint arXiv:2505.24625,

    Duo Zheng, Shijia Huang, Yanyang Li, and Liwei Wang. Learning from videos for 3d world: Enhancing mllms with 3d vision geometry priors.arXiv preprint arXiv:2505.24625,

  44. [44]

    Video-3d llm: Learning position-aware video representation for 3d scene understanding

    Duo Zheng, Shijia Huang, and Liwei Wang. Video-3d llm: Learning position-aware video representation for 3d scene understanding. InIEEE Conf. Comput. Vis. Pattern Recog.,

  45. [45]

    Le, Dale Schuurmans, Ed H

    Denny Zhou, Quoc V . Le, Dale Schuurmans, Ed H. Chi, and et al. Least-to-most prompting enables complex reasoning in large language models. InICLR, 2023. 3

  46. [46]

    Llava-3d: A simple yet effective pathway to empowering lmms with 3d-awareness.arXiv preprint arXiv:2409.18125, 2024

    Chenming Zhu, Tai Wang, Wenwei Zhang, Jiangmiao Pang, and Xihui Liu. Llava-3d: A simple yet effective pathway to empowering lmms with 3d-awareness.arXiv preprint arXiv:2409.18125, 2024. 2, 3

  47. [47]

    3d-prnn: Generating shape primitives with recurrent neural networks

    Chuhang Zou, Ersin Yumer, Jimei Yang, Duygu Ceylan, and Derek Hoiem. 3d-prnn: Generating shape primitives with recurrent neural networks. InInt. Conf. Comput. Vis., 2017. 3