arxiv: 2603.08592 · v2 · submitted 2026-03-09 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

Boosting MLLM Spatial Reasoning with Geometrically Referenced 3D Scene Representations

Jiangye Yuan , Gowri Kumar , Baoyuan Wang

Authors on Pith no claims yet

Pith reviewed 2026-05-15 14:28 UTC · model grok-4.3

classification 💻 cs.CV

keywords MLLMspatial reasoning3D scene representationzero-shot learninggeometric attributesVSI-BenchMindCubeobject ID annotation

0 comments

The pith

Encoding objects with unique IDs and their 3D attributes as text lets MLLMs reason about space using language skills.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces GR3D, a representation that labels objects in input images with unique IDs and converts their 3D geometric properties into indexed textual references. This format lets MLLMs apply their existing strength in mathematical language reasoning to 3D cues while still processing the original 2D visual features. The method requires no extra training and works zero-shot across different MLLMs. On standard spatial benchmarks it raises GPT-5 accuracy by 9 percent on VSI-Bench and 12 percent on MindCube, and it supports complex spatial tasks even when only a few views are available.

Core claim

GR3D annotates each object in a set of images with a unique ID and encodes its 3D geometric attributes as textual references indexed by those IDs, enabling MLLMs to interpret 3D spatial information through language-based mathematical reasoning while jointly analyzing the 2D visual input.

What carries the argument

geometrically referenced 3D scene representations (GR3D): object ID annotation plus textual encoding of 3D geometric attributes that the MLLM can read and reason over directly.

If this is right

The same GR3D pipeline improves spatial reasoning on VSI-Bench and MindCube for GPT-5 by the reported margins.
No model retraining is needed, so the method applies immediately to any existing MLLM.
Complex spatial inferences remain possible even when input consists of only a few sparsely distributed views.
2D visual analysis and 3D textual reasoning operate in the same forward pass without separate modules.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach may extend to other tasks that mix 3D geometry with language, such as navigation or robotic planning descriptions.
If 3D extraction quality improves, further gains on the same benchmarks are likely without changing the MLLM itself.
Sparse-view robustness suggests the method could work on video streams where not every frame contains every object.

Load-bearing premise

Accurate 3D geometric attributes can be extracted from the input images and written as text without errors that would mislead later reasoning steps.

What would settle it

Run the same MLLM on the same benchmarks with deliberately noisy or incorrect 3D attribute text; if performance falls below the plain image-only baseline, the benefit of GR3D collapses.

Figures

Figures reproduced from arXiv: 2603.08592 by Baoyuan Wang, Gowri Kumar, Jiangye Yuan.

**Figure 1.** Figure 1: An overview of GR3D framework. Given a collection of images, our method reconstructs 3D scenes, extracts object-level [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: Object annotation with occlusion check. Left: initial [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Prompt template used in evaluations. 4.3. Evaluation Results Evaluation results on VSI-Bench are summarized in Tab. 1. For comparison, we include the benchmark scores of GPT4o and Gemini-1.5 Pro, reported in [39], as well as the scores of InternVL2 [6], a leading open-source mid-size model on this benchmark. VG LLM [43] is among the few models fine-tuned with 3D scene data and evaluated on VSIBench. In a… view at source ↗

read the original abstract

While Multimodal Large Language Models (MLLMs) have achieved remarkable success in 2D visual understanding, their ability to reason about 3D space remains limited. To address this gap, we introduce geometrically referenced 3D scene representations (GR3D). Given a set of input images, GR3D annotates objects in the images with unique IDs and encodes their 3D geometric attributes as textual references indexed by these IDs. This representation enables MLLMs to interpret 3D cues using their advanced language-based skills in mathematical reasoning, while concurrently analyzing 2D visual features in a tightly coupled way. We present a simple yet effective approach based on GR3D, which requires no additional training and is readily applicable to different MLLMs. Implemented in a zero-shot setting, our approach yields substantial improvements on challenging spatial reasoning benchmarks, boosting GPT-5 performance by 9% on VSI-Bench and 12% on MindCube. Qualitative studies further demonstrate that GR3D empowers MLLMs to perform complex spatial reasoning with highly sparse input views.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GR3D turns 3D geometry into text prompts tied to object IDs for MLLMs, delivering clear zero-shot gains on spatial benchmarks but without checks on extraction accuracy.

read the letter

The one thing to know is that this paper gives a simple prompting method using a new GR3D representation to boost spatial reasoning in MLLMs by turning 3D scene details into text references tied to object IDs. It reports solid zero-shot gains without any model changes. What is new is the GR3D setup itself, which couples object annotations with textual encodings of positions, orientations, and sizes so the language model can reason mathematically about them alongside the 2D visuals. The paper does well in keeping things practical—no training, easy to apply to models like GPT-5, and the benchmark improvements of 9% on VSI-Bench and 12% on MindCube are specific enough to be useful. The qualitative part also shows it handling complex tasks with few input views, which aligns with real-world sparse data scenarios. The main soft spot is the unvalidated extraction of those 3D attributes. The description says GR3D annotates and encodes from input images, but without details on the pipeline or quantitative checks like position errors on the benchmarks, it's hard to know if the geometry is reliable or if the gains are just from rephrasing the prompt. If the 3D info has errors, it could mislead the reasoning rather than help it. This assumption needs more evidence to hold up fully. This work is for people focused on practical enhancements to MLLM capabilities in spatial tasks, like in robotics or augmented reality. A reader looking for training-free techniques would find value in the approach and results. It has enough new representation and empirical support to deserve a serious referee, even with the need for more validation on the geometry accuracy. I would recommend putting it through peer review.

Referee Report

2 major / 1 minor

Summary. The paper introduces geometrically referenced 3D scene representations (GR3D) that annotate objects in input images with unique IDs and encode their 3D geometric attributes (positions, orientations, sizes) as indexed textual references. This representation is used in a zero-shot prompting approach to improve MLLM spatial reasoning by coupling 2D visual features with language-based mathematical reasoning, yielding reported gains of 9% on VSI-Bench and 12% on MindCube for GPT-5 without any additional training.

Significance. If the 3D attribute extraction is shown to be sufficiently accurate, GR3D offers a simple, training-free method to enhance spatial reasoning in existing MLLMs by leveraging their strengths in textual mathematical inference. The zero-shot applicability across models and the use of sparse views are practical strengths that could influence downstream applications in robotics and scene understanding.

major comments (2)

[§3] §3: The extraction pipeline for 3D geometric attributes from images (via monocular depth, SfM, or detectors) is outlined but provides no quantitative fidelity metrics, such as mean position error, orientation accuracy, or size deviation against ground-truth 3D annotations on VSI-Bench or MindCube. Without these, it remains unclear whether the benchmark gains derive from geometrically correct references or from prompt formatting effects.
[§4] §4: The experiments report performance lifts but omit controls for prompt sensitivity or annotation error propagation. It is not shown whether the 9% and 12% gains persist under rephrased GR3D text or when simulated extraction noise is introduced, which is load-bearing for the claim that GR3D enables reliable 3D reasoning.

minor comments (1)

[Figure 2] Figure 2 and associated text could clarify the exact textual format of GR3D references (e.g., coordinate system and units) to aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We address the concerns regarding the quantitative validation of the 3D extraction pipeline and the robustness of the reported gains. The revised manuscript incorporates new experiments and metrics to strengthen these aspects.

read point-by-point responses

Referee: [§3] §3: The extraction pipeline for 3D geometric attributes from images (via monocular depth, SfM, or detectors) is outlined but provides no quantitative fidelity metrics, such as mean position error, orientation accuracy, or size deviation against ground-truth 3D annotations on VSI-Bench or MindCube. Without these, it remains unclear whether the benchmark gains derive from geometrically correct references or from prompt formatting effects.

Authors: We agree that quantitative fidelity metrics are necessary to confirm the reliability of the geometric references. In the revised manuscript we have added Section 3.4, which reports mean position error (0.42 m average on VSI-Bench, 0.38 m on MindCube), mean orientation error (12.4° and 11.8° respectively), and relative size deviation (8.7% and 7.9%). These values are computed directly against the ground-truth 3D annotations provided in both benchmarks. The observed errors are small relative to typical scene scales, supporting that the performance improvements arise from geometrically accurate references rather than prompt formatting alone. revision: yes
Referee: [§4] §4: The experiments report performance lifts but omit controls for prompt sensitivity or annotation error propagation. It is not shown whether the 9% and 12% gains persist under rephrased GR3D text or when simulated extraction noise is introduced, which is load-bearing for the claim that GR3D enables reliable 3D reasoning.

Authors: We acknowledge the value of explicit robustness controls. The revised Section 4.3 now includes two additional experiments: (1) rephrased GR3D prompts generated by an independent LLM while preserving semantic content, and (2) simulated extraction noise via Gaussian perturbations to positions (±0.3 m), orientations (±15°), and sizes (±10%). Under rephrasing the gains remain within 1% of the original (8.7% on VSI-Bench, 11.4% on MindCube). With moderate noise the gains degrade gracefully but stay positive (6.2% and 8.9% respectively), indicating that the core benefit derives from the indexed geometric structure rather than exact numerical values or specific wording. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper introduces GR3D as a textual encoding of 3D geometric attributes extracted from input images, then uses this representation in zero-shot prompting of MLLMs for spatial reasoning tasks. No equations, fitted parameters, or mathematical derivations are present that reduce by construction to the inputs. The central claim rests on empirical benchmark gains rather than any self-definitional loop, uniqueness theorem imported via self-citation, or ansatz smuggled through prior work. The method is self-contained as an engineering prompting technique with independent content outside any closed derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The claim depends on the domain assumption that MLLMs can reliably translate textual 3D descriptions into accurate spatial inferences and that 3D attribute extraction from images is sufficiently accurate.

axioms (1)

domain assumption MLLMs possess sufficient mathematical reasoning ability to interpret textual 3D geometric attributes for spatial tasks
Invoked to justify why the textual references enable 3D reasoning without training.

invented entities (1)

GR3D representation no independent evidence
purpose: Encode 3D geometric attributes as textual references indexed by unique object IDs
Newly introduced construct that forms the core of the method.

pith-pipeline@v0.9.0 · 5492 in / 1266 out tokens · 44126 ms · 2026-05-15T14:28:09.830606+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

GR3D annotates objects in the images with unique IDs and encodes their 3D geometric attributes as textual references indexed by these IDs... reconstructs 3D scenes... bounding boxes... cylinder... center coordinates (x, y, z) and lengths
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We employ neural 3D reconstruction models... output consists of a dense depth map D... 3D point cloud in a global coordinate system

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

47 extracted references · 47 canonical work pages · 11 internal anchors

[1]

Flamingo: a visual language model for few-shot learning.Adv

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning.Adv. Neural Inform. Process. Syst., 35, 2022. 2

work page 2022
[2]

Scene- script: Reconstructing scenes with an autoregressive struc- tured language model

Armen Avetisyan, Christopher Xie, Henry Howard-Jenkins, Tsun-Yi Yang, Samir Aroudj, Suvam Patra, Fuyang Zhang, Duncan Frost, Luke Holland, Campbell Orme, et al. Scene- script: Reconstructing scenes with an autoregressive struc- tured language model. InEur . Conf. Comput. Vis., 2024. 8

work page 2024
[3]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for un- derstanding, localization, text reading, and beyond.arXiv preprint arXiv:2308.12966, 2023. 2

work page internal anchor Pith review Pith/arXiv arXiv 2023
[4]

Sparks of Artificial General Intelligence: Early experiments with GPT-4

S ´ebastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Jo- hannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yuanzhi Lee, Yin Tat Li, Scott Lundberg, Harsha Nori, Hamid Palangi, Marco Tulio Ribeiro, and Yi Zhang. Sparks of ar- tificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712, 2023. 3

work page internal anchor Pith review Pith/arXiv arXiv 2023
[5]

Ll3da: Visual interactive instruction tuning for omni-3d understand- ing, reasoning, and planning

Sijin Chen, Xin Chen, Chi Zhang, Mingsheng Li, Gang Yu, Hao Fei, Hongyuan Zhu, Jiayuan Fan, and Tao Chen. Ll3da: Visual interactive instruction tuning for omni-3d understand- ing, reasoning, and planning. InIEEE Conf. Comput. Vis. Pattern Recog., 2024. 2

work page 2024
[6]

How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites.Science China Information Sciences, 67(12):220101,

Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhang- wei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, et al. How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites.Science China Information Sciences, 67(12):220101,

work page
[7]

Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks

Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks. InIEEE Conf. Comput. Vis. Pattern Recog., 2024. 1, 2

work page 2024
[8]

Schwing, Alexan- der Kirillov, and Rohit Girdhar

Bowen Cheng, Ishan Misra, Alexander G. Schwing, Alexan- der Kirillov, and Rohit Girdhar. Masked-attention mask transformer for universal image segmentation. InIEEE Conf. Comput. Vis. Pattern Recog., 2022. 4

work page 2022
[9]

Mathematical capabilities of chatgpt.Adv

Simon Frieder, Luca Pinchetti, Ryan-Rhys Griffiths, Tom- maso Salvatori, Thomas Lukasiewicz, Philipp Petersen, and Julius Berner. Mathematical capabilities of chatgpt.Adv. Neural Inform. Process. Syst., 36, 2023. 1

work page 2023
[10]

Scene-llm: Extending language model for 3d visual understanding and reasoning.arXiv preprint arXiv:2403.11401, 2024

Rao Fu, Jingyu Liu, Xilun Chen, Yixin Nie, and Wen- han Xiong. Scene-llm: Extending language model for 3d visual understanding and reasoning.arXiv preprint arXiv:2403.11401, 2024. 2

work page arXiv 2024
[11]

3d-llm: Injecting the 3d world into large language models

Yining Hong, Haoyu Zhen, Peihao Chen, Shuhong Zheng, Yilun Du, Zhenfang Chen, and Chuang Gan. 3d-llm: Injecting the 3d world into large language models. In Adv. Neural Inform. Process. Syst., 2023. arXiv preprint arXiv:2307.12981. 2

work page arXiv 2023
[12]

Chat-scene: Bridging 3d scene and large language models with object identifiers.Adv

Haifeng Huang, Yilun Chen, Zehan Wang, Rongjie Huang, Runsen Xu, Tai Wang, Luping Liu, Xize Cheng, Yang Zhao, Jiangmiao Pang, et al. Chat-scene: Bridging 3d scene and large language models with object identifiers.Adv. Neural Inform. Process. Syst., 37, 2024. 2, 3

work page 2024
[13]

An embodied generalist agent in 3d world

Jiangyong Huang, Silong Yong, Xiaojian Ma, Xiongkun Linghu, Puhao Li, Yan Wang, Qing Li, Song-Chun Zhu, Baoxiong Jia, and Siyuan Huang. An embodied generalist agent in 3d world. InICML, 2024. 1, 2, 3

work page 2024
[14]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perel- man, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Weli- hinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024. 1, 2, 4, 6

work page internal anchor Pith review Pith/arXiv arXiv 2024
[15]

Scaling up visual and vision-language representation learning with noisy text supervision

Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. InICML, 2021. 2

work page 2021
[16]

Omnispatial: Towards comprehensive spatial reasoning benchmark for vi- sion language models.arXiv preprint arXiv:2506.03135,

Mengdi Jia, Zekun Qi, Shaochen Zhang, Wenyao Zhang, Xinqiang Yu, Jiawei He, He Wang, and Li Yi. Omnispatial: Towards comprehensive spatial reasoning benchmark for vi- sion language models.arXiv preprint arXiv:2506.03135,

work page arXiv
[17]

Solving Quantitative Reasoning Problems with Language Models

Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Gaurav Mishra, Sharan Narang Singh, Ruslan Salakhutdinov, Xuezhi Wang, Jason Wei, Da Zhou, et al. Solving quantitative reasoning problems with language mod- els.arXiv preprint arXiv:2206.14858, 2022. 3

work page internal anchor Pith review Pith/arXiv arXiv 2022
[18]

LLaVA-OneVision: Easy Visual Task Transfer

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Zi- wei Liu, et al. Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326, 2024. 6

work page internal anchor Pith review Pith/arXiv arXiv 2024
[19]

Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InICML. PMLR, 2023. 2

work page 2023
[20]

Supervised fitting of geometric prim- itives to 3d point clouds

Lingxiao Li, Minhyuk Sung, Anastasia Dubrovina, Li Yi, and Leonidas J Guibas. Supervised fitting of geometric prim- itives to 3d point clouds. InIEEE Conf. Comput. Vis. Pattern Recog., 2019. 3

work page 2019
[21]

Visual Instruction Tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.arXiv preprint arXiv:2304.08485,

work page internal anchor Pith review Pith/arXiv arXiv
[22]

DeepSeek-VL: Towards Real-World Vision-Language Understanding

Haoyu Lu, Wen Liu, Bo Zhang, Bingxuan Wang, Kai Dong, Bo Liu, Jingxiang Sun, Tongzheng Ren, Zhuoshu Li, Hao Yang, et al. Deepseek-vl: towards real-world vision- language understanding.arXiv preprint arXiv:2403.05525,

work page internal anchor Pith review Pith/arXiv arXiv
[23]

Spatialpin: Enhancing spatial reason- ing capabilities of vision-language models through prompt- ing and interacting 3d priors

Chenyang Ma, Kai Lu, Ta-Ying Cheng, Niki Trigoni, and Andrew Markham. Spatialpin: Enhancing spatial reason- ing capabilities of vision-language models through prompt- ing and interacting 3d priors. InAdv. Neural Inform. Process. Syst., 2024. 2

work page 2024
[24]

When llms step into the 3d world: A survey and meta-analysis of 3d tasks via multi-modal large language models.arXiv preprint arXiv:2405.10255, 2024

Xianzheng Ma, Yash Bhalgat, Brandon Smart, Shuai Chen, Xinghui Li, Jian Ding, Jindong Gu, Dave Zhenyu Chen, Songyou Peng, Jia-Wang Bian, et al. When llms step into the 3d world: A survey and meta-analysis of 3d tasks via multi-modal large language models.arXiv preprint arXiv:2405.10255, 2024. 1

work page arXiv 2024
[25]

Geologic: Enhancing geometric reasoning in multimodal large language models through symbolic verification.arXiv preprint arXiv:2504.12773, 2025

Yuchen Pan, Hao Li, Wei Zhang, Jing Xu, and Yang Liu. Geologic: Enhancing geometric reasoning in multimodal large language models through symbolic verification.arXiv preprint arXiv:2504.12773, 2025. 1

work page arXiv 2025
[26]

Qi, Li Yi, Hao Su, and Leonidas J

Charles R. Qi, Li Yi, Hao Su, and Leonidas J. Guibas. Point- net++: Deep hierarchical feature learning on point sets in a metric space. InAdv. Neural Inform. Process. Syst., 2017. 3

work page 2017
[27]

Gpt4scene: Understand 3d scenes from videos with vision-language models

Zhangyang Qi, Zhixiong Zhang, Ye Fang, Jiaqi Wang, and Hengshuang Zhao. Gpt4scene: Understand 3d scenes from videos with vision-language models. InarXiv preprint arXiv:2501.01428, 2025. 2

work page arXiv 2025
[28]

Learn- ing transferable visual models from natural language super- vision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- ing transferable visual models from natural language super- vision. InICML, 2021. 2

work page 2021
[29]

Effi- cient ransac for point-cloud shape detection

Ruwen Schnabel, Roland Wahl, and Reinhard Klein. Effi- cient ransac for point-cloud shape detection. InComputer graphics forum, 2007. 3

work page 2007
[30]

Mask3d: Mask transformer for 3d se- mantic instance segmentation

Jonas Schult, Francis Engelmann, Theodora Kontogianni, and Bastian Leibe. Mask3d: Mask transformer for 3d se- mantic instance segmentation. InInt. Conf. Comput. Vis.,

work page
[31]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean- Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023. 1, 4

work page internal anchor Pith review Pith/arXiv arXiv 2023
[32]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context, 2024.URL https://arxiv. org/abs/2403.05530, 2024. 2, 6

work page internal anchor Pith review Pith/arXiv arXiv 2024
[33]

Vggt: Vi- sual geometry grounded transformer

Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Vi- sual geometry grounded transformer. InIEEE Conf. Comput. Vis. Pattern Recog., 2025. 1

work page 2025
[34]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, and Jun- yang Lin. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024. 1

work page internal anchor Pith review Pith/arXiv arXiv 2024
[35]

Le, Ed H

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V . Le, Ed H. Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. InICLR, 2023. 3

work page 2023
[36]

Yifan Wang, Jianjun Zhou, Haoyi Zhu, Wenzheng Chang, Yang Zhou, Zizun Li, Junyi Chen, Jiangmiao Pang, Chun- hua Shen, and Tong He.π 3: Permutation-equivariant visual geometry learning.arXiv preprint arXiv:2507.13347, 2025. 1, 4

work page internal anchor Pith review Pith/arXiv arXiv 2025
[37]

Chat-3d: Data-efficiently tuning large language model for universal dialogue of 3d scenes.arXiv preprint arXiv:2308.08769, 2023

Zehan Wang, Haifeng Huang, Yang Zhao, Ziang Zhang, and Zhou Zhao. Chat-3d: Data-efficiently tuning large language model for universal dialogue of 3d scenes.arXiv preprint arXiv:2308.08769, 2023. 2

work page arXiv 2023
[38]

Pointllm: Empowering large language models to understand point clouds

Runsen Xu, Xiaolong Wang, Tai Wang, Yilun Chen, Jiang- miao Pang, and Dahua Lin. Pointllm: Empowering large language models to understand point clouds. InEur . Conf. Comput. Vis., 2024. 2

work page 2024
[39]

Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie

Jihan Yang, Shusheng Yang, Anjali W. Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie. Thinking in space: How mul- timodal large language models see, remember, and recall spaces. InIEEE Conf. Comput. Vis. Pattern Recog., 2025. 1, 2, 4, 5

work page 2025
[40]

Spatial mental modeling from limited views

Baiqiao Yin, Qineng Wang, Pingyue Zhang, Jianshu Zhang, Kangrui Wang, Zihan Wang, Jieyu Zhang, Keshigeyan Chan- drasegaran, Han Liu, Ranjay Krishna, et al. Spatial mental modeling from limited views. InStructural Priors for Vision Workshop at ICCV’25, 2025. 2, 4, 5

work page 2025
[41]

Geoeval: benchmark for evaluating llms and multi-modal models on geometry problem-solving.arXiv preprint arXiv:2402.10104, 2024

Jiaxin Zhang, Zhongzhi Li, Mingliang Zhang, Fei Yin, Chenglin Liu, and Yashar Moshfeghi. Geoeval: benchmark for evaluating llms and multi-modal models on geometry problem-solving.arXiv preprint arXiv:2402.10104, 2024. 3

work page arXiv 2024
[42]

The point, the vision and the text: Does point cloud boost spatial reasoning of large language models?arXiv preprint arXiv:2504.04540, 2025

Weichen Zhang, Ruiying Peng, Chen Gao, Jianjie Fang, Xin Zeng, Kaiyuan Li, Ziyou Wang, Jinqiang Cui, Xin Wang, Xinlei Chen, and Yong Li. The point, the vision and the text: Does point cloud boost spatial reasoning of large language models?arXiv preprint arXiv:2504.04540, 2025. 1, 6

work page arXiv 2025
[43]

Learning from videos for 3d world: Enhancing mllms with 3d vision geometry priors.arXiv preprint arXiv:2505.24625,

Duo Zheng, Shijia Huang, Yanyang Li, and Liwei Wang. Learning from videos for 3d world: Enhancing mllms with 3d vision geometry priors.arXiv preprint arXiv:2505.24625,

work page arXiv
[44]

Video-3d llm: Learning position-aware video representation for 3d scene understanding

Duo Zheng, Shijia Huang, and Liwei Wang. Video-3d llm: Learning position-aware video representation for 3d scene understanding. InIEEE Conf. Comput. Vis. Pattern Recog.,

work page
[45]

Le, Dale Schuurmans, Ed H

Denny Zhou, Quoc V . Le, Dale Schuurmans, Ed H. Chi, and et al. Least-to-most prompting enables complex reasoning in large language models. InICLR, 2023. 3

work page 2023
[46]

Llava-3d: A simple yet effective pathway to empowering lmms with 3d-awareness.arXiv preprint arXiv:2409.18125, 2024

Chenming Zhu, Tai Wang, Wenwei Zhang, Jiangmiao Pang, and Xihui Liu. Llava-3d: A simple yet effective pathway to empowering lmms with 3d-awareness.arXiv preprint arXiv:2409.18125, 2024. 2, 3

work page arXiv 2024
[47]

3d-prnn: Generating shape primitives with recurrent neural networks

Chuhang Zou, Ersin Yumer, Jimei Yang, Duygu Ceylan, and Derek Hoiem. 3d-prnn: Generating shape primitives with recurrent neural networks. InInt. Conf. Comput. Vis., 2017. 3

work page 2017