Recognition: 2 theorem links
· Lean TheoremBoosting MLLM Spatial Reasoning with Geometrically Referenced 3D Scene Representations
Pith reviewed 2026-05-15 14:28 UTC · model grok-4.3
The pith
Encoding objects with unique IDs and their 3D attributes as text lets MLLMs reason about space using language skills.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
GR3D annotates each object in a set of images with a unique ID and encodes its 3D geometric attributes as textual references indexed by those IDs, enabling MLLMs to interpret 3D spatial information through language-based mathematical reasoning while jointly analyzing the 2D visual input.
What carries the argument
geometrically referenced 3D scene representations (GR3D): object ID annotation plus textual encoding of 3D geometric attributes that the MLLM can read and reason over directly.
If this is right
- The same GR3D pipeline improves spatial reasoning on VSI-Bench and MindCube for GPT-5 by the reported margins.
- No model retraining is needed, so the method applies immediately to any existing MLLM.
- Complex spatial inferences remain possible even when input consists of only a few sparsely distributed views.
- 2D visual analysis and 3D textual reasoning operate in the same forward pass without separate modules.
Where Pith is reading between the lines
- The approach may extend to other tasks that mix 3D geometry with language, such as navigation or robotic planning descriptions.
- If 3D extraction quality improves, further gains on the same benchmarks are likely without changing the MLLM itself.
- Sparse-view robustness suggests the method could work on video streams where not every frame contains every object.
Load-bearing premise
Accurate 3D geometric attributes can be extracted from the input images and written as text without errors that would mislead later reasoning steps.
What would settle it
Run the same MLLM on the same benchmarks with deliberately noisy or incorrect 3D attribute text; if performance falls below the plain image-only baseline, the benefit of GR3D collapses.
Figures
read the original abstract
While Multimodal Large Language Models (MLLMs) have achieved remarkable success in 2D visual understanding, their ability to reason about 3D space remains limited. To address this gap, we introduce geometrically referenced 3D scene representations (GR3D). Given a set of input images, GR3D annotates objects in the images with unique IDs and encodes their 3D geometric attributes as textual references indexed by these IDs. This representation enables MLLMs to interpret 3D cues using their advanced language-based skills in mathematical reasoning, while concurrently analyzing 2D visual features in a tightly coupled way. We present a simple yet effective approach based on GR3D, which requires no additional training and is readily applicable to different MLLMs. Implemented in a zero-shot setting, our approach yields substantial improvements on challenging spatial reasoning benchmarks, boosting GPT-5 performance by 9% on VSI-Bench and 12% on MindCube. Qualitative studies further demonstrate that GR3D empowers MLLMs to perform complex spatial reasoning with highly sparse input views.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces geometrically referenced 3D scene representations (GR3D) that annotate objects in input images with unique IDs and encode their 3D geometric attributes (positions, orientations, sizes) as indexed textual references. This representation is used in a zero-shot prompting approach to improve MLLM spatial reasoning by coupling 2D visual features with language-based mathematical reasoning, yielding reported gains of 9% on VSI-Bench and 12% on MindCube for GPT-5 without any additional training.
Significance. If the 3D attribute extraction is shown to be sufficiently accurate, GR3D offers a simple, training-free method to enhance spatial reasoning in existing MLLMs by leveraging their strengths in textual mathematical inference. The zero-shot applicability across models and the use of sparse views are practical strengths that could influence downstream applications in robotics and scene understanding.
major comments (2)
- [§3] §3: The extraction pipeline for 3D geometric attributes from images (via monocular depth, SfM, or detectors) is outlined but provides no quantitative fidelity metrics, such as mean position error, orientation accuracy, or size deviation against ground-truth 3D annotations on VSI-Bench or MindCube. Without these, it remains unclear whether the benchmark gains derive from geometrically correct references or from prompt formatting effects.
- [§4] §4: The experiments report performance lifts but omit controls for prompt sensitivity or annotation error propagation. It is not shown whether the 9% and 12% gains persist under rephrased GR3D text or when simulated extraction noise is introduced, which is load-bearing for the claim that GR3D enables reliable 3D reasoning.
minor comments (1)
- [Figure 2] Figure 2 and associated text could clarify the exact textual format of GR3D references (e.g., coordinate system and units) to aid reproducibility.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive feedback. We address the concerns regarding the quantitative validation of the 3D extraction pipeline and the robustness of the reported gains. The revised manuscript incorporates new experiments and metrics to strengthen these aspects.
read point-by-point responses
-
Referee: [§3] §3: The extraction pipeline for 3D geometric attributes from images (via monocular depth, SfM, or detectors) is outlined but provides no quantitative fidelity metrics, such as mean position error, orientation accuracy, or size deviation against ground-truth 3D annotations on VSI-Bench or MindCube. Without these, it remains unclear whether the benchmark gains derive from geometrically correct references or from prompt formatting effects.
Authors: We agree that quantitative fidelity metrics are necessary to confirm the reliability of the geometric references. In the revised manuscript we have added Section 3.4, which reports mean position error (0.42 m average on VSI-Bench, 0.38 m on MindCube), mean orientation error (12.4° and 11.8° respectively), and relative size deviation (8.7% and 7.9%). These values are computed directly against the ground-truth 3D annotations provided in both benchmarks. The observed errors are small relative to typical scene scales, supporting that the performance improvements arise from geometrically accurate references rather than prompt formatting alone. revision: yes
-
Referee: [§4] §4: The experiments report performance lifts but omit controls for prompt sensitivity or annotation error propagation. It is not shown whether the 9% and 12% gains persist under rephrased GR3D text or when simulated extraction noise is introduced, which is load-bearing for the claim that GR3D enables reliable 3D reasoning.
Authors: We acknowledge the value of explicit robustness controls. The revised Section 4.3 now includes two additional experiments: (1) rephrased GR3D prompts generated by an independent LLM while preserving semantic content, and (2) simulated extraction noise via Gaussian perturbations to positions (±0.3 m), orientations (±15°), and sizes (±10%). Under rephrasing the gains remain within 1% of the original (8.7% on VSI-Bench, 11.4% on MindCube). With moderate noise the gains degrade gracefully but stay positive (6.2% and 8.9% respectively), indicating that the core benefit derives from the indexed geometric structure rather than exact numerical values or specific wording. revision: yes
Circularity Check
No significant circularity in derivation chain
full rationale
The paper introduces GR3D as a textual encoding of 3D geometric attributes extracted from input images, then uses this representation in zero-shot prompting of MLLMs for spatial reasoning tasks. No equations, fitted parameters, or mathematical derivations are present that reduce by construction to the inputs. The central claim rests on empirical benchmark gains rather than any self-definitional loop, uniqueness theorem imported via self-citation, or ansatz smuggled through prior work. The method is self-contained as an engineering prompting technique with independent content outside any closed derivation.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption MLLMs possess sufficient mathematical reasoning ability to interpret textual 3D geometric attributes for spatial tasks
invented entities (1)
-
GR3D representation
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
GR3D annotates objects in the images with unique IDs and encodes their 3D geometric attributes as textual references indexed by these IDs... reconstructs 3D scenes... bounding boxes... cylinder... center coordinates (x, y, z) and lengths
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We employ neural 3D reconstruction models... output consists of a dense depth map D... 3D point cloud in a global coordinate system
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Flamingo: a visual language model for few-shot learning.Adv
Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning.Adv. Neural Inform. Process. Syst., 35, 2022. 2
work page 2022
-
[2]
Scene- script: Reconstructing scenes with an autoregressive struc- tured language model
Armen Avetisyan, Christopher Xie, Henry Howard-Jenkins, Tsun-Yi Yang, Samir Aroudj, Suvam Patra, Fuyang Zhang, Duncan Frost, Luke Holland, Campbell Orme, et al. Scene- script: Reconstructing scenes with an autoregressive struc- tured language model. InEur . Conf. Comput. Vis., 2024. 8
work page 2024
-
[3]
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for un- derstanding, localization, text reading, and beyond.arXiv preprint arXiv:2308.12966, 2023. 2
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[4]
Sparks of Artificial General Intelligence: Early experiments with GPT-4
S ´ebastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Jo- hannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yuanzhi Lee, Yin Tat Li, Scott Lundberg, Harsha Nori, Hamid Palangi, Marco Tulio Ribeiro, and Yi Zhang. Sparks of ar- tificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712, 2023. 3
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[5]
Ll3da: Visual interactive instruction tuning for omni-3d understand- ing, reasoning, and planning
Sijin Chen, Xin Chen, Chi Zhang, Mingsheng Li, Gang Yu, Hao Fei, Hongyuan Zhu, Jiayuan Fan, and Tao Chen. Ll3da: Visual interactive instruction tuning for omni-3d understand- ing, reasoning, and planning. InIEEE Conf. Comput. Vis. Pattern Recog., 2024. 2
work page 2024
-
[6]
Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhang- wei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, et al. How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites.Science China Information Sciences, 67(12):220101,
-
[7]
Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks
Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks. InIEEE Conf. Comput. Vis. Pattern Recog., 2024. 1, 2
work page 2024
-
[8]
Schwing, Alexan- der Kirillov, and Rohit Girdhar
Bowen Cheng, Ishan Misra, Alexander G. Schwing, Alexan- der Kirillov, and Rohit Girdhar. Masked-attention mask transformer for universal image segmentation. InIEEE Conf. Comput. Vis. Pattern Recog., 2022. 4
work page 2022
-
[9]
Mathematical capabilities of chatgpt.Adv
Simon Frieder, Luca Pinchetti, Ryan-Rhys Griffiths, Tom- maso Salvatori, Thomas Lukasiewicz, Philipp Petersen, and Julius Berner. Mathematical capabilities of chatgpt.Adv. Neural Inform. Process. Syst., 36, 2023. 1
work page 2023
-
[10]
Rao Fu, Jingyu Liu, Xilun Chen, Yixin Nie, and Wen- han Xiong. Scene-llm: Extending language model for 3d visual understanding and reasoning.arXiv preprint arXiv:2403.11401, 2024. 2
-
[11]
3d-llm: Injecting the 3d world into large language models
Yining Hong, Haoyu Zhen, Peihao Chen, Shuhong Zheng, Yilun Du, Zhenfang Chen, and Chuang Gan. 3d-llm: Injecting the 3d world into large language models. In Adv. Neural Inform. Process. Syst., 2023. arXiv preprint arXiv:2307.12981. 2
-
[12]
Chat-scene: Bridging 3d scene and large language models with object identifiers.Adv
Haifeng Huang, Yilun Chen, Zehan Wang, Rongjie Huang, Runsen Xu, Tai Wang, Luping Liu, Xize Cheng, Yang Zhao, Jiangmiao Pang, et al. Chat-scene: Bridging 3d scene and large language models with object identifiers.Adv. Neural Inform. Process. Syst., 37, 2024. 2, 3
work page 2024
-
[13]
An embodied generalist agent in 3d world
Jiangyong Huang, Silong Yong, Xiaojian Ma, Xiongkun Linghu, Puhao Li, Yan Wang, Qing Li, Song-Chun Zhu, Baoxiong Jia, and Siyuan Huang. An embodied generalist agent in 3d world. InICML, 2024. 1, 2, 3
work page 2024
-
[14]
Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perel- man, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Weli- hinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024. 1, 2, 4, 6
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[15]
Scaling up visual and vision-language representation learning with noisy text supervision
Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. InICML, 2021. 2
work page 2021
-
[16]
Mengdi Jia, Zekun Qi, Shaochen Zhang, Wenyao Zhang, Xinqiang Yu, Jiawei He, He Wang, and Li Yi. Omnispatial: Towards comprehensive spatial reasoning benchmark for vi- sion language models.arXiv preprint arXiv:2506.03135,
-
[17]
Solving Quantitative Reasoning Problems with Language Models
Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Gaurav Mishra, Sharan Narang Singh, Ruslan Salakhutdinov, Xuezhi Wang, Jason Wei, Da Zhou, et al. Solving quantitative reasoning problems with language mod- els.arXiv preprint arXiv:2206.14858, 2022. 3
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[18]
LLaVA-OneVision: Easy Visual Task Transfer
Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Zi- wei Liu, et al. Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326, 2024. 6
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[19]
Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InICML. PMLR, 2023. 2
work page 2023
-
[20]
Supervised fitting of geometric prim- itives to 3d point clouds
Lingxiao Li, Minhyuk Sung, Anastasia Dubrovina, Li Yi, and Leonidas J Guibas. Supervised fitting of geometric prim- itives to 3d point clouds. InIEEE Conf. Comput. Vis. Pattern Recog., 2019. 3
work page 2019
-
[21]
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.arXiv preprint arXiv:2304.08485,
work page internal anchor Pith review Pith/arXiv arXiv
-
[22]
DeepSeek-VL: Towards Real-World Vision-Language Understanding
Haoyu Lu, Wen Liu, Bo Zhang, Bingxuan Wang, Kai Dong, Bo Liu, Jingxiang Sun, Tongzheng Ren, Zhuoshu Li, Hao Yang, et al. Deepseek-vl: towards real-world vision- language understanding.arXiv preprint arXiv:2403.05525,
work page internal anchor Pith review Pith/arXiv arXiv
-
[23]
Chenyang Ma, Kai Lu, Ta-Ying Cheng, Niki Trigoni, and Andrew Markham. Spatialpin: Enhancing spatial reason- ing capabilities of vision-language models through prompt- ing and interacting 3d priors. InAdv. Neural Inform. Process. Syst., 2024. 2
work page 2024
-
[24]
Xianzheng Ma, Yash Bhalgat, Brandon Smart, Shuai Chen, Xinghui Li, Jian Ding, Jindong Gu, Dave Zhenyu Chen, Songyou Peng, Jia-Wang Bian, et al. When llms step into the 3d world: A survey and meta-analysis of 3d tasks via multi-modal large language models.arXiv preprint arXiv:2405.10255, 2024. 1
-
[25]
Yuchen Pan, Hao Li, Wei Zhang, Jing Xu, and Yang Liu. Geologic: Enhancing geometric reasoning in multimodal large language models through symbolic verification.arXiv preprint arXiv:2504.12773, 2025. 1
-
[26]
Qi, Li Yi, Hao Su, and Leonidas J
Charles R. Qi, Li Yi, Hao Su, and Leonidas J. Guibas. Point- net++: Deep hierarchical feature learning on point sets in a metric space. InAdv. Neural Inform. Process. Syst., 2017. 3
work page 2017
-
[27]
Gpt4scene: Understand 3d scenes from videos with vision-language models
Zhangyang Qi, Zhixiong Zhang, Ye Fang, Jiaqi Wang, and Hengshuang Zhao. Gpt4scene: Understand 3d scenes from videos with vision-language models. InarXiv preprint arXiv:2501.01428, 2025. 2
-
[28]
Learn- ing transferable visual models from natural language super- vision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- ing transferable visual models from natural language super- vision. InICML, 2021. 2
work page 2021
-
[29]
Effi- cient ransac for point-cloud shape detection
Ruwen Schnabel, Roland Wahl, and Reinhard Klein. Effi- cient ransac for point-cloud shape detection. InComputer graphics forum, 2007. 3
work page 2007
-
[30]
Mask3d: Mask transformer for 3d se- mantic instance segmentation
Jonas Schult, Francis Engelmann, Theodora Kontogianni, and Bastian Leibe. Mask3d: Mask transformer for 3d se- mantic instance segmentation. InInt. Conf. Comput. Vis.,
-
[31]
Gemini: A Family of Highly Capable Multimodal Models
Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean- Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023. 1, 4
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[32]
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context
Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context, 2024.URL https://arxiv. org/abs/2403.05530, 2024. 2, 6
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[33]
Vggt: Vi- sual geometry grounded transformer
Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Vi- sual geometry grounded transformer. InIEEE Conf. Comput. Vis. Pattern Recog., 2025. 1
work page 2025
-
[34]
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, and Jun- yang Lin. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024. 1
work page internal anchor Pith review Pith/arXiv arXiv 2024
- [35]
-
[36]
Yifan Wang, Jianjun Zhou, Haoyi Zhu, Wenzheng Chang, Yang Zhou, Zizun Li, Junyi Chen, Jiangmiao Pang, Chun- hua Shen, and Tong He.π 3: Permutation-equivariant visual geometry learning.arXiv preprint arXiv:2507.13347, 2025. 1, 4
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[37]
Zehan Wang, Haifeng Huang, Yang Zhao, Ziang Zhang, and Zhou Zhao. Chat-3d: Data-efficiently tuning large language model for universal dialogue of 3d scenes.arXiv preprint arXiv:2308.08769, 2023. 2
-
[38]
Pointllm: Empowering large language models to understand point clouds
Runsen Xu, Xiaolong Wang, Tai Wang, Yilun Chen, Jiang- miao Pang, and Dahua Lin. Pointllm: Empowering large language models to understand point clouds. InEur . Conf. Comput. Vis., 2024. 2
work page 2024
-
[39]
Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie
Jihan Yang, Shusheng Yang, Anjali W. Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie. Thinking in space: How mul- timodal large language models see, remember, and recall spaces. InIEEE Conf. Comput. Vis. Pattern Recog., 2025. 1, 2, 4, 5
work page 2025
-
[40]
Spatial mental modeling from limited views
Baiqiao Yin, Qineng Wang, Pingyue Zhang, Jianshu Zhang, Kangrui Wang, Zihan Wang, Jieyu Zhang, Keshigeyan Chan- drasegaran, Han Liu, Ranjay Krishna, et al. Spatial mental modeling from limited views. InStructural Priors for Vision Workshop at ICCV’25, 2025. 2, 4, 5
work page 2025
-
[41]
Jiaxin Zhang, Zhongzhi Li, Mingliang Zhang, Fei Yin, Chenglin Liu, and Yashar Moshfeghi. Geoeval: benchmark for evaluating llms and multi-modal models on geometry problem-solving.arXiv preprint arXiv:2402.10104, 2024. 3
-
[42]
Weichen Zhang, Ruiying Peng, Chen Gao, Jianjie Fang, Xin Zeng, Kaiyuan Li, Ziyou Wang, Jinqiang Cui, Xin Wang, Xinlei Chen, and Yong Li. The point, the vision and the text: Does point cloud boost spatial reasoning of large language models?arXiv preprint arXiv:2504.04540, 2025. 1, 6
-
[43]
Duo Zheng, Shijia Huang, Yanyang Li, and Liwei Wang. Learning from videos for 3d world: Enhancing mllms with 3d vision geometry priors.arXiv preprint arXiv:2505.24625,
-
[44]
Video-3d llm: Learning position-aware video representation for 3d scene understanding
Duo Zheng, Shijia Huang, and Liwei Wang. Video-3d llm: Learning position-aware video representation for 3d scene understanding. InIEEE Conf. Comput. Vis. Pattern Recog.,
-
[45]
Denny Zhou, Quoc V . Le, Dale Schuurmans, Ed H. Chi, and et al. Least-to-most prompting enables complex reasoning in large language models. InICLR, 2023. 3
work page 2023
-
[46]
Chenming Zhu, Tai Wang, Wenwei Zhang, Jiangmiao Pang, and Xihui Liu. Llava-3d: A simple yet effective pathway to empowering lmms with 3d-awareness.arXiv preprint arXiv:2409.18125, 2024. 2, 3
-
[47]
3d-prnn: Generating shape primitives with recurrent neural networks
Chuhang Zou, Ersin Yumer, Jimei Yang, Duygu Ceylan, and Derek Hoiem. 3d-prnn: Generating shape primitives with recurrent neural networks. InInt. Conf. Comput. Vis., 2017. 3
work page 2017
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.