pith. sign in

arxiv: 2511.16567 · v3 · submitted 2025-11-20 · 💻 cs.CV

POMA-3D: The Point Map Way to 3D Scene Understanding

Pith reviewed 2026-05-17 20:23 UTC · model grok-4.3

classification 💻 cs.CV
keywords point maps3D scene understandingself-supervised learning3D representationsembodied navigationscene retrievalgeometric inputsview-to-scene alignment
0
0 comments X

The pith

POMA-3D learns self-supervised 3D scene representations from point maps encoding explicit 3D coordinates.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents POMA-3D, the first self-supervised model for 3D representations trained on point maps. Point maps put 3D coordinates on a structured 2D grid to maintain global geometry and work with 2D model inputs. A view-to-scene alignment strategy moves rich priors from 2D foundation models into the 3D features. POMA-JEPA is added to keep features geometrically consistent across multiple views of the same scene. A large new dataset called ScenePoint is built to support pretraining, and the resulting model improves performance on 3D question answering, navigation, retrieval, and localization using only geometric data.

Core claim

POMA-3D serves as a strong backbone for both specialist and generalist 3D understanding by learning from point maps that preserve global 3D geometry while remaining compatible with 2D foundation models, using view-to-scene alignment and a joint embedding-predictive architecture to enforce consistency across views, all while relying solely on geometric inputs for diverse tasks.

What carries the argument

Point maps that encode explicit 3D coordinates on a structured 2D grid, combined with a view-to-scene alignment strategy to transfer 2D priors and POMA-JEPA for multi-view consistency.

If this is right

  • Benefits 3D question answering tasks with only 3D coordinate inputs.
  • Improves performance in embodied navigation and localization.
  • Enhances scene retrieval using the learned geometric representations.
  • Addresses scarcity of pretrained priors in 3D representation learning.
  • Supports both specialist and generalist 3D understanding models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Point maps could potentially scale to even larger datasets for better generalization in real-world robotics applications.
  • The approach might extend to dynamic scenes if temporal consistency is added to the architecture.
  • Integration with language models could create multimodal 3D understanding systems beyond pure geometry.
  • Testing on outdoor or large-scale environments would reveal how well the global geometry preservation holds outside indoor rooms.

Load-bearing premise

That point maps preserve global 3D geometry sufficiently well and that the view-to-scene alignment strategy transfers rich 2D priors into 3D representations without major distortion or loss of information.

What would settle it

A controlled test where POMA-3D is compared to a standard point cloud model on a 3D QA benchmark and shows no improvement when using only the point map inputs.

Figures

Figures reproduced from arXiv: 2511.16567 by Junpeng Jing, Krystian Mikolajczyk, Ranran Huang, Weixun Luo, Ye Mao.

Figure 1
Figure 1. Figure 1: Overview of POMA-3D. POMA-3D is a self-supervised 3D model pretrained on the large-scale point map dataset ScenePoint via alignment with 2D foundation models and the POMA-JEPA objective. The 3D features from pretrained POMA-3D transfer effectively to diverse 3D understanding tasks, including 3D visual question answering, embodied navigation, scene retrieval, and embodied localization. Abstract In this pape… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the POMA-3D pretraining. POMA-3D is pretrained with two objectives: (1) aligning [CLS] embeddings from the point map context encoder with image and text embeddings from the frozen FG-CLIP using Lview and Lscene, and (2) reconstructing masked point map embeddings from the target encoder using unmasked embeddings from the context encoder via a predictor optimized by Lpjepa. The target encoder is … view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative scene retrieval results. Top-4 candidates from each method are shown. For the given query, only POMA-3D retrieves the unique ground-truth scene, while others fail to return bookshelf-containing scenes. Green boxes mark bookshelves. “I am cooking in the kitchen area.” “I am washing my face.” “I am standing near to the curtain.” “I am printing something for my work.” “I am standing between two bl… view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative embodied localization results. Top: text to describe the current agent’s situation. Bottom: merged multi-view point maps, where red regions indicate the point map views retrieved by POMA-3D based on the text. point cloud–based VLL model SceneVerse by 6.2% on SQA3D and 4.5% on Hypo3D. Even when compared to large 3D LLMs such as LEO and LLaVA-3D, POMA￾3Dspec still achieves noticeable improvements… view at source ↗
Figure 5
Figure 5. Figure 5: Comparison of POMA-3D and its baseline FG-CLIP [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
read the original abstract

In this paper, we introduce POMA-3D, the first self-supervised 3D representation model learned from point maps. Point maps encode explicit 3D coordinates on a structured 2D grid, preserving global 3D geometry while remaining compatible with the input format of 2D foundation models. To transfer rich 2D priors into POMA-3D, a view-to-scene alignment strategy is designed. Moreover, as point maps are view-dependent with respect to a canonical space, we introduce POMA-JEPA, a joint embedding-predictive architecture that enforces geometrically consistent point map features across multiple views. Additionally, we introduce ScenePoint, a point map dataset constructed from 6.5K room-level RGB-D scenes and 1M 2D image scenes to facilitate large-scale POMA-3D pretraining. Experiments show that POMA-3D serves as a strong backbone for both specialist and generalist 3D understanding. It benefits diverse tasks, including 3D question answering, embodied navigation, scene retrieval, and embodied localization, all achieved using only geometric inputs (i.e., 3D coordinates). Overall, our POMA-3D explores a point map way to 3D scene understanding, addressing the scarcity of pretrained priors and limited data in 3D representation learning. Project Page: https://matchlab-imperial.github.io/poma3d/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces POMA-3D, a self-supervised 3D representation model trained on point maps that encode explicit 3D coordinates on a structured 2D grid. It proposes a view-to-scene alignment strategy to inject 2D foundation-model priors and POMA-JEPA, a joint embedding-predictive architecture, to enforce cross-view geometric consistency. A new ScenePoint dataset is constructed from 6.5K room-level RGB-D scenes and 1M images for large-scale pretraining. The central claim is that POMA-3D serves as a strong backbone for both specialist and generalist 3D tasks (3D question answering, embodied navigation, scene retrieval, embodied localization) when supplied only with geometric inputs at inference.

Significance. If the empirical claims hold, the work offers a novel route to 3D representation learning that leverages the geometric fidelity of point maps while reusing 2D priors, directly addressing data scarcity. The introduction of the ScenePoint dataset and the explicit use of only 3D coordinates at test time are concrete strengths that could be adopted by the community.

major comments (3)
  1. [Abstract] Abstract: the stated performance benefits on 3D QA, navigation, retrieval and localization are presented without any quantitative numbers, baselines, or ablation results, so the load-bearing claim that POMA-3D is a 'strong backbone' cannot yet be evaluated.
  2. [View-to-scene alignment] View-to-scene alignment section: the manuscript provides no alignment-error metric, cross-view feature variance, or ablation that removes the alignment step, leaving open the possibility that the transfer of 2D priors distorts the canonical 3D coordinate grid.
  3. [POMA-JEPA and Experiments] POMA-JEPA description and experiments: the joint embedding-predictive loss is asserted to enforce geometric consistency, yet no quantitative verification (e.g., canonical-coordinate variance across views or ablation of the predictive term) is supplied to confirm that the regularization is strong enough to support the downstream gains.
minor comments (2)
  1. [Method] The notation for point maps versus canonical coordinates should be introduced with a single equation early in the method section to improve readability.
  2. [Figures] Figure captions for the alignment diagram should explicitly label the 2D-to-3D mapping operation.

Simulated Author's Rebuttal

3 responses · 0 unresolved

Thank you for the constructive and detailed review of our manuscript introducing POMA-3D. We appreciate the referee's recognition of the novelty in leveraging point maps for self-supervised 3D representation learning, the view-to-scene alignment strategy, the POMA-JEPA architecture, and the introduction of the ScenePoint dataset. We address each major comment point by point below, providing clarifications and committing to revisions that strengthen the presentation of our empirical claims without altering the core contributions.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the stated performance benefits on 3D QA, navigation, retrieval and localization are presented without any quantitative numbers, baselines, or ablation results, so the load-bearing claim that POMA-3D is a 'strong backbone' cannot yet be evaluated.

    Authors: We agree that the abstract would be strengthened by including key quantitative highlights to support the claim that POMA-3D serves as a strong backbone. In the revised manuscript, we will update the abstract to report specific performance gains (e.g., relative improvements on 3D question answering, navigation success rates, retrieval mAP, and localization accuracy) with references to the corresponding tables and baselines in the experiments section. This will enable readers to directly assess the empirical strength of the results. revision: yes

  2. Referee: [View-to-scene alignment] View-to-scene alignment section: the manuscript provides no alignment-error metric, cross-view feature variance, or ablation that removes the alignment step, leaving open the possibility that the transfer of 2D priors distorts the canonical 3D coordinate grid.

    Authors: We acknowledge the value of additional quantitative validation for the view-to-scene alignment. We will add an alignment-error metric (e.g., average L2 distance between projected 2D features and canonical 3D coordinates) and report cross-view feature variance to demonstrate preservation of geometric structure. We will also include an ablation study that removes the alignment step and measures its impact on downstream tasks, which will directly address concerns about potential distortion of the 3D coordinate grid. revision: yes

  3. Referee: [POMA-JEPA and Experiments] POMA-JEPA description and experiments: the joint embedding-predictive loss is asserted to enforce geometric consistency, yet no quantitative verification (e.g., canonical-coordinate variance across views or ablation of the predictive term) is supplied to confirm that the regularization is strong enough to support the downstream gains.

    Authors: We agree that explicit quantitative verification would better substantiate the role of the joint embedding-predictive loss. In the revision, we will report canonical-coordinate variance across multiple views (before and after the predictive term) and include an ablation that isolates the predictive component, showing its contribution to geometric consistency and to the observed gains on downstream tasks such as navigation and scene retrieval. These additions will confirm that the regularization is effective. revision: yes

Circularity Check

0 steps flagged

No circularity in POMA-3D derivation chain

full rationale

The paper introduces point maps as input, a view-to-scene alignment strategy, and POMA-JEPA as novel extensions of JEPA-style self-supervised learning. These are architectural choices evaluated empirically on the new ScenePoint dataset and downstream tasks (3D QA, navigation, etc.) using only geometric inputs. No load-bearing step reduces a claimed prediction or result to a fitted parameter, self-definition, or self-citation chain by construction. The central claims rest on experimental outcomes rather than equations that equate outputs to inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the domain assumption that point maps preserve global 3D geometry and on the effectiveness of the newly introduced alignment and consistency mechanisms; no explicit free parameters or invented physical entities are stated in the abstract.

axioms (1)
  • domain assumption Point maps encode explicit 3D coordinates on a structured 2D grid while preserving global 3D geometry and remaining compatible with 2D foundation model inputs.
    Directly stated in the abstract as the foundation for transferring 2D priors.
invented entities (1)
  • POMA-JEPA no independent evidence
    purpose: Joint embedding-predictive architecture that enforces geometrically consistent point map features across multiple views.
    New architecture introduced to address view-dependence of point maps.

pith-pipeline@v0.9.0 · 5571 in / 1329 out tokens · 31181 ms · 2026-05-17T20:23:22.735745+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Contrastive Language-Colored Pointmap Pretraining for Unified 3D Scene Understanding

    cs.CV 2026-04 unverdicted novelty 6.0

    UniScene3D learns unified 3D scene representations from colored pointmaps using contrastive CLIP pretraining plus cross-view geometric and grounded view alignments, achieving state-of-the-art results on viewpoint grou...

Reference graph

Works this paper leans on

59 extracted references · 59 canonical work pages · cited by 1 Pith paper · 9 internal anchors

  1. [1]

    Referit3d: Neural listeners for fine-grained 3d object identification in real-world scenes

    Panos Achlioptas, Ahmed Abdelreheem, Fei Xia, Mohamed Elhoseiny, and Leonidas Guibas. Referit3d: Neural listeners for fine-grained 3d object identification in real-world scenes. InEuropean conference on computer vision, pages 422–440. Springer, 2020. 1, 6

  2. [2]

    Self-supervised learning from images with a joint-embedding predictive architecture

    Mahmoud Assran, Quentin Duval, Ishan Misra, Piotr Bo- janowski, Pascal Vincent, Michael Rabbat, Yann LeCun, and Nicolas Ballas. Self-supervised learning from images with a joint-embedding predictive architecture. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15619–15629, 2023. 2

  3. [3]

    V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

    Mido Assran, Adrien Bardes, David Fan, Quentin Garrido, Russell Howes, Matthew Muckley, Ammar Rizvi, Claire Roberts, Koustuv Sinha, Artem Zholus, et al. V-jepa 2: Self- supervised video models enable understanding, prediction and planning.arXiv preprint arXiv:2506.09985, 2025. 2

  4. [4]

    Scanqa: 3d question answering for spatial scene understanding

    Daichi Azuma, Taiki Miyanishi, Shuhei Kurita, and Motoaki Kawanabe. Scanqa: 3d question answering for spatial scene understanding. Inproceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 19129– 19139, 2022. 2, 5, 6

  5. [5]

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 6

  6. [6]

    V-jepa: Latent video prediction for visual represen- tation learning

    Adrien Bardes, Quentin Garrido, Jean Ponce, Xinlei Chen, Michael Rabbat, Yann LeCun, Mido Assran, and Nicolas Ballas. V-jepa: Latent video prediction for visual represen- tation learning. 2023. 2

  7. [7]

    ARKitScenes: A Diverse Real-World Dataset For 3D Indoor Scene Understanding Using Mobile RGB-D Data

    Gilad Baruch, Zhuoyuan Chen, Afshin Dehghan, Tal Dimry, Yuri Feigin, Peter Fu, Thomas Gebauer, Brandon Joffe, Daniel Kurz, Arik Schwartz, et al. Arkitscenes: A diverse real-world dataset for 3d indoor scene understanding using mobile rgb-d data.arXiv preprint arXiv:2111.08897, 2021. 3

  8. [8]

    Scanrefer: 3d object localization in rgb-d scans using natural language

    Dave Zhenyu Chen, Angel X Chang, and Matthias Nießner. Scanrefer: 3d object localization in rgb-d scans using natural language. InEuropean conference on computer vision, pages 202–221. Springer, 2020. 1, 2, 6

  9. [9]

    Ll3da: Visual interactive instruction tuning for omni-3d understanding reasoning and planning

    Sijin Chen, Xin Chen, Chi Zhang, Mingsheng Li, Gang Yu, Hao Fei, Hongyuan Zhu, Jiayuan Fan, and Tao Chen. Ll3da: Visual interactive instruction tuning for omni-3d understanding reasoning and planning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 26428–26438, 2024. 2

  10. [10]

    3d aware region prompted vision language model.arXiv preprint arXiv:2509.13317,

    An-Chieh Cheng, Yang Fu, Yukang Chen, Zhijian Liu, Xiao- long Li, Subhashree Radhakrishnan, Song Han, Yao Lu, Jan Kautz, Pavlo Molchanov, et al. 3d aware region prompted vision language model.arXiv preprint arXiv:2509.13317,

  11. [11]

    Scannet: Richly-annotated 3d reconstructions of indoor scenes

    Angela Dai, Angel X Chang, Manolis Savva, Maciej Hal- ber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5828–5839, 2017. 3

  12. [12]

    Procthor: Large-scale embodied ai using procedural generation.Ad- vances in Neural Information Processing Systems, 35:5982– 5994, 2022

    Matt Deitke, Eli VanderBilt, Alvaro Herrasti, Luca Weihs, Kiana Ehsani, Jordi Salvador, Winson Han, Eric Kolve, Aniruddha Kembhavi, and Roozbeh Mottaghi. Procthor: Large-scale embodied ai using procedural generation.Ad- vances in Neural Information Processing Systems, 35:5982– 5994, 2022. 3

  13. [13]

    A survey of embodied ai: From simulators to research tasks.IEEE Transactions on Emerging Topics in Computational Intelligence, 6(2):230–244, 2022

    Jiafei Duan, Samson Yu, Hui Li Tan, Hongyuan Zhu, and Cheston Tan. A survey of embodied ai: From simulators to research tasks.IEEE Transactions on Emerging Topics in Computational Intelligence, 6(2):230–244, 2022. 1

  14. [14]

    A point set generation network for 3d object reconstruction from a single image

    Haoqiang Fan, Hao Su, and Leonidas J Guibas. A point set generation network for 3d object reconstruction from a single image. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 605–613, 2017. 5

  15. [15]

    VLM-3R: Vision-Language Models Augmented with Instruction-Aligned 3D Reconstruction

    Zhiwen Fan, Jian Zhang, Renjie Li, Junge Zhang, Runjin Chen, Hezhen Hu, Kevin Wang, Huaizhi Qu, Dilin Wang, Zhicheng Yan, et al. Vlm-3r: Vision-language models aug- mented with instruction-aligned 3d reconstruction.arXiv preprint arXiv:2505.20279, 2025. 2

  16. [16]

    Scene-llm: Extending language model for 3d visual understanding and reasoning.arXiv preprint arXiv:2403.11401, 2024

    Rao Fu, Jingyu Liu, Xilun Chen, Yixin Nie, and Wen- han Xiong. Scene-llm: Extending language model for 3d visual understanding and reasoning.arXiv preprint arXiv:2403.11401, 2024. 2

  17. [17]

    Viewrefer: Grasp the multi-view knowledge for 3d visual grounding

    Zoey Guo, Yiwen Tang, Ray Zhang, Dong Wang, Zhigang Wang, Bin Zhao, and Xuelong Li. Viewrefer: Grasp the multi-view knowledge for 3d visual grounding. InProceed- ings of the IEEE/CVF International Conference on Com- puter Vision, pages 15372–15383, 2023. 2

  18. [18]

    3d-llm: In- jecting the 3d world into large language models.Advances in Neural Information Processing Systems, 36:20482–20494,

    Yining Hong, Haoyu Zhen, Peihao Chen, Shuhong Zheng, Yilun Du, Zhenfang Chen, and Chuang Gan. 3d-llm: In- jecting the 3d world into large language models.Advances in Neural Information Processing Systems, 36:20482–20494,

  19. [19]

    3d-sis: 3d se- mantic instance segmentation of rgb-d scans

    Ji Hou, Angela Dai, and Matthias Nießner. 3d-sis: 3d se- mantic instance segmentation of rgb-d scans. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4421–4430, 2019. 1, 2

  20. [20]

    An Embodied Generalist Agent in 3D World

    Jiangyong Huang, Silong Yong, Xiaojian Ma, Xiongkun Linghu, Puhao Li, Yan Wang, Qing Li, Song-Chun Zhu, Baoxiong Jia, and Siyuan Huang. An embodied generalist agent in 3d world.arXiv preprint arXiv:2311.12871, 2023. 2, 6

  21. [21]

    No pose at all: Self-supervised pose-free 3d gaussian splatting from sparse 9 views

    Ranran Huang and Krystian Mikolajczyk. No pose at all: Self-supervised pose-free 3d gaussian splatting from sparse 9 views. InProceedings of the IEEE/CVF International Con- ference on Computer Vision, pages 27947–27957, 2025. 2

  22. [22]

    Multi- view transformer for 3d visual grounding

    Shijia Huang, Yilun Chen, Jiaya Jia, and Liwei Wang. Multi- view transformer for 3d visual grounding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15524–15533, 2022. 2

  23. [23]

    Clip2point: Transfer clip to point cloud classifica- tion with image-depth pre-training

    Tianyu Huang, Bowen Dong, Yunhan Yang, Xiaoshui Huang, Rynson WH Lau, Wanli Ouyang, and Wangmeng Zuo. Clip2point: Transfer clip to point cloud classifica- tion with image-depth pre-training. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 22157–22167, 2023. 2

  24. [24]

    Sceneverse: Scaling 3d vision-language learning for grounded scene understanding

    Baoxiong Jia, Yixin Chen, Huangyue Yu, Yan Wang, Xuesong Niu, Tengyu Liu, Qing Li, and Siyuan Huang. Sceneverse: Scaling 3d vision-language learning for grounded scene understanding. InEuropean Conference on Computer Vision, pages 289–310. Springer, 2024. 2, 3, 5, 6

  25. [25]

    Pointgroup: Dual-set point grouping for 3d instance segmentation

    Li Jiang, Hengshuang Zhao, Shaoshuai Shi, Shu Liu, Chi- Wing Fu, and Jiaya Jia. Pointgroup: Dual-set point grouping for 3d instance segmentation. InProceedings of the IEEE/CVF conference on computer vision and Pattern recognition, pages 4867–4876, 2020. 1, 2

  26. [26]

    Ground- ing image matching in 3d with mast3r

    Vincent Leroy, Yohann Cabon, and J´erˆome Revaud. Ground- ing image matching in 3d with mast3r. InEuropean Confer- ence on Computer Vision, pages 71–91. Springer, 2024. 2

  27. [27]

    LLaVA-OneVision: Easy Visual Task Transfer

    Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Zi- wei Liu, et al. Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326, 2024. 6

  28. [28]

    Scenesplat: Gaussian splatting-based scene un- derstanding with vision-language pretraining.arXiv preprint arXiv:2503.18052, 2025

    Yue Li, Qi Ma, Runyi Yang, Huapeng Li, Mengjiao Ma, Bin Ren, Nikola Popovic, Nicu Sebe, Ender Konukoglu, Theo Gevers, et al. Scenesplat: Gaussian splatting-based scene un- derstanding with vision-language pretraining.arXiv preprint arXiv:2503.18052, 2025. 2, 3

  29. [29]

    Embodied intelligence for 3d under- standing: A survey on 3d scene question answering.arXiv preprint arXiv:2502.00342, 2025

    Zechuan Li, Hongshan Yu, Yihao Ding, Yan Li, Yong He, and Naveed Akhtar. Embodied intelligence for 3d under- standing: A survey on 3d scene question answering.arXiv preprint arXiv:2502.00342, 2025. 1

  30. [30]

    Multi- modal situated reasoning in 3d scenes.Advances in Neural Information Processing Systems, 37:140903–140936, 2024

    Xiongkun Linghu, Jiangyong Huang, Xuesong Niu, Xiao- jian Shawn Ma, Baoxiong Jia, and Siyuan Huang. Multi- modal situated reasoning in 3d scenes.Advances in Neural Information Processing Systems, 37:140903–140936, 2024. 5, 6

  31. [31]

    Openshape: Scaling up 3d shape representation towards open-world understanding.Advances in neural information processing systems, 36:44860–44879, 2023

    Minghua Liu, Ruoxi Shi, Kaiming Kuang, Yinhao Zhu, Xu- anlin Li, Shizhong Han, Hong Cai, Fatih Porikli, and Hao Su. Openshape: Scaling up 3d shape representation towards open-world understanding.Advances in neural information processing systems, 36:44860–44879, 2023. 2

  32. [32]

    Decoupled Weight Decay Regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017. 5

  33. [33]

    Sqa3d: Situated question answering in 3d scenes,

    Xiaojian Ma, Silong Yong, Zilong Zheng, Qing Li, Yi- tao Liang, Song-Chun Zhu, and Siyuan Huang. Sqa3d: Situated question answering in 3d scenes.arXiv preprint arXiv:2210.07474, 2022. 2, 5, 6, 8

  34. [34]

    Multiscan: Scalable rgbd scanning for 3d environments with articulated objects.Advances in neural information processing systems, 35:9058–9071, 2022

    Yongsen Mao, Yiming Zhang, Hanxiao Jiang, Angel Chang, and Manolis Savva. Multiscan: Scalable rgbd scanning for 3d environments with articulated objects.Advances in neural information processing systems, 35:9058–9071, 2022. 3

  35. [35]

    Opendlign: Open-world point cloud understanding with depth-aligned images.Advances in Neural Information Pro- cessing Systems, 37:101144–101167, 2024

    Ye Mao, Junpeng Jing, and Krystian Mikolajczyk. Opendlign: Open-world point cloud understanding with depth-aligned images.Advances in Neural Information Pro- cessing Systems, 37:101144–101167, 2024. 2

  36. [36]

    Hypo3d: Exploring hypothetical reason- ing in 3d.arXiv preprint arXiv:2502.00954, 2025

    Ye Mao, Weixun Luo, Junpeng Jing, Anlan Qiu, and Krys- tian Mikolajczyk. Hypo3d: Exploring hypothetical reason- ing in 3d.arXiv preprint arXiv:2502.00954, 2025. 5, 6, 8

  37. [37]

    Masked autoencoders for 3d point cloud self- supervised learning.World Scientific Annual Review of Arti- ficial Intelligence, 1:2440001, 2023

    Yatian Pang, Eng Hock Francis Tay, Li Yuan, and Zhenghua Chen. Masked autoencoders for 3d point cloud self- supervised learning.World Scientific Annual Review of Arti- ficial Intelligence, 1:2440001, 2023. 5

  38. [38]

    Pointnet: Deep learning on point sets for 3d classification and segmentation

    Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas. Pointnet: Deep learning on point sets for 3d classification and segmentation. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 652–660,

  39. [39]

    Shapellm: Universal 3d object understanding for embodied interaction

    Zekun Qi, Runpei Dong, Shaochen Zhang, Haoran Geng, Chunrui Han, Zheng Ge, Li Yi, and Kaisheng Ma. Shapellm: Universal 3d object understanding for embodied interaction. InEuropean Conference on Computer Vision, pages 214–

  40. [40]

    Learning transferable visual models from natural language supervi- sion

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 2

  41. [41]

    Habitat-Matterport 3D Dataset (HM3D): 1000 Large-scale 3D Environments for Embodied AI

    Santhosh K Ramakrishnan, Aaron Gokaslan, Erik Wijmans, Oleksandr Maksymets, Alex Clegg, John Turner, Eric Un- dersander, Wojciech Galuba, Andrew Westbury, Angel X Chang, et al. Habitat-matterport 3d dataset (hm3d): 1000 large-scale 3d environments for embodied ai.arXiv preprint arXiv:2109.08238, 2021. 3

  42. [42]

    org/10.48550/arXiv.2210.03105

    Jonas Schult, Francis Engelmann, Alexander Hermans, Or Litany, Siyu Tang, and Bastian Leibe. Mask3d: Mask trans- former for 3d semantic instance segmentation.arXiv preprint arXiv:2210.03105, 2022. 5

  43. [43]

    Conceptual captions: A cleaned, hypernymed, im- age alt-text dataset for automatic image captioning

    Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. Conceptual captions: A cleaned, hypernymed, im- age alt-text dataset for automatic image captioning. InPro- ceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2556–2565, 2018. 3

  44. [44]

    Splattalk: 3d vqa with gaussian splatting.arXiv preprint arXiv:2503.06271, 2025

    Anh Thai, Songyou Peng, Kyle Genova, Leonidas Guibas, and Thomas Funkhouser. Splattalk: 3d vqa with gaussian splatting.arXiv preprint arXiv:2503.06271, 2025. 2, 5, 6

  45. [45]

    Rio: 3d object instance re- localization in changing indoor environments

    Johanna Wald, Armen Avetisyan, Nassir Navab, Federico Tombari, and Matthias Nießner. Rio: 3d object instance re- localization in changing indoor environments. InProceed- ings of the IEEE/CVF International Conference on Com- puter Vision, pages 7658–7667, 2019. 3

  46. [46]

    Ross3d: Recon- structive visual instruction tuning with 3d-awareness.arXiv preprint arXiv:2504.01901, 2025

    Haochen Wang, Yucheng Zhao, Tiancai Wang, Haoqiang Fan, Xiangyu Zhang, and Zhaoxiang Zhang. Ross3d: Recon- structive visual instruction tuning with 3d-awareness.arXiv preprint arXiv:2504.01901, 2025. 2

  47. [47]

    Vggt: Vi- sual geometry grounded transformer

    Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Vi- sual geometry grounded transformer. InProceedings of the 10 Computer Vision and Pattern Recognition Conference, pages 5294–5306, 2025. 2, 3

  48. [48]

    Dust3r: Geometric 3d vi- sion made easy

    Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. Dust3r: Geometric 3d vi- sion made easy. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20697– 20709, 2024. 2

  49. [49]

    Chat-3d: Data-efficiently tuning large language model for universal dialogue of 3d scenes,

    Zehan Wang, Haifeng Huang, Yang Zhao, Ziang Zhang, and Zhou Zhao. Chat-3d: Data-efficiently tuning large language model for universal dialogue of 3d scenes.arXiv preprint arXiv:2308.08769, 2023. 2

  50. [50]

    Fg-clip: Fine-grained visual and textual alignment.arXiv preprint arXiv:2505.05071, 2025

    Chunyu Xie, Bin Wang, Fanjing Kong, Jincheng Li, Dawei Liang, Gengshen Zhang, Dawei Leng, and Yuhui Yin. Fg- clip: Fine-grained visual and textual alignment.arXiv preprint arXiv:2505.05071, 2025. 3, 4, 5, 6, 8

  51. [51]

    Ulip: Learning a unified representation of language, images, and point clouds for 3d understanding

    Le Xue, Mingfei Gao, Chen Xing, Roberto Mart ´ın-Mart´ın, Jiajun Wu, Caiming Xiong, Ran Xu, Juan Carlos Niebles, and Silvio Savarese. Ulip: Learning a unified representation of language, images, and point clouds for 3d understanding. InProceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 1179–1189, 2023. 2

  52. [52]

    Ulip-2: Towards scalable multimodal pre-training for 3d understanding

    Le Xue, Ning Yu, Shu Zhang, Artemis Panagopoulou, Jun- nan Li, Roberto Mart´ın-Mart´ın, Jiajun Wu, Caiming Xiong, Ran Xu, Juan Carlos Niebles, et al. Ulip-2: Towards scalable multimodal pre-training for 3d understanding. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 27091–27101, 2024. 2

  53. [53]

    Point transformer

    Hengshuang Zhao, Li Jiang, Jiaya Jia, Philip HS Torr, and Vladlen Koltun. Point transformer. InProceedings of the IEEE/CVF international conference on computer vision, pages 16259–16268, 2021. 2

  54. [54]

    Video-3d llm: Learning position-aware video representation for 3d scene understanding

    Duo Zheng, Shijia Huang, and Liwei Wang. Video-3d llm: Learning position-aware video representation for 3d scene understanding. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 8995–9006, 2025. 2, 3, 6

  55. [55]

    Structured3d: A large photo-realistic dataset for structured 3d modeling

    Jia Zheng, Junfei Zhang, Jing Li, Rui Tang, Shenghua Gao, and Zihan Zhou. Structured3d: A large photo-realistic dataset for structured 3d modeling. InEuropean Conference on Computer Vision, pages 519–535. Springer, 2020. 3

  56. [56]

    Uni3d: Exploring unified 3d representation at scale

    Junsheng Zhou, Jinsheng Wang, Baorui Ma, Yu-Shen Liu, Tiejun Huang, and Xinlong Wang. Uni3d: Exploring unified 3d representation at scale.arXiv preprint arXiv:2310.06773,

  57. [57]

    Llava-3d: A simple yet effective pathway to empowering lmms with 3d-awareness.arXiv preprint arXiv:2409.18125, 2024

    Chenming Zhu, Tai Wang, Wenwei Zhang, Jiangmiao Pang, and Xihui Liu. Llava-3d: A simple yet effective pathway to empowering lmms with 3d-awareness.arXiv preprint arXiv:2409.18125, 2024. 2, 6

  58. [58]

    InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

    Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shen- glong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479, 2025. 3

  59. [59]

    3d-vista: Pre-trained transformer for 3d vision and text alignment

    Ziyu Zhu, Xiaojian Ma, Yixin Chen, Zhidong Deng, Siyuan Huang, and Qing Li. 3d-vista: Pre-trained transformer for 3d vision and text alignment. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 2911– 2921, 2023. 2, 3, 5, 6 11