POMA-3D: The Point Map Way to 3D Scene Understanding

Junpeng Jing; Krystian Mikolajczyk; Ranran Huang; Weixun Luo; Ye Mao

arxiv: 2511.16567 · v3 · submitted 2025-11-20 · 💻 cs.CV

POMA-3D: The Point Map Way to 3D Scene Understanding

Ye Mao , Weixun Luo , Ranran Huang , Junpeng Jing , Krystian Mikolajczyk This is my paper

Pith reviewed 2026-05-17 20:23 UTC · model grok-4.3

classification 💻 cs.CV

keywords point maps3D scene understandingself-supervised learning3D representationsembodied navigationscene retrievalgeometric inputsview-to-scene alignment

0 comments

The pith

POMA-3D learns self-supervised 3D scene representations from point maps encoding explicit 3D coordinates.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents POMA-3D, the first self-supervised model for 3D representations trained on point maps. Point maps put 3D coordinates on a structured 2D grid to maintain global geometry and work with 2D model inputs. A view-to-scene alignment strategy moves rich priors from 2D foundation models into the 3D features. POMA-JEPA is added to keep features geometrically consistent across multiple views of the same scene. A large new dataset called ScenePoint is built to support pretraining, and the resulting model improves performance on 3D question answering, navigation, retrieval, and localization using only geometric data.

Core claim

POMA-3D serves as a strong backbone for both specialist and generalist 3D understanding by learning from point maps that preserve global 3D geometry while remaining compatible with 2D foundation models, using view-to-scene alignment and a joint embedding-predictive architecture to enforce consistency across views, all while relying solely on geometric inputs for diverse tasks.

What carries the argument

Point maps that encode explicit 3D coordinates on a structured 2D grid, combined with a view-to-scene alignment strategy to transfer 2D priors and POMA-JEPA for multi-view consistency.

If this is right

Benefits 3D question answering tasks with only 3D coordinate inputs.
Improves performance in embodied navigation and localization.
Enhances scene retrieval using the learned geometric representations.
Addresses scarcity of pretrained priors in 3D representation learning.
Supports both specialist and generalist 3D understanding models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Point maps could potentially scale to even larger datasets for better generalization in real-world robotics applications.
The approach might extend to dynamic scenes if temporal consistency is added to the architecture.
Integration with language models could create multimodal 3D understanding systems beyond pure geometry.
Testing on outdoor or large-scale environments would reveal how well the global geometry preservation holds outside indoor rooms.

Load-bearing premise

That point maps preserve global 3D geometry sufficiently well and that the view-to-scene alignment strategy transfers rich 2D priors into 3D representations without major distortion or loss of information.

What would settle it

A controlled test where POMA-3D is compared to a standard point cloud model on a 3D QA benchmark and shows no improvement when using only the point map inputs.

Figures

Figures reproduced from arXiv: 2511.16567 by Junpeng Jing, Krystian Mikolajczyk, Ranran Huang, Weixun Luo, Ye Mao.

**Figure 1.** Figure 1: Overview of POMA-3D. POMA-3D is a self-supervised 3D model pretrained on the large-scale point map dataset ScenePoint via alignment with 2D foundation models and the POMA-JEPA objective. The 3D features from pretrained POMA-3D transfer effectively to diverse 3D understanding tasks, including 3D visual question answering, embodied navigation, scene retrieval, and embodied localization. Abstract In this pape… view at source ↗

**Figure 2.** Figure 2: Overview of the POMA-3D pretraining. POMA-3D is pretrained with two objectives: (1) aligning [CLS] embeddings from the point map context encoder with image and text embeddings from the frozen FG-CLIP using Lview and Lscene, and (2) reconstructing masked point map embeddings from the target encoder using unmasked embeddings from the context encoder via a predictor optimized by Lpjepa. The target encoder is … view at source ↗

**Figure 3.** Figure 3: Qualitative scene retrieval results. Top-4 candidates from each method are shown. For the given query, only POMA-3D retrieves the unique ground-truth scene, while others fail to return bookshelf-containing scenes. Green boxes mark bookshelves. “I am cooking in the kitchen area.” “I am washing my face.” “I am standing near to the curtain.” “I am printing something for my work.” “I am standing between two bl… view at source ↗

**Figure 4.** Figure 4: Qualitative embodied localization results. Top: text to describe the current agent’s situation. Bottom: merged multi-view point maps, where red regions indicate the point map views retrieved by POMA-3D based on the text. point cloud–based VLL model SceneVerse by 6.2% on SQA3D and 4.5% on Hypo3D. Even when compared to large 3D LLMs such as LEO and LLaVA-3D, POMA3Dspec still achieves noticeable improvements… view at source ↗

**Figure 5.** Figure 5: Comparison of POMA-3D and its baseline FG-CLIP [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

read the original abstract

In this paper, we introduce POMA-3D, the first self-supervised 3D representation model learned from point maps. Point maps encode explicit 3D coordinates on a structured 2D grid, preserving global 3D geometry while remaining compatible with the input format of 2D foundation models. To transfer rich 2D priors into POMA-3D, a view-to-scene alignment strategy is designed. Moreover, as point maps are view-dependent with respect to a canonical space, we introduce POMA-JEPA, a joint embedding-predictive architecture that enforces geometrically consistent point map features across multiple views. Additionally, we introduce ScenePoint, a point map dataset constructed from 6.5K room-level RGB-D scenes and 1M 2D image scenes to facilitate large-scale POMA-3D pretraining. Experiments show that POMA-3D serves as a strong backbone for both specialist and generalist 3D understanding. It benefits diverse tasks, including 3D question answering, embodied navigation, scene retrieval, and embodied localization, all achieved using only geometric inputs (i.e., 3D coordinates). Overall, our POMA-3D explores a point map way to 3D scene understanding, addressing the scarcity of pretrained priors and limited data in 3D representation learning. Project Page: https://matchlab-imperial.github.io/poma3d/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

POMA-3D tries to adapt 2D foundation models to 3D via point maps and a JEPA-style predictor but the abstract supplies no numbers to show the gains are real.

read the letter

The paper's main move is to feed explicit 3D coordinates arranged on a 2D grid into a self-supervised model so that existing 2D priors can be transferred without starting from scratch on 3D data. They add a view-to-scene alignment step to inject those priors and introduce POMA-JEPA to enforce cross-view consistency on the point-map features. They also release ScenePoint, a dataset built from 6.5K room scans and a million images, to support large-scale pretraining. This combination of input format, alignment, and predictive loss is not in the prior work they cite, so the setup itself counts as new for the 3D representation learning crowd. The practical goal is clear: produce a backbone that works on 3D QA, navigation, retrieval, and localization when only raw coordinates are available at test time. That target is useful for robotics and embodied AI if the features actually hold up. The alignment and consistency mechanisms are reasonable extensions of JEPA ideas to geometry-preserving inputs, and the dataset construction looks like a straightforward way to scale up. The main weakness is the complete absence of quantitative results, baselines, or ablations in the abstract. Without those, it is impossible to tell whether the alignment preserves global geometry or whether the predictive loss actually reduces view-dependent variance enough to deliver the claimed task improvements. The stress-test worry about distortion during alignment or weak regularization is exactly the spot that needs checking; if the full paper does not include alignment error metrics or an ablation that removes the alignment, the central claim stays unproven. This work is aimed at researchers who already work on self-supervised 3D models or who want to bootstrap 3D representations from 2D foundation models. A reader focused on architecture details and dataset design would get value from the description even before the numbers are verified. It deserves a serious referee because the problem it attacks is real and the proposed route is distinct enough to warrant checking the experiments in detail. I would send it to peer review rather than desk reject, with the expectation that the authors add the missing quantitative evidence and ablations in revision.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces POMA-3D, a self-supervised 3D representation model trained on point maps that encode explicit 3D coordinates on a structured 2D grid. It proposes a view-to-scene alignment strategy to inject 2D foundation-model priors and POMA-JEPA, a joint embedding-predictive architecture, to enforce cross-view geometric consistency. A new ScenePoint dataset is constructed from 6.5K room-level RGB-D scenes and 1M images for large-scale pretraining. The central claim is that POMA-3D serves as a strong backbone for both specialist and generalist 3D tasks (3D question answering, embodied navigation, scene retrieval, embodied localization) when supplied only with geometric inputs at inference.

Significance. If the empirical claims hold, the work offers a novel route to 3D representation learning that leverages the geometric fidelity of point maps while reusing 2D priors, directly addressing data scarcity. The introduction of the ScenePoint dataset and the explicit use of only 3D coordinates at test time are concrete strengths that could be adopted by the community.

major comments (3)

[Abstract] Abstract: the stated performance benefits on 3D QA, navigation, retrieval and localization are presented without any quantitative numbers, baselines, or ablation results, so the load-bearing claim that POMA-3D is a 'strong backbone' cannot yet be evaluated.
[View-to-scene alignment] View-to-scene alignment section: the manuscript provides no alignment-error metric, cross-view feature variance, or ablation that removes the alignment step, leaving open the possibility that the transfer of 2D priors distorts the canonical 3D coordinate grid.
[POMA-JEPA and Experiments] POMA-JEPA description and experiments: the joint embedding-predictive loss is asserted to enforce geometric consistency, yet no quantitative verification (e.g., canonical-coordinate variance across views or ablation of the predictive term) is supplied to confirm that the regularization is strong enough to support the downstream gains.

minor comments (2)

[Method] The notation for point maps versus canonical coordinates should be introduced with a single equation early in the method section to improve readability.
[Figures] Figure captions for the alignment diagram should explicitly label the 2D-to-3D mapping operation.

Simulated Author's Rebuttal

3 responses · 0 unresolved

Thank you for the constructive and detailed review of our manuscript introducing POMA-3D. We appreciate the referee's recognition of the novelty in leveraging point maps for self-supervised 3D representation learning, the view-to-scene alignment strategy, the POMA-JEPA architecture, and the introduction of the ScenePoint dataset. We address each major comment point by point below, providing clarifications and committing to revisions that strengthen the presentation of our empirical claims without altering the core contributions.

read point-by-point responses

Referee: [Abstract] Abstract: the stated performance benefits on 3D QA, navigation, retrieval and localization are presented without any quantitative numbers, baselines, or ablation results, so the load-bearing claim that POMA-3D is a 'strong backbone' cannot yet be evaluated.

Authors: We agree that the abstract would be strengthened by including key quantitative highlights to support the claim that POMA-3D serves as a strong backbone. In the revised manuscript, we will update the abstract to report specific performance gains (e.g., relative improvements on 3D question answering, navigation success rates, retrieval mAP, and localization accuracy) with references to the corresponding tables and baselines in the experiments section. This will enable readers to directly assess the empirical strength of the results. revision: yes
Referee: [View-to-scene alignment] View-to-scene alignment section: the manuscript provides no alignment-error metric, cross-view feature variance, or ablation that removes the alignment step, leaving open the possibility that the transfer of 2D priors distorts the canonical 3D coordinate grid.

Authors: We acknowledge the value of additional quantitative validation for the view-to-scene alignment. We will add an alignment-error metric (e.g., average L2 distance between projected 2D features and canonical 3D coordinates) and report cross-view feature variance to demonstrate preservation of geometric structure. We will also include an ablation study that removes the alignment step and measures its impact on downstream tasks, which will directly address concerns about potential distortion of the 3D coordinate grid. revision: yes
Referee: [POMA-JEPA and Experiments] POMA-JEPA description and experiments: the joint embedding-predictive loss is asserted to enforce geometric consistency, yet no quantitative verification (e.g., canonical-coordinate variance across views or ablation of the predictive term) is supplied to confirm that the regularization is strong enough to support the downstream gains.

Authors: We agree that explicit quantitative verification would better substantiate the role of the joint embedding-predictive loss. In the revision, we will report canonical-coordinate variance across multiple views (before and after the predictive term) and include an ablation that isolates the predictive component, showing its contribution to geometric consistency and to the observed gains on downstream tasks such as navigation and scene retrieval. These additions will confirm that the regularization is effective. revision: yes

Circularity Check

0 steps flagged

No circularity in POMA-3D derivation chain

full rationale

The paper introduces point maps as input, a view-to-scene alignment strategy, and POMA-JEPA as novel extensions of JEPA-style self-supervised learning. These are architectural choices evaluated empirically on the new ScenePoint dataset and downstream tasks (3D QA, navigation, etc.) using only geometric inputs. No load-bearing step reduces a claimed prediction or result to a fitted parameter, self-definition, or self-citation chain by construction. The central claims rest on experimental outcomes rather than equations that equate outputs to inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the domain assumption that point maps preserve global 3D geometry and on the effectiveness of the newly introduced alignment and consistency mechanisms; no explicit free parameters or invented physical entities are stated in the abstract.

axioms (1)

domain assumption Point maps encode explicit 3D coordinates on a structured 2D grid while preserving global 3D geometry and remaining compatible with 2D foundation model inputs.
Directly stated in the abstract as the foundation for transferring 2D priors.

invented entities (1)

POMA-JEPA no independent evidence
purpose: Joint embedding-predictive architecture that enforces geometrically consistent point map features across multiple views.
New architecture introduced to address view-dependence of point maps.

pith-pipeline@v0.9.0 · 5571 in / 1329 out tokens · 31181 ms · 2026-05-17T20:23:22.735745+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

point maps encode explicit 3D coordinates on a structured 2D grid, preserving global 3D geometry while remaining compatible with the input format of 2D foundation models

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Contrastive Language-Colored Pointmap Pretraining for Unified 3D Scene Understanding
cs.CV 2026-04 unverdicted novelty 6.0

UniScene3D learns unified 3D scene representations from colored pointmaps using contrastive CLIP pretraining plus cross-view geometric and grounded view alignments, achieving state-of-the-art results on viewpoint grou...

Reference graph

Works this paper leans on

59 extracted references · 59 canonical work pages · cited by 1 Pith paper · 9 internal anchors

[1]

Referit3d: Neural listeners for fine-grained 3d object identification in real-world scenes

Panos Achlioptas, Ahmed Abdelreheem, Fei Xia, Mohamed Elhoseiny, and Leonidas Guibas. Referit3d: Neural listeners for fine-grained 3d object identification in real-world scenes. InEuropean conference on computer vision, pages 422–440. Springer, 2020. 1, 6

work page 2020
[2]

Self-supervised learning from images with a joint-embedding predictive architecture

Mahmoud Assran, Quentin Duval, Ishan Misra, Piotr Bo- janowski, Pascal Vincent, Michael Rabbat, Yann LeCun, and Nicolas Ballas. Self-supervised learning from images with a joint-embedding predictive architecture. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15619–15629, 2023. 2

work page 2023
[3]

V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

Mido Assran, Adrien Bardes, David Fan, Quentin Garrido, Russell Howes, Matthew Muckley, Ammar Rizvi, Claire Roberts, Koustuv Sinha, Artem Zholus, et al. V-jepa 2: Self- supervised video models enable understanding, prediction and planning.arXiv preprint arXiv:2506.09985, 2025. 2

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

Scanqa: 3d question answering for spatial scene understanding

Daichi Azuma, Taiki Miyanishi, Shuhei Kurita, and Motoaki Kawanabe. Scanqa: 3d question answering for spatial scene understanding. Inproceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 19129– 19139, 2022. 2, 5, 6

work page 2022
[5]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 6

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

V-jepa: Latent video prediction for visual represen- tation learning

Adrien Bardes, Quentin Garrido, Jean Ponce, Xinlei Chen, Michael Rabbat, Yann LeCun, Mido Assran, and Nicolas Ballas. V-jepa: Latent video prediction for visual represen- tation learning. 2023. 2

work page 2023
[7]

ARKitScenes: A Diverse Real-World Dataset For 3D Indoor Scene Understanding Using Mobile RGB-D Data

Gilad Baruch, Zhuoyuan Chen, Afshin Dehghan, Tal Dimry, Yuri Feigin, Peter Fu, Thomas Gebauer, Brandon Joffe, Daniel Kurz, Arik Schwartz, et al. Arkitscenes: A diverse real-world dataset for 3d indoor scene understanding using mobile rgb-d data.arXiv preprint arXiv:2111.08897, 2021. 3

work page internal anchor Pith review Pith/arXiv arXiv 2021
[8]

Scanrefer: 3d object localization in rgb-d scans using natural language

Dave Zhenyu Chen, Angel X Chang, and Matthias Nießner. Scanrefer: 3d object localization in rgb-d scans using natural language. InEuropean conference on computer vision, pages 202–221. Springer, 2020. 1, 2, 6

work page 2020
[9]

Ll3da: Visual interactive instruction tuning for omni-3d understanding reasoning and planning

Sijin Chen, Xin Chen, Chi Zhang, Mingsheng Li, Gang Yu, Hao Fei, Hongyuan Zhu, Jiayuan Fan, and Tao Chen. Ll3da: Visual interactive instruction tuning for omni-3d understanding reasoning and planning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 26428–26438, 2024. 2

work page 2024
[10]

3d aware region prompted vision language model.arXiv preprint arXiv:2509.13317,

An-Chieh Cheng, Yang Fu, Yukang Chen, Zhijian Liu, Xiao- long Li, Subhashree Radhakrishnan, Song Han, Yao Lu, Jan Kautz, Pavlo Molchanov, et al. 3d aware region prompted vision language model.arXiv preprint arXiv:2509.13317,

work page arXiv
[11]

Scannet: Richly-annotated 3d reconstructions of indoor scenes

Angela Dai, Angel X Chang, Manolis Savva, Maciej Hal- ber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5828–5839, 2017. 3

work page 2017
[12]

Procthor: Large-scale embodied ai using procedural generation.Ad- vances in Neural Information Processing Systems, 35:5982– 5994, 2022

Matt Deitke, Eli VanderBilt, Alvaro Herrasti, Luca Weihs, Kiana Ehsani, Jordi Salvador, Winson Han, Eric Kolve, Aniruddha Kembhavi, and Roozbeh Mottaghi. Procthor: Large-scale embodied ai using procedural generation.Ad- vances in Neural Information Processing Systems, 35:5982– 5994, 2022. 3

work page 2022
[13]

A survey of embodied ai: From simulators to research tasks.IEEE Transactions on Emerging Topics in Computational Intelligence, 6(2):230–244, 2022

Jiafei Duan, Samson Yu, Hui Li Tan, Hongyuan Zhu, and Cheston Tan. A survey of embodied ai: From simulators to research tasks.IEEE Transactions on Emerging Topics in Computational Intelligence, 6(2):230–244, 2022. 1

work page 2022
[14]

A point set generation network for 3d object reconstruction from a single image

Haoqiang Fan, Hao Su, and Leonidas J Guibas. A point set generation network for 3d object reconstruction from a single image. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 605–613, 2017. 5

work page 2017
[15]

VLM-3R: Vision-Language Models Augmented with Instruction-Aligned 3D Reconstruction

Zhiwen Fan, Jian Zhang, Renjie Li, Junge Zhang, Runjin Chen, Hezhen Hu, Kevin Wang, Huaizhi Qu, Dilin Wang, Zhicheng Yan, et al. Vlm-3r: Vision-language models aug- mented with instruction-aligned 3d reconstruction.arXiv preprint arXiv:2505.20279, 2025. 2

work page internal anchor Pith review Pith/arXiv arXiv 2025
[16]

Scene-llm: Extending language model for 3d visual understanding and reasoning.arXiv preprint arXiv:2403.11401, 2024

Rao Fu, Jingyu Liu, Xilun Chen, Yixin Nie, and Wen- han Xiong. Scene-llm: Extending language model for 3d visual understanding and reasoning.arXiv preprint arXiv:2403.11401, 2024. 2

work page arXiv 2024
[17]

Viewrefer: Grasp the multi-view knowledge for 3d visual grounding

Zoey Guo, Yiwen Tang, Ray Zhang, Dong Wang, Zhigang Wang, Bin Zhao, and Xuelong Li. Viewrefer: Grasp the multi-view knowledge for 3d visual grounding. InProceed- ings of the IEEE/CVF International Conference on Com- puter Vision, pages 15372–15383, 2023. 2

work page 2023
[18]

3d-llm: In- jecting the 3d world into large language models.Advances in Neural Information Processing Systems, 36:20482–20494,

Yining Hong, Haoyu Zhen, Peihao Chen, Shuhong Zheng, Yilun Du, Zhenfang Chen, and Chuang Gan. 3d-llm: In- jecting the 3d world into large language models.Advances in Neural Information Processing Systems, 36:20482–20494,

work page
[19]

3d-sis: 3d se- mantic instance segmentation of rgb-d scans

Ji Hou, Angela Dai, and Matthias Nießner. 3d-sis: 3d se- mantic instance segmentation of rgb-d scans. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4421–4430, 2019. 1, 2

work page 2019
[20]

An Embodied Generalist Agent in 3D World

Jiangyong Huang, Silong Yong, Xiaojian Ma, Xiongkun Linghu, Puhao Li, Yan Wang, Qing Li, Song-Chun Zhu, Baoxiong Jia, and Siyuan Huang. An embodied generalist agent in 3d world.arXiv preprint arXiv:2311.12871, 2023. 2, 6

work page internal anchor Pith review Pith/arXiv arXiv 2023
[21]

No pose at all: Self-supervised pose-free 3d gaussian splatting from sparse 9 views

Ranran Huang and Krystian Mikolajczyk. No pose at all: Self-supervised pose-free 3d gaussian splatting from sparse 9 views. InProceedings of the IEEE/CVF International Con- ference on Computer Vision, pages 27947–27957, 2025. 2

work page 2025
[22]

Multi- view transformer for 3d visual grounding

Shijia Huang, Yilun Chen, Jiaya Jia, and Liwei Wang. Multi- view transformer for 3d visual grounding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15524–15533, 2022. 2

work page 2022
[23]

Clip2point: Transfer clip to point cloud classifica- tion with image-depth pre-training

Tianyu Huang, Bowen Dong, Yunhan Yang, Xiaoshui Huang, Rynson WH Lau, Wanli Ouyang, and Wangmeng Zuo. Clip2point: Transfer clip to point cloud classifica- tion with image-depth pre-training. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 22157–22167, 2023. 2

work page 2023
[24]

Sceneverse: Scaling 3d vision-language learning for grounded scene understanding

Baoxiong Jia, Yixin Chen, Huangyue Yu, Yan Wang, Xuesong Niu, Tengyu Liu, Qing Li, and Siyuan Huang. Sceneverse: Scaling 3d vision-language learning for grounded scene understanding. InEuropean Conference on Computer Vision, pages 289–310. Springer, 2024. 2, 3, 5, 6

work page 2024
[25]

Pointgroup: Dual-set point grouping for 3d instance segmentation

Li Jiang, Hengshuang Zhao, Shaoshuai Shi, Shu Liu, Chi- Wing Fu, and Jiaya Jia. Pointgroup: Dual-set point grouping for 3d instance segmentation. InProceedings of the IEEE/CVF conference on computer vision and Pattern recognition, pages 4867–4876, 2020. 1, 2

work page 2020
[26]

Ground- ing image matching in 3d with mast3r

Vincent Leroy, Yohann Cabon, and J´erˆome Revaud. Ground- ing image matching in 3d with mast3r. InEuropean Confer- ence on Computer Vision, pages 71–91. Springer, 2024. 2

work page 2024
[27]

LLaVA-OneVision: Easy Visual Task Transfer

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Zi- wei Liu, et al. Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326, 2024. 6

work page internal anchor Pith review Pith/arXiv arXiv 2024
[28]

Scenesplat: Gaussian splatting-based scene un- derstanding with vision-language pretraining.arXiv preprint arXiv:2503.18052, 2025

Yue Li, Qi Ma, Runyi Yang, Huapeng Li, Mengjiao Ma, Bin Ren, Nikola Popovic, Nicu Sebe, Ender Konukoglu, Theo Gevers, et al. Scenesplat: Gaussian splatting-based scene un- derstanding with vision-language pretraining.arXiv preprint arXiv:2503.18052, 2025. 2, 3

work page arXiv 2025
[29]

Embodied intelligence for 3d under- standing: A survey on 3d scene question answering.arXiv preprint arXiv:2502.00342, 2025

Zechuan Li, Hongshan Yu, Yihao Ding, Yan Li, Yong He, and Naveed Akhtar. Embodied intelligence for 3d under- standing: A survey on 3d scene question answering.arXiv preprint arXiv:2502.00342, 2025. 1

work page arXiv 2025
[30]

Multi- modal situated reasoning in 3d scenes.Advances in Neural Information Processing Systems, 37:140903–140936, 2024

Xiongkun Linghu, Jiangyong Huang, Xuesong Niu, Xiao- jian Shawn Ma, Baoxiong Jia, and Siyuan Huang. Multi- modal situated reasoning in 3d scenes.Advances in Neural Information Processing Systems, 37:140903–140936, 2024. 5, 6

work page 2024
[31]

Openshape: Scaling up 3d shape representation towards open-world understanding.Advances in neural information processing systems, 36:44860–44879, 2023

Minghua Liu, Ruoxi Shi, Kaiming Kuang, Yinhao Zhu, Xu- anlin Li, Shizhong Han, Hong Cai, Fatih Porikli, and Hao Su. Openshape: Scaling up 3d shape representation towards open-world understanding.Advances in neural information processing systems, 36:44860–44879, 2023. 2

work page 2023
[32]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017. 5

work page internal anchor Pith review Pith/arXiv arXiv 2017
[33]

Sqa3d: Situated question answering in 3d scenes,

Xiaojian Ma, Silong Yong, Zilong Zheng, Qing Li, Yi- tao Liang, Song-Chun Zhu, and Siyuan Huang. Sqa3d: Situated question answering in 3d scenes.arXiv preprint arXiv:2210.07474, 2022. 2, 5, 6, 8

work page arXiv 2022
[34]

Multiscan: Scalable rgbd scanning for 3d environments with articulated objects.Advances in neural information processing systems, 35:9058–9071, 2022

Yongsen Mao, Yiming Zhang, Hanxiao Jiang, Angel Chang, and Manolis Savva. Multiscan: Scalable rgbd scanning for 3d environments with articulated objects.Advances in neural information processing systems, 35:9058–9071, 2022. 3

work page 2022
[35]

Opendlign: Open-world point cloud understanding with depth-aligned images.Advances in Neural Information Pro- cessing Systems, 37:101144–101167, 2024

Ye Mao, Junpeng Jing, and Krystian Mikolajczyk. Opendlign: Open-world point cloud understanding with depth-aligned images.Advances in Neural Information Pro- cessing Systems, 37:101144–101167, 2024. 2

work page 2024
[36]

Hypo3d: Exploring hypothetical reason- ing in 3d.arXiv preprint arXiv:2502.00954, 2025

Ye Mao, Weixun Luo, Junpeng Jing, Anlan Qiu, and Krys- tian Mikolajczyk. Hypo3d: Exploring hypothetical reason- ing in 3d.arXiv preprint arXiv:2502.00954, 2025. 5, 6, 8

work page arXiv 2025
[37]

Masked autoencoders for 3d point cloud self- supervised learning.World Scientific Annual Review of Arti- ficial Intelligence, 1:2440001, 2023

Yatian Pang, Eng Hock Francis Tay, Li Yuan, and Zhenghua Chen. Masked autoencoders for 3d point cloud self- supervised learning.World Scientific Annual Review of Arti- ficial Intelligence, 1:2440001, 2023. 5

work page 2023
[38]

Pointnet: Deep learning on point sets for 3d classification and segmentation

Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas. Pointnet: Deep learning on point sets for 3d classification and segmentation. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 652–660,

work page
[39]

Shapellm: Universal 3d object understanding for embodied interaction

Zekun Qi, Runpei Dong, Shaochen Zhang, Haoran Geng, Chunrui Han, Zheng Ge, Li Yi, and Kaisheng Ma. Shapellm: Universal 3d object understanding for embodied interaction. InEuropean Conference on Computer Vision, pages 214–

work page
[40]

Learning transferable visual models from natural language supervi- sion

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 2

work page 2021
[41]

Habitat-Matterport 3D Dataset (HM3D): 1000 Large-scale 3D Environments for Embodied AI

Santhosh K Ramakrishnan, Aaron Gokaslan, Erik Wijmans, Oleksandr Maksymets, Alex Clegg, John Turner, Eric Un- dersander, Wojciech Galuba, Andrew Westbury, Angel X Chang, et al. Habitat-matterport 3d dataset (hm3d): 1000 large-scale 3d environments for embodied ai.arXiv preprint arXiv:2109.08238, 2021. 3

work page internal anchor Pith review Pith/arXiv arXiv 2021
[42]

org/10.48550/arXiv.2210.03105

Jonas Schult, Francis Engelmann, Alexander Hermans, Or Litany, Siyu Tang, and Bastian Leibe. Mask3d: Mask trans- former for 3d semantic instance segmentation.arXiv preprint arXiv:2210.03105, 2022. 5

work page arXiv 2022
[43]

Conceptual captions: A cleaned, hypernymed, im- age alt-text dataset for automatic image captioning

Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. Conceptual captions: A cleaned, hypernymed, im- age alt-text dataset for automatic image captioning. InPro- ceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2556–2565, 2018. 3

work page 2018
[44]

Splattalk: 3d vqa with gaussian splatting.arXiv preprint arXiv:2503.06271, 2025

Anh Thai, Songyou Peng, Kyle Genova, Leonidas Guibas, and Thomas Funkhouser. Splattalk: 3d vqa with gaussian splatting.arXiv preprint arXiv:2503.06271, 2025. 2, 5, 6

work page arXiv 2025
[45]

Rio: 3d object instance re- localization in changing indoor environments

Johanna Wald, Armen Avetisyan, Nassir Navab, Federico Tombari, and Matthias Nießner. Rio: 3d object instance re- localization in changing indoor environments. InProceed- ings of the IEEE/CVF International Conference on Com- puter Vision, pages 7658–7667, 2019. 3

work page 2019
[46]

Ross3d: Recon- structive visual instruction tuning with 3d-awareness.arXiv preprint arXiv:2504.01901, 2025

Haochen Wang, Yucheng Zhao, Tiancai Wang, Haoqiang Fan, Xiangyu Zhang, and Zhaoxiang Zhang. Ross3d: Recon- structive visual instruction tuning with 3d-awareness.arXiv preprint arXiv:2504.01901, 2025. 2

work page arXiv 2025
[47]

Vggt: Vi- sual geometry grounded transformer

Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Vi- sual geometry grounded transformer. InProceedings of the 10 Computer Vision and Pattern Recognition Conference, pages 5294–5306, 2025. 2, 3

work page 2025
[48]

Dust3r: Geometric 3d vi- sion made easy

Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. Dust3r: Geometric 3d vi- sion made easy. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20697– 20709, 2024. 2

work page 2024
[49]

Chat-3d: Data-efficiently tuning large language model for universal dialogue of 3d scenes,

Zehan Wang, Haifeng Huang, Yang Zhao, Ziang Zhang, and Zhou Zhao. Chat-3d: Data-efficiently tuning large language model for universal dialogue of 3d scenes.arXiv preprint arXiv:2308.08769, 2023. 2

work page arXiv 2023
[50]

Fg-clip: Fine-grained visual and textual alignment.arXiv preprint arXiv:2505.05071, 2025

Chunyu Xie, Bin Wang, Fanjing Kong, Jincheng Li, Dawei Liang, Gengshen Zhang, Dawei Leng, and Yuhui Yin. Fg- clip: Fine-grained visual and textual alignment.arXiv preprint arXiv:2505.05071, 2025. 3, 4, 5, 6, 8

work page arXiv 2025
[51]

Ulip: Learning a unified representation of language, images, and point clouds for 3d understanding

Le Xue, Mingfei Gao, Chen Xing, Roberto Mart ´ın-Mart´ın, Jiajun Wu, Caiming Xiong, Ran Xu, Juan Carlos Niebles, and Silvio Savarese. Ulip: Learning a unified representation of language, images, and point clouds for 3d understanding. InProceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 1179–1189, 2023. 2

work page 2023
[52]

Ulip-2: Towards scalable multimodal pre-training for 3d understanding

Le Xue, Ning Yu, Shu Zhang, Artemis Panagopoulou, Jun- nan Li, Roberto Mart´ın-Mart´ın, Jiajun Wu, Caiming Xiong, Ran Xu, Juan Carlos Niebles, et al. Ulip-2: Towards scalable multimodal pre-training for 3d understanding. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 27091–27101, 2024. 2

work page 2024
[53]

Point transformer

Hengshuang Zhao, Li Jiang, Jiaya Jia, Philip HS Torr, and Vladlen Koltun. Point transformer. InProceedings of the IEEE/CVF international conference on computer vision, pages 16259–16268, 2021. 2

work page 2021
[54]

Video-3d llm: Learning position-aware video representation for 3d scene understanding

Duo Zheng, Shijia Huang, and Liwei Wang. Video-3d llm: Learning position-aware video representation for 3d scene understanding. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 8995–9006, 2025. 2, 3, 6

work page 2025
[55]

Structured3d: A large photo-realistic dataset for structured 3d modeling

Jia Zheng, Junfei Zhang, Jing Li, Rui Tang, Shenghua Gao, and Zihan Zhou. Structured3d: A large photo-realistic dataset for structured 3d modeling. InEuropean Conference on Computer Vision, pages 519–535. Springer, 2020. 3

work page 2020
[56]

Uni3d: Exploring unified 3d representation at scale

Junsheng Zhou, Jinsheng Wang, Baorui Ma, Yu-Shen Liu, Tiejun Huang, and Xinlong Wang. Uni3d: Exploring unified 3d representation at scale.arXiv preprint arXiv:2310.06773,

work page arXiv
[57]

Llava-3d: A simple yet effective pathway to empowering lmms with 3d-awareness.arXiv preprint arXiv:2409.18125, 2024

Chenming Zhu, Tai Wang, Wenwei Zhang, Jiangmiao Pang, and Xihui Liu. Llava-3d: A simple yet effective pathway to empowering lmms with 3d-awareness.arXiv preprint arXiv:2409.18125, 2024. 2, 6

work page arXiv 2024
[58]

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shen- glong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479, 2025. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025
[59]

3d-vista: Pre-trained transformer for 3d vision and text alignment

Ziyu Zhu, Xiaojian Ma, Yixin Chen, Zhidong Deng, Siyuan Huang, and Qing Li. 3d-vista: Pre-trained transformer for 3d vision and text alignment. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 2911– 2921, 2023. 2, 3, 5, 6 11

work page 2023

[1] [1]

Referit3d: Neural listeners for fine-grained 3d object identification in real-world scenes

Panos Achlioptas, Ahmed Abdelreheem, Fei Xia, Mohamed Elhoseiny, and Leonidas Guibas. Referit3d: Neural listeners for fine-grained 3d object identification in real-world scenes. InEuropean conference on computer vision, pages 422–440. Springer, 2020. 1, 6

work page 2020

[2] [2]

Self-supervised learning from images with a joint-embedding predictive architecture

Mahmoud Assran, Quentin Duval, Ishan Misra, Piotr Bo- janowski, Pascal Vincent, Michael Rabbat, Yann LeCun, and Nicolas Ballas. Self-supervised learning from images with a joint-embedding predictive architecture. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15619–15629, 2023. 2

work page 2023

[3] [3]

V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

Mido Assran, Adrien Bardes, David Fan, Quentin Garrido, Russell Howes, Matthew Muckley, Ammar Rizvi, Claire Roberts, Koustuv Sinha, Artem Zholus, et al. V-jepa 2: Self- supervised video models enable understanding, prediction and planning.arXiv preprint arXiv:2506.09985, 2025. 2

work page internal anchor Pith review Pith/arXiv arXiv 2025

[4] [4]

Scanqa: 3d question answering for spatial scene understanding

Daichi Azuma, Taiki Miyanishi, Shuhei Kurita, and Motoaki Kawanabe. Scanqa: 3d question answering for spatial scene understanding. Inproceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 19129– 19139, 2022. 2, 5, 6

work page 2022

[5] [5]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 6

work page internal anchor Pith review Pith/arXiv arXiv 2025

[6] [6]

V-jepa: Latent video prediction for visual represen- tation learning

Adrien Bardes, Quentin Garrido, Jean Ponce, Xinlei Chen, Michael Rabbat, Yann LeCun, Mido Assran, and Nicolas Ballas. V-jepa: Latent video prediction for visual represen- tation learning. 2023. 2

work page 2023

[7] [7]

ARKitScenes: A Diverse Real-World Dataset For 3D Indoor Scene Understanding Using Mobile RGB-D Data

Gilad Baruch, Zhuoyuan Chen, Afshin Dehghan, Tal Dimry, Yuri Feigin, Peter Fu, Thomas Gebauer, Brandon Joffe, Daniel Kurz, Arik Schwartz, et al. Arkitscenes: A diverse real-world dataset for 3d indoor scene understanding using mobile rgb-d data.arXiv preprint arXiv:2111.08897, 2021. 3

work page internal anchor Pith review Pith/arXiv arXiv 2021

[8] [8]

Scanrefer: 3d object localization in rgb-d scans using natural language

Dave Zhenyu Chen, Angel X Chang, and Matthias Nießner. Scanrefer: 3d object localization in rgb-d scans using natural language. InEuropean conference on computer vision, pages 202–221. Springer, 2020. 1, 2, 6

work page 2020

[9] [9]

Ll3da: Visual interactive instruction tuning for omni-3d understanding reasoning and planning

Sijin Chen, Xin Chen, Chi Zhang, Mingsheng Li, Gang Yu, Hao Fei, Hongyuan Zhu, Jiayuan Fan, and Tao Chen. Ll3da: Visual interactive instruction tuning for omni-3d understanding reasoning and planning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 26428–26438, 2024. 2

work page 2024

[10] [10]

3d aware region prompted vision language model.arXiv preprint arXiv:2509.13317,

An-Chieh Cheng, Yang Fu, Yukang Chen, Zhijian Liu, Xiao- long Li, Subhashree Radhakrishnan, Song Han, Yao Lu, Jan Kautz, Pavlo Molchanov, et al. 3d aware region prompted vision language model.arXiv preprint arXiv:2509.13317,

work page arXiv

[11] [11]

Scannet: Richly-annotated 3d reconstructions of indoor scenes

Angela Dai, Angel X Chang, Manolis Savva, Maciej Hal- ber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5828–5839, 2017. 3

work page 2017

[12] [12]

Procthor: Large-scale embodied ai using procedural generation.Ad- vances in Neural Information Processing Systems, 35:5982– 5994, 2022

Matt Deitke, Eli VanderBilt, Alvaro Herrasti, Luca Weihs, Kiana Ehsani, Jordi Salvador, Winson Han, Eric Kolve, Aniruddha Kembhavi, and Roozbeh Mottaghi. Procthor: Large-scale embodied ai using procedural generation.Ad- vances in Neural Information Processing Systems, 35:5982– 5994, 2022. 3

work page 2022

[13] [13]

A survey of embodied ai: From simulators to research tasks.IEEE Transactions on Emerging Topics in Computational Intelligence, 6(2):230–244, 2022

Jiafei Duan, Samson Yu, Hui Li Tan, Hongyuan Zhu, and Cheston Tan. A survey of embodied ai: From simulators to research tasks.IEEE Transactions on Emerging Topics in Computational Intelligence, 6(2):230–244, 2022. 1

work page 2022

[14] [14]

A point set generation network for 3d object reconstruction from a single image

Haoqiang Fan, Hao Su, and Leonidas J Guibas. A point set generation network for 3d object reconstruction from a single image. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 605–613, 2017. 5

work page 2017

[15] [15]

VLM-3R: Vision-Language Models Augmented with Instruction-Aligned 3D Reconstruction

Zhiwen Fan, Jian Zhang, Renjie Li, Junge Zhang, Runjin Chen, Hezhen Hu, Kevin Wang, Huaizhi Qu, Dilin Wang, Zhicheng Yan, et al. Vlm-3r: Vision-language models aug- mented with instruction-aligned 3d reconstruction.arXiv preprint arXiv:2505.20279, 2025. 2

work page internal anchor Pith review Pith/arXiv arXiv 2025

[16] [16]

Scene-llm: Extending language model for 3d visual understanding and reasoning.arXiv preprint arXiv:2403.11401, 2024

Rao Fu, Jingyu Liu, Xilun Chen, Yixin Nie, and Wen- han Xiong. Scene-llm: Extending language model for 3d visual understanding and reasoning.arXiv preprint arXiv:2403.11401, 2024. 2

work page arXiv 2024

[17] [17]

Viewrefer: Grasp the multi-view knowledge for 3d visual grounding

Zoey Guo, Yiwen Tang, Ray Zhang, Dong Wang, Zhigang Wang, Bin Zhao, and Xuelong Li. Viewrefer: Grasp the multi-view knowledge for 3d visual grounding. InProceed- ings of the IEEE/CVF International Conference on Com- puter Vision, pages 15372–15383, 2023. 2

work page 2023

[18] [18]

3d-llm: In- jecting the 3d world into large language models.Advances in Neural Information Processing Systems, 36:20482–20494,

Yining Hong, Haoyu Zhen, Peihao Chen, Shuhong Zheng, Yilun Du, Zhenfang Chen, and Chuang Gan. 3d-llm: In- jecting the 3d world into large language models.Advances in Neural Information Processing Systems, 36:20482–20494,

work page

[19] [19]

3d-sis: 3d se- mantic instance segmentation of rgb-d scans

Ji Hou, Angela Dai, and Matthias Nießner. 3d-sis: 3d se- mantic instance segmentation of rgb-d scans. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4421–4430, 2019. 1, 2

work page 2019

[20] [20]

An Embodied Generalist Agent in 3D World

Jiangyong Huang, Silong Yong, Xiaojian Ma, Xiongkun Linghu, Puhao Li, Yan Wang, Qing Li, Song-Chun Zhu, Baoxiong Jia, and Siyuan Huang. An embodied generalist agent in 3d world.arXiv preprint arXiv:2311.12871, 2023. 2, 6

work page internal anchor Pith review Pith/arXiv arXiv 2023

[21] [21]

No pose at all: Self-supervised pose-free 3d gaussian splatting from sparse 9 views

Ranran Huang and Krystian Mikolajczyk. No pose at all: Self-supervised pose-free 3d gaussian splatting from sparse 9 views. InProceedings of the IEEE/CVF International Con- ference on Computer Vision, pages 27947–27957, 2025. 2

work page 2025

[22] [22]

Multi- view transformer for 3d visual grounding

Shijia Huang, Yilun Chen, Jiaya Jia, and Liwei Wang. Multi- view transformer for 3d visual grounding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15524–15533, 2022. 2

work page 2022

[23] [23]

Clip2point: Transfer clip to point cloud classifica- tion with image-depth pre-training

Tianyu Huang, Bowen Dong, Yunhan Yang, Xiaoshui Huang, Rynson WH Lau, Wanli Ouyang, and Wangmeng Zuo. Clip2point: Transfer clip to point cloud classifica- tion with image-depth pre-training. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 22157–22167, 2023. 2

work page 2023

[24] [24]

Sceneverse: Scaling 3d vision-language learning for grounded scene understanding

Baoxiong Jia, Yixin Chen, Huangyue Yu, Yan Wang, Xuesong Niu, Tengyu Liu, Qing Li, and Siyuan Huang. Sceneverse: Scaling 3d vision-language learning for grounded scene understanding. InEuropean Conference on Computer Vision, pages 289–310. Springer, 2024. 2, 3, 5, 6

work page 2024

[25] [25]

Pointgroup: Dual-set point grouping for 3d instance segmentation

Li Jiang, Hengshuang Zhao, Shaoshuai Shi, Shu Liu, Chi- Wing Fu, and Jiaya Jia. Pointgroup: Dual-set point grouping for 3d instance segmentation. InProceedings of the IEEE/CVF conference on computer vision and Pattern recognition, pages 4867–4876, 2020. 1, 2

work page 2020

[26] [26]

Ground- ing image matching in 3d with mast3r

Vincent Leroy, Yohann Cabon, and J´erˆome Revaud. Ground- ing image matching in 3d with mast3r. InEuropean Confer- ence on Computer Vision, pages 71–91. Springer, 2024. 2

work page 2024

[27] [27]

LLaVA-OneVision: Easy Visual Task Transfer

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Zi- wei Liu, et al. Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326, 2024. 6

work page internal anchor Pith review Pith/arXiv arXiv 2024

[28] [28]

Scenesplat: Gaussian splatting-based scene un- derstanding with vision-language pretraining.arXiv preprint arXiv:2503.18052, 2025

Yue Li, Qi Ma, Runyi Yang, Huapeng Li, Mengjiao Ma, Bin Ren, Nikola Popovic, Nicu Sebe, Ender Konukoglu, Theo Gevers, et al. Scenesplat: Gaussian splatting-based scene un- derstanding with vision-language pretraining.arXiv preprint arXiv:2503.18052, 2025. 2, 3

work page arXiv 2025

[29] [29]

Embodied intelligence for 3d under- standing: A survey on 3d scene question answering.arXiv preprint arXiv:2502.00342, 2025

Zechuan Li, Hongshan Yu, Yihao Ding, Yan Li, Yong He, and Naveed Akhtar. Embodied intelligence for 3d under- standing: A survey on 3d scene question answering.arXiv preprint arXiv:2502.00342, 2025. 1

work page arXiv 2025

[30] [30]

Multi- modal situated reasoning in 3d scenes.Advances in Neural Information Processing Systems, 37:140903–140936, 2024

Xiongkun Linghu, Jiangyong Huang, Xuesong Niu, Xiao- jian Shawn Ma, Baoxiong Jia, and Siyuan Huang. Multi- modal situated reasoning in 3d scenes.Advances in Neural Information Processing Systems, 37:140903–140936, 2024. 5, 6

work page 2024

[31] [31]

Openshape: Scaling up 3d shape representation towards open-world understanding.Advances in neural information processing systems, 36:44860–44879, 2023

Minghua Liu, Ruoxi Shi, Kaiming Kuang, Yinhao Zhu, Xu- anlin Li, Shizhong Han, Hong Cai, Fatih Porikli, and Hao Su. Openshape: Scaling up 3d shape representation towards open-world understanding.Advances in neural information processing systems, 36:44860–44879, 2023. 2

work page 2023

[32] [32]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017. 5

work page internal anchor Pith review Pith/arXiv arXiv 2017

[33] [33]

Sqa3d: Situated question answering in 3d scenes,

Xiaojian Ma, Silong Yong, Zilong Zheng, Qing Li, Yi- tao Liang, Song-Chun Zhu, and Siyuan Huang. Sqa3d: Situated question answering in 3d scenes.arXiv preprint arXiv:2210.07474, 2022. 2, 5, 6, 8

work page arXiv 2022

[34] [34]

Multiscan: Scalable rgbd scanning for 3d environments with articulated objects.Advances in neural information processing systems, 35:9058–9071, 2022

Yongsen Mao, Yiming Zhang, Hanxiao Jiang, Angel Chang, and Manolis Savva. Multiscan: Scalable rgbd scanning for 3d environments with articulated objects.Advances in neural information processing systems, 35:9058–9071, 2022. 3

work page 2022

[35] [35]

Opendlign: Open-world point cloud understanding with depth-aligned images.Advances in Neural Information Pro- cessing Systems, 37:101144–101167, 2024

Ye Mao, Junpeng Jing, and Krystian Mikolajczyk. Opendlign: Open-world point cloud understanding with depth-aligned images.Advances in Neural Information Pro- cessing Systems, 37:101144–101167, 2024. 2

work page 2024

[36] [36]

Hypo3d: Exploring hypothetical reason- ing in 3d.arXiv preprint arXiv:2502.00954, 2025

Ye Mao, Weixun Luo, Junpeng Jing, Anlan Qiu, and Krys- tian Mikolajczyk. Hypo3d: Exploring hypothetical reason- ing in 3d.arXiv preprint arXiv:2502.00954, 2025. 5, 6, 8

work page arXiv 2025

[37] [37]

Masked autoencoders for 3d point cloud self- supervised learning.World Scientific Annual Review of Arti- ficial Intelligence, 1:2440001, 2023

Yatian Pang, Eng Hock Francis Tay, Li Yuan, and Zhenghua Chen. Masked autoencoders for 3d point cloud self- supervised learning.World Scientific Annual Review of Arti- ficial Intelligence, 1:2440001, 2023. 5

work page 2023

[38] [38]

Pointnet: Deep learning on point sets for 3d classification and segmentation

Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas. Pointnet: Deep learning on point sets for 3d classification and segmentation. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 652–660,

work page

[39] [39]

Shapellm: Universal 3d object understanding for embodied interaction

Zekun Qi, Runpei Dong, Shaochen Zhang, Haoran Geng, Chunrui Han, Zheng Ge, Li Yi, and Kaisheng Ma. Shapellm: Universal 3d object understanding for embodied interaction. InEuropean Conference on Computer Vision, pages 214–

work page

[40] [40]

Learning transferable visual models from natural language supervi- sion

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 2

work page 2021

[41] [41]

Habitat-Matterport 3D Dataset (HM3D): 1000 Large-scale 3D Environments for Embodied AI

Santhosh K Ramakrishnan, Aaron Gokaslan, Erik Wijmans, Oleksandr Maksymets, Alex Clegg, John Turner, Eric Un- dersander, Wojciech Galuba, Andrew Westbury, Angel X Chang, et al. Habitat-matterport 3d dataset (hm3d): 1000 large-scale 3d environments for embodied ai.arXiv preprint arXiv:2109.08238, 2021. 3

work page internal anchor Pith review Pith/arXiv arXiv 2021

[42] [42]

org/10.48550/arXiv.2210.03105

Jonas Schult, Francis Engelmann, Alexander Hermans, Or Litany, Siyu Tang, and Bastian Leibe. Mask3d: Mask trans- former for 3d semantic instance segmentation.arXiv preprint arXiv:2210.03105, 2022. 5

work page arXiv 2022

[43] [43]

Conceptual captions: A cleaned, hypernymed, im- age alt-text dataset for automatic image captioning

Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. Conceptual captions: A cleaned, hypernymed, im- age alt-text dataset for automatic image captioning. InPro- ceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2556–2565, 2018. 3

work page 2018

[44] [44]

Splattalk: 3d vqa with gaussian splatting.arXiv preprint arXiv:2503.06271, 2025

Anh Thai, Songyou Peng, Kyle Genova, Leonidas Guibas, and Thomas Funkhouser. Splattalk: 3d vqa with gaussian splatting.arXiv preprint arXiv:2503.06271, 2025. 2, 5, 6

work page arXiv 2025

[45] [45]

Rio: 3d object instance re- localization in changing indoor environments

Johanna Wald, Armen Avetisyan, Nassir Navab, Federico Tombari, and Matthias Nießner. Rio: 3d object instance re- localization in changing indoor environments. InProceed- ings of the IEEE/CVF International Conference on Com- puter Vision, pages 7658–7667, 2019. 3

work page 2019

[46] [46]

Ross3d: Recon- structive visual instruction tuning with 3d-awareness.arXiv preprint arXiv:2504.01901, 2025

Haochen Wang, Yucheng Zhao, Tiancai Wang, Haoqiang Fan, Xiangyu Zhang, and Zhaoxiang Zhang. Ross3d: Recon- structive visual instruction tuning with 3d-awareness.arXiv preprint arXiv:2504.01901, 2025. 2

work page arXiv 2025

[47] [47]

Vggt: Vi- sual geometry grounded transformer

Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Vi- sual geometry grounded transformer. InProceedings of the 10 Computer Vision and Pattern Recognition Conference, pages 5294–5306, 2025. 2, 3

work page 2025

[48] [48]

Dust3r: Geometric 3d vi- sion made easy

Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. Dust3r: Geometric 3d vi- sion made easy. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20697– 20709, 2024. 2

work page 2024

[49] [49]

Chat-3d: Data-efficiently tuning large language model for universal dialogue of 3d scenes,

Zehan Wang, Haifeng Huang, Yang Zhao, Ziang Zhang, and Zhou Zhao. Chat-3d: Data-efficiently tuning large language model for universal dialogue of 3d scenes.arXiv preprint arXiv:2308.08769, 2023. 2

work page arXiv 2023

[50] [50]

Fg-clip: Fine-grained visual and textual alignment.arXiv preprint arXiv:2505.05071, 2025

Chunyu Xie, Bin Wang, Fanjing Kong, Jincheng Li, Dawei Liang, Gengshen Zhang, Dawei Leng, and Yuhui Yin. Fg- clip: Fine-grained visual and textual alignment.arXiv preprint arXiv:2505.05071, 2025. 3, 4, 5, 6, 8

work page arXiv 2025

[51] [51]

Ulip: Learning a unified representation of language, images, and point clouds for 3d understanding

Le Xue, Mingfei Gao, Chen Xing, Roberto Mart ´ın-Mart´ın, Jiajun Wu, Caiming Xiong, Ran Xu, Juan Carlos Niebles, and Silvio Savarese. Ulip: Learning a unified representation of language, images, and point clouds for 3d understanding. InProceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 1179–1189, 2023. 2

work page 2023

[52] [52]

Ulip-2: Towards scalable multimodal pre-training for 3d understanding

Le Xue, Ning Yu, Shu Zhang, Artemis Panagopoulou, Jun- nan Li, Roberto Mart´ın-Mart´ın, Jiajun Wu, Caiming Xiong, Ran Xu, Juan Carlos Niebles, et al. Ulip-2: Towards scalable multimodal pre-training for 3d understanding. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 27091–27101, 2024. 2

work page 2024

[53] [53]

Point transformer

Hengshuang Zhao, Li Jiang, Jiaya Jia, Philip HS Torr, and Vladlen Koltun. Point transformer. InProceedings of the IEEE/CVF international conference on computer vision, pages 16259–16268, 2021. 2

work page 2021

[54] [54]

Video-3d llm: Learning position-aware video representation for 3d scene understanding

Duo Zheng, Shijia Huang, and Liwei Wang. Video-3d llm: Learning position-aware video representation for 3d scene understanding. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 8995–9006, 2025. 2, 3, 6

work page 2025

[55] [55]

Structured3d: A large photo-realistic dataset for structured 3d modeling

Jia Zheng, Junfei Zhang, Jing Li, Rui Tang, Shenghua Gao, and Zihan Zhou. Structured3d: A large photo-realistic dataset for structured 3d modeling. InEuropean Conference on Computer Vision, pages 519–535. Springer, 2020. 3

work page 2020

[56] [56]

Uni3d: Exploring unified 3d representation at scale

Junsheng Zhou, Jinsheng Wang, Baorui Ma, Yu-Shen Liu, Tiejun Huang, and Xinlong Wang. Uni3d: Exploring unified 3d representation at scale.arXiv preprint arXiv:2310.06773,

work page arXiv

[57] [57]

Llava-3d: A simple yet effective pathway to empowering lmms with 3d-awareness.arXiv preprint arXiv:2409.18125, 2024

Chenming Zhu, Tai Wang, Wenwei Zhang, Jiangmiao Pang, and Xihui Liu. Llava-3d: A simple yet effective pathway to empowering lmms with 3d-awareness.arXiv preprint arXiv:2409.18125, 2024. 2, 6

work page arXiv 2024

[58] [58]

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shen- glong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479, 2025. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025

[59] [59]

3d-vista: Pre-trained transformer for 3d vision and text alignment

Ziyu Zhu, Xiaojian Ma, Yixin Chen, Zhidong Deng, Siyuan Huang, and Qing Li. 3d-vista: Pre-trained transformer for 3d vision and text alignment. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 2911– 2921, 2023. 2, 3, 5, 6 11

work page 2023