POMA-3D: The Point Map Way to 3D Scene Understanding
Pith reviewed 2026-05-17 20:23 UTC · model grok-4.3
The pith
POMA-3D learns self-supervised 3D scene representations from point maps encoding explicit 3D coordinates.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
POMA-3D serves as a strong backbone for both specialist and generalist 3D understanding by learning from point maps that preserve global 3D geometry while remaining compatible with 2D foundation models, using view-to-scene alignment and a joint embedding-predictive architecture to enforce consistency across views, all while relying solely on geometric inputs for diverse tasks.
What carries the argument
Point maps that encode explicit 3D coordinates on a structured 2D grid, combined with a view-to-scene alignment strategy to transfer 2D priors and POMA-JEPA for multi-view consistency.
If this is right
- Benefits 3D question answering tasks with only 3D coordinate inputs.
- Improves performance in embodied navigation and localization.
- Enhances scene retrieval using the learned geometric representations.
- Addresses scarcity of pretrained priors in 3D representation learning.
- Supports both specialist and generalist 3D understanding models.
Where Pith is reading between the lines
- Point maps could potentially scale to even larger datasets for better generalization in real-world robotics applications.
- The approach might extend to dynamic scenes if temporal consistency is added to the architecture.
- Integration with language models could create multimodal 3D understanding systems beyond pure geometry.
- Testing on outdoor or large-scale environments would reveal how well the global geometry preservation holds outside indoor rooms.
Load-bearing premise
That point maps preserve global 3D geometry sufficiently well and that the view-to-scene alignment strategy transfers rich 2D priors into 3D representations without major distortion or loss of information.
What would settle it
A controlled test where POMA-3D is compared to a standard point cloud model on a 3D QA benchmark and shows no improvement when using only the point map inputs.
Figures
read the original abstract
In this paper, we introduce POMA-3D, the first self-supervised 3D representation model learned from point maps. Point maps encode explicit 3D coordinates on a structured 2D grid, preserving global 3D geometry while remaining compatible with the input format of 2D foundation models. To transfer rich 2D priors into POMA-3D, a view-to-scene alignment strategy is designed. Moreover, as point maps are view-dependent with respect to a canonical space, we introduce POMA-JEPA, a joint embedding-predictive architecture that enforces geometrically consistent point map features across multiple views. Additionally, we introduce ScenePoint, a point map dataset constructed from 6.5K room-level RGB-D scenes and 1M 2D image scenes to facilitate large-scale POMA-3D pretraining. Experiments show that POMA-3D serves as a strong backbone for both specialist and generalist 3D understanding. It benefits diverse tasks, including 3D question answering, embodied navigation, scene retrieval, and embodied localization, all achieved using only geometric inputs (i.e., 3D coordinates). Overall, our POMA-3D explores a point map way to 3D scene understanding, addressing the scarcity of pretrained priors and limited data in 3D representation learning. Project Page: https://matchlab-imperial.github.io/poma3d/
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces POMA-3D, a self-supervised 3D representation model trained on point maps that encode explicit 3D coordinates on a structured 2D grid. It proposes a view-to-scene alignment strategy to inject 2D foundation-model priors and POMA-JEPA, a joint embedding-predictive architecture, to enforce cross-view geometric consistency. A new ScenePoint dataset is constructed from 6.5K room-level RGB-D scenes and 1M images for large-scale pretraining. The central claim is that POMA-3D serves as a strong backbone for both specialist and generalist 3D tasks (3D question answering, embodied navigation, scene retrieval, embodied localization) when supplied only with geometric inputs at inference.
Significance. If the empirical claims hold, the work offers a novel route to 3D representation learning that leverages the geometric fidelity of point maps while reusing 2D priors, directly addressing data scarcity. The introduction of the ScenePoint dataset and the explicit use of only 3D coordinates at test time are concrete strengths that could be adopted by the community.
major comments (3)
- [Abstract] Abstract: the stated performance benefits on 3D QA, navigation, retrieval and localization are presented without any quantitative numbers, baselines, or ablation results, so the load-bearing claim that POMA-3D is a 'strong backbone' cannot yet be evaluated.
- [View-to-scene alignment] View-to-scene alignment section: the manuscript provides no alignment-error metric, cross-view feature variance, or ablation that removes the alignment step, leaving open the possibility that the transfer of 2D priors distorts the canonical 3D coordinate grid.
- [POMA-JEPA and Experiments] POMA-JEPA description and experiments: the joint embedding-predictive loss is asserted to enforce geometric consistency, yet no quantitative verification (e.g., canonical-coordinate variance across views or ablation of the predictive term) is supplied to confirm that the regularization is strong enough to support the downstream gains.
minor comments (2)
- [Method] The notation for point maps versus canonical coordinates should be introduced with a single equation early in the method section to improve readability.
- [Figures] Figure captions for the alignment diagram should explicitly label the 2D-to-3D mapping operation.
Simulated Author's Rebuttal
Thank you for the constructive and detailed review of our manuscript introducing POMA-3D. We appreciate the referee's recognition of the novelty in leveraging point maps for self-supervised 3D representation learning, the view-to-scene alignment strategy, the POMA-JEPA architecture, and the introduction of the ScenePoint dataset. We address each major comment point by point below, providing clarifications and committing to revisions that strengthen the presentation of our empirical claims without altering the core contributions.
read point-by-point responses
-
Referee: [Abstract] Abstract: the stated performance benefits on 3D QA, navigation, retrieval and localization are presented without any quantitative numbers, baselines, or ablation results, so the load-bearing claim that POMA-3D is a 'strong backbone' cannot yet be evaluated.
Authors: We agree that the abstract would be strengthened by including key quantitative highlights to support the claim that POMA-3D serves as a strong backbone. In the revised manuscript, we will update the abstract to report specific performance gains (e.g., relative improvements on 3D question answering, navigation success rates, retrieval mAP, and localization accuracy) with references to the corresponding tables and baselines in the experiments section. This will enable readers to directly assess the empirical strength of the results. revision: yes
-
Referee: [View-to-scene alignment] View-to-scene alignment section: the manuscript provides no alignment-error metric, cross-view feature variance, or ablation that removes the alignment step, leaving open the possibility that the transfer of 2D priors distorts the canonical 3D coordinate grid.
Authors: We acknowledge the value of additional quantitative validation for the view-to-scene alignment. We will add an alignment-error metric (e.g., average L2 distance between projected 2D features and canonical 3D coordinates) and report cross-view feature variance to demonstrate preservation of geometric structure. We will also include an ablation study that removes the alignment step and measures its impact on downstream tasks, which will directly address concerns about potential distortion of the 3D coordinate grid. revision: yes
-
Referee: [POMA-JEPA and Experiments] POMA-JEPA description and experiments: the joint embedding-predictive loss is asserted to enforce geometric consistency, yet no quantitative verification (e.g., canonical-coordinate variance across views or ablation of the predictive term) is supplied to confirm that the regularization is strong enough to support the downstream gains.
Authors: We agree that explicit quantitative verification would better substantiate the role of the joint embedding-predictive loss. In the revision, we will report canonical-coordinate variance across multiple views (before and after the predictive term) and include an ablation that isolates the predictive component, showing its contribution to geometric consistency and to the observed gains on downstream tasks such as navigation and scene retrieval. These additions will confirm that the regularization is effective. revision: yes
Circularity Check
No circularity in POMA-3D derivation chain
full rationale
The paper introduces point maps as input, a view-to-scene alignment strategy, and POMA-JEPA as novel extensions of JEPA-style self-supervised learning. These are architectural choices evaluated empirically on the new ScenePoint dataset and downstream tasks (3D QA, navigation, etc.) using only geometric inputs. No load-bearing step reduces a claimed prediction or result to a fitted parameter, self-definition, or self-citation chain by construction. The central claims rest on experimental outcomes rather than equations that equate outputs to inputs.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Point maps encode explicit 3D coordinates on a structured 2D grid while preserving global 3D geometry and remaining compatible with 2D foundation model inputs.
invented entities (1)
-
POMA-JEPA
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
point maps encode explicit 3D coordinates on a structured 2D grid, preserving global 3D geometry while remaining compatible with the input format of 2D foundation models
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 1 Pith paper
-
Contrastive Language-Colored Pointmap Pretraining for Unified 3D Scene Understanding
UniScene3D learns unified 3D scene representations from colored pointmaps using contrastive CLIP pretraining plus cross-view geometric and grounded view alignments, achieving state-of-the-art results on viewpoint grou...
Reference graph
Works this paper leans on
-
[1]
Referit3d: Neural listeners for fine-grained 3d object identification in real-world scenes
Panos Achlioptas, Ahmed Abdelreheem, Fei Xia, Mohamed Elhoseiny, and Leonidas Guibas. Referit3d: Neural listeners for fine-grained 3d object identification in real-world scenes. InEuropean conference on computer vision, pages 422–440. Springer, 2020. 1, 6
work page 2020
-
[2]
Self-supervised learning from images with a joint-embedding predictive architecture
Mahmoud Assran, Quentin Duval, Ishan Misra, Piotr Bo- janowski, Pascal Vincent, Michael Rabbat, Yann LeCun, and Nicolas Ballas. Self-supervised learning from images with a joint-embedding predictive architecture. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15619–15629, 2023. 2
work page 2023
-
[3]
V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning
Mido Assran, Adrien Bardes, David Fan, Quentin Garrido, Russell Howes, Matthew Muckley, Ammar Rizvi, Claire Roberts, Koustuv Sinha, Artem Zholus, et al. V-jepa 2: Self- supervised video models enable understanding, prediction and planning.arXiv preprint arXiv:2506.09985, 2025. 2
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[4]
Scanqa: 3d question answering for spatial scene understanding
Daichi Azuma, Taiki Miyanishi, Shuhei Kurita, and Motoaki Kawanabe. Scanqa: 3d question answering for spatial scene understanding. Inproceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 19129– 19139, 2022. 2, 5, 6
work page 2022
-
[5]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 6
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[6]
V-jepa: Latent video prediction for visual represen- tation learning
Adrien Bardes, Quentin Garrido, Jean Ponce, Xinlei Chen, Michael Rabbat, Yann LeCun, Mido Assran, and Nicolas Ballas. V-jepa: Latent video prediction for visual represen- tation learning. 2023. 2
work page 2023
-
[7]
ARKitScenes: A Diverse Real-World Dataset For 3D Indoor Scene Understanding Using Mobile RGB-D Data
Gilad Baruch, Zhuoyuan Chen, Afshin Dehghan, Tal Dimry, Yuri Feigin, Peter Fu, Thomas Gebauer, Brandon Joffe, Daniel Kurz, Arik Schwartz, et al. Arkitscenes: A diverse real-world dataset for 3d indoor scene understanding using mobile rgb-d data.arXiv preprint arXiv:2111.08897, 2021. 3
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[8]
Scanrefer: 3d object localization in rgb-d scans using natural language
Dave Zhenyu Chen, Angel X Chang, and Matthias Nießner. Scanrefer: 3d object localization in rgb-d scans using natural language. InEuropean conference on computer vision, pages 202–221. Springer, 2020. 1, 2, 6
work page 2020
-
[9]
Ll3da: Visual interactive instruction tuning for omni-3d understanding reasoning and planning
Sijin Chen, Xin Chen, Chi Zhang, Mingsheng Li, Gang Yu, Hao Fei, Hongyuan Zhu, Jiayuan Fan, and Tao Chen. Ll3da: Visual interactive instruction tuning for omni-3d understanding reasoning and planning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 26428–26438, 2024. 2
work page 2024
-
[10]
3d aware region prompted vision language model.arXiv preprint arXiv:2509.13317,
An-Chieh Cheng, Yang Fu, Yukang Chen, Zhijian Liu, Xiao- long Li, Subhashree Radhakrishnan, Song Han, Yao Lu, Jan Kautz, Pavlo Molchanov, et al. 3d aware region prompted vision language model.arXiv preprint arXiv:2509.13317,
-
[11]
Scannet: Richly-annotated 3d reconstructions of indoor scenes
Angela Dai, Angel X Chang, Manolis Savva, Maciej Hal- ber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5828–5839, 2017. 3
work page 2017
-
[12]
Matt Deitke, Eli VanderBilt, Alvaro Herrasti, Luca Weihs, Kiana Ehsani, Jordi Salvador, Winson Han, Eric Kolve, Aniruddha Kembhavi, and Roozbeh Mottaghi. Procthor: Large-scale embodied ai using procedural generation.Ad- vances in Neural Information Processing Systems, 35:5982– 5994, 2022. 3
work page 2022
-
[13]
Jiafei Duan, Samson Yu, Hui Li Tan, Hongyuan Zhu, and Cheston Tan. A survey of embodied ai: From simulators to research tasks.IEEE Transactions on Emerging Topics in Computational Intelligence, 6(2):230–244, 2022. 1
work page 2022
-
[14]
A point set generation network for 3d object reconstruction from a single image
Haoqiang Fan, Hao Su, and Leonidas J Guibas. A point set generation network for 3d object reconstruction from a single image. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 605–613, 2017. 5
work page 2017
-
[15]
VLM-3R: Vision-Language Models Augmented with Instruction-Aligned 3D Reconstruction
Zhiwen Fan, Jian Zhang, Renjie Li, Junge Zhang, Runjin Chen, Hezhen Hu, Kevin Wang, Huaizhi Qu, Dilin Wang, Zhicheng Yan, et al. Vlm-3r: Vision-language models aug- mented with instruction-aligned 3d reconstruction.arXiv preprint arXiv:2505.20279, 2025. 2
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[16]
Rao Fu, Jingyu Liu, Xilun Chen, Yixin Nie, and Wen- han Xiong. Scene-llm: Extending language model for 3d visual understanding and reasoning.arXiv preprint arXiv:2403.11401, 2024. 2
-
[17]
Viewrefer: Grasp the multi-view knowledge for 3d visual grounding
Zoey Guo, Yiwen Tang, Ray Zhang, Dong Wang, Zhigang Wang, Bin Zhao, and Xuelong Li. Viewrefer: Grasp the multi-view knowledge for 3d visual grounding. InProceed- ings of the IEEE/CVF International Conference on Com- puter Vision, pages 15372–15383, 2023. 2
work page 2023
-
[18]
Yining Hong, Haoyu Zhen, Peihao Chen, Shuhong Zheng, Yilun Du, Zhenfang Chen, and Chuang Gan. 3d-llm: In- jecting the 3d world into large language models.Advances in Neural Information Processing Systems, 36:20482–20494,
-
[19]
3d-sis: 3d se- mantic instance segmentation of rgb-d scans
Ji Hou, Angela Dai, and Matthias Nießner. 3d-sis: 3d se- mantic instance segmentation of rgb-d scans. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4421–4430, 2019. 1, 2
work page 2019
-
[20]
An Embodied Generalist Agent in 3D World
Jiangyong Huang, Silong Yong, Xiaojian Ma, Xiongkun Linghu, Puhao Li, Yan Wang, Qing Li, Song-Chun Zhu, Baoxiong Jia, and Siyuan Huang. An embodied generalist agent in 3d world.arXiv preprint arXiv:2311.12871, 2023. 2, 6
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[21]
No pose at all: Self-supervised pose-free 3d gaussian splatting from sparse 9 views
Ranran Huang and Krystian Mikolajczyk. No pose at all: Self-supervised pose-free 3d gaussian splatting from sparse 9 views. InProceedings of the IEEE/CVF International Con- ference on Computer Vision, pages 27947–27957, 2025. 2
work page 2025
-
[22]
Multi- view transformer for 3d visual grounding
Shijia Huang, Yilun Chen, Jiaya Jia, and Liwei Wang. Multi- view transformer for 3d visual grounding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15524–15533, 2022. 2
work page 2022
-
[23]
Clip2point: Transfer clip to point cloud classifica- tion with image-depth pre-training
Tianyu Huang, Bowen Dong, Yunhan Yang, Xiaoshui Huang, Rynson WH Lau, Wanli Ouyang, and Wangmeng Zuo. Clip2point: Transfer clip to point cloud classifica- tion with image-depth pre-training. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 22157–22167, 2023. 2
work page 2023
-
[24]
Sceneverse: Scaling 3d vision-language learning for grounded scene understanding
Baoxiong Jia, Yixin Chen, Huangyue Yu, Yan Wang, Xuesong Niu, Tengyu Liu, Qing Li, and Siyuan Huang. Sceneverse: Scaling 3d vision-language learning for grounded scene understanding. InEuropean Conference on Computer Vision, pages 289–310. Springer, 2024. 2, 3, 5, 6
work page 2024
-
[25]
Pointgroup: Dual-set point grouping for 3d instance segmentation
Li Jiang, Hengshuang Zhao, Shaoshuai Shi, Shu Liu, Chi- Wing Fu, and Jiaya Jia. Pointgroup: Dual-set point grouping for 3d instance segmentation. InProceedings of the IEEE/CVF conference on computer vision and Pattern recognition, pages 4867–4876, 2020. 1, 2
work page 2020
-
[26]
Ground- ing image matching in 3d with mast3r
Vincent Leroy, Yohann Cabon, and J´erˆome Revaud. Ground- ing image matching in 3d with mast3r. InEuropean Confer- ence on Computer Vision, pages 71–91. Springer, 2024. 2
work page 2024
-
[27]
LLaVA-OneVision: Easy Visual Task Transfer
Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Zi- wei Liu, et al. Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326, 2024. 6
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[28]
Yue Li, Qi Ma, Runyi Yang, Huapeng Li, Mengjiao Ma, Bin Ren, Nikola Popovic, Nicu Sebe, Ender Konukoglu, Theo Gevers, et al. Scenesplat: Gaussian splatting-based scene un- derstanding with vision-language pretraining.arXiv preprint arXiv:2503.18052, 2025. 2, 3
-
[29]
Zechuan Li, Hongshan Yu, Yihao Ding, Yan Li, Yong He, and Naveed Akhtar. Embodied intelligence for 3d under- standing: A survey on 3d scene question answering.arXiv preprint arXiv:2502.00342, 2025. 1
-
[30]
Xiongkun Linghu, Jiangyong Huang, Xuesong Niu, Xiao- jian Shawn Ma, Baoxiong Jia, and Siyuan Huang. Multi- modal situated reasoning in 3d scenes.Advances in Neural Information Processing Systems, 37:140903–140936, 2024. 5, 6
work page 2024
-
[31]
Minghua Liu, Ruoxi Shi, Kaiming Kuang, Yinhao Zhu, Xu- anlin Li, Shizhong Han, Hong Cai, Fatih Porikli, and Hao Su. Openshape: Scaling up 3d shape representation towards open-world understanding.Advances in neural information processing systems, 36:44860–44879, 2023. 2
work page 2023
-
[32]
Decoupled Weight Decay Regularization
Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017. 5
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[33]
Sqa3d: Situated question answering in 3d scenes,
Xiaojian Ma, Silong Yong, Zilong Zheng, Qing Li, Yi- tao Liang, Song-Chun Zhu, and Siyuan Huang. Sqa3d: Situated question answering in 3d scenes.arXiv preprint arXiv:2210.07474, 2022. 2, 5, 6, 8
-
[34]
Yongsen Mao, Yiming Zhang, Hanxiao Jiang, Angel Chang, and Manolis Savva. Multiscan: Scalable rgbd scanning for 3d environments with articulated objects.Advances in neural information processing systems, 35:9058–9071, 2022. 3
work page 2022
-
[35]
Ye Mao, Junpeng Jing, and Krystian Mikolajczyk. Opendlign: Open-world point cloud understanding with depth-aligned images.Advances in Neural Information Pro- cessing Systems, 37:101144–101167, 2024. 2
work page 2024
-
[36]
Hypo3d: Exploring hypothetical reason- ing in 3d.arXiv preprint arXiv:2502.00954, 2025
Ye Mao, Weixun Luo, Junpeng Jing, Anlan Qiu, and Krys- tian Mikolajczyk. Hypo3d: Exploring hypothetical reason- ing in 3d.arXiv preprint arXiv:2502.00954, 2025. 5, 6, 8
-
[37]
Yatian Pang, Eng Hock Francis Tay, Li Yuan, and Zhenghua Chen. Masked autoencoders for 3d point cloud self- supervised learning.World Scientific Annual Review of Arti- ficial Intelligence, 1:2440001, 2023. 5
work page 2023
-
[38]
Pointnet: Deep learning on point sets for 3d classification and segmentation
Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas. Pointnet: Deep learning on point sets for 3d classification and segmentation. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 652–660,
-
[39]
Shapellm: Universal 3d object understanding for embodied interaction
Zekun Qi, Runpei Dong, Shaochen Zhang, Haoran Geng, Chunrui Han, Zheng Ge, Li Yi, and Kaisheng Ma. Shapellm: Universal 3d object understanding for embodied interaction. InEuropean Conference on Computer Vision, pages 214–
-
[40]
Learning transferable visual models from natural language supervi- sion
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 2
work page 2021
-
[41]
Habitat-Matterport 3D Dataset (HM3D): 1000 Large-scale 3D Environments for Embodied AI
Santhosh K Ramakrishnan, Aaron Gokaslan, Erik Wijmans, Oleksandr Maksymets, Alex Clegg, John Turner, Eric Un- dersander, Wojciech Galuba, Andrew Westbury, Angel X Chang, et al. Habitat-matterport 3d dataset (hm3d): 1000 large-scale 3d environments for embodied ai.arXiv preprint arXiv:2109.08238, 2021. 3
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[42]
Jonas Schult, Francis Engelmann, Alexander Hermans, Or Litany, Siyu Tang, and Bastian Leibe. Mask3d: Mask trans- former for 3d semantic instance segmentation.arXiv preprint arXiv:2210.03105, 2022. 5
-
[43]
Conceptual captions: A cleaned, hypernymed, im- age alt-text dataset for automatic image captioning
Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. Conceptual captions: A cleaned, hypernymed, im- age alt-text dataset for automatic image captioning. InPro- ceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2556–2565, 2018. 3
work page 2018
-
[44]
Splattalk: 3d vqa with gaussian splatting.arXiv preprint arXiv:2503.06271, 2025
Anh Thai, Songyou Peng, Kyle Genova, Leonidas Guibas, and Thomas Funkhouser. Splattalk: 3d vqa with gaussian splatting.arXiv preprint arXiv:2503.06271, 2025. 2, 5, 6
-
[45]
Rio: 3d object instance re- localization in changing indoor environments
Johanna Wald, Armen Avetisyan, Nassir Navab, Federico Tombari, and Matthias Nießner. Rio: 3d object instance re- localization in changing indoor environments. InProceed- ings of the IEEE/CVF International Conference on Com- puter Vision, pages 7658–7667, 2019. 3
work page 2019
-
[46]
Haochen Wang, Yucheng Zhao, Tiancai Wang, Haoqiang Fan, Xiangyu Zhang, and Zhaoxiang Zhang. Ross3d: Recon- structive visual instruction tuning with 3d-awareness.arXiv preprint arXiv:2504.01901, 2025. 2
-
[47]
Vggt: Vi- sual geometry grounded transformer
Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Vi- sual geometry grounded transformer. InProceedings of the 10 Computer Vision and Pattern Recognition Conference, pages 5294–5306, 2025. 2, 3
work page 2025
-
[48]
Dust3r: Geometric 3d vi- sion made easy
Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. Dust3r: Geometric 3d vi- sion made easy. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20697– 20709, 2024. 2
work page 2024
-
[49]
Chat-3d: Data-efficiently tuning large language model for universal dialogue of 3d scenes,
Zehan Wang, Haifeng Huang, Yang Zhao, Ziang Zhang, and Zhou Zhao. Chat-3d: Data-efficiently tuning large language model for universal dialogue of 3d scenes.arXiv preprint arXiv:2308.08769, 2023. 2
-
[50]
Fg-clip: Fine-grained visual and textual alignment.arXiv preprint arXiv:2505.05071, 2025
Chunyu Xie, Bin Wang, Fanjing Kong, Jincheng Li, Dawei Liang, Gengshen Zhang, Dawei Leng, and Yuhui Yin. Fg- clip: Fine-grained visual and textual alignment.arXiv preprint arXiv:2505.05071, 2025. 3, 4, 5, 6, 8
-
[51]
Ulip: Learning a unified representation of language, images, and point clouds for 3d understanding
Le Xue, Mingfei Gao, Chen Xing, Roberto Mart ´ın-Mart´ın, Jiajun Wu, Caiming Xiong, Ran Xu, Juan Carlos Niebles, and Silvio Savarese. Ulip: Learning a unified representation of language, images, and point clouds for 3d understanding. InProceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 1179–1189, 2023. 2
work page 2023
-
[52]
Ulip-2: Towards scalable multimodal pre-training for 3d understanding
Le Xue, Ning Yu, Shu Zhang, Artemis Panagopoulou, Jun- nan Li, Roberto Mart´ın-Mart´ın, Jiajun Wu, Caiming Xiong, Ran Xu, Juan Carlos Niebles, et al. Ulip-2: Towards scalable multimodal pre-training for 3d understanding. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 27091–27101, 2024. 2
work page 2024
-
[53]
Hengshuang Zhao, Li Jiang, Jiaya Jia, Philip HS Torr, and Vladlen Koltun. Point transformer. InProceedings of the IEEE/CVF international conference on computer vision, pages 16259–16268, 2021. 2
work page 2021
-
[54]
Video-3d llm: Learning position-aware video representation for 3d scene understanding
Duo Zheng, Shijia Huang, and Liwei Wang. Video-3d llm: Learning position-aware video representation for 3d scene understanding. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 8995–9006, 2025. 2, 3, 6
work page 2025
-
[55]
Structured3d: A large photo-realistic dataset for structured 3d modeling
Jia Zheng, Junfei Zhang, Jing Li, Rui Tang, Shenghua Gao, and Zihan Zhou. Structured3d: A large photo-realistic dataset for structured 3d modeling. InEuropean Conference on Computer Vision, pages 519–535. Springer, 2020. 3
work page 2020
-
[56]
Uni3d: Exploring unified 3d representation at scale
Junsheng Zhou, Jinsheng Wang, Baorui Ma, Yu-Shen Liu, Tiejun Huang, and Xinlong Wang. Uni3d: Exploring unified 3d representation at scale.arXiv preprint arXiv:2310.06773,
-
[57]
Chenming Zhu, Tai Wang, Wenwei Zhang, Jiangmiao Pang, and Xihui Liu. Llava-3d: A simple yet effective pathway to empowering lmms with 3d-awareness.arXiv preprint arXiv:2409.18125, 2024. 2, 6
-
[58]
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shen- glong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479, 2025. 3
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[59]
3d-vista: Pre-trained transformer for 3d vision and text alignment
Ziyu Zhu, Xiaojian Ma, Yixin Chen, Zhidong Deng, Siyuan Huang, and Qing Li. 3d-vista: Pre-trained transformer for 3d vision and text alignment. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 2911– 2921, 2023. 2, 3, 5, 6 11
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.