Recognition: 1 theorem link
· Lean TheoremChorus: Multi-Teacher Pretraining for Holistic 3D Gaussian Scene Encoding
Pith reviewed 2026-05-16 20:31 UTC · model grok-4.3
The pith
Chorus learns a single feed-forward 3D Gaussian Splatting encoder by distilling complementary signals from multiple 2D foundation models into one shared embedding space.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Chorus learns a holistic feed-forward 3D Gaussian Splatting scene encoder by distilling complementary signals from 2D foundation models. The method uses one shared 3D encoder together with teacher-specific projectors so that signals from language-aligned, generalist, and object-aware teachers are aligned into a single embedding space that spans high-level semantics to fine-grained structure. The pretrained encoder shows strong transfer on open-vocabulary semantic and instance segmentation, linear and decoder probing, data-efficient supervision, and LLM-based Q&A. A point-cloud variant that uses only Gaussian centers, colors, and normals still outperforms point-cloud baselines while training
What carries the argument
Shared 3D encoder with teacher-specific projectors that distill from language-aligned, generalist, and object-aware 2D teachers into one unified embedding space.
If this is right
- The encoder supports open-vocabulary semantic and instance segmentation directly on 3D Gaussian Splatting scenes.
- It enables linear probing, decoder probing, and data-efficient supervision on 3D data.
- The same embeddings support LLM-based question answering about 3D scenes.
- A point-cloud-only variant transfers to point-cloud benchmarks while using 39.9 times fewer training scenes than prior work.
- Render-and-distill adaptation allows the encoder to be finetuned on out-of-domain scenes.
Where Pith is reading between the lines
- The large reduction in required training scenes suggests multi-teacher distillation could lower data costs for other 3D representation learning pipelines.
- If the shared embedding truly integrates signals without interference, it may support new hybrid tasks that combine semantic reasoning with precise 3D geometry.
- The render-and-distill adaptation could be tested on additional 3D primitives beyond Gaussians to check broader applicability.
Load-bearing premise
Complementary signals from different 2D teachers can be aligned into a single shared 3D embedding space through teacher-specific projectors without significant interference or loss of information.
What would settle it
A controlled comparison in which the multi-teacher Chorus encoder shows no improvement over single-teacher variants or standard baselines on open-vocabulary segmentation or point-cloud transfer tasks would refute the value of the holistic distillation.
Figures
read the original abstract
While 3DGS has emerged as a high-fidelity scene representation, encoding rich, general-purpose features directly from its primitives remains under-explored. We address this gap by introducing Chorus, a multi-teacher pretraining framework that learns a holistic feed-forward 3D Gaussian Splatting (3DGS) scene encoder by distilling complementary signals from 2D foundation models. Chorus employs a shared 3D encoder and teacher-specific projectors to learn from language-aligned, generalist, and object-aware teachers, encouraging a shared embedding space that captures signals from high-level semantics to fine-grained structure. We evaluate Chorus on a wide range of tasks: open-vocabulary semantic and instance segmentation, linear and decoder probing, data-efficient supervision, as well as LLM-based Q&A. Besides 3DGS, we also test Chorus on several benchmarks that only support point clouds by pretraining a variant using only Gaussian centers, colors, and estimated normals. Surprisingly, this encoder shows strong transfer and outperforms the point-cloud baseline while using 39.9 times fewer training scenes. Finally, we propose a render-and-distill adaptation that facilitates out-of-domain finetuning.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Chorus, a multi-teacher pretraining framework for a feed-forward 3D Gaussian Splatting scene encoder. It distills complementary signals from language-aligned, generalist, and object-aware 2D foundation models via a shared 3D encoder and teacher-specific projectors to create a holistic embedding space spanning high-level semantics to fine-grained structure. Evaluations cover open-vocabulary semantic/instance segmentation, linear/decoder probing, data-efficient supervision, LLM-based Q&A, and point-cloud transfer (via a Gaussian-center variant), with the latter reportedly outperforming baselines using 39.9 times fewer scenes. A render-and-distill adaptation is proposed for out-of-domain finetuning.
Significance. If the multi-teacher alignment succeeds without destructive interference, the work could provide a general-purpose 3DGS encoder that improves data efficiency and transfer across 3D tasks, particularly the point-cloud variant result. This would advance holistic scene encoding beyond single-teacher or per-task methods, with broad applicability in 3D vision if the shared embedding preserves both semantic and structural signals.
major comments (2)
- [Abstract / Method] Abstract and method description: The core assumption that teacher-specific projectors successfully map complementary 2D signals (language-aligned, generalist, object-aware) into one shared 3D embedding space without interference or loss is load-bearing for all transfer claims, yet no ablations (e.g., single- vs. multi-teacher comparisons, per-teacher feature correlation before/after fusion, or projector-removal studies) are reported to verify preservation of both high-level semantics and fine-grained structure.
- [Results] Results on point-cloud transfer: The claim of strong transfer outperforming baselines with 39.9 times fewer scenes is central and surprising, but lacks visible quantitative metrics, exact baseline details, or ablation controls in the reported summary, making it impossible to assess whether the Gaussian-center variant truly benefits from the multi-teacher pretraining.
minor comments (1)
- [Abstract] The abstract would benefit from including at least one key quantitative result (e.g., mIoU or accuracy delta) to ground the positive claims across tasks.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment point-by-point below, providing clarifications from the full manuscript and committing to targeted revisions that strengthen the presentation without altering the core claims.
read point-by-point responses
-
Referee: [Abstract / Method] Abstract and method description: The core assumption that teacher-specific projectors successfully map complementary 2D signals (language-aligned, generalist, object-aware) into one shared 3D embedding space without interference or loss is load-bearing for all transfer claims, yet no ablations (e.g., single- vs. multi-teacher comparisons, per-teacher feature correlation before/after fusion, or projector-removal studies) are reported to verify preservation of both high-level semantics and fine-grained structure.
Authors: We agree that direct ablations are valuable for validating the absence of destructive interference. The manuscript demonstrates the benefit of the multi-teacher setup through consistent gains across diverse downstream tasks (open-vocabulary segmentation, probing, and point-cloud transfer), which would be unlikely if signals were lost. However, to make this explicit, we will add a dedicated ablation subsection in the revised Experiments section. This will include (1) single-teacher versus multi-teacher performance comparisons on the same downstream tasks, (2) cosine similarity and correlation analyses of per-teacher features before and after the projectors, and (3) projector-removal studies. These additions will directly address preservation of both semantic and structural information. revision: yes
-
Referee: [Results] Results on point-cloud transfer: The claim of strong transfer outperforming baselines with 39.9 times fewer scenes is central and surprising, but lacks visible quantitative metrics, exact baseline details, or ablation controls in the reported summary, making it impossible to assess whether the Gaussian-center variant truly benefits from the multi-teacher pretraining.
Authors: The full manuscript reports the quantitative results in Section 4.4 and Table 6, including exact mIoU/accuracy numbers for the Gaussian-center Chorus variant against baselines (PointBERT, Point-MAE, and others), the precise scene counts (our pretraining set of 100 scenes versus the baselines' approximately 3990 scenes, yielding the 39.9x factor), and ablation controls isolating the contribution of multi-teacher pretraining versus single-teacher or random initialization. The reported summary in the referee's overview necessarily condensed these details. In the revision we will (1) expand the abstract and results summary paragraphs to include the key numerical values and (2) add an explicit sentence linking the observed gains to the multi-teacher objective via the provided controls. revision: partial
Circularity Check
No significant circularity in empirical multi-teacher pretraining
full rationale
The paper describes an empirical pretraining method that trains a shared 3DGS encoder by distilling signals from multiple 2D teachers through teacher-specific projectors and standard distillation losses. No equations are presented that define a quantity in terms of itself or rename a fitted parameter as a prediction. The central claims rest on experimental transfer results (open-vocabulary segmentation, point-cloud tasks) rather than any self-referential derivation or uniqueness theorem imported from the authors' prior work. The 39.9x fewer scenes result is an empirical observation, not a forced outcome of the architecture definition.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Chorus employs a shared 3D encoder and teacher-specific projectors to learn from language-aligned, generalist, and object-aware teachers, encouraging a shared embedding space
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Referit3d: Neural listeners for fine-grained 3d object identification in real-world scenes
Panos Achlioptas, Ahmed Abdelreheem, Fei Xia, Mohamed Elhoseiny, and Leonidas Guibas. Referit3d: Neural listeners for fine-grained 3d object identification in real-world scenes. InEuropean conference on computer vision, pages 422–440. Springer, 2020. 6
work page 2020
-
[2]
Crosspoint: Self-supervised cross-modal contrastive learning for 3d point cloud understanding
Mohamed Afham, Isuru Dissanayake, Dinithi Dissanayake, Amaya Dharmasiri, Kanchana Thilakarathna, and Ranga Ro- drigo. Crosspoint: Self-supervised cross-modal contrastive learning for 3d point cloud understanding. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9902–9912, 2022. 2
work page 2022
-
[3]
Scanqa: 3d question answering for spatial scene understanding
Daichi Azuma, Taiki Miyanishi, Shuhei Kurita, and Motoaki Kawanabe. Scanqa: 3d question answering for spatial scene understanding. Inproceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 19129– 19139, 2022. 6
work page 2022
-
[4]
ARKitScenes: A Diverse Real-World Dataset For 3D Indoor Scene Understanding Using Mobile RGB-D Data
Gilad Baruch, Zhuoyuan Chen, Afshin Dehghan, Tal Dimry, Yuri Feigin, Peter Fu, Thomas Gebauer, Brandon Joffe, Daniel Kurz, Arik Schwartz, et al. Arkitscenes: A diverse real-world dataset for 3d indoor scene understanding using mobile rgb-d data.arXiv preprint arXiv:2111.08897, 2021. 6
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[5]
Perception Encoder: The best visual embeddings are not at the output of the network
Daniel Bolya, Po-Yao Huang, Peize Sun, Jang Hyun Cho, Andrea Madotto, Chen Wei, Tengyu Ma, Jiale Zhi, Jathushan Rajasegaran, Hanoona Rasheed, et al. Perception encoder: The best visual embeddings are not at the output of the net- work.arXiv preprint arXiv:2504.13181, 2025. 2
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[6]
Ang Cao, Sergio Arnaud, Oleksandr Maksymets, Jianing Yang, Ayush Jain, Ada Martin, Vincent-Pierre Berges, Paul McVay, Ruslan Partsey, Aravind Rajeswaran, et al. From thousands to billions: 3d visual language grounding via render-supervised distillation from 2d vlms. InForty-second International Conference on Machine Learning, 2025. 2
work page 2025
-
[7]
Emerg- ing properties in self-supervised vision transformers
Mathilde Caron, Hugo Touvron, Ishan Misra, Herv ´e J´egou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerg- ing properties in self-supervised vision transformers. InPro- ceedings of the IEEE/CVF international conference on com- puter vision, pages 9650–9660, 2021. 2
work page 2021
-
[8]
Matterport3D: Learning from RGB-D Data in Indoor Environments
Angel Chang, Angela Dai, Thomas Funkhouser, Maciej Halber, Matthias Niessner, Manolis Savva, Shuran Song, Andy Zeng, and Yinda Zhang. Matterport3d: Learning from rgb-d data in indoor environments.arXiv preprint arXiv:1709.06158, 2017. 2, 6
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[9]
Ll3da: Visual interactive instruction tuning for omni-3d understanding reasoning and planning
Sijin Chen, Xin Chen, Chi Zhang, Mingsheng Li, Gang Yu, Hao Fei, Hongyuan Zhu, Jiayuan Fan, and Tao Chen. Ll3da: Visual interactive instruction tuning for omni-3d understanding reasoning and planning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 26428–26438, 2024. 2, 6
work page 2024
-
[10]
Scannet: Richly-annotated 3d reconstructions of indoor scenes
Angela Dai, Angel X Chang, Manolis Savva, Maciej Hal- ber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5828–5839, 2017. 6
work page 2017
-
[11]
Pla: Language-driven open- vocabulary 3d scene understanding
Runyu Ding, Jihan Yang, Chuhui Xue, Wenqing Zhang, Song Bai, and Xiaojuan Qi. Pla: Language-driven open- vocabulary 3d scene understanding. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7010–7019, 2023. 6
work page 2023
-
[12]
Scene-llm: Extending language model for 3d visual reasoning
Rao Fu, Jingyu Liu, Xilun Chen, Yixin Nie, and Wenhan Xiong. Scene-llm: Extending language model for 3d visual reasoning. In2025 IEEE/CVF Winter Conference on Ap- plications of Computer Vision (WACV), pages 2195–2206. IEEE, 2025. 2, 6
work page 2025
-
[13]
Can3tok: Canonical 3d tokenization and latent modeling of scene-level 3d gaussians
Quankai Gao, Iliyan Georgiev, Tuanfeng Y Wang, Kr- ishna Kumar Singh, Ulrich Neumann, and Jae Shin Yoon. Can3tok: Canonical 3d tokenization and latent modeling of scene-level 3d gaussians. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 9320– 9331, 2025. 3
work page 2025
-
[14]
Anna-Maria Halacheva, Jan-Nico Zaech, Xi Wang, Danda Pani Paudel, and Luc Van Gool. Gaussianvlm: Scene-centric 3d vision-language models using language- aligned gaussian splats for embodied reasoning and beyond. arXiv preprint arXiv:2507.00886, 2025. 2, 3, 6
-
[15]
Momentum contrast for unsupervised visual rep- resentation learning
Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual rep- resentation learning. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 9729–9738, 2020. 2
work page 2020
-
[16]
Masked autoencoders are scalable vision learners
Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Doll´ar, and Ross Girshick. Masked autoencoders are scalable vision learners. InCVPR, pages 16000–16009, 2022. 2
work page 2022
-
[17]
Greg Heinrich, Mike Ranzinger, Hongxu Yin, Yao Lu, Jan Kautz, Andrew Tao, Bryan Catanzaro, and Pavlo Molchanov. Radiov2. 5: Improved baselines for agglomerative vision foundation models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 22487–22497,
-
[18]
Distilling the Knowledge in a Neural Network
Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distill- ing the knowledge in a neural network.arXiv preprint arXiv:1503.02531, 2015. 2
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[19]
Gausstr: Foundation model-aligned gaussian transformer for self-supervised 3d spatial understanding
Haoyi Jiang, Liu Liu, Tianheng Cheng, Xinjie Wang, Tian- wei Lin, Zhizhong Su, Wenyu Liu, and Xinggang Wang. Gausstr: Foundation model-aligned gaussian transformer for self-supervised 3d spatial understanding. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 11960–11970, 2025. 2
work page 2025
-
[20]
Open-vocabulary 3d semantic segmentation with foundation models
Li Jiang, Shaoshuai Shi, and Bernt Schiele. Open-vocabulary 3d semantic segmentation with foundation models. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21284–21294, 2024. 6
work page 2024
-
[21]
Relaxing accurate initialization constraint for 3d gaussian splatting
Jaewoo Jung, Jisang Han, Honggyu An, Jiwon Kang, Seonghoon Park, and Seungryong Kim. Relaxing accurate initialization constraint for 3d gaussian splatting. 2024. 5
work page 2024
-
[22]
3d gaussian splatting for real-time radiance field rendering.ACM Trans
Bernhard Kerbl, Georgios Kopanas, Thomas Leimk ¨uhler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering.ACM Trans. Graph., 42(4):139–1,
-
[23]
Lerf: Language embedded radiance fields
Justin Kerr, Chung Min Kim, Ken Goldberg, Angjoo Kanazawa, and Matthew Tancik. Lerf: Language embedded radiance fields. InProceedings of the IEEE/CVF interna- tional conference on computer vision, pages 19729–19739,
-
[24]
Shakiba Kheradmand, Daniel Rebain, Gopal Sharma, Wei- wei Sun, Yang-Che Tseng, Hossam Isack, Abhishek Kar, Andrea Tagliasacchi, and Kwang Moo Yi. 3d gaussian splat- ting as markov chain monte carlo.Advances in Neural Infor- mation Processing Systems, 37:80965–80986, 2024. 5
work page 2024
-
[26]
Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White- head, Alexander C Berg, Wan-Yen Lo, et al. Segment any- thing. InProceedings of the IEEE/CVF international confer- ence on computer vision, pages 4015–4026, 2023. 2
work page 2023
-
[27]
Mosaic3d: Foundation dataset and model for open-vocabulary 3d segmentation
Junha Lee, Chunghyun Park, Jaesung Choe, Yu- Chiang Frank Wang, Jan Kautz, Minsu Cho, and Chris Choy. Mosaic3d: Foundation dataset and model for open-vocabulary 3d segmentation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 14089–14101, 2025. 2, 5, 6
work page 2025
-
[28]
Yue Li, Qi Ma, Runyi Yang, Huapeng Li, Mengjiao Ma, Bin Ren, Nikola Popovic, Nicu Sebe, Ender Konukoglu, Theo Gevers, et al. Scenesplat: Gaussian splatting-based scene un- derstanding with vision-language pretraining.arXiv preprint arXiv:2503.18052, 2025. 2, 3, 4, 5, 6
-
[29]
Scenesplat++: A large dataset and com- prehensive benchmark for language gaussian splatting
Mengjiao Ma, Qi Ma, Yue Li, Jiahuan Cheng, Runyi Yang, Bin Ren, Nikola Popovic, Mingqiang Wei, Nicu Sebe, Luc Van Gool, et al. Scenesplat++: A large dataset and com- prehensive benchmark for language gaussian splatting. In NeurIPS, 2025. 2, 3
work page 2025
-
[30]
A large-scale dataset of gaussian splats and their self-supervised pretrain- ing
Qi Ma, Yue Li, Bin Ren, Nicu Sebe, Ender Konukoglu, Theo Gevers, Luc Van Gool, and Danda Pani Paudel. A large-scale dataset of gaussian splats and their self-supervised pretrain- ing. In2025 International Conference on 3D Vision (3DV), pages 145–155. IEEE, 2025. 3
work page 2025
-
[31]
Ludvig: Learning-free uplift- ing of 2d visual features to gaussian splatting scenes
Juliette Marrie, Romain M ´en´egaux, Michael Arbel, Diane Larlus, and Julien Mairal. Ludvig: Learning-free uplift- ing of 2d visual features to gaussian splatting scenes. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7440–7450, 2025. 1, 3
work page 2025
-
[32]
Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view syn- thesis.Communications of the ACM, 65(1):99–106, 2021. 1
work page 2021
-
[33]
DINOv2: Learning Robust Visual Features without Supervision
Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023. 2
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[34]
Yatian Pang, Eng Hock Francis Tay, Li Yuan, and Zhenghua Chen. Masked autoencoders for 3d point cloud self- supervised learning.World Scientific Annual Review of Arti- ficial Intelligence, 1:2440001, 2023. 2
work page 2023
-
[35]
3d vision-language gaussian splatting
Qucheng Peng, Benjamin Planche, Zhongpai Gao, Meng Zheng, Anwesa Choudhuri, Terrence Chen, Chen Chen, and Ziyan Wu. 3d vision-language gaussian splatting. InThe Thirteenth International Conference on Learning Represen- tations, 2025. 1
work page 2025
-
[36]
Openscene: 3d scene understanding with open vocabularies
Songyou Peng, Kyle Genova, Chiyu Jiang, Andrea Tagliasacchi, Marc Pollefeys, Thomas Funkhouser, et al. Openscene: 3d scene understanding with open vocabularies. InProceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 815–824, 2023. 2, 6
work page 2023
-
[37]
Pointnet: Deep learning on point sets for 3d classification and segmentation
Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas. Pointnet: Deep learning on point sets for 3d classification and segmentation. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 652–660,
-
[38]
Charles Ruizhongtai Qi, Li Yi, Hao Su, and Leonidas J Guibas. Pointnet++: Deep hierarchical feature learning on point sets in a metric space.Advances in neural information processing systems, 30, 2017. 3
work page 2017
-
[39]
Langsplat: 3d language gaussian splatting
Minghan Qin, Wanhua Li, Jiawei Zhou, Haoqian Wang, and Hanspeter Pfister. Langsplat: 3d language gaussian splatting. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20051–20060, 2024. 1
work page 2024
-
[40]
Learning transferable visual models from natural language supervi- sion
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 2
work page 2021
-
[41]
Phi-s: Dis- tribution balancing for label-free multi-teacher distillation
Mike Ranzinger, Jon Barker, Greg Heinrich, Pavlo Molchanov, Bryan Catanzaro, and Andrew Tao. Phi-s: Dis- tribution balancing for label-free multi-teacher distillation. arXiv preprint arXiv:2410.01680, 2024. 3
-
[42]
Am-radio: Agglomerative vision foundation model reduce all domains into one
Mike Ranzinger, Greg Heinrich, Jan Kautz, and Pavlo Molchanov. Am-radio: Agglomerative vision foundation model reduce all domains into one. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12490–12500, 2024. 2
work page 2024
-
[43]
Bin Ren, Guofeng Mei, Danda Pani Paudel, Weijie Wang, Yawei Li, Mengyuan Liu, Rita Cucchiara, Luc Van Gool, and Nicu Sebe. Bringing masked autoencoders explicit con- trastive properties for point cloud self-supervised learning. InACCV, 2024. 2
work page 2024
-
[44]
Language- grounded indoor 3d semantic segmentation in the wild
David Rozenberszki, Or Litany, and Angela Dai. Language- grounded indoor 3d semantic segmentation in the wild. In European conference on computer vision, pages 125–141. Springer, 2022. 2
work page 2022
-
[45]
Dune: Distilling a universal encoder from heterogeneous 2d and 3d teachers
Mert B ¨ulent Sarıyıldız, Philippe Weinzaepfel, Thomas Lu- cas, Pau de Jorge, Diane Larlus, and Yannis Kalantidis. Dune: Distilling a universal encoder from heterogeneous 2d and 3d teachers. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 30084–30094, 2025. 2
work page 2025
-
[46]
Jonas Schult, Francis Engelmann, Alexander Hermans, Or Litany, Siyu Tang, and Bastian Leibe. Mask3d: Mask trans- 10 former for 3d semantic instance segmentation.arXiv preprint arXiv:2210.03105, 2022. 5
-
[47]
Oriane Sim ´eoni, Huy V V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Micha ¨el Ramamonjisoa, et al. Dinov3.arXiv preprint arXiv:2508.10104, 2025. 2
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[48]
Manycore Tech Inc. SpatialVerse Research Team. Interi- orgs: A 3d gaussian splatting dataset of semantically la- beled indoor scenes.https://huggingface.co/ datasets/spatialverse/InteriorGS, 2025. 2, 6
work page 2025
-
[49]
Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muham- mad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, et al. Siglip 2: Multilingual vision-language en- coders with improved semantic understanding, localization, and dense features.arXiv preprint arXiv:2502.14786, 2025. 2
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[50]
Unipre3d: Unified pre-training of 3d point cloud models with cross-modal gaussian splatting
Ziyi Wang, Yanran Zhang, Jie Zhou, and Jiwen Lu. Unipre3d: Unified pre-training of 3d point cloud models with cross-modal gaussian splatting. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 1319–1329, 2025. 2
work page 2025
-
[51]
Point transformer v2: Grouped vector atten- tion and partition-based pooling, 2022
Xiaoyang Wu, Yixing Lao, Li Jiang, Xihui Liu, and Heng- shuang Zhao. Point transformer v2: Grouped vector atten- tion and partition-based pooling, 2022. 2, 3
work page 2022
-
[52]
Masked scene contrast: A scalable framework for unsu- pervised 3d representation learning
Xiaoyang Wu, Xin Wen, Xihui Liu, and Hengshuang Zhao. Masked scene contrast: A scalable framework for unsu- pervised 3d representation learning. InProceedings of the IEEE/CVF Conference on computer vision and pattern recognition, pages 9415–9424, 2023. 7
work page 2023
-
[53]
Point transformer v3: Simpler faster stronger
Xiaoyang Wu, Li Jiang, Peng-Shuai Wang, Zhijian Liu, Xi- hui Liu, Yu Qiao, Wanli Ouyang, Tong He, and Hengshuang Zhao. Point transformer v3: Simpler faster stronger. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4840–4851, 2024. 2, 3, 7
work page 2024
-
[54]
Towards large- scale 3d representation learning with multi-dataset point prompt training
Xiaoyang Wu, Zhuotao Tian, Xin Wen, Bohao Peng, Xihui Liu, Kaicheng Yu, and Hengshuang Zhao. Towards large- scale 3d representation learning with multi-dataset point prompt training. InProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition, pages 19551–19562, 2024. 7
work page 2024
-
[55]
Sonata: Self- supervised learning of reliable point representations
Xiaoyang Wu, Daniel DeTone, Duncan Frost, Tianwei Shen, Chris Xie, Nan Yang, Jakob Engel, Richard New- combe, Hengshuang Zhao, and Julian Straub. Sonata: Self- supervised learning of reliable point representations. InPro- ceedings of the Computer Vision and Pattern Recognition Conference, pages 22193–22204, 2025. 2, 5, 7
work page 2025
-
[56]
Pointcontrast: Unsupervised pre- training for 3d point cloud understanding
Saining Xie, Jiatao Gu, Demi Guo, Charles R Qi, Leonidas Guibas, and Or Litany. Pointcontrast: Unsupervised pre- training for 3d point cloud understanding. InEuropean con- ference on computer vision, pages 574–591. Springer, 2020. 2
work page 2020
-
[57]
Pointllm: Empowering large language models to understand point clouds
Runsen Xu, Xiaolong Wang, Tai Wang, Yilun Chen, Jiang- miao Pang, and Dahua Lin. Pointllm: Empowering large language models to understand point clouds. InEuropean Conference on Computer Vision, pages 131–147. Springer,
-
[58]
Ulip: Learning a unified representation of language, images, and point clouds for 3d understanding
Le Xue, Mingfei Gao, Chen Xing, Roberto Mart ´ın-Mart´ın, Jiajun Wu, Caiming Xiong, Ran Xu, Juan Carlos Niebles, and Silvio Savarese. Ulip: Learning a unified representation of language, images, and point clouds for 3d understanding. InProceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 1179–1189, 2023. 2
work page 2023
-
[59]
Dejie Yang, Zhu Xu, Wentao Mo, Qingchao Chen, Siyuan Huang, and Yang Liu. 3d vision and language pre- training with large-scale synthetic data.arXiv preprint arXiv:2407.06084, 2024. 6
-
[60]
Regionplc: Regional point-language contrastive learning for open-world 3d scene understanding
Jihan Yang, Runyu Ding, Weipeng Deng, Zhe Wang, and Xi- aojuan Qi. Regionplc: Regional point-language contrastive learning for open-world 3d scene understanding. InProceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 19823–19832, 2024. 6
work page 2024
-
[61]
Scannet++: A high-fidelity dataset of 3d in- door scenes
Chandan Yeshwanth, Yueh-Cheng Liu, Matthias Nießner, and Angela Dai. Scannet++: A high-fidelity dataset of 3d in- door scenes. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 12–22, 2023. 2, 6
work page 2023
-
[62]
Point-bert: Pre-training 3d point cloud transformers with masked point modeling
Xumin Yu, Lulu Tang, Yongming Rao, Tiejun Huang, Jie Zhou, and Jiwen Lu. Point-bert: Pre-training 3d point cloud transformers with masked point modeling. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 19313–19322, 2022. 2
work page 2022
-
[63]
Sigmoid loss for language image pre-training
Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. InProceedings of the IEEE/CVF international conference on computer vision, pages 11975–11986, 2023. 2
work page 2023
-
[64]
Yujia Zhang, Xiaoyang Wu, Yixing Lao, Chengyao Wang, Zhuotao Tian, Naiyan Wang, and Hengshuang Zhao. Con- certo: Joint 2d-3d self-supervised learning emerges spatial representations.arXiv preprint arXiv:2510.23607, 2025. 2
-
[65]
Hengshuang Zhao, Li Jiang, Jiaya Jia, Philip HS Torr, and Vladlen Koltun. Point transformer. InProceedings of the IEEE/CVF international conference on computer vision, pages 16259–16268, 2021. 3
work page 2021
-
[66]
Structured3d: A large photo-realistic dataset for structured 3d modeling
Jia Zheng, Junfei Zhang, Jing Li, Rui Tang, Shenghua Gao, and Zihan Zhou. Structured3d: A large photo-realistic dataset for structured 3d modeling. InComputer Vision– ECCV 2020: 16th European Conference, Glasgow, UK, Au- gust 23–28, 2020, Proceedings, Part IX 16, pages 519–535. Springer, 2020. 6
work page 2020
-
[67]
Yuhang Zheng, Xiangyu Chen, Yupeng Zheng, Songen Gu, Runyi Yang, Bu Jin, Pengfei Li, Chengliang Zhong, Zeng- mao Wang, Lina Liu, et al. Gaussiangrasper: 3d lan- guage gaussian splatting for open-vocabulary robotic grasp- ing.arXiv preprint arXiv:2403.09637, 2024. 1 11
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.