arxiv: 2512.17817 · v3 · submitted 2025-12-19 · 💻 cs.CV

Recognition: 1 theorem link

· Lean Theorem

Chorus: Multi-Teacher Pretraining for Holistic 3D Gaussian Scene Encoding

Yue Li , Qi Ma , Runyi Yang , Mengjiao Ma , Bin Ren , Nikola Popovic , Nicu Sebe , Theo Gevers

show 3 more authors

Luc Van Gool Danda Pani Paudel Martin R. Oswald

Authors on Pith no claims yet

Pith reviewed 2026-05-16 20:31 UTC · model grok-4.3

classification 💻 cs.CV

keywords 3D Gaussian Splattingmulti-teacher pretrainingscene encodingfeed-forward encoderopen-vocabulary segmentationknowledge distillationpoint cloud transfer

0 comments

The pith

Chorus learns a single feed-forward 3D Gaussian Splatting encoder by distilling complementary signals from multiple 2D foundation models into one shared embedding space.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Chorus as a pretraining approach that trains a feed-forward encoder directly on 3D Gaussian Splatting primitives. It pulls signals from language-aligned, generalist, and object-aware 2D models through a shared encoder and separate projectors for each teacher. This produces embeddings that span high-level semantics down to fine geometric details. The resulting encoder supports open-vocabulary segmentation, probing, and LLM-based question answering on 3D scenes. It also transfers to point-cloud benchmarks while using far fewer training scenes than prior methods and includes a render-and-distill step for adapting to new domains.

Core claim

Chorus learns a holistic feed-forward 3D Gaussian Splatting scene encoder by distilling complementary signals from 2D foundation models. The method uses one shared 3D encoder together with teacher-specific projectors so that signals from language-aligned, generalist, and object-aware teachers are aligned into a single embedding space that spans high-level semantics to fine-grained structure. The pretrained encoder shows strong transfer on open-vocabulary semantic and instance segmentation, linear and decoder probing, data-efficient supervision, and LLM-based Q&A. A point-cloud variant that uses only Gaussian centers, colors, and normals still outperforms point-cloud baselines while training

What carries the argument

Shared 3D encoder with teacher-specific projectors that distill from language-aligned, generalist, and object-aware 2D teachers into one unified embedding space.

If this is right

The encoder supports open-vocabulary semantic and instance segmentation directly on 3D Gaussian Splatting scenes.
It enables linear probing, decoder probing, and data-efficient supervision on 3D data.
The same embeddings support LLM-based question answering about 3D scenes.
A point-cloud-only variant transfers to point-cloud benchmarks while using 39.9 times fewer training scenes than prior work.
Render-and-distill adaptation allows the encoder to be finetuned on out-of-domain scenes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The large reduction in required training scenes suggests multi-teacher distillation could lower data costs for other 3D representation learning pipelines.
If the shared embedding truly integrates signals without interference, it may support new hybrid tasks that combine semantic reasoning with precise 3D geometry.
The render-and-distill adaptation could be tested on additional 3D primitives beyond Gaussians to check broader applicability.

Load-bearing premise

Complementary signals from different 2D teachers can be aligned into a single shared 3D embedding space through teacher-specific projectors without significant interference or loss of information.

What would settle it

A controlled comparison in which the multi-teacher Chorus encoder shows no improvement over single-teacher variants or standard baselines on open-vocabulary segmentation or point-cloud transfer tasks would refute the value of the holistic distillation.

Figures

Figures reproduced from arXiv: 2512.17817 by Bin Ren, Danda Pani Paudel, Luc Van Gool, Martin R. Oswald, Mengjiao Ma, Nicu Sebe, Nikola Popovic, Qi Ma, Runyi Yang, Theo Gevers, Yue Li.

**Figure 1.** Figure 1: Chorus Framework. (a) Multi-Teacher Pretraining. A feed-forward 3DGS scene encoder with per-teacher projectors distills complementary signals—language-aligned, generalist, and object-aware—into a shared embedding. (b) Example Feature PCA (results on novel scenes). At inference we input the full 3DGS scene; PCA on encoder features presents clear semantic awareness despite domain shift. (c) Evaluation & Data… view at source ↗

**Figure 2.** Figure 2: Chorus Overview. (a) Multi-Teacher Pretraining. We train a feed-forward 3DGS scene encoder to distill complementary signals–language-aligned (SigLIP), generalist (DINO), and object-aware (PE)–from 2D teachers. This knowledge is transferred into a shared embedding space via lightweight per-teacher projectors and losses. To accelerate out-of-domain adaptation, we support finetuning the encoder with online re… view at source ↗

**Figure 3.** Figure 3: Rendering-Based View Sampling and Pairing: (a) Camera Location Sampling: We use Furthest Point Sampling to select camera positions that achieve broad spatial coverage across the entire navigable scene space. (b) Visibility Culling: For each location, we sample view angles and track the visibility of the 3D Gaussians across frames. (c) View Pairing and Selection: We obtain a minimum 2D bounding box coverin… view at source ↗

**Figure 4.** Figure 4: Inference Feature PCA Visualization. Features from different encoders on a concert hall. Chorus shows the best semantic consistency (see zoomed-in chairs and stairs in the back). against point cloud encoders. For the rendering-based adaptation, we initialize with the pretrained Chorus encoder and for each batch we select 4 overlapping views. By default, the rendered image resolution is 480×640, and the r… view at source ↗

**Figure 5.** Figure 5: 2D Adaption Ablation. Performance improves with higher teacher render resolution (left) and more adaptation scenes (right). The left x-axis denotes the 2D teacher’s feature resolution, formatted as (feature size) × bilinear upsample factor. Language model-based question answering. We evaluate Chorus as the 3D encoder within an LLM-based pipeline for visual question answering and grounding (see Tab. 3), wh… view at source ↗

**Figure 6.** Figure 6: Scaling Trend Together With Rendering-Based Adaptation. Linear probing performance on InteriorGS vs. number of pretraining scenes. We compare our multi-teacher pretraining and the self-supervised pretraining [55] on 3DGS, Chorus scales faster and to higher accuracy. Our adaptation recipe yields a +2.7% mIoU gain on this new dataset using only 100 scenes. Data efficiency experiments. We validate the benef… view at source ↗

**Figure 7.** Figure 7: VLM Qualitative Results. We visualize a scene in ScanNet and object grounding (left) and QA results (right) [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 8.** Figure 8: Design Choice Ablation. We validate the choices by evaluating zero-shot segmentation on ScanNet++ Val using a subset of training scenes. SmoothL1 loss, 3DGS-aware augmentations, introducing PE-Spatial in a separate stage, and an instancelevel contrastive term each provide incremental gains. objectives unchanged. Despite the distribution gap between point clouds (observations) and 3DGS (optimized parame… view at source ↗

read the original abstract

While 3DGS has emerged as a high-fidelity scene representation, encoding rich, general-purpose features directly from its primitives remains under-explored. We address this gap by introducing Chorus, a multi-teacher pretraining framework that learns a holistic feed-forward 3D Gaussian Splatting (3DGS) scene encoder by distilling complementary signals from 2D foundation models. Chorus employs a shared 3D encoder and teacher-specific projectors to learn from language-aligned, generalist, and object-aware teachers, encouraging a shared embedding space that captures signals from high-level semantics to fine-grained structure. We evaluate Chorus on a wide range of tasks: open-vocabulary semantic and instance segmentation, linear and decoder probing, data-efficient supervision, as well as LLM-based Q&A. Besides 3DGS, we also test Chorus on several benchmarks that only support point clouds by pretraining a variant using only Gaussian centers, colors, and estimated normals. Surprisingly, this encoder shows strong transfer and outperforms the point-cloud baseline while using 39.9 times fewer training scenes. Finally, we propose a render-and-distill adaptation that facilitates out-of-domain finetuning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Chorus gives a clean multi-teacher distillation recipe for a shared 3DGS encoder that transfers to segmentation and point-cloud tasks with far less data, though the alignment step lacks direct checks.

read the letter

Chorus gives a practical recipe for pretraining a feed-forward encoder on 3D Gaussian Splatting scenes by distilling from several 2D foundation models at the same time. The shared encoder with separate projectors for each teacher is the core idea, and the results on open-vocabulary segmentation plus the point-cloud transfer with far fewer scenes are the parts that stand out. The setup is new in how it applies this to holistic 3DGS encoding rather than just lifting 2D features. It does a good job showing transfer across segmentation, probing, and even LLM Q&A, and the render-and-distill step for adaptation is a nice practical addition. The fact that the Gaussian-center version beats point-cloud baselines on some tasks while using 39.9 times fewer scenes suggests the representation itself is carrying useful information. The soft spot is the alignment step. The paper assumes the teacher-specific projectors can merge the different signals into one shared space without losing much, but there are no ablations that directly test for interference, such as checking correlations between teacher features before and after or removing one teacher to see the effect. Without those, it's hard to know how much each teacher is really contributing or if one is dominating. The abstract claims positive results but the details on baselines and exact numbers aren't spelled out here, so the size of the improvement is still unclear. This paper is aimed at researchers in 3D computer vision who need a general-purpose scene encoder instead of training from scratch for each task. Someone working on robotics or AR that uses 3DGS would find the transfer experiments useful. It deserves peer review because the method is implementable and the claims are specific enough to check with code and data.

Referee Report

2 major / 1 minor

Summary. The paper introduces Chorus, a multi-teacher pretraining framework for a feed-forward 3D Gaussian Splatting scene encoder. It distills complementary signals from language-aligned, generalist, and object-aware 2D foundation models via a shared 3D encoder and teacher-specific projectors to create a holistic embedding space spanning high-level semantics to fine-grained structure. Evaluations cover open-vocabulary semantic/instance segmentation, linear/decoder probing, data-efficient supervision, LLM-based Q&A, and point-cloud transfer (via a Gaussian-center variant), with the latter reportedly outperforming baselines using 39.9 times fewer scenes. A render-and-distill adaptation is proposed for out-of-domain finetuning.

Significance. If the multi-teacher alignment succeeds without destructive interference, the work could provide a general-purpose 3DGS encoder that improves data efficiency and transfer across 3D tasks, particularly the point-cloud variant result. This would advance holistic scene encoding beyond single-teacher or per-task methods, with broad applicability in 3D vision if the shared embedding preserves both semantic and structural signals.

major comments (2)

[Abstract / Method] Abstract and method description: The core assumption that teacher-specific projectors successfully map complementary 2D signals (language-aligned, generalist, object-aware) into one shared 3D embedding space without interference or loss is load-bearing for all transfer claims, yet no ablations (e.g., single- vs. multi-teacher comparisons, per-teacher feature correlation before/after fusion, or projector-removal studies) are reported to verify preservation of both high-level semantics and fine-grained structure.
[Results] Results on point-cloud transfer: The claim of strong transfer outperforming baselines with 39.9 times fewer scenes is central and surprising, but lacks visible quantitative metrics, exact baseline details, or ablation controls in the reported summary, making it impossible to assess whether the Gaussian-center variant truly benefits from the multi-teacher pretraining.

minor comments (1)

[Abstract] The abstract would benefit from including at least one key quantitative result (e.g., mIoU or accuracy delta) to ground the positive claims across tasks.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point-by-point below, providing clarifications from the full manuscript and committing to targeted revisions that strengthen the presentation without altering the core claims.

read point-by-point responses

Referee: [Abstract / Method] Abstract and method description: The core assumption that teacher-specific projectors successfully map complementary 2D signals (language-aligned, generalist, object-aware) into one shared 3D embedding space without interference or loss is load-bearing for all transfer claims, yet no ablations (e.g., single- vs. multi-teacher comparisons, per-teacher feature correlation before/after fusion, or projector-removal studies) are reported to verify preservation of both high-level semantics and fine-grained structure.

Authors: We agree that direct ablations are valuable for validating the absence of destructive interference. The manuscript demonstrates the benefit of the multi-teacher setup through consistent gains across diverse downstream tasks (open-vocabulary segmentation, probing, and point-cloud transfer), which would be unlikely if signals were lost. However, to make this explicit, we will add a dedicated ablation subsection in the revised Experiments section. This will include (1) single-teacher versus multi-teacher performance comparisons on the same downstream tasks, (2) cosine similarity and correlation analyses of per-teacher features before and after the projectors, and (3) projector-removal studies. These additions will directly address preservation of both semantic and structural information. revision: yes
Referee: [Results] Results on point-cloud transfer: The claim of strong transfer outperforming baselines with 39.9 times fewer scenes is central and surprising, but lacks visible quantitative metrics, exact baseline details, or ablation controls in the reported summary, making it impossible to assess whether the Gaussian-center variant truly benefits from the multi-teacher pretraining.

Authors: The full manuscript reports the quantitative results in Section 4.4 and Table 6, including exact mIoU/accuracy numbers for the Gaussian-center Chorus variant against baselines (PointBERT, Point-MAE, and others), the precise scene counts (our pretraining set of 100 scenes versus the baselines' approximately 3990 scenes, yielding the 39.9x factor), and ablation controls isolating the contribution of multi-teacher pretraining versus single-teacher or random initialization. The reported summary in the referee's overview necessarily condensed these details. In the revision we will (1) expand the abstract and results summary paragraphs to include the key numerical values and (2) add an explicit sentence linking the observed gains to the multi-teacher objective via the provided controls. revision: partial

Circularity Check

0 steps flagged

No significant circularity in empirical multi-teacher pretraining

full rationale

The paper describes an empirical pretraining method that trains a shared 3DGS encoder by distilling signals from multiple 2D teachers through teacher-specific projectors and standard distillation losses. No equations are presented that define a quantity in terms of itself or rename a fitted parameter as a prediction. The central claims rest on experimental transfer results (open-vocabulary segmentation, point-cloud tasks) rather than any self-referential derivation or uniqueness theorem imported from the authors' prior work. The 39.9x fewer scenes result is an empirical observation, not a forced outcome of the architecture definition.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no explicit free parameters, axioms, or invented entities; the framework rests on standard knowledge-distillation assumptions and 3DGS representation choices already present in prior work.

pith-pipeline@v0.9.0 · 5540 in / 1215 out tokens · 29345 ms · 2026-05-16T20:31:50.567119+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Chorus employs a shared 3D encoder and teacher-specific projectors to learn from language-aligned, generalist, and object-aware teachers, encouraging a shared embedding space

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

66 extracted references · 66 canonical work pages · 7 internal anchors

[1]

Referit3d: Neural listeners for fine-grained 3d object identification in real-world scenes

Panos Achlioptas, Ahmed Abdelreheem, Fei Xia, Mohamed Elhoseiny, and Leonidas Guibas. Referit3d: Neural listeners for fine-grained 3d object identification in real-world scenes. InEuropean conference on computer vision, pages 422–440. Springer, 2020. 6

work page 2020
[2]

Crosspoint: Self-supervised cross-modal contrastive learning for 3d point cloud understanding

Mohamed Afham, Isuru Dissanayake, Dinithi Dissanayake, Amaya Dharmasiri, Kanchana Thilakarathna, and Ranga Ro- drigo. Crosspoint: Self-supervised cross-modal contrastive learning for 3d point cloud understanding. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9902–9912, 2022. 2

work page 2022
[3]

Scanqa: 3d question answering for spatial scene understanding

Daichi Azuma, Taiki Miyanishi, Shuhei Kurita, and Motoaki Kawanabe. Scanqa: 3d question answering for spatial scene understanding. Inproceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 19129– 19139, 2022. 6

work page 2022
[4]

ARKitScenes: A Diverse Real-World Dataset For 3D Indoor Scene Understanding Using Mobile RGB-D Data

Gilad Baruch, Zhuoyuan Chen, Afshin Dehghan, Tal Dimry, Yuri Feigin, Peter Fu, Thomas Gebauer, Brandon Joffe, Daniel Kurz, Arik Schwartz, et al. Arkitscenes: A diverse real-world dataset for 3d indoor scene understanding using mobile rgb-d data.arXiv preprint arXiv:2111.08897, 2021. 6

work page internal anchor Pith review Pith/arXiv arXiv 2021
[5]

Perception Encoder: The best visual embeddings are not at the output of the network

Daniel Bolya, Po-Yao Huang, Peize Sun, Jang Hyun Cho, Andrea Madotto, Chen Wei, Tengyu Ma, Jiale Zhi, Jathushan Rajasegaran, Hanoona Rasheed, et al. Perception encoder: The best visual embeddings are not at the output of the net- work.arXiv preprint arXiv:2504.13181, 2025. 2

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

From thousands to billions: 3d visual language grounding via render-supervised distillation from 2d vlms

Ang Cao, Sergio Arnaud, Oleksandr Maksymets, Jianing Yang, Ayush Jain, Ada Martin, Vincent-Pierre Berges, Paul McVay, Ruslan Partsey, Aravind Rajeswaran, et al. From thousands to billions: 3d visual language grounding via render-supervised distillation from 2d vlms. InForty-second International Conference on Machine Learning, 2025. 2

work page 2025
[7]

Emerg- ing properties in self-supervised vision transformers

Mathilde Caron, Hugo Touvron, Ishan Misra, Herv ´e J´egou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerg- ing properties in self-supervised vision transformers. InPro- ceedings of the IEEE/CVF international conference on com- puter vision, pages 9650–9660, 2021. 2

work page 2021
[8]

Matterport3D: Learning from RGB-D Data in Indoor Environments

Angel Chang, Angela Dai, Thomas Funkhouser, Maciej Halber, Matthias Niessner, Manolis Savva, Shuran Song, Andy Zeng, and Yinda Zhang. Matterport3d: Learning from rgb-d data in indoor environments.arXiv preprint arXiv:1709.06158, 2017. 2, 6

work page internal anchor Pith review Pith/arXiv arXiv 2017
[9]

Ll3da: Visual interactive instruction tuning for omni-3d understanding reasoning and planning

Sijin Chen, Xin Chen, Chi Zhang, Mingsheng Li, Gang Yu, Hao Fei, Hongyuan Zhu, Jiayuan Fan, and Tao Chen. Ll3da: Visual interactive instruction tuning for omni-3d understanding reasoning and planning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 26428–26438, 2024. 2, 6

work page 2024
[10]

Scannet: Richly-annotated 3d reconstructions of indoor scenes

Angela Dai, Angel X Chang, Manolis Savva, Maciej Hal- ber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5828–5839, 2017. 6

work page 2017
[11]

Pla: Language-driven open- vocabulary 3d scene understanding

Runyu Ding, Jihan Yang, Chuhui Xue, Wenqing Zhang, Song Bai, and Xiaojuan Qi. Pla: Language-driven open- vocabulary 3d scene understanding. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7010–7019, 2023. 6

work page 2023
[12]

Scene-llm: Extending language model for 3d visual reasoning

Rao Fu, Jingyu Liu, Xilun Chen, Yixin Nie, and Wenhan Xiong. Scene-llm: Extending language model for 3d visual reasoning. In2025 IEEE/CVF Winter Conference on Ap- plications of Computer Vision (WACV), pages 2195–2206. IEEE, 2025. 2, 6

work page 2025
[13]

Can3tok: Canonical 3d tokenization and latent modeling of scene-level 3d gaussians

Quankai Gao, Iliyan Georgiev, Tuanfeng Y Wang, Kr- ishna Kumar Singh, Ulrich Neumann, and Jae Shin Yoon. Can3tok: Canonical 3d tokenization and latent modeling of scene-level 3d gaussians. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 9320– 9331, 2025. 3

work page 2025
[14]

Gaussianvlm: Scene-centric 3d vision-language models using language- aligned gaussian splats for embodied reasoning and beyond

Anna-Maria Halacheva, Jan-Nico Zaech, Xi Wang, Danda Pani Paudel, and Luc Van Gool. Gaussianvlm: Scene-centric 3d vision-language models using language- aligned gaussian splats for embodied reasoning and beyond. arXiv preprint arXiv:2507.00886, 2025. 2, 3, 6

work page arXiv 2025
[15]

Momentum contrast for unsupervised visual rep- resentation learning

Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual rep- resentation learning. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 9729–9738, 2020. 2

work page 2020
[16]

Masked autoencoders are scalable vision learners

Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Doll´ar, and Ross Girshick. Masked autoencoders are scalable vision learners. InCVPR, pages 16000–16009, 2022. 2

work page 2022
[17]

Greg Heinrich, Mike Ranzinger, Hongxu Yin, Yao Lu, Jan Kautz, Andrew Tao, Bryan Catanzaro, and Pavlo Molchanov. Radiov2. 5: Improved baselines for agglomerative vision foundation models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 22487–22497,

work page
[18]

Distilling the Knowledge in a Neural Network

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distill- ing the knowledge in a neural network.arXiv preprint arXiv:1503.02531, 2015. 2

work page internal anchor Pith review Pith/arXiv arXiv 2015
[19]

Gausstr: Foundation model-aligned gaussian transformer for self-supervised 3d spatial understanding

Haoyi Jiang, Liu Liu, Tianheng Cheng, Xinjie Wang, Tian- wei Lin, Zhizhong Su, Wenyu Liu, and Xinggang Wang. Gausstr: Foundation model-aligned gaussian transformer for self-supervised 3d spatial understanding. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 11960–11970, 2025. 2

work page 2025
[20]

Open-vocabulary 3d semantic segmentation with foundation models

Li Jiang, Shaoshuai Shi, and Bernt Schiele. Open-vocabulary 3d semantic segmentation with foundation models. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21284–21294, 2024. 6

work page 2024
[21]

Relaxing accurate initialization constraint for 3d gaussian splatting

Jaewoo Jung, Jisang Han, Honggyu An, Jiwon Kang, Seonghoon Park, and Seungryong Kim. Relaxing accurate initialization constraint for 3d gaussian splatting. 2024. 5

work page 2024
[22]

3d gaussian splatting for real-time radiance field rendering.ACM Trans

Bernhard Kerbl, Georgios Kopanas, Thomas Leimk ¨uhler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering.ACM Trans. Graph., 42(4):139–1,

work page
[23]

Lerf: Language embedded radiance fields

Justin Kerr, Chung Min Kim, Ken Goldberg, Angjoo Kanazawa, and Matthew Tancik. Lerf: Language embedded radiance fields. InProceedings of the IEEE/CVF interna- tional conference on computer vision, pages 19729–19739,

work page
[24]

3d gaussian splat- ting as markov chain monte carlo.Advances in Neural Infor- mation Processing Systems, 37:80965–80986, 2024

Shakiba Kheradmand, Daniel Rebain, Gopal Sharma, Wei- wei Sun, Yang-Che Tseng, Hossam Isack, Abhishek Kar, Andrea Tagliasacchi, and Kwang Moo Yi. 3d gaussian splat- ting as markov chain monte carlo.Advances in Neural Infor- mation Processing Systems, 37:80965–80986, 2024. 5

work page 2024
[26]

Segment any- thing

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White- head, Alexander C Berg, Wan-Yen Lo, et al. Segment any- thing. InProceedings of the IEEE/CVF international confer- ence on computer vision, pages 4015–4026, 2023. 2

work page 2023
[27]

Mosaic3d: Foundation dataset and model for open-vocabulary 3d segmentation

Junha Lee, Chunghyun Park, Jaesung Choe, Yu- Chiang Frank Wang, Jan Kautz, Minsu Cho, and Chris Choy. Mosaic3d: Foundation dataset and model for open-vocabulary 3d segmentation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 14089–14101, 2025. 2, 5, 6

work page 2025
[28]

Scenesplat: Gaussian splatting-based scene un- derstanding with vision-language pretraining.arXiv preprint arXiv:2503.18052, 2025

Yue Li, Qi Ma, Runyi Yang, Huapeng Li, Mengjiao Ma, Bin Ren, Nikola Popovic, Nicu Sebe, Ender Konukoglu, Theo Gevers, et al. Scenesplat: Gaussian splatting-based scene un- derstanding with vision-language pretraining.arXiv preprint arXiv:2503.18052, 2025. 2, 3, 4, 5, 6

work page arXiv 2025
[29]

Scenesplat++: A large dataset and com- prehensive benchmark for language gaussian splatting

Mengjiao Ma, Qi Ma, Yue Li, Jiahuan Cheng, Runyi Yang, Bin Ren, Nikola Popovic, Mingqiang Wei, Nicu Sebe, Luc Van Gool, et al. Scenesplat++: A large dataset and com- prehensive benchmark for language gaussian splatting. In NeurIPS, 2025. 2, 3

work page 2025
[30]

A large-scale dataset of gaussian splats and their self-supervised pretrain- ing

Qi Ma, Yue Li, Bin Ren, Nicu Sebe, Ender Konukoglu, Theo Gevers, Luc Van Gool, and Danda Pani Paudel. A large-scale dataset of gaussian splats and their self-supervised pretrain- ing. In2025 International Conference on 3D Vision (3DV), pages 145–155. IEEE, 2025. 3

work page 2025
[31]

Ludvig: Learning-free uplift- ing of 2d visual features to gaussian splatting scenes

Juliette Marrie, Romain M ´en´egaux, Michael Arbel, Diane Larlus, and Julien Mairal. Ludvig: Learning-free uplift- ing of 2d visual features to gaussian splatting scenes. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7440–7450, 2025. 1, 3

work page 2025
[32]

Nerf: Representing scenes as neural radiance fields for view syn- thesis.Communications of the ACM, 65(1):99–106, 2021

Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view syn- thesis.Communications of the ACM, 65(1):99–106, 2021. 1

work page 2021
[33]

DINOv2: Learning Robust Visual Features without Supervision

Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023. 2

work page internal anchor Pith review Pith/arXiv arXiv 2023
[34]

Masked autoencoders for 3d point cloud self- supervised learning.World Scientific Annual Review of Arti- ficial Intelligence, 1:2440001, 2023

Yatian Pang, Eng Hock Francis Tay, Li Yuan, and Zhenghua Chen. Masked autoencoders for 3d point cloud self- supervised learning.World Scientific Annual Review of Arti- ficial Intelligence, 1:2440001, 2023. 2

work page 2023
[35]

3d vision-language gaussian splatting

Qucheng Peng, Benjamin Planche, Zhongpai Gao, Meng Zheng, Anwesa Choudhuri, Terrence Chen, Chen Chen, and Ziyan Wu. 3d vision-language gaussian splatting. InThe Thirteenth International Conference on Learning Represen- tations, 2025. 1

work page 2025
[36]

Openscene: 3d scene understanding with open vocabularies

Songyou Peng, Kyle Genova, Chiyu Jiang, Andrea Tagliasacchi, Marc Pollefeys, Thomas Funkhouser, et al. Openscene: 3d scene understanding with open vocabularies. InProceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 815–824, 2023. 2, 6

work page 2023
[37]

Pointnet: Deep learning on point sets for 3d classification and segmentation

Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas. Pointnet: Deep learning on point sets for 3d classification and segmentation. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 652–660,

work page
[38]

Pointnet++: Deep hierarchical feature learning on point sets in a metric space.Advances in neural information processing systems, 30, 2017

Charles Ruizhongtai Qi, Li Yi, Hao Su, and Leonidas J Guibas. Pointnet++: Deep hierarchical feature learning on point sets in a metric space.Advances in neural information processing systems, 30, 2017. 3

work page 2017
[39]

Langsplat: 3d language gaussian splatting

Minghan Qin, Wanhua Li, Jiawei Zhou, Haoqian Wang, and Hanspeter Pfister. Langsplat: 3d language gaussian splatting. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20051–20060, 2024. 1

work page 2024
[40]

Learning transferable visual models from natural language supervi- sion

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 2

work page 2021
[41]

Phi-s: Dis- tribution balancing for label-free multi-teacher distillation

Mike Ranzinger, Jon Barker, Greg Heinrich, Pavlo Molchanov, Bryan Catanzaro, and Andrew Tao. Phi-s: Dis- tribution balancing for label-free multi-teacher distillation. arXiv preprint arXiv:2410.01680, 2024. 3

work page arXiv 2024
[42]

Am-radio: Agglomerative vision foundation model reduce all domains into one

Mike Ranzinger, Greg Heinrich, Jan Kautz, and Pavlo Molchanov. Am-radio: Agglomerative vision foundation model reduce all domains into one. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12490–12500, 2024. 2

work page 2024
[43]

Bringing masked autoencoders explicit con- trastive properties for point cloud self-supervised learning

Bin Ren, Guofeng Mei, Danda Pani Paudel, Weijie Wang, Yawei Li, Mengyuan Liu, Rita Cucchiara, Luc Van Gool, and Nicu Sebe. Bringing masked autoencoders explicit con- trastive properties for point cloud self-supervised learning. InACCV, 2024. 2

work page 2024
[44]

Language- grounded indoor 3d semantic segmentation in the wild

David Rozenberszki, Or Litany, and Angela Dai. Language- grounded indoor 3d semantic segmentation in the wild. In European conference on computer vision, pages 125–141. Springer, 2022. 2

work page 2022
[45]

Dune: Distilling a universal encoder from heterogeneous 2d and 3d teachers

Mert B ¨ulent Sarıyıldız, Philippe Weinzaepfel, Thomas Lu- cas, Pau de Jorge, Diane Larlus, and Yannis Kalantidis. Dune: Distilling a universal encoder from heterogeneous 2d and 3d teachers. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 30084–30094, 2025. 2

work page 2025
[46]

Mask3d: Mask trans- former for 3d semantic instance segmentation.arXiv preprint arXiv:2210.03105, 2022

Jonas Schult, Francis Engelmann, Alexander Hermans, Or Litany, Siyu Tang, and Bastian Leibe. Mask3d: Mask trans- 10 former for 3d semantic instance segmentation.arXiv preprint arXiv:2210.03105, 2022. 5

work page arXiv 2022
[47]

DINOv3

Oriane Sim ´eoni, Huy V V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Micha ¨el Ramamonjisoa, et al. Dinov3.arXiv preprint arXiv:2508.10104, 2025. 2

work page internal anchor Pith review Pith/arXiv arXiv 2025
[48]

SpatialVerse Research Team

Manycore Tech Inc. SpatialVerse Research Team. Interi- orgs: A 3d gaussian splatting dataset of semantically la- beled indoor scenes.https://huggingface.co/ datasets/spatialverse/InteriorGS, 2025. 2, 6

work page 2025
[49]

SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muham- mad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, et al. Siglip 2: Multilingual vision-language en- coders with improved semantic understanding, localization, and dense features.arXiv preprint arXiv:2502.14786, 2025. 2

work page internal anchor Pith review Pith/arXiv arXiv 2025
[50]

Unipre3d: Unified pre-training of 3d point cloud models with cross-modal gaussian splatting

Ziyi Wang, Yanran Zhang, Jie Zhou, and Jiwen Lu. Unipre3d: Unified pre-training of 3d point cloud models with cross-modal gaussian splatting. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 1319–1329, 2025. 2

work page 2025
[51]

Point transformer v2: Grouped vector atten- tion and partition-based pooling, 2022

Xiaoyang Wu, Yixing Lao, Li Jiang, Xihui Liu, and Heng- shuang Zhao. Point transformer v2: Grouped vector atten- tion and partition-based pooling, 2022. 2, 3

work page 2022
[52]

Masked scene contrast: A scalable framework for unsu- pervised 3d representation learning

Xiaoyang Wu, Xin Wen, Xihui Liu, and Hengshuang Zhao. Masked scene contrast: A scalable framework for unsu- pervised 3d representation learning. InProceedings of the IEEE/CVF Conference on computer vision and pattern recognition, pages 9415–9424, 2023. 7

work page 2023
[53]

Point transformer v3: Simpler faster stronger

Xiaoyang Wu, Li Jiang, Peng-Shuai Wang, Zhijian Liu, Xi- hui Liu, Yu Qiao, Wanli Ouyang, Tong He, and Hengshuang Zhao. Point transformer v3: Simpler faster stronger. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4840–4851, 2024. 2, 3, 7

work page 2024
[54]

Towards large- scale 3d representation learning with multi-dataset point prompt training

Xiaoyang Wu, Zhuotao Tian, Xin Wen, Bohao Peng, Xihui Liu, Kaicheng Yu, and Hengshuang Zhao. Towards large- scale 3d representation learning with multi-dataset point prompt training. InProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition, pages 19551–19562, 2024. 7

work page 2024
[55]

Sonata: Self- supervised learning of reliable point representations

Xiaoyang Wu, Daniel DeTone, Duncan Frost, Tianwei Shen, Chris Xie, Nan Yang, Jakob Engel, Richard New- combe, Hengshuang Zhao, and Julian Straub. Sonata: Self- supervised learning of reliable point representations. InPro- ceedings of the Computer Vision and Pattern Recognition Conference, pages 22193–22204, 2025. 2, 5, 7

work page 2025
[56]

Pointcontrast: Unsupervised pre- training for 3d point cloud understanding

Saining Xie, Jiatao Gu, Demi Guo, Charles R Qi, Leonidas Guibas, and Or Litany. Pointcontrast: Unsupervised pre- training for 3d point cloud understanding. InEuropean con- ference on computer vision, pages 574–591. Springer, 2020. 2

work page 2020
[57]

Pointllm: Empowering large language models to understand point clouds

Runsen Xu, Xiaolong Wang, Tai Wang, Yilun Chen, Jiang- miao Pang, and Dahua Lin. Pointllm: Empowering large language models to understand point clouds. InEuropean Conference on Computer Vision, pages 131–147. Springer,

work page
[58]

Ulip: Learning a unified representation of language, images, and point clouds for 3d understanding

Le Xue, Mingfei Gao, Chen Xing, Roberto Mart ´ın-Mart´ın, Jiajun Wu, Caiming Xiong, Ran Xu, Juan Carlos Niebles, and Silvio Savarese. Ulip: Learning a unified representation of language, images, and point clouds for 3d understanding. InProceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 1179–1189, 2023. 2

work page 2023
[59]

3d vision and language pre- training with large-scale synthetic data.arXiv preprint arXiv:2407.06084, 2024

Dejie Yang, Zhu Xu, Wentao Mo, Qingchao Chen, Siyuan Huang, and Yang Liu. 3d vision and language pre- training with large-scale synthetic data.arXiv preprint arXiv:2407.06084, 2024. 6

work page arXiv 2024
[60]

Regionplc: Regional point-language contrastive learning for open-world 3d scene understanding

Jihan Yang, Runyu Ding, Weipeng Deng, Zhe Wang, and Xi- aojuan Qi. Regionplc: Regional point-language contrastive learning for open-world 3d scene understanding. InProceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 19823–19832, 2024. 6

work page 2024
[61]

Scannet++: A high-fidelity dataset of 3d in- door scenes

Chandan Yeshwanth, Yueh-Cheng Liu, Matthias Nießner, and Angela Dai. Scannet++: A high-fidelity dataset of 3d in- door scenes. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 12–22, 2023. 2, 6

work page 2023
[62]

Point-bert: Pre-training 3d point cloud transformers with masked point modeling

Xumin Yu, Lulu Tang, Yongming Rao, Tiejun Huang, Jie Zhou, and Jiwen Lu. Point-bert: Pre-training 3d point cloud transformers with masked point modeling. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 19313–19322, 2022. 2

work page 2022
[63]

Sigmoid loss for language image pre-training

Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. InProceedings of the IEEE/CVF international conference on computer vision, pages 11975–11986, 2023. 2

work page 2023
[64]

Con- certo: Joint 2d-3d self-supervised learning emerges spatial representations.arXiv preprint arXiv:2510.23607, 2025

Yujia Zhang, Xiaoyang Wu, Yixing Lao, Chengyao Wang, Zhuotao Tian, Naiyan Wang, and Hengshuang Zhao. Con- certo: Joint 2d-3d self-supervised learning emerges spatial representations.arXiv preprint arXiv:2510.23607, 2025. 2

work page arXiv 2025
[65]

Point transformer

Hengshuang Zhao, Li Jiang, Jiaya Jia, Philip HS Torr, and Vladlen Koltun. Point transformer. InProceedings of the IEEE/CVF international conference on computer vision, pages 16259–16268, 2021. 3

work page 2021
[66]

Structured3d: A large photo-realistic dataset for structured 3d modeling

Jia Zheng, Junfei Zhang, Jing Li, Rui Tang, Shenghua Gao, and Zihan Zhou. Structured3d: A large photo-realistic dataset for structured 3d modeling. InComputer Vision– ECCV 2020: 16th European Conference, Glasgow, UK, Au- gust 23–28, 2020, Proceedings, Part IX 16, pages 519–535. Springer, 2020. 6

work page 2020
[67]

Gaussiangrasper: 3d lan- guage gaussian splatting for open-vocabulary robotic grasp- ing.arXiv preprint arXiv:2403.09637, 2024

Yuhang Zheng, Xiangyu Chen, Yupeng Zheng, Songen Gu, Runyi Yang, Bu Jin, Pengfei Li, Chengliang Zhong, Zeng- mao Wang, Lina Liu, et al. Gaussiangrasper: 3d lan- guage gaussian splatting for open-vocabulary robotic grasp- ing.arXiv preprint arXiv:2403.09637, 2024. 1 11

work page arXiv 2024