Relation-Centric Open-Vocabulary 3D Gaussian Segmentation

Eunsung Cha; Hyunjoon Lee; Jaesik Park

arxiv: 2607.01140 · v1 · pith:FMAQ7PQCnew · submitted 2026-07-01 · 💻 cs.CV

Relation-Centric Open-Vocabulary 3D Gaussian Segmentation

Eunsung Cha , Hyunjoon Lee , Jaesik Park This is my paper

Pith reviewed 2026-07-02 13:31 UTC · model grok-4.3

classification 💻 cs.CV

keywords 3D Gaussian segmentationopen-vocabulary segmentationpairwise relationsrelation graph3D scene understandingcomputer visioninstance segmentation

0 comments

The pith

PairGS reframes 3D Gaussian segmentation as pairwise relation modeling to separate Gaussians along object boundaries for any language query.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents PairGS as a way to handle open-vocabulary 3D Gaussian segmentation by estimating relations between pairs of Gaussians instead of embedding language into each Gaussian or optimizing per-Gaussian features. It draws on view contribution weights and multi-view mask evidence to build a relation graph through sparse candidate edges, precise affinity computation, and a hierarchical cluster tree. This setup targets cleaner boundary separation and multi-granular queries while skipping the expense of per-scene optimization. A reader would care if the method delivers both higher accuracy on benchmarks and substantially lower runtime than earlier approaches.

Core claim

PairGS explicitly constructs a relation graph for segmentation by first proposing sparse edge candidates using low-dimensional descriptors, then computing precise pairwise affinities only on those candidates, and finally building a hierarchical cluster tree for multi-granular querying, yielding state-of-the-art results on open-vocabulary 3D Gaussian segmentation benchmarks with a fast variant that runs 50 times faster than optimization-based instance-feature methods.

What carries the argument

The relation graph constructed from view contribution weights and multi-view mask evidence, which encodes pairwise affinities to group Gaussians into object-consistent clusters.

If this is right

Gaussian segmentations become responsive to diverse language queries without embedding language knowledge into individual points.
Hierarchical cluster trees support querying at multiple levels of granularity from the same relation graph.
Segmentation avoids the computational overhead of per-scene optimization of instance features.
Performance reaches state-of-the-art levels on standard open-vocabulary 3D Gaussian segmentation benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same pairwise-relation construction could be tested on tasks such as 3D instance tracking across time or editing of reconstructed scenes.
If the low-dimensional descriptors reliably surface relevant edges, similar candidate-proposal steps might reduce cost in other dense 3D correspondence problems.
The approach suggests that explicit graph modeling may substitute for learned per-point embeddings in additional open-vocabulary 3D perception settings.

Load-bearing premise

View contribution weights and multi-view mask evidence supply signals rich enough to build an accurate relation graph that cleanly separates Gaussians along object boundaries for arbitrary language queries.

What would settle it

A test scene containing clear multi-view mask evidence where the resulting relation graph merges Gaussians from distinct objects or splits Gaussians from the same object under a straightforward language query.

Figures

Figures reproduced from arXiv: 2607.01140 by Eunsung Cha, Hyunjoon Lee, Jaesik Park.

**Figure 1.** Figure 1: We propose PairGS, a training-free method that estimates pairwise relations between Gaussians to enable efficient and fine-grained [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗

**Figure 2.** Figure 2: Overview of PairGS. Our method begins with a pretrained 3DGS [ [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Illustration of the two sampling locations for robust mask [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Illustration of the Hierarchical Pruning steps in TreeDB [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Qualitative comparison. We visualize open-vocabulary object selection results on LERF [21] dataset. Compared to languagebased [18] and instance-feature-based [6, 29, 48] baselines, PairGS produces cleaner 3D Gaussian segmentations for objects in contact with others, such as a spoon or a spatula. It also yields coherent instance separation for larger objects. In particular, THGS [12] is point-based and con… view at source ↗

**Figure 6.** Figure 6: Qualitative comparison. We visualize open-vocabulary semantic segmentation on ScanNet [11] dataset. Our method demonstrates robust performance across diverse queries, accurately segmenting objects at varying scales, such as large desks and smaller chairs [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

read the original abstract

Open-vocabulary 3D Gaussian segmentation is challenging because it requires language understanding for diverse queries and accurate separation of Gaussians along object boundaries. Prior approaches either embed language knowledge into individual Gaussians to improve query responsiveness or optimize per-Gaussian instance features to encode object identity. However, these strategies may produce noisy Gaussian segmentations or rely on cost-inefficient per-scene optimization. We propose PairGS, a framework that reframes Gaussian segmentation as modeling pairwise relations between Gaussians. 3D Gaussian representations provide rich signals for relation estimation, such as view contribution weights and multi-view mask evidence. By leveraging these cues, PairGS explicitly constructs a relation graph for segmentation without a heavy optimization process. PairGS first proposes sparse edge candidates using low-dimensional descriptors, computes precise pairwise affinities only on those candidates, and builds a hierarchical cluster tree for multi-granular querying. It achieves state-of-the-art results on open-vocabulary 3D Gaussian segmentation benchmarks, while the fast variant is 50x faster than optimization-based instance-feature approaches.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PairGS reframes 3D Gaussian segmentation as pairwise relation modeling to skip per-scene optimization, but the abstract supplies zero numbers or benchmarks to back the SOTA and 50x speedup claims.

read the letter

The paper's main move is to build one relation graph from view contribution weights and multi-view mask evidence, then answer language queries by selecting subtrees in a hierarchical cluster tree. It first picks sparse edge candidates with low-dimensional descriptors before computing affinities only on those. This is distinct from the two prior strategies it names—per-Gaussian language embeddings and per-scene instance-feature optimization—and targets the efficiency problem directly.

The construction itself is straightforward and avoids heavy per-scene fitting, which is a practical plus for robotics or AR settings that need fast language-driven separation.

The soft spot is the performance claim. The abstract asserts state-of-the-art results and a 50x speedup for the fast variant, yet gives no benchmark names, no quantitative scores, no ablations, and no error analysis. Without those, the central assertion cannot be checked. The key assumption—that view weights plus 2D masks will produce clusters whose boundaries match semantic objects for arbitrary queries—also sits untested in the provided text. If the masks from open-vocabulary 2D models miss occluded or visually similar objects, or if the candidate selection drops critical edges, the graph will connect across boundaries or miss intra-object links. That concern from the stress-test note lands because the method relies entirely on those external 2D signals.

The method description looks internally consistent and does not appear to hide circular fitting. This is for people working on efficient 3D scene understanding who want an alternative to optimization-heavy pipelines. A reader focused on relation-based clustering would find the graph construction useful to examine.

If the full paper contains the missing experiments and comparisons, it deserves a serious referee. Otherwise the claims are too thin to evaluate.

Referee Report

2 major / 0 minor

Summary. The manuscript introduces PairGS, a framework that reframes open-vocabulary 3D Gaussian segmentation as explicit modeling of pairwise relations between Gaussians. It leverages view contribution weights and multi-view mask evidence to propose sparse edge candidates, compute affinities, and construct a hierarchical cluster tree that supports multi-granular language queries, claiming state-of-the-art results on benchmarks together with a 50x speedup for the fast variant relative to optimization-based instance-feature methods.

Significance. If the performance claims are substantiated with quantitative evidence, the relation-centric formulation offers a potentially scalable alternative to per-scene optimization or per-Gaussian language embedding, which could reduce computational overhead in practical 3D scene understanding applications while preserving query flexibility.

major comments (2)

[Abstract] Abstract: the central SOTA and 50x speedup claims are asserted without any quantitative numbers, benchmark names, ablation tables, or error analysis, rendering the primary performance contribution unverifiable from the provided text.
[Abstract] PairGS construction (abstract description of relation graph): the claim that view contribution weights plus multi-view mask evidence suffice to produce a query-agnostic hierarchical cluster tree whose boundaries align with semantic objects for arbitrary queries is load-bearing, yet no analysis is supplied of failure modes when 2D masks from open-vocabulary models mis-segment visually similar or occluded objects or when low-dimensional descriptors omit critical edges.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments. We address each major comment below and indicate the revisions we will make to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: the central SOTA and 50x speedup claims are asserted without any quantitative numbers, benchmark names, ablation tables, or error analysis, rendering the primary performance contribution unverifiable from the provided text.

Authors: We agree that the abstract would benefit from greater specificity to allow readers to immediately verify the claims. In the revised version we will expand the abstract to name the primary benchmarks, report concrete metrics (including the exact speedup factor relative to the cited optimization-based baselines), and explicitly reference the main-text tables and ablations that substantiate the SOTA results. revision: yes
Referee: [Abstract] PairGS construction (abstract description of relation graph): the claim that view contribution weights plus multi-view mask evidence suffice to produce a query-agnostic hierarchical cluster tree whose boundaries align with semantic objects for arbitrary queries is load-bearing, yet no analysis is supplied of failure modes when 2D masks from open-vocabulary models mis-segment visually similar or occluded objects or when low-dimensional descriptors omit critical edges.

Authors: The abstract summarizes the core design choice; the full manuscript already contains quantitative results across scenes containing occlusions and visually similar objects that implicitly evaluate robustness. Nevertheless, we acknowledge the value of an explicit failure-mode discussion. We will add a dedicated paragraph in the revised manuscript that examines cases where 2D mask errors or descriptor omissions occur and shows how the multi-view affinity computation and hierarchical clustering limit their impact. revision: yes

Circularity Check

0 steps flagged

No significant circularity; method relies on external signals without self-referential reduction

full rationale

The paper reframes open-vocabulary 3D Gaussian segmentation as pairwise relation modeling and constructs a relation graph from view contribution weights and multi-view mask evidence. No equations, derivations, or self-citations are shown that reduce the claimed SOTA performance or cluster quality to a fitted parameter or input defined by the same data. The fast variant's speed advantage is presented as arising from avoiding per-scene optimization rather than any tautological prediction. The central construction uses external cues (masks, weights) to build the graph, making the derivation self-contained against benchmarks without fitting-input-called-prediction or self-definition patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated. The method implicitly assumes that existing Gaussian attributes (view weights, masks) suffice for relation estimation.

pith-pipeline@v0.9.1-grok · 5711 in / 1116 out tokens · 21396 ms · 2026-07-02T13:31:08.378668+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

57 extracted references · 11 canonical work pages · 3 internal anchors

[1]

Principal component anal- ysis.Wiley interdisciplinary reviews: computational statistics, 2(4):433–459, 2010

Herv´e Abdi and Lynne J Williams. Principal component anal- ysis.Wiley interdisciplinary reviews: computational statistics, 2(4):433–459, 2010. 4

2010
[2]

Lightsplat: Fast and memory-efficient open-vocabulary 3d scene understanding in five seconds

Jaehun Bang, Jinhyeok Kim, Minji Kim, Seungheon Jeong, and Kyungdon Joo. Lightsplat: Fast and memory-efficient open-vocabulary 3d scene understanding in five seconds. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 19812–19821, 2026. 3

2026
[3]

Mip-nerf 360: Unbounded anti-aliased neural radiance fields

Jonathan T Barron, Ben Mildenhall, Dor Verbin, Pratul P Srinivasan, and Peter Hedman. Mip-nerf 360: Unbounded anti-aliased neural radiance fields. InProceedings of the IEEE/CVF conference on computer vision and pattern recog- nition, pages 5470–5479, 2022. 5, 9

2022
[4]

Emerg- ing properties in self-supervised vision transformers

Mathilde Caron, Hugo Touvron, Ishan Misra, Herv´e J´egou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerg- ing properties in self-supervised vision transformers. InPro- ceedings of the IEEE/CVF international conference on com- puter vision, pages 9650–9660, 2021. 2

2021
[5]

Segment any 3d gaussians

Jiazhong Cen, Jiemin Fang, Chen Yang, Lingxi Xie, Xiaopeng Zhang, Wei Shen, and Qi Tian. Segment any 3d gaussians. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 1971–1979, 2025. 2, 3

1971
[6]

Tackling view-dependent semantics in 3d language gaussian splatting

Jiazhong Cen, Xudong Zhou, Jiemin Fang, Changsong Wen, Lingxi Xie, XIAOPENG ZHANG, Wei Shen, and Qi Tian. Tackling view-dependent semantics in 3d language gaussian splatting. InForty-second International Conference on Ma- chine Learning, 2025. 2, 3, 5, 6, 7, 8, 1, 4

2025
[7]

Lifting by gaussians: A simple, fast and flexible method for 3d instance segmentation

Rohan Chacko, Nicolai H¨ani, Eldar Khaliullin, Lin Sun, and Douglas Lee. Lifting by gaussians: A simple, fast and flexible method for 3d instance segmentation. In2025 IEEE/CVF Win- ter Conference on Applications of Computer Vision (WACV), pages 3497–3507. IEEE, 2025. 3, 5

2025
[8]

Tracking anything with de- coupled video segmentation

Ho Kei Cheng, Seoung Wug Oh, Brian Price, Alexander Schwing, and Joon-Young Lee. Tracking anything with de- coupled video segmentation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 1316– 1326, 2023. 3, 7, 8, 5

2023
[9]

Occam’s lgs: An efficient ap- proach for language gaussian splatting.arXiv preprint arXiv:2412.01807, 2024

Jiahuan Cheng, Jan-Nico Zaech, Luc Van Gool, and Danda Pani Paudel. Occam’s lgs: An efficient ap- proach for language gaussian splatting.arXiv preprint arXiv:2412.01807, 2024. 2, 4, 6, 7, 3

work page arXiv 2024
[10]

Click-gaussian: Interactive segmenta- tion to any 3d gaussians

Seokhun Choi, Hyeonseop Song, Jaechul Kim, Taehyeong Kim, and Hoseok Do. Click-gaussian: Interactive segmenta- tion to any 3d gaussians. InEuropean Conference on Com- puter Vision, pages 289–305. Springer, 2024. 2, 3

2024
[11]

Scannet: Richly-annotated 3d reconstructions of indoor scenes

Angela Dai, Angel X Chang, Manolis Savva, Maciej Hal- ber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5828–5839, 2017. 7, 8, 3, 5

2017
[12]

Training-free hierarchi- cal scene understanding for gaussian splatting with superpoint graphs

Shaohui Dai, Yansong Qu, Zheyan Li, Xinyang Li, Shengchuan Zhang, and Liujuan Cao. Training-free hierarchi- cal scene understanding for gaussian splatting with superpoint graphs. InProceedings of the 33rd ACM International Con- ference on Multimedia, pages 3673–3682, 2025. 3, 5, 6, 7, 1, 2, 4

2025
[13]

Extrinsplat: Decoupling geometry and semantics for open-vocabulary understanding in 3d gaussian splatting

Jiayu Ding, Xinpeng Liu, Zhiyi Pan, Shiqiang Long, and Ge Li. Extrinsplat: Decoupling geometry and semantics for open-vocabulary understanding in 3d gaussian splatting. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 31019–31028, 2026. 3

2026
[14]

Opennerf: open set 3d neural scene segmentation with pixel-wise features and rendered novel views.arXiv preprint arXiv:2404.03650,

Francis Engelmann, Fabian Manhardt, Michael Niemeyer, Keisuke Tateno, Marc Pollefeys, and Federico Tombari. Open- nerf: Open set 3d neural scene segmentation with pixel- wise features and rendered novel views.arXiv preprint arXiv:2404.03650, 2024. 2

work page arXiv 2024
[15]

Panoptic nerf: 3d-to-2d label transfer for panoptic urban scene segmentation

Xiao Fu, Shangzhan Zhang, Tianrun Chen, Yichong Lu, Lanyun Zhu, Xiaowei Zhou, Andreas Geiger, and Yiyi Liao. Panoptic nerf: 3d-to-2d label transfer for panoptic urban scene segmentation. In2022 International Conference on 3D Vision (3DV), pages 1–11. IEEE, 2022. 2

2022
[16]

2d gaussian splatting for geometrically ac- curate radiance fields

Binbin Huang, Zehao Yu, Anpei Chen, Andreas Geiger, and Shenghua Gao. 2d gaussian splatting for geometrically ac- curate radiance fields. InACM SIGGRAPH 2024 conference papers, pages 1–11, 2024. 3, 7

2024
[17]

V otesplat: Hough voting gaussian splatting for 3d scene understanding

Minchao Jiang, Shunyu Jia, Jiaming Gu, Xiaoyuan Lu, Guangming Zhu, Anqi Dong, and Liang Zhang. V otesplat: Hough voting gaussian splatting for 3d scene understanding. arXiv preprint arXiv:2506.22799, 2025. 2, 3

work page arXiv 2025
[18]

Kim Jun-Seong, GeonU Kim, Kim Yu-Ji, Yu-Chiang Frank Wang, Jaesung Choe, and Tae-Hyun Oh. Dr. splat: Directly referring 3d gaussian splatting via direct language embedding registration. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 14137–14146, 2025. 1, 2, 4, 6, 7, 3, 5

2025
[19]

3d gaussian splatting for real-time radiance field rendering.ACM Trans

Bernhard Kerbl, Georgios Kopanas, Thomas Leimk¨uhler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering.ACM Trans. Graph., 42(4):139–1, 2023. 1, 2, 3, 7

2023
[20]

Lerf: Language embedded radiance fields

Justin Kerr, Chung Min Kim, Ken Goldberg, Angjoo Kanazawa, and Matthew Tancik. Lerf: Language embedded radiance fields. InProceedings of the IEEE/CVF Interna- tional Conference on Computer Vision, pages 19729–19739,
[21]

Lerf: Language embedded radiance fields

Justin Kerr, Chung Min Kim, Ken Goldberg, Angjoo Kanazawa, and Matthew Tancik. Lerf: Language embedded radiance fields. InProceedings of the IEEE/CVF interna- tional conference on computer vision, pages 19729–19739,
[22]

Segment any- thing

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White- head, Alexander C Berg, Wan-Yen Lo, et al. Segment any- thing. InProceedings of the IEEE/CVF international confer- ence on computer vision, pages 4015–4026, 2023. 2, 3, 7, 8

2023
[23]

Tanks and temples: Benchmarking large-scale scene reconstruction.ACM Transactions on Graphics, 36(4), 2017

Arno Knapitsch, Jaesik Park, Qian-Yi Zhou, and Vladlen Koltun. Tanks and temples: Benchmarking large-scale scene reconstruction.ACM Transactions on Graphics, 36(4), 2017. 6

2017
[24]

Decomposing nerf for editing via feature field distillation.Ad- vances in Neural Information Processing Systems, 35:23311– 23330, 2022

Sosuke Kobayashi, Eiichi Matsumoto, and Vincent Sitzmann. Decomposing nerf for editing via feature field distillation.Ad- vances in Neural Information Processing Systems, 35:23311– 23330, 2022. 2 9

2022
[25]

Panoptic neural fields: A semantic object-aware neural scene representation

Abhijit Kundu, Kyle Genova, Xiaoqi Yin, Alireza Fathi, Car- oline Pantofaru, Leonidas J Guibas, Andrea Tagliasacchi, Frank Dellaert, and Thomas Funkhouser. Panoptic neural fields: A semantic object-aware neural scene representation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12871–12881, 2022. 2

2022
[26]

Cut pursuit: Fast algorithms to learn piecewise constant functions on general weighted graphs.SIAM Journal on Imaging Sciences, 10(4): 1724–1766, 2017

Loic Landrieu and Guillaume Obozinski. Cut pursuit: Fast algorithms to learn piecewise constant functions on general weighted graphs.SIAM Journal on Imaging Sciences, 10(4): 1724–1766, 2017. 3

2017
[27]

Cf3: Compact and fast 3d feature fields

Hyunjoon Lee, Joonkyu Min, and Jaesik Park. Cf3: Compact and fast 3d feature fields. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 27906– 27916, 2025. 2, 4

2025
[28]

Language-driven Semantic Segmentation

Boyi Li, Kilian Q Weinberger, Serge Belongie, Vladlen Koltun, and Ren ´e Ranftl. Language-driven semantic seg- mentation.arXiv preprint arXiv:2201.03546, 2022. 2

work page internal anchor Pith review Pith/arXiv arXiv 2022
[29]

Instancegaussian: Appearance-semantic joint gaussian representation for 3d instance-level perception

Haijie Li, Yanmin Wu, Jiarui Meng, Qiankun Gao, Zhiyao Zhang, Ronggang Wang, and Jian Zhang. Instancegaussian: Appearance-semantic joint gaussian representation for 3d instance-level perception. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 14078– 14088, 2025. 2, 3, 6, 7, 8, 4

2025
[30]

Su- pergseg: Open-vocabulary 3d segmentation with structured super-gaussians.arXiv preprint arXiv:2412.10231, 2024

Siyun Liang, Sen Wang, Kunyi Li, Michael Niemeyer, Ste- fano Gasperini, Nassir Navab, and Federico Tombari. Su- pergseg: Open-vocabulary 3d segmentation with structured super-gaussians.arXiv preprint arXiv:2412.10231, 2024. 2, 3

work page arXiv 2024
[31]

Kitti-360: A novel dataset and benchmarks for urban scene understanding in 2d and 3d.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(3):3292–3310, 2022

Yiyi Liao, Jun Xie, and Andreas Geiger. Kitti-360: A novel dataset and benchmarks for urban scene understanding in 2d and 3d.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(3):3292–3310, 2022. 5, 6

2022
[32]

Weakly supervised 3d open- vocabulary segmentation.Advances in Neural Information Processing Systems, 36:53433–53456, 2023

Kunhao Liu, Fangneng Zhan, Jiahui Zhang, Muyu Xu, Yingchen Yu, Abdulmotaleb El Saddik, Christian Theobalt, Eric Xing, and Shijian Lu. Weakly supervised 3d open- vocabulary segmentation.Advances in Neural Information Processing Systems, 36:53433–53456, 2023. 2

2023
[33]

Scaffold-gs: Structured 3d gaussians for view-adaptive rendering

Tao Lu, Mulin Yu, Linning Xu, Yuanbo Xiangli, Limin Wang, Dahua Lin, and Bo Dai. Scaffold-gs: Structured 3d gaussians for view-adaptive rendering. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20654–20664, 2024. 7

2024
[34]

Gaga: Group Any Gaussians via 3D-aware Memory Bank

Weijie Lyu, Xueting Li, Abhijit Kundu, Yi-Hsuan Tsai, and Ming-Hsuan Yang. Gaga: Group any gaussians via 3d-aware memory bank.arXiv preprint arXiv:2404.07977, 2024. 2, 3

work page internal anchor Pith review Pith/arXiv arXiv 2024
[35]

Ludvig: Learning-free uplifting of 2d visual features to gaussian splatting scenes.arXiv preprint arXiv:2410.14462, 2024

Juliette Marrie, Romain M ´en´egaux, Michael Arbel, Diane Larlus, and Julien Mairal. Ludvig: Learning-free uplifting of 2d visual features to gaussian splatting scenes.arXiv preprint arXiv:2410.14462, 2024. 2, 4

work page arXiv 2024
[36]

Nerf: Representing scenes as neural radiance fields for view syn- thesis.Communications of the ACM, 65(1):99–106, 2021

Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view syn- thesis.Communications of the ACM, 65(1):99–106, 2021. 2

2021
[37]

Graspgen: A diffusion-based framework for 6-dof grasping with on- generator training,

Adithyavairavan Murali, Balakumar Sundaralingam, Yu-Wei Chao, Wentao Yuan, Jun Yamada, Mark Carlson, Fabio Ramos, Stan Birchfield, Dieter Fox, and Clemens Eppner. Graspgen: A diffusion-based framework for 6-dof grasping with on-generator training.arXiv preprint arXiv:2507.13097,

work page arXiv
[38]

3d vision-language gaussian splatting.arXiv preprint arXiv:2410.07577, 2024

Qucheng Peng, Benjamin Planche, Zhongpai Gao, Meng Zheng, Anwesa Choudhuri, Terrence Chen, Chen Chen, and Ziyan Wu. 3d vision-language gaussian splatting.arXiv preprint arXiv:2410.07577, 2024. 2

work page arXiv 2024
[39]

Langsplat: 3d language gaussian splatting

Minghan Qin, Wanhua Li, Jiawei Zhou, Haoqian Wang, and Hanspeter Pfister. Langsplat: 3d language gaussian splatting. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20051–20060, 2024. 6, 7, 1, 2, 4

2024
[40]

Goi: Find 3d gaussians of interest with an optimizable open-vocabulary semantic-space hyperplane

Yansong Qu, Shaohui Dai, Xinyang Li, Jianghang Lin, Liu- juan Cao, Shengchuan Zhang, and Rongrong Ji. Goi: Find 3d gaussians of interest with an optimizable open-vocabulary semantic-space hyperplane. InProceedings of the 32nd ACM international conference on multimedia, pages 5328–5337,
[41]

Learning transferable visual models from natural language supervi- sion

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 2, 3, 8

2021
[42]

SAM 2: Segment Anything in Images and Videos

Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman R¨adle, Chloe Rolland, Laura Gustafson, et al. Sam 2: Segment any- thing in images and videos.arXiv preprint arXiv:2408.00714,

work page internal anchor Pith review Pith/arXiv arXiv
[43]

Language- grounded indoor 3d semantic segmentation in the wild

David Rozenberszki, Or Litany, and Angela Dai. Language- grounded indoor 3d semantic segmentation in the wild. In European conference on computer vision, pages 125–141. Springer, 2022. 3

2022
[44]

Trace3d: Consistent segmentation lifting via gaussian instance tracing

Hongyu Shen, Junfeng Ni, Yixin Chen, Weishuo Li, Mingtao Pei, and Siyuan Huang. Trace3d: Consistent segmentation lifting via gaussian instance tracing. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 6656–6666, 2025. 3

2025
[45]

Language embedded 3d gaussians for open-vocabulary scene understanding

Jin-Chuan Shi, Miao Wang, Hao-Bin Duan, and Shao-Hua Guan. Language embedded 3d gaussians for open-vocabulary scene understanding. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 5333–5343, 2024. 2

2024
[46]

Panoptic lifting for 3d scene understanding with neural fields

Yawar Siddiqui, Lorenzo Porzi, Samuel Rota Bul ´o, Nor- man M ¨uller, Matthias Nießner, Angela Dai, and Peter Kontschieder. Panoptic lifting for 3d scene understanding with neural fields. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 9043–9052, 2023. 2

2023
[47]

Neural feature fusion fields: 3d distillation of self- supervised 2d image representations

Vadim Tschernezki, Iro Laina, Diane Larlus, and Andrea Vedaldi. Neural feature fusion fields: 3d distillation of self- supervised 2d image representations. In2022 International Conference on 3D Vision (3DV), pages 443–453. IEEE, 2022. 2

2022
[48]

Opengaussian: Towards point-level 10 3d gaussian-based open vocabulary understanding.Advances in Neural Information Processing Systems, 37:19114–19138,

Yanmin Wu, Jiarui Meng, Haijie Li, Chenming Wu, Yahao Shi, Xinhua Cheng, Chen Zhao, Haocheng Feng, Errui Ding, Jingdong Wang, et al. Opengaussian: Towards point-level 10 3d gaussian-based open vocabulary understanding.Advances in Neural Information Processing Systems, 37:19114–19138,
[49]

1, 2, 3, 6, 7, 8, 4, 5
[50]

Relags: Relational language gaussian splatting

Yaxu Xie, Abdalla Arafa, Alireza Javanmardi, Christen Millerdurai, Jia Cheng Hu, Shaoxiang Wang, Alain Pagani, and Didier Stricker. Relags: Relational language gaussian splatting. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23826– 23836, 2026. 3, 4

2026
[51]

Gaus- sian grouping: Segment and edit anything in 3d scenes

Mingqiao Ye, Martin Danelljan, Fisher Yu, and Lei Ke. Gaus- sian grouping: Segment and edit anything in 3d scenes. In European conference on computer vision, pages 162–179. Springer, 2024. 2, 3, 7, 5

2024
[52]

Sai3d: Segment any instance in 3d scenes

Yingda Yin, Yuzheng Liu, Yang Xiao, Daniel Cohen-Or, Jing- wei Huang, and Baoquan Chen. Sai3d: Segment any instance in 3d scenes. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3292–3302,
[53]

Nerflets: Local radiance fields for efficient structure-aware 3d scene represen- tation from 2d supervision

Xiaoshuai Zhang, Abhijit Kundu, Thomas Funkhouser, Leonidas Guibas, Hao Su, and Kyle Genova. Nerflets: Local radiance fields for efficient structure-aware 3d scene represen- tation from 2d supervision. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8274–8284, 2023. 2

2023
[54]

Feature 3dgs: Supercharging 3d gaussian splatting to enable distilled feature fields

Shijie Zhou, Haoran Chang, Sicheng Jiang, Zhiwen Fan, Ze- hao Zhu, Dejia Xu, Pradyumna Chari, Suya You, Zhangyang Wang, and Achuta Kadambi. Feature 3dgs: Supercharging 3d gaussian splatting to enable distilled feature fields. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21676–21685, 2024. 2

2024
[55]

Cos3d: Collaborative open-vocabulary 3d segmentation

Runsong Zhu, Ka-Hei Hui, Zhengzhe Liu, Qianyi Wu, Weiliang Tang, Shi Qiu, Pheng-Ann Heng, and Chi-Wing Fu. Cos3d: Collaborative open-vocabulary 3d segmentation. arXiv preprint arXiv:2510.20238, 2025. 2, 3, 6, 7, 8, 4

work page arXiv 2025
[56]

Objectgs: Object-aware scene reconstruction and scene understand- ing via gaussian splatting

Ruijie Zhu, Mulin Yu, Linning Xu, Lihan Jiang, Yixuan Li, Tianzhu Zhang, Jiangmiao Pang, and Bo Dai. Objectgs: Object-aware scene reconstruction and scene understand- ing via gaussian splatting. InInternational Conference on Computer Vision (ICCV)(19/10/2025-23/10/2025, Honolulu, Hawai’i), 2025. 2, 3, 7, 5

2025
[57]

Traffic sign

Xingxing Zuo, Pouya Samangouei, Yunwen Zhou, Yan Di, and Mingyang Li. Fmgs: Foundation model embedded 3d gaussian splatting for holistic 3d scene understanding.Inter- national Journal of Computer Vision, 133(2):611–627, 2025. 2 11 A. Appendix Contents A.1 . Comparison of Feature Lifting Strategies . 1 A.2 . Our 3D–2D Association Approach . . . . . 1 A.3 ....

2025

[1] [1]

Principal component anal- ysis.Wiley interdisciplinary reviews: computational statistics, 2(4):433–459, 2010

Herv´e Abdi and Lynne J Williams. Principal component anal- ysis.Wiley interdisciplinary reviews: computational statistics, 2(4):433–459, 2010. 4

2010

[2] [2]

Lightsplat: Fast and memory-efficient open-vocabulary 3d scene understanding in five seconds

Jaehun Bang, Jinhyeok Kim, Minji Kim, Seungheon Jeong, and Kyungdon Joo. Lightsplat: Fast and memory-efficient open-vocabulary 3d scene understanding in five seconds. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 19812–19821, 2026. 3

2026

[3] [3]

Mip-nerf 360: Unbounded anti-aliased neural radiance fields

Jonathan T Barron, Ben Mildenhall, Dor Verbin, Pratul P Srinivasan, and Peter Hedman. Mip-nerf 360: Unbounded anti-aliased neural radiance fields. InProceedings of the IEEE/CVF conference on computer vision and pattern recog- nition, pages 5470–5479, 2022. 5, 9

2022

[4] [4]

Emerg- ing properties in self-supervised vision transformers

Mathilde Caron, Hugo Touvron, Ishan Misra, Herv´e J´egou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerg- ing properties in self-supervised vision transformers. InPro- ceedings of the IEEE/CVF international conference on com- puter vision, pages 9650–9660, 2021. 2

2021

[5] [5]

Segment any 3d gaussians

Jiazhong Cen, Jiemin Fang, Chen Yang, Lingxi Xie, Xiaopeng Zhang, Wei Shen, and Qi Tian. Segment any 3d gaussians. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 1971–1979, 2025. 2, 3

1971

[6] [6]

Tackling view-dependent semantics in 3d language gaussian splatting

Jiazhong Cen, Xudong Zhou, Jiemin Fang, Changsong Wen, Lingxi Xie, XIAOPENG ZHANG, Wei Shen, and Qi Tian. Tackling view-dependent semantics in 3d language gaussian splatting. InForty-second International Conference on Ma- chine Learning, 2025. 2, 3, 5, 6, 7, 8, 1, 4

2025

[7] [7]

Lifting by gaussians: A simple, fast and flexible method for 3d instance segmentation

Rohan Chacko, Nicolai H¨ani, Eldar Khaliullin, Lin Sun, and Douglas Lee. Lifting by gaussians: A simple, fast and flexible method for 3d instance segmentation. In2025 IEEE/CVF Win- ter Conference on Applications of Computer Vision (WACV), pages 3497–3507. IEEE, 2025. 3, 5

2025

[8] [8]

Tracking anything with de- coupled video segmentation

Ho Kei Cheng, Seoung Wug Oh, Brian Price, Alexander Schwing, and Joon-Young Lee. Tracking anything with de- coupled video segmentation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 1316– 1326, 2023. 3, 7, 8, 5

2023

[9] [9]

Occam’s lgs: An efficient ap- proach for language gaussian splatting.arXiv preprint arXiv:2412.01807, 2024

Jiahuan Cheng, Jan-Nico Zaech, Luc Van Gool, and Danda Pani Paudel. Occam’s lgs: An efficient ap- proach for language gaussian splatting.arXiv preprint arXiv:2412.01807, 2024. 2, 4, 6, 7, 3

work page arXiv 2024

[10] [10]

Click-gaussian: Interactive segmenta- tion to any 3d gaussians

Seokhun Choi, Hyeonseop Song, Jaechul Kim, Taehyeong Kim, and Hoseok Do. Click-gaussian: Interactive segmenta- tion to any 3d gaussians. InEuropean Conference on Com- puter Vision, pages 289–305. Springer, 2024. 2, 3

2024

[11] [11]

Scannet: Richly-annotated 3d reconstructions of indoor scenes

Angela Dai, Angel X Chang, Manolis Savva, Maciej Hal- ber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5828–5839, 2017. 7, 8, 3, 5

2017

[12] [12]

Training-free hierarchi- cal scene understanding for gaussian splatting with superpoint graphs

Shaohui Dai, Yansong Qu, Zheyan Li, Xinyang Li, Shengchuan Zhang, and Liujuan Cao. Training-free hierarchi- cal scene understanding for gaussian splatting with superpoint graphs. InProceedings of the 33rd ACM International Con- ference on Multimedia, pages 3673–3682, 2025. 3, 5, 6, 7, 1, 2, 4

2025

[13] [13]

Extrinsplat: Decoupling geometry and semantics for open-vocabulary understanding in 3d gaussian splatting

Jiayu Ding, Xinpeng Liu, Zhiyi Pan, Shiqiang Long, and Ge Li. Extrinsplat: Decoupling geometry and semantics for open-vocabulary understanding in 3d gaussian splatting. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 31019–31028, 2026. 3

2026

[14] [14]

Opennerf: open set 3d neural scene segmentation with pixel-wise features and rendered novel views.arXiv preprint arXiv:2404.03650,

Francis Engelmann, Fabian Manhardt, Michael Niemeyer, Keisuke Tateno, Marc Pollefeys, and Federico Tombari. Open- nerf: Open set 3d neural scene segmentation with pixel- wise features and rendered novel views.arXiv preprint arXiv:2404.03650, 2024. 2

work page arXiv 2024

[15] [15]

Panoptic nerf: 3d-to-2d label transfer for panoptic urban scene segmentation

Xiao Fu, Shangzhan Zhang, Tianrun Chen, Yichong Lu, Lanyun Zhu, Xiaowei Zhou, Andreas Geiger, and Yiyi Liao. Panoptic nerf: 3d-to-2d label transfer for panoptic urban scene segmentation. In2022 International Conference on 3D Vision (3DV), pages 1–11. IEEE, 2022. 2

2022

[16] [16]

2d gaussian splatting for geometrically ac- curate radiance fields

Binbin Huang, Zehao Yu, Anpei Chen, Andreas Geiger, and Shenghua Gao. 2d gaussian splatting for geometrically ac- curate radiance fields. InACM SIGGRAPH 2024 conference papers, pages 1–11, 2024. 3, 7

2024

[17] [17]

V otesplat: Hough voting gaussian splatting for 3d scene understanding

Minchao Jiang, Shunyu Jia, Jiaming Gu, Xiaoyuan Lu, Guangming Zhu, Anqi Dong, and Liang Zhang. V otesplat: Hough voting gaussian splatting for 3d scene understanding. arXiv preprint arXiv:2506.22799, 2025. 2, 3

work page arXiv 2025

[18] [18]

Kim Jun-Seong, GeonU Kim, Kim Yu-Ji, Yu-Chiang Frank Wang, Jaesung Choe, and Tae-Hyun Oh. Dr. splat: Directly referring 3d gaussian splatting via direct language embedding registration. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 14137–14146, 2025. 1, 2, 4, 6, 7, 3, 5

2025

[19] [19]

3d gaussian splatting for real-time radiance field rendering.ACM Trans

Bernhard Kerbl, Georgios Kopanas, Thomas Leimk¨uhler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering.ACM Trans. Graph., 42(4):139–1, 2023. 1, 2, 3, 7

2023

[20] [20]

Lerf: Language embedded radiance fields

Justin Kerr, Chung Min Kim, Ken Goldberg, Angjoo Kanazawa, and Matthew Tancik. Lerf: Language embedded radiance fields. InProceedings of the IEEE/CVF Interna- tional Conference on Computer Vision, pages 19729–19739,

[21] [21]

Lerf: Language embedded radiance fields

Justin Kerr, Chung Min Kim, Ken Goldberg, Angjoo Kanazawa, and Matthew Tancik. Lerf: Language embedded radiance fields. InProceedings of the IEEE/CVF interna- tional conference on computer vision, pages 19729–19739,

[22] [22]

Segment any- thing

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White- head, Alexander C Berg, Wan-Yen Lo, et al. Segment any- thing. InProceedings of the IEEE/CVF international confer- ence on computer vision, pages 4015–4026, 2023. 2, 3, 7, 8

2023

[23] [23]

Tanks and temples: Benchmarking large-scale scene reconstruction.ACM Transactions on Graphics, 36(4), 2017

Arno Knapitsch, Jaesik Park, Qian-Yi Zhou, and Vladlen Koltun. Tanks and temples: Benchmarking large-scale scene reconstruction.ACM Transactions on Graphics, 36(4), 2017. 6

2017

[24] [24]

Decomposing nerf for editing via feature field distillation.Ad- vances in Neural Information Processing Systems, 35:23311– 23330, 2022

Sosuke Kobayashi, Eiichi Matsumoto, and Vincent Sitzmann. Decomposing nerf for editing via feature field distillation.Ad- vances in Neural Information Processing Systems, 35:23311– 23330, 2022. 2 9

2022

[25] [25]

Panoptic neural fields: A semantic object-aware neural scene representation

Abhijit Kundu, Kyle Genova, Xiaoqi Yin, Alireza Fathi, Car- oline Pantofaru, Leonidas J Guibas, Andrea Tagliasacchi, Frank Dellaert, and Thomas Funkhouser. Panoptic neural fields: A semantic object-aware neural scene representation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12871–12881, 2022. 2

2022

[26] [26]

Cut pursuit: Fast algorithms to learn piecewise constant functions on general weighted graphs.SIAM Journal on Imaging Sciences, 10(4): 1724–1766, 2017

Loic Landrieu and Guillaume Obozinski. Cut pursuit: Fast algorithms to learn piecewise constant functions on general weighted graphs.SIAM Journal on Imaging Sciences, 10(4): 1724–1766, 2017. 3

2017

[27] [27]

Cf3: Compact and fast 3d feature fields

Hyunjoon Lee, Joonkyu Min, and Jaesik Park. Cf3: Compact and fast 3d feature fields. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 27906– 27916, 2025. 2, 4

2025

[28] [28]

Language-driven Semantic Segmentation

Boyi Li, Kilian Q Weinberger, Serge Belongie, Vladlen Koltun, and Ren ´e Ranftl. Language-driven semantic seg- mentation.arXiv preprint arXiv:2201.03546, 2022. 2

work page internal anchor Pith review Pith/arXiv arXiv 2022

[29] [29]

Instancegaussian: Appearance-semantic joint gaussian representation for 3d instance-level perception

Haijie Li, Yanmin Wu, Jiarui Meng, Qiankun Gao, Zhiyao Zhang, Ronggang Wang, and Jian Zhang. Instancegaussian: Appearance-semantic joint gaussian representation for 3d instance-level perception. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 14078– 14088, 2025. 2, 3, 6, 7, 8, 4

2025

[30] [30]

Su- pergseg: Open-vocabulary 3d segmentation with structured super-gaussians.arXiv preprint arXiv:2412.10231, 2024

Siyun Liang, Sen Wang, Kunyi Li, Michael Niemeyer, Ste- fano Gasperini, Nassir Navab, and Federico Tombari. Su- pergseg: Open-vocabulary 3d segmentation with structured super-gaussians.arXiv preprint arXiv:2412.10231, 2024. 2, 3

work page arXiv 2024

[31] [31]

Kitti-360: A novel dataset and benchmarks for urban scene understanding in 2d and 3d.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(3):3292–3310, 2022

Yiyi Liao, Jun Xie, and Andreas Geiger. Kitti-360: A novel dataset and benchmarks for urban scene understanding in 2d and 3d.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(3):3292–3310, 2022. 5, 6

2022

[32] [32]

Weakly supervised 3d open- vocabulary segmentation.Advances in Neural Information Processing Systems, 36:53433–53456, 2023

Kunhao Liu, Fangneng Zhan, Jiahui Zhang, Muyu Xu, Yingchen Yu, Abdulmotaleb El Saddik, Christian Theobalt, Eric Xing, and Shijian Lu. Weakly supervised 3d open- vocabulary segmentation.Advances in Neural Information Processing Systems, 36:53433–53456, 2023. 2

2023

[33] [33]

Scaffold-gs: Structured 3d gaussians for view-adaptive rendering

Tao Lu, Mulin Yu, Linning Xu, Yuanbo Xiangli, Limin Wang, Dahua Lin, and Bo Dai. Scaffold-gs: Structured 3d gaussians for view-adaptive rendering. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20654–20664, 2024. 7

2024

[34] [34]

Gaga: Group Any Gaussians via 3D-aware Memory Bank

Weijie Lyu, Xueting Li, Abhijit Kundu, Yi-Hsuan Tsai, and Ming-Hsuan Yang. Gaga: Group any gaussians via 3d-aware memory bank.arXiv preprint arXiv:2404.07977, 2024. 2, 3

work page internal anchor Pith review Pith/arXiv arXiv 2024

[35] [35]

Ludvig: Learning-free uplifting of 2d visual features to gaussian splatting scenes.arXiv preprint arXiv:2410.14462, 2024

Juliette Marrie, Romain M ´en´egaux, Michael Arbel, Diane Larlus, and Julien Mairal. Ludvig: Learning-free uplifting of 2d visual features to gaussian splatting scenes.arXiv preprint arXiv:2410.14462, 2024. 2, 4

work page arXiv 2024

[36] [36]

Nerf: Representing scenes as neural radiance fields for view syn- thesis.Communications of the ACM, 65(1):99–106, 2021

Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view syn- thesis.Communications of the ACM, 65(1):99–106, 2021. 2

2021

[37] [37]

Graspgen: A diffusion-based framework for 6-dof grasping with on- generator training,

Adithyavairavan Murali, Balakumar Sundaralingam, Yu-Wei Chao, Wentao Yuan, Jun Yamada, Mark Carlson, Fabio Ramos, Stan Birchfield, Dieter Fox, and Clemens Eppner. Graspgen: A diffusion-based framework for 6-dof grasping with on-generator training.arXiv preprint arXiv:2507.13097,

work page arXiv

[38] [38]

3d vision-language gaussian splatting.arXiv preprint arXiv:2410.07577, 2024

Qucheng Peng, Benjamin Planche, Zhongpai Gao, Meng Zheng, Anwesa Choudhuri, Terrence Chen, Chen Chen, and Ziyan Wu. 3d vision-language gaussian splatting.arXiv preprint arXiv:2410.07577, 2024. 2

work page arXiv 2024

[39] [39]

Langsplat: 3d language gaussian splatting

Minghan Qin, Wanhua Li, Jiawei Zhou, Haoqian Wang, and Hanspeter Pfister. Langsplat: 3d language gaussian splatting. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20051–20060, 2024. 6, 7, 1, 2, 4

2024

[40] [40]

Goi: Find 3d gaussians of interest with an optimizable open-vocabulary semantic-space hyperplane

Yansong Qu, Shaohui Dai, Xinyang Li, Jianghang Lin, Liu- juan Cao, Shengchuan Zhang, and Rongrong Ji. Goi: Find 3d gaussians of interest with an optimizable open-vocabulary semantic-space hyperplane. InProceedings of the 32nd ACM international conference on multimedia, pages 5328–5337,

[41] [41]

Learning transferable visual models from natural language supervi- sion

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 2, 3, 8

2021

[42] [42]

SAM 2: Segment Anything in Images and Videos

Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman R¨adle, Chloe Rolland, Laura Gustafson, et al. Sam 2: Segment any- thing in images and videos.arXiv preprint arXiv:2408.00714,

work page internal anchor Pith review Pith/arXiv arXiv

[43] [43]

Language- grounded indoor 3d semantic segmentation in the wild

David Rozenberszki, Or Litany, and Angela Dai. Language- grounded indoor 3d semantic segmentation in the wild. In European conference on computer vision, pages 125–141. Springer, 2022. 3

2022

[44] [44]

Trace3d: Consistent segmentation lifting via gaussian instance tracing

Hongyu Shen, Junfeng Ni, Yixin Chen, Weishuo Li, Mingtao Pei, and Siyuan Huang. Trace3d: Consistent segmentation lifting via gaussian instance tracing. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 6656–6666, 2025. 3

2025

[45] [45]

Language embedded 3d gaussians for open-vocabulary scene understanding

Jin-Chuan Shi, Miao Wang, Hao-Bin Duan, and Shao-Hua Guan. Language embedded 3d gaussians for open-vocabulary scene understanding. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 5333–5343, 2024. 2

2024

[46] [46]

Panoptic lifting for 3d scene understanding with neural fields

Yawar Siddiqui, Lorenzo Porzi, Samuel Rota Bul ´o, Nor- man M ¨uller, Matthias Nießner, Angela Dai, and Peter Kontschieder. Panoptic lifting for 3d scene understanding with neural fields. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 9043–9052, 2023. 2

2023

[47] [47]

Neural feature fusion fields: 3d distillation of self- supervised 2d image representations

Vadim Tschernezki, Iro Laina, Diane Larlus, and Andrea Vedaldi. Neural feature fusion fields: 3d distillation of self- supervised 2d image representations. In2022 International Conference on 3D Vision (3DV), pages 443–453. IEEE, 2022. 2

2022

[48] [48]

Opengaussian: Towards point-level 10 3d gaussian-based open vocabulary understanding.Advances in Neural Information Processing Systems, 37:19114–19138,

Yanmin Wu, Jiarui Meng, Haijie Li, Chenming Wu, Yahao Shi, Xinhua Cheng, Chen Zhao, Haocheng Feng, Errui Ding, Jingdong Wang, et al. Opengaussian: Towards point-level 10 3d gaussian-based open vocabulary understanding.Advances in Neural Information Processing Systems, 37:19114–19138,

[49] [49]

1, 2, 3, 6, 7, 8, 4, 5

[50] [50]

Relags: Relational language gaussian splatting

Yaxu Xie, Abdalla Arafa, Alireza Javanmardi, Christen Millerdurai, Jia Cheng Hu, Shaoxiang Wang, Alain Pagani, and Didier Stricker. Relags: Relational language gaussian splatting. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23826– 23836, 2026. 3, 4

2026

[51] [51]

Gaus- sian grouping: Segment and edit anything in 3d scenes

Mingqiao Ye, Martin Danelljan, Fisher Yu, and Lei Ke. Gaus- sian grouping: Segment and edit anything in 3d scenes. In European conference on computer vision, pages 162–179. Springer, 2024. 2, 3, 7, 5

2024

[52] [52]

Sai3d: Segment any instance in 3d scenes

Yingda Yin, Yuzheng Liu, Yang Xiao, Daniel Cohen-Or, Jing- wei Huang, and Baoquan Chen. Sai3d: Segment any instance in 3d scenes. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3292–3302,

[53] [53]

Nerflets: Local radiance fields for efficient structure-aware 3d scene represen- tation from 2d supervision

Xiaoshuai Zhang, Abhijit Kundu, Thomas Funkhouser, Leonidas Guibas, Hao Su, and Kyle Genova. Nerflets: Local radiance fields for efficient structure-aware 3d scene represen- tation from 2d supervision. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8274–8284, 2023. 2

2023

[54] [54]

Feature 3dgs: Supercharging 3d gaussian splatting to enable distilled feature fields

Shijie Zhou, Haoran Chang, Sicheng Jiang, Zhiwen Fan, Ze- hao Zhu, Dejia Xu, Pradyumna Chari, Suya You, Zhangyang Wang, and Achuta Kadambi. Feature 3dgs: Supercharging 3d gaussian splatting to enable distilled feature fields. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21676–21685, 2024. 2

2024

[55] [55]

Cos3d: Collaborative open-vocabulary 3d segmentation

Runsong Zhu, Ka-Hei Hui, Zhengzhe Liu, Qianyi Wu, Weiliang Tang, Shi Qiu, Pheng-Ann Heng, and Chi-Wing Fu. Cos3d: Collaborative open-vocabulary 3d segmentation. arXiv preprint arXiv:2510.20238, 2025. 2, 3, 6, 7, 8, 4

work page arXiv 2025

[56] [56]

Objectgs: Object-aware scene reconstruction and scene understand- ing via gaussian splatting

Ruijie Zhu, Mulin Yu, Linning Xu, Lihan Jiang, Yixuan Li, Tianzhu Zhang, Jiangmiao Pang, and Bo Dai. Objectgs: Object-aware scene reconstruction and scene understand- ing via gaussian splatting. InInternational Conference on Computer Vision (ICCV)(19/10/2025-23/10/2025, Honolulu, Hawai’i), 2025. 2, 3, 7, 5

2025

[57] [57]

Traffic sign

Xingxing Zuo, Pouya Samangouei, Yunwen Zhou, Yan Di, and Mingyang Li. Fmgs: Foundation model embedded 3d gaussian splatting for holistic 3d scene understanding.Inter- national Journal of Computer Vision, 133(2):611–627, 2025. 2 11 A. Appendix Contents A.1 . Comparison of Feature Lifting Strategies . 1 A.2 . Our 3D–2D Association Approach . . . . . 1 A.3 ....

2025