Ilov3Splat: Instance-Level Open-Vocabulary 3D Scene Understanding in Gaussian Splatting

Binh Long Nguyen; Clinton Fookes; Kien Nguyen; Peyman Moghadam; Sridha Sridharan

arxiv: 2605.04506 · v2 · submitted 2026-05-06 · 💻 cs.CV · cs.AI

Ilov3Splat: Instance-Level Open-Vocabulary 3D Scene Understanding in Gaussian Splatting

Binh Long Nguyen , Kien Nguyen , Sridha Sridharan , Clinton Fookes , Peyman Moghadam This is my paper

Pith reviewed 2026-05-14 22:07 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords 3D Gaussian Splattingopen-vocabulary 3D understandinginstance segmentationCLIP featurescontrastive learningSAM masksview-consistent featureslanguage grounding

0 comments

The pith

Augmenting 3D Gaussian splats with view-consistent CLIP and instance feature fields enables open-vocabulary object identification in 3D scenes from text queries without category labels.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Ilov3Splat, which adds semantic representations directly to 3D Gaussian Splatting models. It encodes language-aligned CLIP features using multi-resolution hash embeddings and trains separate instance features with contrastive loss on SAM-generated masks. This produces view-consistent 3D feature fields that support matching text queries to groups of Gaussians via two-stage clustering. The result is the ability to select and segment arbitrary objects in reconstructed 3D scenes using natural language descriptions alone, without any pre-defined categories or manual 3D annotations. Experiments show gains over prior open-vocabulary 3D-GS baselines on object selection and instance segmentation metrics.

Core claim

Joint optimization of geometry and semantics in 3D Gaussian Splatting is achieved by attaching view-consistent feature fields: multi-resolution hash embeddings store dense CLIP-aligned features for language grounding, while a parallel instance field is trained contrastively on 2D SAM masks to enforce cross-view object coherence; at query time, CLIP-encoded text is matched to these fields followed by 3D clustering to recover the corresponding Gaussian groups.

What carries the argument

View-consistent feature fields added to Gaussian splats, implemented via multi-resolution hash embedding of CLIP features plus contrastive loss on SAM masks to produce instance-level 3D descriptors.

If this is right

Text queries can directly retrieve and segment coherent 3D object groups from the optimized Gaussian representation.
No manual 3D labels or closed category sets are required at training or test time.
The same feature fields support both dense semantic grounding and fine-grained instance distinction.
Two-stage 3D clustering converts per-Gaussian matches into usable object-level outputs for downstream tasks.
Performance improvements appear on standard open-vocabulary 3D benchmarks for both selection accuracy and segmentation quality.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The method could support language-driven editing or manipulation of specific 3D objects once the feature fields are learned.
Extension to video or dynamic scenes would require adding temporal consistency constraints to the same hash-encoded fields.
Integration with robotics navigation stacks becomes feasible because natural-language commands map directly onto 3D Gaussian clusters.
The approach may lower the annotation burden for large-scale 3D scene datasets by relying only on 2D foundation models.

Load-bearing premise

Multi-resolution hash embedding of CLIP features together with contrastive training on SAM masks will yield sufficiently view-consistent 3D instance features for reliable downstream clustering without any category supervision.

What would settle it

Quantitative drop in 3D instance segmentation IoU on a benchmark scene containing fine-grained or partially occluded objects when the contrastive loss is removed or when SAM masks are replaced by random groupings, showing that the learned features fail to separate instances across views.

Figures

Figures reproduced from arXiv: 2605.04506 by Binh Long Nguyen, Clinton Fookes, Kien Nguyen, Peyman Moghadam, Sridha Sridharan.

**Figure 1.** Figure 1: Comparison between 2D-rendered, point-level, and our instance-level 3D open view at source ↗

**Figure 2.** Figure 2: An overview of Ilov3Splat. Left: Our method learns language-aligned and instance-aware features for 3D Gaussians, computed via compact multi-resolution hash encoding and lightweight projection MLPs. Right: Feature learning is guided by multiview 2D signals, leveraging CLIP for language alignment, DINO for object boundary regularization, and SAM for instance-aware contrastive learning. instance-aware featu… view at source ↗

**Figure 3.** Figure 3: Qualitative results of 3D object selection on the LERF dataset. view at source ↗

**Figure 4.** Figure 4: Qualitative results of category-agnostic 3D instance segmentation on the Scan view at source ↗

**Figure 5.** Figure 5: Qualitative ablation of Ilov3Splat. We visualize the impact of individual training components and pipeline stages in our model. 5.3 Ablation Study Effect of training components. An ablation study is conducted to assess the contributions of key training components in Ilov3Splat. As shown in view at source ↗

read the original abstract

We introduce Ilov3Splat, a novel framework for instance-level open-vocabulary 3D scene understanding built on 3D Gaussian Splatting (3D-GS). Most prior work depends on 2D rendering-based matching or point-level semantic association, which undermines cross-view consistency, lacks coherent instance-level reasoning, and limits precision in downstream 3D tasks. To address these limitations, our method jointly optimizes scene geometry and semantic representations by augmenting Gaussian splats with view-consistent feature fields. Specifically, we leverage multi-resolution hash embedding to efficiently encode language-aligned CLIP features, enabling dense and coherent language grounding in 3D space. We further train an instance feature field using contrastive loss over SAM masks, supporting fine-grained object distinction across views. At inference time, CLIP-encoded queries are matched against the learned features, followed by two-stage 3D clustering to retrieve relevant Gaussian groups. This enables our framework to identify arbitrary objects in 3D scenes based on natural language descriptions, without requiring category supervision or manual annotations. Experiments on standard benchmarks demonstrate that Ilov3Splat outperforms prior open-vocabulary 3D-GS methods in both object selection and instance segmentation, offering a flexible and accurate solution for language-driven 3D scene understanding. Project page: https://csiro-robotics.github.io/Ilov3Splat.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Ilov3Splat adds a contrastive instance feature field on SAM masks to 3D Gaussian Splatting alongside hash-encoded CLIP features, but the abstract gives no numbers to check if view consistency actually holds for unsupervised clustering.

read the letter

The main takeaway is that Ilov3Splat introduces an instance feature field trained via contrastive loss on SAM masks inside 3D Gaussian Splatting, combined with multi-resolution hash-encoded CLIP features for open-vocabulary 3D instance understanding. This is new in how it augments the splats with these view-consistent fields to enable language-driven retrieval through query matching and two-stage clustering. It improves on prior approaches that used 2D rendering or point-level associations by optimizing the semantics jointly with the geometry in 3D. The use of hash embeddings should make the CLIP integration efficient, and the contrastive term on masks supports distinguishing arbitrary objects without supervision. The potential issue is that the contrastive loss is applied on 2D projections, so there is no direct mechanism to enforce that instance features for the same object remain consistent across different viewpoints. If drift occurs, the 3D clustering may not reliably group Gaussians belonging to one object from a text query. The abstract claims better performance on benchmarks for object selection and instance segmentation, but without any specific numbers, ablation studies, or error analysis provided, it's difficult to assess how well this holds up in practice. This paper would be of interest to researchers in computer vision and robotics focused on 3D scene understanding with natural language. It offers a flexible pipeline for language-driven tasks in Gaussian Splatting representations. Given the practical nature of the approach and its grounding in pretrained models, it deserves peer review to evaluate the experimental results and confirm the consistency of the learned features.

Referee Report

1 major / 2 minor

Summary. The paper introduces Ilov3Splat, a framework for instance-level open-vocabulary 3D scene understanding built on 3D Gaussian Splatting. It augments Gaussian splats with view-consistent feature fields by using multi-resolution hash embeddings to encode CLIP features for dense language grounding and training an instance feature field via contrastive loss over SAM masks. At inference, CLIP-encoded text queries are matched to the learned features, followed by two-stage 3D clustering to retrieve relevant Gaussian groups, enabling identification of arbitrary objects from natural language descriptions without category supervision or annotations. The work claims to outperform prior open-vocabulary 3D-GS methods on standard benchmarks for object selection and instance segmentation.

Significance. If the learned 3D instance features prove sufficiently view-consistent to support reliable unsupervised clustering, the approach would meaningfully advance language-driven 3D scene understanding by providing coherent instance-level grounding directly in 3D space rather than relying on 2D rendering-based matching. The practical integration of efficient hash grids with pretrained CLIP and SAM models is a strength that could enable flexible open-vocabulary applications, though the absence of detailed quantitative validation in the reviewed text limits assessment of real-world impact.

major comments (1)

[Method (instance feature field)] Method section (instance feature field training): The contrastive loss operates on 2D projections of SAM masks, but no explicit regularization term or consistency penalty is described to enforce that instance features for the same 3D Gaussian remain coherent across views. This is load-bearing for the central claim, as the two-stage 3D clustering at inference relies on view-consistent features to group Gaussians belonging to an arbitrary object from a text query; without it, view-specific drift could invalidate the unsupervised retrieval.

minor comments (2)

[Abstract] Abstract: The claim of outperforming prior methods on standard benchmarks would be strengthened by including at least one key quantitative result (e.g., mIoU or accuracy delta) rather than a qualitative statement.
[Method] Notation and equations: The precise formulation of the contrastive loss (including temperature and positive/negative pair construction) and how the multi-resolution hash embedding is fused into the Gaussian splat optimization should be stated explicitly with equation numbers for reproducibility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed comment on the instance feature field. We appreciate the focus on view consistency, which is central to our claims, and will revise the manuscript to clarify the mechanism.

read point-by-point responses

Referee: Method section (instance feature field training): The contrastive loss operates on 2D projections of SAM masks, but no explicit regularization term or consistency penalty is described to enforce that instance features for the same 3D Gaussian remain coherent across views. This is load-bearing for the central claim, as the two-stage 3D clustering at inference relies on view-consistent features to group Gaussians belonging to an arbitrary object from a text query; without it, view-specific drift could invalidate the unsupervised retrieval.

Authors: The instance feature field is parameterized directly on the 3D Gaussians via a shared multi-resolution hash embedding, not per-view 2D features. The contrastive loss is computed over SAM mask projections from multiple training views, so the identical 3D feature vector for each Gaussian participates in positive/negative pairs across all views in which it is visible. Any view-specific drift would increase the loss on other views, providing implicit cross-view regularization through the joint 3D optimization. We will add an explicit paragraph in Section 3.3 describing this effect and will include an ablation on an optional explicit consistency penalty (e.g., L2 feature distance across random view pairs) in the revision if it yields measurable gains. revision: yes

Circularity Check

0 steps flagged

No circularity: method relies on external pretrained CLIP/SAM and standard optimization

full rationale

The paper presents an engineering pipeline that augments 3D Gaussian Splatting with hash-encoded CLIP features and a contrastive instance field trained on SAM masks. No equations, uniqueness theorems, or predictions are shown that reduce by construction to the target result. All load-bearing components (CLIP embeddings, SAM masks, contrastive loss, two-stage clustering) are imported from external models or standard techniques and treated as fixed inputs. The derivation chain is therefore self-contained against external benchmarks rather than internally circular.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on two domain assumptions about lifting 2D foundation-model outputs into coherent 3D fields; no free parameters or invented entities are introduced beyond the standard Gaussian Splatting representation.

axioms (2)

domain assumption CLIP features extracted from 2D views can be lifted into a view-consistent 3D field via multi-resolution hash embedding
Invoked when the method augments Gaussian splats with language-aligned features.
domain assumption SAM-generated masks supply reliable instance-level supervision for contrastive training across views
Used to train the instance feature field that enables fine-grained object distinction.

pith-pipeline@v0.9.0 · 5568 in / 1386 out tokens · 38262 ms · 2026-05-14T22:07:00.801129+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

30 extracted references · 30 canonical work pages

[1]

In: ICCV

Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: ICCV. pp. 9650– 9660 (2021)

work page 2021
[2]

In: ECCV

Choe, J., Park, C., Rameau, F., Park, J., Kweon, I.S.: Pointmixer: Mlp-mixer for point cloud understanding. In: ECCV. pp. 620–640. Springer (2022)

work page 2022
[3]

In: ECCV

Choi, S., Song, H., Kim, J., Kim, T., Do, H.: Click-gaussian: Interactive segmen- tation to any 3d gaussians. In: ECCV. pp. 289–305. Springer (2024)

work page 2024
[4]

In: CVPR

Dai,A.,Chang,A.X.,Savva,M.,Halber,M.,Funkhouser,T.,Nießner,M.:Scannet: Richly-annotated 3d reconstructions of indoor scenes. In: CVPR. pp. 5828–5839 (2017)

work page 2017
[5]

In: ICRA

Hausler, S., Hall, D., Mahendren, S., Moghadam, P.: Reg-NF: Efficient registration of implicit surfaces within neural fields. In: ICRA. pp. 15409–15415 (2024)

work page 2024
[6]

In: AAAI

Ji, Y., Zhu, H., Tang, J., Liu, W., Zhang, Z., Tan, X., Xie, Y.: Fastlgs: Speeding up language embedded gaussians with feature grid mapping. In: AAAI. vol. 39, pp. 3922–3930 (2025)

work page 2025
[7]

splat: Directly referring 3d gaussian splatting via direct language embedding registration

Jun-Seong, K., Kim, G., Yu-Ji, K., Wang, Y.C.F., Choe, J., Oh, T.H.: Dr. splat: Directly referring 3d gaussian splatting via direct language embedding registration. In: CVPR. pp. 14137–14146 (2025)

work page 2025
[8]

ACM TOG42(4), 139–1 (2023)

Kerbl, B., Kopanas, G., Leimkühler, T., Drettakis, G.: 3d gaussian splatting for real-time radiance field rendering. ACM TOG42(4), 139–1 (2023)

work page 2023
[9]

In: ICCV

Kerr, J., Kim, C.M., Goldberg, K., Kanazawa, A., Tancik, M.: Lerf: Language embedded radiance fields. In: ICCV. pp. 19729–19739 (2023)

work page 2023
[10]

In: CVPR

Kim, C.M., Wu, M., Kerr, J., Goldberg, K., Tancik, M., Kanazawa, A.: Garfield: Group anything with radiance fields. In: CVPR. pp. 21530–21539 (2024)

work page 2024
[11]

In: ICCV

Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.Y., et al.: Segment anything. In: ICCV. pp. 4015–4026 (2023) Ilov3Splat: Instance-Level Open-Vocabulary 3D-GS 15

work page 2023
[12]

NeurIPS35, 23311–23330 (2022)

Kobayashi, S., Matsumoto, E., Sitzmann, V.: Decomposing nerf for editing via feature field distillation. NeurIPS35, 23311–23330 (2022)

work page 2022
[13]

In: ICLR (2022)

Li, B., Weinberger, K.Q., Belongie, S., Koltun, V., Ranftl, R.: Language-driven semantic segmentation. In: ICLR (2022)

work page 2022
[14]

NeurIPS36, 53433– 53456 (2023)

Liu, K., Zhan, F., Zhang, J., Xu, M., Yu, Y., El Saddik, A., Theobalt, C., Xing, E., Lu, S.: Weakly supervised 3d open-vocabulary segmentation. NeurIPS36, 53433– 53456 (2023)

work page 2023
[15]

Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J.T., Ramamoorthi, R., Ng, R.: Nerf: Representing scenes as neural radiance fields for view synthesis. Commun. ACM65(1), 99–106 (2021)

work page 2021
[16]

ACM TOG41(4), 1–15 (2022)

Müller, T., Evans, A., Schied, C., Keller, A.: Instant neural graphics primitives with a multiresolution hash encoding. ACM TOG41(4), 1–15 (2022)

work page 2022
[17]

In: CVPR

Qin, M., Li, W., Zhou, J., Wang, H., Pfister, H.: Langsplat: 3d language gaussian splatting. In: CVPR. pp. 20051–20060 (2024)

work page 2024
[18]

In: ICML

Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: ICML. pp. 8748–8763 (2021)

work page 2021
[19]

Information11(4), 193 (2020)

Raschka, S., Patterson, J., Nolet, C.: Machine learning in python: Main devel- opments and technology trends in data science, machine learning, and artificial intelligence. Information11(4), 193 (2020)

work page 2020
[20]

In: CVPR

Shi, J.C., Wang, M., Duan, H.B., Guan, S.H.: Language embedded 3d gaussians for open-vocabulary scene understanding. In: CVPR. pp. 5333–5343 (2024)

work page 2024
[21]

In: ACM SIGGRAPH

Tancik,M.,Weber,E.,Ng,E.,Li,R.,Yi,B.,Wang,T.,Kristoffersen,A.,Austin,J., Salahi, K., Ahuja, A., et al.: Nerfstudio: A modular framework for neural radiance field development. In: ACM SIGGRAPH. pp. 1–12 (2023)

work page 2023
[22]

Vidanapathirana, K., Knights, J., Hausler, S., Cox, M., Ramezani, M., Jooste, J., Griffiths, E., Mohamed, S., Sridharan, S., Fookes, C., Moghadam, P.: WildScenes: A benchmark for 2D and 3D semantic segmentation in large-scale natural environ- ments. Int. J. Robot. Res.44(4), 532–549 (2025)

work page 2025
[23]

NeurIPS37, 19114–19138 (2024)

Wu, Y., Meng, J., Li, H., Wu, C., Shi, Y., Cheng, X., Zhao, C., Feng, H., Ding, E., Wang, J., et al.: Opengaussian: Towards point-level 3d gaussian-based open vocabulary understanding. NeurIPS37, 19114–19138 (2024)

work page 2024
[24]

In: CVPR

Yang, B., Pfreundschuh, P., Siegwart, R., Hutter, M., Moghadam, P., Patil, V.: TULIP: Transformer for upsampling of lidar point clouds. In: CVPR. pp. 15354– 15364 (2024)

work page 2024
[25]

Sam3d: Segment anything in 3d scenes.arXiv preprint arXiv:2306.03908, 2023

Yang, Y., Wu, X., He, T., Zhao, H., Liu, X.: Sam3d: Segment anything in 3d scenes. arXiv preprint arXiv:2306.03908 (2023)

work page arXiv 2023
[26]

In: ECCV

Ye, M., Danelljan, M., Yu, F., Ke, L.: Gaussian grouping: Segment and edit any- thing in 3d scenes. In: ECCV. pp. 162–179. Springer (2024)

work page 2024
[27]

In: CVPR

Zhang, R., Guo, Z., Zhang, W., Li, K., Miao, X., Cui, B., Qiao, Y., Gao, P., Li, H.: Pointclip: Point cloud understanding by clip. In: CVPR. pp. 8552–8562 (2022)

work page 2022
[28]

In: ICCV

Zhi,S.,Laidlow,T.,Leutenegger,S.,Davison,A.J.:In-placescenelabellingandun- derstanding with implicit scene representation. In: ICCV. pp. 15838–15847 (2021)

work page 2021
[29]

In: CVPR

Zhou,S.,Chang,H.,Jiang,S.,Fan,Z.,Zhu,Z.,Xu,D.,Chari,P.,You,S.,Wang,Z., Kadambi, A.: Feature 3dgs: Supercharging 3d gaussian splatting to enable distilled feature fields. In: CVPR. pp. 21676–21685 (2024)

work page 2024
[30]

IJCV133(2), 611–627 (2025)

Zuo, X., Samangouei, P., Zhou, Y., Di, Y., Li, M.: Fmgs: Foundation model em- bedded 3d gaussian splatting for holistic 3d scene understanding. IJCV133(2), 611–627 (2025)

work page 2025

[1] [1]

In: ICCV

Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: ICCV. pp. 9650– 9660 (2021)

work page 2021

[2] [2]

In: ECCV

Choe, J., Park, C., Rameau, F., Park, J., Kweon, I.S.: Pointmixer: Mlp-mixer for point cloud understanding. In: ECCV. pp. 620–640. Springer (2022)

work page 2022

[3] [3]

In: ECCV

Choi, S., Song, H., Kim, J., Kim, T., Do, H.: Click-gaussian: Interactive segmen- tation to any 3d gaussians. In: ECCV. pp. 289–305. Springer (2024)

work page 2024

[4] [4]

In: CVPR

Dai,A.,Chang,A.X.,Savva,M.,Halber,M.,Funkhouser,T.,Nießner,M.:Scannet: Richly-annotated 3d reconstructions of indoor scenes. In: CVPR. pp. 5828–5839 (2017)

work page 2017

[5] [5]

In: ICRA

Hausler, S., Hall, D., Mahendren, S., Moghadam, P.: Reg-NF: Efficient registration of implicit surfaces within neural fields. In: ICRA. pp. 15409–15415 (2024)

work page 2024

[6] [6]

In: AAAI

Ji, Y., Zhu, H., Tang, J., Liu, W., Zhang, Z., Tan, X., Xie, Y.: Fastlgs: Speeding up language embedded gaussians with feature grid mapping. In: AAAI. vol. 39, pp. 3922–3930 (2025)

work page 2025

[7] [7]

splat: Directly referring 3d gaussian splatting via direct language embedding registration

Jun-Seong, K., Kim, G., Yu-Ji, K., Wang, Y.C.F., Choe, J., Oh, T.H.: Dr. splat: Directly referring 3d gaussian splatting via direct language embedding registration. In: CVPR. pp. 14137–14146 (2025)

work page 2025

[8] [8]

ACM TOG42(4), 139–1 (2023)

Kerbl, B., Kopanas, G., Leimkühler, T., Drettakis, G.: 3d gaussian splatting for real-time radiance field rendering. ACM TOG42(4), 139–1 (2023)

work page 2023

[9] [9]

In: ICCV

Kerr, J., Kim, C.M., Goldberg, K., Kanazawa, A., Tancik, M.: Lerf: Language embedded radiance fields. In: ICCV. pp. 19729–19739 (2023)

work page 2023

[10] [10]

In: CVPR

Kim, C.M., Wu, M., Kerr, J., Goldberg, K., Tancik, M., Kanazawa, A.: Garfield: Group anything with radiance fields. In: CVPR. pp. 21530–21539 (2024)

work page 2024

[11] [11]

In: ICCV

Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.Y., et al.: Segment anything. In: ICCV. pp. 4015–4026 (2023) Ilov3Splat: Instance-Level Open-Vocabulary 3D-GS 15

work page 2023

[12] [12]

NeurIPS35, 23311–23330 (2022)

Kobayashi, S., Matsumoto, E., Sitzmann, V.: Decomposing nerf for editing via feature field distillation. NeurIPS35, 23311–23330 (2022)

work page 2022

[13] [13]

In: ICLR (2022)

Li, B., Weinberger, K.Q., Belongie, S., Koltun, V., Ranftl, R.: Language-driven semantic segmentation. In: ICLR (2022)

work page 2022

[14] [14]

NeurIPS36, 53433– 53456 (2023)

Liu, K., Zhan, F., Zhang, J., Xu, M., Yu, Y., El Saddik, A., Theobalt, C., Xing, E., Lu, S.: Weakly supervised 3d open-vocabulary segmentation. NeurIPS36, 53433– 53456 (2023)

work page 2023

[15] [15]

Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J.T., Ramamoorthi, R., Ng, R.: Nerf: Representing scenes as neural radiance fields for view synthesis. Commun. ACM65(1), 99–106 (2021)

work page 2021

[16] [16]

ACM TOG41(4), 1–15 (2022)

Müller, T., Evans, A., Schied, C., Keller, A.: Instant neural graphics primitives with a multiresolution hash encoding. ACM TOG41(4), 1–15 (2022)

work page 2022

[17] [17]

In: CVPR

Qin, M., Li, W., Zhou, J., Wang, H., Pfister, H.: Langsplat: 3d language gaussian splatting. In: CVPR. pp. 20051–20060 (2024)

work page 2024

[18] [18]

In: ICML

Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: ICML. pp. 8748–8763 (2021)

work page 2021

[19] [19]

Information11(4), 193 (2020)

Raschka, S., Patterson, J., Nolet, C.: Machine learning in python: Main devel- opments and technology trends in data science, machine learning, and artificial intelligence. Information11(4), 193 (2020)

work page 2020

[20] [20]

In: CVPR

Shi, J.C., Wang, M., Duan, H.B., Guan, S.H.: Language embedded 3d gaussians for open-vocabulary scene understanding. In: CVPR. pp. 5333–5343 (2024)

work page 2024

[21] [21]

In: ACM SIGGRAPH

Tancik,M.,Weber,E.,Ng,E.,Li,R.,Yi,B.,Wang,T.,Kristoffersen,A.,Austin,J., Salahi, K., Ahuja, A., et al.: Nerfstudio: A modular framework for neural radiance field development. In: ACM SIGGRAPH. pp. 1–12 (2023)

work page 2023

[22] [22]

Vidanapathirana, K., Knights, J., Hausler, S., Cox, M., Ramezani, M., Jooste, J., Griffiths, E., Mohamed, S., Sridharan, S., Fookes, C., Moghadam, P.: WildScenes: A benchmark for 2D and 3D semantic segmentation in large-scale natural environ- ments. Int. J. Robot. Res.44(4), 532–549 (2025)

work page 2025

[23] [23]

NeurIPS37, 19114–19138 (2024)

Wu, Y., Meng, J., Li, H., Wu, C., Shi, Y., Cheng, X., Zhao, C., Feng, H., Ding, E., Wang, J., et al.: Opengaussian: Towards point-level 3d gaussian-based open vocabulary understanding. NeurIPS37, 19114–19138 (2024)

work page 2024

[24] [24]

In: CVPR

Yang, B., Pfreundschuh, P., Siegwart, R., Hutter, M., Moghadam, P., Patil, V.: TULIP: Transformer for upsampling of lidar point clouds. In: CVPR. pp. 15354– 15364 (2024)

work page 2024

[25] [25]

Sam3d: Segment anything in 3d scenes.arXiv preprint arXiv:2306.03908, 2023

Yang, Y., Wu, X., He, T., Zhao, H., Liu, X.: Sam3d: Segment anything in 3d scenes. arXiv preprint arXiv:2306.03908 (2023)

work page arXiv 2023

[26] [26]

In: ECCV

Ye, M., Danelljan, M., Yu, F., Ke, L.: Gaussian grouping: Segment and edit any- thing in 3d scenes. In: ECCV. pp. 162–179. Springer (2024)

work page 2024

[27] [27]

In: CVPR

Zhang, R., Guo, Z., Zhang, W., Li, K., Miao, X., Cui, B., Qiao, Y., Gao, P., Li, H.: Pointclip: Point cloud understanding by clip. In: CVPR. pp. 8552–8562 (2022)

work page 2022

[28] [28]

In: ICCV

Zhi,S.,Laidlow,T.,Leutenegger,S.,Davison,A.J.:In-placescenelabellingandun- derstanding with implicit scene representation. In: ICCV. pp. 15838–15847 (2021)

work page 2021

[29] [29]

In: CVPR

Zhou,S.,Chang,H.,Jiang,S.,Fan,Z.,Zhu,Z.,Xu,D.,Chari,P.,You,S.,Wang,Z., Kadambi, A.: Feature 3dgs: Supercharging 3d gaussian splatting to enable distilled feature fields. In: CVPR. pp. 21676–21685 (2024)

work page 2024

[30] [30]

IJCV133(2), 611–627 (2025)

Zuo, X., Samangouei, P., Zhou, Y., Di, Y., Li, M.: Fmgs: Foundation model em- bedded 3d gaussian splatting for holistic 3d scene understanding. IJCV133(2), 611–627 (2025)

work page 2025