Ilov3Splat: Instance-Level Open-Vocabulary 3D Scene Understanding in Gaussian Splatting
Pith reviewed 2026-05-14 22:07 UTC · model grok-4.3
The pith
Augmenting 3D Gaussian splats with view-consistent CLIP and instance feature fields enables open-vocabulary object identification in 3D scenes from text queries without category labels.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Joint optimization of geometry and semantics in 3D Gaussian Splatting is achieved by attaching view-consistent feature fields: multi-resolution hash embeddings store dense CLIP-aligned features for language grounding, while a parallel instance field is trained contrastively on 2D SAM masks to enforce cross-view object coherence; at query time, CLIP-encoded text is matched to these fields followed by 3D clustering to recover the corresponding Gaussian groups.
What carries the argument
View-consistent feature fields added to Gaussian splats, implemented via multi-resolution hash embedding of CLIP features plus contrastive loss on SAM masks to produce instance-level 3D descriptors.
If this is right
- Text queries can directly retrieve and segment coherent 3D object groups from the optimized Gaussian representation.
- No manual 3D labels or closed category sets are required at training or test time.
- The same feature fields support both dense semantic grounding and fine-grained instance distinction.
- Two-stage 3D clustering converts per-Gaussian matches into usable object-level outputs for downstream tasks.
- Performance improvements appear on standard open-vocabulary 3D benchmarks for both selection accuracy and segmentation quality.
Where Pith is reading between the lines
- The method could support language-driven editing or manipulation of specific 3D objects once the feature fields are learned.
- Extension to video or dynamic scenes would require adding temporal consistency constraints to the same hash-encoded fields.
- Integration with robotics navigation stacks becomes feasible because natural-language commands map directly onto 3D Gaussian clusters.
- The approach may lower the annotation burden for large-scale 3D scene datasets by relying only on 2D foundation models.
Load-bearing premise
Multi-resolution hash embedding of CLIP features together with contrastive training on SAM masks will yield sufficiently view-consistent 3D instance features for reliable downstream clustering without any category supervision.
What would settle it
Quantitative drop in 3D instance segmentation IoU on a benchmark scene containing fine-grained or partially occluded objects when the contrastive loss is removed or when SAM masks are replaced by random groupings, showing that the learned features fail to separate instances across views.
Figures
read the original abstract
We introduce Ilov3Splat, a novel framework for instance-level open-vocabulary 3D scene understanding built on 3D Gaussian Splatting (3D-GS). Most prior work depends on 2D rendering-based matching or point-level semantic association, which undermines cross-view consistency, lacks coherent instance-level reasoning, and limits precision in downstream 3D tasks. To address these limitations, our method jointly optimizes scene geometry and semantic representations by augmenting Gaussian splats with view-consistent feature fields. Specifically, we leverage multi-resolution hash embedding to efficiently encode language-aligned CLIP features, enabling dense and coherent language grounding in 3D space. We further train an instance feature field using contrastive loss over SAM masks, supporting fine-grained object distinction across views. At inference time, CLIP-encoded queries are matched against the learned features, followed by two-stage 3D clustering to retrieve relevant Gaussian groups. This enables our framework to identify arbitrary objects in 3D scenes based on natural language descriptions, without requiring category supervision or manual annotations. Experiments on standard benchmarks demonstrate that Ilov3Splat outperforms prior open-vocabulary 3D-GS methods in both object selection and instance segmentation, offering a flexible and accurate solution for language-driven 3D scene understanding. Project page: https://csiro-robotics.github.io/Ilov3Splat.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Ilov3Splat, a framework for instance-level open-vocabulary 3D scene understanding built on 3D Gaussian Splatting. It augments Gaussian splats with view-consistent feature fields by using multi-resolution hash embeddings to encode CLIP features for dense language grounding and training an instance feature field via contrastive loss over SAM masks. At inference, CLIP-encoded text queries are matched to the learned features, followed by two-stage 3D clustering to retrieve relevant Gaussian groups, enabling identification of arbitrary objects from natural language descriptions without category supervision or annotations. The work claims to outperform prior open-vocabulary 3D-GS methods on standard benchmarks for object selection and instance segmentation.
Significance. If the learned 3D instance features prove sufficiently view-consistent to support reliable unsupervised clustering, the approach would meaningfully advance language-driven 3D scene understanding by providing coherent instance-level grounding directly in 3D space rather than relying on 2D rendering-based matching. The practical integration of efficient hash grids with pretrained CLIP and SAM models is a strength that could enable flexible open-vocabulary applications, though the absence of detailed quantitative validation in the reviewed text limits assessment of real-world impact.
major comments (1)
- [Method (instance feature field)] Method section (instance feature field training): The contrastive loss operates on 2D projections of SAM masks, but no explicit regularization term or consistency penalty is described to enforce that instance features for the same 3D Gaussian remain coherent across views. This is load-bearing for the central claim, as the two-stage 3D clustering at inference relies on view-consistent features to group Gaussians belonging to an arbitrary object from a text query; without it, view-specific drift could invalidate the unsupervised retrieval.
minor comments (2)
- [Abstract] Abstract: The claim of outperforming prior methods on standard benchmarks would be strengthened by including at least one key quantitative result (e.g., mIoU or accuracy delta) rather than a qualitative statement.
- [Method] Notation and equations: The precise formulation of the contrastive loss (including temperature and positive/negative pair construction) and how the multi-resolution hash embedding is fused into the Gaussian splat optimization should be stated explicitly with equation numbers for reproducibility.
Simulated Author's Rebuttal
We thank the referee for the detailed comment on the instance feature field. We appreciate the focus on view consistency, which is central to our claims, and will revise the manuscript to clarify the mechanism.
read point-by-point responses
-
Referee: Method section (instance feature field training): The contrastive loss operates on 2D projections of SAM masks, but no explicit regularization term or consistency penalty is described to enforce that instance features for the same 3D Gaussian remain coherent across views. This is load-bearing for the central claim, as the two-stage 3D clustering at inference relies on view-consistent features to group Gaussians belonging to an arbitrary object from a text query; without it, view-specific drift could invalidate the unsupervised retrieval.
Authors: The instance feature field is parameterized directly on the 3D Gaussians via a shared multi-resolution hash embedding, not per-view 2D features. The contrastive loss is computed over SAM mask projections from multiple training views, so the identical 3D feature vector for each Gaussian participates in positive/negative pairs across all views in which it is visible. Any view-specific drift would increase the loss on other views, providing implicit cross-view regularization through the joint 3D optimization. We will add an explicit paragraph in Section 3.3 describing this effect and will include an ablation on an optional explicit consistency penalty (e.g., L2 feature distance across random view pairs) in the revision if it yields measurable gains. revision: yes
Circularity Check
No circularity: method relies on external pretrained CLIP/SAM and standard optimization
full rationale
The paper presents an engineering pipeline that augments 3D Gaussian Splatting with hash-encoded CLIP features and a contrastive instance field trained on SAM masks. No equations, uniqueness theorems, or predictions are shown that reduce by construction to the target result. All load-bearing components (CLIP embeddings, SAM masks, contrastive loss, two-stage clustering) are imported from external models or standard techniques and treated as fixed inputs. The derivation chain is therefore self-contained against external benchmarks rather than internally circular.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption CLIP features extracted from 2D views can be lifted into a view-consistent 3D field via multi-resolution hash embedding
- domain assumption SAM-generated masks supply reliable instance-level supervision for contrastive training across views
Reference graph
Works this paper leans on
- [1]
- [2]
- [3]
- [4]
- [5]
- [6]
-
[7]
splat: Directly referring 3d gaussian splatting via direct language embedding registration
Jun-Seong, K., Kim, G., Yu-Ji, K., Wang, Y.C.F., Choe, J., Oh, T.H.: Dr. splat: Directly referring 3d gaussian splatting via direct language embedding registration. In: CVPR. pp. 14137–14146 (2025)
work page 2025
-
[8]
Kerbl, B., Kopanas, G., Leimkühler, T., Drettakis, G.: 3d gaussian splatting for real-time radiance field rendering. ACM TOG42(4), 139–1 (2023)
work page 2023
- [9]
- [10]
- [11]
-
[12]
Kobayashi, S., Matsumoto, E., Sitzmann, V.: Decomposing nerf for editing via feature field distillation. NeurIPS35, 23311–23330 (2022)
work page 2022
-
[13]
Li, B., Weinberger, K.Q., Belongie, S., Koltun, V., Ranftl, R.: Language-driven semantic segmentation. In: ICLR (2022)
work page 2022
-
[14]
NeurIPS36, 53433– 53456 (2023)
Liu, K., Zhan, F., Zhang, J., Xu, M., Yu, Y., El Saddik, A., Theobalt, C., Xing, E., Lu, S.: Weakly supervised 3d open-vocabulary segmentation. NeurIPS36, 53433– 53456 (2023)
work page 2023
-
[15]
Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J.T., Ramamoorthi, R., Ng, R.: Nerf: Representing scenes as neural radiance fields for view synthesis. Commun. ACM65(1), 99–106 (2021)
work page 2021
-
[16]
Müller, T., Evans, A., Schied, C., Keller, A.: Instant neural graphics primitives with a multiresolution hash encoding. ACM TOG41(4), 1–15 (2022)
work page 2022
- [17]
- [18]
-
[19]
Raschka, S., Patterson, J., Nolet, C.: Machine learning in python: Main devel- opments and technology trends in data science, machine learning, and artificial intelligence. Information11(4), 193 (2020)
work page 2020
- [20]
-
[21]
Tancik,M.,Weber,E.,Ng,E.,Li,R.,Yi,B.,Wang,T.,Kristoffersen,A.,Austin,J., Salahi, K., Ahuja, A., et al.: Nerfstudio: A modular framework for neural radiance field development. In: ACM SIGGRAPH. pp. 1–12 (2023)
work page 2023
-
[22]
Vidanapathirana, K., Knights, J., Hausler, S., Cox, M., Ramezani, M., Jooste, J., Griffiths, E., Mohamed, S., Sridharan, S., Fookes, C., Moghadam, P.: WildScenes: A benchmark for 2D and 3D semantic segmentation in large-scale natural environ- ments. Int. J. Robot. Res.44(4), 532–549 (2025)
work page 2025
-
[23]
Wu, Y., Meng, J., Li, H., Wu, C., Shi, Y., Cheng, X., Zhao, C., Feng, H., Ding, E., Wang, J., et al.: Opengaussian: Towards point-level 3d gaussian-based open vocabulary understanding. NeurIPS37, 19114–19138 (2024)
work page 2024
- [24]
-
[25]
Sam3d: Segment anything in 3d scenes.arXiv preprint arXiv:2306.03908, 2023
Yang, Y., Wu, X., He, T., Zhao, H., Liu, X.: Sam3d: Segment anything in 3d scenes. arXiv preprint arXiv:2306.03908 (2023)
- [26]
- [27]
- [28]
- [29]
-
[30]
Zuo, X., Samangouei, P., Zhou, Y., Di, Y., Li, M.: Fmgs: Foundation model em- bedded 3d gaussian splatting for holistic 3d scene understanding. IJCV133(2), 611–627 (2025)
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.