OP2GS: Object-Aware 3D Gaussian Splatting with Dual-Opacity Primitives

Guiyu Liu; Janne Heikkil\"a; Janne Mustaniemi; Juho Kannala; Niklas Vaara

arxiv: 2605.20044 · v1 · pith:VIQ343LJnew · submitted 2026-05-19 · 💻 cs.CV

OP2GS: Object-Aware 3D Gaussian Splatting with Dual-Opacity Primitives

Guiyu Liu , Niklas Vaara , Janne Mustaniemi , Juho Kannala , Janne Heikkil\"a This is my paper

Pith reviewed 2026-05-20 05:56 UTC · model grok-4.3

classification 💻 cs.CV

keywords 3D Gaussian Splattingobject-aware representationdual opacityinstance segmentationopen-vocabulary scene understandingneural rendering3D reconstruction

0 comments

The pith

Each 3D Gaussian gets a second opacity so visual rendering stays accurate even when object labels are noisy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces OP2GS to make 3D Gaussian Splatting aware of object instances without heavy feature storage or label errors. It does this by giving each Gaussian an original opacity for rendering the scene and a separate instance opacity that decides its contribution to object masks. This separation means Gaussians that receive wrong labels during projection can still help reconstruct the image but drop out of the mask computation. Training uses a random object loss based on visibility to learn the occupancy, followed by object-level semantic aggregation. The result is open-vocabulary performance at lower cost than distilling features and better label consistency than simple lifting methods.

Core claim

OP2GS augments each Gaussian primitive with an explicit instance identity and a dedicated instance opacity σ* for object-mask rendering. The original opacity σ handles visual reconstruction while σ* models contribution to a particular object mask. This dual-opacity formulation allows mislabeled Gaussians to remain available for image rendering while becoming transparent in the object-mask branch. A random object loss optimizes the 1D instance occupancy field using transmittance-based visibility, and semantic descriptors are attached at the object level through multi-view aggregation.

What carries the argument

The dual-opacity primitive that separates the original opacity σ for color rendering from the instance opacity σ* for mask rendering, combined with the random object loss for learning occupancy.

Load-bearing premise

The assumption that a visibility-based loss can correctly figure out which Gaussians belong to which objects even when the initial labels lifted from 2D images are noisy.

What would settle it

Optimizing on a scene with available 3D ground-truth object labels and measuring whether the learned instance opacities correctly suppress Gaussians that do not belong to each object.

Figures

Figures reproduced from arXiv: 2605.20044 by Guiyu Liu, Janne Heikkil\"a, Janne Mustaniemi, Juho Kannala, Niklas Vaara.

**Figure 2.** Figure 2: The overall pipeline of our proposed OP2GS • Visibility-aware instance learning. We optimize σ ∗ with a random object loss that reuses 3DGS transmittance, suppressing label contamination from floaters and ambiguous mask lifting. • Compact open-vocabulary segmentation. OP2GS avoids high-dimensional per-Gaussian feature rendering while achieving competitive accuracy and 121 FPS inference. 2 Related work 2.1 … view at source ↗

**Figure 3.** Figure 3: Illustration of the dual-opacity rendering process. The dashed line indicates the same ray [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Visualization of the training objective: different colors indicate Gaussians with various [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Example of 3D open-vocabulary segmentation on the LERF-Mask dataset. [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Object embedding aggregation. We choose N training views to render object masks. The cropped object regions are fed into the CLIP image encoder, and the final object embedding is obtained by averaging features from N views [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗

**Figure 7.** Figure 7: Qualitative instance segmentation results on the Replica dataset. (a) Input image. (b) 2D [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗

**Figure 8.** Figure 8: Open-vocabulary 3D segmentation results of different methods on the LERF-Mask dataset. [PITH_FULL_IMAGE:figures/full_fig_p015_8.png] view at source ↗

**Figure 9.** Figure 9: (a) Ground-truth image. (b) Early stage rendered mask. (c) Final rendered mask. (d) Final [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗

**Figure 10.** Figure 10: Examples of rendered instance masks under different thresholds [PITH_FULL_IMAGE:figures/full_fig_p016_10.png] view at source ↗

**Figure 11.** Figure 11: Qualitative rendering results on the Replica dataset. (a) Ground-truth image. (b) Gaussian [PITH_FULL_IMAGE:figures/full_fig_p018_11.png] view at source ↗

read the original abstract

3D Gaussian Splatting (3DGS) provides an explicit and efficient scene representation, but its primitives lack inherent object-level identity, hindering downstream tasks such as open-vocabulary scene understanding. Existing methods typically address this by either distilling high-dimensional feature embeddings into Gaussians or by lifting 2D mask labels into 3D via heuristic refinement. However, feature-based approaches incur heavy storage and decoding overhead, while lifting-based pipelines remain vulnerable to label contamination: Gaussians necessary for appearance reconstruction often receive incorrect object labels during 2D-to-3D projection. We propose OP2GS, an object-aware Gaussian representation that augments each primitive with an explicit instance identity and a dedicated instance opacity $\sigma^{*}$ for object-mask rendering. The original opacity $\sigma$ remains responsible for visual reconstruction, while $\sigma^{*}$ models whether a Gaussian should contribute to a particular object mask. This dual-opacity formulation decouples visual existence from instance occupancy: mislabeled Gaussians can remain available for image rendering while becoming transparent in the object-mask branch. To learn this representation, we introduce a random object loss that optimizes the 1D instance occupancy field using the standard transmittance-based visibility of 3DGS. Semantic descriptors are then attached at the object level through multi-view aggregation, eliminating per-Gaussian feature storage. Compared with feature-training approaches, OP2GS achieves competitive open-vocabulary performance while significantly reducing computational overhead. Compared with training-free pipelines, it leverages physically consistent occupancy learning to resolve visibility ambiguities.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Dual opacities separate visual rendering from object labels in 3DGS, but the loss may not pin down correct assignments from noisy lifted labels.

read the letter

The main thing here is the dual-opacity primitive: standard opacity σ handles image rendering while a new instance opacity σ* controls contribution to object masks. This lets mislabeled Gaussians stay useful for visuals but drop out for masks, which directly targets the label contamination that comes with 2D-to-3D lifting in prior work. They optimize σ* with a random object loss that reuses the existing transmittance formulation, then attach semantics at the object level after multi-view aggregation instead of storing features on every Gaussian. That move should cut memory and decoding costs compared with feature-distillation baselines. The separation is a clean mechanism that is not in the cited 3DGS literature, and it keeps the core rendering pipeline intact. The approach is coherent on its own terms and gives a practical way to add object awareness without heavy per-primitive overhead. The soft spot is whether the random object loss can actually recover accurate per-Gaussian assignments from noisy lifted labels. It relies only on standard visibility without explicit denoising or multi-view consistency terms, so ambiguities in overlaps or occlusions could leave the 1D occupancy field under-constrained. The abstract names the equations but supplies no numbers, ablations, or error analysis, which makes it hard to judge if the optimizer lands on the right solution in practice. This is for researchers extending 3D Gaussian Splatting toward open-vocabulary scene understanding in robotics or AR. A reader already working with Gaussian primitives would get value from the object-level attachment and the decoupling idea if the experiments hold up. I would send it for peer review so the results can be examined in detail.

Referee Report

2 major / 2 minor

Summary. The paper proposes OP2GS, an object-aware extension to 3D Gaussian Splatting that augments each primitive with an explicit instance identity and a dedicated instance opacity σ* for object-mask rendering, while retaining the original opacity σ for visual reconstruction. The dual-opacity design is intended to decouple visual existence from instance occupancy so that Gaussians receiving incorrect 2D-to-3D lifted labels can still contribute to image synthesis but become transparent in the mask branch. A random object loss is introduced to optimize the 1D instance occupancy field via the standard transmittance formulation of 3DGS; semantic descriptors are subsequently attached at the object level through multi-view aggregation rather than per-Gaussian feature storage. The central claim is that this yields competitive open-vocabulary performance at substantially lower computational cost than feature-distillation or heuristic-lifting baselines.

Significance. If the random object loss reliably recovers accurate per-Gaussian assignments from noisy lifted labels, the dual-opacity representation would constitute a lightweight, physically motivated way to add object-level identity to explicit 3D scene representations without the storage overhead of high-dimensional features. This could meaningfully benefit downstream open-vocabulary tasks. The manuscript correctly identifies the label-contamination problem in existing lifting pipelines and proposes a clean architectural separation; however, the absence of quantitative results, ablations, or error analysis in the supplied text prevents assessment of whether the claimed resolution of visibility ambiguities is actually achieved.

major comments (2)

[Method section on random object loss] The random object loss is described as optimizing the instance occupancy field solely with the standard 3DGS transmittance-based visibility (no explicit denoising, multi-view consistency, or post-processing term). This formulation appears under-constrained for overlapping or partially occluded Gaussians, raising the risk that multiple occupancy solutions remain equally plausible and that the optimizer may not converge to the correct per-Gaussian instance assignments from noisy 2D-to-3D labels.
[Experiments / Results] No quantitative results, ablation studies, or error analysis are supplied to support the claims of competitive open-vocabulary performance or successful resolution of visibility ambiguities. Without reported metrics (e.g., mIoU on object masks, novel-view synthesis PSNR, or comparisons against lifting and feature baselines on standard benchmarks), the central empirical claim cannot be verified.

minor comments (2)

[Rendering formulation] Clarify the exact rendering equations for the object-mask branch (how σ* is combined with transmittance) and ensure they are presented alongside the standard 3DGS equations for direct comparison.
[Semantic descriptor attachment] The multi-view aggregation procedure for attaching semantic descriptors at the object level should be described with sufficient algorithmic detail (e.g., voting scheme, handling of conflicting labels) to allow reproduction.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed review of our manuscript. We address each major comment below, providing clarifications on the proposed method and indicating the revisions planned to strengthen the empirical support and exposition.

read point-by-point responses

Referee: [Method section on random object loss] The random object loss is described as optimizing the instance occupancy field solely with the standard 3DGS transmittance-based visibility (no explicit denoising, multi-view consistency, or post-processing term). This formulation appears under-constrained for overlapping or partially occluded Gaussians, raising the risk that multiple occupancy solutions remain equally plausible and that the optimizer may not converge to the correct per-Gaussian instance assignments from noisy 2D-to-3D labels.

Authors: The random object loss operates by randomly sampling an object identity at each optimization step and supervising the rendered instance mask (produced via the dedicated opacity σ* and the standard 3DGS transmittance) against the corresponding lifted 2D label. Because the underlying 3DGS optimization already enforces multi-view photometric consistency, the occupancy field is indirectly constrained across views; Gaussians that are mislabeled in one view but correctly contribute to appearance in others can remain transparent in the instance branch without affecting visual reconstruction. We agree that an explicit discussion of convergence under occlusion would improve clarity, and we will expand the method section with a derivation of the loss and a qualitative analysis of ambiguous cases. revision: partial
Referee: [Experiments / Results] No quantitative results, ablation studies, or error analysis are supplied to support the claims of competitive open-vocabulary performance or successful resolution of visibility ambiguities. Without reported metrics (e.g., mIoU on object masks, novel-view synthesis PSNR, or comparisons against lifting and feature baselines on standard benchmarks), the central empirical claim cannot be verified.

Authors: We acknowledge that the current manuscript version does not present the full set of quantitative results, ablations, or error analysis. In the revised manuscript we will add a comprehensive experimental section that reports mIoU for object-mask rendering, PSNR/SSIM for novel-view synthesis, storage and runtime comparisons against feature-distillation and lifting baselines, and targeted ablations on the dual-opacity mechanism together with an analysis of label-contamination cases. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation remains self-contained.

full rationale

The paper introduces a dual-opacity representation with distinct σ for visual rendering and σ* for instance occupancy, optimized via a random object loss that applies the pre-existing 3DGS transmittance visibility formulation to the 1D occupancy field. This does not reduce any claimed result or prediction to a fitted parameter or input quantity by construction, nor does it rely on self-citation chains or ansatzes that presuppose the target decoupling. The optimization is presented as an independent learning step that resolves label ambiguities from lifted 2D masks, with semantic descriptors aggregated separately at the object level; the central claims therefore retain independent content beyond the inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The method rests on the standard 3DGS rendering pipeline and the assumption that 2D masks can be lifted to 3D with sufficient signal for the occupancy loss to recover correct assignments. No new physical constants or large numbers of fitted parameters are introduced in the abstract.

axioms (2)

standard math 3D Gaussian Splatting primitives can be rendered with standard alpha blending and transmittance-based visibility
Invoked when the random object loss is defined using the existing 3DGS visibility formulation.
domain assumption 2D object masks lifted into 3D contain enough correct signal that an occupancy loss can separate mislabeled Gaussians
Central premise required for the dual-opacity correction to succeed.

invented entities (1)

instance opacity σ* no independent evidence
purpose: Separate channel that decides whether a Gaussian contributes to a particular object mask
New per-primitive quantity introduced to decouple visual rendering from instance occupancy.

pith-pipeline@v0.9.0 · 5825 in / 1602 out tokens · 37274 ms · 2026-05-20T05:56:16.104556+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We propose OP2GS, an object-aware Gaussian representation that augments each primitive with an explicit instance identity and a dedicated instance opacity σ∗ for object-mask rendering. The original opacity σ remains responsible for visual reconstruction, while σ∗ models whether a Gaussian should contribute to a particular object mask.
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

This dual-opacity formulation decouples visual existence from instance occupancy

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

30 extracted references · 30 canonical work pages · 2 internal anchors

[1]

Berg, Wan-Yen Lo, Piotr Dollar, and Ross Girshick

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C. Berg, Wan-Yen Lo, Piotr Dollar, and Ross Girshick. Segment anything. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 4015–4026, October 2023

work page 2023
[2]

3d gaussian splatting for real-time radiance field rendering.ACM Trans

Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering.ACM Trans. Graph., 42(4):139–1, 2023

work page 2023
[3]

Feature 3dgs: Supercharging 3d gaussian splatting to enable distilled feature fields

Shijie Zhou, Haoran Chang, Sicheng Jiang, Zhiwen Fan, Zehao Zhu, Dejia Xu, Pradyumna Chari, Suya You, Zhangyang Wang, and Achuta Kadambi. Feature 3dgs: Supercharging 3d gaussian splatting to enable distilled feature fields. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21676–21685, 2024

work page 2024
[4]

Gaussian grouping: Segment and edit anything in 3d scenes

Mingqiao Ye, Martin Danelljan, Fisher Yu, and Lei Ke. Gaussian grouping: Segment and edit anything in 3d scenes. InEuropean Conference on Computer Vision. Springer, 2024

work page 2024
[5]

Opengaussian: Towards point-level 3d gaussian-based open vocabulary understanding

Yanmin Wu, Jiarui Meng, Haijie Li, Chenming Wu, Yahao Shi, Xinhua Cheng, Chen Zhao, Haocheng Feng, Errui Ding, Jingdong Wang, and Jian Zhang. Opengaussian: Towards point-level 3d gaussian-based open vocabulary understanding. InAdvances in Neural Information Processing Systems (NeurIPS), 2024

work page 2024
[6]

Lightsplat: Fast and memory-efficient open-vocabulary 3d scene understanding in five seconds

Jaehun Bang, Jinhyeok Kim, Minji Kim, Seungheon Jeong, and Kyungdon Joo. Lightsplat: Fast and memory-efficient open-vocabulary 3d scene understanding in five seconds. InCVPR, 2026

work page 2026
[7]

Occam’s lgs: An efficient approach for language gaussian splatting.arXiv preprint arXiv:2412.01807, 2024

Jiahuan Cheng, Jan-Nico Zaech, Luc Van Gool, and Danda Pani Paudel. Occam’s lgs: An efficient approach for language gaussian splatting.arXiv preprint arXiv:2412.01807, 2024

work page arXiv 2024
[8]

Kim Jun-Seong, GeonU Kim, Kim Yu-Ji, Yu-Chiang Frank Wang, Jaesung Choe, and Tae-Hyun Oh. Dr. splat: Directly referring 3d gaussian splatting via direct language embedding registration. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 14137–14146, 2025

work page 2025
[9]

Scaffold-gs: Structured 3d gaussians for view-adaptive rendering

Tao Lu, Mulin Yu, Linning Xu, Yuanbo Xiangli, Limin Wang, Dahua Lin, and Bo Dai. Scaffold-gs: Structured 3d gaussians for view-adaptive rendering. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20654–20664, 2024

work page 2024
[10]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PMLR, 2021

work page 2021
[11]

Segment any 3d gaussians

Jiazhong Cen, Jiemin Fang, Chen Yang, Lingxi Xie, Xiaopeng Zhang, Wei Shen, and Qi Tian. Segment any 3d gaussians. InProceedings of the AAAI Conference on Artificial Intelligence, pages 1971–1979, 2025

work page 1971
[12]

Objectgs: Object-aware scene reconstruction and scene understanding via gaussian splatting

Ruijie Zhu, Mulin Yu, Linning Xu, Lihan Jiang, Yixuan Li, Tianzhu Zhang, Jiangmiao Pang, and Bo Dai. Objectgs: Object-aware scene reconstruction and scene understanding via gaussian splatting. InProceed- ings of the IEEE/CVF International Conference on Computer Vision, pages 8350–8360, 2025

work page 2025
[13]

Lerf: Language embedded radiance fields

Justin Kerr, Chung Min Kim, Ken Goldberg, Angjoo Kanazawa, and Matthew Tancik. Lerf: Language embedded radiance fields. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 12806–12816, 2023

work page 2023
[14]

Open- neRF: Open set 3d neural scene segmentation with pixel-wise features and rendered novel views

Francis Engelmann, Fabian Manhardt, Michael Niemeyer, Keisuke Tateno, and Federico Tombari. Open- neRF: Open set 3d neural scene segmentation with pixel-wise features and rendered novel views. InThe Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/ forum?id=SgjAojPKb3

work page 2024
[15]

Langsplat: 3d language gaus- sian splatting

Minghan Qin, Wanhua Li, Jiawei Zhou, Haoqian Wang, and Hanspeter Pfister. Langsplat: 3d language gaus- sian splatting. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 23606–23615, 2024

work page 2024
[16]

Scenesplat: Gaussian splatting-based scene understanding with vision- language pretraining

Yue Li, Qi Ma, Runyi Yang, Huapeng Li, Mengjiao Ma, Bin Ren, Nikola Popovic, Nicu Sebe, Ender Konukoglu, Theo Gevers, et al. Scenesplat: Gaussian splatting-based scene understanding with vision- language pretraining. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 4961–4972, 2025

work page 2025
[17]

Click-gaussian: Interac- tive segmentation to any 3d gaussians

Seokhun Choi, Hyeonseop Song, Jaechul Kim, Taehyeong Kim, and Hoseok Do. Click-gaussian: Interac- tive segmentation to any 3d gaussians. InECCV (3), pages 289–305, 2024. 10

work page 2024
[18]

Identity-aware language gaussian splatting for open-vocabulary 3d semantic segmentation

SungMin Jang and Wonjun Kim. Identity-aware language gaussian splatting for open-vocabulary 3d semantic segmentation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 20467–20476, 2025

work page 2025
[19]

Visibility- aware language aggregation for open-vocabulary segmentation in 3d gaussian splatting.arXiv preprint arXiv:2509.05515, 2025

Sen Wang, Kunyi Li, Siyun Liang, Elena Alegret, Jing Ma, Nassir Navab, and Stefano Gasperini. Visibility- aware language aggregation for open-vocabulary segmentation in 3d gaussian splatting.arXiv preprint arXiv:2509.05515, 2025

work page arXiv 2025
[20]

Ludvig: Learning-free uplifting of 2d visual features to gaussian splatting scenes

Juliette Marrie, Romain Ménégaux, Michael Arbel, Diane Larlus, and Julien Mairal. Ludvig: Learning-free uplifting of 2d visual features to gaussian splatting scenes. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 7440–7450, 2025

work page 2025
[21]

Opensplat3d: Open-vocabulary 3d instance segmentation using gaussian splatting

Jens Piekenbrinck, Christian Schmidt, Alexander Hermans, Narunas Vaskevicius, Timm Linder, and Bastian Leibe. Opensplat3d: Open-vocabulary 3d instance segmentation using gaussian splatting. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 5246–5255, 2025

work page 2025
[22]

Weakly supervised 3d open-vocabulary segmentation.Advances in Neural Information Processing Systems, 36:53433–53456, 2023

Kunhao Liu, Fangneng Zhan, Jiahui Zhang, Muyu Xu, Yingchen Yu, Abdulmotaleb El Saddik, Christian Theobalt, Eric Xing, and Shijian Lu. Weakly supervised 3d open-vocabulary segmentation.Advances in Neural Information Processing Systems, 36:53433–53456, 2023

work page 2023
[23]

The Replica Dataset: A Digital Replica of Indoor Spaces

Julian Straub, Thomas Whelan, Lingni Ma, Yufan Chen, Erik Wijmans, Simon Green, Jakob J Engel, Raul Mur-Artal, Carl Ren, Shobhit Verma, et al. The replica dataset: A digital replica of indoor spaces.CoRR, abs/1906.05797, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1906
[24]

Language-driven Semantic Segmentation

Boyi Li, Kilian Q Weinberger, Serge Belongie, Vladlen Koltun, and René Ranftl. Language-driven semantic segmentation.arXiv preprint arXiv:2201.03546, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[25]

Open-vocabulary semantic segmentation with mask-adapted clip

Feng Liang, Bichen Wu, Xiaoliang Dai, Kunpeng Li, Yinan Zhao, Hang Zhang, Peizhao Zhang, Peter Vajda, and Diana Marculescu. Open-vocabulary semantic segmentation with mask-adapted clip. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7061–7070, 2023

work page 2023
[26]

Lifting by gaussians: A simple, fast and flexible method for 3d instance segmentation

Rohan Chacko, Nicolai Häni, Eldar Khaliullin, Lin Sun, and Douglas Lee. Lifting by gaussians: A simple, fast and flexible method for 3d instance segmentation. In2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 3497–3507. IEEE, 2025

work page 2025
[27]

Tracking anything with decoupled video segmentation

Ho Kei Cheng, Seoung Wug Oh, Brian Price, Alexander Schwing, and Joon-Young Lee. Tracking anything with decoupled video segmentation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 1316–1326, 2023

work page 2023
[28]

Segment anything in 3d with nerfs.Advances in Neural Information Processing Systems, 36: 25971–25990, 2023

Jiazhong Cen, Zanwei Zhou, Jiemin Fang, Wei Shen, Lingxi Xie, Dongsheng Jiang, Xiaopeng Zhang, Qi Tian, et al. Segment anything in 3d with nerfs.Advances in Neural Information Processing Systems, 36: 25971–25990, 2023

work page 2023
[29]

slang-gaussian-rasterization

google. slang-gaussian-rasterization. https://github.com/google/ slang-gaussian-rasterization, 2024

work page 2024
[30]

ViT-B-16

Binbin Huang, Zehao Yu, Anpei Chen, Andreas Geiger, and Shenghua Gao. 2d gaussian splatting for geometrically accurate radiance fields. InACM SIGGRAPH 2024 conference papers, pages 1–11, 2024. A Technical appendices and appendix A.1 The Process of Open-vocabulary Querying As shown in Fig. 6, we first render the object masks in N training views and crop th...

work page 2024

[1] [1]

Berg, Wan-Yen Lo, Piotr Dollar, and Ross Girshick

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C. Berg, Wan-Yen Lo, Piotr Dollar, and Ross Girshick. Segment anything. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 4015–4026, October 2023

work page 2023

[2] [2]

3d gaussian splatting for real-time radiance field rendering.ACM Trans

Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering.ACM Trans. Graph., 42(4):139–1, 2023

work page 2023

[3] [3]

Feature 3dgs: Supercharging 3d gaussian splatting to enable distilled feature fields

Shijie Zhou, Haoran Chang, Sicheng Jiang, Zhiwen Fan, Zehao Zhu, Dejia Xu, Pradyumna Chari, Suya You, Zhangyang Wang, and Achuta Kadambi. Feature 3dgs: Supercharging 3d gaussian splatting to enable distilled feature fields. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21676–21685, 2024

work page 2024

[4] [4]

Gaussian grouping: Segment and edit anything in 3d scenes

Mingqiao Ye, Martin Danelljan, Fisher Yu, and Lei Ke. Gaussian grouping: Segment and edit anything in 3d scenes. InEuropean Conference on Computer Vision. Springer, 2024

work page 2024

[5] [5]

Opengaussian: Towards point-level 3d gaussian-based open vocabulary understanding

Yanmin Wu, Jiarui Meng, Haijie Li, Chenming Wu, Yahao Shi, Xinhua Cheng, Chen Zhao, Haocheng Feng, Errui Ding, Jingdong Wang, and Jian Zhang. Opengaussian: Towards point-level 3d gaussian-based open vocabulary understanding. InAdvances in Neural Information Processing Systems (NeurIPS), 2024

work page 2024

[6] [6]

Lightsplat: Fast and memory-efficient open-vocabulary 3d scene understanding in five seconds

Jaehun Bang, Jinhyeok Kim, Minji Kim, Seungheon Jeong, and Kyungdon Joo. Lightsplat: Fast and memory-efficient open-vocabulary 3d scene understanding in five seconds. InCVPR, 2026

work page 2026

[7] [7]

Occam’s lgs: An efficient approach for language gaussian splatting.arXiv preprint arXiv:2412.01807, 2024

Jiahuan Cheng, Jan-Nico Zaech, Luc Van Gool, and Danda Pani Paudel. Occam’s lgs: An efficient approach for language gaussian splatting.arXiv preprint arXiv:2412.01807, 2024

work page arXiv 2024

[8] [8]

Kim Jun-Seong, GeonU Kim, Kim Yu-Ji, Yu-Chiang Frank Wang, Jaesung Choe, and Tae-Hyun Oh. Dr. splat: Directly referring 3d gaussian splatting via direct language embedding registration. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 14137–14146, 2025

work page 2025

[9] [9]

Scaffold-gs: Structured 3d gaussians for view-adaptive rendering

Tao Lu, Mulin Yu, Linning Xu, Yuanbo Xiangli, Limin Wang, Dahua Lin, and Bo Dai. Scaffold-gs: Structured 3d gaussians for view-adaptive rendering. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20654–20664, 2024

work page 2024

[10] [10]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PMLR, 2021

work page 2021

[11] [11]

Segment any 3d gaussians

Jiazhong Cen, Jiemin Fang, Chen Yang, Lingxi Xie, Xiaopeng Zhang, Wei Shen, and Qi Tian. Segment any 3d gaussians. InProceedings of the AAAI Conference on Artificial Intelligence, pages 1971–1979, 2025

work page 1971

[12] [12]

Objectgs: Object-aware scene reconstruction and scene understanding via gaussian splatting

Ruijie Zhu, Mulin Yu, Linning Xu, Lihan Jiang, Yixuan Li, Tianzhu Zhang, Jiangmiao Pang, and Bo Dai. Objectgs: Object-aware scene reconstruction and scene understanding via gaussian splatting. InProceed- ings of the IEEE/CVF International Conference on Computer Vision, pages 8350–8360, 2025

work page 2025

[13] [13]

Lerf: Language embedded radiance fields

Justin Kerr, Chung Min Kim, Ken Goldberg, Angjoo Kanazawa, and Matthew Tancik. Lerf: Language embedded radiance fields. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 12806–12816, 2023

work page 2023

[14] [14]

Open- neRF: Open set 3d neural scene segmentation with pixel-wise features and rendered novel views

Francis Engelmann, Fabian Manhardt, Michael Niemeyer, Keisuke Tateno, and Federico Tombari. Open- neRF: Open set 3d neural scene segmentation with pixel-wise features and rendered novel views. InThe Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/ forum?id=SgjAojPKb3

work page 2024

[15] [15]

Langsplat: 3d language gaus- sian splatting

Minghan Qin, Wanhua Li, Jiawei Zhou, Haoqian Wang, and Hanspeter Pfister. Langsplat: 3d language gaus- sian splatting. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 23606–23615, 2024

work page 2024

[16] [16]

Scenesplat: Gaussian splatting-based scene understanding with vision- language pretraining

Yue Li, Qi Ma, Runyi Yang, Huapeng Li, Mengjiao Ma, Bin Ren, Nikola Popovic, Nicu Sebe, Ender Konukoglu, Theo Gevers, et al. Scenesplat: Gaussian splatting-based scene understanding with vision- language pretraining. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 4961–4972, 2025

work page 2025

[17] [17]

Click-gaussian: Interac- tive segmentation to any 3d gaussians

Seokhun Choi, Hyeonseop Song, Jaechul Kim, Taehyeong Kim, and Hoseok Do. Click-gaussian: Interac- tive segmentation to any 3d gaussians. InECCV (3), pages 289–305, 2024. 10

work page 2024

[18] [18]

Identity-aware language gaussian splatting for open-vocabulary 3d semantic segmentation

SungMin Jang and Wonjun Kim. Identity-aware language gaussian splatting for open-vocabulary 3d semantic segmentation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 20467–20476, 2025

work page 2025

[19] [19]

Visibility- aware language aggregation for open-vocabulary segmentation in 3d gaussian splatting.arXiv preprint arXiv:2509.05515, 2025

Sen Wang, Kunyi Li, Siyun Liang, Elena Alegret, Jing Ma, Nassir Navab, and Stefano Gasperini. Visibility- aware language aggregation for open-vocabulary segmentation in 3d gaussian splatting.arXiv preprint arXiv:2509.05515, 2025

work page arXiv 2025

[20] [20]

Ludvig: Learning-free uplifting of 2d visual features to gaussian splatting scenes

Juliette Marrie, Romain Ménégaux, Michael Arbel, Diane Larlus, and Julien Mairal. Ludvig: Learning-free uplifting of 2d visual features to gaussian splatting scenes. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 7440–7450, 2025

work page 2025

[21] [21]

Opensplat3d: Open-vocabulary 3d instance segmentation using gaussian splatting

Jens Piekenbrinck, Christian Schmidt, Alexander Hermans, Narunas Vaskevicius, Timm Linder, and Bastian Leibe. Opensplat3d: Open-vocabulary 3d instance segmentation using gaussian splatting. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 5246–5255, 2025

work page 2025

[22] [22]

Weakly supervised 3d open-vocabulary segmentation.Advances in Neural Information Processing Systems, 36:53433–53456, 2023

Kunhao Liu, Fangneng Zhan, Jiahui Zhang, Muyu Xu, Yingchen Yu, Abdulmotaleb El Saddik, Christian Theobalt, Eric Xing, and Shijian Lu. Weakly supervised 3d open-vocabulary segmentation.Advances in Neural Information Processing Systems, 36:53433–53456, 2023

work page 2023

[23] [23]

The Replica Dataset: A Digital Replica of Indoor Spaces

Julian Straub, Thomas Whelan, Lingni Ma, Yufan Chen, Erik Wijmans, Simon Green, Jakob J Engel, Raul Mur-Artal, Carl Ren, Shobhit Verma, et al. The replica dataset: A digital replica of indoor spaces.CoRR, abs/1906.05797, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1906

[24] [24]

Language-driven Semantic Segmentation

Boyi Li, Kilian Q Weinberger, Serge Belongie, Vladlen Koltun, and René Ranftl. Language-driven semantic segmentation.arXiv preprint arXiv:2201.03546, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[25] [25]

Open-vocabulary semantic segmentation with mask-adapted clip

Feng Liang, Bichen Wu, Xiaoliang Dai, Kunpeng Li, Yinan Zhao, Hang Zhang, Peizhao Zhang, Peter Vajda, and Diana Marculescu. Open-vocabulary semantic segmentation with mask-adapted clip. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7061–7070, 2023

work page 2023

[26] [26]

Lifting by gaussians: A simple, fast and flexible method for 3d instance segmentation

Rohan Chacko, Nicolai Häni, Eldar Khaliullin, Lin Sun, and Douglas Lee. Lifting by gaussians: A simple, fast and flexible method for 3d instance segmentation. In2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 3497–3507. IEEE, 2025

work page 2025

[27] [27]

Tracking anything with decoupled video segmentation

Ho Kei Cheng, Seoung Wug Oh, Brian Price, Alexander Schwing, and Joon-Young Lee. Tracking anything with decoupled video segmentation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 1316–1326, 2023

work page 2023

[28] [28]

Segment anything in 3d with nerfs.Advances in Neural Information Processing Systems, 36: 25971–25990, 2023

Jiazhong Cen, Zanwei Zhou, Jiemin Fang, Wei Shen, Lingxi Xie, Dongsheng Jiang, Xiaopeng Zhang, Qi Tian, et al. Segment anything in 3d with nerfs.Advances in Neural Information Processing Systems, 36: 25971–25990, 2023

work page 2023

[29] [29]

slang-gaussian-rasterization

google. slang-gaussian-rasterization. https://github.com/google/ slang-gaussian-rasterization, 2024

work page 2024

[30] [30]

ViT-B-16

Binbin Huang, Zehao Yu, Anpei Chen, Andreas Geiger, and Shenghua Gao. 2d gaussian splatting for geometrically accurate radiance fields. InACM SIGGRAPH 2024 conference papers, pages 1–11, 2024. A Technical appendices and appendix A.1 The Process of Open-vocabulary Querying As shown in Fig. 6, we first render the object masks in N training views and crop th...

work page 2024