pith. sign in

arxiv: 2606.30638 · v1 · pith:EKGZWDQHnew · submitted 2026-06-29 · 💻 cs.CV

Open-Vocabulary and Referring Segmentation for 3D Gaussians Using 2D Detectors

Pith reviewed 2026-06-30 05:51 UTC · model grok-4.3

classification 💻 cs.CV
keywords 3D Gaussian Splattingopen-vocabulary segmentationreferring expression grounding2D object detectorszero-shot learninginstance groupingview aggregation3D scene understanding
0
0 comments X

The pith

GaussDet lets 3D Gaussian scenes handle complex referring expressions by voting from multi-view 2D detectors.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents GaussDet as a way to add open-vocabulary segmentation and referential grounding to 3D Gaussian Splatting without relying on CLIP feature distillation. It learns per-Gaussian instance features to form 3D groups, renders those groups, and collects semantic labels from 2D detectors across views to build a View-Aggregated Semantic Label Distribution for each instance. The aggregation step reduces noise from imperfect grouping and supports direct zero-shot use of referring expressions that go beyond simple nouns. Experiments on LeRF-OVS, ScanNet, and Ref-LeRF show gains over prior methods, with a reported 16.7 percent mIoU lift in strict zero-shot referential grounding.

Core claim

GaussDet decomposes scenes into 3D instances via learned Gaussian features, renders the instances, and aggregates discrete open-vocabulary labels from 2D detectors into a per-instance VASD; this produces robust semantics that extend from basic language queries to complex referential expressions and yields consistent improvements on open-vocabulary segmentation and referring grounding benchmarks.

What carries the argument

The View-Aggregated Semantic Label Distribution (VASD) formed by rendering 3D instance groups and collecting votes from multi-view 2D detections.

If this is right

  • The method supports zero-shot transfer from simple noun queries to complex referential expressions without retraining.
  • View aggregation acts as a regularizer that reduces spurious labels from low-quality 3D groups.
  • Consistent accuracy gains appear on both open-vocabulary segmentation and referring expression tasks.
  • The approach avoids the need for a predefined number of instances or bottom-up clustering used in earlier work.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same voting mechanism could be tested on dynamic or time-varying Gaussian scenes if 2D detectors track objects across frames.
  • Replacing continuous CLIP embeddings with discrete detector outputs may lower memory use during scene optimization.
  • The framework suggests that any 2D model supplying instance-level labels could be plugged in without changing the 3D grouping step.
  • Embodied agents might use the resulting 3D instance labels for spatial planning tasks that require reference resolution.

Load-bearing premise

Multi-view aggregation of 2D detector outputs will consistently correct errors introduced by noisy 3D instance grouping.

What would settle it

Performance on Ref-LeRF remains unchanged when the view-aggregation step is removed and only single-view 2D detections are used.

Figures

Figures reproduced from arXiv: 2606.30638 by Jameel Hassan, Vishal Patel, Yasiru Ranasinghe.

Figure 1
Figure 1. Figure 1: Open-vocabulary and referential expression grounding with [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: GaussDet overall pipeline. We train a 3D Gaussian splat of the scene using RGB images with augmented instance features per Gaussian, which are used to decompose the scene into instance groups Gi. Each instance group Gi is rendered to image frames, and the top-K views Ki per instance group are selected. We obtain semantic label maps S v using detections from a 2D open vocabulary object detector to generate … view at source ↗
Figure 3
Figure 3. Figure 3: View-Aggregated Semantic Label Distribution (VASD) generation. [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative comparison on the LeRF dataset for Open-Vocabulary [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative results on Ref-LeRF dataset for Referential Grounding. [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Comparison of Semantic Label Distributions. [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: , ablating this feature on the LeRF dataset reveals that Semantic Label Regularization in generating the semantic label distribution yields consistent improvements across all scenes, resulting in an average mIoU increase of 6.56%. Strategically, this formulation marks a deliberate divergence from prior open￾vocabulary methods designed for point clouds, such as [5], which typically discard the background in… view at source ↗
Figure 8
Figure 8. Figure 8: Influence of top-K choice. evaluated scenes. Crucially, because 3D instance groups must be rendered across all available views to com￾pute their mask visibility scores, and the 2D discrete semantic label maps are pre-computed for all images, the marginal computational overhead of this ensembling approach is negligible. The additional cost is strictly limited to the sorting and aggregation of these pre-exis… view at source ↗
Figure 9
Figure 9. Figure 9: Additional Open-Vocabulary Segmentation comparisons on LeRF. [PITH_FULL_IMAGE:figures/full_fig_p021_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Qualitative comparison on the Ref-LeRF dataset. [PITH_FULL_IMAGE:figures/full_fig_p022_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Failure cases. Our method faces challenges with heavily overlapping objects lacking multi-view coverage, leading to incomplete segmentations or residual back￾ground artifacts. in incomplete masks as seen in the “chopsticks” and “spoon” examples. Further￾more, the inherent limitations in the underlying 3D scene decomposition can sometimes still cause some degree of floating background artifacts to persist.… view at source ↗
Figure 12
Figure 12. Figure 12: Qualitative comparison on ScanNet point clouds. [PITH_FULL_IMAGE:figures/full_fig_p024_12.png] view at source ↗
read the original abstract

3D Gaussian Splatting (3DGS) has emerged at the forefront of 3D scene reconstruction. Extending 3DGS with language-driven, open-vocabulary understanding has gained significant attention for real-world applications such as embodied AI. Recent methods achieve this by learning an instance feature attribute and assigning semantics by distilling high-dimensional Contrastive Language-Image Pretraining (CLIP) features directly into the scene representation. However, the instance grouping mechanisms of these methods either require a predefined number of instances or suffer from noise in their bottom-up grouping strategies. Furthermore, the reliance on CLIP restricts semantic understanding to simple noun phrases, preventing complex spatial reasoning and referential expression grounding. We present GaussDet, a method that circumvents the need for dense CLIP features by leveraging discrete, open-vocabulary 2D object detectors with referring expression capabilities. We learn instance features for individual Gaussians to decompose the scene into 3D instance groups. By rendering these groups and aggregating semantic votes from multi-view 2D detections, we generate a robust View-Aggregated Semantic Label Distribution (VASD) for each 3D instance. This view-aggregation strategy acts as a strong regularizer, attenuating spurious labels caused by low-quality instance grouping. Our approach enables a straightforward, zero-shot extension from simple language queries to complex referential grounding. Extensive evaluations across two key tasks -- open-vocabulary segmentation (LeRF-OVS, ScanNet) and referring expression grounding (Ref-LeRF) -- demonstrate that GaussDet achieves consistent improvements over existing methods. Most notably, we achieve a substantial 16.7% mIoU improvement in referential grounding within a strict zero-shot setting.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper introduces GaussDet, which extends 3D Gaussian Splatting to open-vocabulary segmentation and referring expression grounding by learning per-Gaussian instance features, rendering instance groups, and aggregating semantic votes from multi-view 2D detectors into a View-Aggregated Semantic Label Distribution (VASD) per 3D instance. The view-aggregation is presented as a regularizer against spurious labels from imperfect bottom-up grouping. The method claims a straightforward zero-shot extension from simple queries to complex referential grounding and reports consistent gains over prior CLIP-based methods, including a 16.7% mIoU improvement on Ref-LeRF in a strict zero-shot setting.

Significance. If the results hold, the work is significant because it replaces dense CLIP feature distillation with discrete 2D detectors that natively support referring expressions, enabling complex spatial reasoning in 3DGS without predefined instance counts. The zero-shot referential grounding capability is a practical advance for embodied AI applications. The view-aggregation idea is conceptually appealing as a regularizer, though its contribution to the reported gains requires direct empirical isolation.

major comments (2)
  1. [Abstract] Abstract: the assertion that 'This view-aggregation strategy acts as a strong regularizer, attenuating spurious labels caused by low-quality instance grouping' is load-bearing for attributing the 16.7% mIoU gain, yet the manuscript provides no ablation that decouples grouping quality from the aggregation step (e.g., VASD versus per-view labels on identical rendered masks, or controlled injection of grouping noise). If errors are correlated across views, aggregation may reinforce rather than attenuate them.
  2. [Experiments] Experiments section: quantitative improvements are reported without sufficient detail on experimental methodology, potential confounding factors (such as detector choice or view selection), or error analysis, which limits assessment of whether the data support the zero-shot referential grounding claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the major comments below and will revise the manuscript to strengthen the presentation of our method and results.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the assertion that 'This view-aggregation strategy acts as a strong regularizer, attenuating spurious labels caused by low-quality instance grouping' is load-bearing for attributing the 16.7% mIoU gain, yet the manuscript provides no ablation that decouples grouping quality from the aggregation step (e.g., VASD versus per-view labels on identical rendered masks, or controlled injection of grouping noise). If errors are correlated across views, aggregation may reinforce rather than attenuate them.

    Authors: We agree that an explicit ablation isolating the contribution of view aggregation would strengthen the claim. In the revised manuscript we will add an ablation comparing VASD against per-view labels applied to identical rendered masks, and we will include controlled analysis of error correlation across views to address the possibility that aggregation could reinforce correlated mistakes. revision: yes

  2. Referee: [Experiments] Experiments section: quantitative improvements are reported without sufficient detail on experimental methodology, potential confounding factors (such as detector choice or view selection), or error analysis, which limits assessment of whether the data support the zero-shot referential grounding claims.

    Authors: We acknowledge that additional methodological detail is needed. The revised Experiments section will expand on detector configurations and versions used, view selection criteria and counts, controls for potential confounding factors, and a dedicated error analysis of failure modes in the zero-shot referential grounding experiments. revision: yes

Circularity Check

0 steps flagged

No circularity; method relies on external 2D detectors and standard aggregation

full rationale

The paper presents GaussDet as a pipeline that learns instance features on 3D Gaussians, renders groups, and aggregates votes from off-the-shelf open-vocabulary 2D detectors to form VASD. No equations, fitted parameters, or self-citations are shown that reduce a claimed prediction or uniqueness result back to the input by construction. The regularization statement about view aggregation is an empirical claim about the method's behavior rather than a definitional or self-referential step. Evaluations on external benchmarks (LeRF-OVS, ScanNet, Ref-LeRF) are independent of any internal fitting loop described in the provided text. This is the normal case of a self-contained engineering method.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract provides no information on free parameters, axioms, or invented entities; the method extends existing components without introducing new ones explicitly.

pith-pipeline@v0.9.1-grok · 5847 in / 1133 out tokens · 43795 ms · 2026-06-30T05:51:10.748340+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

45 extracted references · 10 canonical work pages · 3 internal anchors

  1. [1]

    arXiv preprint arXiv:2504.14151 (2025) 3

    Arnaud, S., McVay, P., Martin, A., Majumdar, A., Jatavallabhula, K.M., Thomas, P., Partsey, R., Dugas, D., Gejji, A., Sax, A., et al.: Locate 3d: Real-world object localization via self-supervised learning in 3d. arXiv preprint arXiv:2504.14151 (2025) 3

  2. [2]

    Qwen3-VL Technical Report

    Bai, S., Cai, Y., Chen, R., Chen, K., Chen, X., Cheng, Z., Deng, L., Ding, W., Gao, C., Ge, C., et al.: Qwen3-vl technical report. arXiv preprint arXiv:2511.21631 (2025) 2

  3. [3]

    In: Proceedings of the IEEE/CVF international conference on computer vision

    Barron, J.T., Mildenhall, B., Tancik, M., Hedman, P., Martin-Brualla, R., Srini- vasan, P.P.: Mip-nerf: A multiscale representation for anti-aliasing neural radiance fields. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 5855–5864 (2021) 3

  4. [4]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Barron, J.T., Mildenhall, B., Verbin, D., Srinivasan, P.P., Hedman, P.: Zip-nerf: Anti-aliased grid-based neural radiance fields. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 19697–19705 (2023) 3

  5. [5]

    arXiv preprint arXiv:2406.02548 (2024) 1, 6, 13

    Boudjoghra, M.E.A., Dai, A., Lahoud, J., Cholakkal, H., Anwer, R.M., Khan, S., Khan, F.S.: Open-yolo 3d: Towards fast and accurate open-vocabulary 3d instance segmentation. arXiv preprint arXiv:2406.02548 (2024) 1, 6, 13

  6. [6]

    SAM 3: Segment Anything with Concepts

    Carion, N., Gustafson, L., Hu, Y.T., Debnath, S., Hu, R., Suris, D., Ryali, C., Alwala,K.V.,Khedr,H.,Huang,A.,etal.:Sam3:Segmentanythingwithconcepts. arXiv preprint arXiv:2511.16719 (2025) 1

  7. [7]

    In: Proceedings of the AAAI conference on artificial intelligence

    Cen, J., Fang, J., Yang, C., Xie, L., Zhang, X., Shen, W., Tian, Q.: Segment any 3d gaussians. In: Proceedings of the AAAI conference on artificial intelligence. vol. 39, pp. 1971–1979 (2025) 19

  8. [8]

    arXiv preprint arXiv:2505.24746 (2025) 2, 3, 4, 5, 9, 10, 11, 19, 20

    Cen, J., Zhou, X., Fang, J., Wen, C., Xie, L., Zhang, X., Shen, W., Tian, Q.: Tack- ling view-dependent semantics in 3d language gaussian splatting. arXiv preprint arXiv:2505.24746 (2025) 2, 3, 4, 5, 9, 10, 11, 19, 20

  9. [9]

    In: European conference on computer vision

    Chen, D.Z., Chang, A.X., Nießner, M.: Scanrefer: 3d object localization in rgb-d scans using natural language. In: European conference on computer vision. pp. 202–221. Springer (2020) 3

  10. [10]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Cheng, T., Song, L., Ge, Y., Liu, W., Wang, X., Shan, Y.: Yolo-world: Real-time open-vocabulary object detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 16901–16911 (2024) 2 16 J. Hassan et al

  11. [11]

    In: 2024 IEEE Inter- national Conference on Robotics and Automation (ICRA)

    Gu, Q., Kuwajerwala, A., Morin, S., Jatavallabhula, K.M., Sen, B., Agarwal, A., Rivera, C., Paul, W., Ellis, K., Chellappa, R., et al.: Conceptgraphs: Open- vocabulary 3d scene graphs for perception and planning. In: 2024 IEEE Inter- national Conference on Robotics and Automation (ICRA). pp. 5021–5028. IEEE (2024) 1

  12. [12]

    IEEE Robotics and Automation Letters (2025) 2, 3

    Halacheva, A.M., Zaech, J.N., Wang, X., Paudel, D.P., Van Gool, L.: Gaussianvlm: Scene-centric 3d vision-language models using language-aligned gaussian splats for embodied reasoning and beyond. IEEE Robotics and Automation Letters (2025) 2, 3

  13. [13]

    arXiv preprint arXiv:2508.08252 (2025) 2, 3, 8, 10

    He, S., Jie, G., Wang, C., Zhou, Y., Hu, S., Li, G., Ding, H.: Refersplat: Referring segmentation in 3d gaussian splatting. arXiv preprint arXiv:2508.08252 (2025) 2, 3, 8, 10

  14. [14]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Hess, G., Lindström, C., Fatemi, M., Petersson, C., Svensson, L.: Splatad: Real- timelidarandcamerarenderingwith3dgaussiansplattingforautonomousdriving. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 11982–11992 (2025) 2, 3

  15. [15]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recog- nition

    Keetha, N., Karhade, J., Jatavallabhula, K.M., Yang, G., Scherer, S., Ramanan, D., Luiten, J.: Splatam: Splat track & map 3d gaussians for dense rgb-d slam. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recog- nition. pp. 21357–21366 (2024) 2

  16. [16]

    ACM Trans

    Kerbl, B., Kopanas, G., Leimkühler, T., Drettakis, G., et al.: 3d gaussian splatting for real-time radiance field rendering. ACM Trans. Graph.42(4), 139–1 (2023) 1, 3

  17. [17]

    In: Proceedings of the IEEE/CVF international conference on computer vision

    Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.Y., et al.: Segment anything. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 4015–4026 (2023) 2, 19

  18. [18]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Lai, X., Tian, Z., Chen, Y., Li, Y., Yuan, Y., Liu, S., Jia, J.: Lisa: Reasoning seg- mentation via large language model. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 9579–9589 (2024) 3

  19. [19]

    Language-driven Semantic Segmentation

    Li, B., Weinberger, K.Q., Belongie, S., Koltun, V., Ranftl, R.: Language-driven semantic segmentation. arXiv preprint arXiv:2201.03546 (2022) 1

  20. [20]

    In: Proceedings of the Computer Vision and Pattern Recognition Con- ference

    Li, H., Wu, Y., Meng, J., Gao, Q., Zhang, Z., Wang, R., Zhang, J.: Instance- gaussian: Appearance-semantic joint gaussian representation for 3d instance-level perception. In: Proceedings of the Computer Vision and Pattern Recognition Con- ference. pp. 14078–14088 (2025) 3

  21. [21]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Li, Y., Ma, Q., Yang, R., Li, H., Ma, M., Ren, B., Popovic, N., Sebe, N., Konukoglu, E., Gevers, T., et al.: Scenesplat: Gaussian splatting-based scene understanding with vision-language pretraining. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 4961–4972 (2025) 2

  22. [22]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Li, Z., Chen, Z., Li, Z., Xu, Y.: Spacetime gaussian feature splatting for real- time dynamic view synthesis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 8508–8520 (2024) 2

  23. [23]

    In: European Conference on Computer Vision

    Lin, Z., Geng, S., Zhang, R., Gao, P., De Melo, G., Wang, X., Dai, J., Qiao, Y., Li, H.: Frozen clip models are efficient video learners. In: European Conference on Computer Vision. pp. 388–404. Springer (2022) 1

  24. [24]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Ling, H., Kim, S.W., Torralba, A., Fidler, S., Kreis, K.: Align your gaussians: Text- to-4d with dynamic 3d gaussians and composed diffusion models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 8576–8588 (2024) 2 GaussDet 17

  25. [25]

    In: European conference on computer vision

    Liu, S., Zeng, Z., Ren, T., Li, F., Zhang, H., Yang, J., Jiang, Q., Li, C., Yang, J., Su, H., et al.: Grounding dino: Marrying dino with grounded pre-training for open-set object detection. In: European conference on computer vision. pp. 38–55. Springer (2024) 3

  26. [26]

    In: 2024 International Conference on 3D Vision (3DV)

    Luiten, J., Kopanas, G., Leibe, B., Ramanan, D.: Dynamic 3d gaussians: Tracking by persistent dynamic view synthesis. In: 2024 International Conference on 3D Vision (3DV). pp. 800–809. IEEE (2024) 2

  27. [27]

    McInnes, L., Healy, J., Astels, S., et al.: hdbscan: Hierarchical density based clus- tering. J. Open Source Softw.2(11), 205 (2017) 2, 19

  28. [28]

    Commu- nications of the ACM65(1), 99–106 (2021) 3

    Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J.T., Ramamoorthi, R., Ng, R.: Nerf: Representing scenes as neural radiance fields for view synthesis. Commu- nications of the ACM65(1), 99–106 (2021) 3

  29. [29]

    In: European conference on computer vision

    Minderer, M., Gritsenko, A., Stone, A., Neumann, M., Weissenborn, D., Doso- vitskiy, A., Mahendran, A., Arnab, A., Dehghani, M., Shen, Z., et al.: Simple open-vocabulary object detection. In: European conference on computer vision. pp. 728–755. Springer (2022) 2

  30. [30]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recog- nition

    Nguyen, P., Ngo, T.D., Kalogerakis, E., Gan, C., Tran, A., Pham, C., Nguyen, K.: Open3dis: Open-vocabulary 3d instance segmentation with 2d mask guidance. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recog- nition. pp. 4018–4028 (2024) 1

  31. [31]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Peng, S., Genova, K., Jiang, C., Tagliasacchi, A., Pollefeys, M., Funkhouser, T., et al.: Openscene: 3d scene understanding with open vocabularies. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 815– 824 (2023) 1

  32. [32]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Qin, M., Li, W., Zhou, J., Wang, H., Pfister, H.: Langsplat: 3d language gaussian splatting. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 20051–20060 (2024) 2, 3, 20

  33. [33]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Rasheed, H., Khattak, M.U., Maaz, M., Khan, S., Khan, F.S.: Fine-tuned clip models are efficient video learners. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 6545–6554 (2023) 1

  34. [34]

    In: Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition

    Shi, J.C., Wang, M., Duan, H.B., Guan, S.H.: Language embedded 3d gaussians for open-vocabulary scene understanding. In: Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition. pp. 5333–5343 (2024) 2

  35. [35]

    arXiv preprint arXiv:2405.04378 (2024) 2, 3

    Shorinwa, O., Tucker, J., Smith, A., Swann, A., Chen, T., Firoozi, R., Kennedy III, M.,Schwager,M.:Splat-mover:Multi-stage,open-vocabularyroboticmanipulation via editable gaussian splatting. arXiv preprint arXiv:2405.04378 (2024) 2, 3

  36. [36]

    arXiv preprint arXiv:2306.13631 (2023) 1

    Takmaz, A., Fedele, E., Sumner, R.W., Pollefeys, M., Tombari, F., Engelmann, F.: Openmask3d: Open-vocabulary 3d instance segmentation. arXiv preprint arXiv:2306.13631 (2023) 1

  37. [37]

    Advances in Neural Information Processing Systems 37, 19114–19138 (2024) 2, 3, 4, 5, 8, 9, 19, 20

    Wu, Y., Meng, J., Li, H., Wu, C., Shi, Y., Cheng, X., Zhao, C., Feng, H., Ding, E., Wang, J., et al.: Opengaussian: Towards point-level 3d gaussian-based open vocabulary understanding. Advances in Neural Information Processing Systems 37, 19114–19138 (2024) 2, 3, 4, 5, 8, 9, 19, 20

  38. [38]

    In: European conference on computer vision

    Ye, M., Danelljan, M., Yu, F., Ke, L.: Gaussian grouping: Segment and edit any- thing in 3d scenes. In: European conference on computer vision. pp. 162–179. Springer (2024) 2

  39. [39]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Yi, T., Fang, J., Wang, J., Wu, G., Xie, L., Zhang, X., Liu, W., Tian, Q., Wang, X.: Gaussiandreamer: Fast generation from text to 3d gaussians by bridging 2d and 3d diffusion models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 6796–6807 (2024) 2 18 J. Hassan et al

  40. [40]

    Yuksekgonul, M., Bianchi, F., Kalluri, P., Jurafsky, D., Zou, J.: When and why vision-language models behave like bags-of-words, and what to do about it? arXiv preprint arXiv:2210.01936 (2022) 2, 5

  41. [41]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Zareian, A., Rosa, K.D., Hu, D.H., Chang, S.F.: Open-vocabulary object detection using captions. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 14393–14402 (2021) 1

  42. [42]

    In: Proceedings of the IEEE international conference on computer vision

    Zhao, H., Puig, X., Zhou, B., Fidler, S., Torralba, A.: Open vocabulary scene parsing. In: Proceedings of the IEEE international conference on computer vision. pp. 2002–2010 (2017) 1

  43. [43]

    Zhou, H., Shao, J., Xu, L., Bai, D., Qiu, W., Liu, B., Wang, Y., Geiger, A., Liao, Y.: Hugs:Holisticurban3dsceneunderstandingviagaussiansplatting.In:Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 21336–21345 (2024) 2

  44. [44]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Zhou,S.,Chang,H.,Jiang,S.,Fan,Z.,Zhu,Z.,Xu,D.,Chari,P.,You,S.,Wang,Z., Kadambi, A.: Feature 3dgs: Supercharging 3d gaussian splatting to enable distilled feature fields. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 21676–21685 (2024) 2, 3

  45. [45]

    Detect the objects corresponding to the following description if present: ‘{target_str}’

    Zhou, X., Lin, Z., Shan, X., Wang, Y., Sun, D., Yang, M.H.: Drivinggaussian: Com- posite gaussian splatting for surrounding dynamic autonomous driving scenes. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recog- nition. pp. 21634–21643 (2024) 2, 3 GaussDet 19 Supplementary Material A Overview The following supplementary materia...