OpenGaFF: Open-Vocabulary Gaussian Feature Field with Codebook Attention

Federico Tombari; Kunyi Li; Michael Niemeyer; Nassir Navab; Sen Wang; Stefano Gasperini

arxiv: 2605.06088 · v3 · pith:W2K32ELXnew · submitted 2026-05-07 · 💻 cs.CV

OpenGaFF: Open-Vocabulary Gaussian Feature Field with Codebook Attention

Kunyi Li , Michael Niemeyer , Sen Wang , Stefano Gasperini , Nassir Navab , Federico Tombari This is my paper

Pith reviewed 2026-05-25 06:08 UTC · model grok-4.3

classification 💻 cs.CV

keywords open-vocabulary 3D scene understandingGaussian Splattingsemantic feature fieldcodebook attention3D semantic consistencyobject-level consistencyopen-vocabulary segmentation

0 comments

The pith

OpenGaFF conditions open-vocabulary semantic predictions on 3D Gaussian geometry through a feature field and codebook attention to enforce spatial and object-level consistency.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces OpenGaFF to address fragmented semantic predictions in open-vocabulary 3D scenes represented by Gaussian Splatting. It defines semantics as a continuous function of each Gaussian's geometry and appearance, directly linking the two to reduce inconsistency across views. A learned codebook supplies shared semantic primitives, and codebook-guided attention matches query features to these entries to retrieve language descriptors while lowering variance inside objects. Experiments on standard 2D and 3D benchmarks show gains in segmentation quality and 3D coherence compared with prior methods.

Core claim

By modeling semantics as a continuous function of Gaussian geometry and appearance in a Gaussian Feature Field, and retrieving language features through similarity matching against a structured codebook of shared primitives via codebook-guided attention, the method strengthens the geometry-semantics coupling and produces more spatially coherent and object-consistent open-vocabulary predictions in 3D scenes.

What carries the argument

The Gaussian Feature Field, which models semantics as a continuous function of Gaussian geometry and appearance, together with the codebook-guided attention mechanism that retrieves language features by similarity matching between query embeddings and learned codebook entries.

If this is right

Semantic predictions become more consistent across multi-view observations of the same 3D structure.
Intra-object feature variance decreases, producing cleaner object boundaries in 3D.
The codebook entries become semantically interpretable, revealing the primitives the model has learned.
Segmentation quality improves on both 2D projection and direct 3D evaluation benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same geometry-conditioned field could be tested on dynamic scenes by updating codebook entries across time steps.
If the codebook size is varied, the trade-off between consistency and vocabulary coverage becomes measurable.
The approach suggests that other 3D representations might benefit from explicit geometry-to-semantics conditioning rather than post-hoc fusion.

Load-bearing premise

A learned structured codebook of shared semantic primitives plus similarity-based attention will enforce object-level consistency without losing open-vocabulary capability or creating new inconsistencies.

What would settle it

Semantic feature variance measured inside individual objects remains high or open-vocabulary query performance drops below baselines on held-out scenes when the codebook attention is ablated.

Figures

Figures reproduced from arXiv: 2605.06088 by Federico Tombari, Kunyi Li, Michael Niemeyer, Nassir Navab, Sen Wang, Stefano Gasperini.

**Figure 1.** Figure 1: OpenGaFF is an open-vocabulary 3D scene understanding method and achieves precise segmentation and consistently high vision-language similarity score in both 2D and 3D evaluations. Abstract Understanding open-vocabulary 3D scenes with Gaussian-based representations remains challenging due to fragmented and spatially inconsistent semantic predictions across multi-view observations. In this paper, we presen… view at source ↗

**Figure 2.** Figure 2: Overview of OpenGaFF. We first preprocess RGB images using foundation models to generate ground-truth language features. These features are clustered for structured language codebook initialization, while PCA is applied to per-view feature maps to obtain low-dimensional representations for supervising the Gaussian Feature Field. In Stage 1, the Gaussian Feature Field is trained by 2D feature distillation. … view at source ↗

**Figure 3.** Figure 3: Qualitative Evaluation of 2D and 3D Open-Vocabulary Query on LERF-OVS [28]. We visualize vision-language similarity as 2D heatmaps. For the shown 3D results, we perform open-vocabulary segmentation directly in 3D and render selected Gaussians. Our method achieve more precise and consistent segmentation in both 2D and 3D. 4 Experiments 4.1 Experimental Setup Datasets. Following previous methods [13, 16, 14]… view at source ↗

**Figure 4.** Figure 4: Qualitative Evaluation of 3D Open-Vocabulary Query on ScanNet-v2 [33]. We highlight the Gaussians corresponding to the query text. Our approach produces more spatially consistent responses while LangSplatV2 [10] exhibits significant noise and inconsistent activations. fragmented or noisy results. Due to weak geometry–semantic coupling, relevant Gaussians may be missed, leading to incomplete shapes (e.g., "… view at source ↗

**Figure 5.** Figure 5: What Does the Codebook Capture? LangSplatV2 [10] encodes semantics in a distributed manner, where a single entry may represent multiple objects or an object may span multiple entries. In contrast, our method learns disentangled semantic units. Recent works [21, 14, 10] have demonstrated the effectiveness of incorporating codebook learning into 3D scene understanding. However, an open question remains:… view at source ↗

**Figure 6.** Figure 6: Ablation Studies. We conduct comprehensive ablation studies to demonstrate the effectc of differnet proposed contributions and report the mIoU of 3D OVS on the whole scene. explicitly couples semantics with geometry, enabling consistent feature propagation across spatially coherent regions and producing more complete and robust segmentation in both 2D and 3D. Ablation on Attention Module. We replace the co… view at source ↗

**Figure 7.** Figure 7: Ablation on Entropy Loss. We report the mIoU of 3D OVS on Figurines scene. Larger λentropy values encourage stricter object-level bindings but may over-specialize entries, hurting rare object learning due to limited observations. Rendered RGB Rendered LD Feature Predicted Language Feature Ours View 1 View 2 Predicted Language Feature LangSplatV2 view at source ↗

**Figure 8.** Figure 8: Illustration of Object-Level Feature Consistency. Compared with the language feature maps predicted by LangSplatV2 [10], ours are more consistent and clearer, demonstrating superior segmentation performance. can be suboptimal for objects that appear infrequently in the training data. Due to limited observations, such objects may not be sufficiently learned, leading to degraded segmentation performance. Thi… view at source ↗

**Figure 9.** Figure 9: Visualization of Codebook Entries. We present per-entry heatmaps and their corresponding masked RGB images to visualize the regions each codebook entry attends to. These results demonstrate that our codebook effectively captures disentangled and semantically meaningful units. D.3 More Evaluation on ScanNet-v2 We visualize the predicted semantic feature point clouds and compare them with the ground-truth s… view at source ↗

**Figure 10.** Figure 10: Qualitative Evaluation of 2D Open-Vocabulary Segmentation on MipNeRF360 [32]. Our method can predict more precise and consistent segmentation in both 2D and 3D. stems from our Gaussian Feature Field (Section 3.2), which effectively couples 3D geometry with semantic representations. 5 view at source ↗

**Figure 11.** Figure 11: Additional Qualitative Evaluation of 2D and 3D Open-Vocabulary Segmentation on LERF-OVS [28]. Our method can predict more precise and consistent segmentation in both 2D and 3D. 6 view at source ↗

**Figure 12.** Figure 12: Additional Qualitative Evaluation of 3D Open-Vocabulary Segmentation on ScanNetv2 [33]. We visualize the language feature point cloud. Ours method can predict clean and more consistent language feature. 7 view at source ↗

read the original abstract

Understanding open-vocabulary 3D scenes with Gaussian-based representations remains challenging due to fragmented and spatially inconsistent semantic predictions across multi-view observations. In this paper, we present OpenGaFF, a novel framework for open-vocabulary 3D scene understanding built upon 3D Gaussian Splatting. At the core of our method is a Gaussian Feature Field that models semantics as a continuous function of Gaussian geometry and appearance. By explicitly conditioning semantic predictions on geometric structure, this formulation strengthens the coupling between geometry and semantics, leading to improved spatial coherence across similar structures in 3D space. To further enforce object-level semantic consistency, we introduce a structured codebook that serves as a set of shared semantic primitives. Furthermore, a codebook-guided attention mechanism is proposed to retrieve language features via similarity matching between query embeddings and learned codebook entries, enabling robust open-vocabulary reasoning while reducing intra-object feature variance. Extensive experiments on standard 2D and 3D open-vocabulary benchmarks demonstrate that our method consistently outperforms prior approaches, achieving improved segmentation quality, stronger 3D semantic consistency and a semantically interpretable codebook that provides insight into the learned representation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

OpenGaFF adds geometry-conditioned feature fields and codebook attention to Gaussian Splatting for open-vocab 3D consistency, but the abstract gives almost no technical specifics to check the claims.

read the letter

This paper introduces a Gaussian Feature Field that conditions semantic predictions on the geometry and appearance of the Gaussians, plus a structured codebook with similarity-based attention to retrieve language features and reduce intra-object variance. The core move is to strengthen the link between geometry and semantics so that predictions stay coherent across views and similar structures, while the codebook acts as shared primitives for object-level consistency in open-vocabulary settings. That combination is the actual new piece on top of existing Gaussian Splatting work. It directly targets the fragmentation problem that shows up in multi-view semantic labeling, and the framing around spatial coherence and reduced variance makes sense as a practical fix. The approach is straightforward enough that someone already working in this area could implement the high-level idea without too much trouble. The main weakness is that everything stays conceptual: no equations for the feature field or attention, no description of how the codebook is trained or initialized, and no ablation or dataset details in the abstract. Without those, the claims of consistent outperformance and a semantically interpretable codebook cannot be verified against the paper's own evidence. The assumption that codebook matching will handle arbitrary queries without introducing new inconsistencies or losing flexibility is plausible but untested here. This is for people already doing open-vocabulary 3D labeling or Gaussian Splatting extensions; a reader outside that niche gets little. It deserves peer review because the problem is real and the architectural direction is concrete, even if the current write-up needs the full experiments and implementation to stand up.

Referee Report

2 major / 0 minor

Summary. The manuscript introduces OpenGaFF, a framework for open-vocabulary 3D scene understanding built upon 3D Gaussian Splatting. It proposes a Gaussian Feature Field that models semantics as a continuous function of Gaussian geometry and appearance, explicitly conditioning semantic predictions on geometric structure to improve spatial coherence. A structured codebook of shared semantic primitives is introduced, along with a codebook-guided attention mechanism that retrieves language features via similarity matching to enable robust open-vocabulary reasoning and reduce intra-object feature variance. Extensive experiments on standard 2D and 3D benchmarks are claimed to show consistent outperformance in segmentation quality and 3D semantic consistency, with the codebook providing semantic interpretability.

Significance. If the empirical results and technical mechanisms hold, the work would advance open-vocabulary 3D understanding by strengthening the geometry-semantics coupling in Gaussian representations and offering an interpretable codebook for consistency. The approach addresses fragmentation in multi-view semantic predictions, which is a relevant problem in the field.

major comments (2)

[Abstract] Abstract: the central claim that conditioning semantic predictions on geometric structure strengthens the coupling between geometry and semantics (leading to improved spatial coherence) is presented without any equations, derivations, or pseudocode; this makes the load-bearing assumption that the Gaussian Feature Field formulation achieves this coupling unverifiable from the provided text.
[Abstract] Abstract: the assertion of consistent outperformance on standard 2D and 3D open-vocabulary benchmarks, improved segmentation quality, and stronger 3D semantic consistency is made without reference to specific datasets, metrics, baselines, ablation studies, or error bars; this undermines assessment of the empirical contribution.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on the abstract. We address each point below and will revise the abstract to improve clarity and self-containment while preserving its summary nature.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that conditioning semantic predictions on geometric structure strengthens the coupling between geometry and semantics (leading to improved spatial coherence) is presented without any equations, derivations, or pseudocode; this makes the load-bearing assumption that the Gaussian Feature Field formulation achieves this coupling unverifiable from the provided text.

Authors: We agree that the abstract presents the claim at a high level. The Gaussian Feature Field is defined in the manuscript as a continuous function explicitly conditioned on Gaussian geometry (position, covariance, and appearance attributes), with the conditioning implemented via a geometry-aware feature extractor detailed in Section 3. The abstract avoids equations to maintain accessibility, but we will revise it to include a concise description of the conditioning mechanism (e.g., 'by parameterizing semantic features as a function of each Gaussian's geometric attributes') to make the coupling more explicit without requiring derivations. revision: yes
Referee: [Abstract] Abstract: the assertion of consistent outperformance on standard 2D and 3D open-vocabulary benchmarks, improved segmentation quality, and stronger 3D semantic consistency is made without reference to specific datasets, metrics, baselines, ablation studies, or error bars; this undermines assessment of the empirical contribution.

Authors: The abstract summarizes the experimental findings reported in full in the Experiments section, which includes quantitative results on standard benchmarks, comparisons to baselines, ablation studies, and consistency metrics. We acknowledge that the abstract could be strengthened by referencing key evaluation aspects. We will revise the abstract to include brief mentions of the evaluation scope (e.g., 'on standard 2D and 3D benchmarks with segmentation and consistency metrics') to better contextualize the claims. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The provided abstract and description contain no equations, derivations, fitted parameters presented as predictions, or load-bearing self-citations. The claims describe a conceptual framework (Gaussian Feature Field conditioned on geometry, codebook-guided attention) without any reduction of outputs to inputs by construction. No steps match the enumerated circularity patterns, as there are no mathematical steps or uniqueness theorems invoked that could be inspected for equivalence to their own premises. The derivation chain is therefore self-contained at the level of description.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities; ledger left empty.

pith-pipeline@v0.9.0 · 5746 in / 1053 out tokens · 23035 ms · 2026-05-25T06:08:57.925848+00:00 · methodology

Review history (3 revisions) →

discussion (0)

Reference graph

Works this paper leans on

36 extracted references · 36 canonical work pages · 2 internal anchors

[1]

Drivedreamer: Towards real-world-drive world models for autonomous driving

Xiaofeng Wang, Zheng Zhu, Guan Huang, Xinze Chen, Jiagang Zhu, and Jiwen Lu. Drivedreamer: Towards real-world-drive world models for autonomous driving. InEuropean conference on computer vision, pages 55–72. Springer, 2024

work page 2024
[2]

Driving into the future: Multiview visual forecasting and planning with world model for autonomous driving

Yuqi Wang, Jiawei He, Lue Fan, Hongxin Li, Yuntao Chen, and Zhaoxiang Zhang. Driving into the future: Multiview visual forecasting and planning with world model for autonomous driving. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14749–14759, 2024

work page 2024
[3]

Openemma: Open-source multimodal model for end-to-end autonomous driving

Shuo Xing, Chengyuan Qian, Yuping Wang, Hongyuan Hua, Kexin Tian, Yang Zhou, and Zhengzhong Tu. Openemma: Open-source multimodal model for end-to-end autonomous driving. InProceedings of the Winter Conference on Applications of Computer Vision, pages 1001–1009, 2025

work page 2025
[4]

Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments

Peter Anderson, Qi Wu, Damien Teney, Jake Bruce, Mark Johnson, Niko Sünderhauf, Ian Reid, Stephen Gould, and Anton Van Den Hengel. Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 3674–3683, 2018

work page 2018
[5]

Magma: A foundation model for multimodal ai agents

Jianwei Yang, Reuben Tan, Qianhui Wu, Ruijie Zheng, Baolin Peng, Yongyuan Liang, Yu Gu, Mu Cai, Seonghyeon Ye, Joel Jang, et al. Magma: A foundation model for multimodal ai agents. InProceedings of the computer vision and pattern recognition conference, pages 14203–14214, 2025

work page 2025
[6]

A comprehensive survey of scene graphs: Generation and application.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(1):1–26, 2021

Xiaojun Chang, Pengzhen Ren, Pengfei Xu, Zhihui Li, Xiaojiang Chen, and Alex Hauptmann. A comprehensive survey of scene graphs: Generation and application.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(1):1–26, 2021

work page 2021
[7]

Foundations of spatial perception for robotics: Hierarchical representations and real-time systems.The International Journal of Robotics Research, 43(10):1457–1505, 2024

Nathan Hughes, Yun Chang, Siyi Hu, Rajat Talak, Rumaia Abdulhai, Jared Strader, and Luca Carlone. Foundations of spatial perception for robotics: Hierarchical representations and real-time systems.The International Journal of Robotics Research, 43(10):1457–1505, 2024

work page 2024
[8]

Openscene: 3d scene understanding with open vocabularies

Songyou Peng, Kyle Genova, Chiyu Jiang, Andrea Tagliasacchi, Marc Pollefeys, Thomas Funkhouser, et al. Openscene: 3d scene understanding with open vocabularies. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 815–824, 2023

work page 2023
[9]

Langsplat: 3d language gaus- sian splatting

Minghan Qin, Wanhua Li, Jiawei Zhou, Haoqian Wang, and Hanspeter Pfister. Langsplat: 3d language gaus- sian splatting. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20051–20060, 2024

work page 2024
[10]

Langsplatv2: High- dimensional 3d language gaussian splatting with 450+ fps

Wanhua Li, Yujie Zhao, Minghan Qin, Yang Liu, Yuanhao Cai, and Chuang Gan. Langsplatv2: High- dimensional 3d language gaussian splatting with 450+ fps. InAnnual Conference on Neural Information Processing Systems, 2025

work page 2025
[11]

3d gaussian splatting for real-time radiance field rendering.ACM Trans

Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering.ACM Trans. Graph., 42(4):139–1, 2023

work page 2023
[12]

Language-driven Semantic Segmentation

Boyi Li, Kilian Q Weinberger, Serge Belongie, Vladlen Koltun, and René Ranftl. Language-driven semantic segmentation.arXiv preprint arXiv:2201.03546, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[13]

Opengaussian: Towards point-level 3d gaussian-based open vocabulary understanding.Advances in Neural Information Processing Systems, 37:19114–19138, 2024

Yanmin Wu, Jiarui Meng, Haijie Li, Chenming Wu, Yahao Shi, Xinhua Cheng, Chen Zhao, Haocheng Feng, Errui Ding, Jingdong Wang, et al. Opengaussian: Towards point-level 3d gaussian-based open vocabulary understanding.Advances in Neural Information Processing Systems, 37:19114–19138, 2024

work page 2024
[14]

Gala: Guided attention with language alignment for open vocabulary gaussian splatting.arXiv preprint arXiv:2508.14278, 2025

Elena Alegret, Kunyi Li, Sen Wang, Siyun Liang, Michael Niemeyer, Stefano Gasperini, Nassir Navab, and Federico Tombari. Gala: Guided attention with language alignment for open vocabulary gaussian splatting.arXiv preprint arXiv:2508.14278, 2025

work page arXiv 2025
[15]

Supergseg: Open-vocabulary 3d segmentation with structured super-gaussians

Siyun Liang, Sen Wang, Kunyi Li, Michael Niemeyer, Stefano Gasperini, Hendrik Lensch, Nassir Navab, and Federico Tombari. Supergseg: Open-vocabulary 3d segmentation with structured super-gaussians. arXiv preprint arXiv:2412.10231, 2024

work page arXiv 2024
[16]

Visibility- aware language aggregation for open-vocabulary segmentation in 3d gaussian splatting.arXiv preprint arXiv:2509.05515, 2025

Sen Wang, Kunyi Li, Siyun Liang, Elena Alegret, Jing Ma, Nassir Navab, and Stefano Gasperini. Visibility- aware language aggregation for open-vocabulary segmentation in 3d gaussian splatting.arXiv preprint arXiv:2509.05515, 2025

work page arXiv 2025
[17]

Feature 3dgs: Supercharging 3d gaussian splatting to enable distilled feature fields

Shijie Zhou, Haoran Chang, Sicheng Jiang, Zhiwen Fan, Zehao Zhu, Dejia Xu, Pradyumna Chari, Suya You, Zhangyang Wang, and Achuta Kadambi. Feature 3dgs: Supercharging 3d gaussian splatting to enable distilled feature fields. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21676–21685, 2024. 10

work page 2024
[18]

Gags: Granularity- aware feature distillation for language gaussian splatting

Yuning Peng, Haiping Wang, Yuan Liu, Chenglu Wen, Zhen Dong, and Bisheng Yang. Gags: Granularity- aware feature distillation for language gaussian splatting. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 8376–8384, 2026

work page 2026
[19]

Kim Jun-Seong, GeonU Kim, Kim Yu-Ji, Yu-Chiang Frank Wang, Jaesung Choe, and Tae-Hyun Oh. Dr. splat: Directly referring 3d gaussian splatting via direct language embedding registration. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 14137–14146, 2025

work page 2025
[20]

Occam’s lgs: An efficient approach for language gaussian splatting.arXiv preprint arXiv:2412.01807, 2024

Jiahuan Cheng, Jan-Nico Zaech, Luc Van Gool, and Danda Pani Paudel. Occam’s lgs: An efficient approach for language gaussian splatting.arXiv preprint arXiv:2412.01807, 2024

work page arXiv 2024
[21]

Goi: Find 3d gaussians of interest with an optimizable open-vocabulary semantic-space hyperplane

Yansong Qu, Shaohui Dai, Xinyang Li, Jianghang Lin, Liujuan Cao, Shengchuan Zhang, and Rongrong Ji. Goi: Find 3d gaussians of interest with an optimizable open-vocabulary semantic-space hyperplane. In Proceedings of the 32nd ACM international conference on multimedia, pages 5328–5337, 2024

work page 2024
[22]

Ccl-lgs: Contrastive codebook learning for 3d language gaussian splatting.arXiv preprint arXiv:2505.20469, 2025

Lei Tian, Xiaomin Li, Liqian Ma, Hefei Huang, Zirui Zheng, Hao Yin, Taiqing Li, Huchuan Lu, and Xu Jia. Ccl-lgs: Contrastive codebook learning for 3d language gaussian splatting.arXiv preprint arXiv:2505.20469, 2025

work page arXiv 2025
[23]

Dinov2: Learning robust visual features without supervision.Transactions on Machine Learning Research Journal, 2024

Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision.Transactions on Machine Learning Research Journal, 2024

work page 2024
[24]

Grounding dino: Marrying dino with grounded pre-training for open-set object detection

Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. InEuropean conference on computer vision, pages 38–55. Springer, 2024

work page 2024
[25]

Segment anything

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything. InProceedings of the IEEE/CVF international conference on computer vision, pages 4015–4026, 2023

work page 2023
[26]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021

work page 2021
[27]

SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, et al. Siglip 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features.arXiv preprint arXiv:2502.14786, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[28]

Lerf: Language embedded radiance fields

Justin Kerr, Chung Min Kim, Ken Goldberg, Angjoo Kanazawa, and Matthew Tancik. Lerf: Language embedded radiance fields. InProceedings of the IEEE/CVF international conference on computer vision, pages 19729–19739, 2023

work page 2023
[29]

Garfield: Group anything with radiance fields

Chung Min Kim, Mingxuan Wu, Justin Kerr, Ken Goldberg, Matthew Tancik, and Angjoo Kanazawa. Garfield: Group anything with radiance fields. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21530–21539, 2024

work page 2024
[30]

Omniseg3d: Omniversal 3d segmentation via hierarchical contrastive learning

Haiyang Ying, Yixuan Yin, Jinzhi Zhang, Fan Wang, Tao Yu, Ruqi Huang, and Lu Fang. Omniseg3d: Omniversal 3d segmentation via hierarchical contrastive learning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20612–20622, 2024

work page 2024
[31]

Nerf: Representing scenes as neural radiance fields for view synthesis.Communications of the ACM, 65 (1):99–106, 2021

Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view synthesis.Communications of the ACM, 65 (1):99–106, 2021

work page 2021
[32]

Mip-nerf 360: Unbounded anti-aliased neural radiance fields

Jonathan T Barron, Ben Mildenhall, Dor Verbin, Pratul P Srinivasan, and Peter Hedman. Mip-nerf 360: Unbounded anti-aliased neural radiance fields. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5470–5479, 2022

work page 2022
[33]

Scannet: Richly-annotated 3d reconstructions of indoor scenes

Angela Dai, Angel X Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 5828–5839, 2017

work page 2017
[34]

gsplat: An open-source library for gaussian splatting.Journal of Machine Learning Research, 26(34):1–17, 2025

Vickie Ye, Ruilong Li, Justin Kerr, Matias Turkulainen, Brent Yi, Zhuoyang Pan, Otto Seiskari, Jianbo Ye, Jeffrey Hu, Matthew Tancik, et al. gsplat: An open-source library for gaussian splatting.Journal of Machine Learning Research, 26(34):1–17, 2025. 11

work page 2025
[35]

Cluster quality analysis using silhouette score

Ketan Rajshekhar Shahapure and Charles Nicholas. Cluster quality analysis using silhouette score. In 2020 IEEE 7th international conference on data science and advanced analytics (DSAA), pages 747–748. IEEE, 2020

work page 2020
[36]

British Free Range Eggs

Jin-Chuan Shi, Miao Wang, Hao-Bin Duan, and Shao-Hua Guan. Language embedded 3d gaussians for open-vocabulary scene understanding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5333–5343, 2024. 12 A Implementation Details Preprocessing.We follow LangSplat [ 9] to preprocess SAM [25] masks and CLIP [26] language...

work page 2024

[1] [1]

Drivedreamer: Towards real-world-drive world models for autonomous driving

Xiaofeng Wang, Zheng Zhu, Guan Huang, Xinze Chen, Jiagang Zhu, and Jiwen Lu. Drivedreamer: Towards real-world-drive world models for autonomous driving. InEuropean conference on computer vision, pages 55–72. Springer, 2024

work page 2024

[2] [2]

Driving into the future: Multiview visual forecasting and planning with world model for autonomous driving

Yuqi Wang, Jiawei He, Lue Fan, Hongxin Li, Yuntao Chen, and Zhaoxiang Zhang. Driving into the future: Multiview visual forecasting and planning with world model for autonomous driving. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14749–14759, 2024

work page 2024

[3] [3]

Openemma: Open-source multimodal model for end-to-end autonomous driving

Shuo Xing, Chengyuan Qian, Yuping Wang, Hongyuan Hua, Kexin Tian, Yang Zhou, and Zhengzhong Tu. Openemma: Open-source multimodal model for end-to-end autonomous driving. InProceedings of the Winter Conference on Applications of Computer Vision, pages 1001–1009, 2025

work page 2025

[4] [4]

Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments

Peter Anderson, Qi Wu, Damien Teney, Jake Bruce, Mark Johnson, Niko Sünderhauf, Ian Reid, Stephen Gould, and Anton Van Den Hengel. Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 3674–3683, 2018

work page 2018

[5] [5]

Magma: A foundation model for multimodal ai agents

Jianwei Yang, Reuben Tan, Qianhui Wu, Ruijie Zheng, Baolin Peng, Yongyuan Liang, Yu Gu, Mu Cai, Seonghyeon Ye, Joel Jang, et al. Magma: A foundation model for multimodal ai agents. InProceedings of the computer vision and pattern recognition conference, pages 14203–14214, 2025

work page 2025

[6] [6]

A comprehensive survey of scene graphs: Generation and application.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(1):1–26, 2021

Xiaojun Chang, Pengzhen Ren, Pengfei Xu, Zhihui Li, Xiaojiang Chen, and Alex Hauptmann. A comprehensive survey of scene graphs: Generation and application.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(1):1–26, 2021

work page 2021

[7] [7]

Foundations of spatial perception for robotics: Hierarchical representations and real-time systems.The International Journal of Robotics Research, 43(10):1457–1505, 2024

Nathan Hughes, Yun Chang, Siyi Hu, Rajat Talak, Rumaia Abdulhai, Jared Strader, and Luca Carlone. Foundations of spatial perception for robotics: Hierarchical representations and real-time systems.The International Journal of Robotics Research, 43(10):1457–1505, 2024

work page 2024

[8] [8]

Openscene: 3d scene understanding with open vocabularies

Songyou Peng, Kyle Genova, Chiyu Jiang, Andrea Tagliasacchi, Marc Pollefeys, Thomas Funkhouser, et al. Openscene: 3d scene understanding with open vocabularies. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 815–824, 2023

work page 2023

[9] [9]

Langsplat: 3d language gaus- sian splatting

Minghan Qin, Wanhua Li, Jiawei Zhou, Haoqian Wang, and Hanspeter Pfister. Langsplat: 3d language gaus- sian splatting. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20051–20060, 2024

work page 2024

[10] [10]

Langsplatv2: High- dimensional 3d language gaussian splatting with 450+ fps

Wanhua Li, Yujie Zhao, Minghan Qin, Yang Liu, Yuanhao Cai, and Chuang Gan. Langsplatv2: High- dimensional 3d language gaussian splatting with 450+ fps. InAnnual Conference on Neural Information Processing Systems, 2025

work page 2025

[11] [11]

3d gaussian splatting for real-time radiance field rendering.ACM Trans

Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering.ACM Trans. Graph., 42(4):139–1, 2023

work page 2023

[12] [12]

Language-driven Semantic Segmentation

Boyi Li, Kilian Q Weinberger, Serge Belongie, Vladlen Koltun, and René Ranftl. Language-driven semantic segmentation.arXiv preprint arXiv:2201.03546, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[13] [13]

Opengaussian: Towards point-level 3d gaussian-based open vocabulary understanding.Advances in Neural Information Processing Systems, 37:19114–19138, 2024

Yanmin Wu, Jiarui Meng, Haijie Li, Chenming Wu, Yahao Shi, Xinhua Cheng, Chen Zhao, Haocheng Feng, Errui Ding, Jingdong Wang, et al. Opengaussian: Towards point-level 3d gaussian-based open vocabulary understanding.Advances in Neural Information Processing Systems, 37:19114–19138, 2024

work page 2024

[14] [14]

Gala: Guided attention with language alignment for open vocabulary gaussian splatting.arXiv preprint arXiv:2508.14278, 2025

Elena Alegret, Kunyi Li, Sen Wang, Siyun Liang, Michael Niemeyer, Stefano Gasperini, Nassir Navab, and Federico Tombari. Gala: Guided attention with language alignment for open vocabulary gaussian splatting.arXiv preprint arXiv:2508.14278, 2025

work page arXiv 2025

[15] [15]

Supergseg: Open-vocabulary 3d segmentation with structured super-gaussians

Siyun Liang, Sen Wang, Kunyi Li, Michael Niemeyer, Stefano Gasperini, Hendrik Lensch, Nassir Navab, and Federico Tombari. Supergseg: Open-vocabulary 3d segmentation with structured super-gaussians. arXiv preprint arXiv:2412.10231, 2024

work page arXiv 2024

[16] [16]

Visibility- aware language aggregation for open-vocabulary segmentation in 3d gaussian splatting.arXiv preprint arXiv:2509.05515, 2025

Sen Wang, Kunyi Li, Siyun Liang, Elena Alegret, Jing Ma, Nassir Navab, and Stefano Gasperini. Visibility- aware language aggregation for open-vocabulary segmentation in 3d gaussian splatting.arXiv preprint arXiv:2509.05515, 2025

work page arXiv 2025

[17] [17]

Feature 3dgs: Supercharging 3d gaussian splatting to enable distilled feature fields

Shijie Zhou, Haoran Chang, Sicheng Jiang, Zhiwen Fan, Zehao Zhu, Dejia Xu, Pradyumna Chari, Suya You, Zhangyang Wang, and Achuta Kadambi. Feature 3dgs: Supercharging 3d gaussian splatting to enable distilled feature fields. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21676–21685, 2024. 10

work page 2024

[18] [18]

Gags: Granularity- aware feature distillation for language gaussian splatting

Yuning Peng, Haiping Wang, Yuan Liu, Chenglu Wen, Zhen Dong, and Bisheng Yang. Gags: Granularity- aware feature distillation for language gaussian splatting. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 8376–8384, 2026

work page 2026

[19] [19]

Kim Jun-Seong, GeonU Kim, Kim Yu-Ji, Yu-Chiang Frank Wang, Jaesung Choe, and Tae-Hyun Oh. Dr. splat: Directly referring 3d gaussian splatting via direct language embedding registration. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 14137–14146, 2025

work page 2025

[20] [20]

Occam’s lgs: An efficient approach for language gaussian splatting.arXiv preprint arXiv:2412.01807, 2024

Jiahuan Cheng, Jan-Nico Zaech, Luc Van Gool, and Danda Pani Paudel. Occam’s lgs: An efficient approach for language gaussian splatting.arXiv preprint arXiv:2412.01807, 2024

work page arXiv 2024

[21] [21]

Goi: Find 3d gaussians of interest with an optimizable open-vocabulary semantic-space hyperplane

Yansong Qu, Shaohui Dai, Xinyang Li, Jianghang Lin, Liujuan Cao, Shengchuan Zhang, and Rongrong Ji. Goi: Find 3d gaussians of interest with an optimizable open-vocabulary semantic-space hyperplane. In Proceedings of the 32nd ACM international conference on multimedia, pages 5328–5337, 2024

work page 2024

[22] [22]

Ccl-lgs: Contrastive codebook learning for 3d language gaussian splatting.arXiv preprint arXiv:2505.20469, 2025

Lei Tian, Xiaomin Li, Liqian Ma, Hefei Huang, Zirui Zheng, Hao Yin, Taiqing Li, Huchuan Lu, and Xu Jia. Ccl-lgs: Contrastive codebook learning for 3d language gaussian splatting.arXiv preprint arXiv:2505.20469, 2025

work page arXiv 2025

[23] [23]

Dinov2: Learning robust visual features without supervision.Transactions on Machine Learning Research Journal, 2024

Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision.Transactions on Machine Learning Research Journal, 2024

work page 2024

[24] [24]

Grounding dino: Marrying dino with grounded pre-training for open-set object detection

Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. InEuropean conference on computer vision, pages 38–55. Springer, 2024

work page 2024

[25] [25]

Segment anything

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything. InProceedings of the IEEE/CVF international conference on computer vision, pages 4015–4026, 2023

work page 2023

[26] [26]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021

work page 2021

[27] [27]

SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, et al. Siglip 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features.arXiv preprint arXiv:2502.14786, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[28] [28]

Lerf: Language embedded radiance fields

Justin Kerr, Chung Min Kim, Ken Goldberg, Angjoo Kanazawa, and Matthew Tancik. Lerf: Language embedded radiance fields. InProceedings of the IEEE/CVF international conference on computer vision, pages 19729–19739, 2023

work page 2023

[29] [29]

Garfield: Group anything with radiance fields

Chung Min Kim, Mingxuan Wu, Justin Kerr, Ken Goldberg, Matthew Tancik, and Angjoo Kanazawa. Garfield: Group anything with radiance fields. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21530–21539, 2024

work page 2024

[30] [30]

Omniseg3d: Omniversal 3d segmentation via hierarchical contrastive learning

Haiyang Ying, Yixuan Yin, Jinzhi Zhang, Fan Wang, Tao Yu, Ruqi Huang, and Lu Fang. Omniseg3d: Omniversal 3d segmentation via hierarchical contrastive learning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20612–20622, 2024

work page 2024

[31] [31]

Nerf: Representing scenes as neural radiance fields for view synthesis.Communications of the ACM, 65 (1):99–106, 2021

Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view synthesis.Communications of the ACM, 65 (1):99–106, 2021

work page 2021

[32] [32]

Mip-nerf 360: Unbounded anti-aliased neural radiance fields

Jonathan T Barron, Ben Mildenhall, Dor Verbin, Pratul P Srinivasan, and Peter Hedman. Mip-nerf 360: Unbounded anti-aliased neural radiance fields. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5470–5479, 2022

work page 2022

[33] [33]

Scannet: Richly-annotated 3d reconstructions of indoor scenes

Angela Dai, Angel X Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 5828–5839, 2017

work page 2017

[34] [34]

gsplat: An open-source library for gaussian splatting.Journal of Machine Learning Research, 26(34):1–17, 2025

Vickie Ye, Ruilong Li, Justin Kerr, Matias Turkulainen, Brent Yi, Zhuoyang Pan, Otto Seiskari, Jianbo Ye, Jeffrey Hu, Matthew Tancik, et al. gsplat: An open-source library for gaussian splatting.Journal of Machine Learning Research, 26(34):1–17, 2025. 11

work page 2025

[35] [35]

Cluster quality analysis using silhouette score

Ketan Rajshekhar Shahapure and Charles Nicholas. Cluster quality analysis using silhouette score. In 2020 IEEE 7th international conference on data science and advanced analytics (DSAA), pages 747–748. IEEE, 2020

work page 2020

[36] [36]

British Free Range Eggs

Jin-Chuan Shi, Miao Wang, Hao-Bin Duan, and Shao-Hua Guan. Language embedded 3d gaussians for open-vocabulary scene understanding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5333–5343, 2024. 12 A Implementation Details Preprocessing.We follow LangSplat [ 9] to preprocess SAM [25] masks and CLIP [26] language...

work page 2024