QueryGaussian: Scalable and Training-Free Open-Vocabulary 3D Instance Retrieval

Chao Yue; Dongming Zhang; Jian Xue; Ke Lu; Xiuyuan Zhu; Zijie Yang

arxiv: 2606.19733 · v1 · pith:FJZQ4OYSnew · submitted 2026-06-18 · 💻 cs.CV · cs.AI

QueryGaussian: Scalable and Training-Free Open-Vocabulary 3D Instance Retrieval

Xiuyuan Zhu , Ke Lu , Zijie Yang , Chao Yue , Jian Xue , Dongming Zhang This is my paper

Pith reviewed 2026-06-26 17:50 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords 3D instance retrievalopen-vocabularyGaussian splattingtraining-freescalable 3D searchsemantic liftingtemporal fusioncity-scale scenes

0 comments

The pith

QueryGaussian retrieves open-vocabulary 3D instances from city-scale scenes by lifting 2D masks into 3D without training or scene-wide embeddings.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Existing approaches embed semantic features into every 3D primitive, so memory and compute costs grow directly with scene size and cause failures on large environments. QueryGaussian instead uses pre-trained 2D models to interpret text prompts and projects segmentation masks into 3D through concurrent maximum-weight association. A temporal fusion module with multi-stage adaptive density clustering resolves projection ambiguities across views. The result matches prior accuracy while cutting GPU memory by over 70 percent and speeding inference by 180 times, allowing retrieval on scenes with tens of millions of Gaussians on ordinary hardware.

Core claim

QueryGaussian is a training-free framework for open-vocabulary 3D instance retrieval that decouples semantic understanding from geometric representation by lifting 2D segmentation masks into 3D via concurrent maximum-weight association and a temporal fusion module with multi-stage adaptive density clustering, thereby avoiding the linear scaling of memory and compute that occurs when semantic features are distilled into every primitive.

What carries the argument

Instance-level query mechanism that lifts 2D segmentation masks into 3D via concurrent maximum-weight association plus temporal fusion with multi-stage adaptive density clustering.

If this is right

GPU memory usage drops by more than 70 percent relative to scene-embedding baselines.
Inference accelerates by a factor of 180 while accuracy remains comparable to state-of-the-art methods.
Retrieval becomes feasible on city-scale scenes holding tens of millions of Gaussians using only consumer-grade hardware.
No per-scene training or tuning is required because the method relies on off-the-shelf 2D vision models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same decoupling of semantics from geometry could be tested on other 3D representations such as point clouds or meshes to check whether the efficiency gain generalizes.
The temporal fusion module might extend naturally to video sequences, allowing retrieval in dynamic rather than static scenes.
Because the method avoids storing semantic features per primitive, it opens the possibility of on-the-fly retrieval during interactive navigation of very large environments.

Load-bearing premise

Lifting 2D segmentation masks into consistent 3D instances through maximum-weight association and adaptive density clustering will preserve semantic-visual alignment across views without any training or scene-specific tuning.

What would settle it

Apply the method to a city-scale scene containing at least ten million Gaussians, run a set of text-prompt retrievals on consumer hardware, and check whether it finishes without out-of-memory errors and matches ground-truth instance labels at rates comparable to prior methods.

Figures

Figures reproduced from arXiv: 2606.19733 by Chao Yue, Dongming Zhang, Jian Xue, Ke Lu, Xiuyuan Zhu, Zijie Yang.

**Figure 1.** Figure 1: Overview of QueryGaussian. Given a 3DGS scene and a text query, the framework first renders multi-view images and records per-pixel maximum [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗

**Figure 2.** Figure 2: Qualitative comparisons on the small-scale indoor scene dataset. QueryGaussian produces cleaner segmentation with fewer floaters and noise [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Qualitative results on large-scale outdoor scenes. Existing scene-level methods fail due to OOM, while QueryGaussian successfully localizes [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Overview of the 3D spatial reasoning agent. The LLM decomposes a user query into retrieval instructions, QueryGaussian returns masks and 3D [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

read the original abstract

Efficiently retrieving specific 3D instances from large-scale scenes via natural language prompts remains a formidable challenge in multimedia analysis. Existing approaches predominantly follow a "scene-level embedding" paradigm, which requires distilling high-dimensional semantic features into every 3D primitive. This strategy suffers from a fundamental architectural bottleneck: memory and computational costs scale linearly with scene complexity, inevitably triggering out-of-memory (OOM) failures in city-scale environments. To address this barrier, we propose QueryGaussian, a training-free framework for expeditious and scalable open-vocabulary 3D instance retrieval. Unlike holistic semantic distillation, QueryGaussian employs an instance-level query mechanism that decouples semantic understanding from geometric representation. Specifically, we leverage pre-trained 2D vision models to interpret user prompts and lift segmentation masks into 3D via a concurrent maximum-weight association strategy, ensuring semantic-visual consistency. To mitigate projection ambiguity, we introduce a temporal fusion module with multi-stage adaptive density clustering. Experimental results demonstrate that QueryGaussian not only matches the accuracy of state-of-the-art methods but also delivers a decisive efficiency leap, reducing GPU memory usage by over 70% and accelerating inference by 180x. Crucially, QueryGaussian enables expeditious instance retrieval on city-scale scenes containing tens of millions of Gaussians using consumer-grade hardware.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

QueryGaussian shifts to instance-level querying with 2D mask lifting and temporal clustering to cut memory and speed up retrieval on large Gaussian scenes, but the big efficiency numbers rest on unshown experimental details.

read the letter

The main point is that this work avoids the linear memory cost of embedding semantics into every 3D Gaussian by querying instances directly. It lifts 2D segmentation masks from pre-trained models using concurrent maximum-weight association and adds a temporal fusion module with multi-stage adaptive density clustering to handle cross-view issues. The result is presented as training-free and able to run city-scale retrieval on consumer hardware while matching prior accuracy.

This approach is a direct response to the scaling problem in scene-level methods, and the decoupling of semantics from geometry is a sensible move that could make open-vocabulary search practical for very large environments. The use of existing 2D models without any fine-tuning keeps the method lightweight and easy to reproduce in principle.

The soft spots sit in the evaluation. The abstract states 70% memory reduction and 180x speedup with accuracy parity, yet the paper must show the exact datasets, baselines, run counts, and whether city-scale tests used real tens-of-millions-of-Gaussians scenes or smaller proxies. Projection ambiguities and density variations are common in such data, so the lifting and clustering steps need clear evidence that they maintain consistency without scene-specific tuning; if the ablations or failure cases are missing, that weakens the central claim. Minor gaps in reporting error bars or data exclusions would also need fixing.

This paper is for researchers working on scalable 3D scene understanding, robotics mapping, or AR retrieval who already use Gaussian representations. A reader focused on efficiency trade-offs would get concrete ideas from the query path even before the numbers are fully verified.

It deserves peer review so the implementation details and results can be checked against the claims.

Referee Report

2 major / 1 minor

Summary. The paper proposes QueryGaussian, a training-free open-vocabulary 3D instance retrieval framework for large-scale Gaussian scenes. It decouples semantics from geometry by using pre-trained 2D vision models to process natural language prompts, lifting 2D segmentation masks into 3D via concurrent maximum-weight association, and applying a temporal fusion module with multi-stage adaptive density clustering to address projection ambiguities. The central claims are that this achieves accuracy parity with state-of-the-art methods while reducing GPU memory by over 70% and accelerating inference by 180x, enabling city-scale retrieval on scenes with tens of millions of Gaussians using consumer hardware.

Significance. If the accuracy and efficiency claims hold under rigorous evaluation, the work would be significant for enabling scalable 3D instance retrieval without scene-specific training or linear memory scaling. The training-free design leveraging existing 2D models and the explicit focus on city-scale feasibility are clear strengths that could impact multimedia analysis and 3D scene understanding applications.

major comments (2)

[Abstract] Abstract: the central efficiency claims (70% memory reduction, 180x speedup, city-scale operation) rest on the instance-level query path preserving semantic-visual consistency, yet the abstract provides no information on evaluation datasets, baselines, error bars, or data exclusions, preventing verification of whether the concurrent maximum-weight association and multi-stage adaptive density clustering actually deliver the claimed accuracy parity.
[Method] Method description (lifting and fusion): the assumption that concurrent max-weight mask lifting plus temporal fusion with multi-stage adaptive density clustering maintains cross-view consistency without any training or scene-specific tuning is load-bearing for the scalability claim, but no analysis of robustness to projection ambiguities, view-dependent appearance, or density variations in tens-of-millions-Gaussian scenes is referenced.

minor comments (1)

[Abstract] The abstract would benefit from explicit mention of the datasets and quantitative metrics used to support the accuracy parity claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and propose targeted revisions to enhance the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: the central efficiency claims (70% memory reduction, 180x speedup, city-scale operation) rest on the instance-level query path preserving semantic-visual consistency, yet the abstract provides no information on evaluation datasets, baselines, error bars, or data exclusions, preventing verification of whether the concurrent maximum-weight association and multi-stage adaptive density clustering actually deliver the claimed accuracy parity.

Authors: We agree that the abstract would benefit from additional context to support verifiability of the claims. We will revise the abstract to reference the main evaluation datasets (including city-scale scenes with tens of millions of Gaussians), the primary baselines, and indicate that accuracy results with error bars appear in the experiments. This change directly addresses the concern while preserving conciseness. revision: yes
Referee: [Method] Method description (lifting and fusion): the assumption that concurrent max-weight mask lifting plus temporal fusion with multi-stage adaptive density clustering maintains cross-view consistency without any training or scene-specific tuning is load-bearing for the scalability claim, but no analysis of robustness to projection ambiguities, view-dependent appearance, or density variations in tens-of-millions-Gaussian scenes is referenced.

Authors: The experiments section validates performance on large scenes via ablations of the fusion module. To strengthen the presentation of the load-bearing assumption, we will add a dedicated robustness analysis subsection (drawing on existing results and qualitative examples) that explicitly discusses handling of projection ambiguities, view-dependent effects, and density variations. This addition will reference the design elements without requiring new experiments. revision: yes

Circularity Check

0 steps flagged

No circularity detected; derivation relies on pre-trained models and heuristics without self-referential reduction

full rationale

The paper presents QueryGaussian as a training-free method that decouples semantic understanding from geometry by leveraging existing pre-trained 2D vision models, a concurrent maximum-weight association for mask lifting, and a temporal fusion module with multi-stage adaptive density clustering. No equations, fitted parameters, predictions that reduce to inputs by construction, or load-bearing self-citations appear in the provided text. The efficiency and scalability claims derive directly from this architectural choice rather than any circular step. The derivation chain is self-contained against external pre-trained components.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no explicit free parameters, axioms, or invented entities can be extracted. The method relies on pre-trained 2D vision models and geometric lifting assumptions whose details are not provided.

pith-pipeline@v0.9.1-grok · 5773 in / 1209 out tokens · 16297 ms · 2026-06-26T17:50:19.058572+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

43 extracted references · 5 canonical work pages · 3 internal anchors

[1]

Lifting by gaussians: A simple, fast and flexible method for 3d instance segmentation

Rohan Chacko, Nicolai H ¨ani, Eldar Khaliullin, Lin Sun, and Douglas Lee. Lifting by gaussians: A simple, fast and flexible method for 3d instance segmentation. In2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 3497–3507. IEEE, 2025

2025
[2]

Tensorf: Tensorial radiance fields

Anpei Chen, Zexiang Xu, Andreas Geiger, Jingyi Yu, and Hao Su. Tensorf: Tensorial radiance fields. InEuropean Conference on Computer Vision (ECCV), 2022

2022
[3]

Omnire: Omni urban scene reconstruction

Ziyu Chen, Jiawei Yang, Jiahui Huang, Riccardo de Lutio, Jan- ick Martinez Esturo, Boris Ivanovic, Or Litany, Zan Gojcic, Sanja Fidler, Marco Pavone, Li Song, and Yue Wang. Omnire: Omni urban scene reconstruction. InThe Thirteenth International Conference on Learning Representations, 2025

2025
[4]

Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, 2025

DeepSeek-AI. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, 2025

2025
[5]

A density-based algorithm for discovering clusters in large spatial databases with noise

Martin Ester, Hans-Peter Kriegel, J ¨”org Sander, and Xiaowei Xu. A density-based algorithm for discovering clusters in large spatial databases with noise. InProceedings of the Second International Conference on Knowledge Discovery and Data Mining, KDD’96, page 226–231. AAAI Press, 1996

1996
[6]

Mini-splatting: Representing scenes with a constrained number of gaussians

Guangchi Fang and Bing Wang. Mini-splatting: Representing scenes with a constrained number of gaussians. InEuropean Conference on Computer Vision, 2024

2024
[7]

Trips: Trilinear point splatting for real-time radiance field rendering

Linus Franke, Darius R ¨”uckert, Laura Fink, and Marc Stamminger. Trips: Trilinear point splatting for real-time radiance field rendering. Computer Graphics Forum, 43(2), 2024

2024
[8]

Plenoxels: Radiance fields without neural networks

Sara Fridovich-Keil, Alex Yu, Matthew Tancik, Qinhong Chen, Ben- jamin Recht, and Angjoo Kanazawa. Plenoxels: Radiance fields without neural networks. InCVPR, 2022

2022
[9]

Garbin, Marek Kowalski, Matthew Johnson, Jamie Shot- ton, and Julien Valentin

Stephan J. Garbin, Marek Kowalski, Matthew Johnson, Jamie Shot- ton, and Julien Valentin. Fastnerf: High-fidelity neural rendering at 200fps. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 14346–14355, October 2021

2021
[10]

A hierar- chical compression technique for 3d gaussian splatting compression, 2025

He Huang, Wenjie Huang, Qi Yang, Yiling Xu, and Zhu li. A hierar- chical compression technique for 3d gaussian splatting compression, 2025

2025
[11]

Sc-gs: Sparse-controlled gaussian splatting for editable dynamic scenes

Yi-Hua Huang, Yang-Tian Sun, Ziyi Yang, Xiaoyang Lyu, Yan-Pei Cao, and Xiaojuan Qi. Sc-gs: Sparse-controlled gaussian splatting for editable dynamic scenes. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4220– 4230, June 2024

2024
[12]

Pointgroup: Dual-set point grouping for 3d instance segmentation.Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020

Li Jiang, Hengshuang Zhao, Shaoshuai Shi, Shu Liu, Chi-Wing Fu, and Jiaya Jia. Pointgroup: Dual-set point grouping for 3d instance segmentation.Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020

2020
[13]

3d gaussian splatting for real-time radiance field rendering

Bernhard Kerbl, Georgios Kopanas, Thomas Leimk ¨uhler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering. ACM Transactions on Graphics, 42(4), July 2023

2023
[14]

Lerf: Language embedded radiance fields

Justin Kerr, Chung Min Kim, Ken Goldberg, Angjoo Kanazawa, and Matthew Tancik. Lerf: Language embedded radiance fields. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 19729–19739, October 2023

2023
[15]

Lerf: Language embedded radiance fields

Justin* Kerr, Chung Min* Kim, Ken Goldberg, Angjoo Kanazawa, and Matthew Tancik. Lerf: Language embedded radiance fields. In International Conference on Computer Vision (ICCV), 2023

2023
[16]

Segment Anything

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rol- land, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C. Berg, Wan-Yen Lo, Piotr Doll´ar, and Ross Girshick. Segment anything. arXiv:2304.02643, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[17]

Matrixcity: A large-scale city dataset for city-scale neural rendering and beyond

Yixuan Li, Lihan Jiang, Linning Xu, Yuanbo Xiangli, Zhenzhi Wang, Dahua Lin, and Bo Dai. Matrixcity: A large-scale city dataset for city-scale neural rendering and beyond. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 3205– 3215, 2023

2023
[18]

Vastgaussian: Vast 3d gaussians for large scene reconstruction

Jiaqi Lin, Zhihao Li, Xiao Tang, Jianzhuang Liu, Shiyong Liu, Jiayue Liu, Yangdi Lu, Xiaofei Wu, Songcen Xu, Youliang Yan, and Wenming Yang. Vastgaussian: Vast 3d gaussians for large scene reconstruction. InCVPR, 2024

2024
[19]

Dsem-nerf: Multimodal feature fusion and global–local attention for enhanced 3d scene reconstruction.Information Fusion, 115:102752, 2025

Dong Liu, Zhiyong Wang, and Peiyuan Chen. Dsem-nerf: Multimodal feature fusion and global–local attention for enhanced 3d scene reconstruction.Information Fusion, 115:102752, 2025

2025
[20]

Weakly supervised 3d open-vocabulary segmentation

Kunhao Liu, Fangneng Zhan, Jiahui Zhang, MUYU XU, Yingchen Yu, Abdulmotaleb El Saddik, Christian Theobalt, Eric Xing, and Shijian Lu. Weakly supervised 3d open-vocabulary segmentation. InThirty- seventh Conference on Neural Information Processing Systems, 2023

2023
[21]

Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection

Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection.arXiv preprint arXiv:2303.05499, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[22]

Citygaussian: Real-time high-quality large-scale scene rendering with gaussians

Yang Liu, Chuanchen Luo, Lue Fan, Naiyan Wang, Junran Peng, and Zhaoxiang Zhang. Citygaussian: Real-time high-quality large-scale scene rendering with gaussians. InEuropean Conference on Computer Vision, pages 265–282. Springer, 2025

2025
[23]

hdbscan: Hierarchical density based clustering.Journal of Open Source Software, 2(11):205, 2017

Leland McInnes, John Healy, and Steve Astels. hdbscan: Hierarchical density based clustering.Journal of Open Source Software, 2(11):205, 2017

2017
[24]

Srinivasan, Matthew Tancik, Jonathan T

Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik, Jonathan T. Bar- ron, Ravi Ramamoorthi, and Ren Ng. Nerf: representing scenes as neu- ral radiance fields for view synthesis.Commun. ACM, 65(1):99–106, December 2021

2021
[25]

Instant neural graphics primitives with a multiresolution hash encod- ing.ACM Trans

Thomas M ¨”uller, Alex Evans, Christoph Schied, and Alexander Keller. Instant neural graphics primitives with a multiresolution hash encod- ing.ACM Trans. Graph., 41(4):102:1–102:15, July 2022

2022
[26]

Langsplat: 3d language gaussian splatting

Minghan Qin, Wanhua Li, Jiawei Zhou, Haoqian Wang, and Hanspeter Pfister. Langsplat: 3d language gaussian splatting. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion (CVPR), pages 20051–20060, June 2024

2024
[27]

Advanc- ing Extended Reality with 3D Gaussian Splatting: Innovations and Prospects

Shi Qiu, Binzhu Xie, Qixuan Liu, and Pheng-Ann Heng. Advanc- ing Extended Reality with 3D Gaussian Splatting: Innovations and Prospects . In2025 IEEE International Conference on Artificial Intelligence and eXtended and Virtual Reality (AIxVR), pages 203– 208, Los Alamitos, CA, USA, January 2025. IEEE Computer Society

2025
[28]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. In Marina Meila and Tong Zhang, editors,Proceedings of the 38th Inter- national Conference on Machin...

2021
[29]

SAM 2: Segment Anything in Images and Videos

Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chai- tanya Ryali, Tengyu Ma, Haitham Khedr, Roman R ¨adle, Chloe Rol- land, Laura Gustafson, Eric Mintun, Junting Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao-Yuan Wu, Ross Girshick, Piotr Doll ´ar, and Christoph Feichtenhofer. Sam 2: Segment anything in images and videos.arXiv preprint arXiv:...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[30]

Sam 2: Segment anything in images and videos, 2024

Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chai- tanya Ryali, Tengyu Ma, Haitham Khedr, Roman R ¨adle, Chloe Rol- land, Laura Gustafson, Eric Mintun, Junting Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao-Yuan Wu, Ross Girshick, Piotr Doll ´ar, and Christoph Feichtenhofer. Sam 2: Segment anything in images and videos, 2024

2024
[31]

arXiv preprint arXiv:2403.17898 (2024)

Kerui Ren, Lihan Jiang, Tao Lu, Mulin Yu, Linning Xu, Zhangkai Ni, and Bo Dai. Octree-gs: Towards consistent real-time rendering with lod-structured 3d gaussians.arXiv preprint arXiv:2403.17898, 2024

work page arXiv 2024
[32]

Grounded sam: Assembling open-world models for diverse visual tasks, 2024

Tianhe Ren, Shilong Liu, Ailing Zeng, Jing Lin, Kunchang Li, He Cao, Jiayu Chen, Xinyu Huang, Yukang Chen, Feng Yan, Zhaoyang Zeng, Hao Zhang, Feng Li, Jie Yang, Hongyang Li, Qing Jiang, and Lei Zhang. Grounded sam: Assembling open-world models for diverse visual tasks, 2024

2024
[33]

Language embedded 3d gaussians for open-vocabulary scene under- standing

Jin-Chuan Shi, Miao Wang, Hao-Bin Duan, and Shao-Hua Guan. Language embedded 3d gaussians for open-vocabulary scene under- standing. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5333–5343, June 2024

2024
[34]

Sa-gs: Scale-adaptive gaussian splatting for training-free anti-aliasing.arXiv preprint arXiv:2403.19615, 2024

Xiaowei Song, Jv Zheng, Shiran Yuan, Huan-ang Gao, Jingwei Zhao, Xiang He, Weihao Gu, and Hao Zhao. Sa-gs: Scale-adaptive gaussian splatting for training-free anti-aliasing.arXiv preprint arXiv:2403.19615, 2024

work page arXiv 2024
[35]

Mega- nerf: Scalable construction of large-scale nerfs for virtual fly-throughs

Haithem Turki, Deva Ramanan, and Mahadev Satyanarayanan. Mega- nerf: Scalable construction of large-scale nerfs for virtual fly-throughs. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 12922–12931, June 2022

2022
[36]

4d gaussian splatting for real-time dynamic scene rendering

Guanjun Wu, Taoran Yi, Jiemin Fang, Lingxi Xie, Xiaopeng Zhang, Wei Wei, Wenyu Liu, Qi Tian, and Xinggang Wang. 4d gaussian splatting for real-time dynamic scene rendering. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 20310–20320, June 2024

2024
[37]

Opengaussian: Towards point-level 3d gaussian- based open vocabulary understanding

Yanmin Wu, Jiarui Meng, Haijie LI, Chenming Wu, Yahao Shi, Xinhua Cheng, Chen Zhao, Haocheng Feng, Errui Ding, Jingdong Wang, and Jian Zhang. Opengaussian: Towards point-level 3d gaussian- based open vocabulary understanding. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

2024
[38]

Street gaussians for modeling dynamic urban scenes

Yunzhi Yan, Haotong Lin, Chenxu Zhou, Weijie Wang, Haiyang Sun, Kun Zhan, Xianpeng Lang, Xiaowei Zhou, and Sida Peng. Street gaussians for modeling dynamic urban scenes. InECCV, 2024

2024
[39]

Multi-scale 3d gaussian splatting for anti-aliased rendering

Zhiwen Yan, Weng Fei Low, Yu Chen, and Gim Hee Lee. Multi-scale 3d gaussian splatting for anti-aliased rendering. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20923–20931, 2024

2024
[40]

Deformable 3d gaussians for high-fidelity monocular dynamic scene reconstruction

Ziyi Yang, Xinyu Gao, Wen Zhou, Shaohui Jiao, Yuqing Zhang, and Xiaogang Jin. Deformable 3d gaussians for high-fidelity monocular dynamic scene reconstruction. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 20331–20341, June 2024

2024
[41]

Gaussian grouping: Segment and edit anything in 3d scenes

Mingqiao Ye, Martin Danelljan, Fisher Yu, and Lei Ke. Gaussian grouping: Segment and edit anything in 3d scenes. InECCV, 2024

2024
[42]

Tyska, Bryan A

Mengyang Zhao, Quan Liu, Aadarsh Jha, Ruining Deng, Tianyuan Yao, Anita Mahadevan-Jansen, Matthew J. Tyska, Bryan A. Millis, and Yuankai Huo. V oxelembed: 3d instance segmentation and tracking with voxel embedding based deep learning. InMachine Learning in Medical Imaging: 12th International Workshop, MLMI 2021, Held in Conjunction with MICCAI 2021, Stras...

2021
[43]

Drivinggaussian: Composite gaussian splatting for surrounding dynamic autonomous driving scenes

Xiaoyu Zhou, Zhiwei Lin, Xiaojun Shan, Yongtao Wang, Deqing Sun, and Ming-Hsuan Yang. Drivinggaussian: Composite gaussian splatting for surrounding dynamic autonomous driving scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21634–21643, 2024

2024

[1] [1]

Lifting by gaussians: A simple, fast and flexible method for 3d instance segmentation

Rohan Chacko, Nicolai H ¨ani, Eldar Khaliullin, Lin Sun, and Douglas Lee. Lifting by gaussians: A simple, fast and flexible method for 3d instance segmentation. In2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 3497–3507. IEEE, 2025

2025

[2] [2]

Tensorf: Tensorial radiance fields

Anpei Chen, Zexiang Xu, Andreas Geiger, Jingyi Yu, and Hao Su. Tensorf: Tensorial radiance fields. InEuropean Conference on Computer Vision (ECCV), 2022

2022

[3] [3]

Omnire: Omni urban scene reconstruction

Ziyu Chen, Jiawei Yang, Jiahui Huang, Riccardo de Lutio, Jan- ick Martinez Esturo, Boris Ivanovic, Or Litany, Zan Gojcic, Sanja Fidler, Marco Pavone, Li Song, and Yue Wang. Omnire: Omni urban scene reconstruction. InThe Thirteenth International Conference on Learning Representations, 2025

2025

[4] [4]

Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, 2025

DeepSeek-AI. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, 2025

2025

[5] [5]

A density-based algorithm for discovering clusters in large spatial databases with noise

Martin Ester, Hans-Peter Kriegel, J ¨”org Sander, and Xiaowei Xu. A density-based algorithm for discovering clusters in large spatial databases with noise. InProceedings of the Second International Conference on Knowledge Discovery and Data Mining, KDD’96, page 226–231. AAAI Press, 1996

1996

[6] [6]

Mini-splatting: Representing scenes with a constrained number of gaussians

Guangchi Fang and Bing Wang. Mini-splatting: Representing scenes with a constrained number of gaussians. InEuropean Conference on Computer Vision, 2024

2024

[7] [7]

Trips: Trilinear point splatting for real-time radiance field rendering

Linus Franke, Darius R ¨”uckert, Laura Fink, and Marc Stamminger. Trips: Trilinear point splatting for real-time radiance field rendering. Computer Graphics Forum, 43(2), 2024

2024

[8] [8]

Plenoxels: Radiance fields without neural networks

Sara Fridovich-Keil, Alex Yu, Matthew Tancik, Qinhong Chen, Ben- jamin Recht, and Angjoo Kanazawa. Plenoxels: Radiance fields without neural networks. InCVPR, 2022

2022

[9] [9]

Garbin, Marek Kowalski, Matthew Johnson, Jamie Shot- ton, and Julien Valentin

Stephan J. Garbin, Marek Kowalski, Matthew Johnson, Jamie Shot- ton, and Julien Valentin. Fastnerf: High-fidelity neural rendering at 200fps. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 14346–14355, October 2021

2021

[10] [10]

A hierar- chical compression technique for 3d gaussian splatting compression, 2025

He Huang, Wenjie Huang, Qi Yang, Yiling Xu, and Zhu li. A hierar- chical compression technique for 3d gaussian splatting compression, 2025

2025

[11] [11]

Sc-gs: Sparse-controlled gaussian splatting for editable dynamic scenes

Yi-Hua Huang, Yang-Tian Sun, Ziyi Yang, Xiaoyang Lyu, Yan-Pei Cao, and Xiaojuan Qi. Sc-gs: Sparse-controlled gaussian splatting for editable dynamic scenes. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4220– 4230, June 2024

2024

[12] [12]

Pointgroup: Dual-set point grouping for 3d instance segmentation.Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020

Li Jiang, Hengshuang Zhao, Shaoshuai Shi, Shu Liu, Chi-Wing Fu, and Jiaya Jia. Pointgroup: Dual-set point grouping for 3d instance segmentation.Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020

2020

[13] [13]

3d gaussian splatting for real-time radiance field rendering

Bernhard Kerbl, Georgios Kopanas, Thomas Leimk ¨uhler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering. ACM Transactions on Graphics, 42(4), July 2023

2023

[14] [14]

Lerf: Language embedded radiance fields

Justin Kerr, Chung Min Kim, Ken Goldberg, Angjoo Kanazawa, and Matthew Tancik. Lerf: Language embedded radiance fields. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 19729–19739, October 2023

2023

[15] [15]

Lerf: Language embedded radiance fields

Justin* Kerr, Chung Min* Kim, Ken Goldberg, Angjoo Kanazawa, and Matthew Tancik. Lerf: Language embedded radiance fields. In International Conference on Computer Vision (ICCV), 2023

2023

[16] [16]

Segment Anything

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rol- land, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C. Berg, Wan-Yen Lo, Piotr Doll´ar, and Ross Girshick. Segment anything. arXiv:2304.02643, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[17] [17]

Matrixcity: A large-scale city dataset for city-scale neural rendering and beyond

Yixuan Li, Lihan Jiang, Linning Xu, Yuanbo Xiangli, Zhenzhi Wang, Dahua Lin, and Bo Dai. Matrixcity: A large-scale city dataset for city-scale neural rendering and beyond. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 3205– 3215, 2023

2023

[18] [18]

Vastgaussian: Vast 3d gaussians for large scene reconstruction

Jiaqi Lin, Zhihao Li, Xiao Tang, Jianzhuang Liu, Shiyong Liu, Jiayue Liu, Yangdi Lu, Xiaofei Wu, Songcen Xu, Youliang Yan, and Wenming Yang. Vastgaussian: Vast 3d gaussians for large scene reconstruction. InCVPR, 2024

2024

[19] [19]

Dsem-nerf: Multimodal feature fusion and global–local attention for enhanced 3d scene reconstruction.Information Fusion, 115:102752, 2025

Dong Liu, Zhiyong Wang, and Peiyuan Chen. Dsem-nerf: Multimodal feature fusion and global–local attention for enhanced 3d scene reconstruction.Information Fusion, 115:102752, 2025

2025

[20] [20]

Weakly supervised 3d open-vocabulary segmentation

Kunhao Liu, Fangneng Zhan, Jiahui Zhang, MUYU XU, Yingchen Yu, Abdulmotaleb El Saddik, Christian Theobalt, Eric Xing, and Shijian Lu. Weakly supervised 3d open-vocabulary segmentation. InThirty- seventh Conference on Neural Information Processing Systems, 2023

2023

[21] [21]

Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection

Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection.arXiv preprint arXiv:2303.05499, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[22] [22]

Citygaussian: Real-time high-quality large-scale scene rendering with gaussians

Yang Liu, Chuanchen Luo, Lue Fan, Naiyan Wang, Junran Peng, and Zhaoxiang Zhang. Citygaussian: Real-time high-quality large-scale scene rendering with gaussians. InEuropean Conference on Computer Vision, pages 265–282. Springer, 2025

2025

[23] [23]

hdbscan: Hierarchical density based clustering.Journal of Open Source Software, 2(11):205, 2017

Leland McInnes, John Healy, and Steve Astels. hdbscan: Hierarchical density based clustering.Journal of Open Source Software, 2(11):205, 2017

2017

[24] [24]

Srinivasan, Matthew Tancik, Jonathan T

Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik, Jonathan T. Bar- ron, Ravi Ramamoorthi, and Ren Ng. Nerf: representing scenes as neu- ral radiance fields for view synthesis.Commun. ACM, 65(1):99–106, December 2021

2021

[25] [25]

Instant neural graphics primitives with a multiresolution hash encod- ing.ACM Trans

Thomas M ¨”uller, Alex Evans, Christoph Schied, and Alexander Keller. Instant neural graphics primitives with a multiresolution hash encod- ing.ACM Trans. Graph., 41(4):102:1–102:15, July 2022

2022

[26] [26]

Langsplat: 3d language gaussian splatting

Minghan Qin, Wanhua Li, Jiawei Zhou, Haoqian Wang, and Hanspeter Pfister. Langsplat: 3d language gaussian splatting. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion (CVPR), pages 20051–20060, June 2024

2024

[27] [27]

Advanc- ing Extended Reality with 3D Gaussian Splatting: Innovations and Prospects

Shi Qiu, Binzhu Xie, Qixuan Liu, and Pheng-Ann Heng. Advanc- ing Extended Reality with 3D Gaussian Splatting: Innovations and Prospects . In2025 IEEE International Conference on Artificial Intelligence and eXtended and Virtual Reality (AIxVR), pages 203– 208, Los Alamitos, CA, USA, January 2025. IEEE Computer Society

2025

[28] [28]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. In Marina Meila and Tong Zhang, editors,Proceedings of the 38th Inter- national Conference on Machin...

2021

[29] [29]

SAM 2: Segment Anything in Images and Videos

Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chai- tanya Ryali, Tengyu Ma, Haitham Khedr, Roman R ¨adle, Chloe Rol- land, Laura Gustafson, Eric Mintun, Junting Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao-Yuan Wu, Ross Girshick, Piotr Doll ´ar, and Christoph Feichtenhofer. Sam 2: Segment anything in images and videos.arXiv preprint arXiv:...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[30] [30]

Sam 2: Segment anything in images and videos, 2024

Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chai- tanya Ryali, Tengyu Ma, Haitham Khedr, Roman R ¨adle, Chloe Rol- land, Laura Gustafson, Eric Mintun, Junting Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao-Yuan Wu, Ross Girshick, Piotr Doll ´ar, and Christoph Feichtenhofer. Sam 2: Segment anything in images and videos, 2024

2024

[31] [31]

arXiv preprint arXiv:2403.17898 (2024)

Kerui Ren, Lihan Jiang, Tao Lu, Mulin Yu, Linning Xu, Zhangkai Ni, and Bo Dai. Octree-gs: Towards consistent real-time rendering with lod-structured 3d gaussians.arXiv preprint arXiv:2403.17898, 2024

work page arXiv 2024

[32] [32]

Grounded sam: Assembling open-world models for diverse visual tasks, 2024

Tianhe Ren, Shilong Liu, Ailing Zeng, Jing Lin, Kunchang Li, He Cao, Jiayu Chen, Xinyu Huang, Yukang Chen, Feng Yan, Zhaoyang Zeng, Hao Zhang, Feng Li, Jie Yang, Hongyang Li, Qing Jiang, and Lei Zhang. Grounded sam: Assembling open-world models for diverse visual tasks, 2024

2024

[33] [33]

Language embedded 3d gaussians for open-vocabulary scene under- standing

Jin-Chuan Shi, Miao Wang, Hao-Bin Duan, and Shao-Hua Guan. Language embedded 3d gaussians for open-vocabulary scene under- standing. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5333–5343, June 2024

2024

[34] [34]

Sa-gs: Scale-adaptive gaussian splatting for training-free anti-aliasing.arXiv preprint arXiv:2403.19615, 2024

Xiaowei Song, Jv Zheng, Shiran Yuan, Huan-ang Gao, Jingwei Zhao, Xiang He, Weihao Gu, and Hao Zhao. Sa-gs: Scale-adaptive gaussian splatting for training-free anti-aliasing.arXiv preprint arXiv:2403.19615, 2024

work page arXiv 2024

[35] [35]

Mega- nerf: Scalable construction of large-scale nerfs for virtual fly-throughs

Haithem Turki, Deva Ramanan, and Mahadev Satyanarayanan. Mega- nerf: Scalable construction of large-scale nerfs for virtual fly-throughs. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 12922–12931, June 2022

2022

[36] [36]

4d gaussian splatting for real-time dynamic scene rendering

Guanjun Wu, Taoran Yi, Jiemin Fang, Lingxi Xie, Xiaopeng Zhang, Wei Wei, Wenyu Liu, Qi Tian, and Xinggang Wang. 4d gaussian splatting for real-time dynamic scene rendering. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 20310–20320, June 2024

2024

[37] [37]

Opengaussian: Towards point-level 3d gaussian- based open vocabulary understanding

Yanmin Wu, Jiarui Meng, Haijie LI, Chenming Wu, Yahao Shi, Xinhua Cheng, Chen Zhao, Haocheng Feng, Errui Ding, Jingdong Wang, and Jian Zhang. Opengaussian: Towards point-level 3d gaussian- based open vocabulary understanding. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

2024

[38] [38]

Street gaussians for modeling dynamic urban scenes

Yunzhi Yan, Haotong Lin, Chenxu Zhou, Weijie Wang, Haiyang Sun, Kun Zhan, Xianpeng Lang, Xiaowei Zhou, and Sida Peng. Street gaussians for modeling dynamic urban scenes. InECCV, 2024

2024

[39] [39]

Multi-scale 3d gaussian splatting for anti-aliased rendering

Zhiwen Yan, Weng Fei Low, Yu Chen, and Gim Hee Lee. Multi-scale 3d gaussian splatting for anti-aliased rendering. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20923–20931, 2024

2024

[40] [40]

Deformable 3d gaussians for high-fidelity monocular dynamic scene reconstruction

Ziyi Yang, Xinyu Gao, Wen Zhou, Shaohui Jiao, Yuqing Zhang, and Xiaogang Jin. Deformable 3d gaussians for high-fidelity monocular dynamic scene reconstruction. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 20331–20341, June 2024

2024

[41] [41]

Gaussian grouping: Segment and edit anything in 3d scenes

Mingqiao Ye, Martin Danelljan, Fisher Yu, and Lei Ke. Gaussian grouping: Segment and edit anything in 3d scenes. InECCV, 2024

2024

[42] [42]

Tyska, Bryan A

Mengyang Zhao, Quan Liu, Aadarsh Jha, Ruining Deng, Tianyuan Yao, Anita Mahadevan-Jansen, Matthew J. Tyska, Bryan A. Millis, and Yuankai Huo. V oxelembed: 3d instance segmentation and tracking with voxel embedding based deep learning. InMachine Learning in Medical Imaging: 12th International Workshop, MLMI 2021, Held in Conjunction with MICCAI 2021, Stras...

2021

[43] [43]

Drivinggaussian: Composite gaussian splatting for surrounding dynamic autonomous driving scenes

Xiaoyu Zhou, Zhiwei Lin, Xiaojun Shan, Yongtao Wang, Deqing Sun, and Ming-Hsuan Yang. Drivinggaussian: Composite gaussian splatting for surrounding dynamic autonomous driving scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21634–21643, 2024

2024