Monocular Open Vocabulary Occupancy Prediction for Indoor Scenes

Changhao Chen; Changqing Zhou; Han Zhang; Yueru Luo; Zeyu Jiang

arxiv: 2602.22667 · v2 · pith:YQAQBUNMnew · submitted 2026-02-26 · 💻 cs.CV

Monocular Open Vocabulary Occupancy Prediction for Indoor Scenes

Changqing Zhou , Yueru Luo , Han Zhang , Zeyu Jiang , Changhao Chen This is my paper

Pith reviewed 2026-05-21 11:46 UTC · model grok-4.3

classification 💻 cs.CV

keywords open-vocabulary occupancyindoor 3D predictionmonocular inputlanguage-embedded Gaussiansbinary supervisionvolumetric aggregationsemantic alignment

0 comments

The pith

A monocular method predicts open-vocabulary 3D occupancy in indoor scenes from single images using only binary occupancy labels.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to enable embodied agents to understand indoor environments with abundant and evolving semantic categories. It adopts a geometry-only supervision approach that relies solely on binary occupied-versus-free labels rather than dense semantic annotations. The framework employs 3D Language-Embedded Gaussians to link fine-grained geometry with language-aligned embeddings. An opacity-aware Poisson-based operator replaces prior Gaussian-to-occupancy conversions to ensure convergence under weak supervision, while a Progressive Temperature Decay schedule sharpens feature alignment during rendering. This combination is shown to deliver accurate open-vocabulary occupancy maps on indoor benchmarks.

Core claim

The authors establish that 3D Language-Embedded Gaussians can serve as a unified intermediate representation for open-vocabulary occupancy prediction when paired with an opacity-aware Poisson-based volumetric aggregation operator and a Progressive Temperature Decay schedule that gradually sharpens opacities during splatting, allowing stable geometry-language alignment from monocular images under binary supervision alone.

What carries the argument

3D Language-Embedded Gaussians as a unified intermediate representation that couples fine-grained 3D geometry with language-aligned semantic embeddings, stabilized by an opacity-aware Poisson-based operator and Progressive Temperature Decay schedule.

If this is right

Open-vocabulary occupancy becomes feasible indoors without requiring dense semantic ground truth for every object category.
Embodied agents can maintain consistent 3D semantic maps as new object categories appear over time.
The same intermediate Gaussian representation can support both geometric reconstruction and language-based queries in a single forward pass.
Binary supervision reduces the annotation burden for training occupancy models in complex indoor layouts.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same stabilization techniques might transfer to outdoor settings if adjusted for sparser point distributions and different scale ranges.
Integration with online mapping systems could allow robots to incrementally update open-vocabulary occupancy without full scene re-training.
The approach suggests a broader pattern where language embeddings are anchored to geometry through differentiable rendering rather than direct feature matching.

Load-bearing premise

Existing Gaussian-to-Occupancy operators fail to converge under binary occupancy supervision, and the proposed opacity-aware Poisson replacement together with the temperature decay schedule will produce stable alignment without new artifacts or extra dense labels.

What would settle it

A direct test would replace the Poisson operator with a standard Gaussian-to-occupancy conversion while keeping all other components fixed and measure whether convergence and alignment quality collapse on the same indoor dataset under binary supervision.

Figures

Figures reproduced from arXiv: 2602.22667 by Changhao Chen, Changqing Zhou, Han Zhang, Yueru Luo, Zeyu Jiang.

**Figure 1.** Figure 1: Closed- vs. open-vocabulary occupancy. Prior methods [47, 50] trained under a closed vocabulary can label only the categories predefined at training time, which restricts real-world deployment. Our open-vocabulary approach aligns language with 3D occupancy and supports text queries for arbitrary categories. Right column (Random Class): text-conditioned per-voxel scores are visualized as heatmaps; darker r… view at source ↗

**Figure 2.** Figure 2: LegoOcc Framework Overview. From a monocular image, a feed-forward Gaussian model produces Language-Embedded Gaussians. Training proceeds along two couched paths: Semantic learning, we differentiably render Gaussian features to the image with Progressive Temperature Decay and align them to a training-free open-vocabulary segmenter via a cosine objective Lfeat; Geometry learning, we convert Gaussians to occ… view at source ↗

**Figure 3.** Figure 3: Comparison of temperature schedules. Linear decay decreases τ uniformly, whereas our exponential schedule rapidly approaches Tmin, allocating more iterations for the model to adapt to the low-temperature regime. 3.5. Losses We optimize the network with a composite objective that couples 3D geometry supervision with 2D language-aligned guidance as shown in [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Qualitative results on Occ-ScanNet. From top to bottom: (a) input images; (b) ground-truth semantic occupancy; (c) results from our re-implemented LOcc [53]; (d) our method. Both (c) and (d) are trained with geometry-only annotations and evaluated on the closed-vocabulary annotation of Occ-ScanNet. (a) (b) (c) [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Open-vocabulary qualitative results. Legends list the VLM-extracted object nouns used as text queries. (a) Input image. (b) Open-vocabulary 2D segmentation for queried nouns. (c) Our 3D open-vocabulary occupancy colored by the same categories. amples. These results demonstrate the model’s ability to localize and identify free-form indoor categories in 3D. 5. Conclusion We introduced a monocular open-vocabu… view at source ↗

read the original abstract

Open-vocabulary 3D occupancy is vital for embodied agents, which need to understand complex indoor environments where semantic categories are abundant and evolve beyond fixed taxonomies. While recent work has explored open-vocabulary occupancy in outdoor driving scenarios, such methods transfer poorly indoors, where geometry is denser, layouts are more intricate, and semantics are far more fine-grained. To address these challenges, we adopt a geometry-only supervision paradigm that uses only binary occupancy labels (occupied vs free). Our framework builds upon 3D Language-Embedded Gaussians, which serve as a unified intermediate representation coupling fine-grained 3D geometry with a language-aligned semantic embedding. On the geometry side, we find that existing Gaussian-to-Occupancy operators fail to converge under such weak supervision, and we introduce an opacity-aware, Poisson-based approach that stabilizes volumetric aggregation. On the semantic side, direct alignment between rendered features and open-vocabulary segmentation features suffers from feature mixing; we therefore propose a Progressive Temperature Decay schedule that gradually sharpens opacities during splatting, strengthening Gaussian-language alignment. On Occ-ScanNet, our framework achieves 59.50 IoU and 21.05 mIoU in the open-vocabulary setting, surpassing all existing occupancy methods in IoU and outperforming prior open-vocabulary approaches by a large margin in mIoU. Code will be released at https://github.com/JuIvyy/LegoOcc.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper fixes indoor open-vocab occupancy from monocular input by swapping in an opacity-aware Poisson operator and a temperature decay schedule after claiming standard Gaussian methods fail under binary labels.

read the letter

The main thing here is that the authors tackle open-vocabulary 3D occupancy indoors from single images under weak binary supervision. They build on language-embedded Gaussians and replace the usual aggregation step with an opacity-aware Poisson volumetric operator while adding a Progressive Temperature Decay schedule to sharpen feature alignment during splatting. On Occ-ScanNet they report 59.50 IoU and 21.05 mIoU, which they position above prior occupancy and open-vocab baselines.

Referee Report

2 major / 2 minor

Summary. The paper proposes a monocular framework for open-vocabulary 3D occupancy prediction in indoor scenes. It builds on 3D Language-Embedded Gaussians as a unified representation for geometry and semantics. Under a geometry-only supervision paradigm using only binary occupied/free labels, the authors replace standard Gaussian-to-occupancy operators with an opacity-aware Poisson-based aggregation because existing operators are stated to fail to converge; they also introduce a Progressive Temperature Decay schedule to mitigate feature mixing in semantic alignment. On Occ-ScanNet the method reports 59.50 IoU and 21.05 mIoU, claiming to surpass prior occupancy and open-vocabulary approaches.

Significance. If the central claims hold, the work is significant for advancing open-vocabulary indoor occupancy prediction under weak supervision, which is relevant for embodied agents operating in dense, semantically rich environments. The geometry-only paradigm and the use of Gaussians as an intermediate representation are promising directions. Explicit credit is given for the commitment to release code, which supports reproducibility.

major comments (2)

[Abstract and §3] Abstract and §3 (geometry supervision): the claim that 'existing Gaussian-to-Occupancy operators fail to converge under such weak supervision' is load-bearing for the introduction of the opacity-aware Poisson replacement, yet the manuscript provides no training curves, loss plots, or controlled ablation demonstrating divergence or collapse of standard splatting operators under binary labels. Without this evidence the necessity of the new operator remains unproven and any performance lift could be attributable to other factors.
[§4] §4 (experiments): the reported 59.50 IoU and 21.05 mIoU on Occ-ScanNet are presented without error bars, multiple-run statistics, or explicit dataset-split details; in addition, no ablation isolates the contribution of the Progressive Temperature Decay schedule. These omissions make it impossible to verify the robustness of the superiority claims over baselines.

minor comments (2)

[Abstract] The abstract states that the method 'surpasses all existing occupancy methods in IoU' but does not name the specific baselines or reference the corresponding table/figure for this comparison.
[§3] Notation for the parameters of the Progressive Temperature Decay schedule could be introduced more explicitly when first defined to aid readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful review and constructive suggestions. We address each of the major comments below and outline the revisions we will make to the manuscript.

read point-by-point responses

Referee: [Abstract and §3] Abstract and §3 (geometry supervision): the claim that 'existing Gaussian-to-Occupancy operators fail to converge under such weak supervision' is load-bearing for the introduction of the opacity-aware Poisson replacement, yet the manuscript provides no training curves, loss plots, or controlled ablation demonstrating divergence or collapse of standard splatting operators under binary labels. Without this evidence the necessity of the new operator remains unproven and any performance lift could be attributable to other factors.

Authors: We agree that providing empirical evidence for the convergence failure of existing operators under geometry-only supervision would strengthen our justification for the opacity-aware Poisson aggregation. In the revised manuscript, we will add training curves and loss plots comparing standard splatting operators with our proposed method under binary occupancy labels. This will clearly demonstrate the divergence issue and support the necessity of the new operator. revision: yes
Referee: [§4] §4 (experiments): the reported 59.50 IoU and 21.05 mIoU on Occ-ScanNet are presented without error bars, multiple-run statistics, or explicit dataset-split details; in addition, no ablation isolates the contribution of the Progressive Temperature Decay schedule. These omissions make it impossible to verify the robustness of the superiority claims over baselines.

Authors: We acknowledge the importance of statistical robustness in the experimental results. We will conduct multiple runs with different random seeds and report the mean and standard deviation for the IoU and mIoU metrics, including error bars in the tables. We will also provide explicit details on the dataset splits used for Occ-ScanNet. Furthermore, we will include an ablation study that isolates the effect of the Progressive Temperature Decay schedule to demonstrate its contribution to the performance. revision: yes

Circularity Check

0 steps flagged

No significant circularity; results are empirical measurements on external benchmark

full rationale

The paper reports measured performance (59.50 IoU, 21.05 mIoU on Occ-ScanNet) as outcomes of an experimental framework rather than quantities algebraically derived from its own fitted parameters or equations. The claim that existing Gaussian-to-Occupancy operators fail to converge is presented as an empirical observation motivating the new opacity-aware Poisson operator and Progressive Temperature Decay schedule; no self-citation, self-definitional loop, or fitted-input-renamed-as-prediction is exhibited in the provided text that would reduce the reported metrics to the inputs by construction. The geometry-language alignment and supervision paradigm are validated externally, leaving the derivation chain self-contained against benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The framework rests on the domain assumption that 3D Language-Embedded Gaussians provide a suitable joint geometry-language representation and on the empirical observation that standard operators fail under weak supervision; no explicit free parameters beyond the temperature schedule are named, and no new physical entities are postulated.

free parameters (1)

Progressive Temperature Decay schedule parameters
Hyperparameters controlling the rate at which opacities are sharpened during splatting; these are introduced to mitigate feature mixing and are presumably tuned on validation data.

axioms (1)

domain assumption 3D Language-Embedded Gaussians serve as a unified intermediate representation coupling fine-grained 3D geometry with language-aligned semantic embedding
Invoked as the base representation upon which both the geometry and semantic modules are built.

pith-pipeline@v0.9.0 · 5790 in / 1438 out tokens · 55766 ms · 2026-05-21T11:46:20.265055+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

FreeOcc: Training-Free Embodied Open-Vocabulary Occupancy Prediction
cs.RO 2026-04 unverdicted novelty 6.0

FreeOcc enables training-free open-vocabulary 3D occupancy prediction from RGB-D sequences by combining SLAM, dense Gaussian maps, off-the-shelf vision-language models, and probabilistic projection, achieving over 2x ...

Reference graph

Works this paper leans on

59 extracted references · 59 canonical work pages · cited by 1 Pith paper · 4 internal anchors

[1]

S2GO: Streaming sparse gaussian occupancy

Anonymous. S2GO: Streaming sparse gaussian occupancy. InSubmitted to The Fourteenth International Conference on Learning Representations, 2025. under review. 4

work page 2025
[2]

VGMOcc: Sparse gaussian occupancy predic- tion with visual geometry model priors

Anonymous. VGMOcc: Sparse gaussian occupancy predic- tion with visual geometry model priors. InSubmitted to The Fourteenth International Conference on Learning Represen- tations, 2025. under review. 6

work page 2025
[3]

Chang, and Matthias Niessner

Armen Avetisyan, Manuel Dahnert, Angela Dai, Manolis Savva, Angel X. Chang, and Matthias Niessner. Scan2cad: Learning cad model alignment in rgb-d scans. InThe IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019. 2

work page 2019
[4]

Qwen2.5-VL Technical Report

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhao- hai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Jun- yang Lin. Qwen2.5-vl technical repor...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

Lan- gocc: Self-supervised open vocabulary occupancy estima- tion via volume rendering.arXiv preprint arXiv:2407.17310,

Simon Boeder, Fabian Gigengack, and Benjamin Risse. Lan- gocc: Self-supervised open vocabulary occupancy estima- tion via volume rendering.arXiv preprint arXiv:2407.17310,

work page arXiv
[6]

Monoscene: Monoc- ular 3d semantic scene completion

Anh-Quan Cao and Raoul De Charette. Monoscene: Monoc- ular 3d semantic scene completion. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3991–4001, 2022. 2, 3, 6

work page 2022
[7]

Gaussrender: Learning 3d occupancy with gaussian rendering

Loick Chambon, Eloi Zablocki, Alexandre Boulch, Mick- ael Chen, and Matthieu Cord. Gaussrender: Learning 3d occupancy with gaussian rendering. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 27010–27020, 2025. 1

work page 2025
[8]

Og: Equip vision occupancy with in- stance segmentation and visual grounding.arXiv preprint arXiv:2307.05873, 2023

Zichao Dong, Hang Ji, Weikun Zhang, Xufeng Huang, and Junbo Chen. Og: Equip vision occupancy with in- stance segmentation and visual grounding.arXiv preprint arXiv:2307.05873, 2023. 3

work page arXiv 2023
[9]

Loc: A general language-guided framework for open-set 3d occupancy prediction.arXiv preprint arXiv:2510.22141, 2025

Yuhang Gao, Xiang Xiang, Sheng Zhong, and Guoyou Wang. Loc: A general language-guided framework for open-set 3d occupancy prediction.arXiv preprint arXiv:2510.22141, 2025. 3

work page arXiv 2025
[10]

Tri-perspective view for vision- based 3d semantic occupancy prediction

Yuanhui Huang, Wenzhao Zheng, Yunpeng Zhang, Jie Zhou, and Jiwen Lu. Tri-perspective view for vision- based 3d semantic occupancy prediction. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9223–9232, 2023. 6

work page 2023
[11]

Gaussianformer-2: Probabilistic gaussian superposition for efficient 3d occupancy prediction.arXiv preprint arXiv:2412.04384, 2024

Yuanhui Huang, Amonnut Thammatadatrakoon, Wenzhao Zheng, Yunpeng Zhang, Dalong Du, and Jiwen Lu. Gaussianformer-2: Probabilistic gaussian superposition for efficient 3d occupancy prediction.arXiv preprint arXiv:2412.04384, 2024. 1, 2, 4, 7

work page arXiv 2024
[12]

Selfocc: Self-supervised vision-based 3d occupancy prediction

Yuanhui Huang, Wenzhao Zheng, Borui Zhang, Jie Zhou, and Jiwen Lu. Selfocc: Self-supervised vision-based 3d occupancy prediction. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 19946–19956, 2024. 3

work page 2024
[13]

Gaussianformer: Scene as gaussians for vision-based 3d semantic occupancy prediction

Yuanhui Huang, Wenzhao Zheng, Yunpeng Zhang, Jie Zhou, and Jiwen Lu. Gaussianformer: Scene as gaussians for vision-based 3d semantic occupancy prediction. InEuropean Conference on Computer Vision, pages 376–393. Springer,

work page
[14]

Openocc: Open vocab- ulary 3d scene reconstruction via occupancy representation

Haochen Jiang, Yueming Xu, Yihan Zeng, Hang Xu, Wei Zhang, Jianfeng Feng, and Li Zhang. Openocc: Open vocab- ulary 3d scene reconstruction via occupancy representation. InIEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2024. 3

work page 2024
[15]

Towards open world object de- tection

KJ Joseph, Salman Khan, Fahad Shahbaz Khan, and Vi- neeth N Balasubramanian. Towards open world object de- tection. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5830–5840,

work page
[16]

Kim Jun-Seong, Kim GeonU, Kim Yu-Ji, Yu-Chiang Frank Wang, Jaesung Choe, and Tae-Hyun Oh. Dr. splat: Directly referring 3d gaussian splatting via direct language embed- ding registration. InCVPR, 2025. 5

work page 2025
[17]

3d gaussian splatting for real-time radiance field rendering.ACM Trans

Bernhard Kerbl, Georgios Kopanas, Thomas Leimk ¨uhler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering.ACM Trans. Graph., 42(4):139–1,

work page
[18]

J. F. C. Kingman.Poisson processes. The Clarendon Press Oxford University Press, New York, 1993. Oxford Science Publications. 5

work page 1993
[19]

Ago: Adaptive grounding for open world 3d occupancy prediction.arXiv preprint arXiv:2504.10117, 2025

Peizheng Li, Shuxiao Ding, You Zhou, Qingwen Zhang, Onat Inak, Larissa Triess, Niklas Hanselmann, Marius Cordts, and Andreas Zell. Ago: Adaptive grounding for open world 3d occupancy prediction.arXiv preprint arXiv:2504.10117, 2025. 2, 3

work page arXiv 2025
[20]

V oxformer: Sparse voxel transformer for camera- based 3d semantic scene completion

Yiming Li, Zhiding Yu, Christopher Choy, Chaowei Xiao, Jose M Alvarez, Sanja Fidler, Chen Feng, and Anima Anand- kumar. V oxformer: Sparse voxel transformer for camera- based 3d semantic scene completion. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9087–9098, 2023. 1

work page 2023
[21]

Fb-occ: 3d occupancy prediction based on forward-backward view transformation,

Zhiqi Li, Zhiding Yu, David Austin, Mingsheng Fang, Shiyi Lan, Jan Kautz, and Jose M Alvarez. FB-OCC: 3D occu- pancy prediction based on forward-backward view transfor- mation.arXiv:2307.01492, 2023. 3

work page arXiv 2023
[22]

V olumetric environ- ment representation for vision-language navigation

Rui Liu, Wenguan Wang, and Yi Yang. V olumetric environ- ment representation for vision-language navigation. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16317–16328, 2024. 1

work page 2024
[23]

Oc- cvla: Vision-language-action model with implicit 3d occu- pancy supervision.arXiv preprint arXiv:2509.05578, 2025

Ruixun Liu, Lingyu Kong, Derun Li, and Hang Zhao. Oc- cvla: Vision-language-action model with implicit 3d occu- pancy supervision.arXiv preprint arXiv:2509.05578, 2025. 1

work page arXiv 2025
[24]

Grounding dino: Marrying dino with 9 grounded pre-training for open-set object detection

Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, et al. Grounding dino: Marrying dino with 9 grounded pre-training for open-set object detection. InEuro- pean conference on computer vision, pages 38–55. Springer,

work page
[25]

SGDR: Stochastic Gradient Descent with Warm Restarts

Ilya Loshchilov and Frank Hutter. Sgdr: Stochas- tic gradient descent with warm restarts.arXiv preprint arXiv:1608.03983, 2016. 6

work page internal anchor Pith review Pith/arXiv arXiv 2016
[26]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017. 6

work page internal anchor Pith review Pith/arXiv arXiv 2017
[27]

On the difficulty of training recurrent neural networks

Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. On the difficulty of training recurrent neural networks. InInter- national conference on machine learning, pages 1310–1318. Pmlr, 2013. 6

work page 2013
[28]

Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unprojecting to 3d

Jonah Philion and Sanja Fidler. Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unprojecting to 3d. InEuropean conference on computer vision, pages 194–210. Springer, 2020. 3

work page 2020
[29]

Splatssc: Decoupled depth-guided gaussian splat- ting for semantic scene completion, 2025

Rui Qian, Haozhi Cao, Tianchen Deng, Shenghai Yuan, and Lihua Xie. Splatssc: Decoupled depth-guided gaussian splat- ting for semantic scene completion, 2025. 4

work page 2025
[30]

Learning transferable visual models from natural language supervi- sion

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 2, 7

work page 2021
[31]

Grounded SAM: Assembling Open-World Models for Diverse Visual Tasks

Tianhe Ren, Shilong Liu, Ailing Zeng, Jing Lin, Kunchang Li, He Cao, Jiayu Chen, Xinyu Huang, Yukang Chen, Feng Yan, et al. Grounded sam: Assembling open-world models for diverse visual tasks.arXiv preprint arXiv:2401.14159,

work page internal anchor Pith review Pith/arXiv arXiv
[32]

Ross.Stochastic processes

S.M. Ross.Stochastic processes. Wiley, 1996. 5

work page 1996
[33]

Language embedded 3d gaussians for open-vocabulary scene understanding.arXiv preprint arXiv:2311.18482, 2023

Jin-Chuan Shi, Miao Wang, Hao-Bin Duan, and Shao- Hua Guan. Language embedded 3d gaussians for open-vocabulary scene understanding.arXiv preprint arXiv:2311.18482, 2023. 2

work page arXiv 2023
[34]

Occupancy as set of points

Yiang Shi, Tianheng Cheng, Qian Zhang, Wenyu Liu, and Xinggang Wang. Occupancy as set of points. InEuropean Conference on Computer Vision, pages 72–87. Springer,

work page
[35]

Har- nessing vision foundation models for high-performance, training-free open vocabulary segmentation.arXiv preprint arXiv:2411.09219, 2024

Yuheng Shi, Minjing Dong, and Chang Xu. Har- nessing vision foundation models for high-performance, training-free open vocabulary segmentation.arXiv preprint arXiv:2411.09219, 2024. 2, 6

work page arXiv 2024
[36]

A coarse-to-fine approach to multi-modality 3d occupancy grounding.arXiv preprint arXiv:2508.01197, 2025

Zhan Shi, Song Wang, Junbo Chen, and Jianke Zhu. A coarse-to-fine approach to multi-modality 3d occupancy grounding.arXiv preprint arXiv:2508.01197, 2025. 2

work page arXiv 2025
[37]

Semantic scene com- pletion from a single depth image

Shuran Song, Fisher Yu, Andy Zeng, Angel X Chang, Mano- lis Savva, and Thomas Funkhouser. Semantic scene com- pletion from a single depth image. InProceedings of the IEEE conference on computer vision and pattern recogni- tion, pages 1746–1754, 2017. 2

work page 2017
[38]

Ovo: Open-vocabulary occupancy,

Zhiyu Tan, Zichao Dong, Cheng Zhang, Weikun Zhang, Hang Ji, and Hao Li. Ovo: Open-vocabulary occupancy,

work page
[39]

Sparseocc: Re- thinking sparse latent representation for vision-based seman- tic occupancy prediction

Pin Tang, Zhongdao Wang, Guoqing Wang, Jilai Zheng, Xi- angxuan Ren, Bailan Feng, and Chao Ma. Sparseocc: Re- thinking sparse latent representation for vision-based seman- tic occupancy prediction. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15035–15044, 2024. 1

work page 2024
[40]

Emanuele Vespa, Nikolay Nikolov, Marius Grimm, Luigi Nardi, Paul H. J. Kelly, and Stefan Leutenegger. Efficient octree-based volumetric slam supporting signed-distance and occupancy mapping.IEEE Robotics and Automation Letters, 3(2):1144–1151, 2018. 2

work page 2018
[41]

Pop-3d: Open-vocabulary 3d occupancy prediction from im- ages.Advances in Neural Information Processing Systems, 36:50545–50557, 2023

Antonin V obecky, Oriane Sim ´eoni, David Hurych, Spyri- don Gidaris, Andrei Bursuc, Patrick P ´erez, and Josef Sivic. Pop-3d: Open-vocabulary 3d occupancy prediction from im- ages.Advances in Neural Information Processing Systems, 36:50545–50557, 2023. 3, 6, 7

work page 2023
[42]

Embodiedocc++: Boosting embodied 3d occupancy prediction with plane regularization and uncer- tainty sampler.arXiv preprint arXiv:2504.09540, 2025

Hao Wang, Xiaobao Wei, Xiaoan Zhang, Jianing Li, Chengyu Bai, Ying Li, Ming Lu, Wenzhao Zheng, and Shanghang Zhang. Embodiedocc++: Boosting embodied 3d occupancy prediction with plane regularization and uncer- tainty sampler.arXiv preprint arXiv:2504.09540, 2025. 1, 3, 6

work page arXiv 2025
[43]

Forknet: Multi-branch volumetric semantic com- pletion from a single depth image, 2019

Yida Wang, David Joseph Tan, Nassir Navab, and Federico Tombari. Forknet: Multi-branch volumetric semantic com- pletion from a single depth image, 2019. 2

work page 2019
[44]

Occllama: An occupancy-language-action generative world model for au- tonomous driving.arXiv preprint arXiv:2409.03272, 2024

Julong Wei, Shanshuai Yuan, Pengfei Li, Qingda Hu, Zhongxue Gan, and Wenchao Ding. Occllama: An occupancy-language-action generative world model for au- tonomous driving.arXiv preprint arXiv:2409.03272, 2024. 1

work page arXiv 2024
[45]

Surroundocc: Multi-camera 3d occu- pancy prediction for autonomous driving

Yi Wei, Linqing Zhao, Wenzhao Zheng, Zheng Zhu, Jie Zhou, and Jiwen Lu. Surroundocc: Multi-camera 3d occu- pancy prediction for autonomous driving. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 21729–21740, 2023. 1, 6

work page 2023
[46]

Scfusion: Real-time incremental scene recon- struction with semantic completion

Shun-Cheng Wu, Kesuke Tateno, Nassir Navab, and Fed- erico Tombari. Scfusion: Real-time incremental scene recon- struction with semantic completion. In2020 International Conference on 3D Vision (3DV), pages 801–810, 2020. 2

work page 2020
[47]

Embodiedocc: Embodied 3d occu- pancy prediction for vision-based online scene understand- ing.arXiv preprint arXiv:2412.04380, 2024

Yuqi Wu, Wenzhao Zheng, Sicheng Zuo, Yuanhui Huang, Jie Zhou, and Jiwen Lu. Embodiedocc: Embodied 3d occu- pancy prediction for vision-based online scene understand- ing.arXiv preprint arXiv:2412.04380, 2024. 1, 2, 3, 5, 6, 7

work page arXiv 2024
[48]

Depth any- thing v2.Advances in Neural Information Processing Sys- tems, 37:21875–21911, 2024

Lihe Yang, Bingyi Kang, Zilong Huang, Zhen Zhao, Xiao- gang Xu, Jiashi Feng, and Hengshuang Zhao. Depth any- thing v2.Advances in Neural Information Processing Sys- tems, 37:21875–21911, 2024. 6

work page 2024
[49]

Ndc-scene: Boost monocular 3d semantic scene completion in normalized de- vice coordinates space

Jiawei Yao, Chuming Li, Keqiang Sun, Yingjie Cai, Hao Li, Wanli Ouyang, and Hongsheng Li. Ndc-scene: Boost monocular 3d semantic scene completion in normalized de- vice coordinates space. In2023 IEEE/CVF International Conference on Computer Vision (ICCV), pages 9421–9431. IEEE Computer Society, 2023. 2, 3

work page 2023
[50]

Monocular occupancy prediction for scalable indoor scenes

Hongxiao Yu, Yuqi Wang, Yuntao Chen, and Zhaoxiang Zhang. Monocular occupancy prediction for scalable indoor scenes. InEuropean Conference on Computer Vision, pages 38–54. Springer, 2024. 1, 2, 3, 6, 7

work page 2024
[51]

Shtocc: Effective 3d occupancy prediction with sparse head and tail voxels.arXiv preprint arXiv:2505.22461, 2025

Qiucheng Yu, Yuan Xie, and Xin Tan. Shtocc: Effective 3d occupancy prediction with sparse head and tail voxels.arXiv preprint arXiv:2505.22461, 2025. 1 10

work page arXiv 2025
[52]

Gaussian opacity fields: Efficient adaptive surface reconstruction in unbounded scenes.ACM Transactions on Graphics, 2024

Zehao Yu, Torsten Sattler, and Andreas Geiger. Gaussian opacity fields: Efficient adaptive surface reconstruction in unbounded scenes.ACM Transactions on Graphics, 2024. 4

work page 2024
[53]

Language driven occupancy prediction

Zhu Yu, Bowen Pang, Lizhe Liu, Runmin Zhang, Qiang Li, Si-Yuan Cao, Maochun Luo, Mingxia Chen, Sheng Yang, and Hui-Liang Shen. Language driven occupancy prediction. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 7548–7558, 2025. 2, 3, 6, 7, 8

work page 2025
[54]

Occnerf: Self- supervised multi-camera occupancy prediction with neural radiance fields.CoRR, abs/2312.09243, 2023

Chubin Zhang, Juncheng Yan, Yi Wei, Jiaxin Li, Li Liu, Yansong Tang, Yueqi Duan, and Jiwen Lu. Occnerf: Self- supervised multi-camera occupancy prediction with neural radiance fields.CoRR, abs/2312.09243, 2023. 3

work page arXiv 2023
[55]

Deep long-tailed learning: A survey.IEEE transactions on pattern analysis and machine intelligence, 45(9):10795–10816, 2023

Yifan Zhang, Bingyi Kang, Bryan Hooi, Shuicheng Yan, and Jiashi Feng. Deep long-tailed learning: A survey.IEEE transactions on pattern analysis and machine intelligence, 45(9):10795–10816, 2023. 1

work page 2023
[56]

Occformer: Dual-path transformer for vision-based 3d semantic occu- pancy prediction

Yunpeng Zhang, Zheng Zhu, and Dalong Du. Occformer: Dual-path transformer for vision-based 3d semantic occu- pancy prediction. InProceedings of the IEEE/CVF Interna- tional Conference on Computer Vision (ICCV), pages 9433– 9443, 2023. 1

work page 2023
[57]

Roboocc: Enhancing the geometric and semantic scene understanding for robots.arXiv preprint arXiv:2504.14604, 2025

Zhang Zhang, Qiang Zhang, Wei Cui, Shuai Shi, Yijie Guo, Gang Han, Wen Zhao, Hengle Ren, Renjing Xu, and Jian Tang. Roboocc: Enhancing the geometric and semantic scene understanding for robots.arXiv preprint arXiv:2504.14604, 2025. 3, 6

work page arXiv 2025
[58]

Veon: V ocabulary- enhanced occupancy prediction

Jilai Zheng, Pin Tang, Zhongdao Wang, Guoqing Wang, Xi- angxuan Ren, Bailan Feng, and Chao Ma. Veon: V ocabulary- enhanced occupancy prediction. InEuropean Conference on Computer Vision, pages 92–108. Springer, 2024. 2, 3

work page 2024
[59]

Feature 3dgs: Supercharging 3d gaussian splatting to enable distilled feature fields

Shijie Zhou, Haoran Chang, Sicheng Jiang, Zhiwen Fan, Ze- hao Zhu, Dejia Xu, Pradyumna Chari, Suya You, Zhangyang Wang, and Achuta Kadambi. Feature 3dgs: Supercharging 3d gaussian splatting to enable distilled feature fields. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21676–21685, 2024. 2 11

work page 2024

[1] [1]

S2GO: Streaming sparse gaussian occupancy

Anonymous. S2GO: Streaming sparse gaussian occupancy. InSubmitted to The Fourteenth International Conference on Learning Representations, 2025. under review. 4

work page 2025

[2] [2]

VGMOcc: Sparse gaussian occupancy predic- tion with visual geometry model priors

Anonymous. VGMOcc: Sparse gaussian occupancy predic- tion with visual geometry model priors. InSubmitted to The Fourteenth International Conference on Learning Represen- tations, 2025. under review. 6

work page 2025

[3] [3]

Chang, and Matthias Niessner

Armen Avetisyan, Manuel Dahnert, Angela Dai, Manolis Savva, Angel X. Chang, and Matthias Niessner. Scan2cad: Learning cad model alignment in rgb-d scans. InThe IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019. 2

work page 2019

[4] [4]

Qwen2.5-VL Technical Report

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhao- hai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Jun- yang Lin. Qwen2.5-vl technical repor...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[5] [5]

Lan- gocc: Self-supervised open vocabulary occupancy estima- tion via volume rendering.arXiv preprint arXiv:2407.17310,

Simon Boeder, Fabian Gigengack, and Benjamin Risse. Lan- gocc: Self-supervised open vocabulary occupancy estima- tion via volume rendering.arXiv preprint arXiv:2407.17310,

work page arXiv

[6] [6]

Monoscene: Monoc- ular 3d semantic scene completion

Anh-Quan Cao and Raoul De Charette. Monoscene: Monoc- ular 3d semantic scene completion. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3991–4001, 2022. 2, 3, 6

work page 2022

[7] [7]

Gaussrender: Learning 3d occupancy with gaussian rendering

Loick Chambon, Eloi Zablocki, Alexandre Boulch, Mick- ael Chen, and Matthieu Cord. Gaussrender: Learning 3d occupancy with gaussian rendering. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 27010–27020, 2025. 1

work page 2025

[8] [8]

Og: Equip vision occupancy with in- stance segmentation and visual grounding.arXiv preprint arXiv:2307.05873, 2023

Zichao Dong, Hang Ji, Weikun Zhang, Xufeng Huang, and Junbo Chen. Og: Equip vision occupancy with in- stance segmentation and visual grounding.arXiv preprint arXiv:2307.05873, 2023. 3

work page arXiv 2023

[9] [9]

Loc: A general language-guided framework for open-set 3d occupancy prediction.arXiv preprint arXiv:2510.22141, 2025

Yuhang Gao, Xiang Xiang, Sheng Zhong, and Guoyou Wang. Loc: A general language-guided framework for open-set 3d occupancy prediction.arXiv preprint arXiv:2510.22141, 2025. 3

work page arXiv 2025

[10] [10]

Tri-perspective view for vision- based 3d semantic occupancy prediction

Yuanhui Huang, Wenzhao Zheng, Yunpeng Zhang, Jie Zhou, and Jiwen Lu. Tri-perspective view for vision- based 3d semantic occupancy prediction. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9223–9232, 2023. 6

work page 2023

[11] [11]

Gaussianformer-2: Probabilistic gaussian superposition for efficient 3d occupancy prediction.arXiv preprint arXiv:2412.04384, 2024

Yuanhui Huang, Amonnut Thammatadatrakoon, Wenzhao Zheng, Yunpeng Zhang, Dalong Du, and Jiwen Lu. Gaussianformer-2: Probabilistic gaussian superposition for efficient 3d occupancy prediction.arXiv preprint arXiv:2412.04384, 2024. 1, 2, 4, 7

work page arXiv 2024

[12] [12]

Selfocc: Self-supervised vision-based 3d occupancy prediction

Yuanhui Huang, Wenzhao Zheng, Borui Zhang, Jie Zhou, and Jiwen Lu. Selfocc: Self-supervised vision-based 3d occupancy prediction. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 19946–19956, 2024. 3

work page 2024

[13] [13]

Gaussianformer: Scene as gaussians for vision-based 3d semantic occupancy prediction

Yuanhui Huang, Wenzhao Zheng, Yunpeng Zhang, Jie Zhou, and Jiwen Lu. Gaussianformer: Scene as gaussians for vision-based 3d semantic occupancy prediction. InEuropean Conference on Computer Vision, pages 376–393. Springer,

work page

[14] [14]

Openocc: Open vocab- ulary 3d scene reconstruction via occupancy representation

Haochen Jiang, Yueming Xu, Yihan Zeng, Hang Xu, Wei Zhang, Jianfeng Feng, and Li Zhang. Openocc: Open vocab- ulary 3d scene reconstruction via occupancy representation. InIEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2024. 3

work page 2024

[15] [15]

Towards open world object de- tection

KJ Joseph, Salman Khan, Fahad Shahbaz Khan, and Vi- neeth N Balasubramanian. Towards open world object de- tection. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5830–5840,

work page

[16] [16]

Kim Jun-Seong, Kim GeonU, Kim Yu-Ji, Yu-Chiang Frank Wang, Jaesung Choe, and Tae-Hyun Oh. Dr. splat: Directly referring 3d gaussian splatting via direct language embed- ding registration. InCVPR, 2025. 5

work page 2025

[17] [17]

3d gaussian splatting for real-time radiance field rendering.ACM Trans

Bernhard Kerbl, Georgios Kopanas, Thomas Leimk ¨uhler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering.ACM Trans. Graph., 42(4):139–1,

work page

[18] [18]

J. F. C. Kingman.Poisson processes. The Clarendon Press Oxford University Press, New York, 1993. Oxford Science Publications. 5

work page 1993

[19] [19]

Ago: Adaptive grounding for open world 3d occupancy prediction.arXiv preprint arXiv:2504.10117, 2025

Peizheng Li, Shuxiao Ding, You Zhou, Qingwen Zhang, Onat Inak, Larissa Triess, Niklas Hanselmann, Marius Cordts, and Andreas Zell. Ago: Adaptive grounding for open world 3d occupancy prediction.arXiv preprint arXiv:2504.10117, 2025. 2, 3

work page arXiv 2025

[20] [20]

V oxformer: Sparse voxel transformer for camera- based 3d semantic scene completion

Yiming Li, Zhiding Yu, Christopher Choy, Chaowei Xiao, Jose M Alvarez, Sanja Fidler, Chen Feng, and Anima Anand- kumar. V oxformer: Sparse voxel transformer for camera- based 3d semantic scene completion. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9087–9098, 2023. 1

work page 2023

[21] [21]

Fb-occ: 3d occupancy prediction based on forward-backward view transformation,

Zhiqi Li, Zhiding Yu, David Austin, Mingsheng Fang, Shiyi Lan, Jan Kautz, and Jose M Alvarez. FB-OCC: 3D occu- pancy prediction based on forward-backward view transfor- mation.arXiv:2307.01492, 2023. 3

work page arXiv 2023

[22] [22]

V olumetric environ- ment representation for vision-language navigation

Rui Liu, Wenguan Wang, and Yi Yang. V olumetric environ- ment representation for vision-language navigation. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16317–16328, 2024. 1

work page 2024

[23] [23]

Oc- cvla: Vision-language-action model with implicit 3d occu- pancy supervision.arXiv preprint arXiv:2509.05578, 2025

Ruixun Liu, Lingyu Kong, Derun Li, and Hang Zhao. Oc- cvla: Vision-language-action model with implicit 3d occu- pancy supervision.arXiv preprint arXiv:2509.05578, 2025. 1

work page arXiv 2025

[24] [24]

Grounding dino: Marrying dino with 9 grounded pre-training for open-set object detection

Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, et al. Grounding dino: Marrying dino with 9 grounded pre-training for open-set object detection. InEuro- pean conference on computer vision, pages 38–55. Springer,

work page

[25] [25]

SGDR: Stochastic Gradient Descent with Warm Restarts

Ilya Loshchilov and Frank Hutter. Sgdr: Stochas- tic gradient descent with warm restarts.arXiv preprint arXiv:1608.03983, 2016. 6

work page internal anchor Pith review Pith/arXiv arXiv 2016

[26] [26]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017. 6

work page internal anchor Pith review Pith/arXiv arXiv 2017

[27] [27]

On the difficulty of training recurrent neural networks

Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. On the difficulty of training recurrent neural networks. InInter- national conference on machine learning, pages 1310–1318. Pmlr, 2013. 6

work page 2013

[28] [28]

Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unprojecting to 3d

Jonah Philion and Sanja Fidler. Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unprojecting to 3d. InEuropean conference on computer vision, pages 194–210. Springer, 2020. 3

work page 2020

[29] [29]

Splatssc: Decoupled depth-guided gaussian splat- ting for semantic scene completion, 2025

Rui Qian, Haozhi Cao, Tianchen Deng, Shenghai Yuan, and Lihua Xie. Splatssc: Decoupled depth-guided gaussian splat- ting for semantic scene completion, 2025. 4

work page 2025

[30] [30]

Learning transferable visual models from natural language supervi- sion

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 2, 7

work page 2021

[31] [31]

Grounded SAM: Assembling Open-World Models for Diverse Visual Tasks

Tianhe Ren, Shilong Liu, Ailing Zeng, Jing Lin, Kunchang Li, He Cao, Jiayu Chen, Xinyu Huang, Yukang Chen, Feng Yan, et al. Grounded sam: Assembling open-world models for diverse visual tasks.arXiv preprint arXiv:2401.14159,

work page internal anchor Pith review Pith/arXiv arXiv

[32] [32]

Ross.Stochastic processes

S.M. Ross.Stochastic processes. Wiley, 1996. 5

work page 1996

[33] [33]

Language embedded 3d gaussians for open-vocabulary scene understanding.arXiv preprint arXiv:2311.18482, 2023

Jin-Chuan Shi, Miao Wang, Hao-Bin Duan, and Shao- Hua Guan. Language embedded 3d gaussians for open-vocabulary scene understanding.arXiv preprint arXiv:2311.18482, 2023. 2

work page arXiv 2023

[34] [34]

Occupancy as set of points

Yiang Shi, Tianheng Cheng, Qian Zhang, Wenyu Liu, and Xinggang Wang. Occupancy as set of points. InEuropean Conference on Computer Vision, pages 72–87. Springer,

work page

[35] [35]

Har- nessing vision foundation models for high-performance, training-free open vocabulary segmentation.arXiv preprint arXiv:2411.09219, 2024

Yuheng Shi, Minjing Dong, and Chang Xu. Har- nessing vision foundation models for high-performance, training-free open vocabulary segmentation.arXiv preprint arXiv:2411.09219, 2024. 2, 6

work page arXiv 2024

[36] [36]

A coarse-to-fine approach to multi-modality 3d occupancy grounding.arXiv preprint arXiv:2508.01197, 2025

Zhan Shi, Song Wang, Junbo Chen, and Jianke Zhu. A coarse-to-fine approach to multi-modality 3d occupancy grounding.arXiv preprint arXiv:2508.01197, 2025. 2

work page arXiv 2025

[37] [37]

Semantic scene com- pletion from a single depth image

Shuran Song, Fisher Yu, Andy Zeng, Angel X Chang, Mano- lis Savva, and Thomas Funkhouser. Semantic scene com- pletion from a single depth image. InProceedings of the IEEE conference on computer vision and pattern recogni- tion, pages 1746–1754, 2017. 2

work page 2017

[38] [38]

Ovo: Open-vocabulary occupancy,

Zhiyu Tan, Zichao Dong, Cheng Zhang, Weikun Zhang, Hang Ji, and Hao Li. Ovo: Open-vocabulary occupancy,

work page

[39] [39]

Sparseocc: Re- thinking sparse latent representation for vision-based seman- tic occupancy prediction

Pin Tang, Zhongdao Wang, Guoqing Wang, Jilai Zheng, Xi- angxuan Ren, Bailan Feng, and Chao Ma. Sparseocc: Re- thinking sparse latent representation for vision-based seman- tic occupancy prediction. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15035–15044, 2024. 1

work page 2024

[40] [40]

Emanuele Vespa, Nikolay Nikolov, Marius Grimm, Luigi Nardi, Paul H. J. Kelly, and Stefan Leutenegger. Efficient octree-based volumetric slam supporting signed-distance and occupancy mapping.IEEE Robotics and Automation Letters, 3(2):1144–1151, 2018. 2

work page 2018

[41] [41]

Pop-3d: Open-vocabulary 3d occupancy prediction from im- ages.Advances in Neural Information Processing Systems, 36:50545–50557, 2023

Antonin V obecky, Oriane Sim ´eoni, David Hurych, Spyri- don Gidaris, Andrei Bursuc, Patrick P ´erez, and Josef Sivic. Pop-3d: Open-vocabulary 3d occupancy prediction from im- ages.Advances in Neural Information Processing Systems, 36:50545–50557, 2023. 3, 6, 7

work page 2023

[42] [42]

Embodiedocc++: Boosting embodied 3d occupancy prediction with plane regularization and uncer- tainty sampler.arXiv preprint arXiv:2504.09540, 2025

Hao Wang, Xiaobao Wei, Xiaoan Zhang, Jianing Li, Chengyu Bai, Ying Li, Ming Lu, Wenzhao Zheng, and Shanghang Zhang. Embodiedocc++: Boosting embodied 3d occupancy prediction with plane regularization and uncer- tainty sampler.arXiv preprint arXiv:2504.09540, 2025. 1, 3, 6

work page arXiv 2025

[43] [43]

Forknet: Multi-branch volumetric semantic com- pletion from a single depth image, 2019

Yida Wang, David Joseph Tan, Nassir Navab, and Federico Tombari. Forknet: Multi-branch volumetric semantic com- pletion from a single depth image, 2019. 2

work page 2019

[44] [44]

Occllama: An occupancy-language-action generative world model for au- tonomous driving.arXiv preprint arXiv:2409.03272, 2024

Julong Wei, Shanshuai Yuan, Pengfei Li, Qingda Hu, Zhongxue Gan, and Wenchao Ding. Occllama: An occupancy-language-action generative world model for au- tonomous driving.arXiv preprint arXiv:2409.03272, 2024. 1

work page arXiv 2024

[45] [45]

Surroundocc: Multi-camera 3d occu- pancy prediction for autonomous driving

Yi Wei, Linqing Zhao, Wenzhao Zheng, Zheng Zhu, Jie Zhou, and Jiwen Lu. Surroundocc: Multi-camera 3d occu- pancy prediction for autonomous driving. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 21729–21740, 2023. 1, 6

work page 2023

[46] [46]

Scfusion: Real-time incremental scene recon- struction with semantic completion

Shun-Cheng Wu, Kesuke Tateno, Nassir Navab, and Fed- erico Tombari. Scfusion: Real-time incremental scene recon- struction with semantic completion. In2020 International Conference on 3D Vision (3DV), pages 801–810, 2020. 2

work page 2020

[47] [47]

Embodiedocc: Embodied 3d occu- pancy prediction for vision-based online scene understand- ing.arXiv preprint arXiv:2412.04380, 2024

Yuqi Wu, Wenzhao Zheng, Sicheng Zuo, Yuanhui Huang, Jie Zhou, and Jiwen Lu. Embodiedocc: Embodied 3d occu- pancy prediction for vision-based online scene understand- ing.arXiv preprint arXiv:2412.04380, 2024. 1, 2, 3, 5, 6, 7

work page arXiv 2024

[48] [48]

Depth any- thing v2.Advances in Neural Information Processing Sys- tems, 37:21875–21911, 2024

Lihe Yang, Bingyi Kang, Zilong Huang, Zhen Zhao, Xiao- gang Xu, Jiashi Feng, and Hengshuang Zhao. Depth any- thing v2.Advances in Neural Information Processing Sys- tems, 37:21875–21911, 2024. 6

work page 2024

[49] [49]

Ndc-scene: Boost monocular 3d semantic scene completion in normalized de- vice coordinates space

Jiawei Yao, Chuming Li, Keqiang Sun, Yingjie Cai, Hao Li, Wanli Ouyang, and Hongsheng Li. Ndc-scene: Boost monocular 3d semantic scene completion in normalized de- vice coordinates space. In2023 IEEE/CVF International Conference on Computer Vision (ICCV), pages 9421–9431. IEEE Computer Society, 2023. 2, 3

work page 2023

[50] [50]

Monocular occupancy prediction for scalable indoor scenes

Hongxiao Yu, Yuqi Wang, Yuntao Chen, and Zhaoxiang Zhang. Monocular occupancy prediction for scalable indoor scenes. InEuropean Conference on Computer Vision, pages 38–54. Springer, 2024. 1, 2, 3, 6, 7

work page 2024

[51] [51]

Shtocc: Effective 3d occupancy prediction with sparse head and tail voxels.arXiv preprint arXiv:2505.22461, 2025

Qiucheng Yu, Yuan Xie, and Xin Tan. Shtocc: Effective 3d occupancy prediction with sparse head and tail voxels.arXiv preprint arXiv:2505.22461, 2025. 1 10

work page arXiv 2025

[52] [52]

Gaussian opacity fields: Efficient adaptive surface reconstruction in unbounded scenes.ACM Transactions on Graphics, 2024

Zehao Yu, Torsten Sattler, and Andreas Geiger. Gaussian opacity fields: Efficient adaptive surface reconstruction in unbounded scenes.ACM Transactions on Graphics, 2024. 4

work page 2024

[53] [53]

Language driven occupancy prediction

Zhu Yu, Bowen Pang, Lizhe Liu, Runmin Zhang, Qiang Li, Si-Yuan Cao, Maochun Luo, Mingxia Chen, Sheng Yang, and Hui-Liang Shen. Language driven occupancy prediction. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 7548–7558, 2025. 2, 3, 6, 7, 8

work page 2025

[54] [54]

Occnerf: Self- supervised multi-camera occupancy prediction with neural radiance fields.CoRR, abs/2312.09243, 2023

Chubin Zhang, Juncheng Yan, Yi Wei, Jiaxin Li, Li Liu, Yansong Tang, Yueqi Duan, and Jiwen Lu. Occnerf: Self- supervised multi-camera occupancy prediction with neural radiance fields.CoRR, abs/2312.09243, 2023. 3

work page arXiv 2023

[55] [55]

Deep long-tailed learning: A survey.IEEE transactions on pattern analysis and machine intelligence, 45(9):10795–10816, 2023

Yifan Zhang, Bingyi Kang, Bryan Hooi, Shuicheng Yan, and Jiashi Feng. Deep long-tailed learning: A survey.IEEE transactions on pattern analysis and machine intelligence, 45(9):10795–10816, 2023. 1

work page 2023

[56] [56]

Occformer: Dual-path transformer for vision-based 3d semantic occu- pancy prediction

Yunpeng Zhang, Zheng Zhu, and Dalong Du. Occformer: Dual-path transformer for vision-based 3d semantic occu- pancy prediction. InProceedings of the IEEE/CVF Interna- tional Conference on Computer Vision (ICCV), pages 9433– 9443, 2023. 1

work page 2023

[57] [57]

Roboocc: Enhancing the geometric and semantic scene understanding for robots.arXiv preprint arXiv:2504.14604, 2025

Zhang Zhang, Qiang Zhang, Wei Cui, Shuai Shi, Yijie Guo, Gang Han, Wen Zhao, Hengle Ren, Renjing Xu, and Jian Tang. Roboocc: Enhancing the geometric and semantic scene understanding for robots.arXiv preprint arXiv:2504.14604, 2025. 3, 6

work page arXiv 2025

[58] [58]

Veon: V ocabulary- enhanced occupancy prediction

Jilai Zheng, Pin Tang, Zhongdao Wang, Guoqing Wang, Xi- angxuan Ren, Bailan Feng, and Chao Ma. Veon: V ocabulary- enhanced occupancy prediction. InEuropean Conference on Computer Vision, pages 92–108. Springer, 2024. 2, 3

work page 2024

[59] [59]

Feature 3dgs: Supercharging 3d gaussian splatting to enable distilled feature fields

Shijie Zhou, Haoran Chang, Sicheng Jiang, Zhiwen Fan, Ze- hao Zhu, Dejia Xu, Pradyumna Chari, Suya You, Zhangyang Wang, and Achuta Kadambi. Feature 3dgs: Supercharging 3d gaussian splatting to enable distilled feature fields. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21676–21685, 2024. 2 11

work page 2024