pith. sign in

arxiv: 2602.22667 · v2 · pith:YQAQBUNMnew · submitted 2026-02-26 · 💻 cs.CV

Monocular Open Vocabulary Occupancy Prediction for Indoor Scenes

Pith reviewed 2026-05-21 11:46 UTC · model grok-4.3

classification 💻 cs.CV
keywords open-vocabulary occupancyindoor 3D predictionmonocular inputlanguage-embedded Gaussiansbinary supervisionvolumetric aggregationsemantic alignment
0
0 comments X

The pith

A monocular method predicts open-vocabulary 3D occupancy in indoor scenes from single images using only binary occupancy labels.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to enable embodied agents to understand indoor environments with abundant and evolving semantic categories. It adopts a geometry-only supervision approach that relies solely on binary occupied-versus-free labels rather than dense semantic annotations. The framework employs 3D Language-Embedded Gaussians to link fine-grained geometry with language-aligned embeddings. An opacity-aware Poisson-based operator replaces prior Gaussian-to-occupancy conversions to ensure convergence under weak supervision, while a Progressive Temperature Decay schedule sharpens feature alignment during rendering. This combination is shown to deliver accurate open-vocabulary occupancy maps on indoor benchmarks.

Core claim

The authors establish that 3D Language-Embedded Gaussians can serve as a unified intermediate representation for open-vocabulary occupancy prediction when paired with an opacity-aware Poisson-based volumetric aggregation operator and a Progressive Temperature Decay schedule that gradually sharpens opacities during splatting, allowing stable geometry-language alignment from monocular images under binary supervision alone.

What carries the argument

3D Language-Embedded Gaussians as a unified intermediate representation that couples fine-grained 3D geometry with language-aligned semantic embeddings, stabilized by an opacity-aware Poisson-based operator and Progressive Temperature Decay schedule.

If this is right

  • Open-vocabulary occupancy becomes feasible indoors without requiring dense semantic ground truth for every object category.
  • Embodied agents can maintain consistent 3D semantic maps as new object categories appear over time.
  • The same intermediate Gaussian representation can support both geometric reconstruction and language-based queries in a single forward pass.
  • Binary supervision reduces the annotation burden for training occupancy models in complex indoor layouts.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same stabilization techniques might transfer to outdoor settings if adjusted for sparser point distributions and different scale ranges.
  • Integration with online mapping systems could allow robots to incrementally update open-vocabulary occupancy without full scene re-training.
  • The approach suggests a broader pattern where language embeddings are anchored to geometry through differentiable rendering rather than direct feature matching.

Load-bearing premise

Existing Gaussian-to-Occupancy operators fail to converge under binary occupancy supervision, and the proposed opacity-aware Poisson replacement together with the temperature decay schedule will produce stable alignment without new artifacts or extra dense labels.

What would settle it

A direct test would replace the Poisson operator with a standard Gaussian-to-occupancy conversion while keeping all other components fixed and measure whether convergence and alignment quality collapse on the same indoor dataset under binary supervision.

Figures

Figures reproduced from arXiv: 2602.22667 by Changhao Chen, Changqing Zhou, Han Zhang, Yueru Luo, Zeyu Jiang.

Figure 1
Figure 1. Figure 1: Closed- vs. open-vocabulary occupancy. Prior meth￾ods [47, 50] trained under a closed vocabulary can label only the categories predefined at training time, which restricts real-world deployment. Our open-vocabulary approach aligns language with 3D occupancy and supports text queries for arbitrary categories. Right column (Random Class): text-conditioned per-voxel scores are visualized as heatmaps; darker r… view at source ↗
Figure 2
Figure 2. Figure 2: LegoOcc Framework Overview. From a monocular image, a feed-forward Gaussian model produces Language-Embedded Gaussians. Training proceeds along two couched paths: Semantic learning, we differentiably render Gaussian features to the image with Progressive Temperature Decay and align them to a training-free open-vocabulary segmenter via a cosine objective Lfeat; Geometry learning, we convert Gaussians to occ… view at source ↗
Figure 3
Figure 3. Figure 3: Comparison of temperature schedules. Linear decay decreases τ uniformly, whereas our exponential schedule rapidly approaches Tmin, allocating more iterations for the model to adapt to the low-temperature regime. 3.5. Losses We optimize the network with a composite objective that couples 3D geometry supervision with 2D language-aligned guidance as shown in [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative results on Occ-ScanNet. From top to bottom: (a) input images; (b) ground-truth semantic occupancy; (c) results from our re-implemented LOcc [53]; (d) our method. Both (c) and (d) are trained with geometry-only annotations and evaluated on the closed-vocabulary annotation of Occ-ScanNet. (a) (b) (c) [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Open-vocabulary qualitative results. Legends list the VLM-extracted object nouns used as text queries. (a) Input image. (b) Open-vocabulary 2D segmentation for queried nouns. (c) Our 3D open-vocabulary occupancy colored by the same categories. amples. These results demonstrate the model’s ability to localize and identify free-form indoor categories in 3D. 5. Conclusion We introduced a monocular open-vocabu… view at source ↗
read the original abstract

Open-vocabulary 3D occupancy is vital for embodied agents, which need to understand complex indoor environments where semantic categories are abundant and evolve beyond fixed taxonomies. While recent work has explored open-vocabulary occupancy in outdoor driving scenarios, such methods transfer poorly indoors, where geometry is denser, layouts are more intricate, and semantics are far more fine-grained. To address these challenges, we adopt a geometry-only supervision paradigm that uses only binary occupancy labels (occupied vs free). Our framework builds upon 3D Language-Embedded Gaussians, which serve as a unified intermediate representation coupling fine-grained 3D geometry with a language-aligned semantic embedding. On the geometry side, we find that existing Gaussian-to-Occupancy operators fail to converge under such weak supervision, and we introduce an opacity-aware, Poisson-based approach that stabilizes volumetric aggregation. On the semantic side, direct alignment between rendered features and open-vocabulary segmentation features suffers from feature mixing; we therefore propose a Progressive Temperature Decay schedule that gradually sharpens opacities during splatting, strengthening Gaussian-language alignment. On Occ-ScanNet, our framework achieves 59.50 IoU and 21.05 mIoU in the open-vocabulary setting, surpassing all existing occupancy methods in IoU and outperforming prior open-vocabulary approaches by a large margin in mIoU. Code will be released at https://github.com/JuIvyy/LegoOcc.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes a monocular framework for open-vocabulary 3D occupancy prediction in indoor scenes. It builds on 3D Language-Embedded Gaussians as a unified representation for geometry and semantics. Under a geometry-only supervision paradigm using only binary occupied/free labels, the authors replace standard Gaussian-to-occupancy operators with an opacity-aware Poisson-based aggregation because existing operators are stated to fail to converge; they also introduce a Progressive Temperature Decay schedule to mitigate feature mixing in semantic alignment. On Occ-ScanNet the method reports 59.50 IoU and 21.05 mIoU, claiming to surpass prior occupancy and open-vocabulary approaches.

Significance. If the central claims hold, the work is significant for advancing open-vocabulary indoor occupancy prediction under weak supervision, which is relevant for embodied agents operating in dense, semantically rich environments. The geometry-only paradigm and the use of Gaussians as an intermediate representation are promising directions. Explicit credit is given for the commitment to release code, which supports reproducibility.

major comments (2)
  1. [Abstract and §3] Abstract and §3 (geometry supervision): the claim that 'existing Gaussian-to-Occupancy operators fail to converge under such weak supervision' is load-bearing for the introduction of the opacity-aware Poisson replacement, yet the manuscript provides no training curves, loss plots, or controlled ablation demonstrating divergence or collapse of standard splatting operators under binary labels. Without this evidence the necessity of the new operator remains unproven and any performance lift could be attributable to other factors.
  2. [§4] §4 (experiments): the reported 59.50 IoU and 21.05 mIoU on Occ-ScanNet are presented without error bars, multiple-run statistics, or explicit dataset-split details; in addition, no ablation isolates the contribution of the Progressive Temperature Decay schedule. These omissions make it impossible to verify the robustness of the superiority claims over baselines.
minor comments (2)
  1. [Abstract] The abstract states that the method 'surpasses all existing occupancy methods in IoU' but does not name the specific baselines or reference the corresponding table/figure for this comparison.
  2. [§3] Notation for the parameters of the Progressive Temperature Decay schedule could be introduced more explicitly when first defined to aid readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful review and constructive suggestions. We address each of the major comments below and outline the revisions we will make to the manuscript.

read point-by-point responses
  1. Referee: [Abstract and §3] Abstract and §3 (geometry supervision): the claim that 'existing Gaussian-to-Occupancy operators fail to converge under such weak supervision' is load-bearing for the introduction of the opacity-aware Poisson replacement, yet the manuscript provides no training curves, loss plots, or controlled ablation demonstrating divergence or collapse of standard splatting operators under binary labels. Without this evidence the necessity of the new operator remains unproven and any performance lift could be attributable to other factors.

    Authors: We agree that providing empirical evidence for the convergence failure of existing operators under geometry-only supervision would strengthen our justification for the opacity-aware Poisson aggregation. In the revised manuscript, we will add training curves and loss plots comparing standard splatting operators with our proposed method under binary occupancy labels. This will clearly demonstrate the divergence issue and support the necessity of the new operator. revision: yes

  2. Referee: [§4] §4 (experiments): the reported 59.50 IoU and 21.05 mIoU on Occ-ScanNet are presented without error bars, multiple-run statistics, or explicit dataset-split details; in addition, no ablation isolates the contribution of the Progressive Temperature Decay schedule. These omissions make it impossible to verify the robustness of the superiority claims over baselines.

    Authors: We acknowledge the importance of statistical robustness in the experimental results. We will conduct multiple runs with different random seeds and report the mean and standard deviation for the IoU and mIoU metrics, including error bars in the tables. We will also provide explicit details on the dataset splits used for Occ-ScanNet. Furthermore, we will include an ablation study that isolates the effect of the Progressive Temperature Decay schedule to demonstrate its contribution to the performance. revision: yes

Circularity Check

0 steps flagged

No significant circularity; results are empirical measurements on external benchmark

full rationale

The paper reports measured performance (59.50 IoU, 21.05 mIoU on Occ-ScanNet) as outcomes of an experimental framework rather than quantities algebraically derived from its own fitted parameters or equations. The claim that existing Gaussian-to-Occupancy operators fail to converge is presented as an empirical observation motivating the new opacity-aware Poisson operator and Progressive Temperature Decay schedule; no self-citation, self-definitional loop, or fitted-input-renamed-as-prediction is exhibited in the provided text that would reduce the reported metrics to the inputs by construction. The geometry-language alignment and supervision paradigm are validated externally, leaving the derivation chain self-contained against benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The framework rests on the domain assumption that 3D Language-Embedded Gaussians provide a suitable joint geometry-language representation and on the empirical observation that standard operators fail under weak supervision; no explicit free parameters beyond the temperature schedule are named, and no new physical entities are postulated.

free parameters (1)
  • Progressive Temperature Decay schedule parameters
    Hyperparameters controlling the rate at which opacities are sharpened during splatting; these are introduced to mitigate feature mixing and are presumably tuned on validation data.
axioms (1)
  • domain assumption 3D Language-Embedded Gaussians serve as a unified intermediate representation coupling fine-grained 3D geometry with language-aligned semantic embedding
    Invoked as the base representation upon which both the geometry and semantic modules are built.

pith-pipeline@v0.9.0 · 5790 in / 1438 out tokens · 55766 ms · 2026-05-21T11:46:20.265055+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. FreeOcc: Training-Free Embodied Open-Vocabulary Occupancy Prediction

    cs.RO 2026-04 unverdicted novelty 6.0

    FreeOcc enables training-free open-vocabulary 3D occupancy prediction from RGB-D sequences by combining SLAM, dense Gaussian maps, off-the-shelf vision-language models, and probabilistic projection, achieving over 2x ...

Reference graph

Works this paper leans on

59 extracted references · 59 canonical work pages · cited by 1 Pith paper · 4 internal anchors

  1. [1]

    S2GO: Streaming sparse gaussian occupancy

    Anonymous. S2GO: Streaming sparse gaussian occupancy. InSubmitted to The Fourteenth International Conference on Learning Representations, 2025. under review. 4

  2. [2]

    VGMOcc: Sparse gaussian occupancy predic- tion with visual geometry model priors

    Anonymous. VGMOcc: Sparse gaussian occupancy predic- tion with visual geometry model priors. InSubmitted to The Fourteenth International Conference on Learning Represen- tations, 2025. under review. 6

  3. [3]

    Chang, and Matthias Niessner

    Armen Avetisyan, Manuel Dahnert, Angela Dai, Manolis Savva, Angel X. Chang, and Matthias Niessner. Scan2cad: Learning cad model alignment in rgb-d scans. InThe IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019. 2

  4. [4]

    Qwen2.5-VL Technical Report

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhao- hai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Jun- yang Lin. Qwen2.5-vl technical repor...

  5. [5]

    Lan- gocc: Self-supervised open vocabulary occupancy estima- tion via volume rendering.arXiv preprint arXiv:2407.17310,

    Simon Boeder, Fabian Gigengack, and Benjamin Risse. Lan- gocc: Self-supervised open vocabulary occupancy estima- tion via volume rendering.arXiv preprint arXiv:2407.17310,

  6. [6]

    Monoscene: Monoc- ular 3d semantic scene completion

    Anh-Quan Cao and Raoul De Charette. Monoscene: Monoc- ular 3d semantic scene completion. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3991–4001, 2022. 2, 3, 6

  7. [7]

    Gaussrender: Learning 3d occupancy with gaussian rendering

    Loick Chambon, Eloi Zablocki, Alexandre Boulch, Mick- ael Chen, and Matthieu Cord. Gaussrender: Learning 3d occupancy with gaussian rendering. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 27010–27020, 2025. 1

  8. [8]

    Og: Equip vision occupancy with in- stance segmentation and visual grounding.arXiv preprint arXiv:2307.05873, 2023

    Zichao Dong, Hang Ji, Weikun Zhang, Xufeng Huang, and Junbo Chen. Og: Equip vision occupancy with in- stance segmentation and visual grounding.arXiv preprint arXiv:2307.05873, 2023. 3

  9. [9]

    Loc: A general language-guided framework for open-set 3d occupancy prediction.arXiv preprint arXiv:2510.22141, 2025

    Yuhang Gao, Xiang Xiang, Sheng Zhong, and Guoyou Wang. Loc: A general language-guided framework for open-set 3d occupancy prediction.arXiv preprint arXiv:2510.22141, 2025. 3

  10. [10]

    Tri-perspective view for vision- based 3d semantic occupancy prediction

    Yuanhui Huang, Wenzhao Zheng, Yunpeng Zhang, Jie Zhou, and Jiwen Lu. Tri-perspective view for vision- based 3d semantic occupancy prediction. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9223–9232, 2023. 6

  11. [11]

    Gaussianformer-2: Probabilistic gaussian superposition for efficient 3d occupancy prediction.arXiv preprint arXiv:2412.04384, 2024

    Yuanhui Huang, Amonnut Thammatadatrakoon, Wenzhao Zheng, Yunpeng Zhang, Dalong Du, and Jiwen Lu. Gaussianformer-2: Probabilistic gaussian superposition for efficient 3d occupancy prediction.arXiv preprint arXiv:2412.04384, 2024. 1, 2, 4, 7

  12. [12]

    Selfocc: Self-supervised vision-based 3d occupancy prediction

    Yuanhui Huang, Wenzhao Zheng, Borui Zhang, Jie Zhou, and Jiwen Lu. Selfocc: Self-supervised vision-based 3d occupancy prediction. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 19946–19956, 2024. 3

  13. [13]

    Gaussianformer: Scene as gaussians for vision-based 3d semantic occupancy prediction

    Yuanhui Huang, Wenzhao Zheng, Yunpeng Zhang, Jie Zhou, and Jiwen Lu. Gaussianformer: Scene as gaussians for vision-based 3d semantic occupancy prediction. InEuropean Conference on Computer Vision, pages 376–393. Springer,

  14. [14]

    Openocc: Open vocab- ulary 3d scene reconstruction via occupancy representation

    Haochen Jiang, Yueming Xu, Yihan Zeng, Hang Xu, Wei Zhang, Jianfeng Feng, and Li Zhang. Openocc: Open vocab- ulary 3d scene reconstruction via occupancy representation. InIEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2024. 3

  15. [15]

    Towards open world object de- tection

    KJ Joseph, Salman Khan, Fahad Shahbaz Khan, and Vi- neeth N Balasubramanian. Towards open world object de- tection. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5830–5840,

  16. [16]

    Kim Jun-Seong, Kim GeonU, Kim Yu-Ji, Yu-Chiang Frank Wang, Jaesung Choe, and Tae-Hyun Oh. Dr. splat: Directly referring 3d gaussian splatting via direct language embed- ding registration. InCVPR, 2025. 5

  17. [17]

    3d gaussian splatting for real-time radiance field rendering.ACM Trans

    Bernhard Kerbl, Georgios Kopanas, Thomas Leimk ¨uhler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering.ACM Trans. Graph., 42(4):139–1,

  18. [18]

    J. F. C. Kingman.Poisson processes. The Clarendon Press Oxford University Press, New York, 1993. Oxford Science Publications. 5

  19. [19]

    Ago: Adaptive grounding for open world 3d occupancy prediction.arXiv preprint arXiv:2504.10117, 2025

    Peizheng Li, Shuxiao Ding, You Zhou, Qingwen Zhang, Onat Inak, Larissa Triess, Niklas Hanselmann, Marius Cordts, and Andreas Zell. Ago: Adaptive grounding for open world 3d occupancy prediction.arXiv preprint arXiv:2504.10117, 2025. 2, 3

  20. [20]

    V oxformer: Sparse voxel transformer for camera- based 3d semantic scene completion

    Yiming Li, Zhiding Yu, Christopher Choy, Chaowei Xiao, Jose M Alvarez, Sanja Fidler, Chen Feng, and Anima Anand- kumar. V oxformer: Sparse voxel transformer for camera- based 3d semantic scene completion. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9087–9098, 2023. 1

  21. [21]

    Fb-occ: 3d occupancy prediction based on forward-backward view transformation,

    Zhiqi Li, Zhiding Yu, David Austin, Mingsheng Fang, Shiyi Lan, Jan Kautz, and Jose M Alvarez. FB-OCC: 3D occu- pancy prediction based on forward-backward view transfor- mation.arXiv:2307.01492, 2023. 3

  22. [22]

    V olumetric environ- ment representation for vision-language navigation

    Rui Liu, Wenguan Wang, and Yi Yang. V olumetric environ- ment representation for vision-language navigation. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16317–16328, 2024. 1

  23. [23]

    Oc- cvla: Vision-language-action model with implicit 3d occu- pancy supervision.arXiv preprint arXiv:2509.05578, 2025

    Ruixun Liu, Lingyu Kong, Derun Li, and Hang Zhao. Oc- cvla: Vision-language-action model with implicit 3d occu- pancy supervision.arXiv preprint arXiv:2509.05578, 2025. 1

  24. [24]

    Grounding dino: Marrying dino with 9 grounded pre-training for open-set object detection

    Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, et al. Grounding dino: Marrying dino with 9 grounded pre-training for open-set object detection. InEuro- pean conference on computer vision, pages 38–55. Springer,

  25. [25]

    SGDR: Stochastic Gradient Descent with Warm Restarts

    Ilya Loshchilov and Frank Hutter. Sgdr: Stochas- tic gradient descent with warm restarts.arXiv preprint arXiv:1608.03983, 2016. 6

  26. [26]

    Decoupled Weight Decay Regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017. 6

  27. [27]

    On the difficulty of training recurrent neural networks

    Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. On the difficulty of training recurrent neural networks. InInter- national conference on machine learning, pages 1310–1318. Pmlr, 2013. 6

  28. [28]

    Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unprojecting to 3d

    Jonah Philion and Sanja Fidler. Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unprojecting to 3d. InEuropean conference on computer vision, pages 194–210. Springer, 2020. 3

  29. [29]

    Splatssc: Decoupled depth-guided gaussian splat- ting for semantic scene completion, 2025

    Rui Qian, Haozhi Cao, Tianchen Deng, Shenghai Yuan, and Lihua Xie. Splatssc: Decoupled depth-guided gaussian splat- ting for semantic scene completion, 2025. 4

  30. [30]

    Learning transferable visual models from natural language supervi- sion

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 2, 7

  31. [31]

    Grounded SAM: Assembling Open-World Models for Diverse Visual Tasks

    Tianhe Ren, Shilong Liu, Ailing Zeng, Jing Lin, Kunchang Li, He Cao, Jiayu Chen, Xinyu Huang, Yukang Chen, Feng Yan, et al. Grounded sam: Assembling open-world models for diverse visual tasks.arXiv preprint arXiv:2401.14159,

  32. [32]

    Ross.Stochastic processes

    S.M. Ross.Stochastic processes. Wiley, 1996. 5

  33. [33]

    Language embedded 3d gaussians for open-vocabulary scene understanding.arXiv preprint arXiv:2311.18482, 2023

    Jin-Chuan Shi, Miao Wang, Hao-Bin Duan, and Shao- Hua Guan. Language embedded 3d gaussians for open-vocabulary scene understanding.arXiv preprint arXiv:2311.18482, 2023. 2

  34. [34]

    Occupancy as set of points

    Yiang Shi, Tianheng Cheng, Qian Zhang, Wenyu Liu, and Xinggang Wang. Occupancy as set of points. InEuropean Conference on Computer Vision, pages 72–87. Springer,

  35. [35]

    Har- nessing vision foundation models for high-performance, training-free open vocabulary segmentation.arXiv preprint arXiv:2411.09219, 2024

    Yuheng Shi, Minjing Dong, and Chang Xu. Har- nessing vision foundation models for high-performance, training-free open vocabulary segmentation.arXiv preprint arXiv:2411.09219, 2024. 2, 6

  36. [36]

    A coarse-to-fine approach to multi-modality 3d occupancy grounding.arXiv preprint arXiv:2508.01197, 2025

    Zhan Shi, Song Wang, Junbo Chen, and Jianke Zhu. A coarse-to-fine approach to multi-modality 3d occupancy grounding.arXiv preprint arXiv:2508.01197, 2025. 2

  37. [37]

    Semantic scene com- pletion from a single depth image

    Shuran Song, Fisher Yu, Andy Zeng, Angel X Chang, Mano- lis Savva, and Thomas Funkhouser. Semantic scene com- pletion from a single depth image. InProceedings of the IEEE conference on computer vision and pattern recogni- tion, pages 1746–1754, 2017. 2

  38. [38]

    Ovo: Open-vocabulary occupancy,

    Zhiyu Tan, Zichao Dong, Cheng Zhang, Weikun Zhang, Hang Ji, and Hao Li. Ovo: Open-vocabulary occupancy,

  39. [39]

    Sparseocc: Re- thinking sparse latent representation for vision-based seman- tic occupancy prediction

    Pin Tang, Zhongdao Wang, Guoqing Wang, Jilai Zheng, Xi- angxuan Ren, Bailan Feng, and Chao Ma. Sparseocc: Re- thinking sparse latent representation for vision-based seman- tic occupancy prediction. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15035–15044, 2024. 1

  40. [40]

    Emanuele Vespa, Nikolay Nikolov, Marius Grimm, Luigi Nardi, Paul H. J. Kelly, and Stefan Leutenegger. Efficient octree-based volumetric slam supporting signed-distance and occupancy mapping.IEEE Robotics and Automation Letters, 3(2):1144–1151, 2018. 2

  41. [41]

    Pop-3d: Open-vocabulary 3d occupancy prediction from im- ages.Advances in Neural Information Processing Systems, 36:50545–50557, 2023

    Antonin V obecky, Oriane Sim ´eoni, David Hurych, Spyri- don Gidaris, Andrei Bursuc, Patrick P ´erez, and Josef Sivic. Pop-3d: Open-vocabulary 3d occupancy prediction from im- ages.Advances in Neural Information Processing Systems, 36:50545–50557, 2023. 3, 6, 7

  42. [42]

    Embodiedocc++: Boosting embodied 3d occupancy prediction with plane regularization and uncer- tainty sampler.arXiv preprint arXiv:2504.09540, 2025

    Hao Wang, Xiaobao Wei, Xiaoan Zhang, Jianing Li, Chengyu Bai, Ying Li, Ming Lu, Wenzhao Zheng, and Shanghang Zhang. Embodiedocc++: Boosting embodied 3d occupancy prediction with plane regularization and uncer- tainty sampler.arXiv preprint arXiv:2504.09540, 2025. 1, 3, 6

  43. [43]

    Forknet: Multi-branch volumetric semantic com- pletion from a single depth image, 2019

    Yida Wang, David Joseph Tan, Nassir Navab, and Federico Tombari. Forknet: Multi-branch volumetric semantic com- pletion from a single depth image, 2019. 2

  44. [44]

    Occllama: An occupancy-language-action generative world model for au- tonomous driving.arXiv preprint arXiv:2409.03272, 2024

    Julong Wei, Shanshuai Yuan, Pengfei Li, Qingda Hu, Zhongxue Gan, and Wenchao Ding. Occllama: An occupancy-language-action generative world model for au- tonomous driving.arXiv preprint arXiv:2409.03272, 2024. 1

  45. [45]

    Surroundocc: Multi-camera 3d occu- pancy prediction for autonomous driving

    Yi Wei, Linqing Zhao, Wenzhao Zheng, Zheng Zhu, Jie Zhou, and Jiwen Lu. Surroundocc: Multi-camera 3d occu- pancy prediction for autonomous driving. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 21729–21740, 2023. 1, 6

  46. [46]

    Scfusion: Real-time incremental scene recon- struction with semantic completion

    Shun-Cheng Wu, Kesuke Tateno, Nassir Navab, and Fed- erico Tombari. Scfusion: Real-time incremental scene recon- struction with semantic completion. In2020 International Conference on 3D Vision (3DV), pages 801–810, 2020. 2

  47. [47]

    Embodiedocc: Embodied 3d occu- pancy prediction for vision-based online scene understand- ing.arXiv preprint arXiv:2412.04380, 2024

    Yuqi Wu, Wenzhao Zheng, Sicheng Zuo, Yuanhui Huang, Jie Zhou, and Jiwen Lu. Embodiedocc: Embodied 3d occu- pancy prediction for vision-based online scene understand- ing.arXiv preprint arXiv:2412.04380, 2024. 1, 2, 3, 5, 6, 7

  48. [48]

    Depth any- thing v2.Advances in Neural Information Processing Sys- tems, 37:21875–21911, 2024

    Lihe Yang, Bingyi Kang, Zilong Huang, Zhen Zhao, Xiao- gang Xu, Jiashi Feng, and Hengshuang Zhao. Depth any- thing v2.Advances in Neural Information Processing Sys- tems, 37:21875–21911, 2024. 6

  49. [49]

    Ndc-scene: Boost monocular 3d semantic scene completion in normalized de- vice coordinates space

    Jiawei Yao, Chuming Li, Keqiang Sun, Yingjie Cai, Hao Li, Wanli Ouyang, and Hongsheng Li. Ndc-scene: Boost monocular 3d semantic scene completion in normalized de- vice coordinates space. In2023 IEEE/CVF International Conference on Computer Vision (ICCV), pages 9421–9431. IEEE Computer Society, 2023. 2, 3

  50. [50]

    Monocular occupancy prediction for scalable indoor scenes

    Hongxiao Yu, Yuqi Wang, Yuntao Chen, and Zhaoxiang Zhang. Monocular occupancy prediction for scalable indoor scenes. InEuropean Conference on Computer Vision, pages 38–54. Springer, 2024. 1, 2, 3, 6, 7

  51. [51]

    Shtocc: Effective 3d occupancy prediction with sparse head and tail voxels.arXiv preprint arXiv:2505.22461, 2025

    Qiucheng Yu, Yuan Xie, and Xin Tan. Shtocc: Effective 3d occupancy prediction with sparse head and tail voxels.arXiv preprint arXiv:2505.22461, 2025. 1 10

  52. [52]

    Gaussian opacity fields: Efficient adaptive surface reconstruction in unbounded scenes.ACM Transactions on Graphics, 2024

    Zehao Yu, Torsten Sattler, and Andreas Geiger. Gaussian opacity fields: Efficient adaptive surface reconstruction in unbounded scenes.ACM Transactions on Graphics, 2024. 4

  53. [53]

    Language driven occupancy prediction

    Zhu Yu, Bowen Pang, Lizhe Liu, Runmin Zhang, Qiang Li, Si-Yuan Cao, Maochun Luo, Mingxia Chen, Sheng Yang, and Hui-Liang Shen. Language driven occupancy prediction. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 7548–7558, 2025. 2, 3, 6, 7, 8

  54. [54]

    Occnerf: Self- supervised multi-camera occupancy prediction with neural radiance fields.CoRR, abs/2312.09243, 2023

    Chubin Zhang, Juncheng Yan, Yi Wei, Jiaxin Li, Li Liu, Yansong Tang, Yueqi Duan, and Jiwen Lu. Occnerf: Self- supervised multi-camera occupancy prediction with neural radiance fields.CoRR, abs/2312.09243, 2023. 3

  55. [55]

    Deep long-tailed learning: A survey.IEEE transactions on pattern analysis and machine intelligence, 45(9):10795–10816, 2023

    Yifan Zhang, Bingyi Kang, Bryan Hooi, Shuicheng Yan, and Jiashi Feng. Deep long-tailed learning: A survey.IEEE transactions on pattern analysis and machine intelligence, 45(9):10795–10816, 2023. 1

  56. [56]

    Occformer: Dual-path transformer for vision-based 3d semantic occu- pancy prediction

    Yunpeng Zhang, Zheng Zhu, and Dalong Du. Occformer: Dual-path transformer for vision-based 3d semantic occu- pancy prediction. InProceedings of the IEEE/CVF Interna- tional Conference on Computer Vision (ICCV), pages 9433– 9443, 2023. 1

  57. [57]

    Roboocc: Enhancing the geometric and semantic scene understanding for robots.arXiv preprint arXiv:2504.14604, 2025

    Zhang Zhang, Qiang Zhang, Wei Cui, Shuai Shi, Yijie Guo, Gang Han, Wen Zhao, Hengle Ren, Renjing Xu, and Jian Tang. Roboocc: Enhancing the geometric and semantic scene understanding for robots.arXiv preprint arXiv:2504.14604, 2025. 3, 6

  58. [58]

    Veon: V ocabulary- enhanced occupancy prediction

    Jilai Zheng, Pin Tang, Zhongdao Wang, Guoqing Wang, Xi- angxuan Ren, Bailan Feng, and Chao Ma. Veon: V ocabulary- enhanced occupancy prediction. InEuropean Conference on Computer Vision, pages 92–108. Springer, 2024. 2, 3

  59. [59]

    Feature 3dgs: Supercharging 3d gaussian splatting to enable distilled feature fields

    Shijie Zhou, Haoran Chang, Sicheng Jiang, Zhiwen Fan, Ze- hao Zhu, Dejia Xu, Pradyumna Chari, Suya You, Zhangyang Wang, and Achuta Kadambi. Feature 3dgs: Supercharging 3d gaussian splatting to enable distilled feature fields. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21676–21685, 2024. 2 11