BePo: Dual Representation for 3D Occupancy Prediction
Pith reviewed 2026-05-19 11:25 UTC · model grok-4.3
The pith
BePo combines BEV and sparse points with cross-attention to fix small-object weaknesses in 3D occupancy prediction.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
BePo features a dual representation of BEV and sparse points. The 3D information learned in the sparse points branch is shared with the BEV stream via cross-attention, which injects learning signals of difficult objects on the BEV plane. The outputs of both branches are then fused to generate the final 3D occupancy predictions.
What carries the argument
Dual representation of BEV and sparse points, with cross-attention that transfers 3D signals from the sparse branch to strengthen the BEV branch before fusion.
If this is right
- Small and distant objects receive stronger feature signals on the BEV plane.
- Large flat surfaces remain well modeled by the BEV branch while varied-size objects are handled by the points branch.
- Inference cost stays low because neither branch requires dense 3D volumes.
- Final occupancy maps improve on multiple public driving and indoor benchmarks.
Where Pith is reading between the lines
- The same dual-branch pattern with cross-attention sharing could be tested on other 3D perception tasks such as depth completion or instance segmentation.
- Replacing cross-attention with a cheaper sharing module might preserve gains at even lower runtime.
- The approach points to hybrid representations as a general route for balancing detail and efficiency in real-time 3D vision.
Load-bearing premise
Cross-attention will move useful 3D signals from sparse points to the BEV branch without introducing noise or alignment errors that degrade the fused result.
What would settle it
Ablating the cross-attention link on the same benchmarks and measuring whether small-object accuracy falls back to single-branch BEV levels.
Figures
read the original abstract
3D occupancy infers fine-grained 3D geometry and semantics which is critical for autonomous driving. Most existing approaches carry high compute costs, requiring dense 3D feature volume and cross-attention to effectively aggregate information. More efficient methods adopt Bird's Eye View (BEV) or sparse points as scene representation leading to much reduced runtime. However, BEV struggles with small objects that often have very limited feature representation especially after being projected to the ground plane. Sparse points on the other and, can model objects of various sizes in 3D space, but is inefficient at capturing flat surfaces or large objects. To address these shortcomings, we present BePo, which features a dual representation of BEV and sparse points. The 3D information learned in the sparse points branch is shared with the BEV stream via cross-attention, which injects learning signals of difficult objects on the BEV plane. The outputs of both branches are then fused to generate the final 3D occupancy predictions. Extensive experiments on a suite of challenging benchmarks including Occ3D-nuScenes, Occ3D-Waymo and Occ-ScanNet demonstrate the superiority of our proposed BePo. In addition, BePo carries low inference cost even when compared to latest efficient methods.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces BePo, a dual-representation architecture for 3D occupancy prediction that combines a Bird's Eye View (BEV) branch with a sparse-points branch. 3D cues learned in the sparse-points branch are transferred to the BEV branch via cross-attention to improve modeling of small and difficult objects; the two branch outputs are then fused to produce the final occupancy map. Experiments on Occ3D-nuScenes, Occ3D-Waymo, and Occ-ScanNet are reported to show superior accuracy together with low inference cost relative to prior efficient methods.
Significance. If the cross-attention transfer proves reliable, the dual-representation idea offers a principled way to mitigate the complementary weaknesses of pure BEV (poor small-object support after ground-plane projection) and pure sparse points (weak coverage of large flat surfaces). Demonstrating both accuracy gains and low runtime on multiple large-scale benchmarks would strengthen the case for efficient 3D occupancy pipelines in autonomous driving.
major comments (2)
- [Method (cross-attention)] The description of the cross-attention module (method section) supplies no equations or diagram specifying how queries, keys, and values are formed, whether 3D positional encodings or projection-aware masking are used, or how height information from sparse points is preserved when attending to the BEV plane. Because the central claim rests on clean transfer of 3D signals for difficult objects, this omission is load-bearing and risks unaddressed 2D-3D misalignment.
- [Experiments / Table 1] Table 1 (or equivalent main-results table) reports overall mIoU gains, yet no ablation isolates the contribution of the cross-attention link versus the dual-branch fusion alone, nor any per-class breakdown for small objects. Without these controls it is impossible to confirm that the claimed improvement for difficult objects originates from the proposed mechanism rather than from increased capacity or hyper-parameter tuning.
minor comments (2)
- [Abstract] The abstract states superiority on three benchmarks but does not include even a single numeric result or runtime figure; readers must reach the tables to evaluate the magnitude of the claimed gains.
- [Method] Notation for the fusion step (e.g., how BEV and sparse-point features are combined before the final decoder) is introduced without an accompanying equation or pseudocode.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and for recognizing the potential of the dual-representation approach in mitigating complementary weaknesses of BEV and sparse-points representations. We address each major comment below and will revise the manuscript accordingly to improve clarity and experimental rigor.
read point-by-point responses
-
Referee: [Method (cross-attention)] The description of the cross-attention module (method section) supplies no equations or diagram specifying how queries, keys, and values are formed, whether 3D positional encodings or projection-aware masking are used, or how height information from sparse points is preserved when attending to the BEV plane. Because the central claim rests on clean transfer of 3D signals for difficult objects, this omission is load-bearing and risks unaddressed 2D-3D misalignment.
Authors: We agree that the method section would benefit from greater technical detail on the cross-attention module. In the revised manuscript we will add explicit equations defining the query, key, and value projections, describe the use of 3D positional encodings, and clarify any projection-aware masking. A new diagram will illustrate the module, and we will explain how height information from the sparse-points branch is preserved when attending to the BEV plane. These additions will make the 3D signal transfer mechanism fully transparent. revision: yes
-
Referee: [Experiments / Table 1] Table 1 (or equivalent main-results table) reports overall mIoU gains, yet no ablation isolates the contribution of the cross-attention link versus the dual-branch fusion alone, nor any per-class breakdown for small objects. Without these controls it is impossible to confirm that the claimed improvement for difficult objects originates from the proposed mechanism rather than from increased capacity or hyper-parameter tuning.
Authors: We acknowledge the importance of isolating the cross-attention contribution. The revised version will include new ablation studies comparing the full BePo model against a dual-branch baseline that performs fusion without the cross-attention transfer. We will also add per-class mIoU breakdowns, with emphasis on small and difficult object categories, to demonstrate that the observed gains for these classes arise from the proposed mechanism rather than capacity alone. revision: yes
Circularity Check
No significant circularity; architectural choice validated externally
full rationale
The paper describes BePo as a dual-representation architecture combining BEV and sparse points branches, with cross-attention to transfer 3D signals and subsequent fusion for occupancy output. This is presented purely as a design choice whose value is assessed via empirical results on independent external benchmarks (Occ3D-nuScenes, Occ3D-Waymo, Occ-ScanNet). No equations, derivations, fitted parameters renamed as predictions, self-definitional loops, or load-bearing self-citations appear in the text. The central claims do not reduce to their own inputs by construction and remain self-contained against external evaluation.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
dual-branch design: a query-based sparse points branch and a BEV branch... cross-attention... fused to generate the final 3D occupancy predictions
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Se- mantickitti: A dataset for semantic scene understanding of lidar sequences
Jens Behley, Martin Garbade, Andres Milioto, Jan Quen- zel, Sven Behnke, Cyrill Stachniss, and Jurgen Gall. Se- mantickitti: A dataset for semantic scene understanding of lidar sequences. In Proceedings of the IEEE/CVF inter- national conference on computer vision , pages 9297–9307,
-
[2]
Maxim Berman, Amal Rannen Triki, and Matthew B Blaschko. The lov´asz-softmax loss: A tractable surrogate for the optimization of the intersection-over-union measure in neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4413–4421,
-
[3]
Transformerfusion: Monocular rgb scene reconstruction using transformers
Aljaz Bozic, Pablo Palafox, Justus Thies, Angela Dai, and Matthias Nießner. Transformerfusion: Monocular rgb scene reconstruction using transformers. Advances in Neural In- formation Processing Systems, 34:1403–1414, 2021. 5
work page 2021
-
[4]
nuscenes: A multi- modal dataset for autonomous driving
Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh V ora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Gi- ancarlo Baldan, and Oscar Beijbom. nuscenes: A multi- modal dataset for autonomous driving. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11621–11631, 2020. 1, 2, 5, 7
work page 2020
-
[5]
Monoscene: Monoc- ular 3d semantic scene completion
Anh-Quan Cao and Raoul De Charette. Monoscene: Monoc- ular 3d semantic scene completion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3991–4001, 2022. 2, 5
work page 2022
-
[6]
Scenerf: Self- supervised monocular 3d scene reconstruction with radiance fields
Anh-Quan Cao and Raoul de Charette. Scenerf: Self- supervised monocular 3d scene reconstruction with radiance fields. In Proceedings of the IEEE/CVF International Con- ference on Computer Vision, pages 9387–9398, 2023. 2
work page 2023
-
[7]
End-to- end object detection with transformers
Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to- end object detection with transformers. In European confer- ence on computer vision, pages 213–229. Springer, 2020. 2, 3
work page 2020
-
[8]
Openscene: The largest up-to- date 3d occupancy prediction benchmark in autonomous driving
OpenScene Contributors. Openscene: The largest up-to- date 3d occupancy prediction benchmark in autonomous driving. https://github.com/OpenDriveLab/ OpenScene, 2023. 2
work page 2023
-
[9]
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, et al. An image is worth 16x16 words: Trans- formers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020. 2
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[10]
A point set generation network for 3d object reconstruction from a single image
Haoqiang Fan, Hao Su, and Leonidas J Guibas. A point set generation network for 3d object reconstruction from a single image. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 605–613, 2017. 4
work page 2017
-
[11]
Are we ready for autonomous driving? the kitti vision benchmark suite
Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are we ready for autonomous driving? the kitti vision benchmark suite. In 2012 IEEE conference on computer vision and pat- tern recognition, pages 3354–3361. IEEE, 2012. 2
work page 2012
-
[12]
Vision meets robotics: The kitti dataset
Andreas Geiger, Philip Lenz, Christoph Stiller, and Raquel Urtasun. Vision meets robotics: The kitti dataset. The Inter- national Journal of Robotics Research , 32(11):1231–1237,
-
[13]
Adam W Harley, Zhaoyuan Fang, Jie Li, Rares Ambrus, and Katerina Fragkiadaki. Simple-bev: What really mat- ters for multi-sensor bev perception? In 2023 IEEE Inter- national Conference on Robotics and Automation (ICRA) , pages 2759–2765. IEEE, 2023. 1, 2
work page 2023
-
[14]
Deep residual learning for image recognition
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceed- ings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016. 3, 5
work page 2016
-
[15]
BEVDet: High-performance Multi-camera 3D Object Detection in Bird-Eye-View
Junjie Huang, Guan Huang, Zheng Zhu, Yun Ye, and Dalong Du. Bevdet: High-performance multi-camera 3d object de- tection in bird-eye-view. arXiv preprint arXiv:2112.11790,
work page internal anchor Pith review Pith/arXiv arXiv
-
[16]
Tri-perspective view for vision- based 3d semantic occupancy prediction
Yuanhui Huang, Wenzhao Zheng, Yunpeng Zhang, Jie Zhou, and Jiwen Lu. Tri-perspective view for vision- based 3d semantic occupancy prediction. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9223–9232, 2023. 2, 5, 7, 8
work page 2023
-
[17]
Selfocc: Self-supervised vision-based 3d oc- cupancy prediction
Yuanhui Huang, Wenzhao Zheng, Borui Zhang, Jie Zhou, and Jiwen Lu. Selfocc: Self-supervised vision-based 3d oc- cupancy prediction. In Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 19946–19956, 2024. 2
work page 2024
-
[18]
Padre: A unifying polynomial attention drop-in replacement for efficient vision transformer
Pierre-David Letourneau, Manish Kumar Singh, Hsin- Pai Cheng, Shizhong Han, Yunxiao Shi, Dalton Jones, Matthew Harper Langston, Hong Cai, and Fatih Porikli. Padre: A unifying polynomial attention drop-in replacement for efficient vision transformer. International Conference on Learning Representation (ICLR), 2025. 2
work page 2025
-
[19]
Zhiqi Li, Wenhai Wang, Hongyang Li, Enze Xie, Chong- hao Sima, Tong Lu, Yu Qiao, and Jifeng Dai. Bevformer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers. In European con- ference on computer vision, pages 1–18. Springer, 2022. 2, 3, 5, 7, 8
work page 2022
-
[20]
Feature pyra- mid networks for object detection
Tsung-Yi Lin, Piotr Doll ´ar, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature pyra- mid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recogni- tion, pages 2117–2125, 2017. 4
work page 2017
-
[21]
Focal loss for dense object detection
Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Doll´ar. Focal loss for dense object detection. In Pro- ceedings of the IEEE International Conference on Computer Vision (ICCV), pages 2999–3007, 2017. 4
work page 2017
-
[22]
Xuewu Lin, Tianwei Lin, Zixiang Pei, Lichao Huang, and Zhizhong Su. Sparse4d: Multi-view 3d object detec- tion with sparse spatial-temporal fusion. arXiv preprint arXiv:2211.10581, 2022. 2
-
[23]
Sparsebev: High-performance sparse 3d object de- tection from multi-camera videos
Haisong Liu, Yao Teng, Tao Lu, Haiguang Wang, and Limin Wang. Sparsebev: High-performance sparse 3d object de- tection from multi-camera videos. In Proceedings of the IEEE/CVF International Conference on Computer Vision , pages 18580–18590, 2023. 1, 2, 3, 4
work page 2023
-
[24]
Fully sparse 3d panoptic occupancy prediction
Haisong Liu, Haiguang Wang, Yang Chen, Zetong Yang, Jia Zeng, Li Chen, and Limin Wang. Fully sparse 3d panoptic occupancy prediction. In Proceedings of the European Con- fernece on Computer Vision, 2024. 2
work page 2024
-
[25]
Petr: Position embedding transformation for multi-view 3d object detection
Yingfei Liu, Tiancai Wang, Xiangyu Zhang, and Jian Sun. Petr: Position embedding transformation for multi-view 3d object detection. arXiv preprint arXiv:2203.05625, 2022. 2
-
[26]
Let occ flow: Self-supervised 3d occupancy flow prediction
Yili Liu, Linzhan Mou, Xuan Yu, Chenrui Han, Sitong Mao, Rong Xiong, and Yue Wang. Let occ flow: Self-supervised 3d occupancy flow prediction. The Conference on Robot Learning (CoRL), 2024. 2
work page 2024
-
[27]
Decoupled Weight Decay Regularization
I Loshchilov. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017. 5
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[28]
SGDR: Stochastic Gradient Descent with Warm Restarts
Ilya Loshchilov and Frank Hutter. Sgdr: Stochas- tic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983, 2016. 6
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[29]
Atlas: End- to-end 3d scene reconstruction from posed images
Zak Murez, Tarrence Van As, James Bartolozzi, Ayan Sinha, Vijay Badrinarayanan, and Andrew Rabinovich. Atlas: End- to-end 3d scene reconstruction from posed images. In Com- puter Vision–ECCV 2020: 16th European Conference, Glas- gow, UK, August 23–28, 2020, Proceedings, Part VII 16 , pages 414–431. Springer, 2020. 5
work page 2020
-
[30]
Renderocc: Vision-centric 3d occupancy pre- diction with 2d rendering supervision
Mingjie Pan, Jiaming Liu, Renrui Zhang, Peixiang Huang, Xiaoqi Li, Hongwei Xie, Bing Wang, Li Liu, and Shang- hang Zhang. Renderocc: Vision-centric 3d occupancy pre- diction with 2d rendering supervision. In 2024 IEEE Inter- national Conference on Robotics and Automation (ICRA) , pages 12404–12411. IEEE, 2024. 1, 2, 5, 7, 8
work page 2024
-
[31]
Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unproject- ing to 3d
Jonah Philion and Sanja Fidler. Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unproject- ing to 3d. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceed- ings, Part XIV 16, pages 194–210. Springer, 2020. 4
work page 2020
-
[32]
Ega- depth: Efficient guided attention for self-supervised multi- camera depth estimation
Yunxiao Shi, Hong Cai, Amin Ansari, and Fatih Porikli. Ega- depth: Efficient guided attention for self-supervised multi- camera depth estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 119–129, 2023. 1
work page 2023
-
[33]
Decotr: Enhancing depth completion with 2d and 3d attentions
Yunxiao Shi, Manish Kumar Singh, Hong Cai, and Fatih Porikli. Decotr: Enhancing depth completion with 2d and 3d attentions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10736– 10746, 2024. 1
work page 2024
-
[34]
H3o: Hyper-efficient 3d occupancy prediction with hetero- geneous supervision
Yunxiao Shi, Hong Cai, Amin Ansari, and Fatih Porikli. H3o: Hyper-efficient 3d occupancy prediction with hetero- geneous supervision. 2025. 2
work page 2025
-
[35]
Scalability in perception for autonomous driving: Waymo open dataset
Pei Sun, Henrik Kretzschmar, Xerxes Dotiwalla, Aurelien Chouard, Vijaysai Patnaik, Paul Tsui, James Guo, Yin Zhou, Yuning Chai, Benjamin Caine, et al. Scalability in perception for autonomous driving: Waymo open dataset. In Proceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2446–2454, 2020. 2, 5
work page 2020
-
[36]
Pin Tang, Zhongdao Wang, Guoqing Wang, Jilai Zheng, Xi- angxuan Ren, Bailan Feng, and Chao Ma. Sparseocc: Re- thinking sparse latent representation for vision-based seman- tic occupancy prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 15035–15044, 2024. 2, 5
work page 2024
-
[37]
Occ3d: A large-scale 3d occupancy prediction benchmark for autonomous driving
Xiaoyu Tian, Tao Jiang, Longfei Yun, Yucheng Mao, Huitong Yang, Yue Wang, Yilun Wang, and Hang Zhao. Occ3d: A large-scale 3d occupancy prediction benchmark for autonomous driving. Advances in Neural Information Processing Systems, 36, 2024. 1, 2, 5, 7
work page 2024
-
[38]
Wenwen Tong, Chonghao Sima, Tai Wang, Li Chen, Silei Wu, Hanming Deng, Yi Gu, Lewei Lu, Ping Luo, Dahua Lin, et al. Scene as occupancy. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 8406– 8415, 2023. 2
work page 2023
-
[39]
A Vaswani. Attention is all you need. Advances in Neural Information Processing Systems, 2017. 2, 4
work page 2017
-
[40]
Opus: Occupancy prediction using a sparse set
Jiabao Wang, Zhaojiang Liu, Qiang Meng, Liujiang Yan, Ke Wang, Jie Yang, Wei Liu, Qibin Hou, and Mingming Cheng. Opus: Occupancy prediction using a sparse set. In Advances in Neural Information Processing Systems, 2024. 2, 3, 4, 5, 7, 8
work page 2024
-
[41]
Detr3d: 3d object detection from multi-view images via 3d-to-2d queries
Yue Wang, Vitor Campagnolo Guizilini, Tianyuan Zhang, Yilun Wang, Hang Zhao, and Justin Solomon. Detr3d: 3d object detection from multi-view images via 3d-to-2d queries. In Conference on Robot Learning, pages 180–191. PMLR, 2022. 1, 2, 3
work page 2022
-
[42]
Surroundocc: Multi-camera 3d occu- pancy prediction for autonomous driving
Yi Wei, Linqing Zhao, Wenzhao Zheng, Zheng Zhu, Jie Zhou, and Jiwen Lu. Surroundocc: Multi-camera 3d occu- pancy prediction for autonomous driving. In Proceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 21729–21740, 2023. 1, 2, 3, 7, 8
work page 2023
-
[43]
Deep height decoupling for pre- cise vision-based 3d occupancy prediction
Yuan Wu, Zhiqiang Yan, Zhengxue Wang, Xiang Li, Le Hui, and Jian Yang. Deep height decoupling for pre- cise vision-based 3d occupancy prediction. arXiv preprint arXiv:2409.07972, 2024. 2
-
[44]
Mamo: Leveraging memory and attention for monocular video depth estimation
Rajeev Yasarla, Hong Cai, Jisoo Jeong, Yunxiao Shi, Risheek Garrepalli, and Fatih Porikli. Mamo: Leveraging memory and attention for monocular video depth estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 8754–8764, 2023. 1
work page 2023
-
[45]
Futuredepth: Learning to predict the future improves video depth estimation
Rajeev Yasarla, Manish Kumar Singh, Hong Cai, Yunxiao Shi, Jisoo Jeong, Yinhao Zhu, Shizhong Han, Risheek Gar- repalli, and Fatih Porikli. Futuredepth: Learning to predict the future improves video depth estimation. Proceedings of the European Conference on Computer Vision, 2024. 1
work page 2024
-
[46]
Zichen Yu, Changyong Shu, Jiajun Deng, Kangjie Lu, Zong- dai Liu, Jiangyong Yu, Dawei Yang, Hui Li, and Yan Chen. Flashocc: Fast and memory-efficient occupancy prediction via channel-to-height plugin. arXiv preprint arXiv:2311.12058, 2023. 1, 2, 3, 5, 6, 7, 8
-
[47]
Occnerf: Self- supervised multi-camera occupancy prediction with neural radiance fields
Chubin Zhang, Juncheng Yan, Yi Wei, Jiaxin Li, Li Liu, Yansong Tang, Yueqi Duan, and Jiwen Lu. Occnerf: Self- supervised multi-camera occupancy prediction with neural radiance fields. arXiv preprint arXiv:2312.09243, 2023. 2
-
[48]
Occformer: Dual-path transformer for vision-based 3d semantic occu- pancy prediction
Yunpeng Zhang, Zheng Zhu, and Dalong Du. Occformer: Dual-path transformer for vision-based 3d semantic occu- pancy prediction. In Proceedings of the IEEE/CVF Inter- national Conference on Computer Vision, pages 9433–9443,
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.