BePo: Dual Representation for 3D Occupancy Prediction

Amin Ansari; Fatih Porikli; Hong Cai; Jisoo Jeong; Shizhong Han; Yinhao Zhu; Yunxiao Shi

arxiv: 2506.07002 · v2 · submitted 2025-06-08 · 💻 cs.CV

BePo: Dual Representation for 3D Occupancy Prediction

Yunxiao Shi , Hong Cai , Jisoo Jeong , Yinhao Zhu , Shizhong Han , Amin Ansari , Fatih Porikli This is my paper

Pith reviewed 2026-05-19 11:25 UTC · model grok-4.3

classification 💻 cs.CV

keywords 3D occupancy predictionBEV representationsparse pointscross-attentionautonomous drivingnuScenesWaymo

0 comments

The pith

BePo combines BEV and sparse points with cross-attention to fix small-object weaknesses in 3D occupancy prediction.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes BePo to improve 3D occupancy prediction for autonomous driving by using two scene representations at once. Bird's Eye View alone loses detail on small objects after ground-plane projection, while sparse points capture size variation well but handle large flat surfaces poorly. BePo runs both representations in parallel branches, uses cross-attention to move 3D information from the sparse-points branch into the BEV branch, and fuses the two outputs for the final occupancy map. This dual setup is intended to deliver better accuracy on difficult objects without the high cost of dense 3D volumes. Tests on Occ3D-nuScenes, Occ3D-Waymo, and Occ-ScanNet show gains over prior efficient methods at low inference cost.

Core claim

BePo features a dual representation of BEV and sparse points. The 3D information learned in the sparse points branch is shared with the BEV stream via cross-attention, which injects learning signals of difficult objects on the BEV plane. The outputs of both branches are then fused to generate the final 3D occupancy predictions.

What carries the argument

Dual representation of BEV and sparse points, with cross-attention that transfers 3D signals from the sparse branch to strengthen the BEV branch before fusion.

If this is right

Small and distant objects receive stronger feature signals on the BEV plane.
Large flat surfaces remain well modeled by the BEV branch while varied-size objects are handled by the points branch.
Inference cost stays low because neither branch requires dense 3D volumes.
Final occupancy maps improve on multiple public driving and indoor benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same dual-branch pattern with cross-attention sharing could be tested on other 3D perception tasks such as depth completion or instance segmentation.
Replacing cross-attention with a cheaper sharing module might preserve gains at even lower runtime.
The approach points to hybrid representations as a general route for balancing detail and efficiency in real-time 3D vision.

Load-bearing premise

Cross-attention will move useful 3D signals from sparse points to the BEV branch without introducing noise or alignment errors that degrade the fused result.

What would settle it

Ablating the cross-attention link on the same benchmarks and measuring whether small-object accuracy falls back to single-branch BEV levels.

Figures

Figures reproduced from arXiv: 2506.07002 by Amin Ansari, Fatih Porikli, Hong Cai, Jisoo Jeong, Shizhong Han, Yinhao Zhu, Yunxiao Shi.

**Figure 1.** Figure 1: Accuracy (mIoU on Occ3D-nuScenes [4, 37]) vs. inference latency (ms) measured on a single NVIDIA A100 GPU. BePo outperforms previous methods while maintaining competitive inference latency. ometry and semantics from camera images, providing critical scene information with a level of fine granularity that depth estimation and 3D object detection does not possess, which is critical for downstream tasks such… view at source ↗

**Figure 2.** Figure 2: Overview of our proposed BePo. First, an image backbone ( [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Example qualitative 3D semantic occupancy prediction of BePo on Occ3D-nuScenes validation set. Cons. Veh stands for [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Example qualitative comparison between BePo and FlashOcc [ [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

read the original abstract

3D occupancy infers fine-grained 3D geometry and semantics which is critical for autonomous driving. Most existing approaches carry high compute costs, requiring dense 3D feature volume and cross-attention to effectively aggregate information. More efficient methods adopt Bird's Eye View (BEV) or sparse points as scene representation leading to much reduced runtime. However, BEV struggles with small objects that often have very limited feature representation especially after being projected to the ground plane. Sparse points on the other and, can model objects of various sizes in 3D space, but is inefficient at capturing flat surfaces or large objects. To address these shortcomings, we present BePo, which features a dual representation of BEV and sparse points. The 3D information learned in the sparse points branch is shared with the BEV stream via cross-attention, which injects learning signals of difficult objects on the BEV plane. The outputs of both branches are then fused to generate the final 3D occupancy predictions. Extensive experiments on a suite of challenging benchmarks including Occ3D-nuScenes, Occ3D-Waymo and Occ-ScanNet demonstrate the superiority of our proposed BePo. In addition, BePo carries low inference cost even when compared to latest efficient methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

BePo pairs BEV with a sparse-points branch and cross-attention to feed 3D cues back for small objects, then fuses the outputs, with reported gains on three benchmarks at low cost.

read the letter

The main takeaway is that BePo runs a BEV stream alongside a sparse-points stream, routes 3D information from the points branch into BEV features through cross-attention to help with objects that lose detail after ground-plane projection, and then combines the two outputs for the final occupancy map. This directly targets the complementary weaknesses the abstract lays out: BEV is efficient but weak on small or thin structures, while sparse points keep height and size variation but fall short on large flat surfaces. The cross-attention step is the concrete mechanism offered to share the missing signals without reverting to dense 3D volumes. The paper shows this on Occ3D-nuScenes, Occ3D-Waymo, and Occ-ScanNet, claiming better results than recent efficient baselines while keeping inference cost competitive. That combination of accuracy lift and speed is the practical part worth noting for driving perception work. The soft spots sit in the missing details around implementation and evidence. The description does not include the actual benchmark tables, ablation numbers, or error breakdowns, so the scale of improvement is difficult to judge from the abstract alone. More critically, the cross-attention needs to manage the 2D-3D mismatch between BEV plane features and 3D points; without explicit positional encodings, height preservation, or projection-aware masking, attention can latch onto misaligned or noisy pairs, especially for distant or tiny objects. The stress-test concern about clean signal transfer is reasonable and should be checked against the full architecture diagrams and equations. This work is aimed at researchers and engineers already working on efficient 3D occupancy for autonomous driving. Someone familiar with BEV and sparse methods will see the value in the fusion strategy and the multi-dataset results. It shows straightforward engagement with the trade-offs in the literature and rests on standard benchmarks rather than circular claims. I would bring it to a reading group to walk through the attention construction. It deserves peer review so the community can verify the gains and the alignment handling under closer inspection.

Referee Report

2 major / 2 minor

Summary. The paper introduces BePo, a dual-representation architecture for 3D occupancy prediction that combines a Bird's Eye View (BEV) branch with a sparse-points branch. 3D cues learned in the sparse-points branch are transferred to the BEV branch via cross-attention to improve modeling of small and difficult objects; the two branch outputs are then fused to produce the final occupancy map. Experiments on Occ3D-nuScenes, Occ3D-Waymo, and Occ-ScanNet are reported to show superior accuracy together with low inference cost relative to prior efficient methods.

Significance. If the cross-attention transfer proves reliable, the dual-representation idea offers a principled way to mitigate the complementary weaknesses of pure BEV (poor small-object support after ground-plane projection) and pure sparse points (weak coverage of large flat surfaces). Demonstrating both accuracy gains and low runtime on multiple large-scale benchmarks would strengthen the case for efficient 3D occupancy pipelines in autonomous driving.

major comments (2)

[Method (cross-attention)] The description of the cross-attention module (method section) supplies no equations or diagram specifying how queries, keys, and values are formed, whether 3D positional encodings or projection-aware masking are used, or how height information from sparse points is preserved when attending to the BEV plane. Because the central claim rests on clean transfer of 3D signals for difficult objects, this omission is load-bearing and risks unaddressed 2D-3D misalignment.
[Experiments / Table 1] Table 1 (or equivalent main-results table) reports overall mIoU gains, yet no ablation isolates the contribution of the cross-attention link versus the dual-branch fusion alone, nor any per-class breakdown for small objects. Without these controls it is impossible to confirm that the claimed improvement for difficult objects originates from the proposed mechanism rather than from increased capacity or hyper-parameter tuning.

minor comments (2)

[Abstract] The abstract states superiority on three benchmarks but does not include even a single numeric result or runtime figure; readers must reach the tables to evaluate the magnitude of the claimed gains.
[Method] Notation for the fusion step (e.g., how BEV and sparse-point features are combined before the final decoder) is introduced without an accompanying equation or pseudocode.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the potential of the dual-representation approach in mitigating complementary weaknesses of BEV and sparse-points representations. We address each major comment below and will revise the manuscript accordingly to improve clarity and experimental rigor.

read point-by-point responses

Referee: [Method (cross-attention)] The description of the cross-attention module (method section) supplies no equations or diagram specifying how queries, keys, and values are formed, whether 3D positional encodings or projection-aware masking are used, or how height information from sparse points is preserved when attending to the BEV plane. Because the central claim rests on clean transfer of 3D signals for difficult objects, this omission is load-bearing and risks unaddressed 2D-3D misalignment.

Authors: We agree that the method section would benefit from greater technical detail on the cross-attention module. In the revised manuscript we will add explicit equations defining the query, key, and value projections, describe the use of 3D positional encodings, and clarify any projection-aware masking. A new diagram will illustrate the module, and we will explain how height information from the sparse-points branch is preserved when attending to the BEV plane. These additions will make the 3D signal transfer mechanism fully transparent. revision: yes
Referee: [Experiments / Table 1] Table 1 (or equivalent main-results table) reports overall mIoU gains, yet no ablation isolates the contribution of the cross-attention link versus the dual-branch fusion alone, nor any per-class breakdown for small objects. Without these controls it is impossible to confirm that the claimed improvement for difficult objects originates from the proposed mechanism rather than from increased capacity or hyper-parameter tuning.

Authors: We acknowledge the importance of isolating the cross-attention contribution. The revised version will include new ablation studies comparing the full BePo model against a dual-branch baseline that performs fusion without the cross-attention transfer. We will also add per-class mIoU breakdowns, with emphasis on small and difficult object categories, to demonstrate that the observed gains for these classes arise from the proposed mechanism rather than capacity alone. revision: yes

Circularity Check

0 steps flagged

No significant circularity; architectural choice validated externally

full rationale

The paper describes BePo as a dual-representation architecture combining BEV and sparse points branches, with cross-attention to transfer 3D signals and subsequent fusion for occupancy output. This is presented purely as a design choice whose value is assessed via empirical results on independent external benchmarks (Occ3D-nuScenes, Occ3D-Waymo, Occ-ScanNet). No equations, derivations, fitted parameters renamed as predictions, self-definitional loops, or load-bearing self-citations appear in the text. The central claims do not reduce to their own inputs by construction and remain self-contained against external evaluation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract does not introduce or rely on any explicit free parameters, mathematical axioms, or newly invented physical entities; the contribution is an architectural design whose validity rests on empirical benchmark results.

pith-pipeline@v0.9.0 · 5770 in / 1084 out tokens · 26284 ms · 2026-05-19T11:25:41.882903+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

dual-branch design: a query-based sparse points branch and a BEV branch... cross-attention... fused to generate the final 3D occupancy predictions

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

48 extracted references · 48 canonical work pages · 4 internal anchors

[1]

Se- mantickitti: A dataset for semantic scene understanding of lidar sequences

Jens Behley, Martin Garbade, Andres Milioto, Jan Quen- zel, Sven Behnke, Cyrill Stachniss, and Jurgen Gall. Se- mantickitti: A dataset for semantic scene understanding of lidar sequences. In Proceedings of the IEEE/CVF inter- national conference on computer vision , pages 9297–9307,

work page
[2]

The lov´asz-softmax loss: A tractable surrogate for the optimization of the intersection-over-union measure in neural networks

Maxim Berman, Amal Rannen Triki, and Matthew B Blaschko. The lov´asz-softmax loss: A tractable surrogate for the optimization of the intersection-over-union measure in neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4413–4421,

work page
[3]

Transformerfusion: Monocular rgb scene reconstruction using transformers

Aljaz Bozic, Pablo Palafox, Justus Thies, Angela Dai, and Matthias Nießner. Transformerfusion: Monocular rgb scene reconstruction using transformers. Advances in Neural In- formation Processing Systems, 34:1403–1414, 2021. 5

work page 2021
[4]

nuscenes: A multi- modal dataset for autonomous driving

Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh V ora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Gi- ancarlo Baldan, and Oscar Beijbom. nuscenes: A multi- modal dataset for autonomous driving. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11621–11631, 2020. 1, 2, 5, 7

work page 2020
[5]

Monoscene: Monoc- ular 3d semantic scene completion

Anh-Quan Cao and Raoul De Charette. Monoscene: Monoc- ular 3d semantic scene completion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3991–4001, 2022. 2, 5

work page 2022
[6]

Scenerf: Self- supervised monocular 3d scene reconstruction with radiance fields

Anh-Quan Cao and Raoul de Charette. Scenerf: Self- supervised monocular 3d scene reconstruction with radiance fields. In Proceedings of the IEEE/CVF International Con- ference on Computer Vision, pages 9387–9398, 2023. 2

work page 2023
[7]

End-to- end object detection with transformers

Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to- end object detection with transformers. In European confer- ence on computer vision, pages 213–229. Springer, 2020. 2, 3

work page 2020
[8]

Openscene: The largest up-to- date 3d occupancy prediction benchmark in autonomous driving

OpenScene Contributors. Openscene: The largest up-to- date 3d occupancy prediction benchmark in autonomous driving. https://github.com/OpenDriveLab/ OpenScene, 2023. 2

work page 2023
[9]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, et al. An image is worth 16x16 words: Trans- formers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020. 2

work page internal anchor Pith review Pith/arXiv arXiv 2010
[10]

A point set generation network for 3d object reconstruction from a single image

Haoqiang Fan, Hao Su, and Leonidas J Guibas. A point set generation network for 3d object reconstruction from a single image. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 605–613, 2017. 4

work page 2017
[11]

Are we ready for autonomous driving? the kitti vision benchmark suite

Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are we ready for autonomous driving? the kitti vision benchmark suite. In 2012 IEEE conference on computer vision and pat- tern recognition, pages 3354–3361. IEEE, 2012. 2

work page 2012
[12]

Vision meets robotics: The kitti dataset

Andreas Geiger, Philip Lenz, Christoph Stiller, and Raquel Urtasun. Vision meets robotics: The kitti dataset. The Inter- national Journal of Robotics Research , 32(11):1231–1237,

work page
[13]

Simple-bev: What really mat- ters for multi-sensor bev perception? In 2023 IEEE Inter- national Conference on Robotics and Automation (ICRA) , pages 2759–2765

Adam W Harley, Zhaoyuan Fang, Jie Li, Rares Ambrus, and Katerina Fragkiadaki. Simple-bev: What really mat- ters for multi-sensor bev perception? In 2023 IEEE Inter- national Conference on Robotics and Automation (ICRA) , pages 2759–2765. IEEE, 2023. 1, 2

work page 2023
[14]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceed- ings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016. 3, 5

work page 2016
[15]

BEVDet: High-performance Multi-camera 3D Object Detection in Bird-Eye-View

Junjie Huang, Guan Huang, Zheng Zhu, Yun Ye, and Dalong Du. Bevdet: High-performance multi-camera 3d object de- tection in bird-eye-view. arXiv preprint arXiv:2112.11790,

work page internal anchor Pith review Pith/arXiv arXiv
[16]

Tri-perspective view for vision- based 3d semantic occupancy prediction

Yuanhui Huang, Wenzhao Zheng, Yunpeng Zhang, Jie Zhou, and Jiwen Lu. Tri-perspective view for vision- based 3d semantic occupancy prediction. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9223–9232, 2023. 2, 5, 7, 8

work page 2023
[17]

Selfocc: Self-supervised vision-based 3d oc- cupancy prediction

Yuanhui Huang, Wenzhao Zheng, Borui Zhang, Jie Zhou, and Jiwen Lu. Selfocc: Self-supervised vision-based 3d oc- cupancy prediction. In Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 19946–19956, 2024. 2

work page 2024
[18]

Padre: A unifying polynomial attention drop-in replacement for efficient vision transformer

Pierre-David Letourneau, Manish Kumar Singh, Hsin- Pai Cheng, Shizhong Han, Yunxiao Shi, Dalton Jones, Matthew Harper Langston, Hong Cai, and Fatih Porikli. Padre: A unifying polynomial attention drop-in replacement for efficient vision transformer. International Conference on Learning Representation (ICLR), 2025. 2

work page 2025
[19]

Bevformer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers

Zhiqi Li, Wenhai Wang, Hongyang Li, Enze Xie, Chong- hao Sima, Tong Lu, Yu Qiao, and Jifeng Dai. Bevformer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers. In European con- ference on computer vision, pages 1–18. Springer, 2022. 2, 3, 5, 7, 8

work page 2022
[20]

Feature pyra- mid networks for object detection

Tsung-Yi Lin, Piotr Doll ´ar, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature pyra- mid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recogni- tion, pages 2117–2125, 2017. 4

work page 2017
[21]

Focal loss for dense object detection

Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Doll´ar. Focal loss for dense object detection. In Pro- ceedings of the IEEE International Conference on Computer Vision (ICCV), pages 2999–3007, 2017. 4

work page 2017
[22]

Sparse4D: Multi-view 3D object detection with sparse spatial-temporal fusion.arXiv preprint arXiv:2211.10581, 2022

Xuewu Lin, Tianwei Lin, Zixiang Pei, Lichao Huang, and Zhizhong Su. Sparse4d: Multi-view 3d object detec- tion with sparse spatial-temporal fusion. arXiv preprint arXiv:2211.10581, 2022. 2

work page arXiv 2022
[23]

Sparsebev: High-performance sparse 3d object de- tection from multi-camera videos

Haisong Liu, Yao Teng, Tao Lu, Haiguang Wang, and Limin Wang. Sparsebev: High-performance sparse 3d object de- tection from multi-camera videos. In Proceedings of the IEEE/CVF International Conference on Computer Vision , pages 18580–18590, 2023. 1, 2, 3, 4

work page 2023
[24]

Fully sparse 3d panoptic occupancy prediction

Haisong Liu, Haiguang Wang, Yang Chen, Zetong Yang, Jia Zeng, Li Chen, and Limin Wang. Fully sparse 3d panoptic occupancy prediction. In Proceedings of the European Con- fernece on Computer Vision, 2024. 2

work page 2024
[25]

Petr: Position embedding transformation for multi-view 3d object detection

Yingfei Liu, Tiancai Wang, Xiangyu Zhang, and Jian Sun. Petr: Position embedding transformation for multi-view 3d object detection. arXiv preprint arXiv:2203.05625, 2022. 2

work page arXiv 2022
[26]

Let occ flow: Self-supervised 3d occupancy flow prediction

Yili Liu, Linzhan Mou, Xuan Yu, Chenrui Han, Sitong Mao, Rong Xiong, and Yue Wang. Let occ flow: Self-supervised 3d occupancy flow prediction. The Conference on Robot Learning (CoRL), 2024. 2

work page 2024
[27]

Decoupled Weight Decay Regularization

I Loshchilov. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017. 5

work page internal anchor Pith review Pith/arXiv arXiv 2017
[28]

SGDR: Stochastic Gradient Descent with Warm Restarts

Ilya Loshchilov and Frank Hutter. Sgdr: Stochas- tic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983, 2016. 6

work page internal anchor Pith review Pith/arXiv arXiv 2016
[29]

Atlas: End- to-end 3d scene reconstruction from posed images

Zak Murez, Tarrence Van As, James Bartolozzi, Ayan Sinha, Vijay Badrinarayanan, and Andrew Rabinovich. Atlas: End- to-end 3d scene reconstruction from posed images. In Com- puter Vision–ECCV 2020: 16th European Conference, Glas- gow, UK, August 23–28, 2020, Proceedings, Part VII 16 , pages 414–431. Springer, 2020. 5

work page 2020
[30]

Renderocc: Vision-centric 3d occupancy pre- diction with 2d rendering supervision

Mingjie Pan, Jiaming Liu, Renrui Zhang, Peixiang Huang, Xiaoqi Li, Hongwei Xie, Bing Wang, Li Liu, and Shang- hang Zhang. Renderocc: Vision-centric 3d occupancy pre- diction with 2d rendering supervision. In 2024 IEEE Inter- national Conference on Robotics and Automation (ICRA) , pages 12404–12411. IEEE, 2024. 1, 2, 5, 7, 8

work page 2024
[31]

Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unproject- ing to 3d

Jonah Philion and Sanja Fidler. Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unproject- ing to 3d. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceed- ings, Part XIV 16, pages 194–210. Springer, 2020. 4

work page 2020
[32]

Ega- depth: Efficient guided attention for self-supervised multi- camera depth estimation

Yunxiao Shi, Hong Cai, Amin Ansari, and Fatih Porikli. Ega- depth: Efficient guided attention for self-supervised multi- camera depth estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 119–129, 2023. 1

work page 2023
[33]

Decotr: Enhancing depth completion with 2d and 3d attentions

Yunxiao Shi, Manish Kumar Singh, Hong Cai, and Fatih Porikli. Decotr: Enhancing depth completion with 2d and 3d attentions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10736– 10746, 2024. 1

work page 2024
[34]

H3o: Hyper-efficient 3d occupancy prediction with hetero- geneous supervision

Yunxiao Shi, Hong Cai, Amin Ansari, and Fatih Porikli. H3o: Hyper-efficient 3d occupancy prediction with hetero- geneous supervision. 2025. 2

work page 2025
[35]

Scalability in perception for autonomous driving: Waymo open dataset

Pei Sun, Henrik Kretzschmar, Xerxes Dotiwalla, Aurelien Chouard, Vijaysai Patnaik, Paul Tsui, James Guo, Yin Zhou, Yuning Chai, Benjamin Caine, et al. Scalability in perception for autonomous driving: Waymo open dataset. In Proceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2446–2454, 2020. 2, 5

work page 2020
[36]

Sparseocc: Re- thinking sparse latent representation for vision-based seman- tic occupancy prediction

Pin Tang, Zhongdao Wang, Guoqing Wang, Jilai Zheng, Xi- angxuan Ren, Bailan Feng, and Chao Ma. Sparseocc: Re- thinking sparse latent representation for vision-based seman- tic occupancy prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 15035–15044, 2024. 2, 5

work page 2024
[37]

Occ3d: A large-scale 3d occupancy prediction benchmark for autonomous driving

Xiaoyu Tian, Tao Jiang, Longfei Yun, Yucheng Mao, Huitong Yang, Yue Wang, Yilun Wang, and Hang Zhao. Occ3d: A large-scale 3d occupancy prediction benchmark for autonomous driving. Advances in Neural Information Processing Systems, 36, 2024. 1, 2, 5, 7

work page 2024
[38]

Scene as occupancy

Wenwen Tong, Chonghao Sima, Tai Wang, Li Chen, Silei Wu, Hanming Deng, Yi Gu, Lewei Lu, Ping Luo, Dahua Lin, et al. Scene as occupancy. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 8406– 8415, 2023. 2

work page 2023
[39]

Attention is all you need

A Vaswani. Attention is all you need. Advances in Neural Information Processing Systems, 2017. 2, 4

work page 2017
[40]

Opus: Occupancy prediction using a sparse set

Jiabao Wang, Zhaojiang Liu, Qiang Meng, Liujiang Yan, Ke Wang, Jie Yang, Wei Liu, Qibin Hou, and Mingming Cheng. Opus: Occupancy prediction using a sparse set. In Advances in Neural Information Processing Systems, 2024. 2, 3, 4, 5, 7, 8

work page 2024
[41]

Detr3d: 3d object detection from multi-view images via 3d-to-2d queries

Yue Wang, Vitor Campagnolo Guizilini, Tianyuan Zhang, Yilun Wang, Hang Zhao, and Justin Solomon. Detr3d: 3d object detection from multi-view images via 3d-to-2d queries. In Conference on Robot Learning, pages 180–191. PMLR, 2022. 1, 2, 3

work page 2022
[42]

Surroundocc: Multi-camera 3d occu- pancy prediction for autonomous driving

Yi Wei, Linqing Zhao, Wenzhao Zheng, Zheng Zhu, Jie Zhou, and Jiwen Lu. Surroundocc: Multi-camera 3d occu- pancy prediction for autonomous driving. In Proceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 21729–21740, 2023. 1, 2, 3, 7, 8

work page 2023
[43]

Deep height decoupling for pre- cise vision-based 3d occupancy prediction

Yuan Wu, Zhiqiang Yan, Zhengxue Wang, Xiang Li, Le Hui, and Jian Yang. Deep height decoupling for pre- cise vision-based 3d occupancy prediction. arXiv preprint arXiv:2409.07972, 2024. 2

work page arXiv 2024
[44]

Mamo: Leveraging memory and attention for monocular video depth estimation

Rajeev Yasarla, Hong Cai, Jisoo Jeong, Yunxiao Shi, Risheek Garrepalli, and Fatih Porikli. Mamo: Leveraging memory and attention for monocular video depth estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 8754–8764, 2023. 1

work page 2023
[45]

Futuredepth: Learning to predict the future improves video depth estimation

Rajeev Yasarla, Manish Kumar Singh, Hong Cai, Yunxiao Shi, Jisoo Jeong, Yinhao Zhu, Shizhong Han, Risheek Gar- repalli, and Fatih Porikli. Futuredepth: Learning to predict the future improves video depth estimation. Proceedings of the European Conference on Computer Vision, 2024. 1

work page 2024
[46]

FlashOcc: Fast and memory-efficient occupancy prediction via channel-to-height plugin.arXiv preprint arXiv:2311.12058, 2023

Zichen Yu, Changyong Shu, Jiajun Deng, Kangjie Lu, Zong- dai Liu, Jiangyong Yu, Dawei Yang, Hui Li, and Yan Chen. Flashocc: Fast and memory-efficient occupancy prediction via channel-to-height plugin. arXiv preprint arXiv:2311.12058, 2023. 1, 2, 3, 5, 6, 7, 8

work page arXiv 2023
[47]

Occnerf: Self- supervised multi-camera occupancy prediction with neural radiance fields

Chubin Zhang, Juncheng Yan, Yi Wei, Jiaxin Li, Li Liu, Yansong Tang, Yueqi Duan, and Jiwen Lu. Occnerf: Self- supervised multi-camera occupancy prediction with neural radiance fields. arXiv preprint arXiv:2312.09243, 2023. 2

work page arXiv 2023
[48]

Occformer: Dual-path transformer for vision-based 3d semantic occu- pancy prediction

Yunpeng Zhang, Zheng Zhu, and Dalong Du. Occformer: Dual-path transformer for vision-based 3d semantic occu- pancy prediction. In Proceedings of the IEEE/CVF Inter- national Conference on Computer Vision, pages 9433–9443,

work page

[1] [1]

Se- mantickitti: A dataset for semantic scene understanding of lidar sequences

Jens Behley, Martin Garbade, Andres Milioto, Jan Quen- zel, Sven Behnke, Cyrill Stachniss, and Jurgen Gall. Se- mantickitti: A dataset for semantic scene understanding of lidar sequences. In Proceedings of the IEEE/CVF inter- national conference on computer vision , pages 9297–9307,

work page

[2] [2]

The lov´asz-softmax loss: A tractable surrogate for the optimization of the intersection-over-union measure in neural networks

Maxim Berman, Amal Rannen Triki, and Matthew B Blaschko. The lov´asz-softmax loss: A tractable surrogate for the optimization of the intersection-over-union measure in neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4413–4421,

work page

[3] [3]

Transformerfusion: Monocular rgb scene reconstruction using transformers

Aljaz Bozic, Pablo Palafox, Justus Thies, Angela Dai, and Matthias Nießner. Transformerfusion: Monocular rgb scene reconstruction using transformers. Advances in Neural In- formation Processing Systems, 34:1403–1414, 2021. 5

work page 2021

[4] [4]

nuscenes: A multi- modal dataset for autonomous driving

Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh V ora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Gi- ancarlo Baldan, and Oscar Beijbom. nuscenes: A multi- modal dataset for autonomous driving. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11621–11631, 2020. 1, 2, 5, 7

work page 2020

[5] [5]

Monoscene: Monoc- ular 3d semantic scene completion

Anh-Quan Cao and Raoul De Charette. Monoscene: Monoc- ular 3d semantic scene completion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3991–4001, 2022. 2, 5

work page 2022

[6] [6]

Scenerf: Self- supervised monocular 3d scene reconstruction with radiance fields

Anh-Quan Cao and Raoul de Charette. Scenerf: Self- supervised monocular 3d scene reconstruction with radiance fields. In Proceedings of the IEEE/CVF International Con- ference on Computer Vision, pages 9387–9398, 2023. 2

work page 2023

[7] [7]

End-to- end object detection with transformers

Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to- end object detection with transformers. In European confer- ence on computer vision, pages 213–229. Springer, 2020. 2, 3

work page 2020

[8] [8]

Openscene: The largest up-to- date 3d occupancy prediction benchmark in autonomous driving

OpenScene Contributors. Openscene: The largest up-to- date 3d occupancy prediction benchmark in autonomous driving. https://github.com/OpenDriveLab/ OpenScene, 2023. 2

work page 2023

[9] [9]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, et al. An image is worth 16x16 words: Trans- formers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020. 2

work page internal anchor Pith review Pith/arXiv arXiv 2010

[10] [10]

A point set generation network for 3d object reconstruction from a single image

Haoqiang Fan, Hao Su, and Leonidas J Guibas. A point set generation network for 3d object reconstruction from a single image. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 605–613, 2017. 4

work page 2017

[11] [11]

Are we ready for autonomous driving? the kitti vision benchmark suite

Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are we ready for autonomous driving? the kitti vision benchmark suite. In 2012 IEEE conference on computer vision and pat- tern recognition, pages 3354–3361. IEEE, 2012. 2

work page 2012

[12] [12]

Vision meets robotics: The kitti dataset

Andreas Geiger, Philip Lenz, Christoph Stiller, and Raquel Urtasun. Vision meets robotics: The kitti dataset. The Inter- national Journal of Robotics Research , 32(11):1231–1237,

work page

[13] [13]

Simple-bev: What really mat- ters for multi-sensor bev perception? In 2023 IEEE Inter- national Conference on Robotics and Automation (ICRA) , pages 2759–2765

Adam W Harley, Zhaoyuan Fang, Jie Li, Rares Ambrus, and Katerina Fragkiadaki. Simple-bev: What really mat- ters for multi-sensor bev perception? In 2023 IEEE Inter- national Conference on Robotics and Automation (ICRA) , pages 2759–2765. IEEE, 2023. 1, 2

work page 2023

[14] [14]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceed- ings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016. 3, 5

work page 2016

[15] [15]

BEVDet: High-performance Multi-camera 3D Object Detection in Bird-Eye-View

Junjie Huang, Guan Huang, Zheng Zhu, Yun Ye, and Dalong Du. Bevdet: High-performance multi-camera 3d object de- tection in bird-eye-view. arXiv preprint arXiv:2112.11790,

work page internal anchor Pith review Pith/arXiv arXiv

[16] [16]

Tri-perspective view for vision- based 3d semantic occupancy prediction

Yuanhui Huang, Wenzhao Zheng, Yunpeng Zhang, Jie Zhou, and Jiwen Lu. Tri-perspective view for vision- based 3d semantic occupancy prediction. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9223–9232, 2023. 2, 5, 7, 8

work page 2023

[17] [17]

Selfocc: Self-supervised vision-based 3d oc- cupancy prediction

Yuanhui Huang, Wenzhao Zheng, Borui Zhang, Jie Zhou, and Jiwen Lu. Selfocc: Self-supervised vision-based 3d oc- cupancy prediction. In Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 19946–19956, 2024. 2

work page 2024

[18] [18]

Padre: A unifying polynomial attention drop-in replacement for efficient vision transformer

Pierre-David Letourneau, Manish Kumar Singh, Hsin- Pai Cheng, Shizhong Han, Yunxiao Shi, Dalton Jones, Matthew Harper Langston, Hong Cai, and Fatih Porikli. Padre: A unifying polynomial attention drop-in replacement for efficient vision transformer. International Conference on Learning Representation (ICLR), 2025. 2

work page 2025

[19] [19]

Bevformer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers

Zhiqi Li, Wenhai Wang, Hongyang Li, Enze Xie, Chong- hao Sima, Tong Lu, Yu Qiao, and Jifeng Dai. Bevformer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers. In European con- ference on computer vision, pages 1–18. Springer, 2022. 2, 3, 5, 7, 8

work page 2022

[20] [20]

Feature pyra- mid networks for object detection

Tsung-Yi Lin, Piotr Doll ´ar, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature pyra- mid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recogni- tion, pages 2117–2125, 2017. 4

work page 2017

[21] [21]

Focal loss for dense object detection

Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Doll´ar. Focal loss for dense object detection. In Pro- ceedings of the IEEE International Conference on Computer Vision (ICCV), pages 2999–3007, 2017. 4

work page 2017

[22] [22]

Sparse4D: Multi-view 3D object detection with sparse spatial-temporal fusion.arXiv preprint arXiv:2211.10581, 2022

Xuewu Lin, Tianwei Lin, Zixiang Pei, Lichao Huang, and Zhizhong Su. Sparse4d: Multi-view 3d object detec- tion with sparse spatial-temporal fusion. arXiv preprint arXiv:2211.10581, 2022. 2

work page arXiv 2022

[23] [23]

Sparsebev: High-performance sparse 3d object de- tection from multi-camera videos

Haisong Liu, Yao Teng, Tao Lu, Haiguang Wang, and Limin Wang. Sparsebev: High-performance sparse 3d object de- tection from multi-camera videos. In Proceedings of the IEEE/CVF International Conference on Computer Vision , pages 18580–18590, 2023. 1, 2, 3, 4

work page 2023

[24] [24]

Fully sparse 3d panoptic occupancy prediction

Haisong Liu, Haiguang Wang, Yang Chen, Zetong Yang, Jia Zeng, Li Chen, and Limin Wang. Fully sparse 3d panoptic occupancy prediction. In Proceedings of the European Con- fernece on Computer Vision, 2024. 2

work page 2024

[25] [25]

Petr: Position embedding transformation for multi-view 3d object detection

Yingfei Liu, Tiancai Wang, Xiangyu Zhang, and Jian Sun. Petr: Position embedding transformation for multi-view 3d object detection. arXiv preprint arXiv:2203.05625, 2022. 2

work page arXiv 2022

[26] [26]

Let occ flow: Self-supervised 3d occupancy flow prediction

Yili Liu, Linzhan Mou, Xuan Yu, Chenrui Han, Sitong Mao, Rong Xiong, and Yue Wang. Let occ flow: Self-supervised 3d occupancy flow prediction. The Conference on Robot Learning (CoRL), 2024. 2

work page 2024

[27] [27]

Decoupled Weight Decay Regularization

I Loshchilov. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017. 5

work page internal anchor Pith review Pith/arXiv arXiv 2017

[28] [28]

SGDR: Stochastic Gradient Descent with Warm Restarts

Ilya Loshchilov and Frank Hutter. Sgdr: Stochas- tic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983, 2016. 6

work page internal anchor Pith review Pith/arXiv arXiv 2016

[29] [29]

Atlas: End- to-end 3d scene reconstruction from posed images

Zak Murez, Tarrence Van As, James Bartolozzi, Ayan Sinha, Vijay Badrinarayanan, and Andrew Rabinovich. Atlas: End- to-end 3d scene reconstruction from posed images. In Com- puter Vision–ECCV 2020: 16th European Conference, Glas- gow, UK, August 23–28, 2020, Proceedings, Part VII 16 , pages 414–431. Springer, 2020. 5

work page 2020

[30] [30]

Renderocc: Vision-centric 3d occupancy pre- diction with 2d rendering supervision

Mingjie Pan, Jiaming Liu, Renrui Zhang, Peixiang Huang, Xiaoqi Li, Hongwei Xie, Bing Wang, Li Liu, and Shang- hang Zhang. Renderocc: Vision-centric 3d occupancy pre- diction with 2d rendering supervision. In 2024 IEEE Inter- national Conference on Robotics and Automation (ICRA) , pages 12404–12411. IEEE, 2024. 1, 2, 5, 7, 8

work page 2024

[31] [31]

Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unproject- ing to 3d

Jonah Philion and Sanja Fidler. Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unproject- ing to 3d. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceed- ings, Part XIV 16, pages 194–210. Springer, 2020. 4

work page 2020

[32] [32]

Ega- depth: Efficient guided attention for self-supervised multi- camera depth estimation

Yunxiao Shi, Hong Cai, Amin Ansari, and Fatih Porikli. Ega- depth: Efficient guided attention for self-supervised multi- camera depth estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 119–129, 2023. 1

work page 2023

[33] [33]

Decotr: Enhancing depth completion with 2d and 3d attentions

Yunxiao Shi, Manish Kumar Singh, Hong Cai, and Fatih Porikli. Decotr: Enhancing depth completion with 2d and 3d attentions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10736– 10746, 2024. 1

work page 2024

[34] [34]

H3o: Hyper-efficient 3d occupancy prediction with hetero- geneous supervision

Yunxiao Shi, Hong Cai, Amin Ansari, and Fatih Porikli. H3o: Hyper-efficient 3d occupancy prediction with hetero- geneous supervision. 2025. 2

work page 2025

[35] [35]

Scalability in perception for autonomous driving: Waymo open dataset

Pei Sun, Henrik Kretzschmar, Xerxes Dotiwalla, Aurelien Chouard, Vijaysai Patnaik, Paul Tsui, James Guo, Yin Zhou, Yuning Chai, Benjamin Caine, et al. Scalability in perception for autonomous driving: Waymo open dataset. In Proceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2446–2454, 2020. 2, 5

work page 2020

[36] [36]

Sparseocc: Re- thinking sparse latent representation for vision-based seman- tic occupancy prediction

Pin Tang, Zhongdao Wang, Guoqing Wang, Jilai Zheng, Xi- angxuan Ren, Bailan Feng, and Chao Ma. Sparseocc: Re- thinking sparse latent representation for vision-based seman- tic occupancy prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 15035–15044, 2024. 2, 5

work page 2024

[37] [37]

Occ3d: A large-scale 3d occupancy prediction benchmark for autonomous driving

Xiaoyu Tian, Tao Jiang, Longfei Yun, Yucheng Mao, Huitong Yang, Yue Wang, Yilun Wang, and Hang Zhao. Occ3d: A large-scale 3d occupancy prediction benchmark for autonomous driving. Advances in Neural Information Processing Systems, 36, 2024. 1, 2, 5, 7

work page 2024

[38] [38]

Scene as occupancy

Wenwen Tong, Chonghao Sima, Tai Wang, Li Chen, Silei Wu, Hanming Deng, Yi Gu, Lewei Lu, Ping Luo, Dahua Lin, et al. Scene as occupancy. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 8406– 8415, 2023. 2

work page 2023

[39] [39]

Attention is all you need

A Vaswani. Attention is all you need. Advances in Neural Information Processing Systems, 2017. 2, 4

work page 2017

[40] [40]

Opus: Occupancy prediction using a sparse set

Jiabao Wang, Zhaojiang Liu, Qiang Meng, Liujiang Yan, Ke Wang, Jie Yang, Wei Liu, Qibin Hou, and Mingming Cheng. Opus: Occupancy prediction using a sparse set. In Advances in Neural Information Processing Systems, 2024. 2, 3, 4, 5, 7, 8

work page 2024

[41] [41]

Detr3d: 3d object detection from multi-view images via 3d-to-2d queries

Yue Wang, Vitor Campagnolo Guizilini, Tianyuan Zhang, Yilun Wang, Hang Zhao, and Justin Solomon. Detr3d: 3d object detection from multi-view images via 3d-to-2d queries. In Conference on Robot Learning, pages 180–191. PMLR, 2022. 1, 2, 3

work page 2022

[42] [42]

Surroundocc: Multi-camera 3d occu- pancy prediction for autonomous driving

Yi Wei, Linqing Zhao, Wenzhao Zheng, Zheng Zhu, Jie Zhou, and Jiwen Lu. Surroundocc: Multi-camera 3d occu- pancy prediction for autonomous driving. In Proceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 21729–21740, 2023. 1, 2, 3, 7, 8

work page 2023

[43] [43]

Deep height decoupling for pre- cise vision-based 3d occupancy prediction

Yuan Wu, Zhiqiang Yan, Zhengxue Wang, Xiang Li, Le Hui, and Jian Yang. Deep height decoupling for pre- cise vision-based 3d occupancy prediction. arXiv preprint arXiv:2409.07972, 2024. 2

work page arXiv 2024

[44] [44]

Mamo: Leveraging memory and attention for monocular video depth estimation

Rajeev Yasarla, Hong Cai, Jisoo Jeong, Yunxiao Shi, Risheek Garrepalli, and Fatih Porikli. Mamo: Leveraging memory and attention for monocular video depth estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 8754–8764, 2023. 1

work page 2023

[45] [45]

Futuredepth: Learning to predict the future improves video depth estimation

Rajeev Yasarla, Manish Kumar Singh, Hong Cai, Yunxiao Shi, Jisoo Jeong, Yinhao Zhu, Shizhong Han, Risheek Gar- repalli, and Fatih Porikli. Futuredepth: Learning to predict the future improves video depth estimation. Proceedings of the European Conference on Computer Vision, 2024. 1

work page 2024

[46] [46]

FlashOcc: Fast and memory-efficient occupancy prediction via channel-to-height plugin.arXiv preprint arXiv:2311.12058, 2023

Zichen Yu, Changyong Shu, Jiajun Deng, Kangjie Lu, Zong- dai Liu, Jiangyong Yu, Dawei Yang, Hui Li, and Yan Chen. Flashocc: Fast and memory-efficient occupancy prediction via channel-to-height plugin. arXiv preprint arXiv:2311.12058, 2023. 1, 2, 3, 5, 6, 7, 8

work page arXiv 2023

[47] [47]

Occnerf: Self- supervised multi-camera occupancy prediction with neural radiance fields

Chubin Zhang, Juncheng Yan, Yi Wei, Jiaxin Li, Li Liu, Yansong Tang, Yueqi Duan, and Jiwen Lu. Occnerf: Self- supervised multi-camera occupancy prediction with neural radiance fields. arXiv preprint arXiv:2312.09243, 2023. 2

work page arXiv 2023

[48] [48]

Occformer: Dual-path transformer for vision-based 3d semantic occu- pancy prediction

Yunpeng Zhang, Zheng Zhu, and Dalong Du. Occformer: Dual-path transformer for vision-based 3d semantic occu- pancy prediction. In Proceedings of the IEEE/CVF Inter- national Conference on Computer Vision, pages 9433–9443,

work page