pith. sign in

arxiv: 2506.07002 · v2 · submitted 2025-06-08 · 💻 cs.CV

BePo: Dual Representation for 3D Occupancy Prediction

Pith reviewed 2026-05-19 11:25 UTC · model grok-4.3

classification 💻 cs.CV
keywords 3D occupancy predictionBEV representationsparse pointscross-attentionautonomous drivingnuScenesWaymo
0
0 comments X

The pith

BePo combines BEV and sparse points with cross-attention to fix small-object weaknesses in 3D occupancy prediction.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes BePo to improve 3D occupancy prediction for autonomous driving by using two scene representations at once. Bird's Eye View alone loses detail on small objects after ground-plane projection, while sparse points capture size variation well but handle large flat surfaces poorly. BePo runs both representations in parallel branches, uses cross-attention to move 3D information from the sparse-points branch into the BEV branch, and fuses the two outputs for the final occupancy map. This dual setup is intended to deliver better accuracy on difficult objects without the high cost of dense 3D volumes. Tests on Occ3D-nuScenes, Occ3D-Waymo, and Occ-ScanNet show gains over prior efficient methods at low inference cost.

Core claim

BePo features a dual representation of BEV and sparse points. The 3D information learned in the sparse points branch is shared with the BEV stream via cross-attention, which injects learning signals of difficult objects on the BEV plane. The outputs of both branches are then fused to generate the final 3D occupancy predictions.

What carries the argument

Dual representation of BEV and sparse points, with cross-attention that transfers 3D signals from the sparse branch to strengthen the BEV branch before fusion.

If this is right

  • Small and distant objects receive stronger feature signals on the BEV plane.
  • Large flat surfaces remain well modeled by the BEV branch while varied-size objects are handled by the points branch.
  • Inference cost stays low because neither branch requires dense 3D volumes.
  • Final occupancy maps improve on multiple public driving and indoor benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same dual-branch pattern with cross-attention sharing could be tested on other 3D perception tasks such as depth completion or instance segmentation.
  • Replacing cross-attention with a cheaper sharing module might preserve gains at even lower runtime.
  • The approach points to hybrid representations as a general route for balancing detail and efficiency in real-time 3D vision.

Load-bearing premise

Cross-attention will move useful 3D signals from sparse points to the BEV branch without introducing noise or alignment errors that degrade the fused result.

What would settle it

Ablating the cross-attention link on the same benchmarks and measuring whether small-object accuracy falls back to single-branch BEV levels.

Figures

Figures reproduced from arXiv: 2506.07002 by Amin Ansari, Fatih Porikli, Hong Cai, Jisoo Jeong, Shizhong Han, Yinhao Zhu, Yunxiao Shi.

Figure 1
Figure 1. Figure 1: Accuracy (mIoU on Occ3D-nuScenes [4, 37]) vs. inference latency (ms) measured on a single NVIDIA A100 GPU. BePo outperforms previous methods while maintaining competitive inference latency. ometry and semantics from camera images, providing crit￾ical scene information with a level of fine granularity that depth estimation and 3D object detection does not possess, which is critical for downstream tasks such… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of our proposed BePo. First, an image backbone ( [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Example qualitative 3D semantic occupancy prediction of BePo on Occ3D-nuScenes validation set. Cons. Veh stands for [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Example qualitative comparison between BePo and FlashOcc [ [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
read the original abstract

3D occupancy infers fine-grained 3D geometry and semantics which is critical for autonomous driving. Most existing approaches carry high compute costs, requiring dense 3D feature volume and cross-attention to effectively aggregate information. More efficient methods adopt Bird's Eye View (BEV) or sparse points as scene representation leading to much reduced runtime. However, BEV struggles with small objects that often have very limited feature representation especially after being projected to the ground plane. Sparse points on the other and, can model objects of various sizes in 3D space, but is inefficient at capturing flat surfaces or large objects. To address these shortcomings, we present BePo, which features a dual representation of BEV and sparse points. The 3D information learned in the sparse points branch is shared with the BEV stream via cross-attention, which injects learning signals of difficult objects on the BEV plane. The outputs of both branches are then fused to generate the final 3D occupancy predictions. Extensive experiments on a suite of challenging benchmarks including Occ3D-nuScenes, Occ3D-Waymo and Occ-ScanNet demonstrate the superiority of our proposed BePo. In addition, BePo carries low inference cost even when compared to latest efficient methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces BePo, a dual-representation architecture for 3D occupancy prediction that combines a Bird's Eye View (BEV) branch with a sparse-points branch. 3D cues learned in the sparse-points branch are transferred to the BEV branch via cross-attention to improve modeling of small and difficult objects; the two branch outputs are then fused to produce the final occupancy map. Experiments on Occ3D-nuScenes, Occ3D-Waymo, and Occ-ScanNet are reported to show superior accuracy together with low inference cost relative to prior efficient methods.

Significance. If the cross-attention transfer proves reliable, the dual-representation idea offers a principled way to mitigate the complementary weaknesses of pure BEV (poor small-object support after ground-plane projection) and pure sparse points (weak coverage of large flat surfaces). Demonstrating both accuracy gains and low runtime on multiple large-scale benchmarks would strengthen the case for efficient 3D occupancy pipelines in autonomous driving.

major comments (2)
  1. [Method (cross-attention)] The description of the cross-attention module (method section) supplies no equations or diagram specifying how queries, keys, and values are formed, whether 3D positional encodings or projection-aware masking are used, or how height information from sparse points is preserved when attending to the BEV plane. Because the central claim rests on clean transfer of 3D signals for difficult objects, this omission is load-bearing and risks unaddressed 2D-3D misalignment.
  2. [Experiments / Table 1] Table 1 (or equivalent main-results table) reports overall mIoU gains, yet no ablation isolates the contribution of the cross-attention link versus the dual-branch fusion alone, nor any per-class breakdown for small objects. Without these controls it is impossible to confirm that the claimed improvement for difficult objects originates from the proposed mechanism rather than from increased capacity or hyper-parameter tuning.
minor comments (2)
  1. [Abstract] The abstract states superiority on three benchmarks but does not include even a single numeric result or runtime figure; readers must reach the tables to evaluate the magnitude of the claimed gains.
  2. [Method] Notation for the fusion step (e.g., how BEV and sparse-point features are combined before the final decoder) is introduced without an accompanying equation or pseudocode.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the potential of the dual-representation approach in mitigating complementary weaknesses of BEV and sparse-points representations. We address each major comment below and will revise the manuscript accordingly to improve clarity and experimental rigor.

read point-by-point responses
  1. Referee: [Method (cross-attention)] The description of the cross-attention module (method section) supplies no equations or diagram specifying how queries, keys, and values are formed, whether 3D positional encodings or projection-aware masking are used, or how height information from sparse points is preserved when attending to the BEV plane. Because the central claim rests on clean transfer of 3D signals for difficult objects, this omission is load-bearing and risks unaddressed 2D-3D misalignment.

    Authors: We agree that the method section would benefit from greater technical detail on the cross-attention module. In the revised manuscript we will add explicit equations defining the query, key, and value projections, describe the use of 3D positional encodings, and clarify any projection-aware masking. A new diagram will illustrate the module, and we will explain how height information from the sparse-points branch is preserved when attending to the BEV plane. These additions will make the 3D signal transfer mechanism fully transparent. revision: yes

  2. Referee: [Experiments / Table 1] Table 1 (or equivalent main-results table) reports overall mIoU gains, yet no ablation isolates the contribution of the cross-attention link versus the dual-branch fusion alone, nor any per-class breakdown for small objects. Without these controls it is impossible to confirm that the claimed improvement for difficult objects originates from the proposed mechanism rather than from increased capacity or hyper-parameter tuning.

    Authors: We acknowledge the importance of isolating the cross-attention contribution. The revised version will include new ablation studies comparing the full BePo model against a dual-branch baseline that performs fusion without the cross-attention transfer. We will also add per-class mIoU breakdowns, with emphasis on small and difficult object categories, to demonstrate that the observed gains for these classes arise from the proposed mechanism rather than capacity alone. revision: yes

Circularity Check

0 steps flagged

No significant circularity; architectural choice validated externally

full rationale

The paper describes BePo as a dual-representation architecture combining BEV and sparse points branches, with cross-attention to transfer 3D signals and subsequent fusion for occupancy output. This is presented purely as a design choice whose value is assessed via empirical results on independent external benchmarks (Occ3D-nuScenes, Occ3D-Waymo, Occ-ScanNet). No equations, derivations, fitted parameters renamed as predictions, self-definitional loops, or load-bearing self-citations appear in the text. The central claims do not reduce to their own inputs by construction and remain self-contained against external evaluation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract does not introduce or rely on any explicit free parameters, mathematical axioms, or newly invented physical entities; the contribution is an architectural design whose validity rests on empirical benchmark results.

pith-pipeline@v0.9.0 · 5770 in / 1084 out tokens · 26284 ms · 2026-05-19T11:25:41.882903+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

48 extracted references · 48 canonical work pages · 4 internal anchors

  1. [1]

    Se- mantickitti: A dataset for semantic scene understanding of lidar sequences

    Jens Behley, Martin Garbade, Andres Milioto, Jan Quen- zel, Sven Behnke, Cyrill Stachniss, and Jurgen Gall. Se- mantickitti: A dataset for semantic scene understanding of lidar sequences. In Proceedings of the IEEE/CVF inter- national conference on computer vision , pages 9297–9307,

  2. [2]

    The lov´asz-softmax loss: A tractable surrogate for the optimization of the intersection-over-union measure in neural networks

    Maxim Berman, Amal Rannen Triki, and Matthew B Blaschko. The lov´asz-softmax loss: A tractable surrogate for the optimization of the intersection-over-union measure in neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4413–4421,

  3. [3]

    Transformerfusion: Monocular rgb scene reconstruction using transformers

    Aljaz Bozic, Pablo Palafox, Justus Thies, Angela Dai, and Matthias Nießner. Transformerfusion: Monocular rgb scene reconstruction using transformers. Advances in Neural In- formation Processing Systems, 34:1403–1414, 2021. 5

  4. [4]

    nuscenes: A multi- modal dataset for autonomous driving

    Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh V ora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Gi- ancarlo Baldan, and Oscar Beijbom. nuscenes: A multi- modal dataset for autonomous driving. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11621–11631, 2020. 1, 2, 5, 7

  5. [5]

    Monoscene: Monoc- ular 3d semantic scene completion

    Anh-Quan Cao and Raoul De Charette. Monoscene: Monoc- ular 3d semantic scene completion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3991–4001, 2022. 2, 5

  6. [6]

    Scenerf: Self- supervised monocular 3d scene reconstruction with radiance fields

    Anh-Quan Cao and Raoul de Charette. Scenerf: Self- supervised monocular 3d scene reconstruction with radiance fields. In Proceedings of the IEEE/CVF International Con- ference on Computer Vision, pages 9387–9398, 2023. 2

  7. [7]

    End-to- end object detection with transformers

    Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to- end object detection with transformers. In European confer- ence on computer vision, pages 213–229. Springer, 2020. 2, 3

  8. [8]

    Openscene: The largest up-to- date 3d occupancy prediction benchmark in autonomous driving

    OpenScene Contributors. Openscene: The largest up-to- date 3d occupancy prediction benchmark in autonomous driving. https://github.com/OpenDriveLab/ OpenScene, 2023. 2

  9. [9]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, et al. An image is worth 16x16 words: Trans- formers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020. 2

  10. [10]

    A point set generation network for 3d object reconstruction from a single image

    Haoqiang Fan, Hao Su, and Leonidas J Guibas. A point set generation network for 3d object reconstruction from a single image. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 605–613, 2017. 4

  11. [11]

    Are we ready for autonomous driving? the kitti vision benchmark suite

    Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are we ready for autonomous driving? the kitti vision benchmark suite. In 2012 IEEE conference on computer vision and pat- tern recognition, pages 3354–3361. IEEE, 2012. 2

  12. [12]

    Vision meets robotics: The kitti dataset

    Andreas Geiger, Philip Lenz, Christoph Stiller, and Raquel Urtasun. Vision meets robotics: The kitti dataset. The Inter- national Journal of Robotics Research , 32(11):1231–1237,

  13. [13]

    Simple-bev: What really mat- ters for multi-sensor bev perception? In 2023 IEEE Inter- national Conference on Robotics and Automation (ICRA) , pages 2759–2765

    Adam W Harley, Zhaoyuan Fang, Jie Li, Rares Ambrus, and Katerina Fragkiadaki. Simple-bev: What really mat- ters for multi-sensor bev perception? In 2023 IEEE Inter- national Conference on Robotics and Automation (ICRA) , pages 2759–2765. IEEE, 2023. 1, 2

  14. [14]

    Deep residual learning for image recognition

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceed- ings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016. 3, 5

  15. [15]

    BEVDet: High-performance Multi-camera 3D Object Detection in Bird-Eye-View

    Junjie Huang, Guan Huang, Zheng Zhu, Yun Ye, and Dalong Du. Bevdet: High-performance multi-camera 3d object de- tection in bird-eye-view. arXiv preprint arXiv:2112.11790,

  16. [16]

    Tri-perspective view for vision- based 3d semantic occupancy prediction

    Yuanhui Huang, Wenzhao Zheng, Yunpeng Zhang, Jie Zhou, and Jiwen Lu. Tri-perspective view for vision- based 3d semantic occupancy prediction. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9223–9232, 2023. 2, 5, 7, 8

  17. [17]

    Selfocc: Self-supervised vision-based 3d oc- cupancy prediction

    Yuanhui Huang, Wenzhao Zheng, Borui Zhang, Jie Zhou, and Jiwen Lu. Selfocc: Self-supervised vision-based 3d oc- cupancy prediction. In Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 19946–19956, 2024. 2

  18. [18]

    Padre: A unifying polynomial attention drop-in replacement for efficient vision transformer

    Pierre-David Letourneau, Manish Kumar Singh, Hsin- Pai Cheng, Shizhong Han, Yunxiao Shi, Dalton Jones, Matthew Harper Langston, Hong Cai, and Fatih Porikli. Padre: A unifying polynomial attention drop-in replacement for efficient vision transformer. International Conference on Learning Representation (ICLR), 2025. 2

  19. [19]

    Bevformer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers

    Zhiqi Li, Wenhai Wang, Hongyang Li, Enze Xie, Chong- hao Sima, Tong Lu, Yu Qiao, and Jifeng Dai. Bevformer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers. In European con- ference on computer vision, pages 1–18. Springer, 2022. 2, 3, 5, 7, 8

  20. [20]

    Feature pyra- mid networks for object detection

    Tsung-Yi Lin, Piotr Doll ´ar, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature pyra- mid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recogni- tion, pages 2117–2125, 2017. 4

  21. [21]

    Focal loss for dense object detection

    Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Doll´ar. Focal loss for dense object detection. In Pro- ceedings of the IEEE International Conference on Computer Vision (ICCV), pages 2999–3007, 2017. 4

  22. [22]

    Sparse4D: Multi-view 3D object detection with sparse spatial-temporal fusion.arXiv preprint arXiv:2211.10581, 2022

    Xuewu Lin, Tianwei Lin, Zixiang Pei, Lichao Huang, and Zhizhong Su. Sparse4d: Multi-view 3d object detec- tion with sparse spatial-temporal fusion. arXiv preprint arXiv:2211.10581, 2022. 2

  23. [23]

    Sparsebev: High-performance sparse 3d object de- tection from multi-camera videos

    Haisong Liu, Yao Teng, Tao Lu, Haiguang Wang, and Limin Wang. Sparsebev: High-performance sparse 3d object de- tection from multi-camera videos. In Proceedings of the IEEE/CVF International Conference on Computer Vision , pages 18580–18590, 2023. 1, 2, 3, 4

  24. [24]

    Fully sparse 3d panoptic occupancy prediction

    Haisong Liu, Haiguang Wang, Yang Chen, Zetong Yang, Jia Zeng, Li Chen, and Limin Wang. Fully sparse 3d panoptic occupancy prediction. In Proceedings of the European Con- fernece on Computer Vision, 2024. 2

  25. [25]

    Petr: Position embedding transformation for multi-view 3d object detection

    Yingfei Liu, Tiancai Wang, Xiangyu Zhang, and Jian Sun. Petr: Position embedding transformation for multi-view 3d object detection. arXiv preprint arXiv:2203.05625, 2022. 2

  26. [26]

    Let occ flow: Self-supervised 3d occupancy flow prediction

    Yili Liu, Linzhan Mou, Xuan Yu, Chenrui Han, Sitong Mao, Rong Xiong, and Yue Wang. Let occ flow: Self-supervised 3d occupancy flow prediction. The Conference on Robot Learning (CoRL), 2024. 2

  27. [27]

    Decoupled Weight Decay Regularization

    I Loshchilov. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017. 5

  28. [28]

    SGDR: Stochastic Gradient Descent with Warm Restarts

    Ilya Loshchilov and Frank Hutter. Sgdr: Stochas- tic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983, 2016. 6

  29. [29]

    Atlas: End- to-end 3d scene reconstruction from posed images

    Zak Murez, Tarrence Van As, James Bartolozzi, Ayan Sinha, Vijay Badrinarayanan, and Andrew Rabinovich. Atlas: End- to-end 3d scene reconstruction from posed images. In Com- puter Vision–ECCV 2020: 16th European Conference, Glas- gow, UK, August 23–28, 2020, Proceedings, Part VII 16 , pages 414–431. Springer, 2020. 5

  30. [30]

    Renderocc: Vision-centric 3d occupancy pre- diction with 2d rendering supervision

    Mingjie Pan, Jiaming Liu, Renrui Zhang, Peixiang Huang, Xiaoqi Li, Hongwei Xie, Bing Wang, Li Liu, and Shang- hang Zhang. Renderocc: Vision-centric 3d occupancy pre- diction with 2d rendering supervision. In 2024 IEEE Inter- national Conference on Robotics and Automation (ICRA) , pages 12404–12411. IEEE, 2024. 1, 2, 5, 7, 8

  31. [31]

    Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unproject- ing to 3d

    Jonah Philion and Sanja Fidler. Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unproject- ing to 3d. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceed- ings, Part XIV 16, pages 194–210. Springer, 2020. 4

  32. [32]

    Ega- depth: Efficient guided attention for self-supervised multi- camera depth estimation

    Yunxiao Shi, Hong Cai, Amin Ansari, and Fatih Porikli. Ega- depth: Efficient guided attention for self-supervised multi- camera depth estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 119–129, 2023. 1

  33. [33]

    Decotr: Enhancing depth completion with 2d and 3d attentions

    Yunxiao Shi, Manish Kumar Singh, Hong Cai, and Fatih Porikli. Decotr: Enhancing depth completion with 2d and 3d attentions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10736– 10746, 2024. 1

  34. [34]

    H3o: Hyper-efficient 3d occupancy prediction with hetero- geneous supervision

    Yunxiao Shi, Hong Cai, Amin Ansari, and Fatih Porikli. H3o: Hyper-efficient 3d occupancy prediction with hetero- geneous supervision. 2025. 2

  35. [35]

    Scalability in perception for autonomous driving: Waymo open dataset

    Pei Sun, Henrik Kretzschmar, Xerxes Dotiwalla, Aurelien Chouard, Vijaysai Patnaik, Paul Tsui, James Guo, Yin Zhou, Yuning Chai, Benjamin Caine, et al. Scalability in perception for autonomous driving: Waymo open dataset. In Proceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2446–2454, 2020. 2, 5

  36. [36]

    Sparseocc: Re- thinking sparse latent representation for vision-based seman- tic occupancy prediction

    Pin Tang, Zhongdao Wang, Guoqing Wang, Jilai Zheng, Xi- angxuan Ren, Bailan Feng, and Chao Ma. Sparseocc: Re- thinking sparse latent representation for vision-based seman- tic occupancy prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 15035–15044, 2024. 2, 5

  37. [37]

    Occ3d: A large-scale 3d occupancy prediction benchmark for autonomous driving

    Xiaoyu Tian, Tao Jiang, Longfei Yun, Yucheng Mao, Huitong Yang, Yue Wang, Yilun Wang, and Hang Zhao. Occ3d: A large-scale 3d occupancy prediction benchmark for autonomous driving. Advances in Neural Information Processing Systems, 36, 2024. 1, 2, 5, 7

  38. [38]

    Scene as occupancy

    Wenwen Tong, Chonghao Sima, Tai Wang, Li Chen, Silei Wu, Hanming Deng, Yi Gu, Lewei Lu, Ping Luo, Dahua Lin, et al. Scene as occupancy. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 8406– 8415, 2023. 2

  39. [39]

    Attention is all you need

    A Vaswani. Attention is all you need. Advances in Neural Information Processing Systems, 2017. 2, 4

  40. [40]

    Opus: Occupancy prediction using a sparse set

    Jiabao Wang, Zhaojiang Liu, Qiang Meng, Liujiang Yan, Ke Wang, Jie Yang, Wei Liu, Qibin Hou, and Mingming Cheng. Opus: Occupancy prediction using a sparse set. In Advances in Neural Information Processing Systems, 2024. 2, 3, 4, 5, 7, 8

  41. [41]

    Detr3d: 3d object detection from multi-view images via 3d-to-2d queries

    Yue Wang, Vitor Campagnolo Guizilini, Tianyuan Zhang, Yilun Wang, Hang Zhao, and Justin Solomon. Detr3d: 3d object detection from multi-view images via 3d-to-2d queries. In Conference on Robot Learning, pages 180–191. PMLR, 2022. 1, 2, 3

  42. [42]

    Surroundocc: Multi-camera 3d occu- pancy prediction for autonomous driving

    Yi Wei, Linqing Zhao, Wenzhao Zheng, Zheng Zhu, Jie Zhou, and Jiwen Lu. Surroundocc: Multi-camera 3d occu- pancy prediction for autonomous driving. In Proceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 21729–21740, 2023. 1, 2, 3, 7, 8

  43. [43]

    Deep height decoupling for pre- cise vision-based 3d occupancy prediction

    Yuan Wu, Zhiqiang Yan, Zhengxue Wang, Xiang Li, Le Hui, and Jian Yang. Deep height decoupling for pre- cise vision-based 3d occupancy prediction. arXiv preprint arXiv:2409.07972, 2024. 2

  44. [44]

    Mamo: Leveraging memory and attention for monocular video depth estimation

    Rajeev Yasarla, Hong Cai, Jisoo Jeong, Yunxiao Shi, Risheek Garrepalli, and Fatih Porikli. Mamo: Leveraging memory and attention for monocular video depth estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 8754–8764, 2023. 1

  45. [45]

    Futuredepth: Learning to predict the future improves video depth estimation

    Rajeev Yasarla, Manish Kumar Singh, Hong Cai, Yunxiao Shi, Jisoo Jeong, Yinhao Zhu, Shizhong Han, Risheek Gar- repalli, and Fatih Porikli. Futuredepth: Learning to predict the future improves video depth estimation. Proceedings of the European Conference on Computer Vision, 2024. 1

  46. [46]

    FlashOcc: Fast and memory-efficient occupancy prediction via channel-to-height plugin.arXiv preprint arXiv:2311.12058, 2023

    Zichen Yu, Changyong Shu, Jiajun Deng, Kangjie Lu, Zong- dai Liu, Jiangyong Yu, Dawei Yang, Hui Li, and Yan Chen. Flashocc: Fast and memory-efficient occupancy prediction via channel-to-height plugin. arXiv preprint arXiv:2311.12058, 2023. 1, 2, 3, 5, 6, 7, 8

  47. [47]

    Occnerf: Self- supervised multi-camera occupancy prediction with neural radiance fields

    Chubin Zhang, Juncheng Yan, Yi Wei, Jiaxin Li, Li Liu, Yansong Tang, Yueqi Duan, and Jiwen Lu. Occnerf: Self- supervised multi-camera occupancy prediction with neural radiance fields. arXiv preprint arXiv:2312.09243, 2023. 2

  48. [48]

    Occformer: Dual-path transformer for vision-based 3d semantic occu- pancy prediction

    Yunpeng Zhang, Zheng Zhu, and Dalong Du. Occformer: Dual-path transformer for vision-based 3d semantic occu- pancy prediction. In Proceedings of the IEEE/CVF Inter- national Conference on Computer Vision, pages 9433–9443,