pith. machine review for the scientific record. sign in

arxiv: 2602.06400 · v2 · submitted 2026-02-06 · 💻 cs.CV · cs.AI· cs.RO

Recognition: 1 theorem link

· Lean Theorem

TFusionOcc: T-Primitive Based Object-Centric Multi-Sensor Fusion Framework for 3D Occupancy Prediction

Authors on Pith no claims yet

Pith reviewed 2026-05-16 07:27 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.RO
keywords 3D occupancy predictionmulti-sensor fusionobject-centric representationStudent's t-distributionT-primitivessemantic occupancyautonomous drivingnuScenes
0
0 comments X

The pith

T-primitives based on the Student's t-distribution model complex 3D structures more effectively than Gaussians in multi-sensor fusion for occupancy prediction.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces TFusionOcc, which replaces Gaussian primitives with a family of T-primitives derived from the Student's t-distribution to predict 3D semantic occupancy. These include plain T-primitives, T-Superquadrics, and deformable T-Superquadrics with inverse warping that handle non-convex and asymmetric shapes better than prior methods. A unified T-mixture model jointly represents occupancy and semantics while a tightly coupled multi-stage architecture fuses camera and LiDAR cues. Experiments show state-of-the-art performance on nuScenes and robustness under most corruptions on nuScenes-C.

Core claim

TFusionOcc shows that T-primitives from the Student's t-distribution, especially the deformable T-Superquadric variant, together with a T-mixture model enable superior object-centric modeling of fine-grained geometric and semantic scene structure when integrating camera and LiDAR data, outperforming voxel-based and Gaussian-primitive baselines on nuScenes.

What carries the argument

The T-primitive family (plain T-primitive, T-Superquadric, deformable T-Superquadric with inverse warping) based on the Student's t-distribution, unified through the T-mixture model for joint occupancy and semantic modeling.

If this is right

  • Enables finer modeling of complex scene elements than Gaussian primitives allow.
  • Delivers state-of-the-art 3D semantic occupancy results on the nuScenes dataset.
  • Maintains strong performance under most sensor corruptions on nuScenes-C.
  • Supports safer autonomous vehicle navigation via improved geometric and semantic scene detail.
  • Avoids redundant computation on empty space through an object-centric representation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The deformable T-Superquadric could extend to tracking moving objects across frames.
  • The probabilistic T-mixture formulation may yield improved uncertainty estimates for planning systems.
  • Similar primitives might apply to other 3D vision tasks such as reconstruction from sparse views.
  • The fusion architecture could incorporate additional sensors like radar for adverse conditions.

Load-bearing premise

T-primitives can represent complex non-convex and asymmetric structures more effectively than Gaussian primitives while the multi-stage fusion adds no new failure modes.

What would settle it

A head-to-head test on nuScenes showing no accuracy gain over Gaussian primitives on scenes with highly asymmetric or non-convex objects would falsify the modeling advantage.

Figures

Figures reproduced from arXiv: 2602.06400 by Julie Stephany Berrio, Mao Shan, Stewart Worrall, Yaoqi Huang, Zhenxing Ming.

Figure 1
Figure 1. Figure 1: Pipeline of three approaches: Voxel-based approach (top), 3D-Gaussian-Primitive-based object-centric approach (middle), and our approach (bottom). set of learnable primitives is used to model occupied regions more selectively. Among object-centric approaches [8], [9], Gaussian-based primitives have recently shown promise for improving efficiency and compact representation ( [PITH_FULL_IMAGE:figures/full_f… view at source ↗
Figure 2
Figure 2. Figure 2: Overall architecture of TFusionOcc. The pipeline comprises a camera branch, a LiDAR branch, and a multi-stage fusion branch. The camera branch extracts multi-scale visual features and predicts a pseudo 3D point cloud from surround-view images. The pseudo 3D point cloud is further projected and cylindrically partitioned, resulting in camera-based, multi-scale, dense depth maps and a voxel volume defined und… view at source ↗
Figure 3
Figure 3. Figure 3: Inner Structure of DepthNet. The 1/8-scale visual features are first used to generate a 1/8-scale depth map. Then, bilinear interpolation is leveraged to generate 1/16- and 1/32-scale depth maps. Meanwhile, the image-based pseudo-point cloud is generated solely from a 1/8-scale depth map. maps that have 1 16 and 1 32 resolution of the input image. The predicted depth map and the two extra downsampled depth… view at source ↗
Figure 5
Figure 5. Figure 5: Given the pseudo 3D point cloud derived from the [PITH_FULL_IMAGE:figures/full_fig_p004_5.png] view at source ↗
Figure 5
Figure 5. Figure 5: Skeleton Merge Module. The upper LiDAR-branch serves as the main skeleton to provide a foundation structure for the 3D scene, and the bottom camera-branch serves as an augmentation based on the main skeleton to provide more detailed local structure to compensate for the fine-grained geometry of the main skeleton. in V olCam Cylin that spatially overlaps with the LiDAR anchors is removed. In addition, we im… view at source ↗
Figure 6
Figure 6. Figure 6: Early-Fusion Module. Each occupied voxel center in V olLidar Cylin serves as an anchor and is projected onto multi-scale visual features to aggregate semantic information, yielding a semantic-aware LiDAR feature. The detailed inner structure of the Early-Fusion module is shown in [PITH_FULL_IMAGE:figures/full_fig_p005_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Fused Depth Map Guided 3D Deformable Attention For the same visual feature at location (ui, vj ), several different depth values ranging between [0.0, 1.0] along the depth-axis are used to weight the same pixel feature, resulting in the depth encoded visual feature to mitigate the projection ambiguity problem stemming from the projection process. on lifted visual features to mitigate the 3D to 2D projectio… view at source ↗
Figure 8
Figure 8. Figure 8: Gate-Concatenation and Weight-Summation Fusion Module. The left part exhibits the multi-modality weighted summation fusion part, and the right part exhibits the multi-modality gated concatenation fusion part. The outputs of the two major parts are further fused to produce the final fused feature for each T-Primitive. added to the original attributes for refinement. (∆mi, sˆi, rˆi, αˆi, cˆi) = MLP(QPi ) (∆m… view at source ↗
Figure 9
Figure 9. Figure 9: Performance at different sector ranges for the 3D semantic occupancy prediction task. (a) mIoU performance at different sector ranges on the whole SurroundOcc-nuScenes validation set, (b) IoU performance at different sector ranges on the whole SurroundOcc￾nuScenes validation set, (c) mIoU performance at different sector ranges on the SurroundOcc-nuScenes validation rainy scenario subset, (d) IoU performanc… view at source ↗
Figure 10
Figure 10. Figure 10: Qualitative results compare against other SOTA algo [PITH_FULL_IMAGE:figures/full_fig_p012_10.png] view at source ↗
read the original abstract

The prediction of 3D semantic occupancy enables autonomous vehicles (AVs) to perceive the fine-grained geometric and semantic scene structure for safe navigation and decision-making. Existing methods mainly rely on either voxel-based representations, which incur redundant computation over empty regions, or on object-centric Gaussian primitives, which are limited in modeling complex, non-convex, and asymmetric structures. In this paper, we present TFusionOcc, a T-primitive-based object-centric multi-sensor fusion framework for 3D semantic occupancy prediction. Specifically, we introduce a family of Students t-distribution-based T-primitives, including the plain T-primitive, T-Superquadric, and deformable T-Superquadric with inverse warping, where the deformable T-Superquadric serves as the key geometry-enhancing primitive. We further develop a unified probabilistic formulation based on the Students t-distribution and the T-mixture model (TMM) to jointly model occupancy and semantics, and design a tightly coupled multi-stage fusion architecture to effectively integrate camera and LiDAR cues. Extensive experiments on nuScenes show state-of-the-art performance, while additional evaluations on nuScenes-C demonstrate strong robustness under most corruption scenarios. The code will be available at: https://github.com/DanielMing123/TFusionOcc

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 2 minor

Summary. The paper introduces TFusionOcc, an object-centric multi-sensor fusion framework for 3D semantic occupancy prediction that replaces Gaussian primitives with a family of Student's t-distribution-based T-primitives (plain T-primitive, T-Superquadric, and deformable T-Superquadric with inverse warping). It presents a unified probabilistic formulation via the T-mixture model (TMM) to jointly model occupancy and semantics, together with a tightly-coupled multi-stage fusion architecture that integrates camera and LiDAR features. Experiments on nuScenes report state-of-the-art performance for 3D semantic occupancy, while evaluations on nuScenes-C demonstrate robustness under most corruption scenarios. The code is promised to be released.

Significance. If the quantitative claims hold, the work offers a meaningful advance over prior object-centric methods by using T-primitives that can represent non-convex and asymmetric geometry more flexibly than Gaussians, while the TMM formulation provides a coherent probabilistic treatment of both occupancy and semantics. The multi-stage fusion design and the release of code plus results on both clean and corrupted nuScenes data constitute concrete strengths that support reproducibility and practical relevance for autonomous driving perception.

minor comments (2)
  1. [§4.1] §4.1 and Table 2: the main comparison table reports mIoU and mAP but does not list the exact training schedule, optimizer settings, or number of runs; adding these details would allow direct reproduction of the SOTA numbers.
  2. [§3.3] §3.3, Eq. (7): the inverse warping operation for the deformable T-Superquadric is described at a high level; a short pseudocode block or explicit formula for the warping function would clarify how it differs from standard superquadric deformation.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive review, the recognition of our contributions in using T-primitives for more flexible geometry modeling, and the recommendation for minor revision. We appreciate the comments on the unified TMM formulation, multi-stage fusion, and robustness evaluations on nuScenes-C.

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper introduces a family of T-primitives based on the Student's t-distribution and a unified probabilistic TMM formulation to model occupancy and semantics, followed by a multi-stage fusion architecture. No equations or steps reduce the claimed performance or robustness results to quantities defined solely by fitted parameters from the same dataset or by self-referential definitions. The central claims rest on experimental tables, ablations, and evaluations on nuScenes and nuScenes-C rather than on any load-bearing self-citation chain, uniqueness theorem imported from prior author work, or ansatz smuggled via citation. The derivation is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the modeling power of the new T-primitives and the effectiveness of the multi-stage fusion; the abstract invokes the Student's t-distribution as a domain-appropriate probability model without deriving it from first principles.

axioms (1)
  • domain assumption Student's t-distribution provides a suitable probabilistic basis for modeling 3D occupancy and semantics
    Invoked in the unified probabilistic formulation and T-mixture model described in the abstract
invented entities (1)
  • T-primitive (plain, T-Superquadric, deformable T-Superquadric) no independent evidence
    purpose: To represent complex non-convex and asymmetric 3D structures more flexibly than Gaussian primitives
    New geometric primitive family introduced in the paper

pith-pipeline@v0.9.0 · 5546 in / 1376 out tokens · 27768 ms · 2026-05-16T07:27:53.547304+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

42 extracted references · 42 canonical work pages · 1 internal anchor

  1. [1]

    Occfusion: Multi- sensor fusion framework for 3d semantic occupancy prediction,

    Z. Ming, J. S. Berrio, M. Shan, and S. Worrall, “Occfusion: Multi- sensor fusion framework for 3d semantic occupancy prediction,”IEEE Transactions on Intelligent Vehicles, 2024

  2. [2]

    Co-occ: Coupling explicit feature fusion with volume rendering regularization for multi-modal 3d semantic occupancy prediction,

    J. Pan, Z. Wang, and L. Wang, “Co-occ: Coupling explicit feature fusion with volume rendering regularization for multi-modal 3d semantic occupancy prediction,”IEEE Robotics and Automation Letters, 2024

  3. [3]

    Fusionocc: Multi-modal fusion for 3d occupancy prediction,

    S. Zhang, Y . Zhai, J. Mei, and Y . Hu, “Fusionocc: Multi-modal fusion for 3d occupancy prediction,” inProceedings of the 32nd ACM International Conference on Multimedia, 2024, pp. 787–796

  4. [4]

    Occcylindrical: Multi-modal fusion with cylindrical representation for 3d semantic occupancy prediction,

    Z. Ming, J. S. Berrio, M. Shan, Y . Huang, H. Lyu, N. H. K. Tran, T.-Y . Tseng, and S. Worrall, “Occcylindrical: Multi-modal fusion with cylindrical representation for 3d semantic occupancy prediction,”arXiv preprint arXiv:2505.03284, 2025

  5. [5]

    Daocc: 3d object detection assisted multi- sensor fusion for 3d occupancy prediction,

    Z. Yang, Y . Dong, J. Wang, H. Wang, L. Ma, Z. Cui, Q. Liu, H. Pei, K. Zhang, and C. Zhang, “Daocc: 3d object detection assisted multi- sensor fusion for 3d occupancy prediction,”IEEE Transactions on Circuits and Systems for Video Technology, 2025

  6. [6]

    Sdgocc: Semantic and depth-guided bird’s-eye view transformation for 3d multimodal occupancy prediction,

    Z. Duan, C. Dang, X. Hu, P. An, J. Ding, J. Zhan, Y . Xu, and J. Ma, “Sdgocc: Semantic and depth-guided bird’s-eye view transformation for 3d multimodal occupancy prediction,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 6751–6760

  7. [7]

    Effocc: Learning efficient occupancy networks from minimal labels for autonomous driving,

    Y . Shi, K. Jiang, J. Miao, K. Wang, K. Qian, Y . Wang, J. Li, T. Wen, M. Yang, Y . Xuet al., “Effocc: Learning efficient occupancy networks from minimal labels for autonomous driving,” in2025 IEEE/RSJ Inter- national Conference on Intelligent Robots and Systems (IROS). IEEE, 2025, pp. 17 008–17 015

  8. [8]

    Gaussianformer3d: Multi-modal gaussian-based semantic occupancy prediction with 3d deformable at- tention,

    L. Zhao, S. Wei, J. Hays, and L. Gan, “Gaussianformer3d: Multi-modal gaussian-based semantic occupancy prediction with 3d deformable at- tention,”arXiv preprint arXiv:2505.10685, 2025

  9. [9]

    Gaus- sianfusionocc: A seamless sensor fusion approach for 3d occupancy prediction using 3d gaussians,

    T. Pavkovi ´c, M.-A. N. Mahani, J. Niedermayer, and J. Betz, “Gaus- sianfusionocc: A seamless sensor fusion approach for 3d occupancy prediction using 3d gaussians,”arXiv preprint arXiv:2507.18522, 2025

  10. [10]

    nuscenes: A multimodal dataset for autonomous driving,

    H. Caesar, V . Bankiti, A. H. Lang, S. V ora, V . E. Liong, Q. Xu, A. Krishnan, Y . Pan, G. Baldan, and O. Beijbom, “nuscenes: A multimodal dataset for autonomous driving,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 11 621–11 631

  11. [11]

    Benchmarking and improving bird’s eye view perception robustness in autonomous driving,

    S. Xie, L. Kong, W. Zhang, J. Ren, L. Pan, K. Chen, and Z. Liu, “Benchmarking and improving bird’s eye view perception robustness in autonomous driving,”IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

  12. [12]

    Nerf: Representing scenes as neural radiance fields for view synthesis,

    B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoorthi, and R. Ng, “Nerf: Representing scenes as neural radiance fields for view synthesis,”Communications of the ACM, vol. 65, no. 1, pp. 99–106, 2021

  13. [13]

    Adaptiveocc: Adaptive octree-based network for multi-camera 3d semantic occupancy prediction in autonomous driving,

    T. Yang, Y . Qian, W. Yan, C. Wang, and M. Yang, “Adaptiveocc: Adaptive octree-based network for multi-camera 3d semantic occupancy prediction in autonomous driving,”IEEE Transactions on Circuits and Systems for Video Technology, vol. 35, no. 3, pp. 2173–2187, 2024

  14. [14]

    Doracamom: Joint 3d detection and occupancy prediction with multi-view 4d radars and cameras for omnidirectional perception,

    L. Zheng, J. Liu, R. Guan, L. Yang, S. Lu, Y . Li, X. Bai, J. Bai, Z. Ma, H.-L. Shenet al., “Doracamom: Joint 3d detection and occupancy prediction with multi-view 4d radars and cameras for omnidirectional perception,”IEEE Transactions on Circuits and Systems for Video Technology, 2026

  15. [15]

    Gaussianformer: Scene as gaussians for vision-based 3d semantic occupancy prediction,

    Y . Huang, W. Zheng, Y . Zhang, J. Zhou, and J. Lu, “Gaussianformer: Scene as gaussians for vision-based 3d semantic occupancy prediction,” inEuropean Conference on Computer Vision. Springer, 2024, pp. 376– 393

  16. [16]

    Gaussianformer-2: Probabilistic gaussian superposition for effi- cient 3d occupancy prediction,

    Y . Huang, A. Thammatadatrakoon, W. Zheng, Y . Zhang, D. Du, and J. Lu, “Gaussianformer-2: Probabilistic gaussian superposition for effi- cient 3d occupancy prediction,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 27 477–27 486

  17. [17]

    Graphgsocc: Semantic- geometric graph transformer with dynamic-static decoupling for 3d gaussian splatting-based occupancy prediction,

    K. Song, Y . Wu, C. Siu, H. Xiong, and Q. Xu, “Graphgsocc: Semantic- geometric graph transformer with dynamic-static decoupling for 3d gaussian splatting-based occupancy prediction,”IEEE Transactions on Circuits and Systems for Video Technology, 2026

  18. [18]

    Cylinder3d: An effective 3d framework for driving-scene lidar semantic segmentation,

    H. Zhou, X. Zhu, X. Song, Y . Ma, Z. Wang, H. Li, and D. Lin, “Cylinder3d: An effective 3d framework for driving-scene lidar semantic segmentation,”arXiv preprint arXiv:2008.01550, 2020

  19. [19]

    Second: Sparsely embedded convolutional detection,

    Y . Yan, Y . Mao, and B. Li, “Second: Sparsely embedded convolutional detection,”Sensors, vol. 18, no. 10, p. 3337, 2018

  20. [20]

    Dfa3d: 3d deformable attention for 2d-to-3d feature lifting,

    H. Li, H. Zhang, Z. Zeng, S. Liu, F. Li, T. Ren, and L. Zhang, “Dfa3d: 3d deformable attention for 2d-to-3d feature lifting,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 6684–6693

  21. [21]

    Deep residual learning for image recognition,

    K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778

  22. [22]

    Fcos3d: Fully convolutional one- stage monocular 3d object detection,

    T. Wang, X. Zhu, J. Pang, and D. Lin, “Fcos3d: Fully convolutional one- stage monocular 3d object detection,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 913–922

  23. [23]

    The lov ´asz-softmax loss: A tractable surrogate for the optimization of the intersection-over-union measure in neural networks,

    M. Berman, A. R. Triki, and M. B. Blaschko, “The lov ´asz-softmax loss: A tractable surrogate for the optimization of the intersection-over-union measure in neural networks,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 4413–4421. 14

  24. [24]

    Surroundocc: Multi-camera 3d occupancy prediction for autonomous driving,

    Y . Wei, L. Zhao, W. Zheng, Z. Zhu, J. Zhou, and J. Lu, “Surroundocc: Multi-camera 3d occupancy prediction for autonomous driving,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 21 729–21 740

  25. [25]

    Occ3d: A large-scale 3d occupancy prediction benchmark for autonomous driving,

    X. Tian, T. Jiang, L. Yun, Y . Mao, H. Yang, Y . Wang, Y . Wang, and H. Zhao, “Occ3d: A large-scale 3d occupancy prediction benchmark for autonomous driving,”Advances in Neural Information Processing Systems, vol. 36, pp. 64 318–64 330, 2023

  26. [26]

    Monoscene: Monocular 3d semantic scene completion,

    A.-Q. Cao and R. de Charette, “Monoscene: Monocular 3d semantic scene completion,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 3991–4001

  27. [27]

    Atlas: End-to-end 3d scene reconstruction from posed images,

    Z. Murez, T. Van As, J. Bartolozzi, A. Sinha, V . Badrinarayanan, and A. Rabinovich, “Atlas: End-to-end 3d scene reconstruction from posed images,” inComputer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part VII 16. Springer, 2020, pp. 414–431

  28. [28]

    Bevformer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers,

    Z. Li, W. Wang, H. Li, E. Xie, C. Sima, T. Lu, Y . Qiao, and J. Dai, “Bevformer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers,” inECCV. Springer, 2022, pp. 1–18

  29. [29]

    Tri-perspective view for vision-based 3d semantic occupancy prediction,

    Y . Huang, W. Zheng, Y . Zhang, J. Zhou, and J. Lu, “Tri-perspective view for vision-based 3d semantic occupancy prediction,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 9223–9232

  30. [30]

    Openoccupancy: A large scale benchmark for surrounding semantic occupancy perception,

    X. Wang, Z. Zhu, W. Xu, Y . Zhang, Y . Wei, X. Chi, Y . Ye, D. Du, J. Lu, and X. Wang, “Openoccupancy: A large scale benchmark for surrounding semantic occupancy perception,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 17 850–17 859

  31. [31]

    Inversematrixvt3d: An efficient projection matrix-based approach for 3d occupancy prediction,

    Z. Ming, J. S. Berrio, M. Shan, and S. Worrall, “Inversematrixvt3d: An efficient projection matrix-based approach for 3d occupancy prediction,” in2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2024, pp. 9565–9572

  32. [32]

    Occformer: Dual-path transformer for vision-based 3d semantic occupancy prediction,

    Y . Zhang, Z. Zhu, and D. Du, “Occformer: Dual-path transformer for vision-based 3d semantic occupancy prediction,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 9433–9443

  33. [33]

    Fb-occ: 3d occupancy prediction based on forward-backward view transformation,

    Z. Li, Z. Yu, D. Austin, M. Fang, S. Lan, J. Kautz, and J. M. Alvarez, “Fb-occ: 3d occupancy prediction based on forward-backward view transformation,”arXiv preprint arXiv:2307.01492, 2023

  34. [34]

    Renderocc: Vision-centric 3d occupancy prediction with 2d rendering supervision,

    M. Pan, J. Liu, R. Zhang, P. Huang, X. Li, H. Xie, B. Wang, L. Liu, and S. Zhang, “Renderocc: Vision-centric 3d occupancy prediction with 2d rendering supervision,” in2024 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2024, pp. 12 404–12 411

  35. [35]

    Quadricformer: Scene as superquadrics for 3d semantic occupancy prediction,

    S. Zuo, W. Zheng, X. Han, L. Yang, Y . Pan, and J. Lu, “Quadricformer: Scene as superquadrics for 3d semantic occupancy prediction,”arXiv preprint arXiv:2506.10977, 2025

  36. [36]

    Inverse++: Vision-centric 3d semantic occupancy prediction assisted with 3d object detection,

    Z. Ming, J. S. Berrio-Perez, M. Shan, and S. Worrall, “Inverse++: Vision-centric 3d semantic occupancy prediction assisted with 3d object detection,”Neurocomputing, p. 132162, 2025

  37. [37]

    Lmscnet: Lightweight multiscale 3d semantic completion,

    L. Roldao, R. de Charette, and A. Verroust-Blondet, “Lmscnet: Lightweight multiscale 3d semantic completion,” in2020 International Conference on 3D Vision (3DV). IEEE, 2020, pp. 111–119

  38. [38]

    BEVDet: High-performance Multi-camera 3D Object Detection in Bird-Eye-View

    J. Huang, G. Huang, Z. Zhu, Y . Ye, and D. Du, “Bevdet: High- performance multi-camera 3d object detection in bird-eye-view,”arXiv preprint arXiv:2112.11790, 2021

  39. [39]

    Bevstereo: Enhancing depth estimation in multi-view 3d object detection with temporal stereo,

    Y . Li, H. Bao, Z. Ge, J. Yang, J. Sun, and Z. Li, “Bevstereo: Enhancing depth estimation in multi-view 3d object detection with temporal stereo,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 37, no. 2, 2023, pp. 1486–1494

  40. [40]

    Radocc: Learning cross-modality occupancy knowledge through rendering assisted distillation,

    H. Zhang, X. Yan, D. Bai, J. Gao, P. Wang, B. Liu, S. Cui, and Z. Li, “Radocc: Learning cross-modality occupancy knowledge through rendering assisted distillation,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 7, 2024, pp. 7060–7068

  41. [41]

    Rm 2 occ: Re-projection multi-task multi-sensor fusion for autonomous driving 3d object detection and occupancy perception,

    Y . Ren, L. Wang, M. Li, H. Jiang, Z. Cui, M. Yang, H. Yu, and D. Yang, “Rm 2 occ: Re-projection multi-task multi-sensor fusion for autonomous driving 3d object detection and occupancy perception,” IEEE Transactions on Intelligent Transportation Systems, 2025

  42. [42]

    Occmamba: Se- mantic occupancy prediction with state space models,

    H. Li, Y . Hou, X. Xing, Y . Ma, X. Sun, and Y . Zhang, “Occmamba: Se- mantic occupancy prediction with state space models,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025, pp. 11 949–11 959