pith. sign in

arxiv: 2512.18954 · v5 · submitted 2025-12-22 · 💻 cs.CV

VOIC: Visible-Occluded Integrated Guidance for 3D Semantic Scene Completion

Pith reviewed 2026-05-16 20:31 UTC · model grok-4.3

classification 💻 cs.CV
keywords 3D semantic scene completionmonocular visiondual-decoder networkvisible-occluded separationvoxel representationSemanticKITTIscene completionautonomous driving
0
0 comments X

The pith

A dual-decoder network separates visible and occluded region supervision to improve monocular 3D semantic scene completion.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper argues that single-image 3D semantic scene completion suffers from interference between high-confidence visible areas and low-confidence occluded areas, which dilutes features and propagates errors. It introduces an offline Visible Region Label Extraction strategy that pulls clean voxel-level supervision for visible regions directly from dense ground truth. The VOIC network then splits the task into two decoders: one that produces accurate visible geometric and semantic priors, and a second that uses those priors plus cross-modal interaction to complete the occluded parts. Experiments on SemanticKITTI and SSCBench-KITTI360 show gains in both geometry and semantics over prior monocular methods. The separation matters because it lets each sub-task receive focused supervision without the visible information being corrupted by uncertain occluded reasoning.

Core claim

VOIC decouples SSC into visible-region semantic perception and occluded-region scene completion. It first builds a base 3D voxel representation by fusing image features with depth-derived occupancy. The visible decoder generates high-fidelity geometric and semantic priors from this base. The occlusion decoder then leverages those priors together with cross-modal interaction to perform coherent global scene reasoning. This structure is supported by an offline VRLE step that extracts purified visible voxel labels from dense 3D ground truth.

What carries the argument

The Visible-Occluded Interactive Completion Network (VOIC), a dual-decoder architecture in which the visible decoder supplies high-fidelity priors to the occlusion decoder for global reasoning.

If this is right

  • Higher geometric completion and semantic segmentation accuracy than existing monocular SSC methods.
  • Reduced feature dilution and error propagation between visible and occluded regions.
  • State-of-the-art results on the SemanticKITTI and SSCBench-KITTI360 benchmarks.
  • More coherent global scene reasoning by feeding visible priors into the occlusion decoder.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same visible-occluded split could be tested on multi-view or LiDAR-assisted SSC pipelines to see if the accuracy lift persists when more input data is available.
  • Explicit separation of supervision regions might reduce the data volume needed for training occluded reasoning modules.
  • The dual-decoder design offers a natural way to add uncertainty estimates that flag which voxels come from visible priors versus pure inference.

Load-bearing premise

Offline extraction of visible-region voxel labels from dense 3D ground truth cleanly separates supervision without introducing selection bias or losing information needed for coherent global reasoning.

What would settle it

Retraining an otherwise identical model with combined visible-occluded supervision instead of the separated VRLE labels and checking whether geometric and semantic scores on SemanticKITTI fall below the reported VOIC numbers.

Figures

Figures reproduced from arXiv: 2512.18954 by Jiang Liu, Risa Higashita, Zaidao Han.

Figure 1
Figure 1. Figure 1: Overview of the proposed VOIC framework. Unlike conventional [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overall architecture of the VOIC framework. (a) The model follows a progressive visible–occluded paradigm that decouples the monocular [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Sparse Voxel Feature Initialization. The VEFC module creates a [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative results on the SemanticKITTI validation set. VOIC enhances overall scene classification through high-quality visible-range semantic priors [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
read the original abstract

Camera-based 3D Semantic Scene Completion (SSC) is a critical task for autonomous driving and robotic scene understanding. It aims to infer a complete 3D volumetric representation of both semantics and geometry from a single image. Existing methods typically focus on end-to-end 2D-to-3D feature lifting and voxel completion. However, they often overlook the interference between high-confidence visible-region perception and low-confidence occluded-region reasoning caused by single-image input, which can lead to feature dilution and error propagation. To address these challenges, we introduce an offline Visible Region Label Extraction (VRLE) strategy that explicitly separates and extracts voxel-level supervision for visible regions from dense 3D ground truth. This strategy purifies the supervisory space for two complementary sub-tasks: visible-region perception and occluded-region reasoning. Building on this idea, we propose the Visible-Occluded Interactive Completion Network (VOIC), a novel dual-decoder framework that explicitly decouples SSC into visible-region semantic perception and occluded-region scene completion. VOIC first constructs a base 3D voxel representation by fusing image features with depth-derived occupancy. The visible decoder focuses on generating high-fidelity geometric and semantic priors, while the occlusion decoder leverages these priors together with cross-modal interaction to perform coherent global scene reasoning. Extensive experiments on the SemanticKITTI and SSCBench-KITTI360 benchmarks demonstrate that VOIC outperforms existing monocular SSC methods in both geometric completion and semantic segmentation accuracy, achieving state-of-the-art performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes VOIC, a dual-decoder Visible-Occluded Interactive Completion Network for monocular 3D Semantic Scene Completion. It introduces an offline VRLE strategy to extract voxel-level visible-region labels from dense 3D ground truth, decoupling high-confidence visible perception from occluded-region reasoning via a base voxel representation fused from image features and depth occupancy, with the visible decoder producing priors for the occlusion decoder. Experiments claim SOTA geometric and semantic performance on SemanticKITTI and SSCBench-KITTI360 benchmarks.

Significance. If the performance gains are shown to arise from the architectural decoupling rather than privileged label extraction, the work offers a concrete approach to mitigating feature dilution and error propagation in single-image SSC. The explicit separation of visible and occluded sub-tasks with cross-modal interaction could improve global coherence in autonomous driving scenes, provided the method generalizes beyond the specific benchmarks.

major comments (3)
  1. [§3.2] §3.2 (VRLE strategy): The description of offline visible-region label extraction from dense 3D GT does not specify the exact procedure for computing visibility masks or handling boundary/low-confidence voxels. This leaves open the possibility of selection bias, where only regions already well-reconstructed by LiDAR/multi-view fusion receive supervision, undermining the claim that VRLE provides a clean separation for the dual-decoder interaction.
  2. [§4] §4 (Experiments): The manuscript reports SOTA results but provides insufficient detail on loss functions, network hyperparameters, and full ablation studies isolating the contribution of visible-decoder priors to the occlusion decoder. Without these, it is difficult to confirm that the geometric and semantic gains are load-bearing architectural improvements rather than artifacts of the VRLE supervision.
  3. [§3.1] §3.1 (Base voxel representation): The fusion of image features with depth-derived occupancy is presented as the starting point for both decoders, but no analysis is given on how errors in the initial depth estimation propagate through the visible-to-occlusion prior transfer, which is central to the interference-mitigation claim.
minor comments (2)
  1. [§3] Notation for the visible and occlusion decoders is introduced without a clear diagram or equation reference in the main text; a small schematic in Figure 2 would improve readability.
  2. [Abstract, §1] The abstract and §1 use the term 'parameter-free' in passing for certain priors; this should be removed or clarified since network hyperparameters and loss weights are explicitly listed as free parameters.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major point below with clarifications and commit to revisions that strengthen the manuscript without misrepresenting the current work.

read point-by-point responses
  1. Referee: [§3.2] §3.2 (VRLE strategy): The description of offline visible-region label extraction from dense 3D GT does not specify the exact procedure for computing visibility masks or handling boundary/low-confidence voxels. This leaves open the possibility of selection bias, where only regions already well-reconstructed by LiDAR/multi-view fusion receive supervision, undermining the claim that VRLE provides a clean separation for the dual-decoder interaction.

    Authors: We agree that §3.2 currently provides only a high-level overview of VRLE and omits the precise algorithmic steps. In the revised manuscript we will expand this section to detail the visibility mask computation: rays are cast from the camera center through each voxel using known intrinsics and extrinsics; a voxel is labeled visible only if its first intersection along the ray lies within the image frustum and depth range. Boundary voxels are handled by a 3-voxel dilation followed by a confidence threshold (0.7) derived from the dense GT occupancy variance; voxels below this threshold are excluded from visible supervision. To address selection bias, we will add a supporting analysis and table showing that VRLE labels are extracted uniformly across the entire dense GT volume, independent of any LiDAR reconstruction quality metric. revision: yes

  2. Referee: [§4] §4 (Experiments): The manuscript reports SOTA results but provides insufficient detail on loss functions, network hyperparameters, and full ablation studies isolating the contribution of visible-decoder priors to the occlusion decoder. Without these, it is difficult to confirm that the geometric and semantic gains are load-bearing architectural improvements rather than artifacts of the VRLE supervision.

    Authors: We acknowledge the need for greater experimental transparency. The revised §4 will include the complete loss formulation (voxel-wise cross-entropy for semantics weighted at 1.0, binary cross-entropy for geometry at 0.5, plus a consistency term between decoders), all hyperparameters (Adam optimizer, learning rate 1e-4 with cosine decay, batch size 4, 40 epochs), and additional ablation tables. These will isolate the visible-decoder priors by comparing the full VOIC model against variants that (i) remove prior transfer, (ii) replace priors with random features, and (iii) use only VRLE supervision without the dual-decoder interaction, thereby demonstrating that the reported gains stem from the architectural decoupling. revision: yes

  3. Referee: [§3.1] §3.1 (Base voxel representation): The fusion of image features with depth-derived occupancy is presented as the starting point for both decoders, but no analysis is given on how errors in the initial depth estimation propagate through the visible-to-occlusion prior transfer, which is central to the interference-mitigation claim.

    Authors: We thank the referee for identifying this gap. While the base voxel construction is described, the manuscript lacks explicit propagation analysis. In the revision we will add a new paragraph in §3.1 and a corresponding experiment in §4 that injects controlled Gaussian noise (σ = 0.1–0.5 m) into the depth maps and measures the resulting degradation in both visible and occluded predictions. The results will show that the visible-to-occlusion prior transfer reduces error propagation relative to single-decoder baselines, directly supporting the interference-mitigation claim. revision: yes

Circularity Check

0 steps flagged

No circularity detected; VRLE and dual-decoder architecture are independent of their inputs

full rationale

The paper defines VRLE as an offline preprocessing step that extracts visible voxel labels directly from dense 3D ground truth to create separate supervision signals for the two decoders. This extraction is a fixed, deterministic operation on external labels and does not redefine or predict any quantity that the network is later asked to output. The visible decoder then produces learned priors that are fed forward to the occlusion decoder; this is a standard architectural interaction trained end-to-end rather than a tautology in which the output is forced to equal the input by construction. No equations, uniqueness theorems, or self-citations are presented as load-bearing premises that collapse the claimed gains back to the training labels themselves. Benchmark results on held-out SemanticKITTI and SSCBench-KITTI360 test sets therefore constitute independent evaluation rather than a re-statement of the supervision pipeline.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The method relies on standard deep learning assumptions and benchmark data availability rather than introducing new free parameters or invented entities beyond typical neural network components.

free parameters (1)
  • network hyperparameters and loss weights
    Typical tunable parameters in the dual-decoder architecture and training objective that are fitted during optimization.
axioms (1)
  • domain assumption Dense 3D ground truth labels are available and accurate for extracting visible-region supervision
    The VRLE strategy depends on existing benchmark datasets providing complete 3D volumetric annotations.

pith-pipeline@v0.9.0 · 5572 in / 1205 out tokens · 46958 ms · 2026-05-16T20:31:46.478229+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

43 extracted references · 43 canonical work pages

  1. [1]

    Monoscene: Monocular 3d semantic scene completion,

    A.-Q. Cao and R. De Charette, “Monoscene: Monocular 3d semantic scene completion,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 3991–4001

  2. [2]

    S3cnet: A sparse semantic scene completion network for lidar point clouds,

    R. Cheng, C. Agia, Y . Ren, X. Li, and L. Bingbing, “S3cnet: A sparse semantic scene completion network for lidar point clouds,” in Conference on Robot Learning, 2021, pp. 2148–2161

  3. [3]

    Multi-path sensory substitution device navigates the blind and visually impaired individuals,

    Z. Han, S. Li, X. Wang, X. Hu, R. Higashita, and J. Liu, “Multi-path sensory substitution device navigates the blind and visually impaired individuals,”Displays, p. 103200, 2025

  4. [4]

    LODE: Locally Conditioned Eikonal Implicit Scene Completion from Sparse LiDAR,

    P. Li, R. Zhao, Y . Shi, H. Zhao, J. Yuan, G. Zhou, and Y .-Q. Zhang, “LODE: Locally Conditioned Eikonal Implicit Scene Completion from Sparse LiDAR,” in2023 IEEE International Conference on Robotics and Automation (ICRA), 2023, pp. 8269–8276

  5. [5]

    Semcity: Semantic scene generation with triplane diffusion,

    J. Lee, S. Lee, C. Jo, W. Im, J. Seon, and S.-E. Yoon, “Semcity: Semantic scene generation with triplane diffusion,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 28 337–28 347

  6. [6]

    V oxformer: Sparse voxel transformer for camera- based 3d semantic scene completion,

    Y . Li, Z. Yu, C. Choy, C. Xiao, J. M. Alvarez, S. Fidler, C. Feng, and A. Anandkumar, “V oxformer: Sparse voxel transformer for camera- based 3d semantic scene completion,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 9087–9098

  7. [7]

    Symphonize 3d semantic scene completion with contextual instance queries,

    H. Jiang, T. Cheng, N. Gao, H. Zhang, T. Lin, W. Liu, and X. Wang, “Symphonize 3d semantic scene completion with contextual instance queries,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 20 258–20 267

  8. [8]

    Semantickitti: A dataset for semantic scene understanding of lidar sequences,

    J. Behley, M. Garbade, A. Milioto, J. Quenzel, S. Behnke, C. Stachniss, and J. Gall, “Semantickitti: A dataset for semantic scene understanding of lidar sequences,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 9297–9307

  9. [9]

    Sscbench: A large-scale 3d semantic scene completion benchmark for autonomous driving,

    Y . Li, S. Li, X. Liu, M. Gong, K. Li, N. Chen, Z. Wang, Z. Li, T. Jiang, and F. Yu, “Sscbench: A large-scale 3d semantic scene completion benchmark for autonomous driving,” in2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2024, pp. 13 333– 13 340

  10. [10]

    Semantic scene completion from a single depth image,

    S. Song, F. Yu, A. Zeng, A. X. Chang, M. Savva, and T. Funkhouser, “Semantic scene completion from a single depth image,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 1746–1754

  11. [11]

    3d sketch-aware semantic scene completion via semi-supervised structure prior,

    X. Chen, K.-Y . Lin, C. Qian, G. Zeng, and H. Li, “3d sketch-aware semantic scene completion via semi-supervised structure prior,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 4193–4202

  12. [12]

    Rgbd based dimensional decomposition residual network for 3d semantic scene completion,

    J. Li, Y . Liu, D. Gong, Q. Shi, X. Yuan, C. Zhao, and I. Reid, “Rgbd based dimensional decomposition residual network for 3d semantic scene completion,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 7693–7702

  13. [13]

    Cascaded context pyra- mid for full-resolution 3d semantic scene completion,

    P. Zhang, W. Liu, Y . Lei, H. Lu, and X. Yang, “Cascaded context pyra- mid for full-resolution 3d semantic scene completion,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 7801–7810

  14. [14]

    Openoccupancy: A large scale benchmark for surrounding semantic occupancy perception,

    X. Wang, Z. Zhu, W. Xu, Y . Zhang, Y . Wei, X. Chi, Y . Ye, D. Du, J. Lu, and X. Wang, “Openoccupancy: A large scale benchmark for surrounding semantic occupancy perception,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 17 850–17 859

  15. [15]

    Lmscnet: Lightweight multiscale 3d semantic completion,

    L. Roldao, R. De Charette, and A. Verroust-Blondet, “Lmscnet: Lightweight multiscale 3d semantic completion,” in2020 International Conference on 3D Vision (3DV), 2020, pp. 111–119

  16. [16]

    Sparse single sweep lidar point cloud segmentation via learning contextual shape priors from scene completion,

    X. Yan, J. Gao, J. Li, R. Zhang, Z. Li, R. Huang, and S. Cui, “Sparse single sweep lidar point cloud segmentation via learning contextual shape priors from scene completion,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 35, 2021, pp. 3101–3109

  17. [17]

    A multi-phase camera-LiDAR fusion network for 3D semantic segmentation with weak supervision,

    X. Chang, H. Pan, W. Sun, and H. Gao, “A multi-phase camera-LiDAR fusion network for 3D semantic segmentation with weak supervision,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 33, no. 8, pp. 3737–3746, 2023

  18. [18]

    LiDAR-camera continuous fusion in voxelized grid for semantic scene completion,

    Z. Lu, B. Cao, and Q. Hu, “LiDAR-camera continuous fusion in voxelized grid for semantic scene completion,”IEEE Transactions on Circuits and Systems for Video Technology, 2024

  19. [19]

    Occformer: Dual-path transformer for vision-based 3d semantic occupancy prediction,

    Y . Zhang, Z. Zhu, and D. Du, “Occformer: Dual-path transformer for vision-based 3d semantic occupancy prediction,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 9433–9443. 10

  20. [20]

    Ndc- scene: Boost monocular 3d semantic scene completion in normalized device coordinates space,

    J. Yao, C. Li, K. Sun, Y . Cai, H. Li, W. Ouyang, and H. Li, “Ndc- scene: Boost monocular 3d semantic scene completion in normalized device coordinates space,” in2023 IEEE/CVF International Conference on Computer Vision (ICCV), 2023, pp. 9421–9431

  21. [21]

    Tri-perspective view for vision-based 3d semantic occupancy prediction,

    Y . Huang, W. Zheng, Y . Zhang, J. Zhou, and J. Lu, “Tri-perspective view for vision-based 3d semantic occupancy prediction,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 9223–9232

  22. [22]

    Not all voxels are equal: Hardness-aware semantic scene completion with self- distillation,

    S. Wang, J. Yu, W. Li, W. Liu, X. Liu, J. Chen, and J. Zhu, “Not all voxels are equal: Hardness-aware semantic scene completion with self- distillation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 14 792–14 801

  23. [23]

    Instance-aware monocular 3D semantic scene completion,

    H. Xiao, H. Xu, W. Kang, and Y . Li, “Instance-aware monocular 3D semantic scene completion,”IEEE Transactions on Intelligent Trans- portation Systems, vol. 25, no. 7, pp. 6543–6554, 2024

  24. [24]

    Mixssc: Forward- backward mixture for vision-based 3d semantic scene completion,

    M. Wang, Y . Ding, Y . Liu, Y . Qin, R. Li, and Z. Tang, “Mixssc: Forward- backward mixture for vision-based 3d semantic scene completion,”IEEE Transactions on Circuits and Systems for Video Technology, 2025

  25. [25]

    Hierarchical Temporal Context Learning for Camera-Based Semantic Scene Completion,

    B. Li, J. Deng, W. Zhang, Z. Liang, D. Du, X. Jin, and W. Zeng, “Hierarchical Temporal Context Learning for Camera-Based Semantic Scene Completion,” inComputer Vision – ECCV 2024, A. Leonardis, E. Ricci, S. Roth, O. Russakovsky, T. Sattler, and G. Varol, Eds., Cham, 2025, vol. 15062, pp. 131–148

  26. [26]

    CurriFlow: Curriculum-Guided Depth Fusion with Optical Flow-Based Temporal Alignment for 3D Semantic Scene Completion,

    J. Lin, J. Zhou, W. Xu, R. Xu, C. Wang, S. Chen, K. Fu, Y . Shao, L. Guo, and S. Xu, “CurriFlow: Curriculum-Guided Depth Fusion with Optical Flow-Based Temporal Alignment for 3D Semantic Scene Completion,” Oct. 2025

  27. [27]

    One Step Closer: Creating the Future to Boost Monocular Semantic Scene Completion,

    H. Lu, Y . Su, X. Zhang, and H. Hu, “One Step Closer: Creating the Future to Boost Monocular Semantic Scene Completion,” Jul. 2025

  28. [28]

    Unleashing Semantic and Geometric Priors for 3D Scene Completion,

    S. Chen, W. Sui, B. Zhang, Z. Boukhers, J. See, and C. Yang, “Unleashing Semantic and Geometric Priors for 3D Scene Completion,” Aug. 2025

  29. [29]

    MVFormer: UNet-like Transformer with Mix-V oxel Attention for Camera-Based 3D Semantic Scene Completion,

    F. Gao, Y . Chen, K. Wang, P. Zhou, and J. Lu, “MVFormer: UNet-like Transformer with Mix-V oxel Attention for Camera-Based 3D Semantic Scene Completion,”IEEE Transactions on Circuits and Systems for Video Technology, 2025

  30. [30]

    Semi-supervised 3D Semantic Scene Completion with 2D Vision Foundation Model Guidance,

    D.-H. Pham, D.-D. Nguyen, A. Pham, T. Ho, P. Nguyen, K. Nguyen, and R. Nguyen, “Semi-supervised 3D Semantic Scene Completion with 2D Vision Foundation Model Guidance,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 39, 2025, pp. 6514–6522

  31. [31]

    SPHERE: Semantic-PHysical Engaged REpre- sentation for 3D Semantic Scene Completion,

    Z. Yang and Y . Peng, “SPHERE: Semantic-PHysical Engaged REpre- sentation for 3D Semantic Scene Completion,” inProceedings of the 33rd ACM International Conference on Multimedia, Dublin Ireland, Oct. 2025, pp. 7681–7690

  32. [32]

    Memory-Augmented Re-Completion for 3D Semantic Scene Completion,

    Y .-W. Tseng, S.-P. Yang, J.-C. Wu, I.-B. Liao, Y .-H. Li, H.-H. Shuai, and W.-H. Cheng, “Memory-Augmented Re-Completion for 3D Semantic Scene Completion,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 39, 2025, pp. 7446–7454

  33. [33]

    Mask dino: Towards a unified transformer-based framework for object detection and segmentation,

    F. Li, H. Zhang, H. Xu, S. Liu, L. Zhang, L. M. Ni, and H.-Y . Shum, “Mask dino: Towards a unified transformer-based framework for object detection and segmentation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 3041–3050

  34. [34]

    Deformable DETR: Deformable Transformers for End-to-End Object Detection,

    X. Zhu, W. Su, L. Lu, B. Li, X. Wang, and J. Dai, “Deformable DETR: Deformable Transformers for End-to-End Object Detection,” Mar. 2021

  35. [35]

    Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs,

    L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, “Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs,”IEEE transactions on pattern analysis and machine intelligence, vol. 40, no. 4, pp. 834–848, 2017

  36. [36]

    Beverse: Unified perception and prediction in birds-eye-view for vision-centric autonomous driving,

    Y . Zhang, Z. Zhu, W. Zheng, J. Huang, G. Huang, J. Zhou, and J. Lu, “Beverse: Unified perception and prediction in birds-eye-view for vision- centric autonomous driving,”arXiv preprint arXiv:2205.09743, 2022

  37. [37]

    Bevdepth: Acquisition of reliable depth for multi-view 3d object detec- tion,

    Y . Li, Z. Ge, G. Yu, J. Yang, Z. Wang, Y . Shi, J. Sun, and Z. Li, “Bevdepth: Acquisition of reliable depth for multi-view 3d object detec- tion,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 37, 2023, pp. 1477–1485

  38. [38]

    BEVDet: High- performance Multi-camera 3D Object Detection in Bird-Eye-View,

    J. Huang, G. Huang, Z. Zhu, Y . Ye, and D. Du, “BEVDet: High- performance Multi-camera 3D Object Detection in Bird-Eye-View,” Jun. 2022

  39. [39]

    Mobilestereonet: Towards lightweight deep networks for stereo matching,

    F. Shamsafar, S. Woerz, R. Rahim, and A. Zell, “Mobilestereonet: Towards lightweight deep networks for stereo matching,” inProceedings of the Ieee/Cvf Winter Conference on Applications of Computer Vision, 2022, pp. 2417–2426

  40. [40]

    Decoupled Weight Decay Regularization,

    I. Loshchilov and F. Hutter, “Decoupled Weight Decay Regularization,” Jan. 2019

  41. [41]

    Deep residual learning for image recognition,

    K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778

  42. [42]

    Camera-based 3d semantic scene completion with sparse guidance network,

    J. Mei, Y . Yang, M. Wang, J. Zhu, J. Ra, Y . Ma, L. Li, and Y . Liu, “Camera-based 3d semantic scene completion with sparse guidance network,”IEEE Transactions on Image Processing, 2024

  43. [43]

    Context and geometry aware voxel transformer for semantic scene completion,

    Z. Yu, R. Zhang, J. Ying, J. Yu, X. Hu, L. Luo, S.-Y . Cao, and H.- L. Shen, “Context and geometry aware voxel transformer for semantic scene completion,”Advances in Neural Information Processing Systems, vol. 37, pp. 1531–1555, 2024