pith. machine review for the scientific record. sign in

arxiv: 2604.05780 · v1 · submitted 2026-04-07 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

Sparsity-Aware Voxel Attention and Foreground Modulation for 3D Semantic Scene Completion

Authors on Pith no claims yet

Pith reviewed 2026-05-10 20:17 UTC · model grok-4.3

classification 💻 cs.CV
keywords 3D semantic scene completionmonocular visionvoxel attentionsparsity awarenessforeground modulationSemanticKITTI
0
0 comments X

The pith

VoxSAMNet uses a dummy shortcut to skip empty voxels and foreground modulation to improve monocular 3D semantic scene completion.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper aims to show that accounting for the fact that over 93% of voxels in 3D scenes are empty and foreground objects are rare can lead to better reconstruction of full semantic scenes from just one RGB image. The authors build VoxSAMNet around two main ideas: a module that routes around empty space during attention and a strategy to emphasize and protect features from rare classes. If true, this design would make single-camera 3D perception more practical and accurate for tasks like self-driving cars and robot navigation.

Core claim

The core discovery is that the Dummy Shortcut for Feature Refinement (DSFR) module bypasses empty voxels via a shared dummy node while refining occupied ones with deformable attention, and the Foreground Modulation Strategy with Foreground Dropout and Text-Guided Image Filter alleviates overfitting on long-tailed classes. Together they enable state-of-the-art results of 18.2% mIoU on SemanticKITTI and 20.2% on SSCBench-KITTI-360, beating earlier monocular and stereo approaches.

What carries the argument

DSFR module using a shared dummy node to handle voxel sparsity in attention, together with Foreground Modulation Strategy to address semantic imbalance.

If this is right

  • Delivers higher mIoU than previous methods on two KITTI-based benchmarks.
  • Minimizes processing of the vast majority of empty voxels.
  • Improves performance on rare foreground semantic classes.
  • Provides evidence that sparsity and imbalance must be explicitly modeled in voxel-based SSC.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar dummy node tricks could simplify attention in other sparse 3D data structures like point clouds.
  • The text-guided filter suggests a way to incorporate language priors into vision models for better class balance.
  • Lower compute from skipping empties may enable real-time SSC on embedded hardware.

Load-bearing premise

That the reported performance gains result directly from the DSFR module and Foreground Modulation Strategy rather than from choices in training, augmentation, or the underlying network architecture.

What would settle it

Reproducing the baseline methods with identical training settings and finding that adding the proposed modules does not produce the claimed mIoU gains on SemanticKITTI would disprove the contribution of those components.

Figures

Figures reproduced from arXiv: 2604.05780 by HaoAng Lu, Longjun Gao, Xiaoning Zhang, Yuanqi Su, Yu Xue.

Figure 1
Figure 1. Figure 1: Motivation for VoxSAMNet. (a) Current methods such as BEVFormer [ [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of VoxSAMNet and the structure of its modules. (a) The pipeline flow of the proposed VoxSAMNet. (b) The specific [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Result of VoxSAMNet. Qualitative visual comparisons with Monoscene [ [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
read the original abstract

Monocular Semantic Scene Completion (SSC) aims to reconstruct complete 3D semantic scenes from a single RGB image, offering a cost-effective solution for autonomous driving and robotics. However, the inherently imbalanced nature of voxel distributions, where over 93% of voxels are empty and foreground classes are rare, poses significant challenges. Existing methods often suffer from redundant emphasis on uninformative voxels and poor generalization to long-tailed categories. To address these issues, we propose VoxSAMNet (Voxel Sparsity-Aware Modulation Network), a unified framework that explicitly models voxel sparsity and semantic imbalance. Our approach introduces: (1) a Dummy Shortcut for Feature Refinement (DSFR) module that bypasses empty voxels via a shared dummy node while refining occupied ones with deformable attention; and (2) a Foreground Modulation Strategy combining Foreground Dropout (FD) and Text-Guided Image Filter (TGIF) to alleviate overfitting and enhance class-relevant features. Extensive experiments on the public benchmarks SemanticKITTI and SSCBench-KITTI-360 demonstrate that VoxSAMNet achieves state-of-the-art performance, surpassing prior monocular and stereo baselines with mIoU scores of 18.2% and 20.2%, respectively. Our results highlight the importance of sparsity-aware and semantics-guided design for efficient and accurate 3D scene completion, offering a promising direction for future research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper proposes VoxSAMNet for monocular 3D semantic scene completion, addressing the challenges of extreme voxel sparsity (over 93% empty voxels) and long-tailed foreground classes. It introduces the DSFR module, which uses a dummy shortcut to bypass empty voxels and deformable attention on occupied ones, along with a Foreground Modulation Strategy combining Foreground Dropout (FD) and Text-Guided Image Filter (TGIF). Experiments on SemanticKITTI and SSCBench-KITTI-360 report SOTA mIoU scores of 18.2% and 20.2%, outperforming prior monocular and stereo baselines.

Significance. If the reported gains prove robustly attributable to the sparsity-aware and foreground-modulation components rather than implementation details, the work would offer a practical advance in efficient SSC for autonomous driving by reducing redundant computation on empty space and improving rare-class performance.

major comments (1)
  1. [Experiments] Experiments section (likely §4): The central attribution of the 18.2%/20.2% mIoU gains to DSFR and FD+TGIF is load-bearing but unsupported without matched re-implementations of baselines under identical training schedules, augmentations, optimizers, and backbones. Table 1 or 2 reports overall results but provides no ablation isolating these modules from confounders, undermining the claim that sparsity modeling and text-guided filtering are the sources of improvement.
minor comments (1)
  1. [Abstract] The abstract and introduction could more precisely define the DSFR dummy node and TGIF text embedding process to aid reproducibility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the major comment point by point below and commit to revisions that strengthen the experimental validation.

read point-by-point responses
  1. Referee: The central attribution of the 18.2%/20.2% mIoU gains to DSFR and FD+TGIF is load-bearing but unsupported without matched re-implementations of baselines under identical training schedules, augmentations, optimizers, and backbones. Table 1 or 2 reports overall results but provides no ablation isolating these modules from confounders, undermining the claim that sparsity modeling and text-guided filtering are the sources of improvement.

    Authors: We agree that clear isolation of the DSFR module and Foreground Modulation Strategy (FD + TGIF) is essential to support the attribution of gains. The manuscript already contains ablation studies (Tables 3 and 4) that remove each proposed component while holding training schedule, augmentations, optimizer, and backbone fixed, showing consistent drops in mIoU. To directly address the concern about matched baseline re-implementations, we will add a new set of experiments in the revised version that re-train the strongest prior monocular and stereo baselines under identical conditions to our method. These results will be reported alongside the existing tables to demonstrate that the observed improvements stem from the sparsity-aware and foreground-modulation designs rather than implementation differences. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical architecture evaluated on external benchmarks

full rationale

The paper proposes VoxSAMNet with DSFR module (dummy shortcut + deformable attention) and Foreground Modulation (FD + TGIF), then reports mIoU results from training and evaluation on SemanticKITTI and SSCBench-KITTI-360. These are standard empirical outcomes from external data and standard training procedures, not predictions or derivations that reduce to the paper's own inputs or equations by construction. No mathematical first-principles claims, fitted parameters renamed as predictions, or self-citation chains that bear the central result exist. The work is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a purely empirical computer-vision paper. No theoretical axioms, free parameters in a derivation, or newly postulated physical entities are introduced; the model parameters are learned from data and the modules are engineering choices.

pith-pipeline@v0.9.0 · 5560 in / 1256 out tokens · 31023 ms · 2026-05-10T20:17:36.252192+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

56 extracted references · 14 canonical work pages · 2 internal anchors

  1. [1]

    Depthnet: A monoc- ular depth estimation framework

    Anunay, Pankaj, and Chhavi Dhiman. Depthnet: A monoc- ular depth estimation framework. In2021 International Conference on Engineering and Emerging Technologies (ICEET), pages 1–6, 2021. 4

  2. [2]

    Three cars approaching within 100m! enhancing distant geometry by tri-axis voxel scanning for camera-based semantic scene completion, 2025

    Jongseong Bae, Junwoo Ha, and Ha Young Kim. Three cars approaching within 100m! enhancing distant geometry by tri-axis voxel scanning for camera-based semantic scene completion, 2025. 6, 7

  3. [3]

    Se- mantickitti: A dataset for semantic scene understanding of lidar sequences

    Jens Behley, Martin Garbade, Andres Milioto, Jan Quen- zel, Sven Behnke, Cyrill Stachniss, and Jurgen Gall. Se- mantickitti: A dataset for semantic scene understanding of lidar sequences. InProceedings of the IEEE/CVF inter- national conference on computer vision, pages 9297–9307,

  4. [4]

    Monoscene: Monoc- ular 3d semantic scene completion

    Anh-Quan Cao and Raoul de Charette. Monoscene: Monoc- ular 3d semantic scene completion. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3991–4001, 2022. 1, 2, 5, 6, 7

  5. [5]

    3d u-net: learning dense volumetric segmentation from sparse annotation

    ¨Ozg¨un C ¸ ic ¸ek, Ahmed Abdulkadir, Soeren S Lienkamp, Thomas Brox, and Olaf Ronneberger. 3d u-net: learning dense volumetric segmentation from sparse annotation. In International conference on medical image computing and computer-assisted intervention, pages 424–432. Springer,

  6. [6]

    Loma: Language-assisted semantic occupancy network via triplane mamba

    Yubo Cui, Zhiheng Li, Jiaqiang Wang, and Zheng Fang. Loma: Language-assisted semantic occupancy network via triplane mamba. InProceedings of the AAAI Conference on Artificial Intelligence, pages 2609–2617, 2025. 6, 7

  7. [7]

    Fast r-cnn

    Ross Girshick. Fast r-cnn. InProceedings of the IEEE inter- national conference on computer vision, pages 1440–1448,

  8. [8]

    Masked autoencoders are scalable vision learners

    Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Doll´ar, and Ross Girshick. Masked autoencoders are scalable vision learners. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16000– 16009, 2022. 3

  9. [9]

    Multi-modal scene graph inspired policy for visual navigation: Y

    Yu He, Kang Zhou, and T Lifang Tian. Multi-modal scene graph inspired policy for visual navigation: Y . he, k. zhou. The Journal of Supercomputing, 81(1):107, 2025. 3

  10. [10]

    Sym- phonize 3d semantic scene completion with contextual in- stance queries

    Haoyi Jiang, Tianheng Cheng, Naiyu Gao, Haoyang Zhang, Tianwei Lin, Wenyu Liu, and Xinggang Wang. Sym- phonize 3d semantic scene completion with contextual in- stance queries. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20258– 20267, 2024. 5, 7, 8

  11. [11]

    Soap: Vision-centric 3d seman- tic scene completion with scene-adaptive decoder and oc- cluded region-aware view projection

    Hyo-Jun Lee, Yeong Jun Koh, Hanul Kim, Hyunseop Kim, Yonguk Lee, and Jinu Lee. Soap: Vision-centric 3d seman- tic scene completion with scene-adaptive decoder and oc- cluded region-aware view projection. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 17145–17154, 2025. 6, 7

  12. [12]

    Stereoscene: Bev-assisted stereo match- ing empowers 3d semantic scene completion.arXiv preprint arXiv:2303.13959, 1(3):6, 2023

    Bohan Li, Yasheng Sun, Xin Jin, Wenjun Zeng, Zheng Zhu, Xiaoefeng Wang, Yunpeng Zhang, James Okae, Hang Xiao, and Dalong Du. Stereoscene: Bev-assisted stereo match- ing empowers 3d semantic scene completion.arXiv preprint arXiv:2303.13959, 1(3):6, 2023. 2

  13. [13]

    Dfa3d: 3d deformable attention for 2d-to-3d feature lifting

    Hongyang Li, Hao Zhang, Zhaoyang Zeng, Shilong Liu, Feng Li, Tianhe Ren, and Lei Zhang. Dfa3d: 3d deformable attention for 2d-to-3d feature lifting. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 6684–6693, 2023. 4

  14. [14]

    Grounded language-image pre-training

    Liunian Harold Li, Pengchuan Zhang, Haotian Zhang, Jian- wei Yang, Chunyuan Li, Yiwu Zhong, Lijuan Wang, Lu Yuan, Lei Zhang, Jenq-Neng Hwang, et al. Grounded language-image pre-training. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10965–10975, 2022. 2, 3

  15. [15]

    Alvarez, Sanja Fidler, Chen Feng, and Anima Anandkumar

    Yiming Li, Zhiding Yu, Christopher Choy, Chaowei Xiao, Jose M. Alvarez, Sanja Fidler, Chen Feng, and Anima Anandkumar. V oxformer: Sparse voxel transformer for camera-based 3d semantic scene completion. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9087–9098, 2023. 1, 3, 7

  16. [16]

    Sscbench: A large-scale 3d semantic scene comple- tion benchmark for autonomous driving

    Yiming Li, Sihang Li, Xinhao Liu, Moonjun Gong, Kenan Li, Nuo Chen, Zijun Wang, Zhiheng Li, Tao Jiang, Fisher Yu, et al. Sscbench: A large-scale 3d semantic scene comple- tion benchmark for autonomous driving. In2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 13333–13340. IEEE, 2024. 2

  17. [17]

    Bev- former: Learning bird’s-eye-view representation from multi-camera im- ages via spatiotemporal transformers,

    Zhiqi Li, Wenhai Wang, Hongyang Li, Enze Xie, Chong- hao Sima, Tong Lu, Qiao Yu, and Jifeng Dai. Bev- former: Learning bird’s-eye-view representation from multi- camera images via spatiotemporal transformers.(2022).URL https://arxiv. org/abs/2203.17270, 10, 2022. 1, 2, 3

  18. [18]

    arXiv:2307.01492 (2023)

    Zhiqi Li, Zhiding Yu, David Austin, Mingsheng Fang, Shiyi Lan, Jan Kautz, and Jose M Alvarez. Fb-occ: 3d occupancy prediction based on forward-backward view transformation. arXiv preprint arXiv:2307.01492, 2023. 2

  19. [19]

    Et- former: Efficient triplane deformable attention for 3d seman- tic scene completion from monocular camera.arXiv preprint arXiv:2410.11019, 2024

    Jing Liang, He Yin, Xuewei Qi, Jong Jin Park, Min Sun, Rajasimman Madhivanan, and Dinesh Manocha. Et- former: Efficient triplane deformable attention for 3d seman- tic scene completion from monocular camera.arXiv preprint arXiv:2410.11019, 2024. 2

  20. [20]

    Skip mamba diffusion for monocular 3d semantic scene completion

    Li Liang, Naveed Akhtar, Jordan Vice, Xiangrui Kong, and Ajmal Saeed Mian. Skip mamba diffusion for monocular 3d semantic scene completion. InProceedings of the AAAI Con- ference on Artificial Intelligence, pages 5155–5163, 2025. 6, 7

  21. [21]

    Kitti-360: A novel dataset and benchmarks for urban scene understanding in 2d and 3d.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(3):3292–3310, 2022

    Yiyi Liao, Jun Xie, and Andreas Geiger. Kitti-360: A novel dataset and benchmarks for urban scene understanding in 2d and 3d.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(3):3292–3310, 2022. 2, 6, 7

  22. [22]

    Orthalign: Orthogonal subspace decomposition for non-interfering multi-objective alignment.arXiv preprint arXiv:2509.24610, 2025

    Liang Lin, Zhihao Xu, Junhao Dong, Jian Zhao, Yuchen Yuan, Guibin Zhang, Miao Yu, Yiming Zhang, Zhengtao Yao, Huahui Yi, et al. Orthalign: Orthogonal subspace de- 9 composition for non-interfering multi-objective alignment. arXiv preprint arXiv:2509.24610, 2025. 1

  23. [23]

    Hidden in the noise: Unveiling back- doors in audio llms alignment through latent acoustic pattern triggers

    Liang Lin, Miao Yu, Kaiwen Luo, Yibo Zhang, Lilan Peng, Dexian Wang, Xuehai Tang, Yuanhe Zhang, Xikang Yang, Zhenhong Zhou, et al. Hidden in the noise: Unveiling back- doors in audio llms alignment through latent acoustic pattern triggers. InProceedings of the AAAI Conference on Artificial Intelligence, pages 32015–32023, 2026. 2

  24. [24]

    Disentan- gling instance and scene contexts for 3d semantic scene com- pletion

    Enyu Liu, En Yu, Sijia Chen, and Wenbing Tao. Disentan- gling instance and scene contexts for 3d semantic scene com- pletion. InProceedings of the IEEE/CVF International Con- ference on Computer Vision, pages 26999–27009, 2025. 1, 6, 7

  25. [25]

    Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023. 2

  26. [26]

    Grounding dino: Marrying dino with grounded pre-training for open-set object detection

    Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. InEuro- pean conference on computer vision, pages 38–55. Springer,

  27. [27]

    Swin transformer: Hierarchical vision transformer using shifted windows

    Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022, 2021. 6

  28. [28]

    Decoupled Weight Decay Regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017. 6

  29. [29]

    Vishall3d: Monocular semantic scene completion from reconstructing the visible regions to hallucinating the invisible regions

    Haoang Lu, Yuanqi Su, Xiaoning Zhang, Longjun Gao, Yu Xue, and Le Wang. Vishall3d: Monocular semantic scene completion from reconstructing the visible regions to hallucinating the invisible regions. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 28674–28684, 2025. 3, 5, 6, 7

  30. [30]

    Camera- based 3d semantic scene completion with sparse guidance network.IEEE Transactions on Image Processing, 33:5468– 5481, 2024

    Jianbiao Mei, Yu Yang, Mengmeng Wang, Junyu Zhu, Jong- won Ra, Yukai Ma, Laijian Li, and Yong Liu. Camera- based 3d semantic scene completion with sparse guidance network.IEEE Transactions on Image Processing, 33:5468– 5481, 2024. 3

  31. [31]

    Occdepth: A depth-aware method for 3d semantic scene completion

    Ruihang Miao, Weizhou Liu, Mingrui Chen, Zheng Gong, Weixin Xu, Chen Hu, and Shuchang Zhou. Occdepth: A depth-aware method for 3d semantic scene completion. arXiv preprint arXiv:2302.13540, 2023. 1

  32. [32]

    Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unprojecting to 3d

    Jonah Philion and Sanja Fidler. Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unprojecting to 3d. InEuropean conference on computer vision, pages 194–210. Springer, 2020. 4

  33. [33]

    Learning transferable visual models from natural language supervi- sion

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 2, 3

  34. [34]

    Chang, Manolis Savva, and Thomas Funkhouser

    Shuran Song, Fisher Yu, Andy Zeng, Angel X. Chang, Manolis Savva, and Thomas Funkhouser. Semantic scene completion from a single depth image. InProceedings of the IEEE Conference on Computer Vision and Pattern Recogni- tion (CVPR), 2017. 1

  35. [35]

    Mixssc: Forward-backward mixture for vision-based 3d semantic scene completion.IEEE Transac- tions on Circuits and Systems for Video Technology, 35(6): 5684–5696, 2025

    Meng Wang, Yan Ding, Yumeng Liu, Yunchuan Qin, Ruihui Li, and Zhuo Tang. Mixssc: Forward-backward mixture for vision-based 3d semantic scene completion.IEEE Transac- tions on Circuits and Systems for Video Technology, 35(6): 5684–5696, 2025. 6, 7

  36. [36]

    Vlscene: Vision-language guidance distillation for camera-based 3d semantic scene completion

    Meng Wang, Huilong Pi, Ruihui Li, Yunchuan Qin, Zhuo Tang, and Kenli Li. Vlscene: Vision-language guidance distillation for camera-based 3d semantic scene completion. InProceedings of the AAAI Conference on Artificial Intelli- gence, pages 7808–7816, 2025. 3, 6, 7

  37. [37]

    Learning temporal 3d semantic scene completion via optical flow guidance.arXiv preprint arXiv:2502.14520, 2025

    Meng Wang, Fan Wu, Ruihui Li, Yunchuan Qin, Zhuo Tang, and Kenli Li. Learning temporal 3d semantic scene completion via optical flow guidance.arXiv preprint arXiv:2502.14520, 2025. 6, 7

  38. [38]

    Vision-based 3d semantic scene completion via capture dynamic representations.Knowledge-Based Sys- tems, page 114550, 2025

    Meng Wang, Fan Wu, Yunchuan Qin, Ruihui Li, Zhuo Tang, and Kenli Li. Vision-based 3d semantic scene completion via capture dynamic representations.Knowledge-Based Sys- tems, page 114550, 2025. 3

  39. [39]

    Not all voxels are equal: Hardness-aware semantic scene completion with self- distillation

    Song Wang, Jiawei Yu, Wentong Li, Wenyu Liu, Xiaolu Liu, Junbo Chen, and Jianke Zhu. Not all voxels are equal: Hardness-aware semantic scene completion with self- distillation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14792– 14801, 2024. 1

  40. [40]

    H2gformer: Horizontal-to-global voxel transformer for 3d semantic scene completion

    Yu Wang and Chao Tong. H2gformer: Horizontal-to-global voxel transformer for 3d semantic scene completion. InPro- ceedings of the AAAI Conference on Artificial Intelligence, pages 5722–5730, 2024. 6, 7

  41. [41]

    Detr3d: 3d object detection from multi-view images via 3d-to-2d queries

    Yue Wang, Vitor Campagnolo Guizilini, Tianyuan Zhang, Yilun Wang, Hang Zhao, and Justin Solomon. Detr3d: 3d object detection from multi-view images via 3d-to-2d queries. InConference on robot learning, pages 180–191. PMLR, 2022. 2, 4

  42. [42]

    Depthssc: Depth- spatial alignment and dynamic voxel resolution for monoc- ular 3d semantic scene completion.arXiv preprint arXiv:2311.17084, 7:16, 2023

    Jiawei Yao and Jusheng Zhang. Depthssc: Depth- spatial alignment and dynamic voxel resolution for monoc- ular 3d semantic scene completion.arXiv preprint arXiv:2311.17084, 7:16, 2023. 2

  43. [43]

    Ndc-scene: Boost monocular 3d semantic scene completion in normalized de- vice coordinates space

    Jiawei Yao, Chuming Li, Keqiang Sun, Yingjie Cai, Hao Li, Wanli Ouyang, and Hongsheng Li. Ndc-scene: Boost monocular 3d semantic scene completion in normalized de- vice coordinates space. In2023 IEEE/CVF International Conference on Computer Vision (ICCV), pages 9421–9431. IEEE, 2023. 2

  44. [44]

    Context and geometry aware voxel transformer for semantic scene completion.Advances in Neural Information Processing Systems, 37:1531–1555, 2024

    Zhu Yu, Runmin Zhang, Jiacheng Ying, Junchen Yu, Xiaohai Hu, Lun Luo, Si-Yuan Cao, and Hui-Liang Shen. Context and geometry aware voxel transformer for semantic scene completion.Advances in Neural Information Processing Systems, 37:1531–1555, 2024. 1, 3, 6

  45. [45]

    Ascot: An adaptive self-correction chain-of-thought method for late-stage fragility in llms,

    Dongxu Zhang, Ning Yang, Jihua Zhu, Jinnan Yang, Miao Xin, and Baoliang Tian. Ascot: An adaptive self-correction chain-of-thought method for late-stage fragility in llms. arXiv preprint arXiv:2508.05282, 2025. 7

  46. [46]

    Not all queries need deep thought: Coficot for adaptive coarse-to-fine stateful refinement,

    Dongxu Zhang, Hongqiang Lin, Yiding Sun, Pengyu Wang, Qirui Wang, Ning Yang, and Jihua Zhu. Not all queries need deep thought: Coficot for adaptive coarse-to-fine stateful re- finement.arXiv preprint arXiv:2603.08251, 2026. 7 10

  47. [47]

    Pointcot: A multi-modal benchmark for explicit 3d geometric reasoning.arXiv preprint arXiv:2602.23945, 2026

    Dongxu Zhang, Yiding Sun, Pengcheng Li, Yumou Liu, Hongqiang Lin, Haoran Xu, Xiaoxuan Mu, Liang Lin, Wen- biao Yan, Ning Yang, et al. Pointcot: A multi-modal bench- mark for explicit 3d geometric reasoning.arXiv preprint arXiv:2602.23945, 2026. 7

  48. [48]

    Chain-of-thought compression should not be blind: V-skip for efficient multimodal reasoning via dual-path anchoring,

    Dongxu Zhang, Yiding Sun, Cheng Tan, Wenbiao Yan, Ning Yang, Jihua Zhu, and Haijun Zhang. Chain-of-thought com- pression should not be blind: V-skip for efficient multi- modal reasoning via dual-path anchoring.arXiv preprint arXiv:2601.13879, 2026. 4

  49. [49]

    Cmhanet: A cross-modal hybrid attention network for point cloud registration.Neurocomput- ing, page 133318, 2026

    Dongxu Zhang, Yingsen Wang, Yiding Sun, Haoran Xu, Peilin Fan, and Jihua Zhu. Cmhanet: A cross-modal hybrid attention network for point cloud registration.Neurocomput- ing, page 133318, 2026. 4

  50. [50]

    Igasa: Integrated geometry- aware and skip-attention modules for enhanced point cloud registration.IEEE Transactions on Circuits and Systems for Video Technology, 2026

    Dongxu Zhang, Jihua Zhu, Shiqi Li, Wenbiao Yan, Haoran Xu, Peilin Fan, and Huimin Lu. Igasa: Integrated geometry- aware and skip-attention modules for enhanced point cloud registration.IEEE Transactions on Circuits and Systems for Video Technology, 2026. 2

  51. [51]

    Monodetr: Depth- guided transformer for monocular 3d object detection

    Renrui Zhang, Han Qiu, Tai Wang, Ziyu Guo, Ziteng Cui, Yu Qiao, Hongsheng Li, and Peng Gao. Monodetr: Depth- guided transformer for monocular 3d object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9155–9166, 2023. 4

  52. [52]

    Occformer: Dual-path transformer for vision-based 3d semantic occu- pancy prediction

    Yunpeng Zhang, Zheng Zhu, and Dalong Du. Occformer: Dual-path transformer for vision-based 3d semantic occu- pancy prediction. InProceedings of the IEEE/CVF Inter- national Conference on Computer Vision, pages 9433–9443,

  53. [53]

    Out-of-distribution semantic occupancy prediction,

    Yuheng Zhang, Mengfei Duan, Kunyu Peng, Yuhang Wang, Ruiping Liu, Fei Teng, Kai Luo, Zhiyong Li, and Kailun Yang. Out-of-distribution semantic occupancy prediction,

  54. [54]

    The coherence trap: When mllm- crafted narratives exploit manipulated visual contexts, 2026

    Yuchen Zhang, Yaxiong Wang, Yujiao Wu, Lianwei Wu, Li Zhu, and Zhedong Zheng. The coherence trap: When mllm- crafted narratives exploit manipulated visual contexts, 2026. 2

  55. [55]

    Monoocc: Digging into monocular semantic occu- pancy prediction

    Yupeng Zheng, Xiang Li, Pengfei Li, Yuhang Zheng, Bu Jin, Chengliang Zhong, Xiaoxiao Long, Hao Zhao, and Qichao Zhang. Monoocc: Digging into monocular semantic occu- pancy prediction. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 18398–18405. IEEE, 2024. 1

  56. [56]

    Deformable DETR: Deformable Transformers for End-to-End Object Detection

    Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang, and Jifeng Dai. Deformable detr: Deformable trans- formers for end-to-end object detection.arXiv preprint arXiv:2010.04159, 2020. 1, 4 11