pith. sign in

arxiv: 2509.23880 · v3 · submitted 2025-09-28 · 💻 cs.CV

Learning Adaptive Pseudo-Label Selection for Semi-Supervised 3D Object Detection

Pith reviewed 2026-05-18 11:48 UTC · model grok-4.3

classification 💻 cs.CV
keywords semi-supervised 3D object detectionpseudo-label selectionteacher-student frameworkadaptive thresholdingquality assessment networkKITTIWaymo
0
0 comments X

The pith

Two networks trained on pseudo-label alignment with ground truth enable automatic adaptive selection of high-quality labels for semi-supervised 3D object detection.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to replace manual or partially informed confidence thresholds with a learnable module that picks reliable pseudo-labels from a teacher model. It adds a quality-assessment network that fuses available scores and a separate network that sets thresholds according to context such as object distance, class, and training state. Both networks are trained directly on how closely the chosen pseudo-labels match ground-truth boxes, and a soft-supervision step lets the student model down-weight noisier labels. A sympathetic reader would expect this to let semi-supervised 3D detectors exploit more unlabeled scenes without losing precision or missing rare contexts.

Core claim

The central claim is that a learnable pseudo-labeling module, built from a quality-assessment network performing score fusion and a threshold network producing context-adaptive decisions, can be supervised by the geometric alignment of pseudo-labels to ground-truth boxes; when combined with soft supervision that prioritizes cleaner labels, this module selects high-precision pseudo-labels while preserving wider contextual coverage and higher recall than fixed-threshold or earlier dynamic methods.

What carries the argument

A learnable pseudo-labeling module containing a quality-assessment network and a context-adaptive threshold network, both supervised by the alignment between pseudo-labels and ground-truth bounding boxes.

If this is right

  • The method produces higher overall performance than prior SS3DOD approaches on the KITTI and Waymo benchmarks.
  • Selected pseudo-labels achieve high precision together with broader coverage of contexts and improved recall rates.
  • Soft supervision lets the student network focus training on cleaner labels even when some pseudo-label noise remains.
  • Context-aware thresholds replace the need for hand-tuned confidence cutoffs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same alignment-based supervision could be tested on other pseudo-labeling pipelines outside 3D detection.
  • Removing manual threshold search may simplify scaling to new unlabeled datasets or sensor setups.
  • The two-network design might be combined with different teacher-student architectures to check for further gains.

Load-bearing premise

Alignment of pseudo-labels with ground-truth bounding boxes supplies reliable and sufficient supervision for the quality-assessment and threshold networks without introducing harmful bias.

What would settle it

On a held-out validation set with full ground truth, measure whether the adaptive module's selected labels show measurably higher precision at the same or higher recall than a fixed-threshold baseline; if no consistent gain appears across distance and class bins, the adaptive selection claim is refuted.

Figures

Figures reproduced from arXiv: 2509.23880 by Taehun Kong, Tae-Kyun Kim.

Figure 1
Figure 1. Figure 1: Overview of the proposed framework compared to the [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: (a), (b), and (c) show that classification confidence and objectness have different distributions depending on the context. (b) [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Overview of the proposed framework, consisting of two main components: the Pseudo-label Selection Module (PSM), which [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: The correlation between GT-IoU and each score for [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: CTE thresholds by classes and distances. [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 7
Figure 7. Figure 7: Quantitative comparisons of pseudo-label qualities on [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Qualitative comparisons of pseudo-labels on KITTI. [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗
read the original abstract

Semi-supervised 3D object detection (SS3DOD) aims to reduce costly 3D annotations utilizing unlabeled data. Recent studies adopt pseudo-label-based teacher-student frameworks and demonstrate impressive performance. The main challenge of these frameworks is in selecting high-quality pseudo-labels from the teacher's predictions. Most previous methods, however, select pseudo-labels by comparing confidence scores over thresholds manually set. The latest works tackle the challenge either by dynamic thresholding or refining the quality of pseudo-labels. Such methods still overlook contextual information e.g. object distances, classes, and learning states, and inadequately assess the pseudo-label quality using partial information available from the networks. In this work, we propose a novel SS3DOD framework featuring a learnable pseudo-labeling module designed to automatically and adaptively select high-quality pseudo-labels. Our approach introduces two networks at the teacher output level. These networks reliably assess the quality of pseudo-labels by the score fusion and determine context-adaptive thresholds, which are supervised by the alignment of pseudo-labels over GT bounding boxes. Additionally, we introduce a soft supervision strategy that can learn robustly under pseudo-label noises. This helps the student network prioritize cleaner labels over noisy ones in semi-supervised learning. Extensive experiments on the KITTI and Waymo datasets demonstrate the effectiveness of our method. The proposed method selects high-precision pseudo-labels while maintaining a wider coverage of contexts and a higher recall rate, significantly improving relevant SS3DOD methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes a semi-supervised 3D object detection (SS3DOD) framework that augments teacher-student pseudo-labeling with a learnable module containing a quality-assessment network (via score fusion) and a threshold network that produces context-adaptive thresholds. Both networks are supervised by alignment between teacher pseudo-labels and ground-truth boxes on labeled data; a soft-supervision strategy is added to let the student down-weight noisy labels. Experiments on KITTI and Waymo are reported to yield higher-precision pseudo-labels, broader contextual coverage, and improved recall over prior SS3DOD baselines.

Significance. If the empirical gains are reproducible and the learned selector generalizes, the work would meaningfully advance pseudo-label selection in 3D detection by replacing hand-tuned or non-contextual thresholds with data-driven, context-aware components. The soft-supervision mechanism is a constructive addition for noise robustness.

major comments (2)
  1. [Abstract / §3] Abstract and §3 (method): the claim that the two networks learn a generalizable quality function rests on supervision obtained solely from GT alignment on labeled data. Because this supervision is unavailable on the unlabeled distribution, it is unclear whether the resulting thresholds and quality scores avoid systematic under- or over-selection when object distances, classes, or learning states differ from the labeled subset; this directly underpins the stated improvements in recall and contextual coverage.
  2. [§4] §4 (experiments): the abstract asserts that the method 'significantly improving relevant SS3DOD methods' yet provides no numerical deltas, ablation tables isolating the contribution of the quality-assessment versus threshold network, or direct comparison against recent dynamic-thresholding baselines; without these the central performance claim cannot be verified.
minor comments (2)
  1. Clarify the precise input features and output heads of the two networks and how their predictions are fused at inference time on unlabeled frames.
  2. Specify the exact formulation of the soft-supervision loss (e.g., weighting scheme or temperature) and whether it is applied only to the student or also back to the teacher.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and indicate the revisions we will make.

read point-by-point responses
  1. Referee: [Abstract / §3] Abstract and §3 (method): the claim that the two networks learn a generalizable quality function rests on supervision obtained solely from GT alignment on labeled data. Because this supervision is unavailable on the unlabeled distribution, it is unclear whether the resulting thresholds and quality scores avoid systematic under- or over-selection when object distances, classes, or learning states differ from the labeled subset; this directly underpins the stated improvements in recall and contextual coverage.

    Authors: The quality-assessment and threshold networks are supervised exclusively on labeled data where GT alignments are available. These networks learn to map contextual features—object distances, classes, and model learning state—to quality scores and adaptive thresholds. The same feature extraction is used for unlabeled data, so the learned mapping is applied directly. Experiments on KITTI and Waymo, which contain diverse distances and classes, show higher recall and broader context coverage than baselines, providing empirical evidence that systematic under- or over-selection is mitigated. In revision we will add a paragraph in §3 explicitly discussing the generalization assumption and its empirical support. revision: partial

  2. Referee: [§4] §4 (experiments): the abstract asserts that the method 'significantly improving relevant SS3DOD methods' yet provides no numerical deltas, ablation tables isolating the contribution of the quality-assessment versus threshold network, or direct comparison against recent dynamic-thresholding baselines; without these the central performance claim cannot be verified.

    Authors: The current abstract statement is qualitative. Section 4 already reports mAP and recall gains over multiple SS3DOD baselines on both KITTI and Waymo. To address the request directly, we will (1) insert concrete numerical deltas into the abstract, (2) add ablation tables that isolate the quality-assessment network from the threshold network, and (3) include comparisons against recent dynamic-thresholding methods. These changes will be incorporated in the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper's central mechanism trains quality-assessment and threshold networks via explicit supervision from alignment between teacher pseudo-labels and ground-truth boxes on labeled data, then applies the learned context-adaptive thresholds to unlabeled pseudo-labels. This external GT-derived signal is independent of the model's own predictions on the target unlabeled distribution and does not reduce any claimed prediction or result to a fitted input or self-definition by construction. No equations or steps in the abstract or description equate the output selection to the input supervision through renaming or tautology. The soft-supervision strategy for the student is likewise a standard noise-robust loss rather than a circular re-use of the same fitted values. The overall framework therefore remains self-contained with respect to external benchmarks on KITTI and Waymo.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The method rests on standard semi-supervised assumptions plus the new claim that two auxiliary networks can reliably learn quality and thresholds from GT alignment.

axioms (1)
  • domain assumption Pseudo-label quality can be reliably assessed and thresholded by learned score fusion and context features.
    This is the core modeling choice introduced in the abstract.

pith-pipeline@v0.9.0 · 5796 in / 1152 out tokens · 32099 ms · 2026-05-18T11:48:10.648962+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

60 extracted references · 60 canonical work pages

  1. [1]

    Semi-supervised object detection with object- wise contrastive learning and regression uncertainty.arXiv preprint arXiv:2212.02747, 2022

    Honggyu Choi, Zhixiang Chen, Xuepeng Shi, and Tae- Kyun Kim. Semi-supervised object detection with object- wise contrastive learning and regression uncertainty.arXiv preprint arXiv:2212.02747, 2022. 3

  2. [2]

    V oxel r-cnn: Towards high performance voxel-based 3d object detection

    Jiajun Deng, Shaoshuai Shi, Peiwei Li, Wengang Zhou, Yanyong Zhang, and Houqiang Li. V oxel r-cnn: Towards high performance voxel-based 3d object detection. InPro- ceedings of the AAAI conference on artificial intelligence, pages 1201–1209, 2021. 2, 3, 6, 1

  3. [3]

    Are we ready for autonomous driving? the kitti vision benchmark suite

    Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are we ready for autonomous driving? the kitti vision benchmark suite. In2012 IEEE conference on computer vision and pat- tern recognition, pages 3354–3361. IEEE, 2012. 6

  4. [4]

    Cross-modality knowl- edge distillation network for monocular 3d object detection

    Yu Hong, Hang Dai, and Yong Ding. Cross-modality knowl- edge distillation network for monocular 3d object detection. InEuropean Conference on Computer Vision, pages 87–104. Springer, 2022. 2

  5. [5]

    Label propagation for deep semi-supervised learning

    Ahmet Iscen, Giorgos Tolias, Yannis Avrithis, and Ondrej Chum. Label propagation for deep semi-supervised learning. InProceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 5070–5079, 2019. 3

  6. [6]

    Semi- supervised 3d object detection with channel augmentation using transformation equivariance

    Minju Kang, Taehun Kong, and Tae-Kyun Kim. Semi- supervised 3d object detection with channel augmentation using transformation equivariance. In2024 IEEE Interna- tional Conference on Image Processing (ICIP), pages 638–

  7. [7]

    Pointpillars: Fast encoders for object detection from point clouds

    Alex H Lang, Sourabh V ora, Holger Caesar, Lubing Zhou, Jiong Yang, and Oscar Beijbom. Pointpillars: Fast encoders for object detection from point clouds. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12697–12705, 2019. 2, 3

  8. [8]

    Pseudo-label: The simple and effi- cient semi-supervised learning method for deep neural net- works

    Dong-Hyun Lee et al. Pseudo-label: The simple and effi- cient semi-supervised learning method for deep neural net- works. InWorkshop on challenges in representation learn- ing, ICML, page 896. Atlanta, 2013. 3

  9. [9]

    Dds3d: Dense pseudo-labels with dynamic threshold for semi-supervised 3d object detection

    Jingyu Li, Zhe Liu, Jinghua Hou, and Dingkang Liang. Dds3d: Dense pseudo-labels with dynamic threshold for semi-supervised 3d object detection. In2023 IEEE Inter- national Conference on Robotics and Automation (ICRA), pages 9245–9252. IEEE, 2023. 1, 2, 3, 7

  10. [10]

    Semireward: A general reward model for semi-supervised learning.arXiv preprint arXiv:2310.03013, 2023

    Siyuan Li, Weiyang Jin, Zedong Wang, Fang Wu, Zicheng Liu, Cheng Tan, and Stan Z Li. Semireward: A general reward model for semi-supervised learning.arXiv preprint arXiv:2310.03013, 2023. 3

  11. [11]

    Lidar r-cnn: An efficient and universal 3d object detector

    Zhichao Li, Feng Wang, and Naiyan Wang. Lidar r-cnn: An efficient and universal 3d object detector. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7546–7555, 2021. 3

  12. [12]

    Hierarchical supervision and shuffle data augmentation for 3d semi-supervised ob- ject detection

    Chuandong Liu, Chenqiang Gao, Fangcen Liu, Pengcheng Li, Deyu Meng, and Xinbo Gao. Hierarchical supervision and shuffle data augmentation for 3d semi-supervised ob- ject detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23819– 23828, 2023. 1, 2, 3, 4, 5, 6, 7, 8

  13. [13]

    Pyramid r-cnn: Towards bet- ter performance and adaptability for 3d object detection

    Jiageng Mao, Minzhe Niu, Haoyue Bai, Xiaodan Liang, Hang Xu, and Chunjing Xu. Pyramid r-cnn: Towards bet- ter performance and adaptability for 3d object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2723–2732, 2021. 3

  14. [14]

    V oxel transformer for 3d object detection

    Jiageng Mao, Yujing Xue, Minzhe Niu, Haoyue Bai, Ji- ashi Feng, Xiaodan Liang, Hang Xu, and Chunjing Xu. V oxel transformer for 3d object detection. InProceedings of the IEEE/CVF international conference on computer vision, pages 3164–3173, 2021. 2

  15. [15]

    Takeru Miyato, Shin-ichi Maeda, Masanori Koyama, and Shin Ishii. Virtual adversarial training: a regularization method for supervised and semi-supervised learning.IEEE transactions on pattern analysis and machine intelligence, 41(8):1979–1993, 2018. 3

  16. [16]

    Reliable student: Addressing noise in semi-supervised 3d object detection

    Farzad Nozarian, Shashank Agarwal, Farzaneh Rezaeia- naran, Danish Shahzad, Atanas Poibrenski, Christian M¨uller, and Philipp Slusallek. Reliable student: Addressing noise in semi-supervised 3d object detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4981–4990, 2023. 1, 3, 7

  17. [17]

    3d object detection with pointformer

    Xuran Pan, Zhuofan Xia, Shiji Song, Li Erran Li, and Gao Huang. 3d object detection with pointformer. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7463–7472, 2021. 2

  18. [18]

    Clocs: Camera- lidar object candidates fusion for 3d object detection

    Su Pang, Daniel Morris, and Hayder Radha. Clocs: Camera- lidar object candidates fusion for 3d object detection. In2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 10386–10393. IEEE, 2020. 4

  19. [19]

    Fast-clocs: Fast camera-lidar object candidates fusion for 3d object detection

    Su Pang, Daniel Morris, and Hayder Radha. Fast-clocs: Fast camera-lidar object candidates fusion for 3d object detection. InProceedings of the IEEE/CVF Winter Conference on Ap- plications of Computer Vision, pages 187–196, 2022. 4

  20. [20]

    Detmatch: Two teachers are bet- ter than one for joint 2d and 3d semi-supervised object de- tection

    Jinhyung Park, Chenfeng Xu, Yiyang Zhou, Masayoshi Tomizuka, and Wei Zhan. Detmatch: Two teachers are bet- ter than one for joint 2d and 3d semi-supervised object de- tection. InEuropean Conference on Computer Vision, pages 370–389. Springer, 2022. 1, 3, 6, 7

  21. [21]

    Pointnet: Deep learning on point sets for 3d classification and segmentation

    Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas. Pointnet: Deep learning on point sets for 3d classification and segmentation. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 652–660,

  22. [22]

    Pointnet++: Deep hierarchical feature learning on point sets in a metric space.Advances in neural information processing systems, 30, 2017

    Charles Ruizhongtai Qi, Li Yi, Hao Su, and Leonidas J Guibas. Pointnet++: Deep hierarchical feature learning on point sets in a metric space.Advances in neural information processing systems, 30, 2017. 2

  23. [23]

    Deep hough voting for 3d object detection in point clouds

    Charles R Qi, Or Litany, Kaiming He, and Leonidas J Guibas. Deep hough voting for 3d object detection in point clouds. Inproceedings of the IEEE/CVF International Con- ference on Computer Vision, pages 9277–9286, 2019. 2

  24. [24]

    Temporal ensembling for semi- supervised learning

    Laine Samuli and Aila Timo. Temporal ensembling for semi- supervised learning. InInternational Conference on Learn- ing Representations (ICLR), page 6, 2017. 3

  25. [25]

    Improving 3d object detection with channel-wise transformer

    Hualian Sheng, Sijia Cai, Yuan Liu, Bing Deng, Jianqiang Huang, Xian-Sheng Hua, and Min-Jian Zhao. Improving 3d object detection with channel-wise transformer. InProceed- ings of the IEEE/CVF international conference on computer vision, pages 2743–2752, 2021. 3

  26. [26]

    Pointr- cnn: 3d object proposal generation and detection from point cloud

    Shaoshuai Shi, Xiaogang Wang, and Hongsheng Li. Pointr- cnn: 3d object proposal generation and detection from point cloud. InProceedings of the IEEE/CVF conference on com- puter vision and pattern recognition, pages 770–779, 2019. 2

  27. [27]

    Pv-rcnn: Point- voxel feature set abstraction for 3d object detection

    Shaoshuai Shi, Chaoxu Guo, Li Jiang, Zhe Wang, Jianping Shi, Xiaogang Wang, and Hongsheng Li. Pv-rcnn: Point- voxel feature set abstraction for 3d object detection. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10529–10538, 2020. 3, 6, 1

  28. [28]

    Shaoshuai Shi, Zhe Wang, Jianping Shi, Xiaogang Wang, and Hongsheng Li. From points to parts: 3d object detection from point cloud with part-aware and part-aggregation net- work.IEEE transactions on pattern analysis and machine intelligence, 43(8):2647–2664, 2020. 2

  29. [29]

    Pv- rcnn++: Point-voxel feature set abstraction with local vector representation for 3d object detection.International Journal of Computer Vision, 131(2):531–551, 2023

    Shaoshuai Shi, Li Jiang, Jiajun Deng, Zhe Wang, Chaoxu Guo, Jianping Shi, Xiaogang Wang, and Hongsheng Li. Pv- rcnn++: Point-voxel feature set abstraction with local vector representation for 3d object detection.International Journal of Computer Vision, 131(2):531–551, 2023. 3

  30. [30]

    Point-gnn: Graph neural net- work for 3d object detection in a point cloud

    Weijing Shi and Raj Rajkumar. Point-gnn: Graph neural net- work for 3d object detection in a point cloud. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1711–1719, 2020. 2

  31. [31]

    Distance- normalized unified representation for monocular 3d object detection

    Xuepeng Shi, Zhixiang Chen, and Tae-Kyun Kim. Distance- normalized unified representation for monocular 3d object detection. InEuropean Conference on Computer Vision, pages 91–107. Springer, 2020. 2

  32. [32]

    Geometry-based dis- tance decomposition for monocular 3d object detection

    Xuepeng Shi, Qi Ye, Xiaozhi Chen, Chuangrong Chen, Zhixiang Chen, and Tae-Kyun Kim. Geometry-based dis- tance decomposition for monocular 3d object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15172–15181, 2021

  33. [33]

    Multivari- ate probabilistic monocular 3d object detection

    Xuepeng Shi, Zhixiang Chen, and Tae-Kyun Kim. Multivari- ate probabilistic monocular 3d object detection. InProceed- ings of the IEEE/CVF winter conference on applications of computer vision, pages 4281–4290, 2023. 2

  34. [34]

    Fixmatch: Simplifying semi-supervised learning with consistency and confidence

    Kihyuk Sohn, David Berthelot, Nicholas Carlini, Zizhao Zhang, Han Zhang, Colin A Raffel, Ekin Dogus Cubuk, Alexey Kurakin, and Chun-Liang Li. Fixmatch: Simplifying semi-supervised learning with consistency and confidence. Advances in neural information processing systems, 33:596– 608, 2020. 3

  35. [35]

    A Simple Semi-Supervised Learning Framework for Object Detection , publisher =

    Kihyuk Sohn, Zizhao Zhang, Chun-Liang Li, Han Zhang, Chen-Yu Lee, and Tomas Pfister. A simple semi-supervised learning framework for object detection.arXiv preprint arXiv:2005.04757, 2020. 3

  36. [36]

    Scalability in perception for autonomous driving: Waymo open dataset

    Pei Sun, Henrik Kretzschmar, Xerxes Dotiwalla, Aurelien Chouard, Vijaysai Patnaik, Paul Tsui, James Guo, Yin Zhou, Yuning Chai, Benjamin Caine, et al. Scalability in perception for autonomous driving: Waymo open dataset. InProceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2446–2454, 2020. 6

  37. [37]

    Fourier features let networks learn high frequency functions in low dimen- sional domains.Advances in neural information processing systems, 33:7537–7547, 2020

    Matthew Tancik, Pratul Srinivasan, Ben Mildenhall, Sara Fridovich-Keil, Nithin Raghavan, Utkarsh Singhal, Ravi Ra- mamoorthi, Jonathan Barron, and Ren Ng. Fourier features let networks learn high frequency functions in low dimen- sional domains.Advances in neural information processing systems, 33:7537–7547, 2020. 6

  38. [38]

    Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results.Advances in neural information processing systems, 30, 2017

    Antti Tarvainen and Harri Valpola. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results.Advances in neural information processing systems, 30, 2017. 3

  39. [39]

    Not ev- ery side is equal: Localization uncertainty estimation for semi-supervised 3d object detection

    Chuxin Wang, Wenfei Yang, and Tianzhu Zhang. Not ev- ery side is equal: Localization uncertainty estimation for semi-supervised 3d object detection. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 3814–3824, 2023. 3

  40. [40]

    3dioumatch: Leveraging iou prediction for semi- supervised 3d object detection

    He Wang, Yezhen Cong, Or Litany, Yue Gao, and Leonidas J Guibas. 3dioumatch: Leveraging iou prediction for semi- supervised 3d object detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14615–14624, 2021. 1, 2, 3, 4, 7

  41. [41]

    A-teacher: Asymmetric network for 3d semi-supervised ob- ject detection

    Hanshi Wang, Zhipeng Zhang, Jin Gao, and Weiming Hu. A-teacher: Asymmetric network for 3d semi-supervised ob- ject detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14978– 14987, 2024. 1, 3, 6, 7

  42. [42]

    Consistent-teacher: Towards reducing incon- sistent pseudo-targets in semi-supervised object detection

    Xinjiang Wang, Xingyi Yang, Shilong Zhang, Yijiang Li, Litong Feng, Shijie Fang, Chengqi Lyu, Kai Chen, and Wayne Zhang. Consistent-teacher: Towards reducing incon- sistent pseudo-targets in semi-supervised object detection. In Proceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 3240–3249, 2023. 3

  43. [43]

    Pillar-based object detection for autonomous driving

    Yue Wang, Alireza Fathi, Abhijit Kundu, David A Ross, Caroline Pantofaru, Tom Funkhouser, and Justin Solomon. Pillar-based object detection for autonomous driving. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXII 16, pages 18–34. Springer, 2020. 2, 3

  44. [44]

    Semi-supervised 3d object detection with patchteacher and pillarmix

    Xiaopei Wu, Liang Peng, Liang Xie, Yuenan Hou, Bin- bin Lin, Xiaoshui Huang, Haifeng Liu, Deng Cai, and Wanli Ouyang. Semi-supervised 3d object detection with patchteacher and pillarmix. InProceedings of the AAAI Con- ference on Artificial Intelligence, pages 6153–6161, 2024. 1, 3, 6, 7

  45. [45]

    Unsupervised data augmentation for consistency training.Advances in neural information processing systems, 33:6256–6268, 2020

    Qizhe Xie, Zihang Dai, Eduard Hovy, Thang Luong, and Quoc Le. Unsupervised data augmentation for consistency training.Advances in neural information processing systems, 33:6256–6268, 2020. 3

  46. [46]

    Self-training with noisy student improves imagenet classification

    Qizhe Xie, Minh-Thang Luong, Eduard Hovy, and Quoc V Le. Self-training with noisy student improves imagenet classification. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10687– 10698, 2020. 3

  47. [47]

    End-to- end semi-supervised object detection with soft teacher

    Mengde Xu, Zheng Zhang, Han Hu, Jianfeng Wang, Lijuan Wang, Fangyun Wei, Xiang Bai, and Zicheng Liu. End-to- end semi-supervised object detection with soft teacher. In Proceedings of the IEEE/CVF international conference on computer vision, pages 3060–3069, 2021. 3

  48. [48]

    Monocd: Monocular 3d object detection with complementary depths

    Longfei Yan, Pei Yan, Shengzhou Xiong, Xuanyu Xiang, and Yihua Tan. Monocd: Monocular 3d object detection with complementary depths. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10248–10257, 2024. 2

  49. [49]

    Second: Sparsely embed- ded convolutional detection.Sensors, 18(10):3337, 2018

    Yan Yan, Yuxing Mao, and Bo Li. Second: Sparsely embed- ded convolutional detection.Sensors, 18(10):3337, 2018. 2, 3, 4, 5

  50. [50]

    Std: Sparse-to-dense 3d object detector for point cloud

    Zetong Yang, Yanan Sun, Shu Liu, Xiaoyong Shen, and Jiaya Jia. Std: Sparse-to-dense 3d object detector for point cloud. InProceedings of the IEEE/CVF international conference on computer vision, pages 1951–1960, 2019. 2

  51. [51]

    3dssd: Point-based 3d single stage object detector

    Zetong Yang, Yanan Sun, Shu Liu, and Jiaya Jia. 3dssd: Point-based 3d single stage object detector. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11040–11048, 2020. 2

  52. [52]

    Semi- supervised 3d object detection with proficient teachers

    Junbo Yin, Jin Fang, Dingfu Zhou, Liangjun Zhang, Cheng- Zhong Xu, Jianbing Shen, and Wenguan Wang. Semi- supervised 3d object detection with proficient teachers. In European Conference on Computer Vision, pages 727–743. Springer, 2022. 1, 3

  53. [53]

    Center- based 3d object detection and tracking

    Tianwei Yin, Xingyi Zhou, and Philipp Krahenbuhl. Center- based 3d object detection and tracking. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11784–11793, 2021. 2

  54. [54]

    Csot: Cross-scan object transfer for semi- supervised lidar object detection

    Jinglin Zhan, Tiejun Liu, Rengang Li, Zhaoxiang Zhang, and Yuntao Chen. Csot: Cross-scan object transfer for semi- supervised lidar object detection. InEuropean Conference on Computer Vision, 2024. 3

  55. [55]

    Flexmatch: Boosting semi-supervised learning with curricu- lum pseudo labeling.Advances in Neural Information Pro- cessing Systems, 34:18408–18419, 2021

    Bowen Zhang, Yidong Wang, Wenxin Hou, Hao Wu, Jin- dong Wang, Manabu Okumura, and Takahiro Shinozaki. Flexmatch: Boosting semi-supervised learning with curricu- lum pseudo labeling.Advances in Neural Information Pro- cessing Systems, 34:18408–18419, 2021. 3

  56. [56]

    Atf-3d: Semi-supervised 3d object detection with adaptive thresholds filtering based on confidence and distance.IEEE Robotics and Automation Letters, 7(4):10573–10580, 2022

    Zehan Zhang, Yang Ji, Wei Cui, Yulong Wang, Hao Li, Xian Zhao, Duo Li, Sanli Tang, Ming Yang, Wenming Tan, et al. Atf-3d: Semi-supervised 3d object detection with adaptive thresholds filtering based on confidence and distance.IEEE Robotics and Automation Letters, 7(4):10573–10580, 2022. 1, 2, 3, 8

  57. [57]

    Sess: Self- ensembling semi-supervised 3d object detection

    Na Zhao, Tat-Seng Chua, and Gim Hee Lee. Sess: Self- ensembling semi-supervised 3d object detection. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11079–11087, 2020. 3

  58. [58]

    Instant-teaching: An end-to-end semi-supervised object detection framework

    Qiang Zhou, Chaohui Yu, Zhibin Wang, Qi Qian, and Hao Li. Instant-teaching: An end-to-end semi-supervised object detection framework. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 4081–4090, 2021. 3

  59. [59]

    V oxelnet: End-to-end learning for point cloud based 3d object detection

    Yin Zhou and Oncel Tuzel. V oxelnet: End-to-end learning for point cloud based 3d object detection. InProceedings of the IEEE conference on computer vision and pattern recog- nition, pages 4490–4499, 2018. 2 Learning Adaptive Pseudo-Label Selection for Semi-Supervised 3D Object Detection Supplementary Material PedestrianCyclistCar CTE thresholdPQE scoreCT...

  60. [60]

    and V oxel-RCNN [2] also incorporate a GT-IoU esti- mation module, similar to PQE. The key difference of PQE lies in that the pseudo-label quality is predicted more reli- ably by aggregating diverse information through a score fu- sion manner, including semantic scores and geometric con- sistency between original and augmented scenes. Fig. 4 in the main p...