Recognition: unknown
CLLAP: Contrastive Learning-based LiDAR-Augmented Pretraining for Enhanced Radar-Camera Fusion
Pith reviewed 2026-05-08 04:52 UTC · model grok-4.3
The pith
CLLAP generates pseudo-radar from LiDAR data to pretrain radar-camera fusion models via contrastive learning for improved 3D object detection.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
CLLAP leverages abundant LiDAR data to generate pseudo-radar data using the proposed L2R Sampling method, then incorporates this data into a novel dual-stage, dual-modality contrastive learning strategy that enables effective self-supervised learning from paired pseudo-radar and image data, thereby pretraining existing radar-camera fusion models in a plug-and-play manner to enhance their feature extraction and 3D detection performance.
What carries the argument
The L2R (LiDAR-to-Radar) Sampling method that converts LiDAR point clouds into pseudo-radar returns, paired with a dual-stage dual-modality contrastive learning objective that aligns pseudo-radar and camera features for self-supervised pretraining.
If this is right
- Existing radar-camera fusion architectures gain improved feature extractors without requiring large amounts of annotated radar data.
- Detection accuracy and robustness increase on standard autonomous-driving benchmarks such as NuScenes and Lyft Level 5 across multiple baseline models.
- The pretraining procedure can be inserted as a modular step before supervised fine-tuning on any radar-camera fusion pipeline.
- Performance benefits appear in both normal and adverse weather conditions where radar is intended to provide complementary information to cameras.
Where Pith is reading between the lines
- If the L2R-generated pseudo-radar proves sufficiently realistic, the same pipeline could be scaled to much larger unlabeled LiDAR corpora to produce ever-stronger initializations.
- The contrastive pretraining strategy might be adapted to other sensor pairs where one modality is data-rich and the other is annotation-scarce, such as camera-thermal or camera-sonar fusion.
- A direct test on real radar sequences that lack corresponding LiDAR could reveal whether the learned representations remain effective when the input distribution shifts away from the pseudo-radar training regime.
Load-bearing premise
Pseudo-radar signals created from LiDAR by the L2R method capture enough of the statistical and geometric properties of real radar returns that pretraining on them transfers usefully to models later trained on actual radar data.
What would settle it
A controlled experiment in which a radar-camera fusion model is first pretrained with CLLAP on LiDAR-derived pseudo-radar and then fine-tuned on a fixed real-radar dataset, compared against an identical model trained from scratch on the same real-radar data; if the CLLAP-pretrained version shows no accuracy gain or a clear drop, the central claim is falsified.
Figures
read the original abstract
Accurate 3D object detection is critical for autonomous driving, necessitating reliable, cost-effective sensors capable of operating in adverse weather conditions. Camera and millimeter-wave radar fusion has emerged as a promising solution; however, these methods often rely on finely annotated radar data, which is scarce and labor-intensive to produce. To address this challenge, we present CLLAP, a Contrastive Learning-based LiDAR-Augmented Pretraining framework that enhances the performance of existing radar-camera fusion methods for 3D object detection. CLLAP leverages abundant LiDAR data to generate pseudo-radar data using the proposed L2R (LiDAR-to-Radar) Sampling method. Then, it incorporates this data into a novel dual-stage, dual-modality contrastive learning strategy, enabling effective self-supervised learning from paired pseudo-radar and image data. This approach facilitates effective pretraining of existing radar-camera fusion models in a plug-and-play manner, enhancing their feature extraction capabilities and improving detection accuracy and robustness. Experimental results using NuScenes and Lyft Level 5 datasets demonstrate significant performance improvements across three baseline models, highlighting CLLAP's effectiveness in advancing radar-camera fusion for autonomous driving applications.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes CLLAP, a Contrastive Learning-based LiDAR-Augmented Pretraining framework for improving radar-camera fusion in 3D object detection. It generates pseudo-radar data from abundant LiDAR using the L2R Sampling method and employs a dual-stage, dual-modality contrastive learning strategy to pretrain fusion models using paired pseudo-radar and image data. This plug-and-play pretraining is claimed to enhance feature extraction and detection performance, with experimental results on NuScenes and Lyft Level 5 datasets showing significant improvements across three baseline models.
Significance. If the central assumption holds—that pseudo-radar generated via L2R sufficiently approximates real mmWave radar for effective transfer learning—this work could substantially address the data scarcity issue in radar-camera fusion, enabling better use of self-supervised learning from LiDAR to boost performance in adverse conditions. It builds on standard contrastive learning techniques and public datasets, offering a practical way to leverage more abundant sensor data.
major comments (2)
- [L2R Sampling method description] The fidelity of the generated pseudo-radar to real radar returns is the load-bearing assumption for the entire framework. The manuscript does not provide statistical comparisons (e.g., sparsity, range-azimuth distributions, or reflection patterns) or ablations against simpler LiDAR projections to validate that L2R captures radar-specific traits like noise and multi-path effects sufficiently for the contrastive pretraining to transfer to real radar-camera tasks.
- [Experimental results] The claim of 'significant performance improvements' on NuScenes and Lyft datasets across three baselines lacks any quantitative metrics, ablation studies, or error analysis. This omission prevents assessment of whether the gains are substantial, consistent, or attributable to the pretraining rather than other factors.
minor comments (1)
- [Abstract] Including specific quantitative results (e.g., mAP improvements) would strengthen the abstract and allow readers to immediately gauge the claimed gains.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment below and will revise the manuscript to incorporate the suggested improvements.
read point-by-point responses
-
Referee: The fidelity of the generated pseudo-radar to real radar returns is the load-bearing assumption for the entire framework. The manuscript does not provide statistical comparisons (e.g., sparsity, range-azimuth distributions, or reflection patterns) or ablations against simpler LiDAR projections to validate that L2R captures radar-specific traits like noise and multi-path effects sufficiently for the contrastive pretraining to transfer to real radar-camera tasks.
Authors: We agree that validating the fidelity of the L2R Sampling method is critical to substantiate the core assumption of the framework. The current manuscript describes the L2R method and its design rationale but does not include the requested statistical validations or ablations. In the revised manuscript, we will add quantitative statistical comparisons of sparsity, range-azimuth distributions, and reflection patterns between pseudo-radar and real radar data. We will also include ablation experiments contrasting L2R against simpler LiDAR projections to demonstrate the value of capturing radar-specific traits such as noise and multi-path effects for effective transfer to real radar-camera fusion tasks. revision: yes
-
Referee: The claim of 'significant performance improvements' on NuScenes and Lyft datasets across three baselines lacks any quantitative metrics, ablation studies, or error analysis. This omission prevents assessment of whether the gains are substantial, consistent, or attributable to the pretraining rather than other factors.
Authors: We acknowledge that the experimental section requires more rigorous quantitative support and analysis to fully substantiate the performance claims. While the manuscript reports improvements across baselines on the two datasets, it does not provide the level of detail requested. In the revision, we will expand the results with specific quantitative metrics (including exact mAP/NDS deltas), detailed ablation studies isolating the contributions of the dual-stage dual-modality contrastive pretraining, and error analysis to assess consistency and attribute gains specifically to the pretraining rather than other factors. revision: yes
Circularity Check
No circularity: method uses external LiDAR data and standard contrastive learning with empirical validation on public datasets
full rationale
The paper's chain proceeds from abundant external LiDAR point clouds (NuScenes, Lyft) through a newly proposed L2R sampling procedure to generate pseudo-radar, followed by dual-stage contrastive pretraining on pseudo-radar/image pairs, then plug-and-play fine-tuning on real radar-camera fusion baselines. None of these steps reduce by construction to their own outputs: L2R is an explicit sampling rule, not a fitted parameter renamed as a prediction; contrastive loss is the standard InfoNCE formulation applied to generated pairs; performance gains are measured on held-out real radar data rather than on the pseudo-radar used for pretraining. No self-citation supplies a uniqueness theorem or load-bearing premise, and no equation equates a derived quantity to an input by definition. The transfer assumption (pseudo-radar fidelity) is an empirical claim subject to external falsification, not a circular derivation.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Pseudo-radar data generated from LiDAR can serve as an effective proxy for real radar in self-supervised contrastive pretraining of fusion models
invented entities (1)
-
L2R (LiDAR-to-Radar) Sampling method
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Crosspoint: Self-supervised cross-modal contrastive learning for 3d point cloud understanding
Mohamed Afham, Isuru Dissanayake, Dinithi Dissanayake, Amaya Dharmasiri, Kanchana Thilakarathna, and Ranga Ro- drigo. Crosspoint: Self-supervised cross-modal contrastive learning for 3d point cloud understanding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9902–9912, 2022. 3
2022
-
[2]
Radardistill: Boosting radar-based ob- ject detection performance via knowledge distillation from lidar features
Geonho Bang, Kwangjin Choi, Jisong Kim, Dongsuk Kum, and Jun Won Choi. Radardistill: Boosting radar-based ob- ject detection performance via knowledge distillation from lidar features. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15491– 15500, 2024. 2, 3
2024
-
[3]
Rctdistill: Cross-modal knowledge distillation framework for radar-camera 3d object detection with temporal fusion
Geonho Bang, Minjae Seong, Jisong Kim, Geunju Baek, Daye Oh, Junhyung Kim, Junho Koh, and Jun Won Choi. Rctdistill: Cross-modal knowledge distillation framework for radar-camera 3d object detection with temporal fusion. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 25315–25324, 2025. 2, 3
2025
-
[4]
Felipe Manfio Barbosa and Fernando Santos Os ´orio. Camera-radar perception for autonomous vehicles and adas: Concepts, datasets and metrics.arXiv preprint arXiv:2303.04302, 2023. 1
-
[5]
nuscenes: A multi- modal dataset for autonomous driving
Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh V ora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Gi- ancarlo Baldan, and Oscar Beijbom. nuscenes: A multi- modal dataset for autonomous driving. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11621–11631, 2020. 7
2020
-
[6]
Benchmarking robustness of 3d object detection to common corruptions
Yinpeng Dong, Caixin Kang, Jinlai Zhang, Zijian Zhu, Yikai Wang, Xiao Yang, Hang Su, Xingxing Wei, and Jun Zhu. Benchmarking robustness of 3d object detection to common corruptions. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1022– 1032, 2023. 3
2023
-
[7]
A point set generation network for 3d object reconstruction from a single image
Haoqiang Fan, Hao Su, and Leonidas J Guibas. A point set generation network for 3d object reconstruction from a single image. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 605–613, 2017. 7
2017
-
[8]
4d mmwave radar for autonomous driving perception: a comprehensive survey.IEEE Transac- tions on Intelligent Vehicles, 2024
Lili Fan, Junhao Wang, Yuanmeng Chang, Yuke Li, Yutong Wang, and Dongpu Cao. 4d mmwave radar for autonomous driving perception: a comprehensive survey.IEEE Transac- tions on Intelligent Vehicles, 2024. 1
2024
-
[9]
De- formable feature fusion network for multi-modal 3d object detection
Kun Guo, Tong Gan, Zhao Ding, and Qiang Ling. De- formable feature fusion network for multi-modal 3d object detection. In2024 3rd International Conference on Robotics, Artificial Intelligence and Intelligent Control (RAIIC), pages 363–367. IEEE, 2024. 2
2024
-
[10]
Multimodal 3d object detection on unseen domains,
Deepti Hegde, Suhas Lohit, Kuan-Chuan Peng, Michael J Jones, and Vishal M Patel. Multimodal 3d object detection on unseen domains.arXiv preprint arXiv:2404.11764, 2024. 3
-
[11]
One thousand and one hours: Self-driving motion prediction dataset
John Houston, Guido Zuidhof, Luca Bergamini, Yawei Ye, Long Chen, Ashesh Jain, Sammy Omari, Vladimir Iglovikov, and Peter Ondruska. One thousand and one hours: Self-driving motion prediction dataset. InConference on Robot Learning, pages 409–418. PMLR, 2021. 7
2021
-
[12]
Craft: Camera-radar 3d object detection with spatio-contextual fusion transformer
Youngseok Kim, Sanmin Kim, Jun Won Choi, and Dong- suk Kum. Craft: Camera-radar 3d object detection with spatio-contextual fusion transformer. InProceedings of the AAAI Conference on Artificial Intelligence, pages 1160– 1168, 2023. 2, 7
2023
-
[13]
Crn: Camera radar net for accurate, robust, efficient 3d perception
Youngseok Kim, Juyeb Shin, Sanmin Kim, In-Jae Lee, Jun Won Choi, and Dongsuk Kum. Crn: Camera radar net for accurate, robust, efficient 3d perception. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 17615–17626, 2023. 1, 7
2023
-
[14]
Clusterfusion: Leveraging radar spatial features for radar- camera 3d object detection in autonomous vehicles.IEEE Access, 2023
Irfan Tito Kurniawan and Bambang Riyanto Trilaksono. Clusterfusion: Leveraging radar spatial features for radar- camera 3d object detection in autonomous vehicles.IEEE Access, 2023. 2
2023
-
[15]
Modcl: multi-modal object detection with end-to-end con- trastive learning in indoor scene
Zixu Lan, Fang Deng, Angang Zhang, and Zhongjian Chen. Modcl: multi-modal object detection with end-to-end con- trastive learning in indoor scene. InInternational Conference on Algorithms, High Performance Computing, and Artificial Intelligence (AHPCAI 2024), pages 1032–1038. SPIE, 2024. 3
2024
-
[16]
Samplenet: Differentiable point cloud sampling
Itai Lang, Asaf Manor, and Shai Avidan. Samplenet: Differentiable point cloud sampling. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7578–7588, 2020. 4
2020
-
[17]
Con- trastive representation learning: A framework and review
Phuc H Le-Khac, Graham Healy, and Alan F Smeaton. Con- trastive representation learning: A framework and review. Ieee Access, 8:193907–193934, 2020. 3
2020
-
[18]
Lidar-to-radar translation based on voxel feature extraction module for radar data augmentation
Jinho Lee, Geonkyu Bang, Takaya Shimizu, Masato Iehara, and Shunsuke Kamijo. Lidar-to-radar translation based on voxel feature extraction module for radar data augmentation. Sensors, 24(2):559, 2024. 4
2024
-
[19]
Rcbevdet: Radar-camera fusion in bird’s eye view for 3d object detection
Zhiwei Lin, Zhe Liu, Zhongyu Xia, Xinhao Wang, Yong- tao Wang, Shengxiang Qi, Yang Dong, Nan Dong, Le Zhang, and Ce Zhu. Rcbevdet: Radar-camera fusion in bird’s eye view for 3d object detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14928–14937, 2024. 1, 7
2024
-
[20]
Flownet3d: Learning scene flow in 3d point clouds
Xingyu Liu, Charles R Qi, and Leonidas J Guibas. Flownet3d: Learning scene flow in 3d point clouds. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 529–537, 2019. 4
2019
-
[21]
V2x-dsi: A density-sensitive infrastructure lidar benchmark for economic vehicle-to- everything cooperative perception
Xinyu Liu, Baolu Li, Runsheng Xu, Jiaqi Ma, Xiaopeng Li, Jinlong Li, and Hongkai Yu. V2x-dsi: A density-sensitive infrastructure lidar benchmark for economic vehicle-to- everything cooperative perception. In2024 IEEE Intelligent Vehicles Symposium (IV), pages 490–495. IEEE, 2024. 1
2024
-
[22]
Bevfusion: Multi- task multi-sensor fusion with unified bird’s-eye view repre- sentation
Zhijian Liu, Haotian Tang, Alexander Amini, Xinyu Yang, Huizi Mao, Daniela L Rus, and Song Han. Bevfusion: Multi- task multi-sensor fusion with unified bird’s-eye view repre- sentation. In2023 IEEE international conference on robotics and automation (ICRA), pages 2774–2781. IEEE, 2023. 7
2023
-
[23]
Centerfusion: Center-based radar and camera fusion for 3d object detection
Ramin Nabati and Hairong Qi. Centerfusion: Center-based radar and camera fusion for 3d object detection. InProceed- ings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 1527–1536, 2021. 7
2021
-
[24]
Representation Learning with Contrastive Predictive Coding
Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Repre- sentation learning with contrastive predictive coding.arXiv preprint arXiv:1807.03748, 2018. 6
work page internal anchor Pith review arXiv 2018
-
[25]
Time will tell: New outlooks and a baseline for temporal multi-view 3d object detection
Jinhyung Park, Chenfeng Xu, Shijia Yang, Kurt Keutzer, Kris Kitani, Masayoshi Tomizuka, and Wei Zhan. Time will tell: New outlooks and a baseline for temporal multi-view 3d object detection.arXiv preprint arXiv:2210.02443, 2022. 7
-
[26]
PhD thesis, Purdue University Graduate School,
Cheng Peng.VISION-BASED SMART MONITORING AND ASSESSMENT OF HIGHWAY PAVEMENT INFRASTRUC- TURES. PhD thesis, Purdue University Graduate School,
-
[27]
Aditi Raghunathan, Sang Michael Xie, Fanny Yang, John Duchi, and Percy Liang. Understanding and mitigating the tradeoff between robustness and accuracy.arXiv preprint arXiv:2002.10716, 2020. 2
-
[28]
Jonas Schramm, Niclas V ¨odisch, K ¨ursat Petek, B Ravi Ki- ran, Senthil Yogamani, Wolfram Burgard, and Abhinav Val- ada. Bevcar: Camera-radar fusion for bev map and object segmentation.arXiv preprint arXiv:2403.11761, 2024. 2
-
[29]
Ziying Song, Feiyang Jia, Hongyu Pan, Yadan Luo, Caiyan Jia, Guoxin Zhang, Lin Liu, Yang Ji, Lei Yang, and Li Wang. Contrastalign: Toward robust bev feature alignment via con- trastive learning for multi-modal 3d object detection.arXiv preprint arXiv:2405.16873, 2024. 3
-
[30]
L2r gan: Lidar-to-radar translation
Leichen Wang, Bastian Goldluecke, and Carsten Anklam. L2r gan: Lidar-to-radar translation. InProceedings of the Asian Conference on Computer Vision, 2020. 4
2020
-
[31]
Exploring object-centric temporal modeling for efficient multi-view 3d object detection
Shihao Wang, Yingfei Liu, Tiancai Wang, Ying Li, and Xi- angyu Zhang. Exploring object-centric temporal modeling for efficient multi-view 3d object detection. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 3621–3631, 2023. 7
2023
-
[32]
Crrfnet: An adaptive traf- fic object detection method based on camera and radar radio frequency fusion.Transportation Research Part C: Emerg- ing Technologies, 166:104791, 2024
Wenbo Wang and Weibin Zhang. Crrfnet: An adaptive traf- fic object detection method based on camera and radar radio frequency fusion.Transportation Research Part C: Emerg- ing Technologies, 166:104791, 2024. 2
2024
-
[33]
Attention-based point cloud edge sampling
Chengzhi Wu, Junwei Zheng, Julius Pfrommer, and J ¨urgen Beyerer. Attention-based point cloud edge sampling. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5333–5343, 2023. 4
2023
-
[34]
Mvfusion: Multi-view 3d object detection with semantic-aligned radar and camera fusion
Zizhang Wu, Guilian Chen, Yuanzhu Gan, Lei Wang, and Jian Pu. Mvfusion: Multi-view 3d object detection with semantic-aligned radar and camera fusion. In2023 IEEE In- ternational Conference on Robotics and Automation (ICRA), pages 2766–2773. IEEE, 2023. 2
2023
-
[35]
Sckd: Semi-supervised cross-modality knowledge dis- tillation for 4d radar object detection
Ruoyu Xu, Zhiyu Xiang, Chenwei Zhang, Hanzhi Zhong, Xijun Zhao, Ruina Dang, Peng Xu, Tianyu Pu, and Eryun Liu. Sckd: Semi-supervised cross-modality knowledge dis- tillation for 4d radar object detection. InProceedings of the AAAI Conference on Artificial Intelligence, pages 8933– 8941, 2025. 2, 3
2025
-
[36]
Radar-camera fusion for object detection and semantic segmentation in autonomous driving: A comprehensive review.IEEE Transactions on In- telligent Vehicles, 2023
Shanliang Yao, Runwei Guan, Xiaoyu Huang, Zhuoxiao Li, Xiangyu Sha, Yong Yue, Eng Gee Lim, Hyungjoon Seo, Ka Lok Man, Xiaohui Zhu, et al. Radar-camera fusion for object detection and semantic segmentation in autonomous driving: A comprehensive review.IEEE Transactions on In- telligent Vehicles, 2023. 1
2023
-
[37]
Pastefusion: innovating multimodal sensor fusion for enhanced 3d object detection
Yuhong Yuan, Kai Zhang, Mingbo Yang, Shuxiang Li, and Yu Liang. Pastefusion: innovating multimodal sensor fusion for enhanced 3d object detection. InInternational Confer- ence on Image, Signal Processing, and Pattern Recognition (ISPP 2024), pages 932–938. SPIE, 2024. 2
2024
-
[38]
Contrastive late fusion for 3d object detection.IEEE Transactions on Intelligent Vehicles, 2024
Tingyu Zhang, Zhigang Liang, Yanzhao Yang, Xinyu Yang, Yu Zhu, and Jian Wang. Contrastive late fusion for 3d object detection.IEEE Transactions on Intelligent Vehicles, 2024. 2
2024
-
[39]
Crkd: Enhanced camera-radar object detection with cross-modality knowledge distillation
Lingjun Zhao, Jingyu Song, and Katherine A Skinner. Crkd: Enhanced camera-radar object detection with cross-modality knowledge distillation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15470–15480, 2024. 2, 3
2024
-
[40]
Bev-radar: bidirectional radar-camera fusion for 3d object detection.JUSTC, 54(1):0101–1, 2024
Yuan Zhao, Lu Zhang, Jiajun Deng, and Yanyong Zhang. Bev-radar: bidirectional radar-camera fusion for 3d object detection.JUSTC, 54(1):0101–1, 2024. 2
2024
-
[41]
Bridging the view disparity between radar and camera features for multi-modal fusion 3d object detection.IEEE Transactions on Intelligent Vehicles, 8(2):1523–1535, 2023
Taohua Zhou, Junjie Chen, Yining Shi, Kun Jiang, Meng- meng Yang, and Diange Yang. Bridging the view disparity between radar and camera features for multi-modal fusion 3d object detection.IEEE Transactions on Intelligent Vehicles, 8(2):1523–1535, 2023. 7
2023
-
[42]
V oxelnet: End-to-end learning for point cloud based 3d object detection
Yin Zhou and Oncel Tuzel. V oxelnet: End-to-end learning for point cloud based 3d object detection. InProceedings of the IEEE conference on computer vision and pattern recog- nition, pages 4490–4499, 2018. 7 CLLAP: Contrastive Learning-based LiDAR-Augmented Pretraining for Enhanced Radar-Camera Fusion Supplementary Material
2018
-
[43]
Overview The appendix offers comprehensive explanations of the methodologies introduced in the main text, together with additional experimental results and extended visual analy- ses. The supplementary material is organized into the fol- lowing sections: • Sec.2 Methodology Supplement –Sec.2.1 Sliding Window Feature Matching Mechanism –Sec.2.2 BCSA Module...
-
[44]
Methodology Supplement 2.1. Sliding Window Feature Matching Mechanism Cross-modality feature misalignment presents a significant challenge in multi-modal contrastive learning for radar- camera fusion, as naively treating spatially corresponding features as positive pairs often results in suboptimal align- ment. To address this limitation, we proposed a me...
-
[45]
Visualization of Experimental Results Figure 8 provides a visual comparison between the results produced by our proposed method and those generated by the CRN baseline
Visual supplementation 3.1. Visualization of Experimental Results Figure 8 provides a visual comparison between the results produced by our proposed method and those generated by the CRN baseline. The green solid rectangle denotes the ground truth bounding box, the red dotted rectangle repre- sents the prediction from the baseline model, and the blue dott...
-
[46]
We adopt theSGDoptimizer with a learning rate of 2×10 −4, momentum of 0.9, and weight decay of1×10 −5
Supplementary Experiments Implementation Settings.Our proposed model is im- plemented using thePyTorchframework and trained on NVIDIA GeForce RTX 4090andNVIDIA H800 Tensor Core GPUs. We adopt theSGDoptimizer with a learning rate of 2×10 −4, momentum of 0.9, and weight decay of1×10 −5. The batch size is set to 6 during pretraining. Figure 10. Adverse Weath...
2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.