arxiv: 2604.24044 · v1 · submitted 2026-04-27 · 💻 cs.CV

Recognition: unknown

CLLAP: Contrastive Learning-based LiDAR-Augmented Pretraining for Enhanced Radar-Camera Fusion

Bingyi Liu , Chuanhui Zhu , Hongfei Xue , Jian Teng , Jipeng Liu , Enshu Wang , Penglin Dai , Pu Wang

Authors on Pith no claims yet

Pith reviewed 2026-05-08 04:52 UTC · model grok-4.3

classification 💻 cs.CV

keywords radar-camera fusion3D object detectioncontrastive learningLiDAR pretrainingpseudo-radarautonomous drivingself-supervised learningsensor fusion

0 comments

The pith

CLLAP generates pseudo-radar from LiDAR data to pretrain radar-camera fusion models via contrastive learning for improved 3D object detection.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to overcome the scarcity of annotated radar data that limits radar-camera fusion for 3D object detection in autonomous driving. It does so by converting abundant LiDAR point clouds into pseudo-radar signals through a new L2R sampling procedure, then running a dual-stage contrastive learning process that aligns features between these pseudo-radar signals and camera images. The resulting pretrained weights can be dropped into existing fusion networks as a plug-and-play step before fine-tuning on real radar. A sympathetic reader would care because successful transfer would let developers train stronger detectors without the high cost and labor of labeling real radar returns. The approach is evaluated on NuScenes and Lyft datasets across multiple baseline models, showing measurable gains in accuracy and robustness.

Core claim

CLLAP leverages abundant LiDAR data to generate pseudo-radar data using the proposed L2R Sampling method, then incorporates this data into a novel dual-stage, dual-modality contrastive learning strategy that enables effective self-supervised learning from paired pseudo-radar and image data, thereby pretraining existing radar-camera fusion models in a plug-and-play manner to enhance their feature extraction and 3D detection performance.

What carries the argument

The L2R (LiDAR-to-Radar) Sampling method that converts LiDAR point clouds into pseudo-radar returns, paired with a dual-stage dual-modality contrastive learning objective that aligns pseudo-radar and camera features for self-supervised pretraining.

If this is right

Existing radar-camera fusion architectures gain improved feature extractors without requiring large amounts of annotated radar data.
Detection accuracy and robustness increase on standard autonomous-driving benchmarks such as NuScenes and Lyft Level 5 across multiple baseline models.
The pretraining procedure can be inserted as a modular step before supervised fine-tuning on any radar-camera fusion pipeline.
Performance benefits appear in both normal and adverse weather conditions where radar is intended to provide complementary information to cameras.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the L2R-generated pseudo-radar proves sufficiently realistic, the same pipeline could be scaled to much larger unlabeled LiDAR corpora to produce ever-stronger initializations.
The contrastive pretraining strategy might be adapted to other sensor pairs where one modality is data-rich and the other is annotation-scarce, such as camera-thermal or camera-sonar fusion.
A direct test on real radar sequences that lack corresponding LiDAR could reveal whether the learned representations remain effective when the input distribution shifts away from the pseudo-radar training regime.

Load-bearing premise

Pseudo-radar signals created from LiDAR by the L2R method capture enough of the statistical and geometric properties of real radar returns that pretraining on them transfers usefully to models later trained on actual radar data.

What would settle it

A controlled experiment in which a radar-camera fusion model is first pretrained with CLLAP on LiDAR-derived pseudo-radar and then fine-tuned on a fixed real-radar dataset, compared against an identical model trained from scratch on the same real-radar data; if the CLLAP-pretrained version shows no accuracy gain or a clear drop, the central claim is falsified.

Figures

Figures reproduced from arXiv: 2604.24044 by Bingyi Liu, Chuanhui Zhu, Enshu Wang, Hongfei Xue, Jian Teng, Jipeng Liu, Penglin Dai, Pu Wang.

**Figure 1.** Figure 1: (a) Existing camera-radar fusion models rely solely view at source ↗

**Figure 2.** Figure 2: The overall architecture of the proposed CLLAP Method. The framework begins by generating pseudo-radar point clouds from view at source ↗

**Figure 3.** Figure 3: Pipeline of the L2R Sampling process for generating pseudo-radar point clouds from LiDAR data. view at source ↗

**Figure 4.** Figure 4: Comparison of sampling methods visualization. This view at source ↗

**Figure 5.** Figure 5: An overall framework for Global Contrastive Loss. view at source ↗

**Figure 6.** Figure 6: The overall architecture of the proposed BCSA Module. view at source ↗

**Figure 7.** Figure 7: Figures (a) and (b) present the visualizations of the im view at source ↗

**Figure 8.** Figure 8: Comparison of our method with baseline visualization view at source ↗

**Figure 9.** Figure 9: Comparison of feature heatmaps before and after the view at source ↗

**Figure 10.** Figure 10: Adverse Weather example 4.1. Generalization to adverse weather To further assess whether the proposed pretraining improves the robustness of the learned representations, we evaluate the models on adverse-weather scenarios without any additional training. Following the corruption protocol of the CVPR 2023 work[6], we synthesize adverse-weather data by injecting noise and weather-related corruptions into t… view at source ↗

**Figure 11.** Figure 11: Visualization Comparison farther from the center of the point cloud, often sparse in radar sensing, are assigned a distance weight wdist = 1/D2 iO, where DiO is the Euclidean distance from point i to the origin. Finally, the overall sampling weight is a linear combination of the three individual weights, with scaling factors αint, αspa, αdist controlling the contributions of each weight: wfinal = αintwi… view at source ↗

read the original abstract

Accurate 3D object detection is critical for autonomous driving, necessitating reliable, cost-effective sensors capable of operating in adverse weather conditions. Camera and millimeter-wave radar fusion has emerged as a promising solution; however, these methods often rely on finely annotated radar data, which is scarce and labor-intensive to produce. To address this challenge, we present CLLAP, a Contrastive Learning-based LiDAR-Augmented Pretraining framework that enhances the performance of existing radar-camera fusion methods for 3D object detection. CLLAP leverages abundant LiDAR data to generate pseudo-radar data using the proposed L2R (LiDAR-to-Radar) Sampling method. Then, it incorporates this data into a novel dual-stage, dual-modality contrastive learning strategy, enabling effective self-supervised learning from paired pseudo-radar and image data. This approach facilitates effective pretraining of existing radar-camera fusion models in a plug-and-play manner, enhancing their feature extraction capabilities and improving detection accuracy and robustness. Experimental results using NuScenes and Lyft Level 5 datasets demonstrate significant performance improvements across three baseline models, highlighting CLLAP's effectiveness in advancing radar-camera fusion for autonomous driving applications.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CLLAP's L2R pseudo-radar generation plus dual-stage contrastive pretraining is a direct attempt to ease radar annotation costs, but the fidelity gap between LiDAR-derived data and real mmWave returns is the part that still needs evidence.

read the letter

The main thing here is that the authors generate pseudo-radar from LiDAR using a new L2R sampling step, then run a two-stage contrastive setup that pairs the pseudo-radar with camera images to pretrain existing fusion backbones before fine-tuning on real radar-camera data. This is presented as plug-and-play for models like those in the baselines. The idea targets a genuine pain point: radar labels are expensive while LiDAR is easier to get in volume, so any reliable way to bootstrap from the latter helps deployment costs in adverse weather perception. They evaluate on NuScenes and Lyft Level 5, which are the right datasets for this domain. If the full paper includes the actual detection metrics, ablations on the sampling and contrastive stages, and some check that the pseudo returns preserve radar-like sparsity and reflection statistics, that would be useful incremental work. The soft spot is exactly the transfer link the stress-test flags. LiDAR point clouds lack Doppler, multi-path, and sensor-specific noise that real radar has; unless L2R sampling plus the contrastive loss demonstrably closes that gap (via similarity metrics, qualitative pairs, or an ablation against naive projection), the reported gains could be fragile or replicable by simpler pretraining. The abstract gives no numbers or error breakdowns, so the central claim is still hard to size. This is for fusion researchers who already run radar-camera detectors and want a pretraining recipe to reduce labeling. A reader who cares about self-supervised multi-modal alignment will find the pipeline concrete enough to try. It deserves peer review because the problem is practical, the method is specified at a level that can be reproduced, and the experiments use public data. Send it out and let referees ask for the missing validation on pseudo-radar realism and the size of the actual gains.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes CLLAP, a Contrastive Learning-based LiDAR-Augmented Pretraining framework for improving radar-camera fusion in 3D object detection. It generates pseudo-radar data from abundant LiDAR using the L2R Sampling method and employs a dual-stage, dual-modality contrastive learning strategy to pretrain fusion models using paired pseudo-radar and image data. This plug-and-play pretraining is claimed to enhance feature extraction and detection performance, with experimental results on NuScenes and Lyft Level 5 datasets showing significant improvements across three baseline models.

Significance. If the central assumption holds—that pseudo-radar generated via L2R sufficiently approximates real mmWave radar for effective transfer learning—this work could substantially address the data scarcity issue in radar-camera fusion, enabling better use of self-supervised learning from LiDAR to boost performance in adverse conditions. It builds on standard contrastive learning techniques and public datasets, offering a practical way to leverage more abundant sensor data.

major comments (2)

[L2R Sampling method description] The fidelity of the generated pseudo-radar to real radar returns is the load-bearing assumption for the entire framework. The manuscript does not provide statistical comparisons (e.g., sparsity, range-azimuth distributions, or reflection patterns) or ablations against simpler LiDAR projections to validate that L2R captures radar-specific traits like noise and multi-path effects sufficiently for the contrastive pretraining to transfer to real radar-camera tasks.
[Experimental results] The claim of 'significant performance improvements' on NuScenes and Lyft datasets across three baselines lacks any quantitative metrics, ablation studies, or error analysis. This omission prevents assessment of whether the gains are substantial, consistent, or attributable to the pretraining rather than other factors.

minor comments (1)

[Abstract] Including specific quantitative results (e.g., mAP improvements) would strengthen the abstract and allow readers to immediately gauge the claimed gains.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and will revise the manuscript to incorporate the suggested improvements.

read point-by-point responses

Referee: The fidelity of the generated pseudo-radar to real radar returns is the load-bearing assumption for the entire framework. The manuscript does not provide statistical comparisons (e.g., sparsity, range-azimuth distributions, or reflection patterns) or ablations against simpler LiDAR projections to validate that L2R captures radar-specific traits like noise and multi-path effects sufficiently for the contrastive pretraining to transfer to real radar-camera tasks.

Authors: We agree that validating the fidelity of the L2R Sampling method is critical to substantiate the core assumption of the framework. The current manuscript describes the L2R method and its design rationale but does not include the requested statistical validations or ablations. In the revised manuscript, we will add quantitative statistical comparisons of sparsity, range-azimuth distributions, and reflection patterns between pseudo-radar and real radar data. We will also include ablation experiments contrasting L2R against simpler LiDAR projections to demonstrate the value of capturing radar-specific traits such as noise and multi-path effects for effective transfer to real radar-camera fusion tasks. revision: yes
Referee: The claim of 'significant performance improvements' on NuScenes and Lyft datasets across three baselines lacks any quantitative metrics, ablation studies, or error analysis. This omission prevents assessment of whether the gains are substantial, consistent, or attributable to the pretraining rather than other factors.

Authors: We acknowledge that the experimental section requires more rigorous quantitative support and analysis to fully substantiate the performance claims. While the manuscript reports improvements across baselines on the two datasets, it does not provide the level of detail requested. In the revision, we will expand the results with specific quantitative metrics (including exact mAP/NDS deltas), detailed ablation studies isolating the contributions of the dual-stage dual-modality contrastive pretraining, and error analysis to assess consistency and attribute gains specifically to the pretraining rather than other factors. revision: yes

Circularity Check

0 steps flagged

No circularity: method uses external LiDAR data and standard contrastive learning with empirical validation on public datasets

full rationale

The paper's chain proceeds from abundant external LiDAR point clouds (NuScenes, Lyft) through a newly proposed L2R sampling procedure to generate pseudo-radar, followed by dual-stage contrastive pretraining on pseudo-radar/image pairs, then plug-and-play fine-tuning on real radar-camera fusion baselines. None of these steps reduce by construction to their own outputs: L2R is an explicit sampling rule, not a fitted parameter renamed as a prediction; contrastive loss is the standard InfoNCE formulation applied to generated pairs; performance gains are measured on held-out real radar data rather than on the pseudo-radar used for pretraining. No self-citation supplies a uniqueness theorem or load-bearing premise, and no equation equates a derived quantity to an input by definition. The transfer assumption (pseudo-radar fidelity) is an empirical claim subject to external falsification, not a circular derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The framework depends on the unverified assumption that LiDAR-derived pseudo-radar preserves sufficient radar-specific features for contrastive alignment with camera data; no independent evidence for this mapping is provided in the abstract.

axioms (1)

domain assumption Pseudo-radar data generated from LiDAR can serve as an effective proxy for real radar in self-supervised contrastive pretraining of fusion models
This is the load-bearing premise enabling the entire pretraining pipeline.

invented entities (1)

L2R (LiDAR-to-Radar) Sampling method no independent evidence
purpose: Generate pseudo-radar data from LiDAR point clouds for use in contrastive learning
Newly proposed technique central to creating training pairs without real radar labels.

pith-pipeline@v0.9.0 · 5528 in / 1308 out tokens · 41540 ms · 2026-05-08T04:52:02.112697+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

46 extracted references · 7 canonical work pages · 1 internal anchor

[1]

Crosspoint: Self-supervised cross-modal contrastive learning for 3d point cloud understanding

Mohamed Afham, Isuru Dissanayake, Dinithi Dissanayake, Amaya Dharmasiri, Kanchana Thilakarathna, and Ranga Ro- drigo. Crosspoint: Self-supervised cross-modal contrastive learning for 3d point cloud understanding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9902–9912, 2022. 3

2022
[2]

Radardistill: Boosting radar-based ob- ject detection performance via knowledge distillation from lidar features

Geonho Bang, Kwangjin Choi, Jisong Kim, Dongsuk Kum, and Jun Won Choi. Radardistill: Boosting radar-based ob- ject detection performance via knowledge distillation from lidar features. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15491– 15500, 2024. 2, 3

2024
[3]

Rctdistill: Cross-modal knowledge distillation framework for radar-camera 3d object detection with temporal fusion

Geonho Bang, Minjae Seong, Jisong Kim, Geunju Baek, Daye Oh, Junhyung Kim, Junho Koh, and Jun Won Choi. Rctdistill: Cross-modal knowledge distillation framework for radar-camera 3d object detection with temporal fusion. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 25315–25324, 2025. 2, 3

2025
[4]

Camera-radar perception for autonomous vehicles and adas: Concepts, datasets and metrics.arXiv preprint arXiv:2303.04302, 2023

Felipe Manfio Barbosa and Fernando Santos Os ´orio. Camera-radar perception for autonomous vehicles and adas: Concepts, datasets and metrics.arXiv preprint arXiv:2303.04302, 2023. 1

work page arXiv 2023
[5]

nuscenes: A multi- modal dataset for autonomous driving

Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh V ora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Gi- ancarlo Baldan, and Oscar Beijbom. nuscenes: A multi- modal dataset for autonomous driving. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11621–11631, 2020. 7

2020
[6]

Benchmarking robustness of 3d object detection to common corruptions

Yinpeng Dong, Caixin Kang, Jinlai Zhang, Zijian Zhu, Yikai Wang, Xiao Yang, Hang Su, Xingxing Wei, and Jun Zhu. Benchmarking robustness of 3d object detection to common corruptions. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1022– 1032, 2023. 3

2023
[7]

A point set generation network for 3d object reconstruction from a single image

Haoqiang Fan, Hao Su, and Leonidas J Guibas. A point set generation network for 3d object reconstruction from a single image. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 605–613, 2017. 7

2017
[8]

4d mmwave radar for autonomous driving perception: a comprehensive survey.IEEE Transac- tions on Intelligent Vehicles, 2024

Lili Fan, Junhao Wang, Yuanmeng Chang, Yuke Li, Yutong Wang, and Dongpu Cao. 4d mmwave radar for autonomous driving perception: a comprehensive survey.IEEE Transac- tions on Intelligent Vehicles, 2024. 1

2024
[9]

De- formable feature fusion network for multi-modal 3d object detection

Kun Guo, Tong Gan, Zhao Ding, and Qiang Ling. De- formable feature fusion network for multi-modal 3d object detection. In2024 3rd International Conference on Robotics, Artificial Intelligence and Intelligent Control (RAIIC), pages 363–367. IEEE, 2024. 2

2024
[10]

Multimodal 3d object detection on unseen domains,

Deepti Hegde, Suhas Lohit, Kuan-Chuan Peng, Michael J Jones, and Vishal M Patel. Multimodal 3d object detection on unseen domains.arXiv preprint arXiv:2404.11764, 2024. 3

work page arXiv 2024
[11]

One thousand and one hours: Self-driving motion prediction dataset

John Houston, Guido Zuidhof, Luca Bergamini, Yawei Ye, Long Chen, Ashesh Jain, Sammy Omari, Vladimir Iglovikov, and Peter Ondruska. One thousand and one hours: Self-driving motion prediction dataset. InConference on Robot Learning, pages 409–418. PMLR, 2021. 7

2021
[12]

Craft: Camera-radar 3d object detection with spatio-contextual fusion transformer

Youngseok Kim, Sanmin Kim, Jun Won Choi, and Dong- suk Kum. Craft: Camera-radar 3d object detection with spatio-contextual fusion transformer. InProceedings of the AAAI Conference on Artificial Intelligence, pages 1160– 1168, 2023. 2, 7

2023
[13]

Crn: Camera radar net for accurate, robust, efficient 3d perception

Youngseok Kim, Juyeb Shin, Sanmin Kim, In-Jae Lee, Jun Won Choi, and Dongsuk Kum. Crn: Camera radar net for accurate, robust, efficient 3d perception. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 17615–17626, 2023. 1, 7

2023
[14]

Clusterfusion: Leveraging radar spatial features for radar- camera 3d object detection in autonomous vehicles.IEEE Access, 2023

Irfan Tito Kurniawan and Bambang Riyanto Trilaksono. Clusterfusion: Leveraging radar spatial features for radar- camera 3d object detection in autonomous vehicles.IEEE Access, 2023. 2

2023
[15]

Modcl: multi-modal object detection with end-to-end con- trastive learning in indoor scene

Zixu Lan, Fang Deng, Angang Zhang, and Zhongjian Chen. Modcl: multi-modal object detection with end-to-end con- trastive learning in indoor scene. InInternational Conference on Algorithms, High Performance Computing, and Artificial Intelligence (AHPCAI 2024), pages 1032–1038. SPIE, 2024. 3

2024
[16]

Samplenet: Differentiable point cloud sampling

Itai Lang, Asaf Manor, and Shai Avidan. Samplenet: Differentiable point cloud sampling. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7578–7588, 2020. 4

2020
[17]

Con- trastive representation learning: A framework and review

Phuc H Le-Khac, Graham Healy, and Alan F Smeaton. Con- trastive representation learning: A framework and review. Ieee Access, 8:193907–193934, 2020. 3

2020
[18]

Lidar-to-radar translation based on voxel feature extraction module for radar data augmentation

Jinho Lee, Geonkyu Bang, Takaya Shimizu, Masato Iehara, and Shunsuke Kamijo. Lidar-to-radar translation based on voxel feature extraction module for radar data augmentation. Sensors, 24(2):559, 2024. 4

2024
[19]

Rcbevdet: Radar-camera fusion in bird’s eye view for 3d object detection

Zhiwei Lin, Zhe Liu, Zhongyu Xia, Xinhao Wang, Yong- tao Wang, Shengxiang Qi, Yang Dong, Nan Dong, Le Zhang, and Ce Zhu. Rcbevdet: Radar-camera fusion in bird’s eye view for 3d object detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14928–14937, 2024. 1, 7

2024
[20]

Flownet3d: Learning scene flow in 3d point clouds

Xingyu Liu, Charles R Qi, and Leonidas J Guibas. Flownet3d: Learning scene flow in 3d point clouds. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 529–537, 2019. 4

2019
[21]

V2x-dsi: A density-sensitive infrastructure lidar benchmark for economic vehicle-to- everything cooperative perception

Xinyu Liu, Baolu Li, Runsheng Xu, Jiaqi Ma, Xiaopeng Li, Jinlong Li, and Hongkai Yu. V2x-dsi: A density-sensitive infrastructure lidar benchmark for economic vehicle-to- everything cooperative perception. In2024 IEEE Intelligent Vehicles Symposium (IV), pages 490–495. IEEE, 2024. 1

2024
[22]

Bevfusion: Multi- task multi-sensor fusion with unified bird’s-eye view repre- sentation

Zhijian Liu, Haotian Tang, Alexander Amini, Xinyu Yang, Huizi Mao, Daniela L Rus, and Song Han. Bevfusion: Multi- task multi-sensor fusion with unified bird’s-eye view repre- sentation. In2023 IEEE international conference on robotics and automation (ICRA), pages 2774–2781. IEEE, 2023. 7

2023
[23]

Centerfusion: Center-based radar and camera fusion for 3d object detection

Ramin Nabati and Hairong Qi. Centerfusion: Center-based radar and camera fusion for 3d object detection. InProceed- ings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 1527–1536, 2021. 7

2021
[24]

Representation Learning with Contrastive Predictive Coding

Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Repre- sentation learning with contrastive predictive coding.arXiv preprint arXiv:1807.03748, 2018. 6

work page internal anchor Pith review arXiv 2018
[25]

Time will tell: New outlooks and a baseline for temporal multi-view 3d object detection

Jinhyung Park, Chenfeng Xu, Shijia Yang, Kurt Keutzer, Kris Kitani, Masayoshi Tomizuka, and Wei Zhan. Time will tell: New outlooks and a baseline for temporal multi-view 3d object detection.arXiv preprint arXiv:2210.02443, 2022. 7

work page arXiv 2022
[26]

PhD thesis, Purdue University Graduate School,

Cheng Peng.VISION-BASED SMART MONITORING AND ASSESSMENT OF HIGHWAY PAVEMENT INFRASTRUC- TURES. PhD thesis, Purdue University Graduate School,
[27]

Understanding and mitigating the tradeoff between robustness and accuracy.arXiv preprint arXiv:2002.10716, 2020

Aditi Raghunathan, Sang Michael Xie, Fanny Yang, John Duchi, and Percy Liang. Understanding and mitigating the tradeoff between robustness and accuracy.arXiv preprint arXiv:2002.10716, 2020. 2

work page arXiv 2002
[28]

Bevcar: Camera-radar fusion for bev map and object segmentation.arXiv preprint arXiv:2403.11761, 2024

Jonas Schramm, Niclas V ¨odisch, K ¨ursat Petek, B Ravi Ki- ran, Senthil Yogamani, Wolfram Burgard, and Abhinav Val- ada. Bevcar: Camera-radar fusion for bev map and object segmentation.arXiv preprint arXiv:2403.11761, 2024. 2

work page arXiv 2024
[29]

Contrastalign: Toward robust bev feature alignment via con- trastive learning for multi-modal 3d object detection.arXiv preprint arXiv:2405.16873, 2024

Ziying Song, Feiyang Jia, Hongyu Pan, Yadan Luo, Caiyan Jia, Guoxin Zhang, Lin Liu, Yang Ji, Lei Yang, and Li Wang. Contrastalign: Toward robust bev feature alignment via con- trastive learning for multi-modal 3d object detection.arXiv preprint arXiv:2405.16873, 2024. 3

work page arXiv 2024
[30]

L2r gan: Lidar-to-radar translation

Leichen Wang, Bastian Goldluecke, and Carsten Anklam. L2r gan: Lidar-to-radar translation. InProceedings of the Asian Conference on Computer Vision, 2020. 4

2020
[31]

Exploring object-centric temporal modeling for efficient multi-view 3d object detection

Shihao Wang, Yingfei Liu, Tiancai Wang, Ying Li, and Xi- angyu Zhang. Exploring object-centric temporal modeling for efficient multi-view 3d object detection. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 3621–3631, 2023. 7

2023
[32]

Crrfnet: An adaptive traf- fic object detection method based on camera and radar radio frequency fusion.Transportation Research Part C: Emerg- ing Technologies, 166:104791, 2024

Wenbo Wang and Weibin Zhang. Crrfnet: An adaptive traf- fic object detection method based on camera and radar radio frequency fusion.Transportation Research Part C: Emerg- ing Technologies, 166:104791, 2024. 2

2024
[33]

Attention-based point cloud edge sampling

Chengzhi Wu, Junwei Zheng, Julius Pfrommer, and J ¨urgen Beyerer. Attention-based point cloud edge sampling. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5333–5343, 2023. 4

2023
[34]

Mvfusion: Multi-view 3d object detection with semantic-aligned radar and camera fusion

Zizhang Wu, Guilian Chen, Yuanzhu Gan, Lei Wang, and Jian Pu. Mvfusion: Multi-view 3d object detection with semantic-aligned radar and camera fusion. In2023 IEEE In- ternational Conference on Robotics and Automation (ICRA), pages 2766–2773. IEEE, 2023. 2

2023
[35]

Sckd: Semi-supervised cross-modality knowledge dis- tillation for 4d radar object detection

Ruoyu Xu, Zhiyu Xiang, Chenwei Zhang, Hanzhi Zhong, Xijun Zhao, Ruina Dang, Peng Xu, Tianyu Pu, and Eryun Liu. Sckd: Semi-supervised cross-modality knowledge dis- tillation for 4d radar object detection. InProceedings of the AAAI Conference on Artificial Intelligence, pages 8933– 8941, 2025. 2, 3

2025
[36]

Radar-camera fusion for object detection and semantic segmentation in autonomous driving: A comprehensive review.IEEE Transactions on In- telligent Vehicles, 2023

Shanliang Yao, Runwei Guan, Xiaoyu Huang, Zhuoxiao Li, Xiangyu Sha, Yong Yue, Eng Gee Lim, Hyungjoon Seo, Ka Lok Man, Xiaohui Zhu, et al. Radar-camera fusion for object detection and semantic segmentation in autonomous driving: A comprehensive review.IEEE Transactions on In- telligent Vehicles, 2023. 1

2023
[37]

Pastefusion: innovating multimodal sensor fusion for enhanced 3d object detection

Yuhong Yuan, Kai Zhang, Mingbo Yang, Shuxiang Li, and Yu Liang. Pastefusion: innovating multimodal sensor fusion for enhanced 3d object detection. InInternational Confer- ence on Image, Signal Processing, and Pattern Recognition (ISPP 2024), pages 932–938. SPIE, 2024. 2

2024
[38]

Contrastive late fusion for 3d object detection.IEEE Transactions on Intelligent Vehicles, 2024

Tingyu Zhang, Zhigang Liang, Yanzhao Yang, Xinyu Yang, Yu Zhu, and Jian Wang. Contrastive late fusion for 3d object detection.IEEE Transactions on Intelligent Vehicles, 2024. 2

2024
[39]

Crkd: Enhanced camera-radar object detection with cross-modality knowledge distillation

Lingjun Zhao, Jingyu Song, and Katherine A Skinner. Crkd: Enhanced camera-radar object detection with cross-modality knowledge distillation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15470–15480, 2024. 2, 3

2024
[40]

Bev-radar: bidirectional radar-camera fusion for 3d object detection.JUSTC, 54(1):0101–1, 2024

Yuan Zhao, Lu Zhang, Jiajun Deng, and Yanyong Zhang. Bev-radar: bidirectional radar-camera fusion for 3d object detection.JUSTC, 54(1):0101–1, 2024. 2

2024
[41]

Bridging the view disparity between radar and camera features for multi-modal fusion 3d object detection.IEEE Transactions on Intelligent Vehicles, 8(2):1523–1535, 2023

Taohua Zhou, Junjie Chen, Yining Shi, Kun Jiang, Meng- meng Yang, and Diange Yang. Bridging the view disparity between radar and camera features for multi-modal fusion 3d object detection.IEEE Transactions on Intelligent Vehicles, 8(2):1523–1535, 2023. 7

2023
[42]

V oxelnet: End-to-end learning for point cloud based 3d object detection

Yin Zhou and Oncel Tuzel. V oxelnet: End-to-end learning for point cloud based 3d object detection. InProceedings of the IEEE conference on computer vision and pattern recog- nition, pages 4490–4499, 2018. 7 CLLAP: Contrastive Learning-based LiDAR-Augmented Pretraining for Enhanced Radar-Camera Fusion Supplementary Material

2018
[43]

Overview The appendix offers comprehensive explanations of the methodologies introduced in the main text, together with additional experimental results and extended visual analy- ses. The supplementary material is organized into the fol- lowing sections: • Sec.2 Methodology Supplement –Sec.2.1 Sliding Window Feature Matching Mechanism –Sec.2.2 BCSA Module...
[44]

Methodology Supplement 2.1. Sliding Window Feature Matching Mechanism Cross-modality feature misalignment presents a significant challenge in multi-modal contrastive learning for radar- camera fusion, as naively treating spatially corresponding features as positive pairs often results in suboptimal align- ment. To address this limitation, we proposed a me...
[45]

Visualization of Experimental Results Figure 8 provides a visual comparison between the results produced by our proposed method and those generated by the CRN baseline

Visual supplementation 3.1. Visualization of Experimental Results Figure 8 provides a visual comparison between the results produced by our proposed method and those generated by the CRN baseline. The green solid rectangle denotes the ground truth bounding box, the red dotted rectangle repre- sents the prediction from the baseline model, and the blue dott...
[46]

We adopt theSGDoptimizer with a learning rate of 2×10 −4, momentum of 0.9, and weight decay of1×10 −5

Supplementary Experiments Implementation Settings.Our proposed model is im- plemented using thePyTorchframework and trained on NVIDIA GeForce RTX 4090andNVIDIA H800 Tensor Core GPUs. We adopt theSGDoptimizer with a learning rate of 2×10 −4, momentum of 0.9, and weight decay of1×10 −5. The batch size is set to 6 during pretraining. Figure 10. Adverse Weath...

2023