Recognition: unknown
Control Your Queries: Heterogeneous Query Interaction for Camera-Radar Fusion
Pith reviewed 2026-05-07 16:55 UTC · model grok-4.3
The pith
ConFusion fuses camera and radar by letting image, radar, and world queries interact directly.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
ConFusion introduces heterogeneous query interaction as a fusion paradigm for camera-radar 3D object detection. It combines image queries, radar queries, and learnable world queries distributed in 3D space to improve query initialization and object coverage. Heterogeneous query mixing performs dedicated cross-type attention after feature sampling to consolidate complementary object evidence, while interactive query swap sampling allows related queries to exchange informative feature tokens under attention and geometric constraints. On the nuScenes dataset this produces 59.1 mAP and 65.6 NDS on validation and 61.6 mAP and 67.9 NDS on the test set.
What carries the argument
Heterogeneous query interaction, which mixes image queries, radar queries, and 3D world queries through QMix cross-type attention and QSwap token exchange to consolidate complementary sensor evidence.
Where Pith is reading between the lines
- The same query-interaction structure could be adapted to camera-lidar fusion by adding lidar-specific query types.
- If the attention mechanisms prove robust, the method might reduce the impact of temporary sensor dropouts in real driving.
- Query swapping under geometric constraints may transfer to other query-based detectors that currently sample features independently.
Load-bearing premise
Cross-type attention after sampling and query swapping will reliably merge useful camera and radar evidence without creating new conflicts or missing objects on unseen scenes or degraded sensors.
What would settle it
Running ConFusion on nuScenes validation scenes with added synthetic radar noise or camera occlusions and measuring whether mAP falls below the best prior method.
Figures
read the original abstract
In autonomous driving, camera-radar fusion offers complementary sensing and low deployment cost. Existing methods perform fusion through input mixing, feature map mixing, or query-based feature sampling. We propose a new fusion paradigm, termed heterogeneous query interaction, and present ConFusion, a camera-radar 3D object detector. ConFusion combines image queries, radar queries, and learnable world queries distributed in 3D space to improve query initialization and object coverage. To encourage cross-type interaction among heterogeneous queries, we introduce heterogeneous query mixing (QMix), which performs dedicated cross-type attention after feature sampling to consolidate complementary object evidence. We further propose interactive query swap sampling (QSwap), which improves feature sampling by allowing related queries to exchange informative feature tokens under attention and geometric constraints. Experiments on the nuScenes dataset show that ConFusion achieves state-of-the-art performance, reaching 59.1 mAP and 65.6 NDS on the validation set, and 61.6 mAP and 67.9 NDS on the test set.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces ConFusion, a camera-radar 3D object detector for autonomous driving that uses heterogeneous queries (image queries, radar queries, and learnable world queries) distributed in 3D space. It proposes heterogeneous query mixing (QMix) to perform dedicated cross-type attention after feature sampling for consolidating complementary evidence, and interactive query swap sampling (QSwap) to allow related queries to exchange tokens under attention and geometric constraints. Experiments on nuScenes report state-of-the-art results of 59.1 mAP and 65.6 NDS on validation and 61.6 mAP and 67.9 NDS on test.
Significance. If the reported gains are robustly attributable to the heterogeneous query interaction mechanisms, the work would advance query-based multi-modal fusion by showing how cross-type attention and constrained token exchange can better integrate complementary camera-radar cues than prior mixing or sampling approaches, with potential benefits for low-cost, robust perception systems.
major comments (3)
- [Experiments] Experimental results section: The SOTA performance numbers (59.1/65.6 mAP/NDS val, 61.6/67.9 test) are presented without ablation tables, error bars, or multiple-run statistics, which is load-bearing for the central claim that QMix and QSwap drive the improvements over prior query/feature mixing baselines.
- [Method] Method description (QMix and QSwap): No evaluation or analysis is provided on whether the cross-type attention and token-exchange mechanisms introduce new failure modes (e.g., attention dilution under sparse radar returns or degraded camera features) on unseen scenes or sensor degradations, undermining confidence that the paradigm reliably consolidates evidence without offsets.
- [Implementation] Implementation details: The number, initialization, and optimization of the learnable world queries, along with the specific attention weights and geometric thresholds in QMix/QSwap, are treated as free parameters without sensitivity analysis or ablation, leaving the reproducibility and generality of the reported benchmark gains unclear.
minor comments (1)
- [Method] Notation for query types and sampling steps could be clarified with a single diagram or consistent symbols across sections to aid readability.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We agree that strengthening the experimental validation, discussing potential limitations, and improving implementation transparency will enhance the paper. Below we address each major comment point by point, indicating the revisions we will make.
read point-by-point responses
-
Referee: [Experiments] Experimental results section: The SOTA performance numbers (59.1/65.6 mAP/NDS val, 61.6/67.9 test) are presented without ablation tables, error bars, or multiple-run statistics, which is load-bearing for the central claim that QMix and QSwap drive the improvements over prior query/feature mixing baselines.
Authors: We acknowledge that the current version lacks dedicated ablation tables isolating QMix and QSwap and does not report error bars or multi-run statistics. In the revised manuscript we will add a comprehensive ablation study comparing the full ConFusion model against variants without QMix, without QSwap, and against prior query/feature mixing baselines. We will also rerun the main experiments with multiple random seeds and include error bars to demonstrate that the reported gains (59.1 mAP / 65.6 NDS) are stable and attributable to the proposed heterogeneous query interaction mechanisms. revision: yes
-
Referee: [Method] Method description (QMix and QSwap): No evaluation or analysis is provided on whether the cross-type attention and token-exchange mechanisms introduce new failure modes (e.g., attention dilution under sparse radar returns or degraded camera features) on unseen scenes or sensor degradations, undermining confidence that the paradigm reliably consolidates evidence without offsets.
Authors: We agree this analysis is important for establishing reliability. While the original submission focused on overall gains, the revised version will include a dedicated limitations subsection. It will discuss potential failure modes such as attention dilution with very sparse radar returns and provide qualitative examples from nuScenes validation scenes where camera features are degraded. We will also note that a full quantitative study on synthetic sensor degradations lies outside the current scope but is a valuable direction for future work. revision: partial
-
Referee: [Implementation] Implementation details: The number, initialization, and optimization of the learnable world queries, along with the specific attention weights and geometric thresholds in QMix/QSwap, are treated as free parameters without sensitivity analysis or ablation, leaving the reproducibility and generality of the reported benchmark gains unclear.
Authors: We will expand the implementation details section to address these points. The revision will report sensitivity analysis for the number of learnable world queries (e.g., 100–500) and compare random versus learned initialization. We will also provide ablation results on the attention weights in QMix and the geometric thresholds in QSwap, showing that performance remains stable within reasonable ranges. These additions will improve reproducibility and clarify the generality of the chosen hyperparameters. revision: yes
Circularity Check
No circularity detected in method derivation or benchmark claims
full rationale
The paper describes an architectural paradigm (heterogeneous queries + QMix cross-type attention after sampling + QSwap token exchange under constraints) whose components are defined independently of the reported nuScenes mAP/NDS numbers. No equations are shown that would make the performance metrics equivalent to fitted parameters or self-referential definitions inside the paper. The SOTA results are presented as empirical outcomes on an external dataset, not as quantities derived by construction from the model's own inputs or prior self-citations. The derivation chain remains self-contained and externally testable.
Axiom & Free-Parameter Ledger
free parameters (2)
- number and initialization of learnable world queries
- attention weights and geometric thresholds inside QMix and QSwap
axioms (1)
- domain assumption Queries of different modalities can be meaningfully compared after feature sampling
Reference graph
Works this paper leans on
-
[1]
Trans- fusion: Robust lidar-camera fusion for 3d object detection with transformers
Xuyang Bai, Zeyu Hu, Xinge Zhu, Qingqiu Huang, Yilun Chen, Hongbo Fu, and Chiew-Lan Tai. Trans- fusion: Robust lidar-camera fusion for 3d object detection with transformers. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1090–1099, 2022. 2, 3
2022
-
[2]
nuscenes: A multimodal dataset for autonomous driving
Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh V ora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom. nuscenes: A multimodal dataset for autonomous driving. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11621–11631,
-
[3]
Hongxiang Cai, Zeyuan Zhang, Zhenyu Zhou, Ziyin Li, Wenbo Ding, and Jiuhua Zhao. Bevfusion4d: Learning lidar-camera fusion under bird’s-eye-view via cross-modality guidance and temporal aggregation. arXiv preprint arXiv:2303.17099, 2023. 3
-
[4]
Objectfusion: Multi-modal 3d object detection with object-centric fusion
Qi Cai, Yingwei Pan, Ting Yao, Chong-Wah Ngo, and Tao Mei. Objectfusion: Multi-modal 3d object detection with object-centric fusion. InProceedings of the IEEE/CVF international conference on computer vision, pages 18067–18076, 2023. 3
2023
-
[5]
End-to-end object detection with transformers
Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end object detection with transformers. InEuropean conference on computer vision, pages 213–229. Springer, 2020. 2
2020
-
[6]
Futr3d: A unified sensor fusion framework for 3d detection
Xuanyao Chen, Tianyuan Zhang, Yue Wang, Yilun Wang, and Hang Zhao. Futr3d: A unified sensor fusion framework for 3d detection. Inproceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 172–181, 2023. 2, 3
2023
-
[7]
Deformable feature aggregation for dynamic multi-modal 3d object detection
Zehui Chen, Zhenyu Li, Shiquan Zhang, Liangji Fang, Qinhong Jiang, and Feng Zhao. Deformable feature aggregation for dynamic multi-modal 3d object detection. InEuropean conference on computer vision, pages 628–644. Springer, 2022. 3
2022
-
[8]
Rayformer: Improving query-based multi-camera 3d object detection via ray-centric strategies
Xiaomeng Chu, Jiajun Deng, Guoliang You, Yifan Duan, Yao Li, and Yanyong Zhang. Rayformer: Improving query-based multi-camera 3d object detection via ray-centric strategies. InProceedings of the 32nd ACM International Conference on Multimedia, pages 4620–4629, 2024. 5
2024
-
[9]
Racformer: Towards high-quality 3d object detection via query-based radar-camera fusion
Xiaomeng Chu, Jiajun Deng, Guoliang You, Yifan Duan, Houqiang Li, and Yanyong Zhang. Racformer: Towards high-quality 3d object detection via query-based radar-camera fusion. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 17081–17091, 2025. 2, 3, 5, 7, 8
2025
-
[10]
Dpft: Dual perspective fusion transformer for camera-radar- based object detection.IEEE Transactions on Intelligent Vehicles, 2024
Felix Fent, Andras Palffy, and Holger Caesar. Dpft: Dual perspective fusion transformer for camera-radar- based object detection.IEEE Transactions on Intelligent Vehicles, 2024. 3
2024
-
[11]
Hgsfusion: Radar-camera fusion with hybrid generation and synchronization for 3d object detection
Zijian Gu, Jianwei Ma, Yan Huang, Honghao Wei, Zhanye Chen, Hui Zhang, and Wei Hong. Hgsfusion: Radar-camera fusion with hybrid generation and synchronization for 3d object detection. InProceedings of the AAAI Conference on Artificial Intelligence, pages 3185–3193, 2025. 3
2025
-
[12]
Lift-attend-splat: Bird’s-eye-view camera-lidar fusion using transformers
James Gunn, Zygmunt Lenyk, Anuj Sharma, Andrea Donati, Alexandru Buburuzan, John Redford, and Romain Mueller. Lift-attend-splat: Bird’s-eye-view camera-lidar fusion using transformers. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4526–4536, 2024. 3
2024
-
[13]
Deep residual learning for image recognition
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016. 7
2016
-
[14]
Chunyong Hu, Hang Zheng, Kun Li, Jianyun Xu, Weibo Mao, Maochun Luo, Lingxuan Wang, Mingxia Chen, Qihao Peng, Kaixuan Liu, et al. Fusionformer: A multi-sensory fusion in bird’s-eye-view and temporal consistent transformer for 3d object detection.arXiv preprint arXiv:2309.05257, 2023. 3
-
[15]
EA-LSS: Edge-aware Lift-splat-shot Framework for 3D BEV Object Detection,
Haotian Hu, Fanyi Wang, Jingwen Su, Yaonong Wang, Laifeng Hu, Weiye Fang, Jingwei Xu, and Zhiwang Zhang. Ea-lss: Edge-aware lift-splat-shot framework for 3d bev object detection.arXiv preprint arXiv:2303.17895, 2023. 3
-
[16]
Detecting as labeling: Rethinking lidar- camera fusion in 3d object detection
Junjie Huang, Yun Ye, Zhujin Liang, Yi Shan, and Dalong Du. Detecting as labeling: Rethinking lidar- camera fusion in 3d object detection. InEuropean Conference on Computer Vision, pages 439–455. Springer, 2024. 3
2024
-
[17]
Epnet: Enhancing point features with image semantics for 3d object detection
Tengteng Huang, Zhe Liu, Xiwu Chen, and Xiang Bai. Epnet: Enhancing point features with image semantics for 3d object detection. InEuropean conference on computer vision, pages 35–52. Springer,
-
[18]
Cramnet: Camera-radar fusion with ray-constrained cross-attention for robust 3d object detection
Jyh-Jing Hwang, Henrik Kretzschmar, Joshua Manela, Sean Rafferty, Nicholas Armstrong-Crews, Tiffany Chen, and Dragomir Anguelov. Cramnet: Camera-radar fusion with ray-constrained cross-attention for robust 3d object detection. InEuropean conference on computer vision, pages 388–405. Springer, 2022. 3
2022
-
[19]
Far3d: Expanding the horizon for surround-view 3d object detection
Xiaohui Jiang, Shuailin Li, Yingfei Liu, Shihao Wang, Fan Jia, Tiancai Wang, Lijin Han, and Xiangyu Zhang. Far3d: Expanding the horizon for surround-view 3d object detection. InProceedings of the AAAI conference on artificial intelligence, pages 2561–2569, 2024. 7
2024
-
[20]
Msmdfusion: Fusing lidar and camera at multiple scales with multi-depth seeds for 3d object detection
Yang Jiao, Zequn Jie, Shaoxiang Chen, Jingjing Chen, Lin Ma, and Yu-Gang Jiang. Msmdfusion: Fusing lidar and camera at multiple scales with multi-depth seeds for 3d object detection. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 21643–21652, 2023. 3
2023
-
[21]
RQR3D: Reparametrizing the regression targets for BEV-based 3D object detection
Ozsel Kilinc and Cem Tarhan. Rqr3d: Reparametrizing the regression targets for bev-based 3d object detection.arXiv preprint arXiv:2505.17732, 2025. 7
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[22]
Rcm-fusion: Radar-camera multi-level fusion for 3d object detection
Jisong Kim, Minjae Seong, Geonho Bang, Dongsuk Kum, and Jun Won Choi. Rcm-fusion: Radar-camera multi-level fusion for 3d object detection. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 18236–18242. IEEE, 2024. 3
2024
-
[23]
Crt-fusion: Camera, radar, temporal fusion using motion information for 3d object detection.Advances in Neural Information Processing Systems, 37:108625– 108648, 2024
Jisong Kim, Minjae Seong, and Jun Won Choi. Crt-fusion: Camera, radar, temporal fusion using motion information for 3d object detection.Advances in Neural Information Processing Systems, 37:108625– 108648, 2024. 3, 7, 8
2024
-
[24]
Craft: Camera-radar 3d object detection with spatio-contextual fusion transformer
Youngseok Kim, Sanmin Kim, Jun Won Choi, and Dongsuk Kum. Craft: Camera-radar 3d object detection with spatio-contextual fusion transformer. InProceedings of the AAAI Conference on Artificial Intelligence, pages 1160–1168, 2023. 3
2023
-
[25]
Crn: Camera radar net for accurate, robust, efficient 3d perception
Youngseok Kim, Juyeb Shin, Sanmin Kim, In-Jae Lee, Jun Won Choi, and Dongsuk Kum. Crn: Camera radar net for accurate, robust, efficient 3d perception. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 17615–17626, 2023. 3, 7, 8
2023
-
[26]
Pointpillars: Fast encoders for object detection from point clouds
Alex H Lang, Sourabh V ora, Holger Caesar, Lubing Zhou, Jiong Yang, and Oscar Beijbom. Pointpillars: Fast encoders for object detection from point clouds. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12697–12705, 2019. 4
2019
-
[27]
An energy and gpu- computation efficient backbone network for real-time object detection
Youngwan Lee, Joong-won Hwang, Sangrok Lee, Yuseok Bae, and Jongyoul Park. An energy and gpu- computation efficient backbone network for real-time object detection. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, pages 0–0, 2019. 7
2019
-
[28]
Evt: Efficient view transformation for multi-modal 3d object detection
Yongjin Lee, Hyeon-Mun Jeong, Yurim Jeon, and Sanghyun Kim. Evt: Efficient view transformation for multi-modal 3d object detection. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 26632–26642, 2025. 3
2025
-
[29]
Hvdetfusion: A simple and robust camera-radar fusion framework.arXiv preprint arXiv:2307.11323, 2023
Kai Lei, Zhan Chen, Shuman Jia, and Xiaoteng Zhang. Hvdetfusion: A simple and robust camera-radar fusion framework.arXiv preprint arXiv:2307.11323, 2023. 7
-
[30]
Dn-detr: Accelerate detr training by introducing query denoising
Feng Li, Hao Zhang, Shilong Liu, Jian Guo, Lionel M Ni, and Lei Zhang. Dn-detr: Accelerate detr training by introducing query denoising. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 13619–13627, 2022. 2
2022
-
[31]
Unifying voxel-based representa- tion with transformer for 3d object detection.Advances in Neural Information Processing Systems, 35: 18442–18455, 2022
Yanwei Li, Yilun Chen, Xiaojuan Qi, Zeming Li, Jian Sun, and Jiaya Jia. Unifying voxel-based representa- tion with transformer for 3d object detection.Advances in Neural Information Processing Systems, 35: 18442–18455, 2022. 3
2022
-
[32]
Deepfusion: Lidar-camera deep fusion for multi-modal 3d object detection
Yingwei Li, Adams Wei Yu, Tianjian Meng, Ben Caine, Jiquan Ngiam, Daiyi Peng, Junyang Shen, Yifeng Lu, Denny Zhou, Quoc V Le, et al. Deepfusion: Lidar-camera deep fusion for multi-modal 3d object detection. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 17182–17191, 2022. 3
2022
-
[33]
Bevformer: learning bird’s-eye-view representation from lidar-camera via spatiotemporal transformers
Zhiqi Li, Wenhai Wang, Hongyang Li, Enze Xie, Chonghao Sima, Tong Lu, Qiao Yu, and Jifeng Dai. Bevformer: learning bird’s-eye-view representation from lidar-camera via spatiotemporal transformers. IEEE Transactions on Pattern Analysis and Machine Intelligence, 47(3):2020–2036, 2024. 2
2020
-
[34]
arXiv preprint arXiv:2311.11722 (2023)
Xuewu Lin, Zixiang Pei, Tianwei Lin, Lichao Huang, and Zhizhong Su. Sparse4d v3: Advancing end-to-end 3d detection and tracking.arXiv preprint arXiv:2311.11722, 2023. 7
-
[35]
Rcbevdet: Radar-camera fusion in bird’s eye view for 3d object detection
Zhiwei Lin, Zhe Liu, Zhongyu Xia, Xinhao Wang, Yongtao Wang, Shengxiang Qi, Yang Dong, Nan Dong, Le Zhang, and Ce Zhu. Rcbevdet: Radar-camera fusion in bird’s eye view for 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14928– 14937, 2024. 2, 3, 7, 8 11
2024
-
[36]
Ray denoising: Depth-aware hard negative sampling for multi-view 3d object detection
Feng Liu, Tengteng Huang, Qianjing Zhang, Haotian Yao, Chi Zhang, Fang Wan, Qixiang Ye, and Yanzhao Zhou. Ray denoising: Depth-aware hard negative sampling for multi-view 3d object detection. InEuropean Conference on Computer Vision, pages 200–217. Springer, 2024. 7
2024
-
[37]
Pai3d: Painting adaptive instance-prior for 3d object detection
Hao Liu, Zhuoran Xu, Dan Wang, Baofeng Zhang, Guan Wang, Bo Dong, Xin Wen, and Xinyu Xu. Pai3d: Painting adaptive instance-prior for 3d object detection. InEuropean Conference on Computer Vision, pages 459–475. Springer, 2022. 3
2022
-
[38]
Sparsebev: High-performance sparse 3d object detection from multi-camera videos
Haisong Liu, Yao Teng, Tao Lu, Haiguang Wang, and Limin Wang. Sparsebev: High-performance sparse 3d object detection from multi-camera videos. InProceedings of the IEEE/CVF international conference on computer vision, pages 18580–18590, 2023. 6
2023
-
[39]
Petr: Position embedding transformation for multi-view 3d object detection
Yingfei Liu, Tiancai Wang, Xiangyu Zhang, and Jian Sun. Petr: Position embedding transformation for multi-view 3d object detection. InEuropean conference on computer vision, pages 531–548. Springer,
-
[40]
Echoes beyond points: Unleashing the power of raw radar data in multi-modality fusion.Advances in Neural Information Processing Systems, 36: 53964–53982, 2023
Yang Liu, Feng Wang, Naiyan Wang, and Zhao-Xiang Zhang. Echoes beyond points: Unleashing the power of raw radar data in multi-modality fusion.Advances in Neural Information Processing Systems, 36: 53964–53982, 2023. 3
2023
-
[41]
Bevfusion: Multi-task multi-sensor fusion with unified bird’s-eye view representation,
Zhijian Liu, Haotian Tang, Alexander Amini, Xinyu Yang, Huizi Mao, Daniela Rus, and Song Han. Bevfusion: Multi-task multi-sensor fusion with unified bird’s-eye view representation.arXiv preprint arXiv:2205.13542, 2022. 2, 3
-
[42]
Riccardo: Radar hit prediction and convolution for camera-radar 3d object detection
Yunfei Long, Abhinav Kumar, Xiaoming Liu, and Daniel Morris. Riccardo: Radar hit prediction and convolution for camera-radar 3d object detection. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 22276–22285, 2025. 7
2025
-
[43]
Conditional detr for fast training convergence
Depu Meng, Xiaokang Chen, Zejia Fan, Gang Zeng, Houqiang Li, Yuhui Yuan, Lei Sun, and Jingdong Wang. Conditional detr for fast training convergence. InProceedings of the IEEE/CVF international conference on computer vision, pages 3651–3660, 2021. 2, 4
2021
-
[44]
Centerfusion: Center-based radar and camera fusion for 3d object detection
Ramin Nabati and Hairong Qi. Centerfusion: Center-based radar and camera fusion for 3d object detection. InProceedings of the IEEE/CVF winter conference on applications of computer vision, pages 1527–1536,
-
[45]
Transcar: Transformer-based camera-and-radar fusion for 3d object detection
Su Pang, Daniel Morris, and Hayder Radha. Transcar: Transformer-based camera-and-radar fusion for 3d object detection. In2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 10902–10909. IEEE, 2023. 2, 3
2023
-
[46]
Is pseudo-lidar needed for monocular 3d object detection? InProceedings of the IEEE/CVF international conference on computer vision, pages 3142–3152, 2021
Dennis Park, Rares Ambrus, Vitor Guizilini, Jie Li, and Adrien Gaidon. Is pseudo-lidar needed for monocular 3d object detection? InProceedings of the IEEE/CVF international conference on computer vision, pages 3142–3152, 2021. 8
2021
-
[47]
Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unprojecting to 3d
Jonah Philion and Sanja Fidler. Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unprojecting to 3d. InEuropean conference on computer vision, pages 194–210. Springer, 2020. 4
2020
-
[48]
Vision-radar fusion for robotics bev detections: A survey
Apoorv Singh. Vision-radar fusion for robotics bev detections: A survey. In2023 IEEE Intelligent Vehicles Symposium (IV), pages 1–7. IEEE, 2023. 1
2023
-
[49]
Pointpainting: Sequential fusion for 3d object detection
Sourabh V ora, Alex H Lang, Bassam Helou, and Oscar Beijbom. Pointpainting: Sequential fusion for 3d object detection. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4604–4612, 2020. 2, 3
2020
-
[50]
Pointaugmenting: Cross-modal augmentation for 3d object detection
Chunwei Wang, Chao Ma, Ming Zhu, and Xiaokang Yang. Pointaugmenting: Cross-modal augmentation for 3d object detection. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11794–11803, 2021. 3
2021
-
[51]
Unitr: A unified and efficient multi-modal transformer for bird’s-eye-view representation
Haiyang Wang, Hao Tang, Shaoshuai Shi, Aoxue Li, Zhenguo Li, Bernt Schiele, and Liwei Wang. Unitr: A unified and efficient multi-modal transformer for bird’s-eye-view representation. InProceedings of the IEEE/CVF international conference on computer vision, pages 6792–6802, 2023. 3
2023
-
[52]
Detr3d: 3d object detection from multi-view images via 3d-to-2d queries
Yue Wang, Vitor Campagnolo Guizilini, Tianyuan Zhang, Yilun Wang, Hang Zhao, and Justin Solomon. Detr3d: 3d object detection from multi-view images via 3d-to-2d queries. InConference on robot learning, pages 180–191. PMLR, 2022. 3
2022
-
[53]
Mv2dfusion: Leveraging modality- specific object semantics for multi-modal 3d detection.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025
Zitian Wang, Zehao Huang, Yulu Gao, Naiyan Wang, and Si Liu. Mv2dfusion: Leveraging modality- specific object semantics for multi-modal 3d detection.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025. 2, 3, 5 12
2025
-
[54]
Mmwave radar and vision fusion for object detection in autonomous driving: A review.Sensors, 22(7):2542, 2022
Zhiqing Wei, Fengkai Zhang, Shuo Chang, Yangyang Liu, Huici Wu, and Zhiyong Feng. Mmwave radar and vision fusion for object detection in autonomous driving: A review.Sensors, 22(7):2542, 2022. 1
2022
-
[55]
Sparc: Sparse radar-camera fusion for 3d object detection.arXiv preprint arXiv:2411.19860, 2024
Philipp Wolters, Johannes Gilg, Torben Teepe, Fabian Herzog, Felix Fent, and Gerhard Rigoll. Sparc: Sparse radar-camera fusion for 3d object detection.arXiv preprint arXiv:2411.19860, 2024. 2, 3, 5, 7, 8
-
[56]
Unleashing hydra: Hybrid fusion, depth consistency and radar for unified 3d perception
Philipp Wolters, Johannes Gilg, Torben Teepe, Fabian Herzog, Anouar Laouichi, Martin Hofmann, and Gerhard Rigoll. Unleashing hydra: Hybrid fusion, depth consistency and radar for unified 3d perception. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 7467–7474. IEEE,
-
[57]
Sparsefusion: Fusing multi-modal sparse representations for multi- sensor 3d object detection
Yichen Xie, Chenfeng Xu, Marie-Julie Rakotosaona, Patrick Rim, Federico Tombari, Kurt Keutzer, Masayoshi Tomizuka, and Wei Zhan. Sparsefusion: Fusing multi-modal sparse representations for multi- sensor 3d object detection. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 17591–17602, 2023. 2, 3, 5
2023
-
[58]
Lxl: Lidar excluded lean 3d object detection with 4d imaging radar and camera fusion.IEEE Transactions on Intelligent Vehicles, 9(1):79–92, 2023
Weiyi Xiong, Jianan Liu, Tao Huang, Qing-Long Han, Yuxuan Xia, and Bing Zhu. Lxl: Lidar excluded lean 3d object detection with 4d imaging radar and camera fusion.IEEE Transactions on Intelligent Vehicles, 9(1):79–92, 2023. 3
2023
-
[59]
Fusionpainting: Multimodal fusion with adaptive attention for 3d object detection
Shaoqing Xu, Dingfu Zhou, Jin Fang, Junbo Yin, Zhou Bin, and Liangjun Zhang. Fusionpainting: Multimodal fusion with adaptive attention for 3d object detection. In2021 IEEE international intelligent transportation systems conference (ITSC), pages 3047–3054. IEEE, 2021. 3
2021
-
[60]
Cross modal transformer: Towards fast and robust 3d object detection
Junjie Yan, Yingfei Liu, Jianjian Sun, Fan Jia, Shuailin Li, Tiancai Wang, and Xiangyu Zhang. Cross modal transformer: Towards fast and robust 3d object detection. InProceedings of the IEEE/CVF international conference on computer vision, pages 18268–18278, 2023. 3
2023
-
[61]
Deepinteraction: 3d object detection via modality interaction.Advances in Neural Information Processing Systems, 35:1992–2005,
Zeyu Yang, Jiaqi Chen, Zhenwei Miao, Wei Li, Xiatian Zhu, and Li Zhang. Deepinteraction: 3d object detection via modality interaction.Advances in Neural Information Processing Systems, 35:1992–2005,
1992
-
[62]
Radar-camera fusion for object detection and semantic segmentation in autonomous driving: A comprehensive review.IEEE Transactions on Intelligent Vehicles, 9(1):2094–2128, 2023
Shanliang Yao, Runwei Guan, Xiaoyu Huang, Zhuoxiao Li, Xiangyu Sha, Yong Yue, Eng Gee Lim, Hyungjoon Seo, Ka Lok Man, Xiaohui Zhu, et al. Radar-camera fusion for object detection and semantic segmentation in autonomous driving: A comprehensive review.IEEE Transactions on Intelligent Vehicles, 9(1):2094–2128, 2023. 1
2094
-
[63]
Is-fusion: Instance-scene collaborative fusion for multimodal 3d object detection
Junbo Yin, Jianbing Shen, Runnan Chen, Wei Li, Ruigang Yang, Pascal Frossard, and Wenguan Wang. Is-fusion: Instance-scene collaborative fusion for multimodal 3d object detection. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14905–14915, 2024. 3
2024
-
[64]
Multimodal virtual point 3d detection.Advances in Neural Information Processing Systems, 34:16494–16507, 2021
Tianwei Yin, Xingyi Zhou, and Philipp Krähenbühl. Multimodal virtual point 3d detection.Advances in Neural Information Processing Systems, 34:16494–16507, 2021. 2, 3
2021
-
[65]
3d-cvf: Generating joint camera and lidar features using cross-view spatial feature fusion for 3d object detection
Jin Hyeok Yoo, Yecheol Kim, Jisong Kim, and Jun Won Choi. 3d-cvf: Generating joint camera and lidar features using cross-view spatial feature fusion for 3d object detection. InEuropean conference on computer vision, pages 720–736. Springer, 2020. 3
2020
-
[66]
Sparsefusion3d: Sparse sensor fusion for 3d object detection by radar and camera in environmental perception.IEEE Transactions on Intelligent Vehicles, 9(1):1524–1536, 2023
Zedong Yu, Weibing Wan, Maiyu Ren, Xiuyuan Zheng, and Zhijun Fang. Sparsefusion3d: Sparse sensor fusion for 3d object detection by radar and camera in environmental perception.IEEE Transactions on Intelligent Vehicles, 9(1):1524–1536, 2023. 2, 3
2023
-
[67]
Sparselif: High-performance sparse lidar-camera fusion for 3d object detection
Hongcheng Zhang, Liu Liang, Pengxin Zeng, Xiao Song, and Zhe Wang. Sparselif: High-performance sparse lidar-camera fusion for 3d object detection. InEuropean Conference on Computer Vision, pages 109–128. Springer, 2024. 2, 3, 5
2024
-
[68]
Crkd: Enhanced camera-radar object detection with cross-modality knowledge distillation
Lingjun Zhao, Jingyu Song, and Katherine A Skinner. Crkd: Enhanced camera-radar object detection with cross-modality knowledge distillation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15470–15480, 2024. 3
2024
-
[69]
Rcfusion: Fusing 4-d radar and camera with bird’s-eye view features for 3-d object detection.IEEE Transactions on Instrumentation and Measurement, 72:1–14, 2023
Lianqing Zheng, Sen Li, Bin Tan, Long Yang, Sihan Chen, Libo Huang, Jie Bai, Xichan Zhu, and Zhixiong Ma. Rcfusion: Fusing 4-d radar and camera with bird’s-eye view features for 3-d object detection.IEEE Transactions on Instrumentation and Measurement, 72:1–14, 2023. 3
2023
-
[70]
Cvfusion: Cross-view fusion of 4d radar and camera for 3d object detection
Hanzhi Zhong, Zhiyu Xiang, Ruoyu Xu, Jingyun Fu, Peng Xu, Shaohong Wang, Zhihao Yang, Tianyu Pu, and Eryun Liu. Cvfusion: Cross-view fusion of 4d radar and camera for 3d object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 28188–28197, 2025. 3 13
2025
-
[71]
Bridging the view disparity between radar and camera features for multi-modal fusion 3d object detection.IEEE Transactions on Intelligent Vehicles, 8(2):1523–1535, 2023
Taohua Zhou, Junjie Chen, Yining Shi, Kun Jiang, Mengmeng Yang, and Diange Yang. Bridging the view disparity between radar and camera features for multi-modal fusion 3d object detection.IEEE Transactions on Intelligent Vehicles, 8(2):1523–1535, 2023. 3
2023
-
[72]
Deformable DETR: Deformable Transformers for End-to-End Object Detection
Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang, and Jifeng Dai. Deformable detr: Deformable transformers for end-to-end object detection.arXiv preprint arXiv:2010.04159, 2020. 2, 6 14
work page internal anchor Pith review arXiv 2010
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.