arxiv: 2604.25574 · v1 · submitted 2026-04-28 · 💻 cs.CV

Recognition: unknown

Control Your Queries: Heterogeneous Query Interaction for Camera-Radar Fusion

Jialong Wu , Yihan Wang , Matthias Rottmann

Authors on Pith no claims yet

Pith reviewed 2026-05-07 16:55 UTC · model grok-4.3

classification 💻 cs.CV

keywords camera-radar fusion3D object detectionquery interactionheterogeneous queriesnuScenesautonomous drivingQMixQSwap

0 comments

The pith

ConFusion fuses camera and radar by letting image, radar, and world queries interact directly.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes a new fusion approach for camera-radar 3D object detection that distributes queries from images, radar returns, and learnable positions in 3D space. These queries then interact through dedicated cross-type attention after sampling and through controlled swapping of feature tokens. The result is improved object initialization, coverage, and evidence consolidation from the two sensors. A reader would care because this yields higher detection accuracy on standard benchmarks while relying on relatively low-cost sensors for autonomous driving.

Core claim

ConFusion introduces heterogeneous query interaction as a fusion paradigm for camera-radar 3D object detection. It combines image queries, radar queries, and learnable world queries distributed in 3D space to improve query initialization and object coverage. Heterogeneous query mixing performs dedicated cross-type attention after feature sampling to consolidate complementary object evidence, while interactive query swap sampling allows related queries to exchange informative feature tokens under attention and geometric constraints. On the nuScenes dataset this produces 59.1 mAP and 65.6 NDS on validation and 61.6 mAP and 67.9 NDS on the test set.

What carries the argument

Heterogeneous query interaction, which mixes image queries, radar queries, and 3D world queries through QMix cross-type attention and QSwap token exchange to consolidate complementary sensor evidence.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same query-interaction structure could be adapted to camera-lidar fusion by adding lidar-specific query types.
If the attention mechanisms prove robust, the method might reduce the impact of temporary sensor dropouts in real driving.
Query swapping under geometric constraints may transfer to other query-based detectors that currently sample features independently.

Load-bearing premise

Cross-type attention after sampling and query swapping will reliably merge useful camera and radar evidence without creating new conflicts or missing objects on unseen scenes or degraded sensors.

What would settle it

Running ConFusion on nuScenes validation scenes with added synthetic radar noise or camera occlusions and measuring whether mAP falls below the best prior method.

Figures

Figures reproduced from arXiv: 2604.25574 by Jialong Wu, Matthias Rottmann, Yihan Wang.

**Figure 1.** Figure 1: Evolution of camera-3D sensor fusion: (a) input mixing; (b) feature map mixing; (c,d) view at source ↗

**Figure 2.** Figure 2: Overview of ConFusion. ConFusion first extracts multi-modal feature maps from camera view at source ↗

**Figure 3.** Figure 3: Heterogeneous query initialization and post-training distributions across decoder layers. view at source ↗

**Figure 4.** Figure 4: Attention statistics of shared self-attention and QMix across query types. view at source ↗

**Figure 5.** Figure 5: Qualitative results on the nuScenes val set under challenging scenarios. Left: projected 3D detections on multi-view images. Right: BEV radar points, predictions in blue, ground truth in green view at source ↗

**Figure 6.** Figure 6: QMix cross-type attention links. High-confidence boxes and their top-2 links are visualized. view at source ↗

read the original abstract

In autonomous driving, camera-radar fusion offers complementary sensing and low deployment cost. Existing methods perform fusion through input mixing, feature map mixing, or query-based feature sampling. We propose a new fusion paradigm, termed heterogeneous query interaction, and present ConFusion, a camera-radar 3D object detector. ConFusion combines image queries, radar queries, and learnable world queries distributed in 3D space to improve query initialization and object coverage. To encourage cross-type interaction among heterogeneous queries, we introduce heterogeneous query mixing (QMix), which performs dedicated cross-type attention after feature sampling to consolidate complementary object evidence. We further propose interactive query swap sampling (QSwap), which improves feature sampling by allowing related queries to exchange informative feature tokens under attention and geometric constraints. Experiments on the nuScenes dataset show that ConFusion achieves state-of-the-art performance, reaching 59.1 mAP and 65.6 NDS on the validation set, and 61.6 mAP and 67.9 NDS on the test set.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper introduces heterogeneous query interaction with QMix and QSwap for camera-radar fusion and reports SOTA numbers on nuScenes, but thin experimental details leave the source of the gains unclear.

read the letter

The core idea here is to treat camera features, radar returns, and learnable 3D world queries as distinct types that interact explicitly rather than through generic mixing. QMix adds dedicated cross-type attention after sampling to pull complementary evidence together, while QSwap lets related queries trade tokens under attention and geometry constraints. That setup is new relative to the query-based detectors cited in the abstract, and it directly targets better initialization and coverage in 3D space. The reported numbers—59.1 mAP / 65.6 NDS on validation and 61.6 / 67.9 on test—are competitive for a camera-radar method on nuScenes and show the approach can deliver usable detection performance at low sensor cost. Those are the concrete positives worth noting. The main weakness is the absence of ablations, error bars, or training protocol details. Without those, it is difficult to tell whether the gains come from the new interaction mechanisms or from other unstated choices in the detector backbone or training schedule. The implicit claim that cross-type attention and swap sampling consolidate evidence without creating fresh failure modes under sparse radar or degraded camera conditions also sits on unreviewed evidence. This work sits squarely in the perception stack for autonomous driving. A reader already working on query-based or multi-modal detectors will find the mechanisms straightforward to implement and test. It deserves a serious referee because the paradigm is cleanly stated, the benchmark results are public, and the practical motivation is clear. I would send it out for review with the expectation that the authors supply the missing ablations and stability checks before acceptance.

Referee Report

3 major / 1 minor

Summary. The paper introduces ConFusion, a camera-radar 3D object detector for autonomous driving that uses heterogeneous queries (image queries, radar queries, and learnable world queries) distributed in 3D space. It proposes heterogeneous query mixing (QMix) to perform dedicated cross-type attention after feature sampling for consolidating complementary evidence, and interactive query swap sampling (QSwap) to allow related queries to exchange tokens under attention and geometric constraints. Experiments on nuScenes report state-of-the-art results of 59.1 mAP and 65.6 NDS on validation and 61.6 mAP and 67.9 NDS on test.

Significance. If the reported gains are robustly attributable to the heterogeneous query interaction mechanisms, the work would advance query-based multi-modal fusion by showing how cross-type attention and constrained token exchange can better integrate complementary camera-radar cues than prior mixing or sampling approaches, with potential benefits for low-cost, robust perception systems.

major comments (3)

[Experiments] Experimental results section: The SOTA performance numbers (59.1/65.6 mAP/NDS val, 61.6/67.9 test) are presented without ablation tables, error bars, or multiple-run statistics, which is load-bearing for the central claim that QMix and QSwap drive the improvements over prior query/feature mixing baselines.
[Method] Method description (QMix and QSwap): No evaluation or analysis is provided on whether the cross-type attention and token-exchange mechanisms introduce new failure modes (e.g., attention dilution under sparse radar returns or degraded camera features) on unseen scenes or sensor degradations, undermining confidence that the paradigm reliably consolidates evidence without offsets.
[Implementation] Implementation details: The number, initialization, and optimization of the learnable world queries, along with the specific attention weights and geometric thresholds in QMix/QSwap, are treated as free parameters without sensitivity analysis or ablation, leaving the reproducibility and generality of the reported benchmark gains unclear.

minor comments (1)

[Method] Notation for query types and sampling steps could be clarified with a single diagram or consistent symbols across sections to aid readability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We agree that strengthening the experimental validation, discussing potential limitations, and improving implementation transparency will enhance the paper. Below we address each major comment point by point, indicating the revisions we will make.

read point-by-point responses

Referee: [Experiments] Experimental results section: The SOTA performance numbers (59.1/65.6 mAP/NDS val, 61.6/67.9 test) are presented without ablation tables, error bars, or multiple-run statistics, which is load-bearing for the central claim that QMix and QSwap drive the improvements over prior query/feature mixing baselines.

Authors: We acknowledge that the current version lacks dedicated ablation tables isolating QMix and QSwap and does not report error bars or multi-run statistics. In the revised manuscript we will add a comprehensive ablation study comparing the full ConFusion model against variants without QMix, without QSwap, and against prior query/feature mixing baselines. We will also rerun the main experiments with multiple random seeds and include error bars to demonstrate that the reported gains (59.1 mAP / 65.6 NDS) are stable and attributable to the proposed heterogeneous query interaction mechanisms. revision: yes
Referee: [Method] Method description (QMix and QSwap): No evaluation or analysis is provided on whether the cross-type attention and token-exchange mechanisms introduce new failure modes (e.g., attention dilution under sparse radar returns or degraded camera features) on unseen scenes or sensor degradations, undermining confidence that the paradigm reliably consolidates evidence without offsets.

Authors: We agree this analysis is important for establishing reliability. While the original submission focused on overall gains, the revised version will include a dedicated limitations subsection. It will discuss potential failure modes such as attention dilution with very sparse radar returns and provide qualitative examples from nuScenes validation scenes where camera features are degraded. We will also note that a full quantitative study on synthetic sensor degradations lies outside the current scope but is a valuable direction for future work. revision: partial
Referee: [Implementation] Implementation details: The number, initialization, and optimization of the learnable world queries, along with the specific attention weights and geometric thresholds in QMix/QSwap, are treated as free parameters without sensitivity analysis or ablation, leaving the reproducibility and generality of the reported benchmark gains unclear.

Authors: We will expand the implementation details section to address these points. The revision will report sensitivity analysis for the number of learnable world queries (e.g., 100–500) and compare random versus learned initialization. We will also provide ablation results on the attention weights in QMix and the geometric thresholds in QSwap, showing that performance remains stable within reasonable ranges. These additions will improve reproducibility and clarify the generality of the chosen hyperparameters. revision: yes

Circularity Check

0 steps flagged

No circularity detected in method derivation or benchmark claims

full rationale

The paper describes an architectural paradigm (heterogeneous queries + QMix cross-type attention after sampling + QSwap token exchange under constraints) whose components are defined independently of the reported nuScenes mAP/NDS numbers. No equations are shown that would make the performance metrics equivalent to fitted parameters or self-referential definitions inside the paper. The SOTA results are presented as empirical outcomes on an external dataset, not as quantities derived by construction from the model's own inputs or prior self-citations. The derivation chain remains self-contained and externally testable.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The method introduces several architectural choices whose necessity is not derived from first principles: the number and placement of learnable world queries, the exact form of QMix cross-attention, and the geometric constraints inside QSwap. These are treated as design decisions rather than proven requirements.

free parameters (2)

number and initialization of learnable world queries
The paper states that learnable world queries are distributed in 3D space but does not derive their count or placement from data-independent principles.
attention weights and geometric thresholds inside QMix and QSwap
These are learned or hand-tuned parameters that control how queries exchange information.

axioms (1)

domain assumption Queries of different modalities can be meaningfully compared after feature sampling
The heterogeneous mixing step presupposes that image-derived and radar-derived tokens live in a shared embedding space.

pith-pipeline@v0.9.0 · 5481 in / 1291 out tokens · 31698 ms · 2026-05-07T16:55:54.853290+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

72 extracted references · 9 canonical work pages · 2 internal anchors

[1]

Trans- fusion: Robust lidar-camera fusion for 3d object detection with transformers

Xuyang Bai, Zeyu Hu, Xinge Zhu, Qingqiu Huang, Yilun Chen, Hongbo Fu, and Chiew-Lan Tai. Trans- fusion: Robust lidar-camera fusion for 3d object detection with transformers. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1090–1099, 2022. 2, 3

2022
[2]

nuscenes: A multimodal dataset for autonomous driving

Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh V ora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom. nuscenes: A multimodal dataset for autonomous driving. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11621–11631,
[3]

Bevfusion4d: Learning lidar-camera fusion under bird’s-eye-view via cross-modality guidance and temporal aggregation,

Hongxiang Cai, Zeyuan Zhang, Zhenyu Zhou, Ziyin Li, Wenbo Ding, and Jiuhua Zhao. Bevfusion4d: Learning lidar-camera fusion under bird’s-eye-view via cross-modality guidance and temporal aggregation. arXiv preprint arXiv:2303.17099, 2023. 3

work page arXiv 2023
[4]

Objectfusion: Multi-modal 3d object detection with object-centric fusion

Qi Cai, Yingwei Pan, Ting Yao, Chong-Wah Ngo, and Tao Mei. Objectfusion: Multi-modal 3d object detection with object-centric fusion. InProceedings of the IEEE/CVF international conference on computer vision, pages 18067–18076, 2023. 3

2023
[5]

End-to-end object detection with transformers

Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end object detection with transformers. InEuropean conference on computer vision, pages 213–229. Springer, 2020. 2

2020
[6]

Futr3d: A unified sensor fusion framework for 3d detection

Xuanyao Chen, Tianyuan Zhang, Yue Wang, Yilun Wang, and Hang Zhao. Futr3d: A unified sensor fusion framework for 3d detection. Inproceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 172–181, 2023. 2, 3

2023
[7]

Deformable feature aggregation for dynamic multi-modal 3d object detection

Zehui Chen, Zhenyu Li, Shiquan Zhang, Liangji Fang, Qinhong Jiang, and Feng Zhao. Deformable feature aggregation for dynamic multi-modal 3d object detection. InEuropean conference on computer vision, pages 628–644. Springer, 2022. 3

2022
[8]

Rayformer: Improving query-based multi-camera 3d object detection via ray-centric strategies

Xiaomeng Chu, Jiajun Deng, Guoliang You, Yifan Duan, Yao Li, and Yanyong Zhang. Rayformer: Improving query-based multi-camera 3d object detection via ray-centric strategies. InProceedings of the 32nd ACM International Conference on Multimedia, pages 4620–4629, 2024. 5

2024
[9]

Racformer: Towards high-quality 3d object detection via query-based radar-camera fusion

Xiaomeng Chu, Jiajun Deng, Guoliang You, Yifan Duan, Houqiang Li, and Yanyong Zhang. Racformer: Towards high-quality 3d object detection via query-based radar-camera fusion. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 17081–17091, 2025. 2, 3, 5, 7, 8

2025
[10]

Dpft: Dual perspective fusion transformer for camera-radar- based object detection.IEEE Transactions on Intelligent Vehicles, 2024

Felix Fent, Andras Palffy, and Holger Caesar. Dpft: Dual perspective fusion transformer for camera-radar- based object detection.IEEE Transactions on Intelligent Vehicles, 2024. 3

2024
[11]

Hgsfusion: Radar-camera fusion with hybrid generation and synchronization for 3d object detection

Zijian Gu, Jianwei Ma, Yan Huang, Honghao Wei, Zhanye Chen, Hui Zhang, and Wei Hong. Hgsfusion: Radar-camera fusion with hybrid generation and synchronization for 3d object detection. InProceedings of the AAAI Conference on Artificial Intelligence, pages 3185–3193, 2025. 3

2025
[12]

Lift-attend-splat: Bird’s-eye-view camera-lidar fusion using transformers

James Gunn, Zygmunt Lenyk, Anuj Sharma, Andrea Donati, Alexandru Buburuzan, John Redford, and Romain Mueller. Lift-attend-splat: Bird’s-eye-view camera-lidar fusion using transformers. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4526–4536, 2024. 3

2024
[13]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016. 7

2016
[14]

Fusionformer: A multi-sensory fusion in bird’s-eye-view and temporal consistent transformer for 3d object detection.arXiv preprint arXiv:2309.05257, 2023

Chunyong Hu, Hang Zheng, Kun Li, Jianyun Xu, Weibo Mao, Maochun Luo, Lingxuan Wang, Mingxia Chen, Qihao Peng, Kaixuan Liu, et al. Fusionformer: A multi-sensory fusion in bird’s-eye-view and temporal consistent transformer for 3d object detection.arXiv preprint arXiv:2309.05257, 2023. 3

work page arXiv 2023
[15]

EA-LSS: Edge-aware Lift-splat-shot Framework for 3D BEV Object Detection,

Haotian Hu, Fanyi Wang, Jingwen Su, Yaonong Wang, Laifeng Hu, Weiye Fang, Jingwei Xu, and Zhiwang Zhang. Ea-lss: Edge-aware lift-splat-shot framework for 3d bev object detection.arXiv preprint arXiv:2303.17895, 2023. 3

work page arXiv 2023
[16]

Detecting as labeling: Rethinking lidar- camera fusion in 3d object detection

Junjie Huang, Yun Ye, Zhujin Liang, Yi Shan, and Dalong Du. Detecting as labeling: Rethinking lidar- camera fusion in 3d object detection. InEuropean Conference on Computer Vision, pages 439–455. Springer, 2024. 3

2024
[17]

Epnet: Enhancing point features with image semantics for 3d object detection

Tengteng Huang, Zhe Liu, Xiwu Chen, and Xiang Bai. Epnet: Enhancing point features with image semantics for 3d object detection. InEuropean conference on computer vision, pages 35–52. Springer,
[18]

Cramnet: Camera-radar fusion with ray-constrained cross-attention for robust 3d object detection

Jyh-Jing Hwang, Henrik Kretzschmar, Joshua Manela, Sean Rafferty, Nicholas Armstrong-Crews, Tiffany Chen, and Dragomir Anguelov. Cramnet: Camera-radar fusion with ray-constrained cross-attention for robust 3d object detection. InEuropean conference on computer vision, pages 388–405. Springer, 2022. 3

2022
[19]

Far3d: Expanding the horizon for surround-view 3d object detection

Xiaohui Jiang, Shuailin Li, Yingfei Liu, Shihao Wang, Fan Jia, Tiancai Wang, Lijin Han, and Xiangyu Zhang. Far3d: Expanding the horizon for surround-view 3d object detection. InProceedings of the AAAI conference on artificial intelligence, pages 2561–2569, 2024. 7

2024
[20]

Msmdfusion: Fusing lidar and camera at multiple scales with multi-depth seeds for 3d object detection

Yang Jiao, Zequn Jie, Shaoxiang Chen, Jingjing Chen, Lin Ma, and Yu-Gang Jiang. Msmdfusion: Fusing lidar and camera at multiple scales with multi-depth seeds for 3d object detection. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 21643–21652, 2023. 3

2023
[21]

RQR3D: Reparametrizing the regression targets for BEV-based 3D object detection

Ozsel Kilinc and Cem Tarhan. Rqr3d: Reparametrizing the regression targets for bev-based 3d object detection.arXiv preprint arXiv:2505.17732, 2025. 7

work page internal anchor Pith review Pith/arXiv arXiv 2025
[22]

Rcm-fusion: Radar-camera multi-level fusion for 3d object detection

Jisong Kim, Minjae Seong, Geonho Bang, Dongsuk Kum, and Jun Won Choi. Rcm-fusion: Radar-camera multi-level fusion for 3d object detection. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 18236–18242. IEEE, 2024. 3

2024
[23]

Crt-fusion: Camera, radar, temporal fusion using motion information for 3d object detection.Advances in Neural Information Processing Systems, 37:108625– 108648, 2024

Jisong Kim, Minjae Seong, and Jun Won Choi. Crt-fusion: Camera, radar, temporal fusion using motion information for 3d object detection.Advances in Neural Information Processing Systems, 37:108625– 108648, 2024. 3, 7, 8

2024
[24]

Craft: Camera-radar 3d object detection with spatio-contextual fusion transformer

Youngseok Kim, Sanmin Kim, Jun Won Choi, and Dongsuk Kum. Craft: Camera-radar 3d object detection with spatio-contextual fusion transformer. InProceedings of the AAAI Conference on Artificial Intelligence, pages 1160–1168, 2023. 3

2023
[25]

Crn: Camera radar net for accurate, robust, efficient 3d perception

Youngseok Kim, Juyeb Shin, Sanmin Kim, In-Jae Lee, Jun Won Choi, and Dongsuk Kum. Crn: Camera radar net for accurate, robust, efficient 3d perception. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 17615–17626, 2023. 3, 7, 8

2023
[26]

Pointpillars: Fast encoders for object detection from point clouds

Alex H Lang, Sourabh V ora, Holger Caesar, Lubing Zhou, Jiong Yang, and Oscar Beijbom. Pointpillars: Fast encoders for object detection from point clouds. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12697–12705, 2019. 4

2019
[27]

An energy and gpu- computation efficient backbone network for real-time object detection

Youngwan Lee, Joong-won Hwang, Sangrok Lee, Yuseok Bae, and Jongyoul Park. An energy and gpu- computation efficient backbone network for real-time object detection. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, pages 0–0, 2019. 7

2019
[28]

Evt: Efficient view transformation for multi-modal 3d object detection

Yongjin Lee, Hyeon-Mun Jeong, Yurim Jeon, and Sanghyun Kim. Evt: Efficient view transformation for multi-modal 3d object detection. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 26632–26642, 2025. 3

2025
[29]

Hvdetfusion: A simple and robust camera-radar fusion framework.arXiv preprint arXiv:2307.11323, 2023

Kai Lei, Zhan Chen, Shuman Jia, and Xiaoteng Zhang. Hvdetfusion: A simple and robust camera-radar fusion framework.arXiv preprint arXiv:2307.11323, 2023. 7

work page arXiv 2023
[30]

Dn-detr: Accelerate detr training by introducing query denoising

Feng Li, Hao Zhang, Shilong Liu, Jian Guo, Lionel M Ni, and Lei Zhang. Dn-detr: Accelerate detr training by introducing query denoising. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 13619–13627, 2022. 2

2022
[31]

Unifying voxel-based representa- tion with transformer for 3d object detection.Advances in Neural Information Processing Systems, 35: 18442–18455, 2022

Yanwei Li, Yilun Chen, Xiaojuan Qi, Zeming Li, Jian Sun, and Jiaya Jia. Unifying voxel-based representa- tion with transformer for 3d object detection.Advances in Neural Information Processing Systems, 35: 18442–18455, 2022. 3

2022
[32]

Deepfusion: Lidar-camera deep fusion for multi-modal 3d object detection

Yingwei Li, Adams Wei Yu, Tianjian Meng, Ben Caine, Jiquan Ngiam, Daiyi Peng, Junyang Shen, Yifeng Lu, Denny Zhou, Quoc V Le, et al. Deepfusion: Lidar-camera deep fusion for multi-modal 3d object detection. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 17182–17191, 2022. 3

2022
[33]

Bevformer: learning bird’s-eye-view representation from lidar-camera via spatiotemporal transformers

Zhiqi Li, Wenhai Wang, Hongyang Li, Enze Xie, Chonghao Sima, Tong Lu, Qiao Yu, and Jifeng Dai. Bevformer: learning bird’s-eye-view representation from lidar-camera via spatiotemporal transformers. IEEE Transactions on Pattern Analysis and Machine Intelligence, 47(3):2020–2036, 2024. 2

2020
[34]

arXiv preprint arXiv:2311.11722 (2023)

Xuewu Lin, Zixiang Pei, Tianwei Lin, Lichao Huang, and Zhizhong Su. Sparse4d v3: Advancing end-to-end 3d detection and tracking.arXiv preprint arXiv:2311.11722, 2023. 7

work page arXiv 2023
[35]

Rcbevdet: Radar-camera fusion in bird’s eye view for 3d object detection

Zhiwei Lin, Zhe Liu, Zhongyu Xia, Xinhao Wang, Yongtao Wang, Shengxiang Qi, Yang Dong, Nan Dong, Le Zhang, and Ce Zhu. Rcbevdet: Radar-camera fusion in bird’s eye view for 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14928– 14937, 2024. 2, 3, 7, 8 11

2024
[36]

Ray denoising: Depth-aware hard negative sampling for multi-view 3d object detection

Feng Liu, Tengteng Huang, Qianjing Zhang, Haotian Yao, Chi Zhang, Fang Wan, Qixiang Ye, and Yanzhao Zhou. Ray denoising: Depth-aware hard negative sampling for multi-view 3d object detection. InEuropean Conference on Computer Vision, pages 200–217. Springer, 2024. 7

2024
[37]

Pai3d: Painting adaptive instance-prior for 3d object detection

Hao Liu, Zhuoran Xu, Dan Wang, Baofeng Zhang, Guan Wang, Bo Dong, Xin Wen, and Xinyu Xu. Pai3d: Painting adaptive instance-prior for 3d object detection. InEuropean Conference on Computer Vision, pages 459–475. Springer, 2022. 3

2022
[38]

Sparsebev: High-performance sparse 3d object detection from multi-camera videos

Haisong Liu, Yao Teng, Tao Lu, Haiguang Wang, and Limin Wang. Sparsebev: High-performance sparse 3d object detection from multi-camera videos. InProceedings of the IEEE/CVF international conference on computer vision, pages 18580–18590, 2023. 6

2023
[39]

Petr: Position embedding transformation for multi-view 3d object detection

Yingfei Liu, Tiancai Wang, Xiangyu Zhang, and Jian Sun. Petr: Position embedding transformation for multi-view 3d object detection. InEuropean conference on computer vision, pages 531–548. Springer,
[40]

Echoes beyond points: Unleashing the power of raw radar data in multi-modality fusion.Advances in Neural Information Processing Systems, 36: 53964–53982, 2023

Yang Liu, Feng Wang, Naiyan Wang, and Zhao-Xiang Zhang. Echoes beyond points: Unleashing the power of raw radar data in multi-modality fusion.Advances in Neural Information Processing Systems, 36: 53964–53982, 2023. 3

2023
[41]

Bevfusion: Multi-task multi-sensor fusion with unified bird’s-eye view representation,

Zhijian Liu, Haotian Tang, Alexander Amini, Xinyu Yang, Huizi Mao, Daniela Rus, and Song Han. Bevfusion: Multi-task multi-sensor fusion with unified bird’s-eye view representation.arXiv preprint arXiv:2205.13542, 2022. 2, 3

work page arXiv 2022
[42]

Riccardo: Radar hit prediction and convolution for camera-radar 3d object detection

Yunfei Long, Abhinav Kumar, Xiaoming Liu, and Daniel Morris. Riccardo: Radar hit prediction and convolution for camera-radar 3d object detection. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 22276–22285, 2025. 7

2025
[43]

Conditional detr for fast training convergence

Depu Meng, Xiaokang Chen, Zejia Fan, Gang Zeng, Houqiang Li, Yuhui Yuan, Lei Sun, and Jingdong Wang. Conditional detr for fast training convergence. InProceedings of the IEEE/CVF international conference on computer vision, pages 3651–3660, 2021. 2, 4

2021
[44]

Centerfusion: Center-based radar and camera fusion for 3d object detection

Ramin Nabati and Hairong Qi. Centerfusion: Center-based radar and camera fusion for 3d object detection. InProceedings of the IEEE/CVF winter conference on applications of computer vision, pages 1527–1536,
[45]

Transcar: Transformer-based camera-and-radar fusion for 3d object detection

Su Pang, Daniel Morris, and Hayder Radha. Transcar: Transformer-based camera-and-radar fusion for 3d object detection. In2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 10902–10909. IEEE, 2023. 2, 3

2023
[46]

Is pseudo-lidar needed for monocular 3d object detection? InProceedings of the IEEE/CVF international conference on computer vision, pages 3142–3152, 2021

Dennis Park, Rares Ambrus, Vitor Guizilini, Jie Li, and Adrien Gaidon. Is pseudo-lidar needed for monocular 3d object detection? InProceedings of the IEEE/CVF international conference on computer vision, pages 3142–3152, 2021. 8

2021
[47]

Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unprojecting to 3d

Jonah Philion and Sanja Fidler. Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unprojecting to 3d. InEuropean conference on computer vision, pages 194–210. Springer, 2020. 4

2020
[48]

Vision-radar fusion for robotics bev detections: A survey

Apoorv Singh. Vision-radar fusion for robotics bev detections: A survey. In2023 IEEE Intelligent Vehicles Symposium (IV), pages 1–7. IEEE, 2023. 1

2023
[49]

Pointpainting: Sequential fusion for 3d object detection

Sourabh V ora, Alex H Lang, Bassam Helou, and Oscar Beijbom. Pointpainting: Sequential fusion for 3d object detection. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4604–4612, 2020. 2, 3

2020
[50]

Pointaugmenting: Cross-modal augmentation for 3d object detection

Chunwei Wang, Chao Ma, Ming Zhu, and Xiaokang Yang. Pointaugmenting: Cross-modal augmentation for 3d object detection. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11794–11803, 2021. 3

2021
[51]

Unitr: A unified and efficient multi-modal transformer for bird’s-eye-view representation

Haiyang Wang, Hao Tang, Shaoshuai Shi, Aoxue Li, Zhenguo Li, Bernt Schiele, and Liwei Wang. Unitr: A unified and efficient multi-modal transformer for bird’s-eye-view representation. InProceedings of the IEEE/CVF international conference on computer vision, pages 6792–6802, 2023. 3

2023
[52]

Detr3d: 3d object detection from multi-view images via 3d-to-2d queries

Yue Wang, Vitor Campagnolo Guizilini, Tianyuan Zhang, Yilun Wang, Hang Zhao, and Justin Solomon. Detr3d: 3d object detection from multi-view images via 3d-to-2d queries. InConference on robot learning, pages 180–191. PMLR, 2022. 3

2022
[53]

Mv2dfusion: Leveraging modality- specific object semantics for multi-modal 3d detection.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

Zitian Wang, Zehao Huang, Yulu Gao, Naiyan Wang, and Si Liu. Mv2dfusion: Leveraging modality- specific object semantics for multi-modal 3d detection.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025. 2, 3, 5 12

2025
[54]

Mmwave radar and vision fusion for object detection in autonomous driving: A review.Sensors, 22(7):2542, 2022

Zhiqing Wei, Fengkai Zhang, Shuo Chang, Yangyang Liu, Huici Wu, and Zhiyong Feng. Mmwave radar and vision fusion for object detection in autonomous driving: A review.Sensors, 22(7):2542, 2022. 1

2022
[55]

Sparc: Sparse radar-camera fusion for 3d object detection.arXiv preprint arXiv:2411.19860, 2024

Philipp Wolters, Johannes Gilg, Torben Teepe, Fabian Herzog, Felix Fent, and Gerhard Rigoll. Sparc: Sparse radar-camera fusion for 3d object detection.arXiv preprint arXiv:2411.19860, 2024. 2, 3, 5, 7, 8

work page arXiv 2024
[56]

Unleashing hydra: Hybrid fusion, depth consistency and radar for unified 3d perception

Philipp Wolters, Johannes Gilg, Torben Teepe, Fabian Herzog, Anouar Laouichi, Martin Hofmann, and Gerhard Rigoll. Unleashing hydra: Hybrid fusion, depth consistency and radar for unified 3d perception. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 7467–7474. IEEE,
[57]

Sparsefusion: Fusing multi-modal sparse representations for multi- sensor 3d object detection

Yichen Xie, Chenfeng Xu, Marie-Julie Rakotosaona, Patrick Rim, Federico Tombari, Kurt Keutzer, Masayoshi Tomizuka, and Wei Zhan. Sparsefusion: Fusing multi-modal sparse representations for multi- sensor 3d object detection. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 17591–17602, 2023. 2, 3, 5

2023
[58]

Lxl: Lidar excluded lean 3d object detection with 4d imaging radar and camera fusion.IEEE Transactions on Intelligent Vehicles, 9(1):79–92, 2023

Weiyi Xiong, Jianan Liu, Tao Huang, Qing-Long Han, Yuxuan Xia, and Bing Zhu. Lxl: Lidar excluded lean 3d object detection with 4d imaging radar and camera fusion.IEEE Transactions on Intelligent Vehicles, 9(1):79–92, 2023. 3

2023
[59]

Fusionpainting: Multimodal fusion with adaptive attention for 3d object detection

Shaoqing Xu, Dingfu Zhou, Jin Fang, Junbo Yin, Zhou Bin, and Liangjun Zhang. Fusionpainting: Multimodal fusion with adaptive attention for 3d object detection. In2021 IEEE international intelligent transportation systems conference (ITSC), pages 3047–3054. IEEE, 2021. 3

2021
[60]

Cross modal transformer: Towards fast and robust 3d object detection

Junjie Yan, Yingfei Liu, Jianjian Sun, Fan Jia, Shuailin Li, Tiancai Wang, and Xiangyu Zhang. Cross modal transformer: Towards fast and robust 3d object detection. InProceedings of the IEEE/CVF international conference on computer vision, pages 18268–18278, 2023. 3

2023
[61]

Deepinteraction: 3d object detection via modality interaction.Advances in Neural Information Processing Systems, 35:1992–2005,

Zeyu Yang, Jiaqi Chen, Zhenwei Miao, Wei Li, Xiatian Zhu, and Li Zhang. Deepinteraction: 3d object detection via modality interaction.Advances in Neural Information Processing Systems, 35:1992–2005,

1992
[62]

Radar-camera fusion for object detection and semantic segmentation in autonomous driving: A comprehensive review.IEEE Transactions on Intelligent Vehicles, 9(1):2094–2128, 2023

Shanliang Yao, Runwei Guan, Xiaoyu Huang, Zhuoxiao Li, Xiangyu Sha, Yong Yue, Eng Gee Lim, Hyungjoon Seo, Ka Lok Man, Xiaohui Zhu, et al. Radar-camera fusion for object detection and semantic segmentation in autonomous driving: A comprehensive review.IEEE Transactions on Intelligent Vehicles, 9(1):2094–2128, 2023. 1

2094
[63]

Is-fusion: Instance-scene collaborative fusion for multimodal 3d object detection

Junbo Yin, Jianbing Shen, Runnan Chen, Wei Li, Ruigang Yang, Pascal Frossard, and Wenguan Wang. Is-fusion: Instance-scene collaborative fusion for multimodal 3d object detection. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14905–14915, 2024. 3

2024
[64]

Multimodal virtual point 3d detection.Advances in Neural Information Processing Systems, 34:16494–16507, 2021

Tianwei Yin, Xingyi Zhou, and Philipp Krähenbühl. Multimodal virtual point 3d detection.Advances in Neural Information Processing Systems, 34:16494–16507, 2021. 2, 3

2021
[65]

3d-cvf: Generating joint camera and lidar features using cross-view spatial feature fusion for 3d object detection

Jin Hyeok Yoo, Yecheol Kim, Jisong Kim, and Jun Won Choi. 3d-cvf: Generating joint camera and lidar features using cross-view spatial feature fusion for 3d object detection. InEuropean conference on computer vision, pages 720–736. Springer, 2020. 3

2020
[66]

Sparsefusion3d: Sparse sensor fusion for 3d object detection by radar and camera in environmental perception.IEEE Transactions on Intelligent Vehicles, 9(1):1524–1536, 2023

Zedong Yu, Weibing Wan, Maiyu Ren, Xiuyuan Zheng, and Zhijun Fang. Sparsefusion3d: Sparse sensor fusion for 3d object detection by radar and camera in environmental perception.IEEE Transactions on Intelligent Vehicles, 9(1):1524–1536, 2023. 2, 3

2023
[67]

Sparselif: High-performance sparse lidar-camera fusion for 3d object detection

Hongcheng Zhang, Liu Liang, Pengxin Zeng, Xiao Song, and Zhe Wang. Sparselif: High-performance sparse lidar-camera fusion for 3d object detection. InEuropean Conference on Computer Vision, pages 109–128. Springer, 2024. 2, 3, 5

2024
[68]

Crkd: Enhanced camera-radar object detection with cross-modality knowledge distillation

Lingjun Zhao, Jingyu Song, and Katherine A Skinner. Crkd: Enhanced camera-radar object detection with cross-modality knowledge distillation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15470–15480, 2024. 3

2024
[69]

Rcfusion: Fusing 4-d radar and camera with bird’s-eye view features for 3-d object detection.IEEE Transactions on Instrumentation and Measurement, 72:1–14, 2023

Lianqing Zheng, Sen Li, Bin Tan, Long Yang, Sihan Chen, Libo Huang, Jie Bai, Xichan Zhu, and Zhixiong Ma. Rcfusion: Fusing 4-d radar and camera with bird’s-eye view features for 3-d object detection.IEEE Transactions on Instrumentation and Measurement, 72:1–14, 2023. 3

2023
[70]

Cvfusion: Cross-view fusion of 4d radar and camera for 3d object detection

Hanzhi Zhong, Zhiyu Xiang, Ruoyu Xu, Jingyun Fu, Peng Xu, Shaohong Wang, Zhihao Yang, Tianyu Pu, and Eryun Liu. Cvfusion: Cross-view fusion of 4d radar and camera for 3d object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 28188–28197, 2025. 3 13

2025
[71]

Bridging the view disparity between radar and camera features for multi-modal fusion 3d object detection.IEEE Transactions on Intelligent Vehicles, 8(2):1523–1535, 2023

Taohua Zhou, Junjie Chen, Yining Shi, Kun Jiang, Mengmeng Yang, and Diange Yang. Bridging the view disparity between radar and camera features for multi-modal fusion 3d object detection.IEEE Transactions on Intelligent Vehicles, 8(2):1523–1535, 2023. 3

2023
[72]

Deformable DETR: Deformable Transformers for End-to-End Object Detection

Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang, and Jifeng Dai. Deformable detr: Deformable transformers for end-to-end object detection.arXiv preprint arXiv:2010.04159, 2020. 2, 6 14

work page internal anchor Pith review arXiv 2010