pith. sign in

arxiv: 2411.13311 · v1 · submitted 2024-11-20 · 💻 cs.CV · cs.AI

A Resource Efficient Fusion Network for Object Detection in Bird's-Eye View using Camera and Raw Radar Data

Pith reviewed 2026-05-23 08:11 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords object detectionsensor fusionbird's-eye viewcameraradarrange-Doppler spectrumRADIal datasetautonomous driving
0
0 comments X

The pith

Fusing camera bird's-eye-view features with range-azimuth features recovered from raw radar range-Doppler spectrum achieves competitive object detection accuracy at lower computational cost.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes processing camera images through an encoder-decoder to extract features in the bird's-eye-view polar domain while feeding the raw radar range-Doppler spectrum into a separate decoder that recovers range-azimuth features. These two feature maps are fused to drive object detection without performing conventional radar signal processing steps such as point cloud generation. The method is evaluated on the RADIal dataset against prior fusion approaches both for detection accuracy and for metrics of computational complexity. A sympathetic reader would care because cameras supply semantic detail while radar operates in poor weather, yet most existing fusions incur heavy radar preprocessing costs that this direct-spectrum route aims to sidestep.

Core claim

The central claim is that object detection in bird's-eye view can be performed by transforming camera images into the BEV polar domain and extracting features with a dedicated encoder-decoder architecture, recovering range-azimuth features from the raw range-Doppler radar spectrum via a radar decoder, and fusing the two resulting maps to reach detection performance competitive with existing methods while lowering computational complexity on the RADIal dataset.

What carries the argument

The camera BEV-polar encoder-decoder paired with the radar decoder that reconstructs range-azimuth features directly from the raw range-Doppler input; their outputs are fused for detection.

If this is right

  • Object detection proceeds without conventional radar point-cloud extraction or signal processing.
  • Detection accuracy remains competitive with prior camera-radar fusion methods on the RADIal dataset.
  • Overall computational complexity is reduced relative to methods that ingest processed radar data.
  • The raw-spectrum route supplies sufficient information for the fusion step to succeed.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The efficiency gain could support higher frame-rate operation on embedded vehicle hardware.
  • The same raw-spectrum decoder might be tested on other radar datasets to check whether the accuracy-complexity trade-off generalizes.
  • Because radar remains functional when cameras are degraded by weather, the fusion could be examined for robustness in rain or fog even though the paper reports only nominal conditions.

Load-bearing premise

The raw range-Doppler spectrum contains enough semantic and structural information that a dedicated decoder can recover usable range-azimuth features for fusion with camera features.

What would settle it

Running the proposed network on the RADIal dataset and measuring detection accuracy below existing fusion baselines or computational metrics above those baselines would falsify the central claim.

Figures

Figures reproduced from arXiv: 2411.13311 by Gijs Dubbelman, Kavin Chandrasekaran, Pavol Jancura, Sorin Grigorescu.

Figure 1
Figure 1. Figure 1: Architecture Overview: The image processing pipeline first transforms the camera image into Bird’s-Eye View (BEV). Subsequently, the resultant BEV undergoes conversion into polar representation, directly mapping to the Range-Azimuth (RA) image. Object detection is performed on RA image features fused with radar features from the radar decoder. The predictions obtained in the RA view are shown in the camera… view at source ↗
Figure 2
Figure 2. Figure 2: Image Processing Pipeline: The objects in the frame (four cars) marked in different colors are reflected in the BEV Cartesian and Polar pixel images. The origin is at the bottom center. The azimuth (θ), range (r) ground truth polar coordinates are marked for reference. r denotes the distance from the objects to the ego vehicle (in meters); θ represents the angle at which the objects are located in degrees.… view at source ↗
Figure 3
Figure 3. Figure 3: The camera only and radar only encoder contains four ResNet-50-like blocks with a pre-encoder block. The features from each of those blocks are named x0, x1, x2, x3, and x4. The thick blue curved arrow takes the encoder’s output to the decoder’s input in order to expand the input feature maps to higher resolutions. The dotted lines represent the skip connections used to preserve spatial information. The fe… view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative detection results from the proposed fusion model. The predictions obtained in the RA view (represented [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: The prediction in blue and the ground truth in green are shown in (a) front-view camera and (b) BEV Polar image. Zoom in to better visualize. VIII. CONCLUSION AND FUTURE WORK In this work, upon proposing a fusion strategy in BEV space, we analysed how the performance affects the com￾putational metrics in various aspects. Our approach demon￾strates proficient performance while upholding a comparat￾ively low… view at source ↗
read the original abstract

Cameras can be used to perceive the environment around the vehicle, while affordable radar sensors are popular in autonomous driving systems as they can withstand adverse weather conditions unlike cameras. However, radar point clouds are sparser with low azimuth and elevation resolution that lack semantic and structural information of the scenes, resulting in generally lower radar detection performance. In this work, we directly use the raw range-Doppler (RD) spectrum of radar data, thus avoiding radar signal processing. We independently process camera images within the proposed comprehensive image processing pipeline. Specifically, first, we transform the camera images to Bird's-Eye View (BEV) Polar domain and extract the corresponding features with our camera encoder-decoder architecture. The resultant feature maps are fused with Range-Azimuth (RA) features, recovered from the RD spectrum input from the radar decoder to perform object detection. We evaluate our fusion strategy with other existing methods not only in terms of accuracy but also on computational complexity metrics on RADIal dataset.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The manuscript proposes a fusion network for object detection in bird's-eye view that processes camera images through a BEV-polar transformation and encoder-decoder pipeline, recovers range-azimuth features from the raw radar range-Doppler spectrum via a dedicated decoder, and fuses the resulting feature maps to perform detection. It claims this strategy achieves competitive accuracy while reducing computational complexity relative to existing methods, with evaluation on the RADIal dataset.

Significance. If the empirical claims hold, the work could contribute to resource-efficient multi-modal perception for autonomous driving by avoiding conventional radar signal processing and directly ingesting raw spectra. The dual focus on accuracy and complexity metrics addresses a relevant practical constraint. However, the provided manuscript contains only an abstract with no quantitative results, architecture details, ablations, or comparisons, so no assessment of actual significance is possible.

major comments (1)
  1. [Abstract] Abstract: The central claim that the proposed camera-radar fusion 'performs object detection with competitive accuracy and reduced computational complexity' cannot be evaluated because the manuscript supplies no accuracy metrics, complexity numbers (e.g., FLOPs, latency), baseline comparisons, ablation studies, or error analysis. This absence directly prevents verification of the result and of the assumption that raw RD-spectrum-derived RA features supply sufficient semantic information when fused with BEV-polar camera features.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their review. We address the major comment below and note that the current submission consists solely of the abstract, as indicated in the provided materials.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim that the proposed camera-radar fusion 'performs object detection with competitive accuracy and reduced computational complexity' cannot be evaluated because the manuscript supplies no accuracy metrics, complexity numbers (e.g., FLOPs, latency), baseline comparisons, ablation studies, or error analysis. This absence directly prevents verification of the result and of the assumption that raw RD-spectrum-derived RA features supply sufficient semantic information when fused with BEV-polar camera features.

    Authors: We agree that the abstract as provided does not contain the quantitative results, metrics, or analyses needed to substantiate the claims. The full manuscript will be expanded to include accuracy metrics (e.g., mAP, precision/recall), computational complexity measures (FLOPs, parameters, inference latency), direct comparisons against existing camera-radar fusion baselines, ablation studies on the fusion components, and error analysis, all evaluated on the RADIal dataset. These additions will enable verification of the performance claims and the sufficiency of the raw RD-derived RA features. revision: yes

Circularity Check

0 steps flagged

No circularity in abstract; derivation chain absent

full rationale

Only the abstract is available and it contains no equations, fitted parameters, predictions, or self-citations. The text describes an architecture (raw RD spectrum to RA features via decoder, camera to BEV-polar features, fusion for detection) without any claimed derivation that reduces outputs to inputs by construction. This is the most common honest finding when no load-bearing mathematical steps are present.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only input supplies no explicit free parameters, axioms, or invented entities; the central claim rests on unstated assumptions about feature sufficiency and dataset representativeness that cannot be enumerated from the given text.

pith-pipeline@v0.9.0 · 5692 in / 1054 out tokens · 30978 ms · 2026-05-23T08:11:55.605890+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. REFNet++: Multi-Task Efficient Fusion of Camera and Radar Sensor Data in Bird's-Eye Polar View

    cs.CV 2026-05 unverdicted novelty 4.0

    REFNet++ aligns raw camera images and radar range-Doppler data into a shared bird's-eye polar view using variational encoders for multi-task vehicle detection and free space segmentation on the RADIal dataset.

Reference graph

Works this paper leans on

58 extracted references · 58 canonical work pages · cited by 1 Pith paper · 6 internal anchors

  1. [1]

    Sensor and Sensor Fusion Technology in Autonomous Vehicles: A Review,

    D. J. Yeong, G. Velasco-Hernandez, J. Barry, and J. Walsh, “Sensor and Sensor Fusion Technology in Autonomous Vehicles: A Review,” Sensors, vol. 21, p. 2140, Mar. 2021

  2. [2]

    Radat- ron: Accurate Detection Using Multi-resolution Cascaded MIMO Radar,

    S. Madani, J. Guan, W. Ahmed, S. Gupta, and H. Hassanieh, “Radat- ron: Accurate Detection Using Multi-resolution Cascaded MIMO Radar,” in Computer Vision – ECCV 2022 (S. Avidan, G. Brostow, M. Ciss´e, G. M. Farinella, and T. Hassner, eds.), vol. 13699, pp. 160– 178, Cham: Springer Nature Switzerland, 2022. Series Title: Lecture Notes in Computer Science

  3. [3]

    Richards, Principles of modern radar

    M. Richards, Principles of modern radar . SciTech Pub., 2010

  4. [4]

    Radar and Camera Early Fusion for Vehicle Detection in Advanced Driver Assistance Systems,

    T.-Y . Lim and A. Ansari, “Radar and Camera Early Fusion for Vehicle Detection in Advanced Driver Assistance Systems,” in NeurIPS Ma- chine Learning for Autonomous Driving Workshop , 2019

  5. [5]

    Lift, Splat, Shoot: Encoding Images from Arbitrary Camera Rigs by Implicitly Unprojecting to 3D,

    J. Philion and S. Fidler, “Lift, Splat, Shoot: Encoding Images from Arbitrary Camera Rigs by Implicitly Unprojecting to 3D,” inComputer Vision – ECCV 2020 (A. Vedaldi, H. Bischof, T. Brox, and J.-M. Frahm, eds.), (Cham), pp. 194–210, Springer International Publishing, 2020

  6. [6]

    Orthographic Feature Transform for Monocular 3D Object Detection

    T. Roddick, A. Kendall, and R. Cipolla, “Orthographic Fea- ture Transform for Monocular 3D Object Detection,” Nov. 2018. arXiv:1811.08188 [cs]

  7. [7]

    Cross-view Transformers for real- time Map-view Semantic Segmentation,

    B. Zhou and P. Kr ¨ahenb¨uhl, “Cross-view Transformers for real- time Map-view Semantic Segmentation,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pp. 13750–13759, CVPR, June 2022

  8. [8]

    PETR: Position Embedding Transformation for Multi-view 3D Object Detection,

    Y . Liu, T. Wang, X. Zhang, and J. Sun, “PETR: Position Embedding Transformation for Multi-view 3D Object Detection,” in Computer Vision – ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXVII , (Berlin, Heidelberg), pp. 531–548, Springer-Verlag, Oct. 2022

  9. [9]

    Petrv2: A unified framework for 3d perception from multi-camera images,

    Y . Liu, J. Yan, F. Jia, S. Li, A. Gao, T. Wang, X. Zhang, and J. Sun, “PETRv2: A Unified Framework for 3D Perception from Multi- Camera Images,” Nov. 2022. arXiv:2206.01256 [cs]

  10. [10]

    BEVFusion: Multi-Task Multi-Sensor Fusion with Unified Bird’s- Eye View Representation,

    Z. Liu, H. Tang, A. Amini, X. Yang, H. Mao, D. L. Rus, and S. Han, “BEVFusion: Multi-Task Multi-Sensor Fusion with Unified Bird’s- Eye View Representation,” in 2023 IEEE International Conference on Robotics and Automation (ICRA) , pp. 2774–2781, May 2023

  11. [11]

    BEVFormer: Learning Bird’s-Eye-View Representation from Multi- camera Images via Spatiotemporal Transformers,

    Z. Li, W. Wang, H. Li, E. Xie, C. Sima, T. Lu, Y . Qiao, and J. Dai, “BEVFormer: Learning Bird’s-Eye-View Representation from Multi- camera Images via Spatiotemporal Transformers,” in Computer Vision – ECCV 2022 , (Cham), pp. 1–18, Springer Nature Switzerland, 2022

  12. [12]

    BEVFormer v2: Adapting Modern Image Backbones to Bird’s-Eye-View Recognition via Perspective Supervision,

    C. Yang, Y . Chen, H. Tian, C. Tao, X. Zhu, Z. Zhang, G. Huang, H. Li, Y . Qiao, L. Lu, J. Zhou, and J. Dai, “BEVFormer v2: Adapting Modern Image Backbones to Bird’s-Eye-View Recognition via Perspective Supervision,” in 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pp. 17830–17839, June 2023. ISSN: 2575-7075

  13. [13]

    BEVStereo: Enhancing Depth Estimation in Multi-view 3D Object Detection with Dynamic Temporal Stereo,

    Y . Li, H. Bao, Z. Ge, J. Yang, J. Sun, and Z. Li, “BEVStereo: Enhancing Depth Estimation in Multi-view 3D Object Detection with Dynamic Temporal Stereo,” Sept. 2022. arXiv:2209.10248 [cs]

  14. [14]

    STS: Surround-view Temporal Stereo for Multi-view 3D Detection,

    Z. Wang, C. Min, Z. Ge, Y . Li, Z. Li, H. Yang, and D. Huang, “STS: Surround-view Temporal Stereo for Multi-view 3D Detection,” Aug

  15. [15]

    arXiv:2208.10145 [cs]

  16. [16]

    UniTR: A Unified and Efficient Multi-Modal Transformer for Bird’s- Eye-View Representation,

    H. Wang, H. Tang, S. Shi, A. Li, Z. Li, B. Schiele, and L. Wang, “UniTR: A Unified and Efficient Multi-Modal Transformer for Bird’s- Eye-View Representation,” Aug. 2023. arXiv:2308.07732 [cs]

  17. [17]

    Raw High-Definition Radar for Multi-Task Learning,

    J. Rebut, A. Ouaknine, W. Malik, and P. P ´erez, “Raw High-Definition Radar for Multi-Task Learning,” Apr. 2022. arXiv:2112.10646 [cs, eess]

  18. [18]

    CAR- RADA Dataset: Camera and Automotive Radar with Range-Angle- Doppler Annotations,

    A. Ouaknine, A. Newson, J. Rebut, F. Tupin, and P. P ´erez, “CAR- RADA Dataset: Camera and Automotive Radar with Range-Angle- Doppler Annotations,” May 2021. arXiv:2005.01456 [cs]

  19. [19]

    RADDet: Range- Azimuth-Doppler based Radar Object Detection for Dynamic Road Users,

    A. Zhang, F. E. Nowruzi, and R. Laganiere, “RADDet: Range- Azimuth-Doppler based Radar Object Detection for Dynamic Road Users,” in 2021 18th Conference on Robots and Vision (CRV), pp. 95– 102, May 2021

  20. [20]

    RADIATE: A Radar Dataset for Automotive Perception in Bad Weather,

    M. Sheeny, E. De Pellegrin, S. Mukherjee, A. Ahrabian, S. Wang, and A. Wallace, “RADIATE: A Radar Dataset for Automotive Perception in Bad Weather,” Apr. 2021. arXiv:2010.09076 [cs]

  21. [21]

    High Resolution Radar Dataset for Semi-Supervised Learning of Dynamic Objects,

    M. Mostajabi, C. M. Wang, D. Ranjan, and G. Hsyu, “High Resolution Radar Dataset for Semi-Supervised Learning of Dynamic Objects,” in 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) , pp. 450–457, June 2020. ISSN: 2160-7516

  22. [22]

    RaDICaL: A Synchron- ized FMCW Radar, Depth, IMU and RGB Camera Data Dataset With Low-Level FMCW Radar Signals,

    T.-Y . Lim, S. A. Markowitz, and M. N. Do, “RaDICaL: A Synchron- ized FMCW Radar, Depth, IMU and RGB Camera Data Dataset With Low-Level FMCW Radar Signals,” IEEE Journal of Selected Topics in Signal Processing , vol. 15, pp. 941–953, June 2021. Conference Name: IEEE Journal of Selected Topics in Signal Processing

  23. [23]

    K-Radar: 4D Radar Object Detection for Autonomous Driving in Various Weather Conditions,

    D.-H. Paek, S.-H. Kong, and K. T. Wijaya, “K-Radar: 4D Radar Object Detection for Autonomous Driving in Various Weather Conditions,” Nov. 2023. arXiv:2206.08171 [cs]

  24. [24]

    Vehicle Detection With Automotive Radar Using Deep Learning on Range- Azimuth-Doppler Tensors,

    B. Major, D. Fontijne, A. Ansari, R. T. Sukhavasi, R. Gowaikar, M. Hamilton, S. Lee, S. Grzechnik, and S. Subramanian, “Vehicle Detection With Automotive Radar Using Deep Learning on Range- Azimuth-Doppler Tensors,” in 2019 IEEE/CVF International Confer- ence on Computer Vision Workshop (ICCVW), pp. 924–932, Oct. 2019. ISSN: 2473-9944

  25. [25]

    CNN based Road User Detection using the 3D Radar Cube,

    A. Palffy, J. Dong, J. F. P. Kooij, and D. M. Gavrila, “CNN based Road User Detection using the 3D Radar Cube,” IEEE Robotics and Auto- mation Letters, vol. 5, pp. 1263–1270, Apr. 2020. arXiv:2004.12165 [cs]

  26. [26]

    Object Detection and 3d Estimation Via an FMCW Radar Using a Fully Convolutional Network,

    G. Zhang, H. Li, and F. Wenger, “Object Detection and 3d Estimation Via an FMCW Radar Using a Fully Convolutional Network,” in ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pp. 4487–4491, May 2020. ISSN: 2379-190X

  27. [27]

    RODNet: A Real-Time Radar Object Detection Network Cross-Supervised by Camera-Radar Fused Object 3D Localization,

    Y . Wang, Z. Jiang, Y . Li, J.-N. Hwang, G. Xing, and H. Liu, “RODNet: A Real-Time Radar Object Detection Network Cross-Supervised by Camera-Radar Fused Object 3D Localization,” IEEE Journal of Se- lected Topics in Signal Processing , vol. 15, pp. 954–967, June 2021. arXiv:2102.05150 [cs, eess]

  28. [28]

    T-FFTRadNet: Object Detection with Swin Vision Transformers from Raw ADC Radar Signals,

    J. Giroux, M. Bouchard, and R. Laganiere, “T-FFTRadNet: Object Detection with Swin Vision Transformers from Raw ADC Radar Signals,” Mar. 2023. arXiv:2303.16940 [cs]

  29. [29]

    ADCNet: Learning from Raw Radar Data via Distillation,

    B. Yang, I. Khatri, M. Happold, and C. Chen, “ADCNet: Learning from Raw Radar Data via Distillation,” Dec. 2023. arXiv:2303.11420 [cs, eess]

  30. [30]

    Distant Vehicle Detection Using Radar and Vision

    S. Chadwick, W. Maddern, and P. Newman, “Distant Vehicle Detection Using Radar and Vision,” May 2019. arXiv:1901.10951 [cs]

  31. [31]

    A Deep Learning-based Radar and Camera Sensor Fusion Architecture for Object Detection,

    F. Nobis, M. Geisslinger, M. Weber, J. Betz, and M. Lienkamp, “A Deep Learning-based Radar and Camera Sensor Fusion Architecture for Object Detection,” May 2020. arXiv:2005.07431 [cs]

  32. [32]

    Radar-Camera Sensor Fusion for Joint Object Detection and Distance Estimation in Autonomous Vehicles,

    R. Nabati and H. Qi, “Radar-Camera Sensor Fusion for Joint Object Detection and Distance Estimation in Autonomous Vehicles,” Sept

  33. [33]

    arXiv:2009.08428 [cs]

  34. [34]

    CenterFusion: Center-based Radar and Camera Fusion for 3D Object Detection,

    R. Nabati and H. Qi, “CenterFusion: Center-based Radar and Camera Fusion for 3D Object Detection,” in 2021 IEEE Winter Conference on Applications of Computer Vision (WACV) , pp. 1526–1535, Jan. 2021. arXiv:2011.04841 [cs]

  35. [35]

    GRIF Net: Gated Region of Interest Fusion Network for Robust 3D Object Detection from Radar Point Cloud and Monocular Image,

    Y . Kim, J. W. Choi, and D. Kum, “GRIF Net: Gated Region of Interest Fusion Network for Robust 3D Object Detection from Radar Point Cloud and Monocular Image,” in 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) , pp. 10857– 10864, Oct. 2020. ISSN: 2153-0866

  36. [36]

    CRN: Camera Radar Net for Accurate, Robust, Efficient 3D Perception,

    Y . Kim, J. Shin, S. Kim, I.-J. Lee, J. W. Choi, and D. Kum, “CRN: Camera Radar Net for Accurate, Robust, Efficient 3D Perception,” Dec. 2023. arXiv:2304.00670 [cs]

  37. [37]

    RVDet: Feature-level Fusion of Radar and Camera for Object Detection,

    J. Zhang, M. Zhang, Z. Fang, Y . Wang, X. Zhao, and S. Pu, “RVDet: Feature-level Fusion of Radar and Camera for Object Detection,” in 2021 IEEE International Intelligent Transportation Systems Confer- ence (ITSC), pp. 2822–2828, Sept. 2021

  38. [38]

    MVFusion: Multi- View 3D Object Detection with Semantic-aligned Radar and Camera Fusion,

    Z. Wu, G. Chen, Y . Gan, L. Wang, and J. Pu, “MVFusion: Multi- View 3D Object Detection with Semantic-aligned Radar and Camera Fusion,” Feb. 2023. arXiv:2302.10511 [cs]

  39. [39]

    Low-level Sensor Fusion Network for 3D Vehicle Detection using Radar Range-Azimuth Heatmap and Mon- ocular Image,

    J. Kim, Y . Kim, and D. Kum, “Low-level Sensor Fusion Network for 3D Vehicle Detection using Radar Range-Azimuth Heatmap and Mon- ocular Image,” in Proceedings of the Asian Conference on Computer Vision (ACCV), Proceedings of the Asian Conference on Computer Vision (ACCV), 2020

  40. [40]

    CRAFT: Camera-Radar 3D Object Detection with Spatio-Contextual Fusion Transformer,

    Y . Kim, S. Kim, J. W. Choi, and D. Kum, “CRAFT: Camera-Radar 3D Object Detection with Spatio-Contextual Fusion Transformer,” Nov

  41. [41]

    arXiv:2209.06535 [cs]

  42. [42]

    Cross- Modal Supervision-Based Multitask Learning With Automotive Radar Raw Data,

    Y . Jin, A. Deligiannis, J.-C. Fuentes-Michel, and M. V ossiek, “Cross- Modal Supervision-Based Multitask Learning With Automotive Radar Raw Data,” IEEE Transactions on Intelligent Vehicles , vol. 8, pp. 3012–3025, Apr. 2023. Conference Name: IEEE Transactions on Intelligent Vehicles

  43. [43]

    ROFusion: Efficient Object Detection using Hybrid Point-wise Radar- Optical Fusion,

    L. Liu, S. Zhi, Z. Du, L. Liu, X. Zhang, K. Huo, and W. Jiang, “ROFusion: Efficient Object Detection using Hybrid Point-wise Radar- Optical Fusion,” July 2023. arXiv:2307.08233 [cs]

  44. [44]

    Echoes Beyond Points: Unleashing the Power of Raw Radar Data in Multi-modality Fusion,

    Y . Liu, F. Wang, N. Wang, and Z.-X. Zhang, “Echoes Beyond Points: Unleashing the Power of Raw Radar Data in Multi-modality Fusion,” Advances in Neural Information Processing Systems , vol. 36, pp. 53964–53982, Dec. 2023

  45. [45]

    Vision-Centric BEV Perception: A Survey,

    Y . Ma, T. Wang, X. Bai, H. Yang, Y . Hou, Y . Wang, Y . Qiao, R. Yang, D. Manocha, and X. Zhu, “Vision-Centric BEV Perception: A Survey,” June 2023. arXiv:2208.02797 [cs]

  46. [46]

    PolarFormer: Multi-camera 3D Object Detection with Polar Transformer,

    Y . Jiang, L. Zhang, Z. Miao, X. Zhu, J. Gao, W. Hu, and Y .-G. Jiang, “PolarFormer: Multi-camera 3D Object Detection with Polar Transformer,” Jan. 2023. arXiv:2206.15398 [cs]

  47. [47]

    Transform image to bird’s-eye view - MATLAB transformImage

    “Transform image to bird’s-eye view - MATLAB transformImage.”

  48. [48]

    scipy.ndimage.map coordinates — SciPy v1.12.0 Manual

    “scipy.ndimage.map coordinates — SciPy v1.12.0 Manual.”

  49. [49]

    MIMO Radar, Techniques and Opportunities,

    B. J. Donnet and I. D. Longstaff, “MIMO Radar, Techniques and Opportunities,” in 2006 European Radar Conference , pp. 112–115, Sept. 2006

  50. [50]

    Deep Residual Learning for Image Recognition,

    K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning for Image Recognition,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , pp. 770–778, June 2016. ISSN: 1063-6919

  51. [51]

    Radar-Camera Fusion for Object Detection and Semantic Segmentation in Autonomous Driving: A Comprehensive Review,

    S. Yao, R. Guan, X. Huang, Z. Li, X. Sha, Y . Yue, E. G. Lim, H. Seo, K. L. Man, X. Zhu, and Y . Yue, “Radar-Camera Fusion for Object Detection and Semantic Segmentation in Autonomous Driving: A Comprehensive Review,”IEEE Transactions on Intelligent Vehicles, pp. 1–40, 2023. arXiv:2304.10410 [cs]

  52. [52]

    A survey on multi-sensor fusion based obstacle detection for intelligent ground vehicles in off-road environments,

    J.-w. Hu, B.-y. Zheng, C. Wang, C.-h. Zhao, X.-l. Hou, Q. Pan, and Z. Xu, “A survey on multi-sensor fusion based obstacle detection for intelligent ground vehicles in off-road environments,” Frontiers of Information Technology & Electronic Engineering , vol. 21, pp. 675– 692, May 2020

  53. [53]

    Multi-Sensor Fusion in Automated Driving: A Survey,

    Z. Wang, Y . Wu, and Q. Niu, “Multi-Sensor Fusion in Automated Driving: A Survey,” IEEE Access, vol. 8, pp. 2847–2868, 2020

  54. [54]

    Adam: A Method for Stochastic Optimization

    D. P. Kingma and J. Ba, “Adam: A Method for Stochastic Optimiza- tion,” Jan. 2017. arXiv:1412.6980 [cs]

  55. [55]

    Focal Loss for Dense Object Detection

    T.-Y . Lin, P. Goyal, R. Girshick, K. He, and P. Doll ´ar, “Focal Loss for Dense Object Detection,” Feb. 2018. arXiv:1708.02002 [cs]

  56. [56]

    PIXOR: Real-time 3D Object Detection from Point Clouds

    B. Yang, W. Luo, and R. Urtasun, “PIXOR: Real-time 3D Object Detection from Point Clouds,” Mar. 2019. arXiv:1902.06326 [cs]

  57. [57]

    EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks

    M. Tan and Q. V . Le, “EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks,” Sept. 2020. arXiv:1905.11946 [cs, stat]

  58. [58]

    UNetFormer: A UNet-like transformer for efficient se- mantic segmentation of remote sensing urban scene imagery,

    L. Wang, R. Li, C. Zhang, S. Fang, C. Duan, X. Meng, and P. M. Atkinson, “UNetFormer: A UNet-like transformer for efficient se- mantic segmentation of remote sensing urban scene imagery,” ISPRS Journal of Photogrammetry and Remote Sensing , vol. 190, pp. 196– 214, Aug. 2022