BEVCALIB: LiDAR-Camera Calibration via Geometry-Guided Bird's-Eye View Representations
Pith reviewed 2026-05-19 11:06 UTC · model grok-4.3
The pith
BEVCALIB recovers LiDAR-camera extrinsic parameters by fusing bird's-eye view features extracted separately from each sensor.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
BEVCALIB demonstrates that geometry-guided BEV features extracted separately from camera and LiDAR can be fused in a shared space and selectively decoded to regress accurate extrinsic parameters directly from noisy raw inputs, delivering average error reductions of 47.08 percent in translation and 82.32 percent in rotation on KITTI and 78.17 percent and 68.29 percent on NuScenes relative to the strongest prior baseline.
What carries the argument
Shared BEV feature space created by separate camera and LiDAR BEV extractors, followed by a novel feature selector that filters important geometric cues before the transformation decoder.
If this is right
- Calibration can be performed from ordinary driving sequences instead of dedicated controlled data collections.
- Vehicles can maintain accurate sensor alignment during operation despite vibrations or temperature drift.
- Downstream multi-modal perception tasks receive more reliable fused inputs under real-world noise.
- Reproducible open-source calibration reaches an order-of-magnitude lower error than earlier public baselines.
Where Pith is reading between the lines
- The same separate-BEV-then-fuse pattern could be tested for calibrating other sensor pairs such as radar and camera.
- An online version running at frame rate might support continuous self-calibration on moving platforms.
- The feature selector could be reused in other BEV fusion networks to lower memory use without losing geometric accuracy.
- Evaluating the method on datasets captured with non-standard vehicle rigs would test whether the BEV assumption holds beyond KITTI and NuScenes.
Load-bearing premise
The approach assumes that separately extracted camera and LiDAR BEV features contain sufficient undistorted geometric information to recover accurate extrinsic parameters even when the input data contains noise and without additional explicit geometric constraints or hand-crafted correspondences.
What would settle it
Running the method on a new dataset recorded with deliberately introduced large sensor misalignment or extreme sensor noise and observing that the resulting translation and rotation errors fail to beat the best baseline would falsify the central claim.
Figures
read the original abstract
Accurate LiDAR-camera calibration is fundamental to fusing multi-modal perception in autonomous driving and robotic systems. Traditional calibration methods require extensive data collection in controlled environments and cannot compensate for the transformation changes during the vehicle/robot movement. In this paper, we propose the first model that uses bird's-eye view (BEV) features to perform LiDAR camera calibration from raw data, termed BEVCALIB. To achieve this, we extract camera BEV features and LiDAR BEV features separately and fuse them into a shared BEV feature space. To fully utilize the geometric information from the BEV feature, we introduce a novel feature selector to filter the most important features in the transformation decoder, which reduces memory consumption and enables efficient training. Extensive evaluations on KITTI, NuScenes, and our own dataset demonstrate that BEVCALIB establishes a new state of the art. Under various noise conditions, BEVCALIB outperforms the best baseline in the literature by an average of (47.08%, 82.32%) on KITTI dataset, and (78.17%, 68.29%) on NuScenes dataset, in terms of (translation, rotation), respectively. In the open-source domain, it improves the best reproducible baseline by one order of magnitude. Our code and demo results are available at https://cisl.ucr.edu/BEVCalib.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes BEVCALIB, the first method to perform LiDAR-camera extrinsic calibration directly from raw data by separately extracting camera and LiDAR BEV features, fusing them into a shared BEV space, and employing a novel feature selector module inside a transformation decoder to recover the calibration parameters. Extensive experiments on KITTI, NuScenes, and a custom dataset are reported to show new state-of-the-art accuracy, with average improvements over the best baseline of (47.08%, 82.32%) on KITTI and (78.17%, 68.29%) on NuScenes for (translation, rotation) under added noise; the method is also claimed to improve the best open-source baseline by an order of magnitude.
Significance. If the empirical gains prove robust and the BEV representations truly preserve undistorted geometry without circular dependence on the estimated extrinsics, the work would be significant for online calibration in autonomous driving and robotics, where traditional methods require controlled setups. The open release of code and the efficiency-oriented feature selector are clear strengths. The contribution is tempered, however, by the absence of architectural, loss, and training details that would allow independent verification of the SOTA claims.
major comments (2)
- [Method (BEV feature extraction and fusion)] The central claim rests on the assumption that camera BEV features can be extracted independently of the extrinsic parameters under estimation and still retain sufficient undistorted geometric information for the transformation decoder. Standard camera-to-BEV lifting (depth estimation or homography) is either extrinsic-dependent or introduces depth errors that propagate directly into the fused representation; the manuscript does not describe an extrinsic-independent mechanism or quantify depth-noise sensitivity, so the reported robustness under added noise rests on an untested premise.
- [Experiments and results] The SOTA and percentage-improvement claims (e.g., 47.08% translation / 82.32% rotation on KITTI) are presented without accompanying network architecture diagrams, loss-function definitions, training protocol, ablation studies on the feature selector, or statistical significance tests. These omissions make it impossible to determine whether the gains arise from the proposed BEV fusion or from unstated implementation choices, directly undermining the central empirical contribution.
minor comments (2)
- [Abstract] The abstract states that the method 'fully utilizes the geometric information from the BEV feature' yet provides no explicit geometric loss or correspondence term; a brief clarification of whether any such term is used would improve readability.
- [Experiments] Table or figure captions that report the exact noise levels and the precise definition of the 'best baseline' would help readers reproduce the percentage gains.
Simulated Author's Rebuttal
We thank the referee for the thorough review and valuable feedback on our work. We address the major comments point by point below, providing clarifications and indicating where revisions will be made to improve the manuscript.
read point-by-point responses
-
Referee: [Method (BEV feature extraction and fusion)] The central claim rests on the assumption that camera BEV features can be extracted independently of the extrinsic parameters under estimation and still retain sufficient undistorted geometric information for the transformation decoder. Standard camera-to-BEV lifting (depth estimation or homography) is either extrinsic-dependent or introduces depth errors that propagate directly into the fused representation; the manuscript does not describe an extrinsic-independent mechanism or quantify depth-noise sensitivity, so the reported robustness under added noise rests on an untested premise.
Authors: We appreciate this observation. In BEVCALIB, the camera BEV features are extracted using a dedicated BEV encoder that operates directly on the 2D image features projected into BEV space using a learned depth distribution, which is trained end-to-end but does not require the extrinsic calibration parameters as input. The projection is based on the camera intrinsics only, and the extrinsic is estimated later in the decoder. This design ensures independence from the estimated extrinsics. However, we acknowledge that additional details on this mechanism and a sensitivity analysis to depth noise were not sufficiently elaborated. We will revise the method section to include a clearer description of the extrinsic-independent BEV lifting and add experiments quantifying the impact of depth estimation errors. revision: yes
-
Referee: [Experiments and results] The SOTA and percentage-improvement claims (e.g., 47.08% translation / 82.32% rotation on KITTI) are presented without accompanying network architecture diagrams, loss-function definitions, training protocol, ablation studies on the feature selector, or statistical significance tests. These omissions make it impossible to determine whether the gains arise from the proposed BEV fusion or from unstated implementation choices, directly undermining the central empirical contribution.
Authors: We agree that providing these details is crucial for reproducibility and to substantiate the claims. The current manuscript includes some of this information in the supplementary material, but we recognize it should be more prominently featured in the main paper. In the revised version, we will add a network architecture diagram, explicit loss function formulations, detailed training protocols, ablation studies specifically on the feature selector module, and statistical significance tests (e.g., paired t-tests or confidence intervals) for the reported improvements. This will allow readers to better verify the source of the performance gains. revision: yes
Circularity Check
No significant circularity; empirical architecture with external validation
full rationale
The paper describes an empirical neural architecture that extracts separate camera and LiDAR BEV features, fuses them, and decodes extrinsics via a feature selector. No equations, derivations, or self-citation chains are shown that reduce the reported calibration accuracy or transformation parameters to quantities defined by the method's own fitted inputs or prior self-references. Validation occurs on external benchmarks (KITTI, NuScenes) under added noise, satisfying the criterion for self-contained, falsifiable results independent of internal definitions.
Axiom & Free-Parameter Ledger
free parameters (1)
- network architecture hyperparameters
axioms (1)
- domain assumption BEV features from camera and LiDAR contain complementary geometric information sufficient for extrinsic estimation
invented entities (1)
-
feature selector module
no independent evidence
Reference graph
Works this paper leans on
-
[1]
A. J. Sathyamoorthy, J. Liang, U. Patel, T. Guan, R. Chandra, and D. Manocha. Densecavoid: Real-time navigation in dense crowds using anticipatory behaviors. In2020 IEEE International Conference on Robotics and Automation (ICRA) , pages 11345–11352, 2020. doi:10.1109/ ICRA40945.2020.9197379
-
[2]
Z. Liu, H. Tang, A. Amini, X. Yang, H. Mao, D. Rus, and S. Han. Bevfusion: Multi-task multi- sensor fusion with unified bird’s-eye view representation. In IEEE International Conference on Robotics and Automation (ICRA), 2023
work page 2023
-
[3]
S. R. Mhatre and J. W. Bakal. Deepfusion: A novel deep learning technique for enhanced image super-resolution. In 2024 3rd International Conference on Automation, Computing and Renewable Systems (ICACRS), pages 991–998, 2024. doi:10.1109/ICACRS62842.2024. 10841630
-
[4]
J.-K. Huang and J. W. Grizzle. Improvements to Target-Based 3D LiDAR to Camera Calibra- tion. IEEE Access, 8:134101–134110, 2020. doi:10.1109/ACCESS.2020.3010734
-
[5]
Q. Zhang and R. Pless. Extrinsic calibration of a camera and laser range finder (improves camera calibration). In 2004 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (IEEE Cat. No.04CH37566) , volume 3, pages 2301–2306 vol.3, 2004. doi: 10.1109/IROS.2004.1389752
-
[6]
G. Yan, F. He, C. Shi, P. Wei, X. Cai, and Y . Li. Joint camera intrinsic and lidar-camera extrinsic calibration. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pages 11446–11452, 2023. doi:10.1109/ICRA48891.2023.10160542
- [7]
-
[8]
N. Schneider, F. Piewak, C. Stiller, and U. Franke. Regnet: Multimodal sensor registration using deep neural networks. In 2017 IEEE Intelligent Vehicles Symposium (IV), pages 1803– 1810, 2017. doi:10.1109/IVS.2017.7995968
-
[9]
X. Lv, B. Wang, Z. Dou, D. Ye, and S. Wang. Lccnet: Lidar and camera self-calibration using cost volume network. In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 2888–2895, 2021. doi:10.1109/CVPRW53098.2021. 00324
- [10]
-
[11]
Z. Luo, G. Yan, X. Cai, and B. Shi. Zero-training lidar-camera extrinsic calibration method using segment anything model. In 2024 IEEE International Conference on Robotics and Au- tomation (ICRA), pages 14472–14478, 2024. doi:10.1109/ICRA57147.2024.10610983
-
[12]
Y .-C. Lee and K.-W. Chen. Lccraft: Lidar and camera calibration using recurrent all-pairs field transforms without precise initial guess. In 2024 IEEE International Conference on Robotics and Automation (ICRA), pages 16669–16675, 2024. doi:10.1109/ICRA57147.2024.10610756
-
[13]
Q. Herau, N. Piasco, M. Bennehar, L. Rold ˜ao, D. Tsishkou, C. Migniot, P. Vasseur, and C. Demonceaux. Moisst: Multimodal optimization of implicit scene for spatiotemporal cal- ibration. In 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), page 1810–1817. IEEE, Oct. 2023. doi:10.1109/iros55552.2023.10342427. URL http://dx.do...
-
[14]
Circle loss: A unified perspective of pair similarity optimization
H. Caesar, V . Bankiti, A. H. Lang, S. V ora, V . E. Liong, Q. Xu, A. Krishnan, Y . Pan, G. Baldan, and O. Beijbom. nuscenes: A multimodal dataset for autonomous driving. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 11618–11628, 2020. doi:10.1109/CVPR42600.2020.01164
-
[15]
P. Xiao, Z. Shao, S. Hao, Z. Zhang, X. Chai, J. Jiao, Z. Li, J. Wu, K. Sun, K. Jiang, Y . Wang, and D. Yang. Pandaset: Advanced sensor suite dataset for autonomous driving. In 2021 IEEE International Intelligent Transportation Systems Conference (ITSC), pages 3095–3101, 2021. doi:10.1109/ITSC48978.2021.9565009
-
[16]
J. Shi, Z. Zhu, J. Zhang, R. Liu, Z. Wang, S. Chen, and H. Liu. Calibrcnn: Calibrating camera and lidar by recurrent convolutional neural network and geometric constraints. In2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) , pages 10197– 10202, 2020. doi:10.1109/IROS45743.2020.9341147
- [17]
- [18]
-
[19]
Y . Wang, V . Guizilini, T. Zhang, Y . Wang, H. Zhao, , and J. M. Solomon. Detr3d: 3d object detection from multi-view images via 3d-to-2d queries. In The Conference on Robot Learning (CoRL), 2021
work page 2021
- [20]
- [21]
-
[22]
S. Choi, J. Kim, H. Shin, and J. W. Choi. Mask2map: Vectorized hd map construction using bird’s eye view segmentation masks. InEuropean Conference on Computer Vision, 2024
work page 2024
-
[23]
J. Ross, O. Mendez, A. Saha, M. Johnson, and R. Bowden. Bev-slam: Building a globally- consistent world map using monocular vision. In 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 3830–3836, 2022. doi:10.1109/IROS47612. 2022.9981258
-
[24]
L. Luo, S. Zheng, Y . Li, Y . Fan, B. Yu, S.-Y . Cao, J. Li, and H.-L. Shen. Bevplace: Learning lidar-based place recognition using bird’s eye view images. In 2023 IEEE/CVF International Conference on Computer Vision (ICCV) , pages 8666–8675, 2023. doi:10.1109/ICCV51070. 2023.00799
- [25]
- [26]
- [27]
- [28]
-
[29]
S. Verma, J. S. Berrio, S. Worrall, and E. Nebot. Automatic extrinsic calibration between a camera and a 3d lidar using 3d point and plane correspondences. In 2019 IEEE Intelligent Transportation Systems Conference (ITSC), pages 3906–3912, 2019. doi:10.1109/ITSC.2019. 8917108
-
[30]
LiDAR and Camera Calibration using Motion Estimated by Sensor Fusion Odometry
R. Ishikawa, T. Oishi, and K. Ikeuchi. Lidar and camera calibration using motion estimated by sensor fusion odometry, 2018. URL https://arxiv.org/abs/1804.05178
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[31]
G. Pandey, J. R. McBride, S. Savarese, and R. M. Eustice. Automatic targetless extrinsic calibration of a 3d lidar and camera by maximizing mutual information. In Proceedings of the Twenty-Sixth AAAI Conference on Artificial Intelligence , AAAI’12, page 2053–2059. AAAI Press, 2012
work page 2053
-
[32]
Circle loss: A unified perspective of pair similarity optimization
P.-E. Sarlin, D. DeTone, T. Malisiewicz, and A. Rabinovich. Superglue: Learning feature matching with graph neural networks. In2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4937–4946, 2020. doi:10.1109/CVPR42600.2020.00499
-
[33]
Sample4Geo : Hard negative sampling for cross-view geo-localisation
A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y . Lo, P. Doll ´ar, and R. Girshick. Segment anything. In 2023 IEEE/CVF International Conference on Computer Vision (ICCV), pages 3992–4003, 2023. doi:10.1109/ ICCV51070.2023.00371
- [34]
-
[35]
Q. Herau, N. Piasco, M. Bennehar, L. Roldao, D. Tsishkou, C. Migniot, P. Vasseur, and C. De- monceaux. Soac: Spatio-temporal overlap-aware multi-sensor calibration using neural ra- diance fields. In 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15131–15140, 2024. doi:10.1109/CVPR52733.2024.01433
-
[36]
Srinivasan, Matthew Tancik, Jonathan T
B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoorthi, and R. Ng. Nerf: representing scenes as neural radiance fields for view synthesis.Commun. ACM, 65(1):99–106, Dec. 2021. ISSN 0001-0782. doi:10.1145/3503250. URL https://doi.org/10.1145/ 3503250
-
[37]
Z. Yang, G. Chen, H. Zhang, K. Ta, I. A. B ˆarsan, D. Murphy, S. Manivasagam, and R. Urta- sun. Unical: Unified neural sensor calibration. In Computer Vision – ECCV 2024: 18th Euro- pean Conference, Milan, Italy, September 29–October 4, 2024, Proceedings, Part XXXVI, page 327–345, Berlin, Heidelberg, 2024. Springer-Verlag. ISBN 978-3-031-72763-4. doi:10.1...
- [38]
- [39]
-
[40]
H. Li, C. Sima, J. Dai, W. Wang, L. Lu, H. Wang, J. Zeng, Z. Li, J. Yang, H. Deng, H. Tian, E. Xie, J. Xie, L. Chen, T. Li, Y . Li, Y . Gao, X. Jia, S. Liu, J. Shi, D. Lin, and Y . Qiao. Delving into the devils of bird’s-eye-view perception: A review, evaluation and recipe. IEEE Transactions on Pattern Analysis and Machine Intelligence , pages 1–20, 2023....
-
[41]
Y . Ma, T. Wang, X. Bai, H. Yang, Y . Hou, Y . Wang, Y . Qiao, R. Yang, and X. Zhu. Vision- centric bev perception: A survey. IEEE Transactions on Pattern Analysis and Machine Intelli- gence, 46(12):10978–10997, 2024. doi:10.1109/TPAMI.2024.3449912. 12
-
[42]
J. Philion and S. Fidler. Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unprojecting to 3d. In Proceedings of the European Conference on Computer Vision, 2020
work page 2020
-
[43]
Y . Yan, Y . Mao, and B. Li. Second: Sparsely embedded convolutional detection. Sensors, 18(10), 2018. ISSN 1424-8220. doi:10.3390/s18103337. URL https://www.mdpi.com/ 1424-8220/18/10/3337
-
[44]
W. Liao, S. Qiang, X. Li, X. Chen, H. Wang, Y . Liang, J. Yan, T. He, and P. Peng. Calibr- bev: Multi-camera calibration via reversed bird’s-eye-view representations for autonomous driving. In Proceedings of the 32nd ACM International Conference on Multimedia , MM ’24, page 9145–9154, New York, NY , USA, 2024. Association for Computing Machinery. ISBN 979...
-
[45]
G. Iyer, R. K. Ram, J. K. Murthy, and K. M. Krishna. Calibnet: Geometrically supervised extrinsic calibration using 3d spatial transformer networks. In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) . IEEE, Oct. 2018. doi:10.1109/iros. 2018.8593693. URL http://dx.doi.org/10.1109/IROS.2018.8593693
-
[46]
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polo- sukhin. Attention is all you need. In Proceedings of the 31st International Conference on Neu- ral Information Processing Systems, NIPS’17, page 6000–6010, Red Hook, NY , USA, 2017. Curran Associates Inc. ISBN 9781510860964
work page 2017
-
[47]
A. Kendall, M. Grimes, and R. Cipolla. Posenet: A convolutional network for real-time 6-dof camera relocalization. In Proceedings of the 2015 IEEE International Conference on Com- puter Vision (ICCV), ICCV ’15, page 2938–2946, USA, 2015. IEEE Computer Society. ISBN 9781467383912. doi:10.1109/ICCV .2015.336. URL https://doi.org/10.1109/ICCV. 2015.336
-
[48]
Z. Liu, Y . Lin, Y . Cao, H. Hu, Y . Wei, Z. Zhang, S. Lin, and B. Guo. Swin transformer: Hierarchical vision transformer using shifted windows, 2021. URL https://arxiv.org/ abs/2103.14030
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[49]
L. F. T. Fu and M. F. Fallon. Batch differentiable pose refinement for in-the-wild camera/lidar extrinsic calibration. In CoRL, pages 1362–1377, 2023. URL https://proceedings.mlr. press/v229/fu23a.html
work page 2023
-
[50]
Y . Xiao, Y . Li, C. Meng, X. Li, J. Ji, and Y . Zhang. Calibformer: A transformer-based auto- matic lidar-camera calibration network. In 2024 IEEE International Conference on Robotics and Automation (ICRA), pages 16714–16720, 2024. doi:10.1109/ICRA57147.2024.10610018
-
[51]
J. Zhu, J. Xue, and P. Zhang. Calibdepth: Unifying depth map representation for iterative lidar- camera online calibration. In2023 IEEE International Conference on Robotics and Automation (ICRA), pages 726–733, 2023. doi:10.1109/ICRA48891.2023.10161575
-
[52]
Z. Xia, X. Pan, S. Song, L. E. Li, and G. Huang. Vision transformer with deformable attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4794–4803, June 2022. 13
work page 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.