pith. sign in

arxiv: 2506.02587 · v2 · submitted 2025-06-03 · 💻 cs.CV · cs.RO

BEVCALIB: LiDAR-Camera Calibration via Geometry-Guided Bird's-Eye View Representations

Pith reviewed 2026-05-19 11:06 UTC · model grok-4.3

classification 💻 cs.CV cs.RO
keywords LiDAR-camera calibrationbird's-eye viewextrinsic parametersmulti-modal fusionfeature selectorautonomous drivinggeometric registration
0
0 comments X

The pith

BEVCALIB recovers LiDAR-camera extrinsic parameters by fusing bird's-eye view features extracted separately from each sensor.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces BEVCALIB as the first model to perform LiDAR-camera calibration using bird's-eye view features from raw data. Camera and LiDAR data are each projected into BEV space, their features are extracted independently, and then fused into one shared representation. A feature selector then picks the most informative elements to feed into a decoder that outputs the rigid transformation between the two sensors. This setup removes the need for controlled calibration environments or manual correspondences while handling noise and vehicle motion. Readers care because it makes accurate multi-modal fusion practical for autonomous vehicles that cannot return to a lab for recalibration.

Core claim

BEVCALIB demonstrates that geometry-guided BEV features extracted separately from camera and LiDAR can be fused in a shared space and selectively decoded to regress accurate extrinsic parameters directly from noisy raw inputs, delivering average error reductions of 47.08 percent in translation and 82.32 percent in rotation on KITTI and 78.17 percent and 68.29 percent on NuScenes relative to the strongest prior baseline.

What carries the argument

Shared BEV feature space created by separate camera and LiDAR BEV extractors, followed by a novel feature selector that filters important geometric cues before the transformation decoder.

If this is right

  • Calibration can be performed from ordinary driving sequences instead of dedicated controlled data collections.
  • Vehicles can maintain accurate sensor alignment during operation despite vibrations or temperature drift.
  • Downstream multi-modal perception tasks receive more reliable fused inputs under real-world noise.
  • Reproducible open-source calibration reaches an order-of-magnitude lower error than earlier public baselines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same separate-BEV-then-fuse pattern could be tested for calibrating other sensor pairs such as radar and camera.
  • An online version running at frame rate might support continuous self-calibration on moving platforms.
  • The feature selector could be reused in other BEV fusion networks to lower memory use without losing geometric accuracy.
  • Evaluating the method on datasets captured with non-standard vehicle rigs would test whether the BEV assumption holds beyond KITTI and NuScenes.

Load-bearing premise

The approach assumes that separately extracted camera and LiDAR BEV features contain sufficient undistorted geometric information to recover accurate extrinsic parameters even when the input data contains noise and without additional explicit geometric constraints or hand-crafted correspondences.

What would settle it

Running the method on a new dataset recorded with deliberately introduced large sensor misalignment or extreme sensor noise and observing that the resulting translation and rotation errors fail to beat the best baseline would falsify the central claim.

Figures

Figures reproduced from arXiv: 2506.02587 by Divyank Shah, Hang Qiu, Jerry Li, Justin Yue, Konstantinos Karydis, Weiduo Yuan.

Figure 1
Figure 1. Figure 1: Overall architecture of BEVCALIB. The overall pipeline of our model consists of BEV feature extraction, FPN BEV Encoder, and geometry-guided BEV decoder (GGBD). For BEV fea￾ture extraction (§3.2), the inputs of the camera and LiDAR are extracted into BEV features through different backbones separately, then fused into a shared BEV feature space. The FPN BEV en￾coder is used to improve the multi-scale geome… view at source ↗
Figure 2
Figure 2. Figure 2: Overall Architecture of Geometry-Guided BEV Decoder (GGBD). The GGBD compo￾nent contains a feature selector (left) and a refinement module (right). The feature selector calculates the positions of BEV features using Equation 1. The corresponding positional embeddings (PE) are added to keep the geometry information of the selected feature. After the decoder, the refinement module adds an average-pooling ope… view at source ↗
Figure 3
Figure 3. Figure 3: Error Distribution of BEVCALIB and Other Baselines on CALIBDB and KITTI Qualitative Results [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative results. A comparison of LiDAR-camera overlays from KITTI sequences. [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
read the original abstract

Accurate LiDAR-camera calibration is fundamental to fusing multi-modal perception in autonomous driving and robotic systems. Traditional calibration methods require extensive data collection in controlled environments and cannot compensate for the transformation changes during the vehicle/robot movement. In this paper, we propose the first model that uses bird's-eye view (BEV) features to perform LiDAR camera calibration from raw data, termed BEVCALIB. To achieve this, we extract camera BEV features and LiDAR BEV features separately and fuse them into a shared BEV feature space. To fully utilize the geometric information from the BEV feature, we introduce a novel feature selector to filter the most important features in the transformation decoder, which reduces memory consumption and enables efficient training. Extensive evaluations on KITTI, NuScenes, and our own dataset demonstrate that BEVCALIB establishes a new state of the art. Under various noise conditions, BEVCALIB outperforms the best baseline in the literature by an average of (47.08%, 82.32%) on KITTI dataset, and (78.17%, 68.29%) on NuScenes dataset, in terms of (translation, rotation), respectively. In the open-source domain, it improves the best reproducible baseline by one order of magnitude. Our code and demo results are available at https://cisl.ucr.edu/BEVCalib.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes BEVCALIB, the first method to perform LiDAR-camera extrinsic calibration directly from raw data by separately extracting camera and LiDAR BEV features, fusing them into a shared BEV space, and employing a novel feature selector module inside a transformation decoder to recover the calibration parameters. Extensive experiments on KITTI, NuScenes, and a custom dataset are reported to show new state-of-the-art accuracy, with average improvements over the best baseline of (47.08%, 82.32%) on KITTI and (78.17%, 68.29%) on NuScenes for (translation, rotation) under added noise; the method is also claimed to improve the best open-source baseline by an order of magnitude.

Significance. If the empirical gains prove robust and the BEV representations truly preserve undistorted geometry without circular dependence on the estimated extrinsics, the work would be significant for online calibration in autonomous driving and robotics, where traditional methods require controlled setups. The open release of code and the efficiency-oriented feature selector are clear strengths. The contribution is tempered, however, by the absence of architectural, loss, and training details that would allow independent verification of the SOTA claims.

major comments (2)
  1. [Method (BEV feature extraction and fusion)] The central claim rests on the assumption that camera BEV features can be extracted independently of the extrinsic parameters under estimation and still retain sufficient undistorted geometric information for the transformation decoder. Standard camera-to-BEV lifting (depth estimation or homography) is either extrinsic-dependent or introduces depth errors that propagate directly into the fused representation; the manuscript does not describe an extrinsic-independent mechanism or quantify depth-noise sensitivity, so the reported robustness under added noise rests on an untested premise.
  2. [Experiments and results] The SOTA and percentage-improvement claims (e.g., 47.08% translation / 82.32% rotation on KITTI) are presented without accompanying network architecture diagrams, loss-function definitions, training protocol, ablation studies on the feature selector, or statistical significance tests. These omissions make it impossible to determine whether the gains arise from the proposed BEV fusion or from unstated implementation choices, directly undermining the central empirical contribution.
minor comments (2)
  1. [Abstract] The abstract states that the method 'fully utilizes the geometric information from the BEV feature' yet provides no explicit geometric loss or correspondence term; a brief clarification of whether any such term is used would improve readability.
  2. [Experiments] Table or figure captions that report the exact noise levels and the precise definition of the 'best baseline' would help readers reproduce the percentage gains.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thorough review and valuable feedback on our work. We address the major comments point by point below, providing clarifications and indicating where revisions will be made to improve the manuscript.

read point-by-point responses
  1. Referee: [Method (BEV feature extraction and fusion)] The central claim rests on the assumption that camera BEV features can be extracted independently of the extrinsic parameters under estimation and still retain sufficient undistorted geometric information for the transformation decoder. Standard camera-to-BEV lifting (depth estimation or homography) is either extrinsic-dependent or introduces depth errors that propagate directly into the fused representation; the manuscript does not describe an extrinsic-independent mechanism or quantify depth-noise sensitivity, so the reported robustness under added noise rests on an untested premise.

    Authors: We appreciate this observation. In BEVCALIB, the camera BEV features are extracted using a dedicated BEV encoder that operates directly on the 2D image features projected into BEV space using a learned depth distribution, which is trained end-to-end but does not require the extrinsic calibration parameters as input. The projection is based on the camera intrinsics only, and the extrinsic is estimated later in the decoder. This design ensures independence from the estimated extrinsics. However, we acknowledge that additional details on this mechanism and a sensitivity analysis to depth noise were not sufficiently elaborated. We will revise the method section to include a clearer description of the extrinsic-independent BEV lifting and add experiments quantifying the impact of depth estimation errors. revision: yes

  2. Referee: [Experiments and results] The SOTA and percentage-improvement claims (e.g., 47.08% translation / 82.32% rotation on KITTI) are presented without accompanying network architecture diagrams, loss-function definitions, training protocol, ablation studies on the feature selector, or statistical significance tests. These omissions make it impossible to determine whether the gains arise from the proposed BEV fusion or from unstated implementation choices, directly undermining the central empirical contribution.

    Authors: We agree that providing these details is crucial for reproducibility and to substantiate the claims. The current manuscript includes some of this information in the supplementary material, but we recognize it should be more prominently featured in the main paper. In the revised version, we will add a network architecture diagram, explicit loss function formulations, detailed training protocols, ablation studies specifically on the feature selector module, and statistical significance tests (e.g., paired t-tests or confidence intervals) for the reported improvements. This will allow readers to better verify the source of the performance gains. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical architecture with external validation

full rationale

The paper describes an empirical neural architecture that extracts separate camera and LiDAR BEV features, fuses them, and decodes extrinsics via a feature selector. No equations, derivations, or self-citation chains are shown that reduce the reported calibration accuracy or transformation parameters to quantities defined by the method's own fitted inputs or prior self-references. Validation occurs on external benchmarks (KITTI, NuScenes) under added noise, satisfying the criterion for self-contained, falsifiable results independent of internal definitions.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim rests on the domain assumption that BEV representations preserve enough geometric structure for calibration and on standard supervised learning assumptions that paired sensor data with ground-truth extrinsics are available for training.

free parameters (1)
  • network architecture hyperparameters
    Layer sizes, learning rate, and feature dimensions are chosen during model design and training.
axioms (1)
  • domain assumption BEV features from camera and LiDAR contain complementary geometric information sufficient for extrinsic estimation
    Invoked by the decision to extract and fuse BEV features as the core representation.
invented entities (1)
  • feature selector module no independent evidence
    purpose: Filters the most important features inside the transformation decoder to reduce memory and enable efficient training
    Introduced as a novel component of the architecture

pith-pipeline@v0.9.0 · 5795 in / 1475 out tokens · 45879 ms · 2026-05-19T11:06:36.356644+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

52 extracted references · 52 canonical work pages · 2 internal anchors

  1. [1]

    A. J. Sathyamoorthy, J. Liang, U. Patel, T. Guan, R. Chandra, and D. Manocha. Densecavoid: Real-time navigation in dense crowds using anticipatory behaviors. In2020 IEEE International Conference on Robotics and Automation (ICRA) , pages 11345–11352, 2020. doi:10.1109/ ICRA40945.2020.9197379

  2. [2]

    Z. Liu, H. Tang, A. Amini, X. Yang, H. Mao, D. Rus, and S. Han. Bevfusion: Multi-task multi- sensor fusion with unified bird’s-eye view representation. In IEEE International Conference on Robotics and Automation (ICRA), 2023

  3. [3]

    S. R. Mhatre and J. W. Bakal. Deepfusion: A novel deep learning technique for enhanced image super-resolution. In 2024 3rd International Conference on Automation, Computing and Renewable Systems (ICACRS), pages 991–998, 2024. doi:10.1109/ICACRS62842.2024. 10841630

  4. [4]

    Huang and J

    J.-K. Huang and J. W. Grizzle. Improvements to Target-Based 3D LiDAR to Camera Calibra- tion. IEEE Access, 8:134101–134110, 2020. doi:10.1109/ACCESS.2020.3010734

  5. [5]

    Zhang and R

    Q. Zhang and R. Pless. Extrinsic calibration of a camera and laser range finder (improves camera calibration). In 2004 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (IEEE Cat. No.04CH37566) , volume 3, pages 2301–2306 vol.3, 2004. doi: 10.1109/IROS.2004.1389752

  6. [6]

    G. Yan, F. He, C. Shi, P. Wei, X. Cai, and Y . Li. Joint camera intrinsic and lidar-camera extrinsic calibration. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pages 11446–11452, 2023. doi:10.1109/ICRA48891.2023.10160542

  7. [7]

    Geiger, P

    A. Geiger, P. Lenz, C. Stiller, and R. Urtasun. Vision meets robotics: The kitti dataset. Inter- national Journal of Robotics Research (IJRR), 2013

  8. [8]

    Schneider, F

    N. Schneider, F. Piewak, C. Stiller, and U. Franke. Regnet: Multimodal sensor registration using deep neural networks. In 2017 IEEE Intelligent Vehicles Symposium (IV), pages 1803– 1810, 2017. doi:10.1109/IVS.2017.7995968

  9. [9]

    X. Lv, B. Wang, Z. Dou, D. Ye, and S. Wang. Lccnet: Lidar and camera self-calibration using cost volume network. In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 2888–2895, 2021. doi:10.1109/CVPRW53098.2021. 00324

  10. [10]

    Koide, S

    K. Koide, S. Oishi, M. Yokozuka, and A. Banno. General, single-shot, target-less, and auto- matic lidar-camera extrinsic calibration toolbox. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pages 11301–11307. IEEE, 2023

  11. [11]

    Z. Luo, G. Yan, X. Cai, and B. Shi. Zero-training lidar-camera extrinsic calibration method using segment anything model. In 2024 IEEE International Conference on Robotics and Au- tomation (ICRA), pages 14472–14478, 2024. doi:10.1109/ICRA57147.2024.10610983

  12. [12]

    and Dolan, John M

    Y .-C. Lee and K.-W. Chen. Lccraft: Lidar and camera calibration using recurrent all-pairs field transforms without precise initial guess. In 2024 IEEE International Conference on Robotics and Automation (ICRA), pages 16669–16675, 2024. doi:10.1109/ICRA57147.2024.10610756

  13. [13]

    and Fallah, S

    Q. Herau, N. Piasco, M. Bennehar, L. Rold ˜ao, D. Tsishkou, C. Migniot, P. Vasseur, and C. Demonceaux. Moisst: Multimodal optimization of implicit scene for spatiotemporal cal- ibration. In 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), page 1810–1817. IEEE, Oct. 2023. doi:10.1109/iros55552.2023.10342427. URL http://dx.do...

  14. [14]

    Circle loss: A unified perspective of pair similarity optimization

    H. Caesar, V . Bankiti, A. H. Lang, S. V ora, V . E. Liong, Q. Xu, A. Krishnan, Y . Pan, G. Baldan, and O. Beijbom. nuscenes: A multimodal dataset for autonomous driving. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 11618–11628, 2020. doi:10.1109/CVPR42600.2020.01164

  15. [15]

    P. Xiao, Z. Shao, S. Hao, Z. Zhang, X. Chai, J. Jiao, Z. Li, J. Wu, K. Sun, K. Jiang, Y . Wang, and D. Yang. Pandaset: Advanced sensor suite dataset for autonomous driving. In 2021 IEEE International Intelligent Transportation Systems Conference (ITSC), pages 3095–3101, 2021. doi:10.1109/ITSC48978.2021.9565009

  16. [16]

    J. Shi, Z. Zhu, J. Zhang, R. Liu, Z. Wang, S. Chen, and H. Liu. Calibrcnn: Calibrating camera and lidar by recurrent convolutional neural network and geometric constraints. In2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) , pages 10197– 10202, 2020. doi:10.1109/IROS45743.2020.9341147

  17. [17]

    Y . Xiao, Y . Li, C. Meng, X. Li, J. Ji, and Y . Zhang. Calibformer: A transformer-based auto- matic lidar-camera calibration network, 2024. URLhttps://arxiv.org/abs/2311.15241

  18. [18]

    Z. Li, W. Wang, H. Li, E. Xie, C. Sima, T. Lu, Y . Qiao, and J. Dai. Bevformer: Learning bird’s- eye-view representation from multi-camera images via spatiotemporal transformers. arXiv preprint arXiv:2203.17270, 2022

  19. [19]

    Y . Wang, V . Guizilini, T. Zhang, Y . Wang, H. Zhao, , and J. M. Solomon. Detr3d: 3d object detection from multi-view images via 3d-to-2d queries. In The Conference on Robot Learning (CoRL), 2021

  20. [20]

    H. Liu, Y . Teng, T. Lu, H. Wang, and L. Wang. Sparsebev: High-performance sparse 3d object detection from multi-camera videos, 2023. URL https://arxiv.org/abs/2308.09244

  21. [21]

    Q. Li, Y . Wang, Y . Wang, and H. Zhao. Hdmapnet: An online hd map construction and evaluation framework. arXiv preprint arXiv:2107.06307, 2021

  22. [22]

    S. Choi, J. Kim, H. Shin, and J. W. Choi. Mask2map: Vectorized hd map construction using bird’s eye view segmentation masks. InEuropean Conference on Computer Vision, 2024

  23. [23]

    J. Ross, O. Mendez, A. Saha, M. Johnson, and R. Bowden. Bev-slam: Building a globally- consistent world map using monocular vision. In 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 3830–3836, 2022. doi:10.1109/IROS47612. 2022.9981258

  24. [24]

    L. Luo, S. Zheng, Y . Li, Y . Fan, B. Yu, S.-Y . Cao, J. Li, and H.-L. Shen. Bevplace: Learning lidar-based place recognition using bird’s eye view images. In 2023 IEEE/CVF International Conference on Computer Vision (ICCV) , pages 8666–8675, 2023. doi:10.1109/ICCV51070. 2023.00799

  25. [25]

    Zhang, Z

    Y . Zhang, Z. Zhu, and D. Du. Occformer: Dual-path transformer for vision-based 3d semantic occupancy prediction. arXiv preprint arXiv:2304.05316, 2023

  26. [26]

    J. Li, X. He, C. Zhou, X. Cheng, Y . Wen, and D. Zhang. Viewformer: Exploring spatiotem- poral modeling for multi-view 3d occupancy perception via view-guided transformers. arXiv preprint arXiv:2405.04299, 2024

  27. [27]

    Zhang, Y

    L. Zhang, Y . Xiong, Z. Yang, S. Casas, R. Hu, and R. Urtasun. Learning unsupervised world models for autonomous driving via discrete diffusion. ICLR, 2024

  28. [28]

    Zhang, S

    Y . Zhang, S. Gong, K. Xiong, X. Ye, X. Tan, F. Wang, J. Huang, H. Wu, and H. Wang. Bev- world: A multimodal world model for autonomous driving via unified bev latent space, 2024. URL https://arxiv.org/abs/2407.05679. 11

  29. [29]

    Verma, J

    S. Verma, J. S. Berrio, S. Worrall, and E. Nebot. Automatic extrinsic calibration between a camera and a 3d lidar using 3d point and plane correspondences. In 2019 IEEE Intelligent Transportation Systems Conference (ITSC), pages 3906–3912, 2019. doi:10.1109/ITSC.2019. 8917108

  30. [30]

    LiDAR and Camera Calibration using Motion Estimated by Sensor Fusion Odometry

    R. Ishikawa, T. Oishi, and K. Ikeuchi. Lidar and camera calibration using motion estimated by sensor fusion odometry, 2018. URL https://arxiv.org/abs/1804.05178

  31. [31]

    Pandey, J

    G. Pandey, J. R. McBride, S. Savarese, and R. M. Eustice. Automatic targetless extrinsic calibration of a 3d lidar and camera by maximizing mutual information. In Proceedings of the Twenty-Sixth AAAI Conference on Artificial Intelligence , AAAI’12, page 2053–2059. AAAI Press, 2012

  32. [32]

    Circle loss: A unified perspective of pair similarity optimization

    P.-E. Sarlin, D. DeTone, T. Malisiewicz, and A. Rabinovich. Superglue: Learning feature matching with graph neural networks. In2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4937–4946, 2020. doi:10.1109/CVPR42600.2020.00499

  33. [33]

    Sample4Geo : Hard negative sampling for cross-view geo-localisation

    A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y . Lo, P. Doll ´ar, and R. Girshick. Segment anything. In 2023 IEEE/CVF International Conference on Computer Vision (ICCV), pages 3992–4003, 2023. doi:10.1109/ ICCV51070.2023.00371

  34. [34]

    Petek, N

    K. Petek, N. V ¨odisch, J. Meyer, D. Cattaneo, A. Valada, and W. Burgard. Automatic target- less camera-lidar calibration from motion and deep point correspondences.IEEE Robotics and Automation Letters, 9(11):9978–9985, 2024

  35. [35]

    2024 , url =

    Q. Herau, N. Piasco, M. Bennehar, L. Roldao, D. Tsishkou, C. Migniot, P. Vasseur, and C. De- monceaux. Soac: Spatio-temporal overlap-aware multi-sensor calibration using neural ra- diance fields. In 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15131–15140, 2024. doi:10.1109/CVPR52733.2024.01433

  36. [36]

    Srinivasan, Matthew Tancik, Jonathan T

    B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoorthi, and R. Ng. Nerf: representing scenes as neural radiance fields for view synthesis.Commun. ACM, 65(1):99–106, Dec. 2021. ISSN 0001-0782. doi:10.1145/3503250. URL https://doi.org/10.1145/ 3503250

  37. [37]

    Z. Yang, G. Chen, H. Zhang, K. Ta, I. A. B ˆarsan, D. Murphy, S. Manivasagam, and R. Urta- sun. Unical: Unified neural sensor calibration. In Computer Vision – ECCV 2024: 18th Euro- pean Conference, Milan, Italy, September 29–October 4, 2024, Proceedings, Part XXXVI, page 327–345, Berlin, Heidelberg, 2024. Springer-Verlag. ISBN 978-3-031-72763-4. doi:10.1...

  38. [38]

    Kerbl, G

    B. Kerbl, G. Kopanas, T. Leimk ¨uhler, and G. Drettakis. 3d gaussian splatting for real-time radiance field rendering. ACM Transactions on Graphics , 42(4), July 2023. URL https: //repo-sam.inria.fr/fungraph/3d-gaussian-splatting/

  39. [39]

    Herau, M

    Q. Herau, M. Bennehar, A. Moreau, N. Piasco, L. Roldao, D. Tsishkou, C. Migniot, P. Vasseur, and C. Demonceaux. 3dgs-calib: 3d gaussian splatting for multimodal spatiotemporal calibra- tion, 2024

  40. [40]

    H. Li, C. Sima, J. Dai, W. Wang, L. Lu, H. Wang, J. Zeng, Z. Li, J. Yang, H. Deng, H. Tian, E. Xie, J. Xie, L. Chen, T. Li, Y . Li, Y . Gao, X. Jia, S. Liu, J. Shi, D. Lin, and Y . Qiao. Delving into the devils of bird’s-eye-view perception: A review, evaluation and recipe. IEEE Transactions on Pattern Analysis and Machine Intelligence , pages 1–20, 2023....

  41. [41]

    Y . Ma, T. Wang, X. Bai, H. Yang, Y . Hou, Y . Wang, Y . Qiao, R. Yang, and X. Zhu. Vision- centric bev perception: A survey. IEEE Transactions on Pattern Analysis and Machine Intelli- gence, 46(12):10978–10997, 2024. doi:10.1109/TPAMI.2024.3449912. 12

  42. [42]

    Philion and S

    J. Philion and S. Fidler. Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unprojecting to 3d. In Proceedings of the European Conference on Computer Vision, 2020

  43. [43]

    Y . Yan, Y . Mao, and B. Li. Second: Sparsely embedded convolutional detection. Sensors, 18(10), 2018. ISSN 1424-8220. doi:10.3390/s18103337. URL https://www.mdpi.com/ 1424-8220/18/10/3337

  44. [44]

    W. Liao, S. Qiang, X. Li, X. Chen, H. Wang, Y . Liang, J. Yan, T. He, and P. Peng. Calibr- bev: Multi-camera calibration via reversed bird’s-eye-view representations for autonomous driving. In Proceedings of the 32nd ACM International Conference on Multimedia , MM ’24, page 9145–9154, New York, NY , USA, 2024. Association for Computing Machinery. ISBN 979...

  45. [45]

    G. Iyer, R. K. Ram, J. K. Murthy, and K. M. Krishna. Calibnet: Geometrically supervised extrinsic calibration using 3d spatial transformer networks. In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) . IEEE, Oct. 2018. doi:10.1109/iros. 2018.8593693. URL http://dx.doi.org/10.1109/IROS.2018.8593693

  46. [46]

    Vaswani, N

    A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polo- sukhin. Attention is all you need. In Proceedings of the 31st International Conference on Neu- ral Information Processing Systems, NIPS’17, page 6000–6010, Red Hook, NY , USA, 2017. Curran Associates Inc. ISBN 9781510860964

  47. [47]

    Kendall, M

    A. Kendall, M. Grimes, and R. Cipolla. Posenet: A convolutional network for real-time 6-dof camera relocalization. In Proceedings of the 2015 IEEE International Conference on Com- puter Vision (ICCV), ICCV ’15, page 2938–2946, USA, 2015. IEEE Computer Society. ISBN 9781467383912. doi:10.1109/ICCV .2015.336. URL https://doi.org/10.1109/ICCV. 2015.336

  48. [48]

    Z. Liu, Y . Lin, Y . Cao, H. Hu, Y . Wei, Z. Zhang, S. Lin, and B. Guo. Swin transformer: Hierarchical vision transformer using shifted windows, 2021. URL https://arxiv.org/ abs/2103.14030

  49. [49]

    L. F. T. Fu and M. F. Fallon. Batch differentiable pose refinement for in-the-wild camera/lidar extrinsic calibration. In CoRL, pages 1362–1377, 2023. URL https://proceedings.mlr. press/v229/fu23a.html

  50. [50]

    Y . Xiao, Y . Li, C. Meng, X. Li, J. Ji, and Y . Zhang. Calibformer: A transformer-based auto- matic lidar-camera calibration network. In 2024 IEEE International Conference on Robotics and Automation (ICRA), pages 16714–16720, 2024. doi:10.1109/ICRA57147.2024.10610018

  51. [51]

    J. Zhu, J. Xue, and P. Zhang. Calibdepth: Unifying depth map representation for iterative lidar- camera online calibration. In2023 IEEE International Conference on Robotics and Automation (ICRA), pages 726–733, 2023. doi:10.1109/ICRA48891.2023.10161575

  52. [52]

    Z. Xia, X. Pan, S. Song, L. E. Li, and G. Huang. Vision transformer with deformable attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4794–4803, June 2022. 13