BEVCALIB: LiDAR-Camera Calibration via Geometry-Guided Bird's-Eye View Representations

Divyank Shah; Hang Qiu; Jerry Li; Justin Yue; Konstantinos Karydis; Weiduo Yuan

arxiv: 2506.02587 · v2 · submitted 2025-06-03 · 💻 cs.CV · cs.RO

BEVCALIB: LiDAR-Camera Calibration via Geometry-Guided Bird's-Eye View Representations

Weiduo Yuan , Jerry Li , Justin Yue , Divyank Shah , Konstantinos Karydis , Hang Qiu This is my paper

Pith reviewed 2026-05-19 11:06 UTC · model grok-4.3

classification 💻 cs.CV cs.RO

keywords LiDAR-camera calibrationbird's-eye viewextrinsic parametersmulti-modal fusionfeature selectorautonomous drivinggeometric registration

0 comments

The pith

BEVCALIB recovers LiDAR-camera extrinsic parameters by fusing bird's-eye view features extracted separately from each sensor.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces BEVCALIB as the first model to perform LiDAR-camera calibration using bird's-eye view features from raw data. Camera and LiDAR data are each projected into BEV space, their features are extracted independently, and then fused into one shared representation. A feature selector then picks the most informative elements to feed into a decoder that outputs the rigid transformation between the two sensors. This setup removes the need for controlled calibration environments or manual correspondences while handling noise and vehicle motion. Readers care because it makes accurate multi-modal fusion practical for autonomous vehicles that cannot return to a lab for recalibration.

Core claim

BEVCALIB demonstrates that geometry-guided BEV features extracted separately from camera and LiDAR can be fused in a shared space and selectively decoded to regress accurate extrinsic parameters directly from noisy raw inputs, delivering average error reductions of 47.08 percent in translation and 82.32 percent in rotation on KITTI and 78.17 percent and 68.29 percent on NuScenes relative to the strongest prior baseline.

What carries the argument

Shared BEV feature space created by separate camera and LiDAR BEV extractors, followed by a novel feature selector that filters important geometric cues before the transformation decoder.

If this is right

Calibration can be performed from ordinary driving sequences instead of dedicated controlled data collections.
Vehicles can maintain accurate sensor alignment during operation despite vibrations or temperature drift.
Downstream multi-modal perception tasks receive more reliable fused inputs under real-world noise.
Reproducible open-source calibration reaches an order-of-magnitude lower error than earlier public baselines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same separate-BEV-then-fuse pattern could be tested for calibrating other sensor pairs such as radar and camera.
An online version running at frame rate might support continuous self-calibration on moving platforms.
The feature selector could be reused in other BEV fusion networks to lower memory use without losing geometric accuracy.
Evaluating the method on datasets captured with non-standard vehicle rigs would test whether the BEV assumption holds beyond KITTI and NuScenes.

Load-bearing premise

The approach assumes that separately extracted camera and LiDAR BEV features contain sufficient undistorted geometric information to recover accurate extrinsic parameters even when the input data contains noise and without additional explicit geometric constraints or hand-crafted correspondences.

What would settle it

Running the method on a new dataset recorded with deliberately introduced large sensor misalignment or extreme sensor noise and observing that the resulting translation and rotation errors fail to beat the best baseline would falsify the central claim.

Figures

Figures reproduced from arXiv: 2506.02587 by Divyank Shah, Hang Qiu, Jerry Li, Justin Yue, Konstantinos Karydis, Weiduo Yuan.

**Figure 1.** Figure 1: Overall architecture of BEVCALIB. The overall pipeline of our model consists of BEV feature extraction, FPN BEV Encoder, and geometry-guided BEV decoder (GGBD). For BEV feature extraction (§3.2), the inputs of the camera and LiDAR are extracted into BEV features through different backbones separately, then fused into a shared BEV feature space. The FPN BEV encoder is used to improve the multi-scale geome… view at source ↗

**Figure 2.** Figure 2: Overall Architecture of Geometry-Guided BEV Decoder (GGBD). The GGBD component contains a feature selector (left) and a refinement module (right). The feature selector calculates the positions of BEV features using Equation 1. The corresponding positional embeddings (PE) are added to keep the geometry information of the selected feature. After the decoder, the refinement module adds an average-pooling ope… view at source ↗

**Figure 3.** Figure 3: Error Distribution of BEVCALIB and Other Baselines on CALIBDB and KITTI Qualitative Results [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Qualitative results. A comparison of LiDAR-camera overlays from KITTI sequences. [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

read the original abstract

Accurate LiDAR-camera calibration is fundamental to fusing multi-modal perception in autonomous driving and robotic systems. Traditional calibration methods require extensive data collection in controlled environments and cannot compensate for the transformation changes during the vehicle/robot movement. In this paper, we propose the first model that uses bird's-eye view (BEV) features to perform LiDAR camera calibration from raw data, termed BEVCALIB. To achieve this, we extract camera BEV features and LiDAR BEV features separately and fuse them into a shared BEV feature space. To fully utilize the geometric information from the BEV feature, we introduce a novel feature selector to filter the most important features in the transformation decoder, which reduces memory consumption and enables efficient training. Extensive evaluations on KITTI, NuScenes, and our own dataset demonstrate that BEVCALIB establishes a new state of the art. Under various noise conditions, BEVCALIB outperforms the best baseline in the literature by an average of (47.08%, 82.32%) on KITTI dataset, and (78.17%, 68.29%) on NuScenes dataset, in terms of (translation, rotation), respectively. In the open-source domain, it improves the best reproducible baseline by one order of magnitude. Our code and demo results are available at https://cisl.ucr.edu/BEVCalib.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

BEVCALIB claims the first BEV-feature fusion approach for raw-data LiDAR-camera calibration and reports large error reductions, but the camera BEV step looks like it could introduce unaddressed dependence on the parameters being estimated.

read the letter

The main thing to know is that this paper puts forward BEVCALIB as the first method to calibrate LiDAR and camera extrinsics directly from raw inputs by pulling separate BEV features from each sensor, fusing them, and feeding the result through a transformation decoder with a feature selector. They report clear gains over baselines on KITTI and NuScenes under added noise, plus an order-of-magnitude improvement on the best open reproducible baseline, and they release code.

Referee Report

2 major / 2 minor

Summary. The paper proposes BEVCALIB, the first method to perform LiDAR-camera extrinsic calibration directly from raw data by separately extracting camera and LiDAR BEV features, fusing them into a shared BEV space, and employing a novel feature selector module inside a transformation decoder to recover the calibration parameters. Extensive experiments on KITTI, NuScenes, and a custom dataset are reported to show new state-of-the-art accuracy, with average improvements over the best baseline of (47.08%, 82.32%) on KITTI and (78.17%, 68.29%) on NuScenes for (translation, rotation) under added noise; the method is also claimed to improve the best open-source baseline by an order of magnitude.

Significance. If the empirical gains prove robust and the BEV representations truly preserve undistorted geometry without circular dependence on the estimated extrinsics, the work would be significant for online calibration in autonomous driving and robotics, where traditional methods require controlled setups. The open release of code and the efficiency-oriented feature selector are clear strengths. The contribution is tempered, however, by the absence of architectural, loss, and training details that would allow independent verification of the SOTA claims.

major comments (2)

[Method (BEV feature extraction and fusion)] The central claim rests on the assumption that camera BEV features can be extracted independently of the extrinsic parameters under estimation and still retain sufficient undistorted geometric information for the transformation decoder. Standard camera-to-BEV lifting (depth estimation or homography) is either extrinsic-dependent or introduces depth errors that propagate directly into the fused representation; the manuscript does not describe an extrinsic-independent mechanism or quantify depth-noise sensitivity, so the reported robustness under added noise rests on an untested premise.
[Experiments and results] The SOTA and percentage-improvement claims (e.g., 47.08% translation / 82.32% rotation on KITTI) are presented without accompanying network architecture diagrams, loss-function definitions, training protocol, ablation studies on the feature selector, or statistical significance tests. These omissions make it impossible to determine whether the gains arise from the proposed BEV fusion or from unstated implementation choices, directly undermining the central empirical contribution.

minor comments (2)

[Abstract] The abstract states that the method 'fully utilizes the geometric information from the BEV feature' yet provides no explicit geometric loss or correspondence term; a brief clarification of whether any such term is used would improve readability.
[Experiments] Table or figure captions that report the exact noise levels and the precise definition of the 'best baseline' would help readers reproduce the percentage gains.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thorough review and valuable feedback on our work. We address the major comments point by point below, providing clarifications and indicating where revisions will be made to improve the manuscript.

read point-by-point responses

Referee: [Method (BEV feature extraction and fusion)] The central claim rests on the assumption that camera BEV features can be extracted independently of the extrinsic parameters under estimation and still retain sufficient undistorted geometric information for the transformation decoder. Standard camera-to-BEV lifting (depth estimation or homography) is either extrinsic-dependent or introduces depth errors that propagate directly into the fused representation; the manuscript does not describe an extrinsic-independent mechanism or quantify depth-noise sensitivity, so the reported robustness under added noise rests on an untested premise.

Authors: We appreciate this observation. In BEVCALIB, the camera BEV features are extracted using a dedicated BEV encoder that operates directly on the 2D image features projected into BEV space using a learned depth distribution, which is trained end-to-end but does not require the extrinsic calibration parameters as input. The projection is based on the camera intrinsics only, and the extrinsic is estimated later in the decoder. This design ensures independence from the estimated extrinsics. However, we acknowledge that additional details on this mechanism and a sensitivity analysis to depth noise were not sufficiently elaborated. We will revise the method section to include a clearer description of the extrinsic-independent BEV lifting and add experiments quantifying the impact of depth estimation errors. revision: yes
Referee: [Experiments and results] The SOTA and percentage-improvement claims (e.g., 47.08% translation / 82.32% rotation on KITTI) are presented without accompanying network architecture diagrams, loss-function definitions, training protocol, ablation studies on the feature selector, or statistical significance tests. These omissions make it impossible to determine whether the gains arise from the proposed BEV fusion or from unstated implementation choices, directly undermining the central empirical contribution.

Authors: We agree that providing these details is crucial for reproducibility and to substantiate the claims. The current manuscript includes some of this information in the supplementary material, but we recognize it should be more prominently featured in the main paper. In the revised version, we will add a network architecture diagram, explicit loss function formulations, detailed training protocols, ablation studies specifically on the feature selector module, and statistical significance tests (e.g., paired t-tests or confidence intervals) for the reported improvements. This will allow readers to better verify the source of the performance gains. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical architecture with external validation

full rationale

The paper describes an empirical neural architecture that extracts separate camera and LiDAR BEV features, fuses them, and decodes extrinsics via a feature selector. No equations, derivations, or self-citation chains are shown that reduce the reported calibration accuracy or transformation parameters to quantities defined by the method's own fitted inputs or prior self-references. Validation occurs on external benchmarks (KITTI, NuScenes) under added noise, satisfying the criterion for self-contained, falsifiable results independent of internal definitions.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim rests on the domain assumption that BEV representations preserve enough geometric structure for calibration and on standard supervised learning assumptions that paired sensor data with ground-truth extrinsics are available for training.

free parameters (1)

network architecture hyperparameters
Layer sizes, learning rate, and feature dimensions are chosen during model design and training.

axioms (1)

domain assumption BEV features from camera and LiDAR contain complementary geometric information sufficient for extrinsic estimation
Invoked by the decision to extract and fuse BEV features as the core representation.

invented entities (1)

feature selector module no independent evidence
purpose: Filters the most important features inside the transformation decoder to reduce memory and enable efficient training
Introduced as a novel component of the architecture

pith-pipeline@v0.9.0 · 5795 in / 1475 out tokens · 45879 ms · 2026-05-19T11:06:36.356644+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

52 extracted references · 52 canonical work pages · 2 internal anchors

[1]

A. J. Sathyamoorthy, J. Liang, U. Patel, T. Guan, R. Chandra, and D. Manocha. Densecavoid: Real-time navigation in dense crowds using anticipatory behaviors. In2020 IEEE International Conference on Robotics and Automation (ICRA) , pages 11345–11352, 2020. doi:10.1109/ ICRA40945.2020.9197379

work page arXiv 2020
[2]

Z. Liu, H. Tang, A. Amini, X. Yang, H. Mao, D. Rus, and S. Han. Bevfusion: Multi-task multi- sensor fusion with unified bird’s-eye view representation. In IEEE International Conference on Robotics and Automation (ICRA), 2023

work page 2023
[3]

S. R. Mhatre and J. W. Bakal. Deepfusion: A novel deep learning technique for enhanced image super-resolution. In 2024 3rd International Conference on Automation, Computing and Renewable Systems (ICACRS), pages 991–998, 2024. doi:10.1109/ICACRS62842.2024. 10841630

work page doi:10.1109/icacrs62842.2024 2024
[4]

Huang and J

J.-K. Huang and J. W. Grizzle. Improvements to Target-Based 3D LiDAR to Camera Calibra- tion. IEEE Access, 8:134101–134110, 2020. doi:10.1109/ACCESS.2020.3010734

work page doi:10.1109/access.2020.3010734 2020
[5]

Zhang and R

Q. Zhang and R. Pless. Extrinsic calibration of a camera and laser range finder (improves camera calibration). In 2004 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (IEEE Cat. No.04CH37566) , volume 3, pages 2301–2306 vol.3, 2004. doi: 10.1109/IROS.2004.1389752

work page doi:10.1109/iros.2004.1389752 2004
[6]

G. Yan, F. He, C. Shi, P. Wei, X. Cai, and Y . Li. Joint camera intrinsic and lidar-camera extrinsic calibration. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pages 11446–11452, 2023. doi:10.1109/ICRA48891.2023.10160542

work page doi:10.1109/icra48891.2023.10160542 2023
[7]

Geiger, P

A. Geiger, P. Lenz, C. Stiller, and R. Urtasun. Vision meets robotics: The kitti dataset. Inter- national Journal of Robotics Research (IJRR), 2013

work page 2013
[8]

Schneider, F

N. Schneider, F. Piewak, C. Stiller, and U. Franke. Regnet: Multimodal sensor registration using deep neural networks. In 2017 IEEE Intelligent Vehicles Symposium (IV), pages 1803– 1810, 2017. doi:10.1109/IVS.2017.7995968

work page doi:10.1109/ivs.2017.7995968 2017
[9]

X. Lv, B. Wang, Z. Dou, D. Ye, and S. Wang. Lccnet: Lidar and camera self-calibration using cost volume network. In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 2888–2895, 2021. doi:10.1109/CVPRW53098.2021. 00324

work page doi:10.1109/cvprw53098.2021 2021
[10]

Koide, S

K. Koide, S. Oishi, M. Yokozuka, and A. Banno. General, single-shot, target-less, and auto- matic lidar-camera extrinsic calibration toolbox. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pages 11301–11307. IEEE, 2023

work page 2023
[11]

Z. Luo, G. Yan, X. Cai, and B. Shi. Zero-training lidar-camera extrinsic calibration method using segment anything model. In 2024 IEEE International Conference on Robotics and Au- tomation (ICRA), pages 14472–14478, 2024. doi:10.1109/ICRA57147.2024.10610983

work page doi:10.1109/icra57147.2024.10610983 2024
[12]

and Dolan, John M

Y .-C. Lee and K.-W. Chen. Lccraft: Lidar and camera calibration using recurrent all-pairs field transforms without precise initial guess. In 2024 IEEE International Conference on Robotics and Automation (ICRA), pages 16669–16675, 2024. doi:10.1109/ICRA57147.2024.10610756

work page doi:10.1109/icra57147.2024.10610756 2024
[13]

and Fallah, S

Q. Herau, N. Piasco, M. Bennehar, L. Rold ˜ao, D. Tsishkou, C. Migniot, P. Vasseur, and C. Demonceaux. Moisst: Multimodal optimization of implicit scene for spatiotemporal cal- ibration. In 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), page 1810–1817. IEEE, Oct. 2023. doi:10.1109/iros55552.2023.10342427. URL http://dx.do...

work page doi:10.1109/iros55552.2023.10342427 2023
[14]

Circle loss: A unified perspective of pair similarity optimization

H. Caesar, V . Bankiti, A. H. Lang, S. V ora, V . E. Liong, Q. Xu, A. Krishnan, Y . Pan, G. Baldan, and O. Beijbom. nuscenes: A multimodal dataset for autonomous driving. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 11618–11628, 2020. doi:10.1109/CVPR42600.2020.01164

work page doi:10.1109/cvpr42600.2020.01164 2020
[15]

P. Xiao, Z. Shao, S. Hao, Z. Zhang, X. Chai, J. Jiao, Z. Li, J. Wu, K. Sun, K. Jiang, Y . Wang, and D. Yang. Pandaset: Advanced sensor suite dataset for autonomous driving. In 2021 IEEE International Intelligent Transportation Systems Conference (ITSC), pages 3095–3101, 2021. doi:10.1109/ITSC48978.2021.9565009

work page doi:10.1109/itsc48978.2021.9565009 2021
[16]

J. Shi, Z. Zhu, J. Zhang, R. Liu, Z. Wang, S. Chen, and H. Liu. Calibrcnn: Calibrating camera and lidar by recurrent convolutional neural network and geometric constraints. In2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) , pages 10197– 10202, 2020. doi:10.1109/IROS45743.2020.9341147

work page doi:10.1109/iros45743.2020.9341147 2020
[17]

Y . Xiao, Y . Li, C. Meng, X. Li, J. Ji, and Y . Zhang. Calibformer: A transformer-based auto- matic lidar-camera calibration network, 2024. URLhttps://arxiv.org/abs/2311.15241

work page arXiv 2024
[18]

Z. Li, W. Wang, H. Li, E. Xie, C. Sima, T. Lu, Y . Qiao, and J. Dai. Bevformer: Learning bird’s- eye-view representation from multi-camera images via spatiotemporal transformers. arXiv preprint arXiv:2203.17270, 2022

work page arXiv 2022
[19]

Y . Wang, V . Guizilini, T. Zhang, Y . Wang, H. Zhao, , and J. M. Solomon. Detr3d: 3d object detection from multi-view images via 3d-to-2d queries. In The Conference on Robot Learning (CoRL), 2021

work page 2021
[20]

H. Liu, Y . Teng, T. Lu, H. Wang, and L. Wang. Sparsebev: High-performance sparse 3d object detection from multi-camera videos, 2023. URL https://arxiv.org/abs/2308.09244

work page arXiv 2023
[21]

Q. Li, Y . Wang, Y . Wang, and H. Zhao. Hdmapnet: An online hd map construction and evaluation framework. arXiv preprint arXiv:2107.06307, 2021

work page arXiv 2021
[22]

S. Choi, J. Kim, H. Shin, and J. W. Choi. Mask2map: Vectorized hd map construction using bird’s eye view segmentation masks. InEuropean Conference on Computer Vision, 2024

work page 2024
[23]

J. Ross, O. Mendez, A. Saha, M. Johnson, and R. Bowden. Bev-slam: Building a globally- consistent world map using monocular vision. In 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 3830–3836, 2022. doi:10.1109/IROS47612. 2022.9981258

work page doi:10.1109/iros47612 2022
[24]

L. Luo, S. Zheng, Y . Li, Y . Fan, B. Yu, S.-Y . Cao, J. Li, and H.-L. Shen. Bevplace: Learning lidar-based place recognition using bird’s eye view images. In 2023 IEEE/CVF International Conference on Computer Vision (ICCV) , pages 8666–8675, 2023. doi:10.1109/ICCV51070. 2023.00799

work page doi:10.1109/iccv51070 2023
[25]

Zhang, Z

Y . Zhang, Z. Zhu, and D. Du. Occformer: Dual-path transformer for vision-based 3d semantic occupancy prediction. arXiv preprint arXiv:2304.05316, 2023

work page arXiv 2023
[26]

J. Li, X. He, C. Zhou, X. Cheng, Y . Wen, and D. Zhang. Viewformer: Exploring spatiotem- poral modeling for multi-view 3d occupancy perception via view-guided transformers. arXiv preprint arXiv:2405.04299, 2024

work page arXiv 2024
[27]

Zhang, Y

L. Zhang, Y . Xiong, Z. Yang, S. Casas, R. Hu, and R. Urtasun. Learning unsupervised world models for autonomous driving via discrete diffusion. ICLR, 2024

work page 2024
[28]

Zhang, S

Y . Zhang, S. Gong, K. Xiong, X. Ye, X. Tan, F. Wang, J. Huang, H. Wu, and H. Wang. Bev- world: A multimodal world model for autonomous driving via unified bev latent space, 2024. URL https://arxiv.org/abs/2407.05679. 11

work page arXiv 2024
[29]

Verma, J

S. Verma, J. S. Berrio, S. Worrall, and E. Nebot. Automatic extrinsic calibration between a camera and a 3d lidar using 3d point and plane correspondences. In 2019 IEEE Intelligent Transportation Systems Conference (ITSC), pages 3906–3912, 2019. doi:10.1109/ITSC.2019. 8917108

work page doi:10.1109/itsc.2019 2019
[30]

LiDAR and Camera Calibration using Motion Estimated by Sensor Fusion Odometry

R. Ishikawa, T. Oishi, and K. Ikeuchi. Lidar and camera calibration using motion estimated by sensor fusion odometry, 2018. URL https://arxiv.org/abs/1804.05178

work page internal anchor Pith review Pith/arXiv arXiv 2018
[31]

Pandey, J

G. Pandey, J. R. McBride, S. Savarese, and R. M. Eustice. Automatic targetless extrinsic calibration of a 3d lidar and camera by maximizing mutual information. In Proceedings of the Twenty-Sixth AAAI Conference on Artificial Intelligence , AAAI’12, page 2053–2059. AAAI Press, 2012

work page 2053
[32]

Circle loss: A unified perspective of pair similarity optimization

P.-E. Sarlin, D. DeTone, T. Malisiewicz, and A. Rabinovich. Superglue: Learning feature matching with graph neural networks. In2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4937–4946, 2020. doi:10.1109/CVPR42600.2020.00499

work page doi:10.1109/cvpr42600.2020.00499 2020
[33]

Sample4Geo : Hard negative sampling for cross-view geo-localisation

A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y . Lo, P. Doll ´ar, and R. Girshick. Segment anything. In 2023 IEEE/CVF International Conference on Computer Vision (ICCV), pages 3992–4003, 2023. doi:10.1109/ ICCV51070.2023.00371

work page arXiv 2023
[34]

Petek, N

K. Petek, N. V ¨odisch, J. Meyer, D. Cattaneo, A. Valada, and W. Burgard. Automatic target- less camera-lidar calibration from motion and deep point correspondences.IEEE Robotics and Automation Letters, 9(11):9978–9985, 2024

work page 2024
[35]

2024 , url =

Q. Herau, N. Piasco, M. Bennehar, L. Roldao, D. Tsishkou, C. Migniot, P. Vasseur, and C. De- monceaux. Soac: Spatio-temporal overlap-aware multi-sensor calibration using neural ra- diance fields. In 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15131–15140, 2024. doi:10.1109/CVPR52733.2024.01433

work page doi:10.1109/cvpr52733.2024.01433 2024
[36]

Srinivasan, Matthew Tancik, Jonathan T

B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoorthi, and R. Ng. Nerf: representing scenes as neural radiance fields for view synthesis.Commun. ACM, 65(1):99–106, Dec. 2021. ISSN 0001-0782. doi:10.1145/3503250. URL https://doi.org/10.1145/ 3503250

work page doi:10.1145/3503250 2021
[37]

Z. Yang, G. Chen, H. Zhang, K. Ta, I. A. B ˆarsan, D. Murphy, S. Manivasagam, and R. Urta- sun. Unical: Unified neural sensor calibration. In Computer Vision – ECCV 2024: 18th Euro- pean Conference, Milan, Italy, September 29–October 4, 2024, Proceedings, Part XXXVI, page 327–345, Berlin, Heidelberg, 2024. Springer-Verlag. ISBN 978-3-031-72763-4. doi:10.1...

work page doi:10.1007/978-3-031-72764-1_19 2024
[38]

Kerbl, G

B. Kerbl, G. Kopanas, T. Leimk ¨uhler, and G. Drettakis. 3d gaussian splatting for real-time radiance field rendering. ACM Transactions on Graphics , 42(4), July 2023. URL https: //repo-sam.inria.fr/fungraph/3d-gaussian-splatting/

work page 2023
[39]

Herau, M

Q. Herau, M. Bennehar, A. Moreau, N. Piasco, L. Roldao, D. Tsishkou, C. Migniot, P. Vasseur, and C. Demonceaux. 3dgs-calib: 3d gaussian splatting for multimodal spatiotemporal calibra- tion, 2024

work page 2024
[40]

H. Li, C. Sima, J. Dai, W. Wang, L. Lu, H. Wang, J. Zeng, Z. Li, J. Yang, H. Deng, H. Tian, E. Xie, J. Xie, L. Chen, T. Li, Y . Li, Y . Gao, X. Jia, S. Liu, J. Shi, D. Lin, and Y . Qiao. Delving into the devils of bird’s-eye-view perception: A review, evaluation and recipe. IEEE Transactions on Pattern Analysis and Machine Intelligence , pages 1–20, 2023....

work page arXiv 2023
[41]

Y . Ma, T. Wang, X. Bai, H. Yang, Y . Hou, Y . Wang, Y . Qiao, R. Yang, and X. Zhu. Vision- centric bev perception: A survey. IEEE Transactions on Pattern Analysis and Machine Intelli- gence, 46(12):10978–10997, 2024. doi:10.1109/TPAMI.2024.3449912. 12

work page doi:10.1109/tpami.2024.3449912 2024
[42]

Philion and S

J. Philion and S. Fidler. Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unprojecting to 3d. In Proceedings of the European Conference on Computer Vision, 2020

work page 2020
[43]

Y . Yan, Y . Mao, and B. Li. Second: Sparsely embedded convolutional detection. Sensors, 18(10), 2018. ISSN 1424-8220. doi:10.3390/s18103337. URL https://www.mdpi.com/ 1424-8220/18/10/3337

work page doi:10.3390/s18103337 2018
[44]

W. Liao, S. Qiang, X. Li, X. Chen, H. Wang, Y . Liang, J. Yan, T. He, and P. Peng. Calibr- bev: Multi-camera calibration via reversed bird’s-eye-view representations for autonomous driving. In Proceedings of the 32nd ACM International Conference on Multimedia , MM ’24, page 9145–9154, New York, NY , USA, 2024. Association for Computing Machinery. ISBN 979...

work page doi:10.1145/3664647.3680572 2024
[45]

G. Iyer, R. K. Ram, J. K. Murthy, and K. M. Krishna. Calibnet: Geometrically supervised extrinsic calibration using 3d spatial transformer networks. In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) . IEEE, Oct. 2018. doi:10.1109/iros. 2018.8593693. URL http://dx.doi.org/10.1109/IROS.2018.8593693

work page doi:10.1109/iros 2018
[46]

Vaswani, N

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polo- sukhin. Attention is all you need. In Proceedings of the 31st International Conference on Neu- ral Information Processing Systems, NIPS’17, page 6000–6010, Red Hook, NY , USA, 2017. Curran Associates Inc. ISBN 9781510860964

work page 2017
[47]

Kendall, M

A. Kendall, M. Grimes, and R. Cipolla. Posenet: A convolutional network for real-time 6-dof camera relocalization. In Proceedings of the 2015 IEEE International Conference on Com- puter Vision (ICCV), ICCV ’15, page 2938–2946, USA, 2015. IEEE Computer Society. ISBN 9781467383912. doi:10.1109/ICCV .2015.336. URL https://doi.org/10.1109/ICCV. 2015.336

work page doi:10.1109/iccv 2015
[48]

Z. Liu, Y . Lin, Y . Cao, H. Hu, Y . Wei, Z. Zhang, S. Lin, and B. Guo. Swin transformer: Hierarchical vision transformer using shifted windows, 2021. URL https://arxiv.org/ abs/2103.14030

work page internal anchor Pith review Pith/arXiv arXiv 2021
[49]

L. F. T. Fu and M. F. Fallon. Batch differentiable pose refinement for in-the-wild camera/lidar extrinsic calibration. In CoRL, pages 1362–1377, 2023. URL https://proceedings.mlr. press/v229/fu23a.html

work page 2023
[50]

Y . Xiao, Y . Li, C. Meng, X. Li, J. Ji, and Y . Zhang. Calibformer: A transformer-based auto- matic lidar-camera calibration network. In 2024 IEEE International Conference on Robotics and Automation (ICRA), pages 16714–16720, 2024. doi:10.1109/ICRA57147.2024.10610018

work page doi:10.1109/icra57147.2024.10610018 2024
[51]

J. Zhu, J. Xue, and P. Zhang. Calibdepth: Unifying depth map representation for iterative lidar- camera online calibration. In2023 IEEE International Conference on Robotics and Automation (ICRA), pages 726–733, 2023. doi:10.1109/ICRA48891.2023.10161575

work page doi:10.1109/icra48891.2023.10161575 2023
[52]

Z. Xia, X. Pan, S. Song, L. E. Li, and G. Huang. Vision transformer with deformable attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4794–4803, June 2022. 13

work page 2022

[1] [1]

A. J. Sathyamoorthy, J. Liang, U. Patel, T. Guan, R. Chandra, and D. Manocha. Densecavoid: Real-time navigation in dense crowds using anticipatory behaviors. In2020 IEEE International Conference on Robotics and Automation (ICRA) , pages 11345–11352, 2020. doi:10.1109/ ICRA40945.2020.9197379

work page arXiv 2020

[2] [2]

Z. Liu, H. Tang, A. Amini, X. Yang, H. Mao, D. Rus, and S. Han. Bevfusion: Multi-task multi- sensor fusion with unified bird’s-eye view representation. In IEEE International Conference on Robotics and Automation (ICRA), 2023

work page 2023

[3] [3]

S. R. Mhatre and J. W. Bakal. Deepfusion: A novel deep learning technique for enhanced image super-resolution. In 2024 3rd International Conference on Automation, Computing and Renewable Systems (ICACRS), pages 991–998, 2024. doi:10.1109/ICACRS62842.2024. 10841630

work page doi:10.1109/icacrs62842.2024 2024

[4] [4]

Huang and J

J.-K. Huang and J. W. Grizzle. Improvements to Target-Based 3D LiDAR to Camera Calibra- tion. IEEE Access, 8:134101–134110, 2020. doi:10.1109/ACCESS.2020.3010734

work page doi:10.1109/access.2020.3010734 2020

[5] [5]

Zhang and R

Q. Zhang and R. Pless. Extrinsic calibration of a camera and laser range finder (improves camera calibration). In 2004 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (IEEE Cat. No.04CH37566) , volume 3, pages 2301–2306 vol.3, 2004. doi: 10.1109/IROS.2004.1389752

work page doi:10.1109/iros.2004.1389752 2004

[6] [6]

G. Yan, F. He, C. Shi, P. Wei, X. Cai, and Y . Li. Joint camera intrinsic and lidar-camera extrinsic calibration. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pages 11446–11452, 2023. doi:10.1109/ICRA48891.2023.10160542

work page doi:10.1109/icra48891.2023.10160542 2023

[7] [7]

Geiger, P

A. Geiger, P. Lenz, C. Stiller, and R. Urtasun. Vision meets robotics: The kitti dataset. Inter- national Journal of Robotics Research (IJRR), 2013

work page 2013

[8] [8]

Schneider, F

N. Schneider, F. Piewak, C. Stiller, and U. Franke. Regnet: Multimodal sensor registration using deep neural networks. In 2017 IEEE Intelligent Vehicles Symposium (IV), pages 1803– 1810, 2017. doi:10.1109/IVS.2017.7995968

work page doi:10.1109/ivs.2017.7995968 2017

[9] [9]

X. Lv, B. Wang, Z. Dou, D. Ye, and S. Wang. Lccnet: Lidar and camera self-calibration using cost volume network. In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 2888–2895, 2021. doi:10.1109/CVPRW53098.2021. 00324

work page doi:10.1109/cvprw53098.2021 2021

[10] [10]

Koide, S

K. Koide, S. Oishi, M. Yokozuka, and A. Banno. General, single-shot, target-less, and auto- matic lidar-camera extrinsic calibration toolbox. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pages 11301–11307. IEEE, 2023

work page 2023

[11] [11]

Z. Luo, G. Yan, X. Cai, and B. Shi. Zero-training lidar-camera extrinsic calibration method using segment anything model. In 2024 IEEE International Conference on Robotics and Au- tomation (ICRA), pages 14472–14478, 2024. doi:10.1109/ICRA57147.2024.10610983

work page doi:10.1109/icra57147.2024.10610983 2024

[12] [12]

and Dolan, John M

Y .-C. Lee and K.-W. Chen. Lccraft: Lidar and camera calibration using recurrent all-pairs field transforms without precise initial guess. In 2024 IEEE International Conference on Robotics and Automation (ICRA), pages 16669–16675, 2024. doi:10.1109/ICRA57147.2024.10610756

work page doi:10.1109/icra57147.2024.10610756 2024

[13] [13]

and Fallah, S

Q. Herau, N. Piasco, M. Bennehar, L. Rold ˜ao, D. Tsishkou, C. Migniot, P. Vasseur, and C. Demonceaux. Moisst: Multimodal optimization of implicit scene for spatiotemporal cal- ibration. In 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), page 1810–1817. IEEE, Oct. 2023. doi:10.1109/iros55552.2023.10342427. URL http://dx.do...

work page doi:10.1109/iros55552.2023.10342427 2023

[14] [14]

Circle loss: A unified perspective of pair similarity optimization

H. Caesar, V . Bankiti, A. H. Lang, S. V ora, V . E. Liong, Q. Xu, A. Krishnan, Y . Pan, G. Baldan, and O. Beijbom. nuscenes: A multimodal dataset for autonomous driving. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 11618–11628, 2020. doi:10.1109/CVPR42600.2020.01164

work page doi:10.1109/cvpr42600.2020.01164 2020

[15] [15]

P. Xiao, Z. Shao, S. Hao, Z. Zhang, X. Chai, J. Jiao, Z. Li, J. Wu, K. Sun, K. Jiang, Y . Wang, and D. Yang. Pandaset: Advanced sensor suite dataset for autonomous driving. In 2021 IEEE International Intelligent Transportation Systems Conference (ITSC), pages 3095–3101, 2021. doi:10.1109/ITSC48978.2021.9565009

work page doi:10.1109/itsc48978.2021.9565009 2021

[16] [16]

J. Shi, Z. Zhu, J. Zhang, R. Liu, Z. Wang, S. Chen, and H. Liu. Calibrcnn: Calibrating camera and lidar by recurrent convolutional neural network and geometric constraints. In2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) , pages 10197– 10202, 2020. doi:10.1109/IROS45743.2020.9341147

work page doi:10.1109/iros45743.2020.9341147 2020

[17] [17]

Y . Xiao, Y . Li, C. Meng, X. Li, J. Ji, and Y . Zhang. Calibformer: A transformer-based auto- matic lidar-camera calibration network, 2024. URLhttps://arxiv.org/abs/2311.15241

work page arXiv 2024

[18] [18]

Z. Li, W. Wang, H. Li, E. Xie, C. Sima, T. Lu, Y . Qiao, and J. Dai. Bevformer: Learning bird’s- eye-view representation from multi-camera images via spatiotemporal transformers. arXiv preprint arXiv:2203.17270, 2022

work page arXiv 2022

[19] [19]

Y . Wang, V . Guizilini, T. Zhang, Y . Wang, H. Zhao, , and J. M. Solomon. Detr3d: 3d object detection from multi-view images via 3d-to-2d queries. In The Conference on Robot Learning (CoRL), 2021

work page 2021

[20] [20]

H. Liu, Y . Teng, T. Lu, H. Wang, and L. Wang. Sparsebev: High-performance sparse 3d object detection from multi-camera videos, 2023. URL https://arxiv.org/abs/2308.09244

work page arXiv 2023

[21] [21]

Q. Li, Y . Wang, Y . Wang, and H. Zhao. Hdmapnet: An online hd map construction and evaluation framework. arXiv preprint arXiv:2107.06307, 2021

work page arXiv 2021

[22] [22]

S. Choi, J. Kim, H. Shin, and J. W. Choi. Mask2map: Vectorized hd map construction using bird’s eye view segmentation masks. InEuropean Conference on Computer Vision, 2024

work page 2024

[23] [23]

J. Ross, O. Mendez, A. Saha, M. Johnson, and R. Bowden. Bev-slam: Building a globally- consistent world map using monocular vision. In 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 3830–3836, 2022. doi:10.1109/IROS47612. 2022.9981258

work page doi:10.1109/iros47612 2022

[24] [24]

L. Luo, S. Zheng, Y . Li, Y . Fan, B. Yu, S.-Y . Cao, J. Li, and H.-L. Shen. Bevplace: Learning lidar-based place recognition using bird’s eye view images. In 2023 IEEE/CVF International Conference on Computer Vision (ICCV) , pages 8666–8675, 2023. doi:10.1109/ICCV51070. 2023.00799

work page doi:10.1109/iccv51070 2023

[25] [25]

Zhang, Z

Y . Zhang, Z. Zhu, and D. Du. Occformer: Dual-path transformer for vision-based 3d semantic occupancy prediction. arXiv preprint arXiv:2304.05316, 2023

work page arXiv 2023

[26] [26]

J. Li, X. He, C. Zhou, X. Cheng, Y . Wen, and D. Zhang. Viewformer: Exploring spatiotem- poral modeling for multi-view 3d occupancy perception via view-guided transformers. arXiv preprint arXiv:2405.04299, 2024

work page arXiv 2024

[27] [27]

Zhang, Y

L. Zhang, Y . Xiong, Z. Yang, S. Casas, R. Hu, and R. Urtasun. Learning unsupervised world models for autonomous driving via discrete diffusion. ICLR, 2024

work page 2024

[28] [28]

Zhang, S

Y . Zhang, S. Gong, K. Xiong, X. Ye, X. Tan, F. Wang, J. Huang, H. Wu, and H. Wang. Bev- world: A multimodal world model for autonomous driving via unified bev latent space, 2024. URL https://arxiv.org/abs/2407.05679. 11

work page arXiv 2024

[29] [29]

Verma, J

S. Verma, J. S. Berrio, S. Worrall, and E. Nebot. Automatic extrinsic calibration between a camera and a 3d lidar using 3d point and plane correspondences. In 2019 IEEE Intelligent Transportation Systems Conference (ITSC), pages 3906–3912, 2019. doi:10.1109/ITSC.2019. 8917108

work page doi:10.1109/itsc.2019 2019

[30] [30]

LiDAR and Camera Calibration using Motion Estimated by Sensor Fusion Odometry

R. Ishikawa, T. Oishi, and K. Ikeuchi. Lidar and camera calibration using motion estimated by sensor fusion odometry, 2018. URL https://arxiv.org/abs/1804.05178

work page internal anchor Pith review Pith/arXiv arXiv 2018

[31] [31]

Pandey, J

G. Pandey, J. R. McBride, S. Savarese, and R. M. Eustice. Automatic targetless extrinsic calibration of a 3d lidar and camera by maximizing mutual information. In Proceedings of the Twenty-Sixth AAAI Conference on Artificial Intelligence , AAAI’12, page 2053–2059. AAAI Press, 2012

work page 2053

[32] [32]

Circle loss: A unified perspective of pair similarity optimization

P.-E. Sarlin, D. DeTone, T. Malisiewicz, and A. Rabinovich. Superglue: Learning feature matching with graph neural networks. In2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4937–4946, 2020. doi:10.1109/CVPR42600.2020.00499

work page doi:10.1109/cvpr42600.2020.00499 2020

[33] [33]

Sample4Geo : Hard negative sampling for cross-view geo-localisation

A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y . Lo, P. Doll ´ar, and R. Girshick. Segment anything. In 2023 IEEE/CVF International Conference on Computer Vision (ICCV), pages 3992–4003, 2023. doi:10.1109/ ICCV51070.2023.00371

work page arXiv 2023

[34] [34]

Petek, N

K. Petek, N. V ¨odisch, J. Meyer, D. Cattaneo, A. Valada, and W. Burgard. Automatic target- less camera-lidar calibration from motion and deep point correspondences.IEEE Robotics and Automation Letters, 9(11):9978–9985, 2024

work page 2024

[35] [35]

2024 , url =

Q. Herau, N. Piasco, M. Bennehar, L. Roldao, D. Tsishkou, C. Migniot, P. Vasseur, and C. De- monceaux. Soac: Spatio-temporal overlap-aware multi-sensor calibration using neural ra- diance fields. In 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15131–15140, 2024. doi:10.1109/CVPR52733.2024.01433

work page doi:10.1109/cvpr52733.2024.01433 2024

[36] [36]

Srinivasan, Matthew Tancik, Jonathan T

B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoorthi, and R. Ng. Nerf: representing scenes as neural radiance fields for view synthesis.Commun. ACM, 65(1):99–106, Dec. 2021. ISSN 0001-0782. doi:10.1145/3503250. URL https://doi.org/10.1145/ 3503250

work page doi:10.1145/3503250 2021

[37] [37]

Z. Yang, G. Chen, H. Zhang, K. Ta, I. A. B ˆarsan, D. Murphy, S. Manivasagam, and R. Urta- sun. Unical: Unified neural sensor calibration. In Computer Vision – ECCV 2024: 18th Euro- pean Conference, Milan, Italy, September 29–October 4, 2024, Proceedings, Part XXXVI, page 327–345, Berlin, Heidelberg, 2024. Springer-Verlag. ISBN 978-3-031-72763-4. doi:10.1...

work page doi:10.1007/978-3-031-72764-1_19 2024

[38] [38]

Kerbl, G

B. Kerbl, G. Kopanas, T. Leimk ¨uhler, and G. Drettakis. 3d gaussian splatting for real-time radiance field rendering. ACM Transactions on Graphics , 42(4), July 2023. URL https: //repo-sam.inria.fr/fungraph/3d-gaussian-splatting/

work page 2023

[39] [39]

Herau, M

Q. Herau, M. Bennehar, A. Moreau, N. Piasco, L. Roldao, D. Tsishkou, C. Migniot, P. Vasseur, and C. Demonceaux. 3dgs-calib: 3d gaussian splatting for multimodal spatiotemporal calibra- tion, 2024

work page 2024

[40] [40]

H. Li, C. Sima, J. Dai, W. Wang, L. Lu, H. Wang, J. Zeng, Z. Li, J. Yang, H. Deng, H. Tian, E. Xie, J. Xie, L. Chen, T. Li, Y . Li, Y . Gao, X. Jia, S. Liu, J. Shi, D. Lin, and Y . Qiao. Delving into the devils of bird’s-eye-view perception: A review, evaluation and recipe. IEEE Transactions on Pattern Analysis and Machine Intelligence , pages 1–20, 2023....

work page arXiv 2023

[41] [41]

Y . Ma, T. Wang, X. Bai, H. Yang, Y . Hou, Y . Wang, Y . Qiao, R. Yang, and X. Zhu. Vision- centric bev perception: A survey. IEEE Transactions on Pattern Analysis and Machine Intelli- gence, 46(12):10978–10997, 2024. doi:10.1109/TPAMI.2024.3449912. 12

work page doi:10.1109/tpami.2024.3449912 2024

[42] [42]

Philion and S

J. Philion and S. Fidler. Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unprojecting to 3d. In Proceedings of the European Conference on Computer Vision, 2020

work page 2020

[43] [43]

Y . Yan, Y . Mao, and B. Li. Second: Sparsely embedded convolutional detection. Sensors, 18(10), 2018. ISSN 1424-8220. doi:10.3390/s18103337. URL https://www.mdpi.com/ 1424-8220/18/10/3337

work page doi:10.3390/s18103337 2018

[44] [44]

W. Liao, S. Qiang, X. Li, X. Chen, H. Wang, Y . Liang, J. Yan, T. He, and P. Peng. Calibr- bev: Multi-camera calibration via reversed bird’s-eye-view representations for autonomous driving. In Proceedings of the 32nd ACM International Conference on Multimedia , MM ’24, page 9145–9154, New York, NY , USA, 2024. Association for Computing Machinery. ISBN 979...

work page doi:10.1145/3664647.3680572 2024

[45] [45]

G. Iyer, R. K. Ram, J. K. Murthy, and K. M. Krishna. Calibnet: Geometrically supervised extrinsic calibration using 3d spatial transformer networks. In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) . IEEE, Oct. 2018. doi:10.1109/iros. 2018.8593693. URL http://dx.doi.org/10.1109/IROS.2018.8593693

work page doi:10.1109/iros 2018

[46] [46]

Vaswani, N

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polo- sukhin. Attention is all you need. In Proceedings of the 31st International Conference on Neu- ral Information Processing Systems, NIPS’17, page 6000–6010, Red Hook, NY , USA, 2017. Curran Associates Inc. ISBN 9781510860964

work page 2017

[47] [47]

Kendall, M

A. Kendall, M. Grimes, and R. Cipolla. Posenet: A convolutional network for real-time 6-dof camera relocalization. In Proceedings of the 2015 IEEE International Conference on Com- puter Vision (ICCV), ICCV ’15, page 2938–2946, USA, 2015. IEEE Computer Society. ISBN 9781467383912. doi:10.1109/ICCV .2015.336. URL https://doi.org/10.1109/ICCV. 2015.336

work page doi:10.1109/iccv 2015

[48] [48]

Z. Liu, Y . Lin, Y . Cao, H. Hu, Y . Wei, Z. Zhang, S. Lin, and B. Guo. Swin transformer: Hierarchical vision transformer using shifted windows, 2021. URL https://arxiv.org/ abs/2103.14030

work page internal anchor Pith review Pith/arXiv arXiv 2021

[49] [49]

L. F. T. Fu and M. F. Fallon. Batch differentiable pose refinement for in-the-wild camera/lidar extrinsic calibration. In CoRL, pages 1362–1377, 2023. URL https://proceedings.mlr. press/v229/fu23a.html

work page 2023

[50] [50]

Y . Xiao, Y . Li, C. Meng, X. Li, J. Ji, and Y . Zhang. Calibformer: A transformer-based auto- matic lidar-camera calibration network. In 2024 IEEE International Conference on Robotics and Automation (ICRA), pages 16714–16720, 2024. doi:10.1109/ICRA57147.2024.10610018

work page doi:10.1109/icra57147.2024.10610018 2024

[51] [51]

J. Zhu, J. Xue, and P. Zhang. Calibdepth: Unifying depth map representation for iterative lidar- camera online calibration. In2023 IEEE International Conference on Robotics and Automation (ICRA), pages 726–733, 2023. doi:10.1109/ICRA48891.2023.10161575

work page doi:10.1109/icra48891.2023.10161575 2023

[52] [52]

Z. Xia, X. Pan, S. Song, L. E. Li, and G. Huang. Vision transformer with deformable attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4794–4803, June 2022. 13

work page 2022