pith. sign in

arxiv: 2605.23397 · v1 · pith:D6PZZSAFnew · submitted 2026-05-22 · 💻 cs.CV

Joint Target-Less Intrinsic and Extrinsic Camera-LiDAR Calibration using Deep Point Correspondences

Pith reviewed 2026-05-25 04:52 UTC · model grok-4.3

classification 💻 cs.CV
keywords camera calibrationLiDAR calibrationtarget-less calibrationdeep point correspondencesjoint optimizationstructure from motionmulti-modal perceptionintrinsic extrinsic calibration
0
0 comments X

The pith

A target-less method jointly recovers camera intrinsics and camera-LiDAR extrinsics from raw images via deep point correspondences.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a calibration pipeline that estimates both the internal camera parameters, including distortion, and the rigid transform to a LiDAR sensor without any special targets or pre-rectified images. It begins with structure-from-motion to obtain an initial guess for the intrinsics, then applies a deep matcher that finds pixel-to-point links even on distorted raw frames, and finally runs a single nonlinear optimization that refines intrinsics and extrinsics together. A reader would care because most existing target-less extrinsic methods require known, correct intrinsics first, so removing that prerequisite simplifies deployment on new camera-LiDAR rigs. The evaluation on KITTI shows that the joint solution yields better extrinsics than extrinsic-only baselines while also producing usable intrinsics.

Core claim

The authors claim that extending deep correspondence calibration to unknown intrinsics—via SfM initialization, a matcher tolerant of raw distorted images, and tight coupling inside joint nonlinear optimization—yields the first fully target-less pipeline for simultaneous intrinsic and extrinsic camera-LiDAR calibration, with measurable gains in extrinsic accuracy on unseen pairs.

What carries the argument

Deep pixel-point correspondences extended to raw images and coupled inside joint nonlinear optimization over intrinsics and extrinsics.

If this is right

  • Extrinsic accuracy improves when intrinsics are optimized at the same time rather than held fixed.
  • Accurate pinhole intrinsics with radial-tangential distortion are recovered as a direct output.
  • Calibration succeeds on unrectified raw images without a separate intrinsic step.
  • The pipeline works for previously unseen camera-LiDAR pairings after SfM initialization.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the matcher remains stable across lighting changes, the same pipeline could support periodic recalibration during long-term robot operation.
  • The approach might transfer to other sensor combinations provided a comparable deep correspondence model exists for them.
  • Failure of the SfM stage on texture-poor scenes would block the entire joint optimization regardless of later steps.

Load-bearing premise

The deep correspondence network must produce usable matches on images whose focal length and distortion parameters are still unknown.

What would settle it

Running the full pipeline on KITTI sequences whose initial intrinsics are deliberately offset by 20 percent focal length and noticeable distortion, then checking whether the final extrinsic error fails to beat the extrinsic-only baseline, would falsify the joint approach.

Figures

Figures reproduced from arXiv: 2605.23397 by Abhinav Valada, Daniele Cattaneo, Simon Bultmann.

Figure 1
Figure 1. Figure 1: Pipeline for joint intrinsic and extrinsic camera-LiDAR calibration. [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
read the original abstract

Accurate camera-LiDAR calibration is a prerequisite for robust multi-modal perception in robotics. Recent target-less approaches based on deep point correspondences achieve remarkable performance for extrinsic calibration but assume rectified images with known intrinsics. In this work, we overcome this limitation and present the first fully target-less pipeline that jointly estimates camera intrinsics (pinhole model with radial-tangential distortion) and camera-LiDAR extrinsics with deep pixel-point correspondences. Our approach extends deep correspondence-based calibration by (i) automatic intrinsic initialization via structure-from-motion, (ii) generalizing camera-LiDAR matching to raw images with unknown intrinsics including distortion, and (iii) tightly coupling correspondence estimation with joint nonlinear optimization over both intrinsics and extrinsics. We evaluate our method on the KITTI dataset with unseen camera-LiDAR pairs and demonstrate that joint calibration achieves improved extrinsic accuracy while additionally recovering accurate intrinsics.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper presents the first fully target-less pipeline for jointly estimating camera intrinsics (pinhole model with radial-tangential distortion) and camera-LiDAR extrinsics using deep pixel-point correspondences. It uses structure-from-motion for automatic intrinsic initialization, generalizes the matching to raw images with unknown intrinsics, and tightly couples correspondence estimation with joint nonlinear optimization. Evaluation on the KITTI dataset with unseen camera-LiDAR pairs shows improved extrinsic accuracy and recovery of accurate intrinsics.

Significance. If the results hold, this approach would be significant for multi-modal perception in robotics by enabling calibration without targets or pre-known intrinsics. The joint optimization and use of deep correspondences represent a practical advancement over previous methods that assume rectified images with known intrinsics. The ability to handle raw data could facilitate easier deployment in real-world scenarios.

major comments (2)
  1. [Abstract] Abstract: The abstract claims 'improved extrinsic accuracy' on KITTI but provides no quantitative metrics, baselines, or error analysis. This makes it impossible to assess whether the joint method delivers a load-bearing improvement over prior extrinsic-only approaches.
  2. [Method] Method section: The central claim requires that the deep correspondence network (developed for rectified images) produces reliable matches on raw images whose distortion parameters are unknown and jointly estimated. No ablation or diagnostic is shown for correspondence accuracy or bias under radial-tangential distortion; if the feature extractor is sensitive, the subsequent nonlinear optimization can converge to a local minimum whose extrinsic error exceeds the claimed gain.
minor comments (2)
  1. [Experiments] Experiments: Define the exact KITTI sequences and the protocol for 'unseen camera-LiDAR pairs' to support reproducibility.
  2. Notation: Ensure all variables in the joint optimization objective (intrinsics, extrinsics, correspondence weights) are explicitly defined before first use.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the two major comments below and will revise the manuscript to strengthen the presentation of results and add requested diagnostics.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The abstract claims 'improved extrinsic accuracy' on KITTI but provides no quantitative metrics, baselines, or error analysis. This makes it impossible to assess whether the joint method delivers a load-bearing improvement over prior extrinsic-only approaches.

    Authors: We agree that the abstract should include quantitative support for the claim of improved extrinsic accuracy. In the revised version we will expand the abstract to report specific translation/rotation errors, the baselines used, and the magnitude of improvement on the KITTI unseen-pair experiments. revision: yes

  2. Referee: [Method] Method section: The central claim requires that the deep correspondence network (developed for rectified images) produces reliable matches on raw images whose distortion parameters are unknown and jointly estimated. No ablation or diagnostic is shown for correspondence accuracy or bias under radial-tangential distortion; if the feature extractor is sensitive, the subsequent nonlinear optimization can converge to a local minimum whose extrinsic error exceeds the claimed gain.

    Authors: The referee correctly notes the absence of a direct ablation on correspondence accuracy under unknown radial-tangential distortion. While the end-to-end KITTI results and the SfM initialization plus joint optimization provide indirect evidence of robustness, we acknowledge that an explicit diagnostic (e.g., matching precision/recall before and after distortion correction, or failure-case analysis) is missing. We will add this ablation study to the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No circularity: joint optimization and SfM initialization are independent extensions

full rationale

The paper extends prior deep correspondence methods with SfM-based intrinsic initialization and joint nonlinear optimization over intrinsics plus extrinsics. No equation or claim reduces a derived quantity to a fitted parameter by construction, nor does any load-bearing premise rest on a self-citation chain that itself lacks external verification. The evaluation on unseen KITTI pairs supplies an external benchmark, keeping the derivation self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Abstract-only review limits visibility into implementation details; the pinhole model with radial-tangential distortion is a standard assumption, and the deep network likely contains many learned parameters not enumerated here.

axioms (2)
  • domain assumption Pinhole camera model with radial-tangential distortion accurately represents the imaging process for the evaluated cameras.
    Invoked when stating the intrinsics to be estimated.
  • domain assumption Structure-from-motion provides a reliable initial estimate of intrinsics for subsequent joint optimization.
    Stated as automatic intrinsic initialization step.

pith-pipeline@v0.9.0 · 5686 in / 1243 out tokens · 17841 ms · 2026-05-25T04:52:45.898034+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

23 extracted references · 23 canonical work pages

  1. [1]

    Bevcar: Camera-radar fusion for bev map and object segmentation,

    J. Schramm, N. V ¨odisch, K. Petek, B. R. Kiran, S. Yogamani, W. Burgard, and A. Valada, “Bevcar: Camera-radar fusion for bev map and object segmentation,” inIEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2024, pp. 1435–1442

  2. [2]

    Convoluted mixture of deep experts for robust semantic segmentation,

    A. Valada, A. Dhall, and W. Burgard, “Convoluted mixture of deep experts for robust semantic segmentation,” inIEEE/RSJ International conference on intelligent robots and systems (IROS) workshop, state estimation and terrain perception for all terrain mobile robots, vol. 2, 2016, p. 1

  3. [3]

    Towards robust semantic segmentation using deep fusion,

    A. Valada, G. Oliveira, T. Brox, and W. Burgard, “Towards robust semantic segmentation using deep fusion,” inRobotics: Science and systems (RSS 2016) workshop, are the sceptics right? Limits and potentials of deep learning in robotics, vol. 114, 2016

  4. [4]

    Real-time multi-modal semantic fusion on unmanned aerial vehicles,

    S. Bultmann, J. Quenzel, and S. Behnke, “Real-time multi-modal semantic fusion on unmanned aerial vehicles,” inEuropean Conference on Mobile Robots (ECMR), 2021

  5. [5]

    Up-fuse: Uncertainty-guided lidar-camera fusion for 3d panoptic segmentation,

    R. Mohan, F. Drews, Y . Miron, D. Cattaneo, and A. Valada, “Up-fuse: Uncertainty-guided lidar-camera fusion for 3d panoptic segmentation,” Robotics: Science and Systems (RSS), 2026

  6. [6]

    Extending Kalibr: Calibrating the extrinsics of multiple IMUs and of individual axes,

    J. Rehder, J. Nikolic, T. Schneider, T. Hinzmann, and R. Siegwart, “Extending Kalibr: Calibrating the extrinsics of multiple IMUs and of individual axes,” inIEEE International Conference on Robotics and Automation (ICRA), 2016, pp. 4304–4311

  7. [7]

    Joint camera intrinsic and LiDAR-camera extrinsic calibration,

    G. Yan, F. He, C. Shi, P. Wei, X. Cai, and Y . Li, “Joint camera intrinsic and LiDAR-camera extrinsic calibration,” in2023 IEEE International Conference on Robotics and Automation (ICRA), 2023, pp. 11 446– 11 452

  8. [8]

    Joint intrinsic and extrinsic calibration of perception systems utilizing a calibration environment,

    L. Wiesmann, T. L ¨abe, L. Nunes, J. Behley, and C. Stachniss, “Joint intrinsic and extrinsic calibration of perception systems utilizing a calibration environment,”IEEE Robotics and Automation Letters, vol. 9, no. 10, pp. 9103–9110, 2024

  9. [9]

    Online marker-free extrinsic camera calibration using person keypoint detections,

    B. P ¨atzold, S. Bultmann, and S. Behnke, “Online marker-free extrinsic camera calibration using person keypoint detections,” in44th DAGM German Conference on Pattern Recognition (GCPR), 2022, pp. 300– 316

  10. [10]

    Joint intrinsic and extrinsic LiDAR-camera calibration in targetless environments using plane-constrained bundle adjustment,

    L. Li, H. Li, X. Liu, D. He, Z. Miao, F. Kong, R. Li, Z. Liu, and F. Zhang, “Joint intrinsic and extrinsic LiDAR-camera calibration in targetless environments using plane-constrained bundle adjustment,” arXiv preprint arXiv: 2308.12629, 2023

  11. [11]

    Cmrnet++: Map and camera agnostic monocular visual localization in lidar maps,

    D. Cattaneo, D. G. Sorrenti, and A. Valada, “Cmrnet++: Map and camera agnostic monocular visual localization in lidar maps,”IEEE International Conference on Robotics and Automation (ICRA) Workshop on Emerging Learning and Algorithmic Methods for Data Association in Robotics, 2020

  12. [12]

    CMRNext: Camera to LiDAR matching in the wild for localization and extrinsic calibration,

    D. Cattaneo and A. Valada, “CMRNext: Camera to LiDAR matching in the wild for localization and extrinsic calibration,”IEEE Transactions on Robotics, vol. 41, pp. 1995–2013, 2025

  13. [13]

    Ralf: Flow-based global and metric radar localization in lidar maps,

    A. Nayak, D. Cattaneo, and A. Valada, “Ralf: Flow-based global and metric radar localization in lidar maps,” in2024 IEEE International Conference on Robotics and Automation (ICRA), 2024, pp. 5097–5103

  14. [14]

    MINIMA: Modality invariant image matching,

    J. Ren, X. Jiang, Z. Li, D. Liang, X. Zhou, and X. Bai, “MINIMA: Modality invariant image matching,” in2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025, pp. 23 059– 23 068

  15. [15]

    MATCHA: Towards matching anything,

    F. Xue, S. Elflein, L. Leal-Taix ´e, and Q. Zhou, “MATCHA: Towards matching anything,” in2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025, pp. 27 081–27 091

  16. [16]

    I2D-Loc++: Camera pose tracking in LiDAR maps with multi-view motion flows,

    H. Yu, K. Chen, W. Yang, S. Scherer, and G.-S. Xia, “I2D-Loc++: Camera pose tracking in LiDAR maps with multi-view motion flows,” IEEE Robotics and Automation Letters, vol. 9, no. 9, pp. 8162–8169, 2024

  17. [17]

    Automatic target-less camera-LiDAR calibration from motion and deep point correspondences,

    K. Petek, N. V¨odisch, J. Meyer, D. Cattaneo, A. Valada, and W. Burgard, “Automatic target-less camera-LiDAR calibration from motion and deep point correspondences,”IEEE Robotics and Automation Letters, vol. 9, no. 11, pp. 9978–9985, 2024

  18. [18]

    Ceres Solver,

    S. Agarwal, K. Mierle, and T. C. S. Team, “Ceres Solver,” 10 2023. [Online]. Available: https://github.com/ceres-solver/ceres-solver

  19. [19]

    Structure-from-motion revisited,

    J. L. Sch ¨onberger and J.-M. Frahm, “Structure-from-motion revisited,” inConference on Computer Vision and Pattern Recognition (CVPR), 2016

  20. [20]

    On the maximum radius of polynomial lens distortion,

    M. J. Leotta, D. Russell, and A. Matrai, “On the maximum radius of polynomial lens distortion,” in2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2022, pp. 2374–2382

  21. [21]

    Are we ready for autonomous driving? the kitti vision benchmark suite,

    A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for autonomous driving? the kitti vision benchmark suite,” in2012 IEEE Conference on Computer Vision and Pattern Recognition, 2012, pp. 3354–3361

  22. [22]

    Argoverse: 3d tracking and forecasting with rich maps,

    M.-F. Chang, J. Lambert, P. Sangkloy, J. Singh, S. Bak, A. Hartnett, D. Wang, P. Carr, S. Lucey, D. Ramanan, and J. Hays, “Argoverse: 3d tracking and forecasting with rich maps,” in2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 8740–8749

  23. [23]

    Pandaset: Advanced sensor suite dataset for autonomous driving,

    P. Xiao, Z. Shao, S. Hao, Z. Zhang, X. Chai, J. Jiao, Z. Li, J. Wu, K. Sun, K. Jiang, Y . Wang, and D. Yang, “Pandaset: Advanced sensor suite dataset for autonomous driving,” in2021 IEEE International Intelligent Transportation Systems Conference (ITSC), 2021, pp. 3095– 3101