pith. machine review for the scientific record. sign in

arxiv: 2604.14795 · v1 · submitted 2026-04-16 · 💻 cs.RO

Recognition: unknown

Keep It CALM: Toward Calibration-Free Kilometer-Level SLAM with Visual Geometry Foundation Models via an Assistant Eye

Authors on Pith no claims yet

Pith reviewed 2026-05-10 11:12 UTC · model grok-4.3

classification 💻 cs.RO
keywords SLAMVisual Geometry Foundation ModelsCalibration-freeLarge-scale mappingAssistant eyeEpipolar correctionNonlinear alignmentAnchor propagation
0
0 comments X

The pith

An assistant eye and anchor-based nonlinear alignment let visual geometry models build drift-free kilometer-scale maps without calibration.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that visual geometry foundation models produce local reconstructions with non-linear geometric distortions that cannot be fixed by any single linear transform such as Sim3 or SL4. Forcing rigid alignment causes residuals to accumulate and maps to diverge over kilometer distances. CAL2M counters this by adding a separate assistant eye that supplies a constant physical spacing prior, removing scale ambiguity with no pre-calibration. It then uses an epipolar-guided model to correct intrinsic and pose errors and an anchor-propagation strategy to apply elastic nonlinear transformations that fuse sub-maps into one globally consistent reconstruction.

Core claim

CAL2M shows that kilometer-level VGFM-based SLAM becomes possible once scale is resolved by an uncalibrated assistant eye, intrinsics and poses are corrected via epipolar decomposition, and sub-maps are aligned through anchor-driven nonlinear transformations rather than rigid linear ones.

What carries the argument

The assistant eye supplies a constant physical spacing prior to eliminate scale ambiguity, while the epipolar-guided intrinsic and pose correction model together with anchor propagation enable nonlinear elastic sub-map alignment.

If this is right

  • Accurate feature matching allows decomposition of the fundamental matrix to recover and correct rotation and translation errors caused by unknown intrinsics.
  • Anchor propagation creates explicit local-to-global links so that nonlinear transformations can be applied across the entire trajectory.
  • The framework works as a plug-in layer on top of any existing VGFM without requiring temporal or spatial pre-calibration.
  • Global consistency is maintained by fusing multiple anchors rather than enforcing a single rigid transform between sub-maps.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same spacing-prior idea could be tested in other sensor-fusion settings where one cheap auxiliary measurement removes ambiguity that vision alone cannot resolve.
  • Because the corrections are online and intrinsic-search driven, the method may tolerate gradual camera-parameter drift that fixed-calibration pipelines cannot handle.
  • Replacing linear bundle-adjustment steps with the proposed anchor-based nonlinear fusion might reduce the need for frequent loop closures in very long trajectories.

Load-bearing premise

Feature matching is accurate enough for reliable epipolar geometry calculations that underpin the intrinsic and pose corrections.

What would settle it

Run the system on a multi-kilometer trajectory and measure whether final position error and map overlap remain bounded when the assistant eye and anchor fusion are removed versus when they are active.

Figures

Figures reproduced from arXiv: 2604.14795 by Fengyi Zhang, Hesheng Wang, Lin Zhang, Tianchen Deng, Tianjun Zhang.

Figure 1
Figure 1. Figure 1: Comparison of different VGFM-based incremental localization and mapping methods on kilometer-scale trajectories (without loop closure detection). While VGFMs generate locally coherent sub-maps, previous approaches relying on linear alignment (Sim3 or SL4) fail to model the non-linear error distribution, resulting in significant geometric misalignment, global map divergence, and scale drift. In contrast, ou… view at source ↗
Figure 2
Figure 2. Figure 2: System architecture of CAL2M. The pipeline begins by constructing sub-maps from the primary and assistant image streams using a plug-and-play VGFM backbone. The raw estimates undergo epipolar-guided intrinsic search and pose correction. Then, the scale of different submaps are unified via the constant spacing prior, and the global trajectory is optimized by our Joint Primary-Assistant PGO. Finally, the opt… view at source ↗
Figure 3
Figure 3. Figure 3: Comparison of different mapping paradigms. (a) Existing methods usually couple the estimation on scale, pose, and structure, causing cumulative non-linear errors. (b) CAL2M decouples these components, utilizing non-linear anchor alignment to ensure robust global consistency. enforce orthogonality (t T estΘ ≈ 0) and derive the closed￾form solution, Θ = 1 ∥test∥ 2 [test]× [PITH_FULL_IMAGE:figures/full_fig_p… view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative comparison of 3D reconstruction results on KITTI-360 [66]. The figure displays Sequence 06 (top) and Sequence 09 (bottom). For each sequence, the columns correspond to (from left to right): Ground Truth (LiDAR), VGGT-SLAM [18], VGGT-Long [17], and our CAL2M. Within each sub-figure, the upper view illustrates the global point cloud, while the lower view provides a zoomed-in perspective of local … view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative comparison of dense reconstruction results on Argoverse [67] Sequence 01 (top) and Sequence 04 (bottom). The figure displays results from six methods: DROID-SLAM [39], CUT3R [59], MASt3R-SLAM [58], VGGT-SLAM [18], VGGT-Long [17], and our CAL2M [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative evaluation of intrinsic estimation stability. The plots display the focal length estimation over time for KITTI-Odom [65] Seq. 00 (left) and KITTI-360 [66] Seq. 00 (right). The scatter points represent the fluctuating raw outputs from the VGFM. The green dashed line indicates the Ground Truth, the blue dashed line indicates the average focal length of all estimations, and the red line shows the… view at source ↗
Figure 7
Figure 7. Figure 7: Qualitative comparison of scale drift analysis on KITTI-360 [66]. The plots visualize the normalized scale factor evolution over sliding windows for Seq. 00 (top) and Seq. 05 (bottom). The green dashed line indicates the ideal ground truth scale (1.0). The blue curve represents VGGT-Long [17], which exhibits continuous drift, while the red curve shows CAL2M maintaining a stable scale around the ground trut… view at source ↗
Figure 8
Figure 8. Figure 8: Qualitative ablation study of mapping components on Argoverse [67] Seq. 03 (top) and Seq. 05 (bottom). Each sub-figure compares three settings: w/o Non-Linear Alignment, w/o Local Suppression, and the Full Model. The upper part of each panel displays the global point cloud, while the lower part provides a zoomed-in view of local details. confirms that our global consistent mapping strategy can effectively … view at source ↗
read the original abstract

Visual Geometry Foundation Models (VGFMs) demonstrate remarkable zero-shot capabilities in local reconstruction. However, deploying them for kilometer-level Simultaneous Localization and Mapping (SLAM) remains challenging. In such scenarios, current approaches mainly rely on linear transforms (e.g., Sim3 and SL4) for sub-map alignment, while we argue that a single linear transform is fundamentally insufficient to model the complex, non-linear geometric distortions inherent in VGFM outputs. Forcing such rigid alignment leads to the rapid accumulation of uncorrected residuals, eventually resulting in significant trajectory drift and map divergence. To address these limitations, we present CAL2M (Calibration-free Assistant-eye based Large-scale Localization and Mapping), a plug-and-play framework compatible with arbitrary VGFMs. Distinct from traditional systems, CAL2M introduces an "assistant eye" solely to leverage the prior of constant physical spacing, effectively eliminating scale ambiguity without any temporal or spatial pre-calibration. Furthermore, leveraging the assumption of accurate feature matching, we propose an epipolar-guided intrinsic and pose correction model. Supported by an online intrinsic search module, it can effectively rectify rotation and translation errors caused by inaccurate intrinsics through fundamental matrix decomposition. Finally, to ensure accurate mapping, we introduce a globally consistent mapping strategy based on anchor propagation. By constructing and fusing anchors across the trajectory, we establish a direct local-to-global mapping relationship. This enables the application of nonlinear transformations to elastically align sub-maps, effectively eliminating geometric misalignments and ensuring a globally consistent reconstruction. The source code of CAL2M will be publicly available at https://github.com/IRMVLab/CALM.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes CAL2M, a plug-and-play framework for calibration-free kilometer-level SLAM using Visual Geometry Foundation Models (VGFMs). It introduces an 'assistant eye' that leverages the prior of constant physical spacing to eliminate scale ambiguity without temporal or spatial pre-calibration. The core technical components are an epipolar-guided intrinsic and pose correction model (based on fundamental matrix decomposition from feature matches, supported by an online intrinsic search module) to rectify rotation/translation errors, and a globally consistent mapping strategy using anchor propagation to enable nonlinear transformations that align sub-maps and eliminate geometric misalignments. The authors argue that linear transforms (e.g., Sim3) are fundamentally insufficient for VGFM distortions and promise public code release.

Significance. If the empirical claims hold, the work could meaningfully advance zero-shot large-scale SLAM by removing calibration requirements and mitigating drift accumulation from VGFM geometric distortions. The assistant-eye scale resolution and anchor-based nonlinear alignment represent a distinct approach from standard linear sub-map fusion, with potential impact on robotics applications requiring kilometer-range mapping. The plug-and-play compatibility with arbitrary VGFMs is a practical strength.

major comments (2)
  1. [Abstract] Abstract (epipolar-guided correction paragraph): The claim that the epipolar-guided intrinsic/pose correction rectifies rotation and translation errors rests on the assumption of accurate feature matching for fundamental matrix decomposition. The manuscript itself states that VGFM outputs contain complex non-linear geometric distortions that linear transforms cannot model; these same distortions are likely to degrade match quality (repeatability, inlier rates). No inlier ratios, outlier statistics, or ablation isolating this module are provided, making the assumption load-bearing and unverified for the drift-elimination claim.
  2. [Abstract] Abstract (overall claims): The central assertions of kilometer-level accuracy, drift elimination, and effective nonlinear alignment via anchors lack any supporting quantitative results, error metrics (e.g., ATE, RPE, or scale drift), ablation studies, or baseline comparisons. Without these, it is impossible to assess whether the assistant-eye and anchor-propagation components actually compensate for residuals as described.
minor comments (2)
  1. [Title and Abstract] The acronym is given as CAL2M in the abstract but the title uses CALM; this should be clarified for consistency.
  2. [Abstract] The manuscript would benefit from an explicit reproducibility statement detailing the VGFM backbones tested and the exact physical spacing value used for the assistant eye.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below. Revisions have been made to strengthen the presentation of supporting evidence while preserving the core technical claims.

read point-by-point responses
  1. Referee: [Abstract] Abstract (epipolar-guided correction paragraph): The claim that the epipolar-guided intrinsic/pose correction rectifies rotation and translation errors rests on the assumption of accurate feature matching for fundamental matrix decomposition. The manuscript itself states that VGFM outputs contain complex non-linear geometric distortions that linear transforms cannot model; these same distortions are likely to degrade match quality (repeatability, inlier rates). No inlier ratios, outlier statistics, or ablation isolating this module are provided, making the assumption load-bearing and unverified for the drift-elimination claim.

    Authors: We agree that match quality is a critical assumption. The full manuscript includes qualitative examples of feature matches before and after correction, but we acknowledge the absence of explicit inlier/outlier statistics and a dedicated ablation for the epipolar-guided module. In the revised manuscript we have added (i) inlier ratio tables across multiple VGFMs and sequences, (ii) an ablation isolating the online intrinsic search and fundamental-matrix correction, and (iii) a short discussion of how the search module mitigates distortion-induced outliers. These additions directly verify that sufficient inliers remain for drift reduction. revision: yes

  2. Referee: [Abstract] Abstract (overall claims): The central assertions of kilometer-level accuracy, drift elimination, and effective nonlinear alignment via anchors lack any supporting quantitative results, error metrics (e.g., ATE, RPE, or scale drift), ablation studies, or baseline comparisons. Without these, it is impossible to assess whether the assistant-eye and anchor-propagation components actually compensate for residuals as described.

    Authors: The abstract is a concise summary; the Experiments section (Sections 4–5) already contains the requested quantitative evaluation: ATE/RPE tables on kilometer-scale sequences, scale-drift plots, ablations on the assistant-eye prior and anchor-propagation strategy, and direct comparisons against Sim3/SL4-based linear fusion baselines. To improve readability we have revised the abstract to include a single sentence referencing these supporting metrics and have added a new summary table in the main text that highlights the key quantitative gains. revision: partial

Circularity Check

0 steps flagged

No significant circularity; derivation introduces independent modules

full rationale

The paper explicitly states the limitation of linear transforms for VGFM distortions and proposes three new components: an assistant eye using the physical prior of constant spacing to resolve scale without calibration, an epipolar-guided correction model that takes the assumption of accurate feature matching as input, and an anchor-propagation strategy for nonlinear sub-map alignment. None of these reduce by construction to their own outputs or to fitted parameters; the assumption is declared rather than derived from the system itself, and no self-citation chains or uniqueness theorems from prior author work are invoked as load-bearing in the provided text. The chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The framework rests on the domain assumption of accurate feature matching and introduces the assistant-eye concept as a new mechanism for scale without calibration.

axioms (1)
  • domain assumption accurate feature matching
    Invoked to support epipolar-guided intrinsic and pose correction
invented entities (1)
  • assistant eye no independent evidence
    purpose: provide constant physical spacing prior to eliminate scale ambiguity without calibration
    New hardware/software component introduced in the framework

pith-pipeline@v0.9.0 · 5610 in / 1213 out tokens · 28517 ms · 2026-05-10T11:12:00.497193+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

68 extracted references · 8 canonical work pages · 2 internal anchors

  1. [1]

    End-to-end autonomous driving without costly modularization and 3D manual annotation,

    M. Guo, Z. Zhang, Y. He,et al., “End-to-end autonomous driving without costly modularization and 3D manual annotation,” IEEE Trans. Pattern Anal. Mach. Intell. (TPAMI), Early Access, 2025

  2. [2]

    Benchmarking and improving bird’s eye view perception robustness in autonomous driving,

    S. Xie, L. Kong, W. Zhang, et al., “Benchmarking and improving bird’s eye view perception robustness in autonomous driving,” IEEE Trans. Pattern Anal. Mach. Intell. (TPAMI), vol. 47, no. 5, pp. 3878-3894, 2025

  3. [3]

    ORB-SLAM: A versatile and accurate monocular SLAM system,

    R. Mur-Artal, J. M. M. Montiel, and J. D. Tardós, “ORB-SLAM: A versatile and accurate monocular SLAM system,” IEEE Trans. Robot. (TRO), vol. 31, no. 5, pp. 1147-1163, 2015

  4. [4]

    ORB-SLAM2: An open-source SLAM system for monocular, stereo, and RGB-D cameras,

    R. Mur-Artal and J. D. Tardós, “ORB-SLAM2: An open-source SLAM system for monocular, stereo, and RGB-D cameras,” IEEE Trans. Robot. (TRO), vol. 33, no. 5, pp. 1255-1262, 2017

  5. [5]

    ORB-SLAM3: An accurate open-source library for visual, visual–inertial, and multimap SLAM,

    C. Campos, R. Elvira, J. J. G. Rodríguez, et al., “ORB-SLAM3: An accurate open-source library for visual, visual–inertial, and multimap SLAM,” IEEE Trans. Robot. (TRO), vol. 37, no. 6, pp. 1874-1890, 2021

  6. [6]

    SVO: Fast semi- direct monocular visual odometry,

    C. Forster, M. Pizzoli, and D. Scaramuzza, “SVO: Fast semi- direct monocular visual odometry,” in Proc. IEEE Int’l Conf. Robot. Autom. (ICRA), 2014, pp. 15-22

  7. [7]

    LSD-SLAM: Large-scale direct monocular SLAM,

    J. Engel, T. Schöps, and D. Cremers, “LSD-SLAM: Large-scale direct monocular SLAM,” in Proc. Eur. Conf. Comput. Vis. (ECCV), 2014, pp. 834-849

  8. [8]

    Direct sparse odometry,

    J. Engel, V. Koltun, and D. Cremers, “Direct sparse odometry,” IEEE Trans. Pattern Anal. Mach. Intell. (TPAMI), vol. 40, no. 3, pp. 611-625, 2017

  9. [9]

    LDSO: Direct sparse odometry with loop closure,

    X. Gao, R. Wang, N. Demmel, et al., “LDSO: Direct sparse odometry with loop closure,” in Proc. IEEE/RSJ Int’l Conf. Intell. Robots and Syst. (IROS), 2018, pp. 2198-2204

  10. [10]

    A step toward world models: A survey on robotic manipulation,

    P. Zhang, Y. Cheng, X. Sun, et al., “A step toward world models: A survey on robotic manipulation,” arXiv preprint arXiv:2511.02097, 2025. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 18

  11. [11]

    MBA-SLAM: Motion blur aware dense visual SLAM with radiance fields representation,

    P. Wang, L. Zhao, Y. Zhang, et al. “MBA-SLAM: Motion blur aware dense visual SLAM with radiance fields representation,” IEEE Trans. Pattern Anal. Mach. Intell. (TPAMI), Early Access, 2025

  12. [12]

    LN3Diff++: Scalable latent neural fields diffusion for speedy 3D generation,

    Y. Lan, F. Hong, S. Zhou, et al., “LN3Diff++: Scalable latent neural fields diffusion for speedy 3D generation,” IEEE Trans. Pattern Anal. Mach. Intell. (TPAMI), Early Access, 2025

  13. [13]

    DUSt3R: Geometric 3D vision made easy,

    S. Wang, V. Leroy, Y. Cabon, et al., “DUSt3R: Geometric 3D vision made easy,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2024, pp. 20697-20709

  14. [14]

    Grounding image matching in 3D with MASt3R,

    V. Leroy, Y. Cabon, and J. Revaud, “Grounding image matching in 3D with MASt3R,” in Proc. Eur. Conf. Comput. Vis. (ECCV), 2024, pp. 71-91

  15. [15]

    VGGT: Visual geometry grounded transformer,

    J. Wang, M. Chen, N. Karaev, et al., “VGGT: Visual geometry grounded transformer,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2025, pp. 5294-5306

  16. [16]

    $\pi^3$: Permutation-Equivariant Visual Geometry Learning

    Y. Wang, J. Zhou, H. Zhu, et al., “ π3: Permutation-equivariant visual geometry learning,” arXiv preprint arXiv:2507.13347, 2025

  17. [17]

    Vggt-long: Chunk it, loop it, align it–pushing vggt’s limits on kilometer-scale long rgb sequences.arXiv preprint arXiv:2507.16443, 2025

    K. Deng, Z. Ti, J. Xu, et al., “VGGT-Long: Chunk it, loop it, align it–pushing VGGT’s limits on kilometer-scale long RGB sequences,” arXiv preprint arXiv:2507.16443, 2025

  18. [18]

    arXiv preprint arXiv:2505.12549 (2025)

    D. Maggio, H. Lim, and L. Carlone, “VGGT-SLAM: Dense RGB SLAM optimized on the SL(4) manifold,” arXiv preprint arXiv:2505.12549, 2025

  19. [19]

    Structure-from-motion revisited,

    J. L. Schönberger and J.M. Frahm, “Structure-from-motion revisited,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), 2016, pp. 4104-4113

  20. [20]

    Pixelwise view selection for unstructured multi-view stereo,

    J. L. Schönberger, E. Zheng, J. M. Frahm, et al., “Pixelwise view selection for unstructured multi-view stereo,” in Proc. Eur. Conf. Comput. Vis. (ECCV), 2016, pp. 501-518

  21. [21]

    Distinctive image features from scale-invariant keypoints,

    D. G. Lowe, “Distinctive image features from scale-invariant keypoints,” Int’l J. Comput. Vis. (IJCV), vol. 60, no. 2, pp. 91- 110, 2004

  22. [22]

    Stereo processing by semiglobal matching and mutual information,

    H. Hirschmuller, “Stereo processing by semiglobal matching and mutual information,” IEEE Trans. Pattern Anal. Mach. Intell. (TPAMI), vol. 30, no. 2, pp. 328-341, 2008

  23. [23]

    Accurate, dense, and robust multiview stereopsis,

    Y. Furukawa and J. Ponce, “Accurate, dense, and robust multiview stereopsis,” IEEE Trans. Pattern Anal. Mach. Intell. (TPAMI), vol. 32, no. 8, pp. 1362-1376, 2009

  24. [24]

    MVSNet: Depth inference for unstructured multi-view stereo,

    Y. Yao, Z. Luo, S. Li, et al., “MVSNet: Depth inference for unstructured multi-view stereo,” in Proc. Eur. Conf. Comput. Vis. (ECCV), 2018, pp. 767-783

  25. [25]

    Recurrent MVSNet for high-resolution multi-view stereo depth inference,

    Y. Yao, Z. Luo, S. Li, et al., “Recurrent MVSNet for high-resolution multi-view stereo depth inference,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2019, pp. 5525-5534

  26. [26]

    Cascade cost volume for high- resolution multi-view stereo and stereo matching,

    X. Gu, Z. Fan, S. Zhu, et al., “Cascade cost volume for high- resolution multi-view stereo and stereo matching,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2020, pp. 2495-2504

  27. [27]

    PatchmatchNet: Learned multi-view patchmatch stereo,

    F. Wang, S. Galliani, C. Vogel, et al., “PatchmatchNet: Learned multi-view patchmatch stereo,” in Proc. IEEE/CVF Conf. Com- put. Vis. Pattern Recognit. (CVPR), 2021, pp. 14194-14203

  28. [28]

    NeRF: Representing scenes as neural radiance fields for view synthesis,

    B. Mildenhall, P. P. Srinivasan, M. Tancik, et al., “NeRF: Representing scenes as neural radiance fields for view synthesis,” in Proc. Eur. Conf. Comput. Vis. (ECCV), 2020, pp. 405-421

  29. [29]

    Instant neural graphics primitives with a multiresolution hash encoding,

    T. Müller, A. Evans, C. Schied, et al., “Instant neural graphics primitives with a multiresolution hash encoding,” ACM Trans. Graph. (TOG), vol. 41, no. 4, pp. 1-15, 2022

  30. [30]

    Mip-NeRF: A multiscale representation for anti-aliasing neural radiance fields,

    J. T. Barron, B. Mildenhall, M. Tancik, et al., “Mip-NeRF: A multiscale representation for anti-aliasing neural radiance fields,” in Proc. IEEE/CVF int’l Conf. Comput. Vis. (ICCV), 2021, pp. 5855-5864

  31. [31]

    PixelNeRF: Neural radiance fields from one or few images,

    A. Yu, V. Ye, M. Tancik, et al., “PixelNeRF: Neural radiance fields from one or few images,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2021, pp. 4578-4587

  32. [32]

    IBRNet: Learning multi-view image-based rendering,

    Q. Wang, Z. Wang, K. Genova, et al., “IBRNet: Learning multi-view image-based rendering,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2021, pp. 4690-4699

  33. [33]

    MVSNeRF: Fast generalizable radiance field reconstruction from multi-view stereo,

    A. Chen, Z. Xu, F. Zhao, et al., “MVSNeRF: Fast generalizable radiance field reconstruction from multi-view stereo,” in Proc. IEEE/CVF Int’l Conf. Comput. Vis. (ICCV), 2021, pp. 14124- 14133

  34. [34]

    3D Gaussian splatting for real-time radiance field rendering,

    B. Kerbl, G. Kopanas, T. Leimkühler, et al., “3D Gaussian splatting for real-time radiance field rendering,” ACM Trans. Graph. (TOG), vol. 42, no. 4, pp. 139:1-139:14, 2023

  35. [35]

    MonoSLAM: Real-time single camera SLAM,

    A. J. Davison, I. D. Reid, N. D. Molton, et al., “MonoSLAM: Real-time single camera SLAM,” IEEE Trans. Pattern Anal. Mach. Intell. (TPAMI), vol. 29, no. 6, pp. 1052-1067, 2007

  36. [36]

    Parallel tracking and mapping for small AR workspaces,

    G. Klein and D. Murray, “Parallel tracking and mapping for small AR workspaces,” in Proc. IEEE/ACM Int’l Symp. Mixed Augmented Reality (ISMAR), 2007, pp. 225-234

  37. [37]

    S-PTAM: Stereo parallel tracking and mapping,

    T. Pire, T. Fischer, G. Castro, et al., “S-PTAM: Stereo parallel tracking and mapping,” Robot. Auton. Syst. (RAS), vol. 93, pp. 27-42, 2017

  38. [38]

    Large-scale direct SLAM with stereo cameras,

    J. Engel, J. Stückler, and D. Cremers, “Large-scale direct SLAM with stereo cameras,” in Proc. IEEE/RSJ Int’l Conf. Intell. Robots Syst. (IROS), 2015, pp. 1935-1942

  39. [39]

    DROID-SLAM: Deep visual SLAM for monocular, stereo, and RGB-D cameras,

    Z. Teed and J. Deng, “DROID-SLAM: Deep visual SLAM for monocular, stereo, and RGB-D cameras,” in Proc. Adv. Neural Inf. Process. Syst. (NeurIPS), 2021, pp. 16558-16569

  40. [40]

    Deep patch visual odometry,

    Z. Teed, L. Lipson, and J. Deng, “Deep patch visual odometry,” in Proc. Adv. Neural Inf. Process. Syst. (NeurIPS), 2023, pp. 39033-39051

  41. [41]

    Deep patch visual SLAM,

    L. Lipson, Z. Teed, and J. Deng, “Deep patch visual SLAM,” in Proc. Eur. Conf. Comput. Vis. (ECCV), 2024, pp. 424-440

  42. [42]

    TartanVO: A generalizable learning-based VO,

    W. Wang, Y. Hu, and S. Scherer, “TartanVO: A generalizable learning-based VO,” in Proc. Conf. Robot Learning (CoRL), 2021, pp. 1761-1772

  43. [43]

    iMap: Implicit mapping and positioning in real-time,

    E. Sucar, S. Liu, J. Ortiz, et al. “iMap: Implicit mapping and positioning in real-time,” in Proc. IEEE/CVF Int’l Conf. Computer Vis. (ICCV). 2021, pp. 6229-6238

  44. [44]

    NICE-SLAM: Neural implicit scalable encoding for slam,

    Z. Zhu, S. Peng, V. Larsson, et al., “NICE-SLAM: Neural implicit scalable encoding for slam,” in Proc. IEEE/CVF Conf. Computer Vis. Pattern Recognit. (CVPR), 2022, pp. 12786- 12796

  45. [45]

    Loopy-SLAM: Dense neural SLAM with loop closures,

    L. Liso, E. Sandström, V. Yugay, et al., “Loopy-SLAM: Dense neural SLAM with loop closures,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2024, pp. 20363-20373

  46. [46]

    MNE-SLAM: Multi-agent neural SLAM for mobile robots,

    T. Deng, G. Shen, C. Xun, et al., “MNE-SLAM: Multi-agent neural SLAM for mobile robots,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2025, pp. 1485-1494

  47. [47]

    SplaTAM: Splat, track & map 3D Gaussians for dense RGB-D SLAM,

    N. Keetha, J. Karhade, K. M. Jatavallabhula, et al., “SplaTAM: Splat, track & map 3D Gaussians for dense RGB-D SLAM,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2024, pp. 21357-21366

  48. [48]

    GS-SLAM: Dense visual SLAM with 3D Gaussian splatting,

    C. Yan, D. Qu, D. Xu, et al., “GS-SLAM: Dense visual SLAM with 3D Gaussian splatting,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2024, pp. 19595-19604

  49. [49]

    Gaussian splat- ting SLAM,

    H. Matsuki, R. Murai, P. H. J. Kelly, et al., “Gaussian splat- ting SLAM,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2024, pp. 18039-18048

  50. [50]

    Fast3R: Towards 3D reconstruction of 1000+ images in one forward pass,

    J. Yang, A. Sax, K. J. Liang, et al., “Fast3R: Towards 3D reconstruction of 1000+ images in one forward pass,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2025, pp. 21924-21935

  51. [51]

    MUSt3R: Multi-view network for stereo 3D reconstruction,

    Y. Cabon, L. Stoffl, L. Antsfeld, et al., “MUSt3R: Multi-view network for stereo 3D reconstruction,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2025, pp. 1050-1060

  52. [52]

    Reloc-VGGT: Visual re-localization with geometry grounded Transformer,

    T. Deng, W. Wu, K. Wu, et al., “Reloc-VGGT: Visual re-localization with geometry grounded Transformer,” arXiv preprint arXiv:2512.21883, 2025

  53. [53]

    UniPR-3D: Towards uni- versal visual place recognition with visual geometry grounded Transformer,

    T. Deng, X. Chen, Z. Liu, et al., “UniPR-3D: Towards uni- versal visual place recognition with visual geometry grounded Transformer,” arXiv preprint arXiv:2512.21078, 2025

  54. [54]

    What is the best 3d scene representation for robotics? from geometric to foundation models.arXiv preprint arXiv:2512.03422, 2025

    T. Deng, Y. Pan, S. Yuan, et al., “What is the best 3D scene representation for robotics? From geometric to foundation models,” arXiv preprint arXiv:2512.03422, 2025

  55. [55]

    MapAnything: Universal Feed-Forward Metric 3D Reconstruction

    N. Keetha, N. Müller, J. Schönberger, et al., “MapAnything: Universal feed-forward metric 3D reconstruction,” arXiv preprint arXiv:2509.13414, 2025

  56. [56]

    3D reconstruction with spatial memory,

    H. Wang and L. Agapito, “3D reconstruction with spatial memory,” in Proc. Int’l Conf. 3D Vision (3DV), 2025, pp. 78-89

  57. [57]

    SLAM3R: Real-time dense scene reconstruction from monocular RGB videos,

    Y. Liu, S. Dong, S. Wang, et al., “SLAM3R: Real-time dense scene reconstruction from monocular RGB videos,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2025, pp. 16651-16662

  58. [58]

    MASt3R-SLAM: Real-time dense SLAM with 3D reconstruction priors,

    R. Murai, E. Dexheimer, and A. J. Davison, “MASt3R-SLAM: Real-time dense SLAM with 3D reconstruction priors,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2025, pp. 16695-16705. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 19

  59. [59]

    Continuous 3d per- ception model with persistent state,

    Q. Wang, Y. Zhang, A. Holynski, et al. “Continuous 3d per- ception model with persistent state,” in Proc. Computer Vis. Pattern Recognit. (CVPR), 2025, pp. 10510-10522

  60. [60]

    Good features to track,

    J. Shi and C. Tomasi, “Good features to track,” in Proc. IEEE Conf. Computer Vis. Pattern Recognit. (CVPR), 1994, pp. 593- 600

  61. [61]

    An iterative image registration technique with an application to stereo vision,

    B. D. Lucas and T. Kanade, “An iterative image registration technique with an application to stereo vision,” in Proc. Int’l Joint Conf. Artif. Intell. (IJCAI), 1981, pp. 674-679

  62. [62]

    Borglab/GTSAM,

    F. Dellaert, and GTSAM Contributors, “Borglab/GTSAM,” https://github.com/borglab/gtsam, 2022

  63. [63]

    Optimal transport aggregation for visual place recognition,

    S. Izquierdo, J. Civera. “Optimal transport aggregation for visual place recognition,” in Proc. IEEE/CVF Conf. Computer Vis. Pattern Recognit. (CVPR), 2024, pp. 17658-17668

  64. [64]

    Principal warps: thin-plate splines and the decomposition of deformations,

    F. L. Bookstein, “Principal warps: thin-plate splines and the decomposition of deformations,” IEEE Trans. Pattern Anal. Mach. Intell. (TPAMI), vol. 11, no. 6, pp. 567-585, 1989

  65. [65]

    Are we ready for autonomous driving? The KITTI vision benchmark suite,

    A. Geiger, P. Lenz, R. Urtasun, “Are we ready for autonomous driving? The KITTI vision benchmark suite,” in Proc. IEEE Conf. Computer Vis. Pattern Recognit. (CVPR), 2012, pp. 3354- 3361

  66. [66]

    KITTI-360: A novel dataset and benchmarks for urban scene understanding in 2d and 3d,

    Y. Liao, J. Xie, A. Geiger, “KITTI-360: A novel dataset and benchmarks for urban scene understanding in 2d and 3d,” IEEE Trans. Pattern Anal. Mach. Intell. (TPAMI), 2022, vol. 45, no. 3, pp. 3292-3310

  67. [67]

    Argoverse: 3d tracking and forecasting with rich maps,

    M. F. Chang, J. Lambert, P. Sangkloy, et al., “Argoverse: 3d tracking and forecasting with rich maps,” in Proc. IEEE/CVF Conf. Computer Vis. Pattern Recognit. (CVPR), 2019, pp. 8748- 8757. Tianjun Zhang received his B.Eng. and Ph.D. degree from the School of Software Engi- neering and the School of Computer Science and Technology, Tongji University, Shang...

  68. [68]

    degree at the School of Electrical Engineering and Com- puter Science, The University of Queensland, Brisbane, Australia

    He is now pursuing his Ph.D. degree at the School of Electrical Engineering and Com- puter Science, The University of Queensland, Brisbane, Australia. His research focuses on data-driven 3D reconstruction for real-world visual understanding. Tianchen Deng (Graduate Student Member, IEEE) received the B.Eng. degree in control science and engineering from Ha...