pith. sign in

arxiv: 2505.12549 · v2 · pith:7XBKXFOJnew · submitted 2025-05-18 · 💻 cs.CV

VGGT-SLAM: Dense RGB SLAM Optimized on the SL(4) Manifold

Pith reviewed 2026-05-20 20:57 UTC · model grok-4.3

classification 💻 cs.CV
keywords VGGT-SLAMdense SLAMSL(4) manifoldprojective ambiguityhomography optimizationuncalibrated monocularsubmap alignmentloop closure
0
0 comments X

The pith

Optimizing 15-DoF homography transforms on the SL(4) manifold aligns VGGT submaps to recover consistent dense geometry from uncalibrated monocular video.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that similarity transforms are insufficient for aligning submaps from uncalibrated cameras because scene reconstruction remains ambiguous up to a 15-degree-of-freedom projective transform. VGGT-SLAM therefore optimizes homography transforms between sequential submaps directly on the SL(4) manifold while enforcing loop closure constraints. This incremental global alignment produces coherent dense maps from long video sequences that cannot be processed by VGGT in a single forward pass due to memory limits. A reader would care because the method turns a feed-forward reconstruction network into a practical SLAM pipeline that works with ordinary monocular cameras.

Core claim

VGGT-SLAM recovers a consistent scene reconstruction across submaps by optimizing 15-degrees-of-freedom homography transforms between sequential submaps on the SL(4) manifold while accounting for potential loop closure constraints, addressing the projective reconstruction ambiguity that arises when cameras are uncalibrated and no assumptions are made about motion or scene structure.

What carries the argument

Optimization over the SL(4) manifold to estimate 15-DoF homography transforms between VGGT submaps that absorb the projective ambiguity of uncalibrated monocular views.

If this is right

  • Dense maps can be built from video sequences longer than those VGGT can process at once without exceeding GPU memory.
  • Loop closure constraints integrate naturally into the same 15-DoF projective alignment.
  • Map quality improves over similarity-based submap alignment when cameras are uncalibrated.
  • The approach extends any feed-forward reconstruction method to incremental SLAM on ordinary monocular input.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar manifold optimization may be needed for any submap-based monocular pipeline that relies on uncalibrated feed-forward networks.
  • The method could be tested on other reconstruction backbones to measure how much the choice of manifold affects final accuracy.
  • Hybrid systems might switch between similarity and full projective alignments depending on whether camera intrinsics are known or estimated.

Load-bearing premise

Optimizing 15-DoF homography transforms on the SL(4) manifold between VGGT submaps, together with loop closures, is sufficient to recover a globally consistent scene geometry despite the projective reconstruction ambiguity of uncalibrated monocular cameras.

What would settle it

A direct comparison of global consistency error (against ground-truth geometry) between VGGT-SLAM and a similarity-transform baseline on a long monocular sequence whose length exceeds single-pass VGGT capacity; substantially lower error for the SL(4) version would support the claim.

read the original abstract

We present VGGT-SLAM, a dense RGB SLAM system constructed by incrementally and globally aligning submaps created from the feed-forward scene reconstruction approach VGGT using only uncalibrated monocular cameras. While related works align submaps using similarity transforms (i.e., translation, rotation, and scale), we show that such approaches are inadequate in the case of uncalibrated cameras. In particular, we revisit the idea of reconstruction ambiguity, where given a set of uncalibrated cameras with no assumption on the camera motion or scene structure, the scene can only be reconstructed up to a 15-degrees-of-freedom projective transformation of the true geometry. This inspires us to recover a consistent scene reconstruction across submaps by optimizing over the SL(4) manifold, thus estimating 15-degrees-of-freedom homography transforms between sequential submaps while accounting for potential loop closure constraints. As verified by extensive experiments, we demonstrate that VGGT-SLAM achieves improved map quality using long video sequences that are infeasible for VGGT due to its high GPU requirements.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript presents VGGT-SLAM, a dense RGB SLAM system that incrementally constructs and globally aligns submaps generated by the feed-forward VGGT scene reconstruction method using only uncalibrated monocular video input. It addresses projective reconstruction ambiguity by optimizing 15-DoF homography transforms on the SL(4) manifold between sequential submaps while incorporating loop-closure constraints, claiming this yields improved map quality for long sequences infeasible for direct VGGT processing due to GPU limits.

Significance. If the central claim holds, the work would provide a practical extension of feed-forward dense reconstruction methods to long uncalibrated sequences by leveraging standard 15-DoF gauge freedom in SfM through manifold-constrained alignment. Explicit optimization on SL(4) rather than similarity transforms is a clear methodological contribution that directly follows from projective geometry; the approach could influence future monocular SLAM pipelines that combine learned submap generators with global consistency enforcement.

major comments (2)
  1. [Abstract and §4] Abstract and §4 (Experiments): The central claim of 'improved map quality' and 'extensive experiments' is supported only by qualitative statements; no quantitative metrics (e.g., absolute trajectory error, reconstruction completeness, or comparison tables against similarity-transform baselines) or error analysis are referenced, leaving the performance advantage over prior submap alignment methods unverifiable from the provided description.
  2. [§3.2] §3.2 (SL(4) Optimization): The formulation of the 15-DoF homography optimization on the SL(4) manifold is presented as sufficient to resolve inter-submap inconsistencies, but the manuscript does not specify how the objective function weights the loop-closure constraints relative to sequential alignments or whether the optimization is guaranteed to converge to a globally consistent gauge; this is load-bearing for the claim that projective ambiguity is fully mitigated.
minor comments (2)
  1. [§3] Notation for the SL(4) manifold and homography parameterization could be clarified with an explicit definition of the Lie algebra basis or retraction used in the optimizer.
  2. [Abstract] The abstract states that similarity transforms are 'inadequate' for uncalibrated cameras but does not cite the specific prior works that used them, which would help situate the contribution.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight important areas for improving clarity and verifiability, and we address each point below.

read point-by-point responses
  1. Referee: [Abstract and §4] Abstract and §4 (Experiments): The central claim of 'improved map quality' and 'extensive experiments' is supported only by qualitative statements; no quantitative metrics (e.g., absolute trajectory error, reconstruction completeness, or comparison tables against similarity-transform baselines) or error analysis are referenced, leaving the performance advantage over prior submap alignment methods unverifiable from the provided description.

    Authors: We agree that quantitative metrics would strengthen the central claim. The submitted manuscript emphasizes qualitative results from experiments on long sequences, but we will revise §4 to add a comparison table reporting absolute trajectory error (ATE), reconstruction completeness, and other metrics against similarity-transform baselines, along with a brief error analysis. This revision will make the advantages verifiable. revision: yes

  2. Referee: [§3.2] §3.2 (SL(4) Optimization): The formulation of the 15-DoF homography optimization on the SL(4) manifold is presented as sufficient to resolve inter-submap inconsistencies, but the manuscript does not specify how the objective function weights the loop-closure constraints relative to sequential alignments or whether the optimization is guaranteed to converge to a globally consistent gauge; this is load-bearing for the claim that projective ambiguity is fully mitigated.

    Authors: We acknowledge the need for greater specification here. We will revise §3.2 to explicitly describe the objective function as a weighted sum of sequential alignment and loop-closure terms (with equal weights in the reported experiments, or a tunable parameter), and to discuss convergence: the non-convex manifold optimization has no theoretical global guarantee but empirically reaches a consistent gauge that mitigates projective ambiguity, as supported by our results. A short paragraph on this will be added. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The derivation rests on the standard 15-DoF projective ambiguity result from uncalibrated SfM literature, which is an external, independently established mathematical fact rather than a quantity fitted or defined within the paper. The choice to optimize 15-DoF homographies on the SL(4) manifold follows directly as the natural gauge-fixing step for aligning VGGT submaps; this is not a self-definitional reduction, a renamed empirical pattern, or a load-bearing self-citation chain. No equations or claims in the provided text reduce the central result to the paper's own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach depends on the classical result that uncalibrated monocular reconstruction is ambiguous up to a 15-DoF projective transform; no new free parameters or invented entities are introduced in the abstract.

axioms (1)
  • domain assumption Given a set of uncalibrated cameras with no assumption on camera motion or scene structure, the scene can only be reconstructed up to a 15-DoF projective transformation.
    This is explicitly invoked to justify replacing similarity transforms with SL(4) homographies.

pith-pipeline@v0.9.0 · 5723 in / 1269 out tokens · 89407 ms · 2026-05-20T20:57:33.220923+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 23 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Stream3D: Sequential Multi-View 3D Generation via Evidential Memory

    cs.CV 2026-05 unverdicted novelty 7.0

    Stream3D is a training-free method that maintains temporal consistency in 3D generation from monocular streams by dynamically caching a fixed number of informative historical frames using an evidence score.

  2. Mamba-VGGT: Persistent Long-Sequence Video Geometry Grounded Transformer via External Sliding Window Mamba Memory

    cs.CV 2026-05 unverdicted novelty 7.0

    Mamba-VGGT introduces a Sliding Window Mamba memory module and Zero-Init Spatial Memory Injector to enable persistent long-range geometric reasoning in VGGT for extended video sequences.

  3. Efficient Feature-Free Initialization for Monocular Visual-Inertial Systems Using a Feed-Forward 3D Model

    cs.RO 2026-05 unverdicted novelty 7.0

    A feature-free monocular VINS initialization method that uses feed-forward 3D model point cloud predictions achieves over 90% success rate with under 1.2 seconds of data and performs robustly in degraded environments.

  4. VGGT-Edit: Feed-forward Native 3D Scene Editing with Residual Field Prediction

    cs.CV 2026-05 unverdicted novelty 7.0

    VGGT-Edit proposes a native 3D text-conditioned editing framework using depth-synchronized injection and residual field prediction, plus the DeltaScene dataset, outperforming 2D-lifting methods.

  5. Ray-Aware Pointer Memory with Adaptive Updates for Streaming 3D Reconstruction

    cs.CV 2026-05 unverdicted novelty 7.0

    Ray-aware pointers that track both location and viewing direction enable adaptive retain-or-replace memory updates for more stable streaming 3D reconstruction.

  6. AirZoo: A Unified Large-Scale Dataset for Grounding Aerial Geometric 3D Vision

    cs.CV 2026-04 conditional novelty 7.0

    AirZoo is a new large-scale synthetic dataset for aerial 3D vision that improves state-of-the-art models on image retrieval, cross-view matching, and 3D reconstruction when used for fine-tuning.

  7. Keep It CALM: Toward Calibration-Free Kilometer-Level SLAM with Visual Geometry Foundation Models via an Assistant Eye

    cs.RO 2026-04 unverdicted novelty 7.0

    CAL2M achieves calibration-free kilometer-level SLAM by using an assistant eye for scale, epipolar-guided intrinsic correction, and anchor propagation for nonlinear sub-map alignment.

  8. PRISM-SLAM: Probabilistic Ray-Grounded Inference for Scale-aware Metric SLAM

    cs.RO 2026-05 unverdicted novelty 6.0

    PRISM-SLAM achieves scale-aware metric SLAM from RGB input by anchoring VFM depth priors with Plücker ray-distance factors in a factor graph and using dynamic scene uncertainty gating, producing metric trajectories wh...

  9. Ray-Aware Pointer Memory with Adaptive Updates for Streaming 3D Reconstruction

    cs.CV 2026-05 unverdicted novelty 6.0

    Ray-aware pointer memory with adaptive retain-or-replace updates enhances stability and accuracy in streaming 3D reconstruction.

  10. RADIO-ViPE: Online Tightly Coupled Multi-Modal Fusion for Open-Vocabulary Semantic SLAM in Dynamic Environments

    cs.CV 2026-04 unverdicted novelty 6.0

    RADIO-ViPE performs online open-vocabulary semantic SLAM directly from monocular RGB video in dynamic environments by tightly coupling vision-language embeddings from foundation models with geometric factor-graph opti...

  11. Scal3R: Scalable Test-Time Training for Large-Scale 3D Reconstruction

    cs.CV 2026-04 unverdicted novelty 6.0

    Scal3R achieves better accuracy and consistency in large-scale 3D scene reconstruction by maintaining a compressed global context through test-time adaptation of lightweight neural networks on long video sequences.

  12. ZeD-MAP: Bundle Adjustment Guided Zero-Shot Depth Maps for Real-Time Aerial Imaging

    cs.CV 2026-04 conditional novelty 6.0

    ZeD-MAP integrates incremental cluster-based bundle adjustment with zero-shot diffusion depth estimation to deliver metrically consistent real-time depth maps from high-resolution UAV imagery.

  13. ZeD-MAP: Bundle Adjustment Guided Zero-Shot Depth Maps for Real-Time Aerial Imaging

    cs.CV 2026-04 unverdicted novelty 6.0

    ZeD-MAP uses incremental bundle adjustment on image clusters to guide zero-shot diffusion depth estimation, delivering sub-meter accuracy (0.87 m XY, 0.12 m Z) at 1.5-5 seconds per image on high-resolution aerial data.

  14. Depth Anything 3: Recovering the Visual Space from Any Views

    cs.CV 2025-11 unverdicted novelty 6.0

    DA3 recovers consistent visual geometry from arbitrary views via a vanilla DINO transformer and depth-ray target, setting new SOTA on a visual geometry benchmark while outperforming DA2 on monocular depth.

  15. PRIX: Learning to Plan from Raw Pixels for End-to-End Autonomous Driving

    cs.CV 2025-07 unverdicted novelty 6.0

    PRIX presents an efficient camera-only planner with a novel CaRT module that matches larger multimodal models on NavSim and nuScenes while reducing model size and inference time.

  16. Geometry Forcing: Marrying Video Diffusion and 3D Representation for Consistent World Modeling

    cs.CV 2025-07 unverdicted novelty 6.0

    Geometry Forcing aligns video diffusion representations with geometric foundation model features via angular cosine and scale regression objectives to improve 3D consistency in generated videos.

  17. MonoEM-GS: Monocular Expectation-Maximization Gaussian Splatting SLAM

    cs.RO 2026-04 unverdicted novelty 5.0

    MonoEM-GS stabilizes view-dependent geometry from foundation models inside a global Gaussian Splatting representation via EM and adds multi-modal features for in-place open-set segmentation.

  18. Metric, inertially aligned monocular state estimation via kinetodynamic priors

    cs.RO 2025-11 unverdicted novelty 5.0

    The method combines a learned deformation model, continuous B-spline kinematics, and Newton's Second Law to enable accurate pose estimation and metric scale plus gravity recovery in monocular visual odometry on non-ri...

  19. SING3R-SLAM: Submap-based Indoor Monocular Gaussian SLAM with 3D Reconstruction Priors

    cs.CV 2025-11 unverdicted novelty 5.0

    SING3R-SLAM adds submap-level global alignment and reconstruction priors to a Gaussian map to reduce drift and improve local geometry in monocular indoor SLAM.

  20. TTT3R: 3D Reconstruction as Test-Time Training

    cs.CV 2025-09 unverdicted novelty 5.0

    TTT3R derives a closed-form learning rate from memory-observation alignment confidence to boost length generalization in RNN-based 3D reconstruction by 2x in global pose estimation.

  21. ViPE: Video Pose Engine for 3D Geometric Perception

    cs.CV 2025-08 unverdicted novelty 5.0

    ViPE estimates camera intrinsics, motion, and dense near-metric depth from uncalibrated videos, outperforming baselines on TUM and KITTI while releasing annotations for 96M frames across real and generated videos.

  22. VGGT-SLAM++

    cs.CV 2026-04 unverdicted novelty 4.0

    VGGT-SLAM++ improves on prior transformer SLAM by adding dense DEM submap graphs and high-cadence local optimization, achieving SOTA accuracy with reduced drift and bounded memory on benchmarks.

  23. VGGT-Long: Chunk it, Loop it, Align it -- Pushing VGGT's Limits on Kilometer-scale Long RGB Sequences

    cs.CV 2025-07 conditional novelty 4.0

    VGGT-Long extends VGGT with chunking, overlap alignment, and loop closure to produce consistent kilometer-scale 3D reconstructions from monocular RGB sequences without retraining or extra supervision.

Reference graph

Works this paper leans on

79 extracted references · 79 canonical work pages · cited by 21 Pith papers · 5 internal anchors

  1. [1]

    Abate, Y

    M. Abate, Y . Chang, N. Hughes, and L. Carlone. Kimera2: Robust and accurate metric-semantic SLAM in the real world. In Intl. Sym. on Experimental Robotics (ISER) , 2023

  2. [2]

    A tutorial on se(3) transformation parameterizations and on-manifold optimization

    Jose Luis Blanco. A tutorial on se(3) transformation parameterizations and on-manifold optimization. 09 2010

  3. [3]

    Accelerated coordinate encoding: Learning to relocalize in minutes using rgb and poses

    Eric Brachmann, Tommaso Cavallari, and Victor Adrian Prisacariu. Accelerated coordinate encoding: Learning to relocalize in minutes using rgb and poses. In CVPR, 2023

  4. [4]

    Visual camera re-localization from rgb and rgb-d images using dsac

    Eric Brachmann and Carsten Rother. Visual camera re-localization from rgb and rgb-d images using dsac. pami, 44(9):5847–5865, 2021

  5. [5]

    Bradley, T

    D. Bradley, T. Boubekeur, and W. Heidrich. Accurate multi-view reconstruction using robust binocular stereo and surface meshing. In IEEE Conf. on Computer Vision and Pattern Recognition (CVPR) , pages 1–8, 2008

  6. [6]

    Cadena, L

    C. Cadena, L. Carlone, H. Carrillo, Y . Latif, D. Scaramuzza, J. Neira, I. Reid, and J.J. Leonard. Past, present, and future of simultaneous localization and mapping: Toward the robust-perception age. IEEE Trans. Robotics, 32(6):1309–1332, 2016

  7. [7]

    ORB- SLAM3: An accurate open-source library for visual, visual–inertial, and multimap SLAM

    Carlos Campos, Richard Elvira, Juan J Gómez Rodríguez, José MM Montiel, and Juan D Tardós. ORB- SLAM3: An accurate open-source library for visual, visual–inertial, and multimap SLAM. IEEE Trans. Robotics, 2021

  8. [8]

    PreF3R: Pose-Free Feed-Forward 3D Gaussian Splatting from Variable-length Image Sequence

    Zequn Chen, Jiezhi Yang, and Heng Yang. PreF3R: Pose-Free Feed-Forward 3D Gaussian Splatting from Variable-length Image Sequence. arXiv preprint arXiv:2411.16877, 2024

  9. [9]

    Presentations for 3-dimensional special linear groups over integer rings

    Marston Conder, Edmund Robertson, and Peter Williams. Presentations for 3-dimensional special linear groups over integer rings. Proceedings of the American Mathematical Society , 115(1):19–26, 1992

  10. [10]

    maplab 2.0–A modular and multi-modal mapping framework

    Andrei Cramariuc, Lukas Bernreiter, Florian Tschopp, Marius Fehr, Victor Reijgwart, Juan Nieto, Roland Siegwart, and Cesar Cadena. maplab 2.0–A modular and multi-modal mapping framework. IEEE Robotics and Automation Letters, 8(2):520–527, 2022

  11. [11]

    Czarnowski, T

    J. Czarnowski, T. Laidlow, R. Clark, and A. Davison. DeepFactors: Real-time probabilistic dense monocular SLAM. IEEE Robotics and Automation Letters , 5(2):721–728, 2020

  12. [12]

    Factor graphs and GTSAM: A hands-on introduction

    Frank Dellaert. Factor graphs and GTSAM: A hands-on introduction. Technical Report GT-RIM-CP&R- 2012-002, Georgia Institute of Technology, September 2012

  13. [13]

    Reloc3R: Large-Scale Training of Relative Camera Pose Regression for Generalizable, Fast, and Accurate Visual Localization

    Siyan Dong, Shuzhe Wang, Shaohui Liu, Lulu Cai, Qingnan Fan, Juho Kannala, and Yanchao Yang. Reloc3R: Large-Scale Training of Relative Camera Pose Regression for Generalizable, Fast, and Accurate Visual Localization. arXiv preprint arXiv:2412.08376, 2024

  14. [14]

    MASt3R-SfM: a Fully-Integrated Solution for Unconstrained Structure-from-Motion

    Bardienus Duisterhof, Lojze Zust, Philippe Weinzaepfel, Vincent Leroy, Yohann Cabon, and Jerome Revaud. MASt3R-SfM: a Fully-Integrated Solution for Unconstrained Structure-from-Motion. arXiv preprint arXiv:2409.19152, 2024

  15. [15]

    Lie groups for 2D and 3D transformations

    Ethan Eade. Lie groups for 2D and 3D transformations. URL http://ethaneade. com/lie. pdf, revised Dec , 117:118, 2013

  16. [16]

    Ebadi, L

    K. Ebadi, L. Bernreiter, H. Biggie, G. Catt, Y . Chang, A. Chatterjee, C.E. Denniston, S-P. Deschênes, K. Harlow, S. Khattak, L. Nogueira, M. Palieri, P. Petrá ˘cek, P. Petrlík, A. Reinke, V . Krátký, S. Zhao, A. Agha-mohammadi, K. Alexis, C. Heckman, K. Khosoussi, N. Kottege, B. Morrell, M. Hutter, F. Pauling, F. Pomerleau, M. Saska, S. Scherer, R. Siegw...

  17. [17]

    Fischler and R

    M. Fischler and R. Bolles. Random sample consensus: a paradigm for model fitting with application to image analysis and automated cartography. Commun. ACM, 24:381–395, 1981

  18. [18]

    Furukawa, B

    Y . Furukawa, B. Curless, S. M. Seitz, and R. Szeliski. Towards internet-scale multi-view stereo. In IEEE Conf. on Computer Vision and Pattern Recognition (CVPR) , pages 1434–1441, 2010

  19. [19]

    A tutorial on graph-based SLAM

    Giorgio Grisetti, Rainer Kümmerle, Cyrill Stachniss, and Wolfram Burgard. A tutorial on graph-based SLAM. IEEE Intelligent Transportation Systems Magazine, 2(4):31–43, 2010

  20. [20]

    evo: Python package for the evaluation of odometry and SLAM

    Michael Grupp. evo: Python package for the evaluation of odometry and SLAM. https://github. com/MichaelGrupp/evo, 2017

  21. [21]

    Homography estimation on the special linear group based on direct point correspondence

    Tarek Hamel, Robert Mahony, Jochen Trumpf, Pascal Morin, and Minh-Duc Hua. Homography estimation on the special linear group based on direct point correspondence. In 2011 50th IEEE Conference on Decision and Control and European Control Conference , pages 7902–7908, 2011

  22. [22]

    An algorithm for self calibration from several views

    Richard Hartley. An algorithm for self calibration from several views. In cvpr, pages 908–912, 1994

  23. [23]

    Hartley and A

    R. Hartley and A. Zisserman. Multiple View Geometry in Computer Vision. Cambridge University Press, 2000

  24. [24]

    R. I. Hartley. In defense of the eight-point algorithm. IEEE Trans. Pattern Anal. Machine Intell., 19(6):580– 593, June 1997. 10

  25. [25]

    R. I. Hartley and A. Zisserman. Multiple View Geometry in Computer Vision. Cambridge University Press, second edition, 2004

  26. [26]

    Optimal transport aggregation for visual place recognition

    Sergio Izquierdo and Javier Civera. Optimal transport aggregation for visual place recognition. In IEEE Conf. on Computer Vision and Pattern Recognition (CVPR) , June 2024

  27. [27]

    Pow3R: Empowering Unconstrained 3D Reconstruction with Camera and Scene Priors

    Wonbong Jang, Philippe Weinzaepfel, Vincent Leroy, Lourdes Agapito, and Jerome Revaud. Pow3R: Empowering Unconstrained 3D Reconstruction with Camera and Scene Priors. arXiv preprint arXiv:2503.17316, 2025

  28. [28]

    gradslam: Dense slam meets automatic differentiation

    Krishna Murthy Jatavallabhula, Ganesh Iyer, and Liam Paull. gradslam: Dense slam meets automatic differentiation. In IEEE Intl. Conf. on Robotics and Automation (ICRA) , pages 2130–2137. IEEE, 2020

  29. [29]

    3d gaussian splatting for real-time radiance field rendering

    Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering. ACM Trans. Graph., 42(4):139–1, 2023

  30. [30]

    Koenderink and A.J

    J.J. Koenderink and A.J. vanDoorn. Affine structure from motion. Journal of the Optical Society of America A, 8(2):377–385, 1991

  31. [31]

    Quatro++: Robust global registration exploiting ground segmentation for loop closing in LiDAR SLAM

    Hyungtae Lim, Beomsoo Kim, Daebeom Kim, Eungchang Mason Lee, and Hyun Myung. Quatro++: Robust global registration exploiting ground segmentation for loop closing in LiDAR SLAM. Intl. J. of Robotics Research, pages 685–715, 2024

  32. [32]

    Deep patch visual SLAM

    Lahav Lipson, Zachary Teed, and Jia Deng. Deep patch visual SLAM. In European Conf. on Computer Vision (ECCV), pages 424–440, 2024

  33. [33]

    Parametric dense visual SLAM

    Steven Lovegrove. Parametric dense visual SLAM. PhD thesis, 2012

  34. [34]

    Real-time spherical mosaicing using whole image alignment

    Steven Lovegrove and Andrew J Davison. Real-time spherical mosaicing using whole image alignment. In eccv, pages 73–86. Springer, 2010

  35. [35]

    D.G. Lowe. Distinctive image features from scale-invariant keypoints. Intl. J. of Computer Vision , 60(2):91–110, 2004

  36. [36]

    B. D. Lucas and Takeo Kanade. An iterative image registration technique with an application in stereo vision. In Intl. Joint Conf. on AI (IJCAI) , pages 674–679, 1981

  37. [37]

    Maggio, Y

    D. Maggio, Y . Chang, N. Hughes, M. Trang, D. Griffith, C. Dougherty, E. Cristofalo, L. Schmid, and L. Carlone. Clio: Real-time task-driven open-set 3D scene graphs. IEEE Robotics and Automation Letters (RA-L), 9(10):8921–8928, 2024

  38. [38]

    Matas and O

    J. Matas and O. Chum. Randomized RANSAC with sequential probability ratio test. In Intl. Conf. on Computer Vision (ICCV), pages 1727–1732, 2005

  39. [39]

    Gaussian splatting SLAM

    Hidenobu Matsuki, Riku Murai, Paul HJ Kelly, and Andrew J Davison. Gaussian splatting SLAM. In IEEE Conf. on Computer Vision and Pattern Recognition (CVPR) , pages 18039–18048, 2024

  40. [40]

    C. Mei, S. Benhimane, E. Malis, and P. Rives. Constrained multiple planar template tracking for central catadioptric cameras. In British Machine Vision Conf. (BMVC), September 2006

  41. [41]

    C. Mei, S. Benhimane, E. Malis, and P. Rives. Homography-based tracking for central catadioptric cameras. In IEEE/RSJ Intl. Conf. on Intelligent Robots and Systems (IROS) , October 2006

  42. [42]

    C. Mei, S. Benhimane, E. Malis, and P. Rives. Efficient homography-based tracking and 3-D reconstruction for single-viewpoint sensors. IEEE Trans. Robotics, 24(6):1352–1364, Dec. 2008

  43. [43]

    Mouragnon, M

    E. Mouragnon, M. Lhuillier, M. Dhome, F. Dekeyser, and P. Sayd. 3d reconstruction of complex structures with bundle adjustment: an incremental approach. In IEEE Intl. Conf. on Robotics and Automation (ICRA) , pages 3055–3061, May 2006

  44. [44]

    MASt3R-SLAM: Real-Time Dense SLAM with 3D Reconstruction Priors

    Riku Murai, Eric Dexheimer, and Andrew J Davison. MASt3R-SLAM: Real-Time Dense SLAM with 3D Reconstruction Priors. arXiv preprint arXiv:2412.12392, 2024

  45. [45]

    An efficient solution to the five-point relative pose problem

    David Nistér. An efficient solution to the five-point relative pose problem. IEEE Trans. Pattern Anal. Machine Intell., 26(6):756–770, 2004

  46. [46]

    D. Nistér. A minimal solution to the generalised 3-point pose problem. In IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pages 560–567, 2004

  47. [47]

    DINOv2: Learning Robust Visual Features without Supervision

    Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023

  48. [48]

    Global Structure-from-Motion Revisited

    Linfei Pan, Daniel Barath, Marc Pollefeys, and Johannes Lutz Schönberger. Global Structure-from-Motion Revisited. In European Conf. on Computer Vision (ECCV), 2024

  49. [49]

    The Weiszfeld Algorithm: Proof, Amendments, and Extensions , pages 357–389"

    Frank Plastria. The Weiszfeld Algorithm: Proof, Amendments, and Extensions , pages 357–389". Springer US, New York, NY , 2011

  50. [50]

    A General Optimization-based Framework for Global Pose Estimation with Multiple Sensors

    Tong Qin, Shaozu Cao, Jie Pan, and Shaojie Shen. A general optimization-based framework for global pose estimation with multiple sensors. arXiv preprint: 1901.03642, 2019

  51. [51]

    Vins-mono: A robust and versatile monocular visual-inertial state estimator

    Tong Qin, Peiliang Li, and Shaojie Shen. Vins-mono: A robust and versatile monocular visual-inertial state estimator. IEEE Transactions on Robotics, 34(4):1004–1020, 2018

  52. [52]

    Ranftl, A

    R. Ranftl, A. Bochkovskiy, and V . Koltun. Vision transformers for dense prediction. In Intl. Conf. on Computer Vision (ICCV), pages 12179–12188, 2021

  53. [53]

    Rosen, M

    D.M. Rosen, M. Kaess, and J.J. Leonard. An incremental trust-region method for robust online sparse least-squares estimation. In IEEE Intl. Conf. on Robotics and Automation (ICRA) , pages 1262–1269, St. Paul, MN, May 2012

  54. [54]

    Rosinol, M

    A. Rosinol, M. Abate, Y . Chang, and L. Carlone. Kimera: an open-source library for real-time metric- semantic localization and mapping. In IEEE Intl. Conf. on Robotics and Automation (ICRA) , pages 1689–1696, 2020. arXiv preprint: 1910.02490

  55. [55]

    Rosinol, A

    A. Rosinol, A. Violette, M. Abate, N. Hughes, Y . Chang, J. Shi, A. Gupta, and L. Carlone. Kimera: from SLAM to spatial perception with 3D dynamic scene graphs. Intl. J. of Robotics Research, 40(12–14):1510– 11 1546, 2021

  56. [56]

    Structure-from-motion revisited

    Johannes L Schonberger and Jan-Michael Frahm. Structure-from-motion revisited. In IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pages 4104–4113, 2016

  57. [57]

    Pixelwise view selection for unstructured multi-view stereo

    Johannes L Schönberger, Enliang Zheng, Jan-Michael Frahm, and Marc Pollefeys. Pixelwise view selection for unstructured multi-view stereo. In European Conf. on Computer Vision (ECCV) , pages 501–518. Springer, 2016

  58. [58]

    Shotton, B

    J. Shotton, B. Glocker, C. Zach, S. Izadi, A. Criminisi, and A. Fitzgibbon. Scene coordinate regression forests for camera relocalization in rgb-d images. In IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pages 2930–2937, 2013

  59. [59]

    Systems and experiment paper: Construction of panoramic image mosaics with global and local alignment

    Heung-Yeung Shum and Richard Szeliski. Systems and experiment paper: Construction of panoramic image mosaics with global and local alignment. International Journal of Computer Vision, 36:101–130, 2000

  60. [60]

    Splatt3R: Zero-shot Gaussian Splatting from Uncalibrated Image Pairs

    Brandon Smart, Chuanxia Zheng, Iro Laina, and Victor Adrian Prisacariu. Splatt3R: Zero-shot gaussian splatting from uncalibrated image pairs. arXiv preprint arXiv:2408.13912, 2024

  61. [61]

    Quaternion kinematics for the error-state Kalman filter

    Joan Sola. Quaternion kinematics for the error-state kalman filter. arXiv preprint arXiv:1711.02508, 2017

  62. [62]

    DynaVINS: A visual-inertial SLAM for dynamic environments

    Seungwon Song, Hyungtae Lim, Alex Junho Lee, and Hyun Myung. DynaVINS: A visual-inertial SLAM for dynamic environments. IEEE Robotics and Automation Letters , 7(4):11523–11530, 2022

  63. [63]

    A benchmark for the evaluation of RGB-D SLAM systems

    Jürgen Sturm, Nikolas Engelhard, Felix Endres, Wolfram Burgard, and Daniel Cremers. A benchmark for the evaluation of RGB-D SLAM systems. In IEEE/RSJ Intl. Conf. on Intelligent Robots and Systems (IROS), pages 573–580. IEEE, 2012

  64. [64]

    DEEPV2D: Video to depth with differentiable structure from motion

    Zachary Teed and Jia Deng. DEEPV2D: Video to depth with differentiable structure from motion. Intl. Conf. on Learning Representations (ICLR) , 2018

  65. [65]

    DROID-SLAM: Deep visual SLAM for monocular, stereo, and RGB-d cameras

    Zachary Teed and Jia Deng. DROID-SLAM: Deep visual SLAM for monocular, stereo, and RGB-d cameras. In A. Beygelzimer, Y . Dauphin, P. Liang, and J. Wortman Vaughan, editors,Advances in Neural Information Processing Systems (NIPS), 2021

  66. [66]

    GeoCalib: Learning Single-image Calibration with Geometric Optimization

    Alexander Veicht, Paul-Edouard Sarlin, Philipp Lindenberger, and Marc Pollefeys. GeoCalib: Learning Single-image Calibration with Geometric Optimization. In European Conf. on Computer Vision (ECCV), pages 1–20. Springer, 2024

  67. [67]

    3D Reconstruction with Spatial Memory

    Hengyi Wang and Lourdes Agapito. 3D reconstruction with spatial memory. arXiv preprint arXiv:2408.16061, 2024

  68. [68]

    arXiv preprint arXiv:2503.11651 (2025)

    Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. VGGT: Visual geometry grounded transformer. arXiv preprint arXiv:2503.11651, 2025

  69. [69]

    Continuous 3D Perception Model with Persistent State

    Qianqian Wang, Yifei Zhang, Aleksander Holynski, Alexei A Efros, and Angjoo Kanazawa. Continuous 3D Perception Model with Persistent State. arXiv preprint arXiv:2501.12387, 2025

  70. [70]

    DUST3R: Geometric 3D vision made easy

    Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. DUST3R: Geometric 3D vision made easy. In IEEE Conf. on Computer Vision and Pattern Recognition (CVPR) , pages 20697– 20709, 2024

  71. [71]

    Motion and structure from line correspondences: Closed-form solution, uniqueness, and optimization

    Juyang Weng, Thomas Huang, and Narendra Ahuja. Motion and structure from line correspondences: Closed-form solution, uniqueness, and optimization. IEEE Trans. Pattern Anal. Machine Intell. , 14(3), 1992

  72. [72]

    Whelan, R.F

    T. Whelan, R.F. Salas-Moreno, B. Glocker, A. J. Davison, and S. Leutenegger. ElasticFusion: Real-Time Dense SLAM and Light Source Estimation. 2016

  73. [73]

    GS-SLAM: Dense visual slam with 3d gaussian splatting

    Chi Yan, Delin Qu, Dan Xu, Bin Zhao, Zhigang Wang, Dong Wang, and Xuelong Li. GS-SLAM: Dense visual slam with 3d gaussian splatting. In IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pages 19595–19604, 2024

  74. [74]

    Sim-sync: From certifiably optimal synchronization over the 3D similarity group to scene reconstruction with learned depth

    Xihang Yu and Heng Yang. Sim-sync: From certifiably optimal synchronization over the 3D similarity group to scene reconstruction with learned depth. IEEE Robotics and Automation Letters , 2024

  75. [75]

    GO-SLAM: Global optimization for consistent 3D instant reconstruction

    Youmin Zhang, Fabio Tosi, Stefano Mattoccia, and Matteo Poggi. GO-SLAM: Global optimization for consistent 3D instant reconstruction. In Intl. Conf. on Computer Vision (ICCV) , pages 3727–3737, 2023

  76. [76]

    Revisiting the PnP problem: A fast, general and optimal solution

    Yinqiang Zheng, Yubin Kuang, Shigeki Sugimoto, Kalle Astrom, and Masatoshi Okutomi. Revisiting the PnP problem: A fast, general and optimal solution. In Intl. Conf. on Computer Vision (ICCV) , pages 2344–2351, 2013

  77. [77]

    A general and simple method for camera pose and focal length determination

    Yinqiang Zheng, Shigeki Sugimoto, Imari Sato, and Masatoshi Okutomi. A general and simple method for camera pose and focal length determination. In IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2014

  78. [78]

    NICER-SLAM: Neural implicit scene encoding for RGB SLAM

    Zihan Zhu, Songyou Peng, Viktor Larsson, Zhaopeng Cui, Martin R Oswald, Andreas Geiger, and Marc Pollefeys. NICER-SLAM: Neural implicit scene encoding for RGB SLAM. In IEEE International Conference on 3D Vision (3DV), pages 42–52, 2024. 12 A Tangent space of SL(4) Here, we provide the explicit 15 generators, Gk ∀k : {1 : 15}, of SL(4), which allow us to r...

  79. [79]

    # of submaps (# of loops)

    preventing an estimated alignment, and thus we do not include the w = 1 for TUM. This is due to reasons discussed in Sec. 6. Particularly, for the floor scene there are a large portion of images which only view a planar scene which makes the estimation of the full 15-DOF homography matrix 13 degenerate, and for the 360 scene, using a small submap size suc...