VGGT-SLAM: Dense RGB SLAM Optimized on the SL(4) Manifold

Dominic Maggio; Hyungtae Lim; Luca Carlone

arxiv: 2505.12549 · v2 · pith:7XBKXFOJnew · submitted 2025-05-18 · 💻 cs.CV

VGGT-SLAM: Dense RGB SLAM Optimized on the SL(4) Manifold

Dominic Maggio , Hyungtae Lim , Luca Carlone This is my paper

Pith reviewed 2026-05-20 20:57 UTC · model grok-4.3

classification 💻 cs.CV

keywords VGGT-SLAMdense SLAMSL(4) manifoldprojective ambiguityhomography optimizationuncalibrated monocularsubmap alignmentloop closure

0 comments

The pith

Optimizing 15-DoF homography transforms on the SL(4) manifold aligns VGGT submaps to recover consistent dense geometry from uncalibrated monocular video.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that similarity transforms are insufficient for aligning submaps from uncalibrated cameras because scene reconstruction remains ambiguous up to a 15-degree-of-freedom projective transform. VGGT-SLAM therefore optimizes homography transforms between sequential submaps directly on the SL(4) manifold while enforcing loop closure constraints. This incremental global alignment produces coherent dense maps from long video sequences that cannot be processed by VGGT in a single forward pass due to memory limits. A reader would care because the method turns a feed-forward reconstruction network into a practical SLAM pipeline that works with ordinary monocular cameras.

Core claim

VGGT-SLAM recovers a consistent scene reconstruction across submaps by optimizing 15-degrees-of-freedom homography transforms between sequential submaps on the SL(4) manifold while accounting for potential loop closure constraints, addressing the projective reconstruction ambiguity that arises when cameras are uncalibrated and no assumptions are made about motion or scene structure.

What carries the argument

Optimization over the SL(4) manifold to estimate 15-DoF homography transforms between VGGT submaps that absorb the projective ambiguity of uncalibrated monocular views.

If this is right

Dense maps can be built from video sequences longer than those VGGT can process at once without exceeding GPU memory.
Loop closure constraints integrate naturally into the same 15-DoF projective alignment.
Map quality improves over similarity-based submap alignment when cameras are uncalibrated.
The approach extends any feed-forward reconstruction method to incremental SLAM on ordinary monocular input.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar manifold optimization may be needed for any submap-based monocular pipeline that relies on uncalibrated feed-forward networks.
The method could be tested on other reconstruction backbones to measure how much the choice of manifold affects final accuracy.
Hybrid systems might switch between similarity and full projective alignments depending on whether camera intrinsics are known or estimated.

Load-bearing premise

Optimizing 15-DoF homography transforms on the SL(4) manifold between VGGT submaps, together with loop closures, is sufficient to recover a globally consistent scene geometry despite the projective reconstruction ambiguity of uncalibrated monocular cameras.

What would settle it

A direct comparison of global consistency error (against ground-truth geometry) between VGGT-SLAM and a similarity-transform baseline on a long monocular sequence whose length exceeds single-pass VGGT capacity; substantially lower error for the SL(4) version would support the claim.

read the original abstract

We present VGGT-SLAM, a dense RGB SLAM system constructed by incrementally and globally aligning submaps created from the feed-forward scene reconstruction approach VGGT using only uncalibrated monocular cameras. While related works align submaps using similarity transforms (i.e., translation, rotation, and scale), we show that such approaches are inadequate in the case of uncalibrated cameras. In particular, we revisit the idea of reconstruction ambiguity, where given a set of uncalibrated cameras with no assumption on the camera motion or scene structure, the scene can only be reconstructed up to a 15-degrees-of-freedom projective transformation of the true geometry. This inspires us to recover a consistent scene reconstruction across submaps by optimizing over the SL(4) manifold, thus estimating 15-degrees-of-freedom homography transforms between sequential submaps while accounting for potential loop closure constraints. As verified by extensive experiments, we demonstrate that VGGT-SLAM achieves improved map quality using long video sequences that are infeasible for VGGT due to its high GPU requirements.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

VGGT-SLAM aligns VGGT submaps via SL(4) optimization to fix projective ambiguity and handle longer monocular sequences, with a clear theoretical motivation but limited visible evidence on the actual gains.

read the letter

Hi, the one or two things to know about this paper are that it splits long videos into VGGT submaps and aligns them by optimizing 15-DoF homographies on the SL(4) manifold instead of similarity transforms, and that this is presented as a way to produce consistent dense maps when direct VGGT use is too expensive on GPU memory. The approach follows directly from the standard projective ambiguity in uncalibrated SfM, so the choice to optimize over the full homography group makes sense on paper and lets them add loop closures without changing the core idea. What the work does well is spell out why similarity-based stitching is insufficient here and then propose a manifold optimization that matches the 15 degrees of freedom. That step is not in the prior VGGT or submap-alignment literature they cite, and it targets a practical bottleneck for feed-forward reconstruction on extended sequences. The soft spots are mostly in the evaluation. The abstract states that extensive experiments show improved map quality, yet no metrics, baselines, or error analysis appear in the provided material, which leaves the size of the improvement and the reliability of the optimization unclear. The assumption that homography alignment plus loops will recover globally consistent geometry is reasonable given the theory, but real drift from noisy submaps or convergence problems could still show up and would need checking in the full results. This paper is aimed at researchers working on dense RGB SLAM and monocular reconstruction for robotics or AR. Readers who want to scale feed-forward methods to longer videos without massive compute would find the core idea useful. It deserves a serious referee because the technical step is grounded and addresses a real limitation even if the experiments need more detail. I recommend sending it for peer review.

Referee Report

2 major / 2 minor

Summary. The manuscript presents VGGT-SLAM, a dense RGB SLAM system that incrementally constructs and globally aligns submaps generated by the feed-forward VGGT scene reconstruction method using only uncalibrated monocular video input. It addresses projective reconstruction ambiguity by optimizing 15-DoF homography transforms on the SL(4) manifold between sequential submaps while incorporating loop-closure constraints, claiming this yields improved map quality for long sequences infeasible for direct VGGT processing due to GPU limits.

Significance. If the central claim holds, the work would provide a practical extension of feed-forward dense reconstruction methods to long uncalibrated sequences by leveraging standard 15-DoF gauge freedom in SfM through manifold-constrained alignment. Explicit optimization on SL(4) rather than similarity transforms is a clear methodological contribution that directly follows from projective geometry; the approach could influence future monocular SLAM pipelines that combine learned submap generators with global consistency enforcement.

major comments (2)

[Abstract and §4] Abstract and §4 (Experiments): The central claim of 'improved map quality' and 'extensive experiments' is supported only by qualitative statements; no quantitative metrics (e.g., absolute trajectory error, reconstruction completeness, or comparison tables against similarity-transform baselines) or error analysis are referenced, leaving the performance advantage over prior submap alignment methods unverifiable from the provided description.
[§3.2] §3.2 (SL(4) Optimization): The formulation of the 15-DoF homography optimization on the SL(4) manifold is presented as sufficient to resolve inter-submap inconsistencies, but the manuscript does not specify how the objective function weights the loop-closure constraints relative to sequential alignments or whether the optimization is guaranteed to converge to a globally consistent gauge; this is load-bearing for the claim that projective ambiguity is fully mitigated.

minor comments (2)

[§3] Notation for the SL(4) manifold and homography parameterization could be clarified with an explicit definition of the Lie algebra basis or retraction used in the optimizer.
[Abstract] The abstract states that similarity transforms are 'inadequate' for uncalibrated cameras but does not cite the specific prior works that used them, which would help situate the contribution.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight important areas for improving clarity and verifiability, and we address each point below.

read point-by-point responses

Referee: [Abstract and §4] Abstract and §4 (Experiments): The central claim of 'improved map quality' and 'extensive experiments' is supported only by qualitative statements; no quantitative metrics (e.g., absolute trajectory error, reconstruction completeness, or comparison tables against similarity-transform baselines) or error analysis are referenced, leaving the performance advantage over prior submap alignment methods unverifiable from the provided description.

Authors: We agree that quantitative metrics would strengthen the central claim. The submitted manuscript emphasizes qualitative results from experiments on long sequences, but we will revise §4 to add a comparison table reporting absolute trajectory error (ATE), reconstruction completeness, and other metrics against similarity-transform baselines, along with a brief error analysis. This revision will make the advantages verifiable. revision: yes
Referee: [§3.2] §3.2 (SL(4) Optimization): The formulation of the 15-DoF homography optimization on the SL(4) manifold is presented as sufficient to resolve inter-submap inconsistencies, but the manuscript does not specify how the objective function weights the loop-closure constraints relative to sequential alignments or whether the optimization is guaranteed to converge to a globally consistent gauge; this is load-bearing for the claim that projective ambiguity is fully mitigated.

Authors: We acknowledge the need for greater specification here. We will revise §3.2 to explicitly describe the objective function as a weighted sum of sequential alignment and loop-closure terms (with equal weights in the reported experiments, or a tunable parameter), and to discuss convergence: the non-convex manifold optimization has no theoretical global guarantee but empirically reaches a consistent gauge that mitigates projective ambiguity, as supported by our results. A short paragraph on this will be added. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The derivation rests on the standard 15-DoF projective ambiguity result from uncalibrated SfM literature, which is an external, independently established mathematical fact rather than a quantity fitted or defined within the paper. The choice to optimize 15-DoF homographies on the SL(4) manifold follows directly as the natural gauge-fixing step for aligning VGGT submaps; this is not a self-definitional reduction, a renamed empirical pattern, or a load-bearing self-citation chain. No equations or claims in the provided text reduce the central result to the paper's own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach depends on the classical result that uncalibrated monocular reconstruction is ambiguous up to a 15-DoF projective transform; no new free parameters or invented entities are introduced in the abstract.

axioms (1)

domain assumption Given a set of uncalibrated cameras with no assumption on camera motion or scene structure, the scene can only be reconstructed up to a 15-DoF projective transformation.
This is explicitly invoked to justify replacing similarity transforms with SL(4) homographies.

pith-pipeline@v0.9.0 · 5723 in / 1269 out tokens · 89407 ms · 2026-05-20T20:57:33.220923+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

we revisit the idea of reconstruction ambiguity... recover a consistent scene reconstruction across submaps by optimizing over the SL(4) manifold, thus estimating 15-degrees-of-freedom homography transforms

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 23 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Stream3D: Sequential Multi-View 3D Generation via Evidential Memory
cs.CV 2026-05 unverdicted novelty 7.0

Stream3D is a training-free method that maintains temporal consistency in 3D generation from monocular streams by dynamically caching a fixed number of informative historical frames using an evidence score.
Mamba-VGGT: Persistent Long-Sequence Video Geometry Grounded Transformer via External Sliding Window Mamba Memory
cs.CV 2026-05 unverdicted novelty 7.0

Mamba-VGGT introduces a Sliding Window Mamba memory module and Zero-Init Spatial Memory Injector to enable persistent long-range geometric reasoning in VGGT for extended video sequences.
Efficient Feature-Free Initialization for Monocular Visual-Inertial Systems Using a Feed-Forward 3D Model
cs.RO 2026-05 unverdicted novelty 7.0

A feature-free monocular VINS initialization method that uses feed-forward 3D model point cloud predictions achieves over 90% success rate with under 1.2 seconds of data and performs robustly in degraded environments.
VGGT-Edit: Feed-forward Native 3D Scene Editing with Residual Field Prediction
cs.CV 2026-05 unverdicted novelty 7.0

VGGT-Edit proposes a native 3D text-conditioned editing framework using depth-synchronized injection and residual field prediction, plus the DeltaScene dataset, outperforming 2D-lifting methods.
Ray-Aware Pointer Memory with Adaptive Updates for Streaming 3D Reconstruction
cs.CV 2026-05 unverdicted novelty 7.0

Ray-aware pointers that track both location and viewing direction enable adaptive retain-or-replace memory updates for more stable streaming 3D reconstruction.
AirZoo: A Unified Large-Scale Dataset for Grounding Aerial Geometric 3D Vision
cs.CV 2026-04 conditional novelty 7.0

AirZoo is a new large-scale synthetic dataset for aerial 3D vision that improves state-of-the-art models on image retrieval, cross-view matching, and 3D reconstruction when used for fine-tuning.
Keep It CALM: Toward Calibration-Free Kilometer-Level SLAM with Visual Geometry Foundation Models via an Assistant Eye
cs.RO 2026-04 unverdicted novelty 7.0

CAL2M achieves calibration-free kilometer-level SLAM by using an assistant eye for scale, epipolar-guided intrinsic correction, and anchor propagation for nonlinear sub-map alignment.
PRISM-SLAM: Probabilistic Ray-Grounded Inference for Scale-aware Metric SLAM
cs.RO 2026-05 unverdicted novelty 6.0

PRISM-SLAM achieves scale-aware metric SLAM from RGB input by anchoring VFM depth priors with Plücker ray-distance factors in a factor graph and using dynamic scene uncertainty gating, producing metric trajectories wh...
Ray-Aware Pointer Memory with Adaptive Updates for Streaming 3D Reconstruction
cs.CV 2026-05 unverdicted novelty 6.0

Ray-aware pointer memory with adaptive retain-or-replace updates enhances stability and accuracy in streaming 3D reconstruction.
RADIO-ViPE: Online Tightly Coupled Multi-Modal Fusion for Open-Vocabulary Semantic SLAM in Dynamic Environments
cs.CV 2026-04 unverdicted novelty 6.0

RADIO-ViPE performs online open-vocabulary semantic SLAM directly from monocular RGB video in dynamic environments by tightly coupling vision-language embeddings from foundation models with geometric factor-graph opti...
Scal3R: Scalable Test-Time Training for Large-Scale 3D Reconstruction
cs.CV 2026-04 unverdicted novelty 6.0

Scal3R achieves better accuracy and consistency in large-scale 3D scene reconstruction by maintaining a compressed global context through test-time adaptation of lightweight neural networks on long video sequences.
ZeD-MAP: Bundle Adjustment Guided Zero-Shot Depth Maps for Real-Time Aerial Imaging
cs.CV 2026-04 conditional novelty 6.0

ZeD-MAP integrates incremental cluster-based bundle adjustment with zero-shot diffusion depth estimation to deliver metrically consistent real-time depth maps from high-resolution UAV imagery.
ZeD-MAP: Bundle Adjustment Guided Zero-Shot Depth Maps for Real-Time Aerial Imaging
cs.CV 2026-04 unverdicted novelty 6.0

ZeD-MAP uses incremental bundle adjustment on image clusters to guide zero-shot diffusion depth estimation, delivering sub-meter accuracy (0.87 m XY, 0.12 m Z) at 1.5-5 seconds per image on high-resolution aerial data.
Depth Anything 3: Recovering the Visual Space from Any Views
cs.CV 2025-11 unverdicted novelty 6.0

DA3 recovers consistent visual geometry from arbitrary views via a vanilla DINO transformer and depth-ray target, setting new SOTA on a visual geometry benchmark while outperforming DA2 on monocular depth.
PRIX: Learning to Plan from Raw Pixels for End-to-End Autonomous Driving
cs.CV 2025-07 unverdicted novelty 6.0

PRIX presents an efficient camera-only planner with a novel CaRT module that matches larger multimodal models on NavSim and nuScenes while reducing model size and inference time.
Geometry Forcing: Marrying Video Diffusion and 3D Representation for Consistent World Modeling
cs.CV 2025-07 unverdicted novelty 6.0

Geometry Forcing aligns video diffusion representations with geometric foundation model features via angular cosine and scale regression objectives to improve 3D consistency in generated videos.
MonoEM-GS: Monocular Expectation-Maximization Gaussian Splatting SLAM
cs.RO 2026-04 unverdicted novelty 5.0

MonoEM-GS stabilizes view-dependent geometry from foundation models inside a global Gaussian Splatting representation via EM and adds multi-modal features for in-place open-set segmentation.
Metric, inertially aligned monocular state estimation via kinetodynamic priors
cs.RO 2025-11 unverdicted novelty 5.0

The method combines a learned deformation model, continuous B-spline kinematics, and Newton's Second Law to enable accurate pose estimation and metric scale plus gravity recovery in monocular visual odometry on non-ri...
SING3R-SLAM: Submap-based Indoor Monocular Gaussian SLAM with 3D Reconstruction Priors
cs.CV 2025-11 unverdicted novelty 5.0

SING3R-SLAM adds submap-level global alignment and reconstruction priors to a Gaussian map to reduce drift and improve local geometry in monocular indoor SLAM.
TTT3R: 3D Reconstruction as Test-Time Training
cs.CV 2025-09 unverdicted novelty 5.0

TTT3R derives a closed-form learning rate from memory-observation alignment confidence to boost length generalization in RNN-based 3D reconstruction by 2x in global pose estimation.
ViPE: Video Pose Engine for 3D Geometric Perception
cs.CV 2025-08 unverdicted novelty 5.0

ViPE estimates camera intrinsics, motion, and dense near-metric depth from uncalibrated videos, outperforming baselines on TUM and KITTI while releasing annotations for 96M frames across real and generated videos.
VGGT-SLAM++
cs.CV 2026-04 unverdicted novelty 4.0

VGGT-SLAM++ improves on prior transformer SLAM by adding dense DEM submap graphs and high-cadence local optimization, achieving SOTA accuracy with reduced drift and bounded memory on benchmarks.
VGGT-Long: Chunk it, Loop it, Align it -- Pushing VGGT's Limits on Kilometer-scale Long RGB Sequences
cs.CV 2025-07 conditional novelty 4.0

VGGT-Long extends VGGT with chunking, overlap alignment, and loop closure to produce consistent kilometer-scale 3D reconstructions from monocular RGB sequences without retraining or extra supervision.

Reference graph

Works this paper leans on

79 extracted references · 79 canonical work pages · cited by 21 Pith papers · 5 internal anchors

[1]

Abate, Y

M. Abate, Y . Chang, N. Hughes, and L. Carlone. Kimera2: Robust and accurate metric-semantic SLAM in the real world. In Intl. Sym. on Experimental Robotics (ISER) , 2023

work page 2023
[2]

A tutorial on se(3) transformation parameterizations and on-manifold optimization

Jose Luis Blanco. A tutorial on se(3) transformation parameterizations and on-manifold optimization. 09 2010

work page 2010
[3]

Accelerated coordinate encoding: Learning to relocalize in minutes using rgb and poses

Eric Brachmann, Tommaso Cavallari, and Victor Adrian Prisacariu. Accelerated coordinate encoding: Learning to relocalize in minutes using rgb and poses. In CVPR, 2023

work page 2023
[4]

Visual camera re-localization from rgb and rgb-d images using dsac

Eric Brachmann and Carsten Rother. Visual camera re-localization from rgb and rgb-d images using dsac. pami, 44(9):5847–5865, 2021

work page 2021
[5]

Bradley, T

D. Bradley, T. Boubekeur, and W. Heidrich. Accurate multi-view reconstruction using robust binocular stereo and surface meshing. In IEEE Conf. on Computer Vision and Pattern Recognition (CVPR) , pages 1–8, 2008

work page 2008
[6]

Cadena, L

C. Cadena, L. Carlone, H. Carrillo, Y . Latif, D. Scaramuzza, J. Neira, I. Reid, and J.J. Leonard. Past, present, and future of simultaneous localization and mapping: Toward the robust-perception age. IEEE Trans. Robotics, 32(6):1309–1332, 2016

work page 2016
[7]

ORB- SLAM3: An accurate open-source library for visual, visual–inertial, and multimap SLAM

Carlos Campos, Richard Elvira, Juan J Gómez Rodríguez, José MM Montiel, and Juan D Tardós. ORB- SLAM3: An accurate open-source library for visual, visual–inertial, and multimap SLAM. IEEE Trans. Robotics, 2021

work page 2021
[8]

PreF3R: Pose-Free Feed-Forward 3D Gaussian Splatting from Variable-length Image Sequence

Zequn Chen, Jiezhi Yang, and Heng Yang. PreF3R: Pose-Free Feed-Forward 3D Gaussian Splatting from Variable-length Image Sequence. arXiv preprint arXiv:2411.16877, 2024

work page arXiv 2024
[9]

Presentations for 3-dimensional special linear groups over integer rings

Marston Conder, Edmund Robertson, and Peter Williams. Presentations for 3-dimensional special linear groups over integer rings. Proceedings of the American Mathematical Society , 115(1):19–26, 1992

work page 1992
[10]

maplab 2.0–A modular and multi-modal mapping framework

Andrei Cramariuc, Lukas Bernreiter, Florian Tschopp, Marius Fehr, Victor Reijgwart, Juan Nieto, Roland Siegwart, and Cesar Cadena. maplab 2.0–A modular and multi-modal mapping framework. IEEE Robotics and Automation Letters, 8(2):520–527, 2022

work page 2022
[11]

Czarnowski, T

J. Czarnowski, T. Laidlow, R. Clark, and A. Davison. DeepFactors: Real-time probabilistic dense monocular SLAM. IEEE Robotics and Automation Letters , 5(2):721–728, 2020

work page 2020
[12]

Factor graphs and GTSAM: A hands-on introduction

Frank Dellaert. Factor graphs and GTSAM: A hands-on introduction. Technical Report GT-RIM-CP&R- 2012-002, Georgia Institute of Technology, September 2012

work page 2012
[13]

Reloc3R: Large-Scale Training of Relative Camera Pose Regression for Generalizable, Fast, and Accurate Visual Localization

Siyan Dong, Shuzhe Wang, Shaohui Liu, Lulu Cai, Qingnan Fan, Juho Kannala, and Yanchao Yang. Reloc3R: Large-Scale Training of Relative Camera Pose Regression for Generalizable, Fast, and Accurate Visual Localization. arXiv preprint arXiv:2412.08376, 2024

work page arXiv 2024
[14]

MASt3R-SfM: a Fully-Integrated Solution for Unconstrained Structure-from-Motion

Bardienus Duisterhof, Lojze Zust, Philippe Weinzaepfel, Vincent Leroy, Yohann Cabon, and Jerome Revaud. MASt3R-SfM: a Fully-Integrated Solution for Unconstrained Structure-from-Motion. arXiv preprint arXiv:2409.19152, 2024

work page arXiv 2024
[15]

Lie groups for 2D and 3D transformations

Ethan Eade. Lie groups for 2D and 3D transformations. URL http://ethaneade. com/lie. pdf, revised Dec , 117:118, 2013

work page 2013
[16]

Ebadi, L

K. Ebadi, L. Bernreiter, H. Biggie, G. Catt, Y . Chang, A. Chatterjee, C.E. Denniston, S-P. Deschênes, K. Harlow, S. Khattak, L. Nogueira, M. Palieri, P. Petrá ˘cek, P. Petrlík, A. Reinke, V . Krátký, S. Zhao, A. Agha-mohammadi, K. Alexis, C. Heckman, K. Khosoussi, N. Kottege, B. Morrell, M. Hutter, F. Pauling, F. Pomerleau, M. Saska, S. Scherer, R. Siegw...

work page 2024
[17]

Fischler and R

M. Fischler and R. Bolles. Random sample consensus: a paradigm for model fitting with application to image analysis and automated cartography. Commun. ACM, 24:381–395, 1981

work page 1981
[18]

Furukawa, B

Y . Furukawa, B. Curless, S. M. Seitz, and R. Szeliski. Towards internet-scale multi-view stereo. In IEEE Conf. on Computer Vision and Pattern Recognition (CVPR) , pages 1434–1441, 2010

work page 2010
[19]

A tutorial on graph-based SLAM

Giorgio Grisetti, Rainer Kümmerle, Cyrill Stachniss, and Wolfram Burgard. A tutorial on graph-based SLAM. IEEE Intelligent Transportation Systems Magazine, 2(4):31–43, 2010

work page 2010
[20]

evo: Python package for the evaluation of odometry and SLAM

Michael Grupp. evo: Python package for the evaluation of odometry and SLAM. https://github. com/MichaelGrupp/evo, 2017

work page 2017
[21]

Homography estimation on the special linear group based on direct point correspondence

Tarek Hamel, Robert Mahony, Jochen Trumpf, Pascal Morin, and Minh-Duc Hua. Homography estimation on the special linear group based on direct point correspondence. In 2011 50th IEEE Conference on Decision and Control and European Control Conference , pages 7902–7908, 2011

work page 2011
[22]

An algorithm for self calibration from several views

Richard Hartley. An algorithm for self calibration from several views. In cvpr, pages 908–912, 1994

work page 1994
[23]

Hartley and A

R. Hartley and A. Zisserman. Multiple View Geometry in Computer Vision. Cambridge University Press, 2000

work page 2000
[24]

R. I. Hartley. In defense of the eight-point algorithm. IEEE Trans. Pattern Anal. Machine Intell., 19(6):580– 593, June 1997. 10

work page 1997
[25]

R. I. Hartley and A. Zisserman. Multiple View Geometry in Computer Vision. Cambridge University Press, second edition, 2004

work page 2004
[26]

Optimal transport aggregation for visual place recognition

Sergio Izquierdo and Javier Civera. Optimal transport aggregation for visual place recognition. In IEEE Conf. on Computer Vision and Pattern Recognition (CVPR) , June 2024

work page 2024
[27]

Pow3R: Empowering Unconstrained 3D Reconstruction with Camera and Scene Priors

Wonbong Jang, Philippe Weinzaepfel, Vincent Leroy, Lourdes Agapito, and Jerome Revaud. Pow3R: Empowering Unconstrained 3D Reconstruction with Camera and Scene Priors. arXiv preprint arXiv:2503.17316, 2025

work page arXiv 2025
[28]

gradslam: Dense slam meets automatic differentiation

Krishna Murthy Jatavallabhula, Ganesh Iyer, and Liam Paull. gradslam: Dense slam meets automatic differentiation. In IEEE Intl. Conf. on Robotics and Automation (ICRA) , pages 2130–2137. IEEE, 2020

work page 2020
[29]

3d gaussian splatting for real-time radiance field rendering

Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering. ACM Trans. Graph., 42(4):139–1, 2023

work page 2023
[30]

Koenderink and A.J

J.J. Koenderink and A.J. vanDoorn. Affine structure from motion. Journal of the Optical Society of America A, 8(2):377–385, 1991

work page 1991
[31]

Quatro++: Robust global registration exploiting ground segmentation for loop closing in LiDAR SLAM

Hyungtae Lim, Beomsoo Kim, Daebeom Kim, Eungchang Mason Lee, and Hyun Myung. Quatro++: Robust global registration exploiting ground segmentation for loop closing in LiDAR SLAM. Intl. J. of Robotics Research, pages 685–715, 2024

work page 2024
[32]

Deep patch visual SLAM

Lahav Lipson, Zachary Teed, and Jia Deng. Deep patch visual SLAM. In European Conf. on Computer Vision (ECCV), pages 424–440, 2024

work page 2024
[33]

Parametric dense visual SLAM

Steven Lovegrove. Parametric dense visual SLAM. PhD thesis, 2012

work page 2012
[34]

Real-time spherical mosaicing using whole image alignment

Steven Lovegrove and Andrew J Davison. Real-time spherical mosaicing using whole image alignment. In eccv, pages 73–86. Springer, 2010

work page 2010
[35]

D.G. Lowe. Distinctive image features from scale-invariant keypoints. Intl. J. of Computer Vision , 60(2):91–110, 2004

work page 2004
[36]

B. D. Lucas and Takeo Kanade. An iterative image registration technique with an application in stereo vision. In Intl. Joint Conf. on AI (IJCAI) , pages 674–679, 1981

work page 1981
[37]

Maggio, Y

D. Maggio, Y . Chang, N. Hughes, M. Trang, D. Griffith, C. Dougherty, E. Cristofalo, L. Schmid, and L. Carlone. Clio: Real-time task-driven open-set 3D scene graphs. IEEE Robotics and Automation Letters (RA-L), 9(10):8921–8928, 2024

work page 2024
[38]

Matas and O

J. Matas and O. Chum. Randomized RANSAC with sequential probability ratio test. In Intl. Conf. on Computer Vision (ICCV), pages 1727–1732, 2005

work page 2005
[39]

Gaussian splatting SLAM

Hidenobu Matsuki, Riku Murai, Paul HJ Kelly, and Andrew J Davison. Gaussian splatting SLAM. In IEEE Conf. on Computer Vision and Pattern Recognition (CVPR) , pages 18039–18048, 2024

work page 2024
[40]

C. Mei, S. Benhimane, E. Malis, and P. Rives. Constrained multiple planar template tracking for central catadioptric cameras. In British Machine Vision Conf. (BMVC), September 2006

work page 2006
[41]

C. Mei, S. Benhimane, E. Malis, and P. Rives. Homography-based tracking for central catadioptric cameras. In IEEE/RSJ Intl. Conf. on Intelligent Robots and Systems (IROS) , October 2006

work page 2006
[42]

C. Mei, S. Benhimane, E. Malis, and P. Rives. Efficient homography-based tracking and 3-D reconstruction for single-viewpoint sensors. IEEE Trans. Robotics, 24(6):1352–1364, Dec. 2008

work page 2008
[43]

Mouragnon, M

E. Mouragnon, M. Lhuillier, M. Dhome, F. Dekeyser, and P. Sayd. 3d reconstruction of complex structures with bundle adjustment: an incremental approach. In IEEE Intl. Conf. on Robotics and Automation (ICRA) , pages 3055–3061, May 2006

work page 2006
[44]

MASt3R-SLAM: Real-Time Dense SLAM with 3D Reconstruction Priors

Riku Murai, Eric Dexheimer, and Andrew J Davison. MASt3R-SLAM: Real-Time Dense SLAM with 3D Reconstruction Priors. arXiv preprint arXiv:2412.12392, 2024

work page arXiv 2024
[45]

An efficient solution to the five-point relative pose problem

David Nistér. An efficient solution to the five-point relative pose problem. IEEE Trans. Pattern Anal. Machine Intell., 26(6):756–770, 2004

work page 2004
[46]

D. Nistér. A minimal solution to the generalised 3-point pose problem. In IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pages 560–567, 2004

work page 2004
[47]

DINOv2: Learning Robust Visual Features without Supervision

Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[48]

Global Structure-from-Motion Revisited

Linfei Pan, Daniel Barath, Marc Pollefeys, and Johannes Lutz Schönberger. Global Structure-from-Motion Revisited. In European Conf. on Computer Vision (ECCV), 2024

work page 2024
[49]

The Weiszfeld Algorithm: Proof, Amendments, and Extensions , pages 357–389"

Frank Plastria. The Weiszfeld Algorithm: Proof, Amendments, and Extensions , pages 357–389". Springer US, New York, NY , 2011

work page 2011
[50]

A General Optimization-based Framework for Global Pose Estimation with Multiple Sensors

Tong Qin, Shaozu Cao, Jie Pan, and Shaojie Shen. A general optimization-based framework for global pose estimation with multiple sensors. arXiv preprint: 1901.03642, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1901
[51]

Vins-mono: A robust and versatile monocular visual-inertial state estimator

Tong Qin, Peiliang Li, and Shaojie Shen. Vins-mono: A robust and versatile monocular visual-inertial state estimator. IEEE Transactions on Robotics, 34(4):1004–1020, 2018

work page 2018
[52]

Ranftl, A

R. Ranftl, A. Bochkovskiy, and V . Koltun. Vision transformers for dense prediction. In Intl. Conf. on Computer Vision (ICCV), pages 12179–12188, 2021

work page 2021
[53]

Rosen, M

D.M. Rosen, M. Kaess, and J.J. Leonard. An incremental trust-region method for robust online sparse least-squares estimation. In IEEE Intl. Conf. on Robotics and Automation (ICRA) , pages 1262–1269, St. Paul, MN, May 2012

work page 2012
[54]

Rosinol, M

A. Rosinol, M. Abate, Y . Chang, and L. Carlone. Kimera: an open-source library for real-time metric- semantic localization and mapping. In IEEE Intl. Conf. on Robotics and Automation (ICRA) , pages 1689–1696, 2020. arXiv preprint: 1910.02490

work page arXiv 2020
[55]

Rosinol, A

A. Rosinol, A. Violette, M. Abate, N. Hughes, Y . Chang, J. Shi, A. Gupta, and L. Carlone. Kimera: from SLAM to spatial perception with 3D dynamic scene graphs. Intl. J. of Robotics Research, 40(12–14):1510– 11 1546, 2021

work page 2021
[56]

Structure-from-motion revisited

Johannes L Schonberger and Jan-Michael Frahm. Structure-from-motion revisited. In IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pages 4104–4113, 2016

work page 2016
[57]

Pixelwise view selection for unstructured multi-view stereo

Johannes L Schönberger, Enliang Zheng, Jan-Michael Frahm, and Marc Pollefeys. Pixelwise view selection for unstructured multi-view stereo. In European Conf. on Computer Vision (ECCV) , pages 501–518. Springer, 2016

work page 2016
[58]

Shotton, B

J. Shotton, B. Glocker, C. Zach, S. Izadi, A. Criminisi, and A. Fitzgibbon. Scene coordinate regression forests for camera relocalization in rgb-d images. In IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pages 2930–2937, 2013

work page 2013
[59]

Systems and experiment paper: Construction of panoramic image mosaics with global and local alignment

Heung-Yeung Shum and Richard Szeliski. Systems and experiment paper: Construction of panoramic image mosaics with global and local alignment. International Journal of Computer Vision, 36:101–130, 2000

work page 2000
[60]

Splatt3R: Zero-shot Gaussian Splatting from Uncalibrated Image Pairs

Brandon Smart, Chuanxia Zheng, Iro Laina, and Victor Adrian Prisacariu. Splatt3R: Zero-shot gaussian splatting from uncalibrated image pairs. arXiv preprint arXiv:2408.13912, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[61]

Quaternion kinematics for the error-state Kalman filter

Joan Sola. Quaternion kinematics for the error-state kalman filter. arXiv preprint arXiv:1711.02508, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[62]

DynaVINS: A visual-inertial SLAM for dynamic environments

Seungwon Song, Hyungtae Lim, Alex Junho Lee, and Hyun Myung. DynaVINS: A visual-inertial SLAM for dynamic environments. IEEE Robotics and Automation Letters , 7(4):11523–11530, 2022

work page 2022
[63]

A benchmark for the evaluation of RGB-D SLAM systems

Jürgen Sturm, Nikolas Engelhard, Felix Endres, Wolfram Burgard, and Daniel Cremers. A benchmark for the evaluation of RGB-D SLAM systems. In IEEE/RSJ Intl. Conf. on Intelligent Robots and Systems (IROS), pages 573–580. IEEE, 2012

work page 2012
[64]

DEEPV2D: Video to depth with differentiable structure from motion

Zachary Teed and Jia Deng. DEEPV2D: Video to depth with differentiable structure from motion. Intl. Conf. on Learning Representations (ICLR) , 2018

work page 2018
[65]

DROID-SLAM: Deep visual SLAM for monocular, stereo, and RGB-d cameras

Zachary Teed and Jia Deng. DROID-SLAM: Deep visual SLAM for monocular, stereo, and RGB-d cameras. In A. Beygelzimer, Y . Dauphin, P. Liang, and J. Wortman Vaughan, editors,Advances in Neural Information Processing Systems (NIPS), 2021

work page 2021
[66]

GeoCalib: Learning Single-image Calibration with Geometric Optimization

Alexander Veicht, Paul-Edouard Sarlin, Philipp Lindenberger, and Marc Pollefeys. GeoCalib: Learning Single-image Calibration with Geometric Optimization. In European Conf. on Computer Vision (ECCV), pages 1–20. Springer, 2024

work page 2024
[67]

3D Reconstruction with Spatial Memory

Hengyi Wang and Lourdes Agapito. 3D reconstruction with spatial memory. arXiv preprint arXiv:2408.16061, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[68]

arXiv preprint arXiv:2503.11651 (2025)

Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. VGGT: Visual geometry grounded transformer. arXiv preprint arXiv:2503.11651, 2025

work page arXiv 2025
[69]

Continuous 3D Perception Model with Persistent State

Qianqian Wang, Yifei Zhang, Aleksander Holynski, Alexei A Efros, and Angjoo Kanazawa. Continuous 3D Perception Model with Persistent State. arXiv preprint arXiv:2501.12387, 2025

work page arXiv 2025
[70]

DUST3R: Geometric 3D vision made easy

Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. DUST3R: Geometric 3D vision made easy. In IEEE Conf. on Computer Vision and Pattern Recognition (CVPR) , pages 20697– 20709, 2024

work page 2024
[71]

Motion and structure from line correspondences: Closed-form solution, uniqueness, and optimization

Juyang Weng, Thomas Huang, and Narendra Ahuja. Motion and structure from line correspondences: Closed-form solution, uniqueness, and optimization. IEEE Trans. Pattern Anal. Machine Intell. , 14(3), 1992

work page 1992
[72]

Whelan, R.F

T. Whelan, R.F. Salas-Moreno, B. Glocker, A. J. Davison, and S. Leutenegger. ElasticFusion: Real-Time Dense SLAM and Light Source Estimation. 2016

work page 2016
[73]

GS-SLAM: Dense visual slam with 3d gaussian splatting

Chi Yan, Delin Qu, Dan Xu, Bin Zhao, Zhigang Wang, Dong Wang, and Xuelong Li. GS-SLAM: Dense visual slam with 3d gaussian splatting. In IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pages 19595–19604, 2024

work page 2024
[74]

Sim-sync: From certifiably optimal synchronization over the 3D similarity group to scene reconstruction with learned depth

Xihang Yu and Heng Yang. Sim-sync: From certifiably optimal synchronization over the 3D similarity group to scene reconstruction with learned depth. IEEE Robotics and Automation Letters , 2024

work page 2024
[75]

GO-SLAM: Global optimization for consistent 3D instant reconstruction

Youmin Zhang, Fabio Tosi, Stefano Mattoccia, and Matteo Poggi. GO-SLAM: Global optimization for consistent 3D instant reconstruction. In Intl. Conf. on Computer Vision (ICCV) , pages 3727–3737, 2023

work page 2023
[76]

Revisiting the PnP problem: A fast, general and optimal solution

Yinqiang Zheng, Yubin Kuang, Shigeki Sugimoto, Kalle Astrom, and Masatoshi Okutomi. Revisiting the PnP problem: A fast, general and optimal solution. In Intl. Conf. on Computer Vision (ICCV) , pages 2344–2351, 2013

work page 2013
[77]

A general and simple method for camera pose and focal length determination

Yinqiang Zheng, Shigeki Sugimoto, Imari Sato, and Masatoshi Okutomi. A general and simple method for camera pose and focal length determination. In IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2014

work page 2014
[78]

NICER-SLAM: Neural implicit scene encoding for RGB SLAM

Zihan Zhu, Songyou Peng, Viktor Larsson, Zhaopeng Cui, Martin R Oswald, Andreas Geiger, and Marc Pollefeys. NICER-SLAM: Neural implicit scene encoding for RGB SLAM. In IEEE International Conference on 3D Vision (3DV), pages 42–52, 2024. 12 A Tangent space of SL(4) Here, we provide the explicit 15 generators, Gk ∀k : {1 : 15}, of SL(4), which allow us to r...

work page 2024
[79]

# of submaps (# of loops)

preventing an estimated alignment, and thus we do not include the w = 1 for TUM. This is due to reasons discussed in Sec. 6. Particularly, for the floor scene there are a large portion of images which only view a planar scene which makes the estimation of the full 15-DOF homography matrix 13 degenerate, and for the 360 scene, using a small submap size suc...

work page

[1] [1]

Abate, Y

M. Abate, Y . Chang, N. Hughes, and L. Carlone. Kimera2: Robust and accurate metric-semantic SLAM in the real world. In Intl. Sym. on Experimental Robotics (ISER) , 2023

work page 2023

[2] [2]

A tutorial on se(3) transformation parameterizations and on-manifold optimization

Jose Luis Blanco. A tutorial on se(3) transformation parameterizations and on-manifold optimization. 09 2010

work page 2010

[3] [3]

Accelerated coordinate encoding: Learning to relocalize in minutes using rgb and poses

Eric Brachmann, Tommaso Cavallari, and Victor Adrian Prisacariu. Accelerated coordinate encoding: Learning to relocalize in minutes using rgb and poses. In CVPR, 2023

work page 2023

[4] [4]

Visual camera re-localization from rgb and rgb-d images using dsac

Eric Brachmann and Carsten Rother. Visual camera re-localization from rgb and rgb-d images using dsac. pami, 44(9):5847–5865, 2021

work page 2021

[5] [5]

Bradley, T

D. Bradley, T. Boubekeur, and W. Heidrich. Accurate multi-view reconstruction using robust binocular stereo and surface meshing. In IEEE Conf. on Computer Vision and Pattern Recognition (CVPR) , pages 1–8, 2008

work page 2008

[6] [6]

Cadena, L

C. Cadena, L. Carlone, H. Carrillo, Y . Latif, D. Scaramuzza, J. Neira, I. Reid, and J.J. Leonard. Past, present, and future of simultaneous localization and mapping: Toward the robust-perception age. IEEE Trans. Robotics, 32(6):1309–1332, 2016

work page 2016

[7] [7]

ORB- SLAM3: An accurate open-source library for visual, visual–inertial, and multimap SLAM

Carlos Campos, Richard Elvira, Juan J Gómez Rodríguez, José MM Montiel, and Juan D Tardós. ORB- SLAM3: An accurate open-source library for visual, visual–inertial, and multimap SLAM. IEEE Trans. Robotics, 2021

work page 2021

[8] [8]

PreF3R: Pose-Free Feed-Forward 3D Gaussian Splatting from Variable-length Image Sequence

Zequn Chen, Jiezhi Yang, and Heng Yang. PreF3R: Pose-Free Feed-Forward 3D Gaussian Splatting from Variable-length Image Sequence. arXiv preprint arXiv:2411.16877, 2024

work page arXiv 2024

[9] [9]

Presentations for 3-dimensional special linear groups over integer rings

Marston Conder, Edmund Robertson, and Peter Williams. Presentations for 3-dimensional special linear groups over integer rings. Proceedings of the American Mathematical Society , 115(1):19–26, 1992

work page 1992

[10] [10]

maplab 2.0–A modular and multi-modal mapping framework

Andrei Cramariuc, Lukas Bernreiter, Florian Tschopp, Marius Fehr, Victor Reijgwart, Juan Nieto, Roland Siegwart, and Cesar Cadena. maplab 2.0–A modular and multi-modal mapping framework. IEEE Robotics and Automation Letters, 8(2):520–527, 2022

work page 2022

[11] [11]

Czarnowski, T

J. Czarnowski, T. Laidlow, R. Clark, and A. Davison. DeepFactors: Real-time probabilistic dense monocular SLAM. IEEE Robotics and Automation Letters , 5(2):721–728, 2020

work page 2020

[12] [12]

Factor graphs and GTSAM: A hands-on introduction

Frank Dellaert. Factor graphs and GTSAM: A hands-on introduction. Technical Report GT-RIM-CP&R- 2012-002, Georgia Institute of Technology, September 2012

work page 2012

[13] [13]

Reloc3R: Large-Scale Training of Relative Camera Pose Regression for Generalizable, Fast, and Accurate Visual Localization

Siyan Dong, Shuzhe Wang, Shaohui Liu, Lulu Cai, Qingnan Fan, Juho Kannala, and Yanchao Yang. Reloc3R: Large-Scale Training of Relative Camera Pose Regression for Generalizable, Fast, and Accurate Visual Localization. arXiv preprint arXiv:2412.08376, 2024

work page arXiv 2024

[14] [14]

MASt3R-SfM: a Fully-Integrated Solution for Unconstrained Structure-from-Motion

Bardienus Duisterhof, Lojze Zust, Philippe Weinzaepfel, Vincent Leroy, Yohann Cabon, and Jerome Revaud. MASt3R-SfM: a Fully-Integrated Solution for Unconstrained Structure-from-Motion. arXiv preprint arXiv:2409.19152, 2024

work page arXiv 2024

[15] [15]

Lie groups for 2D and 3D transformations

Ethan Eade. Lie groups for 2D and 3D transformations. URL http://ethaneade. com/lie. pdf, revised Dec , 117:118, 2013

work page 2013

[16] [16]

Ebadi, L

K. Ebadi, L. Bernreiter, H. Biggie, G. Catt, Y . Chang, A. Chatterjee, C.E. Denniston, S-P. Deschênes, K. Harlow, S. Khattak, L. Nogueira, M. Palieri, P. Petrá ˘cek, P. Petrlík, A. Reinke, V . Krátký, S. Zhao, A. Agha-mohammadi, K. Alexis, C. Heckman, K. Khosoussi, N. Kottege, B. Morrell, M. Hutter, F. Pauling, F. Pomerleau, M. Saska, S. Scherer, R. Siegw...

work page 2024

[17] [17]

Fischler and R

M. Fischler and R. Bolles. Random sample consensus: a paradigm for model fitting with application to image analysis and automated cartography. Commun. ACM, 24:381–395, 1981

work page 1981

[18] [18]

Furukawa, B

Y . Furukawa, B. Curless, S. M. Seitz, and R. Szeliski. Towards internet-scale multi-view stereo. In IEEE Conf. on Computer Vision and Pattern Recognition (CVPR) , pages 1434–1441, 2010

work page 2010

[19] [19]

A tutorial on graph-based SLAM

Giorgio Grisetti, Rainer Kümmerle, Cyrill Stachniss, and Wolfram Burgard. A tutorial on graph-based SLAM. IEEE Intelligent Transportation Systems Magazine, 2(4):31–43, 2010

work page 2010

[20] [20]

evo: Python package for the evaluation of odometry and SLAM

Michael Grupp. evo: Python package for the evaluation of odometry and SLAM. https://github. com/MichaelGrupp/evo, 2017

work page 2017

[21] [21]

Homography estimation on the special linear group based on direct point correspondence

Tarek Hamel, Robert Mahony, Jochen Trumpf, Pascal Morin, and Minh-Duc Hua. Homography estimation on the special linear group based on direct point correspondence. In 2011 50th IEEE Conference on Decision and Control and European Control Conference , pages 7902–7908, 2011

work page 2011

[22] [22]

An algorithm for self calibration from several views

Richard Hartley. An algorithm for self calibration from several views. In cvpr, pages 908–912, 1994

work page 1994

[23] [23]

Hartley and A

R. Hartley and A. Zisserman. Multiple View Geometry in Computer Vision. Cambridge University Press, 2000

work page 2000

[24] [24]

R. I. Hartley. In defense of the eight-point algorithm. IEEE Trans. Pattern Anal. Machine Intell., 19(6):580– 593, June 1997. 10

work page 1997

[25] [25]

R. I. Hartley and A. Zisserman. Multiple View Geometry in Computer Vision. Cambridge University Press, second edition, 2004

work page 2004

[26] [26]

Optimal transport aggregation for visual place recognition

Sergio Izquierdo and Javier Civera. Optimal transport aggregation for visual place recognition. In IEEE Conf. on Computer Vision and Pattern Recognition (CVPR) , June 2024

work page 2024

[27] [27]

Pow3R: Empowering Unconstrained 3D Reconstruction with Camera and Scene Priors

Wonbong Jang, Philippe Weinzaepfel, Vincent Leroy, Lourdes Agapito, and Jerome Revaud. Pow3R: Empowering Unconstrained 3D Reconstruction with Camera and Scene Priors. arXiv preprint arXiv:2503.17316, 2025

work page arXiv 2025

[28] [28]

gradslam: Dense slam meets automatic differentiation

Krishna Murthy Jatavallabhula, Ganesh Iyer, and Liam Paull. gradslam: Dense slam meets automatic differentiation. In IEEE Intl. Conf. on Robotics and Automation (ICRA) , pages 2130–2137. IEEE, 2020

work page 2020

[29] [29]

3d gaussian splatting for real-time radiance field rendering

Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering. ACM Trans. Graph., 42(4):139–1, 2023

work page 2023

[30] [30]

Koenderink and A.J

J.J. Koenderink and A.J. vanDoorn. Affine structure from motion. Journal of the Optical Society of America A, 8(2):377–385, 1991

work page 1991

[31] [31]

Quatro++: Robust global registration exploiting ground segmentation for loop closing in LiDAR SLAM

Hyungtae Lim, Beomsoo Kim, Daebeom Kim, Eungchang Mason Lee, and Hyun Myung. Quatro++: Robust global registration exploiting ground segmentation for loop closing in LiDAR SLAM. Intl. J. of Robotics Research, pages 685–715, 2024

work page 2024

[32] [32]

Deep patch visual SLAM

Lahav Lipson, Zachary Teed, and Jia Deng. Deep patch visual SLAM. In European Conf. on Computer Vision (ECCV), pages 424–440, 2024

work page 2024

[33] [33]

Parametric dense visual SLAM

Steven Lovegrove. Parametric dense visual SLAM. PhD thesis, 2012

work page 2012

[34] [34]

Real-time spherical mosaicing using whole image alignment

Steven Lovegrove and Andrew J Davison. Real-time spherical mosaicing using whole image alignment. In eccv, pages 73–86. Springer, 2010

work page 2010

[35] [35]

D.G. Lowe. Distinctive image features from scale-invariant keypoints. Intl. J. of Computer Vision , 60(2):91–110, 2004

work page 2004

[36] [36]

B. D. Lucas and Takeo Kanade. An iterative image registration technique with an application in stereo vision. In Intl. Joint Conf. on AI (IJCAI) , pages 674–679, 1981

work page 1981

[37] [37]

Maggio, Y

D. Maggio, Y . Chang, N. Hughes, M. Trang, D. Griffith, C. Dougherty, E. Cristofalo, L. Schmid, and L. Carlone. Clio: Real-time task-driven open-set 3D scene graphs. IEEE Robotics and Automation Letters (RA-L), 9(10):8921–8928, 2024

work page 2024

[38] [38]

Matas and O

J. Matas and O. Chum. Randomized RANSAC with sequential probability ratio test. In Intl. Conf. on Computer Vision (ICCV), pages 1727–1732, 2005

work page 2005

[39] [39]

Gaussian splatting SLAM

Hidenobu Matsuki, Riku Murai, Paul HJ Kelly, and Andrew J Davison. Gaussian splatting SLAM. In IEEE Conf. on Computer Vision and Pattern Recognition (CVPR) , pages 18039–18048, 2024

work page 2024

[40] [40]

C. Mei, S. Benhimane, E. Malis, and P. Rives. Constrained multiple planar template tracking for central catadioptric cameras. In British Machine Vision Conf. (BMVC), September 2006

work page 2006

[41] [41]

C. Mei, S. Benhimane, E. Malis, and P. Rives. Homography-based tracking for central catadioptric cameras. In IEEE/RSJ Intl. Conf. on Intelligent Robots and Systems (IROS) , October 2006

work page 2006

[42] [42]

C. Mei, S. Benhimane, E. Malis, and P. Rives. Efficient homography-based tracking and 3-D reconstruction for single-viewpoint sensors. IEEE Trans. Robotics, 24(6):1352–1364, Dec. 2008

work page 2008

[43] [43]

Mouragnon, M

E. Mouragnon, M. Lhuillier, M. Dhome, F. Dekeyser, and P. Sayd. 3d reconstruction of complex structures with bundle adjustment: an incremental approach. In IEEE Intl. Conf. on Robotics and Automation (ICRA) , pages 3055–3061, May 2006

work page 2006

[44] [44]

MASt3R-SLAM: Real-Time Dense SLAM with 3D Reconstruction Priors

Riku Murai, Eric Dexheimer, and Andrew J Davison. MASt3R-SLAM: Real-Time Dense SLAM with 3D Reconstruction Priors. arXiv preprint arXiv:2412.12392, 2024

work page arXiv 2024

[45] [45]

An efficient solution to the five-point relative pose problem

David Nistér. An efficient solution to the five-point relative pose problem. IEEE Trans. Pattern Anal. Machine Intell., 26(6):756–770, 2004

work page 2004

[46] [46]

D. Nistér. A minimal solution to the generalised 3-point pose problem. In IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pages 560–567, 2004

work page 2004

[47] [47]

DINOv2: Learning Robust Visual Features without Supervision

Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[48] [48]

Global Structure-from-Motion Revisited

Linfei Pan, Daniel Barath, Marc Pollefeys, and Johannes Lutz Schönberger. Global Structure-from-Motion Revisited. In European Conf. on Computer Vision (ECCV), 2024

work page 2024

[49] [49]

The Weiszfeld Algorithm: Proof, Amendments, and Extensions , pages 357–389"

Frank Plastria. The Weiszfeld Algorithm: Proof, Amendments, and Extensions , pages 357–389". Springer US, New York, NY , 2011

work page 2011

[50] [50]

A General Optimization-based Framework for Global Pose Estimation with Multiple Sensors

Tong Qin, Shaozu Cao, Jie Pan, and Shaojie Shen. A general optimization-based framework for global pose estimation with multiple sensors. arXiv preprint: 1901.03642, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1901

[51] [51]

Vins-mono: A robust and versatile monocular visual-inertial state estimator

Tong Qin, Peiliang Li, and Shaojie Shen. Vins-mono: A robust and versatile monocular visual-inertial state estimator. IEEE Transactions on Robotics, 34(4):1004–1020, 2018

work page 2018

[52] [52]

Ranftl, A

R. Ranftl, A. Bochkovskiy, and V . Koltun. Vision transformers for dense prediction. In Intl. Conf. on Computer Vision (ICCV), pages 12179–12188, 2021

work page 2021

[53] [53]

Rosen, M

D.M. Rosen, M. Kaess, and J.J. Leonard. An incremental trust-region method for robust online sparse least-squares estimation. In IEEE Intl. Conf. on Robotics and Automation (ICRA) , pages 1262–1269, St. Paul, MN, May 2012

work page 2012

[54] [54]

Rosinol, M

A. Rosinol, M. Abate, Y . Chang, and L. Carlone. Kimera: an open-source library for real-time metric- semantic localization and mapping. In IEEE Intl. Conf. on Robotics and Automation (ICRA) , pages 1689–1696, 2020. arXiv preprint: 1910.02490

work page arXiv 2020

[55] [55]

Rosinol, A

A. Rosinol, A. Violette, M. Abate, N. Hughes, Y . Chang, J. Shi, A. Gupta, and L. Carlone. Kimera: from SLAM to spatial perception with 3D dynamic scene graphs. Intl. J. of Robotics Research, 40(12–14):1510– 11 1546, 2021

work page 2021

[56] [56]

Structure-from-motion revisited

Johannes L Schonberger and Jan-Michael Frahm. Structure-from-motion revisited. In IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pages 4104–4113, 2016

work page 2016

[57] [57]

Pixelwise view selection for unstructured multi-view stereo

Johannes L Schönberger, Enliang Zheng, Jan-Michael Frahm, and Marc Pollefeys. Pixelwise view selection for unstructured multi-view stereo. In European Conf. on Computer Vision (ECCV) , pages 501–518. Springer, 2016

work page 2016

[58] [58]

Shotton, B

J. Shotton, B. Glocker, C. Zach, S. Izadi, A. Criminisi, and A. Fitzgibbon. Scene coordinate regression forests for camera relocalization in rgb-d images. In IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pages 2930–2937, 2013

work page 2013

[59] [59]

Systems and experiment paper: Construction of panoramic image mosaics with global and local alignment

Heung-Yeung Shum and Richard Szeliski. Systems and experiment paper: Construction of panoramic image mosaics with global and local alignment. International Journal of Computer Vision, 36:101–130, 2000

work page 2000

[60] [60]

Splatt3R: Zero-shot Gaussian Splatting from Uncalibrated Image Pairs

Brandon Smart, Chuanxia Zheng, Iro Laina, and Victor Adrian Prisacariu. Splatt3R: Zero-shot gaussian splatting from uncalibrated image pairs. arXiv preprint arXiv:2408.13912, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[61] [61]

Quaternion kinematics for the error-state Kalman filter

Joan Sola. Quaternion kinematics for the error-state kalman filter. arXiv preprint arXiv:1711.02508, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[62] [62]

DynaVINS: A visual-inertial SLAM for dynamic environments

Seungwon Song, Hyungtae Lim, Alex Junho Lee, and Hyun Myung. DynaVINS: A visual-inertial SLAM for dynamic environments. IEEE Robotics and Automation Letters , 7(4):11523–11530, 2022

work page 2022

[63] [63]

A benchmark for the evaluation of RGB-D SLAM systems

Jürgen Sturm, Nikolas Engelhard, Felix Endres, Wolfram Burgard, and Daniel Cremers. A benchmark for the evaluation of RGB-D SLAM systems. In IEEE/RSJ Intl. Conf. on Intelligent Robots and Systems (IROS), pages 573–580. IEEE, 2012

work page 2012

[64] [64]

DEEPV2D: Video to depth with differentiable structure from motion

Zachary Teed and Jia Deng. DEEPV2D: Video to depth with differentiable structure from motion. Intl. Conf. on Learning Representations (ICLR) , 2018

work page 2018

[65] [65]

DROID-SLAM: Deep visual SLAM for monocular, stereo, and RGB-d cameras

Zachary Teed and Jia Deng. DROID-SLAM: Deep visual SLAM for monocular, stereo, and RGB-d cameras. In A. Beygelzimer, Y . Dauphin, P. Liang, and J. Wortman Vaughan, editors,Advances in Neural Information Processing Systems (NIPS), 2021

work page 2021

[66] [66]

GeoCalib: Learning Single-image Calibration with Geometric Optimization

Alexander Veicht, Paul-Edouard Sarlin, Philipp Lindenberger, and Marc Pollefeys. GeoCalib: Learning Single-image Calibration with Geometric Optimization. In European Conf. on Computer Vision (ECCV), pages 1–20. Springer, 2024

work page 2024

[67] [67]

3D Reconstruction with Spatial Memory

Hengyi Wang and Lourdes Agapito. 3D reconstruction with spatial memory. arXiv preprint arXiv:2408.16061, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[68] [68]

arXiv preprint arXiv:2503.11651 (2025)

Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. VGGT: Visual geometry grounded transformer. arXiv preprint arXiv:2503.11651, 2025

work page arXiv 2025

[69] [69]

Continuous 3D Perception Model with Persistent State

Qianqian Wang, Yifei Zhang, Aleksander Holynski, Alexei A Efros, and Angjoo Kanazawa. Continuous 3D Perception Model with Persistent State. arXiv preprint arXiv:2501.12387, 2025

work page arXiv 2025

[70] [70]

DUST3R: Geometric 3D vision made easy

Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. DUST3R: Geometric 3D vision made easy. In IEEE Conf. on Computer Vision and Pattern Recognition (CVPR) , pages 20697– 20709, 2024

work page 2024

[71] [71]

Motion and structure from line correspondences: Closed-form solution, uniqueness, and optimization

Juyang Weng, Thomas Huang, and Narendra Ahuja. Motion and structure from line correspondences: Closed-form solution, uniqueness, and optimization. IEEE Trans. Pattern Anal. Machine Intell. , 14(3), 1992

work page 1992

[72] [72]

Whelan, R.F

T. Whelan, R.F. Salas-Moreno, B. Glocker, A. J. Davison, and S. Leutenegger. ElasticFusion: Real-Time Dense SLAM and Light Source Estimation. 2016

work page 2016

[73] [73]

GS-SLAM: Dense visual slam with 3d gaussian splatting

Chi Yan, Delin Qu, Dan Xu, Bin Zhao, Zhigang Wang, Dong Wang, and Xuelong Li. GS-SLAM: Dense visual slam with 3d gaussian splatting. In IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pages 19595–19604, 2024

work page 2024

[74] [74]

Sim-sync: From certifiably optimal synchronization over the 3D similarity group to scene reconstruction with learned depth

Xihang Yu and Heng Yang. Sim-sync: From certifiably optimal synchronization over the 3D similarity group to scene reconstruction with learned depth. IEEE Robotics and Automation Letters , 2024

work page 2024

[75] [75]

GO-SLAM: Global optimization for consistent 3D instant reconstruction

Youmin Zhang, Fabio Tosi, Stefano Mattoccia, and Matteo Poggi. GO-SLAM: Global optimization for consistent 3D instant reconstruction. In Intl. Conf. on Computer Vision (ICCV) , pages 3727–3737, 2023

work page 2023

[76] [76]

Revisiting the PnP problem: A fast, general and optimal solution

Yinqiang Zheng, Yubin Kuang, Shigeki Sugimoto, Kalle Astrom, and Masatoshi Okutomi. Revisiting the PnP problem: A fast, general and optimal solution. In Intl. Conf. on Computer Vision (ICCV) , pages 2344–2351, 2013

work page 2013

[77] [77]

A general and simple method for camera pose and focal length determination

Yinqiang Zheng, Shigeki Sugimoto, Imari Sato, and Masatoshi Okutomi. A general and simple method for camera pose and focal length determination. In IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2014

work page 2014

[78] [78]

NICER-SLAM: Neural implicit scene encoding for RGB SLAM

Zihan Zhu, Songyou Peng, Viktor Larsson, Zhaopeng Cui, Martin R Oswald, Andreas Geiger, and Marc Pollefeys. NICER-SLAM: Neural implicit scene encoding for RGB SLAM. In IEEE International Conference on 3D Vision (3DV), pages 42–52, 2024. 12 A Tangent space of SL(4) Here, we provide the explicit 15 generators, Gk ∀k : {1 : 15}, of SL(4), which allow us to r...

work page 2024

[79] [79]

# of submaps (# of loops)

preventing an estimated alignment, and thus we do not include the w = 1 for TUM. This is due to reasons discussed in Sec. 6. Particularly, for the floor scene there are a large portion of images which only view a planar scene which makes the estimation of the full 15-DOF homography matrix 13 degenerate, and for the 360 scene, using a small submap size suc...

work page