VGGT-SLAM: Dense RGB SLAM Optimized on the SL(4) Manifold
Pith reviewed 2026-05-20 20:57 UTC · model grok-4.3
The pith
Optimizing 15-DoF homography transforms on the SL(4) manifold aligns VGGT submaps to recover consistent dense geometry from uncalibrated monocular video.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
VGGT-SLAM recovers a consistent scene reconstruction across submaps by optimizing 15-degrees-of-freedom homography transforms between sequential submaps on the SL(4) manifold while accounting for potential loop closure constraints, addressing the projective reconstruction ambiguity that arises when cameras are uncalibrated and no assumptions are made about motion or scene structure.
What carries the argument
Optimization over the SL(4) manifold to estimate 15-DoF homography transforms between VGGT submaps that absorb the projective ambiguity of uncalibrated monocular views.
If this is right
- Dense maps can be built from video sequences longer than those VGGT can process at once without exceeding GPU memory.
- Loop closure constraints integrate naturally into the same 15-DoF projective alignment.
- Map quality improves over similarity-based submap alignment when cameras are uncalibrated.
- The approach extends any feed-forward reconstruction method to incremental SLAM on ordinary monocular input.
Where Pith is reading between the lines
- Similar manifold optimization may be needed for any submap-based monocular pipeline that relies on uncalibrated feed-forward networks.
- The method could be tested on other reconstruction backbones to measure how much the choice of manifold affects final accuracy.
- Hybrid systems might switch between similarity and full projective alignments depending on whether camera intrinsics are known or estimated.
Load-bearing premise
Optimizing 15-DoF homography transforms on the SL(4) manifold between VGGT submaps, together with loop closures, is sufficient to recover a globally consistent scene geometry despite the projective reconstruction ambiguity of uncalibrated monocular cameras.
What would settle it
A direct comparison of global consistency error (against ground-truth geometry) between VGGT-SLAM and a similarity-transform baseline on a long monocular sequence whose length exceeds single-pass VGGT capacity; substantially lower error for the SL(4) version would support the claim.
read the original abstract
We present VGGT-SLAM, a dense RGB SLAM system constructed by incrementally and globally aligning submaps created from the feed-forward scene reconstruction approach VGGT using only uncalibrated monocular cameras. While related works align submaps using similarity transforms (i.e., translation, rotation, and scale), we show that such approaches are inadequate in the case of uncalibrated cameras. In particular, we revisit the idea of reconstruction ambiguity, where given a set of uncalibrated cameras with no assumption on the camera motion or scene structure, the scene can only be reconstructed up to a 15-degrees-of-freedom projective transformation of the true geometry. This inspires us to recover a consistent scene reconstruction across submaps by optimizing over the SL(4) manifold, thus estimating 15-degrees-of-freedom homography transforms between sequential submaps while accounting for potential loop closure constraints. As verified by extensive experiments, we demonstrate that VGGT-SLAM achieves improved map quality using long video sequences that are infeasible for VGGT due to its high GPU requirements.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents VGGT-SLAM, a dense RGB SLAM system that incrementally constructs and globally aligns submaps generated by the feed-forward VGGT scene reconstruction method using only uncalibrated monocular video input. It addresses projective reconstruction ambiguity by optimizing 15-DoF homography transforms on the SL(4) manifold between sequential submaps while incorporating loop-closure constraints, claiming this yields improved map quality for long sequences infeasible for direct VGGT processing due to GPU limits.
Significance. If the central claim holds, the work would provide a practical extension of feed-forward dense reconstruction methods to long uncalibrated sequences by leveraging standard 15-DoF gauge freedom in SfM through manifold-constrained alignment. Explicit optimization on SL(4) rather than similarity transforms is a clear methodological contribution that directly follows from projective geometry; the approach could influence future monocular SLAM pipelines that combine learned submap generators with global consistency enforcement.
major comments (2)
- [Abstract and §4] Abstract and §4 (Experiments): The central claim of 'improved map quality' and 'extensive experiments' is supported only by qualitative statements; no quantitative metrics (e.g., absolute trajectory error, reconstruction completeness, or comparison tables against similarity-transform baselines) or error analysis are referenced, leaving the performance advantage over prior submap alignment methods unverifiable from the provided description.
- [§3.2] §3.2 (SL(4) Optimization): The formulation of the 15-DoF homography optimization on the SL(4) manifold is presented as sufficient to resolve inter-submap inconsistencies, but the manuscript does not specify how the objective function weights the loop-closure constraints relative to sequential alignments or whether the optimization is guaranteed to converge to a globally consistent gauge; this is load-bearing for the claim that projective ambiguity is fully mitigated.
minor comments (2)
- [§3] Notation for the SL(4) manifold and homography parameterization could be clarified with an explicit definition of the Lie algebra basis or retraction used in the optimizer.
- [Abstract] The abstract states that similarity transforms are 'inadequate' for uncalibrated cameras but does not cite the specific prior works that used them, which would help situate the contribution.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. The comments highlight important areas for improving clarity and verifiability, and we address each point below.
read point-by-point responses
-
Referee: [Abstract and §4] Abstract and §4 (Experiments): The central claim of 'improved map quality' and 'extensive experiments' is supported only by qualitative statements; no quantitative metrics (e.g., absolute trajectory error, reconstruction completeness, or comparison tables against similarity-transform baselines) or error analysis are referenced, leaving the performance advantage over prior submap alignment methods unverifiable from the provided description.
Authors: We agree that quantitative metrics would strengthen the central claim. The submitted manuscript emphasizes qualitative results from experiments on long sequences, but we will revise §4 to add a comparison table reporting absolute trajectory error (ATE), reconstruction completeness, and other metrics against similarity-transform baselines, along with a brief error analysis. This revision will make the advantages verifiable. revision: yes
-
Referee: [§3.2] §3.2 (SL(4) Optimization): The formulation of the 15-DoF homography optimization on the SL(4) manifold is presented as sufficient to resolve inter-submap inconsistencies, but the manuscript does not specify how the objective function weights the loop-closure constraints relative to sequential alignments or whether the optimization is guaranteed to converge to a globally consistent gauge; this is load-bearing for the claim that projective ambiguity is fully mitigated.
Authors: We acknowledge the need for greater specification here. We will revise §3.2 to explicitly describe the objective function as a weighted sum of sequential alignment and loop-closure terms (with equal weights in the reported experiments, or a tunable parameter), and to discuss convergence: the non-convex manifold optimization has no theoretical global guarantee but empirically reaches a consistent gauge that mitigates projective ambiguity, as supported by our results. A short paragraph on this will be added. revision: yes
Circularity Check
No significant circularity
full rationale
The derivation rests on the standard 15-DoF projective ambiguity result from uncalibrated SfM literature, which is an external, independently established mathematical fact rather than a quantity fitted or defined within the paper. The choice to optimize 15-DoF homographies on the SL(4) manifold follows directly as the natural gauge-fixing step for aligning VGGT submaps; this is not a self-definitional reduction, a renamed empirical pattern, or a load-bearing self-citation chain. No equations or claims in the provided text reduce the central result to the paper's own inputs by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Given a set of uncalibrated cameras with no assumption on camera motion or scene structure, the scene can only be reconstructed up to a 15-DoF projective transformation.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
we revisit the idea of reconstruction ambiguity... recover a consistent scene reconstruction across submaps by optimizing over the SL(4) manifold, thus estimating 15-degrees-of-freedom homography transforms
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 23 Pith papers
-
Stream3D: Sequential Multi-View 3D Generation via Evidential Memory
Stream3D is a training-free method that maintains temporal consistency in 3D generation from monocular streams by dynamically caching a fixed number of informative historical frames using an evidence score.
-
Mamba-VGGT: Persistent Long-Sequence Video Geometry Grounded Transformer via External Sliding Window Mamba Memory
Mamba-VGGT introduces a Sliding Window Mamba memory module and Zero-Init Spatial Memory Injector to enable persistent long-range geometric reasoning in VGGT for extended video sequences.
-
Efficient Feature-Free Initialization for Monocular Visual-Inertial Systems Using a Feed-Forward 3D Model
A feature-free monocular VINS initialization method that uses feed-forward 3D model point cloud predictions achieves over 90% success rate with under 1.2 seconds of data and performs robustly in degraded environments.
-
VGGT-Edit: Feed-forward Native 3D Scene Editing with Residual Field Prediction
VGGT-Edit proposes a native 3D text-conditioned editing framework using depth-synchronized injection and residual field prediction, plus the DeltaScene dataset, outperforming 2D-lifting methods.
-
Ray-Aware Pointer Memory with Adaptive Updates for Streaming 3D Reconstruction
Ray-aware pointers that track both location and viewing direction enable adaptive retain-or-replace memory updates for more stable streaming 3D reconstruction.
-
AirZoo: A Unified Large-Scale Dataset for Grounding Aerial Geometric 3D Vision
AirZoo is a new large-scale synthetic dataset for aerial 3D vision that improves state-of-the-art models on image retrieval, cross-view matching, and 3D reconstruction when used for fine-tuning.
-
Keep It CALM: Toward Calibration-Free Kilometer-Level SLAM with Visual Geometry Foundation Models via an Assistant Eye
CAL2M achieves calibration-free kilometer-level SLAM by using an assistant eye for scale, epipolar-guided intrinsic correction, and anchor propagation for nonlinear sub-map alignment.
-
PRISM-SLAM: Probabilistic Ray-Grounded Inference for Scale-aware Metric SLAM
PRISM-SLAM achieves scale-aware metric SLAM from RGB input by anchoring VFM depth priors with Plücker ray-distance factors in a factor graph and using dynamic scene uncertainty gating, producing metric trajectories wh...
-
Ray-Aware Pointer Memory with Adaptive Updates for Streaming 3D Reconstruction
Ray-aware pointer memory with adaptive retain-or-replace updates enhances stability and accuracy in streaming 3D reconstruction.
-
RADIO-ViPE: Online Tightly Coupled Multi-Modal Fusion for Open-Vocabulary Semantic SLAM in Dynamic Environments
RADIO-ViPE performs online open-vocabulary semantic SLAM directly from monocular RGB video in dynamic environments by tightly coupling vision-language embeddings from foundation models with geometric factor-graph opti...
-
Scal3R: Scalable Test-Time Training for Large-Scale 3D Reconstruction
Scal3R achieves better accuracy and consistency in large-scale 3D scene reconstruction by maintaining a compressed global context through test-time adaptation of lightweight neural networks on long video sequences.
-
ZeD-MAP: Bundle Adjustment Guided Zero-Shot Depth Maps for Real-Time Aerial Imaging
ZeD-MAP integrates incremental cluster-based bundle adjustment with zero-shot diffusion depth estimation to deliver metrically consistent real-time depth maps from high-resolution UAV imagery.
-
ZeD-MAP: Bundle Adjustment Guided Zero-Shot Depth Maps for Real-Time Aerial Imaging
ZeD-MAP uses incremental bundle adjustment on image clusters to guide zero-shot diffusion depth estimation, delivering sub-meter accuracy (0.87 m XY, 0.12 m Z) at 1.5-5 seconds per image on high-resolution aerial data.
-
Depth Anything 3: Recovering the Visual Space from Any Views
DA3 recovers consistent visual geometry from arbitrary views via a vanilla DINO transformer and depth-ray target, setting new SOTA on a visual geometry benchmark while outperforming DA2 on monocular depth.
-
PRIX: Learning to Plan from Raw Pixels for End-to-End Autonomous Driving
PRIX presents an efficient camera-only planner with a novel CaRT module that matches larger multimodal models on NavSim and nuScenes while reducing model size and inference time.
-
Geometry Forcing: Marrying Video Diffusion and 3D Representation for Consistent World Modeling
Geometry Forcing aligns video diffusion representations with geometric foundation model features via angular cosine and scale regression objectives to improve 3D consistency in generated videos.
-
MonoEM-GS: Monocular Expectation-Maximization Gaussian Splatting SLAM
MonoEM-GS stabilizes view-dependent geometry from foundation models inside a global Gaussian Splatting representation via EM and adds multi-modal features for in-place open-set segmentation.
-
Metric, inertially aligned monocular state estimation via kinetodynamic priors
The method combines a learned deformation model, continuous B-spline kinematics, and Newton's Second Law to enable accurate pose estimation and metric scale plus gravity recovery in monocular visual odometry on non-ri...
-
SING3R-SLAM: Submap-based Indoor Monocular Gaussian SLAM with 3D Reconstruction Priors
SING3R-SLAM adds submap-level global alignment and reconstruction priors to a Gaussian map to reduce drift and improve local geometry in monocular indoor SLAM.
-
TTT3R: 3D Reconstruction as Test-Time Training
TTT3R derives a closed-form learning rate from memory-observation alignment confidence to boost length generalization in RNN-based 3D reconstruction by 2x in global pose estimation.
-
ViPE: Video Pose Engine for 3D Geometric Perception
ViPE estimates camera intrinsics, motion, and dense near-metric depth from uncalibrated videos, outperforming baselines on TUM and KITTI while releasing annotations for 96M frames across real and generated videos.
-
VGGT-SLAM++
VGGT-SLAM++ improves on prior transformer SLAM by adding dense DEM submap graphs and high-cadence local optimization, achieving SOTA accuracy with reduced drift and bounded memory on benchmarks.
-
VGGT-Long: Chunk it, Loop it, Align it -- Pushing VGGT's Limits on Kilometer-scale Long RGB Sequences
VGGT-Long extends VGGT with chunking, overlap alignment, and loop closure to produce consistent kilometer-scale 3D reconstructions from monocular RGB sequences without retraining or extra supervision.
Reference graph
Works this paper leans on
- [1]
-
[2]
A tutorial on se(3) transformation parameterizations and on-manifold optimization
Jose Luis Blanco. A tutorial on se(3) transformation parameterizations and on-manifold optimization. 09 2010
work page 2010
-
[3]
Accelerated coordinate encoding: Learning to relocalize in minutes using rgb and poses
Eric Brachmann, Tommaso Cavallari, and Victor Adrian Prisacariu. Accelerated coordinate encoding: Learning to relocalize in minutes using rgb and poses. In CVPR, 2023
work page 2023
-
[4]
Visual camera re-localization from rgb and rgb-d images using dsac
Eric Brachmann and Carsten Rother. Visual camera re-localization from rgb and rgb-d images using dsac. pami, 44(9):5847–5865, 2021
work page 2021
-
[5]
D. Bradley, T. Boubekeur, and W. Heidrich. Accurate multi-view reconstruction using robust binocular stereo and surface meshing. In IEEE Conf. on Computer Vision and Pattern Recognition (CVPR) , pages 1–8, 2008
work page 2008
- [6]
-
[7]
ORB- SLAM3: An accurate open-source library for visual, visual–inertial, and multimap SLAM
Carlos Campos, Richard Elvira, Juan J Gómez Rodríguez, José MM Montiel, and Juan D Tardós. ORB- SLAM3: An accurate open-source library for visual, visual–inertial, and multimap SLAM. IEEE Trans. Robotics, 2021
work page 2021
-
[8]
PreF3R: Pose-Free Feed-Forward 3D Gaussian Splatting from Variable-length Image Sequence
Zequn Chen, Jiezhi Yang, and Heng Yang. PreF3R: Pose-Free Feed-Forward 3D Gaussian Splatting from Variable-length Image Sequence. arXiv preprint arXiv:2411.16877, 2024
-
[9]
Presentations for 3-dimensional special linear groups over integer rings
Marston Conder, Edmund Robertson, and Peter Williams. Presentations for 3-dimensional special linear groups over integer rings. Proceedings of the American Mathematical Society , 115(1):19–26, 1992
work page 1992
-
[10]
maplab 2.0–A modular and multi-modal mapping framework
Andrei Cramariuc, Lukas Bernreiter, Florian Tschopp, Marius Fehr, Victor Reijgwart, Juan Nieto, Roland Siegwart, and Cesar Cadena. maplab 2.0–A modular and multi-modal mapping framework. IEEE Robotics and Automation Letters, 8(2):520–527, 2022
work page 2022
-
[11]
J. Czarnowski, T. Laidlow, R. Clark, and A. Davison. DeepFactors: Real-time probabilistic dense monocular SLAM. IEEE Robotics and Automation Letters , 5(2):721–728, 2020
work page 2020
-
[12]
Factor graphs and GTSAM: A hands-on introduction
Frank Dellaert. Factor graphs and GTSAM: A hands-on introduction. Technical Report GT-RIM-CP&R- 2012-002, Georgia Institute of Technology, September 2012
work page 2012
-
[13]
Siyan Dong, Shuzhe Wang, Shaohui Liu, Lulu Cai, Qingnan Fan, Juho Kannala, and Yanchao Yang. Reloc3R: Large-Scale Training of Relative Camera Pose Regression for Generalizable, Fast, and Accurate Visual Localization. arXiv preprint arXiv:2412.08376, 2024
-
[14]
MASt3R-SfM: a Fully-Integrated Solution for Unconstrained Structure-from-Motion
Bardienus Duisterhof, Lojze Zust, Philippe Weinzaepfel, Vincent Leroy, Yohann Cabon, and Jerome Revaud. MASt3R-SfM: a Fully-Integrated Solution for Unconstrained Structure-from-Motion. arXiv preprint arXiv:2409.19152, 2024
-
[15]
Lie groups for 2D and 3D transformations
Ethan Eade. Lie groups for 2D and 3D transformations. URL http://ethaneade. com/lie. pdf, revised Dec , 117:118, 2013
work page 2013
-
[16]
K. Ebadi, L. Bernreiter, H. Biggie, G. Catt, Y . Chang, A. Chatterjee, C.E. Denniston, S-P. Deschênes, K. Harlow, S. Khattak, L. Nogueira, M. Palieri, P. Petrá ˘cek, P. Petrlík, A. Reinke, V . Krátký, S. Zhao, A. Agha-mohammadi, K. Alexis, C. Heckman, K. Khosoussi, N. Kottege, B. Morrell, M. Hutter, F. Pauling, F. Pomerleau, M. Saska, S. Scherer, R. Siegw...
work page 2024
-
[17]
M. Fischler and R. Bolles. Random sample consensus: a paradigm for model fitting with application to image analysis and automated cartography. Commun. ACM, 24:381–395, 1981
work page 1981
-
[18]
Y . Furukawa, B. Curless, S. M. Seitz, and R. Szeliski. Towards internet-scale multi-view stereo. In IEEE Conf. on Computer Vision and Pattern Recognition (CVPR) , pages 1434–1441, 2010
work page 2010
-
[19]
A tutorial on graph-based SLAM
Giorgio Grisetti, Rainer Kümmerle, Cyrill Stachniss, and Wolfram Burgard. A tutorial on graph-based SLAM. IEEE Intelligent Transportation Systems Magazine, 2(4):31–43, 2010
work page 2010
-
[20]
evo: Python package for the evaluation of odometry and SLAM
Michael Grupp. evo: Python package for the evaluation of odometry and SLAM. https://github. com/MichaelGrupp/evo, 2017
work page 2017
-
[21]
Homography estimation on the special linear group based on direct point correspondence
Tarek Hamel, Robert Mahony, Jochen Trumpf, Pascal Morin, and Minh-Duc Hua. Homography estimation on the special linear group based on direct point correspondence. In 2011 50th IEEE Conference on Decision and Control and European Control Conference , pages 7902–7908, 2011
work page 2011
-
[22]
An algorithm for self calibration from several views
Richard Hartley. An algorithm for self calibration from several views. In cvpr, pages 908–912, 1994
work page 1994
-
[23]
R. Hartley and A. Zisserman. Multiple View Geometry in Computer Vision. Cambridge University Press, 2000
work page 2000
-
[24]
R. I. Hartley. In defense of the eight-point algorithm. IEEE Trans. Pattern Anal. Machine Intell., 19(6):580– 593, June 1997. 10
work page 1997
-
[25]
R. I. Hartley and A. Zisserman. Multiple View Geometry in Computer Vision. Cambridge University Press, second edition, 2004
work page 2004
-
[26]
Optimal transport aggregation for visual place recognition
Sergio Izquierdo and Javier Civera. Optimal transport aggregation for visual place recognition. In IEEE Conf. on Computer Vision and Pattern Recognition (CVPR) , June 2024
work page 2024
-
[27]
Pow3R: Empowering Unconstrained 3D Reconstruction with Camera and Scene Priors
Wonbong Jang, Philippe Weinzaepfel, Vincent Leroy, Lourdes Agapito, and Jerome Revaud. Pow3R: Empowering Unconstrained 3D Reconstruction with Camera and Scene Priors. arXiv preprint arXiv:2503.17316, 2025
-
[28]
gradslam: Dense slam meets automatic differentiation
Krishna Murthy Jatavallabhula, Ganesh Iyer, and Liam Paull. gradslam: Dense slam meets automatic differentiation. In IEEE Intl. Conf. on Robotics and Automation (ICRA) , pages 2130–2137. IEEE, 2020
work page 2020
-
[29]
3d gaussian splatting for real-time radiance field rendering
Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering. ACM Trans. Graph., 42(4):139–1, 2023
work page 2023
-
[30]
J.J. Koenderink and A.J. vanDoorn. Affine structure from motion. Journal of the Optical Society of America A, 8(2):377–385, 1991
work page 1991
-
[31]
Quatro++: Robust global registration exploiting ground segmentation for loop closing in LiDAR SLAM
Hyungtae Lim, Beomsoo Kim, Daebeom Kim, Eungchang Mason Lee, and Hyun Myung. Quatro++: Robust global registration exploiting ground segmentation for loop closing in LiDAR SLAM. Intl. J. of Robotics Research, pages 685–715, 2024
work page 2024
-
[32]
Lahav Lipson, Zachary Teed, and Jia Deng. Deep patch visual SLAM. In European Conf. on Computer Vision (ECCV), pages 424–440, 2024
work page 2024
-
[33]
Steven Lovegrove. Parametric dense visual SLAM. PhD thesis, 2012
work page 2012
-
[34]
Real-time spherical mosaicing using whole image alignment
Steven Lovegrove and Andrew J Davison. Real-time spherical mosaicing using whole image alignment. In eccv, pages 73–86. Springer, 2010
work page 2010
-
[35]
D.G. Lowe. Distinctive image features from scale-invariant keypoints. Intl. J. of Computer Vision , 60(2):91–110, 2004
work page 2004
-
[36]
B. D. Lucas and Takeo Kanade. An iterative image registration technique with an application in stereo vision. In Intl. Joint Conf. on AI (IJCAI) , pages 674–679, 1981
work page 1981
- [37]
-
[38]
J. Matas and O. Chum. Randomized RANSAC with sequential probability ratio test. In Intl. Conf. on Computer Vision (ICCV), pages 1727–1732, 2005
work page 2005
-
[39]
Hidenobu Matsuki, Riku Murai, Paul HJ Kelly, and Andrew J Davison. Gaussian splatting SLAM. In IEEE Conf. on Computer Vision and Pattern Recognition (CVPR) , pages 18039–18048, 2024
work page 2024
-
[40]
C. Mei, S. Benhimane, E. Malis, and P. Rives. Constrained multiple planar template tracking for central catadioptric cameras. In British Machine Vision Conf. (BMVC), September 2006
work page 2006
-
[41]
C. Mei, S. Benhimane, E. Malis, and P. Rives. Homography-based tracking for central catadioptric cameras. In IEEE/RSJ Intl. Conf. on Intelligent Robots and Systems (IROS) , October 2006
work page 2006
-
[42]
C. Mei, S. Benhimane, E. Malis, and P. Rives. Efficient homography-based tracking and 3-D reconstruction for single-viewpoint sensors. IEEE Trans. Robotics, 24(6):1352–1364, Dec. 2008
work page 2008
-
[43]
E. Mouragnon, M. Lhuillier, M. Dhome, F. Dekeyser, and P. Sayd. 3d reconstruction of complex structures with bundle adjustment: an incremental approach. In IEEE Intl. Conf. on Robotics and Automation (ICRA) , pages 3055–3061, May 2006
work page 2006
-
[44]
MASt3R-SLAM: Real-Time Dense SLAM with 3D Reconstruction Priors
Riku Murai, Eric Dexheimer, and Andrew J Davison. MASt3R-SLAM: Real-Time Dense SLAM with 3D Reconstruction Priors. arXiv preprint arXiv:2412.12392, 2024
-
[45]
An efficient solution to the five-point relative pose problem
David Nistér. An efficient solution to the five-point relative pose problem. IEEE Trans. Pattern Anal. Machine Intell., 26(6):756–770, 2004
work page 2004
-
[46]
D. Nistér. A minimal solution to the generalised 3-point pose problem. In IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pages 560–567, 2004
work page 2004
-
[47]
DINOv2: Learning Robust Visual Features without Supervision
Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[48]
Global Structure-from-Motion Revisited
Linfei Pan, Daniel Barath, Marc Pollefeys, and Johannes Lutz Schönberger. Global Structure-from-Motion Revisited. In European Conf. on Computer Vision (ECCV), 2024
work page 2024
-
[49]
The Weiszfeld Algorithm: Proof, Amendments, and Extensions , pages 357–389"
Frank Plastria. The Weiszfeld Algorithm: Proof, Amendments, and Extensions , pages 357–389". Springer US, New York, NY , 2011
work page 2011
-
[50]
A General Optimization-based Framework for Global Pose Estimation with Multiple Sensors
Tong Qin, Shaozu Cao, Jie Pan, and Shaojie Shen. A general optimization-based framework for global pose estimation with multiple sensors. arXiv preprint: 1901.03642, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1901
-
[51]
Vins-mono: A robust and versatile monocular visual-inertial state estimator
Tong Qin, Peiliang Li, and Shaojie Shen. Vins-mono: A robust and versatile monocular visual-inertial state estimator. IEEE Transactions on Robotics, 34(4):1004–1020, 2018
work page 2018
- [52]
- [53]
-
[54]
A. Rosinol, M. Abate, Y . Chang, and L. Carlone. Kimera: an open-source library for real-time metric- semantic localization and mapping. In IEEE Intl. Conf. on Robotics and Automation (ICRA) , pages 1689–1696, 2020. arXiv preprint: 1910.02490
-
[55]
A. Rosinol, A. Violette, M. Abate, N. Hughes, Y . Chang, J. Shi, A. Gupta, and L. Carlone. Kimera: from SLAM to spatial perception with 3D dynamic scene graphs. Intl. J. of Robotics Research, 40(12–14):1510– 11 1546, 2021
work page 2021
-
[56]
Structure-from-motion revisited
Johannes L Schonberger and Jan-Michael Frahm. Structure-from-motion revisited. In IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pages 4104–4113, 2016
work page 2016
-
[57]
Pixelwise view selection for unstructured multi-view stereo
Johannes L Schönberger, Enliang Zheng, Jan-Michael Frahm, and Marc Pollefeys. Pixelwise view selection for unstructured multi-view stereo. In European Conf. on Computer Vision (ECCV) , pages 501–518. Springer, 2016
work page 2016
-
[58]
J. Shotton, B. Glocker, C. Zach, S. Izadi, A. Criminisi, and A. Fitzgibbon. Scene coordinate regression forests for camera relocalization in rgb-d images. In IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pages 2930–2937, 2013
work page 2013
-
[59]
Heung-Yeung Shum and Richard Szeliski. Systems and experiment paper: Construction of panoramic image mosaics with global and local alignment. International Journal of Computer Vision, 36:101–130, 2000
work page 2000
-
[60]
Splatt3R: Zero-shot Gaussian Splatting from Uncalibrated Image Pairs
Brandon Smart, Chuanxia Zheng, Iro Laina, and Victor Adrian Prisacariu. Splatt3R: Zero-shot gaussian splatting from uncalibrated image pairs. arXiv preprint arXiv:2408.13912, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[61]
Quaternion kinematics for the error-state Kalman filter
Joan Sola. Quaternion kinematics for the error-state kalman filter. arXiv preprint arXiv:1711.02508, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[62]
DynaVINS: A visual-inertial SLAM for dynamic environments
Seungwon Song, Hyungtae Lim, Alex Junho Lee, and Hyun Myung. DynaVINS: A visual-inertial SLAM for dynamic environments. IEEE Robotics and Automation Letters , 7(4):11523–11530, 2022
work page 2022
-
[63]
A benchmark for the evaluation of RGB-D SLAM systems
Jürgen Sturm, Nikolas Engelhard, Felix Endres, Wolfram Burgard, and Daniel Cremers. A benchmark for the evaluation of RGB-D SLAM systems. In IEEE/RSJ Intl. Conf. on Intelligent Robots and Systems (IROS), pages 573–580. IEEE, 2012
work page 2012
-
[64]
DEEPV2D: Video to depth with differentiable structure from motion
Zachary Teed and Jia Deng. DEEPV2D: Video to depth with differentiable structure from motion. Intl. Conf. on Learning Representations (ICLR) , 2018
work page 2018
-
[65]
DROID-SLAM: Deep visual SLAM for monocular, stereo, and RGB-d cameras
Zachary Teed and Jia Deng. DROID-SLAM: Deep visual SLAM for monocular, stereo, and RGB-d cameras. In A. Beygelzimer, Y . Dauphin, P. Liang, and J. Wortman Vaughan, editors,Advances in Neural Information Processing Systems (NIPS), 2021
work page 2021
-
[66]
GeoCalib: Learning Single-image Calibration with Geometric Optimization
Alexander Veicht, Paul-Edouard Sarlin, Philipp Lindenberger, and Marc Pollefeys. GeoCalib: Learning Single-image Calibration with Geometric Optimization. In European Conf. on Computer Vision (ECCV), pages 1–20. Springer, 2024
work page 2024
-
[67]
3D Reconstruction with Spatial Memory
Hengyi Wang and Lourdes Agapito. 3D reconstruction with spatial memory. arXiv preprint arXiv:2408.16061, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[68]
arXiv preprint arXiv:2503.11651 (2025)
Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. VGGT: Visual geometry grounded transformer. arXiv preprint arXiv:2503.11651, 2025
-
[69]
Continuous 3D Perception Model with Persistent State
Qianqian Wang, Yifei Zhang, Aleksander Holynski, Alexei A Efros, and Angjoo Kanazawa. Continuous 3D Perception Model with Persistent State. arXiv preprint arXiv:2501.12387, 2025
-
[70]
DUST3R: Geometric 3D vision made easy
Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. DUST3R: Geometric 3D vision made easy. In IEEE Conf. on Computer Vision and Pattern Recognition (CVPR) , pages 20697– 20709, 2024
work page 2024
-
[71]
Motion and structure from line correspondences: Closed-form solution, uniqueness, and optimization
Juyang Weng, Thomas Huang, and Narendra Ahuja. Motion and structure from line correspondences: Closed-form solution, uniqueness, and optimization. IEEE Trans. Pattern Anal. Machine Intell. , 14(3), 1992
work page 1992
-
[72]
T. Whelan, R.F. Salas-Moreno, B. Glocker, A. J. Davison, and S. Leutenegger. ElasticFusion: Real-Time Dense SLAM and Light Source Estimation. 2016
work page 2016
-
[73]
GS-SLAM: Dense visual slam with 3d gaussian splatting
Chi Yan, Delin Qu, Dan Xu, Bin Zhao, Zhigang Wang, Dong Wang, and Xuelong Li. GS-SLAM: Dense visual slam with 3d gaussian splatting. In IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pages 19595–19604, 2024
work page 2024
-
[74]
Xihang Yu and Heng Yang. Sim-sync: From certifiably optimal synchronization over the 3D similarity group to scene reconstruction with learned depth. IEEE Robotics and Automation Letters , 2024
work page 2024
-
[75]
GO-SLAM: Global optimization for consistent 3D instant reconstruction
Youmin Zhang, Fabio Tosi, Stefano Mattoccia, and Matteo Poggi. GO-SLAM: Global optimization for consistent 3D instant reconstruction. In Intl. Conf. on Computer Vision (ICCV) , pages 3727–3737, 2023
work page 2023
-
[76]
Revisiting the PnP problem: A fast, general and optimal solution
Yinqiang Zheng, Yubin Kuang, Shigeki Sugimoto, Kalle Astrom, and Masatoshi Okutomi. Revisiting the PnP problem: A fast, general and optimal solution. In Intl. Conf. on Computer Vision (ICCV) , pages 2344–2351, 2013
work page 2013
-
[77]
A general and simple method for camera pose and focal length determination
Yinqiang Zheng, Shigeki Sugimoto, Imari Sato, and Masatoshi Okutomi. A general and simple method for camera pose and focal length determination. In IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2014
work page 2014
-
[78]
NICER-SLAM: Neural implicit scene encoding for RGB SLAM
Zihan Zhu, Songyou Peng, Viktor Larsson, Zhaopeng Cui, Martin R Oswald, Andreas Geiger, and Marc Pollefeys. NICER-SLAM: Neural implicit scene encoding for RGB SLAM. In IEEE International Conference on 3D Vision (3DV), pages 42–52, 2024. 12 A Tangent space of SL(4) Here, we provide the explicit 15 generators, Gk ∀k : {1 : 15}, of SL(4), which allow us to r...
work page 2024
-
[79]
preventing an estimated alignment, and thus we do not include the w = 1 for TUM. This is due to reasons discussed in Sec. 6. Particularly, for the floor scene there are a large portion of images which only view a planar scene which makes the estimation of the full 15-DOF homography matrix 13 degenerate, and for the 360 scene, using a small submap size suc...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.