Recognition: unknown
VGGT-SLAM++
Pith reviewed 2026-05-10 17:38 UTC · model grok-4.3
The pith
VGGT-SLAM++ restores frequent local bundle adjustment in transformer SLAM by building DEM tiles from VGGT submaps and retrieving neighbors with DINOv2 embeddings and VPR.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
VGGT-SLAM++ restores high-cadence local bundle adjustment through a spatially corrective back-end. For each VGGT submap the system constructs a dense planar-canonical DEM, partitions it into patches, computes DINOv2 embeddings for those patches, and integrates the submap into a covisibility graph. Spatial neighbors are retrieved via a VPR module operating inside the covisibility window; the retrieved neighbors trigger frequent local optimization that stabilizes trajectories. On standard SLAM benchmarks the resulting system achieves state-of-the-art accuracy, substantially reduces short-term drift, accelerates graph convergence, and maintains global consistency with compact DEM tiles and sub-
What carries the argument
The DEM-based covisibility graph whose patches carry DINOv2 embeddings and whose neighbors are retrieved by VPR inside the covisibility window, thereby triggering local bundle adjustment on VGGT submaps.
If this is right
- Short-term pose drift is substantially reduced because local bundle adjustment now runs at high cadence.
- Pose-graph convergence accelerates because corrective constraints are added more frequently and more locally.
- Global consistency is preserved over large maps while memory usage remains bounded by the compact DEM tiles.
- Retrieval of neighbors stays sublinear, allowing the system to scale without quadratic growth in computation.
- The front-end continues to use the feed-forward VGGT transformer plus Sim(3) solution, so the improvement is localized to the back-end.
Where Pith is reading between the lines
- The same DEM-plus-VPR pattern could be grafted onto other transformer-based odometry pipelines that currently lack dense local optimization.
- Dense geometric proxies such as DEM tiles may serve as a general bridge between sparse transformer outputs and classical dense bundle adjustment.
- If retrieval remains efficient at city scale, the approach could support lifelong mapping with bounded memory growth.
- The method implicitly assumes that DINOv2 embeddings capture sufficient geometric similarity for reliable neighbor selection in the covisibility graph.
Load-bearing premise
Constructing planar-canonical DEMs from VGGT submaps, embedding their patches with DINOv2, and retrieving neighbors via VPR will reliably trigger effective local bundle adjustment without introducing fresh drift or scalability problems.
What would settle it
Run VGGT-SLAM++ on a long continuous trajectory benchmark previously used for VGGT-SLAM; if short-term drift remains comparable to the baseline or local optimization fails to converge, the claim that the new back-end reliably stabilizes trajectories is false.
Figures
read the original abstract
We introduce VGGT-SLAM++, a complete visual SLAM system that leverages the geometry-rich outputs of the Visual Geometry Grounded Transformer (VGGT). The system comprises a visual odometry (front-end) fusing the VGGT feed-forward transformer and a Sim(3) solution, a Digital Elevation Map (DEM)-based graph construction module, and a back-end that jointly enable accurate large-scale mapping with bounded memory. While prior transformer-based SLAM pipelines such as VGGT-SLAM rely primarily on sparse loop closures or global Sim(3) manifold constraints - allowing short-horizon pose drift - VGGT-SLAM++ restores high-cadence local bundle adjustment (LBA) through a spatially corrective back-end. For each VGGT submap, we construct a dense planar-canonical DEM, partition it into patches, and compute their DINOv2 embeddings to integrate the submap into a covisibility graph. Spatial neighbors are retrieved using a Visual Place Recognition (VPR) module within the covisibility window, triggering frequent local optimization that stabilizes trajectories. Across standard SLAM benchmarks, VGGT-SLAM++ achieves state-of-the-art accuracy, substantially reducing short-term drift, accelerating graph convergence, and maintaining global consistency with compact DEM tiles and sublinear retrieval.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces VGGT-SLAM++, a visual SLAM system that extends prior transformer-based pipelines by fusing VGGT feed-forward outputs with a Sim(3) front-end for visual odometry, constructing planar-canonical DEMs from VGGT submaps, partitioning them into patches embedded via DINOv2, and using a VPR module to retrieve spatial neighbors inside the covisibility window. This triggers frequent local bundle adjustment (LBA) in the back-end, with the overall system claimed to deliver bounded-memory large-scale mapping. The central claim is that VGGT-SLAM++ achieves state-of-the-art accuracy across standard SLAM benchmarks while substantially reducing short-term drift, accelerating graph convergence, and preserving global consistency via compact DEM tiles and sublinear retrieval.
Significance. If the empirical claims hold, the work would advance visual SLAM by restoring high-cadence LBA to transformer-based systems that previously relied on sparse loop closures or global Sim(3) constraints, thereby mitigating short-horizon drift without sacrificing scalability. The DEM-based graph and DINOv2+VPR retrieval mechanism offers a concrete route to memory-bounded, sublinear neighbor selection; explicit credit is due for composing established modules (VGGT, DINOv2, Sim(3)) as black-box inputs rather than re-deriving them. Reproducibility would be strengthened by release of the full pipeline and benchmark scripts.
major comments (2)
- [§3.3] Back-end module (likely §3.3): The claim that VPR retrieval inside the covisibility window reliably triggers effective LBA and reduces short-term drift is load-bearing for the SOTA accuracy assertion, yet the manuscript supplies no ablation isolating this component's contribution versus the VGGT+Sim(3) front-end alone. Inaccurate DINOv2 embeddings or poorly tuned retrieval thresholds could introduce false-positive constraints that increase rather than decrease drift, directly undermining the central drift-reduction guarantee.
- [§4] Experiments section (likely §4): The abstract asserts 'state-of-the-art accuracy' and 'substantially reducing short-term drift' on standard benchmarks, but the manuscript must furnish concrete quantitative tables (e.g., ATE/RPE on KITTI, EuRoC, or TUM sequences) with direct comparisons to VGGT-SLAM, ORB-SLAM3, and other DEM-based baselines, plus error analysis and ablation on retrieval thresholds. Absence of these metrics leaves the central claim unsupported.
minor comments (3)
- [Abstract] Abstract: The acronym 'LBA' appears without prior expansion; first use should read 'local bundle adjustment (LBA)'.
- [§3] Notation: 'planar-canonical DEM' and 'sublinear retrieval' are introduced without a concise definition or pointer to the precise algorithmic realization (e.g., how the canonical plane is chosen or how sublinearity is measured).
- [References] References: The manuscript should cite the original VGGT paper and the DINOv2 work explicitly when describing the front-end and embedding modules.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on VGGT-SLAM++. The comments highlight important aspects of the back-end design and experimental validation that we will address in revision. Below we respond point-by-point to the major comments.
read point-by-point responses
-
Referee: [§3.3] Back-end module (likely §3.3): The claim that VPR retrieval inside the covisibility window reliably triggers effective LBA and reduces short-term drift is load-bearing for the SOTA accuracy assertion, yet the manuscript supplies no ablation isolating this component's contribution versus the VGGT+Sim(3) front-end alone. Inaccurate DINOv2 embeddings or poorly tuned retrieval thresholds could introduce false-positive constraints that increase rather than decrease drift, directly undermining the central drift-reduction guarantee.
Authors: We agree that an explicit ablation isolating the VPR-triggered LBA is required to substantiate the drift-reduction claim. In the revised manuscript we will add a controlled ablation that disables the VPR neighbor retrieval and subsequent high-cadence LBA while retaining the VGGT+Sim(3) front-end, reporting ATE/RPE differences on the same sequences. We will also include retrieval-precision statistics and a sensitivity study on the VPR threshold to quantify the risk of false-positive constraints. The covisibility-window restriction is intended to limit erroneous matches, but the added analysis will make this explicit. revision: yes
-
Referee: [§4] Experiments section (likely §4): The abstract asserts 'state-of-the-art accuracy' and 'substantially reducing short-term drift' on standard benchmarks, but the manuscript must furnish concrete quantitative tables (e.g., ATE/RPE on KITTI, EuRoC, or TUM sequences) with direct comparisons to VGGT-SLAM, ORB-SLAM3, and other DEM-based baselines, plus error analysis and ablation on retrieval thresholds. Absence of these metrics leaves the central claim unsupported.
Authors: We acknowledge that the experimental section must provide the requested quantitative tables and ablations to support the abstract claims. The revised manuscript will expand Section 4 with complete ATE and RPE tables on KITTI, EuRoC, and TUM sequences, including direct comparisons against VGGT-SLAM, ORB-SLAM3, and additional DEM-based baselines. We will also add error-distribution plots and a dedicated ablation on retrieval-threshold values. These additions will ensure every accuracy and drift claim is backed by explicit numerical evidence. revision: yes
Circularity Check
No circularity: system composes external modules without self-referential definitions or fitted predictions
full rationale
The paper describes a composite SLAM pipeline that takes VGGT outputs, Sim(3) poses, DINOv2 embeddings, and VPR retrieval as independent inputs to construct DEM tiles and trigger LBA. No equation or claim reduces a performance metric to a parameter fitted from the same metric; no self-citation is invoked as a uniqueness theorem; the SOTA claim is presented as an empirical outcome on external benchmarks rather than a derivation that loops back to its own construction. The derivation chain is therefore self-contained against external modules and data.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Using the fundamentals of the theory of measurement errors in performing geodesic measurement and calculation works
Bakhriddin Akhmedov. Using the fundamentals of the theory of measurement errors in performing geodesic measurement and calculation works. InE3S Web of Con- ferences, page 03012. EDP Sciences, 2023. 6
2023
-
[2]
Talking to dino: Bridg- ing self-supervised vision backbones with language for open-vocabulary segmentation
Luca Barsellotti, Lorenzo Bianchi, Nicola Messina, Fabio Carrara, Marcella Cornia, Lorenzo Baraldi, Fabrizio Falchi, and Rita Cucchiara. Talking to dino: Bridg- ing self-supervised vision backbones with language for open-vocabulary segmentation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 22025–22035, 2025. 5
2025
-
[3]
Surf: Speeded up robust features
Herbert Bay, Tinne Tuytelaars, and Luc Van Gool. Surf: Speeded up robust features. InEuropean conference on computer vision, pages 404–417. Springer, 2006. 2
2006
-
[4]
Shape indexing using approximate nearest-neighbour search in high- dimensional spaces
Jeffrey S Beis and David G Lowe. Shape indexing using approximate nearest-neighbour search in high- dimensional spaces. InProceedings of IEEE computer society conference on computer vision and pattern recog- nition, pages 1000–1006. IEEE, 1997. 16
1997
-
[5]
The relationship between recall and precision.Journal of the American society for information science, 45(1):12–19, 1994
Michael Buckland and Fredric Gey. The relationship between recall and precision.Journal of the American society for information science, 45(1):12–19, 1994. 16
1994
-
[6]
Boosting with the l 2 loss: regression and classification.Journal of the American Statistical Association, 98(462):324–339, 2003
Peter B¨uhlmann and Bin Yu. Boosting with the l 2 loss: regression and classification.Journal of the American Statistical Association, 98(462):324–339, 2003. 16
2003
-
[7]
A gauss—newton method for convex composite optimization.Mathematical Programming, 71(2):179–194, 1995
James V Burke and Michael C Ferris. A gauss—newton method for convex composite optimization.Mathematical Programming, 71(2):179–194, 1995. 6
1995
-
[8]
The euroc micro aerial vehicle datasets.The International Journal of Robotics Research, 35(10):1157–1163, 2016
Michael Burri, Janosch Nikolic, Pascal Gohl, Thomas Schneider, Joern Rehder, Sammy Omari, Markus W Achtelik, and Roland Siegwart. The euroc micro aerial vehicle datasets.The International Journal of Robotics Research, 35(10):1157–1163, 2016. 6, 7, 8, 12
2016
-
[9]
Orb-slam3: An accurate open-source library for visual, visual–inertial, and multimap slam.IEEE transactions on robotics, 37(6): 1874–1890, 2021
Carlos Campos, Richard Elvira, Juan J G´omez Rodr´ıguez, Jos´e MM Montiel, and Juan D Tard ´os. Orb-slam3: An accurate open-source library for visual, visual–inertial, and multimap slam.IEEE transactions on robotics, 37(6): 1874–1890, 2021. 2, 8
2021
-
[10]
G´omez Rodr´ıguez, Jos´e M
Carlos Campos, Richard Elvira, Juan J. G´omez Rodr´ıguez, Jos´e M. M. Montiel, and Juan D. Tard´os. Orb-slam3: An accurate open-source library for visual, visual–inertial, and multimap slam.IEEE Transactions on Robotics, 37 (6):1874–1890, 2021. 7
2021
-
[11]
A lidar/visual slam backend with loop closure detection and graph optimization.Remote sensing, 13(14):2720, 2021
Shoubin Chen, Baoding Zhou, Changhui Jiang, Weixing Xue, and Qingquan Li. A lidar/visual slam backend with loop closure detection and graph optimization.Remote sensing, 13(14):2720, 2021. 2
2021
- [12]
-
[13]
Recovering shape by shading and stereo under lambertian shading model
Chi Kin Chow and Shiu Yin Yuen. Recovering shape by shading and stereo under lambertian shading model. International journal of computer vision, 85(1):58–100,
-
[14]
Deepfactors: Real-time probabilistic dense monocular slam.IEEE Robotics and Automation Letters, 5(2):721–728, 2020
Jan Czarnowski, Tristan Laidlow, Ronald Clark, and An- drew J Davison. Deepfactors: Real-time probabilistic dense monocular slam.IEEE Robotics and Automation Letters, 5(2):721–728, 2020. 8, 13
2020
-
[15]
Superpoint: Self-supervised interest point detec- tion and description
Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabi- novich. Superpoint: Self-supervised interest point detec- tion and description. InProceedings of the IEEE confer- ence on computer vision and pattern recognition work- shops, pages 224–236, 2018. 2
2018
-
[16]
Oxford university press,
Simon Kirwan Donaldson and Peter B Kronheimer.The geometry of four-manifolds. Oxford university press,
-
[17]
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Alexey Dosovitskiy. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020. 2
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[18]
The faiss library.IEEE Transactions on Big Data, 2025
Matthijs Douze, Alexandr Guzhva, Chengqi Deng, Jeff Johnson, Gergely Szilvasy, Pierre-Emmanuel Mazar ´e, Maria Lomeli, Lucas Hosseini, and Herv ´e J´egou. The faiss library.IEEE Transactions on Big Data, 2025. 4, 5, 16
2025
-
[19]
Mast3r-sfm: a fully-integrated solution for unconstrained structure-from-motion
Bardienus Pieter Duisterhof, Lojze Zust, Philippe Weinza- epfel, Vincent Leroy, Yohann Cabon, and Jerome Revaud. Mast3r-sfm: a fully-integrated solution for unconstrained structure-from-motion. In2025 International Conference on 3D Vision (3DV), pages 1–10. IEEE, 2025. 3
2025
-
[20]
An evalua- tion of the rgb-d slam system
Felix Endres, J ¨urgen Hess, Nikolas Engelhard, J ¨urgen Sturm, Daniel Cremers, and Wolfram Burgard. An evalua- tion of the rgb-d slam system. In2012 IEEE international conference on robotics and automation, pages 1691–1696. IEEE, 2012. 2
2012
-
[21]
Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography.Communi- cations of the ACM, 24(6):381–395, 1981
Martin A Fischler and Robert C Bolles. Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography.Communi- cations of the ACM, 24(6):381–395, 1981. 5, 13, 14
1981
-
[22]
Virtual worlds as proxy for multi-object tracking analysis
Adrien Gaidon, Qiao Wang, Yohann Cabon, and Eleonora Vig. Virtual worlds as proxy for multi-object tracking analysis. InProceedings of the IEEE conference on com- puter vision and pattern recognition, pages 4340–4349,
-
[23]
Variable baseline/resolution stereo
David Gallup, Jan-Michael Frahm, Philippos Mordohai, and Marc Pollefeys. Variable baseline/resolution stereo. In2008 IEEE conference on computer vision and pattern recognition, pages 1–8. IEEE, 2008. 16
2008
-
[24]
Real-time loop detection with bags of binary words
Dorian Galvez-Lopez and Juan D Tardos. Real-time loop detection with bags of binary words. In2011 IEEE/RSJ In- ternational Conference on Intelligent Robots and Systems, pages 51–58. IEEE, 2011. 2
2011
-
[25]
Ldso: Direct sparse odometry with loop clo- sure
Xiang Gao, Rui Wang, Nikolaus Demmel, and Daniel Cremers. Ldso: Direct sparse odometry with loop clo- sure. In2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 2198–2204. IEEE, 2018. 7
2018
-
[26]
Vision meets robotics: The kitti dataset
Andreas Geiger, Philip Lenz, Christoph Stiller, and Raquel Urtasun. Vision meets robotics: The kitti dataset. The international journal of robotics research, 32(11): 1231–1237, 2013. 6, 7, 8, 12, 15
2013
-
[27]
Multi-level mapping: Real-time dense monocular slam
W Nicholas Greene, Kyel Ok, Peter Lommel, and Nicholas Roy. Multi-level mapping: Real-time dense monocular slam. In2016 IEEE International Confer- ence on Robotics and Automation (ICRA), pages 833–840. IEEE, 2016. 16
2016
-
[28]
Findernet: A data augmentation free canoni- calization aided loop detection and closure technique for point clouds in 6-dof separation
Sudarshan S Harithas, Gurkirat Singh, Aneesh Chavan, Sarthak Sharma, Suraj Patni, Chetan Arora, and Madhava Krishna. Findernet: A data augmentation free canoni- calization aided loop detection and closure technique for point clouds in 6-dof separation. InProceedings of the IEEE/CVF Winter Conference on Applications of Com- puter Vision, pages 8399–8408, 2...
2024
-
[29]
Cambridge university press,
Richard Hartley and Andrew Zisserman.Multiple view geometry in computer vision. Cambridge university press,
-
[30]
Determining op- tical flow.Artificial intelligence, 17(1-3):185–203, 1981
Berthold KP Horn and Brian G Schunck. Determining op- tical flow.Artificial intelligence, 17(1-3):185–203, 1981. 2
1981
-
[31]
Pow3r: Empow- ering unconstrained 3d reconstruction with camera and scene priors
Wonbong Jang, Philippe Weinzaepfel, Vincent Leroy, Lourdes Agapito, and Jerome Revaud. Pow3r: Empow- ering unconstrained 3d reconstruction with camera and scene priors. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 1071–1081, 2025. 3
2025
-
[32]
Vision- based 2.5 d terrain modeling for humanoid locomotion
Satoshi Kagami, Koichi Nishiwaki, James J Kuffner, Kei Okada, Masayuki Inaba, and Hirochika Inoue. Vision- based 2.5 d terrain modeling for humanoid locomotion. In2003 IEEE International Conference on Robotics and Automation (Cat. No. 03CH37422), pages 2141–2146. IEEE, 2003. 5
2003
-
[33]
Anyloc: Towards universal visual place recognition.IEEE Robotics and Automation Letters, 9(2):1286–1293, 2023
Nikhil Keetha, Avneesh Mishra, Jay Karhade, Kr- ishna Murthy Jatavallabhula, Sebastian Scherer, Madhava Krishna, and Sourav Garg. Anyloc: Towards universal visual place recognition.IEEE Robotics and Automation Letters, 9(2):1286–1293, 2023. 2, 5, 16
2023
-
[34]
MapAnything: Universal Feed-Forward Metric 3D Reconstruction
Nikhil Keetha, Norman M ¨uller, Johannes Sch ¨onberger, Lorenzo Porzi, Yuchen Zhang, Tobias Fischer, Arno Knapitsch, Duncan Zauss, Ethan Weber, Nelson Antunes, et al. Mapanything: Universal feed-forward metric 3d reconstruction.arXiv preprint arXiv:2509.13414, 2025. 3
work page internal anchor Pith review arXiv 2025
-
[35]
Optimizing disparity for motion in depth
Petr Kellnhofer, Tobias Ritschel, Karol Myszkowski, and Hans-Peter Seidel. Optimizing disparity for motion in depth. InComputer Graphics Forum, pages 143–152. Wiley Online Library, 2013. 16
2013
-
[36]
Hillshading of terrain using layer tints with aspect-variant luminosity
Patrick J Kennelly and A Jon Kimerling. Hillshading of terrain using layer tints with aspect-variant luminosity. Cartography and Geographic Information Science, 31(2): 67–77, 2004. 15
2004
-
[37]
G-cut3r: Guided 3d reconstruction with camera and depth prior integration
Ramil Khafizov, Artem Komarichev, Ruslan Rakhimov, Peter Wonka, and Evgeny Burnaev. G-cut3r: Guided 3d reconstruction with camera and depth prior integration. arXiv preprint arXiv:2508.11379, 2025. 8
-
[38]
PhD thesis, The University of Waikato,
Ashraf Masood Kibriya.Fast algorithms for nearest neighbour search. PhD thesis, The University of Waikato,
-
[39]
Cosine similarity to determine similarity measure: Study case in online essay assessment
Alfirna Rizqi Lahitani, Adhistya Erna Permanasari, and Noor Akhmad Setiawan. Cosine similarity to determine similarity measure: Study case in online essay assessment. In2016 4th International conference on cyber and IT service management, pages 1–6. IEEE, 2016. 16
2016
-
[40]
Delving into the devils of bird’s- eye-view perception: A review, evaluation and recipe
Hongyang Li, Chonghao Sima, Jifeng Dai, Wenhai Wang, Lewei Lu, Huijie Wang, Jia Zeng, Zhiqi Li, Jiazhi Yang, Hanming Deng, et al. Delving into the devils of bird’s- eye-view perception: A review, evaluation and recipe. IEEE Transactions on Pattern Analysis and Machine In- telligence, 46(4):2151–2170, 2023. 14
2023
-
[41]
Robust estimation of sim- ilarity transformation for visual object tracking
Yang Li, Jianke Zhu, Steven CH Hoi, Wenjie Song, Zhe- feng Wang, and Hantang Liu. Robust estimation of sim- ilarity transformation for visual object tracking. InPro- ceedings of the AAAI conference on artificial intelligence, pages 8666–8673, 2019. 6
2019
-
[42]
V oxformer: Sparse voxel transformer for camera-based 3d semantic scene completion
Yiming Li, Zhiding Yu, Christopher Choy, Chaowei Xiao, Jose M Alvarez, Sanja Fidler, Chen Feng, and Anima Anandkumar. V oxformer: Sparse voxel transformer for camera-based 3d semantic scene completion. InProceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9087–9098, 2023. 2
2023
-
[43]
Lightglue: Local feature matching at light speed
Philipp Lindenberger, Paul-Edouard Sarlin, and Marc Pollefeys. Lightglue: Local feature matching at light speed. InProceedings of the IEEE/CVF international conference on computer vision, pages 17627–17638, 2023. 2
2023
-
[44]
Deep patch visual slam
Lahav Lipson, Zachary Teed, and Jia Deng. Deep patch visual slam. InEuropean Conference on Computer Vision, pages 424–440. Springer, 2024. 7, 8
2024
-
[45]
Grounding dino: Marrying dino with grounded pre-training for open-set object detection
Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. In European conference on computer vision, pages 38–55. Springer, 2024. 6
2024
-
[46]
On 7-dimensional lie algebras admitting levi-nondegenerate orbits in c 4.Trudy Moskovskogo Matematicheskogo Obshchestva, 84(2):205– 230, 2023
Alexander Vasil’evich Loboda. On 7-dimensional lie algebras admitting levi-nondegenerate orbits in c 4.Trudy Moskovskogo Matematicheskogo Obshchestva, 84(2):205– 230, 2023. 17
2023
-
[47]
Distinctive image features from scale- invariant keypoints.International journal of computer vision, 60(2):91–110, 2004
David G Lowe. Distinctive image features from scale- invariant keypoints.International journal of computer vision, 60(2):91–110, 2004. 2
2004
-
[48]
Visual place recognition: A survey.ieee transactions on robotics, 32(1):1–19, 2015
Stephanie Lowry, Niko S¨underhauf, Paul Newman, John J Leonard, David Cox, Peter Corke, and Michael J Milford. Visual place recognition: A survey.ieee transactions on robotics, 32(1):1–19, 2015. 2
2015
-
[49]
An iterative image reg- istration technique with an application to stereo vision
Bruce D Lucas and Takeo Kanade. An iterative image reg- istration technique with an application to stereo vision. In IJCAI’81: 7th international joint conference on Artificial intelligence, pages 674–679, 1981. 2, 3
1981
-
[50]
arXiv preprint arXiv:2505.12549 (2025)
Dominic Maggio, Hyungtae Lim, and Luca Carlone. Vggt-slam: Dense rgb slam optimized on the sl (4) mani- fold.arXiv preprint arXiv:2505.12549, 2025. 2, 3, 7, 8, 13
-
[51]
Xudong Mao, Qing Li, Haoran Xie, Raymond YK Lau, and Zhen Wang. Multi-class generative adversar- ial networks with the l2 loss function.arXiv preprint arXiv:1611.04076, 5:1057–7149, 2016. 16
-
[52]
Semanticfusion: Dense 3d semantic mapping with convolutional neural networks
John McCormac, Ankur Handa, Andrew Davison, and Stefan Leutenegger. Semanticfusion: Dense 3d semantic mapping with convolutional neural networks. In2017 IEEE International Conference on Robotics and automa- tion (ICRA), pages 4628–4635. IEEE, 2017. 2
2017
-
[53]
Jacobian varieties
James S Milne. Jacobian varieties. InArithmetic geometry, pages 167–212. Springer, 1986. 17
1986
-
[54]
Orb-slam2: An open- source slam system for monocular, stereo, and rgb-d cam- eras.IEEE transactions on robotics, 33(5):1255–1262,
Raul Mur-Artal and Juan D Tard´os. Orb-slam2: An open- source slam system for monocular, stereo, and rgb-d cam- eras.IEEE transactions on robotics, 33(5):1255–1262,
-
[55]
Orb-slam: A versatile and accurate monocular slam system.IEEE transactions on robotics, 31(5):1147– 1163, 2015
Raul Mur-Artal, Jose Maria Martinez Montiel, and Juan D Tardos. Orb-slam: A versatile and accurate monocular slam system.IEEE transactions on robotics, 31(5):1147– 1163, 2015. 2, 4, 13
2015
-
[56]
Mast3r-slam: Real-time dense slam with 3d reconstruc- tion priors
Riku Murai, Eric Dexheimer, and Andrew J Davison. Mast3r-slam: Real-time dense slam with 3d reconstruc- tion priors. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 16695–16705,
-
[57]
A 2.5 d map-based mobile robot localization via cooperation of aerial and ground robots.Sensors, 17(12):2730, 2017
Tae Hyeon Nam, Jae Hong Shim, and Young Im Cho. A 2.5 d map-based mobile robot localization via cooperation of aerial and ground robots.Sensors, 17(12):2730, 2017. 5
2017
-
[58]
Vi- sual odometry
David Nist´er, Oleg Naroditsky, and James Bergen. Vi- sual odometry. InProceedings of the 2004 IEEE Com- puter Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004., pages I–I. Ieee, 2004. 2
2004
-
[59]
DINOv2: Learning Robust Visual Features without Supervision
Maxime Oquab, Timoth´ee Darcet, Th´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193, 2023. 2, 3, 5, 13, 15, 16
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[60]
Towards accurate loop closure detection in semantic slam with 3d seman- tic covisibility graphs.IEEE Robotics and Automation Letters, 7(2):2455–2462, 2022
Zhentian Qian, Jie Fu, and Jing Xiao. Towards accurate loop closure detection in semantic slam with 3d seman- tic covisibility graphs.IEEE Robotics and Automation Letters, 7(2):2455–2462, 2022. 2, 4
2022
-
[61]
Orb: An efficient alternative to sift or surf
Ethan Rublee, Vincent Rabaud, Kurt Konolige, and Gary Bradski. Orb: An efficient alternative to sift or surf. In 2011 International conference on computer vision, pages 2564–2571. Ieee, 2011. 2
2011
-
[62]
Superglue: Learning feature matching with graph neural networks
Paul-Edouard Sarlin, Daniel DeTone, Tomasz Mal- isiewicz, and Andrew Rabinovich. Superglue: Learning feature matching with graph neural networks. InProceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4938–4947, 2020. 2
2020
-
[63]
Visual odometry [tutorial].IEEE robotics & automation maga- zine, 18(4):80–92, 2011
Davide Scaramuzza and Friedrich Fraundorfer. Visual odometry [tutorial].IEEE robotics & automation maga- zine, 18(4):80–92, 2011. 2
2011
-
[64]
Scene coordinate regression forests for camera relocalization in rgb-d images
Jamie Shotton, Ben Glocker, Christopher Zach, Shahram Izadi, Antonio Criminisi, and Andrew Fitzgibbon. Scene coordinate regression forests for camera relocalization in rgb-d images. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 2930– 2937, 2013. 6, 12
2013
-
[65]
Oriane Sim´eoni, Huy V V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Micha¨el Ramamonjisoa, et al. Dinov3.arXiv preprint arXiv:2508.10104, 2025. 2
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[66]
Brandon Smart, Chuanxia Zheng, Iro Laina, and Vic- tor Adrian Prisacariu. Splatt3r: Zero-shot gaussian splatting from uncalibrated image pairs.arXiv preprint arXiv:2408.13912, 2024. 3
-
[67]
Simultaneous localization and mapping
Cyrill Stachniss, John J Leonard, and Sebastian Thrun. Simultaneous localization and mapping. InSpringer hand- book of robotics, pages 1153–1176. Springer, 2016. 1
2016
-
[68]
On the early history of the singular value decomposition.SIAM review, 35(4):551–566, 1993
Gilbert W Stewart. On the early history of the singular value decomposition.SIAM review, 35(4):551–566, 1993. 5, 16
1993
-
[69]
Scale drift-aware large scale monocular slam.Robotics: science and Systems VI, 2(3):7, 2010
Hauke Strasdat, J Montiel, and Andrew J Davison. Scale drift-aware large scale monocular slam.Robotics: science and Systems VI, 2(3):7, 2010. 2, 13
2010
-
[70]
A benchmark for the eval- uation of rgb-d slam systems
J¨urgen Sturm, Nikolas Engelhard, Felix Endres, Wolfram Burgard, and Daniel Cremers. A benchmark for the eval- uation of rgb-d slam systems. In2012 IEEE/RSJ interna- tional conference on intelligent robots and systems, pages 573–580. IEEE, 2012. 6, 12
2012
-
[71]
Openvslam: A versatile visual slam framework
Shinya Sumikura, Mikiya Shibuya, and Ken Sakurada. Openvslam: A versatile visual slam framework. InPro- ceedings of the 27th ACM international conference on multimedia, pages 2292–2295, 2019. 2
2019
-
[72]
Loftr: Detector-free local feature match- ing with transformers
Jiaming Sun, Zehong Shen, Yuang Wang, Hujun Bao, and Xiaowei Zhou. Loftr: Detector-free local feature match- ing with transformers. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8922–8931, 2021. 2
2021
-
[73]
Towards a robust back-end for pose graph slam
Niko S ¨underhauf and Peter Protzel. Towards a robust back-end for pose graph slam. In2012 IEEE international conference on robotics and automation, pages 1254–1261. IEEE, 2012. 2
2012
-
[74]
Cnn-slam: Real-time dense monocular slam with learned depth prediction
Keisuke Tateno, Federico Tombari, Iro Laina, and Nassir Navab. Cnn-slam: Real-time dense monocular slam with learned depth prediction. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 6243–6252, 2017. 2
2017
-
[75]
Deep V2D : Video to depth with differentiable structure from motion
Zachary Teed and Jia Deng. Deepv2d: Video to depth with differentiable structure from motion.arXiv preprint arXiv:1812.04605, 2018. 8
-
[76]
Raft: Recurrent all-pairs field transforms for optical flow
Zachary Teed and Jia Deng. Raft: Recurrent all-pairs field transforms for optical flow. InEuropean conference on computer vision, pages 402–419. Springer, 2020. 2
2020
-
[77]
Droid-slam: Deep visual slam for monocular, stereo, and rgb-d cameras.Advances in neural information processing systems, 34:16558–16569,
Zachary Teed and Jia Deng. Droid-slam: Deep visual slam for monocular, stereo, and rgb-d cameras.Advances in neural information processing systems, 34:16558–16569,
-
[78]
Probabilistic robotics.Communications of the ACM, 45(3):52–57, 2002
Sebastian Thrun. Probabilistic robotics.Communications of the ACM, 45(3):52–57, 2002. 1
2002
-
[79]
Tangent space estimation for smooth embeddings of riemannian manifolds®.Information and Inference: A Journal of the IMA, 2(1):69–114, 2013
Hemant Tyagi, Elıf Vural, and Pascal Frossard. Tangent space estimation for smooth embeddings of riemannian manifolds®.Information and Inference: A Journal of the IMA, 2(1):69–114, 2013. 6
2013
-
[80]
Demon: Depth and motion network for learning monocular stereo
Benjamin Ummenhofer, Huizhong Zhou, Jonas Uhrig, Nikolaus Mayer, Eddy Ilg, Alexey Dosovitskiy, and Thomas Brox. Demon: Depth and motion network for learning monocular stereo. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 5038–5047, 2017. 2
2017
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.