pith. machine review for the scientific record. sign in

arxiv: 2604.06830 · v1 · submitted 2026-04-08 · 💻 cs.CV · cs.RO

Recognition: unknown

VGGT-SLAM++

Authors on Pith no claims yet

Pith reviewed 2026-05-10 17:38 UTC · model grok-4.3

classification 💻 cs.CV cs.RO
keywords visual SLAMVGGTdigital elevation maplocal bundle adjustmentDINOv2visual place recognitioncovisibility graphtransformer-based SLAM
0
0 comments X

The pith

VGGT-SLAM++ restores frequent local bundle adjustment in transformer SLAM by building DEM tiles from VGGT submaps and retrieving neighbors with DINOv2 embeddings and VPR.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces VGGT-SLAM++ as a full visual SLAM pipeline that takes the geometry-rich outputs of the Visual Geometry Grounded Transformer and adds a back-end to enable high-cadence local optimization. For each VGGT submap it builds a dense planar-canonical digital elevation map, splits the map into patches, embeds those patches with DINOv2, and inserts them into a covisibility graph. Spatial neighbors are then found with a visual place recognition module operating inside the covisibility window, which triggers repeated local bundle adjustment. A sympathetic reader would care because prior transformer SLAM systems suffered accumulating short-term drift; this design keeps trajectories stable over long distances while using compact tiles and sublinear retrieval so that memory stays bounded.

Core claim

VGGT-SLAM++ restores high-cadence local bundle adjustment through a spatially corrective back-end. For each VGGT submap the system constructs a dense planar-canonical DEM, partitions it into patches, computes DINOv2 embeddings for those patches, and integrates the submap into a covisibility graph. Spatial neighbors are retrieved via a VPR module operating inside the covisibility window; the retrieved neighbors trigger frequent local optimization that stabilizes trajectories. On standard SLAM benchmarks the resulting system achieves state-of-the-art accuracy, substantially reduces short-term drift, accelerates graph convergence, and maintains global consistency with compact DEM tiles and sub-

What carries the argument

The DEM-based covisibility graph whose patches carry DINOv2 embeddings and whose neighbors are retrieved by VPR inside the covisibility window, thereby triggering local bundle adjustment on VGGT submaps.

If this is right

  • Short-term pose drift is substantially reduced because local bundle adjustment now runs at high cadence.
  • Pose-graph convergence accelerates because corrective constraints are added more frequently and more locally.
  • Global consistency is preserved over large maps while memory usage remains bounded by the compact DEM tiles.
  • Retrieval of neighbors stays sublinear, allowing the system to scale without quadratic growth in computation.
  • The front-end continues to use the feed-forward VGGT transformer plus Sim(3) solution, so the improvement is localized to the back-end.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same DEM-plus-VPR pattern could be grafted onto other transformer-based odometry pipelines that currently lack dense local optimization.
  • Dense geometric proxies such as DEM tiles may serve as a general bridge between sparse transformer outputs and classical dense bundle adjustment.
  • If retrieval remains efficient at city scale, the approach could support lifelong mapping with bounded memory growth.
  • The method implicitly assumes that DINOv2 embeddings capture sufficient geometric similarity for reliable neighbor selection in the covisibility graph.

Load-bearing premise

Constructing planar-canonical DEMs from VGGT submaps, embedding their patches with DINOv2, and retrieving neighbors via VPR will reliably trigger effective local bundle adjustment without introducing fresh drift or scalability problems.

What would settle it

Run VGGT-SLAM++ on a long continuous trajectory benchmark previously used for VGGT-SLAM; if short-term drift remains comparable to the baseline or local optimization fails to converge, the claim that the new back-end reliably stabilizes trajectories is false.

Figures

Figures reproduced from arXiv: 2604.06830 by Avilasha Mandal, Chetan Arora, Rajesh Kumar, Sudarshan Sunil Harithas.

Figure 1
Figure 1. Figure 1: VGGT-SLAM++ provides an end-to-end SLAM architecture that stabilizes transformer-based odometry by using a low-fidelity geometric representation to support a high-cadence optimization back-end. The trajectories in the upper row are the odometry based trajectories while the lower row corresponds to the corrected trajectories when stabilised by our back-end. Abstract We introduce VGGT-SLAM++, a complete visu… view at source ↗
Figure 2
Figure 2. Figure 2: (A) DEM-based scene representation on the TUM RGB-D freiburg1 teddy dataset. The DEMs provide a compact 2.5D encoding retaining geometric structure. (B) DEM visualizations from the 7-Scenes dataset. (C) DEMs generated for the Virtual KITTI (Sequence 01) dataset. (D) A full KITTI Odometry (Sequence 09) sequence demonstrating a complete loop, illustrating the ability of DEMs and our SLAM back-end pipeline to… view at source ↗
Figure 3
Figure 3. Figure 3: Complete pipeline overview. The proposed VGGT-SLAM++ system comprises three main components: (a) Front-end: A Sim(3) odometry module that optimizes the relative poses of submaps generated by the feed-forward VGGT network. (b) Covisibility graph construction: A DEM-based map representation is used to compute structure-aware embeddings leveraging DINOv2, and an averaged tile score is used to insert spatially… view at source ↗
Figure 5
Figure 5. Figure 5: (A), (B), (C): zero-shot object detection from DEMs proving structure preservation. (D) DEM of TUM-teddy and (E) color coding. GPU; FAISS-HNSW indexing executes on CPU, ensur￾ing constant memory usage per submap. The results are benchmarked using the root mean squared Absolute Trajectory Error (ATE) [95] (ATE rmse) in meters. Memory Profile. At inference time, only the current submaps’ (in covisibility win… view at source ↗
Figure 6
Figure 6. Figure 6: VGGT-SLAM++ results for : (A) custom data (406.8m) recorded by GoPro HERO10 camera with GPS groundtruth with 2m precision. (ATE RMSE 18 ± 2 m); (B) custom data (1.8m) recorded by a OAK-1 camera with a Humanoid robot kinematics groundtruth (ATE RMSE 0.02m); (C) custom data (1.8m) recorded by a OAK-1 camera with Cobot forward kinematics groundtruth (ATE RMSE 0.01m); (D) KITTI Odometry 06 sequence (1230 m; AT… view at source ↗
Figure 7
Figure 7. Figure 7: The red line is the ground truth reference from GPS readings and the blue line is the estimation by VGGT￾SLAM++ for the custom GoPro camera dataset [Axes’ units are in meters]. the DEM that violate the planar assumption and might introduce unstable gradients for both DINOv2 [59] em￾beddings and the Sim(3) backend [69]. To prevent this, VGGT–SLAM++ applies a physically-motivated depth filter ∀pi = (xi , yi … view at source ↗
Figure 8
Figure 8. Figure 8: The DEM rendered from the 3D points aligned by odometry over the KITTI Sequence 05, with color mapping for better visualisation. process consists of (i) fitting a stable reference plane, (ii) expressing all points in a canonical orthonormal frame, (iii) rasterising heights at a chosen spatial resolution, and (iv) feeding the grayscale DEM for DINOv2-based re￾trieval. Input. Let P = {pi} N i=1, pi = (xi , y… view at source ↗
read the original abstract

We introduce VGGT-SLAM++, a complete visual SLAM system that leverages the geometry-rich outputs of the Visual Geometry Grounded Transformer (VGGT). The system comprises a visual odometry (front-end) fusing the VGGT feed-forward transformer and a Sim(3) solution, a Digital Elevation Map (DEM)-based graph construction module, and a back-end that jointly enable accurate large-scale mapping with bounded memory. While prior transformer-based SLAM pipelines such as VGGT-SLAM rely primarily on sparse loop closures or global Sim(3) manifold constraints - allowing short-horizon pose drift - VGGT-SLAM++ restores high-cadence local bundle adjustment (LBA) through a spatially corrective back-end. For each VGGT submap, we construct a dense planar-canonical DEM, partition it into patches, and compute their DINOv2 embeddings to integrate the submap into a covisibility graph. Spatial neighbors are retrieved using a Visual Place Recognition (VPR) module within the covisibility window, triggering frequent local optimization that stabilizes trajectories. Across standard SLAM benchmarks, VGGT-SLAM++ achieves state-of-the-art accuracy, substantially reducing short-term drift, accelerating graph convergence, and maintaining global consistency with compact DEM tiles and sublinear retrieval.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The manuscript introduces VGGT-SLAM++, a visual SLAM system that extends prior transformer-based pipelines by fusing VGGT feed-forward outputs with a Sim(3) front-end for visual odometry, constructing planar-canonical DEMs from VGGT submaps, partitioning them into patches embedded via DINOv2, and using a VPR module to retrieve spatial neighbors inside the covisibility window. This triggers frequent local bundle adjustment (LBA) in the back-end, with the overall system claimed to deliver bounded-memory large-scale mapping. The central claim is that VGGT-SLAM++ achieves state-of-the-art accuracy across standard SLAM benchmarks while substantially reducing short-term drift, accelerating graph convergence, and preserving global consistency via compact DEM tiles and sublinear retrieval.

Significance. If the empirical claims hold, the work would advance visual SLAM by restoring high-cadence LBA to transformer-based systems that previously relied on sparse loop closures or global Sim(3) constraints, thereby mitigating short-horizon drift without sacrificing scalability. The DEM-based graph and DINOv2+VPR retrieval mechanism offers a concrete route to memory-bounded, sublinear neighbor selection; explicit credit is due for composing established modules (VGGT, DINOv2, Sim(3)) as black-box inputs rather than re-deriving them. Reproducibility would be strengthened by release of the full pipeline and benchmark scripts.

major comments (2)
  1. [§3.3] Back-end module (likely §3.3): The claim that VPR retrieval inside the covisibility window reliably triggers effective LBA and reduces short-term drift is load-bearing for the SOTA accuracy assertion, yet the manuscript supplies no ablation isolating this component's contribution versus the VGGT+Sim(3) front-end alone. Inaccurate DINOv2 embeddings or poorly tuned retrieval thresholds could introduce false-positive constraints that increase rather than decrease drift, directly undermining the central drift-reduction guarantee.
  2. [§4] Experiments section (likely §4): The abstract asserts 'state-of-the-art accuracy' and 'substantially reducing short-term drift' on standard benchmarks, but the manuscript must furnish concrete quantitative tables (e.g., ATE/RPE on KITTI, EuRoC, or TUM sequences) with direct comparisons to VGGT-SLAM, ORB-SLAM3, and other DEM-based baselines, plus error analysis and ablation on retrieval thresholds. Absence of these metrics leaves the central claim unsupported.
minor comments (3)
  1. [Abstract] Abstract: The acronym 'LBA' appears without prior expansion; first use should read 'local bundle adjustment (LBA)'.
  2. [§3] Notation: 'planar-canonical DEM' and 'sublinear retrieval' are introduced without a concise definition or pointer to the precise algorithmic realization (e.g., how the canonical plane is chosen or how sublinearity is measured).
  3. [References] References: The manuscript should cite the original VGGT paper and the DINOv2 work explicitly when describing the front-end and embedding modules.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on VGGT-SLAM++. The comments highlight important aspects of the back-end design and experimental validation that we will address in revision. Below we respond point-by-point to the major comments.

read point-by-point responses
  1. Referee: [§3.3] Back-end module (likely §3.3): The claim that VPR retrieval inside the covisibility window reliably triggers effective LBA and reduces short-term drift is load-bearing for the SOTA accuracy assertion, yet the manuscript supplies no ablation isolating this component's contribution versus the VGGT+Sim(3) front-end alone. Inaccurate DINOv2 embeddings or poorly tuned retrieval thresholds could introduce false-positive constraints that increase rather than decrease drift, directly undermining the central drift-reduction guarantee.

    Authors: We agree that an explicit ablation isolating the VPR-triggered LBA is required to substantiate the drift-reduction claim. In the revised manuscript we will add a controlled ablation that disables the VPR neighbor retrieval and subsequent high-cadence LBA while retaining the VGGT+Sim(3) front-end, reporting ATE/RPE differences on the same sequences. We will also include retrieval-precision statistics and a sensitivity study on the VPR threshold to quantify the risk of false-positive constraints. The covisibility-window restriction is intended to limit erroneous matches, but the added analysis will make this explicit. revision: yes

  2. Referee: [§4] Experiments section (likely §4): The abstract asserts 'state-of-the-art accuracy' and 'substantially reducing short-term drift' on standard benchmarks, but the manuscript must furnish concrete quantitative tables (e.g., ATE/RPE on KITTI, EuRoC, or TUM sequences) with direct comparisons to VGGT-SLAM, ORB-SLAM3, and other DEM-based baselines, plus error analysis and ablation on retrieval thresholds. Absence of these metrics leaves the central claim unsupported.

    Authors: We acknowledge that the experimental section must provide the requested quantitative tables and ablations to support the abstract claims. The revised manuscript will expand Section 4 with complete ATE and RPE tables on KITTI, EuRoC, and TUM sequences, including direct comparisons against VGGT-SLAM, ORB-SLAM3, and additional DEM-based baselines. We will also add error-distribution plots and a dedicated ablation on retrieval-threshold values. These additions will ensure every accuracy and drift claim is backed by explicit numerical evidence. revision: yes

Circularity Check

0 steps flagged

No circularity: system composes external modules without self-referential definitions or fitted predictions

full rationale

The paper describes a composite SLAM pipeline that takes VGGT outputs, Sim(3) poses, DINOv2 embeddings, and VPR retrieval as independent inputs to construct DEM tiles and trigger LBA. No equation or claim reduces a performance metric to a parameter fitted from the same metric; no self-citation is invoked as a uniqueness theorem; the SOTA claim is presented as an empirical outcome on external benchmarks rather than a derivation that loops back to its own construction. The derivation chain is therefore self-contained against external modules and data.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no explicit free parameters, axioms, or invented entities can be identified from the provided text.

pith-pipeline@v0.9.0 · 5528 in / 1147 out tokens · 98475 ms · 2026-05-10T17:38:58.690966+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

105 extracted references · 14 canonical work pages · 5 internal anchors

  1. [1]

    Using the fundamentals of the theory of measurement errors in performing geodesic measurement and calculation works

    Bakhriddin Akhmedov. Using the fundamentals of the theory of measurement errors in performing geodesic measurement and calculation works. InE3S Web of Con- ferences, page 03012. EDP Sciences, 2023. 6

  2. [2]

    Talking to dino: Bridg- ing self-supervised vision backbones with language for open-vocabulary segmentation

    Luca Barsellotti, Lorenzo Bianchi, Nicola Messina, Fabio Carrara, Marcella Cornia, Lorenzo Baraldi, Fabrizio Falchi, and Rita Cucchiara. Talking to dino: Bridg- ing self-supervised vision backbones with language for open-vocabulary segmentation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 22025–22035, 2025. 5

  3. [3]

    Surf: Speeded up robust features

    Herbert Bay, Tinne Tuytelaars, and Luc Van Gool. Surf: Speeded up robust features. InEuropean conference on computer vision, pages 404–417. Springer, 2006. 2

  4. [4]

    Shape indexing using approximate nearest-neighbour search in high- dimensional spaces

    Jeffrey S Beis and David G Lowe. Shape indexing using approximate nearest-neighbour search in high- dimensional spaces. InProceedings of IEEE computer society conference on computer vision and pattern recog- nition, pages 1000–1006. IEEE, 1997. 16

  5. [5]

    The relationship between recall and precision.Journal of the American society for information science, 45(1):12–19, 1994

    Michael Buckland and Fredric Gey. The relationship between recall and precision.Journal of the American society for information science, 45(1):12–19, 1994. 16

  6. [6]

    Boosting with the l 2 loss: regression and classification.Journal of the American Statistical Association, 98(462):324–339, 2003

    Peter B¨uhlmann and Bin Yu. Boosting with the l 2 loss: regression and classification.Journal of the American Statistical Association, 98(462):324–339, 2003. 16

  7. [7]

    A gauss—newton method for convex composite optimization.Mathematical Programming, 71(2):179–194, 1995

    James V Burke and Michael C Ferris. A gauss—newton method for convex composite optimization.Mathematical Programming, 71(2):179–194, 1995. 6

  8. [8]

    The euroc micro aerial vehicle datasets.The International Journal of Robotics Research, 35(10):1157–1163, 2016

    Michael Burri, Janosch Nikolic, Pascal Gohl, Thomas Schneider, Joern Rehder, Sammy Omari, Markus W Achtelik, and Roland Siegwart. The euroc micro aerial vehicle datasets.The International Journal of Robotics Research, 35(10):1157–1163, 2016. 6, 7, 8, 12

  9. [9]

    Orb-slam3: An accurate open-source library for visual, visual–inertial, and multimap slam.IEEE transactions on robotics, 37(6): 1874–1890, 2021

    Carlos Campos, Richard Elvira, Juan J G´omez Rodr´ıguez, Jos´e MM Montiel, and Juan D Tard ´os. Orb-slam3: An accurate open-source library for visual, visual–inertial, and multimap slam.IEEE transactions on robotics, 37(6): 1874–1890, 2021. 2, 8

  10. [10]

    G´omez Rodr´ıguez, Jos´e M

    Carlos Campos, Richard Elvira, Juan J. G´omez Rodr´ıguez, Jos´e M. M. Montiel, and Juan D. Tard´os. Orb-slam3: An accurate open-source library for visual, visual–inertial, and multimap slam.IEEE Transactions on Robotics, 37 (6):1874–1890, 2021. 7

  11. [11]

    A lidar/visual slam backend with loop closure detection and graph optimization.Remote sensing, 13(14):2720, 2021

    Shoubin Chen, Baoding Zhou, Changhui Jiang, Weixing Xue, and Qingquan Li. A lidar/visual slam backend with loop closure detection and graph optimization.Remote sensing, 13(14):2720, 2021. 2

  12. [12]

    Stachniss

    Xieyuanli Chen, Thomas L ¨abe, Andres Milioto, Timo R¨ohling, Olga Vysotska, Alexandre Haag, Jens Behley, and Cyrill Stachniss. Overlapnet: Loop closing for lidar- based slam.arXiv preprint arXiv:2105.11344, 2021. 3

  13. [13]

    Recovering shape by shading and stereo under lambertian shading model

    Chi Kin Chow and Shiu Yin Yuen. Recovering shape by shading and stereo under lambertian shading model. International journal of computer vision, 85(1):58–100,

  14. [14]

    Deepfactors: Real-time probabilistic dense monocular slam.IEEE Robotics and Automation Letters, 5(2):721–728, 2020

    Jan Czarnowski, Tristan Laidlow, Ronald Clark, and An- drew J Davison. Deepfactors: Real-time probabilistic dense monocular slam.IEEE Robotics and Automation Letters, 5(2):721–728, 2020. 8, 13

  15. [15]

    Superpoint: Self-supervised interest point detec- tion and description

    Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabi- novich. Superpoint: Self-supervised interest point detec- tion and description. InProceedings of the IEEE confer- ence on computer vision and pattern recognition work- shops, pages 224–236, 2018. 2

  16. [16]

    Oxford university press,

    Simon Kirwan Donaldson and Peter B Kronheimer.The geometry of four-manifolds. Oxford university press,

  17. [17]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    Alexey Dosovitskiy. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020. 2

  18. [18]

    The faiss library.IEEE Transactions on Big Data, 2025

    Matthijs Douze, Alexandr Guzhva, Chengqi Deng, Jeff Johnson, Gergely Szilvasy, Pierre-Emmanuel Mazar ´e, Maria Lomeli, Lucas Hosseini, and Herv ´e J´egou. The faiss library.IEEE Transactions on Big Data, 2025. 4, 5, 16

  19. [19]

    Mast3r-sfm: a fully-integrated solution for unconstrained structure-from-motion

    Bardienus Pieter Duisterhof, Lojze Zust, Philippe Weinza- epfel, Vincent Leroy, Yohann Cabon, and Jerome Revaud. Mast3r-sfm: a fully-integrated solution for unconstrained structure-from-motion. In2025 International Conference on 3D Vision (3DV), pages 1–10. IEEE, 2025. 3

  20. [20]

    An evalua- tion of the rgb-d slam system

    Felix Endres, J ¨urgen Hess, Nikolas Engelhard, J ¨urgen Sturm, Daniel Cremers, and Wolfram Burgard. An evalua- tion of the rgb-d slam system. In2012 IEEE international conference on robotics and automation, pages 1691–1696. IEEE, 2012. 2

  21. [21]

    Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography.Communi- cations of the ACM, 24(6):381–395, 1981

    Martin A Fischler and Robert C Bolles. Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography.Communi- cations of the ACM, 24(6):381–395, 1981. 5, 13, 14

  22. [22]

    Virtual worlds as proxy for multi-object tracking analysis

    Adrien Gaidon, Qiao Wang, Yohann Cabon, and Eleonora Vig. Virtual worlds as proxy for multi-object tracking analysis. InProceedings of the IEEE conference on com- puter vision and pattern recognition, pages 4340–4349,

  23. [23]

    Variable baseline/resolution stereo

    David Gallup, Jan-Michael Frahm, Philippos Mordohai, and Marc Pollefeys. Variable baseline/resolution stereo. In2008 IEEE conference on computer vision and pattern recognition, pages 1–8. IEEE, 2008. 16

  24. [24]

    Real-time loop detection with bags of binary words

    Dorian Galvez-Lopez and Juan D Tardos. Real-time loop detection with bags of binary words. In2011 IEEE/RSJ In- ternational Conference on Intelligent Robots and Systems, pages 51–58. IEEE, 2011. 2

  25. [25]

    Ldso: Direct sparse odometry with loop clo- sure

    Xiang Gao, Rui Wang, Nikolaus Demmel, and Daniel Cremers. Ldso: Direct sparse odometry with loop clo- sure. In2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 2198–2204. IEEE, 2018. 7

  26. [26]

    Vision meets robotics: The kitti dataset

    Andreas Geiger, Philip Lenz, Christoph Stiller, and Raquel Urtasun. Vision meets robotics: The kitti dataset. The international journal of robotics research, 32(11): 1231–1237, 2013. 6, 7, 8, 12, 15

  27. [27]

    Multi-level mapping: Real-time dense monocular slam

    W Nicholas Greene, Kyel Ok, Peter Lommel, and Nicholas Roy. Multi-level mapping: Real-time dense monocular slam. In2016 IEEE International Confer- ence on Robotics and Automation (ICRA), pages 833–840. IEEE, 2016. 16

  28. [28]

    Findernet: A data augmentation free canoni- calization aided loop detection and closure technique for point clouds in 6-dof separation

    Sudarshan S Harithas, Gurkirat Singh, Aneesh Chavan, Sarthak Sharma, Suraj Patni, Chetan Arora, and Madhava Krishna. Findernet: A data augmentation free canoni- calization aided loop detection and closure technique for point clouds in 6-dof separation. InProceedings of the IEEE/CVF Winter Conference on Applications of Com- puter Vision, pages 8399–8408, 2...

  29. [29]

    Cambridge university press,

    Richard Hartley and Andrew Zisserman.Multiple view geometry in computer vision. Cambridge university press,

  30. [30]

    Determining op- tical flow.Artificial intelligence, 17(1-3):185–203, 1981

    Berthold KP Horn and Brian G Schunck. Determining op- tical flow.Artificial intelligence, 17(1-3):185–203, 1981. 2

  31. [31]

    Pow3r: Empow- ering unconstrained 3d reconstruction with camera and scene priors

    Wonbong Jang, Philippe Weinzaepfel, Vincent Leroy, Lourdes Agapito, and Jerome Revaud. Pow3r: Empow- ering unconstrained 3d reconstruction with camera and scene priors. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 1071–1081, 2025. 3

  32. [32]

    Vision- based 2.5 d terrain modeling for humanoid locomotion

    Satoshi Kagami, Koichi Nishiwaki, James J Kuffner, Kei Okada, Masayuki Inaba, and Hirochika Inoue. Vision- based 2.5 d terrain modeling for humanoid locomotion. In2003 IEEE International Conference on Robotics and Automation (Cat. No. 03CH37422), pages 2141–2146. IEEE, 2003. 5

  33. [33]

    Anyloc: Towards universal visual place recognition.IEEE Robotics and Automation Letters, 9(2):1286–1293, 2023

    Nikhil Keetha, Avneesh Mishra, Jay Karhade, Kr- ishna Murthy Jatavallabhula, Sebastian Scherer, Madhava Krishna, and Sourav Garg. Anyloc: Towards universal visual place recognition.IEEE Robotics and Automation Letters, 9(2):1286–1293, 2023. 2, 5, 16

  34. [34]

    MapAnything: Universal Feed-Forward Metric 3D Reconstruction

    Nikhil Keetha, Norman M ¨uller, Johannes Sch ¨onberger, Lorenzo Porzi, Yuchen Zhang, Tobias Fischer, Arno Knapitsch, Duncan Zauss, Ethan Weber, Nelson Antunes, et al. Mapanything: Universal feed-forward metric 3d reconstruction.arXiv preprint arXiv:2509.13414, 2025. 3

  35. [35]

    Optimizing disparity for motion in depth

    Petr Kellnhofer, Tobias Ritschel, Karol Myszkowski, and Hans-Peter Seidel. Optimizing disparity for motion in depth. InComputer Graphics Forum, pages 143–152. Wiley Online Library, 2013. 16

  36. [36]

    Hillshading of terrain using layer tints with aspect-variant luminosity

    Patrick J Kennelly and A Jon Kimerling. Hillshading of terrain using layer tints with aspect-variant luminosity. Cartography and Geographic Information Science, 31(2): 67–77, 2004. 15

  37. [37]

    G-cut3r: Guided 3d reconstruction with camera and depth prior integration

    Ramil Khafizov, Artem Komarichev, Ruslan Rakhimov, Peter Wonka, and Evgeny Burnaev. G-cut3r: Guided 3d reconstruction with camera and depth prior integration. arXiv preprint arXiv:2508.11379, 2025. 8

  38. [38]

    PhD thesis, The University of Waikato,

    Ashraf Masood Kibriya.Fast algorithms for nearest neighbour search. PhD thesis, The University of Waikato,

  39. [39]

    Cosine similarity to determine similarity measure: Study case in online essay assessment

    Alfirna Rizqi Lahitani, Adhistya Erna Permanasari, and Noor Akhmad Setiawan. Cosine similarity to determine similarity measure: Study case in online essay assessment. In2016 4th International conference on cyber and IT service management, pages 1–6. IEEE, 2016. 16

  40. [40]

    Delving into the devils of bird’s- eye-view perception: A review, evaluation and recipe

    Hongyang Li, Chonghao Sima, Jifeng Dai, Wenhai Wang, Lewei Lu, Huijie Wang, Jia Zeng, Zhiqi Li, Jiazhi Yang, Hanming Deng, et al. Delving into the devils of bird’s- eye-view perception: A review, evaluation and recipe. IEEE Transactions on Pattern Analysis and Machine In- telligence, 46(4):2151–2170, 2023. 14

  41. [41]

    Robust estimation of sim- ilarity transformation for visual object tracking

    Yang Li, Jianke Zhu, Steven CH Hoi, Wenjie Song, Zhe- feng Wang, and Hantang Liu. Robust estimation of sim- ilarity transformation for visual object tracking. InPro- ceedings of the AAAI conference on artificial intelligence, pages 8666–8673, 2019. 6

  42. [42]

    V oxformer: Sparse voxel transformer for camera-based 3d semantic scene completion

    Yiming Li, Zhiding Yu, Christopher Choy, Chaowei Xiao, Jose M Alvarez, Sanja Fidler, Chen Feng, and Anima Anandkumar. V oxformer: Sparse voxel transformer for camera-based 3d semantic scene completion. InProceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9087–9098, 2023. 2

  43. [43]

    Lightglue: Local feature matching at light speed

    Philipp Lindenberger, Paul-Edouard Sarlin, and Marc Pollefeys. Lightglue: Local feature matching at light speed. InProceedings of the IEEE/CVF international conference on computer vision, pages 17627–17638, 2023. 2

  44. [44]

    Deep patch visual slam

    Lahav Lipson, Zachary Teed, and Jia Deng. Deep patch visual slam. InEuropean Conference on Computer Vision, pages 424–440. Springer, 2024. 7, 8

  45. [45]

    Grounding dino: Marrying dino with grounded pre-training for open-set object detection

    Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. In European conference on computer vision, pages 38–55. Springer, 2024. 6

  46. [46]

    On 7-dimensional lie algebras admitting levi-nondegenerate orbits in c 4.Trudy Moskovskogo Matematicheskogo Obshchestva, 84(2):205– 230, 2023

    Alexander Vasil’evich Loboda. On 7-dimensional lie algebras admitting levi-nondegenerate orbits in c 4.Trudy Moskovskogo Matematicheskogo Obshchestva, 84(2):205– 230, 2023. 17

  47. [47]

    Distinctive image features from scale- invariant keypoints.International journal of computer vision, 60(2):91–110, 2004

    David G Lowe. Distinctive image features from scale- invariant keypoints.International journal of computer vision, 60(2):91–110, 2004. 2

  48. [48]

    Visual place recognition: A survey.ieee transactions on robotics, 32(1):1–19, 2015

    Stephanie Lowry, Niko S¨underhauf, Paul Newman, John J Leonard, David Cox, Peter Corke, and Michael J Milford. Visual place recognition: A survey.ieee transactions on robotics, 32(1):1–19, 2015. 2

  49. [49]

    An iterative image reg- istration technique with an application to stereo vision

    Bruce D Lucas and Takeo Kanade. An iterative image reg- istration technique with an application to stereo vision. In IJCAI’81: 7th international joint conference on Artificial intelligence, pages 674–679, 1981. 2, 3

  50. [50]

    arXiv preprint arXiv:2505.12549 (2025)

    Dominic Maggio, Hyungtae Lim, and Luca Carlone. Vggt-slam: Dense rgb slam optimized on the sl (4) mani- fold.arXiv preprint arXiv:2505.12549, 2025. 2, 3, 7, 8, 13

  51. [51]

    Multi-class generative adversar- ial networks with the l2 loss function.arXiv preprint arXiv:1611.04076, 5:1057–7149, 2016

    Xudong Mao, Qing Li, Haoran Xie, Raymond YK Lau, and Zhen Wang. Multi-class generative adversar- ial networks with the l2 loss function.arXiv preprint arXiv:1611.04076, 5:1057–7149, 2016. 16

  52. [52]

    Semanticfusion: Dense 3d semantic mapping with convolutional neural networks

    John McCormac, Ankur Handa, Andrew Davison, and Stefan Leutenegger. Semanticfusion: Dense 3d semantic mapping with convolutional neural networks. In2017 IEEE International Conference on Robotics and automa- tion (ICRA), pages 4628–4635. IEEE, 2017. 2

  53. [53]

    Jacobian varieties

    James S Milne. Jacobian varieties. InArithmetic geometry, pages 167–212. Springer, 1986. 17

  54. [54]

    Orb-slam2: An open- source slam system for monocular, stereo, and rgb-d cam- eras.IEEE transactions on robotics, 33(5):1255–1262,

    Raul Mur-Artal and Juan D Tard´os. Orb-slam2: An open- source slam system for monocular, stereo, and rgb-d cam- eras.IEEE transactions on robotics, 33(5):1255–1262,

  55. [55]

    Orb-slam: A versatile and accurate monocular slam system.IEEE transactions on robotics, 31(5):1147– 1163, 2015

    Raul Mur-Artal, Jose Maria Martinez Montiel, and Juan D Tardos. Orb-slam: A versatile and accurate monocular slam system.IEEE transactions on robotics, 31(5):1147– 1163, 2015. 2, 4, 13

  56. [56]

    Mast3r-slam: Real-time dense slam with 3d reconstruc- tion priors

    Riku Murai, Eric Dexheimer, and Andrew J Davison. Mast3r-slam: Real-time dense slam with 3d reconstruc- tion priors. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 16695–16705,

  57. [57]

    A 2.5 d map-based mobile robot localization via cooperation of aerial and ground robots.Sensors, 17(12):2730, 2017

    Tae Hyeon Nam, Jae Hong Shim, and Young Im Cho. A 2.5 d map-based mobile robot localization via cooperation of aerial and ground robots.Sensors, 17(12):2730, 2017. 5

  58. [58]

    Vi- sual odometry

    David Nist´er, Oleg Naroditsky, and James Bergen. Vi- sual odometry. InProceedings of the 2004 IEEE Com- puter Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004., pages I–I. Ieee, 2004. 2

  59. [59]

    DINOv2: Learning Robust Visual Features without Supervision

    Maxime Oquab, Timoth´ee Darcet, Th´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193, 2023. 2, 3, 5, 13, 15, 16

  60. [60]

    Towards accurate loop closure detection in semantic slam with 3d seman- tic covisibility graphs.IEEE Robotics and Automation Letters, 7(2):2455–2462, 2022

    Zhentian Qian, Jie Fu, and Jing Xiao. Towards accurate loop closure detection in semantic slam with 3d seman- tic covisibility graphs.IEEE Robotics and Automation Letters, 7(2):2455–2462, 2022. 2, 4

  61. [61]

    Orb: An efficient alternative to sift or surf

    Ethan Rublee, Vincent Rabaud, Kurt Konolige, and Gary Bradski. Orb: An efficient alternative to sift or surf. In 2011 International conference on computer vision, pages 2564–2571. Ieee, 2011. 2

  62. [62]

    Superglue: Learning feature matching with graph neural networks

    Paul-Edouard Sarlin, Daniel DeTone, Tomasz Mal- isiewicz, and Andrew Rabinovich. Superglue: Learning feature matching with graph neural networks. InProceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4938–4947, 2020. 2

  63. [63]

    Visual odometry [tutorial].IEEE robotics & automation maga- zine, 18(4):80–92, 2011

    Davide Scaramuzza and Friedrich Fraundorfer. Visual odometry [tutorial].IEEE robotics & automation maga- zine, 18(4):80–92, 2011. 2

  64. [64]

    Scene coordinate regression forests for camera relocalization in rgb-d images

    Jamie Shotton, Ben Glocker, Christopher Zach, Shahram Izadi, Antonio Criminisi, and Andrew Fitzgibbon. Scene coordinate regression forests for camera relocalization in rgb-d images. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 2930– 2937, 2013. 6, 12

  65. [65]

    DINOv3

    Oriane Sim´eoni, Huy V V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Micha¨el Ramamonjisoa, et al. Dinov3.arXiv preprint arXiv:2508.10104, 2025. 2

  66. [66]

    Splatt3r: Zero-shot gaussian splatting from uncalibrated image pairs.arXiv preprint arXiv:2408.13912, 2024

    Brandon Smart, Chuanxia Zheng, Iro Laina, and Vic- tor Adrian Prisacariu. Splatt3r: Zero-shot gaussian splatting from uncalibrated image pairs.arXiv preprint arXiv:2408.13912, 2024. 3

  67. [67]

    Simultaneous localization and mapping

    Cyrill Stachniss, John J Leonard, and Sebastian Thrun. Simultaneous localization and mapping. InSpringer hand- book of robotics, pages 1153–1176. Springer, 2016. 1

  68. [68]

    On the early history of the singular value decomposition.SIAM review, 35(4):551–566, 1993

    Gilbert W Stewart. On the early history of the singular value decomposition.SIAM review, 35(4):551–566, 1993. 5, 16

  69. [69]

    Scale drift-aware large scale monocular slam.Robotics: science and Systems VI, 2(3):7, 2010

    Hauke Strasdat, J Montiel, and Andrew J Davison. Scale drift-aware large scale monocular slam.Robotics: science and Systems VI, 2(3):7, 2010. 2, 13

  70. [70]

    A benchmark for the eval- uation of rgb-d slam systems

    J¨urgen Sturm, Nikolas Engelhard, Felix Endres, Wolfram Burgard, and Daniel Cremers. A benchmark for the eval- uation of rgb-d slam systems. In2012 IEEE/RSJ interna- tional conference on intelligent robots and systems, pages 573–580. IEEE, 2012. 6, 12

  71. [71]

    Openvslam: A versatile visual slam framework

    Shinya Sumikura, Mikiya Shibuya, and Ken Sakurada. Openvslam: A versatile visual slam framework. InPro- ceedings of the 27th ACM international conference on multimedia, pages 2292–2295, 2019. 2

  72. [72]

    Loftr: Detector-free local feature match- ing with transformers

    Jiaming Sun, Zehong Shen, Yuang Wang, Hujun Bao, and Xiaowei Zhou. Loftr: Detector-free local feature match- ing with transformers. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8922–8931, 2021. 2

  73. [73]

    Towards a robust back-end for pose graph slam

    Niko S ¨underhauf and Peter Protzel. Towards a robust back-end for pose graph slam. In2012 IEEE international conference on robotics and automation, pages 1254–1261. IEEE, 2012. 2

  74. [74]

    Cnn-slam: Real-time dense monocular slam with learned depth prediction

    Keisuke Tateno, Federico Tombari, Iro Laina, and Nassir Navab. Cnn-slam: Real-time dense monocular slam with learned depth prediction. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 6243–6252, 2017. 2

  75. [75]

    Deep V2D : Video to depth with differentiable structure from motion

    Zachary Teed and Jia Deng. Deepv2d: Video to depth with differentiable structure from motion.arXiv preprint arXiv:1812.04605, 2018. 8

  76. [76]

    Raft: Recurrent all-pairs field transforms for optical flow

    Zachary Teed and Jia Deng. Raft: Recurrent all-pairs field transforms for optical flow. InEuropean conference on computer vision, pages 402–419. Springer, 2020. 2

  77. [77]

    Droid-slam: Deep visual slam for monocular, stereo, and rgb-d cameras.Advances in neural information processing systems, 34:16558–16569,

    Zachary Teed and Jia Deng. Droid-slam: Deep visual slam for monocular, stereo, and rgb-d cameras.Advances in neural information processing systems, 34:16558–16569,

  78. [78]

    Probabilistic robotics.Communications of the ACM, 45(3):52–57, 2002

    Sebastian Thrun. Probabilistic robotics.Communications of the ACM, 45(3):52–57, 2002. 1

  79. [79]

    Tangent space estimation for smooth embeddings of riemannian manifolds®.Information and Inference: A Journal of the IMA, 2(1):69–114, 2013

    Hemant Tyagi, Elıf Vural, and Pascal Frossard. Tangent space estimation for smooth embeddings of riemannian manifolds®.Information and Inference: A Journal of the IMA, 2(1):69–114, 2013. 6

  80. [80]

    Demon: Depth and motion network for learning monocular stereo

    Benjamin Ummenhofer, Huizhong Zhou, Jonas Uhrig, Nikolaus Mayer, Eddy Ilg, Alexey Dosovitskiy, and Thomas Brox. Demon: Depth and motion network for learning monocular stereo. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 5038–5047, 2017. 2

Showing first 80 references.