pith. machine review for the scientific record. sign in

arxiv: 2604.08718 · v2 · submitted 2026-04-09 · 💻 cs.CV · cs.AI· cs.RO

Recognition: unknown

Accelerating Transformer-Based Monocular SLAM via Geometric Utility Scoring

Authors on Pith no claims yet

Pith reviewed 2026-05-10 17:38 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.RO
keywords monocular SLAMGeometric Foundation Modelsframe gatingcomputational efficiencykeyframe selectionlightweight networkgeometric utilitytransformer acceleration
0
0 comments X

The pith

A lightweight predictor scores each frame's mapping value before expensive geometric decoding, letting monocular SLAM skip most redundant frames.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that Geometric Foundation Models deliver strong 3D priors for monocular SLAM but force full costly processing on every frame just to learn whether the frame adds anything new. LeanGate inserts a small feed-forward network that estimates geometric utility ahead of the heavy stages and rejects low-value frames early. This early filter removes more than 90 percent of the redundant work while the final map and tracking accuracy stay the same as the dense baseline. The method therefore turns a dense, slow pipeline into one that runs five times faster on standard benchmarks.

Core claim

LeanGate is a lightweight feed-forward network that outputs a geometric utility score for each incoming video frame. The score is computed before any dense feature extraction or matching from the Geometric Foundation Model, so frames judged low-value are dropped without performing the expensive steps. On standard SLAM benchmarks this gating cuts tracking floating-point operations by more than 85 percent, raises end-to-end throughput by a factor of five, and preserves the tracking and mapping accuracy of the original dense system.

What carries the argument

LeanGate, a lightweight feed-forward frame-gating network that predicts a geometric utility score to decide a frame's mapping value before heavy GFM feature extraction and matching.

If this is right

  • More than 90 percent of incoming frames can be rejected without ever invoking dense GFM decoding.
  • Tracking floating-point operations drop by over 85 percent on standard benchmarks.
  • End-to-end throughput rises by a factor of five with no drop in accuracy.
  • The gating module works as a plug-and-play addition to existing GFM-based SLAM pipelines.
  • Keyframe selection happens before rather than after the costly geometric stage.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same early-rejection idea could apply to other dense transformer pipelines where most inputs add little new information.
  • Running the utility predictor on-device could let SLAM operate longer on battery-powered robots without custom accelerators.
  • If the predictor is trained jointly with the downstream SLAM loss, its decisions might become even more aligned with final map quality.
  • Extending the score to also estimate expected tracking drift could further reduce unnecessary bundle-adjustment steps.

Load-bearing premise

The lightweight network can correctly judge a frame's future mapping usefulness from its raw pixels without running the full expensive geometric decoding first.

What would settle it

Apply LeanGate to a long video sequence containing subtle but critical new geometry in frames the predictor rejects, then measure whether the final map completeness or tracking error falls below the dense baseline.

Figures

Figures reproduced from arXiv: 2604.08718 by Andrew Feng, Bangya Liu, Dayou Li, Hao Wang, Mingyu Ding, Nuo Chen, Suman Banerjee, Xinmiao Xiong, Yang Zhou, Zhiwen Fan.

Figure 1
Figure 1. Figure 1: We present LeanGate, a geometry-aware lightweight frame gating network that bypasses over [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 3
Figure 3. Figure 3: ATE and runtime under naive stride policies across [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Averaged per-scene time breakdown profiled on [PITH_FULL_IMAGE:figures/full_fig_p003_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: System Overview for LeanGate. Left: score-only distillation. Right: MASt3R-SLAM slimmed by LeanGate. The [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative 3D trajectory comparisons on TUM [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
read the original abstract

Geometric Foundation Models (GFMs) have recently advanced monocular SLAM by providing robust, calibration-free 3D priors. However, deploying these models on dense video streams introduces significant computational redundancy. Current GFM-based SLAM systems typically rely on post hoc keyframe selection. Because of this, they must perform expensive dense geometric decoding simply to determine whether a frame contains novel geometry, resulting in late rejection and wasted computation. To mitigate this inefficiency, we propose LeanGate, a lightweight feed-forward frame-gating network. LeanGate predicts a geometric utility score to assess a frame's mapping value prior to the heavy GFM feature extraction and matching stages. As a predictive plug-and-play module, our approach bypasses over 90% of redundant frames. Evaluations on standard SLAM benchmarks demonstrate that LeanGate reduces tracking FLOPs by more than 85% and achieves a 5x end-to-end throughput speedup. Furthermore, it maintains the tracking and mapping accuracy of dense baselines. Project page: https://lean-gate.github.io/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The paper proposes LeanGate, a lightweight feed-forward neural network that predicts a geometric utility score for input frames in GFM-based monocular SLAM. This score is used to gate frames before performing expensive dense geometric feature extraction and matching, thereby avoiding redundant computation on frames with low mapping value. The method is presented as a plug-and-play module that can be inserted into existing pipelines; the abstract reports that it bypasses over 90% of redundant frames, reduces tracking FLOPs by more than 85%, delivers a 5x end-to-end throughput increase, and preserves the tracking and mapping accuracy of dense baselines on standard SLAM benchmarks.

Significance. If the reported speedups and accuracy preservation hold under rigorous verification, the work addresses a practical deployment bottleneck in recent GFM-SLAM systems by moving keyframe selection upstream of the heavy transformer stages. The plug-and-play design and substantial reported FLOPs/throughput gains could improve real-time viability on resource-limited hardware without requiring changes to the underlying GFM or SLAM backend. The empirical nature of the contribution (trained predictor rather than analytic derivation) makes the strength of the result dependent on the quality and breadth of the experimental validation.

major comments (2)
  1. [§4] §4 (Experimental Setup): the abstract claims maintenance of 'tracking and mapping accuracy of dense baselines,' but without reported error bars, number of runs, or per-sequence breakdowns it is impossible to assess whether the observed differences are statistically significant or merely within the variance of the dense baseline itself.
  2. [§3.2] §3.2 (LeanGate Architecture): the claim that the lightweight predictor accurately forecasts mapping value 'prior to the heavy GFM feature extraction' is load-bearing for the entire speedup argument; the manuscript must demonstrate that the predictor's false-negative rate on high-utility frames does not degrade downstream mapping quality beyond the reported tolerance.
minor comments (3)
  1. [Abstract] The abstract states 'bypasses over 90% of redundant frames' while the results claim 'more than 85% FLOPs reduction'; these two figures should be reconciled with a precise definition of 'redundant' and an explicit mapping from bypassed frames to measured FLOPs savings.
  2. [§3] Notation for the utility score (e.g., how it is normalized or thresholded) is introduced without a dedicated equation; adding a short definition in §3 would improve clarity.
  3. [Abstract] The project page URL is given but no supplementary material link or code repository is mentioned; providing at least a pointer to the trained model weights would aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive review and the recommendation of minor revision. We address each major comment below with clarifications and commit to targeted revisions that will improve the statistical rigor and validation of our claims.

read point-by-point responses
  1. Referee: [§4] §4 (Experimental Setup): the abstract claims maintenance of 'tracking and mapping accuracy of dense baselines,' but without reported error bars, number of runs, or per-sequence breakdowns it is impossible to assess whether the observed differences are statistically significant or merely within the variance of the dense baseline itself.

    Authors: We agree that the absence of error bars and run statistics limits the ability to evaluate statistical significance. In the revised manuscript we will expand Section 4 to report error bars computed over five independent runs (different random seeds for both training and evaluation) on the TUM RGB-D and EuRoC sequences. We will also add per-sequence tables for ATE, RPE, and mapping completeness, allowing direct comparison of observed differences against the variance of the dense baseline. These additions will confirm that any small deviations remain within the baseline variance. revision: yes

  2. Referee: [§3.2] §3.2 (LeanGate Architecture): the claim that the lightweight predictor accurately forecasts mapping value 'prior to the heavy GFM feature extraction' is load-bearing for the entire speedup argument; the manuscript must demonstrate that the predictor's false-negative rate on high-utility frames does not degrade downstream mapping quality beyond the reported tolerance.

    Authors: We acknowledge that an explicit quantification of the false-negative rate is required to fully support the upstream gating argument. In the revised Section 3.2 we will insert a dedicated analysis that measures the false-negative rate on held-out validation sequences and correlates it with downstream mapping quality (reconstruction density and tracking drift). We will further include a threshold-sensitivity ablation showing that the resulting quality degradation stays within the tolerance already demonstrated by the main accuracy tables. These additions will directly substantiate that the predictor preserves mapping fidelity prior to GFM extraction. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical gating validated by benchmarks

full rationale

The paper presents LeanGate as a lightweight feed-forward network trained to predict per-frame geometric utility scores, which are then used to gate expensive GFM processing in monocular SLAM. The central claims (85%+ FLOP reduction, 5x throughput, preserved accuracy) are framed as empirical outcomes measured on standard SLAM benchmarks rather than derived from first-principles equations or self-referential definitions. No load-bearing self-citations, uniqueness theorems, or ansatzes are invoked to force the results; the predictor is an independent learned component whose accuracy is externally falsifiable. The derivation chain therefore remains self-contained and non-circular.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The contribution rests on standard deep learning assumptions for training a predictor; no explicit free parameters or invented physical entities are stated in the abstract.

axioms (1)
  • domain assumption A lightweight neural network can be trained to predict geometric utility of frames for SLAM mapping.
    Core premise enabling the early-gating design.
invented entities (1)
  • LeanGate no independent evidence
    purpose: Lightweight feed-forward frame-gating network for geometric utility scoring.
    The proposed module is the main new artifact; evidence is limited to the authors' reported experiments.

pith-pipeline@v0.9.0 · 5506 in / 1347 out tokens · 124902 ms · 2026-05-10T17:38:52.927215+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

62 extracted references · 62 canonical work pages

  1. [1]

    Past, present, and future of simultaneous localization and mapping: Toward the robust-perception age,

    C. Cadena, L. Carlone, H. Carrillo, Y . Latif, D. Scaramuzza, J. Neira, I. Reid, and J. J. Leonard, “Past, present, and future of simultaneous localization and mapping: Toward the robust-perception age,”IEEE T-RO, vol. 32, no. 6, 2017

  2. [2]

    Pose estimation for augmented reality: a hands-on survey,

    E. Marchand, H. Uchiyama, and F. Spindler, “Pose estimation for augmented reality: a hands-on survey,”IEEE TVCG, vol. 22, no. 12, 2015

  3. [3]

    Semi-dense visual odometry for ar on a smartphone,

    T. Schöps, J. Engel, and D. Cremers, “Semi-dense visual odometry for ar on a smartphone,” inISMAR, 2014, pp. 145–150

  4. [4]

    Parallel tracking and mapping for small ar workspaces,

    G. Klein and D. Murray, “Parallel tracking and mapping for small ar workspaces,” inISMAR, 2007, pp. 225–234

  5. [5]

    Orb-slam: A versatile and accurate monocular slam system,

    R. Mur-Artal, J. M. M. Montiel, and J. D. Tardos, “Orb-slam: A versatile and accurate monocular slam system,”IEEE T-RO, vol. 31, no. 5, 2015

  6. [6]

    Orb-slam2: An open-source slam system for monocular, stereo, and rgb-d cameras,

    R. Mur-Artal and J. D. Tardós, “Orb-slam2: An open-source slam system for monocular, stereo, and rgb-d cameras,”IEEE T-RO, vol. 33, no. 5, 2017

  7. [7]

    Multi-view stereo: A tutorial,

    Y . Furukawa and C. Hernández, “Multi-view stereo: A tutorial,”FnT CGV, vol. 9, no. 1-2, 2015

  8. [8]

    Large-scale direct slam with stereo cameras,

    J. Engel, J. Stückler, and D. Cremers, “Large-scale direct slam with stereo cameras,” inIROS, 2015, pp. 1935–1942

  9. [9]

    Pix- elwise view selection for unstructured multi-view stereo,

    J. L. Schönberger, E. Zheng, J.-M. Frahm, and M. Pollefeys, “Pix- elwise view selection for unstructured multi-view stereo,” inECCV, 2016, pp. 501–518

  10. [10]

    Mvsnet: Depth inference for unstructured multi-view stereo,

    Y . Yao, Z. Luo, S. Li, T. Fang, and L. Quan, “Mvsnet: Depth inference for unstructured multi-view stereo,” inECCV, 2018, pp. 767–783

  11. [11]

    Dtam: Dense tracking and mapping in real-time,

    R. A. Newcombe, S. J. Lovegrove, and A. J. Davison, “Dtam: Dense tracking and mapping in real-time,” inICCV, 2011, pp. 2320–2327

  12. [12]

    Deeptam: Deep tracking and mapping,

    H. Zhou, B. Ummenhofer, and T. Brox, “Deeptam: Deep tracking and mapping,” inECCV, 2018, pp. 822–838

  13. [13]

    Codeslam—learning a compact, optimisable representation for dense visual slam,

    M. Bloesch, J. Czarnowski, R. Clark, S. Leutenegger, and A. J. Davison, “Codeslam—learning a compact, optimisable representation for dense visual slam,” inCVPR, 2018, pp. 2560–2568

  14. [14]

    Deepfactors: Real-time probabilistic dense monocular slam,

    J. Czarnowski, T. Laidlow, R. Clark, and A. J. Davison, “Deepfactors: Real-time probabilistic dense monocular slam,”IEEE RA-L, vol. 5, no. 2, 2020

  15. [15]

    Droid-slam: Deep visual slam for monocular, stereo, and rgb-d cameras,

    Z. Teed and J. Deng, “Droid-slam: Deep visual slam for monocular, stereo, and rgb-d cameras,”NeurIPS, vol. 34, pp. 16 558–16 569, 2021

  16. [16]

    Dust3r: Geometric 3d vision made easy,

    S. Wang, V . Leroy, Y . Cabon, B. Chidlovskii, and J. Revaud, “Dust3r: Geometric 3d vision made easy,” inCVPR, 2024, pp. 20 697–20 709

  17. [17]

    Grounding image matching in 3d with mast3r,

    V . Leroy, Y . Cabon, and J. Revaud, “Grounding image matching in 3d with mast3r,” inECCV, 2024, pp. 71–91

  18. [18]

    Vggt: Visual geometry grounded transformer,

    J. Wang, M. Chen, N. Karaev, A. Vedaldi, C. Rupprecht, and D. Novotny, “Vggt: Visual geometry grounded transformer,” inCVPR, 2025, pp. 5294–5306

  19. [19]

    Mast3r-sfm: a fully-integrated solution for unconstrained structure-from-motion,

    B. P. Duisterhof, L. Zust, P. Weinzaepfel, V . Leroy, Y . Cabon, and J. Revaud, “Mast3r-sfm: a fully-integrated solution for unconstrained structure-from-motion,” in3DV, 2025, pp. 1–10

  20. [20]

    MASt3R-SLAM: Real- time dense SLAM with 3D reconstruction priors,

    R. Murai, E. Dexheimer, and A. J. Davison, “MASt3R-SLAM: Real- time dense SLAM with 3D reconstruction priors,” inCVPR, 2025

  21. [21]

    Instantsplat: Un- bounded sparse-view pose-free gaussian splatting in 40 sec- onds.arXiv preprint arXiv:2403.20309, 2024

    Z. Fan, W. Cong, K. Wen, K. Wang, J. Zhang, X. Ding, D. Xu, B. Ivanovic, M. Pavone, G. Pavlakoset al., “Instantsplat: Sparse-view gaussian splatting in seconds,”arXiv:2403.20309, 2024

  22. [22]

    A benchmark for the evaluation of rgb-d slam systems,

    J. Sturm, N. Engelhard, F. Endres, W. Burgard, and D. Cremers, “A benchmark for the evaluation of rgb-d slam systems,” inIROS, 2012

  23. [23]

    Real-time rgb-d camera relocalization,

    B. Glocker, S. Izadi, J. Shotton, and A. Criminisi, “Real-time rgb-d camera relocalization,” inISMAR, 2013, pp. 173–179

  24. [24]

    The euroc micro aerial vehicle datasets,

    M. Burri, J. Nikolic, P. Gohl, T. Schneider, J. Rehder, S. Omari, M. W. Achtelik, and R. Siegwart, “The euroc micro aerial vehicle datasets,” IJRR, 2016

  25. [25]

    Vision meets robotics: The kitti dataset,

    A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, “Vision meets robotics: The kitti dataset,”IJRR, vol. 32, no. 11, 2013

  26. [26]

    Bad slam: Bundle adjusted direct rgb-d slam,

    T. Schops, T. Sattler, and M. Pollefeys, “Bad slam: Bundle adjusted direct rgb-d slam,” inCVPR, 2019, pp. 134–144

  27. [27]

    Introducing slambench, a performance and accuracy benchmarking methodology for slam,

    L. Nardi, B. Bodin, M. Z. Zia, J. Mawer, A. Nisbet, P. H. Kelly, A. J. Davison, M. Luján, M. F. O’Boyle, G. Rileyet al., “Introducing slambench, a performance and accuracy benchmarking methodology for slam,” inICRA, 2015, pp. 5783–5790

  28. [28]

    arXiv e-prints , keywords =

    J. Engel, V . Usenko, and D. Cremers, “A photometrically calibrated benchmark for monocular visual odometry,”arXiv:1607.02555, 2016

  29. [29]

    Svo: Fast semi-direct monocular visual odometry,

    C. Forster, M. Pizzoli, and D. Scaramuzza, “Svo: Fast semi-direct monocular visual odometry,” inICRA, 2014, pp. 15–22

  30. [30]

    Lsd-slam: Large-scale direct monocular slam,

    J. Engel, T. Schöps, and D. Cremers, “Lsd-slam: Large-scale direct monocular slam,” inECCV, 2014, pp. 834–849

  31. [31]

    Superpoint: Self- supervised interest point detection and description,

    D. DeTone, T. Malisiewicz, and A. Rabinovich, “Superpoint: Self- supervised interest point detection and description,” inCVPRW, 2018, pp. 224–236

  32. [32]

    D2-net: A trainable cnn for joint description and detection of local features,

    M. Dusmanu, I. Rocco, T. Pajdla, M. Pollefeys, J. Sivic, A. Torii, and T. Sattler, “D2-net: A trainable cnn for joint description and detection of local features,” inCVPR, 2019, pp. 8092–8101

  33. [33]

    Su- perglue: Learning feature matching with graph neural networks,

    P.-E. Sarlin, D. DeTone, T. Malisiewicz, and A. Rabinovich, “Su- perglue: Learning feature matching with graph neural networks,” in CVPR, 2020, pp. 4938–4947

  34. [34]

    Loftr: Detector- free local feature matching with transformers,

    J. Sun, Z. Shen, Y . Wang, H. Bao, and X. Zhou, “Loftr: Detector- free local feature matching with transformers,” inCVPR, 2021, pp. 8922–8931

  35. [35]

    Ba-net: Dense bundle ad- justment network.arXiv preprint arXiv:1806.04807, 2018

    C. Tang and P. Tan, “Ba-net: Dense bundle adjustment network,” arXiv:1806.04807, 2018

  36. [36]

    Raft: Recurrent all-pairs field transforms for optical flow,

    Z. Teed and J. Deng, “Raft: Recurrent all-pairs field transforms for optical flow,” inECCV, 2020, pp. 402–419

  37. [37]

    Gms: Grid-based motion statistics for fast, ultra-robust feature correspondence,

    J. Bian, W.-Y . Lin, Y . Matsushita, S.-K. Yeung, T.-D. Nguyen, and M.-M. Cheng, “Gms: Grid-based motion statistics for fast, ultra-robust feature correspondence,” inCVPR, 2017, pp. 4181–4190

  38. [38]

    Dkm: Dense kernelized feature matching for geometry estimation,

    J. Edstedt, I. Athanasiadis, M. Wadenbäck, and M. Felsberg, “Dkm: Dense kernelized feature matching for geometry estimation,” inCVPR, 2023, pp. 17 765–17 775

  39. [39]

    Pdc-net+: Enhanced probabilistic dense correspondence network,

    P. Truong, M. Danelljan, R. Timofte, and L. Van Gool, “Pdc-net+: Enhanced probabilistic dense correspondence network,”IEEE TPAMI, vol. 45, no. 8, 2023

  40. [40]

    Aslfeat: Learning local features of accurate shape and localization,

    Z. Luo, L. Zhou, X. Bai, H. Chen, J. Zhang, Y . Yao, S. Li, T. Fang, and L. Quan, “Aslfeat: Learning local features of accurate shape and localization,” inCVPR, 2020, pp. 6589–6598

  41. [41]

    Lightglue: Local feature matching at light speed,

    P. Lindenberger, P.-E. Sarlin, and M. Pollefeys, “Lightglue: Local feature matching at light speed,” inICCV, 2023, pp. 17 627–17 638

  42. [42]

    Key.net: Keypoint detection by handcrafted and learned cnn filters,

    J. Barroso-Laguna, E. Riba, D. Ponsa, and K. Mikolajczyk, “Key.net: Keypoint detection by handcrafted and learned cnn filters,” inICCV, 2019

  43. [43]

    Segformer: Simple and efficient design for semantic segmentation with transformers,

    E. Xie, W. Wang, Z. Yu, A. Anandkumar, J. M. Alvarez, and P. Luo, “Segformer: Simple and efficient design for semantic segmentation with transformers,”NeurIPS, vol. 34, pp. 12 077–12 090, 2021

  44. [44]

    Structure-from-motion revisited,

    J. L. Schonberger and J.-M. Frahm, “Structure-from-motion revisited,” inCVPR, 2016, pp. 4104–4113

  45. [45]

    imap: Implicit mapping and positioning in real-time,

    E. Sucar, S. Liu, J. Ortiz, and A. J. Davison, “imap: Implicit mapping and positioning in real-time,” inICCV, 2021, pp. 6229–6238

  46. [46]

    Nice-slam: Neural implicit scalable encoding for slam,

    Z. Zhu, S. Peng, V . Larsson, W. Xu, H. Bao, Z. Cui, M. R. Oswald, and M. Pollefeys, “Nice-slam: Neural implicit scalable encoding for slam,” inCVPR, 2022, pp. 12 786–12 796

  47. [47]

    Ec3r-slam: Efficient and consistent monocular dense slam with feed-forward 3d reconstruction,

    L. Hu, N. A. Oufroukh, F. Bonardi, and R. Ghandour, “Ec3r-slam: Efficient and consistent monocular dense slam with feed-forward 3d reconstruction,” 2025

  48. [48]

    3d gaussian splatting for real-time radiance field rendering

    B. Kerbl, G. Kopanas, T. Leimkühler, G. Drettakiset al., “3d gaussian splatting for real-time radiance field rendering.”ACM TOG, vol. 42, no. 4, 2023

  49. [49]

    Lamar: Benchmarking localization and mapping for augmented reality,

    P.-E. Sarlin, M. Dusmanu, J. L. Schönberger, P. Speciale, L. Gruber, V . Larsson, O. Miksik, and M. Pollefeys, “Lamar: Benchmarking localization and mapping for augmented reality,” inECCV, 2022, pp. 686–704

  50. [50]

    Map-free visual relocalization: Metric pose relative to a single image,

    E. Arnold, J. Wynn, S. Vicente, G. Garcia-Hernando, A. Monszpart, V . Prisacariu, D. Turmukhambetov, and E. Brachmann, “Map-free visual relocalization: Metric pose relative to a single image,” inECCV, 2022, pp. 690–708

  51. [51]

    Accelerated coordi- nate encoding: Learning to relocalize in minutes using rgb and poses,

    E. Brachmann, T. Cavallari, and V . A. Prisacariu, “Accelerated coordi- nate encoding: Learning to relocalize in minutes using rgb and poses,” inCVPR, 2023, pp. 5044–5053

  52. [52]

    Vggt-slam: Dense rgb slam optimized on the sl (4) manifold,

    D. Maggio, H. Lim, and L. Carlone, “Vggt-slam: Dense rgb slam optimized on the sl (4) manifold,”NeurIPS, vol. 39, 2025

  53. [53]

    VGGT-SLAM 2.0: Real-time Dense Feed- forward Scene Reconstruction,

    D. Maggio and L. Carlone, “Vggt-slam 2.0: Real time dense feed- forward scene reconstruction,”arXiv preprint arXiv:2601.19887, 2026

  54. [54]

    Hartley and A

    R. Hartley and A. Zisserman,Multiple view geometry in computer vision. Cambridge university press, 2003

  55. [55]

    Bundle adjustment—a modern synthesis,

    B. Triggs, P. F. McLauchlan, R. I. Hartley, and A. W. Fitzgibbon, “Bundle adjustment—a modern synthesis,” inVision Algorithms, 1999, pp. 298–372

  56. [56]

    Point-slam: Dense neural point cloud-based slam,

    E. Sandström, Y . Li, L. Van Gool, and M. R. Oswald, “Point-slam: Dense neural point cloud-based slam,” inICCV, 2023, pp. 18 433– 18 444

  57. [57]

    Orb-slam3: An accurate open-source library for visual, visual–inertial, and multimap slam,

    C. Campos, R. Elvira, J. J. G. Rodríguez, J. M. Montiel, and J. D. Tardós, “Orb-slam3: An accurate open-source library for visual, visual–inertial, and multimap slam,”IEEE T-RO, vol. 37, no. 6, 2021

  58. [58]

    Deep patch visual slam,

    L. Lipson, Z. Teed, and J. Deng, “Deep patch visual slam,” inECCV, 2024, pp. 424–440

  59. [59]

    Flare: Feed-forward geometry, appearance and camera estimation from uncalibrated sparse views,

    S. Zhang, J. Wang, Y . Xu, N. Xue, C. Rupprecht, X. Zhou, Y . Shen, and G. Wetzstein, “Flare: Feed-forward geometry, appearance and camera estimation from uncalibrated sparse views,” 2025

  60. [60]

    Scannet++: A high-fidelity dataset of 3d indoor scenes,

    C. Yeshwanth, Y .-C. Liu, M. Nießner, and A. Dai, “Scannet++: A high-fidelity dataset of 3d indoor scenes,” inICCV, 2023, pp. 12–22

  61. [61]

    Flare: Feed-forward geometry, appearance and camera estimation from uncalibrated sparse views,

    S. Zhang, J. Wang, Y . Xu, N. Xue, C. Rupprecht, X. Zhou, Y . Shen, and G. Wetzstein, “Flare: Feed-forward geometry, appearance and camera estimation from uncalibrated sparse views,” inCVPR, 2025, pp. 21 936–21 947

  62. [62]

    MapAnything: Universal feed-forward metric 3D reconstruction,

    N. Keetha, N. Müller, J. Schönberger, L. Porzi, Y . Zhang, T. Fischer, A. Knapitsch, D. Zauss, E. Weber, N. Antunes, J. Luiten, M. Lopez- Antequera, S. R. Bulò, C. Richardt, D. Ramanan, S. Scherer, and P. Kontschieder, “MapAnything: Universal feed-forward metric 3D reconstruction,” inInternational Conference on 3D Vision (3DV). IEEE, 2026