arxiv: 2604.08718 · v2 · submitted 2026-04-09 · 💻 cs.CV · cs.AI· cs.RO

Recognition: unknown

Accelerating Transformer-Based Monocular SLAM via Geometric Utility Scoring

Xinmiao Xiong , Bangya Liu , Hao Wang , Dayou Li , Nuo Chen , Andrew Feng , Mingyu Ding , Suman Banerjee

show 2 more authors

Yang Zhou Zhiwen Fan

Authors on Pith no claims yet

Pith reviewed 2026-05-10 17:38 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.RO

keywords monocular SLAMGeometric Foundation Modelsframe gatingcomputational efficiencykeyframe selectionlightweight networkgeometric utilitytransformer acceleration

0 comments

The pith

A lightweight predictor scores each frame's mapping value before expensive geometric decoding, letting monocular SLAM skip most redundant frames.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that Geometric Foundation Models deliver strong 3D priors for monocular SLAM but force full costly processing on every frame just to learn whether the frame adds anything new. LeanGate inserts a small feed-forward network that estimates geometric utility ahead of the heavy stages and rejects low-value frames early. This early filter removes more than 90 percent of the redundant work while the final map and tracking accuracy stay the same as the dense baseline. The method therefore turns a dense, slow pipeline into one that runs five times faster on standard benchmarks.

Core claim

LeanGate is a lightweight feed-forward network that outputs a geometric utility score for each incoming video frame. The score is computed before any dense feature extraction or matching from the Geometric Foundation Model, so frames judged low-value are dropped without performing the expensive steps. On standard SLAM benchmarks this gating cuts tracking floating-point operations by more than 85 percent, raises end-to-end throughput by a factor of five, and preserves the tracking and mapping accuracy of the original dense system.

What carries the argument

LeanGate, a lightweight feed-forward frame-gating network that predicts a geometric utility score to decide a frame's mapping value before heavy GFM feature extraction and matching.

If this is right

More than 90 percent of incoming frames can be rejected without ever invoking dense GFM decoding.
Tracking floating-point operations drop by over 85 percent on standard benchmarks.
End-to-end throughput rises by a factor of five with no drop in accuracy.
The gating module works as a plug-and-play addition to existing GFM-based SLAM pipelines.
Keyframe selection happens before rather than after the costly geometric stage.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same early-rejection idea could apply to other dense transformer pipelines where most inputs add little new information.
Running the utility predictor on-device could let SLAM operate longer on battery-powered robots without custom accelerators.
If the predictor is trained jointly with the downstream SLAM loss, its decisions might become even more aligned with final map quality.
Extending the score to also estimate expected tracking drift could further reduce unnecessary bundle-adjustment steps.

Load-bearing premise

The lightweight network can correctly judge a frame's future mapping usefulness from its raw pixels without running the full expensive geometric decoding first.

What would settle it

Apply LeanGate to a long video sequence containing subtle but critical new geometry in frames the predictor rejects, then measure whether the final map completeness or tracking error falls below the dense baseline.

Figures

Figures reproduced from arXiv: 2604.08718 by Andrew Feng, Bangya Liu, Dayou Li, Hao Wang, Mingyu Ding, Nuo Chen, Suman Banerjee, Xinmiao Xiong, Yang Zhou, Zhiwen Fan.

**Figure 3.** Figure 3: ATE and runtime under naive stride policies across [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

**Figure 4.** Figure 4: Averaged per-scene time breakdown profiled on [PITH_FULL_IMAGE:figures/full_fig_p003_4.png] view at source ↗

**Figure 5.** Figure 5: System Overview for LeanGate. Left: score-only distillation. Right: MASt3R-SLAM slimmed by LeanGate. The [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗

**Figure 6.** Figure 6: Qualitative 3D trajectory comparisons on TUM [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

read the original abstract

Geometric Foundation Models (GFMs) have recently advanced monocular SLAM by providing robust, calibration-free 3D priors. However, deploying these models on dense video streams introduces significant computational redundancy. Current GFM-based SLAM systems typically rely on post hoc keyframe selection. Because of this, they must perform expensive dense geometric decoding simply to determine whether a frame contains novel geometry, resulting in late rejection and wasted computation. To mitigate this inefficiency, we propose LeanGate, a lightweight feed-forward frame-gating network. LeanGate predicts a geometric utility score to assess a frame's mapping value prior to the heavy GFM feature extraction and matching stages. As a predictive plug-and-play module, our approach bypasses over 90% of redundant frames. Evaluations on standard SLAM benchmarks demonstrate that LeanGate reduces tracking FLOPs by more than 85% and achieves a 5x end-to-end throughput speedup. Furthermore, it maintains the tracking and mapping accuracy of dense baselines. Project page: https://lean-gate.github.io/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LeanGate adds an early lightweight predictor to skip redundant frames in GFM-based monocular SLAM, delivering the reported 85% FLOPs cut and 5x speedup if the gate holds up outside the test sets.

read the letter

The core idea here is straightforward engineering: run a cheap feed-forward net on incoming frames to score their geometric utility before committing to the heavy GFM feature extraction and matching. This flips the usual post-hoc keyframe selection, where the expensive model runs first and then discards most output. The paper shows this bypasses over 90% of frames on standard benchmarks while keeping tracking and mapping accuracy in line with the dense baselines. That produces the claimed tracking FLOPs reduction above 85% and the 5x end-to-end throughput gain. Those numbers address a practical pain point for real-time robotics or AR deployments where dense video streams are the norm.

Referee Report

2 major / 3 minor

Summary. The paper proposes LeanGate, a lightweight feed-forward neural network that predicts a geometric utility score for input frames in GFM-based monocular SLAM. This score is used to gate frames before performing expensive dense geometric feature extraction and matching, thereby avoiding redundant computation on frames with low mapping value. The method is presented as a plug-and-play module that can be inserted into existing pipelines; the abstract reports that it bypasses over 90% of redundant frames, reduces tracking FLOPs by more than 85%, delivers a 5x end-to-end throughput increase, and preserves the tracking and mapping accuracy of dense baselines on standard SLAM benchmarks.

Significance. If the reported speedups and accuracy preservation hold under rigorous verification, the work addresses a practical deployment bottleneck in recent GFM-SLAM systems by moving keyframe selection upstream of the heavy transformer stages. The plug-and-play design and substantial reported FLOPs/throughput gains could improve real-time viability on resource-limited hardware without requiring changes to the underlying GFM or SLAM backend. The empirical nature of the contribution (trained predictor rather than analytic derivation) makes the strength of the result dependent on the quality and breadth of the experimental validation.

major comments (2)

[§4] §4 (Experimental Setup): the abstract claims maintenance of 'tracking and mapping accuracy of dense baselines,' but without reported error bars, number of runs, or per-sequence breakdowns it is impossible to assess whether the observed differences are statistically significant or merely within the variance of the dense baseline itself.
[§3.2] §3.2 (LeanGate Architecture): the claim that the lightweight predictor accurately forecasts mapping value 'prior to the heavy GFM feature extraction' is load-bearing for the entire speedup argument; the manuscript must demonstrate that the predictor's false-negative rate on high-utility frames does not degrade downstream mapping quality beyond the reported tolerance.

minor comments (3)

[Abstract] The abstract states 'bypasses over 90% of redundant frames' while the results claim 'more than 85% FLOPs reduction'; these two figures should be reconciled with a precise definition of 'redundant' and an explicit mapping from bypassed frames to measured FLOPs savings.
[§3] Notation for the utility score (e.g., how it is normalized or thresholded) is introduced without a dedicated equation; adding a short definition in §3 would improve clarity.
[Abstract] The project page URL is given but no supplementary material link or code repository is mentioned; providing at least a pointer to the trained model weights would aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive review and the recommendation of minor revision. We address each major comment below with clarifications and commit to targeted revisions that will improve the statistical rigor and validation of our claims.

read point-by-point responses

Referee: [§4] §4 (Experimental Setup): the abstract claims maintenance of 'tracking and mapping accuracy of dense baselines,' but without reported error bars, number of runs, or per-sequence breakdowns it is impossible to assess whether the observed differences are statistically significant or merely within the variance of the dense baseline itself.

Authors: We agree that the absence of error bars and run statistics limits the ability to evaluate statistical significance. In the revised manuscript we will expand Section 4 to report error bars computed over five independent runs (different random seeds for both training and evaluation) on the TUM RGB-D and EuRoC sequences. We will also add per-sequence tables for ATE, RPE, and mapping completeness, allowing direct comparison of observed differences against the variance of the dense baseline. These additions will confirm that any small deviations remain within the baseline variance. revision: yes
Referee: [§3.2] §3.2 (LeanGate Architecture): the claim that the lightweight predictor accurately forecasts mapping value 'prior to the heavy GFM feature extraction' is load-bearing for the entire speedup argument; the manuscript must demonstrate that the predictor's false-negative rate on high-utility frames does not degrade downstream mapping quality beyond the reported tolerance.

Authors: We acknowledge that an explicit quantification of the false-negative rate is required to fully support the upstream gating argument. In the revised Section 3.2 we will insert a dedicated analysis that measures the false-negative rate on held-out validation sequences and correlates it with downstream mapping quality (reconstruction density and tracking drift). We will further include a threshold-sensitivity ablation showing that the resulting quality degradation stays within the tolerance already demonstrated by the main accuracy tables. These additions will directly substantiate that the predictor preserves mapping fidelity prior to GFM extraction. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical gating validated by benchmarks

full rationale

The paper presents LeanGate as a lightweight feed-forward network trained to predict per-frame geometric utility scores, which are then used to gate expensive GFM processing in monocular SLAM. The central claims (85%+ FLOP reduction, 5x throughput, preserved accuracy) are framed as empirical outcomes measured on standard SLAM benchmarks rather than derived from first-principles equations or self-referential definitions. No load-bearing self-citations, uniqueness theorems, or ansatzes are invoked to force the results; the predictor is an independent learned component whose accuracy is externally falsifiable. The derivation chain therefore remains self-contained and non-circular.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The contribution rests on standard deep learning assumptions for training a predictor; no explicit free parameters or invented physical entities are stated in the abstract.

axioms (1)

domain assumption A lightweight neural network can be trained to predict geometric utility of frames for SLAM mapping.
Core premise enabling the early-gating design.

invented entities (1)

LeanGate no independent evidence
purpose: Lightweight feed-forward frame-gating network for geometric utility scoring.
The proposed module is the main new artifact; evidence is limited to the authors' reported experiments.

pith-pipeline@v0.9.0 · 5506 in / 1347 out tokens · 124902 ms · 2026-05-10T17:38:52.927215+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

62 extracted references · 62 canonical work pages

[1]

Past, present, and future of simultaneous localization and mapping: Toward the robust-perception age,

C. Cadena, L. Carlone, H. Carrillo, Y . Latif, D. Scaramuzza, J. Neira, I. Reid, and J. J. Leonard, “Past, present, and future of simultaneous localization and mapping: Toward the robust-perception age,”IEEE T-RO, vol. 32, no. 6, 2017

work page 2017
[2]

Pose estimation for augmented reality: a hands-on survey,

E. Marchand, H. Uchiyama, and F. Spindler, “Pose estimation for augmented reality: a hands-on survey,”IEEE TVCG, vol. 22, no. 12, 2015

work page 2015
[3]

Semi-dense visual odometry for ar on a smartphone,

T. Schöps, J. Engel, and D. Cremers, “Semi-dense visual odometry for ar on a smartphone,” inISMAR, 2014, pp. 145–150

work page 2014
[4]

Parallel tracking and mapping for small ar workspaces,

G. Klein and D. Murray, “Parallel tracking and mapping for small ar workspaces,” inISMAR, 2007, pp. 225–234

work page 2007
[5]

Orb-slam: A versatile and accurate monocular slam system,

R. Mur-Artal, J. M. M. Montiel, and J. D. Tardos, “Orb-slam: A versatile and accurate monocular slam system,”IEEE T-RO, vol. 31, no. 5, 2015

work page 2015
[6]

Orb-slam2: An open-source slam system for monocular, stereo, and rgb-d cameras,

R. Mur-Artal and J. D. Tardós, “Orb-slam2: An open-source slam system for monocular, stereo, and rgb-d cameras,”IEEE T-RO, vol. 33, no. 5, 2017

work page 2017
[7]

Multi-view stereo: A tutorial,

Y . Furukawa and C. Hernández, “Multi-view stereo: A tutorial,”FnT CGV, vol. 9, no. 1-2, 2015

work page 2015
[8]

Large-scale direct slam with stereo cameras,

J. Engel, J. Stückler, and D. Cremers, “Large-scale direct slam with stereo cameras,” inIROS, 2015, pp. 1935–1942

work page 2015
[9]

Pix- elwise view selection for unstructured multi-view stereo,

J. L. Schönberger, E. Zheng, J.-M. Frahm, and M. Pollefeys, “Pix- elwise view selection for unstructured multi-view stereo,” inECCV, 2016, pp. 501–518

work page 2016
[10]

Mvsnet: Depth inference for unstructured multi-view stereo,

Y . Yao, Z. Luo, S. Li, T. Fang, and L. Quan, “Mvsnet: Depth inference for unstructured multi-view stereo,” inECCV, 2018, pp. 767–783

work page 2018
[11]

Dtam: Dense tracking and mapping in real-time,

R. A. Newcombe, S. J. Lovegrove, and A. J. Davison, “Dtam: Dense tracking and mapping in real-time,” inICCV, 2011, pp. 2320–2327

work page 2011
[12]

Deeptam: Deep tracking and mapping,

H. Zhou, B. Ummenhofer, and T. Brox, “Deeptam: Deep tracking and mapping,” inECCV, 2018, pp. 822–838

work page 2018
[13]

Codeslam—learning a compact, optimisable representation for dense visual slam,

M. Bloesch, J. Czarnowski, R. Clark, S. Leutenegger, and A. J. Davison, “Codeslam—learning a compact, optimisable representation for dense visual slam,” inCVPR, 2018, pp. 2560–2568

work page 2018
[14]

Deepfactors: Real-time probabilistic dense monocular slam,

J. Czarnowski, T. Laidlow, R. Clark, and A. J. Davison, “Deepfactors: Real-time probabilistic dense monocular slam,”IEEE RA-L, vol. 5, no. 2, 2020

work page 2020
[15]

Droid-slam: Deep visual slam for monocular, stereo, and rgb-d cameras,

Z. Teed and J. Deng, “Droid-slam: Deep visual slam for monocular, stereo, and rgb-d cameras,”NeurIPS, vol. 34, pp. 16 558–16 569, 2021

work page 2021
[16]

Dust3r: Geometric 3d vision made easy,

S. Wang, V . Leroy, Y . Cabon, B. Chidlovskii, and J. Revaud, “Dust3r: Geometric 3d vision made easy,” inCVPR, 2024, pp. 20 697–20 709

work page 2024
[17]

Grounding image matching in 3d with mast3r,

V . Leroy, Y . Cabon, and J. Revaud, “Grounding image matching in 3d with mast3r,” inECCV, 2024, pp. 71–91

work page 2024
[18]

Vggt: Visual geometry grounded transformer,

J. Wang, M. Chen, N. Karaev, A. Vedaldi, C. Rupprecht, and D. Novotny, “Vggt: Visual geometry grounded transformer,” inCVPR, 2025, pp. 5294–5306

work page 2025
[19]

Mast3r-sfm: a fully-integrated solution for unconstrained structure-from-motion,

B. P. Duisterhof, L. Zust, P. Weinzaepfel, V . Leroy, Y . Cabon, and J. Revaud, “Mast3r-sfm: a fully-integrated solution for unconstrained structure-from-motion,” in3DV, 2025, pp. 1–10

work page 2025
[20]

MASt3R-SLAM: Real- time dense SLAM with 3D reconstruction priors,

R. Murai, E. Dexheimer, and A. J. Davison, “MASt3R-SLAM: Real- time dense SLAM with 3D reconstruction priors,” inCVPR, 2025

work page 2025
[21]

Instantsplat: Un- bounded sparse-view pose-free gaussian splatting in 40 sec- onds.arXiv preprint arXiv:2403.20309, 2024

Z. Fan, W. Cong, K. Wen, K. Wang, J. Zhang, X. Ding, D. Xu, B. Ivanovic, M. Pavone, G. Pavlakoset al., “Instantsplat: Sparse-view gaussian splatting in seconds,”arXiv:2403.20309, 2024

work page arXiv 2024
[22]

A benchmark for the evaluation of rgb-d slam systems,

J. Sturm, N. Engelhard, F. Endres, W. Burgard, and D. Cremers, “A benchmark for the evaluation of rgb-d slam systems,” inIROS, 2012

work page 2012
[23]

Real-time rgb-d camera relocalization,

B. Glocker, S. Izadi, J. Shotton, and A. Criminisi, “Real-time rgb-d camera relocalization,” inISMAR, 2013, pp. 173–179

work page 2013
[24]

The euroc micro aerial vehicle datasets,

M. Burri, J. Nikolic, P. Gohl, T. Schneider, J. Rehder, S. Omari, M. W. Achtelik, and R. Siegwart, “The euroc micro aerial vehicle datasets,” IJRR, 2016

work page 2016
[25]

Vision meets robotics: The kitti dataset,

A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, “Vision meets robotics: The kitti dataset,”IJRR, vol. 32, no. 11, 2013

work page 2013
[26]

Bad slam: Bundle adjusted direct rgb-d slam,

T. Schops, T. Sattler, and M. Pollefeys, “Bad slam: Bundle adjusted direct rgb-d slam,” inCVPR, 2019, pp. 134–144

work page 2019
[27]

Introducing slambench, a performance and accuracy benchmarking methodology for slam,

L. Nardi, B. Bodin, M. Z. Zia, J. Mawer, A. Nisbet, P. H. Kelly, A. J. Davison, M. Luján, M. F. O’Boyle, G. Rileyet al., “Introducing slambench, a performance and accuracy benchmarking methodology for slam,” inICRA, 2015, pp. 5783–5790

work page 2015
[28]

arXiv e-prints , keywords =

J. Engel, V . Usenko, and D. Cremers, “A photometrically calibrated benchmark for monocular visual odometry,”arXiv:1607.02555, 2016

work page arXiv 2016
[29]

Svo: Fast semi-direct monocular visual odometry,

C. Forster, M. Pizzoli, and D. Scaramuzza, “Svo: Fast semi-direct monocular visual odometry,” inICRA, 2014, pp. 15–22

work page 2014
[30]

Lsd-slam: Large-scale direct monocular slam,

J. Engel, T. Schöps, and D. Cremers, “Lsd-slam: Large-scale direct monocular slam,” inECCV, 2014, pp. 834–849

work page 2014
[31]

Superpoint: Self- supervised interest point detection and description,

D. DeTone, T. Malisiewicz, and A. Rabinovich, “Superpoint: Self- supervised interest point detection and description,” inCVPRW, 2018, pp. 224–236

work page 2018
[32]

D2-net: A trainable cnn for joint description and detection of local features,

M. Dusmanu, I. Rocco, T. Pajdla, M. Pollefeys, J. Sivic, A. Torii, and T. Sattler, “D2-net: A trainable cnn for joint description and detection of local features,” inCVPR, 2019, pp. 8092–8101

work page 2019
[33]

Su- perglue: Learning feature matching with graph neural networks,

P.-E. Sarlin, D. DeTone, T. Malisiewicz, and A. Rabinovich, “Su- perglue: Learning feature matching with graph neural networks,” in CVPR, 2020, pp. 4938–4947

work page 2020
[34]

Loftr: Detector- free local feature matching with transformers,

J. Sun, Z. Shen, Y . Wang, H. Bao, and X. Zhou, “Loftr: Detector- free local feature matching with transformers,” inCVPR, 2021, pp. 8922–8931

work page 2021
[35]

Ba-net: Dense bundle ad- justment network.arXiv preprint arXiv:1806.04807, 2018

C. Tang and P. Tan, “Ba-net: Dense bundle adjustment network,” arXiv:1806.04807, 2018

work page arXiv 2018
[36]

Raft: Recurrent all-pairs field transforms for optical flow,

Z. Teed and J. Deng, “Raft: Recurrent all-pairs field transforms for optical flow,” inECCV, 2020, pp. 402–419

work page 2020
[37]

Gms: Grid-based motion statistics for fast, ultra-robust feature correspondence,

J. Bian, W.-Y . Lin, Y . Matsushita, S.-K. Yeung, T.-D. Nguyen, and M.-M. Cheng, “Gms: Grid-based motion statistics for fast, ultra-robust feature correspondence,” inCVPR, 2017, pp. 4181–4190

work page 2017
[38]

Dkm: Dense kernelized feature matching for geometry estimation,

J. Edstedt, I. Athanasiadis, M. Wadenbäck, and M. Felsberg, “Dkm: Dense kernelized feature matching for geometry estimation,” inCVPR, 2023, pp. 17 765–17 775

work page 2023
[39]

Pdc-net+: Enhanced probabilistic dense correspondence network,

P. Truong, M. Danelljan, R. Timofte, and L. Van Gool, “Pdc-net+: Enhanced probabilistic dense correspondence network,”IEEE TPAMI, vol. 45, no. 8, 2023

work page 2023
[40]

Aslfeat: Learning local features of accurate shape and localization,

Z. Luo, L. Zhou, X. Bai, H. Chen, J. Zhang, Y . Yao, S. Li, T. Fang, and L. Quan, “Aslfeat: Learning local features of accurate shape and localization,” inCVPR, 2020, pp. 6589–6598

work page 2020
[41]

Lightglue: Local feature matching at light speed,

P. Lindenberger, P.-E. Sarlin, and M. Pollefeys, “Lightglue: Local feature matching at light speed,” inICCV, 2023, pp. 17 627–17 638

work page 2023
[42]

Key.net: Keypoint detection by handcrafted and learned cnn filters,

J. Barroso-Laguna, E. Riba, D. Ponsa, and K. Mikolajczyk, “Key.net: Keypoint detection by handcrafted and learned cnn filters,” inICCV, 2019

work page 2019
[43]

Segformer: Simple and efficient design for semantic segmentation with transformers,

E. Xie, W. Wang, Z. Yu, A. Anandkumar, J. M. Alvarez, and P. Luo, “Segformer: Simple and efficient design for semantic segmentation with transformers,”NeurIPS, vol. 34, pp. 12 077–12 090, 2021

work page 2021
[44]

Structure-from-motion revisited,

J. L. Schonberger and J.-M. Frahm, “Structure-from-motion revisited,” inCVPR, 2016, pp. 4104–4113

work page 2016
[45]

imap: Implicit mapping and positioning in real-time,

E. Sucar, S. Liu, J. Ortiz, and A. J. Davison, “imap: Implicit mapping and positioning in real-time,” inICCV, 2021, pp. 6229–6238

work page 2021
[46]

Nice-slam: Neural implicit scalable encoding for slam,

Z. Zhu, S. Peng, V . Larsson, W. Xu, H. Bao, Z. Cui, M. R. Oswald, and M. Pollefeys, “Nice-slam: Neural implicit scalable encoding for slam,” inCVPR, 2022, pp. 12 786–12 796

work page 2022
[47]

Ec3r-slam: Efficient and consistent monocular dense slam with feed-forward 3d reconstruction,

L. Hu, N. A. Oufroukh, F. Bonardi, and R. Ghandour, “Ec3r-slam: Efficient and consistent monocular dense slam with feed-forward 3d reconstruction,” 2025

work page 2025
[48]

3d gaussian splatting for real-time radiance field rendering

B. Kerbl, G. Kopanas, T. Leimkühler, G. Drettakiset al., “3d gaussian splatting for real-time radiance field rendering.”ACM TOG, vol. 42, no. 4, 2023

work page 2023
[49]

Lamar: Benchmarking localization and mapping for augmented reality,

P.-E. Sarlin, M. Dusmanu, J. L. Schönberger, P. Speciale, L. Gruber, V . Larsson, O. Miksik, and M. Pollefeys, “Lamar: Benchmarking localization and mapping for augmented reality,” inECCV, 2022, pp. 686–704

work page 2022
[50]

Map-free visual relocalization: Metric pose relative to a single image,

E. Arnold, J. Wynn, S. Vicente, G. Garcia-Hernando, A. Monszpart, V . Prisacariu, D. Turmukhambetov, and E. Brachmann, “Map-free visual relocalization: Metric pose relative to a single image,” inECCV, 2022, pp. 690–708

work page 2022
[51]

Accelerated coordi- nate encoding: Learning to relocalize in minutes using rgb and poses,

E. Brachmann, T. Cavallari, and V . A. Prisacariu, “Accelerated coordi- nate encoding: Learning to relocalize in minutes using rgb and poses,” inCVPR, 2023, pp. 5044–5053

work page 2023
[52]

Vggt-slam: Dense rgb slam optimized on the sl (4) manifold,

D. Maggio, H. Lim, and L. Carlone, “Vggt-slam: Dense rgb slam optimized on the sl (4) manifold,”NeurIPS, vol. 39, 2025

work page 2025
[53]

VGGT-SLAM 2.0: Real-time Dense Feed- forward Scene Reconstruction,

D. Maggio and L. Carlone, “Vggt-slam 2.0: Real time dense feed- forward scene reconstruction,”arXiv preprint arXiv:2601.19887, 2026

work page arXiv 2026
[54]

Hartley and A

R. Hartley and A. Zisserman,Multiple view geometry in computer vision. Cambridge university press, 2003

work page 2003
[55]

Bundle adjustment—a modern synthesis,

B. Triggs, P. F. McLauchlan, R. I. Hartley, and A. W. Fitzgibbon, “Bundle adjustment—a modern synthesis,” inVision Algorithms, 1999, pp. 298–372

work page 1999
[56]

Point-slam: Dense neural point cloud-based slam,

E. Sandström, Y . Li, L. Van Gool, and M. R. Oswald, “Point-slam: Dense neural point cloud-based slam,” inICCV, 2023, pp. 18 433– 18 444

work page 2023
[57]

Orb-slam3: An accurate open-source library for visual, visual–inertial, and multimap slam,

C. Campos, R. Elvira, J. J. G. Rodríguez, J. M. Montiel, and J. D. Tardós, “Orb-slam3: An accurate open-source library for visual, visual–inertial, and multimap slam,”IEEE T-RO, vol. 37, no. 6, 2021

work page 2021
[58]

Deep patch visual slam,

L. Lipson, Z. Teed, and J. Deng, “Deep patch visual slam,” inECCV, 2024, pp. 424–440

work page 2024
[59]

Flare: Feed-forward geometry, appearance and camera estimation from uncalibrated sparse views,

S. Zhang, J. Wang, Y . Xu, N. Xue, C. Rupprecht, X. Zhou, Y . Shen, and G. Wetzstein, “Flare: Feed-forward geometry, appearance and camera estimation from uncalibrated sparse views,” 2025

work page 2025
[60]

Scannet++: A high-fidelity dataset of 3d indoor scenes,

C. Yeshwanth, Y .-C. Liu, M. Nießner, and A. Dai, “Scannet++: A high-fidelity dataset of 3d indoor scenes,” inICCV, 2023, pp. 12–22

work page 2023
[61]

Flare: Feed-forward geometry, appearance and camera estimation from uncalibrated sparse views,

S. Zhang, J. Wang, Y . Xu, N. Xue, C. Rupprecht, X. Zhou, Y . Shen, and G. Wetzstein, “Flare: Feed-forward geometry, appearance and camera estimation from uncalibrated sparse views,” inCVPR, 2025, pp. 21 936–21 947

work page 2025
[62]

MapAnything: Universal feed-forward metric 3D reconstruction,

N. Keetha, N. Müller, J. Schönberger, L. Porzi, Y . Zhang, T. Fischer, A. Knapitsch, D. Zauss, E. Weber, N. Antunes, J. Luiten, M. Lopez- Antequera, S. R. Bulò, C. Richardt, D. Ramanan, S. Scherer, and P. Kontschieder, “MapAnything: Universal feed-forward metric 3D reconstruction,” inInternational Conference on 3D Vision (3DV). IEEE, 2026

work page 2026