pith. machine review for the scientific record. sign in

arxiv: 2604.04055 · v1 · submitted 2026-04-05 · 💻 cs.CV · cs.RO

Recognition: 2 theorem links

· Lean Theorem

DINO-VO: Learning Where to Focus for Enhanced State Estimation

Authors on Pith no claims yet

Pith reviewed 2026-05-13 17:02 UTC · model grok-4.3

classification 💻 cs.CV cs.RO
keywords visual odometrypatch selectiondifferentiable bundle adjustmentmonocular trackingscene generalizationfeature extractionend-to-end traininginverse depth
0
0 comments X

The pith

DINO-VO replaces fixed rules for picking image patches with a trainable selector to improve monocular visual odometry across environments.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Visual odometry estimates camera motion from image sequences, yet most systems still depend on hand-crafted heuristics to choose which image regions to track, and these choices often break down in large outdoor scenes. DINO-VO embeds a differentiable adaptive patch selector directly into an end-to-end training loop so the network learns which patches yield reliable features. The architecture further couples this selector to a multi-task feature extractor and a differentiable bundle adjustment module that incorporates inverse depth priors. Experiments on TartanAir, KITTI, Euroc, and TUM datasets show the resulting system tracks more accurately while generalizing from synthetic data to real indoor and outdoor environments.

Core claim

DINO-VO is an end-to-end monocular visual odometry pipeline whose central component is a differentiable adaptive patch selector that learns to extract higher-quality patches; this selector is trained jointly with a multi-task feature extraction module and a differentiable bundle adjustment layer that uses inverse depth priors, thereby connecting appearance learning directly to geometric state estimation and yielding improved tracking accuracy on synthetic, indoor, and outdoor benchmarks.

What carries the argument

The differentiable adaptive patch selector, which learns during training to choose patches that improve feature quality and state estimation.

If this is right

  • The system reaches state-of-the-art tracking accuracy on TartanAir, KITTI, Euroc, and TUM.
  • It maintains performance when moving from synthetic training data to real indoor and outdoor test environments.
  • The joint training of patch selection, feature extraction, and bundle adjustment reduces the gap between learned appearance cues and geometric optimization.
  • Heuristic patch selection is replaced by a data-driven alternative that improves robustness in large-scale scenes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar differentiable selection modules could be inserted into other SLAM pipelines that currently rely on fixed feature detectors.
  • If the selector generalizes reliably, deployment of visual odometry in novel environments may require less manual tuning of feature parameters.
  • The inverse-depth prior inside the differentiable bundle adjustment suggests a route for incorporating additional geometric cues without breaking end-to-end differentiability.

Load-bearing premise

End-to-end training of the adaptive patch selector will consistently select higher-quality patches that boost accuracy and generalization without overfitting to the training sets.

What would settle it

A controlled experiment in which DINO-VO is evaluated on a new large-scale outdoor sequence never seen during training and produces larger trajectory errors than a strong heuristic baseline would falsify the generalization claim.

Figures

Figures reproduced from arXiv: 2604.04055 by Guanghao Li, Jian Pu, Junpeng Ma, Qi Chen, Sijia Hu, Xiangyang Xue, Xin Gao.

Figure 1
Figure 1. Figure 1: The overview of the system. Three modules establish our system from left to right: Multi-task Feature Extractor, Adaptive [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Architecture of the Multi-task Feature Extractor. The [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Pipeline for the adaptive patch selector. We utilize [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Comparison of our adaptive patch selector with existing systems. The first row is the input image from different datasets. The [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Comparison of reconstruction results on TartanAir [ [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Percentage of effective patch projections in bundle ad [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Trajector comparison between DPV-SLAM [40] and our system with loop closure mechanism in KITTI Sequence 00 Fig.4 illustrates the qualitative results of our adaptive patch selector. Compared to DPVO [60], our system fo￾cuses more on meaningful objects, such as pipelines and buildings, than the sky. Besides, [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗
read the original abstract

We present DINO Patch Visual Odometry (DINO-VO), an end-to-end monocular visual odometry system with strong scene generalization. Current Visual Odometry (VO) systems often rely on heuristic feature extraction strategies, which can degrade accuracy and robustness, particularly in large-scale outdoor environments. DINO-VO addresses these limitations by incorporating a differentiable adaptive patch selector into the end-to-end pipeline, improving the quality of extracted patches and enhancing generalization across diverse datasets. Additionally, our system integrates a multi-task feature extraction module with a differentiable bundle adjustment (BA) module that leverages inverse depth priors, enabling the system to learn and utilize appearance and geometric information effectively. This integration bridges the gap between feature learning and state estimation. Extensive experiments on the TartanAir, KITTI, Euroc, and TUM datasets demonstrate that DINO-VO exhibits strong generalization across synthetic, indoor, and outdoor environments, achieving state-of-the-art tracking accuracy.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

4 major / 1 minor

Summary. The paper introduces DINO-VO, an end-to-end monocular visual odometry system that incorporates a differentiable adaptive patch selector, a multi-task feature extraction module, and a differentiable bundle adjustment module leveraging inverse depth priors. It claims strong scene generalization and state-of-the-art tracking accuracy across the TartanAir, KITTI, Euroc, and TUM datasets spanning synthetic, indoor, and outdoor environments.

Significance. If substantiated, the approach of learning patch selection end-to-end to bridge feature extraction and state estimation could meaningfully advance visual odometry by reducing reliance on heuristic feature strategies and improving robustness in diverse settings.

major comments (4)
  1. [Abstract] Abstract: The central claim of state-of-the-art tracking accuracy and strong generalization is asserted without any reported quantitative metrics, baseline comparisons, ablation results, or error statistics, rendering it impossible to assess whether the gains are robust or affected by dataset choices.
  2. [Method] Method section: The differentiable adaptive patch selector is described as learning to focus on higher-quality patches via end-to-end training, yet no details are provided on its parameterization, loss formulation, or how gradients from the multi-task extractor and differentiable BA (with inverse depth priors) prevent it from learning dataset-specific heuristics rather than scene-agnostic improvements.
  3. [Experiments] Experiments section: No ablation isolating the adaptive patch selector (e.g., fixed vs. learned selection) is reported on the four datasets, which is load-bearing for the generalization claim since the selector is a fitted component whose contribution must be validated separately from the rest of the pipeline.
  4. [Experiments] Experiments section: The manuscript reports results only on TartanAir, KITTI, Euroc, and TUM with no held-out evaluation on environments outside these distributions, leaving the strong generalization claim without a direct test against the risk of overfitting to training-set appearance or geometry patterns.
minor comments (1)
  1. [Abstract] Abstract: The phrase 'extensive experiments' is used without specifying the evaluation protocol, metrics (e.g., ATE, RPE), or number of runs, which reduces clarity.

Simulated Author's Rebuttal

4 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We agree that several clarifications and additions are needed to strengthen the presentation of our claims. Below we address each major comment point by point and indicate the revisions we will make.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim of state-of-the-art tracking accuracy and strong generalization is asserted without any reported quantitative metrics, baseline comparisons, ablation results, or error statistics, rendering it impossible to assess whether the gains are robust or affected by dataset choices.

    Authors: We agree that the abstract would be more informative with concrete metrics. In the revised manuscript we will add key quantitative results, including average translation and rotation errors on KITTI, Euroc, and TUM relative to the strongest baselines, to support the state-of-the-art and generalization claims. revision: yes

  2. Referee: [Method] Method section: The differentiable adaptive patch selector is described as learning to focus on higher-quality patches via end-to-end training, yet no details are provided on its parameterization, loss formulation, or how gradients from the multi-task extractor and differentiable BA (with inverse depth priors) prevent it from learning dataset-specific heuristics rather than scene-agnostic improvements.

    Authors: We acknowledge the omission of implementation details. The revised method section will include: (1) the exact parameterization of the patch selector network, (2) the composite loss function used to train it jointly with the multi-task features, and (3) an explanation of how back-propagation through the differentiable inverse-depth bundle adjustment encourages scene-agnostic patch selection rather than dataset-specific heuristics. revision: yes

  3. Referee: [Experiments] Experiments section: No ablation isolating the adaptive patch selector (e.g., fixed vs. learned selection) is reported on the four datasets, which is load-bearing for the generalization claim since the selector is a fitted component whose contribution must be validated separately from the rest of the pipeline.

    Authors: We agree that an explicit ablation is necessary. We will add a new table and accompanying text that compares the full DINO-VO pipeline against a controlled variant that uses fixed (non-learned) patch selection on all four datasets, thereby isolating the contribution of the adaptive selector to both accuracy and cross-scene generalization. revision: yes

  4. Referee: [Experiments] Experiments section: The manuscript reports results only on TartanAir, KITTI, Euroc, and TUM with no held-out evaluation on environments outside these distributions, leaving the strong generalization claim without a direct test against the risk of overfitting to training-set appearance or geometry patterns.

    Authors: The four datasets were deliberately chosen to span synthetic, outdoor driving, and indoor environments, which is the standard protocol for assessing generalization in visual odometry. Nevertheless, we recognize that an explicit held-out test on a completely unseen environment would provide stronger evidence. We will expand the discussion section to explicitly address this limitation and list it as an important direction for future work; we do not currently have additional held-out results to report. revision: partial

Circularity Check

0 steps flagged

No circularity: end-to-end training and empirical validation on standard benchmarks remain independent of fitted inputs by construction

full rationale

The paper introduces DINO-VO as an end-to-end monocular VO pipeline that incorporates a differentiable adaptive patch selector, multi-task feature extraction, and differentiable BA with inverse depth priors. The central claims of improved patch quality and strong generalization across synthetic/indoor/outdoor scenes are supported by reported tracking accuracy on TartanAir, KITTI, Euroc, and TUM. No derivation step reduces a prediction to its own inputs by construction, no self-citation is load-bearing for a uniqueness theorem, and no ansatz is smuggled via prior work. The learned selector is a fitted component whose benefit is asserted via experiments rather than tautologically assumed, making the derivation self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The approach rests on standard deep-learning assumptions plus the claim that the patch selector learns useful focus without additional regularization beyond the tracking loss.

free parameters (1)
  • neural network weights
    All parameters of the feature extractor, patch selector, and depth predictor are fitted end-to-end on the training sequences.
axioms (1)
  • standard math The patch selector and bundle adjustment modules are fully differentiable
    Required for gradient-based end-to-end training; invoked implicitly by the description of the pipeline.
invented entities (1)
  • DINO Patch Visual Odometry (DINO-VO) architecture no independent evidence
    purpose: End-to-end monocular state estimation with learned patch selection
    New proposed system combining DINO features, adaptive patch selection, and differentiable BA; no independent evidence outside the paper is supplied.

pith-pipeline@v0.9.0 · 5476 in / 1268 out tokens · 32167 ms · 2026-05-13T17:02:38.905558+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

72 extracted references · 72 canonical work pages · 6 internal anchors

  1. [1]

    Codeslam—learning a compact, optimisable representation for dense visual slam

    Michael Bloesch, Jan Czarnowski, Ronald Clark, Stefan Leutenegger, and Andrew J Davison. Codeslam—learning a compact, optimisable representation for dense visual slam. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2560–2568, 2018. 2

  2. [2]

    Lift-slam: A deep-learning feature-based monocular visual slam method.Neurocomputing, 455:97–110, 2021

    Hudson Martins Silva Bruno and Esther Luna Colombini. Lift-slam: A deep-learning feature-based monocular visual slam method.Neurocomputing, 455:97–110, 2021. 2

  3. [3]

    The euroc micro aerial vehicle datasets.The International Journal of Robotics Research, 35 (10):1157–1163, 2016

    Michael Burri, Janosch Nikolic, Pascal Gohl, Thomas Schneider, Joern Rehder, Sammy Omari, Markus W Achte- lik, and Roland Siegwart. The euroc micro aerial vehicle datasets.The International Journal of Robotics Research, 35 (10):1157–1163, 2016. 6

  4. [4]

    Orb-slam3: An accu- rate open-source library for visual, visual–inertial, and mul- timap slam.IEEE Transactions on Robotics, 37(6):1874– 1890, 2021

    Carlos Campos, Richard Elvira, Juan J G ´omez Rodr´ıguez, Jos´e MM Montiel, and Juan D Tard´os. Orb-slam3: An accu- rate open-source library for visual, visual–inertial, and mul- timap slam.IEEE Transactions on Robotics, 37(6):1874– 1890, 2021. 1, 2, 6, 8

  5. [5]

    Vpl- slam: A vertical line supported point line monocular slam system.IEEE Transactions on Intelligent Transportation Systems, 2024

    Qi Chen, Yu Cao, Jiawei Hou, Guanghao Li, Shoumeng Qiu, Bo Chen, Xiangyang Xue, Hong Lu, and Jian Pu. Vpl- slam: A vertical line supported point line monocular slam system.IEEE Transactions on Intelligent Transportation Systems, 2024. 1, 2

  6. [6]

    Multi- lio: A lightweight multiple lidar-inertial odometry system

    Qi Chen, Guanghao Li, Xiangyang Xue, and Jian Pu. Multi- lio: A lightweight multiple lidar-inertial odometry system. In2024 IEEE International Conference on Robotics and Au- tomation (ICRA), pages 13748–13754, 2024. 2

  7. [7]

    Deepfactors: Real-time probabilistic dense monocular slam.IEEE Robotics and Automation Letters, 5 (2):721–728, 2020

    Jan Czarnowski, Tristan Laidlow, Ronald Clark, and An- drew J Davison. Deepfactors: Real-time probabilistic dense monocular slam.IEEE Robotics and Automation Letters, 5 (2):721–728, 2020. 2

  8. [8]

    Monoslam: Real-time single camera slam

    Andrew J Davison, Ian D Reid, Nicholas D Molton, and Olivier Stasse. Monoslam: Real-time single camera slam. IEEE Transactions on Pattern Analysis and Machine Intelli- gence, 29(6):1052–1067, 2007. 1, 2

  9. [9]

    Compact 3d gaussian splatting for dense visual slam.arXiv preprint arXiv:2403.11247, 2024

    Tianchen Deng, Yaohui Chen, Leyan Zhang, Jianfei Yang, Shenghai Yuan, Jiuming Liu, Danwei Wang, Hesheng Wang, and Weidong Chen. Compact 3d gaussian splatting for dense visual slam.arXiv preprint arXiv:2403.11247, 2024. 2

  10. [10]

    Plgslam: Progressive neural scene represenation with local to global bundle adjustment

    Tianchen Deng, Guole Shen, Tong Qin, Jianyu Wang, Wen- tao Zhao, Jingchuan Wang, Danwei Wang, and Weidong Chen. Plgslam: Progressive neural scene represenation with local to global bundle adjustment. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 19657–19666, 2024. 2

  11. [11]

    Gaussiandwm: 3d gaussian driving world model for unified scene understanding and multi-modal generation.arXiv preprint arXiv:2512.23180,

    Tianchen Deng, Xuefeng Chen, Yi Chen, Qu Chen, Yuyao Xu, Lijin Yang, Le Xu, Yu Zhang, Bo Zhang, Wuxiong Huang, and Hesheng Wang. Gaussiandwm: 3d gaussian driving world model for unified scene understanding and multi-modal generation.arXiv preprint arXiv:2512.23180,

  12. [12]

    What is the best 3d scene representation for robotics? from geometric to foundation models.arXiv preprint arXiv:2512.03422, 2025

    Tianchen Deng, Yue Pan, Shenghai Yuan, Dong Li, Chen Wang, Mingrui Li, Long Chen, Lihua Xie, Danwei Wang, Jingchuan Wang, Javier Civera, Hesheng Wang, and Wei- dong Chen. What is the best 3d scene representation for robotics? from geometric to foundation models.arXiv preprint arXiv:2512.03422, 2025. 2

  13. [13]

    Superpoint: Self-supervised interest point detection and description

    Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabi- novich. Superpoint: Self-supervised interest point detection and description. InProceedings of the IEEE conference on computer vision and pattern recognition workshops, pages 224–236, 2018. 1, 4

  14. [14]

    Lsd- slam: Large-scale direct monocular slam

    Jakob Engel, Thomas Sch ¨ops, and Daniel Cremers. Lsd- slam: Large-scale direct monocular slam. InEuropean Conference on Computer Vision (ECCV), pages 834–849. Springer, 2014. 2

  15. [15]

    Direct sparse odometry.IEEE transactions on pattern analysis and machine intelligence, 40(3):611–625, 2017

    Jakob Engel, Vladlen Koltun, and Daniel Cremers. Direct sparse odometry.IEEE transactions on pattern analysis and machine intelligence, 40(3):611–625, 2017. 6, 7, 8

  16. [16]

    Svo: Fast semi-direct monocular visual odometry

    Christian Forster, Matia Pizzoli, and Davide Scaramuzza. Svo: Fast semi-direct monocular visual odometry. In2014 IEEE international conference on robotics and automation (ICRA), pages 15–22. IEEE, 2014. 6, 8

  17. [17]

    Bags of binary words for fast place recognition in image sequences.IEEE Transactions on robotics, 28(5):1188–1197, 2012

    Dorian G ´alvez-L´opez and Juan D Tardos. Bags of binary words for fast place recognition in image sequences.IEEE Transactions on robotics, 28(5):1188–1197, 2012. 2

  18. [18]

    Deep incomplete multi-view learning via cyclic permutation of vaes

    Xin Gao and Jian Pu. Deep incomplete multi-view learning via cyclic permutation of vaes. InThe Thirteenth Interna- tional Conference on Learning Representations, 2025. 2

  19. [19]

    Ldso: Direct sparse odometry with loop closure

    Xiang Gao, Rui Wang, Nikolaus Demmel, and Daniel Cre- mers. Ldso: Direct sparse odometry with loop closure. In2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 2198–2204. IEEE, 2018. 6

  20. [20]

    Good: Training-free guided diffusion sampling for out-of-distribution detection

    Xin Gao, Jiyao Liu, Guanghao Li, Yueming Lyu, Jianx- iong Gao, Weichen Yu, Ningsheng Xu, Liang Wang, Caifeng Shan, Ziwei Liu, et al. Good: Training-free guided diffusion sampling for out-of-distribution detection. InThe Thirty- ninth Annual Conference on Neural Information Processing Systems, 2025. 2

  21. [21]

    Vision meets robotics: The kitti dataset.The Inter- national Journal of Robotics Research, 32(11):1231–1237,

    Andreas Geiger, Philip Lenz, Christoph Stiller, and Raquel Urtasun. Vision meets robotics: The kitti dataset.The Inter- national Journal of Robotics Research, 32(11):1231–1237,

  22. [22]

    evo: Python package for the evalua- tion of odometry and slam.https://github.com/ MichaelGrupp/evo, 2017

    Michael Grupp. evo: Python package for the evalua- tion of odometry and slam.https://github.com/ MichaelGrupp/evo, 2017. 6

  23. [23]

    Dark-isp: Enhancing raw image processing for low-light object detection

    Jiasheng Guo, Xin Gao, Yuxiang Yan, Guanghao Li, and Jian Pu. Dark-isp: Enhancing raw image processing for low-light object detection. InProceedings of the IEEE/CVF Interna- tional Conference on Computer Vision (ICCV), pages 9583– 9593, 2025. 2

  24. [24]

    Toward camera open-set 3d object detection for autonomous driving scenar- ios.IEEE Transactions on Intelligent Transportation Sys- tems, 26(12):23190–23201, 2025

    Zhuolin He, Xinrun Li, Jiacheng Tang, Shoumeng Qiu, Wenfu Wang, Xiangyang Xue, and Jian Pu. Toward camera open-set 3d object detection for autonomous driving scenar- ios.IEEE Transactions on Intelligent Transportation Sys- tems, 26(12):23190–23201, 2025

  25. [25]

    Dynamicvggt: Learning dynamic point maps for 4d scene reconstruction in autonomous driving.arXiv preprint arXiv:2603.08254, 2026

    Zhuolin He, Jing Li, Guanghao Li, Xiaolei Chen, Jiacheng Tang, Siyang Zhang, Zhounan Jin, Feipeng Cai, Bin Li, Jian Pu, et al. Dynamicvggt: Learning dynamic point maps for 4d scene reconstruction in autonomous driving.arXiv preprint arXiv:2603.08254, 2026

  26. [26]

    MPCFormer: A physics-informed data-driven approach for explainable socially-aware autonomous driving

    Jia Hu, Zhexi Lian, Xuerun Yan, Ruiang Bi, Dou Shen, Yu Ruan, and Haoran Wang. Mpcformer: A physics-informed data-driven approach for explainable socially-aware au- tonomous driving.arXiv preprint arXiv:2512.03795, 2025. 2

  27. [27]

    Photo-slam: Real-time simultaneous localization and photo- realistic mapping for monocular stereo and rgb-d cameras

    Huajian Huang, Longwei Li, Hui Cheng, and Sai-Kit Yeung. Photo-slam: Real-time simultaneous localization and photo- realistic mapping for monocular stereo and rgb-d cameras. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21584–21593, 2024. 2

  28. [28]

    Eslam: Efficient dense slam system based on hy- brid representation of signed distance fields

    Mohammad Mahdi Johari, Camilla Carta, and Franc ¸ois Fleuret. Eslam: Efficient dense slam system based on hy- brid representation of signed distance fields. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 17408–17419, 2023. 2

  29. [29]

    Splatam: Splat track & map 3d gaus- sians for dense rgb-d slam

    Nikhil Keetha, Jay Karhade, Krishna Murthy Jatavallab- hula, Gengshan Yang, Sebastian Scherer, Deva Ramanan, and Jonathon Luiten. Splatam: Splat track & map 3d gaus- sians for dense rgb-d slam. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 21357–21366, 2024. 2

  30. [30]

    3d gaussian splatting for real-time radiance field rendering.ACM Trans

    Bernhard Kerbl, Georgios Kopanas, Thomas Leimk ¨uhler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering.ACM Trans. Graph., 42(4):139–1,

  31. [31]

    Parallel tracking and map- ping for small ar workspaces

    Georg Klein and David Murray. Parallel tracking and map- ping for small ar workspaces. InIEEE and ACM Interna- tional Symposium on Mixed and Augmented Reality, pages 225–234. IEEE, 2007. 2

  32. [32]

    Artdeco: Toward high-fidelity on-the-fly reconstruction with hierarchical gaussian structure and feed- forward guidance

    Guanghao Li, Kerui Ren, Linning Xu, Zhewen Zheng, Changjian Jiang, Xin Gao, Bo Dai, Jian Pu, Mulin Yu, and Jiangmiao Pang. Artdeco: Toward high-fidelity on-the-fly reconstruction with hierarchical gaussian structure and feed- forward guidance. InThe Fourteenth International Confer- ence on Learning Representations. 2

  33. [33]

    Papl-slam: Principal axis-anchored monocu- lar point-line slam.IEEE Robotics and Automation Letters,

    Guanghao Li, Yu Cao, Qi Chen, Xin Gao, Yifan Yang, and Jian Pu. Papl-slam: Principal axis-anchored monocu- lar point-line slam.IEEE Robotics and Automation Letters,

  34. [34]

    Constrained gaussian splatting via implicit tsdf hash grid for dense rgb-d slam.IEEE Transactions on Artificial Intelli- gence, 2025

    Guanghao Li, Qi Chen, Sijia Hu, Yuxiang Yan, and Jian Pu. Constrained gaussian splatting via implicit tsdf hash grid for dense rgb-d slam.IEEE Transactions on Artificial Intelli- gence, 2025. 2

  35. [35]

    Ec-slam: Effectively constrained neural rgb-d slam with tsdf hash en- coding and joint optimization.Pattern Recognition, 170: 112034, 2026

    Guanghao Li, Qi Chen, Yuxiang Yan, and Jian Pu. Ec-slam: Effectively constrained neural rgb-d slam with tsdf hash en- coding and joint optimization.Pattern Recognition, 170: 112034, 2026. 2

  36. [36]

    Undeepvo: Monocular visual odometry through unsuper- vised deep learning

    Ruihao Li, Sen Wang, Zhiqiang Long, and Dongbing Gu. Undeepvo: Monocular visual odometry through unsuper- vised deep learning. InIEEE International Conference on Robotics and Automation (ICRA), pages 7286–7291. IEEE,

  37. [37]

    Deepslam: A ro- bust monocular slam system with unsupervised deep learn- ing.IEEE Transactions on Industrial Electronics, 68(4): 3577–3587, 2020

    Ruihao Li, Sen Wang, and Dongbing Gu. Deepslam: A ro- bust monocular slam system with unsupervised deep learn- ing.IEEE Transactions on Industrial Electronics, 68(4): 3577–3587, 2020. 2

  38. [38]

    Fine-tuning is Not Enough: A Parallel Framework for Collaborative Imitation and Reinforcement Learning in End-to-end Autonomous Driving

    Zhexi Lian, Haoran Wang, Xuerun Yan, Weimeng Lin, Xi- anhong Zhang, Yongyu Chen, and Jia Hu. Fine-tuning is not enough: A parallel framework for collaborative imita- tion and reinforcement learning in end-to-end autonomous driving.arXiv preprint arXiv:2603.13842, 2026. 2

  39. [39]

    Lightglue: Local feature matching at light speed

    Philipp Lindenberger, Paul-Edouard Sarlin, and Marc Polle- feys. Lightglue: Local feature matching at light speed. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 17627–17638, 2023. 1

  40. [40]

    Deep patch visual slam

    Lahav Lipson, Zachary Teed, and Jia Deng. Deep patch visual slam. InEuropean Conference on Computer Vision (ECCV), pages 424–440. Springer, 2025. 2, 5, 6, 7, 8

  41. [41]

    Loopy-slam: Dense neural slam with loop closures

    Lorenzo Liso, Erik Sandstr ¨om, Vladimir Yugay, Luc Van Gool, and Martin R Oswald. Loopy-slam: Dense neural slam with loop closures. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 20363–20373, 2024. 2

  42. [42]

    Mmg-vid: Maxi- mizing marginal gains at segment-level and token-level for efficient video llms.arXiv preprint arXiv:2508.21044, 2025

    Junpeng Ma, Qizhe Zhang, Ming Lu, Zhibin Wang, Qiang Zhou, Jun Song, and Shanghang Zhang. Mmg-vid: Maxi- mizing marginal gains at segment-level and token-level for efficient video llms.arXiv preprint arXiv:2508.21044, 2025. 2

  43. [43]

    Gift: Global irreplaceability frame tar- geting for efficient video understanding.arXiv preprint arXiv:2603.25072, 2026

    Junpeng Ma, Sashuai Zhou, Guanghao Li, Xin Gao, Yue Cao, Hengyu Zeng, Yuxiang Yan, Zhibin Wang, Jun Song, Bo Zheng, et al. Gift: Global irreplaceability frame tar- geting for efficient video understanding.arXiv preprint arXiv:2603.25072, 2026. 2

  44. [44]

    Gaussian splatting slam

    Hidenobu Matsuki, Riku Murai, Paul HJ Kelly, and An- drew J Davison. Gaussian splatting slam. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR), pages 18039–18048, 2024. 2

  45. [45]

    Nerf: Representing scenes as neural radiance fields for view syn- thesis.Communications of the ACM, 65(1):99–106, 2021

    Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view syn- thesis.Communications of the ACM, 65(1):99–106, 2021. 2

  46. [46]

    Orb-slam2: An open- source slam system for monocular, stereo, and rgb-d cam- eras.IEEE transactions on robotics, 33(5):1255–1262, 2017

    Raul Mur-Artal and Juan D Tard ´os. Orb-slam2: An open- source slam system for monocular, stereo, and rgb-d cam- eras.IEEE transactions on robotics, 33(5):1255–1262, 2017. 2

  47. [47]

    Orb-slam: a versatile and accurate monocular slam system.IEEE Transactions on Robotics, 31(5):1147–1163,

    Raul Mur-Artal, Jose Maria Martinez Montiel, and Juan D Tardos. Orb-slam: a versatile and accurate monocular slam system.IEEE Transactions on Robotics, 31(5):1147–1163,

  48. [48]

    Dinov2: Learning robust visual features without supervision

    Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. Transactions on Machine Learning Research Journal, pages 1–31, 2024. 4

  49. [49]

    Rtg-slam: Real-time 3d re- construction at scale using gaussian splatting

    Zhexi Peng, Tianjia Shao, Yong Liu, Jingke Zhou, Yin Yang, Jingdong Wang, and Kun Zhou. Rtg-slam: Real-time 3d re- construction at scale using gaussian splatting. InACM SIG- GRAPH 2024 Conference Papers, pages 1–11, 2024. 2

  50. [50]

    Vi- sion transformers for dense prediction

    Ren ´e Ranftl, Alexey Bochkovskiy, and Vladlen Koltun. Vi- sion transformers for dense prediction. InProceedings of the IEEE/CVF international conference on computer vision, pages 12179–12188, 2021. 4

  51. [51]

    Machine learning for high-speed corner detection

    Edward Rosten and Tom Drummond. Machine learning for high-speed corner detection. InEuropean Conference on Computer Vision (ECCV), pages 430–443. Springer, 2006. 2

  52. [52]

    Orb: An efficient alternative to sift or surf

    Ethan Rublee, Vincent Rabaud, Kurt Konolige, and Gary Bradski. Orb: An efficient alternative to sift or surf. In2011 International conference on computer vision, pages 2564–

  53. [53]

    Good features to track

    Jianbo Shi et al. Good features to track. InProceedings of IEEE Conference on Computer Vision and Pattern Recogni- tion (CVPR), pages 593–600. IEEE, 1994. 2

  54. [54]

    imap: Implicit mapping and positioning in real-time

    Edgar Sucar, Shikun Liu, Joseph Ortiz, and Andrew J Davi- son. imap: Implicit mapping and positioning in real-time. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 6229–6238, 2021. 2

  55. [55]

    Decoupling scene perception and ego status: A multi-context fusion approach for enhanced generaliza- tion in end-to-end autonomous driving

    Jiacheng Tang, Mingyue Feng, Jiachao Liu, Yaonong Wang, and Jian Pu. Decoupling scene perception and ego status: A multi-context fusion approach for enhanced generaliza- tion in end-to-end autonomous driving. InProceedings of the AAAI Conference on Artificial Intelligence, pages 9413– 9420, 2026. 2

  56. [56]

    CausalVAD: De-confounding End-to-End Autonomous Driving via Causal Intervention

    Jiacheng Tang, Zhiyuan Zhou, Zhuolin He, Jia Zhang, Kai Zhang, and Jian Pu. Causalvad: De-confounding end-to-end autonomous driving via causal intervention.arXiv preprint arXiv:2603.18561, 2026. 2

  57. [57]

    Cnn-slam: Real-time dense monocular slam with learned depth prediction

    Keisuke Tateno, Federico Tombari, Iro Laina, and Nassir Navab. Cnn-slam: Real-time dense monocular slam with learned depth prediction. InProceedings of the IEEE Con- ference on Computer Vision and Pattern Recognition, pages 6243–6252, 2017. 2

  58. [58]

    Raft: Recurrent all-pairs field transforms for optical flow

    Zachary Teed and Jia Deng. Raft: Recurrent all-pairs field transforms for optical flow. InComputer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23– 28, 2020, Proceedings, Part II 16, pages 402–419. Springer,

  59. [59]

    Droid-slam: Deep visual slam for monocular, stereo, and rgb-d cameras.Advances in Neu- ral Information Processing Systems, 34:16558–16569, 2021

    Zachary Teed and Jia Deng. Droid-slam: Deep visual slam for monocular, stereo, and rgb-d cameras.Advances in Neu- ral Information Processing Systems, 34:16558–16569, 2021. 1, 2, 6, 7, 8

  60. [60]

    Deep patch vi- sual odometry.Advances in Neural Information Processing Systems, 36, 2024

    Zachary Teed, Lahav Lipson, and Jia Deng. Deep patch vi- sual odometry.Advances in Neural Information Processing Systems, 36, 2024. 1, 2, 3, 4, 5, 6, 7, 8

  61. [61]

    Co- slam: Joint coordinate and sparse parametric encodings for neural real-time slam

    Hengyi Wang, Jingwen Wang, and Lourdes Agapito. Co- slam: Joint coordinate and sparse parametric encodings for neural real-time slam. InIEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), pages 13293– 13302, 2023. 2

  62. [62]

    Deepvo: Towards end-to-end visual odometry with deep recurrent convolutional neural networks

    Sen Wang, Ronald Clark, Hongkai Wen, and Niki Trigoni. Deepvo: Towards end-to-end visual odometry with deep recurrent convolutional neural networks. InIEEE Inter- national Conference on Robotics and Automation (ICRA), pages 2043–2050. IEEE, 2017. 2

  63. [63]

    Tartanair: A dataset to push the limits of visual slam

    Wenshan Wang, Delong Zhu, Xiangwei Wang, Yaoyu Hu, Yuheng Qiu, Chen Wang, Yafei Hu, Ashish Kapoor, and Se- bastian Scherer. Tartanair: A dataset to push the limits of visual slam. In2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 4909–4916. IEEE, 2020. 5, 6

  64. [64]

    Tartanvo: A generalizable learning-based vo

    Wenshan Wang, Yaoyu Hu, and Sebastian Scherer. Tartanvo: A generalizable learning-based vo. InConference on Robot Learning, pages 1761–1772. PMLR, 2021. 6, 7, 8

  65. [65]

    Airslam: An efficient and illumination-robust point- line visual slam system.ArXiv Preprint arXiv:2408.03520,

    Kuan Xu, Yuefan Hao, Shenghai Yuan, Chen Wang, and Li- hua Xie. Airslam: An efficient and illumination-robust point- line visual slam system.ArXiv Preprint arXiv:2408.03520,

  66. [66]

    Depth Anything V2

    Lihe Yang, Bingyi Kang, Zilong Huang, Zhen Zhao, Xiao- gang Xu, Jiashi Feng, and Hengshuang Zhao. Depth any- thing v2.arXiv preprint arXiv:2406.09414, 2024. 2, 4

  67. [67]

    Go-slam: Global optimization for consistent 3d in- stant reconstruction

    Youmin Zhang, Fabio Tosi, Stefano Mattoccia, and Matteo Poggi. Go-slam: Global optimization for consistent 3d in- stant reconstruction. InIEEE/CVF International Conference on Computer Vision, pages 3727–3737, 2023. 2

  68. [68]

    SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference

    Yuan Zhang, Chun-Kai Fan, Junpeng Ma, Wenzhao Zheng, Tao Huang, Kuan Cheng, Denis Gudovskiy, Tomoyuki Okuno, Yohei Nakata, Kurt Keutzer, et al. Sparsevlm: Vi- sual token sparsification for efficient vision-language model inference.arXiv preprint arXiv:2410.04417, 2024. 2

  69. [69]

    Light-slam: A robust deep-learning visual slam system based on lightglue under challenging lighting conditions.ArXiv Preprint arXiv:2407.02382, 2024

    Zhiqi Zhao, Chang Wu, Xiaotong Kong, Zejie Lv, Xiaoqi Du, and Qiyan Li. Light-slam: A robust deep-learning visual slam system based on lightglue under challenging lighting conditions.ArXiv Preprint arXiv:2407.02382, 2024. 1, 2

  70. [70]

    Deeptam: Deep tracking and mapping

    Huizhong Zhou, Benjamin Ummenhofer, and Thomas Brox. Deeptam: Deep tracking and mapping. InEuropean Con- ference on Computer Vision (ECCV), pages 822–838, 2018. 2

  71. [71]

    Spatialreward: Verifiable spatial re- ward modeling for fine-grained spatial consistency in text-to- image generation.arXiv preprint arXiv:2603.22228, 2026

    Sashuai Zhou, Qiang Zhou, Junpeng Ma, Yue Cao, Ruo- fan Hu, Ziang Zhang, Xiaoda Yang, Zhibin Wang, Jun Song, Cheng Yu, et al. Spatialreward: Verifiable spatial re- ward modeling for fine-grained spatial consistency in text-to- image generation.arXiv preprint arXiv:2603.22228, 2026. 2

  72. [72]

    Nice-slam: Neural implicit scalable encoding for slam

    Zihan Zhu, Songyou Peng, Viktor Larsson, Weiwei Xu, Hu- jun Bao, Zhaopeng Cui, Martin R Oswald, and Marc Polle- feys. Nice-slam: Neural implicit scalable encoding for slam. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 12786–12796, 2022. 2