Recognition: 2 theorem links
· Lean TheoremDINO-VO: Learning Where to Focus for Enhanced State Estimation
Pith reviewed 2026-05-13 17:02 UTC · model grok-4.3
The pith
DINO-VO replaces fixed rules for picking image patches with a trainable selector to improve monocular visual odometry across environments.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
DINO-VO is an end-to-end monocular visual odometry pipeline whose central component is a differentiable adaptive patch selector that learns to extract higher-quality patches; this selector is trained jointly with a multi-task feature extraction module and a differentiable bundle adjustment layer that uses inverse depth priors, thereby connecting appearance learning directly to geometric state estimation and yielding improved tracking accuracy on synthetic, indoor, and outdoor benchmarks.
What carries the argument
The differentiable adaptive patch selector, which learns during training to choose patches that improve feature quality and state estimation.
If this is right
- The system reaches state-of-the-art tracking accuracy on TartanAir, KITTI, Euroc, and TUM.
- It maintains performance when moving from synthetic training data to real indoor and outdoor test environments.
- The joint training of patch selection, feature extraction, and bundle adjustment reduces the gap between learned appearance cues and geometric optimization.
- Heuristic patch selection is replaced by a data-driven alternative that improves robustness in large-scale scenes.
Where Pith is reading between the lines
- Similar differentiable selection modules could be inserted into other SLAM pipelines that currently rely on fixed feature detectors.
- If the selector generalizes reliably, deployment of visual odometry in novel environments may require less manual tuning of feature parameters.
- The inverse-depth prior inside the differentiable bundle adjustment suggests a route for incorporating additional geometric cues without breaking end-to-end differentiability.
Load-bearing premise
End-to-end training of the adaptive patch selector will consistently select higher-quality patches that boost accuracy and generalization without overfitting to the training sets.
What would settle it
A controlled experiment in which DINO-VO is evaluated on a new large-scale outdoor sequence never seen during training and produces larger trajectory errors than a strong heuristic baseline would falsify the generalization claim.
Figures
read the original abstract
We present DINO Patch Visual Odometry (DINO-VO), an end-to-end monocular visual odometry system with strong scene generalization. Current Visual Odometry (VO) systems often rely on heuristic feature extraction strategies, which can degrade accuracy and robustness, particularly in large-scale outdoor environments. DINO-VO addresses these limitations by incorporating a differentiable adaptive patch selector into the end-to-end pipeline, improving the quality of extracted patches and enhancing generalization across diverse datasets. Additionally, our system integrates a multi-task feature extraction module with a differentiable bundle adjustment (BA) module that leverages inverse depth priors, enabling the system to learn and utilize appearance and geometric information effectively. This integration bridges the gap between feature learning and state estimation. Extensive experiments on the TartanAir, KITTI, Euroc, and TUM datasets demonstrate that DINO-VO exhibits strong generalization across synthetic, indoor, and outdoor environments, achieving state-of-the-art tracking accuracy.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces DINO-VO, an end-to-end monocular visual odometry system that incorporates a differentiable adaptive patch selector, a multi-task feature extraction module, and a differentiable bundle adjustment module leveraging inverse depth priors. It claims strong scene generalization and state-of-the-art tracking accuracy across the TartanAir, KITTI, Euroc, and TUM datasets spanning synthetic, indoor, and outdoor environments.
Significance. If substantiated, the approach of learning patch selection end-to-end to bridge feature extraction and state estimation could meaningfully advance visual odometry by reducing reliance on heuristic feature strategies and improving robustness in diverse settings.
major comments (4)
- [Abstract] Abstract: The central claim of state-of-the-art tracking accuracy and strong generalization is asserted without any reported quantitative metrics, baseline comparisons, ablation results, or error statistics, rendering it impossible to assess whether the gains are robust or affected by dataset choices.
- [Method] Method section: The differentiable adaptive patch selector is described as learning to focus on higher-quality patches via end-to-end training, yet no details are provided on its parameterization, loss formulation, or how gradients from the multi-task extractor and differentiable BA (with inverse depth priors) prevent it from learning dataset-specific heuristics rather than scene-agnostic improvements.
- [Experiments] Experiments section: No ablation isolating the adaptive patch selector (e.g., fixed vs. learned selection) is reported on the four datasets, which is load-bearing for the generalization claim since the selector is a fitted component whose contribution must be validated separately from the rest of the pipeline.
- [Experiments] Experiments section: The manuscript reports results only on TartanAir, KITTI, Euroc, and TUM with no held-out evaluation on environments outside these distributions, leaving the strong generalization claim without a direct test against the risk of overfitting to training-set appearance or geometry patterns.
minor comments (1)
- [Abstract] Abstract: The phrase 'extensive experiments' is used without specifying the evaluation protocol, metrics (e.g., ATE, RPE), or number of runs, which reduces clarity.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We agree that several clarifications and additions are needed to strengthen the presentation of our claims. Below we address each major comment point by point and indicate the revisions we will make.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central claim of state-of-the-art tracking accuracy and strong generalization is asserted without any reported quantitative metrics, baseline comparisons, ablation results, or error statistics, rendering it impossible to assess whether the gains are robust or affected by dataset choices.
Authors: We agree that the abstract would be more informative with concrete metrics. In the revised manuscript we will add key quantitative results, including average translation and rotation errors on KITTI, Euroc, and TUM relative to the strongest baselines, to support the state-of-the-art and generalization claims. revision: yes
-
Referee: [Method] Method section: The differentiable adaptive patch selector is described as learning to focus on higher-quality patches via end-to-end training, yet no details are provided on its parameterization, loss formulation, or how gradients from the multi-task extractor and differentiable BA (with inverse depth priors) prevent it from learning dataset-specific heuristics rather than scene-agnostic improvements.
Authors: We acknowledge the omission of implementation details. The revised method section will include: (1) the exact parameterization of the patch selector network, (2) the composite loss function used to train it jointly with the multi-task features, and (3) an explanation of how back-propagation through the differentiable inverse-depth bundle adjustment encourages scene-agnostic patch selection rather than dataset-specific heuristics. revision: yes
-
Referee: [Experiments] Experiments section: No ablation isolating the adaptive patch selector (e.g., fixed vs. learned selection) is reported on the four datasets, which is load-bearing for the generalization claim since the selector is a fitted component whose contribution must be validated separately from the rest of the pipeline.
Authors: We agree that an explicit ablation is necessary. We will add a new table and accompanying text that compares the full DINO-VO pipeline against a controlled variant that uses fixed (non-learned) patch selection on all four datasets, thereby isolating the contribution of the adaptive selector to both accuracy and cross-scene generalization. revision: yes
-
Referee: [Experiments] Experiments section: The manuscript reports results only on TartanAir, KITTI, Euroc, and TUM with no held-out evaluation on environments outside these distributions, leaving the strong generalization claim without a direct test against the risk of overfitting to training-set appearance or geometry patterns.
Authors: The four datasets were deliberately chosen to span synthetic, outdoor driving, and indoor environments, which is the standard protocol for assessing generalization in visual odometry. Nevertheless, we recognize that an explicit held-out test on a completely unseen environment would provide stronger evidence. We will expand the discussion section to explicitly address this limitation and list it as an important direction for future work; we do not currently have additional held-out results to report. revision: partial
Circularity Check
No circularity: end-to-end training and empirical validation on standard benchmarks remain independent of fitted inputs by construction
full rationale
The paper introduces DINO-VO as an end-to-end monocular VO pipeline that incorporates a differentiable adaptive patch selector, multi-task feature extraction, and differentiable BA with inverse depth priors. The central claims of improved patch quality and strong generalization across synthetic/indoor/outdoor scenes are supported by reported tracking accuracy on TartanAir, KITTI, Euroc, and TUM. No derivation step reduces a prediction to its own inputs by construction, no self-citation is load-bearing for a uniqueness theorem, and no ansatz is smuggled via prior work. The learned selector is a fitted component whose benefit is asserted via experiments rather than tautologically assumed, making the derivation self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
free parameters (1)
- neural network weights
axioms (1)
- standard math The patch selector and bundle adjustment modules are fully differentiable
invented entities (1)
-
DINO Patch Visual Odometry (DINO-VO) architecture
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
incorporating a differentiable adaptive patch selector... multi-task feature extraction module with a differentiable bundle adjustment (BA) module that leverages inverse depth priors
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Extensive experiments on the TartanAir, KITTI, Euroc, and TUM datasets demonstrate... state-of-the-art tracking accuracy
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Codeslam—learning a compact, optimisable representation for dense visual slam
Michael Bloesch, Jan Czarnowski, Ronald Clark, Stefan Leutenegger, and Andrew J Davison. Codeslam—learning a compact, optimisable representation for dense visual slam. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2560–2568, 2018. 2
work page 2018
-
[2]
Hudson Martins Silva Bruno and Esther Luna Colombini. Lift-slam: A deep-learning feature-based monocular visual slam method.Neurocomputing, 455:97–110, 2021. 2
work page 2021
-
[3]
Michael Burri, Janosch Nikolic, Pascal Gohl, Thomas Schneider, Joern Rehder, Sammy Omari, Markus W Achte- lik, and Roland Siegwart. The euroc micro aerial vehicle datasets.The International Journal of Robotics Research, 35 (10):1157–1163, 2016. 6
work page 2016
-
[4]
Carlos Campos, Richard Elvira, Juan J G ´omez Rodr´ıguez, Jos´e MM Montiel, and Juan D Tard´os. Orb-slam3: An accu- rate open-source library for visual, visual–inertial, and mul- timap slam.IEEE Transactions on Robotics, 37(6):1874– 1890, 2021. 1, 2, 6, 8
work page 2021
-
[5]
Qi Chen, Yu Cao, Jiawei Hou, Guanghao Li, Shoumeng Qiu, Bo Chen, Xiangyang Xue, Hong Lu, and Jian Pu. Vpl- slam: A vertical line supported point line monocular slam system.IEEE Transactions on Intelligent Transportation Systems, 2024. 1, 2
work page 2024
-
[6]
Multi- lio: A lightweight multiple lidar-inertial odometry system
Qi Chen, Guanghao Li, Xiangyang Xue, and Jian Pu. Multi- lio: A lightweight multiple lidar-inertial odometry system. In2024 IEEE International Conference on Robotics and Au- tomation (ICRA), pages 13748–13754, 2024. 2
work page 2024
-
[7]
Jan Czarnowski, Tristan Laidlow, Ronald Clark, and An- drew J Davison. Deepfactors: Real-time probabilistic dense monocular slam.IEEE Robotics and Automation Letters, 5 (2):721–728, 2020. 2
work page 2020
-
[8]
Monoslam: Real-time single camera slam
Andrew J Davison, Ian D Reid, Nicholas D Molton, and Olivier Stasse. Monoslam: Real-time single camera slam. IEEE Transactions on Pattern Analysis and Machine Intelli- gence, 29(6):1052–1067, 2007. 1, 2
work page 2007
-
[9]
Compact 3d gaussian splatting for dense visual slam.arXiv preprint arXiv:2403.11247, 2024
Tianchen Deng, Yaohui Chen, Leyan Zhang, Jianfei Yang, Shenghai Yuan, Jiuming Liu, Danwei Wang, Hesheng Wang, and Weidong Chen. Compact 3d gaussian splatting for dense visual slam.arXiv preprint arXiv:2403.11247, 2024. 2
work page internal anchor Pith review arXiv 2024
-
[10]
Plgslam: Progressive neural scene represenation with local to global bundle adjustment
Tianchen Deng, Guole Shen, Tong Qin, Jianyu Wang, Wen- tao Zhao, Jingchuan Wang, Danwei Wang, and Weidong Chen. Plgslam: Progressive neural scene represenation with local to global bundle adjustment. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 19657–19666, 2024. 2
work page 2024
-
[11]
Tianchen Deng, Xuefeng Chen, Yi Chen, Qu Chen, Yuyao Xu, Lijin Yang, Le Xu, Yu Zhang, Bo Zhang, Wuxiong Huang, and Hesheng Wang. Gaussiandwm: 3d gaussian driving world model for unified scene understanding and multi-modal generation.arXiv preprint arXiv:2512.23180,
-
[12]
Tianchen Deng, Yue Pan, Shenghai Yuan, Dong Li, Chen Wang, Mingrui Li, Long Chen, Lihua Xie, Danwei Wang, Jingchuan Wang, Javier Civera, Hesheng Wang, and Wei- dong Chen. What is the best 3d scene representation for robotics? from geometric to foundation models.arXiv preprint arXiv:2512.03422, 2025. 2
-
[13]
Superpoint: Self-supervised interest point detection and description
Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabi- novich. Superpoint: Self-supervised interest point detection and description. InProceedings of the IEEE conference on computer vision and pattern recognition workshops, pages 224–236, 2018. 1, 4
work page 2018
-
[14]
Lsd- slam: Large-scale direct monocular slam
Jakob Engel, Thomas Sch ¨ops, and Daniel Cremers. Lsd- slam: Large-scale direct monocular slam. InEuropean Conference on Computer Vision (ECCV), pages 834–849. Springer, 2014. 2
work page 2014
-
[15]
Jakob Engel, Vladlen Koltun, and Daniel Cremers. Direct sparse odometry.IEEE transactions on pattern analysis and machine intelligence, 40(3):611–625, 2017. 6, 7, 8
work page 2017
-
[16]
Svo: Fast semi-direct monocular visual odometry
Christian Forster, Matia Pizzoli, and Davide Scaramuzza. Svo: Fast semi-direct monocular visual odometry. In2014 IEEE international conference on robotics and automation (ICRA), pages 15–22. IEEE, 2014. 6, 8
work page 2014
-
[17]
Dorian G ´alvez-L´opez and Juan D Tardos. Bags of binary words for fast place recognition in image sequences.IEEE Transactions on robotics, 28(5):1188–1197, 2012. 2
work page 2012
-
[18]
Deep incomplete multi-view learning via cyclic permutation of vaes
Xin Gao and Jian Pu. Deep incomplete multi-view learning via cyclic permutation of vaes. InThe Thirteenth Interna- tional Conference on Learning Representations, 2025. 2
work page 2025
-
[19]
Ldso: Direct sparse odometry with loop closure
Xiang Gao, Rui Wang, Nikolaus Demmel, and Daniel Cre- mers. Ldso: Direct sparse odometry with loop closure. In2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 2198–2204. IEEE, 2018. 6
work page 2018
-
[20]
Good: Training-free guided diffusion sampling for out-of-distribution detection
Xin Gao, Jiyao Liu, Guanghao Li, Yueming Lyu, Jianx- iong Gao, Weichen Yu, Ningsheng Xu, Liang Wang, Caifeng Shan, Ziwei Liu, et al. Good: Training-free guided diffusion sampling for out-of-distribution detection. InThe Thirty- ninth Annual Conference on Neural Information Processing Systems, 2025. 2
work page 2025
-
[21]
Andreas Geiger, Philip Lenz, Christoph Stiller, and Raquel Urtasun. Vision meets robotics: The kitti dataset.The Inter- national Journal of Robotics Research, 32(11):1231–1237,
-
[22]
Michael Grupp. evo: Python package for the evalua- tion of odometry and slam.https://github.com/ MichaelGrupp/evo, 2017. 6
work page 2017
-
[23]
Dark-isp: Enhancing raw image processing for low-light object detection
Jiasheng Guo, Xin Gao, Yuxiang Yan, Guanghao Li, and Jian Pu. Dark-isp: Enhancing raw image processing for low-light object detection. InProceedings of the IEEE/CVF Interna- tional Conference on Computer Vision (ICCV), pages 9583– 9593, 2025. 2
work page 2025
-
[24]
Zhuolin He, Xinrun Li, Jiacheng Tang, Shoumeng Qiu, Wenfu Wang, Xiangyang Xue, and Jian Pu. Toward camera open-set 3d object detection for autonomous driving scenar- ios.IEEE Transactions on Intelligent Transportation Sys- tems, 26(12):23190–23201, 2025
work page 2025
-
[25]
Zhuolin He, Jing Li, Guanghao Li, Xiaolei Chen, Jiacheng Tang, Siyang Zhang, Zhounan Jin, Feipeng Cai, Bin Li, Jian Pu, et al. Dynamicvggt: Learning dynamic point maps for 4d scene reconstruction in autonomous driving.arXiv preprint arXiv:2603.08254, 2026
-
[26]
MPCFormer: A physics-informed data-driven approach for explainable socially-aware autonomous driving
Jia Hu, Zhexi Lian, Xuerun Yan, Ruiang Bi, Dou Shen, Yu Ruan, and Haoran Wang. Mpcformer: A physics-informed data-driven approach for explainable socially-aware au- tonomous driving.arXiv preprint arXiv:2512.03795, 2025. 2
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[27]
Huajian Huang, Longwei Li, Hui Cheng, and Sai-Kit Yeung. Photo-slam: Real-time simultaneous localization and photo- realistic mapping for monocular stereo and rgb-d cameras. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21584–21593, 2024. 2
work page 2024
-
[28]
Eslam: Efficient dense slam system based on hy- brid representation of signed distance fields
Mohammad Mahdi Johari, Camilla Carta, and Franc ¸ois Fleuret. Eslam: Efficient dense slam system based on hy- brid representation of signed distance fields. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 17408–17419, 2023. 2
work page 2023
-
[29]
Splatam: Splat track & map 3d gaus- sians for dense rgb-d slam
Nikhil Keetha, Jay Karhade, Krishna Murthy Jatavallab- hula, Gengshan Yang, Sebastian Scherer, Deva Ramanan, and Jonathon Luiten. Splatam: Splat track & map 3d gaus- sians for dense rgb-d slam. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 21357–21366, 2024. 2
work page 2024
-
[30]
3d gaussian splatting for real-time radiance field rendering.ACM Trans
Bernhard Kerbl, Georgios Kopanas, Thomas Leimk ¨uhler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering.ACM Trans. Graph., 42(4):139–1,
-
[31]
Parallel tracking and map- ping for small ar workspaces
Georg Klein and David Murray. Parallel tracking and map- ping for small ar workspaces. InIEEE and ACM Interna- tional Symposium on Mixed and Augmented Reality, pages 225–234. IEEE, 2007. 2
work page 2007
-
[32]
Guanghao Li, Kerui Ren, Linning Xu, Zhewen Zheng, Changjian Jiang, Xin Gao, Bo Dai, Jian Pu, Mulin Yu, and Jiangmiao Pang. Artdeco: Toward high-fidelity on-the-fly reconstruction with hierarchical gaussian structure and feed- forward guidance. InThe Fourteenth International Confer- ence on Learning Representations. 2
-
[33]
Papl-slam: Principal axis-anchored monocu- lar point-line slam.IEEE Robotics and Automation Letters,
Guanghao Li, Yu Cao, Qi Chen, Xin Gao, Yifan Yang, and Jian Pu. Papl-slam: Principal axis-anchored monocu- lar point-line slam.IEEE Robotics and Automation Letters,
-
[34]
Guanghao Li, Qi Chen, Sijia Hu, Yuxiang Yan, and Jian Pu. Constrained gaussian splatting via implicit tsdf hash grid for dense rgb-d slam.IEEE Transactions on Artificial Intelli- gence, 2025. 2
work page 2025
-
[35]
Guanghao Li, Qi Chen, Yuxiang Yan, and Jian Pu. Ec-slam: Effectively constrained neural rgb-d slam with tsdf hash en- coding and joint optimization.Pattern Recognition, 170: 112034, 2026. 2
work page 2026
-
[36]
Undeepvo: Monocular visual odometry through unsuper- vised deep learning
Ruihao Li, Sen Wang, Zhiqiang Long, and Dongbing Gu. Undeepvo: Monocular visual odometry through unsuper- vised deep learning. InIEEE International Conference on Robotics and Automation (ICRA), pages 7286–7291. IEEE,
-
[37]
Ruihao Li, Sen Wang, and Dongbing Gu. Deepslam: A ro- bust monocular slam system with unsupervised deep learn- ing.IEEE Transactions on Industrial Electronics, 68(4): 3577–3587, 2020. 2
work page 2020
-
[38]
Zhexi Lian, Haoran Wang, Xuerun Yan, Weimeng Lin, Xi- anhong Zhang, Yongyu Chen, and Jia Hu. Fine-tuning is not enough: A parallel framework for collaborative imita- tion and reinforcement learning in end-to-end autonomous driving.arXiv preprint arXiv:2603.13842, 2026. 2
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[39]
Lightglue: Local feature matching at light speed
Philipp Lindenberger, Paul-Edouard Sarlin, and Marc Polle- feys. Lightglue: Local feature matching at light speed. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 17627–17638, 2023. 1
work page 2023
-
[40]
Lahav Lipson, Zachary Teed, and Jia Deng. Deep patch visual slam. InEuropean Conference on Computer Vision (ECCV), pages 424–440. Springer, 2025. 2, 5, 6, 7, 8
work page 2025
-
[41]
Loopy-slam: Dense neural slam with loop closures
Lorenzo Liso, Erik Sandstr ¨om, Vladimir Yugay, Luc Van Gool, and Martin R Oswald. Loopy-slam: Dense neural slam with loop closures. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 20363–20373, 2024. 2
work page 2024
-
[42]
Junpeng Ma, Qizhe Zhang, Ming Lu, Zhibin Wang, Qiang Zhou, Jun Song, and Shanghang Zhang. Mmg-vid: Maxi- mizing marginal gains at segment-level and token-level for efficient video llms.arXiv preprint arXiv:2508.21044, 2025. 2
-
[43]
Junpeng Ma, Sashuai Zhou, Guanghao Li, Xin Gao, Yue Cao, Hengyu Zeng, Yuxiang Yan, Zhibin Wang, Jun Song, Bo Zheng, et al. Gift: Global irreplaceability frame tar- geting for efficient video understanding.arXiv preprint arXiv:2603.25072, 2026. 2
-
[44]
Hidenobu Matsuki, Riku Murai, Paul HJ Kelly, and An- drew J Davison. Gaussian splatting slam. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR), pages 18039–18048, 2024. 2
work page 2024
-
[45]
Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view syn- thesis.Communications of the ACM, 65(1):99–106, 2021. 2
work page 2021
-
[46]
Raul Mur-Artal and Juan D Tard ´os. Orb-slam2: An open- source slam system for monocular, stereo, and rgb-d cam- eras.IEEE transactions on robotics, 33(5):1255–1262, 2017. 2
work page 2017
-
[47]
Raul Mur-Artal, Jose Maria Martinez Montiel, and Juan D Tardos. Orb-slam: a versatile and accurate monocular slam system.IEEE Transactions on Robotics, 31(5):1147–1163,
-
[48]
Dinov2: Learning robust visual features without supervision
Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. Transactions on Machine Learning Research Journal, pages 1–31, 2024. 4
work page 2024
-
[49]
Rtg-slam: Real-time 3d re- construction at scale using gaussian splatting
Zhexi Peng, Tianjia Shao, Yong Liu, Jingke Zhou, Yin Yang, Jingdong Wang, and Kun Zhou. Rtg-slam: Real-time 3d re- construction at scale using gaussian splatting. InACM SIG- GRAPH 2024 Conference Papers, pages 1–11, 2024. 2
work page 2024
-
[50]
Vi- sion transformers for dense prediction
Ren ´e Ranftl, Alexey Bochkovskiy, and Vladlen Koltun. Vi- sion transformers for dense prediction. InProceedings of the IEEE/CVF international conference on computer vision, pages 12179–12188, 2021. 4
work page 2021
-
[51]
Machine learning for high-speed corner detection
Edward Rosten and Tom Drummond. Machine learning for high-speed corner detection. InEuropean Conference on Computer Vision (ECCV), pages 430–443. Springer, 2006. 2
work page 2006
-
[52]
Orb: An efficient alternative to sift or surf
Ethan Rublee, Vincent Rabaud, Kurt Konolige, and Gary Bradski. Orb: An efficient alternative to sift or surf. In2011 International conference on computer vision, pages 2564–
-
[53]
Jianbo Shi et al. Good features to track. InProceedings of IEEE Conference on Computer Vision and Pattern Recogni- tion (CVPR), pages 593–600. IEEE, 1994. 2
work page 1994
-
[54]
imap: Implicit mapping and positioning in real-time
Edgar Sucar, Shikun Liu, Joseph Ortiz, and Andrew J Davi- son. imap: Implicit mapping and positioning in real-time. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 6229–6238, 2021. 2
work page 2021
-
[55]
Jiacheng Tang, Mingyue Feng, Jiachao Liu, Yaonong Wang, and Jian Pu. Decoupling scene perception and ego status: A multi-context fusion approach for enhanced generaliza- tion in end-to-end autonomous driving. InProceedings of the AAAI Conference on Artificial Intelligence, pages 9413– 9420, 2026. 2
work page 2026
-
[56]
CausalVAD: De-confounding End-to-End Autonomous Driving via Causal Intervention
Jiacheng Tang, Zhiyuan Zhou, Zhuolin He, Jia Zhang, Kai Zhang, and Jian Pu. Causalvad: De-confounding end-to-end autonomous driving via causal intervention.arXiv preprint arXiv:2603.18561, 2026. 2
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[57]
Cnn-slam: Real-time dense monocular slam with learned depth prediction
Keisuke Tateno, Federico Tombari, Iro Laina, and Nassir Navab. Cnn-slam: Real-time dense monocular slam with learned depth prediction. InProceedings of the IEEE Con- ference on Computer Vision and Pattern Recognition, pages 6243–6252, 2017. 2
work page 2017
-
[58]
Raft: Recurrent all-pairs field transforms for optical flow
Zachary Teed and Jia Deng. Raft: Recurrent all-pairs field transforms for optical flow. InComputer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23– 28, 2020, Proceedings, Part II 16, pages 402–419. Springer,
work page 2020
-
[59]
Zachary Teed and Jia Deng. Droid-slam: Deep visual slam for monocular, stereo, and rgb-d cameras.Advances in Neu- ral Information Processing Systems, 34:16558–16569, 2021. 1, 2, 6, 7, 8
work page 2021
-
[60]
Deep patch vi- sual odometry.Advances in Neural Information Processing Systems, 36, 2024
Zachary Teed, Lahav Lipson, and Jia Deng. Deep patch vi- sual odometry.Advances in Neural Information Processing Systems, 36, 2024. 1, 2, 3, 4, 5, 6, 7, 8
work page 2024
-
[61]
Co- slam: Joint coordinate and sparse parametric encodings for neural real-time slam
Hengyi Wang, Jingwen Wang, and Lourdes Agapito. Co- slam: Joint coordinate and sparse parametric encodings for neural real-time slam. InIEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), pages 13293– 13302, 2023. 2
work page 2023
-
[62]
Deepvo: Towards end-to-end visual odometry with deep recurrent convolutional neural networks
Sen Wang, Ronald Clark, Hongkai Wen, and Niki Trigoni. Deepvo: Towards end-to-end visual odometry with deep recurrent convolutional neural networks. InIEEE Inter- national Conference on Robotics and Automation (ICRA), pages 2043–2050. IEEE, 2017. 2
work page 2043
-
[63]
Tartanair: A dataset to push the limits of visual slam
Wenshan Wang, Delong Zhu, Xiangwei Wang, Yaoyu Hu, Yuheng Qiu, Chen Wang, Yafei Hu, Ashish Kapoor, and Se- bastian Scherer. Tartanair: A dataset to push the limits of visual slam. In2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 4909–4916. IEEE, 2020. 5, 6
work page 2020
-
[64]
Tartanvo: A generalizable learning-based vo
Wenshan Wang, Yaoyu Hu, and Sebastian Scherer. Tartanvo: A generalizable learning-based vo. InConference on Robot Learning, pages 1761–1772. PMLR, 2021. 6, 7, 8
work page 2021
-
[65]
Kuan Xu, Yuefan Hao, Shenghai Yuan, Chen Wang, and Li- hua Xie. Airslam: An efficient and illumination-robust point- line visual slam system.ArXiv Preprint arXiv:2408.03520,
-
[66]
Lihe Yang, Bingyi Kang, Zilong Huang, Zhen Zhao, Xiao- gang Xu, Jiashi Feng, and Hengshuang Zhao. Depth any- thing v2.arXiv preprint arXiv:2406.09414, 2024. 2, 4
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[67]
Go-slam: Global optimization for consistent 3d in- stant reconstruction
Youmin Zhang, Fabio Tosi, Stefano Mattoccia, and Matteo Poggi. Go-slam: Global optimization for consistent 3d in- stant reconstruction. InIEEE/CVF International Conference on Computer Vision, pages 3727–3737, 2023. 2
work page 2023
-
[68]
SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference
Yuan Zhang, Chun-Kai Fan, Junpeng Ma, Wenzhao Zheng, Tao Huang, Kuan Cheng, Denis Gudovskiy, Tomoyuki Okuno, Yohei Nakata, Kurt Keutzer, et al. Sparsevlm: Vi- sual token sparsification for efficient vision-language model inference.arXiv preprint arXiv:2410.04417, 2024. 2
work page internal anchor Pith review arXiv 2024
-
[69]
Zhiqi Zhao, Chang Wu, Xiaotong Kong, Zejie Lv, Xiaoqi Du, and Qiyan Li. Light-slam: A robust deep-learning visual slam system based on lightglue under challenging lighting conditions.ArXiv Preprint arXiv:2407.02382, 2024. 1, 2
-
[70]
Deeptam: Deep tracking and mapping
Huizhong Zhou, Benjamin Ummenhofer, and Thomas Brox. Deeptam: Deep tracking and mapping. InEuropean Con- ference on Computer Vision (ECCV), pages 822–838, 2018. 2
work page 2018
-
[71]
Sashuai Zhou, Qiang Zhou, Junpeng Ma, Yue Cao, Ruo- fan Hu, Ziang Zhang, Xiaoda Yang, Zhibin Wang, Jun Song, Cheng Yu, et al. Spatialreward: Verifiable spatial re- ward modeling for fine-grained spatial consistency in text-to- image generation.arXiv preprint arXiv:2603.22228, 2026. 2
-
[72]
Nice-slam: Neural implicit scalable encoding for slam
Zihan Zhu, Songyou Peng, Viktor Larsson, Weiwei Xu, Hu- jun Bao, Zhaopeng Cui, Martin R Oswald, and Marc Polle- feys. Nice-slam: Neural implicit scalable encoding for slam. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 12786–12796, 2022. 2
work page 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.