arxiv: 2604.04055 · v1 · submitted 2026-04-05 · 💻 cs.CV · cs.RO

Recognition: 2 theorem links

· Lean Theorem

DINO-VO: Learning Where to Focus for Enhanced State Estimation

Qi Chen , Guanghao Li , Sijia Hu , Xin Gao , Junpeng Ma , Xiangyang Xue , Jian Pu

Authors on Pith no claims yet

Pith reviewed 2026-05-13 17:02 UTC · model grok-4.3

classification 💻 cs.CV cs.RO

keywords visual odometrypatch selectiondifferentiable bundle adjustmentmonocular trackingscene generalizationfeature extractionend-to-end traininginverse depth

0 comments

The pith

DINO-VO replaces fixed rules for picking image patches with a trainable selector to improve monocular visual odometry across environments.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Visual odometry estimates camera motion from image sequences, yet most systems still depend on hand-crafted heuristics to choose which image regions to track, and these choices often break down in large outdoor scenes. DINO-VO embeds a differentiable adaptive patch selector directly into an end-to-end training loop so the network learns which patches yield reliable features. The architecture further couples this selector to a multi-task feature extractor and a differentiable bundle adjustment module that incorporates inverse depth priors. Experiments on TartanAir, KITTI, Euroc, and TUM datasets show the resulting system tracks more accurately while generalizing from synthetic data to real indoor and outdoor environments.

Core claim

DINO-VO is an end-to-end monocular visual odometry pipeline whose central component is a differentiable adaptive patch selector that learns to extract higher-quality patches; this selector is trained jointly with a multi-task feature extraction module and a differentiable bundle adjustment layer that uses inverse depth priors, thereby connecting appearance learning directly to geometric state estimation and yielding improved tracking accuracy on synthetic, indoor, and outdoor benchmarks.

What carries the argument

The differentiable adaptive patch selector, which learns during training to choose patches that improve feature quality and state estimation.

If this is right

The system reaches state-of-the-art tracking accuracy on TartanAir, KITTI, Euroc, and TUM.
It maintains performance when moving from synthetic training data to real indoor and outdoor test environments.
The joint training of patch selection, feature extraction, and bundle adjustment reduces the gap between learned appearance cues and geometric optimization.
Heuristic patch selection is replaced by a data-driven alternative that improves robustness in large-scale scenes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar differentiable selection modules could be inserted into other SLAM pipelines that currently rely on fixed feature detectors.
If the selector generalizes reliably, deployment of visual odometry in novel environments may require less manual tuning of feature parameters.
The inverse-depth prior inside the differentiable bundle adjustment suggests a route for incorporating additional geometric cues without breaking end-to-end differentiability.

Load-bearing premise

End-to-end training of the adaptive patch selector will consistently select higher-quality patches that boost accuracy and generalization without overfitting to the training sets.

What would settle it

A controlled experiment in which DINO-VO is evaluated on a new large-scale outdoor sequence never seen during training and produces larger trajectory errors than a strong heuristic baseline would falsify the generalization claim.

Figures

Figures reproduced from arXiv: 2604.04055 by Guanghao Li, Jian Pu, Junpeng Ma, Qi Chen, Sijia Hu, Xiangyang Xue, Xin Gao.

**Figure 1.** Figure 1: The overview of the system. Three modules establish our system from left to right: Multi-task Feature Extractor, Adaptive [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗

**Figure 2.** Figure 2: Architecture of the Multi-task Feature Extractor. The [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Pipeline for the adaptive patch selector. We utilize [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Comparison of our adaptive patch selector with existing systems. The first row is the input image from different datasets. The [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Comparison of reconstruction results on TartanAir [ [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗

**Figure 6.** Figure 6: Percentage of effective patch projections in bundle ad [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗

**Figure 7.** Figure 7: Trajector comparison between DPV-SLAM [40] and our system with loop closure mechanism in KITTI Sequence 00 Fig.4 illustrates the qualitative results of our adaptive patch selector. Compared to DPVO [60], our system focuses more on meaningful objects, such as pipelines and buildings, than the sky. Besides, [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗

read the original abstract

We present DINO Patch Visual Odometry (DINO-VO), an end-to-end monocular visual odometry system with strong scene generalization. Current Visual Odometry (VO) systems often rely on heuristic feature extraction strategies, which can degrade accuracy and robustness, particularly in large-scale outdoor environments. DINO-VO addresses these limitations by incorporating a differentiable adaptive patch selector into the end-to-end pipeline, improving the quality of extracted patches and enhancing generalization across diverse datasets. Additionally, our system integrates a multi-task feature extraction module with a differentiable bundle adjustment (BA) module that leverages inverse depth priors, enabling the system to learn and utilize appearance and geometric information effectively. This integration bridges the gap between feature learning and state estimation. Extensive experiments on the TartanAir, KITTI, Euroc, and TUM datasets demonstrate that DINO-VO exhibits strong generalization across synthetic, indoor, and outdoor environments, achieving state-of-the-art tracking accuracy.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DINO-VO adds a differentiable patch selector to an end-to-end monocular VO pipeline but the abstract gives no numbers, so the generalization claim is hard to assess yet.

read the letter

The new element is the integration of a learnable, differentiable patch selector inside a full VO stack that also uses DINO features, multi-task extraction, and inverse-depth bundle adjustment. That combination moves away from fixed heuristics toward something trained jointly with the pose estimation objective. The paper shows the architecture clearly enough that someone could re-implement the selector and the BA module without guessing at the connections. The experiments run on four standard datasets that span synthetic, indoor, and outdoor scenes, which is a reasonable test bed for a generalization story. Credit for shipping an end-to-end differentiable system rather than bolting a learned module onto a frozen classical pipeline. The main weakness is that the abstract asserts state-of-the-art tracking accuracy without any error values, baseline tables, or ablation isolating the selector. That makes it impossible to judge whether the gains come from better patch choice or from the rest of the pipeline. The stress-test concern lands: differentiability alone does not guarantee the selector learns scene-agnostic rules instead of dataset-specific cues such as road appearance or lighting patterns. Without held-out environments or controls that turn the selector on and off, the strong generalization statement rests on the assumption that the training signal pushes in the right direction. This paper is aimed at robotics and AR groups that already run monocular VO and want to try replacing hand-tuned feature selection. A reader who needs concrete numbers before adopting anything will find the current write-up thin, but the architecture description is concrete enough to be worth checking once the full tables appear. I would send it to peer review; the core idea is worth referee time even if the current evidence is preliminary.

Referee Report

4 major / 1 minor

Summary. The paper introduces DINO-VO, an end-to-end monocular visual odometry system that incorporates a differentiable adaptive patch selector, a multi-task feature extraction module, and a differentiable bundle adjustment module leveraging inverse depth priors. It claims strong scene generalization and state-of-the-art tracking accuracy across the TartanAir, KITTI, Euroc, and TUM datasets spanning synthetic, indoor, and outdoor environments.

Significance. If substantiated, the approach of learning patch selection end-to-end to bridge feature extraction and state estimation could meaningfully advance visual odometry by reducing reliance on heuristic feature strategies and improving robustness in diverse settings.

major comments (4)

[Abstract] Abstract: The central claim of state-of-the-art tracking accuracy and strong generalization is asserted without any reported quantitative metrics, baseline comparisons, ablation results, or error statistics, rendering it impossible to assess whether the gains are robust or affected by dataset choices.
[Method] Method section: The differentiable adaptive patch selector is described as learning to focus on higher-quality patches via end-to-end training, yet no details are provided on its parameterization, loss formulation, or how gradients from the multi-task extractor and differentiable BA (with inverse depth priors) prevent it from learning dataset-specific heuristics rather than scene-agnostic improvements.
[Experiments] Experiments section: No ablation isolating the adaptive patch selector (e.g., fixed vs. learned selection) is reported on the four datasets, which is load-bearing for the generalization claim since the selector is a fitted component whose contribution must be validated separately from the rest of the pipeline.
[Experiments] Experiments section: The manuscript reports results only on TartanAir, KITTI, Euroc, and TUM with no held-out evaluation on environments outside these distributions, leaving the strong generalization claim without a direct test against the risk of overfitting to training-set appearance or geometry patterns.

minor comments (1)

[Abstract] Abstract: The phrase 'extensive experiments' is used without specifying the evaluation protocol, metrics (e.g., ATE, RPE), or number of runs, which reduces clarity.

Simulated Author's Rebuttal

4 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We agree that several clarifications and additions are needed to strengthen the presentation of our claims. Below we address each major comment point by point and indicate the revisions we will make.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim of state-of-the-art tracking accuracy and strong generalization is asserted without any reported quantitative metrics, baseline comparisons, ablation results, or error statistics, rendering it impossible to assess whether the gains are robust or affected by dataset choices.

Authors: We agree that the abstract would be more informative with concrete metrics. In the revised manuscript we will add key quantitative results, including average translation and rotation errors on KITTI, Euroc, and TUM relative to the strongest baselines, to support the state-of-the-art and generalization claims. revision: yes
Referee: [Method] Method section: The differentiable adaptive patch selector is described as learning to focus on higher-quality patches via end-to-end training, yet no details are provided on its parameterization, loss formulation, or how gradients from the multi-task extractor and differentiable BA (with inverse depth priors) prevent it from learning dataset-specific heuristics rather than scene-agnostic improvements.

Authors: We acknowledge the omission of implementation details. The revised method section will include: (1) the exact parameterization of the patch selector network, (2) the composite loss function used to train it jointly with the multi-task features, and (3) an explanation of how back-propagation through the differentiable inverse-depth bundle adjustment encourages scene-agnostic patch selection rather than dataset-specific heuristics. revision: yes
Referee: [Experiments] Experiments section: No ablation isolating the adaptive patch selector (e.g., fixed vs. learned selection) is reported on the four datasets, which is load-bearing for the generalization claim since the selector is a fitted component whose contribution must be validated separately from the rest of the pipeline.

Authors: We agree that an explicit ablation is necessary. We will add a new table and accompanying text that compares the full DINO-VO pipeline against a controlled variant that uses fixed (non-learned) patch selection on all four datasets, thereby isolating the contribution of the adaptive selector to both accuracy and cross-scene generalization. revision: yes
Referee: [Experiments] Experiments section: The manuscript reports results only on TartanAir, KITTI, Euroc, and TUM with no held-out evaluation on environments outside these distributions, leaving the strong generalization claim without a direct test against the risk of overfitting to training-set appearance or geometry patterns.

Authors: The four datasets were deliberately chosen to span synthetic, outdoor driving, and indoor environments, which is the standard protocol for assessing generalization in visual odometry. Nevertheless, we recognize that an explicit held-out test on a completely unseen environment would provide stronger evidence. We will expand the discussion section to explicitly address this limitation and list it as an important direction for future work; we do not currently have additional held-out results to report. revision: partial

Circularity Check

0 steps flagged

No circularity: end-to-end training and empirical validation on standard benchmarks remain independent of fitted inputs by construction

full rationale

The paper introduces DINO-VO as an end-to-end monocular VO pipeline that incorporates a differentiable adaptive patch selector, multi-task feature extraction, and differentiable BA with inverse depth priors. The central claims of improved patch quality and strong generalization across synthetic/indoor/outdoor scenes are supported by reported tracking accuracy on TartanAir, KITTI, Euroc, and TUM. No derivation step reduces a prediction to its own inputs by construction, no self-citation is load-bearing for a uniqueness theorem, and no ansatz is smuggled via prior work. The learned selector is a fitted component whose benefit is asserted via experiments rather than tautologically assumed, making the derivation self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The approach rests on standard deep-learning assumptions plus the claim that the patch selector learns useful focus without additional regularization beyond the tracking loss.

free parameters (1)

neural network weights
All parameters of the feature extractor, patch selector, and depth predictor are fitted end-to-end on the training sequences.

axioms (1)

standard math The patch selector and bundle adjustment modules are fully differentiable
Required for gradient-based end-to-end training; invoked implicitly by the description of the pipeline.

invented entities (1)

DINO Patch Visual Odometry (DINO-VO) architecture no independent evidence
purpose: End-to-end monocular state estimation with learned patch selection
New proposed system combining DINO features, adaptive patch selection, and differentiable BA; no independent evidence outside the paper is supplied.

pith-pipeline@v0.9.0 · 5476 in / 1268 out tokens · 32167 ms · 2026-05-13T17:02:38.905558+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

incorporating a differentiable adaptive patch selector... multi-task feature extraction module with a differentiable bundle adjustment (BA) module that leverages inverse depth priors
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Extensive experiments on the TartanAir, KITTI, Euroc, and TUM datasets demonstrate... state-of-the-art tracking accuracy

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

72 extracted references · 72 canonical work pages · 6 internal anchors

[1]

Codeslam—learning a compact, optimisable representation for dense visual slam

Michael Bloesch, Jan Czarnowski, Ronald Clark, Stefan Leutenegger, and Andrew J Davison. Codeslam—learning a compact, optimisable representation for dense visual slam. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2560–2568, 2018. 2

work page 2018
[2]

Lift-slam: A deep-learning feature-based monocular visual slam method.Neurocomputing, 455:97–110, 2021

Hudson Martins Silva Bruno and Esther Luna Colombini. Lift-slam: A deep-learning feature-based monocular visual slam method.Neurocomputing, 455:97–110, 2021. 2

work page 2021
[3]

The euroc micro aerial vehicle datasets.The International Journal of Robotics Research, 35 (10):1157–1163, 2016

Michael Burri, Janosch Nikolic, Pascal Gohl, Thomas Schneider, Joern Rehder, Sammy Omari, Markus W Achte- lik, and Roland Siegwart. The euroc micro aerial vehicle datasets.The International Journal of Robotics Research, 35 (10):1157–1163, 2016. 6

work page 2016
[4]

Orb-slam3: An accu- rate open-source library for visual, visual–inertial, and mul- timap slam.IEEE Transactions on Robotics, 37(6):1874– 1890, 2021

Carlos Campos, Richard Elvira, Juan J G ´omez Rodr´ıguez, Jos´e MM Montiel, and Juan D Tard´os. Orb-slam3: An accu- rate open-source library for visual, visual–inertial, and mul- timap slam.IEEE Transactions on Robotics, 37(6):1874– 1890, 2021. 1, 2, 6, 8

work page 2021
[5]

Vpl- slam: A vertical line supported point line monocular slam system.IEEE Transactions on Intelligent Transportation Systems, 2024

Qi Chen, Yu Cao, Jiawei Hou, Guanghao Li, Shoumeng Qiu, Bo Chen, Xiangyang Xue, Hong Lu, and Jian Pu. Vpl- slam: A vertical line supported point line monocular slam system.IEEE Transactions on Intelligent Transportation Systems, 2024. 1, 2

work page 2024
[6]

Multi- lio: A lightweight multiple lidar-inertial odometry system

Qi Chen, Guanghao Li, Xiangyang Xue, and Jian Pu. Multi- lio: A lightweight multiple lidar-inertial odometry system. In2024 IEEE International Conference on Robotics and Au- tomation (ICRA), pages 13748–13754, 2024. 2

work page 2024
[7]

Deepfactors: Real-time probabilistic dense monocular slam.IEEE Robotics and Automation Letters, 5 (2):721–728, 2020

Jan Czarnowski, Tristan Laidlow, Ronald Clark, and An- drew J Davison. Deepfactors: Real-time probabilistic dense monocular slam.IEEE Robotics and Automation Letters, 5 (2):721–728, 2020. 2

work page 2020
[8]

Monoslam: Real-time single camera slam

Andrew J Davison, Ian D Reid, Nicholas D Molton, and Olivier Stasse. Monoslam: Real-time single camera slam. IEEE Transactions on Pattern Analysis and Machine Intelli- gence, 29(6):1052–1067, 2007. 1, 2

work page 2007
[9]

Compact 3d gaussian splatting for dense visual slam.arXiv preprint arXiv:2403.11247, 2024

Tianchen Deng, Yaohui Chen, Leyan Zhang, Jianfei Yang, Shenghai Yuan, Jiuming Liu, Danwei Wang, Hesheng Wang, and Weidong Chen. Compact 3d gaussian splatting for dense visual slam.arXiv preprint arXiv:2403.11247, 2024. 2

work page internal anchor Pith review arXiv 2024
[10]

Plgslam: Progressive neural scene represenation with local to global bundle adjustment

Tianchen Deng, Guole Shen, Tong Qin, Jianyu Wang, Wen- tao Zhao, Jingchuan Wang, Danwei Wang, and Weidong Chen. Plgslam: Progressive neural scene represenation with local to global bundle adjustment. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 19657–19666, 2024. 2

work page 2024
[11]

Gaussiandwm: 3d gaussian driving world model for unified scene understanding and multi-modal generation.arXiv preprint arXiv:2512.23180,

Tianchen Deng, Xuefeng Chen, Yi Chen, Qu Chen, Yuyao Xu, Lijin Yang, Le Xu, Yu Zhang, Bo Zhang, Wuxiong Huang, and Hesheng Wang. Gaussiandwm: 3d gaussian driving world model for unified scene understanding and multi-modal generation.arXiv preprint arXiv:2512.23180,

work page arXiv
[12]

What is the best 3d scene representation for robotics? from geometric to foundation models.arXiv preprint arXiv:2512.03422, 2025

Tianchen Deng, Yue Pan, Shenghai Yuan, Dong Li, Chen Wang, Mingrui Li, Long Chen, Lihua Xie, Danwei Wang, Jingchuan Wang, Javier Civera, Hesheng Wang, and Wei- dong Chen. What is the best 3d scene representation for robotics? from geometric to foundation models.arXiv preprint arXiv:2512.03422, 2025. 2

work page arXiv 2025
[13]

Superpoint: Self-supervised interest point detection and description

Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabi- novich. Superpoint: Self-supervised interest point detection and description. InProceedings of the IEEE conference on computer vision and pattern recognition workshops, pages 224–236, 2018. 1, 4

work page 2018
[14]

Lsd- slam: Large-scale direct monocular slam

Jakob Engel, Thomas Sch ¨ops, and Daniel Cremers. Lsd- slam: Large-scale direct monocular slam. InEuropean Conference on Computer Vision (ECCV), pages 834–849. Springer, 2014. 2

work page 2014
[15]

Direct sparse odometry.IEEE transactions on pattern analysis and machine intelligence, 40(3):611–625, 2017

Jakob Engel, Vladlen Koltun, and Daniel Cremers. Direct sparse odometry.IEEE transactions on pattern analysis and machine intelligence, 40(3):611–625, 2017. 6, 7, 8

work page 2017
[16]

Svo: Fast semi-direct monocular visual odometry

Christian Forster, Matia Pizzoli, and Davide Scaramuzza. Svo: Fast semi-direct monocular visual odometry. In2014 IEEE international conference on robotics and automation (ICRA), pages 15–22. IEEE, 2014. 6, 8

work page 2014
[17]

Bags of binary words for fast place recognition in image sequences.IEEE Transactions on robotics, 28(5):1188–1197, 2012

Dorian G ´alvez-L´opez and Juan D Tardos. Bags of binary words for fast place recognition in image sequences.IEEE Transactions on robotics, 28(5):1188–1197, 2012. 2

work page 2012
[18]

Deep incomplete multi-view learning via cyclic permutation of vaes

Xin Gao and Jian Pu. Deep incomplete multi-view learning via cyclic permutation of vaes. InThe Thirteenth Interna- tional Conference on Learning Representations, 2025. 2

work page 2025
[19]

Ldso: Direct sparse odometry with loop closure

Xiang Gao, Rui Wang, Nikolaus Demmel, and Daniel Cre- mers. Ldso: Direct sparse odometry with loop closure. In2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 2198–2204. IEEE, 2018. 6

work page 2018
[20]

Good: Training-free guided diffusion sampling for out-of-distribution detection

Xin Gao, Jiyao Liu, Guanghao Li, Yueming Lyu, Jianx- iong Gao, Weichen Yu, Ningsheng Xu, Liang Wang, Caifeng Shan, Ziwei Liu, et al. Good: Training-free guided diffusion sampling for out-of-distribution detection. InThe Thirty- ninth Annual Conference on Neural Information Processing Systems, 2025. 2

work page 2025
[21]

Vision meets robotics: The kitti dataset.The Inter- national Journal of Robotics Research, 32(11):1231–1237,

Andreas Geiger, Philip Lenz, Christoph Stiller, and Raquel Urtasun. Vision meets robotics: The kitti dataset.The Inter- national Journal of Robotics Research, 32(11):1231–1237,

work page
[22]

evo: Python package for the evalua- tion of odometry and slam.https://github.com/ MichaelGrupp/evo, 2017

Michael Grupp. evo: Python package for the evalua- tion of odometry and slam.https://github.com/ MichaelGrupp/evo, 2017. 6

work page 2017
[23]

Dark-isp: Enhancing raw image processing for low-light object detection

Jiasheng Guo, Xin Gao, Yuxiang Yan, Guanghao Li, and Jian Pu. Dark-isp: Enhancing raw image processing for low-light object detection. InProceedings of the IEEE/CVF Interna- tional Conference on Computer Vision (ICCV), pages 9583– 9593, 2025. 2

work page 2025
[24]

Toward camera open-set 3d object detection for autonomous driving scenar- ios.IEEE Transactions on Intelligent Transportation Sys- tems, 26(12):23190–23201, 2025

Zhuolin He, Xinrun Li, Jiacheng Tang, Shoumeng Qiu, Wenfu Wang, Xiangyang Xue, and Jian Pu. Toward camera open-set 3d object detection for autonomous driving scenar- ios.IEEE Transactions on Intelligent Transportation Sys- tems, 26(12):23190–23201, 2025

work page 2025
[25]

Dynamicvggt: Learning dynamic point maps for 4d scene reconstruction in autonomous driving.arXiv preprint arXiv:2603.08254, 2026

Zhuolin He, Jing Li, Guanghao Li, Xiaolei Chen, Jiacheng Tang, Siyang Zhang, Zhounan Jin, Feipeng Cai, Bin Li, Jian Pu, et al. Dynamicvggt: Learning dynamic point maps for 4d scene reconstruction in autonomous driving.arXiv preprint arXiv:2603.08254, 2026

work page arXiv 2026
[26]

MPCFormer: A physics-informed data-driven approach for explainable socially-aware autonomous driving

Jia Hu, Zhexi Lian, Xuerun Yan, Ruiang Bi, Dou Shen, Yu Ruan, and Haoran Wang. Mpcformer: A physics-informed data-driven approach for explainable socially-aware au- tonomous driving.arXiv preprint arXiv:2512.03795, 2025. 2

work page internal anchor Pith review Pith/arXiv arXiv 2025
[27]

Photo-slam: Real-time simultaneous localization and photo- realistic mapping for monocular stereo and rgb-d cameras

Huajian Huang, Longwei Li, Hui Cheng, and Sai-Kit Yeung. Photo-slam: Real-time simultaneous localization and photo- realistic mapping for monocular stereo and rgb-d cameras. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21584–21593, 2024. 2

work page 2024
[28]

Eslam: Efficient dense slam system based on hy- brid representation of signed distance fields

Mohammad Mahdi Johari, Camilla Carta, and Franc ¸ois Fleuret. Eslam: Efficient dense slam system based on hy- brid representation of signed distance fields. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 17408–17419, 2023. 2

work page 2023
[29]

Splatam: Splat track & map 3d gaus- sians for dense rgb-d slam

Nikhil Keetha, Jay Karhade, Krishna Murthy Jatavallab- hula, Gengshan Yang, Sebastian Scherer, Deva Ramanan, and Jonathon Luiten. Splatam: Splat track & map 3d gaus- sians for dense rgb-d slam. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 21357–21366, 2024. 2

work page 2024
[30]

3d gaussian splatting for real-time radiance field rendering.ACM Trans

Bernhard Kerbl, Georgios Kopanas, Thomas Leimk ¨uhler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering.ACM Trans. Graph., 42(4):139–1,

work page
[31]

Parallel tracking and map- ping for small ar workspaces

Georg Klein and David Murray. Parallel tracking and map- ping for small ar workspaces. InIEEE and ACM Interna- tional Symposium on Mixed and Augmented Reality, pages 225–234. IEEE, 2007. 2

work page 2007
[32]

Artdeco: Toward high-fidelity on-the-fly reconstruction with hierarchical gaussian structure and feed- forward guidance

Guanghao Li, Kerui Ren, Linning Xu, Zhewen Zheng, Changjian Jiang, Xin Gao, Bo Dai, Jian Pu, Mulin Yu, and Jiangmiao Pang. Artdeco: Toward high-fidelity on-the-fly reconstruction with hierarchical gaussian structure and feed- forward guidance. InThe Fourteenth International Confer- ence on Learning Representations. 2

work page
[33]

Papl-slam: Principal axis-anchored monocu- lar point-line slam.IEEE Robotics and Automation Letters,

Guanghao Li, Yu Cao, Qi Chen, Xin Gao, Yifan Yang, and Jian Pu. Papl-slam: Principal axis-anchored monocu- lar point-line slam.IEEE Robotics and Automation Letters,

work page
[34]

Constrained gaussian splatting via implicit tsdf hash grid for dense rgb-d slam.IEEE Transactions on Artificial Intelli- gence, 2025

Guanghao Li, Qi Chen, Sijia Hu, Yuxiang Yan, and Jian Pu. Constrained gaussian splatting via implicit tsdf hash grid for dense rgb-d slam.IEEE Transactions on Artificial Intelli- gence, 2025. 2

work page 2025
[35]

Ec-slam: Effectively constrained neural rgb-d slam with tsdf hash en- coding and joint optimization.Pattern Recognition, 170: 112034, 2026

Guanghao Li, Qi Chen, Yuxiang Yan, and Jian Pu. Ec-slam: Effectively constrained neural rgb-d slam with tsdf hash en- coding and joint optimization.Pattern Recognition, 170: 112034, 2026. 2

work page 2026
[36]

Undeepvo: Monocular visual odometry through unsuper- vised deep learning

Ruihao Li, Sen Wang, Zhiqiang Long, and Dongbing Gu. Undeepvo: Monocular visual odometry through unsuper- vised deep learning. InIEEE International Conference on Robotics and Automation (ICRA), pages 7286–7291. IEEE,

work page
[37]

Deepslam: A ro- bust monocular slam system with unsupervised deep learn- ing.IEEE Transactions on Industrial Electronics, 68(4): 3577–3587, 2020

Ruihao Li, Sen Wang, and Dongbing Gu. Deepslam: A ro- bust monocular slam system with unsupervised deep learn- ing.IEEE Transactions on Industrial Electronics, 68(4): 3577–3587, 2020. 2

work page 2020
[38]

Fine-tuning is Not Enough: A Parallel Framework for Collaborative Imitation and Reinforcement Learning in End-to-end Autonomous Driving

Zhexi Lian, Haoran Wang, Xuerun Yan, Weimeng Lin, Xi- anhong Zhang, Yongyu Chen, and Jia Hu. Fine-tuning is not enough: A parallel framework for collaborative imita- tion and reinforcement learning in end-to-end autonomous driving.arXiv preprint arXiv:2603.13842, 2026. 2

work page internal anchor Pith review Pith/arXiv arXiv 2026
[39]

Lightglue: Local feature matching at light speed

Philipp Lindenberger, Paul-Edouard Sarlin, and Marc Polle- feys. Lightglue: Local feature matching at light speed. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 17627–17638, 2023. 1

work page 2023
[40]

Deep patch visual slam

Lahav Lipson, Zachary Teed, and Jia Deng. Deep patch visual slam. InEuropean Conference on Computer Vision (ECCV), pages 424–440. Springer, 2025. 2, 5, 6, 7, 8

work page 2025
[41]

Loopy-slam: Dense neural slam with loop closures

Lorenzo Liso, Erik Sandstr ¨om, Vladimir Yugay, Luc Van Gool, and Martin R Oswald. Loopy-slam: Dense neural slam with loop closures. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 20363–20373, 2024. 2

work page 2024
[42]

Mmg-vid: Maxi- mizing marginal gains at segment-level and token-level for efficient video llms.arXiv preprint arXiv:2508.21044, 2025

Junpeng Ma, Qizhe Zhang, Ming Lu, Zhibin Wang, Qiang Zhou, Jun Song, and Shanghang Zhang. Mmg-vid: Maxi- mizing marginal gains at segment-level and token-level for efficient video llms.arXiv preprint arXiv:2508.21044, 2025. 2

work page arXiv 2025
[43]

Gift: Global irreplaceability frame tar- geting for efficient video understanding.arXiv preprint arXiv:2603.25072, 2026

Junpeng Ma, Sashuai Zhou, Guanghao Li, Xin Gao, Yue Cao, Hengyu Zeng, Yuxiang Yan, Zhibin Wang, Jun Song, Bo Zheng, et al. Gift: Global irreplaceability frame tar- geting for efficient video understanding.arXiv preprint arXiv:2603.25072, 2026. 2

work page arXiv 2026
[44]

Gaussian splatting slam

Hidenobu Matsuki, Riku Murai, Paul HJ Kelly, and An- drew J Davison. Gaussian splatting slam. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR), pages 18039–18048, 2024. 2

work page 2024
[45]

Nerf: Representing scenes as neural radiance fields for view syn- thesis.Communications of the ACM, 65(1):99–106, 2021

Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view syn- thesis.Communications of the ACM, 65(1):99–106, 2021. 2

work page 2021
[46]

Orb-slam2: An open- source slam system for monocular, stereo, and rgb-d cam- eras.IEEE transactions on robotics, 33(5):1255–1262, 2017

Raul Mur-Artal and Juan D Tard ´os. Orb-slam2: An open- source slam system for monocular, stereo, and rgb-d cam- eras.IEEE transactions on robotics, 33(5):1255–1262, 2017. 2

work page 2017
[47]

Orb-slam: a versatile and accurate monocular slam system.IEEE Transactions on Robotics, 31(5):1147–1163,

Raul Mur-Artal, Jose Maria Martinez Montiel, and Juan D Tardos. Orb-slam: a versatile and accurate monocular slam system.IEEE Transactions on Robotics, 31(5):1147–1163,

work page
[48]

Dinov2: Learning robust visual features without supervision

Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. Transactions on Machine Learning Research Journal, pages 1–31, 2024. 4

work page 2024
[49]

Rtg-slam: Real-time 3d re- construction at scale using gaussian splatting

Zhexi Peng, Tianjia Shao, Yong Liu, Jingke Zhou, Yin Yang, Jingdong Wang, and Kun Zhou. Rtg-slam: Real-time 3d re- construction at scale using gaussian splatting. InACM SIG- GRAPH 2024 Conference Papers, pages 1–11, 2024. 2

work page 2024
[50]

Vi- sion transformers for dense prediction

Ren ´e Ranftl, Alexey Bochkovskiy, and Vladlen Koltun. Vi- sion transformers for dense prediction. InProceedings of the IEEE/CVF international conference on computer vision, pages 12179–12188, 2021. 4

work page 2021
[51]

Machine learning for high-speed corner detection

Edward Rosten and Tom Drummond. Machine learning for high-speed corner detection. InEuropean Conference on Computer Vision (ECCV), pages 430–443. Springer, 2006. 2

work page 2006
[52]

Orb: An efficient alternative to sift or surf

Ethan Rublee, Vincent Rabaud, Kurt Konolige, and Gary Bradski. Orb: An efficient alternative to sift or surf. In2011 International conference on computer vision, pages 2564–

work page
[53]

Good features to track

Jianbo Shi et al. Good features to track. InProceedings of IEEE Conference on Computer Vision and Pattern Recogni- tion (CVPR), pages 593–600. IEEE, 1994. 2

work page 1994
[54]

imap: Implicit mapping and positioning in real-time

Edgar Sucar, Shikun Liu, Joseph Ortiz, and Andrew J Davi- son. imap: Implicit mapping and positioning in real-time. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 6229–6238, 2021. 2

work page 2021
[55]

Decoupling scene perception and ego status: A multi-context fusion approach for enhanced generaliza- tion in end-to-end autonomous driving

Jiacheng Tang, Mingyue Feng, Jiachao Liu, Yaonong Wang, and Jian Pu. Decoupling scene perception and ego status: A multi-context fusion approach for enhanced generaliza- tion in end-to-end autonomous driving. InProceedings of the AAAI Conference on Artificial Intelligence, pages 9413– 9420, 2026. 2

work page 2026
[56]

CausalVAD: De-confounding End-to-End Autonomous Driving via Causal Intervention

Jiacheng Tang, Zhiyuan Zhou, Zhuolin He, Jia Zhang, Kai Zhang, and Jian Pu. Causalvad: De-confounding end-to-end autonomous driving via causal intervention.arXiv preprint arXiv:2603.18561, 2026. 2

work page internal anchor Pith review Pith/arXiv arXiv 2026
[57]

Cnn-slam: Real-time dense monocular slam with learned depth prediction

Keisuke Tateno, Federico Tombari, Iro Laina, and Nassir Navab. Cnn-slam: Real-time dense monocular slam with learned depth prediction. InProceedings of the IEEE Con- ference on Computer Vision and Pattern Recognition, pages 6243–6252, 2017. 2

work page 2017
[58]

Raft: Recurrent all-pairs field transforms for optical flow

Zachary Teed and Jia Deng. Raft: Recurrent all-pairs field transforms for optical flow. InComputer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23– 28, 2020, Proceedings, Part II 16, pages 402–419. Springer,

work page 2020
[59]

Droid-slam: Deep visual slam for monocular, stereo, and rgb-d cameras.Advances in Neu- ral Information Processing Systems, 34:16558–16569, 2021

Zachary Teed and Jia Deng. Droid-slam: Deep visual slam for monocular, stereo, and rgb-d cameras.Advances in Neu- ral Information Processing Systems, 34:16558–16569, 2021. 1, 2, 6, 7, 8

work page 2021
[60]

Deep patch vi- sual odometry.Advances in Neural Information Processing Systems, 36, 2024

Zachary Teed, Lahav Lipson, and Jia Deng. Deep patch vi- sual odometry.Advances in Neural Information Processing Systems, 36, 2024. 1, 2, 3, 4, 5, 6, 7, 8

work page 2024
[61]

Co- slam: Joint coordinate and sparse parametric encodings for neural real-time slam

Hengyi Wang, Jingwen Wang, and Lourdes Agapito. Co- slam: Joint coordinate and sparse parametric encodings for neural real-time slam. InIEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), pages 13293– 13302, 2023. 2

work page 2023
[62]

Deepvo: Towards end-to-end visual odometry with deep recurrent convolutional neural networks

Sen Wang, Ronald Clark, Hongkai Wen, and Niki Trigoni. Deepvo: Towards end-to-end visual odometry with deep recurrent convolutional neural networks. InIEEE Inter- national Conference on Robotics and Automation (ICRA), pages 2043–2050. IEEE, 2017. 2

work page 2043
[63]

Tartanair: A dataset to push the limits of visual slam

Wenshan Wang, Delong Zhu, Xiangwei Wang, Yaoyu Hu, Yuheng Qiu, Chen Wang, Yafei Hu, Ashish Kapoor, and Se- bastian Scherer. Tartanair: A dataset to push the limits of visual slam. In2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 4909–4916. IEEE, 2020. 5, 6

work page 2020
[64]

Tartanvo: A generalizable learning-based vo

Wenshan Wang, Yaoyu Hu, and Sebastian Scherer. Tartanvo: A generalizable learning-based vo. InConference on Robot Learning, pages 1761–1772. PMLR, 2021. 6, 7, 8

work page 2021
[65]

Airslam: An efficient and illumination-robust point- line visual slam system.ArXiv Preprint arXiv:2408.03520,

Kuan Xu, Yuefan Hao, Shenghai Yuan, Chen Wang, and Li- hua Xie. Airslam: An efficient and illumination-robust point- line visual slam system.ArXiv Preprint arXiv:2408.03520,

work page arXiv
[66]

Depth Anything V2

Lihe Yang, Bingyi Kang, Zilong Huang, Zhen Zhao, Xiao- gang Xu, Jiashi Feng, and Hengshuang Zhao. Depth any- thing v2.arXiv preprint arXiv:2406.09414, 2024. 2, 4

work page internal anchor Pith review Pith/arXiv arXiv 2024
[67]

Go-slam: Global optimization for consistent 3d in- stant reconstruction

Youmin Zhang, Fabio Tosi, Stefano Mattoccia, and Matteo Poggi. Go-slam: Global optimization for consistent 3d in- stant reconstruction. InIEEE/CVF International Conference on Computer Vision, pages 3727–3737, 2023. 2

work page 2023
[68]

SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference

Yuan Zhang, Chun-Kai Fan, Junpeng Ma, Wenzhao Zheng, Tao Huang, Kuan Cheng, Denis Gudovskiy, Tomoyuki Okuno, Yohei Nakata, Kurt Keutzer, et al. Sparsevlm: Vi- sual token sparsification for efficient vision-language model inference.arXiv preprint arXiv:2410.04417, 2024. 2

work page internal anchor Pith review arXiv 2024
[69]

Light-slam: A robust deep-learning visual slam system based on lightglue under challenging lighting conditions.ArXiv Preprint arXiv:2407.02382, 2024

Zhiqi Zhao, Chang Wu, Xiaotong Kong, Zejie Lv, Xiaoqi Du, and Qiyan Li. Light-slam: A robust deep-learning visual slam system based on lightglue under challenging lighting conditions.ArXiv Preprint arXiv:2407.02382, 2024. 1, 2

work page arXiv 2024
[70]

Deeptam: Deep tracking and mapping

Huizhong Zhou, Benjamin Ummenhofer, and Thomas Brox. Deeptam: Deep tracking and mapping. InEuropean Con- ference on Computer Vision (ECCV), pages 822–838, 2018. 2

work page 2018
[71]

Spatialreward: Verifiable spatial re- ward modeling for fine-grained spatial consistency in text-to- image generation.arXiv preprint arXiv:2603.22228, 2026

Sashuai Zhou, Qiang Zhou, Junpeng Ma, Yue Cao, Ruo- fan Hu, Ziang Zhang, Xiaoda Yang, Zhibin Wang, Jun Song, Cheng Yu, et al. Spatialreward: Verifiable spatial re- ward modeling for fine-grained spatial consistency in text-to- image generation.arXiv preprint arXiv:2603.22228, 2026. 2

work page arXiv 2026
[72]

Nice-slam: Neural implicit scalable encoding for slam

Zihan Zhu, Songyou Peng, Viktor Larsson, Weiwei Xu, Hu- jun Bao, Zhaopeng Cui, Martin R Oswald, and Marc Polle- feys. Nice-slam: Neural implicit scalable encoding for slam. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 12786–12796, 2022. 2

work page 2022