DL-VINS-Factory: A Modular Framework for Learned Visual Front-Ends in Visual-Inertial SLAM

Melissa Jia Ying Chong; Shoon Kit Lim; Ting Yang Ling

arxiv: 2607.01757 · v1 · pith:E3WX3M4Fnew · submitted 2026-07-02 · 💻 cs.CV · cs.RO

DL-VINS-Factory: A Modular Framework for Learned Visual Front-Ends in Visual-Inertial SLAM

Shoon Kit Lim , Melissa Jia Ying Chong , Ting Yang Ling This is my paper

Pith reviewed 2026-07-03 16:26 UTC · model grok-4.3

classification 💻 cs.CV cs.RO

keywords visual-inertial SLAMlearned featuresfeature extractionoptical flowloop closureembedded systemsALIKEDLightGlue

0 comments

The pith

Learned visual front-ends are viable for real-time embedded VI-SLAM but not universally superior to classical tracking.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents DL-VINS-Factory to plug several learned feature extractors into a complete visual-inertial SLAM pipeline and measure their effect on accuracy and speed. It compares ALIKED, RaCo, SuperPoint and XFeat paired with either Lucas-Kanade flow or LightGlue matching against a GFTT baseline, using the same Ceres sliding-window estimator and optional AnyLoc loop closure. Across indoor, outdoor, aggressive-motion and degraded datasets the results show measurable accuracy gains for some learned combinations yet no blanket advantage, with all valid setups reaching real-time rates on a Jetson after TensorRT acceleration. A sympathetic reader cares because the work supplies concrete numbers on when deep-learning detectors actually help inside a tightly coupled SLAM system rather than in isolation.

Core claim

DL-VINS-Factory demonstrates that learned front-ends can be swapped into a shared VI-SLAM architecture; relative to GFTT+LK, ALIKED+LG lowers EuRoC ATE by 5 percent monocular and 7 percent stereo with loop closure, by 12 percent on NTU-VIRAL stereo, while SuperPoint+LK and RaCo+LK cut ATE by 29 percent and 38 percent on the two Botanic Garden cameras; every configuration runs between 29-47 FPS monocular and 18-33 FPS stereo on Jetson hardware, and AnyLoc returns roughly 2-7 times more valid loops than BRIEF+DBoW2.

What carries the argument

DL-VINS-Factory, a modular integration layer that attaches learned extractors to either LK optical-flow tracking or LG descriptor matching, then feeds the resulting tracks into a shared Ceres sliding-window back-end with optional AnyLoc DINOv2-VLAD loop closure.

If this is right

Learned extractors can be inserted without altering the back-end optimizer or loop-closure module.
TensorRT acceleration on Jetson AGX Orin keeps all tested configurations inside real-time bounds.
AnyLoc loop closure yields substantially more valid loops than classical bag-of-words methods.
Optical-flow tracking remains preferable in some visual environments even when learned keypoints are available.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Additional datasets with different sensors or lighting statistics could shift the observed ranking between learned and classical front-ends.
The apparent superiority of learned methods is sensitive to the precise classical baseline chosen for comparison.
The same modular structure could be reused to benchmark future extractors or matchers without re-implementing the estimator.
Hardware-specific acceleration remains essential for any claim of practicality on embedded platforms.

Load-bearing premise

The four chosen datasets together with the GFTT+LK baseline adequately represent the range of conditions and classical methods against which learned front-ends should be judged.

What would settle it

A fifth dataset or operating condition in which every learned front-end either exceeds the real-time FPS limits or produces higher ATE than the GFTT+LK baseline across the board would falsify the viability claim.

Figures

Figures reproduced from arXiv: 2607.01757 by Melissa Jia Ying Chong, Shoon Kit Lim, Ting Yang Ling.

**Figure 1.** Figure 1: System overview of the proposed DL-VINS-Factory framework. Dashed blue blocks represent the original VINS-Fusion implementation, whereas [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: a) ALIKED model serialized into two independent engines, with custom CUDA Kernel replacing the DKD for keypoint selection. b) RaCo reused [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Qualitative comparison of feature distribution and temporal tracking across extractors and datasets. Blue points show tracked current-frame features, [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Median ATE ratio comparing stereo versus monocular accuracy [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Mean ATE comparing Lucas-Kanade optical flow (+LK) against [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 7.** Figure 7: Trajectory evaluation on the SubT MRS smoke room sequence with origin aligned to ground truth (GT). Results are obtained with maximum extracted keypoints capped at 256. ”X” markers indicate the onset of divergence, while inset images highlight the transition from partial to full visual occlusion. TABLE VII THE ANYLOC COSINE SIMILARITY RETRIEVAL GATE IMPACT ON POSITIVE LOOP CLOSURE. Dataset Percentage of K… view at source ↗

**Figure 6.** Figure 6: Trajectory evaluation on the SubT MRS Lowlight 1 sequence, comparing various tracking methods against ground truth (GT) using SE3 alignment. The top plot shows the full length, where several methods experience significant divergence near the end of the run. The bottom plot displays the same trajectories with poses trimmed after 740 seconds, isolating the impact of the dynamically moving object while rig ar… view at source ↗

**Figure 8.** Figure 8: Cross-platform ATE[m] reproducibility scatter plot in log scale. The [PITH_FULL_IMAGE:figures/full_fig_p012_8.png] view at source ↗

**Figure 9.** Figure 9: Average per-frame compute breakdown on the EuRoC dataset. The stacked bars illustrate the latency contributions of individual front-end and back-end [PITH_FULL_IMAGE:figures/full_fig_p013_9.png] view at source ↗

**Figure 10.** Figure 10: (Top) Jetson power consumption and (Bottom) feature-extraction [PITH_FULL_IMAGE:figures/full_fig_p014_10.png] view at source ↗

**Figure 11.** Figure 11: Keypoint count ablation on Jetson AGX Orin. Relative accuracy (left) and front-end compute time (right) are normalized to a 256-keypoint baseline [PITH_FULL_IMAGE:figures/full_fig_p015_11.png] view at source ↗

**Figure 12.** Figure 12: EuRoC mono-inertial 3D trajectory plot [PITH_FULL_IMAGE:figures/full_fig_p021_12.png] view at source ↗

**Figure 13.** Figure 13: EuRoC stereo-inertial 3D trajectory plot [PITH_FULL_IMAGE:figures/full_fig_p022_13.png] view at source ↗

**Figure 14.** Figure 14: NTU-VIRAL mono-inertial 3D trajectory plot [PITH_FULL_IMAGE:figures/full_fig_p023_14.png] view at source ↗

**Figure 15.** Figure 15: NTU-VIRAL stereo-inertial 3D trajectory plot [PITH_FULL_IMAGE:figures/full_fig_p024_15.png] view at source ↗

**Figure 16.** Figure 16: Botanic Garden Gray mono-inertial 3D trajectory plot [PITH_FULL_IMAGE:figures/full_fig_p025_16.png] view at source ↗

**Figure 17.** Figure 17: Botanic Garden Gray stereo-inertial 3D trajectory plot [PITH_FULL_IMAGE:figures/full_fig_p026_17.png] view at source ↗

**Figure 18.** Figure 18: Botanic Garden RGB mono-inertial 3D trajectory plot [PITH_FULL_IMAGE:figures/full_fig_p027_18.png] view at source ↗

**Figure 19.** Figure 19: Botanic Garden RGB stereo-inertial 3D trajectory plot [PITH_FULL_IMAGE:figures/full_fig_p028_19.png] view at source ↗

**Figure 20.** Figure 20: SubT MRS mono-inertial 3D trajectory plot [PITH_FULL_IMAGE:figures/full_fig_p029_20.png] view at source ↗

read the original abstract

Deep-learning features excel in visual matching, yet their practical value in tightly coupled visual-inertial SLAM (VI-SLAM) remains insufficiently characterized. We present DL-VINS-Factory, a unified framework that integrates learned feature extractors (ALIKED, RaCo, SuperPoint, XFeat) with either Lucas--Kanade (LK) optical-flow tracking or LightGlue (LG) descriptor matching. All front-ends share a sliding-window Ceres back-end, with optional AnyLoc DINOv2-VLAD loop closure, and 4-DoF pose-graph optimization. We benchmark the system across the four datasets covering indoor, unstructured outdoor, aggressive-motion, and visually degraded conditions. Results show that learned front-ends are viable for real-time embedded VI-SLAM, but are not universally superior to classical tracking. Relative to the corresponding GFTT+LK baseline, ALIKED+LG reduces EuRoC ATE by $5\%$ in monocular odometry and by $7\%$ in stereo with loop-closure. On NTU-VIRAL, where aggressive aerial motion increases inter-frame viewpoint change, ALIKED+LG stereo reduces loop-closed ATE by $12\%$. In Botanic Garden dataset, optical-flow tracking remains preferable, but learned keypoints still improve over the baseline GFTT, in which SuperPoint+LK reduces grayscale camera ATE by $29\%$, while RaCo+LK reduces RGB camera ATE by $38\%$. On SubT-MRS, learned front-ends display varying degree of improvement based on individual cases. With TensorRT acceleration on a Jetson AGX Orin, all valid configurations run in real time between $29$--$47$ FPS in monocular mode and $18$--$33$ FPS in stereo mode for the EuRoC and NTU-VIRAL datasets. AnyLoc further confirms roughly $2$--$7\times$ more valid loops than BRIEF+DBoW2. The implementation is open-sourced at https://github.com/limshoonkit/DL-VINS-Factory-ROS2/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper supplies a practical open-source modular testbed and concrete benchmark numbers for swapping learned extractors into VI-SLAM, but the evidence for its main takeaway rests on a single classical baseline and thin statistical reporting.

read the letter

The useful part is the unified framework that keeps the Ceres sliding-window back-end fixed while swapping in ALIKED, RaCo, SuperPoint or XFeat, then pairs them with either LK flow or LightGlue matching, plus optional AnyLoc loop closure. They run the same pipeline on EuRoC, NTU-VIRAL, Botanic Garden and SubT-MRS, report ATE numbers, and give TensorRT FPS on a Jetson. That combination of code and cross-dataset timing data is new enough to be worth having around.

The numbers themselves are modest: ALIKED+LG beats GFTT+LK by 5 % monocular and 7 % stereo-with-loops on EuRoC, 12 % on NTU-VIRAL stereo, with bigger gains on one Botanic Garden camera but mixed results elsewhere. Real-time rates (29–47 FPS mono, 18–33 FPS stereo) are credible for embedded use. The open-source release and the fact that AnyLoc finds more loops than BRIEF+DBoW2 are clear positives.

The soft spots are straightforward. No error bars or significance tests appear on the ATE deltas, so it is impossible to judge whether the 5–12 % differences are reliable or just noise. The comparison is only against GFTT+LK; stronger classical pipelines could change the relative ranking. The four datasets cover indoor, outdoor, aggressive and degraded scenes, yet the claim that learned front-ends are “not universally superior” still hinges on whether these conditions and this one baseline are representative. If they are not, the nuanced conclusion does not travel far.

This is for practitioners who need a ready-made test harness and timing numbers before they invest in learned features for their own VI-SLAM stack. It is not a theoretical advance, but the implementation is reproducible and the benchmarking gap it fills is real. I would send it to peer review.

Referee Report

3 major / 1 minor

Summary. The paper presents DL-VINS-Factory, a modular open-source framework integrating learned feature extractors (ALIKED, RaCo, SuperPoint, XFeat) with LK optical flow or LightGlue matching into a tightly-coupled VI-SLAM pipeline sharing a Ceres sliding-window backend and optional AnyLoc loop closure. Across EuRoC, NTU-VIRAL, Botanic Garden and SubT-MRS, it reports that learned front-ends run in real time on Jetson AGX Orin (29-47 FPS monocular, 18-33 FPS stereo) and yield mixed accuracy results versus a GFTT+LK baseline: 5% ATE reduction on EuRoC monocular, 7% on EuRoC stereo with loop closure, 12% on NTU-VIRAL stereo, with optical-flow tracking preferred on Botanic Garden and case-dependent gains on SubT-MRS. The central claim is that learned front-ends are viable for embedded VI-SLAM but not universally superior to classical tracking.

Significance. If the empirical findings are robust, the work supplies a practical, reproducible toolkit for testing learned visual front-ends inside a full VI-SLAM stack and identifies concrete regimes (aggressive motion, loop closure) where descriptor matching helps. The open-source release and embedded timing numbers are directly usable by the robotics community.

major comments (3)

[Abstract / Experimental results] Abstract and experimental results: the reported ATE reductions (5% EuRoC monocular, 7% stereo with loop closure, 12% NTU-VIRAL stereo) are given as point values with no error bars, no statistical significance tests, and no description of run count, dataset splits, or failure-case handling, so the viability and superiority claims cannot be quantitatively assessed.
[Results and discussion] Baseline and conclusion: the statement that learned front-ends are 'not universally superior to classical tracking' is supported only by comparison to GFTT+LK; no additional classical pipelines (e.g., ORB-SLAM-style detection/matching or other detectors) are evaluated, leaving open whether the nuanced conclusion generalizes beyond the chosen baseline.
[Experimental setup] Dataset justification: the four datasets are described as covering indoor/outdoor/aggressive/degraded conditions, yet the paper supplies no explicit argument or ablation showing that this selection adequately samples the space of operating regimes needed to support the 'not universally superior' claim.

minor comments (1)

[Abstract] The abstract states that AnyLoc yields 'roughly 2--7× more valid loops' than BRIEF+DBoW2 but does not define the criterion used to count a loop as valid.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript to improve the presentation of results, clarify baselines, and strengthen dataset justification.

read point-by-point responses

Referee: [Abstract / Experimental results] Abstract and experimental results: the reported ATE reductions (5% EuRoC monocular, 7% stereo with loop closure, 12% NTU-VIRAL stereo) are given as point values with no error bars, no statistical significance tests, and no description of run count, dataset splits, or failure-case handling, so the viability and superiority claims cannot be quantitatively assessed.

Authors: We agree that the reported ATE values are point estimates and that additional details would strengthen the quantitative assessment. In the revised manuscript we will add the number of runs performed per configuration, any observed variability, a description of the evaluation protocol including dataset splits, and notes on failure-case handling. Formal statistical significance testing was not performed in the original work; we will note this limitation explicitly in the discussion. revision: yes
Referee: [Results and discussion] Baseline and conclusion: the statement that learned front-ends are 'not universally superior to classical tracking' is supported only by comparison to GFTT+LK; no additional classical pipelines (e.g., ORB-SLAM-style detection/matching or other detectors) are evaluated, leaving open whether the nuanced conclusion generalizes beyond the chosen baseline.

Authors: The GFTT+LK baseline was selected because it provides the most direct comparison to the optical-flow tracking configurations within our shared Ceres backend. We acknowledge that the conclusion is therefore scoped to this baseline. In revision we will explicitly qualify the statement to reflect this scope and add a short discussion of how other classical pipelines (such as ORB-based matching) might differ, while noting that full integration of alternative back-ends lies outside the current modular front-end focus. revision: yes
Referee: [Experimental setup] Dataset justification: the four datasets are described as covering indoor/outdoor/aggressive/degraded conditions, yet the paper supplies no explicit argument or ablation showing that this selection adequately samples the space of operating regimes needed to support the 'not universally superior' claim.

Authors: The four datasets were chosen for their complementary operating regimes (structured indoor, aggressive aerial motion, unstructured outdoor, and visually degraded subterranean) and their established use in the VI-SLAM literature. The observed case-dependent performance differences across these regimes directly support the non-universal superiority claim. In the revised experimental-setup section we will add an explicit paragraph justifying the selection on these grounds. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical benchmarking with no derivations or fitted predictions.

full rationale

The paper describes a modular integration framework (DL-VINS-Factory) for learned feature extractors (ALIKED, RaCo, SuperPoint, XFeat) paired with LK or LG tracking, sharing a Ceres back-end and optional AnyLoc loop closure. All reported results consist of direct ATE and runtime measurements on four fixed datasets (EuRoC, NTU-VIRAL, Botanic Garden, SubT-MRS) relative to a GFTT+LK baseline. No equations, parameter fits, uniqueness theorems, or self-citations are invoked to derive any quantity; the viability and 'not universally superior' statements are simple empirical summaries of the tabulated comparisons. The study is therefore self-contained against external benchmarks and contains no load-bearing steps that reduce to their own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No mathematical model or derivation is present; the work is an empirical software framework and benchmark study.

pith-pipeline@v0.9.1-grok · 5943 in / 1120 out tokens · 19547 ms · 2026-07-03T16:26:05.108263+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

47 extracted references · 7 canonical work pages

[1]

Orb: An efficient alternative to sift or surf,

E. Rublee, V . Rabaud, K. Konolige, and G. Bradski, “Orb: An efficient alternative to sift or surf,” in2011 International Conference on Computer Vision, 2011, pp. 2564–2571

2011
[2]

Brisk: Binary robust invariant scalable keypoints,

S. Leutenegger, M. Chli, and R. Y . Siegwart, “Brisk: Binary robust invariant scalable keypoints,” in2011 International Conference on Computer Vision, 2011, pp. 2548–2555

2011
[3]

Comparative study of deep learning based features in slam,

C. Deng, K. Qiu, R. Xiong, and C. Zhou, “Comparative study of deep learning based features in slam,” in2019 4th Asia-Pacific Conference on Intelligent Robot Systems (ACIRS), 2019, pp. 250–254

2019
[4]

Superpoint: Self- supervised interest point detection and description,

D. DeTone, T. Malisiewicz, and A. Rabinovich, “Superpoint: Self- supervised interest point detection and description,” in2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2018, pp. 337–33 712

2018
[5]

SuperGlue: Learning feature matching with graph neural networks,

P.-E. Sarlin, D. DeTone, T. Malisiewicz, and A. Rabinovich, “SuperGlue: Learning feature matching with graph neural networks,” inCVPR, 2020. [Online]. Available: https://arxiv.org/abs/1911.11763

work page arXiv 2020
[6]

Closed- loop benchmarking of stereo visual-inertial slam systems: Understanding the impact of drift and latency on tracking accuracy,

Y . Zhao, J. S. Smith, S. H. Karumanchi, and P. A. Vela, “Closed- loop benchmarking of stereo visual-inertial slam systems: Understanding the impact of drift and latency on tracking accuracy,” in2020 IEEE International Conference on Robotics and Automation (ICRA), 2020, pp. 1105–1112

2020
[7]

Bags of binary words for fast place recognition in image sequences,

D. Galvez-L ´opez and J. D. Tardos, “Bags of binary words for fast place recognition in image sequences,”IEEE Transactions on Robotics, vol. 28, no. 5, pp. 1188–1197, 2012

2012
[8]

Anyloc: Towards universal visual place recognition,

N. Keethaet al., “Anyloc: Towards universal visual place recognition,” IEEE Robotics and Automation Letters, vol. 9, no. 2, pp. 1286–1293, 2024

2024
[9]

Orb-slam3: An accurate open-source library for visual, visual–inertial, and multimap slam,

C. Campos, R. Elvira, J. J. G. Rodr ´ıguez, J. M. M. Montiel, and J. D. Tard ´os, “Orb-slam3: An accurate open-source library for visual, visual–inertial, and multimap slam,”IEEE Transactions on Robotics, vol. 37, no. 6, pp. 1874–1890, 2021

2021
[10]

Okvis2-x: Open keyframe-based visual-inertial slam configurable with dense depth or lidar, and gnss,

S. Boche, J. Jung, S. B. Laina, and S. Leutenegger, “Okvis2-x: Open keyframe-based visual-inertial slam configurable with dense depth or lidar, and gnss,”IEEE Transactions on Robotics, vol. 41, pp. 6064– 6083, 2025

2025
[11]

A general optimisation-based framework for global pose estimation with multiple sensors,

T. Qin, S. Cao, J. Pan, and S. Shen, “A general optimisation-based framework for global pose estimation with multiple sensors,”IET Cyber- Systems and Robotics, vol. 7, no. 1, p. e70023, 2025. [Online]. Available: https://ietresearch.onlinelibrary.wiley.com/doi/abs/10.1049/csy2.70023

work page doi:10.1049/csy2.70023 2025
[12]

Openvins: A research platform for visual-inertial estimation,

P. Geneva, K. Eckenhoff, W. Lee, Y . Yang, and G. Huang, “Openvins: A research platform for visual-inertial estimation,” in2020 IEEE In- ternational Conference on Robotics and Automation (ICRA), 2020, pp. 4666–4672

2020
[13]

Kimera: an open- source library for real-time metric-semantic localization and mapping,

A. Rosinol, M. Abate, Y . Chang, and L. Carlone, “Kimera: an open- source library for real-time metric-semantic localization and mapping,” in2020 IEEE International Conference on Robotics and Automation (ICRA), 2020, pp. 1689–1696

2020
[14]

Airslam: An efficient and illumination-robust point-line visual slam system,

K. Xu, Y . Hao, S. Yuan, C. Wang, and L. Xie, “Airslam: An efficient and illumination-robust point-line visual slam system,”IEEE Transactions on Robotics, vol. 41, pp. 1673–1692, 2025

2025
[15]

Gcnv2: Efficient correspondence prediction for real-time slam,

J. Tang, L. Ericson, J. Folkesson, and P. Jensfelt, “Gcnv2: Efficient correspondence prediction for real-time slam,”IEEE Robotics and Automation Letters, vol. 4, no. 4, pp. 3505–3512, 2019

2019
[16]

Dxslam: A robust and efficient visual slam system with deep features,

D. Liet al., “Dxslam: A robust and efficient visual slam system with deep features,” in2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2020, pp. 4958–4965

2020
[17]

Hfnet-slam: An accurate and real-time monocular slam system with deep features,

L. Liu and J. M. Aitken, “Hfnet-slam: An accurate and real-time monocular slam system with deep features,”Sensors, vol. 23, no. 4,
[18]

Available: https://www.mdpi.com/1424-8220/23/4/2113

[Online]. Available: https://www.mdpi.com/1424-8220/23/4/2113
[19]

Supslam: A robust visual inertial slam system using superpoint for unmanned aerial vehicles,

C. H. Quach, M. D. Phung, H. V . Le, and S. Perry, “Supslam: A robust visual inertial slam system using superpoint for unmanned aerial vehicles,” in2021 8th NAFOSTED Conference on Information and Computer Science (NICS), 2021, pp. 507–512

2021
[20]

D-vins: Dynamic adaptive visual–inertial slam with imu prior and semantic constraints in dynamic scenes,

Y . Sunet al., “D-vins: Dynamic adaptive visual–inertial slam with imu prior and semantic constraints in dynamic scenes,” Remote Sensing, vol. 15, no. 15, 2023. [Online]. Available: https: //www.mdpi.com/2072-4292/15/15/3881

2023
[21]

Supervins: A real- time visual-inertial slam framework for challenging imaging conditions,

H. Luo, Y . Liu, C. Guo, Z. Li, and W. Song, “Supervins: A real- time visual-inertial slam framework for challenging imaging conditions,” IEEE Sensors Journal, vol. 25, no. 13, pp. 26 042–26 050, 2025

2025
[22]

LoFTR: Detector-free local feature matching with transformers,

J. Sun, Z. Shen, Y . Wang, H. Bao, and X. Zhou, “LoFTR: Detector-free local feature matching with transformers,”CVPR, 2021

2021
[23]

Raft: Recurrent all-pairs field transforms for optical flow,

Z. Teed and J. Deng, “Raft: Recurrent all-pairs field transforms for optical flow,” inComputer Vision – ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part II. Berlin, Heidelberg: Springer-Verlag, 2020, p. 402–419. [Online]. Available: https://doi.org/10.1007/978-3-030-58536-5 24

work page doi:10.1007/978-3-030-58536-5 2020
[24]

Tightly-coupled visual- inertial odometry with robust feature association in dynamic illumination environments,

J. Zhang, C. Zhang, Q. Liu, Q. Ma, and J. Qin, “Tightly-coupled visual- inertial odometry with robust feature association in dynamic illumination environments,”Robotica, vol. 43, no. 6, p. 2304–2319, 2025

2025
[25]

DROID-SLAM: Deep visual SLAM for monoc- ular, stereo, and RGB-D cameras,

Z. Teed and J. Deng, “DROID-SLAM: Deep visual SLAM for monoc- ular, stereo, and RGB-D cameras,” inAdvances in Neural Information Processing Systems, 2021

2021
[26]

Deep patch visual slam,

L. Lipson, Z. Teed, and J. Deng, “Deep patch visual slam,” in Computer Vision – ECCV 2024: 18th European Conference, Milan, Italy, September 29–October 4, 2024, Proceedings, Part II. Berlin, Heidelberg: Springer-Verlag, 2024, p. 424–440. [Online]. Available: https://doi.org/10.1007/978-3-031-72627-9 24

work page doi:10.1007/978-3-031-72627-9 2024
[27]

Netvlad: Cnn architecture for weakly supervised place recognition,

R. Arandjelovic, P. Gronat, A. Torii, T. Pajdla, and J. Sivic, “Netvlad: Cnn architecture for weakly supervised place recognition,” in2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 5297–5307

2016
[28]

Mixvpr: Feature mixing for visual place recognition,

A. Ali-Bey, B. Chaib-Draa, and P. Gigu ´ere, “Mixvpr: Feature mixing for visual place recognition,” in2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2023, pp. 2997–3006

2023
[29]

Eigenplaces: Training viewpoint robust models for visual place recognition,

G. Berton, G. Trivigno, B. Caputo, and C. Masone, “Eigenplaces: Training viewpoint robust models for visual place recognition,” in2023 IEEE/CVF International Conference on Computer Vision (ICCV), 2023, pp. 11 046–11 056

2023
[30]

Rethinking visual geo- localization for large-scale applications,

G. Berton, C. Masone, and B. Caputo, “Rethinking visual geo- localization for large-scale applications,” in2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 4868– 4878

2022
[31]

DINOv2: Learning robust visual features without supervision,

M. Oquabet al., “DINOv2: Learning robust visual features without supervision,”Transactions on Machine Learning Research, 2024, featured Certification. [Online]. Available: https://openreview.net/ forum?id=a68SUt6zFt

2024
[32]

VINS-Mono: A robust and versatile monoc- ular visual-inertial state estimator,

T. Qin, P. Li, and S. Shen, “VINS-Mono: A robust and versatile monoc- ular visual-inertial state estimator,”IEEE Transactions on Robotics, vol. 34, no. 4, pp. 1004–1020, 2018

2018
[33]

Aliked: A lighter keypoint and descriptor extraction network via deformable transformation,

X. Zhaoet al., “Aliked: A lighter keypoint and descriptor extraction network via deformable transformation,”IEEE Transactions on Instru- mentation and Measurement, vol. 72, pp. 1–16, 2023

2023
[34]

RaCo: Ranking and Covariance for Practical Learned Keypoints ,

A. Shenoi, P. Lindenberger, P.-E. Sarlin, and M. Pollefeys, “ RaCo: Ranking and Covariance for Practical Learned Keypoints ,” in2026 International Conference on 3D Vision (3DV). Los Alamitos, CA, USA: IEEE Computer Society, Mar. 2026, pp. 267–276. [Online]. Available: https://doi.ieeecomputersociety.org/10.1109/3DV69130.2026.00033

work page doi:10.1109/3dv69130.2026.00033 2026
[35]

Xfeat: Accelerated features for lightweight image matching,

G. Potje, F. Cadar, A. Araujo, R. Martins, and E. R. Nascimento, “Xfeat: Accelerated features for lightweight image matching,” in2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 2682–2691

2024
[36]

MAC-VO: Metrics-aware covariance for learning-based stereo visual odometry,

Y . Qiu, Y . Chen, Z. Zhang, W. Wang, and S. Scherer, “MAC-VO: Metrics-aware covariance for learning-based stereo visual odometry,” in2025 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2025, pp. 3803–3814

2025
[37]

Lightglue: Local feature matching at light speed,

P. Lindenberger, P.-E. Sarlin, and M. Pollefeys, “Lightglue: Local feature matching at light speed,” in2023 IEEE/CVF International Conference on Computer Vision (ICCV), 2023, pp. 17 581–17 592

2023
[38]

MAGSAC++, a fast, reliable and accurate robust estimator,

D. Barath, J. Noskova, M. Ivashechkin, and J. Matas, “MAGSAC++, a fast, reliable and accurate robust estimator,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 1304–1312

2020
[39]

Billion-scale similarity search with GPUs,

J. Johnson, M. Douze, and H. J ´egou, “Billion-scale similarity search with GPUs,”IEEE Transactions on Big Data, vol. 7, no. 3, pp. 535– 547, 2019

2019
[40]

The euroc micro aerial vehicle datasets,

M. Burriet al., “The euroc micro aerial vehicle datasets,”The International Journal of Robotics Research, vol. 35, no. 10, pp. 1157–1163, 2016. [Online]. Available: https://doi.org/10.1177/ 0278364915620033 JOURNAL OF LATEX 17

2016
[41]

Ntu viral: A visual-inertial-ranging-lidar dataset, from an aerial vehicle viewpoint,

T.-M. Nguyenet al., “Ntu viral: A visual-inertial-ranging-lidar dataset, from an aerial vehicle viewpoint,”The International Journal of Robotics Research, vol. 41, no. 3, pp. 270–280, 2022. [Online]. Available: https://doi.org/10.1177/02783649211052312

work page doi:10.1177/02783649211052312 2022
[42]

Botanicgarden: A high-quality dataset for robot navigation in unstructured natural environments,

Y . Liuet al., “Botanicgarden: A high-quality dataset for robot navigation in unstructured natural environments,”IEEE Robotics and Automation Letters, vol. 9, no. 3, pp. 2798–2805, 2024

2024
[43]

Subt-mrs dataset: Pushing slam towards all-weather environments,

S. Zhaoet al., “Subt-mrs dataset: Pushing slam towards all-weather environments,” in2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 22 647–22 657

2024
[44]

Good features to track,

J. Shi and Tomasi, “Good features to track,” in1994 Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 1994, pp. 593–600

1994
[45]

An iterative image registration technique with an application to stereo vision,

B. D. Lucas and T. Kanade, “An iterative image registration technique with an application to stereo vision,” inProceedings of the 7th Inter- national Joint Conference on Artificial Intelligence - Volume 2, ser. IJCAI’81. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 1981, p. 674–679

1981
[46]

Distinctive image features from scale-invariant keypoints,

D. G. Lowe, “Distinctive image features from scale-invariant keypoints,”International Journal of Computer Vision, vol. 60, no. 2, pp. 91–110, 2004. [Online]. Available: https://doi.org/10.1023/B: VISI.0000029664.99615.94

work page doi:10.1023/b: 2004
[47]

Breaking of brightness consistency in optical flow with a lightweight cnn network,

Y . Lin, S. Wang, Y . Jiang, and B. Han, “Breaking of brightness consistency in optical flow with a lightweight cnn network,”IEEE Robotics and Automation Letters, vol. 9, no. 8, pp. 6840–6847, 2024. JOURNAL OF LATEX 18 APPENDIXA COVARIANCE-WEIGHTEDREPROJECTION This section details the covariance formulation proposed in Sec. III-A. Given measurement uncert...

2024

[1] [1]

Orb: An efficient alternative to sift or surf,

E. Rublee, V . Rabaud, K. Konolige, and G. Bradski, “Orb: An efficient alternative to sift or surf,” in2011 International Conference on Computer Vision, 2011, pp. 2564–2571

2011

[2] [2]

Brisk: Binary robust invariant scalable keypoints,

S. Leutenegger, M. Chli, and R. Y . Siegwart, “Brisk: Binary robust invariant scalable keypoints,” in2011 International Conference on Computer Vision, 2011, pp. 2548–2555

2011

[3] [3]

Comparative study of deep learning based features in slam,

C. Deng, K. Qiu, R. Xiong, and C. Zhou, “Comparative study of deep learning based features in slam,” in2019 4th Asia-Pacific Conference on Intelligent Robot Systems (ACIRS), 2019, pp. 250–254

2019

[4] [4]

Superpoint: Self- supervised interest point detection and description,

D. DeTone, T. Malisiewicz, and A. Rabinovich, “Superpoint: Self- supervised interest point detection and description,” in2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2018, pp. 337–33 712

2018

[5] [5]

SuperGlue: Learning feature matching with graph neural networks,

P.-E. Sarlin, D. DeTone, T. Malisiewicz, and A. Rabinovich, “SuperGlue: Learning feature matching with graph neural networks,” inCVPR, 2020. [Online]. Available: https://arxiv.org/abs/1911.11763

work page arXiv 2020

[6] [6]

Closed- loop benchmarking of stereo visual-inertial slam systems: Understanding the impact of drift and latency on tracking accuracy,

Y . Zhao, J. S. Smith, S. H. Karumanchi, and P. A. Vela, “Closed- loop benchmarking of stereo visual-inertial slam systems: Understanding the impact of drift and latency on tracking accuracy,” in2020 IEEE International Conference on Robotics and Automation (ICRA), 2020, pp. 1105–1112

2020

[7] [7]

Bags of binary words for fast place recognition in image sequences,

D. Galvez-L ´opez and J. D. Tardos, “Bags of binary words for fast place recognition in image sequences,”IEEE Transactions on Robotics, vol. 28, no. 5, pp. 1188–1197, 2012

2012

[8] [8]

Anyloc: Towards universal visual place recognition,

N. Keethaet al., “Anyloc: Towards universal visual place recognition,” IEEE Robotics and Automation Letters, vol. 9, no. 2, pp. 1286–1293, 2024

2024

[9] [9]

Orb-slam3: An accurate open-source library for visual, visual–inertial, and multimap slam,

C. Campos, R. Elvira, J. J. G. Rodr ´ıguez, J. M. M. Montiel, and J. D. Tard ´os, “Orb-slam3: An accurate open-source library for visual, visual–inertial, and multimap slam,”IEEE Transactions on Robotics, vol. 37, no. 6, pp. 1874–1890, 2021

2021

[10] [10]

Okvis2-x: Open keyframe-based visual-inertial slam configurable with dense depth or lidar, and gnss,

S. Boche, J. Jung, S. B. Laina, and S. Leutenegger, “Okvis2-x: Open keyframe-based visual-inertial slam configurable with dense depth or lidar, and gnss,”IEEE Transactions on Robotics, vol. 41, pp. 6064– 6083, 2025

2025

[11] [11]

A general optimisation-based framework for global pose estimation with multiple sensors,

T. Qin, S. Cao, J. Pan, and S. Shen, “A general optimisation-based framework for global pose estimation with multiple sensors,”IET Cyber- Systems and Robotics, vol. 7, no. 1, p. e70023, 2025. [Online]. Available: https://ietresearch.onlinelibrary.wiley.com/doi/abs/10.1049/csy2.70023

work page doi:10.1049/csy2.70023 2025

[12] [12]

Openvins: A research platform for visual-inertial estimation,

P. Geneva, K. Eckenhoff, W. Lee, Y . Yang, and G. Huang, “Openvins: A research platform for visual-inertial estimation,” in2020 IEEE In- ternational Conference on Robotics and Automation (ICRA), 2020, pp. 4666–4672

2020

[13] [13]

Kimera: an open- source library for real-time metric-semantic localization and mapping,

A. Rosinol, M. Abate, Y . Chang, and L. Carlone, “Kimera: an open- source library for real-time metric-semantic localization and mapping,” in2020 IEEE International Conference on Robotics and Automation (ICRA), 2020, pp. 1689–1696

2020

[14] [14]

Airslam: An efficient and illumination-robust point-line visual slam system,

K. Xu, Y . Hao, S. Yuan, C. Wang, and L. Xie, “Airslam: An efficient and illumination-robust point-line visual slam system,”IEEE Transactions on Robotics, vol. 41, pp. 1673–1692, 2025

2025

[15] [15]

Gcnv2: Efficient correspondence prediction for real-time slam,

J. Tang, L. Ericson, J. Folkesson, and P. Jensfelt, “Gcnv2: Efficient correspondence prediction for real-time slam,”IEEE Robotics and Automation Letters, vol. 4, no. 4, pp. 3505–3512, 2019

2019

[16] [16]

Dxslam: A robust and efficient visual slam system with deep features,

D. Liet al., “Dxslam: A robust and efficient visual slam system with deep features,” in2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2020, pp. 4958–4965

2020

[17] [17]

Hfnet-slam: An accurate and real-time monocular slam system with deep features,

L. Liu and J. M. Aitken, “Hfnet-slam: An accurate and real-time monocular slam system with deep features,”Sensors, vol. 23, no. 4,

[18] [18]

Available: https://www.mdpi.com/1424-8220/23/4/2113

[Online]. Available: https://www.mdpi.com/1424-8220/23/4/2113

[19] [19]

Supslam: A robust visual inertial slam system using superpoint for unmanned aerial vehicles,

C. H. Quach, M. D. Phung, H. V . Le, and S. Perry, “Supslam: A robust visual inertial slam system using superpoint for unmanned aerial vehicles,” in2021 8th NAFOSTED Conference on Information and Computer Science (NICS), 2021, pp. 507–512

2021

[20] [20]

D-vins: Dynamic adaptive visual–inertial slam with imu prior and semantic constraints in dynamic scenes,

Y . Sunet al., “D-vins: Dynamic adaptive visual–inertial slam with imu prior and semantic constraints in dynamic scenes,” Remote Sensing, vol. 15, no. 15, 2023. [Online]. Available: https: //www.mdpi.com/2072-4292/15/15/3881

2023

[21] [21]

Supervins: A real- time visual-inertial slam framework for challenging imaging conditions,

H. Luo, Y . Liu, C. Guo, Z. Li, and W. Song, “Supervins: A real- time visual-inertial slam framework for challenging imaging conditions,” IEEE Sensors Journal, vol. 25, no. 13, pp. 26 042–26 050, 2025

2025

[22] [22]

LoFTR: Detector-free local feature matching with transformers,

J. Sun, Z. Shen, Y . Wang, H. Bao, and X. Zhou, “LoFTR: Detector-free local feature matching with transformers,”CVPR, 2021

2021

[23] [23]

Raft: Recurrent all-pairs field transforms for optical flow,

Z. Teed and J. Deng, “Raft: Recurrent all-pairs field transforms for optical flow,” inComputer Vision – ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part II. Berlin, Heidelberg: Springer-Verlag, 2020, p. 402–419. [Online]. Available: https://doi.org/10.1007/978-3-030-58536-5 24

work page doi:10.1007/978-3-030-58536-5 2020

[24] [24]

Tightly-coupled visual- inertial odometry with robust feature association in dynamic illumination environments,

J. Zhang, C. Zhang, Q. Liu, Q. Ma, and J. Qin, “Tightly-coupled visual- inertial odometry with robust feature association in dynamic illumination environments,”Robotica, vol. 43, no. 6, p. 2304–2319, 2025

2025

[25] [25]

DROID-SLAM: Deep visual SLAM for monoc- ular, stereo, and RGB-D cameras,

Z. Teed and J. Deng, “DROID-SLAM: Deep visual SLAM for monoc- ular, stereo, and RGB-D cameras,” inAdvances in Neural Information Processing Systems, 2021

2021

[26] [26]

Deep patch visual slam,

L. Lipson, Z. Teed, and J. Deng, “Deep patch visual slam,” in Computer Vision – ECCV 2024: 18th European Conference, Milan, Italy, September 29–October 4, 2024, Proceedings, Part II. Berlin, Heidelberg: Springer-Verlag, 2024, p. 424–440. [Online]. Available: https://doi.org/10.1007/978-3-031-72627-9 24

work page doi:10.1007/978-3-031-72627-9 2024

[27] [27]

Netvlad: Cnn architecture for weakly supervised place recognition,

R. Arandjelovic, P. Gronat, A. Torii, T. Pajdla, and J. Sivic, “Netvlad: Cnn architecture for weakly supervised place recognition,” in2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 5297–5307

2016

[28] [28]

Mixvpr: Feature mixing for visual place recognition,

A. Ali-Bey, B. Chaib-Draa, and P. Gigu ´ere, “Mixvpr: Feature mixing for visual place recognition,” in2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2023, pp. 2997–3006

2023

[29] [29]

Eigenplaces: Training viewpoint robust models for visual place recognition,

G. Berton, G. Trivigno, B. Caputo, and C. Masone, “Eigenplaces: Training viewpoint robust models for visual place recognition,” in2023 IEEE/CVF International Conference on Computer Vision (ICCV), 2023, pp. 11 046–11 056

2023

[30] [30]

Rethinking visual geo- localization for large-scale applications,

G. Berton, C. Masone, and B. Caputo, “Rethinking visual geo- localization for large-scale applications,” in2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 4868– 4878

2022

[31] [31]

DINOv2: Learning robust visual features without supervision,

M. Oquabet al., “DINOv2: Learning robust visual features without supervision,”Transactions on Machine Learning Research, 2024, featured Certification. [Online]. Available: https://openreview.net/ forum?id=a68SUt6zFt

2024

[32] [32]

VINS-Mono: A robust and versatile monoc- ular visual-inertial state estimator,

T. Qin, P. Li, and S. Shen, “VINS-Mono: A robust and versatile monoc- ular visual-inertial state estimator,”IEEE Transactions on Robotics, vol. 34, no. 4, pp. 1004–1020, 2018

2018

[33] [33]

Aliked: A lighter keypoint and descriptor extraction network via deformable transformation,

X. Zhaoet al., “Aliked: A lighter keypoint and descriptor extraction network via deformable transformation,”IEEE Transactions on Instru- mentation and Measurement, vol. 72, pp. 1–16, 2023

2023

[34] [34]

RaCo: Ranking and Covariance for Practical Learned Keypoints ,

A. Shenoi, P. Lindenberger, P.-E. Sarlin, and M. Pollefeys, “ RaCo: Ranking and Covariance for Practical Learned Keypoints ,” in2026 International Conference on 3D Vision (3DV). Los Alamitos, CA, USA: IEEE Computer Society, Mar. 2026, pp. 267–276. [Online]. Available: https://doi.ieeecomputersociety.org/10.1109/3DV69130.2026.00033

work page doi:10.1109/3dv69130.2026.00033 2026

[35] [35]

Xfeat: Accelerated features for lightweight image matching,

G. Potje, F. Cadar, A. Araujo, R. Martins, and E. R. Nascimento, “Xfeat: Accelerated features for lightweight image matching,” in2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 2682–2691

2024

[36] [36]

MAC-VO: Metrics-aware covariance for learning-based stereo visual odometry,

Y . Qiu, Y . Chen, Z. Zhang, W. Wang, and S. Scherer, “MAC-VO: Metrics-aware covariance for learning-based stereo visual odometry,” in2025 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2025, pp. 3803–3814

2025

[37] [37]

Lightglue: Local feature matching at light speed,

P. Lindenberger, P.-E. Sarlin, and M. Pollefeys, “Lightglue: Local feature matching at light speed,” in2023 IEEE/CVF International Conference on Computer Vision (ICCV), 2023, pp. 17 581–17 592

2023

[38] [38]

MAGSAC++, a fast, reliable and accurate robust estimator,

D. Barath, J. Noskova, M. Ivashechkin, and J. Matas, “MAGSAC++, a fast, reliable and accurate robust estimator,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 1304–1312

2020

[39] [39]

Billion-scale similarity search with GPUs,

J. Johnson, M. Douze, and H. J ´egou, “Billion-scale similarity search with GPUs,”IEEE Transactions on Big Data, vol. 7, no. 3, pp. 535– 547, 2019

2019

[40] [40]

The euroc micro aerial vehicle datasets,

M. Burriet al., “The euroc micro aerial vehicle datasets,”The International Journal of Robotics Research, vol. 35, no. 10, pp. 1157–1163, 2016. [Online]. Available: https://doi.org/10.1177/ 0278364915620033 JOURNAL OF LATEX 17

2016

[41] [41]

Ntu viral: A visual-inertial-ranging-lidar dataset, from an aerial vehicle viewpoint,

T.-M. Nguyenet al., “Ntu viral: A visual-inertial-ranging-lidar dataset, from an aerial vehicle viewpoint,”The International Journal of Robotics Research, vol. 41, no. 3, pp. 270–280, 2022. [Online]. Available: https://doi.org/10.1177/02783649211052312

work page doi:10.1177/02783649211052312 2022

[42] [42]

Botanicgarden: A high-quality dataset for robot navigation in unstructured natural environments,

Y . Liuet al., “Botanicgarden: A high-quality dataset for robot navigation in unstructured natural environments,”IEEE Robotics and Automation Letters, vol. 9, no. 3, pp. 2798–2805, 2024

2024

[43] [43]

Subt-mrs dataset: Pushing slam towards all-weather environments,

S. Zhaoet al., “Subt-mrs dataset: Pushing slam towards all-weather environments,” in2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 22 647–22 657

2024

[44] [44]

Good features to track,

J. Shi and Tomasi, “Good features to track,” in1994 Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 1994, pp. 593–600

1994

[45] [45]

An iterative image registration technique with an application to stereo vision,

B. D. Lucas and T. Kanade, “An iterative image registration technique with an application to stereo vision,” inProceedings of the 7th Inter- national Joint Conference on Artificial Intelligence - Volume 2, ser. IJCAI’81. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 1981, p. 674–679

1981

[46] [46]

Distinctive image features from scale-invariant keypoints,

D. G. Lowe, “Distinctive image features from scale-invariant keypoints,”International Journal of Computer Vision, vol. 60, no. 2, pp. 91–110, 2004. [Online]. Available: https://doi.org/10.1023/B: VISI.0000029664.99615.94

work page doi:10.1023/b: 2004

[47] [47]

Breaking of brightness consistency in optical flow with a lightweight cnn network,

Y . Lin, S. Wang, Y . Jiang, and B. Han, “Breaking of brightness consistency in optical flow with a lightweight cnn network,”IEEE Robotics and Automation Letters, vol. 9, no. 8, pp. 6840–6847, 2024. JOURNAL OF LATEX 18 APPENDIXA COVARIANCE-WEIGHTEDREPROJECTION This section details the covariance formulation proposed in Sec. III-A. Given measurement uncert...

2024