Efficient Feature-Free Initialization for Monocular Visual-Inertial Systems Using a Feed-Forward 3D Model

Changhao Chen; Dezhen Song; Haoang Li; Huajian Zeng; Jiaqi Yang; Liang Li; Xingxing Zuo; Yuantai Zhang

arxiv: 2605.17327 · v1 · pith:XGHL5L4Wnew · submitted 2026-05-17 · 💻 cs.RO · cs.AI· cs.CV

Efficient Feature-Free Initialization for Monocular Visual-Inertial Systems Using a Feed-Forward 3D Model

Yuantai Zhang , Jiaqi Yang , Huajian Zeng , Changhao Chen , Haoang Li , Liang Li , Dezhen Song , Xingxing Zuo This is my paper

Pith reviewed 2026-05-20 12:58 UTC · model grok-4.3

classification 💻 cs.RO cs.AIcs.CV

keywords monocular visual-inertial initializationfeature-free VINSfeed-forward 3D modelpoint cloud fusionscale and gravity estimationvisually degraded environments

0 comments

The pith

A feed-forward 3D model lets monocular visual-inertial systems initialize without tracking any visual features.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that point clouds predicted directly from images by a feed-forward 3D model can replace the usual visual feature correspondences in monocular VINS initialization. These up-to-scale clouds are fused with short inertial sequences to jointly recover initial scale, velocity, and gravity direction. Removing feature tracking cuts system complexity and raises reliability, with experiments reporting success rates above 90 percent and average initialization times below 1.2 seconds. The approach maintains performance in low-texture or motion-blurred scenes where traditional methods commonly fail.

Core claim

The authors establish that a feature-free initialization procedure, built on up-to-scale point clouds from a single-image feed-forward 3D model and aligned with inertial measurements, can estimate the initial metric scale, velocity, and gravity vector more reliably and with far less data than methods that depend on visual feature tracking and correspondence.

What carries the argument

Feed-forward 3D model that outputs up-to-scale point clouds from individual images, which are then registered to short IMU sequences to solve the joint estimation of scale, velocity, and gravity direction.

If this is right

Initialization succeeds in more than 90 percent of trials on standard benchmarks.
Required sensor duration drops to typically less than 1.2 seconds.
Performance holds across indoor and outdoor scenes, including those with visual degradation.
System design simplifies by dropping all visual feature extraction and matching steps.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same point-cloud predictions could support quick metric recovery in other monocular robotic tasks that currently rely on structure-from-motion bootstrapping.
Because the method tolerates short data windows, it may allow repeated re-initialization during long missions when tracking is lost.
Combining the predicted clouds with additional depth priors from the same model family could further tighten the scale estimate without extra sensors.

Load-bearing premise

The geometric structure in the predicted point clouds remains accurate enough, despite unknown absolute scale, for inertial fusion to recover reliable initial state estimates.

What would settle it

A dataset sequence where the 3D model produces point clouds whose relative geometry deviates substantially from ground truth, causing the fused initialization to produce scale or gravity errors larger than those of feature-based baselines.

Figures

Figures reproduced from arXiv: 2605.17327 by Changhao Chen, Dezhen Song, Haoang Li, Huajian Zeng, Jiaqi Yang, Liang Li, Xingxing Zuo, Yuantai Zhang.

**Figure 1.** Figure 1: System overview. A feed-forward 3D model predicts up-to-scale point clouds (yellow) from input images, while IMU [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: Comparison of factor graphs for nonlinear refinement. (a) Traditional feature-based optimization jointly optimizes IMU [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 4.** Figure 4: Per-attempt initialization runtime breakdown. Infer.: [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 3.** Figure 3: Failure source distribution across methods. Obs.: in [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 5.** Figure 5: Example scenes from our self-collected dataset: (a) [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Trajectory comparison on representative sequences. [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

read the original abstract

Fast and reliable initialization is critical for monocular visual-inertial navigation systems (VINS), as it establishes the starting conditions for subsequent state estimation. Despite steady progress, most existing methods heavily rely on visual feature correspondences and require 3-4 seconds of sensory data for successful initialization, which limits their applicability and efficiency. With the advent of feed-forward 3D models that can directly predict point clouds from images, we revisit the visual-inertial initialization problem from a concise perspective. In this work, we propose a feature-free initialization framework that leverages up-to-scale point clouds predicted by a feed-forward 3D model, thereby obviating the need for visual feature tracking and estimation. This design substantially reduces system complexity and improves the reliability of initialization. Experiments on public datasets demonstrate that the proposed feature-free initialization method achieves the highest success rate, exceeding 90%, and significantly reduces the data duration required for successful initialization, typically to under 1.2 s. We further validate our method on a self-collected dataset covering various indoor and outdoor scenarios, demonstrating robust performance, particularly in visually degraded environments where existing methods often fail. The code and dataset are available at https://github.com/Yuantai-Z/FF-VIO-Init.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper replaces feature tracking with point clouds from a feed-forward 3D model to cut monocular VINS initialization to under 1.2 seconds at over 90 percent success, but the gains rest on unablated depth accuracy.

read the letter

The main point is that they drop visual feature correspondences entirely and feed up-to-scale point clouds from a pre-trained 3D model straight into the initialization solver. This lets them shorten the required data window from the usual 3-4 seconds down to typically under 1.2 seconds while reporting success rates above 90 percent on public datasets and their own indoor-outdoor collection, with particular gains in degraded scenes where feature methods drop out. Releasing code and data helps anyone who wants to check the numbers themselves. What works is the simplification: no tracking, no correspondence search, just direct geometry from the model fused with IMU for scale, velocity, and gravity. That matches the practical need for faster starts in robotics and AR. The soft spots line up with the stress-test note. Because the pipeline is deliberately feature-free, depth errors or biases from the 3D model go straight into the joint optimization; the abstract gives aggregate success rates but does not isolate the model's contribution through ground-truth depth ablations or per-sequence error stats. Without those, it is difficult to separate how much the reported robustness comes from the model versus the short IMU segment. Scale recovery is claimed via IMU fusion, yet any consistent bias in the predicted clouds could still affect the solution. The paper is aimed at VINS practitioners who need quicker, more reliable initialization on real hardware. Readers working on monocular systems or low-texture environments would find the timing and success-rate numbers useful if the full experiments hold up. I would send it to peer review because the core substitution is straightforward, the empirical claims are specific, and the released code makes verification feasible even if the validation needs tightening on error sources.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes a feature-free initialization method for monocular visual-inertial navigation systems (VINS) that replaces visual feature tracking with up-to-scale point clouds predicted by a pre-trained feed-forward 3D model. The approach jointly estimates initial scale, velocity, and gravity direction from short IMU sequences fused with these point clouds. Experiments on public datasets report success rates exceeding 90% with typical initialization times under 1.2 s, and additional validation on a self-collected dataset shows robustness in visually degraded indoor and outdoor scenes.

Significance. If the performance claims hold under closer scrutiny, the method could meaningfully simplify VINS pipelines by eliminating feature correspondence requirements and shortening the data window needed for reliable initialization. The shift to an external feed-forward 3D model is a notable departure from conventional feature-based or optimization-heavy initialization strategies and may prove useful in real-time or resource-constrained settings.

major comments (2)

[Experiments] The central performance claims (success rate >90 %, initialization <1.2 s, robustness in degraded scenes) rest on the assumption that the feed-forward model's up-to-scale point clouds supply sufficient metric geometry for joint scale-velocity-gravity recovery. However, the experimental section provides only aggregate success rates without ablation studies that replace the predicted depths with ground-truth depths or report per-sequence depth-error statistics; this omission leaves the contribution of point-cloud accuracy unisolated and the headline claims only partially supported.
[Method] The manuscript does not detail how scale ambiguity is resolved when fusing the up-to-scale point clouds with inertial measurements, nor does it quantify error propagation from depth prediction noise into the optimization; given the deliberately feature-free design, any systematic bias in the 3D model directly affects the recovered metric quantities and should be analyzed explicitly.

minor comments (2)

[Abstract and Experiments] The abstract and experimental results mention quantitative comparisons but do not list the exact baseline methods, their reported success rates, or the precise success criteria used; adding a table with these values would improve clarity.
[Experiments] No error bars or statistical significance tests accompany the reported success rates and timing figures; including these would strengthen the presentation of the quantitative results.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. The comments highlight important opportunities to strengthen the experimental support and clarify the methodological details. We address each major comment below and will revise the manuscript accordingly to improve rigor and transparency.

read point-by-point responses

Referee: [Experiments] The central performance claims (success rate >90 %, initialization <1.2 s, robustness in degraded scenes) rest on the assumption that the feed-forward model's up-to-scale point clouds supply sufficient metric geometry for joint scale-velocity-gravity recovery. However, the experimental section provides only aggregate success rates without ablation studies that replace the predicted depths with ground-truth depths or report per-sequence depth-error statistics; this omission leaves the contribution of point-cloud accuracy unisolated and the headline claims only partially supported.

Authors: We agree that isolating the contribution of the predicted point-cloud accuracy would strengthen the claims. In the revised manuscript we will add ablation experiments on the public datasets (where ground-truth depths are available) that directly compare initialization performance using the feed-forward predictions versus ground-truth depths. We will also report per-sequence depth-prediction error statistics together with their correlation to initialization success and failure cases. revision: yes
Referee: [Method] The manuscript does not detail how scale ambiguity is resolved when fusing the up-to-scale point clouds with inertial measurements, nor does it quantify error propagation from depth prediction noise into the optimization; given the deliberately feature-free design, any systematic bias in the 3D model directly affects the recovered metric quantities and should be analyzed explicitly.

Authors: We acknowledge that the current description of scale recovery and noise propagation is insufficiently detailed. The scale factor is recovered jointly with velocity and gravity direction inside a single least-squares optimization that aligns the up-to-scale point clouds with IMU-predicted motion over the short initialization window. In the revision we will expand Section 3 with the complete optimization objective, the explicit scale parameterization, and a dedicated sensitivity analysis (including both analytic propagation bounds and empirical results) that quantifies how depth-prediction noise affects the recovered metric quantities. revision: yes

Circularity Check

0 steps flagged

No significant circularity; method relies on external pre-trained model and empirical validation

full rationale

The paper presents a practical initialization framework that feeds up-to-scale point clouds from an external feed-forward 3D model into an IMU-fusion optimizer for scale, velocity and gravity. All performance claims (>90 % success, <1.2 s duration, robustness in degraded scenes) are supported by direct experimental results on public benchmarks and a self-collected dataset rather than by any internal derivation, fitted parameter, or self-citation chain. No equation or algorithmic step reduces to a quantity defined by the same step; the 3D model itself is treated as a black-box input whose accuracy is an independent assumption, not a quantity derived inside the paper.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the accuracy of an external feed-forward 3D reconstruction network and on standard VINS assumptions about rigid-body motion and sensor calibration; no new free parameters or invented entities are introduced in the abstract.

axioms (1)

domain assumption Standard VINS assumptions of rigid body motion, known camera intrinsics, and inertial sensor bias models hold during the short initialization window.
These background assumptions are required for any monocular VINS initializer and are invoked implicitly when fusing point clouds with IMU data.

pith-pipeline@v0.9.0 · 5781 in / 1376 out tokens · 50899 ms · 2026-05-20T12:58:52.343836+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We propose a feature-free initialization framework that leverages up-to-scale point clouds predicted by a feed-forward 3D model... closed-form linear system ¯A(·)x=¯b(·) ... state vector contains only scale, initial velocity, and gravity: x=[s I0v⊤I0 I0g⊤]⊤
IndisputableMonolith/Foundation/DimensionForcing.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Experiments on public datasets demonstrate that the proposed feature-free initialization method achieves the highest success rate, exceeding 90%, and significantly reduces the data duration required for successful initialization, typically to under 1.2 s.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

52 extracted references · 52 canonical work pages · 6 internal anchors

[1]

A Multi-State Constraint Kalman Filter for Vision-aided Inertial Nav- igation,

A. I. Mourikis and S. I. Roumeliotis, “A Multi-State Constraint Kalman Filter for Vision-aided Inertial Nav- igation,” inIEEE International Conference on Robotics and Automation (ICRA), 2007, pp. 3565–3572

work page 2007
[2]

Keyframe-Based Visual-Inertial Odometry Using Nonlinear Optimization,

S. Leutenegger, S. Lynen, M. Bosse, R. Siegwart, and P. Furgale, “Keyframe-Based Visual-Inertial Odometry Using Nonlinear Optimization,”The International Jour- nal of Robotics Research, vol. 34, no. 3, pp. 314–334, 2015

work page 2015
[3]

OpenVINS: A Research Platform for Visual-Inertial Es- timation,

P. Geneva, K. Eckenhoff, W. Lee, Y . Yang, and G. Huang, “OpenVINS: A Research Platform for Visual-Inertial Es- timation,” inIEEE International Conference on Robotics and Automation (ICRA), 2020, pp. 4666–4672

work page 2020
[4]

LIC- Fusion: LiDAR-Inertial-Camera Odometry,

X. Zuo, P. Geneva, W. Lee, Y . Liu, and G. Huang, “LIC- Fusion: LiDAR-Inertial-Camera Odometry,” inIEEE/RSJ International Conference on Intelligent Robots and Sys- tems (IROS), 2019, pp. 5848–5854

work page 2019
[5]

Towards More Precise and Robust Position- ing in Urban Environments Through an Enhanced FGO- Based GNSS RTK Framework,

Y . Zhang, F. Zhu, Q. Cai, J. Lv, Z. Xu, X. Chen, and X. Zhang, “Towards More Precise and Robust Position- ing in Urban Environments Through an Enhanced FGO- Based GNSS RTK Framework,”IEEE Transactions on Intelligent Vehicles, vol. 9, pp. 7603–7616, 2024

work page 2024
[6]

Estimating Body and Hand Motion in an Ego-sensed World,

B. Yi, V . Ye, M. Zheng, Y . Li, L. M ¨uller, G. Pavlakos, Y . Ma, J. Malik, and A. Kanazawa, “Estimating Body and Hand Motion in an Ego-sensed World,” inIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025, pp. 7072–7084

work page 2025
[7]

Aria Gen 2 Pilot Dataset,

C. Kong, J. Fort, A. Kanget al., “Aria Gen 2 Pilot Dataset,” arXiv preprint arXiv:2510.16134, 2025

work page arXiv 2025
[8]

Khronos: A Unified Approach for Spatio-Temporal Metric-Semantic SLAM in Dynamic Environments,

L. Schmid, M. Abate, Y . Chang, and L. Carlone, “Khronos: A Unified Approach for Spatio-Temporal Metric-Semantic SLAM in Dynamic Environments,” in Robotics: Science and Systems (RSS), 2024

work page 2024
[9]

Language-EXtended Indoor SLAM (LEXIS): A Versa- tile System for Real-time Visual Scene Understanding,

C. Kassab, M. Mattamala, L. Zhang, and M. Fallon, “Language-EXtended Indoor SLAM (LEXIS): A Versa- tile System for Real-time Visual Scene Understanding,” inIEEE International Conference on Robotics and Au- tomation (ICRA), 2024, pp. 15 988–15 994

work page 2024
[10]

GNSS/Multisensor Fusion Using Continuous-Time Fac- tor Graph Optimization for Robust Localization,

H. Zhang, C.-C. Chen, H. Vallery, and T. D. Barfoot, “GNSS/Multisensor Fusion Using Continuous-Time Fac- tor Graph Optimization for Robust Localization,”IEEE Transactions on Robotics, vol. 40, pp. 4003–4023, 2024

work page 2024
[11]

PO-GVINS: A Tightly Coupled GNSS- Visual-Inertial Navigation Framework Using Pose-Only Representation,

Z. Xu, F. Zhu, Z. Zhang, C. Jian, J. Lv, Y . Zhang, and X. Zhang, “PO-GVINS: A Tightly Coupled GNSS- Visual-Inertial Navigation Framework Using Pose-Only Representation,”IEEE Robotics and Automation Letters, vol. 10, pp. 10 830–10 837, 2025

work page 2025
[12]

Estimator Initializa- tion in Vision-Aided Inertial Navigation with Unknown Camera-IMU Calibration,

T.-C. Dong-Si and A. I. Mourikis, “Estimator Initializa- tion in Vision-Aided Inertial Navigation with Unknown Camera-IMU Calibration,” inIEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2012, pp. 1064–1071

work page 2012
[13]

StructVIO: Visual-Inertial Odometry With Structural Regularity of Man-Made Environments,

D. Zou, Y . Wu, L. Pei, H. Ling, and W. Yu, “StructVIO: Visual-Inertial Odometry With Structural Regularity of Man-Made Environments,”IEEE Transactions on Robotics, vol. 35, pp. 999–1013, 2019

work page 2019
[14]

ORB-SLAM3: An Ac- curate Open-Source Library for Visual, Visual–Inertial, and Multimap SLAM,

C. Campos, R. Elvira, J. J. G. Rodriguez, J. M. M. Montiel, and J. D. Tardos, “ORB-SLAM3: An Ac- curate Open-Source Library for Visual, Visual–Inertial, and Multimap SLAM,”IEEE Transactions on Robotics, vol. 37, pp. 1874–1890, 2021

work page 2021
[15]

DUSt3R: Geometric 3D Vision Made Easy,

S. Wang, V . Leroy, Y . Cabon, B. Chidlovskii, and J. Re- vaud, “DUSt3R: Geometric 3D Vision Made Easy,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 20 697–20 709

work page 2024
[16]

VGGT: Visual Geometry Grounded Transformer,

J. Wang, M. Chen, N. Karaev, A. Vedaldi, C. Rupprecht, and D. Novotny, “VGGT: Visual Geometry Grounded Transformer,” inIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025, pp. 5294– 5306

work page 2025
[17]

Continuous 3D Perception Model with Persistent State,

Q. Wang, Y . Zhang, A. Holynski, A. A. Efros, and A. Kanazawa, “Continuous 3D Perception Model with Persistent State,” inIEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), 2025, pp. 10 510–10 522

work page 2025
[18]

VINS-Mono: A Robust and Versatile Monocular Visual-Inertial State Estimator,

T. Qin, P. Li, and S. Shen, “VINS-Mono: A Robust and Versatile Monocular Visual-Inertial State Estimator,” IEEE Transactions on Robotics, vol. 34, no. 4, pp. 1004– 1020, 2018

work page 2018
[19]

Visual-Inertial Monoc- ular SLAM With Map Reuse,

R. Mur-Artal and J. D. Tard ´os, “Visual-Inertial Monoc- ular SLAM With Map Reuse,”IEEE Robotics and Au- tomation Letters, vol. 2, pp. 796–803, 2017

work page 2017
[20]

Closed-Form Solution of Visual-Inertial Structure from Motion,

A. Martinelli, “Closed-Form Solution of Visual-Inertial Structure from Motion,”International Journal of Com- puter Vision, vol. 106, no. 2, pp. 138–152, 2014

work page 2014
[21]

Simultaneous State Initialization and Gyroscope Bias Calibration in Visual Inertial Aided Navigation,

J. Kaiser, A. Martinelli, F. Fontana, and D. Scaramuzza, “Simultaneous State Initialization and Gyroscope Bias Calibration in Visual Inertial Aided Navigation,”IEEE Robotics and Automation Letters, vol. 2, pp. 18–25, 2017

work page 2017
[22]

Fast and Robust Initialization for Visual-Inertial SLAM,

C. Campos, J. M. Montiel, and J. D. Tard ´os, “Fast and Robust Initialization for Visual-Inertial SLAM,” inIEEE International Conference on Robotics and Automation (ICRA), 2019, pp. 1288–1294

work page 2019
[23]

A Rotation- Translation-Decoupled Solution for Robust and Efficient Visual-Inertial Initialization,

Y . He, B. Xu, Z. Ouyang, and H. Li, “A Rotation- Translation-Decoupled Solution for Robust and Efficient Visual-Inertial Initialization,” inIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023, pp. 739–748

work page 2023
[24]

sqrtvins: Ro- bust and Ultrafast Square-Root Filter-Based 3D Motion Tracking,

Y . Peng, C. Chen, K. Wu, and G. Huang, “sqrtvins: Ro- bust and Ultrafast Square-Root Filter-Based 3D Motion Tracking,”IEEE Transactions on Robotics, vol. 41, pp. 6570–6589, 2025

work page 2025
[25]

Structure-from- Motion Revisited,

J. L. Sch ¨onberger and J.-M. Frahm, “Structure-from- Motion Revisited,” inIEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), 2016, pp. 4104–4113

work page 2016
[26]

Grounding Image Matching in 3D with MASt3R,

V . Leroy, Y . Cabon, and J. Revaud, “Grounding Image Matching in 3D with MASt3R,” inEuropean Conference on Computer Vision (ECCV), 2024, pp. 71–91

work page 2024
[27]

TTT3R: 3D Reconstruction as Test-Time Training

X. Chen, Y . Chen, Y . Xiu, A. Geiger, and A. Chen, “TTT3R: 3D Reconstruction as Test-Time Training,” arXiv preprint arXiv:2509.26645, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[28]

3D Reconstruction with Spatial Memory

H. Wang and L. Agapito, “3D Reconstruction with Spa- tial Memory,” arXiv preprint arXiv:2408.16061, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[29]

MASt3R- SLAM: Real-Time Dense SLAM with 3D Reconstruction Priors,

R. Murai, E. Dexheimer, and A. J. Davison, “MASt3R- SLAM: Real-Time Dense SLAM with 3D Reconstruction Priors,” inIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025, pp. 16 695– 16 705

work page 2025
[30]

VGGT-SLAM: Dense RGB SLAM Optimized on the SL(4) Manifold

D. Maggio, H. Lim, and L. Carlone, “VGGT-SLAM: Dense RGB SLAM Optimized on the SL(4) Manifold,” arXiv preprint arXiv:2505.12549, 2025

work page internal anchor Pith review arXiv 2025
[31]

Metric3D V2: A Versatile Monocular Geometric Foundation Model for Zero-Shot Metric Depth and Surface Normal Es- timation,

M. Hu, W. Yin, C. Zhang, Z. Cai, X. Long, H. Chen, K. Wang, G. Yu, C. Shen, and S. Shen, “Metric3D V2: A Versatile Monocular Geometric Foundation Model for Zero-Shot Metric Depth and Surface Normal Es- timation,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 46, no. 12, pp. 10 579–10 596, 2024

work page 2024
[32]

MapAnything: Universal Feed-Forward Metric 3D Reconstruction

N. Keetha, N. M ¨uller, J. Sch ¨onberger, L. Porzi, Y . Zhang, T. Fischer, A. Knapitsch, D. Zauss, E. We- ber, N. Antunes, J. Luiten, M. Lopez-Antequera, S. R. Bul `o, C. Richardt, D. Ramanan, S. Scherer, and P. Kontschieder, “MapAnything: Universal Feed- Forward Metric 3D Reconstruction,” arXiv preprint arXiv:2509.13414, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[33]

Amb3r: Accurate feed-forward metric-scale 3d reconstruc- tion with backend.arXiv preprint arXiv:2511.20343, 2025

H. Wang and L. Agapito, “AMB3R: Accurate Feed- Forward Metric-Scale 3D Reconstruction with Backend,” arXiv preprint arXiv:2511.20343, 2025

work page arXiv 2025
[34]

DROID-SLAM: Deep Visual SLAM for Monocular, Stereo, and RGB-D Cameras,

Z. Teed and J. Deng, “DROID-SLAM: Deep Visual SLAM for Monocular, Stereo, and RGB-D Cameras,” inAdvances in Neural Information Processing Systems (NeurIPS), 2021, pp. 16 558–16 569

work page 2021
[35]

Imperative Learning: A Self-Supervised Neuro-Symbolic Learning Framework for Robot Autonomy,

C. Wang, K. Ji, J. Geng, Z. Ren, T. Fu, F. Yang, Y . Guo, H. He, X. Chen, Z. Zhan, Q. Du, S. Su, B. Li, Y . Qiu, Y . Du, Q. Li, Y . Yang, X. Lin, and Z. Zhao, “Imperative Learning: A Self-Supervised Neuro-Symbolic Learning Framework for Robot Autonomy,”The International Journal of Robotics Research, p. 02783649251353181, 2025

work page 2025
[36]

SLAM- Former: Putting SLAM into One Transformer,

Y . Yuan, Z. Chen, K. Li, W. Wang, and H. Zhao, “SLAM- Former: Putting SLAM into One Transformer,” arXiv preprint arXiv:2509.16909, 2025

work page arXiv 2025
[37]

CodeVIO: Visual-Inertial Odometry with Learned Optimizable Dense Depth,

X. Zuo, N. Merrill, W. Li, Y . Liu, M. Pollefeys, and G. Huang, “CodeVIO: Visual-Inertial Odometry with Learned Optimizable Dense Depth,” inIEEE Interna- tional Conference on Robotics and Automation (ICRA), 2021, pp. 14 382–14 388

work page 2021
[38]

Visual-Inertial SLAM as Sim- ple as A, B, VINS,

N. Merrill and G. Huang, “Visual-Inertial SLAM as Sim- ple as A, B, VINS,” arXiv preprint arXiv:2406.05969, 2024

work page arXiv 2024
[39]

LiDAR-VGGT: Cross-Modal Coarse-to-Fine Fusion for Globally Consistent and Metric-Scale Dense Mapping,

L. Wang, L. Guo, Z. Xu, Q. Wang, F. Gao, and X. Chen, “LiDAR-VGGT: Cross-Modal Coarse-to-Fine Fusion for Globally Consistent and Metric-Scale Dense Mapping,” arXiv preprint arXiv:2511.01186, 2025

work page arXiv 2025
[40]

Learned Monocular Depth Priors in Visual-Inertial Initialization,

Y . Zhou, A. Kar, E. Turner, A. Kowdle, C. X. Guo, R. C. DuToit, and K. Tsotsos, “Learned Monocular Depth Priors in Visual-Inertial Initialization,” inEuropean Con- ference on Computer Vision (ECCV), 2022, pp. 552–570

work page 2022
[41]

Fast Monocular Visual-Inertial Initialization Leveraging Learned Single-View Depth,

N. Merrill, P. Geneva, S. Katragadda, C. Chen, and G. Huang, “Fast Monocular Visual-Inertial Initialization Leveraging Learned Single-View Depth,” inRobotics: Science and Systems (RSS), 2023

work page 2023
[42]

Strapdown Inertial Navigation Integration Algorithm Design Part 1: Attitude Algorithms,

P. G. Savage, “Strapdown Inertial Navigation Integration Algorithm Design Part 1: Attitude Algorithms,”Journal of Guidance, Control, and Dynamics, vol. 21, pp. 19–28, 1998

work page 1998
[43]

Visual-Inertial-Aided Nav- igation for High-Dynamic Motion in Built Environ- ments Without Initial Conditions,

T. Lupton and S. Sukkarieh, “Visual-Inertial-Aided Nav- igation for High-Dynamic Motion in Built Environ- ments Without Initial Conditions,”IEEE Transactions on Robotics, vol. 28, no. 1, pp. 61–76, 2012

work page 2012
[44]

IMU Preintegration on Manifold for Efficient Visual- Inertial Maximum-a-Posteriori Estimation,

C. Forster, L. Carlone, F. Dellaert, and D. Scaramuzza, “IMU Preintegration on Manifold for Efficient Visual- Inertial Maximum-a-Posteriori Estimation,” inRobotics: Science and Systems (RSS), 2015

work page 2015
[45]

Consistency Analysis and Improvement of Vision-aided Inertial Navigation,

J. A. Hesch, D. G. Kottas, S. L. Bowman, and S. I. Roumeliotis, “Consistency Analysis and Improvement of Vision-aided Inertial Navigation,”IEEE Transactions on Robotics, vol. 30, pp. 158–176, 2014

work page 2014
[46]

Inverse Depth Parametrization for Monocular SLAM,

J. Civera, A. J. Davison, and J. M. M. Montiel, “Inverse Depth Parametrization for Monocular SLAM,”IEEE Transactions on Robotics, vol. 24, pp. 932–945, 2008

work page 2008
[47]

Learn- ing Single Camera Depth Estimation Using Dual-Pixels,

R. Garg, N. Wadhwa, S. Ansari, and J. T. Barron, “Learn- ing Single Camera Depth Estimation Using Dual-Pixels,” inIEEE/CVF International Conference on Computer Vision (ICCV), 2019, pp. 7628–7637

work page 2019
[48]

The Eu- RoC Micro Aerial Vehicle Datasets,

M. Burri, J. Nikolic, P. Gohl, T. Schneider, J. Rehder, S. Omari, M. W. Achtelik, and R. Siegwart, “The Eu- RoC Micro Aerial Vehicle Datasets,”The International Journal of Robotics Research, vol. 35, pp. 1157–1163, 2016

work page 2016
[49]

The TUM VI Benchmark for Evaluat- ing Visual-Inertial Odometry,

D. Schubert, T. Goll, N. Demmel, V . Usenko, J. St¨uckler, and D. Cremers, “The TUM VI Benchmark for Evaluat- ing Visual-Inertial Odometry,” inIEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2018, pp. 1680–1687

work page 2018
[50]

A Tutorial on Quantitative Trajectory Evaluation for Visual(-Inertial) Odometry,

Z. Zhang and D. Scaramuzza, “A Tutorial on Quantitative Trajectory Evaluation for Visual(-Inertial) Odometry,” in IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2018, pp. 7244–7251

work page 2018
[51]

Depth Anything 3: Recovering the Visual Space from Any Views

H. Lin, S. Chen, J. Liew, D. Y . Chen, Z. Li, G. Shi, J. Feng, and B. Kang, “Depth Anything 3: Recover- ing the Visual Space from Any Views,” arXiv preprint arXiv:2511.10647, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[52]

$\pi^3$: Permutation-Equivariant Visual Geometry Learning

Y . Wang, J. Zhou, H. Zhu, W. Chang, Y . Zhou, Z. Li, J. Chen, J. Pang, C. Shen, and T. He, “π 3: Scal- able Permutation-Equivariant Visual Geometry Learn- ing,” arXiv preprint arXiv:2507.13347, 2025. SUPPLEMENTARYMATERIAL VII. METHODDETAILS In this section, we provide additional algorithmic details of the proposed feature-free method. A. Rank Analysis of...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[1] [1]

A Multi-State Constraint Kalman Filter for Vision-aided Inertial Nav- igation,

A. I. Mourikis and S. I. Roumeliotis, “A Multi-State Constraint Kalman Filter for Vision-aided Inertial Nav- igation,” inIEEE International Conference on Robotics and Automation (ICRA), 2007, pp. 3565–3572

work page 2007

[2] [2]

Keyframe-Based Visual-Inertial Odometry Using Nonlinear Optimization,

S. Leutenegger, S. Lynen, M. Bosse, R. Siegwart, and P. Furgale, “Keyframe-Based Visual-Inertial Odometry Using Nonlinear Optimization,”The International Jour- nal of Robotics Research, vol. 34, no. 3, pp. 314–334, 2015

work page 2015

[3] [3]

OpenVINS: A Research Platform for Visual-Inertial Es- timation,

P. Geneva, K. Eckenhoff, W. Lee, Y . Yang, and G. Huang, “OpenVINS: A Research Platform for Visual-Inertial Es- timation,” inIEEE International Conference on Robotics and Automation (ICRA), 2020, pp. 4666–4672

work page 2020

[4] [4]

LIC- Fusion: LiDAR-Inertial-Camera Odometry,

X. Zuo, P. Geneva, W. Lee, Y . Liu, and G. Huang, “LIC- Fusion: LiDAR-Inertial-Camera Odometry,” inIEEE/RSJ International Conference on Intelligent Robots and Sys- tems (IROS), 2019, pp. 5848–5854

work page 2019

[5] [5]

Towards More Precise and Robust Position- ing in Urban Environments Through an Enhanced FGO- Based GNSS RTK Framework,

Y . Zhang, F. Zhu, Q. Cai, J. Lv, Z. Xu, X. Chen, and X. Zhang, “Towards More Precise and Robust Position- ing in Urban Environments Through an Enhanced FGO- Based GNSS RTK Framework,”IEEE Transactions on Intelligent Vehicles, vol. 9, pp. 7603–7616, 2024

work page 2024

[6] [6]

Estimating Body and Hand Motion in an Ego-sensed World,

B. Yi, V . Ye, M. Zheng, Y . Li, L. M ¨uller, G. Pavlakos, Y . Ma, J. Malik, and A. Kanazawa, “Estimating Body and Hand Motion in an Ego-sensed World,” inIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025, pp. 7072–7084

work page 2025

[7] [7]

Aria Gen 2 Pilot Dataset,

C. Kong, J. Fort, A. Kanget al., “Aria Gen 2 Pilot Dataset,” arXiv preprint arXiv:2510.16134, 2025

work page arXiv 2025

[8] [8]

Khronos: A Unified Approach for Spatio-Temporal Metric-Semantic SLAM in Dynamic Environments,

L. Schmid, M. Abate, Y . Chang, and L. Carlone, “Khronos: A Unified Approach for Spatio-Temporal Metric-Semantic SLAM in Dynamic Environments,” in Robotics: Science and Systems (RSS), 2024

work page 2024

[9] [9]

Language-EXtended Indoor SLAM (LEXIS): A Versa- tile System for Real-time Visual Scene Understanding,

C. Kassab, M. Mattamala, L. Zhang, and M. Fallon, “Language-EXtended Indoor SLAM (LEXIS): A Versa- tile System for Real-time Visual Scene Understanding,” inIEEE International Conference on Robotics and Au- tomation (ICRA), 2024, pp. 15 988–15 994

work page 2024

[10] [10]

GNSS/Multisensor Fusion Using Continuous-Time Fac- tor Graph Optimization for Robust Localization,

H. Zhang, C.-C. Chen, H. Vallery, and T. D. Barfoot, “GNSS/Multisensor Fusion Using Continuous-Time Fac- tor Graph Optimization for Robust Localization,”IEEE Transactions on Robotics, vol. 40, pp. 4003–4023, 2024

work page 2024

[11] [11]

PO-GVINS: A Tightly Coupled GNSS- Visual-Inertial Navigation Framework Using Pose-Only Representation,

Z. Xu, F. Zhu, Z. Zhang, C. Jian, J. Lv, Y . Zhang, and X. Zhang, “PO-GVINS: A Tightly Coupled GNSS- Visual-Inertial Navigation Framework Using Pose-Only Representation,”IEEE Robotics and Automation Letters, vol. 10, pp. 10 830–10 837, 2025

work page 2025

[12] [12]

Estimator Initializa- tion in Vision-Aided Inertial Navigation with Unknown Camera-IMU Calibration,

T.-C. Dong-Si and A. I. Mourikis, “Estimator Initializa- tion in Vision-Aided Inertial Navigation with Unknown Camera-IMU Calibration,” inIEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2012, pp. 1064–1071

work page 2012

[13] [13]

StructVIO: Visual-Inertial Odometry With Structural Regularity of Man-Made Environments,

D. Zou, Y . Wu, L. Pei, H. Ling, and W. Yu, “StructVIO: Visual-Inertial Odometry With Structural Regularity of Man-Made Environments,”IEEE Transactions on Robotics, vol. 35, pp. 999–1013, 2019

work page 2019

[14] [14]

ORB-SLAM3: An Ac- curate Open-Source Library for Visual, Visual–Inertial, and Multimap SLAM,

C. Campos, R. Elvira, J. J. G. Rodriguez, J. M. M. Montiel, and J. D. Tardos, “ORB-SLAM3: An Ac- curate Open-Source Library for Visual, Visual–Inertial, and Multimap SLAM,”IEEE Transactions on Robotics, vol. 37, pp. 1874–1890, 2021

work page 2021

[15] [15]

DUSt3R: Geometric 3D Vision Made Easy,

S. Wang, V . Leroy, Y . Cabon, B. Chidlovskii, and J. Re- vaud, “DUSt3R: Geometric 3D Vision Made Easy,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 20 697–20 709

work page 2024

[16] [16]

VGGT: Visual Geometry Grounded Transformer,

J. Wang, M. Chen, N. Karaev, A. Vedaldi, C. Rupprecht, and D. Novotny, “VGGT: Visual Geometry Grounded Transformer,” inIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025, pp. 5294– 5306

work page 2025

[17] [17]

Continuous 3D Perception Model with Persistent State,

Q. Wang, Y . Zhang, A. Holynski, A. A. Efros, and A. Kanazawa, “Continuous 3D Perception Model with Persistent State,” inIEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), 2025, pp. 10 510–10 522

work page 2025

[18] [18]

VINS-Mono: A Robust and Versatile Monocular Visual-Inertial State Estimator,

T. Qin, P. Li, and S. Shen, “VINS-Mono: A Robust and Versatile Monocular Visual-Inertial State Estimator,” IEEE Transactions on Robotics, vol. 34, no. 4, pp. 1004– 1020, 2018

work page 2018

[19] [19]

Visual-Inertial Monoc- ular SLAM With Map Reuse,

R. Mur-Artal and J. D. Tard ´os, “Visual-Inertial Monoc- ular SLAM With Map Reuse,”IEEE Robotics and Au- tomation Letters, vol. 2, pp. 796–803, 2017

work page 2017

[20] [20]

Closed-Form Solution of Visual-Inertial Structure from Motion,

A. Martinelli, “Closed-Form Solution of Visual-Inertial Structure from Motion,”International Journal of Com- puter Vision, vol. 106, no. 2, pp. 138–152, 2014

work page 2014

[21] [21]

Simultaneous State Initialization and Gyroscope Bias Calibration in Visual Inertial Aided Navigation,

J. Kaiser, A. Martinelli, F. Fontana, and D. Scaramuzza, “Simultaneous State Initialization and Gyroscope Bias Calibration in Visual Inertial Aided Navigation,”IEEE Robotics and Automation Letters, vol. 2, pp. 18–25, 2017

work page 2017

[22] [22]

Fast and Robust Initialization for Visual-Inertial SLAM,

C. Campos, J. M. Montiel, and J. D. Tard ´os, “Fast and Robust Initialization for Visual-Inertial SLAM,” inIEEE International Conference on Robotics and Automation (ICRA), 2019, pp. 1288–1294

work page 2019

[23] [23]

A Rotation- Translation-Decoupled Solution for Robust and Efficient Visual-Inertial Initialization,

Y . He, B. Xu, Z. Ouyang, and H. Li, “A Rotation- Translation-Decoupled Solution for Robust and Efficient Visual-Inertial Initialization,” inIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023, pp. 739–748

work page 2023

[24] [24]

sqrtvins: Ro- bust and Ultrafast Square-Root Filter-Based 3D Motion Tracking,

Y . Peng, C. Chen, K. Wu, and G. Huang, “sqrtvins: Ro- bust and Ultrafast Square-Root Filter-Based 3D Motion Tracking,”IEEE Transactions on Robotics, vol. 41, pp. 6570–6589, 2025

work page 2025

[25] [25]

Structure-from- Motion Revisited,

J. L. Sch ¨onberger and J.-M. Frahm, “Structure-from- Motion Revisited,” inIEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), 2016, pp. 4104–4113

work page 2016

[26] [26]

Grounding Image Matching in 3D with MASt3R,

V . Leroy, Y . Cabon, and J. Revaud, “Grounding Image Matching in 3D with MASt3R,” inEuropean Conference on Computer Vision (ECCV), 2024, pp. 71–91

work page 2024

[27] [27]

TTT3R: 3D Reconstruction as Test-Time Training

X. Chen, Y . Chen, Y . Xiu, A. Geiger, and A. Chen, “TTT3R: 3D Reconstruction as Test-Time Training,” arXiv preprint arXiv:2509.26645, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[28] [28]

3D Reconstruction with Spatial Memory

H. Wang and L. Agapito, “3D Reconstruction with Spa- tial Memory,” arXiv preprint arXiv:2408.16061, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[29] [29]

MASt3R- SLAM: Real-Time Dense SLAM with 3D Reconstruction Priors,

R. Murai, E. Dexheimer, and A. J. Davison, “MASt3R- SLAM: Real-Time Dense SLAM with 3D Reconstruction Priors,” inIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025, pp. 16 695– 16 705

work page 2025

[30] [30]

VGGT-SLAM: Dense RGB SLAM Optimized on the SL(4) Manifold

D. Maggio, H. Lim, and L. Carlone, “VGGT-SLAM: Dense RGB SLAM Optimized on the SL(4) Manifold,” arXiv preprint arXiv:2505.12549, 2025

work page internal anchor Pith review arXiv 2025

[31] [31]

Metric3D V2: A Versatile Monocular Geometric Foundation Model for Zero-Shot Metric Depth and Surface Normal Es- timation,

M. Hu, W. Yin, C. Zhang, Z. Cai, X. Long, H. Chen, K. Wang, G. Yu, C. Shen, and S. Shen, “Metric3D V2: A Versatile Monocular Geometric Foundation Model for Zero-Shot Metric Depth and Surface Normal Es- timation,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 46, no. 12, pp. 10 579–10 596, 2024

work page 2024

[32] [32]

MapAnything: Universal Feed-Forward Metric 3D Reconstruction

N. Keetha, N. M ¨uller, J. Sch ¨onberger, L. Porzi, Y . Zhang, T. Fischer, A. Knapitsch, D. Zauss, E. We- ber, N. Antunes, J. Luiten, M. Lopez-Antequera, S. R. Bul `o, C. Richardt, D. Ramanan, S. Scherer, and P. Kontschieder, “MapAnything: Universal Feed- Forward Metric 3D Reconstruction,” arXiv preprint arXiv:2509.13414, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[33] [33]

Amb3r: Accurate feed-forward metric-scale 3d reconstruc- tion with backend.arXiv preprint arXiv:2511.20343, 2025

H. Wang and L. Agapito, “AMB3R: Accurate Feed- Forward Metric-Scale 3D Reconstruction with Backend,” arXiv preprint arXiv:2511.20343, 2025

work page arXiv 2025

[34] [34]

DROID-SLAM: Deep Visual SLAM for Monocular, Stereo, and RGB-D Cameras,

Z. Teed and J. Deng, “DROID-SLAM: Deep Visual SLAM for Monocular, Stereo, and RGB-D Cameras,” inAdvances in Neural Information Processing Systems (NeurIPS), 2021, pp. 16 558–16 569

work page 2021

[35] [35]

Imperative Learning: A Self-Supervised Neuro-Symbolic Learning Framework for Robot Autonomy,

C. Wang, K. Ji, J. Geng, Z. Ren, T. Fu, F. Yang, Y . Guo, H. He, X. Chen, Z. Zhan, Q. Du, S. Su, B. Li, Y . Qiu, Y . Du, Q. Li, Y . Yang, X. Lin, and Z. Zhao, “Imperative Learning: A Self-Supervised Neuro-Symbolic Learning Framework for Robot Autonomy,”The International Journal of Robotics Research, p. 02783649251353181, 2025

work page 2025

[36] [36]

SLAM- Former: Putting SLAM into One Transformer,

Y . Yuan, Z. Chen, K. Li, W. Wang, and H. Zhao, “SLAM- Former: Putting SLAM into One Transformer,” arXiv preprint arXiv:2509.16909, 2025

work page arXiv 2025

[37] [37]

CodeVIO: Visual-Inertial Odometry with Learned Optimizable Dense Depth,

X. Zuo, N. Merrill, W. Li, Y . Liu, M. Pollefeys, and G. Huang, “CodeVIO: Visual-Inertial Odometry with Learned Optimizable Dense Depth,” inIEEE Interna- tional Conference on Robotics and Automation (ICRA), 2021, pp. 14 382–14 388

work page 2021

[38] [38]

Visual-Inertial SLAM as Sim- ple as A, B, VINS,

N. Merrill and G. Huang, “Visual-Inertial SLAM as Sim- ple as A, B, VINS,” arXiv preprint arXiv:2406.05969, 2024

work page arXiv 2024

[39] [39]

LiDAR-VGGT: Cross-Modal Coarse-to-Fine Fusion for Globally Consistent and Metric-Scale Dense Mapping,

L. Wang, L. Guo, Z. Xu, Q. Wang, F. Gao, and X. Chen, “LiDAR-VGGT: Cross-Modal Coarse-to-Fine Fusion for Globally Consistent and Metric-Scale Dense Mapping,” arXiv preprint arXiv:2511.01186, 2025

work page arXiv 2025

[40] [40]

Learned Monocular Depth Priors in Visual-Inertial Initialization,

Y . Zhou, A. Kar, E. Turner, A. Kowdle, C. X. Guo, R. C. DuToit, and K. Tsotsos, “Learned Monocular Depth Priors in Visual-Inertial Initialization,” inEuropean Con- ference on Computer Vision (ECCV), 2022, pp. 552–570

work page 2022

[41] [41]

Fast Monocular Visual-Inertial Initialization Leveraging Learned Single-View Depth,

N. Merrill, P. Geneva, S. Katragadda, C. Chen, and G. Huang, “Fast Monocular Visual-Inertial Initialization Leveraging Learned Single-View Depth,” inRobotics: Science and Systems (RSS), 2023

work page 2023

[42] [42]

Strapdown Inertial Navigation Integration Algorithm Design Part 1: Attitude Algorithms,

P. G. Savage, “Strapdown Inertial Navigation Integration Algorithm Design Part 1: Attitude Algorithms,”Journal of Guidance, Control, and Dynamics, vol. 21, pp. 19–28, 1998

work page 1998

[43] [43]

Visual-Inertial-Aided Nav- igation for High-Dynamic Motion in Built Environ- ments Without Initial Conditions,

T. Lupton and S. Sukkarieh, “Visual-Inertial-Aided Nav- igation for High-Dynamic Motion in Built Environ- ments Without Initial Conditions,”IEEE Transactions on Robotics, vol. 28, no. 1, pp. 61–76, 2012

work page 2012

[44] [44]

IMU Preintegration on Manifold for Efficient Visual- Inertial Maximum-a-Posteriori Estimation,

C. Forster, L. Carlone, F. Dellaert, and D. Scaramuzza, “IMU Preintegration on Manifold for Efficient Visual- Inertial Maximum-a-Posteriori Estimation,” inRobotics: Science and Systems (RSS), 2015

work page 2015

[45] [45]

Consistency Analysis and Improvement of Vision-aided Inertial Navigation,

J. A. Hesch, D. G. Kottas, S. L. Bowman, and S. I. Roumeliotis, “Consistency Analysis and Improvement of Vision-aided Inertial Navigation,”IEEE Transactions on Robotics, vol. 30, pp. 158–176, 2014

work page 2014

[46] [46]

Inverse Depth Parametrization for Monocular SLAM,

J. Civera, A. J. Davison, and J. M. M. Montiel, “Inverse Depth Parametrization for Monocular SLAM,”IEEE Transactions on Robotics, vol. 24, pp. 932–945, 2008

work page 2008

[47] [47]

Learn- ing Single Camera Depth Estimation Using Dual-Pixels,

R. Garg, N. Wadhwa, S. Ansari, and J. T. Barron, “Learn- ing Single Camera Depth Estimation Using Dual-Pixels,” inIEEE/CVF International Conference on Computer Vision (ICCV), 2019, pp. 7628–7637

work page 2019

[48] [48]

The Eu- RoC Micro Aerial Vehicle Datasets,

M. Burri, J. Nikolic, P. Gohl, T. Schneider, J. Rehder, S. Omari, M. W. Achtelik, and R. Siegwart, “The Eu- RoC Micro Aerial Vehicle Datasets,”The International Journal of Robotics Research, vol. 35, pp. 1157–1163, 2016

work page 2016

[49] [49]

The TUM VI Benchmark for Evaluat- ing Visual-Inertial Odometry,

D. Schubert, T. Goll, N. Demmel, V . Usenko, J. St¨uckler, and D. Cremers, “The TUM VI Benchmark for Evaluat- ing Visual-Inertial Odometry,” inIEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2018, pp. 1680–1687

work page 2018

[50] [50]

A Tutorial on Quantitative Trajectory Evaluation for Visual(-Inertial) Odometry,

Z. Zhang and D. Scaramuzza, “A Tutorial on Quantitative Trajectory Evaluation for Visual(-Inertial) Odometry,” in IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2018, pp. 7244–7251

work page 2018

[51] [51]

Depth Anything 3: Recovering the Visual Space from Any Views

H. Lin, S. Chen, J. Liew, D. Y . Chen, Z. Li, G. Shi, J. Feng, and B. Kang, “Depth Anything 3: Recover- ing the Visual Space from Any Views,” arXiv preprint arXiv:2511.10647, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[52] [52]

$\pi^3$: Permutation-Equivariant Visual Geometry Learning

Y . Wang, J. Zhou, H. Zhu, W. Chang, Y . Zhou, Z. Li, J. Chen, J. Pang, C. Shen, and T. He, “π 3: Scal- able Permutation-Equivariant Visual Geometry Learn- ing,” arXiv preprint arXiv:2507.13347, 2025. SUPPLEMENTARYMATERIAL VII. METHODDETAILS In this section, we provide additional algorithmic details of the proposed feature-free method. A. Rank Analysis of...

work page internal anchor Pith review Pith/arXiv arXiv 2025