DeepVIO: Self-supervised Deep Learning of Monocular Visual Inertial Odometry using 3D Geometric Constraints
Pith reviewed 2026-05-25 15:04 UTC · model grok-4.3
The pith
DeepVIO trains a monocular visual-inertial odometry network self-supervised by projecting 3D optical flow and poses from stereo sequences into constraints on 2D flow and IMU fusion.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
DeepVIO directly merges 2D optical flow features and IMU data for monocular trajectory estimation; the 2D flow network is trained under the projected 3D optical flow obtained from stereo, the IMU preintegration and fusion networks minimize losses derived from stereo-computed 6-DoF ego-motion, and an IMU status update scheme refines bias estimates, yielding trajectories that outperform other learned methods and tolerate imperfect camera-IMU calibration better than traditional approaches.
What carries the argument
3D geometric constraints (3D optical flow and 6-DoF poses computed from stereo sequences) that supply supervisory signals by projection onto the monocular 2D flow network and ego-motion losses for the IMU preintegration and fusion networks.
If this is right
- The 2D optical flow network learns under direct projection of the corresponding 3D optical flow.
- The LSTM-style IMU preintegration network and fusion network are optimized by minimizing losses from stereo-derived ego-motion constraints.
- An IMU bias update scheme improves pose estimates by tracking additional gyroscope and accelerometer bias terms.
- The trained system reports higher accuracy and data adaptability than prior learning-based methods on KITTI and EuRoC.
- The approach reduces sensitivity to inaccurate camera-IMU calibration, timing offsets, and missing measurements relative to traditional VIO pipelines.
Where Pith is reading between the lines
- The same stereo-to-monocular transfer could be applied to train networks for other sensor pairs where one modality is easier to calibrate than the other.
- Robustness to missing data suggests the architecture might continue functioning when individual frames or IMU packets are dropped in real deployments.
- Testing the method on sequences with dynamic objects or low-texture regions would check whether the claimed advantage of 3D flow over 2D flow holds in those regimes.
Load-bearing premise
Stereo-derived 3D optical flow and poses act as accurate, unbiased targets that transfer to monocular 2D flow and IMU networks without adding systematic error.
What would settle it
Measure absolute trajectory error of the trained monocular model against both other learned VIO networks and classical VIO pipelines on the same sequences after deliberately adding known calibration offsets or frame drops; lower error for DeepVIO under these perturbations would support the robustness claim.
Figures
read the original abstract
This paper presents an self-supervised deep learning network for monocular visual inertial odometry (named DeepVIO). DeepVIO provides absolute trajectory estimation by directly merging 2D optical flow feature (OFF) and Inertial Measurement Unit (IMU) data. Specifically, it firstly estimates the depth and dense 3D point cloud of each scene by using stereo sequences, and then obtains 3D geometric constraints including 3D optical flow and 6-DoF pose as supervisory signals. Note that such 3D optical flow shows robustness and accuracy to dynamic objects and textureless environments. In DeepVIO training, 2D optical flow network is constrained by the projection of its corresponding 3D optical flow, and LSTM-style IMU preintegration network and the fusion network are learned by minimizing the loss functions from ego-motion constraints. Furthermore, we employ an IMU status update scheme to improve IMU pose estimation through updating the additional gyroscope and accelerometer bias. The experimental results on KITTI and EuRoC datasets show that DeepVIO outperforms state-of-the-art learning based methods in terms of accuracy and data adaptability. Compared to the traditional methods, DeepVIO reduces the impacts of inaccurate Camera-IMU calibrations, unsynchronized and missing data.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents DeepVIO, a self-supervised deep learning method for monocular visual-inertial odometry. It first computes depth and dense 3D point clouds from stereo sequences to derive 3D optical flow and 6-DoF poses as supervisory signals. The 2D optical flow network is trained by minimizing projection error to the 3D flow; LSTM-style IMU preintegration and fusion networks are trained on the 6-DoF poses, with an additional IMU status update for bias correction. Experiments on KITTI and EuRoC are claimed to show superior accuracy and data adaptability versus state-of-the-art learning-based methods, plus reduced sensitivity to calibration errors, unsynchronized data, and missing measurements relative to traditional VIO.
Significance. If the headline accuracy and robustness claims hold after proper validation, the work would demonstrate a practical route to training monocular VIO networks without direct pose ground truth by transferring 3D geometric constraints computed from stereo. The approach could improve handling of dynamic objects and textureless regions via the 3D flow supervision and offer better tolerance to IMU calibration and timing issues.
major comments (2)
- [Abstract] Abstract: the self-supervised framing is load-bearing for the contribution, yet supervision is generated from independent stereo sequences (depth, 3D flow, 6-DoF poses) rather than monocular input alone. This external grounding must be shown to transfer without systematic bias (baseline calibration, disparity quantization, dynamic-object handling) that would be absent at monocular test time; no error statistics on the 3D flow targets or ablation with deliberately degraded stereo calibration are referenced.
- [Abstract] Abstract: the central performance claims (outperformance on KITTI/EuRoC, reduced impact of calibration/unsync/missing data) rest on experimental results that are not quantified here. Without reported metrics, loss formulations, network diagrams, or ablations, it is impossible to assess whether the 3D-flow projection loss or IMU bias update actually drives the stated gains.
minor comments (1)
- [Abstract] Abstract: the phrasing 'self-supervised deep learning network for monocular visual inertial odometry' should be clarified to distinguish the training supervision source from the monocular inference setting.
Simulated Author's Rebuttal
We thank the referee for the constructive comments on the abstract. We address each major comment point by point below.
read point-by-point responses
-
Referee: [Abstract] Abstract: the self-supervised framing is load-bearing for the contribution, yet supervision is generated from independent stereo sequences (depth, 3D flow, 6-DoF poses) rather than monocular input alone. This external grounding must be shown to transfer without systematic bias (baseline calibration, disparity quantization, dynamic-object handling) that would be absent at monocular test time; no error statistics on the 3D flow targets or ablation with deliberately degraded stereo calibration are referenced.
Authors: We agree the abstract should more precisely describe the training setup. Stereo sequences are used only to compute the 3D geometric constraints that serve as training signals; the resulting network is strictly monocular plus IMU at test time. We will revise the abstract to state this distinction explicitly. To address the request for validation of transfer, we will add error statistics on the 3D flow targets and an ablation study using deliberately degraded stereo calibration in the revised manuscript. revision: yes
-
Referee: [Abstract] Abstract: the central performance claims (outperformance on KITTI/EuRoC, reduced impact of calibration/unsync/missing data) rest on experimental results that are not quantified here. Without reported metrics, loss formulations, network diagrams, or ablations, it is impossible to assess whether the 3D-flow projection loss or IMU bias update actually drives the stated gains.
Authors: The abstract is a high-level summary. We will expand it to include the principal quantitative metrics (e.g., translation and rotation errors on KITTI and EuRoC) that support the performance claims. The full manuscript already contains the loss formulations, network diagrams, and ablation studies; these sections will be referenced more explicitly from the revised abstract. revision: yes
Circularity Check
No significant circularity: stereo-derived supervision supplies independent external targets
full rationale
The paper generates 3D optical flow and 6-DoF poses from stereo sequences as supervisory signals for training the monocular 2D flow, IMU preintegration, and fusion networks. These targets are computed separately from the monocular inputs and are not defined in terms of the network outputs or fitted parameters that are then renamed as predictions. The training minimizes projection errors and ego-motion losses against these external geometric constraints, providing grounding outside the monocular inference pipeline. No self-citation chains, self-definitional loops, or ansatzes imported via prior work are present in the derivation. The approach remains self-contained against the stereo benchmarks used for supervision.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Visual odometry: Part ii: Matching, robustness, optimization, and applications
Friedrich Fraundorfer and Davide Scaramuzza. Visual odometry: Part ii: Matching, robustness, optimization, and applications. IEEE Robotics & Automation Magazine , 19(2):78–90, 2012
work page 2012
-
[2]
Past, present, and future of simultaneous localization and mapping: Toward the robust-perception age
Cesar Cadena, Luca Carlone, Henry Carrillo, Yasir Latif, Davide Scaramuzza, Jos ´e Neira, Ian Reid, and John J Leonard. Past, present, and future of simultaneous localization and mapping: Toward the robust-perception age. IEEE Transactions on robotics , 32(6):1309– 1332, 2016
work page 2016
-
[3]
Jakob Engel, Vladlen Koltun, and Daniel Cremers. Direct sparse odometry. arXiv preprint arXiv:1607.02565 , 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[4]
Orb-slam2: An open-source slam system for monocular, stereo, and rgb-d cameras
Raul Mur-Artal and Juan D Tard ´os. Orb-slam2: An open-source slam system for monocular, stereo, and rgb-d cameras. IEEE Transactions on Robotics, 33(5):1255–1262, 2017
work page 2017
-
[5]
Robust stereo visual inertial odometry for fast autonomous flight
Ke Sun, Kartik Mohta, Bernd Pfrommer, Michael Watterson, Sikang Liu, Yash Mulgaonkar, Camillo J Taylor, and Vijay Kumar. Robust stereo visual inertial odometry for fast autonomous flight. IEEE Robotics and Automation Letters , 3(2):965–972, 2018
work page 2018
-
[6]
Visual-inertial monocular slam with map reuse
Raul Mur-Artal and Juan Domingo Tardos. Visual-inertial monocular slam with map reuse. IEEE Robotics and Automation Letters , 2(2):796–803, 2016
work page 2016
-
[7]
Vins-mono: A robust and versatile monocular visual-inertial state estimator
Qin Tong, Peiliang Li, and Shaojie Shen. Vins-mono: A robust and versatile monocular visual-inertial state estimator. IEEE Transactions on Robotics, PP(99):1–17, 2017
work page 2017
-
[8]
Deepvo: Towards end-to-end visual odometry with deep recurrent convolutional neural networks
Sen Wang, Ronald Clark, Hongkai Wen, and Niki Trigoni. Deepvo: Towards end-to-end visual odometry with deep recurrent convolutional neural networks. In 2017 IEEE International Conference on Robotics and Automation (ICRA) , pages 2043–2050. IEEE, 2017
work page 2017
-
[9]
End-to- end, sequence-to-sequence probabilistic visual odometry through deep neural networks
Sen Wang, Ronald Clark, Hongkai Wen, and Niki Trigoni. End-to- end, sequence-to-sequence probabilistic visual odometry through deep neural networks. The International Journal of Robotics Research , 37(4-5):513–542, 2018
work page 2018
-
[10]
Vinet: Visual-inertial odometry as a sequence-to-sequence learning problem
Ronald Clark, Sen Wang, Hongkai Wen, Andrew Markham, and Niki Trigoni. Vinet: Visual-inertial odometry as a sequence-to-sequence learning problem. In Thirty-First AAAI Conference on Artificial Intelligence, 2017
work page 2017
-
[11]
Geonet: Unsupervised learning of dense depth, optical flow and camera pose
Zhichao Yin and Jianping Shi. Geonet: Unsupervised learning of dense depth, optical flow and camera pose. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1983– 1992, 2018
work page 1983
-
[12]
E Jared Shamwell, Sarah Leung, and William D Nothwang. Vision- aided absolute trajectory estimation using an unsupervised deep net- work with online error correction. In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) , pages 2524–
work page 2018
-
[13]
A multi-state con- straint kalman filter for vision-aided inertial navigation
Anastasios I Mourikis and Stergios I Roumeliotis. A multi-state con- straint kalman filter for vision-aided inertial navigation. InProceedings 2007 IEEE International Conference on Robotics and Automation , pages 3565–3572. IEEE, 2007
work page 2007
-
[14]
Robust visual inertial odometry using a direct ekf-based approach
Michael Bloesch, Sammy Omari, Marco Hutter, and Roland Siegwart. Robust visual inertial odometry using a direct ekf-based approach. In 2015 IEEE/RSJ international conference on intelligent robots and systems (IROS), pages 298–304. IEEE, 2015
work page 2015
-
[15]
Keyframe-based visual-inertial slam using nonlinear optimization
Stefan Leutenegger, Paul Furgale, Vincent Rabaud, Margarita Chli, Kurt Konolige, and Roland Siegwart. Keyframe-based visual-inertial slam using nonlinear optimization. Proceedings of Robotis Science and Systems (RSS) 2013 , 2013
work page 2013
-
[16]
Ongoing evolution of visual slam from geometry to deep learning: Challenges and opportunities
Ruihao Li, Sen Wang, and Dongbing Gu. Ongoing evolution of visual slam from geometry to deep learning: Challenges and opportunities. Cognitive Computation, 10(6):875–889, 2018
work page 2018
-
[17]
Learning to fuse: A deep learning approach to visual-inertial camera pose estimation
Jason R Rambach, Aditya Tewari, Alain Pagani, and Didier Stricker. Learning to fuse: A deep learning approach to visual-inertial camera pose estimation. In 2016 IEEE International Symposium on Mixed and Augmented Reality (ISMAR) , pages 71–76. IEEE, 2016
work page 2016
-
[18]
Un- supervised cnn for single view depth estimation: Geometry to the rescue
Ravi Garg, Vijay Kumar BG, Gustavo Carneiro, and Ian Reid. Un- supervised cnn for single view depth estimation: Geometry to the rescue. In European Conference on Computer Vision, pages 740–756. Springer, 2016
work page 2016
-
[19]
Unsu- pervised monocular depth estimation with left-right consistency
Cl ´ement Godard, Oisin Mac Aodha, and Gabriel J Brostow. Unsu- pervised monocular depth estimation with left-right consistency. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 270–279, 2017
work page 2017
-
[20]
Unsupervised learning of depth and ego-motion from video
Tinghui Zhou, Matthew Brown, Noah Snavely, and David G Lowe. Unsupervised learning of depth and ego-motion from video. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1851–1858, 2017
work page 2017
-
[21]
Vincent Casser, Soeren Pirk, Reza Mahjourian, and Anelia An- gelova. Depth prediction without the sensors: Leveraging structure for unsupervised learning from monocular videos. arXiv preprint arXiv:1811.06152, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[22]
Df-net: Unsupervised joint learning of depth and flow using cross-task consistency
Yuliang Zou, Zelun Luo, and Jia-Bin Huang. Df-net: Unsupervised joint learning of depth and flow using cross-task consistency. In Proceedings of the European Conference on Computer Vision (ECCV), pages 36–53, 2018
work page 2018
-
[23]
Undeepvo: Monocular visual odometry through unsupervised deep learning
Ruihao Li, Sen Wang, Zhiqiang Long, and Dongbing Gu. Undeepvo: Monocular visual odometry through unsupervised deep learning. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pages 7286–7291. IEEE, 2018
work page 2018
-
[24]
Huangying Zhan, Ravi Garg, Chamara Saroj Weerasekera, Kejie Li, Harsh Agarwal, and Ian Reid. Unsupervised learning of monocular depth estimation and visual odometry with deep feature reconstruction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 340–349, 2018
work page 2018
-
[25]
Go- icp: A globally optimal solution to 3d icp point-set registration
Jiaolong Yang, Hongdong Li, Dylan Campbell, and Yunde Jia. Go- icp: A globally optimal solution to 3d icp point-set registration. IEEE transactions on pattern analysis and machine intelligence , 38(11):2241–2254, 2016
work page 2016
-
[26]
A taxonomy and evaluation of dense two-frame stereo correspondence algorithms
Daniel Scharstein and Richard Szeliski. A taxonomy and evaluation of dense two-frame stereo correspondence algorithms. International journal of computer vision , 47(1-3):7–42, 2002
work page 2002
-
[27]
Pyramid stereo matching network
Jia-Ren Chang and Yong-Sheng Chen. Pyramid stereo matching network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5410–5418, 2018
work page 2018
-
[28]
Learning monocular visual odometry with dense 3d mapping from dense 3d flow
Zhao Cheng, Sun Li, Pulak Purkait, Tom Duckett, and Rustam Stolkin. Learning monocular visual odometry with dense 3d mapping from dense 3d flow. 2018
work page 2018
-
[29]
Efficient dense scene flow from sparse or dense stereo data
Andreas Wedel, Clemens Rabe, Tobi Vaudrey, Thomas Brox, Uwe Franke, and Daniel Cremers. Efficient dense scene flow from sparse or dense stereo data. In European conference on computer vision , pages 739–751. Springer, 2008
work page 2008
-
[30]
Flownet: Learning optical flow with convolutional networks
Alexey Dosovitskiy, Philipp Fischer, Eddy Ilg, Philip Hausser, Caner Hazirbas, Vladimir Golkov, Patrick Van Der Smagt, Daniel Cremers, and Thomas Brox. Flownet: Learning optical flow with convolutional networks. In Proceedings of the IEEE international conference on computer vision, pages 2758–2766, 2015
work page 2015
-
[31]
Flownet 2.0: Evolution of optical flow estimation with deep networks
Eddy Ilg, Nikolaus Mayer, Tonmoy Saikia, Margret Keuper, Alexey Dosovitskiy, and Thomas Brox. Flownet 2.0: Evolution of optical flow estimation with deep networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2462–2470, 2017
work page 2017
-
[32]
On-manifold preintegration for real-time visual-inertial odom- etry
Christian Forster, Luca Carlone, Frank Dellaert, and Davide Scara- muzza. On-manifold preintegration for real-time visual-inertial odom- etry. IEEE Transactions on Robotics , 33(1):1–21, 2015
work page 2015
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.