pith. sign in

arxiv: 1906.11435 · v2 · pith:ULAQQTGJnew · submitted 2019-06-27 · 💻 cs.RO · cs.CV

DeepVIO: Self-supervised Deep Learning of Monocular Visual Inertial Odometry using 3D Geometric Constraints

Pith reviewed 2026-05-25 15:04 UTC · model grok-4.3

classification 💻 cs.RO cs.CV
keywords visual inertial odometryself-supervised learningmonocularoptical flowIMU preintegrationstereo supervisionego-motion estimation
0
0 comments X

The pith

DeepVIO trains a monocular visual-inertial odometry network self-supervised by projecting 3D optical flow and poses from stereo sequences into constraints on 2D flow and IMU fusion.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces DeepVIO to estimate absolute trajectories from monocular images plus IMU readings without labeled supervision. It first computes depth, 3D point clouds, 3D optical flow, and 6-DoF poses from stereo image pairs, then uses the projection of the 3D flow to constrain a 2D optical flow network while an LSTM-style IMU preintegration network and fusion network are trained by minimizing ego-motion losses. An additional update step corrects gyroscope and accelerometer bias during inference. The resulting system reports higher accuracy than prior learning-based VIO methods on KITTI and EuRoC while showing reduced sensitivity to calibration, synchronization, and missing-data problems compared with classical pipelines.

Core claim

DeepVIO directly merges 2D optical flow features and IMU data for monocular trajectory estimation; the 2D flow network is trained under the projected 3D optical flow obtained from stereo, the IMU preintegration and fusion networks minimize losses derived from stereo-computed 6-DoF ego-motion, and an IMU status update scheme refines bias estimates, yielding trajectories that outperform other learned methods and tolerate imperfect camera-IMU calibration better than traditional approaches.

What carries the argument

3D geometric constraints (3D optical flow and 6-DoF poses computed from stereo sequences) that supply supervisory signals by projection onto the monocular 2D flow network and ego-motion losses for the IMU preintegration and fusion networks.

If this is right

  • The 2D optical flow network learns under direct projection of the corresponding 3D optical flow.
  • The LSTM-style IMU preintegration network and fusion network are optimized by minimizing losses from stereo-derived ego-motion constraints.
  • An IMU bias update scheme improves pose estimates by tracking additional gyroscope and accelerometer bias terms.
  • The trained system reports higher accuracy and data adaptability than prior learning-based methods on KITTI and EuRoC.
  • The approach reduces sensitivity to inaccurate camera-IMU calibration, timing offsets, and missing measurements relative to traditional VIO pipelines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same stereo-to-monocular transfer could be applied to train networks for other sensor pairs where one modality is easier to calibrate than the other.
  • Robustness to missing data suggests the architecture might continue functioning when individual frames or IMU packets are dropped in real deployments.
  • Testing the method on sequences with dynamic objects or low-texture regions would check whether the claimed advantage of 3D flow over 2D flow holds in those regimes.

Load-bearing premise

Stereo-derived 3D optical flow and poses act as accurate, unbiased targets that transfer to monocular 2D flow and IMU networks without adding systematic error.

What would settle it

Measure absolute trajectory error of the trained monocular model against both other learned VIO networks and classical VIO pipelines on the same sequences after deliberately adding known calibration offsets or frame drops; lower error for DeepVIO under these perturbations would support the robustness claim.

Figures

Figures reproduced from arXiv: 1906.11435 by Guoguang Du, Liming Han, Shiguo Lian, Yimin Lin.

Figure 1
Figure 1. Figure 1: The pipeline of the DeepVIO. Novel 3D optical flow and stereo [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Illustration of our proposed framework in training and inferring phase. DeepVIO consists of CNN-Flow, LSTM-IMU and FC-Fusion, which is [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Illustration of disparity map and point cloud obtained from stereo [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Illustration of 3D optical flow from (a) left and (b) right [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Comparison of training losses using our models on KITTI. [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 8
Figure 8. Figure 8: Translation and orientation errors on the KITTI 10. [PITH_FULL_IMAGE:figures/full_fig_p007_8.png] view at source ↗
Figure 7
Figure 7. Figure 7: Comparison of synthetic 2D optical flow and normal 2D optical [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗
read the original abstract

This paper presents an self-supervised deep learning network for monocular visual inertial odometry (named DeepVIO). DeepVIO provides absolute trajectory estimation by directly merging 2D optical flow feature (OFF) and Inertial Measurement Unit (IMU) data. Specifically, it firstly estimates the depth and dense 3D point cloud of each scene by using stereo sequences, and then obtains 3D geometric constraints including 3D optical flow and 6-DoF pose as supervisory signals. Note that such 3D optical flow shows robustness and accuracy to dynamic objects and textureless environments. In DeepVIO training, 2D optical flow network is constrained by the projection of its corresponding 3D optical flow, and LSTM-style IMU preintegration network and the fusion network are learned by minimizing the loss functions from ego-motion constraints. Furthermore, we employ an IMU status update scheme to improve IMU pose estimation through updating the additional gyroscope and accelerometer bias. The experimental results on KITTI and EuRoC datasets show that DeepVIO outperforms state-of-the-art learning based methods in terms of accuracy and data adaptability. Compared to the traditional methods, DeepVIO reduces the impacts of inaccurate Camera-IMU calibrations, unsynchronized and missing data.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper presents DeepVIO, a self-supervised deep learning method for monocular visual-inertial odometry. It first computes depth and dense 3D point clouds from stereo sequences to derive 3D optical flow and 6-DoF poses as supervisory signals. The 2D optical flow network is trained by minimizing projection error to the 3D flow; LSTM-style IMU preintegration and fusion networks are trained on the 6-DoF poses, with an additional IMU status update for bias correction. Experiments on KITTI and EuRoC are claimed to show superior accuracy and data adaptability versus state-of-the-art learning-based methods, plus reduced sensitivity to calibration errors, unsynchronized data, and missing measurements relative to traditional VIO.

Significance. If the headline accuracy and robustness claims hold after proper validation, the work would demonstrate a practical route to training monocular VIO networks without direct pose ground truth by transferring 3D geometric constraints computed from stereo. The approach could improve handling of dynamic objects and textureless regions via the 3D flow supervision and offer better tolerance to IMU calibration and timing issues.

major comments (2)
  1. [Abstract] Abstract: the self-supervised framing is load-bearing for the contribution, yet supervision is generated from independent stereo sequences (depth, 3D flow, 6-DoF poses) rather than monocular input alone. This external grounding must be shown to transfer without systematic bias (baseline calibration, disparity quantization, dynamic-object handling) that would be absent at monocular test time; no error statistics on the 3D flow targets or ablation with deliberately degraded stereo calibration are referenced.
  2. [Abstract] Abstract: the central performance claims (outperformance on KITTI/EuRoC, reduced impact of calibration/unsync/missing data) rest on experimental results that are not quantified here. Without reported metrics, loss formulations, network diagrams, or ablations, it is impossible to assess whether the 3D-flow projection loss or IMU bias update actually drives the stated gains.
minor comments (1)
  1. [Abstract] Abstract: the phrasing 'self-supervised deep learning network for monocular visual inertial odometry' should be clarified to distinguish the training supervision source from the monocular inference setting.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on the abstract. We address each major comment point by point below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the self-supervised framing is load-bearing for the contribution, yet supervision is generated from independent stereo sequences (depth, 3D flow, 6-DoF poses) rather than monocular input alone. This external grounding must be shown to transfer without systematic bias (baseline calibration, disparity quantization, dynamic-object handling) that would be absent at monocular test time; no error statistics on the 3D flow targets or ablation with deliberately degraded stereo calibration are referenced.

    Authors: We agree the abstract should more precisely describe the training setup. Stereo sequences are used only to compute the 3D geometric constraints that serve as training signals; the resulting network is strictly monocular plus IMU at test time. We will revise the abstract to state this distinction explicitly. To address the request for validation of transfer, we will add error statistics on the 3D flow targets and an ablation study using deliberately degraded stereo calibration in the revised manuscript. revision: yes

  2. Referee: [Abstract] Abstract: the central performance claims (outperformance on KITTI/EuRoC, reduced impact of calibration/unsync/missing data) rest on experimental results that are not quantified here. Without reported metrics, loss formulations, network diagrams, or ablations, it is impossible to assess whether the 3D-flow projection loss or IMU bias update actually drives the stated gains.

    Authors: The abstract is a high-level summary. We will expand it to include the principal quantitative metrics (e.g., translation and rotation errors on KITTI and EuRoC) that support the performance claims. The full manuscript already contains the loss formulations, network diagrams, and ablation studies; these sections will be referenced more explicitly from the revised abstract. revision: yes

Circularity Check

0 steps flagged

No significant circularity: stereo-derived supervision supplies independent external targets

full rationale

The paper generates 3D optical flow and 6-DoF poses from stereo sequences as supervisory signals for training the monocular 2D flow, IMU preintegration, and fusion networks. These targets are computed separately from the monocular inputs and are not defined in terms of the network outputs or fitted parameters that are then renamed as predictions. The training minimizes projection errors and ego-motion losses against these external geometric constraints, providing grounding outside the monocular inference pipeline. No self-citation chains, self-definitional loops, or ansatzes imported via prior work are present in the derivation. The approach remains self-contained against the stereo benchmarks used for supervision.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only input supplies no concrete equations, hyperparameters, or modeling choices, preventing enumeration of free parameters, axioms, or invented entities.

pith-pipeline@v0.9.0 · 5766 in / 1090 out tokens · 26372 ms · 2026-05-25T15:04:10.669411+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages · 2 internal anchors

  1. [1]

    Visual odometry: Part ii: Matching, robustness, optimization, and applications

    Friedrich Fraundorfer and Davide Scaramuzza. Visual odometry: Part ii: Matching, robustness, optimization, and applications. IEEE Robotics & Automation Magazine , 19(2):78–90, 2012

  2. [2]

    Past, present, and future of simultaneous localization and mapping: Toward the robust-perception age

    Cesar Cadena, Luca Carlone, Henry Carrillo, Yasir Latif, Davide Scaramuzza, Jos ´e Neira, Ian Reid, and John J Leonard. Past, present, and future of simultaneous localization and mapping: Toward the robust-perception age. IEEE Transactions on robotics , 32(6):1309– 1332, 2016

  3. [3]

    Direct Sparse Odometry

    Jakob Engel, Vladlen Koltun, and Daniel Cremers. Direct sparse odometry. arXiv preprint arXiv:1607.02565 , 2016

  4. [4]

    Orb-slam2: An open-source slam system for monocular, stereo, and rgb-d cameras

    Raul Mur-Artal and Juan D Tard ´os. Orb-slam2: An open-source slam system for monocular, stereo, and rgb-d cameras. IEEE Transactions on Robotics, 33(5):1255–1262, 2017

  5. [5]

    Robust stereo visual inertial odometry for fast autonomous flight

    Ke Sun, Kartik Mohta, Bernd Pfrommer, Michael Watterson, Sikang Liu, Yash Mulgaonkar, Camillo J Taylor, and Vijay Kumar. Robust stereo visual inertial odometry for fast autonomous flight. IEEE Robotics and Automation Letters , 3(2):965–972, 2018

  6. [6]

    Visual-inertial monocular slam with map reuse

    Raul Mur-Artal and Juan Domingo Tardos. Visual-inertial monocular slam with map reuse. IEEE Robotics and Automation Letters , 2(2):796–803, 2016

  7. [7]

    Vins-mono: A robust and versatile monocular visual-inertial state estimator

    Qin Tong, Peiliang Li, and Shaojie Shen. Vins-mono: A robust and versatile monocular visual-inertial state estimator. IEEE Transactions on Robotics, PP(99):1–17, 2017

  8. [8]

    Deepvo: Towards end-to-end visual odometry with deep recurrent convolutional neural networks

    Sen Wang, Ronald Clark, Hongkai Wen, and Niki Trigoni. Deepvo: Towards end-to-end visual odometry with deep recurrent convolutional neural networks. In 2017 IEEE International Conference on Robotics and Automation (ICRA) , pages 2043–2050. IEEE, 2017

  9. [9]

    End-to- end, sequence-to-sequence probabilistic visual odometry through deep neural networks

    Sen Wang, Ronald Clark, Hongkai Wen, and Niki Trigoni. End-to- end, sequence-to-sequence probabilistic visual odometry through deep neural networks. The International Journal of Robotics Research , 37(4-5):513–542, 2018

  10. [10]

    Vinet: Visual-inertial odometry as a sequence-to-sequence learning problem

    Ronald Clark, Sen Wang, Hongkai Wen, Andrew Markham, and Niki Trigoni. Vinet: Visual-inertial odometry as a sequence-to-sequence learning problem. In Thirty-First AAAI Conference on Artificial Intelligence, 2017

  11. [11]

    Geonet: Unsupervised learning of dense depth, optical flow and camera pose

    Zhichao Yin and Jianping Shi. Geonet: Unsupervised learning of dense depth, optical flow and camera pose. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1983– 1992, 2018

  12. [12]

    Vision- aided absolute trajectory estimation using an unsupervised deep net- work with online error correction

    E Jared Shamwell, Sarah Leung, and William D Nothwang. Vision- aided absolute trajectory estimation using an unsupervised deep net- work with online error correction. In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) , pages 2524–

  13. [13]

    A multi-state con- straint kalman filter for vision-aided inertial navigation

    Anastasios I Mourikis and Stergios I Roumeliotis. A multi-state con- straint kalman filter for vision-aided inertial navigation. InProceedings 2007 IEEE International Conference on Robotics and Automation , pages 3565–3572. IEEE, 2007

  14. [14]

    Robust visual inertial odometry using a direct ekf-based approach

    Michael Bloesch, Sammy Omari, Marco Hutter, and Roland Siegwart. Robust visual inertial odometry using a direct ekf-based approach. In 2015 IEEE/RSJ international conference on intelligent robots and systems (IROS), pages 298–304. IEEE, 2015

  15. [15]

    Keyframe-based visual-inertial slam using nonlinear optimization

    Stefan Leutenegger, Paul Furgale, Vincent Rabaud, Margarita Chli, Kurt Konolige, and Roland Siegwart. Keyframe-based visual-inertial slam using nonlinear optimization. Proceedings of Robotis Science and Systems (RSS) 2013 , 2013

  16. [16]

    Ongoing evolution of visual slam from geometry to deep learning: Challenges and opportunities

    Ruihao Li, Sen Wang, and Dongbing Gu. Ongoing evolution of visual slam from geometry to deep learning: Challenges and opportunities. Cognitive Computation, 10(6):875–889, 2018

  17. [17]

    Learning to fuse: A deep learning approach to visual-inertial camera pose estimation

    Jason R Rambach, Aditya Tewari, Alain Pagani, and Didier Stricker. Learning to fuse: A deep learning approach to visual-inertial camera pose estimation. In 2016 IEEE International Symposium on Mixed and Augmented Reality (ISMAR) , pages 71–76. IEEE, 2016

  18. [18]

    Un- supervised cnn for single view depth estimation: Geometry to the rescue

    Ravi Garg, Vijay Kumar BG, Gustavo Carneiro, and Ian Reid. Un- supervised cnn for single view depth estimation: Geometry to the rescue. In European Conference on Computer Vision, pages 740–756. Springer, 2016

  19. [19]

    Unsu- pervised monocular depth estimation with left-right consistency

    Cl ´ement Godard, Oisin Mac Aodha, and Gabriel J Brostow. Unsu- pervised monocular depth estimation with left-right consistency. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 270–279, 2017

  20. [20]

    Unsupervised learning of depth and ego-motion from video

    Tinghui Zhou, Matthew Brown, Noah Snavely, and David G Lowe. Unsupervised learning of depth and ego-motion from video. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1851–1858, 2017

  21. [21]

    Depth Prediction Without the Sensors: Leveraging Structure for Unsupervised Learning from Monocular Videos

    Vincent Casser, Soeren Pirk, Reza Mahjourian, and Anelia An- gelova. Depth prediction without the sensors: Leveraging structure for unsupervised learning from monocular videos. arXiv preprint arXiv:1811.06152, 2018

  22. [22]

    Df-net: Unsupervised joint learning of depth and flow using cross-task consistency

    Yuliang Zou, Zelun Luo, and Jia-Bin Huang. Df-net: Unsupervised joint learning of depth and flow using cross-task consistency. In Proceedings of the European Conference on Computer Vision (ECCV), pages 36–53, 2018

  23. [23]

    Undeepvo: Monocular visual odometry through unsupervised deep learning

    Ruihao Li, Sen Wang, Zhiqiang Long, and Dongbing Gu. Undeepvo: Monocular visual odometry through unsupervised deep learning. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pages 7286–7291. IEEE, 2018

  24. [24]

    Unsupervised learning of monocular depth estimation and visual odometry with deep feature reconstruction

    Huangying Zhan, Ravi Garg, Chamara Saroj Weerasekera, Kejie Li, Harsh Agarwal, and Ian Reid. Unsupervised learning of monocular depth estimation and visual odometry with deep feature reconstruction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 340–349, 2018

  25. [25]

    Go- icp: A globally optimal solution to 3d icp point-set registration

    Jiaolong Yang, Hongdong Li, Dylan Campbell, and Yunde Jia. Go- icp: A globally optimal solution to 3d icp point-set registration. IEEE transactions on pattern analysis and machine intelligence , 38(11):2241–2254, 2016

  26. [26]

    A taxonomy and evaluation of dense two-frame stereo correspondence algorithms

    Daniel Scharstein and Richard Szeliski. A taxonomy and evaluation of dense two-frame stereo correspondence algorithms. International journal of computer vision , 47(1-3):7–42, 2002

  27. [27]

    Pyramid stereo matching network

    Jia-Ren Chang and Yong-Sheng Chen. Pyramid stereo matching network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5410–5418, 2018

  28. [28]

    Learning monocular visual odometry with dense 3d mapping from dense 3d flow

    Zhao Cheng, Sun Li, Pulak Purkait, Tom Duckett, and Rustam Stolkin. Learning monocular visual odometry with dense 3d mapping from dense 3d flow. 2018

  29. [29]

    Efficient dense scene flow from sparse or dense stereo data

    Andreas Wedel, Clemens Rabe, Tobi Vaudrey, Thomas Brox, Uwe Franke, and Daniel Cremers. Efficient dense scene flow from sparse or dense stereo data. In European conference on computer vision , pages 739–751. Springer, 2008

  30. [30]

    Flownet: Learning optical flow with convolutional networks

    Alexey Dosovitskiy, Philipp Fischer, Eddy Ilg, Philip Hausser, Caner Hazirbas, Vladimir Golkov, Patrick Van Der Smagt, Daniel Cremers, and Thomas Brox. Flownet: Learning optical flow with convolutional networks. In Proceedings of the IEEE international conference on computer vision, pages 2758–2766, 2015

  31. [31]

    Flownet 2.0: Evolution of optical flow estimation with deep networks

    Eddy Ilg, Nikolaus Mayer, Tonmoy Saikia, Margret Keuper, Alexey Dosovitskiy, and Thomas Brox. Flownet 2.0: Evolution of optical flow estimation with deep networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2462–2470, 2017

  32. [32]

    On-manifold preintegration for real-time visual-inertial odom- etry

    Christian Forster, Luca Carlone, Frank Dellaert, and Davide Scara- muzza. On-manifold preintegration for real-time visual-inertial odom- etry. IEEE Transactions on Robotics , 33(1):1–21, 2015