pith. sign in

arxiv: 1906.08889 · v1 · pith:4U3ZTULInew · submitted 2019-06-20 · 💻 cs.RO · cs.CV· eess.IV

SGANVO: Unsupervised Deep Visual Odometry and Depth Estimation with Stacked Generative Adversarial Networks

Pith reviewed 2026-05-25 19:16 UTC · model grok-4.3

classification 💻 cs.RO cs.CVeess.IV
keywords unsupervised visual odometrydepth estimationgenerative adversarial networksstacked GANego-motion estimationKITTI datasetrecurrent representation
0
0 comments X

The pith

The SGANVO stacked GAN produces better or comparable unsupervised depth and ego-motion estimates on the KITTI dataset.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces SGANVO, a system of stacked GAN layers for unsupervised visual depth and ego-motion estimation from video. The lowest layer handles direct depth and motion prediction, higher layers extract spatial features, and recurrent connections across layers capture temporal dynamics. This setup is positioned as an advance over encoder-decoder networks, RCNNs, and earlier GAN uses by leveraging the adversarial training process. Results on the KITTI dataset are reported as better or comparable to prior unsupervised methods, particularly in challenging scenes. A reader would care because such methods could support more reliable camera-based navigation without requiring labeled training data.

Core claim

This paper proposes a novel unsupervised network system for visual depth and ego-motion estimation: Stacked Generative Adversarial Network(SGANVO). It consists of a stack of GAN layers, of which the lowest layer estimates the depth and ego-motion while the higher layers estimate the spatial features. It can also capture the temporal dynamic due to the use of a recurrent representation across the layers. The evaluation results show that our proposed method can produce better or comparable results in depth and ego-motion estimation.

What carries the argument

The stack of GAN layers where the lowest layer estimates depth and ego-motion, higher layers estimate spatial features, and recurrent representation across layers captures temporal dynamics.

Load-bearing premise

That the specific stack of GAN layers combined with recurrent representation across layers will capture both spatial features and temporal dynamics sufficiently to improve estimation accuracy beyond prior unsupervised methods.

What would settle it

Direct comparison of depth estimation errors (such as absolute relative error) and ego-motion accuracy (such as trajectory error) between SGANVO and prior unsupervised methods on the KITTI dataset; if SGANVO errors are not lower or equal, the central claim does not hold.

Figures

Figures reproduced from arXiv: 1906.08889 by Dongbing Gu, Tuo Feng.

Figure 1
Figure 1. Figure 1: Our proposed SGANVO architecture for the depth and [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The network is unfolded in time. The temporal dynamic [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Illustrated above are qualitative comparisons of our [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Our proposed SGANVO system to estimate the ego [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
read the original abstract

Recently end-to-end unsupervised deep learning methods have achieved an effect beyond geometric methods for visual depth and ego-motion estimation tasks. These data-based learning methods perform more robustly and accurately in some of the challenging scenes. The encoder-decoder network has been widely used in the depth estimation and the RCNN has brought significant improvements in the ego-motion estimation. Furthermore, the latest use of Generative Adversarial Nets(GANs) in depth and ego-motion estimation has demonstrated that the estimation could be further improved by generating pictures in the game learning process. This paper proposes a novel unsupervised network system for visual depth and ego-motion estimation: Stacked Generative Adversarial Network(SGANVO). It consists of a stack of GAN layers, of which the lowest layer estimates the depth and ego-motion while the higher layers estimate the spatial features. It can also capture the temporal dynamic due to the use of a recurrent representation across the layers. See Fig.1 for details. We select the most commonly used KITTI [1] data set for evaluation. The evaluation results show that our proposed method can produce better or comparable results in depth and ego-motion estimation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes SGANVO, a stacked generative adversarial network for unsupervised depth and ego-motion estimation. The architecture consists of multiple GAN layers where the lowest layer estimates depth and ego-motion, higher layers estimate spatial features, and recurrent representations across layers capture temporal dynamics. Evaluation is performed on the KITTI dataset, with the claim that the method produces better or comparable results to prior unsupervised approaches.

Significance. If the quantitative improvements and architectural advantages are substantiated, the work would contribute a novel combination of stacked GANs and cross-layer recurrence to unsupervised visual odometry, potentially improving robustness in challenging scenes over standard encoder-decoder or RCNN baselines. The paper receives credit for explicitly describing the layered GAN structure and recurrent mechanism in the abstract and for selecting the standard KITTI benchmark for evaluation.

major comments (2)
  1. [Abstract / Experiments] Abstract and Experiments section: the central claim that SGANVO produces 'better or comparable results' is stated without any quantitative metrics (e.g., Abs Rel, RMSE for depth or ATE for ego-motion), ablation studies, or explicit comparison tables against baselines such as SfMLearner or prior GAN methods. This absence prevents verification of the performance claim and makes the result load-bearing for the paper's contribution.
  2. [Method] Method / Network Architecture section: the assumption that the specific stack of GAN layers plus recurrent cross-layer representation will sufficiently capture both spatial features and temporal dynamics to yield accuracy gains is presented without supporting analysis, such as feature visualization, ablation on recurrence, or comparison of loss terms. This is the weakest link in the central claim.
minor comments (2)
  1. [Abstract] Abstract: 'Nets(GANs)' is missing a space; 'pictures in the game learning process' is informal and should be clarified to 'synthesized images during adversarial training'.
  2. [Related Work] The manuscript should include a dedicated related-work subsection contrasting the stacked recurrent GAN design against existing unsupervised VO methods that also employ adversarial losses.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major comment below and agree that the manuscript requires revisions to include quantitative metrics, comparison tables, and supporting analysis for the architectural claims.

read point-by-point responses
  1. Referee: [Abstract / Experiments] Abstract and Experiments section: the central claim that SGANVO produces 'better or comparable results' is stated without any quantitative metrics (e.g., Abs Rel, RMSE for depth or ATE for ego-motion), ablation studies, or explicit comparison tables against baselines such as SfMLearner or prior GAN methods. This absence prevents verification of the performance claim and makes the result load-bearing for the paper's contribution.

    Authors: We agree that the abstract and experiments section lack specific quantitative metrics and explicit comparison tables. In the revised manuscript we will add a results table reporting Abs Rel, Sq Rel, RMSE, and ATE values on the KITTI dataset together with direct numerical comparisons to SfMLearner and prior unsupervised GAN-based methods. This will allow verification of the 'better or comparable' claim. revision: yes

  2. Referee: [Method] Method / Network Architecture section: the assumption that the specific stack of GAN layers plus recurrent cross-layer representation will sufficiently capture both spatial features and temporal dynamics to yield accuracy gains is presented without supporting analysis, such as feature visualization, ablation on recurrence, or comparison of loss terms. This is the weakest link in the central claim.

    Authors: The current manuscript describes the stacked GAN layers and recurrent cross-layer mechanism but does not provide ablations or visualizations. We will add an ablation study isolating the contribution of the recurrent connections, together with feature visualizations and a comparison of loss terms, in the revised version to substantiate the architectural design. revision: yes

Circularity Check

0 steps flagged

No significant circularity in claimed results

full rationale

The paper presents an empirical unsupervised learning architecture (stacked GANs with recurrent cross-layer representation) and reports its performance on the standard KITTI benchmark. No derivation chain, first-principles prediction, or mathematical reduction is claimed or present in the abstract or described text. The evaluation results are obtained by training and testing the proposed network on KITTI splits, which constitutes standard supervised-style validation of an empirical method rather than any self-definitional, fitted-input-renamed-as-prediction, or self-citation-load-bearing step. No equations, uniqueness theorems, or ansatzes are invoked that collapse to the inputs by construction. The central claim therefore remains externally falsifiable against the benchmark and does not reduce to its own training procedure.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim depends on the effectiveness of the stacked adversarial architecture and recurrent connections, which are introduced without independent evidence beyond the KITTI evaluation.

free parameters (1)
  • network architecture and loss weights
    Deep network parameters and training hyperparameters are fitted to the KITTI data to achieve the reported performance.
axioms (1)
  • domain assumption Adversarial training via stacked GANs improves depth and ego-motion estimation accuracy
    Invoked when the abstract states that latest GAN use has demonstrated further improvement.

pith-pipeline@v0.9.0 · 5740 in / 1107 out tokens · 26744 ms · 2026-05-25T19:16:26.800179+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

27 extracted references · 27 canonical work pages · 2 internal anchors

  1. [1]

    ”Are we ready for autonomous driving? the kitti vision benchmark suite.” 2012 IEEE Conference on Computer Vision and Pattern Recognition

    Geiger, Andreas, Philip Lenz, and Raquel Urtasun. ”Are we ready for autonomous driving? the kitti vision benchmark suite.” 2012 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2012

  2. [2]

    ”Spatial transformer networks.” Advances in neural information processing sys- tems

    Jaderberg, Max, Karen Simonyan, and Andrew Zisserman. ”Spatial transformer networks.” Advances in neural information processing sys- tems. 2015

  3. [3]

    ”Unsupervised cnn for single view depth estimation: Geometry to the rescue.” European Conference on Computer Vision

    Garg, Ravi, et al. ”Unsupervised cnn for single view depth estimation: Geometry to the rescue.” European Conference on Computer Vision. Springer, Cham, 2016

  4. [4]

    Godard, Clment, Oisin Mac Aodha, and Gabriel J. Brostow. ”Un- supervised monocular depth estimation with left-right consistency.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017

  5. [5]

    ”Unsupervised learning of depth and ego-motion from video.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    Zhou, Tinghui, et al. ”Unsupervised learning of depth and ego-motion from video.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017

  6. [6]

    ”Geonet: Unsupervised learning of dense depth, optical flow and camera pose.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    Yin, Zhichao, and Jianping Shi. ”Geonet: Unsupervised learning of dense depth, optical flow and camera pose.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018

  7. [7]

    ”Digging into self-supervised monocular depth estimation.” arXiv preprint arXiv:1806.01260 (2018)

    Godard, Clment, et al. ”Digging into self-supervised monocular depth estimation.” arXiv preprint arXiv:1806.01260 (2018)

  8. [8]

    SuperDepth: Self-Supervised, Super-Resolved Monocular Depth Estimation

    Pillai, Sudeep, Rares Ambrus, and Adrien Gaidon. ”SuperDepth: Self-Supervised, Super-Resolved Monocular Depth Estimation.” arXiv preprint arXiv:1810.01849 (2018)

  9. [9]

    ”Undeepvo: Monocular visual odometry through unsupervised deep learning.” 2018 IEEE International Conference on Robotics and Automation (ICRA)

    Li, Ruihao, et al. ”Undeepvo: Monocular visual odometry through unsupervised deep learning.” 2018 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2018

  10. [10]

    Ranjan, Anurag, et al. ”Competitive Collaboration: Joint Unsupervised Learning of Depth, Camera Motion, Optical Flow and Motion Segmen- tation.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2019

  11. [11]

    Joint Unsupervised Learning of Optical Flow and Depth by Watching Stereo Videos

    Wang, Yang, et al. ”Joint Unsupervised Learning of Optical Flow and Depth by Watching Stereo Videos.” arXiv preprint arXiv:1810.03654 (2018)

  12. [12]

    ”PWC-Net: CNNs for optical flow using pyramid, warping, and cost volume.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    Sun, Deqing, et al. ”PWC-Net: CNNs for optical flow using pyramid, warping, and cost volume.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018. 7

  13. [13]

    ”GANVO: Unsupervised Deep Monocular Visual Odometry and Depth Estimation with Generative Adversarial Networks.” arXiv preprint arXiv:1809.05786 (2018)

    Almalioglu, Yasin, et al. ”GANVO: Unsupervised Deep Monocular Visual Odometry and Depth Estimation with Generative Adversarial Networks.” arXiv preprint arXiv:1809.05786 (2018)

  14. [14]

    ”Generative adversarial nets.” Advances in neural information processing systems

    Goodfellow, Ian, et al. ”Generative adversarial nets.” Advances in neural information processing systems. 2014

  15. [15]

    Bhandarkar, and Mukta Prasad

    CS Kumar, Arun, Suchendra M. Bhandarkar, and Mukta Prasad. ”Monocular depth prediction using generative adversarial networks.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. 2018

  16. [16]

    ”Generative Adversarial Networks for unsu- pervised monocular depth prediction.” Proceedings of the European Conference on Computer Vision (ECCV)

    Aleotti, Filippo, et al. ”Generative Adversarial Networks for unsu- pervised monocular depth prediction.” Proceedings of the European Conference on Computer Vision (ECCV). 2018

  17. [17]

    ”Unsupervised adversarial depth estimation using cycled generative networks.” 2018 International Conference on 3D Vision (3DV)

    Pilzer, Andrea, et al. ”Unsupervised adversarial depth estimation using cycled generative networks.” 2018 International Conference on 3D Vision (3DV). IEEE, 2018

  18. [18]

    ”Generative adversarial networks for depth map estimation from RGB video.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops

    Gwn Lore, Kin, et al. ”Generative adversarial networks for depth map estimation from RGB video.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. 2018

  19. [19]

    Shi, Wenzhe, et al. ”Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2016

  20. [20]

    ”Self-normalizing neural networks.” Advances in neural information processing systems

    Klambauer, Gnter, et al. ”Self-normalizing neural networks.” Advances in neural information processing systems. 2017

  21. [21]

    ”Depth map prediction from a single image using a multi-scale deep network.” Advances in neural information processing systems

    Eigen, David, Christian Puhrsch, and Rob Fergus. ”Depth map prediction from a single image using a multi-scale deep network.” Advances in neural information processing systems. 2014

  22. [22]

    ”Unsupervised learning of depth and ego-motion from monocular video using 3d geo- metric constraints.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    Mahjourian, Reza, Martin Wicke, and Anelia Angelova. ”Unsupervised learning of depth and ego-motion from monocular video using 3d geo- metric constraints.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018

  23. [23]

    ”SGAN: An Alternative Training of Generative Adversarial Networks.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    Chavdarova, Tatjana, and Franois Fleuret. ”SGAN: An Alternative Training of Generative Adversarial Networks.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018

  24. [24]

    ”End-to-end, sequence-to-sequence probabilistic visual odometry through deep neural networks.” The International Journal of Robotics Research 37.4-5 (2018): 513-542

    Wang, Sen, et al. ”End-to-end, sequence-to-sequence probabilistic visual odometry through deep neural networks.” The International Journal of Robotics Research 37.4-5 (2018): 513-542

  25. [25]

    Mur-Artal, Raul, and Juan D. Tards. ”Orb-slam2: An open-source slam system for monocular, stereo, and rgb-d cameras.” IEEE Transactions on Robotics 33.5 (2017): 1255-1262

  26. [26]

    Single view stereo matching[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    Luo Y , Ren J, Lin M, et al. Single view stereo matching[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018: 155-163

  27. [27]

    ”The cityscapes dataset for semantic urban scene understanding.” Proceedings of the IEEE conference on computer vision and pattern recognition

    Cordts, Marius, et al. ”The cityscapes dataset for semantic urban scene understanding.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2016