SGANVO: Unsupervised Deep Visual Odometry and Depth Estimation with Stacked Generative Adversarial Networks
Pith reviewed 2026-05-25 19:16 UTC · model grok-4.3
The pith
The SGANVO stacked GAN produces better or comparable unsupervised depth and ego-motion estimates on the KITTI dataset.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
This paper proposes a novel unsupervised network system for visual depth and ego-motion estimation: Stacked Generative Adversarial Network(SGANVO). It consists of a stack of GAN layers, of which the lowest layer estimates the depth and ego-motion while the higher layers estimate the spatial features. It can also capture the temporal dynamic due to the use of a recurrent representation across the layers. The evaluation results show that our proposed method can produce better or comparable results in depth and ego-motion estimation.
What carries the argument
The stack of GAN layers where the lowest layer estimates depth and ego-motion, higher layers estimate spatial features, and recurrent representation across layers captures temporal dynamics.
Load-bearing premise
That the specific stack of GAN layers combined with recurrent representation across layers will capture both spatial features and temporal dynamics sufficiently to improve estimation accuracy beyond prior unsupervised methods.
What would settle it
Direct comparison of depth estimation errors (such as absolute relative error) and ego-motion accuracy (such as trajectory error) between SGANVO and prior unsupervised methods on the KITTI dataset; if SGANVO errors are not lower or equal, the central claim does not hold.
Figures
read the original abstract
Recently end-to-end unsupervised deep learning methods have achieved an effect beyond geometric methods for visual depth and ego-motion estimation tasks. These data-based learning methods perform more robustly and accurately in some of the challenging scenes. The encoder-decoder network has been widely used in the depth estimation and the RCNN has brought significant improvements in the ego-motion estimation. Furthermore, the latest use of Generative Adversarial Nets(GANs) in depth and ego-motion estimation has demonstrated that the estimation could be further improved by generating pictures in the game learning process. This paper proposes a novel unsupervised network system for visual depth and ego-motion estimation: Stacked Generative Adversarial Network(SGANVO). It consists of a stack of GAN layers, of which the lowest layer estimates the depth and ego-motion while the higher layers estimate the spatial features. It can also capture the temporal dynamic due to the use of a recurrent representation across the layers. See Fig.1 for details. We select the most commonly used KITTI [1] data set for evaluation. The evaluation results show that our proposed method can produce better or comparable results in depth and ego-motion estimation.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes SGANVO, a stacked generative adversarial network for unsupervised depth and ego-motion estimation. The architecture consists of multiple GAN layers where the lowest layer estimates depth and ego-motion, higher layers estimate spatial features, and recurrent representations across layers capture temporal dynamics. Evaluation is performed on the KITTI dataset, with the claim that the method produces better or comparable results to prior unsupervised approaches.
Significance. If the quantitative improvements and architectural advantages are substantiated, the work would contribute a novel combination of stacked GANs and cross-layer recurrence to unsupervised visual odometry, potentially improving robustness in challenging scenes over standard encoder-decoder or RCNN baselines. The paper receives credit for explicitly describing the layered GAN structure and recurrent mechanism in the abstract and for selecting the standard KITTI benchmark for evaluation.
major comments (2)
- [Abstract / Experiments] Abstract and Experiments section: the central claim that SGANVO produces 'better or comparable results' is stated without any quantitative metrics (e.g., Abs Rel, RMSE for depth or ATE for ego-motion), ablation studies, or explicit comparison tables against baselines such as SfMLearner or prior GAN methods. This absence prevents verification of the performance claim and makes the result load-bearing for the paper's contribution.
- [Method] Method / Network Architecture section: the assumption that the specific stack of GAN layers plus recurrent cross-layer representation will sufficiently capture both spatial features and temporal dynamics to yield accuracy gains is presented without supporting analysis, such as feature visualization, ablation on recurrence, or comparison of loss terms. This is the weakest link in the central claim.
minor comments (2)
- [Abstract] Abstract: 'Nets(GANs)' is missing a space; 'pictures in the game learning process' is informal and should be clarified to 'synthesized images during adversarial training'.
- [Related Work] The manuscript should include a dedicated related-work subsection contrasting the stacked recurrent GAN design against existing unsupervised VO methods that also employ adversarial losses.
Simulated Author's Rebuttal
We thank the referee for the constructive comments. We address each major comment below and agree that the manuscript requires revisions to include quantitative metrics, comparison tables, and supporting analysis for the architectural claims.
read point-by-point responses
-
Referee: [Abstract / Experiments] Abstract and Experiments section: the central claim that SGANVO produces 'better or comparable results' is stated without any quantitative metrics (e.g., Abs Rel, RMSE for depth or ATE for ego-motion), ablation studies, or explicit comparison tables against baselines such as SfMLearner or prior GAN methods. This absence prevents verification of the performance claim and makes the result load-bearing for the paper's contribution.
Authors: We agree that the abstract and experiments section lack specific quantitative metrics and explicit comparison tables. In the revised manuscript we will add a results table reporting Abs Rel, Sq Rel, RMSE, and ATE values on the KITTI dataset together with direct numerical comparisons to SfMLearner and prior unsupervised GAN-based methods. This will allow verification of the 'better or comparable' claim. revision: yes
-
Referee: [Method] Method / Network Architecture section: the assumption that the specific stack of GAN layers plus recurrent cross-layer representation will sufficiently capture both spatial features and temporal dynamics to yield accuracy gains is presented without supporting analysis, such as feature visualization, ablation on recurrence, or comparison of loss terms. This is the weakest link in the central claim.
Authors: The current manuscript describes the stacked GAN layers and recurrent cross-layer mechanism but does not provide ablations or visualizations. We will add an ablation study isolating the contribution of the recurrent connections, together with feature visualizations and a comparison of loss terms, in the revised version to substantiate the architectural design. revision: yes
Circularity Check
No significant circularity in claimed results
full rationale
The paper presents an empirical unsupervised learning architecture (stacked GANs with recurrent cross-layer representation) and reports its performance on the standard KITTI benchmark. No derivation chain, first-principles prediction, or mathematical reduction is claimed or present in the abstract or described text. The evaluation results are obtained by training and testing the proposed network on KITTI splits, which constitutes standard supervised-style validation of an empirical method rather than any self-definitional, fitted-input-renamed-as-prediction, or self-citation-load-bearing step. No equations, uniqueness theorems, or ansatzes are invoked that collapse to the inputs by construction. The central claim therefore remains externally falsifiable against the benchmark and does not reduce to its own training procedure.
Axiom & Free-Parameter Ledger
free parameters (1)
- network architecture and loss weights
axioms (1)
- domain assumption Adversarial training via stacked GANs improves depth and ego-motion estimation accuracy
Reference graph
Works this paper leans on
-
[1]
Geiger, Andreas, Philip Lenz, and Raquel Urtasun. ”Are we ready for autonomous driving? the kitti vision benchmark suite.” 2012 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2012
work page 2012
-
[2]
”Spatial transformer networks.” Advances in neural information processing sys- tems
Jaderberg, Max, Karen Simonyan, and Andrew Zisserman. ”Spatial transformer networks.” Advances in neural information processing sys- tems. 2015
work page 2015
-
[3]
Garg, Ravi, et al. ”Unsupervised cnn for single view depth estimation: Geometry to the rescue.” European Conference on Computer Vision. Springer, Cham, 2016
work page 2016
-
[4]
Godard, Clment, Oisin Mac Aodha, and Gabriel J. Brostow. ”Un- supervised monocular depth estimation with left-right consistency.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017
work page 2017
-
[5]
Zhou, Tinghui, et al. ”Unsupervised learning of depth and ego-motion from video.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017
work page 2017
-
[6]
Yin, Zhichao, and Jianping Shi. ”Geonet: Unsupervised learning of dense depth, optical flow and camera pose.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018
work page 2018
-
[7]
”Digging into self-supervised monocular depth estimation.” arXiv preprint arXiv:1806.01260 (2018)
Godard, Clment, et al. ”Digging into self-supervised monocular depth estimation.” arXiv preprint arXiv:1806.01260 (2018)
-
[8]
SuperDepth: Self-Supervised, Super-Resolved Monocular Depth Estimation
Pillai, Sudeep, Rares Ambrus, and Adrien Gaidon. ”SuperDepth: Self-Supervised, Super-Resolved Monocular Depth Estimation.” arXiv preprint arXiv:1810.01849 (2018)
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[9]
Li, Ruihao, et al. ”Undeepvo: Monocular visual odometry through unsupervised deep learning.” 2018 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2018
work page 2018
-
[10]
Ranjan, Anurag, et al. ”Competitive Collaboration: Joint Unsupervised Learning of Depth, Camera Motion, Optical Flow and Motion Segmen- tation.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2019
work page 2019
-
[11]
Joint Unsupervised Learning of Optical Flow and Depth by Watching Stereo Videos
Wang, Yang, et al. ”Joint Unsupervised Learning of Optical Flow and Depth by Watching Stereo Videos.” arXiv preprint arXiv:1810.03654 (2018)
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[12]
Sun, Deqing, et al. ”PWC-Net: CNNs for optical flow using pyramid, warping, and cost volume.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018. 7
work page 2018
-
[13]
Almalioglu, Yasin, et al. ”GANVO: Unsupervised Deep Monocular Visual Odometry and Depth Estimation with Generative Adversarial Networks.” arXiv preprint arXiv:1809.05786 (2018)
-
[14]
”Generative adversarial nets.” Advances in neural information processing systems
Goodfellow, Ian, et al. ”Generative adversarial nets.” Advances in neural information processing systems. 2014
work page 2014
-
[15]
CS Kumar, Arun, Suchendra M. Bhandarkar, and Mukta Prasad. ”Monocular depth prediction using generative adversarial networks.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. 2018
work page 2018
-
[16]
Aleotti, Filippo, et al. ”Generative Adversarial Networks for unsu- pervised monocular depth prediction.” Proceedings of the European Conference on Computer Vision (ECCV). 2018
work page 2018
-
[17]
Pilzer, Andrea, et al. ”Unsupervised adversarial depth estimation using cycled generative networks.” 2018 International Conference on 3D Vision (3DV). IEEE, 2018
work page 2018
-
[18]
Gwn Lore, Kin, et al. ”Generative adversarial networks for depth map estimation from RGB video.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. 2018
work page 2018
-
[19]
Shi, Wenzhe, et al. ”Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2016
work page 2016
-
[20]
”Self-normalizing neural networks.” Advances in neural information processing systems
Klambauer, Gnter, et al. ”Self-normalizing neural networks.” Advances in neural information processing systems. 2017
work page 2017
-
[21]
Eigen, David, Christian Puhrsch, and Rob Fergus. ”Depth map prediction from a single image using a multi-scale deep network.” Advances in neural information processing systems. 2014
work page 2014
-
[22]
Mahjourian, Reza, Martin Wicke, and Anelia Angelova. ”Unsupervised learning of depth and ego-motion from monocular video using 3d geo- metric constraints.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018
work page 2018
-
[23]
Chavdarova, Tatjana, and Franois Fleuret. ”SGAN: An Alternative Training of Generative Adversarial Networks.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018
work page 2018
-
[24]
Wang, Sen, et al. ”End-to-end, sequence-to-sequence probabilistic visual odometry through deep neural networks.” The International Journal of Robotics Research 37.4-5 (2018): 513-542
work page 2018
-
[25]
Mur-Artal, Raul, and Juan D. Tards. ”Orb-slam2: An open-source slam system for monocular, stereo, and rgb-d cameras.” IEEE Transactions on Robotics 33.5 (2017): 1255-1262
work page 2017
-
[26]
Luo Y , Ren J, Lin M, et al. Single view stereo matching[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018: 155-163
work page 2018
-
[27]
Cordts, Marius, et al. ”The cityscapes dataset for semantic urban scene understanding.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2016
work page 2016
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.