Scene Motion Decomposition for Learnable Visual Odometry

Anna Vorontsova; Anton Konushin; Filipp Konokhov; Igor Slinko; Olga Barinova

arxiv: 1907.07227 · v1 · pith:JQJHFSOPnew · submitted 2019-07-16 · 💻 cs.CV

Scene Motion Decomposition for Learnable Visual Odometry

Igor Slinko , Anna Vorontsova , Filipp Konokhov , Olga Barinova , Anton Konushin This is my paper

Pith reviewed 2026-05-24 20:48 UTC · model grok-4.3

classification 💻 cs.CV

keywords visual odometrymotion mapsoptical flowdepthego-motionscene motion decomposition6DoF estimationdeep networks

0 comments

The pith

Reformulating ego-motion as 3D scene motion via per-DoF maps lets networks predict camera motion more accurately than stacked depth and flow inputs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that ego-motion estimation can be recast as estimating the motion of an entire 3D scene relative to a fixed camera by first computing 6DoF motion for each visible point from optical flow and depth. These per-point motions are then organized into separate motion maps, one for each degree of freedom, which serve as input to a deep network that outputs the overall 6DoF scene motion. A sympathetic reader would care because the decomposition supplies the network with explicitly separated motion signals instead of raw stacked channels, producing measurable accuracy gains on both outdoor and indoor benchmarks. The work also contributes a network architecture built to process these maps and surpass standard RGB and RGB-D learnable baselines.

Core claim

The central claim is that representing the full scene motion as a sum of per-point 6DoF motions, expressed as one motion map per degree of freedom derived directly from optical flow and depth, allows a deep neural network to recover the 6DoF scene motion (equivalent to camera ego-motion) with higher accuracy than when depth and optical flow are simply concatenated as network input.

What carries the argument

Motion maps, each encoding one of the six degrees of freedom from the per-point 6DoF motions obtained via optical flow and depth.

If this is right

Accuracy improves over naive depth-plus-optical-flow stacking on both outdoor and indoor datasets.
A network architecture that processes the separate motion maps outperforms standard learnable RGB and RGB-D visual-odometry baselines.
The network receives motion maps as input and directly regresses the six degrees of freedom of scene motion.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The per-DoF separation might allow independent weighting or masking of individual motion components when some degrees of freedom are less reliable.
The same decomposition could be applied to sequences that contain a small number of moving objects by first segmenting rigid regions.
Because each map isolates one axis of motion, the architecture may generalize to new sensor configurations that supply only a subset of the six degrees of freedom.

Load-bearing premise

Ego-motion equals the rigid motion of the visible 3D scene relative to a static camera and the per-point 6DoF values computed from optical flow and depth are accurate enough to serve as direct network inputs.

What would settle it

Run the method on a dataset that contains independently moving non-rigid objects or that supplies deliberately noisy or incomplete depth and optical flow; accuracy gains should disappear if the rigid-scene and accurate-input premises are false.

Figures

Figures reproduced from arXiv: 1907.07227 by Anna Vorontsova, Anton Konushin, Filipp Konokhov, Igor Slinko, Olga Barinova.

**Figure 2.** Figure 2: With only one component of 6DoF changing, motion maps provide an easily interpretable representation of the camera motion. [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: Network architecture. Inputs can be one of: OF, OF stacked with disparity or motion maps [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Predictions of our approach using different inputs on KITTI [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Predictions of ORB SLAM2 and our approach on DISCOMAN trajectories [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

read the original abstract

Optical Flow (OF) and depth are commonly used for visual odometry since they provide sufficient information about camera ego-motion in a rigid scene. We reformulate the problem of ego-motion estimation as a problem of motion estimation of a 3D-scene with respect to a static camera. The entire scene motion can be represented as a combination of motions of its visible points. Using OF and depth we estimate a motion of each point in terms of 6DoF and represent results in the form of motion maps, each one addressing single degree of freedom. In this work we provide motion maps as inputs to a deep neural network that predicts 6DoF of scene motion. Through our evaluation on outdoor and indoor datasets we show that utilizing motion maps leads to accuracy improvement in comparison with naive stacking of depth and OF. Another contribution of our work is a novel network architecture that efficiently exploits motion maps and outperforms learnable RGB/RGB-D baselines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The motion maps idea is a reasonable input tweak but the per-point 6DoF derivation from 3 measurements looks underconstrained without global assumptions.

read the letter

The main thing to know is that this paper introduces 'motion maps' — six channels, one per degree of freedom — derived from optical flow and depth, and feeds them to a new network for ego-motion estimation. The abstract claims this beats naive stacking of depth and flow. What they do is reframe ego-motion as scene motion relative to a static camera, compute per-point 6DoF motions, and build maps from that. They also propose a network architecture that uses these maps efficiently. On the positive side, the idea of separating the degrees of freedom into explicit maps is a clean way to structure the input, and if the experiments hold, it shows a modest accuracy lift on standard indoor and outdoor sets compared to RGB or RGB-D baselines. The potential issue is in the motion map construction itself. Optical flow gives 2D motion and depth gives the third coordinate, so each point yields a 3D velocity vector. Recovering a full 6DoF rigid-body motion (3 translation + 3 rotation) from that single vector is not possible without additional constraints. The paper must be using some form of global rigidity or smoothing to fill in the maps, which could make the subsequent network partly redundant or introduce noise that the network then has to overcome. If the gains come mostly from the architecture rather than the maps, that changes how much credit the representation deserves. The experiments are described as comparisons on external datasets, so no obvious circularity in the evaluation. The literature citations in the abstract look standard for visual odometry work. This is the kind of paper that would interest researchers building learned visual odometry pipelines for robotics. The representation is novel enough within the cited work that a referee should check the details of how the maps are computed and whether the accuracy numbers are robust. I would send it out for review rather than desk reject, mainly to see the full method and results.

Referee Report

2 major / 1 minor

Summary. The paper reformulates ego-motion estimation in visual odometry as the problem of estimating the rigid 6DoF motion of a 3D scene relative to a static camera. It computes per-point 6DoF motions from optical flow and depth, represents them as six motion maps (one per degree of freedom), and feeds these maps into a novel neural network architecture to regress the overall scene motion. Evaluations on outdoor and indoor datasets are claimed to show accuracy gains versus naive depth+OF stacking and versus learnable RGB/RGB-D baselines.

Significance. If the motion-map construction is shown to be non-circular and the accuracy gains are reproducible, the work would supply a geometrically motivated input representation that could improve learnable visual odometry pipelines; the novel architecture would constitute an additional engineering contribution.

major comments (2)

[Abstract and Method] Abstract (first two paragraphs) and Method section: the claim that 'using OF and depth we estimate a motion of each point in terms of 6DoF' is underconstrained. A single point supplies only a 3D velocity vector (3 measurements); recovering a full 6DoF rigid-body twist per point requires either global rigidity constraints (rendering the subsequent network redundant) or produces non-unique/noisy maps. This directly undermines the premise that the maps are 'sufficient and accurate inputs' and that their use explains the reported gains versus naive stacking.
[Evaluation] Evaluation section: the abstract asserts accuracy improvement on outdoor and indoor datasets, yet no quantitative results, network details, dataset statistics, or ablation tables are visible in the provided text. Without these, the central empirical claim cannot be assessed and the comparison to 'naive stacking of depth and OF' remains unverifiable.

minor comments (1)

The abstract would benefit from a brief equation or diagram clarifying how the six motion maps are derived from the 3D velocity at each point.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the two major comments point-by-point below and will incorporate clarifications where needed in the revised manuscript.

read point-by-point responses

Referee: [Abstract and Method] Abstract (first two paragraphs) and Method section: the claim that 'using OF and depth we estimate a motion of each point in terms of 6DoF' is underconstrained. A single point supplies only a 3D velocity vector (3 measurements); recovering a full 6DoF rigid-body twist per point requires either global rigidity constraints (rendering the subsequent network redundant) or produces non-unique/noisy maps. This directly undermines the premise that the maps are 'sufficient and accurate inputs' and that their use explains the reported gains versus naive stacking.

Authors: We agree the current wording in the abstract and method is imprecise and can be misread as claiming independent per-point 6DoF recovery. In the manuscript the motion maps are obtained by first lifting 2D optical flow to 3D scene flow using depth, then expressing the resulting 3D velocity field in a 6-channel representation (one channel per twist component) that is spatially varying only where depth or flow discontinuities occur. Because the underlying scene is assumed rigid, the six channels are highly correlated across pixels; the network's role is to learn a robust spatial aggregation rather than to solve an independent 6DoF problem per pixel. We will revise the method section to include the explicit equations that convert (depth, flow) pairs into the six motion-map channels and to state the rigidity assumption up front. This revision will also add a short paragraph contrasting the motion-map representation with naive depth+OF stacking. revision: yes
Referee: [Evaluation] Evaluation section: the abstract asserts accuracy improvement on outdoor and indoor datasets, yet no quantitative results, network details, dataset statistics, or ablation tables are visible in the provided text. Without these, the central empirical claim cannot be assessed and the comparison to 'naive stacking of depth and OF' remains unverifiable.

Authors: The full manuscript (arXiv:1907.07227) contains a complete Evaluation section with (i) absolute trajectory error and relative pose error tables on KITTI sequences 00-10 and on the TUM RGB-D fr1/fr2/fr3 sequences, (ii) network architecture diagrams and layer counts, (iii) dataset statistics (number of frames, sequence lengths, camera intrinsics), and (iv) ablation tables that isolate the contribution of the motion-map input versus raw depth+OF stacking and versus RGB/RGB-D baselines. The text excerpt supplied to the referee appears to have been truncated before these tables. No changes to the experimental content are required, but we will ensure the revised submission places all tables and the network diagram immediately after the method section for easier reference. revision: no

Circularity Check

0 steps flagged

No significant circularity; method is empirical input-to-output mapping on external data.

full rationale

The paper reformulates ego-motion estimation as scene-motion prediction and constructs motion maps from OF+depth as network inputs, then evaluates accuracy gains on outdoor/indoor datasets. No equations, self-citations, or parameter-fitting steps are shown that reduce the claimed 6DoF output to the inputs by construction. The central claim rests on empirical comparison rather than a closed mathematical derivation, satisfying the default expectation of non-circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the domain assumption that scene motion decomposes into independent 6DoF point motions recoverable from optical flow and depth, plus the invented entity of motion maps; no free parameters are visible in the abstract.

axioms (1)

domain assumption Ego-motion equals rigid 3D scene motion relative to a static camera and can be recovered from per-point 6DoF motions derived from optical flow and depth.
Invoked in the reformulation paragraph of the abstract.

invented entities (1)

motion maps no independent evidence
purpose: Six separate maps each encoding one degree of freedom of per-point 3D motion.
New representation introduced to serve as network input.

pith-pipeline@v0.9.0 · 5702 in / 1202 out tokens · 23979 ms · 2026-05-24T20:48:11.976346+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking (D=3) echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

We reformulate the problem of ego-motion estimation as a problem of motion estimation of a 3D-scene with respect to a static camera. ... Using OF and depth we estimate a motion of each point in terms of 6DoF and represent results in the form of motion maps, each one addressing single degree of freedom.
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

the entire scene motion can be represented as a combination of motions of its visible points

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

31 extracted references · 31 canonical work pages · 4 internal anchors

[1]

https: //www.ambarella.com/technology/ technology-overview

Ambarella cvﬂow technology overview. https: //www.ambarella.com/technology/ technology-overview. Accessed: 2018-10-30. 1

work page 2018
[2]

Nvidia optical ﬂow sdk. 2

work page
[3]

”GANVO: Unsupervised Deep Monocular Visual Odometry and Depth Estimation with Generative Adversarial Networks.” arXiv preprint arXiv:1809.05786 (2018)

Y . Almalioglu, M. R. U. Saputra, P. P. de Gusmao, A. Markham, and N. Trigoni. Ganvo: Unsupervised deep monocular visual odometry and depth estimation with generative adversarial networks. arXiv preprint arXiv:1809.05786, 2018. 1

work page arXiv 2018
[4]

Casser, S

V . Casser, S. Pirk, R. Mahjourian, and A. Angelova. Un- supervised learning of depth and ego-motion: A structured approach. In Thirty-Third AAAI Conference on Artiﬁcial In- telligence (AAAI-19), 2019. 1, 6

work page 2019
[5]

Costante and T

G. Costante and T. A. Ciarfuglia. Ls-vo: Learning dense op- tical subspace for robust visual odometry estimation. IEEE Robotics and Automation Letters, 3(3):1735–1742, 2018. 1, 3, 5, 6, 7, 8

work page 2018
[6]

ENG: End-to-end Neural Geometry for Robust Depth and Pose Estimation using CNNs

T. Dharmasiri, A. Spek, and T. Drummond. Eng: End-to-end neural geometry for robust depth and pose estimation using cnns. arXiv preprint arXiv:1807.05705, 2018. 2, 3

work page internal anchor Pith review Pith/arXiv arXiv 2018
[7]

Dosovitskiy, P

A. Dosovitskiy, P. Fischer, E. Ilg, P. Hausser, C. Hazirbas, V . Golkov, P. Van Der Smagt, D. Cremers, and T. Brox. Flownet: Learning optical ﬂow with convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision, pages 2758–2766, 2015. 1, 2

work page 2015
[8]

Engel, V

J. Engel, V . Koltun, and D. Cremers. Direct sparse odometry. IEEE transactions on pattern analysis and machine intelli- gence, 40(3):611–625, 2018. 1, 2

work page 2018
[9]

Forster, M

C. Forster, M. Pizzoli, and D. Scaramuzza. Svo: Fast semi- direct monocular visual odometry. In Robotics and Automa- tion (ICRA), 2014 IEEE International Conference on, pages 15–22. IEEE, 2014. 1

work page 2014
[10]

H. Fu, M. Gong, C. Wang, K. Batmanghelich, and D. Tao. Deep ordinal regression network for monocular depth esti- mation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2002–2011, 2018. 1

work page 2002
[11]

Geiger, P

A. Geiger, P. Lenz, and R. Urtasun. Are we ready for au- tonomous driving? the kitti vision benchmark suite. In Conference on Computer Vision and Pattern Recognition (CVPR), 2012. 2, 3, 6

work page 2012
[12]

Geiger, J

A. Geiger, J. Ziegler, and C. Stiller. Stereoscan: Dense 3d reconstruction in real-time. 2011 IEEE Intelligent Vehicles Symposium (IV), pages 963–968, 2011. 7

work page 2011
[13]

E. Ilg, N. Mayer, T. Saikia, M. Keuper, A. Dosovitskiy, and T. Brox. Flownet 2.0: Evolution of optical ﬂow estimation with deep networks. In IEEE conference on computer vision and pattern recognition (CVPR), volume 2, page 6, 2017. 1

work page 2017
[14]

C. Kerl, J. Sturm, and D. Cremers. Robust odometry estima- tion for rgb-d cameras. In Robotics and Automation (ICRA), 2013 IEEE International Conference on , pages 3748–3754. IEEE, 2013. 2

work page 2013
[15]

T. Lei, H. Joshi, R. Barzilay, T. Jaakkola, K. Tymoshenko, A. Moschitti, and L. Marquez. Semi-supervised ques- tion retrieval with gated convolutions. arXiv preprint arXiv:1512.05726, 2015. 2, 5

work page internal anchor Pith review Pith/arXiv arXiv 2015
[16]

R. Li, S. Wang, Z. Long, and D. Gu. Undeepvo: Monocular visual odometry through unsupervised deep learning. 2018 IEEE International Conference on Robotics and Automation (ICRA), pages 7286–7291, 2018. 7

work page 2018
[17]

Liang, Y

Z. Liang, Y . Feng, Y . Chen, and L. Zhang. Learning for disparity estimation through feature constancy. In Proceed- ings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2811–2820, 2018. 1

work page 2018
[18]

W. Luo, A. G. Schwing, and R. Urtasun. Efﬁcient deep learn- ing for stereo matching. In Proceedings of the IEEE Con- ference on Computer Vision and Pattern Recognition, pages 5695–5703, 2016. 1

work page 2016
[19]

Z. Lv, K. Kim, A. Troccoli, D. Sun, J. Rehg, and J. Kautz. Learning rigidity in dynamic scenes with a moving camera for 3d motion ﬁeld estimation. In ECCV, 2018. 5, 6, 7, 8

work page 2018
[20]

A Large Dataset to Train Convolutional Networks for Disparity, Optical Flow, and Scene Flow Estimation

N. Mayer, E. Ilg, P. H ¨ausser, P. Fischer, D. Cremers, A. Dosovitskiy, and T. Brox. A large dataset to train con- volutional networks for disparity, optical ﬂow, and scene ﬂow estimation. In IEEE International Conference on Computer Vision and Pattern Recognition (CVPR) , 2016. arXiv:1512.02134. 6

work page internal anchor Pith review Pith/arXiv arXiv 2016
[21]

Mur-Artal and J

R. Mur-Artal and J. D. Tard ´os. Orb-slam2: An open-source slam system for monocular, stereo, and rgb-d cameras.IEEE Transactions on Robotics, 33(5):1255–1262, 2017. 2

work page 2017
[22]

Steinbr ¨ucker, J

F. Steinbr ¨ucker, J. Sturm, and D. Cremers. Real-time vi- sual odometry from dense rgb-d images. In Computer Vi- sion Workshops (ICCV Workshops), 2011 IEEE Interna- tional Conference on, pages 719–722. IEEE, 2011. 2

work page 2011
[23]

Sturm, N

J. Sturm, N. Engelhard, F. Endres, W. Burgard, and D. Cre- mers. A benchmark for the evaluation of rgb-d slam sys- tems. 2012 IEEE/RSJ International Conference on Intelli- gent Robots and Systems, pages 573–580, 2012. 7

work page 2012
[24]

D. Sun, X. Yang, M.-Y . Liu, and J. Kautz. Pwc-net: Cnns for optical ﬂow using pyramid, warping, and cost volume. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8934–8943, 2018. 1, 6

work page 2018
[25]

Ummenhofer, H

B. Ummenhofer, H. Zhou, J. Uhrig, N. Mayer, E. Ilg, A. Dosovitskiy, and T. Brox. Demon: Depth and motion net- work for learning monocular stereo. In IEEE Conference on computer vision and pattern recognition (CVPR), volume 5, page 6, 2017. 2, 3

work page 2017
[26]

S. Wang, R. Clark, H. Wen, and N. Trigoni. Deepvo: Towards end-to-end visual odometry with deep recurrent convolutional neural networks. In Robotics and Automa- tion (ICRA), 2017 IEEE International Conference on, pages 2043–2050. IEEE, 2017. 1, 2, 6, 7

work page 2017
[27]

S. Wang, R. Clark, H. Wen, and N. Trigoni. End-to-end, sequence-to-sequence probabilistic visual odometry through deep neural networks. The International Journal of Robotics Research, 37(4-5):513–542, 2018. 2, 7

work page 2018
[28]

FastDepth: Fast Monoc- ular Depth Estimation on Embedded Systems

Wofk, Diana and Ma, Fangchang and Yang, Tien-Ju and Karaman, Sertac and Sze, Vivienne. FastDepth: Fast Monoc- ular Depth Estimation on Embedded Systems. In IEEE In- ternational Conference on Robotics and Automation (ICRA),

work page
[29]

F. Xue, Q. Wang, X. Wang, W. Dong, J. Wang, and H. Zha. Guided feature selection for deep visual odometry. CoRR, abs/1811.09935, 2018. 7

work page internal anchor Pith review Pith/arXiv arXiv 2018
[30]

C. Zhao, L. Sun, P. Purkait, T. Duckett, and R. Stolkin. Learning monocular visual odometry with dense 3d mapping from dense 3d ﬂow. Intelligent Robots and Systems (IROS), 2018 International Conference on, 2018. 3, 5, 6, 7, 8

work page 2018
[31]

H. Zhou, B. Ummenhofer, and T. Brox. Deeptam: Deep tracking and mapping. In European Conference on Com- puter Vision (ECCV), 2018. 6

work page 2018

[1] [1]

https: //www.ambarella.com/technology/ technology-overview

Ambarella cvﬂow technology overview. https: //www.ambarella.com/technology/ technology-overview. Accessed: 2018-10-30. 1

work page 2018

[2] [2]

Nvidia optical ﬂow sdk. 2

work page

[3] [3]

”GANVO: Unsupervised Deep Monocular Visual Odometry and Depth Estimation with Generative Adversarial Networks.” arXiv preprint arXiv:1809.05786 (2018)

Y . Almalioglu, M. R. U. Saputra, P. P. de Gusmao, A. Markham, and N. Trigoni. Ganvo: Unsupervised deep monocular visual odometry and depth estimation with generative adversarial networks. arXiv preprint arXiv:1809.05786, 2018. 1

work page arXiv 2018

[4] [4]

Casser, S

V . Casser, S. Pirk, R. Mahjourian, and A. Angelova. Un- supervised learning of depth and ego-motion: A structured approach. In Thirty-Third AAAI Conference on Artiﬁcial In- telligence (AAAI-19), 2019. 1, 6

work page 2019

[5] [5]

Costante and T

G. Costante and T. A. Ciarfuglia. Ls-vo: Learning dense op- tical subspace for robust visual odometry estimation. IEEE Robotics and Automation Letters, 3(3):1735–1742, 2018. 1, 3, 5, 6, 7, 8

work page 2018

[6] [6]

ENG: End-to-end Neural Geometry for Robust Depth and Pose Estimation using CNNs

T. Dharmasiri, A. Spek, and T. Drummond. Eng: End-to-end neural geometry for robust depth and pose estimation using cnns. arXiv preprint arXiv:1807.05705, 2018. 2, 3

work page internal anchor Pith review Pith/arXiv arXiv 2018

[7] [7]

Dosovitskiy, P

A. Dosovitskiy, P. Fischer, E. Ilg, P. Hausser, C. Hazirbas, V . Golkov, P. Van Der Smagt, D. Cremers, and T. Brox. Flownet: Learning optical ﬂow with convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision, pages 2758–2766, 2015. 1, 2

work page 2015

[8] [8]

Engel, V

J. Engel, V . Koltun, and D. Cremers. Direct sparse odometry. IEEE transactions on pattern analysis and machine intelli- gence, 40(3):611–625, 2018. 1, 2

work page 2018

[9] [9]

Forster, M

C. Forster, M. Pizzoli, and D. Scaramuzza. Svo: Fast semi- direct monocular visual odometry. In Robotics and Automa- tion (ICRA), 2014 IEEE International Conference on, pages 15–22. IEEE, 2014. 1

work page 2014

[10] [10]

H. Fu, M. Gong, C. Wang, K. Batmanghelich, and D. Tao. Deep ordinal regression network for monocular depth esti- mation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2002–2011, 2018. 1

work page 2002

[11] [11]

Geiger, P

A. Geiger, P. Lenz, and R. Urtasun. Are we ready for au- tonomous driving? the kitti vision benchmark suite. In Conference on Computer Vision and Pattern Recognition (CVPR), 2012. 2, 3, 6

work page 2012

[12] [12]

Geiger, J

A. Geiger, J. Ziegler, and C. Stiller. Stereoscan: Dense 3d reconstruction in real-time. 2011 IEEE Intelligent Vehicles Symposium (IV), pages 963–968, 2011. 7

work page 2011

[13] [13]

E. Ilg, N. Mayer, T. Saikia, M. Keuper, A. Dosovitskiy, and T. Brox. Flownet 2.0: Evolution of optical ﬂow estimation with deep networks. In IEEE conference on computer vision and pattern recognition (CVPR), volume 2, page 6, 2017. 1

work page 2017

[14] [14]

C. Kerl, J. Sturm, and D. Cremers. Robust odometry estima- tion for rgb-d cameras. In Robotics and Automation (ICRA), 2013 IEEE International Conference on , pages 3748–3754. IEEE, 2013. 2

work page 2013

[15] [15]

T. Lei, H. Joshi, R. Barzilay, T. Jaakkola, K. Tymoshenko, A. Moschitti, and L. Marquez. Semi-supervised ques- tion retrieval with gated convolutions. arXiv preprint arXiv:1512.05726, 2015. 2, 5

work page internal anchor Pith review Pith/arXiv arXiv 2015

[16] [16]

R. Li, S. Wang, Z. Long, and D. Gu. Undeepvo: Monocular visual odometry through unsupervised deep learning. 2018 IEEE International Conference on Robotics and Automation (ICRA), pages 7286–7291, 2018. 7

work page 2018

[17] [17]

Liang, Y

Z. Liang, Y . Feng, Y . Chen, and L. Zhang. Learning for disparity estimation through feature constancy. In Proceed- ings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2811–2820, 2018. 1

work page 2018

[18] [18]

W. Luo, A. G. Schwing, and R. Urtasun. Efﬁcient deep learn- ing for stereo matching. In Proceedings of the IEEE Con- ference on Computer Vision and Pattern Recognition, pages 5695–5703, 2016. 1

work page 2016

[19] [19]

Z. Lv, K. Kim, A. Troccoli, D. Sun, J. Rehg, and J. Kautz. Learning rigidity in dynamic scenes with a moving camera for 3d motion ﬁeld estimation. In ECCV, 2018. 5, 6, 7, 8

work page 2018

[20] [20]

A Large Dataset to Train Convolutional Networks for Disparity, Optical Flow, and Scene Flow Estimation

N. Mayer, E. Ilg, P. H ¨ausser, P. Fischer, D. Cremers, A. Dosovitskiy, and T. Brox. A large dataset to train con- volutional networks for disparity, optical ﬂow, and scene ﬂow estimation. In IEEE International Conference on Computer Vision and Pattern Recognition (CVPR) , 2016. arXiv:1512.02134. 6

work page internal anchor Pith review Pith/arXiv arXiv 2016

[21] [21]

Mur-Artal and J

R. Mur-Artal and J. D. Tard ´os. Orb-slam2: An open-source slam system for monocular, stereo, and rgb-d cameras.IEEE Transactions on Robotics, 33(5):1255–1262, 2017. 2

work page 2017

[22] [22]

Steinbr ¨ucker, J

F. Steinbr ¨ucker, J. Sturm, and D. Cremers. Real-time vi- sual odometry from dense rgb-d images. In Computer Vi- sion Workshops (ICCV Workshops), 2011 IEEE Interna- tional Conference on, pages 719–722. IEEE, 2011. 2

work page 2011

[23] [23]

Sturm, N

J. Sturm, N. Engelhard, F. Endres, W. Burgard, and D. Cre- mers. A benchmark for the evaluation of rgb-d slam sys- tems. 2012 IEEE/RSJ International Conference on Intelli- gent Robots and Systems, pages 573–580, 2012. 7

work page 2012

[24] [24]

D. Sun, X. Yang, M.-Y . Liu, and J. Kautz. Pwc-net: Cnns for optical ﬂow using pyramid, warping, and cost volume. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8934–8943, 2018. 1, 6

work page 2018

[25] [25]

Ummenhofer, H

B. Ummenhofer, H. Zhou, J. Uhrig, N. Mayer, E. Ilg, A. Dosovitskiy, and T. Brox. Demon: Depth and motion net- work for learning monocular stereo. In IEEE Conference on computer vision and pattern recognition (CVPR), volume 5, page 6, 2017. 2, 3

work page 2017

[26] [26]

S. Wang, R. Clark, H. Wen, and N. Trigoni. Deepvo: Towards end-to-end visual odometry with deep recurrent convolutional neural networks. In Robotics and Automa- tion (ICRA), 2017 IEEE International Conference on, pages 2043–2050. IEEE, 2017. 1, 2, 6, 7

work page 2017

[27] [27]

S. Wang, R. Clark, H. Wen, and N. Trigoni. End-to-end, sequence-to-sequence probabilistic visual odometry through deep neural networks. The International Journal of Robotics Research, 37(4-5):513–542, 2018. 2, 7

work page 2018

[28] [28]

FastDepth: Fast Monoc- ular Depth Estimation on Embedded Systems

Wofk, Diana and Ma, Fangchang and Yang, Tien-Ju and Karaman, Sertac and Sze, Vivienne. FastDepth: Fast Monoc- ular Depth Estimation on Embedded Systems. In IEEE In- ternational Conference on Robotics and Automation (ICRA),

work page

[29] [29]

F. Xue, Q. Wang, X. Wang, W. Dong, J. Wang, and H. Zha. Guided feature selection for deep visual odometry. CoRR, abs/1811.09935, 2018. 7

work page internal anchor Pith review Pith/arXiv arXiv 2018

[30] [30]

C. Zhao, L. Sun, P. Purkait, T. Duckett, and R. Stolkin. Learning monocular visual odometry with dense 3d mapping from dense 3d ﬂow. Intelligent Robots and Systems (IROS), 2018 International Conference on, 2018. 3, 5, 6, 7, 8

work page 2018

[31] [31]

H. Zhou, B. Ummenhofer, and T. Brox. Deeptam: Deep tracking and mapping. In European Conference on Com- puter Vision (ECCV), 2018. 6

work page 2018