pith. sign in

arxiv: 1907.07227 · v1 · pith:JQJHFSOPnew · submitted 2019-07-16 · 💻 cs.CV

Scene Motion Decomposition for Learnable Visual Odometry

Pith reviewed 2026-05-24 20:48 UTC · model grok-4.3

classification 💻 cs.CV
keywords visual odometrymotion mapsoptical flowdepthego-motionscene motion decomposition6DoF estimationdeep networks
0
0 comments X

The pith

Reformulating ego-motion as 3D scene motion via per-DoF maps lets networks predict camera motion more accurately than stacked depth and flow inputs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that ego-motion estimation can be recast as estimating the motion of an entire 3D scene relative to a fixed camera by first computing 6DoF motion for each visible point from optical flow and depth. These per-point motions are then organized into separate motion maps, one for each degree of freedom, which serve as input to a deep network that outputs the overall 6DoF scene motion. A sympathetic reader would care because the decomposition supplies the network with explicitly separated motion signals instead of raw stacked channels, producing measurable accuracy gains on both outdoor and indoor benchmarks. The work also contributes a network architecture built to process these maps and surpass standard RGB and RGB-D learnable baselines.

Core claim

The central claim is that representing the full scene motion as a sum of per-point 6DoF motions, expressed as one motion map per degree of freedom derived directly from optical flow and depth, allows a deep neural network to recover the 6DoF scene motion (equivalent to camera ego-motion) with higher accuracy than when depth and optical flow are simply concatenated as network input.

What carries the argument

Motion maps, each encoding one of the six degrees of freedom from the per-point 6DoF motions obtained via optical flow and depth.

If this is right

  • Accuracy improves over naive depth-plus-optical-flow stacking on both outdoor and indoor datasets.
  • A network architecture that processes the separate motion maps outperforms standard learnable RGB and RGB-D visual-odometry baselines.
  • The network receives motion maps as input and directly regresses the six degrees of freedom of scene motion.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The per-DoF separation might allow independent weighting or masking of individual motion components when some degrees of freedom are less reliable.
  • The same decomposition could be applied to sequences that contain a small number of moving objects by first segmenting rigid regions.
  • Because each map isolates one axis of motion, the architecture may generalize to new sensor configurations that supply only a subset of the six degrees of freedom.

Load-bearing premise

Ego-motion equals the rigid motion of the visible 3D scene relative to a static camera and the per-point 6DoF values computed from optical flow and depth are accurate enough to serve as direct network inputs.

What would settle it

Run the method on a dataset that contains independently moving non-rigid objects or that supplies deliberately noisy or incomplete depth and optical flow; accuracy gains should disappear if the rigid-scene and accurate-input premises are false.

Figures

Figures reproduced from arXiv: 1907.07227 by Anna Vorontsova, Anton Konushin, Filipp Konokhov, Igor Slinko, Olga Barinova.

Figure 1
Figure 1. Figure 1: Scene motion decomposition into 6 motion maps, cor [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: With only one component of 6DoF changing, motion maps provide an easily interpretable representation of the camera motion. [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Network architecture. Inputs can be one of: OF, OF stacked with disparity or motion maps [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Predictions of our approach using different inputs on KITTI [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Predictions of ORB SLAM2 and our approach on DISCOMAN trajectories [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
read the original abstract

Optical Flow (OF) and depth are commonly used for visual odometry since they provide sufficient information about camera ego-motion in a rigid scene. We reformulate the problem of ego-motion estimation as a problem of motion estimation of a 3D-scene with respect to a static camera. The entire scene motion can be represented as a combination of motions of its visible points. Using OF and depth we estimate a motion of each point in terms of 6DoF and represent results in the form of motion maps, each one addressing single degree of freedom. In this work we provide motion maps as inputs to a deep neural network that predicts 6DoF of scene motion. Through our evaluation on outdoor and indoor datasets we show that utilizing motion maps leads to accuracy improvement in comparison with naive stacking of depth and OF. Another contribution of our work is a novel network architecture that efficiently exploits motion maps and outperforms learnable RGB/RGB-D baselines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper reformulates ego-motion estimation in visual odometry as the problem of estimating the rigid 6DoF motion of a 3D scene relative to a static camera. It computes per-point 6DoF motions from optical flow and depth, represents them as six motion maps (one per degree of freedom), and feeds these maps into a novel neural network architecture to regress the overall scene motion. Evaluations on outdoor and indoor datasets are claimed to show accuracy gains versus naive depth+OF stacking and versus learnable RGB/RGB-D baselines.

Significance. If the motion-map construction is shown to be non-circular and the accuracy gains are reproducible, the work would supply a geometrically motivated input representation that could improve learnable visual odometry pipelines; the novel architecture would constitute an additional engineering contribution.

major comments (2)
  1. [Abstract and Method] Abstract (first two paragraphs) and Method section: the claim that 'using OF and depth we estimate a motion of each point in terms of 6DoF' is underconstrained. A single point supplies only a 3D velocity vector (3 measurements); recovering a full 6DoF rigid-body twist per point requires either global rigidity constraints (rendering the subsequent network redundant) or produces non-unique/noisy maps. This directly undermines the premise that the maps are 'sufficient and accurate inputs' and that their use explains the reported gains versus naive stacking.
  2. [Evaluation] Evaluation section: the abstract asserts accuracy improvement on outdoor and indoor datasets, yet no quantitative results, network details, dataset statistics, or ablation tables are visible in the provided text. Without these, the central empirical claim cannot be assessed and the comparison to 'naive stacking of depth and OF' remains unverifiable.
minor comments (1)
  1. The abstract would benefit from a brief equation or diagram clarifying how the six motion maps are derived from the 3D velocity at each point.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the two major comments point-by-point below and will incorporate clarifications where needed in the revised manuscript.

read point-by-point responses
  1. Referee: [Abstract and Method] Abstract (first two paragraphs) and Method section: the claim that 'using OF and depth we estimate a motion of each point in terms of 6DoF' is underconstrained. A single point supplies only a 3D velocity vector (3 measurements); recovering a full 6DoF rigid-body twist per point requires either global rigidity constraints (rendering the subsequent network redundant) or produces non-unique/noisy maps. This directly undermines the premise that the maps are 'sufficient and accurate inputs' and that their use explains the reported gains versus naive stacking.

    Authors: We agree the current wording in the abstract and method is imprecise and can be misread as claiming independent per-point 6DoF recovery. In the manuscript the motion maps are obtained by first lifting 2D optical flow to 3D scene flow using depth, then expressing the resulting 3D velocity field in a 6-channel representation (one channel per twist component) that is spatially varying only where depth or flow discontinuities occur. Because the underlying scene is assumed rigid, the six channels are highly correlated across pixels; the network's role is to learn a robust spatial aggregation rather than to solve an independent 6DoF problem per pixel. We will revise the method section to include the explicit equations that convert (depth, flow) pairs into the six motion-map channels and to state the rigidity assumption up front. This revision will also add a short paragraph contrasting the motion-map representation with naive depth+OF stacking. revision: yes

  2. Referee: [Evaluation] Evaluation section: the abstract asserts accuracy improvement on outdoor and indoor datasets, yet no quantitative results, network details, dataset statistics, or ablation tables are visible in the provided text. Without these, the central empirical claim cannot be assessed and the comparison to 'naive stacking of depth and OF' remains unverifiable.

    Authors: The full manuscript (arXiv:1907.07227) contains a complete Evaluation section with (i) absolute trajectory error and relative pose error tables on KITTI sequences 00-10 and on the TUM RGB-D fr1/fr2/fr3 sequences, (ii) network architecture diagrams and layer counts, (iii) dataset statistics (number of frames, sequence lengths, camera intrinsics), and (iv) ablation tables that isolate the contribution of the motion-map input versus raw depth+OF stacking and versus RGB/RGB-D baselines. The text excerpt supplied to the referee appears to have been truncated before these tables. No changes to the experimental content are required, but we will ensure the revised submission places all tables and the network diagram immediately after the method section for easier reference. revision: no

Circularity Check

0 steps flagged

No significant circularity; method is empirical input-to-output mapping on external data.

full rationale

The paper reformulates ego-motion estimation as scene-motion prediction and constructs motion maps from OF+depth as network inputs, then evaluates accuracy gains on outdoor/indoor datasets. No equations, self-citations, or parameter-fitting steps are shown that reduce the claimed 6DoF output to the inputs by construction. The central claim rests on empirical comparison rather than a closed mathematical derivation, satisfying the default expectation of non-circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the domain assumption that scene motion decomposes into independent 6DoF point motions recoverable from optical flow and depth, plus the invented entity of motion maps; no free parameters are visible in the abstract.

axioms (1)
  • domain assumption Ego-motion equals rigid 3D scene motion relative to a static camera and can be recovered from per-point 6DoF motions derived from optical flow and depth.
    Invoked in the reformulation paragraph of the abstract.
invented entities (1)
  • motion maps no independent evidence
    purpose: Six separate maps each encoding one degree of freedom of per-point 3D motion.
    New representation introduced to serve as network input.

pith-pipeline@v0.9.0 · 5702 in / 1202 out tokens · 23979 ms · 2026-05-24T20:48:11.976346+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

31 extracted references · 31 canonical work pages · 4 internal anchors

  1. [1]

    https: //www.ambarella.com/technology/ technology-overview

    Ambarella cvflow technology overview. https: //www.ambarella.com/technology/ technology-overview. Accessed: 2018-10-30. 1

  2. [2]

    Nvidia optical flow sdk. 2

  3. [3]

    ”GANVO: Unsupervised Deep Monocular Visual Odometry and Depth Estimation with Generative Adversarial Networks.” arXiv preprint arXiv:1809.05786 (2018)

    Y . Almalioglu, M. R. U. Saputra, P. P. de Gusmao, A. Markham, and N. Trigoni. Ganvo: Unsupervised deep monocular visual odometry and depth estimation with generative adversarial networks. arXiv preprint arXiv:1809.05786, 2018. 1

  4. [4]

    Casser, S

    V . Casser, S. Pirk, R. Mahjourian, and A. Angelova. Un- supervised learning of depth and ego-motion: A structured approach. In Thirty-Third AAAI Conference on Artificial In- telligence (AAAI-19), 2019. 1, 6

  5. [5]

    Costante and T

    G. Costante and T. A. Ciarfuglia. Ls-vo: Learning dense op- tical subspace for robust visual odometry estimation. IEEE Robotics and Automation Letters, 3(3):1735–1742, 2018. 1, 3, 5, 6, 7, 8

  6. [6]

    ENG: End-to-end Neural Geometry for Robust Depth and Pose Estimation using CNNs

    T. Dharmasiri, A. Spek, and T. Drummond. Eng: End-to-end neural geometry for robust depth and pose estimation using cnns. arXiv preprint arXiv:1807.05705, 2018. 2, 3

  7. [7]

    Dosovitskiy, P

    A. Dosovitskiy, P. Fischer, E. Ilg, P. Hausser, C. Hazirbas, V . Golkov, P. Van Der Smagt, D. Cremers, and T. Brox. Flownet: Learning optical flow with convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision, pages 2758–2766, 2015. 1, 2

  8. [8]

    Engel, V

    J. Engel, V . Koltun, and D. Cremers. Direct sparse odometry. IEEE transactions on pattern analysis and machine intelli- gence, 40(3):611–625, 2018. 1, 2

  9. [9]

    Forster, M

    C. Forster, M. Pizzoli, and D. Scaramuzza. Svo: Fast semi- direct monocular visual odometry. In Robotics and Automa- tion (ICRA), 2014 IEEE International Conference on, pages 15–22. IEEE, 2014. 1

  10. [10]

    H. Fu, M. Gong, C. Wang, K. Batmanghelich, and D. Tao. Deep ordinal regression network for monocular depth esti- mation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2002–2011, 2018. 1

  11. [11]

    Geiger, P

    A. Geiger, P. Lenz, and R. Urtasun. Are we ready for au- tonomous driving? the kitti vision benchmark suite. In Conference on Computer Vision and Pattern Recognition (CVPR), 2012. 2, 3, 6

  12. [12]

    Geiger, J

    A. Geiger, J. Ziegler, and C. Stiller. Stereoscan: Dense 3d reconstruction in real-time. 2011 IEEE Intelligent Vehicles Symposium (IV), pages 963–968, 2011. 7

  13. [13]

    E. Ilg, N. Mayer, T. Saikia, M. Keuper, A. Dosovitskiy, and T. Brox. Flownet 2.0: Evolution of optical flow estimation with deep networks. In IEEE conference on computer vision and pattern recognition (CVPR), volume 2, page 6, 2017. 1

  14. [14]

    C. Kerl, J. Sturm, and D. Cremers. Robust odometry estima- tion for rgb-d cameras. In Robotics and Automation (ICRA), 2013 IEEE International Conference on , pages 3748–3754. IEEE, 2013. 2

  15. [15]

    T. Lei, H. Joshi, R. Barzilay, T. Jaakkola, K. Tymoshenko, A. Moschitti, and L. Marquez. Semi-supervised ques- tion retrieval with gated convolutions. arXiv preprint arXiv:1512.05726, 2015. 2, 5

  16. [16]

    R. Li, S. Wang, Z. Long, and D. Gu. Undeepvo: Monocular visual odometry through unsupervised deep learning. 2018 IEEE International Conference on Robotics and Automation (ICRA), pages 7286–7291, 2018. 7

  17. [17]

    Liang, Y

    Z. Liang, Y . Feng, Y . Chen, and L. Zhang. Learning for disparity estimation through feature constancy. In Proceed- ings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2811–2820, 2018. 1

  18. [18]

    W. Luo, A. G. Schwing, and R. Urtasun. Efficient deep learn- ing for stereo matching. In Proceedings of the IEEE Con- ference on Computer Vision and Pattern Recognition, pages 5695–5703, 2016. 1

  19. [19]

    Z. Lv, K. Kim, A. Troccoli, D. Sun, J. Rehg, and J. Kautz. Learning rigidity in dynamic scenes with a moving camera for 3d motion field estimation. In ECCV, 2018. 5, 6, 7, 8

  20. [20]

    A Large Dataset to Train Convolutional Networks for Disparity, Optical Flow, and Scene Flow Estimation

    N. Mayer, E. Ilg, P. H ¨ausser, P. Fischer, D. Cremers, A. Dosovitskiy, and T. Brox. A large dataset to train con- volutional networks for disparity, optical flow, and scene flow estimation. In IEEE International Conference on Computer Vision and Pattern Recognition (CVPR) , 2016. arXiv:1512.02134. 6

  21. [21]

    Mur-Artal and J

    R. Mur-Artal and J. D. Tard ´os. Orb-slam2: An open-source slam system for monocular, stereo, and rgb-d cameras.IEEE Transactions on Robotics, 33(5):1255–1262, 2017. 2

  22. [22]

    Steinbr ¨ucker, J

    F. Steinbr ¨ucker, J. Sturm, and D. Cremers. Real-time vi- sual odometry from dense rgb-d images. In Computer Vi- sion Workshops (ICCV Workshops), 2011 IEEE Interna- tional Conference on, pages 719–722. IEEE, 2011. 2

  23. [23]

    Sturm, N

    J. Sturm, N. Engelhard, F. Endres, W. Burgard, and D. Cre- mers. A benchmark for the evaluation of rgb-d slam sys- tems. 2012 IEEE/RSJ International Conference on Intelli- gent Robots and Systems, pages 573–580, 2012. 7

  24. [24]

    D. Sun, X. Yang, M.-Y . Liu, and J. Kautz. Pwc-net: Cnns for optical flow using pyramid, warping, and cost volume. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8934–8943, 2018. 1, 6

  25. [25]

    Ummenhofer, H

    B. Ummenhofer, H. Zhou, J. Uhrig, N. Mayer, E. Ilg, A. Dosovitskiy, and T. Brox. Demon: Depth and motion net- work for learning monocular stereo. In IEEE Conference on computer vision and pattern recognition (CVPR), volume 5, page 6, 2017. 2, 3

  26. [26]

    S. Wang, R. Clark, H. Wen, and N. Trigoni. Deepvo: Towards end-to-end visual odometry with deep recurrent convolutional neural networks. In Robotics and Automa- tion (ICRA), 2017 IEEE International Conference on, pages 2043–2050. IEEE, 2017. 1, 2, 6, 7

  27. [27]

    S. Wang, R. Clark, H. Wen, and N. Trigoni. End-to-end, sequence-to-sequence probabilistic visual odometry through deep neural networks. The International Journal of Robotics Research, 37(4-5):513–542, 2018. 2, 7

  28. [28]

    FastDepth: Fast Monoc- ular Depth Estimation on Embedded Systems

    Wofk, Diana and Ma, Fangchang and Yang, Tien-Ju and Karaman, Sertac and Sze, Vivienne. FastDepth: Fast Monoc- ular Depth Estimation on Embedded Systems. In IEEE In- ternational Conference on Robotics and Automation (ICRA),

  29. [29]

    F. Xue, Q. Wang, X. Wang, W. Dong, J. Wang, and H. Zha. Guided feature selection for deep visual odometry. CoRR, abs/1811.09935, 2018. 7

  30. [30]

    C. Zhao, L. Sun, P. Purkait, T. Duckett, and R. Stolkin. Learning monocular visual odometry with dense 3d mapping from dense 3d flow. Intelligent Robots and Systems (IROS), 2018 International Conference on, 2018. 3, 5, 6, 7, 8

  31. [31]

    H. Zhou, B. Ummenhofer, and T. Brox. Deeptam: Deep tracking and mapping. In European Conference on Com- puter Vision (ECCV), 2018. 6