pith. sign in

arxiv: 2605.07945 · v1 · submitted 2026-05-08 · 💻 cs.CV

Rebalancing gradient to improve self-supervised co-training of depth, odometry and optical flow predictions

Pith reviewed 2026-05-11 03:23 UTC · model grok-4.3

classification 💻 cs.CV
keywords self-supervised learningmonocular depth estimationoptical flowvisual odometryco-traininggradient balancingphotometric loss
0
0 comments X

The pith

CoopNet rebalances gradients among co-trained networks to equalize learning progress in self-supervised depth, odometry, and optical flow prediction.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper introduces CoopNet to make co-training of depth, odometry, and optical flow networks more effective in a self-supervised setting. It does so by dynamically adjusting the share of the gradient each network receives during backpropagation, based on a model of where their photometric reconstructions disagree. The disagreement is interpreted as coming from moving objects that should not influence depth and odometry training. Theoretical arguments and tests on real driving videos support the model, leading to predictions that match or exceed previous methods on standard benchmarks.

Core claim

The central discovery is that a distribution model of photometric errors can identify regions of strong disagreement between depth-plus-odometry reconstruction and optical-flow reconstruction; these regions are taken to be moving objects. By using this model to apportion the training gradient, CoopNet prevents the networks from interfering with one another's learning and achieves improved or state-of-the-art accuracy in depth map, odometry, and flow predictions on KITTI and CityScapes.

What carries the argument

CoopNet's hybrid loss, which models the distribution of photometric reconstruction errors from the depth-odometry pair versus the optical flow network and uses their disagreement to dynamically weight the gradient contributions.

If this is right

  • Depth and odometry training automatically down-weights pixels that belong to moving objects.
  • Optical flow training benefits from the same disagreement signal.
  • The overall system reaches performance at or above prior self-supervised methods without extra supervision.
  • Equitable learning progress occurs across the three tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This approach could be applied to other co-training scenarios where multiple predictors share a common loss but have different sensitivities to certain data regions.
  • Disagreement between independent reconstructions may serve as a general indicator for dynamic elements in unsupervised scene understanding.
  • Further gains might come from combining this rebalancing with other self-supervised signals such as semantic segmentation.

Load-bearing premise

The pixels belonging to moving objects are precisely the ones where the depth-plus-odometry and optical-flow reconstructions disagree the most.

What would settle it

Running the CoopNet training on the KITTI dataset and finding that depth estimation error remains unchanged or worsens compared to a baseline without gradient rebalancing would falsify the claim.

Figures

Figures reproduced from arXiv: 2605.07945 by Antoine Manzanera, David Filliat, Marwane Hariat.

Figure 2
Figure 2. Figure 2: Degenerate cases with black stains (correspond [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 1
Figure 1. Figure 1: Density models of ∆ used in our work, for all the pixels (black), rigid pixels (blue dashed), and mobile pixels (red dashed). This is the result of the statistical analysis of ∆ on all the images of KITTI and highlights the intrinsic bias of the Gaussian distribution, the moving pixels follow￾ing a bimodal distribution centred on both sides of the tails and the rigid pixels located around the mean value. N… view at source ↗
Figure 3
Figure 3. Figure 3: Smoothing issue around moving objects. The [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Diagram depicting CoopNet. The Quantile Module takes as input the rigid flow inferred by the pair \left (\mathcal {D}_{\theta }, \mathcal {T}_{\alpha }\right ) and the flow produced by \protect \mathcal {F}_{\delta } to compute ∆. The running values \left (\widetilde {q_{-\eta }}, \widetilde {q_{\eta }}\right ) are updated with the P^2 algorithm [18] to be used at the next epoch. The current values \left (… view at source ↗
Figure 5
Figure 5. Figure 5: Comparison of depth map estimation algorithms in challenging situations. White Dashed rectangles target the [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Illustration of the large variations of ∆ between positive and negative values in challenging cases indicative of a strong ambiguity. Also note the dominance of the red colour due to the intrinsic bias mentioned in Sec. 3.1. tricky cases where it’s quite difficult, even for a human, to determine the flow displacement: for instance pixels at the edges, in high-texture areas or around thin objects. Con￾verse… view at source ↗
read the original abstract

We present CoopNet, an approach that improves the cooperation of co-trained networks by dynamically adapting the apportionment of gradient, to ensure equitable learning progress. It is applied to motion-aware self-supervised prediction of depth maps, by introducing a new hybrid loss, based on a distribution model of photo-metric reconstruction errors made by, on the one hand the depth + odometry paired networks, and on the other hand the optical flow network. This model essentially assumes that the pixels from moving objects (that must be discarded for training depth and odometry), correspond to those where the two reconstructions strongly disagree. We justify this model by theoretical considerations and experimental evidences. A comparative evaluation on KITTI and CityScapes datasets shows that CoopNet improves or is comparable to the state-of-the-art in depth, odometry and optical flow predictions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper presents CoopNet, which improves self-supervised co-training of depth, odometry, and optical flow networks by dynamically rebalancing gradient apportionment via a new hybrid loss. The loss relies on a distribution model of photometric reconstruction errors, positing that pixels from moving objects (to be discarded for depth/odometry training) are exactly those where the depth-plus-odometry reconstruction and the optical-flow reconstruction disagree strongly. This model is justified by theoretical considerations and experimental evidence; comparative results on KITTI and Cityscapes show that CoopNet improves or matches state-of-the-art performance across the three tasks.

Significance. If the core modeling assumption and gradient-rebalancing mechanism prove robust, the work could meaningfully advance self-supervised multi-task learning in computer vision by addressing gradient imbalance without explicit moving-object supervision. The experimental evaluation on standard benchmarks (KITTI, Cityscapes) provides evidence of practical utility and comparability to prior art. However, the potential circularity in fitting the error-distribution parameters on the same training data and the lack of detailed isolation of the rebalancing contribution limit the strength of the novelty claim relative to incidental regularization effects.

major comments (3)
  1. [Abstract / model justification] Abstract and model justification: The load-bearing assumption that strong disagreement between depth+odometry and flow photometric errors exactly identifies moving-object pixels is stated without a full derivation or sensitivity analysis. This correspondence is justified only by 'theoretical considerations and experimental evidences,' yet it is vulnerable to common violations (illumination variation, specular surfaces, textureless regions) that can produce disagreement on static geometry; if inaccurate, the dynamic apportionment systematically misallocates gradients and the claimed equitable cooperation mechanism is undermined.
  2. [Hybrid loss definition] Hybrid loss and distribution model: The error-distribution parameters are fitted to the training data on which the networks are also trained, creating a circularity where the rebalancing signal is defined in terms of quantities derived from the networks being optimized. The manuscript must demonstrate that this fitting procedure is stable, does not collapse to trivial solutions, and yields gains beyond what a fixed or non-circular mask would achieve; otherwise the central claim of improved cooperation via dynamic adaptation rests on an internally dependent construction.
  3. [Comparative evaluation] Experimental evaluation: While results on KITTI and Cityscapes are reported as state-of-the-art or comparable, the manuscript lacks ablations that isolate the contribution of the gradient-rebalancing component from other loss terms or training choices. Without such controls, observed improvements cannot be confidently attributed to the proposed equitable-learning mechanism rather than incidental regularization.
minor comments (2)
  1. [Method] Notation for the photometric error distribution and its parameters should be introduced with explicit equations early in the method section to allow readers to verify the claimed parameter-free or distribution-based properties.
  2. [Experiments] Figure captions and table legends could more clearly indicate which metrics are depth, odometry, or flow, and whether reported numbers are means or medians across sequences.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed comments. We address each major point below and describe the revisions we will incorporate to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract / model justification] Abstract and model justification: The load-bearing assumption that strong disagreement between depth+odometry and flow photometric errors exactly identifies moving-object pixels is stated without a full derivation or sensitivity analysis. This correspondence is justified only by 'theoretical considerations and experimental evidences,' yet it is vulnerable to common violations (illumination variation, specular surfaces, textureless regions) that can produce disagreement on static geometry; if inaccurate, the dynamic apportionment systematically misallocates gradients and the claimed equitable cooperation mechanism is undermined.

    Authors: We agree that the current presentation would benefit from a fuller derivation and explicit sensitivity analysis. In the revised manuscript we will expand the theoretical justification section with a step-by-step derivation of the photometric-error disagreement model under the self-supervised photometric consistency assumption. We will also add a dedicated sensitivity study (including controlled perturbations for illumination changes, specular highlights, and textureless regions) to quantify how often such violations produce false-positive disagreements on static geometry and to show that the resulting gradient rebalancing remains beneficial overall. revision: yes

  2. Referee: [Hybrid loss definition] Hybrid loss and distribution model: The error-distribution parameters are fitted to the training data on which the networks are also trained, creating a circularity where the rebalancing signal is defined in terms of quantities derived from the networks being optimized. The manuscript must demonstrate that this fitting procedure is stable, does not collapse to trivial solutions, and yields gains beyond what a fixed or non-circular mask would achieve; otherwise the central claim of improved cooperation via dynamic adaptation rests on an internally dependent construction.

    Authors: We acknowledge the circularity concern. In the revision we will add experiments that (i) refit the distribution parameters on a held-out validation split and compare performance to the original training-set fitting, (ii) monitor the fitted parameters across epochs to demonstrate stability and absence of collapse to trivial masks, and (iii) directly compare the full dynamic hybrid loss against both a fixed-mask baseline and a non-circular (pre-computed) mask variant. These controls will clarify whether the observed gains derive from the adaptive rebalancing rather than incidental regularization. revision: yes

  3. Referee: [Comparative evaluation] Experimental evaluation: While results on KITTI and Cityscapes are reported as state-of-the-art or comparable, the manuscript lacks ablations that isolate the contribution of the gradient-rebalancing component from other loss terms or training choices. Without such controls, observed improvements cannot be confidently attributed to the proposed equitable-learning mechanism rather than incidental regularization.

    Authors: We agree that stronger isolation of the rebalancing contribution is needed. The revised manuscript will include a new ablation section containing: (a) CoopNet with the hybrid loss replaced by a standard photometric loss, (b) CoopNet with dynamic rebalancing disabled (fixed equal weighting), and (c) CoopNet augmented with alternative regularization terms that do not rely on error-distribution modeling. Quantitative comparisons on KITTI and Cityscapes will be reported to attribute performance differences specifically to the proposed gradient-apportionment mechanism. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation remains self-contained

full rationale

The paper introduces a hybrid loss whose distribution model is presented as an explicit modeling assumption (pixels of moving objects correspond to strong disagreement between depth+odometry and flow reconstructions), justified separately by theoretical arguments and experimental evidence rather than by fitting parameters to the training outputs themselves. The gradient rebalancing is then derived directly from this loss during training; no equation reduces the rebalancing signal to a fitted parameter or to a self-citation chain that would make the claimed improvement tautological. The central claim therefore retains independent content beyond its inputs.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The claim rests on one domain assumption about disagreement identifying moving objects and on an unspecified distribution model whose parameters are fitted to data. No new physical entities are introduced.

free parameters (1)
  • distribution model parameters
    The photometric error distribution model requires parameters that are fitted or chosen to separate static and moving-object pixels.
axioms (1)
  • domain assumption Pixels from moving objects correspond to locations where the depth+odometry and optical-flow reconstructions strongly disagree.
    This assumption directly justifies both the hybrid loss and the decision to discard those pixels for depth and odometry training.

pith-pipeline@v0.9.0 · 5444 in / 1418 out tokens · 59544 ms · 2026-05-11T03:23:22.038037+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

44 extracted references · 44 canonical work pages

  1. [1]

    J.W. Bian, Z. Li, N. Wang, H. Zhan, C.Shen, M.M Cheng, and I. Reid. Unsupervised scale-consistent depth and ego-motion learning from monocular video. InInterna- tional Conference on Neural Information Processing Sys- tems (NIPS), page 35–45, 2019

  2. [2]

    Casser, S

    V . Casser, S. Pirk, R. Mahjourian, and A. Angelova. Depth prediction without the sensors: Leveraging structure for un- supervised learning from monocular videos. InAAAI Confer- ence on Artificial Intelligence, volume 33, pages 8001–8008, 2018

  3. [3]

    Y . Chen, C. Schmid, and C. Sminchiescu. Self-supervised learning with geometric constraints in monocular video. In ICCV, pages 7063–7072, 2019

  4. [4]

    Cordts, M

    M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele. The Cityscapes dataset for semantic urban scene understanding. InCVPR, pages 3213–3223, 2016

  5. [5]

    Dosovitskiy, P

    A. Dosovitskiy, P. Fischer, E. Ilg, P. Hausser, C. Hazirbas, V . Golkov, P. Van der Smagt, D. Cremers, and T. Brox. Flownet: Learning optical flow with convolutional networks. InICCV, pages 2758–2766, 2015

  6. [6]

    Eigen, C

    D. Eigen, C. Puhrsch, and R. Fergus. Depth map prediction from a single image using a multi-scale deep network. In International Conference on Neural Information Processing Systems (NIPS), page 2366–2374, 2014

  7. [7]

    F .Gao, J. Yu, H. Shen, Y . Wang, and H. Yang. Attentional separation-and-aggregation network for self- supervised depth-pose learning in dynamic scenes. InCon- ference on Robot Learning (CoRL 2020), Cambridge MA, 2020

  8. [8]

    R. Garg, V . Kumar, B.G. Gustavo, and I. Reid. Unsuper- vised CNN for single view depth estimation: Geometry to the rescue. InECCV, pages 740–756. Springer International Publishing, 2016

  9. [9]

    Geiger, P

    A. Geiger, P. Lenz, and R. Urtasun. Are we ready for au- tonomous driving? the KITTI vision benchmark suite. In IEEE Conference on Computer Vision and Pattern Recogni- tion, pages 3354–3361, 2012

  10. [10]

    Godard, O.M

    C. Godard, O.M. Aodha, M. Firman, and G. Brostow. Dig- ging into self-supervised monocular depth estimation. In ICCV, pages 3828–3838, 2019

  11. [11]

    Gordon, H

    A. Gordon, H. Li, R. Jonschkowski, and A. Angelova. Depth from videos in the wild: Unsupervised monocular depth learning from unknown cameras. InICCV, pages 8977– 8986, 2019

  12. [12]

    Guizilini, R

    V . Guizilini, R. Ambrus, S. Pillai, A. Raventos, and A. Gaidon. 3d packing for self-supervised monocular depth es- timation. InCVPR, pages 2485–2494, 2020

  13. [13]

    Guizilini, R

    V . Guizilini, R. Hou, J. Li, R. Ambrus, and A. Gaidon. Se- mantically guided representation learning for self-supervised monocular depth. InInternational Conference on Learning Representations, 2020

  14. [14]

    K. He, G. Gkioxari, P. Doll ´ar, and R. Girshick. Mask R- CNN. InICCV, pages 2961–2969, 2017

  15. [15]

    Hur and S

    J. Hur and S. Roth. Self-supervised monocular scene flow estimation. InCVPR, 2020

  16. [16]

    E. Ilg, N. Mayer, T. Saikia, M. Keuper, A. Dosovitskiy, T. Brox, P. Van der Smagt, D. Cremers, and T. Brox. Flownet 2.0: Evolution of optical flow estimation with deep networks. InCVPR, pages 2462–2470, 2017

  17. [17]

    Jaderberg, K

    M. Jaderberg, K. Simonyan, A. Zisserman, and K. Kavukcuoglu. Spatial transformer networks. InInterna- tional Conference on Neural Information Processing Sys- tems (NIPS), 2016

  18. [18]

    Jain and I

    R. Jain and I. Chlamtak. The P2 algorithm for dynamic sta- tistical computing calculation of quantiles and histograms without storing observations.Communications of The ACM - CACM, 28, 1985

  19. [19]

    arXiv preprint arXiv:2106.03505 (2021)

    S. Jia, X.Pei, W. Yao, and S.C Wong. Self-supervised depth estimation leveraging global perception and geometric smoothness using on-board videos.CoRR, abs/2106.03505, 2021

  20. [20]

    Johnston and G

    A. Johnston and G. Carneiro. Self-supervised monocular trained depth estimation using self-attention and discrete dis- parity volume. InCVPR, pages 4756–4765, 2020

  21. [21]

    Kim and J.H

    U.H. Kim and J.H. Kim. Revisiting self-supervised monoc- ular depth estimation.Robot Intelligence Technology and Applications, 2022

  22. [22]

    D.P Kingma, J. Ba, N. Snavely, and D.G. Lowe. ADAM: A method for stochastic optimization. InInternational Confer- ence on Learning Representations (ICLR), San Diego, CA, 2015

  23. [23]

    Klingner, J.A Term ¨ohlen, J

    M. Klingner, J.A Term ¨ohlen, J. Mikolajczyk, and T. Fin- gscheidt. Self-supervised monocular depth estimation: Solv- ing the dynamic object problem by semantic guidance. In ECCV, pages 582–600. Springer International Publishing, 2020

  24. [24]

    S. Lee, S. Im, S. Lin, and I.S Kweon. Learning monocu- lar depth in dynamic scenes via instance-aware projection consistency. InAAAI Conference on Artificial Intelligence, 2021

  25. [25]

    S. Lee, S. Im, S. Lin, and S. Kweon. Learning residual tease flow as dynamic motion from stereo videos. InIEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2019

  26. [26]

    H. Li, A. Gordon, H. Zhao, V . Casser, and A. Angelova. Un- supervised monocular depth learning in dynamic scenes. In Conference on Robot Learning (PMLR), pages 1908–1917, 2020

  27. [27]

    J. Li, J. Zhao, S. Song, and T. Feng. Unsupervised joint learning of depth, optical flow, ego-motion from video. CoRR, abs/2105.14520, 2021

  28. [28]

    R. Li, X. He, D. Xue, S. Su, Q. Mao, Y . Zhu, J. Sun, and Zhang Y . Learning depth via leveraging semantics: Self- supervised monocular depth estimation with both implicit and explicit semantic guidance.CoRR, abs/2102.06685, 2021

  29. [29]

    Maire, S

    T.Y Lin, M. Maire, S. Belongie, L. Bourdev, R. Girshick, J. Hays, P. Perona, D. Ramanan, C. Lawrence Zitnick, and P. Doll´ar. Microsoft COCo: Common objects in context. InECCV, pages 740–755. Springer International Publishing, 2015

  30. [30]

    L. Liu, G. Zhai, W. Ye, and Y . Liu. Unsupervised learn- ing of scene flow estimation fusing with local rigidity. In Proceedings of the Twenty-Eighth International Joint Con- ference on Artificial Intelligence, IJCAI-19, pages 876–882. International Joint Conferences on Artificial Intelligence Or- ganization, 2019

  31. [31]

    Meister, J

    S. Meister, J. Hur, and S. Roth. Unsupervised learning of optical flow with a bidirectional census loss. InAAAI, New Orleans, Louisiana, Feb. 2018

  32. [32]

    Mur-Artal, J.M.M Montiel, and J

    R. Mur-Artal, J.M.M Montiel, and J. Tard ´os. ORB-SLAM: a versatile and accurate monocular SLAM system.IEEE Transactions on Robotics, 31:1147 – 1163, 10 2015

  33. [33]

    Russakovsky, J

    O. Russakovsky, J. Deng, H. Yao, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A.C. Berg, and L. Fei-Fei. Imagenet large scale visual recognition challenge.Int. J. Comput. Vis., page 211–252, 2015

  34. [34]

    D. Sun, X. Yang, M.Y . Liu, and J. Kautz. PWC-Net: CNNs for optical flow using pyramid, warping, and cost volume. In CVPR, pages 8943–8943, 2018

  35. [35]

    F. Tosi, F. Aleotti, P.Z Ramirez, M. Poggi, S. Salti, L.D Ste- fano, and S. Mattoccia. Distilled semantics for comprehen- sive scene understanding from videos. InProceedings of IEEE CVPR, page 4654–4665, 2020

  36. [36]

    Wang, J.M Buenaposada, R

    C. Wang, J.M Buenaposada, R. Zhu, and S. Lucey. Learn- ing depth from monocular videos using direct methods. In CVPR, 2018

  37. [37]

    Y . Wang, Y . Yang, Z. Yang, L.Zhao, P.Wang, and W.Xu. Occlusion aware unsupervised learning of optical flow. In CVPR, 2018

  38. [38]

    Wang, A.C Bovik, H.R Sheikh, and E.P Simoncelli

    Z. Wang, A.C Bovik, H.R Sheikh, and E.P Simoncelli. Im- age quality assessment: from error visibility to structural similarity.IEEE Transactions on Image Processing, pages 600–612, 2004

  39. [39]

    Watson, O.M

    J. Watson, O.M. Aodha, V . Prisacariu, G. Brostow, and M. Firman. The temporal opportunist: Self-supervised multi- frame monocular depth. InCVPR, pages 1164–1174, 2021

  40. [40]

    Z. Yang, P. Wang, Y . Wang, W. Xu, and R. Nevatia. Every pixel counts: Unsupervised geometry learning with holistic 3d motion understanding. InECCV 2018 Workshops, 2018

  41. [41]

    Yin and J

    Z. Yin and J. Shi. GeoNet: Unsupervised learning of dense depth, optical flow and camera pose. InCVPR, pages 1983– 1992, 2018

  42. [42]

    J.J. Yu, A.W. Harley, and K.G Derpanis. Back to basics: Un- supervised learning of optical flow via brightness constancy and motion smoothness. InECCV 2016 Workshops, pages 3–10. Springer International Publishing, 2016

  43. [43]

    T. Zhou, M. Brown, N. Snavely, and D.G. Lowe. Unsu- pervised learning of depth and ego-motion from video. In CVPR, pages 1851–1860, 2017

  44. [44]

    Y . Zou, Z. Luo, and J.B. Huang. DF-Net: Unsupervised joint learning of depth and flow using cross-task consistency. In European Conference on Computer Vision, 2018