Rebalancing gradient to improve self-supervised co-training of depth, odometry and optical flow predictions
Pith reviewed 2026-05-11 03:23 UTC · model grok-4.3
The pith
CoopNet rebalances gradients among co-trained networks to equalize learning progress in self-supervised depth, odometry, and optical flow prediction.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central discovery is that a distribution model of photometric errors can identify regions of strong disagreement between depth-plus-odometry reconstruction and optical-flow reconstruction; these regions are taken to be moving objects. By using this model to apportion the training gradient, CoopNet prevents the networks from interfering with one another's learning and achieves improved or state-of-the-art accuracy in depth map, odometry, and flow predictions on KITTI and CityScapes.
What carries the argument
CoopNet's hybrid loss, which models the distribution of photometric reconstruction errors from the depth-odometry pair versus the optical flow network and uses their disagreement to dynamically weight the gradient contributions.
If this is right
- Depth and odometry training automatically down-weights pixels that belong to moving objects.
- Optical flow training benefits from the same disagreement signal.
- The overall system reaches performance at or above prior self-supervised methods without extra supervision.
- Equitable learning progress occurs across the three tasks.
Where Pith is reading between the lines
- This approach could be applied to other co-training scenarios where multiple predictors share a common loss but have different sensitivities to certain data regions.
- Disagreement between independent reconstructions may serve as a general indicator for dynamic elements in unsupervised scene understanding.
- Further gains might come from combining this rebalancing with other self-supervised signals such as semantic segmentation.
Load-bearing premise
The pixels belonging to moving objects are precisely the ones where the depth-plus-odometry and optical-flow reconstructions disagree the most.
What would settle it
Running the CoopNet training on the KITTI dataset and finding that depth estimation error remains unchanged or worsens compared to a baseline without gradient rebalancing would falsify the claim.
Figures
read the original abstract
We present CoopNet, an approach that improves the cooperation of co-trained networks by dynamically adapting the apportionment of gradient, to ensure equitable learning progress. It is applied to motion-aware self-supervised prediction of depth maps, by introducing a new hybrid loss, based on a distribution model of photo-metric reconstruction errors made by, on the one hand the depth + odometry paired networks, and on the other hand the optical flow network. This model essentially assumes that the pixels from moving objects (that must be discarded for training depth and odometry), correspond to those where the two reconstructions strongly disagree. We justify this model by theoretical considerations and experimental evidences. A comparative evaluation on KITTI and CityScapes datasets shows that CoopNet improves or is comparable to the state-of-the-art in depth, odometry and optical flow predictions.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents CoopNet, which improves self-supervised co-training of depth, odometry, and optical flow networks by dynamically rebalancing gradient apportionment via a new hybrid loss. The loss relies on a distribution model of photometric reconstruction errors, positing that pixels from moving objects (to be discarded for depth/odometry training) are exactly those where the depth-plus-odometry reconstruction and the optical-flow reconstruction disagree strongly. This model is justified by theoretical considerations and experimental evidence; comparative results on KITTI and Cityscapes show that CoopNet improves or matches state-of-the-art performance across the three tasks.
Significance. If the core modeling assumption and gradient-rebalancing mechanism prove robust, the work could meaningfully advance self-supervised multi-task learning in computer vision by addressing gradient imbalance without explicit moving-object supervision. The experimental evaluation on standard benchmarks (KITTI, Cityscapes) provides evidence of practical utility and comparability to prior art. However, the potential circularity in fitting the error-distribution parameters on the same training data and the lack of detailed isolation of the rebalancing contribution limit the strength of the novelty claim relative to incidental regularization effects.
major comments (3)
- [Abstract / model justification] Abstract and model justification: The load-bearing assumption that strong disagreement between depth+odometry and flow photometric errors exactly identifies moving-object pixels is stated without a full derivation or sensitivity analysis. This correspondence is justified only by 'theoretical considerations and experimental evidences,' yet it is vulnerable to common violations (illumination variation, specular surfaces, textureless regions) that can produce disagreement on static geometry; if inaccurate, the dynamic apportionment systematically misallocates gradients and the claimed equitable cooperation mechanism is undermined.
- [Hybrid loss definition] Hybrid loss and distribution model: The error-distribution parameters are fitted to the training data on which the networks are also trained, creating a circularity where the rebalancing signal is defined in terms of quantities derived from the networks being optimized. The manuscript must demonstrate that this fitting procedure is stable, does not collapse to trivial solutions, and yields gains beyond what a fixed or non-circular mask would achieve; otherwise the central claim of improved cooperation via dynamic adaptation rests on an internally dependent construction.
- [Comparative evaluation] Experimental evaluation: While results on KITTI and Cityscapes are reported as state-of-the-art or comparable, the manuscript lacks ablations that isolate the contribution of the gradient-rebalancing component from other loss terms or training choices. Without such controls, observed improvements cannot be confidently attributed to the proposed equitable-learning mechanism rather than incidental regularization.
minor comments (2)
- [Method] Notation for the photometric error distribution and its parameters should be introduced with explicit equations early in the method section to allow readers to verify the claimed parameter-free or distribution-based properties.
- [Experiments] Figure captions and table legends could more clearly indicate which metrics are depth, odometry, or flow, and whether reported numbers are means or medians across sequences.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed comments. We address each major point below and describe the revisions we will incorporate to strengthen the manuscript.
read point-by-point responses
-
Referee: [Abstract / model justification] Abstract and model justification: The load-bearing assumption that strong disagreement between depth+odometry and flow photometric errors exactly identifies moving-object pixels is stated without a full derivation or sensitivity analysis. This correspondence is justified only by 'theoretical considerations and experimental evidences,' yet it is vulnerable to common violations (illumination variation, specular surfaces, textureless regions) that can produce disagreement on static geometry; if inaccurate, the dynamic apportionment systematically misallocates gradients and the claimed equitable cooperation mechanism is undermined.
Authors: We agree that the current presentation would benefit from a fuller derivation and explicit sensitivity analysis. In the revised manuscript we will expand the theoretical justification section with a step-by-step derivation of the photometric-error disagreement model under the self-supervised photometric consistency assumption. We will also add a dedicated sensitivity study (including controlled perturbations for illumination changes, specular highlights, and textureless regions) to quantify how often such violations produce false-positive disagreements on static geometry and to show that the resulting gradient rebalancing remains beneficial overall. revision: yes
-
Referee: [Hybrid loss definition] Hybrid loss and distribution model: The error-distribution parameters are fitted to the training data on which the networks are also trained, creating a circularity where the rebalancing signal is defined in terms of quantities derived from the networks being optimized. The manuscript must demonstrate that this fitting procedure is stable, does not collapse to trivial solutions, and yields gains beyond what a fixed or non-circular mask would achieve; otherwise the central claim of improved cooperation via dynamic adaptation rests on an internally dependent construction.
Authors: We acknowledge the circularity concern. In the revision we will add experiments that (i) refit the distribution parameters on a held-out validation split and compare performance to the original training-set fitting, (ii) monitor the fitted parameters across epochs to demonstrate stability and absence of collapse to trivial masks, and (iii) directly compare the full dynamic hybrid loss against both a fixed-mask baseline and a non-circular (pre-computed) mask variant. These controls will clarify whether the observed gains derive from the adaptive rebalancing rather than incidental regularization. revision: yes
-
Referee: [Comparative evaluation] Experimental evaluation: While results on KITTI and Cityscapes are reported as state-of-the-art or comparable, the manuscript lacks ablations that isolate the contribution of the gradient-rebalancing component from other loss terms or training choices. Without such controls, observed improvements cannot be confidently attributed to the proposed equitable-learning mechanism rather than incidental regularization.
Authors: We agree that stronger isolation of the rebalancing contribution is needed. The revised manuscript will include a new ablation section containing: (a) CoopNet with the hybrid loss replaced by a standard photometric loss, (b) CoopNet with dynamic rebalancing disabled (fixed equal weighting), and (c) CoopNet augmented with alternative regularization terms that do not rely on error-distribution modeling. Quantitative comparisons on KITTI and Cityscapes will be reported to attribute performance differences specifically to the proposed gradient-apportionment mechanism. revision: yes
Circularity Check
No circularity: derivation remains self-contained
full rationale
The paper introduces a hybrid loss whose distribution model is presented as an explicit modeling assumption (pixels of moving objects correspond to strong disagreement between depth+odometry and flow reconstructions), justified separately by theoretical arguments and experimental evidence rather than by fitting parameters to the training outputs themselves. The gradient rebalancing is then derived directly from this loss during training; no equation reduces the rebalancing signal to a fitted parameter or to a self-citation chain that would make the claimed improvement tautological. The central claim therefore retains independent content beyond its inputs.
Axiom & Free-Parameter Ledger
free parameters (1)
- distribution model parameters
axioms (1)
- domain assumption Pixels from moving objects correspond to locations where the depth+odometry and optical-flow reconstructions strongly disagree.
Reference graph
Works this paper leans on
-
[1]
J.W. Bian, Z. Li, N. Wang, H. Zhan, C.Shen, M.M Cheng, and I. Reid. Unsupervised scale-consistent depth and ego-motion learning from monocular video. InInterna- tional Conference on Neural Information Processing Sys- tems (NIPS), page 35–45, 2019
work page 2019
- [2]
-
[3]
Y . Chen, C. Schmid, and C. Sminchiescu. Self-supervised learning with geometric constraints in monocular video. In ICCV, pages 7063–7072, 2019
work page 2019
- [4]
-
[5]
A. Dosovitskiy, P. Fischer, E. Ilg, P. Hausser, C. Hazirbas, V . Golkov, P. Van der Smagt, D. Cremers, and T. Brox. Flownet: Learning optical flow with convolutional networks. InICCV, pages 2758–2766, 2015
work page 2015
- [6]
-
[7]
F .Gao, J. Yu, H. Shen, Y . Wang, and H. Yang. Attentional separation-and-aggregation network for self- supervised depth-pose learning in dynamic scenes. InCon- ference on Robot Learning (CoRL 2020), Cambridge MA, 2020
work page 2020
-
[8]
R. Garg, V . Kumar, B.G. Gustavo, and I. Reid. Unsuper- vised CNN for single view depth estimation: Geometry to the rescue. InECCV, pages 740–756. Springer International Publishing, 2016
work page 2016
- [9]
-
[10]
C. Godard, O.M. Aodha, M. Firman, and G. Brostow. Dig- ging into self-supervised monocular depth estimation. In ICCV, pages 3828–3838, 2019
work page 2019
- [11]
-
[12]
V . Guizilini, R. Ambrus, S. Pillai, A. Raventos, and A. Gaidon. 3d packing for self-supervised monocular depth es- timation. InCVPR, pages 2485–2494, 2020
work page 2020
-
[13]
V . Guizilini, R. Hou, J. Li, R. Ambrus, and A. Gaidon. Se- mantically guided representation learning for self-supervised monocular depth. InInternational Conference on Learning Representations, 2020
work page 2020
-
[14]
K. He, G. Gkioxari, P. Doll ´ar, and R. Girshick. Mask R- CNN. InICCV, pages 2961–2969, 2017
work page 2017
- [15]
-
[16]
E. Ilg, N. Mayer, T. Saikia, M. Keuper, A. Dosovitskiy, T. Brox, P. Van der Smagt, D. Cremers, and T. Brox. Flownet 2.0: Evolution of optical flow estimation with deep networks. InCVPR, pages 2462–2470, 2017
work page 2017
-
[17]
M. Jaderberg, K. Simonyan, A. Zisserman, and K. Kavukcuoglu. Spatial transformer networks. InInterna- tional Conference on Neural Information Processing Sys- tems (NIPS), 2016
work page 2016
-
[18]
R. Jain and I. Chlamtak. The P2 algorithm for dynamic sta- tistical computing calculation of quantiles and histograms without storing observations.Communications of The ACM - CACM, 28, 1985
work page 1985
-
[19]
arXiv preprint arXiv:2106.03505 (2021)
S. Jia, X.Pei, W. Yao, and S.C Wong. Self-supervised depth estimation leveraging global perception and geometric smoothness using on-board videos.CoRR, abs/2106.03505, 2021
-
[20]
A. Johnston and G. Carneiro. Self-supervised monocular trained depth estimation using self-attention and discrete dis- parity volume. InCVPR, pages 4756–4765, 2020
work page 2020
-
[21]
U.H. Kim and J.H. Kim. Revisiting self-supervised monoc- ular depth estimation.Robot Intelligence Technology and Applications, 2022
work page 2022
-
[22]
D.P Kingma, J. Ba, N. Snavely, and D.G. Lowe. ADAM: A method for stochastic optimization. InInternational Confer- ence on Learning Representations (ICLR), San Diego, CA, 2015
work page 2015
-
[23]
M. Klingner, J.A Term ¨ohlen, J. Mikolajczyk, and T. Fin- gscheidt. Self-supervised monocular depth estimation: Solv- ing the dynamic object problem by semantic guidance. In ECCV, pages 582–600. Springer International Publishing, 2020
work page 2020
-
[24]
S. Lee, S. Im, S. Lin, and I.S Kweon. Learning monocu- lar depth in dynamic scenes via instance-aware projection consistency. InAAAI Conference on Artificial Intelligence, 2021
work page 2021
-
[25]
S. Lee, S. Im, S. Lin, and S. Kweon. Learning residual tease flow as dynamic motion from stereo videos. InIEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2019
work page 2019
-
[26]
H. Li, A. Gordon, H. Zhao, V . Casser, and A. Angelova. Un- supervised monocular depth learning in dynamic scenes. In Conference on Robot Learning (PMLR), pages 1908–1917, 2020
work page 1908
- [27]
- [28]
- [29]
-
[30]
L. Liu, G. Zhai, W. Ye, and Y . Liu. Unsupervised learn- ing of scene flow estimation fusing with local rigidity. In Proceedings of the Twenty-Eighth International Joint Con- ference on Artificial Intelligence, IJCAI-19, pages 876–882. International Joint Conferences on Artificial Intelligence Or- ganization, 2019
work page 2019
-
[31]
S. Meister, J. Hur, and S. Roth. Unsupervised learning of optical flow with a bidirectional census loss. InAAAI, New Orleans, Louisiana, Feb. 2018
work page 2018
-
[32]
Mur-Artal, J.M.M Montiel, and J
R. Mur-Artal, J.M.M Montiel, and J. Tard ´os. ORB-SLAM: a versatile and accurate monocular SLAM system.IEEE Transactions on Robotics, 31:1147 – 1163, 10 2015
work page 2015
-
[33]
O. Russakovsky, J. Deng, H. Yao, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A.C. Berg, and L. Fei-Fei. Imagenet large scale visual recognition challenge.Int. J. Comput. Vis., page 211–252, 2015
work page 2015
-
[34]
D. Sun, X. Yang, M.Y . Liu, and J. Kautz. PWC-Net: CNNs for optical flow using pyramid, warping, and cost volume. In CVPR, pages 8943–8943, 2018
work page 2018
-
[35]
F. Tosi, F. Aleotti, P.Z Ramirez, M. Poggi, S. Salti, L.D Ste- fano, and S. Mattoccia. Distilled semantics for comprehen- sive scene understanding from videos. InProceedings of IEEE CVPR, page 4654–4665, 2020
work page 2020
-
[36]
C. Wang, J.M Buenaposada, R. Zhu, and S. Lucey. Learn- ing depth from monocular videos using direct methods. In CVPR, 2018
work page 2018
-
[37]
Y . Wang, Y . Yang, Z. Yang, L.Zhao, P.Wang, and W.Xu. Occlusion aware unsupervised learning of optical flow. In CVPR, 2018
work page 2018
-
[38]
Wang, A.C Bovik, H.R Sheikh, and E.P Simoncelli
Z. Wang, A.C Bovik, H.R Sheikh, and E.P Simoncelli. Im- age quality assessment: from error visibility to structural similarity.IEEE Transactions on Image Processing, pages 600–612, 2004
work page 2004
-
[39]
J. Watson, O.M. Aodha, V . Prisacariu, G. Brostow, and M. Firman. The temporal opportunist: Self-supervised multi- frame monocular depth. InCVPR, pages 1164–1174, 2021
work page 2021
-
[40]
Z. Yang, P. Wang, Y . Wang, W. Xu, and R. Nevatia. Every pixel counts: Unsupervised geometry learning with holistic 3d motion understanding. InECCV 2018 Workshops, 2018
work page 2018
- [41]
-
[42]
J.J. Yu, A.W. Harley, and K.G Derpanis. Back to basics: Un- supervised learning of optical flow via brightness constancy and motion smoothness. InECCV 2016 Workshops, pages 3–10. Springer International Publishing, 2016
work page 2016
-
[43]
T. Zhou, M. Brown, N. Snavely, and D.G. Lowe. Unsu- pervised learning of depth and ego-motion from video. In CVPR, pages 1851–1860, 2017
work page 2017
-
[44]
Y . Zou, Z. Luo, and J.B. Huang. DF-Net: Unsupervised joint learning of depth and flow using cross-task consistency. In European Conference on Computer Vision, 2018
work page 2018
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.