pith. machine review for the scientific record. sign in

arxiv: 2604.22686 · v3 · submitted 2026-04-24 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

SS3D: End2End Self-Supervised 3D from Web Videos

Authors on Pith no claims yet

Pith reviewed 2026-05-14 21:23 UTC · model grok-4.3

classification 💻 cs.CV
keywords self-supervised 3Dmonocular depth estimationego-motioncamera intrinsicsweb video pretrainingstructure from motionzero-shot transferfeed-forward 3D
0
0 comments X

The pith

Pretraining a single feed-forward network on filtered web videos enables joint monocular estimation of depth, ego-motion, and intrinsics with strong zero-shot transfer.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents SS3D, an end-to-end self-supervised pipeline that pretrains a model on roughly 100 million frames drawn from YouTube-8M to learn 3D estimation directly from monocular video. It jointly outputs depth, camera motion, and intrinsic parameters in one forward pass by scaling structure-from-motion supervision to unconstrained web data. A multi-view signal proxy filters videos and orders them into a curriculum to create usable training signals despite sparse multi-view content and high data heterogeneity. An intrinsics-first two-stage schedule and a single-checkpoint evaluation protocol keep the joint learning stable. If successful, this approach shows that large unlabeled video collections can produce general 3D perception models that transfer across domains and improve when fine-tuned on target datasets.

Core claim

By combining a multi-view signal proxy for video filtering and curriculum sampling with an intrinsics-first training schedule, a feed-forward model can be trained end-to-end on web-scale video to predict depth, ego-motion, and intrinsics together; the resulting checkpoint exhibits strong cross-domain zero-shot performance and outperforms prior self-supervised baselines after fine-tuning.

What carries the argument

The multi-view signal proxy (MVS), which filters unconstrained web videos and performs curriculum sampling to supply stable SfM supervision signals despite weak multi-view observability.

If this is right

  • A single checkpoint can be used for coherent end-to-end 3D estimation without separate heads or post-processing steps.
  • Zero-shot transfer to new video domains becomes competitive with supervised methods trained on those domains.
  • Fine-tuning on limited labeled data yields higher accuracy than fine-tuning from prior self-supervised checkpoints.
  • Joint prediction of depth, ego-motion, and intrinsics remains stable under a unified evaluation protocol.
  • The released pretrained model can serve as a drop-in starting point for downstream monocular 3D tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same filtering and curriculum logic could be applied to other large unlabeled video sources such as social media or surveillance archives to further scale pretraining.
  • The learned 3D priors might transfer to related tasks like visual odometry or novel-view synthesis without additional supervision.
  • If the MVS proxy generalizes, future work could test whether even larger corpora produce monotonic gains in cross-domain robustness.
  • Robotics and AR systems could adopt the released checkpoint for real-time monocular 3D without domain-specific retraining.

Load-bearing premise

The multi-view signal proxy can consistently select and order web videos so that they supply enough consistent multi-view geometry for stable self-supervised training.

What would settle it

Run the same model and training schedule on the unfiltered YouTube-8M corpus without MVS selection and measure whether zero-shot depth and ego-motion accuracy on held-out domains falls to the level of prior self-supervised baselines or training diverges.

Figures

Figures reproduced from arXiv: 2604.22686 by Antoine Manzanera, David Filliat, Gianni Franchi, Marwane Hariat.

Figure 1
Figure 1. Figure 1: SS3D reconstructs both the exterior (left) and interior (right) of the Sagrada Familia from two casual videos: one recorded outside and one inside (see Appendix E for the link towards both videos). The reconstruction relies on self-supervised estimates of depth, camera pose, and intrinsics. Depth maps and camera trajectories are visualized, with each camera shown along its corresponding viewpoint. Thumbnai… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of SS3D (1) Train self-supervised, sub-domain experts on multi￾domain web video. (2) Distill expert predictions into a single deployable student model. (3) At inference, the student predicts depth, pose, and intrinsics, which together induce a 3D reconstruction (e.g., point cloud) displayed on the far right. For illustration, “domains” are schematic and may not match the clustering used in trainin… view at source ↗
Figure 3
Figure 3. Figure 3: Training dynamics for indoor and outdoor contexts. We can see that naively scaling to more data yields few to no gains. Experiments done with distillation of two experts: indoor and outdoor. ID MVS+Curr Ldistill YTB8M Unified 3D KITTI Abs Rel ↓ NYU Abs Rel ↓ ZS FT ZS FT B1 – 0.082 – 0.115 B2 – 0.080 – 0.117 +U ✓ – 0.078 – 0.111 +YT ✓ ✓ 0.19 0.079 0.26 0.112 +D ✓ ✓ ✓ 0.101 0.072 0.125 0.098 Full ✓ ✓ ✓ ✓ 0.0… view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative results on KITTI after finetuning. Each row shows (left to right): Input, Point Cloud, Depth. References 1. Abu-El-Haija, S., Kothari, N., Lee, J., Natsev, P., Toderici, G., Varadarajan, B., Vijayanarasimhan, S.: Youtube-8m: A large-scale video classification benchmark. arXiv preprint arXiv:1609.08675 (2016) 2. Assran, M., Bardes, A., Fan, D., Garrido, Q., Howes, R., Muckley, M., Rizvi, A., Rob… view at source ↗
read the original abstract

We present SS3D, a web-scale SfM-based self-supervision pretraining pipeline for feed-forward 3D estimation from monocular video. Our model jointly predicts depth, ego-motion, and intrinsics in a single forward pass and is trained/evaluated as a coherent end-to-end 3D estimator. To stabilize joint learning, we use an intrinsics-first two-stage schedule and a unified single-checkpoint evaluation protocol. Scaling SfM self-supervision to unconstrained web video is challenging due to weak multi-view observability and strong corpus heterogeneity; we address these with a multi-view signal proxy (MVS) used for filtering and curriculum sampling, and with expert training distilled into a single student. Pretraining on YouTube-8M (~100M frames after filtering) yields strong cross-domain zero-shot transfer and improved fine-tuning performance over prior self-supervised baselines. We release the pretrained checkpoint and code.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces SS3D, a self-supervised pretraining pipeline for end-to-end monocular 3D estimation that jointly predicts depth, ego-motion, and camera intrinsics from video. It scales SfM supervision to ~100M frames from filtered YouTube-8M web videos using a multi-view signal proxy (MVS) for filtering and curriculum sampling, an intrinsics-first two-stage training schedule, and expert distillation into a single student model. The central claim is that this yields strong cross-domain zero-shot transfer and improved fine-tuning performance over prior self-supervised baselines, with code and checkpoint released.

Significance. If the empirical claims hold under standard controls, the work would demonstrate a practical route to web-scale self-supervision for feed-forward 3D models, reducing dependence on curated multi-view datasets and improving generalization. The release of reproducible code and a single checkpoint is a clear strength that supports follow-up research.

major comments (3)
  1. [Method (MVS proxy description)] The MVS filtering and curriculum-sampling procedure (described in the method section) is load-bearing for the scaling claim, yet the manuscript provides no quantitative validation that MVS scores correlate with downstream SfM stability metrics such as successful reconstruction rate, median reprojection error, or depth consistency on held-out web videos. Without an ablation comparing filtered vs. unfiltered YouTube-8M subsets or a correlation plot, it remains possible that reported gains arise primarily from dataset scale rather than the proposed proxy.
  2. [Training schedule and distillation] The two-stage intrinsics schedule and expert-distillation step presuppose that the MVS-filtered signal is already sufficiently clean for joint end-to-end optimization. The paper should report an ablation (e.g., in the experiments section) that isolates the contribution of each component by training a single-stage baseline on the same filtered corpus and measuring zero-shot transfer degradation.
  3. [Experiments (zero-shot evaluation)] The zero-shot cross-domain transfer results rely on a unified single-checkpoint evaluation protocol, but the manuscript does not detail how domain-specific intrinsics or motion statistics are handled at test time. A concrete failure case or per-domain breakdown would be needed to confirm that the reported improvements are not artifacts of evaluation protocol differences with prior baselines.
minor comments (2)
  1. [Method] Notation for the MVS proxy score should be defined explicitly with an equation rather than described only in prose, to allow readers to reproduce the filtering thresholds.
  2. [Experiments] The abstract states '~100M frames after filtering'; the exact filtering ratio and final dataset statistics should appear in a table in the experiments section for transparency.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their detailed review and positive assessment of the work's potential impact. We have carefully considered each major comment and provide point-by-point responses below. Where appropriate, we have revised the manuscript to include additional ablations, clarifications, and analyses to strengthen the claims.

read point-by-point responses
  1. Referee: The MVS filtering and curriculum-sampling procedure (described in the method section) is load-bearing for the scaling claim, yet the manuscript provides no quantitative validation that MVS scores correlate with downstream SfM stability metrics such as successful reconstruction rate, median reprojection error, or depth consistency on held-out web videos. Without an ablation comparing filtered vs. unfiltered YouTube-8M subsets or a correlation plot, it remains possible that reported gains arise primarily from dataset scale rather than the proposed proxy.

    Authors: We agree that quantitative validation of the MVS proxy would strengthen the method section. In the revised manuscript, we include a new subsection with correlation analysis between MVS scores and SfM reconstruction metrics on held-out videos, as well as performance comparison on filtered versus unfiltered data subsets. These additions show that the filtering improves stability beyond mere scale. revision: yes

  2. Referee: The two-stage intrinsics schedule and expert-distillation step presuppose that the MVS-filtered signal is already sufficiently clean for joint end-to-end optimization. The paper should report an ablation (e.g., in the experiments section) that isolates the contribution of each component by training a single-stage baseline on the same filtered corpus and measuring zero-shot transfer degradation.

    Authors: We acknowledge the need for isolating the contributions of the training components. We have added an ablation study in the experiments section comparing the full two-stage schedule with distillation against a single-stage baseline trained on the same filtered corpus. The results confirm the benefits of each component for zero-shot transfer performance. revision: yes

  3. Referee: The zero-shot cross-domain transfer results rely on a unified single-checkpoint evaluation protocol, but the manuscript does not detail how domain-specific intrinsics or motion statistics are handled at test time. A concrete failure case or per-domain breakdown would be needed to confirm that the reported improvements are not artifacts of evaluation protocol differences with prior baselines.

    Authors: We clarify that at test time, the model relies on its own predicted intrinsics and ego-motion without domain-specific priors. In the revised manuscript, we have added a per-domain breakdown of the zero-shot results and included a discussion of representative failure cases to demonstrate the robustness of the unified evaluation protocol. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical pipeline relies on external SfM and verifiable pretraining results.

full rationale

The paper describes an end-to-end self-supervised pipeline that filters web videos via an MVS proxy, applies SfM-based supervision, and reports empirical zero-shot transfer and fine-tuning gains on YouTube-8M. No equation or claim reduces a reported prediction to a fitted parameter or self-citation by construction; the method uses external SfM tools, releases code and checkpoints, and evaluates on held-out domains. The central claims are therefore falsifiable against independent benchmarks rather than tautological.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim rests on the standard domain assumption that SfM supplies usable self-supervision and on the newly introduced MVS proxy whose effectiveness is not independently verified outside the paper.

free parameters (1)
  • MVS filtering and curriculum thresholds
    Parameters that decide which web clips are retained and in what order they are presented during training.
axioms (1)
  • domain assumption Structure-from-motion provides reliable depth and ego-motion signals even on unconstrained web video after filtering
    Invoked as the foundation of the entire self-supervision pipeline.
invented entities (1)
  • multi-view signal proxy (MVS) no independent evidence
    purpose: Filtering and curriculum sampling of web videos to stabilize self-supervision
    New component introduced to address weak observability and heterogeneity; no external falsifiable test is described.

pith-pipeline@v0.9.0 · 5465 in / 1427 out tokens · 63490 ms · 2026-05-14T21:23:03.555977+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

69 extracted references · 69 canonical work pages · 8 internal anchors

  1. [1]

    YouTube-8M: A Large-Scale Video Classification Benchmark

    Abu-El-Haija, S., Kothari, N., Lee, J., Natsev, P., Toderici, G., Varadarajan, B., Vijayanarasimhan, S.: Youtube-8m: A large-scale video classification benchmark. arXiv preprint arXiv:1609.08675 (2016)

  2. [2]

    V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

    Assran, M., Bardes, A., Fan, D., Garrido, Q., Howes, R., Muckley, M., Rizvi, A., Roberts, C., Sinha, K., Zholus, A., et al.: V-jepa 2: Self-supervised video models enable understanding, prediction and planning. arXiv preprint arXiv:2506.09985 (2025)

  3. [3]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Bhat, S.F., Alhashim, I., Wonka, P.: Adabins: Depth estimation using adaptive bins. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 4009–4018 (2021)

  4. [4]

    ZoeDepth: Zero-shot Transfer by Combining Relative and Metric Depth

    Bhat,S.F.,Birkl,R.,Wofk,D.,Wonka,P.,Müller,M.:Zoedepth:Zero-shottransfer by combining relative and metric depth. arXiv preprint arXiv:2302.12288 (2023)

  5. [5]

    Advances in neural information processing systems32(2019)

    Bian, J., Li, Z., Wang, N., Zhan, H., Shen, C., Cheng, M.M., Reid, I.: Unsupervised scale-consistent depth and ego-motion learning from monocular video. Advances in neural information processing systems32(2019)

  6. [6]

    1–a model zoo for robust monocular relative depth estimation

    Birkl, R., Wofk, D., Müller, M.: Midas v3. 1–a model zoo for robust monocular relative depth estimation. arXiv preprint arXiv:2307.14460 (2023)

  7. [7]

    In: European conference on computer vision

    Butler, D.J., Wulff, J., Stanley, G.B., Black, M.J.: A naturalistic open source movie for optical flow evaluation. In: European conference on computer vision. pp. 611–

  8. [8]

    In: European Conference on Computer Vision

    Chen, T., An, S., Zhang, Y., Ma, C., Wang, H., Guo, X., Zheng, W.: Improving monocular depth estimation by leveraging structural awareness and complemen- tary datasets. In: European Conference on Computer Vision. pp. 90–108. Springer (2020)

  9. [9]

    In: Proceed- ings of the IEEE/CVF international conference on computer vision

    Chen, Y., Schmid, C., Sminchisescu, C.: Self-supervised learning with geometric constraints in monocular video: Connecting flow, depth, and camera. In: Proceed- ings of the IEEE/CVF international conference on computer vision. pp. 7063–7072 (2019) 16 M. Hariat et al

  10. [10]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)

  11. [11]

    arXiv preprint arXiv:2312.01283 (2023)

    Fan, C., Yin, Z., Li, Y., Zhang, F.: Deeper into self-supervised monocular indoor depth estimation. arXiv preprint arXiv:2312.01283 (2023)

  12. [12]

    The international journal of robotics research32(11), 1231–1237 (2013)

    Geiger, A., Lenz, P., Stiller, C., Urtasun, R.: Vision meets robotics: The kitti dataset. The international journal of robotics research32(11), 1231–1237 (2013)

  13. [13]

    In: Proceedings of the IEEE/CVF in- ternational conference on computer vision

    Godard, C., Mac Aodha, O., Firman, M., Brostow, G.J.: Digging into self- supervised monocular depth estimation. In: Proceedings of the IEEE/CVF in- ternational conference on computer vision. pp. 3828–3838 (2019)

  14. [14]

    In: Proceedings of the IEEE/CVF international conference on computer vision

    Gordon, A., Li, H., Jonschkowski, R., Angelova, A.: Depth from videos in the wild: Unsupervised monocular depth learning from unknown cameras. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 8977–8986 (2019)

  15. [15]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Guizilini, V., Ambrus, R., Pillai, S., Raventos, A., Gaidon, A.: 3d packing for self-supervised monocular depth estimation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 2485–2494 (2020)

  16. [16]

    arXiv preprint arXiv:2002.12319 (2020)

    Guizilini, V., Hou, R., Li, J., Ambrus, R., Gaidon, A.: Semantically-guided representation learning for self-supervised monocular depth. arXiv preprint arXiv:2002.12319 (2020)

  17. [17]

    In: Pro- ceedings of the IEEE/CVF winter conference on applications of computer vision

    Hariat, M., Manzanera, A., Filliat, D.: Rebalancing gradient to improve self- supervised co-training of depth, odometry and optical flow predictions. In: Pro- ceedings of the IEEE/CVF winter conference on applications of computer vision. pp. 1267–1276 (2023)

  18. [18]

    In: Proceedings of the Computer Vision and Pattern Recognition Confer- ence

    Hariat, M., Manzanera, A., Filliat, D.: Improved monocular depth prediction us- ing distance transform over pre-semantic contours with self-supervised neural net- works. In: Proceedings of the Computer Vision and Pattern Recognition Confer- ence. pp. 21868–21879 (2025)

  19. [19]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 16000–16009 (2022)

  20. [20]

    He, M., Hui, L., Bian, Y., Ren, J., Xie, J., Yang, J.: Ra-depth: Resolution adaptive self-supervisedmonoculardepthestimation.In:EuropeanConferenceonComputer Vision. pp. 565–581. Springer (2022)

  21. [21]

    IEEE Transactions on Pattern Analysis and Machine Intelligence30(3), 548–554 (2008)

    Hernandez, C., Vogiatzis, G., Cipolla, R.: Multiview photometric stereo. IEEE Transactions on Pattern Analysis and Machine Intelligence30(3), 548–554 (2008)

  22. [22]

    arXiv preprint arXiv:2106.03505 (2021)

    Jia, S., Pei, X., Yao, W., Wong, S.C.: Self-supervised depth estimation leveraging global perception and geometric smoothness using on-board videos. arXiv preprint arXiv:2106.03505 (2021)

  23. [23]

    MapAnything: Universal Feed-Forward Metric 3D Reconstruction

    Keetha, N., Müller, N., Schönberger, J., Porzi, L., Zhang, Y., Fischer, T., Knapitsch, A., Zauss, D., Weber, E., Antunes, N., et al.: Mapanything: Univer- sal feed-forward metric 3d reconstruction. arXiv preprint arXiv:2509.13414 (2025)

  24. [24]

    Adam: A Method for Stochastic Optimization

    Kingma, D.P.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)

  25. [25]

    In: European conference on computer vision

    Klingner, M., Termöhlen, J.A., Mikolajczyk, J., Fingscheidt, T.: Self-supervised monocular depth estimation: Solving the dynamic object problem by semantic guidance. In: European conference on computer vision. pp. 582–600. Springer (2020)

  26. [26]

    In: Proceedings of the AAAI conference on artificial intelligence

    Lee, S., Im, S., Lin, S., Kweon, I.S.: Learning monocular depth in dynamic scenes via instance-aware projection consistency. In: Proceedings of the AAAI conference on artificial intelligence. vol. 35, pp. 1863–1872 (2021) Abbreviated paper title 17

  27. [27]

    In: Proceedings of the IEEE/CVF international conference on computer vision

    Li, B., Huang, Y., Liu, Z., Zou, D., Yu, W.: Structdepth: Leveraging the struc- tural regularities for self-supervised indoor depth estimation. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 12663–12673 (2021)

  28. [28]

    In: Conference on Robot Learning

    Li, H., Gordon, A., Zhao, H., Casser, V., Angelova, A.: Unsupervised monocular depth learning in dynamic scenes. In: Conference on Robot Learning. pp. 1908–

  29. [29]

    Pattern Recognition137, 109297 (2023)

    Li, R., Xue, D., Su, S., He, X., Mao, Q., Zhu, Y., Sun, J., Zhang, Y.: Learning depth via leveraging semantics: Self-supervised monocular depth estimation with both implicit and explicit semantic guidance. Pattern Recognition137, 109297 (2023)

  30. [30]

    IEEE Transac- tions on Circuits and Systems for Video Technology33(2), 830–846 (2022)

    Li, R., Ji, P., Xu, Y., Bhanu, B.: Monoindoor++: Towards better practice of self- supervised monocular depth estimation for indoor environments. IEEE Transac- tions on Circuits and Systems for Video Technology33(2), 830–846 (2022)

  31. [31]

    Depth Anything 3: Recovering the Visual Space from Any Views

    Lin, H., Chen, S., Liew, J., Chen, D.Y., Li, Z., Shi, G., Feng, J., Kang, B.: Depth anything 3: Recovering the visual space from any views. arXiv preprint arXiv:2511.10647 (2025)

  32. [32]

    In: Proceedings of the AAAI conference on artificial intelligence

    Lyu, X., Liu, L., Wang, M., Kong, X., Liu, L., Liu, Y., Chen, X., Yuan, Y.: Hr- depth: High resolution self-supervised monocular depth estimation. In: Proceedings of the AAAI conference on artificial intelligence. vol. 35, pp. 2294–2301 (2021)

  33. [33]

    DINOv2: Learning Robust Visual Features without Supervision

    Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Szafraniec, M., Khalidov, V., Fernandez, P., Haziza, D., Massa, F., El-Nouby, A., et al.: Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193 (2023)

  34. [34]

    Advances in neural information processing sys- tems32(2019)

    Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., et al.: Pytorch: An imperative style, high- performance deep learning library. Advances in neural information processing sys- tems32(2019)

  35. [35]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Piccinelli, L., Yang, Y.H., Sakaridis, C., Segu, M., Li, S., Van Gool, L., Yu, F.: Unidepth: Universal monocular metric depth estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 10106– 10116 (2024)

  36. [36]

    Multimedia Tools and Applications82(27), 41641–41667 (2023)

    Pinard, C., Manzanera, A.: Does it work outside this benchmark? introducing the rigid depth constructor tool: Depth validation dataset construction in rigid scenes for the masses. Multimedia Tools and Applications82(27), 41641–41667 (2023)

  37. [37]

    In: Proceedings of the IEEE/CVF conference on com- puter vision and pattern recognition

    Poggi, M., Aleotti, F., Tosi, F., Mattoccia, S.: On the uncertainty of self-supervised monocular depth estimation. In: Proceedings of the IEEE/CVF conference on com- puter vision and pattern recognition. pp. 3227–3237 (2020)

  38. [38]

    In: International conference on machine learning

    Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PmLR (2021)

  39. [39]

    ACM Comput

    Rajapaksha, U., Sohel, F., Laga, H., Diepeveen, D., Bennamoun, M.: Deep learning-based depth estimation methods from monocular image and videos: A comprehensive survey. ACM Comput. Surv.56(12) (oct 2024).https://doi.org/ 10.1145/3677327

  40. [40]

    Nature331(6152), 163– 166 (1988)

    Ramachandran, V.S.: Perception of shape from shading. Nature331(6152), 163– 166 (1988)

  41. [41]

    In: Proceedings of the IEEE/CVF international conference on computer vision

    Ranftl, R., Bochkovskiy, A., Koltun, V.: Vision transformers for dense prediction. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 12179–12188 (2021) 18 M. Hariat et al

  42. [42]

    IEEE transactions on pattern analysis and machine intelligence44(3), 1623–1637 (2020)

    Ranftl, R., Lasinger, K., Hafner, D., Schindler, K., Koltun, V.: Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer. IEEE transactions on pattern analysis and machine intelligence44(3), 1623–1637 (2020)

  43. [43]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Ranjan, A., Jampani, V., Balles, L., Kim, K., Sun, D., Wulff, J., Black, M.J.: Com- petitivecollaboration:Jointunsupervisedlearningofdepth,cameramotion,optical flow and motion segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 12240–12249 (2019)

  44. [44]

    Rohan, A., Hasan, M.J., Petrovski, A.: A systematic literature review on deep learning-based depth estimation in computer vision (2025),https://arxiv.org/ abs/2501.05147

  45. [45]

    International journal of computer vision115(3), 211–252 (2015)

    Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., et al.: Imagenet large scale visual recog- nition challenge. International journal of computer vision115(3), 211–252 (2015)

  46. [46]

    In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision

    Saeedan, F., Roth, S.: Boosting monocular depth with panoptic segmentation maps. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. pp. 3853–3862 (2021)

  47. [47]

    In: Proceedings of the IEEE conference on computer vision and pattern recognition

    Schonberger, J.L., Frahm, J.M.: Structure-from-motion revisited. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 4104–4113 (2016)

  48. [48]

    Ad- vances in neural information processing systems31(2018)

    Sener, O., Koltun, V.: Multi-task learning as multi-objective optimization. Ad- vances in neural information processing systems31(2018)

  49. [49]

    In: European Conference on Computer Vision

    Shu, C., Yu, K., Duan, Z., Yang, K.: Feature-metric loss for self-supervised learning of depth and egomotion. In: European Conference on Computer Vision. pp. 572–

  50. [50]

    In: European conference on computer vision

    Silberman, N., Hoiem, D., Kohli, P., Fergus, R.: Indoor segmentation and support inference from rgbd images. In: European conference on computer vision. pp. 746–

  51. [51]

    In: 2012 IEEE/RSJ international conference on intelligent robots and systems

    Sturm, J., Engelhard, N., Endres, F., Burgard, W., Cremers, D.: A benchmark for the evaluation of rgb-d slam systems. In: 2012 IEEE/RSJ international conference on intelligent robots and systems. pp. 573–580. IEEE (2012)

  52. [52]

    In: Pro- ceedingsofIEEEInternationalConferenceonComputerVision.pp.862–869.IEEE (2003)

    Tankus, Sochen, Yeshurun: A new perspective [on] shape-from-shading. In: Pro- ceedingsofIEEEInternationalConferenceonComputerVision.pp.862–869.IEEE (2003)

  53. [53]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Wang, J., Chen, M., Karaev, N., Vedaldi, A., Rupprecht, C., Novotny, D.: Vggt: Visual geometry grounded transformer. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 5294–5306 (2025)

  54. [54]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Wang, S., Leroy, V., Cabon, Y., Chidlovskii, B., Revaud, J.: Dust3r: Geometric 3d vision made easy. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 20697–20709 (2024)

  55. [55]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Wang, Y., Yue, Y., Lu, R., Liu, T., Zhong, Z., Song, S., Huang, G.: Efficient- train: Exploring generalized curriculum learning for training visual backbones. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 5852–5864 (2023)

  56. [56]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Watson, J., Mac Aodha, O., Prisacariu, V., Brostow, G., Firman, M.: The temporal opportunist: Self-supervised multi-frame monocular depth. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 1164–1174 (2021)

  57. [57]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Wimbauer, F., Chen, W., Muhle, D., Rupprecht, C., Cremers, D.: Anycam: Learn- ing to recover camera poses and intrinsics from casual videos. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 16717–16727 (2025) Abbreviated paper title 19

  58. [58]

    In: Proceed- ings of the AAAI Conference on Artificial Intelligence

    Xie, Z., Zhang, Y., Zhuang, C., Shi, Q., Liu, Z., Gu, J., Zhang, G.: Mode: A mixture-of-experts model with mutual distillation among the experts. In: Proceed- ings of the AAAI Conference on Artificial Intelligence. vol. 38, pp. 16067–16075 (2024)

  59. [59]

    Remote Sensing13(9), 1673 (2021)

    Xu, W., Zou, L., Wu, L., Fu, Z.: Self-supervised monocular depth learning in low- texture areas. Remote Sensing13(9), 1673 (2021)

  60. [60]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Yang, L., Kang, B., Huang, Z., Xu, X., Feng, J., Zhao, H.: Depth anything: Un- leashing the power of large-scale unlabeled data. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10371–10381 (2024)

  61. [61]

    In: Proceedings of the IEEE conference on computer vision and pat- tern recognition

    Yin, Z., Shi, J.: Geonet: Unsupervised learning of dense depth, optical flow and camera pose. In: Proceedings of the IEEE conference on computer vision and pat- tern recognition. pp. 1983–1992 (2018)

  62. [62]

    In: Proceedings of the IEEE conference on computer vision and pattern recognition

    Zhan, H., Garg, R., Weerasekera, C.S., Li, K., Agarwal, H., Reid, I.: Unsupervised learning of monocular depth estimation and visual odometry with deep feature reconstruction. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 340–349 (2018)

  63. [63]

    In: 2022 international conference on 3D vision (3DV)

    Zhao, C., Zhang, Y., Poggi, M., Tosi, F., Guo, X., Zhu, Z., Huang, G., Tang, Y., Mattoccia, S.: Monovit: Self-supervised monocular depth estimation with a vision transformer. In: 2022 international conference on 3D vision (3DV). pp. 668–678. IEEE (2022)

  64. [64]

    arXiv preprint arXiv:2110.09482 (2021)

    Zhou, H., Greenwood, D., Taylor, S.: Self-supervised monocular depth estimation with internal feature fusion. arXiv preprint arXiv:2110.09482 (2021)

  65. [65]

    In: British Machine Vision Conference (BMVC) (2021)

    Zhou, H., Greenwood, D., Taylor, S.: Self-supervised monocular depth estimation with internal feature fusion. In: British Machine Vision Conference (BMVC) (2021)

  66. [66]

    In: Proceedings of the IEEE/CVF interna- tional conference on computer vision

    Zhou, J., Wang, Y., Qin, K., Zeng, W.: Moving indoor: Unsupervised video depth learning in challenging environments. In: Proceedings of the IEEE/CVF interna- tional conference on computer vision. pp. 8618–8627 (2019)

  67. [67]

    In: Proceedings of the IEEE conference on computer vision and pattern recognition

    Zhou, T., Brown, M., Snavely, N., Lowe, D.G.: Unsupervised learning of depth and ego-motion from video. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 1851–1858 (2017)

  68. [68]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Zhu, S., Brazil, G., Liu, X.: The edge of depth: Explicit constraints between seg- mentation and depth. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 13116–13125 (2020)

  69. [69]

    In: Proceedings of the European conference on computer vision (ECCV)

    Zou, Y., Luo, Z., Huang, J.B.: Df-net: Unsupervised joint learning of depth and flow using cross-task consistency. In: Proceedings of the European conference on computer vision (ECCV). pp. 36–53 (2018)