arxiv: 2604.22686 · v3 · submitted 2026-04-24 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

SS3D: End2End Self-Supervised 3D from Web Videos

Marwane Hariat , Gianni Franchi , David Filliat , Antoine Manzanera

Authors on Pith no claims yet

Pith reviewed 2026-05-14 21:23 UTC · model grok-4.3

classification 💻 cs.CV

keywords self-supervised 3Dmonocular depth estimationego-motioncamera intrinsicsweb video pretrainingstructure from motionzero-shot transferfeed-forward 3D

0 comments

The pith

Pretraining a single feed-forward network on filtered web videos enables joint monocular estimation of depth, ego-motion, and intrinsics with strong zero-shot transfer.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents SS3D, an end-to-end self-supervised pipeline that pretrains a model on roughly 100 million frames drawn from YouTube-8M to learn 3D estimation directly from monocular video. It jointly outputs depth, camera motion, and intrinsic parameters in one forward pass by scaling structure-from-motion supervision to unconstrained web data. A multi-view signal proxy filters videos and orders them into a curriculum to create usable training signals despite sparse multi-view content and high data heterogeneity. An intrinsics-first two-stage schedule and a single-checkpoint evaluation protocol keep the joint learning stable. If successful, this approach shows that large unlabeled video collections can produce general 3D perception models that transfer across domains and improve when fine-tuned on target datasets.

Core claim

By combining a multi-view signal proxy for video filtering and curriculum sampling with an intrinsics-first training schedule, a feed-forward model can be trained end-to-end on web-scale video to predict depth, ego-motion, and intrinsics together; the resulting checkpoint exhibits strong cross-domain zero-shot performance and outperforms prior self-supervised baselines after fine-tuning.

What carries the argument

The multi-view signal proxy (MVS), which filters unconstrained web videos and performs curriculum sampling to supply stable SfM supervision signals despite weak multi-view observability.

If this is right

A single checkpoint can be used for coherent end-to-end 3D estimation without separate heads or post-processing steps.
Zero-shot transfer to new video domains becomes competitive with supervised methods trained on those domains.
Fine-tuning on limited labeled data yields higher accuracy than fine-tuning from prior self-supervised checkpoints.
Joint prediction of depth, ego-motion, and intrinsics remains stable under a unified evaluation protocol.
The released pretrained model can serve as a drop-in starting point for downstream monocular 3D tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same filtering and curriculum logic could be applied to other large unlabeled video sources such as social media or surveillance archives to further scale pretraining.
The learned 3D priors might transfer to related tasks like visual odometry or novel-view synthesis without additional supervision.
If the MVS proxy generalizes, future work could test whether even larger corpora produce monotonic gains in cross-domain robustness.
Robotics and AR systems could adopt the released checkpoint for real-time monocular 3D without domain-specific retraining.

Load-bearing premise

The multi-view signal proxy can consistently select and order web videos so that they supply enough consistent multi-view geometry for stable self-supervised training.

What would settle it

Run the same model and training schedule on the unfiltered YouTube-8M corpus without MVS selection and measure whether zero-shot depth and ego-motion accuracy on held-out domains falls to the level of prior self-supervised baselines or training diverges.

Figures

Figures reproduced from arXiv: 2604.22686 by Antoine Manzanera, David Filliat, Gianni Franchi, Marwane Hariat.

**Figure 1.** Figure 1: SS3D reconstructs both the exterior (left) and interior (right) of the Sagrada Familia from two casual videos: one recorded outside and one inside (see Appendix E for the link towards both videos). The reconstruction relies on self-supervised estimates of depth, camera pose, and intrinsics. Depth maps and camera trajectories are visualized, with each camera shown along its corresponding viewpoint. Thumbnai… view at source ↗

**Figure 2.** Figure 2: Overview of SS3D (1) Train self-supervised, sub-domain experts on multidomain web video. (2) Distill expert predictions into a single deployable student model. (3) At inference, the student predicts depth, pose, and intrinsics, which together induce a 3D reconstruction (e.g., point cloud) displayed on the far right. For illustration, “domains” are schematic and may not match the clustering used in trainin… view at source ↗

**Figure 3.** Figure 3: Training dynamics for indoor and outdoor contexts. We can see that naively scaling to more data yields few to no gains. Experiments done with distillation of two experts: indoor and outdoor. ID MVS+Curr Ldistill YTB8M Unified 3D KITTI Abs Rel ↓ NYU Abs Rel ↓ ZS FT ZS FT B1 – 0.082 – 0.115 B2 – 0.080 – 0.117 +U ✓ – 0.078 – 0.111 +YT ✓ ✓ 0.19 0.079 0.26 0.112 +D ✓ ✓ ✓ 0.101 0.072 0.125 0.098 Full ✓ ✓ ✓ ✓ 0.0… view at source ↗

**Figure 4.** Figure 4: Qualitative results on KITTI after finetuning. Each row shows (left to right): Input, Point Cloud, Depth. References 1. Abu-El-Haija, S., Kothari, N., Lee, J., Natsev, P., Toderici, G., Varadarajan, B., Vijayanarasimhan, S.: Youtube-8m: A large-scale video classification benchmark. arXiv preprint arXiv:1609.08675 (2016) 2. Assran, M., Bardes, A., Fan, D., Garrido, Q., Howes, R., Muckley, M., Rizvi, A., Rob… view at source ↗

read the original abstract

We present SS3D, a web-scale SfM-based self-supervision pretraining pipeline for feed-forward 3D estimation from monocular video. Our model jointly predicts depth, ego-motion, and intrinsics in a single forward pass and is trained/evaluated as a coherent end-to-end 3D estimator. To stabilize joint learning, we use an intrinsics-first two-stage schedule and a unified single-checkpoint evaluation protocol. Scaling SfM self-supervision to unconstrained web video is challenging due to weak multi-view observability and strong corpus heterogeneity; we address these with a multi-view signal proxy (MVS) used for filtering and curriculum sampling, and with expert training distilled into a single student. Pretraining on YouTube-8M (~100M frames after filtering) yields strong cross-domain zero-shot transfer and improved fine-tuning performance over prior self-supervised baselines. We release the pretrained checkpoint and code.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SS3D gives a concrete pipeline for scaling SfM self-supervision to web video via MVS filtering and distillation, but the filtering step is the part that still needs the most checking.

read the letter

The main thing to know is that this paper shows how to train a single model for depth, motion, and intrinsics on roughly 100 million frames from YouTube-8M by first using a multi-view signal proxy to filter and curriculum-sample the videos, then running a two-stage intrinsics schedule and distilling from an expert into the student. They claim this produces better zero-shot transfer and fine-tuning than earlier self-supervised baselines, and they release both code and the checkpoint. That combination of filtering, scheduling, and distillation inside one end-to-end loop is the clearest new piece relative to the baselines they cite. Releasing the artifacts is also useful; anyone who wants to test the model on their own data can do so directly. The approach is coherent on paper and addresses the real problems of weak multi-view and high heterogeneity in web video. The soft spot is exactly the one the stress-test note flags: the MVS proxy has to do the heavy lifting of identifying videos that actually give stable SfM signals. If those scores only weakly track real geometry quality or depth consistency, then the large corpus could still contain a lot of noisy supervision, and the reported gains might trace more to dataset scale or evaluation choices than to the two-stage schedule or distillation. Without seeing the ablations on the filtering thresholds or failure cases, it is hard to judge how much the method itself moves the needle. This is the kind of work that matters for people trying to pretrain 3D models for robotics or AR without labels. A reader who needs a starting checkpoint or ideas for handling messy video data will get something concrete from it. I would send it to peer review because the technical choices are spelled out, the artifacts are available for verification, and the scaling claim is testable even if the current evidence is still preliminary.

Referee Report

3 major / 2 minor

Summary. The paper introduces SS3D, a self-supervised pretraining pipeline for end-to-end monocular 3D estimation that jointly predicts depth, ego-motion, and camera intrinsics from video. It scales SfM supervision to ~100M frames from filtered YouTube-8M web videos using a multi-view signal proxy (MVS) for filtering and curriculum sampling, an intrinsics-first two-stage training schedule, and expert distillation into a single student model. The central claim is that this yields strong cross-domain zero-shot transfer and improved fine-tuning performance over prior self-supervised baselines, with code and checkpoint released.

Significance. If the empirical claims hold under standard controls, the work would demonstrate a practical route to web-scale self-supervision for feed-forward 3D models, reducing dependence on curated multi-view datasets and improving generalization. The release of reproducible code and a single checkpoint is a clear strength that supports follow-up research.

major comments (3)

[Method (MVS proxy description)] The MVS filtering and curriculum-sampling procedure (described in the method section) is load-bearing for the scaling claim, yet the manuscript provides no quantitative validation that MVS scores correlate with downstream SfM stability metrics such as successful reconstruction rate, median reprojection error, or depth consistency on held-out web videos. Without an ablation comparing filtered vs. unfiltered YouTube-8M subsets or a correlation plot, it remains possible that reported gains arise primarily from dataset scale rather than the proposed proxy.
[Training schedule and distillation] The two-stage intrinsics schedule and expert-distillation step presuppose that the MVS-filtered signal is already sufficiently clean for joint end-to-end optimization. The paper should report an ablation (e.g., in the experiments section) that isolates the contribution of each component by training a single-stage baseline on the same filtered corpus and measuring zero-shot transfer degradation.
[Experiments (zero-shot evaluation)] The zero-shot cross-domain transfer results rely on a unified single-checkpoint evaluation protocol, but the manuscript does not detail how domain-specific intrinsics or motion statistics are handled at test time. A concrete failure case or per-domain breakdown would be needed to confirm that the reported improvements are not artifacts of evaluation protocol differences with prior baselines.

minor comments (2)

[Method] Notation for the MVS proxy score should be defined explicitly with an equation rather than described only in prose, to allow readers to reproduce the filtering thresholds.
[Experiments] The abstract states '~100M frames after filtering'; the exact filtering ratio and final dataset statistics should appear in a table in the experiments section for transparency.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their detailed review and positive assessment of the work's potential impact. We have carefully considered each major comment and provide point-by-point responses below. Where appropriate, we have revised the manuscript to include additional ablations, clarifications, and analyses to strengthen the claims.

read point-by-point responses

Referee: The MVS filtering and curriculum-sampling procedure (described in the method section) is load-bearing for the scaling claim, yet the manuscript provides no quantitative validation that MVS scores correlate with downstream SfM stability metrics such as successful reconstruction rate, median reprojection error, or depth consistency on held-out web videos. Without an ablation comparing filtered vs. unfiltered YouTube-8M subsets or a correlation plot, it remains possible that reported gains arise primarily from dataset scale rather than the proposed proxy.

Authors: We agree that quantitative validation of the MVS proxy would strengthen the method section. In the revised manuscript, we include a new subsection with correlation analysis between MVS scores and SfM reconstruction metrics on held-out videos, as well as performance comparison on filtered versus unfiltered data subsets. These additions show that the filtering improves stability beyond mere scale. revision: yes
Referee: The two-stage intrinsics schedule and expert-distillation step presuppose that the MVS-filtered signal is already sufficiently clean for joint end-to-end optimization. The paper should report an ablation (e.g., in the experiments section) that isolates the contribution of each component by training a single-stage baseline on the same filtered corpus and measuring zero-shot transfer degradation.

Authors: We acknowledge the need for isolating the contributions of the training components. We have added an ablation study in the experiments section comparing the full two-stage schedule with distillation against a single-stage baseline trained on the same filtered corpus. The results confirm the benefits of each component for zero-shot transfer performance. revision: yes
Referee: The zero-shot cross-domain transfer results rely on a unified single-checkpoint evaluation protocol, but the manuscript does not detail how domain-specific intrinsics or motion statistics are handled at test time. A concrete failure case or per-domain breakdown would be needed to confirm that the reported improvements are not artifacts of evaluation protocol differences with prior baselines.

Authors: We clarify that at test time, the model relies on its own predicted intrinsics and ego-motion without domain-specific priors. In the revised manuscript, we have added a per-domain breakdown of the zero-shot results and included a discussion of representative failure cases to demonstrate the robustness of the unified evaluation protocol. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical pipeline relies on external SfM and verifiable pretraining results.

full rationale

The paper describes an end-to-end self-supervised pipeline that filters web videos via an MVS proxy, applies SfM-based supervision, and reports empirical zero-shot transfer and fine-tuning gains on YouTube-8M. No equation or claim reduces a reported prediction to a fitted parameter or self-citation by construction; the method uses external SfM tools, releases code and checkpoints, and evaluates on held-out domains. The central claims are therefore falsifiable against independent benchmarks rather than tautological.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim rests on the standard domain assumption that SfM supplies usable self-supervision and on the newly introduced MVS proxy whose effectiveness is not independently verified outside the paper.

free parameters (1)

MVS filtering and curriculum thresholds
Parameters that decide which web clips are retained and in what order they are presented during training.

axioms (1)

domain assumption Structure-from-motion provides reliable depth and ego-motion signals even on unconstrained web video after filtering
Invoked as the foundation of the entire self-supervision pipeline.

invented entities (1)

multi-view signal proxy (MVS) no independent evidence
purpose: Filtering and curriculum sampling of web videos to stabilize self-supervision
New component introduced to address weak observability and heterogeneity; no external falsifiable test is described.

pith-pipeline@v0.9.0 · 5465 in / 1427 out tokens · 63490 ms · 2026-05-14T21:23:03.555977+00:00 · methodology

Review history (3 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

Photometric Self-Supervision. The reconstruction loss ... ρ(·) is a robust penalty (Charbonnier + SSIM). ... multi-view objective Ψ = Σ Ψi→j
IndisputableMonolith/Foundation/BranchSelection.lean RCLCombiner_isCoupling_iff unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Multi-View Signal Proxy (MVS) ... Pt,t+1 = rH / rF ... MVS(v) = average Pt,t+1

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

69 extracted references · 69 canonical work pages · 8 internal anchors

[1]

YouTube-8M: A Large-Scale Video Classification Benchmark

Abu-El-Haija, S., Kothari, N., Lee, J., Natsev, P., Toderici, G., Varadarajan, B., Vijayanarasimhan, S.: Youtube-8m: A large-scale video classification benchmark. arXiv preprint arXiv:1609.08675 (2016)

work page internal anchor Pith review Pith/arXiv arXiv 2016
[2]

V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

Assran, M., Bardes, A., Fan, D., Garrido, Q., Howes, R., Muckley, M., Rizvi, A., Roberts, C., Sinha, K., Zholus, A., et al.: V-jepa 2: Self-supervised video models enable understanding, prediction and planning. arXiv preprint arXiv:2506.09985 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Bhat, S.F., Alhashim, I., Wonka, P.: Adabins: Depth estimation using adaptive bins. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 4009–4018 (2021)

work page 2021
[4]

ZoeDepth: Zero-shot Transfer by Combining Relative and Metric Depth

Bhat,S.F.,Birkl,R.,Wofk,D.,Wonka,P.,Müller,M.:Zoedepth:Zero-shottransfer by combining relative and metric depth. arXiv preprint arXiv:2302.12288 (2023)

work page internal anchor Pith review arXiv 2023
[5]

Advances in neural information processing systems32(2019)

Bian, J., Li, Z., Wang, N., Zhan, H., Shen, C., Cheng, M.M., Reid, I.: Unsupervised scale-consistent depth and ego-motion learning from monocular video. Advances in neural information processing systems32(2019)

work page 2019
[6]

1–a model zoo for robust monocular relative depth estimation

Birkl, R., Wofk, D., Müller, M.: Midas v3. 1–a model zoo for robust monocular relative depth estimation. arXiv preprint arXiv:2307.14460 (2023)

work page arXiv 2023
[7]

In: European conference on computer vision

Butler, D.J., Wulff, J., Stanley, G.B., Black, M.J.: A naturalistic open source movie for optical flow evaluation. In: European conference on computer vision. pp. 611–

work page
[8]

In: European Conference on Computer Vision

Chen, T., An, S., Zhang, Y., Ma, C., Wang, H., Guo, X., Zheng, W.: Improving monocular depth estimation by leveraging structural awareness and complemen- tary datasets. In: European Conference on Computer Vision. pp. 90–108. Springer (2020)

work page 2020
[9]

In: Proceed- ings of the IEEE/CVF international conference on computer vision

Chen, Y., Schmid, C., Sminchisescu, C.: Self-supervised learning with geometric constraints in monocular video: Connecting flow, depth, and camera. In: Proceed- ings of the IEEE/CVF international conference on computer vision. pp. 7063–7072 (2019) 16 M. Hariat et al

work page 2019
[10]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)

work page internal anchor Pith review Pith/arXiv arXiv 2010
[11]

arXiv preprint arXiv:2312.01283 (2023)

Fan, C., Yin, Z., Li, Y., Zhang, F.: Deeper into self-supervised monocular indoor depth estimation. arXiv preprint arXiv:2312.01283 (2023)

work page arXiv 2023
[12]

The international journal of robotics research32(11), 1231–1237 (2013)

Geiger, A., Lenz, P., Stiller, C., Urtasun, R.: Vision meets robotics: The kitti dataset. The international journal of robotics research32(11), 1231–1237 (2013)

work page 2013
[13]

In: Proceedings of the IEEE/CVF in- ternational conference on computer vision

Godard, C., Mac Aodha, O., Firman, M., Brostow, G.J.: Digging into self- supervised monocular depth estimation. In: Proceedings of the IEEE/CVF in- ternational conference on computer vision. pp. 3828–3838 (2019)

work page 2019
[14]

In: Proceedings of the IEEE/CVF international conference on computer vision

Gordon, A., Li, H., Jonschkowski, R., Angelova, A.: Depth from videos in the wild: Unsupervised monocular depth learning from unknown cameras. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 8977–8986 (2019)

work page 2019
[15]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Guizilini, V., Ambrus, R., Pillai, S., Raventos, A., Gaidon, A.: 3d packing for self-supervised monocular depth estimation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 2485–2494 (2020)

work page 2020
[16]

arXiv preprint arXiv:2002.12319 (2020)

Guizilini, V., Hou, R., Li, J., Ambrus, R., Gaidon, A.: Semantically-guided representation learning for self-supervised monocular depth. arXiv preprint arXiv:2002.12319 (2020)

work page arXiv 2002
[17]

In: Pro- ceedings of the IEEE/CVF winter conference on applications of computer vision

Hariat, M., Manzanera, A., Filliat, D.: Rebalancing gradient to improve self- supervised co-training of depth, odometry and optical flow predictions. In: Pro- ceedings of the IEEE/CVF winter conference on applications of computer vision. pp. 1267–1276 (2023)

work page 2023
[18]

In: Proceedings of the Computer Vision and Pattern Recognition Confer- ence

Hariat, M., Manzanera, A., Filliat, D.: Improved monocular depth prediction us- ing distance transform over pre-semantic contours with self-supervised neural net- works. In: Proceedings of the Computer Vision and Pattern Recognition Confer- ence. pp. 21868–21879 (2025)

work page 2025
[19]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 16000–16009 (2022)

work page 2022
[20]

He, M., Hui, L., Bian, Y., Ren, J., Xie, J., Yang, J.: Ra-depth: Resolution adaptive self-supervisedmonoculardepthestimation.In:EuropeanConferenceonComputer Vision. pp. 565–581. Springer (2022)

work page 2022
[21]

IEEE Transactions on Pattern Analysis and Machine Intelligence30(3), 548–554 (2008)

Hernandez, C., Vogiatzis, G., Cipolla, R.: Multiview photometric stereo. IEEE Transactions on Pattern Analysis and Machine Intelligence30(3), 548–554 (2008)

work page 2008
[22]

arXiv preprint arXiv:2106.03505 (2021)

Jia, S., Pei, X., Yao, W., Wong, S.C.: Self-supervised depth estimation leveraging global perception and geometric smoothness using on-board videos. arXiv preprint arXiv:2106.03505 (2021)

work page arXiv 2021
[23]

MapAnything: Universal Feed-Forward Metric 3D Reconstruction

Keetha, N., Müller, N., Schönberger, J., Porzi, L., Zhang, Y., Fischer, T., Knapitsch, A., Zauss, D., Weber, E., Antunes, N., et al.: Mapanything: Univer- sal feed-forward metric 3d reconstruction. arXiv preprint arXiv:2509.13414 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[24]

Adam: A Method for Stochastic Optimization

Kingma, D.P.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)

work page internal anchor Pith review Pith/arXiv arXiv 2014
[25]

In: European conference on computer vision

Klingner, M., Termöhlen, J.A., Mikolajczyk, J., Fingscheidt, T.: Self-supervised monocular depth estimation: Solving the dynamic object problem by semantic guidance. In: European conference on computer vision. pp. 582–600. Springer (2020)

work page 2020
[26]

In: Proceedings of the AAAI conference on artificial intelligence

Lee, S., Im, S., Lin, S., Kweon, I.S.: Learning monocular depth in dynamic scenes via instance-aware projection consistency. In: Proceedings of the AAAI conference on artificial intelligence. vol. 35, pp. 1863–1872 (2021) Abbreviated paper title 17

work page 2021
[27]

In: Proceedings of the IEEE/CVF international conference on computer vision

Li, B., Huang, Y., Liu, Z., Zou, D., Yu, W.: Structdepth: Leveraging the struc- tural regularities for self-supervised indoor depth estimation. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 12663–12673 (2021)

work page 2021
[28]

In: Conference on Robot Learning

Li, H., Gordon, A., Zhao, H., Casser, V., Angelova, A.: Unsupervised monocular depth learning in dynamic scenes. In: Conference on Robot Learning. pp. 1908–

work page 1908
[29]

Pattern Recognition137, 109297 (2023)

Li, R., Xue, D., Su, S., He, X., Mao, Q., Zhu, Y., Sun, J., Zhang, Y.: Learning depth via leveraging semantics: Self-supervised monocular depth estimation with both implicit and explicit semantic guidance. Pattern Recognition137, 109297 (2023)

work page 2023
[30]

IEEE Transac- tions on Circuits and Systems for Video Technology33(2), 830–846 (2022)

Li, R., Ji, P., Xu, Y., Bhanu, B.: Monoindoor++: Towards better practice of self- supervised monocular depth estimation for indoor environments. IEEE Transac- tions on Circuits and Systems for Video Technology33(2), 830–846 (2022)

work page 2022
[31]

Depth Anything 3: Recovering the Visual Space from Any Views

Lin, H., Chen, S., Liew, J., Chen, D.Y., Li, Z., Shi, G., Feng, J., Kang, B.: Depth anything 3: Recovering the visual space from any views. arXiv preprint arXiv:2511.10647 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[32]

In: Proceedings of the AAAI conference on artificial intelligence

Lyu, X., Liu, L., Wang, M., Kong, X., Liu, L., Liu, Y., Chen, X., Yuan, Y.: Hr- depth: High resolution self-supervised monocular depth estimation. In: Proceedings of the AAAI conference on artificial intelligence. vol. 35, pp. 2294–2301 (2021)

work page 2021
[33]

DINOv2: Learning Robust Visual Features without Supervision

Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Szafraniec, M., Khalidov, V., Fernandez, P., Haziza, D., Massa, F., El-Nouby, A., et al.: Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[34]

Advances in neural information processing sys- tems32(2019)

Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., et al.: Pytorch: An imperative style, high- performance deep learning library. Advances in neural information processing sys- tems32(2019)

work page 2019
[35]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Piccinelli, L., Yang, Y.H., Sakaridis, C., Segu, M., Li, S., Van Gool, L., Yu, F.: Unidepth: Universal monocular metric depth estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 10106– 10116 (2024)

work page 2024
[36]

Multimedia Tools and Applications82(27), 41641–41667 (2023)

Pinard, C., Manzanera, A.: Does it work outside this benchmark? introducing the rigid depth constructor tool: Depth validation dataset construction in rigid scenes for the masses. Multimedia Tools and Applications82(27), 41641–41667 (2023)

work page 2023
[37]

In: Proceedings of the IEEE/CVF conference on com- puter vision and pattern recognition

Poggi, M., Aleotti, F., Tosi, F., Mattoccia, S.: On the uncertainty of self-supervised monocular depth estimation. In: Proceedings of the IEEE/CVF conference on com- puter vision and pattern recognition. pp. 3227–3237 (2020)

work page 2020
[38]

In: International conference on machine learning

Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PmLR (2021)

work page 2021
[39]

ACM Comput

Rajapaksha, U., Sohel, F., Laga, H., Diepeveen, D., Bennamoun, M.: Deep learning-based depth estimation methods from monocular image and videos: A comprehensive survey. ACM Comput. Surv.56(12) (oct 2024).https://doi.org/ 10.1145/3677327

work page doi:10.1145/3677327 2024
[40]

Nature331(6152), 163– 166 (1988)

Ramachandran, V.S.: Perception of shape from shading. Nature331(6152), 163– 166 (1988)

work page 1988
[41]

In: Proceedings of the IEEE/CVF international conference on computer vision

Ranftl, R., Bochkovskiy, A., Koltun, V.: Vision transformers for dense prediction. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 12179–12188 (2021) 18 M. Hariat et al

work page 2021
[42]

IEEE transactions on pattern analysis and machine intelligence44(3), 1623–1637 (2020)

Ranftl, R., Lasinger, K., Hafner, D., Schindler, K., Koltun, V.: Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer. IEEE transactions on pattern analysis and machine intelligence44(3), 1623–1637 (2020)

work page 2020
[43]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Ranjan, A., Jampani, V., Balles, L., Kim, K., Sun, D., Wulff, J., Black, M.J.: Com- petitivecollaboration:Jointunsupervisedlearningofdepth,cameramotion,optical flow and motion segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 12240–12249 (2019)

work page 2019
[44]

Rohan, A., Hasan, M.J., Petrovski, A.: A systematic literature review on deep learning-based depth estimation in computer vision (2025),https://arxiv.org/ abs/2501.05147

work page arXiv 2025
[45]

International journal of computer vision115(3), 211–252 (2015)

Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., et al.: Imagenet large scale visual recog- nition challenge. International journal of computer vision115(3), 211–252 (2015)

work page 2015
[46]

In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision

Saeedan, F., Roth, S.: Boosting monocular depth with panoptic segmentation maps. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. pp. 3853–3862 (2021)

work page 2021
[47]

In: Proceedings of the IEEE conference on computer vision and pattern recognition

Schonberger, J.L., Frahm, J.M.: Structure-from-motion revisited. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 4104–4113 (2016)

work page 2016
[48]

Ad- vances in neural information processing systems31(2018)

Sener, O., Koltun, V.: Multi-task learning as multi-objective optimization. Ad- vances in neural information processing systems31(2018)

work page 2018
[49]

In: European Conference on Computer Vision

Shu, C., Yu, K., Duan, Z., Yang, K.: Feature-metric loss for self-supervised learning of depth and egomotion. In: European Conference on Computer Vision. pp. 572–

work page
[50]

In: European conference on computer vision

Silberman, N., Hoiem, D., Kohli, P., Fergus, R.: Indoor segmentation and support inference from rgbd images. In: European conference on computer vision. pp. 746–

work page
[51]

In: 2012 IEEE/RSJ international conference on intelligent robots and systems

Sturm, J., Engelhard, N., Endres, F., Burgard, W., Cremers, D.: A benchmark for the evaluation of rgb-d slam systems. In: 2012 IEEE/RSJ international conference on intelligent robots and systems. pp. 573–580. IEEE (2012)

work page 2012
[52]

In: Pro- ceedingsofIEEEInternationalConferenceonComputerVision.pp.862–869.IEEE (2003)

Tankus, Sochen, Yeshurun: A new perspective [on] shape-from-shading. In: Pro- ceedingsofIEEEInternationalConferenceonComputerVision.pp.862–869.IEEE (2003)

work page 2003
[53]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Wang, J., Chen, M., Karaev, N., Vedaldi, A., Rupprecht, C., Novotny, D.: Vggt: Visual geometry grounded transformer. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 5294–5306 (2025)

work page 2025
[54]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Wang, S., Leroy, V., Cabon, Y., Chidlovskii, B., Revaud, J.: Dust3r: Geometric 3d vision made easy. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 20697–20709 (2024)

work page 2024
[55]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Wang, Y., Yue, Y., Lu, R., Liu, T., Zhong, Z., Song, S., Huang, G.: Efficient- train: Exploring generalized curriculum learning for training visual backbones. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 5852–5864 (2023)

work page 2023
[56]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Watson, J., Mac Aodha, O., Prisacariu, V., Brostow, G., Firman, M.: The temporal opportunist: Self-supervised multi-frame monocular depth. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 1164–1174 (2021)

work page 2021
[57]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Wimbauer, F., Chen, W., Muhle, D., Rupprecht, C., Cremers, D.: Anycam: Learn- ing to recover camera poses and intrinsics from casual videos. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 16717–16727 (2025) Abbreviated paper title 19

work page 2025
[58]

In: Proceed- ings of the AAAI Conference on Artificial Intelligence

Xie, Z., Zhang, Y., Zhuang, C., Shi, Q., Liu, Z., Gu, J., Zhang, G.: Mode: A mixture-of-experts model with mutual distillation among the experts. In: Proceed- ings of the AAAI Conference on Artificial Intelligence. vol. 38, pp. 16067–16075 (2024)

work page 2024
[59]

Remote Sensing13(9), 1673 (2021)

Xu, W., Zou, L., Wu, L., Fu, Z.: Self-supervised monocular depth learning in low- texture areas. Remote Sensing13(9), 1673 (2021)

work page 2021
[60]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Yang, L., Kang, B., Huang, Z., Xu, X., Feng, J., Zhao, H.: Depth anything: Un- leashing the power of large-scale unlabeled data. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10371–10381 (2024)

work page 2024
[61]

In: Proceedings of the IEEE conference on computer vision and pat- tern recognition

Yin, Z., Shi, J.: Geonet: Unsupervised learning of dense depth, optical flow and camera pose. In: Proceedings of the IEEE conference on computer vision and pat- tern recognition. pp. 1983–1992 (2018)

work page 1983
[62]

In: Proceedings of the IEEE conference on computer vision and pattern recognition

Zhan, H., Garg, R., Weerasekera, C.S., Li, K., Agarwal, H., Reid, I.: Unsupervised learning of monocular depth estimation and visual odometry with deep feature reconstruction. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 340–349 (2018)

work page 2018
[63]

In: 2022 international conference on 3D vision (3DV)

Zhao, C., Zhang, Y., Poggi, M., Tosi, F., Guo, X., Zhu, Z., Huang, G., Tang, Y., Mattoccia, S.: Monovit: Self-supervised monocular depth estimation with a vision transformer. In: 2022 international conference on 3D vision (3DV). pp. 668–678. IEEE (2022)

work page 2022
[64]

arXiv preprint arXiv:2110.09482 (2021)

Zhou, H., Greenwood, D., Taylor, S.: Self-supervised monocular depth estimation with internal feature fusion. arXiv preprint arXiv:2110.09482 (2021)

work page arXiv 2021
[65]

In: British Machine Vision Conference (BMVC) (2021)

Zhou, H., Greenwood, D., Taylor, S.: Self-supervised monocular depth estimation with internal feature fusion. In: British Machine Vision Conference (BMVC) (2021)

work page 2021
[66]

In: Proceedings of the IEEE/CVF interna- tional conference on computer vision

Zhou, J., Wang, Y., Qin, K., Zeng, W.: Moving indoor: Unsupervised video depth learning in challenging environments. In: Proceedings of the IEEE/CVF interna- tional conference on computer vision. pp. 8618–8627 (2019)

work page 2019
[67]

In: Proceedings of the IEEE conference on computer vision and pattern recognition

Zhou, T., Brown, M., Snavely, N., Lowe, D.G.: Unsupervised learning of depth and ego-motion from video. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 1851–1858 (2017)

work page 2017
[68]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Zhu, S., Brazil, G., Liu, X.: The edge of depth: Explicit constraints between seg- mentation and depth. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 13116–13125 (2020)

work page 2020
[69]

In: Proceedings of the European conference on computer vision (ECCV)

Zou, Y., Luo, Z., Huang, J.B.: Df-net: Unsupervised joint learning of depth and flow using cross-task consistency. In: Proceedings of the European conference on computer vision (ECCV). pp. 36–53 (2018)

work page 2018