UniDepthV2: Universal Monocular Metric Depth Estimation Made Simpler
Pith reviewed 2026-05-17 09:05 UTC · model grok-4.3
The pith
UniDepthV2 predicts metric 3D points directly from single images across domains without extra inputs or retraining.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
UniDepthV2 reconstructs metric 3D scenes from single images across domains by implementing a self-promptable camera module that predicts a dense camera representation to condition depth features, exploiting a pseudo-spherical output representation that disentangles the camera and depth representations, and proposing a geometric invariance loss that promotes the invariance of camera-prompted depth features. The model further improves its predecessor through an edge-guided loss for sharper localization, a revisited and simplified architectural design, and an additional uncertainty-level output.
What carries the argument
self-promptable camera module that predicts a dense camera representation to condition depth features, combined with a pseudo-spherical output representation and geometric invariance loss
If this is right
- Metric 3D reconstruction becomes feasible from ordinary single images in previously unseen environments.
- Downstream tasks gain access to per-pixel uncertainty estimates for confidence-aware processing.
- Edge localization in depth maps improves without requiring separate post-processing steps.
- A single trained model replaces the need for domain-specific depth estimators in many settings.
- Inference runs with a simpler and more efficient network than the prior version.
Where Pith is reading between the lines
- The same disentanglement of geometry and appearance could be tested on related tasks such as surface normal estimation or novel-view synthesis.
- If the uncertainty output correlates well with actual error, it might support active sensing strategies that query only low-confidence regions.
- The approach opens the possibility of deploying depth-aware systems in robotics or augmented reality without per-scene camera recalibration.
Load-bearing premise
The self-promptable camera module and geometric invariance loss can reliably disentangle and generalize camera and depth features without domain-specific information or post-hoc adjustments.
What would settle it
A new test domain where the model produces depth values that deviate systematically from ground-truth metric distances without any fine-tuning or camera calibration would falsify the claim of universal generalization.
read the original abstract
Accurate monocular metric depth estimation (MMDE) is crucial to solving downstream tasks in 3D perception and modeling. However, the remarkable accuracy of recent MMDE methods is confined to their training domains. These methods fail to generalize to unseen domains even in the presence of moderate domain gaps, which hinders their practical applicability. We propose a new model, UniDepthV2, capable of reconstructing metric 3D scenes from solely single images across domains. Departing from the existing MMDE paradigm, UniDepthV2 directly predicts metric 3D points from the input image at inference time without any additional information, striving for a universal and flexible MMDE solution. In particular, UniDepthV2 implements a self-promptable camera module predicting a dense camera representation to condition depth features. Our model exploits a pseudo-spherical output representation, which disentangles the camera and depth representations. In addition, we propose a geometric invariance loss that promotes the invariance of camera-prompted depth features. UniDepthV2 improves its predecessor UniDepth model via a new edge-guided loss which enhances the localization and sharpness of edges in the metric depth outputs, a revisited, simplified and more efficient architectural design, and an additional uncertainty-level output which enables downstream tasks requiring confidence. Thorough evaluations on ten depth datasets in a zero-shot regime consistently demonstrate the superior performance and generalization of UniDepthV2. Code and models are available at https://github.com/lpiccinelli-eth/UniDepth
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces UniDepthV2, a model for monocular metric depth estimation (MMDE) that directly predicts metric 3D points from single images across domains without additional inputs. It builds on UniDepth with a self-promptable camera module that outputs a dense camera representation, a pseudo-spherical output representation to disentangle camera and depth features, a geometric invariance loss, an edge-guided loss for improved edge sharpness, a simplified architecture, and an uncertainty output. The central claim is superior zero-shot generalization on ten depth datasets.
Significance. If the disentanglement of camera parameters from scene content holds and zero-shot results are free of domain leakage, this would represent a meaningful step toward practical universal MMDE for downstream 3D tasks. The public release of code and models strengthens the contribution by enabling direct reproducibility and extension.
major comments (3)
- [§3.2] §3.2 (Self-promptable camera module): The module is described as predicting a dense camera representation to condition depth features, but the manuscript lacks controlled experiments (e.g., varying focal length or principal point while holding scene content fixed) that would demonstrate whether predictions rely on true camera intrinsics rather than image appearance cues such as object scale or texture. This directly bears on the zero-shot generalization claim.
- [§4] §4 (Experimental setup) and Table 1/2: The zero-shot evaluations across ten datasets report superior performance, yet the text does not explicitly confirm that training data splits exclude any overlap with evaluation domains or that baselines were re-trained under identical conditions without implicit domain cues. This information is load-bearing for the generalization superiority assertion.
- [§3.3] §3.3 (Geometric invariance loss): The loss is introduced to enforce invariance of camera-prompted depth features, but no quantitative ablation or feature-space analysis (e.g., cosine similarity of depth features under synthetic camera perturbations) is provided to verify that it successfully separates camera and depth representations rather than allowing domain-specific correlations to persist.
minor comments (2)
- [Figure 4] Figure 4: The uncertainty visualization would be clearer with an explicit color scale and comparison to ground-truth error maps.
- [§3.1] Notation: The pseudo-spherical representation is introduced without a compact mathematical definition (e.g., an equation relating spherical coordinates to metric depth and camera parameters); adding this would improve readability.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, providing the strongest honest defense of the manuscript while committing to revisions that directly strengthen the claims on camera disentanglement, data integrity, and loss validation.
read point-by-point responses
-
Referee: [§3.2] §3.2 (Self-promptable camera module): The module is described as predicting a dense camera representation to condition depth features, but the manuscript lacks controlled experiments (e.g., varying focal length or principal point while holding scene content fixed) that would demonstrate whether predictions rely on true camera intrinsics rather than image appearance cues such as object scale or texture. This directly bears on the zero-shot generalization claim.
Authors: We agree that explicit controlled experiments would provide clearer evidence that the self-promptable camera module captures true intrinsics rather than appearance cues. In the revised manuscript we will add a dedicated experiment subsection that fixes scene content and synthetically varies focal length and principal point; we will report the resulting shifts in the predicted dense camera representation and the downstream metric depth to demonstrate responsiveness to camera parameters. revision: yes
-
Referee: [§4] §4 (Experimental setup) and Table 1/2: The zero-shot evaluations across ten datasets report superior performance, yet the text does not explicitly confirm that training data splits exclude any overlap with evaluation domains or that baselines were re-trained under identical conditions without implicit domain cues. This information is load-bearing for the generalization superiority assertion.
Authors: We confirm that the training splits were constructed with no overlap to any evaluation domains and that all baselines were re-trained from scratch under identical data, hyperparameters, and protocol. In the revised Section 4 we will add an explicit paragraph detailing the split construction procedure and training conditions to make this information unambiguous and to reinforce the zero-shot generalization results. revision: yes
-
Referee: [§3.3] §3.3 (Geometric invariance loss): The loss is introduced to enforce invariance of camera-prompted depth features, but no quantitative ablation or feature-space analysis (e.g., cosine similarity of depth features under synthetic camera perturbations) is provided to verify that it successfully separates camera and depth representations rather than allowing domain-specific correlations to persist.
Authors: We acknowledge that a quantitative feature-space analysis would strengthen the justification for the geometric invariance loss. In the revised manuscript we will include a new ablation that reports average cosine similarity of depth features under controlled synthetic camera perturbations, together with the corresponding depth accuracy metrics, to demonstrate that the loss successfully promotes separation of camera and depth representations. revision: yes
Circularity Check
No significant circularity in derivation chain
full rationale
The paper presents an empirical ML architecture for monocular metric depth estimation, with innovations including a self-promptable camera module, pseudo-spherical output representation, geometric invariance loss, and edge-guided loss. These are introduced as design choices to promote disentanglement of camera and depth features, supported by zero-shot evaluations on ten external benchmark datasets. No equations, predictions, or first-principles results reduce by construction to quantities fitted on evaluation data or to unverified self-citations. Self-reference to the predecessor UniDepth model describes incremental improvements rather than serving as load-bearing justification for the core claims. The derivation remains self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
free parameters (1)
- training hyperparameters and loss weights
axioms (1)
- domain assumption Single images contain sufficient cues for metric depth and camera parameters across arbitrary domains
invented entities (2)
-
self-promptable camera module
no independent evidence
-
pseudo-spherical output representation
no independent evidence
Forward citations
Cited by 17 Pith papers
-
CARD: A Multi-Modal Automotive Dataset for Dense 3D Reconstruction in Challenging Road Topography
CARD is a new multi-modal driving dataset delivering ~500K dense depth pixels per frame from challenging road topographies using stereo cameras and fused LiDARs over 110 km.
-
Anny-Fit: All-Age Human Mesh Recovery
Anny-Fit jointly optimizes all-age multi-person 3D human meshes in camera coordinates using complementary signals from off-the-shelf depth, segmentation, keypoint, and VLM networks, yielding better reprojection, depth...
-
DP-SfM: Dual-Pixel Structure-from-Motion without Scale Ambiguity
Dual-pixel defocus blur enables absolute scale estimation in SfM without reference objects or calibration.
-
LuMon: A Comprehensive Benchmark and Development Suite with Novel Datasets for Lunar Monocular Depth Estimation
A new benchmark with real lunar stereo ground truth and analog data shows that sim-to-real fine-tuned monocular depth models achieve large in-domain gains but minimal generalization to actual lunar images.
-
Appearance Decomposition Gaussian Splatting for Multi-Traversal Reconstruction
ADM-GS decomposes static background appearance into traversal-invariant material and traversal-dependent illumination via a frequency-separated neural light field, yielding +0.98 dB PSNR gains and better cross-travers...
-
Efficient Test-Time Optimization for Depth Completion via Low-Rank Decoder Adaptation
Low-rank decoder adaptation enables efficient test-time optimization for zero-shot depth completion by updating only the subspace containing depth-relevant information.
-
RAD: Retrieval-Augmented Monocular Metric Depth Estimation for Underrepresented Classes
RAD retrieves semantically similar RGB-D context samples for low-confidence regions and fuses them via matched cross-attention to cut relative absolute depth error by 29.2% on NYU Depth v2 underrepresented classes whi...
-
HumanSplatHMR: Closing the Loop Between Human Mesh Recovery and Gaussian Splatting Avatar
HumanSplatHMR closes the loop between human mesh recovery and Gaussian Splatting by using photometric, segmentation, and depth losses to refine poses during avatar optimization.
-
Vista4D: Video Reshooting with 4D Point Clouds
Vista4D re-synthesizes dynamic videos from new viewpoints by grounding them in a 4D point cloud built with static segmentation and multiview training.
-
Enhancing Glass Surface Reconstruction via Depth Prior for Robot Navigation
A training-free RANSAC-based fusion of depth foundation model priors with sensor data recovers accurate metric depth on glass, supported by a new GlassRecon RGB-D dataset with derived ground truth.
-
In Depth We Trust: Reliable Monocular Depth Supervision for Gaussian Splatting
A selective regularization framework lets scale-ambiguous monocular depth priors improve Gaussian Splatting geometry and rendering by isolating and supervising only ill-posed regions.
-
OpenVO: Open-World Visual Odometry with Temporal Dynamics Awareness
OpenVO estimates ego-motion from monocular dashcam footage with varying observation rates and uncalibrated cameras by encoding temporal dynamics in a two-frame regression framework and using 3D priors from foundation ...
-
Depth Anything 3: Recovering the Visual Space from Any Views
DA3 recovers consistent visual geometry from arbitrary views via a vanilla DINO transformer and depth-ray target, setting new SOTA on a visual geometry benchmark while outperforming DA2 on monocular depth.
-
The Midas Touch for Metric Depth
MTD turns relative depth into metric depth via segment-wise sparse graph optimization and discontinuity-aware geodesic pixel refinement, claiming better accuracy and generalization than prior depth methods.
-
ViPE: Video Pose Engine for 3D Geometric Perception
ViPE estimates camera intrinsics, motion, and dense near-metric depth from uncalibrated videos, outperforming baselines on TUM and KITTI while releasing annotations for 96M frames across real and generated videos.
-
MoGe-2: Accurate Monocular Geometry with Metric Scale and Sharp Details
MoGe-2 recovers metric-scale 3D point maps with fine details from single images via data refinement and extension of affine-invariant predictions.
-
Depth-Aware Rover: A Study of Edge AI and Monocular Vision for Real-World Implementation
Monocular depth estimation with UniDepthV2 on Raspberry Pi enables cost-effective rover navigation, proving more robust than stereo vision in real-world tests at 0.1 FPS depth and 10 FPS detection.
Reference graph
Works this paper leans on
-
[1]
Depth-supervised nerf: Fewer views and faster training for free,
K. Deng, A. Liu, J.-Y . Zhu, and D. Ramanan, “Depth-supervised nerf: Fewer views and faster training for free,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 12 882–12 891. 1
work page 2022
-
[2]
Does computer vision matter for action?
B. Zhou, P. Kr ¨ahenb¨uhl, and V . Koltun, “Does computer vision matter for action?”Science Robotics, vol. 4, 5 2019. 1 12
work page 2019
-
[3]
Towards real-time monocular depth estimation for robotics: A survey,
X. Dong, M. A. Garratt, S. G. Anavatti, and H. A. Abbass, “Towards real-time monocular depth estimation for robotics: A survey,”IEEE Transactions on Intelligent Transportation Systems, vol. 23, no. 10, pp. 16 940–16 961, 2022. 1
work page 2022
-
[4]
Y . Wang, W.-L. Chao, D. Garg, B. Hariharan, M. Campbell, and K. Q. Weinberger, “Pseudo-lidar from visual depth estimation: Bridging the gap in 3d object detection for autonomous driving,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 8445–8453. 1
work page 2019
-
[5]
Is pseudo-lidar needed for monocular 3d object detection?
D. Park, R. Ambrus, V . Guizilini, J. Li, and A. Gaidon, “Is pseudo-lidar needed for monocular 3d object detection?” inIEEE/CVF International Conference on Computer Vision (ICCV), 2021. 1
work page 2021
-
[6]
Depth map prediction from a single image using a multi-scale deep network,
D. Eigen, C. Puhrsch, and R. Fergus, “Depth map prediction from a single image using a multi-scale deep network,” inAdvances in Neural Information Processing Systems (NeurIPS), vol. 3. Neural information processing systems foundation, 6 2014, pp. 2366–2374. 1, 3
work page 2014
-
[7]
Deep ordinal regression network for monocular depth estimation,
H. Fu, M. Gong, C. Wang, K. Batmanghelich, and D. Tao, “Deep ordinal regression network for monocular depth estimation,”Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2002–2011, 6 2018. 1, 3
work page 2002
-
[8]
Adabins: Depth estimation using adaptive bins,
S. F. Bhat, I. Alhashim, and P. Wonka, “Adabins: Depth estimation using adaptive bins,”Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4008–4017, 11 2020. 1, 3, 9
work page 2020
-
[9]
Vision transformers for dense prediction,
R. Ranftl, A. Bochkovskiy, and V . Koltun, “Vision transformers for dense prediction,”Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 12 159–12 168, 3 2021. 1
work page 2021
-
[10]
P3Depth: Monocular depth estimation with a piecewise planarity prior,
V . Patil, C. Sakaridis, A. Liniger, and L. V . Gool, “P3Depth: Monocular depth estimation with a piecewise planarity prior,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2022, pp. 1600–1611. 1, 3
work page 2022
-
[11]
Neural window fully- connected crfs for monocular depth estimation,
W. Yuan, X. Gu, Z. Dai, S. Zhu, and P. Tan, “Neural window fully- connected crfs for monocular depth estimation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2022, pp. 3906–3915. 1, 3, 9
work page 2022
-
[12]
iDisc: Internal discretization for monocular depth estimation,
L. Piccinelli, C. Sakaridis, and F. Yu, “iDisc: Internal discretization for monocular depth estimation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023. 1, 3, 9
work page 2023
-
[13]
Train in germany, test in the usa: Making 3d object detectors generalize,
Y . Wang, X. Chen, Y . You, L. E. Li, B. Hariharan, M. Campbell, K. Q. Weinberger, and W. L. Chao, “Train in germany, test in the usa: Making 3d object detectors generalize,”Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11 710–11 720, 5 2020. 1
work page 2020
-
[14]
Metric3d: Towards zero-shot metric 3d prediction from a single image,
W. Yin, C. Zhang, H. Chen, Z. Cai, G. Yu, K. Wang, X. Chen, and C. Shen, “Metric3d: Towards zero-shot metric 3d prediction from a single image,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023, pp. 9043–9053. 1, 3, 5, 7, 8, 9, 10
work page 2023
-
[15]
Towards zero-shot scale-aware monocular depth estimation,
V . Guizilini, I. Vasiljevic, D. Chen, R. Ambrus,, and A. Gaidon, “Towards zero-shot scale-aware monocular depth estimation,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023, pp. 9233–9243. 1, 3
work page 2023
-
[16]
M. Hu, W. Yin, C. Zhang, Z. Cai, X. Long, H. Chen, K. Wang, G. Yu, C. Shen, and S. Shen, “Metric3d v2: A versatile monocular geometric foundation model for zero-shot metric depth and surface normal estima- tion,”arXiv preprint arXiv:2404.15506, 2024. 1, 3, 6, 7, 8, 9, 10
-
[17]
Unidepth: Universal monocular metric depth estimation,
L. Piccinelli, Y .-H. Yang, C. Sakaridis, M. Segu, S. Li, L. Van Gool, and F. Yu, “Unidepth: Universal monocular metric depth estimation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 10 106–10 116. 1, 2, 3, 6, 7, 8, 9, 10, 11
work page 2024
-
[18]
Perception-aware chance-constrained model predictive control for uncertain environments,
A. D. Bonzanini, A. Mesbah, and S. Di Cairano, “Perception-aware chance-constrained model predictive control for uncertain environments,” in2021 American Control Conference (ACC). IEEE, 2021, pp. 2082–
work page 2021
-
[19]
Stochastic model predictive control: An overview and perspectives for future research,
A. Mesbah, “Stochastic model predictive control: An overview and perspectives for future research,”IEEE Control Systems Magazine, vol. 36, no. 6, pp. 30–44, 2016. 2
work page 2016
-
[20]
Safe perception-based control under stochastic sensor uncertainty using con- formal prediction,
S. Yang, G. J. Pappas, R. Mangharam, and L. Lindemann, “Safe perception-based control under stochastic sensor uncertainty using con- formal prediction,”arXiv preprint arXiv:2304.00194, 2023. 2
-
[21]
Robust model predictive control: A survey,
A. Bemporad and M. Morari, “Robust model predictive control: A survey,” inRobustness in identification and control. Springer, 2007, pp. 207–
work page 2007
-
[22]
Indoor seg- mentation and support inference from rgbd images,
P. K. Nathan Silberman, Derek Hoiem and R. Fergus, “Indoor seg- mentation and support inference from rgbd images,” inThe European Conference on Computer Vision (ECCV), 2012. 2
work page 2012
-
[23]
Are we ready for autonomous driving? The KITTI vision benchmark suite,
A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for autonomous driving? The KITTI vision benchmark suite,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2012. 2
work page 2012
-
[24]
Towards robust monocular depth estimation: Mixing datasets for zero-shot cross- dataset transfer,
R. Ranftl, K. Lasinger, D. Hafner, K. Schindler, and V . Koltun, “Towards robust monocular depth estimation: Mixing datasets for zero-shot cross- dataset transfer,”IEEE Transactions on Pattern Analysis and Machine Intelligence (T-PAMI), vol. 44, no. 3, pp. 1623–1637, 2020. 2
work page 2020
-
[25]
Omnidata: A scalable pipeline for making multi-task mid-level vision datasets from 3d scans,
A. Eftekhar, A. Sax, J. Malik, and A. Zamir, “Omnidata: A scalable pipeline for making multi-task mid-level vision datasets from 3d scans,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 10 786–10 796. 2
work page 2021
-
[26]
Learning to recover 3d scene shape from a single image,
W. Yin, J. Zhang, O. Wang, S. Niklaus, L. Mai, S. Chen, and C. Shen, “Learning to recover 3d scene shape from a single image,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 204–213. 2
work page 2021
-
[27]
Repurposing diffusion-based image generators for monocular depth estimation,
B. Ke, A. Obukhov, S. Huang, N. Metzger, R. C. Daudt, and K. Schindler, “Repurposing diffusion-based image generators for monocular depth estimation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 9492–9502. 2
work page 2024
-
[28]
Depth anything: Unleashing the power of large-scale unlabeled data,
L. Yang, B. Kang, Z. Huang, X. Xu, J. Feng, and H. Zhao, “Depth anything: Unleashing the power of large-scale unlabeled data,” inPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 10 371–10 381. 2
work page 2024
-
[29]
L. Yang, B. Kang, Z. Huang, Z. Zhao, X. Xu, J. Feng, and H. Zhao, “Depth anything v2,”arXiv preprint arXiv:2406.09414, 2024. 2, 8, 9
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[30]
A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y . Lo, P. Doll ´ar, and R. Girshick, “Segment anything,” 2023. 3
work page 2023
-
[31]
Geowizard: Unleashing the diffusion priors for 3d geometry estimation from a single image,
X. Fu, W. Yin, M. Hu, K. Wang, Y . Ma, P. Tan, S. Shen, D. Lin, and X. Long, “Geowizard: Unleashing the diffusion priors for 3d geometry estimation from a single image,” inEuropean Conference on Computer Vision (ECCV), 2024. 3
work page 2024
-
[32]
Lotus: Diffusion-based visual foundation model for high-quality dense prediction,
J. He, H. Li, W. Yin, Y . Liang, L. Li, K. Zhou, H. Zhang, B. Liu, and Y .-C. Chen, “Lotus: Diffusion-based visual foundation model for high-quality dense prediction,” inInternational Conference on Learning Representations (ICLR), 2025. 3
work page 2025
-
[33]
Video depth anything: Consistent depth estimation for super-long videos,
S. Chen, H. Guo, S. Zhu, F. Zhang, Z. Huang, J. Feng, and B. Kang, “Video depth anything: Consistent depth estimation for super-long videos,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2025, pp. 22 831–22 840. 3
work page 2025
-
[34]
Depthcrafter: Generating consistent long depth sequences for open-world videos,
W. Hu, X. Gao, X. Li, S. Zhao, X. Cun, Y . Zhang, L. Quan, and Y . Shan, “Depthcrafter: Generating consistent long depth sequences for open-world videos,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2025, pp. 2005–2015. 3
work page 2025
-
[35]
Deeper depth prediction with fully convolutional residual networks,
I. Laina, C. Rupprecht, V . Belagiannis, F. Tombari, and N. Navab, “Deeper depth prediction with fully convolutional residual networks,”Proceedings of the International Conference on 3D Vision (3DV), pp. 239–248, 6
-
[36]
Learning depth from single monoc- ular images using deep convolutional neural fields,
F. Liu, C. Shen, G. Lin, and I. Reid, “Learning depth from single monoc- ular images using deep convolutional neural fields,”IEEE Transactions on Pattern Analysis and Machine Intelligence (T-PAMI), vol. 38, pp. 2024–2039, 2 2015. 3
work page 2024
-
[37]
Transformer-based attention networks for continuous pixel-wise prediction,
G. Yang, H. Tang, M. Ding, N. Sebe, and E. Ricci, “Transformer-based attention networks for continuous pixel-wise prediction,”Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 16 249–16 259, 3 2021. 3
work page 2021
-
[38]
L. Piccinelli, T. Wandel, C. Sakaridis, W. Abbeloos, and L. V . Gool, “Video depth propagation,” 2025. [Online]. Available: https: //arxiv.org/abs/2512.10725 3
-
[39]
ZoeDepth: Zero-shot Transfer by Combining Relative and Metric Depth
S. F. Bhat, R. Birkl, D. Wofk, P. Wonka, and M. M¨uller, “Zoedepth: Zero- shot transfer by combining relative and metric depth,”arXiv preprint arXiv:2302.12288, 2023. 3, 7, 8, 9, 10
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[40]
Cam-convs: Camera-aware multi-scale convolutions for single- view depth,
J. M. Facil, B. Ummenhofer, H. Zhou, L. Montesano, T. Brox, and J. Civera, “Cam-convs: Camera-aware multi-scale convolutions for single- view depth,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 11 826–11 835. 3
work page 2019
-
[41]
From big to small: Multi- scale local planar guidance for monocular depth estimation,
J. H. Lee, M. Han, D. W. Ko, and I. H. Suh, “From big to small: Multi- scale local planar guidance for monocular depth estimation,”CoRR, vol. abs/1907.10326, 7 2019. 3, 9
-
[42]
Mapillary planet-scale depth dataset,
M. L. Antequera, P. Gargallo, M. Hofinger, S. R. Bul `o, Y . Kuang, and P. Kontschieder, “Mapillary planet-scale depth dataset,” inThe European Conference on Computer Vision (ECCV). Springer International Pub- lishing, 2020, pp. 589–604. 3, 6
work page 2020
-
[43]
Depth Pro: Sharp Monocular Metric Depth in Less Than a Second
A. Bochkovskii, A. Delaunoy, H. Germain, M. Santos, Y . Zhou, S. R. Richter, and V . Koltun, “Depth pro: Sharp monocular metric depth in less than a second,”arXiv preprint arXiv:2410.02073, 2024. 3, 6, 7, 8, 10
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[44]
UniK3D: Universal camera monocular 3d estimation,
L. Piccinelli, C. Sakaridis, M. Segu, Y .-H. Yang, S. Li, W. Abbeloos, and L. Van Gool, “UniK3D: Universal camera monocular 3d estimation,” 13 inIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025. 3
work page 2025
-
[45]
Diffusion models for monocular depth estimation: Overcoming challenging conditions,
F. Tosi, P. Zama Ramirez, and M. Poggi, “Diffusion models for monocular depth estimation: Overcoming challenging conditions,” inEuropean Conference on Computer Vision (ECCV), 2024. 3
work page 2024
-
[46]
Robust monocular depth estimation under chal- lenging conditions,
S. Gasperini, V . Olsson, M. Poggi, F. Tosi, S. Salti, L. Di Stefano, K. AAstr ”om, J. Gonfaus, L. Van Gool, R. Timofte, A. N ”aslund, and L. Bitti, “Robust monocular depth estimation under chal- lenging conditions,” inProceedings of the IEEE/CVF International Con- ference on Computer Vision (ICCV), October 2023, pp. 7897–7908. 3
work page 2023
-
[47]
Learning depth estimation for transparent and mirror surfaces,
A. Costanzino, F. Tosi, M. Poggi, S. Salti, S. Mattoccia, and L. Di Ste- fano, “Learning depth estimation for transparent and mirror surfaces,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2023, pp. 17 770–17 780. 3
work page 2023
-
[48]
Booster: A benchmark for depth from images of specular and transparent surfaces,
P. Z. Ramirez, A. Costanzino, F. Tosi, M. Poggi, S. Salti, S. Mattoccia, and L. Di Stefano, “Booster: A benchmark for depth from images of specular and transparent surfaces,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 46, no. 1, pp. 85–102, 2024. 3, 6
work page 2024
-
[49]
Slic superpixels compared to state-of-the-art superpixel methods,
R. Achanta, A. Shaji, K. Smith, A. Lucchi, P. Fua, and S. S ¨usstrunk, “Slic superpixels compared to state-of-the-art superpixel methods,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 34, no. 11, pp. 2274–2282, 2012. 5
work page 2012
-
[50]
An image is worth 16x16 words: Transformers for image recognition at scale,
A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Un- terthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words: Transformers for image recognition at scale,” inInternational Conference on Learning Representations (ICLR). OpenReview.net, 2021. 5, 7
work page 2021
-
[51]
Grounding image matching in 3d with mast3r
V . Leroy, Y . Cabon, and J. Revaud, “Grounding image matching in 3d with mast3r,”arXiv preprint arXiv:2406.09756, 2024. 7, 8, 10
-
[52]
J. Geyer, Y . Kassahun, M. Mahmudi, X. Ricou, R. Durgesh, A. S. Chung, L. Hauswald, V . H. Pham, M. M ¨uhlegg, S. Dorn, T. Fernandez, M. J ¨anicke, S. Mirashi, C. Savani, M. Sturm, O. V orobiov, M. Oelker, S. Garreis, and P. Schuberth, “A2D2: Audi Autonomous Driving Dataset,”arXiv preprint arXiv:2004.06320, 2020. [Online]. Available: https://www.a2d2.audi 6
-
[53]
Argoverse 2: Next generation datasets for self-driving perception and forecasting,
B. Wilson, W. Qi, T. Agarwal, J. Lambert, J. Singh, S. Khandelwal, B. Pan, R. Kumar, A. Hartnett, J. K. Pontes, D. Ramanan, P. Carr, and J. Hays, “Argoverse 2: Next generation datasets for self-driving perception and forecasting,” inAdvances in Neural Information Processing Systems,
-
[54]
G. Baruch, Z. Chen, A. Dehghan, T. Dimry, Y . Feigin, P. Fu, T. Gebauer, B. Joffe, D. Kurz, A. Schwartz, and E. Shulman, “ARKitscenes - a diverse real-world dataset for 3d indoor scene understanding using mobile RGB- d data,” inAdvances in Neural Information Processing Systems (NIPS),
-
[55]
BEDLAM: A synthetic dataset of bodies exhibiting detailed lifelike animated motion,
M. J. Black, P. Patel, J. Tesch, and J. Yang, “BEDLAM: A synthetic dataset of bodies exhibiting detailed lifelike animated motion,” inPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023, pp. 8726–8737. 6
work page 2023
-
[56]
Blendedmvs: A large-scale dataset for generalized multi-view stereo networks,
Y . Yao, Z. Luo, S. Li, J. Zhang, Y . Ren, L. Zhou, T. Fang, and L. Quan, “Blendedmvs: A large-scale dataset for generalized multi-view stereo networks,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 1790–1799. 6
work page 2020
-
[57]
DL3DV-10k: A large-scale scene dataset for deep learning-based 3d vision,
L. Ling, Y . Sheng, Z. Tu, W. Zhao, C. Xin, K. Wan, L. Yu, Q. Guo, Z. Yu, Y . Luet al., “DL3DV-10k: A large-scale scene dataset for deep learning-based 3d vision,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 22 160– 22 169. 6
work page 2024
-
[58]
Drivingstereo: A large-scale dataset for stereo matching in autonomous driving scenarios,
G. Yang, X. Song, C. Huang, Z. Deng, J. Shi, and B. Zhou, “Drivingstereo: A large-scale dataset for stereo matching in autonomous driving scenarios,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019. 6
work page 2019
-
[59]
Dynamicstereo: Consistent dynamic depth from stereo videos,
N. Karaev, I. Rocco, B. Graham, N. Neverova, A. Vedaldi, and C. Rup- precht, “Dynamicstereo: Consistent dynamic depth from stereo videos,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023. 6
work page 2023
-
[60]
Eden: Multimodal synthetic dataset of enclosed garden scenes,
H.-A. Le, T. Mensink, P. Das, S. Karaoglu, and T. Gevers, “Eden: Multimodal synthetic dataset of enclosed garden scenes,” inProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2021, pp. 1579–1589. 6
work page 2021
-
[61]
Hoi4d: A 4d egocentric dataset for category-level human-object interaction,
Y . Liu, Y . Liu, C. Jiang, K. Lyu, W. Wan, H. Shen, B. Liang, Z. Fu, H. Wang, and L. Yi, “Hoi4d: A 4d egocentric dataset for category-level human-object interaction,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 21 013– 21 022. 6
work page 2022
-
[62]
Habitat-matterport 3d dataset (HM3d): 1000 large-scale 3d environments for embodied AI,
S. K. Ramakrishnan, A. Gokaslan, E. Wijmans, O. Maksymets, A. Clegg, J. M. Turner, E. Undersander, W. Galuba, A. Westbury, A. X. Chang, M. Savva, Y . Zhao, and D. Batra, “Habitat-matterport 3d dataset (HM3d): 1000 large-scale 3d environments for embodied AI,” inAdvances in Neural Information Processing Systems (NIPS), 2021. 6
work page 2021
-
[63]
Matterport3d: Learning from rgb-d data in indoor environments,
A. Chang, A. Dai, T. Funkhouser, M. Halber, M. Niessner, M. Savva, S. Song, A. Zeng, and Y . Zhang, “Matterport3d: Learning from rgb-d data in indoor environments,” inProceedings of the International Conference on 3D Vision (3DV), 2017. 6
work page 2017
-
[64]
Ma- trixcity: A large-scale city dataset for city-scale neural rendering and beyond,
Y . Li, L. Jiang, L. Xu, Y . Xiangli, Z. Wang, D. Lin, and B. Dai, “Ma- trixcity: A large-scale city dataset for city-scale neural rendering and beyond,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023, pp. 3205–3215. 6
work page 2023
-
[65]
Megadepth: Learning single-view depth prediction from internet photos,
Z. Li and N. Snavely, “Megadepth: Learning single-view depth prediction from internet photos,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 2041–2050. 6
work page 2018
-
[66]
Map-free visual relocalization: Metric pose relative to a single image,
E. Arnold, J. Wynn, S. Vicente, G. Garcia-Hernando, ´A. Monszpart, V . A. Prisacariu, D. Turmukhambetov, and E. Brachmann, “Map-free visual relocalization: Metric pose relative to a single image,” inEuropean Conference on Computer Vision (ECCV), 2022. 6
work page 2022
-
[67]
Pointodyssey: A large-scale synthetic dataset for long-term point track- ing,
Y . Zheng, A. W. Harley, B. Shen, G. Wetzstein, and L. J. Guibas, “Pointodyssey: A large-scale synthetic dataset for long-term point track- ing,” inProceedings of the IEEE/CVF International Conference on Com- puter Vision (ICCV), 2023, pp. 19 855–19 865. 6
work page 2023
-
[68]
Scannet: Richly-annotated 3d reconstructions of indoor scenes,
A. Dai, A. X. Chang, M. Savva, M. Halber, T. Funkhouser, and M. Nießner, “Scannet: Richly-annotated 3d reconstructions of indoor scenes,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2017. 6
work page 2017
-
[69]
Scannet++: A high- fidelity dataset of 3d indoor scenes,
C. Yeshwanth, Y .-C. Liu, M. Nießner, and A. Dai, “Scannet++: A high- fidelity dataset of 3d indoor scenes,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023. 6
work page 2023
-
[70]
Tartanair: A dataset to push the limits of visual slam,
W. Wang, D. Zhu, X. Wang, Y . Hu, Y . Qiu, C. Wang, Y . Hu, A. Kapoor, and S. Scherer, “Tartanair: A dataset to push the limits of visual slam,” in2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2020, pp. 4909–4916. 6
work page 2020
-
[71]
Taskonomy: Disentangling task transfer learning,
A. R. Zamir, A. Sax, W. B. Shen, L. Guibas, J. Malik, and S. Savarese, “Taskonomy: Disentangling task transfer learning,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2018. 6
work page 2018
-
[72]
Scalability in perception for autonomous driving: Waymo open dataset,
P. Sun, H. Kretzschmar, X. Dotiwalla, A. Chouard, V . Patnaik, P. Tsui, J. Guo, Y . Zhou, Y . Chai, B. Caineet al., “Scalability in perception for autonomous driving: Waymo open dataset,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 2446–2454. 6
work page 2020
-
[73]
Rgbd objects in the wild: Scaling real-world 3d object learning from rgb-d videos,
H. Xia, Y . Fu, S. Liu, and X. Wang, “Rgbd objects in the wild: Scaling real-world 3d object learning from rgb-d videos,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 22 378–22 389. 6
work page 2024
-
[74]
Sun rgb-d: A rgb-d scene un- derstanding benchmark suite,
S. Song, S. P. Lichtenberg, and J. Xiao, “Sun rgb-d: A rgb-d scene un- derstanding benchmark suite,”Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), vol. 07-12-June- 2015, pp. 567–576, 10 2015. 6
work page 2015
-
[75]
T. Koch, L. Liebel, M. K ¨orner, and F. Fraundorfer, “Comparison of monocular depth estimation methods using geometrically relevant metrics on the IBims-1 dataset,”Computer Vision and Image Understanding (CVIU), vol. 191, p. 102877, 2020. 6
work page 2020
-
[76]
A benchmark for the evaluation of rgb-d slam systems,
J. Sturm, N. Engelhard, F. Endres, W. Burgard, and D. Cremers, “A benchmark for the evaluation of rgb-d slam systems,” inProc. of the International Conference on Intelligent Robot Systems (IROS), 2012. 6
work page 2012
-
[77]
A multi-view stereo benchmark with high-resolution images and multi-camera videos,
T. Sch¨ops, J. L. Sch¨onberger, S. Galliani, T. Sattler, K. Schindler, M. Polle- feys, and A. Geiger, “A multi-view stereo benchmark with high-resolution images and multi-camera videos,” inProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition (CVPR), 2017. 6
work page 2017
-
[78]
A naturalistic open source movie for optical flow evaluation,
D. J. Butler, J. Wulff, G. B. Stanley, and M. J. Black, “A naturalistic open source movie for optical flow evaluation,” inThe European Conference on Computer Vision (ECCV), ser. Part IV , LNCS 7577. Springer, 2012, pp. 611–625. 6
work page 2012
-
[79]
3d packing for self-supervised monocular depth estimation,
V . Guizilini, R. Ambrus, S. Pillai, A. Raventos, and A. Gaidon, “3d packing for self-supervised monocular depth estimation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020. 6
work page 2020
-
[80]
nuscenes: A multimodal dataset for autonomous driving,
H. Caesar, V . Bankiti, A. H. Lang, S. V ora, V . E. Liong, Q. Xu, A. Krish- nan, Y . Pan, G. Baldan, and O. Beijbom, “nuscenes: A multimodal dataset for autonomous driving,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020. 6 14
work page 2020
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.