{"paper":{"title":"UniDepthV2: Universal Monocular Metric Depth Estimation Made Simpler","license":"http://creativecommons.org/licenses/by-nc-sa/4.0/","headline":"UniDepthV2 predicts metric 3D points directly from single images across domains without extra inputs or retraining.","cross_cats":[],"primary_cat":"cs.CV","authors_text":"Christos Sakaridis, Luc Van Gool, Luigi Piccinelli, Mattia Segu, Siyuan Li, Wim Abbeloos, Yung-Hsu Yang","submitted_at":"2025-02-27T14:03:15Z","abstract_excerpt":"Accurate monocular metric depth estimation (MMDE) is crucial to solving downstream tasks in 3D perception and modeling. However, the remarkable accuracy of recent MMDE methods is confined to their training domains. These methods fail to generalize to unseen domains even in the presence of moderate domain gaps, which hinders their practical applicability. We propose a new model, UniDepthV2, capable of reconstructing metric 3D scenes from solely single images across domains. Departing from the existing MMDE paradigm, UniDepthV2 directly predicts metric 3D points from the input image at inference"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"UniDepthV2 is capable of reconstructing metric 3D scenes from solely single images across domains, improves its predecessor via edge-guided loss, simplified design, and uncertainty output, and shows superior zero-shot performance on ten depth datasets.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"The self-promptable camera module and geometric invariance loss can reliably disentangle and generalize camera and depth features without domain-specific information or post-hoc adjustments.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"UniDepthV2 predicts metric 3D points directly from single images using a self-promptable camera module, pseudo-spherical representation, and new losses for improved cross-domain generalization.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"UniDepthV2 predicts metric 3D points directly from single images across domains without extra inputs or retraining.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"70e151f1aee0359192101646a69b9d071af0425368c7b3fc4d7a1b041e3699f0"},"source":{"id":"2502.20110","kind":"arxiv","version":2},"verdict":{"id":"fde7e9ff-f63a-49a8-adc9-1343c26eedfd","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-17T09:05:20.568310Z","strongest_claim":"UniDepthV2 is capable of reconstructing metric 3D scenes from solely single images across domains, improves its predecessor via edge-guided loss, simplified design, and uncertainty output, and shows superior zero-shot performance on ten depth datasets.","one_line_summary":"UniDepthV2 predicts metric 3D points directly from single images using a self-promptable camera module, pseudo-spherical representation, and new losses for improved cross-domain generalization.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"The self-promptable camera module and geometric invariance loss can reliably disentangle and generalize camera and depth features without domain-specific information or post-hoc adjustments.","pith_extraction_headline":"UniDepthV2 predicts metric 3D points directly from single images across domains without extra inputs or retraining."},"references":{"count":90,"sample":[{"doi":"","year":2022,"title":"Depth-supervised nerf: Fewer views and faster training for free,","work_id":"9c964b73-0d43-4117-af0c-fc8a9a4c5eb7","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2019,"title":"Does computer vision matter for action?","work_id":"55872667-fb85-41e1-a5e3-80b88c3c6854","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2022,"title":"Towards real-time monocular depth estimation for robotics: A survey,","work_id":"7c4b0726-6b1e-4f6a-bd77-fe262de6f0da","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2019,"title":"Pseudo-lidar from visual depth estimation: Bridging the gap in 3d object detection for autonomous driving,","work_id":"2961b922-30d2-434c-a7e8-45d28720004b","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2021,"title":"Is pseudo-lidar needed for monocular 3d object detection?","work_id":"ba6cffc8-53f7-4039-8bd0-5aa435d2c91d","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":90,"snapshot_sha256":"6cf81852b32717d4f66610fea45cdd9f877fd3179a7241dbf8ac190ea9f830b9","internal_anchors":4},"formal_canon":{"evidence_count":2,"snapshot_sha256":"a8b7e840c26150e5a07040cf5c1935a61fd9b77b26155a6f025f47136dce9335"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"}