{"record_type":"pith_number_record","schema_url":"https://pith.science/schemas/pith-number/v1.json","pith_number":"pith:2025:T7ZZFOUO5OYYOO3TTYPI755A6W","short_pith_number":"pith:T7ZZFOUO","schema_version":"1.0","canonical_sha256":"9ff392ba8eebb1873b739e1e8ff7a0f58135acb7684e5745e21045aa8f19539c","source":{"kind":"arxiv","id":"2502.20110","version":2},"attestation_state":"computed","paper":{"title":"UniDepthV2: Universal Monocular Metric Depth Estimation Made Simpler","license":"http://creativecommons.org/licenses/by-nc-sa/4.0/","headline":"UniDepthV2 predicts metric 3D points directly from single images across domains without extra inputs or retraining.","cross_cats":[],"primary_cat":"cs.CV","authors_text":"Christos Sakaridis, Luc Van Gool, Luigi Piccinelli, Mattia Segu, Siyuan Li, Wim Abbeloos, Yung-Hsu Yang","submitted_at":"2025-02-27T14:03:15Z","abstract_excerpt":"Accurate monocular metric depth estimation (MMDE) is crucial to solving downstream tasks in 3D perception and modeling. However, the remarkable accuracy of recent MMDE methods is confined to their training domains. These methods fail to generalize to unseen domains even in the presence of moderate domain gaps, which hinders their practical applicability. We propose a new model, UniDepthV2, capable of reconstructing metric 3D scenes from solely single images across domains. Departing from the existing MMDE paradigm, UniDepthV2 directly predicts metric 3D points from the input image at inference"},"verification_status":{"content_addressed":true,"pith_receipt":true,"author_attested":false,"weak_author_claims":0,"strong_author_claims":0,"externally_anchored":false,"storage_verified":false,"citation_signatures":0,"replication_records":0,"graph_snapshot":true,"references_resolved":true,"formal_links_present":true},"canonical_record":{"source":{"id":"2502.20110","kind":"arxiv","version":2},"metadata":{"license":"http://creativecommons.org/licenses/by-nc-sa/4.0/","primary_cat":"cs.CV","submitted_at":"2025-02-27T14:03:15Z","cross_cats_sorted":[],"title_canon_sha256":"68a0fcd2de0d339511f8281b97df677821d623c66641a7b3566c8e9a30afa586","abstract_canon_sha256":"e9a08e1f1d555ae397933a1837c99b6d2833f6cc5b0c1eea97f923d444200fa7"},"schema_version":"1.0"},"receipt":{"kind":"pith_receipt","key_id":"pith-v1-2026-05","algorithm":"ed25519","signed_at":"2026-05-17T23:38:14.500620Z","signature_b64":"JUYDVhVNEy/v3S8dJo0JGoFtLo1bc/z0Pdbx/gZo1+BxqE0NmNcdeHsX2fwK4WVjeT1oLghoenwyujHL+joYAg==","signed_message":"canonical_sha256_bytes","builder_version":"pith-number-builder-2026-05-17-v1","receipt_version":"0.3","canonical_sha256":"9ff392ba8eebb1873b739e1e8ff7a0f58135acb7684e5745e21045aa8f19539c","last_reissued_at":"2026-05-17T23:38:14.500017Z","signature_status":"signed_v1","first_computed_at":"2026-05-17T23:38:14.500017Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54"},"graph_snapshot":{"paper":{"title":"UniDepthV2: Universal Monocular Metric Depth Estimation Made Simpler","license":"http://creativecommons.org/licenses/by-nc-sa/4.0/","headline":"UniDepthV2 predicts metric 3D points directly from single images across domains without extra inputs or retraining.","cross_cats":[],"primary_cat":"cs.CV","authors_text":"Christos Sakaridis, Luc Van Gool, Luigi Piccinelli, Mattia Segu, Siyuan Li, Wim Abbeloos, Yung-Hsu Yang","submitted_at":"2025-02-27T14:03:15Z","abstract_excerpt":"Accurate monocular metric depth estimation (MMDE) is crucial to solving downstream tasks in 3D perception and modeling. However, the remarkable accuracy of recent MMDE methods is confined to their training domains. These methods fail to generalize to unseen domains even in the presence of moderate domain gaps, which hinders their practical applicability. We propose a new model, UniDepthV2, capable of reconstructing metric 3D scenes from solely single images across domains. Departing from the existing MMDE paradigm, UniDepthV2 directly predicts metric 3D points from the input image at inference"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"UniDepthV2 is capable of reconstructing metric 3D scenes from solely single images across domains, improves its predecessor via edge-guided loss, simplified design, and uncertainty output, and shows superior zero-shot performance on ten depth datasets.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"The self-promptable camera module and geometric invariance loss can reliably disentangle and generalize camera and depth features without domain-specific information or post-hoc adjustments.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"UniDepthV2 predicts metric 3D points directly from single images using a self-promptable camera module, pseudo-spherical representation, and new losses for improved cross-domain generalization.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"UniDepthV2 predicts metric 3D points directly from single images across domains without extra inputs or retraining.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"70e151f1aee0359192101646a69b9d071af0425368c7b3fc4d7a1b041e3699f0"},"source":{"id":"2502.20110","kind":"arxiv","version":2},"verdict":{"id":"fde7e9ff-f63a-49a8-adc9-1343c26eedfd","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-17T09:05:20.568310Z","strongest_claim":"UniDepthV2 is capable of reconstructing metric 3D scenes from solely single images across domains, improves its predecessor via edge-guided loss, simplified design, and uncertainty output, and shows superior zero-shot performance on ten depth datasets.","one_line_summary":"UniDepthV2 predicts metric 3D points directly from single images using a self-promptable camera module, pseudo-spherical representation, and new losses for improved cross-domain generalization.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"The self-promptable camera module and geometric invariance loss can reliably disentangle and generalize camera and depth features without domain-specific information or post-hoc adjustments.","pith_extraction_headline":"UniDepthV2 predicts metric 3D points directly from single images across domains without extra inputs or retraining."},"references":{"count":90,"sample":[{"doi":"","year":2022,"title":"Depth-supervised nerf: Fewer views and faster training for free,","work_id":"9c964b73-0d43-4117-af0c-fc8a9a4c5eb7","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2019,"title":"Does computer vision matter for action?","work_id":"55872667-fb85-41e1-a5e3-80b88c3c6854","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2022,"title":"Towards real-time monocular depth estimation for robotics: A survey,","work_id":"7c4b0726-6b1e-4f6a-bd77-fe262de6f0da","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2019,"title":"Pseudo-lidar from visual depth estimation: Bridging the gap in 3d object detection for autonomous driving,","work_id":"2961b922-30d2-434c-a7e8-45d28720004b","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2021,"title":"Is pseudo-lidar needed for monocular 3d object detection?","work_id":"ba6cffc8-53f7-4039-8bd0-5aa435d2c91d","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":90,"snapshot_sha256":"6cf81852b32717d4f66610fea45cdd9f877fd3179a7241dbf8ac190ea9f830b9","internal_anchors":4},"formal_canon":{"evidence_count":2,"snapshot_sha256":"a8b7e840c26150e5a07040cf5c1935a61fd9b77b26155a6f025f47136dce9335"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"},"aliases":[{"alias_kind":"arxiv","alias_value":"2502.20110","created_at":"2026-05-17T23:38:14.500103+00:00"},{"alias_kind":"arxiv_version","alias_value":"2502.20110v2","created_at":"2026-05-17T23:38:14.500103+00:00"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.2502.20110","created_at":"2026-05-17T23:38:14.500103+00:00"},{"alias_kind":"pith_short_12","alias_value":"T7ZZFOUO5OYY","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_16","alias_value":"T7ZZFOUO5OYYOO3T","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_8","alias_value":"T7ZZFOUO","created_at":"2026-05-18T12:33:37.589309+00:00"}],"events":[],"event_summary":{},"paper_claims":[],"inbound_citations":{"count":17,"internal_anchor_count":17,"sample":[{"citing_arxiv_id":"2508.10934","citing_title":"ViPE: Video Pose Engine for 3D Geometric Perception","ref_index":53,"is_internal_anchor":true},{"citing_arxiv_id":"2602.09532","citing_title":"RAD: Retrieval-Augmented Monocular Metric Depth Estimation for Underrepresented Classes","ref_index":39,"is_internal_anchor":true},{"citing_arxiv_id":"2602.19035","citing_title":"OpenVO: Open-World Visual Odometry with Temporal Dynamics Awareness","ref_index":43,"is_internal_anchor":true},{"citing_arxiv_id":"2603.01765","citing_title":"Efficient Test-Time Optimization for Depth Completion via Low-Rank Decoder Adaptation","ref_index":35,"is_internal_anchor":true},{"citing_arxiv_id":"2507.02546","citing_title":"MoGe-2: Accurate Monocular Geometry with Metric Scale and Sharp Details","ref_index":45,"is_internal_anchor":true},{"citing_arxiv_id":"2605.11578","citing_title":"The Midas Touch for Metric Depth","ref_index":38,"is_internal_anchor":true},{"citing_arxiv_id":"2605.04728","citing_title":"Anny-Fit: All-Age Human Mesh Recovery","ref_index":40,"is_internal_anchor":true},{"citing_arxiv_id":"2604.21915","citing_title":"Vista4D: Video Reshooting with 4D Point Clouds","ref_index":36,"is_internal_anchor":true},{"citing_arxiv_id":"2604.18336","citing_title":"Enhancing Glass Surface Reconstruction via Depth Prior for Robot Navigation","ref_index":16,"is_internal_anchor":true},{"citing_arxiv_id":"2605.01852","citing_title":"DP-SfM: Dual-Pixel Structure-from-Motion without Scale Ambiguity","ref_index":49,"is_internal_anchor":true},{"citing_arxiv_id":"2604.09352","citing_title":"LuMon: A Comprehensive Benchmark and Development Suite with Novel Datasets for Lunar Monocular Depth Estimation","ref_index":36,"is_internal_anchor":true},{"citing_arxiv_id":"2511.10647","citing_title":"Depth Anything 3: Recovering the Visual Space from Any Views","ref_index":66,"is_internal_anchor":true},{"citing_arxiv_id":"2604.05908","citing_title":"Appearance Decomposition Gaussian Splatting for Multi-Traversal Reconstruction","ref_index":64,"is_internal_anchor":true},{"citing_arxiv_id":"2604.05715","citing_title":"In Depth We Trust: Reliable Monocular Depth Supervision for Gaussian Splatting","ref_index":30,"is_internal_anchor":true},{"citing_arxiv_id":"2605.05014","citing_title":"CARD: A Multi-Modal Automotive Dataset for Dense 3D Reconstruction in Challenging Road Topography","ref_index":39,"is_internal_anchor":true},{"citing_arxiv_id":"2605.02784","citing_title":"HumanSplatHMR: Closing the Loop Between Human Mesh Recovery and Gaussian Splatting Avatar","ref_index":26,"is_internal_anchor":true},{"citing_arxiv_id":"2604.22331","citing_title":"Depth-Aware Rover: A Study of Edge AI and Monocular Vision for Real-World Implementation","ref_index":7,"is_internal_anchor":true}]},"formal_canon":{"evidence_count":2,"sample":[],"anchors":[]},"links":{"html":"https://pith.science/pith/T7ZZFOUO5OYYOO3TTYPI755A6W","json":"https://pith.science/pith/T7ZZFOUO5OYYOO3TTYPI755A6W.json","graph_json":"https://pith.science/api/pith-number/T7ZZFOUO5OYYOO3TTYPI755A6W/graph.json","events_json":"https://pith.science/api/pith-number/T7ZZFOUO5OYYOO3TTYPI755A6W/events.json","paper":"https://pith.science/paper/T7ZZFOUO"},"agent_actions":{"view_html":"https://pith.science/pith/T7ZZFOUO5OYYOO3TTYPI755A6W","download_json":"https://pith.science/pith/T7ZZFOUO5OYYOO3TTYPI755A6W.json","view_paper":"https://pith.science/paper/T7ZZFOUO","resolve_alias":"https://pith.science/api/pith-number/resolve?arxiv=2502.20110&json=true","fetch_graph":"https://pith.science/api/pith-number/T7ZZFOUO5OYYOO3TTYPI755A6W/graph.json","fetch_events":"https://pith.science/api/pith-number/T7ZZFOUO5OYYOO3TTYPI755A6W/events.json","actions":{"anchor_timestamp":"https://pith.science/pith/T7ZZFOUO5OYYOO3TTYPI755A6W/action/timestamp_anchor","attest_storage":"https://pith.science/pith/T7ZZFOUO5OYYOO3TTYPI755A6W/action/storage_attestation","attest_author":"https://pith.science/pith/T7ZZFOUO5OYYOO3TTYPI755A6W/action/author_attestation","sign_citation":"https://pith.science/pith/T7ZZFOUO5OYYOO3TTYPI755A6W/action/citation_signature","submit_replication":"https://pith.science/pith/T7ZZFOUO5OYYOO3TTYPI755A6W/action/replication_record"}},"created_at":"2026-05-17T23:38:14.500103+00:00","updated_at":"2026-05-17T23:38:14.500103+00:00"}