{"paper":{"title":"Depth Pro: Sharp Monocular Metric Depth in Less Than a Second","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"Depth Pro produces sharp, metric-scale depth maps from single images in 0.3 seconds without any camera metadata.","cross_cats":["cs.LG"],"primary_cat":"cs.CV","authors_text":"Aleksei Bochkovskii, Ama\\\"el Delaunoy, Hugo Germain, Marcel Santos, Stephan R. Richter, Vladlen Koltun, Yichao Zhou","submitted_at":"2024-10-02T22:42:20Z","abstract_excerpt":"We present a foundation model for zero-shot metric monocular depth estimation. Our model, Depth Pro, synthesizes high-resolution depth maps with unparalleled sharpness and high-frequency details. The predictions are metric, with absolute scale, without relying on the availability of metadata such as camera intrinsics. And the model is fast, producing a 2.25-megapixel depth map in 0.3 seconds on a standard GPU. These characteristics are enabled by a number of technical contributions, including an efficient multi-scale vision transformer for dense prediction, a training protocol that combines re"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"We present a foundation model for zero-shot metric monocular depth estimation. Our model, Depth Pro, synthesizes high-resolution depth maps with unparalleled sharpness and high-frequency details. The predictions are metric, with absolute scale, without relying on the availability of metadata such as camera intrinsics. And the model is fast, producing a 2.25-megapixel depth map in 0.3 seconds on a standard GPU.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That the training protocol combining real and synthetic datasets, together with the multi-scale vision transformer, achieves both high metric accuracy and fine boundary tracing in zero-shot settings without camera intrinsics.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"Depth Pro is a fast foundation model for zero-shot metric monocular depth estimation that produces sharp high-resolution depth maps with absolute scale using a multi-scale vision transformer.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"Depth Pro produces sharp, metric-scale depth maps from single images in 0.3 seconds without any camera metadata.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"33fdaba0a1b0a6ef23567915298b3a0db14507139ad27ded34d4d371be6110a3"},"source":{"id":"2410.02073","kind":"arxiv","version":2},"verdict":{"id":"8d9e293a-1ea3-44e4-97cf-b67cbd78cb18","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-14T20:40:37.299863Z","strongest_claim":"We present a foundation model for zero-shot metric monocular depth estimation. Our model, Depth Pro, synthesizes high-resolution depth maps with unparalleled sharpness and high-frequency details. The predictions are metric, with absolute scale, without relying on the availability of metadata such as camera intrinsics. And the model is fast, producing a 2.25-megapixel depth map in 0.3 seconds on a standard GPU.","one_line_summary":"Depth Pro is a fast foundation model for zero-shot metric monocular depth estimation that produces sharp high-resolution depth maps with absolute scale using a multi-scale vision transformer.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That the training protocol combining real and synthetic datasets, together with the multi-scale vision transformer, achieves both high metric accuracy and fine boundary tracing in zero-shot settings without camera intrinsics.","pith_extraction_headline":"Depth Pro produces sharp, metric-scale depth maps from single images in 0.3 seconds without any camera metadata."},"references":{"count":294,"sample":[{"doi":"","year":null,"title":"Defocus deblurring using dual-pixel data , author=. ECCV , year=","work_id":"276fd506-629e-4bb1-9f04-24451c334aef","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"RCA engineer , year=","work_id":"6ddc4f2f-28ac-4db4-a96d-9c30847ac9cb","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2022,"title":"Attention Attention Everywhere: Monocular Depth Prediction with Skip Attention , author=. 2022 , journal=","work_id":"421fc360-e547-40f6-afdd-a9c85f6761b6","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"Unilmv2: Pseudo-masked language models for unified language model pre-training , author=. ICML , year=","work_id":"1a6b7713-24f1-43c5-a6d0-bb06b38e3ce8","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"Hangbo Bao and Li Dong and Songhao Piao and Furu Wei , booktitle=","work_id":"4b07335e-a784-4b28-9333-b92fac5f3143","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":294,"snapshot_sha256":"9efdf4becbb4f22097dd23350371fb98e22c0f162dd02a8c4bef9d60a70f2fe6","internal_anchors":0},"formal_canon":{"evidence_count":2,"snapshot_sha256":"4cc9cca9276f8fc2b11233da516301a0f5c9ce3c54c373a1221bbe6010c435d3"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"}