MapAnything: Evaluating Monocular Metric Depth Models for 3D Urban Asset Localization
Pith reviewed 2026-05-21 21:51 UTC · model grok-4.3
The pith
MapAnything derives accurate geocoordinates for city objects like signs and damage from one ordinary photo.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By leveraging advanced Metric Depth Estimation models, MapAnything accurately calculates object geocoordinates, converting 2D image data into valuable 3D spatial information through the integration of estimated camera-to-object distance with geometric principles and known camera specifications.
What carries the argument
Metric depth estimation that predicts distances, combined with geometric principles and camera specifications for geocoordinate calculation.
If this is right
- Automated mapping of traffic signs and road pavement damage from monocular images.
- Granular analysis of accuracy across distance intervals and semantic areas such as roads and vegetation.
- Practical demonstration for integration into automated urban inventory systems.
- Reduced manual effort in maintaining urban digital twins and spatial datasets.
Where Pith is reading between the lines
- The framework could be deployed on smartphones for crowdsourced urban mapping.
- Accuracy might improve further by fusing multiple images from different viewpoints.
- This could complement or reduce the need for dedicated mapping vehicles equipped with LiDAR.
Load-bearing premise
The distances estimated by metric depth models are accurate enough in complex urban environments to allow reliable geocoordinate calculation using geometry and camera specs.
What would settle it
Finding that the geocoordinate errors compared to LiDAR ground truth are consistently larger than acceptable for inventory purposes, for example more than one meter at close range.
Figures
read the original abstract
City administrations increasingly rely on comprehensive databases and urban digital twins of city assets, such as traffic signs and trees, as well as incidents like graffiti or road damage, to maintain an effective overview of urban conditions. Digitization has increased the demand for continuously updated spatial datasets, yet current data acquisition and maintenance processes still involve considerable manual effort, posing significant scalability challenges. This paper introduces MapAnything, a novel geo-localization framework that automates the spatial mapping of urban objects and incidents from a single monocular image. By leveraging advanced Metric Depth Estimation models, MapAnything accurately calculates object geocoordinates, converting 2D image data into valuable 3D spatial information. The methodology integrates the estimated camera-to-object distance with geometric principles and known camera specifications. We present a detailed validation of the framework, comparing its distance-estimation accuracy against high-precision LiDAR point clouds in complex urban environments. Our evaluation provides a granular analysis of spatial performance across various distance intervals and semantic areas, such as roads and vegetation. Finally, we demonstrate the framework's practical efficacy through specific use cases, including mapping traffic signs and road pavement damage, and provide recommendations for its integration into automated urban inventory systems.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces MapAnything, a framework for 3D urban asset localization from monocular images using metric depth estimation models. It claims to accurately calculate object geocoordinates by combining estimated depths with known camera intrinsics and geometric principles. The paper provides a validation study comparing distance estimates to LiDAR point clouds in complex urban environments, with granular analysis across distance intervals and semantic classes, and demonstrates use cases for mapping traffic signs and road damage.
Significance. If the accuracy claims hold under the reported conditions, this work could enable scalable, low-cost 3D mapping of urban assets using standard cameras, addressing scalability challenges in maintaining urban digital twins. The evaluation against external LiDAR data and focus on practical use cases strengthen its potential impact in computer vision applications for smart cities.
major comments (2)
- [§4.2] §4.2 (LiDAR validation): The reported distance errors are presented per bin and semantic class, but the manuscript does not quantify how these translate to geocoordinate error in meters or whether they remain below the tolerance needed for the traffic-sign and pavement-damage use cases asserted in §5.1.
- [§3.2] §3.2 (depth-to-geocoordinate pipeline): The fusion of monocular metric depth with ray geometry and known intrinsics/extrinsics is described at a high level; no explicit error-propagation analysis or sensitivity study is provided for urban factors (occlusion, specular surfaces) that the abstract identifies as the target domain.
minor comments (2)
- [Abstract] Abstract: The phrase 'advanced Metric Depth Estimation models' is used without naming the specific models or indicating whether they are used off-the-shelf or fine-tuned on urban data.
- [Table 2] Table 2: Column headers for semantic classes should explicitly state the number of samples per bin to allow readers to assess statistical reliability of the per-class results.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. The comments highlight important aspects of practical applicability and methodological rigor that we address below. We have revised the manuscript accordingly where feasible.
read point-by-point responses
-
Referee: [§4.2] §4.2 (LiDAR validation): The reported distance errors are presented per bin and semantic class, but the manuscript does not quantify how these translate to geocoordinate error in meters or whether they remain below the tolerance needed for the traffic-sign and pavement-damage use cases asserted in §5.1.
Authors: We agree that connecting distance errors to geocoordinate accuracy and use-case tolerances strengthens the validation. In the revised manuscript, we have added explicit propagation calculations in §4.2 that convert binned distance errors to approximate geocoordinate errors (lateral and depth components) using the camera intrinsics and typical viewing angles. For the §5.1 use cases, we now include a direct comparison: our median errors remain below 1.5 m for distances under 15 m, which aligns with common tolerances for traffic-sign inventory (0.5–2 m) and pavement-damage mapping; we note larger errors beyond 25 m and suggest multi-frame fusion as mitigation. revision: yes
-
Referee: [§3.2] §3.2 (depth-to-geocoordinate pipeline): The fusion of monocular metric depth with ray geometry and known intrinsics/extrinsics is described at a high level; no explicit error-propagation analysis or sensitivity study is provided for urban factors (occlusion, specular surfaces) that the abstract identifies as the target domain.
Authors: The pipeline is formalized via the ray-casting equations in §3.2 that combine metric depth with known intrinsics and extrinsics. We acknowledge the absence of a dedicated error-propagation or sensitivity analysis. The revised version adds a dedicated paragraph in §3.2 that qualitatively discusses error contributions from occlusion and specular surfaces, drawing on failure cases observed in our urban LiDAR validation set. A full quantitative sensitivity study would require new controlled experiments beyond the current scope; we have therefore marked this as future work while providing the initial analysis requested. revision: partial
Circularity Check
No circularity: evaluation rests on external LiDAR ground truth
full rationale
The paper introduces MapAnything as a framework that applies off-the-shelf metric depth estimation models, combines their camera-to-object distance outputs with known camera intrinsics/extrinsics and ray geometry, and validates the resulting geocoordinates against independent high-precision LiDAR point clouds. The abstract and methodology description contain no fitted parameters that are later renamed as predictions, no self-definitional equations, and no load-bearing uniqueness theorems imported from prior self-citations. Granular performance analysis across distance bins and semantic classes is presented as direct empirical comparison rather than any reduction to the input assumptions by construction. The derivation chain therefore remains self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Metric depth estimation models yield sufficiently accurate camera-to-object distances for geocoordinate computation in complex urban environments
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We set up the following equations following the pinhole camera model... dhorizontal = d cos(θpitch,eff) ... latobject = latcamera + (180/π)·Δlat
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Zoedepth: Zero-shot transfer by com- bining relative and metric depth.arXiv preprint, 2023
Shariq Farooq Bhat, Reiner Birkl, Diana Wofk, Peter Wonka, and Matthias M ¨uller. Zoedepth: Zero-shot transfer by com- bining relative and metric depth.arXiv preprint, 2023. 3
work page 2023
-
[2]
Depth pro: Sharp monocular metric depth in less than a second.arXiv preprint, 2024
Aleksei Bochkovskii, Ama ¨el Delaunoy, Hugo Germain, Marcel Santos, Yichao Zhou, Stephan R Richter, and Vladlen Koltun. Depth pro: Sharp monocular metric depth in less than a second.arXiv preprint, 2024. 1, 3, 4
work page 2024
-
[3]
Dominik Boller, Matthew Moy De Vitry, Jan D. Wegner, and Jo˜ao P. Leit ˜ao. Automated localization of urban drainage infrastructure from public-access street-level images.Urban Water Journal, 16, 2019. 1, 2
work page 2019
-
[4]
Steve Branson, Jan Dirk Wegner, David Hall, Nico Lang, Konrad Schindler, and Pietro Perona. From Google Maps to a fine-grained catalog of street trees.ISPRS Journal of Photogrammetry and Remote Sensing, 135, 2018. 1, 2
work page 2018
-
[5]
Yohann Cabon, Naila Murray, and M. Humenberger. Virtual kitti 2.arXiv preprint, 2020. 4
work page 2020
-
[6]
Andrew Campbell, Alan Both, and Qian (Chayn) Sun. De- tecting and mapping traffic signs from google street view im- ages using deep learning and gis.Computers, Environment and Urban Systems, 77, 2019. 1, 2
work page 2019
-
[7]
Liang Cheng, Yi Yuan, Nan Xia, Song Chen, Yanming Chen, Kang Yang, Lei Ma, and Manchun Li. Crowd-sourced pic- tures geo-localization method based on street view images and 3D reconstruction.ISPRS Journal of Photogrammetry and Remote Sensing, 141, 2018. 1, 2
work page 2018
-
[8]
Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Scharw¨achter, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes dataset. InCVPR Workshop on the Future of Datasets in Vision, volume 2, 2015. 3
work page 2015
-
[9]
Depth map prediction from a single image using a multi-scale deep net- work
David Eigen, Christian Puhrsch, and Rob Fergus. Depth map prediction from a single image using a multi-scale deep net- work. InNeural Information Processing Systems, 2014. 3
work page 2014
-
[10]
Ravi Garg, B. V . Kumar, G. Carneiro, and Ian D. Reid. Un- supervised cnn for single view depth estimation: Geometry to the rescue. InECCV, 2016. 3
work page 2016
-
[11]
A2d2: Audi autonomous driving dataset
Jakob Geyer, Yohannes Kassahun, Mentar Mahmudi, Xavier Ricou, Rupesh Durgesh, Andrew S Chung, Lorenz Hauswald, Viet Hoang Pham, Maximilian M ¨uhlegg, Sebas- tian Dorn, et al. A2d2: Audi autonomous driving dataset. arXiv preprint, 2020. 3
work page 2020
- [12]
-
[13]
Telecom inventory man- agement via object recognition and localisation on google street view images
Ramya Hebbalaguppe, Gaurav Garg, Ehtesham Hassan, Hi- ranmay Ghosh, and Ankit Verma. Telecom inventory man- agement via object recognition and localisation on google street view images. InWACV, 2017. 1, 2
work page 2017
-
[14]
Mu Hu, Wei Yin, China. Xiaoyan Zhang, Zhipeng Cai, Xi- aoxiao Long, Hao Chen, Kaixuan Wang, Gang Yu, Chunhua Shen, and Shaojie Shen. Metric3d v2: A versatile monocular geometric foundation model for zero-shot metric depth and surface normal estimation.TPAMI, 46, 2024. 1, 3
work page 2024
-
[15]
Vladimir A. Krylov and Rozenn Dahyot. Object Ge- olocation from Crowdsourced Street Level Imagery. In Carlos Alzate, Anna Monreale, Haytham Assem, Albert Bifet, Teodora Sandra Buda, Bora Caglayan, Brett Drury, Eva Garc´ıa-Mart´ın, Ricard Gavald`a, Irena Koprinska, Ste- fan Kramer, Niklas Lavesson, Michael Madden, Ian Mol- 7 loy, Maria-Irina Nicolae, and M...
work page 2019
-
[16]
Krylov, Eamonn Kenny, and Rozenn Dahyot
Vladimir A. Krylov, Eamonn Kenny, and Rozenn Dahyot. Automatic discovery and geotagging of objects from street view imagery.Remote Sensing, 10, 2018. 1, 2
work page 2018
-
[17]
Bevdepth: Acquisition of reliable depth for multi-view 3d object detec- tion
Yinhao Li, Zheng Ge, Guanyi Yu, Jinrong Yang, Zengran Wang, Yukang Shi, Jianjian Sun, and Zeming Li. Bevdepth: Acquisition of reliable depth for multi-view 3d object detec- tion. InAAAI, 2023. 2
work page 2023
-
[18]
Zhen Li, Yuliang Gao, Qingqing Hong, Yuren Du, Seiichi Serikawa, and Lifeng Zhang. Keypoint3d: Keypoint-based and anchor-free 3d object detection for autonomous driving with monocular vision.Remote Sensing, 2023. 2
work page 2023
-
[19]
Rcbevdet: Radar-camera fusion in bird’s eye view for 3d object detection
Zhiwei Lin, Zhe Liu, Zhongyu Xia, Xinhao Wang, Yongtao Wang, Shengxiang Qi, Yang Dong, Nan Dong, Le Zhang, and Ce Zhu. Rcbevdet: Radar-camera fusion in bird’s eye view for 3d object detection. InCVPR, 2024. 2
work page 2024
-
[20]
Rashwan, Juli ´an Cristiano, M
Armin Masoumian, Hatem A. Rashwan, Juli ´an Cristiano, M. Salman Asif, and Domenec Puig. Monocular depth es- timation using deep learning: A review.Sensors, 22, 2022. 3
work page 2022
-
[21]
The mapillary vistas dataset for semantic understanding of street scenes
Gerhard Neuhold, Tobias Ollmann, Samuel Rota Bulo, and Peter Kontschieder. The mapillary vistas dataset for semantic understanding of street scenes. InICCV, 2017. 3
work page 2017
-
[22]
Tessio Novack, Leonard V orbeck, Heinrich Lorei, and Alexander Zipf. Towards detecting building facades with graffiti artwork based on street view images.ISPRS Inter- national Journal of Geo-Information, 9, 2020. 1, 2
work page 2020
-
[23]
Unidepth: Universal monocular metric depth estimation.CVPR, 2024
Luigi Piccinelli, Yung-Hsu Yang, Christos Sakaridis, Mattia Segu, Siyuan Li, Luc van Gool, and Fisher Yu. Unidepth: Universal monocular metric depth estimation.CVPR, 2024. 1, 3, 4
work page 2024
-
[24]
Qi, Wei Liu, Chenxia Wu, Hao Su, and Leonidas J
Charles R. Qi, Wei Liu, Chenxia Wu, Hao Su, and Leonidas J. Guibas. Frustum pointnets for 3d object detec- tion from rgb-d data.CVPR, 2018. 2
work page 2018
-
[25]
Monogrnet: A geo- metric reasoning network for monocular 3d object localiza- tion
Zengyi Qin, Jinglu Wang, and Yan Lu. Monogrnet: A geo- metric reasoning network for monocular 3d object localiza- tion. InAAAI, 2018. 2
work page 2018
-
[26]
Ren ´e Ranftl, Katrin Lasinger, David Hafner, Konrad Schindler, and Vladlen Koltun. Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer.IEEE Transactions on Pattern Analysis and Ma- chine Intelligence, 44, 2020. 3
work page 2020
-
[27]
Manhole detection using image processing on google street view imagery
Vinay Vishnani, Anikait Adhya, Chinmay Bajpai, Priya Chimurkar, and Kumar Khandagle. Manhole detection using image processing on google street view imagery. InThird International Conference on Smart Systems and Inventive Technology (ICSSIT), 2020. 2
work page 2020
-
[28]
Wegner, Steve Branson, David Hall, Konrad Schindler, and Pietro Perona
Jan D. Wegner, Steve Branson, David Hall, Konrad Schindler, and Pietro Perona. Cataloging public objects us- ing aerial and street-level images — urban trees. InCVPR, Las Vegas, NV , USA, 2016. 1, 2
work page 2016
-
[29]
Depth anything: Unleashing the power of large-scale unlabeled data
Lihe Yang, Bingyi Kang, Zilong Huang, Xiaogang Xu, Jiashi Feng, and Hengshuang Zhao. Depth anything: Unleashing the power of large-scale unlabeled data. InCVPR, 2024. 1, 3, 4
work page 2024
-
[30]
Metric3d: Towards zero-shot metric 3d prediction from a single image
Wei Yin, Chi Zhang, Hao Chen, Zhipeng Cai, Gang Yu, Kaixuan Wang, Xiaozhi Chen, and Chunhua Shen. Metric3d: Towards zero-shot metric 3d prediction from a single image. ICCV, 2023. 1, 3, 4
work page 2023
-
[31]
Safdnet: A simple and effective network for fully sparse 3d object detection
Gang Zhang, Junnan Chen, Guohuan Gao, Jianmin Li, Si Liu, and Xiaolin Hu. Safdnet: A simple and effective network for fully sparse 3d object detection. InCVPR, 2024. 2
work page 2024
-
[32]
Ruijie Zhu, Chuxin Wang, Ziyang Song, Li Liu, Tianzhu Zhang, and Yongdong Zhang. Scaledepth: Decomposing metric depth estimation into scale prediction and relative depth estimation.arXiv preprint, 2024. 3 8 Supplementary Material
work page 2024
-
[33]
Semantic Groups for twelve of the sample images
Semantic Groups for Test Images Figure 5. Semantic Groups for twelve of the sample images. Original images from Cyclomedia. 9
-
[34]
Absolute Relative Error (ARE) for different depth estimation models on the same example image
All ARE plots for one Example Image 0.0 0.2 0.4 0.6 0.8 1.0 Error Magnitude (a) DepthAnything-S 0.0 0.2 0.4 0.6 0.8 1.0 Error Magnitude (b) DepthAnything-B 0.0 0.2 0.4 0.6 0.8 1.0 Error Magnitude (c) DepthAnything-L 0.0 0.2 0.4 0.6 0.8 1.0 Error Magnitude (d) UniDepth-S 0.0 0.2 0.4 0.6 0.8 1.0 Error Magnitude (e) UniDepth-B 0.0 0.2 0.4 0.6 0.8 1.0 Error M...
-
[35]
Mean Absolute Error (MAE) for different depth estimation models on the same example image
All MAE plots for one Example Image 0 5 10 15 20 25 30 Error Magnitude (a) DepthAnything-S 0 5 10 15 20 25 30 Error Magnitude (b) DepthAnything-B 0 5 10 15 20 25 30 Error Magnitude (c) DepthAnything-L 0 5 10 15 20 25 30 Error Magnitude (d) UniDepth-S 0 5 10 15 20 25 30 Error Magnitude (e) UniDepth-B 0 5 10 15 20 25 30 Error Magnitude (f) UniDepth-L 0 5 10...
-
[36]
Traffic Signs: Image Coverage in Annotated Area (a) Road Network Image Coverage Area Covered Road Segments Camera Points (b) Road covered by Cyclomedia Images Image Coverage Area Covered Road Segments Camera Points (c) Road covered by Mapillary Images Figure 8. The road network according to OpenStreetMap in the area where signs were annotated (a) and the ...
-
[37]
Traffic Signs: Segmentation We chose three models to segment the traffic signs from the images for comparison:
-
[38]
a U-Net model trained on the A2D2 dataset,
-
[39]
a SegFormer model fine-tuned on the A2D2 dataset,
-
[40]
a Mask2Former model trained on the Mapillary Vistas dataset. We annotated 100 images of our Cyclomedia dataset and calculated the Intersection over Union (IoU) between the annotated and the predicted masks for each image. The av- erage over all annotated images is shown in Table 5. The models trained on the A2D2 dataset perform significantly better. We as...
-
[41]
Traffic Signs: Deviation Box Plots 5-10 10-15 15-20 20-25 25-30 Distance to camera (m) 0 2 4 6 8 10 Error in predicted position (m) (a) DepthAnything-B 5-10 10-15 15-20 20-25 25-30 Distance to camera (m) 0 2 4 6 8 10 Error in predicted position (m) (b) DepthPro (with cam) 5-10 10-15 15-20 20-25 25-30 Distance to camera (m) 0 2 4 6 8 10 Error in predicted ...
-
[42]
Road Damages: Deviation Box Plots 2-4 4-6 6-8 8-10 Distance to camera (m) 0 2 4 6 8 10 Error in predicted position (m) (a) DepthAnything-B 2-4 4-6 6-8 8-10 Distance to camera (m) 0 2 4 6 8 10 Error in predicted position (m) (b) DepthPro (with cam) 2-4 4-6 6-8 8-10 Distance to camera (m) 0 2 4 6 8 10 Error in predicted position (m) (c) Metric3D-ViT 2-4 4-6...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.