MapAnything: Evaluating Monocular Metric Depth Models for 3D Urban Asset Localization

Andr\'e Ludwig; Bogdan Franczyk; Eric Peukert; Erik Quinten Fastermann; Jonas Kunze; Miriam Louise Carnot

arxiv: 2509.14839 · v2 · pith:XGCUB25Jnew · submitted 2025-09-18 · 💻 cs.CV

MapAnything: Evaluating Monocular Metric Depth Models for 3D Urban Asset Localization

Miriam Louise Carnot , Jonas Kunze , Erik Quinten Fastermann , Eric Peukert , Andr\'e Ludwig , Bogdan Franczyk This is my paper

Pith reviewed 2026-05-21 21:51 UTC · model grok-4.3

classification 💻 cs.CV

keywords monocular metric depthurban asset localization3D geocoordinatesdigital twinsLiDAR comparisontraffic signscomputer visiondepth estimation

0 comments

The pith

MapAnything derives accurate geocoordinates for city objects like signs and damage from one ordinary photo.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows how to turn a single camera image of an urban object into its real-world position on a map. It does this by feeding the image to a metric depth model that guesses the distance to the object, then using the camera's known angle and focal length to figure out the latitude, longitude, and height. Validation comes from running the system on city streets and checking the results against expensive laser scans. If this works, cities could update their asset databases much faster and cheaper by having inspectors or vehicles just take pictures.

Core claim

By leveraging advanced Metric Depth Estimation models, MapAnything accurately calculates object geocoordinates, converting 2D image data into valuable 3D spatial information through the integration of estimated camera-to-object distance with geometric principles and known camera specifications.

What carries the argument

Metric depth estimation that predicts distances, combined with geometric principles and camera specifications for geocoordinate calculation.

If this is right

Automated mapping of traffic signs and road pavement damage from monocular images.
Granular analysis of accuracy across distance intervals and semantic areas such as roads and vegetation.
Practical demonstration for integration into automated urban inventory systems.
Reduced manual effort in maintaining urban digital twins and spatial datasets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The framework could be deployed on smartphones for crowdsourced urban mapping.
Accuracy might improve further by fusing multiple images from different viewpoints.
This could complement or reduce the need for dedicated mapping vehicles equipped with LiDAR.

Load-bearing premise

The distances estimated by metric depth models are accurate enough in complex urban environments to allow reliable geocoordinate calculation using geometry and camera specs.

What would settle it

Finding that the geocoordinate errors compared to LiDAR ground truth are consistently larger than acceptable for inventory purposes, for example more than one meter at close range.

Figures

Figures reproduced from arXiv: 2509.14839 by Andr\'e Ludwig, Bogdan Franczyk, Eric Peukert, Erik Quinten Fastermann, Jonas Kunze, Miriam Louise Carnot.

**Figure 3.** Figure 3: Distance ranges and semantic groups from Segformer [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Relationship between true distance (camera-sign) and [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Semantic Groups for twelve of the sample images. Original images from Cyclomedia. [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Absolute Relative Error (ARE) for different depth estimation models on the same example image. [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗

**Figure 7.** Figure 7: Mean Absolute Error (MAE) for different depth estimation models on the same example image. [PITH_FULL_IMAGE:figures/full_fig_p011_7.png] view at source ↗

**Figure 8.** Figure 8: The road network according to OpenStreetMap in the area where signs were annotated (a) and the parts of it that are covered [PITH_FULL_IMAGE:figures/full_fig_p012_8.png] view at source ↗

**Figure 9.** Figure 9: Positions and viewing directions of all recordings from both image sources (a, b) and all signs that should be visible within this [PITH_FULL_IMAGE:figures/full_fig_p013_9.png] view at source ↗

**Figure 10.** Figure 10: The box plots show the relationship between the error of the position (y-axis) and the true distance between sign and camera [PITH_FULL_IMAGE:figures/full_fig_p015_10.png] view at source ↗

**Figure 11.** Figure 11: The box plots show the relationship between the error of the position (y-axis) and the true distance between road damage and [PITH_FULL_IMAGE:figures/full_fig_p016_11.png] view at source ↗

read the original abstract

City administrations increasingly rely on comprehensive databases and urban digital twins of city assets, such as traffic signs and trees, as well as incidents like graffiti or road damage, to maintain an effective overview of urban conditions. Digitization has increased the demand for continuously updated spatial datasets, yet current data acquisition and maintenance processes still involve considerable manual effort, posing significant scalability challenges. This paper introduces MapAnything, a novel geo-localization framework that automates the spatial mapping of urban objects and incidents from a single monocular image. By leveraging advanced Metric Depth Estimation models, MapAnything accurately calculates object geocoordinates, converting 2D image data into valuable 3D spatial information. The methodology integrates the estimated camera-to-object distance with geometric principles and known camera specifications. We present a detailed validation of the framework, comparing its distance-estimation accuracy against high-precision LiDAR point clouds in complex urban environments. Our evaluation provides a granular analysis of spatial performance across various distance intervals and semantic areas, such as roads and vegetation. Finally, we demonstrate the framework's practical efficacy through specific use cases, including mapping traffic signs and road pavement damage, and provide recommendations for its integration into automated urban inventory systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MapAnything applies existing monocular metric depth models to urban asset geo-localization and validates distances against LiDAR with distance-bin breakdowns, but the practical accuracy in messy city scenes is still the load-bearing question.

read the letter

MapAnything takes monocular metric depth estimators, combines the output distances with camera intrinsics and geometry, and turns single images into geocoordinates for things like traffic signs or pavement damage. It then checks those coordinates against LiDAR point clouds in real urban settings and breaks the errors down by distance interval and semantic class such as roads or vegetation. That granular reporting is the part that actually adds value; most papers stop at overall averages, so seeing performance decay with range or across object types helps judge where the pipeline might be usable. The use-case examples for inventory mapping also show they kept deployment in mind rather than just running benchmarks. The central weakness is still the depth estimates themselves. Urban scenes bring occlusions, specular surfaces, and lighting shifts that monocular models handle unevenly, and even small relative depth errors grow with distance. The abstract states the framework “accurately calculates” coordinates and presents LiDAR validation, yet without the concrete error distributions, outlier handling, or exclusion rules it is hard to tell whether the measured errors fall inside the tolerances needed for city asset work. If the full results show consistent meter-level accuracy at typical ranges, the claims hold; if not, the practical payoff shrinks. This is for computer-vision researchers who want an applied example of depth-based localization and for practitioners building digital-twin pipelines who need a concrete validation protocol. A reader looking for a ready framework description plus real-data error analysis would get something out of it. The work deserves a serious referee because it supplies a focused, reproducible evaluation on city data even though the underlying depth models are not new. I would send it out for review; the validation section could use tightening but the overall framing is worth checking.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces MapAnything, a framework for 3D urban asset localization from monocular images using metric depth estimation models. It claims to accurately calculate object geocoordinates by combining estimated depths with known camera intrinsics and geometric principles. The paper provides a validation study comparing distance estimates to LiDAR point clouds in complex urban environments, with granular analysis across distance intervals and semantic classes, and demonstrates use cases for mapping traffic signs and road damage.

Significance. If the accuracy claims hold under the reported conditions, this work could enable scalable, low-cost 3D mapping of urban assets using standard cameras, addressing scalability challenges in maintaining urban digital twins. The evaluation against external LiDAR data and focus on practical use cases strengthen its potential impact in computer vision applications for smart cities.

major comments (2)

[§4.2] §4.2 (LiDAR validation): The reported distance errors are presented per bin and semantic class, but the manuscript does not quantify how these translate to geocoordinate error in meters or whether they remain below the tolerance needed for the traffic-sign and pavement-damage use cases asserted in §5.1.
[§3.2] §3.2 (depth-to-geocoordinate pipeline): The fusion of monocular metric depth with ray geometry and known intrinsics/extrinsics is described at a high level; no explicit error-propagation analysis or sensitivity study is provided for urban factors (occlusion, specular surfaces) that the abstract identifies as the target domain.

minor comments (2)

[Abstract] Abstract: The phrase 'advanced Metric Depth Estimation models' is used without naming the specific models or indicating whether they are used off-the-shelf or fine-tuned on urban data.
[Table 2] Table 2: Column headers for semantic classes should explicitly state the number of samples per bin to allow readers to assess statistical reliability of the per-class results.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight important aspects of practical applicability and methodological rigor that we address below. We have revised the manuscript accordingly where feasible.

read point-by-point responses

Referee: [§4.2] §4.2 (LiDAR validation): The reported distance errors are presented per bin and semantic class, but the manuscript does not quantify how these translate to geocoordinate error in meters or whether they remain below the tolerance needed for the traffic-sign and pavement-damage use cases asserted in §5.1.

Authors: We agree that connecting distance errors to geocoordinate accuracy and use-case tolerances strengthens the validation. In the revised manuscript, we have added explicit propagation calculations in §4.2 that convert binned distance errors to approximate geocoordinate errors (lateral and depth components) using the camera intrinsics and typical viewing angles. For the §5.1 use cases, we now include a direct comparison: our median errors remain below 1.5 m for distances under 15 m, which aligns with common tolerances for traffic-sign inventory (0.5–2 m) and pavement-damage mapping; we note larger errors beyond 25 m and suggest multi-frame fusion as mitigation. revision: yes
Referee: [§3.2] §3.2 (depth-to-geocoordinate pipeline): The fusion of monocular metric depth with ray geometry and known intrinsics/extrinsics is described at a high level; no explicit error-propagation analysis or sensitivity study is provided for urban factors (occlusion, specular surfaces) that the abstract identifies as the target domain.

Authors: The pipeline is formalized via the ray-casting equations in §3.2 that combine metric depth with known intrinsics and extrinsics. We acknowledge the absence of a dedicated error-propagation or sensitivity analysis. The revised version adds a dedicated paragraph in §3.2 that qualitatively discusses error contributions from occlusion and specular surfaces, drawing on failure cases observed in our urban LiDAR validation set. A full quantitative sensitivity study would require new controlled experiments beyond the current scope; we have therefore marked this as future work while providing the initial analysis requested. revision: partial

Circularity Check

0 steps flagged

No circularity: evaluation rests on external LiDAR ground truth

full rationale

The paper introduces MapAnything as a framework that applies off-the-shelf metric depth estimation models, combines their camera-to-object distance outputs with known camera intrinsics/extrinsics and ray geometry, and validates the resulting geocoordinates against independent high-precision LiDAR point clouds. The abstract and methodology description contain no fitted parameters that are later renamed as predictions, no self-definitional equations, and no load-bearing uniqueness theorems imported from prior self-citations. Granular performance analysis across distance bins and semantic classes is presented as direct empirical comparison rather than any reduction to the input assumptions by construction. The derivation chain therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the reliability of pre-trained metric depth models in urban scenes and the accuracy of camera intrinsic parameters; no new free parameters or invented entities are introduced in the abstract.

axioms (1)

domain assumption Metric depth estimation models yield sufficiently accurate camera-to-object distances for geocoordinate computation in complex urban environments
Invoked when the framework integrates estimated distances with geometric principles and camera specifications.

pith-pipeline@v0.9.0 · 5758 in / 1182 out tokens · 36277 ms · 2026-05-21T21:51:58.798404+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We set up the following equations following the pinhole camera model... dhorizontal = d cos(θpitch,eff) ... latobject = latcamera + (180/π)·Δlat

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

42 extracted references · 42 canonical work pages

[1]

Zoedepth: Zero-shot transfer by com- bining relative and metric depth.arXiv preprint, 2023

Shariq Farooq Bhat, Reiner Birkl, Diana Wofk, Peter Wonka, and Matthias M ¨uller. Zoedepth: Zero-shot transfer by com- bining relative and metric depth.arXiv preprint, 2023. 3

work page 2023
[2]

Depth pro: Sharp monocular metric depth in less than a second.arXiv preprint, 2024

Aleksei Bochkovskii, Ama ¨el Delaunoy, Hugo Germain, Marcel Santos, Yichao Zhou, Stephan R Richter, and Vladlen Koltun. Depth pro: Sharp monocular metric depth in less than a second.arXiv preprint, 2024. 1, 3, 4

work page 2024
[3]

Wegner, and Jo˜ao P

Dominik Boller, Matthew Moy De Vitry, Jan D. Wegner, and Jo˜ao P. Leit ˜ao. Automated localization of urban drainage infrastructure from public-access street-level images.Urban Water Journal, 16, 2019. 1, 2

work page 2019
[4]

From Google Maps to a fine-grained catalog of street trees.ISPRS Journal of Photogrammetry and Remote Sensing, 135, 2018

Steve Branson, Jan Dirk Wegner, David Hall, Nico Lang, Konrad Schindler, and Pietro Perona. From Google Maps to a fine-grained catalog of street trees.ISPRS Journal of Photogrammetry and Remote Sensing, 135, 2018. 1, 2

work page 2018
[5]

Humenberger

Yohann Cabon, Naila Murray, and M. Humenberger. Virtual kitti 2.arXiv preprint, 2020. 4

work page 2020
[6]

De- tecting and mapping traffic signs from google street view im- ages using deep learning and gis.Computers, Environment and Urban Systems, 77, 2019

Andrew Campbell, Alan Both, and Qian (Chayn) Sun. De- tecting and mapping traffic signs from google street view im- ages using deep learning and gis.Computers, Environment and Urban Systems, 77, 2019. 1, 2

work page 2019
[7]

Crowd-sourced pic- tures geo-localization method based on street view images and 3D reconstruction.ISPRS Journal of Photogrammetry and Remote Sensing, 141, 2018

Liang Cheng, Yi Yuan, Nan Xia, Song Chen, Yanming Chen, Kang Yang, Lei Ma, and Manchun Li. Crowd-sourced pic- tures geo-localization method based on street view images and 3D reconstruction.ISPRS Journal of Photogrammetry and Remote Sensing, 141, 2018. 1, 2

work page 2018
[8]

The cityscapes dataset

Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Scharw¨achter, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes dataset. InCVPR Workshop on the Future of Datasets in Vision, volume 2, 2015. 3

work page 2015
[9]

Depth map prediction from a single image using a multi-scale deep net- work

David Eigen, Christian Puhrsch, and Rob Fergus. Depth map prediction from a single image using a multi-scale deep net- work. InNeural Information Processing Systems, 2014. 3

work page 2014
[10]

Ravi Garg, B. V . Kumar, G. Carneiro, and Ian D. Reid. Un- supervised cnn for single view depth estimation: Geometry to the rescue. InECCV, 2016. 3

work page 2016
[11]

A2d2: Audi autonomous driving dataset

Jakob Geyer, Yohannes Kassahun, Mentar Mahmudi, Xavier Ricou, Rupesh Durgesh, Andrew S Chung, Lorenz Hauswald, Viet Hoang Pham, Maximilian M ¨uhlegg, Sebas- tian Dorn, et al. A2d2: Audi autonomous driving dataset. arXiv preprint, 2020. 3

work page 2020
[12]

Bros- tow

Cl ´ement Godard, Oisin Mac Aodha, and Gabriel J. Bros- tow. Unsupervised monocular depth estimation with left- right consistency.CVPR, 2016. 3

work page 2016
[13]

Telecom inventory man- agement via object recognition and localisation on google street view images

Ramya Hebbalaguppe, Gaurav Garg, Ehtesham Hassan, Hi- ranmay Ghosh, and Ankit Verma. Telecom inventory man- agement via object recognition and localisation on google street view images. InWACV, 2017. 1, 2

work page 2017
[14]

Xiaoyan Zhang, Zhipeng Cai, Xi- aoxiao Long, Hao Chen, Kaixuan Wang, Gang Yu, Chunhua Shen, and Shaojie Shen

Mu Hu, Wei Yin, China. Xiaoyan Zhang, Zhipeng Cai, Xi- aoxiao Long, Hao Chen, Kaixuan Wang, Gang Yu, Chunhua Shen, and Shaojie Shen. Metric3d v2: A versatile monocular geometric foundation model for zero-shot metric depth and surface normal estimation.TPAMI, 46, 2024. 1, 3

work page 2024
[15]

Krylov and Rozenn Dahyot

Vladimir A. Krylov and Rozenn Dahyot. Object Ge- olocation from Crowdsourced Street Level Imagery. In Carlos Alzate, Anna Monreale, Haytham Assem, Albert Bifet, Teodora Sandra Buda, Bora Caglayan, Brett Drury, Eva Garc´ıa-Mart´ın, Ricard Gavald`a, Irena Koprinska, Ste- fan Kramer, Niklas Lavesson, Michael Madden, Ian Mol- 7 loy, Maria-Irina Nicolae, and M...

work page 2019
[16]

Krylov, Eamonn Kenny, and Rozenn Dahyot

Vladimir A. Krylov, Eamonn Kenny, and Rozenn Dahyot. Automatic discovery and geotagging of objects from street view imagery.Remote Sensing, 10, 2018. 1, 2

work page 2018
[17]

Bevdepth: Acquisition of reliable depth for multi-view 3d object detec- tion

Yinhao Li, Zheng Ge, Guanyi Yu, Jinrong Yang, Zengran Wang, Yukang Shi, Jianjian Sun, and Zeming Li. Bevdepth: Acquisition of reliable depth for multi-view 3d object detec- tion. InAAAI, 2023. 2

work page 2023
[18]

Keypoint3d: Keypoint-based and anchor-free 3d object detection for autonomous driving with monocular vision.Remote Sensing, 2023

Zhen Li, Yuliang Gao, Qingqing Hong, Yuren Du, Seiichi Serikawa, and Lifeng Zhang. Keypoint3d: Keypoint-based and anchor-free 3d object detection for autonomous driving with monocular vision.Remote Sensing, 2023. 2

work page 2023
[19]

Rcbevdet: Radar-camera fusion in bird’s eye view for 3d object detection

Zhiwei Lin, Zhe Liu, Zhongyu Xia, Xinhao Wang, Yongtao Wang, Shengxiang Qi, Yang Dong, Nan Dong, Le Zhang, and Ce Zhu. Rcbevdet: Radar-camera fusion in bird’s eye view for 3d object detection. InCVPR, 2024. 2

work page 2024
[20]

Rashwan, Juli ´an Cristiano, M

Armin Masoumian, Hatem A. Rashwan, Juli ´an Cristiano, M. Salman Asif, and Domenec Puig. Monocular depth es- timation using deep learning: A review.Sensors, 22, 2022. 3

work page 2022
[21]

The mapillary vistas dataset for semantic understanding of street scenes

Gerhard Neuhold, Tobias Ollmann, Samuel Rota Bulo, and Peter Kontschieder. The mapillary vistas dataset for semantic understanding of street scenes. InICCV, 2017. 3

work page 2017
[22]

Towards detecting building facades with graffiti artwork based on street view images.ISPRS Inter- national Journal of Geo-Information, 9, 2020

Tessio Novack, Leonard V orbeck, Heinrich Lorei, and Alexander Zipf. Towards detecting building facades with graffiti artwork based on street view images.ISPRS Inter- national Journal of Geo-Information, 9, 2020. 1, 2

work page 2020
[23]

Unidepth: Universal monocular metric depth estimation.CVPR, 2024

Luigi Piccinelli, Yung-Hsu Yang, Christos Sakaridis, Mattia Segu, Siyuan Li, Luc van Gool, and Fisher Yu. Unidepth: Universal monocular metric depth estimation.CVPR, 2024. 1, 3, 4

work page 2024
[24]

Qi, Wei Liu, Chenxia Wu, Hao Su, and Leonidas J

Charles R. Qi, Wei Liu, Chenxia Wu, Hao Su, and Leonidas J. Guibas. Frustum pointnets for 3d object detec- tion from rgb-d data.CVPR, 2018. 2

work page 2018
[25]

Monogrnet: A geo- metric reasoning network for monocular 3d object localiza- tion

Zengyi Qin, Jinglu Wang, and Yan Lu. Monogrnet: A geo- metric reasoning network for monocular 3d object localiza- tion. InAAAI, 2018. 2

work page 2018
[26]

Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer.IEEE Transactions on Pattern Analysis and Ma- chine Intelligence, 44, 2020

Ren ´e Ranftl, Katrin Lasinger, David Hafner, Konrad Schindler, and Vladlen Koltun. Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer.IEEE Transactions on Pattern Analysis and Ma- chine Intelligence, 44, 2020. 3

work page 2020
[27]

Manhole detection using image processing on google street view imagery

Vinay Vishnani, Anikait Adhya, Chinmay Bajpai, Priya Chimurkar, and Kumar Khandagle. Manhole detection using image processing on google street view imagery. InThird International Conference on Smart Systems and Inventive Technology (ICSSIT), 2020. 2

work page 2020
[28]

Wegner, Steve Branson, David Hall, Konrad Schindler, and Pietro Perona

Jan D. Wegner, Steve Branson, David Hall, Konrad Schindler, and Pietro Perona. Cataloging public objects us- ing aerial and street-level images — urban trees. InCVPR, Las Vegas, NV , USA, 2016. 1, 2

work page 2016
[29]

Depth anything: Unleashing the power of large-scale unlabeled data

Lihe Yang, Bingyi Kang, Zilong Huang, Xiaogang Xu, Jiashi Feng, and Hengshuang Zhao. Depth anything: Unleashing the power of large-scale unlabeled data. InCVPR, 2024. 1, 3, 4

work page 2024
[30]

Metric3d: Towards zero-shot metric 3d prediction from a single image

Wei Yin, Chi Zhang, Hao Chen, Zhipeng Cai, Gang Yu, Kaixuan Wang, Xiaozhi Chen, and Chunhua Shen. Metric3d: Towards zero-shot metric 3d prediction from a single image. ICCV, 2023. 1, 3, 4

work page 2023
[31]

Safdnet: A simple and effective network for fully sparse 3d object detection

Gang Zhang, Junnan Chen, Guohuan Gao, Jianmin Li, Si Liu, and Xiaolin Hu. Safdnet: A simple and effective network for fully sparse 3d object detection. InCVPR, 2024. 2

work page 2024
[32]

Scaledepth: Decomposing metric depth estimation into scale prediction and relative depth estimation.arXiv preprint, 2024

Ruijie Zhu, Chuxin Wang, Ziyang Song, Li Liu, Tianzhu Zhang, and Yongdong Zhang. Scaledepth: Decomposing metric depth estimation into scale prediction and relative depth estimation.arXiv preprint, 2024. 3 8 Supplementary Material

work page 2024
[33]

Semantic Groups for twelve of the sample images

Semantic Groups for Test Images Figure 5. Semantic Groups for twelve of the sample images. Original images from Cyclomedia. 9

work page
[34]

Absolute Relative Error (ARE) for different depth estimation models on the same example image

All ARE plots for one Example Image 0.0 0.2 0.4 0.6 0.8 1.0 Error Magnitude (a) DepthAnything-S 0.0 0.2 0.4 0.6 0.8 1.0 Error Magnitude (b) DepthAnything-B 0.0 0.2 0.4 0.6 0.8 1.0 Error Magnitude (c) DepthAnything-L 0.0 0.2 0.4 0.6 0.8 1.0 Error Magnitude (d) UniDepth-S 0.0 0.2 0.4 0.6 0.8 1.0 Error Magnitude (e) UniDepth-B 0.0 0.2 0.4 0.6 0.8 1.0 Error M...

work page
[35]

Mean Absolute Error (MAE) for different depth estimation models on the same example image

All MAE plots for one Example Image 0 5 10 15 20 25 30 Error Magnitude (a) DepthAnything-S 0 5 10 15 20 25 30 Error Magnitude (b) DepthAnything-B 0 5 10 15 20 25 30 Error Magnitude (c) DepthAnything-L 0 5 10 15 20 25 30 Error Magnitude (d) UniDepth-S 0 5 10 15 20 25 30 Error Magnitude (e) UniDepth-B 0 5 10 15 20 25 30 Error Magnitude (f) UniDepth-L 0 5 10...

work page
[36]

Traffic Signs: Image Coverage in Annotated Area (a) Road Network Image Coverage Area Covered Road Segments Camera Points (b) Road covered by Cyclomedia Images Image Coverage Area Covered Road Segments Camera Points (c) Road covered by Mapillary Images Figure 8. The road network according to OpenStreetMap in the area where signs were annotated (a) and the ...

work page
[37]

Traffic Signs: Segmentation We chose three models to segment the traffic signs from the images for comparison:

work page
[38]

a U-Net model trained on the A2D2 dataset,

work page
[39]

a SegFormer model fine-tuned on the A2D2 dataset,

work page
[40]

We annotated 100 images of our Cyclomedia dataset and calculated the Intersection over Union (IoU) between the annotated and the predicted masks for each image

a Mask2Former model trained on the Mapillary Vistas dataset. We annotated 100 images of our Cyclomedia dataset and calculated the Intersection over Union (IoU) between the annotated and the predicted masks for each image. The av- erage over all annotated images is shown in Table 5. The models trained on the A2D2 dataset perform significantly better. We as...

work page
[41]

Traffic Signs: Deviation Box Plots 5-10 10-15 15-20 20-25 25-30 Distance to camera (m) 0 2 4 6 8 10 Error in predicted position (m) (a) DepthAnything-B 5-10 10-15 15-20 20-25 25-30 Distance to camera (m) 0 2 4 6 8 10 Error in predicted position (m) (b) DepthPro (with cam) 5-10 10-15 15-20 20-25 25-30 Distance to camera (m) 0 2 4 6 8 10 Error in predicted ...

work page
[42]

The box plots show the relationship between the error of the position (y-axis) and the true distance between road damage and camera (x-axis)

Road Damages: Deviation Box Plots 2-4 4-6 6-8 8-10 Distance to camera (m) 0 2 4 6 8 10 Error in predicted position (m) (a) DepthAnything-B 2-4 4-6 6-8 8-10 Distance to camera (m) 0 2 4 6 8 10 Error in predicted position (m) (b) DepthPro (with cam) 2-4 4-6 6-8 8-10 Distance to camera (m) 0 2 4 6 8 10 Error in predicted position (m) (c) Metric3D-ViT 2-4 4-6...

work page

[1] [1]

Zoedepth: Zero-shot transfer by com- bining relative and metric depth.arXiv preprint, 2023

Shariq Farooq Bhat, Reiner Birkl, Diana Wofk, Peter Wonka, and Matthias M ¨uller. Zoedepth: Zero-shot transfer by com- bining relative and metric depth.arXiv preprint, 2023. 3

work page 2023

[2] [2]

Depth pro: Sharp monocular metric depth in less than a second.arXiv preprint, 2024

Aleksei Bochkovskii, Ama ¨el Delaunoy, Hugo Germain, Marcel Santos, Yichao Zhou, Stephan R Richter, and Vladlen Koltun. Depth pro: Sharp monocular metric depth in less than a second.arXiv preprint, 2024. 1, 3, 4

work page 2024

[3] [3]

Wegner, and Jo˜ao P

Dominik Boller, Matthew Moy De Vitry, Jan D. Wegner, and Jo˜ao P. Leit ˜ao. Automated localization of urban drainage infrastructure from public-access street-level images.Urban Water Journal, 16, 2019. 1, 2

work page 2019

[4] [4]

From Google Maps to a fine-grained catalog of street trees.ISPRS Journal of Photogrammetry and Remote Sensing, 135, 2018

Steve Branson, Jan Dirk Wegner, David Hall, Nico Lang, Konrad Schindler, and Pietro Perona. From Google Maps to a fine-grained catalog of street trees.ISPRS Journal of Photogrammetry and Remote Sensing, 135, 2018. 1, 2

work page 2018

[5] [5]

Humenberger

Yohann Cabon, Naila Murray, and M. Humenberger. Virtual kitti 2.arXiv preprint, 2020. 4

work page 2020

[6] [6]

De- tecting and mapping traffic signs from google street view im- ages using deep learning and gis.Computers, Environment and Urban Systems, 77, 2019

Andrew Campbell, Alan Both, and Qian (Chayn) Sun. De- tecting and mapping traffic signs from google street view im- ages using deep learning and gis.Computers, Environment and Urban Systems, 77, 2019. 1, 2

work page 2019

[7] [7]

Crowd-sourced pic- tures geo-localization method based on street view images and 3D reconstruction.ISPRS Journal of Photogrammetry and Remote Sensing, 141, 2018

Liang Cheng, Yi Yuan, Nan Xia, Song Chen, Yanming Chen, Kang Yang, Lei Ma, and Manchun Li. Crowd-sourced pic- tures geo-localization method based on street view images and 3D reconstruction.ISPRS Journal of Photogrammetry and Remote Sensing, 141, 2018. 1, 2

work page 2018

[8] [8]

The cityscapes dataset

Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Scharw¨achter, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes dataset. InCVPR Workshop on the Future of Datasets in Vision, volume 2, 2015. 3

work page 2015

[9] [9]

Depth map prediction from a single image using a multi-scale deep net- work

David Eigen, Christian Puhrsch, and Rob Fergus. Depth map prediction from a single image using a multi-scale deep net- work. InNeural Information Processing Systems, 2014. 3

work page 2014

[10] [10]

Ravi Garg, B. V . Kumar, G. Carneiro, and Ian D. Reid. Un- supervised cnn for single view depth estimation: Geometry to the rescue. InECCV, 2016. 3

work page 2016

[11] [11]

A2d2: Audi autonomous driving dataset

Jakob Geyer, Yohannes Kassahun, Mentar Mahmudi, Xavier Ricou, Rupesh Durgesh, Andrew S Chung, Lorenz Hauswald, Viet Hoang Pham, Maximilian M ¨uhlegg, Sebas- tian Dorn, et al. A2d2: Audi autonomous driving dataset. arXiv preprint, 2020. 3

work page 2020

[12] [12]

Bros- tow

Cl ´ement Godard, Oisin Mac Aodha, and Gabriel J. Bros- tow. Unsupervised monocular depth estimation with left- right consistency.CVPR, 2016. 3

work page 2016

[13] [13]

Telecom inventory man- agement via object recognition and localisation on google street view images

Ramya Hebbalaguppe, Gaurav Garg, Ehtesham Hassan, Hi- ranmay Ghosh, and Ankit Verma. Telecom inventory man- agement via object recognition and localisation on google street view images. InWACV, 2017. 1, 2

work page 2017

[14] [14]

Xiaoyan Zhang, Zhipeng Cai, Xi- aoxiao Long, Hao Chen, Kaixuan Wang, Gang Yu, Chunhua Shen, and Shaojie Shen

Mu Hu, Wei Yin, China. Xiaoyan Zhang, Zhipeng Cai, Xi- aoxiao Long, Hao Chen, Kaixuan Wang, Gang Yu, Chunhua Shen, and Shaojie Shen. Metric3d v2: A versatile monocular geometric foundation model for zero-shot metric depth and surface normal estimation.TPAMI, 46, 2024. 1, 3

work page 2024

[15] [15]

Krylov and Rozenn Dahyot

Vladimir A. Krylov and Rozenn Dahyot. Object Ge- olocation from Crowdsourced Street Level Imagery. In Carlos Alzate, Anna Monreale, Haytham Assem, Albert Bifet, Teodora Sandra Buda, Bora Caglayan, Brett Drury, Eva Garc´ıa-Mart´ın, Ricard Gavald`a, Irena Koprinska, Ste- fan Kramer, Niklas Lavesson, Michael Madden, Ian Mol- 7 loy, Maria-Irina Nicolae, and M...

work page 2019

[16] [16]

Krylov, Eamonn Kenny, and Rozenn Dahyot

Vladimir A. Krylov, Eamonn Kenny, and Rozenn Dahyot. Automatic discovery and geotagging of objects from street view imagery.Remote Sensing, 10, 2018. 1, 2

work page 2018

[17] [17]

Bevdepth: Acquisition of reliable depth for multi-view 3d object detec- tion

Yinhao Li, Zheng Ge, Guanyi Yu, Jinrong Yang, Zengran Wang, Yukang Shi, Jianjian Sun, and Zeming Li. Bevdepth: Acquisition of reliable depth for multi-view 3d object detec- tion. InAAAI, 2023. 2

work page 2023

[18] [18]

Keypoint3d: Keypoint-based and anchor-free 3d object detection for autonomous driving with monocular vision.Remote Sensing, 2023

Zhen Li, Yuliang Gao, Qingqing Hong, Yuren Du, Seiichi Serikawa, and Lifeng Zhang. Keypoint3d: Keypoint-based and anchor-free 3d object detection for autonomous driving with monocular vision.Remote Sensing, 2023. 2

work page 2023

[19] [19]

Rcbevdet: Radar-camera fusion in bird’s eye view for 3d object detection

Zhiwei Lin, Zhe Liu, Zhongyu Xia, Xinhao Wang, Yongtao Wang, Shengxiang Qi, Yang Dong, Nan Dong, Le Zhang, and Ce Zhu. Rcbevdet: Radar-camera fusion in bird’s eye view for 3d object detection. InCVPR, 2024. 2

work page 2024

[20] [20]

Rashwan, Juli ´an Cristiano, M

Armin Masoumian, Hatem A. Rashwan, Juli ´an Cristiano, M. Salman Asif, and Domenec Puig. Monocular depth es- timation using deep learning: A review.Sensors, 22, 2022. 3

work page 2022

[21] [21]

The mapillary vistas dataset for semantic understanding of street scenes

Gerhard Neuhold, Tobias Ollmann, Samuel Rota Bulo, and Peter Kontschieder. The mapillary vistas dataset for semantic understanding of street scenes. InICCV, 2017. 3

work page 2017

[22] [22]

Towards detecting building facades with graffiti artwork based on street view images.ISPRS Inter- national Journal of Geo-Information, 9, 2020

Tessio Novack, Leonard V orbeck, Heinrich Lorei, and Alexander Zipf. Towards detecting building facades with graffiti artwork based on street view images.ISPRS Inter- national Journal of Geo-Information, 9, 2020. 1, 2

work page 2020

[23] [23]

Unidepth: Universal monocular metric depth estimation.CVPR, 2024

Luigi Piccinelli, Yung-Hsu Yang, Christos Sakaridis, Mattia Segu, Siyuan Li, Luc van Gool, and Fisher Yu. Unidepth: Universal monocular metric depth estimation.CVPR, 2024. 1, 3, 4

work page 2024

[24] [24]

Qi, Wei Liu, Chenxia Wu, Hao Su, and Leonidas J

Charles R. Qi, Wei Liu, Chenxia Wu, Hao Su, and Leonidas J. Guibas. Frustum pointnets for 3d object detec- tion from rgb-d data.CVPR, 2018. 2

work page 2018

[25] [25]

Monogrnet: A geo- metric reasoning network for monocular 3d object localiza- tion

Zengyi Qin, Jinglu Wang, and Yan Lu. Monogrnet: A geo- metric reasoning network for monocular 3d object localiza- tion. InAAAI, 2018. 2

work page 2018

[26] [26]

Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer.IEEE Transactions on Pattern Analysis and Ma- chine Intelligence, 44, 2020

Ren ´e Ranftl, Katrin Lasinger, David Hafner, Konrad Schindler, and Vladlen Koltun. Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer.IEEE Transactions on Pattern Analysis and Ma- chine Intelligence, 44, 2020. 3

work page 2020

[27] [27]

Manhole detection using image processing on google street view imagery

Vinay Vishnani, Anikait Adhya, Chinmay Bajpai, Priya Chimurkar, and Kumar Khandagle. Manhole detection using image processing on google street view imagery. InThird International Conference on Smart Systems and Inventive Technology (ICSSIT), 2020. 2

work page 2020

[28] [28]

Wegner, Steve Branson, David Hall, Konrad Schindler, and Pietro Perona

Jan D. Wegner, Steve Branson, David Hall, Konrad Schindler, and Pietro Perona. Cataloging public objects us- ing aerial and street-level images — urban trees. InCVPR, Las Vegas, NV , USA, 2016. 1, 2

work page 2016

[29] [29]

Depth anything: Unleashing the power of large-scale unlabeled data

Lihe Yang, Bingyi Kang, Zilong Huang, Xiaogang Xu, Jiashi Feng, and Hengshuang Zhao. Depth anything: Unleashing the power of large-scale unlabeled data. InCVPR, 2024. 1, 3, 4

work page 2024

[30] [30]

Metric3d: Towards zero-shot metric 3d prediction from a single image

Wei Yin, Chi Zhang, Hao Chen, Zhipeng Cai, Gang Yu, Kaixuan Wang, Xiaozhi Chen, and Chunhua Shen. Metric3d: Towards zero-shot metric 3d prediction from a single image. ICCV, 2023. 1, 3, 4

work page 2023

[31] [31]

Safdnet: A simple and effective network for fully sparse 3d object detection

Gang Zhang, Junnan Chen, Guohuan Gao, Jianmin Li, Si Liu, and Xiaolin Hu. Safdnet: A simple and effective network for fully sparse 3d object detection. InCVPR, 2024. 2

work page 2024

[32] [32]

Scaledepth: Decomposing metric depth estimation into scale prediction and relative depth estimation.arXiv preprint, 2024

Ruijie Zhu, Chuxin Wang, Ziyang Song, Li Liu, Tianzhu Zhang, and Yongdong Zhang. Scaledepth: Decomposing metric depth estimation into scale prediction and relative depth estimation.arXiv preprint, 2024. 3 8 Supplementary Material

work page 2024

[33] [33]

Semantic Groups for twelve of the sample images

Semantic Groups for Test Images Figure 5. Semantic Groups for twelve of the sample images. Original images from Cyclomedia. 9

work page

[34] [34]

Absolute Relative Error (ARE) for different depth estimation models on the same example image

All ARE plots for one Example Image 0.0 0.2 0.4 0.6 0.8 1.0 Error Magnitude (a) DepthAnything-S 0.0 0.2 0.4 0.6 0.8 1.0 Error Magnitude (b) DepthAnything-B 0.0 0.2 0.4 0.6 0.8 1.0 Error Magnitude (c) DepthAnything-L 0.0 0.2 0.4 0.6 0.8 1.0 Error Magnitude (d) UniDepth-S 0.0 0.2 0.4 0.6 0.8 1.0 Error Magnitude (e) UniDepth-B 0.0 0.2 0.4 0.6 0.8 1.0 Error M...

work page

[35] [35]

Mean Absolute Error (MAE) for different depth estimation models on the same example image

All MAE plots for one Example Image 0 5 10 15 20 25 30 Error Magnitude (a) DepthAnything-S 0 5 10 15 20 25 30 Error Magnitude (b) DepthAnything-B 0 5 10 15 20 25 30 Error Magnitude (c) DepthAnything-L 0 5 10 15 20 25 30 Error Magnitude (d) UniDepth-S 0 5 10 15 20 25 30 Error Magnitude (e) UniDepth-B 0 5 10 15 20 25 30 Error Magnitude (f) UniDepth-L 0 5 10...

work page

[36] [36]

Traffic Signs: Image Coverage in Annotated Area (a) Road Network Image Coverage Area Covered Road Segments Camera Points (b) Road covered by Cyclomedia Images Image Coverage Area Covered Road Segments Camera Points (c) Road covered by Mapillary Images Figure 8. The road network according to OpenStreetMap in the area where signs were annotated (a) and the ...

work page

[37] [37]

Traffic Signs: Segmentation We chose three models to segment the traffic signs from the images for comparison:

work page

[38] [38]

a U-Net model trained on the A2D2 dataset,

work page

[39] [39]

a SegFormer model fine-tuned on the A2D2 dataset,

work page

[40] [40]

We annotated 100 images of our Cyclomedia dataset and calculated the Intersection over Union (IoU) between the annotated and the predicted masks for each image

a Mask2Former model trained on the Mapillary Vistas dataset. We annotated 100 images of our Cyclomedia dataset and calculated the Intersection over Union (IoU) between the annotated and the predicted masks for each image. The av- erage over all annotated images is shown in Table 5. The models trained on the A2D2 dataset perform significantly better. We as...

work page

[41] [41]

Traffic Signs: Deviation Box Plots 5-10 10-15 15-20 20-25 25-30 Distance to camera (m) 0 2 4 6 8 10 Error in predicted position (m) (a) DepthAnything-B 5-10 10-15 15-20 20-25 25-30 Distance to camera (m) 0 2 4 6 8 10 Error in predicted position (m) (b) DepthPro (with cam) 5-10 10-15 15-20 20-25 25-30 Distance to camera (m) 0 2 4 6 8 10 Error in predicted ...

work page

[42] [42]

The box plots show the relationship between the error of the position (y-axis) and the true distance between road damage and camera (x-axis)

Road Damages: Deviation Box Plots 2-4 4-6 6-8 8-10 Distance to camera (m) 0 2 4 6 8 10 Error in predicted position (m) (a) DepthAnything-B 2-4 4-6 6-8 8-10 Distance to camera (m) 0 2 4 6 8 10 Error in predicted position (m) (b) DepthPro (with cam) 2-4 4-6 6-8 8-10 Distance to camera (m) 0 2 4 6 8 10 Error in predicted position (m) (c) Metric3D-ViT 2-4 4-6...

work page