pith. sign in

arxiv: 2509.14839 · v2 · pith:XGCUB25Jnew · submitted 2025-09-18 · 💻 cs.CV

MapAnything: Evaluating Monocular Metric Depth Models for 3D Urban Asset Localization

Pith reviewed 2026-05-21 21:51 UTC · model grok-4.3

classification 💻 cs.CV
keywords monocular metric depthurban asset localization3D geocoordinatesdigital twinsLiDAR comparisontraffic signscomputer visiondepth estimation
0
0 comments X

The pith

MapAnything derives accurate geocoordinates for city objects like signs and damage from one ordinary photo.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows how to turn a single camera image of an urban object into its real-world position on a map. It does this by feeding the image to a metric depth model that guesses the distance to the object, then using the camera's known angle and focal length to figure out the latitude, longitude, and height. Validation comes from running the system on city streets and checking the results against expensive laser scans. If this works, cities could update their asset databases much faster and cheaper by having inspectors or vehicles just take pictures.

Core claim

By leveraging advanced Metric Depth Estimation models, MapAnything accurately calculates object geocoordinates, converting 2D image data into valuable 3D spatial information through the integration of estimated camera-to-object distance with geometric principles and known camera specifications.

What carries the argument

Metric depth estimation that predicts distances, combined with geometric principles and camera specifications for geocoordinate calculation.

If this is right

  • Automated mapping of traffic signs and road pavement damage from monocular images.
  • Granular analysis of accuracy across distance intervals and semantic areas such as roads and vegetation.
  • Practical demonstration for integration into automated urban inventory systems.
  • Reduced manual effort in maintaining urban digital twins and spatial datasets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The framework could be deployed on smartphones for crowdsourced urban mapping.
  • Accuracy might improve further by fusing multiple images from different viewpoints.
  • This could complement or reduce the need for dedicated mapping vehicles equipped with LiDAR.

Load-bearing premise

The distances estimated by metric depth models are accurate enough in complex urban environments to allow reliable geocoordinate calculation using geometry and camera specs.

What would settle it

Finding that the geocoordinate errors compared to LiDAR ground truth are consistently larger than acceptable for inventory purposes, for example more than one meter at close range.

Figures

Figures reproduced from arXiv: 2509.14839 by Andr\'e Ludwig, Bogdan Franczyk, Eric Peukert, Erik Quinten Fastermann, Jonas Kunze, Miriam Louise Carnot.

Figure 1
Figure 1. Figure 1: Monocular Metric Depth Estimation results, scales are [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 3
Figure 3. Figure 3: Distance ranges and semantic groups from Segformer [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Relationship between true distance (camera-sign) and [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Semantic Groups for twelve of the sample images. Original images from Cyclomedia. [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Absolute Relative Error (ARE) for different depth estimation models on the same example image. [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Mean Absolute Error (MAE) for different depth estimation models on the same example image. [PITH_FULL_IMAGE:figures/full_fig_p011_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: The road network according to OpenStreetMap in the area where signs were annotated (a) and the parts of it that are covered [PITH_FULL_IMAGE:figures/full_fig_p012_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Positions and viewing directions of all recordings from both image sources (a, b) and all signs that should be visible within this [PITH_FULL_IMAGE:figures/full_fig_p013_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: The box plots show the relationship between the error of the position (y-axis) and the true distance between sign and camera [PITH_FULL_IMAGE:figures/full_fig_p015_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: The box plots show the relationship between the error of the position (y-axis) and the true distance between road damage and [PITH_FULL_IMAGE:figures/full_fig_p016_11.png] view at source ↗
read the original abstract

City administrations increasingly rely on comprehensive databases and urban digital twins of city assets, such as traffic signs and trees, as well as incidents like graffiti or road damage, to maintain an effective overview of urban conditions. Digitization has increased the demand for continuously updated spatial datasets, yet current data acquisition and maintenance processes still involve considerable manual effort, posing significant scalability challenges. This paper introduces MapAnything, a novel geo-localization framework that automates the spatial mapping of urban objects and incidents from a single monocular image. By leveraging advanced Metric Depth Estimation models, MapAnything accurately calculates object geocoordinates, converting 2D image data into valuable 3D spatial information. The methodology integrates the estimated camera-to-object distance with geometric principles and known camera specifications. We present a detailed validation of the framework, comparing its distance-estimation accuracy against high-precision LiDAR point clouds in complex urban environments. Our evaluation provides a granular analysis of spatial performance across various distance intervals and semantic areas, such as roads and vegetation. Finally, we demonstrate the framework's practical efficacy through specific use cases, including mapping traffic signs and road pavement damage, and provide recommendations for its integration into automated urban inventory systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces MapAnything, a framework for 3D urban asset localization from monocular images using metric depth estimation models. It claims to accurately calculate object geocoordinates by combining estimated depths with known camera intrinsics and geometric principles. The paper provides a validation study comparing distance estimates to LiDAR point clouds in complex urban environments, with granular analysis across distance intervals and semantic classes, and demonstrates use cases for mapping traffic signs and road damage.

Significance. If the accuracy claims hold under the reported conditions, this work could enable scalable, low-cost 3D mapping of urban assets using standard cameras, addressing scalability challenges in maintaining urban digital twins. The evaluation against external LiDAR data and focus on practical use cases strengthen its potential impact in computer vision applications for smart cities.

major comments (2)
  1. [§4.2] §4.2 (LiDAR validation): The reported distance errors are presented per bin and semantic class, but the manuscript does not quantify how these translate to geocoordinate error in meters or whether they remain below the tolerance needed for the traffic-sign and pavement-damage use cases asserted in §5.1.
  2. [§3.2] §3.2 (depth-to-geocoordinate pipeline): The fusion of monocular metric depth with ray geometry and known intrinsics/extrinsics is described at a high level; no explicit error-propagation analysis or sensitivity study is provided for urban factors (occlusion, specular surfaces) that the abstract identifies as the target domain.
minor comments (2)
  1. [Abstract] Abstract: The phrase 'advanced Metric Depth Estimation models' is used without naming the specific models or indicating whether they are used off-the-shelf or fine-tuned on urban data.
  2. [Table 2] Table 2: Column headers for semantic classes should explicitly state the number of samples per bin to allow readers to assess statistical reliability of the per-class results.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight important aspects of practical applicability and methodological rigor that we address below. We have revised the manuscript accordingly where feasible.

read point-by-point responses
  1. Referee: [§4.2] §4.2 (LiDAR validation): The reported distance errors are presented per bin and semantic class, but the manuscript does not quantify how these translate to geocoordinate error in meters or whether they remain below the tolerance needed for the traffic-sign and pavement-damage use cases asserted in §5.1.

    Authors: We agree that connecting distance errors to geocoordinate accuracy and use-case tolerances strengthens the validation. In the revised manuscript, we have added explicit propagation calculations in §4.2 that convert binned distance errors to approximate geocoordinate errors (lateral and depth components) using the camera intrinsics and typical viewing angles. For the §5.1 use cases, we now include a direct comparison: our median errors remain below 1.5 m for distances under 15 m, which aligns with common tolerances for traffic-sign inventory (0.5–2 m) and pavement-damage mapping; we note larger errors beyond 25 m and suggest multi-frame fusion as mitigation. revision: yes

  2. Referee: [§3.2] §3.2 (depth-to-geocoordinate pipeline): The fusion of monocular metric depth with ray geometry and known intrinsics/extrinsics is described at a high level; no explicit error-propagation analysis or sensitivity study is provided for urban factors (occlusion, specular surfaces) that the abstract identifies as the target domain.

    Authors: The pipeline is formalized via the ray-casting equations in §3.2 that combine metric depth with known intrinsics and extrinsics. We acknowledge the absence of a dedicated error-propagation or sensitivity analysis. The revised version adds a dedicated paragraph in §3.2 that qualitatively discusses error contributions from occlusion and specular surfaces, drawing on failure cases observed in our urban LiDAR validation set. A full quantitative sensitivity study would require new controlled experiments beyond the current scope; we have therefore marked this as future work while providing the initial analysis requested. revision: partial

Circularity Check

0 steps flagged

No circularity: evaluation rests on external LiDAR ground truth

full rationale

The paper introduces MapAnything as a framework that applies off-the-shelf metric depth estimation models, combines their camera-to-object distance outputs with known camera intrinsics/extrinsics and ray geometry, and validates the resulting geocoordinates against independent high-precision LiDAR point clouds. The abstract and methodology description contain no fitted parameters that are later renamed as predictions, no self-definitional equations, and no load-bearing uniqueness theorems imported from prior self-citations. Granular performance analysis across distance bins and semantic classes is presented as direct empirical comparison rather than any reduction to the input assumptions by construction. The derivation chain therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the reliability of pre-trained metric depth models in urban scenes and the accuracy of camera intrinsic parameters; no new free parameters or invented entities are introduced in the abstract.

axioms (1)
  • domain assumption Metric depth estimation models yield sufficiently accurate camera-to-object distances for geocoordinate computation in complex urban environments
    Invoked when the framework integrates estimated distances with geometric principles and camera specifications.

pith-pipeline@v0.9.0 · 5758 in / 1182 out tokens · 36277 ms · 2026-05-21T21:51:58.798404+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

42 extracted references · 42 canonical work pages

  1. [1]

    Zoedepth: Zero-shot transfer by com- bining relative and metric depth.arXiv preprint, 2023

    Shariq Farooq Bhat, Reiner Birkl, Diana Wofk, Peter Wonka, and Matthias M ¨uller. Zoedepth: Zero-shot transfer by com- bining relative and metric depth.arXiv preprint, 2023. 3

  2. [2]

    Depth pro: Sharp monocular metric depth in less than a second.arXiv preprint, 2024

    Aleksei Bochkovskii, Ama ¨el Delaunoy, Hugo Germain, Marcel Santos, Yichao Zhou, Stephan R Richter, and Vladlen Koltun. Depth pro: Sharp monocular metric depth in less than a second.arXiv preprint, 2024. 1, 3, 4

  3. [3]

    Wegner, and Jo˜ao P

    Dominik Boller, Matthew Moy De Vitry, Jan D. Wegner, and Jo˜ao P. Leit ˜ao. Automated localization of urban drainage infrastructure from public-access street-level images.Urban Water Journal, 16, 2019. 1, 2

  4. [4]

    From Google Maps to a fine-grained catalog of street trees.ISPRS Journal of Photogrammetry and Remote Sensing, 135, 2018

    Steve Branson, Jan Dirk Wegner, David Hall, Nico Lang, Konrad Schindler, and Pietro Perona. From Google Maps to a fine-grained catalog of street trees.ISPRS Journal of Photogrammetry and Remote Sensing, 135, 2018. 1, 2

  5. [5]

    Humenberger

    Yohann Cabon, Naila Murray, and M. Humenberger. Virtual kitti 2.arXiv preprint, 2020. 4

  6. [6]

    De- tecting and mapping traffic signs from google street view im- ages using deep learning and gis.Computers, Environment and Urban Systems, 77, 2019

    Andrew Campbell, Alan Both, and Qian (Chayn) Sun. De- tecting and mapping traffic signs from google street view im- ages using deep learning and gis.Computers, Environment and Urban Systems, 77, 2019. 1, 2

  7. [7]

    Crowd-sourced pic- tures geo-localization method based on street view images and 3D reconstruction.ISPRS Journal of Photogrammetry and Remote Sensing, 141, 2018

    Liang Cheng, Yi Yuan, Nan Xia, Song Chen, Yanming Chen, Kang Yang, Lei Ma, and Manchun Li. Crowd-sourced pic- tures geo-localization method based on street view images and 3D reconstruction.ISPRS Journal of Photogrammetry and Remote Sensing, 141, 2018. 1, 2

  8. [8]

    The cityscapes dataset

    Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Scharw¨achter, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes dataset. InCVPR Workshop on the Future of Datasets in Vision, volume 2, 2015. 3

  9. [9]

    Depth map prediction from a single image using a multi-scale deep net- work

    David Eigen, Christian Puhrsch, and Rob Fergus. Depth map prediction from a single image using a multi-scale deep net- work. InNeural Information Processing Systems, 2014. 3

  10. [10]

    Ravi Garg, B. V . Kumar, G. Carneiro, and Ian D. Reid. Un- supervised cnn for single view depth estimation: Geometry to the rescue. InECCV, 2016. 3

  11. [11]

    A2d2: Audi autonomous driving dataset

    Jakob Geyer, Yohannes Kassahun, Mentar Mahmudi, Xavier Ricou, Rupesh Durgesh, Andrew S Chung, Lorenz Hauswald, Viet Hoang Pham, Maximilian M ¨uhlegg, Sebas- tian Dorn, et al. A2d2: Audi autonomous driving dataset. arXiv preprint, 2020. 3

  12. [12]

    Bros- tow

    Cl ´ement Godard, Oisin Mac Aodha, and Gabriel J. Bros- tow. Unsupervised monocular depth estimation with left- right consistency.CVPR, 2016. 3

  13. [13]

    Telecom inventory man- agement via object recognition and localisation on google street view images

    Ramya Hebbalaguppe, Gaurav Garg, Ehtesham Hassan, Hi- ranmay Ghosh, and Ankit Verma. Telecom inventory man- agement via object recognition and localisation on google street view images. InWACV, 2017. 1, 2

  14. [14]

    Xiaoyan Zhang, Zhipeng Cai, Xi- aoxiao Long, Hao Chen, Kaixuan Wang, Gang Yu, Chunhua Shen, and Shaojie Shen

    Mu Hu, Wei Yin, China. Xiaoyan Zhang, Zhipeng Cai, Xi- aoxiao Long, Hao Chen, Kaixuan Wang, Gang Yu, Chunhua Shen, and Shaojie Shen. Metric3d v2: A versatile monocular geometric foundation model for zero-shot metric depth and surface normal estimation.TPAMI, 46, 2024. 1, 3

  15. [15]

    Krylov and Rozenn Dahyot

    Vladimir A. Krylov and Rozenn Dahyot. Object Ge- olocation from Crowdsourced Street Level Imagery. In Carlos Alzate, Anna Monreale, Haytham Assem, Albert Bifet, Teodora Sandra Buda, Bora Caglayan, Brett Drury, Eva Garc´ıa-Mart´ın, Ricard Gavald`a, Irena Koprinska, Ste- fan Kramer, Niklas Lavesson, Michael Madden, Ian Mol- 7 loy, Maria-Irina Nicolae, and M...

  16. [16]

    Krylov, Eamonn Kenny, and Rozenn Dahyot

    Vladimir A. Krylov, Eamonn Kenny, and Rozenn Dahyot. Automatic discovery and geotagging of objects from street view imagery.Remote Sensing, 10, 2018. 1, 2

  17. [17]

    Bevdepth: Acquisition of reliable depth for multi-view 3d object detec- tion

    Yinhao Li, Zheng Ge, Guanyi Yu, Jinrong Yang, Zengran Wang, Yukang Shi, Jianjian Sun, and Zeming Li. Bevdepth: Acquisition of reliable depth for multi-view 3d object detec- tion. InAAAI, 2023. 2

  18. [18]

    Keypoint3d: Keypoint-based and anchor-free 3d object detection for autonomous driving with monocular vision.Remote Sensing, 2023

    Zhen Li, Yuliang Gao, Qingqing Hong, Yuren Du, Seiichi Serikawa, and Lifeng Zhang. Keypoint3d: Keypoint-based and anchor-free 3d object detection for autonomous driving with monocular vision.Remote Sensing, 2023. 2

  19. [19]

    Rcbevdet: Radar-camera fusion in bird’s eye view for 3d object detection

    Zhiwei Lin, Zhe Liu, Zhongyu Xia, Xinhao Wang, Yongtao Wang, Shengxiang Qi, Yang Dong, Nan Dong, Le Zhang, and Ce Zhu. Rcbevdet: Radar-camera fusion in bird’s eye view for 3d object detection. InCVPR, 2024. 2

  20. [20]

    Rashwan, Juli ´an Cristiano, M

    Armin Masoumian, Hatem A. Rashwan, Juli ´an Cristiano, M. Salman Asif, and Domenec Puig. Monocular depth es- timation using deep learning: A review.Sensors, 22, 2022. 3

  21. [21]

    The mapillary vistas dataset for semantic understanding of street scenes

    Gerhard Neuhold, Tobias Ollmann, Samuel Rota Bulo, and Peter Kontschieder. The mapillary vistas dataset for semantic understanding of street scenes. InICCV, 2017. 3

  22. [22]

    Towards detecting building facades with graffiti artwork based on street view images.ISPRS Inter- national Journal of Geo-Information, 9, 2020

    Tessio Novack, Leonard V orbeck, Heinrich Lorei, and Alexander Zipf. Towards detecting building facades with graffiti artwork based on street view images.ISPRS Inter- national Journal of Geo-Information, 9, 2020. 1, 2

  23. [23]

    Unidepth: Universal monocular metric depth estimation.CVPR, 2024

    Luigi Piccinelli, Yung-Hsu Yang, Christos Sakaridis, Mattia Segu, Siyuan Li, Luc van Gool, and Fisher Yu. Unidepth: Universal monocular metric depth estimation.CVPR, 2024. 1, 3, 4

  24. [24]

    Qi, Wei Liu, Chenxia Wu, Hao Su, and Leonidas J

    Charles R. Qi, Wei Liu, Chenxia Wu, Hao Su, and Leonidas J. Guibas. Frustum pointnets for 3d object detec- tion from rgb-d data.CVPR, 2018. 2

  25. [25]

    Monogrnet: A geo- metric reasoning network for monocular 3d object localiza- tion

    Zengyi Qin, Jinglu Wang, and Yan Lu. Monogrnet: A geo- metric reasoning network for monocular 3d object localiza- tion. InAAAI, 2018. 2

  26. [26]

    Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer.IEEE Transactions on Pattern Analysis and Ma- chine Intelligence, 44, 2020

    Ren ´e Ranftl, Katrin Lasinger, David Hafner, Konrad Schindler, and Vladlen Koltun. Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer.IEEE Transactions on Pattern Analysis and Ma- chine Intelligence, 44, 2020. 3

  27. [27]

    Manhole detection using image processing on google street view imagery

    Vinay Vishnani, Anikait Adhya, Chinmay Bajpai, Priya Chimurkar, and Kumar Khandagle. Manhole detection using image processing on google street view imagery. InThird International Conference on Smart Systems and Inventive Technology (ICSSIT), 2020. 2

  28. [28]

    Wegner, Steve Branson, David Hall, Konrad Schindler, and Pietro Perona

    Jan D. Wegner, Steve Branson, David Hall, Konrad Schindler, and Pietro Perona. Cataloging public objects us- ing aerial and street-level images — urban trees. InCVPR, Las Vegas, NV , USA, 2016. 1, 2

  29. [29]

    Depth anything: Unleashing the power of large-scale unlabeled data

    Lihe Yang, Bingyi Kang, Zilong Huang, Xiaogang Xu, Jiashi Feng, and Hengshuang Zhao. Depth anything: Unleashing the power of large-scale unlabeled data. InCVPR, 2024. 1, 3, 4

  30. [30]

    Metric3d: Towards zero-shot metric 3d prediction from a single image

    Wei Yin, Chi Zhang, Hao Chen, Zhipeng Cai, Gang Yu, Kaixuan Wang, Xiaozhi Chen, and Chunhua Shen. Metric3d: Towards zero-shot metric 3d prediction from a single image. ICCV, 2023. 1, 3, 4

  31. [31]

    Safdnet: A simple and effective network for fully sparse 3d object detection

    Gang Zhang, Junnan Chen, Guohuan Gao, Jianmin Li, Si Liu, and Xiaolin Hu. Safdnet: A simple and effective network for fully sparse 3d object detection. InCVPR, 2024. 2

  32. [32]

    Scaledepth: Decomposing metric depth estimation into scale prediction and relative depth estimation.arXiv preprint, 2024

    Ruijie Zhu, Chuxin Wang, Ziyang Song, Li Liu, Tianzhu Zhang, and Yongdong Zhang. Scaledepth: Decomposing metric depth estimation into scale prediction and relative depth estimation.arXiv preprint, 2024. 3 8 Supplementary Material

  33. [33]

    Semantic Groups for twelve of the sample images

    Semantic Groups for Test Images Figure 5. Semantic Groups for twelve of the sample images. Original images from Cyclomedia. 9

  34. [34]

    Absolute Relative Error (ARE) for different depth estimation models on the same example image

    All ARE plots for one Example Image 0.0 0.2 0.4 0.6 0.8 1.0 Error Magnitude (a) DepthAnything-S 0.0 0.2 0.4 0.6 0.8 1.0 Error Magnitude (b) DepthAnything-B 0.0 0.2 0.4 0.6 0.8 1.0 Error Magnitude (c) DepthAnything-L 0.0 0.2 0.4 0.6 0.8 1.0 Error Magnitude (d) UniDepth-S 0.0 0.2 0.4 0.6 0.8 1.0 Error Magnitude (e) UniDepth-B 0.0 0.2 0.4 0.6 0.8 1.0 Error M...

  35. [35]

    Mean Absolute Error (MAE) for different depth estimation models on the same example image

    All MAE plots for one Example Image 0 5 10 15 20 25 30 Error Magnitude (a) DepthAnything-S 0 5 10 15 20 25 30 Error Magnitude (b) DepthAnything-B 0 5 10 15 20 25 30 Error Magnitude (c) DepthAnything-L 0 5 10 15 20 25 30 Error Magnitude (d) UniDepth-S 0 5 10 15 20 25 30 Error Magnitude (e) UniDepth-B 0 5 10 15 20 25 30 Error Magnitude (f) UniDepth-L 0 5 10...

  36. [36]

    Traffic Signs: Image Coverage in Annotated Area (a) Road Network Image Coverage Area Covered Road Segments Camera Points (b) Road covered by Cyclomedia Images Image Coverage Area Covered Road Segments Camera Points (c) Road covered by Mapillary Images Figure 8. The road network according to OpenStreetMap in the area where signs were annotated (a) and the ...

  37. [37]

    Traffic Signs: Segmentation We chose three models to segment the traffic signs from the images for comparison:

  38. [38]

    a U-Net model trained on the A2D2 dataset,

  39. [39]

    a SegFormer model fine-tuned on the A2D2 dataset,

  40. [40]

    We annotated 100 images of our Cyclomedia dataset and calculated the Intersection over Union (IoU) between the annotated and the predicted masks for each image

    a Mask2Former model trained on the Mapillary Vistas dataset. We annotated 100 images of our Cyclomedia dataset and calculated the Intersection over Union (IoU) between the annotated and the predicted masks for each image. The av- erage over all annotated images is shown in Table 5. The models trained on the A2D2 dataset perform significantly better. We as...

  41. [41]

    Traffic Signs: Deviation Box Plots 5-10 10-15 15-20 20-25 25-30 Distance to camera (m) 0 2 4 6 8 10 Error in predicted position (m) (a) DepthAnything-B 5-10 10-15 15-20 20-25 25-30 Distance to camera (m) 0 2 4 6 8 10 Error in predicted position (m) (b) DepthPro (with cam) 5-10 10-15 15-20 20-25 25-30 Distance to camera (m) 0 2 4 6 8 10 Error in predicted ...

  42. [42]

    The box plots show the relationship between the error of the position (y-axis) and the true distance between road damage and camera (x-axis)

    Road Damages: Deviation Box Plots 2-4 4-6 6-8 8-10 Distance to camera (m) 0 2 4 6 8 10 Error in predicted position (m) (a) DepthAnything-B 2-4 4-6 6-8 8-10 Distance to camera (m) 0 2 4 6 8 10 Error in predicted position (m) (b) DepthPro (with cam) 2-4 4-6 6-8 8-10 Distance to camera (m) 0 2 4 6 8 10 Error in predicted position (m) (c) Metric3D-ViT 2-4 4-6...