pith. sign in

arxiv: 2606.29716 · v1 · pith:ZRI3VHGOnew · submitted 2026-06-29 · 💻 cs.CV

AerialMetric: Benchmarking and Adapting UAV Monocular Metric Depth Estimation in the Real World

Pith reviewed 2026-06-30 06:43 UTC · model grok-4.3

classification 💻 cs.CV
keywords monocular metric depth estimationUAV aerial imagerybenchmark datasetdomain gapfine-tuning adaptationimage-depth pairsviewpoint evaluationcomputer vision
0
0 comments X

The pith

AerialMetric supplies 68K image-depth pairs that let fine-tuned models close the domain gap and reach state-of-the-art metric depth accuracy from UAV viewpoints.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Models trained on ground-level scenes produce large errors on aerial UAV images because of shifts in viewpoint, altitude, and scale. The paper supplies AerialMetric, a dataset of 52K real-world and 16K synthetic image-depth pairs drawn from photogrammetry, controlled flights, synthetic scenes, and in-the-wild sources, each carrying reliable metric ground truth. Systematic tests quantify how viewpoint, altitude, and camera parameters degrade existing models. Fine-tuning representative metric depth networks on the new data produces leading performance across the four aerial subsets.

Core claim

AerialMetric provides 52K real and 16K synthetic image-depth pairs with metric ground truth across four complementary subsets that jointly cover photogrammetry data, controlled aerial acquisition, photorealistic synthetic scenes, and in-the-wild imagery. Evaluation of existing models under aerial conditions reveals the size of the domain gap and the separate effects of viewpoint, altitude, and camera parameters. Fine-tuning representative models on the dataset establishes a comprehensive aerial benchmark and delivers state-of-the-art metric depth performance on diverse UAV imagery.

What carries the argument

The AerialMetric dataset of 68K image-depth pairs with reliable metric ground truth collected under UAV viewpoints.

If this is right

  • Existing models exhibit clear performance drops when applied to aerial viewpoints.
  • Viewpoint, altitude, and camera parameters each measurably affect metric depth accuracy.
  • Fine-tuning on AerialMetric creates a new public benchmark for aerial monocular metric depth.
  • Adapted models reach state-of-the-art results across the four aerial subsets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • UAV navigation and mapping systems could adopt the fine-tuned models for more reliable obstacle avoidance and terrain reconstruction.
  • The same real-plus-synthetic collection strategy may transfer to other robotics settings that face large viewpoint shifts.
  • Public release of the pairs, code, and weights could accelerate metric depth work for satellite or underwater imagery.
  • Testing the adapted models on live UAV video with changing lighting or motion would reveal whether the benchmark gains survive dynamic flight.

Load-bearing premise

The image-depth pairs supply accurate and representative metric ground truth for real UAV operating conditions.

What would settle it

Independent real UAV flights with LiDAR-verified depths showing that fine-tuned models produce no accuracy gain over untuned baselines would falsify the adaptation claim.

Figures

Figures reproduced from arXiv: 2606.29716 by Chuan Huang, Chuanyu Fu, Guanying Chen, Shuguang Cui, Xiaochun Cao, Yin Zou, Yuqi Zhang, Zhiyuan Yuan, Zhongqiang Song.

Figure 1
Figure 1. Figure 1: By benchmarking existing methods on the AerialMetric dataset, we identify a significant domain gap in aerial depth estimation (the value reported in the left figure is δ 1, the higher the better). Fine-tuning the model with our dataset resolves this issue, delivering robust metric depth estimation on both aerial and ground-level scenes. Abstract. This paper addresses the problem of monocular metric depth e… view at source ↗
Figure 2
Figure 2. Figure 2: Data collection pipeline of the proposed AerialMetric dataset. capture in real-world scenes. AerialMetric-Wild consists of diverse in-the-wild aerial images collected from the Internet, serving as a challenging evaluation set for cross-domain generalization. Based on the proposed AerialMetric dataset, we conduct a comprehensive analysis of existing metric depth estimation methods under aerial settings. Our… view at source ↗
Figure 3
Figure 3. Figure 3: Visualization of the introduced AerialMetric dataset. shift, we introduce AerialMetric, a large-scale, high-quality aerial depth dataset built through a hybrid construction pipeline that achieves geometric accuracy, parameter decoupling, and scene generalization. The dataset comprises four complementary components (see [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative comparison with representative methods on aerial scenes. The qualitative results in [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Radar summary of parameter robustness on AerialMetric-Decoupled. Larger radius indicates better performance under each decoupled setting [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗
Figure 7
Figure 7. Figure 7: Heatmap visualization of metric accuracy across altitudes and pitch angles. aerial-specific camera factors, whereas prior foundation models yield irregular and contracted footprints, indicating severe sensitivity to these parameters [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Qualitative comparison of fine-tuning strategies [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗
Figure 1
Figure 1. Figure 1: Depth statistics of the proposed datasets. reconstructed from a dense collection of 2,368 high-resolution frames (6000 ×4000 pixels). The reconstruction process involved an extensive 640-minute optimization via the DJI Terra [12] 3D engine to ensure a high-fidelity, geometrically consistent mesh. In terms of spatial resolution, the pipeline achieved an average Ground Sam￾pling Distance (GSD) of 1.63 cm, ca… view at source ↗
Figure 2
Figure 2. Figure 2: Statistics of the AerialMetric-Oblique dataset. The left panel shows the image count across various discrete Field of View (FOV) angles. The right panel illustrates the distribution of camera pitch angles [PITH_FULL_IMAGE:figures/full_fig_p024_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Visual overview of the AerialMetric-Oblique dataset [PITH_FULL_IMAGE:figures/full_fig_p024_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Automated flight trajectories for the AerialMetric-Decoupled dataset. The overlapping central segments represent the primary decoupled flight paths, while the peripheral trajectories correspond to auxiliary oblique photography missions designed to capture multi-angle perspectives for high-fidelity mesh reconstruction. A.3 AerialMetric-Wild: Reliable Pseudo-Metric Labels for Open-World Evaluation The Aerial… view at source ↗
Figure 5
Figure 5. Figure 5: Representative scenes from the AerialMetric-Wild dataset [PITH_FULL_IMAGE:figures/full_fig_p026_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Photorealistic simulation environment in UE4. By streaming massive 3D tiles via Cesium and utilizing AirSim for UAV control, our pipeline facilitates the scalable generation of synthetic data with continuous, parameterized camera poses and accurate Z-buffer depth [PITH_FULL_IMAGE:figures/full_fig_p029_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Google Earth Studio rendering pipeline for synthetic data generation [PITH_FULL_IMAGE:figures/full_fig_p029_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Results on AerialMetric-Decoupled across FOV, pitch, and altitude [PITH_FULL_IMAGE:figures/full_fig_p032_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Comprehensive qualitative comparison of 2D metric depth maps [PITH_FULL_IMAGE:figures/full_fig_p035_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: 3D point cloud comparison under UAV perspective [PITH_FULL_IMAGE:figures/full_fig_p036_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Zero-shot generalization on unseen AerialMetric-Wild videos [PITH_FULL_IMAGE:figures/full_fig_p037_11.png] view at source ↗
read the original abstract

This paper addresses the problem of monocular metric depth estimation in aerial UAV imagery. Although recent data-driven methods have achieved remarkable progress in ground-level scenarios, models trained primarily on street-view and indoor datasets exhibit significant domain gaps when applied to aerial viewpoints. To tackle these challenges, we introduce AerialMetric, a benchmark dataset designed to evaluate and facilitate the adaptation of monocular metric depth estimation under UAV aerial viewpoints. The dataset consists of four complementary subsets collected from different sources, jointly covering real-world photogrammetry data, controlled aerial acquisition settings, photorealistic synthetic scenes, and in-the-wild Internet imagery. Totally, AerialMetric provides 52K real-world and 16K synthetic image-depth pairs with reliable metric ground truth. Based on this dataset, we conduct systematic evaluations of existing state-of-the-art models under aerial settings and investigate the impact of viewpoint, altitude, and camera parameters on metric depth prediction. In addition, by fine-tuning representative metric depth model on our dataset, we establish a comprehensive aerial benchmark and achieve state-of-the-art performance across diverse aerial imagery. Our dataset, code, and model weight are publicly available at https://kuieless.github.io/AerialMetric-ECCV2026-page/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper introduces AerialMetric, a benchmark dataset for monocular metric depth estimation under UAV aerial viewpoints. It comprises four subsets (real photogrammetry, controlled aerial acquisition, photorealistic synthetic, and in-the-wild imagery) totaling 52K real and 16K synthetic image-depth pairs claimed to have reliable metric ground truth. The work evaluates existing SOTA models, analyzes effects of viewpoint/altitude/camera parameters, and reports that fine-tuning representative models on the dataset yields state-of-the-art performance across diverse aerial imagery, with dataset, code, and weights released publicly.

Significance. If the absolute metric scale of the real photogrammetry pairs is independently validated and representative of UAV conditions, the dataset would address a clear domain gap between ground-level and aerial depth estimation, providing a reproducible benchmark that enables systematic adaptation and evaluation. The public release of data, code, and weights is a clear strength supporting reproducibility.

major comments (1)
  1. [Dataset Construction] Dataset section (photogrammetry subset description): the assertion of 'reliable metric ground truth' for the 52K real-world pairs is load-bearing for the fine-tuning and SOTA claims, yet the manuscript provides no explicit description of absolute scale recovery (e.g., RTK-GPS fusion, known baselines, or barometric altitude) nor any error statistics against an external reference. Standard SfM/MVS pipelines recover depths only up to scale, so without this the reported domain-gap closure may optimize for pseudo-metric rather than true metric depth.
minor comments (1)
  1. [Abstract] Abstract: 'Totally, AerialMetric provides' should be rephrased to 'In total, AerialMetric provides' for standard academic English.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive review and for recognizing the potential value of AerialMetric as a benchmark. We respond point-by-point to the single major comment below.

read point-by-point responses
  1. Referee: [Dataset Construction] Dataset section (photogrammetry subset description): the assertion of 'reliable metric ground truth' for the 52K real-world pairs is load-bearing for the fine-tuning and SOTA claims, yet the manuscript provides no explicit description of absolute scale recovery (e.g., RTK-GPS fusion, known baselines, or barometric altitude) nor any error statistics against an external reference. Standard SfM/MVS pipelines recover depths only up to scale, so without this the reported domain-gap closure may optimize for pseudo-metric rather than true metric depth.

    Authors: We agree that the manuscript would be strengthened by an explicit description of absolute scale recovery for the photogrammetry subset. The 52K pairs were sourced from professional surveying pipelines that incorporate RTK-GPS, known camera intrinsics/extrinsics, and barometric altitude constraints to produce metric reconstructions; however, this process was summarized only briefly rather than detailed. In the revised manuscript we will add a dedicated paragraph (and, if space permits, a supplementary figure) describing the scale-recovery pipeline, the role of ground control points, and quantitative error statistics obtained by cross-validation against independent total-station measurements on a held-out subset of scenes. This addition will directly address the concern that the ground truth may be only pseudo-metric. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical dataset and benchmarking contribution

full rationale

The paper introduces AerialMetric as a new benchmark dataset with image-depth pairs claimed to have reliable metric ground truth, then evaluates existing models and fine-tunes them to report SOTA performance. No derivation chain, equations, fitted parameters presented as predictions, or self-citation load-bearing steps exist in the provided text. The contribution is data collection and empirical evaluation against external models and imagery; it does not reduce any result to its own inputs by construction. Self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no information on free parameters, axioms, or invented entities.

pith-pipeline@v0.9.1-grok · 5776 in / 1040 out tokens · 33380 ms · 2026-06-30T06:43:48.064866+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

89 extracted references · 17 canonical work pages · 6 internal anchors

  1. [1]

    117483 (2026)

    Barbato, F., Caligiuri, M., Zanuttigh, P.: Flyawarev2: A multimodal cross-domain uavdatasetforurbansceneunderstanding.SignalProcessing:ImageCommunication p. 117483 (2026)

  2. [2]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Beche, R., Nedevschi, S.: Claravid: A holistic scene reconstruction benchmark from aerial perspective with delentropy-based complexity profiling. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 26015–26025 (2025)

  3. [3]

    ZoeDepth: Zero-shot Transfer by Combining Relative and Metric Depth

    Bhat, S.F., Birkl, R., Wofk, D., Wonka, P., Müller, M.: Zoedepth: Zero-shot transfer by combining relative and metric depth. arXiv preprint arXiv:2302.12288 (2023)

  4. [4]

    1–a model zoo for robust monocular relative depth estimation

    Birkl, R., Wofk, D., Müller, M.: Midas v3. 1–a model zoo for robust monocular relative depth estimation. arXiv preprint arXiv:2307.14460 (2023)

  5. [5]

    In: ICLR (2025)

    Bochkovskii, A., Delaunoy, A., Germain, H., Santos, M., Zhou, Y., Richter, S.R., Koltun, V.: Depth pro: Sharp monocular metric depth in less than a second. In: ICLR (2025)

  6. [6]

    In: CVPR (2020)

    Caesar, H., Bankiti, V., Lang, A.H., Vora, S., Liong, V.E., Xu, Q., Krishnan, A., Pan, Y., Baldan, G., Beijbom, O.: nuscenes: A multimodal dataset for autonomous driving. In: CVPR (2020)

  7. [7]

    In: CVPR (2019)

    Chang, M.F., Lambert, J., Sangkloy, P., Singh, J., Bak, S., Hartnett, A., Wang, D., Carr, P., Lucey, S., Ramanan, D., Hays, J.: Argoverse: 3d tracking and forecasting with rich maps. In: CVPR (2019)

  8. [8]

    arXiv preprint arXiv:2203.09065 (2022) 16 Z

    Chen, M., Hu, Q., Yu, Z., Thomas, H., Feng, A., Hou, Y., McCullough, K., Ren, F., Soibelman, L.: Stpls3d: A large-scale synthetic and real aerial photogrammetry 3d point cloud dataset. arXiv preprint arXiv:2203.09065 (2022) 16 Z. Song et al

  9. [9]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Comanici, G., Bieber, E., Schaekermann, M., Pasupat, I., Sachdeva, N., Dhillon, I., Blistein, M., Ram, O., Zhang, D., Rosen, E., et al.: Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. arXiv preprint arXiv:2507.06261 (2025)

  10. [10]

    In: CVPR (2017)

    Dai, A., Chang, A.X., Savva, M., Halber, M., Funkhouser, T., Nießner, M.: Scannet: Richly-annotated 3d reconstructions of indoor scenes. In: CVPR (2017)

  11. [11]

    IEEE Robotics and Automation Letters10(4), 3302–3309 (2025)

    Dhrafani, D., Liu, Y., Jong, A., Shin, U., He, Y., Harp, T., Hu, Y., Oh, J., Scherer, S.: Firestereo: Forest infrared stereo dataset for uas depth perception in visually degraded environments. IEEE Robotics and Automation Letters10(4), 3302–3309 (2025)

  12. [12]

    DJI: Dji terra.https://enterprise.dji.com/dji-terra, accessed: 2025-11-3

  13. [13]

    In: CVPR (2023)

    Du, B., Huang, Y., Chen, J., Huang, D.: Adaptive sparse convolutional networks with global context enhancement for faster object detection on drone images. In: CVPR (2023)

  14. [14]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Du, K., Liao, X., Xia, J., Guo, C., Gu, Y., Guan, Y., Wang, D., Huang, S., Wang, Z.: Uavlight: A benchmark for illumination-robust 3d reconstruction in unmanned aerial vehicle (uav) scenes. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 5670–5679 (2026)

  15. [15]

    NIPS (2014)

    Eigen, D., Puhrsch, C., Fergus, R.: Depth map prediction from a single image using a multi-scale deep network. NIPS (2014)

  16. [16]

    ESRI: Create a 3d product with arcgis drone2map.https://www.esri.com/zh- cn/arcgis/products/arcgis-reality/resources/sample-drone-datasets , ac- cessed: 2025-10-5

  17. [17]

    In: 2021 IEEE 17th International Conference on Intelligent Com- puter Communication and Processing (ICCP)

    Florea, H., Miclea, V.C., Nedevschi, S.: Wilduav: Monocular uav dataset for depth estimation tasks. In: 2021 IEEE 17th International Conference on Intelligent Com- puter Communication and Processing (ICCP). pp. 291–298. IEEE (2021)

  18. [18]

    IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing18, 5445–5459 (2025)

    Florea, H., Nedevschi, S.: Tandepth: Leveraging global dems for metric monocular depth estimation in uavs. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing18, 5445–5459 (2025)

  19. [19]

    In: CVPRW (2019)

    Fonder, M., Van Droogenbroeck, M.: Mid-air: A multi-modal dataset for extremely low altitude drone flights. In: CVPRW (2019)

  20. [20]

    arXiv preprint arXiv:2502.18041 (2025)

    Gao, Y., Li, C., You, Z., Liu, J., Li, Z., Chen, P., Chen, Q., Tang, Z., Wang, L., Yang, P., Tang, Y., Tang, Y., Liang, S., Zhu, S., Xiong, Z., Su, Y., Ye, X., Li, J., Ding, Y., Wang, D., Wang, Z., Zhao, B., Li, X.: Openfly: A comprehensive platform for aerial vision-language navigation. arXiv preprint arXiv:2502.18041 (2025)

  21. [21]

    The international journal of robotics research32(11), 1231–1237 (2013)

    Geiger, A., Lenz, P., Stiller, C., Urtasun, R.: Vision meets robotics: The kitti dataset. The international journal of robotics research32(11), 1231–1237 (2013)

  22. [22]

    Google: Google earth pro.https://earth.google.com, accessed: 2026-01-05

  23. [23]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Gross, M., Matha, S.B., Fahmy, A., Song, R., Cremers, D., Meeß, H.: Occufly: A 3d vision benchmark for semantic scene completion from the aerial perspective. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 21474–21485 (2026)

  24. [24]

    In: CVPR (2020)

    Guizilini, V., Ambrus, R., Pillai, S., Raventos, A., Gaidon, A.: 3d packing for self-supervised monocular depth estimation. In: CVPR (2020)

  25. [25]

    Lotus-2: Advancing Geometric Dense Prediction with Powerful Image Generative Model

    He, J., Li, H., Sheng, M., Chen, Y.C.: Lotus-2: Advancing geometric dense prediction with powerful image generative model. arXiv preprint arXiv:2512.01030 (2025)

  26. [26]

    arXiv preprint arXiv:2409.18124 (2024)

    He, J., Li, H., Yin, W., Liang, Y., Li, L., Zhou, K., Liu, H., Liu, B., Chen, Y.C.: Lotus: Diffusion-based visual foundation model for high-quality dense prediction. arXiv preprint arXiv:2409.18124 (2024)

  27. [27]

    In: ICLR (2022) AerialMetric: UAV Metric Depth Estimation 17

    Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W.: LoRA: Low-rank adaptation of large language models. In: ICLR (2022) AerialMetric: UAV Metric Depth Estimation 17

  28. [28]

    TPAMI46(12), 10579–10596 (2024)

    Hu, M., Yin, W., Zhang, C., Cai, Z., Long, X., Chen, H., Wang, K., Yu, G., Shen, C., Shen, S.: Metric3d v2: A versatile monocular geometric foundation model for zero-shot metric depth and surface normal estimation. TPAMI46(12), 10579–10596 (2024)

  29. [29]

    In: CVPR (2018)

    Huang, P.H., Matzen, K., Kopf, J., Ahuja, N., Huang, J.B.: Deepmvs: Learning multi-view stereopsis. In: CVPR (2018)

  30. [30]

    In: AAAI (2022)

    Huang, Y., Chen, J., Huang, D.: Ufpmp-det: Toward accurate and efficient object detection on drone imagery. In: AAAI (2022)

  31. [31]

    In: CVPR (2023)

    Jung, H., Ruhkamp, P., Zhai, G., Brasch, N., Li, Y., Verdie, Y., Song, J., Zhou, Y., Armagan, A., Ilic, S., et al.: On the importance of accurate geometry data for dense 3d vision tasks. In: CVPR (2023)

  32. [32]

    Array23, 100361 (2024)

    Katkuri, A.V.R., Madan, H., Khatri, N., Abdul-Qawy, A.S.H., Patnaik, K.S.: Autonomous uav navigation using deep learning-based computer vision frameworks: A systematic literature review. Array23, 100361 (2024)

  33. [33]

    In: 2026 International Conference on 3D Vision (3DV)

    Keetha, N., Müller, N., Schönberger, J., Porzi, L., Zhang, Y., Fischer, T., Knapitsch, A., Zauss, D., Weber, E., Antunes, N., et al.: Mapanything: Universal feed-forward metric 3d reconstruction. In: 2026 International Conference on 3D Vision (3DV). pp. 499–509. IEEE (2026)

  34. [34]

    In: ECCV Workshops (2018)

    Koch, T., Liebel, L., Fraundorfer, F., Korner, M.: Evaluation of cnn-based single- image depth estimation methods. In: ECCV Workshops (2018)

  35. [35]

    In: CVPR (2024)

    Kolbeinsson, B., Mikolajczyk, K.: Ddos: The drone depth and obstacle segmentation dataset. In: CVPR (2024)

  36. [36]

    ISPRS Open Journal of Photogrammetry and Remote Sensing1, 100001 (2021)

    Kölle, M., Laupheimer, D., Schmohl, S., Haala, N., Rottensteiner, F., Wegner, J.D., Ledoux, H.: The hessigheim 3d (h3d) benchmark on semantic segmentation of high-resolution 3d point clouds and textured meshes from uav lidar and multi-view- stereo. ISPRS Open Journal of Photogrammetry and Remote Sensing1, 100001 (2021)

  37. [37]

    The International Journal of Robotics Research 43(8), 1114–1127 (2024)

    Li, H., Zou, Y., Chen, N., Lin, J., Liu, X., Xu, W., Zheng, C., Li, R., He, D., Kong, F., et al.: Mars-lvig dataset: A multi-sensor aerial robots slam dataset for lidar-visual-inertial-gnss fusion. The International Journal of Robotics Research 43(8), 1114–1127 (2024)

  38. [38]

    In: ICML (2024)

    Li, Y., Liu, M., Wu, Y., Wang, X., Yang, X., Li, S.: Learning adaptive and view- invariant vision transformer for real-time uav tracking. In: ICML (2024)

  39. [39]

    In: CVPR (2018)

    Li, Z., Snavely, N.: Megadepth: Learning single-view depth prediction from internet photos. In: CVPR (2018)

  40. [40]

    Depth Anything 3: Recovering the Visual Space from Any Views

    Lin, H., Chen, S., Liew, J., Chen, D.Y., Li, Z., Shi, G., Feng, J., Kang, B.: Depth any- thing 3: Recovering the visual space from any views. arXiv preprint arXiv:2511.10647 (2025)

  41. [41]

    In: ECCV (2022)

    Lin, L., Liu, Y., Hu, Y., Yan, X., Xie, K., Huang, H.: Capturing, reconstructing, and simulating: the urbanscene3d dataset. In: ECCV (2022)

  42. [42]

    arXiv preprint arXiv:2512.16913 (2025)

    Lin, X., Song, M., Zhang, D., Lu, W., Li, H., Du, B., Yang, M.H., Nguyen, T., Qi, L.: Depth any panoramas: A foundation model for panoramic depth estimation. arXiv preprint arXiv:2512.16913 (2025)

  43. [43]

    Science Robotics6(59), eabg5810 (2021)

    Loquercio, A., Kaufmann, E., Ranftl, R., Müller, M., Koltun, V., Scaramuzza, D.: Learning high-speed flight in the wild. Science Robotics6(59), eabg5810 (2021)

  44. [44]

    ISPRS journal of photogrammetry and remote sensing165, 108–119 (2020)

    Lyu, Y., Vosselman, G., Xia, G.S., Yilmaz, A., Yang, M.Y.: Uavid: A semantic segmentation dataset for uav imagery. ISPRS journal of photogrammetry and remote sensing165, 108–119 (2020)

  45. [45]

    ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial Information Sciences2(2020) 18 Z

    Madhuanand, L., Nex, F., Yang, M., et al.: Deep learning for monocular depth estimation from uav images. ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial Information Sciences2(2020) 18 Z. Song et al

  46. [46]

    In: ECCV (2012)

    Nathan Silberman, Derek Hoiem, P.K., Fergus, R.: Indoor segmentation and support inference from rgbd images. In: ECCV (2012)

  47. [47]

    ISPRS Open Journal of Photogrammetry and Remote Sensing13, 100070 (2024)

    Nex, F., Stathopoulou, E., Remondino, F., Yang, M., Madhuanand, L., Yogender, Y., Alsadik, B., Weinmann, M., Jutzi, B., Qin, R.: Usegeo-a uav-based multi-sensor dataset for geospatial research. ISPRS Open Journal of Photogrammetry and Remote Sensing13, 100070 (2024)

  48. [48]

    The International Journal of Robotics Research41(3), 270–280 (2022)

    Nguyen, T.M., Yuan, S., Cao, M., Lyu, Y., Nguyen, T.H., Xie, L.: Ntu viral: A visual- inertial-ranging-lidar dataset, from an aerial vehicle viewpoint. The International Journal of Robotics Research41(3), 270–280 (2022)

  49. [49]

    OpenDroneMap Authors: Odm - a command line toolkit to generate maps, point clouds, 3d models and dems from drone, balloon or kite images.https: //opendronemap.org/odm/datasets/(2020), accessed: 2025-9-25

  50. [50]

    Piccinelli, L., Sakaridis, C., Yang, Y.H., Segu, M., Li, S., Abbeloos, W., Gool, L.V.: UniDepthV2: Universal monocular metric depth estimation made simpler (2025), https://arxiv.org/abs/2502.20110

  51. [51]

    In: CVPR (2024)

    Piccinelli, L., Yang, Y.H., Sakaridis, C., Segu, M., Li, S., Van Gool, L., Yu, F.: UniDepth: Universal monocular metric depth estimation. In: CVPR (2024)

  52. [52]

    In: ICCV (2021)

    Ranftl, R., Bochkovskiy, A., Koltun, V.: Vision transformers for dense prediction. In: ICCV (2021)

  53. [53]

    TPAMI44(3), 1623–1637 (2022)

    Ranftl, R., Lasinger, K., Hafner, D., Schindler, K., Koltun, V.: Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer. TPAMI44(3), 1623–1637 (2022)

  54. [54]

    In: ICCV (2023)

    Rizzoli, G., Barbato, F., Caligiuri, M., Zanuttigh, P.: Syndrone-multi-modal uav dataset for urban scenarios. In: ICCV (2023)

  55. [55]

    In: ICCV (2021)

    Roberts, M., Ramapuram, J., Ranjan, A., Kumar, A., Bautista, M.A., Paczan, N., Webb, R., Susskind, J.M.: Hypersim: A photorealistic synthetic dataset for holistic indoor scene understanding. In: ICCV (2021)

  56. [56]

    In: CVPR (2016)

    Schonberger, J.L., Frahm, J.M.: Structure-from-motion revisited. In: CVPR (2016)

  57. [57]

    In: CVPR (2017)

    Schops, T., Schonberger, J.L., Galliani, S., Sattler, T., Schindler, K., Pollefeys, M., Geiger, A.: A multi-view stereo benchmark with high-resolution images and multi-camera videos. In: CVPR (2017)

  58. [58]

    Sensors22(6), 2097 (2022)

    Shimada, T., Nishikawa, H., Kong, X., Tomiyama, H.: Pix2pix-based monocular depth estimation for drones with optical flow on airsim. Sensors22(6), 2097 (2022)

  59. [59]

    In: CVPR (2020)

    Sun, P., Kretzschmar, H., Dotiwalla, X., Chouard, A., Patnaik, V., Tsui, P., Guo, J., Zhou, Y., Chai, Y., Caine, B., Vasudevan, V., Han, W., Ngiam, J., Zhao, H., Timofeev, A., Ettinger, S., Krivokon, M., Gao, A., Joshi, A., Zhang, Y., Shlens, J., Chen, Z., Anguelov, D.: Scalability in perception for autonomous driving: Waymo open dataset. In: CVPR (2020)

  60. [60]

    The International Journal of Robotics Research43(12), 1853–1866 (2024)

    Thalagala, R.G., De Silva, O., Jayasiri, A., Gubbels, A., Mann, G.K., Gosine, R.G.: Mun-frl: A visual-inertial-lidar dataset for aerial autonomous navigation and mapping. The International Journal of Robotics Research43(12), 1853–1866 (2024)

  61. [61]

    In: CVPR (2022)

    Turki, H., Ramanan, D., Satyanarayanan, M.: Mega-nerf: Scalable construction of large-scale nerfs for virtual fly-throughs. In: CVPR (2022)

  62. [62]

    arXiv preprint arXiv:1908.00463 (2019)

    Vasiljevic, I., Kolkin, N., Zhang, S., Luo, R., Wang, H., Dai, F.Z., Daniele, A.F., Mostajabi, M., Basart, S., Walter, M.R., et al.: Diode: A dense indoor and outdoor depth dataset. arXiv preprint arXiv:1908.00463 (2019)

  63. [63]

    In: CVPR (2025)

    Vuong, K., Ghosh, A., Ramanan, D., Narasimhan, S., Tulsiani, S.: Aerialmegadepth: Learning aerial-ground reconstruction and view synthesis. In: CVPR (2025)

  64. [64]

    ISPRS Journal of Photogrammetry and Remote Sensing190, 196–214 (2022)

    Wang, L., Li, R., Zhang, C., Fang, S., Duan, C., Meng, X., Atkinson, P.M.: Unet- former: A unet-like transformer for efficient semantic segmentation of remote sensing AerialMetric: UAV Metric Depth Estimation 19 urban scene imagery. ISPRS Journal of Photogrammetry and Remote Sensing190, 196–214 (2022)

  65. [65]

    In: CVPR (2025)

    Wang, R., Xu, S., Dai, C., Xiang, J., Deng, Y., Tong, X., Yang, J.: Moge: Unlocking accurate monocular geometry estimation for open-domain images with optimal training supervision. In: CVPR (2025)

  66. [66]

    Advances in Neural Information Processing Systems38, 35928–35959 (2025)

    Wang, R., Xu, S., Dong, Y., Deng, Y., Xiang, J., Lv, Z., Sun, G., Tong, X., Yang, J.: Moge-2: Accurate monocular geometry with metric scale and sharp details. Advances in Neural Information Processing Systems38, 35928–35959 (2025)

  67. [67]

    In: ICCV (2025)

    Wang, S., Li, S., Zhang, Y., Yu, S., Yuan, S., She, R., Guo, Q., Zheng, J., Howe, O.K., Chandra, L., et al.: Uavscenes: A multi-modal dataset for uavs. In: ICCV (2025)

  68. [68]

    In: IROS (2020)

    Wang, W., Zhu, D., Wang, X., Hu, Y., Qiu, Y., Wang, C., Hu, Y., Kapoor, A., Scherer, S.: Tartanair: A dataset to push the limits of visual slam. In: IROS (2020)

  69. [69]

    Argoverse 2: Next Generation Datasets for Self-Driving Perception and Forecasting

    Wilson, B., Qi, W., Agarwal, T., Lambert, J., Singh, J., Khandelwal, S., Pan, B., Kumar, R., Hartnett, A., Pontes, J.K., et al.: Argoverse 2: Next generation datasets for self-driving perception and forecasting. arXiv preprint arXiv:2301.00493 (2023)

  70. [70]

    arXiv preprint arXiv:2401.05971 (2024)

    Wu, R., Cheng, X., Zhu, J., Liu, X., Zhang, M., Yan, S.: Uavd4l: A large-scale dataset for uav 6-dof localization. arXiv preprint arXiv:2401.05971 (2024)

  71. [71]

    In: CVPR (2025)

    Wu, Y., Wang, X., Yang, X., Liu, M., Zeng, D., Ye, H., Li, S.: Learning occlusion- robust vision transformers for real-time uav tracking. In: CVPR (2025)

  72. [72]

    arXiv preprint arXiv:2401.14032 (2024)

    Xiong,B.,Li,Z.,Li,Z.:Gauu-scene:Ascenereconstructionbenchmarkonlargescale 3d reconstruction dataset using gaussian splatting. arXiv preprint arXiv:2401.14032 (2024)

  73. [73]

    arXiv preprint arXiv:2404.04880 (2024)

    Xiong, B., Zheng, N., Liu, J., Li, Z.: Gauu-scene v2: Assessing the reliability of image-based metrics with expansive lidar image dataset using 3dgs and nerf. arXiv preprint arXiv:2404.04880 (2024)

  74. [74]

    In: CVPR (2022)

    Yan, Q., Zheng, J., Reding, S., Li, S., Doytchinov, I.: Crossloc: Scalable aerial localization assisted by multimodal synthetic data. In: CVPR (2022)

  75. [75]

    In: SIGGRAPH (2023)

    Yang, G., Xue, F., Zhang, Q., Xie, K., Fu, C.W., Huang, H.: Urbanbis: a large-scale benchmark for fine-grained urban building instance segmentation. In: SIGGRAPH (2023)

  76. [76]

    In: CVPR (2024)

    Yang, L., Kang, B., Huang, Z., Xu, X., Feng, J., Zhao, H.: Depth anything: Un- leashing the power of large-scale unlabeled data. In: CVPR (2024)

  77. [77]

    NIPS (2024)

    Yang, L., Kang, B., Huang, Z., Zhao, Z., Xu, X., Feng, J., Zhao, H.: Depth anything v2. NIPS (2024)

  78. [78]

    In: AAAI (2025)

    Ye, C., Zhuge, Y., Zhang, P.: Towards open-vocabulary remote sensing image semantic segmentation. In: AAAI (2025)

  79. [79]

    In: ICCV (2023)

    Yin, W., Zhang, C., Chen, H., Cai, Z., Yu, G., Wang, K., Chen, X., Shen, C.: Metric3d: Towards zero-shot metric 3d prediction from a single image. In: ICCV (2023)

  80. [80]

    arXiv preprint arXiv:2601.03252 (2026)

    Yu, H., Lin, H., Wang, J., Li, J., Wang, Y., Zhang, X., Wang, Y., Zhou, X., Hu, R., Peng, S.: Infinidepth: Arbitrary-resolution and fine-grained depth estimation with neural implicit fields. arXiv preprint arXiv:2601.03252 (2026)

Showing first 80 references.