Recognition: unknown
AirZoo: A Unified Large-Scale Dataset for Grounding Aerial Geometric 3D Vision
Pith reviewed 2026-05-07 11:47 UTC · model grok-4.3
The pith
Fine-tuning on the AirZoo dataset raises performance of state-of-the-art models on aerial 3D vision benchmarks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
AirZoo is generated via a scalable pipeline from photogrammetric meshes covering 378 regions in 22 countries, rendering diverse scenes with configurable trajectories and conditions while providing pixel-level metric depth and 6-DoF geo-referenced poses. Rigorous tests in three tracks demonstrate that it acts as an effective pre-training engine, with fine-tuning leading to substantial gains for models including MegaLoc, RoMa, VGGT, and Depth Anything 3 on both public and new real-world benchmarks, thus setting a new performance upper bound.
What carries the argument
Scalable generation pipeline that renders vast outdoor environments from freely available photogrammetric 3D meshes with customizable UAV flight trajectories and configurable weather and illumination.
If this is right
- Models pre-trained or fine-tuned on AirZoo achieve higher accuracy in aerial image retrieval.
- Improved cross-view matching between aerial and ground or different aerial views.
- Enhanced multi-view 3D reconstruction from aerial image sets.
- New upper bound established for aerial spatial intelligence tasks.
- Dataset supports transfer to real UAV sensor data despite being synthetic.
Where Pith is reading between the lines
- Similar mesh-based rendering could address data needs in other domains like autonomous driving or indoor robotics.
- The geo-referenced nature of the data may facilitate fusion with GIS systems for large-scale mapping.
- Further experiments could test the dataset's utility for training models on dynamic elements like moving vehicles.
- Combining AirZoo with existing ground-level datasets might yield models robust across viewpoints.
Load-bearing premise
The synthetic images rendered from photogrammetric meshes closely enough match the characteristics of real UAV-captured images, including lighting, textures, and sensor noise, to support effective learning transfer.
What would settle it
If state-of-the-art models fine-tuned on AirZoo show no improvement or worse performance than those trained without it when evaluated on independent real-world aerial benchmarks, the effectiveness of the dataset would be disproven.
Figures
read the original abstract
Despite the rapid progress in data-driven 3D vision, aerial geometric 3D vision remains a formidable challenge due to the severe scarcity of large-scale, high-fidelity training data. Existing benchmarks, predominantly biased toward ground-level or object-centric views, do not account for complex viewpoint transformations and diverse environmental conditions in UAV-based sensing. To bridge this critical gap, we propose AirZoo, a unified large-scale dataset and benchmark for grounding aerial geometric 3D vision. AirZoo possesses three appealing properties: 1) Scalable Generation Pipeline: Leveraging freely available, world-scale photogrammetric 3D meshes, it renders vast outdoor environments with customizable UAV flight trajectories and configurable weather/illumination. 2) Comprehensive Scene Diversity: It provides the most extensive coverage of region types to date (spanning 378 regions across 22 countries), systematically encompassing both highly structured urban landscapes and complex unstructured natural environments. 3) Rich Geometric Annotations: Each frame provides synchronized, pixel-level metric depth and precise 6-DoF geo-referenced poses, essential for geometry-aware learning. Through three rigorous evaluation tracks -- aerial image retrieval, cross-view matching, and multi-view 3D reconstruction -- we demonstrate that AirZoo serves as a powerful pre-training engine. Extensive experiments on both public and newly collected real-world benchmarks reveal that fine-tuning on AirZoo yields substantial performance gains for SoTA models (e.g., MegaLoc, RoMa, VGGT, and Depth Anything 3), establishing a new performance upper bound for aerial spatial intelligence.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces AirZoo, a large-scale dataset for aerial geometric 3D vision generated via a scalable pipeline from freely available photogrammetric 3D meshes. It renders diverse outdoor scenes (378 regions across 22 countries) with customizable UAV trajectories, weather, and illumination, providing synchronized pixel-level metric depth and 6-DoF geo-referenced poses per frame. The work evaluates the dataset through three tracks (aerial image retrieval, cross-view matching, multi-view 3D reconstruction) and claims that fine-tuning state-of-the-art models (MegaLoc, RoMa, VGGT, Depth Anything 3) on AirZoo produces substantial gains on both public and newly collected real-world UAV benchmarks, establishing a new performance upper bound for aerial spatial intelligence.
Significance. If the transfer gains hold after rigorous validation against domain shift, AirZoo would address a genuine data scarcity issue in UAV-based 3D vision by supplying scale, geo-referenced annotations, and scene diversity that existing ground-centric benchmarks lack. The generation pipeline's use of existing meshes and configurability is a practical strength that could enable reproducible pre-training.
major comments (2)
- [Abstract] Abstract: The headline claim that fine-tuning on AirZoo yields substantial gains for MegaLoc, RoMa, VGGT, and Depth Anything 3 and establishes a new upper bound rests on the unverified assumption that rendered images from photogrammetric meshes sufficiently replicate real UAV sensor statistics (rolling shutter, compression, atmospheric effects, fine-scale geometry). No quantitative details, ablation on rendering fidelity, or domain-gap analysis are supplied to support this transfer; the domain gap between baked-in lighting/reconstruction holes in meshes and actual UAV captures is load-bearing for the generalization conclusion.
- [Abstract] Abstract (evaluation tracks): The three rigorous evaluation tracks are invoked to demonstrate AirZoo as a pre-training engine, yet the abstract (and by extension the central claim) provides no metrics, baseline comparisons, or error analysis. Without these, it is impossible to assess whether observed gains exceed what scale alone or test-set selection would produce.
minor comments (1)
- [Abstract] The abstract refers to 'configurable weather/illumination' without specifying whether the pipeline uses physically based rendering or explicit domain randomization; this detail belongs in the methods section for reproducibility.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive comments on our manuscript. We provide point-by-point responses to the major comments below and describe the revisions we plan to incorporate.
read point-by-point responses
-
Referee: [Abstract] Abstract: The headline claim that fine-tuning on AirZoo yields substantial gains for MegaLoc, RoMa, VGGT, and Depth Anything 3 and establishes a new upper bound rests on the unverified assumption that rendered images from photogrammetric meshes sufficiently replicate real UAV sensor statistics (rolling shutter, compression, atmospheric effects, fine-scale geometry). No quantitative details, ablation on rendering fidelity, or domain-gap analysis are supplied to support this transfer; the domain gap between baked-in lighting/reconstruction holes in meshes and actual UAV captures is load-bearing for the generalization conclusion.
Authors: We appreciate the referee pointing out the importance of validating the domain transfer. The manuscript demonstrates the utility of AirZoo through extensive fine-tuning experiments on both public and newly collected real-world UAV benchmarks, where state-of-the-art models exhibit substantial performance improvements. These results provide empirical evidence that the geometric information in the rendered data transfers effectively to real captures. Nevertheless, we agree that explicit domain-gap analysis and rendering fidelity ablations would further bolster the claims. In the revised manuscript, we will update the abstract to include key quantitative metrics and add a new analysis subsection detailing comparisons between rendered and real image characteristics, along with ablations on rendering parameters such as lighting and weather effects. revision: yes
-
Referee: [Abstract] Abstract (evaluation tracks): The three rigorous evaluation tracks are invoked to demonstrate AirZoo as a pre-training engine, yet the abstract (and by extension the central claim) provides no metrics, baseline comparisons, or error analysis. Without these, it is impossible to assess whether observed gains exceed what scale alone or test-set selection would produce.
Authors: We acknowledge that the abstract, due to length constraints, does not detail specific metrics or comparisons. The full paper includes comprehensive quantitative results, baseline comparisons, and error analyses across the three evaluation tracks in the experiments section. To make the central claims more self-contained, we will revise the abstract to incorporate representative performance gains and baseline references from the evaluation tracks. revision: yes
Circularity Check
No significant circularity in dataset creation and benchmarking
full rationale
The paper generates a synthetic aerial dataset from external photogrammetric meshes via a rendering pipeline and evaluates fine-tuning gains on separate public benchmarks plus newly collected real UAV data. No equations, self-definitional constructions, fitted parameters presented as predictions, or load-bearing self-citations appear in the provided text. The training data source and test benchmarks are explicitly distinct, making the performance claims externally falsifiable and the overall chain self-contained.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption World-scale photogrammetric 3D meshes provide accurate geometry for rendering UAV trajectories
- domain assumption Configurable weather and illumination in rendering approximate real environmental conditions
Reference graph
Works this paper leans on
-
[1]
Neurocomputing513, 194–203 (2022)
Ali-Bey, A., Chaib-Draa, B., Giguere, P.: Gsv-cities: Toward appropriate super- vised visual place recognition. Neurocomputing513, 194–203 (2022)
2022
-
[2]
In: ECCV (2024)
Berton, G., Junglas, L., Zaccone, R., Pollok, T., Caputo, B., Masone, C.: Meshvpr: Citywide visual place recognition using 3d meshes. In: ECCV (2024)
2024
-
[3]
In: CVPR
Berton, G., Masone, C.: Megaloc: One retrieval to place them all. In: CVPR. pp. 2861–2867 (2025)
2025
-
[4]
In: CVPR
Brachmann, E., Cavallari, T., Prisacariu, V.A.: Accelerated coordinate encoding: Learning to relocalize in minutes using rgb and poses. In: CVPR. pp. 5044–5053 (2023)
2023
-
[5]
In: CVPR
Cabon, Y., Stoffl, L., Antsfeld, L., Csurka, G., Chidlovskii, B., Revaud, J., Leroy, V.: Must3r: Multi-view network for stereo 3d reconstruction. In: CVPR. pp. 1050– 1060 (2025)
2025
-
[6]
ShapeNet: An Information-Rich 3D Model Repository
Chang, A.X., Funkhouser, T., Guibas, L., Hanrahan, P., Huang, Q., Li, Z., Savarese, S., Savva, M., Song, S., Su, H., et al.: Shapenet: An information-rich 3d model repository. arXiv preprint arXiv:1512.03012 (2015)
work page internal anchor Pith review arXiv 2015
-
[7]
In: CVPR
Dai,A.,Chang,A.X.,Savva,M.,Halber,M.,Funkhouser,T.,Nießner,M.:Scannet: Richly-annotated 3d reconstructions of indoor scenes. In: CVPR. pp. 5828–5839 (2017)
2017
-
[8]
IEEE TIP (2023)
Dai, M., Zheng, E., Feng, Z., Qi, L., Zhuang, J., Yang, W.: Vision-based uav self- positioning in low-altitude urban environments. IEEE TIP (2023)
2023
-
[9]
Dhaouadi, O., Marin, R., Meier, J., Kaiser, J., Cremers, D.: Ortholoc: Uav 6-dof localization and calibration using orthographic geodata. arXiv preprint arXiv:2509.18350 (2025)
-
[10]
In: CVPR
Edstedt, J., Sun, Q., Bökman, G., Wadenbäck, M., Felsberg, M.: Roma: Robust dense feature matching. In: CVPR. pp. 19790–19800 (2024)
2024
-
[11]
In: CVPRW
Fonder, M., Van Droogenbroeck, M.: Mid-air: A multi-modal dataset for extremely low altitude drone flights. In: CVPRW. pp. 0–0 (2019)
2019
-
[12]
IJCV129(12), 3313–3337 (2021)
Fu, H., Jia, R., Gao, L., Gong, M., Zhao, B., Maybank, S., Tao, D.: 3d-future: 3d furniture shape with texture. IJCV129(12), 3313–3337 (2021)
2021
-
[13]
International Journal of Robotics Research (IJRR) (2013)
Geiger, A., Lenz, P., Stiller, C., Urtasun, R.: Vision meets robotics: The kitti dataset. International Journal of Robotics Research (IJRR) (2013)
2013
-
[14]
Google: Google Maps.https://www.google.com/maps(2026), accessed: 2026-03- 01
2026
-
[15]
arXiv preprint arXiv:2512.20770 (2025)
Gross, M., Matha, S.B., Fahmy, A., Song, R., Cremers, D., Meess, H.: Occufly: A 3d vision benchmark for semantic scene completion from the aerial perspective. arXiv preprint arXiv:2512.20770 (2025)
-
[16]
IEEE Robotics and Automation Letters9(10), 8210–8217 (2024)
He, M., Chen, C., Liu, J., Li, C., Lyu, X., Huang, G., Meng, Z.: Aerialvl: A dataset, baseline and algorithm framework for aerial-based visual localization with reference map. IEEE Robotics and Automation Letters9(10), 8210–8217 (2024)
2024
-
[17]
In: CVPR
Izquierdo,S.,Civera,J.:Optimaltransportaggregationforvisualplacerecognition. In: CVPR. pp. 17658–17668 (2024)
2024
-
[18]
In: AAAI (2025)
Ji, Y., He, B., Tan, Z., Wu, L.: Game4loc: A uav geo-localization benchmark from game data. In: AAAI (2025)
2025
-
[19]
IEEE Robotics and Automation Letters9(2), 1286–1293 (2023)
Keetha, N., Mishra, A., Karhade, J., Jatavallabhula, K.M., Scherer, S., Krishna, M., Garg, S.: Anyloc: Towards universal visual place recognition. IEEE Robotics and Automation Letters9(2), 1286–1293 (2023)
2023
-
[20]
MapAnything: Universal Feed-Forward Metric 3D Reconstruction
Keetha, N., Müller, N., Schönberger, J., Porzi, L., Zhang, Y., Fischer, T., Knapitsch, A., Zauss, D., Weber, E., Antunes, N., et al.: Mapanything: Univer- sal feed-forward metric 3d reconstruction. arXiv preprint arXiv:2509.13414 (2025) 16 X. Cheng et al
work page internal anchor Pith review arXiv 2025
-
[21]
In: ICCV
Kendall, A., Grimes, M., Cipolla, R.: Posenet: A convolutional network for real- time 6-dof camera relocalization. In: ICCV. pp. 2938–2946 (2015)
2015
-
[22]
In: CVPR
Li, X., Wang, S., Zhao, Y., Verbeek, J., Kannala, J.: Hierarchical scene coordinate classification and regression for visual localization. In: CVPR. pp. 11983–11992 (2020)
2020
-
[23]
In: ICCV (2023)
Li, Y., Jiang, L., Xu, L., Xiangli, Y., Wang, Z., Lin, D., Dai, B.: Matrixcity: A large-scale city dataset for city-scale neural rendering and beyond. In: ICCV (2023)
2023
-
[24]
In: CVPR
Li, Z., Snavely, N.: Megadepth: Learning single-view depth prediction from internet photos. In: CVPR. pp. 2041–2050 (2018)
2041
-
[25]
Depth Anything 3: Recovering the Visual Space from Any Views
Lin, H., Chen, S., Liew, J., Chen, D.Y., Li, Z., Shi, G., Feng, J., Kang, B.: Depth anything 3: Recovering the visual space from any views. arXiv preprint arXiv:2511.10647 (2025)
work page internal anchor Pith review arXiv 2025
-
[26]
In: ECCV
Lin, L., Liu, Y., Hu, Y., Yan, X., Xie, K., Huang, H.: Capturing, reconstructing, and simulating: the urbanscene3d dataset. In: ECCV. pp. 93–109. Springer (2022)
2022
-
[27]
In: CVPR
Liu, L., Xu, W., Fu, H., Qian, S., Yu, Q., Han, Y., Lu, C.: Akb-48: A real-world articulated object knowledge base. In: CVPR. pp. 14809–14818 (2022)
2022
-
[28]
In: CVPR
Loiseau, T., Bourmaud, G.: Rubik: A structured benchmark for image matching across geometric challenges. In: CVPR. pp. 27070–27080 (2025)
2025
-
[29]
In: 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)
Luo, X., Wan, X., Gao, Y., Tian, Y., Zhang, W., Shu, L.: Jointloc: A real-time visual localization framework for planetary uavs based on joint relative and abso- lute pose estimation. In: 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). pp. 3348–3355. IEEE (2024)
2024
-
[30]
arXiv preprint arXiv:2505.12549 (2025)
Maggio, D., Lim, H., Carlone, L.: Vggt-slam: Dense rgb slam optimized on the sl (4) manifold. arXiv preprint arXiv:2505.12549 (2025)
-
[31]
In: CVPR (2025)
Murai, R., Dexheimer, E., Davison, A.J.: Mast3r-slam: Real-time dense slam with 3d reconstruction priors. In: CVPR (2025)
2025
-
[32]
DINOv2: Learning Robust Visual Features without Supervision
Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Szafraniec, M., Khalidov, V., Fernandez, P., Haziza, D., Massa, F., El-Nouby, A., et al.: Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193 (2023)
work page internal anchor Pith review arXiv 2023
-
[33]
In: ICCV (2021)
Reizenstein,J.,Shapovalov,R.,Henzler,P.,Sbordone,L.,Labatut,P.,Novotny,D.: Common objects in 3d: Large-scale learning and evaluation of real-life 3d category reconstruction. In: ICCV (2021)
2021
-
[34]
In: ICCV
Rizzoli, G., Barbato, F., Caligiuri, M., Zanuttigh, P.: Syndrone-multi-modal uav dataset for urban scenarios. In: ICCV. pp. 2210–2220 (2023)
2023
-
[35]
In: CVPR
Sarlin, P.E., Unagar, A., Larsson, M., Germain, H., Toft, C., Larsson, V., Pollefeys, M., Lepetit, V., Hammarstrand, L., Kahl, F., et al.: Back to the feature: Learning robust camera localization from pixels to pose. In: CVPR. pp. 3247–3257 (2021)
2021
-
[36]
In: CVPR
Schonberger, J.L., Frahm, J.M.: Structure-from-motion revisited. In: CVPR. pp. 4104–4113 (2016)
2016
-
[37]
In: ECCV
Schönberger, J.L., Zheng, E., Frahm, J.M., Pollefeys, M.: Pixelwise view selection for unstructured multi-view stereo. In: ECCV. pp. 501–518. Springer (2016)
2016
-
[38]
In: ICLR (2024)
Shen, X., Cai, Z., Yin, W., Müller, M., Li, Z., Wang, K., Chen, X., Wang, C.: Gim: Learning generalizable image matcher from internet videos. In: ICLR (2024)
2024
-
[39]
In: CVPR
Sun, J., Shen, Z., Wang, Y., Bao, H., Zhou, X.: Loftr: Detector-free local feature matching with transformers. In: CVPR. pp. 8922–8931 (2021)
2021
-
[40]
In: CVPR
Sun, P., Kretzschmar, H., Dotiwalla, X., Chouard, A., Patnaik, V., Tsui, P., Guo, J., Zhou, Y., Chai, Y., Caine, B., et al.: Scalability in perception for autonomous driving: Waymo open dataset. In: CVPR. pp. 2446–2454 (2020)
2020
-
[41]
In: ECCV
Tung, J., Chou, G., Cai, R., Yang, G., Zhang, K., Wetzstein, G., Hariharan, B., Snavely, N.: Megascenes: Scene-level view synthesis at scale. In: ECCV. pp. 197–
-
[42]
Springer (2024) Abbreviated paper title 17
2024
-
[43]
In: CVPR
Vuong, K., Ghosh, A., Ramanan, D., Narasimhan, S., Tulsiani, S.: Aerialmegadepth: Learning aerial-ground reconstruction and view synthesis. In: CVPR. pp. 21674–21684 (2025)
2025
-
[44]
In: CVPR
Wang, F., Jiang, X., Galliani, S., Vogel, C., Pollefeys, M.: Glace: Global local accelerated coordinate encoding. In: CVPR. pp. 21562–21571 (2024)
2024
-
[45]
In: CVPR
Wang, J., Chen, M., Karaev, N., Vedaldi, A., Rupprecht, C., Novotny, D.: Vggt: Visual geometry grounded transformer. In: CVPR. pp. 5294–5306 (2025)
2025
-
[46]
In: CVPR
Wang, Q., Zhang, Y., Holynski, A., Efros, A.A., Kanazawa, A.: Continuous 3d perception model with persistent state. In: CVPR. pp. 10510–10522 (2025)
2025
-
[47]
In: CVPR
Wang, S., Leroy, V., Cabon, Y., Chidlovskii, B., Revaud, J.: Dust3r: Geometric 3d vision made easy. In: CVPR. pp. 20697–20709 (2024)
2024
-
[48]
In: ICCV
Wang, S., Li, S., Zhang, Y., Yu, S., Yuan, S., She, R., Guo, Q., Zheng, J., Howe, O.K., Chandra, L., et al.: Uavscenes: A multi-modal dataset for uavs. In: ICCV. pp. 28946–28958 (2025)
2025
-
[49]
In: CVPR
Wang, Y., He, X., Peng, S., Tan, D., Zhou, X.: Efficient loftr: Semi-dense local feature matching with sparse-like speed. In: CVPR. pp. 21666–21675 (2024)
2024
-
[50]
In: CVPR
Warburg, F., Hauberg, S., Lopez-Antequera, M., Gargallo, P., Kuang, Y., Civera, J.: Mapillary street-level sequences: A dataset for lifelong place recognition. In: CVPR. pp. 2626–2635 (2020)
2020
-
[51]
In: 3DV (2024)
Wu, R., Cheng, X., Zhu, J., Liu, Y., Zhang, M., Yan, S.: Uavd4l: A large-scale dataset for uav 6-dof localization. In: 3DV (2024)
2024
-
[52]
co / datasets/Xecades/AerialExtreMatch-Localization
Wu, R., Huang, Z., He, X., Yan, L., Shen, Y., Peng, S., Zhou, X., Zhang, M.: Aerialextrematch-localization dataset (2026),https : / / huggingface . co / datasets/Xecades/AerialExtreMatch-Localization
2026
-
[53]
In: CVPR
Wu, T., Zhang, J., Fu, X., Wang, Y., Ren, J., Pan, L., Wu, W., Yang, L., Wang, J., Qian, C., Lin, D., Liu, Z.: Omniobject3d: Large-vocabulary 3d object dataset for realistic perception, reconstruction and generation. In: CVPR. pp. 803–814 (June 2023)
2023
-
[54]
arXiv (2024)
Xu, W., Yao, Y., Cao, J., Wei, Z., Liu, C., Wang, J., Peng, M.: Uav-visloc: A large-scale dataset for uav visual localization. arXiv (2024)
2024
-
[55]
In: CVPR
Yang, J., Sax, A., Liang, K.J., Henaff, M., Tang, H., Cao, A., Chai, J., Meier, F., Feiszli, M.: Fast3r: Towards 3d reconstruction of 1000+ images in one forward pass. In: CVPR. pp. 21924–21935 (2025)
2025
-
[56]
In: CVPR
Yao, Y., Luo, Z., Li, S., Zhang, J., Ren, Y., Zhou, L., Fang, T., Quan, L.: Blended- mvs: A large-scale dataset for generalized multi-view stereo networks. In: CVPR. pp. 1790–1799 (2020)
2020
-
[57]
In: ICCV
Yeshwanth, C., Liu, Y.C., Nießner, M., Dai, A.: Scannet++: A high-fidelity dataset of 3d indoor scenes. In: ICCV. pp. 12–22 (2023)
2023
-
[58]
Zhang, J., Herrmann, C., Hur, J., Jampani, V., Darrell, T., Cole, F., Sun, D., Yang, M.H.: Monst3r: A simple approach for estimating geometry in the presence of motion. arXiv preprint arXiv:2410.03825 (2024)
-
[59]
In: ACM MM
Zheng, Z., Wei, Y., Yang, Y.: University-1652: A multi-view multi-source bench- mark for drone-based geo-localization. In: ACM MM. pp. 1395–1403 (2020)
2020
-
[60]
Stereo Magnification: Learning View Synthesis using Multiplane Images
Zhou, T., Tucker, R., Flynn, J., Fyffe, G., Snavely, N.: Stereo magnification: Learning view synthesis using multiplane images. ACM TOG37(2018),https: //arxiv.org/abs/1805.09817
work page internal anchor Pith review arXiv 2018
-
[61]
IEEE TCSVT (2023)
Zhu, R., Yin, L., Yang, M., Wu, F., Yang, Y., Hu, W.: Sues-200: A multi-height multi-scene cross-view image benchmark across drone and satellite. IEEE TCSVT (2023)
2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.