arxiv: 2604.26567 · v1 · submitted 2026-04-29 · 💻 cs.CV

Recognition: unknown

AirZoo: A Unified Large-Scale Dataset for Grounding Aerial Geometric 3D Vision

Xiaoya Cheng , Rouwan Wu , Xinyi Liu , Zeyu Cui , Yan Liu , Na Zhao , Yu Liu , Maojun Zhang

show 1 more author

Shen Yan

Authors on Pith no claims yet

Pith reviewed 2026-05-07 11:47 UTC · model grok-4.3

classification 💻 cs.CV

keywords aerial 3D visionUAVdatasetphotogrammetric meshdepth map6-DoF poseimage retrieval3D reconstruction

0 comments

The pith

Fine-tuning on the AirZoo dataset raises performance of state-of-the-art models on aerial 3D vision benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper presents AirZoo as a solution to the scarcity of large-scale, high-fidelity data for aerial geometric 3D vision. The dataset is created by rendering images from existing world-scale 3D meshes with varied UAV paths and weather settings, including accurate depth maps and camera poses. Experiments across image retrieval, cross-view matching, and 3D reconstruction show that models like MegaLoc and Depth Anything improve when fine-tuned on AirZoo and tested on real data. A sympathetic reader would care because current 3D vision models struggle with the unique challenges of drone views, such as extreme perspective changes and natural environments. If the approach works, it provides a scalable way to generate training data without needing massive real UAV flights.

Core claim

AirZoo is generated via a scalable pipeline from photogrammetric meshes covering 378 regions in 22 countries, rendering diverse scenes with configurable trajectories and conditions while providing pixel-level metric depth and 6-DoF geo-referenced poses. Rigorous tests in three tracks demonstrate that it acts as an effective pre-training engine, with fine-tuning leading to substantial gains for models including MegaLoc, RoMa, VGGT, and Depth Anything 3 on both public and new real-world benchmarks, thus setting a new performance upper bound.

What carries the argument

Scalable generation pipeline that renders vast outdoor environments from freely available photogrammetric 3D meshes with customizable UAV flight trajectories and configurable weather and illumination.

If this is right

Models pre-trained or fine-tuned on AirZoo achieve higher accuracy in aerial image retrieval.
Improved cross-view matching between aerial and ground or different aerial views.
Enhanced multi-view 3D reconstruction from aerial image sets.
New upper bound established for aerial spatial intelligence tasks.
Dataset supports transfer to real UAV sensor data despite being synthetic.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar mesh-based rendering could address data needs in other domains like autonomous driving or indoor robotics.
The geo-referenced nature of the data may facilitate fusion with GIS systems for large-scale mapping.
Further experiments could test the dataset's utility for training models on dynamic elements like moving vehicles.
Combining AirZoo with existing ground-level datasets might yield models robust across viewpoints.

Load-bearing premise

The synthetic images rendered from photogrammetric meshes closely enough match the characteristics of real UAV-captured images, including lighting, textures, and sensor noise, to support effective learning transfer.

What would settle it

If state-of-the-art models fine-tuned on AirZoo show no improvement or worse performance than those trained without it when evaluated on independent real-world aerial benchmarks, the effectiveness of the dataset would be disproven.

Figures

Figures reproduced from arXiv: 2604.26567 by Maojun Zhang, Na Zhao, Rouwan Wu, Shen Yan, Xiaoya Cheng, Xinyi Liu, Yan Liu, Yu Liu, Zeyu Cui.

**Figure 1.** Figure 1: Overview of AirZoo. (Top) A million-scale synthetic UAV dataset with pixel-perfect geometry featuring extensive flight trajectories across diverse global regions. (Bottom) Fine-tuning on AirZoo empowers SoTA models to achieve superior performance in aerial geometric 3D vision tasks, including Aerial image Retrieval , Cross-view Matching, and Multi-view 3D Reconstruction . Abstract. Despite the rapid progr… view at source ↗

**Figure 2.** Figure 2: Construction pipeline and properties of the AirZoo dataset. view at source ↗

**Figure 3.** Figure 3: Comprehensive statistics of the AirZoo dataset. view at source ↗

**Figure 4.** Figure 4: Geometric verification of the proposed dataset. view at source ↗

**Figure 5.** Figure 5: Qualitative cross-view geo-localization comparisons. view at source ↗

**Figure 6.** Figure 6: Qualitative cross-view matching results on the AirZoo-Real. view at source ↗

**Figure 7.** Figure 7: Qualitative results on AirZoo-Real, UAVScenes and UrbanScene3D. view at source ↗

read the original abstract

Despite the rapid progress in data-driven 3D vision, aerial geometric 3D vision remains a formidable challenge due to the severe scarcity of large-scale, high-fidelity training data. Existing benchmarks, predominantly biased toward ground-level or object-centric views, do not account for complex viewpoint transformations and diverse environmental conditions in UAV-based sensing. To bridge this critical gap, we propose AirZoo, a unified large-scale dataset and benchmark for grounding aerial geometric 3D vision. AirZoo possesses three appealing properties: 1) Scalable Generation Pipeline: Leveraging freely available, world-scale photogrammetric 3D meshes, it renders vast outdoor environments with customizable UAV flight trajectories and configurable weather/illumination. 2) Comprehensive Scene Diversity: It provides the most extensive coverage of region types to date (spanning 378 regions across 22 countries), systematically encompassing both highly structured urban landscapes and complex unstructured natural environments. 3) Rich Geometric Annotations: Each frame provides synchronized, pixel-level metric depth and precise 6-DoF geo-referenced poses, essential for geometry-aware learning. Through three rigorous evaluation tracks -- aerial image retrieval, cross-view matching, and multi-view 3D reconstruction -- we demonstrate that AirZoo serves as a powerful pre-training engine. Extensive experiments on both public and newly collected real-world benchmarks reveal that fine-tuning on AirZoo yields substantial performance gains for SoTA models (e.g., MegaLoc, RoMa, VGGT, and Depth Anything 3), establishing a new performance upper bound for aerial spatial intelligence.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AirZoo delivers a genuinely large new aerial dataset with good coverage and annotations, but the claimed real-world transfer gains rest on shaky ground because of the rendering-to-UAV gap.

read the letter

The paper's real contribution is the scale and annotation quality of AirZoo. It pulls from existing photogrammetric meshes to create 378 regions across 22 countries, adds synchronized metric depth and geo-referenced 6-DoF poses per frame, and supports configurable trajectories plus weather. That combination is bigger than prior aerial sets and directly targets the data shortage in UAV 3D tasks like retrieval, cross-view matching, and reconstruction. The generation pipeline is straightforward and reuses public meshes, which keeps the effort practical rather than reinventing geometry from scratch. Those are the parts that actually move the needle for people who need more diverse outdoor aerial training data.

Referee Report

2 major / 1 minor

Summary. The paper introduces AirZoo, a large-scale dataset for aerial geometric 3D vision generated via a scalable pipeline from freely available photogrammetric 3D meshes. It renders diverse outdoor scenes (378 regions across 22 countries) with customizable UAV trajectories, weather, and illumination, providing synchronized pixel-level metric depth and 6-DoF geo-referenced poses per frame. The work evaluates the dataset through three tracks (aerial image retrieval, cross-view matching, multi-view 3D reconstruction) and claims that fine-tuning state-of-the-art models (MegaLoc, RoMa, VGGT, Depth Anything 3) on AirZoo produces substantial gains on both public and newly collected real-world UAV benchmarks, establishing a new performance upper bound for aerial spatial intelligence.

Significance. If the transfer gains hold after rigorous validation against domain shift, AirZoo would address a genuine data scarcity issue in UAV-based 3D vision by supplying scale, geo-referenced annotations, and scene diversity that existing ground-centric benchmarks lack. The generation pipeline's use of existing meshes and configurability is a practical strength that could enable reproducible pre-training.

major comments (2)

[Abstract] Abstract: The headline claim that fine-tuning on AirZoo yields substantial gains for MegaLoc, RoMa, VGGT, and Depth Anything 3 and establishes a new upper bound rests on the unverified assumption that rendered images from photogrammetric meshes sufficiently replicate real UAV sensor statistics (rolling shutter, compression, atmospheric effects, fine-scale geometry). No quantitative details, ablation on rendering fidelity, or domain-gap analysis are supplied to support this transfer; the domain gap between baked-in lighting/reconstruction holes in meshes and actual UAV captures is load-bearing for the generalization conclusion.
[Abstract] Abstract (evaluation tracks): The three rigorous evaluation tracks are invoked to demonstrate AirZoo as a pre-training engine, yet the abstract (and by extension the central claim) provides no metrics, baseline comparisons, or error analysis. Without these, it is impossible to assess whether observed gains exceed what scale alone or test-set selection would produce.

minor comments (1)

[Abstract] The abstract refers to 'configurable weather/illumination' without specifying whether the pipeline uses physically based rendering or explicit domain randomization; this detail belongs in the methods section for reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments on our manuscript. We provide point-by-point responses to the major comments below and describe the revisions we plan to incorporate.

read point-by-point responses

Referee: [Abstract] Abstract: The headline claim that fine-tuning on AirZoo yields substantial gains for MegaLoc, RoMa, VGGT, and Depth Anything 3 and establishes a new upper bound rests on the unverified assumption that rendered images from photogrammetric meshes sufficiently replicate real UAV sensor statistics (rolling shutter, compression, atmospheric effects, fine-scale geometry). No quantitative details, ablation on rendering fidelity, or domain-gap analysis are supplied to support this transfer; the domain gap between baked-in lighting/reconstruction holes in meshes and actual UAV captures is load-bearing for the generalization conclusion.

Authors: We appreciate the referee pointing out the importance of validating the domain transfer. The manuscript demonstrates the utility of AirZoo through extensive fine-tuning experiments on both public and newly collected real-world UAV benchmarks, where state-of-the-art models exhibit substantial performance improvements. These results provide empirical evidence that the geometric information in the rendered data transfers effectively to real captures. Nevertheless, we agree that explicit domain-gap analysis and rendering fidelity ablations would further bolster the claims. In the revised manuscript, we will update the abstract to include key quantitative metrics and add a new analysis subsection detailing comparisons between rendered and real image characteristics, along with ablations on rendering parameters such as lighting and weather effects. revision: yes
Referee: [Abstract] Abstract (evaluation tracks): The three rigorous evaluation tracks are invoked to demonstrate AirZoo as a pre-training engine, yet the abstract (and by extension the central claim) provides no metrics, baseline comparisons, or error analysis. Without these, it is impossible to assess whether observed gains exceed what scale alone or test-set selection would produce.

Authors: We acknowledge that the abstract, due to length constraints, does not detail specific metrics or comparisons. The full paper includes comprehensive quantitative results, baseline comparisons, and error analyses across the three evaluation tracks in the experiments section. To make the central claims more self-contained, we will revise the abstract to incorporate representative performance gains and baseline references from the evaluation tracks. revision: yes

Circularity Check

0 steps flagged

No significant circularity in dataset creation and benchmarking

full rationale

The paper generates a synthetic aerial dataset from external photogrammetric meshes via a rendering pipeline and evaluates fine-tuning gains on separate public benchmarks plus newly collected real UAV data. No equations, self-definitional constructions, fitted parameters presented as predictions, or load-bearing self-citations appear in the provided text. The training data source and test benchmarks are explicitly distinct, making the performance claims externally falsifiable and the overall chain self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central contribution rests on the assumption that existing photogrammetric meshes are sufficiently accurate and complete for rendering realistic aerial views, plus that synthetic variations capture real-world diversity.

axioms (2)

domain assumption World-scale photogrammetric 3D meshes provide accurate geometry for rendering UAV trajectories
Invoked in the scalable generation pipeline description.
domain assumption Configurable weather and illumination in rendering approximate real environmental conditions
Used to claim comprehensive scene diversity.

pith-pipeline@v0.9.0 · 5603 in / 1266 out tokens · 52436 ms · 2026-05-07T11:47:22.034213+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

61 extracted references · 9 canonical work pages · 5 internal anchors

[1]

Neurocomputing513, 194–203 (2022)

Ali-Bey, A., Chaib-Draa, B., Giguere, P.: Gsv-cities: Toward appropriate super- vised visual place recognition. Neurocomputing513, 194–203 (2022)

2022
[2]

In: ECCV (2024)

Berton, G., Junglas, L., Zaccone, R., Pollok, T., Caputo, B., Masone, C.: Meshvpr: Citywide visual place recognition using 3d meshes. In: ECCV (2024)

2024
[3]

In: CVPR

Berton, G., Masone, C.: Megaloc: One retrieval to place them all. In: CVPR. pp. 2861–2867 (2025)

2025
[4]

In: CVPR

Brachmann, E., Cavallari, T., Prisacariu, V.A.: Accelerated coordinate encoding: Learning to relocalize in minutes using rgb and poses. In: CVPR. pp. 5044–5053 (2023)

2023
[5]

In: CVPR

Cabon, Y., Stoffl, L., Antsfeld, L., Csurka, G., Chidlovskii, B., Revaud, J., Leroy, V.: Must3r: Multi-view network for stereo 3d reconstruction. In: CVPR. pp. 1050– 1060 (2025)

2025
[6]

ShapeNet: An Information-Rich 3D Model Repository

Chang, A.X., Funkhouser, T., Guibas, L., Hanrahan, P., Huang, Q., Li, Z., Savarese, S., Savva, M., Song, S., Su, H., et al.: Shapenet: An information-rich 3d model repository. arXiv preprint arXiv:1512.03012 (2015)

work page internal anchor Pith review arXiv 2015
[7]

In: CVPR

Dai,A.,Chang,A.X.,Savva,M.,Halber,M.,Funkhouser,T.,Nießner,M.:Scannet: Richly-annotated 3d reconstructions of indoor scenes. In: CVPR. pp. 5828–5839 (2017)

2017
[8]

IEEE TIP (2023)

Dai, M., Zheng, E., Feng, Z., Qi, L., Zhuang, J., Yang, W.: Vision-based uav self- positioning in low-altitude urban environments. IEEE TIP (2023)

2023
[9]

Ortholoc: Uav 6-dof localization and calibration using orthographic geodata.arXiv preprint arXiv:2509.18350, 2025

Dhaouadi, O., Marin, R., Meier, J., Kaiser, J., Cremers, D.: Ortholoc: Uav 6-dof localization and calibration using orthographic geodata. arXiv preprint arXiv:2509.18350 (2025)

work page arXiv 2025
[10]

In: CVPR

Edstedt, J., Sun, Q., Bökman, G., Wadenbäck, M., Felsberg, M.: Roma: Robust dense feature matching. In: CVPR. pp. 19790–19800 (2024)

2024
[11]

In: CVPRW

Fonder, M., Van Droogenbroeck, M.: Mid-air: A multi-modal dataset for extremely low altitude drone flights. In: CVPRW. pp. 0–0 (2019)

2019
[12]

IJCV129(12), 3313–3337 (2021)

Fu, H., Jia, R., Gao, L., Gong, M., Zhao, B., Maybank, S., Tao, D.: 3d-future: 3d furniture shape with texture. IJCV129(12), 3313–3337 (2021)

2021
[13]

International Journal of Robotics Research (IJRR) (2013)

Geiger, A., Lenz, P., Stiller, C., Urtasun, R.: Vision meets robotics: The kitti dataset. International Journal of Robotics Research (IJRR) (2013)

2013
[14]

Google: Google Maps.https://www.google.com/maps(2026), accessed: 2026-03- 01

2026
[15]

arXiv preprint arXiv:2512.20770 (2025)

Gross, M., Matha, S.B., Fahmy, A., Song, R., Cremers, D., Meess, H.: Occufly: A 3d vision benchmark for semantic scene completion from the aerial perspective. arXiv preprint arXiv:2512.20770 (2025)

work page arXiv 2025
[16]

IEEE Robotics and Automation Letters9(10), 8210–8217 (2024)

He, M., Chen, C., Liu, J., Li, C., Lyu, X., Huang, G., Meng, Z.: Aerialvl: A dataset, baseline and algorithm framework for aerial-based visual localization with reference map. IEEE Robotics and Automation Letters9(10), 8210–8217 (2024)

2024
[17]

In: CVPR

Izquierdo,S.,Civera,J.:Optimaltransportaggregationforvisualplacerecognition. In: CVPR. pp. 17658–17668 (2024)

2024
[18]

In: AAAI (2025)

Ji, Y., He, B., Tan, Z., Wu, L.: Game4loc: A uav geo-localization benchmark from game data. In: AAAI (2025)

2025
[19]

IEEE Robotics and Automation Letters9(2), 1286–1293 (2023)

Keetha, N., Mishra, A., Karhade, J., Jatavallabhula, K.M., Scherer, S., Krishna, M., Garg, S.: Anyloc: Towards universal visual place recognition. IEEE Robotics and Automation Letters9(2), 1286–1293 (2023)

2023
[20]

MapAnything: Universal Feed-Forward Metric 3D Reconstruction

Keetha, N., Müller, N., Schönberger, J., Porzi, L., Zhang, Y., Fischer, T., Knapitsch, A., Zauss, D., Weber, E., Antunes, N., et al.: Mapanything: Univer- sal feed-forward metric 3d reconstruction. arXiv preprint arXiv:2509.13414 (2025) 16 X. Cheng et al

work page internal anchor Pith review arXiv 2025
[21]

In: ICCV

Kendall, A., Grimes, M., Cipolla, R.: Posenet: A convolutional network for real- time 6-dof camera relocalization. In: ICCV. pp. 2938–2946 (2015)

2015
[22]

In: CVPR

Li, X., Wang, S., Zhao, Y., Verbeek, J., Kannala, J.: Hierarchical scene coordinate classification and regression for visual localization. In: CVPR. pp. 11983–11992 (2020)

2020
[23]

In: ICCV (2023)

Li, Y., Jiang, L., Xu, L., Xiangli, Y., Wang, Z., Lin, D., Dai, B.: Matrixcity: A large-scale city dataset for city-scale neural rendering and beyond. In: ICCV (2023)

2023
[24]

In: CVPR

Li, Z., Snavely, N.: Megadepth: Learning single-view depth prediction from internet photos. In: CVPR. pp. 2041–2050 (2018)

2041
[25]

Depth Anything 3: Recovering the Visual Space from Any Views

Lin, H., Chen, S., Liew, J., Chen, D.Y., Li, Z., Shi, G., Feng, J., Kang, B.: Depth anything 3: Recovering the visual space from any views. arXiv preprint arXiv:2511.10647 (2025)

work page internal anchor Pith review arXiv 2025
[26]

In: ECCV

Lin, L., Liu, Y., Hu, Y., Yan, X., Xie, K., Huang, H.: Capturing, reconstructing, and simulating: the urbanscene3d dataset. In: ECCV. pp. 93–109. Springer (2022)

2022
[27]

In: CVPR

Liu, L., Xu, W., Fu, H., Qian, S., Yu, Q., Han, Y., Lu, C.: Akb-48: A real-world articulated object knowledge base. In: CVPR. pp. 14809–14818 (2022)

2022
[28]

In: CVPR

Loiseau, T., Bourmaud, G.: Rubik: A structured benchmark for image matching across geometric challenges. In: CVPR. pp. 27070–27080 (2025)

2025
[29]

In: 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)

Luo, X., Wan, X., Gao, Y., Tian, Y., Zhang, W., Shu, L.: Jointloc: A real-time visual localization framework for planetary uavs based on joint relative and abso- lute pose estimation. In: 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). pp. 3348–3355. IEEE (2024)

2024
[30]

arXiv preprint arXiv:2505.12549 (2025)

Maggio, D., Lim, H., Carlone, L.: Vggt-slam: Dense rgb slam optimized on the sl (4) manifold. arXiv preprint arXiv:2505.12549 (2025)

work page arXiv 2025
[31]

In: CVPR (2025)

Murai, R., Dexheimer, E., Davison, A.J.: Mast3r-slam: Real-time dense slam with 3d reconstruction priors. In: CVPR (2025)

2025
[32]

DINOv2: Learning Robust Visual Features without Supervision

Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Szafraniec, M., Khalidov, V., Fernandez, P., Haziza, D., Massa, F., El-Nouby, A., et al.: Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193 (2023)

work page internal anchor Pith review arXiv 2023
[33]

In: ICCV (2021)

Reizenstein,J.,Shapovalov,R.,Henzler,P.,Sbordone,L.,Labatut,P.,Novotny,D.: Common objects in 3d: Large-scale learning and evaluation of real-life 3d category reconstruction. In: ICCV (2021)

2021
[34]

In: ICCV

Rizzoli, G., Barbato, F., Caligiuri, M., Zanuttigh, P.: Syndrone-multi-modal uav dataset for urban scenarios. In: ICCV. pp. 2210–2220 (2023)

2023
[35]

In: CVPR

Sarlin, P.E., Unagar, A., Larsson, M., Germain, H., Toft, C., Larsson, V., Pollefeys, M., Lepetit, V., Hammarstrand, L., Kahl, F., et al.: Back to the feature: Learning robust camera localization from pixels to pose. In: CVPR. pp. 3247–3257 (2021)

2021
[36]

In: CVPR

Schonberger, J.L., Frahm, J.M.: Structure-from-motion revisited. In: CVPR. pp. 4104–4113 (2016)

2016
[37]

In: ECCV

Schönberger, J.L., Zheng, E., Frahm, J.M., Pollefeys, M.: Pixelwise view selection for unstructured multi-view stereo. In: ECCV. pp. 501–518. Springer (2016)

2016
[38]

In: ICLR (2024)

Shen, X., Cai, Z., Yin, W., Müller, M., Li, Z., Wang, K., Chen, X., Wang, C.: Gim: Learning generalizable image matcher from internet videos. In: ICLR (2024)

2024
[39]

In: CVPR

Sun, J., Shen, Z., Wang, Y., Bao, H., Zhou, X.: Loftr: Detector-free local feature matching with transformers. In: CVPR. pp. 8922–8931 (2021)

2021
[40]

In: CVPR

Sun, P., Kretzschmar, H., Dotiwalla, X., Chouard, A., Patnaik, V., Tsui, P., Guo, J., Zhou, Y., Chai, Y., Caine, B., et al.: Scalability in perception for autonomous driving: Waymo open dataset. In: CVPR. pp. 2446–2454 (2020)

2020
[41]

In: ECCV

Tung, J., Chou, G., Cai, R., Yang, G., Zhang, K., Wetzstein, G., Hariharan, B., Snavely, N.: Megascenes: Scene-level view synthesis at scale. In: ECCV. pp. 197–
[42]

Springer (2024) Abbreviated paper title 17

2024
[43]

In: CVPR

Vuong, K., Ghosh, A., Ramanan, D., Narasimhan, S., Tulsiani, S.: Aerialmegadepth: Learning aerial-ground reconstruction and view synthesis. In: CVPR. pp. 21674–21684 (2025)

2025
[44]

In: CVPR

Wang, F., Jiang, X., Galliani, S., Vogel, C., Pollefeys, M.: Glace: Global local accelerated coordinate encoding. In: CVPR. pp. 21562–21571 (2024)

2024
[45]

In: CVPR

Wang, J., Chen, M., Karaev, N., Vedaldi, A., Rupprecht, C., Novotny, D.: Vggt: Visual geometry grounded transformer. In: CVPR. pp. 5294–5306 (2025)

2025
[46]

In: CVPR

Wang, Q., Zhang, Y., Holynski, A., Efros, A.A., Kanazawa, A.: Continuous 3d perception model with persistent state. In: CVPR. pp. 10510–10522 (2025)

2025
[47]

In: CVPR

Wang, S., Leroy, V., Cabon, Y., Chidlovskii, B., Revaud, J.: Dust3r: Geometric 3d vision made easy. In: CVPR. pp. 20697–20709 (2024)

2024
[48]

In: ICCV

Wang, S., Li, S., Zhang, Y., Yu, S., Yuan, S., She, R., Guo, Q., Zheng, J., Howe, O.K., Chandra, L., et al.: Uavscenes: A multi-modal dataset for uavs. In: ICCV. pp. 28946–28958 (2025)

2025
[49]

In: CVPR

Wang, Y., He, X., Peng, S., Tan, D., Zhou, X.: Efficient loftr: Semi-dense local feature matching with sparse-like speed. In: CVPR. pp. 21666–21675 (2024)

2024
[50]

In: CVPR

Warburg, F., Hauberg, S., Lopez-Antequera, M., Gargallo, P., Kuang, Y., Civera, J.: Mapillary street-level sequences: A dataset for lifelong place recognition. In: CVPR. pp. 2626–2635 (2020)

2020
[51]

In: 3DV (2024)

Wu, R., Cheng, X., Zhu, J., Liu, Y., Zhang, M., Yan, S.: Uavd4l: A large-scale dataset for uav 6-dof localization. In: 3DV (2024)

2024
[52]

co / datasets/Xecades/AerialExtreMatch-Localization

Wu, R., Huang, Z., He, X., Yan, L., Shen, Y., Peng, S., Zhou, X., Zhang, M.: Aerialextrematch-localization dataset (2026),https : / / huggingface . co / datasets/Xecades/AerialExtreMatch-Localization

2026
[53]

In: CVPR

Wu, T., Zhang, J., Fu, X., Wang, Y., Ren, J., Pan, L., Wu, W., Yang, L., Wang, J., Qian, C., Lin, D., Liu, Z.: Omniobject3d: Large-vocabulary 3d object dataset for realistic perception, reconstruction and generation. In: CVPR. pp. 803–814 (June 2023)

2023
[54]

arXiv (2024)

Xu, W., Yao, Y., Cao, J., Wei, Z., Liu, C., Wang, J., Peng, M.: Uav-visloc: A large-scale dataset for uav visual localization. arXiv (2024)

2024
[55]

In: CVPR

Yang, J., Sax, A., Liang, K.J., Henaff, M., Tang, H., Cao, A., Chai, J., Meier, F., Feiszli, M.: Fast3r: Towards 3d reconstruction of 1000+ images in one forward pass. In: CVPR. pp. 21924–21935 (2025)

2025
[56]

In: CVPR

Yao, Y., Luo, Z., Li, S., Zhang, J., Ren, Y., Zhou, L., Fang, T., Quan, L.: Blended- mvs: A large-scale dataset for generalized multi-view stereo networks. In: CVPR. pp. 1790–1799 (2020)

2020
[57]

In: ICCV

Yeshwanth, C., Liu, Y.C., Nießner, M., Dai, A.: Scannet++: A high-fidelity dataset of 3d indoor scenes. In: ICCV. pp. 12–22 (2023)

2023
[58]

Monst3r: A simple approach for estimat- ing geometry in the presence of motion.arXiv preprint arXiv:2410.03825, 2024

Zhang, J., Herrmann, C., Hur, J., Jampani, V., Darrell, T., Cole, F., Sun, D., Yang, M.H.: Monst3r: A simple approach for estimating geometry in the presence of motion. arXiv preprint arXiv:2410.03825 (2024)

work page arXiv 2024
[59]

In: ACM MM

Zheng, Z., Wei, Y., Yang, Y.: University-1652: A multi-view multi-source bench- mark for drone-based geo-localization. In: ACM MM. pp. 1395–1403 (2020)

2020
[60]

Stereo Magnification: Learning View Synthesis using Multiplane Images

Zhou, T., Tucker, R., Flynn, J., Fyffe, G., Snavely, N.: Stereo magnification: Learning view synthesis using multiplane images. ACM TOG37(2018),https: //arxiv.org/abs/1805.09817

work page internal anchor Pith review arXiv 2018
[61]

IEEE TCSVT (2023)

Zhu, R., Yin, L., Yang, M., Wu, F., Yang, Y., Hu, W.: Sues-200: A multi-height multi-scene cross-view image benchmark across drone and satellite. IEEE TCSVT (2023)

2023