Sky2Ground: A Benchmark for Site Modeling under Varying Altitude

Grace Lim; Rajat Modi; Sirshapan Mitra; Yogesh Rawat; Zengyan Wang

arxiv: 2603.13740 · v3 · submitted 2026-03-14 · 💻 cs.CV

Sky2Ground: A Benchmark for Site Modeling under Varying Altitude

Zengyan Wang , Sirshapan Mitra , Rajat Modi , Grace Lim , Yogesh Rawat This is my paper

Pith reviewed 2026-05-15 12:00 UTC · model grok-4.3

classification 💻 cs.CV

keywords Sky2Ground datasetvarying altitude localizationmulti-view alignmentSkyNet modelcurriculum trainingsatellite imagerypose estimation3D reconstruction

0 comments

The pith

SkyNet improves multi-view alignment for altitude-varying scenes by training progressively on satellite imagery.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Sky2Ground, a dataset of satellite, aerial, and ground images across 51 sites to test camera localization and 3D reconstruction when viewpoints span wide altitude ranges. Existing models lose accuracy when satellite images are added because the extreme differences in scale and angle disrupt alignment. SkyNet addresses this with a curriculum strategy that starts with easier views and gradually adds satellite ones, yielding stronger cross-view consistency. This setup matters for building reliable 3D models from mixed sources like maps and drone footage in real-world conditions.

Core claim

SkyNet, a model trained with a curriculum strategy that progressively incorporates more satellite views, significantly strengthens multi-view alignment and outperforms existing methods by 9.6% on RRA@5 and 18.1% on RTA@5 in absolute performance.

What carries the argument

Curriculum-based training strategy that progressively incorporates satellite views to enhance cross-view consistency in SkyNet.

If this is right

Satellite imagery degrades pose estimation performance in current models under large altitude changes.
Reconstruction suffers from sparse geometric overlap, orthogonal angles, and noise in real images.
The dataset supports evaluation from global satellite context down to local ground details.
SkyNet supplies a practical baseline for generalizable localization across altitude levels.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The curriculum approach could transfer to other scale-varying problems such as combining street-level and overhead imagery for mapping.
Similar benchmarks mixing synthetic and real data might help test robustness in related tasks like object detection across distances.
Better handling of these view differences may improve downstream uses in urban modeling or navigation systems.

Load-bearing premise

The 51 sites and their mix of real and synthetic images represent the range of altitude variations and noise that future models will encounter.

What would settle it

SkyNet falling below baseline performance on a fresh set of sites whose altitude spans or noise levels differ markedly from the original 51.

Figures

Figures reproduced from arXiv: 2603.13740 by Grace Lim, Rajat Modi, Sirshapan Mitra, Yogesh Rawat, Zengyan Wang.

**Figure 1.** Figure 1: Cross-view examples from the Sky2Ground dataset. Satellite, aerial, and ground-level images for a variety of urban scenes in Sky2Ground, where each column corresponds to a unique site. These examples highlight strong viewpoint and appearance variations across modalities, revealing the challenges of cross-view matching and multi-scale scene understanding. Real images additionally introduce diverse lighting … view at source ↗

**Figure 2.** Figure 2: Overview of the Sky2Ground dataset. The middle trajectory illustrates camera poses from one of our collected sites. Dots indicate ground-truth camera positions for synthetic images, while red frustums represent the estimated camera poses for real images. The surrounding images showcase example satellite, aerial, and ground views—where the real images demonstrate more diverse illumination conditions and rea… view at source ↗

**Figure 3.** Figure 3: Benchmark splits and modality setups. (a) Image counts per split for synthetic CR - Core, D1 - Dense 1, D2 - Dense 2, D3 - Dense3 and D4 - Dense 4, across ground, aerial, and satellite views. (b) View combinations used in each benchmark setup: Ground (G), Ground+Aerial (GA), Ground+Satellite (GS), and Ground+Aerial+Satellite (GAS). ensures that although the total number of images across D1 − D4 changes, ev… view at source ↗

**Figure 4.** Figure 4: Comparison of RRA@5 and RTA@5 metrics for four methods (Dust3r, Mast3r, Map Anything, and VGGT). Models ‘suffer’ when number of views are less: In [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

**Figure 5.** Figure 5: Comparison of models across view combinations. [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗

**Figure 6.** Figure 6: Comparison of reconstruction quality across ground/aerial/satellite.: We report PSNR↑ and DreamSim↓ (lower is better). All methods benefit from increased camera density. 2DGS consistently achieves the best perceptual quality. 2D-GS consistently gives best rendering results across ground/aerial/satellite: Earlier, we noticed that VGGT obtained the best performance out of all localization methods. We use VG… view at source ↗

**Figure 7.** Figure 7: Rendering results across satellite, aerial, and ground viewpoints. Each row shows a different view, with the leftmost column [PITH_FULL_IMAGE:figures/full_fig_p006_7.png] view at source ↗

**Figure 8.** Figure 8: SkyNet: An architecture for cross-view camera localization. SkyNet consists of two encoders, 1) a Sat-Encoder processes input satellite images, and a GAS-Encoder processes ground/aerial/satellite views. Our model first patchifies the input images into tokens by DINO, and appends camera tokens for camera prediction. GAS-encoder then alternates between self-attention and Masked-Satellite Attention. Camera/De… view at source ↗

**Figure 9.** Figure 9: Sky2Ground Python script used to download satellite images and stitch aerial-tiles using quadkeys [PITH_FULL_IMAGE:figures/full_fig_p013_9.png] view at source ↗

**Figure 10.** Figure 10: (Top row): synthetic satellite images generated from Google Earth Studio. (Bottom row): corresponding real satellite images we collected. The script begins by converting a latitude and longitude into what Bing Maps refers to as tile coordinates. The function latlon_to_tileXY(lat, lon, zoom) takes three inputs: a geographic latitude, a geographic longitude, and a zoom level. Web Mercator, the projection u… view at source ↗

**Figure 11.** Figure 11: (Top row): synthetic satellite images generated from Google Earth Studio. (Bottom row): corresponding real satellite images we collected. of the module, download_big_bing_image(lat, lon, zoom, grid_size), stitches together an entire grid of tiles around a central geographic coordinate. The user specifies a grid size, such as 3 × 3, 5 × 5, or 10 × 10. The function first determines the tile coordinates of t… view at source ↗

**Figure 12.** Figure 12: Qualitative results of SkyNet vs VGGT: Given one satellite image, and two ground views, VGGT fails to localize the pointmaps properly, for eg, the pyramid is formed at wrong location (marked in red), in contrast, SkyNet correctly localizes the ground views (marked in blue), even when the views are non-overlapping. plicitly encodes the underlying depth and viewpoint changes. The architecture employs a symm… view at source ↗

**Figure 13.** Figure 13: (Top row): synthetic satellite images generated from Google Earth Studio. (Bottom row): corresponding real satellite images collected from aerial sources [PITH_FULL_IMAGE:figures/full_fig_p018_13.png] view at source ↗

**Figure 14.** Figure 14: Ferry Building: Visualization of the ground-truth point cloud on our Sky2Ground dataset. The top row shows the reconstructed scene from a satellite view perspective, and the bottom row presents two aerial view renderings for structural details and geometry [PITH_FULL_IMAGE:figures/full_fig_p019_14.png] view at source ↗

**Figure 15.** Figure 15: Trafalgar Square: Visualization of the ground-truth point clouds on our Sky2Ground dataset. The top row shows the reconstructed scene from a satellite view perspective, and the bottom row presents two aerial view renderings for structural details and geometry [PITH_FULL_IMAGE:figures/full_fig_p020_15.png] view at source ↗

**Figure 16.** Figure 16: Charles Bridge: Visualization of the ground-truth point cloud on our Sky2Ground dataset. The top row shows the reconstructed scene from a satellite view perspective, and the bottom row presents two aerial view renderings for structural details and geometry [PITH_FULL_IMAGE:figures/full_fig_p021_16.png] view at source ↗

**Figure 17.** Figure 17: Colosseum: Visualization of the ground-truth point cloud on our Sky2Ground dataset. The top row shows the reconstructed scene from a satellite view perspective, and the bottom row presents two aerial view renderings for structural details and geometry [PITH_FULL_IMAGE:figures/full_fig_p022_17.png] view at source ↗

**Figure 18.** Figure 18: Piazza Navona:Visualization of the ground-truth point cloud on our Sky2Ground dataset. The top row shows the reconstructed scene from a satellite view perspective, and the bottom row presents two aerial view renderings for structural details and geometry [PITH_FULL_IMAGE:figures/full_fig_p023_18.png] view at source ↗

read the original abstract

We introduce Sky2Ground, a three-view dataset designed for varying altitude camera localization, correspondence learning, and reconstruction. The dataset combines structured synthetic imagery with real, in-the-wild images, providing both controlled multi-view geometry and realistic scene noise. Each of the 51 sites contains thousands of satellite, aerial, and ground images spanning wide altitude ranges and nearly orthogonal viewing angles, enabling rigorous evaluation across global-to-local contexts. We benchmark state of the art pose estimation models, including MASt3R, DUSt3R, Map Anything, and VGGT, and observe that the use of satellite imagery often degrades performance, highlighting the challenges under large altitude variations. We also examine reconstruction methods, highlighting the challenges introduced by sparse geometric overlap, varying perspectives, and the use of real imagery, which often introduces noise and reduces rendering quality. To address some of these challenges, we propose SkyNet, a model which enhances cross-view consistency when incorporating satellite imagery with a curriculum-based training strategy to progressively incorporate more satellite views. SkyNet significantly strengthens multi-view alignment and outperforms existing methods by 9.6% on RRA@5 and 18.1% on RTA@5 in terms of absolute performance. Sky2Ground and SkyNet together establish a comprehensive testbed and baseline for advancing large-scale, multi-altitude 3D perception and generalizable camera localization. Code and models will be released publicly for future research.Project page: https://sky2ground2026.github.io/sky2ground/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Sky2Ground adds a targeted benchmark for multi-altitude localization and a curriculum model with clear gains on the new data.

read the letter

The main point is a new three-view dataset across 51 sites that mixes satellite, aerial, and ground imagery with wide altitude spans, plus a curriculum-trained model called SkyNet that improves multi-view alignment when satellite data is added. The dataset construction stands out because it pairs controlled synthetic views for geometry with real in-the-wild images for noise, something prior multi-view work has not scaled to this altitude range and view combination. Benchmarks on existing pose estimators like MASt3R and DUSt3R show the expected drop when satellite imagery enters the mix, and SkyNet delivers the reported absolute lifts of 9.6% on RRA@5 and 18.1% on RTA@5 through progressive training that gradually brings in more satellite views. The reconstruction experiments also document the practical hit from sparse overlap and real-image noise without overclaiming fixes. The evaluation stays empirical on held-out data with no circular definitions in the metrics, and the curriculum is simply a training schedule rather than a fitted parameter trick. The main limitation is that 51 sites may not fully represent every real-world variation in lighting, terrain, or sensor noise, though this is typical for a first benchmark rather than a load-bearing flaw. The work is aimed at people building localization or 3D perception systems that cross altitude levels, such as drone or satellite-ground fusion tasks. A reader focused on multi-view geometry benchmarks will get direct value from the released data and baselines. It deserves peer review because the new resources and consistent internal results are solid enough to warrant referee attention.

Referee Report

2 major / 2 minor

Summary. The paper introduces Sky2Ground, a benchmark dataset of 51 sites with thousands of synthetic and real satellite, aerial, and ground images spanning wide altitude ranges and orthogonal views. It evaluates pose estimation models (MASt3R, DUSt3R, Map Anything, VGGT) and reconstruction methods, documenting performance degradation from satellite imagery, and proposes SkyNet, which applies a curriculum training strategy to progressively add satellite views and reports absolute gains of 9.6% on RRA@5 and 18.1% on RTA@5 over baselines.

Significance. If the empirical gains and evaluation protocol hold under scrutiny, the work supplies a valuable public testbed for multi-altitude 3D perception and localization, an area where existing benchmarks lack coverage of extreme viewpoint and scale changes. The curriculum-based SkyNet offers a practical, reproducible baseline for cross-view consistency, and the planned code/model release will support follow-on research in generalizable camera pose estimation.

major comments (2)

[§5.2] §5.2: The absolute gains of 9.6% on RRA@5 and 18.1% on RTA@5 are reported without standard deviations, multiple random seeds, or statistical significance tests. Because the curriculum progression schedule is an explicit free parameter, these omissions make it difficult to determine whether the outperformance is robust or sensitive to hyperparameter choices.
[§4.1] §4.1: The manuscript provides insufficient detail on the train/validation/test splits across the 51 sites and on the precise balancing of real versus synthetic imagery within each altitude tier. This information is load-bearing for verifying that the reported degradation and SkyNet improvements are not artifacts of site-specific leakage or unbalanced noise distributions.

minor comments (2)

[Figure 3] Figure 3: The caption and legend should explicitly label which rows correspond to satellite versus ground views to help readers immediately connect the visualizations to the altitude-variation challenge.
[Related Work] Related Work section: A brief citation to recent remote-sensing cross-view matching papers (e.g., on satellite-to-ground registration) would strengthen the positioning of Sky2Ground relative to prior benchmarks.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive assessment and constructive suggestions. We respond to each major comment below and will make the necessary revisions to the manuscript.

read point-by-point responses

Referee: [§5.2] §5.2: The absolute gains of 9.6% on RRA@5 and 18.1% on RTA@5 are reported without standard deviations, multiple random seeds, or statistical significance tests. Because the curriculum progression schedule is an explicit free parameter, these omissions make it difficult to determine whether the outperformance is robust or sensitive to hyperparameter choices.

Authors: We agree that including standard deviations from multiple random seeds and statistical significance tests would better demonstrate the robustness of the reported gains. Although the curriculum progression schedule was determined through preliminary validation, we will conduct additional experiments with multiple seeds in the revised manuscript and report the mean performance along with standard deviations and p-values where appropriate. revision: yes
Referee: [§4.1] §4.1: The manuscript provides insufficient detail on the train/validation/test splits across the 51 sites and on the precise balancing of real versus synthetic imagery within each altitude tier. This information is load-bearing for verifying that the reported degradation and SkyNet improvements are not artifacts of site-specific leakage or unbalanced noise distributions.

Authors: We appreciate this observation and will provide more detailed information in the revised manuscript. Specifically, we will clarify the site-level splits (ensuring no overlap between train, validation, and test sites), the allocation of images across altitude tiers, and the exact ratios of real to synthetic images within each tier to allow full verification of the experimental setup. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical benchmarks on held-out data

full rationale

The paper introduces the Sky2Ground dataset (51 sites mixing synthetic and real imagery) and SkyNet model with a curriculum training procedure. Central claims are absolute performance gains (9.6% RRA@5, 18.1% RTA@5) measured on held-out test splits against external baselines (MASt3R, DUSt3R, etc.). No equations or derivations are present that reduce to fitted inputs by construction; the curriculum is a standard training schedule, not a self-definitional loop. Any self-citations are incidental and non-load-bearing for the empirical results. The evaluation protocol is self-contained against external benchmarks and does not import uniqueness theorems or ansatzes from prior author work.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim rests on standard multi-view geometry assumptions plus a new training curriculum whose hyperparameters are not detailed in the abstract.

free parameters (1)

curriculum progression schedule
The rate at which satellite views are added during training is a tunable choice that affects the reported gains.

axioms (1)

domain assumption Standard epipolar and multi-view geometry constraints remain valid across large altitude differences
Invoked when using satellite imagery together with aerial and ground views for correspondence learning.

invented entities (1)

SkyNet no independent evidence
purpose: Model architecture and curriculum strategy to improve cross-view consistency
New neural network and training procedure introduced to address the observed performance drop with satellite data.

pith-pipeline@v0.9.0 · 5582 in / 1327 out tokens · 48760 ms · 2026-05-15T12:00:52.206890+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

53 extracted references · 53 canonical work pages

[1]

Urban digital twins and metaverses towards city multiplicities: uniting or dividing urban experiences? Ethics and Information Technology, 27 (1):4, 2025

Javier Argota Sánchez-Vaquerizo. Urban digital twins and metaverses towards city multiplicities: uniting or dividing urban experiences? Ethics and Information Technology, 27 (1):4, 2025. 1

work page 2025
[2]

Implications of web mercator and its use in online mapping

Sarah E Battersby, Michael P Finn, E Lynn Usery, and K Yamamoto. Implications of web mercator and its use in online mapping. Cartographica, 49(2):85–101, 2014. 1

work page 2014
[3]

nuscenes: A multi- modal dataset for autonomous driving

Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh V ora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Gi- ancarlo Baldan, and Oscar Beijbom. nuscenes: A multi- modal dataset for autonomous driving. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11621–11631, 2020. 3

work page 2020
[4]

End- to-end object detection with transformers

Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End- to-end object detection with transformers. In European conference on computer vision, pages 213–229. Springer,

work page
[5]

3d gaussian splatting for fine- detailed surface reconstruction in large-scale scene.arXiv preprint arXiv:2506.17636, 2025

Shihan Chen, Zhaojin Li, Zeyu Chen, Qingsong Yan, Gaoyang Shen, and Ran Duan. 3d gaussian splatting for fine- detailed surface reconstruction in large-scale scene. arXiv preprint arXiv:2506.17636, 2025. 2

work page arXiv 2025
[6]

An integrated uav navigation system based on aerial image matching

Gianpaolo Conte and Patrick Doherty. An integrated uav navigation system based on aerial image matching. In 2008 IEEE Aerospace Conference, pages 1–10. IEEE, 2008. 1

work page 2008
[7]

Citygs-x: A scalable architecture for efficient and geometrically accurate large-scale scene reconstruction.arXiv preprint arXiv:2503.23044,

Yuanyuan Gao, Hao Li, Jiaqi Chen, Zhengyu Zou, Zhihang Zhong, Dingwen Zhang, Xiao Sun, and Junwei Han. Citygs- x: A scalable architecture for efficient and geometrically accurate large-scale scene reconstruction. arXiv preprint arXiv:2503.23044, 2025. 2

work page arXiv 2025
[8]

Google Earth Studio

Google. Google Earth Studio. https://earth.google. com/studio/. 1

work page
[9]

Dragon: Drone and ground gaussian splatting for 3d build- ing reconstruction

Yujin Ham, Mateusz Michalkiewicz, and Guha Balakrishnan. Dragon: Drone and ground gaussian splatting for 3d build- ing reconstruction. In 2024 IEEE International Conference on Computational Photography (ICCP), pages 1–12. IEEE,

work page 2024
[10]

Beyond geo-localization: Fine-grained orientation of street-view images by cross-view matching with satellite imagery

Wenmiao Hu, Yichen Zhang, Yuxuan Liang, Yifang Yin, An- drei Georgescu, An Tran, Hannes Kruppa, See-Kiong Ng, and Roger Zimmermann. Beyond geo-localization: Fine-grained orientation of street-view images by cross-view matching with satellite imagery. In Proceedings of the 30th ACM international conference on multimedia, pages 6155–6164,

work page
[11]

Horizon-gs: Unified 3d gaussian splatting for large-scale aerial-to-ground scenes

Lihan Jiang, Kerui Ren, Mulin Yu, Linning Xu, Junting Dong, Tao Lu, Feng Zhao, Dahua Lin, and Bo Dai. Horizon-gs: Unified 3d gaussian splatting for large-scale aerial-to-ground scenes. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 26789–26799, 2025. 2

work page 2025
[12]

Image Matching across Wide Baselines: From Paper to Practice

Yuhe Jin, Dmytro Mishkin, Anastasiia Mishchuk, Jiri Matas, Pascal Fua, Kwang Moo Yi, and Eduard Trulls. Image Matching across Wide Baselines: From Paper to Practice. International Journal of Computer Vision, 2020. 4

work page 2020
[13]

Unconstrained large-scale 3d re- construction and rendering across altitudes

Neil Joshi, Joshua Carney, Nathanael Kuo, Homer Li, Cheng Peng, and Myron Brown. Unconstrained large-scale 3d re- construction and rendering across altitudes. arXiv preprint arXiv:2505.00734, 2025. 2

work page arXiv 2025
[14]

3d gaussian splatting for real-time radiance field rendering

Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering. ACM Trans. Graph., 42(4):139–1, 2023. 2

work page 2023
[15]

Future landscape visualization using a city digital twin: In- tegration of augmented reality and drones with implemen- tation of 3d model-based occlusion handling

Naoki Kikuchi, Tomohiro Fukuda, and Nobuyoshi Yabuki. Future landscape visualization using a city digital twin: In- tegration of augmented reality and drones with implemen- tation of 3d model-based occlusion handling. Journal of Computational Design and Engineering, 9(2):837–856, 2022. 2

work page 2022
[16]

Photogrammetry: Geometry from Images and Laser Scans

Karl Kraus. Photogrammetry: Geometry from Images and Laser Scans. De Gruyter, 2007. 1

work page 2007
[17]

Digital twin of a city: Review of technology serving city needs

Ville V Lehtola, Mila Koeva, Sander Oude Elberink, Paulo Raposo, Juho-Pekka Virtanen, Faridaddin Vahdatikhaki, and Simone Borsci. Digital twin of a city: Review of technology serving city needs. International Journal of Applied Earth Observation and Geoinformation, 114:102915, 2022. 1

work page 2022
[18]

Ground- ing image matching in 3d with mast3r, 2024

Vincent Leroy, Yohann Cabon, and Jerome Revaud. Ground- ing image matching in 3d with mast3r, 2024. 1, 2, 8, 5

work page 2024
[19]

Learning cross-view visual geo-localization without ground truth

Haoyuan Li, Chang Xu, Wen Yang, Huai Yu, and Gui-Song Xia. Learning cross-view visual geo-localization without ground truth. IEEE Transactions on Geoscience and Remote Sensing, 2024. 1

work page 2024
[20]

Matrixcity: A large-scale city dataset for city-scale neural rendering and beyond

Yixuan Li, Lihan Jiang, Linning Xu, Yuanbo Xiangli, Zhenzhi Wang, Dahua Lin, and Bo Dai. Matrixcity: A large-scale city dataset for city-scale neural rendering and beyond. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3205–3215, 2023. 1, 3

work page 2023
[21]

Kitti-360: A novel dataset and benchmarks for urban scene understanding in 2d and 3d

Yiyi Liao, Jun Xie, and Andreas Geiger. Kitti-360: A novel dataset and benchmarks for urban scene understanding in 2d and 3d. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(3):3292–3310, 2022. 3

work page 2022
[22]

Citygaussian: Real-time high-quality large-scale scene rendering with gaussians

Yang Liu, Chuanchen Luo, Lue Fan, Naiyan Wang, Jun- ran Peng, and Zhaoxiang Zhang. Citygaussian: Real-time high-quality large-scale scene rendering with gaussians. In European Conference on Computer Vision, pages 265–282. Springer, 2024. 2

work page 2024
[23]

Citygaussianv2: Efficient and geometrically accurate reconstruction for large- scale scenes.arXiv preprint arXiv:2411.00771, 2024

Yang Liu, Chuanchen Luo, Zhongkai Mao, Junran Peng, and Zhaoxiang Zhang. Citygaussianv2: Efficient and geometri- cally accurate reconstruction for large-scale scenes. arXiv preprint arXiv:2411.00771, 2024. 2

work page arXiv 2024
[24]

Sat- NeRF: Learning multi-view satellite photogrammetry with transient objects and shadow modeling using RPC cam- eras

Roger Marí, Gabriele Facciolo, and Thibaud Ehret. Sat- NeRF: Learning multi-view satellite photogrammetry with transient objects and shadow modeling using RPC cam- eras. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 1310– 1320, 2022. 1

work page 2022
[25]

Bing Maps Imagery Services

Microsoft Corporation. Bing Maps Imagery Services. https://www.bing.com/maps. 1

work page
[26]

Nerf: Representing scenes as neural radiance fields for view syn- thesis

Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view syn- thesis. Communications of the ACM, 65(1):99–106, 2021. 2

work page 2021
[27]

Cross-view visual geo-localization for outdoor augmented reality

Niluthpol Chowdhury Mithun, Kshitij S Minhas, Han-Pang Chiu, Taragay Oskiper, Mikhail Sizintsev, Supun Samarasek- era, and Rakesh Kumar. Cross-view visual geo-localization for outdoor augmented reality. In 2023 IEEE Conference Virtual Reality and 3D User Interfaces (VR), pages 493–502. IEEE, 2023. 1

work page 2023
[28]

On occlu- sions in video action detection: Benchmark datasets and training recipes

Rajat Modi, Vibhav Vineet, and Yogesh Rawat. On occlu- sions in video action detection: Benchmark datasets and training recipes. Advances in Neural Information Processing Systems, 36:57306–57335, 2023. 2

work page 2023
[29]

Visual localization with google earth images for robust global pose estimation of uavs

Bhavit Patel, Timothy D Barfoot, and Angela P Schoel- lig. Visual localization with google earth images for robust global pose estimation of uavs. In 2020 IEEE international conference on robotics and automation (ICRA), pages 6491–

work page 2020
[30]

Navigating urban complexity: The trans- formative role of digital twins in smart city development

Dechen Peldon, Saeed Banihashemi, Khuong LeNguyen, and Sybil Derrible. Navigating urban complexity: The trans- formative role of digital twins in smart city development. Sustainable Cities and Society, 2024. 1

work page 2024
[31]

Revealing scenes by inverting structure from motion reconstructions

Francesco Pittaluga, Sanjeev J Koppal, Sing Bing Kang, and Sudipta N Sinha. Revealing scenes by inverting structure from motion reconstructions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 145–154, 2019. 1

work page 2019
[32]

Sat2map: Reconstructing 3d building roof from 2d satellite images

Yoones Rezaei and Stephen Lee. Sat2map: Reconstructing 3d building roof from 2d satellite images. ACM Transactions on Cyber-Physical Systems, 8(4):1–25, 2024. 1

work page 2024
[33]

Structure-from-motion revisited

Johannes Lutz Schönberger and Jan Michael Frahm. Structure-from-motion revisited. In Conference on Computer Vision and Pattern Recognition (CVPR), 2016. 2

work page 2016
[34]

A vote-and-verify strat- egy for fast spatial verification in image retrieval

Johannes Lutz Schönberger, True Price, Torsten Sattler, Jan- Michael Frahm, and Marc Pollefeys. A vote-and-verify strat- egy for fast spatial verification in image retrieval. In Asian Conference on Computer Vision (ACCV), 2016. 1

work page 2016
[35]

Pixelwise view selection for un- structured multi-view stereo

Johannes Lutz Schönberger, Enliang Zheng, Marc Pollefeys, and Jan-Michael Frahm. Pixelwise view selection for un- structured multi-view stereo. In European Conference on Computer Vision (ECCV), 2016. 1, 2

work page 2016
[36]

To- wards urban digital twins: A workflow for procedural visual- ization using geospatial data

Sanjay Somanath, Vasilis Naserentin, Orfeas Eleftheriou, Daniel Sjölie, Beata Stahre Wästberg, and Anders Logg. To- wards urban digital twins: A workflow for procedural visual- ization using geospatial data. Remote Sensing, 16(11):1939,

work page 1939
[37]

Dronesplat: 3d gaussian splatting for ro- bust 3d reconstruction from in-the-wild drone imagery

Jiadong Tang, Yu Gao, Dianyi Yang, Liqi Yan, Yufeng Yue, and Yi Yang. Dronesplat: 3d gaussian splatting for ro- bust 3d reconstruction from in-the-wild drone imagery. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 833–843, 2025. 2

work page 2025
[38]

Geometric processing of remote sensing im- ages: models, algorithms and methods

Thierry Toutin. Geometric processing of remote sensing im- ages: models, algorithms and methods. International Journal of Remote Sensing, 25(5):1893–1924, 2004. 1

work page 1924
[39]

Aerialmegadepth: Learning aerial-ground reconstruction and view synthesis

Khiem Vuong, Anurag Ghosh, Deva Ramanan, Srinivasa Narasimhan, and Shubham Tulsiani. Aerialmegadepth: Learning aerial-ground reconstruction and view synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025. 1, 3, 8

work page 2025
[40]

Vggt: Visual geometry grounded transformer

Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Visual geometry grounded transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025. 1, 2, 3, 6, 7, 5

work page 2025
[41]

Dust3r: Geometric 3d vision made easy

Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. Dust3r: Geometric 3d vision made easy. In CVPR, 2024. 1, 2, 8, 4

work page 2024
[42]

Croco: Self-supervised pre-training for 3d vision tasks by cross-view completion

Philippe Weinzaepfel, Vincent Leroy, Thomas Lucas, Ro- main Brégier, Yohann Cabon, Vaibhav Arora, Leonid Ants- feld, Boris Chidlovskii, Gabriela Csurka, and Jérôme Re- vaud. Croco: Self-supervised pre-training for 3d vision tasks by cross-view completion. Advances in Neural Information Processing Systems, 35:3502–3516, 2022. 5

work page 2022
[43]

Croco v2: Improved cross-view completion pre-training for stereo matching and optical flow

Philippe Weinzaepfel, Thomas Lucas, Vincent Leroy, Yohann Cabon, Vaibhav Arora, Romain Brégier, Gabriela Csurka, Leonid Antsfeld, Boris Chidlovskii, and Jérôme Revaud. Croco v2: Improved cross-view completion pre-training for stereo matching and optical flow. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 17969–17980, 2023. 5

work page 2023
[44]

Wide- area image geolocalization with aerial reference imagery

Scott Workman, Richard Souvenir, and Nathan Jacobs. Wide- area image geolocalization with aerial reference imagery. In Proceedings of the IEEE International Conference on Computer Vision, pages 3961–3969, 2015. 1

work page 2015
[45]

3d gaussian splat- ting for large-scale surface reconstruction from aerial images

YuanZheng Wu, Jin Liu, and Shunping Ji. 3d gaussian splat- ting for large-scale surface reconstruction from aerial images. arXiv preprint arXiv:2409.00381, 2024. 2

work page arXiv 2024
[46]

Bungeenerf: Progressive neural radiance field for extreme multi-scale scene rendering

Yuanbo Xiangli, Linning Xu, Xingang Pan, Nanxuan Zhao, Anyi Rao, Christian Theobalt, Bo Dai, and Dahua Lin. Bungeenerf: Progressive neural radiance field for extreme multi-scale scene rendering. In European conference on computer vision, pages 106–122. Springer, 2022. 1, 2, 3

work page 2022
[47]

Gauu-scene: A scene reconstruction benchmark on large scale 3d recon- struction dataset using gaussian splatting

Butian Xiong, Zhuo Li, and Zhen Li. Gauu-scene: A scene reconstruction benchmark on large scale 3d recon- struction dataset using gaussian splatting. arXiv preprint arXiv:2401.14032, 2024. 3

work page arXiv 2024
[48]

Vr-nerf: High- fidelity virtualized walkable spaces

Linning Xu, Vasu Agrawal, William Laney, Tony Garcia, Aayush Bansal, Changil Kim, Samuel Rota Bulò, Lorenzo Porzi, Peter Kontschieder, Aljaž Božiˇc, et al. Vr-nerf: High- fidelity virtualized walkable spaces. In SIGGRAPH Asia 2023 Conference Papers, pages 1–12, 2023. 1

work page 2023
[49]

Robust and efficient 3d gaussian splatting for urban scene reconstruction

Zhensheng Yuan, Haozhi Huang, Zhen Xiong, Di Wang, and Guanghua Yang. Robust and efficient 3d gaussian splatting for urban scene reconstruction. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 26209– 26219, 2025. 2

work page 2025
[50]

Predicting ground-level scene layout from aerial imagery

Menghua Zhai, Zachary Bessinger, Scott Workman, and Nathan Jacobs. Predicting ground-level scene layout from aerial imagery. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 867–875,

work page
[51]

Crossview- gs: Cross-view gaussian splatting for large-scale scene recon- struction

Chenhao Zhang, Yuanping Cao, and Lei Zhang. Crossview- gs: Cross-view gaussian splatting for large-scale scene recon- struction. arXiv preprint arXiv:2501.01695, 2025. 2

work page arXiv 2025
[52]

Bird- nerf: Fast neural reconstruction of large-scale scenes from aerial imagery

Huiqing Zhang, Yifei Xue, Ming Liao, and Yizhen Lao. Bird- nerf: Fast neural reconstruction of large-scale scenes from aerial imagery. Scientific Reports, 15(1):37295, 2025. 2

work page 2025
[53]

Drone-assisted road gaussian splatting with cross-view uncertainty.arXiv preprint arXiv:2408.15242,

Saining Zhang, Baijun Ye, Xiaoxue Chen, Yuantao Chen, Zongzheng Zhang, Cheng Peng, Yongliang Shi, and Hao Zhao. Drone-assisted road gaussian splatting with cross-view uncertainty. arXiv preprint arXiv:2408.15242, 2024. 3 Sky2Ground: A Benchmark for Site Modeling under Varying Altitude Supplementary Material This supplementary document provides additional ...

work page arXiv 2024

[1] [1]

Urban digital twins and metaverses towards city multiplicities: uniting or dividing urban experiences? Ethics and Information Technology, 27 (1):4, 2025

Javier Argota Sánchez-Vaquerizo. Urban digital twins and metaverses towards city multiplicities: uniting or dividing urban experiences? Ethics and Information Technology, 27 (1):4, 2025. 1

work page 2025

[2] [2]

Implications of web mercator and its use in online mapping

Sarah E Battersby, Michael P Finn, E Lynn Usery, and K Yamamoto. Implications of web mercator and its use in online mapping. Cartographica, 49(2):85–101, 2014. 1

work page 2014

[3] [3]

nuscenes: A multi- modal dataset for autonomous driving

Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh V ora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Gi- ancarlo Baldan, and Oscar Beijbom. nuscenes: A multi- modal dataset for autonomous driving. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11621–11631, 2020. 3

work page 2020

[4] [4]

End- to-end object detection with transformers

Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End- to-end object detection with transformers. In European conference on computer vision, pages 213–229. Springer,

work page

[5] [5]

3d gaussian splatting for fine- detailed surface reconstruction in large-scale scene.arXiv preprint arXiv:2506.17636, 2025

Shihan Chen, Zhaojin Li, Zeyu Chen, Qingsong Yan, Gaoyang Shen, and Ran Duan. 3d gaussian splatting for fine- detailed surface reconstruction in large-scale scene. arXiv preprint arXiv:2506.17636, 2025. 2

work page arXiv 2025

[6] [6]

An integrated uav navigation system based on aerial image matching

Gianpaolo Conte and Patrick Doherty. An integrated uav navigation system based on aerial image matching. In 2008 IEEE Aerospace Conference, pages 1–10. IEEE, 2008. 1

work page 2008

[7] [7]

Citygs-x: A scalable architecture for efficient and geometrically accurate large-scale scene reconstruction.arXiv preprint arXiv:2503.23044,

Yuanyuan Gao, Hao Li, Jiaqi Chen, Zhengyu Zou, Zhihang Zhong, Dingwen Zhang, Xiao Sun, and Junwei Han. Citygs- x: A scalable architecture for efficient and geometrically accurate large-scale scene reconstruction. arXiv preprint arXiv:2503.23044, 2025. 2

work page arXiv 2025

[8] [8]

Google Earth Studio

Google. Google Earth Studio. https://earth.google. com/studio/. 1

work page

[9] [9]

Dragon: Drone and ground gaussian splatting for 3d build- ing reconstruction

Yujin Ham, Mateusz Michalkiewicz, and Guha Balakrishnan. Dragon: Drone and ground gaussian splatting for 3d build- ing reconstruction. In 2024 IEEE International Conference on Computational Photography (ICCP), pages 1–12. IEEE,

work page 2024

[10] [10]

Beyond geo-localization: Fine-grained orientation of street-view images by cross-view matching with satellite imagery

Wenmiao Hu, Yichen Zhang, Yuxuan Liang, Yifang Yin, An- drei Georgescu, An Tran, Hannes Kruppa, See-Kiong Ng, and Roger Zimmermann. Beyond geo-localization: Fine-grained orientation of street-view images by cross-view matching with satellite imagery. In Proceedings of the 30th ACM international conference on multimedia, pages 6155–6164,

work page

[11] [11]

Horizon-gs: Unified 3d gaussian splatting for large-scale aerial-to-ground scenes

Lihan Jiang, Kerui Ren, Mulin Yu, Linning Xu, Junting Dong, Tao Lu, Feng Zhao, Dahua Lin, and Bo Dai. Horizon-gs: Unified 3d gaussian splatting for large-scale aerial-to-ground scenes. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 26789–26799, 2025. 2

work page 2025

[12] [12]

Image Matching across Wide Baselines: From Paper to Practice

Yuhe Jin, Dmytro Mishkin, Anastasiia Mishchuk, Jiri Matas, Pascal Fua, Kwang Moo Yi, and Eduard Trulls. Image Matching across Wide Baselines: From Paper to Practice. International Journal of Computer Vision, 2020. 4

work page 2020

[13] [13]

Unconstrained large-scale 3d re- construction and rendering across altitudes

Neil Joshi, Joshua Carney, Nathanael Kuo, Homer Li, Cheng Peng, and Myron Brown. Unconstrained large-scale 3d re- construction and rendering across altitudes. arXiv preprint arXiv:2505.00734, 2025. 2

work page arXiv 2025

[14] [14]

3d gaussian splatting for real-time radiance field rendering

Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering. ACM Trans. Graph., 42(4):139–1, 2023. 2

work page 2023

[15] [15]

Future landscape visualization using a city digital twin: In- tegration of augmented reality and drones with implemen- tation of 3d model-based occlusion handling

Naoki Kikuchi, Tomohiro Fukuda, and Nobuyoshi Yabuki. Future landscape visualization using a city digital twin: In- tegration of augmented reality and drones with implemen- tation of 3d model-based occlusion handling. Journal of Computational Design and Engineering, 9(2):837–856, 2022. 2

work page 2022

[16] [16]

Photogrammetry: Geometry from Images and Laser Scans

Karl Kraus. Photogrammetry: Geometry from Images and Laser Scans. De Gruyter, 2007. 1

work page 2007

[17] [17]

Digital twin of a city: Review of technology serving city needs

Ville V Lehtola, Mila Koeva, Sander Oude Elberink, Paulo Raposo, Juho-Pekka Virtanen, Faridaddin Vahdatikhaki, and Simone Borsci. Digital twin of a city: Review of technology serving city needs. International Journal of Applied Earth Observation and Geoinformation, 114:102915, 2022. 1

work page 2022

[18] [18]

Ground- ing image matching in 3d with mast3r, 2024

Vincent Leroy, Yohann Cabon, and Jerome Revaud. Ground- ing image matching in 3d with mast3r, 2024. 1, 2, 8, 5

work page 2024

[19] [19]

Learning cross-view visual geo-localization without ground truth

Haoyuan Li, Chang Xu, Wen Yang, Huai Yu, and Gui-Song Xia. Learning cross-view visual geo-localization without ground truth. IEEE Transactions on Geoscience and Remote Sensing, 2024. 1

work page 2024

[20] [20]

Matrixcity: A large-scale city dataset for city-scale neural rendering and beyond

Yixuan Li, Lihan Jiang, Linning Xu, Yuanbo Xiangli, Zhenzhi Wang, Dahua Lin, and Bo Dai. Matrixcity: A large-scale city dataset for city-scale neural rendering and beyond. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3205–3215, 2023. 1, 3

work page 2023

[21] [21]

Kitti-360: A novel dataset and benchmarks for urban scene understanding in 2d and 3d

Yiyi Liao, Jun Xie, and Andreas Geiger. Kitti-360: A novel dataset and benchmarks for urban scene understanding in 2d and 3d. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(3):3292–3310, 2022. 3

work page 2022

[22] [22]

Citygaussian: Real-time high-quality large-scale scene rendering with gaussians

Yang Liu, Chuanchen Luo, Lue Fan, Naiyan Wang, Jun- ran Peng, and Zhaoxiang Zhang. Citygaussian: Real-time high-quality large-scale scene rendering with gaussians. In European Conference on Computer Vision, pages 265–282. Springer, 2024. 2

work page 2024

[23] [23]

Citygaussianv2: Efficient and geometrically accurate reconstruction for large- scale scenes.arXiv preprint arXiv:2411.00771, 2024

Yang Liu, Chuanchen Luo, Zhongkai Mao, Junran Peng, and Zhaoxiang Zhang. Citygaussianv2: Efficient and geometri- cally accurate reconstruction for large-scale scenes. arXiv preprint arXiv:2411.00771, 2024. 2

work page arXiv 2024

[24] [24]

Sat- NeRF: Learning multi-view satellite photogrammetry with transient objects and shadow modeling using RPC cam- eras

Roger Marí, Gabriele Facciolo, and Thibaud Ehret. Sat- NeRF: Learning multi-view satellite photogrammetry with transient objects and shadow modeling using RPC cam- eras. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 1310– 1320, 2022. 1

work page 2022

[25] [25]

Bing Maps Imagery Services

Microsoft Corporation. Bing Maps Imagery Services. https://www.bing.com/maps. 1

work page

[26] [26]

Nerf: Representing scenes as neural radiance fields for view syn- thesis

Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view syn- thesis. Communications of the ACM, 65(1):99–106, 2021. 2

work page 2021

[27] [27]

Cross-view visual geo-localization for outdoor augmented reality

Niluthpol Chowdhury Mithun, Kshitij S Minhas, Han-Pang Chiu, Taragay Oskiper, Mikhail Sizintsev, Supun Samarasek- era, and Rakesh Kumar. Cross-view visual geo-localization for outdoor augmented reality. In 2023 IEEE Conference Virtual Reality and 3D User Interfaces (VR), pages 493–502. IEEE, 2023. 1

work page 2023

[28] [28]

On occlu- sions in video action detection: Benchmark datasets and training recipes

Rajat Modi, Vibhav Vineet, and Yogesh Rawat. On occlu- sions in video action detection: Benchmark datasets and training recipes. Advances in Neural Information Processing Systems, 36:57306–57335, 2023. 2

work page 2023

[29] [29]

Visual localization with google earth images for robust global pose estimation of uavs

Bhavit Patel, Timothy D Barfoot, and Angela P Schoel- lig. Visual localization with google earth images for robust global pose estimation of uavs. In 2020 IEEE international conference on robotics and automation (ICRA), pages 6491–

work page 2020

[30] [30]

Navigating urban complexity: The trans- formative role of digital twins in smart city development

Dechen Peldon, Saeed Banihashemi, Khuong LeNguyen, and Sybil Derrible. Navigating urban complexity: The trans- formative role of digital twins in smart city development. Sustainable Cities and Society, 2024. 1

work page 2024

[31] [31]

Revealing scenes by inverting structure from motion reconstructions

Francesco Pittaluga, Sanjeev J Koppal, Sing Bing Kang, and Sudipta N Sinha. Revealing scenes by inverting structure from motion reconstructions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 145–154, 2019. 1

work page 2019

[32] [32]

Sat2map: Reconstructing 3d building roof from 2d satellite images

Yoones Rezaei and Stephen Lee. Sat2map: Reconstructing 3d building roof from 2d satellite images. ACM Transactions on Cyber-Physical Systems, 8(4):1–25, 2024. 1

work page 2024

[33] [33]

Structure-from-motion revisited

Johannes Lutz Schönberger and Jan Michael Frahm. Structure-from-motion revisited. In Conference on Computer Vision and Pattern Recognition (CVPR), 2016. 2

work page 2016

[34] [34]

A vote-and-verify strat- egy for fast spatial verification in image retrieval

Johannes Lutz Schönberger, True Price, Torsten Sattler, Jan- Michael Frahm, and Marc Pollefeys. A vote-and-verify strat- egy for fast spatial verification in image retrieval. In Asian Conference on Computer Vision (ACCV), 2016. 1

work page 2016

[35] [35]

Pixelwise view selection for un- structured multi-view stereo

Johannes Lutz Schönberger, Enliang Zheng, Marc Pollefeys, and Jan-Michael Frahm. Pixelwise view selection for un- structured multi-view stereo. In European Conference on Computer Vision (ECCV), 2016. 1, 2

work page 2016

[36] [36]

To- wards urban digital twins: A workflow for procedural visual- ization using geospatial data

Sanjay Somanath, Vasilis Naserentin, Orfeas Eleftheriou, Daniel Sjölie, Beata Stahre Wästberg, and Anders Logg. To- wards urban digital twins: A workflow for procedural visual- ization using geospatial data. Remote Sensing, 16(11):1939,

work page 1939

[37] [37]

Dronesplat: 3d gaussian splatting for ro- bust 3d reconstruction from in-the-wild drone imagery

Jiadong Tang, Yu Gao, Dianyi Yang, Liqi Yan, Yufeng Yue, and Yi Yang. Dronesplat: 3d gaussian splatting for ro- bust 3d reconstruction from in-the-wild drone imagery. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 833–843, 2025. 2

work page 2025

[38] [38]

Geometric processing of remote sensing im- ages: models, algorithms and methods

Thierry Toutin. Geometric processing of remote sensing im- ages: models, algorithms and methods. International Journal of Remote Sensing, 25(5):1893–1924, 2004. 1

work page 1924

[39] [39]

Aerialmegadepth: Learning aerial-ground reconstruction and view synthesis

Khiem Vuong, Anurag Ghosh, Deva Ramanan, Srinivasa Narasimhan, and Shubham Tulsiani. Aerialmegadepth: Learning aerial-ground reconstruction and view synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025. 1, 3, 8

work page 2025

[40] [40]

Vggt: Visual geometry grounded transformer

Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Visual geometry grounded transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025. 1, 2, 3, 6, 7, 5

work page 2025

[41] [41]

Dust3r: Geometric 3d vision made easy

Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. Dust3r: Geometric 3d vision made easy. In CVPR, 2024. 1, 2, 8, 4

work page 2024

[42] [42]

Croco: Self-supervised pre-training for 3d vision tasks by cross-view completion

Philippe Weinzaepfel, Vincent Leroy, Thomas Lucas, Ro- main Brégier, Yohann Cabon, Vaibhav Arora, Leonid Ants- feld, Boris Chidlovskii, Gabriela Csurka, and Jérôme Re- vaud. Croco: Self-supervised pre-training for 3d vision tasks by cross-view completion. Advances in Neural Information Processing Systems, 35:3502–3516, 2022. 5

work page 2022

[43] [43]

Croco v2: Improved cross-view completion pre-training for stereo matching and optical flow

Philippe Weinzaepfel, Thomas Lucas, Vincent Leroy, Yohann Cabon, Vaibhav Arora, Romain Brégier, Gabriela Csurka, Leonid Antsfeld, Boris Chidlovskii, and Jérôme Revaud. Croco v2: Improved cross-view completion pre-training for stereo matching and optical flow. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 17969–17980, 2023. 5

work page 2023

[44] [44]

Wide- area image geolocalization with aerial reference imagery

Scott Workman, Richard Souvenir, and Nathan Jacobs. Wide- area image geolocalization with aerial reference imagery. In Proceedings of the IEEE International Conference on Computer Vision, pages 3961–3969, 2015. 1

work page 2015

[45] [45]

3d gaussian splat- ting for large-scale surface reconstruction from aerial images

YuanZheng Wu, Jin Liu, and Shunping Ji. 3d gaussian splat- ting for large-scale surface reconstruction from aerial images. arXiv preprint arXiv:2409.00381, 2024. 2

work page arXiv 2024

[46] [46]

Bungeenerf: Progressive neural radiance field for extreme multi-scale scene rendering

Yuanbo Xiangli, Linning Xu, Xingang Pan, Nanxuan Zhao, Anyi Rao, Christian Theobalt, Bo Dai, and Dahua Lin. Bungeenerf: Progressive neural radiance field for extreme multi-scale scene rendering. In European conference on computer vision, pages 106–122. Springer, 2022. 1, 2, 3

work page 2022

[47] [47]

Gauu-scene: A scene reconstruction benchmark on large scale 3d recon- struction dataset using gaussian splatting

Butian Xiong, Zhuo Li, and Zhen Li. Gauu-scene: A scene reconstruction benchmark on large scale 3d recon- struction dataset using gaussian splatting. arXiv preprint arXiv:2401.14032, 2024. 3

work page arXiv 2024

[48] [48]

Vr-nerf: High- fidelity virtualized walkable spaces

Linning Xu, Vasu Agrawal, William Laney, Tony Garcia, Aayush Bansal, Changil Kim, Samuel Rota Bulò, Lorenzo Porzi, Peter Kontschieder, Aljaž Božiˇc, et al. Vr-nerf: High- fidelity virtualized walkable spaces. In SIGGRAPH Asia 2023 Conference Papers, pages 1–12, 2023. 1

work page 2023

[49] [49]

Robust and efficient 3d gaussian splatting for urban scene reconstruction

Zhensheng Yuan, Haozhi Huang, Zhen Xiong, Di Wang, and Guanghua Yang. Robust and efficient 3d gaussian splatting for urban scene reconstruction. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 26209– 26219, 2025. 2

work page 2025

[50] [50]

Predicting ground-level scene layout from aerial imagery

Menghua Zhai, Zachary Bessinger, Scott Workman, and Nathan Jacobs. Predicting ground-level scene layout from aerial imagery. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 867–875,

work page

[51] [51]

Crossview- gs: Cross-view gaussian splatting for large-scale scene recon- struction

Chenhao Zhang, Yuanping Cao, and Lei Zhang. Crossview- gs: Cross-view gaussian splatting for large-scale scene recon- struction. arXiv preprint arXiv:2501.01695, 2025. 2

work page arXiv 2025

[52] [52]

Bird- nerf: Fast neural reconstruction of large-scale scenes from aerial imagery

Huiqing Zhang, Yifei Xue, Ming Liao, and Yizhen Lao. Bird- nerf: Fast neural reconstruction of large-scale scenes from aerial imagery. Scientific Reports, 15(1):37295, 2025. 2

work page 2025

[53] [53]

Drone-assisted road gaussian splatting with cross-view uncertainty.arXiv preprint arXiv:2408.15242,

Saining Zhang, Baijun Ye, Xiaoxue Chen, Yuantao Chen, Zongzheng Zhang, Cheng Peng, Yongliang Shi, and Hao Zhao. Drone-assisted road gaussian splatting with cross-view uncertainty. arXiv preprint arXiv:2408.15242, 2024. 3 Sky2Ground: A Benchmark for Site Modeling under Varying Altitude Supplementary Material This supplementary document provides additional ...

work page arXiv 2024