Cross-View Splatter: Feed-Forward View Synthesis with Georeferenced Images

Akshay Krishnan; Arno Solin; Daniyar Turmukhambetov; Filippo Aleotti; Gabriel Brostow; Guillermo Garcia-Hernando; Juho Kannala; Matias Turkulainen; Mohamed Sayed

arxiv: 2605.19656 · v1 · pith:HSVQKY2Tnew · submitted 2026-05-19 · 💻 cs.CV

Cross-View Splatter: Feed-Forward View Synthesis with Georeferenced Images

Matias Turkulainen , Akshay Krishnan , Filippo Aleotti , Mohamed Sayed , Guillermo Garcia-Hernando , Juho Kannala , Arno Solin , Gabriel Brostow

show 1 more author

Daniyar Turmukhambetov

This is my paper

Pith reviewed 2026-05-20 06:11 UTC · model grok-4.3

classification 💻 cs.CV

keywords view synthesisGaussian splattingsatellite imagerygeoreferenced imagesnovel view synthesisoutdoor scenesfeed-forward reconstruction

0 comments

The pith

Fusing satellite and ground images in one 3D frame improves novel-view synthesis for outdoor scenes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Cross-View Splatter, a feed-forward model that predicts pixel-aligned Gaussian splats by combining GPS-tagged ground photos with orthorectified satellite imagery. It aligns feature representations from ground-level and bird's-eye views inside a shared coordinate frame to fill coverage gaps that ground images alone leave behind. Training draws on curated georeferenced datasets mined from public mapping services, and the method is tested on a new benchmark that compares against earlier view-synthesis approaches. If the alignment step holds, the result is denser scene reconstructions and more accurate novel views without requiring exhaustive ground capture.

Core claim

Cross-View Splatter predicts pixel-aligned Gaussian splats for outdoor scenes by fusing orthorectified satellite views with GPS-tagged ground photos inside a single 3D coordinate frame; aligning the ground and bird's-eye feature representations produces better scene coverage and novel-view synthesis than ground imagery alone.

What carries the argument

Alignment of ground and bird's-eye feature representations inside a unified 3D coordinate frame that fuses satellite and ground imagery for Gaussian splat prediction.

If this is right

Ground capture campaigns for large outdoor sites can be reduced while still obtaining usable 3D reconstructions.
Novel views become feasible in regions visible only from the satellite vantage.
Publicly available satellite data can serve as a geometric prior for any GPS-tagged ground collection.
The same feed-forward pipeline supports evaluation on a new georeferenced benchmark that includes both image types.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach could extend to other multi-source capture settings, such as drone footage paired with ground images, if similar feature alignment is applied.
Real-time mapping applications might benefit if the feed-forward prediction is further optimized for speed on mobile hardware.
Errors in public satellite orthorectification would directly limit reconstruction fidelity in practice.

Load-bearing premise

Orthorectified satellite imagery and GPS-tagged ground photos can be aligned into a shared 3D frame without large systematic errors in pose or scale.

What would settle it

Measure novel-view quality on a test scene after deliberately shifting satellite poses or scales by known amounts; if quality gains over ground-only disappear or reverse, the central claim does not hold.

Figures

Figures reproduced from arXiv: 2605.19656 by Akshay Krishnan, Arno Solin, Daniyar Turmukhambetov, Filippo Aleotti, Gabriel Brostow, Guillermo Garcia-Hernando, Juho Kannala, Matias Turkulainen, Mohamed Sayed.

**Figure 1.** Figure 1: Cross-View Splatter is a feed-forward model that predicts Gaussian splats for GPS-tagged ground level images and corresponding orthorectified satellite imagery from mapping services. It predicts Gaussian splats for both ground level and bird’s-eye views in a unified coordinate system and supports multiple input images with unknown 6DoF poses. Only the GPS location of the ground level images is required. O… view at source ↗

**Figure 2.** Figure 2: Method overview: Given geolocalized ground images and a single orthorectified satellite perspective, our model synthesizes 3D Gaussian splats in a shared coordinate frame. Ground views exchange information with satellite views within bidirectional cross-attention layers. Gaussians are predicted separately from ground and satellite branches, which are then combined into a unified coordinate frame. Although … view at source ↗

**Figure 3.** Figure 3: Coordinate conventions. We consider camera I ground 0 to define the origin of the world coordinates, i.e. the identity pose, as well as the spatial location of the BEV satellite image I sat . The BEV I sat frame is aligned with the heading of I ground 0 such that the camera look-at direction zc is pointing up in the satellite view. Gaussian splats G are projected via perspective projection to ground views,… view at source ↗

**Figure 4.** Figure 4: Example reconstruction outputs on scenes not seen during training. Left to right: input ground images, input satellite image, predicted height map, predicted height confidence (black: low, red: high), predicted ground Gaussians, predicted combined Gaussians. Additionally, a camera-head is used to regress the 6DoF relative pose Ti and perspective camera intrinsics Ki for each input ground image using the ca… view at source ↗

**Figure 5.** Figure 5: Illustration of training samples from our BEV aug [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: Satellite-to-ground qualitative results. Column 1: target ground image and satellite image input. Columns (2-4): Predictions. (Top) our G sat rendered to ground-view RGB, groundview depth, and BEV depth. (Bottom) Sat2Density [61] with RGB and volume-rendered ground/BEV depths from predicted density. Our method produces sharper, more accurate depth maps. The scene is challenging for Sat2Density [61], as … view at source ↗

**Figure 7.** Figure 7: Qualitative results. We show results for a target image from the Outdoor Tanks and Temples dataset under the sparse-view synthesis setting. Our method extrapolates into unobserved regions by leveraging visual cues from the corresponding satellite image. 0 0.1 0.2 0.3 0.4 0.5 6 8 10 12 14 IoU (Context vs. Target) PSNR↑ AnySplat Combined (Ours) [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 8.** Figure 8: Stratified evaluation. Bucketed PSNR performance (5% bins) vs. image overlap on our geolocalized Tanks & Temples dataset. Implementation details. We use PyTorch [58] with gsplat (v1.5) [96] for Gaussian rasterization. We initialize our model from AnySplat [36]. We train for 4 days on 2×-A100 GPUs with a batch size of 10, using FlashAttention-v2 [10, 11] and mixed-precision. We resize satellite images and … view at source ↗

**Figure 10.** Figure 10: Benchmark geoalignment tool. We manually align COLMAP reconstructions to satellite imagery for 10 scenes from Tanks and Temples and 40 scenes from DL3DV-Benchmark datasets. Top: satellite image. Bottom-left: aligned COLMAP pointcloud. Bottomright: visualization of points projected to a scene image. camera intrinsics and camera poses provided by the dataset and ran ‘colmap point triangulator’ command to g… view at source ↗

**Figure 11.** Figure 11: Our Tanks and Temples benchmark. Visualization of satellite and ground level images [PITH_FULL_IMAGE:figures/full_fig_p015_11.png] view at source ↗

**Figure 12.** Figure 12: Our DL3DV benchmark. Visualization of satellite and ground level images. 15 [PITH_FULL_IMAGE:figures/full_fig_p015_12.png] view at source ↗

**Figure 13.** Figure 13: Training data visualization. We showcase our training data that consists of satellite images and terrain height maps aligned with ground level images. poses for novel-view synthesis. NoPoSplat first reconstructs a splat and then refines each novel camera pose for 200 iterations to align it with the reconstruction. Long-LRM uses Plucker rays for the target-views. AnySplat performs two forward passes: one … view at source ↗

**Figure 14.** Figure 14: Qualitative comparison to SEVA. SEVA is a generative based model capable of hallucinating unseen areas whereas our Cross-View Splatter is a feed-forward approach that predicts geometry only for visible regions in ground images and satellite image. E.1. Comparison to Diffusion Based Method We compare our method to Stable Virtual Camera (SEVA) [104], which is a state-of-the-art diffusion model for viewsyn… view at source ↗

**Figure 15.** Figure 15: Limitations of satellite imagery. Notice how a building has been rebuilt and expanded in the right frame compared to the left taken a few years ago. This is Family scene in Tanks and Temples. G. Limitations Our method struggles in scenarios where the ground-level camera observes directions that fall outside the satellite orthographic view. For example, when looking upward toward the sky or downward at … view at source ↗

**Figure 16.** Figure 16: Qualitative baseline comparisons. Additional qualitative comparisons of various baselines on our georeferenced Tanks and Temples dataset. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_16.png] view at source ↗

**Figure 17.** Figure 17: Cross-View Splatter Ground-Only vs Full-Model. We visualize the benefit of Cross-View Splatter’s satellite branch on qualitative rendering on the Tanks and Temples and DL3DV benchmarks. Our Full-Model achieves better coverage and completeness compared to ground only imagery in sparse-view settings. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_17.png] view at source ↗

**Figure 18.** Figure 18: Cross-View Splatter satellite qualitative. We show visuals of our full-model satellite head predictions on our benchmark scenes. The first two rows are the inputs to the model, i.e. a BEV perspective and a ground level image. We predict height maps, confidence values, ground level splats, and satellite splats that can then be rendered to ground level views. 21 [PITH_FULL_IMAGE:figures/full_fig_p021_18.png] view at source ↗

**Figure 19.** Figure 19: Sat2Density [61] comparison. We compare our predictions (Columns 1-4) with Sat2Density height estimates and Sat2Density ground renders against the Ground Truth inputs (Columns 5-6). Both models get the same satellite image and ground image as inputs. 22 [PITH_FULL_IMAGE:figures/full_fig_p022_19.png] view at source ↗

read the original abstract

We present Cross-View Splatter, a feed-forward method that predicts pixel-aligned Gaussian splats for outdoor scenes captured at ground level AND by satellite. Faithful reconstructions require good camera coverage, but ground imagery is time-consuming and hard to capture at scale for large outdoor scenes. Fortunately, satellite imagery can provide a global geometric prior that is easy to access via public APIs. Cross-View Splatter fuses orthorectified satellite views with GPS-tagged ground photos to predict Gaussian splats in a unified 3D coordinate frame. By aligning ground and bird's-eye feature representations, our model improves scene coverage and novel-view synthesis, compared to ground imagery alone. We train on curated georeferenced datasets and paired satellite-terrain data, mined from open mapping services. We evaluate our method on a new benchmark for novel-view synthesis with georeferenced imagery allowing comparison to prior state-of-the-art methods. Our code and data preparation will be available at https://nianticspatial.github.io/cross-view-splatter/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper offers a feed-forward Gaussian splatting approach that fuses public satellite imagery with ground photos for better outdoor coverage, but the gains depend on alignment quality that still needs verification.

read the letter

The core contribution here is a model that takes GPS-tagged ground images plus orthorectified satellite views and outputs pixel-aligned Gaussians in one coordinate frame. This directly targets the coverage problem in large outdoor scenes where ground capture alone is slow and incomplete. Using open mapping data for training and a new georeferenced benchmark is a sensible practical move, and planning to release the code and data prep helps others build on it quickly. The fusion of bird's-eye and ground features to improve novel-view synthesis over ground-only baselines is the part that feels new relative to prior single-view splatting work. That said, the alignment step is the obvious place to look harder. Public satellite orthorectification and standard GPS tags routinely carry meter-scale translation and possible scale drift, and the abstract does not yet show error statistics or ablations on deliberately misaligned inputs. If residual misalignment is not explicitly handled or shown to be tolerated by the network, the reported improvements could partly reflect careful pair curation rather than a general property of cross-view fusion. The full paper will need to clarify how poses are unified and whether the model is robust to the kinds of errors that appear in real open data. This is the kind of work that matters for people doing large-scale mapping, 3D reconstruction, or outdoor AR pipelines. Readers who already work with georeferenced imagery or Gaussian splatting will find the benchmark and the fusion architecture useful even if they end up tweaking the alignment details. It deserves a serious referee because the problem is real, the data sources are accessible, and the feed-forward framing is clean enough to evaluate properly.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Cross-View Splatter, a feed-forward neural method for novel-view synthesis of outdoor scenes. It predicts pixel-aligned 3D Gaussian splats by fusing features from GPS-tagged ground-level photographs and orthorectified satellite imagery within a single georeferenced 3D coordinate frame. The central claim is that this cross-view fusion improves scene coverage and synthesis quality relative to ground imagery alone. The approach is trained on curated pairs mined from open mapping services and evaluated on a newly introduced benchmark that supports comparison against prior state-of-the-art methods.

Significance. If the alignment between views is shown to be reliable and the reported gains are not artifacts of data curation, the work offers a practical route to scalable outdoor reconstruction by exploiting freely available satellite priors. The feed-forward design and planned release of code and data-preparation scripts would support reproducibility and adoption in computer vision and robotics applications that require large-scale 3D models.

major comments (2)

[§3.2] §3.2 (Cross-View Feature Alignment): The method relies on aligning ground and bird's-eye feature representations into a unified 3D frame using GPS tags and orthorectified satellite data. However, no quantitative alignment-error statistics (e.g., mean translation or rotation residuals, or scale-drift measurements) are reported on the training or test pairs. Given that consumer-grade GPS and public orthorectification typically exhibit meter-scale inconsistencies, it is unclear whether the observed improvements in coverage and PSNR/SSIM (Tables 2 and 3) would persist under realistic residual misalignment.
[§5.1] §5.1 (Benchmark and Baselines): The new georeferenced benchmark is used to claim superiority over ground-only baselines. Without an ablation that perturbs the satellite poses by amounts consistent with reported GPS accuracy (e.g., ±2 m translation, ±1° rotation) and re-measures the synthesis metrics, it remains possible that the gains are specific to the curated, well-aligned pairs rather than a general property of the cross-view fusion.

minor comments (2)

The abstract states that 'code and data preparation will be available'; the final version should include a permanent repository link and confirm that the benchmark dataset (including the satellite-ground pairings) is released under an open license.
[§3.1] Notation for the unified coordinate frame (e.g., the transformation between ground and satellite cameras) should be defined explicitly in §3.1 before being used in the feature-alignment equations.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The concerns about alignment reliability and robustness to realistic pose noise are valid and point to useful additions that will strengthen the claims. We address each major comment below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [§3.2] §3.2 (Cross-View Feature Alignment): The method relies on aligning ground and bird's-eye feature representations into a unified 3D frame using GPS tags and orthorectified satellite data. However, no quantitative alignment-error statistics (e.g., mean translation or rotation residuals, or scale-drift measurements) are reported on the training or test pairs. Given that consumer-grade GPS and public orthorectification typically exhibit meter-scale inconsistencies, it is unclear whether the observed improvements in coverage and PSNR/SSIM (Tables 2 and 3) would persist under realistic residual misalignment.

Authors: We agree that explicit quantitative alignment-error statistics are missing from the current version. In the revision we will add a new table (or subsection in §3.2) reporting mean translation and rotation residuals, as well as scale-drift statistics, computed directly from the GPS tags and orthorectified satellite metadata on both the training and test pairs. These numbers will be derived from the same curated data used for all experiments and will allow readers to judge whether the reported gains remain plausible under the meter-scale errors typical of consumer GPS and public orthorectification. revision: yes
Referee: [§5.1] §5.1 (Benchmark and Baselines): The new georeferenced benchmark is used to claim superiority over ground-only baselines. Without an ablation that perturbs the satellite poses by amounts consistent with reported GPS accuracy (e.g., ±2 m translation, ±1° rotation) and re-measures the synthesis metrics, it remains possible that the gains are specific to the curated, well-aligned pairs rather than a general property of the cross-view fusion.

Authors: We accept that an explicit robustness ablation is needed to rule out the possibility that gains are an artifact of unusually clean alignments. In the revised manuscript we will add an ablation in §5.1 (or a new supplementary section) that applies controlled perturbations of ±2 m translation and ±1° rotation to the satellite poses, re-runs the cross-view fusion, and reports the resulting changes in PSNR, SSIM, and coverage metrics on the benchmark. This will directly test whether the cross-view advantage persists under realistic residual misalignment. revision: yes

Circularity Check

0 steps flagged

No circularity; derivation relies on external training data and empirical evaluation

full rationale

The paper trains a feed-forward model on curated georeferenced datasets and paired satellite-terrain data mined from open mapping services, then evaluates on a new benchmark for novel-view synthesis. No equations or steps reduce by construction to fitted inputs, self-definitions, or self-citation chains; the alignment of ground and bird's-eye features occurs via learned prediction on independent external data rather than tautological renaming or forced statistical equivalence. The central claim of improved coverage therefore rests on standard supervised training against held-out test views, remaining self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; the central claim rests on the unstated assumption that satellite and ground data can be aligned accurately in a shared frame.

axioms (1)

domain assumption Satellite imagery supplies a reliable global geometric prior that can be fused with ground photos without large pose or scale errors.
Invoked implicitly when the abstract states that satellite views provide a global geometric prior for unified 3D reconstruction.

pith-pipeline@v0.9.0 · 5744 in / 1197 out tokens · 36327 ms · 2026-05-20T06:11:55.738310+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We introduce Cross-View Splatter, a feed-forward method that uses both ground-level imagery and orthographic satellite imagery... inject cross-attention layers... regress 3D Gaussian splat attributes.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We adopt the per-batch ℓ2-scaling... height map regression... orthographic projection model

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

108 extracted references · 108 canonical work pages · 8 internal anchors

[1]

Map-free visual relocalization: Metric pose relative to a single image

Eduardo Arnold, Jamie Wynn, Sara Vicente, Guillermo Garcia-Hernando, ´Aron Monszpart, Victor Adrian Prisacariu, Daniyar Turmukhambetov, and Eric Brach- mann. Map-free visual relocalization: Metric pose relative to a single image. InECCV, 2022. 4, 6

work page 2022
[2]

Scene coordinate reconstruction: Posing of image collections via incremental learning of a relocalizer

Eric Brachmann, Jamie Wynn, Shuai Chen, Tommaso Cav- allari, ´Aron Monszpart, Daniyar Turmukhambetov, and Victor Adrian Prisacariu. Scene coordinate reconstruction: Posing of image collections via incremental learning of a relocalizer. InECCV, 2024. 2

work page 2024
[3]

Virtual KITTI 2

Yohann Cabon, Naila Murray, and Martin Humenberger. Virtual kitti 2.arXiv preprint arXiv:2001.10773, 2020. 2, 6

work page internal anchor Pith review Pith/arXiv arXiv 2001
[4]

Efficient geometry-aware 3d generative adversarial networks

Eric R Chan, Connor Z Lin, Matthew A Chan, Koki Nagano, Boxiao Pan, Shalini De Mello, Orazio Gallo, Leonidas J Guibas, Jonathan Tremblay, Sameh Khamis, et al. Efficient geometry-aware 3d generative adversarial networks. InCVPR, 2022. 3

work page 2022
[5]

pixelsplat: 3d gaussian splats from image pairs for scalable generalizable 3d reconstruction

David Charatan, Sizhe Li, Andrea Tagliasacchi, and Vin- cent Sitzmann. pixelsplat: 3d gaussian splats from image pairs for scalable generalizable 3d reconstruction. InCVPR,

work page
[6]

Mvsnerf: Fast generalizable radiance field reconstruction from multi-view stereo.arXiv preprint arXiv:2103.15595, 2021

Anpei Chen, Zexiang Xu, Fuqiang Zhao, Xiaoshuai Zhang, Fanbo Xiang, Jingyi Yu, and Hao Su. Mvsnerf: Fast generalizable radiance field reconstruction from multi-view stereo.arXiv preprint arXiv:2103.15595, 2021. 2

work page arXiv 2021
[7]

Mvsplat: Efficient 3d gaussian splatting from sparse multi-view images

Yuedong Chen, Haofei Xu, Chuanxia Zheng, Bohan Zhuang, Marc Pollefeys, Andreas Geiger, Tat-Jen Cham, and Jianfei Cai. Mvsplat: Efficient 3d gaussian splatting from sparse multi-view images. InECCV, 2024. 1, 2, 7

work page 2024
[8]

Crandall, Andrew Owens, Noah Snavely, and Daniel P

David J. Crandall, Andrew Owens, Noah Snavely, and Daniel P. Huttenlocher. Sfm with mrfs: Discrete- continuous optimization for large-scale structure from mo- tion.PAMI, 2013. 2

work page 2013
[9]

Chang, Manolis Savva, Maciej Hal- ber, Thomas Funkhouser, and Matthias Nießner

Angela Dai, Angel X. Chang, Manolis Savva, Maciej Hal- ber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In CVPR, 2017. 2

work page 2017
[10]

FlashAttention-2: Faster attention with better par- allelism and work partitioning

Tri Dao. FlashAttention-2: Faster attention with better par- allelism and work partitioning. InICLR, 2024. 8

work page 2024
[11]

Fu, Stefano Ermon, Atri Rudra, and Christopher R´e

Tri Dao, Daniel Y . Fu, Stefano Ermon, Atri Rudra, and Christopher R´e. FlashAttention: Fast and memory-efficient exact attention with IO-awareness. InNeurIPS, 2022. 8

work page 2022
[12]

Vision Transformers Need Registers

Timoth ´ee Darcet, Maxime Oquab, Julien Mairal, and Pi- otr Bojanowski. Vision transformers need registers.arXiv preprint arXiv:2309.16588, 2023. 4

work page internal anchor Pith review Pith/arXiv arXiv 2023
[13]

Streetscapes: Large-scale consistent street view generation using autoregressive video diffusion

Boyang Deng, Richard Tucker, Zhengqi Li, Leonidas Guibas, Noah Snavely, and Gordon Wetzstein. Streetscapes: Large-scale consistent street view generation using autoregressive video diffusion. InSIGGRAPH, 2024. 3

work page 2024
[14]

Ortholoc: UA V 6-dof localization and calibration using orthographic geo- data

Oussema Dhaouadi, Riccardo Marin, Johannes Michael Meier, Jacques Kaiser, and Daniel Cremers. Ortholoc: UA V 6-dof localization and calibration using orthographic geo- data. InNeurIPS Datasets and Benchmarks Track, 2025. 2

work page 2025
[15]

An Im- age is Worth 16x16 Words: Transformers for Image Recog- nition at Scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An Im- age is Worth 16x16 Words: Transformers for Image Recog- nition at Scale. InICLR, 2021. 2, 4

work page 2021
[16]

Learning to render novel views from wide-baseline stereo pairs.CVPR, 2023

Yilun Du, Cameron Smith, Ayush Tewari, and Vincent Sitz- mann. Learning to render novel views from wide-baseline stereo pairs.CVPR, 2023. 2

work page 2023
[17]

MASt3r-sfm: a fully-integrated solution for unconstrained structure-from-motion

Bardienus Pieter Duisterhof, Lojze Zust, Philippe Weinza- epfel, Vincent Leroy, Yohann Cabon, and Jerome Revaud. MASt3r-sfm: a fully-integrated solution for unconstrained structure-from-motion. In3DV, 2025. 1, 2, 13

work page 2025
[18]

Esri World Imagery.https : / / www

Esri. Esri World Imagery.https : / / www . arcgis . com / home / item . html ? id = 10df2279f9684e4a9f6a7f08febac2a9. Ac- cessed: 2025-10-05. 2, 4, 6

work page 2025
[19]

Tiled Web Maps.https://github

Florian Fervers. Tiled Web Maps.https://github. com/fferflo/tiledwebmaps. Accessed: 2025-10-

work page 2025
[20]

Uncertainty-aware vision-based metric cross-view geolo- calization

Florian Fervers, Sebastian Bullinger, Christoph Bo- densteiner, Michael Arens, and Rainer Stiefelhagen. Uncertainty-aware vision-based metric cross-view geolo- calization. InCVPR, 2023

work page 2023
[21]

Statewide visual geolocalization in the wild

Florian Fervers, Sebastian Bullinger, Christoph Boden- steiner, Michael Arens, and Rainer Stiefelhagen. Statewide visual geolocalization in the wild. InECCV, 2024. 6

work page 2024
[22]

Collection of open nation-scale lidar datasets

Flai. Collection of open nation-scale lidar datasets. https : / / registry . opendata . aws / open - lidar-data. Accessed: 2025-10-19. 2, 6

work page 2025
[23]

Virtual worlds as proxy for multi-object tracking anal- ysis

Adrien Gaidon, Qiao Wang, Yohann Cabon, and Eleonora Vig. Virtual worlds as proxy for multi-object tracking anal- ysis. InCVPR, 2016. 6

work page 2016
[24]

Massively parallel multiview stereopsis by surface normal diffusion

Silvano Galliani, Katrin Lasinger, and Konrad Schindler. Massively parallel multiview stereopsis by surface normal diffusion. InICCV, 2015. 2

work page 2015
[25]

Skyeyes: Ground roaming using aerial view images.arXiv preprint arXiv:2409.16685, 2024

Zhiyuan Gao, Wenbin Teng, Gonglin Chen, Jinsen Wu, Ningli Xu, Rongjun Qin, Andrew Feng, and Yajie Zhao. Skyeyes: Ground roaming using aerial view images.arXiv preprint arXiv:2409.16685, 2024. 2, 3

work page arXiv 2024
[26]

Gdal: Geospatial data abstraction li- brary.https://gdal.org, 2024

GDAL Developers. Gdal: Geospatial data abstraction li- brary.https://gdal.org, 2024. 16

work page 2024
[27]

Vision meets robotics: The kitti dataset.IJRR,

Andreas Geiger, Philip Lenz, Christoph Stiller, and Raquel Urtasun. Vision meets robotics: The kitti dataset.IJRR,

work page
[28]

Are we ready for Autonomous Driving? The KITTI Vision Bench- mark Suite

Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are we ready for Autonomous Driving? The KITTI Vision Bench- mark Suite. InCVPR, 2012. 2

work page 2012
[29]

Google Maps Platform Documentation

Google. Google Maps Platform Documentation. https : / / developers . google . com / maps / documentation. Accessed: 2025-10-04. 2, 4, 6

work page 2025
[30]

Cascade cost volume for high- resolution multi-view stereo and stereo matching

Xiaodong Gu, Zhiwen Fan, Siyu Zhu, Zuozhuo Dai, Feitong Tan, and Ping Tan. Cascade cost volume for high- resolution multi-view stereo and stereo matching. InCVPR,

work page
[31]

Sugar: Surface- aligned gaussian splatting for efficient 3d mesh reconstruc- tion and high-quality mesh rendering.CVPR, 2024

Antoine Gu ´edon and Vincent Lepetit. Sugar: Surface- aligned gaussian splatting for efficient 3d mesh reconstruc- tion and high-quality mesh rendering.CVPR, 2024. 3

work page 2024
[32]

Cambridge University Press,

Richard Hartley and Andrew Zisserman.Multiple View Ge- ometry in Computer Vision. Cambridge University Press,

work page
[33]

Pf3plat: Pose-free feed-forward 3d gaussian splatting,

Sunghwan Hong, Jaewoo Jung, Heeseong Shin, Jisang 9 Han, Jiaolong Yang, Chong Luo, and Seungryong Kim. Pf3plat: Pose-free feed-forward 3d gaussian splatting. arXiv preprint arXiv:2410.22128, 2024. 2

work page arXiv 2024
[34]

En- hancing monocular height estimation from aerial images with street-view images.arXiv preprint arXiv:2311.02121,

Xiaomou Hou, Wanshui Gan, and Naoto Yokoya. En- hancing monocular height estimation from aerial images with street-view images.arXiv preprint arXiv:2311.02121,

work page arXiv
[35]

SkySplat: Generalizable 3D Gaussian Splatting from Multi-Temporal Sparse Satellite Images

Xuejun Huang, Xinyi Liu, Yi Wan, Zhi Zheng, Bin Zhang, Mingtao Xiong, Yingying Pei, and Yongjun Zhang. Skysplat: Generalizable 3d gaussian splatting from multi-temporal sparse satellite images.arXiv preprint arXiv:2508.09479, 2025. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025
[36]

AnySplat: Feed-forward 3D Gaussian Splatting from Unconstrained Views, May 2025

Lihan Jiang, Yucheng Mao, Linning Xu, Tao Lu, Kerui Ren, Yichen Jin, Xudong Xu, Mulin Yu, Jiangmiao Pang, Feng Zhao, et al. Anysplat: Feed-forward 3d gaus- sian splatting from unconstrained views.arXiv preprint arXiv:2505.23716, 2025. 1, 2, 3, 5, 7, 8, 13, 16

work page arXiv 2025
[37]

Horizon-gs: Unified 3d gaussian splatting for large-scale aerial-to-ground scenes

Lihan Jiang, Kerui Ren, Mulin Yu, Linning Xu, Junt- ing Dong, Tao Lu, Feng Zhao, Dahua Lin, and Bo Dai. Horizon-gs: Unified 3d gaussian splatting for large-scale aerial-to-ground scenes. InCVPR, 2025. 3

work page 2025
[38]

Analyzing and improving the image quality of StyleGAN

Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. Analyzing and improving the image quality of StyleGAN. InCVPR, 2020. 3

work page 2020
[39]

MapAnything: Universal Feed-Forward Metric 3D Reconstruction

Nikhil Keetha, Norman M ¨uller, Johannes Sch ¨onberger, Lorenzo Porzi, Yuchen Zhang, Tobias Fischer, Arno Knapitsch, Duncan Zauss, Ethan Weber, Nelson Antunes, Jonathon Luiten, Manuel Lopez-Antequera, Samuel Rota Bul`o, Christian Richardt, Deva Ramanan, Sebastian Scherer, and Peter Kontschieder. MapAnything: Univer- sal feed-forward metric 3D reconstructi...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[40]

3d gaussian splatting for real-time radiance field rendering.TOG, 2023

Bernhard Kerbl, Georgios Kopanas, Thomas Leimk ¨uhler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering.TOG, 2023. 2, 4

work page 2023
[41]

Kingma and Jimmy Ba

Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization.International Conference on Learning Representations (ICLR), 2015. 13

work page 2015
[42]

Tanks and temples: Benchmarking large-scale scene reconstruction.TOG, 2017

Arno Knapitsch, Jaesik Park, Qian-Yi Zhou, and Vladlen Koltun. Tanks and temples: Benchmarking large-scale scene reconstruction.TOG, 2017. 5, 6, 17

work page 2017
[43]

Skyfall-gs: Synthe- sizing immersive 3d urban scenes from satellite imagery

Jie-Ying Lee, Yi-Ruei Liu, Shr-Ruei Tsai, Wei-Cheng Chang, Chung-Ho Wu, Jiewen Chan, Zhenjun Zhao, Chieh Hubert Lin, and Yu-Lun Liu. Skyfall-gs: Synthe- sizing immersive 3d urban scenes from satellite imagery. arXiv preprint arXiv:2510.15869, 2025. 2, 3

work page arXiv 2025
[44]

Leotta, Cheng Long, Bastien Jacquet, Michael Zins, Daniel Lipsa, Jizhe Shan, Boyan Xu, Zhaoyu Li, Xun Zhang, Shih-Fu Chang, Misu Purri, Jia Xue, and Kristin Dana

Matthew J. Leotta, Cheng Long, Bastien Jacquet, Michael Zins, Daniel Lipsa, Jizhe Shan, Boyan Xu, Zhaoyu Li, Xun Zhang, Shih-Fu Chang, Misu Purri, Jia Xue, and Kristin Dana. Urban Semantic 3D Reconstruction from Multiview Satellite Imagery. InCVPRW, 2019. 2, 3

work page 2019
[45]

Zuoyue Li, Zhenqiang Li, Zhaopeng Cui, Marc Pollefeys, and Martin R. Oswald. Sat2Scene: 3D urban scene gener- ation from satellite images with diffusion. InCVPR, 2024. 3

work page 2024
[46]

Zuoyue Li, Zhenqiang Li, Zhaopeng Cui, Rongjun Qin, Marc Pollefeys, and Martin R. Oswald. Sat2Vid: Street- view panoramic video synthesis from a single satellite im- age. InICCV, 2021. 3

work page 2021
[47]

Megadepth: Learning single-view depth prediction from internet photos

Zhengqi Li and Noah Snavely. Megadepth: Learning single-view depth prediction from internet photos. In CVPR, 2018. 2, 17

work page 2018
[48]

Sky optimization: Semantically aware image processing of skies in low-light photography

Orly Liba, Longqi Cai, Yun-Ta Tsai, Elad Eban, Yair Movshovitz-Attias, Yael Pritch, Huizhong Chen, and Jonathan T Barron. Sky optimization: Semantically aware image processing of skies in low-light photography. In CVPRW, 2020. 6

work page 2020
[49]

Dl3dv-10k: A large-scale scene dataset for deep learning-based 3d vision

Lu Ling, Yichen Sheng, Zhi Tu, Wentian Zhao, Cheng Xin, Kun Wan, Lantao Yu, Qianyu Guo, Zixun Yu, Yawen Lu, et al. Dl3dv-10k: A large-scale scene dataset for deep learning-based 3d vision. InCVPR, 2024. 2, 6, 7, 16

work page 2024
[50]

Infinite nature: Perpetual view generation of natural scenes from a single image

Andrew Liu, Richard Tucker, Varun Jampani, Ameesh Makadia, Noah Snavely, and Angjoo Kanazawa. Infinite nature: Perpetual view generation of natural scenes from a single image. InICCV, 2021. 3

work page 2021
[51]

SLAM3R: Real-time dense scene reconstruction from monocular RGB videos

Yuzheng Liu, Siyan Dong, Shuzhe Wang, Yingda Yin, Yan- chao Yang, Qingnan Fan, and Baoquan Chen. SLAM3R: Real-time dense scene reconstruction from monocular RGB videos. InCVPR, 2025. 2

work page 2025
[52]

Worldmirror: Universal 3d world reconstruction with any-prior prompting,

Yifan Liu, Zhiyuan Min, Zhenwei Wang, Junta Wu, Tengfei Wang, Yixuan Yuan, Yawei Luo, and Chunchao Guo. Worldmirror: Universal 3d world reconstruction with any- prior prompting.arXiv preprint arXiv:2510.10726, 2025. 2

work page arXiv 2025
[53]

Mapillary Metropolis Dataset.https:// www.mapillary.com/dataset/metropolis

Mapillary. Mapillary Metropolis Dataset.https:// www.mapillary.com/dataset/metropolis. Ac- cessed: 2025-10-18. 5, 6, 16

work page 2025
[54]

Azure Maps.https : / / azure

Microsoft. Azure Maps.https : / / azure . microsoft . com / en - us / products / azure - maps. Accessed: 2025-10-04. 2, 4, 6

work page 2025
[55]

Srinivasan, Matthew Tancik, Jonathan T

Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik, Jonathan T. Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view syn- thesis. InECCV, 2020. 2

work page 2020
[56]

Maxime Oquab, Timoth ´ee Darcet, Theo Moutakanni, Huy V . V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernan- dez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Russell Howes, Po-Yao Huang, Hu Xu, Vasu Sharma, Shang-Wen Li, Wojciech Galuba, Mike Rabbat, Mido Ass- ran, Nicolas Ballas, Gabriel Synnaeve, Ishan Misra, Herve Jegou, Julien Mairal, Patric...

work page internal anchor Pith review Pith/arXiv arXiv
[57]

Global Structure-from-Motion Revisited

Linfei Pan, Daniel Barath, Marc Pollefeys, and Jo- hannes Lutz Sch ¨onberger. Global Structure-from-Motion Revisited. InECCV, 2024. 2

work page 2024
[58]

Py- torch: An imperative style, high-performance deep learning library

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Nataly Gimelshein, Luca Antiga, Alban Kopf, Fed- erico Metta, Allan Chiley, Brian Stwalley, Sheng Huang, Jiawan Jiang, Yehezkel Chen, Peng Zeng, Xiaobing Li, James Yu, Teteya Li, Andrey Kuchaiev, Kartik Ren, Houdong Zhang, Yanghan Shi, Jani Sin...

work page 2019
[59]

UniK3D: Universal camera monocular 3d estimation

Luigi Piccinelli, Christos Sakaridis, Mattia Segu, Yung- 10 Hsu Yang, Siyuan Li, Wim Abbeloos, and Luc Van Gool. UniK3D: Universal camera monocular 3d estimation. In CVPR, 2025. 6

work page 2025
[60]

See- ing through satellite images at street views.arXiv preprint arXiv:2505.17001, 2025

Ming Qian, Bin Tan, Qiuyu Wang, Xianwei Zheng, Han- jiang Xiong, Gui-Song Xia, Yujun Shen, and Nan Xue. See- ing through satellite images at street views.arXiv preprint arXiv:2505.17001, 2025. 3, 13, 17

work page arXiv 2025
[61]

Sat2density: Faithful density learning from satellite-ground image pairs

Ming Qian, Jincheng Xiong, Gui-Song Xia, and Nan Xue. Sat2density: Faithful density learning from satellite-ground image pairs. InICCV, 2023. 3, 7, 13, 17, 22

work page 2023
[62]

Rongjun Qin. RPC Stereo Processor (RSP)–A Software Package for Digital Surface Model and Orthophoto Gener- ation from Satellite Stereo Imagery.ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial Information Sciences, 2016. 2, 3

work page 2016
[63]

Vi- sion transformers for dense prediction

Ren ´e Ranftl, Alexey Bochkovskiy, and Vladlen Koltun. Vi- sion transformers for dense prediction. InICCV, 2021. 4

work page 2021
[64]

Com- mon objects in 3d: Large-scale learning and evaluation of real-life 3d category reconstruction

Jeremy Reizenstein, Roman Shapovalov, Philipp Henzler, Luca Sbordone, Patrick Labatut, and David Novotny. Com- mon objects in 3d: Large-scale learning and evaluation of real-life 3d category reconstruction. InICCV, 2021. 2

work page 2021
[65]

High-Resolution Image Synthesis with Latent Diffusion Models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bjorn Ommer. High-Resolution Image Synthesis with Latent Diffusion Models. InCVPR, 2022. 3

work page 2022
[66]

Pixelwise view selection for unstructured multi-view stereo

Johannes L Sch ¨onberger, Enliang Zheng, Jan-Michael Frahm, and Marc Pollefeys. Pixelwise view selection for unstructured multi-view stereo. InECCV, 2016. 2

work page 2016
[67]

A multi-view stereo benchmark with high- resolution images and multi-camera videos

Thomas Sch ¨ops, Torsten Sattler, Christian H¨ane, and Marc Pollefeys. A multi-view stereo benchmark with high- resolution images and multi-camera videos. InCVPR,

work page
[68]

Sch ¨onberger and Jan-Michael Frahm

Johannes L. Sch ¨onberger and Jan-Michael Frahm. Structure-from-motion revisited. InCVPR, 2016. 2, 6

work page 2016
[69]

Geometry-guided street-view panorama synthesis from satellite imagery.PAMI, 2022

Yujiao Shi, Dylan Campbell, Xin Yu, and Hongdong Li. Geometry-guided street-view panorama synthesis from satellite imagery.PAMI, 2022. 3

work page 2022
[70]

Splatt3R: Zero-shot Gaussian Splatting from Uncalibrated Image Pairs

Brandon Smart, Chuanxia Zheng, Iro Laina, and Vic- tor Adrian Prisacariu. Splatt3r: Zero-shot gaussian splatting from uncalibrated image pairs.arXiv preprint arXiv:2408.13912, 2024. 2

work page internal anchor Pith review Pith/arXiv arXiv 2024
[71]

Srinivasan, Richard Tucker, Jonathan T

Pratul P. Srinivasan, Richard Tucker, Jonathan T. Barron, Ravi Ramamoorthi, Ren Ng, and Noah Snavely. Pushing the boundaries of view extrapolation with multiplane im- ages.CVPR, 2019. 2

work page 2019
[72]

Zhang, Pratul Srinivasan, Ruiqi Gao, Arthur Brussee, Aleksander Holynski, Ricardo Martin-Brualla, Jonathan T

Stanislaw Szymanowicz, Jason Y . Zhang, Pratul Srinivasan, Ruiqi Gao, Arthur Brussee, Aleksander Holynski, Ricardo Martin-Brualla, Jonathan T. Barron, and Philipp Henzler. Bolt3D: Generating 3D Scenes in Seconds. InICCV, 2025. 17

work page 2025
[73]

Nerfstudio: A modular framework for neural radiance field development

Matthew Tancik, Ethan Weber, Evonne Ng, Ruilong Li, Brent Yi, Justin Kerr, Terrance Wang, Alexander Kristof- fersen, Jake Austin, Kamyar Salahi, Abhik Ahuja, David McAllister, and Angjoo Kanazawa. Nerfstudio: A modular framework for neural radiance field development. InSIG- GRAPH, 2023. 7

work page 2023
[74]

Schwing, and Zhicheng Yan

Zhenggang Tang, Yuchen Fan, Dilin Wang, Hongyu Xu, Rakesh Ranjan, Alexander G. Schwing, and Zhicheng Yan. MV-DUSt3R+: Single-stage scene reconstruction from sparse views in 2 seconds. InCVPR, 2025. 2

work page 2025
[75]

Coming Down to Earth: Satellite-to-Street View Synthesis for Geo-Localization

Aysim Toker, Qunjie Zhou, Maxim Maximov, and Laura Leal-Taix´e. Coming Down to Earth: Satellite-to-Street View Synthesis for Geo-Localization. InCVPR, 2021. 3

work page 2021
[76]

Single-view view syn- thesis with multiplane images

Richard Tucker and Noah Snavely. Single-view view syn- thesis with multiplane images. InCVPR, 2020. 2

work page 2020
[77]

Geological Survey

U.S. Geological Survey. USGS Lidar Explorer Map. https : / / apps . nationalmap . gov / lidar - explorer. Accessed: 2025-10-19. 2, 6

work page 2025
[78]

Gomez, Łukasz Kaiser, and Illia Polosukhin

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. InNeurIPS,

work page
[79]

Aerialmegadepth: Learning aerial-ground reconstruction and view synthesis

Khiem Vuong, Anurag Ghosh, Deva Ramanan, Srinivasa Narasimhan, and Shubham Tulsiani. Aerialmegadepth: Learning aerial-ground reconstruction and view synthesis. InCVPR, 2025. 2

work page 2025
[80]

Skyseg.https://huggingface

Jianyuan Wang. Skyseg.https://huggingface. co/JianyuanWang/skyseg. Accessed: 2025-08-10. 6

work page 2025

Showing first 80 references.

[1] [1]

Map-free visual relocalization: Metric pose relative to a single image

Eduardo Arnold, Jamie Wynn, Sara Vicente, Guillermo Garcia-Hernando, ´Aron Monszpart, Victor Adrian Prisacariu, Daniyar Turmukhambetov, and Eric Brach- mann. Map-free visual relocalization: Metric pose relative to a single image. InECCV, 2022. 4, 6

work page 2022

[2] [2]

Scene coordinate reconstruction: Posing of image collections via incremental learning of a relocalizer

Eric Brachmann, Jamie Wynn, Shuai Chen, Tommaso Cav- allari, ´Aron Monszpart, Daniyar Turmukhambetov, and Victor Adrian Prisacariu. Scene coordinate reconstruction: Posing of image collections via incremental learning of a relocalizer. InECCV, 2024. 2

work page 2024

[3] [3]

Virtual KITTI 2

Yohann Cabon, Naila Murray, and Martin Humenberger. Virtual kitti 2.arXiv preprint arXiv:2001.10773, 2020. 2, 6

work page internal anchor Pith review Pith/arXiv arXiv 2001

[4] [4]

Efficient geometry-aware 3d generative adversarial networks

Eric R Chan, Connor Z Lin, Matthew A Chan, Koki Nagano, Boxiao Pan, Shalini De Mello, Orazio Gallo, Leonidas J Guibas, Jonathan Tremblay, Sameh Khamis, et al. Efficient geometry-aware 3d generative adversarial networks. InCVPR, 2022. 3

work page 2022

[5] [5]

pixelsplat: 3d gaussian splats from image pairs for scalable generalizable 3d reconstruction

David Charatan, Sizhe Li, Andrea Tagliasacchi, and Vin- cent Sitzmann. pixelsplat: 3d gaussian splats from image pairs for scalable generalizable 3d reconstruction. InCVPR,

work page

[6] [6]

Mvsnerf: Fast generalizable radiance field reconstruction from multi-view stereo.arXiv preprint arXiv:2103.15595, 2021

Anpei Chen, Zexiang Xu, Fuqiang Zhao, Xiaoshuai Zhang, Fanbo Xiang, Jingyi Yu, and Hao Su. Mvsnerf: Fast generalizable radiance field reconstruction from multi-view stereo.arXiv preprint arXiv:2103.15595, 2021. 2

work page arXiv 2021

[7] [7]

Mvsplat: Efficient 3d gaussian splatting from sparse multi-view images

Yuedong Chen, Haofei Xu, Chuanxia Zheng, Bohan Zhuang, Marc Pollefeys, Andreas Geiger, Tat-Jen Cham, and Jianfei Cai. Mvsplat: Efficient 3d gaussian splatting from sparse multi-view images. InECCV, 2024. 1, 2, 7

work page 2024

[8] [8]

Crandall, Andrew Owens, Noah Snavely, and Daniel P

David J. Crandall, Andrew Owens, Noah Snavely, and Daniel P. Huttenlocher. Sfm with mrfs: Discrete- continuous optimization for large-scale structure from mo- tion.PAMI, 2013. 2

work page 2013

[9] [9]

Chang, Manolis Savva, Maciej Hal- ber, Thomas Funkhouser, and Matthias Nießner

Angela Dai, Angel X. Chang, Manolis Savva, Maciej Hal- ber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In CVPR, 2017. 2

work page 2017

[10] [10]

FlashAttention-2: Faster attention with better par- allelism and work partitioning

Tri Dao. FlashAttention-2: Faster attention with better par- allelism and work partitioning. InICLR, 2024. 8

work page 2024

[11] [11]

Fu, Stefano Ermon, Atri Rudra, and Christopher R´e

Tri Dao, Daniel Y . Fu, Stefano Ermon, Atri Rudra, and Christopher R´e. FlashAttention: Fast and memory-efficient exact attention with IO-awareness. InNeurIPS, 2022. 8

work page 2022

[12] [12]

Vision Transformers Need Registers

Timoth ´ee Darcet, Maxime Oquab, Julien Mairal, and Pi- otr Bojanowski. Vision transformers need registers.arXiv preprint arXiv:2309.16588, 2023. 4

work page internal anchor Pith review Pith/arXiv arXiv 2023

[13] [13]

Streetscapes: Large-scale consistent street view generation using autoregressive video diffusion

Boyang Deng, Richard Tucker, Zhengqi Li, Leonidas Guibas, Noah Snavely, and Gordon Wetzstein. Streetscapes: Large-scale consistent street view generation using autoregressive video diffusion. InSIGGRAPH, 2024. 3

work page 2024

[14] [14]

Ortholoc: UA V 6-dof localization and calibration using orthographic geo- data

Oussema Dhaouadi, Riccardo Marin, Johannes Michael Meier, Jacques Kaiser, and Daniel Cremers. Ortholoc: UA V 6-dof localization and calibration using orthographic geo- data. InNeurIPS Datasets and Benchmarks Track, 2025. 2

work page 2025

[15] [15]

An Im- age is Worth 16x16 Words: Transformers for Image Recog- nition at Scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An Im- age is Worth 16x16 Words: Transformers for Image Recog- nition at Scale. InICLR, 2021. 2, 4

work page 2021

[16] [16]

Learning to render novel views from wide-baseline stereo pairs.CVPR, 2023

Yilun Du, Cameron Smith, Ayush Tewari, and Vincent Sitz- mann. Learning to render novel views from wide-baseline stereo pairs.CVPR, 2023. 2

work page 2023

[17] [17]

MASt3r-sfm: a fully-integrated solution for unconstrained structure-from-motion

Bardienus Pieter Duisterhof, Lojze Zust, Philippe Weinza- epfel, Vincent Leroy, Yohann Cabon, and Jerome Revaud. MASt3r-sfm: a fully-integrated solution for unconstrained structure-from-motion. In3DV, 2025. 1, 2, 13

work page 2025

[18] [18]

Esri World Imagery.https : / / www

Esri. Esri World Imagery.https : / / www . arcgis . com / home / item . html ? id = 10df2279f9684e4a9f6a7f08febac2a9. Ac- cessed: 2025-10-05. 2, 4, 6

work page 2025

[19] [19]

Tiled Web Maps.https://github

Florian Fervers. Tiled Web Maps.https://github. com/fferflo/tiledwebmaps. Accessed: 2025-10-

work page 2025

[20] [20]

Uncertainty-aware vision-based metric cross-view geolo- calization

Florian Fervers, Sebastian Bullinger, Christoph Bo- densteiner, Michael Arens, and Rainer Stiefelhagen. Uncertainty-aware vision-based metric cross-view geolo- calization. InCVPR, 2023

work page 2023

[21] [21]

Statewide visual geolocalization in the wild

Florian Fervers, Sebastian Bullinger, Christoph Boden- steiner, Michael Arens, and Rainer Stiefelhagen. Statewide visual geolocalization in the wild. InECCV, 2024. 6

work page 2024

[22] [22]

Collection of open nation-scale lidar datasets

Flai. Collection of open nation-scale lidar datasets. https : / / registry . opendata . aws / open - lidar-data. Accessed: 2025-10-19. 2, 6

work page 2025

[23] [23]

Virtual worlds as proxy for multi-object tracking anal- ysis

Adrien Gaidon, Qiao Wang, Yohann Cabon, and Eleonora Vig. Virtual worlds as proxy for multi-object tracking anal- ysis. InCVPR, 2016. 6

work page 2016

[24] [24]

Massively parallel multiview stereopsis by surface normal diffusion

Silvano Galliani, Katrin Lasinger, and Konrad Schindler. Massively parallel multiview stereopsis by surface normal diffusion. InICCV, 2015. 2

work page 2015

[25] [25]

Skyeyes: Ground roaming using aerial view images.arXiv preprint arXiv:2409.16685, 2024

Zhiyuan Gao, Wenbin Teng, Gonglin Chen, Jinsen Wu, Ningli Xu, Rongjun Qin, Andrew Feng, and Yajie Zhao. Skyeyes: Ground roaming using aerial view images.arXiv preprint arXiv:2409.16685, 2024. 2, 3

work page arXiv 2024

[26] [26]

Gdal: Geospatial data abstraction li- brary.https://gdal.org, 2024

GDAL Developers. Gdal: Geospatial data abstraction li- brary.https://gdal.org, 2024. 16

work page 2024

[27] [27]

Vision meets robotics: The kitti dataset.IJRR,

Andreas Geiger, Philip Lenz, Christoph Stiller, and Raquel Urtasun. Vision meets robotics: The kitti dataset.IJRR,

work page

[28] [28]

Are we ready for Autonomous Driving? The KITTI Vision Bench- mark Suite

Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are we ready for Autonomous Driving? The KITTI Vision Bench- mark Suite. InCVPR, 2012. 2

work page 2012

[29] [29]

Google Maps Platform Documentation

Google. Google Maps Platform Documentation. https : / / developers . google . com / maps / documentation. Accessed: 2025-10-04. 2, 4, 6

work page 2025

[30] [30]

Cascade cost volume for high- resolution multi-view stereo and stereo matching

Xiaodong Gu, Zhiwen Fan, Siyu Zhu, Zuozhuo Dai, Feitong Tan, and Ping Tan. Cascade cost volume for high- resolution multi-view stereo and stereo matching. InCVPR,

work page

[31] [31]

Sugar: Surface- aligned gaussian splatting for efficient 3d mesh reconstruc- tion and high-quality mesh rendering.CVPR, 2024

Antoine Gu ´edon and Vincent Lepetit. Sugar: Surface- aligned gaussian splatting for efficient 3d mesh reconstruc- tion and high-quality mesh rendering.CVPR, 2024. 3

work page 2024

[32] [32]

Cambridge University Press,

Richard Hartley and Andrew Zisserman.Multiple View Ge- ometry in Computer Vision. Cambridge University Press,

work page

[33] [33]

Pf3plat: Pose-free feed-forward 3d gaussian splatting,

Sunghwan Hong, Jaewoo Jung, Heeseong Shin, Jisang 9 Han, Jiaolong Yang, Chong Luo, and Seungryong Kim. Pf3plat: Pose-free feed-forward 3d gaussian splatting. arXiv preprint arXiv:2410.22128, 2024. 2

work page arXiv 2024

[34] [34]

En- hancing monocular height estimation from aerial images with street-view images.arXiv preprint arXiv:2311.02121,

Xiaomou Hou, Wanshui Gan, and Naoto Yokoya. En- hancing monocular height estimation from aerial images with street-view images.arXiv preprint arXiv:2311.02121,

work page arXiv

[35] [35]

SkySplat: Generalizable 3D Gaussian Splatting from Multi-Temporal Sparse Satellite Images

Xuejun Huang, Xinyi Liu, Yi Wan, Zhi Zheng, Bin Zhang, Mingtao Xiong, Yingying Pei, and Yongjun Zhang. Skysplat: Generalizable 3d gaussian splatting from multi-temporal sparse satellite images.arXiv preprint arXiv:2508.09479, 2025. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025

[36] [36]

AnySplat: Feed-forward 3D Gaussian Splatting from Unconstrained Views, May 2025

Lihan Jiang, Yucheng Mao, Linning Xu, Tao Lu, Kerui Ren, Yichen Jin, Xudong Xu, Mulin Yu, Jiangmiao Pang, Feng Zhao, et al. Anysplat: Feed-forward 3d gaus- sian splatting from unconstrained views.arXiv preprint arXiv:2505.23716, 2025. 1, 2, 3, 5, 7, 8, 13, 16

work page arXiv 2025

[37] [37]

Horizon-gs: Unified 3d gaussian splatting for large-scale aerial-to-ground scenes

Lihan Jiang, Kerui Ren, Mulin Yu, Linning Xu, Junt- ing Dong, Tao Lu, Feng Zhao, Dahua Lin, and Bo Dai. Horizon-gs: Unified 3d gaussian splatting for large-scale aerial-to-ground scenes. InCVPR, 2025. 3

work page 2025

[38] [38]

Analyzing and improving the image quality of StyleGAN

Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. Analyzing and improving the image quality of StyleGAN. InCVPR, 2020. 3

work page 2020

[39] [39]

MapAnything: Universal Feed-Forward Metric 3D Reconstruction

Nikhil Keetha, Norman M ¨uller, Johannes Sch ¨onberger, Lorenzo Porzi, Yuchen Zhang, Tobias Fischer, Arno Knapitsch, Duncan Zauss, Ethan Weber, Nelson Antunes, Jonathon Luiten, Manuel Lopez-Antequera, Samuel Rota Bul`o, Christian Richardt, Deva Ramanan, Sebastian Scherer, and Peter Kontschieder. MapAnything: Univer- sal feed-forward metric 3D reconstructi...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[40] [40]

3d gaussian splatting for real-time radiance field rendering.TOG, 2023

Bernhard Kerbl, Georgios Kopanas, Thomas Leimk ¨uhler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering.TOG, 2023. 2, 4

work page 2023

[41] [41]

Kingma and Jimmy Ba

Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization.International Conference on Learning Representations (ICLR), 2015. 13

work page 2015

[42] [42]

Tanks and temples: Benchmarking large-scale scene reconstruction.TOG, 2017

Arno Knapitsch, Jaesik Park, Qian-Yi Zhou, and Vladlen Koltun. Tanks and temples: Benchmarking large-scale scene reconstruction.TOG, 2017. 5, 6, 17

work page 2017

[43] [43]

Skyfall-gs: Synthe- sizing immersive 3d urban scenes from satellite imagery

Jie-Ying Lee, Yi-Ruei Liu, Shr-Ruei Tsai, Wei-Cheng Chang, Chung-Ho Wu, Jiewen Chan, Zhenjun Zhao, Chieh Hubert Lin, and Yu-Lun Liu. Skyfall-gs: Synthe- sizing immersive 3d urban scenes from satellite imagery. arXiv preprint arXiv:2510.15869, 2025. 2, 3

work page arXiv 2025

[44] [44]

Leotta, Cheng Long, Bastien Jacquet, Michael Zins, Daniel Lipsa, Jizhe Shan, Boyan Xu, Zhaoyu Li, Xun Zhang, Shih-Fu Chang, Misu Purri, Jia Xue, and Kristin Dana

Matthew J. Leotta, Cheng Long, Bastien Jacquet, Michael Zins, Daniel Lipsa, Jizhe Shan, Boyan Xu, Zhaoyu Li, Xun Zhang, Shih-Fu Chang, Misu Purri, Jia Xue, and Kristin Dana. Urban Semantic 3D Reconstruction from Multiview Satellite Imagery. InCVPRW, 2019. 2, 3

work page 2019

[45] [45]

Zuoyue Li, Zhenqiang Li, Zhaopeng Cui, Marc Pollefeys, and Martin R. Oswald. Sat2Scene: 3D urban scene gener- ation from satellite images with diffusion. InCVPR, 2024. 3

work page 2024

[46] [46]

Zuoyue Li, Zhenqiang Li, Zhaopeng Cui, Rongjun Qin, Marc Pollefeys, and Martin R. Oswald. Sat2Vid: Street- view panoramic video synthesis from a single satellite im- age. InICCV, 2021. 3

work page 2021

[47] [47]

Megadepth: Learning single-view depth prediction from internet photos

Zhengqi Li and Noah Snavely. Megadepth: Learning single-view depth prediction from internet photos. In CVPR, 2018. 2, 17

work page 2018

[48] [48]

Sky optimization: Semantically aware image processing of skies in low-light photography

Orly Liba, Longqi Cai, Yun-Ta Tsai, Elad Eban, Yair Movshovitz-Attias, Yael Pritch, Huizhong Chen, and Jonathan T Barron. Sky optimization: Semantically aware image processing of skies in low-light photography. In CVPRW, 2020. 6

work page 2020

[49] [49]

Dl3dv-10k: A large-scale scene dataset for deep learning-based 3d vision

Lu Ling, Yichen Sheng, Zhi Tu, Wentian Zhao, Cheng Xin, Kun Wan, Lantao Yu, Qianyu Guo, Zixun Yu, Yawen Lu, et al. Dl3dv-10k: A large-scale scene dataset for deep learning-based 3d vision. InCVPR, 2024. 2, 6, 7, 16

work page 2024

[50] [50]

Infinite nature: Perpetual view generation of natural scenes from a single image

Andrew Liu, Richard Tucker, Varun Jampani, Ameesh Makadia, Noah Snavely, and Angjoo Kanazawa. Infinite nature: Perpetual view generation of natural scenes from a single image. InICCV, 2021. 3

work page 2021

[51] [51]

SLAM3R: Real-time dense scene reconstruction from monocular RGB videos

Yuzheng Liu, Siyan Dong, Shuzhe Wang, Yingda Yin, Yan- chao Yang, Qingnan Fan, and Baoquan Chen. SLAM3R: Real-time dense scene reconstruction from monocular RGB videos. InCVPR, 2025. 2

work page 2025

[52] [52]

Worldmirror: Universal 3d world reconstruction with any-prior prompting,

Yifan Liu, Zhiyuan Min, Zhenwei Wang, Junta Wu, Tengfei Wang, Yixuan Yuan, Yawei Luo, and Chunchao Guo. Worldmirror: Universal 3d world reconstruction with any- prior prompting.arXiv preprint arXiv:2510.10726, 2025. 2

work page arXiv 2025

[53] [53]

Mapillary Metropolis Dataset.https:// www.mapillary.com/dataset/metropolis

Mapillary. Mapillary Metropolis Dataset.https:// www.mapillary.com/dataset/metropolis. Ac- cessed: 2025-10-18. 5, 6, 16

work page 2025

[54] [54]

Azure Maps.https : / / azure

Microsoft. Azure Maps.https : / / azure . microsoft . com / en - us / products / azure - maps. Accessed: 2025-10-04. 2, 4, 6

work page 2025

[55] [55]

Srinivasan, Matthew Tancik, Jonathan T

Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik, Jonathan T. Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view syn- thesis. InECCV, 2020. 2

work page 2020

[56] [56]

Maxime Oquab, Timoth ´ee Darcet, Theo Moutakanni, Huy V . V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernan- dez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Russell Howes, Po-Yao Huang, Hu Xu, Vasu Sharma, Shang-Wen Li, Wojciech Galuba, Mike Rabbat, Mido Ass- ran, Nicolas Ballas, Gabriel Synnaeve, Ishan Misra, Herve Jegou, Julien Mairal, Patric...

work page internal anchor Pith review Pith/arXiv arXiv

[57] [57]

Global Structure-from-Motion Revisited

Linfei Pan, Daniel Barath, Marc Pollefeys, and Jo- hannes Lutz Sch ¨onberger. Global Structure-from-Motion Revisited. InECCV, 2024. 2

work page 2024

[58] [58]

Py- torch: An imperative style, high-performance deep learning library

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Nataly Gimelshein, Luca Antiga, Alban Kopf, Fed- erico Metta, Allan Chiley, Brian Stwalley, Sheng Huang, Jiawan Jiang, Yehezkel Chen, Peng Zeng, Xiaobing Li, James Yu, Teteya Li, Andrey Kuchaiev, Kartik Ren, Houdong Zhang, Yanghan Shi, Jani Sin...

work page 2019

[59] [59]

UniK3D: Universal camera monocular 3d estimation

Luigi Piccinelli, Christos Sakaridis, Mattia Segu, Yung- 10 Hsu Yang, Siyuan Li, Wim Abbeloos, and Luc Van Gool. UniK3D: Universal camera monocular 3d estimation. In CVPR, 2025. 6

work page 2025

[60] [60]

See- ing through satellite images at street views.arXiv preprint arXiv:2505.17001, 2025

Ming Qian, Bin Tan, Qiuyu Wang, Xianwei Zheng, Han- jiang Xiong, Gui-Song Xia, Yujun Shen, and Nan Xue. See- ing through satellite images at street views.arXiv preprint arXiv:2505.17001, 2025. 3, 13, 17

work page arXiv 2025

[61] [61]

Sat2density: Faithful density learning from satellite-ground image pairs

Ming Qian, Jincheng Xiong, Gui-Song Xia, and Nan Xue. Sat2density: Faithful density learning from satellite-ground image pairs. InICCV, 2023. 3, 7, 13, 17, 22

work page 2023

[62] [62]

Rongjun Qin. RPC Stereo Processor (RSP)–A Software Package for Digital Surface Model and Orthophoto Gener- ation from Satellite Stereo Imagery.ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial Information Sciences, 2016. 2, 3

work page 2016

[63] [63]

Vi- sion transformers for dense prediction

Ren ´e Ranftl, Alexey Bochkovskiy, and Vladlen Koltun. Vi- sion transformers for dense prediction. InICCV, 2021. 4

work page 2021

[64] [64]

Com- mon objects in 3d: Large-scale learning and evaluation of real-life 3d category reconstruction

Jeremy Reizenstein, Roman Shapovalov, Philipp Henzler, Luca Sbordone, Patrick Labatut, and David Novotny. Com- mon objects in 3d: Large-scale learning and evaluation of real-life 3d category reconstruction. InICCV, 2021. 2

work page 2021

[65] [65]

High-Resolution Image Synthesis with Latent Diffusion Models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bjorn Ommer. High-Resolution Image Synthesis with Latent Diffusion Models. InCVPR, 2022. 3

work page 2022

[66] [66]

Pixelwise view selection for unstructured multi-view stereo

Johannes L Sch ¨onberger, Enliang Zheng, Jan-Michael Frahm, and Marc Pollefeys. Pixelwise view selection for unstructured multi-view stereo. InECCV, 2016. 2

work page 2016

[67] [67]

A multi-view stereo benchmark with high- resolution images and multi-camera videos

Thomas Sch ¨ops, Torsten Sattler, Christian H¨ane, and Marc Pollefeys. A multi-view stereo benchmark with high- resolution images and multi-camera videos. InCVPR,

work page

[68] [68]

Sch ¨onberger and Jan-Michael Frahm

Johannes L. Sch ¨onberger and Jan-Michael Frahm. Structure-from-motion revisited. InCVPR, 2016. 2, 6

work page 2016

[69] [69]

Geometry-guided street-view panorama synthesis from satellite imagery.PAMI, 2022

Yujiao Shi, Dylan Campbell, Xin Yu, and Hongdong Li. Geometry-guided street-view panorama synthesis from satellite imagery.PAMI, 2022. 3

work page 2022

[70] [70]

Splatt3R: Zero-shot Gaussian Splatting from Uncalibrated Image Pairs

Brandon Smart, Chuanxia Zheng, Iro Laina, and Vic- tor Adrian Prisacariu. Splatt3r: Zero-shot gaussian splatting from uncalibrated image pairs.arXiv preprint arXiv:2408.13912, 2024. 2

work page internal anchor Pith review Pith/arXiv arXiv 2024

[71] [71]

Srinivasan, Richard Tucker, Jonathan T

Pratul P. Srinivasan, Richard Tucker, Jonathan T. Barron, Ravi Ramamoorthi, Ren Ng, and Noah Snavely. Pushing the boundaries of view extrapolation with multiplane im- ages.CVPR, 2019. 2

work page 2019

[72] [72]

Zhang, Pratul Srinivasan, Ruiqi Gao, Arthur Brussee, Aleksander Holynski, Ricardo Martin-Brualla, Jonathan T

Stanislaw Szymanowicz, Jason Y . Zhang, Pratul Srinivasan, Ruiqi Gao, Arthur Brussee, Aleksander Holynski, Ricardo Martin-Brualla, Jonathan T. Barron, and Philipp Henzler. Bolt3D: Generating 3D Scenes in Seconds. InICCV, 2025. 17

work page 2025

[73] [73]

Nerfstudio: A modular framework for neural radiance field development

Matthew Tancik, Ethan Weber, Evonne Ng, Ruilong Li, Brent Yi, Justin Kerr, Terrance Wang, Alexander Kristof- fersen, Jake Austin, Kamyar Salahi, Abhik Ahuja, David McAllister, and Angjoo Kanazawa. Nerfstudio: A modular framework for neural radiance field development. InSIG- GRAPH, 2023. 7

work page 2023

[74] [74]

Schwing, and Zhicheng Yan

Zhenggang Tang, Yuchen Fan, Dilin Wang, Hongyu Xu, Rakesh Ranjan, Alexander G. Schwing, and Zhicheng Yan. MV-DUSt3R+: Single-stage scene reconstruction from sparse views in 2 seconds. InCVPR, 2025. 2

work page 2025

[75] [75]

Coming Down to Earth: Satellite-to-Street View Synthesis for Geo-Localization

Aysim Toker, Qunjie Zhou, Maxim Maximov, and Laura Leal-Taix´e. Coming Down to Earth: Satellite-to-Street View Synthesis for Geo-Localization. InCVPR, 2021. 3

work page 2021

[76] [76]

Single-view view syn- thesis with multiplane images

Richard Tucker and Noah Snavely. Single-view view syn- thesis with multiplane images. InCVPR, 2020. 2

work page 2020

[77] [77]

Geological Survey

U.S. Geological Survey. USGS Lidar Explorer Map. https : / / apps . nationalmap . gov / lidar - explorer. Accessed: 2025-10-19. 2, 6

work page 2025

[78] [78]

Gomez, Łukasz Kaiser, and Illia Polosukhin

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. InNeurIPS,

work page

[79] [79]

Aerialmegadepth: Learning aerial-ground reconstruction and view synthesis

Khiem Vuong, Anurag Ghosh, Deva Ramanan, Srinivasa Narasimhan, and Shubham Tulsiani. Aerialmegadepth: Learning aerial-ground reconstruction and view synthesis. InCVPR, 2025. 2

work page 2025

[80] [80]

Skyseg.https://huggingface

Jianyuan Wang. Skyseg.https://huggingface. co/JianyuanWang/skyseg. Accessed: 2025-08-10. 6

work page 2025