pith. sign in

arxiv: 2605.19656 · v1 · pith:HSVQKY2Tnew · submitted 2026-05-19 · 💻 cs.CV

Cross-View Splatter: Feed-Forward View Synthesis with Georeferenced Images

Pith reviewed 2026-05-20 06:11 UTC · model grok-4.3

classification 💻 cs.CV
keywords view synthesisGaussian splattingsatellite imagerygeoreferenced imagesnovel view synthesisoutdoor scenesfeed-forward reconstruction
0
0 comments X

The pith

Fusing satellite and ground images in one 3D frame improves novel-view synthesis for outdoor scenes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Cross-View Splatter, a feed-forward model that predicts pixel-aligned Gaussian splats by combining GPS-tagged ground photos with orthorectified satellite imagery. It aligns feature representations from ground-level and bird's-eye views inside a shared coordinate frame to fill coverage gaps that ground images alone leave behind. Training draws on curated georeferenced datasets mined from public mapping services, and the method is tested on a new benchmark that compares against earlier view-synthesis approaches. If the alignment step holds, the result is denser scene reconstructions and more accurate novel views without requiring exhaustive ground capture.

Core claim

Cross-View Splatter predicts pixel-aligned Gaussian splats for outdoor scenes by fusing orthorectified satellite views with GPS-tagged ground photos inside a single 3D coordinate frame; aligning the ground and bird's-eye feature representations produces better scene coverage and novel-view synthesis than ground imagery alone.

What carries the argument

Alignment of ground and bird's-eye feature representations inside a unified 3D coordinate frame that fuses satellite and ground imagery for Gaussian splat prediction.

If this is right

  • Ground capture campaigns for large outdoor sites can be reduced while still obtaining usable 3D reconstructions.
  • Novel views become feasible in regions visible only from the satellite vantage.
  • Publicly available satellite data can serve as a geometric prior for any GPS-tagged ground collection.
  • The same feed-forward pipeline supports evaluation on a new georeferenced benchmark that includes both image types.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The approach could extend to other multi-source capture settings, such as drone footage paired with ground images, if similar feature alignment is applied.
  • Real-time mapping applications might benefit if the feed-forward prediction is further optimized for speed on mobile hardware.
  • Errors in public satellite orthorectification would directly limit reconstruction fidelity in practice.

Load-bearing premise

Orthorectified satellite imagery and GPS-tagged ground photos can be aligned into a shared 3D frame without large systematic errors in pose or scale.

What would settle it

Measure novel-view quality on a test scene after deliberately shifting satellite poses or scales by known amounts; if quality gains over ground-only disappear or reverse, the central claim does not hold.

Figures

Figures reproduced from arXiv: 2605.19656 by Akshay Krishnan, Arno Solin, Daniyar Turmukhambetov, Filippo Aleotti, Gabriel Brostow, Guillermo Garcia-Hernando, Juho Kannala, Matias Turkulainen, Mohamed Sayed.

Figure 1
Figure 1. Figure 1: Cross-View Splatter is a feed-forward model that predicts Gaussian splats for GPS-tagged ground level images and corre￾sponding orthorectified satellite imagery from mapping services. It predicts Gaussian splats for both ground level and bird’s-eye views in a unified coordinate system and supports multiple input images with unknown 6DoF poses. Only the GPS location of the ground level images is required. O… view at source ↗
Figure 2
Figure 2. Figure 2: Method overview: Given geolocalized ground images and a single orthorectified satellite perspective, our model synthesizes 3D Gaussian splats in a shared coordinate frame. Ground views exchange information with satellite views within bidirectional cross-attention layers. Gaussians are predicted separately from ground and satellite branches, which are then combined into a unified coordinate frame. Although … view at source ↗
Figure 3
Figure 3. Figure 3: Coordinate conventions. We consider camera I ground 0 to define the origin of the world coordinates, i.e. the identity pose, as well as the spatial location of the BEV satellite image I sat . The BEV I sat frame is aligned with the heading of I ground 0 such that the camera look-at direction zc is pointing up in the satellite view. Gaussian splats G are projected via perspective projection to ground views,… view at source ↗
Figure 4
Figure 4. Figure 4: Example reconstruction outputs on scenes not seen during training. Left to right: input ground images, input satellite image, predicted height map, predicted height confidence (black: low, red: high), predicted ground Gaussians, predicted combined Gaussians. Additionally, a camera-head is used to regress the 6DoF relative pose Ti and perspective camera intrinsics Ki for each input ground image using the ca… view at source ↗
Figure 5
Figure 5. Figure 5: Illustration of training samples from our BEV aug [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Satellite-to-ground qualitative results. Column 1: tar￾get ground image and satellite image input. Columns (2-4): Pre￾dictions. (Top) our G sat rendered to ground-view RGB, ground￾view depth, and BEV depth. (Bottom) Sat2Density [61] with RGB and volume-rendered ground/BEV depths from predicted density. Our method produces sharper, more accurate depth maps. The scene is challenging for Sat2Density [61], as … view at source ↗
Figure 7
Figure 7. Figure 7: Qualitative results. We show results for a target image from the Outdoor Tanks and Temples dataset under the sparse-view synthesis setting. Our method extrapolates into unobserved regions by leveraging visual cues from the corresponding satellite image. 0 0.1 0.2 0.3 0.4 0.5 6 8 10 12 14 IoU (Context vs. Target) PSNR↑ AnySplat Combined (Ours) [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Stratified evaluation. Bucketed PSNR performance (5% bins) vs. image overlap on our geolocalized Tanks & Temples dataset. Implementation details. We use PyTorch [58] with gsplat (v1.5) [96] for Gaussian rasterization. We ini￾tialize our model from AnySplat [36]. We train for 4 days on 2×-A100 GPUs with a batch size of 10, using FlashAttention-v2 [10, 11] and mixed-precision. We resize satellite images and … view at source ↗
Figure 10
Figure 10. Figure 10: Benchmark geoalignment tool. We manually align COLMAP reconstructions to satellite imagery for 10 scenes from Tanks and Temples and 40 scenes from DL3DV-Benchmark datasets. Top: satellite image. Bottom-left: aligned COLMAP pointcloud. Bottom￾right: visualization of points projected to a scene image. camera intrinsics and camera poses provided by the dataset and ran ‘colmap point triangulator’ command to g… view at source ↗
Figure 11
Figure 11. Figure 11: Our Tanks and Temples benchmark. Visualization of satellite and ground level images [PITH_FULL_IMAGE:figures/full_fig_p015_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Our DL3DV benchmark. Visualization of satellite and ground level images. 15 [PITH_FULL_IMAGE:figures/full_fig_p015_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Training data visualization. We showcase our training data that consists of satellite images and terrain height maps aligned with ground level images. poses for novel-view synthesis. NoPoSplat first reconstructs a splat and then refines each novel camera pose for 200 it￾erations to align it with the reconstruction. Long-LRM uses Plucker rays for the target-views. AnySplat performs two forward passes: one … view at source ↗
Figure 14
Figure 14. Figure 14: Qualitative comparison to SEVA. SEVA is a gener￾ative based model capable of hallucinating unseen areas whereas our Cross-View Splatter is a feed-forward approach that predicts geometry only for visible regions in ground images and satellite image. E.1. Comparison to Diffusion Based Method We compare our method to Stable Virtual Camera (SEVA) [104], which is a state-of-the-art diffusion model for view￾syn… view at source ↗
Figure 15
Figure 15. Figure 15: Limitations of satellite imagery. Notice how a build￾ing has been rebuilt and expanded in the right frame compared to the left taken a few years ago. This is Family scene in Tanks and Temples. G. Limitations Our method struggles in scenarios where the ground-level camera observes directions that fall outside the satellite or￾thographic view. For example, when looking upward to￾ward the sky or downward at … view at source ↗
Figure 16
Figure 16. Figure 16: Qualitative baseline comparisons. Additional qualitative comparisons of various baselines on our georeferenced Tanks and Temples dataset. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Cross-View Splatter Ground-Only vs Full-Model. We visualize the benefit of Cross-View Splatter’s satellite branch on qualitative rendering on the Tanks and Temples and DL3DV benchmarks. Our Full-Model achieves better coverage and completeness compared to ground only imagery in sparse-view settings. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Cross-View Splatter satellite qualitative. We show visuals of our full-model satellite head predictions on our benchmark scenes. The first two rows are the inputs to the model, i.e. a BEV perspective and a ground level image. We predict height maps, confidence values, ground level splats, and satellite splats that can then be rendered to ground level views. 21 [PITH_FULL_IMAGE:figures/full_fig_p021_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Sat2Density [61] comparison. We compare our predictions (Columns 1-4) with Sat2Density height estimates and Sat2Density ground renders against the Ground Truth inputs (Columns 5-6). Both models get the same satellite image and ground image as inputs. 22 [PITH_FULL_IMAGE:figures/full_fig_p022_19.png] view at source ↗
read the original abstract

We present Cross-View Splatter, a feed-forward method that predicts pixel-aligned Gaussian splats for outdoor scenes captured at ground level AND by satellite. Faithful reconstructions require good camera coverage, but ground imagery is time-consuming and hard to capture at scale for large outdoor scenes. Fortunately, satellite imagery can provide a global geometric prior that is easy to access via public APIs. Cross-View Splatter fuses orthorectified satellite views with GPS-tagged ground photos to predict Gaussian splats in a unified 3D coordinate frame. By aligning ground and bird's-eye feature representations, our model improves scene coverage and novel-view synthesis, compared to ground imagery alone. We train on curated georeferenced datasets and paired satellite-terrain data, mined from open mapping services. We evaluate our method on a new benchmark for novel-view synthesis with georeferenced imagery allowing comparison to prior state-of-the-art methods. Our code and data preparation will be available at https://nianticspatial.github.io/cross-view-splatter/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Cross-View Splatter, a feed-forward neural method for novel-view synthesis of outdoor scenes. It predicts pixel-aligned 3D Gaussian splats by fusing features from GPS-tagged ground-level photographs and orthorectified satellite imagery within a single georeferenced 3D coordinate frame. The central claim is that this cross-view fusion improves scene coverage and synthesis quality relative to ground imagery alone. The approach is trained on curated pairs mined from open mapping services and evaluated on a newly introduced benchmark that supports comparison against prior state-of-the-art methods.

Significance. If the alignment between views is shown to be reliable and the reported gains are not artifacts of data curation, the work offers a practical route to scalable outdoor reconstruction by exploiting freely available satellite priors. The feed-forward design and planned release of code and data-preparation scripts would support reproducibility and adoption in computer vision and robotics applications that require large-scale 3D models.

major comments (2)
  1. [§3.2] §3.2 (Cross-View Feature Alignment): The method relies on aligning ground and bird's-eye feature representations into a unified 3D frame using GPS tags and orthorectified satellite data. However, no quantitative alignment-error statistics (e.g., mean translation or rotation residuals, or scale-drift measurements) are reported on the training or test pairs. Given that consumer-grade GPS and public orthorectification typically exhibit meter-scale inconsistencies, it is unclear whether the observed improvements in coverage and PSNR/SSIM (Tables 2 and 3) would persist under realistic residual misalignment.
  2. [§5.1] §5.1 (Benchmark and Baselines): The new georeferenced benchmark is used to claim superiority over ground-only baselines. Without an ablation that perturbs the satellite poses by amounts consistent with reported GPS accuracy (e.g., ±2 m translation, ±1° rotation) and re-measures the synthesis metrics, it remains possible that the gains are specific to the curated, well-aligned pairs rather than a general property of the cross-view fusion.
minor comments (2)
  1. The abstract states that 'code and data preparation will be available'; the final version should include a permanent repository link and confirm that the benchmark dataset (including the satellite-ground pairings) is released under an open license.
  2. [§3.1] Notation for the unified coordinate frame (e.g., the transformation between ground and satellite cameras) should be defined explicitly in §3.1 before being used in the feature-alignment equations.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The concerns about alignment reliability and robustness to realistic pose noise are valid and point to useful additions that will strengthen the claims. We address each major comment below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [§3.2] §3.2 (Cross-View Feature Alignment): The method relies on aligning ground and bird's-eye feature representations into a unified 3D frame using GPS tags and orthorectified satellite data. However, no quantitative alignment-error statistics (e.g., mean translation or rotation residuals, or scale-drift measurements) are reported on the training or test pairs. Given that consumer-grade GPS and public orthorectification typically exhibit meter-scale inconsistencies, it is unclear whether the observed improvements in coverage and PSNR/SSIM (Tables 2 and 3) would persist under realistic residual misalignment.

    Authors: We agree that explicit quantitative alignment-error statistics are missing from the current version. In the revision we will add a new table (or subsection in §3.2) reporting mean translation and rotation residuals, as well as scale-drift statistics, computed directly from the GPS tags and orthorectified satellite metadata on both the training and test pairs. These numbers will be derived from the same curated data used for all experiments and will allow readers to judge whether the reported gains remain plausible under the meter-scale errors typical of consumer GPS and public orthorectification. revision: yes

  2. Referee: [§5.1] §5.1 (Benchmark and Baselines): The new georeferenced benchmark is used to claim superiority over ground-only baselines. Without an ablation that perturbs the satellite poses by amounts consistent with reported GPS accuracy (e.g., ±2 m translation, ±1° rotation) and re-measures the synthesis metrics, it remains possible that the gains are specific to the curated, well-aligned pairs rather than a general property of the cross-view fusion.

    Authors: We accept that an explicit robustness ablation is needed to rule out the possibility that gains are an artifact of unusually clean alignments. In the revised manuscript we will add an ablation in §5.1 (or a new supplementary section) that applies controlled perturbations of ±2 m translation and ±1° rotation to the satellite poses, re-runs the cross-view fusion, and reports the resulting changes in PSNR, SSIM, and coverage metrics on the benchmark. This will directly test whether the cross-view advantage persists under realistic residual misalignment. revision: yes

Circularity Check

0 steps flagged

No circularity; derivation relies on external training data and empirical evaluation

full rationale

The paper trains a feed-forward model on curated georeferenced datasets and paired satellite-terrain data mined from open mapping services, then evaluates on a new benchmark for novel-view synthesis. No equations or steps reduce by construction to fitted inputs, self-definitions, or self-citation chains; the alignment of ground and bird's-eye features occurs via learned prediction on independent external data rather than tautological renaming or forced statistical equivalence. The central claim of improved coverage therefore rests on standard supervised training against held-out test views, remaining self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; the central claim rests on the unstated assumption that satellite and ground data can be aligned accurately in a shared frame.

axioms (1)
  • domain assumption Satellite imagery supplies a reliable global geometric prior that can be fused with ground photos without large pose or scale errors.
    Invoked implicitly when the abstract states that satellite views provide a global geometric prior for unified 3D reconstruction.

pith-pipeline@v0.9.0 · 5744 in / 1197 out tokens · 36327 ms · 2026-05-20T06:11:55.738310+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

108 extracted references · 108 canonical work pages · 8 internal anchors

  1. [1]

    Map-free visual relocalization: Metric pose relative to a single image

    Eduardo Arnold, Jamie Wynn, Sara Vicente, Guillermo Garcia-Hernando, ´Aron Monszpart, Victor Adrian Prisacariu, Daniyar Turmukhambetov, and Eric Brach- mann. Map-free visual relocalization: Metric pose relative to a single image. InECCV, 2022. 4, 6

  2. [2]

    Scene coordinate reconstruction: Posing of image collections via incremental learning of a relocalizer

    Eric Brachmann, Jamie Wynn, Shuai Chen, Tommaso Cav- allari, ´Aron Monszpart, Daniyar Turmukhambetov, and Victor Adrian Prisacariu. Scene coordinate reconstruction: Posing of image collections via incremental learning of a relocalizer. InECCV, 2024. 2

  3. [3]

    Virtual KITTI 2

    Yohann Cabon, Naila Murray, and Martin Humenberger. Virtual kitti 2.arXiv preprint arXiv:2001.10773, 2020. 2, 6

  4. [4]

    Efficient geometry-aware 3d generative adversarial networks

    Eric R Chan, Connor Z Lin, Matthew A Chan, Koki Nagano, Boxiao Pan, Shalini De Mello, Orazio Gallo, Leonidas J Guibas, Jonathan Tremblay, Sameh Khamis, et al. Efficient geometry-aware 3d generative adversarial networks. InCVPR, 2022. 3

  5. [5]

    pixelsplat: 3d gaussian splats from image pairs for scalable generalizable 3d reconstruction

    David Charatan, Sizhe Li, Andrea Tagliasacchi, and Vin- cent Sitzmann. pixelsplat: 3d gaussian splats from image pairs for scalable generalizable 3d reconstruction. InCVPR,

  6. [6]

    Mvsnerf: Fast generalizable radiance field reconstruction from multi-view stereo.arXiv preprint arXiv:2103.15595, 2021

    Anpei Chen, Zexiang Xu, Fuqiang Zhao, Xiaoshuai Zhang, Fanbo Xiang, Jingyi Yu, and Hao Su. Mvsnerf: Fast generalizable radiance field reconstruction from multi-view stereo.arXiv preprint arXiv:2103.15595, 2021. 2

  7. [7]

    Mvsplat: Efficient 3d gaussian splatting from sparse multi-view images

    Yuedong Chen, Haofei Xu, Chuanxia Zheng, Bohan Zhuang, Marc Pollefeys, Andreas Geiger, Tat-Jen Cham, and Jianfei Cai. Mvsplat: Efficient 3d gaussian splatting from sparse multi-view images. InECCV, 2024. 1, 2, 7

  8. [8]

    Crandall, Andrew Owens, Noah Snavely, and Daniel P

    David J. Crandall, Andrew Owens, Noah Snavely, and Daniel P. Huttenlocher. Sfm with mrfs: Discrete- continuous optimization for large-scale structure from mo- tion.PAMI, 2013. 2

  9. [9]

    Chang, Manolis Savva, Maciej Hal- ber, Thomas Funkhouser, and Matthias Nießner

    Angela Dai, Angel X. Chang, Manolis Savva, Maciej Hal- ber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In CVPR, 2017. 2

  10. [10]

    FlashAttention-2: Faster attention with better par- allelism and work partitioning

    Tri Dao. FlashAttention-2: Faster attention with better par- allelism and work partitioning. InICLR, 2024. 8

  11. [11]

    Fu, Stefano Ermon, Atri Rudra, and Christopher R´e

    Tri Dao, Daniel Y . Fu, Stefano Ermon, Atri Rudra, and Christopher R´e. FlashAttention: Fast and memory-efficient exact attention with IO-awareness. InNeurIPS, 2022. 8

  12. [12]

    Vision Transformers Need Registers

    Timoth ´ee Darcet, Maxime Oquab, Julien Mairal, and Pi- otr Bojanowski. Vision transformers need registers.arXiv preprint arXiv:2309.16588, 2023. 4

  13. [13]

    Streetscapes: Large-scale consistent street view generation using autoregressive video diffusion

    Boyang Deng, Richard Tucker, Zhengqi Li, Leonidas Guibas, Noah Snavely, and Gordon Wetzstein. Streetscapes: Large-scale consistent street view generation using autoregressive video diffusion. InSIGGRAPH, 2024. 3

  14. [14]

    Ortholoc: UA V 6-dof localization and calibration using orthographic geo- data

    Oussema Dhaouadi, Riccardo Marin, Johannes Michael Meier, Jacques Kaiser, and Daniel Cremers. Ortholoc: UA V 6-dof localization and calibration using orthographic geo- data. InNeurIPS Datasets and Benchmarks Track, 2025. 2

  15. [15]

    An Im- age is Worth 16x16 Words: Transformers for Image Recog- nition at Scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An Im- age is Worth 16x16 Words: Transformers for Image Recog- nition at Scale. InICLR, 2021. 2, 4

  16. [16]

    Learning to render novel views from wide-baseline stereo pairs.CVPR, 2023

    Yilun Du, Cameron Smith, Ayush Tewari, and Vincent Sitz- mann. Learning to render novel views from wide-baseline stereo pairs.CVPR, 2023. 2

  17. [17]

    MASt3r-sfm: a fully-integrated solution for unconstrained structure-from-motion

    Bardienus Pieter Duisterhof, Lojze Zust, Philippe Weinza- epfel, Vincent Leroy, Yohann Cabon, and Jerome Revaud. MASt3r-sfm: a fully-integrated solution for unconstrained structure-from-motion. In3DV, 2025. 1, 2, 13

  18. [18]

    Esri World Imagery.https : / / www

    Esri. Esri World Imagery.https : / / www . arcgis . com / home / item . html ? id = 10df2279f9684e4a9f6a7f08febac2a9. Ac- cessed: 2025-10-05. 2, 4, 6

  19. [19]

    Tiled Web Maps.https://github

    Florian Fervers. Tiled Web Maps.https://github. com/fferflo/tiledwebmaps. Accessed: 2025-10-

  20. [20]

    Uncertainty-aware vision-based metric cross-view geolo- calization

    Florian Fervers, Sebastian Bullinger, Christoph Bo- densteiner, Michael Arens, and Rainer Stiefelhagen. Uncertainty-aware vision-based metric cross-view geolo- calization. InCVPR, 2023

  21. [21]

    Statewide visual geolocalization in the wild

    Florian Fervers, Sebastian Bullinger, Christoph Boden- steiner, Michael Arens, and Rainer Stiefelhagen. Statewide visual geolocalization in the wild. InECCV, 2024. 6

  22. [22]

    Collection of open nation-scale lidar datasets

    Flai. Collection of open nation-scale lidar datasets. https : / / registry . opendata . aws / open - lidar-data. Accessed: 2025-10-19. 2, 6

  23. [23]

    Virtual worlds as proxy for multi-object tracking anal- ysis

    Adrien Gaidon, Qiao Wang, Yohann Cabon, and Eleonora Vig. Virtual worlds as proxy for multi-object tracking anal- ysis. InCVPR, 2016. 6

  24. [24]

    Massively parallel multiview stereopsis by surface normal diffusion

    Silvano Galliani, Katrin Lasinger, and Konrad Schindler. Massively parallel multiview stereopsis by surface normal diffusion. InICCV, 2015. 2

  25. [25]

    Skyeyes: Ground roaming using aerial view images.arXiv preprint arXiv:2409.16685, 2024

    Zhiyuan Gao, Wenbin Teng, Gonglin Chen, Jinsen Wu, Ningli Xu, Rongjun Qin, Andrew Feng, and Yajie Zhao. Skyeyes: Ground roaming using aerial view images.arXiv preprint arXiv:2409.16685, 2024. 2, 3

  26. [26]

    Gdal: Geospatial data abstraction li- brary.https://gdal.org, 2024

    GDAL Developers. Gdal: Geospatial data abstraction li- brary.https://gdal.org, 2024. 16

  27. [27]

    Vision meets robotics: The kitti dataset.IJRR,

    Andreas Geiger, Philip Lenz, Christoph Stiller, and Raquel Urtasun. Vision meets robotics: The kitti dataset.IJRR,

  28. [28]

    Are we ready for Autonomous Driving? The KITTI Vision Bench- mark Suite

    Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are we ready for Autonomous Driving? The KITTI Vision Bench- mark Suite. InCVPR, 2012. 2

  29. [29]

    Google Maps Platform Documentation

    Google. Google Maps Platform Documentation. https : / / developers . google . com / maps / documentation. Accessed: 2025-10-04. 2, 4, 6

  30. [30]

    Cascade cost volume for high- resolution multi-view stereo and stereo matching

    Xiaodong Gu, Zhiwen Fan, Siyu Zhu, Zuozhuo Dai, Feitong Tan, and Ping Tan. Cascade cost volume for high- resolution multi-view stereo and stereo matching. InCVPR,

  31. [31]

    Sugar: Surface- aligned gaussian splatting for efficient 3d mesh reconstruc- tion and high-quality mesh rendering.CVPR, 2024

    Antoine Gu ´edon and Vincent Lepetit. Sugar: Surface- aligned gaussian splatting for efficient 3d mesh reconstruc- tion and high-quality mesh rendering.CVPR, 2024. 3

  32. [32]

    Cambridge University Press,

    Richard Hartley and Andrew Zisserman.Multiple View Ge- ometry in Computer Vision. Cambridge University Press,

  33. [33]

    Pf3plat: Pose-free feed-forward 3d gaussian splatting,

    Sunghwan Hong, Jaewoo Jung, Heeseong Shin, Jisang 9 Han, Jiaolong Yang, Chong Luo, and Seungryong Kim. Pf3plat: Pose-free feed-forward 3d gaussian splatting. arXiv preprint arXiv:2410.22128, 2024. 2

  34. [34]

    En- hancing monocular height estimation from aerial images with street-view images.arXiv preprint arXiv:2311.02121,

    Xiaomou Hou, Wanshui Gan, and Naoto Yokoya. En- hancing monocular height estimation from aerial images with street-view images.arXiv preprint arXiv:2311.02121,

  35. [35]

    SkySplat: Generalizable 3D Gaussian Splatting from Multi-Temporal Sparse Satellite Images

    Xuejun Huang, Xinyi Liu, Yi Wan, Zhi Zheng, Bin Zhang, Mingtao Xiong, Yingying Pei, and Yongjun Zhang. Skysplat: Generalizable 3d gaussian splatting from multi-temporal sparse satellite images.arXiv preprint arXiv:2508.09479, 2025. 3

  36. [36]

    AnySplat: Feed-forward 3D Gaussian Splatting from Unconstrained Views, May 2025

    Lihan Jiang, Yucheng Mao, Linning Xu, Tao Lu, Kerui Ren, Yichen Jin, Xudong Xu, Mulin Yu, Jiangmiao Pang, Feng Zhao, et al. Anysplat: Feed-forward 3d gaus- sian splatting from unconstrained views.arXiv preprint arXiv:2505.23716, 2025. 1, 2, 3, 5, 7, 8, 13, 16

  37. [37]

    Horizon-gs: Unified 3d gaussian splatting for large-scale aerial-to-ground scenes

    Lihan Jiang, Kerui Ren, Mulin Yu, Linning Xu, Junt- ing Dong, Tao Lu, Feng Zhao, Dahua Lin, and Bo Dai. Horizon-gs: Unified 3d gaussian splatting for large-scale aerial-to-ground scenes. InCVPR, 2025. 3

  38. [38]

    Analyzing and improving the image quality of StyleGAN

    Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. Analyzing and improving the image quality of StyleGAN. InCVPR, 2020. 3

  39. [39]

    MapAnything: Universal Feed-Forward Metric 3D Reconstruction

    Nikhil Keetha, Norman M ¨uller, Johannes Sch ¨onberger, Lorenzo Porzi, Yuchen Zhang, Tobias Fischer, Arno Knapitsch, Duncan Zauss, Ethan Weber, Nelson Antunes, Jonathon Luiten, Manuel Lopez-Antequera, Samuel Rota Bul`o, Christian Richardt, Deva Ramanan, Sebastian Scherer, and Peter Kontschieder. MapAnything: Univer- sal feed-forward metric 3D reconstructi...

  40. [40]

    3d gaussian splatting for real-time radiance field rendering.TOG, 2023

    Bernhard Kerbl, Georgios Kopanas, Thomas Leimk ¨uhler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering.TOG, 2023. 2, 4

  41. [41]

    Kingma and Jimmy Ba

    Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization.International Conference on Learning Representations (ICLR), 2015. 13

  42. [42]

    Tanks and temples: Benchmarking large-scale scene reconstruction.TOG, 2017

    Arno Knapitsch, Jaesik Park, Qian-Yi Zhou, and Vladlen Koltun. Tanks and temples: Benchmarking large-scale scene reconstruction.TOG, 2017. 5, 6, 17

  43. [43]

    Skyfall-gs: Synthe- sizing immersive 3d urban scenes from satellite imagery

    Jie-Ying Lee, Yi-Ruei Liu, Shr-Ruei Tsai, Wei-Cheng Chang, Chung-Ho Wu, Jiewen Chan, Zhenjun Zhao, Chieh Hubert Lin, and Yu-Lun Liu. Skyfall-gs: Synthe- sizing immersive 3d urban scenes from satellite imagery. arXiv preprint arXiv:2510.15869, 2025. 2, 3

  44. [44]

    Leotta, Cheng Long, Bastien Jacquet, Michael Zins, Daniel Lipsa, Jizhe Shan, Boyan Xu, Zhaoyu Li, Xun Zhang, Shih-Fu Chang, Misu Purri, Jia Xue, and Kristin Dana

    Matthew J. Leotta, Cheng Long, Bastien Jacquet, Michael Zins, Daniel Lipsa, Jizhe Shan, Boyan Xu, Zhaoyu Li, Xun Zhang, Shih-Fu Chang, Misu Purri, Jia Xue, and Kristin Dana. Urban Semantic 3D Reconstruction from Multiview Satellite Imagery. InCVPRW, 2019. 2, 3

  45. [45]

    Zuoyue Li, Zhenqiang Li, Zhaopeng Cui, Marc Pollefeys, and Martin R. Oswald. Sat2Scene: 3D urban scene gener- ation from satellite images with diffusion. InCVPR, 2024. 3

  46. [46]

    Zuoyue Li, Zhenqiang Li, Zhaopeng Cui, Rongjun Qin, Marc Pollefeys, and Martin R. Oswald. Sat2Vid: Street- view panoramic video synthesis from a single satellite im- age. InICCV, 2021. 3

  47. [47]

    Megadepth: Learning single-view depth prediction from internet photos

    Zhengqi Li and Noah Snavely. Megadepth: Learning single-view depth prediction from internet photos. In CVPR, 2018. 2, 17

  48. [48]

    Sky optimization: Semantically aware image processing of skies in low-light photography

    Orly Liba, Longqi Cai, Yun-Ta Tsai, Elad Eban, Yair Movshovitz-Attias, Yael Pritch, Huizhong Chen, and Jonathan T Barron. Sky optimization: Semantically aware image processing of skies in low-light photography. In CVPRW, 2020. 6

  49. [49]

    Dl3dv-10k: A large-scale scene dataset for deep learning-based 3d vision

    Lu Ling, Yichen Sheng, Zhi Tu, Wentian Zhao, Cheng Xin, Kun Wan, Lantao Yu, Qianyu Guo, Zixun Yu, Yawen Lu, et al. Dl3dv-10k: A large-scale scene dataset for deep learning-based 3d vision. InCVPR, 2024. 2, 6, 7, 16

  50. [50]

    Infinite nature: Perpetual view generation of natural scenes from a single image

    Andrew Liu, Richard Tucker, Varun Jampani, Ameesh Makadia, Noah Snavely, and Angjoo Kanazawa. Infinite nature: Perpetual view generation of natural scenes from a single image. InICCV, 2021. 3

  51. [51]

    SLAM3R: Real-time dense scene reconstruction from monocular RGB videos

    Yuzheng Liu, Siyan Dong, Shuzhe Wang, Yingda Yin, Yan- chao Yang, Qingnan Fan, and Baoquan Chen. SLAM3R: Real-time dense scene reconstruction from monocular RGB videos. InCVPR, 2025. 2

  52. [52]

    Worldmirror: Universal 3d world reconstruction with any-prior prompting,

    Yifan Liu, Zhiyuan Min, Zhenwei Wang, Junta Wu, Tengfei Wang, Yixuan Yuan, Yawei Luo, and Chunchao Guo. Worldmirror: Universal 3d world reconstruction with any- prior prompting.arXiv preprint arXiv:2510.10726, 2025. 2

  53. [53]

    Mapillary Metropolis Dataset.https:// www.mapillary.com/dataset/metropolis

    Mapillary. Mapillary Metropolis Dataset.https:// www.mapillary.com/dataset/metropolis. Ac- cessed: 2025-10-18. 5, 6, 16

  54. [54]

    Azure Maps.https : / / azure

    Microsoft. Azure Maps.https : / / azure . microsoft . com / en - us / products / azure - maps. Accessed: 2025-10-04. 2, 4, 6

  55. [55]

    Srinivasan, Matthew Tancik, Jonathan T

    Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik, Jonathan T. Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view syn- thesis. InECCV, 2020. 2

  56. [56]

    Maxime Oquab, Timoth ´ee Darcet, Theo Moutakanni, Huy V . V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernan- dez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Russell Howes, Po-Yao Huang, Hu Xu, Vasu Sharma, Shang-Wen Li, Wojciech Galuba, Mike Rabbat, Mido Ass- ran, Nicolas Ballas, Gabriel Synnaeve, Ishan Misra, Herve Jegou, Julien Mairal, Patric...

  57. [57]

    Global Structure-from-Motion Revisited

    Linfei Pan, Daniel Barath, Marc Pollefeys, and Jo- hannes Lutz Sch ¨onberger. Global Structure-from-Motion Revisited. InECCV, 2024. 2

  58. [58]

    Py- torch: An imperative style, high-performance deep learning library

    Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Nataly Gimelshein, Luca Antiga, Alban Kopf, Fed- erico Metta, Allan Chiley, Brian Stwalley, Sheng Huang, Jiawan Jiang, Yehezkel Chen, Peng Zeng, Xiaobing Li, James Yu, Teteya Li, Andrey Kuchaiev, Kartik Ren, Houdong Zhang, Yanghan Shi, Jani Sin...

  59. [59]

    UniK3D: Universal camera monocular 3d estimation

    Luigi Piccinelli, Christos Sakaridis, Mattia Segu, Yung- 10 Hsu Yang, Siyuan Li, Wim Abbeloos, and Luc Van Gool. UniK3D: Universal camera monocular 3d estimation. In CVPR, 2025. 6

  60. [60]

    See- ing through satellite images at street views.arXiv preprint arXiv:2505.17001, 2025

    Ming Qian, Bin Tan, Qiuyu Wang, Xianwei Zheng, Han- jiang Xiong, Gui-Song Xia, Yujun Shen, and Nan Xue. See- ing through satellite images at street views.arXiv preprint arXiv:2505.17001, 2025. 3, 13, 17

  61. [61]

    Sat2density: Faithful density learning from satellite-ground image pairs

    Ming Qian, Jincheng Xiong, Gui-Song Xia, and Nan Xue. Sat2density: Faithful density learning from satellite-ground image pairs. InICCV, 2023. 3, 7, 13, 17, 22

  62. [62]

    Rongjun Qin. RPC Stereo Processor (RSP)–A Software Package for Digital Surface Model and Orthophoto Gener- ation from Satellite Stereo Imagery.ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial Information Sciences, 2016. 2, 3

  63. [63]

    Vi- sion transformers for dense prediction

    Ren ´e Ranftl, Alexey Bochkovskiy, and Vladlen Koltun. Vi- sion transformers for dense prediction. InICCV, 2021. 4

  64. [64]

    Com- mon objects in 3d: Large-scale learning and evaluation of real-life 3d category reconstruction

    Jeremy Reizenstein, Roman Shapovalov, Philipp Henzler, Luca Sbordone, Patrick Labatut, and David Novotny. Com- mon objects in 3d: Large-scale learning and evaluation of real-life 3d category reconstruction. InICCV, 2021. 2

  65. [65]

    High-Resolution Image Synthesis with Latent Diffusion Models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bjorn Ommer. High-Resolution Image Synthesis with Latent Diffusion Models. InCVPR, 2022. 3

  66. [66]

    Pixelwise view selection for unstructured multi-view stereo

    Johannes L Sch ¨onberger, Enliang Zheng, Jan-Michael Frahm, and Marc Pollefeys. Pixelwise view selection for unstructured multi-view stereo. InECCV, 2016. 2

  67. [67]

    A multi-view stereo benchmark with high- resolution images and multi-camera videos

    Thomas Sch ¨ops, Torsten Sattler, Christian H¨ane, and Marc Pollefeys. A multi-view stereo benchmark with high- resolution images and multi-camera videos. InCVPR,

  68. [68]

    Sch ¨onberger and Jan-Michael Frahm

    Johannes L. Sch ¨onberger and Jan-Michael Frahm. Structure-from-motion revisited. InCVPR, 2016. 2, 6

  69. [69]

    Geometry-guided street-view panorama synthesis from satellite imagery.PAMI, 2022

    Yujiao Shi, Dylan Campbell, Xin Yu, and Hongdong Li. Geometry-guided street-view panorama synthesis from satellite imagery.PAMI, 2022. 3

  70. [70]

    Splatt3R: Zero-shot Gaussian Splatting from Uncalibrated Image Pairs

    Brandon Smart, Chuanxia Zheng, Iro Laina, and Vic- tor Adrian Prisacariu. Splatt3r: Zero-shot gaussian splatting from uncalibrated image pairs.arXiv preprint arXiv:2408.13912, 2024. 2

  71. [71]

    Srinivasan, Richard Tucker, Jonathan T

    Pratul P. Srinivasan, Richard Tucker, Jonathan T. Barron, Ravi Ramamoorthi, Ren Ng, and Noah Snavely. Pushing the boundaries of view extrapolation with multiplane im- ages.CVPR, 2019. 2

  72. [72]

    Zhang, Pratul Srinivasan, Ruiqi Gao, Arthur Brussee, Aleksander Holynski, Ricardo Martin-Brualla, Jonathan T

    Stanislaw Szymanowicz, Jason Y . Zhang, Pratul Srinivasan, Ruiqi Gao, Arthur Brussee, Aleksander Holynski, Ricardo Martin-Brualla, Jonathan T. Barron, and Philipp Henzler. Bolt3D: Generating 3D Scenes in Seconds. InICCV, 2025. 17

  73. [73]

    Nerfstudio: A modular framework for neural radiance field development

    Matthew Tancik, Ethan Weber, Evonne Ng, Ruilong Li, Brent Yi, Justin Kerr, Terrance Wang, Alexander Kristof- fersen, Jake Austin, Kamyar Salahi, Abhik Ahuja, David McAllister, and Angjoo Kanazawa. Nerfstudio: A modular framework for neural radiance field development. InSIG- GRAPH, 2023. 7

  74. [74]

    Schwing, and Zhicheng Yan

    Zhenggang Tang, Yuchen Fan, Dilin Wang, Hongyu Xu, Rakesh Ranjan, Alexander G. Schwing, and Zhicheng Yan. MV-DUSt3R+: Single-stage scene reconstruction from sparse views in 2 seconds. InCVPR, 2025. 2

  75. [75]

    Coming Down to Earth: Satellite-to-Street View Synthesis for Geo-Localization

    Aysim Toker, Qunjie Zhou, Maxim Maximov, and Laura Leal-Taix´e. Coming Down to Earth: Satellite-to-Street View Synthesis for Geo-Localization. InCVPR, 2021. 3

  76. [76]

    Single-view view syn- thesis with multiplane images

    Richard Tucker and Noah Snavely. Single-view view syn- thesis with multiplane images. InCVPR, 2020. 2

  77. [77]

    Geological Survey

    U.S. Geological Survey. USGS Lidar Explorer Map. https : / / apps . nationalmap . gov / lidar - explorer. Accessed: 2025-10-19. 2, 6

  78. [78]

    Gomez, Łukasz Kaiser, and Illia Polosukhin

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. InNeurIPS,

  79. [79]

    Aerialmegadepth: Learning aerial-ground reconstruction and view synthesis

    Khiem Vuong, Anurag Ghosh, Deva Ramanan, Srinivasa Narasimhan, and Shubham Tulsiani. Aerialmegadepth: Learning aerial-ground reconstruction and view synthesis. InCVPR, 2025. 2

  80. [80]

    Skyseg.https://huggingface

    Jianyuan Wang. Skyseg.https://huggingface. co/JianyuanWang/skyseg. Accessed: 2025-08-10. 6

Showing first 80 references.