pith. machine review for the scientific record. sign in

arxiv: 2512.07527 · v3 · submitted 2025-12-08 · 💻 cs.CV · cs.GR

From Orbit to Ground: Generative City Photogrammetry from Extreme Off-Nadir Satellite Images

Pith reviewed 2026-05-17 00:54 UTC · model grok-4.3

classification 💻 cs.CV cs.GR
keywords 3D city reconstructionsatellite photogrammetrynovel view synthesisgenerative texture restorationsigned distance fieldsoff-nadir imageryurban modelingheight maps
0
0 comments X

The pith

Representing city geometry as a 2.5D height map from satellite views enables synthesis of photorealistic ground-level images over large areas.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper seeks to solve the problem of creating detailed 3D city models from a small number of satellite images captured at extreme angles from orbit. Standard methods fail because these views provide almost no information about building sides and have distorted textures. The authors introduce a specialized geometry representation that treats the city as a height map with monotonic vertical structure, which allows reliable shape recovery despite the limited input. They then use a neural network to fix and enhance the textures on this geometry. If successful, this would let planners and simulators build accurate virtual cities without extensive on-site photography.

Core claim

We show that modeling city geometry as a 2.5D height map via a Z-monotonic signed distance field stabilizes the reconstruction process from sparse extreme off-nadir satellite images. This produces watertight meshes featuring crisp roof lines and clean vertically extruded facades. Appearance is then transferred from the satellite data through differentiable rendering and refined by a generative texture restoration network that recovers plausible high-frequency details from the degraded orbital captures. Experiments confirm the approach reconstructs real-world regions spanning 4 square kilometers and delivers superior photorealistic novel views at ground level compared to prior techniques.

What carries the argument

A Z-monotonic signed distance field representing a 2.5D height map, which enforces the vertical extrusion typical of urban buildings and provides stable optimization targets under sparse satellite viewpoints with minimal parallax.

If this is right

  • Large-scale urban areas up to 4 km² can be reconstructed from only a few satellite images.
  • The resulting meshes and textures support high-fidelity ground view synthesis for visualization.
  • Models serve directly as assets in urban planning and simulation applications.
  • The technique remains robust across extensive experiments on real-world city data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Such reconstructions could accelerate the creation of digital city twins by leveraging freely available satellite archives instead of costly aerial or ground campaigns.
  • Integration with existing simulation tools might improve accuracy in predicting urban phenomena like heat islands or evacuation routes.
  • Future work could test whether the same height-map prior helps with mixed natural and built environments beyond pure cities.

Load-bearing premise

That the geometry of cities is well approximated by vertically extruded structures captured in a 2.5D height map without significant overhangs or intricate roof shapes.

What would settle it

Observing whether the method produces accurate facades and roofs on buildings with known overhangs or sloped complex roofs when compared against high-resolution ground truth imagery or LiDAR scans.

Figures

Figures reproduced from arXiv: 2512.07527 by Baoquan Chen, Fei Yu, Haisen Zhao, He Sun, Luyang Tang, Mingchao Sun, Mu Xu, Rui Bu, Wenzheng Chen, Yangyan Li, Yuchao Jin, Yu Liu, Zengye Ge.

Figure 1
Figure 1. Figure 1: City-Scale 3D Reconstruction from Satellite Imagery. We reconstruct a 4 km2 real-world urban region from 11 sparse￾view satellite images captured from orbit that contain extremely limited parallax. The resulting 3D model, featuring crisp geometry and photorealistic appearance, enables extreme viewpoint extrapolation, supporting high-fidelity, close-range rendering from ground-level viewpoints. Please zoom … view at source ↗
Figure 2
Figure 2. Figure 2: Unlike dense street views, satellite images are sparse and [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The framework of our method. Our pipeline first reconstructs city geometry, then refines its appearance. Stage 1 (Geometry): We optimize a Z-Monotonic SDF against sparse MVS points to extract a high-fidelity, watertight mesh with clean vertical facades. Stage 2 (Appearance): Starting with an initial texture (back-projected from source images), we use a restoration network to enhance close-range novel-view … view at source ↗
Figure 4
Figure 4. Figure 4: Z-Monotonic SDF vs. Naive Conversion. (a, b) A naive 2.5D mesh, generated by directly converting sparse MVS points into a voxel grid, suffers from severe “stair-step” artifacts and topological holes. (c) Our Z-Monotonic SDF representation optimizes a continuous field, resulting in a clean, watertight mesh with precise roofs and sharp vertical facades. starting point. This is achieved by optimizing a textur… view at source ↗
Figure 5
Figure 5. Figure 5: Appearance Refinement. (a) The basic texture Tbasic, created by naively back-projecting the blurry source satellite im￾ages, suffers from low fidelity and “baked-in” artifacts. (b) Our final texture Tfinal, optimized using supervision from the restora￾tion network, recovers sharp, photorealistic, and globally consis￾tent details. 4. Experiments 4.1. Experimental Setup Datasets. To comprehensively validate … view at source ↗
Figure 6
Figure 6. Figure 6: Visualization of reconstruction results. Compared to baselines, our method successfully achieves high-quality city reconstruc￾tion from satellite imagery. “FAIL” denotes the method fails to converge in experiment, manifested as program crashes. achieves superior geometric accuracy and visual quality over existing approaches. For geometric accuracy, it surpasses all baselines in Re￾call, F1-Score and Chamfe… view at source ↗
Figure 7
Figure 7. Figure 7: Qualitative results on ablation study. Please refer to [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
read the original abstract

City-scale 3D reconstruction from satellite imagery presents the challenge of extreme viewpoint extrapolation, where our goal is to synthesize ground-level novel views from sparse orbital images with minimal parallax. This requires inferring nearly $90^\circ$ viewpoint gaps from image sources with severely foreshortened facades and flawed textures, causing state-of-the-art reconstruction engines such as NeRF and 3DGS to fail. To address this problem, we propose two design choices tailored for city structures and satellite inputs. First, we model city geometry as a 2.5D height map, implemented as a Z-monotonic signed distance field (SDF) that matches urban building layouts from top-down viewpoints. This stabilizes geometry optimization under sparse, off-nadir satellite views and yields a watertight mesh with crisp roofs and clean, vertically extruded facades. Second, we paint the mesh appearance from satellite images via differentiable rendering techniques. While the satellite inputs may contain long-range, blurry captures, we further train a generative texture restoration network to enhance the appearance, recovering high-frequency, plausible texture details from degraded inputs. Our method's scalability and robustness are demonstrated through extensive experiments on large-scale urban reconstruction. For example, in our teaser figure, we reconstruct a $4\,\mathrm{km}^2$ real-world region from only a few satellite images, achieving state-of-the-art performance in synthesizing photorealistic ground views. The resulting models are not only visually compelling but also serve as high-fidelity, application-ready assets for downstream tasks like urban planning and simulation. Project page can be found at https://pku-vcl-geometry.github.io/Orbit2Ground/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper claims to solve city-scale 3D reconstruction from sparse extreme off-nadir satellite images by modeling geometry as a 2.5D height map via a Z-monotonic signed distance field (SDF) that produces watertight meshes with crisp roofs and extruded facades, combined with differentiable rendering and a generative texture restoration network to recover plausible high-frequency details from blurry inputs. It demonstrates the approach on a 4 km² real-world region, claiming state-of-the-art photorealistic ground-level novel view synthesis suitable for urban planning and simulation, where NeRF and 3DGS fail due to large viewpoint gaps.

Significance. If the central claims hold with supporting evidence, the work would advance scalable urban photogrammetry by providing an interpretable, application-ready alternative to general-purpose implicit representations for satellite-to-ground extrapolation. The explicit tailoring of the geometry representation and generative restoration to city structures is a positive aspect that could enable downstream uses in simulation.

major comments (2)
  1. [Abstract] Abstract: The Z-monotonic SDF is asserted to 'match urban building layouts from top-down viewpoints' and yield 'crisp roofs and clean, vertically extruded facades,' yet by construction a monotonic height field cannot represent overhangs, balconies, awnings, or non-vertical roof elements. This assumption is load-bearing for the claim of high-fidelity geometry that supports accurate photorealistic ground views and simulation assets; the generative texture network cannot compensate for missing 3D structure.
  2. [Abstract] Abstract: The manuscript states that the method achieves 'state-of-the-art performance in synthesizing photorealistic ground views' on the 4 km² example, but supplies no quantitative metrics, baseline comparisons, ablation results, or error analysis to support this. Without such evidence the superiority claim cannot be evaluated and is central to the paper's contribution.
minor comments (1)
  1. [Abstract] The project page URL is provided but the manuscript should include a brief statement on code or model availability to support reproducibility claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below with clarifications on our design choices and evidence, and indicate where revisions will be made to strengthen the presentation.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The Z-monotonic SDF is asserted to 'match urban building layouts from top-down viewpoints' and yield 'crisp roofs and clean, vertically extruded facades,' yet by construction a monotonic height field cannot represent overhangs, balconies, awnings, or non-vertical roof elements. This assumption is load-bearing for the claim of high-fidelity geometry that supports accurate photorealistic ground views and simulation assets; the generative texture network cannot compensate for missing 3D structure.

    Authors: We agree that a Z-monotonic SDF is a 2.5D height-map representation and therefore cannot model overhangs, balconies, awnings, or non-vertical roof elements. This is a deliberate modeling choice for city-scale reconstruction from sparse extreme off-nadir satellite imagery: it ensures optimization stability, produces watertight meshes, and matches the dominant structure of urban buildings (vertically extruded volumes with simple roofs). The generative texture restoration network is designed to synthesize plausible high-frequency appearance on these facades for ground-level novel views, which is the primary goal for photorealistic synthesis and downstream simulation use. We acknowledge that this approximation limits geometric fidelity for fine architectural details and will add an explicit limitations paragraph discussing the 2.5D assumption, its suitability for most urban planning applications, and potential extensions to full 3D representations. revision: partial

  2. Referee: [Abstract] Abstract: The manuscript states that the method achieves 'state-of-the-art performance in synthesizing photorealistic ground views' on the 4 km² example, but supplies no quantitative metrics, baseline comparisons, ablation results, or error analysis to support this. Without such evidence the superiority claim cannot be evaluated and is central to the paper's contribution.

    Authors: The full manuscript contains quantitative evaluations in the Experiments section, including PSNR/SSIM/LPIPS comparisons against NeRF and 3DGS baselines, ablations on the Z-monotonic SDF and generative texture components, and error analysis on the 4 km² real-world region. These results support the state-of-the-art claim for ground-view synthesis. However, the abstract does not summarize the numerical evidence. We will revise the abstract to include key quantitative metrics and explicit references to the supporting experiments and baselines, making the superiority claim directly verifiable from the abstract. revision: yes

Circularity Check

0 steps flagged

No circularity: explicit modeling choices applied to inputs

full rationale

The paper proposes two explicit design choices—representing city geometry via a 2.5D Z-monotonic SDF height map and applying a generative texture restoration network after differentiable rendering—to address extreme off-nadir satellite inputs. These are presented as tailored modeling decisions that stabilize optimization and enhance appearance, not as quantities derived from or reducing back to the input data by construction. No equations, fitted parameters renamed as predictions, or self-citation chains are shown that would make the central reconstruction claims equivalent to the inputs. The method is therefore self-contained, with results evaluated on real-world 4 km² regions against external visual and application benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on domain assumptions about urban geometry and the effectiveness of the generative restoration step; no free parameters or invented entities are explicitly quantified in the abstract.

axioms (1)
  • domain assumption City geometry from top-down satellite views is well approximated by vertically extruded facades and crisp roofs that can be represented as a Z-monotonic signed distance field.
    Invoked to stabilize geometry optimization under sparse off-nadir inputs and produce watertight meshes.
invented entities (1)
  • Generative texture restoration network no independent evidence
    purpose: To recover high-frequency plausible texture details from degraded, blurry satellite inputs when painting the mesh appearance.
    Introduced as an additional trained component after differentiable rendering from satellite images.

pith-pipeline@v0.9.0 · 5644 in / 1406 out tokens · 86357 ms · 2026-05-17T00:54:04.211882+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

71 extracted references · 71 canonical work pages · 5 internal anchors

  1. [1]

    3dgs-to-pc: 3d gaussian splatting to dense point clouds

    Lewis A G Stuart, Andrew Morton, Ian Stavness, and Michael P Pound. 3dgs-to-pc: 3d gaussian splatting to dense point clouds. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, pages 3730–3739, 2025. 5

  2. [2]

    Gaussian splatting for efficient satellite image photogram- metry

    Luca Savant Aira, Gabriele Facciolo, and Thibaud Ehret. Gaussian splatting for efficient satellite image photogram- metry. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 5959–5969, 2025. 3, 6, 7, 22

  3. [3]

    Satgs: Remote sensing novel view synthesis using multi-temporal satellite images with appearance-adaptive 3dgs.Remote Sensing, 17(9):1609, 2025

    Nan Bai, Anran Yang, Hao Chen, and Chun Du. Satgs: Remote sensing novel view synthesis using multi-temporal satellite images with appearance-adaptive 3dgs.Remote Sensing, 17(9):1609, 2025. 3, 6

  4. [4]

    Patchmatch: A randomized correspon- dence algorithm for structural image editing.ACM Trans

    Connelly Barnes, Eli Shechtman, Adam Finkelstein, and Dan B Goldman. Patchmatch: A randomized correspon- dence algorithm for structural image editing.ACM Trans. Graph., 28(3):24, 2009. 2

  5. [5]

    Mvsformer++: Revealing the devil in transformer’s details for multi-view stereo.arXiv preprint arXiv:2401.11673, 2024

    Chenjie Cao, Xinlin Ren, and Yanwei Fu. Mvsformer++: Revealing the devil in transformer’s details for multi-view stereo.arXiv preprint arXiv:2401.11673, 2024. 1

  6. [6]

    Two deterministic half-quadratic regular- ization algorithms for computed imaging

    Pierre Charbonnier, Laure Blanc-Feraud, Gilles Aubert, and Michel Barlaud. Two deterministic half-quadratic regular- ization algorithms for computed imaging. InProceedings of 1st international conference on image processing, pages 168–172. IEEE, 1994. 5

  7. [7]

    Dogs: Distributed-oriented gaus- sian splatting for large-scale 3d reconstruction via gaussian consensus.Advances in Neural Information Processing Sys- tems, 37:34487–34512, 2024

    Yu Chen and Gim Hee Lee. Dogs: Distributed-oriented gaus- sian splatting for large-scale 3d reconstruction via gaussian consensus.Advances in Neural Information Processing Sys- tems, 37:34487–34512, 2024. 2

  8. [8]

    Ziyang Chen, Wenting Li, Zhongwei Cui, and Yongjun Zhang. Surface depth estimation from multi-view stereo satellite images with distribution contrast network.IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 2024. 3

  9. [9]

    Luciddreamer: Domain-free generation of 3d gaussian splatting scenes

    Jaeyoung Chung, Suyoung Lee, Hyeongjin Nam, Jaerin Lee, and Kyoung Mu Lee. Luciddreamer: Domain-free gen- eration of 3d gaussian splatting scenes.arXiv preprint arXiv:2311.13384, 2023. 5

  10. [10]

    An automatic and modular stereo pipeline for pushbroom images

    Carlo De Franchis, Enric Meinhardt-Llopis, Julien Michel, Jean-Michel Morel, and Gabriele Facciolo. An automatic and modular stereo pipeline for pushbroom images. InISPRS Annals of the Photogrammetry, Remote Sensing and Spatial Information Sciences, 2014. 3

  11. [11]

    Shadow neural radiance fields for multi-view satellite photogrammetry

    Dawa Derksen and Dario Izzo. Shadow neural radiance fields for multi-view satellite photogrammetry. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1152–1161, 2021. 3, 6

  12. [12]

    Scaling recti- fied flow transformers for high-resolution image synthesis

    Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas M ¨uller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling recti- fied flow transformers for high-resolution image synthesis. InForty-first international conference on machine learning,

  13. [13]

    Flashgs: Efficient 3d gaussian splatting for large-scale and high-resolution rendering

    Guofeng Feng, Siyan Chen, Rong Fu, Zimu Liao, Yi Wang, Tao Liu, Boni Hu, Linning Xu, Zhilin Pei, Hengjie Li, et al. Flashgs: Efficient 3d gaussian splatting for large-scale and high-resolution rendering. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 26652– 26662, 2025. 2

  14. [14]

    Flowr: Flowing from sparse to dense 3d reconstructions

    Tobias Fischer, Samuel Rota Bul `o, Yung-Hsu Yang, Nikhil Keetha, Lorenzo Porzi, Norman M ¨uller, Katja Schwarz, Jonathon Luiten, Marc Pollefeys, and Peter Kontschieder. Flowr: Flowing from sparse to dense 3d reconstructions. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 27702–27712, 2025. 3

  15. [15]

    Accurate, dense, and robust multiview stereopsis.IEEE transactions on pattern analysis and machine intelligence, 32(8):1362–1376, 2009

    Yasutaka Furukawa and Jean Ponce. Accurate, dense, and robust multiview stereopsis.IEEE transactions on pattern analysis and machine intelligence, 32(8):1362–1376, 2009. 2, 4

  16. [16]

    Rational polyno- mial camera model warping for deep learning based satel- lite multi-view stereo matching

    Jian Gao, Jin Liu, and Shunping Ji. Rational polyno- mial camera model warping for deep learning based satel- lite multi-view stereo matching. InProceedings of the IEEE/CVF international conference on computer vision, pages 6148–6157, 2021. 3

  17. [17]

    A general deep learn- ing based framework for 3d reconstruction from multi-view stereo satellite images.ISPRS Journal of Photogrammetry and Remote Sensing, 195:446–461, 2023

    Jian Gao, Jin Liu, and Shunping Ji. A general deep learn- ing based framework for 3d reconstruction from multi-view stereo satellite images.ISPRS Journal of Photogrammetry and Remote Sensing, 195:446–461, 2023. 3

  18. [18]

    Citygs- x: A scalable architecture for efficient and geometrically accurate large-scale scene reconstruction.arXiv preprint arXiv:2503.23044, 2025

    Yuanyuan Gao, Hao Li, Jiaqi Chen, Zhengyu Zou, Zhihang Zhong, Dingwen Zhang, Xiao Sun, and Junwei Han. Citygs- x: A scalable architecture for efficient and geometrically accurate large-scale scene reconstruction.arXiv preprint arXiv:2503.23044, 2025. 2, 6, 5

  19. [19]

    Sat2city: 3d city generation from a single satellite image with cascaded latent diffusion

    Tongyan Hua, Lutao Jiang, Ying-Cong Chen, and Wufan Zhao. Sat2city: 3d city generation from a single satellite image with cascaded latent diffusion. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 27978–27988, 2025. 3

  20. [20]

    2d gaussian splatting for geometrically ac- curate radiance fields

    Binbin Huang, Zehao Yu, Anpei Chen, Andreas Geiger, and Shenghua Gao. 2d gaussian splatting for geometrically ac- curate radiance fields. InACM SIGGRAPH 2024 conference papers, pages 1–11, 2024. 6, 7, 5

  21. [21]

    SkySplat: Generalizable 3D Gaussian Splatting from Multi-Temporal Sparse Satellite Images

    Xuejun Huang, Xinyi Liu, Yi Wan, Zhi Zheng, Bin Zhang, Mingtao Xiong, Yingying Pei, and Yongjun Zhang. Skysplat: Generalizable 3d gaussian splatting from multi-temporal sparse satellite images.arXiv preprint arXiv:2508.09479, 2025. 3

  22. [22]

    3d gaussian splatting for real-time radiance field rendering.ACM Trans

    Bernhard Kerbl, Georgios Kopanas, Thomas Leimk ¨uhler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering.ACM Trans. Graph., 42(4):139–1,

  23. [23]

    A hierarchical 3d gaussian representation for real-time ren- dering of very large datasets.ACM Transactions on Graphics (TOG), 43(4):1–15, 2024

    Bernhard Kerbl, Andreas Meuleman, Georgios Kopanas, Michael Wimmer, Alexandre Lanvin, and George Drettakis. A hierarchical 3d gaussian representation for real-time ren- dering of very large datasets.ACM Transactions on Graphics (TOG), 43(4):1–15, 2024. 2

  24. [24]

    Adam: A Method for Stochastic Optimization

    Diederik P Kingma. Adam: A method for stochastic opti- mization.arXiv preprint arXiv:1412.6980, 2014. 5

  25. [25]

    Flux.https://github.com/ black-forest-labs/flux, 2024

    Black Forest Labs. Flux.https://github.com/ black-forest-labs/flux, 2024. 2, 5, 3, 20

  26. [26]

    Skyfall-gs: Synthesiz- 9 ing immersive 3d urban scenes from satellite imagery.arXiv preprint arXiv:2510.15869, 2025

    Jie-Ying Lee, Yi-Ruei Liu, Shr-Ruei Tsai, Wei-Cheng Chang, Chung-Ho Wu, Jiewen Chan, Zhenjun Zhao, Chieh Hubert Lin, and Yu-Lun Liu. Skyfall-gs: Synthesiz- 9 ing immersive 3d urban scenes from satellite imagery.arXiv preprint arXiv:2510.15869, 2025. 3, 6, 4, 5, 7, 14, 15, 16

  27. [27]

    Matrixcity: A large-scale city dataset for city-scale neural rendering and beyond

    Yixuan Li, Lihan Jiang, Linning Xu, Yuanbo Xiangli, Zhen- zhi Wang, Dahua Lin, and Bo Dai. Matrixcity: A large-scale city dataset for city-scale neural rendering and beyond. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3205–3215, 2023. 6, 3

  28. [28]

    Magic3d: High-resolution text-to-3d content creation

    Chen-Hsuan Lin, Jun Gao, Luming Tang, Towaki Takikawa, Xiaohui Zeng, Xun Huang, Karsten Kreis, Sanja Fidler, Ming-Yu Liu, and Tsung-Yi Lin. Magic3d: High-resolution text-to-3d content creation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 300–309, 2023. 3

  29. [29]

    Vastgaussian: Vast 3d gaussians for large scene reconstruction

    Jiaqi Lin, Zhihao Li, Xiao Tang, Jianzhuang Liu, Shiyong Liu, Jiayue Liu, Yangdi Lu, Xiaofei Wu, Songcen Xu, You- liang Yan, et al. Vastgaussian: Vast 3d gaussians for large scene reconstruction. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 5166–5175, 2024. 2

  30. [30]

    Zero-1-to- 3: Zero-shot one image to 3d object

    Ruoshi Liu, Rundi Wu, Basile Van Hoorick, Pavel Tok- makov, Sergey Zakharov, and Carl V ondrick. Zero-1-to- 3: Zero-shot one image to 3d object. InProceedings of the IEEE/CVF international conference on computer vision, pages 9298–9309, 2023. 3

  31. [31]

    3dgs-enhancer: Enhancing unbounded 3d gaussian splatting with view- consistent 2d diffusion priors.Advances in Neural Informa- tion Processing Systems, 37:133305–133327, 2024

    Xi Liu, Chaoyi Zhou, and Siyu Huang. 3dgs-enhancer: Enhancing unbounded 3d gaussian splatting with view- consistent 2d diffusion priors.Advances in Neural Informa- tion Processing Systems, 37:133305–133327, 2024. 3

  32. [32]

    Citygaussian: Real-time high-quality large-scale scene rendering with gaussians

    Yang Liu, Chuanchen Luo, Lue Fan, Naiyan Wang, Jun- ran Peng, and Zhaoxiang Zhang. Citygaussian: Real-time high-quality large-scale scene rendering with gaussians. In European Conference on Computer Vision, pages 265–282. Springer, 2024. 2

  33. [33]

    Citygaussianv2: Efficient and geometri- cally accurate reconstruction for large-scale scenes.arXiv preprint arXiv:2411.00771, 2024

    Yang Liu, Chuanchen Luo, Zhongkai Mao, Junran Peng, and Zhaoxiang Zhang. Citygaussianv2: Efficient and geometri- cally accurate reconstruction for large-scale scenes.arXiv preprint arXiv:2411.00771, 2024. 2, 6, 5

  34. [34]

    Wonder3d: Sin- gle image to 3d using cross-domain diffusion

    Xiaoxiao Long, Yuan-Chen Guo, Cheng Lin, Yuan Liu, Zhiyang Dou, Lingjie Liu, Yuexin Ma, Song-Hai Zhang, Marc Habermann, Christian Theobalt, et al. Wonder3d: Sin- gle image to 3d using cross-domain diffusion. InProceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9970–9980, 2024. 3

  35. [35]

    Marching cubes: A high resolution 3d surface construction algorithm

    William E Lorensen and Harvey E Cline. Marching cubes: A high resolution 3d surface construction algorithm. InSem- inal graphics: pioneering efforts that shaped the field, pages 347–353. 1998. 4, 8

  36. [36]

    Sat-nerf: Learning multi-view satellite photogrammetry with transient objects and shadow modeling using rpc cameras

    Roger Mar ´ı, Gabriele Facciolo, and Thibaud Ehret. Sat-nerf: Learning multi-view satellite photogrammetry with transient objects and shadow modeling using rpc cameras. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1311–1321, 2022. 3, 6

  37. [37]

    Multi- date earth observation nerf: The detail is in the shadows

    Roger Mar ´ı, Gabriele Facciolo, and Thibaud Ehret. Multi- date earth observation nerf: The detail is in the shadows. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 2035–2045, 2023. 3, 6

  38. [38]

    Realfusion: 360deg reconstruction of any object from a single image

    Luke Melas-Kyriazi, Iro Laina, Christian Rupprecht, and Andrea Vedaldi. Realfusion: 360deg reconstruction of any object from a single image. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8446–8455, 2023. 3

  39. [39]

    Nerf: Representing scenes as neural radiance fields for view syn- thesis.Communications of the ACM, 65(1):99–106, 2021

    Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view syn- thesis.Communications of the ACM, 65(1):99–106, 2021. 2

  40. [40]

    DINOv2: Learning Robust Visual Features without Supervision

    Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023. 1

  41. [41]

    Scalable diffusion models with transformers

    William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF inter- national conference on computer vision, pages 4195–4205,

  42. [42]

    Rasterized edge gra- dients: Handling discontinuities differentiably

    Stanislav Pidhorskyi, Tomas Simon, Gabriel Schwartz, He Wen, Yaser Sheikh, and Jason Saragih. Rasterized edge gra- dients: Handling discontinuities differentiably. InEuropean Conference on Computer Vision, pages 335–352. Springer,

  43. [43]

    Example datasets - pix4dmatic.https : //support.pix4d.com/hc/en- us/articles/ 360048957691, 2025

    Pix4D. Example datasets - pix4dmatic.https : //support.pix4d.com/hc/en- us/articles/ 360048957691, 2025. Accessed: 2025-11. 6

  44. [44]

    DreamFusion: Text-to-3D using 2D Diffusion

    Ben Poole, Ajay Jain, Jonathan T Barron, and Ben Milden- hall. Dreamfusion: Text-to-3d using 2d diffusion.arXiv preprint arXiv:2209.14988, 2022. 3

  45. [45]

    Rongjun Qin. Rpc stereo processor (rsp)–a software pack- age for digital surface model and orthophoto generation from satellite stereo imagery.ISPRS Annals of the Photogram- metry, Remote Sensing and Spatial Information Sciences, 3: 77–82, 2016. 3

  46. [46]

    Urban radiance fields

    Konstantinos Rematas, Andrew Liu, Pratul P Srini- vasan, Jonathan T Barron, Andrea Tagliasacchi, Thomas Funkhouser, and Vittorio Ferrari. Urban radiance fields. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 12932–12942, 2022. 2

  47. [47]

    Octree-gs: Towards consistent real-time rendering with lod-structured 3d gaussians.arXiv preprint arXiv:2403.17898, 2024

    Kerui Ren, Lihan Jiang, Tao Lu, Mulin Yu, Linning Xu, Zhangkai Ni, and Bo Dai. Octree-gs: Towards consistent real-time rendering with lod-structured 3d gaussians.arXiv preprint arXiv:2403.17898, 2024. 2

  48. [48]

    High-resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022. 3

  49. [49]

    Photorealistic text-to-image diffusion models with deep language understanding.Advances in neural information processing systems, 35:36479–36494, 2022

    Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding.Advances in neural information processing systems, 35:36479–36494, 2022. 3

  50. [50]

    Structure- from-motion revisited

    Johannes L Schonberger and Jan-Michael Frahm. Structure- from-motion revisited. InProceedings of the IEEE con- 10 ference on computer vision and pattern recognition, pages 4104–4113, 2016. 2

  51. [51]

    Pixelwise view selection for unstructured multi-view stereo

    Johannes L Sch ¨onberger, Enliang Zheng, Jan-Michael Frahm, and Marc Pollefeys. Pixelwise view selection for unstructured multi-view stereo. InEuropean conference on computer vision, pages 501–518. Springer, 2016. 2

  52. [52]

    Flexible isosurface extraction for gradient-based mesh optimization.ACM Transactions on Graphics (TOG), 42(4):1–16, 2023

    Tianchang Shen, Jacob Munkberg, Jon Hasselgren, Kangxue Yin, Zian Wang, Wenzheng Chen, Zan Gojcic, Sanja Fidler, Nicholas Sharp, and Jun Gao. Flexible isosurface extraction for gradient-based mesh optimization.ACM Transactions on Graphics (TOG), 42(4):1–16, 2023. 2, 4, 8, 1, 3

  53. [53]

    DINOv3

    Oriane Sim ´eoni, Huy V V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Micha ¨el Ramamonjisoa, et al. Dinov3.arXiv preprint arXiv:2508.10104, 2025. 1

  54. [54]

    City-on-web: Real-time neural rendering of large- scale scenes on the web

    Kaiwen Song, Xiaoyi Zeng, Chenqu Ren, and Juyong Zhang. City-on-web: Real-time neural rendering of large- scale scenes on the web. InEuropean Conference on Com- puter Vision, pages 385–402. Springer, 2024. 2

  55. [55]

    Block-nerf: Scalable large scene neural view synthesis

    Matthew Tancik, Vincent Casser, Xinchen Yan, Sabeek Prad- han, Ben Mildenhall, Pratul P Srinivasan, Jonathan T Barron, and Henrik Kretzschmar. Block-nerf: Scalable large scene neural view synthesis. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 8248–8258, 2022. 2

  56. [56]

    Hunyuanworld 1.0: Generating immersive, explorable, and interactive 3d worlds from words or pixels.arXiv preprint arXiv:2507.21809, 2025

    HunyuanWorld Team, Zhenwei Wang, Yuhao Liu, Junta Wu, Zixiao Gu, Haoyuan Wang, Xuhui Zuo, Tianyu Huang, Wen- huan Li, Sheng Zhang, et al. Hunyuanworld 1.0: Generating immersive, explorable, and interactive 3d worlds from words or pixels.arXiv preprint arXiv:2507.21809, 2025. 3

  57. [57]

    Mega-nerf: Scalable construction of large- scale nerfs for virtual fly-throughs

    Haithem Turki, Deva Ramanan, and Mahadev Satya- narayanan. Mega-nerf: Scalable construction of large- scale nerfs for virtual fly-throughs. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12922–12931, 2022. 2

  58. [58]

    Suds: Scalable urban dynamic scenes

    Haithem Turki, Jason Y Zhang, Francesco Ferroni, and Deva Ramanan. Suds: Scalable urban dynamic scenes. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12375–12385, 2023. 2

  59. [59]

    Difix3d+: Improving 3d reconstruc- tions with single-step diffusion models

    Jay Zhangjie Wu, Yuxuan Zhang, Haithem Turki, Xuanchi Ren, Jun Gao, Mike Zheng Shou, Sanja Fidler, Zan Goj- cic, and Huan Ling. Difix3d+: Improving 3d reconstruc- tions with single-step diffusion models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 26024–26035, 2025. 3

  60. [60]

    Grid-guided neural radiance fields for large urban scenes

    Linning Xu, Yuanbo Xiangli, Sida Peng, Xingang Pan, Nanxuan Zhao, Christian Theobalt, Bo Dai, and Dahua Lin. Grid-guided neural radiance fields for large urban scenes. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 8296–8306, 2023. 2

  61. [61]

    Shuting Yang, Hao Chen, Fachuan He, Wen Chen, Ting Chen, and Jianjun He. A learning-based dual-scale enhanced confidence for dsm fusion in 3d reconstruction of multi-view satellite images.IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 2025. 3

  62. [62]

    Gsfixer: Improving 3d gaussian splatting with reference-guided video diffusion priors.arXiv preprint arXiv:2508.09667, 2025

    Xingyilang Yin, Qi Zhang, Jiahao Chang, Ying Feng, Qing- nan Fan, Xi Yang, Chi-Man Pun, Huaqi Zhang, and Xi- aodong Cun. Gsfixer: Improving 3d gaussian splatting with reference-guided video diffusion priors.arXiv preprint arXiv:2508.09667, 2025. 3

  63. [63]

    Wonderworld: Interactive 3d scene generation from a single image

    Hong-Xing Yu, Haoyi Duan, Charles Herrmann, William T Freeman, and Jiajun Wu. Wonderworld: Interactive 3d scene generation from a single image. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 5916–5926, 2025. 3, 5

  64. [64]

    Mip-splatting: Alias-free 3d gaussian splat- ting

    Zehao Yu, Anpei Chen, Binbin Huang, Torsten Sattler, and Andreas Geiger. Mip-splatting: Alias-free 3d gaussian splat- ting. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), pages 19447– 19456, 2024. 6, 5

  65. [65]

    Leveraging vision re- construction pipelines for satellite imagery

    Kai Zhang, Noah Snavely, and Jin Sun. Leveraging vision re- construction pipelines for satellite imagery. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion Workshops, pages 0–0, 2019. 3, 1, 4

  66. [66]

    Sparsesat-nerf: Dense depth supervised neural radiance fields for sparse satellite images.arXiv preprint arXiv:2309.00277, 2023

    Lulin Zhang and Ewelina Rupnik. Sparsesat-nerf: Dense depth supervised neural radiance fields for sparse satellite images.arXiv preprint arXiv:2309.00277, 2023. 3, 6

  67. [67]

    Satensorf: Fast satellite tensorial radiance field for multidate satellite imagery of large size.IEEE Transactions on Geo- science and Remote Sensing, 62:1–15, 2024

    Tongtong Zhang, Yu Zhou, Yuanxiang Li, and Xian Wei. Satensorf: Fast satellite tensorial radiance field for multidate satellite imagery of large size.IEEE Transactions on Geo- science and Remote Sensing, 62:1–15, 2024. 3, 6, 7, 22

  68. [68]

    Effi- cient large-scale scene representation with a hybrid of high- resolution grid and plane features.Pattern Recognition, 158: 111001, 2025

    Yuqi Zhang, Guanying Chen, and Shuguang Cui. Effi- cient large-scale scene representation with a hybrid of high- resolution grid and plane features.Pattern Recognition, 158: 111001, 2025. 2

  69. [69]

    On scaling up 3d gaussian splatting training

    Hexu Zhao, Haoyang Weng, Daohan Lu, Ang Li, Jinyang Li, Aurojit Panda, and Saining Xie. On scaling up 3d gaussian splatting training. InEuropean Conference on Computer Vi- sion, pages 14–36. Springer, 2024. 2

  70. [70]

    A review of 3d reconstruction from high-resolution urban satellite im- ages.International Journal of Remote Sensing, 44(2):713– 748, 2023

    Li Zhao, Haiyan Wang, Yi Zhu, and Mei Song. A review of 3d reconstruction from high-resolution urban satellite im- ages.International Journal of Remote Sensing, 44(2):713– 748, 2023. 3

  71. [71]

    Ours w/o Image Restoration Network

    MI Zhenxing and Dan Xu. Switch-nerf: Learning scene de- composition with mixture of experts for large-scale neural radiance fields. InThe Eleventh International Conference on Learning Representations, 2022. 2 11 From Orbit to Ground: Generative City Photogrammetry from Extreme Off-Nadir Satellite Images Supplementary Material A. Appendix Overview In this ...