arxiv: 2512.07527 · v3 · submitted 2025-12-08 · 💻 cs.CV · cs.GR

From Orbit to Ground: Generative City Photogrammetry from Extreme Off-Nadir Satellite Images

Fei Yu , Yu Liu , Luyang Tang , Mingchao Sun , Zengye Ge , Rui Bu , Yuchao Jin , Haisen Zhao

show 5 more authors

He Sun Yangyan Li Mu Xu Wenzheng Chen Baoquan Chen

This is my paper

Pith reviewed 2026-05-17 00:54 UTC · model grok-4.3

classification 💻 cs.CV cs.GR

keywords 3D city reconstructionsatellite photogrammetrynovel view synthesisgenerative texture restorationsigned distance fieldsoff-nadir imageryurban modelingheight maps

0 comments

The pith

Representing city geometry as a 2.5D height map from satellite views enables synthesis of photorealistic ground-level images over large areas.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper seeks to solve the problem of creating detailed 3D city models from a small number of satellite images captured at extreme angles from orbit. Standard methods fail because these views provide almost no information about building sides and have distorted textures. The authors introduce a specialized geometry representation that treats the city as a height map with monotonic vertical structure, which allows reliable shape recovery despite the limited input. They then use a neural network to fix and enhance the textures on this geometry. If successful, this would let planners and simulators build accurate virtual cities without extensive on-site photography.

Core claim

We show that modeling city geometry as a 2.5D height map via a Z-monotonic signed distance field stabilizes the reconstruction process from sparse extreme off-nadir satellite images. This produces watertight meshes featuring crisp roof lines and clean vertically extruded facades. Appearance is then transferred from the satellite data through differentiable rendering and refined by a generative texture restoration network that recovers plausible high-frequency details from the degraded orbital captures. Experiments confirm the approach reconstructs real-world regions spanning 4 square kilometers and delivers superior photorealistic novel views at ground level compared to prior techniques.

What carries the argument

A Z-monotonic signed distance field representing a 2.5D height map, which enforces the vertical extrusion typical of urban buildings and provides stable optimization targets under sparse satellite viewpoints with minimal parallax.

If this is right

Large-scale urban areas up to 4 km² can be reconstructed from only a few satellite images.
The resulting meshes and textures support high-fidelity ground view synthesis for visualization.
Models serve directly as assets in urban planning and simulation applications.
The technique remains robust across extensive experiments on real-world city data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Such reconstructions could accelerate the creation of digital city twins by leveraging freely available satellite archives instead of costly aerial or ground campaigns.
Integration with existing simulation tools might improve accuracy in predicting urban phenomena like heat islands or evacuation routes.
Future work could test whether the same height-map prior helps with mixed natural and built environments beyond pure cities.

Load-bearing premise

That the geometry of cities is well approximated by vertically extruded structures captured in a 2.5D height map without significant overhangs or intricate roof shapes.

What would settle it

Observing whether the method produces accurate facades and roofs on buildings with known overhangs or sloped complex roofs when compared against high-resolution ground truth imagery or LiDAR scans.

Figures

Figures reproduced from arXiv: 2512.07527 by Baoquan Chen, Fei Yu, Haisen Zhao, He Sun, Luyang Tang, Mingchao Sun, Mu Xu, Rui Bu, Wenzheng Chen, Yangyan Li, Yuchao Jin, Yu Liu, Zengye Ge.

**Figure 1.** Figure 1: City-Scale 3D Reconstruction from Satellite Imagery. We reconstruct a 4 km2 real-world urban region from 11 sparseview satellite images captured from orbit that contain extremely limited parallax. The resulting 3D model, featuring crisp geometry and photorealistic appearance, enables extreme viewpoint extrapolation, supporting high-fidelity, close-range rendering from ground-level viewpoints. Please zoom … view at source ↗

**Figure 2.** Figure 2: Unlike dense street views, satellite images are sparse and [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: The framework of our method. Our pipeline first reconstructs city geometry, then refines its appearance. Stage 1 (Geometry): We optimize a Z-Monotonic SDF against sparse MVS points to extract a high-fidelity, watertight mesh with clean vertical facades. Stage 2 (Appearance): Starting with an initial texture (back-projected from source images), we use a restoration network to enhance close-range novel-view … view at source ↗

**Figure 4.** Figure 4: Z-Monotonic SDF vs. Naive Conversion. (a, b) A naive 2.5D mesh, generated by directly converting sparse MVS points into a voxel grid, suffers from severe “stair-step” artifacts and topological holes. (c) Our Z-Monotonic SDF representation optimizes a continuous field, resulting in a clean, watertight mesh with precise roofs and sharp vertical facades. starting point. This is achieved by optimizing a textur… view at source ↗

**Figure 5.** Figure 5: Appearance Refinement. (a) The basic texture Tbasic, created by naively back-projecting the blurry source satellite images, suffers from low fidelity and “baked-in” artifacts. (b) Our final texture Tfinal, optimized using supervision from the restoration network, recovers sharp, photorealistic, and globally consistent details. 4. Experiments 4.1. Experimental Setup Datasets. To comprehensively validate … view at source ↗

**Figure 6.** Figure 6: Visualization of reconstruction results. Compared to baselines, our method successfully achieves high-quality city reconstruction from satellite imagery. “FAIL” denotes the method fails to converge in experiment, manifested as program crashes. achieves superior geometric accuracy and visual quality over existing approaches. For geometric accuracy, it surpasses all baselines in Recall, F1-Score and Chamfe… view at source ↗

**Figure 7.** Figure 7: Qualitative results on ablation study. Please refer to [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

read the original abstract

City-scale 3D reconstruction from satellite imagery presents the challenge of extreme viewpoint extrapolation, where our goal is to synthesize ground-level novel views from sparse orbital images with minimal parallax. This requires inferring nearly $90^\circ$ viewpoint gaps from image sources with severely foreshortened facades and flawed textures, causing state-of-the-art reconstruction engines such as NeRF and 3DGS to fail. To address this problem, we propose two design choices tailored for city structures and satellite inputs. First, we model city geometry as a 2.5D height map, implemented as a Z-monotonic signed distance field (SDF) that matches urban building layouts from top-down viewpoints. This stabilizes geometry optimization under sparse, off-nadir satellite views and yields a watertight mesh with crisp roofs and clean, vertically extruded facades. Second, we paint the mesh appearance from satellite images via differentiable rendering techniques. While the satellite inputs may contain long-range, blurry captures, we further train a generative texture restoration network to enhance the appearance, recovering high-frequency, plausible texture details from degraded inputs. Our method's scalability and robustness are demonstrated through extensive experiments on large-scale urban reconstruction. For example, in our teaser figure, we reconstruct a $4\,\mathrm{km}^2$ real-world region from only a few satellite images, achieving state-of-the-art performance in synthesizing photorealistic ground views. The resulting models are not only visually compelling but also serve as high-fidelity, application-ready assets for downstream tasks like urban planning and simulation. Project page can be found at https://pku-vcl-geometry.github.io/Orbit2Ground/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper stabilizes satellite-to-ground city reconstruction with a Z-monotonic SDF height map plus a separate generative texture network, but the 2.5D assumption risks missing real urban details like overhangs.

read the letter

The core contribution is a practical way to build 3D city models from just a few extreme off-nadir satellite images. Standard NeRF and 3DGS struggle with the near-90-degree viewpoint shift and foreshortened facades, so the authors model geometry as a Z-monotonic SDF that acts like a height field. This keeps optimization stable and produces a watertight mesh with extruded facades. They then use differentiable rendering to transfer appearance and train a generative network to sharpen textures from the blurry orbital inputs. The 4 km² example shows they can scale this to real city blocks without dense ground imagery, which is a clear step beyond prior satellite NeRF work for this specific extrapolation task. The design choices are explicit and tied to urban structure, which helps explain why it might outperform generic methods here. That said, the Z-monotonic constraint means the geometry cannot represent balconies, awnings, bridges, or non-vertical roof elements without smoothing or distortion. If those features appear in the test region, the base mesh will be inaccurate and the texture network can only mask the problem visually, not fix the underlying 3D error. This directly affects claims about high-fidelity assets for simulation. The abstract asserts SOTA results but supplies no numbers, ablations, or error breakdowns in the provided text, so the strength of the evidence is still unclear. This work is aimed at vision researchers focused on remote-sensing 3D and at teams needing quick large-area urban models for planning. It deserves peer review so the experiments can be checked against the geometric limitations and quantitative claims.

Referee Report

2 major / 1 minor

Summary. The paper claims to solve city-scale 3D reconstruction from sparse extreme off-nadir satellite images by modeling geometry as a 2.5D height map via a Z-monotonic signed distance field (SDF) that produces watertight meshes with crisp roofs and extruded facades, combined with differentiable rendering and a generative texture restoration network to recover plausible high-frequency details from blurry inputs. It demonstrates the approach on a 4 km² real-world region, claiming state-of-the-art photorealistic ground-level novel view synthesis suitable for urban planning and simulation, where NeRF and 3DGS fail due to large viewpoint gaps.

Significance. If the central claims hold with supporting evidence, the work would advance scalable urban photogrammetry by providing an interpretable, application-ready alternative to general-purpose implicit representations for satellite-to-ground extrapolation. The explicit tailoring of the geometry representation and generative restoration to city structures is a positive aspect that could enable downstream uses in simulation.

major comments (2)

[Abstract] Abstract: The Z-monotonic SDF is asserted to 'match urban building layouts from top-down viewpoints' and yield 'crisp roofs and clean, vertically extruded facades,' yet by construction a monotonic height field cannot represent overhangs, balconies, awnings, or non-vertical roof elements. This assumption is load-bearing for the claim of high-fidelity geometry that supports accurate photorealistic ground views and simulation assets; the generative texture network cannot compensate for missing 3D structure.
[Abstract] Abstract: The manuscript states that the method achieves 'state-of-the-art performance in synthesizing photorealistic ground views' on the 4 km² example, but supplies no quantitative metrics, baseline comparisons, ablation results, or error analysis to support this. Without such evidence the superiority claim cannot be evaluated and is central to the paper's contribution.

minor comments (1)

[Abstract] The project page URL is provided but the manuscript should include a brief statement on code or model availability to support reproducibility claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below with clarifications on our design choices and evidence, and indicate where revisions will be made to strengthen the presentation.

read point-by-point responses

Referee: [Abstract] Abstract: The Z-monotonic SDF is asserted to 'match urban building layouts from top-down viewpoints' and yield 'crisp roofs and clean, vertically extruded facades,' yet by construction a monotonic height field cannot represent overhangs, balconies, awnings, or non-vertical roof elements. This assumption is load-bearing for the claim of high-fidelity geometry that supports accurate photorealistic ground views and simulation assets; the generative texture network cannot compensate for missing 3D structure.

Authors: We agree that a Z-monotonic SDF is a 2.5D height-map representation and therefore cannot model overhangs, balconies, awnings, or non-vertical roof elements. This is a deliberate modeling choice for city-scale reconstruction from sparse extreme off-nadir satellite imagery: it ensures optimization stability, produces watertight meshes, and matches the dominant structure of urban buildings (vertically extruded volumes with simple roofs). The generative texture restoration network is designed to synthesize plausible high-frequency appearance on these facades for ground-level novel views, which is the primary goal for photorealistic synthesis and downstream simulation use. We acknowledge that this approximation limits geometric fidelity for fine architectural details and will add an explicit limitations paragraph discussing the 2.5D assumption, its suitability for most urban planning applications, and potential extensions to full 3D representations. revision: partial
Referee: [Abstract] Abstract: The manuscript states that the method achieves 'state-of-the-art performance in synthesizing photorealistic ground views' on the 4 km² example, but supplies no quantitative metrics, baseline comparisons, ablation results, or error analysis to support this. Without such evidence the superiority claim cannot be evaluated and is central to the paper's contribution.

Authors: The full manuscript contains quantitative evaluations in the Experiments section, including PSNR/SSIM/LPIPS comparisons against NeRF and 3DGS baselines, ablations on the Z-monotonic SDF and generative texture components, and error analysis on the 4 km² real-world region. These results support the state-of-the-art claim for ground-view synthesis. However, the abstract does not summarize the numerical evidence. We will revise the abstract to include key quantitative metrics and explicit references to the supporting experiments and baselines, making the superiority claim directly verifiable from the abstract. revision: yes

Circularity Check

0 steps flagged

No circularity: explicit modeling choices applied to inputs

full rationale

The paper proposes two explicit design choices—representing city geometry via a 2.5D Z-monotonic SDF height map and applying a generative texture restoration network after differentiable rendering—to address extreme off-nadir satellite inputs. These are presented as tailored modeling decisions that stabilize optimization and enhance appearance, not as quantities derived from or reducing back to the input data by construction. No equations, fitted parameters renamed as predictions, or self-citation chains are shown that would make the central reconstruction claims equivalent to the inputs. The method is therefore self-contained, with results evaluated on real-world 4 km² regions against external visual and application benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on domain assumptions about urban geometry and the effectiveness of the generative restoration step; no free parameters or invented entities are explicitly quantified in the abstract.

axioms (1)

domain assumption City geometry from top-down satellite views is well approximated by vertically extruded facades and crisp roofs that can be represented as a Z-monotonic signed distance field.
Invoked to stabilize geometry optimization under sparse off-nadir inputs and produce watertight meshes.

invented entities (1)

Generative texture restoration network no independent evidence
purpose: To recover high-frequency plausible texture details from degraded, blurry satellite inputs when painting the mesh appearance.
Introduced as an additional trained component after differentiable rendering from satellite images.

pith-pipeline@v0.9.0 · 5644 in / 1406 out tokens · 86357 ms · 2026-05-17T00:54:04.211882+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

we model city geometry as a 2.5D height map, implemented as a Z-Monotonic signed distance field (SDF) that matches urban building layouts from top-down viewpoints... ∂s(x, y, z)/∂z ≥ 0
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean embed_strictMono_of_one_lt unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Z-Monotonic SDF... yields a watertight mesh with crisp roofs and clean, vertically extruded facades

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

71 extracted references · 71 canonical work pages · 5 internal anchors

[1]

3dgs-to-pc: 3d gaussian splatting to dense point clouds

Lewis A G Stuart, Andrew Morton, Ian Stavness, and Michael P Pound. 3dgs-to-pc: 3d gaussian splatting to dense point clouds. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, pages 3730–3739, 2025. 5

work page 2025
[2]

Gaussian splatting for efficient satellite image photogram- metry

Luca Savant Aira, Gabriele Facciolo, and Thibaud Ehret. Gaussian splatting for efficient satellite image photogram- metry. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 5959–5969, 2025. 3, 6, 7, 22

work page 2025
[3]

Satgs: Remote sensing novel view synthesis using multi-temporal satellite images with appearance-adaptive 3dgs.Remote Sensing, 17(9):1609, 2025

Nan Bai, Anran Yang, Hao Chen, and Chun Du. Satgs: Remote sensing novel view synthesis using multi-temporal satellite images with appearance-adaptive 3dgs.Remote Sensing, 17(9):1609, 2025. 3, 6

work page 2025
[4]

Patchmatch: A randomized correspon- dence algorithm for structural image editing.ACM Trans

Connelly Barnes, Eli Shechtman, Adam Finkelstein, and Dan B Goldman. Patchmatch: A randomized correspon- dence algorithm for structural image editing.ACM Trans. Graph., 28(3):24, 2009. 2

work page 2009
[5]

Mvsformer++: Revealing the devil in transformer’s details for multi-view stereo.arXiv preprint arXiv:2401.11673, 2024

Chenjie Cao, Xinlin Ren, and Yanwei Fu. Mvsformer++: Revealing the devil in transformer’s details for multi-view stereo.arXiv preprint arXiv:2401.11673, 2024. 1

work page arXiv 2024
[6]

Two deterministic half-quadratic regular- ization algorithms for computed imaging

Pierre Charbonnier, Laure Blanc-Feraud, Gilles Aubert, and Michel Barlaud. Two deterministic half-quadratic regular- ization algorithms for computed imaging. InProceedings of 1st international conference on image processing, pages 168–172. IEEE, 1994. 5

work page 1994
[7]

Dogs: Distributed-oriented gaus- sian splatting for large-scale 3d reconstruction via gaussian consensus.Advances in Neural Information Processing Sys- tems, 37:34487–34512, 2024

Yu Chen and Gim Hee Lee. Dogs: Distributed-oriented gaus- sian splatting for large-scale 3d reconstruction via gaussian consensus.Advances in Neural Information Processing Sys- tems, 37:34487–34512, 2024. 2

work page 2024
[8]

Ziyang Chen, Wenting Li, Zhongwei Cui, and Yongjun Zhang. Surface depth estimation from multi-view stereo satellite images with distribution contrast network.IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 2024. 3

work page 2024
[9]

Luciddreamer: Domain-free generation of 3d gaussian splatting scenes

Jaeyoung Chung, Suyoung Lee, Hyeongjin Nam, Jaerin Lee, and Kyoung Mu Lee. Luciddreamer: Domain-free gen- eration of 3d gaussian splatting scenes.arXiv preprint arXiv:2311.13384, 2023. 5

work page arXiv 2023
[10]

An automatic and modular stereo pipeline for pushbroom images

Carlo De Franchis, Enric Meinhardt-Llopis, Julien Michel, Jean-Michel Morel, and Gabriele Facciolo. An automatic and modular stereo pipeline for pushbroom images. InISPRS Annals of the Photogrammetry, Remote Sensing and Spatial Information Sciences, 2014. 3

work page 2014
[11]

Shadow neural radiance fields for multi-view satellite photogrammetry

Dawa Derksen and Dario Izzo. Shadow neural radiance fields for multi-view satellite photogrammetry. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1152–1161, 2021. 3, 6

work page 2021
[12]

Scaling recti- fied flow transformers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas M ¨uller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling recti- fied flow transformers for high-resolution image synthesis. InForty-first international conference on machine learning,

work page
[13]

Flashgs: Efficient 3d gaussian splatting for large-scale and high-resolution rendering

Guofeng Feng, Siyan Chen, Rong Fu, Zimu Liao, Yi Wang, Tao Liu, Boni Hu, Linning Xu, Zhilin Pei, Hengjie Li, et al. Flashgs: Efficient 3d gaussian splatting for large-scale and high-resolution rendering. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 26652– 26662, 2025. 2

work page 2025
[14]

Flowr: Flowing from sparse to dense 3d reconstructions

Tobias Fischer, Samuel Rota Bul `o, Yung-Hsu Yang, Nikhil Keetha, Lorenzo Porzi, Norman M ¨uller, Katja Schwarz, Jonathon Luiten, Marc Pollefeys, and Peter Kontschieder. Flowr: Flowing from sparse to dense 3d reconstructions. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 27702–27712, 2025. 3

work page 2025
[15]

Accurate, dense, and robust multiview stereopsis.IEEE transactions on pattern analysis and machine intelligence, 32(8):1362–1376, 2009

Yasutaka Furukawa and Jean Ponce. Accurate, dense, and robust multiview stereopsis.IEEE transactions on pattern analysis and machine intelligence, 32(8):1362–1376, 2009. 2, 4

work page 2009
[16]

Rational polyno- mial camera model warping for deep learning based satel- lite multi-view stereo matching

Jian Gao, Jin Liu, and Shunping Ji. Rational polyno- mial camera model warping for deep learning based satel- lite multi-view stereo matching. InProceedings of the IEEE/CVF international conference on computer vision, pages 6148–6157, 2021. 3

work page 2021
[17]

A general deep learn- ing based framework for 3d reconstruction from multi-view stereo satellite images.ISPRS Journal of Photogrammetry and Remote Sensing, 195:446–461, 2023

Jian Gao, Jin Liu, and Shunping Ji. A general deep learn- ing based framework for 3d reconstruction from multi-view stereo satellite images.ISPRS Journal of Photogrammetry and Remote Sensing, 195:446–461, 2023. 3

work page 2023
[18]

Citygs- x: A scalable architecture for efficient and geometrically accurate large-scale scene reconstruction.arXiv preprint arXiv:2503.23044, 2025

Yuanyuan Gao, Hao Li, Jiaqi Chen, Zhengyu Zou, Zhihang Zhong, Dingwen Zhang, Xiao Sun, and Junwei Han. Citygs- x: A scalable architecture for efficient and geometrically accurate large-scale scene reconstruction.arXiv preprint arXiv:2503.23044, 2025. 2, 6, 5

work page arXiv 2025
[19]

Sat2city: 3d city generation from a single satellite image with cascaded latent diffusion

Tongyan Hua, Lutao Jiang, Ying-Cong Chen, and Wufan Zhao. Sat2city: 3d city generation from a single satellite image with cascaded latent diffusion. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 27978–27988, 2025. 3

work page 2025
[20]

2d gaussian splatting for geometrically ac- curate radiance fields

Binbin Huang, Zehao Yu, Anpei Chen, Andreas Geiger, and Shenghua Gao. 2d gaussian splatting for geometrically ac- curate radiance fields. InACM SIGGRAPH 2024 conference papers, pages 1–11, 2024. 6, 7, 5

work page 2024
[21]

SkySplat: Generalizable 3D Gaussian Splatting from Multi-Temporal Sparse Satellite Images

Xuejun Huang, Xinyi Liu, Yi Wan, Zhi Zheng, Bin Zhang, Mingtao Xiong, Yingying Pei, and Yongjun Zhang. Skysplat: Generalizable 3d gaussian splatting from multi-temporal sparse satellite images.arXiv preprint arXiv:2508.09479, 2025. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025
[22]

3d gaussian splatting for real-time radiance field rendering.ACM Trans

Bernhard Kerbl, Georgios Kopanas, Thomas Leimk ¨uhler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering.ACM Trans. Graph., 42(4):139–1,

work page
[23]

A hierarchical 3d gaussian representation for real-time ren- dering of very large datasets.ACM Transactions on Graphics (TOG), 43(4):1–15, 2024

Bernhard Kerbl, Andreas Meuleman, Georgios Kopanas, Michael Wimmer, Alexandre Lanvin, and George Drettakis. A hierarchical 3d gaussian representation for real-time ren- dering of very large datasets.ACM Transactions on Graphics (TOG), 43(4):1–15, 2024. 2

work page 2024
[24]

Adam: A Method for Stochastic Optimization

Diederik P Kingma. Adam: A method for stochastic opti- mization.arXiv preprint arXiv:1412.6980, 2014. 5

work page internal anchor Pith review Pith/arXiv arXiv 2014
[25]

Flux.https://github.com/ black-forest-labs/flux, 2024

Black Forest Labs. Flux.https://github.com/ black-forest-labs/flux, 2024. 2, 5, 3, 20

work page 2024
[26]

Skyfall-gs: Synthesiz- 9 ing immersive 3d urban scenes from satellite imagery.arXiv preprint arXiv:2510.15869, 2025

Jie-Ying Lee, Yi-Ruei Liu, Shr-Ruei Tsai, Wei-Cheng Chang, Chung-Ho Wu, Jiewen Chan, Zhenjun Zhao, Chieh Hubert Lin, and Yu-Lun Liu. Skyfall-gs: Synthesiz- 9 ing immersive 3d urban scenes from satellite imagery.arXiv preprint arXiv:2510.15869, 2025. 3, 6, 4, 5, 7, 14, 15, 16

work page arXiv 2025
[27]

Matrixcity: A large-scale city dataset for city-scale neural rendering and beyond

Yixuan Li, Lihan Jiang, Linning Xu, Yuanbo Xiangli, Zhen- zhi Wang, Dahua Lin, and Bo Dai. Matrixcity: A large-scale city dataset for city-scale neural rendering and beyond. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3205–3215, 2023. 6, 3

work page 2023
[28]

Magic3d: High-resolution text-to-3d content creation

Chen-Hsuan Lin, Jun Gao, Luming Tang, Towaki Takikawa, Xiaohui Zeng, Xun Huang, Karsten Kreis, Sanja Fidler, Ming-Yu Liu, and Tsung-Yi Lin. Magic3d: High-resolution text-to-3d content creation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 300–309, 2023. 3

work page 2023
[29]

Vastgaussian: Vast 3d gaussians for large scene reconstruction

Jiaqi Lin, Zhihao Li, Xiao Tang, Jianzhuang Liu, Shiyong Liu, Jiayue Liu, Yangdi Lu, Xiaofei Wu, Songcen Xu, You- liang Yan, et al. Vastgaussian: Vast 3d gaussians for large scene reconstruction. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 5166–5175, 2024. 2

work page 2024
[30]

Zero-1-to- 3: Zero-shot one image to 3d object

Ruoshi Liu, Rundi Wu, Basile Van Hoorick, Pavel Tok- makov, Sergey Zakharov, and Carl V ondrick. Zero-1-to- 3: Zero-shot one image to 3d object. InProceedings of the IEEE/CVF international conference on computer vision, pages 9298–9309, 2023. 3

work page 2023
[31]

3dgs-enhancer: Enhancing unbounded 3d gaussian splatting with view- consistent 2d diffusion priors.Advances in Neural Informa- tion Processing Systems, 37:133305–133327, 2024

Xi Liu, Chaoyi Zhou, and Siyu Huang. 3dgs-enhancer: Enhancing unbounded 3d gaussian splatting with view- consistent 2d diffusion priors.Advances in Neural Informa- tion Processing Systems, 37:133305–133327, 2024. 3

work page 2024
[32]

Citygaussian: Real-time high-quality large-scale scene rendering with gaussians

Yang Liu, Chuanchen Luo, Lue Fan, Naiyan Wang, Jun- ran Peng, and Zhaoxiang Zhang. Citygaussian: Real-time high-quality large-scale scene rendering with gaussians. In European Conference on Computer Vision, pages 265–282. Springer, 2024. 2

work page 2024
[33]

Citygaussianv2: Efficient and geometri- cally accurate reconstruction for large-scale scenes.arXiv preprint arXiv:2411.00771, 2024

Yang Liu, Chuanchen Luo, Zhongkai Mao, Junran Peng, and Zhaoxiang Zhang. Citygaussianv2: Efficient and geometri- cally accurate reconstruction for large-scale scenes.arXiv preprint arXiv:2411.00771, 2024. 2, 6, 5

work page arXiv 2024
[34]

Wonder3d: Sin- gle image to 3d using cross-domain diffusion

Xiaoxiao Long, Yuan-Chen Guo, Cheng Lin, Yuan Liu, Zhiyang Dou, Lingjie Liu, Yuexin Ma, Song-Hai Zhang, Marc Habermann, Christian Theobalt, et al. Wonder3d: Sin- gle image to 3d using cross-domain diffusion. InProceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9970–9980, 2024. 3

work page 2024
[35]

Marching cubes: A high resolution 3d surface construction algorithm

William E Lorensen and Harvey E Cline. Marching cubes: A high resolution 3d surface construction algorithm. InSem- inal graphics: pioneering efforts that shaped the field, pages 347–353. 1998. 4, 8

work page 1998
[36]

Sat-nerf: Learning multi-view satellite photogrammetry with transient objects and shadow modeling using rpc cameras

Roger Mar ´ı, Gabriele Facciolo, and Thibaud Ehret. Sat-nerf: Learning multi-view satellite photogrammetry with transient objects and shadow modeling using rpc cameras. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1311–1321, 2022. 3, 6

work page 2022
[37]

Multi- date earth observation nerf: The detail is in the shadows

Roger Mar ´ı, Gabriele Facciolo, and Thibaud Ehret. Multi- date earth observation nerf: The detail is in the shadows. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 2035–2045, 2023. 3, 6

work page 2035
[38]

Realfusion: 360deg reconstruction of any object from a single image

Luke Melas-Kyriazi, Iro Laina, Christian Rupprecht, and Andrea Vedaldi. Realfusion: 360deg reconstruction of any object from a single image. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8446–8455, 2023. 3

work page 2023
[39]

Nerf: Representing scenes as neural radiance fields for view syn- thesis.Communications of the ACM, 65(1):99–106, 2021

Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view syn- thesis.Communications of the ACM, 65(1):99–106, 2021. 2

work page 2021
[40]

DINOv2: Learning Robust Visual Features without Supervision

Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023. 1

work page internal anchor Pith review Pith/arXiv arXiv 2023
[41]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF inter- national conference on computer vision, pages 4195–4205,

work page
[42]

Rasterized edge gra- dients: Handling discontinuities differentiably

Stanislav Pidhorskyi, Tomas Simon, Gabriel Schwartz, He Wen, Yaser Sheikh, and Jason Saragih. Rasterized edge gra- dients: Handling discontinuities differentiably. InEuropean Conference on Computer Vision, pages 335–352. Springer,

work page
[43]

Example datasets - pix4dmatic.https : //support.pix4d.com/hc/en- us/articles/ 360048957691, 2025

Pix4D. Example datasets - pix4dmatic.https : //support.pix4d.com/hc/en- us/articles/ 360048957691, 2025. Accessed: 2025-11. 6

work page 2025
[44]

DreamFusion: Text-to-3D using 2D Diffusion

Ben Poole, Ajay Jain, Jonathan T Barron, and Ben Milden- hall. Dreamfusion: Text-to-3d using 2d diffusion.arXiv preprint arXiv:2209.14988, 2022. 3

work page internal anchor Pith review Pith/arXiv arXiv 2022
[45]

Rongjun Qin. Rpc stereo processor (rsp)–a software pack- age for digital surface model and orthophoto generation from satellite stereo imagery.ISPRS Annals of the Photogram- metry, Remote Sensing and Spatial Information Sciences, 3: 77–82, 2016. 3

work page 2016
[46]

Urban radiance fields

Konstantinos Rematas, Andrew Liu, Pratul P Srini- vasan, Jonathan T Barron, Andrea Tagliasacchi, Thomas Funkhouser, and Vittorio Ferrari. Urban radiance fields. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 12932–12942, 2022. 2

work page 2022
[47]

Octree-gs: Towards consistent real-time rendering with lod-structured 3d gaussians.arXiv preprint arXiv:2403.17898, 2024

Kerui Ren, Lihan Jiang, Tao Lu, Mulin Yu, Linning Xu, Zhangkai Ni, and Bo Dai. Octree-gs: Towards consistent real-time rendering with lod-structured 3d gaussians.arXiv preprint arXiv:2403.17898, 2024. 2

work page arXiv 2024
[48]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022. 3

work page 2022
[49]

Photorealistic text-to-image diffusion models with deep language understanding.Advances in neural information processing systems, 35:36479–36494, 2022

Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding.Advances in neural information processing systems, 35:36479–36494, 2022. 3

work page 2022
[50]

Structure- from-motion revisited

Johannes L Schonberger and Jan-Michael Frahm. Structure- from-motion revisited. InProceedings of the IEEE con- 10 ference on computer vision and pattern recognition, pages 4104–4113, 2016. 2

work page 2016
[51]

Pixelwise view selection for unstructured multi-view stereo

Johannes L Sch ¨onberger, Enliang Zheng, Jan-Michael Frahm, and Marc Pollefeys. Pixelwise view selection for unstructured multi-view stereo. InEuropean conference on computer vision, pages 501–518. Springer, 2016. 2

work page 2016
[52]

Flexible isosurface extraction for gradient-based mesh optimization.ACM Transactions on Graphics (TOG), 42(4):1–16, 2023

Tianchang Shen, Jacob Munkberg, Jon Hasselgren, Kangxue Yin, Zian Wang, Wenzheng Chen, Zan Gojcic, Sanja Fidler, Nicholas Sharp, and Jun Gao. Flexible isosurface extraction for gradient-based mesh optimization.ACM Transactions on Graphics (TOG), 42(4):1–16, 2023. 2, 4, 8, 1, 3

work page 2023
[53]

DINOv3

Oriane Sim ´eoni, Huy V V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Micha ¨el Ramamonjisoa, et al. Dinov3.arXiv preprint arXiv:2508.10104, 2025. 1

work page internal anchor Pith review Pith/arXiv arXiv 2025
[54]

City-on-web: Real-time neural rendering of large- scale scenes on the web

Kaiwen Song, Xiaoyi Zeng, Chenqu Ren, and Juyong Zhang. City-on-web: Real-time neural rendering of large- scale scenes on the web. InEuropean Conference on Com- puter Vision, pages 385–402. Springer, 2024. 2

work page 2024
[55]

Block-nerf: Scalable large scene neural view synthesis

Matthew Tancik, Vincent Casser, Xinchen Yan, Sabeek Prad- han, Ben Mildenhall, Pratul P Srinivasan, Jonathan T Barron, and Henrik Kretzschmar. Block-nerf: Scalable large scene neural view synthesis. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 8248–8258, 2022. 2

work page 2022
[56]

Hunyuanworld 1.0: Generating immersive, explorable, and interactive 3d worlds from words or pixels.arXiv preprint arXiv:2507.21809, 2025

HunyuanWorld Team, Zhenwei Wang, Yuhao Liu, Junta Wu, Zixiao Gu, Haoyuan Wang, Xuhui Zuo, Tianyu Huang, Wen- huan Li, Sheng Zhang, et al. Hunyuanworld 1.0: Generating immersive, explorable, and interactive 3d worlds from words or pixels.arXiv preprint arXiv:2507.21809, 2025. 3

work page arXiv 2025
[57]

Mega-nerf: Scalable construction of large- scale nerfs for virtual fly-throughs

Haithem Turki, Deva Ramanan, and Mahadev Satya- narayanan. Mega-nerf: Scalable construction of large- scale nerfs for virtual fly-throughs. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12922–12931, 2022. 2

work page 2022
[58]

Suds: Scalable urban dynamic scenes

Haithem Turki, Jason Y Zhang, Francesco Ferroni, and Deva Ramanan. Suds: Scalable urban dynamic scenes. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12375–12385, 2023. 2

work page 2023
[59]

Difix3d+: Improving 3d reconstruc- tions with single-step diffusion models

Jay Zhangjie Wu, Yuxuan Zhang, Haithem Turki, Xuanchi Ren, Jun Gao, Mike Zheng Shou, Sanja Fidler, Zan Goj- cic, and Huan Ling. Difix3d+: Improving 3d reconstruc- tions with single-step diffusion models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 26024–26035, 2025. 3

work page 2025
[60]

Grid-guided neural radiance fields for large urban scenes

Linning Xu, Yuanbo Xiangli, Sida Peng, Xingang Pan, Nanxuan Zhao, Christian Theobalt, Bo Dai, and Dahua Lin. Grid-guided neural radiance fields for large urban scenes. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 8296–8306, 2023. 2

work page 2023
[61]

Shuting Yang, Hao Chen, Fachuan He, Wen Chen, Ting Chen, and Jianjun He. A learning-based dual-scale enhanced confidence for dsm fusion in 3d reconstruction of multi-view satellite images.IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 2025. 3

work page 2025
[62]

Gsfixer: Improving 3d gaussian splatting with reference-guided video diffusion priors.arXiv preprint arXiv:2508.09667, 2025

Xingyilang Yin, Qi Zhang, Jiahao Chang, Ying Feng, Qing- nan Fan, Xi Yang, Chi-Man Pun, Huaqi Zhang, and Xi- aodong Cun. Gsfixer: Improving 3d gaussian splatting with reference-guided video diffusion priors.arXiv preprint arXiv:2508.09667, 2025. 3

work page arXiv 2025
[63]

Wonderworld: Interactive 3d scene generation from a single image

Hong-Xing Yu, Haoyi Duan, Charles Herrmann, William T Freeman, and Jiajun Wu. Wonderworld: Interactive 3d scene generation from a single image. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 5916–5926, 2025. 3, 5

work page 2025
[64]

Mip-splatting: Alias-free 3d gaussian splat- ting

Zehao Yu, Anpei Chen, Binbin Huang, Torsten Sattler, and Andreas Geiger. Mip-splatting: Alias-free 3d gaussian splat- ting. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), pages 19447– 19456, 2024. 6, 5

work page 2024
[65]

Leveraging vision re- construction pipelines for satellite imagery

Kai Zhang, Noah Snavely, and Jin Sun. Leveraging vision re- construction pipelines for satellite imagery. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion Workshops, pages 0–0, 2019. 3, 1, 4

work page 2019
[66]

Sparsesat-nerf: Dense depth supervised neural radiance fields for sparse satellite images.arXiv preprint arXiv:2309.00277, 2023

Lulin Zhang and Ewelina Rupnik. Sparsesat-nerf: Dense depth supervised neural radiance fields for sparse satellite images.arXiv preprint arXiv:2309.00277, 2023. 3, 6

work page arXiv 2023
[67]

Satensorf: Fast satellite tensorial radiance field for multidate satellite imagery of large size.IEEE Transactions on Geo- science and Remote Sensing, 62:1–15, 2024

Tongtong Zhang, Yu Zhou, Yuanxiang Li, and Xian Wei. Satensorf: Fast satellite tensorial radiance field for multidate satellite imagery of large size.IEEE Transactions on Geo- science and Remote Sensing, 62:1–15, 2024. 3, 6, 7, 22

work page 2024
[68]

Effi- cient large-scale scene representation with a hybrid of high- resolution grid and plane features.Pattern Recognition, 158: 111001, 2025

Yuqi Zhang, Guanying Chen, and Shuguang Cui. Effi- cient large-scale scene representation with a hybrid of high- resolution grid and plane features.Pattern Recognition, 158: 111001, 2025. 2

work page 2025
[69]

On scaling up 3d gaussian splatting training

Hexu Zhao, Haoyang Weng, Daohan Lu, Ang Li, Jinyang Li, Aurojit Panda, and Saining Xie. On scaling up 3d gaussian splatting training. InEuropean Conference on Computer Vi- sion, pages 14–36. Springer, 2024. 2

work page 2024
[70]

A review of 3d reconstruction from high-resolution urban satellite im- ages.International Journal of Remote Sensing, 44(2):713– 748, 2023

Li Zhao, Haiyan Wang, Yi Zhu, and Mei Song. A review of 3d reconstruction from high-resolution urban satellite im- ages.International Journal of Remote Sensing, 44(2):713– 748, 2023. 3

work page 2023
[71]

Ours w/o Image Restoration Network

MI Zhenxing and Dan Xu. Switch-nerf: Learning scene de- composition with mixture of experts for large-scale neural radiance fields. InThe Eleventh International Conference on Learning Representations, 2022. 2 11 From Orbit to Ground: Generative City Photogrammetry from Extreme Off-Nadir Satellite Images Supplementary Material A. Appendix Overview In this ...

work page 2022