pith. machine review for the scientific record. sign in

arxiv: 2604.02003 · v2 · submitted 2026-04-02 · 💻 cs.CV

Recognition: no theorem link

ProDiG: Progressive Diffusion-Guided Gaussian Splatting for Aerial to Ground Reconstruction

Authors on Pith no claims yet

Pith reviewed 2026-05-13 21:58 UTC · model grok-4.3

classification 💻 cs.CV
keywords aerial to ground reconstructionGaussian splattingdiffusion guidanceprogressive refinementview synthesis3D geometry consistencygeometry-aware attentionextreme viewpoint change
0
0 comments X

The pith

ProDiG progressively refines aerial Gaussian representations into ground-level 3D views by synthesizing intermediate altitudes with diffusion guidance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents ProDiG as a way to generate realistic ground-level renderings and coherent 3D geometry starting only from aerial images. It works by creating successive intermediate-height views and updating the underlying Gaussian splats at each step, using attention that respects epipolar geometry and a module that scales Gaussians according to camera distance. Existing single-step or post-hoc refinement approaches either produce inconsistent geometry or require ground-truth images at multiple heights, which are scarce. A reader would care because this removes the need for multi-altitude capture campaigns and could support site modeling from drone footage alone.

Core claim

ProDiG is a diffusion-guided framework that progressively transforms aerial 3D representations toward ground-level fidelity. It synthesizes intermediate-altitude views and refines the Gaussian representation at each stage using a geometry-aware causal attention module that injects epipolar structure into reference-view diffusion, plus a distance-adaptive Gaussian module that dynamically adjusts scale and opacity based on camera distance.

What carries the argument

Progressive refinement loop that synthesizes intermediate views and applies geometry-aware causal attention together with distance-adaptive Gaussian scaling to maintain consistency across large viewpoint gaps.

If this is right

  • Produces ground-level renderings whose visual quality and geometric consistency exceed those of prior single-stage or post-hoc methods.
  • Maintains stable reconstruction when viewpoint change is extreme, without requiring any additional ground-truth viewpoints.
  • Enables coherent 3D site models from aerial-only input on both synthetic and real-world scenes.
  • Supports applications that need ground-level fidelity from drone or satellite imagery alone.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same progressive intermediate-view strategy might reduce domain gaps in other 3D tasks such as indoor-to-outdoor or day-to-night translation.
  • If the distance-adaptive scaling generalizes, it could stabilize Gaussian splatting under arbitrary camera trajectories beyond aerial-ground pairs.
  • The geometry-aware attention could be tested as a drop-in module for other diffusion-based view-synthesis pipelines to improve epipolar consistency.

Load-bearing premise

Synthesizing and refining through intermediate altitudes will reliably close the gap between aerial and ground viewpoints even when no real ground-truth images exist at any lower height.

What would settle it

Run ProDiG on a dataset containing paired aerial and actual ground-level photographs of the same sites, then measure whether the generated ground renderings match the real photographs in both pixel appearance and 3D geometric alignment within a chosen error threshold.

Figures

Figures reproduced from arXiv: 2604.02003 by Sirshapan Mitra, Yogesh S. Rawat.

Figure 1
Figure 1. Figure 1: Overview of ProDiG. (a) Our framework reconstructs a complete 3D scene using only aerial images. A large distribution shift exists between the aerial training images and the ground-level query images. During evaluation, we render novel views at ground-level camera poses and compare them against ground-truth images. (b) In our Distance-Adaptive Gaussian Splatting module, each Gaussian is dynamically scaled … view at source ↗
Figure 2
Figure 2. Figure 2: Overview of aeroFix: (left) Our diffusion model is fine-tuned on aerial imagery using LoRA. The noisy novel view is fixed using the reference view to fixed novel image. In the diffusion block, the relative camera pose difference is injected into the timestep embedding of the noisy image to encode geometric variation across viewpoints. We additionally include Plucker ray embeddings before the attention ¨ mi… view at source ↗
Figure 3
Figure 3. Figure 3: Effectiveness of aeroFix: Comparison of aerial image refinement between Difix3D+[38] and our aeroFix model. The noisy novel views are outlined in orange, the reference images in green, and the refined (fixed) novel images in pink. Difix3D+ tends to copy content from the reference view when the viewpoint difference is large, leading to inconsistencies and artifacts. In contrast, aeroFix effectively preserve… view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative analysis of ProDiG(ours): Comparison of our method with existing baselines on aerial-to-ground reconstruction. Gaussian Splatting [16] struggles to render complete scenes due to the absence of ground-level viewpoints, while Difix3D+[38] exhibits noisy artifacts and hallucinated structures. In contrast, ProDiG (ours) produces geometrically consistent and visually coherent reconstruc￾tions with f… view at source ↗
Figure 5
Figure 5. Figure 5: Generalization across Varying Altitudes. We evaluate our method on the Aerial MegaDepth[36] dataset, which contains sites captured at diverse altitude ranges [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Ablations. (top) Comparison of different progressive methods. (bottom) Effectiveness of Distance Adaptive Gaussian Module. 7k represents, evalution at 7k iteration, after initial train￾ing. combining intermediate-altitude synthesis, geometry-aware causal attention, and distance-adaptive Gaussian refine￾ment, ProDiG produces stable, geometrically consistent 3D representations and realistic ground-level rend… view at source ↗
read the original abstract

Generating ground-level views and coherent 3D site models from aerial-only imagery is challenging due to extreme viewpoint changes, missing intermediate observations, and large scale variations. Existing methods either refine renderings post-hoc, often producing geometrically inconsistent results, or rely on multi-altitude ground-truth, which is rarely available. Gaussian Splatting and diffusion-based refinements improve fidelity under small variations but fail under wide aerial-toground gaps. To address these limitations, we introduce ProDiG (Progressive Diffusion-Guided Gaussian Splatting for Aerial to Ground Reconstruction), a diffusionguided framework that progressively transforms aerial 3D representations toward ground-level fidelity. ProDiG synthesizes intermediate-altitude views and refines the Gaussian representation at each stage using a geometry-aware causal attention module that injects epipolar structure into reference-view diffusion. A distance-adaptive Gaussian module dynamically adjusts Gaussian scale and opacity based on camera distance, ensuring stable reconstruction across large viewpoint gaps. Together, these components enable progressive, geometrically grounded refinement without requiring additional ground-truth viewpoints. Extensive experiments on synthetic and real-world datasets demonstrate that ProDiG produces visually realistic ground-level renderings and coherent 3D geometry, significantly outperforming existing approaches in terms of visual quality, geometric consistency, and robustness to extreme viewpoint changes. Project Page: https://sirsh07.github.io/research/prodig

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The manuscript presents ProDiG, a progressive diffusion-guided Gaussian splatting method for reconstructing ground-level views and 3D models from aerial imagery. The approach synthesizes intermediate-altitude views and iteratively refines the Gaussian representation using a geometry-aware causal attention module that incorporates epipolar geometry and a distance-adaptive Gaussian module that adjusts scale and opacity based on camera distance. It claims to achieve superior visual quality, geometric consistency, and robustness to extreme viewpoint changes on both synthetic and real datasets without requiring additional ground-truth ground-level viewpoints.

Significance. If the empirical results and consistency claims are substantiated, this work would represent a meaningful advance in novel view synthesis for large viewpoint gaps, with potential applications in aerial photogrammetry, virtual tourism, and disaster assessment. The progressive refinement strategy combined with explicit geometric constraints in the diffusion process offers a practical solution where prior methods either require multi-altitude data or produce inconsistent geometry.

major comments (1)
  1. [Method description of progressive refinement] In the description of the geometry-aware causal attention module and distance-adaptive Gaussian module: the paper states that these components inject epipolar structure and dynamically adjust scale/opacity to ensure stable reconstruction, but provides no derivation, bound, or explicit consistency loss (e.g., cycle-consistency or bundle-adjustment term) showing that cumulative geometric drift is prevented across multiple refinement stages. This directly bears on the central claim that the method operates without any ground-truth viewpoints, as diffusion priors could introduce hallucinations not constrained by the initial aerial 3D structure.
minor comments (1)
  1. [Abstract] The abstract asserts outperformance in visual quality and geometric consistency but omits any quantitative metrics, ablation results, or experimental setup details (e.g., dataset names, baseline methods, or evaluation protocols), which would strengthen the summary for readers.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on geometric consistency in our progressive refinement pipeline. We address the single major comment below and outline targeted revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Method description of progressive refinement] In the description of the geometry-aware causal attention module and distance-adaptive Gaussian module: the paper states that these components inject epipolar structure and dynamically adjust scale/opacity to ensure stable reconstruction, but provides no derivation, bound, or explicit consistency loss (e.g., cycle-consistency or bundle-adjustment term) showing that cumulative geometric drift is prevented across multiple refinement stages. This directly bears on the central claim that the method operates without any ground-truth viewpoints, as diffusion priors could introduce hallucinations not constrained by the initial aerial 3D structure.

    Authors: We acknowledge that the current manuscript lacks a formal derivation, theoretical bound, or explicit auxiliary loss (such as cycle-consistency) to prove absence of cumulative drift. The geometry-aware causal attention module enforces epipolar constraints by restricting attention to geometrically corresponding rays derived from the initial aerial Gaussian splats, while the distance-adaptive module modulates Gaussian parameters to preserve scale consistency with camera distance. These mechanisms are intended to ground each diffusion step in the original 3D structure, limiting hallucinations. Our experiments on synthetic data with known ground-truth geometry show low drift in multi-stage metrics (e.g., PSNR and depth error remain stable across refinement stages). We agree a more explicit analysis would improve rigor. In revision we will (1) add a dedicated paragraph deriving the attention mask from epipolar geometry and (2) include an ablation quantifying drift over refinement stages. This constitutes a partial revision. revision: partial

Circularity Check

0 steps flagged

No circularity detected; method is an algorithmic pipeline without load-bearing derivations or self-referential reductions

full rationale

The paper describes ProDiG as a progressive diffusion-guided Gaussian splatting pipeline that synthesizes intermediate views and refines representations via geometry-aware causal attention and distance-adaptive Gaussians. No equations, derivations, or fitted parameters are presented that reduce by construction to the inputs. The central claims rest on the described components and experimental validation on synthetic and real-world datasets rather than on any self-definition, fitted-input prediction, or self-citation chain. The approach is self-contained as an engineering framework without mathematical steps that equate outputs to inputs by definition.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 2 invented entities

The central claim rests on standard assumptions from Gaussian Splatting and diffusion models plus two newly introduced modules whose effectiveness is asserted but not independently verified in the abstract.

axioms (2)
  • domain assumption Gaussian Splatting can represent 3D scenes from images
    Standard assumption in the field invoked for the base representation
  • domain assumption Diffusion models can synthesize coherent novel views when guided
    Relies on prior diffusion capabilities for view synthesis
invented entities (2)
  • geometry-aware causal attention module no independent evidence
    purpose: injects epipolar structure into reference-view diffusion
    New component introduced to maintain geometric consistency during progressive refinement
  • distance-adaptive Gaussian module no independent evidence
    purpose: dynamically adjusts Gaussian scale and opacity based on camera distance
    New component introduced to ensure stable reconstruction across large viewpoint gaps

pith-pipeline@v0.9.0 · 5542 in / 1397 out tokens · 33749 ms · 2026-05-13T21:58:15.287025+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

48 extracted references · 48 canonical work pages · 1 internal anchor

  1. [1]

    Wriva public data, 2024

    Myron Brown, Michael Chan, and Michael Twardowski. Wriva public data, 2024. 7

  2. [2]

    Ray conditioning: Trading photo- consistency for photo-realism in multi-view image genera- tion

    Eric Ming Chen, Sidhanth Holalkere, Ruyu Yan, Kai Zhang, and Abe Davis. Ray conditioning: Trading photo- consistency for photo-realism in multi-view image genera- tion. InProceedings of the IEEE/CVF International Confer- ence on Computer Vision, pages 23242–23251, 2023. 3

  3. [3]

    Mvsplat: Efficient 3d gaussian splatting from sparse multi-view images

    Yuedong Chen, Haofei Xu, Chuanxia Zheng, Bohan Zhuang, Marc Pollefeys, Andreas Geiger, Tat-Jen Cham, and Jianfei Cai. Mvsplat: Efficient 3d gaussian splatting from sparse multi-view images. InEuropean Conference on Computer Vision, pages 370–386. Springer, 2024. 1, 3

  4. [4]

    Mvs- plat360: Feed-forward 360 scene synthesis from sparse views.Advances in Neural Information Processing Systems, 37:107064–107086, 2024

    Yuedong Chen, Chuanxia Zheng, Haofei Xu, Bohan Zhuang, Andrea Vedaldi, Tat-Jen Cham, and Jianfei Cai. Mvs- plat360: Feed-forward 360 scene synthesis from sparse views.Advances in Neural Information Processing Systems, 37:107064–107086, 2024. 1

  5. [5]

    Dy- namic 3d gaussian fields for urban areas.arXiv preprint arXiv:2406.03175, 2024

    Tobias Fischer, Jonas Kulhanek, Samuel Rota Bulo, Lorenzo Porzi, Marc Pollefeys, and Peter Kontschieder. Dy- namic 3d gaussian fields for urban areas.arXiv preprint arXiv:2406.03175, 2024. 1

  6. [6]

    Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography.Communications of the ACM, 24(6):381–395, 1981

    Martin A Fischler and Robert C Bolles. Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography.Communications of the ACM, 24(6):381–395, 1981. 7

  7. [7]

    Dream- sim: Learning new dimensions of human visual similar- ity using synthetic data.arXiv preprint arXiv:2306.09344,

    Stephanie Fu, Netanel Tamir, Shobhita Sundaram, Lucy Chai, Richard Zhang, Tali Dekel, and Phillip Isola. Dream- sim: Learning new dimensions of human visual similar- ity using synthetic data.arXiv preprint arXiv:2306.09344,

  8. [8]

    Magicdrive3d: Controllable 3d genera- tion for any-view rendering in street scenes.arXiv preprint arXiv:2405.14475, 2024

    Ruiyuan Gao, Kai Chen, Zhihao Li, Lanqing Hong, Zhenguo Li, and Qiang Xu. Magicdrive3d: Controllable 3d genera- tion for any-view rendering in street scenes.arXiv preprint arXiv:2405.14475, 2024. 1

  9. [9]

    Skyeyes: Ground roaming using aerial view images

    Zhiyuan Gao, Wenbin Teng, Gonglin Chen, Jinsen Wu, Ningli Xu, Rongjun Qin, Andrew Feng, and Yajie Zhao. Skyeyes: Ground roaming using aerial view images. In2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 3045–3054. IEEE, 2025. 1, 3

  10. [10]

    Dragon: Drone and ground gaussian splatting for 3d building reconstruction

    Yujin Ham, Mateusz Michalkiewicz, and Guha Balakrish- nan. Dragon: Drone and ground gaussian splatting for 3d building reconstruction. In2024 IEEE International Confer- ence on Computational Photography (ICCP), pages 1–12. IEEE, 2024. 1, 5

  11. [11]

    Denoising dif- fusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising dif- fusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020. 3

  12. [12]

    2d gaussian splatting for geometrically ac- curate radiance fields

    Binbin Huang, Zehao Yu, Anpei Chen, Andreas Geiger, and Shenghua Gao. 2d gaussian splatting for geometrically ac- curate radiance fields. InACM SIGGRAPH 2024 conference papers, pages 1–11, 2024. 2, 7

  13. [13]

    Epidiff: Enhancing multi-view synthesis via localized epipolar-constrained diffusion

    Zehuan Huang, Hao Wen, Junting Dong, Yaohui Wang, Yangguang Li, Xinyuan Chen, Yan-Pei Cao, Ding Liang, Yu Qiao, Bo Dai, et al. Epidiff: Enhancing multi-view synthesis via localized epipolar-constrained diffusion. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, pages 9784–9794, 2024. 3

  14. [14]

    Horizon- gs: Unified 3d gaussian splatting for large-scale aerial-to- ground scenes

    Lihan Jiang, Kerui Ren, Mulin Yu, Linning Xu, Junting Dong, Tao Lu, Feng Zhao, Dahua Lin, and Bo Dai. Horizon- gs: Unified 3d gaussian splatting for large-scale aerial-to- ground scenes. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 26789–26799, 2025. 1, 6

  15. [15]

    Spad: Spatially aware multi-view diffusers

    Yash Kant, Aliaksandr Siarohin, Ziyi Wu, Michael Vasilkovsky, Guocheng Qian, Jian Ren, Riza Alp Guler, Bernard Ghanem, Sergey Tulyakov, and Igor Gilitschenski. Spad: Spatially aware multi-view diffusers. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, pages 10026–10038, 2024. 3, 4

  16. [16]

    3d gaussian splatting for real-time radiance field rendering.ACM Trans

    Bernhard Kerbl, Georgios Kopanas, Thomas Leimk ¨uhler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering.ACM Trans. Graph., 42(4):139–1,

  17. [17]

    A hierarchical 3d gaussian representation for real-time ren- dering of very large datasets.ACM Transactions on Graphics (TOG), 43(4):1–15, 2024

    Bernhard Kerbl, Andreas Meuleman, Georgios Kopanas, Michael Wimmer, Alexandre Lanvin, and George Drettakis. A hierarchical 3d gaussian representation for real-time ren- dering of very large datasets.ACM Transactions on Graphics (TOG), 43(4):1–15, 2024. 5

  18. [18]

    3d gaussian splat- ting as markov chain monte carlo.Advances in Neural Infor- mation Processing Systems, 37:80965–80986, 2024

    Shakiba Kheradmand, Daniel Rebain, Gopal Sharma, Wei- wei Sun, Yang-Che Tseng, Hossam Isack, Abhishek Kar, Andrea Tagliasacchi, and Kwang Moo Yi. 3d gaussian splat- ting as markov chain monte carlo.Advances in Neural Infor- mation Processing Systems, 37:80965–80986, 2024. 2, 7

  19. [19]

    Wildgaussians: 3d gaussian splatting in the wild.arXiv preprint arXiv:2407.08447, 2024

    Jonas Kulhanek, Songyou Peng, Zuzana Kukelova, Marc Pollefeys, and Torsten Sattler. Wildgaussians: 3d gaussian splatting in the wild.arXiv preprint arXiv:2407.08447, 2024. 2, 5

  20. [20]

    Skyfall-gs: Synthesiz- ing immersive 3d urban scenes from satellite imagery.arXiv preprint arXiv:2510.15869, 2025

    Jie-Ying Lee, Yi-Ruei Liu, Shr-Ruei Tsai, Wei-Cheng Chang, Chung-Ho Wu, Jiewen Chan, Zhenjun Zhao, Chieh Hubert Lin, and Yu-Lun Liu. Skyfall-gs: Synthesiz- ing immersive 3d urban scenes from satellite imagery.arXiv preprint arXiv:2510.15869, 2025. 5

  21. [21]

    Matrixcity: A large-scale city dataset for city-scale neural rendering and beyond

    Yixuan Li, Lihan Jiang, Linning Xu, Yuanbo Xiangli, Zhen- zhi Wang, Dahua Lin, and Bo Dai. Matrixcity: A large-scale city dataset for city-scale neural rendering and beyond. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3205–3215, 2023. 6, 7

  22. [22]

    Diff- bir: Toward blind image restoration with generative diffusion prior

    Xinqi Lin, Jingwen He, Ziyan Chen, Zhaoyang Lyu, Bo Dai, Fanghua Yu, Yu Qiao, Wanli Ouyang, and Chao Dong. Diff- bir: Toward blind image restoration with generative diffusion prior. InEuropean Conference on Computer Vision, pages 430–448. Springer, 2024. 3, 5

  23. [23]

    Citygaussian: Real-time high-quality large-scale scene rendering with gaussians

    Yang Liu, Chuanchen Luo, Lue Fan, Naiyan Wang, Jun- ran Peng, and Zhaoxiang Zhang. Citygaussian: Real-time high-quality large-scale scene rendering with gaussians. In European Conference on Computer Vision, pages 265–282. Springer, 2024. 1, 2

  24. [24]

    Scaffold-gs: Structured 3d gaussians for view-adaptive rendering

    Tao Lu, Mulin Yu, Linning Xu, Yuanbo Xiangli, Limin Wang, Dahua Lin, and Bo Dai. Scaffold-gs: Structured 3d gaussians for view-adaptive rendering. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20654–20664, 2024. 2, 5, 7

  25. [25]

    Nerf: Representing scenes as neural radiance fields for view syn- thesis.Communications of the ACM, 65(1):99–106, 2021

    Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view syn- thesis.Communications of the ACM, 65(1):99–106, 2021. 1

  26. [26]

    T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models

    Chong Mou, Xintao Wang, Liangbin Xie, Yanze Wu, Jian Zhang, Zhongang Qi, and Ying Shan. T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. InProceedings of the AAAI conference on artificial intelligence, pages 4296–4304, 2024. 3

  27. [27]

    arXiv preprint arXiv:2403.12036 (2024)

    Gaurav Parmar, Taesung Park, Srinivasa Narasimhan, and Jun-Yan Zhu. One-step image translation with text-to-image models.arXiv preprint arXiv:2403.12036, 2024. 5

  28. [28]

    Octree-gs: Towards consistent real-time rendering with lod-structured 3d gaussians.arXiv preprint arXiv:2403.17898, 2024

    Kerui Ren, Lihan Jiang, Tao Lu, Mulin Yu, Linning Xu, Zhangkai Ni, and Bo Dai. Octree-gs: Towards consistent real-time rendering with lod-structured 3d gaussians.arXiv preprint arXiv:2403.17898, 2024. 2, 3, 5

  29. [29]

    Structure-from-motion revisited

    Johannes Lutz Sch ¨onberger and Jan-Michael Frahm. Structure-from-motion revisited. InConference on Com- puter Vision and Pattern Recognition (CVPR), 2016. 3, 7

  30. [30]

    Pixelwise view selection for un- structured multi-view stereo

    Johannes Lutz Sch ¨onberger, Enliang Zheng, Marc Pollefeys, and Jan-Michael Frahm. Pixelwise view selection for un- structured multi-view stereo. InEuropean Conference on Computer Vision (ECCV), 2016. 7

  31. [31]

    Light field networks: Neu- ral scene representations with single-evaluation rendering

    Vincent Sitzmann, Semon Rezchikov, Bill Freeman, Josh Tenenbaum, and Fredo Durand. Light field networks: Neu- ral scene representations with single-evaluation rendering. Advances in Neural Information Processing Systems, 34: 19313–19325, 2021. 3

  32. [32]

    Generalizable patch-based neural render- ing

    Mohammed Suhail, Carlos Esteves, Leonid Sigal, and Ameesh Makadia. Generalizable patch-based neural render- ing. InEuropean Conference on Computer Vision, pages 156–174. Springer, 2022. 3, 4

  33. [33]

    Light field neural rendering

    Mohammed Suhail, Carlos Esteves, Leonid Sigal, and Ameesh Makadia. Light field neural rendering. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8269–8279, 2022. 3

  34. [34]

    Dronesplat: 3d gaussian splatting for robust 3d reconstruction from in-the-wild drone imagery

    Jiadong Tang, Yu Gao, Dianyi Yang, Liqi Yan, Yufeng Yue, and Yi Yang. Dronesplat: 3d gaussian splatting for robust 3d reconstruction from in-the-wild drone imagery. InProceed- ings of the Computer Vision and Pattern Recognition Con- ference, pages 833–843, 2025. 1

  35. [35]

    Mega-nerf: Scalable construction of large- scale nerfs for virtual fly-throughs

    Haithem Turki, Deva Ramanan, and Mahadev Satya- narayanan. Mega-nerf: Scalable construction of large- scale nerfs for virtual fly-throughs. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12922–12931, 2022. 6

  36. [36]

    Aerialmegadepth: Learning aerial-ground reconstruction and view synthesis

    Khiem Vuong, Anurag Ghosh, Deva Ramanan, Srinivasa Narasimhan, and Shubham Tulsiani. Aerialmegadepth: Learning aerial-ground reconstruction and view synthesis. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 21674–21684, 2025. 8

  37. [37]

    Image quality assessment: from error visibility to structural similarity.IEEE transactions on image processing, 13(4):600–612, 2004

    Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Si- moncelli. Image quality assessment: from error visibility to structural similarity.IEEE transactions on image processing, 13(4):600–612, 2004. 7

  38. [38]

    Difix3d+: Improving 3d reconstruc- tions with single-step diffusion models

    Jay Zhangjie Wu, Yuxuan Zhang, Haithem Turki, Xuanchi Ren, Jun Gao, Mike Zheng Shou, Sanja Fidler, Zan Goj- cic, and Huan Ling. Difix3d+: Improving 3d reconstruc- tions with single-step diffusion models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 26024–26035, 2025. 1, 3, 5, 6, 7

  39. [39]

    Bungeenerf: Progressive neural radiance field for extreme multi-scale scene rendering

    Yuanbo Xiangli, Linning Xu, Xingang Pan, Nanxuan Zhao, Anyi Rao, Christian Theobalt, Bo Dai, and Dahua Lin. Bungeenerf: Progressive neural radiance field for extreme multi-scale scene rendering. InEuropean conference on computer vision, pages 106–122. Springer, 2022. 1

  40. [40]

    Gauu-scene v2: Assessing the reliability of image-based metrics with expansive lidar image dataset using 3dgs and nerf.arXiv preprint arXiv:2404.04880, 2024

    Butian Xiong, Nanjun Zheng, Junhua Liu, and Zhen Li. Gauu-scene v2: Assessing the reliability of image-based metrics with expansive lidar image dataset using 3dgs and nerf.arXiv preprint arXiv:2404.04880, 2024. 6

  41. [41]

    Wild-gs: Real- time novel view synthesis from unconstrained photo collec- tions.Advances in Neural Information Processing Systems, 37:103334–103355, 2024

    Jiacong Xu, Yiqun Mei, and Vishal Patel. Wild-gs: Real- time novel view synthesis from unconstrained photo collec- tions.Advances in Neural Information Processing Systems, 37:103334–103355, 2024. 2

  42. [42]

    Street gaussians: Modeling dynamic urban scenes with gaussian splatting

    Yunzhi Yan, Haotong Lin, Chenxu Zhou, Weijie Wang, Haiyang Sun, Kun Zhan, Xianpeng Lang, Xiaowei Zhou, and Sida Peng. Street gaussians: Modeling dynamic urban scenes with gaussian splatting. InEuropean Conference on Computer Vision, pages 156–173. Springer, 2024. 1

  43. [43]

    gsplat: An open-source library for gaussian splatting.Journal of Ma- chine Learning Research, 26(34):1–17, 2025

    Vickie Ye, Ruilong Li, Justin Kerr, Matias Turkulainen, Brent Yi, Zhuoyang Pan, Otto Seiskari, Jianbo Ye, Jeffrey Hu, Matthew Tancik, and Angjoo Kanazawa. gsplat: An open-source library for gaussian splatting.Journal of Ma- chine Learning Research, 26(34):1–17, 2025. 7

  44. [44]

    ViewCrafter: Taming Video Diffusion Models for High-fidelity Novel View Synthesis

    Wangbo Yu, Jinbo Xing, Li Yuan, Wenbo Hu, Xiaoyu Li, Zhipeng Huang, Xiangjun Gao, Tien-Tsin Wong, Ying Shan, and Yonghong Tian. Viewcrafter: Taming video diffusion models for high-fidelity novel view synthesis.arXiv preprint arXiv:2409.02048, 2024. 1, 3

  45. [45]

    Mip-splatting: Alias-free 3d gaussian splat- ting

    Zehao Yu, Anpei Chen, Binbin Huang, Torsten Sattler, and Andreas Geiger. Mip-splatting: Alias-free 3d gaussian splat- ting. InProceedings of the IEEE/CVF conference on com- puter vision and pattern recognition, pages 19447–19456,

  46. [46]

    Crossview- gs: Cross-view gaussian splatting for large-scale scene recon- struction

    Chenhao Zhang, Yuanping Cao, and Lei Zhang. Crossview- gs: Cross-view gaussian splatting for large-scale scene re- construction.arXiv preprint arXiv:2501.01695, 2025. 1

  47. [47]

    Adding conditional control to text-to-image diffusion models

    Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF international conference on computer vision, pages 3836–3847, 2023. 3

  48. [48]

    The unreasonable effectiveness of deep features as a perceptual metric

    Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shecht- man, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. InProceedings of the IEEE conference on computer vision and pattern recogni- tion, pages 586–595, 2018. 7