arxiv: 2604.02003 · v2 · submitted 2026-04-02 · 💻 cs.CV

Recognition: no theorem link

ProDiG: Progressive Diffusion-Guided Gaussian Splatting for Aerial to Ground Reconstruction

Sirshapan Mitra , Yogesh S. Rawat

Authors on Pith no claims yet

Pith reviewed 2026-05-13 21:58 UTC · model grok-4.3

classification 💻 cs.CV

keywords aerial to ground reconstructionGaussian splattingdiffusion guidanceprogressive refinementview synthesis3D geometry consistencygeometry-aware attentionextreme viewpoint change

0 comments

The pith

ProDiG progressively refines aerial Gaussian representations into ground-level 3D views by synthesizing intermediate altitudes with diffusion guidance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents ProDiG as a way to generate realistic ground-level renderings and coherent 3D geometry starting only from aerial images. It works by creating successive intermediate-height views and updating the underlying Gaussian splats at each step, using attention that respects epipolar geometry and a module that scales Gaussians according to camera distance. Existing single-step or post-hoc refinement approaches either produce inconsistent geometry or require ground-truth images at multiple heights, which are scarce. A reader would care because this removes the need for multi-altitude capture campaigns and could support site modeling from drone footage alone.

Core claim

ProDiG is a diffusion-guided framework that progressively transforms aerial 3D representations toward ground-level fidelity. It synthesizes intermediate-altitude views and refines the Gaussian representation at each stage using a geometry-aware causal attention module that injects epipolar structure into reference-view diffusion, plus a distance-adaptive Gaussian module that dynamically adjusts scale and opacity based on camera distance.

What carries the argument

Progressive refinement loop that synthesizes intermediate views and applies geometry-aware causal attention together with distance-adaptive Gaussian scaling to maintain consistency across large viewpoint gaps.

If this is right

Produces ground-level renderings whose visual quality and geometric consistency exceed those of prior single-stage or post-hoc methods.
Maintains stable reconstruction when viewpoint change is extreme, without requiring any additional ground-truth viewpoints.
Enables coherent 3D site models from aerial-only input on both synthetic and real-world scenes.
Supports applications that need ground-level fidelity from drone or satellite imagery alone.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same progressive intermediate-view strategy might reduce domain gaps in other 3D tasks such as indoor-to-outdoor or day-to-night translation.
If the distance-adaptive scaling generalizes, it could stabilize Gaussian splatting under arbitrary camera trajectories beyond aerial-ground pairs.
The geometry-aware attention could be tested as a drop-in module for other diffusion-based view-synthesis pipelines to improve epipolar consistency.

Load-bearing premise

Synthesizing and refining through intermediate altitudes will reliably close the gap between aerial and ground viewpoints even when no real ground-truth images exist at any lower height.

What would settle it

Run ProDiG on a dataset containing paired aerial and actual ground-level photographs of the same sites, then measure whether the generated ground renderings match the real photographs in both pixel appearance and 3D geometric alignment within a chosen error threshold.

Figures

Figures reproduced from arXiv: 2604.02003 by Sirshapan Mitra, Yogesh S. Rawat.

**Figure 1.** Figure 1: Overview of ProDiG. (a) Our framework reconstructs a complete 3D scene using only aerial images. A large distribution shift exists between the aerial training images and the ground-level query images. During evaluation, we render novel views at ground-level camera poses and compare them against ground-truth images. (b) In our Distance-Adaptive Gaussian Splatting module, each Gaussian is dynamically scaled … view at source ↗

**Figure 2.** Figure 2: Overview of aeroFix: (left) Our diffusion model is fine-tuned on aerial imagery using LoRA. The noisy novel view is fixed using the reference view to fixed novel image. In the diffusion block, the relative camera pose difference is injected into the timestep embedding of the noisy image to encode geometric variation across viewpoints. We additionally include Plucker ray embeddings before the attention ¨ mi… view at source ↗

**Figure 3.** Figure 3: Effectiveness of aeroFix: Comparison of aerial image refinement between Difix3D+[38] and our aeroFix model. The noisy novel views are outlined in orange, the reference images in green, and the refined (fixed) novel images in pink. Difix3D+ tends to copy content from the reference view when the viewpoint difference is large, leading to inconsistencies and artifacts. In contrast, aeroFix effectively preserve… view at source ↗

**Figure 4.** Figure 4: Qualitative analysis of ProDiG(ours): Comparison of our method with existing baselines on aerial-to-ground reconstruction. Gaussian Splatting [16] struggles to render complete scenes due to the absence of ground-level viewpoints, while Difix3D+[38] exhibits noisy artifacts and hallucinated structures. In contrast, ProDiG (ours) produces geometrically consistent and visually coherent reconstructions with f… view at source ↗

**Figure 5.** Figure 5: Generalization across Varying Altitudes. We evaluate our method on the Aerial MegaDepth[36] dataset, which contains sites captured at diverse altitude ranges [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Ablations. (top) Comparison of different progressive methods. (bottom) Effectiveness of Distance Adaptive Gaussian Module. 7k represents, evalution at 7k iteration, after initial training. combining intermediate-altitude synthesis, geometry-aware causal attention, and distance-adaptive Gaussian refinement, ProDiG produces stable, geometrically consistent 3D representations and realistic ground-level rend… view at source ↗

read the original abstract

Generating ground-level views and coherent 3D site models from aerial-only imagery is challenging due to extreme viewpoint changes, missing intermediate observations, and large scale variations. Existing methods either refine renderings post-hoc, often producing geometrically inconsistent results, or rely on multi-altitude ground-truth, which is rarely available. Gaussian Splatting and diffusion-based refinements improve fidelity under small variations but fail under wide aerial-toground gaps. To address these limitations, we introduce ProDiG (Progressive Diffusion-Guided Gaussian Splatting for Aerial to Ground Reconstruction), a diffusionguided framework that progressively transforms aerial 3D representations toward ground-level fidelity. ProDiG synthesizes intermediate-altitude views and refines the Gaussian representation at each stage using a geometry-aware causal attention module that injects epipolar structure into reference-view diffusion. A distance-adaptive Gaussian module dynamically adjusts Gaussian scale and opacity based on camera distance, ensuring stable reconstruction across large viewpoint gaps. Together, these components enable progressive, geometrically grounded refinement without requiring additional ground-truth viewpoints. Extensive experiments on synthetic and real-world datasets demonstrate that ProDiG produces visually realistic ground-level renderings and coherent 3D geometry, significantly outperforming existing approaches in terms of visual quality, geometric consistency, and robustness to extreme viewpoint changes. Project Page: https://sirsh07.github.io/research/prodig

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ProDiG combines progressive diffusion guidance with Gaussian splatting and adds causal attention plus distance-adaptive scaling to handle aerial-to-ground gaps without extra ground truth.

read the letter

The main takeaway is that this paper gives a concrete pipeline for turning aerial-only images into ground-level renderings and coherent 3D models by synthesizing intermediate views step by step. The geometry-aware causal attention injects epipolar structure during diffusion, and the distance-adaptive module tweaks Gaussian scale and opacity to cope with large viewpoint and scale shifts. That combination is the actual new piece; prior work either post-processes or requires multi-altitude data that is rarely available in practice. The progressive structure is a reasonable response to the missing intermediate observations problem, and it directly targets the failure modes of standard Gaussian splatting under extreme changes. The abstract reports better visual quality and geometric consistency on both synthetic and real datasets, which aligns with the goal of practical drone-based mapping. The soft spot is the lack of any explicit bound or cycle-consistency mechanism to keep the refinements from drifting. The causal attention conditions on reference views, but if diffusion introduces local hallucinations that are not globally corrected across stages, the final 3D structure could accumulate errors even if individual renders look plausible. The abstract does not include the actual metrics or ablation numbers, so it is hard to judge how large the gains are or whether consistency holds for the widest gaps. This is aimed at computer vision researchers working on novel view synthesis and 3D reconstruction from sparse aerial data. Someone building mapping tools or urban models from drone footage would find the method description and the reported experiments useful. I would send it for peer review. The pipeline is described clearly enough that referees can check the quantitative results and test the consistency claims directly.

Referee Report

1 major / 1 minor

Summary. The manuscript presents ProDiG, a progressive diffusion-guided Gaussian splatting method for reconstructing ground-level views and 3D models from aerial imagery. The approach synthesizes intermediate-altitude views and iteratively refines the Gaussian representation using a geometry-aware causal attention module that incorporates epipolar geometry and a distance-adaptive Gaussian module that adjusts scale and opacity based on camera distance. It claims to achieve superior visual quality, geometric consistency, and robustness to extreme viewpoint changes on both synthetic and real datasets without requiring additional ground-truth ground-level viewpoints.

Significance. If the empirical results and consistency claims are substantiated, this work would represent a meaningful advance in novel view synthesis for large viewpoint gaps, with potential applications in aerial photogrammetry, virtual tourism, and disaster assessment. The progressive refinement strategy combined with explicit geometric constraints in the diffusion process offers a practical solution where prior methods either require multi-altitude data or produce inconsistent geometry.

major comments (1)

[Method description of progressive refinement] In the description of the geometry-aware causal attention module and distance-adaptive Gaussian module: the paper states that these components inject epipolar structure and dynamically adjust scale/opacity to ensure stable reconstruction, but provides no derivation, bound, or explicit consistency loss (e.g., cycle-consistency or bundle-adjustment term) showing that cumulative geometric drift is prevented across multiple refinement stages. This directly bears on the central claim that the method operates without any ground-truth viewpoints, as diffusion priors could introduce hallucinations not constrained by the initial aerial 3D structure.

minor comments (1)

[Abstract] The abstract asserts outperformance in visual quality and geometric consistency but omits any quantitative metrics, ablation results, or experimental setup details (e.g., dataset names, baseline methods, or evaluation protocols), which would strengthen the summary for readers.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on geometric consistency in our progressive refinement pipeline. We address the single major comment below and outline targeted revisions to strengthen the manuscript.

read point-by-point responses

Referee: [Method description of progressive refinement] In the description of the geometry-aware causal attention module and distance-adaptive Gaussian module: the paper states that these components inject epipolar structure and dynamically adjust scale/opacity to ensure stable reconstruction, but provides no derivation, bound, or explicit consistency loss (e.g., cycle-consistency or bundle-adjustment term) showing that cumulative geometric drift is prevented across multiple refinement stages. This directly bears on the central claim that the method operates without any ground-truth viewpoints, as diffusion priors could introduce hallucinations not constrained by the initial aerial 3D structure.

Authors: We acknowledge that the current manuscript lacks a formal derivation, theoretical bound, or explicit auxiliary loss (such as cycle-consistency) to prove absence of cumulative drift. The geometry-aware causal attention module enforces epipolar constraints by restricting attention to geometrically corresponding rays derived from the initial aerial Gaussian splats, while the distance-adaptive module modulates Gaussian parameters to preserve scale consistency with camera distance. These mechanisms are intended to ground each diffusion step in the original 3D structure, limiting hallucinations. Our experiments on synthetic data with known ground-truth geometry show low drift in multi-stage metrics (e.g., PSNR and depth error remain stable across refinement stages). We agree a more explicit analysis would improve rigor. In revision we will (1) add a dedicated paragraph deriving the attention mask from epipolar geometry and (2) include an ablation quantifying drift over refinement stages. This constitutes a partial revision. revision: partial

Circularity Check

0 steps flagged

No circularity detected; method is an algorithmic pipeline without load-bearing derivations or self-referential reductions

full rationale

The paper describes ProDiG as a progressive diffusion-guided Gaussian splatting pipeline that synthesizes intermediate views and refines representations via geometry-aware causal attention and distance-adaptive Gaussians. No equations, derivations, or fitted parameters are presented that reduce by construction to the inputs. The central claims rest on the described components and experimental validation on synthetic and real-world datasets rather than on any self-definition, fitted-input prediction, or self-citation chain. The approach is self-contained as an engineering framework without mathematical steps that equate outputs to inputs by definition.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 2 invented entities

The central claim rests on standard assumptions from Gaussian Splatting and diffusion models plus two newly introduced modules whose effectiveness is asserted but not independently verified in the abstract.

axioms (2)

domain assumption Gaussian Splatting can represent 3D scenes from images
Standard assumption in the field invoked for the base representation
domain assumption Diffusion models can synthesize coherent novel views when guided
Relies on prior diffusion capabilities for view synthesis

invented entities (2)

geometry-aware causal attention module no independent evidence
purpose: injects epipolar structure into reference-view diffusion
New component introduced to maintain geometric consistency during progressive refinement
distance-adaptive Gaussian module no independent evidence
purpose: dynamically adjusts Gaussian scale and opacity based on camera distance
New component introduced to ensure stable reconstruction across large viewpoint gaps

pith-pipeline@v0.9.0 · 5542 in / 1397 out tokens · 33749 ms · 2026-05-13T21:58:15.287025+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

48 extracted references · 48 canonical work pages · 1 internal anchor

[1]

Wriva public data, 2024

Myron Brown, Michael Chan, and Michael Twardowski. Wriva public data, 2024. 7

work page 2024
[2]

Ray conditioning: Trading photo- consistency for photo-realism in multi-view image genera- tion

Eric Ming Chen, Sidhanth Holalkere, Ruyu Yan, Kai Zhang, and Abe Davis. Ray conditioning: Trading photo- consistency for photo-realism in multi-view image genera- tion. InProceedings of the IEEE/CVF International Confer- ence on Computer Vision, pages 23242–23251, 2023. 3

work page 2023
[3]

Mvsplat: Efficient 3d gaussian splatting from sparse multi-view images

Yuedong Chen, Haofei Xu, Chuanxia Zheng, Bohan Zhuang, Marc Pollefeys, Andreas Geiger, Tat-Jen Cham, and Jianfei Cai. Mvsplat: Efficient 3d gaussian splatting from sparse multi-view images. InEuropean Conference on Computer Vision, pages 370–386. Springer, 2024. 1, 3

work page 2024
[4]

Mvs- plat360: Feed-forward 360 scene synthesis from sparse views.Advances in Neural Information Processing Systems, 37:107064–107086, 2024

Yuedong Chen, Chuanxia Zheng, Haofei Xu, Bohan Zhuang, Andrea Vedaldi, Tat-Jen Cham, and Jianfei Cai. Mvs- plat360: Feed-forward 360 scene synthesis from sparse views.Advances in Neural Information Processing Systems, 37:107064–107086, 2024. 1

work page 2024
[5]

Dy- namic 3d gaussian fields for urban areas.arXiv preprint arXiv:2406.03175, 2024

Tobias Fischer, Jonas Kulhanek, Samuel Rota Bulo, Lorenzo Porzi, Marc Pollefeys, and Peter Kontschieder. Dy- namic 3d gaussian fields for urban areas.arXiv preprint arXiv:2406.03175, 2024. 1

work page arXiv 2024
[6]

Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography.Communications of the ACM, 24(6):381–395, 1981

Martin A Fischler and Robert C Bolles. Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography.Communications of the ACM, 24(6):381–395, 1981. 7

work page 1981
[7]

Dream- sim: Learning new dimensions of human visual similar- ity using synthetic data.arXiv preprint arXiv:2306.09344,

Stephanie Fu, Netanel Tamir, Shobhita Sundaram, Lucy Chai, Richard Zhang, Tali Dekel, and Phillip Isola. Dream- sim: Learning new dimensions of human visual similar- ity using synthetic data.arXiv preprint arXiv:2306.09344,

work page arXiv
[8]

Magicdrive3d: Controllable 3d genera- tion for any-view rendering in street scenes.arXiv preprint arXiv:2405.14475, 2024

Ruiyuan Gao, Kai Chen, Zhihao Li, Lanqing Hong, Zhenguo Li, and Qiang Xu. Magicdrive3d: Controllable 3d genera- tion for any-view rendering in street scenes.arXiv preprint arXiv:2405.14475, 2024. 1

work page arXiv 2024
[9]

Skyeyes: Ground roaming using aerial view images

Zhiyuan Gao, Wenbin Teng, Gonglin Chen, Jinsen Wu, Ningli Xu, Rongjun Qin, Andrew Feng, and Yajie Zhao. Skyeyes: Ground roaming using aerial view images. In2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 3045–3054. IEEE, 2025. 1, 3

work page 2025
[10]

Dragon: Drone and ground gaussian splatting for 3d building reconstruction

Yujin Ham, Mateusz Michalkiewicz, and Guha Balakrish- nan. Dragon: Drone and ground gaussian splatting for 3d building reconstruction. In2024 IEEE International Confer- ence on Computational Photography (ICCP), pages 1–12. IEEE, 2024. 1, 5

work page 2024
[11]

Denoising dif- fusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising dif- fusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020. 3

work page 2020
[12]

2d gaussian splatting for geometrically ac- curate radiance fields

Binbin Huang, Zehao Yu, Anpei Chen, Andreas Geiger, and Shenghua Gao. 2d gaussian splatting for geometrically ac- curate radiance fields. InACM SIGGRAPH 2024 conference papers, pages 1–11, 2024. 2, 7

work page 2024
[13]

Epidiff: Enhancing multi-view synthesis via localized epipolar-constrained diffusion

Zehuan Huang, Hao Wen, Junting Dong, Yaohui Wang, Yangguang Li, Xinyuan Chen, Yan-Pei Cao, Ding Liang, Yu Qiao, Bo Dai, et al. Epidiff: Enhancing multi-view synthesis via localized epipolar-constrained diffusion. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, pages 9784–9794, 2024. 3

work page 2024
[14]

Horizon- gs: Unified 3d gaussian splatting for large-scale aerial-to- ground scenes

Lihan Jiang, Kerui Ren, Mulin Yu, Linning Xu, Junting Dong, Tao Lu, Feng Zhao, Dahua Lin, and Bo Dai. Horizon- gs: Unified 3d gaussian splatting for large-scale aerial-to- ground scenes. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 26789–26799, 2025. 1, 6

work page 2025
[15]

Spad: Spatially aware multi-view diffusers

Yash Kant, Aliaksandr Siarohin, Ziyi Wu, Michael Vasilkovsky, Guocheng Qian, Jian Ren, Riza Alp Guler, Bernard Ghanem, Sergey Tulyakov, and Igor Gilitschenski. Spad: Spatially aware multi-view diffusers. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, pages 10026–10038, 2024. 3, 4

work page 2024
[16]

3d gaussian splatting for real-time radiance field rendering.ACM Trans

Bernhard Kerbl, Georgios Kopanas, Thomas Leimk ¨uhler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering.ACM Trans. Graph., 42(4):139–1,

work page
[17]

A hierarchical 3d gaussian representation for real-time ren- dering of very large datasets.ACM Transactions on Graphics (TOG), 43(4):1–15, 2024

Bernhard Kerbl, Andreas Meuleman, Georgios Kopanas, Michael Wimmer, Alexandre Lanvin, and George Drettakis. A hierarchical 3d gaussian representation for real-time ren- dering of very large datasets.ACM Transactions on Graphics (TOG), 43(4):1–15, 2024. 5

work page 2024
[18]

3d gaussian splat- ting as markov chain monte carlo.Advances in Neural Infor- mation Processing Systems, 37:80965–80986, 2024

Shakiba Kheradmand, Daniel Rebain, Gopal Sharma, Wei- wei Sun, Yang-Che Tseng, Hossam Isack, Abhishek Kar, Andrea Tagliasacchi, and Kwang Moo Yi. 3d gaussian splat- ting as markov chain monte carlo.Advances in Neural Infor- mation Processing Systems, 37:80965–80986, 2024. 2, 7

work page 2024
[19]

Wildgaussians: 3d gaussian splatting in the wild.arXiv preprint arXiv:2407.08447, 2024

Jonas Kulhanek, Songyou Peng, Zuzana Kukelova, Marc Pollefeys, and Torsten Sattler. Wildgaussians: 3d gaussian splatting in the wild.arXiv preprint arXiv:2407.08447, 2024. 2, 5

work page arXiv 2024
[20]

Skyfall-gs: Synthesiz- ing immersive 3d urban scenes from satellite imagery.arXiv preprint arXiv:2510.15869, 2025

Jie-Ying Lee, Yi-Ruei Liu, Shr-Ruei Tsai, Wei-Cheng Chang, Chung-Ho Wu, Jiewen Chan, Zhenjun Zhao, Chieh Hubert Lin, and Yu-Lun Liu. Skyfall-gs: Synthesiz- ing immersive 3d urban scenes from satellite imagery.arXiv preprint arXiv:2510.15869, 2025. 5

work page arXiv 2025
[21]

Matrixcity: A large-scale city dataset for city-scale neural rendering and beyond

Yixuan Li, Lihan Jiang, Linning Xu, Yuanbo Xiangli, Zhen- zhi Wang, Dahua Lin, and Bo Dai. Matrixcity: A large-scale city dataset for city-scale neural rendering and beyond. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3205–3215, 2023. 6, 7

work page 2023
[22]

Diff- bir: Toward blind image restoration with generative diffusion prior

Xinqi Lin, Jingwen He, Ziyan Chen, Zhaoyang Lyu, Bo Dai, Fanghua Yu, Yu Qiao, Wanli Ouyang, and Chao Dong. Diff- bir: Toward blind image restoration with generative diffusion prior. InEuropean Conference on Computer Vision, pages 430–448. Springer, 2024. 3, 5

work page 2024
[23]

Citygaussian: Real-time high-quality large-scale scene rendering with gaussians

Yang Liu, Chuanchen Luo, Lue Fan, Naiyan Wang, Jun- ran Peng, and Zhaoxiang Zhang. Citygaussian: Real-time high-quality large-scale scene rendering with gaussians. In European Conference on Computer Vision, pages 265–282. Springer, 2024. 1, 2

work page 2024
[24]

Scaffold-gs: Structured 3d gaussians for view-adaptive rendering

Tao Lu, Mulin Yu, Linning Xu, Yuanbo Xiangli, Limin Wang, Dahua Lin, and Bo Dai. Scaffold-gs: Structured 3d gaussians for view-adaptive rendering. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20654–20664, 2024. 2, 5, 7

work page 2024
[25]

Nerf: Representing scenes as neural radiance fields for view syn- thesis.Communications of the ACM, 65(1):99–106, 2021

Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view syn- thesis.Communications of the ACM, 65(1):99–106, 2021. 1

work page 2021
[26]

T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models

Chong Mou, Xintao Wang, Liangbin Xie, Yanze Wu, Jian Zhang, Zhongang Qi, and Ying Shan. T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. InProceedings of the AAAI conference on artificial intelligence, pages 4296–4304, 2024. 3

work page 2024
[27]

arXiv preprint arXiv:2403.12036 (2024)

Gaurav Parmar, Taesung Park, Srinivasa Narasimhan, and Jun-Yan Zhu. One-step image translation with text-to-image models.arXiv preprint arXiv:2403.12036, 2024. 5

work page arXiv 2024
[28]

Octree-gs: Towards consistent real-time rendering with lod-structured 3d gaussians.arXiv preprint arXiv:2403.17898, 2024

Kerui Ren, Lihan Jiang, Tao Lu, Mulin Yu, Linning Xu, Zhangkai Ni, and Bo Dai. Octree-gs: Towards consistent real-time rendering with lod-structured 3d gaussians.arXiv preprint arXiv:2403.17898, 2024. 2, 3, 5

work page arXiv 2024
[29]

Structure-from-motion revisited

Johannes Lutz Sch ¨onberger and Jan-Michael Frahm. Structure-from-motion revisited. InConference on Com- puter Vision and Pattern Recognition (CVPR), 2016. 3, 7

work page 2016
[30]

Pixelwise view selection for un- structured multi-view stereo

Johannes Lutz Sch ¨onberger, Enliang Zheng, Marc Pollefeys, and Jan-Michael Frahm. Pixelwise view selection for un- structured multi-view stereo. InEuropean Conference on Computer Vision (ECCV), 2016. 7

work page 2016
[31]

Light field networks: Neu- ral scene representations with single-evaluation rendering

Vincent Sitzmann, Semon Rezchikov, Bill Freeman, Josh Tenenbaum, and Fredo Durand. Light field networks: Neu- ral scene representations with single-evaluation rendering. Advances in Neural Information Processing Systems, 34: 19313–19325, 2021. 3

work page 2021
[32]

Generalizable patch-based neural render- ing

Mohammed Suhail, Carlos Esteves, Leonid Sigal, and Ameesh Makadia. Generalizable patch-based neural render- ing. InEuropean Conference on Computer Vision, pages 156–174. Springer, 2022. 3, 4

work page 2022
[33]

Light field neural rendering

Mohammed Suhail, Carlos Esteves, Leonid Sigal, and Ameesh Makadia. Light field neural rendering. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8269–8279, 2022. 3

work page 2022
[34]

Dronesplat: 3d gaussian splatting for robust 3d reconstruction from in-the-wild drone imagery

Jiadong Tang, Yu Gao, Dianyi Yang, Liqi Yan, Yufeng Yue, and Yi Yang. Dronesplat: 3d gaussian splatting for robust 3d reconstruction from in-the-wild drone imagery. InProceed- ings of the Computer Vision and Pattern Recognition Con- ference, pages 833–843, 2025. 1

work page 2025
[35]

Mega-nerf: Scalable construction of large- scale nerfs for virtual fly-throughs

Haithem Turki, Deva Ramanan, and Mahadev Satya- narayanan. Mega-nerf: Scalable construction of large- scale nerfs for virtual fly-throughs. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12922–12931, 2022. 6

work page 2022
[36]

Aerialmegadepth: Learning aerial-ground reconstruction and view synthesis

Khiem Vuong, Anurag Ghosh, Deva Ramanan, Srinivasa Narasimhan, and Shubham Tulsiani. Aerialmegadepth: Learning aerial-ground reconstruction and view synthesis. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 21674–21684, 2025. 8

work page 2025
[37]

Image quality assessment: from error visibility to structural similarity.IEEE transactions on image processing, 13(4):600–612, 2004

Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Si- moncelli. Image quality assessment: from error visibility to structural similarity.IEEE transactions on image processing, 13(4):600–612, 2004. 7

work page 2004
[38]

Difix3d+: Improving 3d reconstruc- tions with single-step diffusion models

Jay Zhangjie Wu, Yuxuan Zhang, Haithem Turki, Xuanchi Ren, Jun Gao, Mike Zheng Shou, Sanja Fidler, Zan Goj- cic, and Huan Ling. Difix3d+: Improving 3d reconstruc- tions with single-step diffusion models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 26024–26035, 2025. 1, 3, 5, 6, 7

work page 2025
[39]

Bungeenerf: Progressive neural radiance field for extreme multi-scale scene rendering

Yuanbo Xiangli, Linning Xu, Xingang Pan, Nanxuan Zhao, Anyi Rao, Christian Theobalt, Bo Dai, and Dahua Lin. Bungeenerf: Progressive neural radiance field for extreme multi-scale scene rendering. InEuropean conference on computer vision, pages 106–122. Springer, 2022. 1

work page 2022
[40]

Gauu-scene v2: Assessing the reliability of image-based metrics with expansive lidar image dataset using 3dgs and nerf.arXiv preprint arXiv:2404.04880, 2024

Butian Xiong, Nanjun Zheng, Junhua Liu, and Zhen Li. Gauu-scene v2: Assessing the reliability of image-based metrics with expansive lidar image dataset using 3dgs and nerf.arXiv preprint arXiv:2404.04880, 2024. 6

work page arXiv 2024
[41]

Wild-gs: Real- time novel view synthesis from unconstrained photo collec- tions.Advances in Neural Information Processing Systems, 37:103334–103355, 2024

Jiacong Xu, Yiqun Mei, and Vishal Patel. Wild-gs: Real- time novel view synthesis from unconstrained photo collec- tions.Advances in Neural Information Processing Systems, 37:103334–103355, 2024. 2

work page 2024
[42]

Street gaussians: Modeling dynamic urban scenes with gaussian splatting

Yunzhi Yan, Haotong Lin, Chenxu Zhou, Weijie Wang, Haiyang Sun, Kun Zhan, Xianpeng Lang, Xiaowei Zhou, and Sida Peng. Street gaussians: Modeling dynamic urban scenes with gaussian splatting. InEuropean Conference on Computer Vision, pages 156–173. Springer, 2024. 1

work page 2024
[43]

gsplat: An open-source library for gaussian splatting.Journal of Ma- chine Learning Research, 26(34):1–17, 2025

Vickie Ye, Ruilong Li, Justin Kerr, Matias Turkulainen, Brent Yi, Zhuoyang Pan, Otto Seiskari, Jianbo Ye, Jeffrey Hu, Matthew Tancik, and Angjoo Kanazawa. gsplat: An open-source library for gaussian splatting.Journal of Ma- chine Learning Research, 26(34):1–17, 2025. 7

work page 2025
[44]

ViewCrafter: Taming Video Diffusion Models for High-fidelity Novel View Synthesis

Wangbo Yu, Jinbo Xing, Li Yuan, Wenbo Hu, Xiaoyu Li, Zhipeng Huang, Xiangjun Gao, Tien-Tsin Wong, Ying Shan, and Yonghong Tian. Viewcrafter: Taming video diffusion models for high-fidelity novel view synthesis.arXiv preprint arXiv:2409.02048, 2024. 1, 3

work page internal anchor Pith review arXiv 2024
[45]

Mip-splatting: Alias-free 3d gaussian splat- ting

Zehao Yu, Anpei Chen, Binbin Huang, Torsten Sattler, and Andreas Geiger. Mip-splatting: Alias-free 3d gaussian splat- ting. InProceedings of the IEEE/CVF conference on com- puter vision and pattern recognition, pages 19447–19456,

work page
[46]

Crossview- gs: Cross-view gaussian splatting for large-scale scene recon- struction

Chenhao Zhang, Yuanping Cao, and Lei Zhang. Crossview- gs: Cross-view gaussian splatting for large-scale scene re- construction.arXiv preprint arXiv:2501.01695, 2025. 1

work page arXiv 2025
[47]

Adding conditional control to text-to-image diffusion models

Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF international conference on computer vision, pages 3836–3847, 2023. 3

work page 2023
[48]

The unreasonable effectiveness of deep features as a perceptual metric

Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shecht- man, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. InProceedings of the IEEE conference on computer vision and pattern recogni- tion, pages 586–595, 2018. 7

work page 2018