pith. sign in

arxiv: 2605.19949 · v1 · pith:QK2CYNXUnew · submitted 2026-05-19 · 💻 cs.CV

Feed-Forward Gaussian Splatting from Sparse Aerial Views

Pith reviewed 2026-05-20 06:47 UTC · model grok-4.3

classification 💻 cs.CV
keywords 3D Gaussian SplattingSparse Aerial ViewsUrban Scene ReconstructionFeed-Forward ReconstructionNovel View SynthesisGenerative PriorsAerial Imagery
0
0 comments X

The pith

AnyCity reconstructs coherent 3D Gaussian urban scenes from sparse aerial views in one feed-forward pass.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to demonstrate that sparse aerial captures of cities, which over-observe roofs while leaving facades and occluded areas with little support, can still produce artifact-free 3D reconstructions suitable for novel-view synthesis. It does so by first extracting a geometry latent that stays faithful to the input observations and then applying a controlled generative update only where evidence is weak. A reader would care because conventional direct-regression methods create ghosting and melted surfaces, while slower generative approaches risk structures that contradict the photos; a fast method that respects observed geometry would make large-scale urban modeling practical from limited drone flights.

Core claim

AnyCity first predicts an observation-supported geometry latent to anchor reliable structures from the sparse inputs. It then uses scaffold-conditioned aerial completion tokens to predict a gated residual update for weakly constrained content before Gaussian decoding. Training combines dense-to-sparse distillation to transfer structural cues with an aerial-adapted video diffusion prior that supplies fine-grained appearance through gated token conditioning, while observation-preserving objectives ensure the refined representation remains consistent with input-supported geometry. At inference the model produces the final 3D Gaussian scene in a single forward pass.

What carries the argument

Observation-supported geometry latent followed by scaffold-conditioned gated residual update before 3D Gaussian decoding.

If this is right

  • Reconstruction completes in seconds rather than minutes or hours for large urban scenes.
  • Novel views remain coherent with input geometry and avoid ghosting or stretched textures seen in direct regression baselines.
  • The same pipeline works across synthetic, real aerial, UAV-textured, and ground-level scenes without per-scene optimization.
  • Observation-preserving losses keep generated content from drifting away from measurable input evidence.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same geometry-first then gated-completion pattern could be tested on other sparse multi-view settings such as street-level or indoor captures.
  • Because inference is feed-forward, the model could support real-time rendering loops once the Gaussians are decoded.
  • The gated token design might allow targeted style or time-of-day adjustments by swapping the diffusion prior without retraining the geometry stage.

Load-bearing premise

Dense-to-sparse distillation and the gated aerial video diffusion prior can supply missing cues without creating geometry or appearance inconsistencies with the parts directly supported by the input views.

What would settle it

Generate novel views from the output Gaussians and inspect them for floating facades, texture stretching on building sides, or visible seams that contradict the original sparse photos; persistent artifacts would show the separation of observed and generated content has failed.

Figures

Figures reproduced from arXiv: 2605.19949 by Dongli Wu, Rongjun Qin, Tongyan Hua, Wufan Zhao, Xiaobao Wei, Yinrui Ren, Zhuoxiao Li.

Figure 1
Figure 1. Figure 1: From sparse aerial observations to generative urban reconstruction. Sparse aerial views show reliable observations on roofs and roads but weak constraints on facades, distant buildings, and occluded structures. AnyCity addresses this imbalance and produces coherent 3D Gaussian urban reconstructions from sparse aerial inputs. Abstract Reconstructing large-scale urban scenes from sparse aerial views is a cru… view at source ↗
Figure 2
Figure 2. Figure 2: Pose and evidence imbalance in sparse aerial reconstruction. Sparse aerial views provide weak parallax and limited overlap, making facades and occluded structures under-constrained. AnyCity anchors reliable geometry with an observation-supported geometry latent and uses scaffold-conditioned aerial completion tokens to refine weakly constrained content before Gaussian decoding. 1. Introduction Scalable 3D u… view at source ↗
Figure 3
Figure 3. Figure 3: Overview of AnyCity. AnyCity reconstructs a 3D Gaussian scene from sparse unposed aerial images. It first builds an observation-supported scaffold Zgeo, then uses gated aerial completion tokens and a LoRA-adapted video prior to predict a residual update ∆Z for weakly constrained content. Stage I stabilizes the scaffold with geometric losses, while Stage II trains residual refinement with dense-to-sparse di… view at source ↗
Figure 4
Figure 4. Figure 4: Geometry stabilizers. To obtain a stable scaffold before residual refinement, the first training stage disables the completion branch and optimizes the observation-supported pathway alone. We decode Zgeo with Dgs and supervise its renderings with photometric and lightweight geometric losses: Lstage1 = Lrgb + λdepthLdepth + λnormalLnormal. (3) Here, Lrgb is computed from render￾ings of Dgs(Zgeo). As illustr… view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative comparison of urban novel view synthesis with two context views. For each scene, we show two conditioning images and the synthesized target view from different methods. Compared with prior feed-forward baselines, AnyCity better preserves global layout and facade continuity while reducing floaters and melting artifacts, yielding more realistic renderings. Results are shown on GoogleEarth, CityNe… view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative ablation [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Qualitative results on diverse urban topologies. Given only 4 sparse input images, AnyCity successfully generalizes to various unconstrained city layouts, including dense commercial districts, residential blocks, waterfronts, and urban parks. It consis￾tently produces high-fidelity novel views (middle) and physically plausible 3D Gaussian geometries (right). Under this unified protocol, we compare our meth… view at source ↗
Figure 8
Figure 8. Figure 8: Failure case in extreme occlusion. When the aerial capture altitude is relatively low in a dense skyscraper cluster, background high-rises suffer from severe foreground occlusion, leading to incomplete geometric reconstruction and blurred textures on distant facades. While AnyCity performs robustly in most urban scenarios, it occasionally encounters difficulties under extreme physical constraints. As illus… view at source ↗
read the original abstract

Reconstructing large-scale urban scenes from sparse aerial views is a crucial yet challenging task. Due to biased top-down and shallow-oblique camera poses, sparse aerial captures exhibit strong evidence imbalance: roofs and open regions are repeatedly observed, while facades, distant buildings, and occluded structures receive little multi-view support. Existing feed-forward 3D Gaussian Splatting methods directly regress a deterministic representation from sparse inputs, but this often leads to ghosting, melted facades, and stretched textures. Recent pseudo-view and video-based generative reconstruction methods use additional supervision or generative priors. However, they often lack a clear separation between observed geometry and prior-driven content, which can lead to plausible but inconsistent structures. We propose AnyCity, an observation-grounded generative reconstruction framework for sparse aerial urban scenes. AnyCity first predicts an observation-supported geometry latent to anchor reliable structures, and then uses scaffold-conditioned aerial completion tokens to predict a gated residual update for weakly constrained content before Gaussian decoding. During training, dense-to-sparse distillation transfers structural cues from dense-view reconstruction, while an aerial-adapted video diffusion prior provides fine-grained urban appearance cues through gated token conditioning. Observation-preserving objectives keep the refined representation consistent with input-supported geometry. At inference time, AnyCity reconstructs the final 3D Gaussian scene from sparse aerial views in a single feed-forward pass, achieving coherent urban novel-view synthesis with second-level inference. Experiments on synthetic, aerial-domain, UAV-textured, and real-world scenes show consistent improvements over feed-forward baselines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes AnyCity, an observation-grounded generative reconstruction framework for sparse aerial urban scenes using 3D Gaussian Splatting. It first predicts an observation-supported geometry latent to anchor reliable structures from sparse aerial views, then uses scaffold-conditioned aerial completion tokens to predict a gated residual update for weakly constrained content before Gaussian decoding. Training uses dense-to-sparse distillation to transfer structural cues and an aerial-adapted video diffusion prior for fine-grained appearance through gated token conditioning, with observation-preserving objectives to maintain consistency. The method claims to enable single feed-forward pass reconstruction with second-level inference, showing consistent improvements over feed-forward baselines on synthetic, aerial, UAV, and real-world scenes.

Significance. If the proposed separation between observation-supported geometry and prior-driven content holds without leakage, the work could be significant for practical applications in large-scale urban 3D reconstruction from sparse aerial captures, addressing the evidence imbalance issue that leads to artifacts in existing methods. The combination of distillation and generative priors in a feed-forward setting is a promising direction, and the fast inference time is a practical strength.

major comments (2)
  1. Abstract: The abstract asserts 'consistent improvements over feed-forward baselines' on multiple scene types but supplies no quantitative metrics, error analysis, ablation details, or specific numerical comparisons. Without these, the central claims of coherent urban novel-view synthesis and avoidance of ghosting or melted facades cannot be verified.
  2. Training description paragraph: The framework relies on 'gated token conditioning' from the aerial-adapted video diffusion prior to supply a 'gated residual update' while claiming that 'observation-preserving objectives keep the refined representation consistent with input-supported geometry.' It is not specified whether the gate is hard or learned, nor whether the update is applied at the token level before Gaussian decoding. If the learned gate allows the prior to influence the observation-supported geometry latent, structural changes could propagate to roof and open-region Gaussians despite multi-view support, violating the evidence-imbalance premise.
minor comments (1)
  1. The abstract refers to 'second-level inference' without specifying the hardware, resolution, or exact timing measurement used to support this claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the practical strengths of our approach for large-scale urban reconstruction. We provide point-by-point responses to the major comments below.

read point-by-point responses
  1. Referee: Abstract: The abstract asserts 'consistent improvements over feed-forward baselines' on multiple scene types but supplies no quantitative metrics, error analysis, ablation details, or specific numerical comparisons. Without these, the central claims of coherent urban novel-view synthesis and avoidance of ghosting or melted facades cannot be verified.

    Authors: We agree that the abstract, as a concise summary, would be strengthened by including a small number of key quantitative results. In the revised manuscript we will add brief numerical comparisons (e.g., average PSNR/SSIM gains on the synthetic and real-world test sets) while remaining within the abstract length limit. revision: yes

  2. Referee: Training description paragraph: The framework relies on 'gated token conditioning' from the aerial-adapted video diffusion prior to supply a 'gated residual update' while claiming that 'observation-preserving objectives keep the refined representation consistent with input-supported geometry.' It is not specified whether the gate is hard or learned, nor whether the update is applied at the token level before Gaussian decoding. If the learned gate allows the prior to influence the observation-supported geometry latent, structural changes could propagate to roof and open-region Gaussians despite multi-view support, violating the evidence-imbalance premise.

    Authors: The gate is a learned soft gate realized by a small MLP followed by a sigmoid activation; it modulates only the residual tokens produced for the aerial completion branch. The observation-supported geometry latent is generated in an earlier stage and is held fixed; the residual update is added exclusively to the completion tokens before they enter the Gaussian decoder. Observation-preserving losses further penalize any deviation in well-supported regions. We will expand the methods section with an explicit description of the gate architecture, a diagram illustrating the latent separation, and quantitative gate-activation statistics showing near-zero influence on supported geometry. revision: yes

Circularity Check

0 steps flagged

No significant circularity; new architecture and training objectives are independent

full rationale

The paper introduces AnyCity as a feed-forward framework that first predicts an observation-supported geometry latent from sparse aerial views, then applies scaffold-conditioned aerial completion tokens for a gated residual update before Gaussian decoding. Training uses dense-to-sparse distillation to transfer structural cues and an aerial-adapted video diffusion prior for appearance via gated token conditioning, with observation-preserving objectives to maintain consistency. No derivation step reduces by construction to its own inputs, no fitted parameters are relabeled as predictions, and no load-bearing claims rest on self-citations or imported uniqueness theorems. The central claims rest on explicitly described novel components and objectives that are not equivalent to the inputs by definition, making the chain self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 2 invented entities

The central claim rests on several new architectural pieces and training assumptions whose effectiveness is asserted but not independently evidenced in the provided abstract.

axioms (2)
  • domain assumption Dense-to-sparse distillation transfers structural cues from dense-view reconstruction to sparse inputs.
    Invoked in the training description to enable structural learning.
  • domain assumption An aerial-adapted video diffusion prior can supply fine-grained urban appearance cues through gated token conditioning while preserving consistency with observed geometry.
    Central to the generative completion step.
invented entities (2)
  • observation-supported geometry latent no independent evidence
    purpose: Anchors reliable structures before generative completion.
    First stage of the AnyCity pipeline.
  • scaffold-conditioned aerial completion tokens no independent evidence
    purpose: Predict gated residual updates for weakly constrained content.
    Mechanism for adding details to under-observed regions.

pith-pipeline@v0.9.0 · 5818 in / 1600 out tokens · 49098 ms · 2026-05-20T06:47:43.918838+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

42 extracted references · 42 canonical work pages · 7 internal anchors

  1. [1]

    Aligning global semantics and local textures in generative video enhancement

    Zhikai Chen, Fuchen Long, Zhaofan Qiu, Ting Yao, Wengang Zhou, Jiebo Luo, and Tao Mei. Aligning global semantics and local textures in generative video enhancement. InProceedings of the IEEE/CVF Inter- national Conference on Computer Vision, pages 17087– 17096, 2025. 6

  2. [2]

    Predicting depth, surface normals and semantic labels with a common multi- scale convolutional architecture

    David Eigen and Rob Fergus. Predicting depth, surface normals and semantic labels with a common multi- scale convolutional architecture. InProceedings of the IEEE international conference on computer vision, pages 2650–2658, 2015. 6

  3. [3]

    Citygs-x: A scalable architecture for efficient and geometrically accurate large-scale scene reconstruc- tion

    Yuanyuan Gao, Hao Li, Jiaqi Chen, Zhengyu Zou, Zhi- hang Zhong, Dingwen Zhang, Xiao Sun, and Junwei Han. Citygs-x: A scalable architecture for efficient and geometrically accurate large-scale scene reconstruc- tion. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 27187–27196,

  4. [4]

    World recon- struction from inconsistent views.arXiv preprint arXiv:2603.16736, 2026

    Lukas Höllein and Matthias Nießner. World recon- struction from inconsistent views.arXiv preprint arXiv:2603.16736, 2026. 3

  5. [5]

    Sat2city: 3d city generation from a single satellite image with cascaded latent diffusion

    Tongyan Hua, Lutao Jiang, Ying-Cong Chen, and Wu- fan Zhao. Sat2city: 3d city generation from a single satellite image with cascaded latent diffusion. InPro- ceedings of the IEEE/CVF International Conference on Computer Vision, pages 27978–27988, 2025. 2

  6. [6]

    Gen3r: 3d scene genera- tion meets feed-forward reconstruction.arXiv preprint arXiv:2601.04090, 2026

    Jiaxin Huang, Yuanbo Yang, Bangbang Yang, Lin Ma, Yuewen Ma, and Yiyi Liao. Gen3r: 3d scene genera- tion meets feed-forward reconstruction.arXiv preprint arXiv:2601.04090, 2026. 2, 3

  7. [7]

    Anysplat: Feed-forward 3d gaussian splatting from unconstrained views.ACM Transactions on Graphics (TOG), 44(6):1–16, 2025

    Lihan Jiang, Yucheng Mao, Linning Xu, Tao Lu, Kerui Ren, Yichen Jin, Xudong Xu, Mulin Yu, Jiangmiao Pang, Feng Zhao, et al. Anysplat: Feed-forward 3d gaussian splatting from unconstrained views.ACM Transactions on Graphics (TOG), 44(6):1–16, 2025. 2, 3, 6, 14

  8. [8]

    3d gaussian splatting for real-time radiance field rendering.ACM Trans

    Bernhard Kerbl, Georgios Kopanas, Thomas Leimküh- ler, George Drettakis, et al. 3d gaussian splatting for real-time radiance field rendering.ACM Trans. Graph., 42(4):139–1, 2023. 2, 3

  9. [9]

    Gen- erative sparse-view gaussian splatting

    Hanyang Kong, Xingyi Yang, and Xinchao Wang. Gen- erative sparse-view gaussian splatting. InProceedings of the Computer Vision and Pattern Recognition Con- ference, pages 26745–26755, 2025. 2, 3

  10. [10]

    arXiv preprint arXiv:2510.21615 (2025) 4

    Orest Kupyn, Fabian Manhardt, Federico Tombari, and Christian Rupprecht. Epipolar geometry im- proves video generation models.arXiv preprint arXiv:2510.21615, 2025. 3

  11. [11]

    Skyfall-gs: Synthe- sizing immersive 3d urban scenes from satellite imagery

    Jie-Ying Lee, Yi-Ruei Liu, Shr-Ruei Tsai, Wei-Cheng Chang, Chung-Ho Wu, Jiewen Chan, Zhenjun Zhao, Chieh Hubert Lin, and Yu-Lun Liu. Skyfall-gs: Syn- thesizing immersive 3d urban scenes from satellite im- agery.arXiv preprint arXiv:2510.15869, 2025. 2

  12. [12]

    Urbangs: A scalable and efficient architecture for geometrically accurate large-scene reconstruction.arXiv preprint arXiv:2602.02089, 2026

    Changbai Li, Haodong Zhu, Hanlin Chen, Xiuping Liang, Tongfei Chen, Shuwei Shao, Linlin Yang, Huobin Tan, and Baochang Zhang. Urbangs: A scalable and efficient architecture for geometrically accurate large-scene reconstruction.arXiv preprint arXiv:2602.02089, 2026. 2

  13. [13]

    Manipdreamer3d: Synthesizing plausible robotic manipulation video with occupancy-aware 3d trajectory

    Ying Li, Xiaobao Wei, Xiaowei Chi, Yuming Li, Zhongyu Zhao, Hao Wang, Ningning Ma, Ming Lu, Sirui Han, and Shanghang Zhang. Manipdreamer3d: Synthesizing plausible robotic manipulation video with occupancy-aware 3d trajectory. InProceedings of the AAAI Conference on Artificial Intelligence (AAAI),

  14. [14]

    Matrixcity: A large-scale city dataset for city-scale neural render- ing and beyond

    Yixuan Li, Lihan Jiang, Linning Xu, Yuanbo Xiangli, Zhenzhi Wang, Dahua Lin, and Bo Dai. Matrixcity: A large-scale city dataset for city-scale neural render- ing and beyond. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 3205–3215, 2023. 2, 5, 13

  15. [15]

    Wonder- land: Navigating 3d scenes from a single image

    Hanwen Liang, Junli Cao, Vidit Goel, Guocheng Qian, Sergei Korolev, Demetri Terzopoulos, Konstantinos N Plataniotis, Sergey Tulyakov, and Jian Ren. Wonder- land: Navigating 3d scenes from a single image. In Proceedings of the Computer Vision and Pattern Recog- nition Conference, pages 798–810, 2025. 2, 3

  16. [16]

    Depth Anything 3: Recovering the Visual Space from Any Views

    Haotong Lin, Sili Chen, Junhao Liew, Donny Y Chen, Zhenyu Li, Guang Shi, Jiashi Feng, and Bingyi Kang. Depth anything 3: Recovering the visual space from any views.arXiv preprint arXiv:2511.10647, 2025. 2, 3, 6

  17. [17]

    Capturing, reconstructing, and simulating: the urbanscene3d dataset

    Liqiang Lin, Yilin Liu, Yue Hu, Xingguang Yan, Ke Xie, and Hui Huang. Capturing, reconstructing, and simulating: the urbanscene3d dataset. InECCV,

  18. [18]

    Dl3dv-10k: A large-scale scene dataset for deep learning-based 3d vision

    Lu Ling, Yichen Sheng, Zhi Tu, Wentian Zhao, Cheng Xin, Kun Wan, Lantao Yu, Qianyu Guo, Zixun Yu, Yawen Lu, et al. Dl3dv-10k: A large-scale scene dataset for deep learning-based 3d vision. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22160–22169, 2024. 5

  19. [19]

    Citygaussianv2: Efficient and geometrically accurate reconstruction for large- scale scenes.arXiv preprint arXiv:2411.00771, 2024

    Yang Liu, Chuanchen Luo, Zhongkai Mao, Junran Peng, and Zhaoxiang Zhang. Citygaussianv2: Efficient and geometrically accurate reconstruction for large- scale scenes.arXiv preprint arXiv:2411.00771, 2024. 2, 6

  20. [20]

    Decoupled Weight Decay Regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101,

  21. [21]

    Nerf: Representing scenes as neural radiance fields for 9 view synthesis.Communications of the ACM, 65(1): 99–106, 2021

    Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for 9 view synthesis.Communications of the ACM, 65(1): 99–106, 2021. 3

  22. [22]

    Pixelwise view selection for unstructured multi-view stereo

    Johannes L Schönberger, Enliang Zheng, Jan-Michael Frahm, and Marc Pollefeys. Pixelwise view selection for unstructured multi-view stereo. InEuropean con- ference on computer vision, pages 501–518. Springer,

  23. [23]

    Structure-from-motion revisited

    Johannes Lutz Schönberger and Jan-Michael Frahm. Structure-from-motion revisited. InConference on Computer Vision and Pattern Recognition (CVPR),

  24. [24]

    Lyra 2.0: Explorable Generative 3D Worlds

    Tianchang Shen, Sherwin Bahmani, Kai He, Sangeetha Grama Srinivasan, Tianshi Cao, Jiawei Ren, Ruilong Li, Zian Wang, Nicholas Sharp, Zan Gojcic, et al. Lyra 2.0: Explorable generative 3d worlds.arXiv preprint arXiv:2604.13036, 2026. 2, 3

  25. [25]

    DreamGaussian: Generative Gaussian Splatting for Efficient 3D Content Creation

    Jiaxiang Tang, Jiawei Ren, Hang Zhou, Ziwei Liu, and Gang Zeng. Dreamgaussian: Generative gaussian splat- ting for efficient 3d content creation.arXiv preprint arXiv:2309.16653, 2023. 3

  26. [26]

    Wan: Open and Advanced Large-Scale Video Generative Models

    Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025. 5, 6

  27. [27]

    Roboarmgs: High-quality robotic arm splatting via b \’ezier curve refinement.arXiv preprint arXiv:2511.17961, 2025

    Hao Wang, Xiaobao Wei, Ying Li, Qingpo Wuwu, Dongli Wu, Jiajun Cao, Ming Lu, Wenzhao Zheng, and Shanghang Zhang. Roboarmgs: High-quality robotic arm splatting via b \’ezier curve refinement.arXiv preprint arXiv:2511.17961, 2025. 3

  28. [28]

    Embodiedocc++: Boosting embod- ied 3d occupancy prediction with plane regularization and uncertainty sampler

    Hao Wang, Xiaobao Wei, Xiaoan Zhang, Jianing Li, Chengyu Bai, Ying Li, Ming Lu, Wenzhao Zheng, and Shanghang Zhang. Embodiedocc++: Boosting embod- ied 3d occupancy prediction with plane regularization and uncertainty sampler. InProceedings of the 33rd ACM International Conference on Multimedia, pages 925–934, 2025. 3

  29. [29]

    Vggt: Visual geometry grounded transformer

    Jianyuan Wang, Minghao Chen, Nikita Karaev, An- drea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Visual geometry grounded transformer. InPro- ceedings of the Computer Vision and Pattern Recog- nition Conference, pages 5294–5306, 2025. 2, 3, 6, 12

  30. [30]

    Chronotailor: Harnessing at- tention guidance for fine-grained video virtual try-on

    Jinjuan Wang, Wenzhang Sun, Ming Li, Yun Zheng, Fanyao Li, Zhulin Tao, Donglin Di, Hao Li, Wei Chen, and Xianglin Huang. Chronotailor: Harnessing at- tention guidance for fine-grained video virtual try-on. arXiv preprint arXiv:2506.05858, 2025. 6

  31. [31]

    Dust3r: Geometric 3d vision made easy

    Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. Dust3r: Geometric 3d vision made easy. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 20697–20709, 2024. 3

  32. [32]

    Image quality assessment: from er- ror visibility to structural similarity.IEEE transactions on image processing, 13(4):600–612, 2004

    Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. Image quality assessment: from er- ror visibility to structural similarity.IEEE transactions on image processing, 13(4):600–612, 2004. 6

  33. [33]

    Re- confusion: 3d reconstruction with diffusion priors

    Rundi Wu, Ben Mildenhall, Philipp Henzler, Keunhong Park, Ruiqi Gao, Daniel Watson, Pratul P Srinivasan, Dor Verbin, Jonathan T Barron, Ben Poole, et al. Re- confusion: 3d reconstruction with diffusion priors. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 21551–21561,

  34. [34]

    Bungeenerf: Progressive neural radiance field for extreme multi-scale scene rendering

    Yuanbo Xiangli, Linning Xu, Xingang Pan, Nanx- uan Zhao, Anyi Rao, Christian Theobalt, Bo Dai, and Dahua Lin. Bungeenerf: Progressive neural radiance field for extreme multi-scale scene rendering. InEuro- pean conference on computer vision, pages 106–122. Springer, 2022. 2, 7

  35. [35]

    Citydreamer: Compositional generative model of unbounded 3d cities

    Haozhe Xie, Zhaoxi Chen, Fangzhou Hong, and Ziwei Liu. Citydreamer: Compositional generative model of unbounded 3d cities. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9666–9675, 2024. 3, 5

  36. [36]

    Focus: Unified vision-language mod- eling for interactive editing driven by referential seg- mentation.arXiv preprint arXiv:2506.16806, 2025

    Fan Yang, Yousong Zhu, Xin Li, Yufei Zhan, Hongyin Zhao, Shurong Zheng, Yaowei Wang, Ming Tang, and Jinqiao Wang. Focus: Unified vision-language mod- eling for interactive editing driven by referential seg- mentation.arXiv preprint arXiv:2506.16806, 2025. 6

  37. [37]

    Blended- mvs: A large-scale dataset for generalized multi-view stereo networks

    Yao Yao, Zixin Luo, Shiwei Li, Jingyang Zhang, Yufan Ren, Lei Zhou, Tian Fang, and Long Quan. Blended- mvs: A large-scale dataset for generalized multi-view stereo networks. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 1790–1799, 2020. 5

  38. [38]

    No pose, no problem: Surprisingly simple 3d gaussian splats from sparse unposed images.arXiv preprint arXiv:2410.24207,

    Botao Ye, Sifei Liu, Haofei Xu, Xueting Li, Marc Pollefeys, Ming-Hsuan Yang, and Songyou Peng. No pose, no problem: Surprisingly simple 3d gaussian splats from sparse unposed images.arXiv preprint arXiv:2410.24207, 2024. 3, 6

  39. [39]

    From Orbit to Ground: Generative City Photogrammetry from Extreme Off-Nadir Satellite Images

    Fei Yu, Yu Liu, Luyang Tang, Mingchao Sun, Zengye Ge, Rui Bu, Yuchao Jin, Haisen Zhao, He Sun, Yangyan Li, et al. From orbit to ground: Generative city photogrammetry from extreme off-nadir satellite images.arXiv preprint arXiv:2512.07527, 2025. 2, 3

  40. [40]

    Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think

    Sihyun Yu, Sangkyung Kwak, Huiwon Jang, Jongheon Jeong, Jonathan Huang, Jinwoo Shin, and Saining Xie. Representation alignment for generation: Training dif- fusion transformers is easier than you think.arXiv preprint arXiv:2410.06940, 2024. 6

  41. [41]

    The unreasonable ef- fectiveness of deep features as a perceptual metric

    Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable ef- fectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vi- 10 sion and pattern recognition, pages 586–595, 2018. 6

  42. [42]

    Flare: Feed-forward geometry, appearance and camera estimation from uncalibrated sparse views

    Shangzhan Zhang, Jianyuan Wang, Yinghao Xu, Nan Xue, Christian Rupprecht, Xiaowei Zhou, Yujun Shen, and Gordon Wetzstein. Flare: Feed-forward geometry, appearance and camera estimation from uncalibrated sparse views. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 21936– 21947, 2025. 3, 6 11 A. Detailed progressive training ...