Feed-Forward Gaussian Splatting from Sparse Aerial Views

Dongli Wu; Rongjun Qin; Tongyan Hua; Wufan Zhao; Xiaobao Wei; Yinrui Ren; Zhuoxiao Li

arxiv: 2605.19949 · v1 · pith:QK2CYNXUnew · submitted 2026-05-19 · 💻 cs.CV

Feed-Forward Gaussian Splatting from Sparse Aerial Views

Dongli Wu , Zhuoxiao Li , Tongyan Hua , Yinrui Ren , Xiaobao Wei , Rongjun Qin , Wufan Zhao This is my paper

Pith reviewed 2026-05-20 06:47 UTC · model grok-4.3

classification 💻 cs.CV

keywords 3D Gaussian SplattingSparse Aerial ViewsUrban Scene ReconstructionFeed-Forward ReconstructionNovel View SynthesisGenerative PriorsAerial Imagery

0 comments

The pith

AnyCity reconstructs coherent 3D Gaussian urban scenes from sparse aerial views in one feed-forward pass.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to demonstrate that sparse aerial captures of cities, which over-observe roofs while leaving facades and occluded areas with little support, can still produce artifact-free 3D reconstructions suitable for novel-view synthesis. It does so by first extracting a geometry latent that stays faithful to the input observations and then applying a controlled generative update only where evidence is weak. A reader would care because conventional direct-regression methods create ghosting and melted surfaces, while slower generative approaches risk structures that contradict the photos; a fast method that respects observed geometry would make large-scale urban modeling practical from limited drone flights.

Core claim

AnyCity first predicts an observation-supported geometry latent to anchor reliable structures from the sparse inputs. It then uses scaffold-conditioned aerial completion tokens to predict a gated residual update for weakly constrained content before Gaussian decoding. Training combines dense-to-sparse distillation to transfer structural cues with an aerial-adapted video diffusion prior that supplies fine-grained appearance through gated token conditioning, while observation-preserving objectives ensure the refined representation remains consistent with input-supported geometry. At inference the model produces the final 3D Gaussian scene in a single forward pass.

What carries the argument

Observation-supported geometry latent followed by scaffold-conditioned gated residual update before 3D Gaussian decoding.

If this is right

Reconstruction completes in seconds rather than minutes or hours for large urban scenes.
Novel views remain coherent with input geometry and avoid ghosting or stretched textures seen in direct regression baselines.
The same pipeline works across synthetic, real aerial, UAV-textured, and ground-level scenes without per-scene optimization.
Observation-preserving losses keep generated content from drifting away from measurable input evidence.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same geometry-first then gated-completion pattern could be tested on other sparse multi-view settings such as street-level or indoor captures.
Because inference is feed-forward, the model could support real-time rendering loops once the Gaussians are decoded.
The gated token design might allow targeted style or time-of-day adjustments by swapping the diffusion prior without retraining the geometry stage.

Load-bearing premise

Dense-to-sparse distillation and the gated aerial video diffusion prior can supply missing cues without creating geometry or appearance inconsistencies with the parts directly supported by the input views.

What would settle it

Generate novel views from the output Gaussians and inspect them for floating facades, texture stretching on building sides, or visible seams that contradict the original sparse photos; persistent artifacts would show the separation of observed and generated content has failed.

Figures

Figures reproduced from arXiv: 2605.19949 by Dongli Wu, Rongjun Qin, Tongyan Hua, Wufan Zhao, Xiaobao Wei, Yinrui Ren, Zhuoxiao Li.

**Figure 1.** Figure 1: From sparse aerial observations to generative urban reconstruction. Sparse aerial views show reliable observations on roofs and roads but weak constraints on facades, distant buildings, and occluded structures. AnyCity addresses this imbalance and produces coherent 3D Gaussian urban reconstructions from sparse aerial inputs. Abstract Reconstructing large-scale urban scenes from sparse aerial views is a cru… view at source ↗

**Figure 2.** Figure 2: Pose and evidence imbalance in sparse aerial reconstruction. Sparse aerial views provide weak parallax and limited overlap, making facades and occluded structures under-constrained. AnyCity anchors reliable geometry with an observation-supported geometry latent and uses scaffold-conditioned aerial completion tokens to refine weakly constrained content before Gaussian decoding. 1. Introduction Scalable 3D u… view at source ↗

**Figure 3.** Figure 3: Overview of AnyCity. AnyCity reconstructs a 3D Gaussian scene from sparse unposed aerial images. It first builds an observation-supported scaffold Zgeo, then uses gated aerial completion tokens and a LoRA-adapted video prior to predict a residual update ∆Z for weakly constrained content. Stage I stabilizes the scaffold with geometric losses, while Stage II trains residual refinement with dense-to-sparse di… view at source ↗

**Figure 4.** Figure 4: Geometry stabilizers. To obtain a stable scaffold before residual refinement, the first training stage disables the completion branch and optimizes the observation-supported pathway alone. We decode Zgeo with Dgs and supervise its renderings with photometric and lightweight geometric losses: Lstage1 = Lrgb + λdepthLdepth + λnormalLnormal. (3) Here, Lrgb is computed from renderings of Dgs(Zgeo). As illustr… view at source ↗

**Figure 5.** Figure 5: Qualitative comparison of urban novel view synthesis with two context views. For each scene, we show two conditioning images and the synthesized target view from different methods. Compared with prior feed-forward baselines, AnyCity better preserves global layout and facade continuity while reducing floaters and melting artifacts, yielding more realistic renderings. Results are shown on GoogleEarth, CityNe… view at source ↗

**Figure 6.** Figure 6: Qualitative ablation [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Qualitative results on diverse urban topologies. Given only 4 sparse input images, AnyCity successfully generalizes to various unconstrained city layouts, including dense commercial districts, residential blocks, waterfronts, and urban parks. It consistently produces high-fidelity novel views (middle) and physically plausible 3D Gaussian geometries (right). Under this unified protocol, we compare our meth… view at source ↗

**Figure 8.** Figure 8: Failure case in extreme occlusion. When the aerial capture altitude is relatively low in a dense skyscraper cluster, background high-rises suffer from severe foreground occlusion, leading to incomplete geometric reconstruction and blurred textures on distant facades. While AnyCity performs robustly in most urban scenarios, it occasionally encounters difficulties under extreme physical constraints. As illus… view at source ↗

read the original abstract

Reconstructing large-scale urban scenes from sparse aerial views is a crucial yet challenging task. Due to biased top-down and shallow-oblique camera poses, sparse aerial captures exhibit strong evidence imbalance: roofs and open regions are repeatedly observed, while facades, distant buildings, and occluded structures receive little multi-view support. Existing feed-forward 3D Gaussian Splatting methods directly regress a deterministic representation from sparse inputs, but this often leads to ghosting, melted facades, and stretched textures. Recent pseudo-view and video-based generative reconstruction methods use additional supervision or generative priors. However, they often lack a clear separation between observed geometry and prior-driven content, which can lead to plausible but inconsistent structures. We propose AnyCity, an observation-grounded generative reconstruction framework for sparse aerial urban scenes. AnyCity first predicts an observation-supported geometry latent to anchor reliable structures, and then uses scaffold-conditioned aerial completion tokens to predict a gated residual update for weakly constrained content before Gaussian decoding. During training, dense-to-sparse distillation transfers structural cues from dense-view reconstruction, while an aerial-adapted video diffusion prior provides fine-grained urban appearance cues through gated token conditioning. Observation-preserving objectives keep the refined representation consistent with input-supported geometry. At inference time, AnyCity reconstructs the final 3D Gaussian scene from sparse aerial views in a single feed-forward pass, achieving coherent urban novel-view synthesis with second-level inference. Experiments on synthetic, aerial-domain, UAV-textured, and real-world scenes show consistent improvements over feed-forward baselines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AnyCity anchors the geometry latent first then adds gated residuals from an aerial video diffusion prior to fill facades and occluded areas, but the claimed separation from observed content is the part that needs the closest look.

read the letter

The main point is that AnyCity reconstructs 3D Gaussians from sparse aerial views in one forward pass. It first predicts an observation-supported geometry latent to hold down the well-observed roofs and open regions, then conditions scaffold tokens on an aerial-adapted video diffusion prior to produce gated residual updates for the weakly supported facades and distant structures before decoding to Gaussians. Training uses dense-to-sparse distillation plus observation-preserving objectives to keep the refined output consistent with the input views.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes AnyCity, an observation-grounded generative reconstruction framework for sparse aerial urban scenes using 3D Gaussian Splatting. It first predicts an observation-supported geometry latent to anchor reliable structures from sparse aerial views, then uses scaffold-conditioned aerial completion tokens to predict a gated residual update for weakly constrained content before Gaussian decoding. Training uses dense-to-sparse distillation to transfer structural cues and an aerial-adapted video diffusion prior for fine-grained appearance through gated token conditioning, with observation-preserving objectives to maintain consistency. The method claims to enable single feed-forward pass reconstruction with second-level inference, showing consistent improvements over feed-forward baselines on synthetic, aerial, UAV, and real-world scenes.

Significance. If the proposed separation between observation-supported geometry and prior-driven content holds without leakage, the work could be significant for practical applications in large-scale urban 3D reconstruction from sparse aerial captures, addressing the evidence imbalance issue that leads to artifacts in existing methods. The combination of distillation and generative priors in a feed-forward setting is a promising direction, and the fast inference time is a practical strength.

major comments (2)

Abstract: The abstract asserts 'consistent improvements over feed-forward baselines' on multiple scene types but supplies no quantitative metrics, error analysis, ablation details, or specific numerical comparisons. Without these, the central claims of coherent urban novel-view synthesis and avoidance of ghosting or melted facades cannot be verified.
Training description paragraph: The framework relies on 'gated token conditioning' from the aerial-adapted video diffusion prior to supply a 'gated residual update' while claiming that 'observation-preserving objectives keep the refined representation consistent with input-supported geometry.' It is not specified whether the gate is hard or learned, nor whether the update is applied at the token level before Gaussian decoding. If the learned gate allows the prior to influence the observation-supported geometry latent, structural changes could propagate to roof and open-region Gaussians despite multi-view support, violating the evidence-imbalance premise.

minor comments (1)

The abstract refers to 'second-level inference' without specifying the hardware, resolution, or exact timing measurement used to support this claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the practical strengths of our approach for large-scale urban reconstruction. We provide point-by-point responses to the major comments below.

read point-by-point responses

Referee: Abstract: The abstract asserts 'consistent improvements over feed-forward baselines' on multiple scene types but supplies no quantitative metrics, error analysis, ablation details, or specific numerical comparisons. Without these, the central claims of coherent urban novel-view synthesis and avoidance of ghosting or melted facades cannot be verified.

Authors: We agree that the abstract, as a concise summary, would be strengthened by including a small number of key quantitative results. In the revised manuscript we will add brief numerical comparisons (e.g., average PSNR/SSIM gains on the synthetic and real-world test sets) while remaining within the abstract length limit. revision: yes
Referee: Training description paragraph: The framework relies on 'gated token conditioning' from the aerial-adapted video diffusion prior to supply a 'gated residual update' while claiming that 'observation-preserving objectives keep the refined representation consistent with input-supported geometry.' It is not specified whether the gate is hard or learned, nor whether the update is applied at the token level before Gaussian decoding. If the learned gate allows the prior to influence the observation-supported geometry latent, structural changes could propagate to roof and open-region Gaussians despite multi-view support, violating the evidence-imbalance premise.

Authors: The gate is a learned soft gate realized by a small MLP followed by a sigmoid activation; it modulates only the residual tokens produced for the aerial completion branch. The observation-supported geometry latent is generated in an earlier stage and is held fixed; the residual update is added exclusively to the completion tokens before they enter the Gaussian decoder. Observation-preserving losses further penalize any deviation in well-supported regions. We will expand the methods section with an explicit description of the gate architecture, a diagram illustrating the latent separation, and quantitative gate-activation statistics showing near-zero influence on supported geometry. revision: yes

Circularity Check

0 steps flagged

No significant circularity; new architecture and training objectives are independent

full rationale

The paper introduces AnyCity as a feed-forward framework that first predicts an observation-supported geometry latent from sparse aerial views, then applies scaffold-conditioned aerial completion tokens for a gated residual update before Gaussian decoding. Training uses dense-to-sparse distillation to transfer structural cues and an aerial-adapted video diffusion prior for appearance via gated token conditioning, with observation-preserving objectives to maintain consistency. No derivation step reduces by construction to its own inputs, no fitted parameters are relabeled as predictions, and no load-bearing claims rest on self-citations or imported uniqueness theorems. The central claims rest on explicitly described novel components and objectives that are not equivalent to the inputs by definition, making the chain self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 2 invented entities

The central claim rests on several new architectural pieces and training assumptions whose effectiveness is asserted but not independently evidenced in the provided abstract.

axioms (2)

domain assumption Dense-to-sparse distillation transfers structural cues from dense-view reconstruction to sparse inputs.
Invoked in the training description to enable structural learning.
domain assumption An aerial-adapted video diffusion prior can supply fine-grained urban appearance cues through gated token conditioning while preserving consistency with observed geometry.
Central to the generative completion step.

invented entities (2)

observation-supported geometry latent no independent evidence
purpose: Anchors reliable structures before generative completion.
First stage of the AnyCity pipeline.
scaffold-conditioned aerial completion tokens no independent evidence
purpose: Predict gated residual updates for weakly constrained content.
Mechanism for adding details to under-observed regions.

pith-pipeline@v0.9.0 · 5818 in / 1600 out tokens · 49098 ms · 2026-05-20T06:47:43.918838+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

AnyCity first predicts an observation-supported geometry latent to anchor reliable structures, and then uses scaffold-conditioned aerial completion tokens to predict a gated residual update for weakly constrained content before Gaussian decoding.
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

During training, dense-to-sparse distillation transfers structural cues from dense-view reconstruction, while an aerial-adapted video diffusion prior provides fine-grained urban appearance cues through gated token conditioning.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

42 extracted references · 42 canonical work pages · 7 internal anchors

[1]

Aligning global semantics and local textures in generative video enhancement

Zhikai Chen, Fuchen Long, Zhaofan Qiu, Ting Yao, Wengang Zhou, Jiebo Luo, and Tao Mei. Aligning global semantics and local textures in generative video enhancement. InProceedings of the IEEE/CVF Inter- national Conference on Computer Vision, pages 17087– 17096, 2025. 6

work page 2025
[2]

Predicting depth, surface normals and semantic labels with a common multi- scale convolutional architecture

David Eigen and Rob Fergus. Predicting depth, surface normals and semantic labels with a common multi- scale convolutional architecture. InProceedings of the IEEE international conference on computer vision, pages 2650–2658, 2015. 6

work page 2015
[3]

Citygs-x: A scalable architecture for efficient and geometrically accurate large-scale scene reconstruc- tion

Yuanyuan Gao, Hao Li, Jiaqi Chen, Zhengyu Zou, Zhi- hang Zhong, Dingwen Zhang, Xiao Sun, and Junwei Han. Citygs-x: A scalable architecture for efficient and geometrically accurate large-scale scene reconstruc- tion. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 27187–27196,

work page
[4]

World recon- struction from inconsistent views.arXiv preprint arXiv:2603.16736, 2026

Lukas Höllein and Matthias Nießner. World recon- struction from inconsistent views.arXiv preprint arXiv:2603.16736, 2026. 3

work page arXiv 2026
[5]

Sat2city: 3d city generation from a single satellite image with cascaded latent diffusion

Tongyan Hua, Lutao Jiang, Ying-Cong Chen, and Wu- fan Zhao. Sat2city: 3d city generation from a single satellite image with cascaded latent diffusion. InPro- ceedings of the IEEE/CVF International Conference on Computer Vision, pages 27978–27988, 2025. 2

work page 2025
[6]

Gen3r: 3d scene genera- tion meets feed-forward reconstruction.arXiv preprint arXiv:2601.04090, 2026

Jiaxin Huang, Yuanbo Yang, Bangbang Yang, Lin Ma, Yuewen Ma, and Yiyi Liao. Gen3r: 3d scene genera- tion meets feed-forward reconstruction.arXiv preprint arXiv:2601.04090, 2026. 2, 3

work page arXiv 2026
[7]

Anysplat: Feed-forward 3d gaussian splatting from unconstrained views.ACM Transactions on Graphics (TOG), 44(6):1–16, 2025

Lihan Jiang, Yucheng Mao, Linning Xu, Tao Lu, Kerui Ren, Yichen Jin, Xudong Xu, Mulin Yu, Jiangmiao Pang, Feng Zhao, et al. Anysplat: Feed-forward 3d gaussian splatting from unconstrained views.ACM Transactions on Graphics (TOG), 44(6):1–16, 2025. 2, 3, 6, 14

work page 2025
[8]

3d gaussian splatting for real-time radiance field rendering.ACM Trans

Bernhard Kerbl, Georgios Kopanas, Thomas Leimküh- ler, George Drettakis, et al. 3d gaussian splatting for real-time radiance field rendering.ACM Trans. Graph., 42(4):139–1, 2023. 2, 3

work page 2023
[9]

Gen- erative sparse-view gaussian splatting

Hanyang Kong, Xingyi Yang, and Xinchao Wang. Gen- erative sparse-view gaussian splatting. InProceedings of the Computer Vision and Pattern Recognition Con- ference, pages 26745–26755, 2025. 2, 3

work page 2025
[10]

arXiv preprint arXiv:2510.21615 (2025) 4

Orest Kupyn, Fabian Manhardt, Federico Tombari, and Christian Rupprecht. Epipolar geometry im- proves video generation models.arXiv preprint arXiv:2510.21615, 2025. 3

work page arXiv 2025
[11]

Skyfall-gs: Synthe- sizing immersive 3d urban scenes from satellite imagery

Jie-Ying Lee, Yi-Ruei Liu, Shr-Ruei Tsai, Wei-Cheng Chang, Chung-Ho Wu, Jiewen Chan, Zhenjun Zhao, Chieh Hubert Lin, and Yu-Lun Liu. Skyfall-gs: Syn- thesizing immersive 3d urban scenes from satellite im- agery.arXiv preprint arXiv:2510.15869, 2025. 2

work page arXiv 2025
[12]

Urbangs: A scalable and efficient architecture for geometrically accurate large-scene reconstruction.arXiv preprint arXiv:2602.02089, 2026

Changbai Li, Haodong Zhu, Hanlin Chen, Xiuping Liang, Tongfei Chen, Shuwei Shao, Linlin Yang, Huobin Tan, and Baochang Zhang. Urbangs: A scalable and efficient architecture for geometrically accurate large-scene reconstruction.arXiv preprint arXiv:2602.02089, 2026. 2

work page arXiv 2026
[13]

Manipdreamer3d: Synthesizing plausible robotic manipulation video with occupancy-aware 3d trajectory

Ying Li, Xiaobao Wei, Xiaowei Chi, Yuming Li, Zhongyu Zhao, Hao Wang, Ningning Ma, Ming Lu, Sirui Han, and Shanghang Zhang. Manipdreamer3d: Synthesizing plausible robotic manipulation video with occupancy-aware 3d trajectory. InProceedings of the AAAI Conference on Artificial Intelligence (AAAI),

work page
[14]

Matrixcity: A large-scale city dataset for city-scale neural render- ing and beyond

Yixuan Li, Lihan Jiang, Linning Xu, Yuanbo Xiangli, Zhenzhi Wang, Dahua Lin, and Bo Dai. Matrixcity: A large-scale city dataset for city-scale neural render- ing and beyond. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 3205–3215, 2023. 2, 5, 13

work page 2023
[15]

Wonder- land: Navigating 3d scenes from a single image

Hanwen Liang, Junli Cao, Vidit Goel, Guocheng Qian, Sergei Korolev, Demetri Terzopoulos, Konstantinos N Plataniotis, Sergey Tulyakov, and Jian Ren. Wonder- land: Navigating 3d scenes from a single image. In Proceedings of the Computer Vision and Pattern Recog- nition Conference, pages 798–810, 2025. 2, 3

work page 2025
[16]

Depth Anything 3: Recovering the Visual Space from Any Views

Haotong Lin, Sili Chen, Junhao Liew, Donny Y Chen, Zhenyu Li, Guang Shi, Jiashi Feng, and Bingyi Kang. Depth anything 3: Recovering the visual space from any views.arXiv preprint arXiv:2511.10647, 2025. 2, 3, 6

work page internal anchor Pith review Pith/arXiv arXiv 2025
[17]

Capturing, reconstructing, and simulating: the urbanscene3d dataset

Liqiang Lin, Yilin Liu, Yue Hu, Xingguang Yan, Ke Xie, and Hui Huang. Capturing, reconstructing, and simulating: the urbanscene3d dataset. InECCV,

work page
[18]

Dl3dv-10k: A large-scale scene dataset for deep learning-based 3d vision

Lu Ling, Yichen Sheng, Zhi Tu, Wentian Zhao, Cheng Xin, Kun Wan, Lantao Yu, Qianyu Guo, Zixun Yu, Yawen Lu, et al. Dl3dv-10k: A large-scale scene dataset for deep learning-based 3d vision. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22160–22169, 2024. 5

work page 2024
[19]

Citygaussianv2: Efficient and geometrically accurate reconstruction for large- scale scenes.arXiv preprint arXiv:2411.00771, 2024

Yang Liu, Chuanchen Luo, Zhongkai Mao, Junran Peng, and Zhaoxiang Zhang. Citygaussianv2: Efficient and geometrically accurate reconstruction for large- scale scenes.arXiv preprint arXiv:2411.00771, 2024. 2, 6

work page arXiv 2024
[20]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101,

work page internal anchor Pith review Pith/arXiv arXiv
[21]

Nerf: Representing scenes as neural radiance fields for 9 view synthesis.Communications of the ACM, 65(1): 99–106, 2021

Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for 9 view synthesis.Communications of the ACM, 65(1): 99–106, 2021. 3

work page 2021
[22]

Pixelwise view selection for unstructured multi-view stereo

Johannes L Schönberger, Enliang Zheng, Jan-Michael Frahm, and Marc Pollefeys. Pixelwise view selection for unstructured multi-view stereo. InEuropean con- ference on computer vision, pages 501–518. Springer,

work page
[23]

Structure-from-motion revisited

Johannes Lutz Schönberger and Jan-Michael Frahm. Structure-from-motion revisited. InConference on Computer Vision and Pattern Recognition (CVPR),

work page
[24]

Lyra 2.0: Explorable Generative 3D Worlds

Tianchang Shen, Sherwin Bahmani, Kai He, Sangeetha Grama Srinivasan, Tianshi Cao, Jiawei Ren, Ruilong Li, Zian Wang, Nicholas Sharp, Zan Gojcic, et al. Lyra 2.0: Explorable generative 3d worlds.arXiv preprint arXiv:2604.13036, 2026. 2, 3

work page internal anchor Pith review Pith/arXiv arXiv 2026
[25]

DreamGaussian: Generative Gaussian Splatting for Efficient 3D Content Creation

Jiaxiang Tang, Jiawei Ren, Hang Zhou, Ziwei Liu, and Gang Zeng. Dreamgaussian: Generative gaussian splat- ting for efficient 3d content creation.arXiv preprint arXiv:2309.16653, 2023. 3

work page internal anchor Pith review Pith/arXiv arXiv 2023
[26]

Wan: Open and Advanced Large-Scale Video Generative Models

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025. 5, 6

work page internal anchor Pith review Pith/arXiv arXiv 2025
[27]

Roboarmgs: High-quality robotic arm splatting via b \’ezier curve refinement.arXiv preprint arXiv:2511.17961, 2025

Hao Wang, Xiaobao Wei, Ying Li, Qingpo Wuwu, Dongli Wu, Jiajun Cao, Ming Lu, Wenzhao Zheng, and Shanghang Zhang. Roboarmgs: High-quality robotic arm splatting via b \’ezier curve refinement.arXiv preprint arXiv:2511.17961, 2025. 3

work page arXiv 2025
[28]

Embodiedocc++: Boosting embod- ied 3d occupancy prediction with plane regularization and uncertainty sampler

Hao Wang, Xiaobao Wei, Xiaoan Zhang, Jianing Li, Chengyu Bai, Ying Li, Ming Lu, Wenzhao Zheng, and Shanghang Zhang. Embodiedocc++: Boosting embod- ied 3d occupancy prediction with plane regularization and uncertainty sampler. InProceedings of the 33rd ACM International Conference on Multimedia, pages 925–934, 2025. 3

work page 2025
[29]

Vggt: Visual geometry grounded transformer

Jianyuan Wang, Minghao Chen, Nikita Karaev, An- drea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Visual geometry grounded transformer. InPro- ceedings of the Computer Vision and Pattern Recog- nition Conference, pages 5294–5306, 2025. 2, 3, 6, 12

work page 2025
[30]

Chronotailor: Harnessing at- tention guidance for fine-grained video virtual try-on

Jinjuan Wang, Wenzhang Sun, Ming Li, Yun Zheng, Fanyao Li, Zhulin Tao, Donglin Di, Hao Li, Wei Chen, and Xianglin Huang. Chronotailor: Harnessing at- tention guidance for fine-grained video virtual try-on. arXiv preprint arXiv:2506.05858, 2025. 6

work page arXiv 2025
[31]

Dust3r: Geometric 3d vision made easy

Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. Dust3r: Geometric 3d vision made easy. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 20697–20709, 2024. 3

work page 2024
[32]

Image quality assessment: from er- ror visibility to structural similarity.IEEE transactions on image processing, 13(4):600–612, 2004

Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. Image quality assessment: from er- ror visibility to structural similarity.IEEE transactions on image processing, 13(4):600–612, 2004. 6

work page 2004
[33]

Re- confusion: 3d reconstruction with diffusion priors

Rundi Wu, Ben Mildenhall, Philipp Henzler, Keunhong Park, Ruiqi Gao, Daniel Watson, Pratul P Srinivasan, Dor Verbin, Jonathan T Barron, Ben Poole, et al. Re- confusion: 3d reconstruction with diffusion priors. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 21551–21561,

work page
[34]

Bungeenerf: Progressive neural radiance field for extreme multi-scale scene rendering

Yuanbo Xiangli, Linning Xu, Xingang Pan, Nanx- uan Zhao, Anyi Rao, Christian Theobalt, Bo Dai, and Dahua Lin. Bungeenerf: Progressive neural radiance field for extreme multi-scale scene rendering. InEuro- pean conference on computer vision, pages 106–122. Springer, 2022. 2, 7

work page 2022
[35]

Citydreamer: Compositional generative model of unbounded 3d cities

Haozhe Xie, Zhaoxi Chen, Fangzhou Hong, and Ziwei Liu. Citydreamer: Compositional generative model of unbounded 3d cities. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9666–9675, 2024. 3, 5

work page 2024
[36]

Focus: Unified vision-language mod- eling for interactive editing driven by referential seg- mentation.arXiv preprint arXiv:2506.16806, 2025

Fan Yang, Yousong Zhu, Xin Li, Yufei Zhan, Hongyin Zhao, Shurong Zheng, Yaowei Wang, Ming Tang, and Jinqiao Wang. Focus: Unified vision-language mod- eling for interactive editing driven by referential seg- mentation.arXiv preprint arXiv:2506.16806, 2025. 6

work page arXiv 2025
[37]

Blended- mvs: A large-scale dataset for generalized multi-view stereo networks

Yao Yao, Zixin Luo, Shiwei Li, Jingyang Zhang, Yufan Ren, Lei Zhou, Tian Fang, and Long Quan. Blended- mvs: A large-scale dataset for generalized multi-view stereo networks. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 1790–1799, 2020. 5

work page 2020
[38]

No pose, no problem: Surprisingly simple 3d gaussian splats from sparse unposed images.arXiv preprint arXiv:2410.24207,

Botao Ye, Sifei Liu, Haofei Xu, Xueting Li, Marc Pollefeys, Ming-Hsuan Yang, and Songyou Peng. No pose, no problem: Surprisingly simple 3d gaussian splats from sparse unposed images.arXiv preprint arXiv:2410.24207, 2024. 3, 6

work page arXiv 2024
[39]

From Orbit to Ground: Generative City Photogrammetry from Extreme Off-Nadir Satellite Images

Fei Yu, Yu Liu, Luyang Tang, Mingchao Sun, Zengye Ge, Rui Bu, Yuchao Jin, Haisen Zhao, He Sun, Yangyan Li, et al. From orbit to ground: Generative city photogrammetry from extreme off-nadir satellite images.arXiv preprint arXiv:2512.07527, 2025. 2, 3

work page internal anchor Pith review Pith/arXiv arXiv 2025
[40]

Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think

Sihyun Yu, Sangkyung Kwak, Huiwon Jang, Jongheon Jeong, Jonathan Huang, Jinwoo Shin, and Saining Xie. Representation alignment for generation: Training dif- fusion transformers is easier than you think.arXiv preprint arXiv:2410.06940, 2024. 6

work page internal anchor Pith review Pith/arXiv arXiv 2024
[41]

The unreasonable ef- fectiveness of deep features as a perceptual metric

Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable ef- fectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vi- 10 sion and pattern recognition, pages 586–595, 2018. 6

work page 2018
[42]

Flare: Feed-forward geometry, appearance and camera estimation from uncalibrated sparse views

Shangzhan Zhang, Jianyuan Wang, Yinghao Xu, Nan Xue, Christian Rupprecht, Xiaowei Zhou, Yujun Shen, and Gordon Wetzstein. Flare: Feed-forward geometry, appearance and camera estimation from uncalibrated sparse views. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 21936– 21947, 2025. 3, 6 11 A. Detailed progressive training ...

work page 2025

[1] [1]

Aligning global semantics and local textures in generative video enhancement

Zhikai Chen, Fuchen Long, Zhaofan Qiu, Ting Yao, Wengang Zhou, Jiebo Luo, and Tao Mei. Aligning global semantics and local textures in generative video enhancement. InProceedings of the IEEE/CVF Inter- national Conference on Computer Vision, pages 17087– 17096, 2025. 6

work page 2025

[2] [2]

Predicting depth, surface normals and semantic labels with a common multi- scale convolutional architecture

David Eigen and Rob Fergus. Predicting depth, surface normals and semantic labels with a common multi- scale convolutional architecture. InProceedings of the IEEE international conference on computer vision, pages 2650–2658, 2015. 6

work page 2015

[3] [3]

Citygs-x: A scalable architecture for efficient and geometrically accurate large-scale scene reconstruc- tion

Yuanyuan Gao, Hao Li, Jiaqi Chen, Zhengyu Zou, Zhi- hang Zhong, Dingwen Zhang, Xiao Sun, and Junwei Han. Citygs-x: A scalable architecture for efficient and geometrically accurate large-scale scene reconstruc- tion. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 27187–27196,

work page

[4] [4]

World recon- struction from inconsistent views.arXiv preprint arXiv:2603.16736, 2026

Lukas Höllein and Matthias Nießner. World recon- struction from inconsistent views.arXiv preprint arXiv:2603.16736, 2026. 3

work page arXiv 2026

[5] [5]

Sat2city: 3d city generation from a single satellite image with cascaded latent diffusion

Tongyan Hua, Lutao Jiang, Ying-Cong Chen, and Wu- fan Zhao. Sat2city: 3d city generation from a single satellite image with cascaded latent diffusion. InPro- ceedings of the IEEE/CVF International Conference on Computer Vision, pages 27978–27988, 2025. 2

work page 2025

[6] [6]

Gen3r: 3d scene genera- tion meets feed-forward reconstruction.arXiv preprint arXiv:2601.04090, 2026

Jiaxin Huang, Yuanbo Yang, Bangbang Yang, Lin Ma, Yuewen Ma, and Yiyi Liao. Gen3r: 3d scene genera- tion meets feed-forward reconstruction.arXiv preprint arXiv:2601.04090, 2026. 2, 3

work page arXiv 2026

[7] [7]

Anysplat: Feed-forward 3d gaussian splatting from unconstrained views.ACM Transactions on Graphics (TOG), 44(6):1–16, 2025

Lihan Jiang, Yucheng Mao, Linning Xu, Tao Lu, Kerui Ren, Yichen Jin, Xudong Xu, Mulin Yu, Jiangmiao Pang, Feng Zhao, et al. Anysplat: Feed-forward 3d gaussian splatting from unconstrained views.ACM Transactions on Graphics (TOG), 44(6):1–16, 2025. 2, 3, 6, 14

work page 2025

[8] [8]

3d gaussian splatting for real-time radiance field rendering.ACM Trans

Bernhard Kerbl, Georgios Kopanas, Thomas Leimküh- ler, George Drettakis, et al. 3d gaussian splatting for real-time radiance field rendering.ACM Trans. Graph., 42(4):139–1, 2023. 2, 3

work page 2023

[9] [9]

Gen- erative sparse-view gaussian splatting

Hanyang Kong, Xingyi Yang, and Xinchao Wang. Gen- erative sparse-view gaussian splatting. InProceedings of the Computer Vision and Pattern Recognition Con- ference, pages 26745–26755, 2025. 2, 3

work page 2025

[10] [10]

arXiv preprint arXiv:2510.21615 (2025) 4

Orest Kupyn, Fabian Manhardt, Federico Tombari, and Christian Rupprecht. Epipolar geometry im- proves video generation models.arXiv preprint arXiv:2510.21615, 2025. 3

work page arXiv 2025

[11] [11]

Skyfall-gs: Synthe- sizing immersive 3d urban scenes from satellite imagery

Jie-Ying Lee, Yi-Ruei Liu, Shr-Ruei Tsai, Wei-Cheng Chang, Chung-Ho Wu, Jiewen Chan, Zhenjun Zhao, Chieh Hubert Lin, and Yu-Lun Liu. Skyfall-gs: Syn- thesizing immersive 3d urban scenes from satellite im- agery.arXiv preprint arXiv:2510.15869, 2025. 2

work page arXiv 2025

[12] [12]

Urbangs: A scalable and efficient architecture for geometrically accurate large-scene reconstruction.arXiv preprint arXiv:2602.02089, 2026

Changbai Li, Haodong Zhu, Hanlin Chen, Xiuping Liang, Tongfei Chen, Shuwei Shao, Linlin Yang, Huobin Tan, and Baochang Zhang. Urbangs: A scalable and efficient architecture for geometrically accurate large-scene reconstruction.arXiv preprint arXiv:2602.02089, 2026. 2

work page arXiv 2026

[13] [13]

Manipdreamer3d: Synthesizing plausible robotic manipulation video with occupancy-aware 3d trajectory

Ying Li, Xiaobao Wei, Xiaowei Chi, Yuming Li, Zhongyu Zhao, Hao Wang, Ningning Ma, Ming Lu, Sirui Han, and Shanghang Zhang. Manipdreamer3d: Synthesizing plausible robotic manipulation video with occupancy-aware 3d trajectory. InProceedings of the AAAI Conference on Artificial Intelligence (AAAI),

work page

[14] [14]

Matrixcity: A large-scale city dataset for city-scale neural render- ing and beyond

Yixuan Li, Lihan Jiang, Linning Xu, Yuanbo Xiangli, Zhenzhi Wang, Dahua Lin, and Bo Dai. Matrixcity: A large-scale city dataset for city-scale neural render- ing and beyond. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 3205–3215, 2023. 2, 5, 13

work page 2023

[15] [15]

Wonder- land: Navigating 3d scenes from a single image

Hanwen Liang, Junli Cao, Vidit Goel, Guocheng Qian, Sergei Korolev, Demetri Terzopoulos, Konstantinos N Plataniotis, Sergey Tulyakov, and Jian Ren. Wonder- land: Navigating 3d scenes from a single image. In Proceedings of the Computer Vision and Pattern Recog- nition Conference, pages 798–810, 2025. 2, 3

work page 2025

[16] [16]

Depth Anything 3: Recovering the Visual Space from Any Views

Haotong Lin, Sili Chen, Junhao Liew, Donny Y Chen, Zhenyu Li, Guang Shi, Jiashi Feng, and Bingyi Kang. Depth anything 3: Recovering the visual space from any views.arXiv preprint arXiv:2511.10647, 2025. 2, 3, 6

work page internal anchor Pith review Pith/arXiv arXiv 2025

[17] [17]

Capturing, reconstructing, and simulating: the urbanscene3d dataset

Liqiang Lin, Yilin Liu, Yue Hu, Xingguang Yan, Ke Xie, and Hui Huang. Capturing, reconstructing, and simulating: the urbanscene3d dataset. InECCV,

work page

[18] [18]

Dl3dv-10k: A large-scale scene dataset for deep learning-based 3d vision

Lu Ling, Yichen Sheng, Zhi Tu, Wentian Zhao, Cheng Xin, Kun Wan, Lantao Yu, Qianyu Guo, Zixun Yu, Yawen Lu, et al. Dl3dv-10k: A large-scale scene dataset for deep learning-based 3d vision. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22160–22169, 2024. 5

work page 2024

[19] [19]

Citygaussianv2: Efficient and geometrically accurate reconstruction for large- scale scenes.arXiv preprint arXiv:2411.00771, 2024

Yang Liu, Chuanchen Luo, Zhongkai Mao, Junran Peng, and Zhaoxiang Zhang. Citygaussianv2: Efficient and geometrically accurate reconstruction for large- scale scenes.arXiv preprint arXiv:2411.00771, 2024. 2, 6

work page arXiv 2024

[20] [20]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101,

work page internal anchor Pith review Pith/arXiv arXiv

[21] [21]

Nerf: Representing scenes as neural radiance fields for 9 view synthesis.Communications of the ACM, 65(1): 99–106, 2021

Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for 9 view synthesis.Communications of the ACM, 65(1): 99–106, 2021. 3

work page 2021

[22] [22]

Pixelwise view selection for unstructured multi-view stereo

Johannes L Schönberger, Enliang Zheng, Jan-Michael Frahm, and Marc Pollefeys. Pixelwise view selection for unstructured multi-view stereo. InEuropean con- ference on computer vision, pages 501–518. Springer,

work page

[23] [23]

Structure-from-motion revisited

Johannes Lutz Schönberger and Jan-Michael Frahm. Structure-from-motion revisited. InConference on Computer Vision and Pattern Recognition (CVPR),

work page

[24] [24]

Lyra 2.0: Explorable Generative 3D Worlds

Tianchang Shen, Sherwin Bahmani, Kai He, Sangeetha Grama Srinivasan, Tianshi Cao, Jiawei Ren, Ruilong Li, Zian Wang, Nicholas Sharp, Zan Gojcic, et al. Lyra 2.0: Explorable generative 3d worlds.arXiv preprint arXiv:2604.13036, 2026. 2, 3

work page internal anchor Pith review Pith/arXiv arXiv 2026

[25] [25]

DreamGaussian: Generative Gaussian Splatting for Efficient 3D Content Creation

Jiaxiang Tang, Jiawei Ren, Hang Zhou, Ziwei Liu, and Gang Zeng. Dreamgaussian: Generative gaussian splat- ting for efficient 3d content creation.arXiv preprint arXiv:2309.16653, 2023. 3

work page internal anchor Pith review Pith/arXiv arXiv 2023

[26] [26]

Wan: Open and Advanced Large-Scale Video Generative Models

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025. 5, 6

work page internal anchor Pith review Pith/arXiv arXiv 2025

[27] [27]

Roboarmgs: High-quality robotic arm splatting via b \’ezier curve refinement.arXiv preprint arXiv:2511.17961, 2025

Hao Wang, Xiaobao Wei, Ying Li, Qingpo Wuwu, Dongli Wu, Jiajun Cao, Ming Lu, Wenzhao Zheng, and Shanghang Zhang. Roboarmgs: High-quality robotic arm splatting via b \’ezier curve refinement.arXiv preprint arXiv:2511.17961, 2025. 3

work page arXiv 2025

[28] [28]

Embodiedocc++: Boosting embod- ied 3d occupancy prediction with plane regularization and uncertainty sampler

Hao Wang, Xiaobao Wei, Xiaoan Zhang, Jianing Li, Chengyu Bai, Ying Li, Ming Lu, Wenzhao Zheng, and Shanghang Zhang. Embodiedocc++: Boosting embod- ied 3d occupancy prediction with plane regularization and uncertainty sampler. InProceedings of the 33rd ACM International Conference on Multimedia, pages 925–934, 2025. 3

work page 2025

[29] [29]

Vggt: Visual geometry grounded transformer

Jianyuan Wang, Minghao Chen, Nikita Karaev, An- drea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Visual geometry grounded transformer. InPro- ceedings of the Computer Vision and Pattern Recog- nition Conference, pages 5294–5306, 2025. 2, 3, 6, 12

work page 2025

[30] [30]

Chronotailor: Harnessing at- tention guidance for fine-grained video virtual try-on

Jinjuan Wang, Wenzhang Sun, Ming Li, Yun Zheng, Fanyao Li, Zhulin Tao, Donglin Di, Hao Li, Wei Chen, and Xianglin Huang. Chronotailor: Harnessing at- tention guidance for fine-grained video virtual try-on. arXiv preprint arXiv:2506.05858, 2025. 6

work page arXiv 2025

[31] [31]

Dust3r: Geometric 3d vision made easy

Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. Dust3r: Geometric 3d vision made easy. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 20697–20709, 2024. 3

work page 2024

[32] [32]

Image quality assessment: from er- ror visibility to structural similarity.IEEE transactions on image processing, 13(4):600–612, 2004

Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. Image quality assessment: from er- ror visibility to structural similarity.IEEE transactions on image processing, 13(4):600–612, 2004. 6

work page 2004

[33] [33]

Re- confusion: 3d reconstruction with diffusion priors

Rundi Wu, Ben Mildenhall, Philipp Henzler, Keunhong Park, Ruiqi Gao, Daniel Watson, Pratul P Srinivasan, Dor Verbin, Jonathan T Barron, Ben Poole, et al. Re- confusion: 3d reconstruction with diffusion priors. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 21551–21561,

work page

[34] [34]

Bungeenerf: Progressive neural radiance field for extreme multi-scale scene rendering

Yuanbo Xiangli, Linning Xu, Xingang Pan, Nanx- uan Zhao, Anyi Rao, Christian Theobalt, Bo Dai, and Dahua Lin. Bungeenerf: Progressive neural radiance field for extreme multi-scale scene rendering. InEuro- pean conference on computer vision, pages 106–122. Springer, 2022. 2, 7

work page 2022

[35] [35]

Citydreamer: Compositional generative model of unbounded 3d cities

Haozhe Xie, Zhaoxi Chen, Fangzhou Hong, and Ziwei Liu. Citydreamer: Compositional generative model of unbounded 3d cities. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9666–9675, 2024. 3, 5

work page 2024

[36] [36]

Focus: Unified vision-language mod- eling for interactive editing driven by referential seg- mentation.arXiv preprint arXiv:2506.16806, 2025

Fan Yang, Yousong Zhu, Xin Li, Yufei Zhan, Hongyin Zhao, Shurong Zheng, Yaowei Wang, Ming Tang, and Jinqiao Wang. Focus: Unified vision-language mod- eling for interactive editing driven by referential seg- mentation.arXiv preprint arXiv:2506.16806, 2025. 6

work page arXiv 2025

[37] [37]

Blended- mvs: A large-scale dataset for generalized multi-view stereo networks

Yao Yao, Zixin Luo, Shiwei Li, Jingyang Zhang, Yufan Ren, Lei Zhou, Tian Fang, and Long Quan. Blended- mvs: A large-scale dataset for generalized multi-view stereo networks. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 1790–1799, 2020. 5

work page 2020

[38] [38]

No pose, no problem: Surprisingly simple 3d gaussian splats from sparse unposed images.arXiv preprint arXiv:2410.24207,

Botao Ye, Sifei Liu, Haofei Xu, Xueting Li, Marc Pollefeys, Ming-Hsuan Yang, and Songyou Peng. No pose, no problem: Surprisingly simple 3d gaussian splats from sparse unposed images.arXiv preprint arXiv:2410.24207, 2024. 3, 6

work page arXiv 2024

[39] [39]

From Orbit to Ground: Generative City Photogrammetry from Extreme Off-Nadir Satellite Images

Fei Yu, Yu Liu, Luyang Tang, Mingchao Sun, Zengye Ge, Rui Bu, Yuchao Jin, Haisen Zhao, He Sun, Yangyan Li, et al. From orbit to ground: Generative city photogrammetry from extreme off-nadir satellite images.arXiv preprint arXiv:2512.07527, 2025. 2, 3

work page internal anchor Pith review Pith/arXiv arXiv 2025

[40] [40]

Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think

Sihyun Yu, Sangkyung Kwak, Huiwon Jang, Jongheon Jeong, Jonathan Huang, Jinwoo Shin, and Saining Xie. Representation alignment for generation: Training dif- fusion transformers is easier than you think.arXiv preprint arXiv:2410.06940, 2024. 6

work page internal anchor Pith review Pith/arXiv arXiv 2024

[41] [41]

The unreasonable ef- fectiveness of deep features as a perceptual metric

Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable ef- fectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vi- 10 sion and pattern recognition, pages 586–595, 2018. 6

work page 2018

[42] [42]

Flare: Feed-forward geometry, appearance and camera estimation from uncalibrated sparse views

Shangzhan Zhang, Jianyuan Wang, Yinghao Xu, Nan Xue, Christian Rupprecht, Xiaowei Zhou, Yujun Shen, and Gordon Wetzstein. Flare: Feed-forward geometry, appearance and camera estimation from uncalibrated sparse views. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 21936– 21947, 2025. 3, 6 11 A. Detailed progressive training ...

work page 2025