pith. sign in

arxiv: 2411.13549 · v2 · submitted 2024-11-20 · 💻 cs.CV

KFC-W: Generating 3D-Consistent Videos from Unposed Internet Photos

Pith reviewed 2026-05-23 16:44 UTC · model grok-4.3

classification 💻 cs.CV
keywords 3D-consistent video generationself-supervised learningunposed photosmultiview consistencyvideo interpolationcamera control3D Gaussian Splatting
0
0 comments X

The pith

A self-supervised model learns to generate 3D-consistent videos from unposed internet photos without any camera parameters or 3D labels.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows how to train a video model that can take a few random photos of a scene and produce smooth interpolations that respect the underlying 3D geometry. It does this by combining the natural frame-to-frame consistency found in ordinary videos with the different viewpoints present in unposed multiview internet photos. The training uses no explicit 3D supervision, camera poses, or depth maps. If the approach holds, it demonstrates that large-scale 3D scene understanding can be extracted from everyday 2D data sources alone. The resulting model outperforms prior video generators on measures of geometric and appearance consistency and improves downstream tasks that require camera control.

Core claim

The central claim is that a scalable 3D-aware video model can be trained in a self-supervised manner by exploiting video consistency together with the viewpoint variability of unposed multiview internet photos, without requiring any 3D annotations such as camera parameters, and that this model produces superior geometric and appearance consistency compared with existing video baselines while also benefiting camera-controlled applications such as 3D Gaussian Splatting.

What carries the argument

The self-supervised training procedure that pairs video-frame consistency with multiview photo variability to induce implicit 3D geometric understanding.

If this is right

  • Random internet photos can serve as keyframes for video generation that respects scene layout and identity.
  • The model supports explicit camera control in tasks such as 3D Gaussian Splatting.
  • Scene-level 3D learning becomes feasible at scale using only ordinary 2D video and photo collections.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Large uncurated photo collections could replace curated 3D datasets for training video models.
  • The same consistency signal might extend to learning other 3D properties such as lighting or material appearance.
  • Failure modes on highly dynamic or non-rigid scenes would reveal limits of the implicit-geometry approach.

Load-bearing premise

Natural consistency across video frames plus viewpoint differences in unposed photos are enough to produce genuine 3D geometric understanding in the model.

What would settle it

Generated videos that show clear changes in object shape, size, or relative position when the camera path is interpolated between input views.

Figures

Figures reproduced from arXiv: 2411.13549 by Bharath Hariharan, Fujun Luan, Gene Chou, Hao Tan, Kai Zhang, Noah Snavely, Sai Bi, Zexiang Xu.

Figure 1
Figure 1. Figure 1: Given n unposed input keyframes, the goal is to generate a video of the scene with a realistic camera trajectory and consistent geometry. From top to bottom: Ours, Luma Dream Machine [43] (a commercial video generation model), FILM [51] (a frame interpolation method). Luma hallucinates new buildings (left scene) and statues (right scene) without understanding the scene layout. FILM is unable to handle wide… view at source ↗
Figure 2
Figure 2. Figure 2: Training objectives. Left: Multiview inpainting. We provide n condition images and one target image to a diffusion model. We add noise to 80% of the target following the diffusion process. The condition images and remaining 20% of the target are kept clean. Note how some regions in the target are not seen in the conditions. The model learns priors such as symmetry to generate a plausible image. Right: View… view at source ↗
Figure 3
Figure 3. Figure 3: Multiview inpainting of internet photos and view interpolation of videos can be unified under the same denoising objective. [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Top two rows: We control illumination by conditioning on [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Example scene from our user study interface. We pro [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: From top to bottom: Ours (Full), Luma, Ours (Video [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: We run an ablation “Long-Video” with only the view [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
read the original abstract

We address the problem of generating videos from unposed internet photos. A handful of input images serve as keyframes, and our model interpolates between them to simulate a path moving between the cameras. Given random images, a model's ability to capture underlying geometry, recognize scene identity, and relate frames in terms of camera position and orientation reflects a fundamental understanding of 3D structure and scene layout. However, existing video models such as Luma Dream Machine fail at this task. We design a self-supervised method that takes advantage of the consistency of videos and variability of multiview internet photos to train a scalable, 3D-aware video model without any 3D annotations such as camera parameters. We validate that our method outperforms all baselines in terms of geometric and appearance consistency. We also show our model benefits applications that enable camera control, such as 3D Gaussian Splatting. Our results suggest that we can scale up scene-level 3D learning using only 2D data such as videos and multiview internet photos.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims to introduce a self-supervised method called KFC-W that generates 3D-consistent videos from unposed internet photos by leveraging video consistency and multiview photo variability. It trains a scalable 3D-aware video model without any 3D annotations such as camera parameters. The method is validated to outperform all baselines in geometric and appearance consistency and is shown to benefit applications like camera control in 3D Gaussian Splatting, suggesting that scene-level 3D learning can be scaled using only 2D data.

Significance. Should the results hold, this would be a significant contribution to computer vision by showing that 3D geometric understanding can emerge from self-supervision on 2D data sources alone, potentially reducing the need for 3D annotations and enabling more accessible training of 3D-aware generative models.

major comments (2)
  1. [Abstract] Abstract: The central empirical claim that the method 'outperforms all baselines in terms of geometric and appearance consistency' provides no quantitative metrics, ablation studies, or description of how geometric consistency was measured or what the baselines were, rendering the validation statement unverifiable.
  2. [Method] Method section: No analysis or experiments are described that distinguish learning of genuine 3D geometry (e.g., consistent depth ordering or camera trajectories in 3D space) from 2D temporal interpolation or appearance matching, which is required to support the '3D-aware' claim in the absence of camera parameters or explicit 3D losses.
minor comments (2)
  1. [Abstract] Abstract: The mention of Luma Dream Machine as a failing example should be expanded to list all baselines used in the reported comparisons.
  2. [Method] Throughout: Ensure any self-supervised loss functions or training objectives are explicitly formulated with equations for reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the two major comments point by point below and will revise the manuscript accordingly to strengthen the presentation of results and supporting analysis.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central empirical claim that the method 'outperforms all baselines in terms of geometric and appearance consistency' provides no quantitative metrics, ablation studies, or description of how geometric consistency was measured or what the baselines were, rendering the validation statement unverifiable.

    Authors: We agree that the abstract, as a high-level summary, does not include the quantitative details. The full manuscript reports these in the experiments section, including specific metrics, baseline descriptions, and evaluation protocols for geometric consistency. To address the concern directly, we will revise the abstract to incorporate key quantitative results and a concise note on measurement methodology. revision: yes

  2. Referee: [Method] Method section: No analysis or experiments are described that distinguish learning of genuine 3D geometry (e.g., consistent depth ordering or camera trajectories in 3D space) from 2D temporal interpolation or appearance matching, which is required to support the '3D-aware' claim in the absence of camera parameters or explicit 3D losses.

    Authors: We acknowledge that additional targeted analysis would better isolate 3D geometry learning from 2D effects. While existing results on downstream tasks like camera-controlled 3D Gaussian Splatting provide indirect support, we will add explicit experiments in the revised manuscript, such as depth ordering visualizations, trajectory consistency tests, and comparisons against purely 2D baselines, to strengthen the '3D-aware' claim. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical self-supervised training validated against external baselines

full rationale

The paper presents a self-supervised training procedure that leverages video consistency and multiview photo variability to produce 3D-aware video generation without camera parameters or 3D losses. No derivation chain, fitted parameter renamed as prediction, or self-citation load-bearing step is described in the abstract or claimed method. The central claim is an empirical outperformance on geometric and appearance consistency metrics against external baselines, which is falsifiable outside the training objective itself. No equations or uniqueness theorems are invoked that reduce to the inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated. The central claim rests on the unstated premise that 2D consistency signals alone suffice for 3D geometry learning.

pith-pipeline@v0.9.0 · 5730 in / 1157 out tokens · 19378 ms · 2026-05-23T16:44:54.482845+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. CityRAG: Stepping Into a City via Spatially-Grounded Video Generation

    cs.CV 2026-04 unverdicted novelty 6.0

    CityRAG generates minutes-long 3D-consistent videos of real-world cities by grounding outputs in geo-registered data and using temporally unaligned training to disentangle fixed scenes from transient elements like weather.

Reference graph

Works this paper leans on

95 extracted references · 95 canonical work pages · cited by 1 Pith paper · 15 internal anchors

  1. [1]

    YouTube-8M: A Large-Scale Video Classification Benchmark

    Sami Abu-El-Haija, Nisarg Kothari, Joonseok Lee, Apos- tol (Paul) Natsev, George Toderici, Balakrishnan Varadarajan, and Sudheendra Vijayanarasimhan. Youtube-8m: A large- scale video classification benchmark. In arXiv:1609.08675,

  2. [2]

    Lumiere: A space-time diffusion model for video generation

    Omer Bar-Tal, Hila Chefer, Omer Tov, Charles Herrmann, Roni Paiss, Shiran Zada, Ariel Ephrat, Junhwa Hur, Yuanzhen Li, Tomer Michaeli, Oliver Wang, Deqing Sun, Tali Dekel, and Inbar Mosseri. Lumiere: A space-time diffusion model for video generation. In Proceedings of the 41st International Conference on Machine Learning, 2024. 3

  3. [3]

    Barron, Ben Mildenhall, Matthew Tancik, Peter Hedman, Ricardo Martin-Brualla, and Pratul P

    Jonathan T. Barron, Ben Mildenhall, Matthew Tancik, Peter Hedman, Ricardo Martin-Brualla, and Pratul P. Srinivasan. Mip-nerf: A multiscale representation for anti-aliasing neural radiance fields, 2021. 2

  4. [4]

    Barron, Ben Mildenhall, Dor Verbin, Pratul P

    Jonathan T. Barron, Ben Mildenhall, Dor Verbin, Pratul P. Srinivasan, and Peter Hedman. Zip-nerf: Anti-aliased grid- based neural radiance fields. ICCV, 2023. 2

  5. [5]

    Nope-nerf: Optimising neural radiance field with no pose prior

    Wenjing Bian, Zirui Wang, Kejie Li, Jiawang Bian, and Vic- tor Adrian Prisacariu. Nope-nerf: Optimising neural radiance field with no pose prior. In CVPR, 2023. 2

  6. [6]

    Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

    Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, Varun Jampani, and Robin Rombach. Stable video diffusion: Scaling la- tent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127, 2023. 1, 3, 6

  7. [7]

    Video generation models as world simulators, 2024

    Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, Clarence Ng, Ricky Wang, and Aditya Ramesh. Video generation models as world simulators, 2024. 3

  8. [8]

    Diffdreamer: Towards consistent unsupervised single- view scene extrapolation with conditional diffusion models

    Shengqu Cai, Eric Ryan Chan, Songyou Peng, Mohamad Shahbazi, Anton Obukhov, Luc Van Gool, and Gordon Wet- zstein. Diffdreamer: Towards consistent unsupervised single- view scene extrapolation with conditional diffusion models. In ICCV, 2023. 3

  9. [9]

    Diffusion forcing: Next-token prediction meets full-sequence diffusion,

    Boyuan Chen, Diego Marti Monso, Yilun Du, Max Sim- chowitz, Russ Tedrake, and Vincent Sitzmann. Diffusion forcing: Next-token prediction meets full-sequence diffusion,

  10. [10]

    V3d: Video diffusion models are effective 3d generators

    Zilong Chen, Yikai Wang, Feng Wang, Zhengyi Wang, and Huaping Liu. V3d: Video diffusion models are effective 3d generators. arXiv preprint arXiv:2403.06738, 2024. 1, 3

  11. [11]

    Lu-nerf: Scene 9 and pose estimation by synchronizing local unposed nerfs

    Zezhou Cheng, Carlos Esteves, Varun Jampani, Abhishek Kar, Subhransu Maji, and Ameesh Makadia. Lu-nerf: Scene 9 and pose estimation by synchronizing local unposed nerfs. In CVPR, 2023. 2

  12. [12]

    Scannet: Richly-annotated 3d reconstructions of indoor scenes

    Angela Dai, Angel X Chang, Manolis Savva, Maciej Hal- ber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5828–5839, 2017. 3

  13. [13]

    LDMVFI: Video frame interpolation with latent diffusion models

    Duolikun Danier, Fan Zhang, and David Bull. LDMVFI: Video frame interpolation with latent diffusion models. In AAAI, 2024. 2, 3, 6

  14. [14]

    Objaverse-XL: A Universe of 10M+ 3D Objects

    Matt Deitke, Ruoshi Liu, Matthew Wallingford, Huong Ngo, Oscar Michel, Aditya Kusupati, Alan Fan, Chris- tian Laforte, Vikram V oleti, Samir Yitzhak Gadre, Eli VanderBilt, Aniruddha Kembhavi, Carl V ondrick, Georgia Gkioxari, Kiana Ehsani, Ludwig Schmidt, and Ali Farhadi. Objaverse-xl: A universe of 10m+ 3d objects. arXiv preprint arXiv:2307.05663, 2023. 2, 3

  15. [15]

    Qi, Yin Zhou, Zoey Yang, Aur’elien Chouard, Pei Sun, Jiquan Ngiam, Vijay Vasudevan, Alexander McCauley, Jonathon Shlens, and Dragomir Anguelov

    Scott Ettinger, Shuyang Cheng, Benjamin Caine, Chenxi Liu, Hang Zhao, Sabeek Pradhan, Yuning Chai, Ben Sapp, Charles R. Qi, Yin Zhou, Zoey Yang, Aur’elien Chouard, Pei Sun, Jiquan Ngiam, Vijay Vasudevan, Alexander McCauley, Jonathon Shlens, and Dragomir Anguelov. Large scale in- teractive motion forecasting for autonomous driving: The waymo open motion da...

  16. [16]

    Instantsplat: Unbounded sparse-view pose-free gaussian splat- ting in 40 seconds, 2024

    Zhiwen Fan, Wenyan Cong, Kairun Wen, Kevin Wang, Jian Zhang, Xinghao Ding, Danfei Xu, Boris Ivanovic, Marco Pavone, Georgios Pavlakos, Zhangyang Wang, and Yue Wang. Instantsplat: Unbounded sparse-view pose-free gaussian splat- ting in 40 seconds, 2024. 2, 8

  17. [17]

    Black, and Zhang Xuaner

    Haiwen Feng, Zheng Ding, Zhihao Xia, Simon Niklaus, Vic- toria Abrevaya, Michael J. Black, and Zhang Xuaner. Ex- plorative in-betweening of time and space. In European Conference on Computer Vision, 2024. 3

  18. [18]

    Efros, and Xiaolong Wang

    Yang Fu, Sifei Liu, Amey Kulkarni, Jan Kautz, Alexei A. Efros, and Xiaolong Wang. Colmap-free 3d gaussian splatting. In CVPR, 2024. 2, 8

  19. [19]

    Vivid-1-to-3: Novel view synthesis with video diffusion models

    Jeong gi Kwak, Erqun Dong, Yuhe Jin, Hanseok Ko, Shweta Mahajan, and Kwang Moo Yi. Vivid-1-to-3: Novel view synthesis with video diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 6775–6785, 2024. 1, 3

  20. [20]

    Emu video: Factorizing text-to-video generation by explicit image conditioning

    Rohit Girdhar, Mannat Singh, Andrew Brown, et al. Emu video: Factorizing text-to-video generation by explicit image conditioning. arXiv preprint arXiv:2311.10709, 2023. 1, 3, 6

  21. [21]

    Vfusion3d: Learning scalable 3d generative models from video diffusion models

    Junlin Han, Filippos Kokkinos, and Philip Torr. Vfusion3d: Learning scalable 3d generative models from video diffusion models. European Conference on Computer Vision (ECCV),

  22. [22]

    Masked Autoencoders Are Scalable Vision Learners

    Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Doll´ar, and Ross Girshick. Masked autoencoders are scalable vision learners. arXiv:2111.06377, 2021. 4

  23. [23]

    Srinivasan, Ben Mildenhall, Jonathan T

    Peter Hedman, Pratul P. Srinivasan, Ben Mildenhall, Jonathan T. Barron, and Paul Debevec. Baking neural ra- diance fields for real-time view synthesis. ICCV, 2021. 2

  24. [24]

    Denoising Diffusion Probabilistic Models

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffu- sion probabilistic models. arXiv preprint arxiv:2006.11239,

  25. [25]

    Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P Kingma, Ben Poole, Mohammad Norouzi, and David J. Fleet. Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303, 2022. 3

  26. [26]

    CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers

    Wenyi Hong, Ming Ding, Wendi Zheng, Xinghan Liu, and Jie Tang. Cogvideo: Large-scale pretraining for text-to-video gen- eration via transformers. arXiv preprint arXiv:2205.15868,

  27. [27]

    Lrm: Large reconstruction model for single image to 3d

    Yicong Hong, Kai Zhang, Jiuxiang Gu, Sai Bi, Yang Zhou, Difan Liu, Feng Liu, Kalyan Sunkavalli, Trung Bui, and Hao Tan. Lrm: Large reconstruction model for single image to 3d. In ICLR, 2024. 3

  28. [28]

    Video interpolation with diffusion models

    Siddhant Jain, Daniel Watson, Eric Tabellion, Ben Poole, Janne Kontkanen, et al. Video interpolation with diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7341–7351,

  29. [29]

    A construct- optimize approach to sparse view synthesis without camera pose

    Kaiwen Jiang, Yang Fu, Mukund Varma T, Yash Belhe, Xi- aolong Wang, Hao Su, and Ravi Ramamoorthi. A construct- optimize approach to sparse view synthesis without camera pose. SIGGRAPH, 2024. 2

  30. [30]

    Lvsm: A large view synthesis model with minimal 3d inductive bias, 2024

    Haian Jin, Hanwen Jiang, Hao Tan, Kai Zhang, Sai Bi, Tianyuan Zhang, Fujun Luan, Noah Snavely, and Zexiang Xu. Lvsm: A large view synthesis model with minimal 3d inductive bias, 2024. 3

  31. [31]

    Image Match- ing across Wide Baselines: From Paper to Practice

    Yuhe Jin, Dmytro Mishkin, Anastasiia Mishchuk, Jiri Matas, Pascal Fua, Kwang Moo Yi, and Eduard Trulls. Image Match- ing across Wide Baselines: From Paper to Practice. Interna- tional Journal of Computer Vision, 2020. 5

  32. [32]

    Pyramidal flow matching for efficient video generative modeling, 2024

    Yang Jin, Zhicheng Sun, Ningyuan Li, Kun Xu, Kun Xu, Hao Jiang, Nan Zhuang, Quzhe Huang, Yang Song, Yadong Mu, and Zhouchen Lin. Pyramidal flow matching for efficient video generative modeling, 2024. 1, 3

  33. [33]

    Flavr: Flow-agnostic video representations for fast frame interpolation

    Tarun Kalluri, Deepak Pathak, Manmohan Chandraker, and Du Tran. Flavr: Flow-agnostic video representations for fast frame interpolation. In WACV, 2023. 2, 3, 6

  34. [34]

    How far is video generation from world model? – a physical law perspective,

    Bingyi Kang, Yang Yue, Rui Lu, Zhijie Lin, Yang Zhao, Kaixin Wang, Gao Huang, and Jiashi Feng. How far is video generation from world model? – a physical law perspective,

  35. [35]

    3d gaussian splatting for real-time radiance field rendering

    Bernhard Kerbl, Georgios Kopanas, Thomas Leimk¨uhler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering. ACM Transactions on Graphics, 42(4), 2023. 2, 6

  36. [36]

    Auto-Encoding Variational Bayes

    Diederik P. Kingma and Max Welling. Auto-encoding vari- ational bayes. arXiv preprint arXiv:1312.6114 , 2013. 3, 4

  37. [37]

    Wildgaussians: 3d gaussian splatting in the wild

    Jonas Kulhanek, Songyou Peng, Zuzana Kukelova, Marc Pollefeys, and Torsten Sattler. Wildgaussians: 3d gaussian splatting in the wild. arXiv, 2024. 2

  38. [38]

    Pika labs: Ai video generation platform

    Pika Labs. Pika labs: Ai video generation platform. https: //pika.art/, 2024. Accessed: 2024-11-10. 6

  39. [39]

    Crowdsampling the plenoptic function

    Zhengqi Li, Wenqi Xian, Abe Davis, and Noah Snavely. Crowdsampling the plenoptic function. In Computer Vision– ECCV 2020: 16th European Conference, Glasgow, UK, Au- gust 23–28, 2020, Proceedings, Part I 16 , pages 178–196. Springer, 2020. 2 10

  40. [40]

    Dl3dv-10k: A large-scale scene dataset for deep learning- based 3d vision

    Lu Ling, Yichen Sheng, Zhi Tu, Wentian Zhao, Cheng Xin, Kun Wan, Lantao Yu, Qianyu Guo, Zixun Yu, Yawen Lu, et al. Dl3dv-10k: A large-scale scene dataset for deep learning- based 3d vision. arXiv preprint arXiv:2312.16256 , 2023. 3

  41. [41]

    Zero-1-to-3: Zero-shot one image to 3d object

    Ruoshi Liu, Rundi Wu, Basile Van Hoorick, Pavel Tokmakov, Sergey Zakharov, and Carl V ondrick. Zero-1-to-3: Zero-shot one image to 3d object. In ICCV, 2023. 2, 3

  42. [42]

    SyncDreamer: Generating Multiview-consistent Images from a Single-view Image

    Yuan Liu, Cheng Lin, Zijiao Zeng, Xiaoxiao Long, Lingjie Liu, Taku Komura, and Wenping Wang. Syncdreamer: Gener- ating multiview-consistent images from a single-view image. arXiv preprint arXiv:2309.03453, 2023. 3

  43. [43]

    Luma dream machine, 2024

    LUMA. Luma dream machine, 2024. 1, 2, 3, 6

  44. [44]

    Nerf in the wild: Neural radiance fields for uncon- strained photo collections

    Ricardo Martin-Brualla, Noha Radwan, Mehdi SM Sajjadi, Jonathan T Barron, Alexey Dosovitskiy, and Daniel Duck- worth. Nerf in the wild: Neural radiance fields for uncon- strained photo collections. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7210–7219, 2021. 2

  45. [45]

    Kim, and Johannes Kopf

    Andreas Meuleman, Yu-Lun Liu, Chen Gao, Jia-Bin Huang, Changil Kim, Min H. Kim, and Johannes Kopf. Progressively optimized local radiance fields for robust view synthesis. In CVPR, 2023. 2

  46. [46]

    Nerf: Representing scenes as neural radiance fields for view synthe- sis

    Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view synthe- sis. ICCV, 2021. 2

  47. [47]

    Instant neural graphics primitives with a multires- olution hash encoding

    Thomas M¨uller, Alex Evans, Christoph Schied, and Alexan- der Keller. Instant neural graphics primitives with a multires- olution hash encoding. ACM Trans. Graph., 41(4):102:1– 102:15, 2022. 2

  48. [48]

    Scalable Diffusion Models with Transformers

    William Peebles and Saining Xie. Scalable diffusion models with transformers. arXiv preprint arXiv:2212.09748, 2022. 3, 5

  49. [49]

    Courville

    Ethan Perez, Florian Strub, Harm de Vries, Vincent Dumoulin, and Aaron C. Courville. Film: Visual reasoning with a general conditioning layer. In AAAI, 2018. 5

  50. [50]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. In Proceedings of the 38th International Conference on Machine Learning, pages 8748–8763. PMLR, 2021. 5

  51. [51]

    Film: Frame interpola- tion for large motion

    Fitsum Reda, Janne Kontkanen, Eric Tabellion, Deqing Sun, Caroline Pantofaru, and Brian Curless. Film: Frame interpola- tion for large motion. In European Conference on Computer Vision (ECCV), 2022. 1, 2, 3, 6

  52. [52]

    Kilonerf: Speeding up neural radiance fields with thousands of tiny mlps

    Christian Reiser, Songyou Peng, Yiyi Liao, and Andreas Geiger. Kilonerf: Speeding up neural radiance fields with thousands of tiny mlps. In CVPR, 2021. 2

  53. [53]

    Look outside the room: Synthesizing a consistent long-term 3d scene video from a single image

    Xuanchi Ren and Xiaolong Wang. Look outside the room: Synthesizing a consistent long-term 3d scene video from a single image. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022. 3

  54. [54]

    High-resolution image synthesis with latent diffusion models, 2021

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models, 2021. 3, 4, 5

  55. [55]

    Runway AI

    Inc. Runway AI. Introducing gen-3 alpha: A new fron- tier for video generation. https://runwayml.com/ research/introducing-gen-3-alpha , 2024. Ac- cessed: 2024-11-10. 6

  56. [56]

    ZeroNVS: Zero-shot 360-degree view synthesis from a single real image

    Kyle Sargent, Zizhang Li, Tanmay Shah, Charles Herrmann, Hong-Xing Yu, Yunzhi Zhang, Eric Ryan Chan, Dmitry La- gun, Li Fei-Fei, Deqing Sun, and Jiajun Wu. ZeroNVS: Zero-shot 360-degree view synthesis from a single real image. In CVPR, 2024. 3

  57. [57]

    Structure-from-motion revisited

    Johannes Lutz Sch ¨onberger and Jan-Michael Frahm. Structure-from-motion revisited. In Conference on Computer Vision and Pattern Recognition (CVPR), 2016. 3, 6, 8

  58. [58]

    Laion-5b: An open large-scale dataset for training next gen- eration image-text models

    Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, et al. Laion-5b: An open large-scale dataset for training next gen- eration image-text models. Advances in Neural Information Processing Systems, 35:25278–25294, 2022. 2

  59. [59]

    Genwarp: Single image to novel views with semantic-preserving generative warping

    Junyoung Seo, Kazumi Fukuda, Takashi Shibuya, Takuya Narihira, Naoki Murata, Shoukang Hu, Chieh-Hsin Lai, Se- ungryong Kim, and Yuki Mitsufuji. Genwarp: Single image to novel views with semantic-preserving generative warping. arXiv preprint arXiv:2405.17251, 2024. 3

  60. [60]

    Make-A-Video: Text-to-Video Generation without Text-Video Data

    Uriel Singer, Adam Polyak, Thomas Hayes, Dafna Shaham, Chitwan Saharia, William Chan, and Mohammad Norouzi. Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:2209.14792, 2022. 3

  61. [61]

    Seitz, and Richard Szeliski

    Noah Snavely, Steven M. Seitz, and Richard Szeliski. Photo tourism: exploring photo collections in 3d. Seminal Graphics Papers: Pushing the Boundaries, Volume 2, 2006. 3

  62. [62]

    Denoising Diffusion Implicit Models

    Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. arXiv:2010.02502, 2020. 5

  63. [63]

    Neural 3d reconstruction in the wild

    Jiaming Sun, Xi Chen, Qianqian Wang, Zhengqi Li, Hadar Averbuch-Elor, Xiaowei Zhou, and Noah Snavely. Neural 3d reconstruction in the wild. In ACM SIGGRAPH 2022 conference proceedings, pages 1–9, 2022. 2

  64. [64]

    Movie Gen: A Cast of Media Foundation Models

    The Movie Gen team @ Meta. Movie gen: A cast of media foundation models. arXiv preprint arXiv:2410.13720, 2024. 3

  65. [65]

    Consistent view synthesis with pose-guided diffusion models

    Hung-Yu Tseng, Qinbo Li, Changil Kim, Suhib Alsisan, Jia- Bin Huang, and Johannes Kopf. Consistent view synthesis with pose-guided diffusion models. In CVPR, 2023. 3

  66. [66]

    Megascenes: Scene-level view synthesis at scale

    Joseph Tung, Gene Chou, Ruojin Cai, Guandao Yang, Kai Zhang, Gordon Wetzstein, Bharath Hariharan, and Noah Snavely. Megascenes: Scene-level view synthesis at scale. In ECCV, 2024. 2, 3

  67. [67]

    Gomez, Lukasz Kaiser, and Illia Polosukhin

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkor- eit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need, 2023. 3

  68. [68]

    Barron, and Pratul P

    Dor Verbin, Peter Hedman, Ben Mildenhall, Todd Zickler, Jonathan T. Barron, and Pratul P. Srinivasan. Ref-NeRF: Structured view-dependent appearance for neural radiance fields. In CVPR, 2022. 2 11

  69. [69]

    Pf-lrm: Pose-free large reconstruction model for joint pose and shape prediction

    Peng Wang, Hao Tan, Sai Bi, Yinghao Xu, Fujun Luan, Kalyan Sunkavalli, Wenping Wang, Zexiang Xu, and Kai Zhang. Pf-lrm: Pose-free large reconstruction model for joint pose and shape prediction. arXiv preprint arXiv:2311.12024,

  70. [70]

    Dust3r: Geometric 3d vision made easy

    Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. Dust3r: Geometric 3d vision made easy. In CVPR, 2024. 8

  71. [71]

    Generative inbetweening: Adapting image-to-video models for keyframe interpolation

    Xiaojuan Wang, Boyang Zhou, Brian Curless, Ira Kemel- macher, Aleksander Holynski, and Steve Seitz. Generative inbetweening: Adapting image-to-video models for keyframe interpolation. arXiv preprint arXiv:2408.15239, 2024. 2, 3

  72. [72]

    Image quality assessment: from error visibility to structural similarity

    Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing, 13(4):600–612, 2004. 9

  73. [73]

    NeRF −−: Neural radiance fields without known camera parameters,

    Zirui Wang, Shangzhe Wu, Weidi Xie, Min Chen, and Vic- tor Adrian Prisacariu. Nerf–: Neural radiance fields without known camera parameters. arXiv preprint arXiv:2102.07064,

  74. [74]

    Novel view synthesis with diffusion models

    Daniel Watson, William Chan, Ricardo Martin-Brualla, Jonathan Ho, Andrea Tagliasacchi, and Mohammad Norouzi. Novel view synthesis with diffusion models. arXiv preprint arXiv:2210.04628, 2022. 3

  75. [75]

    Controlling space and time with diffusion models

    Daniel Watson, Saurabh Saxena, Lala Li, Andrea Tagliasac- chi, and David J Fleet. Controlling space and time with diffusion models. arXiv preprint arXiv:2407.07860, 2024. 2, 3

  76. [76]

    Meshlrm: Large reconstruction model for high- quality mesh

    Xinyue Wei, Kai Zhang, Sai Bi, Hao Tan, Fujun Luan, Valentin Deschaintre, Kalyan Sunkavalli, Hao Su, and Zex- iang Xu. Meshlrm: Large reconstruction model for high- quality mesh. arXiv preprint arXiv:2404.12385, 2024. 3

  77. [77]

    CroCo v2: Improved Cross-view Completion Pre-training for Stereo Matching and Optical Flow

    Philippe Weinzaepfel, Thomas Lucas, Vincent Leroy, Yohann Cabon, Vaibhav Arora, Romain Br ´egier, Gabriela Csurka, Leonid Antsfeld, Boris Chidlovskii, and J ´erˆome Revaud. CroCo v2: Improved Cross-view Completion Pre-training for Stereo Matching and Optical Flow. In ICCV, 2023. 4

  78. [78]

    CroCo: Self-Supervised Pre- training for 3D Vision Tasks by Cross-View Completion

    Weinzaepfel, Philippe and Leroy, Vincent and Lucas, Thomas and Br´egier, Romain and Cabon, Yohann and Arora, Vaibhav and Antsfeld, Leonid and Chidlovskii, Boris and Csurka, Gabriela and Revaud J´erˆome. CroCo: Self-Supervised Pre- training for 3D Vision Tasks by Cross-View Completion. In NeurIPS, 2022. 4

  79. [79]

    Art•v: Auto-regressive text-to-video generation with diffusion models

    Wenming Weng, Ruoyu Feng, Yanhui Wang, Qi Dai, Chunyu Wang, Dacheng Yin, Zhiyuan Zhao, Kai Qiu, Jianmin Bao, Yuhui Yuan, Chong Luo, Yueyi Zhang, and Zhiwei Xiong. Art•v: Auto-regressive text-to-video generation with diffusion models. arXiv preprint arXiv:2311.18834, 2023. 2

  80. [80]

    Srinivasan, Dor Verbin, Jonathan T

    Rundi Wu, Ben Mildenhall, Philipp Henzler, Keunhong Park, Ruiqi Gao, Daniel Watson, Pratul P Srinivasan, Dor Verbin, Jonathan T Barron, Ben Poole, et al. Reconfu- sion: 3d reconstruction with diffusion priors. arXiv preprint arXiv:2312.02981, 2023. 3

Showing first 80 references.