KFC-W: Generating 3D-Consistent Videos from Unposed Internet Photos

Bharath Hariharan; Fujun Luan; Gene Chou; Hao Tan; Kai Zhang; Noah Snavely; Sai Bi; Zexiang Xu

arxiv: 2411.13549 · v2 · submitted 2024-11-20 · 💻 cs.CV

KFC-W: Generating 3D-Consistent Videos from Unposed Internet Photos

Gene Chou , Kai Zhang , Sai Bi , Hao Tan , Zexiang Xu , Fujun Luan , Bharath Hariharan , Noah Snavely This is my paper

Pith reviewed 2026-05-23 16:44 UTC · model grok-4.3

classification 💻 cs.CV

keywords 3D-consistent video generationself-supervised learningunposed photosmultiview consistencyvideo interpolationcamera control3D Gaussian Splatting

0 comments

The pith

A self-supervised model learns to generate 3D-consistent videos from unposed internet photos without any camera parameters or 3D labels.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows how to train a video model that can take a few random photos of a scene and produce smooth interpolations that respect the underlying 3D geometry. It does this by combining the natural frame-to-frame consistency found in ordinary videos with the different viewpoints present in unposed multiview internet photos. The training uses no explicit 3D supervision, camera poses, or depth maps. If the approach holds, it demonstrates that large-scale 3D scene understanding can be extracted from everyday 2D data sources alone. The resulting model outperforms prior video generators on measures of geometric and appearance consistency and improves downstream tasks that require camera control.

Core claim

The central claim is that a scalable 3D-aware video model can be trained in a self-supervised manner by exploiting video consistency together with the viewpoint variability of unposed multiview internet photos, without requiring any 3D annotations such as camera parameters, and that this model produces superior geometric and appearance consistency compared with existing video baselines while also benefiting camera-controlled applications such as 3D Gaussian Splatting.

What carries the argument

The self-supervised training procedure that pairs video-frame consistency with multiview photo variability to induce implicit 3D geometric understanding.

If this is right

Random internet photos can serve as keyframes for video generation that respects scene layout and identity.
The model supports explicit camera control in tasks such as 3D Gaussian Splatting.
Scene-level 3D learning becomes feasible at scale using only ordinary 2D video and photo collections.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Large uncurated photo collections could replace curated 3D datasets for training video models.
The same consistency signal might extend to learning other 3D properties such as lighting or material appearance.
Failure modes on highly dynamic or non-rigid scenes would reveal limits of the implicit-geometry approach.

Load-bearing premise

Natural consistency across video frames plus viewpoint differences in unposed photos are enough to produce genuine 3D geometric understanding in the model.

What would settle it

Generated videos that show clear changes in object shape, size, or relative position when the camera path is interpolated between input views.

Figures

Figures reproduced from arXiv: 2411.13549 by Bharath Hariharan, Fujun Luan, Gene Chou, Hao Tan, Kai Zhang, Noah Snavely, Sai Bi, Zexiang Xu.

**Figure 1.** Figure 1: Given n unposed input keyframes, the goal is to generate a video of the scene with a realistic camera trajectory and consistent geometry. From top to bottom: Ours, Luma Dream Machine [43] (a commercial video generation model), FILM [51] (a frame interpolation method). Luma hallucinates new buildings (left scene) and statues (right scene) without understanding the scene layout. FILM is unable to handle wide… view at source ↗

**Figure 2.** Figure 2: Training objectives. Left: Multiview inpainting. We provide n condition images and one target image to a diffusion model. We add noise to 80% of the target following the diffusion process. The condition images and remaining 20% of the target are kept clean. Note how some regions in the target are not seen in the conditions. The model learns priors such as symmetry to generate a plausible image. Right: View… view at source ↗

**Figure 3.** Figure 3: Multiview inpainting of internet photos and view interpolation of videos can be unified under the same denoising objective. [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Top two rows: We control illumination by conditioning on [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Example scene from our user study interface. We pro [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: From top to bottom: Ours (Full), Luma, Ours (Video [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: We run an ablation “Long-Video” with only the view [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

read the original abstract

We address the problem of generating videos from unposed internet photos. A handful of input images serve as keyframes, and our model interpolates between them to simulate a path moving between the cameras. Given random images, a model's ability to capture underlying geometry, recognize scene identity, and relate frames in terms of camera position and orientation reflects a fundamental understanding of 3D structure and scene layout. However, existing video models such as Luma Dream Machine fail at this task. We design a self-supervised method that takes advantage of the consistency of videos and variability of multiview internet photos to train a scalable, 3D-aware video model without any 3D annotations such as camera parameters. We validate that our method outperforms all baselines in terms of geometric and appearance consistency. We also show our model benefits applications that enable camera control, such as 3D Gaussian Splatting. Our results suggest that we can scale up scene-level 3D learning using only 2D data such as videos and multiview internet photos.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The self-supervised mix of video consistency and unposed photo variability is the real contribution, but the abstract gives almost no evidence that it produces actual 3D geometry rather than 2D coherence.

read the letter

The one thing to take away is that the paper describes a training procedure that tries to get 3D-aware video interpolation from nothing but ordinary videos and internet photo collections, with no camera parameters or 3D labels at all. The claim is that temporal consistency inside videos plus viewpoint spread across photos is enough to induce usable geometry, and that this beats existing video models on consistency metrics while also feeding into downstream tasks like 3D Gaussian Splatting with camera control. That combination of data sources for self-supervision looks like the concrete new piece. The framing around scaling scene-level 3D work with cheap 2D data is also reasonable and points to a practical direction. The paper does a clean job stating the problem as testing whether a model can relate frames by camera position without explicit pose supervision. On the weak side, the abstract supplies no numbers, no ablation results, and no description of how geometric consistency was actually measured or against which baselines. That leaves the central result resting on an unverified statement. The stress-test worry about the model learning 2D trajectory interpolation and appearance matching instead of scene layout or depth ordering is still live until the experiments show controlled checks like novel-view depth accuracy or consistency on held-out camera paths that are not in the training videos. If those tests are missing or weak, the geometric claim does not land. This is for groups working on video generation or neural rendering who are already trying to reduce reliance on posed or annotated 3D data. A reader in that space would get value from the training recipe if the quantitative support holds up in the full paper. I would send it to peer review because the problem is well-posed and the self-supervised angle is worth referee scrutiny, even though the current evidence level is too low to judge the result yet.

Referee Report

2 major / 2 minor

Summary. The paper claims to introduce a self-supervised method called KFC-W that generates 3D-consistent videos from unposed internet photos by leveraging video consistency and multiview photo variability. It trains a scalable 3D-aware video model without any 3D annotations such as camera parameters. The method is validated to outperform all baselines in geometric and appearance consistency and is shown to benefit applications like camera control in 3D Gaussian Splatting, suggesting that scene-level 3D learning can be scaled using only 2D data.

Significance. Should the results hold, this would be a significant contribution to computer vision by showing that 3D geometric understanding can emerge from self-supervision on 2D data sources alone, potentially reducing the need for 3D annotations and enabling more accessible training of 3D-aware generative models.

major comments (2)

[Abstract] Abstract: The central empirical claim that the method 'outperforms all baselines in terms of geometric and appearance consistency' provides no quantitative metrics, ablation studies, or description of how geometric consistency was measured or what the baselines were, rendering the validation statement unverifiable.
[Method] Method section: No analysis or experiments are described that distinguish learning of genuine 3D geometry (e.g., consistent depth ordering or camera trajectories in 3D space) from 2D temporal interpolation or appearance matching, which is required to support the '3D-aware' claim in the absence of camera parameters or explicit 3D losses.

minor comments (2)

[Abstract] Abstract: The mention of Luma Dream Machine as a failing example should be expanded to list all baselines used in the reported comparisons.
[Method] Throughout: Ensure any self-supervised loss functions or training objectives are explicitly formulated with equations for reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the two major comments point by point below and will revise the manuscript accordingly to strengthen the presentation of results and supporting analysis.

read point-by-point responses

Referee: [Abstract] Abstract: The central empirical claim that the method 'outperforms all baselines in terms of geometric and appearance consistency' provides no quantitative metrics, ablation studies, or description of how geometric consistency was measured or what the baselines were, rendering the validation statement unverifiable.

Authors: We agree that the abstract, as a high-level summary, does not include the quantitative details. The full manuscript reports these in the experiments section, including specific metrics, baseline descriptions, and evaluation protocols for geometric consistency. To address the concern directly, we will revise the abstract to incorporate key quantitative results and a concise note on measurement methodology. revision: yes
Referee: [Method] Method section: No analysis or experiments are described that distinguish learning of genuine 3D geometry (e.g., consistent depth ordering or camera trajectories in 3D space) from 2D temporal interpolation or appearance matching, which is required to support the '3D-aware' claim in the absence of camera parameters or explicit 3D losses.

Authors: We acknowledge that additional targeted analysis would better isolate 3D geometry learning from 2D effects. While existing results on downstream tasks like camera-controlled 3D Gaussian Splatting provide indirect support, we will add explicit experiments in the revised manuscript, such as depth ordering visualizations, trajectory consistency tests, and comparisons against purely 2D baselines, to strengthen the '3D-aware' claim. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical self-supervised training validated against external baselines

full rationale

The paper presents a self-supervised training procedure that leverages video consistency and multiview photo variability to produce 3D-aware video generation without camera parameters or 3D losses. No derivation chain, fitted parameter renamed as prediction, or self-citation load-bearing step is described in the abstract or claimed method. The central claim is an empirical outperformance on geometric and appearance consistency metrics against external baselines, which is falsifiable outside the training objective itself. No equations or uniqueness theorems are invoked that reduce to the inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated. The central claim rests on the unstated premise that 2D consistency signals alone suffice for 3D geometry learning.

pith-pipeline@v0.9.0 · 5730 in / 1157 out tokens · 19378 ms · 2026-05-23T16:44:54.482845+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

CityRAG: Stepping Into a City via Spatially-Grounded Video Generation
cs.CV 2026-04 unverdicted novelty 6.0

CityRAG generates minutes-long 3D-consistent videos of real-world cities by grounding outputs in geo-registered data and using temporally unaligned training to disentangle fixed scenes from transient elements like weather.

Reference graph

Works this paper leans on

95 extracted references · 95 canonical work pages · cited by 1 Pith paper · 15 internal anchors

[1]

YouTube-8M: A Large-Scale Video Classification Benchmark

Sami Abu-El-Haija, Nisarg Kothari, Joonseok Lee, Apos- tol (Paul) Natsev, George Toderici, Balakrishnan Varadarajan, and Sudheendra Vijayanarasimhan. Youtube-8m: A large- scale video classification benchmark. In arXiv:1609.08675,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Lumiere: A space-time diffusion model for video generation

Omer Bar-Tal, Hila Chefer, Omer Tov, Charles Herrmann, Roni Paiss, Shiran Zada, Ariel Ephrat, Junhwa Hur, Yuanzhen Li, Tomer Michaeli, Oliver Wang, Deqing Sun, Tali Dekel, and Inbar Mosseri. Lumiere: A space-time diffusion model for video generation. In Proceedings of the 41st International Conference on Machine Learning, 2024. 3

work page 2024
[3]

Barron, Ben Mildenhall, Matthew Tancik, Peter Hedman, Ricardo Martin-Brualla, and Pratul P

Jonathan T. Barron, Ben Mildenhall, Matthew Tancik, Peter Hedman, Ricardo Martin-Brualla, and Pratul P. Srinivasan. Mip-nerf: A multiscale representation for anti-aliasing neural radiance fields, 2021. 2

work page 2021
[4]

Barron, Ben Mildenhall, Dor Verbin, Pratul P

Jonathan T. Barron, Ben Mildenhall, Dor Verbin, Pratul P. Srinivasan, and Peter Hedman. Zip-nerf: Anti-aliased grid- based neural radiance fields. ICCV, 2023. 2

work page 2023
[5]

Nope-nerf: Optimising neural radiance field with no pose prior

Wenjing Bian, Zirui Wang, Kejie Li, Jiawang Bian, and Vic- tor Adrian Prisacariu. Nope-nerf: Optimising neural radiance field with no pose prior. In CVPR, 2023. 2

work page 2023
[6]

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, Varun Jampani, and Robin Rombach. Stable video diffusion: Scaling la- tent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127, 2023. 1, 3, 6

work page internal anchor Pith review Pith/arXiv arXiv 2023
[7]

Video generation models as world simulators, 2024

Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, Clarence Ng, Ricky Wang, and Aditya Ramesh. Video generation models as world simulators, 2024. 3

work page 2024
[8]

Diffdreamer: Towards consistent unsupervised single- view scene extrapolation with conditional diffusion models

Shengqu Cai, Eric Ryan Chan, Songyou Peng, Mohamad Shahbazi, Anton Obukhov, Luc Van Gool, and Gordon Wet- zstein. Diffdreamer: Towards consistent unsupervised single- view scene extrapolation with conditional diffusion models. In ICCV, 2023. 3

work page 2023
[9]

Diffusion forcing: Next-token prediction meets full-sequence diffusion,

Boyuan Chen, Diego Marti Monso, Yilun Du, Max Sim- chowitz, Russ Tedrake, and Vincent Sitzmann. Diffusion forcing: Next-token prediction meets full-sequence diffusion,

work page
[10]

V3d: Video diffusion models are effective 3d generators

Zilong Chen, Yikai Wang, Feng Wang, Zhengyi Wang, and Huaping Liu. V3d: Video diffusion models are effective 3d generators. arXiv preprint arXiv:2403.06738, 2024. 1, 3

work page arXiv 2024
[11]

Lu-nerf: Scene 9 and pose estimation by synchronizing local unposed nerfs

Zezhou Cheng, Carlos Esteves, Varun Jampani, Abhishek Kar, Subhransu Maji, and Ameesh Makadia. Lu-nerf: Scene 9 and pose estimation by synchronizing local unposed nerfs. In CVPR, 2023. 2

work page 2023
[12]

Scannet: Richly-annotated 3d reconstructions of indoor scenes

Angela Dai, Angel X Chang, Manolis Savva, Maciej Hal- ber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5828–5839, 2017. 3

work page 2017
[13]

LDMVFI: Video frame interpolation with latent diffusion models

Duolikun Danier, Fan Zhang, and David Bull. LDMVFI: Video frame interpolation with latent diffusion models. In AAAI, 2024. 2, 3, 6

work page 2024
[14]

Objaverse-XL: A Universe of 10M+ 3D Objects

Matt Deitke, Ruoshi Liu, Matthew Wallingford, Huong Ngo, Oscar Michel, Aditya Kusupati, Alan Fan, Chris- tian Laforte, Vikram V oleti, Samir Yitzhak Gadre, Eli VanderBilt, Aniruddha Kembhavi, Carl V ondrick, Georgia Gkioxari, Kiana Ehsani, Ludwig Schmidt, and Ali Farhadi. Objaverse-xl: A universe of 10m+ 3d objects. arXiv preprint arXiv:2307.05663, 2023. 2, 3

work page internal anchor Pith review Pith/arXiv arXiv 2023
[15]

Qi, Yin Zhou, Zoey Yang, Aur’elien Chouard, Pei Sun, Jiquan Ngiam, Vijay Vasudevan, Alexander McCauley, Jonathon Shlens, and Dragomir Anguelov

Scott Ettinger, Shuyang Cheng, Benjamin Caine, Chenxi Liu, Hang Zhao, Sabeek Pradhan, Yuning Chai, Ben Sapp, Charles R. Qi, Yin Zhou, Zoey Yang, Aur’elien Chouard, Pei Sun, Jiquan Ngiam, Vijay Vasudevan, Alexander McCauley, Jonathon Shlens, and Dragomir Anguelov. Large scale in- teractive motion forecasting for autonomous driving: The waymo open motion da...

work page
[16]

Instantsplat: Unbounded sparse-view pose-free gaussian splat- ting in 40 seconds, 2024

Zhiwen Fan, Wenyan Cong, Kairun Wen, Kevin Wang, Jian Zhang, Xinghao Ding, Danfei Xu, Boris Ivanovic, Marco Pavone, Georgios Pavlakos, Zhangyang Wang, and Yue Wang. Instantsplat: Unbounded sparse-view pose-free gaussian splat- ting in 40 seconds, 2024. 2, 8

work page 2024
[17]

Black, and Zhang Xuaner

Haiwen Feng, Zheng Ding, Zhihao Xia, Simon Niklaus, Vic- toria Abrevaya, Michael J. Black, and Zhang Xuaner. Ex- plorative in-betweening of time and space. In European Conference on Computer Vision, 2024. 3

work page 2024
[18]

Efros, and Xiaolong Wang

Yang Fu, Sifei Liu, Amey Kulkarni, Jan Kautz, Alexei A. Efros, and Xiaolong Wang. Colmap-free 3d gaussian splatting. In CVPR, 2024. 2, 8

work page 2024
[19]

Vivid-1-to-3: Novel view synthesis with video diffusion models

Jeong gi Kwak, Erqun Dong, Yuhe Jin, Hanseok Ko, Shweta Mahajan, and Kwang Moo Yi. Vivid-1-to-3: Novel view synthesis with video diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 6775–6785, 2024. 1, 3

work page 2024
[20]

Emu video: Factorizing text-to-video generation by explicit image conditioning

Rohit Girdhar, Mannat Singh, Andrew Brown, et al. Emu video: Factorizing text-to-video generation by explicit image conditioning. arXiv preprint arXiv:2311.10709, 2023. 1, 3, 6

work page arXiv 2023
[21]

Vfusion3d: Learning scalable 3d generative models from video diffusion models

Junlin Han, Filippos Kokkinos, and Philip Torr. Vfusion3d: Learning scalable 3d generative models from video diffusion models. European Conference on Computer Vision (ECCV),

work page
[22]

Masked Autoencoders Are Scalable Vision Learners

Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Doll´ar, and Ross Girshick. Masked autoencoders are scalable vision learners. arXiv:2111.06377, 2021. 4

work page internal anchor Pith review Pith/arXiv arXiv 2021
[23]

Srinivasan, Ben Mildenhall, Jonathan T

Peter Hedman, Pratul P. Srinivasan, Ben Mildenhall, Jonathan T. Barron, and Paul Debevec. Baking neural ra- diance fields for real-time view synthesis. ICCV, 2021. 2

work page 2021
[24]

Denoising Diffusion Probabilistic Models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffu- sion probabilistic models. arXiv preprint arxiv:2006.11239,

work page internal anchor Pith review Pith/arXiv arXiv 2006
[25]

Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P Kingma, Ben Poole, Mohammad Norouzi, and David J. Fleet. Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303, 2022. 3

work page internal anchor Pith review Pith/arXiv arXiv 2022
[26]

CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers

Wenyi Hong, Ming Ding, Wendi Zheng, Xinghan Liu, and Jie Tang. Cogvideo: Large-scale pretraining for text-to-video gen- eration via transformers. arXiv preprint arXiv:2205.15868,

work page internal anchor Pith review Pith/arXiv arXiv
[27]

Lrm: Large reconstruction model for single image to 3d

Yicong Hong, Kai Zhang, Jiuxiang Gu, Sai Bi, Yang Zhou, Difan Liu, Feng Liu, Kalyan Sunkavalli, Trung Bui, and Hao Tan. Lrm: Large reconstruction model for single image to 3d. In ICLR, 2024. 3

work page 2024
[28]

Video interpolation with diffusion models

Siddhant Jain, Daniel Watson, Eric Tabellion, Ben Poole, Janne Kontkanen, et al. Video interpolation with diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7341–7351,

work page
[29]

A construct- optimize approach to sparse view synthesis without camera pose

Kaiwen Jiang, Yang Fu, Mukund Varma T, Yash Belhe, Xi- aolong Wang, Hao Su, and Ravi Ramamoorthi. A construct- optimize approach to sparse view synthesis without camera pose. SIGGRAPH, 2024. 2

work page 2024
[30]

Lvsm: A large view synthesis model with minimal 3d inductive bias, 2024

Haian Jin, Hanwen Jiang, Hao Tan, Kai Zhang, Sai Bi, Tianyuan Zhang, Fujun Luan, Noah Snavely, and Zexiang Xu. Lvsm: A large view synthesis model with minimal 3d inductive bias, 2024. 3

work page 2024
[31]

Image Match- ing across Wide Baselines: From Paper to Practice

Yuhe Jin, Dmytro Mishkin, Anastasiia Mishchuk, Jiri Matas, Pascal Fua, Kwang Moo Yi, and Eduard Trulls. Image Match- ing across Wide Baselines: From Paper to Practice. Interna- tional Journal of Computer Vision, 2020. 5

work page 2020
[32]

Pyramidal flow matching for efficient video generative modeling, 2024

Yang Jin, Zhicheng Sun, Ningyuan Li, Kun Xu, Kun Xu, Hao Jiang, Nan Zhuang, Quzhe Huang, Yang Song, Yadong Mu, and Zhouchen Lin. Pyramidal flow matching for efficient video generative modeling, 2024. 1, 3

work page 2024
[33]

Flavr: Flow-agnostic video representations for fast frame interpolation

Tarun Kalluri, Deepak Pathak, Manmohan Chandraker, and Du Tran. Flavr: Flow-agnostic video representations for fast frame interpolation. In WACV, 2023. 2, 3, 6

work page 2023
[34]

How far is video generation from world model? – a physical law perspective,

Bingyi Kang, Yang Yue, Rui Lu, Zhijie Lin, Yang Zhao, Kaixin Wang, Gao Huang, and Jiashi Feng. How far is video generation from world model? – a physical law perspective,

work page
[35]

3d gaussian splatting for real-time radiance field rendering

Bernhard Kerbl, Georgios Kopanas, Thomas Leimk¨uhler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering. ACM Transactions on Graphics, 42(4), 2023. 2, 6

work page 2023
[36]

Auto-Encoding Variational Bayes

Diederik P. Kingma and Max Welling. Auto-encoding vari- ational bayes. arXiv preprint arXiv:1312.6114 , 2013. 3, 4

work page internal anchor Pith review Pith/arXiv arXiv 2013
[37]

Wildgaussians: 3d gaussian splatting in the wild

Jonas Kulhanek, Songyou Peng, Zuzana Kukelova, Marc Pollefeys, and Torsten Sattler. Wildgaussians: 3d gaussian splatting in the wild. arXiv, 2024. 2

work page 2024
[38]

Pika labs: Ai video generation platform

Pika Labs. Pika labs: Ai video generation platform. https: //pika.art/, 2024. Accessed: 2024-11-10. 6

work page 2024
[39]

Crowdsampling the plenoptic function

Zhengqi Li, Wenqi Xian, Abe Davis, and Noah Snavely. Crowdsampling the plenoptic function. In Computer Vision– ECCV 2020: 16th European Conference, Glasgow, UK, Au- gust 23–28, 2020, Proceedings, Part I 16 , pages 178–196. Springer, 2020. 2 10

work page 2020
[40]

Dl3dv-10k: A large-scale scene dataset for deep learning- based 3d vision

Lu Ling, Yichen Sheng, Zhi Tu, Wentian Zhao, Cheng Xin, Kun Wan, Lantao Yu, Qianyu Guo, Zixun Yu, Yawen Lu, et al. Dl3dv-10k: A large-scale scene dataset for deep learning- based 3d vision. arXiv preprint arXiv:2312.16256 , 2023. 3

work page arXiv 2023
[41]

Zero-1-to-3: Zero-shot one image to 3d object

Ruoshi Liu, Rundi Wu, Basile Van Hoorick, Pavel Tokmakov, Sergey Zakharov, and Carl V ondrick. Zero-1-to-3: Zero-shot one image to 3d object. In ICCV, 2023. 2, 3

work page 2023
[42]

SyncDreamer: Generating Multiview-consistent Images from a Single-view Image

Yuan Liu, Cheng Lin, Zijiao Zeng, Xiaoxiao Long, Lingjie Liu, Taku Komura, and Wenping Wang. Syncdreamer: Gener- ating multiview-consistent images from a single-view image. arXiv preprint arXiv:2309.03453, 2023. 3

work page internal anchor Pith review Pith/arXiv arXiv 2023
[43]

Luma dream machine, 2024

LUMA. Luma dream machine, 2024. 1, 2, 3, 6

work page 2024
[44]

Nerf in the wild: Neural radiance fields for uncon- strained photo collections

Ricardo Martin-Brualla, Noha Radwan, Mehdi SM Sajjadi, Jonathan T Barron, Alexey Dosovitskiy, and Daniel Duck- worth. Nerf in the wild: Neural radiance fields for uncon- strained photo collections. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7210–7219, 2021. 2

work page 2021
[45]

Kim, and Johannes Kopf

Andreas Meuleman, Yu-Lun Liu, Chen Gao, Jia-Bin Huang, Changil Kim, Min H. Kim, and Johannes Kopf. Progressively optimized local radiance fields for robust view synthesis. In CVPR, 2023. 2

work page 2023
[46]

Nerf: Representing scenes as neural radiance fields for view synthe- sis

Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view synthe- sis. ICCV, 2021. 2

work page 2021
[47]

Instant neural graphics primitives with a multires- olution hash encoding

Thomas M¨uller, Alex Evans, Christoph Schied, and Alexan- der Keller. Instant neural graphics primitives with a multires- olution hash encoding. ACM Trans. Graph., 41(4):102:1– 102:15, 2022. 2

work page 2022
[48]

Scalable Diffusion Models with Transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. arXiv preprint arXiv:2212.09748, 2022. 3, 5

work page internal anchor Pith review Pith/arXiv arXiv 2022
[49]

Courville

Ethan Perez, Florian Strub, Harm de Vries, Vincent Dumoulin, and Aaron C. Courville. Film: Visual reasoning with a general conditioning layer. In AAAI, 2018. 5

work page 2018
[50]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. In Proceedings of the 38th International Conference on Machine Learning, pages 8748–8763. PMLR, 2021. 5

work page 2021
[51]

Film: Frame interpola- tion for large motion

Fitsum Reda, Janne Kontkanen, Eric Tabellion, Deqing Sun, Caroline Pantofaru, and Brian Curless. Film: Frame interpola- tion for large motion. In European Conference on Computer Vision (ECCV), 2022. 1, 2, 3, 6

work page 2022
[52]

Kilonerf: Speeding up neural radiance fields with thousands of tiny mlps

Christian Reiser, Songyou Peng, Yiyi Liao, and Andreas Geiger. Kilonerf: Speeding up neural radiance fields with thousands of tiny mlps. In CVPR, 2021. 2

work page 2021
[53]

Look outside the room: Synthesizing a consistent long-term 3d scene video from a single image

Xuanchi Ren and Xiaolong Wang. Look outside the room: Synthesizing a consistent long-term 3d scene video from a single image. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022. 3

work page 2022
[54]

High-resolution image synthesis with latent diffusion models, 2021

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models, 2021. 3, 4, 5

work page 2021
[55]

Runway AI

Inc. Runway AI. Introducing gen-3 alpha: A new fron- tier for video generation. https://runwayml.com/ research/introducing-gen-3-alpha , 2024. Ac- cessed: 2024-11-10. 6

work page 2024
[56]

ZeroNVS: Zero-shot 360-degree view synthesis from a single real image

Kyle Sargent, Zizhang Li, Tanmay Shah, Charles Herrmann, Hong-Xing Yu, Yunzhi Zhang, Eric Ryan Chan, Dmitry La- gun, Li Fei-Fei, Deqing Sun, and Jiajun Wu. ZeroNVS: Zero-shot 360-degree view synthesis from a single real image. In CVPR, 2024. 3

work page 2024
[57]

Structure-from-motion revisited

Johannes Lutz Sch ¨onberger and Jan-Michael Frahm. Structure-from-motion revisited. In Conference on Computer Vision and Pattern Recognition (CVPR), 2016. 3, 6, 8

work page 2016
[58]

Laion-5b: An open large-scale dataset for training next gen- eration image-text models

Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, et al. Laion-5b: An open large-scale dataset for training next gen- eration image-text models. Advances in Neural Information Processing Systems, 35:25278–25294, 2022. 2

work page 2022
[59]

Genwarp: Single image to novel views with semantic-preserving generative warping

Junyoung Seo, Kazumi Fukuda, Takashi Shibuya, Takuya Narihira, Naoki Murata, Shoukang Hu, Chieh-Hsin Lai, Se- ungryong Kim, and Yuki Mitsufuji. Genwarp: Single image to novel views with semantic-preserving generative warping. arXiv preprint arXiv:2405.17251, 2024. 3

work page arXiv 2024
[60]

Make-A-Video: Text-to-Video Generation without Text-Video Data

Uriel Singer, Adam Polyak, Thomas Hayes, Dafna Shaham, Chitwan Saharia, William Chan, and Mohammad Norouzi. Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:2209.14792, 2022. 3

work page internal anchor Pith review Pith/arXiv arXiv 2022
[61]

Seitz, and Richard Szeliski

Noah Snavely, Steven M. Seitz, and Richard Szeliski. Photo tourism: exploring photo collections in 3d. Seminal Graphics Papers: Pushing the Boundaries, Volume 2, 2006. 3

work page 2006
[62]

Denoising Diffusion Implicit Models

Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. arXiv:2010.02502, 2020. 5

work page internal anchor Pith review Pith/arXiv arXiv 2010
[63]

Neural 3d reconstruction in the wild

Jiaming Sun, Xi Chen, Qianqian Wang, Zhengqi Li, Hadar Averbuch-Elor, Xiaowei Zhou, and Noah Snavely. Neural 3d reconstruction in the wild. In ACM SIGGRAPH 2022 conference proceedings, pages 1–9, 2022. 2

work page 2022
[64]

Movie Gen: A Cast of Media Foundation Models

The Movie Gen team @ Meta. Movie gen: A cast of media foundation models. arXiv preprint arXiv:2410.13720, 2024. 3

work page internal anchor Pith review Pith/arXiv arXiv 2024
[65]

Consistent view synthesis with pose-guided diffusion models

Hung-Yu Tseng, Qinbo Li, Changil Kim, Suhib Alsisan, Jia- Bin Huang, and Johannes Kopf. Consistent view synthesis with pose-guided diffusion models. In CVPR, 2023. 3

work page 2023
[66]

Megascenes: Scene-level view synthesis at scale

Joseph Tung, Gene Chou, Ruojin Cai, Guandao Yang, Kai Zhang, Gordon Wetzstein, Bharath Hariharan, and Noah Snavely. Megascenes: Scene-level view synthesis at scale. In ECCV, 2024. 2, 3

work page 2024
[67]

Gomez, Lukasz Kaiser, and Illia Polosukhin

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkor- eit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need, 2023. 3

work page 2023
[68]

Barron, and Pratul P

Dor Verbin, Peter Hedman, Ben Mildenhall, Todd Zickler, Jonathan T. Barron, and Pratul P. Srinivasan. Ref-NeRF: Structured view-dependent appearance for neural radiance fields. In CVPR, 2022. 2 11

work page 2022
[69]

Pf-lrm: Pose-free large reconstruction model for joint pose and shape prediction

Peng Wang, Hao Tan, Sai Bi, Yinghao Xu, Fujun Luan, Kalyan Sunkavalli, Wenping Wang, Zexiang Xu, and Kai Zhang. Pf-lrm: Pose-free large reconstruction model for joint pose and shape prediction. arXiv preprint arXiv:2311.12024,

work page arXiv
[70]

Dust3r: Geometric 3d vision made easy

Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. Dust3r: Geometric 3d vision made easy. In CVPR, 2024. 8

work page 2024
[71]

Generative inbetweening: Adapting image-to-video models for keyframe interpolation

Xiaojuan Wang, Boyang Zhou, Brian Curless, Ira Kemel- macher, Aleksander Holynski, and Steve Seitz. Generative inbetweening: Adapting image-to-video models for keyframe interpolation. arXiv preprint arXiv:2408.15239, 2024. 2, 3

work page arXiv 2024
[72]

Image quality assessment: from error visibility to structural similarity

Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing, 13(4):600–612, 2004. 9

work page 2004
[73]

NeRF −−: Neural radiance fields without known camera parameters,

Zirui Wang, Shangzhe Wu, Weidi Xie, Min Chen, and Vic- tor Adrian Prisacariu. Nerf–: Neural radiance fields without known camera parameters. arXiv preprint arXiv:2102.07064,

work page arXiv
[74]

Novel view synthesis with diﬀusion models

Daniel Watson, William Chan, Ricardo Martin-Brualla, Jonathan Ho, Andrea Tagliasacchi, and Mohammad Norouzi. Novel view synthesis with diffusion models. arXiv preprint arXiv:2210.04628, 2022. 3

work page arXiv 2022
[75]

Controlling space and time with diffusion models

Daniel Watson, Saurabh Saxena, Lala Li, Andrea Tagliasac- chi, and David J Fleet. Controlling space and time with diffusion models. arXiv preprint arXiv:2407.07860, 2024. 2, 3

work page arXiv 2024
[76]

Meshlrm: Large reconstruction model for high- quality mesh

Xinyue Wei, Kai Zhang, Sai Bi, Hao Tan, Fujun Luan, Valentin Deschaintre, Kalyan Sunkavalli, Hao Su, and Zex- iang Xu. Meshlrm: Large reconstruction model for high- quality mesh. arXiv preprint arXiv:2404.12385, 2024. 3

work page arXiv 2024
[77]

CroCo v2: Improved Cross-view Completion Pre-training for Stereo Matching and Optical Flow

Philippe Weinzaepfel, Thomas Lucas, Vincent Leroy, Yohann Cabon, Vaibhav Arora, Romain Br ´egier, Gabriela Csurka, Leonid Antsfeld, Boris Chidlovskii, and J ´erˆome Revaud. CroCo v2: Improved Cross-view Completion Pre-training for Stereo Matching and Optical Flow. In ICCV, 2023. 4

work page 2023
[78]

CroCo: Self-Supervised Pre- training for 3D Vision Tasks by Cross-View Completion

Weinzaepfel, Philippe and Leroy, Vincent and Lucas, Thomas and Br´egier, Romain and Cabon, Yohann and Arora, Vaibhav and Antsfeld, Leonid and Chidlovskii, Boris and Csurka, Gabriela and Revaud J´erˆome. CroCo: Self-Supervised Pre- training for 3D Vision Tasks by Cross-View Completion. In NeurIPS, 2022. 4

work page 2022
[79]

Art•v: Auto-regressive text-to-video generation with diffusion models

Wenming Weng, Ruoyu Feng, Yanhui Wang, Qi Dai, Chunyu Wang, Dacheng Yin, Zhiyuan Zhao, Kai Qiu, Jianmin Bao, Yuhui Yuan, Chong Luo, Yueyi Zhang, and Zhiwei Xiong. Art•v: Auto-regressive text-to-video generation with diffusion models. arXiv preprint arXiv:2311.18834, 2023. 2

work page arXiv 2023
[80]

Srinivasan, Dor Verbin, Jonathan T

Rundi Wu, Ben Mildenhall, Philipp Henzler, Keunhong Park, Ruiqi Gao, Daniel Watson, Pratul P Srinivasan, Dor Verbin, Jonathan T Barron, Ben Poole, et al. Reconfu- sion: 3d reconstruction with diffusion priors. arXiv preprint arXiv:2312.02981, 2023. 3

work page arXiv 2023

Showing first 80 references.

[1] [1]

YouTube-8M: A Large-Scale Video Classification Benchmark

Sami Abu-El-Haija, Nisarg Kothari, Joonseok Lee, Apos- tol (Paul) Natsev, George Toderici, Balakrishnan Varadarajan, and Sudheendra Vijayanarasimhan. Youtube-8m: A large- scale video classification benchmark. In arXiv:1609.08675,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

Lumiere: A space-time diffusion model for video generation

Omer Bar-Tal, Hila Chefer, Omer Tov, Charles Herrmann, Roni Paiss, Shiran Zada, Ariel Ephrat, Junhwa Hur, Yuanzhen Li, Tomer Michaeli, Oliver Wang, Deqing Sun, Tali Dekel, and Inbar Mosseri. Lumiere: A space-time diffusion model for video generation. In Proceedings of the 41st International Conference on Machine Learning, 2024. 3

work page 2024

[3] [3]

Barron, Ben Mildenhall, Matthew Tancik, Peter Hedman, Ricardo Martin-Brualla, and Pratul P

Jonathan T. Barron, Ben Mildenhall, Matthew Tancik, Peter Hedman, Ricardo Martin-Brualla, and Pratul P. Srinivasan. Mip-nerf: A multiscale representation for anti-aliasing neural radiance fields, 2021. 2

work page 2021

[4] [4]

Barron, Ben Mildenhall, Dor Verbin, Pratul P

Jonathan T. Barron, Ben Mildenhall, Dor Verbin, Pratul P. Srinivasan, and Peter Hedman. Zip-nerf: Anti-aliased grid- based neural radiance fields. ICCV, 2023. 2

work page 2023

[5] [5]

Nope-nerf: Optimising neural radiance field with no pose prior

Wenjing Bian, Zirui Wang, Kejie Li, Jiawang Bian, and Vic- tor Adrian Prisacariu. Nope-nerf: Optimising neural radiance field with no pose prior. In CVPR, 2023. 2

work page 2023

[6] [6]

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, Varun Jampani, and Robin Rombach. Stable video diffusion: Scaling la- tent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127, 2023. 1, 3, 6

work page internal anchor Pith review Pith/arXiv arXiv 2023

[7] [7]

Video generation models as world simulators, 2024

Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, Clarence Ng, Ricky Wang, and Aditya Ramesh. Video generation models as world simulators, 2024. 3

work page 2024

[8] [8]

Diffdreamer: Towards consistent unsupervised single- view scene extrapolation with conditional diffusion models

Shengqu Cai, Eric Ryan Chan, Songyou Peng, Mohamad Shahbazi, Anton Obukhov, Luc Van Gool, and Gordon Wet- zstein. Diffdreamer: Towards consistent unsupervised single- view scene extrapolation with conditional diffusion models. In ICCV, 2023. 3

work page 2023

[9] [9]

Diffusion forcing: Next-token prediction meets full-sequence diffusion,

Boyuan Chen, Diego Marti Monso, Yilun Du, Max Sim- chowitz, Russ Tedrake, and Vincent Sitzmann. Diffusion forcing: Next-token prediction meets full-sequence diffusion,

work page

[10] [10]

V3d: Video diffusion models are effective 3d generators

Zilong Chen, Yikai Wang, Feng Wang, Zhengyi Wang, and Huaping Liu. V3d: Video diffusion models are effective 3d generators. arXiv preprint arXiv:2403.06738, 2024. 1, 3

work page arXiv 2024

[11] [11]

Lu-nerf: Scene 9 and pose estimation by synchronizing local unposed nerfs

Zezhou Cheng, Carlos Esteves, Varun Jampani, Abhishek Kar, Subhransu Maji, and Ameesh Makadia. Lu-nerf: Scene 9 and pose estimation by synchronizing local unposed nerfs. In CVPR, 2023. 2

work page 2023

[12] [12]

Scannet: Richly-annotated 3d reconstructions of indoor scenes

Angela Dai, Angel X Chang, Manolis Savva, Maciej Hal- ber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5828–5839, 2017. 3

work page 2017

[13] [13]

LDMVFI: Video frame interpolation with latent diffusion models

Duolikun Danier, Fan Zhang, and David Bull. LDMVFI: Video frame interpolation with latent diffusion models. In AAAI, 2024. 2, 3, 6

work page 2024

[14] [14]

Objaverse-XL: A Universe of 10M+ 3D Objects

Matt Deitke, Ruoshi Liu, Matthew Wallingford, Huong Ngo, Oscar Michel, Aditya Kusupati, Alan Fan, Chris- tian Laforte, Vikram V oleti, Samir Yitzhak Gadre, Eli VanderBilt, Aniruddha Kembhavi, Carl V ondrick, Georgia Gkioxari, Kiana Ehsani, Ludwig Schmidt, and Ali Farhadi. Objaverse-xl: A universe of 10m+ 3d objects. arXiv preprint arXiv:2307.05663, 2023. 2, 3

work page internal anchor Pith review Pith/arXiv arXiv 2023

[15] [15]

Qi, Yin Zhou, Zoey Yang, Aur’elien Chouard, Pei Sun, Jiquan Ngiam, Vijay Vasudevan, Alexander McCauley, Jonathon Shlens, and Dragomir Anguelov

Scott Ettinger, Shuyang Cheng, Benjamin Caine, Chenxi Liu, Hang Zhao, Sabeek Pradhan, Yuning Chai, Ben Sapp, Charles R. Qi, Yin Zhou, Zoey Yang, Aur’elien Chouard, Pei Sun, Jiquan Ngiam, Vijay Vasudevan, Alexander McCauley, Jonathon Shlens, and Dragomir Anguelov. Large scale in- teractive motion forecasting for autonomous driving: The waymo open motion da...

work page

[16] [16]

Instantsplat: Unbounded sparse-view pose-free gaussian splat- ting in 40 seconds, 2024

Zhiwen Fan, Wenyan Cong, Kairun Wen, Kevin Wang, Jian Zhang, Xinghao Ding, Danfei Xu, Boris Ivanovic, Marco Pavone, Georgios Pavlakos, Zhangyang Wang, and Yue Wang. Instantsplat: Unbounded sparse-view pose-free gaussian splat- ting in 40 seconds, 2024. 2, 8

work page 2024

[17] [17]

Black, and Zhang Xuaner

Haiwen Feng, Zheng Ding, Zhihao Xia, Simon Niklaus, Vic- toria Abrevaya, Michael J. Black, and Zhang Xuaner. Ex- plorative in-betweening of time and space. In European Conference on Computer Vision, 2024. 3

work page 2024

[18] [18]

Efros, and Xiaolong Wang

Yang Fu, Sifei Liu, Amey Kulkarni, Jan Kautz, Alexei A. Efros, and Xiaolong Wang. Colmap-free 3d gaussian splatting. In CVPR, 2024. 2, 8

work page 2024

[19] [19]

Vivid-1-to-3: Novel view synthesis with video diffusion models

Jeong gi Kwak, Erqun Dong, Yuhe Jin, Hanseok Ko, Shweta Mahajan, and Kwang Moo Yi. Vivid-1-to-3: Novel view synthesis with video diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 6775–6785, 2024. 1, 3

work page 2024

[20] [20]

Emu video: Factorizing text-to-video generation by explicit image conditioning

Rohit Girdhar, Mannat Singh, Andrew Brown, et al. Emu video: Factorizing text-to-video generation by explicit image conditioning. arXiv preprint arXiv:2311.10709, 2023. 1, 3, 6

work page arXiv 2023

[21] [21]

Vfusion3d: Learning scalable 3d generative models from video diffusion models

Junlin Han, Filippos Kokkinos, and Philip Torr. Vfusion3d: Learning scalable 3d generative models from video diffusion models. European Conference on Computer Vision (ECCV),

work page

[22] [22]

Masked Autoencoders Are Scalable Vision Learners

Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Doll´ar, and Ross Girshick. Masked autoencoders are scalable vision learners. arXiv:2111.06377, 2021. 4

work page internal anchor Pith review Pith/arXiv arXiv 2021

[23] [23]

Srinivasan, Ben Mildenhall, Jonathan T

Peter Hedman, Pratul P. Srinivasan, Ben Mildenhall, Jonathan T. Barron, and Paul Debevec. Baking neural ra- diance fields for real-time view synthesis. ICCV, 2021. 2

work page 2021

[24] [24]

Denoising Diffusion Probabilistic Models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffu- sion probabilistic models. arXiv preprint arxiv:2006.11239,

work page internal anchor Pith review Pith/arXiv arXiv 2006

[25] [25]

Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P Kingma, Ben Poole, Mohammad Norouzi, and David J. Fleet. Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303, 2022. 3

work page internal anchor Pith review Pith/arXiv arXiv 2022

[26] [26]

CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers

Wenyi Hong, Ming Ding, Wendi Zheng, Xinghan Liu, and Jie Tang. Cogvideo: Large-scale pretraining for text-to-video gen- eration via transformers. arXiv preprint arXiv:2205.15868,

work page internal anchor Pith review Pith/arXiv arXiv

[27] [27]

Lrm: Large reconstruction model for single image to 3d

Yicong Hong, Kai Zhang, Jiuxiang Gu, Sai Bi, Yang Zhou, Difan Liu, Feng Liu, Kalyan Sunkavalli, Trung Bui, and Hao Tan. Lrm: Large reconstruction model for single image to 3d. In ICLR, 2024. 3

work page 2024

[28] [28]

Video interpolation with diffusion models

Siddhant Jain, Daniel Watson, Eric Tabellion, Ben Poole, Janne Kontkanen, et al. Video interpolation with diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7341–7351,

work page

[29] [29]

A construct- optimize approach to sparse view synthesis without camera pose

Kaiwen Jiang, Yang Fu, Mukund Varma T, Yash Belhe, Xi- aolong Wang, Hao Su, and Ravi Ramamoorthi. A construct- optimize approach to sparse view synthesis without camera pose. SIGGRAPH, 2024. 2

work page 2024

[30] [30]

Lvsm: A large view synthesis model with minimal 3d inductive bias, 2024

Haian Jin, Hanwen Jiang, Hao Tan, Kai Zhang, Sai Bi, Tianyuan Zhang, Fujun Luan, Noah Snavely, and Zexiang Xu. Lvsm: A large view synthesis model with minimal 3d inductive bias, 2024. 3

work page 2024

[31] [31]

Image Match- ing across Wide Baselines: From Paper to Practice

Yuhe Jin, Dmytro Mishkin, Anastasiia Mishchuk, Jiri Matas, Pascal Fua, Kwang Moo Yi, and Eduard Trulls. Image Match- ing across Wide Baselines: From Paper to Practice. Interna- tional Journal of Computer Vision, 2020. 5

work page 2020

[32] [32]

Pyramidal flow matching for efficient video generative modeling, 2024

Yang Jin, Zhicheng Sun, Ningyuan Li, Kun Xu, Kun Xu, Hao Jiang, Nan Zhuang, Quzhe Huang, Yang Song, Yadong Mu, and Zhouchen Lin. Pyramidal flow matching for efficient video generative modeling, 2024. 1, 3

work page 2024

[33] [33]

Flavr: Flow-agnostic video representations for fast frame interpolation

Tarun Kalluri, Deepak Pathak, Manmohan Chandraker, and Du Tran. Flavr: Flow-agnostic video representations for fast frame interpolation. In WACV, 2023. 2, 3, 6

work page 2023

[34] [34]

How far is video generation from world model? – a physical law perspective,

Bingyi Kang, Yang Yue, Rui Lu, Zhijie Lin, Yang Zhao, Kaixin Wang, Gao Huang, and Jiashi Feng. How far is video generation from world model? – a physical law perspective,

work page

[35] [35]

3d gaussian splatting for real-time radiance field rendering

Bernhard Kerbl, Georgios Kopanas, Thomas Leimk¨uhler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering. ACM Transactions on Graphics, 42(4), 2023. 2, 6

work page 2023

[36] [36]

Auto-Encoding Variational Bayes

Diederik P. Kingma and Max Welling. Auto-encoding vari- ational bayes. arXiv preprint arXiv:1312.6114 , 2013. 3, 4

work page internal anchor Pith review Pith/arXiv arXiv 2013

[37] [37]

Wildgaussians: 3d gaussian splatting in the wild

Jonas Kulhanek, Songyou Peng, Zuzana Kukelova, Marc Pollefeys, and Torsten Sattler. Wildgaussians: 3d gaussian splatting in the wild. arXiv, 2024. 2

work page 2024

[38] [38]

Pika labs: Ai video generation platform

Pika Labs. Pika labs: Ai video generation platform. https: //pika.art/, 2024. Accessed: 2024-11-10. 6

work page 2024

[39] [39]

Crowdsampling the plenoptic function

Zhengqi Li, Wenqi Xian, Abe Davis, and Noah Snavely. Crowdsampling the plenoptic function. In Computer Vision– ECCV 2020: 16th European Conference, Glasgow, UK, Au- gust 23–28, 2020, Proceedings, Part I 16 , pages 178–196. Springer, 2020. 2 10

work page 2020

[40] [40]

Dl3dv-10k: A large-scale scene dataset for deep learning- based 3d vision

Lu Ling, Yichen Sheng, Zhi Tu, Wentian Zhao, Cheng Xin, Kun Wan, Lantao Yu, Qianyu Guo, Zixun Yu, Yawen Lu, et al. Dl3dv-10k: A large-scale scene dataset for deep learning- based 3d vision. arXiv preprint arXiv:2312.16256 , 2023. 3

work page arXiv 2023

[41] [41]

Zero-1-to-3: Zero-shot one image to 3d object

Ruoshi Liu, Rundi Wu, Basile Van Hoorick, Pavel Tokmakov, Sergey Zakharov, and Carl V ondrick. Zero-1-to-3: Zero-shot one image to 3d object. In ICCV, 2023. 2, 3

work page 2023

[42] [42]

SyncDreamer: Generating Multiview-consistent Images from a Single-view Image

Yuan Liu, Cheng Lin, Zijiao Zeng, Xiaoxiao Long, Lingjie Liu, Taku Komura, and Wenping Wang. Syncdreamer: Gener- ating multiview-consistent images from a single-view image. arXiv preprint arXiv:2309.03453, 2023. 3

work page internal anchor Pith review Pith/arXiv arXiv 2023

[43] [43]

Luma dream machine, 2024

LUMA. Luma dream machine, 2024. 1, 2, 3, 6

work page 2024

[44] [44]

Nerf in the wild: Neural radiance fields for uncon- strained photo collections

Ricardo Martin-Brualla, Noha Radwan, Mehdi SM Sajjadi, Jonathan T Barron, Alexey Dosovitskiy, and Daniel Duck- worth. Nerf in the wild: Neural radiance fields for uncon- strained photo collections. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7210–7219, 2021. 2

work page 2021

[45] [45]

Kim, and Johannes Kopf

Andreas Meuleman, Yu-Lun Liu, Chen Gao, Jia-Bin Huang, Changil Kim, Min H. Kim, and Johannes Kopf. Progressively optimized local radiance fields for robust view synthesis. In CVPR, 2023. 2

work page 2023

[46] [46]

Nerf: Representing scenes as neural radiance fields for view synthe- sis

Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view synthe- sis. ICCV, 2021. 2

work page 2021

[47] [47]

Instant neural graphics primitives with a multires- olution hash encoding

Thomas M¨uller, Alex Evans, Christoph Schied, and Alexan- der Keller. Instant neural graphics primitives with a multires- olution hash encoding. ACM Trans. Graph., 41(4):102:1– 102:15, 2022. 2

work page 2022

[48] [48]

Scalable Diffusion Models with Transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. arXiv preprint arXiv:2212.09748, 2022. 3, 5

work page internal anchor Pith review Pith/arXiv arXiv 2022

[49] [49]

Courville

Ethan Perez, Florian Strub, Harm de Vries, Vincent Dumoulin, and Aaron C. Courville. Film: Visual reasoning with a general conditioning layer. In AAAI, 2018. 5

work page 2018

[50] [50]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. In Proceedings of the 38th International Conference on Machine Learning, pages 8748–8763. PMLR, 2021. 5

work page 2021

[51] [51]

Film: Frame interpola- tion for large motion

Fitsum Reda, Janne Kontkanen, Eric Tabellion, Deqing Sun, Caroline Pantofaru, and Brian Curless. Film: Frame interpola- tion for large motion. In European Conference on Computer Vision (ECCV), 2022. 1, 2, 3, 6

work page 2022

[52] [52]

Kilonerf: Speeding up neural radiance fields with thousands of tiny mlps

Christian Reiser, Songyou Peng, Yiyi Liao, and Andreas Geiger. Kilonerf: Speeding up neural radiance fields with thousands of tiny mlps. In CVPR, 2021. 2

work page 2021

[53] [53]

Look outside the room: Synthesizing a consistent long-term 3d scene video from a single image

Xuanchi Ren and Xiaolong Wang. Look outside the room: Synthesizing a consistent long-term 3d scene video from a single image. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022. 3

work page 2022

[54] [54]

High-resolution image synthesis with latent diffusion models, 2021

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models, 2021. 3, 4, 5

work page 2021

[55] [55]

Runway AI

Inc. Runway AI. Introducing gen-3 alpha: A new fron- tier for video generation. https://runwayml.com/ research/introducing-gen-3-alpha , 2024. Ac- cessed: 2024-11-10. 6

work page 2024

[56] [56]

ZeroNVS: Zero-shot 360-degree view synthesis from a single real image

Kyle Sargent, Zizhang Li, Tanmay Shah, Charles Herrmann, Hong-Xing Yu, Yunzhi Zhang, Eric Ryan Chan, Dmitry La- gun, Li Fei-Fei, Deqing Sun, and Jiajun Wu. ZeroNVS: Zero-shot 360-degree view synthesis from a single real image. In CVPR, 2024. 3

work page 2024

[57] [57]

Structure-from-motion revisited

Johannes Lutz Sch ¨onberger and Jan-Michael Frahm. Structure-from-motion revisited. In Conference on Computer Vision and Pattern Recognition (CVPR), 2016. 3, 6, 8

work page 2016

[58] [58]

Laion-5b: An open large-scale dataset for training next gen- eration image-text models

Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, et al. Laion-5b: An open large-scale dataset for training next gen- eration image-text models. Advances in Neural Information Processing Systems, 35:25278–25294, 2022. 2

work page 2022

[59] [59]

Genwarp: Single image to novel views with semantic-preserving generative warping

Junyoung Seo, Kazumi Fukuda, Takashi Shibuya, Takuya Narihira, Naoki Murata, Shoukang Hu, Chieh-Hsin Lai, Se- ungryong Kim, and Yuki Mitsufuji. Genwarp: Single image to novel views with semantic-preserving generative warping. arXiv preprint arXiv:2405.17251, 2024. 3

work page arXiv 2024

[60] [60]

Make-A-Video: Text-to-Video Generation without Text-Video Data

Uriel Singer, Adam Polyak, Thomas Hayes, Dafna Shaham, Chitwan Saharia, William Chan, and Mohammad Norouzi. Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:2209.14792, 2022. 3

work page internal anchor Pith review Pith/arXiv arXiv 2022

[61] [61]

Seitz, and Richard Szeliski

Noah Snavely, Steven M. Seitz, and Richard Szeliski. Photo tourism: exploring photo collections in 3d. Seminal Graphics Papers: Pushing the Boundaries, Volume 2, 2006. 3

work page 2006

[62] [62]

Denoising Diffusion Implicit Models

Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. arXiv:2010.02502, 2020. 5

work page internal anchor Pith review Pith/arXiv arXiv 2010

[63] [63]

Neural 3d reconstruction in the wild

Jiaming Sun, Xi Chen, Qianqian Wang, Zhengqi Li, Hadar Averbuch-Elor, Xiaowei Zhou, and Noah Snavely. Neural 3d reconstruction in the wild. In ACM SIGGRAPH 2022 conference proceedings, pages 1–9, 2022. 2

work page 2022

[64] [64]

Movie Gen: A Cast of Media Foundation Models

The Movie Gen team @ Meta. Movie gen: A cast of media foundation models. arXiv preprint arXiv:2410.13720, 2024. 3

work page internal anchor Pith review Pith/arXiv arXiv 2024

[65] [65]

Consistent view synthesis with pose-guided diffusion models

Hung-Yu Tseng, Qinbo Li, Changil Kim, Suhib Alsisan, Jia- Bin Huang, and Johannes Kopf. Consistent view synthesis with pose-guided diffusion models. In CVPR, 2023. 3

work page 2023

[66] [66]

Megascenes: Scene-level view synthesis at scale

Joseph Tung, Gene Chou, Ruojin Cai, Guandao Yang, Kai Zhang, Gordon Wetzstein, Bharath Hariharan, and Noah Snavely. Megascenes: Scene-level view synthesis at scale. In ECCV, 2024. 2, 3

work page 2024

[67] [67]

Gomez, Lukasz Kaiser, and Illia Polosukhin

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkor- eit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need, 2023. 3

work page 2023

[68] [68]

Barron, and Pratul P

Dor Verbin, Peter Hedman, Ben Mildenhall, Todd Zickler, Jonathan T. Barron, and Pratul P. Srinivasan. Ref-NeRF: Structured view-dependent appearance for neural radiance fields. In CVPR, 2022. 2 11

work page 2022

[69] [69]

Pf-lrm: Pose-free large reconstruction model for joint pose and shape prediction

Peng Wang, Hao Tan, Sai Bi, Yinghao Xu, Fujun Luan, Kalyan Sunkavalli, Wenping Wang, Zexiang Xu, and Kai Zhang. Pf-lrm: Pose-free large reconstruction model for joint pose and shape prediction. arXiv preprint arXiv:2311.12024,

work page arXiv

[70] [70]

Dust3r: Geometric 3d vision made easy

Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. Dust3r: Geometric 3d vision made easy. In CVPR, 2024. 8

work page 2024

[71] [71]

Generative inbetweening: Adapting image-to-video models for keyframe interpolation

Xiaojuan Wang, Boyang Zhou, Brian Curless, Ira Kemel- macher, Aleksander Holynski, and Steve Seitz. Generative inbetweening: Adapting image-to-video models for keyframe interpolation. arXiv preprint arXiv:2408.15239, 2024. 2, 3

work page arXiv 2024

[72] [72]

Image quality assessment: from error visibility to structural similarity

Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing, 13(4):600–612, 2004. 9

work page 2004

[73] [73]

NeRF −−: Neural radiance fields without known camera parameters,

Zirui Wang, Shangzhe Wu, Weidi Xie, Min Chen, and Vic- tor Adrian Prisacariu. Nerf–: Neural radiance fields without known camera parameters. arXiv preprint arXiv:2102.07064,

work page arXiv

[74] [74]

Novel view synthesis with diﬀusion models

Daniel Watson, William Chan, Ricardo Martin-Brualla, Jonathan Ho, Andrea Tagliasacchi, and Mohammad Norouzi. Novel view synthesis with diffusion models. arXiv preprint arXiv:2210.04628, 2022. 3

work page arXiv 2022

[75] [75]

Controlling space and time with diffusion models

Daniel Watson, Saurabh Saxena, Lala Li, Andrea Tagliasac- chi, and David J Fleet. Controlling space and time with diffusion models. arXiv preprint arXiv:2407.07860, 2024. 2, 3

work page arXiv 2024

[76] [76]

Meshlrm: Large reconstruction model for high- quality mesh

Xinyue Wei, Kai Zhang, Sai Bi, Hao Tan, Fujun Luan, Valentin Deschaintre, Kalyan Sunkavalli, Hao Su, and Zex- iang Xu. Meshlrm: Large reconstruction model for high- quality mesh. arXiv preprint arXiv:2404.12385, 2024. 3

work page arXiv 2024

[77] [77]

CroCo v2: Improved Cross-view Completion Pre-training for Stereo Matching and Optical Flow

Philippe Weinzaepfel, Thomas Lucas, Vincent Leroy, Yohann Cabon, Vaibhav Arora, Romain Br ´egier, Gabriela Csurka, Leonid Antsfeld, Boris Chidlovskii, and J ´erˆome Revaud. CroCo v2: Improved Cross-view Completion Pre-training for Stereo Matching and Optical Flow. In ICCV, 2023. 4

work page 2023

[78] [78]

CroCo: Self-Supervised Pre- training for 3D Vision Tasks by Cross-View Completion

Weinzaepfel, Philippe and Leroy, Vincent and Lucas, Thomas and Br´egier, Romain and Cabon, Yohann and Arora, Vaibhav and Antsfeld, Leonid and Chidlovskii, Boris and Csurka, Gabriela and Revaud J´erˆome. CroCo: Self-Supervised Pre- training for 3D Vision Tasks by Cross-View Completion. In NeurIPS, 2022. 4

work page 2022

[79] [79]

Art•v: Auto-regressive text-to-video generation with diffusion models

Wenming Weng, Ruoyu Feng, Yanhui Wang, Qi Dai, Chunyu Wang, Dacheng Yin, Zhiyuan Zhao, Kai Qiu, Jianmin Bao, Yuhui Yuan, Chong Luo, Yueyi Zhang, and Zhiwei Xiong. Art•v: Auto-regressive text-to-video generation with diffusion models. arXiv preprint arXiv:2311.18834, 2023. 2

work page arXiv 2023

[80] [80]

Srinivasan, Dor Verbin, Jonathan T

Rundi Wu, Ben Mildenhall, Philipp Henzler, Keunhong Park, Ruiqi Gao, Daniel Watson, Pratul P Srinivasan, Dor Verbin, Jonathan T Barron, Ben Poole, et al. Reconfu- sion: 3d reconstruction with diffusion priors. arXiv preprint arXiv:2312.02981, 2023. 3

work page arXiv 2023