KFC-W: Generating 3D-Consistent Videos from Unposed Internet Photos
Pith reviewed 2026-05-23 16:44 UTC · model grok-4.3
The pith
A self-supervised model learns to generate 3D-consistent videos from unposed internet photos without any camera parameters or 3D labels.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that a scalable 3D-aware video model can be trained in a self-supervised manner by exploiting video consistency together with the viewpoint variability of unposed multiview internet photos, without requiring any 3D annotations such as camera parameters, and that this model produces superior geometric and appearance consistency compared with existing video baselines while also benefiting camera-controlled applications such as 3D Gaussian Splatting.
What carries the argument
The self-supervised training procedure that pairs video-frame consistency with multiview photo variability to induce implicit 3D geometric understanding.
If this is right
- Random internet photos can serve as keyframes for video generation that respects scene layout and identity.
- The model supports explicit camera control in tasks such as 3D Gaussian Splatting.
- Scene-level 3D learning becomes feasible at scale using only ordinary 2D video and photo collections.
Where Pith is reading between the lines
- Large uncurated photo collections could replace curated 3D datasets for training video models.
- The same consistency signal might extend to learning other 3D properties such as lighting or material appearance.
- Failure modes on highly dynamic or non-rigid scenes would reveal limits of the implicit-geometry approach.
Load-bearing premise
Natural consistency across video frames plus viewpoint differences in unposed photos are enough to produce genuine 3D geometric understanding in the model.
What would settle it
Generated videos that show clear changes in object shape, size, or relative position when the camera path is interpolated between input views.
Figures
read the original abstract
We address the problem of generating videos from unposed internet photos. A handful of input images serve as keyframes, and our model interpolates between them to simulate a path moving between the cameras. Given random images, a model's ability to capture underlying geometry, recognize scene identity, and relate frames in terms of camera position and orientation reflects a fundamental understanding of 3D structure and scene layout. However, existing video models such as Luma Dream Machine fail at this task. We design a self-supervised method that takes advantage of the consistency of videos and variability of multiview internet photos to train a scalable, 3D-aware video model without any 3D annotations such as camera parameters. We validate that our method outperforms all baselines in terms of geometric and appearance consistency. We also show our model benefits applications that enable camera control, such as 3D Gaussian Splatting. Our results suggest that we can scale up scene-level 3D learning using only 2D data such as videos and multiview internet photos.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims to introduce a self-supervised method called KFC-W that generates 3D-consistent videos from unposed internet photos by leveraging video consistency and multiview photo variability. It trains a scalable 3D-aware video model without any 3D annotations such as camera parameters. The method is validated to outperform all baselines in geometric and appearance consistency and is shown to benefit applications like camera control in 3D Gaussian Splatting, suggesting that scene-level 3D learning can be scaled using only 2D data.
Significance. Should the results hold, this would be a significant contribution to computer vision by showing that 3D geometric understanding can emerge from self-supervision on 2D data sources alone, potentially reducing the need for 3D annotations and enabling more accessible training of 3D-aware generative models.
major comments (2)
- [Abstract] Abstract: The central empirical claim that the method 'outperforms all baselines in terms of geometric and appearance consistency' provides no quantitative metrics, ablation studies, or description of how geometric consistency was measured or what the baselines were, rendering the validation statement unverifiable.
- [Method] Method section: No analysis or experiments are described that distinguish learning of genuine 3D geometry (e.g., consistent depth ordering or camera trajectories in 3D space) from 2D temporal interpolation or appearance matching, which is required to support the '3D-aware' claim in the absence of camera parameters or explicit 3D losses.
minor comments (2)
- [Abstract] Abstract: The mention of Luma Dream Machine as a failing example should be expanded to list all baselines used in the reported comparisons.
- [Method] Throughout: Ensure any self-supervised loss functions or training objectives are explicitly formulated with equations for reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address the two major comments point by point below and will revise the manuscript accordingly to strengthen the presentation of results and supporting analysis.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central empirical claim that the method 'outperforms all baselines in terms of geometric and appearance consistency' provides no quantitative metrics, ablation studies, or description of how geometric consistency was measured or what the baselines were, rendering the validation statement unverifiable.
Authors: We agree that the abstract, as a high-level summary, does not include the quantitative details. The full manuscript reports these in the experiments section, including specific metrics, baseline descriptions, and evaluation protocols for geometric consistency. To address the concern directly, we will revise the abstract to incorporate key quantitative results and a concise note on measurement methodology. revision: yes
-
Referee: [Method] Method section: No analysis or experiments are described that distinguish learning of genuine 3D geometry (e.g., consistent depth ordering or camera trajectories in 3D space) from 2D temporal interpolation or appearance matching, which is required to support the '3D-aware' claim in the absence of camera parameters or explicit 3D losses.
Authors: We acknowledge that additional targeted analysis would better isolate 3D geometry learning from 2D effects. While existing results on downstream tasks like camera-controlled 3D Gaussian Splatting provide indirect support, we will add explicit experiments in the revised manuscript, such as depth ordering visualizations, trajectory consistency tests, and comparisons against purely 2D baselines, to strengthen the '3D-aware' claim. revision: yes
Circularity Check
No circularity: empirical self-supervised training validated against external baselines
full rationale
The paper presents a self-supervised training procedure that leverages video consistency and multiview photo variability to produce 3D-aware video generation without camera parameters or 3D losses. No derivation chain, fitted parameter renamed as prediction, or self-citation load-bearing step is described in the abstract or claimed method. The central claim is an empirical outperformance on geometric and appearance consistency metrics against external baselines, which is falsifiable outside the training objective itself. No equations or uniqueness theorems are invoked that reduce to the inputs by construction.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 1 Pith paper
-
CityRAG: Stepping Into a City via Spatially-Grounded Video Generation
CityRAG generates minutes-long 3D-consistent videos of real-world cities by grounding outputs in geo-registered data and using temporally unaligned training to disentangle fixed scenes from transient elements like weather.
Reference graph
Works this paper leans on
-
[1]
YouTube-8M: A Large-Scale Video Classification Benchmark
Sami Abu-El-Haija, Nisarg Kothari, Joonseok Lee, Apos- tol (Paul) Natsev, George Toderici, Balakrishnan Varadarajan, and Sudheendra Vijayanarasimhan. Youtube-8m: A large- scale video classification benchmark. In arXiv:1609.08675,
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
Lumiere: A space-time diffusion model for video generation
Omer Bar-Tal, Hila Chefer, Omer Tov, Charles Herrmann, Roni Paiss, Shiran Zada, Ariel Ephrat, Junhwa Hur, Yuanzhen Li, Tomer Michaeli, Oliver Wang, Deqing Sun, Tali Dekel, and Inbar Mosseri. Lumiere: A space-time diffusion model for video generation. In Proceedings of the 41st International Conference on Machine Learning, 2024. 3
work page 2024
-
[3]
Barron, Ben Mildenhall, Matthew Tancik, Peter Hedman, Ricardo Martin-Brualla, and Pratul P
Jonathan T. Barron, Ben Mildenhall, Matthew Tancik, Peter Hedman, Ricardo Martin-Brualla, and Pratul P. Srinivasan. Mip-nerf: A multiscale representation for anti-aliasing neural radiance fields, 2021. 2
work page 2021
-
[4]
Barron, Ben Mildenhall, Dor Verbin, Pratul P
Jonathan T. Barron, Ben Mildenhall, Dor Verbin, Pratul P. Srinivasan, and Peter Hedman. Zip-nerf: Anti-aliased grid- based neural radiance fields. ICCV, 2023. 2
work page 2023
-
[5]
Nope-nerf: Optimising neural radiance field with no pose prior
Wenjing Bian, Zirui Wang, Kejie Li, Jiawang Bian, and Vic- tor Adrian Prisacariu. Nope-nerf: Optimising neural radiance field with no pose prior. In CVPR, 2023. 2
work page 2023
-
[6]
Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets
Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, Varun Jampani, and Robin Rombach. Stable video diffusion: Scaling la- tent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127, 2023. 1, 3, 6
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[7]
Video generation models as world simulators, 2024
Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, Clarence Ng, Ricky Wang, and Aditya Ramesh. Video generation models as world simulators, 2024. 3
work page 2024
-
[8]
Shengqu Cai, Eric Ryan Chan, Songyou Peng, Mohamad Shahbazi, Anton Obukhov, Luc Van Gool, and Gordon Wet- zstein. Diffdreamer: Towards consistent unsupervised single- view scene extrapolation with conditional diffusion models. In ICCV, 2023. 3
work page 2023
-
[9]
Diffusion forcing: Next-token prediction meets full-sequence diffusion,
Boyuan Chen, Diego Marti Monso, Yilun Du, Max Sim- chowitz, Russ Tedrake, and Vincent Sitzmann. Diffusion forcing: Next-token prediction meets full-sequence diffusion,
-
[10]
V3d: Video diffusion models are effective 3d generators
Zilong Chen, Yikai Wang, Feng Wang, Zhengyi Wang, and Huaping Liu. V3d: Video diffusion models are effective 3d generators. arXiv preprint arXiv:2403.06738, 2024. 1, 3
-
[11]
Lu-nerf: Scene 9 and pose estimation by synchronizing local unposed nerfs
Zezhou Cheng, Carlos Esteves, Varun Jampani, Abhishek Kar, Subhransu Maji, and Ameesh Makadia. Lu-nerf: Scene 9 and pose estimation by synchronizing local unposed nerfs. In CVPR, 2023. 2
work page 2023
-
[12]
Scannet: Richly-annotated 3d reconstructions of indoor scenes
Angela Dai, Angel X Chang, Manolis Savva, Maciej Hal- ber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5828–5839, 2017. 3
work page 2017
-
[13]
LDMVFI: Video frame interpolation with latent diffusion models
Duolikun Danier, Fan Zhang, and David Bull. LDMVFI: Video frame interpolation with latent diffusion models. In AAAI, 2024. 2, 3, 6
work page 2024
-
[14]
Objaverse-XL: A Universe of 10M+ 3D Objects
Matt Deitke, Ruoshi Liu, Matthew Wallingford, Huong Ngo, Oscar Michel, Aditya Kusupati, Alan Fan, Chris- tian Laforte, Vikram V oleti, Samir Yitzhak Gadre, Eli VanderBilt, Aniruddha Kembhavi, Carl V ondrick, Georgia Gkioxari, Kiana Ehsani, Ludwig Schmidt, and Ali Farhadi. Objaverse-xl: A universe of 10m+ 3d objects. arXiv preprint arXiv:2307.05663, 2023. 2, 3
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[15]
Scott Ettinger, Shuyang Cheng, Benjamin Caine, Chenxi Liu, Hang Zhao, Sabeek Pradhan, Yuning Chai, Ben Sapp, Charles R. Qi, Yin Zhou, Zoey Yang, Aur’elien Chouard, Pei Sun, Jiquan Ngiam, Vijay Vasudevan, Alexander McCauley, Jonathon Shlens, and Dragomir Anguelov. Large scale in- teractive motion forecasting for autonomous driving: The waymo open motion da...
-
[16]
Instantsplat: Unbounded sparse-view pose-free gaussian splat- ting in 40 seconds, 2024
Zhiwen Fan, Wenyan Cong, Kairun Wen, Kevin Wang, Jian Zhang, Xinghao Ding, Danfei Xu, Boris Ivanovic, Marco Pavone, Georgios Pavlakos, Zhangyang Wang, and Yue Wang. Instantsplat: Unbounded sparse-view pose-free gaussian splat- ting in 40 seconds, 2024. 2, 8
work page 2024
-
[17]
Haiwen Feng, Zheng Ding, Zhihao Xia, Simon Niklaus, Vic- toria Abrevaya, Michael J. Black, and Zhang Xuaner. Ex- plorative in-betweening of time and space. In European Conference on Computer Vision, 2024. 3
work page 2024
-
[18]
Yang Fu, Sifei Liu, Amey Kulkarni, Jan Kautz, Alexei A. Efros, and Xiaolong Wang. Colmap-free 3d gaussian splatting. In CVPR, 2024. 2, 8
work page 2024
-
[19]
Vivid-1-to-3: Novel view synthesis with video diffusion models
Jeong gi Kwak, Erqun Dong, Yuhe Jin, Hanseok Ko, Shweta Mahajan, and Kwang Moo Yi. Vivid-1-to-3: Novel view synthesis with video diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 6775–6785, 2024. 1, 3
work page 2024
-
[20]
Emu video: Factorizing text-to-video generation by explicit image conditioning
Rohit Girdhar, Mannat Singh, Andrew Brown, et al. Emu video: Factorizing text-to-video generation by explicit image conditioning. arXiv preprint arXiv:2311.10709, 2023. 1, 3, 6
-
[21]
Vfusion3d: Learning scalable 3d generative models from video diffusion models
Junlin Han, Filippos Kokkinos, and Philip Torr. Vfusion3d: Learning scalable 3d generative models from video diffusion models. European Conference on Computer Vision (ECCV),
-
[22]
Masked Autoencoders Are Scalable Vision Learners
Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Doll´ar, and Ross Girshick. Masked autoencoders are scalable vision learners. arXiv:2111.06377, 2021. 4
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[23]
Srinivasan, Ben Mildenhall, Jonathan T
Peter Hedman, Pratul P. Srinivasan, Ben Mildenhall, Jonathan T. Barron, and Paul Debevec. Baking neural ra- diance fields for real-time view synthesis. ICCV, 2021. 2
work page 2021
-
[24]
Denoising Diffusion Probabilistic Models
Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffu- sion probabilistic models. arXiv preprint arxiv:2006.11239,
work page internal anchor Pith review Pith/arXiv arXiv 2006
-
[25]
Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P Kingma, Ben Poole, Mohammad Norouzi, and David J. Fleet. Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303, 2022. 3
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[26]
CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers
Wenyi Hong, Ming Ding, Wendi Zheng, Xinghan Liu, and Jie Tang. Cogvideo: Large-scale pretraining for text-to-video gen- eration via transformers. arXiv preprint arXiv:2205.15868,
work page internal anchor Pith review Pith/arXiv arXiv
-
[27]
Lrm: Large reconstruction model for single image to 3d
Yicong Hong, Kai Zhang, Jiuxiang Gu, Sai Bi, Yang Zhou, Difan Liu, Feng Liu, Kalyan Sunkavalli, Trung Bui, and Hao Tan. Lrm: Large reconstruction model for single image to 3d. In ICLR, 2024. 3
work page 2024
-
[28]
Video interpolation with diffusion models
Siddhant Jain, Daniel Watson, Eric Tabellion, Ben Poole, Janne Kontkanen, et al. Video interpolation with diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7341–7351,
-
[29]
A construct- optimize approach to sparse view synthesis without camera pose
Kaiwen Jiang, Yang Fu, Mukund Varma T, Yash Belhe, Xi- aolong Wang, Hao Su, and Ravi Ramamoorthi. A construct- optimize approach to sparse view synthesis without camera pose. SIGGRAPH, 2024. 2
work page 2024
-
[30]
Lvsm: A large view synthesis model with minimal 3d inductive bias, 2024
Haian Jin, Hanwen Jiang, Hao Tan, Kai Zhang, Sai Bi, Tianyuan Zhang, Fujun Luan, Noah Snavely, and Zexiang Xu. Lvsm: A large view synthesis model with minimal 3d inductive bias, 2024. 3
work page 2024
-
[31]
Image Match- ing across Wide Baselines: From Paper to Practice
Yuhe Jin, Dmytro Mishkin, Anastasiia Mishchuk, Jiri Matas, Pascal Fua, Kwang Moo Yi, and Eduard Trulls. Image Match- ing across Wide Baselines: From Paper to Practice. Interna- tional Journal of Computer Vision, 2020. 5
work page 2020
-
[32]
Pyramidal flow matching for efficient video generative modeling, 2024
Yang Jin, Zhicheng Sun, Ningyuan Li, Kun Xu, Kun Xu, Hao Jiang, Nan Zhuang, Quzhe Huang, Yang Song, Yadong Mu, and Zhouchen Lin. Pyramidal flow matching for efficient video generative modeling, 2024. 1, 3
work page 2024
-
[33]
Flavr: Flow-agnostic video representations for fast frame interpolation
Tarun Kalluri, Deepak Pathak, Manmohan Chandraker, and Du Tran. Flavr: Flow-agnostic video representations for fast frame interpolation. In WACV, 2023. 2, 3, 6
work page 2023
-
[34]
How far is video generation from world model? – a physical law perspective,
Bingyi Kang, Yang Yue, Rui Lu, Zhijie Lin, Yang Zhao, Kaixin Wang, Gao Huang, and Jiashi Feng. How far is video generation from world model? – a physical law perspective,
-
[35]
3d gaussian splatting for real-time radiance field rendering
Bernhard Kerbl, Georgios Kopanas, Thomas Leimk¨uhler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering. ACM Transactions on Graphics, 42(4), 2023. 2, 6
work page 2023
-
[36]
Auto-Encoding Variational Bayes
Diederik P. Kingma and Max Welling. Auto-encoding vari- ational bayes. arXiv preprint arXiv:1312.6114 , 2013. 3, 4
work page internal anchor Pith review Pith/arXiv arXiv 2013
-
[37]
Wildgaussians: 3d gaussian splatting in the wild
Jonas Kulhanek, Songyou Peng, Zuzana Kukelova, Marc Pollefeys, and Torsten Sattler. Wildgaussians: 3d gaussian splatting in the wild. arXiv, 2024. 2
work page 2024
-
[38]
Pika labs: Ai video generation platform
Pika Labs. Pika labs: Ai video generation platform. https: //pika.art/, 2024. Accessed: 2024-11-10. 6
work page 2024
-
[39]
Crowdsampling the plenoptic function
Zhengqi Li, Wenqi Xian, Abe Davis, and Noah Snavely. Crowdsampling the plenoptic function. In Computer Vision– ECCV 2020: 16th European Conference, Glasgow, UK, Au- gust 23–28, 2020, Proceedings, Part I 16 , pages 178–196. Springer, 2020. 2 10
work page 2020
-
[40]
Dl3dv-10k: A large-scale scene dataset for deep learning- based 3d vision
Lu Ling, Yichen Sheng, Zhi Tu, Wentian Zhao, Cheng Xin, Kun Wan, Lantao Yu, Qianyu Guo, Zixun Yu, Yawen Lu, et al. Dl3dv-10k: A large-scale scene dataset for deep learning- based 3d vision. arXiv preprint arXiv:2312.16256 , 2023. 3
-
[41]
Zero-1-to-3: Zero-shot one image to 3d object
Ruoshi Liu, Rundi Wu, Basile Van Hoorick, Pavel Tokmakov, Sergey Zakharov, and Carl V ondrick. Zero-1-to-3: Zero-shot one image to 3d object. In ICCV, 2023. 2, 3
work page 2023
-
[42]
SyncDreamer: Generating Multiview-consistent Images from a Single-view Image
Yuan Liu, Cheng Lin, Zijiao Zeng, Xiaoxiao Long, Lingjie Liu, Taku Komura, and Wenping Wang. Syncdreamer: Gener- ating multiview-consistent images from a single-view image. arXiv preprint arXiv:2309.03453, 2023. 3
work page internal anchor Pith review Pith/arXiv arXiv 2023
- [43]
-
[44]
Nerf in the wild: Neural radiance fields for uncon- strained photo collections
Ricardo Martin-Brualla, Noha Radwan, Mehdi SM Sajjadi, Jonathan T Barron, Alexey Dosovitskiy, and Daniel Duck- worth. Nerf in the wild: Neural radiance fields for uncon- strained photo collections. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7210–7219, 2021. 2
work page 2021
-
[45]
Andreas Meuleman, Yu-Lun Liu, Chen Gao, Jia-Bin Huang, Changil Kim, Min H. Kim, and Johannes Kopf. Progressively optimized local radiance fields for robust view synthesis. In CVPR, 2023. 2
work page 2023
-
[46]
Nerf: Representing scenes as neural radiance fields for view synthe- sis
Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view synthe- sis. ICCV, 2021. 2
work page 2021
-
[47]
Instant neural graphics primitives with a multires- olution hash encoding
Thomas M¨uller, Alex Evans, Christoph Schied, and Alexan- der Keller. Instant neural graphics primitives with a multires- olution hash encoding. ACM Trans. Graph., 41(4):102:1– 102:15, 2022. 2
work page 2022
-
[48]
Scalable Diffusion Models with Transformers
William Peebles and Saining Xie. Scalable diffusion models with transformers. arXiv preprint arXiv:2212.09748, 2022. 3, 5
work page internal anchor Pith review Pith/arXiv arXiv 2022
- [49]
-
[50]
Learning transferable visual models from natural language supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. In Proceedings of the 38th International Conference on Machine Learning, pages 8748–8763. PMLR, 2021. 5
work page 2021
-
[51]
Film: Frame interpola- tion for large motion
Fitsum Reda, Janne Kontkanen, Eric Tabellion, Deqing Sun, Caroline Pantofaru, and Brian Curless. Film: Frame interpola- tion for large motion. In European Conference on Computer Vision (ECCV), 2022. 1, 2, 3, 6
work page 2022
-
[52]
Kilonerf: Speeding up neural radiance fields with thousands of tiny mlps
Christian Reiser, Songyou Peng, Yiyi Liao, and Andreas Geiger. Kilonerf: Speeding up neural radiance fields with thousands of tiny mlps. In CVPR, 2021. 2
work page 2021
-
[53]
Look outside the room: Synthesizing a consistent long-term 3d scene video from a single image
Xuanchi Ren and Xiaolong Wang. Look outside the room: Synthesizing a consistent long-term 3d scene video from a single image. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022. 3
work page 2022
-
[54]
High-resolution image synthesis with latent diffusion models, 2021
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models, 2021. 3, 4, 5
work page 2021
- [55]
-
[56]
ZeroNVS: Zero-shot 360-degree view synthesis from a single real image
Kyle Sargent, Zizhang Li, Tanmay Shah, Charles Herrmann, Hong-Xing Yu, Yunzhi Zhang, Eric Ryan Chan, Dmitry La- gun, Li Fei-Fei, Deqing Sun, and Jiajun Wu. ZeroNVS: Zero-shot 360-degree view synthesis from a single real image. In CVPR, 2024. 3
work page 2024
-
[57]
Structure-from-motion revisited
Johannes Lutz Sch ¨onberger and Jan-Michael Frahm. Structure-from-motion revisited. In Conference on Computer Vision and Pattern Recognition (CVPR), 2016. 3, 6, 8
work page 2016
-
[58]
Laion-5b: An open large-scale dataset for training next gen- eration image-text models
Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, et al. Laion-5b: An open large-scale dataset for training next gen- eration image-text models. Advances in Neural Information Processing Systems, 35:25278–25294, 2022. 2
work page 2022
-
[59]
Genwarp: Single image to novel views with semantic-preserving generative warping
Junyoung Seo, Kazumi Fukuda, Takashi Shibuya, Takuya Narihira, Naoki Murata, Shoukang Hu, Chieh-Hsin Lai, Se- ungryong Kim, and Yuki Mitsufuji. Genwarp: Single image to novel views with semantic-preserving generative warping. arXiv preprint arXiv:2405.17251, 2024. 3
-
[60]
Make-A-Video: Text-to-Video Generation without Text-Video Data
Uriel Singer, Adam Polyak, Thomas Hayes, Dafna Shaham, Chitwan Saharia, William Chan, and Mohammad Norouzi. Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:2209.14792, 2022. 3
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[61]
Noah Snavely, Steven M. Seitz, and Richard Szeliski. Photo tourism: exploring photo collections in 3d. Seminal Graphics Papers: Pushing the Boundaries, Volume 2, 2006. 3
work page 2006
-
[62]
Denoising Diffusion Implicit Models
Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. arXiv:2010.02502, 2020. 5
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[63]
Neural 3d reconstruction in the wild
Jiaming Sun, Xi Chen, Qianqian Wang, Zhengqi Li, Hadar Averbuch-Elor, Xiaowei Zhou, and Noah Snavely. Neural 3d reconstruction in the wild. In ACM SIGGRAPH 2022 conference proceedings, pages 1–9, 2022. 2
work page 2022
-
[64]
Movie Gen: A Cast of Media Foundation Models
The Movie Gen team @ Meta. Movie gen: A cast of media foundation models. arXiv preprint arXiv:2410.13720, 2024. 3
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[65]
Consistent view synthesis with pose-guided diffusion models
Hung-Yu Tseng, Qinbo Li, Changil Kim, Suhib Alsisan, Jia- Bin Huang, and Johannes Kopf. Consistent view synthesis with pose-guided diffusion models. In CVPR, 2023. 3
work page 2023
-
[66]
Megascenes: Scene-level view synthesis at scale
Joseph Tung, Gene Chou, Ruojin Cai, Guandao Yang, Kai Zhang, Gordon Wetzstein, Bharath Hariharan, and Noah Snavely. Megascenes: Scene-level view synthesis at scale. In ECCV, 2024. 2, 3
work page 2024
-
[67]
Gomez, Lukasz Kaiser, and Illia Polosukhin
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkor- eit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need, 2023. 3
work page 2023
-
[68]
Dor Verbin, Peter Hedman, Ben Mildenhall, Todd Zickler, Jonathan T. Barron, and Pratul P. Srinivasan. Ref-NeRF: Structured view-dependent appearance for neural radiance fields. In CVPR, 2022. 2 11
work page 2022
-
[69]
Pf-lrm: Pose-free large reconstruction model for joint pose and shape prediction
Peng Wang, Hao Tan, Sai Bi, Yinghao Xu, Fujun Luan, Kalyan Sunkavalli, Wenping Wang, Zexiang Xu, and Kai Zhang. Pf-lrm: Pose-free large reconstruction model for joint pose and shape prediction. arXiv preprint arXiv:2311.12024,
-
[70]
Dust3r: Geometric 3d vision made easy
Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. Dust3r: Geometric 3d vision made easy. In CVPR, 2024. 8
work page 2024
-
[71]
Generative inbetweening: Adapting image-to-video models for keyframe interpolation
Xiaojuan Wang, Boyang Zhou, Brian Curless, Ira Kemel- macher, Aleksander Holynski, and Steve Seitz. Generative inbetweening: Adapting image-to-video models for keyframe interpolation. arXiv preprint arXiv:2408.15239, 2024. 2, 3
-
[72]
Image quality assessment: from error visibility to structural similarity
Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing, 13(4):600–612, 2004. 9
work page 2004
-
[73]
NeRF −−: Neural radiance fields without known camera parameters,
Zirui Wang, Shangzhe Wu, Weidi Xie, Min Chen, and Vic- tor Adrian Prisacariu. Nerf–: Neural radiance fields without known camera parameters. arXiv preprint arXiv:2102.07064,
-
[74]
Novel view synthesis with diffusion models
Daniel Watson, William Chan, Ricardo Martin-Brualla, Jonathan Ho, Andrea Tagliasacchi, and Mohammad Norouzi. Novel view synthesis with diffusion models. arXiv preprint arXiv:2210.04628, 2022. 3
-
[75]
Controlling space and time with diffusion models
Daniel Watson, Saurabh Saxena, Lala Li, Andrea Tagliasac- chi, and David J Fleet. Controlling space and time with diffusion models. arXiv preprint arXiv:2407.07860, 2024. 2, 3
-
[76]
Meshlrm: Large reconstruction model for high- quality mesh
Xinyue Wei, Kai Zhang, Sai Bi, Hao Tan, Fujun Luan, Valentin Deschaintre, Kalyan Sunkavalli, Hao Su, and Zex- iang Xu. Meshlrm: Large reconstruction model for high- quality mesh. arXiv preprint arXiv:2404.12385, 2024. 3
-
[77]
CroCo v2: Improved Cross-view Completion Pre-training for Stereo Matching and Optical Flow
Philippe Weinzaepfel, Thomas Lucas, Vincent Leroy, Yohann Cabon, Vaibhav Arora, Romain Br ´egier, Gabriela Csurka, Leonid Antsfeld, Boris Chidlovskii, and J ´erˆome Revaud. CroCo v2: Improved Cross-view Completion Pre-training for Stereo Matching and Optical Flow. In ICCV, 2023. 4
work page 2023
-
[78]
CroCo: Self-Supervised Pre- training for 3D Vision Tasks by Cross-View Completion
Weinzaepfel, Philippe and Leroy, Vincent and Lucas, Thomas and Br´egier, Romain and Cabon, Yohann and Arora, Vaibhav and Antsfeld, Leonid and Chidlovskii, Boris and Csurka, Gabriela and Revaud J´erˆome. CroCo: Self-Supervised Pre- training for 3D Vision Tasks by Cross-View Completion. In NeurIPS, 2022. 4
work page 2022
-
[79]
Art•v: Auto-regressive text-to-video generation with diffusion models
Wenming Weng, Ruoyu Feng, Yanhui Wang, Qi Dai, Chunyu Wang, Dacheng Yin, Zhiyuan Zhao, Kai Qiu, Jianmin Bao, Yuhui Yuan, Chong Luo, Yueyi Zhang, and Zhiwei Xiong. Art•v: Auto-regressive text-to-video generation with diffusion models. arXiv preprint arXiv:2311.18834, 2023. 2
-
[80]
Srinivasan, Dor Verbin, Jonathan T
Rundi Wu, Ben Mildenhall, Philipp Henzler, Keunhong Park, Ruiqi Gao, Daniel Watson, Pratul P Srinivasan, Dor Verbin, Jonathan T Barron, Ben Poole, et al. Reconfu- sion: 3d reconstruction with diffusion priors. arXiv preprint arXiv:2312.02981, 2023. 3
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.