Recognition: 2 theorem links
· Lean TheoremCamCo: Camera-Controllable 3D-Consistent Image-to-Video Generation
Pith reviewed 2026-05-16 19:39 UTC · model grok-4.3
The pith
CamCo adds precise camera pose control to image-to-video generation while enforcing 3D consistency across frames.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
CamCo equips a pre-trained image-to-video diffusion model with Plücker-coordinate camera pose inputs and an epipolar attention module placed in each attention block that enforces epipolar constraints on the feature maps, then fine-tunes the resulting system on real-world videos whose poses were estimated by structure-from-motion algorithms, yielding videos that follow user-specified camera trajectories with improved 3D consistency and plausible object motion.
What carries the argument
Epipolar attention module that enforces geometric constraints on feature maps, combined with Plücker coordinate parameterization of camera poses.
Load-bearing premise
That the epipolar attention module will enforce 3D geometric consistency without introducing new artifacts or lowering visual quality, and that fine-tuning on SfM-estimated poses from real videos will transfer to arbitrary user-specified trajectories at inference time.
What would settle it
Generate videos under complex orbiting or dollying camera paths and check whether multi-view 3D reconstruction from the output frames recovers consistent object depths and positions, or shows systematic drift.
read the original abstract
Recently video diffusion models have emerged as expressive generative tools for high-quality video content creation readily available to general users. However, these models often do not offer precise control over camera poses for video generation, limiting the expression of cinematic language and user control. To address this issue, we introduce CamCo, which allows fine-grained Camera pose Control for image-to-video generation. We equip a pre-trained image-to-video generator with accurately parameterized camera pose input using Pl\"ucker coordinates. To enhance 3D consistency in the videos produced, we integrate an epipolar attention module in each attention block that enforces epipolar constraints to the feature maps. Additionally, we fine-tune CamCo on real-world videos with camera poses estimated through structure-from-motion algorithms to better synthesize object motion. Our experiments show that CamCo significantly improves 3D consistency and camera control capabilities compared to previous models while effectively generating plausible object motion. Project page: https://ir1d.github.io/CamCo/
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces CamCo, a method for adding fine-grained camera pose control to pre-trained image-to-video diffusion models. It parameterizes camera input via Plücker coordinates, inserts an epipolar attention module into each attention block to enforce geometric consistency, and fine-tunes the model on real videos whose poses were recovered by structure-from-motion (SfM). The central claim is that these changes yield videos with measurably better 3D consistency and more accurate camera control while still producing plausible object motion.
Significance. If the quantitative claims hold, CamCo would provide a practical route to controllable cinematic video synthesis from a single image, addressing a clear limitation of current diffusion-based video generators. The epipolar-attention design is a lightweight way to inject 3D inductive bias without retraining from scratch, and the use of SfM poses for fine-tuning is a pragmatic data strategy. However, the absence of any numerical results, ablation tables, or evaluation protocol in the abstract makes it impossible to judge whether the improvements are substantial enough to shift the state of the art.
major comments (3)
- [Abstract] Abstract: the assertion that CamCo 'significantly improves 3D consistency and camera control capabilities' is unsupported by any quantitative metrics, ablation results, or description of how 3D consistency was measured (e.g., reprojection error, multi-view consistency scores, or user studies). Without these numbers the central empirical claim cannot be evaluated.
- [Method] Method section (camera-conditioning and fine-tuning): fine-tuning exclusively on SfM-estimated poses from real videos introduces a domain gap for arbitrary user-specified trajectories at inference. SfM poses contain noise, scale ambiguity, and are drawn from the distribution of handheld/tripod motion; no experiment tests generalization to out-of-distribution paths such as extreme dolly zooms or rapid pans outside the training support. This directly threatens the advertised 'camera control capabilities.'
- [Experiments] Experiments: no details are given on the evaluation protocol for 3D consistency (e.g., whether it uses ground-truth poses, multi-view reconstruction error, or optical-flow consistency), nor on the baselines, datasets, or statistical significance of the reported improvements.
minor comments (2)
- [Abstract] Abstract: the sentence 'effectively generating plausible object motion' is vague; clarify what motion quality metric or qualitative criterion is intended.
- [Method] Notation: Plücker coordinates are mentioned without an explicit definition or reference to the coordinate convention used; add a short equation or citation in the method section.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment below and will revise the manuscript to improve the clarity of our claims, evaluation details, and discussion of limitations.
read point-by-point responses
-
Referee: [Abstract] Abstract: the assertion that CamCo 'significantly improves 3D consistency and camera control capabilities' is unsupported by any quantitative metrics, ablation results, or description of how 3D consistency was measured (e.g., reprojection error, multi-view consistency scores, or user studies). Without these numbers the central empirical claim cannot be evaluated.
Authors: We agree that the abstract would be strengthened by including key quantitative results. The experiments section reports specific metrics for 3D consistency (reprojection error and multi-view consistency scores) and camera control accuracy, along with comparisons to baselines. In the revised manuscript we will add a concise summary of these numerical improvements and a brief description of the evaluation metrics to the abstract. revision: yes
-
Referee: [Method] Method section (camera-conditioning and fine-tuning): fine-tuning exclusively on SfM-estimated poses from real videos introduces a domain gap for arbitrary user-specified trajectories at inference. SfM poses contain noise, scale ambiguity, and are drawn from the distribution of handheld/tripod motion; no experiment tests generalization to out-of-distribution paths such as extreme dolly zooms or rapid pans outside the training support. This directly threatens the advertised 'camera control capabilities.'
Authors: SfM-estimated poses from real videos are used because they provide realistic object motion that synthetic data cannot easily replicate. The epipolar attention module is intended to provide robustness to the noise and scale ambiguity inherent in SfM. While our current experiments cover a range of trajectories, we acknowledge the value of explicit OOD testing. We will add a new experiment subsection evaluating performance on extreme paths (e.g., rapid pans and dolly zooms) to better substantiate generalization claims. revision: yes
-
Referee: [Experiments] Experiments: no details are given on the evaluation protocol for 3D consistency (e.g., whether it uses ground-truth poses, multi-view reconstruction error, or optical-flow consistency), nor on the baselines, datasets, or statistical significance of the reported improvements.
Authors: We apologize for the insufficient detail in the current draft. Section 4 specifies that 3D consistency is measured via reprojection error against SfM ground-truth poses and multi-view consistency scores; baselines are evaluated on RealEstate10K and similar datasets; results are reported as means with standard deviations over repeated samples. We will expand the experiments section with a dedicated evaluation-protocol subsection that explicitly describes these elements, the datasets, and the statistical reporting. revision: yes
Circularity Check
No significant circularity detected in derivation chain
full rationale
The paper describes an engineering extension to a pre-trained image-to-video diffusion model: injecting Plücker coordinates as camera-pose conditioning and adding an epipolar attention module, followed by fine-tuning on SfM-estimated real-video poses. These are architectural and training choices whose outputs are evaluated empirically against baselines. No equations, uniqueness theorems, or self-citations are presented that would make any claimed prediction or consistency result equivalent to its own inputs by construction. The central claims rest on comparative experiments rather than tautological reductions, so the derivation chain remains self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- standard math Epipolar geometry provides valid constraints between corresponding points in different views of the same 3D scene
Lean theorems connected to this paper
-
Foundation/AlexanderDualityalexander_duality_circle_linking echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
We equip a pre-trained image-to-video generator with accurately parameterized camera pose input using Plücker coordinates. To enhance 3D consistency... we integrate an epipolar attention module... that enforces epipolar constraints to the feature maps.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 20 Pith papers
-
PicoEyes: Unified Gaze Estimation Framework for Mixed Reality with a Large-Scale Multi-View Dataset
PicoEyes unifies gaze estimation for mixed reality by jointly predicting 3D eye parameters, segmentation, optical and visual axes, and depth maps from monocular or binocular inputs, supported by a new large-scale mult...
-
PicoEyes: Unified Gaze Estimation Framework for Mixed Reality with a Large-Scale Multi-View Dataset
PicoEyes delivers a unified end-to-end model for full 3D gaze estimation including eye parameters, axes, segmentation and depth from monocular or binocular near-eye images, supported by a new large-scale multi-view dataset.
-
Reshoot-Anything: A Self-Supervised Model for In-the-Wild Video Reshooting
Reshoot-Anything trains a diffusion transformer on pseudo multi-view triplets created by cropping and warping monocular videos to achieve temporally consistent video reshooting with robust camera control on dynamic scenes.
-
UniGeo: Unifying Geometric Guidance for Camera-Controllable Image Editing via Video Models
UniGeo unifies geometric guidance across three levels in video models to reduce geometric drift and improve consistency in camera-controllable image editing.
-
Geometrically Consistent Multi-View Scene Generation from Freehand Sketches
A framework generates consistent multi-view scenes from one freehand sketch via a ~9k-sample dataset, Parallel Camera-Aware Attention Adapters, and Sparse Correspondence Supervision Loss, outperforming baselines in re...
-
OmniCamera: A Unified Framework for Multi-task Video Generation with Arbitrary Camera Control
OmniCamera disentangles video content and camera motion for multi-task generation with arbitrary camera control via the OmniCAM hybrid dataset and Dual-level Curriculum Co-Training.
-
SparseCam4D: Spatio-Temporally Consistent 4D Reconstruction from Sparse Cameras
SparseCam4D achieves spatio-temporally consistent high-fidelity 4D reconstruction from sparse cameras via a Spatio-Temporal Distortion Field that corrects inconsistencies in generative observations.
-
StereoSpace: Depth-Free Synthesis of Stereo Geometry via End-to-End Diffusion in a Canonical Space
A viewpoint-conditioned diffusion model generates stereo image pairs from monocular input in a canonical rectified space without using depth or explicit warping.
-
RealCam: Real-Time Novel-View Video Generation with Interactive Camera Control
RealCam is a causal autoregressive model for real-time camera-controlled video-to-video generation, using cross-frame in-context teacher distillation and loop-closed data augmentation to achieve high fidelity and consistency.
-
PhyCo: Learning Controllable Physical Priors for Generative Motion
PhyCo adds continuous physical control to video diffusion models via physics-supervised fine-tuning on a large simulation dataset and VLM-guided rewards, yielding measurable gains in physical realism on the Physics-IQ...
-
UniGeo: Unifying Geometric Guidance for Camera-Controllable Image Editing via Video Models
UniGeo adds unified geometric guidance at three levels in video models to reduce geometric drift and improve structural fidelity in camera-controllable image editing.
-
Lyra 2.0: Explorable Generative 3D Worlds
Lyra 2.0 produces persistent 3D-consistent video sequences for large explorable worlds by using per-frame geometry for information routing and self-augmented training to correct temporal drift.
-
Rays as Pixels: Learning A Joint Distribution of Videos and Camera Trajectories
A video diffusion model learns a joint distribution over videos and camera trajectories by representing cameras as pixel-aligned ray encodings (raxels) denoised jointly with video frames via decoupled attention.
-
INSPATIO-WORLD: A Real-Time 4D World Simulator via Spatiotemporal Autoregressive Modeling
INSPATIO-WORLD is a real-time framework for high-fidelity 4D scene generation and navigation from monocular videos via STAR architecture with implicit caching, explicit geometric constraints, and distribution-matching...
-
Pixel-to-4D: Camera-Controlled Image-to-Video Generation with Dynamic 3D Gaussians
A unified single-pass framework using dynamic 3D Gaussians generates temporally consistent camera-controlled videos from a single image.
-
ViewCrafter: Taming Video Diffusion Models for High-fidelity Novel View Synthesis
ViewCrafter tames video diffusion models with point-based 3D guidance and iterative trajectory planning to produce high-fidelity novel views from single or sparse images.
-
CameraCtrl: Enabling Camera Control for Text-to-Video Generation
CameraCtrl enables accurate camera pose control in video diffusion models through a trained plug-and-play module and dataset choices emphasizing diverse camera trajectories with matching appearance.
-
SANA-WM: Efficient Minute-Scale World Modeling with Hybrid Linear Diffusion Transformer
SANA-WM is a 2.6B-parameter efficient world model that synthesizes minute-scale 720p videos with 6-DoF camera control, trained on 213K public clips in 15 days on 64 H100s and runnable on single GPUs at 36x higher thro...
-
Pose-Aware Diffusion for 3D Generation
PAD synthesizes 3D geometry in observation space via depth unprojection as anchor to eliminate pose ambiguity in image-to-3D generation.
-
Cosmos World Foundation Model Platform for Physical AI
The Cosmos platform supplies open-source pre-trained world models and supporting tools for building fine-tunable digital world simulations to train Physical AI.
Reference graph
Works this paper leans on
-
[1]
Stable video diffusion: Scaling latent video diffu- sion models to large datasets
Stability AI. Stable video diffusion: Scaling latent video diffu- sion models to large datasets. https://stability.ai/research/ stable-video-diffusion-scaling-latent-video-diffusion-models-to-large-datasets , 2023
work page 2023
-
[2]
Frozen in time: A joint video and image encoder for end-to-end retrieval
Max Bain, Arsha Nagrani, Gül Varol, and Andrew Zisserman. Frozen in time: A joint video and image encoder for end-to-end retrieval. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1728–1738, 2021
work page 2021
-
[3]
Improving image generation with better captions
James Betker, Gabriel Goh, Li Jing, Tim Brooks, Jianfeng Wang, Linjie Li, Long Ouyang, Juntang Zhuang, Joyce Lee, Yufei Guo, et al. Improving image generation with better captions. Computer Science. https://cdn. openai. com/papers/dall-e-3. pdf, 2(3):8, 2023
work page 2023
-
[4]
Align your latents: High-resolution video synthesis with latent diffusion models
Andreas Blattmann, Robin Rombach, Huan Ling, Tim Dockhorn, Seung Wook Kim, Sanja Fidler, and Karsten Kreis. Align your latents: High-resolution video synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22563–22575, 2023
work page 2023
-
[5]
Video generation models as world simulators
Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, Clarence Ng, Ricky Wang, and Aditya Ramesh. Video generation models as world simulators. 2024
work page 2024
-
[6]
Coyo-700m: Image-text pair dataset
Minwoo Byeon, Beomhee Park, Haecheon Kim, Sungjun Lee, Woonhyuk Baek, and Sae- hoon Kim. Coyo-700m: Image-text pair dataset. https://github.com/kakaobrain/ coyo-dataset, 2022
work page 2022
-
[7]
pixelsplat: 3d gaus- sian splats from image pairs for scalable generalizable 3d reconstruction
David Charatan, Sizhe Li, Andrea Tagliasacchi, and Vincent Sitzmann. pixelsplat: 3d gaus- sian splats from image pairs for scalable generalizable 3d reconstruction. arXiv preprint arXiv:2312.12337, 2023
-
[8]
VideoCrafter1: Open Diffusion Models for High-Quality Video Generation
Haoxin Chen, Menghan Xia, Yingqing He, Yong Zhang, Xiaodong Cun, Shaoshu Yang, Jinbo Xing, Yaofang Liu, Qifeng Chen, Xintao Wang, et al. Videocrafter1: Open diffusion models for high-quality video generation. arXiv preprint arXiv:2310.19512, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[9]
Objaverse-xl: A universe of 10m+ 3d objects.arXiv preprint arXiv:2307.05663, 2023
Matt Deitke, Ruoshi Liu, Matthew Wallingford, Huong Ngo, Oscar Michel, Aditya Kusupati, Alan Fan, Christian Laforte, Vikram V oleti, Samir Yitzhak Gadre, et al. Objaverse-xl: A universe of 10m+ 3d objects. arXiv preprint arXiv:2307.05663, 2023
-
[10]
Objaverse: A universe of annotated 3d objects
Matt Deitke, Dustin Schwenk, Jordi Salvador, Luca Weihs, Oscar Michel, Eli VanderBilt, Ludwig Schmidt, Kiana Ehsani, Aniruddha Kembhavi, and Ali Farhadi. Objaverse: A universe of annotated 3d objects. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13142–13153, 2023
work page 2023
-
[11]
Depth-supervised nerf: Fewer views and faster training for free
Kangle Deng, Andrew Liu, Jun-Yan Zhu, and Deva Ramanan. Depth-supervised nerf: Fewer views and faster training for free. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12882–12891, 2022
work page 2022
-
[12]
Graphdreamer: Compositional 3d scene synthesis from scene graphs
Gege Gao, Weiyang Liu, Anpei Chen, Andreas Geiger, and Bernhard Schölkopf. Graphdreamer: Compositional 3d scene synthesis from scene graphs. arXiv preprint arXiv:2312.00093, 2023
-
[13]
Sparsectrl: Adding sparse controls to text-to-video diffusion models
Yuwei Guo, Ceyuan Yang, Anyi Rao, Maneesh Agrawala, Dahua Lin, and Bo Dai. Sparsectrl: Adding sparse controls to text-to-video diffusion models. arXiv preprint arXiv:2311.16933, 2023
-
[14]
AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning
Yuwei Guo, Ceyuan Yang, Anyi Rao, Yaohui Wang, Yu Qiao, Dahua Lin, and Bo Dai. Ani- matediff: Animate your personalized text-to-image diffusion models without specific tuning. arXiv preprint arXiv:2307.04725, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[15]
CameraCtrl: Enabling Camera Control for Text-to-Video Generation
Hao He, Yinghao Xu, Yuwei Guo, Gordon Wetzstein, Bo Dai, Hongsheng Li, and Ceyuan Yang. Cameractrl: Enabling camera control for text-to-video generation. arXiv preprint arXiv:2404.02101, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[16]
Katrin Heimann, Maria Alessandra Umiltà, Michele Guerra, and Vittorio Gallese. Moving mirrors: a high-density eeg study investigating the effect of camera movements on motor cortex activation during action observation. Journal of cognitive neuroscience, 26(9):2087–2101, 2014
work page 2087
-
[17]
Gans trained by a two time-scale update rule converge to a local nash equilibrium
Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30, 2017. 10
work page 2017
-
[18]
Classifier-Free Diffusion Guidance
Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[19]
LoRA: Low-Rank Adaptation of Large Language Models
Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[20]
Plücker coordinates for lines in the space
Yan-Bin Jia. Plücker coordinates for lines in the space. Problem Solver Techniques for Applied Computer Science, Com-S-477/577 Course Handout, 3, 2020
work page 2020
-
[21]
Spad: Spatially aware multiview diffusers
Yash Kant, Ziyi Wu, Michael Vasilkovsky, Guocheng Qian, Jian Ren, Riza Alp Guler, Bernard Ghanem, Sergey Tulyakov, Igor Gilitschenski, and Aliaksandr Siarohin. Spad: Spatially aware multiview diffusers. arXiv preprint arXiv:2402.05235, 2024
-
[22]
Elucidating the design space of diffusion-based generative models
Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diffusion-based generative models. Advances in Neural Information Processing Systems, 35:26565–26577, 2022
work page 2022
-
[23]
Text2video-zero: Text-to-image diffusion models are zero-shot video generators
Levon Khachatryan, Andranik Movsisyan, Vahram Tadevosyan, Roberto Henschel, Zhangyang Wang, Shant Navasardyan, and Humphrey Shi. Text2video-zero: Text-to-image diffusion models are zero-shot video generators. arXiv preprint arXiv:2303.13439, 2023
-
[24]
Gligen: Open-set grounded text-to-image generation
Yuheng Li, Haotian Liu, Qingyang Wu, Fangzhou Mu, Jianwei Yang, Jianfeng Gao, Chunyuan Li, and Yong Jae Lee. Gligen: Open-set grounded text-to-image generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22511–22521, 2023
work page 2023
-
[25]
Magic3d: High-resolution text-to-3d content creation
Chen-Hsuan Lin, Jun Gao, Luming Tang, Towaki Takikawa, Xiaohui Zeng, Xun Huang, Karsten Kreis, Sanja Fidler, Ming-Yu Liu, and Tsung-Yi Lin. Magic3d: High-resolution text-to-3d content creation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 300–309, 2023
work page 2023
-
[26]
Zero-1-to-3: Zero-shot one image to 3d object
Ruoshi Liu, Rundi Wu, Basile Van Hoorick, Pavel Tokmakov, Sergey Zakharov, and Carl V ondrick. Zero-1-to-3: Zero-shot one image to 3d object. arXiv preprint arXiv:2303.11328, 2023
-
[27]
Latte: Latent Diffusion Transformer for Video Generation
Xin Ma, Yaohui Wang, Gengyun Jia, Xinyuan Chen, Ziwei Liu, Yuan-Fang Li, Cunjian Chen, and Yu Qiao. Latte: Latent diffusion transformer for video generation. arXiv preprint arXiv:2401.03048, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[28]
Im-3d: Iterative multiview diffusion and reconstruction for high-quality 3d generation
Luke Melas-Kyriazi, Iro Laina, Christian Rupprecht, Natalia Neverova, Andrea Vedaldi, Oran Gafni, and Filippos Kokkinos. Im-3d: Iterative multiview diffusion and reconstruction for high-quality 3d generation. arXiv preprint arXiv:2402.08682, 2024
-
[29]
Chong Mou, Xintao Wang, Liangbin Xie, Yanze Wu, Jian Zhang, Zhongang Qi, and Ying Shan. T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 4296–4304, 2024
work page 2024
-
[30]
Camera movement in narrative cinema: towards a taxonomy of functions
Jakob Isak Nielsen, Edvin Kau, and Richard Raskin. Camera movement in narrative cinema: towards a taxonomy of functions. Department of Inf. & Media Studies, University of Aarhus, 2007
work page 2007
-
[31]
Scalable diffusion models with transformers
William Peebles and Saining Xie. Scalable diffusion models with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4195–4205, 2023
work page 2023
-
[32]
Vase: Object-centric appearance and shape manipulation of real videos
Elia Peruzzo, Vidit Goel, Dejia Xu, Xingqian Xu, Yifan Jiang, Zhangyang Wang, Humphrey Shi, and Nicu Sebe. Vase: Object-centric appearance and shape manipulation of real videos. arXiv preprint arXiv:2401.02473, 2024
-
[33]
Compositional 3d scene generation using locally conditioned diffusion
Ryan Po and Gordon Wetzstein. Compositional 3d scene generation using locally conditioned diffusion. arXiv preprint arXiv:2303.12218, 2023
-
[34]
SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis
Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[35]
DreamFusion: Text-to-3D using 2D Diffusion
Ben Poole, Ajay Jain, Jonathan T Barron, and Ben Mildenhall. Dreamfusion: Text-to-3d using 2d diffusion. arXiv preprint arXiv:2209.14988, 2022. 11
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[36]
Hierarchical Text-Conditional Image Generation with CLIP Latents
Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[37]
Zero-shot text-to-image generation
Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea V oss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation. In International Conference on Machine Learning, pages 8821–8831. PMLR, 2021
work page 2021
-
[38]
High- resolution image synthesis with latent diffusion models
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High- resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022
work page 2022
-
[39]
Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed Ghasemipour, Raphael Gontijo-Lopes, Burcu Karagol Ayan, Tim Salimans, Jonathan Ho, David J. Fleet, and Mohammad Norouzi. Photorealistic text-to-image diffusion models with deep language understanding. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Ky...
work page 2022
-
[40]
Structure-from-motion revisited
Johannes Lutz Schönberger and Jan-Michael Frahm. Structure-from-motion revisited. In Conference on Computer Vision and Pattern Recognition (CVPR), 2016
work page 2016
-
[41]
Pixelwise view selection for unstructured multi-view stereo
Johannes Lutz Schönberger, Enliang Zheng, Marc Pollefeys, and Jan-Michael Frahm. Pixelwise view selection for unstructured multi-view stereo. In European Conference on Computer Vision (ECCV), 2016
work page 2016
-
[42]
LAION-5B: An open large-scale dataset for training next generation image-text models
Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, et al. Laion- 5b: An open large-scale dataset for training next generation image-text models. arXiv preprint arXiv:2210.08402, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[43]
Zero123++: a Single Image to Consistent Multi-view Diffusion Base Model
Ruoxi Shi, Hansheng Chen, Zhuoyang Zhang, Minghua Liu, Chao Xu, Xinyue Wei, Linghao Chen, Chong Zeng, and Hao Su. Zero123++: a single image to consistent multi-view diffusion base model. arXiv preprint arXiv:2310.15110, 2023
work page internal anchor Pith review arXiv 2023
-
[44]
Ed Sikov. Film studies: An introduction. Columbia University Press, 2020
work page 2020
-
[45]
Light field networks: Neural scene representations with single-evaluation rendering
Vincent Sitzmann, Semon Rezchikov, Bill Freeman, Josh Tenenbaum, and Fredo Durand. Light field networks: Neural scene representations with single-evaluation rendering. Advances in Neural Information Processing Systems, 34:19313–19325, 2021
work page 2021
-
[46]
DreamGaussian: Generative Gaussian Splatting for Efficient 3D Content Creation
Jiaxiang Tang, Jiawei Ren, Hang Zhou, Ziwei Liu, and Gang Zeng. Dreamgaussian: Generative gaussian splatting for efficient 3d content creation. arXiv preprint arXiv:2309.16653, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[47]
Consistent view synthesis with pose-guided diffusion models
Hung-Yu Tseng, Qinbo Li, Changil Kim, Suhib Alsisan, Jia-Bin Huang, and Johannes Kopf. Consistent view synthesis with pose-guided diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16773–16783, 2023
work page 2023
-
[48]
Towards Accurate Generative Models of Video: A New Metric & Challenges
Thomas Unterthiner, Sjoerd Van Steenkiste, Karol Kurach, Raphael Marinier, Marcin Michalski, and Sylvain Gelly. Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:1812.01717, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[49]
Vikram V oleti, Chun-Han Yao, Mark Boss, Adam Letts, David Pankratz, Dmitry Tochilkin, Christian Laforte, Robin Rombach, and Varun Jampani. Sv3d: Novel multi-view synthesis and 3d generation from a single image using latent video diffusion.arXiv preprint arXiv:2403.12008, 2024
-
[50]
Taming mode collapse in score distillation for text-to-3d generation
Peihao Wang, Dejia Xu, Zhiwen Fan, Dilin Wang, Sreyas Mohan, Forrest Iandola, Rakesh Ranjan, Yilei Li, Qiang Liu, Zhangyang Wang, et al. Taming mode collapse in score distillation for text-to-3d generation. arXiv preprint arXiv:2401.00909, 2023
-
[51]
Motionctrl: A unified and flexible motion controller for video generation
Zhouxia Wang, Ziyang Yuan, Xintao Wang, Tianshui Chen, Menghan Xia, Ping Luo, and Ying Shan. Motionctrl: A unified and flexible motion controller for video generation. arXiv preprint arXiv:2312.03641, 2023
-
[52]
Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation
Jay Zhangjie Wu, Yixiao Ge, Xintao Wang, Stan Weixian Lei, Yuchao Gu, Yufei Shi, Wynne Hsu, Ying Shan, Xiaohu Qie, and Mike Zheng Shou. Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7623–7633, 2023
work page 2023
-
[53]
Neurallift- 360: Lifting an in-the-wild 2d photo to a 3d object with 360 {\deg} views
Dejia Xu, Yifan Jiang, Peihao Wang, Zhiwen Fan, Yi Wang, and Zhangyang Wang. Neurallift- 360: Lifting an in-the-wild 2d photo to a 3d object with 360 {\deg} views. arXiv preprint arXiv:2211.16431, 2022. 12
-
[54]
Comp4d: Llm-guided compositional 4d scene generation
Dejia Xu, Hanwen Liang, Neel P Bhatt, Hezhen Hu, Hanxue Liang, Konstantinos N Plataniotis, and Zhangyang Wang. Comp4d: Llm-guided compositional 4d scene generation. arXiv preprint arXiv:2403.16993, 2024
-
[55]
VideoGPT: Video Generation using VQ-VAE and Transformers
Wilson Yan, Yunzhi Zhang, Pieter Abbeel, and Aravind Srinivas. Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:2104.10157, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[56]
Mehmet Burak Yilmaz, Elen Lotman, Andres Karjus, and Pia Tikka. An embodiment of the cinematographer: emotional and perceptual responses to different camera movement techniques. Frontiers in Neuroscience, 17:1160843, 2023
work page 2023
-
[57]
Magvit: Masked generative video transformer
Lijun Yu, Yong Cheng, Kihyuk Sohn, José Lezama, Han Zhang, Huiwen Chang, Alexander G Hauptmann, Ming-Hsuan Yang, Yuan Hao, Irfan Essa, et al. Magvit: Masked generative video transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10459–10469, 2023
work page 2023
-
[58]
Effi- cient video diffusion models via content-frame motion-latent decomposition
Sihyun Yu, Weili Nie, De-An Huang, Boyi Li, Jinwoo Shin, and Anima Anandkumar. Effi- cient video diffusion models via content-frame motion-latent decomposition. arXiv preprint arXiv:2403.14148, 2024
-
[59]
Adding Conditional Control to Text-to-Image Diffusion Models
Lvmin Zhang and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. arXiv preprint arXiv:2302.05543, 2023
work page internal anchor Pith review arXiv 2023
-
[60]
Scenewiz3d: Towards text-guided 3d scene composition
Qihang Zhang, Chaoyang Wang, Aliaksandr Siarohin, Peiye Zhuang, Yinghao Xu, Ceyuan Yang, Dahua Lin, Bolei Zhou, Sergey Tulyakov, and Hsin-Ying Lee. Scenewiz3d: Towards text-guided 3d scene composition. arXiv preprint arXiv:2312.08885, 2023
-
[61]
Particlesfm: Exploiting dense point trajectories for localizing moving cameras in the wild
Wang Zhao, Shaohui Liu, Hengkai Guo, Wenping Wang, and Yong-Jin Liu. Particlesfm: Exploiting dense point trajectories for localizing moving cameras in the wild. In European Conference on Computer Vision, pages 523–542. Springer, 2022
work page 2022
-
[62]
Stereo Magnification: Learning View Synthesis using Multiplane Images
Tinghui Zhou, Richard Tucker, John Flynn, Graham Fyffe, and Noah Snavely. Stereo magnifi- cation: Learning view synthesis using multiplane images. arXiv preprint arXiv:1805.09817, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[63]
Qi Zuo, Xiaodong Gu, Lingteng Qiu, Yuan Dong, Zhengyi Zhao, Weihao Yuan, Rui Peng, Siyu Zhu, Zilong Dong, Liefeng Bo, et al. Videomv: Consistent multi-view generation based on large video generative model. arXiv preprint arXiv:2403.12010, 2024. 13 A Additional Details on Epipolar Constraint Attention An epipolar line refers to the projection on one camera...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.