pith. machine review for the scientific record. sign in

arxiv: 2601.00678 · v2 · submitted 2026-01-02 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

Pixel-to-4D: Camera-Controlled Image-to-Video Generation with Dynamic 3D Gaussians

Authors on Pith no claims yet

Pith reviewed 2026-05-16 18:16 UTC · model grok-4.3

classification 💻 cs.CV
keywords image-to-video generation3D Gaussianscamera controldynamic scenesingle-image conditioningvideo synthesis4D reconstruction
0
0 comments X

The pith

A single image generates camera-controlled 4D video by building dynamic 3D Gaussians in one forward pass.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a framework that takes one static image and produces a video sequence whose camera path can be freely specified by the user. It builds an explicit 3D Gaussian representation of the scene geometry and directly samples plausible object motion inside the same forward pass. This removes the need for separate iterative denoising steps that other methods use to add motion, yielding both faster inference and stronger guarantees of temporal and geometric consistency. A sympathetic reader would care because the approach could make controllable scene animation from ordinary photographs practical for simulation, robotics, and content creation.

Core claim

We propose a novel framework that constructs a 3D Gaussian scene representation and samples plausible object motion, given a single image in a single forward pass. This enables fast, camera-guided video generation without the need for iterative denoising to inject object motion into render frames.

What carries the argument

Dynamic 3D Gaussians that jointly encode the scene's static geometry extracted from the input image and the sampled object motions.

Load-bearing premise

A single static image contains enough information to construct an accurate 3D Gaussian representation and to sample plausible, temporally consistent object motions that align with an arbitrary camera trajectory.

What would settle it

If videos generated for large camera movements exhibit visible object trajectory errors or geometric drift from the input image, the single-pass construction would be shown to be insufficient.

Figures

Figures reproduced from arXiv: 2601.00678 by Daniela Ivanova, John H. Williamson, Melonie de Almeida, Paul Henderson, Tong Shi.

Figure 1
Figure 1. Figure 1: Pixel-to-4D: Given an input image It, encs encodes It and its estimated depths Dt and fuses features from DINOv2. The combined features are decoded by decs to predict static Gaussian parameters d, ∆, r, s, σ, c. Conditioned on the combined features, splat velocities v and accelerations a are generated using decvae and decd from latent Gaussian noise. These are aggregrated over object segmentations to give … view at source ↗
Figure 2
Figure 2. Figure 2: Qualitative comparisons on four datasets. Each block shows the input frame at [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative ablation results on Waymo: Input and predicted frames and depths at [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative ablation results on KITTI, showing input and predicted frames and depths at [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
read the original abstract

Humans excel at forecasting the future dynamics of a scene given just a single image. Video generation models that can mimic this ability are an essential component for intelligent systems. Recent approaches have improved temporal coherence and 3D consistency in single-image-conditioned video generation. However, these methods often lack robust user controllability, such as modifying the camera path, limiting their applicability in real-world applications. Most existing camera-controlled image-to-video models struggle with accurately modeling camera motion, maintaining temporal consistency, and preserving geometric integrity. Leveraging explicit intermediate 3D representations offers a promising solution by enabling coherent video generation aligned with a given camera trajectory. Although these methods often use 3D point clouds to render scenes and introduce object motion in a later stage, this two-step process still falls short in achieving full temporal consistency, despite allowing precise control over camera movement. We propose a novel framework that constructs a 3D Gaussian scene representation and samples plausible object motion, given a single image in a single forward pass. This enables fast, camera-guided video generation without the need for iterative denoising to inject object motion into render frames. Extensive experiments on the KITTI, Waymo, RealEstate10K and DL3DV-10K datasets demonstrate that our method achieves state-of-the-art video quality and inference efficiency. The project page is available at https://melonienimasha.github.io/Pixel-to-4D-Website.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes Pixel-to-4D, a novel framework that, from a single input image, constructs an explicit 3D Gaussian scene representation and samples plausible object motions in one forward pass. This enables fast, camera-controlled image-to-video generation without iterative denoising steps. The authors claim state-of-the-art video quality and inference efficiency on the KITTI, Waymo, RealEstate10K, and DL3DV-10K datasets.

Significance. If the central claims hold, the work would offer a meaningful advance in controllable video synthesis by combining explicit 3D representations with single-pass dynamics prediction, potentially improving both geometric consistency and speed relative to diffusion-based baselines that rely on iterative refinement.

major comments (2)
  1. [Abstract] Abstract: The assertion of state-of-the-art results on KITTI, Waymo, RealEstate10K and DL3DV-10K is presented without any quantitative metrics, tables, ablation studies, or error analysis, leaving the central empirical claim unsupported in the provided text and preventing verification of the reported gains in quality and efficiency.
  2. [Method] Method (Dynamic 3D Gaussians construction): The single-image prediction of both static Gaussians and object motion lacks explicit discussion of 3D regularization (e.g., depth supervision, cross-view consistency losses, or multi-view rendering terms). Without such constraints the representation risks being under-determined, which directly threatens 3D consistency when rendering along arbitrary camera trajectories that deviate from the training distribution.
minor comments (2)
  1. [Abstract] Abstract: The phrase 'samples plausible object motion' is used without defining the motion parameterization or the loss used to train it; a brief clarification would improve readability.
  2. [Abstract] The project page URL is given but no supplementary video or code link is referenced in the abstract; adding such pointers would aid reproducibility assessment.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment point-by-point below and indicate the revisions made to strengthen the paper.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The assertion of state-of-the-art results on KITTI, Waymo, RealEstate10K and DL3DV-10K is presented without any quantitative metrics, tables, ablation studies, or error analysis, leaving the central empirical claim unsupported in the provided text and preventing verification of the reported gains in quality and efficiency.

    Authors: We appreciate the referee highlighting the need for stronger support in the abstract. The full manuscript provides detailed quantitative tables, ablation studies, and error analysis in Section 4 across all listed datasets. To directly address the concern, we will revise the abstract to include a brief reference to key supporting metrics (e.g., superior PSNR/SSIM and inference speed relative to baselines) while maintaining its concise nature. This change ensures the SOTA claim is better grounded even in the summary text. revision: yes

  2. Referee: [Method] Method (Dynamic 3D Gaussians construction): The single-image prediction of both static Gaussians and object motion lacks explicit discussion of 3D regularization (e.g., depth supervision, cross-view consistency losses, or multi-view rendering terms). Without such constraints the representation risks being under-determined, which directly threatens 3D consistency when rendering along arbitrary camera trajectories that deviate from the training distribution.

    Authors: We agree that an explicit discussion of regularization strengthens the method description. The current training already leverages dataset-induced constraints from multi-view video data, but we will expand the Method section with a new paragraph detailing the regularization: monocular depth supervision on Gaussian centers, a cross-view consistency term via auxiliary renderings, and multi-view photometric losses. These additions clarify how the single-pass prediction remains well-constrained for novel trajectories. revision: yes

Circularity Check

0 steps flagged

No circularity: novel framework presented without self-referential derivations or fitted predictions

full rationale

The provided abstract and description frame the contribution as a new procedural framework that constructs 3D Gaussians and samples motion from a single image in one forward pass. No equations, parameter-fitting steps, or self-citations are exhibited that would reduce any claimed prediction to an input quantity by construction. The method is positioned as independent of prior fitted results from the same authors, with evaluation on external datasets (KITTI, Waymo, RealEstate10K, DL3DV-10K). This satisfies the criteria for a self-contained proposal with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim depends on the unstated premise that a learned single-image encoder can produce a complete dynamic 3D Gaussian scene whose motion sampling yields temporally coherent renderings; no explicit free parameters or invented entities are enumerated in the abstract.

axioms (1)
  • domain assumption A single image suffices to infer both static 3D geometry and plausible future object motions
    Implicit in the single-image input and single-forward-pass claim.
invented entities (1)
  • Dynamic 3D Gaussians no independent evidence
    purpose: Unified representation of scene geometry and object motion
    Introduced as the core intermediate structure enabling one-pass generation.

pith-pipeline@v0.9.0 · 5567 in / 1137 out tokens · 42749 ms · 2026-05-16T18:16:01.870709+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

44 extracted references · 44 canonical work pages · 6 internal anchors

  1. [1]

    Ren- derdiffusion: Image diffusion for 3d reconstruction, inpaint- ing and generation

    Titas Anciukevi ˇcius, Zexiang Xu, Matthew Fisher, Paul Hen- derson, Hakan Bilen, Niloy J Mitra, and Paul Guerrero. Ren- derdiffusion: Image diffusion for 3d reconstruction, inpaint- ing and generation. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 12608–12618, 2023. 2

  2. [2]

    Denoising diffusion via image-based rendering

    Titas Anciukevi ˇcius, Fabian Manhardt, Federico Tombari, and Paul Henderson. Denoising diffusion via image-based rendering. InThe Twelfth International Conference on Learning Representations, 2024. 2

  3. [3]

    Lindell, and Sergey Tulyakov

    Sherwin Bahmani, Ivan Skorokhodov, Guocheng Qian, Ali- aksandr Siarohin, Willi Menapace, Andrea Tagliasacchi, David B. Lindell, and Sergey Tulyakov. Ac3d: Analyzing and improving 3d camera control in video diffusion trans- formers. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 22875–22889, 2025. 2

  4. [4]

    Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

    A. Blattmann et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint, arXiv:2311.15127, 2023. Accessed: Oct. 08, 2024. 1, 2

  5. [5]

    Depth Pro: Sharp Monocular Metric Depth in Less Than a Second

    Aleksei Bochkovskii, Ama ¨el Delaunoy, Hugo Germain, Marcel Santos, Yichao Zhou, Stephan R Richter, and Vladlen Koltun. Depth pro: Sharp monocular metric depth in less than a second.arXiv preprint arXiv:2410.02073, 2024. 4

  6. [6]

    DINOv2: Learning Robust Visual Features without Supervision

    Mathilde Caron, Hugo Touvron, Ishan Misra, Herv ´e J´egou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Di- nov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023. 2, 4, 8

  7. [7]

    Are we ready for autonomous driving? the kitti vision benchmark suite

    Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are we ready for autonomous driving? the kitti vision benchmark suite. InConference on Computer Vision and Pattern Recog- nition (CVPR), 2012. 7

  8. [8]

    Vision meets robotics: The kitti dataset, 2013

    Andreas Geiger, Philip Lenz, Christoph Stiller, and Raquel Urtasun. Vision meets robotics: The kitti dataset, 2013. 2, 5

  9. [9]

    CameraCtrl: Enabling Camera Control for Text-to-Video Generation

    Hao He, Yinghao Xu, Yuwei Guo, Gordon Wetzstein, Bo Dai, Hongsheng Li, and Ceyuan Yang. Cameractrl: Enabling camera control for text-to-video generation.arXiv preprint arXiv:2404.02101, 2024. 1, 2, 7

  10. [10]

    Unsupervised object-centric video generation and decomposition in 3d.Ad- vances in Neural Information Processing Systems, 33:3106– 3117, 2020

    Paul Henderson and Christoph H Lampert. Unsupervised object-centric video generation and decomposition in 3d.Ad- vances in Neural Information Processing Systems, 33:3106– 3117, 2020. 1, 2

  11. [11]

    Denoising dif- fusion implicit models

    Stefano Ermon Jiaming Song, Chenlin Meng. Denoising dif- fusion implicit models. InICLR, 2021. 7

  12. [12]

    3d gaussian splatting for real-time radiance field rendering.ACM Trans

    Bernhard Kerbl, Georgios Kopanas, Thomas Leimk ¨uhler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering.ACM Trans. Graph., 42(4):139–1,

  13. [13]

    3d gaussian splatting for real-time radiance field rendering.ACM Transactions on Graphics, 42 (4), 2023

    Bernhard Kerbl, Georgios Kopanas, Thomas Leimk ¨uhler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering.ACM Transactions on Graphics, 42 (4), 2023. 4

  14. [14]

    Collab- orative video diffusion: Consistent multi-video generation 8 with camera control.Advances in Neural Information Pro- cessing Systems, 37:16240–16271, 2024

    Zhengfei Kuang, Shengqu Cai, Hao He, Yinghao Xu, Hong- sheng Li, Leonidas J Guibas, and Gordon Wetzstein. Collab- orative video diffusion: Consistent multi-video generation 8 with camera control.Advances in Neural Information Pro- cessing Systems, 37:16240–16271, 2024. 2

  15. [15]

    Efros, and Xiaolong Wang

    Zihang Lai, Sifei Liu, Alexei A. Efros, and Xiaolong Wang. Video autoencoder: Self-supervised disentangle- ment of static 3d structure and motion. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 9730–9740, 2021. 1, 2

  16. [16]

    Realcam-i2v: Real-world image-to-video generation with interactive complex camera control

    Teng Li, Guangcong Zheng, Rui Jiang, Shuigen Zhan, Tao Wu, Yehao Lu, Yining Lin, Chuanyun Deng, Yepan Xiong, Min Chen, et al. Realcam-i2v: Real-world image-to-video generation with interactive complex camera control. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 28785–28796, 2025. 1, 2, 7

  17. [17]

    Phy124: Fast physics-driven 4d con- tent generation from a single image.arXiv preprint arXiv:2409.07179, 2024

    Jiajing Lin, Zhenzhong Wang, Yongjie Hou, Yuzhou Tang, and Min Jiang. Phy124: Fast physics-driven 4d con- tent generation from a single image.arXiv preprint arXiv:2409.07179, 2024. 2, 3

  18. [18]

    Dl3dv-10k: A large-scale scene dataset for deep learning-based 3d vi- sion

    Lu Ling, Yichen Sheng, Zhi Tu, Wentian Zhao, Cheng Xin, Kun Wan, Lantao Yu, Qianyu Guo, Zixun Yu, Yawen Lu, Xuanmao Li, Xingpeng Sun, Rohan Ashok, Anirud- dha Mukherjee, Hao Kang, Xiangrui Kong, Gang Hua, Tianyi Zhang, Bedrich Benes, and Aniket Bera. Dl3dv-10k: A large-scale scene dataset for deep learning-based 3d vi- sion. InProceedings of the IEEE/CVF ...

  19. [19]

    One-2-3-45: Any single image to 3d mesh in 45 seconds without per-shape optimiza- tion.Advances in Neural Information Processing Systems, 36:22226–22246, 2023

    Minghua Liu, Chao Xu, Haian Jin, Linghao Chen, Mukund Varma T, Zexiang Xu, and Hao Su. One-2-3-45: Any single image to 3d mesh in 45 seconds without per-shape optimiza- tion.Advances in Neural Information Processing Systems, 36:22226–22246, 2023. 2

  20. [20]

    Zero-1-to-3: Zero-shot one image to 3d object, 2023

    Ruoshi Liu, Rundi Wu, Basile Van Hoorick, Pavel Tok- makov, Sergey Zakharov, and Carl V ondrick. Zero-1-to-3: Zero-shot one image to 3d object, 2023. 2

  21. [21]

    Syncdreamer: Gen- erating multiview-consistent images from a single-view im- age, 2024

    Yuan Liu, Cheng Lin, Zijiao Zeng, Xiaoxiao Long, Lingjie Liu, Taku Komura, and Wenping Wang. Syncdreamer: Gen- erating multiview-consistent images from a single-view im- age, 2024. 2

  22. [22]

    Wonder3d: Sin- gle image to 3d using cross-domain diffusion

    Xiaoxiao Long, Yuan-Chen Guo, Cheng Lin, Yuan Liu, Zhiyang Dou, Lingjie Liu, Yuexin Ma, Song-Hai Zhang, Marc Habermann, Christian Theobalt, et al. Wonder3d: Sin- gle image to 3d using cross-domain diffusion. InProceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9970–9980, 2024. 2

  23. [23]

    Dreamdrive: Generative 4d scene modeling from street view images.arXiv preprint arXiv:2501.00601, 2024

    Jiageng Mao, Boyi Li, Boris Ivanovic, Yuxiao Chen, Yan Wang, Yurong You, Chaowei Xiao, Danfei Xu, Marco Pavone, and Yue Wang. Dreamdrive: Generative 4d scene modeling from street view images.arXiv preprint arXiv:2501.00601, 2024. 3

  24. [24]

    Waymo open dataset: Panoramic video panoptic segmentation

    Jieru Mei, Alex Zihao Zhu, Xinchen Yan, Hang Yan, Siyuan Qiao, Liang-Chieh Chen, and Henrik Kretzschmar. Waymo open dataset: Panoramic video panoptic segmentation. In European Conference on Computer Vision, pages 53–72. Springer, 2022. 2, 5, 7

  25. [25]

    Grounded SAM: Assembling Open-World Models for Diverse Visual Tasks

    Tianhe Ren, Shilong Liu, Ailing Zeng, Jing Lin, Kunchang Li, He Cao, Jiayu Chen, Xinyu Huang, Yukang Chen, Feng Yan, et al. Grounded sam: Assembling open-world models for diverse visual tasks.arXiv preprint arXiv:2401.14159,

  26. [26]

    Gen3c: 3d-informed world-consistent video generation with precise camera con- trol

    Xuanchi Ren, Tianchang Shen, Jiahui Huang, Huan Ling, Yifan Lu, Merlin Nimier-David, Thomas M ¨uller, Alexan- der Keller, Sanja Fidler, and Jun Gao. Gen3c: 3d-informed world-consistent video generation with precise camera con- trol. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 6121–6132, 2025. 1, 2

  27. [27]

    U-net: Convolutional networks for biomedical image segmentation

    Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. InMICCAI, 2015. 3

  28. [28]

    Make-it-4d: Synthesizing a consistent long-term dynamic scene video from a single image

    Liao Shen, Xingyi Li, Huiqiang Sun, Juewen Peng, Ke Xian, Zhiguo Cao, and Guosheng Lin. Make-it-4d: Synthesizing a consistent long-term dynamic scene video from a single image. InProceedings of the 31st ACM International Con- ference on Multimedia, pages 8167–8175, 2023. 2, 3

  29. [29]

    Mvdream: Multi-view diffusion for 3d gen- eration, 2024

    Yichun Shi, Peng Wang, Jianglong Ye, Mai Long, Kejie Li, and Xiao Yang. Mvdream: Multi-view diffusion for 3d gen- eration, 2024. 2

  30. [30]

    Dimensionx: Create any 3d and 4d scenes from a single image with de- coupled video diffusion

    Wenqiang Sun, Shuo Chen, Fangfu Liu, Zilong Chen, Yueqi Duan, Jun Zhu, Jun Zhang, and Yikai Wang. Dimensionx: Create any 3d and 4d scenes from a single image with de- coupled video diffusion. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 13695– 13706, 2025. 2, 3

  31. [31]

    Viewset diffusion:(0-) image-conditioned 3d gener- ative models from 2d data

    Stanislaw Szymanowicz, Christian Rupprecht, and Andrea Vedaldi. Viewset diffusion:(0-) image-conditioned 3d gener- ative models from 2d data. InProceedings of the IEEE/CVF international conference on computer vision, pages 8863– 8873, 2023. 2

  32. [32]

    Henriques, Christian Rup- precht, and Andrea Vedaldi

    Stanislaw Szymanowicz, Eldar Insafutdinov, Chuanxia Zheng, Dylan Campbell, Jo ˜ao F. Henriques, Christian Rup- precht, and Andrea Vedaldi. Flash3d: Feed-forward gener- alisable 3d scene reconstruction from a single image.arXiv preprint arXiv:2402.03807, 2024. 2

  33. [33]

    Splatter image: Ultra-fast single-view 3d recon- struction

    Stanislaw Szymanowicz, Chrisitian Rupprecht, and Andrea Vedaldi. Splatter image: Ultra-fast single-view 3d recon- struction. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10208– 10217, 2024. 2, 3, 7

  34. [34]

    Dreamgaussian: Generative gaussian splatting for ef- ficient 3d content creation, 2024

    Jiaxiang Tang, Jiawei Ren, Hang Zhou, Ziwei Liu, and Gang Zeng. Dreamgaussian: Generative gaussian splatting for ef- ficient 3d content creation, 2024. 2

  35. [35]

    Consistent view synthe- sis with pose-guided diffusion models

    Hung-Yu Tseng, Qinbo Li, Changil Kim, Suhib Alsisan, Jia- Bin Huang, and Johannes Kopf. Consistent view synthe- sis with pose-guided diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 16773–16783, 2023. 1, 2

  36. [36]

    Motionctrl: A unified and flexible motion controller for video generation

    Zhouxia Wang, Ziyang Yuan, Xintao Wang, Yaowei Li, Tianshui Chen, Menghan Xia, Ping Luo, and Ying Shan. Motionctrl: A unified and flexible motion controller for video generation. InACM SIGGRAPH 2024 Conference Pa- pers, pages 1–11, 2024. 1, 2, 7

  37. [37]

    Dynamicrafter: Animating open-domain images with video diffusion priors

    Jinbo Xing, Menghan Xia, Yong Zhang, Haoxin Chen, Wangbo Yu, Hanyuan Liu, Gongye Liu, Xintao Wang, Ying Shan, and Tien-Tsin Wong. Dynamicrafter: Animating open-domain images with video diffusion priors. InEu- 9 ropean Conference on Computer Vision, pages 399–417. Springer, 2024. 1, 2

  38. [38]

    CamCo: Camera-Controllable 3D-Consistent Image-to-Video Generation

    Dejia Xu, Weili Nie, Chao Liu, Sifei Liu, Jan Kautz, Zhangyang Wang, and Arash Vahdat. Camco: Camera- controllable 3d-consistent image-to-video generation.arXiv preprint arXiv:2406.02509, 2024. 1

  39. [39]

    Forecasting future videos from novel views via disentangled 3d scene representation

    Sudhir Yarram and Junsong Yuan. Forecasting future videos from novel views via disentangled 3d scene representation. InEuropean Conference on Computer Vision, pages 58–76. Springer, 2024. 1, 2

  40. [40]

    Long-term photometric consistent novel view synthesis with diffusion models

    Jason J Yu, Fereshteh Forghani, Konstantinos G Derpanis, and Marcus A Brubaker. Long-term photometric consistent novel view synthesis with diffusion models. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 7094–7104, 2023. 2

  41. [41]

    Cami2v: Camera-controlled image-to-video dif- fusion model.arXiv preprint arXiv:2402.00000, 2024

    Guangcong Zheng, Teng Li, Rui Jiang, Yehao Lu, Tao Wu, and Xi Li. Cami2v: Camera-controlled image-to-video dif- fusion model.arXiv preprint arXiv:2402.00000, 2024. 1, 2, 7

  42. [42]

    A unified approach for text- and image-guided 4d scene generation

    Yufeng Zheng, Xueting Li, Koki Nagano, Sifei Liu, Otmar Hilliges, and Shalini De Mello. A unified approach for text- and image-guided 4d scene generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7300–7309, 2024. 2, 3

  43. [43]

    Stereo magnification: Learning view synthesis using multiplane images.ACM Transactions on Graphics (TOG), 37(4):1–12, 2018

    Tinghui Zhou, Richard Tucker, John Flynn, Graham Fyffe, and Noah Snavely. Stereo magnification: Learning view synthesis using multiplane images.ACM Transactions on Graphics (TOG), 37(4):1–12, 2018. Proceedings of SIG- GRAPH 2018. 2, 5

  44. [44]

    Ewa volume splatting

    Matthias Zwicker, Hanspeter Pfister, Jeroen Van Baar, and Markus Gross. Ewa volume splatting. InProceedings Visu- alization, 2001. VIS’01., pages 29–538. IEEE, 2001. 1, 2, 3 10