pith. machine review for the scientific record.
sign in

arxiv: 2511.00503 · v2 · submitted 2025-11-01 · 💻 cs.CV

Diff4Splat: Controllable 4D Scene Generation with Latent Dynamic Reconstruction Models

Pith reviewed 2026-05-18 01:55 UTC · model grok-4.3

classification 💻 cs.CV
keywords 4D scene generation3D Gaussian splattingvideo diffusion modelslatent transformerdynamic reconstructionfeed-forward generationcontrollable synthesisnovel view synthesis
0
0 comments X

The pith

A single forward pass from one image, camera path and optional text can output a full 4D scene as a deformable 3D Gaussian field.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a method that turns a single image, a camera trajectory, and an optional text prompt into an explicit 4D scene representation. It does so by directly producing a set of time-varying 3D Gaussian points that hold appearance, shape, and motion information together. This occurs in one quick step instead of the slow optimization loops common in prior work. If successful, the result would let users create and view dynamic scenes much more quickly for uses such as animation or virtual environments. The approach rests on training a video latent transformer to combine diffusion-based generation with geometric and motion constraints learned from large 4D datasets.

Core claim

Given a single input image, a camera trajectory, and an optional text prompt, Diff4Splat directly predicts a deformable 3D Gaussian field that encodes appearance, geometry, and motion, all in a single forward pass, without test-time optimization or post-hoc refinement. At the core of the framework lies a video latent transformer, which augments video diffusion models to jointly capture spatio-temporal dependencies and predict time-varying 3D Gaussian primitives. Training is guided by objectives on appearance fidelity, geometric accuracy, and motion consistency, enabling synthesis of high-quality 4D scenes in 30 seconds.

What carries the argument

Video latent transformer that augments video diffusion models to jointly capture spatio-temporal dependencies and predict time-varying 3D Gaussian primitives.

If this is right

  • High-quality 4D scenes become available for video generation, novel view synthesis, and geometry extraction tasks.
  • Performance matches or exceeds optimization-based dynamic scene methods while running in roughly 30 seconds.
  • Control is provided through an input camera trajectory and optional text prompt in a single pass.
  • Explicit 3D Gaussian output allows direct extraction of geometry and motion without post-processing.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the single-pass prediction holds for longer sequences, it could enable interactive 4D editing tools that current slow methods cannot support.
  • The same latent-space augmentation idea might transfer to other diffusion models for faster 4D extensions in robotics simulation.
  • Fast explicit 4D output could simplify downstream tasks such as physics-based editing or real-time rendering in AR applications.

Load-bearing premise

The video latent transformer produces stable and accurate time-varying 3D Gaussian primitives across diverse scenes without needing extra constraints or refinement steps.

What would settle it

Render the generated 4D Gaussian field from new viewpoints and times; visible flickering, drifting geometry, or motion artifacts that grow with sequence length would show the joint prediction is not reliable.

Figures

Figures reproduced from arXiv: 2511.00503 by Chenguo Lin, Chenxin Li, Haopeng Li, Honglei Yan, Jingjing Zhao, Kairun Wen, Panwang Pan, Yadong Mu, Yixuan Yuan, Yuchen Lin, Yunlong Lin.

Figure 1
Figure 1. Figure 1: Given a single image, a specified camera trajectory, and an optional text prompt, our [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Architecture of DIFF4SPLAT. We present a high-fidelity dynamic 3DGS generation method from a single image through four key innovations: (1) video diffusion latents processed by our novel Transformer (Sec. 3.2), (2) a dynamic 3DGS deformation mechanism (Sec. 3.3), (3) unified supervision with photometric, geometric, and motion losses (Sec. 3.4), and (4) a progressive training scheme for robust geometry and … view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative comparison with state-of-the-art methods. DIFF4SPLAT (last column) generates more visually appealing and temporally consistent 4D scenes with superior geometric fidelity compared to baselines. Kindly zoom in for details. camera controllability, drastically reducing the Relative Pose Error (RPE) in both translation and rotation. This ensures that the generated video faithfully adheres to the spe… view at source ↗
Figure 4
Figure 4. Figure 4: Ablation of the Deformation Gaussian Field shows that removing this module (the red bounding boxes) results in ghosting artifacts, particularly in the large motion frames. training strategy yields significantly higher visual quality than direct dynamic training. This result underscores that progressive training not only enhances final performance and visual fidelity but also achieves superior results withi… view at source ↗
Figure 5
Figure 5. Figure 5: Ablation on the progressive training strategy. [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Failure Case. DIFF4SPLAT can produce artifacts when rendering novel timestamps, especially from disparate viewpoints. This issue, common to related methods, stems from ambiguity in estimating temporal deformations when propagating 3D Gaussians from multiple reference frames [PITH_FULL_IMAGE:figures/full_fig_p021_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: More qualitative of DIFF4SPLAT for 4D Scene generation. 21 [PITH_FULL_IMAGE:figures/full_fig_p021_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: More qualitative of DIFF4SPLAT for 4D Scene generation [PITH_FULL_IMAGE:figures/full_fig_p022_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: More qualitative of DIFF4SPLAT for 4D Scene generation. 22 [PITH_FULL_IMAGE:figures/full_fig_p022_9.png] view at source ↗
read the original abstract

We introduce Diff4Splat, a feed-forward method that synthesizes controllable and explicit 4D scenes from a single image. Our approach unifies the generative priors of video diffusion models with geometry and motion constraints learned from large-scale 4D datasets. Given a single input image, a camera trajectory, and an optional text prompt, Diff4Splat directly predicts a deformable 3D Gaussian field that encodes appearance, geometry, and motion, all in a single forward pass, without test-time optimization or post-hoc refinement. At the core of our framework lies a video latent transformer, which augments video diffusion models to jointly capture spatio-temporal dependencies and predict time-varying 3D Gaussian primitives. Training is guided by objectives on appearance fidelity, geometric accuracy, and motion consistency, enabling Diff4Splat to synthesize high-quality 4D scenes in 30 seconds. We demonstrate the effectiveness of Diff4Splat across video generation, novel view synthesis, and geometry extraction, where it matches or surpasses optimization-based methods for dynamic scene synthesis while being significantly more efficient.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces Diff4Splat, a feed-forward method that, given a single input image, camera trajectory, and optional text prompt, directly predicts a deformable 3D Gaussian field encoding appearance, geometry, and motion in a single forward pass. It augments video diffusion models with a video latent transformer to jointly capture spatio-temporal dependencies and predict time-varying 3D Gaussian primitives. Training uses objectives on appearance fidelity, geometric accuracy, and motion consistency on large-scale 4D datasets, enabling high-quality 4D scene synthesis in 30 seconds. The authors claim the method matches or surpasses optimization-based baselines on video generation, novel view synthesis, and geometry extraction while being significantly more efficient.

Significance. If the results hold, this would be a notable advance in efficient controllable 4D scene generation by removing test-time optimization and post-hoc refinement steps. The explicit deformable 3D Gaussian representation supports direct controllability via trajectories and prompts, and the approach benefits from training on external 4D datasets with standard diffusion objectives, keeping circularity low. This could accelerate applications in dynamic scene modeling for VR/AR and animation.

major comments (2)
  1. [Abstract and core framework description] The training objectives are described only at a high level as covering 'appearance fidelity, geometric accuracy, and motion consistency' with no mention of an explicit 3D reprojection loss or multi-view consistency term. This is load-bearing for the central claim that the video latent transformer produces view-consistent time-varying 3D Gaussians for arbitrary trajectories, as the latent space may capture only 2D spatio-temporal correlations without such a penalty.
  2. [Abstract] The abstract asserts that the method 'matches or surpasses optimization-based methods' for video generation, novel view synthesis, and geometry extraction, yet provides no quantitative tables, metrics, error bars, or dataset details to support the single-pass accuracy claim.
minor comments (1)
  1. [Abstract] The claimed runtime of '30 seconds' would benefit from specification of hardware, input resolution, and output format to support reproducibility claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and will revise the paper to improve clarity where needed while preserving the core contributions.

read point-by-point responses
  1. Referee: [Abstract and core framework description] The training objectives are described only at a high level as covering 'appearance fidelity, geometric accuracy, and motion consistency' with no mention of an explicit 3D reprojection loss or multi-view consistency term. This is load-bearing for the central claim that the video latent transformer produces view-consistent time-varying 3D Gaussians for arbitrary trajectories, as the latent space may capture only 2D spatio-temporal correlations without such a penalty.

    Authors: We agree the abstract is high-level. The full methods section details that the geometric accuracy objective includes an explicit 3D reprojection loss (computed via differentiable rendering of Gaussians onto multiple views using the input camera trajectories) and a multi-view consistency term supervised on the large-scale 4D datasets. These terms directly penalize inconsistencies in 3D positions and appearances, ensuring the latent transformer learns view-consistent outputs rather than pure 2D correlations. We will revise the abstract to explicitly reference the 3D reprojection and multi-view terms. revision: yes

  2. Referee: [Abstract] The abstract asserts that the method 'matches or surpasses optimization-based methods' for video generation, novel view synthesis, and geometry extraction, yet provides no quantitative tables, metrics, error bars, or dataset details to support the single-pass accuracy claim.

    Authors: Abstracts conventionally summarize claims at a high level; the supporting quantitative evidence appears in the results section, including tables with metrics (PSNR/SSIM/LPIPS for video and NVS, Chamfer distance and normal consistency for geometry), error bars over multiple seeds, and dataset details (e.g., specific 4D training corpora and evaluation splits). We will add a brief sentence to the abstract pointing to these quantitative results and consider including one or two key metric values if space permits. revision: partial

Circularity Check

0 steps flagged

No significant circularity; derivation relies on external data and standard training

full rationale

The paper presents Diff4Splat as a trained feed-forward model that augments video diffusion backbones with a latent transformer to output time-varying 3D Gaussians. Training uses external 4D datasets and objectives for appearance, geometry, and motion consistency. No equations or claims in the abstract reduce the output to a self-defined quantity, a fitted parameter renamed as prediction, or a self-citation chain that bears the central load. The single-forward-pass claim follows from the learned mapping rather than tautological construction, making the derivation self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Review performed on abstract only; no explicit free parameters, axioms, or invented entities are stated. The central claim implicitly rests on the assumption that large-scale 4D datasets provide sufficient supervision for joint appearance-geometry-motion prediction without post-processing.

pith-pipeline@v0.9.0 · 5755 in / 1182 out tokens · 22482 ms · 2026-05-18T01:55:55.635413+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

116 extracted references · 116 canonical work pages · 13 internal anchors

  1. [1]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

  2. [2]

    GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints

    Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebr \'o n, and Sumit Sanghai. Gqa: Training generalized multi-query transformer models from multi-head checkpoints. arXiv preprint arXiv:2305.13245, 2023

  3. [3]

    Ac3d: Analyzing and improving 3d camera control in video diffusion transformers

    Sherwin Bahmani, Ivan Skorokhodov, Guocheng Qian, Aliaksandr Siarohin, Willi Menapace, Andrea Tagliasacchi, David B Lindell, and Sergey Tulyakov. Ac3d: Analyzing and improving 3d camera control in video diffusion transformers. arXiv preprint arXiv:2411.18673, 2024

  4. [4]

    Componerf: Text-guided multi-object compositional nerf with editable 3d scene layout

    Haotian Bai, Yuanhuiyi Lyu, Lutao Jiang, Sijia Li, Haonan Lu, Xiaodong Lin, and Lin Wang. Componerf: Text-guided multi-object compositional nerf with editable 3d scene layout. arXiv preprint arXiv:2303.13843, 2023 a

  5. [5]

    Recammaster: Camera-controlled generative rendering from a single video.arXiv preprint arXiv:2503.11647,

    Jianhong Bai, Menghan Xia, Xiao Fu, Xintao Wang, Lianrui Mu, Jinwen Cao, Zuozhu Liu, Haoji Hu, Xiang Bai, Pengfei Wan, and Di Zhang. Recammaster: Camera-controlled generative rendering from a single video. arXiv preprint arXiv:2503.11647, 2025

  6. [6]

    Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

    Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond. arXiv preprint arXiv:2308.12966, 2023 b

  7. [7]

    Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

    Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram Voleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127, 2023 a

  8. [8]

    Align your latents: High-resolution video synthesis with latent diffusion models

    Andreas Blattmann, Robin Rombach, Huan Ling, Tim Dockhorn, Seung Wook Kim, Sanja Fidler, and Karsten Kreis. Align your latents: High-resolution video synthesis with latent diffusion models. In Proc. CVPR, 2023 b

  9. [9]

    Virtual KITTI 2

    Yohann Cabon, Naila Murray, and Martin Humenberger. Virtual kitti 2. arXiv preprint arXiv:2001.10773, 2020

  10. [10]

    Video depth anything: Consistent depth estimation for super-long videos

    Sili Chen, Hengkai Guo, Shengnan Zhu, Feihu Zhang, Zilong Huang, Jiashi Feng, and Bingyi Kang. Video depth anything: Consistent depth estimation for super-long videos. In Conference on Computer Vision and Pattern Recognition (CVPR), 2025 a

  11. [11]

    4dnex: Feed-forward 4d generative modeling made easy

    Zhaoxi Chen, Tianqi Liu, Long Zhuo, Jiawei Ren, Zeng Tao, He Zhu, Fangzhou Hong, Liang Pan, and Ziwei Liu. 4dnex: Feed-forward 4d generative modeling made easy. arXiv preprint arXiv:2508.13154, 2025 b

  12. [12]

    Dreamscene4d: Dynamic multi-object scene generation from monocular videos

    Wen-Hsuan Chu, Lei Ke, and Katerina Fragkiadaki. Dreamscene4d: Dynamic multi-object scene generation from monocular videos. arXiv preprint arXiv:2405.02280, 2024

  13. [13]

    Dreamscene4d: Dynamic multi-object scene generation from monocular videos

    Wen-Hsuan Chu, Lei Ke, and Katerina Fragkiadaki. Dreamscene4d: Dynamic multi-object scene generation from monocular videos. Advances in Neural Information Processing Systems, 37: 0 96181--96206, 2025

  14. [14]

    Luciddreamer: Domain-free generation of 3d gaussian splatting scenes,

    Jaeyoung Chung, Suyoung Lee, Hyeongjin Nam, Jaerin Lee, and Kyoung Mu Lee. Luciddreamer: Domain-free generation of 3d gaussian splatting scenes. arXiv preprint arXiv:2311.13384, 2023

  15. [15]

    Scannet: Richly-annotated 3d reconstructions of indoor scenes

    Angela Dai, Angel X Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nie ner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.\ 5828--5839, 2017

  16. [16]

    Structure and content-guided video synthesis with diffusion models

    Patrick Esser, Johnathan Chiu, Parmida Atighehchian, Jonathan Granskog, and Anastasis Germanidis. Structure and content-guided video synthesis with diffusion models. In ICCV, 2023

  17. [17]

    GraphDreamer : Compositional 3D scene synthesis from scene graphs

    Gege Gao, Weiyang Liu, Anpei Chen, Andreas Geiger, and Bernhard Sch \"o lkopf. GraphDreamer : Compositional 3D scene synthesis from scene graphs. Proc. CVPR, 2024

  18. [18]

    AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning

    Yuwei Guo, Ceyuan Yang, Anyi Rao, Yaohui Wang, Yu Qiao, Dahua Lin, and Bo Dai. Animatediff: Animate your personalized text-to-image diffusion models without specific tuning. arXiv preprint arXiv:2307.04725, 2023

  19. [19]

    CameraCtrl: Enabling Camera Control for Text-to-Video Generation

    Hao He, Yinghao Xu, Yuwei Guo, Gordon Wetzstein, Bo Dai, Hongsheng Li, and Ceyuan Yang. Cameractrl: Enabling camera control for text-to-video generation. arXiv preprint arXiv:2404.02101, 2024

  20. [20]

    Cameractrl ii: Dynamic scene exploration via camera-controlled video diffusion models, 2025

    Hao He, Ceyuan Yang, Shanchuan Lin, Yinghao Xu, Meng Wei, Liangke Gui, Qi Zhao, Gordon Wetzstein, Lu Jiang, and Hongsheng Li. Cameractrl ii: Dynamic scene exploration via camera-controlled video diffusion models, 2025. URL https://arxiv.org/abs/2503.10592

  21. [21]

    Denoising diffusion probabilistic models

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in neural information processing systems, 2020

  22. [22]

    Video diffusion models

    Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J Fleet. Video diffusion models. Advances in Neural Information Processing Systems, 35: 0 8633--8646, 2022

  23. [23]

    Lora: Low-rank adaptation of large language models

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. In ICLR, 2022

  24. [24]

    Pl \"u cker coordinates for lines in the space

    Yan-Bin Jia. Pl \"u cker coordinates for lines in the space. Problem Solver Techniques for Applied Computer Science, Com-S-477/577 Course Handout, 2020

  25. [25]

    Stereo4d: Learning how things move in 3d from internet stereo videos

    Linyi Jin, Richard Tucker, Zhengqi Li, David Fouhey, Noah Snavely, and Aleksander Holynski. Stereo4d: Learning how things move in 3d from internet stereo videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

  26. [26]

    Dynamicstereo: Consistent dynamic depth from stereo videos

    Nikita Karaev, Ignacio Rocco, Benjamin Graham, Natalia Neverova, Andrea Vedaldi, and Christian Rupprecht. Dynamicstereo: Consistent dynamic depth from stereo videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023

  27. [27]

    3d gaussian splatting for real-time radiance field rendering

    Bernhard Kerbl, Georgios Kopanas, Thomas Leimkuehler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering. ACM Transactions on Graphics (TOG), 2023 a

  28. [28]

    3d gaussian splatting for real-time radiance field rendering

    Bernhard Kerbl, Georgios Kopanas, Thomas Leimk \"u hler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering. In ACM TOG, 2023 b

  29. [29]

    Mosca: Dynamic gaussian fusion from casual videos via 4d motion scaffolds

    Jiahui Lei, Yijia Weng, Adam Harley, Leonidas Guibas, and Kostas Daniilidis. Mosca: Dynamic gaussian fusion from casual videos via 4d motion scaffolds. arXiv preprint arXiv:2405.17421, 2024

  30. [30]

    Grounding Image Matching in 3D with MASt3R.arXiv preprint arXiv:2406.09756,

    Vincent Leroy, Yohann Cabon, and Jerome Revaud. Grounding image matching in 3d with mast3r. arXiv:2406.09756, 2024

  31. [31]

    4k4dgen: Panoramic 4d generation at 4k resolution

    Renjie Li, Panwang Pan, Bangbang Yang, Dejia Xu, Shijie Zhou, Xuanyang Zhang, Zeming Li, Achuta Kadambi, Zhangyang Wang, Zhengzhong Tu, et al. 4k4dgen: Panoramic 4d generation at 4k resolution. arXiv preprint arXiv:2406.13527, 2024

  32. [32]

    4k4dgen: Panoramic 4d generation at 4k resolution

    Renjie Li, Panwang Pan, Bangbang Yang, Dejia Xu, Shijie Zhou, Xuanyang Zhang, Zeming Li, Achuta Kadambi, Zhangyang Wang, Zhengzhong Tu, et al. 4k4dgen: Panoramic 4d generation at 4k resolution. Proc. ICLR, 2025 a

  33. [33]

    Matrixcity: A large-scale city dataset for city-scale neural rendering and beyond

    Yixuan Li, Lihan Jiang, Linning Xu, Yuanbo Xiangli, Zhenzhi Wang, Dahua Lin, and Bo Dai. Matrixcity: A large-scale city dataset for city-scale neural rendering and beyond. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023 a

  34. [34]

    Gligen: Open-set grounded text-to-image generation

    Yuheng Li, Haotian Liu, Qingyang Wu, Fangzhou Mu, Jianwei Yang, Jianfeng Gao, Chunyuan Li, and Yong Jae Lee. Gligen: Open-set grounded text-to-image generation. In CVPR, 2023 b

  35. [35]

    Dynibar: Neural dynamic image-based rendering

    Zhengqi Li, Qianqian Wang, Forrester Cole, Richard Tucker, and Noah Snavely. Dynibar: Neural dynamic image-based rendering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023 c

  36. [36]

    Megasam: Accurate, fast, and robust structure and motion from casual dynamic videos

    Zhengqi Li, Richard Tucker, Forrester Cole, Qianqian Wang, Linyi Jin, Vickie Ye, Angjoo Kanazawa, Aleksander Holynski, and Noah Snavely. Megasam: Accurate, fast, and robust structure and motion from casual dynamic videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025 b

  37. [37]

    Wonderland: Navigating 3d scenes from a single image

    Hanwen Liang, Junli Cao, Vidit Goel, Guocheng Qian, Sergei Korolev, Demetri Terzopoulos, Konstantinos N Plataniotis, Sergey Tulyakov, and Jian Ren. Wonderland: Navigating 3d scenes from a single image. arXiv preprint arXiv:2412.12091, 2024 a

  38. [38]

    Plataniotis, Sergey Tulyakov, and Jian Ren

    Hanwen Liang, Junli Cao, Vidit Goel, Guocheng Qian, Sergei Korolev, Demetri Terzopoulos, Konstantinos N. Plataniotis, Sergey Tulyakov, and Jian Ren. Wonderland: Navigating 3D Scenes from a Single Image , December 2024 b

  39. [39]

    Feed- Forward Bullet-Time Reconstruction of Dynamic Scenes from Monocular Videos , December 2024 c

    Hanxue Liang, Jiawei Ren, Ashkan Mirzaei, Antonio Torralba, Ziwei Liu, Igor Gilitschenski, Sanja Fidler, Cengiz Oztireli, Huan Ling, Zan Gojcic, and Jiahui Huang. Feed- Forward Bullet-Time Reconstruction of Dynamic Scenes from Monocular Videos , December 2024 c

  40. [40]

    Feed-forward bullet-time reconstruction of dynamic scenes from monocular videos.arXiv preprint arXiv:2412.03526,

    Hanxue Liang, Jiawei Ren, Ashkan Mirzaei, Antonio Torralba, Ziwei Liu, Igor Gilitschenski, Sanja Fidler, Cengiz Oztireli, Huan Ling, Zan Gojcic, et al. Feed-forward bullet-time reconstruction of dynamic scenes from monocular videos. arXiv preprint arXiv:2412.03526, 2024 d

  41. [41]

    Gaufre: Gaussian deformation fields for real-time dynamic novel view synthesis

    Yiqing Liang, Numair Khan, Zhengqin Li, Thu Nguyen-Phuoc, Douglas Lanman, James Tompkin, and Lei Xiao. Gaufre: Gaussian deformation fields for real-time dynamic novel view synthesis. In 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pp.\ 2642--2652. IEEE, 2025

  42. [42]

    Instructscene: Instruction-driven 3d indoor scene synthesis with semantic graph prior

    Chenguo Lin and Yadong Mu. Instructscene: Instruction-driven 3d indoor scene synthesis with semantic graph prior. arXiv preprint arXiv:2402.04717, 2024

  43. [43]

    Instructlayout: Instruction-driven 2d and 3d layout synthesis with semantic graph prior

    Chenguo Lin, Yuchen Lin, Panwang Pan, Xuanyang Zhang, and Yadong Mu. Instructlayout: Instruction-driven 2d and 3d layout synthesis with semantic graph prior. arXiv preprint arXiv:2407.07580, 2024 a

  44. [44]

    Diffsplat: Repurposing image diffusion models for scalable gaussian splat generation,

    Chenguo Lin, Panwang Pan, Bangbang Yang, Zeming Li, and Yadong Mu. Diffsplat: Repurposing image diffusion models for scalable gaussian splat generation. arXiv preprint arXiv:2501.16764, 2025 a

  45. [45]

    Gaussian-flow: 4d reconstruction with dynamic 3d gaussian particle

    Youtian Lin, Zuozhuo Dai, Siyu Zhu, and Yao Yao. Gaussian-flow: 4d reconstruction with dynamic 3d gaussian particle. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 21136--21145, 2024 b

  46. [46]

    Partcrafter: Structured 3d mesh generation via compositional latent diffusion transformers, 2025 b

    Yuchen Lin, Chenguo Lin, Panwang Pan, Honglei Yan, Yiqiang Feng, Yadong Mu, and Katerina Fragkiadaki. Partcrafter: Structured 3d mesh generation via compositional latent diffusion transformers, 2025 b . URL https://arxiv.org/abs/2506.05573

  47. [47]

    Omniphysgs: 3d constitutive gaussians for general physics-based dynamics generation

    Yuchen Lin, Chenguo Lin, Jianjin Xu, and Yadong Mu. Omniphysgs: 3d constitutive gaussians for general physics-based dynamics generation. arXiv preprint arXiv:2501.18982, 2025 c

  48. [48]

    Dl3dv-10k: A large-scale scene dataset for deep learning-based 3d vision

    Lu Ling, Yichen Sheng, Zhi Tu, Wentian Zhao, Cheng Xin, Kun Wan, Lantao Yu, Qianyu Guo, Zixun Yu, Yawen Lu, et al. Dl3dv-10k: A large-scale scene dataset for deep learning-based 3d vision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 22160--22169, 2024

  49. [49]

    Flow matching for generative modeling

    Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling. In International Conference on Learning Representations (ICLR), 2023

  50. [50]

    Reconx: Reconstruct any scene from sparse views with video diffusion model

    Fangfu Liu, Wenqiang Sun, Hanyang Wang, Yikai Wang, Haowen Sun, Junliang Ye, Jun Zhang, and Yueqi Duan. Reconx: Reconstruct any scene from sparse views with video diffusion model. arXiv preprint arXiv:2408.16767, 2024

  51. [51]

    Decoupled weight decay regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In International Conference on Learning Representations, 2019

  52. [52]

    Spring: A high-resolution high-detail dataset and benchmark for scene flow, optical flow and stereo

    Lukas Mehl, Jenny Schmalfuss, Azin Jahedi, Yaroslava Nalivayko, and Andr \'e s Bruhn. Spring: A high-resolution high-detail dataset and benchmark for scene flow, optical flow and stereo. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.\ 4981--4991, 2023

  53. [53]

    Nerf: Representing scenes as neural radiance fields for view synthesis

    B Mildenhall, PP Srinivasan, M Tancik, JT Barron, R Ramamoorthi, and R Ng. Nerf: Representing scenes as neural radiance fields for view synthesis. In European Conference on Computer Vision (ECCV), 2020

  54. [54]

    T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models

    Chong Mou, Xintao Wang, Liangbin Xie, Yanze Wu, Jian Zhang, Zhongang Qi, and Ying Shan. T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. In AAAI, 2024

  55. [55]

    u ller, Katja Schwarz, Barbara R \

    Norman M \"u ller, Katja Schwarz, Barbara R \"o ssle, Lorenzo Porzi, Samuel Rota Bul \`o , Matthias Nie ner, and Peter Kontschieder. Multidiff: Consistent novel view synthesis from a single image. In Proc. CVPR, 2024

  56. [56]

    Video-bench: A comprehensive benchmark and toolkit for evaluating video-based large language models

    Munan Ning, Bin Zhu, Yujia Xie, Bin Lin, Jiaxi Cui, Lu Yuan, Dongdong Chen, and Li Yuan. Video-bench: A comprehensive benchmark and toolkit for evaluating video-based large language models. arXiv preprint arXiv:2311.16103, 2023

  57. [57]

    Mofa-video: Controllable image animation via generative motion field adaptions in frozen image-to-video diffusion model

    Muyao Niu, Xiaodong Cun, Xintao Wang, Yong Zhang, Ying Shan, and Yinqiang Zheng. Mofa-video: Controllable image animation via generative motion field adaptions in frozen image-to-video diffusion model. arXiv preprint arXiv:2405.20222, 2024

  58. [58]

    Humansplat: Generalizable single-image human gaussian splatting with structure priors, 2024

    Panwang Pan, Zhuo Su, Chenguo Lin, Zhen Fan, Yongjie Zhang, Zeming Li, Tingting Shen, Yadong Mu, and Yebin Liu. Humansplat: Generalizable single-image human gaussian splatting with structure priors, 2024. URL https://arxiv.org/abs/2406.12459

  59. [59]

    Vase: Object-centric appearance and shape manipulation of real videos

    Elia Peruzzo, Vidit Goel, Dejia Xu, Xingqian Xu, Yifan Jiang, Zhangyang Wang, Humphrey Shi, and Nicu Sebe. Vase: Object-centric appearance and shape manipulation of real videos. arXiv preprint arXiv:2401.02473, 2024

  60. [60]

    Compositional 3D scene generation using locally conditioned diffusion

    Ryan Po and Gordon Wetzstein. Compositional 3D scene generation using locally conditioned diffusion. Proc. 3DV, 2024

  61. [61]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pp.\ 8748--8763. PMLR, 2021

  62. [62]

    Exploring the limits of transfer learning with a unified text-to-text transformer

    Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. In Proc. JMLR, 2020

  63. [63]

    Gen3c: 3d-informed world-consistent video generation with precise camera control

    Xuanchi Ren, Tianchang Shen, Jiahui Huang, Huan Ling, Yifan Lu, Merlin Nimier-David, Thomas Müller, Alexander Keller, Sanja Fidler, and Jun Gao. Gen3c: 3d-informed world-consistent video generation with precise camera control. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

  64. [64]

    High-resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj \"o rn Ommer. High-resolution image synthesis with latent diffusion models. In Proc. CVPR, 2022 a

  65. [65]

    High-resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj \"o rn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022 b

  66. [66]

    Structure-from-motion revisited

    Johannes L Schonberger and Jan-Michael Frahm. Structure-from-motion revisited. In Computer Vision and Pattern Recognition (CVPR), 2016

  67. [67]

    CLIP+MLP Aesthetic Score Predictor

    Christoph Schuhmann. CLIP+MLP Aesthetic Score Predictor . https://github.com/christophschuhmann/improved-aesthetic-predictor, 2023

  68. [68]

    Laion-5b: An open large-scale dataset for training next generation image-text models

    Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, et al. Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in neural information processing systems, 2022

  69. [69]

    Seeing world dynamics in a nutshell, 2025

    Qiuhong Shen, Xuanyu Yi, Mingbao Lin, Hanwang Zhang, Shuicheng Yan, and Xinchao Wang. Seeing world dynamics in a nutshell, 2025. URL https://arxiv.org/abs/2502.03465

  70. [70]

    Light field networks: Neural scene representations with single-evaluation rendering

    Vincent Sitzmann, Semon Rezchikov, Bill Freeman, Josh Tenenbaum, and Fredo Durand. Light field networks: Neural scene representations with single-evaluation rendering. In Proc. NeurIPS, 2021

  71. [71]

    A benchmark for the evaluation of rgb-d slam systems

    J \"u rgen Sturm, Nikolas Engelhard, Felix Endres, Wolfram Burgard, and Daniel Cremers. A benchmark for the evaluation of rgb-d slam systems. In IEEE/RSJ international conference on intelligent robots and systems, 2012

  72. [72]

    Dimensionx: Create any 3d and 4d scenes from a single image with controllable video diffusion

    Wenqiang Sun, Shuo Chen, Fangfu Liu, Zilong Chen, Yueqi Duan, Jun Zhang, and Yikai Wang. Dimensionx: Create any 3d and 4d scenes from a single image with controllable video diffusion. arXiv preprint arXiv:2411.04928, 2024 a

  73. [73]

    Splatter a video: Video gaussian representation for versatile processing, 2024 b

    Yang-Tian Sun, Yi-Hua Huang, Lin Ma, Xiaoyang Lyu, Yan-Pei Cao, and Xiaojuan Qi. Splatter a video: Video gaussian representation for versatile processing, 2024 b . URL https://arxiv.org/abs/2406.13870

  74. [74]

    Splatter a video: Video gaussian representation for versatile processing

    Yang-Tian Sun, Yihua Huang, Lin Ma, Xiaoyang Lyu, Yan-Pei Cao, and Xiaojuan Qi. Splatter a video: Video gaussian representation for versatile processing. In Advances in Neural Information Processing Systems (NeurIPS), 2024 c

  75. [75]

    Bolt3d: Generating 3d scenes in seconds

    Stanislaw Szymanowicz, Jason Y Zhang, Pratul Srinivasan, Ruiqi Gao, Arthur Brussee, Aleksander Holynski, Ricardo Martin-Brualla, Jonathan T Barron, and Philipp Henzler. Bolt3d: Generating 3d scenes in seconds. arXiv preprint arXiv:2503.14445, 2025

  76. [76]

    Towards Accurate Generative Models of Video: A New Metric & Challenges

    Thomas Unterthiner, Sjoerd Van Steenkiste, Karol Kurach, Raphael Marinier, Marcin Michalski, and Sylvain Gelly. Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:1812.01717, 2018

  77. [77]

    Fvd: A new metric for video generation

    Thomas Unterthiner, Sjoerd van Steenkiste, Karol Kurach, Rapha \"e l Marinier, Marcin Michalski, and Sylvain Gelly. Fvd: A new metric for video generation. International Conference on Learning Representations (ICLR), 2019

  78. [78]

    Cg3d: Compositional generation for text-to-3d via gaussian splatting,

    Alexander Vilesov, Pradyumna Chari, and Achuta Kadambi. Cg3d: Compositional generation for text-to-3d via gaussian splatting. arXiv preprint arXiv:2311.17907, 2023

  79. [79]

    4real-video: Learning generalizable photo-realistic 4d video diffusion

    Chaoyang Wang, Peiye Zhuang, Tuan Duc Ngo, Willi Menapace, Aliaksandr Siarohin, Michael Vasilkovsky, Ivan Skorokhodov, Sergey Tulyakov, Peter Wonka, and Hsin-Ying Lee. 4real-video: Learning generalizable photo-realistic 4d video diffusion. arXiv preprint arXiv:2412.04462, 2024 a

  80. [80]

    Vggt: Visual geometry grounded transformer, 2025 a

    Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Visual geometry grounded transformer, 2025 a . URL https://arxiv.org/abs/2503.11651

Showing first 80 references.