arxiv: 2511.00503 · v2 · submitted 2025-11-01 · 💻 cs.CV

Diff4Splat: Controllable 4D Scene Generation with Latent Dynamic Reconstruction Models

Panwang Pan , Chenguo Lin , Jingjing Zhao , Chenxin Li , Yuchen Lin , Haopeng Li , Honglei Yan , Kairun Wen

show 3 more authors

Yunlong Lin Yixuan Yuan Yadong Mu

This is my paper

Pith reviewed 2026-05-18 01:55 UTC · model grok-4.3

classification 💻 cs.CV

keywords 4D scene generation3D Gaussian splattingvideo diffusion modelslatent transformerdynamic reconstructionfeed-forward generationcontrollable synthesisnovel view synthesis

0 comments

The pith

A single forward pass from one image, camera path and optional text can output a full 4D scene as a deformable 3D Gaussian field.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a method that turns a single image, a camera trajectory, and an optional text prompt into an explicit 4D scene representation. It does so by directly producing a set of time-varying 3D Gaussian points that hold appearance, shape, and motion information together. This occurs in one quick step instead of the slow optimization loops common in prior work. If successful, the result would let users create and view dynamic scenes much more quickly for uses such as animation or virtual environments. The approach rests on training a video latent transformer to combine diffusion-based generation with geometric and motion constraints learned from large 4D datasets.

Core claim

Given a single input image, a camera trajectory, and an optional text prompt, Diff4Splat directly predicts a deformable 3D Gaussian field that encodes appearance, geometry, and motion, all in a single forward pass, without test-time optimization or post-hoc refinement. At the core of the framework lies a video latent transformer, which augments video diffusion models to jointly capture spatio-temporal dependencies and predict time-varying 3D Gaussian primitives. Training is guided by objectives on appearance fidelity, geometric accuracy, and motion consistency, enabling synthesis of high-quality 4D scenes in 30 seconds.

What carries the argument

Video latent transformer that augments video diffusion models to jointly capture spatio-temporal dependencies and predict time-varying 3D Gaussian primitives.

If this is right

High-quality 4D scenes become available for video generation, novel view synthesis, and geometry extraction tasks.
Performance matches or exceeds optimization-based dynamic scene methods while running in roughly 30 seconds.
Control is provided through an input camera trajectory and optional text prompt in a single pass.
Explicit 3D Gaussian output allows direct extraction of geometry and motion without post-processing.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the single-pass prediction holds for longer sequences, it could enable interactive 4D editing tools that current slow methods cannot support.
The same latent-space augmentation idea might transfer to other diffusion models for faster 4D extensions in robotics simulation.
Fast explicit 4D output could simplify downstream tasks such as physics-based editing or real-time rendering in AR applications.

Load-bearing premise

The video latent transformer produces stable and accurate time-varying 3D Gaussian primitives across diverse scenes without needing extra constraints or refinement steps.

What would settle it

Render the generated 4D Gaussian field from new viewpoints and times; visible flickering, drifting geometry, or motion artifacts that grow with sequence length would show the joint prediction is not reliable.

Figures

Figures reproduced from arXiv: 2511.00503 by Chenguo Lin, Chenxin Li, Haopeng Li, Honglei Yan, Jingjing Zhao, Kairun Wen, Panwang Pan, Yadong Mu, Yixuan Yuan, Yuchen Lin, Yunlong Lin.

**Figure 2.** Figure 2: Architecture of DIFF4SPLAT. We present a high-fidelity dynamic 3DGS generation method from a single image through four key innovations: (1) video diffusion latents processed by our novel Transformer (Sec. 3.2), (2) a dynamic 3DGS deformation mechanism (Sec. 3.3), (3) unified supervision with photometric, geometric, and motion losses (Sec. 3.4), and (4) a progressive training scheme for robust geometry and … view at source ↗

**Figure 3.** Figure 3: Qualitative comparison with state-of-the-art methods. DIFF4SPLAT (last column) generates more visually appealing and temporally consistent 4D scenes with superior geometric fidelity compared to baselines. Kindly zoom in for details. camera controllability, drastically reducing the Relative Pose Error (RPE) in both translation and rotation. This ensures that the generated video faithfully adheres to the spe… view at source ↗

**Figure 4.** Figure 4: Ablation of the Deformation Gaussian Field shows that removing this module (the red bounding boxes) results in ghosting artifacts, particularly in the large motion frames. training strategy yields significantly higher visual quality than direct dynamic training. This result underscores that progressive training not only enhances final performance and visual fidelity but also achieves superior results withi… view at source ↗

**Figure 5.** Figure 5: Ablation on the progressive training strategy. [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Failure Case. DIFF4SPLAT can produce artifacts when rendering novel timestamps, especially from disparate viewpoints. This issue, common to related methods, stems from ambiguity in estimating temporal deformations when propagating 3D Gaussians from multiple reference frames [PITH_FULL_IMAGE:figures/full_fig_p021_6.png] view at source ↗

**Figure 7.** Figure 7: More qualitative of DIFF4SPLAT for 4D Scene generation. 21 [PITH_FULL_IMAGE:figures/full_fig_p021_7.png] view at source ↗

**Figure 8.** Figure 8: More qualitative of DIFF4SPLAT for 4D Scene generation [PITH_FULL_IMAGE:figures/full_fig_p022_8.png] view at source ↗

**Figure 9.** Figure 9: More qualitative of DIFF4SPLAT for 4D Scene generation. 22 [PITH_FULL_IMAGE:figures/full_fig_p022_9.png] view at source ↗

read the original abstract

We introduce Diff4Splat, a feed-forward method that synthesizes controllable and explicit 4D scenes from a single image. Our approach unifies the generative priors of video diffusion models with geometry and motion constraints learned from large-scale 4D datasets. Given a single input image, a camera trajectory, and an optional text prompt, Diff4Splat directly predicts a deformable 3D Gaussian field that encodes appearance, geometry, and motion, all in a single forward pass, without test-time optimization or post-hoc refinement. At the core of our framework lies a video latent transformer, which augments video diffusion models to jointly capture spatio-temporal dependencies and predict time-varying 3D Gaussian primitives. Training is guided by objectives on appearance fidelity, geometric accuracy, and motion consistency, enabling Diff4Splat to synthesize high-quality 4D scenes in 30 seconds. We demonstrate the effectiveness of Diff4Splat across video generation, novel view synthesis, and geometry extraction, where it matches or surpasses optimization-based methods for dynamic scene synthesis while being significantly more efficient.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Diff4Splat offers a feed-forward route to controllable 4D Gaussian scenes but its performance claims rest on thin evidence so far.

read the letter

Diff4Splat presents a feed-forward method for creating controllable 4D scenes by predicting deformable 3D Gaussian fields directly from a single image and camera trajectory. The new element is the video latent transformer that augments standard video diffusion to handle both spatio-temporal patterns and the output of time-varying Gaussians. This lets the model encode appearance, geometry, and motion in one pass. It trains on large 4D datasets with losses for fidelity, accuracy, and consistency, and the authors say it produces results competitive with optimization-heavy methods but in far less time, around 30 seconds. This setup has clear practical appeal for anyone needing quick 4D content without heavy computation per scene. The explicit Gaussian output also makes it easier to extract geometry or render from new views compared to pure video generation approaches. The main concern is the lack of detailed results in the abstract. Without tables or specific metrics, it's difficult to confirm how well it really performs on novel view synthesis or motion accuracy. The stress-test point about potential lack of explicit multi-view constraints is relevant here. The geometric accuracy objective might not be enough to prevent drift in the 3D structure for trajectories outside the training distribution, even if 2D video quality is high. This kind of paper would fit well in a reading group focused on generative 3D and 4D models. It could be valuable for practitioners in graphics who want to move beyond slow optimization loops. I would send it out for peer review to get a closer look at the experiments and any ablations on the transformer design.

Referee Report

2 major / 1 minor

Summary. The paper introduces Diff4Splat, a feed-forward method that, given a single input image, camera trajectory, and optional text prompt, directly predicts a deformable 3D Gaussian field encoding appearance, geometry, and motion in a single forward pass. It augments video diffusion models with a video latent transformer to jointly capture spatio-temporal dependencies and predict time-varying 3D Gaussian primitives. Training uses objectives on appearance fidelity, geometric accuracy, and motion consistency on large-scale 4D datasets, enabling high-quality 4D scene synthesis in 30 seconds. The authors claim the method matches or surpasses optimization-based baselines on video generation, novel view synthesis, and geometry extraction while being significantly more efficient.

Significance. If the results hold, this would be a notable advance in efficient controllable 4D scene generation by removing test-time optimization and post-hoc refinement steps. The explicit deformable 3D Gaussian representation supports direct controllability via trajectories and prompts, and the approach benefits from training on external 4D datasets with standard diffusion objectives, keeping circularity low. This could accelerate applications in dynamic scene modeling for VR/AR and animation.

major comments (2)

[Abstract and core framework description] The training objectives are described only at a high level as covering 'appearance fidelity, geometric accuracy, and motion consistency' with no mention of an explicit 3D reprojection loss or multi-view consistency term. This is load-bearing for the central claim that the video latent transformer produces view-consistent time-varying 3D Gaussians for arbitrary trajectories, as the latent space may capture only 2D spatio-temporal correlations without such a penalty.
[Abstract] The abstract asserts that the method 'matches or surpasses optimization-based methods' for video generation, novel view synthesis, and geometry extraction, yet provides no quantitative tables, metrics, error bars, or dataset details to support the single-pass accuracy claim.

minor comments (1)

[Abstract] The claimed runtime of '30 seconds' would benefit from specification of hardware, input resolution, and output format to support reproducibility claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and will revise the paper to improve clarity where needed while preserving the core contributions.

read point-by-point responses

Referee: [Abstract and core framework description] The training objectives are described only at a high level as covering 'appearance fidelity, geometric accuracy, and motion consistency' with no mention of an explicit 3D reprojection loss or multi-view consistency term. This is load-bearing for the central claim that the video latent transformer produces view-consistent time-varying 3D Gaussians for arbitrary trajectories, as the latent space may capture only 2D spatio-temporal correlations without such a penalty.

Authors: We agree the abstract is high-level. The full methods section details that the geometric accuracy objective includes an explicit 3D reprojection loss (computed via differentiable rendering of Gaussians onto multiple views using the input camera trajectories) and a multi-view consistency term supervised on the large-scale 4D datasets. These terms directly penalize inconsistencies in 3D positions and appearances, ensuring the latent transformer learns view-consistent outputs rather than pure 2D correlations. We will revise the abstract to explicitly reference the 3D reprojection and multi-view terms. revision: yes
Referee: [Abstract] The abstract asserts that the method 'matches or surpasses optimization-based methods' for video generation, novel view synthesis, and geometry extraction, yet provides no quantitative tables, metrics, error bars, or dataset details to support the single-pass accuracy claim.

Authors: Abstracts conventionally summarize claims at a high level; the supporting quantitative evidence appears in the results section, including tables with metrics (PSNR/SSIM/LPIPS for video and NVS, Chamfer distance and normal consistency for geometry), error bars over multiple seeds, and dataset details (e.g., specific 4D training corpora and evaluation splits). We will add a brief sentence to the abstract pointing to these quantitative results and consider including one or two key metric values if space permits. revision: partial

Circularity Check

0 steps flagged

No significant circularity; derivation relies on external data and standard training

full rationale

The paper presents Diff4Splat as a trained feed-forward model that augments video diffusion backbones with a latent transformer to output time-varying 3D Gaussians. Training uses external 4D datasets and objectives for appearance, geometry, and motion consistency. No equations or claims in the abstract reduce the output to a self-defined quantity, a fitted parameter renamed as prediction, or a self-citation chain that bears the central load. The single-forward-pass claim follows from the learned mapping rather than tautological construction, making the derivation self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Review performed on abstract only; no explicit free parameters, axioms, or invented entities are stated. The central claim implicitly rests on the assumption that large-scale 4D datasets provide sufficient supervision for joint appearance-geometry-motion prediction without post-processing.

pith-pipeline@v0.9.0 · 5755 in / 1182 out tokens · 22482 ms · 2026-05-18T01:55:55.635413+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

video latent transformer... jointly capture spatio-temporal dependencies and predict time-varying 3D Gaussian primitives... Training is guided by objectives on appearance fidelity, geometric accuracy, and motion consistency
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

deformable 3D Gaussian field... 8-layer DPT head

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

116 extracted references · 116 canonical work pages · 13 internal anchors

[1]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

work page
[2]

GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints

Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebr \'o n, and Sumit Sanghai. Gqa: Training generalized multi-query transformer models from multi-head checkpoints. arXiv preprint arXiv:2305.13245, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

Ac3d: Analyzing and improving 3d camera control in video diffusion transformers

Sherwin Bahmani, Ivan Skorokhodov, Guocheng Qian, Aliaksandr Siarohin, Willi Menapace, Andrea Tagliasacchi, David B Lindell, and Sergey Tulyakov. Ac3d: Analyzing and improving 3d camera control in video diffusion transformers. arXiv preprint arXiv:2411.18673, 2024

work page arXiv 2024
[4]

Componerf: Text-guided multi-object compositional nerf with editable 3d scene layout

Haotian Bai, Yuanhuiyi Lyu, Lutao Jiang, Sijia Li, Haonan Lu, Xiaodong Lin, and Lin Wang. Componerf: Text-guided multi-object compositional nerf with editable 3d scene layout. arXiv preprint arXiv:2303.13843, 2023 a

work page arXiv 2023
[5]

Recammaster: Camera-controlled generative rendering from a single video.arXiv preprint arXiv:2503.11647,

Jianhong Bai, Menghan Xia, Xiao Fu, Xintao Wang, Lianrui Mu, Jinwen Cao, Zuozhu Liu, Haoji Hu, Xiang Bai, Pengfei Wan, and Di Zhang. Recammaster: Camera-controlled generative rendering from a single video. arXiv preprint arXiv:2503.11647, 2025

work page arXiv 2025
[6]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond. arXiv preprint arXiv:2308.12966, 2023 b

work page internal anchor Pith review Pith/arXiv arXiv 2023
[7]

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram Voleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127, 2023 a

work page internal anchor Pith review Pith/arXiv arXiv 2023
[8]

Align your latents: High-resolution video synthesis with latent diffusion models

Andreas Blattmann, Robin Rombach, Huan Ling, Tim Dockhorn, Seung Wook Kim, Sanja Fidler, and Karsten Kreis. Align your latents: High-resolution video synthesis with latent diffusion models. In Proc. CVPR, 2023 b

work page 2023
[9]

Virtual KITTI 2

Yohann Cabon, Naila Murray, and Martin Humenberger. Virtual kitti 2. arXiv preprint arXiv:2001.10773, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2001
[10]

Video depth anything: Consistent depth estimation for super-long videos

Sili Chen, Hengkai Guo, Shengnan Zhu, Feihu Zhang, Zilong Huang, Jiashi Feng, and Bingyi Kang. Video depth anything: Consistent depth estimation for super-long videos. In Conference on Computer Vision and Pattern Recognition (CVPR), 2025 a

work page 2025
[11]

4dnex: Feed-forward 4d generative modeling made easy

Zhaoxi Chen, Tianqi Liu, Long Zhuo, Jiawei Ren, Zeng Tao, He Zhu, Fangzhou Hong, Liang Pan, and Ziwei Liu. 4dnex: Feed-forward 4d generative modeling made easy. arXiv preprint arXiv:2508.13154, 2025 b

work page arXiv 2025
[12]

Dreamscene4d: Dynamic multi-object scene generation from monocular videos

Wen-Hsuan Chu, Lei Ke, and Katerina Fragkiadaki. Dreamscene4d: Dynamic multi-object scene generation from monocular videos. arXiv preprint arXiv:2405.02280, 2024

work page arXiv 2024
[13]

Dreamscene4d: Dynamic multi-object scene generation from monocular videos

Wen-Hsuan Chu, Lei Ke, and Katerina Fragkiadaki. Dreamscene4d: Dynamic multi-object scene generation from monocular videos. Advances in Neural Information Processing Systems, 37: 0 96181--96206, 2025

work page 2025
[14]

Luciddreamer: Domain-free generation of 3d gaussian splatting scenes,

Jaeyoung Chung, Suyoung Lee, Hyeongjin Nam, Jaerin Lee, and Kyoung Mu Lee. Luciddreamer: Domain-free generation of 3d gaussian splatting scenes. arXiv preprint arXiv:2311.13384, 2023

work page arXiv 2023
[15]

Scannet: Richly-annotated 3d reconstructions of indoor scenes

Angela Dai, Angel X Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nie ner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.\ 5828--5839, 2017

work page 2017
[16]

Structure and content-guided video synthesis with diffusion models

Patrick Esser, Johnathan Chiu, Parmida Atighehchian, Jonathan Granskog, and Anastasis Germanidis. Structure and content-guided video synthesis with diffusion models. In ICCV, 2023

work page 2023
[17]

GraphDreamer : Compositional 3D scene synthesis from scene graphs

Gege Gao, Weiyang Liu, Anpei Chen, Andreas Geiger, and Bernhard Sch \"o lkopf. GraphDreamer : Compositional 3D scene synthesis from scene graphs. Proc. CVPR, 2024

work page 2024
[18]

AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning

Yuwei Guo, Ceyuan Yang, Anyi Rao, Yaohui Wang, Yu Qiao, Dahua Lin, and Bo Dai. Animatediff: Animate your personalized text-to-image diffusion models without specific tuning. arXiv preprint arXiv:2307.04725, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[19]

CameraCtrl: Enabling Camera Control for Text-to-Video Generation

Hao He, Yinghao Xu, Yuwei Guo, Gordon Wetzstein, Bo Dai, Hongsheng Li, and Ceyuan Yang. Cameractrl: Enabling camera control for text-to-video generation. arXiv preprint arXiv:2404.02101, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[20]

Cameractrl ii: Dynamic scene exploration via camera-controlled video diffusion models, 2025

Hao He, Ceyuan Yang, Shanchuan Lin, Yinghao Xu, Meng Wei, Liangke Gui, Qi Zhao, Gordon Wetzstein, Lu Jiang, and Hongsheng Li. Cameractrl ii: Dynamic scene exploration via camera-controlled video diffusion models, 2025. URL https://arxiv.org/abs/2503.10592

work page arXiv 2025
[21]

Denoising diffusion probabilistic models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in neural information processing systems, 2020

work page 2020
[22]

Video diffusion models

Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J Fleet. Video diffusion models. Advances in Neural Information Processing Systems, 35: 0 8633--8646, 2022

work page 2022
[23]

Lora: Low-rank adaptation of large language models

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. In ICLR, 2022

work page 2022
[24]

Pl \"u cker coordinates for lines in the space

Yan-Bin Jia. Pl \"u cker coordinates for lines in the space. Problem Solver Techniques for Applied Computer Science, Com-S-477/577 Course Handout, 2020

work page 2020
[25]

Stereo4d: Learning how things move in 3d from internet stereo videos

Linyi Jin, Richard Tucker, Zhengqi Li, David Fouhey, Noah Snavely, and Aleksander Holynski. Stereo4d: Learning how things move in 3d from internet stereo videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

work page 2025
[26]

Dynamicstereo: Consistent dynamic depth from stereo videos

Nikita Karaev, Ignacio Rocco, Benjamin Graham, Natalia Neverova, Andrea Vedaldi, and Christian Rupprecht. Dynamicstereo: Consistent dynamic depth from stereo videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023

work page 2023
[27]

3d gaussian splatting for real-time radiance field rendering

Bernhard Kerbl, Georgios Kopanas, Thomas Leimkuehler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering. ACM Transactions on Graphics (TOG), 2023 a

work page 2023
[28]

3d gaussian splatting for real-time radiance field rendering

Bernhard Kerbl, Georgios Kopanas, Thomas Leimk \"u hler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering. In ACM TOG, 2023 b

work page 2023
[29]

Mosca: Dynamic gaussian fusion from casual videos via 4d motion scaffolds

Jiahui Lei, Yijia Weng, Adam Harley, Leonidas Guibas, and Kostas Daniilidis. Mosca: Dynamic gaussian fusion from casual videos via 4d motion scaffolds. arXiv preprint arXiv:2405.17421, 2024

work page arXiv 2024
[30]

Grounding Image Matching in 3D with MASt3R.arXiv preprint arXiv:2406.09756,

Vincent Leroy, Yohann Cabon, and Jerome Revaud. Grounding image matching in 3d with mast3r. arXiv:2406.09756, 2024

work page arXiv 2024
[31]

4k4dgen: Panoramic 4d generation at 4k resolution

Renjie Li, Panwang Pan, Bangbang Yang, Dejia Xu, Shijie Zhou, Xuanyang Zhang, Zeming Li, Achuta Kadambi, Zhangyang Wang, Zhengzhong Tu, et al. 4k4dgen: Panoramic 4d generation at 4k resolution. arXiv preprint arXiv:2406.13527, 2024

work page arXiv 2024
[32]

4k4dgen: Panoramic 4d generation at 4k resolution

Renjie Li, Panwang Pan, Bangbang Yang, Dejia Xu, Shijie Zhou, Xuanyang Zhang, Zeming Li, Achuta Kadambi, Zhangyang Wang, Zhengzhong Tu, et al. 4k4dgen: Panoramic 4d generation at 4k resolution. Proc. ICLR, 2025 a

work page 2025
[33]

Matrixcity: A large-scale city dataset for city-scale neural rendering and beyond

Yixuan Li, Lihan Jiang, Linning Xu, Yuanbo Xiangli, Zhenzhi Wang, Dahua Lin, and Bo Dai. Matrixcity: A large-scale city dataset for city-scale neural rendering and beyond. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023 a

work page 2023
[34]

Gligen: Open-set grounded text-to-image generation

Yuheng Li, Haotian Liu, Qingyang Wu, Fangzhou Mu, Jianwei Yang, Jianfeng Gao, Chunyuan Li, and Yong Jae Lee. Gligen: Open-set grounded text-to-image generation. In CVPR, 2023 b

work page 2023
[35]

Dynibar: Neural dynamic image-based rendering

Zhengqi Li, Qianqian Wang, Forrester Cole, Richard Tucker, and Noah Snavely. Dynibar: Neural dynamic image-based rendering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023 c

work page 2023
[36]

Megasam: Accurate, fast, and robust structure and motion from casual dynamic videos

Zhengqi Li, Richard Tucker, Forrester Cole, Qianqian Wang, Linyi Jin, Vickie Ye, Angjoo Kanazawa, Aleksander Holynski, and Noah Snavely. Megasam: Accurate, fast, and robust structure and motion from casual dynamic videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025 b

work page 2025
[37]

Wonderland: Navigating 3d scenes from a single image

Hanwen Liang, Junli Cao, Vidit Goel, Guocheng Qian, Sergei Korolev, Demetri Terzopoulos, Konstantinos N Plataniotis, Sergey Tulyakov, and Jian Ren. Wonderland: Navigating 3d scenes from a single image. arXiv preprint arXiv:2412.12091, 2024 a

work page arXiv 2024
[38]

Plataniotis, Sergey Tulyakov, and Jian Ren

Hanwen Liang, Junli Cao, Vidit Goel, Guocheng Qian, Sergei Korolev, Demetri Terzopoulos, Konstantinos N. Plataniotis, Sergey Tulyakov, and Jian Ren. Wonderland: Navigating 3D Scenes from a Single Image , December 2024 b

work page 2024
[39]

Feed- Forward Bullet-Time Reconstruction of Dynamic Scenes from Monocular Videos , December 2024 c

Hanxue Liang, Jiawei Ren, Ashkan Mirzaei, Antonio Torralba, Ziwei Liu, Igor Gilitschenski, Sanja Fidler, Cengiz Oztireli, Huan Ling, Zan Gojcic, and Jiahui Huang. Feed- Forward Bullet-Time Reconstruction of Dynamic Scenes from Monocular Videos , December 2024 c

work page 2024
[40]

Feed-forward bullet-time reconstruction of dynamic scenes from monocular videos.arXiv preprint arXiv:2412.03526,

Hanxue Liang, Jiawei Ren, Ashkan Mirzaei, Antonio Torralba, Ziwei Liu, Igor Gilitschenski, Sanja Fidler, Cengiz Oztireli, Huan Ling, Zan Gojcic, et al. Feed-forward bullet-time reconstruction of dynamic scenes from monocular videos. arXiv preprint arXiv:2412.03526, 2024 d

work page arXiv 2024
[41]

Gaufre: Gaussian deformation fields for real-time dynamic novel view synthesis

Yiqing Liang, Numair Khan, Zhengqin Li, Thu Nguyen-Phuoc, Douglas Lanman, James Tompkin, and Lei Xiao. Gaufre: Gaussian deformation fields for real-time dynamic novel view synthesis. In 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pp.\ 2642--2652. IEEE, 2025

work page 2025
[42]

Instructscene: Instruction-driven 3d indoor scene synthesis with semantic graph prior

Chenguo Lin and Yadong Mu. Instructscene: Instruction-driven 3d indoor scene synthesis with semantic graph prior. arXiv preprint arXiv:2402.04717, 2024

work page arXiv 2024
[43]

Instructlayout: Instruction-driven 2d and 3d layout synthesis with semantic graph prior

Chenguo Lin, Yuchen Lin, Panwang Pan, Xuanyang Zhang, and Yadong Mu. Instructlayout: Instruction-driven 2d and 3d layout synthesis with semantic graph prior. arXiv preprint arXiv:2407.07580, 2024 a

work page arXiv 2024
[44]

Diffsplat: Repurposing image diffusion models for scalable gaussian splat generation,

Chenguo Lin, Panwang Pan, Bangbang Yang, Zeming Li, and Yadong Mu. Diffsplat: Repurposing image diffusion models for scalable gaussian splat generation. arXiv preprint arXiv:2501.16764, 2025 a

work page arXiv 2025
[45]

Gaussian-flow: 4d reconstruction with dynamic 3d gaussian particle

Youtian Lin, Zuozhuo Dai, Siyu Zhu, and Yao Yao. Gaussian-flow: 4d reconstruction with dynamic 3d gaussian particle. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 21136--21145, 2024 b

work page 2024
[46]

Partcrafter: Structured 3d mesh generation via compositional latent diffusion transformers, 2025 b

Yuchen Lin, Chenguo Lin, Panwang Pan, Honglei Yan, Yiqiang Feng, Yadong Mu, and Katerina Fragkiadaki. Partcrafter: Structured 3d mesh generation via compositional latent diffusion transformers, 2025 b . URL https://arxiv.org/abs/2506.05573

work page arXiv 2025
[47]

Omniphysgs: 3d constitutive gaussians for general physics-based dynamics generation

Yuchen Lin, Chenguo Lin, Jianjin Xu, and Yadong Mu. Omniphysgs: 3d constitutive gaussians for general physics-based dynamics generation. arXiv preprint arXiv:2501.18982, 2025 c

work page arXiv 2025
[48]

Dl3dv-10k: A large-scale scene dataset for deep learning-based 3d vision

Lu Ling, Yichen Sheng, Zhi Tu, Wentian Zhao, Cheng Xin, Kun Wan, Lantao Yu, Qianyu Guo, Zixun Yu, Yawen Lu, et al. Dl3dv-10k: A large-scale scene dataset for deep learning-based 3d vision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 22160--22169, 2024

work page 2024
[49]

Flow matching for generative modeling

Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling. In International Conference on Learning Representations (ICLR), 2023

work page 2023
[50]

Reconx: Reconstruct any scene from sparse views with video diffusion model

Fangfu Liu, Wenqiang Sun, Hanyang Wang, Yikai Wang, Haowen Sun, Junliang Ye, Jun Zhang, and Yueqi Duan. Reconx: Reconstruct any scene from sparse views with video diffusion model. arXiv preprint arXiv:2408.16767, 2024

work page arXiv 2024
[51]

Decoupled weight decay regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In International Conference on Learning Representations, 2019

work page 2019
[52]

Spring: A high-resolution high-detail dataset and benchmark for scene flow, optical flow and stereo

Lukas Mehl, Jenny Schmalfuss, Azin Jahedi, Yaroslava Nalivayko, and Andr \'e s Bruhn. Spring: A high-resolution high-detail dataset and benchmark for scene flow, optical flow and stereo. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.\ 4981--4991, 2023

work page 2023
[53]

Nerf: Representing scenes as neural radiance fields for view synthesis

B Mildenhall, PP Srinivasan, M Tancik, JT Barron, R Ramamoorthi, and R Ng. Nerf: Representing scenes as neural radiance fields for view synthesis. In European Conference on Computer Vision (ECCV), 2020

work page 2020
[54]

T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models

Chong Mou, Xintao Wang, Liangbin Xie, Yanze Wu, Jian Zhang, Zhongang Qi, and Ying Shan. T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. In AAAI, 2024

work page 2024
[55]

u ller, Katja Schwarz, Barbara R \

Norman M \"u ller, Katja Schwarz, Barbara R \"o ssle, Lorenzo Porzi, Samuel Rota Bul \`o , Matthias Nie ner, and Peter Kontschieder. Multidiff: Consistent novel view synthesis from a single image. In Proc. CVPR, 2024

work page 2024
[56]

Video-bench: A comprehensive benchmark and toolkit for evaluating video-based large language models

Munan Ning, Bin Zhu, Yujia Xie, Bin Lin, Jiaxi Cui, Lu Yuan, Dongdong Chen, and Li Yuan. Video-bench: A comprehensive benchmark and toolkit for evaluating video-based large language models. arXiv preprint arXiv:2311.16103, 2023

work page arXiv 2023
[57]

Mofa-video: Controllable image animation via generative motion field adaptions in frozen image-to-video diffusion model

Muyao Niu, Xiaodong Cun, Xintao Wang, Yong Zhang, Ying Shan, and Yinqiang Zheng. Mofa-video: Controllable image animation via generative motion field adaptions in frozen image-to-video diffusion model. arXiv preprint arXiv:2405.20222, 2024

work page arXiv 2024
[58]

Humansplat: Generalizable single-image human gaussian splatting with structure priors, 2024

Panwang Pan, Zhuo Su, Chenguo Lin, Zhen Fan, Yongjie Zhang, Zeming Li, Tingting Shen, Yadong Mu, and Yebin Liu. Humansplat: Generalizable single-image human gaussian splatting with structure priors, 2024. URL https://arxiv.org/abs/2406.12459

work page arXiv 2024
[59]

Vase: Object-centric appearance and shape manipulation of real videos

Elia Peruzzo, Vidit Goel, Dejia Xu, Xingqian Xu, Yifan Jiang, Zhangyang Wang, Humphrey Shi, and Nicu Sebe. Vase: Object-centric appearance and shape manipulation of real videos. arXiv preprint arXiv:2401.02473, 2024

work page arXiv 2024
[60]

Compositional 3D scene generation using locally conditioned diffusion

Ryan Po and Gordon Wetzstein. Compositional 3D scene generation using locally conditioned diffusion. Proc. 3DV, 2024

work page 2024
[61]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pp.\ 8748--8763. PMLR, 2021

work page 2021
[62]

Exploring the limits of transfer learning with a unified text-to-text transformer

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. In Proc. JMLR, 2020

work page 2020
[63]

Gen3c: 3d-informed world-consistent video generation with precise camera control

Xuanchi Ren, Tianchang Shen, Jiahui Huang, Huan Ling, Yifan Lu, Merlin Nimier-David, Thomas Müller, Alexander Keller, Sanja Fidler, and Jun Gao. Gen3c: 3d-informed world-consistent video generation with precise camera control. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

work page 2025
[64]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj \"o rn Ommer. High-resolution image synthesis with latent diffusion models. In Proc. CVPR, 2022 a

work page 2022
[65]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj \"o rn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022 b

work page 2022
[66]

Structure-from-motion revisited

Johannes L Schonberger and Jan-Michael Frahm. Structure-from-motion revisited. In Computer Vision and Pattern Recognition (CVPR), 2016

work page 2016
[67]

CLIP+MLP Aesthetic Score Predictor

Christoph Schuhmann. CLIP+MLP Aesthetic Score Predictor . https://github.com/christophschuhmann/improved-aesthetic-predictor, 2023

work page 2023
[68]

Laion-5b: An open large-scale dataset for training next generation image-text models

Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, et al. Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in neural information processing systems, 2022

work page 2022
[69]

Seeing world dynamics in a nutshell, 2025

Qiuhong Shen, Xuanyu Yi, Mingbao Lin, Hanwang Zhang, Shuicheng Yan, and Xinchao Wang. Seeing world dynamics in a nutshell, 2025. URL https://arxiv.org/abs/2502.03465

work page arXiv 2025
[70]

Light field networks: Neural scene representations with single-evaluation rendering

Vincent Sitzmann, Semon Rezchikov, Bill Freeman, Josh Tenenbaum, and Fredo Durand. Light field networks: Neural scene representations with single-evaluation rendering. In Proc. NeurIPS, 2021

work page 2021
[71]

A benchmark for the evaluation of rgb-d slam systems

J \"u rgen Sturm, Nikolas Engelhard, Felix Endres, Wolfram Burgard, and Daniel Cremers. A benchmark for the evaluation of rgb-d slam systems. In IEEE/RSJ international conference on intelligent robots and systems, 2012

work page 2012
[72]

Dimensionx: Create any 3d and 4d scenes from a single image with controllable video diffusion

Wenqiang Sun, Shuo Chen, Fangfu Liu, Zilong Chen, Yueqi Duan, Jun Zhang, and Yikai Wang. Dimensionx: Create any 3d and 4d scenes from a single image with controllable video diffusion. arXiv preprint arXiv:2411.04928, 2024 a

work page arXiv 2024
[73]

Splatter a video: Video gaussian representation for versatile processing, 2024 b

Yang-Tian Sun, Yi-Hua Huang, Lin Ma, Xiaoyang Lyu, Yan-Pei Cao, and Xiaojuan Qi. Splatter a video: Video gaussian representation for versatile processing, 2024 b . URL https://arxiv.org/abs/2406.13870

work page arXiv 2024
[74]

Splatter a video: Video gaussian representation for versatile processing

Yang-Tian Sun, Yihua Huang, Lin Ma, Xiaoyang Lyu, Yan-Pei Cao, and Xiaojuan Qi. Splatter a video: Video gaussian representation for versatile processing. In Advances in Neural Information Processing Systems (NeurIPS), 2024 c

work page 2024
[75]

Bolt3d: Generating 3d scenes in seconds

Stanislaw Szymanowicz, Jason Y Zhang, Pratul Srinivasan, Ruiqi Gao, Arthur Brussee, Aleksander Holynski, Ricardo Martin-Brualla, Jonathan T Barron, and Philipp Henzler. Bolt3d: Generating 3d scenes in seconds. arXiv preprint arXiv:2503.14445, 2025

work page arXiv 2025
[76]

Towards Accurate Generative Models of Video: A New Metric & Challenges

Thomas Unterthiner, Sjoerd Van Steenkiste, Karol Kurach, Raphael Marinier, Marcin Michalski, and Sylvain Gelly. Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:1812.01717, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[77]

Fvd: A new metric for video generation

Thomas Unterthiner, Sjoerd van Steenkiste, Karol Kurach, Rapha \"e l Marinier, Marcin Michalski, and Sylvain Gelly. Fvd: A new metric for video generation. International Conference on Learning Representations (ICLR), 2019

work page 2019
[78]

Cg3d: Compositional generation for text-to-3d via gaussian splatting,

Alexander Vilesov, Pradyumna Chari, and Achuta Kadambi. Cg3d: Compositional generation for text-to-3d via gaussian splatting. arXiv preprint arXiv:2311.17907, 2023

work page arXiv 2023
[79]

4real-video: Learning generalizable photo-realistic 4d video diffusion

Chaoyang Wang, Peiye Zhuang, Tuan Duc Ngo, Willi Menapace, Aliaksandr Siarohin, Michael Vasilkovsky, Ivan Skorokhodov, Sergey Tulyakov, Peter Wonka, and Hsin-Ying Lee. 4real-video: Learning generalizable photo-realistic 4d video diffusion. arXiv preprint arXiv:2412.04462, 2024 a

work page arXiv 2024
[80]

Vggt: Visual geometry grounded transformer, 2025 a

Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Visual geometry grounded transformer, 2025 a . URL https://arxiv.org/abs/2503.11651

work page arXiv 2025

Showing first 80 references.