Diff4Splat: Controllable 4D Scene Generation with Latent Dynamic Reconstruction Models
Pith reviewed 2026-05-18 01:55 UTC · model grok-4.3
The pith
A single forward pass from one image, camera path and optional text can output a full 4D scene as a deformable 3D Gaussian field.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Given a single input image, a camera trajectory, and an optional text prompt, Diff4Splat directly predicts a deformable 3D Gaussian field that encodes appearance, geometry, and motion, all in a single forward pass, without test-time optimization or post-hoc refinement. At the core of the framework lies a video latent transformer, which augments video diffusion models to jointly capture spatio-temporal dependencies and predict time-varying 3D Gaussian primitives. Training is guided by objectives on appearance fidelity, geometric accuracy, and motion consistency, enabling synthesis of high-quality 4D scenes in 30 seconds.
What carries the argument
Video latent transformer that augments video diffusion models to jointly capture spatio-temporal dependencies and predict time-varying 3D Gaussian primitives.
If this is right
- High-quality 4D scenes become available for video generation, novel view synthesis, and geometry extraction tasks.
- Performance matches or exceeds optimization-based dynamic scene methods while running in roughly 30 seconds.
- Control is provided through an input camera trajectory and optional text prompt in a single pass.
- Explicit 3D Gaussian output allows direct extraction of geometry and motion without post-processing.
Where Pith is reading between the lines
- If the single-pass prediction holds for longer sequences, it could enable interactive 4D editing tools that current slow methods cannot support.
- The same latent-space augmentation idea might transfer to other diffusion models for faster 4D extensions in robotics simulation.
- Fast explicit 4D output could simplify downstream tasks such as physics-based editing or real-time rendering in AR applications.
Load-bearing premise
The video latent transformer produces stable and accurate time-varying 3D Gaussian primitives across diverse scenes without needing extra constraints or refinement steps.
What would settle it
Render the generated 4D Gaussian field from new viewpoints and times; visible flickering, drifting geometry, or motion artifacts that grow with sequence length would show the joint prediction is not reliable.
Figures
read the original abstract
We introduce Diff4Splat, a feed-forward method that synthesizes controllable and explicit 4D scenes from a single image. Our approach unifies the generative priors of video diffusion models with geometry and motion constraints learned from large-scale 4D datasets. Given a single input image, a camera trajectory, and an optional text prompt, Diff4Splat directly predicts a deformable 3D Gaussian field that encodes appearance, geometry, and motion, all in a single forward pass, without test-time optimization or post-hoc refinement. At the core of our framework lies a video latent transformer, which augments video diffusion models to jointly capture spatio-temporal dependencies and predict time-varying 3D Gaussian primitives. Training is guided by objectives on appearance fidelity, geometric accuracy, and motion consistency, enabling Diff4Splat to synthesize high-quality 4D scenes in 30 seconds. We demonstrate the effectiveness of Diff4Splat across video generation, novel view synthesis, and geometry extraction, where it matches or surpasses optimization-based methods for dynamic scene synthesis while being significantly more efficient.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Diff4Splat, a feed-forward method that, given a single input image, camera trajectory, and optional text prompt, directly predicts a deformable 3D Gaussian field encoding appearance, geometry, and motion in a single forward pass. It augments video diffusion models with a video latent transformer to jointly capture spatio-temporal dependencies and predict time-varying 3D Gaussian primitives. Training uses objectives on appearance fidelity, geometric accuracy, and motion consistency on large-scale 4D datasets, enabling high-quality 4D scene synthesis in 30 seconds. The authors claim the method matches or surpasses optimization-based baselines on video generation, novel view synthesis, and geometry extraction while being significantly more efficient.
Significance. If the results hold, this would be a notable advance in efficient controllable 4D scene generation by removing test-time optimization and post-hoc refinement steps. The explicit deformable 3D Gaussian representation supports direct controllability via trajectories and prompts, and the approach benefits from training on external 4D datasets with standard diffusion objectives, keeping circularity low. This could accelerate applications in dynamic scene modeling for VR/AR and animation.
major comments (2)
- [Abstract and core framework description] The training objectives are described only at a high level as covering 'appearance fidelity, geometric accuracy, and motion consistency' with no mention of an explicit 3D reprojection loss or multi-view consistency term. This is load-bearing for the central claim that the video latent transformer produces view-consistent time-varying 3D Gaussians for arbitrary trajectories, as the latent space may capture only 2D spatio-temporal correlations without such a penalty.
- [Abstract] The abstract asserts that the method 'matches or surpasses optimization-based methods' for video generation, novel view synthesis, and geometry extraction, yet provides no quantitative tables, metrics, error bars, or dataset details to support the single-pass accuracy claim.
minor comments (1)
- [Abstract] The claimed runtime of '30 seconds' would benefit from specification of hardware, input resolution, and output format to support reproducibility claims.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each major comment below and will revise the paper to improve clarity where needed while preserving the core contributions.
read point-by-point responses
-
Referee: [Abstract and core framework description] The training objectives are described only at a high level as covering 'appearance fidelity, geometric accuracy, and motion consistency' with no mention of an explicit 3D reprojection loss or multi-view consistency term. This is load-bearing for the central claim that the video latent transformer produces view-consistent time-varying 3D Gaussians for arbitrary trajectories, as the latent space may capture only 2D spatio-temporal correlations without such a penalty.
Authors: We agree the abstract is high-level. The full methods section details that the geometric accuracy objective includes an explicit 3D reprojection loss (computed via differentiable rendering of Gaussians onto multiple views using the input camera trajectories) and a multi-view consistency term supervised on the large-scale 4D datasets. These terms directly penalize inconsistencies in 3D positions and appearances, ensuring the latent transformer learns view-consistent outputs rather than pure 2D correlations. We will revise the abstract to explicitly reference the 3D reprojection and multi-view terms. revision: yes
-
Referee: [Abstract] The abstract asserts that the method 'matches or surpasses optimization-based methods' for video generation, novel view synthesis, and geometry extraction, yet provides no quantitative tables, metrics, error bars, or dataset details to support the single-pass accuracy claim.
Authors: Abstracts conventionally summarize claims at a high level; the supporting quantitative evidence appears in the results section, including tables with metrics (PSNR/SSIM/LPIPS for video and NVS, Chamfer distance and normal consistency for geometry), error bars over multiple seeds, and dataset details (e.g., specific 4D training corpora and evaluation splits). We will add a brief sentence to the abstract pointing to these quantitative results and consider including one or two key metric values if space permits. revision: partial
Circularity Check
No significant circularity; derivation relies on external data and standard training
full rationale
The paper presents Diff4Splat as a trained feed-forward model that augments video diffusion backbones with a latent transformer to output time-varying 3D Gaussians. Training uses external 4D datasets and objectives for appearance, geometry, and motion consistency. No equations or claims in the abstract reduce the output to a self-defined quantity, a fitted parameter renamed as prediction, or a self-citation chain that bears the central load. The single-forward-pass claim follows from the learned mapping rather than tautological construction, making the derivation self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
video latent transformer... jointly capture spatio-temporal dependencies and predict time-varying 3D Gaussian primitives... Training is guided by objectives on appearance fidelity, geometric accuracy, and motion consistency
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
deformable 3D Gaussian field... 8-layer DPT head
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...
-
[2]
GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints
Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebr \'o n, and Sumit Sanghai. Gqa: Training generalized multi-query transformer models from multi-head checkpoints. arXiv preprint arXiv:2305.13245, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[3]
Ac3d: Analyzing and improving 3d camera control in video diffusion transformers
Sherwin Bahmani, Ivan Skorokhodov, Guocheng Qian, Aliaksandr Siarohin, Willi Menapace, Andrea Tagliasacchi, David B Lindell, and Sergey Tulyakov. Ac3d: Analyzing and improving 3d camera control in video diffusion transformers. arXiv preprint arXiv:2411.18673, 2024
-
[4]
Componerf: Text-guided multi-object compositional nerf with editable 3d scene layout
Haotian Bai, Yuanhuiyi Lyu, Lutao Jiang, Sijia Li, Haonan Lu, Xiaodong Lin, and Lin Wang. Componerf: Text-guided multi-object compositional nerf with editable 3d scene layout. arXiv preprint arXiv:2303.13843, 2023 a
-
[5]
Jianhong Bai, Menghan Xia, Xiao Fu, Xintao Wang, Lianrui Mu, Jinwen Cao, Zuozhu Liu, Haoji Hu, Xiang Bai, Pengfei Wan, and Di Zhang. Recammaster: Camera-controlled generative rendering from a single video. arXiv preprint arXiv:2503.11647, 2025
-
[6]
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond. arXiv preprint arXiv:2308.12966, 2023 b
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[7]
Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets
Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram Voleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127, 2023 a
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[8]
Align your latents: High-resolution video synthesis with latent diffusion models
Andreas Blattmann, Robin Rombach, Huan Ling, Tim Dockhorn, Seung Wook Kim, Sanja Fidler, and Karsten Kreis. Align your latents: High-resolution video synthesis with latent diffusion models. In Proc. CVPR, 2023 b
work page 2023
-
[9]
Yohann Cabon, Naila Murray, and Martin Humenberger. Virtual kitti 2. arXiv preprint arXiv:2001.10773, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2001
-
[10]
Video depth anything: Consistent depth estimation for super-long videos
Sili Chen, Hengkai Guo, Shengnan Zhu, Feihu Zhang, Zilong Huang, Jiashi Feng, and Bingyi Kang. Video depth anything: Consistent depth estimation for super-long videos. In Conference on Computer Vision and Pattern Recognition (CVPR), 2025 a
work page 2025
-
[11]
4dnex: Feed-forward 4d generative modeling made easy
Zhaoxi Chen, Tianqi Liu, Long Zhuo, Jiawei Ren, Zeng Tao, He Zhu, Fangzhou Hong, Liang Pan, and Ziwei Liu. 4dnex: Feed-forward 4d generative modeling made easy. arXiv preprint arXiv:2508.13154, 2025 b
-
[12]
Dreamscene4d: Dynamic multi-object scene generation from monocular videos
Wen-Hsuan Chu, Lei Ke, and Katerina Fragkiadaki. Dreamscene4d: Dynamic multi-object scene generation from monocular videos. arXiv preprint arXiv:2405.02280, 2024
-
[13]
Dreamscene4d: Dynamic multi-object scene generation from monocular videos
Wen-Hsuan Chu, Lei Ke, and Katerina Fragkiadaki. Dreamscene4d: Dynamic multi-object scene generation from monocular videos. Advances in Neural Information Processing Systems, 37: 0 96181--96206, 2025
work page 2025
-
[14]
Luciddreamer: Domain-free generation of 3d gaussian splatting scenes,
Jaeyoung Chung, Suyoung Lee, Hyeongjin Nam, Jaerin Lee, and Kyoung Mu Lee. Luciddreamer: Domain-free generation of 3d gaussian splatting scenes. arXiv preprint arXiv:2311.13384, 2023
-
[15]
Scannet: Richly-annotated 3d reconstructions of indoor scenes
Angela Dai, Angel X Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nie ner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.\ 5828--5839, 2017
work page 2017
-
[16]
Structure and content-guided video synthesis with diffusion models
Patrick Esser, Johnathan Chiu, Parmida Atighehchian, Jonathan Granskog, and Anastasis Germanidis. Structure and content-guided video synthesis with diffusion models. In ICCV, 2023
work page 2023
-
[17]
GraphDreamer : Compositional 3D scene synthesis from scene graphs
Gege Gao, Weiyang Liu, Anpei Chen, Andreas Geiger, and Bernhard Sch \"o lkopf. GraphDreamer : Compositional 3D scene synthesis from scene graphs. Proc. CVPR, 2024
work page 2024
-
[18]
AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning
Yuwei Guo, Ceyuan Yang, Anyi Rao, Yaohui Wang, Yu Qiao, Dahua Lin, and Bo Dai. Animatediff: Animate your personalized text-to-image diffusion models without specific tuning. arXiv preprint arXiv:2307.04725, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[19]
CameraCtrl: Enabling Camera Control for Text-to-Video Generation
Hao He, Yinghao Xu, Yuwei Guo, Gordon Wetzstein, Bo Dai, Hongsheng Li, and Ceyuan Yang. Cameractrl: Enabling camera control for text-to-video generation. arXiv preprint arXiv:2404.02101, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[20]
Cameractrl ii: Dynamic scene exploration via camera-controlled video diffusion models, 2025
Hao He, Ceyuan Yang, Shanchuan Lin, Yinghao Xu, Meng Wei, Liangke Gui, Qi Zhao, Gordon Wetzstein, Lu Jiang, and Hongsheng Li. Cameractrl ii: Dynamic scene exploration via camera-controlled video diffusion models, 2025. URL https://arxiv.org/abs/2503.10592
-
[21]
Denoising diffusion probabilistic models
Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in neural information processing systems, 2020
work page 2020
-
[22]
Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J Fleet. Video diffusion models. Advances in Neural Information Processing Systems, 35: 0 8633--8646, 2022
work page 2022
-
[23]
Lora: Low-rank adaptation of large language models
Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. In ICLR, 2022
work page 2022
-
[24]
Pl \"u cker coordinates for lines in the space
Yan-Bin Jia. Pl \"u cker coordinates for lines in the space. Problem Solver Techniques for Applied Computer Science, Com-S-477/577 Course Handout, 2020
work page 2020
-
[25]
Stereo4d: Learning how things move in 3d from internet stereo videos
Linyi Jin, Richard Tucker, Zhengqi Li, David Fouhey, Noah Snavely, and Aleksander Holynski. Stereo4d: Learning how things move in 3d from internet stereo videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025
work page 2025
-
[26]
Dynamicstereo: Consistent dynamic depth from stereo videos
Nikita Karaev, Ignacio Rocco, Benjamin Graham, Natalia Neverova, Andrea Vedaldi, and Christian Rupprecht. Dynamicstereo: Consistent dynamic depth from stereo videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023
work page 2023
-
[27]
3d gaussian splatting for real-time radiance field rendering
Bernhard Kerbl, Georgios Kopanas, Thomas Leimkuehler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering. ACM Transactions on Graphics (TOG), 2023 a
work page 2023
-
[28]
3d gaussian splatting for real-time radiance field rendering
Bernhard Kerbl, Georgios Kopanas, Thomas Leimk \"u hler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering. In ACM TOG, 2023 b
work page 2023
-
[29]
Mosca: Dynamic gaussian fusion from casual videos via 4d motion scaffolds
Jiahui Lei, Yijia Weng, Adam Harley, Leonidas Guibas, and Kostas Daniilidis. Mosca: Dynamic gaussian fusion from casual videos via 4d motion scaffolds. arXiv preprint arXiv:2405.17421, 2024
-
[30]
Grounding Image Matching in 3D with MASt3R.arXiv preprint arXiv:2406.09756,
Vincent Leroy, Yohann Cabon, and Jerome Revaud. Grounding image matching in 3d with mast3r. arXiv:2406.09756, 2024
-
[31]
4k4dgen: Panoramic 4d generation at 4k resolution
Renjie Li, Panwang Pan, Bangbang Yang, Dejia Xu, Shijie Zhou, Xuanyang Zhang, Zeming Li, Achuta Kadambi, Zhangyang Wang, Zhengzhong Tu, et al. 4k4dgen: Panoramic 4d generation at 4k resolution. arXiv preprint arXiv:2406.13527, 2024
-
[32]
4k4dgen: Panoramic 4d generation at 4k resolution
Renjie Li, Panwang Pan, Bangbang Yang, Dejia Xu, Shijie Zhou, Xuanyang Zhang, Zeming Li, Achuta Kadambi, Zhangyang Wang, Zhengzhong Tu, et al. 4k4dgen: Panoramic 4d generation at 4k resolution. Proc. ICLR, 2025 a
work page 2025
-
[33]
Matrixcity: A large-scale city dataset for city-scale neural rendering and beyond
Yixuan Li, Lihan Jiang, Linning Xu, Yuanbo Xiangli, Zhenzhi Wang, Dahua Lin, and Bo Dai. Matrixcity: A large-scale city dataset for city-scale neural rendering and beyond. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023 a
work page 2023
-
[34]
Gligen: Open-set grounded text-to-image generation
Yuheng Li, Haotian Liu, Qingyang Wu, Fangzhou Mu, Jianwei Yang, Jianfeng Gao, Chunyuan Li, and Yong Jae Lee. Gligen: Open-set grounded text-to-image generation. In CVPR, 2023 b
work page 2023
-
[35]
Dynibar: Neural dynamic image-based rendering
Zhengqi Li, Qianqian Wang, Forrester Cole, Richard Tucker, and Noah Snavely. Dynibar: Neural dynamic image-based rendering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023 c
work page 2023
-
[36]
Megasam: Accurate, fast, and robust structure and motion from casual dynamic videos
Zhengqi Li, Richard Tucker, Forrester Cole, Qianqian Wang, Linyi Jin, Vickie Ye, Angjoo Kanazawa, Aleksander Holynski, and Noah Snavely. Megasam: Accurate, fast, and robust structure and motion from casual dynamic videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025 b
work page 2025
-
[37]
Wonderland: Navigating 3d scenes from a single image
Hanwen Liang, Junli Cao, Vidit Goel, Guocheng Qian, Sergei Korolev, Demetri Terzopoulos, Konstantinos N Plataniotis, Sergey Tulyakov, and Jian Ren. Wonderland: Navigating 3d scenes from a single image. arXiv preprint arXiv:2412.12091, 2024 a
-
[38]
Plataniotis, Sergey Tulyakov, and Jian Ren
Hanwen Liang, Junli Cao, Vidit Goel, Guocheng Qian, Sergei Korolev, Demetri Terzopoulos, Konstantinos N. Plataniotis, Sergey Tulyakov, and Jian Ren. Wonderland: Navigating 3D Scenes from a Single Image , December 2024 b
work page 2024
-
[39]
Feed- Forward Bullet-Time Reconstruction of Dynamic Scenes from Monocular Videos , December 2024 c
Hanxue Liang, Jiawei Ren, Ashkan Mirzaei, Antonio Torralba, Ziwei Liu, Igor Gilitschenski, Sanja Fidler, Cengiz Oztireli, Huan Ling, Zan Gojcic, and Jiahui Huang. Feed- Forward Bullet-Time Reconstruction of Dynamic Scenes from Monocular Videos , December 2024 c
work page 2024
-
[40]
Hanxue Liang, Jiawei Ren, Ashkan Mirzaei, Antonio Torralba, Ziwei Liu, Igor Gilitschenski, Sanja Fidler, Cengiz Oztireli, Huan Ling, Zan Gojcic, et al. Feed-forward bullet-time reconstruction of dynamic scenes from monocular videos. arXiv preprint arXiv:2412.03526, 2024 d
-
[41]
Gaufre: Gaussian deformation fields for real-time dynamic novel view synthesis
Yiqing Liang, Numair Khan, Zhengqin Li, Thu Nguyen-Phuoc, Douglas Lanman, James Tompkin, and Lei Xiao. Gaufre: Gaussian deformation fields for real-time dynamic novel view synthesis. In 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pp.\ 2642--2652. IEEE, 2025
work page 2025
-
[42]
Instructscene: Instruction-driven 3d indoor scene synthesis with semantic graph prior
Chenguo Lin and Yadong Mu. Instructscene: Instruction-driven 3d indoor scene synthesis with semantic graph prior. arXiv preprint arXiv:2402.04717, 2024
-
[43]
Instructlayout: Instruction-driven 2d and 3d layout synthesis with semantic graph prior
Chenguo Lin, Yuchen Lin, Panwang Pan, Xuanyang Zhang, and Yadong Mu. Instructlayout: Instruction-driven 2d and 3d layout synthesis with semantic graph prior. arXiv preprint arXiv:2407.07580, 2024 a
-
[44]
Diffsplat: Repurposing image diffusion models for scalable gaussian splat generation,
Chenguo Lin, Panwang Pan, Bangbang Yang, Zeming Li, and Yadong Mu. Diffsplat: Repurposing image diffusion models for scalable gaussian splat generation. arXiv preprint arXiv:2501.16764, 2025 a
-
[45]
Gaussian-flow: 4d reconstruction with dynamic 3d gaussian particle
Youtian Lin, Zuozhuo Dai, Siyu Zhu, and Yao Yao. Gaussian-flow: 4d reconstruction with dynamic 3d gaussian particle. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 21136--21145, 2024 b
work page 2024
-
[46]
Partcrafter: Structured 3d mesh generation via compositional latent diffusion transformers, 2025 b
Yuchen Lin, Chenguo Lin, Panwang Pan, Honglei Yan, Yiqiang Feng, Yadong Mu, and Katerina Fragkiadaki. Partcrafter: Structured 3d mesh generation via compositional latent diffusion transformers, 2025 b . URL https://arxiv.org/abs/2506.05573
-
[47]
Omniphysgs: 3d constitutive gaussians for general physics-based dynamics generation
Yuchen Lin, Chenguo Lin, Jianjin Xu, and Yadong Mu. Omniphysgs: 3d constitutive gaussians for general physics-based dynamics generation. arXiv preprint arXiv:2501.18982, 2025 c
-
[48]
Dl3dv-10k: A large-scale scene dataset for deep learning-based 3d vision
Lu Ling, Yichen Sheng, Zhi Tu, Wentian Zhao, Cheng Xin, Kun Wan, Lantao Yu, Qianyu Guo, Zixun Yu, Yawen Lu, et al. Dl3dv-10k: A large-scale scene dataset for deep learning-based 3d vision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 22160--22169, 2024
work page 2024
-
[49]
Flow matching for generative modeling
Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling. In International Conference on Learning Representations (ICLR), 2023
work page 2023
-
[50]
Reconx: Reconstruct any scene from sparse views with video diffusion model
Fangfu Liu, Wenqiang Sun, Hanyang Wang, Yikai Wang, Haowen Sun, Junliang Ye, Jun Zhang, and Yueqi Duan. Reconx: Reconstruct any scene from sparse views with video diffusion model. arXiv preprint arXiv:2408.16767, 2024
-
[51]
Decoupled weight decay regularization
Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In International Conference on Learning Representations, 2019
work page 2019
-
[52]
Spring: A high-resolution high-detail dataset and benchmark for scene flow, optical flow and stereo
Lukas Mehl, Jenny Schmalfuss, Azin Jahedi, Yaroslava Nalivayko, and Andr \'e s Bruhn. Spring: A high-resolution high-detail dataset and benchmark for scene flow, optical flow and stereo. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.\ 4981--4991, 2023
work page 2023
-
[53]
Nerf: Representing scenes as neural radiance fields for view synthesis
B Mildenhall, PP Srinivasan, M Tancik, JT Barron, R Ramamoorthi, and R Ng. Nerf: Representing scenes as neural radiance fields for view synthesis. In European Conference on Computer Vision (ECCV), 2020
work page 2020
-
[54]
Chong Mou, Xintao Wang, Liangbin Xie, Yanze Wu, Jian Zhang, Zhongang Qi, and Ying Shan. T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. In AAAI, 2024
work page 2024
-
[55]
u ller, Katja Schwarz, Barbara R \
Norman M \"u ller, Katja Schwarz, Barbara R \"o ssle, Lorenzo Porzi, Samuel Rota Bul \`o , Matthias Nie ner, and Peter Kontschieder. Multidiff: Consistent novel view synthesis from a single image. In Proc. CVPR, 2024
work page 2024
-
[56]
Video-bench: A comprehensive benchmark and toolkit for evaluating video-based large language models
Munan Ning, Bin Zhu, Yujia Xie, Bin Lin, Jiaxi Cui, Lu Yuan, Dongdong Chen, and Li Yuan. Video-bench: A comprehensive benchmark and toolkit for evaluating video-based large language models. arXiv preprint arXiv:2311.16103, 2023
-
[57]
Muyao Niu, Xiaodong Cun, Xintao Wang, Yong Zhang, Ying Shan, and Yinqiang Zheng. Mofa-video: Controllable image animation via generative motion field adaptions in frozen image-to-video diffusion model. arXiv preprint arXiv:2405.20222, 2024
-
[58]
Humansplat: Generalizable single-image human gaussian splatting with structure priors, 2024
Panwang Pan, Zhuo Su, Chenguo Lin, Zhen Fan, Yongjie Zhang, Zeming Li, Tingting Shen, Yadong Mu, and Yebin Liu. Humansplat: Generalizable single-image human gaussian splatting with structure priors, 2024. URL https://arxiv.org/abs/2406.12459
-
[59]
Vase: Object-centric appearance and shape manipulation of real videos
Elia Peruzzo, Vidit Goel, Dejia Xu, Xingqian Xu, Yifan Jiang, Zhangyang Wang, Humphrey Shi, and Nicu Sebe. Vase: Object-centric appearance and shape manipulation of real videos. arXiv preprint arXiv:2401.02473, 2024
-
[60]
Compositional 3D scene generation using locally conditioned diffusion
Ryan Po and Gordon Wetzstein. Compositional 3D scene generation using locally conditioned diffusion. Proc. 3DV, 2024
work page 2024
-
[61]
Learning transferable visual models from natural language supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pp.\ 8748--8763. PMLR, 2021
work page 2021
-
[62]
Exploring the limits of transfer learning with a unified text-to-text transformer
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. In Proc. JMLR, 2020
work page 2020
-
[63]
Gen3c: 3d-informed world-consistent video generation with precise camera control
Xuanchi Ren, Tianchang Shen, Jiahui Huang, Huan Ling, Yifan Lu, Merlin Nimier-David, Thomas Müller, Alexander Keller, Sanja Fidler, and Jun Gao. Gen3c: 3d-informed world-consistent video generation with precise camera control. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025
work page 2025
-
[64]
High-resolution image synthesis with latent diffusion models
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj \"o rn Ommer. High-resolution image synthesis with latent diffusion models. In Proc. CVPR, 2022 a
work page 2022
-
[65]
High-resolution image synthesis with latent diffusion models
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj \"o rn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022 b
work page 2022
-
[66]
Structure-from-motion revisited
Johannes L Schonberger and Jan-Michael Frahm. Structure-from-motion revisited. In Computer Vision and Pattern Recognition (CVPR), 2016
work page 2016
-
[67]
CLIP+MLP Aesthetic Score Predictor
Christoph Schuhmann. CLIP+MLP Aesthetic Score Predictor . https://github.com/christophschuhmann/improved-aesthetic-predictor, 2023
work page 2023
-
[68]
Laion-5b: An open large-scale dataset for training next generation image-text models
Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, et al. Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in neural information processing systems, 2022
work page 2022
-
[69]
Seeing world dynamics in a nutshell, 2025
Qiuhong Shen, Xuanyu Yi, Mingbao Lin, Hanwang Zhang, Shuicheng Yan, and Xinchao Wang. Seeing world dynamics in a nutshell, 2025. URL https://arxiv.org/abs/2502.03465
-
[70]
Light field networks: Neural scene representations with single-evaluation rendering
Vincent Sitzmann, Semon Rezchikov, Bill Freeman, Josh Tenenbaum, and Fredo Durand. Light field networks: Neural scene representations with single-evaluation rendering. In Proc. NeurIPS, 2021
work page 2021
-
[71]
A benchmark for the evaluation of rgb-d slam systems
J \"u rgen Sturm, Nikolas Engelhard, Felix Endres, Wolfram Burgard, and Daniel Cremers. A benchmark for the evaluation of rgb-d slam systems. In IEEE/RSJ international conference on intelligent robots and systems, 2012
work page 2012
-
[72]
Dimensionx: Create any 3d and 4d scenes from a single image with controllable video diffusion
Wenqiang Sun, Shuo Chen, Fangfu Liu, Zilong Chen, Yueqi Duan, Jun Zhang, and Yikai Wang. Dimensionx: Create any 3d and 4d scenes from a single image with controllable video diffusion. arXiv preprint arXiv:2411.04928, 2024 a
-
[73]
Splatter a video: Video gaussian representation for versatile processing, 2024 b
Yang-Tian Sun, Yi-Hua Huang, Lin Ma, Xiaoyang Lyu, Yan-Pei Cao, and Xiaojuan Qi. Splatter a video: Video gaussian representation for versatile processing, 2024 b . URL https://arxiv.org/abs/2406.13870
-
[74]
Splatter a video: Video gaussian representation for versatile processing
Yang-Tian Sun, Yihua Huang, Lin Ma, Xiaoyang Lyu, Yan-Pei Cao, and Xiaojuan Qi. Splatter a video: Video gaussian representation for versatile processing. In Advances in Neural Information Processing Systems (NeurIPS), 2024 c
work page 2024
-
[75]
Bolt3d: Generating 3d scenes in seconds
Stanislaw Szymanowicz, Jason Y Zhang, Pratul Srinivasan, Ruiqi Gao, Arthur Brussee, Aleksander Holynski, Ricardo Martin-Brualla, Jonathan T Barron, and Philipp Henzler. Bolt3d: Generating 3d scenes in seconds. arXiv preprint arXiv:2503.14445, 2025
-
[76]
Towards Accurate Generative Models of Video: A New Metric & Challenges
Thomas Unterthiner, Sjoerd Van Steenkiste, Karol Kurach, Raphael Marinier, Marcin Michalski, and Sylvain Gelly. Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:1812.01717, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[77]
Fvd: A new metric for video generation
Thomas Unterthiner, Sjoerd van Steenkiste, Karol Kurach, Rapha \"e l Marinier, Marcin Michalski, and Sylvain Gelly. Fvd: A new metric for video generation. International Conference on Learning Representations (ICLR), 2019
work page 2019
-
[78]
Cg3d: Compositional generation for text-to-3d via gaussian splatting,
Alexander Vilesov, Pradyumna Chari, and Achuta Kadambi. Cg3d: Compositional generation for text-to-3d via gaussian splatting. arXiv preprint arXiv:2311.17907, 2023
-
[79]
4real-video: Learning generalizable photo-realistic 4d video diffusion
Chaoyang Wang, Peiye Zhuang, Tuan Duc Ngo, Willi Menapace, Aliaksandr Siarohin, Michael Vasilkovsky, Ivan Skorokhodov, Sergey Tulyakov, Peter Wonka, and Hsin-Ying Lee. 4real-video: Learning generalizable photo-realistic 4d video diffusion. arXiv preprint arXiv:2412.04462, 2024 a
-
[80]
Vggt: Visual geometry grounded transformer, 2025 a
Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Visual geometry grounded transformer, 2025 a . URL https://arxiv.org/abs/2503.11651
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.