PerpetualWonder: Long-Horizon Action-Conditioned 4D Scene Generation
Pith reviewed 2026-05-21 13:22 UTC · model grok-4.3
The pith
PerpetualWonder creates a closed-loop system that links physical states to visual primitives for consistent long-horizon 4D scene generation from a single image.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
PerpetualWonder is the first true closed-loop system for this task. It features a novel unified representation that creates a bidirectional link between the physical state and visual primitives, allowing generative refinements to correct both the dynamics and appearance. It also introduces a robust update mechanism that gathers supervision from multiple viewpoints to resolve optimization ambiguity.
What carries the argument
The novel unified representation that creates a bidirectional link between the physical state and visual primitives, allowing generative refinements to update both dynamics and appearance.
If this is right
- The system can simulate complex multi-step interactions from long-horizon actions while keeping physical plausibility.
- Generative refinements can correct both dynamics and appearance in a unified way.
- Multi-viewpoint supervision resolves optimization ambiguity during scene updates.
- Visual consistency is maintained across extended action sequences starting from one image.
Where Pith is reading between the lines
- The closed-loop design could support interactive applications where user actions continuously update both physics and visuals in real time.
- Similar bidirectional links might improve other generative models that currently treat simulation and rendering as separate stages.
- The method opens a path to test whether single-image inputs can drive plausible forecasts in more varied real-world settings beyond controlled experiments.
Load-bearing premise
A bidirectional link between physical state and visual primitives can be maintained consistently over long horizons without introducing new inconsistencies, and multi-viewpoint supervision will reliably resolve optimization ambiguity.
What would settle it
Running the system on a sequence of actions and checking whether objects show physically impossible motions or visual inconsistencies that contradict the initial image after several steps would test whether the closed-loop corrections hold.
Figures
read the original abstract
We introduce PerpetualWonder, a hybrid generative simulator that enables long-horizon, action-conditioned 4D scene generation from a single image. Current works fail at this task because their physical state is decoupled from their visual representation, which prevents generative refinements to update the underlying physics for subsequent interactions. PerpetualWonder solves this by introducing the first true closed-loop system. It features a novel unified representation that creates a bidirectional link between the physical state and visual primitives, allowing generative refinements to correct both the dynamics and appearance. It also introduces a robust update mechanism that gathers supervision from multiple viewpoints to resolve optimization ambiguity. Experiments demonstrate that from a single image, PerpetualWonder can successfully simulate complex, multi-step interactions from long-horizon actions, maintaining physical plausibility and visual consistency.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces PerpetualWonder, a hybrid generative simulator for long-horizon action-conditioned 4D scene generation from a single image. It argues that prior methods fail because physical state is decoupled from visual representation, preventing generative refinements from updating physics. PerpetualWonder addresses this via a novel unified representation creating a bidirectional link between physical state and visual primitives (allowing corrections to both dynamics and appearance) plus a robust update mechanism that gathers multi-viewpoint supervision to resolve optimization ambiguity. Experiments are reported to show successful simulation of complex multi-step interactions while maintaining physical plausibility and visual consistency.
Significance. If the bidirectional unified representation and multi-view update mechanism operate without introducing new inconsistencies, the work would advance 4D generative simulation by enabling closed-loop consistency over long horizons from minimal input. This directly targets a recognized limitation in decoupled physics-visual pipelines common in current video and scene generation literature. The single-image starting point and action-conditioning further increase potential utility for downstream tasks such as robotics planning and interactive content creation, provided the claims receive rigorous empirical support.
major comments (1)
- [§4.2] §4.2 (Update Mechanism): The claim that multi-viewpoint supervision resolves optimization ambiguity is load-bearing for the closed-loop contribution, yet the manuscript does not demonstrate that the synthesized viewpoints are independent of the current ambiguous physical-state estimate. Because all additional views are generated from the single input image via the same model, the supervision signal risks circular reinforcement of drift rather than correction; an ablation isolating the independence of the multi-view signal or a quantitative measure of ambiguity reduction over horizon length is required.
minor comments (2)
- [Abstract / §3] The abstract and early sections would benefit from a concise equation or diagram defining the unified representation and the bidirectional link, as the current prose description leaves the precise interface between physical state and visual primitives underspecified.
- [§5] Quantitative tables or plots reporting physical-consistency metrics (e.g., collision violation rates, trajectory error) across increasing horizon lengths are referenced but not detailed enough to evaluate the long-horizon claim; adding error bars and baseline comparisons would improve clarity.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. The concern regarding the independence of the multi-view supervision signal is well-taken and directly relevant to the core closed-loop claim. We address this point below and have revised the manuscript accordingly.
read point-by-point responses
-
Referee: [§4.2] §4.2 (Update Mechanism): The claim that multi-viewpoint supervision resolves optimization ambiguity is load-bearing for the closed-loop contribution, yet the manuscript does not demonstrate that the synthesized viewpoints are independent of the current ambiguous physical-state estimate. Because all additional views are generated from the single input image via the same model, the supervision signal risks circular reinforcement of drift rather than correction; an ablation isolating the independence of the multi-view signal or a quantitative measure of ambiguity reduction over horizon length is required.
Authors: We agree that the manuscript would benefit from a more explicit demonstration that the multi-view supervision is not merely reinforcing the initial estimate. The update mechanism generates additional views by rendering the current unified representation (which encodes both visual primitives and physical state) under novel camera poses, then optimizes the physical parameters to minimize inconsistency across all views. Because the physical state is a latent variable that is updated jointly, the cross-view constraints provide an independent signal that can correct drift even when the views are synthesized from the same model. Nevertheless, we acknowledge that the original submission did not include a dedicated ablation or quantitative tracking of ambiguity reduction. In the revised manuscript we have expanded §4.2 with (i) a step-by-step derivation showing how the bidirectional link decouples the physical update from the initial visual estimate and (ii) a new plot that reports the variance of the estimated physical state before and after each multi-view update across increasing horizon lengths. revision: yes
Circularity Check
No circularity: claims presented as novel system without equations or self-referential reductions
full rationale
The provided abstract and description introduce PerpetualWonder via a unified representation creating a bidirectional link and a multi-view update mechanism, but contain no mathematical equations, derivations, fitted parameters, or self-citations. Without visible formulas or load-bearing citations that reduce to inputs by construction, the central claims cannot be shown to collapse into self-definition or fitted predictions. The system is described as a new closed-loop proposal initialized from a single image, with no evidence of circular steps in the given text.
Axiom & Free-Parameter Ledger
invented entities (1)
-
unified representation
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Tc4d: Trajectory-conditioned text-to-4d generation
Sherwin Bahmani, Xian Liu, Wang Yifan, Ivan Sko- rokhodov, Victor Rong, Ziwei Liu, Xihui Liu, Jeong Joon Park, Sergey Tulyakov, Gordon Wetzstein, et al. Tc4d: Trajectory-conditioned text-to-4d generation. InEuropean Conference on Computer Vision, pages 53–72. Springer,
-
[2]
4d-fy: Text-to-4d generation using hybrid score dis- tillation sampling
Sherwin Bahmani, Ivan Skorokhodov, Victor Rong, Gordon Wetzstein, Leonidas Guibas, Peter Wonka, Sergey Tulyakov, Jeong Joon Park, Andrea Tagliasacchi, and David B Lin- dell. 4d-fy: Text-to-4d generation using hybrid score dis- tillation sampling. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 7996–8006, 2024. 2
work page 2024
-
[3]
ReCamMaster: Camera-Controlled Generative Rendering from A Single Video, March 2025
Jianhong Bai, Menghan Xia, Xiao Fu, Xintao Wang, Lian- rui Mu, Jinwen Cao, Zuozhu Liu, Haoji Hu, Xiang Bai, Pengfei Wan, et al. Recammaster: Camera-controlled generative rendering from a single video.arXiv preprint arXiv:2503.11647, 2025. 3
-
[4]
Video generation models as world simulators
Tim Brooks, Bill Peebles, et al. Video generation models as world simulators. OpenAI Technical Report, 2024. 1, 2, 3
work page 2024
-
[5]
Go-with-the-flow: Motion- controllable video diffusion models using real-time warped noise
Ryan Burgert, Yuancheng Xu, Wenqi Xian, Oliver Pilarski, Pascal Clausen, Mingming He, Li Ma, Yitong Deng, Lingx- iao Li, Mohsen Mousavi, et al. Go-with-the-flow: Motion- controllable video diffusion models using real-time warped noise. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 13–23, 2025. 5, 6
work page 2025
-
[6]
Physgen3d: Crafting a miniature interactive world from a single image
Boyuan Chen, Hanxiao Jiang, Shaowei Liu, Saurabh Gupta, Yunzhu Li, Hao Zhao, and Shenlong Wang. Physgen3d: Crafting a miniature interactive world from a single image. InProceedings of the Computer Vision and Pattern Recogni- tion Conference, pages 6178–6189, 2025. 1, 3
work page 2025
-
[7]
Motion- Conditioned Diffusion Model for Controllable Video Synthesis,
Tsai-Shien Chen, Chieh Hubert Lin, Hung-Yu Tseng, Tsung- Yi Lin, and Ming-Hsuan Yang. Motion-conditioned diffu- sion model for controllable video synthesis.arXiv preprint arXiv:2304.14404, 2023. 3
-
[8]
Worldscore: A unified evaluation benchmark for world generation
Haoyi Duan, Hong-Xing Yu, Sirui Chen, Li Fei-Fei, and Ji- ajun Wu. Worldscore: A unified evaluation benchmark for world generation. InICCV, 2025. 6, 7
work page 2025
-
[9]
Chen Geng, Yunzhi Zhang, Shangzhe Wu, and Jiajun Wu. Birth and death of a rose. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 26102– 26113, 2025. 2
work page 2025
-
[10]
Motion prompting: Controlling video generation with motion trajec- tories
Daniel Geng, Charles Herrmann, Junhwa Hur, Forrester Cole, Serena Zhang, Tobias Pfaff, Tatiana Lopez-Guevara, Yusuf Aytar, Michael Rubinstein, Chen Sun, et al. Motion prompting: Controlling video generation with motion trajec- tories. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 1–12, 2025. 3
work page 2025
-
[11]
Nate Gillman, Charles Herrmann, Michael Freeman, Daksh Aggarwal, Evan Luo, Deqing Sun, and Chen Sun. Force prompting: Video generation models can learn and gen- eralize physics-based control signals.arXiv preprint arXiv:2505.19386, 2025. 3
-
[12]
Diffusion as shader: 3d-aware video diffusion for ver- satile video generation control
Zekai Gu, Rui Yan, Jiahao Lu, Peng Li, Zhiyang Dou, Chenyang Si, Zhen Dong, Qifeng Liu, Cheng Lin, Ziwei Liu, et al. Diffusion as shader: 3d-aware video diffusion for ver- satile video generation control. InProceedings of the Special Interest Group on Computer Graphics and Interactive Tech- niques Conference Conference Papers, pages 1–12, 2025. 6
work page 2025
-
[13]
Haoyang Huang, Guoqing Ma, Nan Duan, Xing Chen, Changyi Wan, Ranchen Ming, Tianyu Wang, Bo Wang, Zhiying Lu, Aojie Li, et al. Step-video-ti2v technical re- port: A state-of-the-art text-driven image-to-video genera- tion model.arXiv preprint arXiv:2503.11251, 2025. 3
-
[14]
Dreamphysics: Learning physics-based 3d dynamics with video diffusion priors
Tianyu Huang, Haoze Zhang, Yihan Zeng, Zhilu Zhang, Hui Li, Wangmeng Zuo, and Rynson WH Lau. Dreamphysics: Learning physics-based 3d dynamics with video diffusion priors. InProceedings of the AAAI Conference on Artificial Intelligence, pages 3733–3741, 2025. 1, 3
work page 2025
-
[15]
Tianyu Huang, Wangguandong Zheng, Tengfei Wang, Yuhao Liu, Zhenwei Wang, Junta Wu, Jie Jiang, Hui Li, Rynson WH Lau, Wangmeng Zuo, and Chunchao Guo. V oyager: Long-range and world-consistent video diffu- sion for explorable 3d scene generation.arXiv preprint arXiv:2506.04225, 2025. 1
-
[16]
Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perel- man, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Weli- hinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024. S1
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[17]
The material point method for simulating continuum materials.Acm siggraph 2016 courses, 2016
Chenfanfu Jiang, Craig Schroeder, Joseph Teran, Alexey Stomakhin, and Andrew Selle. The material point method for simulating continuum materials.Acm siggraph 2016 courses, 2016. 5
work page 2016
-
[18]
3d gaussian splatting for real-time radiance field rendering.ACM Trans
Bernhard Kerbl, Georgios Kopanas, Thomas Leimk ¨uhler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering.ACM Trans. Graph., 42(4):139–1,
-
[19]
Genesis: a generative approach to substitutes in context
Caterina Lacerra, Rocco Tripodi, Roberto Navigli, et al. Genesis: a generative approach to substitutes in context. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. ACL, 2021. 5, S1
work page 2021
-
[20]
Long Le, Ryan Lucas, Chen Wang, Chuhao Chen, Dinesh Ja- yaraman, Eric Eaton, and Lingjie Liu. Pixie: Fast and gener- 9 alizable supervised learning of 3d physics from pixels.arXiv preprint arXiv:2508.17437, 2025. 2
-
[21]
Wonderplay: Dy- namic 3d scene generation from a single image and actions
Zizhang Li, Hong-Xing Yu, Wei Liu, Yin Yang, Charles Her- rmann, Gordon Wetzstein, and Jiajun Wu. Wonderplay: Dy- namic 3d scene generation from a single image and actions
-
[22]
2, 3, 4, 5, 6, 7, S1
-
[23]
Align your gaussians: Text-to-4d with dynamic 3d gaussians and composed diffusion models
Huan Ling, Seung Wook Kim, Antonio Torralba, Sanja Fi- dler, and Karsten Kreis. Align your gaussians: Text-to-4d with dynamic 3d gaussians and composed diffusion models. InProceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 8576–8588, 2024. 2
work page 2024
-
[24]
Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view syn- thesis.Communications of the ACM, 65(1):99–106, 2021. 2
work page 2021
-
[25]
Matthias M ¨uller, Bruno Heidelberger, Matthias Teschner, and Markus Gross. Meshless deformations based on shape matching.ACM transactions on graphics (TOG), 24(3):471– 478, 2005. 5
work page 2005
-
[26]
Position based dynamics.Journal of Visual Communication and Image Representation, 18(2):109–118,
Matthias M ¨uller, Bruno Heidelberger, Marcus Hennix, and John Ratcliff. Position based dynamics.Journal of Visual Communication and Image Representation, 18(2):109–118,
-
[27]
Muyao Niu, Xiaodong Cun, Xintao Wang, Yong Zhang, Ying Shan, and Yinqiang Zheng. Mofa-video: Controllable image animation via generative motion field adaptions in frozen image-to-video diffusion model. InEuropean Con- ference on Computer Vision, pages 111–128. Springer, 2024. 3
work page 2024
-
[28]
Nerfies: Deformable neural radiance fields
Keunhong Park, Utkarsh Sinha, Jonathan T Barron, Sofien Bouaziz, Dan B Goldman, Steven M Seitz, and Ricardo Martin-Brualla. Nerfies: Deformable neural radiance fields. InProceedings of the IEEE/CVF international conference on computer vision, pages 5865–5874, 2021. 2
work page 2021
-
[29]
Hypernerf: A higher-dimensional representation for topologically varying neural radiance fields,
Keunhong Park, Utkarsh Sinha, Peter Hedman, Jonathan T Barron, Sofien Bouaziz, Dan B Goldman, Ricardo Martin- Brualla, and Steven M Seitz. Hypernerf: A higher- dimensional representation for topologically varying neural radiance fields.arXiv preprint arXiv:2106.13228, 2021
-
[30]
D-nerf: Neural radiance fields for dynamic scenes
Albert Pumarola, Enric Corona, Gerard Pons-Moll, and Francesc Moreno-Noguer. D-nerf: Neural radiance fields for dynamic scenes. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 10318–10327, 2021. 2
work page 2021
-
[31]
Splatsim: Zero- shot sim2real transfer of rgb manipulation policies using gaussian splatting
M Nomaan Qureshi, Sparsh Garg, Francisco Yandun, David Held, George Kantor, and Abhisesh Silwal. Splatsim: Zero- shot sim2real transfer of rgb manipulation policies using gaussian splatting. In2025 IEEE International Confer- ence on Robotics and Automation (ICRA), pages 6502–6509. IEEE, 2025. 1
work page 2025
-
[32]
SAM 2: Segment Anything in Images and Videos
Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman R¨adle, Chloe Rolland, Laura Gustafson, et al. Sam 2: Segment anything in images and videos.arXiv preprint arXiv:2408.00714, 2024. 4, S1
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[33]
Dreamgaussian4d: Genera- tive 4d gaussian splatting
Jiawei Ren, Liang Pan, Jiaxiang Tang, Chi Zhang, Ang Cao, Gang Zeng, and Ziwei Liu. Dreamgaussian4d: Genera- tive 4d gaussian splatting.arXiv preprint arXiv:2312.17142,
-
[34]
Jiawei Ren, Cheng Xie, Ashkan Mirzaei, Karsten Kreis, Zi- wei Liu, Antonio Torralba, Sanja Fidler, Seung Wook Kim, Huan Ling, et al. L4gm: Large 4d gaussian reconstruction model.Advances in Neural Information Processing Systems, 37:56828–56858, 2024. 2
work page 2024
-
[35]
Gen3c: 3d-informed world-consistent video generation with precise camera con- trol
Xuanchi Ren, Tianchang Shen, Jiahui Huang, Huan Ling, Yifan Lu, Merlin Nimier-David, Thomas M ¨uller, Alexan- der Keller, Sanja Fidler, and Jun Gao. Gen3c: 3d-informed world-consistent video generation with precise camera con- trol. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, 2025. 3, 4, 5, 6, 7, S1
work page 2025
-
[36]
Structure-from-motion revisited
Johannes Lutz Sch ¨onberger and Jan-Michael Frahm. Structure-from-motion revisited. InConference on Com- puter Vision and Pattern Recognition (CVPR), 2016. 4
work page 2016
-
[37]
Motion-i2v: Consistent and controllable image-to-video generation with explicit motion modeling
Xiaoyu Shi, Zhaoyang Huang, Fu-Yun Wang, Weikang Bian, Dasong Li, Yi Zhang, Manyuan Zhang, Ka Chun Cheung, Simon See, Hongwei Qin, et al. Motion-i2v: Consistent and controllable image-to-video generation with explicit motion modeling. InACM SIGGRAPH 2024 Conference Papers, pages 1–11, 2024. 3
work page 2024
-
[38]
Text-to-4d dy- namic scene generation
Uriel Singer, Shelly Sheynin, Adam Polyak, Oron Ashual, Iurii Makarov, Filippos Kokkinos, Naman Goyal, Andrea Vedaldi, Devi Parikh, Justin Johnson, et al. Text-to-4d dy- namic scene generation.arXiv preprint arXiv:2301.11280,
-
[39]
Physmotion: Physics- grounded dynamics from a single image.arXiv preprint arXiv:2411.17189, 2024
Xiyang Tan, Ying Jiang, Xuan Li, Zeshun Zong, Tianyi Xie, Yin Yang, and Chenfanfu Jiang. Physmotion: Physics- grounded dynamics from a single image.arXiv preprint arXiv:2411.17189, 2024. 2, 3
-
[40]
HunyuanWorld Team, Zhenwei Wang, Yuhao Liu, Junta Wu, Zixiao Gu, Haoyuan Wang, Xuhui Zuo, Tianyu Huang, Wen- huan Li, Sheng Zhang, et al. Hunyuanworld 1.0: Generating immersive, explorable, and interactive 3d worlds from words or pixels.arXiv preprint arXiv:2507.21809, 2025. 1
-
[41]
Wan: Open and Advanced Large-Scale Video Generative Models
Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video gen- erative models.arXiv preprint arXiv:2503.20314, 2025. 3, 6, 7
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[42]
Cinemaster: A 3d-aware and controllable frame- work for cinematic text-to-video generation
Qinghe Wang, Yawen Luo, Xiaoyu Shi, Xu Jia, Huchuan Lu, Tianfan Xue, Xintao Wang, Pengfei Wan, Di Zhang, and Kun Gai. Cinemaster: A 3d-aware and controllable frame- work for cinematic text-to-video generation. InProceedings of the Special Interest Group on Computer Graphics and In- teractive Techniques Conference Conference Papers, pages 1–10, 2025. 3
work page 2025
-
[43]
Freetimegs: Free gaussian primitives at anytime any- where for dynamic scene reconstruction
Yifan Wang, Peishan Yang, Zhen Xu, Jiaming Sun, Zhan- hua Zhang, Yong Chen, Hujun Bao, Sida Peng, and Xiaowei Zhou. Freetimegs: Free gaussian primitives at anytime any- where for dynamic scene reconstruction. InCVPR, 2025. 4
work page 2025
-
[44]
Controlling space and time with diffusion models
Daniel Watson, Saurabh Saxena, Lala Li, Andrea Tagliasac- chi, and David J Fleet. Controlling space and time with dif- fusion models.arXiv preprint arXiv:2407.07860, 2024. 2
-
[45]
Video models are zero-shot learners and reasoners
Thadd ¨aus Wiedemer, Yuxuan Li, Paul Vicol, Shixiang Shane Gu, Nick Matarese, Kevin Swersky, Been Kim, Priyank 10 Jaini, and Robert Geirhos. Video models are zero-shot learn- ers and reasoners.arXiv preprint arXiv:2509.20328, 2025. 2, 3, 6
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[46]
4d gaussian splatting for real-time dynamic scene render- ing
Guanjun Wu, Taoran Yi, Jiemin Fang, Lingxi Xie, Xiaopeng Zhang, Wei Wei, Wenyu Liu, Qi Tian, and Xinggang Wang. 4d gaussian splatting for real-time dynamic scene render- ing. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), pages 20310– 20320, 2024. 2
work page 2024
-
[47]
Barron, and Aleksander Holynski
Rundi Wu, Ruiqi Gao, Ben Poole, Alex Trevithick, Changxi Zheng, Jonathan T. Barron, and Aleksander Holynski. CAT4D: Create Anything in 4D with Multi-View Video Dif- fusion Models.arXiv:2411.18613, 2024. 2
-
[48]
Draganything: Motion control for any- thing using entity representation
Weijia Wu, Zhuang Li, Yuchao Gu, Rui Zhao, Yefei He, David Junhao Zhang, Mike Zheng Shou, Yan Li, Tingting Gao, and Di Zhang. Draganything: Motion control for any- thing using entity representation. InEuropean Conference on Computer Vision, pages 331–348. Springer, 2024. 3
work page 2024
-
[49]
Physgaussian: Physics- integrated 3d gaussians for generative dynamics
Tianyi Xie, Zeshun Zong, Yuxing Qiu, Xuan Li, Yutao Feng, Yin Yang, and Chenfanfu Jiang. Physgaussian: Physics- integrated 3d gaussians for generative dynamics. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4389–4398, 2024. 1, 3
work page 2024
-
[50]
4d gaussian splatting: Modeling dynamic scenes with native 4d primitives.arXiv preprint, 2024
Zeyu Yang, Zijie Pan, Xiatian Zhu, Li Zhang, Jianfeng Feng, Yu-Gang Jiang, and Philip HS Torr. 4d gaussian splatting: Modeling dynamic scenes with native 4d primitives.arXiv preprint, 2024. 2
work page 2024
-
[51]
CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer
Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiao- han Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024. 3, 5
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[52]
Real- time photorealistic dynamic scene representation and render- ing with 4d gaussian splatting
Zeyu Yang, Hongye Yang, Zijie Pan, and Li Zhang. Real- time photorealistic dynamic scene representation and render- ing with 4d gaussian splatting. InInternational Conference on Learning Representations (ICLR), 2024. 2
work page 2024
-
[53]
Gaussian grouping: Segment and edit anything in 3d scenes
Mingqiao Ye, Martin Danelljan, Fisher Yu, and Lei Ke. Gaussian grouping: Segment and edit anything in 3d scenes. InECCV, 2024. 4, S1
work page 2024
-
[54]
Wonderjourney: Going from anywhere to everywhere
Hong-Xing Yu, Haoyi Duan, Junhwa Hur, Kyle Sargent, Michael Rubinstein, William T Freeman, Forrester Cole, De- qing Sun, Noah Snavely, Jiajun Wu, et al. Wonderjourney: Going from anywhere to everywhere. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6658–6667, 2024. 5
work page 2024
-
[55]
Wonderworld: Interactive 3d scene generation from a single image
Hong-Xing Yu, Haoyi Duan, Charles Herrmann, William T Freeman, and Jiajun Wu. Wonderworld: Interactive 3d scene generation from a single image. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 5916–5926, 2025. 5
work page 2025
-
[56]
3dmatch: Learning local geometric descriptors from rgb-d reconstruc- tions
Andy Zeng, Shuran Song, Matthias Nießner, Matthew Fisher, Jianxiong Xiao, and Thomas Funkhouser. 3dmatch: Learning local geometric descriptors from rgb-d reconstruc- tions. InCVPR, 2017. 4, S1
work page 2017
-
[57]
Stag4d: Spatial-temporal anchored generative 4d gaussians
Yifei Zeng, Yanqin Jiang, Siyu Zhu, Yuanxun Lu, Youtian Lin, Hao Zhu, Weiming Hu, Xun Cao, and Yao Yao. Stag4d: Spatial-temporal anchored generative 4d gaussians. InEu- ropean Conference on Computer Vision, pages 163–179. Springer, 2024. 2
work page 2024
-
[58]
Gaussian vari- ation field diffusion for high-fidelity video-to-4d synthesis
Bowen Zhang, Sicheng Xu, Chuxin Wang, Jiaolong Yang, Feng Zhao, Dong Chen, and Baining Guo. Gaussian vari- ation field diffusion for high-fidelity video-to-4d synthesis. arXiv preprint arXiv:2507.23785, 2025. 2
-
[59]
Feng, Changxi Zheng, Noah Snavely, Jiajun Wu, and William T
Tianyuan Zhang, Hong-Xing Yu, Rundi Wu, Brandon Y . Feng, Changxi Zheng, Noah Snavely, Jiajun Wu, and William T. Freeman. PhysDreamer: Physics-based interac- tion with 3d objects via video generation. InEuropean Con- ference on Computer Vision. Springer, 2024. 1, 3
work page 2024
-
[60]
Tora: Trajectory-oriented diffusion transformer for video genera- tion
Zhenghao Zhang, Junchao Liao, Menghao Li, Zuozhuo Dai, Bingxue Qiu, Siyu Zhu, Long Qin, and Weizhi Wang. Tora: Trajectory-oriented diffusion transformer for video genera- tion. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 2063–2073, 2025. 3, 6
work page 2063
-
[61]
Genxd: Generating any 3d and 4d scenes
Yuyang Zhao, Chung-Ching Lin, Kevin Lin, Zhiwen Yan, Linjie Li, Zhengyuan Yang, Jianfeng Wang, Gim Hee Lee, and Lijuan Wang. Genxd: Generating any 3d and 4d scenes. arXiv preprint arXiv:2411.02319, 2024. 2
-
[62]
Hunyuan3D 2.0: Scaling Diffusion Models for High Resolution Textured 3D Assets Generation
Zibo Zhao, Zeqiang Lai, Qingxiang Lin, Yunfei Zhao, Haolin Liu, Shuhui Yang, Yifei Feng, Mingxin Yang, Sheng Zhang, Xianghui Yang, et al. Hunyuan3d 2.0: Scaling diffu- sion models for high resolution textured 3d assets generation. arXiv preprint arXiv:2501.12202, 2025. S1
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[63]
Open-sora: Democratizing efficient video production for all, 2024.URL https://github
Zangwei Zheng, Xiangyu Peng, Tianji Yang, Chenhui Shen, Shenggui Li, Hongxin Liu, Yukun Zhou, Tianyi Li, and Yang You. Open-sora: Democratizing efficient video production for all, 2024.URL https://github. com/hpcaitech/Open-Sora,
work page 2024
-
[64]
Jensen Jinghao Zhou, Hang Gao, Vikram V oleti, Aaryaman Vasishta, Chun-Han Yao, Mark Boss, Philip Torr, Christian Rupprecht, and Varun Jampani. Stable virtual camera: Gen- erative view synthesis with diffusion models.arXiv preprint arXiv:2503.14489, 2025. 3 11 PerpetualWonder: Long-Horizon Action-Conditioned 4D Scene Generation Supplementary Material A. D...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.