3DPhysVideo: Consistency-Guided Flow SDE for Video Generation via 3D Scene Reconstruction and Physical Simulation

Hwidong Kim; Tae-Kyun Kim; Yunho Kim

arxiv: 2605.16795 · v1 · pith:6MYRC43Nnew · submitted 2026-05-16 · 💻 cs.CV · cs.AI· cs.GR

3DPhysVideo: Consistency-Guided Flow SDE for Video Generation via 3D Scene Reconstruction and Physical Simulation

Hwidong Kim , Yunho Kim , Tae-Kyun Kim This is my paper

Pith reviewed 2026-05-19 21:22 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.GR

keywords video generation3D scene reconstructionphysical simulationimage-to-videoflow modelsconsistency guidancetraining-freepoint cloud guidance

0 comments

The pith

A training-free pipeline turns a single image into a physically realistic video by reconstructing 3D scenes and guiding generation with physics simulations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper tries to establish that an existing image-to-video flow model can be reused without any retraining to first build complete 3D scene geometry from one photo and then produce final videos whose motion follows physical laws. The method works by rendering point clouds to guide the model for novel views during reconstruction, running physics solvers on that geometry, and then feeding the simulated point clouds back into the same model for video synthesis. A special decomposition called Consistency-Guided Flow SDE separates the model's velocity prediction into a denoising part and a consistency bias term so that outputs stay faithful to the guiding conditions. A sympathetic reader would care because current video generators frequently produce motion that ignores gravity, collisions, or fluid behavior, and a method that fixes this from minimal input while running on ordinary hardware could make realistic dynamic scene creation far more accessible.

Core claim

The central claim is that an off-the-shelf image-to-video flow model can be repurposed for both 3D scene reconstruction via rendered point cloud guidance and for physically simulated point cloud-guided video synthesis without fine-tuning, with Consistency-Guided Flow SDE enforcing consistency by decomposing the predicted velocity into denoising and consistency bias components.

What carries the argument

Consistency-Guided Flow SDE, which decomposes the predicted velocity of the I2V flow model into denoising and consistency bias to enforce consistency with conditional inputs such as rendered or simulated point clouds.

If this is right

The approach generates videos that respect physical dynamics in scenes involving multiple objects and fluid interactions.
The pipeline runs efficiently on a single consumer GPU.
Generated videos score higher than state-of-the-art baselines on GPT-based metrics, the VideoPhy benchmark, and human evaluations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same guidance technique might be tested on extending short clips into longer sequences by repeatedly applying physics steps.
Similar repurposing of flow models could be explored for other conditional generation tasks that require geometric or physical fidelity.
Integrating more detailed physics engines could address edge cases like deformable objects or complex lighting changes during motion.

Load-bearing premise

The off-the-shelf image-to-video flow model can be effectively repurposed for both 3D scene reconstruction via point cloud guidance and final video synthesis via physically simulated point cloud guidance without any fine-tuning or additional training.

What would settle it

A direct test would check whether output videos of fluid-object interactions or multi-body collisions exhibit clear physical violations, such as incorrect fluid flow trajectories or interpenetrating objects, when compared against ground-truth physics simulation.

Figures

Figures reproduced from arXiv: 2605.16795 by Hwidong Kim, Tae-Kyun Kim, Yunho Kim.

**Figure 1.** Figure 1: We propose 3DPHYSVIDEO, a training-free framework for 3D physics-conditioned video generation, leveraging an off-the-shelf video model. From an input scene (center), our method enables users to apply diverse physical controls to a variety of materials. See Sec. A of the supplementary for details on the simulation setup. Abstract Video generative models have made remarkable progress, yet they often yield vi… view at source ↗

**Figure 2.** Figure 2: Overall Pipeline. Starting from a single image, 3DPHYSVIDEO reconstructs 3D geometry of multiple objects via 360-degree orbit video synthesis, then applies 3D point-based physics simulation to produce photorealistic videos from the simulation results. poses and positions, we instead leverage a generic video generation model G [73] to reconstruct scenes including multi-objects. All masks needed in our pipe… view at source ↗

**Figure 3.** Figure 3: Consistency-Guided Flow SDE ΦCF. Given an input video latent z, we initialize a latent zτ at a diffusion step t = τ . Since the standard generation process (left) follows the full velocity vθ, which includes the consistency bias vc and the denoising bias vϵ, along the flow ODE-path, the brief exposure to vc leaves the result not aligned with the input image in texture, semantics, or other attributes. In co… view at source ↗

**Figure 4.** Figure 4: Qualitative comparison across five representative scenarios. From left to right: (i) several [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Qualitative results on complex, non-local physical phenomena. Left: high-speed ball impact [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Qualitative comparison on Simulation to Video. Tab. 1 summarizes the quantitative results. Our method achieves the highest overall scores in physical realism and semantic consistency. Meanwhile, our photorealism score remains competitive with state-of-the-art methods, demonstrating that improved physical plausibility does not compromise visual quality. Complex and non-local physical phenomena. Beyond local… view at source ↗

**Figure 7.** Figure 7: Qualitative results under different physical properties. Physical dynamics [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗

**Figure 8.** Figure 8: Ablation on the number of [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗

**Figure 9.** Figure 9: Qualitative results of Consistency-Guided Flow SDE given coarse motion priors. The juice [PITH_FULL_IMAGE:figures/full_fig_p015_9.png] view at source ↗

**Figure 10.** Figure 10: Qualitative results of our SDE with text alignment bias. Left: For a detailed text prompt, [PITH_FULL_IMAGE:figures/full_fig_p016_10.png] view at source ↗

**Figure 11.** Figure 11: Qualitative result on the Translate Up trajectory. 17 [PITH_FULL_IMAGE:figures/full_fig_p017_11.png] view at source ↗

**Figure 12.** Figure 12: Qualitative result on the Zoom Out trajectory. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_12.png] view at source ↗

**Figure 13.** Figure 13: Qualitative result on the Translate Down trajectory. Go with the Flow Trajectory Attention ReCamMaster Diffusion as Shader 3DPhysVideo (Ours) Camera Trajectory (360 orbit) 0° 72° 144° 216° 288° 0° 72° 144° 216° 288° [PITH_FULL_IMAGE:figures/full_fig_p019_13.png] view at source ↗

**Figure 14.** Figure 14: Qualitative results on a static scene under a 360° orbital camera trajectory. [PITH_FULL_IMAGE:figures/full_fig_p019_14.png] view at source ↗

**Figure 15.** Figure 15: Qualitative results on a static scene under a 360° orbital camera trajectory. [PITH_FULL_IMAGE:figures/full_fig_p020_15.png] view at source ↗

**Figure 16.** Figure 16: Comparison of generated videos across various [PITH_FULL_IMAGE:figures/full_fig_p025_16.png] view at source ↗

**Figure 17.** Figure 17: Latent norm at each SDE optimiza [PITH_FULL_IMAGE:figures/full_fig_p025_17.png] view at source ↗

**Figure 18.** Figure 18: Comparison of generated videos across various target timesteps [PITH_FULL_IMAGE:figures/full_fig_p026_18.png] view at source ↗

**Figure 19.** Figure 19: Failure case. An apple with an excessively low Young’s modulus ( [PITH_FULL_IMAGE:figures/full_fig_p027_19.png] view at source ↗

**Figure 20.** Figure 20: Qualitative comparison with state-of-the-art simulation to video models on the book [PITH_FULL_IMAGE:figures/full_fig_p027_20.png] view at source ↗

**Figure 21.** Figure 21: Qualitative comparison with state-of-the-art simulation to video models on the Snorlax [PITH_FULL_IMAGE:figures/full_fig_p028_21.png] view at source ↗

**Figure 22.** Figure 22: Qualitative comparison with state-of-the-art simulation to video models on the ball [PITH_FULL_IMAGE:figures/full_fig_p028_22.png] view at source ↗

**Figure 23.** Figure 23: Qualitative comparison with state-of-the-art simulation to video models on the can-teddy [PITH_FULL_IMAGE:figures/full_fig_p029_23.png] view at source ↗

**Figure 24.** Figure 24: Qualitative comparison on the block falling scene. [PITH_FULL_IMAGE:figures/full_fig_p030_24.png] view at source ↗

**Figure 25.** Figure 25: Qualitative comparison on the synthesized can-doll collision scene. [PITH_FULL_IMAGE:figures/full_fig_p031_25.png] view at source ↗

**Figure 26.** Figure 26: Qualitative comparison on the ball collision scene. [PITH_FULL_IMAGE:figures/full_fig_p032_26.png] view at source ↗

**Figure 27.** Figure 27: Qualitative comparison on the snorlax deflating scene. [PITH_FULL_IMAGE:figures/full_fig_p033_27.png] view at source ↗

read the original abstract

Video generative models have made remarkable progress, yet they often yield visual artifacts that violate grounding in physical dynamics. Recent works such as PhysGen3D tackle single image-to-3D physics through mesh reconstruction and Physically-Based Rendering, but challenges remain in modeling fluid dynamics, multi-object interactions and photorealism. This work introduces 3DPhysVideo, a novel training-free pipeline that generates physically realistic videos from a single image. We repurpose an off-the-shelf video model for two stages. First, we use it as a novel view synthesizer to reconstruct complete 360-degree 3D scene geometry by guiding the image-to-video (I2V) flow model with rendered point clouds. Second, after applying physics solvers to this geometry, the physically simulated point cloud is used to guide the same I2V flow model to synthesize final, high-quality videos. Consistency-Guided Flow SDE, which decomposes the predicted velocity of the I2V flow model into denoising and consistency bias, enforces consistency to the conditional inputs, allowing us to effectively repurpose the model for both 3D reconstruction and simulation-guided video generation. In the diverse experiments including multi-objects, and fluid interaction scenes, our method successfully bridges the gap from single-images to physically plausible videos, while remaining efficient to run on a single consumer GPU. It outperforms state-of-the-art baselines on GPT-based scores, VideoPhy benchmark and human evaluation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's main contribution is a dual repurposing of a pretrained I2V flow model via Consistency-Guided Flow SDE for 3D reconstruction followed by physics simulation, though the effectiveness of the guidance for enforcing physical laws without tuning remains the key open question.

read the letter

This paper gives a practical training-free method for making single-image video generation respect physical dynamics better. It reconstructs 3D geometry by guiding an off-the-shelf image-to-video flow model with point clouds, runs a physics solver on that geometry, and then guides the same model again with the simulated point clouds to produce the final video. The new part is the Consistency-Guided Flow SDE, which breaks the predicted velocity into a denoising component and a consistency bias. That bias is what lets them repurpose the model for both the reconstruction step and the physics-guided synthesis without fine-tuning. It builds on prior work like PhysGen3D but adds this specific decomposition and dual application. The approach works reasonably on multi-object and fluid scenes, with reported gains over baselines on VideoPhy, GPT-based metrics, and human evaluations. Running on a single consumer GPU is a plus for usability. The soft spot is the reliance on the consistency bias to actually transfer the physical constraints from the simulation. If the bias is not dominant enough, or if the point cloud guidance introduces artifacts, the output might look coherent but not truly follow the intended motions, particularly in tricky fluid or collision cases. The paper could use more direct checks on how closely the generated dynamics match the simulator outputs. This kind of work is for researchers in video generation and physics-informed AI who are looking for ways to improve realism without heavy retraining. A reader focused on applied generative models would find the pipeline details and results worthwhile. It deserves a serious referee because the core idea is clearly laid out and the experiments show concrete improvements, though the validation of the physical accuracy could be tighter. I would recommend putting it through peer review.

Referee Report

2 major / 2 minor

Summary. The paper presents 3DPhysVideo, a novel training-free pipeline that generates physically realistic videos from a single image. It repurposes an off-the-shelf image-to-video flow model in two stages: first as a novel-view synthesizer guided by rendered point clouds to reconstruct complete 360-degree 3D scene geometry, and second to synthesize the final video guided by point clouds from an external physics solver applied to the reconstructed geometry. The key technical component is the Consistency-Guided Flow SDE, which decomposes the model's predicted velocity into a denoising term plus a consistency bias to enforce fidelity to the conditional point-cloud inputs in both stages. Experiments on multi-object and fluid-interaction scenes claim that the method produces physically plausible videos, runs efficiently on a single consumer GPU, and outperforms baselines on GPT-based scores, the VideoPhy benchmark, and human evaluation.

Significance. If the central claims hold, the work would be a useful contribution to video generation by demonstrating how existing pretrained flow models can be repurposed, without fine-tuning, to incorporate explicit 3D reconstruction and physics simulation for improved physical grounding. The training-free design, single-GPU efficiency, and handling of fluids and multi-object interactions address recognized limitations in current generative models. The use of standard physics solvers and point-cloud guidance is a pragmatic strength that could be adopted more broadly if the consistency mechanism proves reliable.

major comments (2)

The manuscript provides no quantitative evaluation or ablation of the strength of the consistency bias term relative to the model's prior when the I2V flow model is conditioned on physics-simulated point clouds (see description of Consistency-Guided Flow SDE). Without such analysis it is unclear whether the bias reliably transfers physical trajectories or whether generated frames achieve only visual coherence while violating dynamics, particularly for fluids and collisions; this directly affects the central claim that the same off-the-shelf model can be successfully repurposed for both reconstruction and simulation-guided synthesis.
No independent metrics (e.g., reconstruction error, multi-view consistency, or Chamfer distance) are reported for the accuracy of the 360-degree geometry obtained in the first stage by guiding the flow model with rendered point clouds. Because this geometry is the input to the subsequent physics solver, the absence of verification undermines confidence that downstream video dynamics match the intended physical simulation.

minor comments (2)

The abstract refers to 'GPT-based scores' without defining the prompt, model, or exact metric used; this should be clarified in the experiments section for reproducibility.
Explicit equations for the velocity decomposition (denoising term plus consistency bias) would improve clarity and allow readers to assess the claimed parameter-free nature of the guidance.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address the major comments point by point below, indicating where we agree and will revise the manuscript accordingly.

read point-by-point responses

Referee: The manuscript provides no quantitative evaluation or ablation of the strength of the consistency bias term relative to the model's prior when the I2V flow model is conditioned on physics-simulated point clouds (see description of Consistency-Guided Flow SDE). Without such analysis it is unclear whether the bias reliably transfers physical trajectories or whether generated frames achieve only visual coherence while violating dynamics, particularly for fluids and collisions; this directly affects the central claim that the same off-the-shelf model can be successfully repurposed for both reconstruction and simulation-guided synthesis.

Authors: We agree that an explicit quantitative ablation of the consistency bias strength would strengthen the central claim. In the revised manuscript we will add an ablation that varies the bias weighting hyperparameter and reports VideoPhy benchmark scores together with targeted qualitative checks on trajectory fidelity for fluid and multi-object collision cases. This will directly quantify how the bias term modulates the model's prior. revision: yes
Referee: No independent metrics (e.g., reconstruction error, multi-view consistency, or Chamfer distance) are reported for the accuracy of the 360-degree geometry obtained in the first stage by guiding the flow model with rendered point clouds. Because this geometry is the input to the subsequent physics solver, the absence of verification undermines confidence that downstream video dynamics match the intended physical simulation.

Authors: We acknowledge the value of direct verification for the intermediate 3D reconstruction. While downstream video quality and human evaluations provide indirect support, we will incorporate additional quantitative checks in the revision, including multi-view consistency scores and Chamfer distance where reference geometry or proxy measures are available from the evaluation scenes. revision: yes

Circularity Check

0 steps flagged

No circularity: pipeline uses external off-the-shelf I2V model and standard physics solver with independent guidance mechanism

full rationale

The derivation chain relies on repurposing a pretrained image-to-video flow model (external) for point-cloud-guided reconstruction and then physics-simulated guidance, with Consistency-Guided Flow SDE presented as a decomposition of the model's existing velocity prediction into denoising plus consistency bias. No parameter is fitted to the target video output, no self-citation chain justifies the core uniqueness or ansatz, and the physics solver operates independently of the generative model. The method is therefore self-contained against external benchmarks rather than reducing to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The pipeline depends on the assumption that point-cloud guidance preserves consistency in the flow model and that physics solvers produce usable inputs for video synthesis; no free parameters or new entities with independent evidence are explicitly introduced in the abstract.

axioms (2)

domain assumption An off-the-shelf image-to-video flow model can be guided by rendered point clouds to reconstruct complete 360-degree 3D scene geometry without training.
Invoked in the first stage of the pipeline as the basis for repurposing the model.
domain assumption Physics solvers applied to the reconstructed geometry produce point clouds that, when used to guide the same flow model, yield high-quality and physically plausible videos.
Central to the second stage and the overall claim of physical realism.

invented entities (1)

Consistency-Guided Flow SDE no independent evidence
purpose: Decomposes the predicted velocity of the I2V flow model into denoising and consistency bias to enforce consistency with conditional inputs for both reconstruction and simulation guidance.
Newly introduced component that enables the training-free repurposing; no independent evidence provided outside the method itself.

pith-pipeline@v0.9.0 · 5805 in / 1608 out tokens · 41059 ms · 2026-05-19T21:22:02.185690+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Consistency-Guided Flow SDE, which decomposes the predicted velocity of the I2V flow model into denoising and consistency bias
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean J_uniquely_calibrated_via_higher_derivative unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

p∗ = arg maxp Ezτ∼p [C(zτ,zI)] − 1/β DKL(p∥q)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

75 extracted references · 75 canonical work pages

[1]

Genesis: A generative and universal physics engine for robotics and beyond, December 2024

Genesis Authors. Genesis: A generative and universal physics engine for robotics and beyond, December 2024. URLhttps://github.com/Genesis-Embodied-AI/Genesis

work page 2024
[2]

Recammaster: Camera-controlled generative rendering from a single video.ICCV, 2025

Jianhong Bai, Menghan Xia, Xiao Fu, Xintao Wang, Lianrui Mu, Jinwen Cao, Zuozhu Liu, Haoji Hu, Xiang Bai, Pengfei Wan, et al. Recammaster: Camera-controlled generative rendering from a single video.ICCV, 2025

work page 2025
[3]

Videophy: Evaluating physical commonsense for video generation.arXiv, 2024

Hritik Bansal, Zongyu Lin, Tianyi Xie, Zeshun Zong, Michal Yarom, Yonatan Bitton, Chenfanfu Jiang, Yizhou Sun, Kai-Wei Chang, and Aditya Grover. Videophy: Evaluating physical commonsense for video generation.arXiv, 2024

work page 2024
[4]

Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv, 2023

Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Do- minik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv, 2023

work page 2023
[5]

Go-with-the-flow: Motion-controllable video diffusion models using real-time warped noise

Ryan Burgert, Yuancheng Xu, Wenqi Xian, Oliver Pilarski, Pascal Clausen, Mingming He, Li Ma, Yitong Deng, Lingxiao Li, Mohsen Mousavi, Michael Ryoo, Paul Debevec, and Ning Yu. Go-with-the-flow: Motion-controllable video diffusion models using real-time warped noise. In CVPR, 2025

work page 2025
[6]

Gic: Gaussian-informed continuum for physical property identification and simulation.NeurIPS, 2024

Junhao Cai, Yuji Yang, Weihao Yuan, Yisheng He, Zilong Dong, Liefeng Bo, Hui Cheng, and Qifeng Chen. Gic: Gaussian-informed continuum for physical property identification and simulation.NeurIPS, 2024

work page 2024
[7]

Physgen3d: Crafting a miniature interactive world from a single image

Boyuan Chen, Hanxiao Jiang, Shaowei Liu, Saurabh Gupta, Yunzhu Li, Hao Zhao, and Shenlong Wang. Physgen3d: Crafting a miniature interactive world from a single image. InCVPR, 2025

work page 2025
[8]

Partgen: Part-level 3d generation and reconstruction with multi-view diffusion models

Minghao Chen, Roman Shapovalov, Iro Laina, Tom Monnier, Jianyuan Wang, David Novotny, and Andrea Vedaldi. Partgen: Part-level 3d generation and reconstruction with multi-view diffusion models. InCVPR, 2025

work page 2025
[9]

Motion-conditioned diffusion model for controllable video synthesis.arXiv, 2023

Tsai-Shien Chen, Chieh Hubert Lin, Hung-Yu Tseng, Tsung-Yi Lin, and Ming-Hsuan Yang. Motion-conditioned diffusion model for controllable video synthesis.arXiv, 2023

work page 2023
[10]

Veo 3.https://deepmind.google/models/veo/, 2025

DeepMind / Google. Veo 3.https://deepmind.google/models/veo/, 2025

work page 2025
[11]

Fischler and Robert C

Martin A. Fischler and Robert C. Bolles. Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography.Communications of the ACM, 1981

work page 1981
[12]

Physical simulator in-the-loop video generation.CVPR, 2026

Lin Geng Foo, Mark He Huang, Alexandros Lattas, Stylianos Moschoglou, Thabo Beeler, and Christian Theobalt. Physical simulator in-the-loop video generation.CVPR, 2026

work page 2026
[13]

Motion prompting: Controlling video generation with motion trajectories

Daniel Geng, Charles Herrmann, Junhwa Hur, Forrester Cole, Serena Zhang, Tobias Pfaff, Tatiana Lopez-Guevara, Carl Doersch, Yusuf Aytar, Michael Rubinstein, Chen Sun, et al. Motion prompting: Controlling video generation with motion trajectories. InCVPR, 2025

work page 2025
[14]

Force prompting: Video generation models can learn and generalize physics-based control signals.NeurIPS, 2025

Nate Gillman, Charles Herrmann, Michael Freeman, Daksh Aggarwal, Evan Luo, Deqing Sun, and Chen Sun. Force prompting: Video generation models can learn and generalize physics-based control signals.NeurIPS, 2025

work page 2025
[15]

Diffusion as shader: 3d-aware video diffusion for versatile video generation control.SIGGRAPH 2025, 2025

Zekai Gu, Rui Yan, Jiahao Lu, Peng Li, Zhiyang Dou, Chenyang Si, Zhen Dong, Qifeng Liu, Cheng Lin, Ziwei Liu, Wenping Wang, and Yuan Liu. Diffusion as shader: 3d-aware video diffusion for versatile video generation control.SIGGRAPH 2025, 2025

work page 2025
[16]

Cameractrl: Enabling camera control for text-to-video generation.ICLR, 2025

Hao He, Yinghao Xu, Yuwei Guo, Gordon Wetzstein, Bo Dai, Hongsheng Li, and Ceyuan Yang. Cameractrl: Enabling camera control for text-to-video generation.ICLR, 2025

work page 2025
[17]

Dreamphysics: Learning physics-based 3d dynamics with video diffusion priors

Tianyu Huang, Haoze Zhang, Yihan Zeng, Zhilu Zhang, Hui Li, Wangmeng Zuo, and Ryn- son WH Lau. Dreamphysics: Learning physics-based 3d dynamics with video diffusion priors. InAAAI, 2025. 10

work page 2025
[18]

VBench: Comprehensive benchmark suite for video generative models

Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, Yaohui Wang, Xinyuan Chen, Limin Wang, Dahua Lin, Yu Qiao, and Ziwei Liu. VBench: Comprehensive benchmark suite for video generative models. InCVPR, 2024

work page 2024
[19]

Peekaboo: Interactive video generation via masked-diffusion

Yash Jain, Anshul Nasery, Vibhav Vineet, and Harkirat Behl. Peekaboo: Interactive video generation via masked-diffusion. InCVPR, 2024

work page 2024
[20]

Huang, Jong Chul Ye, Niloy J

Hyeonho Jeong, Chun-Hao P. Huang, Jong Chul Ye, Niloy J. Mitra, and Duygu Ceylan. Track4gen: Teaching video diffusion models to track points improves video generation. In CVPR, 2025

work page 2025
[21]

The material point method for simulating continuum materials

Chenfanfu Jiang, Craig Schroeder, Joseph Teran, Alexey Stomakhin, and Andrew Selle. The material point method for simulating continuum materials. InACM SIGGRAPH 2016 Courses, 2016

work page 2016
[22]

Uniedit-flow: Unleashing inversion and editing in the era of flow models.arXiv, 2025

Guanlong Jiao, Biqing Huang, Kuan-Chieh Wang, and Renjie Liao. Uniedit-flow: Unleashing inversion and editing in the era of flow models.arXiv, 2025

work page 2025
[23]

3d gaussian splatting for real-time radiance field rendering.ACM TOG, 2023

Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering.ACM TOG, 2023

work page 2023
[24]

The numerical solution of stochastic differential equations

Peter E Kloeden and RA Pearson. The numerical solution of stochastic differential equations. The ANZIAM Journal, 1977

work page 1977
[25]

Springer, 1992

Peter E Kloeden and Eckhard Platen.Numerical Solution of Stochastic Differential Equations. Springer, 1992

work page 1992
[26]

On information and sufficiency.Annals of Mathe- matical Statistics, 1951

Solomon Kullback and Richard A Leibler. On information and sufficiency.Annals of Mathe- matical Statistics, 1951

work page 1951
[27]

Magicmotion: Controllable video generation with dense-to-sparse trajectory guidance.ICCV, 2025

Quanhao Li, Zhen Xing, Rui Wang, Hui Zhang, Qi Dai, and Zuxuan Wu. Magicmotion: Controllable video generation with dense-to-sparse trajectory guidance.ICCV, 2025

work page 2025
[28]

Wonderplay: Dynamic 3d scene generation from a single image and actions

Zizhang Li, Hong-Xing Yu, Wei Liu, Yin Yang, Charles Herrmann, Gordon Wetzstein, and Jiajun Wu. Wonderplay: Dynamic 3d scene generation from a single image and actions. In ICCV, 2025

work page 2025
[29]

Movideo: Motion-aware video generation with diffusion models

Jingyun Liang, Yuchen Fan, Kai Zhang, Radu Timofte, Luc Van Gool, and Rakesh Ranjan. Movideo: Motion-aware video generation with diffusion models. InECCV, 2024

work page 2024
[30]

Phys4dgen: Physics-compliant 4d generation with multi-material composition perception

Jiajing Lin, Zhenzhong Wang, Dejun Xu, Shu Jiang, Yunpeng Gong, and Min Jiang. Phys4dgen: Physics-compliant 4d generation with multi-material composition perception. InACM Multime- dia, 2025

work page 2025
[31]

Omniphysgs: 3d constitutive gaussians for general physics-based dynamics generation.ICLR, 2025

Yuchen Lin, Chenguo Lin, Jianjin Xu, and Yadong Mu. Omniphysgs: 3d constitutive gaussians for general physics-based dynamics generation.ICLR, 2025

work page 2025
[32]

Motionclone: Training-free motion cloning for controllable video generation

Pengyang Ling, Jiazi Bu, Pan Zhang, Xiaoyi Dong, Yuhang Zang, Tong Wu, Huaian Chen, Jiaqi Wang, and Yi Jin. Motionclone: Training-free motion cloning for controllable video generation. ICLR, 2025

work page 2025
[33]

Flow matching for generative modeling.ICLR, 2023

Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling.ICLR, 2023

work page 2023
[34]

Zero-1-to-3: Zero-shot one image to 3d object

Ruoshi Liu, Rundi Wu, Basile Van Hoorick, Pavel Tokmakov, Sergey Zakharov, and Carl V ondrick. Zero-1-to-3: Zero-shot one image to 3d object. InICCV, 2023

work page 2023
[35]

Physgen: Rigid-body physics-grounded image-to-video generation

Shaowei Liu, Zhongzheng Ren, Saurabh Gupta, and Shenlong Wang. Physgen: Rigid-body physics-grounded image-to-video generation. InECCV, 2024

work page 2024
[36]

Realwonder: Real-time physical action-conditioned video generation, 2026

Wei Liu, Ziyu Chen, Zizhang Li, Yue Wang, Hong-Xing Yu, and Jiajun Wu. Realwonder: Real-time physical action-conditioned video generation, 2026. 11

work page 2026
[37]

Physflow: Unleashing the potential of multi-modal foundation models and video diffusion for 4d dynamic physical scene simulation.CVPR, 2025

Zhuoman Liu, Weicai Ye, Yan Luximon, Pengfei Wan, and Di Zhang. Physflow: Unleashing the potential of multi-modal foundation models and video diffusion for 4d dynamic physical scene simulation.CVPR, 2025

work page 2025
[38]

Towards world simulator: Crafting physical commonsense- based benchmark for video generation.arXiv, 2024

Fanqing Meng, Jiaqi Liao, Xinyu Tan, Wenqi Shao, Quanfeng Lu, Kaipeng Zhang, Yu Cheng, Dianqi Li, Yu Qiao, and Ping Luo. Towards world simulator: Crafting physical commonsense- based benchmark for video generation.arXiv, 2024

work page 2024
[39]

Render colored pointclouds

Meta. Render colored pointclouds. https://pytorch3d.org/tutorials/render_ colored_points, 2024

work page 2024
[40]

Uniphy: Learning a unified constitutive model for inverse physics simulation

Himangi Mittal, Peiye Zhuang, Hsin-Ying Lee, and Shubham Tulsiani. Uniphy: Learning a unified constitutive model for inverse physics simulation. InCVPR, 2025

work page 2025
[41]

Null-text inversion for editing real images using guided diffusion models

Ron Mokady, Amir Hertz, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Null-text inversion for editing real images using guided diffusion models. InCVPR, 2023

work page 2023
[42]

Do generative video models understand physical principles?arXiv, 2025

Saman Motamed, Laura Culp, Kevin Swersky, Priyank Jaini, and Robert Geirhos. Do generative video models understand physical principles?arXiv, 2025

work page 2025
[43]

Conditional image-to-video generation with latent flow diffusion models

Haomiao Ni, Changhao Shi, Kai Li, Sharon X Huang, and Martin Renqiang Min. Conditional image-to-video generation with latent flow diffusion models. InCVPR, 2023

work page 2023
[44]

Sora.https://openai.com/blog/sora, 2024

OpenAI. Sora.https://openai.com/blog/sora, 2024

work page 2024
[45]

Freetraj: Tuning-free trajectory control in video diffusion models, 2024

Haonan Qiu, Zhaoxi Chen, Zhouxia Wang, Yingqing He, Menghan Xia, and Ziwei Liu. Freetraj: Tuning-free trajectory control in video diffusion models, 2024

work page 2024
[46]

Direct preference optimization: Your language model is secretly a reward model

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. NeurIPS, 2023

work page 2023
[47]

Sam 2: Segment anything in images and videos.ICLR, 2025

Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, Eric Mintun, Junting Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao-Yuan Wu, Ross Girshick, Piotr Dollár, and Christoph Feichtenhofer. Sam 2: Segment anything in images and videos.ICLR, 2025

work page 2025
[48]

InThe F okker-Planck equation: methods of solution and applications

Hannes Risken. InThe F okker-Planck equation: methods of solution and applications. Springer, 1989

work page 1989
[49]

Semantic image inversion and editing using rectified stochastic differential equa- tions.ICLR, 2025

Litu Rout, Yujia Chen, Nataniel Ruiz, Constantine Caramanis, Sanjay Shakkottai, and Wen- Sheng Chu. Semantic image inversion and editing using rectified stochastic differential equa- tions.ICLR, 2025

work page 2025
[50]

Introducing gen-3 alpha: A new frontier for video generation

Runway Research. Introducing gen-3 alpha: A new frontier for video generation. https: //runwayml.com/research/introducing-gen-3-alpha, 2024

work page 2024
[51]

Mvdream: Multi- view diffusion for 3d generation.ICLR, 2024

Yichun Shi, Peng Wang, Jianglong Ye, Mai Long, Kejie Li, and Xiao Yang. Mvdream: Multi- view diffusion for 3d generation.ICLR, 2024

work page 2024
[52]

Score-based generative modeling through stochastic differential equations.ICLR, 2021

Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations.ICLR, 2021

work page 2021
[53]

A material point method for snow simulation.ACM TOG, 2013

Alexey Stomakhin, Craig Schroeder, Lawrence Chai, Joseph Teran, and Andrew Selle. A material point method for snow simulation.ACM TOG, 2013

work page 2013
[54]

Physmotion: Physics-grounded dynamics from a single image.3DV, 2026

Xiyang Tan, Ying Jiang, Xuan Li, Zeshun Zong, Tianyi Xie, Yin Yang, and Chenfanfu Jiang. Physmotion: Physics-grounded dynamics from a single image.3DV, 2026

work page 2026
[55]

Lgm: Large multi-view gaussian model for high-resolution 3d content creation

Jiaxiang Tang, Zhaoxi Chen, Xiaokang Chen, Tengfei Wang, Gang Zeng, and Ziwei Liu. Lgm: Large multi-view gaussian model for high-resolution 3d content creation. InECCV, 2024. 12

work page 2024
[56]

Physctrl: Generative physics for controllable and physics-grounded video generation

Chen Wang, Chuhao Chen, Yiming Huang, Zhiyang Dou, Yuan Liu, Jiatao Gu, and Lingjie Liu. Physctrl: Generative physics for controllable and physics-grounded video generation. In NeurIPS, 2025

work page 2025
[57]

Taming rectified flow for inversion and editing.ICML, 2025

Jiangshan Wang, Junfu Pu, Zhongang Qi, Jiayi Guo, Yue Ma, Nisha Huang, Yuxin Chen, Xiu Li, and Ying Shan. Taming rectified flow for inversion and editing.ICML, 2025

work page 2025
[58]

Vggt: Visual geometry grounded transformer

Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Visual geometry grounded transformer. InCVPR, 2025

work page 2025
[59]

Motionctrl: A unified and flexible motion controller for video generation

Zhouxia Wang, Ziyang Yuan, Xintao Wang, Yaowei Li, Tianshui Chen, Menghan Xia, Ping Luo, and Ying Shan. Motionctrl: A unified and flexible motion controller for video generation. InACM SIGGRAPH, 2024

work page 2024
[60]

Cat4d: Create anything in 4d with multi-view video diffusion models

Rundi Wu, Ruiqi Gao, Ben Poole, Alex Trevithick, Changxi Zheng, Jonathan T Barron, and Aleksander Holynski. Cat4d: Create anything in 4d with multi-view video diffusion models. In CVPR, 2025

work page 2025
[61]

Draganything: Motion control for anything using entity representation

Weijia Wu, Zhuang Li, Yuchao Gu, Rui Zhao, Yefei He, David Junhao Zhang, Mike Zheng Shou, Yan Li, Tingting Gao, and Di Zhang. Draganything: Motion control for anything using entity representation. InECCV, 2024

work page 2024
[62]

Spatialtracker: Tracking any 2d pixels in 3d space

Yuxi Xiao, Qianqian Wang, Shangzhan Zhang, Nan Xue, Sida Peng, Yujun Shen, and Xiaowei Zhou. Spatialtracker: Tracking any 2d pixels in 3d space. InCVPR, 2024

work page 2024
[63]

Spatialtrackerv2: 3d point tracking made easy

Yuxi Xiao, Jianyuan Wang, Nan Xue, Nikita Karaev, Yuri Makarov, Bingyi Kang, Xing Zhu, Hujun Bao, Yujun Shen, and Xiaowei Zhou. Spatialtrackerv2: 3d point tracking made easy. In ICCV, 2025

work page 2025
[64]

Trajectory attention for fine-grained video motion control

Zeqi Xiao, Wenqi Ouyang, Yifan Zhou, Shuai Yang, Lei Yang, Jianlou Si, and Xingang Pan. Trajectory attention for fine-grained video motion control. InICLR, 2025

work page 2025
[65]

Physgaussian: Physics-integrated 3d gaussians for generative dynamics

Tianyi Xie, Zeshun Zong, Yuxing Qiu, Xuan Li, Yutao Feng, Yin Yang, and Chenfanfu Jiang. Physgaussian: Physics-integrated 3d gaussians for generative dynamics. InCVPR, 2024

work page 2024
[66]

In- stantmesh: Efficient 3d mesh generation from a single image with sparse-view large reconstruc- tion models.arXiv, 2024

Jiale Xu, Weihao Cheng, Yiming Gao, Xintao Wang, Shenghua Gao, and Ying Shan. In- stantmesh: Efficient 3d mesh generation from a single image with sparse-view large reconstruc- tion models.arXiv, 2024

work page 2024
[67]

Direct-a-video: Customized video generation with user-directed camera movement and object motion

Shiyuan Yang, Liang Hou, Haibin Huang, Chongyang Ma, Pengfei Wan, Di Zhang, Xiaodong Chen, and Jing Liao. Direct-a-video: Customized video generation with user-directed camera movement and object motion. InACM SIGGRAPH, 2024

work page 2024
[68]

Text-to-image rectified flow as plug-and-play priors.ICLR, 2025

Xiaofeng Yang, Cheng Chen, Xulei Yang, Fayao Liu, and Guosheng Lin. Text-to-image rectified flow as plug-and-play priors.ICLR, 2025

work page 2025
[69]

Vlipp: Towards physically plausible video generation with vision and language informed physical prior.ICCV, 2025

Xindi Yang, Baolu Li, Yiming Zhang, Zhenfei Yin, Lei Bai, Liqian Ma, Zhiyong Wang, Jianfei Cai, Tien-Tsin Wong, Huchuan Lu, et al. Vlipp: Towards physically plausible video generation with vision and language informed physical prior.ICCV, 2025

work page 2025
[70]

Freeman, and Jiajun Wu

Hong-Xing Yu, Haoyi Duan, Charles Herrmann, William T. Freeman, and Jiajun Wu. Wonder- world: Interactive 3d scene generation from a single image. InCVPR, 2025

work page 2025
[71]

Viewcrafter: Taming video diffusion models for high-fidelity novel view synthesis.TPAMI, 2025

Wangbo Yu, Jinbo Xing, Li Yuan, Wenbo Hu, Xiaoyu Li, Zhipeng Huang, Xiangjun Gao, Tien-Tsin Wong, Ying Shan, and Yonghong Tian. Viewcrafter: Taming video diffusion models for high-fidelity novel view synthesis.TPAMI, 2025

work page 2025
[72]

Perpetualwonder: Long-horizon action-conditioned 4d scene generation.CVPR, 2026

Jiahao Zhan, Zizhang Li, Hong-Xing Yu, and Jiajun Wu. Perpetualwonder: Long-horizon action-conditioned 4d scene generation.CVPR, 2026

work page 2026
[73]

Frame context packing and drift prevention in next-frame-prediction video diffusion models

Lvmin Zhang, Shengqu Cai, Muyang Li, Gordon Wetzstein, and Maneesh Agrawala. Frame context packing and drift prevention in next-frame-prediction video diffusion models. In NeurIPS, 2025. 13

work page 2025
[74]

Physdreamer: Physics-based interaction with 3d objects via video generation

Tianyuan Zhang, Hong-Xing Yu, Rundi Wu, Brandon Y Feng, Changxi Zheng, Noah Snavely, Jiajun Wu, and William T Freeman. Physdreamer: Physics-based interaction with 3d objects via video generation. InECCV, 2024

work page 2024
[75]

A Snorlaxmelting into molten lava, surrounded by flames and glowing embers, as fiery lava bursts and flows around it

Yunzhi Zhang, Carson Murtuza-Lanier, Zizhang Li, Yilun Du, and Jiajun Wu. Product of experts for visual generation.arXiv, 2025. 14 A Teaser Figure Details The teaser figure (Fig. 1) shows four physical phenomena—a hanging robe, a duck floating on water, a Franka Panda striking foam, and a rising steam plume—each generated by our full pipeline: 3D scene re...

work page 2025

[1] [1]

Genesis: A generative and universal physics engine for robotics and beyond, December 2024

Genesis Authors. Genesis: A generative and universal physics engine for robotics and beyond, December 2024. URLhttps://github.com/Genesis-Embodied-AI/Genesis

work page 2024

[2] [2]

Recammaster: Camera-controlled generative rendering from a single video.ICCV, 2025

Jianhong Bai, Menghan Xia, Xiao Fu, Xintao Wang, Lianrui Mu, Jinwen Cao, Zuozhu Liu, Haoji Hu, Xiang Bai, Pengfei Wan, et al. Recammaster: Camera-controlled generative rendering from a single video.ICCV, 2025

work page 2025

[3] [3]

Videophy: Evaluating physical commonsense for video generation.arXiv, 2024

Hritik Bansal, Zongyu Lin, Tianyi Xie, Zeshun Zong, Michal Yarom, Yonatan Bitton, Chenfanfu Jiang, Yizhou Sun, Kai-Wei Chang, and Aditya Grover. Videophy: Evaluating physical commonsense for video generation.arXiv, 2024

work page 2024

[4] [4]

Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv, 2023

Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Do- minik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv, 2023

work page 2023

[5] [5]

Go-with-the-flow: Motion-controllable video diffusion models using real-time warped noise

Ryan Burgert, Yuancheng Xu, Wenqi Xian, Oliver Pilarski, Pascal Clausen, Mingming He, Li Ma, Yitong Deng, Lingxiao Li, Mohsen Mousavi, Michael Ryoo, Paul Debevec, and Ning Yu. Go-with-the-flow: Motion-controllable video diffusion models using real-time warped noise. In CVPR, 2025

work page 2025

[6] [6]

Gic: Gaussian-informed continuum for physical property identification and simulation.NeurIPS, 2024

Junhao Cai, Yuji Yang, Weihao Yuan, Yisheng He, Zilong Dong, Liefeng Bo, Hui Cheng, and Qifeng Chen. Gic: Gaussian-informed continuum for physical property identification and simulation.NeurIPS, 2024

work page 2024

[7] [7]

Physgen3d: Crafting a miniature interactive world from a single image

Boyuan Chen, Hanxiao Jiang, Shaowei Liu, Saurabh Gupta, Yunzhu Li, Hao Zhao, and Shenlong Wang. Physgen3d: Crafting a miniature interactive world from a single image. InCVPR, 2025

work page 2025

[8] [8]

Partgen: Part-level 3d generation and reconstruction with multi-view diffusion models

Minghao Chen, Roman Shapovalov, Iro Laina, Tom Monnier, Jianyuan Wang, David Novotny, and Andrea Vedaldi. Partgen: Part-level 3d generation and reconstruction with multi-view diffusion models. InCVPR, 2025

work page 2025

[9] [9]

Motion-conditioned diffusion model for controllable video synthesis.arXiv, 2023

Tsai-Shien Chen, Chieh Hubert Lin, Hung-Yu Tseng, Tsung-Yi Lin, and Ming-Hsuan Yang. Motion-conditioned diffusion model for controllable video synthesis.arXiv, 2023

work page 2023

[10] [10]

Veo 3.https://deepmind.google/models/veo/, 2025

DeepMind / Google. Veo 3.https://deepmind.google/models/veo/, 2025

work page 2025

[11] [11]

Fischler and Robert C

Martin A. Fischler and Robert C. Bolles. Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography.Communications of the ACM, 1981

work page 1981

[12] [12]

Physical simulator in-the-loop video generation.CVPR, 2026

Lin Geng Foo, Mark He Huang, Alexandros Lattas, Stylianos Moschoglou, Thabo Beeler, and Christian Theobalt. Physical simulator in-the-loop video generation.CVPR, 2026

work page 2026

[13] [13]

Motion prompting: Controlling video generation with motion trajectories

Daniel Geng, Charles Herrmann, Junhwa Hur, Forrester Cole, Serena Zhang, Tobias Pfaff, Tatiana Lopez-Guevara, Carl Doersch, Yusuf Aytar, Michael Rubinstein, Chen Sun, et al. Motion prompting: Controlling video generation with motion trajectories. InCVPR, 2025

work page 2025

[14] [14]

Force prompting: Video generation models can learn and generalize physics-based control signals.NeurIPS, 2025

Nate Gillman, Charles Herrmann, Michael Freeman, Daksh Aggarwal, Evan Luo, Deqing Sun, and Chen Sun. Force prompting: Video generation models can learn and generalize physics-based control signals.NeurIPS, 2025

work page 2025

[15] [15]

Diffusion as shader: 3d-aware video diffusion for versatile video generation control.SIGGRAPH 2025, 2025

Zekai Gu, Rui Yan, Jiahao Lu, Peng Li, Zhiyang Dou, Chenyang Si, Zhen Dong, Qifeng Liu, Cheng Lin, Ziwei Liu, Wenping Wang, and Yuan Liu. Diffusion as shader: 3d-aware video diffusion for versatile video generation control.SIGGRAPH 2025, 2025

work page 2025

[16] [16]

Cameractrl: Enabling camera control for text-to-video generation.ICLR, 2025

Hao He, Yinghao Xu, Yuwei Guo, Gordon Wetzstein, Bo Dai, Hongsheng Li, and Ceyuan Yang. Cameractrl: Enabling camera control for text-to-video generation.ICLR, 2025

work page 2025

[17] [17]

Dreamphysics: Learning physics-based 3d dynamics with video diffusion priors

Tianyu Huang, Haoze Zhang, Yihan Zeng, Zhilu Zhang, Hui Li, Wangmeng Zuo, and Ryn- son WH Lau. Dreamphysics: Learning physics-based 3d dynamics with video diffusion priors. InAAAI, 2025. 10

work page 2025

[18] [18]

VBench: Comprehensive benchmark suite for video generative models

Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, Yaohui Wang, Xinyuan Chen, Limin Wang, Dahua Lin, Yu Qiao, and Ziwei Liu. VBench: Comprehensive benchmark suite for video generative models. InCVPR, 2024

work page 2024

[19] [19]

Peekaboo: Interactive video generation via masked-diffusion

Yash Jain, Anshul Nasery, Vibhav Vineet, and Harkirat Behl. Peekaboo: Interactive video generation via masked-diffusion. InCVPR, 2024

work page 2024

[20] [20]

Huang, Jong Chul Ye, Niloy J

Hyeonho Jeong, Chun-Hao P. Huang, Jong Chul Ye, Niloy J. Mitra, and Duygu Ceylan. Track4gen: Teaching video diffusion models to track points improves video generation. In CVPR, 2025

work page 2025

[21] [21]

The material point method for simulating continuum materials

Chenfanfu Jiang, Craig Schroeder, Joseph Teran, Alexey Stomakhin, and Andrew Selle. The material point method for simulating continuum materials. InACM SIGGRAPH 2016 Courses, 2016

work page 2016

[22] [22]

Uniedit-flow: Unleashing inversion and editing in the era of flow models.arXiv, 2025

Guanlong Jiao, Biqing Huang, Kuan-Chieh Wang, and Renjie Liao. Uniedit-flow: Unleashing inversion and editing in the era of flow models.arXiv, 2025

work page 2025

[23] [23]

3d gaussian splatting for real-time radiance field rendering.ACM TOG, 2023

Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering.ACM TOG, 2023

work page 2023

[24] [24]

The numerical solution of stochastic differential equations

Peter E Kloeden and RA Pearson. The numerical solution of stochastic differential equations. The ANZIAM Journal, 1977

work page 1977

[25] [25]

Springer, 1992

Peter E Kloeden and Eckhard Platen.Numerical Solution of Stochastic Differential Equations. Springer, 1992

work page 1992

[26] [26]

On information and sufficiency.Annals of Mathe- matical Statistics, 1951

Solomon Kullback and Richard A Leibler. On information and sufficiency.Annals of Mathe- matical Statistics, 1951

work page 1951

[27] [27]

Magicmotion: Controllable video generation with dense-to-sparse trajectory guidance.ICCV, 2025

Quanhao Li, Zhen Xing, Rui Wang, Hui Zhang, Qi Dai, and Zuxuan Wu. Magicmotion: Controllable video generation with dense-to-sparse trajectory guidance.ICCV, 2025

work page 2025

[28] [28]

Wonderplay: Dynamic 3d scene generation from a single image and actions

Zizhang Li, Hong-Xing Yu, Wei Liu, Yin Yang, Charles Herrmann, Gordon Wetzstein, and Jiajun Wu. Wonderplay: Dynamic 3d scene generation from a single image and actions. In ICCV, 2025

work page 2025

[29] [29]

Movideo: Motion-aware video generation with diffusion models

Jingyun Liang, Yuchen Fan, Kai Zhang, Radu Timofte, Luc Van Gool, and Rakesh Ranjan. Movideo: Motion-aware video generation with diffusion models. InECCV, 2024

work page 2024

[30] [30]

Phys4dgen: Physics-compliant 4d generation with multi-material composition perception

Jiajing Lin, Zhenzhong Wang, Dejun Xu, Shu Jiang, Yunpeng Gong, and Min Jiang. Phys4dgen: Physics-compliant 4d generation with multi-material composition perception. InACM Multime- dia, 2025

work page 2025

[31] [31]

Omniphysgs: 3d constitutive gaussians for general physics-based dynamics generation.ICLR, 2025

Yuchen Lin, Chenguo Lin, Jianjin Xu, and Yadong Mu. Omniphysgs: 3d constitutive gaussians for general physics-based dynamics generation.ICLR, 2025

work page 2025

[32] [32]

Motionclone: Training-free motion cloning for controllable video generation

Pengyang Ling, Jiazi Bu, Pan Zhang, Xiaoyi Dong, Yuhang Zang, Tong Wu, Huaian Chen, Jiaqi Wang, and Yi Jin. Motionclone: Training-free motion cloning for controllable video generation. ICLR, 2025

work page 2025

[33] [33]

Flow matching for generative modeling.ICLR, 2023

Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling.ICLR, 2023

work page 2023

[34] [34]

Zero-1-to-3: Zero-shot one image to 3d object

Ruoshi Liu, Rundi Wu, Basile Van Hoorick, Pavel Tokmakov, Sergey Zakharov, and Carl V ondrick. Zero-1-to-3: Zero-shot one image to 3d object. InICCV, 2023

work page 2023

[35] [35]

Physgen: Rigid-body physics-grounded image-to-video generation

Shaowei Liu, Zhongzheng Ren, Saurabh Gupta, and Shenlong Wang. Physgen: Rigid-body physics-grounded image-to-video generation. InECCV, 2024

work page 2024

[36] [36]

Realwonder: Real-time physical action-conditioned video generation, 2026

Wei Liu, Ziyu Chen, Zizhang Li, Yue Wang, Hong-Xing Yu, and Jiajun Wu. Realwonder: Real-time physical action-conditioned video generation, 2026. 11

work page 2026

[37] [37]

Physflow: Unleashing the potential of multi-modal foundation models and video diffusion for 4d dynamic physical scene simulation.CVPR, 2025

Zhuoman Liu, Weicai Ye, Yan Luximon, Pengfei Wan, and Di Zhang. Physflow: Unleashing the potential of multi-modal foundation models and video diffusion for 4d dynamic physical scene simulation.CVPR, 2025

work page 2025

[38] [38]

Towards world simulator: Crafting physical commonsense- based benchmark for video generation.arXiv, 2024

Fanqing Meng, Jiaqi Liao, Xinyu Tan, Wenqi Shao, Quanfeng Lu, Kaipeng Zhang, Yu Cheng, Dianqi Li, Yu Qiao, and Ping Luo. Towards world simulator: Crafting physical commonsense- based benchmark for video generation.arXiv, 2024

work page 2024

[39] [39]

Render colored pointclouds

Meta. Render colored pointclouds. https://pytorch3d.org/tutorials/render_ colored_points, 2024

work page 2024

[40] [40]

Uniphy: Learning a unified constitutive model for inverse physics simulation

Himangi Mittal, Peiye Zhuang, Hsin-Ying Lee, and Shubham Tulsiani. Uniphy: Learning a unified constitutive model for inverse physics simulation. InCVPR, 2025

work page 2025

[41] [41]

Null-text inversion for editing real images using guided diffusion models

Ron Mokady, Amir Hertz, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Null-text inversion for editing real images using guided diffusion models. InCVPR, 2023

work page 2023

[42] [42]

Do generative video models understand physical principles?arXiv, 2025

Saman Motamed, Laura Culp, Kevin Swersky, Priyank Jaini, and Robert Geirhos. Do generative video models understand physical principles?arXiv, 2025

work page 2025

[43] [43]

Conditional image-to-video generation with latent flow diffusion models

Haomiao Ni, Changhao Shi, Kai Li, Sharon X Huang, and Martin Renqiang Min. Conditional image-to-video generation with latent flow diffusion models. InCVPR, 2023

work page 2023

[44] [44]

Sora.https://openai.com/blog/sora, 2024

OpenAI. Sora.https://openai.com/blog/sora, 2024

work page 2024

[45] [45]

Freetraj: Tuning-free trajectory control in video diffusion models, 2024

Haonan Qiu, Zhaoxi Chen, Zhouxia Wang, Yingqing He, Menghan Xia, and Ziwei Liu. Freetraj: Tuning-free trajectory control in video diffusion models, 2024

work page 2024

[46] [46]

Direct preference optimization: Your language model is secretly a reward model

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. NeurIPS, 2023

work page 2023

[47] [47]

Sam 2: Segment anything in images and videos.ICLR, 2025

Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, Eric Mintun, Junting Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao-Yuan Wu, Ross Girshick, Piotr Dollár, and Christoph Feichtenhofer. Sam 2: Segment anything in images and videos.ICLR, 2025

work page 2025

[48] [48]

InThe F okker-Planck equation: methods of solution and applications

Hannes Risken. InThe F okker-Planck equation: methods of solution and applications. Springer, 1989

work page 1989

[49] [49]

Semantic image inversion and editing using rectified stochastic differential equa- tions.ICLR, 2025

Litu Rout, Yujia Chen, Nataniel Ruiz, Constantine Caramanis, Sanjay Shakkottai, and Wen- Sheng Chu. Semantic image inversion and editing using rectified stochastic differential equa- tions.ICLR, 2025

work page 2025

[50] [50]

Introducing gen-3 alpha: A new frontier for video generation

Runway Research. Introducing gen-3 alpha: A new frontier for video generation. https: //runwayml.com/research/introducing-gen-3-alpha, 2024

work page 2024

[51] [51]

Mvdream: Multi- view diffusion for 3d generation.ICLR, 2024

Yichun Shi, Peng Wang, Jianglong Ye, Mai Long, Kejie Li, and Xiao Yang. Mvdream: Multi- view diffusion for 3d generation.ICLR, 2024

work page 2024

[52] [52]

Score-based generative modeling through stochastic differential equations.ICLR, 2021

Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations.ICLR, 2021

work page 2021

[53] [53]

A material point method for snow simulation.ACM TOG, 2013

Alexey Stomakhin, Craig Schroeder, Lawrence Chai, Joseph Teran, and Andrew Selle. A material point method for snow simulation.ACM TOG, 2013

work page 2013

[54] [54]

Physmotion: Physics-grounded dynamics from a single image.3DV, 2026

Xiyang Tan, Ying Jiang, Xuan Li, Zeshun Zong, Tianyi Xie, Yin Yang, and Chenfanfu Jiang. Physmotion: Physics-grounded dynamics from a single image.3DV, 2026

work page 2026

[55] [55]

Lgm: Large multi-view gaussian model for high-resolution 3d content creation

Jiaxiang Tang, Zhaoxi Chen, Xiaokang Chen, Tengfei Wang, Gang Zeng, and Ziwei Liu. Lgm: Large multi-view gaussian model for high-resolution 3d content creation. InECCV, 2024. 12

work page 2024

[56] [56]

Physctrl: Generative physics for controllable and physics-grounded video generation

Chen Wang, Chuhao Chen, Yiming Huang, Zhiyang Dou, Yuan Liu, Jiatao Gu, and Lingjie Liu. Physctrl: Generative physics for controllable and physics-grounded video generation. In NeurIPS, 2025

work page 2025

[57] [57]

Taming rectified flow for inversion and editing.ICML, 2025

Jiangshan Wang, Junfu Pu, Zhongang Qi, Jiayi Guo, Yue Ma, Nisha Huang, Yuxin Chen, Xiu Li, and Ying Shan. Taming rectified flow for inversion and editing.ICML, 2025

work page 2025

[58] [58]

Vggt: Visual geometry grounded transformer

Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Visual geometry grounded transformer. InCVPR, 2025

work page 2025

[59] [59]

Motionctrl: A unified and flexible motion controller for video generation

Zhouxia Wang, Ziyang Yuan, Xintao Wang, Yaowei Li, Tianshui Chen, Menghan Xia, Ping Luo, and Ying Shan. Motionctrl: A unified and flexible motion controller for video generation. InACM SIGGRAPH, 2024

work page 2024

[60] [60]

Cat4d: Create anything in 4d with multi-view video diffusion models

Rundi Wu, Ruiqi Gao, Ben Poole, Alex Trevithick, Changxi Zheng, Jonathan T Barron, and Aleksander Holynski. Cat4d: Create anything in 4d with multi-view video diffusion models. In CVPR, 2025

work page 2025

[61] [61]

Draganything: Motion control for anything using entity representation

Weijia Wu, Zhuang Li, Yuchao Gu, Rui Zhao, Yefei He, David Junhao Zhang, Mike Zheng Shou, Yan Li, Tingting Gao, and Di Zhang. Draganything: Motion control for anything using entity representation. InECCV, 2024

work page 2024

[62] [62]

Spatialtracker: Tracking any 2d pixels in 3d space

Yuxi Xiao, Qianqian Wang, Shangzhan Zhang, Nan Xue, Sida Peng, Yujun Shen, and Xiaowei Zhou. Spatialtracker: Tracking any 2d pixels in 3d space. InCVPR, 2024

work page 2024

[63] [63]

Spatialtrackerv2: 3d point tracking made easy

Yuxi Xiao, Jianyuan Wang, Nan Xue, Nikita Karaev, Yuri Makarov, Bingyi Kang, Xing Zhu, Hujun Bao, Yujun Shen, and Xiaowei Zhou. Spatialtrackerv2: 3d point tracking made easy. In ICCV, 2025

work page 2025

[64] [64]

Trajectory attention for fine-grained video motion control

Zeqi Xiao, Wenqi Ouyang, Yifan Zhou, Shuai Yang, Lei Yang, Jianlou Si, and Xingang Pan. Trajectory attention for fine-grained video motion control. InICLR, 2025

work page 2025

[65] [65]

Physgaussian: Physics-integrated 3d gaussians for generative dynamics

Tianyi Xie, Zeshun Zong, Yuxing Qiu, Xuan Li, Yutao Feng, Yin Yang, and Chenfanfu Jiang. Physgaussian: Physics-integrated 3d gaussians for generative dynamics. InCVPR, 2024

work page 2024

[66] [66]

In- stantmesh: Efficient 3d mesh generation from a single image with sparse-view large reconstruc- tion models.arXiv, 2024

Jiale Xu, Weihao Cheng, Yiming Gao, Xintao Wang, Shenghua Gao, and Ying Shan. In- stantmesh: Efficient 3d mesh generation from a single image with sparse-view large reconstruc- tion models.arXiv, 2024

work page 2024

[67] [67]

Direct-a-video: Customized video generation with user-directed camera movement and object motion

Shiyuan Yang, Liang Hou, Haibin Huang, Chongyang Ma, Pengfei Wan, Di Zhang, Xiaodong Chen, and Jing Liao. Direct-a-video: Customized video generation with user-directed camera movement and object motion. InACM SIGGRAPH, 2024

work page 2024

[68] [68]

Text-to-image rectified flow as plug-and-play priors.ICLR, 2025

Xiaofeng Yang, Cheng Chen, Xulei Yang, Fayao Liu, and Guosheng Lin. Text-to-image rectified flow as plug-and-play priors.ICLR, 2025

work page 2025

[69] [69]

Vlipp: Towards physically plausible video generation with vision and language informed physical prior.ICCV, 2025

Xindi Yang, Baolu Li, Yiming Zhang, Zhenfei Yin, Lei Bai, Liqian Ma, Zhiyong Wang, Jianfei Cai, Tien-Tsin Wong, Huchuan Lu, et al. Vlipp: Towards physically plausible video generation with vision and language informed physical prior.ICCV, 2025

work page 2025

[70] [70]

Freeman, and Jiajun Wu

Hong-Xing Yu, Haoyi Duan, Charles Herrmann, William T. Freeman, and Jiajun Wu. Wonder- world: Interactive 3d scene generation from a single image. InCVPR, 2025

work page 2025

[71] [71]

Viewcrafter: Taming video diffusion models for high-fidelity novel view synthesis.TPAMI, 2025

Wangbo Yu, Jinbo Xing, Li Yuan, Wenbo Hu, Xiaoyu Li, Zhipeng Huang, Xiangjun Gao, Tien-Tsin Wong, Ying Shan, and Yonghong Tian. Viewcrafter: Taming video diffusion models for high-fidelity novel view synthesis.TPAMI, 2025

work page 2025

[72] [72]

Perpetualwonder: Long-horizon action-conditioned 4d scene generation.CVPR, 2026

Jiahao Zhan, Zizhang Li, Hong-Xing Yu, and Jiajun Wu. Perpetualwonder: Long-horizon action-conditioned 4d scene generation.CVPR, 2026

work page 2026

[73] [73]

Frame context packing and drift prevention in next-frame-prediction video diffusion models

Lvmin Zhang, Shengqu Cai, Muyang Li, Gordon Wetzstein, and Maneesh Agrawala. Frame context packing and drift prevention in next-frame-prediction video diffusion models. In NeurIPS, 2025. 13

work page 2025

[74] [74]

Physdreamer: Physics-based interaction with 3d objects via video generation

Tianyuan Zhang, Hong-Xing Yu, Rundi Wu, Brandon Y Feng, Changxi Zheng, Noah Snavely, Jiajun Wu, and William T Freeman. Physdreamer: Physics-based interaction with 3d objects via video generation. InECCV, 2024

work page 2024

[75] [75]

A Snorlaxmelting into molten lava, surrounded by flames and glowing embers, as fiery lava bursts and flows around it

Yunzhi Zhang, Carson Murtuza-Lanier, Zizhang Li, Yilun Du, and Jiajun Wu. Product of experts for visual generation.arXiv, 2025. 14 A Teaser Figure Details The teaser figure (Fig. 1) shows four physical phenomena—a hanging robe, a duck floating on water, a Franka Panda striking foam, and a rising steam plume—each generated by our full pipeline: 3D scene re...

work page 2025