Phys4D: Fine-Grained Physics-Consistent 4D Modeling from Video Diffusion

Chenwei Xu; Fan Du; Guo Ye; Han Liu; Haoran Lu; Jianshu Zhang; Lie Lu; Manling Li; Maojiang Su; Pranav Maneriker

arxiv: 2603.03485 · v3 · pith:JLFWSRNSnew · submitted 2026-03-03 · 💻 cs.CV · cs.AI· cs.RO

Phys4D: Fine-Grained Physics-Consistent 4D Modeling from Video Diffusion

Haoran Lu , Shang Wu , Songling Liu , Jianshu Zhang , Maojiang Su , Guo Ye , Chenwei Xu , Lie Lu

show 5 more authors

Pranav Maneriker Fan Du Manling Li Zhaoran Wang Han Liu

This is my paper

classification 💻 cs.CV cs.AIcs.RO

keywords modelsphysicalconsistencydiffusionfine-grainedphys4dvideoworld

0 comments

read the original abstract

Recent video diffusion models have achieved impressive capabilities as large-scale generative world models. However, these models often struggle with fine-grained physical consistency, exhibiting physically implausible dynamics over time. In this work, we present \textbf{Phys4D}, a pipeline for learning physics-consistent 4D world representations from video diffusion models. Phys4D adopts \textbf{a three-stage training paradigm} that progressively lifts appearance-driven video diffusion models into physics-consistent 4D world representations. We first bootstrap robust geometry and motion representations through large-scale pseudo-supervised pretraining, establishing a foundation for 4D scene modeling. We then perform physics-grounded supervised fine-tuning using simulation-generated data, enforcing temporally consistent 4D dynamics. Finally, we apply simulation-grounded reinforcement learning to correct residual physical violations that are difficult to capture through explicit supervision. To evaluate fine-grained physical consistency beyond appearance-based metrics, we introduce a set of \textbf{4D world consistency evaluation} that probe geometric coherence, motion stability, and long-horizon physical plausibility. Experimental results demonstrate that Phys4D substantially improves fine-grained spatiotemporal and physical consistency compared to appearance-driven baselines, while maintaining strong generative performance. Our project page is available at https://sensational-brioche-7657e7.netlify.app/

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

APT: Atomic Physical Transitions for Causal Video-Language Understanding
cs.CV 2026-06 unverdicted novelty 6.0

Introduces APT chains as ordered causal transition sequences and APT-Tune to improve VLM transition detection while preserving event-level performance.
Physics-IQ Verified
cs.CV 2026-06 unverdicted novelty 5.0

Physics-IQ Verified refines 57.6% of samples and 34.8% of prompts from the original benchmark and produces moderate ranking shifts (Kendall's τ = 0.46) across six image-to-video models.
MagicSim: A Unified Infrastructure for Executable Embodied Interaction
cs.RO 2026-06 unverdicted novelty 5.0

MagicSim is a unified embodied interaction infrastructure built on a deterministic batched runtime and shared MDP that supports diverse world construction, execution, task evaluation, automatic rollout generation, and...