pith. sign in

arxiv: 2603.03485 · v3 · pith:JLFWSRNSnew · submitted 2026-03-03 · 💻 cs.CV · cs.AI· cs.RO

Phys4D: Fine-Grained Physics-Consistent 4D Modeling from Video Diffusion

classification 💻 cs.CV cs.AIcs.RO
keywords modelsphysicalconsistencydiffusionfine-grainedphys4dvideoworld
0
0 comments X
read the original abstract

Recent video diffusion models have achieved impressive capabilities as large-scale generative world models. However, these models often struggle with fine-grained physical consistency, exhibiting physically implausible dynamics over time. In this work, we present \textbf{Phys4D}, a pipeline for learning physics-consistent 4D world representations from video diffusion models. Phys4D adopts \textbf{a three-stage training paradigm} that progressively lifts appearance-driven video diffusion models into physics-consistent 4D world representations. We first bootstrap robust geometry and motion representations through large-scale pseudo-supervised pretraining, establishing a foundation for 4D scene modeling. We then perform physics-grounded supervised fine-tuning using simulation-generated data, enforcing temporally consistent 4D dynamics. Finally, we apply simulation-grounded reinforcement learning to correct residual physical violations that are difficult to capture through explicit supervision. To evaluate fine-grained physical consistency beyond appearance-based metrics, we introduce a set of \textbf{4D world consistency evaluation} that probe geometric coherence, motion stability, and long-horizon physical plausibility. Experimental results demonstrate that Phys4D substantially improves fine-grained spatiotemporal and physical consistency compared to appearance-driven baselines, while maintaining strong generative performance. Our project page is available at https://sensational-brioche-7657e7.netlify.app/

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. APT: Atomic Physical Transitions for Causal Video-Language Understanding

    cs.CV 2026-06 unverdicted novelty 6.0

    Introduces APT chains as ordered causal transition sequences and APT-Tune to improve VLM transition detection while preserving event-level performance.

  2. Physics-IQ Verified

    cs.CV 2026-06 unverdicted novelty 5.0

    Physics-IQ Verified refines 57.6% of samples and 34.8% of prompts from the original benchmark and produces moderate ranking shifts (Kendall's τ = 0.46) across six image-to-video models.

  3. MagicSim: A Unified Infrastructure for Executable Embodied Interaction

    cs.RO 2026-06 unverdicted novelty 5.0

    MagicSim is a unified embodied interaction infrastructure built on a deterministic batched runtime and shared MDP that supports diverse world construction, execution, task evaluation, automatic rollout generation, and...