PhyGDPO: Physics-Aware Groupwise Direct Preference Optimization for Physically Consistent Text-to-Video Generation
read the original abstract
Recent advances in text-to-video (T2V) generation have achieved good visual quality, yet synthesizing videos that faithfully follow physical laws remains an open challenge. Existing methods mainly based on graphics or prompt extension struggle to generalize beyond simple simulated environments or learn implicit physical reasoning. The scarcity of training data with rich physics interactions and phenomena is also a problem. In this paper, we first introduce a Physics-Augmented video data construction Pipeline, PhyAugPipe, that leverages a vision-language model (VLM) with chain-of-thought reasoning to collect a large-scale training dataset, PhyVidGen-135K. Then we formulate a principled Physics-aware Groupwise Direct Preference Optimization, PhyGDPO, framework that uses real-world video as winning case to guarantee correct physics learning and builds upon the groupwise Plackett-Luce probabilistic model to capture holistic preferences beyond pairwise comparisons. In PhyGDPO, we design a Physics-Guided Rewarding (PGR) scheme that leverages VLM-based physical rewards to direct the optimization to focus on challenging physics cases. In addition, we propose a LoRA-Switch Reference (LoRA-SR) scheme that avoids full-model duplication as reference for efficient DPO training. Experiments show that our method significantly outperforms state-of-the-art open-source methods on PhyGenBench and VideoPhy2. Please check our project page at https://caiyuanhao1998.github.io/project/PhyGDPO for more video results. Our code, data, and models are publicly available at https://github.com/caiyuanhao1998/Open-PhyGDPO
This paper has not been read by Pith yet.
Forward citations
Cited by 9 Pith papers
-
PhyEditBench: A Real-World Multi-Stage Benchmark for Physics-Aware Image Editing
PhyEditBench is a new benchmark with real-world and synthetic instances that reveals limitations in current image editing models' physics reasoning and proposes a video-generation-based baseline called PhyWorld.
-
PhyEditBench: A Real-World Multi-Stage Benchmark for Physics-Aware Image Editing
PhyEditBench is a new benchmark for physics-aware image editing with real and synthetic instances plus a training-free PhyWorld baseline that uses test-time scaling to outperform SOTA models.
-
CaC: Advancing Video Reward Models via Hierarchical Spatiotemporal Concentrating
CaC is a hierarchical spatiotemporal concentrating reward model for video anomalies that reports 25.7% accuracy gains on fine-grained benchmarks and 11.7% anomaly reduction in generated videos via a new dataset and GR...
-
Proprio: Latent Self-Scoring and Inference-Time Refinement for Physically Plausible Video Generation
Proprio uses flow residuals from latent perturbations in frozen video generators as a self-scoring signal for physical plausibility, yielding reported gains of 16.5% on Physics-IQ and 20.6% on VideoPhy2-hard.
-
LaMo: Self-Supervised Latent Motion Priors for Physical Realism in Video Generation
LaMo adds self-supervised latent motion priors via a motion drift loss during training and motion prior guidance during sampling to boost physical fidelity in video diffusion models like CogVideoX.
-
NEWTON: Agentic Planning for Physically Grounded Video Generation
NEWTON improves physical accuracy in video generation by deploying a trainable planner that coordinates physics-aware tools and a verifier, raising joint accuracy on VideoPhy-2 without altering the base generators.
-
Self-Refining Video Sampling
Self-refining video sampling treats a pre-trained generator as a denoising autoencoder for iterative inference-time refinement guided by self-consistency uncertainty to improve motion coherence and physics alignment.
-
PhysRAG: Enhancing Physics-Awareness in Video Generation via Retrieval-Augmented Generation
PhysRAG curates 7K videos from WISA-80K, builds a physical video database, and injects knowledge via learnable queries into a diffusion model to reach SOTA visual quality and physical compliance on PhyGenBench and VBench.
-
Physics-Informed Video Generation via Mixture-of-Experts Latent Alignment
PILA aligns frozen flow-matching video models to a physics attribute bank via MoE experts and operational residuals, reporting SOTA physical plausibility on VBench-2.0, VideoPhy-2 and PhyGenBench while preserving visu...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.