pith. sign in

arxiv: 2512.24551 · v4 · pith:UZFUOZIUnew · submitted 2025-12-31 · 💻 cs.CV

PhyGDPO: Physics-Aware Groupwise Direct Preference Optimization for Physically Consistent Text-to-Video Generation

classification 💻 cs.CV
keywords phygdpodatadirectgroupwiseoptimizationphysicalphysicstraining
0
0 comments X
read the original abstract

Recent advances in text-to-video (T2V) generation have achieved good visual quality, yet synthesizing videos that faithfully follow physical laws remains an open challenge. Existing methods mainly based on graphics or prompt extension struggle to generalize beyond simple simulated environments or learn implicit physical reasoning. The scarcity of training data with rich physics interactions and phenomena is also a problem. In this paper, we first introduce a Physics-Augmented video data construction Pipeline, PhyAugPipe, that leverages a vision-language model (VLM) with chain-of-thought reasoning to collect a large-scale training dataset, PhyVidGen-135K. Then we formulate a principled Physics-aware Groupwise Direct Preference Optimization, PhyGDPO, framework that uses real-world video as winning case to guarantee correct physics learning and builds upon the groupwise Plackett-Luce probabilistic model to capture holistic preferences beyond pairwise comparisons. In PhyGDPO, we design a Physics-Guided Rewarding (PGR) scheme that leverages VLM-based physical rewards to direct the optimization to focus on challenging physics cases. In addition, we propose a LoRA-Switch Reference (LoRA-SR) scheme that avoids full-model duplication as reference for efficient DPO training. Experiments show that our method significantly outperforms state-of-the-art open-source methods on PhyGenBench and VideoPhy2. Please check our project page at https://caiyuanhao1998.github.io/project/PhyGDPO for more video results. Our code, data, and models are publicly available at https://github.com/caiyuanhao1998/Open-PhyGDPO

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 9 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. PhyEditBench: A Real-World Multi-Stage Benchmark for Physics-Aware Image Editing

    cs.CV 2026-06 unverdicted novelty 7.0

    PhyEditBench is a new benchmark with real-world and synthetic instances that reveals limitations in current image editing models' physics reasoning and proposes a video-generation-based baseline called PhyWorld.

  2. PhyEditBench: A Real-World Multi-Stage Benchmark for Physics-Aware Image Editing

    cs.CV 2026-06 unverdicted novelty 7.0

    PhyEditBench is a new benchmark for physics-aware image editing with real and synthetic instances plus a training-free PhyWorld baseline that uses test-time scaling to outperform SOTA models.

  3. CaC: Advancing Video Reward Models via Hierarchical Spatiotemporal Concentrating

    cs.CV 2026-05 unverdicted novelty 7.0

    CaC is a hierarchical spatiotemporal concentrating reward model for video anomalies that reports 25.7% accuracy gains on fine-grained benchmarks and 11.7% anomaly reduction in generated videos via a new dataset and GR...

  4. Proprio: Latent Self-Scoring and Inference-Time Refinement for Physically Plausible Video Generation

    cs.CV 2026-05 unverdicted novelty 6.0

    Proprio uses flow residuals from latent perturbations in frozen video generators as a self-scoring signal for physical plausibility, yielding reported gains of 16.5% on Physics-IQ and 20.6% on VideoPhy2-hard.

  5. LaMo: Self-Supervised Latent Motion Priors for Physical Realism in Video Generation

    cs.CV 2026-05 unverdicted novelty 6.0

    LaMo adds self-supervised latent motion priors via a motion drift loss during training and motion prior guidance during sampling to boost physical fidelity in video diffusion models like CogVideoX.

  6. NEWTON: Agentic Planning for Physically Grounded Video Generation

    cs.CV 2026-05 unverdicted novelty 6.0

    NEWTON improves physical accuracy in video generation by deploying a trainable planner that coordinates physics-aware tools and a verifier, raising joint accuracy on VideoPhy-2 without altering the base generators.

  7. Self-Refining Video Sampling

    cs.CV 2026-01 conditional novelty 6.0

    Self-refining video sampling treats a pre-trained generator as a denoising autoencoder for iterative inference-time refinement guided by self-consistency uncertainty to improve motion coherence and physics alignment.

  8. PhysRAG: Enhancing Physics-Awareness in Video Generation via Retrieval-Augmented Generation

    cs.CV 2026-06 unverdicted novelty 5.0

    PhysRAG curates 7K videos from WISA-80K, builds a physical video database, and injects knowledge via learnable queries into a diffusion model to reach SOTA visual quality and physical compliance on PhyGenBench and VBench.

  9. Physics-Informed Video Generation via Mixture-of-Experts Latent Alignment

    cs.CV 2026-06 unverdicted novelty 5.0

    PILA aligns frozen flow-matching video models to a physics attribute bank via MoE experts and operational residuals, reporting SOTA physical plausibility on VBench-2.0, VideoPhy-2 and PhyGenBench while preserving visu...