"PhyWorldBench": A Comprehensive Evaluation of Physical Realism in Text-to-Video Models

Ashwin Nagarajan; Daniel Hong; Fangrui Zhu; Jing Gu; Kaiwen Zhou; Ming-Yu Liu; Qianqi Yan; Xian Liu; Xin Eric Wang; Yue Fan

"PhyWorldBench": A Comprehensive Evaluation of Physical Realism in Text-to-Video Models

Not yet reviewed by Pith; the record is open.

Re-run · record.json Download PDF Read on arXiv ↗

This paper has not been read by Pith yet. Machine review is queued; the pith claim, tier, and objections will appear here once it completes.

SPECIMEN: schema-true, not a live event

T0 review · schema-true

One-sentence machine reading of the paper's core claim.

pith:XXXXXXXX · record.json · timestamp

arxiv 2507.13428 v3 pith:RI6GJKDW submitted 2025-07-17 cs.CV cs.AI

"PhyWorldBench": A Comprehensive Evaluation of Physical Realism in Text-to-Video Models

Jing Gu , Xian Liu , Yu Zeng , Ashwin Nagarajan , Fangrui Zhu , Daniel Hong , Yue Fan , Qianqi Yan

show 3 more authors

Kaiwen Zhou Ming-Yu Liu Xin Eric Wang

This is my paper

classification cs.CV cs.AI

keywords modelsphysicalphysicsevaluategenerationphenomenapromptsanti-physics

verification ladder T0 review T1 audit T2 compute T3 formal T4 reserved

0 comments

read the original abstract

Video generation models have achieved remarkable progress in creating high-quality, photorealistic content. However, their ability to accurately simulate physical phenomena remains a critical and unresolved challenge. This paper presents PhyWorldBench, a comprehensive benchmark designed to evaluate video generation models based on their adherence to the laws of physics. The benchmark covers multiple levels of physical phenomena, ranging from fundamental principles such as object motion and energy conservation to more complex scenarios involving rigid body interactions and human or animal motion. Additionally, we introduce a novel Anti-Physics category, where prompts intentionally violate real-world physics, enabling the assessment of whether models can follow such instructions while maintaining logical consistency. Besides large-scale human evaluation, we also design a simple yet effective method that utilizes current multimodal large language models to evaluate physics realism in a zero-shot fashion. We evaluate 12 state-of-the-art text-to-video generation models, including five open-source and five proprietary models, with detailed comparison and analysis. Through systematic testing across 1050 curated prompts spanning fundamental, composite, and anti-physics scenarios, we identify pivotal challenges these models face in adhering to real-world physics. We further examine their performance under diverse physical phenomena and prompt types, and derive targeted recommendations for crafting prompts that enhance fidelity to physical principles.

discussion (0)

Forward citations

Cited by 17 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

PhysInOne: Visual Physics Learning and Reasoning in One Suite
cs.CV 2026-04 unverdicted novelty 8.0

PhysInOne is a new dataset of 2 million videos across 153,810 dynamic 3D scenes covering 71 physical phenomena, shown to improve AI performance on physics-aware video generation, prediction, property estimation, and m...
MiraBench: Evaluating Action-Conditioned Reliability in Robotic World Models
cs.AI 2026-05 unverdicted novelty 7.0

MiraBench defines action-conditioned reliability via three levels (physics adherence, action-following fidelity, optimism bias detection) and applies it to 12 model configurations using a 16,000-judgment human corpus,...
Are Video Models Zero-Shot Learners and Reasoners in Education? EduVideoBench, A Knowledge-Skills-Attitude Benchmark for Educational Video Generation
cs.CL 2026-05 unverdicted novelty 7.0

EduVideoBench is a new KSA-grounded benchmark that evaluates five frontier video generation models and finds substantial gaps in educational validity across knowledge, skills, and attitudes.
WBench: A Comprehensive Multi-turn Benchmark for Interactive Video World Model Evaluation
cs.CV 2026-05 unverdicted novelty 7.0

WBench is a benchmark with 289 test cases and 1,058 turns for evaluating interactive world models using 22 automated metrics validated against human judgments.
CRONOS: Benchmarking Counterfactual Physical Consistency in Video Models
cs.CV 2026-05 unverdicted novelty 7.0

CRONOS benchmark shows recent open-source video generators fail to preserve physical consistency under controlled changes to viewpoint, scene, object category, and appearance.
PhyGround: Benchmarking Physical Reasoning in Generative World Models
cs.CV 2026-05 accept novelty 7.0

PhyGround is a new benchmark with curated prompts, a 13-law taxonomy, large-scale human annotations, and an open physics-specialized VLM judge for evaluating physical reasoning in generative video models.
Do Joint Audio-Video Generation Models Understand Physics?
cs.SD 2026-05 unverdicted novelty 7.0

Current joint audio-video generation models lack robust physical commonsense, especially during transitions and when prompted for impossible behaviors.
Do Joint Audio-Video Generation Models Understand Physics?
cs.SD 2026-05 unverdicted novelty 7.0

AV-Phys Bench shows that current joint audio-video models lack robust physical commonsense, with major drops on transitions and deliberate anti-physics prompts.
Agentic World Modeling: Foundations, Capabilities, Laws, and Beyond
cs.AI 2026-04 unverdicted novelty 7.0

Proposes a levels x laws taxonomy for world models in AI agents, defining L1-L3 capabilities across physical, digital, social, and scientific regimes while reviewing over 400 works to outline a roadmap for advanced ag...
OSCBench: Benchmarking Object State Change in Text-to-Video Generation
cs.CV 2026-03 unverdicted novelty 7.0

OSCBench demonstrates that text-to-video models produce inaccurate and temporally inconsistent object state changes, with performance dropping sharply on novel and compositional action scenarios.
Detecting AI-Generated Video: A Vision-Language Dual-View Survey
cs.CV 2026-07 conditional novelty 6.0

AIGC-V detection should be treated as factual fidelity verification and organized by a four-layer vision-language dual-view taxonomy spanning cues, motion, cross-modal consistency, and world-level reasoning.
BadDreamer: Transferable Backdoor Attacks against Video World Models for Autonomous Driving
cs.CV 2026-06 unverdicted novelty 6.0

Introduces BadDreamer, a backdoor attack that poisons the transition dynamics of video world models so that a trigger causes hallucination of obstacle-free futures, transferring to unsafe action predictions in autonom...
Lighting-grounded Video Generation with Renderer-based Agent Reasoning
cs.CV 2026-04 unverdicted novelty 6.0

LiVER conditions video diffusion models on renderer-derived 3D control signals for disentangled, editable control over object layout, lighting, and camera trajectory.
A Definition and Roadmap for World Models
cs.AI 2026-07 conditional novelty 5.0

A perspective article defining world models as finite-resource compression of physical state transitions and outlining a roadmap toward physical AGI via unified representations and interactive simulators.
MPMWorlds: Material-Point-Method Simulations for Inferring and Extrapolating Physical Dynamics
cs.GR 2026-06 unverdicted novelty 5.0

Assembles MPM simulation dataset and compares code generation versus video diffusion for inferring physical parameters and extrapolating dynamics from videos.
Physically Viable World Models: A Case for Query-Conditioned Embodied AI
cs.AI 2026-05 unverdicted novelty 5.0

Embodied AI requires query-conditioned world models that select the simplest physical abstraction sufficient to answer intervention queries.
Agentic World Modeling: Foundations, Capabilities, Laws, and Beyond
cs.AI 2026-04 conditional novelty 4.0

A survey proposing a three-level capability taxonomy (L1 Predictor, L2 Simulator, L3 Evolver) for world models across physical, digital, social, and scientific domains.