Evaluating Newtonian Mechanics in Video Generative Models with Real Physical Systems

Andrii Zadaianchuk; Antonios Tragoudaras; Antonios Vozikis; Chenyu Zhang; Daniil Cherniavskii; Derck W. E. Prinzhorn; Efstratios Gavves; Mark Bodracska; Nicu Sebe; Thijmen Nijdam

arxiv: 2504.02918 · v3 · pith:UDJJKHZLnew · submitted 2025-04-03 · 💻 cs.CV

Evaluating Newtonian Mechanics in Video Generative Models with Real Physical Systems

Antonios Tragoudaras , Chenyu Zhang , Daniil Cherniavskii , Antonios Vozikis , Thijmen Nijdam , Derck W. E. Prinzhorn , Mark Bodracska , Nicu Sebe

show 2 more authors

Andrii Zadaianchuk Efstratios Gavves

This is my paper

classification 💻 cs.CV

keywords modelsphysicalvideogenerationlawsmorpheusnewtonianvideos

0 comments

read the original abstract

Recent advances in image and video generation raise hopes that these models possess world modeling capabilities-the ability to generate realistic, physically plausible videos. This could revolutionize applications in robotics, autonomous driving, and scientific simulation. However, before treating these models as world models, we must ask: Do they adhere to physical laws? Current evaluation methods rely on subjective judgments or trajectory matching, limiting their usage for physical reasoning estimation, where many generations could be physically plausible. Thus, we introduce Morpheus, one of the first physics-informed evaluation frameworks for measuring the ability of video generation models to comprehend Newtonian dynamics. Morpheus features 130 real-world videos capturing physical phenomena, guided by conservation laws. Using those as conditioning for video generation, we assess physical plausibility leveraging interpretable metrics evaluated with respect to infallible conservation laws known per physical setting, leveraging advances in physics-informed neural networks and vision-language foundation models. Importantly, Morpheus targets controlled Newtonian rigid-body settings to enable quantitative checks. Our findings reveal that even with advanced prompting and video conditioning, contemporary models struggle to encode physical principles despite generating aesthetically pleasing videos.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 16 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

PhysInOne: Visual Physics Learning and Reasoning in One Suite
cs.CV 2026-04 unverdicted novelty 8.0

PhysInOne is a new dataset of 2 million videos across 153,810 dynamic 3D scenes covering 71 physical phenomena, shown to improve AI performance on physics-aware video generation, prediction, property estimation, and m...
RoboGaze: Evaluating Robot World Models via Structured Vision-Language Analysis
cs.RO 2026-06 unverdicted novelty 7.0

RoboGaze presents a structured multi-agent VLM pipeline and robotics-specific error taxonomy that improves video evaluation metrics by up to 43 F1 points over zero-shot baselines on a 382-clip dataset.
YoCausal: How Far is Video Generation from World Model? A Causality Perspective
cs.CV 2026-05 unverdicted novelty 7.0

YoCausal benchmark shows video diffusion models detect the arrow of time but lack genuine causal understanding relative to humans.
Benchmarking Single-Factor Physical Video-to-Audio Generation
cs.CV 2026-05 unverdicted novelty 7.0

FlatSounds benchmark shows state-of-the-art V2A models rely more on text captions than visual input for physical and semantic accuracy, with captions improving correctness but degrading temporal alignment.
CRONOS: Benchmarking Counterfactual Physical Consistency in Video Models
cs.CV 2026-05 unverdicted novelty 7.0

CRONOS benchmark shows recent open-source video generators fail to preserve physical consistency under controlled changes to viewpoint, scene, object category, and appearance.
PhyGround: Benchmarking Physical Reasoning in Generative World Models
cs.CV 2026-05 accept novelty 7.0

PhyGround is a new benchmark with curated prompts, a 13-law taxonomy, large-scale human annotations, and an open physics-specialized VLM judge for evaluating physical reasoning in generative video models.
ACWM-Phys: Investigating Generalized Physical Interaction in Action-Conditioned Video World Models
cs.CV 2026-05 unverdicted novelty 7.0

ACWM-Phys is a controllable simulator benchmark with in- and out-of-distribution protocols for evaluating action-conditioned world models across rigid, kinematic, deformable, and particle dynamics.
Latent State Design for World Models under Sufficiency Constraints
cs.AI 2026-05 unverdicted novelty 7.0

World models succeed when their latent states are built to meet task-specific sufficiency constraints rather than preserving the maximum amount of information.
APT: Atomic Physical Transitions for Causal Video-Language Understanding
cs.CV 2026-06 unverdicted novelty 6.0

Introduces APT chains as ordered causal transition sequences and APT-Tune to improve VLM transition detection while preserving event-level performance.
Video Models Can Reason with Verifiable Rewards
cs.CV 2026-05 unverdicted novelty 6.0

VideoRLVR uses SDE-GRPO optimization, dense decomposed rewards, and Early-Step Focus to train video diffusion models on verifiable reasoning tasks, outperforming supervised fine-tuning and other video generators on Ma...
ACWM-Phys: Investigating Generalized Physical Interaction in Action-Conditioned Video World Models
cs.CV 2026-05 unverdicted novelty 6.0

ACWM-Phys benchmark shows action-conditioned world models generalize on simple geometric interactions but drop sharply on deformable contacts, high-dimensional control, and complex articulated motion, indicating relia...
EgoDyn-Bench: Evaluating Ego-Motion Understanding in Vision-Centric Foundation Models for Autonomous Driving
cs.CV 2026-04 unverdicted novelty 6.0

EgoDyn-Bench reveals a perception bottleneck in vision-centric foundation models: ego-motion logic derives from language while visual input adds negligible signal, with explicit trajectories restoring consistency.
Video models are zero-shot learners and reasoners
cs.LG 2025-09 unverdicted novelty 6.0

Generative video models exhibit emergent zero-shot capabilities across perception, manipulation, and basic reasoning tasks.
OptiWorld: Optimal Control for Video World Generation under Physical Constraints
cs.CV 2026-05 unverdicted novelty 5.0

OptiWorld inserts a classical optimal-control layer that extracts a world state, plans an optimal trajectory on a geometric manifold under physical constraints, and renders the video conditioned on that trajectory.
Physically Viable World Models: A Case for Query-Conditioned Embodied AI
cs.AI 2026-05 unverdicted novelty 5.0

Embodied AI requires query-conditioned world models that select the simplest physical abstraction sufficient to answer intervention queries.
MASS: Motion-Aware Spatial-Temporal Grounding for Physics Reasoning and Comprehension in Vision-Language Models
cs.CV 2025-11 unverdicted novelty 5.0

MASS adds spatiotemporal motion signals and 3D grounding to VLMs and releases MASS-Bench, yielding physics-reasoning performance within 2% of Gemini-2.5-Flash after reinforcement fine-tuning.