hub Canonical reference

Videophy-2: A challenging action-centric physical commonsense evaluation in video generation

Hritik Bansal, Clark Peng, Yonatan Bitton, Roman Goldenberg, Aditya Grover, Kai-Wei Chang · 2025 · arXiv 2503.06800

Canonical reference. 89% of citing Pith papers cite this work as background.

31 Pith papers citing it

Background 89% of classified citations

read on arXiv browse 31 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 8 dataset 1

citation-polarity summary

background 8 use dataset 1

representative citing papers

PhysInOne: Visual Physics Learning and Reasoning in One Suite

cs.CV · 2026-04-10 · unverdicted · novelty 8.0

PhysInOne is a new dataset of 2 million videos across 153,810 dynamic 3D scenes covering 71 physical phenomena, shown to improve AI performance on physics-aware video generation, prediction, property estimation, and motion transfer.

MemoBench: Benchmarking World Modeling in Dynamically Changing Environments

cs.CV · 2026-06-25 · unverdicted · novelty 7.0 · 4 refs

MemoBench is a new diagnostic benchmark with automated and VQA metrics that evaluates memory consistency in video models under disappear-and-reappear in dynamic environments.

YoCausal: How Far is Video Generation from World Model? A Causality Perspective

cs.CV · 2026-05-28 · unverdicted · novelty 7.0

YoCausal benchmark shows video diffusion models detect the arrow of time but lack genuine causal understanding relative to humans.

What-If World: A Causal Benchmark for General World Models in Embodied Scenarios

cs.CV · 2026-05-26 · unverdicted · novelty 7.0

What-If World is a new paired-prompt benchmark showing that nine state-of-the-art video generation models achieve at most 52% on causal intervention tests and cluster near 28% for open-source systems.

WBench: A Comprehensive Multi-turn Benchmark for Interactive Video World Model Evaluation

cs.CV · 2026-05-25 · unverdicted · novelty 7.0

WBench is a benchmark with 289 test cases and 1,058 turns for evaluating interactive world models using 22 automated metrics validated against human judgments.

CRONOS: Benchmarking Counterfactual Physical Consistency in Video Models

cs.CV · 2026-05-22 · unverdicted · novelty 7.0

CRONOS benchmark shows recent open-source video generators fail to preserve physical consistency under controlled changes to viewpoint, scene, object category, and appearance.

CaC: Advancing Video Reward Models via Hierarchical Spatiotemporal Concentrating

cs.CV · 2026-05-12 · unverdicted · novelty 7.0 · 2 refs

CaC presents a new spatiotemporal concentrating reward model for video anomalies, built on a novel large-scale dataset and three-stage training with RL and IoU rewards, claiming 25.7% accuracy gains and 11.7% anomaly reduction.

PhyGround: Benchmarking Physical Reasoning in Generative World Models

cs.CV · 2026-05-11 · accept · novelty 7.0

PhyGround is a new benchmark with curated prompts, a 13-law taxonomy, large-scale human annotations, and an open physics-specialized VLM judge for evaluating physical reasoning in generative video models.

Do Joint Audio-Video Generation Models Understand Physics?

cs.SD · 2026-05-08 · unverdicted · novelty 7.0 · 2 refs

AV-Phys Bench shows that current joint audio-video models lack robust physical commonsense, with major drops on transitions and deliberate anti-physics prompts.

BRITE: A Benchmark for Reliable and Interpretable T2V Evaluation on Implausible Scenarios

cs.MM · 2026-04-24 · unverdicted · novelty 7.0

BRITE benchmark reveals that leading T2V models handle static object composition well but degrade sharply on object-action binding and audio-visual synchronization for implausible prompts.

MoRight: Motion Control Done Right

cs.CV · 2026-04-08 · unverdicted · novelty 7.0

MoRight disentangles object and camera motion via canonical-view specification and temporal cross-view attention, while decomposing motion into active user-driven and passive consequence components to learn and apply causality in video generation.

Current World Models Lack a Persistent State Core

cs.CV · 2026-06-18 · unverdicted · novelty 6.0

Current world models fail to evolve internal state when unobserved and instead resume scenes at the last observed state, as diagnosed by the new WRBench benchmark across 23 models and 9600 videos.

Proprio: Latent Self-Scoring and Inference-Time Refinement for Physically Plausible Video Generation

cs.CV · 2026-05-27 · unverdicted · novelty 6.0

Proprio uses flow residuals from latent perturbations in frozen video generators as a self-scoring signal for physical plausibility, yielding reported gains of 16.5% on Physics-IQ and 20.6% on VideoPhy2-hard.

LaMo: Self-Supervised Latent Motion Priors for Physical Realism in Video Generation

cs.CV · 2026-05-22 · unverdicted · novelty 6.0

LaMo adds self-supervised latent motion priors via a motion drift loss during training and motion prior guidance during sampling to boost physical fidelity in video diffusion models like CogVideoX.

NEWTON: Agentic Planning for Physically Grounded Video Generation

cs.CV · 2026-05-18 · unverdicted · novelty 6.0

NEWTON improves physical accuracy in video generation by deploying a trainable planner that coordinates physics-aware tools and a verifier, raising joint accuracy on VideoPhy-2 without altering the base generators.

SARA: Semantically Adaptive Relational Alignment for Video Diffusion Models

cs.CV · 2026-05-08 · unverdicted · novelty 6.0 · 2 refs

SARA introduces semantic saliency to guide relational alignment in video diffusion models, improving text following and motion quality over prior alignment methods.

How Far Are Video Models from True Multimodal Reasoning?

cs.CV · 2026-04-21 · unverdicted · novelty 6.0

Current video models succeed on basic understanding but achieve under 25% success on logically grounded generation and near 0% on interactive generation, exposing gaps in multimodal reasoning.

Self-Refining Video Sampling

cs.CV · 2026-01-26 · conditional · novelty 6.0

Self-refining video sampling treats a pre-trained generator as a denoising autoencoder for iterative inference-time refinement guided by self-consistency uncertainty to improve motion coherence and physics alignment.

ProPhy: Progressive Physical Alignment for Dynamic World Simulation

cs.CV · 2025-12-05 · unverdicted · novelty 6.0

ProPhy adds explicit physics-aware conditioning via semantic and refinement experts plus VLM knowledge transfer to produce more physically coherent dynamic videos than prior methods.

PhyDetEx: Detecting and Explaining the Physical Plausibility of T2V Models

cs.CV · 2025-12-01 · conditional · novelty 6.0

A new dataset and fine-tuned VLM detector/explainer called PhyDetEx shows that current T2V models still struggle to generate videos that obey physical laws, with open-source models performing worse.

Enhancing Physical Plausibility in Video Generation by Reasoning the Implausibility

cs.CV · 2025-09-29 · unverdicted · novelty 6.0

A training-free framework uses physics-violating counterfactual prompts and Synchronized Decoupled Guidance to suppress implausible motions in diffusion-based video generation while preserving photorealism.

Physics-Informed Video Generation via Mixture-of-Experts Latent Alignment

cs.CV · 2026-06-03 · unverdicted · novelty 5.0

PILA aligns frozen flow-matching video models to a physics attribute bank via MoE experts and operational residuals, reporting SOTA physical plausibility on VBench-2.0, VideoPhy-2 and PhyGenBench while preserving visual quality.

MPMWorlds: Material-Point-Method Simulations for Inferring and Extrapolating Physical Dynamics

cs.GR · 2026-06-01 · unverdicted · novelty 5.0

Assembles MPM simulation dataset and compares code generation versus video diffusion for inferring physical parameters and extrapolating dynamics from videos.

OptiWorld: Optimal Control for Video World Generation under Physical Constraints

cs.CV · 2026-05-30 · unverdicted · novelty 5.0

OptiWorld inserts a classical optimal-control layer that extracts a world state, plans an optimal trajectory on a geometric manifold under physical constraints, and renders the video conditioned on that trajectory.

citing papers explorer

Showing 28 of 28 citing papers after filters.

PhysInOne: Visual Physics Learning and Reasoning in One Suite cs.CV · 2026-04-10 · unverdicted · none · ref 7
PhysInOne is a new dataset of 2 million videos across 153,810 dynamic 3D scenes covering 71 physical phenomena, shown to improve AI performance on physics-aware video generation, prediction, property estimation, and motion transfer.
MemoBench: Benchmarking World Modeling in Dynamically Changing Environments cs.CV · 2026-06-25 · unverdicted · none · ref 2 · 4 links
MemoBench is a new diagnostic benchmark with automated and VQA metrics that evaluates memory consistency in video models under disappear-and-reappear in dynamic environments.
YoCausal: How Far is Video Generation from World Model? A Causality Perspective cs.CV · 2026-05-28 · unverdicted · none · ref 8
YoCausal benchmark shows video diffusion models detect the arrow of time but lack genuine causal understanding relative to humans.
What-If World: A Causal Benchmark for General World Models in Embodied Scenarios cs.CV · 2026-05-26 · unverdicted · none · ref 9
What-If World is a new paired-prompt benchmark showing that nine state-of-the-art video generation models achieve at most 52% on causal intervention tests and cluster near 28% for open-source systems.
WBench: A Comprehensive Multi-turn Benchmark for Interactive Video World Model Evaluation cs.CV · 2026-05-25 · unverdicted · none · ref 76
WBench is a benchmark with 289 test cases and 1,058 turns for evaluating interactive world models using 22 automated metrics validated against human judgments.
CRONOS: Benchmarking Counterfactual Physical Consistency in Video Models cs.CV · 2026-05-22 · unverdicted · none · ref 6
CRONOS benchmark shows recent open-source video generators fail to preserve physical consistency under controlled changes to viewpoint, scene, object category, and appearance.
CaC: Advancing Video Reward Models via Hierarchical Spatiotemporal Concentrating cs.CV · 2026-05-12 · unverdicted · none · ref 3 · 2 links
CaC presents a new spatiotemporal concentrating reward model for video anomalies, built on a novel large-scale dataset and three-stage training with RL and IoU rewards, claiming 25.7% accuracy gains and 11.7% anomaly reduction.
Do Joint Audio-Video Generation Models Understand Physics? cs.SD · 2026-05-08 · unverdicted · none · ref 3 · 2 links
AV-Phys Bench shows that current joint audio-video models lack robust physical commonsense, with major drops on transitions and deliberate anti-physics prompts.
BRITE: A Benchmark for Reliable and Interpretable T2V Evaluation on Implausible Scenarios cs.MM · 2026-04-24 · unverdicted · none · ref 13
BRITE benchmark reveals that leading T2V models handle static object composition well but degrade sharply on object-action binding and audio-visual synchronization for implausible prompts.
MoRight: Motion Control Done Right cs.CV · 2026-04-08 · unverdicted · none · ref 5
MoRight disentangles object and camera motion via canonical-view specification and temporal cross-view attention, while decomposing motion into active user-driven and passive consequence components to learn and apply causality in video generation.
Current World Models Lack a Persistent State Core cs.CV · 2026-06-18 · unverdicted · none · ref 14
Current world models fail to evolve internal state when unobserved and instead resume scenes at the last observed state, as diagnosed by the new WRBench benchmark across 23 models and 9600 videos.
Proprio: Latent Self-Scoring and Inference-Time Refinement for Physically Plausible Video Generation cs.CV · 2026-05-27 · unverdicted · none · ref 5
Proprio uses flow residuals from latent perturbations in frozen video generators as a self-scoring signal for physical plausibility, yielding reported gains of 16.5% on Physics-IQ and 20.6% on VideoPhy2-hard.
LaMo: Self-Supervised Latent Motion Priors for Physical Realism in Video Generation cs.CV · 2026-05-22 · unverdicted · none · ref 4
LaMo adds self-supervised latent motion priors via a motion drift loss during training and motion prior guidance during sampling to boost physical fidelity in video diffusion models like CogVideoX.
NEWTON: Agentic Planning for Physically Grounded Video Generation cs.CV · 2026-05-18 · unverdicted · none · ref 4
NEWTON improves physical accuracy in video generation by deploying a trainable planner that coordinates physics-aware tools and a verifier, raising joint accuracy on VideoPhy-2 without altering the base generators.
SARA: Semantically Adaptive Relational Alignment for Video Diffusion Models cs.CV · 2026-05-08 · unverdicted · none · ref 17 · 2 links
SARA introduces semantic saliency to guide relational alignment in video diffusion models, improving text following and motion quality over prior alignment methods.
How Far Are Video Models from True Multimodal Reasoning? cs.CV · 2026-04-21 · unverdicted · none · ref 4
Current video models succeed on basic understanding but achieve under 25% success on logically grounded generation and near 0% on interactive generation, exposing gaps in multimodal reasoning.
ProPhy: Progressive Physical Alignment for Dynamic World Simulation cs.CV · 2025-12-05 · unverdicted · none · ref 2
ProPhy adds explicit physics-aware conditioning via semantic and refinement experts plus VLM knowledge transfer to produce more physically coherent dynamic videos than prior methods.
Enhancing Physical Plausibility in Video Generation by Reasoning the Implausibility cs.CV · 2025-09-29 · unverdicted · none · ref 4
A training-free framework uses physics-violating counterfactual prompts and Synchronized Decoupled Guidance to suppress implausible motions in diffusion-based video generation while preserving photorealism.
Physics-Informed Video Generation via Mixture-of-Experts Latent Alignment cs.CV · 2026-06-03 · unverdicted · none · ref 43
PILA aligns frozen flow-matching video models to a physics attribute bank via MoE experts and operational residuals, reporting SOTA physical plausibility on VBench-2.0, VideoPhy-2 and PhyGenBench while preserving visual quality.
MPMWorlds: Material-Point-Method Simulations for Inferring and Extrapolating Physical Dynamics cs.GR · 2026-06-01 · unverdicted · none · ref 3
Assembles MPM simulation dataset and compares code generation versus video diffusion for inferring physical parameters and extrapolating dynamics from videos.
OptiWorld: Optimal Control for Video World Generation under Physical Constraints cs.CV · 2026-05-30 · unverdicted · none · ref 23
OptiWorld inserts a classical optimal-control layer that extracts a world state, plans an optimal trajectory on a geometric manifold under physical constraints, and renders the video conditioned on that trajectory.
Tempered Self-Similarity Alignment for Physically Plausible Video Generation cs.CV · 2026-05-24 · unverdicted · none · ref 3
Tempered Self-similarity Alignment transfers relational structure from foundation-model STSS into video generators via probabilistic correspondence alignment, yielding reported gains in physical plausibility on VideoPhy benchmarks.
PhyWorld: Physics-Faithful World Model for Video Generation cs.CV · 2026-05-19 · unverdicted · none · ref 52
PhyWorld improves temporal consistency and physical plausibility in video world models via flow matching fine-tuning followed by DPO on physics preference pairs, with reported gains on VBench and a custom physical-faithfulness benchmark.
DriveCtrl: Conditioned Sim-to-Real Driving Video Generation cs.CV · 2026-05-14 · unverdicted · none · ref 47
DriveCtrl is a depth-conditioned controllable framework that generates realistic driving videos from simulation while preserving annotations and scene dynamics.
Agentic Physical AI toward a Domain-Specific Foundation Model for Nuclear Reactor Control cs.AI · 2025-12-29 · unverdicted · none · ref 27
A compact language model trained on scaled synthetic nuclear reactor control data exhibits variance collapse and emergent concentration on a single actuation strategy driven by physical execution success.
MASS: Motion-Aware Spatial-Temporal Grounding for Physics Reasoning and Comprehension in Vision-Language Models cs.CV · 2025-11-23 · unverdicted · none · ref 4
MASS adds spatiotemporal motion signals and 3D grounding to VLMs and releases MASS-Bench, yielding physics-reasoning performance within 2% of Gemini-2.5-Flash after reinforcement fine-tuning.
World Simulation with Video Foundation Models for Physical AI cs.CV · 2025-10-28 · unverdicted · none · ref 8
Cosmos-Predict2.5 unifies text-to-world, image-to-world, and video-to-world generation in one model trained on 200M clips with RL post-training, delivering improved quality and control for physical AI.
Evolution of Video Generative Foundations cs.CV · 2026-04-07 · unverdicted · none · ref 170
This survey traces video generation technology from GANs to diffusion models and then to autoregressive and multimodal approaches while analyzing principles, strengths, and future trends.

Videophy-2: A challenging action-centric physical commonsense evaluation in video generation

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer