hub Canonical reference

Videophy-2: A challenging action-centric physical commonsense evaluation in video generation

Bansal, H · 2025 · arXiv 2503.06800

Canonical reference. 89% of citing Pith papers cite this work as background.

33 Pith papers citing it

Background 89% of classified citations

read on arXiv browse 33 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 8 dataset 1

citation-polarity summary

background 8 use dataset 1

representative citing papers

PhysInOne: Visual Physics Learning and Reasoning in One Suite

cs.CV · 2026-04-10 · unverdicted · novelty 8.0

PhysInOne is a new dataset of 2 million videos across 153,810 dynamic 3D scenes covering 71 physical phenomena, shown to improve AI performance on physics-aware video generation, prediction, property estimation, and motion transfer.

MemoBench: Benchmarking World Modeling in Dynamically Changing Environments

cs.CV · 2026-06-25 · unverdicted · novelty 7.0 · 4 refs

MemoBench is a new diagnostic benchmark with automated and VQA metrics that evaluates memory consistency in video models under disappear-and-reappear in dynamic environments.

Each Judge Its Own Yardstick: Discovering Per-VLM Taxonomies for Physical Video Evaluation

cs.CV · 2026-06-22 · unverdicted · novelty 7.0

JudgeFit produces per-VLM physical video evaluation taxonomies that improve held-out accuracy by a mean 32% relative to a single global schema across 16 models from eight families.

YoCausal: How Far is Video Generation from World Model? A Causality Perspective

cs.CV · 2026-05-28 · unverdicted · novelty 7.0

YoCausal benchmark shows video diffusion models detect the arrow of time but lack genuine causal understanding relative to humans.

What-If World: A Causal Benchmark for General World Models in Embodied Scenarios

cs.CV · 2026-05-26 · unverdicted · novelty 7.0

What-If World is a new paired-prompt benchmark showing that nine state-of-the-art video generation models achieve at most 52% on causal intervention tests and cluster near 28% for open-source systems.

WBench: A Comprehensive Multi-turn Benchmark for Interactive Video World Model Evaluation

cs.CV · 2026-05-25 · unverdicted · novelty 7.0

WBench is a benchmark with 289 test cases and 1,058 turns for evaluating interactive world models using 22 automated metrics validated against human judgments.

CRONOS: Benchmarking Counterfactual Physical Consistency in Video Models

cs.CV · 2026-05-22 · unverdicted · novelty 7.0

CRONOS benchmark shows recent open-source video generators fail to preserve physical consistency under controlled changes to viewpoint, scene, object category, and appearance.

CaC: Advancing Video Reward Models via Hierarchical Spatiotemporal Concentrating

cs.CV · 2026-05-12 · unverdicted · novelty 7.0 · 2 refs

CaC presents a new spatiotemporal concentrating reward model for video anomalies, built on a novel large-scale dataset and three-stage training with RL and IoU rewards, claiming 25.7% accuracy gains and 11.7% anomaly reduction.

PhyGround: Benchmarking Physical Reasoning in Generative World Models

cs.CV · 2026-05-11 · accept · novelty 7.0

PhyGround is a new benchmark with curated prompts, a 13-law taxonomy, large-scale human annotations, and an open physics-specialized VLM judge for evaluating physical reasoning in generative video models.

Do Joint Audio-Video Generation Models Understand Physics?

cs.SD · 2026-05-08 · unverdicted · novelty 7.0 · 2 refs

AV-Phys Bench shows that current joint audio-video models lack robust physical commonsense, with major drops on transitions and deliberate anti-physics prompts.

BRITE: A Benchmark for Reliable and Interpretable T2V Evaluation on Implausible Scenarios

cs.MM · 2026-04-24 · unverdicted · novelty 7.0

BRITE benchmark reveals that leading T2V models handle static object composition well but degrade sharply on object-action binding and audio-visual synchronization for implausible prompts.

MoRight: Motion Control Done Right

cs.CV · 2026-04-08 · unverdicted · novelty 7.0

MoRight disentangles object and camera motion via canonical-view specification and temporal cross-view attention, while decomposing motion into active user-driven and passive consequence components to learn and apply causality in video generation.

Current World Models Lack a Persistent State Core

cs.CV · 2026-06-18 · unverdicted · novelty 6.0

Current world models fail to evolve internal state when unobserved and instead resume scenes at the last observed state, as diagnosed by the new WRBench benchmark across 23 models and 9600 videos.

Proprio: Latent Self-Scoring and Inference-Time Refinement for Physically Plausible Video Generation

cs.CV · 2026-05-27 · unverdicted · novelty 6.0

Proprio uses flow residuals from latent perturbations in frozen video generators as a self-scoring signal for physical plausibility, yielding reported gains of 16.5% on Physics-IQ and 20.6% on VideoPhy2-hard.

LaMo: Self-Supervised Latent Motion Priors for Physical Realism in Video Generation

cs.CV · 2026-05-22 · unverdicted · novelty 6.0

LaMo adds self-supervised latent motion priors via a motion drift loss during training and motion prior guidance during sampling to boost physical fidelity in video diffusion models like CogVideoX.

NEWTON: Agentic Planning for Physically Grounded Video Generation

cs.CV · 2026-05-18 · unverdicted · novelty 6.0

NEWTON improves physical accuracy in video generation by deploying a trainable planner that coordinates physics-aware tools and a verifier, raising joint accuracy on VideoPhy-2 without altering the base generators.

SARA: Semantically Adaptive Relational Alignment for Video Diffusion Models

cs.CV · 2026-05-08 · unverdicted · novelty 6.0 · 2 refs

SARA introduces semantic saliency to guide relational alignment in video diffusion models, improving text following and motion quality over prior alignment methods.

How Far Are Video Models from True Multimodal Reasoning?

cs.CV · 2026-04-21 · unverdicted · novelty 6.0

Current video models succeed on basic understanding but achieve under 25% success on logically grounded generation and near 0% on interactive generation, exposing gaps in multimodal reasoning.

Self-Refining Video Sampling

cs.CV · 2026-01-26 · conditional · novelty 6.0

Self-refining video sampling treats a pre-trained generator as a denoising autoencoder for iterative inference-time refinement guided by self-consistency uncertainty to improve motion coherence and physics alignment.

ProPhy: Progressive Physical Alignment for Dynamic World Simulation

cs.CV · 2025-12-05 · unverdicted · novelty 6.0

ProPhy adds explicit physics-aware conditioning via semantic and refinement experts plus VLM knowledge transfer to produce more physically coherent dynamic videos than prior methods.

PhyDetEx: Detecting and Explaining the Physical Plausibility of T2V Models

cs.CV · 2025-12-01 · conditional · novelty 6.0

A new dataset and fine-tuned VLM detector/explainer called PhyDetEx shows that current T2V models still struggle to generate videos that obey physical laws, with open-source models performing worse.

Enhancing Physical Plausibility in Video Generation by Reasoning the Implausibility

cs.CV · 2025-09-29 · unverdicted · novelty 6.0

A training-free framework uses physics-violating counterfactual prompts and Synchronized Decoupled Guidance to suppress implausible motions in diffusion-based video generation while preserving photorealism.

PhysRAG: Enhancing Physics-Awareness in Video Generation via Retrieval-Augmented Generation

cs.CV · 2026-06-25 · unverdicted · novelty 5.0

PhysRAG curates 7K videos from WISA-80K, builds a physical video database, and injects knowledge via learnable queries into a diffusion model to reach SOTA visual quality and physical compliance on PhyGenBench and VBench.

Physics-Informed Video Generation via Mixture-of-Experts Latent Alignment

cs.CV · 2026-06-03 · unverdicted · novelty 5.0

PILA aligns frozen flow-matching video models to a physics attribute bank via MoE experts and operational residuals, reporting SOTA physical plausibility on VBench-2.0, VideoPhy-2 and PhyGenBench while preserving visual quality.

citing papers explorer

Showing 0 of 0 citing papers after filters.

No citing papers match the current filters.

Videophy-2: A challenging action-centric physical commonsense evaluation in video generation

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer