CaC is a hierarchical spatiotemporal concentrating reward model for video anomalies that reports 25.7% accuracy gains on fine-grained benchmarks and 11.7% anomaly reduction in generated videos via a new dataset and GRPO training with temporal/spatial IoU rewards.
Mixed citations
Title resolution pending
Mixed citation behavior. Most common role is background (67%).
citation-role summary
citation-polarity summary
years
2026 10representative citing papers
PhyGround is a new benchmark with curated prompts, a 13-law taxonomy, large-scale human annotations, and an open physics-specialized VLM judge for evaluating physical reasoning in generative video models.
RewardHarness self-evolves a tool-and-skill library from 100 preference examples to reach 47.4% accuracy on image-edit evaluation, beating GPT-5, and yields stronger RL-tuned models.
AnimationBench is the first benchmark that operationalizes the twelve basic principles of animation and IP preservation into scalable, VLM-assisted metrics for animation-style I2V generation.
The paper presents WorldReasonBench, a benchmark that tests video generators on maintaining physical, social, logical, and informational consistency when predicting future states from initial conditions and actions.
DeScore decouples CoT reasoning from reward scoring in video reward models using a two-stage training process to improve generalization and avoid optimization bottlenecks of coupled generative RMs.
Current video models succeed on basic understanding but achieve under 25% success on logically grounded generation and near 0% on interactive generation, exposing gaps in multimodal reasoning.
Long-CODE isolates long-context video evaluation with a new benchmark dataset and shot-dynamics metric that correlates better with human judgments on narrative richness and global consistency than short-video metrics.
MMPhysVideo improves physical plausibility in video diffusion models by jointly modeling RGB with perceptual cues in pseudo-RGB format via a bidirectional teacher-student architecture and a new data curation pipeline.
The PhyScore challenge creates the first benchmark requiring metrics to jointly score video quality, physical realism, condition alignment, and temporal consistency while localizing physical anomalies in 1554 videos from seven generative models across text-to-2D, image-to-4D, and video-to-4D tracks.
citing papers explorer
-
CaC: Advancing Video Reward Models via Hierarchical Spatiotemporal Concentrating
CaC is a hierarchical spatiotemporal concentrating reward model for video anomalies that reports 25.7% accuracy gains on fine-grained benchmarks and 11.7% anomaly reduction in generated videos via a new dataset and GRPO training with temporal/spatial IoU rewards.
-
PhyGround: Benchmarking Physical Reasoning in Generative World Models
PhyGround is a new benchmark with curated prompts, a 13-law taxonomy, large-scale human annotations, and an open physics-specialized VLM judge for evaluating physical reasoning in generative video models.
-
RewardHarness: Self-Evolving Agentic Post-Training
RewardHarness self-evolves a tool-and-skill library from 100 preference examples to reach 47.4% accuracy on image-edit evaluation, beating GPT-5, and yields stronger RL-tuned models.
-
AnimationBench: Are Video Models Good at Character-Centric Animation?
AnimationBench is the first benchmark that operationalizes the twelve basic principles of animation and IP preservation into scalable, VLM-assisted metrics for animation-style I2V generation.
-
WorldReasonBench: Human-Aligned Stress Testing of Video Generators as Future World-State Predictors
The paper presents WorldReasonBench, a benchmark that tests video generators on maintaining physical, social, logical, and informational consistency when predicting future states from initial conditions and actions.
-
Think, then Score: Decoupled Reasoning and Scoring for Video Reward Modeling
DeScore decouples CoT reasoning from reward scoring in video reward models using a two-stage training process to improve generalization and avoid optimization bottlenecks of coupled generative RMs.
-
How Far Are Video Models from True Multimodal Reasoning?
Current video models succeed on basic understanding but achieve under 25% success on logically grounded generation and near 0% on interactive generation, exposing gaps in multimodal reasoning.
-
Long-CODE: Isolating Pure Long-Context as an Orthogonal Dimension in Video Evaluation
Long-CODE isolates long-context video evaluation with a new benchmark dataset and shot-dynamics metric that correlates better with human judgments on narrative richness and global consistency than short-video metrics.
-
MMPhysVideo: Scaling Physical Plausibility in Video Generation via Joint Multimodal Modeling
MMPhysVideo improves physical plausibility in video diffusion models by jointly modeling RGB with perceptual cues in pseudo-RGB format via a bidirectional teacher-student architecture and a new data curation pipeline.
-
LoViF 2026 The First Challenge on Holistic Quality Assessment for 4D World Model (PhyScore)
The PhyScore challenge creates the first benchmark requiring metrics to jointly score video quality, physical realism, condition alignment, and temporal consistency while localizing physical anomalies in 1554 videos from seven generative models across text-to-2D, image-to-4D, and video-to-4D tracks.