RoboGaze presents a structured multi-agent VLM pipeline and robotics-specific error taxonomy that improves video evaluation metrics by up to 43 F1 points over zero-shot baselines on a 382-clip dataset.
hub Mixed citations
Huynh-Thu, Q
Mixed citation behavior. Most common role is background (67%).
hub tools
citation-role summary
citation-polarity summary
years
2026 15representative citing papers
CaC presents a new spatiotemporal concentrating reward model for video anomalies, built on a novel large-scale dataset and three-stage training with RL and IoU rewards, claiming 25.7% accuracy gains and 11.7% anomaly reduction.
PhyGround is a new benchmark with curated prompts, a 13-law taxonomy, large-scale human annotations, and an open physics-specialized VLM judge for evaluating physical reasoning in generative video models.
RewardHarness self-evolves a tool-and-skill library from 100 preference examples to reach 47.4% accuracy on image-edit evaluation, beating GPT-5, and yields stronger RL-tuned models.
AnimationBench is the first benchmark that operationalizes the twelve basic principles of animation and IP preservation into scalable, VLM-assisted metrics for animation-style I2V generation.
SpecLoR rectifies the amplitude spectrum of lookahead-estimated clean latents to natural-video priors during early ODE sampling steps, cutting physical artifacts with only four extra NFEs.
The paper presents WorldReasonBench, a benchmark that tests video generators on maintaining physical, social, logical, and informational consistency when predicting future states from initial conditions and actions.
DeScore decouples CoT reasoning from reward scoring in video reward models using a two-stage training process to improve generalization and avoid optimization bottlenecks of coupled generative RMs.
Current video models succeed on basic understanding but achieve under 25% success on logically grounded generation and near 0% on interactive generation, exposing gaps in multimodal reasoning.
Long-CODE isolates long-context video evaluation with a new benchmark dataset and shot-dynamics metric that correlates better with human judgments on narrative richness and global consistency than short-video metrics.
MMPhysVideo improves physical plausibility in video diffusion models by jointly modeling RGB with perceptual cues in pseudo-RGB format via a bidirectional teacher-student architecture and a new data curation pipeline.
VisPhyWorld evaluates MLLMs' physical reasoning via executable code generation for video reconstruction, with VisPhyBench showing strong semantics but weak parameter inference and dynamics simulation.
SG-PVR introduces plan-and-verify reasoning grounded in spatio-temporal scene graphs to address verification gaps and implicit evidence in existing T2V reward models.
Introduces CineDance-1M dataset for multi-shot long-form text-to-audio-video generation along with CineBench and a model adaptation.
The PhyScore challenge creates the first benchmark requiring metrics to jointly score video quality, physical realism, condition alignment, and temporal consistency while localizing physical anomalies in 1554 videos from seven generative models across text-to-2D, image-to-4D, and video-to-4D tracks.
citing papers explorer
No citing papers match the current filters.