{"total":15,"items":[{"citing_arxiv_id":"2606.28385","ref_index":12,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"RoboGaze: Evaluating Robot World Models via Structured Vision-Language Analysis","primary_cat":"cs.RO","submitted_at":"2026-06-22T06:45:09+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"RoboGaze presents a structured multi-agent VLM pipeline and robotics-specific error taxonomy that improves video evaluation metrics by up to 43 F1 points over zero-shot baselines on a 382-clip dataset.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.11969","ref_index":53,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"SpecLoR: Spectral Lookahead Rectification for Motion-Coherent Text-to-Video Generation","primary_cat":"cs.CV","submitted_at":"2026-06-10T11:49:54+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SpecLoR rectifies the amplitude spectrum of lookahead-estimated clean latents to natural-video priors during early ODE sampling steps, cutting physical artifacts with only four extra NFEs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.11838","ref_index":2,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Plan-and-Verify Video Reward Reasoning with Spatio-Temporal Scene Graph Grounding","primary_cat":"cs.CV","submitted_at":"2026-06-10T09:18:14+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"SG-PVR introduces plan-and-verify reasoning grounded in spatio-temporal scene graphs to address verification gaps and implicit evidence in existing T2V reward models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.09639","ref_index":18,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"CineDance: Towards Next-Generation Multi-Shot Long-Form Cinematic Audio-Video Generation","primary_cat":"cs.CV","submitted_at":"2026-06-08T15:35:51+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Introduces CineDance-1M dataset for multi-shot long-form text-to-audio-video generation along with CineBench and a model adaptation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.11723","ref_index":16,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"CaC: Advancing Video Reward Models via Hierarchical Spatiotemporal Concentrating","primary_cat":"cs.CV","submitted_at":"2026-05-12T08:08:33+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"CaC presents a new spatiotemporal concentrating reward model for video anomalies, built on a novel large-scale dataset and three-stage training with RL and IoU rewards, claiming 25.7% accuracy gains and 11.7% anomaly reduction.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"generated videos, with expert annotations of temporal segments, anomaly types, andper-frame bboxes. Although labor-intensive, these annotations are crucial: 1) they improve CoT ground-truth (GT) construction by allowing video clips with GT spatiotemporal annotations to be fed into the foundation model, effectively reducing hallucinations during label generation compared with prior work [16, 54]; 2) they provide references for attribution and temporal/spatial IoU rewards in the third training stage, guiding the model toward more interpretable reasoning through explicit localization. Extensive experiments on our anomaly benchmark demonstrate that CaC reliably detects sparse anomalies, achieving81.7%overall accuracy and improving over the strongest baseline by 25."},{"citing_arxiv_id":"2605.10806","ref_index":15,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"PhyGround: Benchmarking Physical Reasoning in Generative World Models","primary_cat":"cs.CV","submitted_at":"2026-05-11T16:30:51+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"PhyGround is a new benchmark with curated prompts, a 13-law taxonomy, large-scale human annotations, and an open physics-specialized VLM judge for evaluating physical reasoning in generative video models.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Ltx-2: Efficient joint audio-visual foundation model.arXiv preprint arXiv:2601.03233, 2026. [14] Hui Han, Siyuan Li, Jiaqi Chen, Yiwen Yuan, Yuling Wu, Yufan Deng, Chak Tou Leong, Hanwen Du, Junchen Fu, Youhua Li, et al. Video-bench: Human-aligned video generation benchmark. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 18858-18868, 2025. [15] Xuan He, Dongfu Jiang, Ping Nie, Minghao Liu, Zhengxuan Jiang, Mingyi Su, Wentao Ma, Junru Lin, Chun Ye, Yi Lu, et al. Videoscore2: Think before you score in generative video evaluation.arXiv preprint arXiv:2509.22799, 2025. [16] Xuan He, Dongfu Jiang, Ge Zhang, Max Ku, Achint Soni, Sherman Siu, Haonan Chen, Abhranil Chandra, Ziyan Jiang, Aaran Arulraj, et al."},{"citing_arxiv_id":"2605.10434","ref_index":4,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"WorldReasonBench: Human-Aligned Stress Testing of Video Generators as Future World-State Predictors","primary_cat":"cs.CV","submitted_at":"2026-05-11T12:06:57+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"The paper presents WorldReasonBench, a benchmark that tests video generators on maintaining physical, social, logical, and informational consistency when predicting future states from initial conditions and actions.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"On extending the bradley-terry model to accommodate ties in paired comparison experiments.Journal of the American Statistical Association, 65(329):317-328, 1970. [3] Yu Gao, Haoyuan Guo, Tuyen Hoang, Weilin Huang, Lu Jiang, Fangyuan Kong, Huixia Li, Jiashi Li, Liang Li, Xiaojie Li, et al. Seedance 1.0: Exploring the boundaries of video generation models.arXiv preprint arXiv:2506.09113, 2025. [4] Xuan He, Dongfu Jiang, Ping Nie, Minghao Liu, Zhengxuan Jiang, Mingyi Su, Wentao Ma, Junru Lin, Chun Ye, Yi Lu, et al. Videoscore2: Think before you score in generative video evaluation.arXiv preprint arXiv:2509.22799, 2025. [5] Xuming He, Zehao Fan, Hengjia Li, Fan Zhuo, Hankun Xu, Senlin Cheng, Di Weng, Haifeng Liu, Can Ye, and Boxi Wu. Ruler-bench: Probing rule-based reasoning abilities of next-level"},{"citing_arxiv_id":"2605.08703","ref_index":8,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"RewardHarness: Self-Evolving Agentic Post-Training","primary_cat":"cs.AI","submitted_at":"2026-05-09T05:32:48+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"RewardHarness self-evolves a tool-and-skill library from 100 preference examples to reach 47.4% accuracy on image-edit evaluation, beating GPT-5, and yields stronger RL-tuned models.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"outputs, showing that the REWARDHARNESS-trained variant faithfully executes editing instructions while the base model and EditReward-trained variant frequently fail (see Appendix B.2 for additional examples). 4 Related Work Reward models for visual generation.Existing reward models-ImageReward, PickScore, Vi- sionReward, EditReward, VideoScore2, ImagenWorld-rely on supervised fine-tuning from tens of thousands of human preference comparisons [ 8, 11, 17, 22, 29, 33, 34]. REWARDHARNESS learns from only ∼100 demonstrations by shifting adaptation from parameter updates to explicit library evolution.Self-evolving agents.Context-based self-evolving methods (Reflexion, ExpeL, 7 Table 2: To validate the effectiveness of REWARDHARNESSas a reward model, we use it to RL-tune FLUX.2-klein-base-4B and evaluate on downstream image editing benchmarks (ImgEdit-Bench)."},{"citing_arxiv_id":"2605.05922","ref_index":9,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Think, then Score: Decoupled Reasoning and Scoring for Video Reward Modeling","primary_cat":"cs.CV","submitted_at":"2026-05-07T09:30:58+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DeScore decouples CoT reasoning from reward scoring in video reward models using a two-stage training process to improve generalization and avoid optimization bottlenecks of coupled generative RMs.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"[46] show that ineffective prompts can produce response groups that are uniformly correct or uniformly incorrect, thereby weakening effective gradient signals and increasing training variance. Meanwhile, [11] empirically show that the gradient variance of GRPO grows with sequence length, which leads training instability. These limitations are directly inherited by CoT-based video reward models [9, 39, 36], where reasoning and scoring are coupled in a single sampling chain, causing the final reward prediction to rely heavily on GRPO-based optimization. This motivates us to move beyond the standard GRPO objective and develop a more efficient optimization strategy for video reward modeling. 3 Method 3.1 Data Collection We build our preference dataset by captioning diverse real-world videos and using the captions"},{"citing_arxiv_id":"2605.05187","ref_index":15,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"LoViF 2026 The First Challenge on Holistic Quality Assessment for 4D World Model (PhyScore)","primary_cat":"cs.CV","submitted_at":"2026-05-06T17:52:39+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"The PhyScore challenge creates the first benchmark requiring metrics to jointly score video quality, physical realism, condition alignment, and temporal consistency while localizing physical anomalies in 1554 videos from seven generative models across text-to-2D, image-to-4D, and video-to-4D tracks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.19193","ref_index":22,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"How Far Are Video Models from True Multimodal Reasoning?","primary_cat":"cs.CV","submitted_at":"2026-04-21T08:04:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Current video models succeed on basic understanding but achieve under 25% success on logically grounded generation and near 0% on interactive generation, exposing gaps in multimodal reasoning.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Second, video quality evaluation has transitioned from fragmented scoring metrics [28] toward fine-grained and unified textual interpretations. Previous evaluation paradigms typically rely on task-specific metric combinations [12, 27,30,45,46,58,77,93] or reward models trained on multi-dimensional prefer- How Far Are Video Models from True Multimodal Reasoning? 3 ence datasets [3,4,22,23,81]. However, fragmented and coarse metrics often fail to provide actionable textual feedback, while reward model training is fre- quently bottlenecked by the scarcity of high-quality data and remains susceptible to reward hacking [13]. To circumvent these issues, recent agent-based frame- works [20,75,83] have been adapted for unified video evaluation, offering both"},{"citing_arxiv_id":"2604.17428","ref_index":8,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Long-CODE: Isolating Pure Long-Context as an Orthogonal Dimension in Video Evaluation","primary_cat":"cs.CV","submitted_at":"2026-04-19T13:17:34+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Long-CODE isolates long-context video evaluation with a new benchmark dataset and shot-dynamics metric that correlates better with human judgments on narrative richness and global consistency than short-video metrics.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.15299","ref_index":7,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"AnimationBench: Are Video Models Good at Character-Centric Animation?","primary_cat":"cs.CV","submitted_at":"2026-04-16T17:57:08+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"AnimationBench is the first benchmark that operationalizes the twelve basic principles of animation and IP preservation into scalable, VLM-assisted metrics for animation-style I2V generation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.02817","ref_index":1,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"MMPhysVideo: Scaling Physical Plausibility in Video Generation via Joint Multimodal Modeling","primary_cat":"cs.CV","submitted_at":"2026-04-03T07:32:24+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MMPhysVideo improves physical plausibility in video diffusion models by jointly modeling RGB with perceptual cues in pseudo-RGB format via a bidirectional teacher-student architecture and a new data curation pipeline.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2602.13294","ref_index":19,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"VisPhyWorld: Probing Physical Reasoning via Code-Driven Video Reconstruction","primary_cat":"cs.CV","submitted_at":"2026-02-09T05:46:44+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"VisPhyWorld evaluates MLLMs' physical reasoning via executable code generation for video reconstruction, with VisPhyBench showing strong semantics but weak parameter inference and dynamics simulation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}