{"total":25,"items":[{"citing_arxiv_id":"2605.14269","ref_index":19,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"PhyMotion: Structured 3D Motion Reward for Physics-Grounded Human Video Generation","primary_cat":"cs.CV","submitted_at":"2026-05-14T02:12:13+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"PhyMotion scores generated human videos by grounding recovered 3D poses in a physics simulator across kinematic, contact, and dynamic axes, yielding stronger human correlation and larger RL post-training gains than prior 2D rewards.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.13223","ref_index":38,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Skill-Aligned Annotation for Reliable Evaluation in Text-to-Image Generation","primary_cat":"cs.CV","submitted_at":"2026-05-13T09:14:31+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Skill-aligned annotation improves inter-annotator agreement and evaluation stability in text-to-image generation compared to uniform annotation baselines.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.11723","ref_index":60,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"CaC: Advancing Video Reward Models via Hierarchical Spatiotemporal Concentrating","primary_cat":"cs.CV","submitted_at":"2026-05-12T08:08:33+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"image quality assessment via reinforcement learning to rank.arXiv preprint arXiv:2505.14460, 2025. [59] Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, and Yuxiao Dong. Imagereward: Learning and evaluating human preferences for text-to-image generation. Advances in Neural Information Processing Systems, 36:15903-15935, 2023. [60] Jiazheng Xu, Yu Huang, Jiale Cheng, Yuanming Yang, Jiajun Xu, Yuan Wang, Wenbo Duan, Shen Yang, Qunlin Jin, Shurun Li, et al. Visionreward: Fine-grained multi-dimensional human preference learning for image and video generation.arXiv preprint arXiv:2412.21059, 2024. 13 [61] Zeyue Xue, Jie Wu, Yu Gao, Fangyuan Kong, Lingting Zhu, Mengzhao Chen, Zhiheng Liu,"},{"citing_arxiv_id":"2605.08703","ref_index":34,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"RewardHarness: Self-Evolving Agentic Post-Training","primary_cat":"cs.AI","submitted_at":"2026-05-09T05:32:48+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"RewardHarness self-evolves a tool-and-skill library from 100 preference examples to reach 47.4% accuracy on image-edit evaluation, beating GPT-5, and yields stronger RL-tuned models.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"Skillrl: Evolving agents via recursive skill-augmented reinforcement learning.arXiv preprint arXiv:2602.08234, 2026. [33] Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, and Yuxiao Dong. Imagereward: Learning and evaluating human preferences for text-to-image generation.Advances in Neural Information Processing Systems, 36:15903-15935, 2023. [34] Jiazheng Xu, Yu Huang, Jiale Cheng, Yuanming Yang, Jiajun Xu, Yuan Wang, Wenbo Duan, Shen Yang, Qunlin Jin, Shurun Li, et al. Visionreward: Fine-grained multi-dimensional human preference learning for image and video generation.arXiv preprint arXiv:2412.21059, 2024. [35] Zeyue Xue, Jie Wu, Yu Gao, Fangyuan Kong, Lingting Zhu, Mengzhao Chen, Zhiheng Liu, Wei Liu, Qiushan Guo,"},{"citing_arxiv_id":"2604.28185","ref_index":89,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Visual Generation in the New Era: An Evolution from Atomic Mapping to Agentic World Modeling","primary_cat":"cs.CV","submitted_at":"2026-04-30T17:59:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Visual generation models are evolving from passive renderers to interactive agentic world modelers, but current systems lack spatial reasoning, temporal consistency, and causal understanding, with evaluations overemphasizing perceptual quality.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.19234","ref_index":44,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Learning to Credit the Right Steps: Objective-aware Process Optimization for Visual Generation","primary_cat":"cs.CV","submitted_at":"2026-04-21T08:37:13+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"OTCA improves GRPO training for visual generation by estimating step importance in trajectories and adaptively weighting multiple reward objectives.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"In such settings, multiple objectives may exhibit conflicting opti- mization directions, and fixed reward weighting fails to adaptively balance the trade-offs among them. Classical gradient-based MOO methods attempt to resolve these conflicts by explicitly operating on per-objective gradients, e.g., MGDA [4] computes a minimum- norm convex combination of task gradients, while PCGrad [44] and CAGrad [22] modify conflicting gradients to reduce interference. However, these approaches require computing and storing separate gradients for each objective at every update step, which becomes infeasible for large-scale diffusion models. To alleviate this limita- tion, MGDA-UB [33] derives an upper bound of the multi-objective gradient norm and proves that minimizing this bound leads to a"},{"citing_arxiv_id":"2604.19193","ref_index":81,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"How Far Are Video Models from True Multimodal Reasoning?","primary_cat":"cs.CV","submitted_at":"2026-04-21T08:04:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Current video models succeed on basic understanding but achieve under 25% success on logically grounded generation and near 0% on interactive generation, exposing gaps in multimodal reasoning.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Second, video quality evaluation has transitioned from fragmented scoring metrics [28] toward fine-grained and unified textual interpretations. Previous evaluation paradigms typically rely on task-specific metric combinations [12, 27,30,45,46,58,77,93] or reward models trained on multi-dimensional prefer- How Far Are Video Models from True Multimodal Reasoning? 3 ence datasets [3,4,22,23,81]. However, fragmented and coarse metrics often fail to provide actionable textual feedback, while reward model training is fre- quently bottlenecked by the scarcity of high-quality data and remains susceptible to reward hacking [13]. To circumvent these issues, recent agent-based frame- works [20,75,83] have been adapted for unified video evaluation, offering both"},{"citing_arxiv_id":"2604.17397","ref_index":17,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Speculative Decoding for Autoregressive Video Generation","primary_cat":"cs.CV","submitted_at":"2026-04-19T12:01:57+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"A training-free speculative decoding method for block-based autoregressive video diffusion uses a quality router on worst-frame ImageReward scores to accept drafter proposals, achieving up to 2.09x speedup at 95.7% quality retention.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.17195","ref_index":51,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"DreamShot: Personalized Storyboard Synthesis with Video Diffusion Prior","primary_cat":"cs.CV","submitted_at":"2026-04-19T01:51:41+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DreamShot uses video diffusion priors and a role-attention consistency loss to produce coherent, personalized storyboards with better character and scene continuity than text-to-image methods.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Shanchuan Lin, Jiepeng Cen, Zhibei Ma, Alan Yuille, and Lu Jiang. Videoauteur: Towards long narrative video gener- ation. InICCV, pages 19163-19173, 2025. 3 [50] Junfei Xiao, Ceyuan Yang, Lvmin Zhang, Shengqu Cai, Yang Zhao, Yuwei Guo, Gordon Wetzstein, Maneesh Agrawala, Alan Yuille, and Lu Jiang. Captain cin- ema: Towards short movie generation.arXiv preprint arXiv:2507.18634, 2025. 2, 3 [51] Jiazheng Xu, Yu Huang, Jiale Cheng, Yuanming Yang, Jiajun Xu, Yuan Wang, Wenbo Duan, Shen Yang, Qunlin Jin, Shu- run Li, et al. Visionreward: Fine-grained multi-dimensional human preference learning for image and video generation. arXiv preprint arXiv:2412.21059, 2024. 12 [52] Shuai Yang, Yuying Ge, Yang Li, Yukang Chen, Yixiao Ge, Ying Shan, and Ying-Cong Chen."},{"citing_arxiv_id":"2604.14910","ref_index":45,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Reward-Aware Trajectory Shaping for Few-step Visual Generation","primary_cat":"cs.CV","submitted_at":"2026-04-16T11:52:11+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"RATS lets few-step visual generators surpass multi-step teachers by shaping trajectories with reward-based adaptive guidance instead of strict imitation.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"[43] Zeyue Xue, Jie Wu, Yu Gao, Fangyuan Kong, Lingting Zhu, Mengzhao Chen, Zhiheng Liu, Wei Liu, Qiushan Guo, Weilin Huang, et al . 2025. DanceGRPO: Unleashing GRPO on Visual Generation.arXiv preprint arXiv:2505.07818(2025). [44] Xiaomeng Yang, Zhiyu Tan, and Hao Li. 2025. IPO: Iterative preference optimiza- tion for text-to-video generation.arXiv preprint arXiv:2502.02088(2025). [45] Tianwei Yin, Michaël Gharbi, Taesung Park, Richard Zhang, Eli Shechtman, Fredo Durand, and Bill Freeman. 2024. Improved distribution matching distillation for fast image synthesis.Advances in neural information processing systems37 (2024), 47455-47487. [46] Tianwei Yin, Michaël Gharbi, Richard Zhang, Eli Shechtman, Fredo Durand, William T Freeman, and Taesung Park."},{"citing_arxiv_id":"2604.11490","ref_index":67,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Anthropogenic Regional Adaptation in Multimodal Vision-Language Model","primary_cat":"cs.AI","submitted_at":"2026-04-13T13:56:00+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Anthropogenic Regional Adaptation with GG-EZ improves cultural relevance in multimodal vision-language models for Southeast Asia by 5-15% while retaining over 98% of global performance.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"utilization of existing systems for targeted regional applications. Overcoming this limitation, we specifically designed a simple yet effective method, Geographical-generalization-made-easy (GG-EZ), which adapts an ex- isting global model to a regional-specific context with minimal degradation on the global context. Inspired by recent advancements in training strategies of large language models (LLMs) [18,23,33,67], GG-EZ operationalizes regional adaptation through a two-level approach: (1) regional data filtering to curate culturally relevant training subsets, and (2) model merging to integrate region- specific adaptations without catastrophic forgetting of global knowledge. Anthropogenic Regional Adaptation 3 We validate Anthropogenic Regional Adaptation and GG-EZ through rig-"},{"citing_arxiv_id":"2603.08090","ref_index":71,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"DSH-Bench: A Difficulty- and Scenario-Aware Benchmark with Hierarchical Subject Taxonomy for Subject-Driven Text-to-Image Generation","primary_cat":"cs.CV","submitted_at":"2026-03-09T08:30:28+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"DSH-Bench is a benchmark for subject-driven T2I generation that uses hierarchical taxonomy sampling, difficulty/scenario classification, and a new SICS metric showing 9.4% higher human correlation than prior measures.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2602.02214","ref_index":44,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Causal Forcing: Autoregressive Diffusion Distillation Done Right for High-Quality Real-Time Interactive Video Generation","primary_cat":"cs.CV","submitted_at":"2026-02-02T15:19:22+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2601.16933","ref_index":24,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Reward-Forcing: Autoregressive Video Generation with Reward Feedback","primary_cat":"cs.CV","submitted_at":"2026-01-23T17:47:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Reward-Forcing guides autoregressive video generation with reward feedback to achieve performance comparable to teacher-dependent methods on benchmarks like VBench without relying on distillation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2601.04068","ref_index":75,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Mind the Generative Details: Direct Localized Detail Preference Optimization for Video Diffusion Models","primary_cat":"cs.CV","submitted_at":"2026-01-07T16:32:17+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"LocalDPO aligns text-to-video diffusion models with human preferences at the spatio-temporal region level by automatically generating localized preference pairs from corrupted real videos and applying a region-aware DPO loss.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2512.04678","ref_index":79,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Reward Forcing: Efficient Streaming Video Generation with Rewarded Distribution Matching Distillation","primary_cat":"cs.CV","submitted_at":"2025-12-04T11:12:13+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Reward Forcing combines EMA-Sink tokens and Rewarded Distribution Matching Distillation to deliver state-of-the-art streaming video generation at 23.1 FPS without copying initial frames.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2512.01236","ref_index":43,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"PSR: Scaling Multi-Subject Personalized Image Generation with Pairwise Subject-Consistency Rewards","primary_cat":"cs.CV","submitted_at":"2025-12-01T03:25:49+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"A data-generation pipeline plus pairwise subject-consistency rewards in RL improve consistency and prompt adherence for multi-subject personalized image generation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2511.22699","ref_index":84,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer","primary_cat":"cs.CV","submitted_at":"2025-11-27T18:52:07+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Z-Image is an efficient 6B-parameter foundation model for image generation that rivals larger commercial systems in photorealism and bilingual text rendering through a new single-stream diffusion transformer and streamlined training.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"model's capability forphotorealistic image generation, alongside improvingaesthetic qualityand nuancedinstruction-following. During the GRPO training loop, we compute a composite advantage function by aggregating the scores from our reward model (e.g., realism, aesthetics, instruction following, etc.). This multi-faceted feedback mechanism enables targeted, fine-grained optimization [84]. By providing distinct signals for different aspects of the generation, GRPO can simultaneously enhance photorealistic image generation, aesthetic quality, improve semantic accuracy, and reduce undesirable artifacts. This integrated approach proved to be significantly more effective than optimizing against a single reward, allowing the model to achieve a"},{"citing_arxiv_id":"2511.18719","ref_index":34,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Seeing What Matters: Visual Preference Policy Optimization for Visual Generation","primary_cat":"cs.CV","submitted_at":"2025-11-24T03:21:17+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ViPO enhances GRPO for visual generation by creating spatially and temporally aware advantage maps from pretrained vision models to focus optimization on perceptually important regions.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2510.16888","ref_index":21,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Uniworld-V2: Reinforce Image Editing with Diffusion Negative-aware Finetuning and MLLM Implicit Feedback","primary_cat":"cs.CV","submitted_at":"2025-10-19T15:38:06+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"UniWorld-V2 applies policy optimization via DiffusionNFT and MLLM logit feedback with group filtering to reach state-of-the-art scores of 4.49 on ImgEdit and 7.83 on GEdit-Bench while remaining model-agnostic.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2510.02283","ref_index":64,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Self-Forcing++: Towards Minute-Scale High-Quality Video Generation","primary_cat":"cs.CV","submitted_at":"2025-10-02T17:55:42+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Self-Forcing++ scales autoregressive video diffusion to over 4 minutes by using self-generated segments for guidance, reducing error accumulation and outperforming baselines in fidelity and consistency.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2506.22832","ref_index":31,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Listener-Rewarded Thinking in VLMs for Image Preferences","primary_cat":"cs.CV","submitted_at":"2025-06-28T09:53:17+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Listener-augmented GRPO uses an independent frozen VLM to provide dense confidence scores on reasoning traces, yielding 67.4% accuracy on ImageReward, up to +6% OOD gains on 1.2M-vote human data, and fewer reasoning contradictions.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2505.07818","ref_index":42,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"DanceGRPO: Unleashing GRPO on Visual Generation","primary_cat":"cs.CV","submitted_at":"2025-05-12T17:59:34+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DanceGRPO applies GRPO to visual generation tasks to achieve stable policy optimization across diffusion models, rectified flows, multiple tasks, and diverse reward models, outperforming prior RL methods.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"Figure 2 We visualize the training curves of motion quality and visual aesthetics quality on HunyuanVideo, motion quality on SkyReels-I2V. Furthermore, constructing an effective video reward model for training alignment poses substantial difficulties. Our experiments evaluated several candidates: the Videoscore [41] model exhibited unstable reward distri- butions, rendering it impractical for optimization, while Visionreward-Video [42], a 29-dimensional metric, yielded semantically coherent rewards but suffered from inaccuracies across individual dimensions. Conse- quently, we adopted VideoAlign [14], a multidimensional framework evaluating three critical aspects: visual aesthetics quality, motion quality, and text-video alignment. Notably, the text-video alignment dimension demonstrated significant instability, prompting its exclusion from our final analysis."},{"citing_arxiv_id":"2503.05236","ref_index":12,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Unified Reward Model for Multimodal Understanding and Generation","primary_cat":"cs.CV","submitted_at":"2025-03-07T08:36:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"UnifiedReward is the first unified reward model that jointly assesses multimodal understanding and generation to provide better preference signals for aligning vision models via DPO.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"UNIFIEDREWARD IS CAPABLE OF ASSESSING BOTH IMAGE AND VIDEO UNDERSTANDING AND GENERATION. \"PAIR\"AND\"POINT\"REFER TO \"PAIRRANKING\"AND\"POINTSCORING\". Reward Model Method Image Generation Image Understand Video Generation Video Understand PickScore'23 [10] Point✓ HPS'23 [11] Point✓ ImageReward'23 [11] Point✓ LLaV A-Critic'24 [5] Pair/Point✓ VideoScore'24 [4] Point✓ LiFT'24 [4] Point✓ VisionReward'24 [12] Point✓ ✓ VideoReward'25 [7] Point✓ UnifiedRewardPair/Point✓ ✓ ✓ ✓ automatically construct high-quality preference pair data by selecting the outputs of specific baselines, such as Vision Language Models (VLM) and diffusion models, through multi- stage filtering,i.e.,pair ranking and point sifting.(3)Finally, we use these preference pairs to align the outputs of these models"},{"citing_arxiv_id":"2501.13918","ref_index":76,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Improving Video Generation with Human Feedback","primary_cat":"cs.CV","submitted_at":"2025-01-23T18:55:41+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A human preference dataset and VideoReward model enable Flow-DPO and Flow-NRG to produce smoother, better-aligned videos from text prompts in flow-based generators.","context_count":1,"top_context_role":"dataset","top_context_polarity":"baseline","context_text":"Overall Accuracy Overall Accuracy VQ Accuracy MQ Accuracy TA Accuracy w/ Ties w/o Ties w/ Ties w/o Ties w/ Ties w/o Ties w/ Ties w/o Ties w/ Ties w/o Ties Random 33.67 49.84 41.86 50.30 47.42 49.86 59.07 49.64 37.25 50.40 VideoScore [19] 49.03 71.69 41.80 50.22 47.41 47.72 59.05 51.09 37.24 50.34 LiFT [73] 37.06 58.39 39.08 57.26 47.53 55.97 59.04 54.91 33.79 55.43 VisionRewrd [76]51.5672.41 56.77 67.59 47.43 59.03 59.03 60.98 46.56 61.15 Ours 49.41 72.89 61.26 73.59 59.68 75.66 66.03 74.70 53.80 72.20 4.4 Noisy Reward Guidance Recall that the KL-regularized RL objective (Eq. 3) admits a closed-form solution(Eq. 7), which transform the original distribution pref(x0 |y) into the new target distribution pθ(x0 |y) . Since the constantsβandwcan be absorbed intor(x 0,y), the closed-form solution becomes:"}],"limit":50,"offset":0}