{"paper":{"title":"Unified Reward Model for Multimodal Understanding and Generation","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"A single reward model trained jointly on image and video tasks improves preference alignment for both understanding and generation.","cross_cats":[],"primary_cat":"cs.CV","authors_text":"Cheng Jin, Hao Li, Jiaqi Wang, Yibin Wang, Yuhang Zang","submitted_at":"2025-03-07T08:36:05Z","abstract_excerpt":"Recent advances in human preference alignment have significantly improved multimodal generation and understanding. A key approach is to train reward models that provide supervision signals for preference optimization. However, existing reward models are often task-specific, limiting their adaptability across diverse visual applications. We also argue that a reward model that jointly learning to assess multiple vision tasks may foster a synergistic effect, where improved image understanding enhances image generation assessment, and refined image evaluation benefits video assessment through bett"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"jointly learning to assess diverse visual tasks yields substantial mutual benefits... achieving consistent improvements across each domain.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"The large-scale human preference dataset accurately represents human judgments across tasks and the two-stage filtering strategy produces high-quality, unbiased preference pairs without introducing selection artifacts.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"UnifiedReward is the first unified reward model that jointly assesses multimodal understanding and generation to provide better preference signals for aligning vision models via DPO.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"A single reward model trained jointly on image and video tasks improves preference alignment for both understanding and generation.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"a04782d1d631e82f96d7da719f6f7958ce15ad318815067e1e7a0c81fc365900"},"source":{"id":"2503.05236","kind":"arxiv","version":2},"verdict":{"id":"27542910-c752-4be9-9a1f-1c30da435b53","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-14T00:39:50.045426Z","strongest_claim":"jointly learning to assess diverse visual tasks yields substantial mutual benefits... achieving consistent improvements across each domain.","one_line_summary":"UnifiedReward is the first unified reward model that jointly assesses multimodal understanding and generation to provide better preference signals for aligning vision models via DPO.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"The large-scale human preference dataset accurately represents human judgments across tasks and the two-stage filtering strategy produces high-quality, unbiased preference pairs without introducing selection artifacts.","pith_extraction_headline":"A single reward model trained jointly on image and video tasks improves preference alignment for both understanding and generation."},"references":{"count":64,"sample":[{"doi":"","year":2024,"title":"Diffusion model alignment using direct preference optimization","work_id":"3bbf25f4-3cec-4668-a9d1-8ddccee8afd8","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2024,"title":"Videodpo: Omni-preference alignment for video diffusion generation","work_id":"47508379-7568-448f-b406-1b4e9bbb181b","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2024,"title":"Lift: Leveraging human feedback for text-to-video model alignment","work_id":"ab339bba-4351-45db-841d-de086f9c0d67","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2024,"title":"Llava-critic: Learning to evaluate multimodal models","work_id":"2d8a9688-7f2b-4224-8107-31c7945ce2cd","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2025,"title":"Internlm-xcomposer2.5-reward: A simple yet effective multi-modal reward model","work_id":"499855fc-7d44-43d3-ab3a-1038d2ccd0bc","ref_index":6,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":64,"snapshot_sha256":"651a51160f69de7cdb1a714c615a241622a86798965a893fba208eafa2cf1838","internal_anchors":21},"formal_canon":{"evidence_count":2,"snapshot_sha256":"bc8e6693c3ba327a9134bd49e0e50656b6f9543d25273089ee312b7696bb487f"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"}