hub Mixed citations

Unified Reward Model for Multimodal Understanding and Generation

Yibin Wang, Yuhang Zang, Hao Li, Cheng Jin, Jiaqi Wang · 2025 · cs.CV · arXiv 2503.05236

Mixed citation behavior. Most common role is background (44%).

45 Pith papers citing it

Background 44% of classified citations

open full Pith review browse 45 citing papers arXiv PDF

abstract

Recent advances in human preference alignment have significantly improved multimodal generation and understanding. A key approach is to train reward models that provide supervision signals for preference optimization. However, existing reward models are often task-specific, limiting their adaptability across diverse visual applications. We also argue that a reward model that jointly learning to assess multiple vision tasks may foster a synergistic effect, where improved image understanding enhances image generation assessment, and refined image evaluation benefits video assessment through better frame analysis. To this end, this paper proposes UnifiedReward, the first unified reward model for multimodal understanding and generation assessment. It supports both pairwise ranking and pointwise scoring, providing effective reward signals for vision model preference alignment. Specifically, (1) we first train UnifiedReward on our constructed large-scale human preference dataset, which covers both image and video generation/understanding tasks. (2) Then, we leverage it to automatically construct high-quality pairwise preference data from vision models by progressively filtering their outputs through our two-stage strategy, i.e., pair ranking and point sifting. (3) Finally, we use these data to align vision models with human preferences via Direct Preference Optimization (DPO). Experimental results show that jointly learning to assess diverse visual tasks yields substantial mutual benefits. We further apply our pipeline to both vision understanding and generation, achieving consistent improvements across each domain.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 9 dataset 4 baseline 3 method 1 other 1

citation-polarity summary

background 8 use dataset 5 baseline 3 unclear 1 use method 1

representative citing papers

OP-GRPO: Efficient Off-Policy GRPO for Flow-Matching Models

cs.CV · 2026-04-05 · unverdicted · novelty 8.0

OP-GRPO is the first off-policy GRPO method for flow-matching models that reuses trajectories via replay buffer and importance sampling corrections, matching on-policy performance with 34.2% of the training steps.

Flow-GRPO: Training Flow Matching Models via Online RL

cs.CV · 2025-05-08 · unverdicted · novelty 8.0

Flow-GRPO is the first online RL method for flow matching models, raising GenEval accuracy from 63% to 95% and text-rendering accuracy from 59% to 92% with little reward hacking.

Arena-T2I Hard: Benchmarking and Improving Faithfulness with Dependency-Aware Checklist

cs.AI · 2026-06-30 · unverdicted · novelty 7.0

Arena-T2I Hard benchmark with ~30 decomposed constraints per prompt and a dependency-aware checklist reward yields better faithfulness-aesthetics trade-off than single-reward or weighted-sum baselines on SD3.5-Medium and FLUX.1-dev.

Through the PRISM: Preference Representation in Intermediate States of Video Diffusion Models

cs.CV · 2026-06-18 · unverdicted · novelty 7.0

PRISM shows video diffusion models inherently encode preference information in noisy latents, achieving SOTA accuracy and enabling noise-robust early-stage sampling with a correlation to generative performance.

CapRL++: Unified Reinforcement Learning with Verifiable Rewards for Dense Image and Video Captioning

cs.CV · 2026-06-08 · unverdicted · novelty 7.0

CapRL++ applies reinforcement learning with verifiable rewards to dense image and video captioning by scoring captions via the accuracy of a vision-free LLM answering MCQs from the caption alone.

Explicit Critic Guidance for Aligning Diffusion Models

cs.LG · 2026-05-26 · unverdicted · novelty 7.0

Introduces a state-aligned latent actor-critic framework that lets diffusion models act as their own timestep-conditioned value functions for trajectory-level RL post-training and inference steering.

AutoRubric-T2I: Robust Rule-Based Reward Model for Text-to-Image Alignment

cs.AI · 2026-05-17 · unverdicted · novelty 7.0 · 2 refs

AutoRubric-T2I learns and selects explicit rubrics from preference pairs to guide VLM judges, producing high-quality interpretable rewards for T2I alignment with far less data than traditional Bradley-Terry models.

DiffusionOPD: A Unified Perspective of On-Policy Distillation in Diffusion Models

cs.LG · 2026-05-14 · unverdicted · novelty 7.0

DiffusionOPD applies online policy distillation from per-task teachers to a unified diffusion student, with a derived closed-form per-step KL objective that unifies SDE and ODE sampling via mean matching.

CaC: Advancing Video Reward Models via Hierarchical Spatiotemporal Concentrating

cs.CV · 2026-05-12 · unverdicted · novelty 7.0 · 2 refs

CaC presents a new spatiotemporal concentrating reward model for video anomalies, built on a novel large-scale dataset and three-stage training with RL and IoU rewards, claiming 25.7% accuracy gains and 11.7% anomaly reduction.

RewardHarness: Self-Evolving Agentic Post-Training

cs.AI · 2026-05-09 · unverdicted · novelty 7.0

RewardHarness self-evolves a tool-and-skill library from 100 preference examples to reach 47.4% accuracy on image-edit evaluation, beating GPT-5, and yields stronger RL-tuned models.

Probing Visual Planning in Image Editing Models

cs.CV · 2026-04-23 · unverdicted · novelty 7.0

Image editing models fail zero-shot visual planning on abstract mazes and queen puzzles but generalize after finetuning, yet still cannot match human zero-shot efficiency.

ParetoSlider: Diffusion Models Post-Training for Continuous Reward Control

cs.LG · 2026-04-22 · unverdicted · novelty 7.0

ParetoSlider conditions diffusion models on continuous preference weights to approximate the full Pareto front, providing dynamic control over multi-objective rewards at inference time.

LeapAlign: Post-Training Flow Matching Models at Any Generation Step by Building Two-Step Trajectories

cs.CV · 2026-04-16 · unverdicted · novelty 7.0

LeapAlign fine-tunes flow matching models by constructing two consecutive leaps that skip multiple ODE steps with randomized timesteps and consistency weighting, enabling stable updates at any generation step.

Visual-ERM: Reward Modeling for Visual Equivalence

cs.CV · 2026-03-13 · unverdicted · novelty 7.0

Visual-ERM is a new multimodal reward model that supplies fine-grained visual feedback for training vision-language models on chart-to-code, table, and SVG tasks, yielding measurable gains over prior rewards.

Improving Text-to-Image Generation with Intrinsic Self-Confidence Rewards

cs.CV · 2026-03-01 · unverdicted · novelty 7.0

SOLACE improves text-to-image generation by using intrinsic self-confidence rewards from noise reconstruction accuracy during reinforcement learning post-training without external supervision.

Beyond VLM-Based Rewards: Diffusion-Native Latent Reward Modeling

cs.CV · 2026-02-11 · unverdicted · novelty 7.0

DiNa-LRM introduces a diffusion-native latent reward model using a noise-calibrated Thurstone likelihood on noisy states, matching VLM performance at lower compute in image alignment and preference optimization.

DiffusionNFT: Online Diffusion Reinforcement with Forward Process

cs.LG · 2025-09-19 · unverdicted · novelty 7.0

DiffusionNFT performs online RL for diffusion models on the forward process via flow matching and positive-negative contrasts, delivering up to 25x efficiency gains and rapid benchmark improvements over prior reverse-process methods.

MixGRPO: Unlocking Flow-based GRPO Efficiency with Mixed ODE-SDE

cs.AI · 2025-07-29 · unverdicted · novelty 7.0

MixGRPO speeds up GRPO for flow-based image generators by restricting SDE sampling and optimization to a sliding window while using ODE elsewhere, cutting training time by up to 71% with better alignment performance.

Optimizing Visual Generative Models via Distribution-wise Rewards

cs.LG · 2026-07-02 · unverdicted · novelty 6.0

Distribution-wise rewards with subset-replace strategy and post-hoc merging improve FID-50K on SiT (8.30 to 5.77) and EDM2 (3.74 to 3.52) while preserving diversity.

PortraitGen: Exemplar-Driven GRPO with Dual-Reward Guidance for Photorealistic Portrait Generation

cs.CV · 2026-06-25 · unverdicted · novelty 6.0

PortraitGen integrates real-image exemplars into GRPO sampling and applies dual rewards (OmniReward and AI-Portrait) to improve photorealism, claiming better results than baselines on a new PortraitBench.

DiffusionBench: On Holistic Evaluation of Diffusion Transformers

cs.CV · 2026-06-23 · conditional · novelty 6.0

NanoGen unifies DiT training on ImageNet and T2I, reveals negative Pearson correlations (-0.377 to -0.580) in method rankings across metrics from 21 models, and motivates DiffusionBench for holistic evaluation.

Unified Multimodal Autoregressive Modeling with Shared Context-Visual Tokenizer is Key to Unification

cs.CV · 2026-06-16 · unverdicted · novelty 6.0

UniAR uses a shared context-visual tokenizer with bitwise quantization and parallel prediction in an autoregressive framework to unify visual understanding and generation, claiming SOTA on generation and editing tasks.

Exploring the Design Space of Reward Backpropagation for Flow Matching

cs.LG · 2026-06-09 · conditional · novelty 6.0

FlowBP unifies surrogate backward trajectories for reward backpropagation in flow matching, recovering prior methods as special cases and showing metric gains on SD3.5-M and FLUX models via three variants.

Beyond Scalar Rewards by Internalizing Reasoning into Score Distributions

cs.CV · 2026-06-08 · unverdicted · novelty 6.0

Z-Reward trains a 27B reasoning teacher VLM on score distributions via GDSO and distills it via RISD into a 9B student, reaching 89.6% and 88.6% human preference accuracy with 41.3% optimization gain over SFT baseline.

citing papers explorer

Showing 0 of 0 citing papers after filters.

No citing papers match the current filters.

Unified Reward Model for Multimodal Understanding and Generation

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer