hub Mixed citations

Unified Reward Model for Multimodal Understanding and Generation

Yibin Wang, Yuhang Zang, Hao Li, Cheng Jin, Jiaqi Wang · 2025 · cs.CV · arXiv 2503.05236

Mixed citation behavior. Most common role is background (41%).

39 Pith papers citing it

Background 41% of classified citations

open full Pith review browse 39 citing papers arXiv PDF

abstract

Recent advances in human preference alignment have significantly improved multimodal generation and understanding. A key approach is to train reward models that provide supervision signals for preference optimization. However, existing reward models are often task-specific, limiting their adaptability across diverse visual applications. We also argue that a reward model that jointly learning to assess multiple vision tasks may foster a synergistic effect, where improved image understanding enhances image generation assessment, and refined image evaluation benefits video assessment through better frame analysis. To this end, this paper proposes UnifiedReward, the first unified reward model for multimodal understanding and generation assessment. It supports both pairwise ranking and pointwise scoring, providing effective reward signals for vision model preference alignment. Specifically, (1) we first train UnifiedReward on our constructed large-scale human preference dataset, which covers both image and video generation/understanding tasks. (2) Then, we leverage it to automatically construct high-quality pairwise preference data from vision models by progressively filtering their outputs through our two-stage strategy, i.e., pair ranking and point sifting. (3) Finally, we use these data to align vision models with human preferences via Direct Preference Optimization (DPO). Experimental results show that jointly learning to assess diverse visual tasks yields substantial mutual benefits. We further apply our pipeline to both vision understanding and generation, achieving consistent improvements across each domain.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 8 dataset 4 baseline 3 method 1 other 1

citation-polarity summary

background 7 use dataset 5 baseline 3 unclear 1 use method 1

representative citing papers

OP-GRPO: Efficient Off-Policy GRPO for Flow-Matching Models

cs.CV · 2026-04-05 · unverdicted · novelty 8.0

OP-GRPO is the first off-policy GRPO method for flow-matching models that reuses trajectories via replay buffer and importance sampling corrections, matching on-policy performance with 34.2% of the training steps.

Flow-GRPO: Training Flow Matching Models via Online RL

cs.CV · 2025-05-08 · unverdicted · novelty 8.0

Flow-GRPO is the first online RL method for flow matching models, raising GenEval accuracy from 63% to 95% and text-rendering accuracy from 59% to 92% with little reward hacking.

Arena-T2I Hard: Benchmarking and Improving Faithfulness with Dependency-Aware Checklist

cs.AI · 2026-06-30 · unverdicted · novelty 7.0

Arena-T2I Hard benchmark with ~30 decomposed constraints per prompt and a dependency-aware checklist reward yields better faithfulness-aesthetics trade-off than single-reward or weighted-sum baselines on SD3.5-Medium and FLUX.1-dev.

CapRL++: Unified Reinforcement Learning with Verifiable Rewards for Dense Image and Video Captioning

cs.CV · 2026-06-08 · unverdicted · novelty 7.0

CapRL++ applies reinforcement learning with verifiable rewards to dense image and video captioning by scoring captions via the accuracy of a vision-free LLM answering MCQs from the caption alone.

Explicit Critic Guidance for Aligning Diffusion Models

cs.LG · 2026-05-26 · unverdicted · novelty 7.0

Introduces a state-aligned latent actor-critic framework that lets diffusion models act as their own timestep-conditioned value functions for trajectory-level RL post-training and inference steering.

AutoRubric-T2I: Robust Rule-Based Reward Model for Text-to-Image Alignment

cs.AI · 2026-05-17 · unverdicted · novelty 7.0 · 2 refs

AutoRubric-T2I learns and selects explicit rubrics from preference pairs to guide VLM judges, producing high-quality interpretable rewards for T2I alignment with far less data than traditional Bradley-Terry models.

DiffusionOPD: A Unified Perspective of On-Policy Distillation in Diffusion Models

cs.LG · 2026-05-14 · unverdicted · novelty 7.0

DiffusionOPD applies online policy distillation from per-task teachers to a unified diffusion student, with a derived closed-form per-step KL objective that unifies SDE and ODE sampling via mean matching.

CaC: Advancing Video Reward Models via Hierarchical Spatiotemporal Concentrating

cs.CV · 2026-05-12 · unverdicted · novelty 7.0 · 2 refs

CaC presents a new spatiotemporal concentrating reward model for video anomalies, built on a novel large-scale dataset and three-stage training with RL and IoU rewards, claiming 25.7% accuracy gains and 11.7% anomaly reduction.

RewardHarness: Self-Evolving Agentic Post-Training

cs.AI · 2026-05-09 · unverdicted · novelty 7.0

RewardHarness self-evolves a tool-and-skill library from 100 preference examples to reach 47.4% accuracy on image-edit evaluation, beating GPT-5, and yields stronger RL-tuned models.

Probing Visual Planning in Image Editing Models

cs.CV · 2026-04-23 · unverdicted · novelty 7.0

Image editing models fail zero-shot visual planning on abstract mazes and queen puzzles but generalize after finetuning, yet still cannot match human zero-shot efficiency.

ParetoSlider: Diffusion Models Post-Training for Continuous Reward Control

cs.LG · 2026-04-22 · unverdicted · novelty 7.0

ParetoSlider conditions diffusion models on continuous preference weights to approximate the full Pareto front, providing dynamic control over multi-objective rewards at inference time.

LeapAlign: Post-Training Flow Matching Models at Any Generation Step by Building Two-Step Trajectories

cs.CV · 2026-04-16 · unverdicted · novelty 7.0

LeapAlign fine-tunes flow matching models by constructing two consecutive leaps that skip multiple ODE steps with randomized timesteps and consistency weighting, enabling stable updates at any generation step.

Visual-ERM: Reward Modeling for Visual Equivalence

cs.CV · 2026-03-13 · unverdicted · novelty 7.0

Visual-ERM is a new multimodal reward model that supplies fine-grained visual feedback for training vision-language models on chart-to-code, table, and SVG tasks, yielding measurable gains over prior rewards.

Improving Text-to-Image Generation with Intrinsic Self-Confidence Rewards

cs.CV · 2026-03-01 · unverdicted · novelty 7.0

SOLACE improves text-to-image generation by using intrinsic self-confidence rewards from noise reconstruction accuracy during reinforcement learning post-training without external supervision.

Beyond VLM-Based Rewards: Diffusion-Native Latent Reward Modeling

cs.CV · 2026-02-11 · unverdicted · novelty 7.0

DiNa-LRM introduces a diffusion-native latent reward model using a noise-calibrated Thurstone likelihood on noisy states, matching VLM performance at lower compute in image alignment and preference optimization.

DiffusionNFT: Online Diffusion Reinforcement with Forward Process

cs.LG · 2025-09-19 · unverdicted · novelty 7.0

DiffusionNFT performs online RL for diffusion models on the forward process via flow matching and positive-negative contrasts, delivering up to 25x efficiency gains and rapid benchmark improvements over prior reverse-process methods.

MixGRPO: Unlocking Flow-based GRPO Efficiency with Mixed ODE-SDE

cs.AI · 2025-07-29 · unverdicted · novelty 7.0

MixGRPO speeds up GRPO for flow-based image generators by restricting SDE sampling and optimization to a sliding window while using ODE elsewhere, cutting training time by up to 71% with better alignment performance.

Exploring the Design Space of Reward Backpropagation for Flow Matching

cs.LG · 2026-06-09 · conditional · novelty 6.0

FlowBP unifies surrogate backward trajectories for reward backpropagation in flow matching, recovering prior methods as special cases and showing metric gains on SD3.5-M and FLUX models via three variants.

Beyond Scalar Rewards by Internalizing Reasoning into Score Distributions

cs.CV · 2026-06-08 · unverdicted · novelty 6.0

Z-Reward trains a 27B reasoning teacher VLM on score distributions via GDSO and distills it via RISD into a 9B student, reaching 89.6% and 88.6% human preference accuracy with 41.3% optimization gain over SFT baseline.

Sketch Then Paint: Hierarchical Reinforcement Learning for Diffusion Multi-Modal Large Language Models

cs.AI · 2026-05-16 · unverdicted · novelty 6.0

Proposes HT-GRPO with sketch-then-paint staged updates, prompt-conditioned importance ratios, and hierarchical credit assignment for dMLLMs, reporting gains on GenEval and DPG plus quality metrics.

RAVEN: Real-time Autoregressive Video Extrapolation with Consistency-model GRPO

cs.CV · 2026-05-14 · unverdicted · novelty 6.0

RAVEN aligns training and inference for causal autoregressive video diffusion via interleaved rollout repacking and introduces CM-GRPO for direct RL on consistency-model kernels, claiming better quality than recent baselines.

When Policy Entropy Constraint Fails: Preserving Diversity in Flow-based RLHF via Perceptual Entropy

cs.CV · 2026-05-12 · unverdicted · novelty 6.0

Policy entropy remains constant in flow-matching models during RLHF due to fixed noise schedules while perceptual diversity collapses from mode-seeking policy gradients, so perceptual entropy constraints are introduced to preserve diversity and improve quality.

dFlowGRPO: Rate-Aware Policy Optimization for Discrete Flow Models

cs.LG · 2026-05-10 · unverdicted · novelty 6.0

dFlowGRPO is a new rate-aware RL method for discrete flow models that outperforms prior GRPO approaches on image generation and matches continuous flow models while supporting broad probability paths.

Auto-Rubric as Reward: From Implicit Preferences to Explicit Multimodal Generative Criteria

cs.AI · 2026-05-08 · unverdicted · novelty 6.0

Auto-Rubric as Reward externalizes VLM preferences into structured rubrics and applies Rubric Policy Optimization to create more reliable binary rewards for multimodal generation, outperforming pairwise models on text-to-image and editing benchmarks.

citing papers explorer

Showing 34 of 34 citing papers after filters.

OP-GRPO: Efficient Off-Policy GRPO for Flow-Matching Models cs.CV · 2026-04-05 · unverdicted · none · ref 39 · internal anchor
OP-GRPO is the first off-policy GRPO method for flow-matching models that reuses trajectories via replay buffer and importance sampling corrections, matching on-policy performance with 34.2% of the training steps.
Flow-GRPO: Training Flow Matching Models via Online RL cs.CV · 2025-05-08 · unverdicted · none · ref 61 · internal anchor
Flow-GRPO is the first online RL method for flow matching models, raising GenEval accuracy from 63% to 95% and text-rendering accuracy from 59% to 92% with little reward hacking.
Arena-T2I Hard: Benchmarking and Improving Faithfulness with Dependency-Aware Checklist cs.AI · 2026-06-30 · unverdicted · none · ref 28 · internal anchor
Arena-T2I Hard benchmark with ~30 decomposed constraints per prompt and a dependency-aware checklist reward yields better faithfulness-aesthetics trade-off than single-reward or weighted-sum baselines on SD3.5-Medium and FLUX.1-dev.
CapRL++: Unified Reinforcement Learning with Verifiable Rewards for Dense Image and Video Captioning cs.CV · 2026-06-08 · unverdicted · none · ref 82 · internal anchor
CapRL++ applies reinforcement learning with verifiable rewards to dense image and video captioning by scoring captions via the accuracy of a vision-free LLM answering MCQs from the caption alone.
Explicit Critic Guidance for Aligning Diffusion Models cs.LG · 2026-05-26 · unverdicted · none · ref 75 · internal anchor
Introduces a state-aligned latent actor-critic framework that lets diffusion models act as their own timestep-conditioned value functions for trajectory-level RL post-training and inference steering.
AutoRubric-T2I: Robust Rule-Based Reward Model for Text-to-Image Alignment cs.AI · 2026-05-17 · unverdicted · none · ref 26 · 2 links · internal anchor
AutoRubric-T2I learns and selects explicit rubrics from preference pairs to guide VLM judges, producing high-quality interpretable rewards for T2I alignment with far less data than traditional Bradley-Terry models.
DiffusionOPD: A Unified Perspective of On-Policy Distillation in Diffusion Models cs.LG · 2026-05-14 · unverdicted · none · ref 30 · internal anchor
DiffusionOPD applies online policy distillation from per-task teachers to a unified diffusion student, with a derived closed-form per-step KL objective that unifies SDE and ODE sampling via mean matching.
CaC: Advancing Video Reward Models via Hierarchical Spatiotemporal Concentrating cs.CV · 2026-05-12 · unverdicted · none · ref 53 · 2 links · internal anchor
CaC presents a new spatiotemporal concentrating reward model for video anomalies, built on a novel large-scale dataset and three-stage training with RL and IoU rewards, claiming 25.7% accuracy gains and 11.7% anomaly reduction.
RewardHarness: Self-Evolving Agentic Post-Training cs.AI · 2026-05-09 · unverdicted · none · ref 27 · internal anchor
RewardHarness self-evolves a tool-and-skill library from 100 preference examples to reach 47.4% accuracy on image-edit evaluation, beating GPT-5, and yields stronger RL-tuned models.
Probing Visual Planning in Image Editing Models cs.CV · 2026-04-23 · unverdicted · none · ref 78 · internal anchor
Image editing models fail zero-shot visual planning on abstract mazes and queen puzzles but generalize after finetuning, yet still cannot match human zero-shot efficiency.
ParetoSlider: Diffusion Models Post-Training for Continuous Reward Control cs.LG · 2026-04-22 · unverdicted · none · ref 44 · internal anchor
ParetoSlider conditions diffusion models on continuous preference weights to approximate the full Pareto front, providing dynamic control over multi-objective rewards at inference time.
LeapAlign: Post-Training Flow Matching Models at Any Generation Step by Building Two-Step Trajectories cs.CV · 2026-04-16 · unverdicted · none · ref 48 · internal anchor
LeapAlign fine-tunes flow matching models by constructing two consecutive leaps that skip multiple ODE steps with randomized timesteps and consistency weighting, enabling stable updates at any generation step.
Visual-ERM: Reward Modeling for Visual Equivalence cs.CV · 2026-03-13 · unverdicted · none · ref 33 · internal anchor
Visual-ERM is a new multimodal reward model that supplies fine-grained visual feedback for training vision-language models on chart-to-code, table, and SVG tasks, yielding measurable gains over prior rewards.
Improving Text-to-Image Generation with Intrinsic Self-Confidence Rewards cs.CV · 2026-03-01 · unverdicted · none · ref 72 · internal anchor
SOLACE improves text-to-image generation by using intrinsic self-confidence rewards from noise reconstruction accuracy during reinforcement learning post-training without external supervision.
Beyond VLM-Based Rewards: Diffusion-Native Latent Reward Modeling cs.CV · 2026-02-11 · unverdicted · none · ref 43 · internal anchor
DiNa-LRM introduces a diffusion-native latent reward model using a noise-calibrated Thurstone likelihood on noisy states, matching VLM performance at lower compute in image alignment and preference optimization.
DiffusionNFT: Online Diffusion Reinforcement with Forward Process cs.LG · 2025-09-19 · unverdicted · none · ref 22 · internal anchor
DiffusionNFT performs online RL for diffusion models on the forward process via flow matching and positive-negative contrasts, delivering up to 25x efficiency gains and rapid benchmark improvements over prior reverse-process methods.
MixGRPO: Unlocking Flow-based GRPO Efficiency with Mixed ODE-SDE cs.AI · 2025-07-29 · unverdicted · none · ref 41 · internal anchor
MixGRPO speeds up GRPO for flow-based image generators by restricting SDE sampling and optimization to a sliding window while using ODE elsewhere, cutting training time by up to 71% with better alignment performance.
Beyond Scalar Rewards by Internalizing Reasoning into Score Distributions cs.CV · 2026-06-08 · unverdicted · none · ref 55 · internal anchor
Z-Reward trains a 27B reasoning teacher VLM on score distributions via GDSO and distills it via RISD into a 9B student, reaching 89.6% and 88.6% human preference accuracy with 41.3% optimization gain over SFT baseline.
Sketch Then Paint: Hierarchical Reinforcement Learning for Diffusion Multi-Modal Large Language Models cs.AI · 2026-05-16 · unverdicted · none · ref 38 · internal anchor
Proposes HT-GRPO with sketch-then-paint staged updates, prompt-conditioned importance ratios, and hierarchical credit assignment for dMLLMs, reporting gains on GenEval and DPG plus quality metrics.
RAVEN: Real-time Autoregressive Video Extrapolation with Consistency-model GRPO cs.CV · 2026-05-14 · unverdicted · none · ref 116 · internal anchor
RAVEN aligns training and inference for causal autoregressive video diffusion via interleaved rollout repacking and introduces CM-GRPO for direct RL on consistency-model kernels, claiming better quality than recent baselines.
When Policy Entropy Constraint Fails: Preserving Diversity in Flow-based RLHF via Perceptual Entropy cs.CV · 2026-05-12 · unverdicted · none · ref 65 · internal anchor
Policy entropy remains constant in flow-matching models during RLHF due to fixed noise schedules while perceptual diversity collapses from mode-seeking policy gradients, so perceptual entropy constraints are introduced to preserve diversity and improve quality.
dFlowGRPO: Rate-Aware Policy Optimization for Discrete Flow Models cs.LG · 2026-05-10 · unverdicted · none · ref 162 · internal anchor
dFlowGRPO is a new rate-aware RL method for discrete flow models that outperforms prior GRPO approaches on image generation and matches continuous flow models while supporting broad probability paths.
Auto-Rubric as Reward: From Implicit Preferences to Explicit Multimodal Generative Criteria cs.AI · 2026-05-08 · unverdicted · none · ref 39 · internal anchor
Auto-Rubric as Reward externalizes VLM preferences into structured rubrics and applies Rubric Policy Optimization to create more reliable binary rewards for multimodal generation, outperforming pairwise models on text-to-image and editing benchmarks.
Video Understanding Reward Modeling: A Robust Benchmark and Performant Reward Models cs.CV · 2026-05-08 · unverdicted · none · ref 35 · internal anchor
Introduces VURB benchmark and VUP-35K dataset to train discriminative and generative video reward models that achieve SOTA performance on VURB and VideoRewardBench.
Think, then Score: Decoupled Reasoning and Scoring for Video Reward Modeling cs.CV · 2026-05-07 · unverdicted · none · ref 41 · internal anchor
DeScore decouples CoT reasoning from reward scoring in video reward models using a two-stage training process to improve generalization and avoid optimization bottlenecks of coupled generative RMs.
V-GRPO: Online Reinforcement Learning for Denoising Generative Models Is Easier than You Think cs.LG · 2026-04-25 · unverdicted · none · ref 41 · internal anchor
V-GRPO makes ELBO surrogates stable and efficient for online RL alignment of denoising models, delivering SOTA text-to-image performance with 2-3x speedups over MixGRPO and DiffusionNFT.
APE: Agentic Prompt Enhancer for Image Generation and Editing cs.CV · 2026-05-29 · unverdicted · none · ref 42 · internal anchor
APE post-trains small language models as single-agent or multi-agent prompt enhancers that improve visual alignment on image generation and editing benchmarks without altering the downstream visual model.
When Preference Labels Fall Short: Aligning Diffusion Models from Real Data cs.CV · 2026-05-19 · unverdicted · none · ref 9 · 2 links · internal anchor
Real images contrasted with generated samples can supply effective preference signals for aligning diffusion models at performance levels comparable to standard preference-pair methods.
Edit-GRPO: A Locality-Preserving Policy Optimization Framework for Image Editing cs.CV · 2026-05-16 · unverdicted · none · ref 35 · internal anchor
Edit-GRPO decouples editing and preservation objectives via region-specific signals in a policy optimization framework to improve locality in image editing tasks.
A Systematic Post-Train Framework for Video Generation cs.CV · 2026-04-28 · unverdicted · none · ref 48 · internal anchor
A post-training pipeline for video generation models combines SFT, RLHF with novel GRPO, prompt enhancement, and inference optimization to improve visual quality, temporal coherence, and instruction following.
DT2IT-MRM: Debiased Preference Construction and Iterative Training for Multimodal Reward Modeling cs.AI · 2026-04-21 · unverdicted · none · ref 53 · internal anchor
DT2IT-MRM proposes a debiased preference construction pipeline, T2I data reformulation, and iterative training to curate multimodal preference data, achieving SOTA on VL-RewardBench, Multimodal RewardBench, and MM-RLHF-RewardBench.
Anthropogenic Regional Adaptation in Multimodal Vision-Language Model cs.AI · 2026-04-13 · unverdicted · none · ref 62 · internal anchor
Anthropogenic Regional Adaptation with GG-EZ improves cultural relevance in multimodal vision-language models for Southeast Asia by 5-15% while retaining over 98% of global performance.
OmniVerifier-M1: Multimodal Meta-Verifier with Explicit Structured Recalibration cs.CL · 2026-05-27 · unverdicted · none · ref 19 · internal anchor
OmniVerifier-M1 is a generalist visual verifier using symbolic outputs for meta-verification and decoupled RL to outperform joint optimization for robust verification and agentic self-correction.
Follow-Your-Preference++: Rethinking Preference Alignment for Image Inpainting cs.CV · 2026-06-02 · unverdicted · none · ref 68 · internal anchor
Empirical study shows reward model ensembles mitigate biases like brightness and composition in preference data for image inpainting, yielding better performance than prior methods without architecture changes.

Unified Reward Model for Multimodal Understanding and Generation

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer