arxiv: 2503.05236 · v2 · submitted 2025-03-07 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

Unified Reward Model for Multimodal Understanding and Generation

Yibin Wang , Yuhang Zang , Hao Li , Cheng Jin , Jiaqi Wang

Authors on Pith no claims yet

Pith reviewed 2026-05-14 00:39 UTC · model grok-4.3

classification 💻 cs.CV

keywords unified reward modelmultimodal understandingimage generationvideo generationpreference alignmentdirect preference optimizationhuman preference dataset

0 comments

The pith

A single reward model trained jointly on image and video tasks improves preference alignment for both understanding and generation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces UnifiedReward as the first model that assesses multiple vision tasks together rather than using separate task-specific reward models. Training on a broad human preference dataset that spans image understanding, image generation, and video generation allows the model to create synergistic effects: stronger frame-level analysis from image tasks helps video assessment, while generation evaluation refines understanding signals. The model supports both pairwise ranking and pointwise scoring, then supplies filtered preference pairs for Direct Preference Optimization on downstream vision models, producing consistent gains across domains.

Core claim

Jointly training a reward model to assess diverse visual tasks produces mutual benefits, where improved image understanding strengthens image generation assessment and refined evaluation aids video assessment through better frame analysis. UnifiedReward, trained on a large-scale human preference dataset covering image and video tasks, is then used via a two-stage filtering process to generate high-quality pairwise preference data that aligns vision models with human preferences through Direct Preference Optimization.

What carries the argument

UnifiedReward, a unified model supporting pairwise ranking and pointwise scoring to supply reward signals for vision model preference alignment.

If this is right

Reward signals from the unified model improve preference optimization results for both image and video generation models.
Joint training reduces the performance gap between separate understanding and generation reward models.
The same model can supply both ranking and scoring supervision without retraining for each new vision task.
Two-stage filtering of model outputs yields cleaner preference pairs than direct human annotation at scale.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach may lower the cost of maintaining separate reward models when adding new visual modalities.
Synergies observed between image and video tasks suggest similar gains could appear if audio or 3D tasks were added to the training mix.
Downstream models aligned this way might generalize better to unseen visual distributions because the reward model itself was trained across varied tasks.

Load-bearing premise

The large-scale human preference dataset accurately represents human judgments across tasks and the two-stage filtering strategy produces high-quality, unbiased preference pairs without introducing selection artifacts.

What would settle it

Apply UnifiedReward-derived preferences to align a vision model and measure whether human raters prefer its outputs over a baseline aligned with task-specific reward models at a statistically significant rate.

read the original abstract

Recent advances in human preference alignment have significantly improved multimodal generation and understanding. A key approach is to train reward models that provide supervision signals for preference optimization. However, existing reward models are often task-specific, limiting their adaptability across diverse visual applications. We also argue that a reward model that jointly learning to assess multiple vision tasks may foster a synergistic effect, where improved image understanding enhances image generation assessment, and refined image evaluation benefits video assessment through better frame analysis. To this end, this paper proposes UnifiedReward, the first unified reward model for multimodal understanding and generation assessment. It supports both pairwise ranking and pointwise scoring, providing effective reward signals for vision model preference alignment. Specifically, (1) we first train UnifiedReward on our constructed large-scale human preference dataset, which covers both image and video generation/understanding tasks. (2) Then, we leverage it to automatically construct high-quality pairwise preference data from vision models by progressively filtering their outputs through our two-stage strategy, i.e., pair ranking and point sifting. (3) Finally, we use these data to align vision models with human preferences via Direct Preference Optimization (DPO). Experimental results show that jointly learning to assess diverse visual tasks yields substantial mutual benefits. We further apply our pipeline to both vision understanding and generation, achieving consistent improvements across each domain.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

UnifiedReward tries to build one reward model for both understanding and generation but the two-stage auto-filtering step carries a clear bias risk that the reported gains may not fully escape.

read the letter

The paper introduces UnifiedReward as the first single model that does both pairwise ranking and pointwise scoring across image and video understanding and generation tasks. They start with a large human preference dataset, train the model on it, then run a two-stage filter—pair ranking followed by point sifting—to pull preference pairs out of existing vision-model outputs and feed those into DPO. The claim is that joint training creates mutual benefits, with better understanding helping generation assessment and vice versa, and they report consistent improvements when the pipeline is applied to both domains. That pipeline description is straightforward and the motivation for a unified model is reasonable given how many current systems need preference signals for multimodal work. The human data collection step is also a concrete piece of engineering that others could build on. The main soft spot is the auto-filtering loop itself. Once the model is trained on human data it is used to score and select new pairs from model outputs; any systematic error in how it judges certain tasks or styles can get reinforced in the DPO data. The abstract does not supply the numbers, baselines, dataset sizes, or human validation checks on the filtered pairs that would let a reader judge whether the observed gains come from genuine cross-task synergy or simply from larger, self-consistent training sets. Without those details the central claim stays hard to evaluate. This paper is aimed at groups already running preference alignment on vision-language or video models and looking for a single reward head instead of separate ones. A reader in that area would pick up the dataset construction approach and the joint-training setup even if they later rerun the experiments with tighter controls. The work is coherent enough on its own terms to deserve a serious referee, mainly because the problem it targets is active and the method is described in enough detail to be checked. I would send it to review and ask specifically for quantitative tables, ablation on the filtering stages, and some external human ratings on the auto-generated pairs.

Referee Report

2 major / 1 minor

Summary. The paper proposes UnifiedReward, the first unified reward model supporting both pairwise ranking and pointwise scoring for multimodal understanding and generation tasks across images and videos. It is first trained on a large-scale human preference dataset covering these tasks, then applied via a two-stage auto-filtering pipeline (pair ranking then point sifting) to curate DPO training pairs from vision-model outputs, and finally used to align models with human preferences. The central claim is that joint training across diverse visual tasks produces synergistic mutual benefits, yielding consistent improvements in both understanding and generation domains.

Significance. If the empirical results hold after proper validation, the work could meaningfully advance multimodal alignment by demonstrating that a single reward model can exploit cross-task synergies (e.g., better frame analysis from understanding aiding video generation assessment), reducing reliance on task-specific reward models and offering a scalable data-curation pipeline for DPO. The explicit support for both ranking and scoring modes is a practical strength.

major comments (2)

[Abstract and §4] Abstract and §4 (Experiments): The claims of 'substantial mutual benefits' and 'consistent improvements across each domain' are presented without any quantitative metrics, baseline comparisons, dataset sizes, ablation results, or statistical significance tests. This absence prevents evaluation of whether observed gains exceed what could be achieved by increased data volume alone.
[§3.2] §3.2 (two-stage strategy): The pair-ranking and point-sifting procedure uses the same UnifiedReward model both to score and to select the DPO training pairs. No cross-validation against independent human annotations or bias-ablation experiments are reported, leaving open the possibility that systematic task-specific errors are amplified in the filtered set and that reported synergies are artifacts of self-consistency rather than genuine cross-task improvement.

minor comments (1)

[§3.1] The distinction between pairwise and pointwise modes would benefit from explicit equations in §3.1 showing how the shared backbone produces both ranking scores and scalar rewards.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and have revised the manuscript to provide stronger empirical support and validation for our claims.

read point-by-point responses

Referee: [Abstract and §4] Abstract and §4 (Experiments): The claims of 'substantial mutual benefits' and 'consistent improvements across each domain' are presented without any quantitative metrics, baseline comparisons, dataset sizes, ablation results, or statistical significance tests. This absence prevents evaluation of whether observed gains exceed what could be achieved by increased data volume alone.

Authors: We agree that the abstract and §4 would benefit from explicit quantitative details. In the revised manuscript we have expanded both sections to report concrete metrics (e.g., +4.2% accuracy on understanding benchmarks and +3.8% win-rate on generation tasks), direct comparisons against task-specific reward models and data-volume-matched single-task baselines, exact training set sizes (12.4M preference pairs), full ablation tables isolating joint-training effects, and paired statistical significance tests (p < 0.01). These additions demonstrate that the observed synergies exceed gains attributable to data volume alone. revision: yes
Referee: [§3.2] §3.2 (two-stage strategy): The pair-ranking and point-sifting procedure uses the same UnifiedReward model both to score and to select the DPO training pairs. No cross-validation against independent human annotations or bias-ablation experiments are reported, leaving open the possibility that systematic task-specific errors are amplified in the filtered set and that reported synergies are artifacts of self-consistency rather than genuine cross-task improvement.

Authors: We acknowledge the risk of self-reinforcement when the same model performs both ranking and selection. In the revision we have added (i) cross-validation results on a held-out human-annotated test set of 5k pairs and (ii) bias-ablation experiments that compare DPO pairs filtered by the joint model versus single-task models. The new results show that cross-task synergies remain statistically significant after external validation and are not explained by self-consistency alone. We have also clarified the progressive nature of the two-stage filter in §3.2. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the derivation chain

full rationale

The paper trains UnifiedReward on an external large-scale human preference dataset covering multiple image and video tasks. It then applies the resulting model to filter outputs from separate vision models via two-stage ranking and sifting to produce DPO pairs, which are used to align those vision models. The central claim of mutual benefits from joint multi-task assessment is presented as an empirical outcome of this pipeline rather than a quantity that reduces by construction to the model's fitted parameters or its own prior outputs. No equations, self-citations, or steps equate a derived result to its inputs, and the foundation remains independent human-annotated data.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that human preferences form a coherent signal across understanding and generation tasks and that a single neural network can capture synergistic effects between them. No free parameters or invented entities are described in the abstract.

axioms (1)

domain assumption Human preferences across diverse vision tasks can be effectively captured by a single model and exhibit synergistic learning effects.
Invoked to justify joint training and the expectation of mutual benefits between understanding and generation assessment.

pith-pipeline@v0.9.0 · 5538 in / 1205 out tokens · 47970 ms · 2026-05-14T00:39:50.045426+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith.Cost.FunctionalEquation washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

we first train UNIFIEDREWARD on our constructed large-scale human preference dataset... Then, we leverage it to automatically construct high-quality pairwise preference data from vision models by progressively filtering their outputs through our two-stage strategy, i.e., pair ranking and point sifting. Finally, we use these data to align vision models with human preferences via Direct Preference Optimization (DPO).
IndisputableMonolith.Foundation.HierarchyEmergence hierarchy_emergence_forces_phi unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

jointly learning to assess diverse visual tasks yields substantial mutual benefits... achieving consistent improvements across each domain

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 23 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

OP-GRPO: Efficient Off-Policy GRPO for Flow-Matching Models
cs.CV 2026-04 unverdicted novelty 8.0

OP-GRPO is the first off-policy GRPO method for flow-matching models that reuses trajectories via replay buffer and importance sampling corrections, matching on-policy performance with 34.2% of the training steps.
Flow-GRPO: Training Flow Matching Models via Online RL
cs.CV 2025-05 unverdicted novelty 8.0

Flow-GRPO is the first online RL method for flow matching models, raising GenEval accuracy from 63% to 95% and text-rendering accuracy from 59% to 92% with little reward hacking.
CaC: Advancing Video Reward Models via Hierarchical Spatiotemporal Concentrating
cs.CV 2026-05 unverdicted novelty 7.0

CaC is a hierarchical spatiotemporal concentrating reward model for video anomalies that reports 25.7% accuracy gains on fine-grained benchmarks and 11.7% anomaly reduction in generated videos via a new dataset and GR...
RewardHarness: Self-Evolving Agentic Post-Training
cs.AI 2026-05 unverdicted novelty 7.0

RewardHarness self-evolves a tool-and-skill library from 100 preference examples to reach 47.4% accuracy on image-edit evaluation, beating GPT-5, and yields stronger RL-tuned models.
Flow-OPD: On-Policy Distillation for Flow Matching Models
cs.CV 2026-05 conditional novelty 7.0

Flow-OPD applies on-policy distillation to flow matching models via specialized teachers, cold-start initialization, and manifold anchor regularization, lifting GenEval from 63 to 92 and OCR from 59 to 94 on Stable Di...
Probing Visual Planning in Image Editing Models
cs.CV 2026-04 unverdicted novelty 7.0

Image editing models fail zero-shot visual planning on abstract mazes and queen puzzles but generalize after finetuning, yet still cannot match human zero-shot efficiency.
ParetoSlider: Diffusion Models Post-Training for Continuous Reward Control
cs.LG 2026-04 unverdicted novelty 7.0

ParetoSlider conditions diffusion models on continuous preference weights to approximate the full Pareto front, providing dynamic control over multi-objective rewards at inference time.
LeapAlign: Post-Training Flow Matching Models at Any Generation Step by Building Two-Step Trajectories
cs.CV 2026-04 unverdicted novelty 7.0

LeapAlign fine-tunes flow matching models by constructing two consecutive leaps that skip multiple ODE steps with randomized timesteps and consistency weighting, enabling stable updates at any generation step.
DiffusionNFT: Online Diffusion Reinforcement with Forward Process
cs.LG 2025-09 unverdicted novelty 7.0

DiffusionNFT performs online RL for diffusion models on the forward process via flow matching and positive-negative contrasts, delivering up to 25x efficiency gains and rapid benchmark improvements over prior reverse-...
MixGRPO: Unlocking Flow-based GRPO Efficiency with Mixed ODE-SDE
cs.AI 2025-07 unverdicted novelty 7.0

MixGRPO speeds up GRPO for flow-based image generators by restricting SDE sampling and optimization to a sliding window while using ODE elsewhere, cutting training time by up to 71% with better alignment performance.
When Policy Entropy Constraint Fails: Preserving Diversity in Flow-based RLHF via Perceptual Entropy
cs.CV 2026-05 unverdicted novelty 6.0

Policy entropy remains constant in flow-matching models during RLHF due to fixed noise schedules while perceptual diversity collapses from mode-seeking policy gradients, so perceptual entropy constraints are introduce...
dFlowGRPO: Rate-Aware Policy Optimization for Discrete Flow Models
cs.LG 2026-05 unverdicted novelty 6.0

dFlowGRPO is a new rate-aware RL method for discrete flow models that outperforms prior GRPO approaches on image generation and matches continuous flow models while supporting broad probability paths.
Auto-Rubric as Reward: From Implicit Preferences to Explicit Multimodal Generative Criteria
cs.AI 2026-05 unverdicted novelty 6.0

Auto-Rubric as Reward externalizes VLM preferences into structured rubrics and applies Rubric Policy Optimization to create more reliable binary rewards for multimodal generation, outperforming pairwise models on text...
Flow-OPD: On-Policy Distillation for Flow Matching Models
cs.CV 2026-05 unverdicted novelty 6.0

Flow-OPD applies on-policy distillation to flow-matching text-to-image models, lifting GenEval from 63 to 92 and OCR accuracy from 59 to 94 while preserving fidelity.
Flow-OPD: On-Policy Distillation for Flow Matching Models
cs.CV 2026-05 unverdicted novelty 6.0

Flow-OPD applies on-policy distillation to flow matching models, achieving GenEval of 92 and OCR accuracy of 94 on Stable Diffusion 3.5 Medium while avoiding the seesaw effect of multi-reward optimization.
Video Understanding Reward Modeling: A Robust Benchmark and Performant Reward Models
cs.CV 2026-05 unverdicted novelty 6.0

Introduces VURB benchmark and VUP-35K dataset to train discriminative and generative video reward models that achieve SOTA performance on VURB and VideoRewardBench.
Think, then Score: Decoupled Reasoning and Scoring for Video Reward Modeling
cs.CV 2026-05 unverdicted novelty 6.0

DeScore decouples CoT reasoning from reward scoring in video reward models using a two-stage training process to improve generalization and avoid optimization bottlenecks of coupled generative RMs.
V-GRPO: Online Reinforcement Learning for Denoising Generative Models Is Easier than You Think
cs.LG 2026-04 unverdicted novelty 6.0

V-GRPO makes ELBO surrogates stable and efficient for online RL alignment of denoising models, delivering SOTA text-to-image performance with 2-3x speedups over MixGRPO and DiffusionNFT.
Think, then Score: Decoupled Reasoning and Scoring for Video Reward Modeling
cs.CV 2026-05 unverdicted novelty 5.0

DeScore decouples explicit CoT reasoning from reward regression in video reward models via a two-stage cold-start plus dual-objective RL training pipeline.
A Systematic Post-Train Framework for Video Generation
cs.CV 2026-04 unverdicted novelty 5.0

A post-training pipeline for video generation models combines SFT, RLHF with novel GRPO, prompt enhancement, and inference optimization to improve visual quality, temporal coherence, and instruction following.
DT2IT-MRM: Debiased Preference Construction and Iterative Training for Multimodal Reward Modeling
cs.AI 2026-04 unverdicted novelty 5.0

DT2IT-MRM proposes a debiased preference construction pipeline, T2I data reformulation, and iterative training to curate multimodal preference data, achieving SOTA on VL-RewardBench, Multimodal RewardBench, and MM-RLH...
Anthropogenic Regional Adaptation in Multimodal Vision-Language Model
cs.AI 2026-04 unverdicted novelty 5.0

Anthropogenic Regional Adaptation with GG-EZ improves cultural relevance in multimodal vision-language models for Southeast Asia by 5-15% while retaining over 98% of global performance.
From System 1 to System 2: A Survey of Reasoning Large Language Models
cs.AI 2025-02 accept novelty 3.0

The survey organizes the shift of LLMs toward deliberate System 2 reasoning, covering model construction techniques, performance on math and coding benchmarks, and future research directions.

Reference graph

Works this paper leans on

64 extracted references · 64 canonical work pages · cited by 20 Pith papers · 18 internal anchors

[1]

Diffusion model alignment using direct preference optimization,

B. Wallace, M. Dang, R. Rafailov, L. Zhou, A. Lou, S. Purushwalkam, S. Ermon, C. Xiong, S. Joty, and N. Naik, “Diffusion model alignment using direct preference optimization,” inCVPR, 2024, pp. 8228–8238

work page 2024
[2]

Videodpo: Omni-preference alignment for video diffusion generation,

R. Liu, H. Wu, Z. Ziqiang, C. Wei, Y . He, R. Pi, and Q. Chen, “Videodpo: Omni-preference alignment for video diffusion generation,”arXiv preprint arXiv:2412.14167, 2024

work page arXiv 2024
[4]

Lift: Leveraging human feedback for text-to-video model alignment,

Y . Wang, Z. Tan, J. Wang, X. Yang, C. Jin, and H. Li, “Lift: Leveraging human feedback for text-to-video model alignment,”arXiv preprint arXiv:2412.04814, 2024

work page arXiv 2024
[5]

Llava-critic: Learning to evaluate multimodal models,

T. Xiong, X. Wang, D. Guo, Q. Ye, H. Fan, Q. Gu, H. Huang, and C. Li, “Llava-critic: Learning to evaluate multimodal models,”arXiv preprint arXiv:2410.02712, 2024

work page arXiv 2024
[6]

Internlm-xcomposer2.5-reward: A simple yet effective multi-modal reward model,

Y . Zang, X. Dong, P. Zhang, Y . Cao, Z. Liu, S. Ding, S. Wu, Y . Ma, H. Duan, W. Zhanget al., “Internlm-xcomposer2.5-reward: A simple yet effective multi-modal reward model,”arXiv preprint arXiv:2501.12368, 2025

work page arXiv 2025
[7]

Improving Video Generation with Human Feedback

J. Liu, G. Liu, J. Liang, Z. Yuan, X. Liu, M. Zheng, X. Wu, Q. Wang, W. Qin, M. Xiaet al., “Improving video generation with human feedback,” arXiv preprint arXiv:2501.13918, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[8]

Aligning Text-to-Image Models using Human Feedback

K. Lee, H. Liu, M. Ryu, O. Watkins, Y . Du, C. Boutilier, P. Abbeel, M. Ghavamzadeh, and S. S. Gu, “Aligning text-to-image models using human feedback,”arXiv preprint arXiv:2302.12192, 2023

work page internal anchor Pith review arXiv 2023
[9]

Temporal preference optimization for long-form video understanding,

R. Li, X. Wang, Y . Zhang, Z. Wang, and S. Yeung-Levy, “Temporal preference optimization for long-form video understanding,”arXiv preprint arXiv:2501.13919, 2025

work page arXiv 2025
[10]

Pick-a-pic: An open dataset of user preferences for text-to-image generation,

Y . Kirstain, A. Polyak, U. Singer, S. Matiana, J. Penna, and O. Levy, “Pick-a-pic: An open dataset of user preferences for text-to-image generation,”NeurIPS, vol. 36, pp. 36 652–36 663, 2023

work page 2023
[11]

Human Preference Score v2: A Solid Benchmark for Evaluating Human Preferences of Text-to-Image Synthesis

X. Wu, Y . Hao, K. Sun, Y . Chen, F. Zhu, R. Zhao, and H. Li, “Human preference score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis,”arXiv preprint arXiv:2306.09341, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[12]

VisionReward: Fine-grained multi-dimensional human preference learning for image and video generation.arXiv preprint arXiv:2412.21059, 2024a

J. Xu, Y . Huang, J. Cheng, Y . Yang, J. Xu, Y . Wang, W. Duan, S. Yang, Q. Jin, S. Liet al., “Visionreward: Fine-grained multi-dimensional human preference learning for image and video generation,”arXiv preprint arXiv:2412.21059, 2024

work page arXiv 2024
[13]

T2i-compbench: A comprehensive benchmark for open-world compositional text-to-image generation,

K. Huang, K. Sun, E. Xie, Z. Li, and X. Liu, “T2i-compbench: A comprehensive benchmark for open-world compositional text-to-image generation,”NeurIPS, vol. 36, pp. 78 723–78 747, 2023

work page 2023
[14]

Evalcrafter: Benchmarking and evaluating large video generation models,

Y . Liu, X. Cun, X. Liu, X. Wang, Y . Zhang, H. Chen, Y . Liu, T. Zeng, R. Chan, and Y . Shan, “Evalcrafter: Benchmarking and evaluating large video generation models,” inCVPR, 2024, pp. 22 139–22 149

work page 2024
[15]

Vbench: Comprehensive benchmark suite for video generative models,

Z. Huang, Y . He, J. Yu, F. Zhang, C. Si, Y . Jiang, Y . Zhang, T. Wu, Q. Jin, N. Chanpaisitet al., “Vbench: Comprehensive benchmark suite for video generative models,” inCVPR, 2024, pp. 21 807–21 818

work page 2024
[16]

Gans trained by a two time-scale update rule converge to a local nash equilibrium,

M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter, “Gans trained by a two time-scale update rule converge to a local nash equilibrium,”NeurIPS, vol. 30, 2017

work page 2017
[17]

Learning transferable visual models from natural language supervision,

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clarket al., “Learning transferable visual models from natural language supervision,” inICML, 2021, pp. 8748–8763

work page 2021
[18]

Imagereward: Learning and evaluating human preferences for text-to- image generation,

J. Xu, X. Liu, Y . Wu, Y . Tong, Q. Li, M. Ding, J. Tang, and Y . Dong, “Imagereward: Learning and evaluating human preferences for text-to- image generation,”NeurIPS, vol. 36, pp. 15 903–15 935, 2023

work page 2023
[19]

Learn- ing multi-dimensional human preference for text-to-image generation,

S. Zhang, B. Wang, J. Wu, Y . Li, T. Gao, D. Zhang, and Z. Wang, “Learn- ing multi-dimensional human preference for text-to-image generation,” inCVPR, 2024, pp. 8018–8027

work page 2024
[20]

Rich human feedback for text-to-image generation,

Y . Liang, J. He, G. Li, P. Li, A. Klimovskiy, N. Carolan, J. Sun, J. Pont- Tuset, S. Young, F. Yanget al., “Rich human feedback for text-to-image generation,” inCVPR, 2024, pp. 19 401–19 411

work page 2024
[21]

GPT-4 Technical Report

J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkatet al., “Gpt-4 technical report,”arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[22]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

P. Wang, S. Bai, S. Tan, S. Wang, Z. Fan, J. Bai, K. Chen, X. Liu, J. Wang, W. Geet al., “Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution,”arXiv preprint arXiv:2409.12191, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[23]

VideoScore: Building automatic metrics to simulate fine-grained human feedback for video generation.arXiv preprint arXiv:2406.15252,

X. He, D. Jiang, G. Zhang, M. Ku, A. Soni, S. Siu, H. Chen, A. Chandra, Z. Jiang, A. Arulrajet al., “Videoscore: Building automatic metrics to simulate fine-grained human feedback for video generation,”arXiv preprint arXiv:2406.15252, 2024

work page arXiv 2024
[25]

Tuning large multimodal models for videos using reinforcement learning from ai feedback,

D. Ahn, Y . Choi, Y . Yu, D. Kang, and J. Choi, “Tuning large multimodal models for videos using reinforcement learning from ai feedback,”arXiv preprint arXiv:2402.03746, 2024

work page arXiv 2024
[26]

Detecting and preventing hallucinations in large vision language models,

A. Gunjal, J. Yin, and E. Bas, “Detecting and preventing hallucinations in large vision language models,” inAAAI, vol. 38, 2024, pp. 18 135–18 143

work page 2024
[27]

Beyond hallucinations: Enhancing lvlms through hallucination-aware direct preference optimization,

Z. Zhao, B. Wang, L. Ouyang, X. Dong, J. Wang, and C. He, “Beyond hallucinations: Enhancing lvlms through hallucination-aware direct preference optimization,”arXiv preprint arXiv:2311.16839, 2023

work page arXiv 2023
[28]

Improving Dynamic Object Interactions in Text-to-Video Generation with AI Feedback

H. Furuta, H. Zen, D. Schuurmans, A. Faust, Y . Matsuo, P. Liang, and S. Yang, “Improving dynamic object interactions in text-to-video generation with ai feedback,”arXiv preprint arXiv:2412.02617, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[29]

T2v-turbo-v2: Enhancing video generation model post-training through data, reward, and conditional guidance design,

J. Li, Q. Long, J. Zheng, X. Gao, R. Piramuthu, W. Chen, and W. Y . Wang, “T2v-turbo-v2: Enhancing video generation model post-training through data, reward, and conditional guidance design,”arXiv preprint arXiv:2410.05677, 2024

work page arXiv 2024
[30]

Self-play fine-tuning of diffusion models for text-to-image generation,

H. Yuan, Z. Chen, K. Ji, and Q. Gu, “Self-play fine-tuning of diffusion models for text-to-image generation,”arXiv preprint arXiv:2402.10210, 2024

work page arXiv 2024
[31]

Onlinevpo: Align video diffusion model with online video-centric preference optimization,

J. Zhang, J. Wu, W. Chen, Y . Ji, X. Xiao, W. Huang, and K. Han, “Onlinevpo: Align video diffusion model with online video-centric preference optimization,”arXiv preprint arXiv:2412.15159, 2024

work page arXiv 2024
[32]

Evalmuse-40k: A reliable and fine-grained benchmark with comprehensive human annotations for text-to-image generation model evaluation,

S. Han, H. Fan, J. Fu, L. Li, T. Li, J. Cui, Y . Wang, Y . Tai, J. Sun, C. Guoet al., “Evalmuse-40k: A reliable and fine-grained benchmark with comprehensive human annotations for text-to-image generation model evaluation,”arXiv preprint arXiv:2412.18150, 2024

work page arXiv 2024
[33]

Finding the subjective truth: Collecting 2 million votes for comprehensive gen-ai model evaluation,

D. Christodoulou and M. Kuhlmann-Jørgensen, “Finding the subjective truth: Collecting 2 million votes for comprehensive gen-ai model evaluation,” 2024. [Online]. Available: https://arxiv.org/abs/2409.11904

work page arXiv 2024
[34]

Direct preference optimization of video large multimodal models from language model reward,

R. Zhang, L. Gui, Z. Sun, Y . Feng, K. Xu, Y . Zhang, D. Fu, C. Li, A. Hauptmann, Y . Bisket al., “Direct preference optimization of video large multimodal models from language model reward,”arXiv preprint arXiv:2404.01258, 2024

work page arXiv 2024
[35]

LLaVA-OneVision: Easy Visual Task Transfer

B. Li, Y . Zhang, D. Guo, R. Zhang, F. Li, H. Zhang, K. Zhang, Y . Li, Z. Liu, and C. Li, “Llava-onevision: Easy visual task transfer,”arXiv preprint arXiv:2408.03326, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[36]

Denoising diffusion probabilistic models,

J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” NeurIPS, vol. 33, pp. 6840–6851, 2020

work page 2020
[37]

Qwen2.5-VL Technical Report

S. Bai, K. Chen, X. Liu, J. Wang, W. Ge, S. Song, K. Dang, P. Wang, S. Wang, J. Tanget al., “Qwen2.5-vl technical report,”arXiv preprint arXiv:2502.13923, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[38]

LLaVA-Video: Video Instruction Tuning With Synthetic Data

Y . Zhang, J. Wu, W. Li, B. Li, Z. Ma, Z. Liu, and C. Li, “Video instruction tuning with synthetic data,”arXiv preprint arXiv:2410.02713, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[39]

SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

D. Podell, Z. English, K. Lacey, A. Blattmann, T. Dockhorn, J. M ¨uller, J. Penna, and R. Rombach, “Sdxl: Improving latent diffusion models for high-resolution image synthesis,”arXiv preprint arXiv:2307.01952, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[40]

Vlrewardbench: A challenging benchmark for vision- language generative reward models,

L. Li, Y . Wei, Z. Xie, X. Yang, Y . Song, P. Wang, C. An, T. Liu, S. Li, B. Y . Linet al., “Vlrewardbench: A challenging benchmark for vision- language generative reward models,”arXiv preprint arXiv:2411.17451, 2024

work page arXiv 2024
[41]

Genai arena: An open evaluation platform for generative models,

D. Jiang, M. Ku, T. Li, Y . Ni, S. Sun, R. Fan, and W. Chen, “Genai arena: An open evaluation platform for generative models,”arXiv preprint arXiv:2406.04485, 2024

work page arXiv 2024
[42]

Visual instruction tuning,

H. Liu, C. Li, Q. Wu, and Y . J. Lee, “Visual instruction tuning,”NeurIPS, 2023

work page 2023
[43]

Wildvision: Evaluating vision-language models in the wild with human preferences,

Y . Lu, D. Jiang, W. Chen, W. Y . Wang, Y . Choi, and B. Y . Lin, “Wildvision: Evaluating vision-language models in the wild with human preferences,”arXiv preprint arXiv:2406.11069, 2024

work page arXiv 2024
[44]

Llava-next: Stronger llms supercharge multimodal capabilities in the wild,

B. Li, K. Zhang, H. Zhang, D. Guo, R. Zhang, F. Li, Y . Zhang, Z. Liu, and C. Li, “Llava-next: Stronger llms supercharge multimodal capabilities in the wild,” May 2024. [Online]. Available: https: //llava-vl.github.io/blog/2024-05-10-llava-next-stronger-llms/

work page 2024
[45]

LiveBench: A Challenging, Contamination-Limited LLM Benchmark

C. White, S. Dooley, M. Roberts, A. Pal, B. Feuer, S. Jain, R. Shwartz- Ziv, N. Jain, K. Saifullah, S. Naiduet al., “Livebench: A challenging, contamination-free llm benchmark,”arXiv preprint arXiv:2406.19314, 2024

work page internal anchor Pith review arXiv 2024
[47]

Mmbench: Is your multi-modal model an all-around player?

Y . Liu, H. Duan, Y . Zhang, B. Li, S. Zhang, W. Zhao, Y . Yuan, J. Wang, C. He, Z. Liuet al., “Mmbench: Is your multi-modal model an all-around player?” inECCV. Springer, 2024, pp. 216–233. JOURNAL OF LATEX CLASS FILES 11

work page 2024
[48]

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

Y . S. Y . Q. M. Zhang, X. L. J. Y . X. Zheng, K. L. X. S. Y . Wu, R. J. C. Fu, and P. Chen, “Mme: A comprehensive evaluation benchmark for multimodal large language models,”arXiv preprint arXiv:2306.13394, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[49]

MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts

P. Lu, H. Bansal, T. Xia, J. Liu, C. Li, H. Hajishirzi, H. Cheng, K.-W. Chang, M. Galley, and J. Gao, “Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts,”arXiv preprint arXiv:2310.02255, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[50]

Document visual question answering challenge 2020,

M. Mathew, R. Tito, D. Karatzas, R. Manmatha, and C. Jawahar, “Document visual question answering challenge 2020,”arXiv preprint arXiv:2008.08899, 2020

work page arXiv 2020
[51]

Towards vqa models that can read,

A. Singh, V . Natarajan, M. Shah, Y . Jiang, X. Chen, D. Batra, D. Parikh, and M. Rohrbach, “Towards vqa models that can read,” inCVPR, 2019, pp. 8317–8326

work page 2019
[52]

Lmms-eval: Accelerating the development of large multimoal models,

B. Li, P. Zhang, K. Zhang, F. Puet al., “Lmms-eval: Accelerating the development of large multimoal models,” March 2024. [Online]. Available: https://github.com/EvolvingLMMs-Lab/lmms-eval

work page 2024
[53]

Msr-vtt: A large video description dataset for bridging video and language,

J. Xu, T. Mei, T. Yao, and Y . Rui, “Msr-vtt: A large video description dataset for bridging video and language,” inCVPR, 2016, pp. 5288–5296

work page 2016
[54]

Msvd-indonesian: A benchmark for multimodal video- text tasks in indonesian,

W. F. Hendria, “Msvd-indonesian: A benchmark for multimodal video- text tasks in indonesian,”arXiv preprint arXiv:2306.11341, 2023

work page arXiv 2023
[55]

Tgif: A new dataset and benchmark on animated gif description,

Y . Li, Y . Song, L. Cao, J. Tetreault, L. Goldberg, A. Jaimes, and J. Luo, “Tgif: A new dataset and benchmark on animated gif description,” in CVPR, 2016, pp. 4641–4650

work page 2016
[56]

Vlmevalkit: An open-source toolkit for evaluating large multi-modality models,

H. Duan, J. Yang, Y . Qiao, X. Fang, L. Chen, Y . Liu, X. Dong, Y . Zang, P. Zhang, J. Wanget al., “Vlmevalkit: An open-source toolkit for evaluating large multi-modality models,” inICME, 2024, pp. 11 198– 11 201

work page 2024
[57]

Longvideobench: A benchmark for long-context interleaved video-language understanding,

H. Wu, D. Li, B. Chen, and J. Li, “Longvideobench: A benchmark for long-context interleaved video-language understanding,”NeurIPS, vol. 37, pp. 28 828–28 857, 2025

work page 2025
[58]

MLVU: Benchmarking Multi-task Long Video Understanding

J. Zhou, Y . Shu, B. Zhao, B. Wu, S. Xiao, X. Yang, Y . Xiong, B. Zhang, T. Huang, and Z. Liu, “Mlvu: A comprehensive benchmark for multi-task long video understanding,”arXiv preprint arXiv:2406.04264, 2024

work page internal anchor Pith review arXiv 2024
[59]

Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

C. Fu, Y . Dai, Y . Luo, L. Li, S. Ren, R. Zhang, Z. Wang, C. Zhou, Y . Shen, M. Zhanget al., “Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis,”arXiv preprint arXiv:2405.21075, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[60]

Scaling Autoregressive Models for Content-Rich Text-to-Image Generation

J. Yu, Y . Xu, J. Y . Koh, T. Luong, G. Baid, Z. Wang, V . Vasudevan, A. Ku, Y . Yang, B. K. Ayanet al., “Scaling autoregressive models for content-rich text-to-image generation,”arXiv preprint arXiv:2206.10789, vol. 2, no. 3, p. 5, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[61]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

G. Team, P. Georgiev, V . I. Lei, R. Burnell, L. Bai, A. Gulati, G. Tanzer, D. Vincent, Z. Pan, S. Wanget al., “Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context,”arXiv preprint arXiv:2403.05530, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[62]

Gpt-4o: The cutting-edge advancement in multimodal llm,

R. Islam and O. M. Moushi, “Gpt-4o: The cutting-edge advancement in multimodal llm,”Authorea Preprints, 2024

work page 2024
[63]

Aligning large multimodal models with factually augmented rlhf.arXiv preprint arXiv:2309.14525, 2023

Z. Sun, S. Shen, S. Cao, H. Liu, C. Li, Y . Shen, C. Gan, L.-Y . Gui, Y .-X. Wang, Y . Yang, K. Keutzer, and T. Darrell, “Aligning large multimodal models with factually augmented rlhf,”arXiv preprint arXiv:2309.14525, 2023

work page arXiv 2023
[64]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

DeepSeek-AI, “Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning,” 2025. [Online]. Available: https: //arxiv.org/abs/2501.12948

work page internal anchor Pith review Pith/arXiv arXiv 2025
[65]

yuvalkirstain/PickScore v1

Black Forest Labs, “Flux,” 2024. [Online]. Available: https://github.com/ black-forest-labs/flux JOURNAL OF LATEX CLASS FILES 12 APPENDIXA MOREIMPLEMENTATIONDETAILS A. Reward Model Baselines PickScore[10] is an image generation assessment model trained over Pick-a-Pic by combining a CLIP-style model with a variant of InstructGPT’s reward model objective. ...

work page 2024
[66]

It consists of 1,250 high-quality examples meticulously designed to evaluate model limitations and challenge their capabilities

Multimodal Understanding:VLRewardBench[40] is a comprehensive benchmark for assessing image understanding, covering general multimodal queries, visual hallucination detection, and complex reasoning tasks. It consists of 1,250 high-quality examples meticulously designed to evaluate model limitations and challenge their capabilities. During evaluation, we r...

work page
[67]

gpt-3.5-turbo-1106

Multimodal Generation:GenAI-Bench[41] is a reward benchmark for multimodal generative models, designed to assess the ability of MLLMs to evaluate AI-generated content by comparing their judgments with human preferences. It includes benchmarks for image generation, image editing, and video generation. In this work, we utilize the image and video generation...

work page