Recognition: 3 theorem links
· Lean TheoremHuman Preference Score v2: A Solid Benchmark for Evaluating Human Preferences of Text-to-Image Synthesis
Pith reviewed 2026-05-11 08:23 UTC · model grok-4.3
The pith
Fine-tuning CLIP on a large bias-reduced dataset of human image choices creates a scorer that aligns better with human judgments on text-to-image outputs than prior metrics.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By fine-tuning CLIP on HPD v2, which comprises 798,090 human preference choices on 433,760 pairs of images from diverse sources, we obtain HPS v2 that more accurately predicts human preferences on generated images, generalizes better across various image distributions, and is responsive to algorithmic improvements of text-to-image generative models.
What carries the argument
HPS v2, the scoring model obtained by fine-tuning CLIP on the HPD v2 human preference dataset, used to rank and compare outputs from text-to-image generative models.
If this is right
- Allows more reliable comparison of recent text-to-image models from academic, community, and industry sources via a shared benchmark.
- Detects when algorithmic changes improve outputs in ways that match human taste rather than proxy scores.
- Supports stable, fair, and easy-to-use evaluation by guiding the design of text prompts used during scoring.
- Provides a dataset and model that can serve as a drop-in replacement for weaker automatic metrics in research pipelines.
Where Pith is reading between the lines
- Researchers could close the loop by using HPS v2 as a training signal inside generative models instead of only for post-hoc evaluation.
- The same preference-collection approach might transfer to related tasks such as text-to-video or image editing where human alignment is also hard to measure.
- Widespread adoption could shift model development away from optimizing for FID or CLIP score toward outputs that survive direct human comparison.
- Periodic retraining of the scorer on new preference data would be needed to keep pace with rapid changes in generative model capabilities.
Load-bearing premise
The collected human preferences are unbiased and representative enough that fine-tuning CLIP on them produces a scorer that continues to align with human judgments on future unseen models and image distributions.
What would settle it
Gather fresh human preference judgments on images from a new text-to-image model released after HPD v2 collection, then measure whether HPS v2 correlates more strongly with those judgments than earlier metrics such as CLIP score or FID.
read the original abstract
Recent text-to-image generative models can generate high-fidelity images from text inputs, but the quality of these generated images cannot be accurately evaluated by existing evaluation metrics. To address this issue, we introduce Human Preference Dataset v2 (HPD v2), a large-scale dataset that captures human preferences on images from a wide range of sources. HPD v2 comprises 798,090 human preference choices on 433,760 pairs of images, making it the largest dataset of its kind. The text prompts and images are deliberately collected to eliminate potential bias, which is a common issue in previous datasets. By fine-tuning CLIP on HPD v2, we obtain Human Preference Score v2 (HPS v2), a scoring model that can more accurately predict human preferences on generated images. Our experiments demonstrate that HPS v2 generalizes better than previous metrics across various image distributions and is responsive to algorithmic improvements of text-to-image generative models, making it a preferable evaluation metric for these models. We also investigate the design of the evaluation prompts for text-to-image generative models, to make the evaluation stable, fair and easy-to-use. Finally, we establish a benchmark for text-to-image generative models using HPS v2, which includes a set of recent text-to-image models from the academic, community and industry. The code and dataset is available at https://github.com/tgxs002/HPSv2 .
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Human Preference Dataset v2 (HPD v2), comprising 798,090 human preference choices over 433,760 image pairs drawn from diverse text-to-image sources, with deliberate collection to reduce bias. Fine-tuning CLIP on HPD v2 yields Human Preference Score v2 (HPS v2), which the authors claim generalizes better than prior metrics (e.g., CLIP, Aesthetic Score) across image distributions and responds to algorithmic improvements in generative models. The work also examines prompt design for stable evaluation and releases a benchmark ranking recent T2I models from academia, community, and industry.
Significance. If the generalization and responsiveness claims hold under rigorous validation, HPS v2 would supply a human-aligned, practical metric that improves upon distribution-based scores like FID or uncalibrated CLIP similarity for T2I evaluation. The scale of HPD v2 and the public benchmark constitute a concrete resource for the field, provided the scorer's alignment persists on future model families.
major comments (3)
- [§4] §4 (Experiments on generalization): The central claim that HPS v2 'generalizes better than previous metrics across various image distributions' is supported only by comparisons on image sets drawn from the same pool of source models used to build HPD v2. No temporal or architectural hold-out is reported in which entire model families (e.g., post-2023 diffusion variants or novel architectures) are excluded from training data yet included in test distributions, leaving the responsiveness-to-improvements result vulnerable to distribution shift.
- [§3.2] §3.2 (HPS v2 training) and Table 2: The fine-tuning procedure is described at a high level, but the manuscript provides neither the exact loss formulation, learning-rate schedule, nor ablation on the number of negative pairs per prompt. Without these details it is impossible to assess whether the reported gains over baseline CLIP are due to the preference data itself or to hyper-parameter choices.
- [§5] §5 (Benchmark): The ranking of models is presented without error bars, inter-rater agreement statistics on the human labels, or a sensitivity analysis to prompt wording. This weakens the assertion that HPS v2 yields a 'stable, fair and easy-to-use' evaluation protocol.
minor comments (3)
- [Abstract / §2.1] The abstract states that HPD v2 'eliminates potential bias' but does not quantify residual prompt or demographic biases; a short paragraph in §2.1 citing the exact collection protocol would clarify this.
- [Figure 3] Figure 3 (qualitative examples) lacks axis labels and a legend indicating which images correspond to which model; this reduces readability.
- [§3.2] The GitHub link is given, but the manuscript does not specify the exact train/validation split sizes or the random seed used for fine-tuning, hindering reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment point by point below, providing clarifications and committing to revisions that strengthen the paper without altering its core claims.
read point-by-point responses
-
Referee: [§4] §4 (Experiments on generalization): The central claim that HPS v2 'generalizes better than previous metrics across various image distributions' is supported only by comparisons on image sets drawn from the same pool of source models used to build HPD v2. No temporal or architectural hold-out is reported in which entire model families (e.g., post-2023 diffusion variants or novel architectures) are excluded from training data yet included in test distributions, leaving the responsiveness-to-improvements result vulnerable to distribution shift.
Authors: We appreciate this point on rigorous generalization testing. Our Section 4 evaluations do include image sets from diverse sources such as community fine-tunes and industry models (e.g., Midjourney v5, DALL·E variants) whose outputs were not part of HPD v2 training collection, and HPS v2 shows improved correlation with human preferences on these. However, we agree that explicit architectural and temporal hold-outs would further substantiate the claims. In the revised manuscript, we will add new experiments that exclude specific post-2023 model families from HPS v2 training data and evaluate responsiveness on held-out newer architectures, to be included in an expanded Section 4. revision: yes
-
Referee: [§3.2] §3.2 (HPS v2 training) and Table 2: The fine-tuning procedure is described at a high level, but the manuscript provides neither the exact loss formulation, learning-rate schedule, nor ablation on the number of negative pairs per prompt. Without these details it is impossible to assess whether the reported gains over baseline CLIP are due to the preference data itself or to hyper-parameter choices.
Authors: We agree that the training details in Section 3.2 are insufficient for full reproducibility and attribution of gains. The current description was kept high-level to focus on the dataset contribution, but this was an oversight. In the revised manuscript, we will expand Section 3.2 and update Table 2 to specify the exact loss (a contrastive pairwise ranking loss on preference pairs), the learning-rate schedule (AdamW with cosine decay, initial LR of 1e-5), and include an ablation on the number of negative pairs per prompt. These additions will demonstrate that performance improvements are driven by HPD v2 rather than hyper-parameters alone. revision: yes
-
Referee: [§5] §5 (Benchmark): The ranking of models is presented without error bars, inter-rater agreement statistics on the human labels, or a sensitivity analysis to prompt wording. This weakens the assertion that HPS v2 yields a 'stable, fair and easy-to-use' evaluation protocol.
Authors: Thank you for noting these omissions in the benchmark presentation. In the revised Section 5, we will add error bars to the model rankings using bootstrap resampling over evaluation prompts. We will also include a sensitivity analysis varying prompt wording (e.g., adding descriptors or rephrasing) to quantify stability of HPS v2 scores. For inter-rater agreement on the underlying human labels, our collection prioritized scale with single annotations per pair; we will explicitly discuss this as a limitation and note how the dataset size helps average out individual variance. revision: partial
- Inter-rater agreement statistics cannot be computed because the HPD v2 collection process used single annotations per image pair to achieve the reported scale of 798k choices.
Circularity Check
No significant circularity; empirical training and held-out testing are independent
full rationale
The paper explicitly collects HPD v2 human preference data, fine-tunes CLIP to produce HPS v2, and then reports generalization results on various image distributions. This is standard supervised learning with no self-definitional loop, no fitted parameter renamed as a prediction, and no load-bearing self-citation that reduces the central claim to its own inputs. The generalization experiments are presented as tests on independent distributions rather than tautological outputs of the training process.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Human preferences over image pairs can be effectively captured and generalized by fine-tuning a pre-trained vision-language model such as CLIP on a large collected dataset.
Lean theorems connected to this paper
-
Foundation.LawOfExistencedefect_zero_iff_one unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Our experiments demonstrate that HPS v2 generalizes better than previous metrics across various image distributions and is responsive to algorithmic improvements of text-to-image generative models.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 60 Pith papers
-
OP-GRPO: Efficient Off-Policy GRPO for Flow-Matching Models
OP-GRPO is the first off-policy GRPO method for flow-matching models that reuses trajectories via replay buffer and importance sampling corrections, matching on-policy performance with 34.2% of the training steps.
-
Pareto-Guided Optimal Transport for Multi-Reward Alignment
PG-OT builds prompt-specific Pareto frontiers and applies distribution-aware optimal transport to improve multi-reward alignment while introducing JDR and JCR metrics to measure synergy and hacking.
-
Asymmetric Flow Models
Asymmetric Flow Modeling restricts noise prediction to a low-rank subspace for high-dimensional flow generation, reaching 1.57 FID on ImageNet 256x256 and new state-of-the-art pixel text-to-image performance via finet...
-
STRIDE: Training-Free Diversity Guidance via PCA-Directed Feature Perturbation in Single-Step Diffusion Models
STRIDE boosts diversity in one-step diffusion models by injecting PCA-aligned pink noise into transformer features while preserving text alignment and quality.
-
Reinforce Adjoint Matching: Scaling RL Post-Training of Diffusion and Flow-Matching Models
Reinforce Adjoint Matching derives a simple consistency loss for RL post-training of diffusion models by tilting the clean distribution toward higher-reward samples under KL regularization while keeping the noising pr...
-
ExtraVAR: Stage-Aware RoPE Remapping for Resolution Extrapolation in Visual Autoregressive Models
ExtraVAR enables resolution extrapolation in visual autoregressive models by stage-aware RoPE remapping and entropy-driven attention scaling, suppressing repetition and detail loss.
-
Attention Sinks in Diffusion Transformers: A Causal Analysis
Suppressing attention sinks in diffusion transformers does not degrade text-image alignment or most preference metrics, revealing a dissociation between generation trajectory changes and semantic output quality.
-
TMPO: Trajectory Matching Policy Optimization for Diverse and Efficient Diffusion Alignment
TMPO uses Softmax Trajectory Balance to match policy probabilities over multiple trajectories to a Boltzmann reward distribution, improving diversity by 9.1% in diffusion alignment tasks.
-
TMPO: Trajectory Matching Policy Optimization for Diverse and Efficient Diffusion Alignment
TMPO replaces scalar reward maximization with trajectory-level matching to a Boltzmann distribution via Softmax-TB, improving generative diversity by 9.1% while keeping competitive reward performance.
-
Flow-OPD: On-Policy Distillation for Flow Matching Models
Flow-OPD applies on-policy distillation to flow matching models via specialized teachers, cold-start initialization, and manifold anchor regularization, lifting GenEval from 63 to 92 and OCR from 59 to 94 on Stable Di...
-
LENS: Low-Frequency Eigen Noise Shaping for Efficient Diffusion Sampling
LENS shapes low-frequency eigen noise with a lightweight network to enable efficient, high-quality sampling in distilled diffusion models.
-
Arena as Offline Reward: Efficient Fine-Grained Preference Optimization for Diffusion Models
ArenaPO infers Gaussian capability distributions from pairwise preferences and applies truncated-normal latent inference to derive fine-grained offline rewards for preference optimization of text-to-image diffusion models.
-
How to Guide Your Flow: Few-Step Alignment via Flow Map Reward Guidance
FMRG is a training-free, single-trajectory guidance method for flow models derived from optimal control that achieves strong reward alignment with only 3 NFEs.
-
Oracle Noise: Faster Semantic Spherical Alignment for Interpretable Latent Optimization
Oracle Noise optimizes diffusion model noise on a Riemannian hypersphere guided by key prompt words to preserve the Gaussian prior, eliminate norm inflation, and achieve faster semantic alignment than Euclidean methods.
-
$Z^2$-Sampling: Zero-Cost Zigzag Trajectories for Semantic Alignment in Diffusion Models
Z²-Sampling implicitly realizes zero-cost zigzag trajectories for curvature-aware semantic alignment in diffusion models by reducing multi-step paths via operator dualities and temporal caching while synthesizing a di...
-
Learning to Credit the Right Steps: Objective-aware Process Optimization for Visual Generation
OTCA improves GRPO training for visual generation by estimating step importance in trajectories and adaptively weighting multiple reward objectives.
-
Guiding Distribution Matching Distillation with Gradient-Based Reinforcement Learning
GDMD replaces raw-sample rewards with distillation-gradient rewards in RL-guided diffusion distillation, yielding 4-step models that surpass their multi-step teachers on GenEval and human preference metrics.
-
Depth Adaptive Efficient Visual Autoregressive Modeling
DepthVAR adaptively allocates per-token computational depth in VAR models using a cyclic rotated scheduler and dynamic layer masking to achieve 2.3-3.1x inference speedup with minimal quality loss.
-
Comparison Drives Preference: Reference-Aware Modeling for AI-Generated Video Quality Assessment
RefVQA uses a query-centered reference graph and graph-guided difference aggregation to improve AI-generated video quality assessment by incorporating inter-video comparisons.
-
LeapAlign: Post-Training Flow Matching Models at Any Generation Step by Building Two-Step Trajectories
LeapAlign fine-tunes flow matching models by constructing two consecutive leaps that skip multiple ODE steps with randomized timesteps and consistency weighting, enabling stable updates at any generation step.
-
OneHOI: Unifying Human-Object Interaction Generation and Editing
OneHOI unifies HOI generation and editing in one conditional diffusion transformer using role-aware tokens, structured attention, and joint training on mixed datasets to reach SOTA on both tasks.
-
SOAR: Self-Correction for Optimal Alignment and Refinement in Diffusion Models
SOAR is a reward-free on-policy method that supplies dense per-timestep supervision to correct exposure bias in diffusion model denoising trajectories, raising GenEval from 0.70 to 0.78 and OCR from 0.64 to 0.67 over ...
-
RewardFlow: Generate Images by Optimizing What You Reward
RewardFlow unifies differentiable rewards including a new VQA-based one and uses a prompt-aware adaptive policy with Langevin dynamics to achieve state-of-the-art image editing and compositional generation.
-
Personalizing Text-to-Image Generation to Individual Taste
PAMELA provides a multi-user rating dataset and personalized reward model that predicts individual image preferences more accurately than prior population-level aesthetic models.
-
Hierarchical SVG Tokenization: Learning Compact Visual Programs for Scalable Vector Graphics Modeling
HiVG introduces hierarchical SVG tokenization with atomic and segment tokens plus HMN initialization to enable more efficient and stable autoregressive generation of vector graphics programs.
-
1.x-Distill: Breaking the Diversity, Quality, and Efficiency Barrier in Distribution Matching Distillation
1.x-Distill achieves better quality and diversity than prior few-step distillation methods at 1.67 and 1.74 effective NFEs on SD3 models with up to 33x speedup.
-
SHARP: Spectrum-aware Highly-dynamic Adaptation for Resolution Promotion in Remote Sensing Synthesis
SHARP applies a spectrum-aware dynamic RoPE scaling schedule that promotes resolution more strongly in early denoising stages and relaxes it later, outperforming static baselines on quality metrics for remote sensing images.
-
MixGRPO: Unlocking Flow-based GRPO Efficiency with Mixed ODE-SDE
MixGRPO speeds up GRPO for flow-based image generators by restricting SDE sampling and optimization to a sliding window while using ODE elsewhere, cutting training time by up to 71% with better alignment performance.
-
Unified Reward Model for Multimodal Understanding and Generation
UnifiedReward is the first unified reward model that jointly assesses multimodal understanding and generation to provide better preference signals for aligning vision models via DPO.
-
HeatKV: Head-tuned KV-cache Compression for Visual Autoregressive Modeling
HeatKV ranks attention heads by their focus on prior scales using offline calibration data and applies a static per-head pruning schedule, delivering 2x higher KV-cache compression than prior methods on the Infinity-2...
-
Skill-Aligned Annotation for Reliable Evaluation in Text-to-Image Generation
Skill-aligned annotation improves inter-annotator agreement and evaluation stability in text-to-image generation compared to uniform annotation baselines.
-
When Policy Entropy Constraint Fails: Preserving Diversity in Flow-based RLHF via Perceptual Entropy
Policy entropy remains constant in flow-matching models during RLHF due to fixed noise schedules while perceptual diversity collapses from mode-seeking policy gradients, so perceptual entropy constraints are introduce...
-
EPIC: Efficient Predicate-Guided Inference-Time Control for Compositional Text-to-Image Generation
EPIC introduces predicate-guided inference-time search that lifts compositional T2I prompt accuracy from 34% to 71% on GenEval2 with 31-81% lower execution costs.
-
Power Reinforcement Post-Training of Text-to-Image Models with Super-Linear Advantage Shaping
Super-Linear Advantage Shaping (SLAS) introduces a non-linear geometric policy update for RL post-training of text-to-image models that reshapes the local policy space via advantage-dependent Fisher-Rao weighting to r...
-
LimeCross: Context-Conditioned Layered Image Editing with Structural Consistency
LimeCross enables text-guided editing of individual layers in composite images by conditioning on cross-layer context via bi-stream attention while preserving layer integrity and introducing the LayerEditBench benchmark.
-
Attention Sinks in Diffusion Transformers: A Causal Analysis
Suppressing attention sinks in diffusion transformers does not degrade CLIP-T alignment at moderate levels but induces sink-specific perceptual shifts six times larger than equal-budget random masking.
-
Removing the Watermark Is Not Enough: Forensic Stealth in Generative-AI Watermark Removal
Current AI image watermark removal attacks replace the watermark with a different forensic signal, allowing independent detectors to distinguish processed outputs from clean images at over 98% true-positive rate under...
-
Auto-Rubric as Reward: From Implicit Preferences to Explicit Multimodal Generative Criteria
Auto-Rubric as Reward externalizes VLM preferences into structured rubrics and applies Rubric Policy Optimization to create more reliable binary rewards for multimodal generation, outperforming pairwise models on text...
-
Flow-OPD: On-Policy Distillation for Flow Matching Models
Flow-OPD applies on-policy distillation to flow-matching text-to-image models, lifting GenEval from 63 to 92 and OCR accuracy from 59 to 94 while preserving fidelity.
-
Flow-OPD: On-Policy Distillation for Flow Matching Models
Flow-OPD applies on-policy distillation to flow matching models, achieving GenEval of 92 and OCR accuracy of 94 on Stable Diffusion 3.5 Medium while avoiding the seesaw effect of multi-reward optimization.
-
D-OPSD: On-Policy Self-Distillation for Continuously Tuning Step-Distilled Diffusion Models
D-OPSD enables continuous supervised fine-tuning of few-step diffusion models via on-policy self-distillation where the model acts as both teacher (multimodal context) and student (text-only context) on its own roll-outs.
-
Threshold-Guided Optimization for Visual Generative Models
A threshold-guided alignment method lets visual generative models be optimized directly from scalar human ratings instead of requiring paired preference data.
-
Advancing Aesthetic Image Generation via Composition Transfer
Composer enables semantic-agnostic composition transfer from references and theme-driven planning via LVLMs to improve aesthetic quality in diffusion-based image generation.
-
Delta Score Matters! Spatial Adaptive Multi Guidance in Diffusion Models
SAMG uses spatially adaptive guidance scales derived from a geometric analysis of classifier-free guidance to resolve the detail-artifact dilemma in diffusion-based image and video generation.
-
Learning from Noisy Preferences: A Semi-Supervised Learning Approach to Direct Preference Optimization
Semi-DPO applies semi-supervised learning to noisy preference data in diffusion DPO by training first on consensus pairs then iteratively pseudo-labeling conflicts, yielding state-of-the-art alignment with complex hum...
-
POCA: Pareto-Optimal Curriculum Alignment for Visual Text Generation
POCA combines Pareto optimization with curriculum alignment to improve multi-reward reinforcement learning for visual text generation without relying on weighted sums.
-
V-GRPO: Online Reinforcement Learning for Denoising Generative Models Is Easier than You Think
V-GRPO makes ELBO surrogates stable and efficient for online RL alignment of denoising models, delivering SOTA text-to-image performance with 2-3x speedups over MixGRPO and DiffusionNFT.
-
Extending One-Step Image Generation from Class Labels to Text via Discriminative Text Representation
By requiring and using highly discriminative LLM text features, the work enables the first effective one-step text-conditioned image generation with MeanFlow.
-
Reward Score Matching: Unifying Reward-based Fine-tuning for Flow and Diffusion Models
Reward Score Matching unifies reward-based fine-tuning for flow and diffusion models by recasting alignment as score matching to a value-guided target.
-
VEFX-Bench: A Holistic Benchmark for Generic Video Editing and Visual Effects
VEFX-Bench releases a large human-labeled video editing dataset, a multi-dimensional reward model, and a standardized benchmark that better matches human judgments than generic evaluators.
-
Bias at the End of the Score
Reward models used as quality scorers in text-to-image generation encode demographic biases that cause reward-guided training to sexualize female subjects, reinforce stereotypes, and reduce diversity.
-
Symbiotic-MoE: Unlocking the Synergy between Generation and Understanding
Symbiotic-MoE introduces modality-aware expert disentanglement and progressive training in a multimodal MoE to achieve synergistic generation and understanding without task interference or extra parameters.
-
Generative Phomosaic with Structure-Aligned and Personalized Diffusion
The paper presents the first generative photomosaic framework that synthesizes tiles via structure-aligned diffusion models and few-shot personalization instead of color-based matching from large tile collections.
-
MAR-GRPO: Stabilized GRPO for AR-diffusion Hybrid Image Generation
MAR-GRPO stabilizes GRPO for AR-diffusion hybrids via multi-trajectory expectation and uncertainty-based token selection, yielding better visual quality, stability, and spatial understanding than baselines.
-
FP4 Explore, BF16 Train: Diffusion Reinforcement Learning via Efficient Rollout Scaling
Sol-RL decouples FP4-based candidate exploration from BF16 policy optimization in diffusion RL, delivering up to 4.64x faster convergence with maintained or superior alignment performance on models like FLUX.1 and SD3.5.
-
Improving Video Generation with Human Feedback
A human preference dataset and VideoReward model enable Flow-DPO and Flow-NRG to produce smoother, better-aligned videos from text prompts in flow-based generators.
-
Emu3: Next-Token Prediction is All You Need
Emu3 shows that next-token prediction on a unified discrete token space for text, images, and video lets a single transformer outperform task-specific models such as SDXL and LLaVA-1.6 in multimodal generation and perception.
-
Towards General Preference Alignment: Diffusion Models at Nash Equilibrium
Diff.-NPO frames diffusion alignment as a self-play game reaching Nash equilibrium and reports better text-to-image results than prior DPO-style methods.
-
A Systematic Post-Train Framework for Video Generation
A post-training pipeline for video generation models combines SFT, RLHF with novel GRPO, prompt enhancement, and inference optimization to improve visual quality, temporal coherence, and instruction following.
-
DT2IT-MRM: Debiased Preference Construction and Iterative Training for Multimodal Reward Modeling
DT2IT-MRM proposes a debiased preference construction pipeline, T2I data reformulation, and iterative training to curate multimodal preference data, achieving SOTA on VL-RewardBench, Multimodal RewardBench, and MM-RLH...
Reference graph
Works this paper leans on
-
[1]
Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollar, and C. Lawrence Zitnick. Microsoft COCO Captions: Data Collection and Evaluation Server, 2015
work page 2015
-
[2]
Cogview2: Faster and better text-to-image generation via hierarchical transformers
Ming Ding, Wendi Zheng, Wenyi Hong, and Jie Tang. Cogview2: Faster and better text-to-image generation via hierarchical transformers. NeurIPS, 35:16890–16902, 2022
work page 2022
-
[3]
Taming transformers for high-resolution image synthesis
Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming transformers for high-resolution image synthesis. In CVPR, pages 12873–12883, 2021
work page 2021
-
[4]
Vector quantized diffusion model for text-to-image synthesis
Shuyang Gu, Dong Chen, Jianmin Bao, Fang Wen, Bo Zhang, Dongdong Chen, Lu Yuan, and Baining Guo. Vector quantized diffusion model for text-to-image synthesis. In CVPR, pages 10696–10706, 2022
work page 2022
- [5]
-
[6]
Masked autoencoders are scalable vision learners
Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. Masked autoencoders are scalable vision learners. In CVPR, pages 16000–16009, 2022
work page 2022
-
[7]
GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium
Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium. In NeurIPS, 2017
work page 2017
-
[8]
Gabriel Ilharco, Mitchell Wortsman, Ross Wightman, Cade Gordon, Nicholas Carlini, Rohan Taori, Achal Dave, Vaishaal Shankar, Hongseok Namkoong, John Miller, Hannaneh Hajishirzi, Ali Farhadi, and Ludwig Schmidt. Openclip, July 2021. If you use this software, please cite it as below
work page 2021
-
[9]
Adam: A Method for Stochastic Optimization
Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[10]
Pick-a-pic: An open dataset of user preferences for text-to-image generation, 2023
Yuval Kirstain, Adam Polyak, Uriel Singer, Shahbuland Matiana, Joe Penna, and Omer Levy. Pick-a-Pic: An Open Dataset of User Preferences for Text-to-Image Generation. arXiv preprint arXiv:2305.01569, 2023
-
[11]
Aligning Text-to-Image Models using Human Feedback
Kimin Lee, Hao Liu, Moonkyung Ryu, Olivia Watkins, Yuqing Du, Craig Boutilier, Pieter Abbeel, Mohammad Ghavamzadeh, and Shixiang Shane Gu. Aligning text-to-image models using human feedback. arXiv preprint arXiv:2302.12192, 2023
work page internal anchor Pith review arXiv 2023
-
[12]
AGIQA-3K: An Open Database for AI-Generated Image Quality Assessment, 2023
Chunyi Li, Zicheng Zhang, Haoning Wu, Wei Sun, Xiongkuo Min, Xiaohong Liu, Guangtao Zhai, and Weisi Lin. AGIQA-3K: An Open Database for AI-Generated Image Quality Assessment, 2023
work page 2023
-
[13]
Fusedream: Training-free text-to-image generation with improved clip+ gan space optimization
Xingchao Liu, Chengyue Gong, Lemeng Wu, Shujian Zhang, Hao Su, and Qiang Liu. Fusedream: Training-free text-to-image generation with improved clip+ gan space optimization. arXiv preprint arXiv:2112.01573, 2021
-
[14]
Decoupled Weight Decay Regularization
Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[15]
Seeing is not always believing: Benchmarking Human and Model Perception of AI-Generated Images, 2023
Zeyu Lu, Di Huang, Lei Bai, Jingjing Qu, Chengyue Wu, Xihui Liu, and Wanli Ouyang. Seeing is not always believing: Benchmarking Human and Model Perception of AI-Generated Images, 2023
work page 2023
-
[16]
A V A: A large-scale database for aesthetic visual analysis
Naila Murray, Luca Marchesotti, and Florent Perronnin. A V A: A large-scale database for aesthetic visual analysis. CVPR, pages 2408–2415, 2012
work page 2012
-
[17]
GLIDE: Towards Photorealistic Image Generation and Editing with Text- Guided Diffusion Models
Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. GLIDE: Towards Photorealistic Image Generation and Editing with Text- Guided Diffusion Models. In ICML, 2021
work page 2021
-
[18]
Training language models to follow instructions with human feedback
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. NeurIPS, 35:27730–27744, 2022
work page 2022
-
[19]
John David Pressman, Katherine Crowson, and Simulacra Captions Contributors. Simulacra Aesthetic Captions. Technical Report Version 1.0, Stability AI, 2022
work page 2022
-
[20]
Learning Transferable Visual Models From Natural Language Supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning Transferable Visual Models From Natural Language Supervision. In ICML, 2021
work page 2021
-
[21]
Hierarchical Text-Conditional Image Generation with CLIP Latents
Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical Text-Conditional Image Generation with CLIP Latents. ArXiv, abs/2204.06125, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[22]
Zero-Shot Text-to-Image Generation
Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea V oss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-Shot Text-to-Image Generation. ArXiv, abs/2102.12092, 2021
work page internal anchor Pith review arXiv 2021
-
[23]
Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer
Robin Rombach, A. Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-Resolution Image Synthesis with Latent Diffusion Models. CVPR, pages 10674–10685, 2022
work page 2022
-
[24]
Photorealistic text-to- image diffusion models with deep language understanding
Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to- image diffusion models with deep language understanding. NeurIPS, 35:36479–36494, 2022
work page 2022
-
[25]
Improved techniques for training gans
Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans. Advances in neural information processing systems , 29, 2016
work page 2016
-
[26]
Generating images of rare concepts using pre-trained diffusion models, 2023
Dvir Samuel, Rami Ben-Ari, Simon Raviv, Nir Darshan, and Gal Chechik. It is all about where you start: Text-to-image generation with seed selection. arXiv preprint arXiv:2304.14530, 2023
- [27]
-
[28]
LAION-5B: An open large-scale dataset for training next generation image-text models
Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, et al. Laion-5b: An open large-scale dataset for training next generation image-text models. arXiv preprint arXiv:2210.08402, 2022
work page internal anchor Pith review arXiv 2022
-
[29]
LAION-5b: An open large-scale dataset for training next generation image-text models
Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade W Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, Patrick Schramowski, Srivatsa R 10 Kundurthy, Katherine Crowson, Ludwig Schmidt, Robert Kaczmarczyk, and Jenia Jitsev. LAION-5b: An open large-scale dataset for training next generation image-text ...
work page 2022
-
[30]
Proper reuse of image classification features improves object detection
Cristina Vasconcelos, Vighnesh Birodkar, and Vincent Dumoulin. Proper reuse of image classification features improves object detection. In CVPR, pages 13628–13637, 2022
work page 2022
-
[31]
Wang, Evan Montoya, David Munechika, Haoyang Yang, Benjamin Hoover, and Duen Horng Chau
Zijie J Wang, Evan Montoya, David Munechika, Haoyang Yang, Benjamin Hoover, and Duen Horng Chau. DiffusionDB: A Large-scale Prompt Gallery Dataset for Text-to-Image Generative Models. arXiv preprint arXiv:2210.14896, 2022
-
[32]
Better Aligning Text-to-Image Models with Human Preference, 2023
Xiaoshi Wu, Keqiang Sun, Feng Zhu, Rui Zhao, and Hongsheng Li. Better Aligning Text-to-Image Models with Human Preference, 2023
work page 2023
-
[33]
ImageReward: Learning and Evaluating Human Preferences for Text-to-Image Generation, 2023
Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, and Yuxiao Dong. ImageReward: Learning and Evaluating Human Preferences for Text-to-Image Generation, 2023
work page 2023
-
[34]
Xingqian Xu, Zhangyang Wang, Eric Zhang, Kai Wang, and Humphrey Shi. Versatile Diffusion: Text, Images and Variations All in One Diffusion Model. arXiv preprint arXiv:2211.08332, 2022
-
[35]
LiT: Zero-Shot Transfer With Locked-Image Text Tuning
Zhai, Xiaohua and Wang, Xiao and Mustafa, Basil and Steiner, Andreas and Keysers, Daniel and Kolesnikov, Alexander and Beyer, Lucas. LiT: Zero-Shot Transfer With Locked-Image Text Tuning. In CVPR, pages 18123–18133, June 2022
work page 2022
-
[36]
The unreasonable effectiveness of deep features as a perceptual metric
Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. In CVPR, pages 586–595, 2018
work page 2018
-
[37]
A Perceptual Quality Assessment Exploration for AIGC Images, 2023
Zicheng Zhang, Chunyi Li, Wei Sun, Xiaohong Liu, Xiongkuo Min, and Guangtao Zhai. A Perceptual Quality Assessment Exploration for AIGC Images, 2023
work page 2023
-
[38]
Hype: A benchmark for human eye perceptual evaluation of generative models
Sharon Zhou, Mitchell Gordon, Ranjay Krishna, Austin Narcomey, Li F Fei-Fei, and Michael Bernstein. Hype: A benchmark for human eye perceptual evaluation of generative models. NeurIPS, 32, 2019
work page 2019
-
[39]
Lafite: Towards language-free training for text-to-image generation
Y Zhou, R Zhang, C Chen, C Li, C Tensmeyer, T Yu, J Gu, J Xu, and T Sun. LAFITE: Towards Language-Free Training for Text-to-Image Generation. arXiv 2021. arXiv preprint arXiv:2111.13792. Checklist The checklist follows the references. Please read the checklist guidelines carefully for information on how to answer these questions. For each question, change...
-
[40]
For all authors... (a) Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope? [Yes] (b) Did you describe the limitations of your work? [Yes] (c) Did you discuss any potential negative societal impacts of your work? [N/A] (d) Have you read the ethics review guidelines and ensured that your paper con...
-
[41]
If you are including theoretical results... (a) Did you state the full set of assumptions of all theoretical results? [N/A] (b) Did you include complete proofs of all theoretical results? [N/A]
-
[42]
If you ran experiments (e.g. for benchmarks)... (a) Did you include the code, data, and instructions needed to reproduce the main experi- mental results (either in the supplemental material or as a URL)? [Yes] Please see the supplemental material. (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [Yes...
-
[43]
If you are using existing assets (e.g., code, data, models) or curating/releasing new assets... (a) If your work uses existing assets, did you cite the creators? [Yes] (b) Did you mention the license of the assets? [N/A] (c) Did you include any new assets either in the supplemental material or as a URL? [Yes] We will release our dataset and pre-train mode...
-
[44]
If you used crowdsourcing or conducted research with human subjects... (a) Did you include the full text of instructions given to participants and screenshots, if ap- plicable? [Yes] We will show our instructions given to the workers in the supplemental material. (b) Did you describe any potential participant risks, with links to Institutional Review Boar...
- [45]
-
[46]
If the picture belongs to the style of “anime and cartoon”, reply only with “anime and cartoon”
-
[47]
If the picture belongs to the style of “real photo”, reply only with “real photo”
-
[48]
If the picture belongs to the style of “concept-art”, reply only with “concept-art”
-
[49]
others”; You must reply with only on word. Even though prompts of “Photo
If the picture doesn’t belong to any styles of above, reply only with “others”; You must reply with only on word. Even though prompts of “Photo” category in HPD v2 are from COCO Captions [1], we retain “Photo” in the classification process to mitigate the potential mistakes made by ChatGPT. The category distribution of HPD v2 is illustrated in Fig. 7. Add...
-
[50]
prompt, Image (A) should take precedence over Image (B)
When Image (A) surpasses Image (B) in terms of aesthetic appeal and fidelity, or Image (B) suffers from severe distortion and blurriness, even if Image (B) aligns better with the 13 (a) (b) Figure 8: Prompt: A pair of skis standing up against a gate. prompt, Image (A) should take precedence over Image (B). For example, in Fig. 8, Fig. 8(b) lacks clear out...
-
[51]
For example, if you cannot make a choice based on personal preference, as in Fig
When facing a dilemma that images are relatively similar in terms of aesthetics and personal preference, please carefully read and consider the prompt for sorting based more on the text- image alignment. For example, if you cannot make a choice based on personal preference, as in Fig. 10, please pay attention to the description, which refers to a mouse me...
-
[52]
It is crucial to pay special attention to the capitalized names, as these names may lead to misunderstandings during the machine translation process. If there is any incorrectly translated proprietary term or content you are not familiar with, we recommend you to search for sample images and explanations online. 14 (a) (b) Figure 10: Prompt: A ginger hair...
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.