BalancedDPO: Adaptive Multi-Metric Alignment

Aditya Malusare; Amrit Singh Bedi; Biplab Banerjee; Dipesh Tamboli; Souradip Chakraborty; Vaneet Aggarwal

arxiv: 2503.12575 · v2 · submitted 2025-03-16 · 💻 cs.CV · cs.AI

BalancedDPO: Adaptive Multi-Metric Alignment

Dipesh Tamboli , Souradip Chakraborty , Aditya Malusare , Biplab Banerjee , Amrit Singh Bedi , Vaneet Aggarwal This is my paper

Pith reviewed 2026-05-22 23:35 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords diffusion modelspreference alignmentDirect Preference Optimizationmulti-metric alignmenttext-to-image generationmajority voteStable Diffusionhuman preference

0 comments

The pith

BalancedDPO aligns text-to-image diffusion models to multiple conflicting metrics by integrating majority-vote consensus directly into the DPO training loop with dynamic reference updates.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper targets the problem of aligning diffusion models when human preferences involve several sometimes-conflicting metrics such as semantic consistency, aesthetics, and preference scores. It replaces single-metric optimization or simple reward averaging with a majority-vote taken across several independent scorers and folds that vote into the standard DPO loss. Dynamic updates to the reference model are added inside the same loop to keep gradient directions stable across heterogeneous metrics. Experiments report higher win rates than prior DPO variants on three evaluation datasets and three different Stable Diffusion backbones. Ablations isolate the contribution of the voting step and the reference updates.

Core claim

BalancedDPO achieves multi-metric preference alignment within the Direct Preference Optimization paradigm by introducing a majority-vote consensus over multiple preference scorers and integrating it directly into the DPO training loop with dynamic reference model updates; this consensus-based formulation avoids reward-scale conflicts and produces more stable gradient directions across heterogeneous metrics.

What carries the argument

Majority-vote consensus over multiple heterogeneous preference scorers, inserted into the DPO loss together with dynamic reference-model updates.

If this is right

Preference win rates rise over single-metric DPO baselines on Pick-a-Pic, PartiPrompt, and HPD.
The gains hold across Stable Diffusion 1.5, 2.1, and SDXL.
Ablations show that both the majority-vote aggregation and the dynamic reference updates contribute measurably to the reported stability.
The method generalizes across the three tested alignment datasets without post-hoc adjustments.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same voting mechanism could be tested on video or 3D generation where metric conflicts are also common.
If the vote threshold is varied, the method might reveal a sweet spot between consensus strength and exploration of diverse outputs.
Replacing one of the scorers with a learned model could reduce reliance on fixed external evaluators.
The approach might lower the engineering cost of combining new metrics because no manual re-weighting is required.

Load-bearing premise

A majority-vote taken across several different preference scorers can be plugged into the DPO loss so that scale conflicts disappear and gradients remain stable without creating fresh selection biases.

What would settle it

Training the same models on the same datasets and backbones with the majority-vote and dynamic-reference components removed or replaced by scalar averaging, then measuring no gain in win rate or the appearance of new systematic biases in the outputs, would falsify the claim.

Figures

Figures reproduced from arXiv: 2503.12575 by Aditya Malusare, Amrit Singh Bedi, Biplab Banerjee, Dipesh Tamboli, Souradip Chakraborty, Vaneet Aggarwal.

**Figure 2.** Figure 2: Comparison of images generated by models trained on image-text pairs from the Pick-a-Pic dataset and preference labels based on different score [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: Pick-a-Pic [9] comparison for SDXL. Comparison of images generated by various SDXL fine-tunes (specify versions if applicable) and BALANCEDDPO (Ours) on the Pick-a-Pic dataset across diverse prompts. BALANCEDDPO generally creates images that are more realistic and possess finer details. They are also superior in terms of prompt alignment and visual attractiveness. The columns emphasize BALANCEDDPO’s streng… view at source ↗

**Figure 4.** Figure 4: Pick-a-Pic [9] comparison. Comparison of images generated by SD1.5, DiffusionDPO, and BALANCEDDPO (Ours) across various prompts. BALANCEDDPO consistently produces more realistic and detailed outputs, outperforming the other models in aligning with prompts and visual appeal. Each column highlights BALANCEDDPO’s superior performance in aspects like facial detail, dynamic motion, adherence to prompt details, … view at source ↗

**Figure 5.** Figure 5: PartiPrompt [12] comparison. Comparison of images generated by SD1.5, DiffusionDPO, and BALANCEDDPO (Ours) on out-of-distribution prompts from the PartiPrompt dataset. BALANCEDDPO consistently generates more accurate and realistic outputs, including specific elements like a helicopter, microphone, and lifelike dog, while the other models produce incomplete or irrelevant results. of the diffusion path leadi… view at source ↗

**Figure 6.** Figure 6: Comparison of images generated by SD v1.5, DiffusionDPO, and [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗

**Figure 7.** Figure 7: Comparison of image generations by SD1.5, DiffusionDPO, and [PITH_FULL_IMAGE:figures/full_fig_p011_7.png] view at source ↗

**Figure 8.** Figure 8: Pick-a-pic [9] dataset. For validation, we employ three distinct datasets to provide a comprehensive evaluation: a) Pick-a-Pic: [9]: The Pick-a-Pic dataset is an extensive, publicly accessible collection of over 500,000 examples, encompassing more than 35,000 unique prompts, each paired with two AI-generated images and corresponding user preference labels. This dataset was compiled through the Pick-a-Pic w… view at source ↗

**Figure 9.** Figure 9: PartiPrompt dataset [12] c) Human Preference Dataset v2 (HPD v2): [8]: This dataset serves as the foundation for creating the Human Preference Score v2 (HPS v2), and contains 798,090 human preference choices on 433,760 pairs of images. It is specifically curated to minimize potential biases present in previous datasets and covers a wide range of image sources and styles. It has 400 unique prompts. We show … view at source ↗

**Figure 10.** Figure 10: HPD Dataset [8] IX. EXTENDED RESULTS This section provides additional experiments to support our claims in the main paper. First, we conduct an ablation study to demonstrate the limitations of the Vanilla Aggregation approach, highlighting why our proposed BALANCEDDPO method is essential for effectively optimizing multiple metrics simultaneously. Second, we present qualitative and quantitative (see Table … view at source ↗

**Figure 11.** Figure 11: Comparison of images generated using BALANCEDDPO across five different seeds for each prompt. Rows correspond to prompts from the PartiPrompts, Pick-a-Pic, and HPD datasets, while columns represent outputs for different seeds. The figure highlights BALANCEDDPO’s ability to consistently generate visually appealing and semantically aligned images across diverse prompts and seeds [PITH_FULL_IMAGE:figures/fu… view at source ↗

read the original abstract

Diffusion models have achieved remarkable progress in text-to-image generation, yet aligning them with human preference remains challenging due to the presence of multiple, sometimes conflicting, evaluation metrics (e.g., semantic consistency, aesthetics, and human preference scores). Existing alignment methods typically optimize for a single metric or rely on scalarized reward aggregation, which can bias the model toward specific evaluation criteria. To address this challenge, we propose BalancedDPO, a framework that achieves multi-metric preference alignment within the Direct Preference Optimization (DPO) paradigm. Unlike prior DPO variants that rely on a single metric, BalancedDPO introduces a majority-vote consensus over multiple preference scorers and integrates it directly into the DPO training loop with dynamic reference model updates. This consensus-based formulation avoids reward-scale conflicts and ensures more stable gradient directions across heterogeneous metrics. Experiments on Pick-a-Pic, PartiPrompt, and HPD datasets demonstrate that BalancedDPO consistently improves preference win rates over the baselines across Stable Diffusion 1.5, Stable Diffusion 2.1 and SDXL backbones. Comprehensive ablations further validate the benefits of majority-vote aggregation and dynamic reference updating, highlighting the method's robustness and generalizability across diverse alignment settings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

BalancedDPO applies majority voting over multiple scorers plus dynamic references inside DPO and reports win-rate gains on three SD backbones, but the method stays heuristic and the abstract-level evidence leaves the size and robustness of the gains unclear.

read the letter

BalancedDPO integrates a majority-vote consensus across heterogeneous preference scorers directly into the DPO loss for text-to-image diffusion models, paired with dynamic reference model updates. The goal is to handle conflicting metrics such as aesthetics and semantic consistency without the scale mismatches that come from simple scalar aggregation. Experiments claim consistent preference win-rate improvements over baselines on Pick-a-Pic, PartiPrompt, and HPD using Stable Diffusion 1.5, 2.1, and SDXL, with ablations that isolate the vote and the reference update. This is a reasonable practical step because single-metric DPO is known to be brittle when multiple objectives matter. The dynamic reference piece may also reduce some of the usual reference-model drift problems in DPO. The soft spot is that the description gives no equations for how the vote is folded into the gradient or how the reference is refreshed, so it is difficult to judge whether the approach truly produces more stable directions or simply adds another tunable heuristic. No error bars, statistical tests, or head-to-head numbers against other multi-objective schemes appear in the provided summary, which makes it hard to tell if the reported gains are large enough to matter or sensitive to the particular scorer set. The work is empirical and internally consistent on its own terms. It is aimed at researchers who tune alignment pipelines for generative image models and who already work with DPO variants. A reader looking for concrete tricks to try on their own training runs will find something usable here. The paper deserves a serious referee because the underlying problem is real and the proposed changes are testable, even if the final gains prove modest once the implementation details are examined.

Referee Report

2 major / 1 minor

Summary. The paper claims to introduce BalancedDPO, a framework for multi-metric preference alignment in diffusion models within the DPO paradigm. It uses majority-vote consensus over multiple preference scorers integrated into the DPO training loop with dynamic reference model updates to avoid reward-scale conflicts and ensure stable gradients. Experiments on Pick-a-Pic, PartiPrompt, and HPD datasets show consistent improvements in preference win rates over baselines for SD 1.5, 2.1, and SDXL, with ablations validating the approach.

Significance. If the results are substantiated, this method could significantly advance preference alignment for generative models by handling multiple conflicting metrics more effectively than single-metric or scalarized approaches. The majority-vote and dynamic update mechanisms represent a practical innovation that may improve robustness in alignment tasks.

major comments (2)

[Experiments] Experiments section: The reported improvements in preference win rates are presented without statistical significance tests, standard errors, or multi-run variance, which is load-bearing for the central claim of 'consistent improvements' across backbones.
[Method] Method section: The majority-vote consensus is integrated into the DPO loss, but the manuscript provides no explicit modified loss equation or derivation showing how the vote avoids scale conflicts, undermining verification of the 'stable gradient directions' claim.

minor comments (1)

[Abstract] The abstract states that ablations 'validate the benefits' but the paper should include a dedicated table or figure summarizing all ablation variants with quantitative deltas for clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We address the major comments point-by-point below. We agree that both statistical rigor and explicit loss formulation are important for substantiating the claims and will revise the manuscript to incorporate these elements.

read point-by-point responses

Referee: [Experiments] Experiments section: The reported improvements in preference win rates are presented without statistical significance tests, standard errors, or multi-run variance, which is load-bearing for the central claim of 'consistent improvements' across backbones.

Authors: We agree that reporting variance and significance tests strengthens the central claim. In the revised manuscript we will rerun the key experiments with at least three random seeds, report standard errors, and include paired t-tests or Wilcoxon tests against baselines to quantify statistical significance of the win-rate improvements. revision: yes
Referee: [Method] Method section: The majority-vote consensus is integrated into the DPO loss, but the manuscript provides no explicit modified loss equation or derivation showing how the vote avoids scale conflicts, undermining verification of the 'stable gradient directions' claim.

Authors: The referee is correct that an explicit equation was omitted. The majority vote produces a binary preference label y_vote that replaces the single-metric label inside the standard DPO loss; because the label is discrete rather than a scalar reward, the gradient direction is determined solely by the sign of the vote and is therefore invariant to the numerical scales of the individual scorers. We will add the modified loss equation L_BalancedDPO = -log σ(β log(π_θ(y_w|x)/π_ref(y_w|x)) - β log(π_θ(y_l|x)/π_ref(y_l|x))) where (y_w, y_l) are chosen by majority vote, together with a short derivation showing scale invariance, in the revised Method section. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical extension of DPO

full rationale

The paper introduces BalancedDPO as an empirical framework that integrates majority-vote consensus over multiple scorers and dynamic reference updates directly into the DPO training loop. No derivation chain, first-principles predictions, fitted parameters renamed as outputs, or load-bearing self-citations are described that would reduce any claimed result to its own inputs by construction. All reported gains are presented as experimental outcomes on Pick-a-Pic, PartiPrompt, and HPD datasets across SD backbones, with ablations validating the components independently. The method remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on the domain assumption that majority voting across scorers yields stable signals independent of individual metric scales; no free parameters or invented entities are mentioned in the abstract.

axioms (1)

domain assumption Majority-vote consensus over multiple preference scorers provides stable gradient directions that avoid reward-scale conflicts in DPO training.
Invoked as the core mechanism that enables multi-metric alignment without scalarization biases.

pith-pipeline@v0.9.0 · 5763 in / 1377 out tokens · 63615 ms · 2026-05-22T23:35:23.108602+00:00 · methodology

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Pareto-Guided Optimal Transport for Multi-Reward Alignment
cs.CV 2026-05 unverdicted novelty 7.0

PG-OT builds prompt-specific Pareto frontiers and applies distribution-aware optimal transport to improve multi-reward alignment while introducing JDR and JCR metrics to measure synergy and hacking.
HP-Edit: A Human-Preference Post-Training Framework for Image Editing
cs.CV 2026-04 unverdicted novelty 7.0

HP-Edit introduces a post-training framework and RealPref-50K dataset that uses a VLM-based HP-Scorer to align diffusion image editing models with human preferences, improving outputs on Qwen-Image-Edit-2509.
LeapAlign: Post-Training Flow Matching Models at Any Generation Step by Building Two-Step Trajectories
cs.CV 2026-04 unverdicted novelty 7.0

LeapAlign fine-tunes flow matching models by constructing two consecutive leaps that skip multiple ODE steps with randomized timesteps and consistency weighting, enabling stable updates at any generation step.

Reference graph

Works this paper leans on

35 extracted references · 35 canonical work pages · cited by 3 Pith papers · 11 internal anchors

[1]

Hierarchical Text-Conditional Image Generation with CLIP Latents

A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, and M. Chen, “Hierarchical text-conditional image generation with clip latents,” arXiv preprint arXiv:2204.06125, vol. 1, no. 2, p. 3, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[2]

Photorealistic text-to-image diffusion models with deep language understanding,

C. Saharia, W. Chan, S. Saxena, L. Li, J. Whang, E. L. Denton, K. Ghasemipour, R. Gontijo Lopes, B. Karagol Ayan, T. Salimans, et al., “Photorealistic text-to-image diffusion models with deep language understanding,” Advances in neural information processing systems , vol. 35, pp. 36479–36494, 2022

work page 2022
[3]

High- resolution image synthesis with latent diffusion models,

R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High- resolution image synthesis with latent diffusion models,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pp. 10684–10695, 2022

work page 2022
[4]

GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models

A. Nichol, P. Dhariwal, A. Ramesh, P. Shyam, P. Mishkin, B. McGrew, I. Sutskever, and M. Chen, “Glide: Towards photorealistic image generation and editing with text-guided diffusion models,” arXiv preprint arXiv:2112.10741, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[5]

Diffusion model alignment using direct preference optimization,

B. Wallace, M. Dang, R. Rafailov, L. Zhou, A. Lou, S. Purushwalkam, S. Ermon, C. Xiong, S. Joty, and N. Naik, “Diffusion model alignment using direct preference optimization,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8228–8238, 2024

work page 2024
[6]

Direct preference optimization: Your language model is secretly a reward model,

R. Rafailov, A. Sharma, E. Mitchell, C. D. Manning, S. Ermon, and C. Finn, “Direct preference optimization: Your language model is secretly a reward model,” Advances in Neural Information Processing Systems , vol. 36, 2024

work page 2024
[7]

Laion-aesthetics,

C. Schuhmann, “Laion-aesthetics,” 2022. Accessed: 2023-11-10

work page 2022
[8]

Human Preference Score v2: A Solid Benchmark for Evaluating Human Preferences of Text-to-Image Synthesis

X. Wu, Y . Hao, K. Sun, Y . Chen, F. Zhu, R. Zhao, and H. Li, “Human preference score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis,” arXiv preprint arXiv:2306.09341 , 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[9]

Pick-a-pic: An open dataset of user preferences for text-to-image generation,

Y . Kirstain, A. Polyak, U. Singer, S. Matiana, J. Penna, and O. Levy, “Pick-a-pic: An open dataset of user preferences for text-to-image generation,” Advances in Neural Information Processing Systems , vol. 36, pp. 36652–36663, 2023

work page 2023
[10]

Beyond one-preference-fits-all alignment: Multi-objective direct prefer- ence optimization,

Z. Zhou, J. Liu, J. Shao, X. Yue, C. Yang, W. Ouyang, and Y . Qiao, “Beyond one-preference-fits-all alignment: Multi-objective direct prefer- ence optimization,” in Findings of the Association for Computational Linguistics ACL 2024 , pp. 10586–10613, 2024

work page 2024
[11]

Social choice and the value alignment problem,

M. Prasad, “Social choice and the value alignment problem,” Artificial intelligence safety and security , pp. 291–314, 2018

work page 2018
[12]

Scaling Autoregressive Models for Content-Rich Text-to-Image Generation

J. Yu, Y . Xu, J. Y . Koh, T. Luong, G. Baid, Z. Wang, V . Vasudevan, A. Ku, Y . Yang, B. K. Ayan, et al., “Scaling autoregressive models for content-rich text-to-image generation,” arXiv preprint arXiv:2206.10789 , vol. 2, no. 3, p. 5, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[13]

Reinforcement learning for fine- tuning text-to-image diffusion models,

Y . Fan, O. Watkins, Y . Du, H. Liu, M. Ryu, C. Boutilier, P. Abbeel, M. Ghavamzadeh, K. Lee, and K. Lee, “Reinforcement learning for fine- tuning text-to-image diffusion models,” Advances in Neural Information Processing Systems, vol. 36, 2024

work page 2024
[14]

Diffusion Policies as an Expressive Policy Class for Offline Reinforcement Learning

Z. Wang, J. J. Hunt, and M. Zhou, “Diffusion policies as an expres- sive policy class for offline reinforcement learning,” arXiv preprint arXiv:2208.06193, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[15]

Aligning Text-to-Image Models using Human Feedback

K. Lee, H. Liu, M. Ryu, O. Watkins, Y . Du, C. Boutilier, P. Abbeel, M. Ghavamzadeh, and S. S. Gu, “Aligning text-to-image models using human feedback,” arXiv preprint arXiv:2302.12192 , 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[16]

Proximal Policy Optimization Algorithms

J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Prox- imal policy optimization algorithms,” arXiv preprint arXiv:1707.06347 , 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[17]

Secrets of RLHF in Large Language Models Part I: PPO

R. Zheng, S. Dou, S. Gao, Y . Hua, W. Shen, B. Wang, Y . Liu, S. Jin, Q. Liu, Y . Zhou, et al., “Secrets of rlhf in large language models part i: Ppo,” arXiv preprint arXiv:2307.04964 , 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[18]

Training Diffusion Models with Reinforcement Learning

K. Black, M. Janner, Y . Du, I. Kostrikov, and S. Levine, “Train- ing diffusion models with reinforcement learning,” arXiv preprint arXiv:2305.13301, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[19]

Reinforcement learning for fine-tuning text-to-speech diffusion models,

J. Chen, J.-S. Byun, M. Elsner, and A. Perrault, “Reinforcement learning for fine-tuning text-to-speech diffusion models,” arXiv preprint arXiv:2405.14632, 2024

work page arXiv 2024
[20]

Aligndiff: Aligning diverse human preferences via behavior- customisable diffusion model,

Z. Dong, Y . Yuan, J. Hao, F. Ni, Y . Mu, Y . Zheng, Y . Hu, T. Lv, C. Fan, and Z. Hu, “Aligndiff: Aligning diverse human preferences via behavior- customisable diffusion model,” arXiv preprint arXiv:2310.02054 , 2023

work page arXiv 2023
[21]

Delve into ppo: Implementation matters for stable rlhf,

R. Zheng, S. Dou, S. Gao, Y . Hua, W. Shen, B. Wang, Y . Liu, S. Jin, Y . Zhou, L. Xiong, et al. , “Delve into ppo: Implementation matters for stable rlhf,” in NeurIPS 2023 Workshop on Instruction Tuning and Instruction Following, 2023

work page 2023
[22]

Direct preference optimization: Your language model is secretly a reward model,

R. Rafailov, A. Sharma, E. Mitchell, C. D. Manning, S. Ermon, and C. Finn, “Direct preference optimization: Your language model is secretly a reward model,” in Thirty-seventh Conference on Neural Information Processing Systems, 2023

work page 2023
[23]

SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

D. Podell, Z. English, K. Lacey, A. Blattmann, T. Dockhorn, J. M ¨uller, J. Penna, and R. Rombach, “Sdxl: Improving latent diffusion models for high-resolution image synthesis,” arXiv preprint arXiv:2307.01952 , 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[24]

Using human feedback to fine-tune diffusion models without any reward model,

K. Yang, J. Tao, J. Lyu, C. Ge, J. Chen, Q. Li, W. Shen, X. Zhu, and X. Li, “Using human feedback to fine-tune diffusion models without any reward model,” arXiv preprint arXiv:2311.13231 , 2023

work page arXiv 2023
[25]

Diffusion-rpo: Aligning diffusion models through relative preference optimization,

Y . Gu, Z. Wang, Y . Yin, Y . Xie, and M. Zhou, “Diffusion-rpo: Aligning diffusion models through relative preference optimization,” arXiv preprint arXiv:2406.06382, 2024

work page arXiv 2024
[26]

Step-aware preference optimization: Aligning preference with denoising performance at each step,

Z. Liang, Y . Yuan, S. Gu, B. Chen, T. Hang, J. Li, and L. Zheng, “Step-aware preference optimization: Aligning preference with denoising performance at each step,” arXiv preprint arXiv:2406.04314 , 2024

work page arXiv 2024
[27]

Decomposed direct preference optimization for structure-based drug design,

X. Cheng, X. Zhou, Y . Yang, Y . Bao, and Q. Gu, “Decomposed direct preference optimization for structure-based drug design,” CoRR, vol. abs/2407.13981, 2024

work page arXiv 2024
[28]

Rank analysis of incomplete block designs: I. the method of paired comparisons,

R. A. Bradley and M. E. Terry, “Rank analysis of incomplete block designs: I. the method of paired comparisons,” Biometrika, vol. 39, no. 3/4, pp. 324–345, 1952

work page 1952
[29]

Inappropriate text classifier,

M. J. Li, “Inappropriate text classifier,” 2023

work page 2023
[30]

Huggingface stable diffusion v1.5 model

R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “Huggingface stable diffusion v1.5 model.” https://huggingface.co/stable- diffusion-v1-5/stable-diffusion-v1-5

work page
[31]

Decoupled Weight Decay Regularization

I. Loshchilov, “Decoupled weight decay regularization,” arXiv preprint arXiv:1711.05101, 2017. 12 VI. A PPENDIX This appendix provides extended insights and results to support the claims made in the main paper. • In Section VII, we describe the datasets used in our experiments, including PartiPrompts, Pick-a-Pic, and HPD, and showcase images generated fro...

work page internal anchor Pith review Pith/arXiv arXiv 2017
[32]

Image Generation: For each caption from our validation datasets, we generate five images using different random seeds for all models (B ALANCED DPO, SD1.5, and DiffusionDPO). 16 Fig. 11. Comparison of images generated using BALANCED DPO across five different seeds for each prompt. Rows correspond to prompts from the PartiPrompts, Pick-a-Pic, and HPD datas...

work page
[33]

Metrics Used: We evaluate generated images using four key metrics: Human Preference Score (HPS), CLIP score, PickScore, and Aesthetics score following DiffusionDPO

work page
[34]

Best Score Selection: For each metric and prompt, we select the highest score from the five generated images for all the models. This method mitigates random variations in image generation, ensures each model demonstrates its best performance, and maintains consistency across all models by automating the selection process

work page
[35]

SD1.5, BALANCED DPO vs

Win Rate Calculation: We computed the win rate for each model pair comparison ( BALANCED DPO vs. SD1.5, BALANCED DPO vs. DiffusionDPO, and DiffusionDPO vs. SD1.5). The win rate represents the proportion of times a model’s best score is preferred ( i.e., higher) compared to the other model’s best score for the same prompt and metric. ETHICS While our model...

work page

[1] [1]

Hierarchical Text-Conditional Image Generation with CLIP Latents

A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, and M. Chen, “Hierarchical text-conditional image generation with clip latents,” arXiv preprint arXiv:2204.06125, vol. 1, no. 2, p. 3, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[2] [2]

Photorealistic text-to-image diffusion models with deep language understanding,

C. Saharia, W. Chan, S. Saxena, L. Li, J. Whang, E. L. Denton, K. Ghasemipour, R. Gontijo Lopes, B. Karagol Ayan, T. Salimans, et al., “Photorealistic text-to-image diffusion models with deep language understanding,” Advances in neural information processing systems , vol. 35, pp. 36479–36494, 2022

work page 2022

[3] [3]

High- resolution image synthesis with latent diffusion models,

R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High- resolution image synthesis with latent diffusion models,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pp. 10684–10695, 2022

work page 2022

[4] [4]

GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models

A. Nichol, P. Dhariwal, A. Ramesh, P. Shyam, P. Mishkin, B. McGrew, I. Sutskever, and M. Chen, “Glide: Towards photorealistic image generation and editing with text-guided diffusion models,” arXiv preprint arXiv:2112.10741, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[5] [5]

Diffusion model alignment using direct preference optimization,

B. Wallace, M. Dang, R. Rafailov, L. Zhou, A. Lou, S. Purushwalkam, S. Ermon, C. Xiong, S. Joty, and N. Naik, “Diffusion model alignment using direct preference optimization,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8228–8238, 2024

work page 2024

[6] [6]

Direct preference optimization: Your language model is secretly a reward model,

R. Rafailov, A. Sharma, E. Mitchell, C. D. Manning, S. Ermon, and C. Finn, “Direct preference optimization: Your language model is secretly a reward model,” Advances in Neural Information Processing Systems , vol. 36, 2024

work page 2024

[7] [7]

Laion-aesthetics,

C. Schuhmann, “Laion-aesthetics,” 2022. Accessed: 2023-11-10

work page 2022

[8] [8]

Human Preference Score v2: A Solid Benchmark for Evaluating Human Preferences of Text-to-Image Synthesis

X. Wu, Y . Hao, K. Sun, Y . Chen, F. Zhu, R. Zhao, and H. Li, “Human preference score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis,” arXiv preprint arXiv:2306.09341 , 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[9] [9]

Pick-a-pic: An open dataset of user preferences for text-to-image generation,

Y . Kirstain, A. Polyak, U. Singer, S. Matiana, J. Penna, and O. Levy, “Pick-a-pic: An open dataset of user preferences for text-to-image generation,” Advances in Neural Information Processing Systems , vol. 36, pp. 36652–36663, 2023

work page 2023

[10] [10]

Beyond one-preference-fits-all alignment: Multi-objective direct prefer- ence optimization,

Z. Zhou, J. Liu, J. Shao, X. Yue, C. Yang, W. Ouyang, and Y . Qiao, “Beyond one-preference-fits-all alignment: Multi-objective direct prefer- ence optimization,” in Findings of the Association for Computational Linguistics ACL 2024 , pp. 10586–10613, 2024

work page 2024

[11] [11]

Social choice and the value alignment problem,

M. Prasad, “Social choice and the value alignment problem,” Artificial intelligence safety and security , pp. 291–314, 2018

work page 2018

[12] [12]

Scaling Autoregressive Models for Content-Rich Text-to-Image Generation

J. Yu, Y . Xu, J. Y . Koh, T. Luong, G. Baid, Z. Wang, V . Vasudevan, A. Ku, Y . Yang, B. K. Ayan, et al., “Scaling autoregressive models for content-rich text-to-image generation,” arXiv preprint arXiv:2206.10789 , vol. 2, no. 3, p. 5, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[13] [13]

Reinforcement learning for fine- tuning text-to-image diffusion models,

Y . Fan, O. Watkins, Y . Du, H. Liu, M. Ryu, C. Boutilier, P. Abbeel, M. Ghavamzadeh, K. Lee, and K. Lee, “Reinforcement learning for fine- tuning text-to-image diffusion models,” Advances in Neural Information Processing Systems, vol. 36, 2024

work page 2024

[14] [14]

Diffusion Policies as an Expressive Policy Class for Offline Reinforcement Learning

Z. Wang, J. J. Hunt, and M. Zhou, “Diffusion policies as an expres- sive policy class for offline reinforcement learning,” arXiv preprint arXiv:2208.06193, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[15] [15]

Aligning Text-to-Image Models using Human Feedback

K. Lee, H. Liu, M. Ryu, O. Watkins, Y . Du, C. Boutilier, P. Abbeel, M. Ghavamzadeh, and S. S. Gu, “Aligning text-to-image models using human feedback,” arXiv preprint arXiv:2302.12192 , 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[16] [16]

Proximal Policy Optimization Algorithms

J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Prox- imal policy optimization algorithms,” arXiv preprint arXiv:1707.06347 , 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[17] [17]

Secrets of RLHF in Large Language Models Part I: PPO

R. Zheng, S. Dou, S. Gao, Y . Hua, W. Shen, B. Wang, Y . Liu, S. Jin, Q. Liu, Y . Zhou, et al., “Secrets of rlhf in large language models part i: Ppo,” arXiv preprint arXiv:2307.04964 , 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[18] [18]

Training Diffusion Models with Reinforcement Learning

K. Black, M. Janner, Y . Du, I. Kostrikov, and S. Levine, “Train- ing diffusion models with reinforcement learning,” arXiv preprint arXiv:2305.13301, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[19] [19]

Reinforcement learning for fine-tuning text-to-speech diffusion models,

J. Chen, J.-S. Byun, M. Elsner, and A. Perrault, “Reinforcement learning for fine-tuning text-to-speech diffusion models,” arXiv preprint arXiv:2405.14632, 2024

work page arXiv 2024

[20] [20]

Aligndiff: Aligning diverse human preferences via behavior- customisable diffusion model,

Z. Dong, Y . Yuan, J. Hao, F. Ni, Y . Mu, Y . Zheng, Y . Hu, T. Lv, C. Fan, and Z. Hu, “Aligndiff: Aligning diverse human preferences via behavior- customisable diffusion model,” arXiv preprint arXiv:2310.02054 , 2023

work page arXiv 2023

[21] [21]

Delve into ppo: Implementation matters for stable rlhf,

R. Zheng, S. Dou, S. Gao, Y . Hua, W. Shen, B. Wang, Y . Liu, S. Jin, Y . Zhou, L. Xiong, et al. , “Delve into ppo: Implementation matters for stable rlhf,” in NeurIPS 2023 Workshop on Instruction Tuning and Instruction Following, 2023

work page 2023

[22] [22]

Direct preference optimization: Your language model is secretly a reward model,

R. Rafailov, A. Sharma, E. Mitchell, C. D. Manning, S. Ermon, and C. Finn, “Direct preference optimization: Your language model is secretly a reward model,” in Thirty-seventh Conference on Neural Information Processing Systems, 2023

work page 2023

[23] [23]

SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

D. Podell, Z. English, K. Lacey, A. Blattmann, T. Dockhorn, J. M ¨uller, J. Penna, and R. Rombach, “Sdxl: Improving latent diffusion models for high-resolution image synthesis,” arXiv preprint arXiv:2307.01952 , 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[24] [24]

Using human feedback to fine-tune diffusion models without any reward model,

K. Yang, J. Tao, J. Lyu, C. Ge, J. Chen, Q. Li, W. Shen, X. Zhu, and X. Li, “Using human feedback to fine-tune diffusion models without any reward model,” arXiv preprint arXiv:2311.13231 , 2023

work page arXiv 2023

[25] [25]

Diffusion-rpo: Aligning diffusion models through relative preference optimization,

Y . Gu, Z. Wang, Y . Yin, Y . Xie, and M. Zhou, “Diffusion-rpo: Aligning diffusion models through relative preference optimization,” arXiv preprint arXiv:2406.06382, 2024

work page arXiv 2024

[26] [26]

Step-aware preference optimization: Aligning preference with denoising performance at each step,

Z. Liang, Y . Yuan, S. Gu, B. Chen, T. Hang, J. Li, and L. Zheng, “Step-aware preference optimization: Aligning preference with denoising performance at each step,” arXiv preprint arXiv:2406.04314 , 2024

work page arXiv 2024

[27] [27]

Decomposed direct preference optimization for structure-based drug design,

X. Cheng, X. Zhou, Y . Yang, Y . Bao, and Q. Gu, “Decomposed direct preference optimization for structure-based drug design,” CoRR, vol. abs/2407.13981, 2024

work page arXiv 2024

[28] [28]

Rank analysis of incomplete block designs: I. the method of paired comparisons,

R. A. Bradley and M. E. Terry, “Rank analysis of incomplete block designs: I. the method of paired comparisons,” Biometrika, vol. 39, no. 3/4, pp. 324–345, 1952

work page 1952

[29] [29]

Inappropriate text classifier,

M. J. Li, “Inappropriate text classifier,” 2023

work page 2023

[30] [30]

Huggingface stable diffusion v1.5 model

R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “Huggingface stable diffusion v1.5 model.” https://huggingface.co/stable- diffusion-v1-5/stable-diffusion-v1-5

work page

[31] [31]

Decoupled Weight Decay Regularization

I. Loshchilov, “Decoupled weight decay regularization,” arXiv preprint arXiv:1711.05101, 2017. 12 VI. A PPENDIX This appendix provides extended insights and results to support the claims made in the main paper. • In Section VII, we describe the datasets used in our experiments, including PartiPrompts, Pick-a-Pic, and HPD, and showcase images generated fro...

work page internal anchor Pith review Pith/arXiv arXiv 2017

[32] [32]

Image Generation: For each caption from our validation datasets, we generate five images using different random seeds for all models (B ALANCED DPO, SD1.5, and DiffusionDPO). 16 Fig. 11. Comparison of images generated using BALANCED DPO across five different seeds for each prompt. Rows correspond to prompts from the PartiPrompts, Pick-a-Pic, and HPD datas...

work page

[33] [33]

Metrics Used: We evaluate generated images using four key metrics: Human Preference Score (HPS), CLIP score, PickScore, and Aesthetics score following DiffusionDPO

work page

[34] [34]

Best Score Selection: For each metric and prompt, we select the highest score from the five generated images for all the models. This method mitigates random variations in image generation, ensures each model demonstrates its best performance, and maintains consistency across all models by automating the selection process

work page

[35] [35]

SD1.5, BALANCED DPO vs

Win Rate Calculation: We computed the win rate for each model pair comparison ( BALANCED DPO vs. SD1.5, BALANCED DPO vs. DiffusionDPO, and DiffusionDPO vs. SD1.5). The win rate represents the proportion of times a model’s best score is preferred ( i.e., higher) compared to the other model’s best score for the same prompt and metric. ETHICS While our model...

work page