BalancedDPO: Adaptive Multi-Metric Alignment
Pith reviewed 2026-05-22 23:35 UTC · model grok-4.3
The pith
BalancedDPO aligns text-to-image diffusion models to multiple conflicting metrics by integrating majority-vote consensus directly into the DPO training loop with dynamic reference updates.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
BalancedDPO achieves multi-metric preference alignment within the Direct Preference Optimization paradigm by introducing a majority-vote consensus over multiple preference scorers and integrating it directly into the DPO training loop with dynamic reference model updates; this consensus-based formulation avoids reward-scale conflicts and produces more stable gradient directions across heterogeneous metrics.
What carries the argument
Majority-vote consensus over multiple heterogeneous preference scorers, inserted into the DPO loss together with dynamic reference-model updates.
If this is right
- Preference win rates rise over single-metric DPO baselines on Pick-a-Pic, PartiPrompt, and HPD.
- The gains hold across Stable Diffusion 1.5, 2.1, and SDXL.
- Ablations show that both the majority-vote aggregation and the dynamic reference updates contribute measurably to the reported stability.
- The method generalizes across the three tested alignment datasets without post-hoc adjustments.
Where Pith is reading between the lines
- The same voting mechanism could be tested on video or 3D generation where metric conflicts are also common.
- If the vote threshold is varied, the method might reveal a sweet spot between consensus strength and exploration of diverse outputs.
- Replacing one of the scorers with a learned model could reduce reliance on fixed external evaluators.
- The approach might lower the engineering cost of combining new metrics because no manual re-weighting is required.
Load-bearing premise
A majority-vote taken across several different preference scorers can be plugged into the DPO loss so that scale conflicts disappear and gradients remain stable without creating fresh selection biases.
What would settle it
Training the same models on the same datasets and backbones with the majority-vote and dynamic-reference components removed or replaced by scalar averaging, then measuring no gain in win rate or the appearance of new systematic biases in the outputs, would falsify the claim.
Figures
read the original abstract
Diffusion models have achieved remarkable progress in text-to-image generation, yet aligning them with human preference remains challenging due to the presence of multiple, sometimes conflicting, evaluation metrics (e.g., semantic consistency, aesthetics, and human preference scores). Existing alignment methods typically optimize for a single metric or rely on scalarized reward aggregation, which can bias the model toward specific evaluation criteria. To address this challenge, we propose BalancedDPO, a framework that achieves multi-metric preference alignment within the Direct Preference Optimization (DPO) paradigm. Unlike prior DPO variants that rely on a single metric, BalancedDPO introduces a majority-vote consensus over multiple preference scorers and integrates it directly into the DPO training loop with dynamic reference model updates. This consensus-based formulation avoids reward-scale conflicts and ensures more stable gradient directions across heterogeneous metrics. Experiments on Pick-a-Pic, PartiPrompt, and HPD datasets demonstrate that BalancedDPO consistently improves preference win rates over the baselines across Stable Diffusion 1.5, Stable Diffusion 2.1 and SDXL backbones. Comprehensive ablations further validate the benefits of majority-vote aggregation and dynamic reference updating, highlighting the method's robustness and generalizability across diverse alignment settings.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims to introduce BalancedDPO, a framework for multi-metric preference alignment in diffusion models within the DPO paradigm. It uses majority-vote consensus over multiple preference scorers integrated into the DPO training loop with dynamic reference model updates to avoid reward-scale conflicts and ensure stable gradients. Experiments on Pick-a-Pic, PartiPrompt, and HPD datasets show consistent improvements in preference win rates over baselines for SD 1.5, 2.1, and SDXL, with ablations validating the approach.
Significance. If the results are substantiated, this method could significantly advance preference alignment for generative models by handling multiple conflicting metrics more effectively than single-metric or scalarized approaches. The majority-vote and dynamic update mechanisms represent a practical innovation that may improve robustness in alignment tasks.
major comments (2)
- [Experiments] Experiments section: The reported improvements in preference win rates are presented without statistical significance tests, standard errors, or multi-run variance, which is load-bearing for the central claim of 'consistent improvements' across backbones.
- [Method] Method section: The majority-vote consensus is integrated into the DPO loss, but the manuscript provides no explicit modified loss equation or derivation showing how the vote avoids scale conflicts, undermining verification of the 'stable gradient directions' claim.
minor comments (1)
- [Abstract] The abstract states that ablations 'validate the benefits' but the paper should include a dedicated table or figure summarizing all ablation variants with quantitative deltas for clarity.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive feedback. We address the major comments point-by-point below. We agree that both statistical rigor and explicit loss formulation are important for substantiating the claims and will revise the manuscript to incorporate these elements.
read point-by-point responses
-
Referee: [Experiments] Experiments section: The reported improvements in preference win rates are presented without statistical significance tests, standard errors, or multi-run variance, which is load-bearing for the central claim of 'consistent improvements' across backbones.
Authors: We agree that reporting variance and significance tests strengthens the central claim. In the revised manuscript we will rerun the key experiments with at least three random seeds, report standard errors, and include paired t-tests or Wilcoxon tests against baselines to quantify statistical significance of the win-rate improvements. revision: yes
-
Referee: [Method] Method section: The majority-vote consensus is integrated into the DPO loss, but the manuscript provides no explicit modified loss equation or derivation showing how the vote avoids scale conflicts, undermining verification of the 'stable gradient directions' claim.
Authors: The referee is correct that an explicit equation was omitted. The majority vote produces a binary preference label y_vote that replaces the single-metric label inside the standard DPO loss; because the label is discrete rather than a scalar reward, the gradient direction is determined solely by the sign of the vote and is therefore invariant to the numerical scales of the individual scorers. We will add the modified loss equation L_BalancedDPO = -log σ(β log(π_θ(y_w|x)/π_ref(y_w|x)) - β log(π_θ(y_l|x)/π_ref(y_l|x))) where (y_w, y_l) are chosen by majority vote, together with a short derivation showing scale invariance, in the revised Method section. revision: yes
Circularity Check
No significant circularity; empirical extension of DPO
full rationale
The paper introduces BalancedDPO as an empirical framework that integrates majority-vote consensus over multiple scorers and dynamic reference updates directly into the DPO training loop. No derivation chain, first-principles predictions, fitted parameters renamed as outputs, or load-bearing self-citations are described that would reduce any claimed result to its own inputs by construction. All reported gains are presented as experimental outcomes on Pick-a-Pic, PartiPrompt, and HPD datasets across SD backbones, with ablations validating the components independently. The method remains self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Majority-vote consensus over multiple preference scorers provides stable gradient directions that avoid reward-scale conflicts in DPO training.
Forward citations
Cited by 3 Pith papers
-
Pareto-Guided Optimal Transport for Multi-Reward Alignment
PG-OT builds prompt-specific Pareto frontiers and applies distribution-aware optimal transport to improve multi-reward alignment while introducing JDR and JCR metrics to measure synergy and hacking.
-
HP-Edit: A Human-Preference Post-Training Framework for Image Editing
HP-Edit introduces a post-training framework and RealPref-50K dataset that uses a VLM-based HP-Scorer to align diffusion image editing models with human preferences, improving outputs on Qwen-Image-Edit-2509.
-
LeapAlign: Post-Training Flow Matching Models at Any Generation Step by Building Two-Step Trajectories
LeapAlign fine-tunes flow matching models by constructing two consecutive leaps that skip multiple ODE steps with randomized timesteps and consistency weighting, enabling stable updates at any generation step.
Reference graph
Works this paper leans on
-
[1]
Hierarchical Text-Conditional Image Generation with CLIP Latents
A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, and M. Chen, “Hierarchical text-conditional image generation with clip latents,” arXiv preprint arXiv:2204.06125, vol. 1, no. 2, p. 3, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[2]
Photorealistic text-to-image diffusion models with deep language understanding,
C. Saharia, W. Chan, S. Saxena, L. Li, J. Whang, E. L. Denton, K. Ghasemipour, R. Gontijo Lopes, B. Karagol Ayan, T. Salimans, et al., “Photorealistic text-to-image diffusion models with deep language understanding,” Advances in neural information processing systems , vol. 35, pp. 36479–36494, 2022
work page 2022
-
[3]
High- resolution image synthesis with latent diffusion models,
R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High- resolution image synthesis with latent diffusion models,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pp. 10684–10695, 2022
work page 2022
-
[4]
GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models
A. Nichol, P. Dhariwal, A. Ramesh, P. Shyam, P. Mishkin, B. McGrew, I. Sutskever, and M. Chen, “Glide: Towards photorealistic image generation and editing with text-guided diffusion models,” arXiv preprint arXiv:2112.10741, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[5]
Diffusion model alignment using direct preference optimization,
B. Wallace, M. Dang, R. Rafailov, L. Zhou, A. Lou, S. Purushwalkam, S. Ermon, C. Xiong, S. Joty, and N. Naik, “Diffusion model alignment using direct preference optimization,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8228–8238, 2024
work page 2024
-
[6]
Direct preference optimization: Your language model is secretly a reward model,
R. Rafailov, A. Sharma, E. Mitchell, C. D. Manning, S. Ermon, and C. Finn, “Direct preference optimization: Your language model is secretly a reward model,” Advances in Neural Information Processing Systems , vol. 36, 2024
work page 2024
- [7]
-
[8]
X. Wu, Y . Hao, K. Sun, Y . Chen, F. Zhu, R. Zhao, and H. Li, “Human preference score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis,” arXiv preprint arXiv:2306.09341 , 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[9]
Pick-a-pic: An open dataset of user preferences for text-to-image generation,
Y . Kirstain, A. Polyak, U. Singer, S. Matiana, J. Penna, and O. Levy, “Pick-a-pic: An open dataset of user preferences for text-to-image generation,” Advances in Neural Information Processing Systems , vol. 36, pp. 36652–36663, 2023
work page 2023
-
[10]
Beyond one-preference-fits-all alignment: Multi-objective direct prefer- ence optimization,
Z. Zhou, J. Liu, J. Shao, X. Yue, C. Yang, W. Ouyang, and Y . Qiao, “Beyond one-preference-fits-all alignment: Multi-objective direct prefer- ence optimization,” in Findings of the Association for Computational Linguistics ACL 2024 , pp. 10586–10613, 2024
work page 2024
-
[11]
Social choice and the value alignment problem,
M. Prasad, “Social choice and the value alignment problem,” Artificial intelligence safety and security , pp. 291–314, 2018
work page 2018
-
[12]
Scaling Autoregressive Models for Content-Rich Text-to-Image Generation
J. Yu, Y . Xu, J. Y . Koh, T. Luong, G. Baid, Z. Wang, V . Vasudevan, A. Ku, Y . Yang, B. K. Ayan, et al., “Scaling autoregressive models for content-rich text-to-image generation,” arXiv preprint arXiv:2206.10789 , vol. 2, no. 3, p. 5, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[13]
Reinforcement learning for fine- tuning text-to-image diffusion models,
Y . Fan, O. Watkins, Y . Du, H. Liu, M. Ryu, C. Boutilier, P. Abbeel, M. Ghavamzadeh, K. Lee, and K. Lee, “Reinforcement learning for fine- tuning text-to-image diffusion models,” Advances in Neural Information Processing Systems, vol. 36, 2024
work page 2024
-
[14]
Diffusion Policies as an Expressive Policy Class for Offline Reinforcement Learning
Z. Wang, J. J. Hunt, and M. Zhou, “Diffusion policies as an expres- sive policy class for offline reinforcement learning,” arXiv preprint arXiv:2208.06193, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[15]
Aligning Text-to-Image Models using Human Feedback
K. Lee, H. Liu, M. Ryu, O. Watkins, Y . Du, C. Boutilier, P. Abbeel, M. Ghavamzadeh, and S. S. Gu, “Aligning text-to-image models using human feedback,” arXiv preprint arXiv:2302.12192 , 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[16]
Proximal Policy Optimization Algorithms
J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Prox- imal policy optimization algorithms,” arXiv preprint arXiv:1707.06347 , 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[17]
Secrets of RLHF in Large Language Models Part I: PPO
R. Zheng, S. Dou, S. Gao, Y . Hua, W. Shen, B. Wang, Y . Liu, S. Jin, Q. Liu, Y . Zhou, et al., “Secrets of rlhf in large language models part i: Ppo,” arXiv preprint arXiv:2307.04964 , 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[18]
Training Diffusion Models with Reinforcement Learning
K. Black, M. Janner, Y . Du, I. Kostrikov, and S. Levine, “Train- ing diffusion models with reinforcement learning,” arXiv preprint arXiv:2305.13301, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[19]
Reinforcement learning for fine-tuning text-to-speech diffusion models,
J. Chen, J.-S. Byun, M. Elsner, and A. Perrault, “Reinforcement learning for fine-tuning text-to-speech diffusion models,” arXiv preprint arXiv:2405.14632, 2024
-
[20]
Aligndiff: Aligning diverse human preferences via behavior- customisable diffusion model,
Z. Dong, Y . Yuan, J. Hao, F. Ni, Y . Mu, Y . Zheng, Y . Hu, T. Lv, C. Fan, and Z. Hu, “Aligndiff: Aligning diverse human preferences via behavior- customisable diffusion model,” arXiv preprint arXiv:2310.02054 , 2023
-
[21]
Delve into ppo: Implementation matters for stable rlhf,
R. Zheng, S. Dou, S. Gao, Y . Hua, W. Shen, B. Wang, Y . Liu, S. Jin, Y . Zhou, L. Xiong, et al. , “Delve into ppo: Implementation matters for stable rlhf,” in NeurIPS 2023 Workshop on Instruction Tuning and Instruction Following, 2023
work page 2023
-
[22]
Direct preference optimization: Your language model is secretly a reward model,
R. Rafailov, A. Sharma, E. Mitchell, C. D. Manning, S. Ermon, and C. Finn, “Direct preference optimization: Your language model is secretly a reward model,” in Thirty-seventh Conference on Neural Information Processing Systems, 2023
work page 2023
-
[23]
SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis
D. Podell, Z. English, K. Lacey, A. Blattmann, T. Dockhorn, J. M ¨uller, J. Penna, and R. Rombach, “Sdxl: Improving latent diffusion models for high-resolution image synthesis,” arXiv preprint arXiv:2307.01952 , 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[24]
Using human feedback to fine-tune diffusion models without any reward model,
K. Yang, J. Tao, J. Lyu, C. Ge, J. Chen, Q. Li, W. Shen, X. Zhu, and X. Li, “Using human feedback to fine-tune diffusion models without any reward model,” arXiv preprint arXiv:2311.13231 , 2023
-
[25]
Diffusion-rpo: Aligning diffusion models through relative preference optimization,
Y . Gu, Z. Wang, Y . Yin, Y . Xie, and M. Zhou, “Diffusion-rpo: Aligning diffusion models through relative preference optimization,” arXiv preprint arXiv:2406.06382, 2024
-
[26]
Step-aware preference optimization: Aligning preference with denoising performance at each step,
Z. Liang, Y . Yuan, S. Gu, B. Chen, T. Hang, J. Li, and L. Zheng, “Step-aware preference optimization: Aligning preference with denoising performance at each step,” arXiv preprint arXiv:2406.04314 , 2024
-
[27]
Decomposed direct preference optimization for structure-based drug design,
X. Cheng, X. Zhou, Y . Yang, Y . Bao, and Q. Gu, “Decomposed direct preference optimization for structure-based drug design,” CoRR, vol. abs/2407.13981, 2024
-
[28]
Rank analysis of incomplete block designs: I. the method of paired comparisons,
R. A. Bradley and M. E. Terry, “Rank analysis of incomplete block designs: I. the method of paired comparisons,” Biometrika, vol. 39, no. 3/4, pp. 324–345, 1952
work page 1952
- [29]
-
[30]
Huggingface stable diffusion v1.5 model
R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “Huggingface stable diffusion v1.5 model.” https://huggingface.co/stable- diffusion-v1-5/stable-diffusion-v1-5
-
[31]
Decoupled Weight Decay Regularization
I. Loshchilov, “Decoupled weight decay regularization,” arXiv preprint arXiv:1711.05101, 2017. 12 VI. A PPENDIX This appendix provides extended insights and results to support the claims made in the main paper. • In Section VII, we describe the datasets used in our experiments, including PartiPrompts, Pick-a-Pic, and HPD, and showcase images generated fro...
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[32]
Image Generation: For each caption from our validation datasets, we generate five images using different random seeds for all models (B ALANCED DPO, SD1.5, and DiffusionDPO). 16 Fig. 11. Comparison of images generated using BALANCED DPO across five different seeds for each prompt. Rows correspond to prompts from the PartiPrompts, Pick-a-Pic, and HPD datas...
-
[33]
Metrics Used: We evaluate generated images using four key metrics: Human Preference Score (HPS), CLIP score, PickScore, and Aesthetics score following DiffusionDPO
-
[34]
Best Score Selection: For each metric and prompt, we select the highest score from the five generated images for all the models. This method mitigates random variations in image generation, ensures each model demonstrates its best performance, and maintains consistency across all models by automating the selection process
-
[35]
Win Rate Calculation: We computed the win rate for each model pair comparison ( BALANCED DPO vs. SD1.5, BALANCED DPO vs. DiffusionDPO, and DiffusionDPO vs. SD1.5). The win rate represents the proportion of times a model’s best score is preferred ( i.e., higher) compared to the other model’s best score for the same prompt and metric. ETHICS While our model...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.