pith. machine review for the scientific record. sign in

arxiv: 2604.24953 · v2 · submitted 2026-04-27 · 💻 cs.CV · cs.AI

Recognition: unknown

ViPO: Visual Preference Optimization at Scale

Authors on Pith no claims yet

Pith reviewed 2026-05-08 04:07 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords visual preference optimizationPoly-DPOpreference datasetimage generationvideo generationdiffusion modelsnoise robustnessdata quality
0
0 comments X

The pith

Extending direct preference optimization with a polynomial term handles noisy visual data, and a new million-pair dataset makes the extension unnecessary.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to show that visual preference optimization cannot scale using existing noisy datasets because conflicting signals in winners and losers prevent models from learning consistent preferences. To fix this, the authors add a polynomial term to the standard objective so the training dynamically lowers or raises confidence depending on how noisy the data is. They also release a new dataset of one million high-resolution image pairs and three hundred thousand video pairs built with strong generative models and varied prompts to create balanced, reliable signals. On this cleaner data the polynomial adjustment is not needed and the method reduces to ordinary direct preference optimization, while on messy public datasets the adjustment produces clear gains. If correct, this means future work can focus on curating better data to simplify training, yet still have a tool that works when data remains imperfect.

Core claim

The paper claims that Poly-DPO extends the direct preference optimization objective with an additional polynomial term that dynamically adjusts model confidence based on dataset characteristics, enabling effective learning across diverse data distributions. When applied to the authors' ViPO dataset of one million 1024-pixel image pairs across five categories and three hundred thousand 720p+ video pairs across three categories, the optimal configuration converges to standard direct preference optimization. This convergence validates that the dataset supplies reliable preference signals and that the polynomial term is adaptive, becoming unnecessary with high-quality data yet remaining useful,

What carries the argument

Poly-DPO, the extension of direct preference optimization that adds a polynomial term to dynamically adjust model confidence according to dataset noise and characteristics.

If this is right

  • On noisy datasets such as Pick-a-Pic V2 the method produces several-point gains in generation metrics for SD1.5 and SDXL models.
  • Models trained on the new dataset outperform those trained on existing open-source preference collections.
  • The approach works across multiple visual generation models.
  • When data quality is high, training reduces to standard direct preference optimization with no loss in final performance.
  • Balanced category and prompt distributions support more consistent preference learning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Data quality may be the main limit on scaling preference optimization for images and video, with algorithmic changes serving mostly as a temporary fix.
  • The observed convergence suggests that more advanced preference methods will be most helpful while datasets are still being improved.
  • The same polynomial adjustment could be tested on other generative tasks or architectures to check whether the pattern of simplification on clean data holds more generally.
  • Applying the method to real user-collected preferences at even larger scale could reveal whether the adjustment creates subtle biases not seen in current benchmarks.

Load-bearing premise

The assumption that the constructed dataset supplies reliable and unbiased preference signals through the use of advanced generative models and diverse prompts, and that the polynomial term adjusts confidence without introducing new biases.

What would settle it

Training models on the high-quality dataset and finding that a non-trivial polynomial coefficient still improves performance over standard direct preference optimization would contradict the reported convergence to the basic method.

Figures

Figures reproduced from arXiv: 2604.24953 by Chen Chen, Jie Wu, Justin Cui, Ming Li, Rui Wang, Xiaojie Li.

Figure 1
Figure 1. Figure 1: (a) Preference scaling with our Poly-DPO and ViPO-Image-1M dataset. (b) When training on a view at source ↗
Figure 2
Figure 2. Figure 2: Overview of our ViPO-Image-1M and ViPO-Video-300K dataset. view at source ↗
Figure 3
Figure 3. Figure 3: Summary of our Poly-DPO. By adjusting only one hyperparameter and introducing only two view at source ↗
Figure 4
Figure 4. Figure 4: Ablation studies with different α on datasets with varying noise properties. While only the HPSv2.1 score is visualized for clarity, a similar trend is observed across all other evaluation metrics Noisy Preference Dataset. As the largest publicly available preference dataset, Pick-a-Pic V2 exhibits significant multi-dimensional conflicts in preference signals. Specifically, when we evaluate image pairs usi… view at source ↗
Figure 5
Figure 5. Figure 5: Performance Comparison between VLM and Human Raters. Accuracy (or Agreement Rate) is defined as the frequency with which a choice aligns with the consensus label (majority vote among human raters, excluding VLM predictions). (a) Overall: The VLM (81.2%) demonstrates higher consistency with the consensus than the average individual human annotator (74.7%). (b) By Modality: The VLM significantly outperforms … view at source ↗
Figure 6
Figure 6. Figure 6: ViPO-Image-1M and ViPO-Video-300K dataset visualization. view at source ↗
Figure 7
Figure 7. Figure 7: Gradient magnitude of Poly-DPO loss with respect to logits under different view at source ↗
Figure 8
Figure 8. Figure 8: Distribution of human rater accuracy on our ViPO datasets view at source ↗
Figure 9
Figure 9. Figure 9: illustrates the annotation interface; rater IDs are utilized strictly for tracking and resuming management to guarantee a fully anonymous evaluation process. 70% 80% 90% 100% Rater Accuracy (Agreement with Majority Vote) 0 1 2 3 4 5 6 7 8 Number of Raters 4 8 6 Distribution of Per-Rater Accuracy Mean: 87.2% Median: 87.6% view at source ↗
Figure 10
Figure 10. Figure 10: Training dynamics of Poly-DPO and Diffusion-DPO on the Pick-a-Pic V2 dataset. Both methods exhibit high training stability, with evaluation metrics steadily increasing to convergence without any signs of model collapse. D IMPLEMENTATION DETAILS Training on Pick-a-pic V2 Dataset. We validate our proposed Poly-DPO method by training the SD1.5 model on the Pick-a-pic V2 dataset. Our training implementation a… view at source ↗
read the original abstract

While preference optimization is crucial for improving visual generative models, how to effectively scale this paradigm remains largely unexplored. Current open-source preference datasets contain conflicting preference patterns, where winners excel in some dimensions but underperform in others. Naively optimizing on such noisy datasets fails to learn preferences, hindering effective scaling. To enhance robustness against noise, we propose Poly-DPO, which extends the DPO objective with an additional polynomial term that dynamically adjusts model confidence based on dataset characteristics, enabling effective learning across diverse data distributions. Beyond biased patterns, existing datasets suffer from low resolution, limited prompt diversity, and imbalanced distributions. To facilitate large-scale visual preference optimization by tackling data bottlenecks, we construct ViPO, a massive-scale preference dataset with 1M image pairs at 1024px across five categories and 300K video pairs at 720p+ across three categories. State-of-the-art generative models and diverse prompts ensure reliable preference signals with balanced distributions. Remarkably, when applying Poly-DPO to our high-quality dataset, the optimal configuration converges to standard DPO. This convergence validates dataset quality and Poly-DPO's adaptive nature: sophisticated optimization becomes unnecessary with sufficient data quality, yet remains valuable for imperfect datasets. We validate our approach across visual generation models. On noisy datasets like Pick-a-Pic V2, Poly-DPO achieves 6.87 and 2.32 gains over Diffusion-DPO on GenEval for SD1.5 and SDXL, respectively. For ViPO, models achieve performance far exceeding those trained on existing open-source preference datasets. These results confirm that addressing both algorithmic adaptability and data quality is essential for scaling visual preference optimization.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes Poly-DPO, an extension of Direct Preference Optimization (DPO) that augments the standard loss with a polynomial term to dynamically modulate model confidence according to dataset noise characteristics, thereby improving robustness on conflicting or low-quality preference data. It introduces ViPO, a new large-scale dataset with 1M image preference pairs at 1024px resolution across five categories and 300K video pairs at 720p+ across three categories, constructed using state-of-the-art generative models and diverse prompts to produce balanced, reliable signals. The central empirical claim is that the optimal Poly-DPO configuration on ViPO converges to vanilla DPO, which the authors interpret as simultaneous validation of ViPO's high quality and the method's adaptivity; they further report that Poly-DPO yields gains of 6.87 and 2.32 on GenEval over Diffusion-DPO for SD1.5 and SDXL respectively when trained on noisier sets such as Pick-a-Pic V2, and that models trained on ViPO substantially outperform those trained on prior open-source preference datasets.

Significance. If the convergence result and performance claims are rigorously substantiated, the work would offer both a practically useful adaptive algorithm for visual preference optimization and a substantial new dataset that could accelerate scaling of generative models. The diagnostic interpretation of convergence as a data-quality signal is novel and potentially generalizable, provided the polynomial term and labeling process are made transparent.

major comments (3)
  1. [Dataset construction] Dataset construction section: the assertion that 'state-of-the-art generative models and diverse prompts ensure reliable preference signals with balanced distributions' is presented without any description of the labeling procedure (human annotation, VLM judge, or model-derived pairwise comparison) or supporting statistics such as agreement rates, bias audits, or prompt-source verification. This detail is load-bearing for the claim that convergence to standard DPO validates dataset quality rather than merely indicating that the polynomial term remains inactive.
  2. [Experiments] Experiments section: the statement that 'the optimal configuration converges to standard DPO' on ViPO does not report the explicit polynomial degree, the learned coefficients, or an ablation demonstrating that the additional term's contribution approaches zero. Without these quantities it is impossible to distinguish genuine adaptivity from optimization dynamics that simply suppress the extra term regardless of data quality.
  3. [Evaluation] Evaluation on noisy datasets: the reported GenEval gains of 6.87 (SD1.5) and 2.32 (SDXL) over Diffusion-DPO are given without variance across random seeds, statistical significance tests, or the precise hyper-parameter settings used for the polynomial term, rendering it difficult to attribute the improvements specifically to Poly-DPO rather than to broader tuning differences.
minor comments (2)
  1. [Abstract] The abstract lists 'five categories' for images and 'three categories' for videos but does not enumerate them; adding this information would improve immediate readability.
  2. [Method] Notation for the polynomial coefficients in the Poly-DPO objective could be clarified with an explicit table or definition to avoid ambiguity when comparing to the standard DPO loss.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive feedback and the opportunity to clarify and strengthen our manuscript. We address each major comment point by point below. Where the comments identify gaps in transparency or rigor, we agree that revisions are warranted and will update the manuscript accordingly.

read point-by-point responses
  1. Referee: [Dataset construction] Dataset construction section: the assertion that 'state-of-the-art generative models and diverse prompts ensure reliable preference signals with balanced distributions' is presented without any description of the labeling procedure (human annotation, VLM judge, or model-derived pairwise comparison) or supporting statistics such as agreement rates, bias audits, or prompt-source verification. This detail is load-bearing for the claim that convergence to standard DPO validates dataset quality rather than merely indicating that the polynomial term remains inactive.

    Authors: We agree that explicit details on the labeling procedure are necessary to substantiate the dataset quality claims and the interpretation of the convergence result. The initial manuscript focused on the generative models and prompt diversity but omitted the full labeling pipeline. In the revised version we will expand the Dataset construction section to describe the complete labeling process (including the use of VLM-based pairwise comparisons with human verification steps), report agreement rates, bias audits, and prompt-source verification statistics. This addition will directly support the claim that convergence reflects high data quality. revision: yes

  2. Referee: [Experiments] Experiments section: the statement that 'the optimal configuration converges to standard DPO' on ViPO does not report the explicit polynomial degree, the learned coefficients, or an ablation demonstrating that the additional term's contribution approaches zero. Without these quantities it is impossible to distinguish genuine adaptivity from optimization dynamics that simply suppress the extra term regardless of data quality.

    Authors: We concur that additional quantitative details are required to demonstrate the adaptive behavior of Poly-DPO. In the revised manuscript we will report the polynomial degree of the optimal configuration on ViPO, the learned coefficient values, and include a dedicated ablation study quantifying the contribution of the polynomial term (showing it approaches zero). These additions will clarify that the observed convergence is due to the method's adaptivity on high-quality data rather than generic suppression of the extra term. revision: yes

  3. Referee: [Evaluation] Evaluation on noisy datasets: the reported GenEval gains of 6.87 (SD1.5) and 2.32 (SDXL) over Diffusion-DPO are given without variance across random seeds, statistical significance tests, or the precise hyper-parameter settings used for the polynomial term, rendering it difficult to attribute the improvements specifically to Poly-DPO rather than to broader tuning differences.

    Authors: We thank the referee for identifying the need for greater statistical rigor. In the revised manuscript we will augment the evaluation section with results across multiple random seeds (including standard deviations), statistical significance tests (such as paired t-tests), and the exact hyper-parameter settings for the polynomial term in Poly-DPO. These updates will strengthen the attribution of the reported gains to the proposed method. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in derivation chain

full rationale

The paper introduces Poly-DPO as a direct extension of the standard DPO objective via an added polynomial term for noise robustness, then reports an empirical observation that the optimal hyperparameter configuration on the new ViPO dataset reduces to vanilla DPO. This observation is used to support claims about data quality and method adaptivity, but it does not constitute a self-definitional loop, a fitted parameter renamed as a prediction, or any load-bearing self-citation. No uniqueness theorems, ansatzes smuggled via prior work, or renamings of known results are invoked. The dataset construction, experimental comparisons on noisy versus high-quality sets, and performance gains provide independent empirical content outside the convergence claim itself.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The claims rest on the quality of the newly constructed dataset and the effectiveness of the proposed polynomial extension, with the abstract providing limited independent verification of these elements.

free parameters (1)
  • parameters of the polynomial term
    The additional polynomial term in Poly-DPO likely involves parameters that are adjusted based on dataset characteristics to dynamically change model confidence.
axioms (1)
  • domain assumption The base DPO objective is a valid way to optimize preferences from paired data
    The paper builds directly on DPO, assuming its foundational validity for preference learning.

pith-pipeline@v0.9.0 · 5604 in / 1315 out tokens · 94840 ms · 2026-05-08T04:07:01.607295+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

32 extracted references · 27 canonical work pages · 17 internal anchors

  1. [1]

    Qwen Technical Report

    Jinze Bai, Shuai Bai, Y unfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Y ang Fan, Wenbin Ge, Y u Han, Fei Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609,

  2. [2]

    Qwen3-VL Technical Report

    Shuai Bai, Y uxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631,

  3. [3]

    X-iqe: explainable image quality evaluation for text-to-image generation with visual large language models.arXiv preprint arXiv:2305.10843,

    Yixiong Chen, Li Liu, and Chris Ding. X-iqe: explainable image quality evaluation for text-to-image generation with visual large language models.arXiv preprint arXiv:2305.10843,

  4. [4]

    Molmo2: Open weights and data for vision-language models with video understanding and grounding, 2026

    Christopher Clark, Jieyu Zhang, Zixian Ma, Jae Sung Park, Mohammadreza Salehi, Rohun Tripathi, Sangho Lee, Zhongzheng Ren, Chris Dongjoo Kim, Yinuo Y ang, et al. Molmo2: Open weights and data for vision-language models with video understanding and grounding.arXiv preprint arXiv:2601.10611,

  5. [5]

    arXiv preprint arXiv:2503.23461 (2025)

    Nikai Du, Zhennan Chen, Shan Gao, Zhizhou Chen, Xi Chen, Zhengkai Jiang, Jian Y ang, and Ying Tai. Textcrafter: Accurately rendering multiple texts in complex visual scenes.arXiv preprint arXiv:2503.23461,

  6. [6]

    Classifier-Free Diffusion Guidance

    Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598,

  7. [7]

    Margin-aware pref- erence optimization for aligning diffusion models without reference.arXiv preprint arXiv:2406.06424,

    Jiwoo Hong, Sayak Paul, Noah Lee, Kashif Rasul, James Thorne, and Jongheon Jeong. Margin-aware pref- erence optimization for aligning diffusion models without reference.arXiv preprint arXiv:2406.06424,

  8. [8]

    ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment

    Xiwei Hu, Rui Wang, Yixiao Fang, Bin Fu, Pei Cheng, and Gang Y u. Ella: Equip diffusion models with llm for enhanced semantic alignment.arXiv preprint arXiv:2403.05135,

  9. [9]

    HunyuanVideo: A Systematic Framework For Large Video Generative Models

    Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603,

  10. [10]

    FLUX.2: Frontier Visual Intelligence.https://bfl.ai/blog/flux-2,

    12 Published as a conference paper at ICLR 2026 Black Forest Labs. FLUX.2: Frontier Visual Intelligence.https://bfl.ai/blog/flux-2,

  11. [11]

    Polyloss: A polynomial expansion perspective of classification loss functions.arXiv preprint arXiv:2204.12511,

    Zhaoqi Leng, Mingxing Tan, Chenxi Liu, Ekin Dogus Cubuk, Xiaojie Shi, Shuyang Cheng, and Dragomir Anguelov. Polyloss: A polynomial expansion perspective of classification loss functions.arXiv preprint arXiv:2204.12511,

  12. [12]

    Flow-GRPO: Training Flow Matching Models via Online RL

    Jie Liu, Gongye Liu, Jiajun Liang, Y angguang Li, Jiaheng Liu, Xintao Wang, Pengfei Wan, Di Zhang, and Wanli Ouyang. Flow-grpo: Training flow matching models via online rl.arXiv preprint arXiv:2505.05470, 2025a. Runtao Liu, Haoyu Wu, Ziqiang Zheng, Chen Wei, Yingqing He, Renjie Pi, and Qifeng Chen. Videodpo: Omni-preference alignment for video diffusion g...

  13. [13]

    Hpsv3: Towards wide-spectrum human preference score

    Y uhang Ma, Xiaoshi Wu, Keqiang Sun, and Hongsheng Li. Hpsv3: Towards wide-spectrum human preference score.arXiv preprint arXiv:2508.03789,

  14. [14]

    SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

    Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas M¨uller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952,

  15. [15]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347,

  16. [16]

    Seed1.8 model card: Towards generalized real-world agency

    13 Published as a conference paper at ICLR 2026 Bytedance Seed. Seed1.8 model card: Towards generalized real-world agency. Technical report,

  17. [17]

    Seedance 1.5 pro: A native audio-visual joint generation foundation model.arXiv preprint arXiv:2512.13507, 2025

    Team Seedance, Heyi Chen, Siyan Chen, Xin Chen, Y anfei Chen, Ying Chen, Zhuo Chen, Feng Cheng, Tianheng Cheng, Xinqi Cheng, et al. Seedance 1.5 pro: A native audio-visual joint generation foundation model.arXiv preprint arXiv:2512.13507,

  18. [18]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y ang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300,

  19. [19]

    Kimi k1.5: Scaling Reinforcement Learning with LLMs

    Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. InICLR, 2021a. Y ang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. InICLR, 2021b. Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng C...

  20. [20]

    LLaMA: Open and Efficient Foundation Language Models

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth´ee Lacroix, Baptiste Rozi`ere, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023a. Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Y asmine Babaei, Nikola...

  21. [21]

    Wan: Open and Advanced Large-Scale Video Generative Models

    Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Y u, Haiming Zhao, Jianxiao Y ang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314,

  22. [22]

    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

    Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191,

  23. [23]

    Qwen-Image Technical Report

    Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Y an, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-image technical report.arXiv preprint arXiv:2508.02324, 2025a. Jie Wu, Y u Gao, Zilyu Y e, Ming Li, Liang Li, Hanzhong Guo, Jie Liu, Zeyue Xue, Xiaoxia Hou, Wei Liu, et al. Rewarddance: Reward scaling in visual generation.ar...

  24. [24]

    DanceGRPO: Unleashing GRPO on Visual Generation

    Zeyue Xue, Jie Wu, Y u Gao, Fangyuan Kong, Lingting Zhu, Mengzhao Chen, Zhiheng Liu, Wei Liu, Qiushan Guo, Weilin Huang, et al. Dancegrpo: Unleashing grpo on visual generation.arXiv preprint arXiv:2505.07818,

  25. [25]

    IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models

    14 Published as a conference paper at ICLR 2026 Hu Y e, Jun Zhang, Sibo Liu, Xiao Han, and Wei Y ang. Ip-adapter: Text compatible image prompt adapter for text-to-image diffusion models.arXiv preprint arXiv:2308.06721,

  26. [26]

    Scaling Autoregressive Models for Content-Rich Text-to-Image Generation

    Jiahui Y u, Y uanzhong Xu, Jing Y u Koh, Thang Luong, Gunjan Baid, Zirui Wang, Vijay V asudevan, Alexander Ku, Yinfei Y ang, Burcu Karagol Ayan, et al. Scaling autoregressive models for content-rich text-to-image generation.arXiv preprint arXiv:2206.10789,

  27. [27]

    SePPO: Semi-policy preference optimization for diffusion alignment,

    Daoan Zhang, Guangchen Lan, Dong-Jun Han, Wenlin Y ao, Xiaoman Pan, Hongming Zhang, Mingxiao Li, Pengcheng Chen, Y u Dong, Christopher Brinton, and Jiebo Luo. Bridging sft and dpo for diffusion model alignment with self-sampling preference optimization, 2025a. URL https://arxiv.org/abs/2410.05255. Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding condit...

  28. [28]

    Learning multi-dimensional human prefer- ence for text-to-image generation

    Tao Zhang, Cheng Da, Kun Ding, Huan Y ang, Kun Jin, Y an Li, Tingting Gao, Di Zhang, Shiming Xiang, and Chunhong Pan. Diffusion model as a noise-aware latent reward model for step-level preference optimization.arXiv preprint arXiv:2502.01051, 2025b. Dian Zheng, Ziqi Huang, Hongbo Liu, Kai Zou, Yinan He, Fan Zhang, Y uanhan Zhang, Jingwen He, Wei-Shi Zheng...

  29. [29]

    winner”, while the image generated from the perturbed caption is designated as the “loser

    15 Published as a conference paper at ICLR 2026 A OVERVIEW OFAPPENDIX The appendix is organized into the following sections: • Section B: Dataset Construction Details. • Section C: More Experiments and Analysis. • Section D: Implementation Details. • Section E: Results and details of the open-source datasets. • Section F: Discussion, Limitation and Future...

  30. [30]

    Aesthetics

    prevents premature convergence on spurious patterns and encourages continued exploration to identify genuine preference signals amidst dimensional conflicts. Conversely, whenα<0 (red and orange curves), the gradient decays more rapidly as confidence increases, actively penalizing overconfident predictions. This mechanism addresses the overconfidence probl...

  31. [31]

    We useβ=1000for this model

    The base learning rates are 2e-9 for SFT and 5e-10 for Poly-DPO, with both stages trained for 4,000 steps. We useβ=1000for this model. • SD3.5-Medium.For the SFT stage, we use a batch size of 2048 and a base learning rate of 1e-8. For the Poly-DPO stage, the batch size is 512 with a base learning rate of 5e-9. The SFT stage is trained for 4,000 steps and ...

  32. [32]

    at scale

    During training, we utilize a dynamic resolution approach and do not perform any resizing operations on the videos in the dataset. This means we consistently train on video data with its original 16:9 and 1:1 aspect ratios. For final evaluation, the VBench2.0 score is calculated by averaging the results from both the 16:9 and 1:1 generated videos. E OPEN-...