Linear-DPO: Linear Direct Preference Optimization for Diffusion and Flow-Matching Generative Models
Pith reviewed 2026-05-21 04:56 UTC · model grok-4.3
The pith
Linear-DPO replaces the sigmoid utility in standard DPO with a linear function and an EMA reference model to better align diffusion and flow-matching image generators.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By generalizing the DPO objective via a unified reverse-time SDE framework to cover both diffusion and flow-matching, and replacing the aggressive sigmoid-based utility with a sustained linear utility while incorporating an EMA-updated reference model, Linear-DPO achieves superior performance over existing baselines in aligning generative models for text-to-image generation.
What carries the argument
Linear-DPO objective that uses a sustained linear utility function instead of sigmoid and an EMA-updated reference model to carry out preference optimization under the unified reverse-time SDE framework.
If this is right
- Linear-DPO applies to both diffusion models like SD1.5 and SDXL and flow-matching models like SD3-Medium with reported gains.
- The linear utility avoids the gradient issues of sigmoid in continuous generative settings.
- EMA reference model updates support stable training throughout the preference optimization process.
- The approach reduces objective mismatch when moving from discrete NLP DPO to regression-based image generation.
Where Pith is reading between the lines
- Linear-DPO might improve alignment stability when preference data contains label noise common in image ratings.
- The same linear utility change could be tested in other continuous generative settings such as audio or video synthesis.
- Combining Linear-DPO with existing reward models might further boost performance on standard text-to-image benchmarks.
- The unified SDE view opens a path to apply similar preference objectives to newer flow-based architectures.
Load-bearing premise
The unified reverse-time SDE framework accurately generalizes the DPO objective to both diffusion and flow-matching without introducing new objective mismatch for regression-based generative tasks.
What would settle it
Training Linear-DPO on SDXL or SD3-Medium and measuring no gain or a loss in human preference win rates or FID scores compared to standard DPO on the same preference dataset would falsify the claimed superiority.
Figures
read the original abstract
Direct Preference Optimization (DPO) is successful for alignment in LLMs but still faces challenges in text-to-image generation. Existing studies are confined to denoising diffusion models while overlooking flow-matching, and suffer from an objective mismatch when applying discrete NLP-based DPO to regression-based generative tasks.\ In this paper, we derive a generalized DPO objective that covers both diffusion and flow-matching via a unified reverse-time SDE framework, and point out from a gradient perspective that the standard DPO objective is suboptimal for text-to-image generation. Consequently, we propose Linear-DPO, which replaces the aggressive sigmoid-based utility function with a sustained linear utility and incorporates an EMA-updated reference model. Qualitative and quantitative experiments on diffusion models (SD1.5, SDXL) and flow-matching model (SD3-Medium) demonstrate the superiority of our approach over existing baselines.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper derives a generalized DPO objective for diffusion and flow-matching models via a unified reverse-time SDE framework. It argues from a gradient perspective that the standard sigmoid-based DPO utility is suboptimal for text-to-image regression tasks, and proposes Linear-DPO which substitutes a linear utility function and adds an EMA-updated reference model. Experiments on SD1.5, SDXL, and SD3-Medium are reported to show superiority over baselines.
Significance. If the gradient analysis and generalization hold without introducing new objective mismatch, the work would usefully extend preference optimization to continuous generative models and address a practical limitation of applying LLM-style DPO to image synthesis. The coverage of flow-matching alongside diffusion and the EMA reference are practical strengths; reproducible experiments on three distinct model families add value.
major comments (2)
- [§3] §3 (unified reverse-time SDE derivation): the manuscript must explicitly derive the gradient of the proposed objective with respect to the denoising network parameters and show where the sigmoid saturates in a manner that harms the regression loss on continuous latents; without this step the suboptimality claim for the diffusion/flow-matching regime remains unverified.
- [§4.2] §4.2 (Linear-DPO formulation): the replacement of the sigmoid by a linear utility together with the EMA reference update is load-bearing for the central claim, yet the paper provides no ablation isolating the contribution of each change nor a direct comparison of the resulting gradients against standard DPO under the same reference-model schedule.
minor comments (2)
- [Table 1] Table 1 and Figure 3: quantitative metrics (e.g., CLIP score, human preference rates) should be reported with standard deviations and statistical significance tests to support the superiority statements.
- [Notation] Notation: the symbol for the EMA reference model should be introduced once and used consistently; currently the text alternates between r_EMA and r_ref without explicit definition.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment below, indicating the revisions we will incorporate to strengthen the manuscript.
read point-by-point responses
-
Referee: [§3] §3 (unified reverse-time SDE derivation): the manuscript must explicitly derive the gradient of the proposed objective with respect to the denoising network parameters and show where the sigmoid saturates in a manner that harms the regression loss on continuous latents; without this step the suboptimality claim for the diffusion/flow-matching regime remains unverified.
Authors: We agree that an explicit derivation of the gradient with respect to the denoising network parameters would make the suboptimality argument more rigorous. Section 3 currently presents a gradient-based perspective on the limitations of the sigmoid utility for continuous regression tasks, but we acknowledge that the derivation steps are not fully expanded. In the revision we will add the complete gradient derivation under the unified reverse-time SDE framework and explicitly highlight the saturation regime of the sigmoid and its effect on the regression loss for continuous latents. revision: yes
-
Referee: [§4.2] §4.2 (Linear-DPO formulation): the replacement of the sigmoid by a linear utility together with the EMA reference update is load-bearing for the central claim, yet the paper provides no ablation isolating the contribution of each change nor a direct comparison of the resulting gradients against standard DPO under the same reference-model schedule.
Authors: We recognize that isolating the linear utility and EMA reference components would strengthen the central claim. The experiments in Section 4.2 compare the full Linear-DPO objective against baselines, but do not contain separate ablations or gradient comparisons under matched reference schedules. We will add these ablations and the requested gradient comparison in the revised version. revision: yes
Circularity Check
Derivation of generalized DPO objective and Linear-DPO proposal is self-contained with no reductions to inputs by construction.
full rationale
The paper first derives a generalized DPO objective covering diffusion and flow-matching via a unified reverse-time SDE framework, then identifies suboptimality of the standard sigmoid-based DPO from a gradient perspective for text-to-image regression tasks, and proposes Linear-DPO as a replacement using sustained linear utility plus EMA-updated reference. No quoted equations or steps reduce the claimed predictions or uniqueness to fitted parameters, self-citations, or ansatzes by construction. The central claims rest on the independent SDE unification and gradient analysis rather than re-labeling or self-referential inputs, making the derivation self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Unified reverse-time SDE framework covers both diffusion and flow-matching models for DPO derivation
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
we replace this sigmoid gating with a smoother linear utility ulinear(x) = 0.2x + 0.5 ... ω′(ΔDθ) = clip(ulinear(β̄ΔDθ), η, 1)
-
IndisputableMonolith/Foundation/AlphaCoordinateFixation.leanJ_uniquely_calibrated_via_higher_derivative unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
the optimization of L(θ) can be interpreted as a form of weighted supervised fine-tuning ... modulated by the weighting function ω(ΔDθ) := β̄ σ(β̄ΔDθ)
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis
Sdxl: Improving latent diffusion models for high-resolution image synthesis , author=. arXiv preprint arXiv:2307.01952 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
High-resolution image synthesis with latent diffusion models , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
-
[3]
Advances in neural information processing systems , volume=
Variational diffusion models , author=. Advances in neural information processing systems , volume=
-
[4]
Proximal Policy Optimization Algorithms
Proximal policy optimization algorithms , author=. arXiv preprint arXiv:1707.06347 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
Advances in Neural Information Processing Systems , volume=
Dpok: Reinforcement learning for fine-tuning text-to-image diffusion models , author=. Advances in Neural Information Processing Systems , volume=
-
[6]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Deepseekmath: Pushing the limits of mathematical reasoning in open language models , author=. arXiv preprint arXiv:2402.03300 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
Advances in Neural Information Processing Systems , volume=
Simpo: Simple preference optimization with a reference-free reward , author=. Advances in Neural Information Processing Systems , volume=
-
[8]
Kolors: Effective training of diffusion model for photorealistic text-to-image synthesis , author=. arXiv preprint , year=
-
[9]
Stochastic Processes and their Applications , volume=
Reverse-time diffusion equation models , author=. Stochastic Processes and their Applications , volume=. 1982 , publisher=
work page 1982
-
[10]
Hunyuan-dit: A powerful multi-resolution diffusion transformer with fine-grained chinese understanding , author=. arXiv preprint arXiv:2405.08748 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[11]
International conference on machine learning , pages=
Deep unsupervised learning using nonequilibrium thermodynamics , author=. International conference on machine learning , pages=. 2015 , organization=
work page 2015
-
[12]
Advances in neural information processing systems , volume=
Denoising diffusion probabilistic models , author=. Advances in neural information processing systems , volume=
-
[13]
Score-Based Generative Modeling through Stochastic Differential Equations
Score-based generative modeling through stochastic differential equations , author=. arXiv preprint arXiv:2011.13456 , year=
work page internal anchor Pith review Pith/arXiv arXiv 2011
-
[14]
Flow Matching for Generative Modeling
Flow matching for generative modeling , author=. arXiv preprint arXiv:2210.02747 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[15]
Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow
Flow straight and fast: Learning to generate and transfer data with rectified flow , author=. arXiv preprint arXiv:2209.03003 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[16]
Building Normalizing Flows with Stochastic Interpolants
Building normalizing flows with stochastic interpolants , author=. arXiv preprint arXiv:2209.15571 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[17]
Advances in neural information processing systems , volume=
Elucidating the design space of diffusion-based generative models , author=. Advances in neural information processing systems , volume=
-
[18]
International Conference on Medical image computing and computer-assisted intervention , pages=
U-net: Convolutional networks for biomedical image segmentation , author=. International Conference on Medical image computing and computer-assisted intervention , pages=. 2015 , organization=
work page 2015
-
[19]
Training Diffusion Models with Reinforcement Learning
Training diffusion models with reinforcement learning , author=. arXiv preprint arXiv:2305.13301 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[20]
Thirty-seventh Conference on Neural Information Processing Systems (NeurIPS) 2023 , year=
Reinforcement learning for fine-tuning text-to-image diffusion models , author=. Thirty-seventh Conference on Neural Information Processing Systems (NeurIPS) 2023 , year=
work page 2023
-
[21]
Advances in neural information processing systems , volume=
Generative modeling by estimating gradients of the data distribution , author=. Advances in neural information processing systems , volume=
-
[22]
Forty-first international conference on machine learning , year=
Scaling rectified flow transformers for high-resolution image synthesis , author=. Forty-first international conference on machine learning , year=
-
[23]
Advances in neural information processing systems , volume=
Deep reinforcement learning from human preferences , author=. Advances in neural information processing systems , volume=
-
[24]
Qwen-image technical report , author=. arXiv preprint arXiv:2508.02324 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[25]
Stochastic Interpolants: A Unifying Framework for Flows and Diffusions
Stochastic interpolants: A unifying framework for flows and diffusions , author=. arXiv preprint arXiv:2303.08797 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[26]
Advances in neural information processing systems , volume=
Pick-a-pic: An open dataset of user preferences for text-to-image generation , author=. Advances in neural information processing systems , volume=
-
[27]
Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
Hpsv3: Towards wide-spectrum human preference score , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
- [28]
-
[29]
Scaling Autoregressive Models for Content-Rich Text-to-Image Generation
Scaling autoregressive models for content-rich text-to-image generation , author=. arXiv preprint arXiv:2206.10789 , volume=
work page internal anchor Pith review Pith/arXiv arXiv
-
[30]
Human preference score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis , author=. arXiv preprint arXiv:2306.09341 , year=
work page internal anchor Pith review Pith/arXiv arXiv
- [31]
-
[32]
International conference on machine learning , pages=
Learning transferable visual models from natural language supervision , author=. International conference on machine learning , pages=. 2021 , organization=
work page 2021
-
[33]
Advances in Neural Information Processing Systems , volume=
Imagereward: Learning and evaluating human preferences for text-to-image generation , author=. Advances in Neural Information Processing Systems , volume=
-
[34]
Classifier-Free Diffusion Guidance
Classifier-free diffusion guidance , author=. arXiv preprint arXiv:2207.12598 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[35]
GitHub repository , howpublished =
Patrick von Platen and Suraj Patil and Anton Lozhkov and Pedro Cuenca and Nathan Lambert and Kashif Rasul and Mishig Davaadorj and Dhruv Nair and Sayak Paul and William Berman and Yiyi Xu and Steven Liu and Thomas Wolf , title =. GitHub repository , howpublished =. 2022 , publisher =
work page 2022
-
[36]
Advances in neural information processing systems , volume=
Direct preference optimization: Your language model is secretly a reward model , author=. Advances in neural information processing systems , volume=
-
[37]
the method of paired comparisons , author=
Rank analysis of incomplete block designs: I. the method of paired comparisons , author=. Biometrika , volume=. 1952 , publisher=
work page 1952
-
[38]
Journal of Risk and uncertainty , volume=
Advances in prospect theory: Cumulative representation of uncertainty , author=. Journal of Risk and uncertainty , volume=. 1992 , publisher=
work page 1992
-
[39]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
Diffusion model alignment using direct preference optimization , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
-
[40]
European Conference on Computer Vision , pages=
Sit: Exploring flow and diffusion-based generative models with scalable interpolant transformers , author=. European Conference on Computer Vision , pages=. 2024 , organization=
work page 2024
-
[41]
Denoising Diffusion Implicit Models
Denoising diffusion implicit models , author=. arXiv preprint arXiv:2010.02502 , year=
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[42]
Flow-GRPO: Training Flow Matching Models via Online RL
Flow-grpo: Training flow matching models via online rl , author=. arXiv preprint arXiv:2505.05470 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[43]
DanceGRPO: Unleashing GRPO on Visual Generation
DanceGRPO: Unleashing GRPO on Visual Generation , author=. arXiv preprint arXiv:2505.07818 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[44]
Advances in Neural Information Processing Systems , volume=
Aligning diffusion models by optimizing human utility , author=. Advances in Neural Information Processing Systems , volume=
-
[45]
The Thirteenth International Conference on Learning Representations , year=
DSPO: Direct score preference optimization for diffusion model alignment , author=. The Thirteenth International Conference on Learning Representations , year=
-
[46]
arXiv preprint arXiv:2507.07510 , year=
Divergence minimization preference optimization for diffusion model alignment , author=. arXiv preprint arXiv:2507.07510 , year=
-
[47]
First Workshop on Scalable Optimization for Efficient and Adaptive Foundation Models , year=
Margin-aware preference optimization for aligning diffusion models without reference , author=. First Workshop on Scalable Optimization for Efficient and Adaptive Foundation Models , year=
-
[48]
Diffusion-SDPO: Safeguarded Direct Preference Optimization for Diffusion Models , author=
-
[49]
International Conference on Machine Learning , pages=
Sequence tutor: Conservative fine-tuning of sequence generation models with kl-control , author=. International Conference on Machine Learning , pages=. 2017 , organization=
work page 2017
-
[50]
Human-centric dialog training via offline reinforcement learning , author=. Proceedings of the 2020 conference on empirical methods in natural language processing (EMNLP) , pages=
work page 2020
-
[51]
Decoupled Weight Decay Regularization
Decoupled weight decay regularization , author=. arXiv preprint arXiv:1711.05101 , year=
work page internal anchor Pith review Pith/arXiv arXiv
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.