pith. sign in

arxiv: 2606.02521 · v3 · pith:FQZWKAQPnew · submitted 2026-06-01 · 💻 cs.LG · cs.CV

Drifting Preference Optimization for One-Step Generative Models

Pith reviewed 2026-06-28 15:18 UTC · model grok-4.3

classification 💻 cs.LG cs.CV
keywords preference optimizationone-step generative modelstext-to-image alignmentDrPOreward rankingfinetuningblack-box rewards
0
0 comments X

The pith

DrPO aligns one-step text-to-image generators by synthesizing a dipole preference field from reward rankings alone.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Drifting Preference Optimization to finetune deterministic one-step generators without relying on policy likelihoods, denoising paths, or differentiable reward gradients. For each prompt it draws multiple samples from the current model, ranks them with any target reward, and constructs an update direction in feature space as a non-parametric dipole between high- and low-scoring examples plus a drift term taken from the frozen base generator. This direction becomes a detached regression target, so the reward model is used only for ranking and never back-propagated. A reader would care because the approach keeps inference at a single forward pass while allowing large or black-box rewards and cutting training cost.

Core claim

DrPO constructs a non-parametric dipole preference field from ranked samples together with a reference drift from the frozen base generator; the resulting feature-space direction is regressed onto the one-step model through a detached target, enabling preference alignment when the reward supplies only ranking information.

What carries the argument

The non-parametric dipole preference field plus reference drift, which supplies a stable feature-space update direction derived solely from reward rankings.

If this is right

  • Alignment metrics improve over reward-gradient-free one-step baselines on HPSv3 and GenEval.
  • HPSv3 training computation falls by 3.51× because reward-model backpropagation is eliminated.
  • Any black-box or non-differentiable reward can be used since it supplies only ranking.
  • Inference cost remains one generator forward pass after training.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same sample-based gradient synthesis might transfer to other deterministic generators beyond text-to-image.
  • Offline variants could pre-compute dipole fields from existing ranked datasets without online sampling.
  • The detached regression target may reduce sensitivity to reward scale compared with direct gradient methods.

Load-bearing premise

The dipole field synthesized from ranked samples and the reference drift together give a stable optimization signal for the deterministic one-step generator.

What would settle it

Run DrPO on SD-Turbo with a non-differentiable reward and measure whether alignment metrics rise above the reward-gradient-free baselines while training FLOPs drop by roughly 3.5× under matched effective batch size.

Figures

Figures reproduced from arXiv: 2606.02521 by Yandong Wen, Zhen Liu, Zhou Jiang.

Figure 1
Figure 1. Figure 1: Overview of DrPO. Left: construction of the drift field used to finetune the network. For an on-policy sample (black dot), N random pairs are collected. In each pair, the sample with the higher reward is labeled as positive (red dot), while the other is labeled as negative (blue dot). Together with the self-repulsion and attraction forces induced by pθ and pref, respectively, these components synthesize th… view at source ↗
Figure 3
Figure 3. Figure 3: Reward curves on SD-Turbo [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 2
Figure 2. Figure 2: Qwen3-VL pairwise preference evaluation. For each prompt, Qwen3-VL compares two matched generations under [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 5
Figure 5. Figure 5: Efficiency comparison on HPSv3. Efficiency with large reward models [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative comparison on Pick-a-Pic v2 prompts for SDXL-Turbo. Rows compare the SDXL-Turbo base, one-step [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 9
Figure 9. Figure 9: Offline DrPO Convergence. Ablation on reference regularization [PITH_FULL_IMAGE:figures/full_fig_p008_9.png] view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative headline comparison on selected Pick-a-Pic v2 prompts for SD-Turbo. The grid follows the same [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Qualitative comparison across reward models. Each row generated with identical prompt; metrics in Table [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗
Figure 10
Figure 10. Figure 10: Effect of the reference term during finetuning. [PITH_FULL_IMAGE:figures/full_fig_p009_10.png] view at source ↗
Figure 8
Figure 8. Figure 8: Generated images for DrPO on GenEval prompts. [PITH_FULL_IMAGE:figures/full_fig_p010_8.png] view at source ↗
read the original abstract

One-step text-to-image generators are attractive for deployment because they generate an image with a single forward pass, but preference finetuning them remains difficult: standard alignment methods often rely on policy likelihoods, denoising trajectories, differentiable reward gradients, or test-time optimization. We propose Drifting Preference Optimization (DrPO), an online preference-finetuning method for deterministic one-step generators. For each prompt, DrPO samples candidates from the current generator, ranks them with a target reward, and uses high- and low-scoring samples to synthesize a feature-space update direction. The update is a non-parametric dipole preference field plus a reference drift estimated from the frozen base generator, and is optimized through a detached feature-space regression target. The target reward is used only for ranking, so DrPO can train with large, black-box, or non-differentiable rewards while inference remains a single generator call. We evaluate DrPO on SD-Turbo and SDXL-Turbo with multiple target rewards and benchmarks, including HPSv3 and GenEval. DrPO improves alignment over reward-gradient-free one-step preference baselines and reduces HPSv3 training computation by $3.51\times$ under the matched effective-batch setting by removing reward-model backpropagation. Initial offline experiments suggest that sample-based gradient synthesis can also be used beyond online reward ranking.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces Drifting Preference Optimization (DrPO), an online preference finetuning method for deterministic one-step text-to-image generators such as SD-Turbo and SDXL-Turbo. For each prompt, the method samples candidates from the current model, ranks them using a target reward model, constructs a non-parametric dipole preference field from high- and low-scoring samples together with a reference drift from the frozen base generator, and optimizes the generator via detached feature-space regression. The reward is used solely for ranking, enabling training with large, black-box, or non-differentiable rewards while keeping inference to a single forward pass. Experiments report improved alignment over reward-gradient-free baselines and a 3.51× reduction in HPSv3 training compute under matched effective-batch size by eliminating reward-model backpropagation; limited offline experiments are also mentioned.

Significance. If the central construction is shown to produce a reliable alignment signal, DrPO would offer a practical route to preference tuning of one-step generators without requiring differentiable rewards or backpropagation through the reward model. This would directly support the use of larger or non-differentiable reward functions and yield the reported computational savings. The approach also opens a direction for sample-based gradient synthesis that may extend beyond online ranking.

major comments (3)
  1. [Abstract] Abstract: the central claim that regression onto the synthesized non-parametric dipole preference field plus reference drift yields an effective update direction for reward maximization lacks any derivation or analysis showing monotonicity or even positive correlation with the true reward gradient in feature space. No equation or section establishes when the dipole direction could be orthogonal or anti-correlated with the reward signal.
  2. [Abstract] Abstract and evaluation sections: quantitative gains are stated on named models and benchmarks, yet the text provides neither error bars nor an ablation isolating the contribution of the dipole construction versus the reference drift or the detached regression target. This leaves the source of the reported alignment improvement and the 3.51× compute reduction unverified.
  3. [Abstract] Abstract: the statement that DrPO 'reduces HPSv3 training computation by 3.51× under the matched effective-batch setting' requires an explicit definition of how effective batch size is computed once reward-model backpropagation is removed; without it the compute comparison cannot be reproduced or assessed for fairness.
minor comments (2)
  1. [Abstract] The abstract mentions 'initial offline experiments' but does not indicate whether these results appear in the main body, appendix, or are omitted; a pointer would improve clarity.
  2. [Abstract] Notation for the 'non-parametric dipole preference field' and 'reference drift' is introduced without an accompanying equation or pseudocode block; adding a compact definition would aid readability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major comment below and describe the revisions we will make to address the concerns.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that regression onto the synthesized non-parametric dipole preference field plus reference drift yields an effective update direction for reward maximization lacks any derivation or analysis showing monotonicity or even positive correlation with the true reward gradient in feature space. No equation or section establishes when the dipole direction could be orthogonal or anti-correlated with the reward signal.

    Authors: We agree that a formal analysis is missing from the current manuscript. The dipole construction is motivated by the geometric intuition that the vector between high- and low-reward samples in feature space approximates a direction of reward increase, stabilized by the base-model drift. In the revision we will add a dedicated analysis subsection deriving the expected inner product between the dipole direction and the true reward gradient under a linear reward assumption in feature space, and explicitly discuss conditions (e.g., non-linear reward or insufficient sample diversity) under which the direction could become orthogonal or anti-correlated. revision: yes

  2. Referee: [Abstract] Abstract and evaluation sections: quantitative gains are stated on named models and benchmarks, yet the text provides neither error bars nor an ablation isolating the contribution of the dipole construction versus the reference drift or the detached regression target. This leaves the source of the reported alignment improvement and the 3.51× compute reduction unverified.

    Authors: The current version indeed reports point estimates without error bars and does not contain ablations that isolate the dipole field, reference drift, and detached regression. We will revise the evaluation section to report means and standard deviations over at least three independent runs for all main metrics. We will also add a new ablation table that measures performance when each component (dipole, drift, detached target) is removed or replaced, thereby clarifying the source of both the alignment gains and the compute savings. revision: yes

  3. Referee: [Abstract] Abstract: the statement that DrPO 'reduces HPSv3 training computation by 3.51× under the matched effective-batch setting' requires an explicit definition of how effective batch size is computed once reward-model backpropagation is removed; without it the compute comparison cannot be reproduced or assessed for fairness.

    Authors: We acknowledge that the abstract statement is not self-contained. In the revised manuscript we will add a precise definition of effective batch size in the experimental setup (and cross-reference it from the abstract), specifying that it is computed from the number of generator forward passes per optimization step once reward-model backpropagation is eliminated, and we will show the arithmetic that yields the 3.51× factor under the matched setting. revision: yes

Circularity Check

0 steps flagged

No circularity; derivation self-contained via external sample ranking and frozen reference

full rationale

The abstract and description define DrPO explicitly as sampling from the current generator, ranking with an external target reward, synthesizing a non-parametric dipole field plus reference drift from the frozen base, and optimizing via detached feature-space regression. The reward is used solely for ranking and never back-propagated; no equation, parameter fit, or self-citation is shown that reduces the update direction or claimed gains to a quantity defined inside the paper by construction. The central construction therefore remains independent of its own outputs and is evaluated against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The abstract introduces a new algorithmic construction with no explicit fitted constants or new physical entities; relies on standard assumptions about deterministic one-step generators and the utility of feature-space regression targets.

axioms (1)
  • domain assumption The target generator is deterministic and produces an image in a single forward pass.
    Stated directly as the setting for which DrPO is designed.
invented entities (1)
  • non-parametric dipole preference field no independent evidence
    purpose: To synthesize an update direction from ranked high- and low-scoring samples.
    New construct introduced to replace gradient-based preference signals.

pith-pipeline@v0.9.1-grok · 5764 in / 1343 out tokens · 27265 ms · 2026-06-28T15:18:07.945473+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

65 extracted references · 13 linked inside Pith

  1. [1]

    arXiv preprint arXiv:1707.06347 , year =

    Proximal Policy Optimization Algorithms , author =. arXiv preprint arXiv:1707.06347 , year =

  2. [2]

    NeurIPS , year =

    Learning to Summarize with Human Feedback , author =. NeurIPS , year =

  3. [3]

    NeurIPS , year =

    Training Language Models to Follow Instructions with Human Feedback , author =. NeurIPS , year =

  4. [4]

    Constitutional

    Bai, Yuntao and Kadavath, Saurav and Kundu, Sandipan and Askell, Amanda and Kernion, Jackson and Jones, Andy and Chen, Anna and Goldie, Anna and Mirhoseini, Azalia and McKinnon, Cameron and others , journal =. Constitutional

  5. [5]

    NeurIPS , year =

    Direct Preference Optimization: Your Language Model is Secretly a Reward Model , author =. NeurIPS , year =

  6. [6]

    Shao, Zhihong and Wang, Peiyi and Zhu, Qihao and Xu, Runxin and Song, Junxiao and Bi, Xiao and Zhang, Haowei and Zhang, Mingchuan and Li, Y. K. and Wu, Y. and Guo, Daya , journal =

  7. [7]

    Transactions on Machine Learning Research , year =

    Scaling Autoregressive Models for Content-Rich Text-to-Image Generation , author =. Transactions on Machine Learning Research , year =

  8. [8]

    CVPR , year =

    High-Resolution Image Synthesis with Latent Diffusion Models , author =. CVPR , year =

  9. [9]

    ICLR , year =

    Podell, Dustin and English, Zion and Lacey, Kyle and Blattmann, Andreas and Dockhorn, Tim and M. ICLR , year =

  10. [10]

    arXiv preprint arXiv:2310.04378 , year =

    Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference , author =. arXiv preprint arXiv:2310.04378 , year =

  11. [11]

    ECCV , year =

    Adversarial Diffusion Distillation , author =. ECCV , year =

  12. [12]

    CVPR , year =

    Diffusion Model Alignment Using Direct Preference Optimization , author =. CVPR , year =

  13. [13]

    ICLR , year =

    Directly Fine-Tuning Diffusion Models on Differentiable Rewards , author =. ICLR , year =

  14. [14]

    Eyring, Luca and Karthik, Shyamgopal and Roth, Karsten and Dosovitskiy, Alexey and Akata, Zeynep , booktitle =

  15. [15]

    arXiv preprint arXiv:2404.00879 , year =

    Model-Agnostic Human Preference Inversion in Diffusion Models , author =. arXiv preprint arXiv:2404.00879 , year =

  16. [16]

    CVPR , year =

    Aesthetic Post-Training Diffusion Models from Generic Preferences with Step-by-step Preference Optimization , author =. CVPR , year =

  17. [17]

    CVPR , year =

    Curriculum Direct Preference Optimization for Diffusion and Consistency Models , author =. CVPR , year =

  18. [18]

    ICLR , year =

    Tuning Timestep-Distilled Diffusion Model Using Pairwise Sample Optimization , author =. ICLR , year =

  19. [19]

    Luo, Weijian , journal =

  20. [20]

    ICML , year =

    David and Goliath: Small One-step Model Beats Large Diffusion with Score Post-training , author =. ICML , year =

  21. [21]

    ICCV , year =

    Adding Additional Control to One-Step Diffusion with Joint Distribution Matching , author =. ICCV , year =

  22. [22]

    NeurIPS , year =

    Reward-Instruct: A Reward-Centric Approach to Fast Photo-Realistic Image Generation , author =. NeurIPS , year =

  23. [23]

    ICLR , year =

    Scaling Group Inference for Diverse and High-Quality Generation , author =. ICLR , year =

  24. [24]

    arXiv preprint arXiv:2602.18799 , year =

    Rethinking Preference Alignment for Diffusion Models with Classifier-Free Guidance , author =. arXiv preprint arXiv:2602.18799 , year =

  25. [25]

    arXiv preprint arXiv:2602.04770 , year =

    Generative Modeling via Drifting , author =. arXiv preprint arXiv:2602.04770 , year =

  26. [26]

    arXiv preprint arXiv:2602.20463 , year =

    A Long-Short Flow-Map Perspective for Drifting Models , author =. arXiv preprint arXiv:2602.20463 , year =

  27. [27]

    arXiv preprint arXiv:2603.07514 , year =

    A Unified View of Drifting and Score-Based Models , author =. arXiv preprint arXiv:2603.07514 , year =

  28. [28]

    arXiv preprint arXiv:2603.09936 , year =

    Generative Drifting is Secretly Score Matching: a Spectral and Variational Perspective , author =. arXiv preprint arXiv:2603.09936 , year =

  29. [29]

    Gradient Flow Drifting: Generative Modeling via

    Cao, Jiarui and Wei, Zixuan and Liu, Yuxin , journal =. Gradient Flow Drifting: Generative Modeling via

  30. [30]

    arXiv preprint arXiv:2603.12366 , year =

    Sinkhorn-Drifting Generative Models , author =. arXiv preprint arXiv:2603.12366 , year =

  31. [31]

    arXiv preprint arXiv:2604.06333 , year =

    Drifting Fields are not Conservative , author =. arXiv preprint arXiv:2604.06333 , year =

  32. [32]

    Attraction, Repulsion, and Friction: Introducing

    Kazanskii, Arkadii and Petrova, Tatiana and Bagrianskii, Konstantin and Puzikov, Aleksandr and State, Radu , journal =. Attraction, Repulsion, and Friction: Introducing

  33. [33]

    Heusel, Martin and Ramsauer, Hubert and Unterthiner, Thomas and Nessler, Bernhard and Hochreiter, Sepp , booktitle =

  34. [34]

    Schuhmann, Christoph , year =

  35. [35]

    NeurIPS , year =

    Pick-a-Pic: An Open Dataset of User Preferences for Text-to-Image Generation , author =. NeurIPS , year =

  36. [36]

    Xu, Jiazheng and Liu, Xiao and Wu, Yuchen and Tong, Yuxuan and Li, Qinkai and Ding, Ming and Tang, Jie and Dong, Yuxiao , booktitle =

  37. [37]

    arXiv preprint arXiv:2306.09341 , year =

    Human Preference Score v2: A Solid Benchmark for Evaluating Human Preferences of Text-to-Image Synthesis , author =. arXiv preprint arXiv:2306.09341 , year =

  38. [38]

    Ghosh, Dhruba and Hajishirzi, Hannaneh and Schmidt, Ludwig , booktitle =

  39. [39]

    Ma, Yuhang and Wu, Xiaoshi and Sun, Keqiang and Li, Hongsheng , booktitle =

  40. [40]

    ICML , year =

    Learning Transferable Visual Models From Natural Language Supervision , author =. ICML , year =

  41. [41]

    and Shen, Yelong and Wallis, Phillip and Allen-Zhu, Zeyuan and Li, Yuanzhi and Wang, Shean and Wang, Lu and Chen, Weizhu , booktitle =

    Hu, Edward J. and Shen, Yelong and Wallis, Phillip and Allen-Zhu, Zeyuan and Li, Yuanzhi and Wang, Shean and Wang, Lu and Chen, Weizhu , booktitle =

  42. [42]

    Transactions on Machine Learning Research , year =

    Oquab, Maxime and Darcet, Timoth. Transactions on Machine Learning Research , year =

  43. [43]

    arXiv preprint arXiv:2512.05116 , year =

    Value Gradient Guidance for Flow Matching Alignment , author =. arXiv preprint arXiv:2512.05116 , year =

  44. [44]

    arXiv preprint arXiv:2207.12598 , year =

    Classifier-free diffusion guidance , author =. arXiv preprint arXiv:2207.12598 , year =

  45. [45]

    CVPR , year =

    Masked autoencoders are scalable vision learners , author =. CVPR , year =

  46. [46]

    arXiv preprint arXiv:2505.13447 , year =

    Mean flows for one-step generative modeling , author =. arXiv preprint arXiv:2505.13447 , year =

  47. [47]

    ICML , year =

    Stylegan-t: Unlocking the power of gans for fast large-scale text-to-image synthesis , author =. ICML , year =

  48. [48]

    CVPR , year =

    One-step diffusion with distribution matching distillation , author =. CVPR , year =

  49. [49]

    ICML , year =

    Consistency Models , author =. ICML , year =

  50. [50]

    arXiv preprint arXiv:2011.13456 , year =

    Score-based generative modeling through stochastic differential equations , author =. arXiv preprint arXiv:2011.13456 , year =

  51. [51]

    NeurIPS , year =

    Denoising diffusion probabilistic models , author =. NeurIPS , year =

  52. [52]

    arXiv preprint arXiv:2511.13649 , year =

    Distribution matching distillation meets reinforcement learning , author =. arXiv preprint arXiv:2511.13649 , year =

  53. [53]

    arXiv preprint arXiv:2604.19009 , year =

    Guiding Distribution Matching Distillation with Gradient-Based Reinforcement Learning , author =. arXiv preprint arXiv:2604.19009 , year =

  54. [54]

    CVPR , year =

    The Unreasonable Effectiveness of Deep Features as a Perceptual Metric , author =. CVPR , year =

  55. [55]

    arXiv preprint arXiv:2410.11081 , year =

    Simplifying, stabilizing and scaling continuous-time consistency models , author =. arXiv preprint arXiv:2410.11081 , year =

  56. [56]

    NeurIPS , year =

    Improved distribution matching distillation for fast image synthesis , author =. NeurIPS , year =

  57. [57]

    CVPR , year =

    Swiftbrush: One-step text-to-image diffusion model with variational score distillation , author =. CVPR , year =

  58. [58]

    ICML , year =

    Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor , author =. ICML , year =

  59. [59]

    arXiv preprint arXiv:2511.21631 , year =

  60. [60]

    Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , pages =

    TIFA: Accurate and Interpretable Text-to-Image Faithfulness Evaluation with Question Answering , author =. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , pages =

  61. [61]

    arXiv preprint arXiv:2404.01291 , year =

    Evaluating Text-to-Visual Generation with Image-to-Text Generation , author =. arXiv preprint arXiv:2404.01291 , year =

  62. [62]

    2024 , doi =

    Ku, Max and Jiang, Dongfu and Wei, Cong and Yue, Xiang and Chen, Wenhu , booktitle =. 2024 , doi =

  63. [63]

    Findings of the Association for Computational Linguistics: ACL 2024 , pages =

    Prometheus-Vision: Vision-Language Model as a Judge for Fine-Grained Evaluation , author =. Findings of the Association for Computational Linguistics: ACL 2024 , pages =. 2024 , doi =

  64. [64]

    Advances in Neural Information Processing Systems , volume =

    Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena , author =. Advances in Neural Information Processing Systems , volume =

  65. [65]

    Judging the Judges: A Systematic Study of Position Bias in

    Shi, Lin and Ma, Chiyu and Liang, Wenhua and Diao, Xingjian and Ma, Weicheng and Vosoughi, Soroush , booktitle =. Judging the Judges: A Systematic Study of Position Bias in. 2025 , doi =