pith. sign in

arxiv: 2606.02884 · v1 · pith:NHBG2AQOnew · submitted 2026-06-01 · 💻 cs.LG · cs.AI

Are we really tilting? The mechanics of reward guidance in flow and diffusion models

Pith reviewed 2026-06-28 15:18 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords reward guidancediffusion modelsreward hackingDoob h-functionGaussian mixturesflow modelsguidance algorithmsplug-in estimator
0
0 comments X

The pith

Reward hacking in diffusion models comes from finite-particle plug-in estimates of the Doob h-function, even for Gaussian targets.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that reward hacking arises from the standard finite-particle approximation to the Doob h-function used in reward-guided diffusion and flow models. Closed-form calculations on Gaussian and Gaussian-mixture targets with quadratic rewards isolate two concrete failure modes: the estimator over-optimizes the reward inside each mode, and it fails to favor high-reward modes. These mechanisms are shown to persist in practical high-dimensional generation such as FLUX.1. A closed-form damping schedule removes the within-mode bias at no extra cost, while best-of-n sampling mitigates the mode-selection issue.

Core claim

Reward hacking arises from an approximation made in most practical implementations of reward-guided diffusion -- finite-particle plug-in estimation of the Doob h-function -- even in the simplest non-trivial settings of Gaussian and Gaussian mixture targets with quadratic rewards. In closed form, two distinct failure modes of the plug-in estimator are isolated: it leads to reward hacking within each mode and it cannot select high-reward modes.

What carries the argument

Finite-particle plug-in estimator of the Doob h-function, which supplies the steering term that tilts the generative process toward the reward-weighted target measure.

If this is right

  • A closed-form reward damping schedule removes the within-mode bias with no additional compute.
  • Best-of-n sampling compensates for the inability to select high-reward modes.
  • The same two mechanisms operate in both diffusion and flow models.
  • The identified biases appear in FLUX.1 text-to-image generation and in 2-D checkerboard targets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Replacing the plug-in estimator with a higher-fidelity approximation could reduce hacking without changing the reward function.
  • The analysis suggests examining whether similar finite-particle biases appear in other guidance techniques that rely on score or velocity corrections.
  • Hybrid methods that combine the proposed damping with adaptive particle counts may address both failure modes simultaneously.

Load-bearing premise

That the two failure modes isolated for Gaussian and Gaussian-mixture targets with quadratic rewards dominate reward hacking in high-dimensional practical models.

What would settle it

A direct calculation or simulation on a one-dimensional Gaussian target with quadratic reward showing that the finite-particle plug-in estimator produces samples whose reward distribution exactly matches the true tilted distribution with no within-mode shift or mode-selection error.

Figures

Figures reproduced from arXiv: 2606.02884 by Nicholas M. Boffi, Sanjit Dandapanthula.

Figure 1
Figure 1. Figure 1: Reward damping. We introduce reward damping, a simple and principled guidance schedule to mitigate reward hacking. Base FLUX.1 [1] samples guided with ImageReward [2]; further experimental details in Appendix D.4. 1 arXiv:2606.02884v1 [cs.LG] 1 Jun 2026 [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview. Compared to analytic reward tilting, practical guidance algorithms over-concentrate within each mode and fail to select high-reward modes. We propose a damped reward scale to mitigate within-mode reward hacking and clarify the role of best-of-n in mode selection; combining these two methods often enables us to approximately recover the reward tilt. 1 Introduction Flow and diffusion-based generati… view at source ↗
Figure 3
Figure 3. Figure 3: Reward hacking for Gaussian mixtures. Exact guidance (A) faithfully samples the tilted distribution ρ˜1. The k = 1 plug-in estimator (B) overshoots the mean and shrinks covariance; reward damping (D) corrects the collapse better than k = 8 (C) with 8× less computational cost. Theorem 3 (∞-Wasserstein bound). Assuming that ρ1 = N (µ, Σ) and r(x) = −∥x − a∥ 2 2 , let ρ˜ (k) 1 denote the terminal distribution… view at source ↗
Figure 4
Figure 4. Figure 4: Mode selection for Gaussian mixture. Best-of-n increases the correct mode probability compared to k = 1 guidance, better matching the analytic tilt. 5 Experiments All code to reproduce these experiments is available in the following GitHub repository: https://github.com/sanjitdp/reward-guidance. All experiments run on a single NVIDIA RTX A6000 or L40S GPU, and each image takes less than 1.5 minutes to gene… view at source ↗
Figure 5
Figure 5. Figure 5: Masked intensity reward. The reward is the mean pixel intensity inside a top-right circular mask minus the mean intensity outside. Naive guidance maximizes the masked-region brightness by removing the welder from the frame; reward damping obtains a high-reward sample including the welder. 8 [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Blueness reward (artist). The reward is the mean blue channel minus the mean of the red and green channels. Naive guidance produces blue images that forget the artist or candles, while reward damping produces a very blue image that retains the artist and shows the warm candlelight. Unguided Guided Guided (lower λ) Guided (k = 8) Guided (damped) “a baby fox wearing a cozy knitted sweater” [PITH_FULL_IMAGE:… view at source ↗
Figure 7
Figure 7. Figure 7: Blueness reward (fox). Naive guidance creates overwhelmingly blue outputs that wash out the fox’s orange fur; reward damping recognizes that the fox should remain orange while the sweater and background turn blue. 9 [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: ImageReward (archaeologist). We guide using ImageReward [2], a learned human-preference reward. Both the unguided and naively guided images produce unnatural images or fail to show a brush, while the damped guidance produces a more natural image with a visible brush. Unguided Guided Guided (lower λ) Guided (k = 8) Guided (damped) “a dull, muted, washed-out, desaturated Indian outdoor market with stalls and… view at source ↗
Figure 9
Figure 9. Figure 9: ImageReward (market). The image is scored against the prompt “a vibrant Indian outdoor market with colorful stalls and produce.” Naive guidance brightens the lamps while leaving the rest of the image washed out; reward damping produces a more balanced improvement across brightness and color. [36] with 5 inner steps. Reward functions. We use three rewards: the blueness reward r(x) = x¯blue − x¯red − x¯green… view at source ↗
Figure 10
Figure 10. Figure 10: ImageReward (miner). Naive guidance hacks ImageReward by ignoring the split diorama constraint; reward damping produces dramatic images that still respect the constraint. with the warm candlelight, and that leave the fox’s fur orange while turning the sweater and background blue. For ImageReward, naive guidance hacks the reward in three different ways: it produces an unnatural archaeologist scene with no … view at source ↗
Figure 12
Figure 12. Figure 12: Checkerboard guidance. Unlike plug-in guidance, best-of-n can select modes. With reward damping, best-of-n significantly improves fidelity to the analytic tilt. 5.2.2 FLUX.1: text-to-image. In Figures 13 and 14, we use FLUX.1-dev [1] with a VLM reward using Qwen2.5-VL-3B [47] to demonstrate that plug-in guidance cannot perform mode selection in a realistic setting. Here, the reward is r(x) = log(p(Yes)) −… view at source ↗
Figure 13
Figure 13. Figure 13: FLUX mode selection (ECLIPSE DINER). The VLM reward is derived from the question “Does this image clearly show a neon sign with the word ‘ECLIPSE’ as the main readable text?” Finite-particle guidance hacks the complex reward function, damping slightly improves the reward, and best-of-n substantially improves the reward, confirming the importance of the initial seed for mode selection. 12 [PITH_FULL_IMAGE… view at source ↗
Figure 14
Figure 14. Figure 14: FLUX mode selection (NEXT TRAIN MARS). The VLM reward is derived from the question “Does this image clearly show a display with the text ‘NEXT TRAIN MARS’ as the main readable text?” We see the qualitative pattern of [PITH_FULL_IMAGE:figures/full_fig_p013_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Single Gaussian target. The k = 1 plug-in over-concentrates exactly as predicted by Theorem 2; increasing k does not help; reward damping recovers the true tilted distribution. −3 0 3 −2 0 2 A Exact guidance −3 0 3 B Plug-in (k = 1) −3 0 3 C Plug-in (k = 8) −3 0 3 D Plug-in (damped) Analytic tilt Guided samples Reward target [PITH_FULL_IMAGE:figures/full_fig_p033_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Double-well reward. The plug-in over-concentrates at the reward maxima (marked by red ×); reward damping partially corrects this. Single Gaussian [PITH_FULL_IMAGE:figures/full_fig_p033_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Non-isotropic covariances. Both components are tilted toward the target a = (0, 2.5)⊤ via off-diagonal entries ±0.25. −3 0 3 −2 0 2 4 A Exact guidance −3 0 3 B Plug-in (k = 1) −3 0 3 C Plug-in (k = 8) −3 0 3 D Plug-in (damped) Analytic tilt Guided samples Reward target [PITH_FULL_IMAGE:figures/full_fig_p034_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Unequal component weights (π1, π2) = (0.2, 0.8). −5 −2 1 4 −2 0 2 4 A Exact guidance −5 −2 1 4 B Plug-in (k = 1) −5 −2 1 4 C Plug-in (k = 8) −5 −2 1 4 D Plug-in (damped) Analytic tilt Guided samples Reward target [PITH_FULL_IMAGE:figures/full_fig_p034_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Uncentered components at (−4.0, 0) and (1.0, 0). data. We train it via flow matching against the linear interpolant It = (1 − t) ϵ + t X1 with ϵ ∼ N (0, I2) and X1 drawn uniformly from the 18 filled checkerboard squares. The model is optimized with Adam (learning rate 10−3 , cosine annealing to 0) for 5 × 105 steps with batch size 4096, using an exponential moving average of the weights with decay 0.9999 … view at source ↗
Figure 22
Figure 22. Figure 22: Step reward trajectories. Colored by terminal sign. Unguided and plug-in trajectories split evenly between positive (green) and negative (coral) modes; best-of-n selection produces a strong majority of positive-mode trajectories, increasingly so as n grows. 0.25 0.50 0.75 Time t −5 0 5 Position x A Plug-in (k = 1) 0.25 0.50 0.75 Time t B Best-of-4 0.25 0.50 0.75 Time t C Best-of-16 Positive trajectories N… view at source ↗
Figure 23
Figure 23. Figure 23: Gaussian reward trajectories. Plug-in already skews toward the positive mode; best-of-n further concentrates it. Xt,s(x) denoting its solution at time s given Xt = x, the flow map reward guidance for the reward r is the feedback control u FMRG t (x) := λ ∇Xt,1(x) ⊤∇r(Xt,1(x)), and the guided trajectory solves dx˜t = (bt(˜xt) + u FMRG t (˜xt)) dt (14) with x˜0 ∼ N (0, Id). For the quadratic reward r(x) = −… view at source ↗
Figure 24
Figure 24. Figure 24: FMRG vs. plug-in on FLUX.1. Top: unguided samples. Middle: the k = 1 plug-in flow. Bottom: flow map reward guidance. Both guided schemes drive the entire image onto a blue tint, sharing the same reward hacking failure mode. −2 0 2 −2 0 2 4 A Plug-in (k = 1), σ = 0.5 −2 0 2 B FMRG, σ = 0.5 −6 −3 0 3 6 −6 −3 0 3 6 C Plug-in (k = 1), σ = 16 −6 −3 0 3 6 D FMRG, σ = 16 Analytic tilt Guided samples Reward targe… view at source ↗
Figure 25
Figure 25. Figure 25: σ-regime crossover. Empirical comparison of FMRG vs. the k = 1 plug-in flow on an isotropic Gaussian target. Panels A,B (narrow, σ = 0.5 < π2 , λ = 1): FMRG concentrates much more sharply at the reward target than the plug-in. Panels C,D (wide, σ = 16 > π2 , λ = 0.1): the relative aggressiveness reverses, with FMRG samples visibly more spread out, exactly as predicted by Theorem 19. 41 [PITH_FULL_IMAGE:f… view at source ↗
read the original abstract

Reward guidance algorithms steer a learned generative process toward the reward-tilted measure at inference time. While empirically powerful, these methods are prone to reward hacking: the guided model over-optimizes the reward at the cost of fidelity to the learned distribution. Prior work has attributed this to the complexity of neural reward functions or implicit biases in diffusion training, but its fundamental origins remain poorly understood. We show that reward hacking arises from an approximation made in most practical implementations of reward-guided diffusion -- finite-particle plug-in estimation of the Doob h-function -- even in the simplest non-trivial settings of Gaussian and Gaussian mixture targets with quadratic rewards. In closed form, we isolate two distinct failure modes of the plug-in estimator: it leads to reward hacking within each mode and it cannot select high-reward modes. We propose a closed-form reward damping schedule that corrects the within-mode bias with no additional compute, and clarify the role of best-of-n sampling in compensating for the mode selection failure. Experiments on Gaussian mixture targets, a 2D checkerboard, and FLUX.1 text-to-image generation confirm that our theoretical insights carry over to practical settings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper claims that reward hacking in reward-guided diffusion and flow models stems from the finite-particle plug-in estimation of the Doob h-function, even in simple Gaussian and Gaussian-mixture targets with quadratic rewards. Closed-form derivations isolate two failure modes: within-mode hacking and inability to select high-reward modes. A closed-form reward damping schedule is proposed to correct within-mode bias at no extra cost, and best-of-n sampling is clarified as compensating for mode selection. Experiments on GMM targets, a 2D checkerboard, and FLUX.1 text-to-image generation are presented to show the insights carry over to practical settings.

Significance. The closed-form derivations against standard Gaussian math constitute a clear strength, providing a parameter-free mechanistic account of reward hacking origins that does not rely on neural reward complexity. If the identified mechanisms prove dominant, the damping schedule offers an immediately usable correction, and the analysis shifts the field from empirical attribution to precise diagnosis of the plug-in approximation.

major comments (1)
  1. [FLUX.1 experiments] FLUX.1 experiments section: the claim that the two Gaussian-derived failure modes 'dominate in practical high-dimensional settings' is load-bearing for the paper's scope, yet the reported FLUX results demonstrate only correlation with particle count in the h-estimator; no ablations are described that hold neural reward model and training fixed while varying particle number (or vice versa) to isolate the plug-in estimator as the primary driver.
minor comments (1)
  1. The title and abstract reference both 'flow and diffusion models,' but the closed-form analysis is developed only for the diffusion (Doob h-function) case; a brief remark on whether the same plug-in failure modes apply verbatim to continuous normalizing flows would improve clarity.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their careful reading and valuable feedback on our manuscript. We address the major comment below and will make appropriate revisions to strengthen the paper.

read point-by-point responses
  1. Referee: [FLUX.1 experiments] FLUX.1 experiments section: the claim that the two Gaussian-derived failure modes 'dominate in practical high-dimensional settings' is load-bearing for the paper's scope, yet the reported FLUX results demonstrate only correlation with particle count in the h-estimator; no ablations are described that hold neural reward model and training fixed while varying particle number (or vice versa) to isolate the plug-in estimator as the primary driver.

    Authors: We agree that a stronger claim of dominance would require more rigorous isolation. However, the FLUX.1 experiments do hold the neural reward model and the base generative model fixed (as they are pre-trained components) while varying the number of particles in the finite-particle approximation of the Doob h-function. This directly tests the impact of the plug-in estimator in a practical setting. The results show that increasing the particle count reduces the observed reward hacking, consistent with our theoretical analysis. We acknowledge that this is correlational evidence in a complex model and does not rule out other contributing factors. To address the concern, we will revise the manuscript to clarify the experimental design, emphasize that the results demonstrate consistency with the identified mechanisms rather than proving they dominate, and add a limitations paragraph discussing the challenges of full ablations in high-dimensional pre-trained models. This will be a partial revision. revision: partial

Circularity Check

0 steps flagged

No circularity: closed-form derivations independent of inputs

full rationale

The paper's central derivations isolate two failure modes of finite-particle plug-in estimation of the Doob h-function for Gaussian and GMM targets with quadratic rewards, performed in closed form against standard Gaussian math. No step reduces a claimed prediction to a fitted parameter by construction, renames a known result, or relies on a self-citation chain that defines the target quantities. The FLUX.1 experiments serve as external confirmation rather than load-bearing inputs to the math. This satisfies the criteria for a self-contained derivation with no circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Analysis rests on the standard definition of the Doob h-function from diffusion literature and the choice of Gaussian targets with quadratic rewards to obtain closed forms; no free parameters or new entities are introduced.

axioms (1)
  • domain assumption Targets are Gaussian or Gaussian mixtures with quadratic rewards
    Used to obtain the closed-form expressions for the two failure modes.

pith-pipeline@v0.9.1-grok · 5736 in / 1162 out tokens · 31161 ms · 2026-06-28T15:18:41.497011+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

50 extracted references · 3 canonical work pages

  1. [1]

    Scaling rectified flow transformers for high-resolution image synthesis.arXiv preprint arXiv:2403.03206, 2024

    Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, Dustin Podell, Tim Dockhorn, Zion English, Kyle Lacey, Alex Goodwin, Yannik Marek, and Robin Rombach. Scaling rectified flow transformers for high-resolution image synthesis.arXiv preprint arXiv:2403.03206, 2024...

  2. [2]

    ImageReward: Learning and evaluating human preferences for text-to-image generation

    Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, and Yuxiao Dong. ImageReward: Learning and evaluating human preferences for text-to-image generation. InAdvances in Neural Information Processing Systems, volume 36, pages 15903–15935. Curran Associates, Inc., 2023. (pages 1, 2, 4, 9, and 10)

  3. [3]

    Denoising diffusion probabilistic models

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. InAdvances in Neural Information Processing Systems, 2020. (page 2)

  4. [4]

    High-resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022. (page 2)

  5. [5]

    Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, Jonathan Ho, David J

    Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L. Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, Jonathan Ho, David J. Fleet, and Mohammad Norouzi. Photorealistic text-to-image diffusion models with deep language understanding. InAdvances in Neural Information Processing Systems, 2022. (page 2)

  6. [6]

    Equivariant diffusion for molecule generation in 3D

    Emiel Hoogeboom, Victor Garcia Satorras, Clément Vignac, and Max Welling. Equivariant diffusion for molecule generation in 3D. InInternational Conference on Machine Learning, 2022. (page 2)

  7. [7]

    GeoDiff: A geometric diffusion model for molecular conformation generation

    Minkai Xu, Lantao Yu, Yang Song, Chence Shi, Stefano Ermon, and Jian Tang. GeoDiff: A geometric diffusion model for molecular conformation generation. InInternational Conference on Learning Representations, 2022. (page 2)

  8. [8]

    3D equivariant diffusion for target-aware molecule generation and affinity prediction

    Jiaqi Guan, Wesley Wei Qian, Xingang Peng, Yufeng Su, Jian Peng, and Jianzhu Ma. 3D equivariant diffusion for target-aware molecule generation and affinity prediction. InInternational Conference on Learning Representations, 2023. (page 2)

  9. [9]

    Structure-based drug design with equivariant diffusion models.Nature Computational Science, 2024

    Arne Schneuing, Charles Harris, Yuanqi Du, Kieran Didi, Arian Jamasb, Ilia Igashov, Weitao Du, Tom Blundell, Pietro Liò, Carla Gomes, Max Welling, Michael Bronstein, and Bruno Correia. Structure-based drug design with equivariant diffusion models.Nature Computational Science, 2024. doi: 10.1038/ s43588-024-00737-x. (page 2)

  10. [10]

    Watson, David Juergens, Nathaniel R

    Joseph L. Watson, David Juergens, Nathaniel R. Bennett, Brian L. Trippe, Jason Yim, Helen E. Eisenach, Woody Ahern, Andrew J. Borst, Robert J. Ragotte, Lukas F. Milles, et al. De novo design of protein structure and function with RFdiffusion.Nature, 2023. doi: 10.1038/s41586-023-06415-8. (pages 2 and 4)

  11. [11]

    Trippe, Valentin De Bortoli, Emile Mathieu, Arnaud Doucet, Regina Barzilay, and Tommi Jaakkola

    Jason Yim, Brian L. Trippe, Valentin De Bortoli, Emile Mathieu, Arnaud Doucet, Regina Barzilay, and Tommi Jaakkola. SE(3) diffusion model with application to protein backbone generation. InInternational Conference on Machine Learning, 2023. (page 2)

  12. [12]

    Ingraham, Max Baranov, Zak Costello, Karl W

    John B. Ingraham, Max Baranov, Zak Costello, Karl W. Barber, Wujie Wang, Ahmed Ismail, Vincent Frappier, Dana M. Lord, Christopher Ng-Thow-Hing, Erik R. Van Vlack, et al. Illuminating protein space with a programmable generative model.Nature, 2023. doi: 10.1038/s41586-023-06728-8. (pages 2 and 4)

  13. [13]

    Nature communications15(1), 1059 (2024) https://doi.org/10.1038/s41467-024-45051-2

    Kevin E. Wu, Kevin K. Yang, Rianne van den Berg, Sarah Alamdari, James Y. Zou, Alex X. Lu, and Ava P. Amini. Protein structure generation via folding diffusion.Nature Communications, 2024. doi: 10.1038/s41467-024-45051-2. (page 2) 14

  14. [14]

    Pick-a-Pic: An open dataset of user preferences for text-to-image generation.Advances in Neural Information Processing Systems, 2023

    Yuval Kirstain, Adam Polyak, Uriel Singer, Shahbuland Matiana, Joe Penna, and Omer Levy. Pick-a-Pic: An open dataset of user preferences for text-to-image generation.Advances in Neural Information Processing Systems, 2023. (pages 2 and 4)

  15. [15]

    Training diffusion models with reinforcement learning.International Conference on Learning Representations, 2024

    Kevin Black, Michael Janner, Yilun Du, Ilya Kostrikov, and Sergey Levine. Training diffusion models with reinforcement learning.International Conference on Learning Representations, 2024. (page 2)

  16. [16]

    Diffusion model alignment using direct preference optimization

    Bram Wallace, Meihua Dang, Rafael Rafailov, Linqi Zhou, Aaron Lou, Senthil Purushwalkam, Stefano Ermon, Caiming Xiong, Shafiq Joty, and Nikhil Naik. Diffusion model alignment using direct preference optimization. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024. (page 2)

  17. [17]

    RewardDance: Reward scaling in visual generation.arXiv preprint arXiv:2509.08826, 2025

    Jie Wu, Yu Gao, Zilyu Ye, Ming Li, Liang Li, Hanzhong Guo, Jie Liu, Zeyue Xue, Xiaoxia Hou, Wei Liu, Yan Zeng, and Weilin Huang. RewardDance: Reward scaling in visual generation.arXiv preprint arXiv:2509.08826, 2025. (page 2)

  18. [18]

    3D equivariant diffusion for target-aware molecule generation and affinity prediction.International Conference on Learning Representations, 2023

    Jiaqi Guan, Wesley Wei Qian, Xingang Peng, Yufeng Su, Jian Peng, and Jianzhu Ma. 3D equivariant diffusion for target-aware molecule generation and affinity prediction.International Conference on Learning Representations, 2023. (pages 2 and 4)

  19. [19]

    DecompDiff: Diffusion models with decomposed priors for structure-based drug design

    Jiaqi Guan, Xiangxin Zhou, Yuwei Yang, Yu Bao, Jian Peng, Jianzhu Ma, Qiang Liu, Liang Wang, and Quanquan Gu. DecompDiff: Diffusion models with decomposed priors for structure-based drug design. InInternational Conference on Machine Learning, 2023. (pages 2 and 4)

  20. [20]

    Krishnapriyan

    Yue Jian, Curtis Wu, Danny Reidenbach, and Aditi S. Krishnapriyan. General binding affinity guidance for diffusion models in structure-based drug design.arXiv preprint arXiv:2406.16821, 2024. (pages 2 and 4)

  21. [21]

    Nate Gruver, Samuel Stanton, Nathan Frey, Tim G. J. Rudner, Isidro Hotzel, Julien Lafrance-Vanasse, Arvind Rajpal, Kyunghyun Cho, and Andrew G. Wilson. Protein design with guided discrete diffusion. InAdvances in Neural Information Processing Systems, 2023. (pages 2 and 4)

  22. [22]

    Diffusion models beat GANs on image synthesis.Advances in Neural Information Processing Systems, 2021

    Prafulla Dhariwal and Alexander Nichol. Diffusion models beat GANs on image synthesis.Advances in Neural Information Processing Systems, 2021. (page 2)

  23. [23]

    Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598,

    Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598,

  24. [24]

    Fine-tuning of continuous-time diffusion models as entropy-regularized control.arXiv preprint arXiv:2402.15194, 2024

    Masatoshi Uehara, Yulai Zhao, Kevin Black, Ehsan Hajiramezanali, Gabriele Scalia, Nathaniel Lee Diamant, Alex M Tseng, Tommaso Biancalani, and Sergey Levine. Fine-tuning of continuous-time diffusion models as entropy-regularized control.arXiv preprint arXiv:2402.15194, 2024. (pages 2 and 4)

  25. [25]

    Inference-time alignment in diffusion models with reward-guided generation: Tutorial and review.arXiv preprint arXiv:2501.09685, 2025

    Masatoshi Uehara, Yulai Zhao, Chenyu Wang, Xiner Li, Aviv Regev, Sergey Levine, and Tommaso Biancalani. Inference-time alignment in diffusion models with reward-guided generation: Tutorial and review.arXiv preprint arXiv:2501.09685, 2025. (pages 2, 3, and 4)

  26. [26]

    Scaling laws for reward model overoptimization

    Leo Gao, John Schulman, and Jacob Hilton. Scaling laws for reward model overoptimization. In International Conference on Machine Learning, pages 10835–10866. PMLR, 2023. (page 2)

  27. [27]

    Helping or herding? Reward model ensembles mitigate but do not eliminate reward hacking

    Jacob Eisenstein, Chirag Nagpal, Alekh Agarwal, Ahmad Beirami, Alex D’Amour, DJ Dvijotham, Adam Fisch, Katherine Heller, Stephen Pfohl, Deepak Ramachandran, Peter Shaw, and Jonathan Berant. Helping or herding? Reward model ensembles mitigate but do not eliminate reward hacking. In Conference on Language Modeling, 2024. (page 2)

  28. [28]

    Bradley Knox, Chelsea Finn, and Scott Niekum

    Rafael Rafailov, Yaswanth Chittepu, Ryan Park, Harshit Sikchi, Joey Hejna, W. Bradley Knox, Chelsea Finn, and Scott Niekum. Scaling laws for reward model overoptimization in direct alignment algorithms. InAdvances in Neural Information Processing Systems, volume 37, 2024. (page 2)

  29. [29]

    Confronting reward overoptimization for diffusion models: A perspective of inductive and primacy biases

    Ziyi Zhang, Sen Zhang, Yibing Zhan, Yong Luo, Yonggang Wen, and Dacheng Tao. Confronting reward overoptimization for diffusion models: A perspective of inductive and primacy biases. InInternational Conference on Machine Learning, pages 60396–60413. PMLR, 2024. (page 2) 15

  30. [30]

    Flow matching for generative modeling

    Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling. InInternational Conference on Learning Representations, 2023. (page 3)

  31. [31]

    Stochastic interpolants: A unifying framework for flows and diffusions.Journal of Machine Learning Research, 2025

    Michael Albergo, Nicholas M Boffi, and Eric Vanden-Eijnden. Stochastic interpolants: A unifying framework for flows and diffusions.Journal of Machine Learning Research, 2025. (pages 3 and 4)

  32. [32]

    L. C. G. Rogers and David Williams.Diffusions, Markov Processes and Martingales. Cambridge Mathematical Library. Cambridge University Press, 2 edition, 2000. (page 3)

  33. [33]

    Carles Domingo-Enrich, Michal Drozdzal, Brian Karrer, and Ricky T. Q. Chen. Adjoint matching: Fine- tuning flow and diffusion generative models with memoryless stochastic optimal control.International Conference on Learning Representations, 2025. (pages 3, 4, and 21)

  34. [34]

    Peter Holderrieth, Uriel Singer, Tommi Jaakkola, Ricky T. Q. Chen, Yaron Lipman, and Brian Karrer. GLASS flows: Efficient inference for reward alignment of flow and diffusion models.International Conference on Learning Representations, 2026. (pages 3, 4, and 19)

  35. [35]

    Albergo, and Yee Whye Teh

    Peter Potaptchik, Adhi Saravanan, Abbas Mammadov, Alvaro Prat, Michael S. Albergo, and Yee Whye Teh. Meta Flow Maps enable scalable reward alignment.arXiv preprint arXiv:2601.14430, 2026. (pages 3 and 4)

  36. [36]

    Diamond maps: Efficient reward alignment via stochastic flow maps.arXiv preprint arXiv:2602.05993, 2026

    Peter Holderrieth, Douglas Chen, Luca Eyring, Ishin Shah, Giri Anantharaman, Yutong He, Zeynep Akata, Tommi Jaakkola, Nicholas Matthew Boffi, and Max Simchowitz. Diamond maps: Efficient reward alignment via stochastic flow maps.arXiv preprint arXiv:2602.05993, 2026. (pages 3, 4, 9, 10, and 35)

  37. [37]

    Human preference score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis

    Xiaoshi Wu, Yiming Hao, Keqiang Sun, Yixiong Chen, Feng Zhu, Rui Zhao, and Hongsheng Li. Human preference score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis. arXiv preprint arXiv:2306.09341, 2023. (page 4)

  38. [38]

    How to build a consistency model: Learning flow maps via self-distillation.Advances in Neural Information Processing Systems, 2025

    Nicholas M Boffi, Michael S Albergo, and Eric Vanden-Eijnden. How to build a consistency model: Learning flow maps via self-distillation.Advances in Neural Information Processing Systems, 2025. (page 4)

  39. [39]

    Jerry Huang, Justin Lin, Sheel Shah, Kartik Nair, and Nicholas M. Boffi. How to guide your flow: Steering flow maps for rapid test-time alignment. InInternational Conference on Machine Learning,

  40. [40]

    (pages 4, 6, 35, 36, and 37)

  41. [41]

    What does guidance do? a fine-grained analysis in a simple setting.Advances in Neural Information Processing Systems, 2024

    Muthu Chidambaram, Khashayar Gatmiry, Sitan Chen, Holden Lee, and Jianfeng Lu. What does guidance do? a fine-grained analysis in a simple setting.Advances in Neural Information Processing Systems, 2024. (page 4)

  42. [42]

    Theoretical insights for diffusion guidance: A case study for Gaussian mixture models.International Conference on Machine Learning,

    Yuchen Wu, Minshuo Chen, Zihao Li, Mengdi Wang, and Yuting Wei. Theoretical insights for diffusion guidance: A case study for Gaussian mixture models.International Conference on Machine Learning,

  43. [43]

    Classifier-free guidance: From high-dimensional analysis to generalized guidance forms.arXiv preprint arXiv:2502.07849, 2025

    Krunoslav Lehman Pavasovic, Jakob Verbeek, Giulio Biroli, and Marc Mezard. Classifier-free guidance: From high-dimensional analysis to generalized guidance forms.arXiv preprint arXiv:2502.07849, 2025. (page 4)

  44. [44]

    Emergence of distortions in high-dimensional guided diffusion models.arXiv preprint arXiv:2602.00716, 2026

    Enrico Ventura, Beatrice Achilli, Luca Ambrogioni, and Carlo Lucibello. Emergence of distortions in high-dimensional guided diffusion models.arXiv preprint arXiv:2602.00716, 2026. (page 4)

  45. [45]

    Analysis of classifier-free guidance weight schedulers.Transactions on Machine Learning Research, 2024

    Xi Wang, Nicolas Dufour, Nefeli Andreou, Marie-Paule Cani, Victoria Fernández Abrevaya, David Picard, and Vicky Kalogeiton. Analysis of classifier-free guidance weight schedulers.Transactions on Machine Learning Research, 2024. (page 4)

  46. [46]

    Steering diffusion models with quadratic rewards: a fine-grained analysis.arXiv preprint arXiv:2602.16570, 2026

    Ankur Moitra, Andrej Risteski, and Dhruv Rohatgi. Steering diffusion models with quadratic rewards: a fine-grained analysis.arXiv preprint arXiv:2602.16570, 2026. (page 4)

  47. [47]

    The tractability landscape of diffusion alignment: regularization, rewards, and computational primitives.arXiv preprint arXiv:2605.11361, 2026

    Ankur Moitra, Andrej Risteski, and Dhruv Rohatgi. The tractability landscape of diffusion alignment: regularization, rewards, and computational primitives.arXiv preprint arXiv:2605.11361, 2026. (page 4) 16

  48. [48]

    Qwen2.5-VL technical report.arXiv preprint arXiv:2502.13923,

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-VL technical report.ar...

  49. [49]

    sup t∈[0,1] Z t 0 b(k) η(s)(¯xη(s))−b (k) s (¯xs) ds 2 # ≤Lh Z 1 0  B+ vuutE

    Stéphane Boucheron, Gábor Lugosi, and Pascal Massart.Concentration Inequalities: a Nonasymptotic Theory of Independence. Oxford University Press, 2013. (page 29) 17 A Further background A.1 Forward SDE matches the probability flow time-marginals Proposition 7(Time-marginals of the forward SDE).The solution of the forward SDE(2) has the same time-marginal ...

  50. [50]

    For (iii), note that then independent finite-k plug-in trajectories are i.i.d., so the probability that allntrajectories end up negative is(1/2)n. D Experimental details and ablations D.1 Damping for Gaussian target We use a single Gaussian targetρ1 = N (0, 0.5I2)with quadratic reward r(x) = −∥x−a∥ 2 2 centered at a = (0, 2.5)⊤ and λ = 3.0. By Proposition...