OPD+ removes the bias from stop-gradient in on-policy distillation by deriving correct gradients for f-divergences, outperforming standard KL-based methods on math reasoning and tool-use tasks.
Improved techniques for fine-tuning flow models via adjoint matching: a deterministic control pipeline
1 Pith paper cite this work. Polarity classification is still indexing.
abstract
We propose a deterministic adjoint matching framework that formulates human preference alignment for flow-based generative models as an optimal control problem over velocity fields. One can directly regress the control toward a value-gradient-induced target under the current policy, leading to a simple and stable training objective. Building on this perspective, we introduce a truncated adjoint scheme that focuses computation on the terminal portion of the trajectory, where reward-relevant signals concentrate, which yields substantial computational savings while preserving alignment quality. We further generalize the framework beyond standard KL-based regularization, allowing more flexible trade-offs between alignment strength and distributional preservation. Experiments on SiT-XL/2 and FLUX.2-Klein-4B demonstrate consistent gains across multiple alignment metrics, along with substantially improved diversity and mode preservation.
fields
cs.LG 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
OPD+: Rethinking the Advantage Design for On-Policy Distillation
OPD+ removes the bias from stop-gradient in on-policy distillation by deriving correct gradients for f-divergences, outperforming standard KL-based methods on math reasoning and tool-use tasks.