pith. sign in

arxiv: 2409.08861 · v5 · pith:FJGMMMB3new · submitted 2024-09-13 · 💻 cs.LG · math.OC· stat.ML

Adjoint Matching: Fine-tuning Flow and Diffusion Generative Models with Memoryless Stochastic Optimal Control

classification 💻 cs.LG math.OCstat.ML
keywords fine-tuningmodelsrewardmatchingadjointcontroldiffusionexisting
0
0 comments X
read the original abstract

Dynamical generative models that produce samples through an iterative process, such as Flow Matching and denoising diffusion models, have seen widespread use, but there have not been many theoretically-sound methods for improving these models with reward fine-tuning. In this work, we cast reward fine-tuning as stochastic optimal control (SOC). Critically, we prove that a very specific memoryless noise schedule must be enforced during fine-tuning, in order to account for the dependency between the noise variable and the generated samples. We also propose a new algorithm named Adjoint Matching which outperforms existing SOC algorithms, by casting SOC problems as a regression problem. We find that our approach significantly improves over existing methods for reward fine-tuning, achieving better consistency, realism, and generalization to unseen human preference reward models, while retaining sample diversity.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 27 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. How to Guide Your Flow: Few-Step Alignment via Flow Map Reward Guidance

    cs.LG 2026-04 unverdicted novelty 8.0

    FMRG reformulates guidance as deterministic optimal control, deriving a single-trajectory method using the flow map that matches or exceeds baselines on reward-guided generation and inverse problems with 3 NFEs at tex...

  2. Variational Optimality of F\"ollmer Processes in Generative Diffusions

    math.ST 2026-02 unverdicted novelty 8.0

    Föllmer processes are variationally optimal among generative diffusions because they minimize the impact of drift estimation error on path-space KL divergence, rendering different interpolation schedules statistically...

  3. Flow-GRPO: Training Flow Matching Models via Online RL

    cs.CV 2025-05 unverdicted novelty 8.0

    Flow-GRPO is the first online RL method for flow matching models, raising GenEval accuracy from 63% to 95% and text-rendering accuracy from 59% to 92% with little reward hacking.

  4. Constrained Diffusion Models with Primal-Dual Inference

    cs.LG 2026-06 unverdicted novelty 7.0

    Develops primal-dual inference (PDI) that jointly infers optimal primal distributions and dual multipliers during diffusion sampling using a dual-conditioned score network.

  5. Adaptive Order Policies for Masked Diffusion

    cs.LG 2026-05 unverdicted novelty 7.0

    A policy network learns to choose unmasking order in masked diffusion by reweighting the loss, outperforming random and heuristic baselines on ordering-sensitive tasks.

  6. Inference-Time Alignment of Diffusion Models via Trust-Region Iterative Twisted Sequential Monte Carlo

    cs.LG 2026-05 conditional novelty 7.0

    TRI-TSMC is a trust-region framework for learning twisting functions in SMC-based inference-time alignment of diffusion models that yields zero-variance samplers in theory and better alignment on text and image tasks ...

  7. Contrastive Distribution Matching for Amortized Sequential Monte Carlo in Discrete Diffusion

    cs.LG 2026-05 unverdicted novelty 7.0

    CDM amortizes SMC inference for reward-tilted discrete diffusion by training a parameterized twist function on contrastive samples with closed-form kernels.

  8. SURGE: Approximation and Training Free Particle Filter for Diffusion Surrogate

    stat.ML 2026-05 unverdicted novelty 7.0

    URGE performs unbiased path-wise importance reweighting via Girsanov estimation for derivative-free inference-time scaling in diffusion models, proving equivalence to particle-wise SMC and outperforming baselines empirically.

  9. How to Guide Your Flow: Few-Step Alignment via Flow Map Reward Guidance

    cs.LG 2026-04 unverdicted novelty 7.0

    FMRG is a training-free, single-trajectory guidance method for flow models derived from optimal control that achieves strong reward alignment with only 3 NFEs.

  10. LeapAlign: Post-Training Flow Matching Models at Any Generation Step by Building Two-Step Trajectories

    cs.CV 2026-04 unverdicted novelty 7.0

    LeapAlign fine-tunes flow matching models by constructing two consecutive leaps that skip multiple ODE steps with randomized timesteps and consistency weighting, enabling stable updates at any generation step.

  11. Scalable Maximum Entropy Reinforcement Learning for Diffusion Policies via Adjoint Matching

    cs.LG 2026-06 unverdicted novelty 6.0

    Presents adjoint matching for scalable max-ent RL training of diffusion policies, enabling simulation-free optimization.

  12. AdvantageFlow: Advantage-Weighted Least Squares for RL in Flow Models

    cs.LG 2026-05 unverdicted novelty 6.0

    AdvantageFlow proposes an advantage-weighted forward-process least-squares loss for RL in rectified flow models, stabilized by rollout policy regularization, and reports better image generation performance than Flow-G...

  13. SURGE: Approximation and Training Free Particle Filter for Diffusion Surrogate

    stat.ML 2026-05 unverdicted novelty 6.0

    SURGE is an unbiased particle filter that fuses diffusion-model simulations with noisy observations via sequential Monte Carlo reweighting over diffusion trajectories.

  14. Simple Approximation and Derivative Free Inference-Time Scaling for Diffusion Models via Sequential Monte Carlo on Path Measures

    stat.ML 2026-05 unverdicted novelty 6.0

    URGE performs unbiased inference-time scaling for diffusion models by attaching multiplicative path weights from Girsanov estimation and resampling trajectories, with a proven equivalence to prior particle-wise SMC schemes.

  15. Learning While Deploying: Fleet-Scale Reinforcement Learning for Generalist Robot Policies

    cs.RO 2026-05 unverdicted novelty 6.0

    Fleet-scale RL framework improves a single generalist VLA policy from deployment data to 95% average success on eight real-world manipulation tasks with 16 dual-arm robots.

  16. Learning While Deploying: Fleet-Scale Reinforcement Learning for Generalist Robot Policies

    cs.RO 2026-05 unverdicted novelty 6.0

    LWD is a fleet-scale offline-to-online RL framework that continually improves pretrained VLA policies using autonomous rollouts and human interventions, reaching 95% average success on real-world manipulation tasks.

  17. How to Guide Your Flow: Few-Step Alignment via Flow Map Reward Guidance

    cs.LG 2026-04 unverdicted novelty 6.0

    FMRG is a training-free single-trajectory guidance framework for flow-based models that matches or exceeds baselines on reward-guided tasks and inverse problems using as few as 3 NFEs.

  18. Generative optimal transport via forward-backward HJB matching

    cond-mat.stat-mech 2026-04 unverdicted novelty 6.0

    A forward-backward HJB duality computes the optimal stochastic transport control from easy forward relaxation trajectories alone, expressed as path-space free energy without backward simulation.

  19. FP4 Explore, BF16 Train: Diffusion Reinforcement Learning via Efficient Rollout Scaling

    cs.LG 2026-04 unverdicted novelty 6.0

    Sol-RL decouples FP4-based candidate exploration from BF16 policy optimization in diffusion RL, delivering up to 4.64x faster convergence with maintained or superior alignment performance on models like FLUX.1 and SD3.5.

  20. CMAD: Cooperative Multi-Agent Diffusion via Stochastic Optimal Control

    cs.LG 2026-02 unverdicted novelty 6.0

    CMAD formulates compositional generation as cooperative stochastic optimal control among pre-trained diffusion models, validated on conditional MNIST against a gradient-guidance baseline.

  21. Flow Matching for Measure Transport and Feedback Stabilization of Control-Affine Systems

    math.OC 2025-10 unverdicted novelty 6.0

    Introduces flow matching for measure transport in control-affine systems and a complementary noising-time-reversal method for stabilization, with numerical examples on linear and nonlinear cases.

  22. Sample-Efficient Optimisation over the Outputs of Generative Models

    stat.ML 2025-09 unverdicted novelty 6.0

    O3 uses surrogate latent spaces extracted from generative models to perform sample-efficient black-box optimization over their outputs, outperforming direct sampling and original-latent optimization on image and prote...

  23. Improving Video Generation with Human Feedback

    cs.CV 2025-01 unverdicted novelty 6.0

    A human preference dataset and VideoReward model enable Flow-DPO and Flow-NRG to produce smoother, better-aligned videos from text prompts in flow-based generators.

  24. Inference-Time Scaling for Diffusion Models beyond Scaling Denoising Steps

    cs.CV 2025-01 conditional novelty 6.0

    Diffusion models improve generation quality via inference-time search over noise candidates guided by verifiers and algorithms, yielding gains beyond denoising step scaling on class- and text-conditioned benchmarks.

  25. DOLLAR: Few-Step Video Generation via Distillation and Latent Reward Optimization

    cs.CV 2024-12 unverdicted novelty 6.0

    DOLLAR combines variational score and consistency distillation for few-step video generation plus latent reward optimization, reporting 82.57 VBench score and up to 278x speedup over the teacher diffusion model for 12...

  26. Behavioral Mode Discovery for Fine-tuning Multimodal Generative Policies

    cs.LG 2026-05 unverdicted novelty 5.0

    Unsupervised behavioral mode discovery combined with mutual information rewards enables RL fine-tuning of multimodal generative policies that achieves higher success rates without losing action diversity.

  27. TIGFlow-GRPO: Trajectory Forecasting via Interaction-Aware Flow Matching and Reward-Guided Optimization

    cs.CV 2026-03 unverdicted novelty 5.0

    TIGFlow-GRPO uses a Trajectory-Interaction-Graph in conditional flow matching plus Flow-GRPO optimization to produce more accurate, socially compliant, and physically feasible trajectory forecasts on ETH/UCY and SDD datasets.