Derives an exact unbiased policy gradient for RL post-training of diffusion LLMs via entropy-guided step selection and one-step denoising rewards, achieving state-of-the-art results on coding and logical reasoning benchmarks.
hub Canonical reference
Fine- tuning of continuous-time diffusion models as entropy-regularized control
Canonical reference. 80% of citing Pith papers cite this work as background.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
Supervised Guidance Training enables conditioning of infinite-dimensional diffusion models via an extended Doob h-transform so that fine-tuned models accurately sample from posteriors in function space.
A new adjoint matching framework formulates flow model alignment as optimal control, enabling direct regression training and terminal-trajectory truncation for efficiency gains on models like SiT-XL and FLUX.
PAMELA provides a multi-user rating dataset and personalized reward model that predicts individual image preferences more accurately than prior population-level aesthetic models.
PG-DLM applies particle Gibbs sampling over full trajectories in diffusion language models to enable iterative refinement, yielding higher accuracy on reward-guided generation with theoretical convergence guarantees.
Reformulates constrained black-box optimization as posterior inference in latent space of flow-based models amortized by outsourced diffusion models, claiming superior performance on synthetic and real tasks.
A hierarchical variational formulation amortizes test-time guidance in diffusion models to achieve strong quality-speed tradeoffs with significantly reduced inference compute.
ZeNO frames noise optimization as a path-integral control problem solvable from zeroth-order reward evaluations, connecting to implicit Langevin dynamics for reward-tilted distributions.
Derives RAM, a reward-adjusted consistency loss extending diffusion pretraining regression to efficient KL-regularized RL post-training, achieving peak rewards up to 50x faster than Flow-GRPO on Stable Diffusion 3.5M.
A unified framework for exponential tilting in diffusion and flow models that includes bias-variance decompositions showing finite gradient variance for some methods, norm bounds on adjoint ODEs, and adapted losses with new Crooks and Jarzynski identities.
FMRG is a training-free single-trajectory guidance framework for flow-based models that matches or exceeds baselines on reward-guided tasks and inverse problems using as few as 3 NFEs.
A new min-max robust formulation for mean field control and variational mean field games is introduced, with existence, uniqueness, and a stochastic maximum principle established under convexity-concavity assumptions.
Reward Score Matching unifies reward-based fine-tuning for flow and diffusion models by recasting alignment as score matching to a value-guided target.
Adjoint matching objectives derived from the Stochastic Maximum Principle have critical points satisfying HJB stationarity conditions for SOC problems with control-dependent drift and diffusion.
ETS performs training-free RL alignment for language models by energy-guided test-time scaling with Monte Carlo energy estimation and importance sampling acceleration.
Stein Diffusion Guidance corrects approximate posteriors in diffusion sampling via a Stein variational mechanism and surrogate SOC objective to enable effective guidance beyond high-density regimes.
DexVLA combines a scaled diffusion action expert with embodiment curriculum learning to achieve better generalization and performance than prior VLA models on diverse robot hardware and long-horizon tasks.
citing papers explorer
-
Reinforcement Learning for Diffusion LLMs with Entropy-Guided Step Selection and Stepwise Advantages
Derives an exact unbiased policy gradient for RL post-training of diffusion LLMs via entropy-guided step selection and one-step denoising rewards, achieving state-of-the-art results on coding and logical reasoning benchmarks.
-
Supervised Guidance Training for Infinite-Dimensional Diffusion Models
Supervised Guidance Training enables conditioning of infinite-dimensional diffusion models via an extended Doob h-transform so that fine-tuned models accurately sample from posteriors in function space.
-
Improved techniques for fine-tuning flow models via adjoint matching: a deterministic control pipeline
A new adjoint matching framework formulates flow model alignment as optimal control, enabling direct regression training and terminal-trajectory truncation for efficiency gains on models like SiT-XL and FLUX.
-
Personalizing Text-to-Image Generation to Individual Taste
PAMELA provides a multi-user rating dataset and personalized reward model that predicts individual image preferences more accurately than prior population-level aesthetic models.
-
Inference-Time Scaling of Diffusion Language Models via Trajectory Refinement
PG-DLM applies particle Gibbs sampling over full trajectories in diffusion language models to enable iterative refinement, yielding higher accuracy on reward-guided generation with theoretical convergence guarantees.
-
Posterior Inference in Latent Space for Scalable Constrained Black-box Optimization
Reformulates constrained black-box optimization as posterior inference in latent space of flow-based models amortized by outsourced diffusion models, claiming superior performance on synthetic and real tasks.
-
Hierarchical Variational Policies for Reward-Guided Diffusion
A hierarchical variational formulation amortizes test-time guidance in diffusion models to achieve strong quality-speed tradeoffs with significantly reduced inference compute.
-
Gradient-Free Noise Optimization for Reward Alignment in Generative Models
ZeNO frames noise optimization as a path-integral control problem solvable from zeroth-order reward evaluations, connecting to implicit Langevin dynamics for reward-tilted distributions.
-
Reinforce Adjoint Matching: Scaling RL Post-Training of Diffusion and Flow-Matching Models
Derives RAM, a reward-adjusted consistency loss extending diffusion pretraining regression to efficient KL-regularized RL post-training, achieving peak rewards up to 50x faster than Flow-GRPO on Stable Diffusion 3.5M.
-
A unified perspective on fine-tuning and sampling with diffusion and flow models
A unified framework for exponential tilting in diffusion and flow models that includes bias-variance decompositions showing finite gradient variance for some methods, norm bounds on adjoint ODEs, and adapted losses with new Crooks and Jarzynski identities.
-
How to Guide Your Flow: Few-Step Alignment via Flow Map Reward Guidance
FMRG is a training-free single-trajectory guidance framework for flow-based models that matches or exceeds baselines on reward-guided tasks and inverse problems using as few as 3 NFEs.
-
Robust mean field control: stochastic maximum principle and variational mean field games
A new min-max robust formulation for mean field control and variational mean field games is introduced, with existence, uniqueness, and a stochastic maximum principle established under convexity-concavity assumptions.
-
Reward Score Matching: Unifying Reward-based Fine-tuning for Flow and Diffusion Models
Reward Score Matching unifies reward-based fine-tuning for flow and diffusion models by recasting alignment as score matching to a value-guided target.
-
Adjoint Matching through the Lens of the Stochastic Maximum Principle in Optimal Control
Adjoint matching objectives derived from the Stochastic Maximum Principle have critical points satisfying HJB stationarity conditions for SOC problems with control-dependent drift and diffusion.
-
ETS: Energy-Guided Test-Time Scaling for Training-Free RL Alignment
ETS performs training-free RL alignment for language models by energy-guided test-time scaling with Monte Carlo energy estimation and importance sampling acceleration.
-
Stein Diffusion Guidance: Training-Free Posterior Correction for Sampling Beyond High-Density Regions
Stein Diffusion Guidance corrects approximate posteriors in diffusion sampling via a Stein variational mechanism and surrogate SOC objective to enable effective guidance beyond high-density regimes.
-
DexVLA: Vision-Language Model with Plug-In Diffusion Expert for General Robot Control
DexVLA combines a scaled diffusion action expert with embodiment curriculum learning to achieve better generalization and performance than prior VLA models on diverse robot hardware and long-horizon tasks.