Derives an exact unbiased policy gradient for RL post-training of diffusion LLMs via entropy-guided step selection and one-step denoising rewards, achieving state-of-the-art results on coding and logical reasoning benchmarks.
hub Canonical reference
Fine- tuning of continuous-time diffusion models as entropy-regularized control
Canonical reference. 80% of citing Pith papers cite this work as background.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
Supervised Guidance Training enables conditioning of infinite-dimensional diffusion models via an extended Doob h-transform so that fine-tuned models accurately sample from posteriors in function space.
A new adjoint matching framework formulates flow model alignment as optimal control, enabling direct regression training and terminal-trajectory truncation for efficiency gains on models like SiT-XL and FLUX.
PAMELA provides a multi-user rating dataset and personalized reward model that predicts individual image preferences more accurately than prior population-level aesthetic models.
PG-DLM applies particle Gibbs sampling over full trajectories in diffusion language models to enable iterative refinement, yielding higher accuracy on reward-guided generation with theoretical convergence guarantees.
Reformulates constrained black-box optimization as posterior inference in latent space of flow-based models amortized by outsourced diffusion models, claiming superior performance on synthetic and real tasks.
A hierarchical variational formulation amortizes test-time guidance in diffusion models to achieve strong quality-speed tradeoffs with significantly reduced inference compute.
ZeNO frames noise optimization as a path-integral control problem solvable from zeroth-order reward evaluations, connecting to implicit Langevin dynamics for reward-tilted distributions.
Derives RAM, a reward-adjusted consistency loss extending diffusion pretraining regression to efficient KL-regularized RL post-training, achieving peak rewards up to 50x faster than Flow-GRPO on Stable Diffusion 3.5M.
A unified framework for exponential tilting in diffusion and flow models that includes bias-variance decompositions showing finite gradient variance for some methods, norm bounds on adjoint ODEs, and adapted losses with new Crooks and Jarzynski identities.
A new min-max robust formulation for mean field control and variational mean field games is introduced, with existence, uniqueness, and a stochastic maximum principle established under convexity-concavity assumptions.
Adjoint matching objectives derived from the Stochastic Maximum Principle have critical points satisfying HJB stationarity conditions for SOC problems with control-dependent drift and diffusion.
ETS performs training-free RL alignment for language models by energy-guided test-time scaling with Monte Carlo energy estimation and importance sampling acceleration.
Stein Diffusion Guidance corrects approximate posteriors in diffusion sampling via a Stein variational mechanism and surrogate SOC objective to enable effective guidance beyond high-density regimes.
DexVLA combines a scaled diffusion action expert with embodiment curriculum learning to achieve better generalization and performance than prior VLA models on diverse robot hardware and long-horizon tasks.
citing papers explorer
-
A unified perspective on fine-tuning and sampling with diffusion and flow models
A unified framework for exponential tilting in diffusion and flow models that includes bias-variance decompositions showing finite gradient variance for some methods, norm bounds on adjoint ODEs, and adapted losses with new Crooks and Jarzynski identities.