DiPOD: Diffusion Policy Optimization without Drifting Apart
Pith reviewed 2026-06-27 07:04 UTC · model grok-4.3
The pith
DiPOD stabilizes diffusion policy training by adding an on-policy ELBO regularizer that prevents double-drift misalignment between the variational bound and true likelihood.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
DiPOD maintains tight-bound behavior throughout training by interleaving self-distillation with policy-improving gradient updates, which prevents the ELBO from separating from the true log-likelihood and aligns the proxy policy gradient with the true gradient of expected return.
What carries the argument
The on-policy ELBO regularizer added to each diffusion policy-gradient update, which works by enforcing alignment between the variational bound and true likelihood during interleaved self-distillation steps.
Load-bearing premise
Double-drift is the dominant source of instability and the ELBO regularizer will reliably prevent misalignment without introducing new optimization problems or performance trade-offs.
What would settle it
Train the same diffusion policies with and without the on-policy ELBO regularizer and check whether the ELBO-log-likelihood gap reappears and whether reward curves become unstable or lower without the regularizer.
Figures
read the original abstract
RL post-training has become increasingly pivotal for improving diffusion policies, but existing diffusion policy-gradient methods are often unstable and cannot achieve reliable policy improvement. We identify the cause as the double-drift phenomenon: optimizing a variational surrogate can let the ELBO separate from the true log-likelihood, which then makes the resulting proxy policy gradient misaligned with the true policy gradient of expected return. We propose \textbf{DiPOD}, a diffusion policy optimization framework that maintains tight-bound behavior throughout training by interleaving self-distillation with policy-improving gradient updates. This leads to a simple and practical algorithm: augmenting each diffusion policy-gradient update with an on-policy ELBO regularizer. Across diffusion language model post-training and continuous-control diffusion policies, DiPOD substantially stabilizes training and reaches higher rewards than previous methods.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper identifies a double-drift phenomenon in existing diffusion policy-gradient methods for RL post-training, where optimizing a variational surrogate allows the ELBO to separate from the true log-likelihood and produces misaligned proxy policy gradients. It proposes DiPOD, which maintains tight-bound behavior by interleaving self-distillation with policy-improving gradient updates via an on-policy ELBO regularizer. The method is evaluated on diffusion language model post-training and continuous-control diffusion policies, where it is reported to stabilize training and achieve higher rewards than prior approaches.
Significance. If the causal mechanism is confirmed, the result would be significant for stabilizing diffusion-based policy optimization in RL, offering a practical augmentation to existing policy-gradient methods without requiring major architectural changes. The approach leverages standard ELBO concepts in a targeted interleaving scheme, which could generalize across diffusion policy settings if the double-drift diagnosis holds.
major comments (2)
- [Abstract] Abstract: the central claim that double-drift is the dominant cause of instability and that the on-policy ELBO regularizer reliably prevents misalignment rests on an unverified causal link; no derivations, equations, or error analysis are supplied to show how the ELBO-log-likelihood separation produces gradient misalignment or how the regularizer closes the gap without new optimization issues.
- [Experiments] Experiments (implied by abstract claims): performance gains in rewards and stability are reported, yet there is no measurement or plot of the ELBO-true likelihood divergence, cosine similarity between surrogate and true policy gradients, or bound tightness over training steps for DiPOD versus baselines; without this, gains could arise from generic regularization rather than the claimed mechanism.
minor comments (2)
- [Method] Clarify the precise mathematical form of the on-policy ELBO regularizer, its weighting schedule relative to the policy gradient term, and the self-distillation procedure.
- [Experiments] Provide the exact experimental details (hyperparameters, number of runs, statistical significance) that support the stability and reward improvements.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on the causal claims and experimental validation of the double-drift phenomenon. We address each major comment below and commit to revisions that strengthen the presentation without altering the core contributions.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central claim that double-drift is the dominant cause of instability and that the on-policy ELBO regularizer reliably prevents misalignment rests on an unverified causal link; no derivations, equations, or error analysis are supplied to show how the ELBO-log-likelihood separation produces gradient misalignment or how the regularizer closes the gap without new optimization issues.
Authors: The manuscript provides a conceptual derivation of double-drift in Section 3 by decomposing the surrogate gradient and showing the ELBO gap term that induces misalignment with the true return gradient. We agree the link can be made more rigorous and will add explicit equations for the gradient difference, a first-order error bound on misalignment, and analysis of how the on-policy regularizer contracts the gap without introducing new instabilities. revision: yes
-
Referee: [Experiments] Experiments (implied by abstract claims): performance gains in rewards and stability are reported, yet there is no measurement or plot of the ELBO-true likelihood divergence, cosine similarity between surrogate and true policy gradients, or bound tightness over training steps for DiPOD versus baselines; without this, gains could arise from generic regularization rather than the claimed mechanism.
Authors: We agree that direct monitoring of the ELBO-log-likelihood divergence, surrogate-true gradient cosine similarity, and bound tightness would provide stronger causal evidence. The current results emphasize end-task rewards and training stability; we will add the requested diagnostic plots and quantitative comparisons versus baselines in the revised experiments section. revision: yes
Circularity Check
No circularity: empirical method proposal without self-referential derivations
full rationale
The paper identifies double-drift via ELBO separation from true log-likelihood and proposes interleaving self-distillation with on-policy ELBO regularization as DiPOD. No equations, uniqueness theorems, fitted parameters renamed as predictions, or self-citation chains appear in the abstract or described claims. The argument rests on empirical stability/reward gains rather than any derivation that reduces by construction to its inputs, so the work is self-contained.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Arel. Arel’s sudoku generator. URL https://www.ocf.berkeley.edu/~arel/sudoku/main. html. Johan Bjorck, Fernando Castañeda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, et al. Gr00t n1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734,
-
[3]
10 Kevin Black, Michael Janner, Yilun Du, Ilya Kostrikov, and Sergey Levine
doi: 10.48550.arXiv preprint ARXIV .2410.24164. 10 Kevin Black, Michael Janner, Yilun Du, Ilya Kostrikov, and Sergey Levine. Training diffusion models with reinforcement learning.arXiv preprint arXiv:2305.13301,
-
[4]
Dflash: Block diffusion for flash speculative decoding
Jian Chen, Yesheng Liang, and Zhijian Liu. Dflash: Block diffusion for flash speculative decoding. arXiv preprint arXiv:2602.06036,
-
[5]
Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168,
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168,
-
[6]
Expo: Stable reinforcement learning with expressive policies.arXiv preprint arXiv:2507.07986,
Perry Dong, Qiyang Li, Dorsa Sadigh, and Chelsea Finn. Expo: Stable reinforcement learning with expressive policies.arXiv preprint arXiv:2507.07986,
-
[7]
Philippe Hansen-Estruch, Ilya Kostrikov, Michael Janner, Jakub Grudzien Kuba, and Sergey Levine. Idql: Implicit q-learning as an actor-critic method with diffusion policies.arXiv preprint arXiv:2304.10573,
-
[8]
Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P Kingma, Ben Poole, Mohammad Norouzi, David J Fleet, et al. Imagen video: High definition video generation with diffusion models.arXiv preprint arXiv:2210.02303, 2022a. Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J Fleet...
-
[9]
Q-learning with adjoint matching.arXiv preprint arXiv:2601.14234,
11 Qiyang Li and Sergey Levine. Q-learning with adjoint matching.arXiv preprint arXiv:2601.14234,
-
[10]
Reinforcement learning with action chunking.arXiv preprint arXiv:2507.07969,
Qiyang Li, Zhiyuan Zhou, and Sergey Levine. Reinforcement learning with action chunking.arXiv preprint arXiv:2507.07969,
-
[11]
Max Sobol Mark, Tian Gao, Georgia Gabriela Sampaio, Mohan Kumar Srirama, Archit Sharma, Chelsea Finn, and Aviral Kumar. Policy agnostic rl: Offline rl and online rl fine-tuning of any class and backbone.arXiv preprint arXiv:2412.06685,
-
[12]
Flow matching policy gradients.arXiv preprint arXiv:2507.21053,
David McAllister, Songwei Ge, Brent Yi, Chung Min Kim, Ethan Weber, Hongsuk Choi, Haiwen Feng, and Angjoo Kanazawa. Flow matching policy gradients.arXiv preprint arXiv:2507.21053,
-
[13]
Large language diffusion models.arXiv preprint arXiv:2502.09992,
Shen Nie, Fengqi Zhu, Zebin You, Xiaolu Zhang, Jingyang Ou, Jun Hu, Jun Zhou, Yankai Lin, Ji- Rong Wen, and Chongxuan Li. Large language diffusion models.arXiv preprint arXiv:2502.09992,
-
[14]
Diffusion policy policy optimization
Allen Z Ren, Justin Lidard, Lars L Ankile, Anthony Simeonov, Pulkit Agrawal, Anirudha Majumdar, Benjamin Burchfiel, Hongkai Dai, and Max Simchowitz. Diffusion policy policy optimization. arXiv preprint arXiv:2409.00588,
-
[15]
Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347,
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347,
-
[16]
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathemat- ical reasoning in open language models.arXiv preprint arXiv:2402.03300,
-
[17]
Parrot: Data-driven behavioral priors for reinforcement learning.arXiv preprint arXiv:2011.10024,
Avi Singh, Huihan Liu, Gaoyue Zhou, Albert Yu, Nicholas Rhinehart, and Sergey Levine. Parrot: Data-driven behavioral priors for reinforcement learning.arXiv preprint arXiv:2011.10024,
arXiv 2011
-
[18]
Denoising diffusion implicit models.arXiv preprint arXiv:2010.02502,
Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models.arXiv preprint arXiv:2010.02502,
Pith/arXiv arXiv 2010
-
[19]
12 Yuxuan Song, Zheng Zhang, Cheng Luo, Pengyang Gao, Fan Xia, Hao Luo, Zheng Li, Yuehang Yang, Hongli Yu, Xingwei Qu, et al. Seed diffusion: A large-scale diffusion language model with high-speed inference.arXiv preprint arXiv:2508.02193,
-
[20]
Xiaohang Tang, Rares Dolga, Sangwoong Yoon, and Ilija Bogunovic. wd1: Weighted policy optimization for reasoning in diffusion language models.arXiv preprint arXiv:2507.08838,
-
[21]
Chenyu Wang, Paria Rashidinejad, DiJia Su, Song Jiang, Sid Wang, Siyan Zhao, Cai Zhou, Shan- non Zejiang Shen, Feiyu Chen, Tommi Jaakkola, et al. Spg: Sandwiched policy gradient for masked diffusion language models.arXiv preprint arXiv:2510.09541,
-
[22]
Dream-coder 7b: An open diffusion language model for code
Zhihui Xie, Jiacheng Ye, Lin Zheng, Jiahui Gao, Jingwei Dong, Zirui Wu, Xueliang Zhao, Shansan Gong, Xin Jiang, Zhenguo Li, et al. Dream-coder 7b: An open diffusion language model for code. arXiv preprint arXiv:2509.01142,
-
[23]
Mmada: Multimodal large diffusion language models.arXiv preprint arXiv:2505.15809,
Ling Yang, Ye Tian, Bowen Li, Xinchen Zhang, Ke Shen, Yunhai Tong, and Mengdi Wang. Mmada: Multimodal large diffusion language models.arXiv preprint arXiv:2505.15809,
-
[24]
Flow policy gradients for robot control.arXiv preprint arXiv:2602.02481,
Brent Yi, Hongsuk Choi, Himanshu Gaurav Singh, Xiaoyu Huang, Takara E Truong, Carmelo Sferrazza, Yi Ma, Rocky Duan, Pieter Abbeel, Guanya Shi, et al. Flow policy gradients for robot control.arXiv preprint arXiv:2602.02481,
-
[25]
Siyan Zhao, Devaansh Gupta, Qinqing Zheng, and Aditya Grover. d1: Scaling reasoning in diffusion large language models via reinforcement learning.arXiv preprint arXiv:2504.12216,
-
[26]
13 A Additional Related Work Diffusion models.Diffusion models are a powerful class of generative models, with wide applica- tions including image generation [Ho et al., 2020, Song et al., 2020, Song and Ermon, 2019], video generation [Ho et al., 2022b,a], and robotics [Black et al., Bjorck et al., 2025]. More recently, diffu- sion language models have em...
2020
-
[27]
The environment transitions to state st+1 ∼P(·|s t, at), and gives the agent a reward rt =r(s t, at)
At each timestept, the policy receives the current state st and takes an action at ∼π(·|s t). The environment transitions to state st+1 ∼P(·|s t, at), and gives the agent a reward rt =r(s t, at). The episode terminates when the agent reaches a terminal states ∗ ∈ S. An RL algorithm aims to optimize the cumulative reward, defined as J(θ) =E at∼π(·|st),st+1...
2016
-
[28]
SPG [Wang et al., 2025] obtains a practical EUBO surrogate for masked dLLMs by exploiting the special absorbing-mask forward process
(often reducing to per-timestep losses). SPG [Wang et al., 2025] obtains a practical EUBO surrogate for masked dLLMs by exploiting the special absorbing-mask forward process. Let x1:n be a token sequence, where xi is the token at position i, and let m denote the mask token. In this paragraph, t indexes the diffusion noising step, while i indexes the seque...
2025
-
[29]
This proves Equation (26)
Whenη≤1/L J , J(θ k+1)≥ J( ¯θk) +η⟨v, v+e⟩ − η 2 ∥v+e∥ 2 =J( ¯θk) + η 2 ∥v∥2 − ∥e∥2 ≥ J( ¯θk) + η 2 h ∇J( ¯θk) 2 −c 2 g ¯εk i . This proves Equation (26). Strict improvement follows whenever ∥∇J( ¯θk)∥> c g √¯εk. If the oracle is policy preserving, then ρ¯θk =ρ θk, so Equation (25) implies ¯εk =D L ρθk (¯θk)≤ε , and policy preservation also gives J( ¯θk) ...
2025
-
[30]
Algorithm implementation.We implement FPO by updating the model parameters according to Equation (4), and SPG by updating the model parameters according to Equation (5)
The reward function is chosen for clarity of exposition and is without loss of generality. Algorithm implementation.We implement FPO by updating the model parameters according to Equation (4), and SPG by updating the model parameters according to Equation (5). In the SPG implementation, a surrogate for EUBO (the LEUBO in Wang et al. [2025]) is used to tac...
2025
-
[31]
Although SPG+DiPOD does not show a clear improvement at sequence lengths128 and 512, it is the only setting that reaches the near-100%regime in the zero-shot setting
Sudoku exhibits a discrete training behavior, with runs tending to settle around roughly 25%, 40%, or near 100% accuracy. Although SPG+DiPOD does not show a clear improvement at sequence lengths128 and 512, it is the only setting that reaches the near-100%regime in the zero-shot setting. E.2 Ablation on the DiPOD Coefficientβ We ablate the DiPOD coefficie...
2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.