D2 Actor Critic: Diffusion Actor Meets Distributional Critic
Pith reviewed 2026-05-25 07:29 UTC · model grok-4.3
The pith
D2AC trains diffusion policies online in RL by pairing them with a distributional critic from fusing distributional RL and clipped double Q-learning.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
D2AC is a model-free RL algorithm that trains diffusion policies online through a policy improvement objective free of high-variance gradients and backpropagation through time, made stable by a distributional critic obtained from fusing distributional RL with clipped double Q-learning.
What carries the argument
The distributional critic formed by fusing distributional RL and clipped double Q-learning, which supports stable policy improvement for the diffusion actor.
If this is right
- The algorithm reaches state-of-the-art performance on a benchmark of eighteen hard RL tasks spanning Humanoid, Dog, and Shadow Hand domains.
- It applies equally to dense-reward and goal-conditioned RL scenarios.
- It demonstrates behavioral robustness and generalization on a predator-prey task beyond standard benchmarks.
Where Pith is reading between the lines
- The same critic fusion could be tested with other expressive policy representations such as normalizing flows to check whether the stability benefit generalizes.
- Avoiding backpropagation through time may allow the objective to scale to longer-horizon tasks where standard diffusion policy training becomes prohibitive.
- Success in goal-conditioned settings suggests the method could support transfer to new goals without retraining the full policy.
Load-bearing premise
The fusion of distributional RL and clipped double Q-learning produces a robust distributional critic that enables stable policy improvement for diffusion actors without introducing new instability or bias.
What would settle it
Observing unstable training or failure to match baseline performance on the Humanoid or Shadow Hand tasks when the distributional critic is replaced by a standard critic would falsify the claim that the fused critic is critical for stability.
Figures
read the original abstract
We introduce D2AC, a new model-free reinforcement learning (RL) algorithm designed to train expressive diffusion policies online effectively. At its core is a policy improvement objective that avoids the high variance of typical policy gradients and the complexity of backpropagation through time. This stable learning process is critically enabled by our second contribution: a robust distributional critic, which we design through a fusion of distributional RL and clipped double Q-learning. The resulting algorithm is highly effective, achieving state-of-the-art performance on a benchmark of eighteen hard RL tasks, including Humanoid, Dog, and Shadow Hand domains, spanning both dense-reward and goal-conditioned RL scenarios. Beyond standard benchmarks, we also evaluate a biologically motivated predator-prey task to examine the behavioral robustness and generalization capacity of our approach. Code: https://github.com/d2ac-actor-critic/d2ac-public
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces D2AC, a model-free RL algorithm for training expressive diffusion policies online. Its core contributions are a policy improvement objective for the diffusion actor that avoids high-variance policy gradients and backpropagation-through-time, and a robust distributional critic obtained by fusing distributional RL with clipped double Q-learning. The resulting method is reported to achieve state-of-the-art performance on a benchmark of 18 challenging continuous-control tasks (Humanoid, Dog, Shadow Hand, etc.) spanning dense-reward and goal-conditioned settings, with additional evaluation on a predator-prey task; an open-source implementation is provided.
Significance. If the reported results and stability claims hold, the work supplies a practical route to online training of high-capacity diffusion policies in RL without the usual sources of variance or instability. The explicit fusion of distributional RL and clipped double Q-learning, together with the supplied code repository, would constitute a reproducible advance for domains that benefit from expressive, multimodal policies.
minor comments (3)
- [§3.2] §3.2: the precise form of the distributional critic loss after the fusion with clipped double Q-learning is described at a high level; adding the explicit Bellman operator and the quantile or categorical parameterization used would improve reproducibility.
- [Table 1, §5] Table 1 and §5: while aggregate SOTA claims are stated, the per-task normalized scores and the number of random seeds are not listed in the main text; moving the full per-task table or at least the mean±std summary into the main body would strengthen the empirical section.
- [§4.1] §4.1: the policy-improvement objective for the diffusion actor is given, but the precise weighting between the actor and critic terms (if any) and the temperature schedule are not stated explicitly; a short paragraph or equation clarifying these hyperparameters would aid readers.
Simulated Author's Rebuttal
We thank the referee for their positive summary of the paper, recognition of its significance for online training of diffusion policies, and recommendation of minor revision. We are pleased that the core contributions—the policy improvement objective and the fused distributional critic—are accurately captured, along with the benchmark results and open-source code.
Circularity Check
No significant circularity detected
full rationale
The paper presents a policy improvement objective for diffusion actors and a distributional critic formed by fusing distributional RL with clipped double Q-learning. These elements are introduced as design choices with no equations or steps shown that reduce the claimed stability or SOTA results to fitted parameters or self-definitions by construction. Performance is justified through benchmark experiments on 18 tasks rather than internal tautologies, and no self-citation chains or uniqueness theorems are invoked as load-bearing premises in the provided text. The derivation chain remains self-contained against external empirical validation.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Maximum a Posteriori Policy Optimisation
Abbas Abdolmaleki, Jost Tobias Springenberg, Yuval Tassa, Remi Munos, Nicolas Heess, and Martin Riedmiller. Maximum a posteriori policy optimisation.arXiv preprint arXiv:1806.06920,
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization.arXiv preprint arXiv:1607.06450,
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
Distributed Distributional Deterministic Policy Gradients
Gabriel Barth-Maron, Matthew W Hoffman, David Budden, Will Dabney, Dan Horgan, Dhruva Tb, Alistair Muldal, Nicolas Heess, and Timothy Lillicrap. Distributed distributional deterministic policy gradients. arXiv preprint arXiv:1804.08617,
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
A distributional perspective on reinforcement learning
13 Published in Transactions on Machine Learning Research (10/2025) Marc G Bellemare, Will Dabney, and Rémi Munos. A distributional perspective on reinforcement learning. InInternational conference on machine learning, pp. 449–458. PMLR,
work page 2025
-
[5]
Training Diffusion Models with Reinforcement Learning
Kevin Black, Michael Janner, Yilun Du, Ilya Kostrikov, and Sergey Levine. Training diffusion models with reinforcement learning.arXiv preprint arXiv:2305.13301,
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
Boosting continuous control with consistency policy.arXiv preprint arXiv:2310.06343,
Yuhui Chen, Haoran Li, and Dongbin Zhao. Boosting continuous control with consistency policy.arXiv preprint arXiv:2310.06343,
-
[7]
Diffusion Policy: Visuomotor Policy Learning via Action Diffusion
Cheng Chi, Siyuan Feng, Yilun Du, Zhenjia Xu, Eric Cousineau, Benjamin Burchfiel, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion.arXiv preprint arXiv:2303.04137,
work page internal anchor Pith review Pith/arXiv arXiv
-
[8]
Maximum entropy reinforcement learning with diffusion policy.arXiv preprint arXiv:2502.11612,
Xiaoyi Dong, Jian Cheng, and Xi Sheryl Zhang. Maximum entropy reinforcement learning with diffusion policy.arXiv preprint arXiv:2502.11612,
-
[9]
Jesse Farebrother, Jordi Orbay, Quan Vuong, Adrien Ali Taïga, Yevgen Chebotar, Ted Xiao, Alex Irpan, Sergey Levine, Pablo Samuel Castro, Aleksandra Faust, et al. Stop regressing: Training value functions via classification for scalable deep rl.arXiv preprint arXiv:2403.03950,
-
[10]
Chen-Xiao Gao, Chenyang Wu, Mingjun Cao, Chenjun Xiao, Yang Yu, and Zongzhang Zhang. Behavior- regularized diffusion policy optimization for offline reinforcement learning.arXiv preprint arXiv:2502.04778,
-
[11]
Mastering Diverse Domains through World Models
Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse domains through world models.arXiv preprint arXiv:2301.04104,
work page internal anchor Pith review Pith/arXiv arXiv
-
[12]
Shuo Han, German Espinosa, Junda Huang, Daniel A Dombeck, Malcolm A MacIver, and Bradly C Stadie. Of mice and machines: A comparison of learning between real world mice and rl agents.arXiv preprint arXiv:2505.12204,
-
[13]
Temporal difference learning for model predictive control
Nicklas Hansen, Xiaolong Wang, and Hao Su. Temporal difference learning for model predictive control. arXiv preprint arXiv:2203.04955,
-
[14]
TD-MPC2: Scalable, Robust World Models for Continuous Control
Nicklas Hansen, Hao Su, and Xiaolong Wang. Td-mpc2: Scalable, robust world models for continuous control. arXiv preprint arXiv:2310.16828,
work page internal anchor Pith review Pith/arXiv arXiv
-
[15]
IDQL: Implicit Q-Learning as an Actor-Critic Method with Diffusion Policies
Philippe Hansen-Estruch, Ilya Kostrikov, Michael Janner, Jakub Grudzien Kuba, and Sergey Levine. Idql: Implicit q-learning as an actor-critic method with diffusion policies.arXiv preprint arXiv:2304.10573,
work page internal anchor Pith review Pith/arXiv arXiv
-
[16]
14 Published in Transactions on Machine Learning Research (10/2025) Nicolas Heess, Gregory Wayne, David Silver, Timothy Lillicrap, Tom Erez, and Yuval Tassa. Learning continuous control policies by stochastic value gradients.Advances in neural information processing systems, 28,
work page 2025
-
[17]
Bilinear value networks.arXiv preprint arXiv:2204.13695,
Zhang-Wei Hong, Ge Yang, and Pulkit Agrawal. Bilinear value networks.arXiv preprint arXiv:2204.13695,
-
[18]
Investigating the histogram loss in regression.arXiv preprint arXiv:2402.13425,
Ehsan Imani, Kai Luedemann, Sam Scholnick-Hughes, Esraa Elelimy, and Martha White. Investigating the histogram loss in regression.arXiv preprint arXiv:2402.13425,
-
[19]
Paul Jeha, Will Grathwohl, Michael Riis Andersen, Carl Henrik Ek, and Jes Frellsen. Variance reduction of dif- fusion model’s gradients with taylor approximation-based control variate.arXiv preprint arXiv:2408.12270,
-
[20]
Auto-Encoding Variational Bayes
Diederik P Kingma and Max Welling. Auto-encoding variational bayes.arXiv preprint arXiv:1312.6114,
work page internal anchor Pith review Pith/arXiv arXiv
-
[21]
Reinforcement Learning and Control as Probabilistic Inference: Tutorial and Review
Sergey Levine. Reinforcement learning and control as probabilistic inference: Tutorial and review.arXiv preprint arXiv:1805.00909,
work page internal anchor Pith review Pith/arXiv arXiv
-
[22]
Continuous control with deep reinforcement learning
Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning.arXiv preprint arXiv:1509.02971,
work page internal anchor Pith review Pith/arXiv arXiv
-
[23]
Distributional soft actor-critic with diffusion policy.arXiv preprint arXiv:2507.01381,
Tong Liu, Yinuo Wang, Xujie Song, Wenjun Zou, Liangfa Chen, Likun Wang, Bin Shuai, Jingliang Duan, and Shengbo Eben Li. Distributional soft actor-critic with diffusion policy.arXiv preprint arXiv:2507.01381,
-
[24]
Decoupled Weight Decay Regularization
Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101,
work page internal anchor Pith review Pith/arXiv arXiv
-
[25]
Haitong Ma, Tianyi Chen, Kai Wang, Na Li, and Bo Dai. Soft diffusion actor-critic: Efficient online reinforcement learning for diffusion policy.arXiv preprint arXiv:2502.00361,
-
[26]
Human-level control through deep reinforcement learning.nature, 518(7540):529–533,
15 Published in Transactions on Machine Learning Research (10/2025) Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning.nature, 518(7540):529–533,
work page 2025
-
[27]
Multi-Goal Reinforcement Learning: Challenging Robotics Environments and Request for Research
Matthias Plappert, Marcin Andrychowicz, Alex Ray, Bob McGrew, Bowen Baker, Glenn Powell, Jonas Schneider, Josh Tobin, Maciek Chociej, Peter Welinder, et al. Multi-goal reinforcement learning: Challenging robotics environments and request for research.arXiv preprint arXiv:1802.09464,
work page internal anchor Pith review Pith/arXiv arXiv
-
[28]
Safe offline reinforcement learning using trajectory-level diffusion models
Ralf Römer, Lukas Brunke, Martin Schuck, and Angela P Schoellig. Safe offline reinforcement learning using trajectory-level diffusion models. InICRA 2024 Workshop{\textemdash}Back to the Future: Robot Learning Going Probabilistic,
work page 2024
-
[29]
Equivalence Between Policy Gradients and Soft Q-Learning
John Schulman, Xi Chen, and Pieter Abbeel. Equivalence between policy gradients and soft q-learning.arXiv preprint arXiv:1704.06440, 2017a. John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017b. Max Schwarzer, Johan Samir Obando Ceron, Aaron Courville, ...
work page internal anchor Pith review Pith/arXiv arXiv
-
[30]
Denoising Diffusion Implicit Models
Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models.arXiv preprint arXiv:2010.02502, 2020a. Yang Song and Prafulla Dhariwal. Improved techniques for training consistency models.arXiv preprint arXiv:2310.14189,
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[31]
Score-Based Generative Modeling through Stochastic Differential Equations
Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations.arXiv preprint arXiv:2011.13456, 2020b. Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever. Consistency models.arXiv preprint arXiv:2303.01469,
work page internal anchor Pith review Pith/arXiv arXiv 2011
-
[32]
Yuval Tassa, Yotam Doron, Alistair Muldal, Tom Erez, Yazhe Li, Diego de Las Casas, David Budden, Abbas Abdolmaleki, Josh Merel, Andrew Lefrancq, et al. Deepmind control suite.arXiv preprint arXiv:1801.00690,
work page internal anchor Pith review Pith/arXiv arXiv
-
[33]
Double q-learning.Advances in neural information processing systems, 23,
16 Published in Transactions on Machine Learning Research (10/2025) Hado Van Hasselt. Double q-learning.Advances in neural information processing systems, 23,
work page 2025
-
[34]
Diffusion model alignment using direct preference optimization.arXiv preprint arXiv:2311.12908,
Bram Wallace, Meihua Dang, Rafael Rafailov, Linqi Zhou, Aaron Lou, Senthil Purushwalkam, Stefano Ermon, Caiming Xiong, Shafiq Joty, and Nikhil Naik. Diffusion model alignment using direct preference optimization.arXiv preprint arXiv:2311.12908,
-
[35]
Benchmarking Model-Based Reinforcement Learning
Tingwu Wang, Xuchan Bao, Ignasi Clavera, Jerrick Hoang, Yeming Wen, Eric Langlois, Shunshi Zhang, Guodong Zhang, Pieter Abbeel, and Jimmy Ba. Benchmarking model-based reinforcement learning.arXiv preprint arXiv:1907.02057,
work page internal anchor Pith review Pith/arXiv arXiv 1907
-
[36]
Model Predictive Path Integral Control using Covariance Variable Importance Sampling
Grady Williams, Andrew Aldrich, and Evangelos Theodorou. Model predictive path integral control using covariance variable importance sampling.arXiv preprint arXiv:1509.01149,
work page internal anchor Pith review Pith/arXiv arXiv
-
[37]
Policy representation via diffusion probability model for reinforcement learning
Long Yang, Zhixiong Huang, Fenghao Lei, Yucun Zhong, Yiming Yang, Cong Fang, Shiting Wen, Binbin Zhou, and Zhouchen Lin. Policy representation via diffusion probability model for reinforcement learning. arXiv preprint arXiv:2305.13122,
-
[38]
Mingqi Yuan, Bo Li, Xin Jin, and Wenjun Zeng. Ultho: Ultra-lightweight yet efficient hyperparameter optimization in deep reinforcement learning.arXiv preprint arXiv:2503.06101,
-
[39]
A Analysis of Computational Cost To analyze the computational cost of D2AC, we compare its wall-clock training time against both the model-based TD-MPC2 and the model-free SAC (Figure 10). The results highlight D2AC’s efficiency profile and its relationship with sample efficiency, as established in our main results (Figure 3). Computational Profile vs. Ba...
work page 2025
-
[40]
of using more diffusion steps for training and for inference. In addition, we also find that settingπref =πϕas done in D2AC policy loss equation 15 rather than just using the actions from the replay bufferπref̸=πϕslightly improves results. Table 3: Performance Statistics at 500K Environment Steps Configuration Quadruped Walk Humanoid Walk Ktrain =K= 2 778...
work page 2025
-
[41]
can also satisfy those two conditions. Lately, a different type of discrete critic based ontwo-hotrepresentation has gained popularity from the success of MuZero (Schrittwieser et al., 2020), Dreamer-v3 (Hafner et al., 2023), and TD-MPC2 (Hansen et al., 2023). To draw the connection between the two types of discrete critics, we first notice the following:...
work page 2020
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.